June 2009
American states commonly name one or more flowers as their state flower; for example, the California Poppy is the state flower of California. Harvesting state names and flowers from DBpedia illustrates several important strategies for the web-page developer who has the ambition to make an AJAX call to DBpedia, unpack the payload and patch it into a web page.
DBpedia provides Web access to the structured information that appears in the sidebar "infobox" feature of many Wikipedia articles. An infobox style guide comments: "In theory, the fields in an infobox should be consistent across every article using it; in practice, however, this is rarely the case, for a number of reasons." While this comment focuses on the unpredictable presence or absence of fields in infoboxes, the developer must also be concerned about variations in the structure of the content of the fields. This variation is illustrated by the XML payload listing state flowers discussed below.
The blog "Meow meow meow" uses the following SPARQL query to list American states and their flowers.
SELECT ?state ?flower WHERE {
?state skos:subject <http://dbpedia.org/resource/Category:States_of_the_United_States> .
?state dbpedia2:flower ?flower
}
Here is the raw XML payload received when the MIME type is set to "application/sparql-results+xml":
An examination of this XML output reveals (1) some complete state/flower repetition (i.e., California and California Poppies appear six times), (2) some partial state/flower repetition (i.e., Alabama has two state flowers), (3) some embedded Hex HTML escape codes (see Florida's entry), (4) some embedded HTML code (see West Virginia's entry), and (5) some <literal> elements mixed in with <uri> elements.
No claim is made here that this is an exhaustive list of irregularities, since there seems to be little control of the source of DBpedia's content, community input.
If the developer wishes to glean from this payload the pattern of state name/state flower, then both the state name and flower name data must be reduced, the repetition in the data must be controlled, and non-text characters transformed. The loop through the payload that accomplishes this can't assume the uniform presence of the <uri> element.
The "application/sparql-results+xml" MIME type produces data structured like this:
<result>
<binding name="state">
<uri>http://dbpedia.org/resource/Alaska</uri>
</binding>
<binding name="flower">
<uri>http://dbpedia.org/resource/Forget-me-not</uri>
</binding>
</result>
The "application/sparql-results+json" MIME type produces data structured like this:
{
"state": { "type": "uri", "value": "http://dbpedia.org/resource/Alaska" } ,
"flower": { "type": "uri", "value": "http://dbpedia.org/resource/Forget-me-not" }
}
Unpacking either of these structures requires some strenuous node traversal. For example, unpacking the XML requires cycling through all the "results" nodes, finding the child node with the attribute "name", checking that this "name" attribute has the value "state", and then reaching inside for its "childNodes[1].firstChild.nodeValue". Applying this method to the flower nodes is slightly more complex as they may occur as <uri> elements or <literal> elements. Unpacking the JSON results is not appreciably simpler.
The "text/html" MIME type produces a much simpler structure that immediately finesses the <uri>/<literal> problem:
<tr>
<td>http://dbpedia.org/resource/Alaska</td>
<td>http://dbpedia.org/resource/Forget-me-not</td>
</tr>
The remainder of this example uses the "text/html" MIME type, and presumes that a Greasemonkey script will be running in a Firefox browser. For reference on using Greasemonkey and making GM_xmlhttpRequest calls to DBpedia, see Wikipedia on a web page.
The targeted SPARQL endpoint is "http://dbpedia.org/sparql".
The "Meow meow meow" SPARQL query above is appropriate for the SPARQL Explorer but needs to be elaborated with an explicit namespace for "dbpedia2" to work in a Greasemonkey GM_xmlhttpRequest:
PREFIX dbpedia2: <http://dbpedia.org/property/>
SELECT ?state ?flower
WHERE
{
?state skos:subject <http://dbpedia.org/resource/Category:States_of_the_United_States> .
?state dbpedia2:flower ?flower
}
The aim is to create a nodelist of <tr> nodes. Placing the responseDetails.responseText inside a new document element permits the use of an XPath query to create a nodelist of <tr> nodes.
var newTable = document.createElement("table");
newTable.innerHTML = responseDetails.responseText;
var tr = document.evaluate('//tr', newTable, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
Looping through the nodelist of <tr> nodes will reveal the state and flower information with a minimum of node traversal. There are two interesting things to note about this loop. It begins at "1" to account for the very first <tr> element in the raw data which contain table headings. The "1" and "3" childnodes represent the state and flower information.
for (var i = 1; i < tr.snapshotLength; i++)
{
tr.snapshotItem(i).childNodes[1].firstChild.nodeValue); // state
tr.snapshotItem(i).childNodes[3].firstChild.nodeValue); // flower
...
}
The JavaScript lastIndexOf() and substring() can be used to isolate state and flower names.
An associative array indexed on state name holds flower data. At this moment, there is a hack to accommodate a state with two flower names.