计算机科学论坛--SPARQL and relational databases: getting started

http://www.snee.com/bobdc.blog/2008/10/sparql_and_relational_database.html

SPARQL and relational databases: getting started
Asking about tables and columns and doing a simple join.

In an [URL=http://www.snee.com/xml/xml2006/owlrdbms.html]earlier project[/URL] I did querying relational data with SPARQL, I wanted to demonstrate how adding OWL metadata made it possible to answer useful queries that couldn't have been answered with just the original data. I did this using two simple databases of one table each, but recently, while helping someone who was also using the [URL=http://sites.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/]D2RQ[/URL] interface to provide access to some relational data, I decided to get more comfortable with using SPARQL to query data from multi-table relational databases.

I wanted to find ways to:

[URL=http://www.flickr.com/photos/toasty/1540997910/][/URL]
list a database's tables

list the columns in those tables

run a query that joined data from at least two of the tables

It was all easier than I though it would be.

For a database to query, I went to the sample database section of [URL=http://dev.mysql.com/doc/#sampledb]MySQL's documentation page[/URL] and got the "world" database. This database has a country table with columns for country names, country codes unique within the table, and other information about the country such as its population and head of state. Another table, named countrylanguage, has a column for a country code to link to the country table, a column for a language spoken in that country, another for a figure showing the percentage of the country's residents who speak that language, and a fourth column for a boolean IsOfficial value.

The following SQL query against that database lists the name of each country, a language spoken there, and what percentage of the population speak that language, with the rows sorted from high percentages to low:

SELECT Name, Language, Percentage FROM country, countrylanguage
WHERE country.Code = countrylanguage.CountryCode
ORDER BY Percentage DESC;

Here's how I developed a SPARQL query doing the same thing.

As I described in the writeup of my earlier project, the first step of using D2RQ with a particular database is pointing its generate-mapping utility at a database installed in a running copy of MySQL (or one of the other relational database managers that D2RQ supports) to generate an SQL-to-SPARQL mapping file for that database. Next, you start up the d2r-server program with that mapping file as a parameter to run a server that provides a SPARQL endpoint for that database.

You could then send SPARQL queries to that server's SPARQL endpoint [URL=http://www.snee.com/bobdc.blog/2008/09/querying_wikidbpedia_for_presi.html#id203795]using Curl[/URL]. For exploring the data, though, I prefer the SNORQL interface, because I can enter a query on a web form, click the Go button, and browse the result. (The SNORQL interface in the latest official release of D2RQ [URL=http://sourceforge.net/mailarchive/forum.php?thread_name=48F7C172.3040003%40snee.com&forum_name=d2rq-map-devel]doesn't get along with Firefox 3.0[/URL]—apparently there's already a fixed version in D2RQ's cvs tree—so I used Chrome for this.)

When exploring a relational database, I first want to know what its tables are and what columns are in those tables. To do this with SPARQL queries, the D2RQ mapping file that I generated earlier provided some good clues. (Keep in mind that there may be other SPARQL interfaces to relational databases in the future, and they may not all map the relational structures to RDF the same way.)

First: what are the tables in the database? D2RQ treats each table row as a resource of type tablename. For example, it treats the country table's row for France as a resource identified as http://localhost:2020/resource/country/FRA, which has a http://www.w3.org/1999/02/22-rdf-syntax-ns#type of http://localhost:2020/resource/vocab/country. So, by asking for a list of all the types with the following query (with DISTINCT added so that each only shows up once),

SELECT DISTINCT ?o WHERE {
?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?o
}

SNORQL's browse mode gives me this list (SNORQL defines "vocab" as the prefix for the http://localhost:2020/resource/vocab/ URI), which corresponds to the tables declared in the world database:

vocab:country
vocab:countrylanguage
vocab:city

What are the columns of these tables? For an example of one column's name, the world database's country table has a row for France with a figure of 59225700 in the Population column. D2RQ represents the column name as a property named tablename_columnname, so the triple describing France's population tells us that http://localhost:2020/resource/country/FRA has a http://localhost:2020/resource/vocab/country_Population of 59225700. How do we list all these predicate names? As I mentioned in [URL=http://www.snee.com/bobdc.blog/2008/08/how_you_can_explore_a_new_set.html]How you can explore a new set of linked data[/URL], the following query lists all the predicates in an RDF-based data collection:

SELECT DISTINCT ?p WHERE {?s ?p ?o}

Displaying all predicates in the world database, though, includes a few extras along with the ones representing the table columns. The table columns are all in the http://localhost:2020/resource/vocab namespace, so the best way I could think of to query for predicates that were in that namespace was like this:

SELECT DISTINCT ?p
WHERE { ?s ?p ?o.
FILTER(regex(str(?p),"http://localhost:2020/resource/vocab/")).
}

It's pretty kludgy, so I'd love to hear of a better alternative. It selects all the predicates in the data set and then only passes along the ones whose URLs, when converted to a string, have "http://localhost:2020/resource/vocab/" in them. If another namespace used http://localhost:2020/resource/vocab/foo as its URL, this FILTER would select predicates in that namespace as well, because it has that "http://localhost:2020/resource/vocab/" substring in there, but I couldn't find a better way to do this. (Although SPARQL implements a subset of XPath, the [URL=http://www.w3.org/TR/1999/REC-xpath-19991116#function-namespace-uri]namespace-uri[/URL] function is not part of this subset.)

I won't show you all 24 predicate names that that this query returns, but for the SPARQL version of the SQL query above that listed country names, languages, and percentages, I picked out the tablename_columnname predicates I needed from the list of 24 and used them to create this query:

SELECT ?name ?language ?percentage
WHERE { ?s1 vocab:country_Name ?name;
            vocab:country_Code ?ccode.
        ?s2 vocab:countrylanguage_CountryCode ?ccode;
            vocab:countrylanguage_Language ?language;
            vocab:countrylanguage_Percentage ?percentage.
}
ORDER BY DESC(?percentage)

It's not a very fancy query. The most interesting part is the use of the ccode variable to connect information in triples from the world database's country table with information in triples from the database's countrylanguage table. When I first started this I had no idea how I was going to do this SPARQL equivalent of an SQL join, but once I sat down and tried it was intuitive enough. I won't show you all 984 results rows, but here's a selection from the middle to give you the flavor of what's there:

"Réunion"        "Creole French"  91.5
"Germany"        "German"         91.3
"Seychelles"     "Seselwa"        91.3
"Romania"        "Romanian"       90.7
"American Samoa" "Samoan"         90.6
"Syria"          "Arabic"         90.0
"Swaziland"      "Swazi"          89.9
"Bahamas"        "Creole English" 89.7
"Chile"          "Spanish"        89.7
"Sweden"         "Swedish"        89.5

Now I feel more confident about forging ahead to explore sets of relational data that I can access with SPARQL queries using the D2RQ interface.

Posted by Bob DuCharme on October 20, 2008 5:24 PM | [URL=http://www.snee.com/bobdc.blog/2008/10/sparql_and_relational_database.html]Permalink[/URL]

TrackBack
TrackBack URL for this entry:
http://www.snee.com/cgi-sys/cgiwrap/bobd/managed-mt/mt-tb.cgi/447

Comments
(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)

Bob,

Queries like SELECT DISTINCT ?p WHERE {?s ?p ?o} are OK for RDF stores because of their specialized indexing strategies, but I don't think this would work well with D2RQ against a relational database.

It would be OK if the RDBMS had a very small amount of data. But on a database of any size, such queries are not practical. I suspect D2RQ will just try to get all the data in the entire database and then try to figure out predicates. Same for query you are suggesting for getting the table names.

D2RQ translates database schema in the following way - all tables get represented as owl:Class, all columns are exposed as properties connected to the classes using rdfs:domain statements.

So, a much less expensive way to get all tables is simply by using SELECT ?table WHERE {?table rdf:type owl:Class}

Then to get columns you can do ?column rdfs:domain ?table. In fact, this single query will get you all tables with the corresponding columns

SELECT ?table ?column WHERE{?column rdfs:domain ?table}

Regards,

Irene

Posted by: Irene Polikoff | [URL=http://www.snee.com/bobdc.blog/2008/10/sparql_and_relational_database.html#comment-2135]October 20, 2008 9:18 PM[/URL]

Thanks Irene! That all makes sense to me, but
I didn't see any use of the [URL=http://www.w3.org/2002/07/owl#]http://www.w3.org/2002/07/owl#[/URL] namespace (or even the string "owl") in the mapping file generated by D2RQ, and

SELECT ?table WHERE {?table rdf:type }

got me no results. This also got me no results:

SELECT ?table ?column WHERE{?column rdfs:domain ?table}


	W 3 C h i n a ( since 2003 ) 旗下站点苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》	62.500ms