Archive

Posts Tagged ‘SPARQL’

First release of SemanticXO!

Here it is: the first fully featured release of SemanticXO! Use it in your activities to store and share any kind of structured information with other XOs. The installation procedure is easy and only requires and XO-1 running the operating system version 12.1.0. Go to the GIT repository and download the files “setup.sh” and “semanticxo.tar.gz” somewhere the XO (these files are in the directory “patch_my_xo”). Then, log in as root and execute “sh setup.sh setup”. The installation package will copy the API onto the XO, setup the triple store and install two demo activity. Once the procedure is complete, reboot the XO to activate everything.

The XO after the installation of SemanticXO

There are two demo activities which are described in more details on the project page. Under the hood SemanticXO provides an API to store named graphs containing description of one or several resources. These named graphs are marked with an author name, a modification date and, eventually, a list of other devices (identified by their URI) to share the graph with. This data is used by a graph replication daemon which every 5 minutes browse the network using Avahi, find other triple stores, and download a copy of the graphs that are shared with it. The data backend of the mailing activity provides a good example of how the API is used.

Exposing API data as Linked Data

The Institute of Development Studies (IDS) is a UK based institute specialised in development research, teaching and communications. As part of their activities, they provide an API to query their knowledge services data set compromising more than 32k abstracts or summaries of development research documents related to 8k development organisations, almost 30 themes and 225 countries and territories.

A month ago, Victor de Boer and myself got a grant from IDS to investigate exposing their data as RDF and building some client applications making use of the enriched data. We aimed at using the API as it is and create 5-star Linked Data by linking the created resources to other resources on the Web. The outcome is the IDSWrapper which is now freely accessible, both as HTML and as RDF. Although this is still work in progress, this wrapper already shows some advantages provided by publishing the data as Linked Data.

Enriched data through linkage

When you query for a document, the API indicates you the language in which this document is wrote. For instance, “English”. The wrapper replaces this information by a reference to the matching resource in Lexvo. The property “language” is also replaced by the equivalent property as defined in Dublin Core, commonly used to denote the language a given document is wrote in. For the data consumer, Lexvo provides alternate spelling of the language name in different languages. Instead of just knowing that the language is named “English”, the data consumer, after deferencing the data from Lexvo will know that this language is also known as “Anglais” in French or “Engelsk” in Danish.

Part of the description of a document

Links can also be established with other resources to enrich the results provided. For instance, the information provided by IDS about the countries is enriched with a link to their equivalent in Geonames. That provides localised names for the countries as well as geographical coordinates.

Part of the description of the resource "Gambia"

Similarly, the description of themes is linked with their equivalent in DBpedia to benefit from the structured information extracted from their Wikipedia page. Thanks to that link, the data consumer gets access to some extra information such as pointers to related documents.

Part of the description of the theme "Food security"

Besides, the resources exposed are also internally linked. The API provides an identifier for the region a given document is related to. In the wrapper, this identifier is turned into the URI corresponding to the relevant resource.

Example of internal link in the description of a document

Integration on the data publisher side

All of these links are established by the wrapper, using either SPARQL requests (for DBpedia) or calls to data API (for Lexvo and Geonames). This is something any client application could do, obviously, but one advantage of publishing Linked Data is that part of the data integration work is done server side, by the person who has the most information about what his data is about. A data consumer just as to use the links already there instead of having to figure out a way to establish them himself.

A single data model

Another advantage for a data consumer is that all the data published by the wrapper, as well as all the connected data sets, are published in RDF. That is one single data model to consume. A simple HTTP GET asking for RDF content returns structured data for the content exposed by the wrapper, and the data DBpedia, Lexvo and Geonames. There is no need to worry about different data formats returned by different APIs.

Next steps

We are implementing more linking services and also working on making the code more generic. Our goal, which is only partially fullfiled now, is to have a generic tool that only requires an ontology for the data set to expose it as Linked Data. The code is freely available on GitHub, watch us to stay up to date with the evolution of the project 😉

Updates about SemanticXO

With the last post about SemanticXO dating back from April, it’s time for an update, isn’t it? 😉

A lot of things happened since April. First, a paper about the project was accepted for presentation at the First International Conference on e-Technologies and Networks for Development (ICeND2011). Then, I spoke about the project during the symposium of the Network Institute as well as during the SugarCamp #2. Lastly, a first release of a triple-store powered Journal is now available for testing.

Publication

The paper entitled “SemanticXO : connecting the XO with the World’s largest information network ” is available from Mendeley. It explains what the goal of the project is and then report on some performance assessement and a first test activity. Most of the information contained has actually been blogged before on this blog (c.f. there and there) but if you want a global overview of the project, this paper is still worth a read. The conference in itself was very nice and I did some networking. I came back with a lot of business card and the hope of keeping in touch with the people I met there. The slides from the presentation are available from SlideShare

Presentations

The Network Institute of Amsterdam organised on May 10 a one-day symposium to strengthen the ties between its members and to stimulate further collaboration. This institute is a long-term collaboration between groups from the Department of Computer Science, the Department of Mathematics, the Faculty of Social Sciences and the Faculty of Economics and Business Administration. I presented a poster about SemanticXO and an abstract went into the proceedings of the event.

More recently, I spent the 10 and the 11 of September at Paris for the Sugar Camp #2 organised by OLPC France. Bastien managed me a bit of time on Sunday afternoon to re-do the presentation from ICeND2011 (thanks again for that!) and get some feedback. This was a very well organised event held at a cool location (“La cité des sciences“), it was also the first time I met so many other people working on Sugar and I could finally put some faces on the name I saw so many time on the mailing lists and on the GIT logs 🙂

First SemanticXO prototype

The project developement effort is split in 3 parts: a common layer hidding the complexity of SPARQL, a new implementation of the journal datastore and the coding of diverse activities making use of the new semantic capabilities. All three are going more or less in parallel, at different speed, as, for instance, the work on activities direct what the common layer will contain. I’ve focused my efforts on the journal datastore to get something ready to test. It’s a very first prototype that has been coded starting with the genuine datastore 0.92 and replacing the part in charge of the metadata. The code taking care of the files remains the same. This new datastore is available from Gitorious but because installing the triple store and replacing the journal is a tricky manual process, I bundled all of that 😉

Installation

The installation bundle consists of two files, a “semanticxo.tgz” and a script “patch-my-xo.sh“. To install SemanticXO, you need to download the two and put them in the same location somewhere on your machine and then type (as root):

sh ./patch-my-xo.sh setup

This will install a triple store, add it to the daemons to start at boot time and replace the default journal by one using the triple store. Be careful to have backups if needed as this will remove all the content previously stored in the journal! Once the script has been executed, reboot the machine to start using the new software.

The bundle has been tested on an XO-1 running the software release 11.2.0 but it should work on any software release on both the XO-1 and XO-1.5. This bundle won’t work on the 1.75 has it contains a binary (the triple store) not compiled for ARM.

What now?

Now that you have the thing installed, open the browser and go to “http://127.0.0.1:8080”. You will see the web interface of the triple store which allows you to make some SPARQL queries and see which named graphs are stored. If you are not fluent in SPARQL, the named graph interface is the most interesting part to play with. Every entry in the journal gets its own named graph, after having populated the journal with some entries you will see this list of named graphs growing. Click on one of them and the content of the journal entry will be displayed. Note that this web interface is also accessible from any other machine on the same network as the XO. This yields new opportunities in terms of backup and information gathering: a teacher can query the journal of any XO directly from a school server, or an other XO.

Removing

The patch script comes with an uninstall function if you want to revert the XO to its original setup. To use it, simply type (as root):

sh ./patch-my-xo.sh remove

and then reboot the machine.

Clustering activity for the XO

In the past few years many data sets have been published and made public in what is now often called the Web of Linked Data, making a step towards the “Web 3.0”: a Web combining a network of documents and data suitable for both human and machine processing. In this Web 3.0, programs are expected to give more precise answers to queries as they will be able to associate a meaning (the semantic) to the information they process. Sugar, the graphical environment found on the XO, is currently Web 2.0 enabled – it can browse web sites – but has no dedicated tools to interact with the Web 3.0. The goal of the SemanticXO project introduced earlier in this blog is to make Sugar Web 3.0 proof by adding semantic software on the XO.

One corner stone of this project is to get a triple store, the software in charge of storing the semantic data, running on the limited hardware of the machine (in our case, an XO-1). As it proved to be feasible, we can now go further and start building activities making use of it. And to begin with, a simple clustering activity: the goal there is to cluster into boxes using drag&drop. The user can create as many boxes as he needs, and the items may be moved around boxes. Here is a screenshot of the application, showing Amerindian items:

Prototype of the clustering activity

The most interesting aspect of this activity is actually under its hood and is not visible on the screenshot. Here is a some of the triples generated by the application (note that the URLs have been shortened for readability) :

subject predicate object
olpc:resource/a05864b4 rdf:type olpc:Item
olpc:resource/a05864b4 olpc:name “image114”
olpc:resource/a05864b4 olpc:hasDepiction “image114.jpg”
olpc:resource/a82045c2 rdf:type olpc:Box
olpc:resource/a82045c2 olpc:hasItem olpc:resource/a05864b4
olpc:resource/78cbb1f0 rdf:type olpc:Box

It is relevant to note here the flexibility of that data model: The assignment of one item to the only box is stated by a triple using the predicate “hasItem”, one of the box is empty because there is no such statement linking it to an item. A varied number of similar triples can be used, without any constraint and the same goes for actually all the triples in the system. There is no requirement for a set of predicates all the items must have. Let’s see the usage that can be made of this data through three different SPARQL queries, introduced from the simple one to the most sophisticated:

  • List the URIs of all the boxes and the items they contain
  • SELECT ?box ?item WHERE {
    ?box rdf:type olpc:Box.
    ?box olpc:hasItem ?item.
    }
    
  • List of the items and their attributes
  • SELECT ?item ?property ?val WHERE {
      ?item rdf:type olpc:Item.
      ?item ?property ?val.
    }
    
  • List of the items that are not in a box
  • SELECT ?item WHERE {
      ?item rdf:type olpc:Item.
      OPTIONAL {
        ?box rdf:type olpc:Box.
        ?box olpc:hasItem ?item.
      }
      FILTER (!bound(?box))
    }
    

These three queries are just some examples, the really nice thing about this query mechanism is that (almost) anything can be asked through SPARQL. There is no need to define a set of API calls to cover a list of anticipated needs, as soon as the SPARQL end point is made available every activity may ask whatever it wants to ask! 🙂

We are not done yet as there is still a lot to develop to finish the application (game mechanism, sharing of items, …). If you are interested in knowing more about the clustering prototype, feel free to drop a comment on this post and/or follow this activity on GitHub. You can also find more information in this technical report about the current achievements of SemanticXO and the ongoing work.

One SPARQL end point per dataset, One end point to query them all

LOD Around The Clock (LATC) logo
Althought being commonly depicted as one giant graph, the Web of Data is not a single entity that can be queried. Instead, it’s a distributed architecture made of different datasets each providing some triples (see the LOD Cloud picture and CKAN.net). Each of these data source can be queried separately, most often through an end point understanding the SPARQL query language. Looking for answers making use of information spanning over different data sets is a more challenging task as the mechanisms used internally to query one data set (database-like joins, query planning, …) do not scale easily over several data sources.

When you want to combine information from, say DBPedia and the Semantic Web doog food site, the easiest and quickest workaround is to download the content of the two datasets, eventually filtering out triples you don’t need, and load the content retrieved into a single data store. This approach as some limitations: you must have a store running somewhere (that may require a significantly powerful machine to host it), the downloaded data must be updated from time to time and the data you need may not be available for downloading at the first place.

When used along with a SPARQL datalayer, eRDF offers you a solution when one of these limitation prevents you from executing your SPARQL query over several datasets. The applications runs on a low-end laptop and can query, and combine the results from, several SPARQL end points. eRDF is a novel RDF query engine making use of evolutionary computing to search for the solution. Instead of the traditional resolution mecanism, an iterative trial and error process is used to progressively find some answers to the query (more information can be found in the published papers which are listed on erdf.nl and in this technical report). It’s a versatile optimisation tool that can run other different kind of data layers and the SPARQL data layer offers an abstraction over a set of SPARQL end points.

Let’s suppose you want to find some persons and the capital of the country they live in:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX db: <http://dbpedia.org/ontology/>

SELECT DISTINCT ?person ?first ?last ?home ?capital WHERE {
	?person  rdf:type         foaf:Person.
	?person  foaf:firstName   ?first.
	?person  foaf:family_name ?last.
	OPTIONAL {
	?person  foaf:homepage    ?home.
	}
	?person  foaf:based_near  ?country.
	?country rdf:type         db:Country.
	?country db:capital       ?capital.
	?capital rdf:type         db:Place.
}
ORDER BY ?first

Such a query can be answered by combining data from the dog food server and dbpedia. More data sets may also contain list of people but let’s focus on researchers as a start. We’ll have to indicate to eRDF which are the end points to query, this is done with a simple csv listing:

DBpedia;http://dbpedia.org/sparql
Semantic Web Dog Food;http://data.semanticweb.org/sparql

Assuming the query is saved into a “people.sparql” file and the end points list goes into a “endpoints.csv”, the query engine is called like this:

java -cp nl.erdf.datalayer-sparql-0.1-SNAPSHOT.jar nl.erdf.main.SPARQLEngine -q people.sparql -s endpoints.csv -t 5

The query will first be scanned for its basic graph patterns, all of them will be grouped and sent to the eRDF optimiser as a set of constraints to solve. Then, eRDF will look for solutions matching as many of these constraints as possible and push back all the relevant triples found back into an RDF model. After some time (set with the parameter “t”), eRDF is stopped and Jena is used to issue the query over the model that was just populated. The answers are then displayed, along with a list of the data sources that contributed in finding them.

If you don’t know which end points are likely to contribute to the answers, you can just query all of the WOD and see what happens… 😉
The package comes with a tool to fetch a list of SPARQL end points from CKAN, test them and create a configuration file. It gets called like that:

java -cp nl.erdf.datalayer-sparql-0.1-SNAPSHOT.jar nl.erdf.main.GetEndPointsFromCKAN

After a few minutes, you will get a “ckan-endpoints.csv” allowing you to run query the WoD from your laptop.

The source code along with a package including all the dependencies are available on GitHub. Please note that this is a first public release of the tool still in snapshot state so bugs are expected to show up. If you spot some, report them and help us improve the software. Comments and suggestions are also much welcome 🙂


The work on eRDF is supported by the LOD Around-The-Clock (LATC) Support Action funded under the European Commission FP7 ICT Work Programme, within the Intelligent Information Management objective (ICT-2009.4.3).

%d bloggers like this: