Archive

Posts Tagged ‘Semantic Web’

Data export VS Faceted expressivity

Bakfiets

Bakfiets by Anomalily on Flickr

One visiting the Netherlands will inevitably stumble upon some “BakFiets” in the streets. This Dutch speciality that seems to be the result from cross-breeding a pick-up with a bike can be used from many things from getting the kids around to moving a fridge.

Now, let’s consider a Dutch bike shop that sells some Bakfiets among other things . In his information system these item will surely be labelled as “Bakfiets” because this is just what they are. This information system can also be expected to be globally be filled with inputs and semantics (table names, fields names, …) in Dutch as well. If that bike shop wants to start selling his items outside of the Netherlands there will be a need for exporting the data into some international standard so that other sellers can re-import the data it into their own information system. This is where things get problematic…

What will happens to the “bakfiets” during the export? As it does not make sense to define an international level class “bakfiets” – which can be translated to “freight bicycle“, every shop item of type “bakfiets” will most certainly be exported as being a item of type “bike”. If the Dutch shop owner is lucky enough the standard may let him indicate that, no, this is not really just a two-wheels standard bike through a comment property. But even if the importer may be able to use that comment (which is not guaranteed), the information is lost: when going international, every “bakfiets” will become a regular bike. Even more worrying is the fact that besides the information loss there is no indication of how much of it is gone.

When the data is exported from one system and re-imported into another specificity is lost

When the data is exported from one system and re-imported into another specificity is lost

Semantic Web technologies can be of help here by enabling the qualification of shop items with facets rather than strict types. That is assigning labels or tags to things instead of putting items into boxes. The Dutch shop will be able to express in is knowledge system that his bikes with a box are both of the specific type “bakfiets” that makes sense only in the Netherlands and are also instances of the international type “bike”. An additional piece of information present in the knowledge base will connect the two types saying the the former is a specification of the later. The resulting information export flow is as follows:

  1. The Dutch shop assign all the box-bikes to the class “bakfiets” and the regular bikes to the class “bike”.
  2. A “reasoner” infers that because all the instances of “bakfiets” are specific types of “bike”, all these items are also of type “bike”.
  3. Another non Dutch shop asking for instances of “bike” in the Dutch shop will get a complete list of all the bikes and see that some of them are actually of type “bakfiets”.
  4. If his own knowledge system does not let him store facets the importers will have to flatten the data to one class but he will have received the complete information and know how much of it will be lost by removing facets.
The data shared has different facets out of which the data importer can make a choice

The data shared has different facets out of which the data importer can make a choice

Beyond this illustrative example data export presents real issues in many cases. Everyone usually want to express their data using the semantic that applies to them and have to force information into some other conceptualisation framework when this data is shared. A more detailed case for research data can be found in the following preprint article:

  • Christophe Guéret, Tamy Chambers, Linda Reijnhoudt, Frank van der Most, Andrea Scharnhorst, “Genericity versus expressivity – an exercise in semantic interoperable research information systems for Web Science”, arXiv preprint http://arxiv.org/abs/1304.5743, 2013

One year of PiLOD project

Yesterday was the closing event of the Pilot Linked Open Data project. A significantly big crowd of politicians, civil servants, hackers, SME owners, open data activists and researchers gathered in the very nice building of the RCE in Amersfoort to hear about what has been done within this one year project lead by Erwin Folmer. But not only that, the participants also got some more insights into Linked Data usage outside of the project and a guided tour through the RCE. More information, photos, and links to the slides, can be found in the report about the event.

Oliver Bartlett and John Walker gave two keynotes explaining how Linked Data is put into use respectively at the BBC and at NXP. Both companies are using this technology to better describe their content and interconnect separated data sources. A shared objective besides having better and more efficient internal processes is to provide better services to the customers. Thanks to the harmonization and linkage of the data, these customers can expect to get more coherent data about what they care, be it a chip or a football player. The two presentations also highlighted two important facts about Linked Data: it’s versatile enough to be applied to two very different business domains such as media and chip manufacturing, 2) the data does not have to be open to be benefit form Semantic Web technologies – as of now, a lot of data at the BBC is becoming LD but none of this LD is LOD.

My activity within the project was around chatting (a lot, as I usually do :-p), writing two book chapters (“Publishing Open Data on the Web”, and “How-to: Linking resources from two datasets” ) and giving an hand on the “HuisKluis” work package managed by Paul Francissen.  I spoke a bit about the latest, showing a demo and some slides to explain how data is managed in the back-end. In short, the “HuisKluis” is a place where information about a house is found and shared. See the following video for a better introduction:

The prototype can be found at http://pilod-huiskluis.appspot.com/ . It works only for houses in the Netherlands but there are a few examples that can be used too:

huiskluis

Here are the few slides giving more details about the implementation:

If you want to really know everything about how things work, feel free to just look at the source code.

This PiLOD project was a pleasant and enriching experience, I’m very much looking forward to a PiLOD2 for a second year of LOD brainstorming and hacking together with Marcel, Arjen, Erwin, Paul, Lieke, Hans, Bart, Dimitri, … and the rest of the (the rather big) group :-) This post is also a good opportunity to thank again the Network Institute for having supported this collaboration with a generous research voucher. Thanks!

Pourquoi utiliser le Web de données?

Il y a quelque jours j’ai eu le plaisir, et la chance, de participer à la série de webinaires organisés par l’AIMS. L’objectif que je m’étais fixé pour ma présentation (en Français) intitulée “Clarifier le sens de vos données publiques avec le Web de données” était de démontrer l’avantage de l’utilisation du Web de données du point de vue du fournisseur de données, en passant par le consommateur. Faire une présentation sans aucun retour de la part de l’auditoire était une expérience intéressante que je renouvèlerait volontiers si une nouvelle occasion se présente. Surtout si c’est Imma et Christophe qui sont aux commandes! grâce à eux tout était parfaitement organisé et le wébinaire s’est déroulé sans problème :-)

Si vous voulez voir si cette présentation atteint son but, les diapositives sont disponible sur Slideshare:

Une autre copie de cette présentation est disponible sur le compte SlideShare de l’AIMS.

Behind the scenes of a Linked Data tutorial

Last week, on the afternoon of November 22, I co-organized a tutorial about Linked Data aimed at researchers from digital humanities. The objective was to give a basic introduction to the core principles and to do that in a very hands-on setting, so that everyone can get a concrete experience with publishing Linked Data.

Everyone listening to Clement speaking about Linked Data and RDFa

To prepare this event, I teamed up with Clement Levallois (@seinecle) from the Erasmus University in Rotterdama. He is an historian of science with interests in network analysis, text processing and other compartments of the digital humanities. He had only heard of Linked Data and was eager to learn more about it. We started of by preparing together a presentation skeleton and the setup for the hands-on. During this he was shouting every time I was using a word he deemed too complex (“dereferencing”, “ontology”, “URI”, “reasoning”, …). In the end, “vocabulary” and “resource” are most probably the two most technical concepts that made it through. Then I took care of writing the slides, and he simplified them again before the tutorial. It is also him that presented them, I was just standing on the side all time.

The result: a researcher from digital humanities explaining to a full room of fellow researchers what Linked Data is and how it can be useful to them. Everyone was very interested & managed to annotate some HTML pages with RDFa, thereby creating a social network of foaf:knows relations among the individuals they described :-) We concluded the tutorial by plotting that network using a tool that Clement developed.

This was a very efficient and interesting collaboration! For those interested in what we did, all the material is available on dropbox and the presentation is on slideshare:

5-stars Linked Open Data pays more than Open Data

Let’s assume you are the owner of a CSV file with some valuable data. You derive some revenue from it by selling it to consumers that do traditional data integration. They take your file and import it into their own data storage solution (for instance, a relational database) and deploy applications on top of this data store.

Traditional data integration

Data integration is not easy and you’ve been told that Linked Open Data facilitates it so you want to publish your data as 5-star Linked Data. The problem is that the first star speaks about “Open license” (follow this link for an extensive description of the 5-star scheme) and that sounds orthogonal to the idea of making money with selling the data :-/

If you publish your CSV as-is, under an open license, you get 3-stars but don’t make money out of serving it. Trying to get 4 or 5 stars means more effort from you as a data publisher and will cost you some money, still without earning you back any…

Well, let’s look at this 4th star again. Going from 3 stars to 4 means publishing descriptions of the entities in the Web. All your data items get a Web page on their own with the structured data associated to them. For instance, if your dataset contains a list of cities with their associated population every of this city as its own URI with the population indicated in it. From that point, you get the 5th star by linking these pages to other pages published as Linked Open Data.

Roughly speaking, your CSV file is turned into a Web site and this is how you can make money out of it. Like for any website, visitors can look at individual pages and do whatever they want with them. They can not however dump the entire web site into their machine. Those interested in getting all the data can still buy it from you, either as a CSV or RDF dump.

Users of your data have the choice between two data usage process: use parts of the data through the Linked Open Data access or buy it all, and integrate it. They are free to choose the best solution for them depending on their needs and resources.

Using Linked Open Data

Some added side bonuses of going 5-star instead of sticking at 3:

  • Because part of the data is open for free, you can expect to get more users screening it and reporting back errors;
  • Other data publishers can easilly link their data set with yours by re-using the URIs of the data items. This increases the value of the data;
  • In its RDF format, it is possible to  add some links within the data set. Thereby doing part of the data integration work on the behalf of the data consumers – who will be grateful for it!
  • Users can deploy a variety of RDF-enabled tools to consume your data in various ways;

Sounds good, doesn’t it? So, why not publishing all your 3-star data as 5-star right away? ;-)

Take home messages from ePSIplatform Conference

Open Data stickers

Open Data

On March 16, 2012 the European Public Sector Information Platform organised the ePSIplatform Conference 2012 on the theme “Taking re-use to the next level!”. A very well organised and interesting event, also a good opportunity to meet new persons and put a face on the names seen on the mails and during teleconferences :-)

The program was intense: 3 plenary sessions, 12 break-out sessions and project presentations during the lunch break. That was a lot to talk about and a lot to listen to. I left Rotterdam with a number of take out messages and food for thought. What follows is a mix of my own opinions and things said by some of the many participants/speakers of the event.

We need to think more about data re-use

It’s a fact: Open Data has reached momentum and more and more data portals are being created. DataCatalogs currently lists 216 sources for Open Data. There could be something around a million of Open Data data sets now available, but how many applications? Maybe around 100k, at most. Furthermore, most on these applications do not really address “real problems” (e.g. help deciders to make educated choices by providing them with the right data at the right time, or optimise food distribution processes). Even if the definition of a “real problem” is open to discussion, there is surely something to think about.

This low number of applications could be explained by a lack of problems to tackle as well as it can be explained by a lack of motivated developers. The ePSI platform has just started a survey on story sharing. Reading about the (positive) experience of others is likely to trigger some vocations in the readers and get more developers on board. The upcoming W3C event about using Open Data will also be a good place to share such stories and spot the things to do next to foster an ecosystem of data and developers.

Open Data should be interactive

We have Open Data and we have Open Data consumers that happily take the data, process it and eventually re-publish it. Fine but we do poor when it comes to capture the added meta data from these users. If one of them spot an error in an open data set, or if missing data is identified, there is hardly any way to communicate this information to the data publisher. Most, if not all, data portals are “read only” and the eventual feedback they receive may not find a matching processing pipeline. Open source software solved this issue by using open bug trackers that allows for reporting bugs/feature requests and facilitate dispatching the issues to persons in charge of implementing them. Using such bug trackers to keep the data users in the loop sounds like a good plan. This is something we started to look at, in a slightly different way, for the projects CEDA_R and Data2Semantics. One of the use case of these projects is the Dutch historical census data (from 1795 onwards) that has to be harmonized and debugged (there was a lot of manual process involved to convert the paper reports in digital form). Only historians can take care of this, and they need to inform the data publisher about their finding – preferably using something even easier that the average bug tracker.

Open (messy) Data is a valuable business

Economical issues are common when speaking about Open Data. They could even be seen as the main obstacle to it. The other obstacles, technical, legal and societal/political being easier to address. So the trick is to convince data owners that, yes, they will loose the money they currently get in access fee but they will get more out of the Open Data, in an indirect way through businesses created. In fact, there is no market for the Open Data itself. Instead, this Open Data has to be seen as part of the global Data market of which DataPublica and OpenCorporates are two examples. In this market, curating and integrating data is a service clients can be charged for. Data companies transform the data into information and put a price tag on the process.  For this matter, having to publish an integrated data set as Open Data because it include pieces of an other Open Data set licensed with a GPL-like license will brake the process. Open Data is easier to consume when license under more BSD-like licenses.

If there is a market for messy open data,  one can wonder whether Linked Data is going against businesses or helping them. Linked Data allows for doing data integration at the publication level and Open Data published exposed using these principles is richer and easier to consume.  This means less work for the consumer, which may spare himself the cost of hiring someone to integrate the data. But Linked Data facilitates the job of data companies too. These could invest the time saved into the development of visualisation tools, for instance. So in the end, it may not be such a bad idea to continue promoting Linked Data ;-)

Open Data initiatives need to become more consistent

Besides the definition given on OpenDefinition, and the 5-star scheme of Tim Berners Lee for Linked Data, there is not much out there to tell people what is Open Data and how to publish it. Data portals can be created from scratch or use CKAN and may expose the meta data about the data sets it contains in different ways (e.g. using DCAT or something else). The data itself can be published within a large spectrum of formats ranging from XLS sheets to PDFs to RDF. Besides this, data portals can be created at the scale of the city, a region, an entire country or an entity such as the EU. These different scales are related to each other and can be seen as a result from a lack of coordination. Directories are important as a way to know what data is out there, and also what data is missing. If everyone take initiatives at different scales, the outcome of this indexing process will be fuzzy and the outcome quite confusing for data users looking for open data sets. On the other hand, self-organisation is often the best solution to build and maintain complex systems (c.f. “Making Things Work” from Y. Bar-Yam). So maybe things are good as they are but we should still avoid ending up with too many data portals  partially overlapping and incompatible with each other.

As far as the data is concerned, PDF, XLS, CSV, TSV, … are all different ways to create data silos that just provide a single view over the data – even a non machine readable one in the case of many PDFs. RDF is here to improve consistency across data sets with a unique, graph based, data model. This data model facilitates sharing data across data sets. It is not the only solution to do that, the data set publishing language (DSPL) from Google being an other one, but it is the only one based on W3C standards. This guarantees the openness of the data format and a constant support, just as for the standards that make the Web (HTML, HTTP, CSS, …).

Don’t underestimate the “wow” effect

During one of the break-out sessions, I was intrigued hearing one of the panel speaker saying he would like to see more DSPL around than RDF. After some (pacific) discussion, we agreed on the following points: RDF is more expressive than DSPL, DSPL comes with an easy to use suite of plug&play tools to play with the data. It seems that if you want to re-use Open Data to do some plots, eventually for some data journalism use-cases, you are better off using DSPL. It is simpler and through the data explorer allows anyone to build graphs in a few clicks. Users prefer having button and sliders to play with simpler data rather than knowing that they have in their hands the most powerful knowledge representation scheme and that they could do anything with it – but finally do nothing with it because of the induced high learning curve. I’m all in favour of Open Data and I try to motive people, and myself sometime, to use Linked Data to publish data sets. Still, I think we have a major issue there: our data model is better but we do not compete yet on the usability side of the story.

An other manifestation of the “wow” effect: the most impressive visualisation show at the event was a part of the video documentaries “The Netherlands from above”, and their matching interactive data explorers.  This is a very nicely done job but the interesting bit is that not only the data was not linked, it was also not open! However, even at an event about re-use of Open Data, nobody seemed to care much. The data was acquired for free from different providers, with some difficulties for some, had to be curated and transcoded, and could not be shared. But the movies are very nice, and the sliders on the interactive pages fun to play with…

We must not rest on our laurels

Finally, and that was also the final message of the event, we should not rest on our laurels. Open data is well received. Many are going into the “Open unless” way of thinking but some others make an Open Data portal just because it is trendy, and trash it after some months. We need to continue explaining to data owners why they should open their data and explain why Linked Data is a good technical solution to implement. Then, we need to find more active users for the data because, in the end, if the data is used, nobody will even dare shutting down the portal serving it. Having these active users  may be our only guarantee that data published as Open Data will remain as such for the years to come.

1 minute video about SemanticXO

The VU is making short videos of 1 minute to highlight some of the research that is being done within its walls. This is the video for SemanticXO, realised by Pepijn Borgwat and presented by Laurens Rietveld.

The script is in Dutch and is as follows:

  • Ik ben laurens rietveld en ik doe onderzoek aan de vrije universiteit naar semantische netwerken.
  • Ik wil iets vertellen over onderzoek van Christophe Gueret dat zich richt op laptops die in ontwikkelingslanden gebruikt worden.
  • Dit is de XO laptop, het is een goedkope stevige laptop die onderwijs bij kinderen moet bevorderen.
  • Op de laptop draait sugar, dat is een constructieve leeromgeving speciaal ontworpen voor jonge leerlingen.
  • Op dit moment blijven alle gegevens die gegenereerd worden in de leeromgeving, in de xo laptop. Als een gesloten kleine data doos.
  • Met dit onderzoek willen we data uitwisseling verbeteren door gebruik te maken van principes van het semantic web.
  • Op die manier kan de data, zoals berichten of tekeningen, gemakkelijk binnen kleine lokale netwerken worden verspreid.
  • Zodra 1 laptop met het netwerk verbonden is kan die lokala data delen met de buitenwereld.
  • Andersom kunnen gegevens van de rest van het internet, ook binnen het lokale netwerk worden gedeeld.

In case you don’t speak Dutch, you may find the following translation to be useful ;-)

  • My name is Laurens Rietveld and I do research on Semantic Networks at the Free University of Amsterdam.
  • I will tell you about the research of Christophe Guéret which concerns laptops being used in developing countries.
  • This is the laptop “XO”, it is a cheap and robust laptop used to support the education of kids.
  • The laptop runs “Sugar”, a constructionist learning environment especially designed for young learners.
  • Currently, all the data that is generated within the learning environment stays in the XO. Like if it was within a closed data silo.
  • With this research we aim at improving data sharing capabilities by using Semantic Web technologies.
  • In doing so, the data (for instance, messages or drawings) can be easily shared within a local network.
  • As soon as one of these laptop gets access to Internet, it becomes possible to share this data with the outside world too.
  • Vice versa, data from Internet can be downloaded and shared across the local network.
Follow

Get every new post delivered to your Inbox.

Join 770 other followers

%d bloggers like this: