NEWSLETTER

Please keep me up to date about current SWC events & activities.

OUR PARTNERS

“A growing data commons from meaningful bits and pieces”

19.08.2008

Richard Cyganiak

Richard Cyganiak, a researcher at the Digital Enterprise Research Institute (DERI) in Galway, talks about the benefits of semantic web indices, data dictionaries and the similarities between the Linked Data initiative and the Semantic Web's orginal impetus. The interview was led by Andreas Blumauer.

INTERVIEW

Richard, you are working for DERI Galway, the world's largest research institute in the area of the Semantic Web. Tell us a bit about ongoing projects - what are the major efforts at the moment in Galway?

There are lots of fascinating projects going on in DERI, too many to enumerate here, like the Nepomuk project that develops a Semantic Desktop infrastructure that is becoming part of the KDE desktop environment, or the Okkam project which aims at developing an “Entity naming system” for worldwide identifier management, just as we have the Domain Name System (DNS) now.

Semantic Web Indeces and Search Engines

You are involved in a project called Sindice. Sindice is "the" semantic web index. Considering all the different "semantic search engines" out there at the moment, end-user might get confused by this term. What is difference between Sindice and semantic search engines like Hakia or Powerset?

Hakia and Powerset try to address an old problem in a new way: How to find information in a large number of text documents? It's the same problem that Google deals with; but while Google relies on keywords and massive scale, Hakia and Powerset look at the text in a deeper way, trying to “understand” its meaning and extracting facts from the sentences. That's why they can answer natural-language questions, while Google just shows documents that contain certain keywords.

Sindice works differently. We don't look much at the words in web pages. We look at bits and pieces of data that publishers have embedded into their pages. This has become possible thanks to technologies such as microformats, RDF/XML and RDFa. At the moment, only a small fraction of web pages contain these bits of data, but the size of this “Web of Data” is growing. Collecting this data allows us to build a database of assertions published by many different parties. This approach involves much less guesswork than the natural-language approach, but at the same time it limits us to the data explicitly marked up using microformats and RDF.

Another difference is that Sindice is not aimed at end users. We give application developers access to our database by using different APIs. We want them to build exciting new applications using the data that is out there. Our contribution is that we save them the hassle of finding and collecting all this data. At the same time, we analyse the Web of Data to better understand its structure and the technical challenges created by its growth.

Reasoning and Data Dictionaries

Sindice performs sophisticated reasoning which dramatically enhances data reusability, search precision, and recall. When do you think these kinds of technologies will become mainstream and will be part of major search engines or of the enterprise stack?

Reasoning is important in an open system where we cannot force everyone to agree on a single schema. When publishers put data in RDF on the Web, they usually also publish their RDF Schema, or data dictionary. Or they re-use an existing, community-developed data dictionary. These dictionaries often state how their terms relate to those in other dictionaries. For example, a person's NAME can be considered a general LABEL for the entity. We use reasoning to automatically translate data based on these annotations. This allows users to formulate queries using the terms from either dictionary. In Sindice we support the RDF Schema semantics but also part of the OWL semantics (OWL Horst).

When will technologies like this become mainstream? I don't know. Semantic Web technologies are good at solving particular problems, and maybe they will never take hold in other areas. Their strength is in integrating data from different places and different publishers, especially in scenarios where the publisher cannot foresee all potential applications of the data. I expect to see successful deployments of Semantic Web technologies in areas where this kind of serendipitous re-use can have high benefit.

Linked Data and the Semantic Web

It seems like "Linked Data" are "The Emperor's New Clothes": How would you describe the relation between "Linked Data" and the "Semantic Web"? Is "Linked Data" only a vehicle for the commercialization of the "Semantic Web" or is it something else?

“Linked Data” has indeed become quite a buzzword over the last couple months. This has surprised me, because it has a precise technical definition (Tim Berners-Lee's Linked Data principles), and Linked Data is actually just a way of using well-known technologies -- HTTP, RDF -- in a way that allows them to unfold their strengths.

The Semantic Web community has always included many people interested in direct and straightforward application of new technologies to the Web. Many of them have now rallied around the Linked Data principles, maybe in part because the term “Semantic Web” has become associated with the more logic-heavy and theoretical side of the community.

To some extent, I see Linked Data as a return to what the Semantic Web should have been about all along -- facilitating exchange and re-use of data on a world-wide scale.

Missing Links in the Semantic Web Infrastructure

One of the projects you are working on is called "Neologism". It serves as an "easy vocabulary publishing tool". Thinking of a "complete" Semantic Web infrastrucutre - what do you miss still?

On the top of my wish list would be a really good data browser. The current crop of data browsers for RDF, such as Tabulator, Disco and the OpenLink browser, are still very basic and geeky. I hope for some sort of “Excel for Web data”, an application that allows me to browse through different datasets, find the bits that are relevant to my problem, and lets me slice and dice and correlate the data in different ways. I think such an app would be key to the kind of serendipitous reuse I mentioned earlier.

On the developer side, I think that a good software library for detecting correspondences between items in different RDF datasets would be really useful. It could support the automatic creation of links between different RDF-enabled websites.

The infrastructure of the Web is not just the software and hardware, but also the standards and agreements that enable the different parts to talk to each other. Much remains to be done in that area as well. For example, data publishers need to provide better metadata, such as who created the data, the time when it is valid, and the license under which it can be used. This will help us figure out if a dataset is adequate for our needs and why we should trust it.

What the Semantic Web Is All About

On your blog you regularly write about your encounters with people you meet at conferences like ESWC or people you work with. It seems like the Semantic Web Community is growing fast. If you explained to someone who is an "outsider" to this community what the semantic web is all about - what would you focus on?

The traditional World Wide Web is all about documents -- the web pages. The Semantic Web, to me, is all about the things described in those pages, and how they relate to each other. The traditional Web is a graph of interconnected documents. This enables quick and simple navigation between those documents, which is great. But moving beyond that, to a Web of interconnected descriptions of things, is much more powerful, because it enables automatic analysis of their relationships: How are these two things connected? Which things depend on that other thing? How do the things in that collection compare to each other?

This is important. To make good decisions, we need good data. What I want to achieve with my work is to contribute to the creation of a data commons. A distributed, resilient, worldwide database that is open to all parties at low or zero cost. This would put entire new classes of applications in our reach.

About Richard Cyganiak

Richard is a research assistant at DERI Galway in Ireland. At DERI, he's a member of the Data Intensive Infrastructure cluster and the Sindice team. He gave four talks at the European Semantic Web Conference in Tenerife, Spain, earlier this year, where he presented the above-mentioned projects Sindice and Neologism (together with Sergio Fernández),  a proposal for a standard for semantic site maps and a so-called lightning talk in which he, in his own words, "complained bitterly about the inability of Semantic Web developers to properly handle characters outside of the usual US-ASCII charset in their apps."

Richard Cyganiak is going to be one of the keynote speakers during the Web of Data Practitioners Days, taking place in Vienna, Oct 22-23 2008.

For more information about the event, see webofdata.info

Logo Web of Data Practitioners Days WOD-PD 2008

References

Richard Cyganiak's Website

DERI Galway

Sindice

Nepomuk Social Semantic Desktop

Okkam - Large-Scale Integrating Project

RESOURCES

Related Concepts