Chris Bizer: "Within the corporate market, there is interest in using Linked Data as a lightweight, pay-as-you-go data integration technology."
17.04.2009
So far little awareness exists about the commercial opportunities of linked data. Andreas Blumauer (SWC) talked to Chris Bizer, mastermind behind the DB-Pedia project and advocate of the linking open data philosophy, about the emerging market for deep web applications, its value for corporate purposes, and the need for information accountability and privacy awareness.
Chris, you are one of the keyplayers and insiders of the Semantic Web community. From your perspective, what are the most interesting scenarios where linked data technologies will be used in the next few months or years?
The deployment of Linked Data technologies is mostly content driven as more and more people realize that there are many interesting datasets available as Linked Data on the Web and want to use these datasets or interlink new datasets with them. I therefore think that the topics of the datasets in the Linking Open Data cloud are already a good indicator for the further development: There are lots of life sciences related datasets, there is lots of data related to publications and the library world, and there is an increasing amount of geographic information. The W3C Semantic Web for Health Care and Life Sciences Interest Group is doing really good outreach work towards the life sciences community and there are first Linked Data pilot project running within pharma companies like Eli Lilly. The growing interest in the libraries community is shown by Linked Data being a main topic of this year’s Dublic Core conference in Seoul and by the new Open Archives Object Reuse and Exchange (OAI-ORE) standard that is build on the Linked Data principles. Other domains in which we will see increasing Linked Data deployment soon are in my opinion the media industry, the sharing of scientific data in general and easing the access to government and other public sector data. Initiatives into the last direction are currently being pushed by the Obama administration and the UK Office of Public Sector Information. Within the media industry, the BBC and Thomson Reuters already use Linked Data technologies. The BBC to interlink their content with external data sources like Musicbrainz and DBpedia. Reuters to annotate documents with concepts from the Web of Data via the Calais service.
Can you see a market for linked data applications yet?
When talking about markets, we should distinguish between applications that rely on the public Web of Data and applications inside companies. I think we will see a growing number of applications that use data from the public Web as background knowledge to offer better search capabilities and to augment local content with additional content from the Web of Data. Interesting developments into this direction are currently happening in the search engine market where the classic search engines try to become smarter and develop into question answering machines by incorporating structured data. Yahoo for example has started to crawl RDFa and microformats from the Web and offers access to the crawled data via the BOSS API. Google is getting interested in crawling the Deep Web and will in my opinion ultimately go down the same route as Yahoo. As it is easy to crawl Linked Data from the Web and as the number of datasets that are published as Linked Data is constantly growing, I think it is just a question of time before the mayor search engines start using the data.
Beside of the classic search engines, there might also be market opportunities for new search engines that specialize on Linked Data. With Falcons, Sindice, Watson, SWSE and Swoogle, there are already various prototypes around. I think the next generation of these search engines will focus on further integrating, fusing and cleansing Linked Data from various sources. This will allow them to sell access to cleaned views on the Data Web and to become central components within Linked Data applications.
Within the corporate market, there is interest in using Linked Data as a lightweight, pay-as-you-go data integration technology. In contrast to classic data warehouses which require a big upfront investment for modeling a global schema, Linked Data technologies allow companies to set up data spaces with relatively little effort. As these data spaces are being used, the companies can invest step-by-step in establishing data links, shared vocabularies or schema mappings between the sources to allow closer integration. An interesting user story that we heard from a pharma company is that within this company different information integration teams repeatedly went though the same data integration and identity resolution steps in order to combine datasets into data marts. By using Linked Data technologies to make the relations between entities within different datasets explicit, the company hopes to streamline these integration processes as things have only to be done once and other teams can use the data links and mappings afterwards.Many companies start to build their own "corporate semantic web", one of the first questions regarding the technical architecture is which triple store should be chosen. Can you recommend a method to pick the right one?
The performance of triple stores was a bottle neck a while ago, but things have improved a lot over the last two years. There are cluster editions of several triple stores now and when deployed on a proper server farm or cloud infrastructure, the stores scale very well. An indicator that might be helpful for choosing a store could be the results of the Berlin SPARQL benchmark which compares the query performance of various triple stores and SPARQL-to-SQL rewriters.
Compared to other technologies like "triplify", what are the advantages of your D2RQ mapping technology? What’s new in your latest release?
Triplify is a lightweight relational database to RDF mapping technology that aims at making it easy to extend existing Web applications with a Linked Data interface. In contrast, D2RQ also covers more complex mappings and provides SPARQL access to non-RDF databases. The latest release extended D2RQ with the ability to publish schema mappings between different data sources and significantly increased the performance of the D2RQ engine by using an improved SPARQL-to-SQL rewriting algorithm.
You are also working on issues like trust, privacy and security. How do you perceive the increased threat to privacy by the interlinking of personal data? Do you see a need for action?
Sure, I do. The ultimate goal of Linked Data is to use the Web like a single global database. The realization of this vision will provide benefits in many areas but will of course also lead to new dangers in others. I think to deal with the arising problems around privacy will require a combination of technical and legal means together with a higher awareness of the users about what data to provide in which context. To get new ideas around the topic, I really like Daniel Weitzner’s work on the privacy paradox and the recent work by the TAMI project on information accountability.
One of the big challenges on the Semantic Web is to enhance the users' control about their data and content. What technological measures does the Semantic Web provide to enhance customer sovereignty and service providers' accountability?
I think the first and most important step is to support users in expressing what they want to allow their data to be used for and what not. The
Last question: Web 2.0 hype is over, public Web 3.0 discussions are starting now, can you think of the next level already?
I think many people have learned from the success of Web 2.0 APIs that you can do really cool things by mashing up content from different Web data sources. The general problem with Web 2.0 APIs is that you always implement your application against fixed set of data sources and your application does not take advantage of new data sources that become available on the Web. Web 2.0 APIs thus kind of slice the Web into different data silos. In contrast, Linked Data realizes the idea of extending the Web with a single global data commons. I think that more and more people will realize that it is beneficial for their applications to operate on top of this unbound data space and automatically become smarter and more comprehensive as the data space grows. Therefore, my guess for the next development step of the Web is the coalescence of distinct Web data sources into a global data space.
About Chris Bizer
Prof. Dr. Chris Bizer is the head of the Web-based Systems research group at Freie Universität







