Enterprise Search goes Open Source
19.02.2009
The Open Source Initiative, SMILA "SeMantic Information Logistics Architecture", was mutually launched at the end of 2007 by empolis, the German Research Center for Artificial Intelligence (DFKI) and brox IT Solutions GmbH to create a standardized infrastructure for information management. The project is embedded into the work of the eclipse community, which consists of more than 180 companies and research institutions, which all have the common objective of developing open and standardized platforms. Since June 2008, SMILA attained the status of an official eclipse project and is co-funded through the THESEUS project by the German Federal Ministry of Economics and Technology.
Andreas Blumauer interviews Mario Lenz
Andreas Blumauer: Can you explain in a few sentences, what are the goals of SMILA project? And how does it fit into THESEUS, which officially rather focuses on the WWW?
Mario Lenz: We are all familiar with the well-known and often applied structures of Relational Database Management Systems and SQL. Even if it is more complex to understand, organize and structure data within Enterprise Resource Planning Systems we found - with SMILA - a way to do so. However, an estimated 80% of all mission critical information is only available in an unstructured manner, i.e. in the form of documents, web pages, emails etc. Despite that fact, a standardized way of representing, accessing and managing those unstructured data does not exist today. Rather, each vendor ships his own, proprietary solution. SMILA's goals are to define and implement such a standard infrastructure framework and to establish a community bringing it forward. THESEUS is about "New Technologies for the Internet of Services", which means it is different from the WWW in general as we clearly focus on specific application scenarios.
Andreas Blumauer: To have a "standardized industrial strength framework to build search solutions to access unstructured information in enterprises" could sound like a nightmare for some technology providers in this market. How are you going to convince such companies to participate in this initiative?
Mario Lenz: Well, actually some of those vendors already expressed their interest in SMILA even though they still hesitate to (formally) join in. Remember: SMILA is about the platform, the infrastructure and the standardization. Assume, that a vendor with an ingenious approach of understanding natural language and/or search - why should this vendor not implement his solution on top of SMILA and, thus, profit from the platform aspects? In other words, do you think the standardization of SQL had been a disadvantage to commercial RDBMS vendors? See, the same should hold for SMILA in the unstructured world...
... a vendor with an ingenious approach of understanding natural language and/or search - why should this vendor not implement his solution on top of SMILA ...
But we agree: The standardization might become a nightmare, as you name it, for vendors in terms of standardization and, hence, the ability of users to exchange one solution with another. Thus, different offerings will be more comparable and exchangeable.
Andreas Blumauer: Just recently Attensity, a leading provider of text technologies in the U.S., has announced to build on the SMILA framework. What was the reason for doing so in particular?
Mario Lenz: Just see above: they have superior technology in text analytics, in extracting semantics from unstructured data sources such as web content. However, to implement complete and fully integrated solution for their customers, Attensity needs aspects such as data source management, monitoring, an open and extensible layer for applications etc. too. This is exactly where SMILA comes in and will help Attensity to serve their customers with a more complete solution.
Andreas Blumauer: Two partners of this project, You and brox plan to make your products "brox EIF" and "empolis e:IAS" open source. Will those two platforms be aligned with SMILA? What does it mean for the interested client who plans to start an Enterprise Search project?
Mario Lenz: I cannot speak for brox, but for empolis, I wouldn't agree to that statement. Empolis' e:IAS is superior in certain aspects of information extraction and semantic indexing. Those components will not be open sourced, but rather be changed in a way that they will be aligned with and shipped on top of SMILA on the contrary to the current proprietary platform. For clients, this implies secured investments, because of future implementations of e:IAS could be easier replaced with other SMILA-compliant solutions.
Andreas Blumauer: Could you tell us a bit more about the architecture? What´s the innovation compared to existing architectures in the enterprise search domain?
Mario Lenz: As of today, there are probably as many architectures as there are vendors in the domain. Attempts, such as Open Search, just touch the surface. In fact technology differs that much from vendor to vendor that a comparison is hardly possible - so it's not to speak of a replacement of one with the other.
Hence, we decided to establish an open architecture that is very flexible and at the same time scalable. Also we pledged to use as much as possible of well-established and recently emerged technology standards like OSGi, SCA, BPEL, JMS and JMX. By designing clear APIs, application developers are given the possibility for easily plugging in their components (e.g. data source connectors & processing services) or even fully replace the implementation of core components - if desired.
Andreas Blumauer: What exactly makes this architecture a "SeMantic Information Logistics Architecture"? For example, are there any clear statements or even specifications regarding the structure of the index store or the data store?
Mario Lenz: Admittedly, currently the semantic part of SMILA is still too weak. We know that we need to address those aspects even more in the future - and we are quite willing to do so. However, we need a more fundamental platform first. As a consequence, SMILA comes with storage for semantic web data (RDF) and an additional number of components that will be able to utilize this. Also, some specific examples and use cases will be implemented shortly. This will demonstrate the usage of Semantic Web technology on top of SMILA.
... There are some excellent open source projects out there that we have been able to profit from ...
Having said this, what chances do you have to incorporate Semantic Web technology in today's proprietary enterprise search solutions? Which changes are there to feed in your specific domain knowledge and utilize expert knowledge to drive search? Virtually none. This is where SMILA can provide added value already today. As an example, we already implemented a solution where a specialized component for chemical structures has been integrated so that standard indexing mechanisms could take the knowledge about those formulas and structures into account.
Andreas Blumauer: Are there any existing implementations yet?
Mario Lenz: Currently the project is going through the incubation phase at eclipse. Due to the fact that we are reusing immense amount of other OSS, our IP process has gained a lot in its size and complexity and hence slowed down the publication of current project achievements more than we expected. We plan to release our first milestone right before the eclipse convention 2009 on March 23rd.
Andreas Blumauer: Is SMILA built on existing frameworks like Aperture or Lucene or can it be implemented (at least in theory) from the scratch?
Mario Lenz: One of our project goals is to provide ready-to-use framework components (data source connectors & processing services). There are some excellent open source projects out there that we have been able to profit from. For example we use parts of Aperture, Sesame and Lucene to implement those components.
Andreas Blumauer: How do advanced technologies like fact extraction fit into this architecture?
Mario Lenz: More advanced text analytics functions, such as entity and fact extraction, definitely should incorporated into SMILA. Depending on resources and priorities, some OS solutions might be implemented and shipped with one of the next releases of SMILA. As mentioned earlier, B2B vendors have already decided to base their technology on SMILA.
Andreas Blumauer: With Version 0.5 M1 you are planning to introduce a "semantic layer", what exactly will this layer do?
Mario Lenz: See above question on the semantic layer. In the near future, we will integrate some Semantic Web technology and implement some specific examples of semantic technology into SMILA. This will be about organizing a domain based on existing knowledge (i.e. an ontology) and about extracting entities and/or facts from unstructured data. As well as utilizing those for organizing and representing the relevant information. As also stated above, this clearly is only a first step and much more is to be done. But we simply need to start somewhere...
Andreas Blumauer: What reasons could a company have to build its enterprise search on an open source framework like SMILA? How does this fit into the vision of the WWW Semantic Web?
Mario Lenz: I think, we need to distinguish three different types of companies here:
- Firstly, small and medium-sized companies can use SMILA to build their search applications by simply combining ready-to-use components that have been delivered with the framework.
- Secondly, enterprise customers most likely will still want to buy from a commercial vendor, such as empolis. Large enterprises, which want to keep control over their vast amount of unstructured data, can integrate their own components in SMILA and/or buy some enterprise-ready components, form specialized vendors (like empolis). In this case, our solution will consist of specialized components on top of SMILA. Thus, the enterprise customer will get best of both worlds: a mature solution based on an open platform combined with professional expertise in implementing such complex solutions including aspects around data source integration, security, and specific business processes.
In all cases, SMILA as an open, extensible and standardized platform will bring those customers in a position where such solutions are easier replaceable and, hence, investments are better secured.
- Thirdly, vendors active in the area will profit from the open platform by being able to focus on their specific expertise rather than having to invest a vast majority of resources into infrastructure aspects. Just as living-e AG which recently joined the initiative. With the current approach, SMILA goes even further than most search vendors in directly incorporating Semantic Web technology.
Andreas Blumauer: Thank you for the opportunity to get a inside view on SMILA.
About: Dr. Mario Lenz
Dr. Mario Lenz completed his Doctorate in Computer Sciences at the Humboldt University in Berlin, majoring in Knowledge Management and minoring in Psychology. During his studies, numerous industrial cooperation projects took place. In 1999 Mario became Director of Development at tec:inno. After the integration of tec:inno into empolis and several management positions in product development, Dr. Lenz was appointed CTO and became a member of the empolis management board in 2006.







