Corporate news

Tassilo Pellegrini's picture

Werner Bailer: "A picture is worth a thousand words."

07. October 2009, by Tassilo Pellegrini

werner-bailer

With increasing broadband diffusion we have witnessed the blossoming of numerous picture and video search platforms. What have been the most interesting developments in the last few years from the perspective of media semantics?

As Susanne Boll has pointed out in an article in IEEE MultiMedia in 2007, most of the media related Web 2.0 sites have not used any research results from areas such as multimedia content analysis or semantic content classification in the beginning. It is exciting to see that many of these platforms are now adding media semantics in small steps, in the typical dynamic way they evolve. We see features such as structured annotation, capabilities for annotating time segments or regions, use of face detection, similarity search, etc. being available in many of these platforms.

Service providers like Google, Jinni or Pixolu are increasingly coupling content-related information with structural indicators. How would you describe the current development in content-based multimedia retrieval?

These services put into practice the lesson learned from more than a decade of research in content-based multimedia retrieval: there are a number of very interesting methods, but in isolation they do not provide real benefit to the end user. What we see now is that these technologies are used in combination with other descriptive metadata and with the user in the loop to provide relevance feedback and iteratively refine the query, which leverages their potential.

While it is relatively easy to automatically grab the semantics from textual data, algorithmic analysis of image data is still in its infancy. What are the benefits and boundaries of algorithmic content-based image analysis?

There is some truth in saying that a picture is worth a thousand words. Automatic analysis algorithms can currently decode just a few of them, mainly those that are related to what is actually is depicted, while humans associate many concepts with a picture out of their context and experiences. Although a lot of progress has been made in automatic concept classification, the results of the TRECVID benchmark still shows that there is a difference of about a factor of 10 in the achieved precision between visually well represented concepts and more abstract ones. We can expect this to improve with the currently ongoing work on large scale concept classification (several thousand classes) that also makes use of semantic relations between different concepts.

With GWAP (ESP, Verbosity, Squiggl etc.) Google demonstrated an impressive game-oriented approach to collaborative image tagging done by non-experts. What is the role of human users in annotating image data? Where will machines support them?

It is interesting to see that the annotations created with different approaches are complementary. There is a Dutch video labeling game that allows users to annotate broadcast archive content. It turns out that the non-expert users annotate different aspects than professional archivists, just as automatic tools can provide annotation that is difficult to create for humans and vice versa. The key is to intelligently combine the strength of these approaches, using automatic methods to apply existing annotations to similar content, to use linked open data to enrich annotations, etc.
 

What about multimedia on the Semantic Web? Are there Semantic Web standards and vocabularies in place (or in development) that support the structured annotation and ontological management of multimedia data?

There have been a number of proposals for multimedia ontologies and mappings of multimedia vocabularies (cf. the excellent report from the W3C MM Semantics XG), differing in complexity and expressivity. Thus the W3C has chartered a working group to develop an ontology and API for multimedia content on the Web. The group is developing a lightweight core set of metadata properties and an API specification for accessing these properties, which may come from metadata documents in different standards. Thus mappings to many relevant standards have also been specified. The set of metadata properties will be formalized for interoperability with the Semantic Web. A W3C recommendation is expected in 2010.

About Werner Bailer

Werner Bailer is a researcher at the Institute of Information Systems of JOANNEUM RESEARCH and works in the area of audiovisual archiving, digital cinema production, digital film restoration and quality analysis and interactive TV. He is interested in image and video processing algorithms and metadata modelling for audiovisual content, currently being a member of the W3C Media Annotations WG. Since 2007 he is working on a PhD thesis on the topic of multimedia content abstraction at the Technical University of Graz.

SAMT 2009 - 4th International Conference on Semantic and Digital Media Technologies

2-4 December 2009 Graz, Austria

The 4th International Conference on Semantic and Digital Media Technologies (SAMT '09) targets at narrowing the large disparity between the low-level descriptors that can be computed automatically from multimedia content and the richness and subjectivity of semantics in user queries and human interpretations of audiovisual media - The Semantic Gap.

Wed, 10/07/2009

Comments

Add new comment