You are on page 1of 4

A WEB SYSTEM FOR ONTOLOGY-BASED MULTIMEDIA ANNOTATION, BROWSING AND SEARCH M. Bertini, G. Becchi, A. Del Bimbo, A. Ferracani, D.

Pezzatini University of Florence - MICC Firenze, Italy


ABSTRACT In this paper we present a complete system for semantic and syntactic annotation, browsing and search of multimedia data, that is based on a service oriented architecture, with web-based interfaces developed following the Rich Internet Application paradigm. The system has been designed to be: i) exible and extendable, allowing users to select only the services they need or to add their own tools to the multimedia processing pipelines; ii) distributed, with services that can be executed in a cloud computing infrastructure and accessed through web applications; iii) user-friendly, with interfaces that have a uniform interface on every platform and that have an interaction level similar to that of desktop applications. Extensive user trials in real-world setup, performed by archive and broadcaster professionals, have shown the efcacy and usability of the proposed solution. Index Terms Multimedia database, multimedia authoring, content analysis, content-based retrieval. 1. INTRODUCTION Recently two surveys to gather user requirements for video annotation and search systems have been conducted within the EU funded research projects VidiVideo1 and IM3I2 . More than 50 professionals working in broadcasters, national video archives, photographic archives and cultural heritage organizations have participated to the surveys. One of the main outcomes is that multimedia annotation and management system have to be web-based. In fact, this requirement was deemed mandatory by 75% of the interviewees and desirable by another 20% [1, 2]. Other interesting results are that controlled lexicons and ontologies are widely used, respectively by 64% and 39% of the interviewees, and that 71% of users requested the possibility to have combinations of search mechanisms that account for structured (e.g. metadata, controlled lexicons and ontologies) and unstructured data (e.g. free text and transcriptions). However, most of the annotation and search system developed by the scientic multimedia community are desktop applications [3, 4, 5, 6, 7, 8] that have search and browsing tools designed for participation to scientic competitions, like TRECVID and VideoOlympics, rather than for end-users, like broadcaster and video archive professionals. Recently some video search engines have been designed as web applications [9, 10, 11] because of the convenience of using browsers as clients that access a common search engine. To satisfy the needs expressed by the surveys we have developed a system that offers an integrated service-oriented environment for
1 http://www.vidivideo.info 2 http:/www.im3i.eu

processing, analysing, indexing, tagging, and searching multimedia content, at the syntactic and semantic level. 2. THE SYSTEM The system presented in this paper3 provides a service-oriented architecture (SOA) that allows for multiple viewpoints of multimedia data inside repositories, providing better ways to reuse, repurpose and share rich media. This paves the way for a multimedia information management platform that is more exible, adaptable, and customizable. In fact, a SOA provides methods for systems development and integration by packaging system functionalities as interoperable services, that are the building blocks of the system. A SOA infrastructure allows different applications to communicate with one another, in loosely coupled way, by passing data in a shared format or by orchestrating the activity of the services. One of the outcomes of this architectural choice is that system deployment in existing infrastructures and workows does not require to redesign them, since it becomes possible to simply complement them, adding only the services that are required. This latter point is particularly important when considering organizations like broadcasters or national video archives, that can not completely redesign their existing systems. An overview of the system architecture, composed by four main layers, is shown in Fig. 1: Interface and Authoring Layer; Architecture layer and Analysis Layer. Communication between analysis and interface layers is routed through the architecture layer, which also takes care of the main repository functions. The Analysis layer is responsible for extracting low level features and semantic annotations from media les, through a series of processing pipelines that can be executed in a cloud of servers, orchestrated by apposite services. The Interface and Authoring layers are composed of several components, ranging from specialized interfaces for annotation and search to basic UI widgets. A main component in the gure is the authoring layer. This component is dedicated to the composition and creation of search, browsing, and editing interfaces for end-users, combining ready-made interface building blocks. Automatic multimedia annotation is performed by user denable processing pipelines; the system provides a number of services for syntactic and semantic audio and video content annotation. These services can be combined in processing pipelines, to create more complex services. For example, a video annotation pipeline, that can be created, modied and managed using some of the services provided by this system, is shown in Fig. 2. Annotation of visual content is performed using an implementation of the Bag-of-Visual-Words paradigm, based on a fusion of MSER [12], SURF [13] and SIFT [14] features and the Pyramid
3 Available

for testing at: URL hidden for double blind review

Interface layer
Corporate CMS

Authoring layer
Authoring environment End-user interfaces

SOA Architecture layer


System repository

System SOA

File storage

Local SOA architecture

Video/Image analysis pipeline

Analysis Database Semantic search / browse Media storage Syntactic search / browse

Audio analysis pipeline

user-dened pipeline

Analysis layer

Fig. 1. Overall view of the system architecture.

BoW annotation

Ingestion/ transcoding

Segmentation

CBIR indexing

Video streaming transcoding

Audio extraction

To the audio processing pipeline

Fig. 3. Screenshots of some of the annotation tools; top) AJAX tool for tagging, ontology-based and audio transcription video annotation, bottom) adding geographical metadata to concept annotations.

Fig. 2. Example of the automatic annotation pipeline for videos, built using services provided by the system. Users can create their own processing pipeline combining the services provided by the system, or other existing pipelines.

Matching Kernel [15]. Audio annotation is based on a fusion of timbral texture features like ZCR, MFCCs, chroma and spectral features and SVM classiers. CBIR retrieval is performed using rhythm and pitch features for audio and MPEG-7 features for visual data, in particular using a combination of Scalable Color, Color Layout and Edge Histogram descriptors. To address the problem of scalability in large-scale archive these features are indexed using the approximate similarity search approach presented in [16]. Semantic-level search and browsing is performed through a search engine that uses the ontology design presented in [17]; ontology-based reasoning using concept relations, subsumption and WordNet synonyms is employed for query expansion. The graph of the ontology concepts is used also to browse the media archives (Fig. 4). The search engine works also with free text annotations and transcriptions, and can be used as a web service or through specic web applications. Other services and specialized interfaces allow tagging and syntactic-level content-based retrieval. Publishing functionalities are provided by a set of services and interfaces of the authoring platform. This platform allows to import and publish existing media repositories and to author web-based environments that let end-users to interact with the repositories. Authors can create elaborate workow patterns and

Fig. 4. Screenshot of the browse application: the concept cloud is used to start browsing, the graph shows a reduced view of the ontology around a selected concept. Users can inspect instances of the concept stored in the system or search them in other repositories like Youtube and Flickr. search interfaces, that can be embedded in a variety of commercial CMS systems. Using AJAX, Flash/Flex, Silverlight and other Rich Internet Application (RIA) technologies [18] make it is possible to develop web applications that are highly responsive [19] and allow more advanced interaction. The quality of interaction, made essential for users by modern desktop applications and operating systems, is achieved by means of drag&drop, advanced widgets and advanced multimedia

features of the search/browse engines using different search modalities (structured/unstructured/similarity based). The trials have been followed by a debrieng of the users, that had to ll a questionnaire to evaluate their impressions of the system and the perceived effectiveness and usability. Given the fact that a such system is not yet of such a widespread use, and the fact that the interfaces of these types of systems may require to understand the meaning and scope of various widgets, a user manual has been prepared for the users, to let them obtain a basic understanding of the system. In addition to the short manual, a simple system walkthrough (about 10 minutes long) was presented to the users by test monitors. These monitors have also taken observational notes and recorded verbal feedback from users during the tests. These notes and the questionnaires have been considered in a second stage of system design to improve the overall usability, considering interface and workow design.

Fig. 5. Screenshots of some of the search tools; top) advanced ontology-based video search (Google-like search is also available), bottom) CBIR and image tagging. Video keyframes can be used to select visually similar videos and images.

support, that are not available in traditional web-based applications [20]. Other benets are the improvement of the server performance because of the distribution of the computational burden also on the client and the easy distribution of new versions of the application, that is downloaded by the clients every time that it is used. All the web applications of the system have been developed according to the RIA paradigm. In particular the applications of the Interface and Authoring layers are developed in AJAX and Flash/Flex, while data is exchanged using SOAP, RSS and JSON for metadata and RTMP for video streaming. Fig. 3, 4 and 5 show some screenshots of the manual annotation (to check automatic annotations, add metadata or create ground truth annotations to train new automatic concept detectors), browse , search (using different modalities) and tagging/CBIR tools. 3. EXPERIMENTS The system presented in this paper has been thoroughly tested in several eld trials with the participation of 19 multimedia archive and broadcaster professionals in The Netherlands, Hungary, Italy and Germany. The system was running on our servers while users were at the venue of their organization, using the same PCs they use for their daily work. The goal of the eld trials was to assess the usability of the system, in particular letting the users to interact with the search engine and its interfaces, to pose semantic- and syntactic-level queries, but also to annotate, automatically and manually, some videos. The methodology used follows the practices dened in the ISO 9241 standard, and considered the following four factors: usability, effectiveness, efciency and satisfaction. A set of activities that involved the various interfaces have been selected. These activities allowed to test several aspects of both the automatic and manual annotation system (this activity was performed by a subset of 6 users) and the Fig. 6. Overview of usability evaluation for the search tests: overall usability of the system, usability of the combination of search modalities. The overall experience is very positive and the system proved to be easy to use, despite the objective difculty of interacting with a complex system for which the testers received only a very limited training. Fig 6 reports two results for the search activities. Users appreciated the combination of different interfaces and functions. The type of search modality that proved to be more suitable for the majority of the users is the advanced interface, because it lets to build queries with Boolean/temporal relations between concepts and concepts relations, and of the possibility to use geographical and video metadata, that is appealing for professional archivists. Also the usability of the annotation components, both automatic

Fig. 7. Overview of usability evaluation for the annotation tests: overall usability of the automatic annotation system, usability of the manual annotation tool. and manual, was satisfactory although some concerns remain regarding the precision of automatic annotation, that is still too low for the high standards of archivists. Fig. 7 reports two results for the automatic and manual annotation tools. None of the users had any previous work experience of automatic video annotation systems but they were trained in using a manual annotation tool developed within their organization. In general the comments recorded during the trials and those gathered with the anonymous questionnaire have shows a high degree of satisfaction for the system, and have provided interesting hints for further improvement of the interfaces that, in part, have already taken into account for further development. 4. CONCLUSIONS In this paper we have presented a system, based on SOA back-end and RIA front-end, that has been jointly designed by industrial and academic partners of EU funded research projects. The system architecture make it easily deployable, also in organizations that have a well established multimedia management workow. The system provides functionalities for the management of automatic multimedia analysis pipelines, manual annotation tools, searching and browsing tools and the authoring interfaces. It has been thoroughly tested in a real-world setup by industry professionals, with good results, and is still under active development within the scope of a EU funded technology transfer project.
5. REFERENCES [1] Deliverable D7.6 - validation of the user interface of the VIDI-Video system, Tech. Rep., VidiVideo consortium, 2009. [2] Deliverable D2.1 - initial user requirements study, Tech. Rep., IM3I consortium, 2009. [3] J. Pickens, J. Adcock, M. Cooper, and A. Girgensohn, FXPAL interactive search experiments for TRECVID 2008, in Proc. of the TRECVID Workshop, 2008. [4] A. Natsev, W. Jiang, M. Merler, J.R. Smith, J. Teic, L. Xie, and R. Yan, s IBM Research TRECVid-2008 video retrieval system, in Proc. of the TRECVID Workshop, 2008. [5] J. Cao, Y.-D. Zhang, B.-L. Feng, L. Bao, L. Pang, and J.-T. Li, TRECVID 2009 of MCG-ICT-CAS, in Proc. of the TRECVID Workshop, 2009. [6] C.G.M. Snoek, K. E. A. van de Sande, O. de Rooij, B. Huurnink, J. R. R. Uijlings, M. van Liempt, M. Bugalho, I. Trancoso, F. Yan, M.A. Tahir, K. Mikolajczyk, J. Kittler, M. de Rijke, J.-M. Geusebroek, T. Gevers, M. Worring, D.C. Koelma, and A.W.M. Smeulders, The MediaMill TRECVID 2009 semantic video search engine, in Proc. of the TRECVID Workshop, Gaithersburg, USA, November 2009. [7] O. de Rooij and M. Worring, Browsing video along multiple threads, IEEE Transactions on Multimedia (TMM), vol. 12, no. 2, pp. 121 130, 2010. [8] Y.-T. Zheng, S.-Y. Neo, X. Chen, and T.-S. Chua, VisionGo: towards true interactivity, in Proc. of CIVR, 2009. [9] M. Bertini, G. DAmico, A. Ferracani, M. Meoni, and G. Serra, Sirio, Orione and Pan: an integrated web system for ontology-based video search and annotation, in Proc. of ACM MM, 2010. [10] W. Bailer, W. Weiss, G. Kienast, G. Thallinger, and W. Haas, A video browsing tool for content management in postproduction, International Journal of Digital Multimedia Broadcasting, 2010. [11] S. Vrochidis, A. Moumtzidou, P. King, A. Dimou, V. Mezaris, and I. Kompatsiaris, VERGE: A video interactive retrieval engine, in Proc. of CBMI, 2010. [12] J. Matas, O. Chum, M. Urban, and T. Pajdla, Robust wide-baseline stereo from maximally stable extremal regions, Image and Vision Computing, vol. 22, no. 10, pp. 761 767, 2004. [13] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, SURF: Speeded Up Robust Features, Computer Vision and Image Understanding (CVIU), vol. 110, no. 3, pp. 346359, 2008. [14] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision (IJCV), vol. 60, no. 2, pp. 91110, 2004. [15] K. Grauman and T. Darrell, The pyramid match kernel: Efcient learning with sets of features, Journal of Machine Learning Research (JMLR), vol. 8, pp. 725760, 2007. [16] G. Amato and P. Savino, Approximate similarity search in metric spaces using inverted les, in Proce. of InfoScale, 2008. [17] L. Ballan, M. Bertini, A. Del Bimbo, and G. Serra, Video annotation and retrieval using ontologies and rule learning, IEEE MultiMedia, vol. 17, no. 4, pp. 8088, Oct.-Dec. 2010. [18] P. Fraternali, G. Rossi, and F. Sanchez-Figueroa, Rich internet applications, IEEE Internet Computing, vol. 14, pp. 912, 2010. [19] T. Leighton, Improving performance on the internet, Communications of the ACM, vol. 52, pp. 4451, February 2009. [20] P. Fraternali, S. Comai, A. Bozzon, and G. T. Carughi, Engineering rich internet applications with a model-driven approach, ACM Transactions on the Web (TWEB), vol. 4, pp. 7:17:47, April 2010.

You might also like