Feature Article

0.6 for a concept such as ‘‘face.’’ Despite the fact that performance improvements have been reported in the last years, and a large effort has been devoted to extend the number of different concept classifiers, important questions remain, such as how many concept detectors are really useful3 and how reliable they should be.4 Moreover, concept classifiers are usually drawn from a particular domain, but an important question is how well they can generalize across different domains. Exploiting the semantic relationships between concepts is receiving a large amount of attention from the scientific community because doing so can improve the detection accuracy of concepts and obtain a richer semantic annotation of a video. To this end, ontologies are expected to improve the capability of computer systems to automatically detect even complex concepts and events from visual data with higher reliability. Ontologies consist of concepts, concept properties, and relationships between concepts. They organize semantic heterogeneity of information, using a formal representation, and provide a common vocabulary that encodes semantics and supports reasoning. There have been few attempts to integrate high-level semantic concepts provided by an ontology with their visual representation. In the most common approach, the ontology provides the conceptual view of the domain at the schema level, and appropriate concept detectors play the role of observers of the realworld sources, classifying an observed entity or event in the nearest concept of the ontology. In this way, concept detectors have the responsibility of implementing invariance with respect to several conditions while, once the observations are classified, the ontology is exploited to have a more complete semantic annotation, establishing links to other concepts and disambiguating the results of classification. In real applications, there is need to detect and recognize complex concepts and situations where multiple elementary concepts are in mutual relation in time and space. Therefore, ontologies have to be extended to define these higher-level concepts, adding sets of rules that encode spatiotemporal relationships among individual concepts. As the number of these concepts grows, the number of rules for their detection increases. Thus, the definition of rules by human experts is not practical; the

Video Annotation and Retrieval Using Ontologies and Rule Learning
Lamberto Ballan, Marco Bertini, Alberto Del Bimbo, and Giuseppe Serra University of Florence, Italy
An approach for automatic annotation and retrieval of video content uses semantic concept classifiers and ontologies to permit expanded queries to synonyms and concept specializations. hile understanding the semantic meaning of video content is immediate for humans, it’s far from immediate for a computer. This discrepancy is commonly referred to as the semantic gap. A recent trend in the effort to bridge this gap is to define a large set of semantic concept detectors, each of which automatically detects the presence of a semantic concept such as ‘‘indoor,’’ ‘‘face,’’ ‘‘person,’’ or ‘‘airplane flying.’’ Typically these detectors learn the mapping between a set of low-level visual features, such as local descriptors, color and texture, and a concept from examples. Approaches to bridge this gap (see the ‘‘Related Work’’ sidebar) have been implemented in several systems designed for visualconcept recognition challenges, such as the Pascal Video Object Classes Challenge1 and the Text Retrieval Conference Video (Trecvid) retrieval evaluation.2 Generally, however, semantic concepts are still difficult to detect accurately, so their detection in video remains a challenging problem. The accuracy of state-of-the-art detection can range from less than 0.1 (measured by average precision) for semantic concepts such as ‘‘people marching’’ or ‘‘fire weapon’’ to above

W

80

1070-986X/10/$26.00 c 2010 IEEE 

Published by the IEEE Computer Society

their performance in terms of mean average precision. http:// www-nlpir. they addressed the imbalance problem of positive and negative examples in the case of rare events and concepts using data-mining techniques. pp. Multimedia. using a set of 363 concept detectors. Jiang. 2007.’’ Proc. 10. pp. vol.. pp. 2008.. pp. IEEE Int’l Conf. 2.gov/projects/tvpubs/tv8. pp. representative of these patterns.’’ Proc.3 Presently.. L.’’ IEEE Trans. C.12 All these approaches expect that rules are created by human experts. 16. 85-94. Worring. References 1. proposed by Bertini et al. ACM Press.1 Snoek and Worring confirmed the positive correlation between the number of concept detectors and video retrieval performance. M.-Y. have proposed two semantic spaces. ‘‘Selection of Concept Detectors Using Ontology-Enriched Semantic Space. 7. 1966-1969. vol. to facilitate the selection and fusion of concept detectors for video search. varies in a range from 0. and J. 3. X.-H. Hollink et al.papers/ibm.’’ IEEE Trans. vol.-W. 2007. 6. Z. IEEE Press. 2008. 2008. 6th Trecvid Workshop. 91-98. Hollink. 252-259. 2. 9. vol. Multimedia. ‘‘Building a Comprehensive Ontology to Refine Video Concept Detection.’’ IEEE MultiMedia.pdf.13 These authors proposed a method to annotate rare events and concepts based on a set of rules that use lowlevel and middle-level features. Int’l Conf. Int’l Machine Vision and Image Processing Conf. 9.. Wei et al. 4. Image and Video Retrieval. Multimedia. Yang and A.. 6th Trecvid Workshop. Ontologies and concept relations have been recently proposed to improve the performance of the concept detectors.19 to 0. ‘‘Association and Temporal Rule Mining for Post-Filtering of Semantic Concept Detection in Video.-J. 2008. Zha et al. A decision-tree algorithm is applied to the rule-learning process. 14. and Y.pdf. as well as the improvement of the pairwise combination of detectors. no. Multimedia.-G.9 In a different approach. 5. ACM Int’l Workshop Multimedia Information Retrieval.13.. http:// www-nlpir. 117-124. IEEE Press.’’ IEEE MultiMedia. K. 958-966. Hunter. pp. http:// www-nlpir. ACM Int’l Conf. the performance of video search engines is still far from acceptable. A. proposed a method to enhance the accuracy of semantic concept detection using association-mining techniques to imply the presence of a concept from the co-occurrence of other high-level concepts. 240-251. no.papers/columbia. ‘‘IBM Research Trecvid-2008 Video Retrieval System. they are not practical for the definition of a large set of rules. M. Shyu et al.14 However. 5. C.1 mean average precision is likely to provide high accuracy in news video retrieval. thus.000 concepts detected) with a minimal 0. Naphade et al. 227-236. ‘‘Dynamic Pictorially Enriched Ontologies for Digital Video Libraries. pp. report that concept-based video retrieval (with fewer than 5.nist. vol. A. 2009. S. ‘‘Video Semantic Content Analysis Based on Ontology. J.gov/projects/tvpubs/tv8. 1085-1096. these methods have shown to be insufficiently expressive to describe composite concepts and events because they don’t take into account spatiotemporal relations between individual concepts.nist. Ngo. 86-91. Automatic learning of rules has been proposed by Shyu et al. Liu et al. Snoek and M.nist. 6th Trecvid Workshop. 3. Hauptmann.8 using pairwise correlations between concepts and hierarchical relationships to refine concept detection of support vector machine classifiers. ‘‘Are Concept Detector Lexicons Effective for Video Search?’’ Proc.’’ Proc. other authors have explored the use of rule-based reasoning over objects and events in different domains. Little.pdf.. identifying their spatiotemporal patterns.11 Bai et al. ‘‘Video Semantic Event/Concept Detection Using a Subspace-Based Multimedia Data Mining Framework. 81 .7 defined an ontology to provide a simple structure to LSCOM. ontologyenriched semantic space and ontology-enriched orthogonal semantic space. 2. ‘‘Columbia University/Vireo-CityU/IRIT Trecvid2008 High-Level Feature Extraction and Interactive Video Search. no. 8. Multimedia & Expo. pp. with temporal description logic. Hauptmann et al. Chang et al. M. Snoek et al.’’ Proc.. Bertini et al. 2007. 2008. ACM Press. ‘‘Evaluating the Application of Semantic Inferencing Rules to Image Annotation. no. the ontology includes visual data instances related to high-level concepts. ‘‘(Un)reliability of Video Concept Detection. defined a set of SWRL rules to perform semiautomatic annotation of images of pancreatic cells. S. no. 2005. 10.. L.’’ IEEE Trans.1-3 Hauptmann et al. ‘‘The MediaMill Trecvid 2008 Semantic Video Search Engine.4-6 In fact. C. to perform event annotation in soccer videos. 6. 13..F. ‘‘Large-Scale Concept Ontology for Multimedia. no. visual prototypes. pp. are then defined and used for automatic annotation. Zha et al.papers/mediamill.’’ Proc. 10..’’ IEEE Trans.’’ Proc.. Knowledge Capture. 13.10 To obtain richer annotations. 2006.gov/projects/tvpubs/tv7. Liu et al. pp. Bai et al. 2.Related Work The usefulness of the construction of large sets of automatic video concept classifiers and the evaluation of the number of detectors needed for effective video retrieval has been studied in several works. 2008. ACM Press. Wei.-L. Natsev et al. 2007. 11. obtained in the Trecvid 2008 evaluation using 20 concepts from the Large Scale Concept Ontology for Multimedia (LSCOM) lexicon. vol. pp. 2008. 12.’’ Proc. defined a soccer ontology and applied temporal reasoning. ‘‘Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study with Broadcast News. 10. 42-51. Moreover.

while sBox and pBox are their bounding boxes. has part. or when no more negative examples are excluded for a certain number of iterations l. we present an approach for automatic annotation and retrieval of video content based on ontologies and semanticconcept classifiers. . exploiting the knowledge embedded in the ontology. the ontology schema (based on concepts detected by semantic classifiers and their linguistic relations as encoded in WordNet). sBoxÞ ^ BoxIsInBox ðpBox. variables. and adding literals one at a time to specialize the rule until it avoids all negative examples. _:Ln. automatic determination of semantic linguistic relations between concepts (is a. This is a general-to-specific search through the space of hypotheses. pBoxÞ ^ HasBoundingBox ðs. to define the ontology schema. and :L1 _ :L2 . consider the sentence that describes the composite concept: ‘‘a person is in a secured area’’ (IF a person and a secured area instances occur in a shot and the bounding box of that person is in the bounding box of that secured area THEN that person is in secured area). using WordNet. and. . the target composite concept) and an empty or initial body. _:Ln are negative literals. . a new algorithm obtained as an adaptation of the first-order inductive learner (FOIL5) technique to ontologies and Semantic Web technologies. As an example. the initial rule for the composite concept ‘‘a person enters in a secured area’’ could be: Person(p) 6 SecuredArea(s) ! PersonEntersSecuredArea(p). . the concept detectors are then linked to the corresponding concepts in the ontology. sBoxÞ ! PersonIsInSecuredArea ðpÞ where p and s are variables that can be bound to any person and any secured area respectively. rule-based method for automatic semantic annotation of composite concepts and events in videos. for each concept. as in: H _ :L1 _ :L2 . The difference between predicates and functions is that predicates (in the following written with an upper-case first letter) can assume only Boolean values. while the literal H that forms the postcondition. All the expressions are composed of constants. A clause is any disjunction of literals. We propose a novel. As an example of the Horn clause. is called head. we present a Web video search engine that relies on ontologies and permits queries using a composition of Boolean and temporal relations between concepts. where all variables are assumed to be universally quantified. a set of the concept instances that have been observed. First of all. A Horn clause is a clause containing at most one positive literal. Ln) is called body.appropriate solution is to learn a set of rules automatically for each composite concept to be detected. A literal is any predicate. otherwise it’s a positive literal. 6 Ln) ! H. The Horn clause precondition (L1 6 L2 . any variable. which is equivalent to ‘‘IF (L1 _ L2 . This system exploits the ontology structure and permits. This sentence can be translated in the following fragment in first-order logic: Person ðpÞ ^ SecuredArea ðsÞ ^ HasBoundingBox ðp. In this article. beginning with the most general preconditions possible (the empty or initial precondition). predicate symbols. expanded queries to synonyms and concept specializations. If a literal contains a negation symbol (:). applied to any term. where H is the positive literal. or its negation. A term is any constant. The hypotheses learned by Foils are sets of rules that are Horn clauses. . Rules are learned using first-order inductive learner for SWRL (Foils). first-order logic rules defined in SWRL are automatically learned from the knowledge that is embedded in the ontology. it’s called negative literal. Automatic rule learning with first-order logic In our approach. the concepts’ relationship of co-occurrence and the temporal consistency of video data are used to improve the performance of individual concept detectors. . is part of) is performed. Our algorithm learns rules expressed in Semantic Web Rules Language (SWRL) automatically. Ln) THEN H’’. . . The algorithm iterates searching new literals that have to be added to the body. composed of the head (that is. Our ontology contains abstract concepts. whereas functions (in the following written in lowercase) might have any constant as their value. and an ontology with a set of instances that are positive and negative examples of the target concept. written in SWRL. Moreover. The algorithm starts with an initial rule. and function symbols. Finally. function. for example. A schema of the algorithm is shown Figure 1. . . It is equivalent to: (L1 6 L2 . or any IEEE MultiMedia 82 .

. 1] weights the contribution of the mutual information and M is a matrix. all concepts and concepts relations).1g where a 2 [0. A special literal. . . The evaluation function used to estimate the utility of adding a new literal is based on the number of positive and negative bindings covered before and after adding this new literal. oi is a conceptdependent weighting coefficient (with Sioi ¼ 1) that measures the contribution from the shot that is temporally i shots apart from St. expressed using mutual information that measures the dependence of a concept pair. the current rule Ri being considered is (L1 6 L2 . x2. First-order inductive learner for SWRL (Foils) algorithm. This performance can be improved considering the probability of contemporary presence of individual concept pairs. the algorithm considers the performance of the rule over the instances stored in the ontology. x2. . . . while d is the maximum temporal distance within which the shots are considered. P(Ct ¼ 1j S) is the confidence score of a detector for the concept C in shot St. The evaluation function is defined as  Rule GainðLiþ1 . Given Pi ¼ P(Ci ¼ 1j S) the confidence score of a detector for the concept Ci in a video shot S and P ¼ [P1. Let us consider a rule Ri and a candidate literal Liþ1 that might be added to the body of the rule.Rule) Add Best_literal to Rule preconditions Pos ← subset of Positive examples that satisfy Rule Neg ← subset of Negative examples that does not satisfy Rule until Neg is empty or no more Neg examples are excluded for i iterations L where MI(Ci. P(Cj ¼ l). and P(Ci ¼ k. .l2f0. 6 Ln) ! H(x1. Ri Þ  t log2  p1 p0 Àlog2 p1 þ n1 p0 þn0 Pos ← Positive examples Neg ← Negative examples Rule ← Initial rule repeat Candidate literals ← Generating hypothesis candidates Best literal ← argmax Rule Gain(L. we included in the ontology the relation of concepts co-occurrence. 1} are computed from ground truth. . where (L1 6 L2 . xk) is the head. Cj ¼ lÞ ¼ PðCi ¼ k.Two issues have to be addressed: the generation of hypothesis candidates and the choice of the most promising candidate. for each concept we reevaluate its confidence values at each shot using: PT ðCt ¼ 1j St Þ ¼ d X i¼Àd !i PðCt ¼ 1j CtÀi ¼ 1ÞPðCtÀi ¼ 1j StÀi Þ ð2Þ October—December 2010 where P(Ct ¼ 1j CtÀi ¼ 1) are probabilities estimated from ground-truth annotations. . To select the most promising literal from the candidates generated at each step. To this end.6 it’s possible to exploit the mutual information to refine the confidence values of the detected concept instances. where at least a variable already exists in the rule. because variables created at different iterations could have the same meaning. can be considered. Cj) is the mutual information between concept Ci and Cj. The probability values P(Ci ¼ k).7 In particular. Cj ¼ l for k. The performance of composite concept annotation is tightly related to the reliability of the semantic classifiers. Suppose that at the ith iteration. Finally. . it’s possible to refine the confidence scores Pþ with Pþ ¼ ð1 À ÞP þ MP ð1Þ Figure 1. Cj ¼ lÞ PðCi ¼ kÞPðCj ¼ lÞ k. 1} is the probability of the presence or absence of Ci in the videos. t is the number of positive rule R bindings that are still covered after adding literal Liþ1 to Ri. . . where p0 and n0 are the number of positive and negative bindings of Ri. Pn]T the confidence score vector for all concepts in S. Cj Þ X PðCi ¼ k. The value of P(Ci ¼ k) for k 2 {0. Equal(xj. . This quantity is computed from the analysis of the concept instances as: MIðCi . . 6 Ln) are literals forming the current rule preconditions and H(x1. while p1 and n1 are the number of positive and negative bindings of the new rule Riþ1 (resulting from the addition of Liþ1). The confidence score of a detector for concept C can be improved considering the fact that the presence of a semantic concept generally spans multiple consecutive shots. xk) where xj and xk are variables already present in the rule. with the diagonal elements set to 0 to avoid self-reinforcement. xk). . as well as their temporal consistency. Following the approach introduced elsewhere. Foils generates candidate specializations of this rule by considering as new literals Liþ1 any predicate occurring in the ontology (that is. . whose entries have been computed using the mutual information. 83 .l 2 {0.

Temporal: before (?a. this doesn’t negatively affect the performance of the rule. to automatically extend the video annotation with instances of the airplane take-off event. to detect and localize objects. Spatial : BoxOverlapsBox(?tas. As an example.’’ and ‘‘ground’’ detectors.9 have a bounding box (aBox.’’ ‘‘sky. considering the performance of the rules over the training data. the ontology contains instances resulting from the detection of ‘‘airplane. it is applied to the ontology. In this case.’’ and ‘‘ground’’ detection and tracking in a Trecvid video sequence. such as Airplaneð?aÞ ^ Skyð?sÞ ^ Groundð?gÞ ! AirplaneIsTakingOff ð?aÞ The algorithm enriches an initial rule with spatiotemporal relations. ?aBox.’’ In some spectively) were chosen in preliminary cases. ?sBox) ∧ HasBoundingBox(?g. ?tag) ∧ MovingObject(?a) Experimental results We evaluated how much our method → AirplaneIsTakingOff (?a) improves the performance of individual concept This rule can be translated in the following detectors. based on an improved version of the particle filter. ?s)) and the spatial properties used to encode the relative positions between concepts (for example. We built an ontolinstances (a. Once the rule is learned. ?aBox) ∧ HasBoundingBox(?s. which are used by the tracker to identify each instance in a video sequence. gBox) AND following the method described. HasBoundingBox(?s. The interval tag AND the airplane is a moving object. the result of the Foils algorithm is Airplane (?a) ∧ Sky(?s) ∧ Ground(? g)∧ HasBoundingBox(?a.’’ ‘‘sky. Figure 2 shows a sequence of ‘‘airplane take-off’’ with results of concept detectors. to be able to evalground AND the time interval tas is after of the uate the effects of time consistency.’’ and a simple initial rule. Then.’’ is added to the rule even if it is not necessary. ?aBox. As an example of rule learning using the Foils algorithm. it was divided using a fourAND for a time interval tag the bounding box fold approach. Thus. Examples of ‘‘airplane.’’ and ‘‘ground’’ and temporal consistency. ?sBox). we can observe that Foils adds some literals that are not necessary for the event representation. maintaining groups of consecof the airplane is on the bounding box of the utive shots in the same fold. At each step. ?aBox. the bounding box of used for this experiment is the training set the airplane is on the bounding box of the sky of Trecvid 2005. until the recognition performance does not improve. ?gBox) ∧ Spatial : BoxOverlapsBox(?tas.Figure 2. parameters of Equations 1 and 2 (a and d. However. exploiting concept co-occurrence sentence: IF ‘‘airplane. IEEE MultiMedia 84 . These detectors have been created using the Viola and Jones algorithm (provided by OpenCV) and color-based pixel classification with a support vector machine. the spatiotemporal evolution of the appearance of concepts is determined using a tracker. consider the event. g) occur in a shot AND they ogy from the MediaMill detectors thesaurus. reTHEN that airplane is ‘‘taking off. In this example. ?sBox) ∧ Spatial : BoxIsInBox(?tag. The dataset for a time interval tas. sBox. The literal candidates considered by the algorithm are all the classes and properties defined in the ontology domain (for example.’’ ‘‘sky. the moving-object concept.8 Concept instances are associated with color and luminance histograms. ‘‘airplane take-off. ?gBox) ∧ Temporal : After(?tas. using a training set. that in our ontology is a hypernym of ‘‘airplane. s. the most promising literal is added. ?sBox)). the temporal properties used to encode Allen’s logic (for example. which contains the instances obtained by the semantic classifiers.

showing comparison of baseline with the proposed refinement approaches: co-occurrence only and the combination of temporal consistency with co-occurrence. we computed the temporal consistency refinement for all concepts (setting d ¼ 15 in Equation 2).14 0. like politics (for example.00 0.’’ We checked the capability of our system to detect composite concepts using semantic rules.’’ ‘‘airplane landing. considering those containing the LSCOM concepts ‘‘airplane take-off. in two video domains: broadcast news and surveillance.35 0. Co-occurrence + temporal consistency 0.01 0.37 percent.60 Concept Airplane Anchor Animal Arafat Basketball Bird Boat Building Bus Bush Jr Car Cartoon Chair Cloud Desert Entertainment Explosion Female Fire weapon Food Golf Government leader Grass House Indoor Baseline 0.inf.49 0.planestv.64 percent. as shown in Figure 3 (next page). using co-occurrence and temporal consistency.01 0. The mean average precision (MAP) computed for all the concepts improved by 4. ‘‘soccer.16 0.48 0.20 0.29 0.46 0.uk/rbf/CaviarDATA1/) surveillance videos.29 0.62 0.18 0.’’ and ‘‘boat’’). in front of it.04 0.27 0.27 0.’’ ‘‘airplane take-off. Table 1 shows the performance of the baseline detectors and the results of the two refinement techniques in terms of average precision. In the training phase.’’ and ‘‘airplane flying. After the co-occurrence refinement.05 0.84 0. see http://homepages. We refer to this set in the following as the Web dataset.’’ The dataset used for the news domain consists of 65 Trecvid 2005 videos and 100 videos containing airplane events taken from YouTube.25 0. Small improvements are obtained for detectors with high performance.06 0.46 0.’’ ‘‘airplane landing. is 17.07 0.’’ We inspected all the videos annotated with the ‘‘airplane’’ concept to select those that contain the ‘‘airplane taxiing’’ event because this concept is not used in LSCOM. obtained by the combination of the two techniques.12 0. ‘‘Arafat.14 0.12 0. Average precision of 50 concepts selected from the 101 MediaMill thesaurus.’’ ‘‘people. is 17.unifi.77 0.ac.84 0.com/ planestv. and computed the refined confidence score only for them (based on Equation 1. to determine when a person is in the shop.05 0. such as ‘‘anchor.’’ and ‘‘people marching. The concepts whose detectors have a low performance.60 0.62 The Trecvid videos were selected from the Trecvid development set.02 0.60 0.21 0.07 0.01 0.14 0.41 0. We report only the 50 concepts that obtained the largest variations. ed. with a ¼ 0. In the experiments.html).06 0.60 0. The videos used for the second domain are the Context Aware Vision using Image-based Active Recognition (Caviar.30 0.20 0.19 0.49 0. the scene framed was divided into four parts.01 0.11 0.’’ ‘‘Bush Jr.28 0.experiments on the training data. For the first domain.41 0.12 0.05 0. The use of temporal consistency greatly improves the performance of certain concepts that are related to topics often shown in consecutive shots within news videos.06 0.’’ are improved by use of co-occurrence that exploits the results of more robust detectors.06 0.21 0.02 0. Alice Video (see http://dailymotion. and Yahoo! video.it/dome. This set is available from http://www. PlanesTV (see http://www. we considered four events selected from the Large Scale Concept Ontology for Multimedia (LSCOM) events and activities:10 ‘‘airplane flying.’’ ‘‘basketball.00 0.05 0.20 0.’’ The other set of events is related to the video surveillance of shopping malls: ‘‘person enters a shop’’ and ‘‘person exits a shop. ‘‘desert.60 0.08 0.micc.09 0.08 0. like ‘‘airplane’’.35 0.05 0.07 0.21 0.82 0. October—December 2010 85 .’’ and ‘‘outdoor.1). Table 1.64 percent.08 0. The overall improvement of MAP.’’ ‘‘explosion. selected from the front view of the second set. These videos were filmed from a fixed-position camera that frames a mall shop and the area in front of the shop.’’ and ‘‘airplane taxiing.47 0.16 0.16 0. The overall improvement for all the concepts. it).’’ and ‘‘government leader’’) or sports (for example.48 0. automatically learned from concept instances.19 0. we identified the concepts that took advantage of the use of the co-occurrence relation.02 0.alice.62 Co-occurrence 0.

40 0. to learn the rules. especially when superimposed graphics and text covered the appearance of the airplane.32 0. (b) Example of person detector and tracking in a video sequence. Investigation of the cases in which the rules fail has shown that the main cause of failure is due to the performance of the sky and ground detectors. semantic. Data set Trecvid 2005 Trecvid 2005 Trecvid 2005 Trecvid 2005 Web Web Web Web Web + Trecvid 2005 Web + Trecvid 2005 Web + Trecvid 2005 Web + Trecvid 2005 Caviar Caviar Action/event Airplane flying Airplane take-off Airplane landing Airplane taxiing Airplane flying Airplane take-off Airplane landing Airplane taxiing Airplane flying Airplane take-off Airplane landing Airplane taxiing Person enters the shop Person leaves the shop Precision 0. supporting multiple ontologies (for different video domains).72 0.96 0. Precision and recall of actions and events for different datasets.69 0. The overall results 86 . Rich Internet applications can avoid the usual slow and synchronous loop for user interactions.69 0.76 0.it/vidivideo.94 0. which occurred mostly in Trecvid videos. leading to good performance from the person detector and the tracker. Table 2.89 for all the rules are extremely promising. The performance of the rules depends on the performance of the detectors and tracker. This fact is reflected by the different performance in the two datasets.93 0. The search engine is a Web application written in Java and executed in an Apache Tomcat application server.76 0. which provides integrated support for Booleantemporal. We used these rules to annotate the videos.94 0.55 0.81 0. In particular.78 0. Sirio search engine Browsing and searching video archives is performed by exploiting the ontology with a Web-based prototype system called Sirio (see http://www.52 0. as shown in Table 2. the fault was the airplane detector.93 0. (a) Caviar surveillance video dataset: view of the mall shop areas. This has allowed us to implement a visual-query mechanism that exhibits a look and feel approaching that of a desktop environment. The results of the recognition of videosurveillance actions show good performance in precision and recall.84 0.micc.94 0. In a few cases. The fixed camera and lighting conditions reduce the variability of the appearance of the observed events and objects. with the fast response that is expected by users.’’ This is due to the fact that the rules modeling those events are simpler.unifi. typical of Web environments that use only the HTML widgets available in standard browsers. The system is based on the richInternet-application paradigm. contact article authors to obtain access passwords). The performance of ‘‘airplane flying’’ and ‘‘airplane taxiing’’ is better than that of ‘‘airplane landing’’ and ‘‘airplane take-off.78 0. The performance of the rules mainly depends on the errors of the tracker. which sometimes happened with multiple persons’ trajectories overlapping.(a) (b) Figure 3. ontology reasoning services.92 0. evaluating the results in terms of precision and recall.60 0. these detectors are affected by the low quality of the images and the presence of superimposed graphics.81 0.78 0.92 0.95 Recall 0.96 0. and query-by-example queries. and W3C SPARQL Protocol and IEEE MultiMedia or in front of the showcase.78 0. The two datasets were divided using a three-fold approach.

designed for different types of users. (b) Sirio search interfaces: GUI query builder. so users can formulate their queries naturally. presented as a graph. visual prototypes for query-by-example. inspecting the annotated concept instances (that is. The user can select a concept from a tag cloud that shows the concepts with the largest number of instances or navigate the ontology following the concept relations. There is a GUI to build composite queries that include Boolean-temporal operators (based on Allen’s logic). These approaches are suitable for novice users because they don’t require them to specify complex queries or broadcast metadata. Figure 4a shows the browser interface. checking the instances of video clips annotated with the selected concept. as shown in Figure 4b. it’s possible to extend user queries through subsumption and meronymy. natural language search. broadcast dates.Figure 4. video clips). (a) Andromeda browsing interface: the ontology graph view is used to explore parts of the full ontology. The GUI is a Flash application. (a) (b) RDF Query Language (SPARQL) queries. The prototype provides different search modalities. Videos. the user navigates the ontology structure. written in Flex and executed in a client-side Flash Virtual Machine. Using the ontology relations and reasoning. and video metadata (such as broadcaster and program names. All the instances of a concept are visible as streaming video clips. and Google-like search. without being forced to select terms from a lexicon. are streamed using the Real-Time Messaging Protocol. The natural-language and Google-like interface require another form of query expansion. using synonym relations based on WordNet. To browse an archive. returned as query results. October—December 2010 87 . A free-text interface for Google-like searches and a natural language interface lets users compose queries with Boolean-temporal operators. and so on) for professional users.

Contact him at delbimbo@dsi. 2008. and video understanding based on statistical pattern recognition and ontologies. ‘‘Learning Logical Definitions from Relations.html. and machine learning. Marco Bertini is an assistant professor in the Department of Systems and Informatics at the University of Florence. Z. Multimedia.it. Liu et al. 5. pp. Contact him at bertini@dsi. References 1. 10. 421-430. pp. M. 1990. ‘‘The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. ‘‘Building a Comprehensive Ontology to Refine Video Concept Detection. Revision of LSCOM Event/Activity Annotations.. http://www.org/challenges/VOC/ voc2009/workshop/index. pp. 3.’’ IEEE Trans. 2007. ACM Multimedia. ACM Press. Kraai. Quinlan. ACM Press. Italy. 2006. self-calibration and 3D reconstruction. Columbia Univ. University of Florence. L. 239-266. Alberto Del Bimbo is a full professor of computer engineering at the University of Florence. 958-966. Multimedia.pascal-network. P. pp. multimedia databases. Acknowledgment This work is partially supported by the European Information Society Technologies VidiVideo Project (contract FP6-045547) and the IM3I Project (contract FP7-222267). His research interests include pattern recognition. pattern recognition. A.-H. Contact him at serra@dsi.unifi.unifi. Italy. Smeaton. His research interests include contentbased indexing and retrieval of videos and Semantic Web technologies. Image and Video Retrieval. Italy. Zha et al.’’ Proc. Serra has a laurea degree in computer engineering from the University of Florence. 8. no.Conclusions Field trials and tests with professional archivists of the search and browsing engine have shown the potential of the use of ontologies for annotation and retrieval. 9. ACM Press. A. J. 227-236.’’ Proc.unifi. 3. vol. 2007.. J. ACM Int’l Workshop Multimedia Information Retrieval. where he is also the director of the master in Multimedia Content Design. Springer Verlag. and W. 240-251. pp. no. C. Hauptmann. High-Level Feature Detection from Video in Trecvid: A 5-Year Retrospective of Achievements.F. A. 2. Advent technical report #221-2006-7. ‘‘Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study with Broadcast News. ACM Int’l Conf. ‘‘The PASCAL Visual Object Classes Challenge 2009 (VOC) Results. ‘‘Improving the Robustness IEEE MultiMedia of Particle Filter-Based Visual Trackers Using Online Parameter Adaptation. University of Florence.it. 5. 218-223. ‘‘Association and Temporal Rule Mining for Post-Filtering of Semantic Concept Detection in Video. vol. Italy. Contact him at ballan@dsi. IEEE Press. 10. ‘‘(Un)reliability of Video Concept Detection. Over. In addition. vol. MM 9. Ballan has a laurea degree in computer engineering from the University of Florence. computer vision.. 2008. 5. unifi. Giuseppe Serra is a PhD student at the Visual Information and Media Lab at the Media Integration and Communication Center. 85-94. Lamberto Ballan is a PhD student at the Visual Information and Media Lab at the Media Integration and Communication Center. IEEE Int’l Conf. His research interests include multimedia information retrieval.. Kennedy. Del Bimbo has a laurea degree in electronic engineering from the University of Florence. 88 . Bagdanov et al. 2006. 2007..D. we plan to investigate the use of fuzzy temporal Horn logic to overcome the expressivity limitations of SWRL.’’ 2009. Yang and A. 6. pp. Snoek et al. 2. and human—computer interaction. Bertini has a PhD in electronic engineering from the University of Florence. His research interests include multiple-view geometry.. Everingham et al. 2009.’’ IEEE Trans. Hauptmann et al. 4.’’ Proc. Advanced Video and Signal Based Surveillance.-J. K. pp. 7. Our future work will deal with the learning of rules that cope with uncertainty and using fuzzy ontology reasoning that can exploit detector confidence scores.’’ Machine Learning..R. no. DTO Challenge Workshop on Large Scale Concept Ontology for Multimedia.’’ Proc.it.it.

Sign up to vote on this title
UsefulNot useful