/  14
 
Beyond Nouns: Exploiting Prepositions andComparative Adjectives for Learning VisualClassifiers
Abhinav Gupta and Larry S. Davis
Department of Computer ScienceUniversity of Maryland, College Park
(agupta,lsd)@cs.umd.edu
Abstract.
Learning visual classifiers for object recognition from weaklylabeled data requires determining correspondence between image re-gions and semantic object classes. Most approaches use co-occurrenceof “nouns” and image features over large datasets to determine the cor-respondence, but many correspondence ambiguities remain. We furtherconstrain the correspondence problem by exploiting additional languageconstructs to improve the learning process from weakly labeled data.We consider both “prepositions” and “comparative adjectives” whichare used to express relationships between objects. If the models of suchrelationships can be determined, they help resolve correspondence ambi-guities. However, learning models of these relationships requires solvingthe correspondence problem. We simultaneously learn the visual fea-tures defining “nouns” and the differential visual features defining such“binary-relationships” using an EM-based approach.
1 Introduction
There has been recent interest in learning visual classifiers of objects from imageswith text captions. This involves establishing correspondence between image re-gions and semantic object classes named by the nouns in the text. There existsignificant ambiguities in correspondence of visual features and object classes.For example, figure 1 contains an image which has been annotated with the nouns“car” and “street”. It is difficult to determine which regions of the image corre-spond to which word unless additional images are available containing “street”but not “car” (and vice-versa). A wide range of automatic image annotationapproaches use such co-occurrence relationships to address the correspondenceproblem.Some words, however, almost always occur in fixed groups, which limits theutility of co-occurrence relationships, alone, to reduce ambiguities in correspon-dence. For example, since cars are typically found on streets, it is difficult toresolve the correspondence using co-occurrence relationships alone. While such
The authors would like to thank Kobus Barnard for providing the Corel-5k dataset.The authors would also like to acknowledge VACE for supporting the research.
 
2 Gupta and Davis
confusion is not a serious impediment for image annotation, it is a problem if localization is a goal
1
.We describe how to reduce ambiguities in correspondence by exploiting nat-ural relationships that exists between objects in an image. These relationshipscorrespond to language constructs such as “prepositions” (e.g. above, below) and“comparative adjectives” (e.g. brighter, smaller). If models for such relationshipswere known and images were annotated with them, then they would constrainthe correspondence problem and help resolve ambiguities. For example, in fig-ure 1, consider the binary relationship
on
(
car,street
). Using this relationship,we can trivially infer that the green region corresponds to “car” and the magentaregion corresponds to “street”.The size of the vocabulary of binary relationships is very small compared tothe vocabulary of nouns/objects. Therefore, human knowledge could be tappedto specify rules which can act as classifiers for such relationships (for example,a binary relationship
above
(
s
1
,p
1
)
s
1
.y < p
1
.y
). Alternatively, models canbe learned from annotated images. Learning such binary relationships from aweakly-labeled dataset would be “straight forward” if we had a solution to thecorrespondence problem at hand. This leads to a chicken-egg problem, wheremodels for the binary relationships are needed for solving the correspondenceproblem, and the solution of the correspondence problem is required for ac-quiring models of the binary relationships. We utilize an EM-based approachto simultaneously learn visual classifiers of objects and “differential” models of common prepositions and comparative binary relationships.The significance of the work is threefold: (1) It allows us to learn classifiers(i.e models) for a vocabulary of prepositions and comparative adjectives. Theseclassifiers are based on differential features extracted from pairs of regions in animage. (2) Simultaneous learning of nouns and relationships reduces correspon-dence ambiguity and leads to better learning performance. (3) Learning priorson relationships that exist between nouns constrains the annotation problem andleads to better labeling and localization performance on the test dataset.
2 Related Work
Our work is clearly related to prior work on relating text captions and imagefeatures for automatic image annotation [2–4]. Many learning approaches havebeen used for annotating images which include translation models [5], statisticalmodels [2,6], classification approaches [7–9] and relevance language models [10,11].Classification based approaches build classifiers without solving the corre-spondence problem. These classifiers are learned on positive and negative exam-ples generated from captions. Relevance language models annotate a test imageby finding similar images in the training dataset and using the annotation wordsshared by them.
1
It has also been argued [1] that for accurate retrieval, understanding image semantics(spatial localization) is critical
 
Beyond Nouns 3
Fig.1.
An example of how our approach can resolve ambiguities. In the case of co-occurrence based approaches, it is hard to correspond the magenta/green regions to‘car’/‘street’. ‘Bear’, ‘water’ and ‘field’ are easy to correspond. However, the correctcorrespondences of ‘bear’ and ‘field’ can be used to acquire a model for the relation‘on’. We can then use that model to classify the green region as belonging to ‘car’ andthe magenta one to ‘street’, since only this assignment satisfies the binary relationship.
Statistical approaches model the joint distribution of nouns and image fea-tures. These approaches use co-occurrence counts between nouns and imagefeatures to predict the annotation of a test image [12,13]. Barnard et. al [6]presented a generative model for image annotation that induces hierarchicalstructure from the co-occurrence data. Srikanth et. al [14] proposed an approachto use the hierarchy induced by WordNet for image annotation. Duygulu et.al [5] modeled the problem as a standard machine translation problem. The im-age is assumed to be a collection of blobs (vocabulary of image features) andthe problem becomes analogous to learning a lexicon from aligned bi-text. Otherapproaches such as [15] also model word to word correlations where predictionof one word induces a prior on prediction of other words.All these approaches use co-occurrence relationships between nouns and im-age features; but they cannot, generally, resolve all correspondence ambiguities.They do not utilize other constructs from natural language and speech taggingapproaches [16,17]. As a trivial example, given the annotation “pink flower”and a model of the adjective “pink”, one would expect a dramatic reduction inthe set of regions that would be classified as a flower in such an image. Otherlanguage constructs, such as “prepositions” or “comparative adjectives”, whichexpress relationships between two or more objects in the image, can also resolveambiguities.Our goal is to learn models, in the form of classifiers, for such language con-structs. Ferrari et. al [18] presented an approach to learn visual attributes froma training dataset of positive and negative images using a generative model.However, collecting a dataset for all such visual attributes is cumbersome. Ide-ally we would like to use the original training dataset with captions to learn

Share & Embed

More from this user

Add a Comment

Characters: ...

Edgepoint130left a comment

Big List of Free Downloadable Ebooks -->http://is.gd/aDAPD

Edgepoint130left a comment

Huge Collection of Free Ebooks Here -> http://kl.am/8yTy