You are on page 1of 17

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO.

2, FEBRUARY 2009 161

Learning Image-Text Associations


Tao Jiang and Ah-Hwee Tan, Senior Member, IEEE

Abstract—Web information fusion can be defined as the problem of collating and tracking information related to specific topics on the
World Wide Web. Whereas most existing work on web information fusion has focused on text-based multidocument summarization,
this paper concerns the topic of image and text association, a cornerstone of cross-media web information fusion. Specifically, we
present two learning methods for discovering the underlying associations between images and texts based on small training data sets.
The first method based on vague transformation measures the information similarity between the visual features and the textual
features through a set of predefined domain-specific information categories. Another method uses a neural network to learn direct
mapping between the visual and textual features by automatically and incrementally summarizing the associated features into a set of
information templates. Despite their distinct approaches, our experimental results on a terrorist domain document set show that both
methods are capable of learning associations between images and texts from a small training data set.

Index Terms—Data mining, multimedia data mining, image-text association mining.

1 INTRODUCTION

T HE diverse and distributed nature of the information


published on the World Wide Web has made it difficult
to collate and track information related to specific topics.
Note that learning image-text association is similar to
the task of automatic annotation [5] but has important
differences. Whereas image annotation concerns annotat-
Although web search engines have reduced information ing images using a set of predefined keywords, image-text
overloading to a certain extent, the information in the association links images to free text segments in natural
retrieved documents still contains a lot of redundancy. language. The methods for image annotation are thus not
Techniques are needed in web information fusion, involving directly applicable to the problem of identifying image-text
filtering of irrelevant and redundant information, collating associations.
of information according to themes, and generation of A key issue of using a machine learning approach to
coherent presentation. As a commonly used technique for
image-text associations is the lack of large training data sets.
information fusion, document summarization has been
However, learning from small training data sets poses the
discussed in a large body of literatures. Most document
new challenge of handling implicit associations. Referring
summarization methods, however, focus on summarizing
text documents [1], [2], [3]. As an increasing amount of to Fig. 2, two associated image-text pairs (I-T pairs) not only
nontext content, namely images, video, and sound, is share partial visual (smokes and fires) and textual features
becoming available on the web, collating and summarizing (“attack”) but also have different visual and textual
multimedia information has posed a key challenge in the contents. As the two I-T pairs are actually on similar topics
web information fusion. (describing scenes of terror attacks), the distinct parts, such
Existing literatures of hypermedia authoring and cross- as the visual content (“black smoke” which can be
document text summarization [4], [1] have suggested that represented using low-level color and texture features) of
understanding the interrelation between information blocks the image in I-T pair 2 and the term “Blazing” (underlined)
is essential in information fusion and summarization. In this in I-T pair 1, could be potentially associated. We call such
paper, we focus on the problem of learning to identify useful associations, which convey the information patterns
relations between multimedia components, in particular, in the domain but are not represented by the training data
image and text associations, for supporting cross-media web set, implicit associations. We can imagine that the smaller the
information fusion. An image-text association refers to a pair data set is, the more useful association patterns cannot be
of image and text segment that is semantically related to covered by the data samples and the more implicit
each other in a web page. A sample image-text association is associations exist. Therefore, we need an algorithm which
shown in Fig. 1. Identifying such associations enables one to is capable of generalizing the data samples in small data set
provide a coherent multimedia summarization of the web to induce the missing implicit associations.
documents. In this paper, we present two methods, following the
multilingual retrieval paradigm [6], [7] for learning image-
text associations. The first method is a textual-visual
. T. Jiang is with ecPresence Technology Pte Ltd., 18 Boon Lay Way,
#07-97 TradeHub21, Singapore 609966. E-mail: jian0006@ntu.edu.sg.
similarity model [8] with the use of a statistical vague
. A.-H. Tan is with the School of Computer Engineering, Nanyang transformation technique for extracting associations be-
Technological University, 50 Nanyang Avenue, Singapore 639798. tween images and texts. As vague transformation typically
E-mail: asahtan@ntu.edu.sg. requires large training data sets and tends to be computa-
Manuscript received 24 May 2007; revised 17 Jan. 2008; accepted 25 Jan. tionally intensive, we employ a set of domain-specific
2008; published online 15 July 2008. information categories for indirectly matching the textual
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2007-05-0233. and visual information at the semantic level. With a small
Digital Object Identifier no. 10.1109/TKDE.2008.150. number of domain information categories, the training data
1041-4347/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society
162 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009

Fig. 2. An illustration of implicit associations between visual and textual


features.

methods consistently enhance the overall performance in


image-text associations.
Fig. 1. An associated text and image pair. The rest of this paper is organized as follows: Section 2
introduces related work. Section 3 describes our approaches
sets for vague transformation need not be large and the of using vague transformation and fusion ART for learning
computation cost can be minimized. In addition, as each image-text associations. Section 4 reports our experiments
information category summarizes a set of data samples, based on the terrorist domain data set. Concluding remarks
implicit image-text associations can be captured (see are given in Section 5.
Section 3.3 for more details). As information categories
may be inconsistently embedded in the visual and textual 2 RELATED WORK
information spaces, we further employ a visual space
2.1 Automatic Hypermedia Authoring for
projection method to transform the visual feature space into Multimedia Content Summarization
a new space, in which the similarities between the
Automatic hypermedia authoring aims to automatically
information categories are comparable to those in the
organize related multimedia components and generate
textual information space. Our experiments show that
coherent hypermedia representations [13], [14]. However,
employing visual space projection can further improve the
most of automatic hypermedia authoring systems assume
performance of the vague transformation.
that multimedia objects used for authoring are well
Considering that indirectly matching the visual and
textual information using an intermediate tier of informa- annotated. Usually, the annotation tasks are done manu-
tion categories may result in a loss of information, we ally with the assistance of annotation tools. Manual
develop another method based on an associative neural annotation can be very tedious and time consuming.
network model called fusion Adaptive Resonance Theory Identifying links (relations) within information is an
(ART) [9], a direct generalization of ART model [10] from important subtask for automatic hypermedia authoring.
one feature field to multiple pattern channels. Even with In [13], authors defined a set of rules for determining the
relatively small data sets where implicit associations tend to relations between multimedia objects. For example, a rule
appear, fusion ART is able to automatically learn a set of can be like “if there is an image A whose subject is
prototypical image-text pairs and therefore can achieve a equivalent to the title of a text document B, the relation
good performance. This is consistent with the prior findings between the image and the text document is depictðA; BÞ.”
that ART models can efficiently learn useful patterns from However, this method requires that the multimedia objects
small training data sets for text categorization [11]. In (the image A and the text document B in the example)
addition, fusion ART can directly learn the associations have well annotated metadata based on which the rules
between the features in the visual and textual channels can be defined and applied. As most of web contents are
without using a fixed set of information categories. There- raw multimedia data without any well-structured annota-
fore, the information loss might be minimal. tion, existing techniques for automatic hypermedia author-
The two proposed models have been evaluated on a ing cannot be directly applied to fusion or summarization
multimedia document set in the terrorist domain collected of the multimedia contents on the web.
from the BBC and CNN news web sites. The experi- For annotating images with semantic concepts or key-
mental results show that both vague transformation and words, various techniques have been proposed. These
fusion ART outperform a baseline method based on an techniques can mainly be classified into three categories,
existing state-of-the-art image annotation method known including image classification, image association rules, and
as Cross-Media Relevance Model (CMRM) [12] in learn- statistical-based image annotations, reviewed in the follow-
ing image-text associations from a small training data set. ing sections.
We have also combined the proposed methods with a
pure text-matching-based method matching image cap- 2.2 Image Classification for Image Annotation
tions with text segments. We find that though the text- Classification is a data mining technique used to predict
based method is fairly reliable, the proposed cross-media group membership for data instances [15]. For example,
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 163

classification can be used to predict whether a person is [20], each pixel is treated as a transaction, while the set
infected by dengue disease. In the multimedia domain, ranges of the pixel’s spectral bands and auxiliary concept
classification has been used for annotation purposes, i.e., labels (e.g., crop yields) are considered as items. Then, pixel-
predicting whether certain semantic concepts appear in level association rules of the form “Band 1 in the range [a, b]
media objects. and Band 2 in the range [c, d] are likely to imply crop yield
An early work in this area is to classify the indoor- E.” However, Tesic et al. [21] pointed out that using
outdoor scenarios of the video frames [16]. In this work, a individual pixel as transaction may lose the context
video frame or image is modeled as sequences of image information of surrounding locations which are usually
segments, each of which is represented by a set of color very useful for determine the image semantics. This
histograms. A group of 1D hidden Markov models (HMMs) motivated them to use image and rectangular image regions
are first trained to capture the patterns of image segment as transactions and items. Image regions are first repre-
sequences and then used to predict the indoor-outdoor sented using Gabor texture features and then clustered using
categories of new images. self-organizing map (SOM) [22] and LVQ to form a visual
Recently, many efforts aim to classify and annotate thesaurus. The thesaurus is used to provide the perceptual
images with more concrete concepts. In [17], a decision tree labeling of the image regions. Then, the first- and second-
is used to learn the classification rules that associate color order spatial predicate associations among regions are
features, including global color histograms and local tabulated in spatial event cubes (SECs), based on which
dominant colors, with semantic concepts such as sunset, higher-order association rules are determined using Apriori
marine, arid images, and nocturne. In [18], a learning vector algorithm [23]. For example, a third-order item set is in the
quantization (LVQ)-based neural network is used to classify form of “If a region with label uj is a right neighbor of a ui
images into outdoor-domain concepts, such as sky, road, region, it is likely that there is a uk region on the right side of
and forest. Image features are extracted via Haar wavelet uj .” More recently, Teredesai et al. [24] proposed a frame-
transformation. Another approach using vector quantiza- work to learn multirelational visual-textual associations for
tion for image classification was presented in [19]. In this image annotation. Within this framework, keywords and
method, images are divided into a number of image blocks. image visual features (including color saliency maps,
Visual information of the image blocks is represented using orientation, and intensity contrast maps) are extracted and
HSV colors. For each image category, a concept-specific stored separately in relational tables in a database. Then, a
codebook is extracted based on training images. Each FP-Growth algorithm [25] is used for extracting multi-
codebook contains a set of codewords, i.e., representative relational associations between the visual features and
color feature vectors for the concept. New image classifica- keywords from the database tables. The extracted rules,
tion is performed based on finding most similar codewords such as “4 Yellow ! EARTH, GROUND,” can be subse-
for its image blocks. The new image will be assigned to the quently used for annotating new images.
category whose codebook provides the most number of the In [26], the author proposed a method to use associations
similar codewords. of visual features to discriminate high-level semantic
At current stage, image classification mainly works for concepts. To avoid combinatory explosion during the
discriminating images into a relevant small set of categories association extraction, a clustering method is used to
that are visually separable. It is not suitable for linking organize the large number of color and texture features into
images with free texts, in which tens of thousands of the a visual dictionary where similar visual features are grouped
different terms exists. On one hand, the concepts represented together. Then, each image can be represented using a
by those terms, such as “sea” and “sky,” may not be easily relevant small set of representative visual feature groups.
separable be the visual features. On the other hand, training Then, for each specific image category (i.e., semantic
classifier for each of these terms would need a large amount concept), a set of associations is extracted as a visual
of training data, which is usually unavailable, and the knowledge base featuring the image category. When a new
training process would also be extremely time consuming. image comes in, it considered related to an image category if
it globally verifies the associations associated with that
2.3 Learning Association Rules between Image image category. In this method, associations were only
Content and Semantic Concepts learned among visual feature groups, not between visual
Association rule mining (ARM) is originally used for features and semantic concepts or keywords.
discovering association patterns in transaction databases. Due to the pattern combinatory explosion problem, the
An association rule is an implication of the form X ) Y , performance of learning association rule is highly dependent
where X, Y  I (called item sets or patterns) and X \ Y ¼ ;. on the number of items (e.g., image features and the number
In the domain of market-basket analysis, such an association of lexical terms). Although existing methods that learning
rule indicates that the customers who buy the set of items X association rules between image features and high-level
are also likely to buy the set of items Y . Mining association semantic concepts are applicable for small set of concepts/
rule from multimedia data is usually a straightforward keywords, they may encounter problems when mining
extension of ARM in transaction databases. association rules on images and free texts where a large
In this area, many efforts are conducted to extract the amount of different terms exist. This may not only cause
association between low-level visual features and high-level significant increasing in the learning times but also result in
semantic concepts for image annotation. Ding et al. [20] a great number of association rules which may also lower the
presented a pixel-based approach to deduce associations performance during the process of annotating images as
between pixels’ spectral features and semantic concepts. In more rules need to be considered and consolidated.
164 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009

2.4 Image Annotation Using Statistical Model 3 IDENTIFYING IMAGE-TEXT ASSOCIATIONS


There have been many prior work on image annotation 3.1 A Similarity-Based Model for Discovering
using statistical modeling approaches [5]. Barnard and Image-Text Associations
Forsyth [27] proposed a method to encode the correlations The task of identifying image-text associations can be cast
of words and visual features using a co-occurrence model. into an information retrieval (IR) problem. Within a web
The learned model can be used to predict words for images document d containing images and text segments, we treat
based on the observed visual features. Another related work each image I in d as a query to find a text segment T S that is
[12] by Jeon et al presented a CMRM for annotating images most semantically related to I, i.e., T S is most similar to I
by estimating the conditional probability of observing a according to a predefined similarity measure function among
term w given the observed visual content of an image. In all text segments in d. In many cases, an image caption can be
Duygulu et al. [28], showed that image annotation could be obtained along with an image in a web document. Therefore,
considered as a machine translation problem to find the we suppose that each image I can be represented as a visual
correspondence between the keywords and the image feature vector, denoted as vI ¼ ðvI1 ; vI2 ; . . . ; vIm Þ, together with
regions. Experiments conducted using IBM translation a textual feature vector representing the image caption,
models illustrated promising results. In Li and Wang [29], denoted as tI ¼ ðtI1 ; tI2 ; . . . ; tIn Þ. For calculating the similarity
presented an approach that trained hundreds of 2D multi- between an image I and a text segment T S represented by a
resolution HMMs (2D MHMMs), each of which encoded the textual feature vector tTS ¼ ðtT1 S ; tT2 S ; . . . ; tTn S Þ, we need to
statistics of visual features in the images related to a specific define a similarity measure simd ð< vI ; tI >; tTS Þ.
concept category. For annotating an image, the concept of To simplify the problem, we assume that, given an
which the 2D MHMMs generates that image with the image I and a text segment T S, the similarity between vI and
highest probability will be chosen. In Xing et al. [30], tTS and the similarity between tI and tTS are independent.
propose a dual-wing harmonium (DWH) model for learn- Therefore, we can calculate simd ð< vI ; tI >; tTS Þ with the use
ing the statistical relationships between the semantic of a linear mixture model as follows:
categories and images and texts. The DWH is an extension  
of the original basic harmonium which models a random simd < vI ; tI >; tTS ¼   simttd ðtI ; tTS Þ
ð1Þ
field represented by the joint probabilities of a set of input þ ð1  Þ  simvt I TS
d ðv ; t Þ:
variables (image or text features) and a set of hidden
variables (semantic categories). As the harmonium model is In the subsequent sections, we first introduce our
undirected, the inferencing can be in two directions. method used for measuring the textual similarities between
Therefore, DWH model can be used for annotating an image captions and text segments (Section 3.2). Then, two
cross-media similarity measures based on vague transfor-
image by first inferencing the semantic categories based on
mation (Section 3.3) and fusion ART (Section 3.4) are
its visual features and then predicting the text keywords
presented.
based on the inferred semantic categories.
A major deficiency of the existing machine learning and 3.2 Text-Based Similarity Measure
statistic-based automatic multimedia annotation methods Matching between text-based features is relatively straight-
is that they usually assign a fixed number of keywords or forward. We use the cosine distance
concepts to a media object. Therefore, there will inevitably
Pn I T S
be some media objects assigned with unnecessary annota- tk  tk
simttd ðtI ; tTS Þ ¼ k¼1 ð2Þ
tions and some others assigned with insufficient annota- ktI kktT S k
tions. In addition, image annotation typically uses a
relatively small set of domain-specific key terms (class to measure the similarity between the textual features of an
image caption and a text segment. The cosine measure is
labels, concepts, or categories) for labeling the images. Our
used as it has been proven to be insensitive to the length of
task of discovering semantic image-text associations from
text documents.
Web documents however do not assume such a set of
selected domain specific key terms. In fact, although our 3.3 Vague-Transformation-Based Cross-Media
research focuses on image and text contents from the Similarity Measure
terrorist domain, both domain-specific and general-domain Measuring the similarity between visual and textual
information (e.g., general-domain terms “people” or features is similar to the task of measuring relevance of
“bus”) are incorporated in our learning paradigm. With documents in the field of multilingual retrieval for selecting
the above considerations, it is clear that the existing image documents in one language based on queries expressed in
annotation methods are not directly applicable to the task another [6]. For multilingual retrieval, transformations are
of image-text association. Furthermore, model evolution is usually needed for bridging the gap between different
not well supported by the above methods. Specifically, representation schemes based on different terminologies.
after the machine learning and statistic models are trained, An open problem is that there is usually a basic
they are difficult to update. Moreover, the above methods distinction between the vocabularies of different languages,
usually treat the semantic concepts in media objects as i.e., word senses may not be organized with words in the
separate individuals without considering relationships same way in different languages. Therefore, an exact
between the concepts for multimedia content representa- mapping from one language to another language may not
tion and annotation. exist. This is known as the vague problem [7], which is even
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 165

Fig. 3. An illustration of bilingual vague transformation adopted from [7].


Fig. 4. An illustration of cross-media transformation with information
more challenging in visual-textual transformation. For bottleneck.
individual image regions, they can hardly convey any
meaningful semantics without considering their contexts. identifying image-text associations. However, this requires
For example, a yellow region can be a petal of a flower or a large set of training data which is usually unavailable.
can be a part of a flame. On the contrary, words in natural Therefore, we consider incorporate high-level information
languages usually have a more precise meaning. In most categories to summarize and generalize useful information.
cases, image regions can hardly be directly and precisely
mapped to words because of the ambiguity. As vague 3.3.2 Single-Direction Vague Transformation for
transformations [31], [7] have been proven useful in the Cross-Media Information Retrieval
field of multilingual retrieval, in this paper, we borrow the In our cross-media IR model described in Section 3.1, the
idea from statistical vague transformation methods for images are considered as the queries for retrieving the
cross-media similarity measure. relevant text segments. Referring to the field of multilingual
retrieval, transformation of the queries into the representa-
3.3.1 Statistical Vague Transformation in Multilingual tion schema of the target document collection seems to be
Retrieval more efficient [7]. Therefore, we first investigate a single-
A statistical vague transformation method is first presented direction vague transformation of the visual information into
in [31] for measuring the similarity of the terms belonging the textual information space.
to two languages. In this method, each term t is represented A drawback of the existing methods for vague transfor-
by a feature vector in the training document space, in which mation, such as those presented in [32] and [31], is that they
each training document represents a feature dimension and require a large training set to build multilingual the-
sauruses. In addition, as the construction of the multilingual
the feature value, known as an association factor, is the
thesauruses requires calculating an association factor for
conditional probability that given the term t, t belongs to
each pair of words picked from two languages, it may be
this training document, i.e. computationally formidable. To overcome these limitations,
hðt; dÞ we introduce an intermediate layer for the transformation.
zðt; dÞ ¼ ; ð3Þ This intermediate layer is a set of domain information
fðtÞ
categories that can be seen as another vocabulary of a
where hðt; dÞ is the number of times that the term t appears smaller size for describing domain information. For exam-
in document d; and fðtÞ is the total number of times that the ple, in the terror attack domain, information categories may
term t appears. include Attack Details, Impacts, and Victims. Therefore, our
As each document also exists in the corpus in the second cross-media transformation is in fact a concatenation of two
language, a document feature vector representing a term in subtransformations, i.e., from visual feature space to
the first language can be used to calculate the correspond- domain information categories and then to textual feature
space (see Fig. 4). This is actually known as the information
ing term vector for the second language. As a result, given a
bottleneck method [33], which has been used for information
query term in one language for multilingual retrieval, it is
generalization, clustering and retrieval [34]. For each
not translated into a single word using a dictionary, but subtransformation, as the number of domain information
rather transformed into a weighted vector of terms in the categories is small, the size of the training data set for
second language representing the query term in the thesaurus construction needs not be large and the construc-
document collection (Fig. 3). tion cost can be affordable. As discussed in Section 2.4, for
Suppose that there is a set of identified associated image- an associated pair of image and text, their contents may not
text pairs. We can treat such an image-text pair as a be an exact match. However, we believe that they can
document represented in both visual and textual languages. always be mapped in terms of general domain information
Then, statistical vague transformation can be applied for categories.
166 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009

Based on the above observation, we build two the- domain information category Attack Details, and two text
sauruses in the form of transformation matrices, each of
segments T S1 and T S2 , represented by the textual feature
which corresponds to a subtransformation. Suppose the
visterm space V is of m dimensions, the textual feature vectors tTS1 and tTS2 , belonging to the categories of Attack
space T is of n dimensions, and the cardinality of the set of Details and Victims, respectively. If the two categories Attack
high-level domain information categories C is l. Based on V,
Details and Victims share many common words (such as kill,
T , and C, we define the two following transformation
matrices: die, and injure), the vague transformation result of vI might be
0 VC 1 similar to both tTS1 and tTS2 . To reduce the influence of
m11 mVC 12 : : mVC 1l
B mVC mVC : : mVC C common terms on different categories and utilize the
B 21 22 2l C
M VC ¼ B
B : : : CC ð4Þ strength of the distinct words, we consider another transfor-
@ : : : A mation from the word space to the visterm space. Similarly,
mVC mVC : : mVC
m1 m2 ml
we define a pair of transformation matrices M T C ¼ fmTkjC gnl
and and M CV ¼ fmCV lm
, where mTkjC ¼ P ðcj jtmk Þ 
Nðcj ;tmk Þ
ji g Nðtmk Þ and
0 1 Nðvi ;cj Þ
mCT
11 mCT
12 : : : mCT
1n
mCV
ji ¼ P ðvi jcj Þ  Nðcj Þ (i ¼ 1; 2; . . . ; m, j ¼ 1; 2; . . . ; l, and
B mCT mCT : : : C mCT
M CT B 21
¼@ 22 C; 2n ð5Þ k ¼ 1; 2; . . . ; n). Here, Nðtmk Þ is the number of text segments
: : : A
mCT mCT : : : mCT containing the term tmk ; Nðcj ; tmk Þ, Nðvi ; cj Þ, and Nðcj Þ are
l1 l2 ln
same as those in (6) and (7). Then, the similarity between a
where mVC ij represents the association factor between the
visual feature vi and the information category cj , and mCT text segment represented by the textual feature vector tTS
jk
represents the association factor between the information and the visual content of an image vI can be defined as
category cj and the textual feature tk . In our current system,
mVC CT ðtTS ÞT M T C M CV vI
ij and mjk are calculated by simT V ðtTS ; vI Þ ¼  T  : ð9Þ
 TS T 
 ðt Þ M T C M CV kvI k
Nðvi ; cj Þ  
mVC
ij ¼ P ðcj jvi Þ  ð6Þ
Nðvi Þ
Finally, we can define a cross-media similarity measure
and based on the dual-direction transformation which is the
arithmetic mean of simVT ðvI ; tTS Þ and simT V ðtTS ; vI Þ
Nðcj ; tmk Þ
mCT
jk ¼ P ðtmk jcj Þ  ; ð7Þ given by
Nðcj Þ
where Nðvi Þ is the number of images containing the visual simVT ðvI ; tTS Þ þ simT V ðtTS ; vI Þ
simvt I TS
d ðv ; t Þ¼ : ð10Þ
feature vi , Nðvi ; cj Þ is the number of images containing vi 2
and belonging to the information category cj , Nðcj Þ is the 3.3.4 Vague Transformation with Visual Space
number of text segments belonging to the category cj , and
Projection
Nðcj ; tmk Þ is the number of text segments belonging to cj
A problem in the reversed cross-media (text-to-visual)
and containing the textual feature (term) tmk .
transformation in dual-direction transformation is that the
For calculating mVC CT
ij and mjk in (6) and (7), we build a
intermediate layer, i.e., information categories, may be
training data set of texts and images that have been
embedded differently in the textual feature space and the
manually classified into domain information categories
visterm space. For example, in Fig. 5, two information
(see Section 4 for details).
categories, “Terrorist Suspects” and “Victims,” may contain
Based on (4) and (5), we can define the similarity
quite different text descriptions but somewhat similar
between the visual part of an image vI and a text segment
images, e.g., human faces. Suppose we translate a term
represented by tTS as ðvI ÞT M VC M CT tTS . For embedding
vector of a text segment into the visual feature space using a
into (1), we use its normalized form
cross-media transformation. Transforming a term vector in
ðvI ÞT M VC M CT tTS “Victims” category or a term vector in “Terrorist Suspects”
simVT ðvI ; tTS Þ ¼ 
 T 
 : ð8Þ category may result in a similar visual feature vector as
 ðvI ÞT M VC M CT ktTS k these two information categories have similar representa-
 
tion in the visual feature space. In such a case when there
3.3.3 Dual-Direction Vague Transformation are text segments belonging to the two categories in the
same web page, we may not be able to select a proper text
Equation (8) calculates the cross-media similarity using a segment for an image about “Terrorist Suspects” or
single-direction transformation from visual feature space to “Victims” based on the text-to-visual vague transformation.
textual feature space. However, it may still have the vague For solving this problem, we need to consolidate the
differences in the similarities between the information
problem. For example, suppose there is a picture I, categories in the textual feature space and the visual feature
represented by the visual feature vector vI , belonging to a space. We assume that text can more precisely represent the
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 167

Fig. 6. Bipartite graph of classified text segments and information


categories.
X
st ðci ; cj Þ ¼ wtðci ; T Sk Þ  wtðcj ; T Sk Þ; ð12Þ
Fig. 5. Visual space projection. T Sk 2ci \cj

semantic similarities of the information categories. There- where


fore, we project the visual feature space into a new space in
1=jT Sk j
which the information category similarities defined in the wtðci ; T Sk Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P ; ð13Þ
2
textual feature space can be preserved. T Sl 2ci ð 1=jT S l jÞ
We treat each row in the transformation matrix M CV in
(9) as a visual feature representation of an information where jT Sk j and jT Sl j represent the sizes of the text
category. We use a m  m matrix X to transform the segments T Sk and T Sl , respectively.
visterm space into a new space, wherein the similarity . Using category-to-text transformation matrix. We
matrix of information categories can be represented as also attempt another approach that utilizes the
T
ðM CV XT ÞðXM CV Þ ¼ fsv ðci ; cj Þgll , where sv ðci ; cj Þ repre- category-to-text vague transformation matrix M CT
sents the similarity between the information categories ci in (5) for calculating the similarity matrix of the
and cj . In addition, we suppose D ¼ fst ðci ; cj Þgll is the information categories. We treat each row in M CT as
similarity matrix of the information categories in the textual a textual feature space representation of an informa-
feature space, where st ðci ; cj Þ is the similarity between the tion category. Then, we calculate the similarity
information categories ci and cj . Our objective is to matrix D in the textual feature space by
minimize the differences between the information category T
similarities in the new space and the textual feature space. D ¼ M CT  M CT : ð14Þ
This can be formulated as an optimization problem of
With the similarity matrix D of the information
 
 T 2 categories calculated above, the visual space projection
minX D  M CV XT XM CV  : ð11Þ
matrix X can be solved based on (11). By incorporating X,
The similarity matrix D in the textual feature space is (9) can be redefined as
calculated based on our training data set, in which texts and
ðtTS ÞT M T C M CV XT XvI
images are manually classified into the domain specific simT V ðtTS ; vI Þ ¼ 
 TS T T 
 : ð15Þ
information categories. Currently, two different methods  ðt Þ M T C M CV X T X kvI k
 
have been explored for this purpose as follows:

. Using bipartite graph of the classified text Using this refined equation in the dual-direction trans-
segments. For constructing the similarity matrix of formation, we expect that the performance of discovering
the information categories in the textual feature the image-text associations can be improved. However,
space, we utilize the bipartite graph of the classified solving (11) is a nonlinear optimization problem of a
text segments and the information categories as very large scale because X is a m  m matrix, i.e., there
shown in Fig. 6. are m2 variables to tune. Fortunately, from (15) we can
The underlying idea is that the more text segments see that we do not need to get the exact matrix X.
that two information categories share, the more Instead, we only need to solve a simple linear equation
similar they are. We borrow the similarity measure T
D ¼ M CV X T XM CV to obtain a matrix
used in [31] for calculating the similarity between
information categories which is originally used for 1 1 T
A ¼ XT X ¼ M CV DM CV ; ð16Þ
calculating term similarity based on bipartite graphs
CV 1
of terms and text documents. Therefore, any st ðci ; cj Þ where M is the pseudo-inverse of the transformation
in D can be calculated as matrix M CV .
168 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009

similar “natural frequencies,” strong resonance occurs.


The strength of the resonance can be computed by a
resonance function.
Given a set of multimedia information objects (associated
image-text pairs) for training, the fusion ART learns a set of
multimedia information object templates, or object templates in
short. Each object template, recorded by a category node in
the F2c field, represents a group of information objects that
have similar “natural frequencies” and can strongly resonate
with each other. Initially, no object template (category node)
Fig. 7. Fusion ART for learning image-text associations. exists in the F2c field. When information objects (associated
image-text pairs) are presented one at a time to the F1c1 and
Then, we can substitute X T X in (15) with A for F1c2 fields, the object templates are incrementally captured
calculating simT V ðts; iv Þ, i.e., the similarity between a text and encoded in the F2c field. The process of learning object
segment represented by tTS and the visual content of an templates using fusion ART can be summarized in the
image represented by vI . following stages:
3.4 Fusion-ART-Based Cross-Media Similarity 1. Code activation. A bottom-up propagation process
Measure first takes place when an information object is
In the previous section, we present the method for presented to the F1c1 and F1c2 fields. For each category
measuring the similarity between visual and textual node (multimedia information object template) in the
information using a vague transformation technique. For F2c field, a resonance score is computed using an ART
the vague transformation technique to work on small data choice function. The ART choice function varies with
sets, we employ an intermediate layer of information respect to different ART models, including ART 1
categories to map the visual and textual information in an [10], ART 2 [37], ART 2-A [38], and fuzzy ART [39].
indirect manner. A deficiency of this method is that for We adopt the ART 2 choice function based on the
training the transformation matrix, additional work on cosine similarity which has been proven to be
manual categorization of images and text segments is
effective for measuring vector similarities and
required. In addition, matching visual and textual informa-
insensitive to the vector lengths. Given a pair of
tion based on a small number of information categories may
visual and textual information feature vectors v and
cause a loss of detailed information. Therefore, it would be
t, for each F2c category node j with a visual
appealing if we can find a method that learns direct
associations between the visual and textual information. In information template vcj and a textual information
this section, we present a method based on the fusion ART template tcj , the resonance score Tj is calculated by
network, a generalization of ART model [10], for discovering v  vcj t  tc j
direct mappings between visual and textual features. Tj ¼  þ ð1  Þ ; ð17Þ
c
kvkkv k j ktkktcj k
3.4.1 A Similarity Measure Based on Adaptive where  is the factor for weighing the visual and
Resonance Theory textual information channels. For giving the equal
As discussed, small data set does not have enough data weight to the visual and textual information chan-
samples, and thus many useful association patterns may vvcj
nels, we set the  value to 0.5. kvkkv ttcj
and ktkkt are
cj cj
appear in the data set implicitly. Those implicit associations k k
actually the ART 2 choice function for visual and
may not be reflected in individual data samples but can be
extracted by summarizing a group of data samples. textual information channels.
For learning implicit associations, we employ a method 2. Code competition. A code competition process follows
based on the fusion ART network. Fusion ART can be under which the F2c node j with the highest
seen as multiple overlapping ART models [35], each of resonance score is identified.
which corresponds to an individual information channels. 3. Template matching. Before the node j can be used for
Fig. 7 shows a two-channel fusion ART (also known as learning, a template matching process checks that for
Adaptive Resonance Associative Map [36]) for learning each (visual and text) information channel, the object
associations between images and texts. The model consists template of node j is sufficiently similar to the input
of a F2c field, and two input pattern fields, namely F1c1 for object with respect to the norm of the input object.
representing visual information channel of the images and The similarity value is computed by using the ART 2
F1c2 for representing textual information channel of text match function [37]. The ART 2 match function is
segments. Such a fusion ART network can be seen as a defined as
simulation of a physical resonance phenomenon where each v  vc j t  tc j
associated image-text pair can be seen as an information Tj ¼  þ ð1  Þ : ð18Þ
kvkkvk ktkktk
“object” that has a “natural frequency” in either visual or
textual information channel represented by the visual Resonance can occur if for each (visual and textual)
feature vector v ¼ ðv1 ; v2 ; . . . ; vm Þ or the textual feature channel, the match function value meets a vigilance
vector t ¼ ðt1 ; t2 ; . . . ; tn Þ. If two information objects have criterion. At current stage, the vigilance criterion is
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 169

an image and a text segment according to their resonance


score (see (18)) in a trained fusion ART. Such resonance-
based similarity measure has been used in existing work
for data terrain analysis [40].

3.4.2 Image Annotation Using Fusion ART


In the above discussion, we define a cross-media similarity
measure based on the resonance function of the fusion ART.
Based on this similarity measure, we can identify an
associated text segment represented by textual feature
vector t that is considered most “similar” to an image
represented by the visual feature vector v. At the same time,
we will identify a F2c object template j with a textual
information template tcj having the strongest resonance
with the image-text pair represented by < v; t > . Based on
the t and tcj , we can extract a set of keywords for annotating
the image. The process is described as follows:
Fig. 8. Training samples for fusion ART.
1. For the kth dimension of the textual information
c
an experience value manually tested and defined vectors t and tcj , if minftk ; tkj g > 0, extract the term
by us. tmk corresponding to the kth dimension of the
4. Template learning. Once a node j is selected, its object textual information vectors for annotating the
template will be updated by linearly combining the image.
object template with the input information object 2. When tmk is extracted, a confidence value of
c
according to a predefined learning rate [37]. The minftk ; tkj g is assigned to tmk based on which all
equation for updating the object template is defined extracted keywords can be ranked for annotating the
as follows: image.
We can see that the image annotation task is possible
vcj ¼ ð1  v Þ  vcj þ v  v ð19Þ because the fusion ART can capture the direct associations
and between the visual and textual information. The perfor-
mance evaluation of using fusion ART for image annotation
tcj ¼ ð1  t Þ  tcj þ t  t; ð20Þ is beyond the scope of this paper. However, the fusion-
ART-based image annotation method has a unique advan-
where v and t are learning rates for visual and
tage over the existing image annotation methods, namely it
textual information channels.
does not require a predefined set of keywords. A set of
5. New code creation. When no category node is
examples of using the fusion-ART-based method for image
sufficiently similar to the new input information
annotation will be presented in the next section.
object, a new category node is added to the F2c field.
Fusion ART thus expands its network architecture 3.4.3 Handling Noisy Text
dynamically in response to incoming information
Web text data may contain various noises such as wrongly
objects.
spelled words. Existing research shows that the web text
The advantage of using fusion ART is that its object noise may lower the performance of the data mining tasks
templates are learned by incrementally combining and on web text data [41], [42]. However, the ART-based
merging new information objects with previously learned systems have shown advantages on handling noisy web
object templates. Therefore, a learned object template is a text and have been successfully used in many applications
“summarization” of characteristics of the training objects. for web/text mining tasks [43], [44], [41], [45], [46]. In
For example, the three training I-T pairs, as shown in Fig. 8, normal practices, the ART-based systems may handle noisy
can be summarized by fusion ART into one object template
data through two ways:
and thereby the implicit associations across the I-T pairs can
be captured. In particular, the implicit associations between . First, noisy reduction can be performed in the data
frequently occurred visual and textual contents (such as the preprocessing stage. For example, in the web/text
visual content “black smoke” which can be represented classification systems, feature reduction is usually
using low-level visual features and the term “blazing” in used for eliminating the features that do not have
Fig. 8) can be learned for predicting new image-text strong association with the class labels [11]. How-
associations. ever, for image-text associations, the information
Suppose a trained fusion ART can capture all the typical object templates are automatically learned by the
multimedia information object templates. Given a new pair fusion ART and thus no class labels exist. Therefore,
of image and text segment that are semantically relevant, in our data preprocessing stage, we use another
their information object should be able to strongly resonate commonly used method that employs the notion of
with an object template in the trained fusion ART. In other term frequency to prune the terms with very high
words, we can measure the similarity or relevance between (above 80 percent) or very low frequencies (below
170 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009

1 percent). Such a feature pruning method has been TABLE 1


effective in removing irrelevant features (including Statistics of the Data Set
noises) in web/text mining tasks.
. In addition, fusion ART can control the influence of
the noises through varying the learning rates.
Referring to (19) and (20), when using small learning
rates, noisy data with low frequencies will not cause
large changes to information object templates. With Ni denotes the number of images, Nw denotes the number of web pages
where the images and texts are extracted, Nc denotes the number of
sufficient patterns, the influence of the noises will be
images with captions, Avg. Nt denotes the average number of texts
kept low. segments appearing along with an image in a web page, and Ntm
denotes the number of text features (terms) extracted for representing
web texts.
4 EXPERIMENTAL RESULTS
4.1 Data Set NðdÞ
tTk S ¼ tfðT S; tmk Þ  log ; ð21Þ
The experiments are conducted on a web page collection, tsfðd; tmk Þ
containing 300 images related to terrorist attacks, down-
loaded from the CNN and BBC news web sites. For each where tfðT S; tmk Þ denotes the frequency of tmk in the text
image, we extract its caption (if available) and long text segment T S, NðdÞ is the total number of text segments in
paragraphs (with more than 15 words) from the web page the web document d, and tsfðd; tmk Þ is the text segment
containing the image. frequency of term tmk in d. Here, we use the text segment
For applying vague-transformation-based methods that frequency for measuring the importance of a term for a web
utilize an intermediate layer of information categories, document.
we manually categorize about 1,500 text segments and After a textual feature vector tTS ¼ ðtT1 S ; tT2 S ; . . . ; tTn S Þ is
300 images into 15 predefined domain information cate- extracted, L1-normalization is applied for normalizing the
term weights into a range of [0, 1]:
gories, i.e., Anti-Terror, Attack Detail, After Attack, Ceremony,
Government Emergency Preparation, Government Response,  TS TS 
TS t1 ; t2 ; . . . ; tTn S
Impact, Investigation, Rescue, Simulation, Attack Target, Terror- t ¼  S ; ð22Þ
max tTi¼1...n
ist Claim, Terrorist Suspect, Victim, and Others. We note that
the semantic contents of images are usually more concen- where n is the number of textual features (i.e., terms).
trating and easy to be classified into an information category.
However, for texts, a text segment may be a mixture of 4.1.2 Visual Feature Extraction from Images
information belonging to different information categories. In Our visual feature extraction method is inspired by those
this case, we assign the text segment into multiple informa- used in the existing image annotation approaches [5], [27],
tion categories. In addition, a PhD student in the School of [12], [28], [47]. During image preprocessing, each image is
Computer Engineering, Nanyang Technological University first segmented into 10  10 rectangular regions arranged in
manually inspects the web pages where the 300 images are a grid. There are two reasons why we use rectangular
extracted to identify the associated text segments for the regions instead of employing more complex image seg-
images. This forms our ground truth for evaluating the mentation techniques to obtain image regions. The first
correctness of the image-text pairs extracted by the proposed reason is that existing work on the image annotation shows
image-text association learning methods. As there are the that using rectangular regions can provide performance
text captions associated with the images which can provide a gain compared with using regions obtained by automatic
lot of clues, the text segments associated with the images image segmentation techniques [47]. The second reason is
usually can be identified from the web pages without that using rectangular regions is much less time consuming
difficulties. Table 1 lists the statistics of the data set and the and, therefore, suitable for the online multimedia web
detailed data preprocessing methods are described as information fusion task.
follows. For each image region, we extract a visual feature vector,
consisting of six color features and 60 gabor texture
4.1.1 Textual Feature Extraction features. The color features are the means and variances
We treat each text paragraph as a text segment. Preproces- of the RGB color spaces. The texture features are extracted
sing the text segments includes tokenizing the text segments, by calculating the means and variations of the Gabor
part-of-speech tagging, stop word filtering, stemming, filtered image regions in six orientations at five different
removing unwanted terms (retaining only nouns, verbs scales. Such a set of color and textual features has been
and adjectives), and generating the textual feature vectors proven to be useful for image classification and annotation
where each dimension corresponds to a remaining term after tasks [48], [49].
the preprocessing. After the feature vectors are extracted, all image regions
For calculating the term weights of the textual feature are clustered using the k-means algorithm. The purpose of
vectors, we use a model, named term frequency and inverted the clustering is to discretize the continuous color and
text segment frequency (TF-ITSF), similar to the traditional texture features [27], [28], [47]. The generated clusters,
TF-IDF model. For a text segment T S in a web document d, called visterms represented by fvt1 ; vt2 ; . . . ; vtk g, are treated
we use the following equation to weight the kth term tmk in as a vocabulary for describing the visual content of the
T S’s textual feature vector tTS ¼ ðtT1 S ; tT2 S ; . . . ; tTn S Þ: images. An image is described by a visterm vtj if it contains
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 171

a region belonging to the jth cluster. For the terrorist


domain data set, the visterm vocabulary is enriched with a
high-level semantic feature, extracted by a face detection
model provided by the OpenCV. In total, a visterm vector of
k þ 1 features is extracted for each image. The weight of
each feature is the corresponding visterm frequency
normalized with the use of L1-normalization.
A problem of using visterms is how we can determine a
proper number of visterms k (i.e., the number of clusters for
k-means clustering). Note that the images have been
manually categorized into the 15 information categories
that reflect the images’ semantic meanings. Therefore,
images belonging to different information categories should Fig. 9. Information gains of clustering the images based on a varying
have different patterns in their visterm vectors. If we cluster number of visterms.
images into different clusters based on their visterm vectors,
in the most ideal case, images belonging to different 4.2 Performance Evaluation Method
categories should be assigned into different clusters. Based We adopt a fivefold cross validation to test the performance
on the above consideration, we can measure the mean- of our methods. In each experiment, we use four folds of the
ingfulness of the visterm sets with different k values by data (240 images) for training and one fold (60 images) for
calculating the information gains of the image clustering testing. The performance is measured in terms of precision
results. The definition of the information gain given defined by
below is similar to the one used in the Decision Tree for
Nc
measuring partitioning results with respect to different data precision ¼ ; ð25Þ
N
attributes.
Given a set S of images belonging to m information where Nc is the number of correctly identified image-text
categories, the information need for classifying of the images associations, and N is the total number of images. We
in S is measured by experimented with different  values for our cross-media
retrieval models (see (1)) to find the best balance point of
Xm

si si weighting the impact of textual and visual features.
IðSÞ ¼  log ; ð23Þ However, we should note that in principle, the best  could
i¼1
kSk kSk
be obtained by using an algorithm such as expectation-
where kSk is the total number of images, and si is the maximization (EM) [50].
number of images belonging to the ith information Note that whereas most IR tasks use both precision and
category. recall to evaluate the performance, we only use precision in
Suppose we cluster an image collection S into n clusters, our experiment. This is simply due to the fact that for each
i.e., S1 ; S2 ; . . . ; Sn . The information gain can be calculated as image, there is only one text segment considered to be
follows: semantically relevant. In addition, we also extract only one
associated text segment for each image using the various
X
n
kSj k models. Therefore, in our experiments, the precision and
Gain ¼ IðSÞ  IðSj Þ; ð24Þ
j¼1
kDk recall values are actually the same.

where kSj k is the number of images in the jth cluster. 4.3 Evaluation of Cross-Media Similarity Measure
Fig. 9 shows the information gains obtained by clustering Based on Visual Features Only
our image collection based on visterm sets with a varying We first evaluate the performance of the cross-media
number of visterms. We can see no matter how many similarity measures, defined in Sections 3.3 and 3.4, by
clusters of images we generate, the largest information gain setting  ¼ 0 in our linear mixture similarity model (see
is always achieved when k is around 400. Based on this (1)), i.e., using only visual contents of images (without
observation, we generate 400 visterms for the image visterm image captions) for measuring image-text associations. As
there has been no prior work on image-text association
vectors.
learning, we implement two baseline methods for evalua-
Note that we employ information gain depending on
tion and comparison. The first method is based on the
information category to determine the number of visterms
CMRM proposed by Jeon et al [12]. The CMRM is designed
to use for representing image contents. This is an optimiza-
for image annotation by estimating the conditional prob-
tion for the data preprocessing stage, the benefit of which ability of observing a term w given the observed visual
will be employed by all the learning models in our content of an image. Another baseline method is based on
experiments. Therefore, we should see that it is not conflict the DWH model proposed by Xing et al [30]. As described
with our statement that the fusion ART does not depend on in Section 2, a trained DWH can also be used to estimate
the predefined information categories for learning the the conditional probability of seeing a term w given the
image-text associations. observed image visual features. As our objective is to
172 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009

TABLE 2
The Seven Cross-Media Models in Comparison

associate an entire text segment to an image, we extend similarity matrix calculated based on bipartite graphs
CMRM and DWH model to calculate the average condi- cannot really reflect the semantic similarity between the
tional probability of observing terms in a text segment domain information categories. We shall revisit this issue in
given the visual content of an image. The reason of using the next section. Note that although DDT outperforms SDT
the average conditional probability, instead of the joint in most of the folds, there is a significant performance
conditional probability, is that we need to minimize the reduction in the fold 4. The reason could be that the reverse
influence of the length of the text segments. Note that the vague transformation results of certain text segments in the
longer a text segment is, the smaller the joint conditional fold 4 are difficult to be discriminated due to the reason
probability tends to be. Table 2 summarizes the seven described in Section 3.3.4. Therefore, the reverse vague
methods that we experimented for discovering image-text transformation based on text data may even lower the
associations based on pure visual contents of images. The overall performance of the DDT. On the other hand,
first four methods are vague-transformation-based cross- DDT_VP_CT performs much stable than DDT by incorpor-
media similarity measures that we define in Section 3.3. ating the visual space projection.
The fifth method is the fusion-ART (object resonance)- For evaluating the impact of the size of training data on
based similarity measure. The last two methods are the learning performance, we also experiment with differ-
baseline methods based on the CMRM and the DWH ent data sizes for training and testing. As DWH model has
model, respectively. Fig. 10 shows the performance of been shown cannot be trained properly for this data set, we
using the various models for extracting image-text associa- leave it out in the rest of the experiments. Fig. 11 shows the
tions based on a fivefold cross validation. performance of the six cross-media similarity models with
We see that among the seven methods, DDT_VP_CT and respect to training data of various sizes. We can see that
fusion ART provide the best performance. They outperform when the size of the training data decreases, the precision of
SDT and DDT which have a similar performance. All of the CMRM drops dramatically. In contrast, the performance
these four models perform much better than DWH model, of vague transformation and fusion ART drop less than
CMRM and DDT_VP_BG. We can see that DWH model 10 percent in terms of average precision. It shows that our
always obtains a precision of 0 percent and therefore cannot methods also provide better performance stability on small
predict the correct image-text association for this particular data sets compared with the statistical-based CMRM.
experiment. The reason could be that the training data set is
too small, and on the contrary, the data dimensions are 4.4 Evaluation of Linear Mixture Similarity Model
quite large (501 for visual features and 8,747 for textual In this section, we study the effect of using both textual and
features) to train a effective DWH model using Gibbs visual features in the linear mixture similarity model for
sampling [30]. It is surprising that DDT_VP_BG is the worst discovering image-text associations. Referring to the experi-
method other than the DWH model, hinting that the mental results in Table 3, we see that textual information is

Fig. 10. Comparison of cross-media models for discovering image-text Fig. 11. Performance comparison of cross-media models with respect to
associations. different training data sizes.
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 173

TABLE 3
The Average Precision Scores (in Percentages) for Image-Text Association Extraction

fairly reliable in identifying image-text associations. In fact, A sample set of the extracted image-text associations is
the pure text similarity measure ð ¼ 1:0Þ outperforms the shown in Fig. 12. We find that cross-media models are
pure cross-media similarity measure ð ¼ 0:0Þ by 20.7 per- usually good in associating general domain keywords with
cent to 24.4 percent in terms of average precision. images. Referring to the second image in Fig. 12, cross-
However, the best result is achieved by the linear media models can associate an image depicting the attack
mixture model using both the text-based and the cross- scene of 911 attack with a text segment containing the word
media similarity measures. DDT_VP_CT with  ¼ 0:7 can “attack” which is commonly used for describing terrorist
achieve an average precision of 62.6 percent, while the attack scenes. However, for the word “wreckage” that is a
fusion ART with  ¼ 0:6 can achieve an average precision of more specific word, cross-media models usually cannot
62.0 percent. On the average, the mixture similarity models identify it correctly. For such cases, using image captions
can outperform the pure text similarity measure by about may be helpful. On the other hand, as discussed before,
5 percent. This shows that visual features are also useful in image captions may not always reflect the image content
the identification of image-text associations. In addition, we accurately. For example, the caption of the third image
observe that combining cross-media and text-based simi- contains the word “man,” which is a very general term, not
larity measures improves the performance of pure text quite relevant to the terrorist event. For such cases, cross-
similarity measure on each fold of the experiment. There- media models can be useful to find the proper domain-
fore, such improvement is stable. In fact, the keywords specific textual information based on the visual features of
extracted from the captions of the images sometimes may the images.
be inconsistent with the contents of the images. For Fig. 13 shows a sample set of the results by using fusion
example, an image on the 911 attack scene may have a ART for image annotation. We can see that such annota-
caption on the ceremony of 911 attack, such as “Victims’ tions can reflect the direct associations between the visual
families will tread Ground Zero for the first time.” In such a and textual features in the images and texts. For example,
case, the visual features can compensate the imprecision in the visual cue of “debris” in the images may be associated
the textual features. with words, such as “bomb” and “terror” in the text
Among the vague transformation methods, dual- segments. Discovering such direct associations is an
direction transformation achieves almost the same perfor- advantage of the fusion-ART-based method.
mance as single-direction transformation. However, visual
space projection with dual-direction transformation can 4.5 Discussions and Comparisons
slightly improve the average precision. We can also see that In Table 4, we provide a summary of the key characteristics
the bipartite-graph-based similarity matrix D for visual of the two proposed methods. First of all, we note that the
space projection does not improve the image-text associa- underlying ideas of the two approaches are quite different.
tion results. By examining the classified text segments, we Given a pair of image and text segment, the vague-
notice that only a small number of text segments belong to transformation-based method translates features from one
more than one category and contribute to category simila- information space into another information space so that
rities. This may have resulted in an inaccurate similarity features of different spaces can be compared. The fusion-
matrix and a biased visual space projection. ART-based method, on the other hand, learns a set of
On the other hand, the performance of fusion ART is prototypical image-text associations and then predicts the
comparable with that of vague transformation with visual degree of association between an incoming pair of image
space projection. Nevertheless, when using the pure cross- and text segment by comparing it with the learned
media model ð ¼ 0:0Þ, fusion ART can actually outperform associations. During the prediction process, the visual and
vague-transformation-based methods by about 1 percent to textual information is first compared in their respective
3 percent. Looking into each fold of the experiment, we see spaces and the results consolidated based on a multimedia
that the fusion-ART-based method is much more stable object resonance function (ART choice function).
than the vague-transformation-based methods in the Vague transformation is a statistic-based method which
sense that the best results are almost always achieved with calculates the conditional probabilities in one information
 ¼ 0:6 or 0.7. For the vague-transformation-based meth- space given some observations in the other information
ods, the best result of each experiment fold is obtained with space. To calculate such conditional probabilities, we need
rather different  values. This suggests that the vague- to perform batch learning on a fixed set of training data.
transformation-based methods are more sensitive to the Once the transformation matrices are trained, they cannot
training data. be updated without building from scratch. In contrast, the
174 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009

Fig. 12. A sample set of image-text associations extracted with similarity scores (SC). The correctly identified associated texts are bolded.

fusion-ART-based method adopts an incremental competi- A fixed number of domain-specific information categories
tive learning paradigm. The trained fusion ART can always are used to reduce the information complexity. Instead of
be updated when new training data are available. using predefined information categories, the fusion-ART-
The vague-transformation-based method encodes the based method can automatically organize multimedia
learned conditional probabilities in transformation matrices. information objects into typical categories. The character-
istics of an object category are encoded by a multimedia
information object template. There are usually more cate-
gory nodes learned by the fusion ART. Therefore, the
information in the fusion ART is less compact than that in
the transformation matrices. In our experiments, around 70
to 80 categories are learned by the fusion ART on a data set
containing 300 images (i.e., 240 images are used for training
in our fivefold cross-validation).
In terms of efficiency, the vague-transformation-based
method runs much faster than the fusion-ART-based
method during both training and testing. However, the
fusion-ART-based method produces a more stable perfor-
mance than that of the vague-transformation-based method
(see discussions in Section 4.4).

5 CONCLUSION
We have presented two distinct methods for learning and
Fig. 13. Samples of image annotations using fusion ART. extracting associations between images and texts from
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 175

TABLE 4
Comparison of the Vague-Transformation- and the Fusion-ART-Based Methods

multimedia web documents. The vague-transformation- REFERENCES


based method utilizes an intermediate layer of information [1] D. Radev, “A Common Theory of Information Theory from
categories for capturing indirect image-text associations. Multiple Text Sources, Step One: Cross-Document Structure,”
The fusion-ART-based method learns direct associations Proc. First ACL SIGdial Workshop Discourse and Dialogue,
2000.
between image and text features by employing a resonance [2] R. Barzilay, “Information Fusion for Multidocument Summariza-
environment of the multimedia objects. The experimental tion: Paraphrasing and Generation,” PhD dissertation, 2003.
results suggest that both methods are able to efficiently [3] H. Alani, S. Kim, D.E. Millard, M.J. Weal, W. Hall, P.H. Lewis, and
N.R. Shadbolt, “Automatic Ontology-Based Knowledge Extraction
learn image-text associations from a small training data set. from Web Documents,” IEEE Intelligent Systems, vol. 18, no. 1,
Most notably, they both perform significantly better than pp. 14-21, 2003.
the baseline performance provided by a typical image [4] A. Ginige, D. Lowe, and J. Robertson, “Hypermedia Authoring,”
IEEE Multimedia, vol. 2, no. 4, pp. 24-35, 1995.
annotation model. In addition, while the text-matching- [5] S.-F. Chang, R. Manmatha, and T.-S. Chua, “Combining Text and
based method is still more reliable than the cross-media Audio-Visual Features in Video Indexing,” Proc. IEEE Int’l Conf.
similarity measures, combining visual and textual features Acoustics, Speech, and Signal Processing (ICASSP ’05), pp. 1005-1008,
2005.
provides the best overall performance in discovering cross- [6] D.W. Oard and B.J. Dorr, “A Survey of Multilingual Text
media relationships between components of multimedia Retrieval,” technical report, College Park, MD, USA, 1996.
documents. [7] T. Mandl, “Vague Transformations in Information Retrieval,”
Proc. Sixth Int’l Symp. für Informationswissenschaft (ISI ’98),
Our proposed methods so far have been tested on a pp. 312-325, 1998.
terrorist domain data set. It is necessary to extend our [8] T. Jiang and A.-H. Tan, “Discovering Image-Text Associations
for Cross-Media Web Information Fusion,” Proc. Int’l Workshop
experiments to other domain data sets to obtain a more Parallel Data Mining (PKDD/ECML ’06), pp. 561-568, 2006.
accurate assessment of the systems’ performance. In addi- [9] A.-H. Tan, G.A. Carpenter, and S. Grossberg, “Intelligence
tion, as both methods are based on a similarity-based through Interaction: Towards a Unified Theory for Learning,”
Proc. Int’l Symp. Neural Networks (ISNN ’07), D. Liu et al., eds.,
multilingual retrieval paradigm, using advanced similarity vol. 4491, pp. 1098-1107, 2007.
calculation methods with visterm and term taxonomies may [10] G.A. Carpenter and S. Grossberg, “A Massively Parallel
Architecture for a Self-Organizing Neural Pattern Recognition
result in a better performance. These will form part of our Machine,” Computer Vision, Graphics, and Image Processing,
future work. vol. 37, pp. 54-115, 1987.
[11] J. He, A.-H. Tan, and C.L. Tan, “On Machine Learning Methods
for Chinese Document Categorization,” Applied Intelligence,
ACKNOWLEDGMENTS vol. 18, no. 3, pp. 311-322, 2003.
[12] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic Image
The authors would like to thank the anonymous reviewers Annotation and Retrieval Using Cross-Media Relevance Models,”
for providing many invaluable comments to the previous Proc. 26th Ann. Int’l ACM SIGIR Conf. Research and Development in
versions of this paper. Special thanks to Dr. Eric P. Xing Information Retrieval (SIGIR ’03), pp. 119-126, 2003.
[13] S. Little, J. Geurts, and J. Hunter, “Dynamic Generation of
and Dr. Jun Yang at the Carnegie Mellon University for Intelligent Multimedia Presentations through Semantic Infer-
providing Matlab source code of the DWH model. The encing,” Proc Sixth European Conf. Research and Advanced
reported work is supported in part by the I2 R-SCE Joint Technology for Digital Libraries (ECDL ’02), pp. 158-175, 2002.
Lab on Intelligent Media. Tao Jiang was with the School of [14] J. Geurts, S. Bocconi, J. van Ossenbruggen, and L. Hardman,
“Towards Ontology-Driven Discourse: From Semantic Graphs to
Computer Engineering, Nanyang Technological Univer- Multimedia Presentations,” Proc. Second Int’l Semantic Web Conf.
sity, Singapore. (ISWC ’03), pp. 597-612, 2003.
176 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 2, FEBRUARY 2009

[15] J. Han, Data Mining: Concepts and Techniques. Morgan Kaufmann, [37] G.A. Carpenter and S. Grossberg, “ART 2: Self-Organization of
2005. Stable Category Recognition Codes for Analog Input Patterns,”
[16] H.H. Yu and W.H. Wolf, “Scenic Classification Methods for Applied Optics, vol. 26, pp. 4919-4930, 1987.
Image and Video Databases,” Proc. SPIE, vol. 2606, no. 1, [38] G.A. Carpenter, S. Grossberg, and D.B. Rosen, “ART 2-A: An
pp. 363-371, http://link.aip.org/link/?PSI/2606/363/1, 1995. Adaptive Resonance Algorithm for Rapid Category Learning
[17] I.K. Sethi, I.L. Coman, and D. Stan, “Mining Association Rules and Recognition,” Neural Networks, vol. 4, no. 4, pp. 493-504,
between Low-Level Image Features and High-Level Concepts,” 1991.
Proc. SPIE, vol. 4384, no. 1, pp. 279-290, http://link.aip.org/link/ [39] G.A. Carpenter, S. Grossberg, and D.B. Rosen, “Fuzzy ART: Fast
?PSI/4384/279/1, 2001. Stable Learning and Categorization of Analog Patterns by an
[18] M. Blume and D.R. Ballard, “Image Annotation Based on Adaptive Resonance System,” Neural Networks, vol. 4, no. 6,
Learning Vector Quantization and Localized Haar Wavelet pp. 759-771, 1991.
Transform Features,” Proc. SPIE, vol. 3077, no. 1, pp. 181-190, [40] W. Li, K.-L. Ong, and W.K. Ng, “Visual Terrain Analysis of
http://link.aip.org/link/?PSI/3077/181/1, 1997. High-Dimensional Datasets,” Proc Ninth European Conf. Princi-
[19] A. Mustafa and I.K. Sethi, “Creating Agents for Locating ples and Practice of Knowledge Discovery in Databases (PKDD ’05),
Images of Specific Categories,” Proc. SPIE, vol. 5304, no. 1, A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, eds.,
pp. 170-178, http://link.aip.org/link/?PSI/5304/170/1, 2003. vol. 3721, pp. 593-600, 2005.
[20] Q. Ding, Q. Ding, and W. Perrizo, “Association Rule Mining [41] F. Chu, Y. Wang, and C. Zaniolo, “An Adaptive Learning
on Remotely Sensed Images Using P-Trees,” Proc. Sixth Pacific- Approach for Noisy Data Streams,” Proc. Fourth IEEE Int’l Conf.
Asia Conf. Advances in Knowledge Discovery and Data Mining Data Mining (ICDM ’04), pp. 351-354, 2004.
(PAKDD ’02), pp. 66-79, 2002. [42] D. Shen, Q. Yang, and Z. Chen, “Noise Reduction through
[21] J. Tesic, S. Newsam, and B.S. Manjunath, “Mining Image Summarization for Web-Page Classification,” Information Pro-
Datasets Using Perceptual Association Rules,” Proc. SIAM Sixth cessing and Management, vol. 43, no. 6, pp. 1735-1747, 2007.
Workshop Mining Scientific and Eng. Datasets in conjunction with the [43] A. Tan, H. Ong, H. Pan, J. Ng, and Q. Li, “FOCI: A Personalized
Third SIAM Int’l Conf. (SDM ’03), http://vision.ece.ucsb.edu/ Web Intelligence System,” Proc. IJCAI Workshop Intelligent
publications/03SDMJelena.pdf, May 2003. Techniques for Web Personalization (ITWP ’01), pp. 14-19,
[22] T. Kohonen, Self-Organizing Maps, T. Kohonen, M.R. Schroeder, Aug. 2001.
and T.S. Huang, eds., Springer-Verlag New York, Inc., 2001. [44] A.-H. Tan, H.-L. Ong, H. Pan, J. Ng, and Q.-X. Li, “Towards
[23] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Personalised Web Intelligence,” Knowledge and Information Systems,
Association Rules in Large Databases,” Proc. 20th Int’l Conf. vol. 6, no. 5, pp. 595-616, 2004.
Very Large Data Bases (VLDB ’94), J.B. Bocca, M. Jarke, and [45] E.W.M. Lee, Y.Y. Lee, C.P. Lim, and C.Y. Tang, “Application of a
C. Zaniolo, eds., pp. 487-499, 1994. Noisy Data Classification Technique to Determine the Occurrence
[24] A.M. Teredesai, M.A. Ahmad, J. Kanodia, and R.S. Gaborski, of Flashover in Compartment Fires,” Advanced Eng. Informatics,
“Comma: A Framework for Integrated Multimedia Mining Using vol. 20, no. 2, pp. 213-222, 2006.
Multi-Relational Associations,” Knowledge and Information Systems, [46] A.M. Fard, H. Akbari, R. Mohammad, and T. Akbarzadeh, “Fuzzy
vol. 10, no. 2, pp. 135-162, 2006. Adaptive Resonance Theory for Content-Based Data Retrieval,”
[25] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent Patterns Proc. Third IEEE Int’l Conf. Innovations in Information Technology
without Candidate Generation: A Frequent-Pattern Tree (IIT ’06), pp. 1-5, Nov. 2006.
Approach,” Data Mining and Knowledge Discovery, vol. 8, [47] S. Feng, R. Manmatha, and V. Lavrenko, “Multiple Bernoulli
no. 1, pp. 53-87, 2004. Relevance Models for Image and Video Annotation,” Proc. IEEE
Computer Soc. Conf. Computer Vision and Pattern Recognition
[26] C. Djeraba, “Association and Content-Based Retrieval,” IEEE
(CVPR ’04), pp. 1002-1009, 2004.
Trans. Knowledge and Data Eng., vol. 15, no. 1, pp. 118-135, Jan./
[48] M. Sharma, “Performance Evaluation of Image Segmentation and
Feb. 2003.
Texture Extraction Methods in Scene Analysis,” master’s thesis,
[27] K. Barnard and D. Forsyth, “Learning the Semantics of Words and
1998.
Pictures,” Proc. Eighth Int’l Conf. Computer Vision (ICCV ’01), vol. 2,
[49] P. Duygulu, O.C. Ozcanli, and N. Papernick, “Comparison of
pp. 408-415, 2001.
Feature Sets Using Multimedia Translation,” LNCS, 2869th ed.,
[28] P. Duygulu, K. Barnard, J.F.G. de Freitas, and D.A. Forsyth, 2003.
“Object Recognition as Machine Translation: Learning a Lexicon [50] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Like-
for a Fixed Image Vocabulary,” Proc. Seventh European Conf. lihood from Incomplete Data via the EM Algorithm,” J. Royal
Computer Vision (ECCV ’02), pp. 97-112, 2002. Statistical Soc. Series B (Methodological ’77), vol. 39, no. 1, pp. 1-38,
[29] J. Li and J.Z. Wang, “Automatic Linguistic Indexing of Pictures 1977.
by a Statistical Modeling Approach,” IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1075-1088,
Sept. 2003.
[30] E.P. Xing, R. Yan, and A.G. Hauptmann, “Mining Associated
Text and Images with Dual-Wing Harmoniums,” Proc. 21st
Ann. Conf. Uncertainty in Artificial Intelligence (UAI ’05), p. 633,
2005.
[31] P. Sheridan and J.P. Ballerini, “Experiments in Multilingual
Information Retrieval Using the Spider System,” Proc. 19th Ann.
Int’l ACM SIGIR Conf. Research and Development in Information
Retrieval (SIGIR ’96), pp. 58-65, 1996.
[32] P. Biebricher, N. Fuhr, G. Lustig, M. Schwantner, and G. Knorz,
“The Automatic Indexing System Air/Phys—From Research to
Applications,” Proc. 11th Ann. Int’l ACM SIGIR Conf. Research
and Development in Information Retrieval (SIGIR ’88), pp. 333-342,
1988.
[33] N. Tishby, F. Pereira, and W. Bialek, “The Information Bottle-
neck Method,” Proc. 37th Ann. Allerton Conf. Comm., Control
and Computing, pp. 368-377, http://citeseer.ist.psu.edu/
tishby99information.html, 1999.
[34] H. Hsu, L.S. Kennedy, and S.-F. Chang, “Video Search Reranking
via Information Bottleneck Principle,” Proc. 14th Ann. ACM Int’l
Conf. Multimedia (MULTIMEDIA ’06), pp. 35-44, 2006.
[35] G. Carpenter and S. Grossberg, Pattern Recognition by Self-
Organizing Neural Networks. MIT Press, 1991.
[36] A.-H. Tan, “Adaptive Resonance Associative Map,” Neural
Networks, vol. 8, no. 3, pp. 437-446, 1995.
JIANG AND TAN: LEARNING IMAGE-TEXT ASSOCIATIONS 177

Tao Jiang received the BS degree in computer Ah-Hwee Tan received the BS (first class
science and technology from Peking University honors) and MS degrees in computer and
in 2000 and the PhD degree from the Nanyang information science from the National Univer-
Technological University. Since October 2007, sity of Singapore in 1989 and 1991, respec-
he has been with ecPresence Technology Pte tively, and the PhD degree in cognitive and
Ltd., Singapore, where he is currently a project neural systems from Boston University, Boston,
manager. From July 2000 to May 2003, he in 1994. He is currently an associate professor
was with Found Group, one of the biggest IT and the head of the Division of Information
companies in China, earlier as a software Systems, School of Computer Engineering,
engineer and later as a technical manager. He Nanyang Technological University, Singapore.
also currently serves as a coordinator of “vWorld Online Community” He was the founding director of the Emerging Research Laboratory, a
project supported and sponsored by Multimedia Development Authority research center for incubating interdisciplinary research initiatives.
(MDA), Singapore. His research interests include data mining, machine Prior to joining NTU, he was a research manager at the A*STAR
learning, and multimedia information fusion. Institute for Infocomm Research (I2R), responsible for the Text Mining
and Intelligent Agents research groups. His current research interests
include cognitive and neural systems, information mining, machine
learning, knowledge discovery, document analysis, and intelligent
agents. He is the holder of five patents and has published more than
80 technical papers in books, international journals, and conferences.
He is an editorial board member of Applied Intelligence and a member
of the ACM. He is a senior member of the IEEE.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

You might also like