You are on page 1of 4

2004 IEEE International Conference on Multimedia and Expo (ICME)

Automatic Image Captioning*


Jia-Yu Pant, Hyung-Jeong Yang’, Pinar Duygulu’ and Christos Faloutsost
tComputer Science Department, Camegie Mellon University, Pittsburgh, U.S.A.
‘Department of Computer Engineering, Bilkent University, Ankara, Turkey
‘fiypan, hjyang, christosJecs.cmu.edu, ‘duygulu@cs.bilkent.edu. f r

Abstract words by an EM algorithm. Recently, probabilistic


In this paper, we examine the problem of automatic models such as cross-media relevance model [6] and
image captioning. Given a training set of captioned latent semantic analysis (LSA) based models [ I I ] are
images, we want to discover correlations between also proposed for captioning.
image features and keywords, sa that we can In this study, we experiment thoroughly with
automatically find good keywords for a new image. We multiple d e s i s alternatives (better clustering decision;
experiment thoroughly with multiple design weighting image features and keywords; dimensionality
alternatives an large datasets of various content st$es, reduction for noise suppression) for better association
and our proposed methods achieve up to a 4.7% model. The proposed methods achieve a 45% relative
relative improvement on captioning accuracy over the improvement on captioning accuracy over the result of
state of the art. [3], on large datasets of various content styles.
The paper is organized as follows: Section 2
describes the data set used in the study. Section 3
1. Introduction and related work
describes an adaptive method for obtaining image
“Given a large image database, find images that
have tigers. Given an unseen image, find terms which region ngoups. The proposed uniqueness weighting
scheme and correlation-based image captioning
best describe its content.” These are some of the
methods are given in Section 4 and S . Section 6
problems that many imagelvideo indexing and retrieval
presents the experimental results and Section 7
systems deal with (see [4][5][10] for recent surveys).
concludes the paper.
Content based image retrieval systems, matching
images based on visual similarities, have some
limitations due to the missing semantic information. 2. Input representation
Manually annotated words could provide semantic We leam the association between image regions and
information, however, it is time consuming and error- words from manuallv annotated images (examDles are
prone. Several automatic image annotation (captioning)
methods have been proposed for better indexing and
retrieval of large image databases [11[21[31[61[71.
We are interested in the following problem: “Given
a set of images, where each image is captioned with a
set of terms describing the image content, find the sea. sun. skv. waves cat. forest. mass. tieer
association between the image features and the terms”.
Furthermore, “with the association found, caption an
unseen image”. Previous works caption an image by
captioning its constituting regions, by a mapping from
image regions to terms. Mori et al. [IO] use co-
occurrence statistics of image grids and words for w6 w7 w8 wl w2 w9 w IO w l 1
modeling the association. Duygulu et al. [3] view the Figure 1. Top: annotated images with their captions,
mapping as a translation of image regions to words, bottom: corresponding blob-tokens and word tokens.
and learn the mapping between region groups and

This material is based upon work suppalted by the National Science Foundation under Grants No. lR1-9817496,US-9988876,
US-0113089,
OS-0209107. US-0205224. NI-0318547, SENSOR-0329549. EF-0331657, by the Pennsylvania hfrastmcture Technology Alliance (PITA)
Grant No. 22-901-0001, and by the Defense Advanced Research Pmjects Agency under CanVact No. N66001-00-1-8936. Additional funding
was provided by donations from Intel. and by U gift from Nonhmp-Gmmman Corporation.

W O 0 4 IEEE
0-7803-8603-5/04/~20.00 1987
An image region is represented by a vector of association goes up. The idea is to give higher weights
features regarding its color, texture, shape, size and to terms (blob-tokens) which are more “unique” in the
position. These feature vectors are clustered into B training set, and low weights to noisy, common terms
clusters and each region is assigned the label of the (blob-tokens).
closest cluster center as in [3]. These labels are called Definition 2 (Uniqueness-weighted data matrix)
blob-tokens. Given an unweighted data matrix Do=[DwolD~ol.Let z,
Formally, let 1 4 1lr...rIN) be a set of annotated (yjj be the number of images which contain the term wi
images where each image Ii is annotated with a set of (the blob-token hi). The weighted data matrix
~ w.~;)and a set of blob tokens
terms W i = ( w , ,._.., D=[DwIDu] is constructed from Do, where the (i,jj-
Bi=(bi,,,..., biflj]. where L; is the number of words, and element of Dw (DE),dni,,] (dB(i,jhis
M iis the number of regions in image Ii. The goal is to N N (3)
dwu.j, =dwou,j) X b - 1 , dmi,,) =dmv.j, Xlog(--),
construct a model that captures the association between 2, Y,
terms and the blob-tokens, given Wi’s and Bcs. where N is the total number of images in the set, and
is the (i,jj-element of Dwo (Duo).
d , , (dBoc;.jJ)
3. Adaptive blob-token generation In the following, whenever we mention the data
The quality of blob-tokens affects the accuracy of matrix D, it will be always the weighted data matrix.
image captioning. In [3], the blob-tokens are generated
using the K-means algorithm on feature vectors of all 5. Proposed methods for image captioning
image regions in the image collection, with the number We proposed 4 methods (Corr, Cos, SvdCorr,
of blob-tokens, E , set at 500. However, the choice of SvdCos) to estimate a W-by-B translation table T,
B=500 is by no means optimal. whose (i,jj-element can be viewed as p(w;lbjj, the
In this study, we determine the number of blob- probability we caption the term wi, given we see a
tokens B adaptively using the G-means algorithm [12]. blob-token bj
G-means clusters the data set starting from small Definition 3 (Method Corrj Let T,,,,=DwTDB, The
number of clusters, E , and increases B iteratively if correlation-based translation table Tcomis defined by
some of the current clusters Fail the Gaussianity test normalizing each column of T,m,o such that each
(e.g., Kolmogorov-Smirov test). In our work, the blob- column sum up to 1. Note that the (iJ-element of T,.,,
tokens are the labels of the clusters adaptively found by can be viewed as an estimate ofp(w,lbjj.
G-means. The numbers of blob-tokens generated for T,= measures the association between a term and a
the 10 training set are all less than 500, ranging from blob-token by the co-occurrence counts. Another
339 to 495, mostly around 400. possible measure could be to see how similar the
overall occurrence pattem (over the training images) of
4. Weighting by uniqueness a term and a blob-token is. Such occurrence pattems
If there are W possible terms and B possible blob- are in fact the columns of DWor De, and the similarity
tokens, the entire annotated image set of N images can can be taken as the cosine value between pairs of
be represented by a data matrix DLN.by.(~+~,I- We now column vectors.
define two matrices: one is unweighted, the other is Definition 4 (Method Cos) Let the i-th column of
uniqueness-weighted as initial data representation. the matrix Dw (De) bed, (d,;). Let cosjd be the cosine
Definition 1 (Unweighted data matrix) Given an value of the angle column vectors dwi and dBj,and let
annotated image set 1=(11,..., IN] with a set of terms W TWJ be a W-by-B matrix whose (i,j)-element
and a set of blob-tokens B, the unweigbted data matrix T,o(i,j)=cos;,j Normalize the columns of Tqo such
D ~ l ’ D w o l D ~is~a] N-by-(W+Bj matrix, where the (i,jJ- that each column sums up to 1, and we get the cosine-
element of the N-by-W matrix Dwois the count of term similarity translation table T,.
wi in image ti, and the (i,jj-element of the N-by-B Singular Value Decomposition (SVD) decomposes
matrix DBOis the count of blob-token bi in image Ii. a given matrix XI-, into a product of three matrices
W e weighted the counts in the data matrix D U, A, VT.That is, X= UAVT, where U=[UI uJ, and ,...,
according to the “uniqueness” of each tendblob-token. ...,
V=[vl, VJ with orthonormal columns ui’s. vi’s. A is
If a term appears only once in the image set, say with a diagonal matrix, A=diag(ol ,..., G ~ ~ ( ~ , , ,where
, ) ) , ai>
image I,, then we will use that term for captioning only 0, for j 5 rank(X), aid, for j rank(X).
when we see the blob-tokens of II again, which is a Previous works [14] show that by setting small ai's
small set of blob-tokens. The more common a term is,
the more blob-tokens it is associated with, and the
to zero, yielding an optimal low rank representation , a
SVD could be used to clean up noise and reveal
uncertainty of finding the correct term/blob-token

1988
informative structure in a matrix X and achieve better seeing a blob-token b,, p(w,lb,). These translation tables
performance in information retrieval applications. We are then used in Algorithm 1 for captioning.
propose to use SVD to suppress the noise in the data The captioning accuracy S on a test image is
matrix D before leaning the association. Following the measured as the percentage of correctly captioned
general rule-of-thumb, we keep the first r ai's which words [ I ] . That is, S = m,,,,/m, where m,,,(m) is
preserve the 90% variance of the distribution, and set the number of the correctly (truth) captioned terms.
the others to zero. In the following, we denote the data The overall performance is expressed by the average
matrix after SVD as Dsvd=[DwJrdlDBJvd]. accuracy over all images in a (test) set.
Definition 5 (Merltod SvdCorr and SvdCos) Method Figure 2(a) compares the proposed methods with the
SvdCorr and SvdCos generate the translation table baseline algorithm 131 which is denoted as EM-B500-
Tcomsvd and Trmsvd following the procedure outlined in UW or simply EM (which means EM is applied to an
Definition 3 and 4, respectively. The only difference is unweighted matrix, denoted UW, in which the number
that instead of starting with the weighted data matrix D, of blob tokens is 500, denoted as BSOO). For the
here the matrix Dsvdis used. proposed methods, blob-tokens are generated
Algorithm 1 (Captioning) Given a translation table adaptively (denoted AdaptB) and the uniqueness
TIwSl (W total number of terms: B: total number of weighting (denoted w) is applied. The proposed
blob-tokens), and the number of captioning terms m for methods achieve an improvement around 12% in
an image. An image I' with 1 blob-tokens B ' = {Vi, ..., absolute accuracy (45% relative improvement) over the
b',), can be captioned by: First, form a query vector baseline.
q=[q,, ..., qslT, where qi is the count of the blob-token The proposed adaptive blob-token generation could
bj in the set B'. Then, compute the term-likelihood also improve the baseline EM method. Figure 2(b)
vector p=Tq, where p+,, _...pwlT.pi is viewed as the shows that the adaptively generated blob-tokens
predicted likelihood of the term w,. Finally, we caption improve the captioning accuracy of the EM algorithm.
the image I' with the m terms corresponding to the The improvement is around 7.5% in absolute accuracy
highest m pts in the p vector. (34.1% relative improvement) over the baseline
method (whose accuracy is about 22%). In fact, we
6. Experimental results found that the improvement is not only on the EM
The experiments are performed on 10 Core1 image method, but also on our proposed methods. When the
data sets. Each data set contains about 5200 training number of blob-tokens is set at 500, proposed methods
images and 1750 testing images. The sets cover a are 9% less accurate (detail figures not shown). This
variety of themes ranging from urban scenes to natural suggests that the correct size of blob-token set is not
scenes, and from artificial objects like jetlplane to 500, since all methods perform worse when the size is
animals. Each image has in average 3 captioning terms set at 500.
and 9 blobs. Before applying the "uniqueness" weighting, the 4
We apply G-means and "uniqueness" weighting to proposed methods perform similar to the baseline EM
show the effects of clustering and weighting. We method (accuracy difference is less than 3%). The
compare our proposed methods, namely Corr, Cos, uniqueness weighting improves the performance of all
SvdCorr and SvdCos, with the state-of-the-art proposed methods except Cos method, which stays put.
machine translation approach [3] as the comparison We also observed that weighting does not affect the
baseline. Each method constructs a translation table as result of EM. Due to the lack of space, we do not show
the estimated conditional probability of a term wi given detail figures here.

(a) (b) (C)


Figure 2 Captioning accuracy improvement (a) proposed methods vs. the baseline EM-B500-UW, (b) adnprive blab-
token generation on EM vs. the baseline, (c) proposed methods vs. EM when the adaptively generated blob-tokens ore used.

1989
Another measurement of the performance of a outperfom the state of the art EM (45% relative
method is the recall and precision values for each word improvement) in captioning accuracy. Specifically,
(Figure 3). Given a word w, let the set R, contains r We do thorough experiments on 10 large datasets
test images captioned with the word w by the method of different image content styles, and examine all
we are evaluating. Let r* he the actual number of test possible combinations of the proposed techniques for
images that have the word w (set R*,), and r’ be size of improving captioning accuracy.
the intersection o f R, and R*,. Then, the precision of The proposed “uniqueness” weighting scheme on
word w is r’lr, and the recall is r’lr*. terms and blob-tokens boosts the captioning accuracy.
Note that some words have zero precision and Our improved, “adaptive” blob-tokens generation
recall, if they are never used or are always used for the consistently leads to performance gains.
wrong images (un-“predictable” words). W e prefer a The proposed methods are less biased to the
method that has fewer unpredictable words, since it training set and more general in terms of retrieval
could generalize better to unseen images. Table I precision and recall.
shows that the proposed methods have two to three The proposed methods can be applied to other
times more predictable words on average than the area, such as building an image glossary of different
baseline EM does. EM captions frequent words with cell types from figures in medical journals [U].
high precision and recall, but misses many other words.
That is, EM is biased to the training set. References
Table 1. Averape recall and precision values and [I]K. Bamurd, P. Duygulu. N. de Freilas. D. Fonyth, D. Blei. M.
Jordan. “Matching wards and pictures”, Journal of Machine
Learning Research, 3:1107:1135,2003.
[2] D. Blei, M. Jordan, “Modeling annotated data”, 26th Annual Int.
#predicted 36 57 72 56 132 ACM SlGlR Canf.. Tomnto, Canada, 2003.
131 P. Duygulu, K. Bamard, N. de Freitas. D. Fonyth, “Object
recognition as machine translation: learning a lexicon for B fixed
Avgprec. 0.041I 0.1131 0.1445 0.1197 0.2079
image vocabulary”. h Sevenlh European Conference on Computer
Vision (ECCV), Vol. 4, pp. 97-1 12. 2002.
[4] D. A. Fonyth and 1. Ponce, “Computer Vision: a mudem
approach”. Prcntice-Hall, 20431.
[5l A. Goodmm “Image information retrieval: An overview of
current research”. Informing Science. 3(2), 2000.
161 J. Jeon, V. Lavrenko, R. Manmatha. “‘Automatic Image
Annotation and Retrieval using Cross-Media Relevance Models‘’.
26th Annual Int. ACM SlGlR Conference, Tomnto, Canada, 2003.
[7] J. Li and I. Z. Wmg, “Automatic linguistic indexing of pictures
by a statistical modeling approach, IEEE Trans. on Pattern Analysis
Recall Precision and Machine Intelligence. 25(10). 2003.
[8] M. Markkula, E. Sonnunen, “End-user searching challenges
Figure 3. Recall and precision of the top 20frequent indexing praclices in the digital newspaper pholo archive”,
words in the test sei. SvdCorr (white bars) gives more hformation retrievd, “0l.l. 2000.
general pegormance than baseline EM (black bars). [Y] Y. Rui, T. S. Huang, S.-F. Chang, “Image Retrieval Past,
As an example of how well the captioning is, for Present, and Future”. Journal of Visual Communication and Image
Representation. 10:1-23, 1999.
the image in Figure I(a), EM-BSOO-UW and [IO] Y. Mori. H. Takahashi, R. Oka, “hnage to word
SvdCorr-AdaptB-W both give “sky”, “cloud”, “sun” transformation based an dividing and vector quantizing images with
and “water”. As for the image in Figure l(h), EM- words’: First ht. Workshop on Multimedia Intelligent Slange and
Retrieval Management, 1999.
BSOO-UW gives “grass”, “rocks”, “sky” and “snow”,
[I I] F. Monay, D. Gatica-Perez, “On Image Auto-Annotation
while SvdCorr-AdaptB-W gives “grass”, “cat”, with Latent Space Models”, Pm.ACM hl. Conf. on Multimedia
“tiger”, and “water”. Although the captions do not (ACM MM), Berkeley. 2003.
match the truth (in the figure) perfectly, they describe [I21 Greg Hamerly and Charles Elkan, “Leaning the k in k-
the content quite well. This indicates that the “truth” means”,proc. of the NIPS 2003.
1131 Vellirte, M. and R.F. Murphy, 2002. “Automated
caption may he just one of the many ways to describe Delerminalion of Pmiein Subcellular Locations from 3D
the image. Fluorescence Microscope Images,” Proc 2002 IEEE Intl Symp.
Biomed Imaging (ISBI 2002). pp. 867-870.
[I41 G. W. Furmas, S. Deerwester, S. T. Dumais, T. K.
7. Conclusion Iandauer, R. A. Hanhman, L. A. Sweeter and K. E. Lochbaum.
In this paper, we studied the problem of automatic “lnformation retrieval using a singular value decomposition model
image captioning and proposed new methods (Corr, of latent semantic sVUctue,” Proc.of the 1Ith ACM SlGlR conf., pp.
Cos, SvdCorr and SvdCos) that consistently 465480,1998.

1990

You might also like