Professional Documents
Culture Documents
DOI 10.3233/HIS-170246
IOS Press
Abstract. Automatic captioning of Images has been explored extensively in the past 10 to 15 years. It is one of the elementary
problems in Computer Vision and Natural Language Processing and has vast array of applications in the real world. In this survey,
we aim to study different approaches used for the generation of image captions in a chronological manner starting from the
basic template based caption generation model to using Neural Networks combined with external world knowledge. We review
existing models in detail, highlighting the involved methodologies and improvements in the same that have occurred in time.
We gave an overview to the standard image datasets and the evaluation measures developed to discern the quality of generated
image captions. Apart from the basic benchmarks we also note speed and accuracy improvements in all the different approaches.
Finally, we investigate further possibilities in automatic image caption generation.
Keywords: Computer Vision, image captioning, deep learning object recognition, Natural Language Processing
1448-5869/17/$35.00
c 2017 – IOS Press. All rights reserved
124 A. Kumar and S. Goel / A survey of evolution of image captioning techniques
fore, the dire needs for image-caption generation can eg. Kulkarni et al. [15] used template-based description
not go unnoticed and hence, lots of research papers and generation, Farhadi et al. [7] grouped the computer vi-
conferences in this regard are taking place throughout sion detections in triplets and then used them to gener-
the world. ate descriptions based on templates. Li et al. [14] gen-
At the heart of this technology are the power of erated description by merging the computer vision de-
neural networks which makes it seems like magic. On tected objects using proper semantic relationships.
one hand, where Convolution Neural Network (CNN) The year 2012 was a highlight in the era of auto-
Layer performs the prime task of identifying the salient mated Image Captioning Techniques as Imagenet clas-
features of the images, the Recurrent Neural Networks sification with deep CNN (convolutional neural net-
makes it possible to construct (almost) meaningful cap- works) was done using 60 million parameters and half
tions from the identified visual features.We will see million neurons, it consisted of five convolutional lay-
how the technology advanced from basic tree parsing ers.
to use of neural networks utilizing hundreds of millions Year 2013 onwards, techniques involving the use
of features to make the results competent enough to of recurrent network started gaining momentum. Mao
be compared to those of human generated captions.We et al. [28] Vinyals et al. [48], Kiros et al. [25], Fang
will be unwinding these technologies one by one in et al. [30], Chen et al. [47] used a recurrent NN and
brief in this survey paper. Kiros et al. [23] used a feedforward one. Also, Kiros et
So, our aim of this survey paper is to present, an- al. [25] proposed to create a ařmultimodal
˛ embedding
alyze and compare the research happening through- spaceaś
˛ by using a vision model and LSTM to encode
out the last decade in this ground-breaking field, in a text.
matter of few pages. We aim at chronologically iden- Years 2014 and 2015 saw the evolution of rich fea-
tifying the technological developments, the involved ture hierarchies for accurate and high quality object de-
approaches, their drawbacks, how well they perform tection going deeper with convolutions. Fast progress
at various metrics and further investigating the future in object detection [18,31–33,49] was identified with
scope of automatic image generation. models which labeled multiple regions of image in im-
age captioning.
Years 2015 and 2016, saw the evolution of very
2. Evolution of image captioning techniques
ařdeep
˛ convolutional networksaś˛ for large scale im-
age recognition with multimodal R-CNN and various
The paper is structured in a chronological manner,
where we start from the basic techniques used, discuss other techniques [24,34–40,48]. We discuss in detail 8
its methodology and shortcomings and then move to a prominent categories of approaches belonging to dif-
newer approach that solved the problems in the previ- ferent time periods. Starting with one of the very first
ous one. Figure 1 shows the timeline we have devel- techniques to the very recent one.
oped to help keep track of the advancements in this
field and better understand the need and basis of devel- 2.1. Template Based (Tree parser)
opment of newer models.
The years 2005 to 2010 saw the birth of the ma- The template based technique is discussed using the
jor approaches dealing with computer vision and map- work of Margaret Mitchell et al. [21] which was pub-
ping the detected objects with words and weaving them lished in 2012.
into a meaningful or stylish description. Related work
include those of Li et al. [5], Farhadi et al. [9] and 2.1.1. Problems faced till now
Li et al. [6]. Also approaches involving adding some Template Based Caption Generation was used ear-
additional knowledge from the corresponding domain lier in – Kulkarni et al. [15], Yang et al. [16] system
which normal computer vision can not see (Yang et substitutes probable prepositions, verbs and interjec-
al. [16]) as well as using some originally existing cap- tions on parsing UIUC Pascal-VOC dataset (Farhadi et
tions for the locally (Ordonez et al. [17]) and globally al. [9]) and choosing head nouns and their dependents
similar (Farhadi et al. [7]) and images to get rid off the using maximum likelihood calculated by taking the ra-
need for structuring the self made sentences was ob- tio of their individual logs.However, only predictable
served during this period. consistent sentences can be generated using template
In years 2010 and 2011 some work still focussed based techniques, but not novice captions. Ordonez et
on using primitive techniques for text generation for al. [17] matched the query image from a much larger
A. Kumar and S. Goel / A survey of evolution of image captioning techniques 125
set of existing captioned photographs followed by lo- and choosing the final outputs. The system followed
cal reordering. Although natural, but these captions are an approach where multiple trees are grown and then,
not true to the image. They mainly describe the similar the best one is chosen as the result. Contexts (nouns)
images and may miss out unique features of the query for adjectives were weighted using Point-wise Mutual
image. Information and for any adjective, only the best 1000
nouns are selected.
2.1.2. Overview In Micro-planning, fully-grown trees are generated
Midge uses syntactical knowledge of the probability by taking the intersection of the subtrees created in
distribution of the next words which should appear af- Content Determination. Subtrees surrounding a noun
ter a given sequence of words. The generator uses con- in position 1 are directly merged with subtrees sur-
straints to filter out the noisy output from the vision rounding a noun in position 2 because the nouns are
system to generate syntactic trees to describe the image ordered.In the surface realisation stage, the most prob-
which computer vision detected. able single tree is chosen by the system from all gener-
ated possible trees and mark-up is removed to produce
2.1.3. Model/Methodology a final string. Different strings may be generated de-
For training, 700,000 images from Flickr dataset pending on different specifications from the user. The
were taken with respective descriptions in dataset used final string is then the one with the most words.
in Ordonez et al. [17].Before parsing , normalization of
description was done. Parsing was done using Berkeley 2.1.4. Evaluation metrics
parser (Petrov [12]).Once a head noun was selected, 5-point Likert scale, Human decisions accumulated
for formulating a description , probability was calcu- using Amazonaŕs˛ Mechanical Turk (Amazon, 2011)
lated for determiners (the, a, an) and pre-nominal mod- were used as evaluation metrics.It was also evaluated
ifiers/adjectives. against Kulkarni et al. system, the Yang et al. sys-
Head nouns were identified and physical objects tem, and human-generated descriptions on the same
were distinguished using WordNet (Miller [1]) from dataset (images). Other metrics include the parameters
the detections of the vision system. Maximum 3 ob- of grammar, main aspects, correctness, order, human
jects were kept in a description of a single sentence. likeness. Results Analysis were done using the non-
Caption generating process was dealt as a problem parametric Wilcoxon Signed-Rank test where the pa-
of growing a syntactically and semantically informed rameter for comparing different systems is the median
tree based on detected object nouns. Tree growth was values.
achieved using lexicalized syntactic derivations using
head noun anchors detected above. 2.1.5. Dataset
A three step growth process was followed (Reiter Training dataset: 700,000 images with their associ-
and Dale [3]) that involved utilising content determi- ated descriptions from the Flickr dataset in Ordonez et
nation for grouping followed by ordering of the object al. [17]. Testing dataset : 840 PASCAL images.
nouns, generating their local subtrees, and filtering ir-
regular detections. Micro planning was done to gen- 2.1.6. Results
erate full syntactic trees around the noun objects de- Midge performed better than all earlier automatic
tected, and modifiers are selected and classified in post- approaches on criteria of correctness and order. And
nominal and pre-nominal in surface realisation step additionally performed better than Yang et al. on the
126 A. Kumar and S. Goel / A survey of evolution of image captioning techniques
Fig. 2. Tree generated using Midge approach during the tree growth process.
criteria of close proximity of sentences with the human language model (SC-NLM), where the sentence struc-
generated ones. ture was disentangled to its content and an encoder
Figure 3 shows captions generated by this method. which created the conditioning on distributed repre-
sentations. Sensible image captions were generated if
2.2. Encoder – Decoder based Caption Generation sampling from SC-NLM was used. Decoder generated
new captions from base.
This technique is discussed using the work of Kiros, Problem of ranking pictures and captions was used
Salakhutdinov, and Zemel – Unifying visual seman- as alternate for generation. Optimising this task would
tic embeddings with multimodal neural language mod- lead to an enhancement in generation technique, be-
els [25] which was published in 2014. cause any generation system makes a grading function
to analyse how well a caption and picture match.
2.2.1. Problems faced till now
Descriptions, by earlier strategies were more ma- 2.2.4. Evaluation metrics
chine type in nature and failed to adapt to the fluidness Med r; R@K.
of captions written by humans. Bleu and Rouge [22]
evaluation ways were unreliable and did not match hu- 2.2.5. Dataset
man perceptions. Flickr30K and Flickr8K.
Fig. 3. Captions generated as in [21] L to R: The bus by the road with a clear blue sky; People with a bottle at the table; A person in black with a
black dog by potted plants.
Fig. 4. Captions generated as in [25] L to R: A parked car while driving down the road; A little boy with a bunch of friends on the street; There
is a cat sitting on a shelf.
2.3.5. Dataset
PASCAL 1K, Flickr 8K and 30K, MS COCO.
2.3.6. Results
Figure 6 shows captions generated by this method.
Fig. 6. Captions generated as in [47] L to R: A train is stopped at a train station; A group of people standing on a snow covered slope; A group
of people that are standing in front of a building.
Fig. 7. Captions generated as in [48] L to R: A red motorcycle parked on the side of the road; A group of young people playing a game of frisbee;
Two dogs play in the grass.
Fig. 8. Captions generated as in [28] L to R: A square with burning street lamps and a street in the foreground; Tourists are sitting at a long table
with a white table cloth and are eating; A blue sky in the background.
Fig. 9. Overview of fully convolutional localization network for dense captioning. The localization layer presents regions and extracts smoothly,
batch of corresponding activations with the help of bi-linear interpolations.
ity distribution. But, the word which had the maxi- localization layer that could be inserted in the neural
mum probability was found out, since this method per- network to enable localized predictions of the region
formed better, though slightly, than sampling. After proposals.
that, the picked word was input to the model and the CNN consisted of 13 layers of 3 × 3 convolutions
process is continued until the end sign ##END## is and formed the input for localization layer. Localiza-
taken as output from the model tion Layer classified convolutional anchor boxes ac-
While doing retrieval of image, top ranked images cording to their transformation and confidence scores
were the output, where ranking was done on the ba- and aligned region proposals that were proposed at the
sis of their perplexity with the query sentence. Sen- starting of object detection to the ground truth boxes.
tence Retrieval used Normalized probability for each Positive proposals were the ones that were matched
sentence. and hence increased confidence scores while train-
ing, while negative proposals decreased the confidence
2.5.4. Evaluation metrics scores.
Sentence Perplexity & BLEU scores (B-1, B-2, B- Recognition network processed features of each re-
3), RK (K = 1, 5, 10) and Med r.For IAPR TC-12: gion from the localization layer. The features of each
Recall Accuracy Curve and (##RK (K = 1, 5, 10) and region were flattened to be made into a vector and then
Med r). passed through fully connected layers. Position was re-
defined and confidence scores of each region were pro-
2.5.5. Dataset posed.
Flickr 8K; Flickr 30K; IAPR TC-12.
2.6.4. Evaluation metrics
2.5.6. Results METEOR, mean Average Precision (AP).
This was the first work in which RNN in a deep mul-
timodal architecture was incorporated. 2.6.5. Dataset
Figure 8 shows captions generated by this method. MSCOCO, YFCC100M, Visual Genome (VG)
Dataset.
2.6. Object Detection (R-CNN) + Localization Layer
+ Caption Generation Model (RNN)
2.6.6. Results
FLCN model performed better than the Region RNN
This technique is discussed using the work of
in both ranking and localization under all metrics in a
Johnson, Karpathy, Li – DenseCap: Fully Convo-
way that median rank reduced from 7 to 5 and local-
lutional Localisation Networks for Dense Caption-
ization recall from 0.5IoU to 0.153.
ing [46] which was published in 2014.
Figure 10 shows how captions are generated by this
method.
2.6.1. Problems faced till now
Predictions based on earlier region CNN-RNN mod-
els did not include context outside of each region. 2.7. Semantic Alignment Models (R-CNN and
Those were inefficient as each region had to be for- B-RNN) + Description Generation Model
warded independently. Localization layer is was pro- (M-RNN)
posed due to these difficulties.
This technique is discussed using the work of Karpa-
2.6.2. Overview thy and Li – Deep Visual-Semantic Alignments for
The paper consists of the work in the detection of Generating Image Descriptions [41] which was pub-
objects, Image captioning, and the processing of par- lished in 2015.
ticular regions of the image.
2.7.1. Problems faced till now
2.6.3. Model/Methodology The focus of most of the works so far has been on
Convolutional Localization Network for Dense Cap- condensing elaborate visual depictions in an image to
tioning of the image was based on CNN- RNN models just one single sentence. However, this requirement is
for image captioning but also included a differentiable nothing but an unnecessary restriction.
132 A. Kumar and S. Goel / A survey of evolution of image captioning techniques
Fig. 10. The sequence of images shows Dense image captioning task using a model that generates rich and dense captions.
Fig. 12. Captions generated as in [41] L to R: A man in black shirt is playing guitar; Two young girls are playing with lego toy; Construction
worker in orange safety vest is working on road.
2.8.1. Problems faced till now shows captions generated by attribute based captioning
The previous papers didnt take into account the ex- model mined with external knowledge.
ternal knowledge for generating the captions. Also
the importance of introducing an intermediate attribute 2.8.4. Evaluation metrics
prediction layer was neglected by almost all previous BLEU, METEOR and CIDER.
work.
2.8.5. Dataset
2.8.2. Overview Flickr8k, Flickr30k and Microsoft COCO.
An intermediate attribute prediction layer is intro-
2.8.6. Results
duced into the predominant CNN-LSTM framework,
Att-Region CNN + LSTM is so far one of the most
which was neglected by almost all previous work.
suitable approach for generating image captions. Fig-
ure 14 question answering examples using the method
2.8.3. Model/Methodology used in [42].
Attributes predicted by the CNN-based attribute pre-
diction model were used to generate the captions for
the image. In the image captioning, the gaps in the cap- 3. Datasets
tion templates were filled by the attributes predicted
by the model. The model for caption generation was 3.1. PASCAL 1K [43]
trained by maximizing the probability for the correct
description of the image. The semantic attribute pre- The images found in this dataset are a subset of im-
diction value Vatt was used rather than using image ages collected from PASCAL VOC Challenge. It has
features directly. The predicted attributes and gener- 20 categories of images, for each of which, it chooses
ated captions were combined with the external knowl- 50 images at random as sample along with their de-
edge in the knowledge base, and then put to the LSTM scriptions which is generated by Amazon’s Mechani-
for providing answers to various questions. Figure 13 cal Turk.
134 A. Kumar and S. Goel / A survey of evolution of image captioning techniques
Fig. 13. Image Caption Generated: A man with bat readies to swing at the pitch while the umpire looks on. External Knowledge: A pitch is a
place used to play various sports such as cricket. The umpire is present to review the match.
Fig. 14. Examples where the attribute-Region-CNN + LSTM gives the most appropriate answer while the baseline model gives wrong answer.
Figure from paper [42].
3.2. Flickr8K & 30K [10] and shapes, animals, people and many other aspects of
modern life. There are captions related with every im-
There are 8000 and 31,783 images respectively in age, in 3 specific languages English, German, Spanish.
Flickr 8K and 30K datasets which are gathered from These 20000 images are of high resolution and strict
Flickr. Majority of these images represent participa- image selection rules are followed while choosing im-
tion of human beings in various tasks. Every image has ages for this dataset.
5 sentences describing it. These datasets are split for
training, testing as well as validation following some 3.5. VISUAL GENOME (VG) [52]
approved standards.
It is a dataset built by experts mainly from Stan-
3.3. MS COCO [27] ford, Yahoo. It is a knowledge base which is basically
a persistent effort to relate the image concepts to their
This is Microsoft Coco dataset which contains natural language description in a structured manner. It
82,000 training images and 40,000 validation images is currently the largest dataset of image based ques-
complemented by 5 sentences for their description. tion and answers with approximately over 17,000,000
These images are sourced from Flickr by finding com- question-answer pairs. Every image is supplemented
mon/famous object categories and generally, they con- with an average of 17 question-answer pairs.
tain variety of objects with important information per-
taining to their context.
4. Evaluation and ranking metrics
3.4. IAPR TC [4]
We have prepared a table for various datasets in
In this dataset, there are 20,000 still natural pictures chronological order showing various approaches used
from various locations all around the world. Pictures for the task of description generation and their cor-
from various categories like – sports, actions, cities, responding scores using a variety of different metrics
A. Kumar and S. Goel / A survey of evolution of image captioning techniques 135
Table 1
Image captioning techniques and their scores on IAPRTC12 dataset
Approaches Year B-1 B-2 B-3 PPL
IAPRTC12
BACK-OFF GT2 2007 32.3 14.5 5.9 55.4
BACK-OFF GT3 2007 31.2 13.1 5.9 55.6
LBL 2007 32.7 14.4 1 20.1
Gupta et al. [56] 2012 15 6 7
Gupta & Mannem [19] 2012 33 18 9.8
MLBL-B-DeCAF [13] 2014 37.3 18.7 9.8 24.7
MLBL-F-DeCAF [13] 2014 36.1 17.6 9.2 21.8
m-RNN Baseline [28] 2014 31.34 11.68 8.03 7.77
m-RNN [28] 2014 39.51 18.28 13.11 6.92
Table 2
Image captioning techniques and their scores on PASCAL dataset
Approaches Year B-1 B-2 B-3 PPL
PASCAL
BabyTalk [15] 2011 25 0.49 9.69
Im2Text [17] 2011 25
LBL 2012 32.7 14.4 1 20.1
Midge [21] 2012 2.89 8.80
RNN 2013 2.79 10.08 36.79
RNN + IF 2013 10.16 16.43 30.04
RNN + IF + FT 2013 10.18 16.45 29.43
Tri5Sem [22] 2013 25
Microsoft 2014 10.48 16.69 27.97
Bidirectional
Retrieval
Microsoft 2014 10.77 16.87 26.95
Bidirectional
Retrieval + FT
TreeTalk [26] 2014 25
m-RNN [28] 2014 25
NIC [48] 2015 59
HUMAN 2015 70
Table 3
Image captioning techniques and their scores on Flikr8k dataset
Approaches Year B-1 B-2 B-3 B-4 BLEU M PPL
Flikr8k
RNN 2013 4.86 11.81 21.88
RNN + IF 2013 12.04 17.10 20.43
Tri5Sem [22] 2013 48
Microsoft Bidirectional Retrieval 2014 14.10 17.97 19.24
m-RNN [28] 2014 58
MNLM [25] 2014 51
Mao et al. [28] 2014 58 28 23 24.39
Google NIC [48] 2014 63 41 27
Chen and Zitnik [47] 2014 14.1
NIC [48] 2015 63
Karpathy et al. [Deep Visual Semantic] 2015 57.9 38.3 24.5 16.0
Xu et al. (Hard-Attention) 2015 67 46 31 21
Att-SVM + LSTM 2016 73 53 38 26 12.63
Att-GlobalCNN + LSTM 2016 72 53 38 27 12.63
Att-RegionCNN + LSTM 2016 74 54 38 27 12.60
Att-GT + LSTM 2016 76 57 41 29 12.52
HUMAN 2016 70 22.51 26.31
136 A. Kumar and S. Goel / A survey of evolution of image captioning techniques
Table 4
Image captioning techniques and their scores on Flikr30k dataset
Approaches Year B-1 B-2 B-3 B-4 BLEU M PPL
Flikr30k
RNN 2013 6.29 12.34 26.94
RNN + IF 2013 12.59 15.56 23.74
Microsoft Bidirectional Retrieval 2014 12.60 16.42 22.51
m-RNN [28] 2014 55
MNLM [25] 2014 56
Mao et al. [28] 2014 55 24 20 35.11
Google NIC [48] 2014 66.3 42.3 27.7 18.3
Chen and Zitnik [47] 2014 12.6
LRCN [35] 2014 58.8 39.1 25.1 16.5
NIC [48] 2015 66
Karpathy and Li [41] 2015 57.3 36.9 24 15.7
Xu et al. (Hard-Attention) 2015 67 44 30 20
Att-SVM + LSTM 2016 68 49 33 23
Att-GlobalCNN + LSTM 2016 70 50 35 27
Att-RegionCNN + LSTM 2016 73 55 40 28
Att-GT + LSTM 2016 78 57 42 30
HUMAN 2016 68 19.62 23.76
Table 5
Image captioning techniques and their scores on MSCOCO dataset
Approaches Year B-1 B-2 B-3 B-4 BLEU M C PPL
MSCOCO
Random 2005 4.6 9.0 5.1
Nearest Neighbour 2006 48.0 28.1 16.6 9.9 15.7 36.5
RNN 2013 4.63 11.47 18.96
RNN + IF 2013 16.60 19.24 15.39
RNN + IF + FT 2013 16.77 19.41 14.90
Microsoft Bidirectional Retrieval 2014 18.35 20.04 14.23
Microsoft Bidirectional Retrieval + FT 2014 18.99 20.42 13.98
Google NIC [47] 2014 66.6 46.1 32.9 24.6
Chen and Zitnik [47] 2014 19 20.4
LRCN [35] 2014 62.8 44.2 30.4
NIC [48] 2015 27.7 23.7 85.5
Karpathy and Li [41] 2015 62.5 45 32.1 23 19.5 66.0
Xu et al. (Hard-Attention) 2015 72 50 36 25 23
Att-SVM + LSTM 2016 69 52 38 28 23 82 12.62
Att-GlobalCNN + LSTM 2016 72 54 40 30 25 83 11.39
Att-RegionCNN + LSTM 2016 74 56 42 31 26 94 10.49
Att-GT + LSTM 2016 80 64 50 40 28 107 9.6
HUMAN 2016 21.7 20.19 24.94 85.4
such as Bleu 1, 2, 3, 4. Meteor, Cider [45] etc. The re- ous dataset using different techniques in a chronologi-
sults compiled in such a manner allow us to clearly see cal order.
how over the years, image captioning techniques have
evolved over time and also observe the large amounts
of positive change in evaluated scores. 5. Challenges
BLEU (bilingual evaluation understudy) [43] and
METEOR (Metric for Evaluation of Translation with The possibility of developing intelligent computer
Explicit Ordering) [44] are metrics generally for the programs that could correctly interpret and caption
evaluation of machine translation output. R@K: recall photos have been intriguing machine learning experts
rates for the first retrieved ground truth sentences or since decades. However, it was only a few years ago
images. Some spaces in table are left empty as the cor- some significant progress in this field has been made.
responding scores were not calculated. Tables 1, 2, 3, We have come a long way from template based tech-
4, 5, 6 and 7 show the comparison of scores for vari- niques to deep learning ones with attention models but
A. Kumar and S. Goel / A survey of evolution of image captioning techniques 137
Table 6
Image captioning techniques and their recall scores on Flikr8k dataset
Flikr8k Image Annotations Image Search
APPROACHES YEAR R@1 R@5 R@10 med r R@1 R@5 R@10 med r
Random 2005 0.1 0.5 1 631 0.1 0.5 1 500
DeFrag [24] 2014 13 33 44 14 10 30 43 15
m-RNN [28] 2014 15 37.2 49 11 12 31 42 15
MNLM [25] 2014 18 55 8 13 52 10
Socher-decaf [53] 2014 4.5 18 28.6 32 6.1 18.5 29 29
Socher-avg-rcnn [53] 2014 6 22.7 34.0 23 6.6 21.6 31.7 25
DeViSE-avg-rcnn [55] 2014 4.8 16.5 27.3 28 5.9 20.1 29.6 29
DeepFE decaf [54] 2014 5.9 19.2 27.3 34 5.2 17.6 26.5 32
DeepFE-rcnn [24] 2014 12.6 32.9 44 14 9.7 29.6 42.5 15
m-RNN-decafe [28] 2014 14.5 37.2 48.5 11 11.5 31.0 42.4 15
NIC [48] 2015 20 61 6 19 64 5
MNLM [25] 2014 13.5 36.2 45.7 13 10.4 31 43.7 14
MNLM [25] (oxford-net) 2014 18 40.9 55 8 12.5 37 51.5 10
Table 7
Image captioning techniques and their recall scores on Flikr30k dataset
Flikr30k Image Annotations Image Search
APPROACHES YEAR R@1 R@5 R@10 med r R@1 R@5 R@10 med r
Random 2005 0.1 0.6 1.1 631 0.1 0.5 1 500
DeFrag [24] 2014 16 40.2 55 8 10 31.4 45 13
m-RNN [28] 2014 18 40.2 51 10 13 31.2 42 16
MNLM [25] 2014 23 63 5 17 57 8
DeViSE-avg-rcnn [54] 2014 4.8 16.5 27.3 28 5.9 20.1 29.6 29
DeepFE-rcnn [24] 2014 16.4 40.2 54.7 8 10.3 31.4 44.5 13
m-RNN-decafe [28] 2014 18.4 40.2 50.9 10 12.6 31.2 41.5 16
SDT-RNN (Socher et al. [57]) 2014 9.6 29.8 41.1 16 8.9 29.8 41.1 16
Kiros et al. [25] 2014 14.8 39.2 50.9 10 11.8 34.0 46.3 13
Donahue et al. [35] 2014 17.5 40.3 50.8 9
Vinyals et al. 2014 23 63 5 17 57 8
NIC [48] 2015 17 56 7 17 57 7
DeFrag (Karpathy et al. [24]) 2015 19.2 44.5 58.0 6 12.9 35.4 47.5 10.8
DepTree edges(Karpathy et al.) 2015 20 46.6 59.4 5.4 15 36.5 48.2 10.4
BRNN (Karpathy et al.) 2015 22.2 48.2 61.4 4.8 15.2 37.7 50.5 9.2
MNLM [25] 2014 14.8 39.2 50.9 10 11.8 34 46.3 13
MNLM [25](oxfordnet) 2014 23 50.7 62.9 5 16.8 42 56.5 8
still there are a lot of challenges that need to be over- of the above discussed approaches use BLEU scores
come. to compare their results to the ground truth suggesting
One of the challenges is the prudent use of an atten- this metric to be a benchmark of evaluation and having
tion system which would describe individual compo- some obvious advantages, a number of shortcomings
nents of an image rather than just the image as a whole have been noticed. It has been noted that BLEU can-
in order to create a holistic description of the com- not deal with languages lacking word boundaries. An-
plete picture. The challenge here is to incorporate more other problem is its bias towards shorter translations.
knowledge than just what the model is trained on. This We could use other automated metrics involving hu-
includes understanding the context of the image and in- man effort such as HyTER, however it is still just an
corporating worldly knowledge while generating cap- approximation.
tions, just as humans would do. Only since last year, a
few researches have started working on this issue how-
ever significant improvements have not yet surfaced. 6. Future scope
Better performance can be expected by choosing a su-
perior image encoder, fine-tuning it and setting up en- The field of image captioning has been researched
semble models. for decades as we just saw. However there is an im-
Performance of a system can be judged better if we mense scope still left to explore. Though most of the
have better evaluation and ranking metrics. While most recent studies have been pretty successful in describ-
138 A. Kumar and S. Goel / A survey of evolution of image captioning techniques
ing the image correctly, but still, human level accuracy References
and descriptiveness seems a far fetched idea.
This all boils down to one thing, knowledge. Hu- [1] G.A. Miller, WordNet: A lexical database for English, Com-
munications of the ACM 38(11) (1995), 39–41.
mans, while thinking for a caption, use their entire [2] S. Hochreiter and J. Schmidhuber, Long short-term memory,
knowledge base which they have been acquiring for Neural computation, 1997.
years. Hence, the emotions, the extra worldly knowl- [3] E. Reiter and R. Dale, Building Natural Language Generation
Systems, Cambridge University Press, 2000.
edge, the power to express which humans possess is
[4] M. Grubinger, P. Clough, H. Mu ller and T. Deselaers, The
sufficient enough for any human to fail a machine in iaprtc-12 benchmark: A new evaluation resource for visual
this so-simple-for-human task. information systems.
So, the need of the future is to have an excel- [5] L.-J. Li and L. Fei-Fei, What, where and who? classifying
events by scene and object recognition, ICCV, 2007.
lent knowledge base, the hardware power to train [6] L.-J. Li, R. Socher and F.-F. Li, in: Towards total scene un-
any model to use that entire knowledge feasibly, in derstanding: Classification, annotation and segmentation in an
order for that machine to develop an entire multi- automatic framework, C. Vision and P. Recognition, CVPR.
IEEE Conference on, IEEE, 2009, pp. 2036–2043.
dimensional context(s), so that any open-ended ques- [7] A. Farhadi, I. Endres, D. Hoiem and D. Forsyth, Describing
tion related to the image can be answered irrespective objects by their attributes, Proceedings of CVPR, 2009.
of the attributes simply detected using any computer [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and F.-F. Li,
in: Imagenet: A large-scale hierarchical image database, C.
vision system.
Vision and P. Recognition, CVPR(2009). IEEE Conference
This is the reason, why the major search engines’ on, IEEE, 2009, pp. 248–255.
corporations, like Google and Microsoft (Bing) have [9] A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C.
the best cards with them to utilize the power of hu- Rashtchian, J. Hockenmaier and D. Forsyth, Every picture
tells a story: Generating sentences from images, In ECCV,
mongous databases to turn them into knowledge-bases 2010.
and realize the future of this technology. Microsoft’s [10] C. Rashtchian, P. Young, M. Hodosh and J. Hockenmaier,
“CAPTION BOT” is an excellent example of this ini- Collecting image annotations using amazons mechanical turk,
In NAACL HLT Workshop on Creating Speech and Language
tiative which uses the power of Emotions, Computer Data with Amazons Mechanical Turk, 2010, pp. 139–147.
Vision and most importantly the power of Bing to re- [11] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky and S. Khu-
ally give fantastic results. danpur, Recurrent neural network based language model, In
INTERSPEECH, 2010.
[12] l. Petrov, Berkeley parser, GNU General Public License v.2,
2010.
7. Conclusion [13] R. Kiros and R.Z.R. Salakhutdinov, Multimodal neural lan-
guage models, In NIPS Deep Learning Workshop, 2013.
[14] S. Li, G. Kulkarni, T.L. Berg, A.C. Berg and Y. Choi, Com-
We classified and discussed 8 major approaches posing simple image descriptions using web-scale n-grams,
used for image captioning according to the order in In Conference on Computational Natural Language Learning,
2011.
which they developed. We discussed how and why an [15] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg
approach evolved so as to solve the shortcomings of and T.L. Berg, Baby talk: Understanding and generating sim-
the previous one. We then explained each of the ap- ple image descriptions, In CVPR, 2011.
[16] Y. Yang, C.L. Teo, H. Daume III and Y. Aloimonos, Corpus-
proaches in detail with the help of a particular study guided sentence generation of natural images, In EMNLP,
and lastly, compared the results of various experiments 2011.
conducted so far using various popular metrics such as [17] V. Ordonez, G. Kulkarni and T.L. Berg, Im2text: Describing
images using 1 million captioned photographs, In NIPS, 2011.
BLEU scores, METEOR, CIDER etc. We were able to [18] Y. Feng and M. Lapata, Automatic Caption Generation
observe clearly the large amount of positive difference for News Images, IEEE Transactions on Pattern Analysis
the scores. and Machine Intelligence 35(4) (April 2013), 797–812. doi:
101109/TPAMI.2012.118.
[19] A. Gupta and P. Mannem, From image annotation to image
description, In Neural information processing, Springer, 2012.
Acknowledgments [20] A. Krizhevsky, I. Sutskever and G.E. Hinton, Imagenet clas-
sification with deep convolutional neural networks, In NIPS,
2012.
A very sincere thanks to Shubham Thakkar, Saumya [21] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg,
Gupta and Shubham Singh who helped us throughout K. Yamaguchi, T. Berg, K. Stratos and H. Daume III, Midge:
Generating image descriptions from computer vision detec-
the formulation of this survey paper. This would not tions, In EACL, Association for Computational Linguistics,
have been possible without their constant support. 2012, pp. 747–756.
A. Kumar and S. Goel / A survey of evolution of image captioning techniques 139
[22] M. Hodosh, P. Young and J. Hockenmaier, Framing image de- [41] A. Karpathy and F.-F. Li, Deep Visual-Semantic Alignments
scription as a ranking task: Data, models and evaluation met- for Generating Image Descriptions, arXiv:1412. 2306v2.
rics, JAIR 47 (2013). (2015).
[23] R. Kiros and R.Z.R. Salakhutdinov, Multimodal neural lan- [42] Q. Wu, P. Wang, C. Shen, A. Dick and A.V.D. Hengel, Ask
guage models, In NIPS Deep Learning Workshop, 2013. Me Anything: Free-form Visual Question Answering Based
[24] A. Karpathy, A. Joulin and F.-F. Li, Deep fragment embed- on Knowledge from External Sources in Proc. IEEE Conf.
dings for bidirectional image sentence mapping, NIPS, 2014. Comp. Vis. Patt. Recogn, 2016.
[25] R. Kiros, R. Salakhutdinov and R.S. Zemel, Unifying visual- [43] K. Papineni, S. Roukos, T. Ward and W.-J. Zhu, Bleu:
semantic embeddings with multimodal neural language mod- A method for automatic evaluation of machine translation,
els, In arXiv:1411.2539(2014). In Proceedings of the 40th annual meeting on association
[26] P. Kuznetsova, V. Ordonez, T. Berg and Y. Choi, Treetalk: for computational linguistics, Association for Computational
Composition and compression of trees for image descriptions, Linguistics, 2002, pp, 311–318.
ACL 2(10) (2014). [44] S. Banerjee and A. Lavie Meteor, An automatic metric for mt
[27] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- evaluation with improved correlation with human judgments,
manan, P. Dolla r and C.L. Zitnick, Microsoft coco: Common In Proceedings of the ACL Workshop on Intrinsic and Ex-
objects in context, arXiv:14050312(2014). trinsic Evaluation Measures for Machine Translation and/or
[28] J. Mao, W. Xu, Y. Yang, J. Wang and A. Yuille, Ex- Summarization, 2005, pp. 65–72.
plain images with multimodal recurrent neural networks, In [45] R. Vedantam, C.L. Zitnick and D. Parikh, Cider: Consensus-
arXiv:1410.1090(2014). based image description evaluation, CVPR, 2015.
[29] R. Socher, A. Karpathy, Q.V. Le, C.D. Manning and A.Y. Ng, [46] J. Johnson, A. Karpathy and F.-F. Li, DenseCap: Fully Convo-
Grounded compositional semantics for finding and describing lutional Localization Networks for Dense Captioning CoRR,
images with sentences, TACL, 2014. abs/1511.07571(2015).
[30] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dol- [47] X. Chen and C.L. Zitnick, Mind’s eye: A recurrent visual rep-
lar, J. Gao, X. He, M. Mitchell, J. Platt et al., From cap- resentation for image caption generation, IEEE Conference on
tions to visual concepts and back, arXiv preprint arXiv:1411. Computer Vision and Pattern Recognition (CVPR), 2015, pp.
4952(2014). 2422–2431.
[31] C. Szegedy, S. Reed, D. Erhan and D. Anguelov, Scal- [48] O. Vinyals, A. Toshev, S. Bengio and D. Erhan, Show
able, high-quality object detection, arXiv preprint arXiv:1412. and tell: A neural image caption generator, arXiv preprint
1441(2014). arXiv:1411.4555(2014).
[32] R. Girshick, J. Donahue, T. Darrell and J. Malik, Rich feature [49] P. Sermane, D. Eigen, X. Zhang, M. Mathieu, R. Fergus and
hierarchies for accurate object detection and semantic seg- Y. LeCun, OverFeat: Integrated recognition, localization and
mentation, 2014. detection using convolutional networks, ICLR, 2014.
[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. [50] T. Mikolov, A. Deoras, D. Povey, L. Burget and J. Cernocky,
Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Strategies for training large scale neural network language
Berg and F.-F. Li, ImageNet Large Scale Visual Recognition models, In Automatic Speech Recognition and Understanding
Challenge, International Journal of Computer Vision (IJCV), (ASRU), 2011 IEEE Workshop on, IEEE, 2011, pp. 196–201.
April 2015, p. 142. [51] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R.
[34] X. Chenand, C. Lawrence Zitnick, Minds Eye: A Recurrent Girshick, S. Guadarrama and T. Darrell, Caffe: Convolu-
Visual Representation for Image Caption Generation in Proc. tional architecture for fast feature embedding, arXiv preprint
IEEE Conf. Comp. Vis. Patt. Recogn, 2015. arXiv:1408.5093(2014).
[35] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. [52] K. Ranjay, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
Venugopalan, K. Saenko and T. Darrell, Long-term recurrent S. Chen, Y. Kalantidis, L.-J. Li, D.A. Shamma, M. Bern-
convolutional networks for visual recognition and description stein and F.-F. Li, Visual Genome: Connecting Language
in Proc, IEEE Conf. Comp. Vis. Patt. Recogn, 2015. and Vision Using Crowdsourced Dense Image Annotations,
[36] J. Mao, W. Xu, Y. Yang, J. Wang and A. Yuille, Deep Caption- arXiv:1602.07332(2016).
ing with Multimodal Recurrent Neural Networks (m-RNN) in [53] R. Socher, Q. Le, C. Manning and A. Ng, Grounded com-
Proc. Int. Conf. Learn. Representations, 2015. positional semantics for finding and describing im- ages with
[37] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle sentences, In NIPS Deep Learning Workshop, 2013.
and A. Courville, Describing videos by exploiting temporal [54] A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, T.
structure in Proc. IEEE Int. Conf. Comp. Vis., 2015. Mikolov et al., Devise: A deep visual-semantic embedding
[38] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. model, In Advances in Neural Information Processing Sys-
Zweig and M. Mitchell, Language models for image caption- tems, 2013, pp. 2121–2129.
ing: The quirks and what works, arXiv preprint arXiv:1505. [55] A. Karpathy, A. Joulin and F.-F. Li, Deep fragment em-
01809(2015). beddings for bidirectional image sentence mapping, arXiv
[39] K. Simonyan and A. Zisserman, Very deep convolutional net- preprint arXiv:1406.5679(2014).
works for large-scale image recognition. in Proc. Int. Conf. [56] A. Gupta, Y. Verma and C. Jawahar, Choosing linguistics over
Learn. Representations, 2015. vision to describe images, In AAAI, 2012.
[40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. [57] R. Socher, A. Karpathy, Q.V. Le, C.D. Manning and A.Y. Ng,
Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, Go- Grounded compositional semantics for finding and describing
ing deeper with convolutions, in Proc. IEEE Conf. Comp. Vis. images with sentences, TACL, 2014.
Patt. Recogn, 2015.
Copyright of International Journal of Hybrid Intelligent Systems is the property of IOS Press
and its content may not be copied or emailed to multiple sites or posted to a listserv without
the copyright holder's express written permission. However, users may print, download, or
email articles for individual use.