A Survey of Evolution of Image Captioning PDF

International Journal of Hybrid Intelligent Systems 14 (2017) 123–139 123
DOI 10.3233/HIS-170246
IOS Press
A survey of evolution of image captioning

techniques
Akshi Kumara,1,∗ and Shivali Goelb,1
a
Faculty of Computer Science and Engineering, Delhi Technological University, Shahbad Daulatpur, Delhi – 42,
India
b
Department of Information Technology, Delhi Technological University, Shahbad Daulatpur, Delhi – 42, India
Abstract. Automatic captioning of Images has been explored extensively in the past 10 to 15 years. It is one of the elementary
problems in Computer Vision and Natural Language Processing and has vast array of applications in the real world. In this survey,
we aim to study different approaches used for the generation of image captions in a chronological manner starting from the
basic template based caption generation model to using Neural Networks combined with external world knowledge. We review
existing models in detail, highlighting the involved methodologies and improvements in the same that have occurred in time.
We gave an overview to the standard image datasets and the evaluation measures developed to discern the quality of generated
image captions. Apart from the basic benchmarks we also note speed and accuracy improvements in all the different approaches.
Finally, we investigate further possibilities in automatic image caption generation.
Keywords: Computer Vision, image captioning, deep learning object recognition, Natural Language Processing
1. Introduction tically correct manner to describe the most salient as-

pect of the image. Therefore to accomplish this task,
The last two decades have seen great improvements we need the technology which can fully understand the
and enthusiasm in the fields of Computer Vision and image and should be capable to apply external worldly
Natural Language Processing. Main target of these knowledge to generate a description which is most true
problems is studying and generating automatic text to the image.In other words, it is equivalent to mim-
descriptions, and comprehending images and videos. icking a human capability to compress the most salient
These problems find their roots in AI and ML which features of an image into the most descriptive, but true-
themselves are in early phases considering their vast to-the-image description. This seems a herculean task
potential. Moreover, these fields have been investigated relative to normal Computer Vision Evaluation Sys-
and researched separately, making it extremely impor- tems simply identifying what is present in the image.
tant to study their combined scope and further investi- However, with most of the world population con-
gating their possibilities. necting online, the need for a image-caption generation
The image captioning problem has been viewed for system has surged to attention-seeking levels. We have
long as a challenging problem because of the need lots of multi-modal data arriving online every second
to identify different segments of the image correctly, in the form of millions of tagged photographs on Face-
identifying the connection between them and finally book and cloud storage. Google Photos image classify-
weaving them together in a syntactically and seman- ing and story making feature is a great example. Also,
with lots of video content, the need for automatic subti-
tles for the videos has seen an appraisal. Automatic Im-
1 Theseauthors have equal contribution.
∗ Corresponding
age Captioning will help in organizing the millions of
author: Akshi Kumar, Faculty of Computer Sci-
unstructured and unorganized and unclassified images
ence Engineering, Delhi Technological University, Shahbad Daulat-
pur, Main Bawana Road, Delhi – 42, India. E-mail: akshikumar@ on the Internet and on humanitarian grounds, will also
dce.ac.in. aid the visually impaired to sense the images. There-
1448-5869/17/$35.00
c 2017 – IOS Press. All rights reserved
124 A. Kumar and S. Goel / A survey of evolution of image captioning techniques
fore, the dire needs for image-caption generation can eg. Kulkarni et al. [15] used template-based description
not go unnoticed and hence, lots of research papers and generation, Farhadi et al. [7] grouped the computer vi-
conferences in this regard are taking place throughout sion detections in triplets and then used them to gener-
the world. ate descriptions based on templates. Li et al. [14] gen-
At the heart of this technology are the power of erated description by merging the computer vision de-
neural networks which makes it seems like magic. On tected objects using proper semantic relationships.
one hand, where Convolution Neural Network (CNN) The year 2012 was a highlight in the era of auto-
Layer performs the prime task of identifying the salient mated Image Captioning Techniques as Imagenet clas-
features of the images, the Recurrent Neural Networks sification with deep CNN (convolutional neural net-
makes it possible to construct (almost) meaningful cap- works) was done using 60 million parameters and half
tions from the identified visual features.We will see million neurons, it consisted of five convolutional lay-
how the technology advanced from basic tree parsing ers.
to use of neural networks utilizing hundreds of millions Year 2013 onwards, techniques involving the use
of features to make the results competent enough to of recurrent network started gaining momentum. Mao
be compared to those of human generated captions.We et al. [28] Vinyals et al. [48], Kiros et al. [25], Fang
will be unwinding these technologies one by one in et al. [30], Chen et al. [47] used a recurrent NN and
brief in this survey paper. Kiros et al. [23] used a feedforward one. Also, Kiros et
So, our aim of this survey paper is to present, an- al. [25] proposed to create a ařmultimodal
˛ embedding
alyze and compare the research happening through- spaceaś
˛ by using a vision model and LSTM to encode
out the last decade in this ground-breaking field, in a text.
matter of few pages. We aim at chronologically iden- Years 2014 and 2015 saw the evolution of rich fea-
tifying the technological developments, the involved ture hierarchies for accurate and high quality object de-
approaches, their drawbacks, how well they perform tection going deeper with convolutions. Fast progress
at various metrics and further investigating the future in object detection [18,31–33,49] was identified with
scope of automatic image generation. models which labeled multiple regions of image in im-
age captioning.
Years 2015 and 2016, saw the evolution of very
2. Evolution of image captioning techniques
ařdeep
˛ convolutional networksaś˛ for large scale im-
age recognition with multimodal R-CNN and various
The paper is structured in a chronological manner,
where we start from the basic techniques used, discuss other techniques [24,34–40,48]. We discuss in detail 8
its methodology and shortcomings and then move to a prominent categories of approaches belonging to dif-
newer approach that solved the problems in the previ- ferent time periods. Starting with one of the very first
ous one. Figure 1 shows the timeline we have devel- techniques to the very recent one.
oped to help keep track of the advancements in this
field and better understand the need and basis of devel- 2.1. Template Based (Tree parser)
opment of newer models.
The years 2005 to 2010 saw the birth of the ma- The template based technique is discussed using the
jor approaches dealing with computer vision and map- work of Margaret Mitchell et al. [21] which was pub-
ping the detected objects with words and weaving them lished in 2012.
into a meaningful or stylish description. Related work
include those of Li et al. [5], Farhadi et al. [9] and 2.1.1. Problems faced till now
Li et al. [6]. Also approaches involving adding some Template Based Caption Generation was used ear-
additional knowledge from the corresponding domain lier in – Kulkarni et al. [15], Yang et al. [16] system
which normal computer vision can not see (Yang et substitutes probable prepositions, verbs and interjec-
al. [16]) as well as using some originally existing captions on parsing UIUC Pascal-VOC dataset (Farhadi et
tions for the locally (Ordonez et al. [17]) and globally al. [9]) and choosing head nouns and their dependents
similar (Farhadi et al. [7]) and images to get rid off the using maximum likelihood calculated by taking the ra-
need for structuring the self made sentences was ob- tio of their individual logs.However, only predictable
served during this period. consistent sentences can be generated using template
In years 2010 and 2011 some work still focussed based techniques, but not novice captions. Ordonez et
on using primitive techniques for text generation for al. [17] matched the query image from a much larger
A. Kumar and S. Goel / A survey of evolution of image captioning techniques 125
Fig. 1. Timeline indicating the progress in automated image captioning.
set of existing captioned photographs followed by lo- and choosing the final outputs. The system followed
cal reordering. Although natural, but these captions are an approach where multiple trees are grown and then,
not true to the image. They mainly describe the similar the best one is chosen as the result. Contexts (nouns)
images and may miss out unique features of the query for adjectives were weighted using Point-wise Mutual
image. Information and for any adjective, only the best 1000
nouns are selected.
2.1.2. Overview In Micro-planning, fully-grown trees are generated
Midge uses syntactical knowledge of the probability by taking the intersection of the subtrees created in
distribution of the next words which should appear af- Content Determination. Subtrees surrounding a noun
ter a given sequence of words. The generator uses con- in position 1 are directly merged with subtrees sur-
straints to filter out the noisy output from the vision rounding a noun in position 2 because the nouns are
system to generate syntactic trees to describe the image ordered.In the surface realisation stage, the most prob-
which computer vision detected. able single tree is chosen by the system from all gener-
ated possible trees and mark-up is removed to produce
2.1.3. Model/Methodology a final string. Different strings may be generated de-
For training, 700,000 images from Flickr dataset pending on different specifications from the user. The
were taken with respective descriptions in dataset used final string is then the one with the most words.
in Ordonez et al. [17].Before parsing , normalization of
description was done. Parsing was done using Berkeley 2.1.4. Evaluation metrics
parser (Petrov [12]).Once a head noun was selected, 5-point Likert scale, Human decisions accumulated
for formulating a description , probability was calcu- using Amazonaŕs˛ Mechanical Turk (Amazon, 2011)
lated for determiners (the, a, an) and pre-nominal mod- were used as evaluation metrics.It was also evaluated
ifiers/adjectives. against Kulkarni et al. system, the Yang et al. sys-
Head nouns were identified and physical objects tem, and human-generated descriptions on the same
were distinguished using WordNet (Miller [1]) from dataset (images). Other metrics include the parameters
the detections of the vision system. Maximum 3 ob- of grammar, main aspects, correctness, order, human
jects were kept in a description of a single sentence. likeness. Results Analysis were done using the non-
Caption generating process was dealt as a problem parametric Wilcoxon Signed-Rank test where the pa-
of growing a syntactically and semantically informed rameter for comparing different systems is the median
tree based on detected object nouns. Tree growth was values.
achieved using lexicalized syntactic derivations using
head noun anchors detected above. 2.1.5. Dataset
A three step growth process was followed (Reiter Training dataset: 700,000 images with their associ-
and Dale [3]) that involved utilising content determi- ated descriptions from the Flickr dataset in Ordonez et
nation for grouping followed by ordering of the object al. [17]. Testing dataset : 840 PASCAL images.
nouns, generating their local subtrees, and filtering ir-
regular detections. Micro planning was done to gen- 2.1.6. Results
erate full syntactic trees around the noun objects de- Midge performed better than all earlier automatic
tected, and modifiers are selected and classified in post- approaches on criteria of correctness and order. And
nominal and pre-nominal in surface realisation step additionally performed better than Yang et al. on the
Fig. 2. Tree generated using Midge approach during the tree growth process.
criteria of close proximity of sentences with the human language model (SC-NLM), where the sentence struc-
generated ones. ture was disentangled to its content and an encoder
Figure 3 shows captions generated by this method. which created the conditioning on distributed repre-
sentations. Sensible image captions were generated if
2.2. Encoder – Decoder based Caption Generation sampling from SC-NLM was used. Decoder generated
new captions from base.
This technique is discussed using the work of Kiros, Problem of ranking pictures and captions was used
Salakhutdinov, and Zemel – Unifying visual seman- as alternate for generation. Optimising this task would
tic embeddings with multimodal neural language mod- lead to an enhancement in generation technique, be-
els [25] which was published in 2014. cause any generation system makes a grading function
to analyse how well a caption and picture match.
2.2.1. Problems faced till now
Descriptions, by earlier strategies were more ma- 2.2.4. Evaluation metrics
chine type in nature and failed to adapt to the fluidness Med r; R@K.
of captions written by humans. Bleu and Rouge [22]
evaluation ways were unreliable and did not match hu- 2.2.5. Dataset
man perceptions. Flickr30K and Flickr8K.
2.2.2. Overview 2.2.6. Results

Encoder (LSTM) ranks captions and pictures and The ways delineated during this paper generated de-
develops sensible grading functions, and the decoder scriptions with quality greater than the that time state-
(SC-NLM) optimises the grading functions as some of-the-art methods which were based on composition-
way of generating and grading new descriptions via the based strategies. Authors worked on attention-based
learnt representations. models which could learn to align the parts of captions
to pictures and determining where to attend next by us-
2.2.3. Model/Methodology ing these alignments, thus modifying dynamically the
Encoding of sentences was done by taking Long decoder conditioning vectors. Figure 4 shows captions
short-term memory (LSTM) recurrent neural netw- generated by this method.
orks [2]. Projection of features of the image were
taken from a deep CNN into the embedding region of 2.3. Extracting Visual Features (RNN) + Maximum
the LSTM hidden states. Joint image-sentence embed- Entropy Language Model-Caption Generation
dings were learnt and minimization of pairwise rank-
ing loss so as to learn to rank pictures and their descrip- The approach is discussed using Minds Eye: A Re-
tions was performed. Images and descriptions were current Visual Representation for Image Caption Gen-
ranked. Using a decoder, a structure-content neural eration [47].
Fig. 3. Captions generated as in [21] L to R: The bus by the road with a clear blue sky; People with a bottle at the table; A person in black with a
black dog by potted plants.
Fig. 4. Captions generated as in [25] L to R: A parked car while driving down the road; A little boy with a bunch of friends on the street; There
is a cat sitting on a shelf.
2.3.1. Problems faced till now 2.3.3. Model/Methodology

Many previous papers experimented projecting the To accomplish the bidirectional mapping, a set of
image features and their associated description in com- latent variables Ut−1 were introduced that encoded
mon space [6,7,13] which find their uses in image the visual features of the previously read/generated
search or image captions ranking.To learn these projec- words Wt−1 .The latent variable U played an impor-
tions, various approached were used:Kernel Canonical tant role of behaving as a long-term visual mem-
Correlation Analysis (KCCA) [22], Recursive neural ory for the previously generated/read words (which
networks [29] and Deep neural networks [24]. While was the heart of this paper). U was used to cal-
these techniques projected both visual features and as- culate P (wt |V, Wt−1 , Ut−1 ) and P (V |Wt−1 , Ut−1 ).
sociated semantics to joint embedding, they failed to Combining these two probabilities together the author
perform the inverse projection. That is, they could not aimed to maximize, P (wt , V |Wt−1 , U − t − 1) =
make fresh sentences or visual depictions from those P (wt |V, Wt−1 , Ut−1 )P (V |Wt−1 , Ut−1 ). That is,
joint embeddings. given the previous words and their visual interpreta-
tion, author aimed to maximize the possibility of the
2.3.2. Overview word wt and the observed visual features V .
This paper explored the bi-directional mapping be- Language Model: This system was able to gen-
tween images and their sentence-based descriptions us- erate 3000 to 20000 words using word classing ap-
ing a recurrent neural network. A new recurrent visual proach [11] P (wt |) = P (ct |)P (wt |ct ; )P (wt |) is the
memory was deployed that mechanically learned to re- probability of the word, P (ct |) is the probability of the
member long-term visual concepts to help in both sen- class. Using frequency distribution, words were cate-
tence production and visual feature reconstruction. gorized into classes using clustering technique. In or-
of the visual features. However, with more words being

predicted, the visual feature probabilities are revised
to predict words, which more closely depict the actual
scene.
2.3.4. Evaluation metrics

Perplexity (PPL), BLEU, METEOR (METR), Hu-
man Subjects, Recall@1,5,10.
2.3.5. Dataset
PASCAL 1K, Flickr 8K and 30K, MS COCO.
2.3.6. Results
Figure 6 shows captions generated by this method.
2.4. Object Detection (CNN) + Caption Generation

Model (RNN)
Fig. 5. 1. Part of the model needed for generating sentences from
visual features and vice versa. 2. Sentences to Visual Features. 3.
Visual Features to sentence.
This technique is discussed using the work of
Vinyals, Toshev, Bengio, and Erhan, Show and tell: A
der to overcome the uncertainty of which word to be neural image caption generator [48], which was pub-
generated, the author took the the RNN models output lished in 2014.
and combined it with the Maximum Entropy language
model [50] output, which was trained simultaneously. 2.4.1. Problems faced till now
The context was kept short by limiting the words to Text generation in previous works was rigid and ex-
look back in Maximum Entropy model to three for all cessively handcrafted. It couldn’t create descriptions of
the experiments. previously unobserved arrangements of objects, even
Learning Model consisted of BPTT (Back- if separate objects were detected in the training set.
propagation Through Time) Algorithm to revise the
weights online. Activation Function used for all units 2.4.2. Overview
End-to-end system that combined newfangled sub-
was sigmoid function (s) = 1 = (1 + exp(s)) and
networks for object detection and caption generation
for word predictions, the clipping Activation Function
models was proposed. This neural network was exten-
Used was soft-max. Author used the RNN code(open-
sively trained using stochastic gradient descent and de-
sourced) of [11] and the Caffe framework [51] to im-
scribed the subject matter of an image using accurately
plement this model. Author used pre-trained 1000-
built English sentences.
class ImageNet [8] model ,rather than starting from
scratch to prevent overfitting.
2.4.3. Model/Methodology
The recurrent hidden state s supplies context on the
It was based on a neural and probabilistic architec-
basis of observed previous words. v represents the set
ture to produce image captions given an image as in-
of observed (assumed to be constant) visual features.
put and applying the principle of translation for gener-
These visual features help in making an informed se-
ating its description. (similar to how we translate text
lection of words. For example, if a girl was detected,
between two languages).
the probability of appearing next of the word girl auto-
The model first uses the following formula to maxi-
matically gets higher. U , represents the hidden recur-
mize the probability of the correct description:
rent layer to reconstruct the visual features from the
given word (so that bi-directional mapping can be fe-
X
θ∗ = arg maxθ log p(S|I; θ) (1)
licitated ). wt is the next word predicted using this vi- I,S
sual hidden layer, i.e. the network utilises its visual
memory, u, along with the currently observed visual Here, θ represents the model parameters, I, the input
features v, to make the next word prediction. Before image and S the generated sentence. Chain rule is then
any words are observed, U makes pure random guesses applied to calculate the joint probability over all words
Fig. 6. Captions generated as in [47] L to R: A train is stopped at a train station; A group of people standing on a snow covered slope; A group
of people that are standing in front of a building.
Fig. 7. Captions generated as in [48] L to R: A red motorcycle parked on the side of the road; A group of young people playing a game of frisbee;
Two dogs play in the grass.
in the sentence S0 , . . . , SN : This paper used the BeamSearch approach with a

N
beam of size 20. They also experimented using greedy
log p(S|I) =
X
log p(St |I, S0 , . . . , St−1 ) (2) search by taking beam size equal to 1 only to find out
t=0
that it degraded the results by and average of 2 BLEU
points (the other technique explored was Sampling).
A constant length hidden state ht expresses the num-
ber of words we consider upto t − 1 and it updated us- 2.4.4. Evaluation metrics
ing a non linear function whenever it sees a new input Bleu-4, METEOR, CIDER. Ranking Metric Re-
xt . call@k (@1 and @10).
ht+1 = f (ht , xt ) (3)
2.4.5. Dataset
This non linear function f was specified by a LSTM Pascal VOC 2008. Flickr8k, Flickr30k, MSCOCO,
network which used words and images as inputs xt SBU.
and was so trained that it would predict one word
of the sentence at a time considering the context 2.4.6. Results
of image observed and also all the preceding words NIC is performed better than various other ap-
p(St |I, S0 , . . . , St−1 ). proaches e.g. Tri5Sem, Im2Text, BabyTalk, SOTA etc.
The loss was given by the the summation of the neg- and was quite close to the ground truth. Figure 7 shows
ative log likelihood of the right word generated at each captions generated by this method.
time step as given below and was minimized with re-
spect to all parameters of the LSTM network, word 2.5. Deep RNN + Deep CNN + multimodal layer
embeddings We and the top layer of the Convolutional interactions
neural network.
N This technique is discussed using the work of Mao et
al. – Explain images with multimodal recurrent neural
X
L(I, S) = − log pt (St ) (4)
t=1 networks [28] which was published in 2014.
Fig. 8. Captions generated as in [28] L to R: A square with burning street lamps and a street in the foreground; Tourists are sitting at a long table
with a white table cloth and are eating; A blue sky in the background.
Fig. 9. Overview of fully convolutional localization network for dense captioning. The localization layer presents regions and extracts smoothly,
batch of corresponding activations with the help of bi-linear interpolations.
2.5.1. Problems faced till now 2.5.3. Model/Methodology

Earlier works extracted features for sentences and There are 6 layers in each time frame : first one
pictures, and mapped them into embedding space of is input word layer, then next two are Word Embed-
same semantics. These strategies addressed tasks such ding layers, then there is Recurrent layer, then the layer
as retrieval of sentences when the image is given or where connection is made: Multimodal layer, and the
retrieval of images when the sentences are given but last layer : Softmax layer.
when they are existing within the database already, and L
lacking the flexibility to caption new pictures that con- 1 X
log2 PPL(w1:L |I) = −
sists of objects and scenes that are previously unseen. L n=1
(5)
log2 P (wn |w1:n−1 , I)
2.5.2. Overview
The model contains 2 sub-networks: deep RNN for N
1 X (i) 2
sentences and a deep CNN [20] for images where, c= L · log2 PPL(w1:L |I (i) ) + kθk2 (6)
N i=1
RNN is Recurrent Neural Network and CNN is Con-
volutional Neural Network. These two sub-networks where PPL is the Perplexity of the sentence and c is
communicate with one other in a multimodal layer and Cost calculated for the model.
this complete model is known as m-RNN model. It Sentence Generation involved starting from the start
takes out probability distribution for generating a word sign ##START##, the model calculated the probabil-
provided previous words and picture are given and fi- ity distribution for the upcoming word, given pre-
nally when this distribution is sampled, image descrip- vious words and picture. Then the upcoming word
tions are generated. was picked by sampling previously obtained probabil-
ity distribution. But, the word which had the maxi- localization layer that could be inserted in the neural
mum probability was found out, since this method per- network to enable localized predictions of the region
formed better, though slightly, than sampling. After proposals.
that, the picked word was input to the model and the CNN consisted of 13 layers of 3 × 3 convolutions
process is continued until the end sign ##END## is and formed the input for localization layer. Localiza-
taken as output from the model tion Layer classified convolutional anchor boxes ac-
While doing retrieval of image, top ranked images cording to their transformation and confidence scores
were the output, where ranking was done on the ba- and aligned region proposals that were proposed at the
sis of their perplexity with the query sentence. Sen- starting of object detection to the ground truth boxes.
tence Retrieval used Normalized probability for each Positive proposals were the ones that were matched
sentence. and hence increased confidence scores while train-
ing, while negative proposals decreased the confidence
2.5.4. Evaluation metrics scores.
Sentence Perplexity & BLEU scores (B-1, B-2, B- Recognition network processed features of each re-
3), RK (K = 1, 5, 10) and Med r.For IAPR TC-12: gion from the localization layer. The features of each
Recall Accuracy Curve and (##RK (K = 1, 5, 10) and region were flattened to be made into a vector and then
Med r). passed through fully connected layers. Position was re-
defined and confidence scores of each region were pro-
2.5.5. Dataset posed.
Flickr 8K; Flickr 30K; IAPR TC-12.
2.5.6. Results METEOR, mean Average Precision (AP).
This was the first work in which RNN in a deep mul-
timodal architecture was incorporated. 2.6.5. Dataset
Figure 8 shows captions generated by this method. MSCOCO, YFCC100M, Visual Genome (VG)
Dataset.
2.6. Object Detection (R-CNN) + Localization Layer
+ Caption Generation Model (RNN)
2.6.6. Results
FLCN model performed better than the Region RNN
This technique is discussed using the work of
in both ranking and localization under all metrics in a
Johnson, Karpathy, Li – DenseCap: Fully Convo-
way that median rank reduced from 7 to 5 and local-
lutional Localisation Networks for Dense Caption-
ization recall from 0.5IoU to 0.153.
ing [46] which was published in 2014.
Figure 10 shows how captions are generated by this
method.
Predictions based on earlier region CNN-RNN mod-
els did not include context outside of each region. 2.7. Semantic Alignment Models (R-CNN and
Those were inefficient as each region had to be for- B-RNN) + Description Generation Model
warded independently. Localization layer is was pro- (M-RNN)
posed due to these difficulties.
This technique is discussed using the work of Karpa-
2.6.2. Overview thy and Li – Deep Visual-Semantic Alignments for
The paper consists of the work in the detection of Generating Image Descriptions [41] which was pub-
objects, Image captioning, and the processing of par- lished in 2015.
ticular regions of the image.
2.6.3. Model/Methodology The focus of most of the works so far has been on
Convolutional Localization Network for Dense Cap- condensing elaborate visual depictions in an image to
tioning of the image was based on CNN- RNN models just one single sentence. However, this requirement is
for image captioning but also included a differentiable nothing but an unnecessary restriction.
Fig. 10. The sequence of images shows Dense image captioning task using a model that generates rich and dense captions.
2.7.2. Overview M-RNN Model trained on the dataset of region-level

This approach consists of two separate models, an annotations from the previous model took as inputs a
alignment model for inferring the latent alignment be- series of input vectors and the image I. It then found
tween continuous group of words in a sentence and the out a series of hidden states and consequently a series
region of the image that they correspond to and the of outputs by using a recurrence relation thereby gen-
second model which is trained on the inferred correla- erated a dense descriptions of images.
tions.
2.7.3. Model/Methodology Bleu-1,2,3,4, METEOR, CIDER. Ranking Metric
To detect objects in an input image the Alignment Recall@1,5,10,Med r.
model used a Regional Convolutional Neural Network
(RCNN). CNN was prepared by training it before hand
2.7.5. Dataset
on images in the ImageNet dataset and finally tuning it
Flickr8k, Flickr30k, MSCOCO.
on the 200 classes of the ImageNet Challenge. In addi-
tion to the whole image 19 top detected locations were
used. The objects were identified based on the pixels 2.7.6. Results
present inside each bounding box. It also used Bidi- This model used very few hard-coded assumptions
rectional recurrent neural network (BRNN) to compute to formulate captions of individual image regions us-
word representations in the sentence. An Image Sen- ing the conventional dataset of images and sentences.
tence Score, Skl aligned every word of a sentence to Figure 12 shows captions generated by this method.
one best image region.
The ultimate goal was to associate snippets of text 2.8. Object Detection (CNN) + Description
instead of single word to each bounding box. There- Generation Model (RNN) + External Knowledge
fore, the concept of Markov Random Field (MRF) and
latent alignment variables was used to generate a num- This technique is discussed using the work of image
ber of image regions explained with segments of text. Captioning and Visual Question Answering Based on
(for e.g. wooden table for table, messy pile of docu- Attributes and External Knowledge – Wu et al. [42]
ments for documents) which was published in 2016.
Fig. 11. Flowchart of proposed description generation model.
Fig. 12. Captions generated as in [41] L to R: A man in black shirt is playing guitar; Two young girls are playing with lego toy; Construction
worker in orange safety vest is working on road.
2.8.1. Problems faced till now shows captions generated by attribute based captioning
The previous papers didnt take into account the ex- model mined with external knowledge.
ternal knowledge for generating the captions. Also
the importance of introducing an intermediate attribute 2.8.4. Evaluation metrics
prediction layer was neglected by almost all previous BLEU, METEOR and CIDER.
work.
2.8.5. Dataset
2.8.2. Overview Flickr8k, Flickr30k and Microsoft COCO.
An intermediate attribute prediction layer is intro-
2.8.6. Results
duced into the predominant CNN-LSTM framework,
Att-Region CNN + LSTM is so far one of the most
which was neglected by almost all previous work.
suitable approach for generating image captions. Fig-
ure 14 question answering examples using the method
2.8.3. Model/Methodology used in [42].
Attributes predicted by the CNN-based attribute pre-
diction model were used to generate the captions for
the image. In the image captioning, the gaps in the cap- 3. Datasets
tion templates were filled by the attributes predicted
by the model. The model for caption generation was 3.1. PASCAL 1K [43]
trained by maximizing the probability for the correct
description of the image. The semantic attribute pre- The images found in this dataset are a subset of im-
diction value Vatt was used rather than using image ages collected from PASCAL VOC Challenge. It has
features directly. The predicted attributes and gener- 20 categories of images, for each of which, it chooses
ated captions were combined with the external knowl- 50 images at random as sample along with their de-
edge in the knowledge base, and then put to the LSTM scriptions which is generated by Amazon’s Mechani-
for providing answers to various questions. Figure 13 cal Turk.
Fig. 13. Image Caption Generated: A man with bat readies to swing at the pitch while the umpire looks on. External Knowledge: A pitch is a
place used to play various sports such as cricket. The umpire is present to review the match.
Fig. 14. Examples where the attribute-Region-CNN + LSTM gives the most appropriate answer while the baseline model gives wrong answer.
Figure from paper [42].
3.2. Flickr8K & 30K [10] and shapes, animals, people and many other aspects of
modern life. There are captions related with every im-
There are 8000 and 31,783 images respectively in age, in 3 specific languages English, German, Spanish.
Flickr 8K and 30K datasets which are gathered from These 20000 images are of high resolution and strict
Flickr. Majority of these images represent participa- image selection rules are followed while choosing im-
tion of human beings in various tasks. Every image has ages for this dataset.
5 sentences describing it. These datasets are split for
training, testing as well as validation following some 3.5. VISUAL GENOME (VG) [52]
approved standards.
It is a dataset built by experts mainly from Stan-
3.3. MS COCO [27] ford, Yahoo. It is a knowledge base which is basically
a persistent effort to relate the image concepts to their
This is Microsoft Coco dataset which contains natural language description in a structured manner. It
82,000 training images and 40,000 validation images is currently the largest dataset of image based ques-
complemented by 5 sentences for their description. tion and answers with approximately over 17,000,000
These images are sourced from Flickr by finding com- question-answer pairs. Every image is supplemented
mon/famous object categories and generally, they con- with an average of 17 question-answer pairs.
tain variety of objects with important information per-
taining to their context.
4. Evaluation and ranking metrics
3.4. IAPR TC [4]
We have prepared a table for various datasets in
In this dataset, there are 20,000 still natural pictures chronological order showing various approaches used
from various locations all around the world. Pictures for the task of description generation and their cor-
from various categories like – sports, actions, cities, responding scores using a variety of different metrics
Table 1
Image captioning techniques and their scores on IAPRTC12 dataset
Approaches Year B-1 B-2 B-3 PPL
IAPRTC12
BACK-OFF GT2 2007 32.3 14.5 5.9 55.4
BACK-OFF GT3 2007 31.2 13.1 5.9 55.6
LBL 2007 32.7 14.4 1 20.1
Gupta et al. [56] 2012 15 6 7
Gupta & Mannem [19] 2012 33 18 9.8
MLBL-B-DeCAF [13] 2014 37.3 18.7 9.8 24.7
MLBL-F-DeCAF [13] 2014 36.1 17.6 9.2 21.8
m-RNN Baseline [28] 2014 31.34 11.68 8.03 7.77
m-RNN [28] 2014 39.51 18.28 13.11 6.92
Table 2
Image captioning techniques and their scores on PASCAL dataset
Approaches Year B-1 B-2 B-3 PPL
PASCAL
BabyTalk [15] 2011 25 0.49 9.69
Im2Text [17] 2011 25
LBL 2012 32.7 14.4 1 20.1
Midge [21] 2012 2.89 8.80
RNN 2013 2.79 10.08 36.79
RNN + IF 2013 10.16 16.43 30.04
RNN + IF + FT 2013 10.18 16.45 29.43
Tri5Sem [22] 2013 25
Microsoft 2014 10.48 16.69 27.97
Bidirectional
Retrieval
Microsoft 2014 10.77 16.87 26.95
Bidirectional
Retrieval + FT
TreeTalk [26] 2014 25
m-RNN [28] 2014 25
NIC [48] 2015 59
HUMAN 2015 70
Table 3
Image captioning techniques and their scores on Flikr8k dataset
Approaches Year B-1 B-2 B-3 B-4 BLEU M PPL
Flikr8k
RNN 2013 4.86 11.81 21.88
RNN + IF 2013 12.04 17.10 20.43
Tri5Sem [22] 2013 48
Microsoft Bidirectional Retrieval 2014 14.10 17.97 19.24
m-RNN [28] 2014 58
MNLM [25] 2014 51
Mao et al. [28] 2014 58 28 23 24.39
Google NIC [48] 2014 63 41 27
Chen and Zitnik [47] 2014 14.1
NIC [48] 2015 63
Karpathy et al. [Deep Visual Semantic] 2015 57.9 38.3 24.5 16.0
Xu et al. (Hard-Attention) 2015 67 46 31 21
Att-SVM + LSTM 2016 73 53 38 26 12.63
Att-GlobalCNN + LSTM 2016 72 53 38 27 12.63
Att-RegionCNN + LSTM 2016 74 54 38 27 12.60
Att-GT + LSTM 2016 76 57 41 29 12.52
HUMAN 2016 70 22.51 26.31
Table 4
Image captioning techniques and their scores on Flikr30k dataset
Approaches Year B-1 B-2 B-3 B-4 BLEU M PPL
Flikr30k
RNN 2013 6.29 12.34 26.94
RNN + IF 2013 12.59 15.56 23.74
m-RNN [28] 2014 55
MNLM [25] 2014 56
Mao et al. [28] 2014 55 24 20 35.11
Google NIC [48] 2014 66.3 42.3 27.7 18.3
Chen and Zitnik [47] 2014 12.6
LRCN [35] 2014 58.8 39.1 25.1 16.5
NIC [48] 2015 66
Karpathy and Li [41] 2015 57.3 36.9 24 15.7
Xu et al. (Hard-Attention) 2015 67 44 30 20
Att-SVM + LSTM 2016 68 49 33 23
Att-GlobalCNN + LSTM 2016 70 50 35 27
Att-RegionCNN + LSTM 2016 73 55 40 28
Att-GT + LSTM 2016 78 57 42 30
HUMAN 2016 68 19.62 23.76
Table 5
Image captioning techniques and their scores on MSCOCO dataset
Approaches Year B-1 B-2 B-3 B-4 BLEU M C PPL
MSCOCO
Random 2005 4.6 9.0 5.1
Nearest Neighbour 2006 48.0 28.1 16.6 9.9 15.7 36.5
RNN 2013 4.63 11.47 18.96
RNN + IF 2013 16.60 19.24 15.39
RNN + IF + FT 2013 16.77 19.41 14.90
Microsoft Bidirectional Retrieval + FT 2014 18.99 20.42 13.98
Google NIC [47] 2014 66.6 46.1 32.9 24.6
Chen and Zitnik [47] 2014 19 20.4
LRCN [35] 2014 62.8 44.2 30.4
NIC [48] 2015 27.7 23.7 85.5
Karpathy and Li [41] 2015 62.5 45 32.1 23 19.5 66.0
Xu et al. (Hard-Attention) 2015 72 50 36 25 23
Att-SVM + LSTM 2016 69 52 38 28 23 82 12.62
Att-GlobalCNN + LSTM 2016 72 54 40 30 25 83 11.39
Att-RegionCNN + LSTM 2016 74 56 42 31 26 94 10.49
Att-GT + LSTM 2016 80 64 50 40 28 107 9.6
HUMAN 2016 21.7 20.19 24.94 85.4
such as Bleu 1, 2, 3, 4. Meteor, Cider [45] etc. The re- ous dataset using different techniques in a chronologi-
sults compiled in such a manner allow us to clearly see cal order.
how over the years, image captioning techniques have
evolved over time and also observe the large amounts
of positive change in evaluated scores. 5. Challenges
BLEU (bilingual evaluation understudy) [43] and
METEOR (Metric for Evaluation of Translation with The possibility of developing intelligent computer
Explicit Ordering) [44] are metrics generally for the programs that could correctly interpret and caption
evaluation of machine translation output. R@K: recall photos have been intriguing machine learning experts
rates for the first retrieved ground truth sentences or since decades. However, it was only a few years ago
images. Some spaces in table are left empty as the cor- some significant progress in this field has been made.
responding scores were not calculated. Tables 1, 2, 3, We have come a long way from template based tech-
4, 5, 6 and 7 show the comparison of scores for vari- niques to deep learning ones with attention models but
Table 6
Image captioning techniques and their recall scores on Flikr8k dataset
Flikr8k Image Annotations Image Search
APPROACHES YEAR R@1 R@5 R@10 med r R@1 R@5 R@10 med r
Random 2005 0.1 0.5 1 631 0.1 0.5 1 500
DeFrag [24] 2014 13 33 44 14 10 30 43 15
m-RNN [28] 2014 15 37.2 49 11 12 31 42 15
MNLM [25] 2014 18 55 8 13 52 10
Socher-decaf [53] 2014 4.5 18 28.6 32 6.1 18.5 29 29
Socher-avg-rcnn [53] 2014 6 22.7 34.0 23 6.6 21.6 31.7 25
DeViSE-avg-rcnn [55] 2014 4.8 16.5 27.3 28 5.9 20.1 29.6 29
DeepFE decaf [54] 2014 5.9 19.2 27.3 34 5.2 17.6 26.5 32
DeepFE-rcnn [24] 2014 12.6 32.9 44 14 9.7 29.6 42.5 15
m-RNN-decafe [28] 2014 14.5 37.2 48.5 11 11.5 31.0 42.4 15
NIC [48] 2015 20 61 6 19 64 5
MNLM [25] 2014 13.5 36.2 45.7 13 10.4 31 43.7 14
MNLM [25] (oxford-net) 2014 18 40.9 55 8 12.5 37 51.5 10
Table 7
Image captioning techniques and their recall scores on Flikr30k dataset
Flikr30k Image Annotations Image Search
APPROACHES YEAR R@1 R@5 R@10 med r R@1 R@5 R@10 med r
Random 2005 0.1 0.6 1.1 631 0.1 0.5 1 500
DeFrag [24] 2014 16 40.2 55 8 10 31.4 45 13
m-RNN [28] 2014 18 40.2 51 10 13 31.2 42 16
MNLM [25] 2014 23 63 5 17 57 8
DeViSE-avg-rcnn [54] 2014 4.8 16.5 27.3 28 5.9 20.1 29.6 29
DeepFE-rcnn [24] 2014 16.4 40.2 54.7 8 10.3 31.4 44.5 13
m-RNN-decafe [28] 2014 18.4 40.2 50.9 10 12.6 31.2 41.5 16
SDT-RNN (Socher et al. [57]) 2014 9.6 29.8 41.1 16 8.9 29.8 41.1 16
Kiros et al. [25] 2014 14.8 39.2 50.9 10 11.8 34.0 46.3 13
Donahue et al. [35] 2014 17.5 40.3 50.8 9
Vinyals et al. 2014 23 63 5 17 57 8
NIC [48] 2015 17 56 7 17 57 7
DeFrag (Karpathy et al. [24]) 2015 19.2 44.5 58.0 6 12.9 35.4 47.5 10.8
DepTree edges(Karpathy et al.) 2015 20 46.6 59.4 5.4 15 36.5 48.2 10.4
BRNN (Karpathy et al.) 2015 22.2 48.2 61.4 4.8 15.2 37.7 50.5 9.2
MNLM [25] 2014 14.8 39.2 50.9 10 11.8 34 46.3 13
MNLM [25](oxfordnet) 2014 23 50.7 62.9 5 16.8 42 56.5 8
still there are a lot of challenges that need to be over- of the above discussed approaches use BLEU scores
come. to compare their results to the ground truth suggesting
One of the challenges is the prudent use of an atten- this metric to be a benchmark of evaluation and having
tion system which would describe individual compo- some obvious advantages, a number of shortcomings
nents of an image rather than just the image as a whole have been noticed. It has been noted that BLEU can-
in order to create a holistic description of the com- not deal with languages lacking word boundaries. An-
plete picture. The challenge here is to incorporate more other problem is its bias towards shorter translations.
knowledge than just what the model is trained on. This We could use other automated metrics involving hu-
includes understanding the context of the image and in- man effort such as HyTER, however it is still just an
corporating worldly knowledge while generating cap- approximation.
tions, just as humans would do. Only since last year, a
few researches have started working on this issue how-
ever significant improvements have not yet surfaced. 6. Future scope
Better performance can be expected by choosing a su-
perior image encoder, fine-tuning it and setting up en- The field of image captioning has been researched
semble models. for decades as we just saw. However there is an im-
Performance of a system can be judged better if we mense scope still left to explore. Though most of the
have better evaluation and ranking metrics. While most recent studies have been pretty successful in describ-
ing the image correctly, but still, human level accuracy References
and descriptiveness seems a far fetched idea.
This all boils down to one thing, knowledge. Hu- [1] G.A. Miller, WordNet: A lexical database for English, Com-
munications of the ACM 38(11) (1995), 39–41.
mans, while thinking for a caption, use their entire [2] S. Hochreiter and J. Schmidhuber, Long short-term memory,
knowledge base which they have been acquiring for Neural computation, 1997.
years. Hence, the emotions, the extra worldly knowl- [3] E. Reiter and R. Dale, Building Natural Language Generation
Systems, Cambridge University Press, 2000.
edge, the power to express which humans possess is
[4] M. Grubinger, P. Clough, H. Mu ller and T. Deselaers, The
sufficient enough for any human to fail a machine in iaprtc-12 benchmark: A new evaluation resource for visual
this so-simple-for-human task. information systems.
So, the need of the future is to have an excel- [5] L.-J. Li and L. Fei-Fei, What, where and who? classifying
events by scene and object recognition, ICCV, 2007.
lent knowledge base, the hardware power to train [6] L.-J. Li, R. Socher and F.-F. Li, in: Towards total scene un-
any model to use that entire knowledge feasibly, in derstanding: Classification, annotation and segmentation in an
order for that machine to develop an entire multi- automatic framework, C. Vision and P. Recognition, CVPR.
IEEE Conference on, IEEE, 2009, pp. 2036–2043.
dimensional context(s), so that any open-ended ques- [7] A. Farhadi, I. Endres, D. Hoiem and D. Forsyth, Describing
tion related to the image can be answered irrespective objects by their attributes, Proceedings of CVPR, 2009.
of the attributes simply detected using any computer [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and F.-F. Li,
in: Imagenet: A large-scale hierarchical image database, C.
vision system.
Vision and P. Recognition, CVPR(2009). IEEE Conference
This is the reason, why the major search engines’ on, IEEE, 2009, pp. 248–255.
corporations, like Google and Microsoft (Bing) have [9] A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C.
the best cards with them to utilize the power of hu- Rashtchian, J. Hockenmaier and D. Forsyth, Every picture
tells a story: Generating sentences from images, In ECCV,
mongous databases to turn them into knowledge-bases 2010.
and realize the future of this technology. Microsoft’s [10] C. Rashtchian, P. Young, M. Hodosh and J. Hockenmaier,
“CAPTION BOT” is an excellent example of this ini- Collecting image annotations using amazons mechanical turk,
In NAACL HLT Workshop on Creating Speech and Language
tiative which uses the power of Emotions, Computer Data with Amazons Mechanical Turk, 2010, pp. 139–147.
Vision and most importantly the power of Bing to re- [11] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky and S. Khu-
ally give fantastic results. danpur, Recurrent neural network based language model, In
INTERSPEECH, 2010.
[12] l. Petrov, Berkeley parser, GNU General Public License v.2,
2010.
7. Conclusion [13] R. Kiros and R.Z.R. Salakhutdinov, Multimodal neural lan-
guage models, In NIPS Deep Learning Workshop, 2013.
[14] S. Li, G. Kulkarni, T.L. Berg, A.C. Berg and Y. Choi, Com-
We classified and discussed 8 major approaches posing simple image descriptions using web-scale n-grams,
used for image captioning according to the order in In Conference on Computational Natural Language Learning,
2011.
which they developed. We discussed how and why an [15] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg
approach evolved so as to solve the shortcomings of and T.L. Berg, Baby talk: Understanding and generating sim-
the previous one. We then explained each of the ap- ple image descriptions, In CVPR, 2011.
[16] Y. Yang, C.L. Teo, H. Daume III and Y. Aloimonos, Corpus-
proaches in detail with the help of a particular study guided sentence generation of natural images, In EMNLP,
and lastly, compared the results of various experiments 2011.
conducted so far using various popular metrics such as [17] V. Ordonez, G. Kulkarni and T.L. Berg, Im2text: Describing
images using 1 million captioned photographs, In NIPS, 2011.
BLEU scores, METEOR, CIDER etc. We were able to [18] Y. Feng and M. Lapata, Automatic Caption Generation
observe clearly the large amount of positive difference for News Images, IEEE Transactions on Pattern Analysis
the scores. and Machine Intelligence 35(4) (April 2013), 797–812. doi:
101109/TPAMI.2012.118.
[19] A. Gupta and P. Mannem, From image annotation to image
description, In Neural information processing, Springer, 2012.
Acknowledgments [20] A. Krizhevsky, I. Sutskever and G.E. Hinton, Imagenet clas-
sification with deep convolutional neural networks, In NIPS,
2012.
A very sincere thanks to Shubham Thakkar, Saumya [21] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg,
Gupta and Shubham Singh who helped us throughout K. Yamaguchi, T. Berg, K. Stratos and H. Daume III, Midge:
Generating image descriptions from computer vision detec-
the formulation of this survey paper. This would not tions, In EACL, Association for Computational Linguistics,
have been possible without their constant support. 2012, pp. 747–756.
[22] M. Hodosh, P. Young and J. Hockenmaier, Framing image de- [41] A. Karpathy and F.-F. Li, Deep Visual-Semantic Alignments
scription as a ranking task: Data, models and evaluation met- for Generating Image Descriptions, arXiv:1412. 2306v2.
rics, JAIR 47 (2013). (2015).
[23] R. Kiros and R.Z.R. Salakhutdinov, Multimodal neural lan- [42] Q. Wu, P. Wang, C. Shen, A. Dick and A.V.D. Hengel, Ask
guage models, In NIPS Deep Learning Workshop, 2013. Me Anything: Free-form Visual Question Answering Based
[24] A. Karpathy, A. Joulin and F.-F. Li, Deep fragment embed- on Knowledge from External Sources in Proc. IEEE Conf.
dings for bidirectional image sentence mapping, NIPS, 2014. Comp. Vis. Patt. Recogn, 2016.
[25] R. Kiros, R. Salakhutdinov and R.S. Zemel, Unifying visual- [43] K. Papineni, S. Roukos, T. Ward and W.-J. Zhu, Bleu:
semantic embeddings with multimodal neural language mod- A method for automatic evaluation of machine translation,
els, In arXiv:1411.2539(2014). In Proceedings of the 40th annual meeting on association
[26] P. Kuznetsova, V. Ordonez, T. Berg and Y. Choi, Treetalk: for computational linguistics, Association for Computational
Composition and compression of trees for image descriptions, Linguistics, 2002, pp, 311–318.
ACL 2(10) (2014). [44] S. Banerjee and A. Lavie Meteor, An automatic metric for mt
[27] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- evaluation with improved correlation with human judgments,
manan, P. Dolla r and C.L. Zitnick, Microsoft coco: Common In Proceedings of the ACL Workshop on Intrinsic and Ex-
objects in context, arXiv:14050312(2014). trinsic Evaluation Measures for Machine Translation and/or
[28] J. Mao, W. Xu, Y. Yang, J. Wang and A. Yuille, Ex- Summarization, 2005, pp. 65–72.
plain images with multimodal recurrent neural networks, In [45] R. Vedantam, C.L. Zitnick and D. Parikh, Cider: Consensus-
arXiv:1410.1090(2014). based image description evaluation, CVPR, 2015.
[29] R. Socher, A. Karpathy, Q.V. Le, C.D. Manning and A.Y. Ng, [46] J. Johnson, A. Karpathy and F.-F. Li, DenseCap: Fully Convo-
Grounded compositional semantics for finding and describing lutional Localization Networks for Dense Captioning CoRR,
images with sentences, TACL, 2014. abs/1511.07571(2015).
[30] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dol- [47] X. Chen and C.L. Zitnick, Mind’s eye: A recurrent visual rep-
lar, J. Gao, X. He, M. Mitchell, J. Platt et al., From cap- resentation for image caption generation, IEEE Conference on
tions to visual concepts and back, arXiv preprint arXiv:1411. Computer Vision and Pattern Recognition (CVPR), 2015, pp.
4952(2014). 2422–2431.
[31] C. Szegedy, S. Reed, D. Erhan and D. Anguelov, Scal- [48] O. Vinyals, A. Toshev, S. Bengio and D. Erhan, Show
able, high-quality object detection, arXiv preprint arXiv:1412. and tell: A neural image caption generator, arXiv preprint
1441(2014). arXiv:1411.4555(2014).
[32] R. Girshick, J. Donahue, T. Darrell and J. Malik, Rich feature [49] P. Sermane, D. Eigen, X. Zhang, M. Mathieu, R. Fergus and
hierarchies for accurate object detection and semantic seg- Y. LeCun, OverFeat: Integrated recognition, localization and
mentation, 2014. detection using convolutional networks, ICLR, 2014.
[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. [50] T. Mikolov, A. Deoras, D. Povey, L. Burget and J. Cernocky,
Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Strategies for training large scale neural network language
Berg and F.-F. Li, ImageNet Large Scale Visual Recognition models, In Automatic Speech Recognition and Understanding
Challenge, International Journal of Computer Vision (IJCV), (ASRU), 2011 IEEE Workshop on, IEEE, 2011, pp. 196–201.
April 2015, p. 142. [51] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R.
[34] X. Chenand, C. Lawrence Zitnick, Minds Eye: A Recurrent Girshick, S. Guadarrama and T. Darrell, Caffe: Convolu-
Visual Representation for Image Caption Generation in Proc. tional architecture for fast feature embedding, arXiv preprint
IEEE Conf. Comp. Vis. Patt. Recogn, 2015. arXiv:1408.5093(2014).
[35] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. [52] K. Ranjay, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
Venugopalan, K. Saenko and T. Darrell, Long-term recurrent S. Chen, Y. Kalantidis, L.-J. Li, D.A. Shamma, M. Bern-
convolutional networks for visual recognition and description stein and F.-F. Li, Visual Genome: Connecting Language
in Proc, IEEE Conf. Comp. Vis. Patt. Recogn, 2015. and Vision Using Crowdsourced Dense Image Annotations,
[36] J. Mao, W. Xu, Y. Yang, J. Wang and A. Yuille, Deep Caption- arXiv:1602.07332(2016).
ing with Multimodal Recurrent Neural Networks (m-RNN) in [53] R. Socher, Q. Le, C. Manning and A. Ng, Grounded com-
Proc. Int. Conf. Learn. Representations, 2015. positional semantics for finding and describing images with
[37] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle sentences, In NIPS Deep Learning Workshop, 2013.
and A. Courville, Describing videos by exploiting temporal [54] A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, T.
structure in Proc. IEEE Int. Conf. Comp. Vis., 2015. Mikolov et al., Devise: A deep visual-semantic embedding
[38] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. model, In Advances in Neural Information Processing Sys-
Zweig and M. Mitchell, Language models for image caption- tems, 2013, pp. 2121–2129.
ing: The quirks and what works, arXiv preprint arXiv:1505. [55] A. Karpathy, A. Joulin and F.-F. Li, Deep fragment em-
01809(2015). beddings for bidirectional image sentence mapping, arXiv
[39] K. Simonyan and A. Zisserman, Very deep convolutional net- preprint arXiv:1406.5679(2014).
works for large-scale image recognition. in Proc. Int. Conf. [56] A. Gupta, Y. Verma and C. Jawahar, Choosing linguistics over
Learn. Representations, 2015. vision to describe images, In AAAI, 2012.
[40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. [57] R. Socher, A. Karpathy, Q.V. Le, C.D. Manning and A.Y. Ng,
Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, Go- Grounded compositional semantics for finding and describing
ing deeper with convolutions, in Proc. IEEE Conf. Comp. Vis. images with sentences, TACL, 2014.
Patt. Recogn, 2015.
Copyright of International Journal of Hybrid Intelligent Systems is the property of IOS Press
and its content may not be copied or emailed to multiple sites or posted to a listserv without
the copyright holder's express written permission. However, users may print, download, or
email articles for individual use.

A Survey of Evolution of Image Captioning PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Survey of Evolution of Image Captioning PDF

Uploaded by

Copyright:

Available Formats

International Journal of Hybrid Intelligent Systems 14 (2017) 123–139 123

A survey of evolution of image captioning

1. Introduction tically correct manner to describe the most salient as-

Fig. 1. Timeline indicating the progress in automated image captioning.

2.2.2. Overview 2.2.6. Results

2.3.1. Problems faced till now 2.3.3. Model/Methodology

of the visual features. However, with more words being

2.3.4. Evaluation metrics

2.4. Object Detection (CNN) + Caption Generation

in the sentence S0 , . . . , SN : This paper used the BeamSearch approach with a

2.5.1. Problems faced till now 2.5.3. Model/Methodology

2.7.2. Overview M-RNN Model trained on the dataset of region-level

Fig. 11. Flowchart of proposed description generation model.

You might also like