You are on page 1of 17

REVIEW 1

IMAGE CAPTIONING USING


NEURAL NETWORK

SUBMITTED BY –

AISHNA MISHRA 17BEE0323 (EMAIL – aishnamishra29@gmail.com , Contact – 7999891718)

KISLAY SINHA 17BEE0113 (EMAIL- kislaysinha99@gmail.com, Contact- 9131887993)

MELVIN 17BEE

GUIDED BY Prof. SHARMILA A


IMAGE CAPTIONING
USING
TENSORFLOW
Introduction
Through our project, we aim to develop a code for image-to-sentence
generation. Artificial neural networks have helped Computers to automatically
generated captions to an image. In our project, we focus on the application of
neural network, which bridges vision and natural language. Through our
project, we can use natural language processing technologies visualise the world
in images. Through Image Captioning, it not only identifies the different types
of objects present in the image but also expresses to relationship between in a
natural language like English.
What does Image Captioning Entail?
What do you think about when you see this picture?

It could be –
1. A man and a girl sitting.
2. A man and a girl eating.
3. A man in black shirt and girl in orange dress eating while sitting on a
sidewalk
Human brain can generate numerous captions for it, within few seconds.
When this is done with the help of neural network and deep learning, we call
it image captioning. However, it is not simple. It requires a combination of
complex and advanced techniques.
WHY IMAGE CAPTIONING USING DEEP
NEURAL NETWORKS?
Some old methods like template and retrieval-based methods have been used to
solve the problem of image captioning. Nevertheless, these methods fail in
generalization because these two methods focus on captioning images using a
certain set of visual categories and hence, does not recognise the new images.
Thus, it suggests that these methods have limited applications as compared to a
human being who can generate numerous caption for the same image.
We need to discover new methods that can have a general function to create
captions and description for any image. With the introduction of Machine
learning in the arena of neural network, the computers have advanced in the
visual and language processing, better than the traditional methods. There are
several models, which have been used to implicitly learn the common
embedding by encoding and decoding the direct modalities.
1. Convolutional neural network(CNN)
2. Long short term memory(LSTM)
3. Recurrent neural network (RNN).
There have been various datasets to tests new methods. They are

1. Flickr 8k - Consists of 8000 images and has 5 captions for each image.
2. Flickr 30k – Consists of 31783 images and has 5 full sentence level
caption for each image.
3. MSCOCO - 82783 images and has 5 captions for each image.
MODELS FOR IMAGE CAPTIONING
The history of image captioning goes back to years ago. Early attempts are
mostly based on detections, which first detect visual concepts (e.g. objects and
their attributes) and then followed by template filling or by nearest neighbour
retrieving for caption generation. With the development of neural networks,
encoder and decoder, which later become the basic model. Most of the
models use CNN in order to represent the input image with the help of vector,
then applies a LSTM network upon to generate words.

‘Based on the encoder-and-decoder, many variants are proposed, where


attention mechanism appears to be the most effective add-on. Specifically,
attention mechanism replaces the feature vector with a set of feature vectors,
such as the features from different regions, and those under different
conditions .It also uses the LSTM net to generate words one by one, where the
difference is that at each step, a mixed guiding feature over the whole feature
set, will be dynamically computed. In recent years, there are also approaches
combining attention mechanism and detection. Instead of doing attention on
features, they consider the attention on a set of detected visual concepts, such
as attributes and objects.’
LITERATURE REVIEW
Image captioning can be generally divided into two categories: top-
down and bottom-up. Bottom-up approaches are the “classical”
ones, which start with visual concepts, objects, attributes, words and
phrases, and combine them into sentences using language models.
[12] and [19] detect concepts and use templates to obtain
sentences, while [23] pieces together detected concepts. [9] and
[20] use more powerful language models. [11] and [22] are the latest
attempts along this direction and they achieve close to the state-of-
the-art performance on various image captioning benchmarks. Top-
down approaches are the “modern” ones, which formulate image
captioning as a machine translation problem [29, 2, 5, 27]. Instead of
translating between different languages, these approaches translate
from a visual representation to a language counterpart. The visual
representation comes from a convolutional neural network which is
often pretrained for image classification on large-scale datasets [18].
Translation is accomplished through recurrent neural networks
based language models. The main advantage of this approach is that
the entire system can be trained from end to end, i.e., all the
parameters can be learned from data. Representative works include
[24, 26, 16, 8, 16, 25]. The differences of the various approaches
often lie in what kind of recurrent neural networks are used. Top-
down approaches represent the state-of-the-art in this problem.
Visual attention is known in Psychology and Neuroscience for long
but is only recently studied in Computer Vision and related areas. In
terms of models, [21, 13] approach it with Boltzmann machines
while [28] does with recurrent neural networks. In terms of
applications, [6] studies it for image tracking, [1] studies it for image
recognition of multiple objects, and [15] uses for image generation.
Finally, as we discuss in Section 1, we are not the first to consider it
for image captioning. In [30], Xu et al., propose a spatial attention
model for image captioning.

SOME EXAMPLES OF MODERN


TECHNIQUES USED IN NEURAL
NETWORKS FOR IMAGE CAPTIONING

EXAMPLE 1 – DATASET Flickr8k


The image below is a schematic diagram of the model involving RNN and CNN

The method proposes the use of a probabilistic framework to caption image using neural network. It
generates captions with maximum probability of the correct translation in an "end-to-end" fashion.

The sequence is –

θ - parameter of the model

I – the image
S – The correct transcription

The sentence is the sum of probability like S0 to SN is the sequence, and N is the length of:

Further we use the Recurrent neural networks (RNN) to calculate the probability from the input
image I and the t − 1 words expressed as a sequence. It is represzzented as –

ht is the memory and xt is the image.

The functions fLSTM is defined according to the following:

Where

 Wi, Wf , Wo, Wc ∈ R
 DhxDx , Ui, Uf , Uo, Uc ∈ R DhxDh
EXAMPLE 2 – CONTRASTIVE LEARNING
Through a new method of Contrastive Learning (CL), for image captioning, it
can encourage distinctiveness, while maintaining the overall quality of the
generated captions. This method is generic and can be used for models with
various structures. This method was tested on two challenging datasets and
helped improve the model with a significant margin. . By employing a state-of-
the-art model as a reference, the proposed method is able to maintain the
optimality of the target model, while encouraging it to learn from
distinctiveness, which is an important property of high quality captions. On two
challenging datasets, namely MSCOCO and InstaPIC-1.1M, the proposed
method improves the target model by significant margins, and gains state-of-
the-art results across multiple metrics. On comparative studies, the proposed
method extends well to models with different structures, which clearly shows
its generalization ability.

“In Contrastive Learning (CL), we learn a target image captioning model pm(:;
θ) with parameter θ by constraining its behaviors relative to a reference model
pn(:; φ) with parameter φ. The learning procedure requires two sets of data:
(1) the observed data X, which is a set of ground-truth imagecaption pairs ((c1,
I1),(c2, I2), ...,(cTm, ITm)), and is readily available in any image captioning
dataset, (2) the noise set Y , which contains mismatched pairs ((c/1, I1),(c/2,
I2), ...,(c/Tn , ITn )), and can be generated by randomly sampling c/t ∈ C/It for
each image It, where C/It is the set of all ground-truth captions except captions
of image It. We refer to X as positive pairs while Y as negative pairs. For any
pair (c, I), the target model and the reference model will respectively give their
estimated conditional probabilities pm(c|I, θ) and pn(c|I, φ). We wish that
pm(ct|It, θ) is greater than pn(ct|It, φ) for any positive pair (ct, It), and vice
versa for any negative pair (c/t, It). Following this intuition, our initial attempt
was to define D((c, I); θ, φ), the difference between pm(c|I, θ) and pn(c|I, φ),
as

D((c, I); θ, φ) = pm(c|I, θ) − pn(c|I, φ),


The loss function is”
EXAMPLE 3 – Pytorch
Let us look at a simple implementation of image captioning in Pytorch. We will
take an image as input, and predict its description using a Deep Learning model.

The code for this example can be found on GitHub.

A pre-trained resnet-152 model is used as an encoder, and the decoder is an


LSTM network.

EXAMPLE 4 – COMBINING TOP DOWN AND


BOTTOM UP APPROACH USING SEMANTIC
ATTENTION
In this paper, it proposes a new image captioning approach that combines the
top-down and bottom-up approaches through a semantic attention model.
Please refer to Figure 1 for an overview of the algorithm. The definition for
semantic attention in image captioning is the ability to provide a detailed,
coherent description of semantically important objects that are needed exactly
when they are needed. In particular, the semantic attention model has the
following properties: 1) able to attend to a semantically important concept or
region of interest in an image, 2) able to weight the relative strength of
attention paid on multiple concepts, and 3) able to switch attention among
concepts dynamically according to task status. Specifically, it detects semantic
concepts or attributes as candidates for attention using a bottom-up approach,
and employ a topdown visual feature to guide where and when attention
should be activated. The model is built on top of a Recurrent Neural Network
(RNN), whose initial state captures global information from the top-down
feature. As the RNN state transits, it gets feedback and interaction from the
bottomup attributes via an attention mechanism enforced on both network
state and output nodes. This feedback allows the algorithm to not only predict
more accurately new words, but also lead to more robust inference of the
semantic gap between existing predictions and image content. This way, we
can leverage external image data for training visual concepts and external text
data for learning semantics between words.
References
[1] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention.
ICLR, 2015. 3

[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to
align and translate. ICLR, 2014. 2

[3] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Microsoft
coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325,

2015. 5

[4] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual

representation for image caption generation. In CVPR, pages

2422–2431, 2015. 1, 6

[5] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, ¨

F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase

representations using rnn encoder-decoder for statistical machine translation. EMNLP,


2014. 2

[6] M. Denil, L. Bazzani, H. Larochelle, and N. de Freitas.

Learning where to attend with deep architectures for image

tracking. Neural computation, 24(8):2151–2184, 2012. 3

[7] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L.

Zitnick. Exploring nearest neighbor approaches for image

captioning. arXiv preprint arXiv:1505.04467, 2015. 4

[8] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,

S. Venugopalan, K. Saenko, and T. Darrell. Long-term

recurrent convolutional networks for visual recognition and

description. In CVPR, pages 2626–2634, 2015. 1, 2, 6

[9] D. Elliott and F. Keller. Image description using visual

dependency representations. In EMNLP, pages 1292–1302,

2013. 1, 2
[10] V. Escorcia, J. C. Niebles, and B. Ghanem. On the relationship between visual attributes
and convolutional networks.

In CVPR, pages 1256–1264, 2015. 4

[11] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng,

P. Dollar, J. Gao, X. He, M. Mitchell, J. Platt, et al. From ´

captions to visual concepts and back. In CVPR, pages 1473–

1482, 2015. 1, 2, 5

[12] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young,

C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every

picture tells a story: Generating sentences from images. In

ECCV, pages 15–29. Springer, 2010. 1, 2

[13] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for
multilabel image annotation. ICLR,

2014. 5

[14] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and

S. Lazebnik. Improving image-sentence embeddings using

large weakly annotated photo collections. In ECCV, pages

529–545. Springer, 2014. 4

[15] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw:

A recurrent neural network for image generation. arXiv

preprint arXiv:1502.04623, 2015. 3

[16] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image
descriptions. In CVPR, June

2015. 1, 2, 5

[17] C. Koch and S. Ullman. Shifts in selective visual attention:

towards the underlying neural circuitry. In Matters of intelligence, pages 115–141. Springer,
1987. 2

[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet

classification with deep convolutional neural networks. In

NIPS, pages 1097–1105, 2012. 2, 4


[19] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg,

and T. L. Berg. Baby talk: Understanding and generating

image descriptions. In CVPR. Citeseer, 2011. 1, 2

[20] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and

Y. Choi. Collective generation of natural image descriptions.

In ACL, pages 359–368, 2012. 1, 2

[21] H. Larochelle and G. E. Hinton. Learning to combine foveal

glimpses with a third-order boltzmann machine. In NIPS,

pages 1243–1251, 2010. 2

[22] R. Lebret, P. O. Pinheiro, and R. Collobert. Simple image

description generator via a linear phrase-based approach.

ICLR, 2015. 1, 2

[23] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi.

Composing simple image descriptions using web-scale ngrams. In CoNLL, pages 220–228,
2011. 1, 2

[24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional

networks for semantic segmentation. In CVPR, June 2015. 5

[25] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille.

Learning like a child: Fast novel visual concept learning

from sentence descriptions of images. In ICCV, 2015. 1,

2, 4

[26] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Deep

captioning with multimodal recurrent neural networks (mrnn). arXiv preprint


arXiv:1412.6632, 2014. 1, 2, 6

[27] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and

J. Dean. Distributed representations of words and phrases

and their compositionality. In NIPS, pages 3111–3119, 2013.

[28] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of


visual attention. In NIPS, pages 2204–2212, 2014. 2

[29] J. Pennington, R. Socher, and C. D. Manning. Glove: Global

vectors for word representation. EMNLP, 12:1532–1543,

2014. 3, 5
[30] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show,
attend and tell: Neural image caption generation with visual attention. In ICML, 2015. 1, 2, 3,
6

You might also like