Professional Documents
Culture Documents
SUBMITTED BY –
MELVIN 17BEE
It could be –
1. A man and a girl sitting.
2. A man and a girl eating.
3. A man in black shirt and girl in orange dress eating while sitting on a
sidewalk
Human brain can generate numerous captions for it, within few seconds.
When this is done with the help of neural network and deep learning, we call
it image captioning. However, it is not simple. It requires a combination of
complex and advanced techniques.
WHY IMAGE CAPTIONING USING DEEP
NEURAL NETWORKS?
Some old methods like template and retrieval-based methods have been used to
solve the problem of image captioning. Nevertheless, these methods fail in
generalization because these two methods focus on captioning images using a
certain set of visual categories and hence, does not recognise the new images.
Thus, it suggests that these methods have limited applications as compared to a
human being who can generate numerous caption for the same image.
We need to discover new methods that can have a general function to create
captions and description for any image. With the introduction of Machine
learning in the arena of neural network, the computers have advanced in the
visual and language processing, better than the traditional methods. There are
several models, which have been used to implicitly learn the common
embedding by encoding and decoding the direct modalities.
1. Convolutional neural network(CNN)
2. Long short term memory(LSTM)
3. Recurrent neural network (RNN).
There have been various datasets to tests new methods. They are
1. Flickr 8k - Consists of 8000 images and has 5 captions for each image.
2. Flickr 30k – Consists of 31783 images and has 5 full sentence level
caption for each image.
3. MSCOCO - 82783 images and has 5 captions for each image.
MODELS FOR IMAGE CAPTIONING
The history of image captioning goes back to years ago. Early attempts are
mostly based on detections, which first detect visual concepts (e.g. objects and
their attributes) and then followed by template filling or by nearest neighbour
retrieving for caption generation. With the development of neural networks,
encoder and decoder, which later become the basic model. Most of the
models use CNN in order to represent the input image with the help of vector,
then applies a LSTM network upon to generate words.
The method proposes the use of a probabilistic framework to caption image using neural network. It
generates captions with maximum probability of the correct translation in an "end-to-end" fashion.
The sequence is –
I – the image
S – The correct transcription
The sentence is the sum of probability like S0 to SN is the sequence, and N is the length of:
Further we use the Recurrent neural networks (RNN) to calculate the probability from the input
image I and the t − 1 words expressed as a sequence. It is represzzented as –
Where
Wi, Wf , Wo, Wc ∈ R
DhxDx , Ui, Uf , Uo, Uc ∈ R DhxDh
EXAMPLE 2 – CONTRASTIVE LEARNING
Through a new method of Contrastive Learning (CL), for image captioning, it
can encourage distinctiveness, while maintaining the overall quality of the
generated captions. This method is generic and can be used for models with
various structures. This method was tested on two challenging datasets and
helped improve the model with a significant margin. . By employing a state-of-
the-art model as a reference, the proposed method is able to maintain the
optimality of the target model, while encouraging it to learn from
distinctiveness, which is an important property of high quality captions. On two
challenging datasets, namely MSCOCO and InstaPIC-1.1M, the proposed
method improves the target model by significant margins, and gains state-of-
the-art results across multiple metrics. On comparative studies, the proposed
method extends well to models with different structures, which clearly shows
its generalization ability.
“In Contrastive Learning (CL), we learn a target image captioning model pm(:;
θ) with parameter θ by constraining its behaviors relative to a reference model
pn(:; φ) with parameter φ. The learning procedure requires two sets of data:
(1) the observed data X, which is a set of ground-truth imagecaption pairs ((c1,
I1),(c2, I2), ...,(cTm, ITm)), and is readily available in any image captioning
dataset, (2) the noise set Y , which contains mismatched pairs ((c/1, I1),(c/2,
I2), ...,(c/Tn , ITn )), and can be generated by randomly sampling c/t ∈ C/It for
each image It, where C/It is the set of all ground-truth captions except captions
of image It. We refer to X as positive pairs while Y as negative pairs. For any
pair (c, I), the target model and the reference model will respectively give their
estimated conditional probabilities pm(c|I, θ) and pn(c|I, φ). We wish that
pm(ct|It, θ) is greater than pn(ct|It, φ) for any positive pair (ct, It), and vice
versa for any negative pair (c/t, It). Following this intuition, our initial attempt
was to define D((c, I); θ, φ), the difference between pm(c|I, θ) and pn(c|I, φ),
as
[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to
align and translate. ICLR, 2014. 2
[3] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Microsoft
coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325,
2015. 5
2422–2431, 2015. 1, 6
2013. 1, 2
[10] V. Escorcia, J. C. Niebles, and B. Ghanem. On the relationship between visual attributes
and convolutional networks.
1482, 2015. 1, 2, 5
[13] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for
multilabel image annotation. ICLR,
2014. 5
[16] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image
descriptions. In CVPR, June
2015. 1, 2, 5
towards the underlying neural circuitry. In Matters of intelligence, pages 115–141. Springer,
1987. 2
ICLR, 2015. 1, 2
Composing simple image descriptions using web-scale ngrams. In CoNLL, pages 220–228,
2011. 1, 2
2, 4
2014. 3, 5
[30] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show,
attend and tell: Neural image caption generation with visual attention. In ICML, 2015. 1, 2, 3,
6