You are on page 1of 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/331221798

Generating Captions for Underwater Images Using Deep Learning Models

Article · February 2019

CITATIONS READS

2 1,414

5 authors, including:

Shylaja S S Rama Devi Penikalapati


People's Education Society PES University
47 PUBLICATIONS   261 CITATIONS    5 PUBLICATIONS   4 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

image scene analysis using deep neural network models View project

All content following this page was uploaded by Rama Devi Penikalapati on 10 May 2019.

The user has requested enhancement of the downloaded file.


Generating Captions for Underwater Images Using Deep
Learning Models

Hardik Gourisaria, Shylaja S S, Rama Devi P, Akhilarka Jayanthi, Tanay Gangey


Department of Computer Science and Engineering, PES University, Bangalore, India
hardik.g@outlook.com, shylaja.sharath@pes.edu,
pramadevi@pes.edu, akhiljay99@gmail.com,
tanay.gangey@gmail.com

Abstract. Image captioning is a small yet important domain lying under the
vast subject of Scene Understanding. Our work focusses on the generation of
annotations for Underwater images in natural language. A new underwater im-
age dataset, namely PESEmphasis5k, has been created and captioned. Our
model using different variations of CNNs, LSTMs, and GRUs, when trained on
our new dataset, generates captions with good accuracy. Alongside this, we in-
troduce Parameter Wise Tally Checker as a new evaluation metrics for analyz-
ing and evaluating models and demonstrate its usage on our captioning model.

Keywords: Deep Learning, Convolutional Neural Networks, Recurrent Neural


Network, Long Short-Term Memory, Gated Recurrent Units, Image Captioning

1 Introduction

Sight is one of the primary senses for humans and we describe what we see, through
language. For machines to be able to emulate humans, the ability to perceive images
and describe them in a language understandable by humans is paramount. Computer
Vision, Machine learning, and Natural Language Processing are a few concepts
which, enable machines to perform the task of describing images. Image captioning is
one small, but an important step in achieving the bigger picture of creating machines
with human-like capabilities.
With several datasets available, like MSCOCO [1] and Visual Genome [2], it is
possible to train our systems with high precision. The accuracy of such a system de-
pends on the types of images used during training. In certain cases, like underwater
images and images with motion blur, machines are not yet able to reliably perform up
to the standard that is acceptable. Our work addresses the problem of captioning un-
derwater images. The reason we have been unable to caption underwater images so
far is that the commonly used datasets such as Flickr8k, MSCOCO [1] and Visual
Genome [2] lack enough underwater images to train the models to caption them.
Hence, for this purpose, a new dataset called the PESEmphasis5k, has been put to-
gether, which has underwater and terrestrial images. A model trained on this dataset
will have the ability to captions underwater as well as terrestrial images with good
accuracy.
2

There exist several evaluation methods to measure the accuracy of the models.
BLEU score is one such evaluation method. However, this metrics is not a reliable
measure of accuracy where human judgment is involved [3]. To address this concern,
we introduce Parameter Wise Tally Checker (PWTC) which currently involves hu-
man evaluation. This method can be further improved and automated in the future.
This method of analysis and evaluation has broad flexibility and ease of use.
With this paper we accomplish two objectives:

1. Generating accurate captions for underwater images


2. Introducing the Parameter Wise Tally Checker as new analysis and evaluation
method for deep learning models.

Fig. 1. Captions generated by our model trained on the PESEmphasis5k Dataset for images
which were not present in the dataset

2 Related Work

2.1 Deep Neural Network for image processing:

For image processing, Deep Neural Networks are powerful models. Our work is relat-
ed to many of the previous works in the sense that it has the goal of generating a
dense annotation for the contents of images. Various other studies [3-5] have been
conducted in the field, however, the primary objective here is to correctly label the
3

objects, scenes, and regions with the predefined set of categories, while our objective
is to generate well-understood captions to describe the image.

2.2 Generating descriptions:

Work related to image captioning has already been explored. Various approaches
have been developed for captioning images. Karpathy et al. [6] developed a Recurrent
Neural Network (RNN) model which obtains the probability distribution of the next
word in the caption based on previously generated words. However, the deep learning
models trained on the commonly used datasets generated correct captions for under-
water images. The problem doesn’t lie in the deep learning models used, but rather in
the dataset on which the deep learning models were trained. Our model, when trained
on the PESEmphasis5k dataset, with underwater images using the encoder-decoder
mechanism, produced good captions for underwater images.

2.3 Model evaluators:

Creating the architecture for the model is an important starting step, however, evaluat-
ing the accuracy of the created architecture is even more important. If the architecture
is not accurate and is used in real life applications, the results can be disastrous. A lot
of work has been put in and many accurate and reliable evaluation methods have been
developed. BLEU score is one of the evaluation techniques which yields accurate
results in various applications. However, when used in natural language processing
applications like image captioning, it fails to produce reliable results. BLEU score
relies on n-gram means to compute the accuracy of the model and checks if the word
sequences generated by the model while training is the same as that present in the
dataset. However, it doesn’t consider the synonyms and difference in the numerous
ways that an image can be perceived by different entities. Our newly introduced
PWTC takes into the difference in the ways the image can be perceived and hence is a
reliable measure of the accuracy of the model. However, currently, the evaluation
must be done manually, but the process can be automated in the future.

3 Method

The approach to solving the problem of Captioning Underwater image was divided
into various simple stages which are explained in detail below:

3.1 Dataset Construction


As mentioned before, after examining the various commonly used datasets, the com-
mon observation was that there was a lack of underwater images in the datasets,
which was the reason as to why the models failed to generate acceptable captions for
these images. Hence, our first step was to obtain a dataset for underwater images.
4

1000 underwater images were collected manually and another 4000 images from
the Flickr8k dataset [7] were added to the 1000 underwater images to obtain a total
dataset size of 5000 images. This was done so that the model could generate captions
for underwater images, as well are the retain the ability to caption terrestrial images as
well. Obtaining captions for the image dataset was naturally the next step. The cap-
tions for the 4000 images taken from the Flickr8k dataset were adopted as it was, and
the remaining 1000 underwater images were captioned manually, having 5 captions
for each image [1]. We obtained the captions with the help of a small team of 5 and
some friends and relatives. The different sets of captions for terrestrial and underwater
images were then merged together to obtain the captions for the dataset. Manual cap-
tioning was done by various groups of people, which included children, teenagers,
adults, teachers, students and so on. This was done to obtain a different perspective
for each image for their captions, leading to a greater diversity of words and language
in the dictionary of the dataset.

3.2 Feature extraction

After the creation of the dataset, the features have extracted from the images in the
dataset. This is done by feeding the image to a Convolutional Neural Network Model
[8-9]. Before feeding the image to the CNN, the size of the image is standardized to
224 x 224. This is done to maintain uniformity in the code. Moreover, the CNN takes
images of only specific preset dimensions. There are many popular and well-
developed CNN models available, a few of which are InceptionV3 [10], VGG [11],
Resnet [12]. Normally these deep learning models, if used, directly provide the class,
the image belongs to i.e. they are used in image classification. However, after remov-
ing the last classifying layer of the CNN model, the feature vector of the images can
be obtained. These feature vectors are then saved.

3.3 Data-preprocessing

After extracting the features from the images in the dataset, the captions from the
caption files for each image are processed. Processing includes cleaning the captions
and reducing it to a simpler form for the computer. This can be done in several ways.
Here, the punctuations are removed from the captions, single letter words, as well as
words containing numbers, are removed. This reduced the complexity of the captions
making it easier for the computer to understand and learn. After doing so, the captions
for each individual image are stored in a dictionary. The contents of this dictionary
are then saved.

3.4 Model Definition


This is one of the most important stages in the Machine Learning process. Here, our
deep learning model is trained to fit our dataset which helps it identify the features in
the image by linking it to our RNN [13-14]. Different combinations and variations of
LSTM [15] and GRU [16] were used and evaluated with our deep learning model.
5

Some of these were, a single layer of LSTM, double layer of LSTM, a single layer of
GRU, double layer of GRU, LSTM followed by GRU and twice alternating layers of
LSTM and GRU.
The processed captions are provided to the tokenizer and the features of each im-
age saved previously in the feature extraction stage, are fed into the deep learning
model and the model is trained. Weights for the model are obtained by training on the
training dataset and the model is tested on the testing dataset. This is done repeatedly
in every epoch. After each epoch, the weights are modified, and an updated model is
obtained till the weights reaches a saturated value and doesn’t get further updated,
significantly.

3.5 Model Evaluation

Here, the trained deep learning model with the saved weights is used and the accuracy
of the model is determined using the BLEU score [17]. The model generates the cap-
tions for the test images and the captions are compared to the pre-defined captions in
the dataset and the accuracy is evaluated. The pre-processed captions are tokenized by
encoding them and saved. The BLEU scores of the model are also obtained to per-
ceive the accuracy in numbers.

3.6 Using the model for generating captions

This is the final step in determining the accuracy of our deep learning model and gen-
erating captions for inputted images. Here, the pre-trained model weights are fed to
the deep learning model along with the tokenized data which was previously saved in
the model evaluation stage. Using these, captions with a fixed length are generated,
for the newly inputted image. The inputted images are of the size 224 x 224, to main-
tain uniformity. The features of the image are extracted using the CNN and then the
feature vector and the tokenizer is used to generate the captions. This is done by de-
coding the tokenizer file and linking the tokens to the features extracted from the in-
putted image. The inputted images are specially selected such that they were not pre-
sent in the PESEmphasis5k dataset.

3.7 Developing new manual evaluation method

Using the BLEU score is not a reliable measure for the accuracy of the models where
natural language is involved. Hence, a new manual evaluation method called PWTC
was developed. This new method is required because human verification is an essen-
tial part in concluding if the model works properly and generates the desired output
i.e. generates correct captions for the inputted images which are not present in the
PESEmphasis5k dataset. For evaluating our underwater image captioning model, 5
different parameters have been selected. The captions were checked manually to ob-
tain the parameters that are fully satisfied by the captions generated. However, the
number of parameters can vary based on the application and are left to the discretion
of the user. The same has been explained in detail in the evaluation section further.
6

4 Experimentation

Initially, the Flickr8k dataset was used, but on carefully analysis of the images present
in the dataset, it was found that there were not enough underwater images in the da-
taset. Hence, a blue color filter was added over 2000 of the Flickr8k dataset images
and appended the word ‘underwater’ to the respective captions, in attempt to try and
help the models to caption underwater images. This was done because on general
observation, it can be been that underwater images have a characteristic blue tinge to
them. In doing so, the observation was that our model did not caption the new in-
putted images well. Hence our new dataset, which is called the PESEmphasis5k, was
created.
With our newly created dataset, a comparative study was done between the cap-
tions generated by the deep learning models when trained on Flickr8k and PESEm-
phasis5k. In doing so, it was observed that the model generated correct captions when
trained on the PESEmphasis5k rather than the Flickr8k. However, further improve-
ment in accuracy of the captions generated was desirable hence, further work was
done in improving the captioning model.
Various combinations of the already existing CNN models such as InceptionV3,
VGG16, VGG19, Resnet50 and InceptionResnetV2 combined with variations of
LSTM [15] and GRU [16] layers, were trained and evaluated. Few of the variations
tried were, InceptionV3 with alternating layers of LSTM and GRU, InceptionRes-
netV2 with alternating layers of LSTM and GRU, InceptionV3 with LSTM layers
followed by GRU layers and InceptionResnetV2 with LSTM layers followed by GRU
layers. Besides these models, VGG16 and VGG19 models with the GRU and LSTM
layers were also used. GRU, which is relatively new has 2 gates as opposed to LSTM
which has three gates. GRU was used because it is much more efficient and works
much better for smaller datasets.
Different number of hidden units for each iteration were used and their perfor-
mance was measured and compared.
To optimize the deep learning model, the optimizer ‘Adam’ [18] was used. It was
observed that after increasing the number of layers beyond three to four layers of
GRU and LSTM, the performance of the model worsened. Hence, our model was
restricted to this number, as the captions generate were already accurate.
A different method for evaluating the accuracy of the model was required as the
BLEU score was not reliable. For this, PWTC was created as a new evaluation meth-
od and selected different parameters that greatly influence the accuracy of the cap-
tions generated. On careful examination, 5 such parameters that greatly influence the
accuracy was found. The parameters and the method of evaluation using these param-
eters has been explained in detail in the evaluation section.
Once evaluation of different variations of the deep learning model was done, using
the BLEU score and PWTC, a comparative study was performed between different
models to determine which model generated the best captions for the inputted images.
On performing the comparative study, the twice alternating layers of LSTM and GRU
was found to be the best RNN.
7

Fig. 2. Starting with inputting image of dimension 3 x 224 x 224, into CNN, we input the fea-
tures obtained and the tokenizer into the RNN, which finally outputs the captions for the image
entered.

Fig. 2 shows a simple flow diagram of the process of generating captions for inputted
image. It is divided into four stages of inputting image, obtaining feature vector using
CNN and inputting this feature vector along with the tokenizer to the RNN to obtain
the final caption for the inputted image.

5 Evaluation

As mentioned in the method, the model was first evaluated using the BLEU score and
the scores were printed for visualization of the accuracy.

5.1 Evaluation using BLEU score

The evaluation of the model was done using the BLEU score [17]. After training the
model, the trained weights were inputted into the evaluator and 4 BLEU scores were
obtained. However, the BLEU score is not a reliable estimate of the accuracy of mod-
els where natural language is involved. Reason for BLEU score not being reliable is
because the way the BLEU score is evaluated is not strongly correlated and is not in
line with human judgment. So, the model may perceive the image in a correct, but a
different perspective than a human and generate captions which are correct, but still
different from what is present in the dataset. This leads to a low BLEU score obtained
by evaluating the model and hence, is not reliable. BLEU score is not a reliable com-
parative measure between the accuracy of two different models when the difference in
the BLEU score of the two models is very less, however, it works well when models
with large difference in BLEU scores are compared. Hence to tackle the above-faced
issue, PWTC was introduced.
8

5.2 Parameter Wise Tally Checker

PWTC is a new evaluation and analysis method developed to suit our captioning
model, however, it can be used in various other applications due to its flexibility.
PWTC works by selecting parameters to determine the accuracy of the output ob-
tained. After the output is obtained, we check if the output satisfies the parameter
defined and tallies with it. It mimics the working of a computer in the sense that the
parameters can only take a True or a False value (binary values). A True value is as-
signed to the parameter if the output obtained satisfies the parameter. Else a False
value is assigned to the parameter.
Once the parameters are tallied, the ratio of the number of parameters assigned a
true value, to the total number of parameters is computed. This ratio gives us the ac-
curacy of the output. In the case of multiple images, the average accuracy of each
output can be computed to obtain the accuracy.
Further PWTC can also be used to analyze the outputs and figure out where a prob-
lem lies in the model or the dataset. The parameters are less accurately computed can
be obtained and this can help us in figuring out the source of the problem and hence
fixing it to improve the performance further.
The number of parameters can vary based on the application. There is no fixed
number defined and for each application, the important parameters which influence
the accuracy of the output in that application can be defined. The evaluation time is
directly proportional to the number of parameters defined but so is the accuracy. If the
number of parameters is increased, the number of computations naturally increases,
however, the obtained result from the evaluation, provide more details that can be
analyzed and help improve the performance.

5.3 Using Parameter Wise Tally Checker as a performance measure for the
Underwater Image Captioning Model

PWTC was used to evaluate and analyze the Underwater Image Captioning model.
After analyzing the parameters that greatly affected the accuracy of the captions gen-
erated, 5 parameters were selected which were: Object, Detail, Context, Action, and
Grammar. These terms have been explained in detail below:

• Object is the general description of what the subject of the image is. It checks if the
object in the image is correctly determined, such as a fish, man, dog etc.
• Detail is the minute extra information given for the object. For example, the gender
of the person in the image, the breed of the fish or the color and so on.
• Context checks if the captions generated, describes the object to be in the right
location such as underwater, over-water, near the seabed and so on.
• Action is basically the action being performed by the object in the image.
• Grammar is the grammatical correctness and understandability of the captions
generated.
9

The image captioning model was evaluated manually based on these parameters
using PWTC. We determined the accuracy of the caption generated by checking the
number of parameters satisfied by the generated caption for the image inputted, not
present in the dataset. Once the number of parameters satisfied by the captions gener-
ated by the models was counted for many different images, the average was calculat-
ed for all the images and a final count of the number of parameters satisfied by the
captions was obtained. This method of evaluation also helped in figuring out which of
the parameters were more prone to be incorrectly deduced.

6 Result and Conclusion

The results obtained after careful evaluation of the model have been discussed below.
Various combinations of the model using different CNN, RNN, and so on, were eval-
uated. The evaluation of these models was done using the BLEU score and human
checking of the final caption generated for the test images which were not present in
the PESEmphasis5k dataset. Below is the comparative study of the various models
that were tested after training them on different datasets.

6.1 The model trained on original Flickr8k vs Model trained on the modified
Flickr8k with blue filter

On training the InceptionV3 with 2 LSTM layers model on the Original and the
Modified Flickr8k dataset, it was observed that the captions generated by the two
models were different due to the change in the dataset. However, there was no im-
provement observed. The captions generated by both models were incorrect and
hence, it was concluded that the modifying the Flickr8k dataset was not helpful in
generating better caption for the inputted images.

6.2 The model trained on Flickr8k vs Model trained on the PESEmphasis5k


dataset

In Table 1, captions generated by InceptionV3 with 2 LSTM model when trained on


different datasets has been listed and compared.
As can be observed in the table, the model trained on the PESEmphasis5k dataset,
which has 1000 underwater images, generated almost correct captions is compared to
the model trained on the Flickr8k dataset, which generated incorrect captions. Hence,
it can be concluded that creating a new dataset namely, PESEmphasis5k, and using it
for training the captioning models, yielded better results than training the model on an
existing dataset.
10

Table 1. Captions generated by the InceptionV3 + 2 LSTM layers model, when trained on
Flickr8k and PESEmphasis5k

Captions generated
Image
Flickr8k PESEmphasis5k

Man in red shirt is An underwater image


standing on the beach with yellow fish

Dog running on the Turtle swimming near


grass the seabed

Man in red shirt is


Woman in black dress
standing in front of
swimming in the ocean
building

6.3 InceptionV3 with 2 LSTM vs InceptionV3 with 2 LSTM and 2 GRU when
trained on the PESEmphasis5k dataset

InceptionV3 with 2 LSTM layers was the first model tried out and the best model
obtained so far is Inception V3 with 2 LSTM and 2 GRU. These two have been
trained on the PESEmphasis5k and compared below.

Table 2. BLEU scores obtained after evaluation of first model and best model

Model BLEU-1 BLEU-2 BLEU-3 BLEU-4


InceptionV3
0.544 0.300 0.216 0.110
2 LSTM Layers
InceptionV3
2 LSTM Layers 0.586 0.337 0.242 0.125
2 GRU Layers
11

As can be observed, the BLEU scores obtained for the two models have a significant
difference in value and so we can conclude that InceptionV3 with 2 LSTM and 2
GRU layers is the better model. Note that this can be concluded only if the difference
of the BLEU scores is significant. The captions generated by the two models have
also been compared below:

Table 3. Captions generates by the first model vs the best model

Captions generated
Image InceptionV3
InceptionV3
2 LSTM Layers
2 LSTM Layers
2 GRU Layers

An underwater image School of fish swim-


with yellow fish ming near the seabed

Turtle swimming near Turtle swimming near


the seabed the seabed

Woman in black dress Man wearing black


swimming in the clothes swimming un-
ocean derwater

As can be observed, the quality, details and the accuracy captions generated by the
InceptionV3 with 2 LSTM and 2 GRU is much better than the first model tried out.

6.4 Comparison between different models tried out after training on the
PESEmphasis5k dataset

Table 4 lists the BLEU scores of few of our models which we trained on the PESEm-
phasis5k dataset.
12

Table 4. BLEU scores obtained after evaluation of a few of our best captioning models

Model BLEU-1 BLEU-2 BLEU-3 BLEU-4


InceptionV3
0.499 0.284 0.206 0.102
4 GRU Layers
InceptionResnetV2
0.528 0.317 0.235 0.123
4 GRU Layers
InceptionV3
0.544 0.300 0.216 0.110
2 LSTM Layers
InceptionV3
0.551 0.327 0.241 0.127
4 LSTM Layers
InceptionV3
0.557 0.321 0.233 0.120
1 GRU Layers
InceptionResnetV3
2 GRU Layers 0.562 0.321 0.223 0.106
2 LSTM Layers
InceptionV3
2 GRU Layers 0.566 0.324 0.243 0.128
2 LSTM Layers
InceptionV3
1 LSTM Layer 0.568 0.345 0.255 0.137
1 GRU Layer
InceptionV3
2 LSTM Layers 0.586 0.337 0.242 0.125
2 GRU Layers
InceptionResnetV2
0.590 0.354 0.259 0.140
4 LSTM Layers

As shown clearly in Table 4, the best BLEU-1 score is 0.590 for the InceptionRes-
netV2 CNN followed by four layers of LSTM, however on evaluating practically by
PWTC it is observed that the best output captions are obtained using the deep learning
model with the InceptionV3 flowed by 2 alternating layers of LSTM and GRU, with
the BLEU-1 score of 0.586. This model produces the best captions when tested and
compared with captions of other models. This also shows that the BLEU score is not a
reliable measure for evaluating a model which involves natural language processing
that the accuracy of the model must be evaluated by the human understandability and
correctness of the output captions, for the images not present in the PESEmphasis5k
dataset.
Given below in Fig. 3. is the Architecture of the RNN layer comprising twice al-
ternating layers of LSTM and GRU.
13

Fig. 3. Architecture of the RNN with two alternating layers of LSTM and GRU

Table 5 provides the evaluation of a few of the test images to check the accuracy of
the best model. The captions generated are given alongside the image along with the
details of the parameters satisfied by the captions generated. The accuracy of the cap-
tions is also calculated, and the analysis of the model is also provided. Accuracy is
computed as a ratio of number of correct parameters and total number of parameters.
14

Table 5. Evaluation of the best model by PWTC gives 92% accuracy on selected 5 image set.
Details is the least accurate parameter however the other parameters are perfectly identified

Captions
Image Object Details Action Context Grammar
Generated

Clownfish
swimming
near coral
reef un-
✓ ✓ ✓ ✓ ✓
derwater

Turtle

✓ ✓ ✓ ✓ ✓
swimming
near the
seabed

Woman in

✓ ✗ ✓ ✓ ✓
red dress
swimming
underwater

Woman
swimming
underwater
✓ ✗ ✓ ✓ ✓

Brown dog

✓ ✓ ✓ ✓ ✓
is running
through the
grass
15

7 Ease of Use

The model developed is extremely easy to use and has various applications. The mod-
el needs to be trained once on the dataset after which the weights for the model are
saved. Once the trained weights and the tokenizer file is created, all that needs to be
done is to use these saved files in the captioning part of the model and generate the
captions for the newly inputted image. The model can also process multiple images at
a time and generate captions for them. The images to be captioned can be inputted in
various ways such as directly entering the name of the image into the model, captur-
ing live image from a camera and inputting it into the model or passing a whole col-
lection of images from a folder to the model.

8 Significance

This work can serve as a base for future development in image captioning models and
can be extended to paragraph generation. It can be a useful research material for re-
searchers who perform study in similar domains, particularly underwater image anal-
ysis for example, automated reports can be generated for underwater scenery by gen-
erating captions for images taken underwater at regular intervals.

9 Limitations and Future Work

9.1 Limitations

─ Dataset used (PESEmphasis5k) is limited to 5000 images currently and can be


considered imperfect. Improvement in the quality of images and captions of the da-
taset can further improve the quality of the captions generated.
─ PWTC is currently a manual evaluation method and is yet to be automated.

9.2 Future Work


─ Improvement to be made in the PESEmphasis5k dataset to provide more diversity
for the images and their captions.
─ Further research in PWTC can be performed to hopefully automate it.

10 Acknowledgements

We would like to extend hearty thanks to our family and friends who helped us great-
ly in the process of creating the dataset. We would also like to thank our teachers at
the PES University who provided us valuable insight on how to tackle the problem.
PES University also provided us with the resources required for the project. Last but
not the least, we extend out hearty thanks to the authors of the Research Papers we
16

have referred as they provided us valuable information an made the process of re-
search a lot easier.

References
1. Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: European confer-
ence on computer vision. Springer. 2014, pp. 740–755.
2. Ranjay Krishna et al. “Visual genome: Connecting language and vision using
crowdsourced dense image annotations”. In: International Journal of Computer Vision
123.1 (2017), pp. 32–73.
3. Micah Hodosh, Peter Young, and Julia Hockenmaier. “Framing image description as a
ranking task: Data, models and evaluation metrics”. In: Journal of Artificial Intelligence
Research 47 (2013), pp. 853–899.
4. Girish Kulkarni et al. “Baby talk: Understanding and generating image descriptions”. In:
Proceedings of the 24th CVPR. Citeseer. 2011.
5. Oriol Vinyals et al. “Show and tell: A neural image caption generator”. In: Proceedings of
the IEEE conference on computer vision and pattern recognition. 2015, pp. 3156–3164.
6. Andrej Karpathy and Li Fei-Fei. “Deep visualsemantic alignments for generating image
descriptions”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2015, pp. 3128–3137.
7. Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. “Im2text: Describing images us-
ing 1 million captioned photographs”. In: Advances in neural information processing sys-
tems. 2011, pp. 1143–1151.
8. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with
deep convolutional neural networks”. In: Advances in neural information processing sys-
tems. 2012, pp. 1097–1105.
9. Olga Russakovsky et al. “Imagenet large scale visual recognition challenge”. In: Interna-
tional Journal of Computer Vision 115.3 (2015), pp. 211–252.
10. C Szegedy et al. “Going deeper with convolutions, CoRR abs/1409.4842”. In: URL
http://arxiv. org/abs/1409.4842 (2014).
11. Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-
scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).
12. Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778.
13. Ilya Sutskever, James Martens, and Geoffrey E Hinton. “Generating text with recurrent
neural networks”. In: Proceedings of the 28th International Conference on Machine Learn-
ing (ICML-11). 2011, pp. 1017– 1024.
14. Zachary C Lipton, John Berkowitz, and Charles Elkan. “A critical review of recurrent neu-
ral networks for sequence learning”. In: arXiv preprint arXiv:1506.00019 (2015).
15. Sepp Hochreiter and J¨urgen Schmidhuber. “Long short-term memory”. In: Neural compu-
tation 9.8 (1997), pp. 1735–1780.
16. Junyoung Chung et al. “Empirical evaluation of gated recurrent neural networks on se-
quence modeling”. In: arXiv preprint arXiv:1412.3555 (2014).
17. Kishore Papineni et al. “BLEU: a method for automatic evaluation of machine transla-
tion”. In: Proceedings of the 40th annual meeting on association for computational linguis-
tics. Association for Computational Linguistics. 2002, pp. 311–318.
18. Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In:
arXiv preprint arXiv:1412.6980 (2014).

View publication stats

You might also like