You are on page 1of 6

Proceedings of the IEEE International Conference on Multimedia and Expo (ICME) 2017 10-14 July 2017

IMAGE CAPTIONING WITH DEEP LSTM BASED ON SEQUENTIAL RESIDUAL

Kaisheng Xu1,2 , Hanli Wang1,2,∗ , Pengjie Tang1,2


1
Department of Computer Science and Technology, Tongji University, Shanghai, P. R. China
2
Key Laboratory of Embedded System and Service Computing, Ministry of Education,
Tongji University, Shanghai, P. R. China

ABSTRACT template which is given in advance, and then the correspond-


ing words related to those objects, actions and scenes are gen-
Image captioning is a fundamental task which requires se- erated to form description sentences. However, the sentences
mantic understanding of images and the ability of generating yielded by template based methods are often rigid without
description sentences with proper and correct structure. In rich styles. Regarding semantic transfer based approaches
consideration of the problem that language models are always such as [2], image retrieval is usually applied first given a
shallow in modern image caption frameworks, a deep residual target description image. Once the target image is queried,
recurrent neural network is proposed in this work with the fol- several similar images are retrieved from an image data repos-
lowing two contributions. First, an easy-to-train deep stacked itory. Descriptions of these images are then transferred to the
Long Short Term Memory (LSTM) language model is de- target caption. Compared to template based methods, transfer
signed to learn the residual function of output distributions by based approaches can generate sentences more human-alike
adding identity mappings to multi-layer LSTMs. Second, in and flexible, but the performance is highly dependent on the
order to overcome the over-fitting problem caused by larger- training dataset and it is of high computational complexity.
scale parameters in deeper LSTM networks, a novel temporal The recent works on image captioning are mostly built up-
Dropout method is proposed into LSTM. The experimental on Deep Neural Networks (DNN) with the encoder-decoder
results on the benchmark MSCOCO and Flickr30K datasets pipeline and optimized in an end-to-end manner. In gen-
demonstrate that the proposed model achieves the state-of- eral, an image is encoded into a fixed-length feature vec-
the-art performances with 101.1 in CIDEr on MSCOCO and tor by Convolutional Neural Network (CNN), then the en-
22.9 in B-4 on Flickr30K, respectively. coded feature vector is decoded by a language model such
Index Terms— Image captioning, residual network, re- as Recurrent Neural Network (RNN) to produce sentences.
current neural network, LSTM, regularization. In [3], the probability distribution of the words in an im-
age is employed by an effective multimodal representation
with the log-bilinear neural machine language model. In [4],
1. INTRODUCTION the Long Short Term Memory (LSTM) network is applied as
the language model and the CNN features extracted from the
Image captioning is a challenging task in computer vision GoogLeNet [6] model are employed for visual modeling. The
which translates an image into natural language with a proper region-based CNN (RCNN) is introduced in [5] to make use
structure and rich semantic information according to its vi- of the semantic information of the candidate image, by which
sual content. In order to describe an image by natural lan- the objects are firstly detected and then the relation between
guage, it is required to understand the image contents cor- the vision model and the language model is established.
rectly, and then to generate the corresponding sentences with In view of the efficiency and flexibility of neural network
correct words, grammars and structures. based method, we follow its pipeline in this work. Howev-
A number of researches have been proposed for image er, in previous researches, only one-layer or two-layer RNN
captioning, including the template based model [1], the se- models are designed to construct the language model, and it
mantic transfer based approach [2], and the neural network has been verified in [7] that it is difficult to train the stacked
based methods [3–5]. In the template based method [1], ob- LSTM network for visual computing. This is mainly because
jects, actions and scenes are firstly detected and filled into a it is harder to optimize deeper LSTM models since the errors
from higher layers can not be effectively transferred to lower
∗ Corresponding author (H. Wang, E-mail: hanliwang@tongji.edu.cn). layers with back propagation and thus the parameters in lower
This work was supported in part by the National Natural Science Foun-
dation of China under Grants 61622115 and 61472281, and the Program for
layers are not optimized sufficiently. Inspired by ResNet [8]
Professor of Special Appointment (Eastern Scholar) at the Shanghai Institu- in which short cut connections are added in very deep CNN
tions of Higher Learning under Grant GZ2015005. models, a novel LSTM language model named ResLSTM is

978-1-5090-6067-2/17/$31.00 ⃝2017
c IEEE
978-1-5090-6067-2/17/$31.00 ©2017 IEEE ICME 2017

361
proposed in this work to construct deeper LSTM models to It is well known that it is difficult to train DNN models
improve the quality of description sentences for image cap- though they can enhance the representation and generaliza-
tioning. On the other hand, over-fitting is an emerging issue tion ability of features. To address this issue, Srivastava et al.
for DNN models due to the huge amount of model parame- propose a highway network [18, 19], and introduce network
ters. Several regularization methods such as Dropout [9] and bypass into deep models so that several layers are skipped
Disturblabel [10] are designed to address this problem. In this and the information from one layer is fed to the next layer
work, we propose a regularization technique named Temporal followed by the skipped layer. Recently, He et al. [8] use an
Dropout which is suitable to be applied to LSTM networks. identity mapping to make higher layers fit the residual func-
There are two major contributions of this work. First, in tion between lower layers and the target function in ResNet.
order to settle the problem of gradient vanishing in LSTM Inspired by [8], the bypass idea is introduced into LSTM net-
network with multiple layers, the easy-to-train deep residual works to build the proposed ResLSTM language model.
LSTM language model ResLSTM is proposed for image cap- On the other hand, DNN usually suffers from over-fitting
tioning. Second, the Temporal Dropout method is designed caused by large-scale model parameters. Several effective
to prevent over-fitting for deep stacked LSTM network based regularization methods are proposed to address this difficulty,
language models. The rest of this paper is organized as fol- including Dropout [9], DropConnect [20], DisturbLabel [10],
lows. Section 2 presents the related works of image caption- Swapout [21], and so on. Dropout [9] is a widely used regu-
ing, residual representation and model regularization. The larization approach which randomly disables a subset of neu-
proposed ResLSTM model with Temporal Dropout is detailed rons in a layer to force some information abandoned during
in Section 3. The experimental results are demonstrated in model training. As the generalization of Dropout, DropCon-
Section 4. Finally, Section 5 concludes this work. nect [20] randomly drops connections between layers rather
than neurons. Instead of disturbing neurons, the DisturbLa-
bel [10] method deliberately adds little errors to labels to pre-
2. RELATED WORK
vent network from fitting special type of data. In [21], the
Swapout method not only inhibits co-adaptation of units with-
High-level image understanding draws much attention in re-
in a layer but also across network layers. In this work, we
cent years. With the success of deep learning, the encoder-
propose the Temporal Dropout method to regularize temporal
decoder framework based on CNN and LSTM becomes the
sequential LSTM for language model construction.
mainstream to bridge the gap between image and text for
image captioning. A number of approaches are proposed to
achieve the goal of image description with flexible structure 3. PROPOSED RESLSTM MODEL WITH
and rich semantic information. Mao et al. [11] propose the m- TEMPORAL DROPOUT
RNN model in which the previous words and features extract-
ed from the VGG net [12] are integrated to predict the next 3.1. Fundamentals
word of the generation sentence. In [7], the Long-term Recur- For image captioning, let ℱ denote the feature vectors
rent Convolutional Network (LRCN) model [7] is designed of an image generated from a CNN model and 𝑆 =
to caption image and video, and the factorization method is ⟨𝑠1 , 𝑠2 , ⋅ ⋅ ⋅ , 𝑠𝑁 ⟩ be the reference sentence of the image where
employed to build the language model with LSTM. Vinyals 𝑁 is the number of words in the reference sentence and 𝑠𝑖
et al. [4] argue that embedding CNN features into each time is the 𝑖𝑡ℎ word, the target is to approximate the distribution
step of RNN makes the model easy to get into over-fitting, of sentence log 𝑝(𝑆∣ℱ). By applying the chain rule of the
therefore they only feed CNN features into LSTM networks Bayesian theorem, the log distribution can be represented in
at the first time step and obtain promising performances. a joint distribution form as
In addition to the aforementioned researches, several e-
𝑁

laborate models are developed by advanced computer vision
log 𝑝(𝑆∣ℱ) = log 𝑝(𝑠𝑡 ∣ℱ, 𝑆1,⋅⋅⋅ ,𝑡−1 ), (1)
methods. In [5], RCNN is employed to encode regions rather
𝑡=1
than the whole image, and the Bi-directional LSTM is imple-
mented to decode the region features and generate semantic where each word 𝑠𝑖 in 𝑆 is represented in the form of word
sentences for an image. Xu et al. [13] introduce the atten- embedding, and 𝑆1,⋅⋅⋅ ,𝑡−1 = ⟨𝑠1 , 𝑠2 , ⋅ ⋅ ⋅ , 𝑠𝑡−1 ⟩ is the slice of
tion mechanism into visual modeling and achieve astonishing the 1 to 𝑡 − 1 elements of 𝑆.
performances on the evaluation metrics such as BLEU [14], To model log 𝑝(𝑠𝑡 ∣ℱ, 𝑆1,⋅⋅⋅ ,𝑡−1 ), RNN can be employed
METEOR [15] and CIDEr [16]. Different from [5] and [13], in which the output is calculated as
Wu et al. [17] focus on the correspondence of image regions
ℎ𝑡 = 𝑓 (𝑠𝑡 , ℎ𝑡−1 ), (2)
and their attributes, and establish the relationship between vi-
sual information and text. However, these approaches ignore where 𝑓 denotes the RNN cell and ℎ𝑡−1 is the output at the
the fact that the language model is too shallow to provide ad- previous time step of 𝑡 − 1. RNN owns the ability of learn-
equate representation for language features. ing sequence information such as sentence, however, it cannot

362
achieve long-term dependence due to gradient vanishing. As 𝑙𝑡ℎ layer can be written as
a variant of RNN, LSTM is able to remember information for
long periods of time. The core of LSTM is its memory which 𝑔 𝑙 (ℎ𝑡𝑙−1 , ℎ𝑙𝑡−1 ) =𝑓 (𝑠𝑡 , ℎ𝑡−1 ) − 𝑔 0 (𝑠𝑡 , ℎ0𝑡−1 )
contains the knowledge at each time step and it is controlled 𝑙−1

by three different kinds of gate including forget gate, input − 𝑔 𝑖 (ℎ𝑖−1
𝑡 , ℎ𝑖𝑡−1 ). (5)
gate and output gate. These gates are guided by ℎ𝑡−1 and 𝑠𝑡 𝑖=1
via the cell function of 𝑓 . In this work, we employ LSTM as
the basic module to build the proposed language model. The family of functions 𝒢 = ⟨𝑔0 , 𝑔1 , ⋅ ⋅ ⋅ ⟩ constitutes an
The basic pipeline of the proposed model includes the fol- approximation of the actual function 𝑓 . Considering the
lowing key steps. First, image features are extracted from a effectiveness of the two-layer stacked LSTM, we build up
CNN model and fed into the first time step of language mod- our model by adding one or two layers between the orig-
el, and then the embedding feature of each word 𝑠𝑖 is input to inal stacked model to generate a three-layer and four-layer
the language model at each time step. Afterwards, the hidden ResLSTM, which are termed as ResLSTM-3 and ResLSTM-
state of ℎ𝑡 which compactly contains all the previous infor- 4, respectively. Figure 1 illustrates the proposed ResLSTM-4
mation is estimated. Finally, Softmax is employed to predict architecture for instance.
the distribution of 𝑠𝑡+1 . For training convenience, a special
word called Start of Sentence (SoS) is added at the beginning
of each sentence. The cross entropy loss function is employed
for back propagation, and the sentence with the highest con-
fidence score is the final winner for image captioning.

3.2. Residual LSTM (ResLSTM)

As aforementioned, language models are usually shallow, in


which one or two RNN or LSTM layers are employed, lead-
ing to poor ability of representation and discrimination for
language features. A shallow LSTM network can hardly ap-
proximate the function of language model, while multi-layer
LSTM networks such as stacked LSTM usually suffer from
the problem of gradient vanishing. This phenomenon is al- Fig. 1. Illustration of the proposed ResLSTM-4 architecture.
so found in [7] that the two-layer stacked LSTM achieves
the best performance and the model performance is degrad-
ed with more than two LSTM layers.
In order to deepen the depth of LSTM based language 3.3. Temporal Dropout
models while improving model performances, we propose the As the layer number of stacked LSTM grows, it is easy
ResLSTM model with the details given below. First, let 𝑔 0 to cause over-fitting, especially for training on small-scale
stand for the LSTM model which aims to approach the RNN datasets. As aforementioned, Dropout is one of the popu-
cell function of 𝑓 dependant on 𝑠𝑡 and the hidden state ℎ0𝑡−1 , lar and effective regularization techniques to prevent model
which can be formulated as training from over-fitting, which randomly discards the acti-
vation of a number of chosen neurons. In this work, we pro-
ℎ0𝑡 = 𝑔 0 (𝑠𝑡 , ℎ0𝑡−1 ). (3) pose the method of Temporal Dropout which is carried out
between two time steps within each LSTM layer in the pro-
Inspired by ResNet, we use another shallow LSTM model 𝑔 1 posed ResLSTM model.
to fit the residual function with the output of 𝑔 0 and ℎ1𝑡−1 as Inspired by Dropout, the proposed Temporal Dropout
method randomly drops a batch of connections in hidden s-
ℎ1𝑡 = 𝑓 (𝑠𝑡 , ℎ𝑡−1 ) − 𝑔 0 (𝑠𝑡 , ℎ0𝑡−1 ) tates, e.g., ℎ𝑡−1 . Moreover, because the information related
≜ 𝑔 1 (ℎ0𝑡 , ℎ1𝑡−1 ) (4) to the long-term memory contained in LSTM is quite impor-
= 𝑔 1 (𝑔 0 (𝑠𝑡 , ℎ0𝑡−1 ), ℎ1𝑡−1 ). tant for language modeling, we do not set any drop operations
on these long-term memory units. Specifically, the hidden s-
tate fed to the time step of 𝑡 is updated as
As 𝑔 1 (ℎ0𝑡 , ℎ1𝑡−1 ) + 𝑔 0 (𝑠𝑡 , ℎ0𝑡−1 ) forms a new approxima-
tion of 𝑓 (𝑠𝑡 , ℎ𝑡−1 ), this residual approximation step can be
𝑟𝑡−1 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝), (6)
applied in a recursive manner for computation, and the gen-
eral form to derive the residual approximation function at the ℎ′𝑡−1 = ℎ𝑡−1 ⊙ 𝑟𝑡−1 , (7)

363
where 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) generates a vector of independen-
t Bernoulli random variables each of which has the probabil- Table 1. Comparison with baselines on MSCOCO.
ity 𝑝 of being 1, ℎ𝑡−1 and ℎ′𝑡−1 represent the original hidden Method CIDEr B-3 B-4 M
state and the updated hidden state at the time step of 𝑡 − 1, ResLSTM-1(VGG16) 90.5 38.7 27.9 23.9
and ⊙ means the element-wise multiplication operation. ResLSTM-3(VGG16) 93.5 39.1 28.6 24.5
ResLSTM-4(VGG16) 93.8 39.3 28.9 24.9
4. EXPERIMENTS ResLSTM-1(ResNet101) 94.9 39.4 28.4 24.7
ResLSTM-3(ResNet101) 97.2 40.4 29.4 25.0
4.1. Experimental Setup ResLSTM-4(ResNet101) 100.6 41.1 30.3 25.5
ResLSTM-1(ResNet200) 95.9 39.6 28.7 24.7
Two benchmark public datasets including MSCOCO [22] ResLSTM-3(ResNet200) 99.0 41.9 31.8 25.3
and Flickr30K [23] are employed to evaluate the proposed ResLSTM-4(ResNet200) 101.1 42.0 31.9 25.7
ResLSTM and Temporal Dropout. Regarding the MSCOCO
dataset, there are 123287 images and at least 5 reference sen-
tences annotated by human are provided for each image. As Table 2. Comparison with baselines on Flickr30K.
for the Flickr30K dataset, there are 31783 images in total, and Method CIDEr B-3 B-4 M
each image has 5 reference sentences.
ResLSTM-1(VGG16) 35.7 26.8 18.1 18.6
We follow the conventional split way used in [5] for train-
ResLSTM-3(VGG16) 36.5 28.0 19.0 18.7
ing, validation and test on both the datasets of MSCOCO and
ResLSTM-4(VGG16) 39.0 28.6 19.6 18.8
Flickr30K. In particular, for MSCOCO, 113287 images and
the relative reference sentences are employed for training; ResLSTM-1(ResNet101) 37.0 27.9 19.1 18.6
and the rest images are split into two parts with one part in- ResLSTM-3(ResNet101) 41.9 31.8 22.3 18.4
cluding 5000 images for validation and the other part for test. ResLSTM-4(ResNet101) 44.1 32.5 22.9 19.0
Regarding Flickr30K, 29000 images are applied for training,
another 1000 images are used for test, and the rest are for
validation. Due to the small size of Flickr30K, it is easy not tested on the Flickr30K dataset, because for the small-
to cause over-fitting. As a solution, the DisturbLabel tech- scale Flickr30K dataset, over-fitting easily happens if the
nique [10] is employed for regularization with the disturbing ResNet200 model is employed due to its larger-scale model
ratio equal to 0.05. The criteria of BLEU [14], METEOR [15] parameters. Moreover, several typical image captioning ex-
and CIDEr [16] are utilized to evaluate the performances of amples achieved by ResLSTM-4(ResNet200) on MSCOCO
the comparative approaches. are shown in Fig. 2.
About the underlying network architecture to deploy the
proposed ResLSTM and Temporal Dropout, three powerful
DNN models are employed for visual feature extraction, in-
cluding VGG16 [12], ResNet101 [8] and ResNet200 [8].

4.2. Comparison with Baselines


The framework of Neuraltalk2 [5] is employed as the un-
derlying framework to perform image captioning with VG-
G16 [12] or ResNet [8] for CNN feature extraction. The per-
formances of these original models are regarded as the base-
line for comparison to the proposed ResLSTM model with
Temporal Dropout.
The performance comparison with different layer config-
urations is presented in Table 1 and Table 2 on MSCOCO and
Flickr30K, respectively, where ResLSTM-𝑖 indicates there Fig. 2. Examples of image captioning achieved by ResLSTM-
are 𝑖 LSTM layers, ‘B’ and ‘M’ are short for the criteria of 4(ResNet200) on MSCOCO.
BLEU and METEOR, respectively. From the results, it can
be observed that the performances are continuously improved
with more LSTM layers employed regardless of the underly-
4.3. Comparison with State-of-The-Arts
ing DNN models (i.e., VGG16 or ResNet). This experimen-
t demonstrates the effectiveness of the proposed ResLSTM In addition, we compare the performances of the proposed
plus Temporal Dropout against the baseline model (i.e., the ResLSTM plus Temporal Dropout with other state-of-the-art
model denoted as ‘ResLSTM-1’). Note that ResNet200 is methods. As shown in Table 3, the proposed approach with

364
the ResNet200 model achieves the best results on the MSCO- 5. CONCLUSION
CO dataset, with 101.1 and 31.9 on CIDEr and B-4, respec-
tively, which outperforms the Att+CNN+LSTM [17] model In this work, a deep residual LSTM language model ResLST-
with the improvements of 7.1 and 0.9 on the CIDEr and B-4 M is proposed for image captioning. By constructing the pro-
metrics. On the Flickr30K dataset as shown in Table 4, our posed ResLSTM to fit residuals, it becomes easy to train deep
proposed method also achieves the best performances, and stacked LSTM models. Moreover, the Temporal Dropout reg-
outperforms the recent emb-gLSTM [24] method by 2.3 and ularization technique is designed into the proposed ResLSTM
1.1 on the B-4 and METEOR metrics, respectively. language model to increase the robustness of residual LST-
M networks. The experimental results on the benchmark M-
SCOCO and Flickr30k datasets demonstrate that the proposed
Table 3. Comparison with state-of-the-art methods on M- ResLSTM with Temporal Dropout is superior to a number of
SCOCO. state-of-the-art methods for image captioning.
Method CIDEr B-3 B-4 M
multimodal RNN [5] 66.0 32.1 23.0 19.5 6. REFERENCES
Google NIC [4] – 32.9 24.6 –
LRCN-CaffeNet [7] – 30.4 21.0 – [1] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young,
m-RNN [11] – 35.0 25.0 – C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every
Soft-Attention [13] – 34.4 24.3 23.9 picture tells a story: Generating sentences from images,”
Hard-Attention [13] – 35.7 25.0 23.0 in Proc. ECCV’10, Sept. 2010, pp. 15–29.
emb-gLSTM [24] 81.3 35.8 26.4 22.7
Att+CNN+LSTM [17] 94.0 42.0 31.0 26.0 [2] P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi,
ResLSTM-4(ResNet200) 101.1 42.0 31.9 25.7 “Treetalk: Composition and compression of trees for
image descriptions,” Trans. Asso. Comput. Lingu.,
vol. 2, pp. 351–362, Oct. 2014.

[3] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Mul-


Table 4. Comparison with state-of-the-art methods on Flick-
timodal neural language models,” in Proc. ICML’14,
r30k.
vol. 14, Jun. 2014, pp. 595–603.
Method CIDEr B-3 B-4 M
LogBilinear [3] – 25.4 17.1 16.9 [4] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show
multimodal RNN [5] 24.7 24.0 15.7 15.3 and tell: A neural image caption generator,” in Proc.
Google NIC [4] – 27.7 18.3 – CVPR’15, Jun. 2015, pp. 3156–3164.
LRCN-CaffeNet [7] – 25.1 16.5 –
m-RNN [11] – 28.0 19.0 – [5] A. Karpathy and L. Fei-Fei, “Deep visual-semantic
Soft-Attention [13] – 28.8 19.1 18.5 alignments for generating image descriptions,” in Proc.
Hard-Attention [13] – 29.6 19.9 18.5 CVPR’15, Jun. 2015, pp. 3128–3137.
emb-gLSTM [24] – 30.5 20.6 17.9
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
ResLSTM-4(ResNet101) 44.1 32.5 22.9 19.0
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-
binovich, “Going deeper with convolutions,” in Proc.
CVPR’15, Jun. 2015, pp. 1–9.
4.4. Effect of Temporal Dropout [7] J. Donahue, L. A. Hendricks, S. Guadarrama,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
Besides the comprehensive evaluation of the proposed
rell, “Long-term recurrent convolutional networks for
ResLSTM with Temporal Dropout mentioned above, we
visual recognition and description,” in Proc. CVPR’15,
further explore the effectiveness of the proposed Temporal
Jun. 2015, pp. 2625–2634.
Dropout method. Due to the space limit, only the investi-
gation with the ResNet101 model on the Flickr30K dataset [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
is presented. As shown in Fig. 3, the performances on learning for image recognition,” in Proc. CVPR’16, Jun.
CIDEr, METEOR and BLEU are all improved with the pro- 2016, pp. 770–778.
posed Temporal Dropout (TDropout) method as compared to
ResLSTM only (i.e., Temporal Dropout is disabled). Note [9] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskev-
that the performances shown in Fig. 3 are different from that er, and R. Salakhutdinov, “Dropout: a simple way to
presented in Tables 2 and 4 since the beam search technique prevent neural networks from overfitting,” J. Machine
is not employed in this experiment. Learning Research, vol. 15, pp. 1929–1958, Jun. 2014.

365
(a) CIDEr (b) METEOR (c) BLEU-1

(d) BLEU-2 (e) BLEU-3 (f) BLEU-4

Fig. 3. Illustration of the effect of Temporal Dropout.

[10] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian, “Dis- [17] Q. Wu, C. Shen, L. Liu, A. Dick, and A. V. D. Hen-
turblabel: Regularizing CNN on the loss layer,” in Proc. gel, “What value do explicit high level concepts have in
CVPR’16, Jun. 2016, pp. 4753–4762. vision to language problems?” in Proc. CVPR’16, Jun.
2016, pp. 203–212.
[11] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and
A. Yuille, “Deep captioning with multimodal recurren- [18] R. K. Srivastava, K. Greff, and J. Schmidhuber, “High-
t neural networks (m-RNN),” in Proc. ICLR’15, Dec. way networks,” arXiv preprint arXiv:1505.00387, 2015.
2015. [19] ——, “Training very deep networks,” in Proc. NIPS’15,
Dec. 2015, pp. 2377–2385.
[12] K. Simonyan and A. Zisserman, “Very deep convolu-
tional networks for large-scale image recognition,” in [20] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus,
Proc. ICLR’14, Apr. 2014. “Regularization of neural networks using DropConnec-
t,” in Proc. ICML’13, Jun. 2013, pp. 1058–1066.
[13] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville,
R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, at- [21] S. Singh, D. Hoiem, and D. Forsyth, “Swapout: Learn-
tend and tell: Neural image caption generation with vi- ing an ensemble of deep architectures,” in Proc. NIP-
sual attention,” in Proc. ICML’15, Jul. 2015, pp. 2048– S’16, Dec. 2016, pp. 28–36.
2057. [22] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick, “Microsoft coco:
[14] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU:
Common objects in context,” in Proc. ECCV’14, Sept.
a method for automatic evaluation of machine transla-
2014, pp. 740–755.
tion,” in Proc. ACL’02, Jul. 2002, pp. 311–318.
[23] M. Hodosh, P. Young, and J. Hockenmaier, “Framing
[15] S. Banerjee and A. Lavie, “METEOR: An automat- image description as a ranking task: Data, models and
ic metric for MT evaluation with improved correla- evaluation metrics,” J. Artificial Intelligence Research,
tion with human judgments,” in Proc. ACL Workshop vol. 47, pp. 853–899, Aug. 2013.
IEEMMTS’05, vol. 29, Jun. 2005, pp. 65–72.
[24] X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars,
[16] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: “Guiding the long-short term memory model for image
Consensus-based image description evaluation,” in caption generation,” in Proc. ICCV’15, Dec. 2015, pp.
Proc. CVPR’15, Jun. 2015, pp. 4566–4575. 2407–2415.

366

You might also like