Professional Documents
Culture Documents
Image Captioning With Deep LSTM Based On Sequential Residual
Image Captioning With Deep LSTM Based On Sequential Residual
978-1-5090-6067-2/17/$31.00 ⃝2017
c IEEE
978-1-5090-6067-2/17/$31.00 ©2017 IEEE ICME 2017
361
proposed in this work to construct deeper LSTM models to It is well known that it is difficult to train DNN models
improve the quality of description sentences for image cap- though they can enhance the representation and generaliza-
tioning. On the other hand, over-fitting is an emerging issue tion ability of features. To address this issue, Srivastava et al.
for DNN models due to the huge amount of model parame- propose a highway network [18, 19], and introduce network
ters. Several regularization methods such as Dropout [9] and bypass into deep models so that several layers are skipped
Disturblabel [10] are designed to address this problem. In this and the information from one layer is fed to the next layer
work, we propose a regularization technique named Temporal followed by the skipped layer. Recently, He et al. [8] use an
Dropout which is suitable to be applied to LSTM networks. identity mapping to make higher layers fit the residual func-
There are two major contributions of this work. First, in tion between lower layers and the target function in ResNet.
order to settle the problem of gradient vanishing in LSTM Inspired by [8], the bypass idea is introduced into LSTM net-
network with multiple layers, the easy-to-train deep residual works to build the proposed ResLSTM language model.
LSTM language model ResLSTM is proposed for image cap- On the other hand, DNN usually suffers from over-fitting
tioning. Second, the Temporal Dropout method is designed caused by large-scale model parameters. Several effective
to prevent over-fitting for deep stacked LSTM network based regularization methods are proposed to address this difficulty,
language models. The rest of this paper is organized as fol- including Dropout [9], DropConnect [20], DisturbLabel [10],
lows. Section 2 presents the related works of image caption- Swapout [21], and so on. Dropout [9] is a widely used regu-
ing, residual representation and model regularization. The larization approach which randomly disables a subset of neu-
proposed ResLSTM model with Temporal Dropout is detailed rons in a layer to force some information abandoned during
in Section 3. The experimental results are demonstrated in model training. As the generalization of Dropout, DropCon-
Section 4. Finally, Section 5 concludes this work. nect [20] randomly drops connections between layers rather
than neurons. Instead of disturbing neurons, the DisturbLa-
bel [10] method deliberately adds little errors to labels to pre-
2. RELATED WORK
vent network from fitting special type of data. In [21], the
Swapout method not only inhibits co-adaptation of units with-
High-level image understanding draws much attention in re-
in a layer but also across network layers. In this work, we
cent years. With the success of deep learning, the encoder-
propose the Temporal Dropout method to regularize temporal
decoder framework based on CNN and LSTM becomes the
sequential LSTM for language model construction.
mainstream to bridge the gap between image and text for
image captioning. A number of approaches are proposed to
achieve the goal of image description with flexible structure 3. PROPOSED RESLSTM MODEL WITH
and rich semantic information. Mao et al. [11] propose the m- TEMPORAL DROPOUT
RNN model in which the previous words and features extract-
ed from the VGG net [12] are integrated to predict the next 3.1. Fundamentals
word of the generation sentence. In [7], the Long-term Recur- For image captioning, let ℱ denote the feature vectors
rent Convolutional Network (LRCN) model [7] is designed of an image generated from a CNN model and 𝑆 =
to caption image and video, and the factorization method is ⟨𝑠1 , 𝑠2 , ⋅ ⋅ ⋅ , 𝑠𝑁 ⟩ be the reference sentence of the image where
employed to build the language model with LSTM. Vinyals 𝑁 is the number of words in the reference sentence and 𝑠𝑖
et al. [4] argue that embedding CNN features into each time is the 𝑖𝑡ℎ word, the target is to approximate the distribution
step of RNN makes the model easy to get into over-fitting, of sentence log 𝑝(𝑆∣ℱ). By applying the chain rule of the
therefore they only feed CNN features into LSTM networks Bayesian theorem, the log distribution can be represented in
at the first time step and obtain promising performances. a joint distribution form as
In addition to the aforementioned researches, several e-
𝑁
∑
laborate models are developed by advanced computer vision
log 𝑝(𝑆∣ℱ) = log 𝑝(𝑠𝑡 ∣ℱ, 𝑆1,⋅⋅⋅ ,𝑡−1 ), (1)
methods. In [5], RCNN is employed to encode regions rather
𝑡=1
than the whole image, and the Bi-directional LSTM is imple-
mented to decode the region features and generate semantic where each word 𝑠𝑖 in 𝑆 is represented in the form of word
sentences for an image. Xu et al. [13] introduce the atten- embedding, and 𝑆1,⋅⋅⋅ ,𝑡−1 = ⟨𝑠1 , 𝑠2 , ⋅ ⋅ ⋅ , 𝑠𝑡−1 ⟩ is the slice of
tion mechanism into visual modeling and achieve astonishing the 1 to 𝑡 − 1 elements of 𝑆.
performances on the evaluation metrics such as BLEU [14], To model log 𝑝(𝑠𝑡 ∣ℱ, 𝑆1,⋅⋅⋅ ,𝑡−1 ), RNN can be employed
METEOR [15] and CIDEr [16]. Different from [5] and [13], in which the output is calculated as
Wu et al. [17] focus on the correspondence of image regions
ℎ𝑡 = 𝑓 (𝑠𝑡 , ℎ𝑡−1 ), (2)
and their attributes, and establish the relationship between vi-
sual information and text. However, these approaches ignore where 𝑓 denotes the RNN cell and ℎ𝑡−1 is the output at the
the fact that the language model is too shallow to provide ad- previous time step of 𝑡 − 1. RNN owns the ability of learn-
equate representation for language features. ing sequence information such as sentence, however, it cannot
362
achieve long-term dependence due to gradient vanishing. As 𝑙𝑡ℎ layer can be written as
a variant of RNN, LSTM is able to remember information for
long periods of time. The core of LSTM is its memory which 𝑔 𝑙 (ℎ𝑡𝑙−1 , ℎ𝑙𝑡−1 ) =𝑓 (𝑠𝑡 , ℎ𝑡−1 ) − 𝑔 0 (𝑠𝑡 , ℎ0𝑡−1 )
contains the knowledge at each time step and it is controlled 𝑙−1
∑
by three different kinds of gate including forget gate, input − 𝑔 𝑖 (ℎ𝑖−1
𝑡 , ℎ𝑖𝑡−1 ). (5)
gate and output gate. These gates are guided by ℎ𝑡−1 and 𝑠𝑡 𝑖=1
via the cell function of 𝑓 . In this work, we employ LSTM as
the basic module to build the proposed language model. The family of functions 𝒢 = ⟨𝑔0 , 𝑔1 , ⋅ ⋅ ⋅ ⟩ constitutes an
The basic pipeline of the proposed model includes the fol- approximation of the actual function 𝑓 . Considering the
lowing key steps. First, image features are extracted from a effectiveness of the two-layer stacked LSTM, we build up
CNN model and fed into the first time step of language mod- our model by adding one or two layers between the orig-
el, and then the embedding feature of each word 𝑠𝑖 is input to inal stacked model to generate a three-layer and four-layer
the language model at each time step. Afterwards, the hidden ResLSTM, which are termed as ResLSTM-3 and ResLSTM-
state of ℎ𝑡 which compactly contains all the previous infor- 4, respectively. Figure 1 illustrates the proposed ResLSTM-4
mation is estimated. Finally, Softmax is employed to predict architecture for instance.
the distribution of 𝑠𝑡+1 . For training convenience, a special
word called Start of Sentence (SoS) is added at the beginning
of each sentence. The cross entropy loss function is employed
for back propagation, and the sentence with the highest con-
fidence score is the final winner for image captioning.
363
where 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) generates a vector of independen-
t Bernoulli random variables each of which has the probabil- Table 1. Comparison with baselines on MSCOCO.
ity 𝑝 of being 1, ℎ𝑡−1 and ℎ′𝑡−1 represent the original hidden Method CIDEr B-3 B-4 M
state and the updated hidden state at the time step of 𝑡 − 1, ResLSTM-1(VGG16) 90.5 38.7 27.9 23.9
and ⊙ means the element-wise multiplication operation. ResLSTM-3(VGG16) 93.5 39.1 28.6 24.5
ResLSTM-4(VGG16) 93.8 39.3 28.9 24.9
4. EXPERIMENTS ResLSTM-1(ResNet101) 94.9 39.4 28.4 24.7
ResLSTM-3(ResNet101) 97.2 40.4 29.4 25.0
4.1. Experimental Setup ResLSTM-4(ResNet101) 100.6 41.1 30.3 25.5
ResLSTM-1(ResNet200) 95.9 39.6 28.7 24.7
Two benchmark public datasets including MSCOCO [22] ResLSTM-3(ResNet200) 99.0 41.9 31.8 25.3
and Flickr30K [23] are employed to evaluate the proposed ResLSTM-4(ResNet200) 101.1 42.0 31.9 25.7
ResLSTM and Temporal Dropout. Regarding the MSCOCO
dataset, there are 123287 images and at least 5 reference sen-
tences annotated by human are provided for each image. As Table 2. Comparison with baselines on Flickr30K.
for the Flickr30K dataset, there are 31783 images in total, and Method CIDEr B-3 B-4 M
each image has 5 reference sentences.
ResLSTM-1(VGG16) 35.7 26.8 18.1 18.6
We follow the conventional split way used in [5] for train-
ResLSTM-3(VGG16) 36.5 28.0 19.0 18.7
ing, validation and test on both the datasets of MSCOCO and
ResLSTM-4(VGG16) 39.0 28.6 19.6 18.8
Flickr30K. In particular, for MSCOCO, 113287 images and
the relative reference sentences are employed for training; ResLSTM-1(ResNet101) 37.0 27.9 19.1 18.6
and the rest images are split into two parts with one part in- ResLSTM-3(ResNet101) 41.9 31.8 22.3 18.4
cluding 5000 images for validation and the other part for test. ResLSTM-4(ResNet101) 44.1 32.5 22.9 19.0
Regarding Flickr30K, 29000 images are applied for training,
another 1000 images are used for test, and the rest are for
validation. Due to the small size of Flickr30K, it is easy not tested on the Flickr30K dataset, because for the small-
to cause over-fitting. As a solution, the DisturbLabel tech- scale Flickr30K dataset, over-fitting easily happens if the
nique [10] is employed for regularization with the disturbing ResNet200 model is employed due to its larger-scale model
ratio equal to 0.05. The criteria of BLEU [14], METEOR [15] parameters. Moreover, several typical image captioning ex-
and CIDEr [16] are utilized to evaluate the performances of amples achieved by ResLSTM-4(ResNet200) on MSCOCO
the comparative approaches. are shown in Fig. 2.
About the underlying network architecture to deploy the
proposed ResLSTM and Temporal Dropout, three powerful
DNN models are employed for visual feature extraction, in-
cluding VGG16 [12], ResNet101 [8] and ResNet200 [8].
364
the ResNet200 model achieves the best results on the MSCO- 5. CONCLUSION
CO dataset, with 101.1 and 31.9 on CIDEr and B-4, respec-
tively, which outperforms the Att+CNN+LSTM [17] model In this work, a deep residual LSTM language model ResLST-
with the improvements of 7.1 and 0.9 on the CIDEr and B-4 M is proposed for image captioning. By constructing the pro-
metrics. On the Flickr30K dataset as shown in Table 4, our posed ResLSTM to fit residuals, it becomes easy to train deep
proposed method also achieves the best performances, and stacked LSTM models. Moreover, the Temporal Dropout reg-
outperforms the recent emb-gLSTM [24] method by 2.3 and ularization technique is designed into the proposed ResLSTM
1.1 on the B-4 and METEOR metrics, respectively. language model to increase the robustness of residual LST-
M networks. The experimental results on the benchmark M-
SCOCO and Flickr30k datasets demonstrate that the proposed
Table 3. Comparison with state-of-the-art methods on M- ResLSTM with Temporal Dropout is superior to a number of
SCOCO. state-of-the-art methods for image captioning.
Method CIDEr B-3 B-4 M
multimodal RNN [5] 66.0 32.1 23.0 19.5 6. REFERENCES
Google NIC [4] – 32.9 24.6 –
LRCN-CaffeNet [7] – 30.4 21.0 – [1] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young,
m-RNN [11] – 35.0 25.0 – C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every
Soft-Attention [13] – 34.4 24.3 23.9 picture tells a story: Generating sentences from images,”
Hard-Attention [13] – 35.7 25.0 23.0 in Proc. ECCV’10, Sept. 2010, pp. 15–29.
emb-gLSTM [24] 81.3 35.8 26.4 22.7
Att+CNN+LSTM [17] 94.0 42.0 31.0 26.0 [2] P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi,
ResLSTM-4(ResNet200) 101.1 42.0 31.9 25.7 “Treetalk: Composition and compression of trees for
image descriptions,” Trans. Asso. Comput. Lingu.,
vol. 2, pp. 351–362, Oct. 2014.
365
(a) CIDEr (b) METEOR (c) BLEU-1
[10] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian, “Dis- [17] Q. Wu, C. Shen, L. Liu, A. Dick, and A. V. D. Hen-
turblabel: Regularizing CNN on the loss layer,” in Proc. gel, “What value do explicit high level concepts have in
CVPR’16, Jun. 2016, pp. 4753–4762. vision to language problems?” in Proc. CVPR’16, Jun.
2016, pp. 203–212.
[11] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and
A. Yuille, “Deep captioning with multimodal recurren- [18] R. K. Srivastava, K. Greff, and J. Schmidhuber, “High-
t neural networks (m-RNN),” in Proc. ICLR’15, Dec. way networks,” arXiv preprint arXiv:1505.00387, 2015.
2015. [19] ——, “Training very deep networks,” in Proc. NIPS’15,
Dec. 2015, pp. 2377–2385.
[12] K. Simonyan and A. Zisserman, “Very deep convolu-
tional networks for large-scale image recognition,” in [20] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus,
Proc. ICLR’14, Apr. 2014. “Regularization of neural networks using DropConnec-
t,” in Proc. ICML’13, Jun. 2013, pp. 1058–1066.
[13] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville,
R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, at- [21] S. Singh, D. Hoiem, and D. Forsyth, “Swapout: Learn-
tend and tell: Neural image caption generation with vi- ing an ensemble of deep architectures,” in Proc. NIP-
sual attention,” in Proc. ICML’15, Jul. 2015, pp. 2048– S’16, Dec. 2016, pp. 28–36.
2057. [22] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick, “Microsoft coco:
[14] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU:
Common objects in context,” in Proc. ECCV’14, Sept.
a method for automatic evaluation of machine transla-
2014, pp. 740–755.
tion,” in Proc. ACL’02, Jul. 2002, pp. 311–318.
[23] M. Hodosh, P. Young, and J. Hockenmaier, “Framing
[15] S. Banerjee and A. Lavie, “METEOR: An automat- image description as a ranking task: Data, models and
ic metric for MT evaluation with improved correla- evaluation metrics,” J. Artificial Intelligence Research,
tion with human judgments,” in Proc. ACL Workshop vol. 47, pp. 853–899, Aug. 2013.
IEEMMTS’05, vol. 29, Jun. 2005, pp. 65–72.
[24] X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars,
[16] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: “Guiding the long-short term memory model for image
Consensus-based image description evaluation,” in caption generation,” in Proc. ICCV’15, Dec. 2015, pp.
Proc. CVPR’15, Jun. 2015, pp. 4566–4575. 2407–2415.
366