You are on page 1of 13

Computer Vision and Image Understanding 221 (2022) 103453

Contents lists available at ScienceDirect

Computer Vision and Image Understanding


journal homepage: www.elsevier.com/locate/cviu

Video captioning using Semantically Contextual Generative Adversarial


Network
Hemalatha Munusamy a,b ,∗, Chandra Sekhar C. a
a Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600036, India
b Department of Information Technology, MIT campus, Anna University, Chennai 600025, India

ARTICLE INFO ABSTRACT


Communicated by Nikos Paragios In this work, we propose a Semantically Contextual Generative Adversarial Network (SC-GAN) for video
captioning. The semantic features extracted from a video are used in the discriminator to weigh the word
Keywords:
embedding vectors. The weighted word embedding vectors along with the visual features are used to
Video captioning
Generative adversarial network
discriminate the ground truth descriptions from the descriptions generated by the generator. The manager
Reinforcement learning in the generator uses the features from the discriminator to generate a goal vector for the worker. The worker
Generator is trained using: a goal based reward and a semantics based reward in generating the description. The semantics
Discriminator based reward ensures that the worker generates descriptions that incorporate the semantic features. The goal
based reward calculated from discriminator features ensures the generation of descriptions similar to the
ground truth descriptions. We have used MSVD and MSR-VTT datasets to demonstrate the effectiveness of
the proposed approach to video captioning.

1. Introduction learning based approaches were proposed for generating more specific
descriptions. The adversarial Long Short Term Memory (LSTM) based
Video captioning involves the generation of descriptions for video approach proposed for video captioning in Yang et al. (2018) uses a
clips. It plays a significant role in robotics, visual understanding, aids soft-argmax function proposed in Luvizon et al. (2019) to generate a
for visually challenged persons, automatic subtitling of videos, creating sentence. The soft-argmax function overcomes the non-differentiability
instructions from videos, etc. There are many approaches proposed problem of argmax.
for image and video captioning (Xu et al., 2015b,a; Aakur et al., In order to generate a generalized description many approaches
2017; Baraldi et al., 2017; You et al., 2016; Kulkarni et al., 2011; have been proposed using the generative adversarial network (Good-
Dai et al., 2017; Deshpande et al., 2019; Chen et al., 2019; Pasunuru
fellow et al., 2014) for image and video captioning (Chen et al., 2019;
and Bansal, 2017; Venugopalan et al., 2015a; Guo et al., 2016; Hori
Wang et al., 2018; Dai et al., 2017). The GAN was originally proposed
et al., 2017; Gan et al., 2017; Krishna et al., 2017; Venugopalan et al.,
to generate synthetic images. Later, it was adopted in domains such
2017; Pan et al., 2017; Xu et al., 2019; Yu et al., 2016; Yuan et al.,
as text generation, security, and captioning. The main challenge of a
2018). The field of image and video captioning has experienced a
text GAN is the back-propagation of gradients from the discriminator
tremendous change from the initial template-based approaches to the
recent transformer based approaches. The major challenge in video to the generator. In an image generation GAN, the generator generates
captioning is the stochastic nature of the videos and the multi-modal an image given as input to the discriminator. The discriminator tries to
information available in the video. In order to capture the stochastic discriminate between the original images and the fake images gener-
information from the videos, many existing approaches use a Recurrent ated by the generator. While training the model, the image generated
Neural Network (RNN) based encoder to extract the video features. by the generator is directly given as input to the discriminator. As
These features are then provided to a decoder to generate a description. the image generated by the generator is continuous, the gradients are
Many variants of this architecture have been proposed in the literature, propagated from the discriminator to the generator. In text generation,
as discussed in Section 2. the output is a sentence, and the generator uses an argmax function to
The traditional video captioning model uses Maximum Likelihood select one word at a time. Only a sentence can be given as input to
Error for training the model to generate descriptions similar to the the discriminator. Consequently, the gradients cannot be propagated
ground truth descriptions. This model learns more generalized descrip- directly from the discriminator to the generator. Techniques such as
tions that are most repeated in the training videos. The reinforcement policy gradient (Silver et al., 2014), REINFORCE (Williams, 1992), and

∗ Corresponding author at: Department of Information Technology, MIT campus, Anna University, Chennai 600025, India.
E-mail address: hemalatham.ch@gmail.com (H. Munusamy).

https://doi.org/10.1016/j.cviu.2022.103453
Received 20 May 2021; Received in revised form 17 March 2022; Accepted 11 May 2022
Available online 20 May 2022
1077-3142/© 2022 Elsevier Inc. All rights reserved.
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

• The worker is trained to use the semantics based reward in


addition to the goal based reward for generating description. This
allows the worker to generate a description that includes key
concepts present in the video.

The rest of the paper is organized as follows. Section 2 presents


the approaches that use GAN for text generation and the approaches
to image and video captioning. Section 3 presents the proposed video
captioning model using a hierarchical GAN. Section 4 presents details
of the datasets used to evaluate the proposed approach. It also presents
the evaluation metrics used for evaluating the proposed approach.
Section 5 presents the experimental studies and results.
Fig. 1. Overview of the proposed semantically contextual GAN based approach to video
captioning.
2. Related work

2.1. Approaches to video captioning

Gumbel-Softmax approximation (Kusner and Hernández-Lobato, 2016)


The template-based approaches Xu et al. (2015b), Rohrbach et al.
were proposed to address this issue.
(2013) find the objects in a video and fit the object names in the
The text GAN has been introduced for image captioning in Chen Subject, Verb, and Object parts of a sentence. This limits the generated
et al. (2019), Dai et al. (2017), Yan et al. (2018), and for video cap- descriptions only to the templates known by the model. In order to
tioning in Yang et al. (2018), Wang et al. (2018). The conditional GAN overcome this disadvantage, the end-to-end approaches are proposed
based approach to image captioning (Chen et al., 2019) uses self-critical in Venugopalan et al. (2015a), Pan et al. (2015), Yang et al. (2018)
sequence training for reinforcement based training. The RNN and CNN which use RNNs for both encoder and decoder. The sequence to se-
based discriminators are implemented, and it was found that the CNN quence, video to text (S2VT) approach (Venugopalan et al., 2015a) uses
based discriminator is better than the RNN based discriminator. The a stack of two LSTM models for end-to-end video captioning. Another
conditional GAN for image captioning in Dai et al. (2017) uses a policy approach (Nabati and Behrad, 2020) uses boosted and parallel LSTM
gradient algorithm for training the generator and the discriminator. for generating captions. The boosted algorithm uses a form of Adaboost
It uses the 2D-CNN features as the conditional input in generating algorithm for training the model.
descriptions. The hierarchical reinforcement learning-based approach In a video, different frames may have different contribution in gen-
to video captioning (Wang et al., 2018) uses a manager and a worker. erating a description of the video. The attention-based approach in Xu
This approach derives a contextual vector from LSTM based encoder. et al. (2015a) uses different attention to different parts of an image in
The manager uses this context vector to generate a goal that is provided generating the caption. The Hierarchical Recurrent Neural Network (H-
to the worker for generating descriptions. RNN) based approach to video captioning (Yu et al., 2016), provides
The video captioning model that uses Maximum Likelihood Error spatial attention to different parts of the frames in the video. For
always selects a word with maximum probability. This forces the model video captioning, attention-based approach applies different attentional
to choose more generalized words for the descriptions. For example, the weights to the multi-modal entities such as visual, motion, semantic and
model generates the same descriptions for many cooking videos like ‘‘a audio (Hori et al., 2017; Wu et al., 2018; Qi et al., 2020; Jin et al., 2019;
man is cooking’’ or ‘‘a woman is cooking’’. Similarly, it mostly generates Yu et al., 2016; Yan et al., 2020; Chen and Jiang, 2019; Li et al., 2018).
The attention mechanism is also applied to the decoder (Bahdanau
descriptions like ‘‘a man is playing’’ for sports videos. To avoid these
et al., 2014). The approaches in Guo et al. (2016), Yang et al. (2018),
generalized descriptions, we propose a semantically contextual GAN
Gao et al. (2020), Yan et al. (2018) use the Bahdanau attention in the
based model for video captioning. The semantics based rewards force
decoder.
the model to use the semantic keywords from semantic scores given as
The attention based approach provides different attention to differ-
input for generating descriptions.
ent parts of the video. The use of semantic features explicitly detected
Fig. 1 shows the block diagram of the proposed semantically con- from a video using multi-label classification models played an impor-
textual GAN based approach to video captioning. In this approach, the tant role in the image and video captioning models. Approaches based
leaked features from the discriminator are used to train the generator. on the semantic features have been proposed in Xu et al. (2019), Yuan
The manager in the generator uses the leaked features from the dis- et al. (2018), Gan et al. (2017), Aakur et al. (2017), Liu et al. (2017),
criminator to generate a goal vector. The worker uses the goal vector Bin et al. (2017). One such approach is the Semantic Compositional
and the visual and semantic features to generate the descriptions. The Network (SCN) (Gan et al., 2017), in which semantic features are
discriminator distinguishes the description generated by the generator extracted from an image or a video using a multi-layer perceptron.
from the ground truth description. The generator and the discriminator The semantic features are multiplied by the internal weights of the
are trained using different reward functions. LSTM layer to ensure that the important semantics are considered in
The important contributions of the paper are as follows: generating the description. The semantic features play an important
role in determining the probability of the next word to be generated.
• The discriminator is trained with the word embedding vectors
multiplied by weights derived from the semantic features. So the 2.2. GAN based approaches to captioning
discriminator includes the semantical context for discriminating
the generated descriptions from ground truth descriptions. The video captioning model discussed till now uses Maximum Like-
• The discriminator is trained with both the visual and semantic lihood Error based training which generates more generalized de-
features. This enables the discriminator to determine whether scriptions. To overcome this problem, the adversarial training based
semantic features are related to visual features extracted from the approaches were proposed by Yang et al. (2018). Later, many ap-
video. proaches have been proposed to use adversarial training and GANs.
• The manager is trained to generate a goal based on discrimina- The GAN has been successful in generating good quality images. Many
tor features. The worker uses this goal to generate descriptions variants of GAN like CGAN, StackGAN, and DiscoGAN have been
similar to the ground truth descriptions. proposed for generating images. The conditional GAN (CGAN) (Mirza

2
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

and Osindero, 2014) generates images from conditional information two losses. The CIDER based reward helps the model in generating
like the type of the image, keywords, and sentences describing the descriptions that provide good CIDER scores. However, the generated
image. Another variant of GAN, the StackGAN (Zhang et al., 2017) gen- descriptions do not perform well in other evaluation metrics.
erates high-quality images from the text description using two layers The end-to-end RL based approach (Li and Gong, 2019) performs
of generation: the low-level generator generates low-resolution images, attribute prediction and video captioning. For both the tasks, the
and the high-level generator generates high-resolution images from encoder and decoder are two-layer LSTMs. Here the CIDER score is
the low-resolution images. The DiscoGAN discovers the relation be- used as a reward in training the model. Another approach (Wang
tween different domains, which is used for style transfer in generating et al., 2018) uses the manager and worker LSTMs in the generator. The
images (Kim et al., 2017). model uses REINFORCE (Williams, 1992) for training the worker and a
The GAN has been later adapted for text generation in Lin et al. deterministic policy for training the manager. In our work, we propose
(2017), Yu et al. (2017), Guo et al. (2018), Nie et al. (2019). The the SC-GAN based architecture for generating video descriptions. The
sequential GAN (SeqGAN) (Yu et al., 2017) uses the gradient pol- discriminator in the proposed approach discriminates the descriptions
icy updates directly from discriminator to generator. It uses REIN- generated by the generator from the ground truth descriptions. The
FORCE (Williams, 1992) algorithm for training the discriminator and use of visual and semantic features in the discriminator enhances the
the generator. The adversarial rank GAN (Lin et al., 2017) ranks the discriminator’s ability to differentiate the correct video and sentence
human-generated and machine-generated descriptions with different pair from the wrong video and sentence pairs. In this work, we use two
scores. Instead of using traditional binary scores, the objective functions types of rewards for training: the goal based reward and the semantics
of the generator and the discriminator use the rank scores. Another based reward for the worker. The goal based reward is used to train
GAN for text generation RelGAN (Nie et al., 2019) uses the Gumbel- the generator to generate descriptions similar to the ground truth. The
Softmax approximation to overcome the non-differentiability problem semantics based reward ensures that the generator incorporates the
of the generator output. semantics while generating description.
The LeakGAN (Guo et al., 2018) uses the features leaked from the
discriminator to the generator. These features are extracted from the 3. Proposed SC-GAN based approach
pre-final layer of the discriminator. The LeakGAN is suitable for long
text generation. The generator in LeakGAN uses two modules: The man- 3.1. Overview
ager module receives the features from the discriminator and produces
a sub-goal vector. The worker module accepts the text representation The architecture of the proposed semantically contextual GAN is
of the previous word and creates an intermediate embedding. The shown in Fig. 2. The proposed approach uses a discriminator which dif-
intermediate embedding vector is multiplied with the sub-goal vector ferentiates between ground truth description and generated description
to obtain the probability of the next word in the sentence. for a given video and a generator that generates description for a given
The Text-GAN has been later adapted for image and video caption- video. There are two modules in the generator: (1) A manager which
ing (Dai et al., 2017; Wei et al., 2020; Chen et al., 2019; Wang et al., generates a goal based on discriminator features. (2) A worker which
2018). The image captioning model proposed in Chen et al. (2019) uses generates descriptions using the goal, visual features, and semantic
a language evaluator based loss function to train the discriminator and features.
the generator. The loss function measures the similarity of sentences Generating a caption for a given video is a stochastic process which
generated by the model with the ground truth sentences by applying generates one word at each time step. The worker plays the role
language evaluation scores such as CIDER and SPICE. The discriminator of decoder. Let 𝑥𝑡 be the 𝑡th word generated given the previously
and the language evaluator jointly compute the reward function for generated words (𝑥0 , 𝑥1 , … , 𝑥𝑡−1 ) and the features 𝐯 = (𝐩, 𝐪, 𝐬𝑛 ). Here 𝐩
the generator. The approach in Ren et al. (2017) uses visual-semantic are the visual features extracted using 2D-CNN, 𝐪 are the visual features
embedding rewards for training the image captioning model. Here, the extracted using 3D-CNN and 𝐬 are the semantic features extracted using
semantic and visual features are embedded and projected into a higher a multi-layer perceptron (MLP) as in Hemalatha and Sekhar (2020).
dimensional space to calculate rewards. The proposed approach uses The semantic keywords are extracted by processing the ground truth
two kinds of rewards: semantics based reward and goal based reward descriptions and obtaining the most frequent keywords representing the
in the worker for generating captions. Multi-attention based image objects and actions in the video. Let 𝑀 be the number of semantic
captioning (Wei et al., 2020) uses multiple attentions in the decoder keywords obtained from the ground truth descriptions. The ground
for generating captions. The RNN based discriminator uses attention to truth semantic label vector for a video is given by 𝐯 = [𝑣1 , 𝑣2 , … , 𝑣𝑀 ],
the output of the RNNs. where 𝑥𝑖 is 1 if the semantic keyword is present in the descriptions
The video captioning using adversarial LSTM (Yang et al., 2018) of the video. A Multi-layer Perceptron (MLP) is trained as a multi-label
is the first approach proposed to generate captions for videos using classifier with 𝑥 as the target vector for the video in the training dataset.
reinforcement learning (RL). It uses an end-to-end approach to video The output of the MLP 𝐬 = [𝑠1 , 𝑠2 , … , 𝑠𝑀 ] is the semantic feature vector
captioning, where both the encoder and the decoder are the LSTMs. of the video.
Soft-argmax function is used in the final layer of the generator. The We use reinforcement learning to learn the generator and the dis-
soft-argmax function (Luvizon et al., 2019) multiplies the exponential criminator parameters. The CNN based discriminator discriminates the
term in the soft-max function with a very large constant value. The generated descriptions and the human provided descriptions. The inter-
multiplication makes the largest value in an array approximately as 1 mediate features 𝐜𝑡 from the discriminator CNN are directly provided
and the other smaller values approximately to 0. Then the resultant to the manager at each time step 𝑡. The manager module generates a
array values are multiplied with the array index and summed, which goal vector 𝐠𝑡 based on the discriminator features at each time step. The
provides the index of the array with maximum value in the array, worker generates one word 𝑥𝑡 at each time step. The generator and the
thus performing the arg-max on the output. In order to overcome the discriminator are explained in the following subsections.
problem of gradient propagation, we use the goal based training in
the proposed approach. Here the manager generates the goal which is 3.2. Discriminator
satisfied by the worker.
In Pasunuru and Bansal (2017), the RL based approach proposed for The discriminator should differentiate the human given descriptions
video captioning uses the CIDER based reward for training the model. from the descriptions generated by the generator. In this work, we also
The model is trained using both cross-entropy loss and CIDER based want to differentiate between the correct descriptions and the incorrect
loss, where a hyper-parameter determines the trade-off between the descriptions for a video. So we provide both the visual features and

3
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

Fig. 2. Block diagram of the proposed semantically contextual GAN. The visual and semantic features are extracted from the video. The generator module has a contextual manager
and a conditional worker.

Fig. 3. The CNN based discriminator. The outputs of the convolutional layers and pooling layers are concatenated and given as input to a fully connected layer. The final output
indicates whether the input description is valid or invalid for the input video.

the text features as input to the discriminator. Here we use a CNN embedding vector by a score reduces the values in the word embedding
based text classification model as the discriminator. Many approaches vector. Hence, we multiply the word embedding vector by 1 + 𝑠𝑥𝑡 value
use 1D-CNN for sentence classification (Veselỳ et al., 2013; Kim, 2014; if the word 𝑥𝑡 is a semantic keyword. We multiply the word embedding
Zhang et al., 2016). In this work, we use a multi-channel 1D-CNN vector by 1 + 𝑠𝑎𝑣𝑔 for the words that are not the semantic keywords.
as the discriminator. Fig. 3 shows the architecture of the discrimina- Larger weights are given to the word embedding vector for semantic
tor. Each word in the description is represented using the Word2Vec keywords present in the sentence than the other words.
embedding (Mikolov et al., 2013). The semantic weighted word embedding vector, the 2D-CNN feature
The scores for semantic keywords obtained from the MLP are in the vector 𝐩 and 3D-CNN feature vector 𝐪 are concatenated and provided
range of 0.0 and 1.0. Let 𝐞𝑥𝑡 be the word embedding vector of 𝑡th word. as input to a dense layer. The outputs of the dense layers are combined
The semantic weighted embedding vector 𝐞̄ 𝑥𝑡 is obtained as follows: to form the feature matrix 𝐹 ∈ R𝑑×𝑇 that represents the caption of
{ a video along with the 2D-CNN, 3D-CNN, and semantic features. The
𝐞𝑥𝑡 ⋅ (1 + 𝑠𝑥𝑡 ), if 𝑥𝑡 is a semantic keyword
𝐞̄ 𝑥𝑡 = (1) caption can be either the ground truth caption or the caption generated
𝐞𝑥𝑡 ⋅ (1 + 𝑠𝑎𝑣𝑔 ), otherwise
by the generator. Here 𝑇 is the length of the description, and 𝑑 is the
where 𝑠𝑥𝑡 is the score of the word 𝑥𝑡 and 𝑠𝑎𝑣𝑔 is the average of all the dimension of the word embedding vector. Let 𝑙 denote the size of the
scores. As the scores are in the range of 0.0 and 1.0, multiplying a word kernel and 𝑘 denote the number of kernels in the 1 − 𝑑 CNN. Each 1 − 𝑑

4
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

Fig. 4. The generator has a contextual manager and a conditional worker. The contextual manager provides a semantically contextual goal vector. The conditional worker takes
the semantically contextual goal vector as input and generates one word at each time step.

Table 1 3.3. Generator


Size of the kernel (l) and number of kernels (k) in different 1D convolutional layers in
the discriminator.
The generator performs the process of generating a sentence describ-
Dataset (l, k)
ing the video. It tries to generate a sentence similar to the ground truth
MSVD (1, 75), (3, 100), (5, 75), (7, 100)
description given by humans. We use a hierarchical architecture that
(sentence (8, 100), (9, 100), (10, 120), (12, 150)
length - 30) (14, 100), (17, 100), (20, 150), (25, 100)
uses a manager to set a goal and a worker to generate a sentence. The
(30, 200) architecture of the hierarchical generator is presented in Fig. 4.
MSR-VTT (1, 75), (3, 100), (5, 75), (7, 100) At step 𝑡 the discriminator takes the sentence 𝑥1∶𝑡−1 generated up to
(sentence (8, 100), (9, 100), (10, 120), (12, 150) time 𝑡 − 1, and generates the vector 𝐜𝑡 using Monte-Carlo rollout. The
length - 40) (14, 100), (17, 100), (20, 150), (25, 100) manager receives the leaked feature vector 𝐜𝑡 from the discriminator
(30, 200), (35, 100), (40, 200) and generates the goal vector 𝐠𝑡 . Let ℎ𝑀 𝑡 be the hidden state of the
manager LSTM. We apply Bahdanau’s attention (Bahdanau et al., 2014)
on the output of the LSTM to generate 𝐠̄ 𝑡 , that is normalized to generate
convolution layer applied on 𝐹 produces an output matrix of size 𝑇 × 𝑘. the goal vector 𝐠𝑡 .
Then we apply a pooling layer on the output of each convolution layer 𝐠𝑡 = 𝐠̄𝑡 ∕ ‖ ‖
‖𝐠̄𝑡 ‖ (6)
to obtain a feature vector 𝐟 of size 𝑘.
The worker takes the previously generated words 𝑥1 , 𝑥2 , … , 𝑥𝑡−1 ,
We use 𝑚 number of parallel convolution layers and pooling layers
visual features 𝐩 and 𝐪, semantic features 𝐬, and the previous hidden
on the feature matrix 𝐹 . The size and number of kernels used in the
state ℎ𝑊
𝑡−1
along with the goal vector 𝐠𝑡 to generate next word in the
MSVD and MSR-VTT datasets are shown in Table 1. Let 𝐟1 , 𝐟2 , … , 𝐟𝑚 be
sentence. We apply Bahdanau’s attention to the output of the LSTM
the feature vectors extracted using 𝑚 number of the 1-D convolutional
at the worker. Then the output is passed on to a dense layer with a
and pooling layers. The concatenated feature vector 𝐟 = [𝐟1 , 𝐟2 , … , 𝐟𝑚 ] is soft-max activation function to obtain the final score vector 𝐳𝑡 as
given as input to the highway network. The highway network is similar
to a residual network which increases the speed of convergence of the 𝐳𝑡 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝐴𝝎̄ 𝑡 + 𝐛) (7)
model. The highway network uses a transport gate 𝑈 (𝐟) and a carry gate Here 𝝎̄ 𝑡 is the output from the attention layer of the worker, 𝐴 and 𝐛
1 − 𝑈 (𝐟). The output of the highway network is given by the vector are the weights and bias of the dense layer. Then argmax is applied
𝐜 = 𝐻(𝐟) ⋅ 𝑈 (𝐟) + 𝐟 ⋅ (1 − 𝑈 (𝐟 )) (2) on 𝐳𝑡 to obtain the index of the word with a maximum score that is
considered as 𝐱𝑡 .
where
3.4. Training generator and discriminator
𝑈 (𝐟) = 𝑎𝑈 (𝑊𝑈𝑇 𝐟 + 𝐛𝑈 ) (3)
𝐻(𝐟) = 𝑎𝐻 (𝑊𝐻𝑇 𝐟 + 𝐛𝐻 ) (4) We initially used the Maximum Likelihood Estimation (MLE) based
training followed by the Reinforcement Learning (RL) based training.
Here 𝐻(𝐟) is the fully connected layer applied on the feature vector 𝐟,
𝑊𝑈 and 𝑊𝐻 are the weight matrices, 𝐛𝑈 and 𝐛𝐻 are the bias vectors, 3.4.1. Training discriminator
𝑎𝑈 and 𝑎𝐻 are the nonlinear activation functions. Here ⋅ is element-wise Let 𝜙 denote the discriminator model where 𝜙 denotes the pa-
multiplication to obtain the final context vector 𝐜. The vector 𝐜 is then rameters of the model. The function of the discriminator is given by
passed to a fully connected layer to obtain the classification score given 𝜙 (𝐩, 𝐪, 𝐬, 𝑋), where 𝐩 are the 2D-CNN features, 𝐪 are the 3D-CNN
by features, 𝐬 are the semantic features, and 𝑋 is the sentence given as
input to the discriminator. Let X𝐻 be the set of ground truth sentences,
𝑦 = 𝜎(𝑊 𝑇 𝐜 + 𝐛) (5) X𝐺 be the set of sentences generated by the generator, and X𝑊 be the
set of inappropriate sentences. We generated X𝑊 from the ground truth
Here 𝑊 is the weight matrix, 𝐛 is a bias vector, and 𝜎 is a nonlinear descriptions for videos by randomly pairing the video and descriptions
activation. The discriminator classifies the captions generated by the available in the datasets. The discriminator is trained to discriminate
generator and the ground truth captions. This vector 𝐜 is leaked to the the ground truth sentence for a video from the sentence generated by
manager in the generator for adversarial training. the generator and also to discriminate an appropriate ground truth

5
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

where 𝑟𝚆𝑡 is the reward for the worker given by


𝚆 𝚆
𝑟𝚆𝑡 = 𝜆𝑟𝑡 𝐺 + (1 − 𝜆)𝑟𝑡 𝑆 (13)

where 𝜆 is the empirical parameter which determines the trade-off


between goal based reward and semantics based reward. The goal
based reward is given by

1∑
𝑘
𝚆
𝑟𝑡 𝐺 = ‖𝐜 − 𝐜𝑡−𝑖 , 𝐠𝑡−𝑖 ‖𝑐𝑜𝑠 (14)
𝑘 𝑖=1 𝑡
The dimension of the goal vector is chosen to be the same as the
Fig. 5. Reward estimation for manager using Monte-Carlo rollout. dimension of the vector 𝐜 obtained from the discriminator. The reward
is calculated based on the k recent goals of the manager. Thus, it
ensures that the generator follows the goal vector generated by the
manager. The semantics based reward is calculated from the output of
description from an inappropriate description. The discriminator loss
the worker as in (15). Here 𝑗 is the index of the word 𝑥𝑡 generated at
function is given by
∑ time step 𝑡.
 = 𝑙𝑜𝑔𝜙 (𝐩, 𝐪, 𝐬, 𝑋)
𝚆 ∑
𝑡
𝑋∈X𝐻 𝑟𝑡 𝑆 = 𝜋𝑗 ⋅ (𝑧𝑗 − 𝑠𝑗 ) (15)

+ 𝜇⋅ 𝑙𝑜𝑔(1 − 𝜙 (𝐩, 𝐪, 𝐬, 𝑋)) {
𝑖=1
𝑋∈X𝐺 1, if 𝑥𝑡 is a semantic keyword
∑ 𝜋𝑗 = (16)
+ 𝜂⋅ 𝑙𝑜𝑔(1 − 𝜙 (𝐩, 𝐪, 𝐬, 𝑋)) (8) 0, otherwise
𝑋∈X𝑊

where 𝜇 and 𝜂 are the empirically chosen parameters. 3.4.3. Training procedure
The generator and the discriminator are initially trained for a fixed
3.4.2. Training generator number of epochs using the maximum likelihood (ML) method. We
The manager is trained to generate the goal vector 𝐠𝑡 , which will be do not use the rewards during ML training to calculate the gradients.
used by the worker in sentence generation. The gradient of the manager While training the generator, the parameters of the discriminator are
fixed. Similarly, while training the discriminator, the parameters of
with parameters 𝜃 is computed by
the generator are fixed. The generator is trained for 𝑔 number of
∇𝜃 𝙼 = −𝑟𝙼𝑡 ∇𝜃 ‖ ‖
‖(𝐜𝑡+𝑘 − 𝐜𝑡 ), 𝐠𝑡 ‖𝑐𝑜𝑠 (9) epochs, and the discriminator is trained for 𝑑 number of epochs. The
manager and worker are trained in an interleaved manner. The training
In (9), 𝑟𝙼𝑡 is the reward for the manager calculated using (10).
procedure for the generator and discriminator is shown in Algorithm 1.
‖𝛼, 𝛽‖𝑐𝑜𝑠 = 𝛼 𝑇 𝛽∕ ‖𝛼‖ ‖𝛽‖ is the cosine similarity between the two
vectors. The discriminator generates the reward only for a complete 4. Experimental studies
sentence. Hence for the intermediate steps the expected reward is
generated using Monte-Carlo search. Fig. 5 shows the Monte-Carlo roll 4.1. Datasets
out performed using the worker. Let 𝑋1∶𝑡 be the sentence generated
up to time step 𝑡. The Monte-Carlo roll out generates sentences by 4.1.1. MSVD
generating words 𝑥̂ 𝑡+1 to 𝑥̂ 𝑇 following multinomial distribution. The Microsoft Research Video Description Corpus (MSVD) (Chen
Let 𝑋̂ 1 , 𝑋̂ 2 , … , 𝑋̂ 𝑁 be the sentences obtained using the rollout and Dolan, 2011) dataset is one of the most commonly used datasets
policy 𝑀𝐶 𝚆𝜐 . When 𝑡 is less than 𝑇 , the discriminator takes all the 𝑁 for video captioning. It consists of 1970 video clips from YouTube.
sentences generated by the Monte-Carlo rollout policy as input and gen- Each video clip has a duration of 3 s to 10 s. Every video has human-
erates the output scores. The output scores of the discriminator for all annotated descriptions of approximately 30 to 40 annotations. The data
the 𝑁 sentences are averaged to calculate the reward for the manager. split is as follows: 1200 clips for training, 100 clips for validation, and
This produces the discriminator scores for the sentence generated up to 670 clips for testing. The limitation is that the dataset has a limited
time step t. When 𝑡 = 𝑇 , the sentence is complete, and the discriminator number of videos compared to the other video captioning datasets.
directly generates a score which is given as a reward to the manager.
{ ∑ 4.1.2. MSR-VTT
1 𝑁 ̂𝑖
𝑖=1 𝐷𝜙 (𝐩, 𝐪, 𝐬, 𝑋 ) if 𝑡 < 𝑇
𝑟𝙼𝑡 = 𝑁 (10) The MSR Video-to-Text (MSR-VTT) dataset consists of 10,000 vid-
𝐷𝜙 (𝐩, 𝐪, 𝐬, 𝑋) if 𝑡 = 𝑇 eos. Every video has 20 human-annotated descriptions. The split for
At each step, 𝑡 the worker 𝚆 predicts the word 𝑥𝑡 in the sentence. The training, test, and validation is done based on the split-up proposed
worker is trained using the REINFORCE (Williams, 1992) algorithm as by Multimedia MSR video to language challenge. We have used 6513
in Yu et al. (2017). Let 𝐴1∶𝑡−1 be the states of the worker from time videos for training, 2990 videos for testing, and 497 videos for valida-
1 to 𝑡 − 1 and 𝑎0 is the initial state of the worker. 𝑝(𝐴1∶𝑡−1 |𝑎0 ) is the tion. The MSR-VTT dataset has around 20 categories of videos such as
probability of reaching the state 𝐴𝑡−1 from the initial state 𝑎0 . The cooking, sports, music, and animals.
gradient of the objective function 𝑌 (𝜏) of the worker with parameters
𝜏 can be written as 4.2. Evaluation metrics

1 ∑
𝑇 ∑ ( )
∇𝜏 𝑌 (𝜏) = 𝑝(𝐴1∶𝑡−1 |𝑎0 )𝑟𝚆𝑡 ⋅ ∇𝜏 𝑙𝑜𝑔 𝑊𝜏 (𝑥𝑡 |𝑥1∶𝑡−1 ) (11) For evaluating the quality of the generated video captions, we use
𝑇 𝑡=1 𝐴1∶𝑡−1 BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014),
CIDER (Vedantam et al., 2015) and ROUGE-L (Lin, 2004) metrics. The
Let E𝑥𝑡 ∼𝑊𝜏 (𝑥𝑡 |𝑋1∶𝑡−1 ) be the expectation of worker in generating word 𝑥𝑡
results are given as percentage(%) scores. The discussion in Vedantam
from the words 𝑥1∶𝑡−1 generated up to time step 𝑡 − 1. The Eq. (11) can
et al. (2015) shows that the METEOR score evaluates the correctness
be written as
of captions better compared to the BLEU and ROUGE-L scores. The
1 ∑
𝑇
( ) Microsoft COCO evaluation server (Chen et al., 2015) implementation
∇𝜏 𝑌 (𝜏) = E 𝑟𝚆 ⋅ ∇𝜏 𝑙𝑜𝑔 𝑊𝜏 (𝑥𝑡 |𝑥1∶𝑡−1 ) (12)
𝑇 𝑡=1 𝑥𝑡 ∼𝑊𝜏 (𝑥𝑡 |𝑥1∶𝑡−1 ) 𝑡 is used to calculate the scores of the metrics.

6
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

Table 2
Ablation study of SC-GAN on MSVD and MSRVTT datasets using BLEU-4, METEOR, CIDER and ROUGE-L evaluation metrics.
Method MSVD MSR-VTT
BLEU-4 METEOR CIDER ROUGE-L BLEU-4 METEOR CIDER ROUGE-L
GAN—ML train, Discriminator-No semantics (Baseline) 46.9 32.5 61.2 59.1 33.4 23.8 43.3 56.8
GAN—ML train, Discriminator-With semantics 48.1 33.2 72.9 65.2 40.1 27.3 47.1 60.3
GAN—Goal based reward 51.1 33.5 69.6 68.1 43.6 28.4 47.6 61.5
GAN—Semantics based reward 52.0 33.8 78.4 70.2 44.0 28.9 48.4 62.9
SC-GAN—Goal and semantics based rewards 53.1 34.5 82.3 71.4 46.1 29.8 50.4 63.1

Algorithm 1 Training procedure for SC-GAN We train the manager and worker using the maximum likelihood
Input: Video features and their ground truth captions method for 20 epochs. Then the discriminator model is trained for 10
Output: Manager 𝙼𝜃 , Worker 𝚆𝜏 and discriminator 𝜙 models epochs. We alternately train the manager, worker, and discriminator
//Initialization : for another 20 epochs using RL. The learning rate of the manager and
1: Initialize the parameters for Manager 𝙼𝜃 , Worker 𝚆𝜏 and discrimi- the worker is 0.001. The other hyperparameters in the model are tuned
nator 𝜙 randomly using the validation data.
2: Initially train 𝜙 using the samples X𝐻 and X𝑊
3: Initially train Manager 𝙼𝜃 and Worker 𝑊𝜏 using ML training
5. Results
//Reinforcement learning training :
4: for each iteration do
5: Generate a batch of training samples from videos and descrip- 5.1. Ablation study
tions
// Training the generator
Table 2 shows the result of an ablation study on the proposed se-
6: for i =1 to 𝑔 epochs do
mantically contextual GAN based video captioning model. The different
7: // Training the manager
methods listed in the table are as follows:
8: Extract vector c𝑡 from discriminator
9: Generate batch of training samples for manager
• GAN—ML train, Discriminator-No semantics : The basic video
10: Train manager using (9), (10)
captioning model is trained using the maximum likelihood (ML)
11: Update the parameters of the manager
// Training the worker method. While training the discriminator, we do not use semantic
12: Generate goal vector 𝑔𝑡 by using (6) features.
13: Generate batch of training samples for the worker • GAN—ML train, Discriminator-With semantics: The basic video
14: Train the worker using (12) captioning model is trained using the maximum likelihood (ML)
15: Update the parameters of the worker method. While training the discriminator, we multiply the word
16: end for embedding vector of each word by the semantic scores as in (1).
// Training the discriminator • GAN—Goal based reward: Video captioning model is trained
17: for j=1 to 𝑑 epochs do using RL training. The worker is trained using only the goal based
18: Generate a batch of training samples from the ground truth reward obtained as in (14).
descriptions and the corresponding descriptions generated by • GAN—Semantics based reward: Video captioning model is
the worker trained using RL training. Here the worker is trained using only
19: Train the discriminator using (8) the semantics based reward obtained as in (15).
20: Update the discriminator parameters • SC-GAN—Goal and semantics based rewards: Video caption-
21: end for ing model is trained using RL training. The worker is trained using
22: end for
both the goal and semantics based rewards as in (13).

The following observations are made based on the results presented


4.3. Experimental setup in Table 2:

• The proposed SC-GAN based video captioning using RL training


We have used Keras environment with Tensorflow background to
performs better than the other models.
implement the proposed GAN based video captioning model. For both
• The GAN model trained using the ML method generates lower
the datasets, we extract features from every 10th frame in the video.
scores than the model trained using the RL method. This shows
We use the pre-trained ResNet-152 (He et al., 2016) architecture to
that the goal based reward helps the generator to generate better
extract the 2D-CNN features. The 3D-CNN features are extracted from
frames of each video using the C3D (Tran et al., 2015) with an overlap descriptions similar to the ground truth.
of 8 frames. We have converted all the characters of the descriptions • The use of the semantics based reward provides better scores
to lower case before processing the descriptions. The length of the when used along with the goal based reward. But the semantics
descriptions is truncated to 30 words for the MSVD dataset and 40 based reward alone is not enough to generate good descriptions.
words for the MSR-VTT dataset. All the special characters present in the • The use of the goal based reward alone provides a lower score
descriptions are deleted while processing the descriptions. We extract than the model trained using both the goal and semantics based
the important keywords from the training descriptions and use them rewards. The goal based reward ensures that the descriptions are
as the semantic keywords of the videos. We then train a multi-layer similar to the ground truth descriptions. The semantics based
perceptron to predict the semantic scores for each video from the input reward ensures that the semantic words are given greater impor-
visual features. For the MSVD dataset, we choose the 300 most frequent tance while generating each word of the description.
words. For the MSR-VTT dataset, we choose the 400 most frequent • The model trained without semantics gives lower scores irrespec-
words as the semantic keywords. tive of the training method.

7
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

Fig. 6. Performance of the proposed SC-GAN and other models. The semantics based reward increases the scores of semantic keywords in the generated sentence. The sample
videos and their ground truth captions, and the captions generated by the baseline model and the SC-GAN models are shown. The scores generated for each word by the models
are also shown.

5.1.1. Analysis of scores with and without semantics based rewards and reinforcement learning are shown. After ML training, we train the
The semantics based rewards play a significant role in the proposed manager and worker using RL training.
SC-GAN video captioning model. Fig. 6 presents an analysis of the role Fig. 7(a) shows the loss of the worker, and Fig. 7(b) shows the
of the semantics based reward in sentence generation. The generated accuracy of the worker with respect to the epochs of training. The RL
descriptions of four videos and their respective scores are given in training shows much faster convergence after 20 epochs of initial ML
Fig. 6. The baseline model uses only visual features without semantic training and then stabilizes towards the minimum. Thus, the proposed
features and is trained using the ML method. The second model is RL training method converges faster than the model trained using only
trained by applying reinforcement learning with goal based reward but ML training.
without semantics based reward. The third model applies reinforcement
learning, where the manager is provided with discriminator rewards
5.2. Analysis of results on MSVD dataset
generated using Monte-Carlo rollout. It also uses the goal based reward
and the semantics based reward from the worker. From Fig. 6, it is
seen that the use of the semantics based reward enhances the scores of In this section, we compare the performance of the proposed ap-
semantic keywords in sentence generation. proach and the other state-of-the-art approaches for the MSVD dataset.
The first example in Fig. 6 shows that all three models generate The proposed SC-GAN approach provides 53.1% score on the BLEU-
sentences similar to the ground truth. But the proposed SC-GAN model 4 metric, 34.5% score on METEOR metric, 82.3% score of CIDER
produces higher scores compared to the other two models. The other metric, and 71.4% score on ROUGE-L metric. In Tables 3 and 4 the
examples show that the SC-GAN model with RL training produces better various features used in the models are denoted as follows VGG-16(V),
sentences compared to the other two models. ResNet(R), Inception(I), GoogLeNet(G), C3D(C), I3D(I3), Audio(A), Se-
mantic features(S) and Object features (O). The following observations
5.1.2. Plot of loss and accuracy are made from Table 3.
Fig. 7 shows the plot of the loss and accuracy of the worker. The
loss and accuracy are calculated for the worker model while predicting 1. LSTM-YT (Venugopalan et al., 2015b) and S2VT (Venugopalan
the words at each step. The loss and accuracy of both ML training et al., 2015a) approaches use pre-trained CNN as encoder and

8
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

Fig. 7. The loss and accuracy of the worker using ML and RL methods for training.

Table 3
Comparison of performance on MSVD dataset compared to other state-of-the-art approaches using BLEU-4, METEOR, CIDER and ROUGE-L
evaluation metrics.
Approach BLEU-4 METEOR CIDER ROUGE-L
LSTM-YT (V) Venugopalan et al. (2015b) 33.3 29.1 – –
S2VT (V+OF) Venugopalan et al. (2015a) – 29.8 – –
LSTM-E (V+C) Pan et al. (2015) 45.3 31.0 – –
BP-LSTM (R) Nabati and Behrad (2020) 42.9 32.0 62.2 68.3
HMVC (R) Liu et al. (2017) 44.3 32.1 68.4 68.9
MM-Att (V+C+A) Hori et al. (2017) 53.9 32.2 67.4 –
h-RNN (V+C+OF) Yu et al. (2016)] 49.9 32.6 65.8 –
aLSTMs (I) Guo et al. (2016) 50.8 33.3 74.8 –
STAT (G+C+O) Yan et al. (2020) 52.0 33.3 73.8 –
Less-is-more (R) Chen et al. (2018)] 52.3 33.3 76.5 69.6
SCN-LSTM (R+C+S) Gan et al. (2017)] 51.1 33.5 77.7 –
Att-Mot-Rep (V+C+O) Qi et al. (2020) 50.9 33.5 70.3 –
hLSTMat (R) Gao et al. (2020) 54.3 33.5 72.8 –
E2E-RL (I) Li and Gong (2019) 48.0 33.6 86.5 70.5
Stochastic -RNN (R+G) Song et al. (2019) 53.3 33.8 74.8 –
RecNet (I) Zhang et al. (2020) 52.3 34.1 80.3 69.8
Topic-Guid (I+C+A) Chen et al. (2019) 49.2 34.2 77.6 71.0
EtENet (I3+G) Olivastri et al. (2019) 50.0 34.3 86.6 70.2
OSTG (R+O) Zhang and Peng (2020) 57.5 36.8 92.1 –
CIDEnt-RL (I) Pasunuru and Bansal (2017) 54.4 34.9 88.6 72.2
SC-GAN (R+C) 53.1 34.5 82.3 71.4

Table 4
Comparison of performance on MSR-VTT dataset with other state-of-the-art approaches using BLEU-4, METEOR, CIDER and ROUGE-L evaluation
metrics.
Approach BLEU-4 METEOR CIDER ROUGE-L
MM-Att (V+C+A) Hori et al. (2017) 39.7 25.5 40.0 –
aLSTMs (I) Guo et al. (2016) 38.0 26.1 43.2 –
Stochastic -RNN (R+G) Song et al. (2019) 39.8 26.1 40.9 59.3
advLSTM (V) Yang et al. (2018) 36.0 26.1 – –
hLSTMat (R) Gao et al. (2020) 39.7 27.0 42.1 –
BP-LSTM (R) Nabati and Behrad (2020) 36.6 27.0 40.5 58.7
RecNet (I) Zhang et al. (2020) 39.2 27.5 48.7 60.3
STA-FG (R+C) Gao et al. (2020) 40.8 27.4 – –
Less-is-more (R) Chen et al. (2018) 41.3 27.7 44.1 59.8
EtENet (I+G) Olivastri et al. (2019) 40.5 27.7 47.6 60.6
MGSA (I+C+A) Chen and Jiang (2019) 45.4 28.6 50.1 –
OSTG (R+O) Zhang and Peng (2020) 41.9 28.6 48.2 –
HRL (R) Wang et al. (2018) 41.3 28.7 48.0 61.7
DS-RNN (G+S) Xu et al. (2019) 42.3 29.4 46.1 62.3
CIDEnt-RL (I) Pasunuru and Bansal (2017) 40.5 28.4 51.7 61.4
E2E-RL (I) Li and Gong (2019) 40.4 27.0 48.3 61.0
SC-GAN (R+C) 46.1 29.8 50.4 63.1

9
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

LSTM as decoder. Only spatial features are used for video rep- metric compared to aLSTMs (Guo et al., 2016). It also provides
resentation. The other approaches that use multi-modal features better performance compared to the multi-modal approach in
achieve better results than the models that only use spatial fea- HMVC (Liu et al., 2017).
tures. The proposed approach provides a relative improvement 3. The proposed approach gives a relative gain of 17.6% on the
of 59% score on the BLEU-4 metric and 16% improvement in the BLEU-4 metric, 8.4% on METEOR metric, and 4.6% on ROUGE-
score of METEOR metric compared to the LSTM-YT approach. L metric compared to the RL based model RecNet (Zhang et al.,
It also provides better score compared to the other end-to-end 2020) that uses the reconstruction of videos.
approach BP-LSTM (Nabati and Behrad, 2020). 4. Compared to object-aware spatio-temporal correlation model
2. LSTM-E (Pan et al., 2015), MM+Att (Hori et al., 2017), h- OSTG (Zhang and Peng, 2020) the SC-GAN performs better. It
RNN (Yu et al., 2016), aLSTMs (Guo et al., 2016) are the produces a relative increase of 10% on the BLEU-4 metric and
attention-based approaches to video captioning. The proposed 4.2% on the METEOR metric.
approach gives a better performance than the h-RNN (Yu et al., 5. The proposed approach gives a comparable performance with
2016) approach by 6.4% relative increase in score of BLEU- that of on DS-RNN (Xu et al., 2019), which uses visual and
4 metric, 5.8% increase in the score of METEOR metric, 25% semantic features. The DS-RNN uses separate attentional weights
increase in the score of CIDER metric. The MM-ATT approach while merging these features for generating captions. The pro-
presents a better score of the BLEU-4 metric than the proposed posed approach also uses visual and semantic features. The
method, but the scores of other metrics are lower. increase in performance is due to the use of SC-GAN for video
3. The SCN (Gan et al., 2017) uses semantic features in the LSTM captioning.
decoder trained using MLE. SC-GAN provides a 3.9% relatively 6. AdvLSTM (Yang et al., 2018) uses soft-argmax function for prop-
higher score in the BLEU-4 metric and 3% relatively higher score agating gradients. In this work, we use the hierarchical approach
in METEOR metric. The score of the CIDER metric in the SCN for propagating the gradients. The proposed approach gives a
method is higher than the proposed SC-GAN, but the SC-GAN relative increase of 28% in the BLEU-4 metric and 14.2% in
performs better in all the other evaluation metrics. the METEOR metric on the MSR-VTT dataset. The comparison
4. The proposed approach produces better scores on BLEU-4 and shows that the proposed approach performs better than the
ROUGE-L evaluation metrics compared to the topic-based ap- AdvLSTM (Yang et al., 2018) approach.
proach (Chen et al., 2019). However, it provides a lower score 7. HRL (Wang et al., 2018) generates goal vector using the feature
for the METEOR evaluation metric. The lower BLEU-4 and vector extracted from the frames using LSTM based encoder.
ROUGE-L score are due to the additional topics based knowledge In the proposed approach, we generate the goal vector based
of the video given as input to the model. on the semantically contextual features from the discriminator.
5. SC-GAN provides a better performance in terms of score of the The proposed approach provides an 11.6% relatively higher
METEOR evaluation metric compared to STAT (Yan et al., 2020), score in the BLEU-4 metric and 3.8% in the METEOR metric
PickNet based approach (Chen et al., 2018), end to end approach on the MSR-VTT dataset, which shows the effectiveness of the
EtENet (Olivastri et al., 2019), hierarchical LSTM based ap- proposed method of generating goal vector from the features of
proach (Gao et al., 2020), stochastic RNN based approach (Song the discriminator.
et al., 2019).
6. The multi-task based approach (Li and Gong, 2019) uses RL
training for captioning. CIDER based rewards are used for RL 5.4. Comparison with other GAN based approaches
training in the E2E-RL approach. The E2E-RL approach provides
better CIDER scores compared to the proposed approach due The existing approaches to video captioning using reinforcement
to the use of CIDER based rewards. However, the proposed learning are adversarial LSTM based approach (Yang et al., 2018),
approach provides better scores for the BLEU-4 and METEOR hierarchical RL based approach (Wang et al., 2018), Reconstruction
evaluation metrics compared to the E2E-RL method. based approach (Zhang et al., 2020) and End-to-End Multi-task RL
based approach (Li and Gong, 2019). The other RL based models in Li
5.3. Analysis of results on MSR-VTT dataset and Gong (2019), Wang et al. (2018), and Yang et al. (2018) use
LSTM based encoder for obtaining a representation of the video. In this
Table 4 shows the performance of the proposed approach and the work, we use the pre-trained 2D-CNN and 3D-CNN to extract the video
other state-of-the-art methods using BLEU-4, METEOR, CIDER and features. The approach in Yang et al. (2018) uses soft-argmax function
ROUGE-L metrics on MSR-VTT dataset respectively. The following proposed in Luvizon et al. (2019) to overcome the non-differentiability
inferences are made from Table 4. problem of argmax at Generator. In this work, we use the contextual
features from the discriminator to generate a goal, thus avoiding the
1. The SC-GAN gives a score of 29.8% on the METEOR metric, non-differentiability problem at the generator. In Yang et al. (2018),
which is higher than the scores of other state-of-the-art ap- the model is trained to minimize the log-likelihood error. In this work,
proaches. It gives 46.1% of BLEU-4 score, 50.4% of CIDER score, we use the goal based reward and the semantics based reward to train
and 63.1% of ROUGE-L score. Comparing the SC-GAN model the generator.
with other models like EtENet (Olivastri et al., 2019), less-is- The hierarchical model in Wang et al. (2018) uses 2D-CNN features
more (Chen et al., 2018), Stochastic- RNN (Song et al., 2019), extracted from the frames and LSTM encoder to derive the goal. In the
RecNet (Zhang et al., 2020), MGSA (Chen and Jiang, 2019), BP- SC-GAN model, we generate the reward from the semantic features.
LSTM (Nabati and Behrad, 2020) it produces better scores in The end-to-end approach proposed in Li and Gong (2019) uses rein-
terms of the evaluation metrics for MSR-VTT dataset. Thus the forcement learning for training the multi-tasking model. The model
study demonstrates the effectiveness of the proposed SC-GAN is trained for both attribute prediction and caption generation. The
approach. caption generation is similar to the model proposed in S2VT (Venu-
2. Compared to the attention based approaches MM-Att (Hori et al., gopalan et al., 2015a). However, we use a hierarchical architecture
2017) and aLSTMs (Guo et al., 2016) the proposed SC-GAN with a generator and discriminator for generating descriptions in this
gives higher scores for all the evaluation metrics. The pro- work. The model in Li and Gong (2019) is trained using binary cross-
posed approach provides an relative improvement of 21.3% on entropy loss function for attribute prediction and CIDER based reward
BLEU-4 metric, 14.2% on METEOR metric, 16.7% on CIDER for generating captions. We have used a goal based reward from the

10
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

Fig. 8. Video frames and captions generated using different approaches.

discriminator and also a semantic feature based reward for training the CIDER based reward for generating the descriptions. It is seen that the
generator of the proposed SC-GAN model. From Tables 3 and 4, we reinforcement learning based models which use CIDER based rewards
can see that the proposed approach provides better evaluation scores provide higher CIDER scores compared to the other evaluation metrics
compared to the other RL based models for video captioning. like BLEU, METEOR, and ROUGE-L metrics. It can also be seen that the
The approach in Zhang et al. (2020) uses a reconstruction based proposed approach provides better scores for all the evaluation metrics.
video captioning algorithm. It performs both videos to sentence conver- It can be seen that the proposed approach provides better results for
sion and sentence to video conversion. Here the model is trained using the MSR-VTT dataset, which has more number of objects and actions

11
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

compared to the MSVD dataset, which has a limited number of videos. References
This is due to the various objects and actions present in the MSR-VTT
dataset. Thus the semantic based rewards provide better results for the Aakur, S., d. Souza, F.D.M., Sarkar, S., 2017. Towards a knowledge-based approach
for generating video descriptions. In: Proceedings of the 2017 14th Conference on
dataset, which has videos with a larger number of objects and actions.
Computer and Robot Vision. CRV. pp. 24–31.
The CIDER based reward used in literature provides better results Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by jointly learning
as it is a language based reward, which is based on the entire generated to align and translate. arXiv e-prints arXiv:1409.0473.
sentence. However, the CIDER based rewards are completely language Baraldi, L., Grana, C., Cucchiara, R., 2017. Hierarchical boundary-aware neural encoder
for video captioning. In: Proceedings of the 2017 IEEE Conference on Computer
oriented, thus aiming only to generate descriptions similar to the
Vision and Pattern Recognition. CVPR. pp. 3185–3194.
ground truth. The proposed SC-GAN uses both the goal based reward Bin, Y., Yang, Y., Zhou, J., Huang, Z., Shen, H.T., 2017. Adaptively attending to visual
and the semantic based reward to train the generator. The goal based attributes and linguistic knowledge for captioning. In: Proceedings of the 25th ACM
reward is obtained from the discriminator, thus it helps the generator International Conference on Multimedia. pp. 1345–1353.
Chen, D.L., Dolan, W.B., 2011. Collecting highly parallel data for paraphrase evaluation.
to generate descriptions similar to ground truth. The semantics based
In: Proceedings of the 49th Annual Meeting of the Association for Computational
reward at the generator ensures that the semantic keywords are in- Linguistics: Human Language Technologies. pp. 190–200.
cluded in the generated description. Thus, the generator will be able Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L., 2015.
to generate descriptions similar to the ground truth descriptions and Microsoft COCO captions: Data collection and evaluation server. CoRR arXiv:
1504.00325.
semantically rich.
Chen, S., Jiang, Y.-G., 2019. Motion guided spatial attention for video captioning. In:
The RL based algorithms require more time to train the model Proceedings of the AAAI Conference on Artificial Intelligence. pp. 8191–8198.
than the traditional Maximum likelihood error (MLE) based training Chen, S., Jin, Q., Chen, J., Hauptmann, A.G., 2019. Generating video descriptions with
strategies, mainly when Monte-Carlo rollout is used for training the latent topic guidance. IEEE Trans. Multimed. 21 (9), 2407–2418.
model. However, the performance of the RL based algorithm is higher Chen, C., Mu, S., Xiao, W., Ye, Z., Wu, L., Ma, F., Ju, Q., 2019. Improving image
captioning with conditional generative adversarial nets. In: Proceedings of the AAAI
than the MLE based training. This can be seen in the results produced Conference on Artificial Intelligence. pp. 2852–2858.
by the SC-GAN and also the other RL based algorithms in the literature. Chen, Y., Wang, S., Zhang, W., Huang, Q., 2018. Less is more: Picking informative
Fig. 8 shows the sample video frames from MSVD and MSR-VTT frames for video captioning. In: Proceedings of the 2018 European Conference on
Computer Vision, ECCV. pp. 367–384.
datasets. It also shows the captions generated by different models. It
Dai, B., Fidler, S., Urtasun, R., Lin, D., 2017. Towards diverse and natural image
is seen that the proposed SC-GAN model generates a larger number of descriptions via a conditional GAN. In: 2017 IEEE International Conference on
semantic keywords in the description compared to the model trained Computer Vision. ICCV. pp. 2989–2998.
using ML. Thus the proposed approach which uses the semantics based Denkowski, M., Lavie, A., 2014. Meteor universal: Language specific translation
rewards improves the ability of the worker in predicting the semantic evaluation for any target language. In: Proceedings of the Ninth Workshop on
Statistical Machine Translation. pp. 376–380.
keywords of the descriptions. Deshpande, A., Aneja, J., Wang, L., Schwing, A.G., Forsyth, D.A., 2019. Fast, diverse
and accurate image captioning guided by part-of-speech. In: IEEE Conference on
Computer Vision and Pattern Recognition. CVPR. pp. 10695–10704.
6. Conclusion
Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L., 2017. Semantic
compositional networks for visual captioning. In: Proceedings of the 2017 IEEE
The SC-GAN model proposed in this paper for video captioning uses Conference on Computer Vision and Pattern Recognition. CVPR. pp. 1141–1150.
Gao, L., Li, X., Song, J., Shen, H.T., 2020. Hierarchical LSTMs with adaptive attention
the semantics based reward while training the worker. The discrimi-
for visual captioning. IEEE Trans. Pattern Anal. Mach. Intell. 42 (5), 1112–1131.
nator is trained using the word embedding vector multiplied by the Gao, L., Wang, X., Song, J., Liu, Y., 2020. Fused GRU with semantic-temporal attention
semantic scores and visual features. The manager is trained based on for video captioning. Neurocomputing 395, 222–228.
the reward from the discriminator to set a goal for the worker. This Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
enables the worker to generate descriptions similar to the ground truth Courville, A., Bengio, Y., 2014. Generative adversarial nets. In: Advances in Neural
Information Processing Systems, vol. 27. pp. 2672–2680.
descriptions. The worker is also trained using the goal based reward Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., Shen, H.T., 2016. Attention-based LSTM
which allows it to incorporate the goal generated by the manager. The with semantic consistency for video captioning. In: Proceedings of the 24th ACM
proposed GAN based approach provides a significant improvement over International Conference on Multimedia. pp. 357–361.
the existing approaches to video captioning. The results generated by Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., Wang, J., 2018. Long text generation
via adversarial training with leaked information. In: Proceedings of the AAAI
the proposed approach, which uses the semantics based reward shows
Conference on Artificial Intelligence. pp. 5141–5148.
that the SC-GAN model generates descriptions with a more significant He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition.
number of semantic keywords. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern
Recognition. CVPR. pp. 770–778.
Hemalatha, M., Sekhar, C.C., 2020. Domain-specific semantics guided approach to video
CRediT authorship contribution statement captioning. In: 2020 IEEE Winter Conference on Applications of Computer Vision.
WACV. pp. 1576–1585.
Hori, C., Hori, T., Lee, T., Zhang, Z., Harsham, B., Hershey, J.R., Marks, T.K., Sumi, K.,
Hemalatha Munusamy: Conception and design of study, Acquisi- 2017. Attention-based multimodal fusion for video description. In: Proceedings of
tion of data, Analysis and/or interpretation of data, Writing – original the 2017 IEEE International Conference on Computer Vision. ICCV. pp. 4203–4212.
draft, Writing – review & editing. Chandra Sekhar C.: Conception and Jin, T., Li, Y., Zhang, Z., 2019. Recurrent convolutional video captioning with global
design of study, Writing – original draft, Writing – review & editing. and local attention. Neurocomputing 370, 118–127.
Kim, Y., 2014. Convolutional neural networks for sentence classification. CoRR arXiv:
1408.5882.
Declaration of competing interest Kim, T., Cha, M., Kim, H., Lee, J.K., Kim, J., 2017. Learning to discover cross-
domain relations with generative adversarial networks. In: Proceedings of the 34th
International Conference on Machine Learning. pp. 1857–1865.
The authors declare that they have no known competing finan- Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C., 2017. Dense-captioning events
cial interests or personal relationships that could have appeared to in videos. In: 2017 IEEE International Conference on Computer Vision. ICCV. pp.
706–715.
influence the work reported in this paper.
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L., 2011. Baby
talk: Understanding and generating simple image descriptions. In: Proceedings of
Acknowledgment the 2011 IEEE Conference on Computer Vision and Pattern Recognition. CVPR.
pp. 1601–1608.
Kusner, M.J., Hernández-Lobato, J.M., 2016. GANS for sequences of discrete elements
All authors approved the version of the manuscript to be published. with the gumbel-softmax distribution.

12
H. Munusamy and Chandra Sekhar C. Computer Vision and Image Understanding 221 (2022) 103453

Li, L., Gong, B., 2019. End-to-end video captioning with multitask reinforcement Wu, C., Wei, Y., Chu, X., Weichen, S., Su, F., Wang, L., 2018. Hierarchical
learning. In: 2019 IEEE Winter Conference on Applications of Computer Vision. attention-based multimodal fusion for video captioning. Neurocomputing 315,
WACV. pp. 339–348. 362–370.
Li, W., Guo, D., Fang, X., 2018. Multimodal architecture for video captioning with Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.,
memory networks and an attention mechanism. Pattern Recognit. Lett. 105, 23–29. 2015a. Show, attend and tell: Neural image caption generation with visual atten-
Lin, C.-Y., 2004. Rouge: A package for automatic evaluation of summaries. In: tion. In: Proceedings of the 32nd International Conference on Machine Learning.
Proceedings of the ACL Workshop on Text Summarization Branches Out. pp. 74–81. pp. 2048–2057.
Lin, K., Li, D., He, X., Zhang, Z., Sun, M.-t., 2017. Adversarial ranking for language Xu, N., Liu, A., Wong, Y., Zhang, Y., Nie, W., Su, Y., Kankanhalli, M., 2019. Dual-
generation. In: Advances in Neural Information Processing Systems, vol. 30. stream recurrent neural network for video captioning. IEEE Trans. Circuits Syst.
pp. 3155–3165. Video Technol. 29 (8), 2482–2493.
Liu, A.-A., Xu, N., Wong, Y., Li, J., Su, Y.-T., Kankanhalli, M., 2017. Hierarchical & Xu, R., Xiong, C., Chen, W., Corso, J.J., 2015b. Jointly modeling deep video and
multimodal video captioning: Discovering and transferring multimodal knowledge compositional text to bridge vision and language in a unified framework. In:
for vision to language. Comput. Vis. Image Underst. 163, 113–125, Language in Proceedings of the AAAI Conference on Artificial Intelligence. pp. 2346–2352.
Vision. Yan, C., Tu, Y., Wang, X., Zhang, Y., Hao, X., Zhang, Y., Dai, Q., 2020. STAT: SPatial-
Luvizon, D.C., Tabia, H., Picard, D., 2019. Human pose regression by combining indirect temporal attention mechanism for video captioning. IEEE Trans. Multimed. 22 (1),
part detection and contextual information. Comput. Graph. 85, 15–22. 229–241.
Yan, S., Wu, F., Smith, J.S., Lu, W., Zhang, B., 2018. Image captioning based on a
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient estimation of word
hierarchical attention mechanism and policy gradient optimization. CoRR arXiv:
representations in vector space.
1811.05253.
Mirza, M., Osindero, S., 2014. Conditional generative adversarial nets.
Yan, S., Wu, F., Smith, J.S., Lu, W., Zhang, B., 2018. Image captioning using adversarial
Nabati, M., Behrad, A., 2020. Video captioning using boosted and parallel long
networks and reinforcement learning. In: 2018 24th International Conference on
short-term memory networks. Comput. Vis. Image Underst. 190, 102840.
Pattern Recognition. ICPR. pp. 248–253.
Nie, W., Narodytska, N., Patel, A., 2019. RelGAN: Relational generative adversarial net-
Yang, Y., Zhou, J., Ai, J., Bin, Y., Hanjalic, A., Shen, H.T., Ji, Y., 2018. Video captioning
works for text generation. In: International Conference on Learning Representations.
by adversarial LSTM. IEEE Trans. Image Process. 27 (11), 5600–5611.
pp. 1–20. You, Q., Jin, H., Wang, Z., Fang, C., Luo, J., 2016. Image captioning with semantic
Olivastri, S., Singh, G., Cuzzolin, F., 2019. End-to-end video captioning. In: 2019 attention. In: Proceedings of the 2016 IEEE Conference on Computer Vision and
IEEE/CVF International Conference on Computer Vision Workshop. ICCVW. pp. Pattern Recognition. CVPR. pp. 4651–4659.
1474–1482. Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W., 2016. Video paragraph captioning using
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y., 2015. Jointly modeling embedding and hierarchical recurrent neural networks. In: Proceedings of the 2016 IEEE Conference
translation to bridge video and language. CoRR arXiv:1505.01861. on Computer Vision and Pattern Recognition. CVPR. pp. 4584–4593.
Pan, Y., Yao, T., Li, H., Mei, T., 2017. Video captioning with transferred semantic Yu, L., Zhang, W., Wang, J., Yu, Y., 2017. SeqGAN: Sequence generative adversarial
attributes. In: Proceedings of the 2017 IEEE Conference on Computer Vision and nets with policy gradient. In: Proceedings of the AAAI Conference on Artificial
Pattern Recognition. CVPR. pp. 984–992. Intelligence. pp. 2852–2858.
Papineni, K., Roukos, S., Ward, T., jing Zhu, W., 2002. BLEU: A method for automatic Yuan, J., Tian, C., Zhang, X., Ding, Y., Wei, W., 2018. Video captioning with semantic
evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of guiding. In: Proceedings of the IEEE Fourth International Conference on Multimedia
the Association for Computational Linguistics. ACL. pp. 311–318. Big Data. BigMM. pp. 1–5.
Pasunuru, R., Bansal, M., 2017. Reinforced video captioning with entailment rewards. Zhang, J., Peng, Y., 2020. Video captioning with object-aware spatio-temporal
In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language correlation and aggregation. IEEE Trans. Image Process. 29, 6209–6222.
Processing. pp. 979–985. Zhang, Y., Roller, S., Wallace, B.C., 2016. MGNC-CNN: A simple approach to exploiting
Qi, M., Wang, Y., Li, A., Luo, J., 2020. Sports video captioning via attentive motion multiple word embeddings for sentence classification. In: Proceedings of the 2016
representation and group relationship modeling. IEEE Trans. Circuits Syst. Video Conference of the North American Chapter of the Association for Computational
Technol. 30 (8), 2617–2633. Linguistics: Human Language Technologies. pp. 1522–1527.
Ren, Z., Wang, X., Zhang, N., Lv, X., Li, L., 2017. Deep reinforcement learning-based Zhang, W., Wang, B., Ma, L., Liu, W., 2020. Reconstruct and represent video contents
image captioning with embedding reward. In: 2017 IEEE Conference on Computer for captioning via reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell.
Vision and Pattern Recognition. CVPR. pp. 1151–1159. 42 (12), 3088–3101.
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B., 2013. Translating Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D., 2017. Stack-
video content to natural language descriptions. In: Proceedings of the 2013 IEEE GAN: Text to photo-realistic image synthesis with stacked generative adversarial
International Conference on Computer Vision. ICCV. pp. 433–440. networks. In: 2017 IEEE International Conference on Computer Vision. ICCV.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M., 2014. pp. 5908–5916.
Deterministic policy gradient algorithms. In: Proceedings of the 31st International
Conference on Machine Learning. pp. 387–395.
Song, J., Guo, Y., Gao, L., Li, X., Hanjalic, A., Shen, H.T., 2019. From deterministic to Hemalatha M. received her B.Tech. degree in Information
generative: Multimodal stochastic RNNs for video captioning. IEEE Trans. Neural Technology from Panimalar Engineering College, India, in
Netw. Learn. Syst. 30 (10), 3047–3058. 2011. She received her M.E. degree in Communication and
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spatiotem- Networking from MIT Campus, Anna University in the year
poral features with 3D convolutional networks. In: Proceedings of the 2015 IEEE 2013. She is currently a doctoral student at Department
International Conference on Computer Vision. ICCV. pp. 4489–4497. of Computer Science and Engineering, Indian Institute of
Vedantam, R., Zitnick, C.L., Parikh, D., 2015. CIDER: Consensus-based image descrip- Technology (IIT) Madras. She is also working as Assis-
tion evaluation. In: Proceedings of the 2015 IEEE Conference on Computer Vision tant Professor at Department of Information Technology,
and Pattern Recognition. CVPR. pp. 4566–4575. Madras Institute of Technology, Anna University, Chennai,
Venugopalan, S., Hendricks, L.A., Rohrbach, M., Mooney, R., Darrell, T., Saenko, K., India from December 2014. Her research interests are deep
2017. Captioning images with diverse objects. In: Proceedings of the 2017 IEEE learning, computer vision, artificial neural networks image
Conference on Computer Vision and Pattern Recognition. CVPR. pp. 1170–1178. processing and pattern recognition.
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K., 2015a.
Sequence to sequence – video to text. In: Proceedings of the 2015 IEEE International
Conference on Computer Vision. ICCV. pp. 4534–4542. Chandra Sekhar C. received his B.Tech. degree in Electron-
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K., 2015b. ics and Communication Engineering from Sri Venkateswara
Translating videos to natural language using deep recurrent neural networks. University, Tirupati, India, in 1984. He received his M.Tech.
In: Proceedings of the 2015 Conference of the North American Chapter of degree in Electrical Engineering and Ph.D. degree in Com-
the Association for Computational Linguistics: Human Language Technologies. puter Science and Engineering from Indian Institute of
pp. 1494–1504. Technology (IIT) Madras in 1986 and 1997, respectively. He
Veselỳ, K., Ghoshal, A., Burget, L., Povey, D., 2013. Sequence-discriminative training of was a Lecturer from 1989 to 1997, an Assistant Professor
deep neural networks. In: Proceedings of the Annual Conference of the International from 1997 to 2002, an Associate Professor from 2004 to
Speech Communication Association, INTERSPEECH. pp. 2345–2349. 2010, and a Professor since 2010 in the Department of
Wang, X., Chen, W., Wu, J., Wang, Y., Wang, W.Y., 2018. Video captioning via Computer Science and Engineering at IIT Madras, India.
hierarchical reinforcement learning. In: 2018 IEEE/CVF Conference on Computer He was a Japanese Society for Promotion of Science (JSPS)
Vision and Pattern Recognition. pp. 4213–4222. post-doctoral fellow at Center for Integrated Acoustic Infor-
Wei, Y., Wang, L., Cao, H., Shao, M., Wu, C., 2020. Multi-attention generative mation Research, Nagoya University, Nagoya, Japan, from
adversarial network for image captioning. Neurocomputing 387, 91–99. May 2000 to May 2002. His current research interests are
Williams, R.J., 1992. Simple statistical gradient-following algorithms for connectionist in machine learning, deep learning and distance metric
reinforcement learning. Mach. Learn. 8 (3), 229–256. learning.

13

You might also like