You are on page 1of 25

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING
CENTRAL CAMPUS PULCHOWK

PROJECT CONCEPT NOTE


ON
SEMANTICALLY CONSISTENT VIDEO CAPTIONING

By:
Anish Bhusal / 072BCT505
Avishekh Shrestha / 072BCT507
Ramesh Pathak / 072BCT527
Saramsha Dotel / 072BCT534

Submitted To:
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
LALITPUR, NEPAL

16 December, 2018
ACKNOWLEDGEMENTS

We would like to express our sincerest gratitude towards the Department of Electronics and
Computer Engineering for providing us, students, the golden opportunity for indulging us in a
project of this caliber which will allow us to explore new dimensions in academic research as
well as software product development. We are indebted to all our teachers and seniors who
guided us in selecting a project that best addresses the current trends and practices in Computer
Engineering. We would also like to acknowledge with humble gratitude all those who helped
us to develop our idea well above the level of simplicity into something concrete.

We believe that throughout this project development, a plethora of knowledge and skills can
be amassed through the practical implementation of the theoretical concepts taught as part of
the course syllabus and we appreciate the effort the department has put to standardize it as a
course requirement.

Any suggestions for improvement and refinement on the project idea and implementation will
be highly appreciated.

ABSTRACT
This project deals with the automatic summarization of video content in natural language
sentences while maintaining the semantic consistency across different events in a video. With
the amount of video content increasing exponentially, the automatic captioning of video
events helps to filter video that depict a particular action or genre as well as provide an
opportunity for the visually impaired to explore the visual information videos have to impart.
Through this project, we propose to employ deep neural networks like Convolutional Neural
Networks and Recurrent Neural Networks to extract the visual features of a video, identify
temporal segments of interest and describe each in natural language sentences.

TABLE OF CONTENTS
ACKNOWLEDGEMENTS.......................................................................................................I

ABSTRACT..............................................................................................................................II

TABLE OF CONTENTS.........................................................................................................III

LIST OF FIGURES..................................................................................................................IV
LIST OF TABLES....................................................................................................................V

LIST OF SYMBOLS AND ABBREVIATIONS...................................................................VI

1 INTRODUCTION..................................................................................................................1

1.1 Background..............................................................................................................1

1.2 Problem Statement...................................................................................................3

1.3 Objectives……….....................................................................................................4

1.4 Scope of the Project……………………………………………………….……….5

2. LITERATURE REVIEW…………………………………………………………………...

3 METHODOLOGY…..............................................................................................................6

3.1 Block Diagram……………..…………………………………………………….

3.2 Description……………..…………………………………………………….

3.3 Training Methods……………..…………………………………………………….

3.4 Evaluation Metrics…………..…………………………………………………….

4 EXPECTED OUTCOME……………………......................................................................17

5. PROJECT REQUIREMENTS…………….........................................................................18

6. PROJECT SCHEDULE ……………………………………………………………….17

7. ESTIMATED BUDGET…………………………………………………………...

REFERENCES ........................................................................................................................19

LIST OF FIGURES

Figure 3.1 Block diagram of the proposed system.....................................................................5

Figure 3.2 Proposed algorithm……...........................................................................................7

Figure 3.3 Proposed system architecture……...........................................................................8


LIST OF TABLES

Table 4.1 Calibration data……………………...……….........................................................11

Table 4.2 Quick look of the survey data..................................................................................20


LIST OF SYMBOLS / ABBREVIATIONS

LSTM: Long Short-Term Memory

GPU: Graphics Processing Unit

ReLU: Rectified Linear Unit

CNN: Convolutional Neural Network

MSVD: Microsoft Video Description Corpus

MSR-VTT: Microsoft Research Video-to-Text

SVO: Subject Verb Object

BLEU: Bilingual Evaluation Understudy Score

METEOR: Metric for Evaluation of Translation with Explicit ORdering

IDE: Integrated Development Environment

1. INTRODUCTION
1.1 Background

Automatically describing the contents of an image is a fundamental problem in artificial


intelligence that connects computer vision and natural language processing. Recent advances
have enabled machines to describe an image with sentences. The surge in performance of the
Image Captioning techniques has motivated its extension, the automatic captioning of videos
with natural language sentences to be researched extensively. Today, with online video
platforms like Youtube, Netflix, and a smartphone in nearly everyone’s pocket, people are
creating and streaming video content more than ever. Universities and businesses have
expanded their use of video too, creating hundreds of new applications for videos. Among the
fastest-rising uses are teaching, training and communicating. With more and more educators
and businesses using video as a professional tool for sharing instructional and informational
content on-demand, video captioning has become a more important topic than ever before.

In this information age where a tremendous amount of visual data is generated every day,
automatic captioning of videos would greatly help users to filter what’s interesting to them
among the sheer number of videos on Youtube. Video captioning refers to automatically
describing video contents with natural language sentences. It is the next step to image
captioning where a sentence or a paragraph is generated to describe a video clip that captures
its visual semantics.

1.2 Problem Statement

In the past few years, there has been a major shift towards video content as the primary form
of media. Considering this momentum, the importance of video captioning has increased.
When it comes to providing captioning of media projects, most companies work with offsite
people, usually networks of freelancers (e.g. Amazon Mechanical Turkers) that are based
worldwide. However, delivering video information at such a scale is challenging for media
and entertainment companies as they are costly to create, and sometimes the manual
undertaking can be burdensome to production teams. As we see an explosion in content
produced from across the globe, providing a solution that is both secure, flexible, and
collaborative is becoming increasingly important.

Hence, our system that is being proposed aims to solve video captioning problem where a
sentence is generated to describe a video clip that captures its visual semantics.

1.3 Objectives
The main objectives of Semantically Consistent Video Captioning are:

1. To capture the temporal regions of interests depicting events in a video clip.


2. To generate sentence or paragraph to describe these temporal regions and hence
the whole video with semantic consistency

1.4 Scope of Work

The proposed project sees massive scope in the modern day and age of extravagant digital
media consumption. The project is targeted to bridge the gap between the media content
(videos) and the media review or synopsis (text summary). With the development of an
automatic video captioning system that closely resembles human understanding of the video
content, it becomes possible to computationally generate the description of the events inside
the video without having to view the video at all. This can prove to be particularly useful in
cases where videos depicting certain events have to be identified across a massive corpus of
video data. Manually browsing through each of the videos can be a cumbersome task for such
purpose while the retrieval of the video done by the analysis of its caption is computationally
tractable. The project also has scopes in the online video streaming services like Netflix and
YouTube who can leverage this technology to provide its user with the option to filter the video
based on its content. The sector of advertisement can also be benefited with the development
of this project as advertisements that match with the contents of specific segments of a video
can be inserted into the video. Above all, this project also has important role to play in the
visual assistance of the visually impaired people by providing them the opportunity to consume
the huge amount of video content available on the internet by providing them with a textual
summary of the events occuring in the video. Another important area where this project can be
leveraged is in the censorship of video contents. Videos that violate the community guidelines
can be identified and blocked through the visual understanding mechanism of the system.

2. LITERATURE REVIEW

Recognition of image and video is a fundamental and challenging problem in computer vision.
Rapid progress has been made in the past few years, especially in image feature learning and
various pre-trained CNN models have been proposed. However, such image-based deep
features cannot be directly applied to process videos due to the lack of dynamic information.
D. Tran et al. [1] propose to learn spatio-temporal features using deep 3D CNN and show good
performance in various video analysis tasks. The recent emergence of LSTM [2] has made it
easier to model sequence data and learn patterns with a wider range of temporal dependencies.
Donahue et al. [3] integrate CNN and LSTM to learn spatio-temporal information from videos.
It extracts 2D CNN features from video frames and then the 2D CNN features are fed into an
LSTM network to encode the videos’ temporal information.

In an effort to start describing videos, methods in video summarization aimed to congregate


segments of videos that included important or interesting visual information[4]. This method
attempted to use low-level features such as colour and motion or attempted to model objects
and their relationships. While these summaries provide a means of finding important segments,
these methods are limited by small vocabularies and do not evaluate how well we can explain
visual events.

Early attempts of video captioning by S. Venugopalan et al. [5] simply mean-pooled video
frame features and used a pipeline inspired by the success of image captioning. It proposed
end-to-end LSTM based model for a video-to-text generation. However, this approach works
only for short clips with a single major event. In order to efficiently translate video to language,
approaches should take both temporal and spatial information into account. Inspired by this, an
end-to-end sequence-to-sequence model [6] is proposed to generate caption for videos. It
incorporates a stacked LSTM which firstly reads the sequence of CNN outputs and then
generates a sequence of words.

Pan et al. [7] propose a novel approach, namely Hierarchical Recurrent Neural Encoder
(HRNE), which exploits multiple timescale abstractions of the temporal information with two-
layer LSTMs network. Pan et al. [8] propose a framework which explores the learning of
LSTM and aims to locally maximize the probability of the next word given previous words and
visual content features. To obtain most representative and high-quality description of for a
video, Li et al. [9] propose a summarization based video captioning method, which constructs
an adjacency graph on sequence sentences, then adopts this graph to re-rank the generated
candidate sentences. The most recent works have shifted focus on to generating paragraph or
dense captioning for a video. The key framework proposed by Yu et al. [10] is hierarchical
RNN (h-RNN) for describing a long video with a paragraph consisting of multiple sentences.
This framework consists of two generators: sentence generator and paragraph generator.

Finally, we build upon the recent work on video-captioning [10] which integrates attention
mechanism with LSTM to capture salient structures of video, and explores the correlation
between multi-modal representations for generating sentences with rich semantic content.
Experiments on the benchmark datasets demonstrate that our method using single feature can
achieve competitive or even better results than the state-of-the-art baselines for video
captioning in both BLEU and METEOR. Inspired by this work, we aim to design semantically
consistent video captioning model by incorporating temporal context.

3. METHODOLOGY

3.1 System Block Diagram


Figure. Block Diagram of Semantically Consistent Video Captioning system

3.2 Description Commented [1]: updated by Dotel

The process of video captioning involves both the visual understanding of scenes depicted in
the video frames as well as the ability to generate the best possible natural language description
for those scenes. Moreover, due to the temporal nature of input in case of videos (i.e. a sequence
of video frames rather than a single image), spatial as well as temporal attention mechanism is
required. A brief overview of the proposed steps to be followed to achieve the mentioned
objective is given below:

3.2.1 Input:

The input to our system is a sequence of video frames v = {vt} where t ∈ 0, ..., T − 1 indexes
the frames in temporal order.

3.2.2 Video Features Extraction:

To properly encode the salient features of a video, we need to represent both the spatial as well
as temporal characteristics of the video. This spatio-temporal feature can be extracted with the
help of 3D Convolutional Neural Networks (Conv3D or C3D). These are a variant of the
popular Convolutional Neural Network with a 3D input and 3D output. The technical
description of the neural network architecture is provided below:

Convolution Neural Networks (CNN)


In machine learning, a convolutional neural network (CNN or ConvNet) is a class of deep, feed
forward artificial neural network, most commonly applied to analyzing visual imagery. Thus,
CNN is well suited for the task of image classification.
ConvNets are basically a combination of two parts: initially convolutional layers (feature
extractor) and terminates with a fully-connected (FC) layer or classifier.
• Feature Extractor
The convolutional layers are often present in conjunction with activation layers and pooling
layers which then produce features with reduced dimension. These set of layers are often
repeated. An intuitive understanding of the features produced by the individual set of layers
could be attained by assuming the earlier filters to detect simple features like line curves, and
latter filters to detect more complex features like shapes.
• Classifier
The conventional idea for a ConvNet is to use a fully-connected layer to terminate the ConvNet
model. This layer is fed with the down sampled features from previous layers which are then
classified accordingly.

Layers in CNN
a. Convolutional Layers
The convolutional layer is the core building block of a CNN. The layer's parameters consist of
a set of learnable filters (or kernels), which have a small receptive field, but extend through the
full depth of the input volume. During the forward pass, each filter is convolved across the
width and height of the input volume, computing the dot product between the entries of the
filter and the input and producing a 2-dimensional activation map of that filter. As a result, the
network learns filters that activate when it detects some specific type of feature at some spatial
position in the input. Stacking the activation maps for all filters along the depth dimension
forms the full output volume of the convolution layer. Every entry in the output volume can
thus also be interpreted as an output of a neuron that looks at a small region in the input and
shares parameters with neurons in the same activation map.
Three hyper-parameters control the size of output volume of the convolutional layer:
● Depth
The depth of the output volume controls the number of neurons in a layer that connect to the
same region of the input volume. These neurons learn to activate for different features in the
input. For example, if the first convolutional layer takes the raw image as input, then different
neurons along the depth dimension may activate in the presence of various oriented edges, or
blobs of color.
● Stride
Stride corresponds to the number of pixels by which the filter shifts or slides over the input.
When the stride is 1, the filter shifts by 1 pixel. It is uncommon to use stride more than 2. This
results in reduction in the spatial dimension of the output.
● Zero-padding
Zero-padding refers to the process of symmetrically adding zeros in the boundary of the input
matrix. It is a hyper-parameter that allows adjustment of spatial dimension of output. It is most
commonly used when the dimension of output of the convolutional layer is required to be
identical to the input.

b. Pooling layers
Another important concept of CNNs is pooling, which is a form of non-linear down-sampling.
There are several non-linear functions to implement pooling among which max pooling is the
most common. It partitions the input image into a set of non-overlapping rectangles and, for
each such sub-region, outputs the maximum. The intuition is that the exact location of a feature
is less important than its rough location relative to other features. The pooling layer serves to
progressively reduce the spatial size of the representation, to reduce the number of parameters
and amount of computation in the network, and hence to also control overfitting. It is common
to periodically insert a pooling layer between successive convolutional layers in a CNN
architecture. The pooling operation provides another form of translation invariance.
c. ReLU layer
ReLU is the abbreviation of Rectified Linear Units. This layer applies the non-saturating
activation function f(x)=max(0,x). It increases the nonlinear properties of the decision function
and of the overall network without affecting the receptive fields of the convolution layer. Other
functions are also used to increase nonlinearity, for example the saturating hyperbolic tangent
f(x)=tanh(x), f(x)=|tanh(x)|, and the sigmoid function f(x)=(1+e-x)-1. ReLU is often preferred
to other functions because it trains the neural network several times faster without a significant
penalty to generalization accuracy.

d. Fully Connected Layer


Finally, after several convolutional and max-pooling layers, the high-level reasoning in the
neural network is done via fully connected layers. Neurons in a fully connected layer have
connections to all activations in the previous layer, as seen in regular neural network.

Conv3D:

3D convolutions applies a 3 dimensional filter to the dataset and the filter moves 3-direction
(x, y, t) to calculate the low level feature representations. Their output shape is a 3 dimensional
volume space such as cube or cuboid. They are helpful in event detection in videos They are
not limited to 3d space but can also be applied to 2d space inputs such as images.
Figure 3D Convolution Network

3.2.3 Event Localization Module:

The extracted features of the video from the Convolutional Neural Network (C3D in our case)
will be fed at varying strides to a LSTM unit to generate a set of temporal events marked by
their start time and end time. Each such temporal event also has a hidden representation which
will then be fed into the captioning module for caption generation.

3.2.4 Event Captioning Module:

One captioning approach could be to treat each event description individually and use a
captioning LSTM network to describe each one. However, most events in a video are correlated
and to capture such correlations, we design our captioning module to incorporate the context
from its neighbouring events. Long Short-Term Memory known as LSTM is one of the popular
variants of Recurrent Neural Network (RNN) with visual attention for generating one word at
every time step conditioned on a context vector, previous word and the previously generated
words. The visual attention mechanism allows the model to focus on the important part of the
video frame over others.

LSTM:
Figure. An LSTM Cell

The figure shows the LSTM implementation which we shall be using in our model. The square
dots in the LSTM cell imply projections with learnt weight vectors. The LSTM cell takes input
through the input gate and uses that input to modulate the memory. The forget gate erases the
memory cell and the output gate decides how this memory should be emitted. The contribution
made by each gate depends on the learnt weights. The LSTM can be represented using a simple
affine transformation with learned parameters as,

h_t=o_t° tanh⁡(c_t)

Here, it, ft, ct, ot, ht are the input, forget, memory, output and hidden states of the LSTM,
respectively. The vector z is the context vector capturing the visual information associated with
a particular input location. E is the embedding matrix. Let m and n denote the dimensionality
of embedding and LSTM respectively. σ denotes the logistic sigmoid function and ° denotes
the element-wise multiplication.
Context vector z represents the relevant part of the image input at time t. z can be calculated
using the attention model from the annotation vectors ai, (i … L) corresponding to the features
extracted from the CNN. We shall be using soft attention model [13] which provides relative
importance to the location i in blending the ai’s together. Weight (alpha) is defined for each
element in the annotation vector using the attention model (fatt) for which we use a multilayer
perceptron conditioned on the previously hidden state ht-1.
e_ti=f_att (a_i,h_(t-1))

α_ti=(exp⁡(e_ti))/(∑_(k=1)^L〖exp⁡(e_tk)〗)

Once the weights are computed, the context vector z is computed as,

z= ϕ({a_i },{α_i })

The ϕ function returns a single vector given the set of annotation vectors and the

corresponding weights. For the soft attention model we have,

ϕ({a_i },{α_i })= ∑_(i=1)^L〖α_(t,i) a_i 〗

This corresponds to feeding in a soft weighted context to the LSTM. So, LSTM gives

priority

to annotation vectors with higher weights while generating word sequences.

3.2.5 Output:

Our output is a set of sentences si ∈ S where si = (tstart, tend , {vj}) consists of the start and end
times for each sentence which is defined by a set of words vj ∈ V with differing lengths for
each sentence and V is our vocabulary set.

3.3 Training Method

The model developed shall be trained using stochastic gradient descent using adaptive learning
rate algorithms in laptops and computers with GPUs and high computation power.

Many large scale benchmark datasets are available for the training the model to learn the video
captioning task. Early works were analogous to the image captioning dataset which failed to
exploit the temporal structure of the video whereas the recent works has multiple sentences per
video that are semantically coherent:

1. Microsoft Video Description Corpus (MSVD) [11]

MSVD, also referred to as Youtube Dataset in early works, is one of the earliest open
world datasets. It is a collection of Youtube clips collected on Mechanical Turk by
requesting workers to pick short clips depicting a single activity. As a result, each clip
lasts between 10 seconds to 25 seconds, with quite constant semantics and little
temporal structure complexity. It has 1,970 videos clips in total and covers a wide range
of topics such as sports, animals and music. Each clip comes with multiple parallel and
independent sentences labelled by different Amazon Mechanical Turkers in a number
of languages. Specifically for English, it has roughly 40 parallel sentences per video;
resulting in a total number of 80k clip description pairs. It has a vocabulary of 16k
unique words; each sentence on average contains 8 words

2. ActivityNet Captions [12]

This is a recently released large-scale benchmark dataset specific for dense-captioning


events. It contains 20k videos amounting to 849 video hours. The videos are collected
from a video search engine, covering a wide range of categories. On average, each video
contains 3.65 temporally localized sentences, resulting in a total of 100k sentences.
Each sentence covers a unique segment of the video and describes an event that occurs
over a varying span of time. On average, each sentence has a length of 13.48 words and
describes 36 seconds and 31% of its respective video. The rich annotation enables
explicit exploration of temporal structures.

3. MSR Video-to-Text (MSR-VTT) [13]

It is a recently released large-scale video captioning benchmark and is by far the largest
video captioning dataset in terms of the number of sentences and the size of the
vocabulary. It contains 10k video clips crawled from a video search engine from 20
most representative categories of video search, including news, sports etc. The duration
of each clip is between 10 and 30 seconds, while the total duration is 41.2 hours. Each
video clip is annotated with 20 parallel and independent sentences by multiple Amazon
Mechanical Turkers, which provide a good coverage of the semantics of a video clip.
There are in total 200K clip-sentence pairs with a vocabulary of 29,316 unique words.

3.4 Evaluation Metrics

Video captioning result is evaluated based on correctness as natural language and relevance of
semantics to its respective video. The following are widely used evaluation metrics that
concern these aspects:

1. SVO Accuracy [5]


It is used in early works to measure whether the generated SVO (Subject, Verb, Object)
triplets coheres with ground truth. The purpose of this evaluation metrics is to focus on
matching of broad semantics and ignore visual and language details

2. BLEU [14]

It is one of the most popular metrics in the field of machine translation. The idea is
measuring a numerical translation closeness between two sentences by computing
geometric mean of n-gram match counts. As a result, it is sensitive to position
mismatching of words. Also, it may favor shorter sentences, which makes it hard to
adapt to complex contents.

3. METEOR [14]

It is computed based on the alignment between a given hypothesis sentence and a set
of candidate reference. METEOR compares exact token matches, stemmed tokens,
paraphrase matches, as well as semantically similar matches using WordNet synonyms.
This semantic aspect of METEOR distinguishes it from others. It is shown in the
literature that METEOR is always better than BLEU and ROUGE and outperforms
CIDEr when the number of references is small.

4. PROJECT REQUIREMENTS

4.1 Hardware Requirement

Following hardware requirements will be used to implement the system:

a. NVIDIA Titan X GPU (through cloud services)


b. NVIDIA GeForce GTX 1060
c. Video-camera Module

4.2 Software Requirement

Following software requirements will be used to implement the system:

a. PyTorch Library
b. PyCharm IDE
5. EXPECTED OUTCOME

The video captioning model is expected to generate grammatically and semantically consistent
sentences describing the given video. The system is expected to be able to temporally localize
events in a video and provide its textual description thereby replicating results of the state-of-
the-art systems in case of short video clips with a single major event and generate results with
acceptable metric scores in longer duration video clips with multiple correlated events.
Following are the results obtained from preliminary research and our system is expected to
generate similar descriptions.
Figure. Results obtained from the current State of the art Video Captioning system

6. PROJECT SCHEDULE

The project is scheduled to finish on the first week of August 2019.


Figure. Gantt chart

Table.

Task Name Start Date End Date Duration(days)


Rough Sketch and Design 12/7/2018 12/17/2018 10
Collection of Study Material, Algorithm 12/18/2018 1/1/2019 14
Research 1/2/2019 1/20/2019 18
Basic Implementation 1/12/2019 3/17/2019 64
Traning Simulation 3/18/2019 4/1/2019 14
Testing and Debugging 4/2/2019 4/17/2019 15
Advancing the Prototype 4/17/2019 7/7/2019 81
Documentation 12/10/2018 7/7/2019 209

7. BUDGET ESTIMATE
Table. Expected Budget Estimate

S.N. Item Estimated Cost

in USD in NRs*

1. Cloud Services for Training the Neural Network 150 17262.00

2.

* Based on exchange rate of 1USD=NRs 115.08 as of 15th December, 2018

8. REFERENCES

[1] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal


features with 3d convolutional networks,” IEEE International Conference on Computer Vision.
IEEE, 2015, pp. 4489–4497
[2] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol.
9, no. 8, 1997, pp. 1735–1780
[3] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko,
and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and
description,” CVPR, 2015, pp. 2625–2634.
[4] M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular
mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2015, pp. 3090–3098.
[5] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko.,
“Translating videos to natural language using deep recurrent neural networks”. arXiv preprint
arXiv, 2017, pp. 1412.4729.

[6] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko,


“Sequence to sequence-video to text,” in Proceedings of the IEEE International Conference on
Computer Vision, 2015, pp. 4534– 4542.

[7] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for
video representation with application to captioning,” in CVPR, 2016, pp. 1029–1038.

[8] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to
bridge video and language,” in CVPR, 2016, pp. 4594– 4602

[9] G. Li, S. Ma, and Y. Han, “Summarization-based video caption via deep neural networks,”
in ACM Multimedia, 2015, pp. 1191–1194.

[10] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. Shen, “Video Captioning with Attention-based
LSTM and Semantic Consistency”, IEEE Transactions on Multimedia, 2017, pp. 2045-2055

[11] D. L. Chen and W. B. Dolan., “ Collecting highly parallel data for paraphrase evaluation.”,
In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies-Volume 1, 2017, pages 190–200.
[12] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles., “Dense-captioning events in
videos.”, In Proceedings of the IEEE International Conference on Computer Vision, 2017,
volume 1, page 6.
[13] J. Xu, T. Mei, T. Yao, and Y. Rui. “Msr-vtt: A large video description dataset for bridging
video and language.”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016, pp. 5288–5296.
[14] J. Park, C. Song, and J.-h. Han., “A study of evaluation metrics and datasets for video
captioning.”, IEEE International Conference on Intelligent Informatics and Biomedical
Sciences (ICIIBMS), 2017 , pp. 172–175.

You might also like