Professional Documents
Culture Documents
INSTITUTE OF ENGINEERING
CENTRAL CAMPUS PULCHOWK
By:
Anish Bhusal / 072BCT505
Avishekh Shrestha / 072BCT507
Ramesh Pathak / 072BCT527
Saramsha Dotel / 072BCT534
Submitted To:
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
LALITPUR, NEPAL
16 December, 2018
ACKNOWLEDGEMENTS
We would like to express our sincerest gratitude towards the Department of Electronics and
Computer Engineering for providing us, students, the golden opportunity for indulging us in a
project of this caliber which will allow us to explore new dimensions in academic research as
well as software product development. We are indebted to all our teachers and seniors who
guided us in selecting a project that best addresses the current trends and practices in Computer
Engineering. We would also like to acknowledge with humble gratitude all those who helped
us to develop our idea well above the level of simplicity into something concrete.
We believe that throughout this project development, a plethora of knowledge and skills can
be amassed through the practical implementation of the theoretical concepts taught as part of
the course syllabus and we appreciate the effort the department has put to standardize it as a
course requirement.
Any suggestions for improvement and refinement on the project idea and implementation will
be highly appreciated.
ABSTRACT
This project deals with the automatic summarization of video content in natural language
sentences while maintaining the semantic consistency across different events in a video. With
the amount of video content increasing exponentially, the automatic captioning of video
events helps to filter video that depict a particular action or genre as well as provide an
opportunity for the visually impaired to explore the visual information videos have to impart.
Through this project, we propose to employ deep neural networks like Convolutional Neural
Networks and Recurrent Neural Networks to extract the visual features of a video, identify
temporal segments of interest and describe each in natural language sentences.
TABLE OF CONTENTS
ACKNOWLEDGEMENTS.......................................................................................................I
ABSTRACT..............................................................................................................................II
TABLE OF CONTENTS.........................................................................................................III
LIST OF FIGURES..................................................................................................................IV
LIST OF TABLES....................................................................................................................V
1 INTRODUCTION..................................................................................................................1
1.1 Background..............................................................................................................1
1.3 Objectives……….....................................................................................................4
2. LITERATURE REVIEW…………………………………………………………………...
3 METHODOLOGY…..............................................................................................................6
3.2 Description……………..…………………………………………………….
4 EXPECTED OUTCOME……………………......................................................................17
5. PROJECT REQUIREMENTS…………….........................................................................18
7. ESTIMATED BUDGET…………………………………………………………...
REFERENCES ........................................................................................................................19
LIST OF FIGURES
1. INTRODUCTION
1.1 Background
In this information age where a tremendous amount of visual data is generated every day,
automatic captioning of videos would greatly help users to filter what’s interesting to them
among the sheer number of videos on Youtube. Video captioning refers to automatically
describing video contents with natural language sentences. It is the next step to image
captioning where a sentence or a paragraph is generated to describe a video clip that captures
its visual semantics.
In the past few years, there has been a major shift towards video content as the primary form
of media. Considering this momentum, the importance of video captioning has increased.
When it comes to providing captioning of media projects, most companies work with offsite
people, usually networks of freelancers (e.g. Amazon Mechanical Turkers) that are based
worldwide. However, delivering video information at such a scale is challenging for media
and entertainment companies as they are costly to create, and sometimes the manual
undertaking can be burdensome to production teams. As we see an explosion in content
produced from across the globe, providing a solution that is both secure, flexible, and
collaborative is becoming increasingly important.
Hence, our system that is being proposed aims to solve video captioning problem where a
sentence is generated to describe a video clip that captures its visual semantics.
1.3 Objectives
The main objectives of Semantically Consistent Video Captioning are:
The proposed project sees massive scope in the modern day and age of extravagant digital
media consumption. The project is targeted to bridge the gap between the media content
(videos) and the media review or synopsis (text summary). With the development of an
automatic video captioning system that closely resembles human understanding of the video
content, it becomes possible to computationally generate the description of the events inside
the video without having to view the video at all. This can prove to be particularly useful in
cases where videos depicting certain events have to be identified across a massive corpus of
video data. Manually browsing through each of the videos can be a cumbersome task for such
purpose while the retrieval of the video done by the analysis of its caption is computationally
tractable. The project also has scopes in the online video streaming services like Netflix and
YouTube who can leverage this technology to provide its user with the option to filter the video
based on its content. The sector of advertisement can also be benefited with the development
of this project as advertisements that match with the contents of specific segments of a video
can be inserted into the video. Above all, this project also has important role to play in the
visual assistance of the visually impaired people by providing them the opportunity to consume
the huge amount of video content available on the internet by providing them with a textual
summary of the events occuring in the video. Another important area where this project can be
leveraged is in the censorship of video contents. Videos that violate the community guidelines
can be identified and blocked through the visual understanding mechanism of the system.
2. LITERATURE REVIEW
Recognition of image and video is a fundamental and challenging problem in computer vision.
Rapid progress has been made in the past few years, especially in image feature learning and
various pre-trained CNN models have been proposed. However, such image-based deep
features cannot be directly applied to process videos due to the lack of dynamic information.
D. Tran et al. [1] propose to learn spatio-temporal features using deep 3D CNN and show good
performance in various video analysis tasks. The recent emergence of LSTM [2] has made it
easier to model sequence data and learn patterns with a wider range of temporal dependencies.
Donahue et al. [3] integrate CNN and LSTM to learn spatio-temporal information from videos.
It extracts 2D CNN features from video frames and then the 2D CNN features are fed into an
LSTM network to encode the videos’ temporal information.
Early attempts of video captioning by S. Venugopalan et al. [5] simply mean-pooled video
frame features and used a pipeline inspired by the success of image captioning. It proposed
end-to-end LSTM based model for a video-to-text generation. However, this approach works
only for short clips with a single major event. In order to efficiently translate video to language,
approaches should take both temporal and spatial information into account. Inspired by this, an
end-to-end sequence-to-sequence model [6] is proposed to generate caption for videos. It
incorporates a stacked LSTM which firstly reads the sequence of CNN outputs and then
generates a sequence of words.
Pan et al. [7] propose a novel approach, namely Hierarchical Recurrent Neural Encoder
(HRNE), which exploits multiple timescale abstractions of the temporal information with two-
layer LSTMs network. Pan et al. [8] propose a framework which explores the learning of
LSTM and aims to locally maximize the probability of the next word given previous words and
visual content features. To obtain most representative and high-quality description of for a
video, Li et al. [9] propose a summarization based video captioning method, which constructs
an adjacency graph on sequence sentences, then adopts this graph to re-rank the generated
candidate sentences. The most recent works have shifted focus on to generating paragraph or
dense captioning for a video. The key framework proposed by Yu et al. [10] is hierarchical
RNN (h-RNN) for describing a long video with a paragraph consisting of multiple sentences.
This framework consists of two generators: sentence generator and paragraph generator.
Finally, we build upon the recent work on video-captioning [10] which integrates attention
mechanism with LSTM to capture salient structures of video, and explores the correlation
between multi-modal representations for generating sentences with rich semantic content.
Experiments on the benchmark datasets demonstrate that our method using single feature can
achieve competitive or even better results than the state-of-the-art baselines for video
captioning in both BLEU and METEOR. Inspired by this work, we aim to design semantically
consistent video captioning model by incorporating temporal context.
3. METHODOLOGY
The process of video captioning involves both the visual understanding of scenes depicted in
the video frames as well as the ability to generate the best possible natural language description
for those scenes. Moreover, due to the temporal nature of input in case of videos (i.e. a sequence
of video frames rather than a single image), spatial as well as temporal attention mechanism is
required. A brief overview of the proposed steps to be followed to achieve the mentioned
objective is given below:
3.2.1 Input:
The input to our system is a sequence of video frames v = {vt} where t ∈ 0, ..., T − 1 indexes
the frames in temporal order.
To properly encode the salient features of a video, we need to represent both the spatial as well
as temporal characteristics of the video. This spatio-temporal feature can be extracted with the
help of 3D Convolutional Neural Networks (Conv3D or C3D). These are a variant of the
popular Convolutional Neural Network with a 3D input and 3D output. The technical
description of the neural network architecture is provided below:
Layers in CNN
a. Convolutional Layers
The convolutional layer is the core building block of a CNN. The layer's parameters consist of
a set of learnable filters (or kernels), which have a small receptive field, but extend through the
full depth of the input volume. During the forward pass, each filter is convolved across the
width and height of the input volume, computing the dot product between the entries of the
filter and the input and producing a 2-dimensional activation map of that filter. As a result, the
network learns filters that activate when it detects some specific type of feature at some spatial
position in the input. Stacking the activation maps for all filters along the depth dimension
forms the full output volume of the convolution layer. Every entry in the output volume can
thus also be interpreted as an output of a neuron that looks at a small region in the input and
shares parameters with neurons in the same activation map.
Three hyper-parameters control the size of output volume of the convolutional layer:
● Depth
The depth of the output volume controls the number of neurons in a layer that connect to the
same region of the input volume. These neurons learn to activate for different features in the
input. For example, if the first convolutional layer takes the raw image as input, then different
neurons along the depth dimension may activate in the presence of various oriented edges, or
blobs of color.
● Stride
Stride corresponds to the number of pixels by which the filter shifts or slides over the input.
When the stride is 1, the filter shifts by 1 pixel. It is uncommon to use stride more than 2. This
results in reduction in the spatial dimension of the output.
● Zero-padding
Zero-padding refers to the process of symmetrically adding zeros in the boundary of the input
matrix. It is a hyper-parameter that allows adjustment of spatial dimension of output. It is most
commonly used when the dimension of output of the convolutional layer is required to be
identical to the input.
b. Pooling layers
Another important concept of CNNs is pooling, which is a form of non-linear down-sampling.
There are several non-linear functions to implement pooling among which max pooling is the
most common. It partitions the input image into a set of non-overlapping rectangles and, for
each such sub-region, outputs the maximum. The intuition is that the exact location of a feature
is less important than its rough location relative to other features. The pooling layer serves to
progressively reduce the spatial size of the representation, to reduce the number of parameters
and amount of computation in the network, and hence to also control overfitting. It is common
to periodically insert a pooling layer between successive convolutional layers in a CNN
architecture. The pooling operation provides another form of translation invariance.
c. ReLU layer
ReLU is the abbreviation of Rectified Linear Units. This layer applies the non-saturating
activation function f(x)=max(0,x). It increases the nonlinear properties of the decision function
and of the overall network without affecting the receptive fields of the convolution layer. Other
functions are also used to increase nonlinearity, for example the saturating hyperbolic tangent
f(x)=tanh(x), f(x)=|tanh(x)|, and the sigmoid function f(x)=(1+e-x)-1. ReLU is often preferred
to other functions because it trains the neural network several times faster without a significant
penalty to generalization accuracy.
Conv3D:
3D convolutions applies a 3 dimensional filter to the dataset and the filter moves 3-direction
(x, y, t) to calculate the low level feature representations. Their output shape is a 3 dimensional
volume space such as cube or cuboid. They are helpful in event detection in videos They are
not limited to 3d space but can also be applied to 2d space inputs such as images.
Figure 3D Convolution Network
The extracted features of the video from the Convolutional Neural Network (C3D in our case)
will be fed at varying strides to a LSTM unit to generate a set of temporal events marked by
their start time and end time. Each such temporal event also has a hidden representation which
will then be fed into the captioning module for caption generation.
One captioning approach could be to treat each event description individually and use a
captioning LSTM network to describe each one. However, most events in a video are correlated
and to capture such correlations, we design our captioning module to incorporate the context
from its neighbouring events. Long Short-Term Memory known as LSTM is one of the popular
variants of Recurrent Neural Network (RNN) with visual attention for generating one word at
every time step conditioned on a context vector, previous word and the previously generated
words. The visual attention mechanism allows the model to focus on the important part of the
video frame over others.
LSTM:
Figure. An LSTM Cell
The figure shows the LSTM implementation which we shall be using in our model. The square
dots in the LSTM cell imply projections with learnt weight vectors. The LSTM cell takes input
through the input gate and uses that input to modulate the memory. The forget gate erases the
memory cell and the output gate decides how this memory should be emitted. The contribution
made by each gate depends on the learnt weights. The LSTM can be represented using a simple
affine transformation with learned parameters as,
h_t=o_t° tanh(c_t)
Here, it, ft, ct, ot, ht are the input, forget, memory, output and hidden states of the LSTM,
respectively. The vector z is the context vector capturing the visual information associated with
a particular input location. E is the embedding matrix. Let m and n denote the dimensionality
of embedding and LSTM respectively. σ denotes the logistic sigmoid function and ° denotes
the element-wise multiplication.
Context vector z represents the relevant part of the image input at time t. z can be calculated
using the attention model from the annotation vectors ai, (i … L) corresponding to the features
extracted from the CNN. We shall be using soft attention model [13] which provides relative
importance to the location i in blending the ai’s together. Weight (alpha) is defined for each
element in the annotation vector using the attention model (fatt) for which we use a multilayer
perceptron conditioned on the previously hidden state ht-1.
e_ti=f_att (a_i,h_(t-1))
α_ti=(exp(e_ti))/(∑_(k=1)^L〖exp(e_tk)〗)
Once the weights are computed, the context vector z is computed as,
z= ϕ({a_i },{α_i })
The ϕ function returns a single vector given the set of annotation vectors and the
This corresponds to feeding in a soft weighted context to the LSTM. So, LSTM gives
priority
3.2.5 Output:
Our output is a set of sentences si ∈ S where si = (tstart, tend , {vj}) consists of the start and end
times for each sentence which is defined by a set of words vj ∈ V with differing lengths for
each sentence and V is our vocabulary set.
The model developed shall be trained using stochastic gradient descent using adaptive learning
rate algorithms in laptops and computers with GPUs and high computation power.
Many large scale benchmark datasets are available for the training the model to learn the video
captioning task. Early works were analogous to the image captioning dataset which failed to
exploit the temporal structure of the video whereas the recent works has multiple sentences per
video that are semantically coherent:
MSVD, also referred to as Youtube Dataset in early works, is one of the earliest open
world datasets. It is a collection of Youtube clips collected on Mechanical Turk by
requesting workers to pick short clips depicting a single activity. As a result, each clip
lasts between 10 seconds to 25 seconds, with quite constant semantics and little
temporal structure complexity. It has 1,970 videos clips in total and covers a wide range
of topics such as sports, animals and music. Each clip comes with multiple parallel and
independent sentences labelled by different Amazon Mechanical Turkers in a number
of languages. Specifically for English, it has roughly 40 parallel sentences per video;
resulting in a total number of 80k clip description pairs. It has a vocabulary of 16k
unique words; each sentence on average contains 8 words
It is a recently released large-scale video captioning benchmark and is by far the largest
video captioning dataset in terms of the number of sentences and the size of the
vocabulary. It contains 10k video clips crawled from a video search engine from 20
most representative categories of video search, including news, sports etc. The duration
of each clip is between 10 and 30 seconds, while the total duration is 41.2 hours. Each
video clip is annotated with 20 parallel and independent sentences by multiple Amazon
Mechanical Turkers, which provide a good coverage of the semantics of a video clip.
There are in total 200K clip-sentence pairs with a vocabulary of 29,316 unique words.
Video captioning result is evaluated based on correctness as natural language and relevance of
semantics to its respective video. The following are widely used evaluation metrics that
concern these aspects:
2. BLEU [14]
It is one of the most popular metrics in the field of machine translation. The idea is
measuring a numerical translation closeness between two sentences by computing
geometric mean of n-gram match counts. As a result, it is sensitive to position
mismatching of words. Also, it may favor shorter sentences, which makes it hard to
adapt to complex contents.
3. METEOR [14]
It is computed based on the alignment between a given hypothesis sentence and a set
of candidate reference. METEOR compares exact token matches, stemmed tokens,
paraphrase matches, as well as semantically similar matches using WordNet synonyms.
This semantic aspect of METEOR distinguishes it from others. It is shown in the
literature that METEOR is always better than BLEU and ROUGE and outperforms
CIDEr when the number of references is small.
4. PROJECT REQUIREMENTS
a. PyTorch Library
b. PyCharm IDE
5. EXPECTED OUTCOME
The video captioning model is expected to generate grammatically and semantically consistent
sentences describing the given video. The system is expected to be able to temporally localize
events in a video and provide its textual description thereby replicating results of the state-of-
the-art systems in case of short video clips with a single major event and generate results with
acceptable metric scores in longer duration video clips with multiple correlated events.
Following are the results obtained from preliminary research and our system is expected to
generate similar descriptions.
Figure. Results obtained from the current State of the art Video Captioning system
6. PROJECT SCHEDULE
Table.
7. BUDGET ESTIMATE
Table. Expected Budget Estimate
in USD in NRs*
2.
8. REFERENCES
[7] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for
video representation with application to captioning,” in CVPR, 2016, pp. 1029–1038.
[8] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to
bridge video and language,” in CVPR, 2016, pp. 4594– 4602
[9] G. Li, S. Ma, and Y. Han, “Summarization-based video caption via deep neural networks,”
in ACM Multimedia, 2015, pp. 1191–1194.
[10] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. Shen, “Video Captioning with Attention-based
LSTM and Semantic Consistency”, IEEE Transactions on Multimedia, 2017, pp. 2045-2055
[11] D. L. Chen and W. B. Dolan., “ Collecting highly parallel data for paraphrase evaluation.”,
In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies-Volume 1, 2017, pages 190–200.
[12] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles., “Dense-captioning events in
videos.”, In Proceedings of the IEEE International Conference on Computer Vision, 2017,
volume 1, page 6.
[13] J. Xu, T. Mei, T. Yao, and Y. Rui. “Msr-vtt: A large video description dataset for bridging
video and language.”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016, pp. 5288–5296.
[14] J. Park, C. Song, and J.-h. Han., “A study of evaluation metrics and datasets for video
captioning.”, IEEE International Conference on Intelligent Informatics and Biomedical
Sciences (ICIIBMS), 2017 , pp. 172–175.