You are on page 1of 9

A Machine-Learning Pipeline for Semantic-Aware and

Contexts-Rich Video Description Method


Y. Aun J. Y. M. Khaw
Faculty of Information and Communication Technology, Faculty of Information and Communication Technology,
Universiti Tunku Abdul Rahman, Malaysia Universiti Tunku Abdul Rahman, Malaysia
aunyc@utar.edu.my khawym@utar.edu.my

M. L. Gan* L. T. Tin
Faculty of Information and Communication Technology, Faculty of Information and Communication Technology,
Universiti Tunku Abdul Rahman, Malaysia Universiti Tunku Abdul Rahman, Malaysia
ganml@utar.edu.my avocafe1989@gmail.com

ABSTRACT Systems (PRIS 2022), July 29–31, 2022, Wuhan, China. ACM, New York, NY,
Video description (VD) methods use machine learning to auto- USA, 9 pages. https://doi.org/10.1145/3549179.3549182
matically generate sentences to describe video contents. Global-
description based VD (gVD) methods generates global description 1 INTRODUCTION
to provide the big picture of video scenes but they lack finer grain Video captioning or video description (VD) is the automated pro-
entities information. Meanwhile, modern entity-based VD (eVD) use cess of textual/natural language sentences generation that describe
deep learning to train ML models like object model (YOLOv3), hu- the video content. VD is often seen as a composite product of mul-
man activity model (CNN), location tracking (DeepSORT) to resolve tiple machine learning models that is used to resolve entities in the
individual entity that made up the complete sentences. However, generated sentences. Recently, VD has been gaining momentum
existing eVD are limited in the types of supported entities; thus, courtesy of the proliferation of deep learning technique and models
resulting in eVD generating sentences that contexts-deprived and that are more accurate, robust and diverse (more classes). For exam-
incomplete to clearly describe video scenes. In addition, the entities ple; object and actions in short videos can be accurately classified
resolved by eVD are isolated since they are inferred from different using YOLOv3 and any activity models trained with ResNet, VG-
ML models; resulting in sentences that are not semantically cohe- GNet, Inception and GoogleNet respectively. VD takes videos inputs;
sive; contextually and grammatically. In this paper, a two-stages ML pass them through a series of ML models to make predictions and
pipeline (teVD) is proposed for a holistic and semantic-aware VD resolve the entities needed in forming informative sentences that
sentence generation. Firstly, a ML pipeline is designed to aggregate describes the video content. The outcome of VD is well received
several high performing ML models for resolving fine grain entities in niche domains like decoding surveillance videos, generate more
to improve the accuracy of resolved entities. Second, the compo- relatable YouTube’s description that can attract views and to aid
nents in the entities set are ‘stitched’ together using an entity trim- under-privileged community in ‘seeing’ the videos.
ming method to (1) remove shadow entities and (2) to re-arrange Generally, there are two types of VD; namely global-description
entities based on linguistic rules to generate video descriptions that based (gVD) and entity-based (eVD). gVD is useful to ‘describe’
are context-aware and less ambiguous. The experimental results video with multiple subjects, actions and scenes collectively to give
showed that teVD successfully improved the quality of generated a high-level overview of events. However, for its simplicity; gVD
sentences in short videos; achieving BLEU score of 48.01 and ME- skims on finer grain details that deprived the sentence from second-
TEOR score of 32.80 on MSVD dataset. level contexts information. The ‘second-level’ information here
refers to micro-level information; for example, ‘Alex’ in the ‘User’
KEYWORDS class, or ‘Sunny’ in the ‘Weather’ class. gVD generalize video de-
Semantic, NLP, auto captioning, CNN, activity recognition scription well, and is capable to ‘caption’ a wider range of scenes;
ACM Reference Format: but the generated sentences can be overly vague and generic. Con-
Y. Aun, J. Y. M. Khaw, M. L. Gan*, and L. T. Tin. 2022. A Machine-Learning sider the following two sentences: “A man is walking a dog in the
Pipeline for Semantic-Aware and Contexts-Rich Video Description Method. park”, “A young man (named Bob) wearing a blue dotted t-shirt is
In 2022 4th International Conference on Pattern Recognition and Intelligent walking a dog in the park happily”. The first sentence is generated
by using most of the current video description method. It is not
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed wrong, but not informative enough. The second sentence is the
for profit or commercial advantage and that copies bear this notice and the full citation ideal sentence that is generated with rich semantic context. On
on the first page. Copyrights for components of this work owned by others than ACM the other spectrum, eVD is designed to resolve more contexts to
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a generate rich sentences. eVD leverages on modern ML architecture
fee. Request permissions from permissions@acm.org. and make predictions using ML models to ‘fill up the slot’ in the sen-
PRIS 2022, July 29–31, 2022, Wuhan, China tence’s schema. eVD is useful to paint a scene specific description;
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9608-0/22/07. . . $15.00 but their accuracy is determined by the choice of ML models used.
https://doi.org/10.1145/3549179.3549182 However, combining predictions from several ML models means

14
PRIS 2022, July 29–31, 2022, Wuhan, China Yichiet Aun et al.

entities are resolved in isolated manner; resulting in sentences that Among these features, dressing, emotions (e.g. happy, sad, fear-
lack coherency and disjointed. ful, surprised, angry, disgusted) and gender are most noticeable by
In this paper, we proposed a semantic-aware video description humans. Non HLFs such as organic/inorganic objects, landmarks,
method that keep the contexts richness from eVD while improving colours, and environment were also considered. Of these features,
the coherency among entities in forming grammatically correct location was highlighted as important. Template-based methods
sentences. A ML pipeline is designed to aggregate state-of-the- utilizing sentence generation (SVO triplets) with simple NLG were
art pretrained ML models to resolve entities in short videos. This implemented in [3]. Subsequently the HLFs were extracted from the
project is motivated by the flexibility of pipelining the models as video frame, and part-of-speech (POS) tag is created and associated
components can always be changed with a better one or more can to the HLFs. Singular, plural, present tense, NN – noun, NNS – noun,
be added. Having this flexibility allows for semantically richer sen- VB – verb, VBZ – verb were among the tags associate. Thereafter
tences to be crafted easily instead of training special purpose deep Anon, 2019 proposed the POS tagger from NLTK toolkit. The HLFs
neural network for video captioning. We added an ‘entity trimming’ information are extracted frame wise where the samples spans up
method to better model better human-human, human-object and to eight frames. The Hidden Markov Model (HMM) was used to
object-object interaction / relationship to improve the clarity and train the activity recognition model. This work provides insights
structure of the generated sentences. The outcome of this work is for the semantics required to understand and form an informational
a video captioning method that resolve not only more contexts; sentence. The recognized sentences could span for few events in
but more useful contexts from short videos for semantic aware a single video. However, the approach is event localized and does
sentence generation. From the experimental results, the proposed not form any correlation with surrounding events. There is also no
work achieved a BLEU score of 48.01 and METEOR score of 32.80 context understanding between events.
on MSVD benchmarking dataset.
2.2 Sequence To Sequence – Video To Text
2 RELATED WORKS In this paper by [4], the author uses deep learning network for video
description. The work focuses on open-domain video description
The components in video; in all its complexity can be described
which are sensitive to temporal structure. Specifically, Long Short
modularly based on individual elements. Video description (DV)
Term Memory (LSTM) network which is a variant of Recurrent
uses applied ML to automatically describe the events based on
Neural Network (RNN) was used. Due to the diverse set of objects,
video contents. Two types of ML models are the most widely used;
actions, scene and their attributes, it makes it hard to generate
specifically, Object Detection and Recognition and Natural Lan-
description for open domain videos. Therefore, video clips paired
guage Processing (NLP). Object detection and recognition has been
with sentence description from large datasets are used to feed into
a popular field for quite some time. Object recognition is a task to
learning network to train the model. The author states that LSTMs
recognise / identify objects in digital images. There’s a difference
were chosen because of it has a great success on task such as speech
between object detection and object recognition. Object detection
recognition and machine translation, which have a sequence-to-
is to locate the object location and enclose it with a bounding box.
sequence nature. Therefore, video and language also inherit the
Object recognition on the other hand classify detected object. Due
properties above, so it suits well for LSTMs training. Raw frames in
to the popularity of deep learning, currently there are few popular
RGB and optical flow format are feed into CNN networks, which
models for object detection and recognition. For examples, a few
then the features output act as the input for first LSTM layer. The
are: Region-Based Convolutional Neural Network (R-CNN), You
stacked LSTM in the paper consist of two layers, the first layer for
Only Look Once (YOLO), Single Shot Detector (SSD). In this paper,
visual feature inputs, the second for language models (with given
YOLOv3 is used for object detection and recognition [1]. Mean-
text input) and hidden representation of video sequence from the
while, Natural Language Processing is the application / task for
first layer (see Figure 1). Each of the LSTM consist of 1000 hidden
computer to understand human language. NLP has a wide array
units. Raw frames that feed into CNN generates a fixed length vector.
of implementation such as language translation application (e.g.
Optical flow images are also used due to flow CNN models proved
Google Translate), personal assistant application (e.g. Siri, Cortana).
to provide good results for activity recognition. The features vectors
In video description field, NLP technology can help us better formu-
then feed into decoder to generate sequence of words. According
late generated sentence so that it turns out grammatically correct
to the author, although RNN can be used to decode the vector, it is
and makes sense. Besides that, NLP is also used on entity trimming
believed that the long-term dependencies of video sequence may
applications to generate linguistically correct sentence.
lead to an inferior result. Therefore, LSTM were deployed.

2.1 Human Focused Video Description 2.3 Dense-Captioning Events In Videos


In [2], high- level features (HLFs) video frames are initially iden- The work by [5] states that they are able to perform video descrip-
tified. Features such as objects, movements (e.g. action, emotion) tion by using dense video captioning. The method proposed is
and properties (e.g. gender) are defined under HLFs whereby the claimed to be able to identify all events happen in the video in a
main subject is a human. According to the author’s results in the single pass. Therefore, multiple sentence description is generated
2007 and 2008 TREC video summarisation task dataset, the follow- describing different events happen at different timespan in a video.
ing categories are identified as human related information: actions, In addition, the model proposed also uses context from other sur-
emotions, age, gender, body parts, grouping, identity and dressing. rounding events to caption each event. It claims that it was able to

15
A Machine-Learning Pipeline for Semantic-Aware and Contexts-Rich Video Description Method PRIS 2022, July 29–31, 2022, Wuhan, China

2.4 Automated Textual Descriptions For A


Wide Range Of Video Events With 48
Human Actions
In [6], the authors demonstrated video textual description accord-
ing to human actions by a hybrid method. The method consists of a
description generator as well as an action classifier. The description
uses rule-based method that infers entities to an action, while the
bag-of-features deduces the actual action on display. Fusion engine,
visual processing, action recognition, description generator and
event description are the core elements of the proposed method.
Initially, a moving object (e.g. human, animal, non-organic matter)
is classified with visual processing. Human classification will have
additional feature description such as body part and pose. Next,
Figure 1: Overall system design view of S2VT implementa-
the tracked objects are fused to become entities using the fusion
tion.
engine. Low level object features are then extracted with the event
descriptor. Subsequently descriptions based on entity, relationship
or the overall picture is done through the abstract descriptor. The
use context from past and future sequence to generate caption for action classifier is capable to deduce up to 48 human actions. Fi-
an event. Also, the model is able to perform on videos as long as 10 nally, a hypothesis is derived by the description generator based
minutes. To enable description of multiple event in a long video, on the data obtained by the event descriptor. Corresponding to
action localization is needed. The model uses action proposal and the hypothesis, textual descriptions are produced according to the
social human tracking method to achieve action localization. The actions, objects and entities. This method however is self-limiting
input video frames are U = {Ut } where t ∈ {0, . . . ,T − 1}, repre- as the results are only able to describe with short, simple verbs
sents frames in temporal order. The output for this model is a set of without semantics comprehension. Refer to figure 2
sentences si = (tstart , tend , {vj }), where tstart indicates the sentence
start time and tend for end time and vj ∈ V, where vj is a set of
words with different length with vocabulary set V. To start with the
pipeline, the author claims that their model starts with extracting
C3D frame features, then feeding it through action proposal module 3 MACHINE LEARNING PIPELINE FOR
which generate a set of proposals: P = {(ti_start , ti_end , scorei , hi )}, SEMANTIC-AWARE VIDEO DESCRIPTION
where scorei represents the proposal score. Scores higher than the
We designed a ML pipeline that aggregates 3 ML models for entity
set threshold will have their hidden representation hi , to be used
resolution. The predictions from ML are then fed into sentence
as the input for LSTM captioning module. To capture video context
generator to build a semantic aware sentence describing videos
from surrounding events, the author groups all of the events into
events in fine details. The components in the pipeline are shown in
two categories relative to current event. The two groups are events
figure 3
that has already occurred (past) and events that are yet to occur
Firstly, the video frames in the video will be extracted and passed
(future). For a video event with a hidden representation of h , and
into different components in the pipeline for context extraction.
a start and end time of [ti_start , ti_end ] exported from the proposal
The object detector based on You-Only-Look-Once v3 (YOLOv3)
module, the past and future representation can be calculated by:
model detect the position of human and objects in the frame. At the
past 1 Õ same time video frames sampled at 10 frames interval are passed
l t jend < tiend ai j h j
h i
hi = past (1) into scene recognition module for the recognition of the scene in
z j,i
the video frame. Also, action recognition model is also used to
f utur e 1 recognize the overall / globally active event. The followed entity
l t jend < tiend ai j h j
Õ h i
hi = (2)
z f utur e j,i (mainly detected human) will then further be processed for different
process of feature extraction. Extraction including: face detection,
where h is the hidden representation of other proposed events. ai j
face / identity recognition, gender detection, age detection, emotion
is the attention used j to determine how relevant event j is to event
recognition, clothing colours and pattern detection, crowd counting,
i. Z is the normalization. ai j can be calculated as follows:
and localized action recognition (details discussed below). After
w i = w a hi + ba that, all the captured features are being aggregated and filtered. A
preliminary sentence will then be generated based on a fill-in-the-
blanks method using a predefined template. The sentence will then
ai j = w i hi (3)
be passed into next stage where language masking model will be
Where w i is the annotation vector, which can be calculated from applied to perform entity trimming from all the detected objects
learnt weights w a and bias ba . Lastly, the hidden representations in the scene. Lastly, before forming its shape, the sentence then
are then concatenated (hi_past , hi , hi_f utur e ) and fed into LSTM to passed into grammar checking model to correct any grammar error
generate sentences. and output and display it.

16
PRIS 2022, July 29–31, 2022, Wuhan, China Yichiet Aun et al.

Figure 2: The processing pipeline of the system (with 48 Human Actions).

Figure 3: Proposed system design.

4 SUPPORTED ENTITIES where, frame indicates detected human frames, id indicates track-
The proposed pipeline attempts to resolve object types, human ing id but since tracking haven’t initiated, -1 is put for padding,
(identity), scenes and activity that forms the core entities in the gen- bbox_x and bbox_y indicates the top-left corner coordinates of the
erated sentences. This section discusses the individual ML models bounding box, width and height represent the width and height of
used to make these predictions. the bounding box, conf indicates the confidence of detected human,
and x, y, z are used for 3D tracking where in this case - 1 are used
4.1 Videos Frames for padding. Also, the number of detected humans for each frame
will be recorded and calculated with simple max occurrence to get
Video frames are captured and input using OpenCV function a rough figure (simple crowd counting) where how many humans
cv2.VideoCapture(video), where video is the file path to the video file. are included in the scene.
The frames are sampled at their original frame rate and resolution.
Frame can be sampled at a lower rate for faster performance. The
inputted frame will consist of three channels which are Red, Green 4.3 Scene Recognition
and Blue (RGB) channels. Frames will be extracted into a temporary For scene recognition, places365-CNNs is used as the model to
folder as individual images in Joint Photographic Experts Group predict scene classes [7]. The model is trained on 10 million scene
(JPEG) format for easier processing on latter stages. photographs with almost 5000 images for each category of 400+.
Modified version of CNN (Places-CNN) was used in the training
4.2 Object And Human Detection process (end-to-end network). After detection, information such as
environment type (indoor / outdoor), scene category, scene attribute
In this module, You-Only-Look-Once version 3 (YOLOv3) model
(what are the attribute of the scene that leads to classification)
was used to perform detection [1]. YOLO is using CNN (specifically
and activation map can be acquired. To ensure good performance,
Darknet-53) as its deep neural network for object detection. YOLO
frames are sampled in 10 frames interval where each detection
model achieves great performance with good accuracy in detecting
are stored in array. After that, result with the max occurrence are
and classifying objects. As said by the authors, YOLOv3 has an accu-
used as the final result. As output, the environment type (indoor /
racy on par with Single-Shot Detector (SSD) but is three times faster.
outdoor) and scene category were acquired.
For each frame, output will be the bounding box coordinates for the
detected item, the item label, and their corresponding confidence
level. The confidence level for detection is set at 0.4 to avoid too 4.4 Action Recognition
many unnecessary detections. As this project is human-oriented, A series of video frames are processed into RGB frames and optical
each detected human details will be saved separately for tracking flow frames (by applying TV-L1 optical flow algorithm). For most
and other purposes (discussed below). The details are outputted in of the video we acquire nowadays, the aspect ratio is usually at
the format: (frame, id, bbox_x, bbox_y, width, height, conf, x, y, z) 16:9 compared to 4:3 at the olden days. However, for frame input

17
A Machine-Learning Pipeline for Semantic-Aware and Contexts-Rich Video Description Method PRIS 2022, July 29–31, 2022, Wuhan, China

into action recognition model, frame size must be in 224x224 pixels, process. In this paper, we demonstrate the use of applied ML using
which is 1:1 in ratio. Therefore, video frames are resized and cropped face features and clothing.
in the centre with original ratio maintained. It can be safely assumed
that most of the video will have its most important information 5.1 Face Feature Extractions
focused at the centre of a scene. So, we have high chance that we
Features extraction methods such as face recognition (identity
won’t lost important information by cropping off the frames. Since
recognition) [10], gender and age detection [11], emotion detection
most of the video have ratio of 16:9, the height is always fixed to
are used to achieve entity smoothing. All the mentioned meth-
244 pixels and the width of the video will be adjusted accordingly.
ods are using dlib (King, 2019) as a library / toolkit to perform
Finally, the frame will be cropped into 224x224 pixels from the
recognition. The module uses a pre-compiled library by [10] for
centre. Cropped frame will be pre-processed before used for action
ease of development. For face recognition, previously acquired /
recognition. To process RGB frames, pixel values are normalized
saved face encodings will be used for face comparison to compare
between -1 and 1 for both all three RGB channels. Each frame will
with unknown faces. If there are known faces detected (confidence
then append into arrays where ultimately the array will have a size
threshold: 0.4, based on face distance, lower distance have higher
of (batch_size, frame_count, frame_width, frame_height, channels),
confidence level), the detected label with highest confidence level
where frame_count represent the total number of frames extracted
will be used to resolve the entity identity. If there are more than
from the video frame_width and frame_height represents the frame
one detected identity along the sequence, label with the highest
size (width, height), which in this case is 224x224 and channels
frequency is selected as the identity. This is to cater with issue
represent number of frame channels where in this case is 3 for RGB
such as mis-identification due to angle or image resolution issue.
channels. On the other hand, to process optical flow stream, frames
In case of no known face detected, gender and age of the person
are converted into grayscale image first. Then optical flow stream is
will be calculated and resolve the entity into a more meaningful
calculated using TV-L1 optical flow algorithm. The pixels value will
context. Also, all detected face will go through the pipeline of emo-
then be truncated between -20 and 20 to eliminate large movement.
tion recognition to recognize current individual facial expression
After that, pixel values are normalized between -1 and 1 for both
per-frame-basis. Emotion classes available are: [angry, happy, sad,
channels. The resulting array will have the same value as RGB array
neutral, disgust, fear, surprise]. All detection will be recorded and
except for channels which is only 2. Both arrays will then be fed
calculate the max occurrence. To improve the detection success
into Two-Stream Inflated 3D ConvNet (I3D) [8] model for action
rate for the features mentioned above, face detection is employed
recognition. As output, the detected action class will be output
to detect the exact location and bounding boxes of the faces. In
along with its confidence level. The action with highest confidence
case of no faces detected, the pipeline carries on without carrying
level was chosen as the occurring event and this information will
out feature extraction mentioned above. The detected faces will be
be passed on to the next stage.
cropped out according to coordinates for feature extractions.

4.5 Human And Object Tracking 5.2 Clothes / apparel recognition


In this module, DeepSORT [9] model was used to track human loca- This module can be separated into two parts: clothes / apparel de-
tion. The model will be able to track every different individual even tection and clothes / apparel attribute recognition. For the first part,
when occlusions happen in between. The model is improved on top a custom YOLOv3 detector was trained on ModaNet dataset [12] by
of Simple Online and Realtime Tracking (SORT) model. It uses the [13]. ModaNet contains around 55000 street fashion images with
detection info obtained earlier by the YOLOv3 model. As a result, polygon annotations. The dataset contains 13 meta fashion cate-
all humans will be tracked and with their respective bounding box gories / classes. Given the images of detected humans, the custom
coordinates. The results are outputted in the format as followed: YOLOv3 detector detects the category of apparel worn by the hu-
(frame, id, bbox_x, bbox_y, width, height, conf, x, y, z) where, frame man, bounding box coordinates and labels are outputted and saved
indicates detected human frames, id indicates tracking id of human in array as a result. In the meantime, detected apparel will have
(each individuals are associated with an ID), bbox_x and bbox_y their frame cropped according to their bounding box and saved
indicates the top-left corner coordinates of the bounding box, width as JPEG images for further processing. On part two, the frames of
and height represent the width and height of the bounding box, detected apparel are input into clothes attribute detector [14] to
conf indicates the confidence of detected human, and x, y, z are generate detections of fashion attributes. The fashion attributes
used for 3D tracking where in this case - 1 are used for padding. detector is trained on DeepFashion dataset [15]. The dataset con-
For convenience of processing in further stages, the detection info tains over 800000 diverse fashion images. The fashion attribute
is saved as NumPy array file format (.npy). Currently, DeepSORT detector is trained on a subset of 300000 clothes images with 1000
is only trained on tracking humans’ position. If objects are needed number of clothing attributes. Each detected attribute is associated
to be tracked, the model needs to be retrained with the particular with clothes category in python dictionary manner. Besides that,
object features. clothes base colour is also computed. The cropped image will be
gone through K-Means clustering (grouping the pixels of the frame,
three clusters are set) to get the major colour from the frame (in
5 FEATURE EXTRACTION RGB format). The colour space in RGB is then converted into Lab
After acquired each human tracking bounding box, each unique (CIELAB) colour space for easier comparison. Finally colour dis-
entity can be segmented / crop out to go through feature extraction tance is computed using delta-E (specifically deltaE_cie76) against

18
PRIS 2022, July 29–31, 2022, Wuhan, China Yichiet Aun et al.

predefined colour palette containing 22 distinguish colour. Colour Bilingual Evaluation Understudy (BLEU) and Metric for Evalua-
with the shortest delta-E distance will be chosen as the base colour tion of Translation with Explicit Ordering (METEOR) scores. BLEU
of detected apparel. Same as before, each detected colour is saved score is one of the popular metrics that used to measure the quality
in the same python dictionary structure. With all detected clothes or accurateness of computer-generated text [20]. The metric cal-
category, attribute and colour saved in a dictionary array, each culates the overlapping of words ranging from unigrams (single
clothes category attribute and colour will be calculated with max word) up to n-grams (multiple or n consecutive words), between
occurrence method. Finally, a clean list of clothes categories with its the predicted sentence and one or more ground truth / reference
attribute and colour will be outputted (in format: {clothes_category, sentences. High-scoring description in BLEU requires generated
attribute, colour}). sentence matches the length and word of the ground truth. This
project uses BLEU-4 as evaluation metric which scores are calcu-
5.3 Sentence generation, entity trimming and lated up to 4-grams. BLEU can be calculated as shown in Table 1(a).
grammar correction In the equation, lr is the length of the referenced corpus and lC is
the length of the candidate sentence, N is the total number of n-
Features / semantic context generated will then be fused and fil-
grams used, w n represents the positive weights for each gram, and
tered to generate dictionaries describing global scene also localized pn is the geometric average of the modified n-gram precisions. On
human attributes. All the detected and captured context will be the other hand, METEOR score is a metric that computes how well
arranged in a fill-in-the-blanks manner based on a pre-defined a generated sentence aligned to the referenced sentence [21]. ME-
template to generate a preliminary text. The generated sentence TEOR actually is a metric proposed to curb the weaknesses of BLEU.
contains basic and important context from the detection, it is still What METEOR differences from BLEU is that METEOR uses lexical
far from perfection. Language masking model is used to perform database of a language (WordNet by default), and matches word by
entity trimming on the objects / entity detected. The model used is semantic matching instead of exact matching. METEOR accounts
based on a deep learning model: Bidirectional Encoder Representa- for exact word match, synonym matching, stemmed word matching
tions from Transformers (BERT) [16], which is being packaged into as well as paraphrase matching. The dataset chosen for evaluation is
a library called FitBERT [17] for ease of use. BERT is a state-of-the-
Microsoft Research Video Description Corpus (MSVD). The dataset
art model which is widely used in NLP applications to perform task
is one of the popular videos in the wild dataset recording up to 1970
such as next sentence prediction, masked language model (predict- YouTube video clips. The annotations are hand-labelled by humans
ing masked token) and so on. This round, in this project, language (∼40 annotations per clip) which each sample ranges between 10 to
masking model is used to predict the best and most suitable en- 25 seconds displaying one major event. 30 videos would be chosen
tity to be insert at any point of the sentence. Most of the crucial by random to perform evaluation with both chosen metrics. The
information such as subject, action, scene and so on is provided obtained score would then be compared with past works to evaluate
in the masked sentence to aid in entity prediction / trimming. For the performance of the project. The proposed method generates
example, from a list of detected objects: [car, handphone, cat, dog, two types of descriptions, an overview / global sentence describ-
glass, ball], a masked sentence of: “A man is talking on a ***mask*** ing the main event and multiple localized sentences describing
in a football stadium”, where ***mask*** is the masked entity to
individual human sentences. Therefore, calculation for both BLEU
be predict. After prediction, ideal sentence would be: “A man is
and METEOR scores are shown in Table 1(b). From the equation
talking on the handphone in a football stadium”. After performing above, average global detection score (Averaд) is the summation of
entity trimming, the last part of the pipeline is to perform grammar
each detection score (nth G scor e ) divided by the number of samples
correction on the sentence to ensure the correctness of the sentence.
(N ) evaluated. On the other side, average localize average score
The method used is calling the unofficial Application Programming
are calculated as shown in Table 1(c). For the first for each video
Interface (API) [18] of a popular grammar checking tool: Gram-
sample, each generated localized sentence score (lscor e ) is summed
marly [19]. Different from all of the previous module, the API only
together and divided by the number of sentences generated on
supports on JavaScript, therefore, NodeJS was used as an external
the sample (Nl ) to get the lscor e for the sample. Then the average
link to python main program. The generated sentence is packed and
localized detection score (Averaдelscor e ) is the summation of each
sent to Grammarly’s server, where the corrected sentence will be
detection score (nth Lscor e ) divided by the number of samples (N )
returned as a result. Finally sentences / description for the scene are
evaluated. Finally, an overall score is calculated as in Table 1(d).
generated. This mainly split into two parts: the first one would be
a global / overview sentence describing the scene as a whole (with
less details), the second one would be multiple localized sentence 7 PERFORMANCE EVALUATION
describing each and every human detected with their attributes The proposed model is evaluated against various models developed
and actions attached. Figures 4, 5, 6 and 7 show some experimental over the period of time [22]. Table 2 shows the comparison with
results. methods / models benchmarking on MSVD dataset. From Table
1, the proposed pipeline achieved BLEU scores etter and surpass
6 EXPERIMENTAL SETUP many of the compared models and methods but does not perform
We train and implement the ML pipeline using on Intel i5, 32GB well in METEOR metric. This can be reasoned by a few hypotheses.
DDR3 ram and Nvidia 2080Ti. Standard libraries related to com- The score component is observed in Table 3. Firstly, both BLEU
puter vision like Python 3.6.9 (language), OpenCV 4.1.0, CUDA and METEOR scores are better in global prediction. This suggest
toolkit, TensorFlow, Keras are used. The model is evaluated using that our model does captures global and main event context fairly

19
A Machine-Learning Pipeline for Semantic-Aware and Contexts-Rich Video Description Method PRIS 2022, July 29–31, 2022, Wuhan, China

Figure 4: Screenshot of an output from proposed system

Figure 5: Screenshot of an output from proposed system

Figure 6: Screenshot of an output from proposed system.

Figure 7: Screenshot of an output from proposed system

20
PRIS 2022, July 29–31, 2022, Wuhan, China Yichiet Aun et al.

Table 1: The equations for BLUE and METEOR score

(a) (b)
N n=1 n
thG
ÍN
lr AveraдeG scor e = scor e
loдBLU E = min(1 − , 0) + w n loдpn
Í
lc N
n=1
(c) (d)
Í Nl Aver aдe G scor e +Aver aдe L scor e
Lscor e = n=1 n
thl
scor e Overall Score = 2
Nl
n=1 n
ÍN th L
AveraдeLscor e = N
scor e

Table 2: Comparison of benchmarking result with other works on MSVD dataset

Techniques/Models/Methods BLEU Scores METEOR Scores


S2VT - 29.8
h-RNN 49.9 32.6
MM-VDN 37.6 29.0
Glove+Deep Fusion Ensble 42.1 31.4
S2FT - 29.9
HRNE 43.8 33.1
GRU-RCN 43.3 31.6
LSTM-E 45.3 31.0
SCN-LSTM 51.1 33.5
LSTM-TSA 52.8 33.5
TDDF 45.8 33.3
BAE 42.5 32.4
PickNet 46.1 33.1
M3 -IC 52.8 33.3
RecNetlocal 52.3 34.1
TSA-ED 51.7 34.0
GRU-EVE 47.9 35.0
Current Project 48.01 32.80

Table 3: Breakdown of score obtained from each component

Global Prediction Score Localized Prediction Score


BLEU 50.55 45.47
METEOR 34.39 31.30

well. However, when it comes to localized prediction, the scores 8 FINDINGS AND CONCLUSION
plummeted quite some bit. This may be due to that it still poses a Currently, available video description method tends to have short
challenge to detect and describe events in a finer level. Also, this and concise sentence, some have semantics information involved
may as well suggest that the reference sentences do not meet the must most lacks of it. This project aims to introduce a method and
context anticipation of the inferred sentences. This is due to that the pipeline of video description by incorporating semantic context
annotations in the dataset mostly contains limited context (without which also includes audio information to provide a richer and more
apparel, emotions, scene and so on). Besides that, the MSVD datasets accurate video description. The pipeline aims to resolve entity for
does not only contains annotations describing human actions but better information conveying using slot filling approach. Up until
also non-human objects, such as animals and vehicles. Therefore, for this stage, context that can be resolved in this project are: actions,
those video / sample with no human involved, the system performs identity (age, gender, name), emotions, objects, apparel and its at-
miserably with score of zero as there are no sentence generated. tributes and also scene. The score achieved for BLEU and METEOR
In fact, if video scores with non-human scenes are excluded, the are 48.01 and 32.80 respectively. Despite the achievement, there
BLEU and METEOR scores are actually quite good with scores of are a lot of problem and challenges encountered during the de-
62.01 and 43.58 respectively. velopment of project. The extraction of semantics is affected by
occlusion of entities or objects, clarity of videos, limitation of some

21
A Machine-Learning Pipeline for Semantic-Aware and Contexts-Rich Video Description Method PRIS 2022, July 29–31, 2022, Wuhan, China

feature’s extraction methods that will propagate down the caption- [9] N. Wojke, A. Bewley and D. Paulus, "Simple online and realtime tracking with a
ing pipeline. In conclusion, this paper explores the possibilities of deep association metric," in CoRR abs/1703.07402, 2017.
[10] A. Geitgey, "face-recognition 1.3.0," 20 February 2020. [Online]. Available: https:
pipelining models in order to achieve semantic and context rich //pypi.org/project/face-recognition/. [Accessed 20 November 2020].
detection to generate descriptions for a video. Also, methods such [11] G. Levi and T. Hassner, "Age and gender classification using convolutional neural
networks," in IEEE Conference on Computer Vision and Pattern Recognition
as language masking, human identity recognition provides ways Workshop (CVPRW), 2015.
to perform entity smoothing and trimming. Going forward, new [12] S. Zheng, F. Yang, M. H. Kiapour and R. Piramuthu, "Modanet: A large-scale
modules can be added in to further enhance the accuracy or con- street fashion dataset with polygon annotations," in ACM Multimedia, 2018.
[13] Simaiden, "Clothing detection using YOLOv3, RetinaNet, Faster RCNN in
text richness of the pipeline; such as, dense video auto captioning, ModaNet and DeepFashion2 dataset," 12 January 2020. [Online]. Available:
human intention prediction, audio information extraction, speech https://github.com/simaiden/Clothing-Detection. [Accessed 20 November 2020].
and emotion recognition and information extraction from video [14] X. Liu, J. Li, J. Wang and Z. Liu, "MMFashion: An Open-Source Toolbox for Visual
Fashion Analysis," arXiv preprint arXiv:2005.08847, 2020.
subtitles. [15] Z. Liu, P. Luo, S. Qiu, X. Wang and X. Tang, "Deepfashion: Powering robust
clothes recognition and retrieval with rich annotations," in CVPR, 2016.
REFERENCES [16] J. Devlin, M. W. Chang, K. Lee and K. B. Toutanova, "Bert: Pre-training of deep
bidirectional transformers for language under- standing," arXiv preprint arXiv:
[1] J. Redmon and A. Farhadi, "Yolov3: An Incremental Improvement," in arXiv 1810.04805, 2018.
preprint, 2018. [17] R. Biljana, S. Aneta, Q. Jenkins, J. and S. Havens, "fitbert 0.9.0," 22 May 2020.
[2] M. U. G. Khan, L. Zhang and G. Y., "Human Focused Video Description," in Pro- [Online]. Available: https://pypi.org/project/fitbert/. [Accessed 20 November
ceedings of the IEEE International Conference on Computer Vision Workshops 2020].
(ICCV Workshops), 2011. [18] Stewartmcgown, "Unofficial Grammarly API Client," 9 September 2019. [On-
[3] A. Gatt and E. Reiter, "SimpleNLG: A Realisation Engine for Practical Applica- line]. Available: https://github.com/stewartmcgown/grammarly-api. [Accessed
tions.," in Proceedings of the 12th European Workshop on Natural Language 20 November 2020].
Generation, ENLG ’09, Stroudsburg, PA, USA, 2009. [19] M. Lytvyn, A. Shevchenko and D. Lider, "Grammarly," 2020. [Online]. Available:
[4] S. Venugopalan, M. Rohrbach, R. Mooney, T. Darrell and K. Saenko, "Sequence to https://www.grammarly.com. [Accessed 20 November 2020].
Sequence Video to Text.," in Proceeding ICCV, 2015. [20] K. Papineni, S. Roukos, T. Ward and W.-J. Zhu, "BLEU: A method for automatic
[5] R. Krishna, K. Hata, F. Ren, L. Fei-Fei and J. C. Niebles, "Dense-captioning Events evaluation of machine translation," IBM Research Report RC22176 (W0109-022),
in Videos," in arXiv preprint arXiv:1705.00754, 2017. 2001.
[6] P. Hanckmann, K. Schutte and G. J. Burghouts, "Automated textual descriptions [21] S. Banerjee and A. Lavie, "METEOR: An automatic metric for MT evaluation with
for a wide range of video events with 48 human actions," in Proceedings of the improved correlation with human judgments," in ACL Workshop on Intrinsic and
European Conference on Computer Vision Workshops and Demonstrations, 2012. Extrinsic Evaluation Measures for Machine Translation and/or Sum- marization,
[7] B. Zhou, A. Lapedriza, A. Khosla, A. .Oliva and A. Torralba, "Places: A 10 million 2005.
image database for scene recognition," in IEEE TPAMI, 2017. [22] N. Aafaq, A. Mian, W. Liu, S. Z. Gilani and M. Shah, "Video description: A survey
[8] J. Carreira and A. Zisserman, "Quo vadis, action recognition? a new model and of methods, datasets and evaluation metrics," arXiv preprint arXiv:1806.00186,
the kinetics dataset," in Computer Vision and Pattern Recognition (CVPR), 2017. 2018.

22

You might also like