Yang 2021

Neurocomputing 445 (2021) 121–133
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
A comparative study of language transformers for video question

answering
Zekun Yang a,⇑, Noa Garcia b, Chenhui Chu b, Mayu Otani c, Yuta Nakashima b, Haruo Takemura a
a
Graduate School of Information Science and Technology, Osaka University, Japan
b
Institute of Datability Science, Osaka University, Japan
c
CyberAgent, Inc., Japan
a r t i c l e i n f o a b s t r a c t
Article history: With the goal of correctly answering questions about images or videos, visual question answering (VQA)
Received 29 June 2020 has quickly developed in recent years. However, current VQA systems mainly focus on answering ques-
Revised 23 November 2020 tions about a single image and face many challenges in answering video-based questions. VQA in video
Accepted 22 February 2021
not only has to understand the evolution between video frames but also requires a certain understanding
Available online 10 March 2021
Communicated by Zidong Wang
of corresponding subtitles. In this paper, we propose a language Transformer-based video question
answering model to encode the complex semantics from video clips. Different from previous models
which represent visual features by recurrent neural networks, our model encodes visual concept
Keywords:
Video question answering
sequences with a pre-trained language Transformer. We investigate the performance of our model using
Language representation four language Transformers over two different datasets. The results demonstrate outstanding improve-
Language transformers ments compared to previous work.
Ó 2021 Elsevier B.V. All rights reserved.
1. Introduction related to this environment; and text VQA [15,37,31], which

focuses on answering questions related to the texts in the image.
A visual question answering (VQA) system aims to answer We believe video-QA can be a proxy of embodied-QA and text-
questions related to images, which is an interesting but challeng- QA tasks because video is a media that has multiple modalities,
ing task for an intelligent system. It usually takes an image and a i.e., a sequence of frames, audio tracks, and speech contents (or
related question as input and predicts a correct answer as output subtitles). Video-QA systems need to combine these modalities in
[46,1]. During question answering, semantics in both language videos, and try to understand what is happening in order to find
(the question) and visual (the image) content is retrieved and rep- the correct answer.
resented by high-dimensional vectors as visual and language fea- Recently, some techniques have been proposed to handle video-
tures. Attention mechanisms are also adopted to find QA tasks. The inputs of video-QA vary by task, e.g. video only
corresponding parts in the image that are helpful to answer the [10,52], video and subtitle [22,17,23], video and audio [29], etc.
question [46,47,28]. VQA has been quickly developing in recent Video and subtitles contain rich information and have been widely
years, and many strong attempts have been made [13,3,36,1] to applied in many similar tasks [22,17,23], so we take the video
create better VQA systems. frame sequences, the corresponding subtitles, a question, and can-
There are many branches of VQA tasks, including general VQA didate answers as inputs, as shown in Fig. 1.
[1,13,3], which needs to answer a question related to an image; Video-QA systems face two main challenges: First, the visual
video question answering (video-QA) [38,54,18], which addresses input of video-QA for each question is a sequence of video frames
question answering in video sequences; embodied question instead of a single image, making it necessary for the system to
answering (embodied-QA) [6,43,51], where an agent is set in a ran- understand the evolution of the video [56,32]. Second, understand-
dom place in 3D environment and need to answer questions ing subtitles is also of importance, because they contain the tran-
scription of what the characters are saying and convey
complementary information that is not shown in video frames
⇑ Corresponding author.
[22,18,11].
E-mail addresses: yang.zekun@lab.ime.cmc.osaka-u.ac.jp (Z. Yang), noagar-
In video-QA models, visual and language elements are usually
cia@ids.osaka-u.ac.jp (N. Garcia), chu@ids.osaka-u.ac.jp (C. Chu), otani_mayu@cy-
beragent.co.jp (M. Otani), n-yuta@ids.osaka-u.ac.jp (Y. Nakashima), takemura@ime. handled by individual flows [38,22,18]. The visual flow retrieves
cmc.osaka-u.ac.jp (H. Takemura). and processes the semantics from video frames, and the language
https://doi.org/10.1016/j.neucom.2021.02.092
0925-2312/Ó 2021 Elsevier B.V. All rights reserved.
Z. Yang, N. Garcia, C. Chu et al. Neurocomputing 445 (2021) 121–133
The remainder of this paper is arranged as follows: Section 2

introduces related work on video-based question answering and
language representations respectively. Section 3 gives a brief intro-
duction to the Transformers we studied. Section 4 describes our
proposed model in detail. Section 5 presents the experimental
evaluation. Section 6 gives discussions about the results, and Sec-
tion 7 concludes the paper.
2. Related work
2.1. Video-based question answering
In the past few years, some video-QA related work is proposed

to discuss and handle its technical aspects. For example, Wu et al.
[45] present a passage retrieval algorithm to extend the text ques-
tion answering task to videos; Zhao et al. [54] propose the context-
Fig. 1. Overview of our work. We use variants of Transformers to encode the aware question understanding network to deal with video-QA with
complex semantics from video clips in video question answering task. several rounds of dialogues; Wu et al. [44] present a cross-
language video-QA model that handles the English questions, and
finds answers in Chinese videos.
flow obtains and processes the semantics from subtitles in parallel. Different video-QA tasks may handle the combinations of input
Recurrent neural networks (RNNs), especially long short-term modalities in different ways. For example, in [10,52], video frames
memory (LSTM) networks [14], are used in each flow to encode themselves are taken as input to answer related questions. The
those elements. As the inputs of video-QA tasks often contain long tasks in [22,17] take video frames and subtitles as input. Moreover,
sequences, RNN-based networks may have difficulty grasping the video frames and audio are used as inputs in [29] to deal with
semantics in them and can thus be a bottleneck to the improve- questions related to speeches. Our target task involves sequences
ment of accuracy in the video-QA task. of video frames and corresponding subtitles.
In the natural language processing community, Transformer Representation of video is also one of the main concerns in
[40], a novel architecture that aims at solving sequence-to- video-QA tasks, and there are a number of techniques related even
sequence tasks while handling long-range dependencies, has been only to the input combination of the target task. For instance, Lei
proposed and used extensively [2,9,25,7]. Transformers rely on et al. [22], Kim et al. [16], and Zeng et al. [53] mainly use LSTM
self-attention to compute representations of the input and output to represent the visual (sequences of frames) and language (subti-
without using RNNs. All of them have been reported to outperform tles) features. Lei et al. [23] adopt BERT for language representation
RNNs in several natural language processing tasks. and LSTM for visual representation. Garcia et al. [11] use object
We propose to improve video-QA by capturing the visual and detectors to detect features in video frames and represent the
language semantics from video clips more accurately. Different sequences of frames by bag-of-concept, while they use BERT for
from previous models which represent visual features by RNNs, subtitle representation. Owing to its rich capability, we used BERT
our model encodes visual elements in the input with a pre- to represent both visual and language features in [49].
trained language Transformer [40]. For this, we follow [22] and This work shares a similar idea to Zeng et al. [53] because both
adopt a visual concept detector [35] to represent each detected of us are handling visual elements linguistically. Our work is dis-
concept with a corresponding word/phrase. We fine-tune the net- tinct from Zeng et al. [53] for two reasons: 1) We use a sequence
work in an end-to-end manner (including the pre-trained Trans- of words/phrases to represent a sequence of frames, which is much
former) for the video-QA task. simpler than the sequences adopted in [53]. 2) Besides taking
One key question about our new video-QA model is which visual concept features, we also include the subtitles that contain
Transformer should be used for video-QA tasks among the different the transcription of what the characters are saying, which acts
pre-trained variants. To explore this, we refer to a reading compre- complementarily to the video frames.
hension task called SQuAD [34] because video-QA needs to under- This work is also different from our previous work [49], in
stand the subtitles to predict the correct answer, which is similar which we used BERT for visual and language representations. We
to reading comprehension. We assume that variants of Transform- were not clear about which Transformer performs the best in
ers that work well for SQuAD also perform well for video-QA tasks. video-QA task, and why it performs the best before carrying on this
The SQuAD leaderboard1 shows that BERT [8], XLNet [48], RoBERTa project. This work not only shows the performance of other three
[26], and ALBERT [21] were the best performers at the time of this Transformers in video-QA task, but also empirically shows how dif-
research. Therefore, we select them and compare their performances ferent pre-training strategies and configurations of Transformers
in our experiments. affect the video-QA performance.
Our contribution is threefold: i) We propose an improved model
for video-QA. Our model uses a pre-trained language Transformer
to encode a sequence of visual concepts besides subtitles. ii) We 2.2. Language representation
evaluate our model on four commonly applied Transformers and
two video-QA datasets, and find that BERT works better than other In question answering models, language representations are
variants of Transformers. iii) We give discussions on how different used to convert the words or sentences into vectors for further
pre-training strategies and configurations of Transformers affect steps. Language representation methods handle the language fea-
the video-QA performance. tures at word level or sentence level and have been quickly devel-
oped in recent years. For example, GloVe [33] is presented to
leverage statistical information at the word level by training only
on non-zero elements in a word-word co-occurrence matrix.
1
https://rajpurkar.github.io/SQuAD-explorer/. Skip-thoughts vectors [19] generate language representations at
122
sentence level, providing a generic, distributed encoder to recon- the segment recurrence and relative encoding scheme of
struct the surrounding sentences of an encoded passage. Besides Transformer-XL [5] into pre-training and improves the perfor-
these methods, many Transformers have been proposed for sen- mance for tasks with longer text sequences. Experiments show
tence feature representation. For example, BERT [8] is proposed that XLNet also has good performance on a wide range of tasks
to give deep bidirectional representations of unlabeled text; XLNet [48,7,55].
[48] enables bidirectional context learning by maximizing the Instead of masked language modeling, XLNet uses permutation
expected likelihood over all permutations of the factorization language modeling during pre-training, which learns to utilize con-
order; RoBERTa [26] is an improved recipe for training BERT mod- textual features from all positions to capture bidirectional
els; and ALBERT [21] lowers the memory consumption and contexts.
increases the training speed of BERT. Some other language repre-
sentations [30,4,39,27] have also been proposed, aiming at repre- 3.3. RoBERTa
senting language features in words or sentences better.
RoBERTa is an improved recipe for training BERT models by
3. Introduction to transformers measuring the impact of hyper-parameters and data size [26].
Compared with standard BERT, pre-trained RoBERTa model is cre-
Transformers [40] are novel architectures that rely on self- ated by longer training with bigger batches over more data.
attention to compute representations of the input and output RoBERTa can match or exceed the performance of BERT and other
without using RNNs. They transform one sequence into another similar methods [26].
by reducing the complexity of the algorithm to constant level, so During the pre-training, dynamic masking is used. Unlike BERT
that longer sequences can be processed more easily. To represent which performs masking during data pre-processing, RoBERTa
language features better, many variants of Transformers have been generates the masking pattern when a sequence is fed into the
proposed in recent years. model.
In Transformers, the complex language features are converted
into vectors by input representations so that they can be handled
3.4. ALBERT
easily in computer systems. For a given input word token
sequence, the input representation is the combination of the corre-
ALBERT is a language representation method [21] proposed to
sponding token embeddings, segment embeddings, and position
reduce memory consumption and training time. It has fewer
embeddings. Token embeddings denote the embeddings for the
parameters than BERT-large and achieves significantly better per-
input tokens, segment embeddings denote the sentence that each
formance [21].
token belongs to (e.g. A: the former sentence; B: the latter sen-
ALBERT employs factorized embedding parameterization and
tence), and position embeddings denote the position of each token
cross-layer parameter sharing to reduce the parameters. The larger
within the input sequence. In token embeddings, the first token in
embeddings are decomposed into two smaller matrices and are
every sentence is a [CLS] mark, which is used to obtain the output
projected into a lower dimensional embedding space. Then, the
in classification tasks. Then, the former sentence follows. The [SEP]
lower dimensional space is projected to the hidden layers. More-
token is added between the former and latter sentences and at the
over, all parameters are shared across layers to improve the effi-
end of the latter sentence to indicate separation between different
ciencies of parameters.
sentences. The latter sentence then goes between two [SEP] tokens.
ALBERT uses sentence order prediction instead of next sentence
We investigate the best performing Transformers on the SQuAD
prediction in BERT, which focuses on modeling the tie between dif-
dataset [34] at this moment, which are BERT [8], XLNet [48],
ferent sentences. This setup forces the model to learn finer-grained
RoBERTa [26], and ALBERT [21]. These Transformers are mainly
distinctions about discourse-level coherence properties. Conse-
distinguished from each other by their pre-training settings. We
quently, ALBERT improves the performance for multi-sentence
use Transformers to model the semantics captured from video
encoding tasks.
clips, including visual concept features and subtitles. We introduce
each studied Transformer first, and then we evaluate their perfor-
mances via experiments. 4. The proposed model
3.1. BERT We propose a video-QA model to answer multiple choice video

questions, which is illustrated in Fig. 2. In our model, two flows for
BERT is a language representation model designed to extract visual and language predictions work respectively at first. Then,
pre-trained deep bidirectional representations [8]. It uses bidirec- their results are summed to obtain the joint answer prediction.
tional Transformers [40], meaning every word attends to the con- In the visual flow, we represent the visual semantics from each
text of both sides in every layer of the network. Pre-trained BERT video frame as a set of objects and attributes that appear in the
representations can be fine-tuned to achieve state-of-the-art per- scene, which is called visual concept features. In the language flow,
formance in a wide range of tasks [8,25,12]. the language semantics is extracted from the subtitles. In each
To obtain a deep bidirectional representation, BERT uses flow, visual concept features or subtitles are handled along with
masked language modeling during its pre-training, which masks the question and each candidate answer with the same variant of
some of the input tokens at random, and then predicts those language Transformer but with different instances.
masked tokens. To understand the relations between two sen-
tences, a binarized next sentence prediction is also taken in the 4.1. Visual representations
pre-training to give predictions of whether a certain sentence is
the next sentence of another sentence. Recent work [22,50] found that using detected object labels as
input has comparable or better performance than using features
3.2. XLNet from convolutional neural networks directly in image captioning
and video-QA tasks. Thus, we use visual concept features, which
XLNet is a generalized method that leverages both auto- are the labels of the detected objects using Faster-RCNN [35], to
regressive language modeling and auto-encoding [48]. It integrates represent the content of a video scene semantically. Here, the
123
Fig. 2. Proposed model for answering multiple choice video questions. Note that V, Q, and S denote visual concept, question, and subtitle, respectively.
visual concept features contain both objects and attributes, which

are shown in the format of adjective + noun or just noun, such
as man, black hair, and blue shirt. Fig. 3 shows an example of such
features.
We extract visual concept features from each video frame using
Faster-RCNN [35] pre-trained on the Visual Genome dataset [20],
as in [1]. The video frames are extracted at 3 fps. In every extracted
frame, the visual concept features are represented by correspond-
ing nouns or phrases. We aggregate the visual concepts from all the
frames and remove the duplicated ones to obtain unique visual
concept features from a whole scene. Then, the unique visual con-
cept features v , the question q, and each candidate answer item ai
(i = 0, 1, 2, 3, 4) are concatenated and rearranged into a single string
ci . Each rearranged string is tokenized to obtain the sequence T ci .
ci ¼ ½v ; q; ai ð1Þ
Fig. 3. Example of visual concept features detected using Faster-RCNN.
T ci ¼ tokenizeðci Þ ð2Þ
Here, the concatenation of v and q, [v ; q], is set as the former
sentence for the Transformer, and [ai ] is set as the latter sentence. is tokenized to form the sequence of tokens T wi , which is fed into
When ci is longer than the maximum number of tokens L, the num- the Transformer to obtain V wi . Then, V 0wi is fed into a fully con-
bers of token in the former ([v ; q]) and latter ([ai ]) sentences are nected layer to obtain the language flow representation Rwi for
measured. The last token(s) will be truncated from the longer sen- answer ai , where F w is a trainable parameter.
tence until the number of words in T ci is no more than L.
wi ¼ ½s; q; ai ð5Þ
Next, T ci is fed into the language Transformers to obtain an out-
put V ci , a matrix containing the vector representation of each word T wi ¼ tokenizeðwi Þ ð6Þ
in the input sentence. The output vector corresponding to the [CLS]
token, V 0ci , is fed into a fully connected layer to obtain the visual V wi ¼ Transformerw ðT wi Þ ð7Þ
flow representation Rci for answer ai , where F c is a trainable
parameter. Rwi ¼ F w V 0wi ð8Þ
V ci ¼ Transformerc ðT ci Þ ð3Þ
4.3. Prediction
Rci ¼ F c V 0ci ð4Þ
Finally, the representations of visual and language flows for
each question are summed to obtain Rpi . Softmax is used to convert
4.2. Language representations
the summed vector into the answer scores Rf :
Similarly, in the language flow, we concatenate the subtitles s, Rpi ¼ Rci þ Rwi ð9Þ
the question q, and the candidate answer items ai (i = 0, 1, 2, 3,
4) to form the string wi . The concatenation of s and q, [s; q], is set Rp ¼ ½Rp0 ; Rp1 ; Rp2 ; Rp3 ; Rp4 ð10Þ
as the former sentence, and [ai ] is set as the latter sentence. The
last token(s) will be truncated from the longer sentence until the Rf ¼ softmaxðRp Þ ð11Þ
number of words in T wi is no more than L. The rearranged string
124
The answer with the maximum score is selected as the final 5.1. Results on the TVQA dataset
predicted answer ap with
Results with time stamp annotated elements: Results on the
ap ¼ argmaxðRf Þ ð12Þ TVQA dataset with time stamp annotated elements are presented
in Table 2. The TVQA dataset has no answer label in the test set,
so we submit the prediction of the test set to the test server. As
the number of total submissions to the test server is limited, we
5. Experiments split 15,253 Q/A pairs from the training set to form a test* set,
while the validation set is kept the same. For comparison, we
In our experiments, the learning rate is set to 2 105 , the report results of the TVQA model [22], which uses a shared LSTM
number of epochs is set to 10, the training batch size is set to 8 for both visual and language representations and two unique
and the inferring (i.e. validation and test) batch size is set as 16. LSTMs for joint-modeling, the results from Thomas et al. [42],
The detailed information about different pre-trained Transformer which takes the TVQA model’s framework, but uses either shared
models we use is shown in Table 1. L, the maximum number of LSTM and BERT for comparison and enlarges the output dimension
tokens per sequence, can be 512 at most. To save memory, L is of the unique LSTM to 600; and STAGE [23], which uses LSTM for
set to 128 in our training. Adam optimizer is used with its weight visual and BERT for language representation. We also report some
decay setting as 1 105 . Cross-entropy loss is applied in our results on the official test set.
prediction. From the results of the test set, when we use V/S + Q + A as
Datasets: Two video-QA datasets are adopted in our experi- inputs, the accuracies of question answering using Transformers
ments: the TVQA dataset [22] and the Pororo dataset [18]. The are all better than the TVQA model, where GloVe + LSTM is used.
TVQA dataset is based on six TV shows with 152,500 question–an- They are also better than STAGE, except RoBERTa is adopted. When
swer pairs (Q/A pairs) from 21,800 clips, while the Pororo dataset is we use V + Q + A as inputs, the validation accuracy of RoBERTa is
based on a children’s cartoon video series called Pororo with 8834 17.04% lower than TVQA baseline and is 17.40% lower than Tho-
Q/A pairs from 171 episodes. In both datasets, each question is in mas et al. using GloVe + LSTM. Among these Transformers, BERT
multiple-choice style with one correct answer out of five candidate performs the best in answer predictions. Results show that when
answers. For each question, subtitles corresponding to each video we use both visual and subtitle representations, our model using
scene are also provided. To correctly answer these questions, BERT obtains an accuracy up to 4.23% higher than that obtained
video-QA systems need to achieve a joint understanding to both with TVQA baseline and up to 2.48% higher than that with STAGE.
visual and language features. The TVQA dataset provides two It is also seen that the results of using V + Q + A is about 20%
modes for prediction: one is using time stamp annotated elements, lower than using S + Q + A, while the results of using S + Q + A is
whose inputs are visual and language features corresponding to close to using V/S + Q + A, we think the reason lies in the bias in
the question; another is using full length elements, which takes the TVQA dataset. Thomas et al. [42] has proved that the TVQA
all the visual and language features as inputs. The subtitles in each dataset tends to concentrate on the information in the subtitles
video clip contain several rounds of dialogues. The Pororo dataset while suppress the information in the video during training. This
provides only full-length elements for prediction. The subtitles means that a stronger natural language processing framework will
are composed of narrations and a few rounds of dialogues. Besides contribute more to the improvement of accuracy. In Table 2, both
the videos and subtitles, descriptions about the video scenes are LSTM-based and Transformer-based methods are natural language
also given. To compare our results with those of previous models, processing frameworks, making the prediction of subtitle-based
we report the results on TVQA [22], STAGE [23], and MDAM [17]. questions better.
Input Sequence: To test the performance of Transformers, the BERT is the worst performing Transformer among the four vari-
input tokens containing visual and language features are rear- ants in SQuAD but is the best in our experiments on the TVQA data-
ranged in three different ways: set. To study the reason why BERT performs the best here, we
consider the different pre-training settings in different Transform-
1Þ V=S þ Q þ A ers, which mainly refers to the language modeling strategy in each
Transformer.
2Þ V=S þ : þ Q þ A
ð13Þ The language modeling strategies in different Transformers
3Þ V=S þ ½SEP þ Q þ A influence how they predict the answer according to visual and lan-
guage features. When modeling language features, Transformers
temporarily cover some words in the input sentence and try to pre-
where V represents visual concepts, which corresponds to v in Sec- dict them based on the nearby contexts. In our case, most of
tion 4, S represents subtitles, which corresponds to s, Q represents answers are implied in the given context (V/S). When we use BERT,
question, which corresponds to q, and A represents answer, which the masked language modeling masks some words in the sen-
combines all ai . V/S indicates that both visual concepts and subtitles tences randomly and uses the contexts on both sides to predict
are taken in visual and language flows. Ablation studies are con- that word during pre-training. In XLNet, when the sentence is
ducted by removing either visual concepts (S + Q + A) or subtitles short, there are fewer permutations for words, giving only limited
(V + Q + A). We report the highest accuracies of each Transformer information for the permutation language modeling to predict the
along with its corresponding word rearrangements. words while keeping its long memory. Our model takes no more
Table 1
Detailed information about different pre-trained Transformer models we use.
Transformer Model name Layers Hidden size Attention heads Parameters

BERT bert-base-uncased 12 768 12 110 M
XLNet xlnet-base-cased 12 768 12 110 M
RoBERTa roberta-base 12 768 12 125 M
ALBERT albert-base-v1 12 768 12 11 M
125
Table 2
Accuracy (in %) of proposed model and baselines on TVQA dataset with time stamp annotations. Note that one only has limited chances to submit their results to the test server
for evaluation; thus, we only show representative results of our model.
Input Name Model Rearrangements Val Test* Test

V+Q+A TVQA [56] GloVe + LSTM – 45.03 – 45.44
Thomas [42] GloVe + LSTM – 45.39 – –
Thomas [42] BERT – 43.44 – –
Ours BERT V+.+Q + A 48.95 49.23 –
Ours XLNet V+.+Q + A 48.17 48.53 –
Ours RoBERTa V+Q+A 27.99 27.66 –
Ours ALBERT V+Q+A 47.01 47.53 –
S+Q+A TVQA [56] GloVe + LSTM – 65.15 – 66.36
Thomas [42] GloVe + LSTM – 66.07 – –
Thomas [42] BERT – 68.30 – –
Ours BERT S+[SEP]+Q + A 70.65 70.22 –
Ours XLNet S+.+Q + A 68.21 68.22 –
Ours RoBERTa S+.+Q + A 69.32 68.92 –
Ours ALBERT S+[SEP]+Q + A 67.53 66.57 –
V/S + Q + A TVQA [56] GloVe + LSTM – 67.70 – 68.48
STAGE [23] LSTM + BERT – 70.50 – 70.23
Ours BERT V/S+.+Q + A 72.41 72.23 72.71
Ours XLNet V/S+[SEP]+Q + A 70.28 70.33 70.64
Ours RoBERTa V/S+.+Q + A 69.09 69.11 68.82
Ours ALBERT V/S+.+Q + A 70.32 69.65 70.35
than 128 tokens in training, which is only a quarter of that in the time stamp annotated elements in the TVQA dataset. The maxi-
experiments with XLNet (512). The information for word predic- mum number of tokens per input, L, is set as 128. Before inputting
tion becomes limited and the performance is not as good as BERT’s. the data, it is necessary to do some pre-processing because full
When we use RoBERTa, the masking pattern is generated when a length elements contain many tokens and some of them may be
sequence is fed into the model. This means RoBERTa also needs truncated before being conveyed into our model directly. Thus,
to fine-tune the masks and fit the context structure while training. we use an 80-token, 10-step sliding window (i.e. There are at max-
Because SQuAD is a reading comprehension task, the input in imum 80 tokens in each segment, and it will move by 10 tokens to
SQuAD is a paragraph with several sentences, the tie between dif- obtain the next segment) in the language flow to select the tokens
ferent sentences is strong, and the masks fit the context structure that best match the question and five candidate answers. To make
well. However, the subtitles in our task are mainly dialogues, the our model concentrate on the main components of the question
tie between sentences is weaker than between paragraphs, and and answers, we concatenate the questions and five candidate
the masks cannot fit the context structure as well as SQuAD, mak- answers together and remove the stop words from them. The
ing the performance of RoBERTa insufficient. In ALBERT, large cosine similarity of the TF-IDF index between the tokens in the
embeddings are factorized into smaller ones, which breaks some window and question–answer pair is used as a metric for selecting
word embeddings and cannot lead to a better prediction compared the input tokens.
with BERT. To implement this idea, we first generate a vocabulary of
When we use V + Q + A, the accuracy of RoBERTa is much lower approximately 44,000 words with all the tokens in the TVQA train-
than that of other Transformers. We believe this is also because of ing set that appear at least 5 times. Then, we segment the tok-
the dynamic masking in RoBERTa. When we use RoBERTa, it needs enized subtitles into 80-token segments and compute the cosine
to know how to predict the words in the masks during training. By similarity between the TF-IDF representations of each segment
using subtitles and questions, dynamic masking can learn from a and the question. Finally, we select the segment with the highest
certain structure of sentence, so the accuracy of answer prediction cosine similarity as the input subtitle. The results of each model
is only slightly lower than that of BERT. However, when we use are listed in Table 3.
visual concept features, which are just word sequences, the From the table, it can be seen that the best test accuracy of our
dynamic masking cannot learn from a certain structure of sentence model is 0.02% lower than that of the TVQA baseline. There may be
and gives many wrong predictions. two main reasons for this. First, full-length elements contain too
We also consider the function of next sentence prediction in our many words to be covered in the embeddings. Our model embeds
task, as it is designed to find the next sentence (A) according to the the visual concepts/subtitles, the question, and candidate answers
previous sentences (V/S + Q), and hence, might be responsible for altogether and truncates the remaining words when the input
the drop of accuracy in V + Q + A when we use RoBERTa. We find sequence is longer than L. However, in [22,23,16], the questions,
that only BERT has such a mechanism (XLNet and RoBERTa do answers, subtitles, and visual features are embedded indepen-
not), while ALBERT has improved next sentence prediction to sen- dently, enabling the tokenized input to contain more words and
tence order prediction. Previous work [48,26,21] has shown that convey more information. Second, our work does not use attention
next sentence prediction is almost ineffective because it lacks the to find the corresponding part of the visual/subtitle elements
difficulty compared with language modeling. To check the results, related to the question. In the TVQA model [22], context matching
the performance of XLNet is close to that of BERT, meaning that modules are used to build context-aware vectors, which is helpful
next sentence prediction is not the main reason for such a drop. for prediction. In STAGE [23], guided attention is applied to match
the words in questions and answers to the visual concepts and sub-
5.1.1. Results with full-length elements titles. In PAMN [16], dual memory embedding is adopted to enable
Besides using time stamp annotated elements, we also want to the pinpointing of different temporal parts for each module. In
test the performance of our model under full-length visual concept HERO [24], cross modal Transformer is taken to fuse QA pair with
features and subtitles. We use full-length elements (i.e. visual con- local textual context, then the QA pairs are fused with global video
cepts and subtitles without time stamp annotations) instead of the context in temporal Transformer module. We would like to explore
126
Table 3
Accuracy (in %) of proposed model and baselines using full length elements on TVQA dataset.
Input Name Model Rearrangements Val Test

TVQA [56] LSTM – 64.42 66.46
PAMN [16] LSTM – – 66.77
STAGE [23] LSTM + BERT – 68.56 69.67
V/S + Q + A HERO [24] Transformers – – 71.25
Proposed BERT V/S+.+Q + A 65.07 66.44
Proposed XLNet V/S+[SEP]+Q + A 61.40 61.68
Proposed RoBERTa V/S+.+Q + A 63.86 64.23
Proposed ALBERT V/S+.+Q + A 63.69 63.56
attention mechanisms to continue improving the prediction in full- 6. Discussions

length elements.
6.1. Training time and inferring time
5.2. Results on the Pororo dataset We want to compare which model runs faster during the train-
ing and inferring. Thus, we report the training time per epoch and
Results on the Pororo dataset are listed in Table 4. For a fair estimate the inferring time per question by dividing the total time
comparison, we only use the video scenes and subtitles, and we needed in inferring and the total number of questions to infer. Our
do not use the video scene descriptions provided. As STAGE only experiments are performed on a computer with a Core i7 8700 K
performs experiments on the TVQA dataset, yet their pre-trained CPU (3.70 GHz), 32G RAM, and an Nvidia TITAN RTX GPU. We keep
BERT model is not available for the Pororo dataset, we compare other configurations the same while only change the model we
our model against the MDAM [17] model and TVQA [22] frame- use. The training time per epoch and the inferring time per ques-
work (i.e. we use the model provided by TVQA but change the tion with different Transformers on the TVQA dataset are listed
dataset to Pororo). We obtain an accuracy up to 6.73% higher than in Table 5.
that of MDAM and up to 11.26% higher than that of TVQA. As the From the table, we can see that all the Transformer-based mod-
Pororo dataset has no time stamp annotations, we use full length els need more time for training and inferring compared with TVQA
elements directly. Note that we do not report a comparison with baseline [22]. This means the Transformers is more time-
[18], as it also uses the video scene descriptions to find a correct consuming and memory-consuming. Also, the training time of
answer. using V + Q + A is a bit shorter than S + Q + A, while the inferring
From the results, when we use both visual and language fea- time of using V + Q + A is no shorter than S + Q + A, except XLNet is
tures, the accuracies of using Transformers are improved compared used. Usually, the best validation accuracy can be obtained within
with those using LSTM. Among these Transformers, ALBERT has the 3 epochs in both flows in our model and TVQA model.
best accuracy on the Pororo test set. In detail, it increases the accu-
racy by 6.73% compared with MDAM [17]. We also find that when
6.2. Attention weights in different flows
we use RoBERTa, the accuracy of using V + Q + A is lower than using
other Transformer, and the accuracy of using V/S + Q + A drops by
We would like to take a deeper look into the mechanisms of
14.13% approximately compared with using S + Q + A in the test
answer predictions using Transformers. We select BERT, the best-
set. We believe the reason for such a drop also lies in the structure
performed Transformer on the TVQA dataset, and visualize its
of visual concept features.
attention weights via BERTviz [41]. We pick up a question as an
We find that the best accuracy of V + Q + A on the Pororo test set
example and visualize the attention weights of the visual and lan-
using Transformers is still 0.82% lower than MDAM [17], and the
guage flows at layer 1, 7 and 12 in Fig. 5 respectively. The answer to
validation accuracies of BERT, RoBERTa, and ALBERT when using
this question can be found in the language flow. Note that the
V/S + Q + A are lower than those using S + Q + A. We think the rea-
height of the full attention visualization is very large, so we only
son lies in the representation method of visual features.
put the focus (i.e. where the attention assembles) here. From the
In MDAM [17], the visual features are directly retrieved as ten-
figure, the attention weights tend to assemble at the [CLS] mark
sors from ResNet-152 pre-trained on the ImageNet set, while in
at layer 1 of both flows, meaning BERT starts by [CLS] for sentence
our proposed model, the visual features are first retrieved as ten-
classification. They tend to assemble at the [SEP] mark in layer 7 of
sors, then classified, and finally represented as words from
both flows, implying BERT is trying to find the correspondence of
Faster-RCNN pre-trained on the Visual Genome set. Because Visual
the former sentence (S/V + Q) and the latter sentence (A). However,
Genome is a real-world based set, while Pororo is a cartoon based
the attention weights tend to assemble in different ways the two
dataset, there might be some classification errors in our visual con-
flows in layer 12: they tend to assemble at the punctuation in
cept features, making them not sufficiently accurate to give correct
the language flow while still at the [SEP] mark at the visual flow.
answer predictions.
This implies that BERT have found the correspondence of sentences
To illustrate this clearly, we show some detected features from
and tend to know how the sentence is structured in the language
TVQA and Pororo frames in Fig. 4, where the features in the top line
flow, but in the visual flow, it did not find correspondence, and is
are from the TVQA dataset, and those in the bottom line are from
still trying to find it.
the Pororo dataset. From the figure, it can be seen that even if
the scene contains more features, Faster-RCNN detects features
in the TVQA dataset accurately. In the Pororo dataset, it detects 6.3. Evaluation with different sequence lengths
the photo album pages (bottom left) as window, and the cartoon
figure (bottom right) as yellow toy. From the figure, our Faster- We also explore how different sequence lengths influence the
RCNN cannot retrieve cartoon features as accurately as real- video-QA accuracy. We calculate the maximum number of words
world features, which should be responsible for the drop in valida- (Max), the minimum number of words (Min), the average number
tion accuracy when using V/S + Q + A. of words (Avg) and the percentage of sequences with more than
127
Table 4
Accuracy (in %) of proposed model and baselines on Pororo dataset.
Input Name Model (Rearrangements) Val Test

V+Q+A MDAM [17] LSTM – 42.10
TVQA [22] GloVe + LSTM 34.87 33.77
Ours BERT (V + Q + A) 40.75 40.03
Ours XLNet (V + Q + A) 38.14 41.28
Ours RoBERTa (V + Q + A) 31.70 35.25
Ours ALBERT (V+.+Q + A) 37.83 38.93
S+Q+A MDAM [17] LSTM – 42.50
TVQA [22] GloVe + LSTM 37.60 33.90
Ours BERT (S+[SEP]+Q + A) 52.51 54.16
Ours XLNet (S + Q + A) 50.61 52.24
Ours RoBERTa (S + Q + A) 53.07 54.01
Ours ALBERT (S+.+Q + A) 52.04 53.64
V/S + Q + A MDAM [17] LSTM – 48.90
TVQA [22] GloVe + LSTM 37.78 42.53
Ours BERT (V/S+.+Q + A) 52.35 55.26
Ours XLNet (V/S + Q + A) 51.38 54.89
Ours RoBERTa (V/S+[SEP]+Q + A) 37.17 39.88
Ours ALBERT (V/S+.+Q + A) 51.48 55.63
Fig. 4. Examples of detected features in TVQA and Pororo datasets. The top line shows frames from the TVQA dataset, and the bottom line shows frames from the Pororo
dataset.
Table 5 more than 128 words in the language flow. We also find that the
Training time (in h:mm, per epoch) and inferring time (in ms, per question) for percentage of sequences with more than 512 words in both flows
different Transformers on the TVQA dataset.
is less than 1%.
Input Model Training time Inferring time We evaluate our model with three different values of L (128,
TVQA [22] V+Q+A 2:14 5.70 256, and 512) on the TVQA test* set with time stamp annotations
BERT 2:57 21.29 using different Transformers. The results are listed in Table 7.
XLNet 4:13 30.46 From the table, when L increases, the test* accuracies increased
RoBERTa 2:50 18.21
when using BERT, XLNet, and ALBERT. The reason is that L implies
ALBERT 2:42 17.49
TVQA [22] S+Q+A 2:16 5.70 the amount of information conveyed into the network. When we
BERT 2:59 21.36 increase L, the information conveyed into the network increases,
XLNet 4:15 30.46 making the network predicts better. However, when we use
RoBERTa 2:52 17.95 RoBERTa, the accuracy drops by 2.64% when we increase L from
ALBERT 2:45 17.36
TVQA [22] V/S + Q + A 4:15 10.09
128 to 256 and 2.58% when we increase L from 128 to 512. Accord-
BERT 5:27 32.69 ing to the statistics and previous results, about 45.11% of
XLNet 7:45 54.77 sequences have more than 128 words in the visual flow, which is
RoBERTa 5:22 30.01 approximately three times that in the language flow. As RoBERTa
ALBERT 4:55 33.87
is sensitive to word sequence (i.e. the visual semantics), when
we increase L to 256, the increasing length of word sequence
may play an important role in the drop of accuracy. As only
128, 256, and 512 words with time stamp annotations for visual 3.94% of sequences have more than 256 words in the visual flow,
and language sequences of the TVQA test* set, as shown in Table 6. which is slightly higher than the percentage in the language flow
From the table, 45.11% of the sequences have more than 128 (3.30%), when we increase L to 512, the accuracy slightly increases
words in the visual flow, and more than 15% of the sequences have compared with that when L is 256.
128
Fig. 5. Attention weights of the BERT model in the visual and language flow. (a) Visual flow, Layer 1, (b) Visual flow, Layer 7, (c) Visual flow, Layer 12, (d) Language flow, Layer
1, (e) Language flow, Layer 7, (f) Language flow, Layer 12. Note that the height of the full attention visualization is very large, so we only put the focus (i.e. where the attention
focuses) here.
Table 6
Statistics of the input sequences on the TVQA test* set.
Max Min Avg >128 >256 >512

Visual flow 527 20 135.73 45.11% 3.94% 0.01%
Language flow 684 18 95.58 15.30% 3.30% 0.75%
Table 7
Accuracy results (in %) on the TVQA test* set using different L values (128, 256, and 512).
BERT XLNet RoBERTa ALBERT

128 72.23 70.33 69.11 69.65
256 72.68 72.45 66.47 70.20
512 72.79 72.98 66.53 70.45
From the results above, even if the accuracy of XLNet is 2.54% sentence. For ALBERT, the performance is stable regardless of add-
lower than that of BERT when L = 128, when we increase L to ing the length of word sequence and the change in L.
512, the performance of XLNet is slightly higher than that of BERT, Overall, we believe that there is not a specific Transformer suit-
meaning XLNet is more suitable for longer sequences. For RoBERTa, able for all kinds of tasks. It depends on detailed requirements to
the performance might be influenced by the visual concept fea- select the proper language Transformer to help the model perform
tures, which are composed of words or phrases instead of a whole better.
129
6.4. Qualitative results sets in Fig. 6. The bounding boxes or the words in blue are hints to
the correct predictions, which are annotated by us manually. To
Finally, we show some examples of successful and unsuccessful categorize by dataset, questions (1) to (4) are from the TVQA data-
predictions of different Transformers on the TVQA and Pororo dataset, and questions (5) and (6) are from the Pororo dataset. To cat-
Fig. 6. Successful and unsuccessful predictions on TVQA and Pororo datasets. (1) to (4) are examples from the TVQA dataset, (5) and (6) are examples from the Pororo dataset.
Note that the bounding boxes or words in blue are hints to the answer, which are annotated by us manually.
130
egorize by type, questions (1), (3), and (5) are related to the visual such differences is related to the characteristics of our task and the
features, while questions (2), (4), and (6) are related to the subti- different pre-training settings in different Transformers. As future
tles. Note that the subtitles in (4), (5), and (6) are long, so we cut work, we want to improve the visual feature representations so
off some unrelated parts. As the original subtitles in the Pororo that the accuracy can be further improved.
dataset have no punctuation, we also add some necessary punctu-
ation to make them easy to read. Declaration of Competing Interest
From the figure, when we use BERT, XLNet, and ALBERT, our
model gets six, five, and four correct predictions, respectively. The authors declare that they have no known competing finan-
When we use RoBERTa, our model only gets two correct predic- cial interests or personal relationships that could have appeared to
tions. We now discuss each case in detail. influence the work reported in this paper.
In question (1), the scene is set in a park, and a couple is sitting
on a bench. When using BERT, XLNet, and ALBERT, our model cap-
tures the visual features related to park and bench in the video CRediT authorship contribution statement
scene, so it gives correct predictions; RoBERTa and the TVQA base-
line cannot deal with the video scene correctly and hence give Zekun Yang: Methodology, Software, Investigation, Writing -
incorrect predictions. In question (2), the answer can be found original draft. Noa Garcia: Investigation, Methodology, Writing -
from the subtitles, which explain the difference between comics review & editing. Chenhui Chu: Methodology, Writing - review &
and comic books. This question spans approximately 12 s and is editing. Mayu Otani: Methodology, Writing - review & editing.
quite challenging. All of the four Transformers give correct predic- Yuta Nakashima: Supervision, Writing - review & editing. Haruo
tions according to the subtitles, but the TVQA baseline fails to give Takemura: Supervision.
a correct prediction. In question (3), beer bottles are on the desk
when they are talking in the video frame, BERT, XLNet, and Declaration of Competing Interest
RoBERTa can predict the answer correctly, but ALBERT and the
TVQA baseline give wrong predictions. In question (4), Chandler The authors declare that they have no known competing finan-
says he is wiped, and another person says they should get going cial interests or personal relationships that could have appeared
(leave). When BERT, XLNet, and the TVQA baseline are adopted, to influence the work reported in this paper.
they correspond get going and leave and give correct predictions,
while RoBERTa and ALBERT cannot correspond them, thereby gen- Acknowledgements
erating incorrect predictions. In question (5), although the subtitle
is saying Hold on. I will come save you, it is difficult to know Who is This work was supported by JSPS KAKENHI No. 18H03264,
to be saved tongtong. When looking at the video frames, we find it China Scholarship Council and ACT-I, JST.
is Tutu. To answer this question correctly, the proposed model
needs to understand the relationship between characters, which
References
is not considered in our model. We believe BERT, XLNet, and
ALBERT make use of the previous sentence (Tu tu, ugh) to find [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang,
the answer. In question (6), the subtitle says they almost won some Bottom-up and top-down attention for image captioning and visual question
answering, in: Proceedings of the IEEE Conference on Computer Vision and
luck, so the answer is they are lucky. BERT and ALBERT can connect
Pattern Recognition, 2018, pp. 6077–6086.
these two phrases and predict well, while XLNet and RoBERTa can- [2] A. Burns, R. Tan, K. Saenko, S. Sclaroff, B.A. Plummer, Language features matter:
not connect them and hence give wrong predictions. effective language representations for vision-language tasks, in: Proceedings of
the IEEE International Conference on Computer Vision, 2019, pp. 7474–7483.
From these examples, our model is able to solve both visual and
[3] R. Cadene, H. Ben-Younes, M. Cord, N. Thome, Murel: multimodal relational
language related questions that cannot be solved by LSTM. The rea- reasoning for visual question answering, Proc. CVPR (2019) 1989–1998.
son for this may be that language Transformers that we adopt use a [4] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised learning of
self-attention bidirectional structure, making every word attend to universal sentence representations from natural language inference data, in:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language
its context on both sides, while in LSTM, the follow-up words in a Processing, 2017, pp. 670–680.
long sentence may have a weak attendance to their far previous [5] Z. Dai, Z. Yang, Y. Yang, J.G. Carbonell, Q. Le, R. Salakhutdinov, Transformer-xl:
words. However, when the answer is not explicit in the video attentive language models beyond a fixed-length context, in: Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics,
frames or the subtitles or requires outside knowledge sources, 2019, pp. 2978–2988.
our model gives bad predictions. [6] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, D. Batra, Embodied question
answering, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops, 2018, pp. 2054–2063.
7. Conclusion [7] P. Dasigi, N.F. Liu, A. Marasovic, N.A. Smith, M. Gardner, Quoref: a reading
comprehension dataset with questions requiring coreferential reasoning, in:
In this paper, we proposed an improved model for video-QA Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language
tasks. We used language representations for visual concept fea- Processing (EMNLP-IJCNLP), 2019, pp. 5927–5934.
tures and subtitles based on Transformers to capture semantics [8] J. Devlin, M.W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep
in both video scenes and subtitles more accurately. Experiments bidirectional transformers for language understanding, in: Proceedings of the
2019 Conference of the North American Chapter of the Association for
were conducted to test the performance of our model by consider- Computational Linguistics: Human Language Technologies, 2019, pp. 4171–
ing different Transformers, different input arrangements, subtitles 4186.
with/without time stamp annotations, and different maximum [9] M.A. Di Gangi, M. Negri, M. Turchi, Adapting transformer to end-to-end spoken
language translation, in: INTERSPEECH 2019, International Speech
lengths with our model. Results showed that BERT performed the
Communication Association (ISCA), 2019, pp. 1133–1137..
best on the TVQA dataset, which improved the accuracy by [10] J. Gao, R. Ge, K. Chen, R. Nevatia, Motion-appearance co-memory networks for
4.23%, while ALBERT performed the best on the Pororo dataset, video question answering, in: Proceedings of the IEEE Conference on Computer
which improved the accuracy by 6.73% compared with previous Vision and Pattern Recognition, 2018, pp. 6576–6585.
[11] N. Garcia, M. Otani, C. Chu, Y. Nakashima, KnowIT VQA: answering knowledge-
models. We also found that the accuracy is different when we based questions about videos, in: Proceedings of the Thirty-Fourth AAAI
use different Transformers on the same dataset. The reason behind Conference on Artificial Intelligence, 2020.
131
[12] L. Gong, D. He, Z. Li, T. Qin, L. Wang, T. Liu, Efficient training of BERT by Meeting of the Association for Computational Linguistics, Association for
progressively stacking, International Conference on Machine Learning (2019) Computational Linguistics, 2010, pp. 384–394.
2337–2346. [40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I.
[13] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the V in VQA Polosukhin, Attention is all you need, Proc. NIPS (2017) 5998–6008.
matter: Elevating the role of image understanding in Visual Question [41] J. Vig, A multiscale visualization of attention in the transformer model, 2019.
Answering, in: Conference on Computer Vision and Pattern Recognition arXiv preprint arXiv:1906.05714 URL:https://arxiv.org/abs/1906.05714..
(CVPR), 2017.. [42] T. Winterbottom, S. Xiao, A. McLean, N. Al Moubayed, On modality bias in the
[14] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 tvqa dataset, 2020..
(1997) 1735–1780. [43] Y. Wu, L. Jiang, Y. Yang, Revisiting embodiedqa: a simple baseline and beyond,
[15] R. Hu, A. Singh, T. Darrell, M. Rohrbach, Iterative answer prediction with IEEE Trans. Image Process. 29 (2020) 3984–3992.
pointer-augmented multimodal transformers for textvqa, in: Proceedings of [44] Y.C. Wu, C.H. Chang, Y.S. Lee, Clvq: cross-language video question/answering
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, system, in: IEEE Sixth International Symposium on Multimedia Software
pp. 9992–10002. Engineering, IEEE, 2014, pp. 294–301.
[16] J. Kim, M. Ma, K. Kim, S. Kim, C.D. Yoo, Progressive attention memory network [45] Y.C. Wu, J.C. Yang, A robust passage retrieval algorithm for video question
for movie story question answering, in: Proceedings of the IEEE Conference on answering, IEEE Trans. Circuits Syst. Video Technol. 18 (2008) 1411–1421.
Computer Vision and Pattern Recognition, 2019, pp. 8337–8346. [46] H. Xu, K. Saenko, Ask, attend and answer: exploring question-guided spatial
[17] K.M. Kim, S.H. Choi, J.H. Kim, B.T. Zhang, Multimodal dual attention memory attention for visual question answering, in: European Conference on Computer
for video story question answering, in: Proceedings of the European Vision, Springer, 2016, pp. 451–466.
Conference on Computer Vision (ECCV), 2018, pp. 673–688. [47] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio,
[18] K.M. Kim, M.O. Heo, S.H. Choi, B.T. Zhang, Deepstory: video story qa by deep Show, attend and tell: neural image caption generation with visual attention,
embedded memory networks, in: Proceedings of the 26th International Joint Proc. ICML (2015) 2048–2057.
Conference on Artificial Intelligence, AAAI Press, 2017, pp. 2016–2022. [48] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R.R. Salakhutdinov, Q.V. Le, Xlnet:
[19] R. Kiros, Y. Zhu, R.R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, S. Fidler, generalized autoregressive pretraining for language understanding, Advances
Skip-thought vectors, Advances in Neural Information Processing Systems in Neural Information Processing Systems (2019) 5754–5764.
(2015) 3294–3302. [49] Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, H. Takemura, Bert
[20] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, J. Carlos Niebles, Dense-captioning events representations for video question answering, The IEEE Winter Conference on
in videos, in: Proceedings of the IEEE International Conference on Computer Applications of Computer Vision (2020) 1556–1565.
Vision, 2017, pp. 706–715. [50] X. Yin, V. Ordonez, Obj2text: generating visually descriptive language from
[21] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: a lite bert object layouts, in: Proceedings of the 2017 Conference on Empirical Methods
for self-supervised learning of language representations, in: International in Natural Language Processing, 2017, pp. 177–187.
Conference on Learning Representations, 2019. [51] L. Yu, X. Chen, G. Gkioxari, M. Bansal, T.L. Berg, D. Batra, Multi-target embodied
[22] J. Lei, L. Yu, M. Bansal, T. Berg, Tvqa: localized, compositional video question question answering, in: Proceedings of the IEEE Conference on Computer
answering, in: Proceedings of the 2018 Conference on Empirical Methods in Vision and Pattern Recognition, 2019, pp. 6309–6318.
Natural Language Processing, 2018, pp. 1369–1379. [52] Y. Yu, H. Ko, J. Choi, G. Kim, End-to-end concept word detection for video
[23] J. Lei, L. Yu, T.L. Berg, M. Bansal, Tvqa+: spatio-temporal grounding for video captioning, retrieval, and question answering, in: Proceedings of the IEEE
question answering, 2019. arXiv preprint arXiv:1904.11574.. Conference on Computer Vision and Pattern Recognition, 2017, pp. 3165–3173.
[24] L. Li, Y.C. Chen, Y. Cheng, Z. Gan, L. Yu, J. Liu, Hero: hierarchical encoder for [53] K.H. Zeng, T.H. Chen, C.Y. Chuang, Y.H. Liao, J.C. Niebles, M. Sun, Leveraging
video+ language omni-representation pre-training, 2020. arXiv preprint video descriptions to learn video question answering, in: Thirty-First AAAI
arXiv:2005.00200.. Conference on Artificial Intelligence, 2017.
[25] Z. Li, X. Ding, T. Liu, Story ending prediction by transferable bert, in: [54] Z. Zhao, X. Jiang, D. Cai, J. Xiao, X. He, S. Pu, Multi-turn video question
Proceedings of the 28th International Joint Conference on Artificial answering via multi-stream hierarchical attention context network, IJCAI
Intelligence, AAAI Press, 2019, pp. 1800–1806. (2018) 3690–3696.
[26] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. [55] W. Zhong, J. Xu, D. Tang, Z. Xu, N. Duan, M. Zhou, J. Wang, J. Yin, Reasoning over
Zettlemoyer, V. Stoyanov, Roberta: a rbustly optimized bert pretraining semantic-level graph for fact checking, 2019. arXiv preprint arXiv:1909.03745..
approach, 2019. arXiv preprint arXiv:1907.11692.. [56] L. Zhu, Z. Xu, Y. Yang, A.G. Hauptmann, Uncovering the temporal context for
[27] J. Lu, D. Batra, D. Parikh, S. Lee, Vilbert: pretraining task-agnostic visiolinguistic video question answering, Int. J. Comput. Vision 124 (2017) 409–421.
representations for vision-and-language tasks, Adv. Neural Inf. Process. Syst.
(2019) 13–23.
[28] P. Lu, L. Ji, W. Zhang, N. Duan, M. Zhou, J. Wang, R-vqa: learning visual relation
facts with semantic attention for visual question answering, in: Proceedings of Zekun Yang received his B.S. in Computer Science and
the 24th ACM SIGKDD International Conference on Knowledge Discovery & Technology from Xidian University in 2015, and M.S. in
Data Mining, ACM, 2018, pp. 1880–1889. Circuit and System from Lanzhou University in 2018,
[29] D. Mamgai, S. Brodiya, R. Yadav, M. Dua, An improved automated question respectively. He is now a Ph.D. student at Graduate
answering system from lecture videos, in: Proceedings of 2nd International School of Information Science and Technology, Osaka
Conference on Communication, Computing and Networking, Springer, 2019, University. His research interests are computer vision
pp. 653–659. and natural language processing.
[30] B. McCann, J. Bradbury, C. Xiong, R. Socher, Learned in translation:
contextualized word vectors, Adv. Neural Inf. Process. Syst. (2017) 6294–6305.
[31] A. Mishra, S. Shekhar, A.K. Singh, A. Chakraborty, Ocr-vqa: visual question
answering by reading text in images, in: 2019 International Conference on
Document Analysis and Recognition (ICDAR), IEEE, 2019, pp. 947–952.
[32] J. Mun, P. Hongsuck Seo, I. Jung, B. Han, Marioqa: answering questions by
watching gameplay videos, Proc. ICCV (2017) 2867–2875.
[33] J. Pennington, R. Socher, C. Manning, Glove: global vectors for word
representation, in: Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing, 2014, pp. 1532–1543. Noa Garcia received her B.S. degree in Telecommuni-
[34] P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswerable cation Engineering from Universitat Politècnica de Cat-
questions for squad, in: Proceedings of the 56th Annual Meeting of the alunya, Spain, in 2012, and her Ph.D degree in Computer
Association for Computational Linguistics (Volume 2: Short Papers), 2018, pp. Science from Aston University, United Kingdom, in
784–789.. 2019. She is currently a researcher at Osaka University,
[35] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object Japan. Her research interests include applications on
detection with region proposal networks, Adv. Neural Inf. Process. Syst. (2015) high-level visual understanding at the intersection of
91–99. computer vision, natural language processing, and
[36] A. Singh, V. Goswami, V. Natarajan, Y. Jiang, X. Chen, M. Shah, M. Rohrbach, D.
machine learning.
Batra, D. Parikh, Pythia-a platform for vision & language research, in: SysML
Workshop, NeurIPS, 2018..
[37] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, M.
Rohrbach, Towards vqa models that can read, in: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2019, pp. 8317–8326.
[38] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, S. Fidler, Movieqa:
understanding stories in movies through question-answering, in: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp.
4631–4640.
[39] J. Turian, L. Ratinov, Y. Bengio, Word representations: a simple and general
method for semi-supervised learning, in: Proceedings of the 48th Annual
132
Chenhui Chu received his B.S. in Software Engineering Yuta Nakashima received the B.E. and M.E. degrees in
from Chongqing University in 2008, and M.S., and Ph.D. communication engineering and the Ph.D. degree in
in Informatics from Kyoto University in 2012 and 2015, engineering from Osaka University, Osaka, Japan, in
respectively. He is currently a research assistant pro- 2006, 2008, and 2012, respectively. From 2012 to 2016,
fessor at Osaka University. His research interests center he was an Assistant Professor at the Nara Institute of
on natural language processing, particularly machine Science and Technology. He is currently an Associate
translation and multimodal machine learning. Professor at the Institute for Datability Science, Osaka
University. His research interests include computer
vision and machine learning and their applications. His
main research includes video content analysis using
machine learning approaches. He is a member of ACM,
IEICE, and IPSJ.
Mayu Otani received the B.S. degree from Kyoto Haruo Takemura received his B.E. M.E. and Ph.D.
University in 2013, and M.S, and Ph.D in engineering degree from Osaka University in 1982, 1984, and 1987,
from Nara Institute of Science and Technology in 2015 respectively. He is currently a Professor at Cybermedia
and 2018. She is currently an Research Scientist at Center, Osaka University since 2001. His research
CyberAgent, Inc. Her research interests include video interests include Interactive Computer Graphics,
understanding and multimodal machine learning. Human–Computer Interaction, and Mixed Reality. He is
a member of IEEE, ACM, IPSJ, and IEICE.
133

Yang 2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yang 2021

Uploaded by

Copyright:

Available Formats

Neurocomputing 445 (2021) 121–133

Contents lists available at ScienceDirect

A comparative study of language transformers for video question

1. Introduction related to this environment; and text VQA [15,37,31], which

The remainder of this paper is arranged as follows: Section 2

2.1. Video-based question answering

In the past few years, some video-QA related work is proposed

3.1. BERT We propose a video-QA model to answer multiple choice video

visual concept features contain both objects and attributes, which

Transformer Model name Layers Hidden size Attention heads Parameters

Input Name Model Rearrangements Val Test* Test

Input Name Model Rearrangements Val Test

attention mechanisms to continue improving the prediction in full- 6. Discussions

Input Name Model (Rearrangements) Val Test

Max Min Avg >128 >256 >512

BERT XLNet RoBERTa ALBERT

You might also like