You are on page 1of 11

Image and Vision Computing 110 (2021) 104165

Contents lists available at ScienceDirect

Image and Vision Computing

journal homepage: www.elsevier.com/locate/imavis

Visual question answering model based on graph neural network and


contextual attention
Himanshu Sharma ⁎, Anand Singh Jalal
Department of Computer Engineering and Applications, GLA University Mathura, India

a r t i c l e i n f o a b s t r a c t

Article history: Visual Question Answering (VQA) has recently appeared as a hot research area in the field of computer vision and
Received 30 August 2020 natural language processing. A VQA model uses both image and question features and fuses them to predict an
Received in revised form 10 January 2021 answer for a given natural question related to an image. However, most VQA approaches using attention mech-
Accepted 26 March 2021
anism mainly concentrate on extraction of visual information from regions of interests for answer prediction and
Available online 29 March 2021
ignore the relation between the regions of interests together with the reasoning among these regions. Apart from
Keywords:
this limitation, VQA approaches also ignore the regions which are previously attended for answer generation.
Visual question answering These regions which are attended in past can guide the selection of the subsequent regions of attention. In this
Computer vision paper, a novel VQA model is presented and formulated that utilizes this relationship between the regions and
Natural language processing employs visual context based attention that takes into account the previously attended visual content. Experi-
Attention mental results demonstrate that the proposed VQA model boosts the accuracy of answer prediction on publically
available datasets VQA 1.0 and VQA 2.0.
© 2021 Elsevier B.V. All rights reserved.

1. Introduction between semantic objects present in a given input image or two or


more regions of importance in an image. Also, we employ an attention
Due to recent progress in deep learning approaches, computer vi- mechanism that uses previously attended image regions to guide the
sion and natural language processing, the two subfields of Artificial In- answer prediction. The proposed model uses a deep CNN [2] to extract
telligence (AI), are growing tremendously. Visual question answering visual features from an input image. GNN is used to build a visual rela-
(VQA) requires fusing both image features and question features to tionship between all nodes (each region of interest) which are fully con-
predict an answer to a natural language question related to an nected in an undirected way. The messages are sent between all nodes
image. Thus, VQA involves both computer vision and natural language through all edges and generate all visual relationship representation
processing. In particular, various VQA models used attention-based among nodes of a graph. Word embeddings are used to represent vector
mechanism by focusing on visually important content. There are corresponding to each word in a question. Further, an attention model is
mainly two main limitations of these attention-based VQA ap- used that keep tracks of previously attended regions together with the
proaches. First, these VQA methods attend only particular image re- current regions/objects of interest using contextual Long Short Term
gions or salient semantic objects present in an image for answer Memory (LSTM) [3]. Then, we fuse both these attention weights for
prediction, but they usually ignore the relationship among those ob- generating a combined attention model. In the final stage, the proposed
jects or regions that can be used for more accurate answer prediction. model uses a LSTM-based answer generation model to guess the subse-
Second, the majority of the attention-based VQA models ignore the quent answer word using previously generated answer and visual rela-
previously attended objects or image regions while predicting the cur- tionship representations chosen by the attention model.
rent answer word by attending the most relevant image region or ob- The key highlights of the proposed model can be listed as follows:
ject at the current time step. Thus, these VQA models may focus on the
same image region multiple times and are not able to predict accurate • The proposed VQA model uses the implicit relationship among se-
answers. mantic objects or significant-image regions by employing a GNN
To handle these limitations, the proposed work uses the Graph Neu- model. Thus, the proposed model obtains fine-grained visual con-
ral Network (GNN) [1] for capturing a hidden visual relationship tent/information and their better representations.
• The proposed VQA model also employs an attention model that re-
⁎ Corresponding author.
members the previously attended visual information together with
E-mail addresses: himanshu.sharma@gla.ac.in (H. Sharma), asjalal@gla.ac.in (A.S. Jalal) the current attended objects or regions of interest by using a
. contextual LSTM.

https://doi.org/10.1016/j.imavis.2021.104165
0262-8856/© 2021 Elsevier B.V. All rights reserved.
H. Sharma and A.S. Jalal Image and Vision Computing 110 (2021) 104165

• The proposed VQA model is assessed on two well-known datasets: [17] gave a novel VQA model known as Multimodal Deep Fusion Net-
VQA 1.0 and VQA 2.0. The results demonstrate that the proposed work (MDFNet). Graph Reasoning and Fusion Layer (GRFL) are used
VQA model predicts more accurate answers and surpasses the current to encode the semantic and spatial relationship among the objects and
state-of-the-art methods. merge both the relations effectively. Zhong et al. [18] discussed a new
feed-forward encoder-decoder pipeline employing Self-Adaptive Neu-
ral Module Transformer (SANMT) as a replacement of feed-forward
The rest part of this paper is organized as follows. Section 2 presents encoder-decoder pipeline.
the work related to VQA, attention mechanism and visual relationship
model. Section 3 presents the proposed VQA model framework. 2.2. Attention mechanism
Section 4 presents the datasets used, experiments conducted, results
(both quantitative and qualitative) analysis. Finally, Section 5 presents Bahdanau et al. [19] first used attention modules for obtaining im-
the conclusion and future avenues. proved neural machine translation (NMT) task. Many computer vision
and natural language processing tasks [20–23] has used attention
2. Related works mechanism. Attention mechanism has been used in VQA models and
become a vital part of VQA models [9–12]. Previous works in VQA
2.1. Visual question answering mostly used top-down image attention, in which image regions are pri-
marily attended using question-guidance [24–26]. These models merge
Visual Question Answering includes the sub-problems from com- both the image and question features and represent both these features
puter vision, natural language processing and relational reasoning. into the vector representations. The element of this fused vector con-
Many VQA models solve this problem as a way of merging both visual veys the significance of an image region. Final attended visual features
and textual features extracted from image and question respectively. are generated by calculating the average of all the attended visual fea-
These VQA models then predict the answer and thus consider VQA as tures. Many VQA models employed attention mechanism guided by im-
a classification problem. Joint embedding methods that do the repre- ages on question words and hence produce co-attention mechanism.
sentation in common feature space are used by many image captioning Nam et al. [27] presented dual attention networks (DANs) by focusing
models [4,5] and image annotation [6] tasks. These early methods use on specific question words and corresponding image regions to collect
CNN for image feature extraction and RNN to obtain question features. necessary information from both feature vectors. Lu et al. used a hierar-
Then these two feature vectors are merged to generate a combined chical attention mechanism on both question and image for VQA [10].
image-question feature representation in common space. Multilevel attention mechanism is proposed by Yu et al. [28] for
Zhou et al. suggested a straightforward VQA model name as extracting spatial information and language model of the image. Gao
“iBOWIMG” [7] using a joint embedding approach. They extracted vi- et al. [29] suggested a VQA model by capturing semantics of question,
sual features by using pre-trained CNNs and textual features by using exhaustive object information and the correspondence between these
bag-of-words. Then both the features are concatenated and given as two modalities. The model is known as Question-Led Object Attention
input to a classification model to guess an answer. “VIS + LSTM” [8] (QLOB). Both bottom-up and top-down attention mechanisms are
was proposed by Ren et al. which used an encoder LSTM that takes used by [30]. Objects in an image are detected by using bottom-up
both visual and textual feature as an input to produce a fixed-size mechanism, while attentions maps are calculated using top-down
image-question embedding representation. mechanism over all detection boxes. The transformer model was
The VQA methods discussed above include irrelevant or noisy infor- discussed by [31] and employed first for machine translation task. Fur-
mation as these methods use a global feature for image representation. ther, it is employed for many natural language processing applications.
So, the use of attention mechanism is implemented by recent VQA The core component of this model is scaled dot-product based attention
models [9–12]. Thus, the VQA models get improved visual and textual mechanism. The scaled dot-product attention uses query q ∈ ℝd, kt ∈ ℝd
information by reducing noisy information with the help of attention representing set of keys and vt ∈ ℝd denoting values, where t ∈
mechanism. Hence, these VQA models are the class of fine-grained {1, 2, 3,…,n} represents set of key-values pairs and d represents dimen-
joint embedding methods. Also, better- fused image-question represen- sions of all possible input features. The model then computes the dot
tation is obtained and thus improved answer prediction results are product of query with all keys. It then divides each dot product value
obtained. by d1/2 and uses a softmax function to generate the attention weights.
Han et al. [13] proposed a new dataset known as PlotGraphs contain- The attention function on the set of keys K = [k1, k2,. …, kn] ∈ ℝn×d, set
ing graph-based external knowledge information about the movies. of values V = [v1, v2. …, vn] ∈ ℝm×d and set of queries Q = [q1, q2, ..
They presented a model that uses movie clip, subtitle and external …, qm] ∈ ℝm×d is given by:
knowledge based on graphs to give answer related to a movie. A module !
known as Layered Memory Network (LMN) is used to represent the QK T
F att ¼ AðQ, K, V Þ ¼ soft max 1=2
V ð1Þ
content of a movie. Another module named as a Plot Graph Representa- d
tion Network (PGRN) is used to encode the meaningful information and
relationships in form o a graph. Wu et al. [14] proposed an encoder- Where, Fatt ∈ ℝm×d represents features obtained by attention mech-
decoder based video captioning model (ConvRS) by constructing a anism corresponding to queries Q.
novel convolutional sequence modeling by capturing the sequential de- Apart from scaled dot-product attention component, Transformers
pendencies by encoding temporal sequences. Xi et al. [15] proposed a uses multi-head attention mechanism together with the feed forward
VQA model (MOVRD) by detecting multiple relationships among the networks (FFN). The multi-head based attention (MHA) module at-
objects. Their model utilizes word vector similarity concept to represent tends the information from diverse representation subspaces. Further,
the relationship among the objects. Word mover's distance algorithm is the FFN module uses the output from the MHA module as input and
applied to compute the relationship between word vectors. Question- converts it by applying ReLU activation function and dropout and is
guided attention method is applied to focus on the regions of the image. given by:
Hosseinabad et al. [16] presented a novel multiple-answer VQA
model (EMQA) using sliding window to generate the answer for a FFNðyÞ ¼ FC ðDropout ð Re LU ðFC ðyÞÞÞÞ ð2Þ
given question analogous to different image regions. “ICon Question An-
swering (ICQA)” is the new dataset created for training and evaluating The serious drawback of the Transformers is due to large scale fea-
EMQA model. To perform fine-grained multimodal fusion, Zhang et al. ture vectors. The attention weights computed in such cases fail to

2
H. Sharma and A.S. Jalal Image and Vision Computing 110 (2021) 104165

capture everything thus creating biased attentions or the combination obtained from neighboring nodes are given as input to the GNN model
of incorrect possibilities. The models such as MFH [32] and MCAN [33] and outputs each node's the updated hidden state at each time step.
using Transformer inspired attention models predicted wrong answers For each node, GNN employing Multi-layer perceptrons (MLP) are
as these models may not be able to distinguish the keywords in ques- used for updating their current hidden state [1]. For graph data related
tions. Also, these models may fail to recognize or classify some visual learning task, Gated Graph Neural Network (GGNN) proposed by [50]
contents of the image, though the question is easy to understand and updates the each node's hidden state by using gated recurrent units
thus predicts wrong answers. (GRU) by employing backpropagation technique. For graph data classi-
fication, [51,52] use CNNs by inputting node's features together with
2.3. Visual relationship neighboring graph structure. Wang et al. [53] performed action recogni-
tion task by using graph convolutional networks employing reasoning
The concept of visual relationship was investigated prior to the ad- based on correspondence relationship and spatial–temporal relation-
vancement of deep learning approaches. Methods discussed in ship. Graph attention networks using self-attentional layers handles
[34–37] used relationship among objects (for example, occurrence of the limitation of the graph convolution dependent approaches. In this
two or more objects together [34], location and dimension [38]) for paper, the implicit visual relationship between salient objects or
re-scoring detected objects. In order to improve image segmentation image regions is captured using GNN.
task, [39,40] discussed about spatial relationships (for example,
“above”, “around”, “below”, “inside” etc.) among two or more objects. 3. Proposed model for visual question answering
In many computer vision tasks, visual relationship has played an im-
portant role such as the generation of captions corresponding to an The overall framework of the proposed VQA model is illustrated in
image, better image searching and finding the objects in an image Fig. 1. The proposed VQA model is mainly consists of five modules
[41–43]. It has been used to answer question related to the synthetic im- named as Image Representation, Visual Relationship Representation, Ques-
ages in the CLEVR dataset. Apart from spatial relationship, semantic re- tion Representation, Attention Mechanism and Answer Generation. In the
lations are also utilized in VQA task [44–46]. Neural networks are first module, ResNet101 [54] pre-trained on ImageNet is employed to
designed to represent the visual relationship between objects in an obtain the image representation. In the second module, the GNN
image [47–49]. model FGNN uses this image representation corresponding to different
spatial locations to capture the visual relations among the semantic ob-
2.4. Graph neural network jects or regions of interest. For initializing each node in the graph, the
spatial representations are given as input to the GNN model and further,
Graph Neural Network is a category of neural network which di- each node's information is updated recurrently using hidden states of
rectly works on the graph structure. GNNs are used in node classifica- other nodes and thus obtains visual representations R = {r1, r2, ……, rn |
tion problems. Each node's previous hidden state and information ri ∈ ℝm}. In the third module, the length of each question is first

Fig. 1. Block Diagram of the proposed VQA model. It consists of CNN-based visual feature extraction module; GNN based visual relationship representation module, visual context-aware
attention module and LSTM-based answer generation model. The visual relationship module generates three words (subject-relationship-object) by plotting relationship adjacency
matrix and attention weights distribution over set of nodes.

3
H. Sharma and A.S. Jalal Image and Vision Computing 110 (2021) 104165

trimmed to 14 words and word embeddings are used to represent each Here Convedge represents a convolutional layer having size of kernel
word in a vector form. Further, Gated Recurrent Unit (GRU) [55] uses 1.The input to Convedge is the node features absolute difference. Sigmoid
these word embeddings as input. The final hidden state of GRU is used function is further applied.
to represent the input question. In the fourth module, the obtained Non-linear transformation is applied to trim down the size of image
question vector and the visual representations R are given as input to representations. For the initialization of each node's hidden state, the
the visual context-aware attention model FATT. This context-conscious proposed model uses transformed vector.
attention model uses the LSTM unit to consider the prior focused visual
content at every time t that is used in selecting further unexplored vi- vta ¼ φðW a va þ ba Þ ð10Þ
sual content.  
0
In the last module, an LSTM-based answer generation model FLSTM ha ¼ β vta ð11Þ
uses hidden state (ht−1) at time step t-1, answer word (xt) generated
at time step t-1 and vt (the output of FATT), and thus produces the hidden Where, φand β are the tanh activation functions. Each node's initial
state (ht) at time t to generate the subsequent answer word. Thus, the hidden state is denoted by hoa, a is a node of graph. Vector associated
steps involved in the proposed VQA model can be formulated as fol- with a spatial location is represented by va ∈ V. Learned weights and
lows: bias is denoted by Wa and ba respectively.
The messages sent by the hidden states of neighboring nodes of a
V ¼ CNNðI Þ ð3Þ given node a are collected at each time step t:

R ¼ F GNN ðV Þ ð4Þ xta ¼ ∑ W g hd


t−1
þ bg ð12Þ
ðd, aÞ∈X
 
V0 ¼ Re LU ðW v  R þ bR Þ þ Re LU W Q  Q þ bQ ð5Þ
here, learned shared weights is represented by Wg and learned bias is
represented by bg, over all nodes and X is the set of neighboring nodes
vt ¼ F ATT ðR, ht−1 , pt−1 Þ ð6Þ taken from adjacency matrix A.
GNN uses Gated Recurrent Unit (GRU) after collecting all incoming
ht ¼ F LSTM ðht−1 , xt , vt , qt Þ ð7Þ messages. For updating each node's hidden state, GRU includes a reset
gate r and an update gate z, as follows:
anst ¼ arg|fflmax
ffl{zfflffl}soft max ðW o ht þ bo Þ ð8Þ  
t−1
s zta ¼ σ W z xta þ U z ha þ bz ð13Þ

 
Where, time step is denoted by t, anst represents the generated an- t−1
r ta ¼ σ W r xta þ U r ha þ br ð14Þ
swer word at time t depending on the highest softmax probability,
shared learned weight is Wo and bo is the learned bias. At time t = 0,    
t t−1
ho is set to zero. For the sake of simplicity, we have combined both ha ¼ ρ W h xta þ U h r ta ⊙ha þ bh ð15Þ
image and question representation in a single module.
t   t−1 t
ha ¼ 1−zta ⊙ha þ zta ⊙ha ð16Þ
3.1. Image representation
Where, W and U represents learned shared weights, b represents the
The proposed VQA model uses ResNet101 CNN based model [54] bias term, σ denotes the element-wise logistic sigmoid function,
which is pre-trained on ImageNet to obtain image representation from element-wise multiplication is performed between matrices and ρ is a
its last convolutional layer as a set of nonlinear activations denoted as non-linear activation function (tanh). To control the information from
V = {v1, v2, . ……, vn | vi ∈ ℝm}. For making the output dimension of previous as well as current hidden state, we employ the reset gate r to-
each image similar, spatially adaptive average pooling method is used. gether with the update gate z.
Thus, 10 × 10 × 2048 is the dimension of the final convolutional layer
of ResNet-101. Therefore, the input image is divided into 100 spatial- 3.3. Question representation
location indices.
To enhance the effectiveness, the length of each question is first
3.2. Visual relationship representation trimmed to 14 words. As suggested by [56], only 0.25% of questions in
any VQA dataset exceed 14 words length. Questions having length
Graph Neural Network is a category of neural network which di- below 14 words are appended with zero vectors and questions having
rectly works on the graph structure. Inspired by GGNN [50], which is length more than 14 words; the extra words are discarded from such
used for graph data related learning task, which updates the each node's questions. Each word in the question is represented by 300-
hidden state by using Gated Recurrent Units (GRU), we expand the idea dimensional vectors using pre-trained GloVe word embeddings [57].
to discover the inherent visual relationship among objects in images. These word embedding are passed through a GRU having dq dimen-
For a given image, GNN model is used to instantiate a graph G contain- sional hidden state. The proposed model represents the input question
ing N nodes representing the spatial location of an image extracted q by using the final hidden state Q ∈ Rdq. The purpose of using GRU for
using deep CNN. To fully extract the relationship among nodes of a question representation as it uses less number of parameters for train-
graph, the proposed model uses a graph that is completely connected ing purpose. Also, GRU uses less amount of memory, train and execute
and computes edge strength between neighboring graph nodes. An ad- faster than other question representation approaches.
jacency matrix A is formed from these edge strengths. These edge
strengths are given by the relationship probabilities occurring between 3.4. Attention mechanism
any two nodes of a given graphs. The edge strength Sp, q between any
two nodes vp and vq is given as: The proposed model uses the LSTM-based answer generation model
   with previous hidden state ht−1, pt−1 as the visual content attended at
Sp,q ¼ σ Convedge vp −vq  ð9Þ time step t-1 and R as the implicit relationship obtained using GNN;

4
H. Sharma and A.S. Jalal Image and Vision Computing 110 (2021) 104165

the model computes the value of attention weight (attt) between (0,1) it ¼ σ ðW i xt þ U i ht−1 þ agt ⊙M i vt þ bi Þ ð25Þ
for the image signal R using non-linear activation function (softmax  
f t ¼ σ W f xt þ U f ht−1 þ agt ⊙M f vt þ bf ð26Þ
function):

zt ¼ wTATT tanh ðU ATT R þ W ATT ht−1 þ M ATT pt−1 þ bATT Þ ð17Þ ot ¼ σ ðW o xt þ U o ht−1 þ agt ⊙M o vt þ bo Þ ð27Þ

att t ¼ soft max ðzt Þ ð18Þ mt ¼ ϕðW c xt þ U c ht−1 þ agt ⊙Mc vt þ bc Þ ð28Þ

Where, UATT, WATT and MATT are the learned shared weights, bATT rep- mt ¼ it ⊙mt þ f t ⊙mt−1 ð29Þ
resents the bias term. Further, the attention model uses the visual rep-
ht ¼ ot ⊙ϕðmt Þ ð30Þ
resentations R obtained using GNN. The attention model focuses on
the implicit visual relationship at each time step. Also, the recent nor-
Where, element-wise multiplication is performed between matri-
malized weight attt is fused with the earlier generated weight att t−1
ces. Learned shared weights matrices are represented by W, U, and M
using an interpolation gate igt:
and the bias term is denoted by b. σ is the pair-wise logistic sigmoid ac-
kt ¼ σ ðW k ht−1 þ bk Þ ð19Þ tivation function and φ represents non-linear activation function
(tanh). The information given to the memory cell unit mt is represented
att t ¼ kt att t þ ð1−igt Þatt t−1 ð20Þ as mt , that is passed through it(the input gate).

Where, learned shared weight is denoted by Wk and the bias term is 4. Results and discussions
denoted by bk, σ represents the element-wise logistic sigmoid function.
When the value of gate igt is equal to zero, the model uses the previously 4.1. Dataset and evaluation metrics
generated weight and ignores the current normalized. When the value
of gate igt is equal one, the model ignores the last generated weight The proposed VQA model is evaluated on two well-known datasets:
and employs the existing normalized weight to choose the appropriate VQA 1.0 [58] and VQA 2.0 [63]. VQA 1.0 dataset uses image data from
visual information. The model computes the attended visual signal vt by Microsoft COCO. It contains ‘248,349’ questions for training, ‘121,512’
combining all relation-aware visual representations: questions for validation and ‘244,302’ questions for testing (test-stan-
n
dard), which are gathered from 123,287 images. 25% of test-standard
vt ¼ ∑ ar1,t ðRÞi ð21Þ questions are categorized to test-dev set. The questions in this dataset
i¼1 are categorized into three categories: binary (yes/no), number and
other. For each question, 10 different free- form answers are given by
When the attended visual information is obtained, this information different annotators. VQA 2.0 can be considered as the revised version
is forwarded to the LSTM model, which can memorize the visual infor- of VQA 1.0. It consists of ‘443,757’ train questions, ‘214,354’ validation
mation chosen by the proposed attention model. This context informa- questions and ‘447,793’ questions for testing. Languages biases are
tion is used to lead the attention weight choice in the subsequent time also taken care in this dataset. The proposed VQA model is evaluated
step: on the open-ended questions contained in both the datasets. Both the
pt ¼ qLSTM ðpt−1 , vt Þ ð22Þ datasets contain more than 50% of the questions that are of other
categories.
If the occurrence of a particular answer is above 9 times in the train-
3.5. Answer generation ing data, it becomes a candidate answer for both the datasets: VQA 1.0
and VQA 2.0. The total number of candidate answers available in both
The proposed VQA model has the capability to generate multi-word VQA 1.0 and VQA 2.0 are 2185 and 3129 respectively. Thus, the pro-
answers, thus the process of answer generation can be formulated as a posed VQA model corresponds to a classifier having 2185- labels or
sentence generation task. Assume that a question Q is of length m, Q 3129-labels. The proposed model is trained on training and validation
= {q1, q2, ……qm}, and answer A is of length n, A = {a1, a2……, an}. For splits. The results are computed on both test-dev and test-standard
the generated answer sentence, log-likelihood can be formulated as: data partitions. Accuracy (Acc) of a given answer α is computed as:

n number ðα Þ
log probðA=I, Q Þ ¼ ∑ log probðak =a1:k−1 , I, Q Þ ð23Þ Accðα Þ ¼ min ,1 ð31Þ
3
t¼1

here, number (α): answer α selected by distinct annotators.


Here, prob(ak/a1:k−1, I, Q) denotes the probability of producing ak,
when given previously generated answer words a1:k−1, image I and
question Q.
To generate the answer sentence, the proposed model uses a variant
of LSTM. Different from previous VQA models, an adaptive gate agt is Table 1
employed to decide whether visual information can be given to the Setting of hyperparameters.
LSTM iterations. The proposed VQA model uses a variation of the basic Hyperparameters Value
LSTM by including additional visual gate unit apart from three conven-
The hidden layer dimension in GRU 1024
tional gates (input gate it, forget unit ft and output gate ot) and a mem- Dimension for encoding 300
ory cellmt. The input to the variation of LSTM is xt: word embedding, ht Rate of learning 0.0002
−1: the hidden state at time t-1 and vt : the visual signal attended. The The learning rate decreasing interval 3
mathematical formulation of the LSTM-based answer-sentence genera- Factor by which learning rate is decreased 0.8
Dropout Factor 0.5
tion model at time t is given as follows:
Batch size 512
  Epochs (Maximum Number) 15
gt ¼ σ W g xt þ U g ht−1 þ bg ð24Þ

5
H. Sharma and A.S. Jalal Image and Vision Computing 110 (2021) 104165

4.2. Experimental setup 14

12
The proposed model is implemented with the Pytorch library. The
10 Baseline loss

Training Loss
proposed model uses Adamax solver with mini-batch size as 256. A
8 Baseline+Rel loss
warm-up approach is applied to set the rate of learning. The rate of
Baseline+CAA loss
learning is initialized to 0.002 for first training epoch. Learning rate in- 6
creases at each epoch till 15 epochs. After this, it decays every 3 epochs Baseline+Rel+CAA loss
4
by a factor of 0.8. Dropouts (Dropout factor 0.5) are employed after
every fully connected layer to avoid over-fitting. Questions are encoded 2
using a vector of size of 300. We set the hidden state of GRU to 1024. We 0
set the batch size as 512. We have applied non linear activation function 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
σ, the rectified linear unit used: ReLU (p) = max (p, 0). The summary of Epochs
most important hyperparameters is shown in Table 1.
Fig. 2. Training loss (Cross-Entropy) versus epochs for the proposed VQA model.
4.3. Ablation study

The proposed VQA model is a combination of various modules and re-


70
quires the setting of important hyper-parameters. For assessing the con-

Validation Accuracy(%)
tribution of different modules, ablation test is conducted. For both the 65
datasets, variant versions of the proposed VQA model are trained on the Baseline
60
training split dataset and assessing the accuracy on the validation split Accuracy(%)
dataset. Different variations of the proposed VQA model are as follows: 55 Baseline+Rel
Accuracy (%)
• Baseline model: It only uses image features and question features. 50 Baseline+CAA
Both feature vectors are passed through the fully-connected layer. Accuracy(%)
45
Further, they are merged by using element-wise multiplication. Baseline+Rel+CAA
• Baseline + Rel (Visual Relationship): It uses visual feature, question 40 Accuracy(%)
features and visual relationship between objects or image regions. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
• Baseline + CAA (Context-Aware Attention): It uses visual features, Epochs

question features and context-aware attention mechanism.


• Baseline + Rel + CAA: This is the full version of the proposed VQA
Fig. 3. Accuracy versus epochs for the proposed VQA model.
model, which uses visual relationship and context-aware attention
module for generating answer sentence.

the gain in performance can be observed when dimension of contextual


Table 2 shows the ablation study of each module used and the attention module i.e. dim (CAA) is equal to 3072. We have used element-
hyperparameters employed by the proposed model. Dim(I,Q) represents wise multiplication strategy for fusion as it outperforms the addition fu-
the dimension the vector obtained by fusing image and question repre- sion strategy. Hence, we have used dim (I,Q) = 1024, dim(Rel) = 256
sentations, dim(Rel) represents the dimension of the visual relationship and dim(CAA) = 3072 in our final proposed VQA model.
module and dim(CAA) denotes the dimension of the contextual- Fig. 2 shows the training losses of our Baseline model (dim(I,Q) =
attention module. We have tried dim(I,Q) = 512, dim(I,Q) = 1024 and 1024), Baseline + Rel model (dim(I,Q) = 1024, dim(Rel) = 256), Base-
dim(I,Q) = 2048 for our Baseline model. The best performance is evalu- line + CAA model (dim(I,Q) = 1024, dim(CAA) = 3072) and Baseline
ated for dim (I,Q) = 1024. Similarly, for the Baseline + Rel model, dim + Rel + CAA (dim(I,Q) = 1024, dim(Rel) = 256, dim(CAA) = 3072)
(Rel) = 256 gives the best performance. For Baseline + CAA model, model. Fig. 3 demonstrates the validation accuracies for Baseline
model, Baseline + Rel model, Baseline + CAA model and Baseline +
Rel + CAA model with the same hyperparameters values respectively.
Table 2
Hyperparameters and ablation study of each module of the proposed model. Bold rows in-
4.4. Comparison with state-of-the art VQA models
dicate the hyperparameters or module used in the final proposed model.

Model Accuracy Size of Model Table 3 demonstrates the accuracy of the proposed VQA model on
Baseline VQA 1.0 dataset and compares it with the recent VQA models. A single
dim(I,Q) = 512 57.31 15.6 M model is used to obtain these results on training + validation split.
dim(I,Q) = 1024 58.12 21.7 M The results contained in Table 3 are divided in mainly four sections:
dim(I,Q) = 2048 57.96 48.7 M

Baseline + Rel (dim(I,Q) = 1024) • First section: The VQA models [7,59] that do not use attention mech-
dim(Rel) = 128 63.11 22.2 M anism.
dim(Rel) = 192 63.32 22.5 M • Second section: The VQA models [9–11,25,27,28,59] that use attention
dim(Rel) = 256 63.61 22.8 M
dim(Rel) = 384 63.43 23.7 M
mechanism.
• Third Section: The VQA models [12,32,60–62,64] that use attention
Baseline + CAA (dim(I,Q) = 1024) mechanism with pre-trained word embeddings like GloVe and Skip-
dim(CAA) = 1024 64.22 24.7 M
thought Vectors
dim(CAA) = 2048 65.56 27.7 M
dim(CAA) = 3072 65.91 30.7 M • Last Section: The proposed model.
dim(CAA) = 4096 64.96 33.7 M

Baseline + Rel + CAA (dim(I,Q) = 1024,dim(Rel) = 256,dim(CAA) = 3072) The main observations from Table 3 can be listed as:
Fusion using addition 67.36 33.8 M First, the baseline version of the proposed VQA model is compared to
Fusion using element-wise multiplication 67.82 33.8 M
the baseline + Rel model. The results demonstrate that when we use

6
H. Sharma and A.S. Jalal Image and Vision Computing 110 (2021) 104165

Table 3
Comparison of the proposed VQA model with the state-of-the-art models on VQA 1.0.

Model test-dev test-standard

Number Yes/No other Overall Number Yes/No other Overall

Zhou et al. [7] 35.03 76.55 42.6 55.72 34.98 76.76 42.62 55.89
Antol et al. [58] 36.77 80.5 43.08 57.75 36.53 80.57 43.73 58.16
Yang et al. [9] 37.32 80.87 43.12 58.7 37.53 80.8 43.48 58.24
Illievski et al. [25] 36.16 79.3 45.77 59.24 – – – 58.9
Lu et al. [10] 38.7 79.7 51.7 61 – – – 62.1
Nam et al. [27] 39.1 83 53.9 64.3 38.1 82.8 54 64.2
Kazemi et al. [11] 39.1 82.2 55.2 64.5 39.1 82 55.2 64.6
Yu et al. [28] 40.2 83.8 53.7 64.6 40.9 83.7 53.7 64.8
Wang et al. [59] 38.4 81.5 53 63.1 38.2 81.4 53.2 63.3
Fukui et al. [60] 37.6 82.5 55.6 64.7 – – – –
Kim et al. [61] 38.21 84.14 54.87 65.08 37.9 84.02 54.77 65.07
Yu et al. [62] 39.8 84 56.2 65.9 38.9 83.8 56.3 65.8
Yu et al. [32] 39.7 85 57.4 66.8 39.5 85 57.4 66.9
Nguyen et al. [12] 41.66 84.48 57.44 66.83 41.27 84.61 56.83 66.66
Zhang et al. [64] 41.43 84.36 58.71 67.37 41.33 84.18 58.58 67.33
Baseline 39.63 82.11 50.35 63.6 39.31 82.32 51.34 64.34
Baseline+Rel 40.66 83.56 56.76 66.72 40.14 83.12 57.12 67.91
Baseline+CAA 41.21 83.78 58.12 67.96 40.87 83.78 58.23 68.11
Baseline+Rel + CAA 42.2 84.12 58.81 68.21 41.43 84.16 59.12 68.35

Table 4
Comparison of performance of our models with the state-of-the-art methods on VQA 2.0.

Model test-dev test-standard

Number Yes/No other Overall Number Yes/No other Overall

Goyal et al. [63] – – – – 0.36 61.20 1.17 25.98


(Prior)
Goyal et al. [63] – – – – 31.55 67.01 27.37 44.26
(Language only)
Goyal et al. [63] – – – – 35.18 73.46 41.83 54.22
(LSTM+CNN)
Goyal et al. [63] – – – – 38.28 78.82 53.36 62.27
(MCB)
Nguyen et al. [12] 46.60 83.50 56.72 66.60 46.93 83.89 56.90 67.00
Teney et al. [56] 44.21 81.82 56.05 65.32 43.90 82.20 56.26 65.67
Zhang et al. [64] 45.51 83.31 58.41 67.20 44.96 83.39 58.49 67.34
Baseline 41.21 80.12 50.79 62.23 39.23 79.67 51.34 62.76
Baseline+Rel 42.34 82.23 56.23 66.78 42.34 81.87 56.78 66.76
Baseline+CAA 44.64 83.34 57.28 67.32 43.56 81.98 58.11 67.11
Baseline+Rel + CAA 46.12 84.12 58.13 67.96 45.21 83.34 58.87 67.98

Fig. 4. Attention maps visualization. First column depicts original image. Second column shows top-three attended regions. Third column shows the answer obtained by our visual
relationship and context-aware attention VQA model.

the visual relationship module with the baseline model, it enhances std split dataset by 3.57 respectively when compared to the baseline
the answer prediction capability of the VQA model by a significant mar- model.
gin. To be more specific, the proposed baseline + Rel model improves Second, the comparison between the baseline version of the pro-
the overall performance on test-dev split dataset by 3.42 and the test- posed VQA model and the baseline + CAA model is presented. Again,

7
H. Sharma and A.S. Jalal Image and Vision Computing 110 (2021) 104165

Fig. 5. The visual relationship module generates three words (subject-relationship-object) by plotting relationship adjacency matrix and attention weights distribution over set of nodes.

Fig. 6. Examples of the proposed VQA model results on VQA 1.0 dataset. Row 1 shows the original image. Row 2 shows the attended objects in bounding boxes obtained by our visual
relationship and context-aware attention model.

the results indicate that the use of context- aware attention mechanism accuracy of 68.21 on test-dev partition and 68.35 on the test-standard
boosts the accuracy of the baseline version of the proposed VQA model partition.
by a considerable margin. To be more specific, the proposed baseline + In Table 4 the proposed model is compared with current state-of-
CAA model improves the performance of test-dev partition and test- the-arts VQA 2.0 dataset. In this table, all VQA models compared are
standard partition by 4.36 and 3.77 respectively as compared to the pro- partitioned into two sections. The first section demonstrates the results
posed baseline model. It can also be verified by comparing the models of the VQA model that does not using visual relationship module. The
shown in first and section of the table which support the fact that second section contains the results of models employing visual
models employing attention mechanism gains better performance relationship module. The proposed VQA model (baseline + Rel +
over novel without using attention mechanism. CAA) outperforms all the previous models with or without visual rela-
Third, the proposed VQA model that uses visual relation module and tionship concept. The proposed VQA model achieves better accuracy
context-aware attention mechanism outperform all the previous VQA on all the question types, Number by 3.2%, Yes/No by 3.4% and other
models. As shown in Table 3, the complete VQA model obtains overall by 4.6%.

8
H. Sharma and A.S. Jalal Image and Vision Computing 110 (2021) 104165

Fig. 7. Examples of the proposed VQA model results on VQA 2.0 dataset. Row 1 shows the original image. Row 2 shows the attended objects in bounding boxes obtained by our visual
relationship and context-aware attention model.

Fig. 8. Comparison of the proposed model with MCAN model. MCAN model fail to distinguish the keywords in the question (the word ‘left’ in the left example and the word ‘catcher’ in the
right example).

It is also observed that the proposed VQA model gains improved per- relationship occurring between the graph nodes is represented by the
formance on both the datasets: VQA 1.0 and VQA 2.0. The proposed VQA learned edge strengths. We have also plotted the attention weight
model generates answers more accurately for questions belonging to distribution a set of nodes to understand the learned relationships for
category ‘others’. Such questions start with ‘which’, ‘what’, ‘where’, producing three words subject-relation-object). In Fig. 6, e.g., man-
‘why’ or ‘who’. The results for ‘Number’ and ‘Yes/No’ category questions raising-bat demonstrated the tight relationships subsequent to the typ-
are also motivating and comparable. It can be argued that the questions ically attended three graph nodes. Thus, our relationship module is able
belonging to category ‘other’ require visual relationship modeling con- to capture similar visual relationships as captured by humans.
cept to predict the answers. It is the reason why the proposed model in- Fig. 6 shows the results on VQA 1.0 dataset. Fig. 7 shows the results
cludes visual relationship and context aware attention modules for on VQA 2.0 dataset. Our Baseline + CAA model attains the image regions
generating the answers. It is also worth noticing that half of the guided by question words and provides them the highest weight and
questions included in both the datasets are of category ‘other’. This is fails to focus right objects which can be helpful in predicting the accurate
the reason why the proposed model gains better accuracy over the answer. Our Baseline + VR + CAA model employing both visual rela-
state-of-the-art models in-spite of the fact that the model does not tionship and context aware attention is able to attend the right objects
gain much better performance on the question belonging to category and the relationship between those objects to answer the question.
‘Number’ or ‘Yes/No’. Thus, it can be observed that the proposed VQA model can extract the
Fig. 4 shows the attention maps generated by the attention module. rich visual relationships as captured by the humans. The models such
The model is able to identify most significant on object based on visual as MFH [32] and MCAN [33] using Transformer inspired attention
relationship reasoning module. It can be observed that our model is able models predicted wrong answers as these models may not be able to
to extract rich visual relationships between objects as captured by distinguish the keywords in questions. Also, these models may fail to
humans. recognize or classify some visual contents of the image, though the ques-
In Fig. 5, we have plotted the adjacency matrix in order to under- tion is easy to understand and thus predicts wrong answers. In Figs. 8
stand and visualize the visual relationships. The probabilities of the and 9, we have compared our model with MCAN and MFH models.

9
H. Sharma and A.S. Jalal Image and Vision Computing 110 (2021) 104165

Fig. 9. Comparison of the proposed model with MFH model. MFH model fail to distinguish the keywords in the question (the word ‘sitting’ in the left example) and does the wrong
classification (the word ‘meat’ in the right example).

Fig. 10. Examples of incorrect answers generated by the proposed model due its incapability of reading text present in images (Left) and failure in establishing correct relationship (Right).

Fig. 10 shows the failure cases of our models due to its incapability of Credit author statement
reading text present in images. Sometimes the model may establish
wrong relationship between the objects as shown in column 2 of Himanshu Sharma: conceptualization, Formal analysis; Writing -
Fig. 9. Sharma et al. [65] used external knowledge for task of image original draft
captioning and motivated to use it for the task of VQA in future. Anand Singh Jalal: Supervision; Writing - review & editing.

5. Conclusion
Declaration of Competing Interest
In this paper, we have presented a novel VQA model which uses the
implicit relationship among semantic objects or significant-regions in None.
an image by employing a GNN model. Thus, the proposed model obtains
fine-grained visual content/information and their better representa- References
tions. The proposed VQA model also employs an attention model that
remembers the previously attended visual information together with [1] F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, G. Monfardini, The graph neural
network model, IEEE Trans. Neural Netw. 20 (1) (2009) 61–80.
the current attended objects or regions of interest by using a contextual
[2] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
LSTM. The proposed VQA model is assessed on two well-known recognition, International Conference on Learning Representations (ICLR), 2015.
datasets: VQA 1.0 and VQA 2.0. The results demonstrate that the pro- [3] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8)
posed VQA model predicts more accurate answers and performs better (1997) 1735–1780.
[4] J. Donahue, A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T.
than the current state-of-the- art methods. In the future, explicit visual
Darrel, Long-term recurrent convolutional networks for visual recognition and de-
relationship can also be integrated to the proposed VQA model. In fu- scription, Proc. IEEE Conference on Computer Vision and Pattern Recognition
ture, we can use the proposed VQA model for answering question re- (CVPR) 2015, pp. 677–691.
lated to videos that needs detailed analysis and investigation. [5] J. Mao, W. Xu, Y. Yang, J. Wang, A. Yuille, K. Murphy, Deep Captioning with Multi-
modal Recurrent Neural Networks, ICLR, 2015.
[6] W. Zhang, H. Hu, H. Hu, Training visual-semantic embedding network for boosting
Compliance with ethical standards automatic image annotation, Neural. Process. Lett. 48 (3) (2018) 1503–1519.
[7] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, R. Fergus, Simple Baseline for Visual Ques-
tion Answering, 2015 arXiv: 1512.02167v2.
Conflict of interest: The authors declare that they have no conflict of [8] K.R. Ren, R. Zemel, Image question answering: A visual semantic embedding model
interest. and a new dataset, NIPS 2015, pp. 2953–2961.

10
H. Sharma and A.S. Jalal Image and Vision Computing 110 (2021) 104165

[9] Z. Yang, X. He, J. Gao, L. Deng, A. Smola, Stacked attention networks for image ques- [37] R. Mottaghi, X. Chen, X. Liu, N.G. Cho, S.W. Lee, S. Fidler, R. Urtasun, A. Yuille, The role
tion answering, Proc. IEEE Conference on Computer Vision and Pattern Recognition of context for object detection and semantic segmentation in the wild, Proc. IEEE
(CVPR) 2016, pp. 21–29. Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
[10] J. Lu, J. Yang, D. Batra, D. Parikh, Hierarchical question-image co-attention for vi- sual [38] I. Biederman, R.J. Mezzanotte, J.C. Rabinowitz, Scene perception: detecting and judg-
question answering, NIPS 2016, pp. 289–297. ing objects undergoing relational violations, Cogn. Psychol. 14 (1982).
[11] V. Kazemi, A. Elqursh, Show, ask, attend, and answer: A strong baseline for visual [39] C. Galleguillos, A. Rabinovich, S. Belongie, Object categorization using co-occurrence,
question answering, 2017 arXiv: 1704.03162v2. location and appearance, Proc. IEEE Conference on Computer Vision and Pattern
[12] D. Nguyen, T. Okatani, Improved fusion of visual and language representations by Recognition (CVPR), 2008.
dense symmetric co-attention for visual question answering, Proc. IEEE Conference [40] S. Gould, J. Rodgers, D. Cohen, G. Elidan, D. Koller, Multi-Class Segmentation with
on Computer Vision and Pattern Recognition (CVPR) 2018, pp. 6087–6096. Relative Location Prior, IJCV, 2008.
[13] Y. Han, B. Wang, R. Hong, F. Wu, Movie question answering via textual memory and [41] H. Fang, S. Gupta, F. Iandola, R.K. Srivastava, L. Deng, P. Doll’ar, J. Gao, X. He, M.
plot graph, IEEE Trans. Circuits Syst. Video Technol. 30 (3) (2020) 875–887. Mitchell, J.C. Platt, From captions to visual concepts and back, Proc. IEEE Conference
[14] A. Wu, Y. Han, Y. Yang, Q. Hu, F. Wu, Convolutional reconstruction-to-sequence for on Computer Vision and Pattern Recognition (CVPR), 2015.
video captioning, IEEE Trans. Circuits Syst. Video Technol. 30 (11) (2020) [42] A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D.
4299–4308. Forsyth, Every picture tells a story: Generating sentences from images, Proceedings
[15] Yuling Xi, Yanning Zhang, Songtao Ding, Shaohua Wan, Visual question answering of the European conference on computer vision (ECCV), 2010.
model based on visual relationship detection, Signal Process. Image Commun. 80 [43] T. Yao, Y. Pan, Y. Li, T. Mei, Exploring visual relationship for image captioning, Pro-
(2020) 115648. ceedings of the European conference on computer vision (ECCV), 2018.
[16] Sayedshayan Hashemi Hosseinabad, Mehran Safayani, Abdolreza Mirzaei, Multiple [44] V. Ramanathan, C. Li, J. Deng, W. Han, Z. Li, K. Gu, Y. Song, S. Bengio, C. Rosenberg, L.
answers to a question: a new approach for visual question answering, Vis. Comput. Fei-Fei, Learning semantic relationships for better action retrieval in images, Proc.
(2020) 1–13. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[17] W. Zhang, J. Yu, Y. Wang, W. Wang, Multimodal deep fusion for image question an- [45] M.A. Sadeghi, A. Farhadi, Recognition using visual phrases, Proc. IEEE Conference on
swering, Knowl.-Based Syst. 106639 (2020). Computer Vision and Pattern Recognition (CVPR), 2011.
[18] Z. Huasong, J. Chen, C. Shen, H. Zhang, J. Huang, X.S. Hua, Self-adaptive neural mod- [46] S.K. Divvala, A. Farhadi, C. Guestrin, Learning everything about anything: Webly-
ule transformer for visual question answering, IEEE Trans. Multimedia (2020). supervised visual concept learning, Proc. IEEE Conference on Computer Vision and
[19] D. Bahdanau, K. Cho, Y. Bengio, Neural Machine Translation by Jointly Learning to Pattern Recognition (CVPR), 2014.
Align and Translate, ICLR, 2015. [47] C. Lu, R. Krishna, M. Bernstein, L. Fei Fei, Visual relationship detection with language
[20] L.Z. Zhang, S. Zhang, X. Yang, Cross-modality interactive attention network for mul- priors, Proceedings of the European conference on computer vision (ECCV), 2016.
tispectral pedestrian detection, Info. Fusion 50 (2019) 20–29. [48] B. Dai, Y. Zhang, D. Lin, Detecting visual relationships with deep relational networks,
[21] D.C. Kim, L. Hoang, A.M. Rush, Structured Attention Networks, ICLR, 2017. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[22] S.N. Vaswani, N. Parmar, Attention is all you need, Advances in neural information [49] H. Zhang, Z. Kyaw, S.F. Chang, T.S. Chua, Visual translation embedding network for
processing systems (NIPS) 2017, pp. 5998–6008. visual relation detection, Proc. IEEE Conference on Computer Vision and Pattern
[23] B.J. Xu, R. Kiros, et al., Show, attend and tell: Neural image caption generation with Recognition (CVPR), 2017.
visual attention, International Conference on Machine Learning (ICML) 2015, [50] Y. Li, D. Tarlow, M. Brockschmidt, R. Zemel, Gated graph sequence neural networks,
pp. 2048–2057. ICLR, 2016.
[24] K.J. Shih, S. Singh, D. Hoiem, Where to look: Focus regions for visual question an- [51] T.N. Kipf, M. Welling, Semi-supervised classification with graph convolutional net-
swering, Proc. IEEE Conference on Computer Vision and Pattern Recognition works, ICLR, 2017.
(CVPR) 2016, pp. 4613–4621. [52] R. Li, S. Wang, F. Zhu, J. Huang, Adaptive graph convolutional neural networks, AAAI,
[25] I. Ilievski, S. Yan, J. Feng, A Focused Dynamic Attention Model for Visual Question 2018.
Answering, 2016 arXiv: 1604.01485. [53] X. Wang, A. Gupta, Videos as space-time region graphs, Proceedings of the European
[26] Z. Chen, Z. Yanpeng, H. Shuaiyi, T. Kewei, M. Yi, Structured attentions for visual ques- conference on computer vision (ECCV) 2018, pp. 399–417.
tion answering, Proceedings of the IEEE international conference on computer vi- [54] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proc.
sion (ICCV) 2017, pp. 1300–1309. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[27] H. Nam, J.W. Ha, J. Kim, Dual attention networks for multimodal reasoning and [55] M.B. Cho, C. Gulcehre, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase represen-
matching, Proc. IEEE Conference on Computer Vision and Pattern Recognition tations using RNN encoder-decoder for statistical machine translation, EMNLP,
(CVPR) 2017, pp. 2156–2164. 2014.
[28] D. Yu, J. Fu, Y. Rui, T. Mei, Multi-level attention networks for visual question answer- [56] D. Teney, P. Anderson, X. He, A. Hengel, Tips and tricks for visual question answer-
ing, Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ing: learnings from the 2017 challenge, Proc. IEEE Conference on Computer Vision
2017, pp. 4187–4195. and Pattern Recognition (CVPR) 2018, pp. 4223–4232.
[29] L. Gao, L. Cao, X. Xu, J. Shao, J. Song, Question-led object attention for visual question [57] S.R. Pennington, C. Manning, Glove: Global vectors for word representation, EMNLP,
answering, Neurocomputing 391 (2020) 227–233. 2014.
[30] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up [58] A.A. Antol, J. Lu, M. Mitchell, VQA: Visual Question Answering, Proceedings of the
and top-down attention for image captioning and vqa, Proc. IEEE Conference on IEEE International Conference on Computer Vision (ICCV), 2015.
Computer Vision and Pattern Recognition (CVPR) 2018, pp. 6077–6086. [59] W.Q. Wang, C. Shen, The VQA-machine: Learning how to use existing vision algo-
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. rithms to answer new questions, Proc. IEEE Conference on Computer Vision and Pat-
Polosukhin, Attention is all you need, Advances in Neural Information Processing tern Recognition (CVPR), 2017.
Systems 2017, pp. 6000–6010. [60] A. Fukui, D.H. Park, D. Yang, A. Rohrbach, Multimodal compact bilinear pooling for
[32] Z. Yu, J. Yu, J. Fan, D. Tao, Beyond bilinear: generalized multimodal factorized high- visual question answering and visual grounding, EMNLP, 2016.
order pooling for visual question answering, IEEE Transactions on Neural Networks [61] O.K. Kim, W. Lim, Hadamard Product for Low-Rank Bilinear Pooling, ICLR, 2017.
and Learning Systems, vol. 99, , 2018 , no. 12. [62] Z. Yu, J. Yu, J. Fan, D. Tao, Multi-modal factorized bilinear pooling with co-attention
[33] Z. Yu, J. Yu, Y. Cui, D. Tao, Q. Tian, Deep modular co-attention networks for visual learning for visual question answering, Proceedings of the IEEE International Con-
question answering, Proceedings of the IEEE Conference on Computer Vision and ference on Computer Vision (ICCV), 2017.
Pattern Recognition 2019, pp. 6281–6290. [63] K.T. Goyal, D. Summers-Stay, D. Batra, D. Parikh, Making the V in VQA matter: ele-
[34] S.K. Divvala, D. Hoiem, J.H. Hays, A.A. Efros, M. Hebert, An empirical study of context vating the role of image understanding in visual question answering, Proc. IEEE Con-
in object detection, Proc. IEEE Conference on Computer Vision and Pattern Recogni- ference on Computer Vision and Pattern Recognition (CVPR), 2017.
tion (CVPR), 2009. [64] W. Zhang, J. Yu, H. Hu, H. Hu, Z. Qin, Multimodal feature fusion by relational reason-
[35] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object Detection with ing and attention for visual question answering, Info. Fusion 55 (2020).
Discriminatively Trained Part-Based Models, PAMI, 2010. [65] H. Sharma, A.S. Jalal, Incorporating external knowledge for image captioning using
[36] M.J. Choi, A. Torralba, A.S. Willsky, A Tree-Based Context Model for Object Recogni- CNN and LSTM, Modern Phys. Lett. B 34 (28) (2020), 2050315, 12-Pages.
tion, PAMI, 2012.

11

You might also like