Professional Documents
Culture Documents
https://doi.org/10.1007/s10462-020-09832-7
Abstract
Visual question answering (VQA) is a task that has received immense consideration
from two major research communities: computer vision and natural language processing.
Recently it has been widely accepted as an AI-complete task which can be used as an alter-
native to visual turing test. In its most common form, it is a multi-modal challenging task
where a computer is required to provide the correct answer for a natural language question
asked about an input image. It attracts many deep learning researchers after their remark-
able achievements in text, voice and vision technologies. This review extensively and criti-
cally examines the current status of VQA research in terms of step by step solution meth-
odologies, datasets and evaluation metrics. Finally, this paper also discusses future research
directions for all the above-mentioned aspects of VQA separately.
1 Introduction
Matt King, FACEBOOK says, “But from my perspective as a blind user, going from essen-
tially zero percent satisfaction from a photo to somewhere in the neighborhood of half …
is a huge jump” as a comment to the great attempt from Facebook to automatically cap-
tion photos of blind users. This leads to an inference that, it would be great if machines
are intelligent enough to understand image contents and communicate this understanding
as effectively as humans. VQA is a stepping stone to this Artificial Intelligence-dream
(AI-dream) of Visual Dialogue. In the most common form of Visual Question Answering
(VQA), the computer is presented with an image and a textual question about this image.
Then, the machine’s task is to generate correct answer, typically a few words or a short
phrase. That is, VQA is a task which is guided by matured research in computer vision
(CV) and natural language processing (NLP), both are under the domain of AI. In words of
* Sruthy Manmadhan
sruthym.88@gmail.com
Binsu C. Kovoor
binsukovoor@gmail.com
1
Division of Information Technology, Cochin University of Science and Technology, Kochi, Kerala,
India
13
Vol.:(0123456789)
5706 S. Manmadhan, B. C. Kovoor
Devi Parikh, a VQA researcher, it is a great combination of pictures, words and common
sense as shown in Fig. 1.
While compared to other vision-language tasks such as image captioning, text-to-image
retrieval, VQA is more challenging because: (1) the questions are not predetermined. In
other tasks, the question to be answered is fixed and so the operations required to answer
it only the image changes. (2) The supporting visual information is very high dimensional.
(3) VQA necessitates solving many computer vision sub tasks. Some are given in Table 1
with a representative question on second column. In this respect, VQA can be used as an
alternative to Visual Turing Test for computer vision systems (Geman et al. 2015) or in
other words VQA is an “AI-complete” task as it demands multi-modal knowledge beyond
a single domain.
After seeing boastful victory of convolutional neural networks; AlexNet (Krizhevsky
et al. 2012), ZFNet (Zeiler and Fergus 2014), VGGNet (Simonyan and Zisserman 2014),
GoogleNet (Szegedy et al. 2015), ResNet (He et al. 2016); in the ImageNet challenge and
renowned results produced by RNNs in the field of text processing (Cho et al. 2014; Sutsk-
ever et al. 2014; Prakash et al. 2019), many researchers became enthusiastic to work with
VQA.
Another factor which attracts researchers is the vast number of potential applications
of VQA. One of the most socially relevant and direct application is to help blind users to
communicate with the pictures. It can be used to improve image retrieval, which can be
commercially used by online shopping sites to attract customers by giving exact results
for their search queries. Incorporation of VQA may increase the popularity of online
13
Visual question answering: a state‑of‑the‑art review 5707
1. This survey serves as a pocket reference for experts and as a beginner’s guide for learners
who are interested in understanding VQA problem, its intelligent solutions, and available
datasets and evaluation methods.
2. This is the first phase-wise review on VQA which provides extensive analysis and
comparison of different methodologies utilized in completion of different steps of VQA
including image featurization, question featurization and joint comprehension of image
and question features to generate answer.
3. It discusses and analyzes various publically available datasets for which gives a catego-
rization of the same based on nature of images and questions contained within them
along with identifying the limitations of each dataset to open eyes of researchers to the
need of development of new datasets.
4. Provides all available evaluation metrics for VQA solutions in a nutshell with formulas
and also discuss possibility of using some new metrics.
5. Finally, this review lists possible and required future research directions in this area to
attain the AI-dream.
The rest of this paper is organized as follows. Sections 2 and 3 consist of detailed
explanations of phase I tasks, image featurization and question featurization respec-
tively. Section 4 includes details of the most simple to sophisticated methods for com-
bining multi-modal features extracted in phase I. Section 5 is the overview of different
datasets available for enthusiastic researchers to try their model with their character-
istics. For any problem with multiple solutions, performance evaluation is critical and
13
5708 S. Manmadhan, B. C. Kovoor
different metrics for evaluating VQA solutions are discussed in Sect. 6. Finally, Sect. 7
of this paper open eyes to future research directions by identifying gaps in the existing
literature.
2 Image featurization
One of the two preparatory tasks of VQA is image featurization. Image feature describes
an image as a numerical vector so that different mathematical operations can be easily
applied. There are many ways to do this explicitly like simple RGB vector, Scale-Invariant
Feature Transform (SIFT) (Lowe 1999), HAAR Transform (Lienhart and Maydt 2002),
Histogram of Oriented Gradients (HOG) (Dalal and Triggs 2005) etc. In deep learning era,
it does not require explicit featurization, since it is self learned by deep neural network.
Training deep learning models from scratch requires large data sets and significant com-
putational resources. Using pre-trained deep neural network models to extract relevant fea-
tures from images allows doing this task easily.
One of the best neural networks for image featurization is convolutional neural network
(CNN). Table 2 gives details of some prominent and widely accepted state-of-the-art CNN
variants for the same task along with year of introduction, depth of the model as number of
layers, size of input image, size of extracted feature vector from last fully connected layer
and the reported error on dataset, those were initially trained and benchmarked on Ima-
geNet data sets and out-performed their competitors by a large margin.
Most of the state-of-the-art VQA models used these successful CNNs with their last
layer removed, sometimes followed by a normalization (Kafle and Kanan 2016; Saito et al.
2017; Fukui et al. 2016) and dimensionality reduction (Kafle and Kanan 2016; Ma et al.
2016; Antol et al. 2015) to represent visual content as a numerical vector. Table 3 gives a
mapping of VQA models to ImageNet winners based on the usage where the columns list
five CNN models and rows indicates major VQA systems. Overall statistics of pre-trained
CNNs used for VQA image featurization has been shown in Fig. 3. From this, researchers
on this field can quickly identify that VGGNet and ResNet have been widely used in VQA
systems. One reason for which people prefer VGGNet is that it extracts features that are
slightly more general and more effective for datasets other than ImageNet on which these
models are trained. Other reasons include quick convergence on fine-tuning, simple imple-
mentation compared to inception like the architecture of GoogLeNet and residual connec-
tions of ResNet. The readers can easily notice a trend of migrating from VGG to ResNet in
recent papers because enough computational resources are available at a reasonable cost.
3 Question featurization
13
Table 2 Overview of predominant CNN models trained on ImageNet
Successful CNN models Year Number of layers Input dimension Output dimension (number Reported error
of features)
Visual question answering: a state‑of‑the‑art review
13
5710 S. Manmadhan, B. C. Kovoor
ZFNet
VGGNet
GoogLeNet
The task of VQA demands word embeddings for question representation because most
of the Machine Learning algorithms and almost all Deep Learning Architectures are inca-
pable of processing strings or plain text in their raw form.
13
Visual question answering: a state‑of‑the‑art review 5711
Primarily, the word or text embedding methods can be categorized into three: (1) Count
based methods; (2) Prediction based methods and (3) Hybrid methods. All embedding
methods take as input an extensive text collection from Internet or publications, which
form corpora. Set of all unique words from a corpus form its vocabulary (V) and the result
of embedding is a representation for every word in V.
Simplest of all is one-hot encoding of words, which results in a vector of size |V|. Example
of one-hot encoding is shown in Fig. 4 where the top portion gives details of the used cor-
pus and then gives the embedding vector for each word in the corpus.
The main drawback of this embedding is that it does not capture the notion of similar-
ity. i.e. Euclidean distance between any two words represented using one-hot embedding
will be √2 and cosine similarity will be ‘0’. At that time, the researchers in this field got
motivated by the quote of Firth, J. R. (1957)—“You shall know a word by the company it
keeps”. This leads to distributional similarity based word embeddings based on co-occur-
rence matrix (Miller and Charles 1991). A co-occurrence matrix is a terms X terms matrix
which captures the number of times a term appears in the context of another term. The
context is defined as a window of k words around the terms. So this can be viewed as a
word X context matrix. For example, assume the same corpus given in Fig. 4. Then the
co-occurrence matrix will look like Fig. 5. Each row (column) of the co-occurrence matrix
gives a vector representation of the corresponding word (context).
Here there is a chance for avoiding some words (say, stop words or unnecessary words)
from context so that the number of columns will be less than the number of rows. Other-
wise, their counts will be very high. Such a table is highly sparse as most frequencies are
equal to zero.
In practice, the co-occurrence counts are converted to probabilities. Still, the size of this
matrix will grow with the size of V. Simple mathematical solution for this is to use a low-
rank approximation of original co-occurrence matrix given by singular value decomposi-
tion (SVD) (Eckart and Young 1936). SVD provides the best rank-k approximation of the
Corpus: Visual Question Answering is a task of answering a question based on a given image. |V| = 11
1 2 3 4 5 6 7 8 9 10 11
visual 1 0 0 0 0 0 0 0 0 0 0
question 0 1 0 0 0 0 0 0 0 0 0
answering 0 0 1 0 0 0 0 0 0 0 0
is 0 0 0 1 0 0 0 0 0 0 0
a 0 0 0 0 1 0 0 0 0 0 0
task 0 0 0 0 0 1 0 0 0 0 0
of 0 0 0 0 0 0 1 0 0 0 0
based 0 0 0 0 0 0 0 1 0 0 0
on 0 0 0 0 0 0 0 0 1 0 0
given 0 0 0 0 0 0 0 0 0 1 0
image 0 0 0 0 0 0 0 0 0 0 1
13
5712 S. Manmadhan, B. C. Kovoor
1 2 3 4 5 6 7 8 9 10 11
visual 1 1 1 0 0 0 0 0 0 0 0
question 1 0 2 1 1 0 0 1 1 0 0
answering 1 2 0 1 1 1 1 0 0 0 0
is 0 1 1 0 1 1 0 0 0 0 0
a 0 1 2 1 0 1 2 1 0 0 0
task 0 0 1 1 1 0 1 0 0 0 0
of 0 0 1 0 2 1 0 0 0 0 0
based 0 1 0 0 2 0 0 0 1 0 0
on 0 1 0 0 1 0 0 1 0 1 0
given 0 0 0 0 0 0 0 1 1 0 1
image 0 0 0 0 1 0 0 0 0 0 0
original data (Co-occurrence matrix: A) as given in Eq. 1 and further described in Fig. 6.
This is done by keeping only top k left singular vectors (k columns on U), top k singular
T). It dis-
values (k rows and k columns of S) and top k right singular vectors (k rows of V
covers latent semantics in the corpus.
Ak = Uk Sk VTk×n (1)
The methods that have been discussed so far are called count based models because they
use the co-occurrence counts of words. The next category includes techniques which
directly learn word representations (these are called (direct) prediction based models).
These are models, which uses neural network as their basic component.
With the end goal to create a distributed representation of words, Xu and Rudnicky
(2000) created a neural network model which is considered as its starting point. However,
the classic model in this category is the three layer network introduced by Bengio et al.
(2003) which is shown in Fig. 7. Plenty of later works depend on this model.
13
Visual question answering: a state‑of‑the‑art review 5713
Fig. 7 Neural architecture: f (i, wt−1;…;wt−n+1) = g(i, C(wt−1); …;C(wt−n+1)) where g is the neural network
and C(i) is the i-th word feature vector (Bengio et al. 2003)
Mikolov et al. (2013a) proposed two state-of-the-art prediction based word embed-
dings, continuous bag-of-words (CBOW) and skip-gram models. Subsequently, Google
made the skip-gram as an open-source venture named word2vec which is widely
acknowledged by users. In CBOW, they modeled the problem of predicting nth word
given (n − 1) words as a multi-class classification using feed forward neural network.
For example, assume the above-mentioned corpus with a single sentence and consider
the task of predicting the word ‘answering’ (nth word), given the words ‘visual ques-
tion’ (previous n − 1 words). Input to the network will be the concatenated one-hot rep-
resentation of context words (visual, question) and output will be a probability distribu-
tion over all possible |V| words (classes). Basic structure of CBOW network in context
of this example is given in Fig. 8a. In short, it predicts an output word given, a bag of
context words. After training, the weight between the hidden layer and the output layer
(Wword) is taken as the word vector representation of the word, where each column rep-
resent a word vector of size [1 * |V|]. Since CBOW can use many context words to pre-
dict the one target word, it can substantially smooth out over the distribution, which is
suitable only if input data is small.
Skip-gram models (Fig. 8b) just the opposite situation. It will predict context words
on both sides (say, ‘visual’,’ answering’) of the input word (say, ‘question’).
In one of his later papers, Mikolov et al. (2013b) proposed various extensions to
basic skip-gram model to avoid expensive operations at output layer. One of the widely
accepted extensions is to use negative sampling which is used in word2vec. Study done
by Levy et al. (2015) reveals that, factors which help prediction based models to achieve
good results can be transferred to traditional distributional models, yielding similar
gains.
13
5714 S. Manmadhan, B. C. Kovoor
3.3 Hybrid methods
Count based methods rely on global co-occurrence counts from the corpus for computing
word representations. Predict based methods learn word representations using co-occur-
rence information. Pennington et al. (2014) proposed global vectors (Glove) which com-
bines both to produce a word embedding. They proposed a weighted least squares model
trained on global information from co-occurrence matrix (only on the nonzero elements,
rather than on the entire sparse matrix or on individual context windows in a large corpus).
A comparative study of all these techniques can be found in Table 4 which states merits
and demerits of each of the above models described in Sects. 3.1, 3.2 and 3.3.
In advanced era of deep learning, VQA researchers also used convolutional neural net-
work (CNN) (Krizhevsky et al. 2012), long short term memory (LSTM) (Hochreiter and
Schmidhuber 1997) and gated recurrent unit (GRU) (Cho et al. 2014) to extract question
representation.
CNN used for question feature extraction (Kim 2014) takes as input, concatenated vec-
tor representation of all ‘n’ words of the question. Then it uses multiple convolutional fil-
ters followed by max-pooling operations. The resulting feature maps are flattened to form
penultimate layer which can be used as question vector.
LSTM is a recurrent neural network (Elman 1990) that is designed for solving the gradi-
ent explosion or vanishing problem. The LSTM layer stores the context information in its
memory cells and serves as the bridge among the words in a sequence (e.g. a question). To
model the long term dependency in the data more effectively, LSTM add three gate nodes
to the traditional RNN structure: the input gate, the output gate and the forget gate. The
input gate and output gate regulate the read and write access to the LSTM memory cells.
The forget gate resets the memory cells when their contents are out of date. The output
state vector from last time step can be used as a question feature. Figure 9 explains the
basic flow of information in an LSTM network.
13
Table 4 Comparison of word embedding techniques
Word embedding models Merits Demerits
13
5716 S. Manmadhan, B. C. Kovoor
A gated recurrent unit (GRU) was proposed by Cho et al. (2014) to make each recurrent
unit to capture dependencies of different time scales adaptively. Similarly to the LSTM
unit, the GRU has gating units that modulate the flow of information inside the unit, how-
ever, without having separate memory cells. The difference from LSTM, is that, here, the
functionalities of output gate and forget gates are merged. Here also, the last hidden state
representation can be used as question feature.
Most of the above discussed word embeddings have been explored by various VQA algo-
rithms to create feature vector for the given natural language question. Usage statistics of
these word embedding methods for VQA can be found in Table 5 (where columns lists
seven promising word embedding models and rows indicate major VQA systems proposed
in literature) and Fig. 10. This statistical study shows that LSTM (generally, RNN family)
is preferred by VQA researchers, which is clearly claimed by Young et al. (2018). They
state that the sequence based models like RNN do better than word sequence independent
methods like word2vec. However, they do not have independent existence without tradi-
tional embeddings because word vectors created using any of the models listed in Table 4
are fed as input to LSTM or GRU. At the same time it suffers from the fact that large
amount of labeled data is required for training.
In a study done by Shih et al. (2016), they stated that simple bag-of-words (BoW)
embedding is enough for VQA, no need to train and use LSTM or GRU. One fact is that,
even though it’s proved that co-occurrence with SVD performs well in capturing latent
semantics, it is not regularly used in VQA for question featurization. SVD also helps in
reducing dimensionality. Levy and Goldberg (2014) shows that, exact factorization with
SVD can achieve solutions that are at least as good as skip-gram with negative sampling
(SNGS)’s solutions for word similarity tasks.
This section describes some noteworthy use cases of text embeddings hand-picked from
VQA literature.
• Antol et al. (2015) utilized the idea of creating BoW with top 1000 words in the ques-
tions of the dataset. They have also exploited the strong correlation between words that
start a question and answer by creating another BoW by picking top 10 first, second
and third words of the question and finally concatenate this to the first representation.
13
Visual question answering: a state‑of‑the‑art review 5717
Table 5 Mapping of VQA models to word embedding techniques (One-hot embedding, CBOW, Skip-gram/
word2vec, GloVe, CNN, LSTM, GRU)
VQA model 1 2 3 4 5 6 7
13
5718 S. Manmadhan, B. C. Kovoor
% of use
25 GRU
word2vec
20
GloVe
15 CBOW CNN
10 One-hot
5
0
defines the visual idea whose presence is to be verified to answer the question. It will
guide the visual feature extraction process.
• Shih et al. (2016) tried binning of questions to yield a simplified fixed length repre-
sentation of important concepts from variable-length questions. They have parsed the
question into four different bins: (1) Type of question using first two words, (2) Nomi-
nal subject, (3) All noun words and (4) All remaining words. Also a fifth bin is desig-
nated for candidate answers since the model performs multiple-choice VQA task.
• Yu et al. (2018a, b) leverages a tree-LSTM network to capture the linguistic structure of
the language. They have mapped each question in the dataset to a semantic tree where
each node refers to a single LSTM unit and root node is set to represent the sequence.
This compositional semantic representation technique can break down questions into
logical expressions to improve reasoning ability.
• Recently Toor et al. (2019) reported exciting results of their VQA model by using two
novel concepts: (1) Question Action Relevance (QAR) and (2) Question Action Editing
(QAE). QAR identifies irrelevant question action words from generated image caption
and QAE edits the question to map irrelevant words to relevant actions.
In VQA, the image and the question are processed independently to obtain separate vec-
tor representations. Different methods for doing this are detailed and comprehended in the
previous two sections. Now, in the next step of VQA, these features are mapped to a joint
space, then combined and fed to answer generation stage. This literature review extensively
identified wide assortments of techniques utilized for consolidating image and question
features ranging from simple concatenation to complex joint attention networks.
Baseline methods include concatenation (Zhou et al. 2015; Jabri et al. 2016; Yu et al.
2018a, b; Huang et al. 2018), element-wise addition and element-wise multiplication
(Antol et al. 2015; Zhang et al. 2016; Goyal et al. 2017; Lin and Parikh 2016; Teney and
Hengel 2018) where the last two requires compatibility in the feature vector dimensions. In
13
Visual question answering: a state‑of‑the‑art review 5719
Malinowski et al. (2017), they tried all these three multimodal fusion methods and found
that element-wise multiplication has more accuracy. Another important finding is that L2
normalization of visual features has a significant impact on fusion performance, especially
for concatenation and summation. According to their results, summation has high accuracy
after normalization.
Shih et al. (2016) uses dot product of region-wise visual features and question embed-
dings. Saito et al. (2017) introduced a hybrid way of multimodal fusion. They integrate
element-wise summation and element-wise multiplication by implementing a polynomial
function given as Eq. 2
(2)
( )( )
1 + x1 + x2 + x3 + ⋯ + xd 1 + x1 + x2 + x3 + ⋯ + xd
where xi and yi indicate the ith dimension of an image and a question feature, respectively.
The intuition behind this integration is that features from multiplication and summation
will be substantially different. Lioutas et al. (2018) proposed an ensemble attention model
where multimodal fusion is done via concatenation of question, answer and image features
where question and answer features are first multiplied element-wise.
Another classic method for finding the relationship between two vectors with its roots
in statistics is Canonical Correlation Analysis (CCA), which has been used of multimodal
fusion by VQA researchers. CCA finds a joint representation between two multi-dimen-
sional factors, for the case of VQA, image and question vectors. One scalable extension for
CCA is normalized CCA (nCCA) proposed by Gong et al. (2014) which is explicit kernel
mapping followed by dimensionality reduction. Yu et al. (2015) and Tommasi et al. (2019)
trained both CCA and nCCA models for VQA and found out that nCCA has excellent per-
formance, especially in case of multiple-choice questions.
Here, researchers train end-to-end deep neural networks for VQA task with specific layers
for joint comprehension of image and question features. The structure and functioning of
this layer may be different for different proposed VQA end-to-end models.
Gao et al. (2015) implemented image question fusion as an additional layer with non-
linear activation function termed, scaled hyperbolic tangent function given as
g(x) = 1.7159 ⋅ tanh ((2∕3) ⋅ x) (3)
where x is the question, image and answer embeddings combined using element-wise
addition.
Andreas et al. (2016) depict a system for building and learning neural module networks
(NMN), which make collections of jointly-trained neural “modules” into deep networks
for VQA. Their approach breaks down questions into their semantic substructures, also,
utilizes these structures to powerfully instantiate modular systems (with reusable parts for
perceiving objects, classifying colors, and so on.). The subsequent compound systems are
jointly trained.
Fukui et al. (2016) proposed an end-to-end model with a Multimodal Compact Bilin-
ear Pooling (MCB) layer (see Fig. 11) for joint representation of image and question fea-
tures. Their intuition behind using Bilinear Pooling for multimodal fusion the outer prod-
uct of feature vectors is more expressive than simple baseline methods like concatenation.
13
5720 S. Manmadhan, B. C. Kovoor
For complete description of the operation portrayed in Fig. 11, readers please refer to the
source attached to the figure title.
Noh et al. (2016) presented a VQA solution using a single CNN, where one of the fully
connected layers is dynamic parameter layer where weights are determined by specially
designed Dynamic Parameter Prediction Network (DPPN). DPPN consist of GRU cells to
embed question words and a hashing function to assign weights to CNN layer.
Ma et al.’s (2016) solution also consist of CNNs. They built a multimodal convolution
layer for joint embedding of image and question features. Here, based on image vector
and two back to back semantic parts from the question side, the multimodal convolution
is performed, which is relied upon to catch the collaborations between the two multimodal
inputs.
Another peculiar type of deep networks is deep residual networks (He et al. 2016)
which works on the intuition that deeper version of a good shallow network would also
do just fine by learning identity transformations in the new layers. Thus, identity con-
nection from input allows deep residual network to retain a copy of input. Basic picto-
rial representation of the concept can be seen in Fig. 12. However, this idea may not be
applied as such in multimodal learning because, the modalities may have correlations.
So, Kim et al. (2016a) cleverly defined a joint residual function as the non-linear
mapping which leads to Multimodal Residual Network (MRN) for VQA task. Lao et al.
(2018) introduced another end-to-end model influenced by residual learning. They used
a Cross-modal Multistep Fusion (CMF) network in between feature representation and
final answer prediction. It focuses on performing multiple fusions by generating various
13
Visual question answering: a state‑of‑the‑art review 5721
multimodal features. This method fuses features in every step rather than waiting for the
final step to fuse. Different layers of CMF share parameters to optimize the use of com-
putational resources.
Gao et al. (2018) identified a main limitation of taking image feature as a sin-
gle dimensional vector from second last layer of any pre-trained CNN for end-to-end
VQA network. This trend misses detailed information like spatial relationships between
objects in the image. To avoid this loss of information, they have proposed a new form
of fusion, question-guided convolution. That is, a series of kernels are designed based
on the question features to convolve with image features. The main idea is to do multi-
modal fusion at an early stage of VQA model to retain more information.
Bai et al. (2018) extended basic MCB model with a deep attention neural tensor network
(DA-NTN) module as last stage of VQA model in order to check the similarity between
fused multi-modal feature vector (question and image features) and answer embedding.
Narasimhan and Schwing (2018) proposed an end-to-end system using a multi-layer
perceptron (MLP) to combine image and question features obtained from a CNN and
LSTM respectively. They have also retrieved a related ‘fact’ from a knowledge-base as sup-
porting material to answer the question. Then, the output of MLP is passed along with the
fact embedding to a score function. Here, the function is calculating the cosine silimarity
between the inputs to ensure the utility of the fact to answer image-question pair.
4.2.1 Encoder–decoder architecture
Here, question together with visual representation is fed into the decoder, which is usually
a LSTM network, which is then trained to produce the correct answer. Two general archi-
tectures for doing this are shown below. Figure 13a shows one way of fusing two feature
vectors, which is considering image encoding as the first (optionally can be last also) word
of question (Ren et al. 2015a, b; Zhu et al. 2016). Figure 13b shows more explicit way,
which is supplying image at every time step of decoding LSTM (Malinowski et al. 2017).
Ruwa et al. (2018) use triples as input to decoder LSTM, image embedding, and ques-
tion embedding along with question mood embedding to generate an emotional adjective
along with answer. Wu et al. (2018) proposed a model similar to one in Fig. 13a, but the
input to first LSTM cell is a combination of three vectors: visual feature vector, image cap-
tion embedding, and the vector embedding of the knowledge extracted based on question
from external sources. This is especially good for open-ended questions, typically ‘why?’
questions.
Typical image and question attention have been explored enough with VQA. However,
they are not optimal in the sense that, they ignore the semantic relationship between image
attention and question attention. This limitation motivated (Cao et al. 2017) to use semantic
cross-modal correlation along with attention to solve the VQA problem. Peng et al. (2019)
identified another limitation of standard attention mechanism where the entire natural lan-
guage question is used to guide the visual encoding process. Still, in the actual scenario,
only the keywords of the question need to be used to identify exciting image regions or fea-
tures. This idea was utilized in their Word-to-Region Attention Network (WRAN), which
fills the existing gap between the question keywords and image regions.
13
5722 S. Manmadhan, B. C. Kovoor
13
Visual question answering: a state‑of‑the‑art review 5723
Lu et al. (2016) proposed a technique that jointly reasons about visual and question
attention, called co-attention. Common theme of co-attention multimodal fusion is that, the
image representation is used to guide the question attention and the question representation
is used to guide image attention. The above paper extract word, phrase and question level
embeddings for a question and at each level apply co-attention on both the image and ques-
tion. The final answer is based on all the co-attended image and question features.
Chen et al. (2015) described a Question-guided attention map (QAM) which is gener-
ated via scanning for image regions that correspond to the input question’s semantics in the
spatial image feature map created as part of image featurization. For accomplishing this,
they have designed a configurable convolution kernel (CCK) and convolve image feature
map with this. The CCK is created by mapping the question encoding from the linguistic
space into visual space, which contain visual data dictated by the intent of the question.
Xu and Saenko (2016) uses two types of attentions in parallel, spatial and semantic
attention. Spatial attention weights are determined based on a correlation matrix, where
each value measures similarity between each word and each location’s visual feature.
Along with this, semantic attention is performed based on an evidence embedding that
detects the presence of semantic concepts or objects, and the embedding results are mul-
tiplied with spatial attention weights and summed over all locations to generate the visual
evidence vector.
Yu et al. (2017) used joint attention learning with three components. First, semantic
attention attends on high-level image features to identify significant concepts from the
image needed to answer a question. Second, spatial attention as context-aware visual atten-
tion is used to infer image regions which can be attended by questions. Third, joint learning
integrates attended regions, attended concepts and question feature vector by element-wise
multiplication.
Shi et al. (2018) proved that question type information is very crucial in answering a
question whether or not it is asked against an image. So, they have replaced the popular
question-guided attention mechanism with Question Type-guided Attention (QTA) to help
image feature extraction. Anyway, this would be directly useful only on datasets with many
categories of questions which are correctly labeled.
One of the recent innovations happened with VQA research is the use of ‘hard atten-
tion’ to build a VQA system by Malinowski et al. (2018). In simple words, soft attention
(commonly used attention) is ‘selective boosting’ whereas hard attention is ‘filtering’ by
avoiding unwanted information which is a least explored area of computer vision. They
have used this in VQA system pipeline to filter out unwanted or least important elements of
fused multi-modal feature vector (image-question vector) for further processing. However,
the main problem with hard attention is that it is non-differentiable, which makes the gradi-
ent lovers a bit disturbed to work with.
A comprehensive overview of different methods for joint comprehension can be found
in Table 6, along with identified merits and demerits aside.
5 Datasets
This section presents an elaborative discussion on various publically available datasets for
validating VQA models with their characteristics. General requirements of a benchmark
VQA datasets include:
13
5724
Table 6 Comparison of different methods used for joint embedding of image and question features
13
Joint comprehension methods Merits De-merits
I. Baseline models
Concatenation Ability to identify importance of each single word in the question Good performance only on synthesized VQA dataset
to the predicted answer
Element-wise multiplication Easy to implement because these operations are inbuilt with many Feature vectors need to be first transformed into same dimensional
advanced language interfaces space
Element-wise summation Feature vectors need to be first transformed into same dimensional
space
II. End-to-end models
Nonlinear activation—hyper- It’s zero-centered Fails to answer questions, if the targeted object is too small
bolic tangent function Training will be fast because of quick convergence (large derivative
compared to sigmoid)
Dynamic parameter layer Take care of various tasks by allowing adaptive weight assignment Fails in counting type questions
in the dynamic parameter layer
Multimodal convolution layer Offer scalable and joint end-to-end training which can be inter- Works well only if the answer is given by the co-occurrence of a
preted as learning commonsense knowledge particular combination of features extracted from an image and a
question
Bilinear pooling layer Uses outer product, which is expressive enough to capture the com- Sharp increase of the learning parameters and computation resources
plex associations between the two different modalities fully
III. Encoder–decoder architecture Correctly matched to the task of VQA, so most exploited Cannot really make use of image features unless the question is about
the largest object in the image
Cannot exploit complicated inter-modal relationships
The effect of image will vanish at each time step of LSTM
IV. Joint attention models Exploit the intent of queries to focus on different regions in an Do not have any explicit notion of object position, and do not support
image the computation of intermediate results based on spatial attention
S. Manmadhan, B. C. Kovoor
Visual question answering: a state‑of‑the‑art review 5725
Most of the existing datasets contain triples made of an image, a question and its correct
answer. Some publically available datasets, on the other hand, provides extra information
like image captions, image regions represented as bounding boxes or multiple-choice can-
didate answers. General capabilities required to answer questions in VQA dataset correctly
include:
The available VQA datasets can be categorized based on three factors: type of images,
question–answer format, and use of external knowledge. The types of images are again of
three categories: natural, clip-art and synthetic images. The image type that forms each
VQA dataset can be found in Table 7. Similarly, question–answer formats are of open-
ended and multiple-choice, which are represented as OE and MC, respectively in Table 7.
Also the source of images which forms each dataset and limitations are described in
Table 7. For sample images from various VQA datasets, see Table 8.
DAQUAR (DAtaset for QUestion Answering on Real-world images) was one of the ear-
liest datasets on image question answering. It is a dataset of human question- answer pairs
about images. COCO-QA has its root on Microsoft COCO (Common Object in COntext).
The questions are automatically generated from COCO image captions and are of 4 differ-
ent types, say, object, number, color and location. All answers are of a single-word type.
The VQA Dataset consists of both real images and abstract cartoon/clipart scenes and
can be named as VQA-real and VQA-abstract respectively. Both contain three questions
per image/scene and ten ground-truth answers per question. Motivation for constructing
VQA-abstract is to minimize language abstract which is high for real image-question pairs.
The abstract scenes are made of 20 ‘paper-doll’ human models (Antol et al. 2014) span-
ning genders, races and ages with 8 different expressions. The set contains over 100 objects
and 31 animals in various poses.
FM-IQA (Freestyle Multilingual-Image Question Answering) consists of COCO images
and freestyle, interesting and diversified set of questions which requires a lot of reason-
ing abilities to answer them. They categorized questions into 8 types: e.g., questions on
object actions, object classes and others. Each image has at least two question-answer pairs
as annotations. Visual Madlibs dataset is collected utilizing naturally generated fill-in-the-
blanks templates intended to accumulate focused descriptions about: people and objects,
their appearances, exercises and interactions, and also inferences about the general scene or
more extensive setting. Visual7W dataset was inspired by the deep rooted thought of the W
inquiries in journalism to portray an entire story. They have used question words: ‘what’,
‘where’, ‘when’, ‘who’, ‘why’, ‘how’ and ‘which’. The Visual7W dataset highlights more
extravagant questions and longer answers than the VQA dataset. TDIUC (Task Directed
Image Understanding Challenge) is a dataset which is developed to avoid some reported
limitations of publically available VQA datasets such as, (1) unbalanced to types of ques-
tions, (2) questions that can be answered by ignoring images, and (3) difficult evaluation
process. The authors of TDIUC divides VQA into 12 constituent tasks (i.e. 12 question
13
Table 7 VQA datasets in a nutshell
5726
13
DAQUAR (Malinowski and Fritz 2014) Natural OE Uses images from NYU-DepthV2 dataset Too small
Contains only indoor scenes
Extreme light conditions which make answering
very difficult
COCO-QA (Ren et al. 2015a, b, c) Natural OE Uses images from COCO where questions are Awkwardly phrased questions with grammatical
created using an NLP algorithm errors due to flaws in NLP algorithm
VQA Dataset (Antol et al. 2015) Natural, Clip-art OE, MC VQA-real images are taken from COCO Presence of language biases so that questions can
VQA-abstract images are synthetic cartoon be answered without using image
images. Contains subjective and opinion seeking questions
which do not have a single correct answer
FM-IQA (Gao et al. 2015) Natural OE Human generated questions and answers Automatic performance evaluation is intractable
Collected in Chinese and translated to English
Visual Madlibs (Yu et al. 2015) Natural OE Uses images from COCO Declarative sentence based questions which can be
Fill in the blank type questions generated using answered easily
COCO image captions
Visual 7W (Zhu et al. 2016) Natural MC Images with multiple choices for answer No yes/no questions
Two types of questions: Telling (requires textual
answer) and pointing (which requires selection
of image region)
Visual genome (Krishna et al. 2017) Natural OE Collected images from YFCC100M and COCO Lengthy answers which makes evaluation chal-
Very large lenging
Greater answer diversity
TDIUC (Kafle and Kanan 2017a) Natural OE Collected images from YFCC100M and COCO Contains too many similar questions, specifically,
12 types of questions to test various vision under- questions regarding color which usually leads to
standing capabilities wrong question type prediction
SHAPES (Andreas et al. 2016) Synthetic MC Images using various shapes in various colors Only yes/no questions
KB-VQA (Wang et al. 2015) Natural OE Questions require knowledge from DBpedia Small scale dataset
FVQA (Wang et al. 2018) Natural OE Questions involve external information collected Long training time
from DBpedia, ConceptNet and Webchild
Diagrams (Kembhavi et al. 2016) Synthetic MC Images of school science topics like digestive Require high level of visual reasoning
S. Manmadhan, B. C. Kovoor
system
Table 7 (continued)
Dataset Image type OE/MC Creation Limitations
CLEVR (Johnson et al. 2017) Synthetic OE Simple images with different shapes and complex Cannot be generalized to a real-world setting
questions which exclusively test visual reason- Not easily extendible
ing ability of solution
FigureQA (Kahou et al. 2017) Synthetic MC Questions about graphical plots and figures All questions are yes/no type
It can be extended iteratively Does not contain questions which require numeri-
cal values as answers
It has fixed labels for bars across different figures
DVQA (Kafle et al. 2018) Synthetic OE Questions about graphical plots Consider only bar charts
VizWiz (Gurari et al. 2018) Natural OE Consists of visual questions asked by blind people Images are often of poor quality
who were seeking answers to their daily visual Questions suffer from audio recording issues
questions Most of the questions are unanswerable
VQA-Med (Hasan et al. 2018) Natural OE Consist of medical images extracted from Pub- Too small for exploring advanced VQA capabilities
Visual question answering: a state‑of‑the‑art review
13
5728 S. Manmadhan, B. C. Kovoor
Visual7W SHAPES
VisualGenome
13
Visual question answering: a state‑of‑the‑art review 5729
Table 8 (continued)
CLEVR FigureQA VizWiz
Q: How many bars are there? Q1: Is there a traffic light in the
photo? Q: What does the CT scan of
thorax shown?
Q2: What is the weather like?
A: bilateral multiple pulmonary
Q3: How many dogs are there? nodules.
Q4: What is the dog doing?
types including absurd questions) that make it easier to evaluate and compare performance
of VQA models.
SHAPES is a dataset images. It is complementary to other VQA datasets as it consists
of shapes of varying arrangements, types and colors rather than natural scenes, which pro-
pose a distinctive challenge for VQA researchers. The dataset consists of complex ques-
tions about spatial and logical reasoning among multiple shapes. Thus avoids the risk of
learning biases, a significant deficiency in most of the VQA dataset. CLEVR (Compo-
sitional Language and Elementary Visual Question Reasoning) is a dataset of synthetic
images similar to SHAPES, but have simple 3D shapes. It consists of questions which
test a range of visual reasoning abilities which leads to minimal biases also comes with
supporting annotations describing kind of reasoning required to answer a question. The
authors claim that CLEVR facilitates in-depth analysis of visual reasoning abilities of solu-
tion model which is intractable with other datasets.
KB-VQA (Knowledge Base-VQA) dataset has been built for assessing the execution
of VQA models on questions requiring more elevated amount of knowledge and explicit
reasoning about visual contents using external information. The authors have attached
a specific label to each question which reveals the human-estimated level of knowledge
required to answer it correctly. The labels are, “visual” (can be answered directly using
13
5730 S. Manmadhan, B. C. Kovoor
visual concepts), “common-sense” (do not require external source) and “KB-Knowl-
edge” (expected to require Wikipedia or similar). FVQA (Fact-based VQA) is similar to
KB-VQA, which require external information to answer. The distinction is that, here, the
authors provide additional supporting-fact for each question-answer pair in the form of a
structural triplet, such as <Cat. CapableOf, ClimbingTrees>.
FigureQA is the first VQA dataset based on graphical plots and figures including five
classes: line plots, dot-line plots, vertical and horizontal bar graphs and pie charts. It con-
sists of 15 question types which seek various relationships between objects in graph and
examine characteristics like the minimum, the maximum, area-under-the-curve, smooth-
ness, and intersection. DVQA (Data Visualization Question Answering) is a dataset con-
currently developed with FigureQA, but tests only various aspects of bar charts. It contains
three types of questions: (1) structure understanding (e.g. How many bars are there?), (2)
data retrieval (e.g. what is the label of the third bar from left?), and (3) reasoning (e.g.
which algorithm has the highest accuracy?).
Diagrams (AI2D) is a dataset that aims to evaluate the task of diagram interpretation.
It consists of diagrams representing topics from grade school science, each annotated with
constituent segmentation, their relationships to each other and relationships to the diagram
canvas. It is actually a new direction of vision research which generally concentrates on
natural image understanding.
VizWiz is the first goal-oriented VQA dataset to capture real interests of real users of a
VQA system. Also, it is the first publically available vision dataset originating from blind
people. Distinguishing characteristics of VizWiz includes: (1) images are captured by blind
people and so usually are of poor quality, (2) questions are spoken and so are more con-
versational or suffer from audio recording imperfections. And (3) many of the questions
cannot be answered because, blind people cannot verify their images captures the visual
content they are asking about.
VQA-Med is a dataset created towards the first step in medical domain VQA. This con-
sists of medical images with clinically relevant question–answer pairs. Success in this task
will improve patient engagement in interpreting medical images and also will serve as a
second opinion for doctors on complex medical images. The question-answer pairs were
generated using semi-automatic method; first generate them using rule-based question gen-
eration followed by manual verification by human experts.
6 Performance evaluation
The real dream of computer vision research is to build models which are close to humans
in image understanding capability. For evaluation of these models, Geman et al. (2015)
have introduced a Visual Turing Test for computer vision systems. Most of the current
papers suggest that VQA can be considered as an alternative for Visual Turing Test, or in
other words, it is an ‘AI-complete’ task. One of the important criteria for a task to be ‘AI-
complete’ is to have a well-defined quantitative evaluation metric to track progress. In this
section, various distinguished VQA evaluations metrics are presented in a nutshell.
From Table 7, it is clear that VQA datasets present two types of questions: open-ended
and multiple-choice. In multiple-choice setting, there is just a single right answer for every
question. So the assessment of a proposed solution is straightforward since one can easily
quantify the mean accuracy over test questions. In open-ended setting, there exists a pos-
sibility of having multiple right answers for a question due to synonyms and paraphrasing.
13
Visual question answering: a state‑of‑the‑art review 5731
BLEUg,h,i VizWiz
� �
N
∑
BP ⋅ exp Wn log Pn
n=1
13
5732 S. Manmadhan, B. C. Kovoor
So, the assessment is nontrivial. To make it manageable, most of the VQA datasets restrict
the answers to contain only a few words (usually one to three), or select answer from a
closed set of answers thereby converting open-ended setting to multiple-choice setting.
Table 9 shows the calculation formula of major VQA metrics along with the names of the
supporting datasets and Fig. 14 shows the usage statistics of these metrics across various
available VQA datasets so that researchers can identify the trend of evaluation.
Though in multiple-choice setting, simple accuracy is enough to evaluate a given VQA
model, in open-ended setting, this will be too rigid to accept an answer as correct only if it
exactly matches with ground truth. For example, if the question was ‘What animals are in
the photo?’ and a model outputs ‘dog’ instead of ground truth ‘dogs’, it is treated as wrong.
Due to these limitations, several alternatives have been proposed.
Wu-Palmer Similarity (WUPS) (Wu and Palmer 1994) was used as an evaluation metric
for VQA in Malinowski and Fritz (2014). It is inspired from fuzzy set theory, which leads
to a soft measure than accuracy. It tries to measure how much a predicted answer differs
from the ground truth based on the difference in their semantic meaning. To avoid getting
a high score for distant concepts, they also proposed a threshold WUPS score, where a
score that is below a threshold will be scaled down by a factor. Significant shortcomings of
WUPS includes: (1) It produce high scores for answers that are lexically related but with
different meaning, (2) It will not work with phrasal or sentence answers.
Another alternative is consensus measure which is based on multiple independently col-
lected ground truth answers from annotators for each question which was used in Antol
et al. (2015). This mainly of two types: average and min consensus. For the average version,
the final score is the weighted average of answers from annotators and for the min version;
the answer needs to agree with at least one annotator. The mostly used version of this (given
in Table 9) will award a full score for a question, if the algorithm agrees with three or more
annotators. Limitations of this approach include: (1) Allowing multiple correct answers for
a question, (2) Expense to collect ground truth, and (3) Difficulty due to lack of consensus.
The authors of FM-IQA dataset (Gao et al. 2015) suggested manual evaluation of VQA
model with human judges. This will work well for both single word and phrase or sentence
answers. Anyhow, there exists the problem of time, resources and expenses of the judg-
ment process. Also, there is a chance of subjective opinion of individuals.
One of the main problems of VQA datasets is skewed distributions of question types.
In such cases; simple accuracy will not work well, especially for rarer question types. So
Kafle and Kanan (2017a) proposed new performance metrics to compensate for unbal-
anced question-type distribution. It is named as mean-per-type (MPT), indicating the cal-
culated arithmetic or harmonic mean accuracy per question type. They also use normalized
metrics, say arithmetic normalized MPT and harmonic normalized MPT, to compensate
for bias in distribution of answers for each question type.
BLEU (BiLingual Evaluation Understudy) and METEOR (Metric for Evaluation of
Translation with Explicit ORdering) proposed by Papineni et al. (2002) and Denkowski
and Lavie (2014) are metrics for automatic evaluation of machine translation. Gurari et al.
(2018) suggested that these can be used as an evaluation metric for VQA and experimented
with their VizWiz dataset. BLEU analyzes the co-occurrences of n-grams between the
predicted answer and ground truth label. BLEU usually fails on short sentences. Though
METEOR can be calculated by having an alignment between the words in the predicted
and ground truth answers, with an aim of one to one correspondence, sometimes it is
intractable to found such an alignment.
According to Kafle and Kanan (2017b), the ideal approach to assess a VQA framework
is still an open problem. The newly introduced first goal-oriented VQA dataset VizWiz
13
Visual question answering: a state‑of‑the‑art review 5733
opened a new chapter by using automatic machine translation evaluation metrics for VQA.
Every assessment strategy has its very own qualities and shortcomings. The technique
to utilize relies upon how the dataset was built, the level of bias within it and available
resources. Parallel to the development of new VQA models, significant work should be
done to create unique and efficient metrics to evaluate their performance.
VQA is a recently introduced task which attracts many researchers due to its potential
applications and AI-completeness. Prerequisite for solving this complex task includes
knowledge of matured research in the fundamental tasks of Computer Vision and Natural
Language Processing.
As shown in Fig. 15, there has been a vast amount of research happened in this area
which leads to meteoric improvements in the performance of VQA algorithms and also to
the development of new datasets. Still, VQA research has miles to go to reach the goal of
human equivalent performance in answering image based questions. Readers, please do not
infer a statement regarding performance of VQA systems by looking at the ‘bars’ across
SHAPES and CLEVR datasets in Fig. 15, because, they are synthetic datasets created for
study purpose, instead focus on realistic datasets like Viz-Wiz for which the performance is
still below 50%.
Next, Table 10 and Fig. 16 present the results of study done on VQA dataset on two test
splits of the dataset named ‘test-dev’ and ‘test-std’. VQA is a relatively large dataset con-
sisting of 265,016 images and open-ended questions about images. These questions require
an understanding of vision, language and commonsense knowledge to answer. Maximum
reported accuracy on this is around 70% which highlights a need for improvement.
The gap still exists between VQA performance and human performance is due to
unproved reasons. This section reveals promising future works needed at different aspects
of VQA to fill the gap between machine and human intelligence in image understanding.
13
5734 S. Manmadhan, B. C. Kovoor
Before that readers are directed to look into the Table 11 for a critical comparison of
selected VQA literature, where each solution is completely dissected into different phases
including limitations and future scope.
7.1 VQA phases
Till now, image featurization part is almost frozen to one of the models came out as a
result of ImageNet challenge. They work well in the classical computer vision tasks of
object detection by splitting the input image into uniform bounding boxes. However,
VQA requires more natural image features by detecting all objects in an image with
13
Table 11 Critical Comparison of a relevant subset of VQA solutions
VQA algorithm Antol et al. (2015) Gao et al. (2015) Lu et al. (2016) Fukui et al. (2016) Malinowski et al. (2017)
Image featurization VGGNet GoogLeNet Last pooling layer of VGG- ResNet152 followed by L2 ResNet152
Net & ResNet normalization
Question featurization BoW LSTM & CNN Three levels: word, phrase LSTM LSTM
(convolution) and ques-
tion (LSTM)
Joint comprehension Element-wise multiplica- Encoder–decoder Encoder–decoder with Compact bilinear pooling Encoder–decoder
tion co-attention in which which computes outer
question and image guid- product between two
ing each other (simultane- vectors
ously or alternately)
Answering approach Classification using MLP & Generation Classification Classification Classification/ generation
Generation using LSTM
Visual question answering: a state‑of‑the‑art review
Limitations Element-wise product is Fails when This model performs atten- Image and question fea- Global image feature vector
not expressive enough The commonsense reason- tion to the image at every tures need to be projected fed into each LSTM unit
to conquer the complex ing through background time step even though to very high dimensional thereby giving irrelevant
associations between images is incorrect some words do not have space to guarantee robust (sometimes) information
the multiple modalities The question focus object is corresponding image performance, leading to into answer prediction
especially in situation too small or looks similar signal which may dimin- huge memory usage stage and also become
where image information to other objects ish the aim of question difficult due the high
plays little role in finding The question needs knowl- attention dimensionality of global
answer (binary/how may edge from experience image feature
type questions) There exists OOV problem
Scope for improvement Incorporate regional atten- Use image features Can use adaptive attention Try to incorporate the Incorporate visual atten-
tion to avoid unimportant extracted from pooling (Lu et al. 2017) to decide method of multi-modal tion to get local important
image features before layers of CNN instead of when to rely on image low-rank bilinear pooling features
fusion last fully connected layer features and when to rely (Kim et al. 2016b) into
to get more fine grained on language features the VQA system
features
Use improved word embed-
dings to handle OOV
5735
13
Table 11 (continued)
5736
VQA algorithm Ben-Younes et al. (2017) Anderson et al. (2018) Yu et al. (2018) Yu et al. (2019) Shrestha et al. (2019)
13
Image featurization ResNet Faster R-CNN with ResNet Convolution layer res5c of Faster R-CNN with ResNet Faster R-CNN
101 as backbone with ResNet 101 as backbone
thresholding to eliminate
irrelevant image features
Question featurization GRU initialized with Skip- GRU initialized with GloVe LSTM LSTM initialized with GRU initialized with GloVe
thought vectors vectors GloVe vectors vectors
Joint comprehension Tucker decomposition of Element-wise product Multi-modal factorized Encoder–decoder with Concatenation of question
correlation tensor high-order pooling modular co-attention features with regional
approach (MFH) with layers image features followed
co-attention by aggregated bimodal
embedding using bidirec-
tional GRU
Answering approach Classification Classification Classification Classification Classification
Limitations Insufficient to capture Fails to learn structured, MFH cannot infer correla- Fails in distinguishing the Less efficient if dataset
complex interactions multi-step visual reason- tion between each image key words in questions if consists of questions about
among different modali- ing required to answer region and each question other objects are brighter unseen compositions of
ties especially when the questions about 3D word because both ques- in the image seen concepts like C-VQA
input images are noisy. shapes in CLEVR dataset tion and image attention (Agrawal et al. 2017)
This makes it unsuitable is being performed inde-
for realistic applica- pendently and fused later
tion domains of VQA Entire question is used to
like Viz-Wiz where the attend image, but only
images taken by blind some words may be
people will be noisy actively involved in locat-
ing image regions
S. Manmadhan, B. C. Kovoor
Table 11 (continued)
VQA algorithm Ben-Younes et al. (2017) Anderson et al. (2018) Yu et al. (2018) Yu et al. (2019) Shrestha et al. (2019)
Scope for improvement Improve the building of Incorporate question guided Method for fine-grained co- Improve image feature Make new VQA models with
correlation tensor via visual attention attention is required extraction by combining strong visual grounding
(1) co-attention (2) early The image featurization Word-to-region attention features from multiple and to circumvent language
fusion and (3) improved step can easily be adapted provided by Peng et al. sources prior. This opens a wide
individual feature embed- to many of the baseline (2019) can be effectively green area for future VQA
dings, to surpass limita- VQA models to improve utilized researchers to work with
tions in dataset performance on real Develop new datasets with
image datasets reduced biases for bench-
marking of AI
Visual question answering: a state‑of‑the‑art review
5737
13
5738 S. Manmadhan, B. C. Kovoor
semantic segmentation. With this in mind, Peng et al. (2019) have used Region-based
CNN (R-CNN) proposed by Girshick et al. (2014) for visual feature extraction step
of VQA. R-CNN works by using selective search to identify a manageable number of
bounding-box object region candidates (“region of interest” or “RoI”) and then extracts
CNN features from each region independently. This attempt should awaken the VQA
researchers to touch image featurization part and explore the descendants of R-CNN,
fast R-CNN (Girshick 2015), faster R-CNN (Ren et al. 2015a, b, c) and the latest Mask
R-CNN (He et al. 2017).
It is too late to shake this frozen phase of VQA to have adequate fine-grained image fea-
ture extraction. For answering compositional questions based of images, especially low-qual-
ity images as in real VQA datasets like Viz-Wiz, features extracted from a single network
would not be sufficient. Rich internal features can be learned by combining information from
multiple sources through feature fusion which is opening a wide green area for research.
Converting natural language questions to fixed-length dense vectors are an essential part
of any VQA system. Of many possible ways to embed a natural language question, word-
2vec and GloVe attract VQA researchers during initial days, which were then shifted to the
deep learning models like LSTM and GRU. The point to be noted is that NLP research has
been walked far away from word2vec baseline around 2017–2018. Three of the exotic new
models which are found to be beneficial for VQA are FastText (Bojanowski et al. 2017),
ELMo (Peters et al. 2018) and BERT (Devlin et al. 2018).
FastText is an extension of original word2vec where the main difference is the inclu-
sion of character n-grams, which allows generating word embeddings for words that did
not appear in the training data which is known as OOV (out-of-vocabulary) problem. OOV
can be more frequent in goal oriented VQA datasets like VizWiz which are going to be
expanded and explored more in future. It is also proved to be super fast to train.
Those who prefer end-to-end deep learning VQA system, ELMo (Embeddings from Lan-
guage Models) and BERT (Bidirectional Encoder Representations from Transformers) are
two great choices. Both can handle OOV problem and are contextualized word embedding
models which can generate different word embeddings for a word depends on the context.
ELMo word embeddings are created by concatenating the activations of internal states of a
two-layer bidirectional language model, say, biLSTM. Different layers of a language model
capture different kinds of information on a word; thus, concatenating them yields more natu-
ral word representations. BERT uses Transformer- an attention based model with positional
encodings to represent word positions. Another promising fact is that BERT is a pre-trained
model which provides a kick-start feature extraction in the form of transfer learning.
Another research provoking thought is that there still exists potential for better use of
concepts from NLP for tackling challenges in VQA. One is to move a step further from
word embeddings to sentence embeddings-numerical representation of a full sentence
encapsulating rich meaning and also sensitive to word order, tense and context. Only small
portion of VQA researchers Kafle and Kanan (2016) and Noh et al. (2016) have noticed
this possibility by using Skip-thought vectors (Kiros et al. 2015) for question featurization.
Their results open the door to the viability of exploiting advanced sentence embedding
models like Quick-thought vectors (Logeswaran and Lee 2018) InferSent (Conneau et al.
2017) and Google’s Universal Sentence Encoder (Cer et al. 2018) for vectorizing natural
language question associated with VQA.
But, another side of this coin states that, for short sentences like single line questions in
VQA, contextual embeddings may fails to perform well because of the limited amount of
contextual information available. This give place to character level embeddings for a perfor-
mance. Besides, character embeddings naturally deals with the common problem of OOV.
13
Visual question answering: a state‑of‑the‑art review 5739
7.2 Datasets
One of the other limitations of VQA research is the lack of goal-oriented datasets like.
Most of the research in this field tackled the earliest benchmark datasets DAQUAR and
VQA. These models cannot be extended to the real applications of VQA like, supporting
visually impaired people, helping a data analyst to extract relevant information from a pool
of data, instructing a kid playing a game on touch screen, and interacting with a robot.
To make VQA research at its goal, it needs more publically available application oriented
datasets.
Till now, there are only two publically available datasets for goal oriented VQA
research, say VizWiz and VQA-Med. In the VQA literature, only one paper has addressed
VizWiz dataset (Gurari et al. 2018) with best accuracy of (below) 50% and according to
Hasan et al. (2018) only five teams addressed VQA-Med datasets with a best BLEU (refer
Sect. 6) score of 0.162. In the future work, we are planning to build a benchmark VQA-
Med dataset for interested researchers which will improve the public health as a whole by
aiding decision of Doctors.
13
5740 S. Manmadhan, B. C. Kovoor
7.3 Evaluation
Most of the state-of-the-art VQA systems were evaluated using classic accuracy which is
enough for multiple-choice version of VQA. Evaluation methods for open-ended VQA set-
ting needs to be explored more in future research. As a starting point for this, the results of
two latest VQA challenges, VizWiz and VQA-Med have identified the possibility of using
metrics for automatic Machine Translation (MT) evaluation with VQA system evaluation.
They have tried only classic MT metrics like BLEU and METEOR. (For more details see
Sect. 6). Even though BLEU is the most popular metric for MT, reports show that it fails
with short sentences. In VQA systems, most of the answers will be of short. This property
suggests the usage of NEVA (Ngram EVAluation) (Forsbom 2003). NEVA is similar in all
aspects to BLEU, but adapted for short sentences.
7.4 Others
This section discusses two peculiar lessons learned from this study which can guide fur-
ther research on VQA. One path for researchers is to explore solutions for visual ques-
tions which require external knowledge to answer. There are three notable works in this
area with acceptable results, but with limitations. First one (Wang et al. 2015) named
‘Ahab’ is limited mainly by the available question templates (only 23) to which the natural
language question to be mapped. With this limitation, they obtained an overall accuracy
of 69.6%. They pointed out one of the major drawbacks of VQA system modeled using
LSTM, as the incapability of LSTM to explicit reasoning in many situations. Second work
worth mentioning have been done by Wu et al. (2016), which combines image features
with knowledge extracted from external source within the decoder portion of their encoder
(CNN)–decoder (LSTM) network. The best result obtained was on COCO-QA dataset with
an accuracy of 73.66%. The limitation is that they generated all knowledge-base queries
only based on visual content neglecting question content. The third and latest work pro-
posed by Wang et al. (2018) developed a new dataset FVQA and a VQA algorithm which
will work by finding a supporting fact from knowledge-base for each question. So, this path
of research needs a new model which should be open to ask any type of questions and also
capable of explicit reasoning with good performance.
Another parallel and possible research thought comes from the power of yes/no ques-
tions in VQA task. The advantages of binary questions include: (1) They are easy to evalu-
ate and (2) They can represent a wide range of tasks. A very difficult VQA task can be
proposed balancing binary questions, which was proved by Zhang et al. (2016) by using
VQA-Abstract dataset. Even though there exists an argument against yes/no questions that
they when asked on natural images by human annotators are biased towards “yes” answer,
the power of it can be explored in medical VQA.
Another significant branch of VQA systems can be activated based on the idea of add-
ing emotional adjective (Ruwa et al. 2018) to the predicted answer. This can directly pro-
vide an extension to the Viz-Wiz challenge where the questions are asked against photo-
graphs taken by visually impaired people using their mobiles. The answers with emotional
adjective certainly give an added advantage to such a community that will help them to
eliminate their accessibility barriers. In the real world, people tend to have partial or mixed
emotions at any time that makes ‘fuzzy systems’, a well equipped candidate for mood detec-
tion networks. This was recently tried by Chaturvedi et al. (2019) for sentimental analysis
13
Visual question answering: a state‑of‑the‑art review 5741
with exciting results. So the integration of fuzzy mood detector to VQA network will defi-
nitely give rise to a promising era of affective question answering systems.
8 Conclusion
References
Agrawal A, Kembhavi A, Batra D, Parikh D (2017) C-vqa: A compositional split of the visual question
answering (vqa) v1. 0 dataset. arXiv preprint arXiv:1704.08243
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down
attention for image captioning and visual question answering. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. pp 6077–6086
Andreas J, Rohrbach M, Darrell T, Klein D (2015) Deep compositional question answering with neural
module networks. arXiv preprint. arXiv preprint arXiv:1511.02799
Andreas J, Rohrbach M, Darrell T, Klein D (2016) Neural module networks. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. pp. 39–48
Antol S, Zitnick CL, Parikh D (2014) Zero-shot learning via visual abstraction. In: European conference on
computer vision. Springer, Cham, pp 401–416
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question
answering. In: Proceedings of the IEEE international conference on computer vision. pp 2425–2433
Bai Y, Fu J, Zhao T, Mei T (2018) Deep attention neural tensor network for visual question answering. In:
Computer vision–ECCV 2018: 15th European conference, Munich, Germany, September 8–14, 2018,
Proceedings. Springer, vol 11216, p 20
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn
Res 3(2):1137–1155
13
5742 S. Manmadhan, B. C. Kovoor
Ben-Younes H, Cadene R, Cord M, & Thome N (2017) Mutan: multimodal tucker fusion for visual question
answering. In: Proceedings of the IEEE international conference on computer vision. pp 2612–2620
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information.
Trans Assoc Comput Linguist 5:135–146
Cao L, Gao L, Song J, Xu X, Shen HT (2017) Jointly learning attentions with semantic cross-modal correla-
tion for visual question answering. In: Australasian database conference. Springer, Cham, pp 248–260
Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C,
Sung YH (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175
Chaturvedi I, Satapathy R, Cavallari S, Cambria E (2019) Fuzzy common-sense reasoning for multimodal
sentiment analysis. Pattern Recognit Lett 125:264–270
Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015) ABC-CNN: An attention based convolutional
neural network for visual question answering. arXiv preprint arXiv:1511.05960
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning
phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence
representations from natural language inference data. arXiv preprint arXiv:1705.02364
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE computer society
conference on computer vision and pattern recognition. CVPR 2005. IEEE, vol 1, pp 886–893
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target lan-
guage. In: Proceedings of the ninth workshop on statistical machine translation. pp 376–380
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805
Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika
1(3):211–218
Elman JL (1990) Finding structure in time. Cognit Sci 14(2):179–211
Fang Z, Liu J, Li Y, Qiao Y, Lu H (2019) Improving visual question answering using dropout and enhanced
question encoder. Pattern Recognit 90:404–414
Feng, Y., Zhu, X., Li, Y., Ruan, Y., & Greenspan, M. (2018). Learning Capsule Networks with Images and
Text. In Advances in neural information processing systems
Forsbom E (2003) Training a super model look-alike: featuring edit distance, n-gram occurrence, and one
reference translation. In: Proceedings of the workshop on machine translation evaluation: towards
systemizing MT evaluation. pp 29–36
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pool-
ing for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847
Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? Dataset and meth-
ods for multilingual image question. In: Advances in neural information processing systems, pp
2296–2304
Gao P, Li H, Li S, Lu P, Li Y, Hoi SC, Wang X (2018) Question-guided hybrid convolution for visual ques-
tion answering. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 485–501
Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. In:
Proceedings of the national academy of sciences. pp 201422953
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp
1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and
semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern rec-
ognition. pp 580–587
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images,
tags, and their semantics. Int J Comput Vis 106(2):210–233
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA matter: elevating the
role of image understanding in visual question answering. In: CVPR. vol 1(2), p 3
Gurari D, Li Q, Stangl AJ, Guo A, Lin C, Grauman K, Luo J, Bigham JP (2018) VizWiz grand challenge:
answering visual questions from blind people. arXiv preprint arXiv:1802.08218
Hasan SA, Ling Y, Farri O, Liu J, Lungren M, Müller H (2018) Overview of the ImageCLEF 2018 medical
domain visual question answering task. In CLEF2018 working notes. CEUR Workshop proceedings,
Avignon, France
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. pp 770–778
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: 2017 IEEE international conference on com-
puter vision (ICCV). IEEE, pp 2980–2988
13
Visual question answering: a state‑of‑the‑art review 5743
13
5744 S. Manmadhan, B. C. Kovoor
Malinowski M, Rohrbach M, Fritz M (2017) Ask your neurons: a deep learning approach to visual question
answering. Int J Comput Vis 125(1–3):110–135
Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrap-
ping hard attention. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 3–20
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and
phrases and their compositionality. In: Advances in neural information processing systems. pp
3111–3119
Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cognit Process 6(1):1–28
Narasimhan M, Schwing AG (2018) Straight to the facts: learning knowledge base retrieval for factual vis-
ual question answering. In: Proceedings of the European conference on computer vision (ECCV). pp
451–468
Noh H, Hongsuck Seo P, Han B (2016) Image question answering using convolutional neural network with
dynamic parameter prediction. In: Proceedings of the IEEE conference on computer vision and pat-
tern recognition. pp 30–38
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine trans-
lation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Asso-
ciation for Computational Linguistics, pp 311–318
Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual ques-
tion answering. Multimedia Tools Appl 78(3):3843–3858
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of
the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1532–1543
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized
word representations. arXiv preprint ar-Xiv:1802.05365
Prakash BS, Sanjeev KV, Prakash R, Chandrasekaran K (2019) A survey on recurrent eural network archi-
tectures for sequential learning. In: Soft computing for problem solving. Springer, Singapore, pp
57–66
Ren H, Lu H (2018) Compositional coding capsule network with k-means routing for text classification.
arXiv preprint arXiv:1810.09177
Ren M, Kiros R, Zemel R (2015a) Image question answering: a visual semantic embedding model and a
new dataset. Proc Adv Neural Inf Process Syst 1(2):5
Ren M, Kiros R, Zemel R (2015b) Exploring models and data for image question answering. In: Advances
in neural information processing systems. pp 2953–2961
Ren S, He K, Girshick R, Sun J (2015c) Faster r-cnn: Towards real-time object detection with region pro-
posal networks. In: Advances in neural information processing systems. pp 91–99
Ruwa N, Mao Q, Wang L, Dong M (2018) Affective visual question answering network. In: 2018 IEEE con-
ference on multimedia information processing and retrieval (MIPR)
Sabour S, Frosst N, Hinton GE (2017). Dynamic routing between capsules. In: Advances in neural informa-
tion processing systems. pp 3856–3866
Saito K, Shin A, Ushiku Y, Harada T (2017) Dualnet: domain-invariant network for visual question answer-
ing. In: 2017 IEEE international conference on multimedia and expo (ICME). IEEE, pp 829–834
Shah M, Chen X, Rohrbach M, Parikh D (2019) Cycle-consistency for robust visual question answering. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6649–6658
Shi Y, Furlanello T, Zha S, Anandkumar A (2018) Question type guided attention in visual question answer-
ing. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 158–175
Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: Proceed-
ings of the IEEE conference on computer vision and pattern recognition. pp 4613–4621
Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 10472–10481
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in
neural information processing systems. pp 3104–3112
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A (2015) Going deeper with convo-
lutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1–9
Teney D, Hengel AV (2018) Visual question answering as a meta learning task. In: Computer vision—
ECCV 2018 lecture notes in computer science. 229–245
Tommasi T, Mallya A, Plummer B, Lazebnik S, Berg AC, Berg TL (2019) Combining multiple cues for
visual madlibs question answering. Int J Comput Vis 127(1):38–60
13
Visual question answering: a state‑of‑the‑art review 5745
Toor AS, Wechsler H, Nappi M (2019) Question action relevance and editing for visual question answering.
Multimedia Tools Appl 78(3):2921–2935
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you
need. In: Advances in neural in-formation processing systems. pp 5998–6008
Wang P, Wu Q, Shen C, Hengel AVD, Dick A (2015) Explicit knowledge-based reasoning for visual ques-
tion answering. arXiv preprint arXiv:1511.02570
Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: fact-based visual question answering.
IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting
on association for computational linguistics. Association for Computational Linguistics, pp 133–138
Wu Q, Wang P, Shen C, Dick A, van den Hengel A (2016). Ask me any-thing: Free-form visual question
answering based on knowledge from exter-nal sources. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition pp 4622-4630)
Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering
based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381
Xu W, Rudnicky A (2000) Can artificial neural networks learn language models?. In: sixth international
conference on spoken language processing
Xu H, Saenko K (2016) Ask, attend and answer: Exploring question-guided spatial attention for visual ques-
tion answering. In: European conference on computer vision. Springer, Cham, pp 451–466
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 21–29
Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learn-ing based natural language
processing. IEEE Comput Intell Mag 13(3):55–75
Yu L, Park E, Berg AC, Berg TL (2015) Visual madlibs: fill in the blank description generation and question
answering. In: Proceedings of the ieee international conference on computer vision. pp 2461–2469
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: 2017
IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4187–4195
Yu D, Gao X, Xiong H (2018a) Structured semantic representation for visual question answering. In: 2018
25th IEEE international conference on image processing (ICIP). IEEE, pp 2286–2290
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018b) Beyond bilinear: generalized multimodal factorized high-order
pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6281–6290
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European confer-
ence on computer vision. Springer, Cham, pp 818–833
Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D (2016) Yin and yang: balancing and answering
binary visual questions. In: Proceedings of the IEEE conference on computer vision and pattern rec-
ognition. pp 5014–5022
Zhao W, Peng H, Eger S, Cambria E, Yang M (2019) Towards scalable and reliable capsule networks for
challenging NLP applications. arXiv preprint arXiv:1906.02829
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering.
arXiv preprint arXiv:1512.02167
Zhu Y, Zhang C, Ré C, Fei-Fei L (2015) Building a large-scale multimodal knowledge base system for
answering visual queries. arXiv preprint arXiv:1507.05670
Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: Grounded question answering in images. In: Pro-
ceedings of the IEEE conference on computer vision and pattern recognition. pp 4995–5004
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13