You are on page 1of 41

Artificial Intelligence Review (2020) 53:5705–5745

https://doi.org/10.1007/s10462-020-09832-7

Visual question answering: a state‑of‑the‑art review

Sruthy Manmadhan1 · Binsu C. Kovoor1

Published online: 8 April 2020


© Springer Nature B.V. 2020

Abstract
Visual question answering (VQA) is a task that has received immense consideration
from two major research communities: computer vision and natural language processing.
Recently it has been widely accepted as an AI-complete task which can be used as an alter-
native to visual turing test. In its most common form, it is a multi-modal challenging task
where a computer is required to provide the correct answer for a natural language question
asked about an input image. It attracts many deep learning researchers after their remark-
able achievements in text, voice and vision technologies. This review extensively and criti-
cally examines the current status of VQA research in terms of step by step solution meth-
odologies, datasets and evaluation metrics. Finally, this paper also discusses future research
directions for all the above-mentioned aspects of VQA separately.

Keywords Visual question answering · Computer vision · Natural language processing ·


Deep learning

1 Introduction

Matt King, FACEBOOK says, “But from my perspective as a blind user, going from essen-
tially zero percent satisfaction from a photo to somewhere in the neighborhood of half …
is a huge jump” as a comment to the great attempt from Facebook to automatically cap-
tion photos of blind users. This leads to an inference that, it would be great if machines
are intelligent enough to understand image contents and communicate this understanding
as effectively as humans. VQA is a stepping stone to this Artificial Intelligence-dream
(AI-dream) of Visual Dialogue. In the most common form of Visual Question Answering
(VQA), the computer is presented with an image and a textual question about this image.
Then, the machine’s task is to generate correct answer, typically a few words or a short
phrase. That is, VQA is a task which is guided by matured research in computer vision
(CV) and natural language processing (NLP), both are under the domain of AI. In words of

* Sruthy Manmadhan
sruthym.88@gmail.com
Binsu C. Kovoor
binsukovoor@gmail.com
1
Division of Information Technology, Cochin University of Science and Technology, Kochi, Kerala,
India

13
Vol.:(0123456789)
5706 S. Manmadhan, B. C. Kovoor

Fig. 1  Definition of VQA

Table 1  Computer vision sub-tasks required to be solved by VQA


CV task Representative VQA question

Object recognition What is in the image?


Object detection Are there any dogs in the picture?
Attribute classification What color is the umbrella?
Scene classification Is it raining?
Counting How many people are there in the image?
Activity recognition Is the child crying?
Spatial relationships among objects What is between cat and sofa?
Commonsense reasoning Does this person have 20/20 vision?
Knowledge-base reasoning Is this a vegetarian pizza?

Devi Parikh, a VQA researcher, it is a great combination of pictures, words and common
sense as shown in Fig. 1.
While compared to other vision-language tasks such as image captioning, text-to-image
retrieval, VQA is more challenging because: (1) the questions are not predetermined. In
other tasks, the question to be answered is fixed and so the operations required to answer
it only the image changes. (2) The supporting visual information is very high dimensional.
(3) VQA necessitates solving many computer vision sub tasks. Some are given in Table 1
with a representative question on second column. In this respect, VQA can be used as an
alternative to Visual Turing Test for computer vision systems (Geman et al. 2015) or in
other words VQA is an “AI-complete” task as it demands multi-modal knowledge beyond
a single domain.
After seeing boastful victory of convolutional neural networks; AlexNet (Krizhevsky
et al. 2012), ZFNet (Zeiler and Fergus 2014), VGGNet (Simonyan and Zisserman 2014),
GoogleNet (Szegedy et al. 2015), ResNet (He et al. 2016); in the ImageNet challenge and
renowned results produced by RNNs in the field of text processing (Cho et al. 2014; Sutsk-
ever et al. 2014; Prakash et al. 2019), many researchers became enthusiastic to work with
VQA.
Another factor which attracts researchers is the vast number of potential applications
of VQA. One of the most socially relevant and direct application is to help blind users to
communicate with the pictures. It can be used to improve image retrieval, which can be
commercially used by online shopping sites to attract customers by giving exact results
for their search queries. Incorporation of VQA may increase the popularity of online

13
Visual question answering: a state‑of‑the‑art review 5707

educational services by allowing learners to interact with images. Another application


of VQA is in the field of surveillance data analysis where VQA can help the analyst
to summarize the available visual data. As mentioned previously, VQA can be used to
measure and demonstrate AI capabilities of a system. Successor of VQA, Visual Dia-
logue can even be used to give natural language instructions to robots.
Many researchers have proposed solutions or algorithms to solve the task of VQA
which can be generally visualized as a three phase process as seen in Fig. 2. Basically,
VQA research has its roots with glorious days of deep learning because, most of the suc-
cessful VQA solutions utilize the deep learning models, CNNs for image featurization
(Phase I) and RNN and its variants LSTM and GRU for question featurization (Phase
I). Major phase of exploration is Phase II in which both the processed features are com-
bined to answer the question about an image. Literature has so many methods for doing
this ranging from simple concatenation to complex attention mechanisms which are
explored and compared in Sect. 4. Current state-of-the-art modeled final phase either as
a classification problem, where a classifier is learned over a predetermined set of candi-
date answers or as a generation problem, where a free-form sentence is generated as an
answer which is the actual goal of VQA. Also (Kafle and Kanan 2016) tried a hybrid of
these two in their work.
Major contributions of this paper include:

1. This survey serves as a pocket reference for experts and as a beginner’s guide for learners
who are interested in understanding VQA problem, its intelligent solutions, and available
datasets and evaluation methods.
2. This is the first phase-wise review on VQA which provides extensive analysis and
comparison of different methodologies utilized in completion of different steps of VQA
including image featurization, question featurization and joint comprehension of image
and question features to generate answer.
3. It discusses and analyzes various publically available datasets for which gives a catego-
rization of the same based on nature of images and questions contained within them
along with identifying the limitations of each dataset to open eyes of researchers to the
need of development of new datasets.
4. Provides all available evaluation metrics for VQA solutions in a nutshell with formulas
and also discuss possibility of using some new metrics.
5. Finally, this review lists possible and required future research directions in this area to
attain the AI-dream.

The rest of this paper is organized as follows. Sections 2 and 3 consist of detailed
explanations of phase I tasks, image featurization and question featurization respec-
tively. Section 4 includes details of the most simple to sophisticated methods for com-
bining multi-modal features extracted in phase I. Section 5 is the overview of different
datasets available for enthusiastic researchers to try their model with their character-
istics. For any problem with multiple solutions, performance evaluation is critical and

Fig. 2  General VQA algorithm


phases

13
5708 S. Manmadhan, B. C. Kovoor

different metrics for evaluating VQA solutions are discussed in Sect. 6. Finally, Sect. 7
of this paper open eyes to future research directions by identifying gaps in the existing
literature.

2 Image featurization

One of the two preparatory tasks of VQA is image featurization. Image feature describes
an image as a numerical vector so that different mathematical operations can be easily
applied. There are many ways to do this explicitly like simple RGB vector, Scale-Invariant
Feature Transform (SIFT) (Lowe 1999), HAAR Transform (Lienhart and Maydt 2002),
Histogram of Oriented Gradients (HOG) (Dalal and Triggs 2005) etc. In deep learning era,
it does not require explicit featurization, since it is self learned by deep neural network.
Training deep learning models from scratch requires large data sets and significant com-
putational resources. Using pre-trained deep neural network models to extract relevant fea-
tures from images allows doing this task easily.
One of the best neural networks for image featurization is convolutional neural network
(CNN). Table 2 gives details of some prominent and widely accepted state-of-the-art CNN
variants for the same task along with year of introduction, depth of the model as number of
layers, size of input image, size of extracted feature vector from last fully connected layer
and the reported error on dataset, those were initially trained and benchmarked on Ima-
geNet data sets and out-performed their competitors by a large margin.
Most of the state-of-the-art VQA models used these successful CNNs with their last
layer removed, sometimes followed by a normalization (Kafle and Kanan 2016; Saito et al.
2017; Fukui et al. 2016) and dimensionality reduction (Kafle and Kanan 2016; Ma et al.
2016; Antol et al. 2015) to represent visual content as a numerical vector. Table 3 gives a
mapping of VQA models to ImageNet winners based on the usage where the columns list
five CNN models and rows indicates major VQA systems. Overall statistics of pre-trained
CNNs used for VQA image featurization has been shown in Fig. 3. From this, researchers
on this field can quickly identify that VGGNet and ResNet have been widely used in VQA
systems. One reason for which people prefer VGGNet is that it extracts features that are
slightly more general and more effective for datasets other than ImageNet on which these
models are trained. Other reasons include quick convergence on fine-tuning, simple imple-
mentation compared to inception like the architecture of GoogLeNet and residual connec-
tions of ResNet. The readers can easily notice a trend of migrating from VGG to ResNet in
recent papers because enough computational resources are available at a reasonable cost.

3 Question featurization

Word embedding is a mapping of words or phrases from a vocabulary to numerical vec-


tors so that computers can easily handle them. It is mainly used for language modeling
and feature learning in natural language processing (NLP). The basic idea behind all the
word embedding methods is to capture as much of the semantic, morphological or contex-
tual information as possible. The problem of choosing the best embedding for a particu-
lar application always needs a trial-and-error approach, because there are different training
algorithms and text corpora, which both have a disparate impact on the generated word
embeddings.

13
Table 2  Overview of predominant CNN models trained on ImageNet
Successful CNN models Year Number of layers Input dimension Output dimension (number Reported error
of features)
Visual question answering: a state‑of‑the‑art review

AlexNet (Krizhevsky et al. 2012) 2012 8 227 × 227 4096 16.4


ZFNet (Zeiler and Fergus 2014) 2013 8 227 × 227 4096 11.7
VGGNet (Simonyan and Zisserman 2014) 2014 19 224 × 224 4096 7.3
GoogleNet (Szegedy et al. 2015) 2014 22 229 × 229 1024 6.7
ResNet (He et al. 2016) 2015 152 224 × 224 20,148 3.57 (Better than human)
5709

13
5710 S. Manmadhan, B. C. Kovoor

Table 3  Mapping of VQA models to CNN models


VQA model AlexNet ZFNet VGGNet GoogleNet ResNet

Image_QA (Ren et al. 2015a) ✓


Talk_to_Machine (Gao et al. 2015) ✓
VQA (Antol et al. 2015) ✓
Vis_Madlibs (Yu et al. 2015) ✓ ✓
VIS + LSTM (Ren et al. 2015b) ✓
Multimodal KB (Zhu et al. 2015) ✓
Ahab (Wang et al. 2015) ✓
ABC-CNN (Chen et al. 2015) ✓
Comp_QA (Andreas et al. 2015) ✓
DPPNet (Noh et al. 2016) ✓
Answer_CNN (Ma et al. 2016) ✓
VQA-Caption (Lin and Parikh 2016) ✓
Re_Baseline (Jabri et al. 2016) ✓
MCB (Fukui et al. 2016) ✓
SMem-VQA (Xu and Saenko 2016) ✓
Region_VQA (Shih et al. 2016) ✓
Vis7W (Zhu et al. 2016) ✓
Ask_Neuron (Malinowski et al. 2017) ✓ ✓ ✓ ✓
SCMC (Cao et al. 2017) ✓
HAN (Malinowski et al. 2018) ✓
StrSem (Yu et al. 2018a, b) ✓
AVQAN (Ruwa et al. 2018) ✓
CMF (Lao et al. 2018) ✓
EnsAtt (Lioutas et al. 2018) ✓
MetaVQA (Teney and Hengel 2018) ✓
DA-NTN (Bai et al. 2018) ✓
QGHC (Gao et al. 2018) ✓
QTA (Shi et al. 2018) ✓
WRAN (Peng et al. 2019) ✓
QAR (Toor et al. 2019) ✓

Fig. 3  Statistical analysis on Image Featurizaon


usage of pre-trained CNNs
AlexNet

ZFNet

VGGNet

GoogLeNet

The task of VQA demands word embeddings for question representation because most
of the Machine Learning algorithms and almost all Deep Learning Architectures are inca-
pable of processing strings or plain text in their raw form.

13
Visual question answering: a state‑of‑the‑art review 5711

Primarily, the word or text embedding methods can be categorized into three: (1) Count
based methods; (2) Prediction based methods and (3) Hybrid methods. All embedding
methods take as input an extensive text collection from Internet or publications, which
form corpora. Set of all unique words from a corpus form its vocabulary (V) and the result
of embedding is a representation for every word in V.

3.1 Count based methods

Simplest of all is one-hot encoding of words, which results in a vector of size |V|. Example
of one-hot encoding is shown in Fig. 4 where the top portion gives details of the used cor-
pus and then gives the embedding vector for each word in the corpus.
The main drawback of this embedding is that it does not capture the notion of similar-
ity. i.e. Euclidean distance between any two words represented using one-hot embedding
will be √2 and cosine similarity will be ‘0’. At that time, the researchers in this field got
motivated by the quote of Firth, J. R. (1957)—“You shall know a word by the company it
keeps”. This leads to distributional similarity based word embeddings based on co-occur-
rence matrix (Miller and Charles 1991). A co-occurrence matrix is a terms X terms matrix
which captures the number of times a term appears in the context of another term. The
context is defined as a window of k words around the terms. So this can be viewed as a
word X context matrix. For example, assume the same corpus given in Fig. 4. Then the
co-occurrence matrix will look like Fig. 5. Each row (column) of the co-occurrence matrix
gives a vector representation of the corresponding word (context).
Here there is a chance for avoiding some words (say, stop words or unnecessary words)
from context so that the number of columns will be less than the number of rows. Other-
wise, their counts will be very high. Such a table is highly sparse as most frequencies are
equal to zero.
In practice, the co-occurrence counts are converted to probabilities. Still, the size of this
matrix will grow with the size of V. Simple mathematical solution for this is to use a low-
rank approximation of original co-occurrence matrix given by singular value decomposi-
tion (SVD) (Eckart and Young 1936). SVD provides the best rank-k approximation of the

Corpus: Visual Question Answering is a task of answering a question based on a given image. |V| = 11

1 2 3 4 5 6 7 8 9 10 11
visual 1 0 0 0 0 0 0 0 0 0 0
question 0 1 0 0 0 0 0 0 0 0 0
answering 0 0 1 0 0 0 0 0 0 0 0
is 0 0 0 1 0 0 0 0 0 0 0
a 0 0 0 0 1 0 0 0 0 0 0
task 0 0 0 0 0 1 0 0 0 0 0
of 0 0 0 0 0 0 1 0 0 0 0
based 0 0 0 0 0 0 0 1 0 0 0
on 0 0 0 0 0 0 0 0 1 0 0
given 0 0 0 0 0 0 0 0 0 1 0
image 0 0 0 0 0 0 0 0 0 0 1

Fig. 4  One-hot embedding

13
5712 S. Manmadhan, B. C. Kovoor

1 2 3 4 5 6 7 8 9 10 11
visual 1 1 1 0 0 0 0 0 0 0 0
question 1 0 2 1 1 0 0 1 1 0 0
answering 1 2 0 1 1 1 1 0 0 0 0
is 0 1 1 0 1 1 0 0 0 0 0
a 0 1 2 1 0 1 2 1 0 0 0
task 0 0 1 1 1 0 1 0 0 0 0
of 0 0 1 0 2 1 0 0 0 0 0
based 0 1 0 0 2 0 0 0 1 0 0
on 0 1 0 0 1 0 0 1 0 1 0
given 0 0 0 0 0 0 0 1 1 0 1
image 0 0 0 0 1 0 0 0 0 0 0

Fig. 5  Co-occurrence matrix

original data (Co-occurrence matrix: A) as given in Eq. 1 and further described in Fig. 6.
This is done by keeping only top k left singular vectors (k columns on U), top k singular
­ T). It dis-
values (k rows and k columns of S) and top k right singular vectors (k rows of V
covers latent semantics in the corpus.

Ak = Uk Sk VTk×n (1)

3.2 Prediction based methods

The methods that have been discussed so far are called count based models because they
use the co-occurrence counts of words. The next category includes techniques which
directly learn word representations (these are called (direct) prediction based models).
These are models, which uses neural network as their basic component.
With the end goal to create a distributed representation of words, Xu and Rudnicky
(2000) created a neural network model which is considered as its starting point. However,
the classic model in this category is the three layer network introduced by Bengio et al.
(2003) which is shown in Fig. 7. Plenty of later works depend on this model.

Fig. 6  SVD notation

13
Visual question answering: a state‑of‑the‑art review 5713

Fig. 7  Neural architecture: f (i, wt−1;…;wt−n+1) = g(i, C(wt−1); …;C(wt−n+1)) where g is the neural network
and C(i) is the i-th word feature vector (Bengio et al. 2003)

Mikolov et al. (2013a) proposed two state-of-the-art prediction based word embed-
dings, continuous bag-of-words (CBOW) and skip-gram models. Subsequently, Google
made the skip-gram as an open-source venture named word2vec which is widely
acknowledged by users. In CBOW, they modeled the problem of predicting nth word
given (n − 1) words as a multi-class classification using feed forward neural network.
For example, assume the above-mentioned corpus with a single sentence and consider
the task of predicting the word ‘answering’ (nth word), given the words ‘visual ques-
tion’ (previous n − 1 words). Input to the network will be the concatenated one-hot rep-
resentation of context words (visual, question) and output will be a probability distribu-
tion over all possible |V| words (classes). Basic structure of CBOW network in context
of this example is given in Fig. 8a. In short, it predicts an output word given, a bag of
context words. After training, the weight between the hidden layer and the output layer
­(Wword) is taken as the word vector representation of the word, where each column rep-
resent a word vector of size [1 * |V|]. Since CBOW can use many context words to pre-
dict the one target word, it can substantially smooth out over the distribution, which is
suitable only if input data is small.
Skip-gram models (Fig. 8b) just the opposite situation. It will predict context words
on both sides (say, ‘visual’,’ answering’) of the input word (say, ‘question’).
In one of his later papers, Mikolov et al. (2013b) proposed various extensions to
basic skip-gram model to avoid expensive operations at output layer. One of the widely
accepted extensions is to use negative sampling which is used in word2vec. Study done
by Levy et al. (2015) reveals that, factors which help prediction based models to achieve
good results can be transferred to traditional distributional models, yielding similar
gains.

13
5714 S. Manmadhan, B. C. Kovoor

Fig. 8  a CBOW network architecture, b skip-gram architecture

3.3 Hybrid methods

Count based methods rely on global co-occurrence counts from the corpus for computing
word representations. Predict based methods learn word representations using co-occur-
rence information. Pennington et al. (2014) proposed global vectors (Glove) which com-
bines both to produce a word embedding. They proposed a weighted least squares model
trained on global information from co-occurrence matrix (only on the nonzero elements,
rather than on the entire sparse matrix or on individual context windows in a large corpus).
A comparative study of all these techniques can be found in Table 4 which states merits
and demerits of each of the above models described in Sects. 3.1, 3.2 and 3.3.

3.4 Recent trends in text embedding

In advanced era of deep learning, VQA researchers also used convolutional neural net-
work (CNN) (Krizhevsky et al. 2012), long short term memory (LSTM) (Hochreiter and
Schmidhuber 1997) and gated recurrent unit (GRU) (Cho et al. 2014) to extract question
representation.
CNN used for question feature extraction (Kim 2014) takes as input, concatenated vec-
tor representation of all ‘n’ words of the question. Then it uses multiple convolutional fil-
ters followed by max-pooling operations. The resulting feature maps are flattened to form
penultimate layer which can be used as question vector.
LSTM is a recurrent neural network (Elman 1990) that is designed for solving the gradi-
ent explosion or vanishing problem. The LSTM layer stores the context information in its
memory cells and serves as the bridge among the words in a sequence (e.g. a question). To
model the long term dependency in the data more effectively, LSTM add three gate nodes
to the traditional RNN structure: the input gate, the output gate and the forget gate. The
input gate and output gate regulate the read and write access to the LSTM memory cells.
The forget gate resets the memory cells when their contents are out of date. The output
state vector from last time step can be used as a question feature. Figure 9 explains the
basic flow of information in an LSTM network.

13
Table 4  Comparison of word embedding techniques
Word embedding models Merits Demerits

Count based models


One-hot vector + Simple to implement and interpret − Size of embedding is equal to size of vocabulary, which is very large
− It does not capture the notion of similarity.
Co-occurrence matrix + It preserves the semantic relationship between words − Size grows with size of vocabulary
Co-occurrence matrix + SVD + Dimensionality reduction by k-rank approximation − Do not work well for word analogy task
+ Discovers latent semantics in corpus.
Visual question answering: a state‑of‑the‑art review

Prediction based models


CBOW + Being probabilistic, these are supposed to perform − Softmax used at output layer is expensive
superior to deterministic methods (generally) − Since CBOW can use many context words to predict the one target word,
+ These are low on memory it can essentially smooth out over the distribution, which is suitable only if
input data is small.
Skip-gram − It gives fine-grained embedding only when the amount of training data is
high enough
− Poorly utilize the statistics of the corpus since they train on separate local
context windows instead of on global co-occurrence counts
Hybrid models
Glove + More accurate than skip-gram because the global corpus − It has a larger memory footprint need—so maybe an issue for large corpus
statistics are captured directly by the model
+ Once computed, GloVe can reuse the co-occurrence
matrix to factorize with any dimensionality quickly
5715

13
5716 S. Manmadhan, B. C. Kovoor

Fig. 9  Basic architecture of


LSTM question feature extraction

A gated recurrent unit (GRU) was proposed by Cho et al. (2014) to make each recurrent
unit to capture dependencies of different time scales adaptively. Similarly to the LSTM
unit, the GRU has gating units that modulate the flow of information inside the unit, how-
ever, without having separate memory cells. The difference from LSTM, is that, here, the
functionalities of output gate and forget gates are merged. Here also, the last hidden state
representation can be used as question feature.

3.5 VQA question embeddings

Most of the above discussed word embeddings have been explored by various VQA algo-
rithms to create feature vector for the given natural language question. Usage statistics of
these word embedding methods for VQA can be found in Table 5 (where columns lists
seven promising word embedding models and rows indicate major VQA systems proposed
in literature) and Fig. 10. This statistical study shows that LSTM (generally, RNN family)
is preferred by VQA researchers, which is clearly claimed by Young et al. (2018). They
state that the sequence based models like RNN do better than word sequence independent
methods like word2vec. However, they do not have independent existence without tradi-
tional embeddings because word vectors created using any of the models listed in Table 4
are fed as input to LSTM or GRU. At the same time it suffers from the fact that large
amount of labeled data is required for training.
In a study done by Shih et al. (2016), they stated that simple bag-of-words (BoW)
embedding is enough for VQA, no need to train and use LSTM or GRU. One fact is that,
even though it’s proved that co-occurrence with SVD performs well in capturing latent
semantics, it is not regularly used in VQA for question featurization. SVD also helps in
reducing dimensionality. Levy and Goldberg (2014) shows that, exact factorization with
SVD can achieve solutions that are at least as good as skip-gram with negative sampling
(SNGS)’s solutions for word similarity tasks.
This section describes some noteworthy use cases of text embeddings hand-picked from
VQA literature.

• Antol et al. (2015) utilized the idea of creating BoW with top 1000 words in the ques-
tions of the dataset. They have also exploited the strong correlation between words that
start a question and answer by creating another BoW by picking top 10 first, second
and third words of the question and finally concatenate this to the first representation.

13
Visual question answering: a state‑of‑the‑art review 5717

Table 5  Mapping of VQA models to word embedding techniques (One-hot embedding, CBOW, Skip-gram/
word2vec, GloVe, CNN, LSTM, GRU)
VQA model 1 2 3 4 5 6 7

Image_QA (Ren et al. 2015a) ✓


Talk_to_Machine (Gao et al. 2015) ✓
VQA (Antol et al. 2015) ✓
Vis_Madlibs (Yu et al. 2015) ✓
VIS + LSTM (Ren et al. 2015b) ✓
ABC-CNN (Chen et al. 2015) ✓
Comp_QA (Andreas et al. 2015) ✓
DPPNet (Noh et al. 2016) ✓
Answer_CNN (Ma et al. 2016) ✓
VQA-Caption (Lin and Parikh 2016) ✓
Re_Baseline (Jabri et al. 2016) ✓
MCB (Fukui et al. 2016) ✓
SMem-VQA (Xu and Saenko 2016) ✓
Region_VQA (Shih et al. 2016) ✓
Vis7W (Zhu et al. 2016) ✓
Ask_Neuron (Malinowski et al. 2017) ✓ ✓ ✓ ✓
SCMC (Cao et al. 2017) ✓
HAN (Malinowski et al. 2018) ✓
StrSem (Yu et al. 2018a, b) ✓
AVQAN (Ruwa et al. 2018) ✓
CMF (Lao et al. 2018) ✓ ✓
EnsAtt (Lioutas et al. 2018) ✓
MetaVQA (Teney and Hengel 2018) ✓ ✓
DA-NTN (Bai et al. 2018) ✓
QGHC (Gao et al. 2018) ✓
WRAN (Peng et al. 2019) ✓
QAR (Toor et al. 2019) ✓
1
One-hot embedding
2
CBOW
3
Skip-gram/word2vec
4
GloVe
5
CNN
6
LSTM
7
GRU​

• Noh et al. (2016) incorporated word embeddings as part of parameter prediction


network which includes GRU cells, to produce candidate weights to be mapped to
dynamic parameter layer.
• With a new balanced binary dataset for VQA, Zhang et al. (2016) proposed a strategy
that summarizes the information of the question in a tuple form (PRS tuple—P and S
forms noun phrases, R forms verb representing relation between P and S) that concisely

13
5718 S. Manmadhan, B. C. Kovoor

Fig. 10  Usage statistics of differ- 45


ent word embedding methods 40
35
30
SkipGram/

% of use
25 GRU
word2vec
20
GloVe
15 CBOW CNN
10 One-hot
5
0

defines the visual idea whose presence is to be verified to answer the question. It will
guide the visual feature extraction process.
• Shih et al. (2016) tried binning of questions to yield a simplified fixed length repre-
sentation of important concepts from variable-length questions. They have parsed the
question into four different bins: (1) Type of question using first two words, (2) Nomi-
nal subject, (3) All noun words and (4) All remaining words. Also a fifth bin is desig-
nated for candidate answers since the model performs multiple-choice VQA task.
• Yu et al. (2018a, b) leverages a tree-LSTM network to capture the linguistic structure of
the language. They have mapped each question in the dataset to a semantic tree where
each node refers to a single LSTM unit and root node is set to represent the sequence.
This compositional semantic representation technique can break down questions into
logical expressions to improve reasoning ability.
• Recently Toor et al. (2019) reported exciting results of their VQA model by using two
novel concepts: (1) Question Action Relevance (QAR) and (2) Question Action Editing
(QAE). QAR identifies irrelevant question action words from generated image caption
and QAE edits the question to map irrelevant words to relevant actions.

4 Joint comprehension of image and text

In VQA, the image and the question are processed independently to obtain separate vec-
tor representations. Different methods for doing this are detailed and comprehended in the
previous two sections. Now, in the next step of VQA, these features are mapped to a joint
space, then combined and fed to answer generation stage. This literature review extensively
identified wide assortments of techniques utilized for consolidating image and question
features ranging from simple concatenation to complex joint attention networks.

4.1 Baseline fusion models

Baseline methods include concatenation (Zhou et al. 2015; Jabri et al. 2016; Yu et al.
2018a, b; Huang et al. 2018), element-wise addition and element-wise multiplication
(Antol et al. 2015; Zhang et al. 2016; Goyal et al. 2017; Lin and Parikh 2016; Teney and
Hengel 2018) where the last two requires compatibility in the feature vector dimensions. In

13
Visual question answering: a state‑of‑the‑art review 5719

Malinowski et al. (2017), they tried all these three multimodal fusion methods and found
that element-wise multiplication has more accuracy. Another important finding is that L2
normalization of visual features has a significant impact on fusion performance, especially
for concatenation and summation. According to their results, summation has high accuracy
after normalization.
Shih et al. (2016) uses dot product of region-wise visual features and question embed-
dings. Saito et al. (2017) introduced a hybrid way of multimodal fusion. They integrate
element-wise summation and element-wise multiplication by implementing a polynomial
function given as Eq. 2

(2)
( )( )
1 + x1 + x2 + x3 + ⋯ + xd 1 + x1 + x2 + x3 + ⋯ + xd

where ­xi and ­yi indicate the ith dimension of an image and a question feature, respectively.
The intuition behind this integration is that features from multiplication and summation
will be substantially different. Lioutas et al. (2018) proposed an ensemble attention model
where multimodal fusion is done via concatenation of question, answer and image features
where question and answer features are first multiplied element-wise.
Another classic method for finding the relationship between two vectors with its roots
in statistics is Canonical Correlation Analysis (CCA), which has been used of multimodal
fusion by VQA researchers. CCA finds a joint representation between two multi-dimen-
sional factors, for the case of VQA, image and question vectors. One scalable extension for
CCA is normalized CCA (nCCA) proposed by Gong et al. (2014) which is explicit kernel
mapping followed by dimensionality reduction. Yu et al. (2015) and Tommasi et al. (2019)
trained both CCA and nCCA models for VQA and found out that nCCA has excellent per-
formance, especially in case of multiple-choice questions.

4.2 End‑to‑end neural network models

Here, researchers train end-to-end deep neural networks for VQA task with specific layers
for joint comprehension of image and question features. The structure and functioning of
this layer may be different for different proposed VQA end-to-end models.
Gao et al. (2015) implemented image question fusion as an additional layer with non-
linear activation function termed, scaled hyperbolic tangent function given as
g(x) = 1.7159 ⋅ tanh ((2∕3) ⋅ x) (3)
where x is the question, image and answer embeddings combined using element-wise
addition.
Andreas et al. (2016) depict a system for building and learning neural module networks
(NMN), which make collections of jointly-trained neural “modules” into deep networks
for VQA. Their approach breaks down questions into their semantic substructures, also,
utilizes these structures to powerfully instantiate modular systems (with reusable parts for
perceiving objects, classifying colors, and so on.). The subsequent compound systems are
jointly trained.
Fukui et al. (2016) proposed an end-to-end model with a Multimodal Compact Bilin-
ear Pooling (MCB) layer (see Fig. 11) for joint representation of image and question fea-
tures. Their intuition behind using Bilinear Pooling for multimodal fusion the outer prod-
uct of feature vectors is more expressive than simple baseline methods like concatenation.

13
5720 S. Manmadhan, B. C. Kovoor

Fig. 11  Multimodal Compact bilinear pooling (Fukui et al. 2016)

For complete description of the operation portrayed in Fig. 11, readers please refer to the
source attached to the figure title.
Noh et al. (2016) presented a VQA solution using a single CNN, where one of the fully
connected layers is dynamic parameter layer where weights are determined by specially
designed Dynamic Parameter Prediction Network (DPPN). DPPN consist of GRU cells to
embed question words and a hashing function to assign weights to CNN layer.
Ma et al.’s (2016) solution also consist of CNNs. They built a multimodal convolution
layer for joint embedding of image and question features. Here, based on image vector
and two back to back semantic parts from the question side, the multimodal convolution
is performed, which is relied upon to catch the collaborations between the two multimodal
inputs.
Another peculiar type of deep networks is deep residual networks (He et al. 2016)
which works on the intuition that deeper version of a good shallow network would also
do just fine by learning identity transformations in the new layers. Thus, identity con-
nection from input allows deep residual network to retain a copy of input. Basic picto-
rial representation of the concept can be seen in Fig. 12. However, this idea may not be
applied as such in multimodal learning because, the modalities may have correlations.
So, Kim et al. (2016a) cleverly defined a joint residual function as the non-linear
mapping which leads to Multimodal Residual Network (MRN) for VQA task. Lao et al.
(2018) introduced another end-to-end model influenced by residual learning. They used
a Cross-modal Multistep Fusion (CMF) network in between feature representation and
final answer prediction. It focuses on performing multiple fusions by generating various

Fig. 12  Residual learning

13
Visual question answering: a state‑of‑the‑art review 5721

multimodal features. This method fuses features in every step rather than waiting for the
final step to fuse. Different layers of CMF share parameters to optimize the use of com-
putational resources.
Gao et al. (2018) identified a main limitation of taking image feature as a sin-
gle dimensional vector from second last layer of any pre-trained CNN for end-to-end
VQA network. This trend misses detailed information like spatial relationships between
objects in the image. To avoid this loss of information, they have proposed a new form
of fusion, question-guided convolution. That is, a series of kernels are designed based
on the question features to convolve with image features. The main idea is to do multi-
modal fusion at an early stage of VQA model to retain more information.
Bai et al. (2018) extended basic MCB model with a deep attention neural tensor network
(DA-NTN) module as last stage of VQA model in order to check the similarity between
fused multi-modal feature vector (question and image features) and answer embedding.
Narasimhan and Schwing (2018) proposed an end-to-end system using a multi-layer
perceptron (MLP) to combine image and question features obtained from a CNN and
LSTM respectively. They have also retrieved a related ‘fact’ from a knowledge-base as sup-
porting material to answer the question. Then, the output of MLP is passed along with the
fact embedding to a score function. Here, the function is calculating the cosine silimarity
between the inputs to ensure the utility of the fact to answer image-question pair.

4.2.1 Encoder–decoder architecture

Here, question together with visual representation is fed into the decoder, which is usually
a LSTM network, which is then trained to produce the correct answer. Two general archi-
tectures for doing this are shown below. Figure 13a shows one way of fusing two feature
vectors, which is considering image encoding as the first (optionally can be last also) word
of question (Ren et al. 2015a, b; Zhu et al. 2016). Figure 13b shows more explicit way,
which is supplying image at every time step of decoding LSTM (Malinowski et al. 2017).
Ruwa et al. (2018) use triples as input to decoder LSTM, image embedding, and ques-
tion embedding along with question mood embedding to generate an emotional adjective
along with answer. Wu et al. (2018) proposed a model similar to one in Fig. 13a, but the
input to first LSTM cell is a combination of three vectors: visual feature vector, image cap-
tion embedding, and the vector embedding of the knowledge extracted based on question
from external sources. This is especially good for open-ended questions, typically ‘why?’
questions.

4.3 Joint attention models

Typical image and question attention have been explored enough with VQA. However,
they are not optimal in the sense that, they ignore the semantic relationship between image
attention and question attention. This limitation motivated (Cao et al. 2017) to use semantic
cross-modal correlation along with attention to solve the VQA problem. Peng et al. (2019)
identified another limitation of standard attention mechanism where the entire natural lan-
guage question is used to guide the visual encoding process. Still, in the actual scenario,
only the keywords of the question need to be used to identify exciting image regions or fea-
tures. This idea was utilized in their Word-to-Region Attention Network (WRAN), which
fills the existing gap between the question keywords and image regions.

13
5722 S. Manmadhan, B. C. Kovoor

Fig. 13  Encoder–decoder architectures types

13
Visual question answering: a state‑of‑the‑art review 5723

Lu et al. (2016) proposed a technique that jointly reasons about visual and question
attention, called co-attention. Common theme of co-attention multimodal fusion is that, the
image representation is used to guide the question attention and the question representation
is used to guide image attention. The above paper extract word, phrase and question level
embeddings for a question and at each level apply co-attention on both the image and ques-
tion. The final answer is based on all the co-attended image and question features.
Chen et al. (2015) described a Question-guided attention map (QAM) which is gener-
ated via scanning for image regions that correspond to the input question’s semantics in the
spatial image feature map created as part of image featurization. For accomplishing this,
they have designed a configurable convolution kernel (CCK) and convolve image feature
map with this. The CCK is created by mapping the question encoding from the linguistic
space into visual space, which contain visual data dictated by the intent of the question.
Xu and Saenko (2016) uses two types of attentions in parallel, spatial and semantic
attention. Spatial attention weights are determined based on a correlation matrix, where
each value measures similarity between each word and each location’s visual feature.
Along with this, semantic attention is performed based on an evidence embedding that
detects the presence of semantic concepts or objects, and the embedding results are mul-
tiplied with spatial attention weights and summed over all locations to generate the visual
evidence vector.
Yu et al. (2017) used joint attention learning with three components. First, semantic
attention attends on high-level image features to identify significant concepts from the
image needed to answer a question. Second, spatial attention as context-aware visual atten-
tion is used to infer image regions which can be attended by questions. Third, joint learning
integrates attended regions, attended concepts and question feature vector by element-wise
multiplication.
Shi et al. (2018) proved that question type information is very crucial in answering a
question whether or not it is asked against an image. So, they have replaced the popular
question-guided attention mechanism with Question Type-guided Attention (QTA) to help
image feature extraction. Anyway, this would be directly useful only on datasets with many
categories of questions which are correctly labeled.
One of the recent innovations happened with VQA research is the use of ‘hard atten-
tion’ to build a VQA system by Malinowski et al. (2018). In simple words, soft attention
(commonly used attention) is ‘selective boosting’ whereas hard attention is ‘filtering’ by
avoiding unwanted information which is a least explored area of computer vision. They
have used this in VQA system pipeline to filter out unwanted or least important elements of
fused multi-modal feature vector (image-question vector) for further processing. However,
the main problem with hard attention is that it is non-differentiable, which makes the gradi-
ent lovers a bit disturbed to work with.
A comprehensive overview of different methods for joint comprehension can be found
in Table 6, along with identified merits and demerits aside.

5 Datasets

This section presents an elaborative discussion on various publically available datasets for
validating VQA models with their characteristics. General requirements of a benchmark
VQA datasets include:

13
5724

Table 6  Comparison of different methods used for joint embedding of image and question features

13
Joint comprehension methods Merits De-merits

I. Baseline models
Concatenation Ability to identify importance of each single word in the question Good performance only on synthesized VQA dataset
to the predicted answer
Element-wise multiplication Easy to implement because these operations are inbuilt with many Feature vectors need to be first transformed into same dimensional
advanced language interfaces space
Element-wise summation Feature vectors need to be first transformed into same dimensional
space
II. End-to-end models
Nonlinear activation—hyper- It’s zero-centered Fails to answer questions, if the targeted object is too small
bolic tangent function Training will be fast because of quick convergence (large derivative
compared to sigmoid)
Dynamic parameter layer Take care of various tasks by allowing adaptive weight assignment Fails in counting type questions
in the dynamic parameter layer
Multimodal convolution layer Offer scalable and joint end-to-end training which can be inter- Works well only if the answer is given by the co-occurrence of a
preted as learning commonsense knowledge particular combination of features extracted from an image and a
question
Bilinear pooling layer Uses outer product, which is expressive enough to capture the com- Sharp increase of the learning parameters and computation resources
plex associations between the two different modalities fully
III. Encoder–decoder architecture Correctly matched to the task of VQA, so most exploited Cannot really make use of image features unless the question is about
the largest object in the image
Cannot exploit complicated inter-modal relationships
The effect of image will vanish at each time step of LSTM
IV. Joint attention models Exploit the intent of queries to focus on different regions in an Do not have any explicit notion of object position, and do not support
image the computation of intermediate results based on spatial attention
S. Manmadhan, B. C. Kovoor
Visual question answering: a state‑of‑the‑art review 5725

• It should be large enough.


• It must be able to grab variability within questions, images and concepts.
• It must support a fair evaluation scheme for validation of different VQA models.
• It must be minimally biased.

Most of the existing datasets contain triples made of an image, a question and its correct
answer. Some publically available datasets, on the other hand, provides extra information
like image captions, image regions represented as bounding boxes or multiple-choice can-
didate answers. General capabilities required to answer questions in VQA dataset correctly
include:

• Ability to recognize objects, attributes and spatial relationship.


• Ability to count, perform logical inference and to make comparisons.
• Leverage commonsense world knowledge.

The available VQA datasets can be categorized based on three factors: type of images,
question–answer format, and use of external knowledge. The types of images are again of
three categories: natural, clip-art and synthetic images. The image type that forms each
VQA dataset can be found in Table 7. Similarly, question–answer formats are of open-
ended and multiple-choice, which are represented as OE and MC, respectively in Table 7.
Also the source of images which forms each dataset and limitations are described in
Table 7. For sample images from various VQA datasets, see Table 8.
DAQUAR (DAtaset for QUestion Answering on Real-world images) was one of the ear-
liest datasets on image question answering. It is a dataset of human question- answer pairs
about images. COCO-QA has its root on Microsoft COCO (Common Object in COntext).
The questions are automatically generated from COCO image captions and are of 4 differ-
ent types, say, object, number, color and location. All answers are of a single-word type.
The VQA Dataset consists of both real images and abstract cartoon/clipart scenes and
can be named as VQA-real and VQA-abstract respectively. Both contain three questions
per image/scene and ten ground-truth answers per question. Motivation for constructing
VQA-abstract is to minimize language abstract which is high for real image-question pairs.
The abstract scenes are made of 20 ‘paper-doll’ human models (Antol et al. 2014) span-
ning genders, races and ages with 8 different expressions. The set contains over 100 objects
and 31 animals in various poses.
FM-IQA (Freestyle Multilingual-Image Question Answering) consists of COCO images
and freestyle, interesting and diversified set of questions which requires a lot of reason-
ing abilities to answer them. They categorized questions into 8 types: e.g., questions on
object actions, object classes and others. Each image has at least two question-answer pairs
as annotations. Visual Madlibs dataset is collected utilizing naturally generated fill-in-the-
blanks templates intended to accumulate focused descriptions about: people and objects,
their appearances, exercises and interactions, and also inferences about the general scene or
more extensive setting. Visual7W dataset was inspired by the deep rooted thought of the W
inquiries in journalism to portray an entire story. They have used question words: ‘what’,
‘where’, ‘when’, ‘who’, ‘why’, ‘how’ and ‘which’. The Visual7W dataset highlights more
extravagant questions and longer answers than the VQA dataset. TDIUC (Task Directed
Image Understanding Challenge) is a dataset which is developed to avoid some reported
limitations of publically available VQA datasets such as, (1) unbalanced to types of ques-
tions, (2) questions that can be answered by ignoring images, and (3) difficult evaluation
process. The authors of TDIUC divides VQA into 12 constituent tasks (i.e. 12 question

13
Table 7  VQA datasets in a nutshell
5726

Dataset Image type OE/MC Creation Limitations

13
DAQUAR (Malinowski and Fritz 2014) Natural OE Uses images from NYU-DepthV2 dataset Too small
Contains only indoor scenes
Extreme light conditions which make answering
very difficult
COCO-QA (Ren et al. 2015a, b, c) Natural OE Uses images from COCO where questions are Awkwardly phrased questions with grammatical
created using an NLP algorithm errors due to flaws in NLP algorithm
VQA Dataset (Antol et al. 2015) Natural, Clip-art OE, MC VQA-real images are taken from COCO Presence of language biases so that questions can
VQA-abstract images are synthetic cartoon be answered without using image
images. Contains subjective and opinion seeking questions
which do not have a single correct answer
FM-IQA (Gao et al. 2015) Natural OE Human generated questions and answers Automatic performance evaluation is intractable
Collected in Chinese and translated to English
Visual Madlibs (Yu et al. 2015) Natural OE Uses images from COCO Declarative sentence based questions which can be
Fill in the blank type questions generated using answered easily
COCO image captions
Visual 7W (Zhu et al. 2016) Natural MC Images with multiple choices for answer No yes/no questions
Two types of questions: Telling (requires textual
answer) and pointing (which requires selection
of image region)
Visual genome (Krishna et al. 2017) Natural OE Collected images from YFCC100M and COCO Lengthy answers which makes evaluation chal-
Very large lenging
Greater answer diversity
TDIUC (Kafle and Kanan 2017a) Natural OE Collected images from YFCC100M and COCO Contains too many similar questions, specifically,
12 types of questions to test various vision under- questions regarding color which usually leads to
standing capabilities wrong question type prediction
SHAPES (Andreas et al. 2016) Synthetic MC Images using various shapes in various colors Only yes/no questions
KB-VQA (Wang et al. 2015) Natural OE Questions require knowledge from DBpedia Small scale dataset
FVQA (Wang et al. 2018) Natural OE Questions involve external information collected Long training time
from DBpedia, ConceptNet and Webchild
Diagrams (Kembhavi et al. 2016) Synthetic MC Images of school science topics like digestive Require high level of visual reasoning
S. Manmadhan, B. C. Kovoor

system
Table 7  (continued)
Dataset Image type OE/MC Creation Limitations

CLEVR (Johnson et al. 2017) Synthetic OE Simple images with different shapes and complex Cannot be generalized to a real-world setting
questions which exclusively test visual reason- Not easily extendible
ing ability of solution
FigureQA (Kahou et al. 2017) Synthetic MC Questions about graphical plots and figures All questions are yes/no type
It can be extended iteratively Does not contain questions which require numeri-
cal values as answers
It has fixed labels for bars across different figures
DVQA (Kafle et al. 2018) Synthetic OE Questions about graphical plots Consider only bar charts
VizWiz (Gurari et al. 2018) Natural OE Consists of visual questions asked by blind people Images are often of poor quality
who were seeking answers to their daily visual Questions suffer from audio recording issues
questions Most of the questions are unanswerable
VQA-Med (Hasan et al. 2018) Natural OE Consist of medical images extracted from Pub- Too small for exploring advanced VQA capabilities
Visual question answering: a state‑of‑the‑art review

Med Central articles


Question-answer pairs are generated from cap-
tions of ImageCLEF 2017 caption prediction
task
5727

13
5728 S. Manmadhan, B. C. Kovoor

Table 8  Example images from various VQA datasets


DAQUAR COCO-QA VQA-real

VQA-abstract FM-IQA VisualMadlibs

KB-VQA FVQA Diagrams

Q: How many stages of growth does


the diagram feature?

Visual7W SHAPES
VisualGenome

Q: Who is under the um-


brella? A: Two women.

13
Visual question answering: a state‑of‑the‑art review 5729

Table 8  (continued)
CLEVR FigureQA VizWiz

Q: Are there an equal number of Q: Is Coral the minimum?


large things and metal spheres? A: Yes

DVQA TDIUC VQA-Med

Q: How many bars are there? Q1: Is there a traffic light in the
photo? Q: What does the CT scan of
thorax shown?
Q2: What is the weather like?
A: bilateral multiple pulmonary
Q3: How many dogs are there? nodules.
Q4: What is the dog doing?

types including absurd questions) that make it easier to evaluate and compare performance
of VQA models.
SHAPES is a dataset images. It is complementary to other VQA datasets as it consists
of shapes of varying arrangements, types and colors rather than natural scenes, which pro-
pose a distinctive challenge for VQA researchers. The dataset consists of complex ques-
tions about spatial and logical reasoning among multiple shapes. Thus avoids the risk of
learning biases, a significant deficiency in most of the VQA dataset. CLEVR (Compo-
sitional Language and Elementary Visual Question Reasoning) is a dataset of synthetic
images similar to SHAPES, but have simple 3D shapes. It consists of questions which
test a range of visual reasoning abilities which leads to minimal biases also comes with
supporting annotations describing kind of reasoning required to answer a question. The
authors claim that CLEVR facilitates in-depth analysis of visual reasoning abilities of solu-
tion model which is intractable with other datasets.
KB-VQA (Knowledge Base-VQA) dataset has been built for assessing the execution
of VQA models on questions requiring more elevated amount of knowledge and explicit
reasoning about visual contents using external information. The authors have attached
a specific label to each question which reveals the human-estimated level of knowledge
required to answer it correctly. The labels are, “visual” (can be answered directly using

13
5730 S. Manmadhan, B. C. Kovoor

visual concepts), “common-sense” (do not require external source) and “KB-Knowl-
edge” (expected to require Wikipedia or similar). FVQA (Fact-based VQA) is similar to
KB-VQA, which require external information to answer. The distinction is that, here, the
authors provide additional supporting-fact for each question-answer pair in the form of a
structural triplet, such as <Cat. CapableOf, ClimbingTrees>.
FigureQA is the first VQA dataset based on graphical plots and figures including five
classes: line plots, dot-line plots, vertical and horizontal bar graphs and pie charts. It con-
sists of 15 question types which seek various relationships between objects in graph and
examine characteristics like the minimum, the maximum, area-under-the-curve, smooth-
ness, and intersection. DVQA (Data Visualization Question Answering) is a dataset con-
currently developed with FigureQA, but tests only various aspects of bar charts. It contains
three types of questions: (1) structure understanding (e.g. How many bars are there?), (2)
data retrieval (e.g. what is the label of the third bar from left?), and (3) reasoning (e.g.
which algorithm has the highest accuracy?).
Diagrams (AI2D) is a dataset that aims to evaluate the task of diagram interpretation.
It consists of diagrams representing topics from grade school science, each annotated with
constituent segmentation, their relationships to each other and relationships to the diagram
canvas. It is actually a new direction of vision research which generally concentrates on
natural image understanding.
VizWiz is the first goal-oriented VQA dataset to capture real interests of real users of a
VQA system. Also, it is the first publically available vision dataset originating from blind
people. Distinguishing characteristics of VizWiz includes: (1) images are captured by blind
people and so usually are of poor quality, (2) questions are spoken and so are more con-
versational or suffer from audio recording imperfections. And (3) many of the questions
cannot be answered because, blind people cannot verify their images captures the visual
content they are asking about.
VQA-Med is a dataset created towards the first step in medical domain VQA. This con-
sists of medical images with clinically relevant question–answer pairs. Success in this task
will improve patient engagement in interpreting medical images and also will serve as a
second opinion for doctors on complex medical images. The question-answer pairs were
generated using semi-automatic method; first generate them using rule-based question gen-
eration followed by manual verification by human experts.

6 Performance evaluation

The real dream of computer vision research is to build models which are close to humans
in image understanding capability. For evaluation of these models, Geman et al. (2015)
have introduced a Visual Turing Test for computer vision systems. Most of the current
papers suggest that VQA can be considered as an alternative for Visual Turing Test, or in
other words, it is an ‘AI-complete’ task. One of the important criteria for a task to be ‘AI-
complete’ is to have a well-defined quantitative evaluation metric to track progress. In this
section, various distinguished VQA evaluations metrics are presented in a nutshell.
From Table 7, it is clear that VQA datasets present two types of questions: open-ended
and multiple-choice. In multiple-choice setting, there is just a single right answer for every
question. So the assessment of a proposed solution is straightforward since one can easily
quantify the mean accuracy over test questions. In open-ended setting, there exists a pos-
sibility of having multiple right answers for a question due to synonyms and paraphrasing.

13
Visual question answering: a state‑of‑the‑art review 5731

Table 9  Overview of VQA evaluation metrics


Metric Calculation formula Supporting datasets

Accuracy #correctlyanwered DAQUAR, COCO-QA,


#totalquestions
VQA-abstract, Visual
Genome and Madlibs,
FVQA, Visual7W,
Shapes, CLEVR, Dia-
grams, DVQA, VizWiz
WUPSa,b,c,d ⎧ ⎫ COCO-QA, DAQUAR​
N
1 ∑ ⎪∏ ∏ ⎪
N
min⎨ maxi WUP(a, t), maxi WUP(a, t)⎬ ⋅ 100
i=1 a∈Ai t∈T t∈T i a∈A
⎪ ⎪
⎩ ⎭
Consensus VQA-real, DAQUAR,
( )
n
AccuracyVQA = min 3 , 1
VizWiz
Human judgment Manual calculation using humans FM-IQA, KB-VQA,
FigureQA
MPTe,f TDIUC
� �
T T
A−1
∑ ∑
At T or T t
t=1 t=1

BLEUg,h,i VizWiz
� �
N

BP ⋅ exp Wn log Pn
n=1

METEORj (1 − Pen) ∗ Fmean VizWiz


a
N = Total no. of questions
b
A = set of predicted answers
c
T = set of ground truth answers
d
WUP(a,b) returns the position of words’ ‘a and ‘b’ in the taxonomy relative to the position of the Least
Common Subsumer (a, b)
e
T = Total number of question types
f
At = Accuracy over question type t
g
BP = Brevity Penalty (Papineni et al. 2002)
h
wn = Positive weights summing to one
i
Pn = Precision score of entire corpus (Papineni et al. 2002)
j
For calculation of Pen and ­Fmean refer Chen et al. (2015)

Fig. 14  Usage statistics of evalu-


ation metrics

13
5732 S. Manmadhan, B. C. Kovoor

So, the assessment is nontrivial. To make it manageable, most of the VQA datasets restrict
the answers to contain only a few words (usually one to three), or select answer from a
closed set of answers thereby converting open-ended setting to multiple-choice setting.
Table 9 shows the calculation formula of major VQA metrics along with the names of the
supporting datasets and Fig. 14 shows the usage statistics of these metrics across various
available VQA datasets so that researchers can identify the trend of evaluation.
Though in multiple-choice setting, simple accuracy is enough to evaluate a given VQA
model, in open-ended setting, this will be too rigid to accept an answer as correct only if it
exactly matches with ground truth. For example, if the question was ‘What animals are in
the photo?’ and a model outputs ‘dog’ instead of ground truth ‘dogs’, it is treated as wrong.
Due to these limitations, several alternatives have been proposed.
Wu-Palmer Similarity (WUPS) (Wu and Palmer 1994) was used as an evaluation metric
for VQA in Malinowski and Fritz (2014). It is inspired from fuzzy set theory, which leads
to a soft measure than accuracy. It tries to measure how much a predicted answer differs
from the ground truth based on the difference in their semantic meaning. To avoid getting
a high score for distant concepts, they also proposed a threshold WUPS score, where a
score that is below a threshold will be scaled down by a factor. Significant shortcomings of
WUPS includes: (1) It produce high scores for answers that are lexically related but with
different meaning, (2) It will not work with phrasal or sentence answers.
Another alternative is consensus measure which is based on multiple independently col-
lected ground truth answers from annotators for each question which was used in Antol
et al. (2015). This mainly of two types: average and min consensus. For the average version,
the final score is the weighted average of answers from annotators and for the min version;
the answer needs to agree with at least one annotator. The mostly used version of this (given
in Table 9) will award a full score for a question, if the algorithm agrees with three or more
annotators. Limitations of this approach include: (1) Allowing multiple correct answers for
a question, (2) Expense to collect ground truth, and (3) Difficulty due to lack of consensus.
The authors of FM-IQA dataset (Gao et al. 2015) suggested manual evaluation of VQA
model with human judges. This will work well for both single word and phrase or sentence
answers. Anyhow, there exists the problem of time, resources and expenses of the judg-
ment process. Also, there is a chance of subjective opinion of individuals.
One of the main problems of VQA datasets is skewed distributions of question types.
In such cases; simple accuracy will not work well, especially for rarer question types. So
Kafle and Kanan (2017a) proposed new performance metrics to compensate for unbal-
anced question-type distribution. It is named as mean-per-type (MPT), indicating the cal-
culated arithmetic or harmonic mean accuracy per question type. They also use normalized
metrics, say arithmetic normalized MPT and harmonic normalized MPT, to compensate
for bias in distribution of answers for each question type.
BLEU (BiLingual Evaluation Understudy) and METEOR (Metric for Evaluation of
Translation with Explicit ORdering) proposed by Papineni et al. (2002) and Denkowski
and Lavie (2014) are metrics for automatic evaluation of machine translation. Gurari et al.
(2018) suggested that these can be used as an evaluation metric for VQA and experimented
with their VizWiz dataset. BLEU analyzes the co-occurrences of n-grams between the
predicted answer and ground truth label. BLEU usually fails on short sentences. Though
METEOR can be calculated by having an alignment between the words in the predicted
and ground truth answers, with an aim of one to one correspondence, sometimes it is
intractable to found such an alignment.
According to Kafle and Kanan (2017b), the ideal approach to assess a VQA framework
is still an open problem. The newly introduced first goal-oriented VQA dataset VizWiz

13
Visual question answering: a state‑of‑the‑art review 5733

opened a new chapter by using automatic machine translation evaluation metrics for VQA.
Every assessment strategy has its very own qualities and shortcomings. The technique
to utilize relies upon how the dataset was built, the level of bias within it and available
resources. Parallel to the development of new VQA models, significant work should be
done to create unique and efficient metrics to evaluate their performance.

7 Discussions and future directions

VQA is a recently introduced task which attracts many researchers due to its potential
applications and AI-completeness. Prerequisite for solving this complex task includes
knowledge of matured research in the fundamental tasks of Computer Vision and Natural
Language Processing.
As shown in Fig. 15, there has been a vast amount of research happened in this area
which leads to meteoric improvements in the performance of VQA algorithms and also to
the development of new datasets. Still, VQA research has miles to go to reach the goal of
human equivalent performance in answering image based questions. Readers, please do not
infer a statement regarding performance of VQA systems by looking at the ‘bars’ across
SHAPES and CLEVR datasets in Fig. 15, because, they are synthetic datasets created for
study purpose, instead focus on realistic datasets like Viz-Wiz for which the performance is
still below 50%.
Next, Table 10 and Fig. 16 present the results of study done on VQA dataset on two test
splits of the dataset named ‘test-dev’ and ‘test-std’. VQA is a relatively large dataset con-
sisting of 265,016 images and open-ended questions about images. These questions require
an understanding of vision, language and commonsense knowledge to answer. Maximum
reported accuracy on this is around 70% which highlights a need for improvement.
The gap still exists between VQA performance and human performance is due to
unproved reasons. This section reveals promising future works needed at different aspects
of VQA to fill the gap between machine and human intelligence in image understanding.

Fig. 15  Current state-of-the-art results across datasets

13
5734 S. Manmadhan, B. C. Kovoor

Table 10  Accuracy on VQA VQA model Accuracy


dataset
test-dev test-std

VQA (Antol et al. 2015) 53.74 54.06


Comp_QA (Andreas et al. 2015) 54.8 55.1
DPPNet (Noh et al. 2016) – 57.22
Re_Baseline (Jabri et al. 2016) – 65.2
MCB (Fukui et al. 2016) 66.7 66.5
SAN (Yang et al. 2016) 58.7 58.9
SMem-VQA (Xu and Saenko 2016) 58 58.2
Region_VQA (Shih et al. 2016) 62.44 62.43
Ask_Neuron (Malinowski et al. 2017) 58.4 58.4
SCMC (Cao et al. 2017) 60.96 61.16
MLAN (Yu et al. 2017) 65.3 58.9
TDAN (Li et al. 2018) – 63.94*
CMF (Lao et al. 2018) 66.4 –
DA-NTN (Bai et al. 2018) 67.56* 67.94*
WRAN (Peng et al. 2019) – 66.5*
MCAN (Yu et al. 2019) 70.63* 70.9*
Cycle (Shah et al. 2019) 69.87* –
AnswerAll (Shrestha et al. 2019) 65.96* –
Dropout (Fang et al. 2019) 66.92* –

*Mark with accuracies indicates the use of VQA2.0 dataset

Fig. 16  Accuracies on test-dev


and test-std splits of VQA dataset

Before that readers are directed to look into the Table 11 for a critical comparison of
selected VQA literature, where each solution is completely dissected into different phases
including limitations and future scope.

7.1 VQA phases

Till now, image featurization part is almost frozen to one of the models came out as a
result of ImageNet challenge. They work well in the classical computer vision tasks of
object detection by splitting the input image into uniform bounding boxes. However,
VQA requires more natural image features by detecting all objects in an image with

13
Table 11  Critical Comparison of a relevant subset of VQA solutions
VQA algorithm Antol et al. (2015) Gao et al. (2015) Lu et al. (2016) Fukui et al. (2016) Malinowski et al. (2017)

Image featurization VGGNet GoogLeNet Last pooling layer of VGG- ResNet152 followed by L2 ResNet152
Net & ResNet normalization
Question featurization BoW LSTM & CNN Three levels: word, phrase LSTM LSTM
(convolution) and ques-
tion (LSTM)
Joint comprehension Element-wise multiplica- Encoder–decoder Encoder–decoder with Compact bilinear pooling Encoder–decoder
tion co-attention in which which computes outer
question and image guid- product between two
ing each other (simultane- vectors
ously or alternately)
Answering approach Classification using MLP & Generation Classification Classification Classification/ generation
Generation using LSTM
Visual question answering: a state‑of‑the‑art review

Limitations Element-wise product is Fails when This model performs atten- Image and question fea- Global image feature vector
not expressive enough The commonsense reason- tion to the image at every tures need to be projected fed into each LSTM unit
to conquer the complex ing through background time step even though to very high dimensional thereby giving irrelevant
associations between images is incorrect some words do not have space to guarantee robust (sometimes) information
the multiple modalities The question focus object is corresponding image performance, leading to into answer prediction
especially in situation too small or looks similar signal which may dimin- huge memory usage stage and also become
where image information to other objects ish the aim of question difficult due the high
plays little role in finding The question needs knowl- attention dimensionality of global
answer (binary/how may edge from experience image feature
type questions) There exists OOV problem
Scope for improvement Incorporate regional atten- Use image features Can use adaptive attention Try to incorporate the Incorporate visual atten-
tion to avoid unimportant extracted from pooling (Lu et al. 2017) to decide method of multi-modal tion to get local important
image features before layers of CNN instead of when to rely on image low-rank bilinear pooling features
fusion last fully connected layer features and when to rely (Kim et al. 2016b) into
to get more fine grained on language features the VQA system
features
Use improved word embed-
dings to handle OOV
5735

13
Table 11  (continued)
5736

VQA algorithm Ben-Younes et al. (2017) Anderson et al. (2018) Yu et al. (2018) Yu et al. (2019) Shrestha et al. (2019)

13
Image featurization ResNet Faster R-CNN with ResNet Convolution layer res5c of Faster R-CNN with ResNet Faster R-CNN
101 as backbone with ResNet 101 as backbone
thresholding to eliminate
irrelevant image features
Question featurization GRU initialized with Skip- GRU initialized with GloVe LSTM LSTM initialized with GRU initialized with GloVe
thought vectors vectors GloVe vectors vectors
Joint comprehension Tucker decomposition of Element-wise product Multi-modal factorized Encoder–decoder with Concatenation of question
correlation tensor high-order pooling modular co-attention features with regional
approach (MFH) with layers image features followed
co-attention by aggregated bimodal
embedding using bidirec-
tional GRU​
Answering approach Classification Classification Classification Classification Classification
Limitations Insufficient to capture Fails to learn structured, MFH cannot infer correla- Fails in distinguishing the Less efficient if dataset
complex interactions multi-step visual reason- tion between each image key words in questions if consists of questions about
among different modali- ing required to answer region and each question other objects are brighter unseen compositions of
ties especially when the questions about 3D word because both ques- in the image seen concepts like C-VQA
input images are noisy. shapes in CLEVR dataset tion and image attention (Agrawal et al. 2017)
This makes it unsuitable is being performed inde-
for realistic applica- pendently and fused later
tion domains of VQA Entire question is used to
like Viz-Wiz where the attend image, but only
images taken by blind some words may be
people will be noisy actively involved in locat-
ing image regions
S. Manmadhan, B. C. Kovoor
Table 11  (continued)
VQA algorithm Ben-Younes et al. (2017) Anderson et al. (2018) Yu et al. (2018) Yu et al. (2019) Shrestha et al. (2019)

Scope for improvement Improve the building of Incorporate question guided Method for fine-grained co- Improve image feature Make new VQA models with
correlation tensor via visual attention attention is required extraction by combining strong visual grounding
(1) co-attention (2) early The image featurization Word-to-region attention features from multiple and to circumvent language
fusion and (3) improved step can easily be adapted provided by Peng et al. sources prior. This opens a wide
individual feature embed- to many of the baseline (2019) can be effectively green area for future VQA
dings, to surpass limita- VQA models to improve utilized researchers to work with
tions in dataset performance on real Develop new datasets with
image datasets reduced biases for bench-
marking of AI
Visual question answering: a state‑of‑the‑art review
5737

13
5738 S. Manmadhan, B. C. Kovoor

semantic segmentation. With this in mind, Peng et al. (2019) have used Region-based
CNN (R-CNN) proposed by Girshick et al. (2014) for visual feature extraction step
of VQA. R-CNN works by using selective search to identify a manageable number of
bounding-box object region candidates (“region of interest” or “RoI”) and then extracts
CNN features from each region independently. This attempt should awaken the VQA
researchers to touch image featurization part and explore the descendants of R-CNN,
fast R-CNN (Girshick 2015), faster R-CNN (Ren et al. 2015a, b, c) and the latest Mask
R-CNN (He et al. 2017).
It is too late to shake this frozen phase of VQA to have adequate fine-grained image fea-
ture extraction. For answering compositional questions based of images, especially low-qual-
ity images as in real VQA datasets like Viz-Wiz, features extracted from a single network
would not be sufficient. Rich internal features can be learned by combining information from
multiple sources through feature fusion which is opening a wide green area for research.
Converting natural language questions to fixed-length dense vectors are an essential part
of any VQA system. Of many possible ways to embed a natural language question, word-
2vec and GloVe attract VQA researchers during initial days, which were then shifted to the
deep learning models like LSTM and GRU. The point to be noted is that NLP research has
been walked far away from word2vec baseline around 2017–2018. Three of the exotic new
models which are found to be beneficial for VQA are FastText (Bojanowski et al. 2017),
ELMo (Peters et al. 2018) and BERT (Devlin et al. 2018).
FastText is an extension of original word2vec where the main difference is the inclu-
sion of character n-grams, which allows generating word embeddings for words that did
not appear in the training data which is known as OOV (out-of-vocabulary) problem. OOV
can be more frequent in goal oriented VQA datasets like VizWiz which are going to be
expanded and explored more in future. It is also proved to be super fast to train.
Those who prefer end-to-end deep learning VQA system, ELMo (Embeddings from Lan-
guage Models) and BERT (Bidirectional Encoder Representations from Transformers) are
two great choices. Both can handle OOV problem and are contextualized word embedding
models which can generate different word embeddings for a word depends on the context.
ELMo word embeddings are created by concatenating the activations of internal states of a
two-layer bidirectional language model, say, biLSTM. Different layers of a language model
capture different kinds of information on a word; thus, concatenating them yields more natu-
ral word representations. BERT uses Transformer- an attention based model with positional
encodings to represent word positions. Another promising fact is that BERT is a pre-trained
model which provides a kick-start feature extraction in the form of transfer learning.
Another research provoking thought is that there still exists potential for better use of
concepts from NLP for tackling challenges in VQA. One is to move a step further from
word embeddings to sentence embeddings-numerical representation of a full sentence
encapsulating rich meaning and also sensitive to word order, tense and context. Only small
portion of VQA researchers Kafle and Kanan (2016) and Noh et al. (2016) have noticed
this possibility by using Skip-thought vectors (Kiros et al. 2015) for question featurization.
Their results open the door to the viability of exploiting advanced sentence embedding
models like Quick-thought vectors (Logeswaran and Lee 2018) InferSent (Conneau et al.
2017) and Google’s Universal Sentence Encoder (Cer et al. 2018) for vectorizing natural
language question associated with VQA.
But, another side of this coin states that, for short sentences like single line questions in
VQA, contextual embeddings may fails to perform well because of the limited amount of
contextual information available. This give place to character level embeddings for a perfor-
mance. Besides, character embeddings naturally deals with the common problem of OOV.

13
Visual question answering: a state‑of‑the‑art review 5739

Introduction of Capsule Networks (CapsNet), a new sensation in deep learning, pro-


posed by Sabour et al. (2017) trembled the use of CNN for image understanding. They
have pointed out two main flaws of CNN architecture. First, there is no spatial information
that is used anywhere in CNN architecture. Then, the pooling function that is used to con-
nect layers is really inefficient as it loses valuable spatial information. CapsNet solves this
by the use of a dynamic routing process. This will definitely form a neat replacement for
CNN image feature extraction in VQA systems, especially using datasets like SHAPES
and CLEVR where number of questions focused on testing spatial reasoning abilities.
Though it is less explored, one recent attempt by Ren and Lu (2018) successfully used
CapsNet with compositional coding (CC) to reduce number of parameters used in word
embedding. The output from CC capsule layer proposed by them forms a promising can-
didate for question featurization in VQA. Feng et al. (2018) also gives an insight towards
the task of learning multi-modal capsules with images and text which directly satisfies the
requirements of a VQA model.
Zhao et al. (2019) proposed a unified NLP-Capsule framework towards challenging
NLP applications. The authors unraveled the main obstacles that hinder the use of capsule
networks for NLP tasks such as poor scalability and less reliable routing processes. They
have proved the effectiveness of their framework for the task of textual question answering
by generating separate capsules for the question and answer in a pair and computing rel-
evance score by cosine similarity between them. This can directly be extended to the task
of VQA by considering the input data as (image, question, and answer) triples.
An immensely explored phase of VQA algorithm is the one which combines or matches
extracted visual and language features for answer generation. One of the winning candi-
dates is the encoder–decoder architecture, whose template including CNNs and RNNs
along with attention mechanism naturally fits to the task of VQA. One of the limitations
suffered by these architectures is the time consuming sequential encoding step. A remark-
able solution to this was proposed by Vaswani et al. (2017) termed as Transformers with
stacked layers of encoder and decoder providing parallelized attention. A bright future of
VQA lies with the use of Transformers especially in time critical domains of VQA_Med
(Hasan et al. 2018) and Viz-Wiz (Gurari et al. 2018).

7.2 Datasets

One of the other limitations of VQA research is the lack of goal-oriented datasets like.
Most of the research in this field tackled the earliest benchmark datasets DAQUAR and
VQA. These models cannot be extended to the real applications of VQA like, supporting
visually impaired people, helping a data analyst to extract relevant information from a pool
of data, instructing a kid playing a game on touch screen, and interacting with a robot.
To make VQA research at its goal, it needs more publically available application oriented
datasets.
Till now, there are only two publically available datasets for goal oriented VQA
research, say VizWiz and VQA-Med. In the VQA literature, only one paper has addressed
VizWiz dataset (Gurari et al. 2018) with best accuracy of (below) 50% and according to
Hasan et al. (2018) only five teams addressed VQA-Med datasets with a best BLEU (refer
Sect. 6) score of 0.162. In the future work, we are planning to build a benchmark VQA-
Med dataset for interested researchers which will improve the public health as a whole by
aiding decision of Doctors.

13
5740 S. Manmadhan, B. C. Kovoor

7.3 Evaluation

Most of the state-of-the-art VQA systems were evaluated using classic accuracy which is
enough for multiple-choice version of VQA. Evaluation methods for open-ended VQA set-
ting needs to be explored more in future research. As a starting point for this, the results of
two latest VQA challenges, VizWiz and VQA-Med have identified the possibility of using
metrics for automatic Machine Translation (MT) evaluation with VQA system evaluation.
They have tried only classic MT metrics like BLEU and METEOR. (For more details see
Sect. 6). Even though BLEU is the most popular metric for MT, reports show that it fails
with short sentences. In VQA systems, most of the answers will be of short. This property
suggests the usage of NEVA (Ngram EVAluation) (Forsbom 2003). NEVA is similar in all
aspects to BLEU, but adapted for short sentences.

7.4 Others

This section discusses two peculiar lessons learned from this study which can guide fur-
ther research on VQA. One path for researchers is to explore solutions for visual ques-
tions which require external knowledge to answer. There are three notable works in this
area with acceptable results, but with limitations. First one (Wang et al. 2015) named
‘Ahab’ is limited mainly by the available question templates (only 23) to which the natural
language question to be mapped. With this limitation, they obtained an overall accuracy
of 69.6%. They pointed out one of the major drawbacks of VQA system modeled using
LSTM, as the incapability of LSTM to explicit reasoning in many situations. Second work
worth mentioning have been done by Wu et al. (2016), which combines image features
with knowledge extracted from external source within the decoder portion of their encoder
(CNN)–decoder (LSTM) network. The best result obtained was on COCO-QA dataset with
an accuracy of 73.66%. The limitation is that they generated all knowledge-base queries
only based on visual content neglecting question content. The third and latest work pro-
posed by Wang et al. (2018) developed a new dataset FVQA and a VQA algorithm which
will work by finding a supporting fact from knowledge-base for each question. So, this path
of research needs a new model which should be open to ask any type of questions and also
capable of explicit reasoning with good performance.
Another parallel and possible research thought comes from the power of yes/no ques-
tions in VQA task. The advantages of binary questions include: (1) They are easy to evalu-
ate and (2) They can represent a wide range of tasks. A very difficult VQA task can be
proposed balancing binary questions, which was proved by Zhang et al. (2016) by using
VQA-Abstract dataset. Even though there exists an argument against yes/no questions that
they when asked on natural images by human annotators are biased towards “yes” answer,
the power of it can be explored in medical VQA.
Another significant branch of VQA systems can be activated based on the idea of add-
ing emotional adjective (Ruwa et al. 2018) to the predicted answer. This can directly pro-
vide an extension to the Viz-Wiz challenge where the questions are asked against photo-
graphs taken by visually impaired people using their mobiles. The answers with emotional
adjective certainly give an added advantage to such a community that will help them to
eliminate their accessibility barriers. In the real world, people tend to have partial or mixed
emotions at any time that makes ‘fuzzy systems’, a well equipped candidate for mood detec-
tion networks. This was recently tried by Chaturvedi et al. (2019) for sentimental analysis

13
Visual question answering: a state‑of‑the‑art review 5741

with exciting results. So the integration of fuzzy mood detector to VQA network will defi-
nitely give rise to a promising era of affective question answering systems.

8 Conclusion

Visual Question Answering is an interesting combination of Computer Vision and Natu-


ral Language Processing which is growing by utilizing the power of deep learning meth-
ods. It is very challenging task where requires solving many subtasks like object detec-
tion, activity recognition, spatial relationships between objects, commonsense reasoning
etc. VQA, now accepted as an AI-complete task, would be an essential step towards the
AI dream of visual dialogue.
This paper critically examines the existing solutions, datasets, and evaluation meth-
ods and presented all comparisons of similar methodologies in user-friendly manner,
which will help the entire range of VQA researchers from beginners to experts. The
identified limitations of current practices and also the promising research directions
revealed in this paper will lead to great success of VQA and also its descendants, video
question answering and visual dialogue.
The main obstacle in the journey of VQA models towards AI dream is that it is not
clear what the source of improvement is and to which extent the model understood the
visual-language concepts.
In the future research on VQA, the researchers should focus on creating efficient,
abundant, unbiased, and goal oriented datasets for testing important characteristics of
VQA. VQA algorithms need to be improved by utilizing the power of Convolutional
Neural Network to extract more natural image features with latest word embeddings
models to provide meaningful multi-modal fusion for which joint attention mechanism
proved its ability.

References
Agrawal A, Kembhavi A, Batra D, Parikh D (2017) C-vqa: A compositional split of the visual question
answering (vqa) v1. 0 dataset. arXiv preprint arXiv​:1704.08243​
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down
attention for image captioning and visual question answering. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. pp 6077–6086
Andreas J, Rohrbach M, Darrell T, Klein D (2015) Deep compositional question answering with neural
module networks. arXiv preprint. arXiv preprint arXiv​:1511.02799​
Andreas J, Rohrbach M, Darrell T, Klein D (2016) Neural module networks. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. pp. 39–48
Antol S, Zitnick CL, Parikh D (2014) Zero-shot learning via visual abstraction. In: European conference on
computer vision. Springer, Cham, pp 401–416
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question
answering. In: Proceedings of the IEEE international conference on computer vision. pp 2425–2433
Bai Y, Fu J, Zhao T, Mei T (2018) Deep attention neural tensor network for visual question answering. In:
Computer vision–ECCV 2018: 15th European conference, Munich, Germany, September 8–14, 2018,
Proceedings. Springer, vol 11216, p 20
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn
Res 3(2):1137–1155

13
5742 S. Manmadhan, B. C. Kovoor

Ben-Younes H, Cadene R, Cord M, & Thome N (2017) Mutan: multimodal tucker fusion for visual question
answering. In: Proceedings of the IEEE international conference on computer vision. pp 2612–2620
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information.
Trans Assoc Comput Linguist 5:135–146
Cao L, Gao L, Song J, Xu X, Shen HT (2017) Jointly learning attentions with semantic cross-modal correla-
tion for visual question answering. In: Australasian database conference. Springer, Cham, pp 248–260
Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C,
Sung YH (2018) Universal sentence encoder. arXiv preprint arXiv​:1803.11175​
Chaturvedi I, Satapathy R, Cavallari S, Cambria E (2019) Fuzzy common-sense reasoning for multimodal
sentiment analysis. Pattern Recognit Lett 125:264–270
Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015) ABC-CNN: An attention based convolutional
neural network for visual question answering. arXiv preprint arXiv​:1511.05960​
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning
phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint
arXiv​:1406.1078
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence
representations from natural language inference data. arXiv preprint arXiv​:1705.02364​
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE computer society
conference on computer vision and pattern recognition. CVPR 2005. IEEE, vol 1, pp 886–893
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target lan-
guage. In: Proceedings of the ninth workshop on statistical machine translation. pp 376–380
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv​:1810.04805​
Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika
1(3):211–218
Elman JL (1990) Finding structure in time. Cognit Sci 14(2):179–211
Fang Z, Liu J, Li Y, Qiao Y, Lu H (2019) Improving visual question answering using dropout and enhanced
question encoder. Pattern Recognit 90:404–414
Feng, Y., Zhu, X., Li, Y., Ruan, Y., & Greenspan, M. (2018). Learning Capsule Networks with Images and
Text. In Advances in neural information processing systems
Forsbom E (2003) Training a super model look-alike: featuring edit distance, n-gram occurrence, and one
reference translation. In: Proceedings of the workshop on machine translation evaluation: towards
systemizing MT evaluation. pp 29–36
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pool-
ing for visual question answering and visual grounding. arXiv preprint arXiv​:1606.01847​
Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? Dataset and meth-
ods for multilingual image question. In: Advances in neural information processing systems, pp
2296–2304
Gao P, Li H, Li S, Lu P, Li Y, Hoi SC, Wang X (2018) Question-guided hybrid convolution for visual ques-
tion answering. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 485–501
Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. In:
Proceedings of the national academy of sciences. pp 201422953
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp
1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and
semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern rec-
ognition. pp 580–587
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images,
tags, and their semantics. Int J Comput Vis 106(2):210–233
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA matter: elevating the
role of image understanding in visual question answering. In: CVPR. vol 1(2), p 3
Gurari D, Li Q, Stangl AJ, Guo A, Lin C, Grauman K, Luo J, Bigham JP (2018) VizWiz grand challenge:
answering visual questions from blind people. arXiv preprint arXiv​:1802.08218​
Hasan SA, Ling Y, Farri O, Liu J, Lungren M, Müller H (2018) Overview of the ImageCLEF 2018 medical
domain visual question answering task. In CLEF2018 working notes. CEUR Workshop proceedings,
Avignon, France
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. pp 770–778
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: 2017 IEEE international conference on com-
puter vision (ICCV). IEEE, pp 2980–2988

13
Visual question answering: a state‑of‑the‑art review 5743

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780


Huang LC, Kulkarni K, Jha A, Lohit S, Jayasuriya S, Turaga P (2018) CS-VQA: visual question answering
with compressively sensed images. arXiv preprint arXiv​:1806.03379​
Jabri A, Joulin A, van der Maaten L (2016) Revisiting visual question answering baselines. In: European
conference on computer vision. Springer, Cham, pp 727–739
Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Clevr: A
diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. pp 2901–2910
Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. pp 4976–4984
Kafle K, Kanan C (2017a) An analysis of visual question answering algorithms. In: 2017 IEEE international
conference on computer vision (ICCV). IEEE, pp 1983–1991
Kafle K, Kanan C (2017b) Visual question answering: datasets, algorithms, and future challenges. Comput
Vis Image Underst 163:3–20
Kafle K, Price B, Cohen S, Kanan C (2018) DVQA: Understanding data visualizations via question
answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp
5648–5656
Kahou SE, Michalski V, Atkinson A, Kadar A, Trischler A, Bengio Y (2017) Figureqa: An annotated figure
dataset for visual reasoning. arXiv preprint arXiv​:1710.07300​
Kembhavi A, Salvato M, Kolve E, Seo M, Hajishirzi H, Farhadi A (2016) A diagram is worth a dozen
images. In: European conference on computer vision. Springer, Cham, pp 235–251
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 con-
ference on empirical methods in natural language processing (EMNLP)
Kim JH, Lee SW, Kwak D, Heo MO, Kim J, Ha JW, Zhang BT (2016a) Multimodal residual learning for
visual qa. In: Advances in neural information processing systems pp 361–369
Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016b) Hadamard product for low-rank bilinear pool-
ing. arXiv preprint arXiv​:1610.04325​
Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors.
In: Advances in neural information processing systems, pp. 3294–3302
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Bernstein MS (2017) Visual genome: connecting
language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neu-
ral networks. In Advances in neural information processing systems pp 1097-1105)
Lao M, Guo Y, Wang H, Zhang X (2018) Cross-modal multistep fusion network with co-attention for visual
question answering. IEEE Access
Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Advances in neural
information processing systems pp 2177–2185
Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word
embeddings. Trans Assoc Comput Linguist 3:211–225
Li M, Gu L, Ji Y, Liu C (2018) Text-guided dual-branch attention network for visual question answering. In:
Pacific rim conference on multimedia. Springer, Cham, pp 750–760
Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection. In: Proceedings.
2002 international conference on image processing. IEEE, vol 1, pp I–I
Lin X, Parikh D (2016) Leveraging visual question answering for image-caption ranking. In: European con-
ference on computer vision. Springer, Cham, pp 261–277
Lioutas V, Passalis N, Tefas A (2018) Explicit ensemble attention learning for improving visual question
answering. Pattern Recognit Lett 111:51–57
Logeswaran L, Lee H (2018) An efficient framework for learning sentence representations. arXiv preprint
arXiv​:1803.02893​
Lowe DG (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh
IEEE international conference on computer vision, 1999. IEEE, vol 2, pp 1150–1157
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answer-
ing. In: Advances in neural information processing systems. pp 289–297
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for
image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recogni-
tion. pp 375–383
Ma L, Lu Z, Li H (2016) Learning to answer questions from image using convolutional neural network. In:
AAAI. vol. 3(7), p 16
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes
based on uncertain input. In: Advances in neural information processing systems. pp 1682–1690

13
5744 S. Manmadhan, B. C. Kovoor

Malinowski M, Rohrbach M, Fritz M (2017) Ask your neurons: a deep learning approach to visual question
answering. Int J Comput Vis 125(1–3):110–135
Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrap-
ping hard attention. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 3–20
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space.
arXiv preprint arXiv​:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and
phrases and their compositionality. In: Advances in neural information processing systems. pp
3111–3119
Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cognit Process 6(1):1–28
Narasimhan M, Schwing AG (2018) Straight to the facts: learning knowledge base retrieval for factual vis-
ual question answering. In: Proceedings of the European conference on computer vision (ECCV). pp
451–468
Noh H, Hongsuck Seo P, Han B (2016) Image question answering using convolutional neural network with
dynamic parameter prediction. In: Proceedings of the IEEE conference on computer vision and pat-
tern recognition. pp 30–38
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine trans-
lation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Asso-
ciation for Computational Linguistics, pp 311–318
Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual ques-
tion answering. Multimedia Tools Appl 78(3):3843–3858
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of
the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1532–1543
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized
word representations. arXiv preprint ar-Xiv:1802.05365
Prakash BS, Sanjeev KV, Prakash R, Chandrasekaran K (2019) A survey on recurrent eural network archi-
tectures for sequential learning. In: Soft computing for problem solving. Springer, Singapore, pp
57–66
Ren H, Lu H (2018) Compositional coding capsule network with k-means routing for text classification.
arXiv preprint arXiv​:1810.09177​
Ren M, Kiros R, Zemel R (2015a) Image question answering: a visual semantic embedding model and a
new dataset. Proc Adv Neural Inf Process Syst 1(2):5
Ren M, Kiros R, Zemel R (2015b) Exploring models and data for image question answering. In: Advances
in neural information processing systems. pp 2953–2961
Ren S, He K, Girshick R, Sun J (2015c) Faster r-cnn: Towards real-time object detection with region pro-
posal networks. In: Advances in neural information processing systems. pp 91–99
Ruwa N, Mao Q, Wang L, Dong M (2018) Affective visual question answering network. In: 2018 IEEE con-
ference on multimedia information processing and retrieval (MIPR)
Sabour S, Frosst N, Hinton GE (2017). Dynamic routing between capsules. In: Advances in neural informa-
tion processing systems. pp 3856–3866
Saito K, Shin A, Ushiku Y, Harada T (2017) Dualnet: domain-invariant network for visual question answer-
ing. In: 2017 IEEE international conference on multimedia and expo (ICME). IEEE, pp 829–834
Shah M, Chen X, Rohrbach M, Parikh D (2019) Cycle-consistency for robust visual question answering. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6649–6658
Shi Y, Furlanello T, Zha S, Anandkumar A (2018) Question type guided attention in visual question answer-
ing. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 158–175
Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: Proceed-
ings of the IEEE conference on computer vision and pattern recognition. pp 4613–4621
Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 10472–10481
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv​:1409.1556
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in
neural information processing systems. pp 3104–3112
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A (2015) Going deeper with convo-
lutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1–9
Teney D, Hengel AV (2018) Visual question answering as a meta learning task. In: Computer vision—
ECCV 2018 lecture notes in computer science. 229–245
Tommasi T, Mallya A, Plummer B, Lazebnik S, Berg AC, Berg TL (2019) Combining multiple cues for
visual madlibs question answering. Int J Comput Vis 127(1):38–60

13
Visual question answering: a state‑of‑the‑art review 5745

Toor AS, Wechsler H, Nappi M (2019) Question action relevance and editing for visual question answering.
Multimedia Tools Appl 78(3):2921–2935
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you
need. In: Advances in neural in-formation processing systems. pp 5998–6008
Wang P, Wu Q, Shen C, Hengel AVD, Dick A (2015) Explicit knowledge-based reasoning for visual ques-
tion answering. arXiv preprint arXiv​:1511.02570​
Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: fact-based visual question answering.
IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting
on association for computational linguistics. Association for Computational Linguistics, pp 133–138
Wu Q, Wang P, Shen C, Dick A, van den Hengel A (2016). Ask me any-thing: Free-form visual question
answering based on knowledge from exter-nal sources. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition pp 4622-4630)
Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering
based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381
Xu W, Rudnicky A (2000) Can artificial neural networks learn language models?. In: sixth international
conference on spoken language processing
Xu H, Saenko K (2016) Ask, attend and answer: Exploring question-guided spatial attention for visual ques-
tion answering. In: European conference on computer vision. Springer, Cham, pp 451–466
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 21–29
Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learn-ing based natural language
processing. IEEE Comput Intell Mag 13(3):55–75
Yu L, Park E, Berg AC, Berg TL (2015) Visual madlibs: fill in the blank description generation and question
answering. In: Proceedings of the ieee international conference on computer vision. pp 2461–2469
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: 2017
IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4187–4195
Yu D, Gao X, Xiong H (2018a) Structured semantic representation for visual question answering. In: 2018
25th IEEE international conference on image processing (ICIP). IEEE, pp 2286–2290
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018b) Beyond bilinear: generalized multimodal factorized high-order
pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6281–6290
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European confer-
ence on computer vision. Springer, Cham, pp 818–833
Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D (2016) Yin and yang: balancing and answering
binary visual questions. In: Proceedings of the IEEE conference on computer vision and pattern rec-
ognition. pp 5014–5022
Zhao W, Peng H, Eger S, Cambria E, Yang M (2019) Towards scalable and reliable capsule networks for
challenging NLP applications. arXiv preprint arXiv​:1906.02829​
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering.
arXiv preprint arXiv​:1512.02167​
Zhu Y, Zhang C, Ré C, Fei-Fei L (2015) Building a large-scale multimodal knowledge base system for
answering visual queries. arXiv preprint arXiv​:1507.05670​
Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: Grounded question answering in images. In: Pro-
ceedings of the IEEE conference on computer vision and pattern recognition. pp 4995–5004

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

13

You might also like