You are on page 1of 7

Pattern Recognition Letters 151 (2021) 325–331

Contents lists available at ScienceDirect

Pattern Recognition Letters


journal homepage: www.elsevier.com/locate/patrec

Visual question answering: Which investigated applications?


Silvio Barra a, Carmen Bisogni b, Maria De Marsico c, Stefano Ricciardi d,∗
a
University of Naples Federico II, Naples, Italy
b
University of Salerno, Salerno, Italy
c
Sapienza University of Rome, Rome, Italy
d
University of Molise, Campobasso, Italy

a r t i c l e i n f o a b s t r a c t

Article history: Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Com-
Received 16 June 2020 puter Vision (CV) and Natural Language Processig (NLP) have recently met. In image captioning and video
Revised 16 July 2021
summarization, the semantic information is completely contained in still images or video dynamics, and
Accepted 7 September 2021
it has only to be mined and expressed in a human-consistent way. Differently from this, in VQA se-
Available online 16 September 2021
mantic information in the same media must be compared with the semantics implied by a question ex-
Edited by Dr. S. Wang pressed in natural language, doubling the artificial intelligence-related effort. Some recent surveys about
VQA approaches have focused on methods underlying either the image-related processing or the verbal-
MSC:
41A05 related one, or on the way to consistently fuse the conveyed information. Possible applications are only
41A10 suggested, and, in fact, most cited works rely on general-purpose datasets that are used to assess the
65D05 building blocks of a VQA system. This paper rather considers the proposals that focus on real-world ap-
65D17 plications, possibly using as benchmarks suitable data bound to the application domain. The paper also
reports about some recent challenges in VQA research.
Keywords:
Visual question answering © 2021 Elsevier B.V. All rights reserved.
Real-world VQA
VQA for medical applicatons
VQA for assistive applications
VQA for context awareness
VQA in cultural heritage and education

1. Introduction can be found in the recent survey in [37] and summarized in the
right part of Fig. 1, where we also shown the NLP tasks involved
Visual Question Answering (VQA) is at present one of the most in this case. On the left part of the figure, the overall workflow of
interesting joint applications of Artificial Intelligence (AI) to Com- a classical VQA framework can be summarized with the proposed
puter Vision (CV) and Natural Language Processing (NLP). Its pur- example.
pose is to achieve systems capable of answering different types of VQA research crosses several AI areas, including CV, NLP and
questions expressed in natural language and regarding any image. also Knowledge Representation & Reasoning (KR), the latter being
To this aim, a VQA system relies on algorithms of different nature able to reason and extract semantics from processed media. Sev-
that jointly take as input an image and a natural language question eral surveys discuss VQA approaches from different points of view.
about it and generate a natural language answer as output. Hu- Among the most recent ones, [37] propose an extensive analysis
mans naturally succeed in this, except for special conditions, and and comparison, among other aspects, of different methodologies
AI aims at reproducing this ability. The role of NLP in solving this underlying the different steps of VQA including featurization, for
multi-disciplinary problem is to understand the question, and of both image and question (Phase I), and joint comprehension of
course to generate the answer according to the results obtained image and question features to generate a correct answer (Phase
by CV. Text-based Q&A is a longer studied problem in NLP. The II)[58]. devote special attention to fusion techniques adopted in
difference with VQA is that both search and reasoning regard the Phase II, distinguishing between fusion techniques for image QA
content of an image. A classification of CV tasks entailed by VQA and for video QA. Similarly, [56] classify methods by their strat-
egy to connect the visual and textual modalities and, in particular,
it examines the approach of combining convolutional and recur-

Corresponding author. rent neural networks to map images and questions onto a com-
E-mail addresses: silvio.barra@unina.it (S. Barra), stefano.ricciardi@unimol.it (S. mon feature space. Looking at these surveys, it is possible to ob-
Ricciardi).

https://doi.org/10.1016/j.patrec.2021.09.008
0167-8655/© 2021 Elsevier B.V. All rights reserved.
S. Barra, C. Bisogni, M. De Marsico et al. Pattern Recognition Letters 151 (2021) 325–331

Fig. 1. An example of a workflow of VQA method on the left, some of the CV tasks and NLP tasks involved in VQA are on the right.

Table 1
A summary of application domains tackled by VQA literature with a summary of the features of the datasets used as benchmarks. Also, the approaches
are reported, with the best results for each dataset. Following a legend for interpret the table: Dataset: C:Classes, L:Labels Dataset Size: AS: Audio Scenes,
C:Classes, T:Text, HoV:Hours of Video, I:images, Ic:Icons, V:Videos Notes: A:Answers, At:Attributes, C:Classes, Cap:Caption, I:Images, Imp:Impression, Q:Questions
Approaches: AM:Attention Mechanism, BM:Bert Model, CM:Classifier Model, DN:DenseNet, ED:Encoder-Decoder, F:Fusion, FRCNN: Faster-RCNN, GM:Generative
Model, IA: Image Attention, IN:ImageNet, IV2: InceptionV2, MLA:MultiLevel Attention, MMFN:Multi-Step Modality Fusion Network, QCM:Question Classifier
Model, RM:Reasoning Module, RN:ResNet, TE:Text Embedding, WE:Word Embedding,W2V:Word2Vec Best Results: ACC:Accuracy, BLEU:Bilingual Evaluation Un-
derstudy Score: TDA:Task Dendendant Accuracy, WBSS:Word-Based Semantic Similarity.

Domain Reference(s) Dataset Dataset Size Notes Approach Best Result

[53] ImageCLEF 2019 3200 I 12,792 Q&A RN152+BM (AM+F) [42]


[42] RN152+TE (GM) ACC: 64%
[7] DN121+LSTM (ED) BLEU: 65,9%
[11] VQACM (Seq-to-Seq)
[5] QCM+VGG
[12] RN50+W2V (ED)
[36] IN+NLP
[52] RN152+BM (AM)
Medical [57] VGG16+Enc (Co-AM)
[46] RN152+LSTM (Co-AM+MFH)
[28] ED+RM
[59], ImageCLEF 2018 2866 I 6413 Q&A IV2+BiLSTM (AM) [8]
[6] VGG16+BDLSTM BLEU: 0,188
[8] VGG16+GRU (ED) WBSS: 0,209
[1] VGG16+LSTM (SANet
[41] RN+LSTM (Co-AM+MFP F)
[34] IN+LSTM (Co-AM)
[39] VQA-RAD 315 I 3515 Q&A CNNEnc.+WE (Co-AM) ACC: 74,1%
[19] VizWiz-Priv 5537 I 1403 Q&A Several Approaches -
Visually [20] VizWiz 31K I 31,000 Q 10 A each Several Approaches -
impaired [9] VQAv2.0 1.1M I 1.1M Q&A FRCNN+RN101+LSTM ACC: 70,34%
[55] SEVN Simulator - - RL-based Approach ACC: 74,8%
Video [33] RAP 587 HoV 84,928 I 72 At - -
Surveillance [49] BOAR + BTV ∼45K I 5 V ∼23K Q 101 C RN50+BiLSTM ACC: 61,36%
[51] Homemade 2V 5 Cap each - -
Cultural [16] Annotated Artpedia 30 I ∼120 Q&A FRCNN+BM ACC:25,1%
Heritage [45] Homemade 16 I, 444 T 805 Q&A - -
Advertising [25] Homemade 64,832 I, 3477 V ∼273k Q&A RN+LSTM TDA
[40] Homemade 3747 I 500k Imp MMFN -
[17] Homemade 1490 I in 360◦ 16,945 Q&A Tucker&Diffusion (MLA) ACC: 58,66%
[44] Homemade 10,209 I 9156 T 9267 Q&A CNN enc.(I&Q) ACC: 39,63%
[54] TACoS-QA 185 V 21,310 Q&A 3DCNN+LSTM ACC: 24,82%
[54] MSR-VTT-QA 3852 V 19,748 Q&A
Misc [3] Homemade 7500 AS 300,000 Q&A FiLM Network MAC Network ACC: 90,3% ACC: 44,8%
[24] Homemade 85,321 Ic 429,654 Q&A CNN+LSTM ACC: 25,78%
[13] VQA 2.0+VizWIz+L 44,955 I 224,775 Q&A CNN+GRU ACC: 44,55%
[50] COCO-A+VG+C 19,431 I, 342 C ∼80K Q&A GloVe+BiLSTM ACC: 79,32%
[48] VQA-CPv2+COCO 325,721 I ∼1.5M Q ∼15M A IA+WE ACC: 34,25%

326
S. Barra, C. Bisogni, M. De Marsico et al. Pattern Recognition Letters 151 (2021) 325–331

The aim of the present paper is to survey VQA proposals from


a novel point of view, and to investigate at which extent differ-
ent application domains inspire different kinds of questions and
call for different benchmarks and/or approaches. An extensive lit-
erature search reveals that relatively few papers have tackled spe-
cific domains, indeed. The following sections will focus on this
aspect. According to literature, the most diffused specific applica-
tion of VQA is supporting automatic intelligent medical diagnosis,
which, consequently, deserves a large section. It encompasses dif-
ferent kinds of problems, characterized by different kinds of image
capture technologies and image content types. The aid to blind and
visually impaired individuals follows, enabling them to get infor-
mation about images both on the web and in the real world, e.g.,
in advanced domotics. A kind of anticipation though without im-
plementation is already envisaged in [38]. It is worth noticing that
these two domains have inspired non only the collection of spe-
cific ad-hoc datasets, but also their use as benchmarks in domain-
related international challenges. A much lower number of devoted
works deal with unattended surveillance, with proposals for sys-
tems able to relief a human operator from the burden of continu-
ous attention and to raise an alarm in anomalous situations. Social
and cultural purposes inspire systems addressing advanced educa-
Fig. 2. Examples of images from datasets collected to evaluate VQA applications: tion and personalized fruition of cultural heritage, and smart and
(1) Medical image from ImageCLEF 2018 VQA-Med ([2]); (2) Image processed for customer-tailored advertisement.
visually impaired people, from VizWiz Dataset ([20]); (3) Image from BOAR Dataset
The paper will finally report about some very recent works fo-
([50]); (4) Image of a Painting from Artpedia ([47]); (5) Advertising image from [25].
cusing on the novelty of either the kind of data taken into account
or of the new approaches to questioning/answering.
The remainder of this paper is organized as follows. Sections
serve that their attention is focused on methodological proposals, from 2 to 6 present recent works in the domains of medical VQA,
generally neglecting the possible application domains. The latter support for blind people, video surveillance, education and cultural
are only shortly listed as those where the VQA can be useful. The heritage, and advertising; Section 7 presents emerging approaches
set mentioned by [37] includes: to help blind users to communi- for new kinds of data and new questioning/answering strategies.
cate through pictures, to attract customers of online shopping sites Section 8 briefly points out some concluding remarks.
by giving ”semantically” satisfying results for their search queries,
to allow learners engaged in educational services to interact with 2. Medical VQA
images, to help the analysts in surveillance data analysis to sum-
marize the available visual data. The authors also hypothesize as AI-based medical image understanding and related medical
Visual Dialogue, envisaged as a successor of VQA, can even be used questions-answering (from here on, med-VQA) is recently attract-
to give natural language instructions to robots. A similar list is pre- ing increasing interest by researchers. In fact, this topic is opening
sented in [58]: blind person assistance (the most popular accord- new scenarios for supporting medical staff in taking clinical de-
ing to the citations achieved), autonomous driving, smart camera cision, as well as for enhanced diagnosis through computer-based
processing on food images, implementation of robot tutors with ”second opinion”. However, the experimentation of any approach
the function of automatic math problem solvers, execution of triv- is conditioned by the availability of a dedicated database includ-
ial tasks such as ”spotting an empty picnic table at the park, or lo- ing medical images, of possibly specific type, and related QA pairs.
cating the restroom at the other end of the room without having to These requirements have been first addressed by the ImageCLEF
ask.” A more general use is mentioned by [26] for advanced im- 2018 evaluation campaign for the Medical Domain Visual Question
age retrieval. Without using image meta-data or tags, it could be Answering pilot task, as described in [21]. The related first med-
possible, e.g., to find all images taken in a certain setting: one VQA public dataset included a total of 2866 medical images, 2278
might simply ask ‘Is it raining?’ for all images in the dataset with- of which used for training, 264 for testing and 324 for validation,
out using image annotations. The general-purpose nature of most along with 6413 QA pairs. In The following ImageCLEF 2019 edition
VQA-related literature is also reflected by the datasets exploited as ([2]), a larger dataset containing 4200 radiology images as well
benchmarks: even in the review papers, the surveyed datasets are as 15,992 QA pairs was released, with a wide variety of imaging
mostly general-purpose ones, where images are possibly classified modalities, type of organs and pathologies. All the works resumed
in natural, clip art or synthetic as in [56], with no reference to a in the following have based the reported experiments on one of
specific source domain. Only the work in [26] focuses, among the the aforementioned datasets.
other topics, on criticizing some current popular datasets with re- More recently, the introduction of two new datasets, namely
gard to their ability to properly train and assess VQA algorithms, VQA-RAD presented in [31], and PathVQA described in [23],
and on proposing new features for future VQA benchmarks. Among promises to further improve the variety and specificity of training
the others, the authors mention larger size and also lower amount and test samples for this challenging declination of VQA. A broad
of bias, since current VQA systems are considered more dependent range of deep frameworks has been proposed to address the re-
on the questions and how they are phrased than the image con- quirements of med-VQA. The authors of [53] propose a med-VQA
tent. The additional requirement explored here is that the dataset deep learning approach exploiting a multimodal question-centric
used for performance evaluation should also be related to the VQA strategy fusing together the image and the written question in the
domain, if this presents specific conditions. As a matter of fact, query, assigning a greater fusion weight to the latter. The fusion
most works (relatively fewer than general-purpose ones) tackling a mechanism combines question and image features to achieve max-
specific application domain also propose suitable related datasets. imum adherence to the query sentence. The answer to the query

327
S. Barra, C. Bisogni, M. De Marsico et al. Pattern Recognition Letters 151 (2021) 325–331

can be of different types, ranging from binary and numbers to mechanism. An SVM based model is trained to predict what cat-
short sentences. The achieved accuracy exceed state-of-the-art. In egory a question belongs to, providing an additional feature. All
[42], a novel method is presented to break the complex med-VQA the resulting features are then efficiently integrated by means of a
problem down into multiple simpler problems through a classifi- multi-modal factorized high-order pooling technique. [5] propose a
cation and generative model. To this aim, the proposed model uses med-VQA model based on differently specialized sub-models, each
data augmentation as well as text tokenization, switching between optimized to answer to a specific class of questions. Considered
classification and generative models by changing both output layer image classification sub-models include ”modality”, ”abnormality”,
and loss function while retaining the core component. The gener- ”organ systems” and ”plane” that are defined through pre-trained
ative model is built masking position by position instead of using VGG16 network. Since questions related to each type are repetitive,
an encoder-decoder framework. the approach is not based on them to predict the answers, yet they
Transfer learning and multi-task learning within a modular are used to choose the best suited model to produce the answers
pipeline architecture are used in [28] to cope with the wide variety and their format. A CNN based on VGG16 network is also exploited
of images in the ImageCLEF-2019 dataset by extracting its inherent in [57] along with global average pooling strategy to extract med-
domain knowledge. The proposed Cross Facts Network basically ex- ical image features by means of a small-size training set. A BERT
ploits upstream tasks to cross-utilize information useful to increase model is used to encode the semantics behind the input question
the precision on more complex downstream tasks. This results in and then a co-attention mechanism enhanced by jointly learned
a clear score improvement on the validation set. On a similar line attention is used for feature fusion. A bilinear model aimed at
of research the authors of [34] propose ETM-Trans, a deep trans- grouping and synthesizing extracted image and question features
fer learning approach based on embedded topic modelling applied for med-VQA is proposed by [52]. This model exploits an attention-
to textual questions, through which topic labels are associated to based scheme to restrict on relevant input context, instead of re-
medical images for fine tuning the pre-trained ImageNet model. lying on additional training data. Additionally, the method is also
A co-attention mechanism is also exploited, where residual net- boosted by an ensemble of trained models.
works is used to provide fine-grained contextual information for On a different line of research, image captioning and machine
answer derivation. In [59] a CNN based Inception-Resnet-v2 model translation are explored by the authors of [8], aiming at generat-
is used to extract image features along with a RNN based Bi-LSTM ing an answer to the image-question pair in terms of a sequence
model to encode questions. The concatenation of image features of words. Image captioning requires an accurate image understand-
and coded questions is therefore used to generate the answers. A ing, and similarly machine translation requires an accurate com-
normalization step, including both image enhancement techniques prehension of the input sequence to effectively translate it. This
and questions lemmatization is performed beforehand. approach provided the highest accuracy scores for the ImageCLEF
The shortage of large labeled datasets to effectively train deep 2018 challenge. Stacked attention Network (SAN) along with Multi-
learning models for med-VQA is the focus of [39]. The authors ex- modal Compact Bilinear Pooling (MCB) VQA models are used in [1].
plore the use of an unsupervised denoising auto-encoder to lever- In this approach, both models rely on CNNs for image processing,
age the availability of large quantities of unlabelled medical im- respectively VGG-16 for SAN and ResNet-152 for MCB, while LSTMs
ages to achieve trained weights that can be more easily adapted are exploited for question processing. Their final hidden layer pro-
to the med-VQA domain. Moreover, they also exploit supervised vides question vector extraction. In [11], the proposed med-VQA
meta-learning to learn meta-weights which can adapt to the do- pipeline partitions questions into two classes. The first one re-
main of interest requiring only a small labeled training set. The quires answers to come from a fixed pool of categories, while the
authors of [36] present a convolutional neural-network-based med- second one requires to generate answers based on abnormal vi-
VQA system, aimed at providing answers according to input im- sual features in the input image. The first class is defined by using
age modalities such as X-ray, computer-tomography, magnetic res- Universal Sentence Encoder question embeddings and ResNet im-
onance, ultrasound, positron emission tomography etc. where the age embeddings, feeding an attention-based mechanism to gener-
image modality can also be identified by the system. On a similar ate answers. The second class uses the same ResNet image embed-
line of research, in [12] a CNN is used to process medical image ding along with word embeddings from a Word2Vec model. This
queries along with a RNN encoder-decoder model to encode image is pre-trained on PubMed data which is used as an input to a se-
and question input vectors and to decode the states required to quence to sequence model which generates descriptions of abnor-
predict target answers. These are generated in natural language as malities.
output by means of greedy search algorithm. An encoder-decoder
model is also at the core of [7]. Here a pre-trained CNN model is 3. VQA for visually impaired people
used in the encoding step along with LSTM model to embed tex-
tual data. Another deep learning inspired approach is the one pro- Assistance to blind people is among the objectives of several
posed in [6] where a combination of CNN and bi-directional Long VQA applications proposed in the recent years. This is mainly due
Short Term Memory (LSTM) coupled with a decision tree classifier to the ability of automatic VQA to answer daily questions which
is used to address the med-VQA problem in terms of a multi-label may help visually impaired people to live without visual barri-
classification task. According to this method, each label is associ- ers. During the last ten years there has been a quite fast evolu-
ated to a unique word among those included in the answer dic- tion which led from needing the aid from volunteers or workers
tionary previously built upon the training set. Similarly, the au- paid for answering blind’s questions [14,30], to the automatic anal-
thors of [41] exploit residual networks of a deep learning frame- ysis of images and related questions, to extract and generate the
work to extract visual features from the input image as a result proper answers. To this aim, the dataset proposed and described
of its interaction with LSTM representation of the question. The in [20] contains 31,0 0 0 visual questions originated by blind people
goal is to achieve small granularity context data useful to derive who took a photo with their mobile phone and recorded a spo-
the answer. Efficient visual-textual feature integration is achieved ken question about it. Each image is labelled with 10 crowdsourced
through Multi-modal Factorized High-order as well as Multi-modal answers. A ”privacy preserving” version of the dataset is released
Factorized Bilinear pooling. [19], where image private regions are removed (credit card num-
LSTM and Support Vector Machine (SVM) are explored in [46]. bers, subject information on medical prescription, etc.). An iPhone
LSTM is used for extracting textual features from questions along application, named VizWiz [15] allows asking a visual question and
with image features thanks to transfer learning and co-attention obtaining an answer in nearly real time. The authors of [9] propose

328
S. Barra, C. Bisogni, M. De Marsico et al. Pattern Recognition Letters 151 (2021) 325–331

a combined bottom-up and top-down attention mechanisms in from the surrounding environment, using a faster R-CNN. This sys-
which the question is analyzed by means of a GRU and the image tem shows a great ability to improve the children’s desire to ex-
is processed by a CNN. The vectors are then combined to produce plore. VQA can improve such desire for adults too, in particular
a set of scores over the candidate answers. The bottom-up mech- for cultural heritage[16]. propose to explore museums and art gal-
anism is based on a Faster-RCNN, which submits to the model the leries using VQA to interact with an audio-guide. The authors use
image regions together with the labels, while the top-down mech- many classical VQA datasets and a cultural dataset named Artpe-
anism weighs the image features by applying a weighted sum with dia [47] to feed two BERT modules, for question classification and
the GRU output. Another interesting application aimed at helping answering, and a faster R-CNN for the VQA module. They anno-
blind people is described in [55], in which the authors exploit a tated 30 images of Artpedia with 3 or 5 Q&A to perform tests. As
reinforcement learning model in order to help a blind person to a result, the user can directly ask questions he/she is interested in,
navigate the street. avoiding long descriptions and freely navigating through the ele-
ments of the painting or the sculpture. This way to explore art can
4. VQA in video surveillance scenarios replace static audio-guides and the growing interest in this goal
has inspired to the construction of a dedicated dataset [45]. This
The adoption of a VQA approach in video surveillance scenar- dataset is focused on the old-Egiptyan Amarna period and contains
ios may help operators to enhance the understanding of a scene, 16 artworks, 805 questions, 204 documents, 101 related documents
thus helping them to take fair and faster decisions. The authors and 139 related paragraphs in English. This dataset is very specific
of [33] propose a complete platform called ISEE for parsing large and quite limited, but it is an interesting starting point to apply
video surveillance data. The platform is organized in three mod- VQA in the service of art.
ules, which are distributed on both CPU and GPU cluster: (i) de-
tection and tracking module, (ii) attribute recognition module, and
6. VQA and advertising
(iii) re-identification module. The first module exploits a Gaussian
Mixture Model for the analysis on the CPU and a Single Shot multi-
Advertising is something strongly related to image understand-
box detector with a Faster R-CNN for the detection on the GPU.
ing. A user looking at an advertise not only sees the objects inside
Both use the Nearest Neighbor-based tracker. The second module
the scene, but also the related text and the relations among the ob-
exploits DeepMAR and LSPR_attr for the attribute recognition; the
jects, and interprets all such information within a precise cultural
third exploits LSPR_ReId and MSCAN for re-identification. The sys-
context. An advertise must be quite simple to be understandable
tem has been tested over the RAP dataset [32]. The authors of
for the greatest number of people and at the same time interest-
[49] mostly focus on the soft biometric aspects of the Q&A in video
ing and eye-catching. No surprise than that VQA can find a chal-
surveillance, thus proposing C2VQA-BOARS (Context and Collabora-
lenging field of application in advertising. The first task to com-
tive Visual Question Ansqering for Biometric Object-Attribute Rel-
plete is using VQA to understand the advertise and, in particular,
evance and Surveillance). The system answers a question by fusing
the underlying communicative strategy[25]. present two datasets
information from the question itself with the caption obtained by
for this purpose, one with images and one with videos. The image
an analysis of the image. Three models are proposed: (i) C2VQA-All
dataset contain 64,832 ads. For each advertise there is a set of Q&A
uses a set of BiLSTMS to encode question and caption; four equally
about what the client is led to do by it, for a total of 202,090 el-
weighted training objectives are used to train the model: ques-
ements with 38 topics, 30 sentiments and 221 symbols. The video
tion relevance, type of object in the question, the attributes of the
dataset has 3477 elements, with 3 or 5 questions per video and
object and the final classification for the relevance of the object;
the same symbols, sentiments and topics of images. The authors
(ii) C2VQA-Image takes a set of GloVe word-embeddings and uses
use a two-layer LSTM and VGGNet to decode images ads and 152-
a 2-layer LSTM for question encoding; the question is combined
layer ResNets to decode video ads. Once an automatic system is
with the dense vector obtained by feeding a pre-trained ResNet-
able to understand the meaning of an advertise, it is natural to ask
50 model with the considered image; (iii) C2VQA-Rel is similar to
whether it is possible to automatically choose which ads to show.
C2VQA-All, but only takes the binary relevance of the question and
The authors of [40] focus their research on predicting the users’
the final classification of the object attribute relevance. More con-
preferences by understanding what impress them most. They built
ceptual approaches can be found in [27] and in [51]. In the for-
Real-ad, a dataset of 3747 images with 40 attributes and collected
mer, the authors propose a system for supporting the question
about 500 millions of impressions from the users. They include
answering operation about moving objects in videos, by filtering
VQA in a low level fusion using LSTM, followed by an attention
the trajectory information of the objects (people and vehicles), and
mechanism and a high level fusion to emphasize the relations be-
representing the movements by means of a structured annotation.
tween visual and auxiliary information. The result is an attention
These annotations can be easily navigated for obtaining answers
heatmap that shows the parts of the images from which the po-
over movement-related questions (”Which direction is the red car
tential client is attracted. Finally, the most challenging step will be
going?”, ”Did any cars leave the garage?”...). The latter work, instead,
to automatically find the most successful ads to help advertising
proposes an ontology/graph based taxonomy schema for describing
designers.
events in the video and associated captions. A probabilistic gener-
[60] propose a way to use VQA to extract relevant informa-
ative model is then used for capturing the relations between input
tion about past campaigns in multi-source data as texts and im-
video and input question.
ages. The authors use 64,0 0 0 images from the dataset in [25] and
extract information, also searching online, analyse keyphrases and
5. VQA education and cultural heritage
generate new possible advertises. To this aim, they use a cross-
modality encoder architecture followed by a feed forward network.
One of the main aspects of VQA is its high correlation to hu-
Even if not exploded yet, this task shows an interesting research
man perception. Even if a VQA system can focus the attention on
directions.
different parts of an image compared to humans, it is proven that
devising a VQA architecture ”interested” in the same image parts
is possible [18]. The inverse process can be also be carried out. 7. Emerging challenges/Misc
[22] developed and tested an educational robot using VQA to for-
mulate questions and start an educational dialog taking inspiration Emerging lines in VQA mainly focus on new input or Q&A.

329
S. Barra, C. Bisogni, M. De Marsico et al. Pattern Recognition Letters 151 (2021) 325–331

Concerning the data, [17] explore the use of 360◦ images on more effective design of the entailed processing. A second con-
which the information can be located in all the field of view. sideration concerns the size of the datasets. While this problems
They used 1490 indoor images from two 3D datasets and gen- is underlined in the general case in [26], ad-hoc datasets seldom
erated 16,945 Q&A. A cubemap-based architecture extracts visual reach the huge amount of information requested for a robust gen-
features and, then, a multi-level attention network aggregates fea- eralizability of the obtained performance. These two aspects defi-
tures. Other than image, a VQA system may be required to extract nitely deserve attention from the VQA community in order to fi-
information also from text[44]. built a dataset to test modalities nally reach performance able to boost real world domain-specific
in which an image-text joint inference is required. The dataset is applications.
composed by educational, web and other VQA datasets resources,
for a total of 10,209 images, 9156 different texts and 9267 ques-
tions. They also tested this new dataset using existing VQA mod- Declaration of Competing Interest
els discovering that the latter do not fit well those new data. The
video-question-answering is also a well-known field, however the The authors declare that they have no known competing finan-
long-video QA is unexplored[54]. manage this kind of data in their cial interests or personal relationships that could have appeared to
work building two datasets of long video, TACoS-QA and MSR-VTT- influence the work reported in this paper.
QA, containing 187 and 3852 videos, respectively. The number of
Q&A is about 20,0 0 0 per dataset. Based on that, they develop a References
matching-guided attention model for video and question embed-
ding, question-related video content localization and answer pre- [1] A.B. Abacha, S. Gayen, J.J. Lau, S. Rajaraman, D. Demner-Fushman, Nlm at im-
ageclef 2018 visual question answering in the medical domain, CLEF (Working
diction. The VQA know-how is extending to Acustic Question An- Notes), 2018.
swering (AQA)[3]. show how to create auditory scenes and related [2] A.B. Abacha, S.A. Hasan, V.V. Datla, J. Liu, D. Demner-Fushman, H. Müller,
Q&A. Their 7500 scenes and 300,000 Q&A were tested using two Vqa-med: Overview of the medical visual question answering task at image-
clef 2019, in: CLEF2019 Working Notes. CEUR Workshop Proceedings, 2019,
neural networks: FiLM, based on Conditional Batch Normalization,
pp. 09–12.
and MAC, based on LSTM models. [3] J. Abdelnour, G. Salvi, J. Rouat, From visual to acoustic question answering,
Concerning the new challenges that can emerge for Q&As, the arXiv preprint arXiv:1902.11280 (2019).
[4] A. Agrawal, D. Batra, D. Parikh, A. Kembhavi, Don’T just assume; look and an-
more intuitive is the possibility of multiple answers. Both [24] and
swer: overcoming priors for visual question answering, 2018 IEEE/CVF Confer-
[13] explore this possibility. The former start this research building ence on Computer Vision and Pattern Recognition (2018) 4971–4980.
a dataset of 100 simple icons randomly located in 85,321 images [5] A. Al-Sadi, B. Talafha, M. Al-Ayyoub, Y. Jararweh, F. Costen, Just at imageclef
for a total of 429,654 Q&A. They tested this new dataset building 2019 visual question answering in the medical domain, Working Notes of CLEF
(2019).
a LSTM-based neural network. The latter focus on the reason why [6] I. Allaouzi, M.B. Ahmed, Deep neural networks and decision tree classifier for
a question can have more than one answer. They built a dataset, visual question answering in the medical domain, CLEF (Working Notes), 2018.
starting from two popular datasets, VQA 2.0 [10] and VizWiz [20], [7] I. Allaouzi, B. Benamrou, M. Ahmed, An encoder-decoder model for visual
question answering in the medical domain, Working Notes of CLEF (2019).
for a total of 44,955 images and 224,775 annotations, labelling 9 [8] R. Ambati, C.R. Dudyala, A sequence-to-sequence model approach for image-
reason for different answers. They built a prediction answer net- clef 2018 medical domain visual question answering, in: 2018 15th IEEE India
work based on CNN and attention model in which not only the an- Council International Conference (INDICON), IEEE, 2018, pp. 1–6.
[9] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bot-
swer is predicted but also a probable reason for difference. Study- tom-up and top-down attention for image captioning and visual question an-
ing the properties of the questions is gaining popularity[50]. not swering, in: The IEEE Conference on Computer Vision and Pattern Recognition
only built a system to detect when a question is not related to the (CVPR), 2018.
[10] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, Vqa:
image but also a method to edit this question. Starting from the
Visual question answering, in: Proceedings of the 2015 IEEE International Con-
COCO-A [43] and the VG [29] datasets, they obtained 19,431 im- ference on Computer Vision (ICCV), in: ICCV ’15, IEEE Computer Society, USA,
ages and 55,738 Q&A for the question relevance and 22,172 ques- 2015, pp. 2425–2433, doi:10.1109/ICCV.2015.279.
[11] M. Bansal, T. Gadgil, R. Shah, P. Verma, Medical visual question answering at
tions about 342 classes for the question editing. Their method is
image clef 2019-vqa med, CLEF (Working Notes), 2019.
then based on BiLSTM embedded in a neural network architecture. [12] A. Bghiel, Y. Dahdouh, I. Allaouzi, M.B. Ahmed, A.A. Boudhir, Visual question
From all those works it is clear that a huge amount of Q&A is re- answering system for identifying medical images attributes, in: The Proceed-
quired to built a well generalising system, however collecting them ings of the Third International Conference on Smart City Applications, Springer,
2019, pp. 483–492.
is time-consuming. For this reason, [48] propose a new VQA sys- [13] N. Bhattacharya, Q. Li, D. Gurari, Why does a visual question have different
tem in which the ability to answer an unknown question is ob- answers? in: Proceedings of the IEEE International Conference on Computer
tained fusing known questions and external data. For this purpose Vision, 2019, pp. 4271–4280.
[14] J.P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R.C. Miller, R. Miller, A. Tatarow-
they focus on weight adaptation on a basic VQA model, using VQA- icz, B. White, S. White, T. Yeh, Vizwiz: Nearly real-time answers to visual ques-
CP v2 dataset [4] and COCO captioning dataset. tions, in: Proceedings of the 23nd Annual ACM Symposium on User Interface
Finally, to evaluate the effectiveness of a VQA system, [35] pro- Software and Technology, in: UIST ’10, Association for Computing Machinery,
New York, NY, USA, 2010, pp. 333–342.
pose an Inverse VQA. They use a question encoder that encodes the [15] J.P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R.C. Miller, R. Miller, A. Tatarow-
image and question, and a decoder that, from the features of the icz, B. White, S. White, et al., Vizwiz: nearly real-time answers to visual ques-
image, generates a visual question. The question encoder is based tions, in: Proceedings of the 23nd annual ACM symposium on User interface
software and technology, 2010, pp. 333–342.
on an LSTM architecture and the authors show how to use this
[16] P. Bongini, F. Becattini, A.D. Bagdanov, A.D. Bimbo, Visual question answering
method to perform VQA diagnosis. for cultural heritage, ArXiv abs/2003.09853 (2020).
[17] S.-H. Chou, W.-L. Chao, W.-S. Lai, M. Sun, M.-H. Yang, Visual question answer-
ing on 360◦ images, ArXiv abs/2001.03339 (2020).
8. Concluding notes
[18] A. Das, H. Agrawal, L. Zitnick, D. Parikh, D. Batra, Human attention in visual
question answering: do humans and deep networks look at the same regions?
Two elements are interesting to underline. First, in most cases, Comput. Vision Image Understanding 163 (2017) 90–100. Language in Vision
[19] D. Gurari, Q. Li, C. Lin, Y. Zhao, A. Guo, A. Stangl, J.P. Bigham, Vizwiz-priv: A
even though specific datasets are collected, methods are mostly
dataset for recognizing the presence and purpose of private visual information
inherited though sometimes adapted. In general, except for med- in images taken by blind people, in: Proceedings of the IEEE Conference on
VQA, there is no attempt for boost optimization following a Computer Vision and Pattern Recognition, 2019, pp. 939–948.
stronger domain characterization, except for new kinds of data. [20] D. Gurari, Q. Li, A. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, J. Bigham, Vizwiz
grand challenge: Answering visual questions from blind people, in: Proceed-
It can be hypothesized that when the image features are very ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018,
different than usual, as in medical imaging, this could take to a pp. 3608–3617.

330
S. Barra, C. Bisogni, M. De Marsico et al. Pattern Recognition Letters 151 (2021) 325–331

[21] S.A. Hasan, Y. Ling, O. Farri, J. Liu, H. Müller, M. Lungren, Overview of imageclef [40] K.-W. Park, J. Lee, S. Kwon, J.-W. Ha, K.-M. Kim, B.-T. Zhang, Which ads to
2018 medical domain visual question answering task, CLEF (Working Notes), show? advertisement image assessment with auxiliary information via mul-
2018. ti-step modality fusion, arXiv preprint arXiv:1910.02358 (2019).
[22] B. He, M. Xia, X. Yu, P. Jian, H. Meng, Z. Chen, An educational robot system of [41] Y. Peng, F. Liu, M.P. Rosen, Umass at imageclef medical visual question answer-
visual question answering for preschoolers, in: 2017 2nd International Confer- ing (med-vqa) 2018 task, CLEF (Working Notes), 2018.
ence on Robotics and Automation Engineering (ICRAE), 2017, pp. 441–445. [42] F. Ren, Y. Zhou, Cgmvqa: a new classification and generative model for medical
[23] X. He, Y. Zhang, L. Mou, E. Xing, P. Xie, Pathvqa: 30 0 0 0+ questions for medical visual question answering, IEEE Access 8 (2020) 50626–50636.
visual question answering, arXiv preprint arXiv:2003.10286 (2020). [43] M.R. Ronchi, P. Perona, Describing common human visual actions in images,
[24] S.H. Hosseinabad, M. Safayani, A. Mirzaei, Multiple answers to a question: a in: M.W.J. Xianghua Xie, G.K.L. Tam (Eds.), Proceedings of the British Machine
new approach for visual question answering, Vis Comput (2020) 1–13. Vision Conference (BMVC), BMVA Press, 2015, pp. 52.1–52.12.
[25] Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas, Z. Agha, N. Ong, A. Ko- [44] S. Sampat, Y. Yang, C. Baral, Diverse visuo-lingustic question answering (dvlqa)
vashka, Automatic understanding of image and video advertisements, in: The challenge, arXiv preprint arXiv:20 05.0 0330 (2020).
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [45] S. Sheng, L.V. Gool, M.-F. Moens, A dataset for multimodal question answering
[26] K. Kafle, C. Kanan, Visual question answering: datasets, algorithms, and future in the cultural heritage domain, LT4DH@COLING, 2016.
challenges, Comput. Vision Image Understanding 163 (2017) 3–20. [46] L. Shi, F. Liu, M.P. Rosen, Deep multimodal learning for medical visual question
[27] B. Katz, J.J. Lin, C. Stauffer, W.E.L. Grimson, Answering questions about moving answering, Working Notes of CLEF (2019).
objects in surveillance videos, New Directions in Question Answering, 2003. [47] M. Stefanini, M. Cornia, L. Baraldi, M. Corsini, R. Cucchiara, Artpedia: A new vi-
[28] T. Kornuta, D. Rajan, C. Shivade, A. Asseman, A.S. Ozcan, Leveraging sual-semantic dataset with visual and contextual sentences in the artistic do-
medical visual question answering with supporting facts, arXiv preprint main, in: E. Ricci, S. Rota Bulò, C. Snoek, O. Lanz, S. Messelodi, N. Sebe (Eds.),
arXiv:1905.12008 (2019). ICIAP 2019, Springer International Publishing, Cham, 2019, pp. 729–740.
[29] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, [48] D. Teney, A.v.d. Hengel, Actively seeking and learning from live data, in: Pro-
L.-J. Li, D. Shamma, M. Bernstein, F.F. Li, Visual genome: connecting language ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
and vision using crowdsourced dense image annotations, Int J Comput Vis 123 2019, pp. 1940–1949.
(2016). [49] A.S. Toor, H. Wechsler, M. Nappi, Biometric surveillance using visual question
[30] W.S. Lasecki, P. Thiha, Y. Zhong, E. Brady, J.P. Bigham, Answering visual ques- answering, Pattern Recognit Lett 126 (2019) 111–118. Robustness, Security and
tions with conversational crowd assistants, in: Proceedings of the 15th Interna- Regulation Aspects in Current Biometric Systems
tional ACM SIGACCESS Conference on Computers and Accessibility, in: ASSETS [50] A.S. Toor, H. Wechsler, M. Nappi, Question action relevance and editing for
’13, Association for Computing Machinery, New York, NY, USA, 2013. visual question answering, Multimed Tools Appl 78 (3) (2019) 2921–2935.
[31] J.J. Lau, S. Gayen, A.B. Abacha, D. Demner-Fushman, A dataset of clinically gen- [51] K. Tu, M. Meng, M.W. Lee, T.E. Choe, S. Zhu, Joint video and text parsing for
erated visual questions and answers about radiology images, Sci Data 5 (1) understanding events and answering queries, IEEE Multimedia 21 (2) (2014)
(2018) 1–10. 42–70.
[32] D. Li, Z. Zhang, X. Chen, K. Huang, A richly annotated pedestrian dataset for [52] M. Vu, R. Sznitman, T. Nyholm, T. Löfstedt, Ensemble of streamlined bilinear vi-
person retrieval in real surveillance scenarios, IEEE Trans. Image Process. 28 sual question answering models for the imageclef 2019 challenge in the med-
(4) (2019) 1575–1590. ical domain, CLEF 2019, volume 2380, 2019.
[33] D. Li, Z. Zhang, K. Yu, K. Huang, T. Tan, Isee: an intelligent scene exploration [53] M.H. Vu, T. Löfstedt, T. Nyholm, R. Sznitman, A question-centric model for vi-
and evaluation platform for large-scale visual surveillance, IEEE Trans. Parallel sual question answering in medical imaging, IEEE Trans Med Imaging (2020).
Distrib. Syst. 30 (12) (2019) 2743–2758. [54] W. Wang, Y. Huang, L. Wang, Long video question answering: a match-
[34] F. Liu, Y. Peng, M.P. Rosen, An effective deep transfer learning and informa- ing-guided attention model, Pattern Recognit 102 (2020) 107248.
tion fusion framework for medical visual question answering, in: International [55] M. Weiss, S. Chamorro, R. Girgis, M. Luck, S.E. Kahou, J.P. Cohen,
Conference of the Cross-Language Evaluation Forum for European Languages, D. Nowrouzezahrai, D. Precup, F. Golemo, C. Pal, Navigation agents for
Springer, 2019, pp. 238–247. the visually impaired: asidewalk simulator and experiments, arXiv preprint
[35] F. Liu, T. Xiang, T.M. Hospedales, W. Yang, C. Sun, Inverse visual question an- arXiv:1910.13249 (2019).
swering: a new benchmark and vqa diagnosis tool, IEEE Trans Pattern Anal [56] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, A. van den Hengel, Visual question
Mach Intell 42 (2) (2020) 460–474. answering: a survey of methods and datasets, Comput. Vision Image Under-
[36] A. Lubna, S. Kalady, A. Lijiya, Mobvqa: A modality based medical image visual standing 163 (2017) 21–40.
question answering system, in: TENCON 2019-2019 IEEE Region 10 Conference [57] X. Yan, L. Li, C. Xie, J. Xiao, L. Gu, Zhejiang university at imageclef 2019 visual
(TENCON), IEEE, 2019, pp. 727–732. question answering in the medical domain, Working Notes of CLEF (2019).
[37] S. Manmadhan, B.C. Kovoor, Visual question answering: a state-of-the-art re- [58] D. Zhang, R. Cao, S. Wu, Information fusion in visual question answering: asur-
view, Artif Intell Rev (2020) 1–41. vey, Information Fusion 52 (2019) 268–280.
[38] C. Muñoz, D. Arellano, F.J. Perales, G. Fontanet, Perceptual and intelligent do- [59] Y. Zhou, X. Kang, F. Ren, Employing inception-resnet-v2 and bi-lstm for medi-
motic system for disabled people, in: Proceedings of the 6th IASTED Inter- cal domain visual question answering, CLEF (Working Notes), 2018.
national Conference on Visualization, Imaging and Image Processing, 2006, [60] Y. Zhou, S. Mishra, M. Verma, N. Bhamidipati, W. Wang, Recommending
pp. 70–75. themes for ad creative design via visual-linguistic representations, in: Proceed-
[39] B.D. Nguyen, T.-T. Do, B.X. Nguyen, T. Do, E. Tjiputra, Q.D. Tran, Overcom- ings of The Web Conference 2020, 2020, pp. 2521–2527.
ing data limitation in medical visual question answering, in: International
Conference on Medical Image Computing and Computer-Assisted Intervention,
Springer, 2019, pp. 522–530.

331

You might also like