Medical VQA Survey

Medical Visual Question Answering: A Survey
Zhihong Lina , Donghao Zhangb , Qingyi Taoc , Danli Shid , Gholamreza Haffarie , Qi Wuf ,
Mingguang Heg and Zongyuan Geb,h,i,∗
a Faculty of Engineering, Monash University, Clayton, VIC, 3800 Australia
b eResearch Center, Monash University, Clayton, VIC, 3800 Australia
c NVIDIA AI Technology Center, 038988, Singapore
d State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangzhou, 510060 China
e Faculty of Information Technology, Monash University, Clayton, 3800, VIC, Australia
f Australian Centre for Robotic Vision, The University of Adelaide, Adelaide, SA 5005, Australia
g Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, VIC, 3002 Australia
h Airdoc Research, Melbourne, VIC, 3000 Australia
i Monash-NVIDIA AI Tech Centre, Melbourne, VIC, 3000 Australia
arXiv:2111.10056v2 [cs.CV] 19 Jan 2022
ARTICLE INFO ABSTRACT

Keywords: Medical Visual Question Answering (VQA) is a combination of medical artificial intelligence and
Visual Question Answering popular VQA challenges. Given a medical image and a clinically relevant question in natural language,
Medical Image Interpretation the medical VQA system is expected to predict a plausible and convincing answer. Although the
Computer Vision general-domain VQA has been extensively studied, the medical VQA still needs specific investigation
Natural Language Processing and exploration due to its task features. In the first part of this survey, we collect and discuss the
publicly available medical VQA datasets up to date about the data source, data quantity, and task
feature. In the second part, we review the approaches used in medical VQA tasks. We summarize
and discuss their techniques, innovation, and potential improvement. In the last part, we analyze some
medical-specific challenges for the field and discuss future research directions. Our goal is to provide
comprehensive information for researchers interested in medical artificial intelligence.
1. Introduction professionals’ efficiency. Another application matching the

advantage of VQA is to act as the pathologists who exam-
Visual Questing Answering (VQA) [10] is a multidisci- ine body tissues and help other healthcare providers make
plinary problem that incorporates computer vision (CV) and diagnoses [33].
natural language processing (NLP). The VQA system is ex- In addition to the health professional role, the medical
pected to answer an image-related question according to the VQA system can also serve as a knowledgeable assistant.
image content. Inspired by the VQA research in the general
For example, the “second opinion” from the VQA system
domain, the recent exploration of medical VQA has attracted
can support the clinicians’ opinion in interpreting medical
great interest. The medical VQA system is expected to as- images and decrease the risk of misdiagnosis at the same
sist in clinical decision making and improve patient engage- time [79].
ment [31, 44]. Unlike other medical AI applications often Ultimately, a mature and complete medical VQA system
restricted to pre-defined diseases or organ types, the medi- can directly review patients’ images and answer any kind of
cal VQA can understand free-form questions in natural lan-
questions. In some situations, such as fully automated health
guage and provide reliable and user-friendly answers.
examination, where medical professionals may not be avail-
In recent research, the medical VQA has been assigned able, a VQA system can provide equivalent consultation.
to several “jobs”. The first one is the diagnostic radiologist, After a hospital visit, patients search for further information
who acts as an expert consultant to the referring physician. A online. The irregular and misleading information from the
workload study [58] shows that the average radiologist has search engine might result in inappropriate answers. Alter-
to interpret one CT or MRI image in 3 to 4 seconds. Be-
natively, a medical VQA can be integrated into an online
sides the long queue of imaging studies, a radiologist must
consultation system to provide reliable answers anytime and
also answer an average of 27 phone calls per day from physi- anywhere.
cians and patients [20], leading to further inefficiencies and Medical VQA is technically more challenging than general-
disruptions in the workflow. A medical VQA system can domain VQA because of the following factors. Firstly, cre-
potentially answer the physician’s questions and help relieve
ating a large-scale medical VQA dataset is challenging be-
the burden of the healthcare system and improve medical
cause expert annotation is expensive for its high requirement
∗ Corresponding author of professional knowledge, and QA pairs can not be synthet-
Email addresses: zhihong.lin@monash.edu (Z. Lin); ically generated directly from images. Secondly, answering
donghao.zhang@monash.edu (D. Zhang); qtao002@e.ntu.edu.sg (Q. Tao);
questions according to a medical image also demands a spe-
shidli@mail2.sysu.edu.cn (D. Shi); gholamreza.haffari@monash.edu (G.
Haffari); qi.wu01@adelaide.edu.au (Q. Wu); mingguang.he@unimelb.edu.au cific design of the VQA model. The task also needs to fo-
(M. He); zongyuan.ge@monash.edu (Z. Ge) cus on a fine-grained scale because a lesion is microscopic.
ORCID (s): 0000-0001-8499-4097 (Z. Lin) Hence, segmentation techniques may be required to locate
Lin et al.: Preprint submitted to Elsevier Page 1 of 18

the region of interest precisely. Finally, a question can be VQA to the clinical scenario.
very professional, which requires the model to be trained
with medical knowledge base rather than a general language
database.
2. Datasets and performance metrics
Since the first medical VQA challenge was being orga- 2.1. Datasets
nized in 2018 [31], an increasing number of organizations To the best of our knowledge, there are 8 public-available
and researchers have joined to expand the tasks and propose medical VQA datasets up to date: VQA-MED-2018 [31],
new datasets and approaches, which have made the medical VQA-RAD [45], VQA-MED-2019 [14], RadVisDial [44],
VQA task an active and inspiring field. To provide a com- PathVQA [33], VQA-MED-2020 [13], SLAKE [53], and
prehensive retrospect of these efforts, we conduct the first VQA-MED-2021 [15] (in chronological order). The datasets
survey (to our best knowledge) for medical VQA. details are summarized in Table 1. In the following para-
In the first part of this survey, we overview the publicly graphs, we provide an overview of the QA pairs collection.
available medical VQA datasets up to date To collect the
most complete information, we search for the medical VQA 2.1.1. VQA-Med-2018
datasets and collect a total of 10 papers. As two datasets have VQA-Med-2018 [31] is a dataset proposed in the Im-
multiple editions, consequently 8 datasets are included in ageCLEF 20183 , and it is the first publicly available dataset
this survey. Among them, three datasets are proposed as Im- in the medical domain. The QA pairs were generated from
ageCLEF1 competitions. The selected datasets are diverse captions by a semi-automatic approach. First, a rule-based
in image modality and question categories. The imaging question generation (QG) system4 automatically generated
modality of those datasets covers chest X-ray, CT, MRI, and possible QA pairs by sentence simplification, answer phrase
pathology. The questions include close-end questions (such identification, question generation, and candidate questions
as Yes/No question) and open-end questions in a variety of ranking. Then, two expert human annotators (including one
topics. We also compare the data sources, the question-answer expert in clinical medicine) manually checked all generated
pairs creation methods, and the metrics for evaluation. These QA pairs in two passes. Respectively, one pass ensures se-
will be inspiring for researchers who are interested in design- mantic correctness, and another ensures the clinical rele-
ing new tasks. vance to associated medical images.
In the second part of this survey, we review the published
approaches to medical VQA. We gather the work notes de- 2.1.2. VQA-RAD
scribing the approaches used in the ImageCLEF VQA-Med VQA-RAD [45] is a radiology-specific dataset proposed
competitions and collect 32 papers in total. However, the in 2018. The image set is a balanced one containing samples
work notes are mainly rapid solutions because of the time- by the head, chest, and abdomen from MedPix5 . To investi-
limited situation. We also search for technical papers from gate the question in a realistic scene, the author presented the
conferences or journals that aim at the current pain point and images to clinicians to collect unguided questions. The clin-
we collect 13 papers. The papers are published in not only icians are required to produce questions in both free-from
the community conferences such as Medical Image Com- and template structures. Afterward, the QA pairs are vali-
puting and Computer Assisted Intervention (MICCAI) and dated and classified manually to analyze the clinical focus.
Association for Computing Machinery Multimedia (ACM- The answer types are either close-ended or open-ended. Al-
MM) but also the influential journal such as the IEEE Trans- though without a large quantity, the VQA-RAD dataset has
actions on Medical Imaging. By reviewing those papers, we acquired essential information about what a medical VQA
find that the current approaches are mostly in a framework system should be able to answer as an AI radiologist.
of four components: image encoder, language encoder, fea-
2.1.3. VQA-Med-2019
ture fusion, and answering. We discuss the research in seven
VQA-Med-2019 [14] is the second edition of the VQA-
parts including the framework, the four components, other
Med and was published during the ImageCLEF 2019 chal-
techniques, and overall discussion. The review of existing
lenge. Inspired by the VQA-RAD [45], VQA-Med-2019
approaches will help researchers to identify the key problem
has addressed the four most frequent question categories:
in previous research and the potential hypothesis in future
modality, plane, organ system, and abnormality. For each
research.
category, the questions follow the patterns from hundreds of
Finally, We discuss four medical-specific challenges for
questions naturally asked and validated in the VQA-RAD [45].
the field. The medical-domain VQA is a more application-
The first three categories (modality, plane, and organ sys-
oriented problem compared with the general-domain VQA.
tem) can be tackled as classification tasks, while the fourth
It has a real application scenario that will produce practical
category (abnormality) presents an answer generation prob-
challenges. In this work, we analyze the need in application
lem.
end and raise four significant challenges: the question di-
3 https://www.imageclef.org/2018
versity, extra medical information, interpretability, and the
4 http://www.cs.cmu.edu/ãrk/mheilman/questions/
integration to medical workflow. The proposed challenges
5 https://medpix.nlm.nih.gov/home
will inspire researchers and eventually promote the medical
1 https://www.imageclef.org/

Table 1
Overview of the medical VQA datasets and their main characteristics. The VQA 2.0 is a general-domain VQA dataset listed here
for comparison. The datasets are presented in chronological order.
Dataset # Images # QA pairs Source of images QA Question Category
and content Creation
VQA 2.0 [29] 204K 614K Microsoft COCO [51] Manually - Object
(For comparison) selected - Color
- Sport
- Count
- etc.
VQA-Med-2018 [31] 2,866 6,413 PubMed Central Articles2 Synthetical - Location
- Finding
- Yes/No questions
- Other questions
VQA-RAD [45] 315 3,515 MedPix database: Natural - Modality
- Head axial single-slice CTs or - Plane
MRIs - Organ System
- Chest X-rays - Abnormality
- Abdominal axial CTs - Object/Condition Presence
- Positional reasoning
- Color
- Size
- Attribute Other
- Counting
- Other
VQA-Med-2019 [14] 4,200 15,292 MedPix database: Synthetical - Modality
- Various in 36 modalities, 16 - Plane
planes, and 10 organ systems - Organ system
- Abnormality
RadVisDial [44] 91,060 455,300 MIMIC-CXR [37]: Synthetical Abnormality
(Silver-standard) - Chest X-ray posterior-
anterior (PA) view
RadVisDial [44] 100 500 MIMIC-CXR [37]: Natural Abnormality
(Gold-standard) - Chest X-ray posterior-
anterior (PA) view
PathVQA [33] 4,998 32,799 Electronic pathology textbooks Synthetical - Color
PEIR Digital Library - Location
- Appearance
- Shape
- etc.
VQA-Med-2020 [13] 5,000 5,000 MedPix database Synthetical - Abnormality
SLAKE [53] 642 14K Medical Segmentation Natural - Organ
Decathlon[75], - Position
NIH Chest X-ray[88], - Knowledge Graph
CHAOS[40]: - Abnormality
- Chest X-rays/CTs - Modality
- Abdomen CTs/MRIs - Plane
- Head CTs/MRIs - Quality
- Neck CTs - Color
- Pelvic cavity CTs - Size
- Shape
VQA-Med-2021 [15] 5,000 5,000 MedPix database Synthetical - Abnormality
2.1.4. RadVisDial tailed annotation guidelines to ensure consistency. Only 100

RadVisDial [44] is the first publicly available dataset for random images are labeled with gold-standard. The Rad-
visual dialog in radiology. The visual dialogue consists of VisDial dataset explored a real-world scene task of AI in the
multiple QA pairs and is considered a more practical and medical domain. Moreover, the team compared the synthet-
complicated task to a radiology AI system than VQA. The ical dialogue to the real-world dialogue and conducted ex-
images are selected from MIMIC-CXR [37]. For each im- periments to reflect the importance of context information.
age, the MIMIC-CXR has provided a well-structured rele- The medical history of the patient was introduced and led to
vant report with annotations for 14 labels (13 abnormalities better accuracy.
and one No Findings label). The RadVisDial consists of
two datasets: a silver-standard dataset and a gold-standard 2.1.5. PathVQA
dataset. In the silver-standard group, the dialogues are syn- PathVQA [33] is a dataset exploring VQA for pathol-
thetically created using the plain text reports associated with ogy. The images with captions are extracted from digital re-
each image. Each dialogue contains five questions randomly sources (electronic textbooks and online libraries). The au-
sampled from 13 possible questions. The corresponding an- thor developed a semi-automated pipeline to transfer the cap-
swer is automatically extracted from the source data and lim- tions into QA pairs, and the generated QA pairs are manually
ited to four choices (yes, no, maybe, not mentioned in the re- checked and revised. The question can be divided into seven
port). In the gold-standard group, the dialogues are collected categories: what, where, when, whose, how, how much/how
from two expert radiologists’ conversations following de- many, and yes/no. The open-ended questions account for

Table 2
Samples of images and question-answer pairs from the mentioned Datasets. Q = Question, A = Answer. The datasets are
presented in chronological order.
Dataset Samples
Q: What does the ct Q: Is the lesion as-

scan of thorax show? sociated with a mass
VQA-Med-2018 [31]
A: bilateral multiple effect?
pulmonary nodules A: no
Object/Condition
Positional
Organ System Presence
VQA-RAD [45] Q: What is the loca-
Q: What is the organ Q: Is there gastric
tion of the mass?
system? fullness?
A: head of the pan-
A: Gastrointestinal A: yes
creas
Modality
Plane
VQA-Med-2019 [14] Q: what imaging
Q: which plane is the
method was used?
image shown in?
A: us-d - doppler ul-
A: axial
trasound
Q: Airspace opacity? Q: Lung lesion?

A: Yes A: No
RadVisDial [44] Q: Fracture? Q: Pneumonia?
A: Not in report A: Yes
Q: What have been

stripped from the
Q: Is remote kidney
bottom half of each
infarct replaced by a
PathVQA [33] specimen to show
large fibrotic scar?
the surface of the
A: yes
brain?
A: meninges
Q: what is abnormal
Q: what abnormality in the ct scan?
VQA-Med-2020 [13] is seen in the image? A: partial anoma-
A: ovarian torsion lous pulmonary ve-
nous return
Q: What is the
Q: Does the image function of the
SLAKE [53] contain left lung? rightmost organ in
A: Yes this picture?
A: Breathe
Q: What abnormality
Q: What is most
is seen in the image?
alarming about this
A: Enhancing lesion
VQA-Med-2020 [15] mri?
right parietal lobe
A: focal nodular hy-
with surrounding
perplasia
edema

50.2% of all questions. For the close-ended “yes/no” ques- is significantly smaller than the general VQA dataset VQA
tions, the answers are balanced with 8,145 “yes” and 8,189 2.0. The QA creation can be natural or synthetical. The
“no”. The questions are designed according to the patholo- question categories can be only from 1 category to 11 cate-
gist certification examination of the American Board of Pathol- gories.
ogy (ABP). Therefore it is an exam to verify the “AI Pathol- By comparing the datasets, we find the nature of the data
ogist” in decision support. The PathVQA dataset demon- source is the key factor of the variation. One type of data
strates that medical VQA can be applied to various scenes. source is the images with a description such as caption and
medical report. The VQA-Med-2018 is the first exploration
2.1.6. VQA-Med-2020 in the medical VQA dataset, it utilizes the images in arti-
VQA-Med-2020 [13] is the third edition of the VQA- cles so that the images will have a corresponding textual
Med and published in ImageCLEF 2020 challenge. The im- description. With the automatic transformation and manual
ages are selected with the limitation that the diagnosis was checking, the textual description is converted from declara-
made according to the image content. The questions are tive sentences to question-answer pairs. The PathVQA uses
specifically addressing on abnormality. A list of 330 abnor- images and text from textbooks and the digital library. To
mality problems is selected, and each problem needs to occur these two datasets, the key problem is how to perform an
at least ten times in the dataset. The QA pairs are generated accurate transformation.
by patterns created. Another type of data source is the image with categor-
In VQA-Med-2020, the visual question generation (VQG) ical attribution. The VQA-RAD, VQA-Med-2019, VQA-
task is first introduced to the medical domain. The VQG Med-2020, VQA-Med-2021 are all sourced from the Med-
task is to generate natural language questions relative to the Pix database which provides images with attributions. There-
image content. The medical VQG dataset includes 1,001 ra- fore, the key problem becomes how to acquire better ques-
diology images and 2,400 associated questions. The ground tions given images and answers. The VQA-RAD starts from
truth questions are generated with a rule-based approach ac- manual creation and collects unguided problems from clini-
cording to the image captions and manually revised. cians. The question patterns collected from VQA-RAD lever-
age the construction of VQA-Med-2019, VQA-Med-2020,
2.1.7. SLAKE and VQA-Med-2021.
SLAKE [53] is a comprehensive dataset with both se- Especially, the RadVisDial uses the MIMIC-CXR dataset
mantic labels and a structural medical knowledge base. The which has both medical reports and abnormal labels (ex-
images are selected from three open source dataset [75, 88, tracted from the reports). However, their category is lim-
40] and annotated by experienced physicians. The semantic ited to the abnormal presence and the dialogs are built with
labels for images provide masks (segmentation) and bound- 5 separate QA pairs about different abnormalities other than
ing boxes (detection) for visual objects. The medical knowl- context-based and iterative QA pairs. The information con-
edge base is provided in the form of a knowledge graph. The tained in the reports can be further digested to create a better
knowledge graph is extracted from OwnThink and manually dialog. Another special dataset is the SLAKE. It provides
reviewed. They are in the form of triplets (e.g., <Heart, more modalities including segmentation, detection, and knowl-
Function, Promote blood flow>). The dataset contains 2,603 edge graph. This feature can improve the complexity of tasks
triplets in English and 2,629 triplets in Chinese. The intro- and allow more question categories. It also raises a new
duction of a knowledge graph allows the external knowledge- problem to the approach researchers as it has more modal
based questions such as organ function and disease preven- to expand the method complexity. Despite the manual an-
tion. The questions are collected from experienced doctors notation limit its quantity, the golden standard annotation is
by selecting from pre-defined questions or rewriting ques- prospective to benefit the community and future research.
tions. Then the questions are categorized by their types and As the medical VQA research is still in an early stage,
balanced to avoid bias. the current datasets are only about radiology and pathology
in data subjects. There is more field to discover such as oph-
2.1.8. VQA-Med-2021
thalmology and dermatology which are also popular in med-
VQA-Med-2021 [13] is published in ImageCLEF 2021
ical AI research and already has existing databases to cre-
challenge. The VQA-Med-2021 is created under the princi-
ate potential VQA task. Besides dataset works, there is also
ples as those in VQA-Med-2020. The training set is the same
exploration addressing data collection. The MVQAS [11]
dataset as used in VQA-Med-2020. The validate set and test
builds an online system providing self-collected and annota-
set are new and manually reviewed by medical doctors.
tion tools to allow users to upload data and semi-automatically
2.1.9. Discussion generate VQA triplets.
In the above sections, we list 8 medical VQA datasets
about their quantity, data source, QA creation, and question
2.2. Performance Metrics
The performance metrics used in the proposed medical
categories. The images quantity ranges from 315 to 91,060,
VQA tasks can be categorized into classification-based met-
while the QA pair quantity ranges from 1 QA pair per image
rics and language-based metrics. The classification-based
to 10 QA pairs per image. The imaging modality includes
metrics are the general metrics in classification tasks such
chest X-ray, CT, MRI, and pathology. However, the quantity

as accuracy and F1 score. They treat the answer as a clas- participate teams and collect their work notes. This strat-
sification result and calculate the exact match accuracy, pre- egy helps us collect a total of 32 papers. Then we search
cision, recall, and e.t.c. All the seven tasks in this paper Google Scholar with the citation search function to find 217
use the classification-based metrics as part of their perfor- papers that cite medical VQA datasets and collect a total
mance metrics. The Language metrics are the general met- of 12 papers. For a detailed count of the papers included
rics for sentence evaluation tasks (e.g., translation, image and excluded at each stage, refer to Fig. 1. Finally, we se-
captioning). The tasks using language-based metrics include lect 45 published papers including 32 work notes and 13
VQA-Med-2018, VQA-Med-2019, PathVQA, VQA-Med- conference/journal papers. The 45 papers describe 46 ap-
2020, and VQA-Med-2021. All of those four tasks use the proaches and the performance and characteristics are shown
BLEU [61] which measures the similarity of the phrases (n- in Table 3. The following sub-sections discuss the medi-
grams) between two sentences. The BLEU was original a cal VQA approaches by the framework, components, other
metrics for machine translation. Besides the general met- techniques, and the overall discussion.
rics, there are also custom metrics designed for the task. For
example, the WBSS (Word-based Semantic Similarity) and 3.1. Framework
CBSS (Concept-based Semantic Similarity) are created in Among the 46 approaches, 39 of them can be attributed
the VQA-Med-2018 [31] as new language metrics. How- to a common framework in the general VQA domain, the
ever, the metric is not a fixed component of the dataset and joint embedding [10]. It is proposed as the baseline method
the approach. Researchers can alternatively use more suit- for the VQA v1 dataset [10] and referred to as the "LSTM
able metrics to evaluate their results. For example, some Q+I". As illustrated in Fig. 2, the framework includes four
researchers [71] introduce the AUC-ROC (Area under the components: an image encoder, a question encoder, a fea-
ROC Curve) as their classification metrics to better evaluate tures fusing algorithm, and an answering component accord-
the measure of separability. ing to the task requirement. Respectively, the image fea-
ture extractor can be the well-developed convolution neu-
ral network (CNN) backbones such as VGG Net [74] and
3. Methods ResNet [32], and the question encoder can be the prevalent
To investigate the feature of approaches used in the med- language encoding models such as LSTM [35] and the Trans-
ical VQA task, we reviewed the published papers evaluated former [82]. The feature encoding models are often initial-
on the datasets. We search for the method papers with two ized with pre-trained weights, and they can be either frozen
strategies. For the ImageCLEF competitions, we use the cor- or fine-tuned in an end-to-end manner during the training
responding overview papers[1, 14, 13, 15] to identify the of the VQA task. The answering component is usually a
neural network classifier or a recurrent neural network lan-
guage generator. In the LSTM Q+I, the question features
and image features are fused via element-wise multiplica-
tion. Then, researchers developed innovative fusing algo-
rithms and introduced the popular attention mechanism into
the system to further increase the performance.
Besides the joint embedding framework, the other 7 ap-
proaches choose to give up the question feature. They find
the semantic space of the questions is simply due to the data
nature and use only the image feature to produce the answer.
The framework has outstanding performance in VQA-Med-
2020 and VQA-Med-2021 where the questions are only about
abnormality and in only “Yes/No” or “What” types.
Compare to the general VQA domain, the architecture
that appears in medical VQA research is less diverse. There
is also other architecture in the general VQA domain such as
compositional model such as the Neural Module Network [9].
However, no compositional model has been adopted in cur-
rent medical VQA approaches and it is also potential to in-
troduce other frameworks to the medical VQA.
3.2. Image Encoder

In terms of the image encoder in the public challenges,
the VGG Net [74] is the most popular choice. As shown
in Fig.3, the participates using VGG Net as the image en-
Figure 1: The number of papers included/excluded at the lit-
coder represent a significant proportion in all of the three
erature review of conferenece/journal papers
VQA-MED challenges. Meanwhile, the pre-trained image

Image Feature
Extraction:
Common choices include:
• ResNet
• VGGNet
• … Algorithm:
Classifier
• Concatenation Bilateral multiple
• Element-wise operation Or
• Bilinear pooling pulmonary nodules
Generator
• Attention mechanism
Question Feature • …
What does the Extraction:
CT scan of thorax • BERT
show? • LSTM/BiLSTM
• GRU
• …
Figure 2: A schematic diagram of the mainstream medical VQA frameworks illustrating

main components including image feature extraction, question feature extraction, feature
fusion, and answering component.
encoder is often used for both public challenges and confer- of detection-based methods such as Up-Down [8], which are
ence/journal works. According to Table 3, more than half of popular in the general VQA domain.
teams (29 of 46) directly used a pre-trained model on the Im-
ageNet [67]. However, ImageNet has different content com- 3.3. Language Encoder
pared with medical VQA. Using ImageNet pre-trained is a The language encoders in the reviewed works include
non-reasonable practice but a workable option when the low LSTM [35] (18 of 46), Bi-LSTM [70] (5 of 46), GRU [19]
data quantity and lack of label both limit the pre-training on (3 of 38), the Transformer [82] (including BERT [22] and
medical datasets. BioBERT [46]) (12 of 46) and the others (8 of 46). Notably,
Finding better pre-training methods is a popular topic in all the top five teams are using The BERT in the VQA-Med-
medical VQA research as well as the whole medical AI re- 2019 challenge. It is shown in Fig.3 that participating teams
search. The solution includes using other pre-trained mod- select BERT/BioBERT instead of LSTM/Bi-LSTM/GRU over
els, using the extra dataset for pre-training, contrastive learn- time. This indicates the advantage of the Transformer model
ing, multi-task pre-training, and meta-learning. The LIST and the BERT pre-training despite the corpus differences be-
team [6] utilized an image encoder pre-trained on the CheX- tween the general domain and medical domain. Meanwhile,
pert [88] dataset. The MMBERT [41] team uses an auxiliary the Recurrent Neural Networks (LSTM/Bi-LSTM/GRU) users
dataset named ROCO [63] (images and captions) to perform are also adopting the pre-trained word embedding compo-
pre-training in a token-masking manner. The CPRD [52] nent (12 of 26). It indicates that the researchers tend to use a
team introduce the contrastive learning technology to con- pre-training model to process the questions. However, only
duct pre-training with unlabeled images in a self-supervised the MMBERT team [41] conducts personalized pre-training
scheme. Self-supervised training has the advantage that it with a self-supervised method, the token masking strategy
does not need the image labels, which is expensive to ac- on extra data. Compared with the image encoder, the lan-
quire for medical images. The MTPT-CSMA [27] team uses guage encoder does not get much researched. It is potential
an auxiliary dataset of segmentation tasks. On the other to develop more NLP pre-training methods or vision plus
hand, researchers also try to further digest the original data. language pre-training methods to medical VQA.
The MEVF [60] team proposed the Mixture of Enhanced Vi- Especially, 7 teams process the questions without a deep
sual Features (MEVF) component, which utilizes the Model- learning model but with keyword or template matching. The
Agnostic Meta-Learning (MAML) and the Convolutional De- reason is that keywords or templates matching has been pow-
noising Auto-Encoder (CDAE) to initialize the model weights erful in their tasks. Hence, a light language encoder can be a
for the image encoder to overcome the data limitation in practical choice in medical VQA application as the task may
quantity. not have a large number of question categories.
Notably, the auxiliary dataset selected by the three teams
in different types: captioning dataset, unlabeled image, and 3.4. Fusion Algorithm
segmentation dataset. It indicates that the pre-training on The fusion stage gathers the extracted visual feature and
various extra data can leverage the image decoder perfor- language feature and then models the hidden relationship be-
mance. Hence, exploring the methods to pre-train the image tween the language feature and visual feature. It is the core
encoder will be a prospective task. Another feature is that all component of VQA methods. The typical fusion algorithm
image encoders in the reviewed papers are CNN classifica- includes the attention mechanism and the pooling module.
tion models than detection ones. It restricts the application

Table 3: The Results and Characteristics of Approaches in medical VQA Tasks.
Team/Method Image En- Pre-trained Language Pre-trained Attention Fusion Output Mode Other Technique(s) Language Classification
coder (Image) Encoder (Language) (Fu- Score(BLEU) Accuracy
sion)
VQA-MED-2018
Chakri* [7] VGG16 ImageNet GRU None No Element-wise Generation 0.188
multiplica- (GRU)
tion
UMMS [64] ResNet-152 ImageNet LSTM Pre-trained Yes MFB with Classification Embedding based topic 0.162
Word Em- Co-attention model
bedding
Lin et al.: Preprint submitted to Elsevier

TU [99] Inception- ImageNet Bi-LSTM Not men- Yes Attention Classification Output rules 0.135
Resnet-v2 tioned mechanism
HQA* [30] Inception- ImageNet Bi-LSTM Pre-trained No Concatenation Classification Question segregation 0.132
Resnet-v2 Word Em-
bedding
NLM [1] VGG16 ImageNet LSTM None Yes SAN Classification 0.121
NLM ResNet-152 ImageNet LSTM None No MCB Classification Pre-training with extra data 0.085
JUST [77] VGGNet ImageNet LSTM Not men- No Concatenation Generation 0.061
tioned (LSTM)
FSTT [5] VGG16 ImageNet Bi-LSTM Not men- No Concatenation Classification Decision Tree Classifier 0.054
tioned
VQA-MED-2019
KEML* [97] VGG16 ImageNet Transformer BERT No BLOCK Classification Global Average Pooling, 0.912 0.938
Meta-learning,
Gated Graph Neural Net-
works,
Knowledge-Based Repre-
sentation Learning
MedFuseNet* [71]ResNet-152 ImageNet LSTM BERT Em- Yes MedFuseNet Classification/ Image Attention, 0.27 0.789
bedding Generation Image-Question Co- (Subset) (Subset)
(LSTM) Attention
MMBERT* [41] ResNet-152 ROCO[63] Transformer Yes Yes Multi-head Generation Pre-training with extra data 0.69 0.672
attention (Transformer)
(Trans-
former)
CGMVQA* [66] ResNet-152 ImageNet Transformer BERT Yes Concatenation Classification Global Average Pooling 0.659 0.64
Generation
(Transformer)
Hanlin [91] VGG16 ImageNet Transformer BERT Yes MFB with Classification Global Average Pooling 0.644 0.62
Co-attention
Page 8 of 18
Team/Method Image Pre-trained Language Pre-trained Attention Fusion Output Mode Other Techinology(s) Language Classification
Encoder (Image) Encoder (Language) (Fusion) Score Accuracy
(BLEU)
minhvu [83] ResNet-152 ImageNet Transformer BERT Yes MLB, MU- Classification Ensemble, 0.634 0.616
TAN with Skip-thought
attention
TUA1 [100] Inception- ImageNet Transformer BERT No Concatenation Classification Sub-models, 0.633 0.606
Resnet-v2 Generation Question classifier
(LSTM)
QC- ResNet-152 ImageNet Skip- Yes Yes QC-MLB Classification Question-centric fusion 0.603
MLB* [84] thought
vectors

UMMS [72] ResNet-152 ImageNet Bi-LSTM Pre-trained Yes MFH with Classification SVM Question classifier 0.593 0.566
Word Em- Co-attention
bedding
IBM Research VGG16 ImageNet LSTM Pre-trained Yes Attention Question classi- Image size encoder 0.582 0.558
AI [43] Word Em- mechanism fier
bedding Classification
LIST [6] DenseNet- CheXpert LSTM Pre-trained No Concatenation Generation 0.583 0.556
121 Word Em- (LSTM)
bedding
Turner.JCE [80] VGG19 Not men- LSTM Not men- No Concatenation Classification Sub-models, 0.572 0.536
tioned tioned Question classifier
JUST19 [4] VGG16 ImageNet None None No None Classification/ Sub-models, 0.591 0.534
Generation Question classifier
(LSTM)
Team_PwC_- ResNet-50 ImageNet LSTM Pre-trained Yes Attention Classification/ Sub-models, 0.534 0.488
Med [12] Word Em- mechanism Generation Question classifier
bedding (LSTM)
Techno [16] VGG16 Not men- LSTM Not men- No Concatenation Classification 0.486 0.462
tioned tioned
Gasmi* [26] EfficientNet ImageNet Bi-LSTM Not men- No Concatenation Classification Optimal parameter selec- 0.42 0.391
tioned tion
Dear Xception Not men- GRU Not men- Yes Attention Classification 0.393 0.21
stranger [55] tioned tioned mechanism
abhishek- VGG19 ImageNet LSTM Pre-trained No Element-wise Generation 0.462 0.16
thanki [78] DenseNet- Word Em- multiplica- (LSTM)
121 bedding tion
VQA-MED-2020
AIML [49] Ensemble Not men- None Not men- No None Classification Multi-scale and multi- 0.542 0.496
CNNs tioned tioned architecture ensemble
TheInception- VGG16 Not men- None None No None Classification Sub-models 0.511 0.48
Team [3] tioned
Page 9 of 18
(BLEU)
bumjun_- VGG16 ImageNet Transformer BioBERT Yes MFH with Classification Global Average Pooling 0.502 0.466
jung [38] Co-attention
HCP-MIC [18] BBN- Not men- Transformer BioBERT No None Classification 0.462 0.426
ResNeSt-50 tioned
NLM [68] ResNet-50 ImageNet None Not men- No None Classification 0.441 0.4
tioned
HARENDRA- VGG16 Not men- Transformer BERT Yes MFB Generation 0.439 0.378
KV [39] tioned (LSTM)
Shengyan [54] VGG16 ImageNet GRU Not men- No None Generation 0.412 0.376

tioned (GRU)
kdevqa [81] VGG16 Not men- Transformer BERT No GLU Classification 0.35 0.314
tioned
VQA-MED-2021
SYSU- ResNets, ImageNet None None No None Classification Hierarchically adaptive 0.416 0.382
HCP [28] VGGNet, and global average pooling,
plus HAGAP Model ensemble,
Mixup Augment,
Curriculum learning,
Label smoothing
Yunnan Uni- VGG16 ImageNet Transformer BioBERT Yes MFH with Classification Global Average Pooling 0.402 0.362
versity [90] Co-attention
TeamS [24] ResNeSt50 Not men- None None No None Classification Bilateral-Branch Networks 0.391 0.348
tioned
Lijie [47] VGG8 ImageNet Transformer BioBERT Yes MFH with Classification 0.352 0.316
Co-attention
IALab_PUC [69] DenseNet- ImageNet None None No None Classification 0.276 0.236
121
TAM [48] Modified Not men- LSTM GLOVE Yes MFB with a Classification 0.255 0.222
ResNet-34 tioned word em- co-attention
beddings
Sheerin [76] VGGNet ImageNet LSTM Not men- No Element-wise Generation(LSTM) 0.227 0.196
tioned multiplica-
tion
VQA-RAD
MTPT- Multi- Multi-task LSTM Pre-trained Yes CSMA Classification Cross-modal self-attention, 0.732
CMSA [27] ResNet-34 Word Em- Multi-task pre-training with
bedding extra data
Page 10 of 18
(BLEU)
CPRD [52] ResNet-8 Contrastive LSTM Pre-trained Yes BAN Classification Constrastive Learning, 0.727
Word Em- Knowledge Distillation
bedding
MMBERT [41] ResNet-152 ROCO Transformer Yes Yes Multi-head Generation Pretraining with extra data 0.72
attention (Transformer)
(Trans-
former)
QCR [96] MEVF None LSTM Not men- Yes BAN/SAN Classification Question-Conditioned Rea- 0.716
tioned soning,

Type-Conditioned Reason-
ing
MMQ [23] MMQ None LSTM Pre-trained Yes BAN/SAN Classification Multiple Meta-model 0.67
Word Em- Quantifying
bedding
MEVF [60] MEVF None LSTM Not men- Yes BAN/SAN Classification Model-Agnostic Meta- 0.439/0.751
tioned Learning, (0.627)
Convolutional Denoising
Auto-Encoder
HQS [30] Inception- ImageNet Bi-LSTM Pre-trained No Concatenation Classification Question Segregation 0.411
Resnet-v2 Word Em-
bedding
PathVQA
MedFuseNet [71] ResNet-152 ImageNet LSTM BERT Em- Yes MedFuseNet Classification/ Image Attention, 0.605 0.636
bedding Generation Image-Question Co- (Subset) (Subset)
(LSTM) Attention
MMQ [23] MMQ None LSTM Pre-trained Yes BAN/SAN Classification Multiple Meta-model 0.488
Word Em- Quantifying
bedding
SLAKE
CPRD [52] ResNet-8 Contrastive LSTM Pre-trained Yes BAN Classification Constrastive Learning, 0.821
Word Em- Pre-training with extra data,
bedding Knowledge Distillation
The “*” means the approach is not proposed during the public challenge period.
Page 11 of 18
(a) Composition of various network architectures as image encoders.
(b) Composition of various network architectures as language encoders.

Figure 3: The encoders used in VQA-Med challenges.
Attention mechanism is widely used in vision and lan- winners in VQA-Med-2018 and VQA-Med-2019 are using
guage tasks. Among the Medical VQA approaches, 23 of 46 MFB pooling with attention solutions.
apply attention mechanisms in the fusion stage. Stacked At- Among the 46 approaches, the QC-MLB [84] and Med-
tention Networks (SAN) [92] is a typical attention algorithm. FuseNet [71] are the only two works that proposed innova-
It uses the question feature as a query to rank the answer- tive fusion algorithm. The QC-MLB uses a multi-glimpse
related image regions. With a multiple-layer structure, the attention mechanism to let the question feature select the
SAN can query an image several times to infer the answer proper image regions to answer. The MedFuseNet uses two
progressively. The SAN introduced an incursive attention attention modules: Image Attention and Image-Question Co-
mechanism and is used as a baseline for many datasets. Be- Attention to let the image feature and question feature inter-
sides the SAN, some other works employ the attention mech- act twice. The QC-MLB and MedFuseNet all show perfor-
anism, such as the Bilinear Attention Networks (BAN) [42], mance improvement. However, the research in fusion schemes
the Hierarchical Question-Image Co-Attention (HieCoAtt) [56], specific for medical VQA is still few.
e.t.c. Notably, the popular multi-head attention methods(e.g.,
the Transformer [82], the Modular Co-Attention Network 3.5. Answering Component
(MCAN) [93]) are seldom applied to medical VQA. Among the 46 reviewed approaches, 33 approaches choose
Multi-modal pooling is another important technique used the classification mode for output, and 8 methods choose
for fusing visual and language features. The basic practice generation mode. Some methods [4, 66, 12, 100, 71] use
includes concatenation, sum, and element-wise product. In a switching strategy to adopt both classification and genera-
the reviewed papers, the direct concatenation is the most tion. The selection of the classification method or generation
widely used fusion method (10 of 46) but shows average per- method reflects the length distribution of ground-truth an-
formance. As the outer product may be computation-cost swers belonging to a single phrase. The classification mode
expensive when input vectors are with high dimensional- will have an advantage within a small answer space. How-
ity, researchers have proposed more efficient pooling meth- ever, it can become difficult if the answer candidates are
ods. Multi-modal Compact Bilinear (MCB) pooling [25] longer and with more complex information such as lesion
is a typical method for pooling. It aggregates visual and descriptions.
text features by embedding the image and text features into
higher-dimensional vectors and then convolving them with 3.6. Other Techniques
multiplications in the Fourier space for efficiency. The at- Other than research in the basic components, some task-
tention mechanism can also be used in the pooling mod- specific strategies are also applied to improve performance.
ule. Regarding the multi-modal pooling family, there are Sub-task strategy is the approach that divides the overall task
some other works such as the Multi-modal Factorized High- into several sub-tasks and assigns branch models. In these
Order (MFH) pooling [95], the Multi-modal Factorized Bi- cases, an extra task classification module is applied for se-
linear (MFB) pooling [94], e.t.c. The pooling with attention lecting the corresponding model by a particular question cat-
method is the second adopted method (8 of 46). Both the egory or a type of image modality. It is often used [43, 4,

100, 80, 72, 49, 3, 18] especially in VQA-Med-2019 (ques- 4. Challenge and Future Works
tion are only in four categories) and VQA-Med-2020(question
are all in one category but in two types). The reason is that After reviewing medical VQA datasets and approaches,
the questions have few and distinct categories to be divided we identify some existing problems. Compared to the gen-
easily. These multiple model approaches may improve the eral domain VQA, the specific requirement and practical im-
single model effectiveness but lead to another risk that they plementation scene also lead to challenges. Besides, the
may suffer from the inappropriate model allocation due to works on general-domain VQA provide some inspiration for
the erroneous results from the sub-task classification mod- the future research direction.
ule. Another problem is that these approaches cannot be gen-
4.1. Question Diversity
eralized to other tasks if the question categories are different.
Question diversity is one of the most significant chal-
Another frequently applied technique is Global Average
lenges of medical VQA. The VQA-RAD [45] investigates
Pooling(4/38). Global Average Pooling [50] is using the av-
the natural questions in clinical conversation. The questions
erage of feature maps to replace the last fully connected lay-
can be categorized into modality, plane, organ system, ab-
ers. It is believed to produce better image representation.
normality, object/condition presence, positional reasoning,
Other techniques include Embedding-based Topic Model,
color, size, attribute other, counting, and other. In other
Question-Conditioned Reasoning, and Image Size encoder.
datasets such as VQA-Med-2019 [14] and PathVQA [33],
3.7. Overall the question categories tend to be less diverse than the VQA-
In the method survey, we have reviewed 45 papers on RAD dataset. In the RadVisDial [44], VQA-Med-2020 [13],
medical VQA approaches, including 32 challenges work notes and VQA-Med-2021 [15], the question category is reduced
and 13 conference/journal papers. Although medical VQA to the abnormality presence only. However, most questions
research just started in 2018, there have already been var- about an abnormality in existing datasets are about presence
ious methods proposed and explored. Since the challenges without further inquiry like the location of tumors or tumor
only have restricted time for problem-solving, the work notes size. Therefore, questions remain to be added to diversify
tend to make direct application of well verified deep learning the medical VQA dataset.
models. They have a high proportion of using pre-trained To identify the prospective question categories, we should
components. On the other hand, they also develop some research the following future directions. First, the data source
task-oriented techniques according to the data nature. No- can be expanded. The synthesized QA pairs in current med-
tably, there are two teams [18, 24] observe the long-tailed ical VQA datasets are usually created from image captions
distribution in their tasks and introduce the Bilateral-Branch and medical reports. However, the information in captions
Network (BBN) [98] to deal with the imbalanced data. De- and reports is restricted to specific topics. Therefore, more
spite the challenge teams do not have much time to make in- data sources, such as the textbook, should be considered to
novations, their observation and strategies are still valuable provide more textural corpus. Second, revisiting the natural
data already collected, such as VQA-RAD [45] and SLAKE [53]
and inspiring.
The conference/journal papers play the role of explorer will help us better understand the real-world conversation.
of problems and solutions. According to the 13 conference/journalThird, the VQA is restricted by the concept that the ques-
papers, the current research is more focused on the image tion must be relevant to the image content. However, in a
encoder than the other component. Researchers introduce realistic clinical scene, the conversation content often ex-
popular CV field ideas such as meta-learning and contrastive ceeds the presented image content, for example, explaining
learning to enhance the image encoder. Pre-training is a the future risks of the abnormality or predicting disease pro-
common topic in both image encoder and language encoder gression. Overall, a VQA model trained on a medical VQA
improvement. Auxiliary data is a simple but efficient solu- dataset with various question categories will demonstrate
tion. better generalization capability and suit better for realistic
Another point that appears in the technical papers is gen- clinical practice scenes.
erality. Among the 13 papers, only 5 papers are aware of
4.2. Integrating the Extra Medical Information
showing their generality and evaluating their approaches on
Another challenge for the medical VQA is to integrate
multiple datasets. And only one team [84] evaluate their
the extra information into the inference procedure. For ex-
approach not only on a medical VQA dataset but also on
ample, Kovaleva et al. found that incorporating the medical
general VQA datasets. It will be a future standard that an
history of the patient leads to better performance in answer-
approach should be evaluated on multiple datasets. Inter-
ing questions [44].
pretability is also a demand in medical AI research. Among
the 13 papers, 5 papers have illustrated the model visualiza- 4.2.1. EHR
tion with techniques such as GradCAM and attention visu- An electronic health record (EHR) contains a patient’s
alization. However, the evaluation of evidence has not ap- medical history, diagnoses, medications, treatment plans, e.t.c.
peared despite there being an available dataset (SLAKE) to In the medical AI domain, the EHR has been proved useful
support this. for disease prediction [73]. Hence it should also be helpful
in medical VQA. To support the research of EHR in medical

VQA, a new dataset containing the original EHR or synthet- RUBi to reduce biases in VQA models. It minimizes the im-
ical EHR is required. Furthermore, the VQA model should portance of the most biased examples and implicitly forces
also be modified because the EHR contains a lot of metric the VQA model to use the two input modalities instead of
variables that should be treated as numbers other than nat- relying on statistical regularities between the question and
ural language. As far as we know, no previous research has the answer [17].
investigated combining EHR and CV. Using numeric vari- The unimodal bias in general-domain VQA has been dis-
ables as VQA input and the corresponding fusion algorithm cussed in many aspects, such as benchmark and training scheme.
also has not been researched. Therefore encoding EHR in These provide a good starting point for further work in med-
VQA is meaningful and worth studying. ical VQA. Benchmarks similar to VQA-CP are required to
investigate the unimodal bias problem of models. As the
4.2.2. Multiple Images quantity of medical VQA datasets is not so large as general-
In the medical scene, especially radiology, it is general domain VQA datasets, the solution for unimodal bias also
that the study is based on multiple images other than a sin- needs research.
gle image. Multiple images can contain different scan planes
and sequential slices to support the decision-making of med- 4.3.2. External Knowledge
ical professionals. For example, in MIMIC-CXR [37], a ra- In general-domain VQA, an “adult-level common sense”
diology report is often associated with two images of postero- is required to support the cognition and inference in question
anterior view and lateral view. However, in RadVisDial [44], answering [89]. One approach is incorporating structured
the authors keep only the postero-anterior view, although the knowledge bases in the VQA datasets. Wang et al. proposed
abandoned images are also informative. The reason may be an approach named “Ahab” to provide explicit knowledge-
that the common VQA or VisDial models only support sin- base reasoning [85]. They also provide a dataset KB-VQA
gle image input. Therefore, the VQA datasets with multiple and a protocol to evaluate the knowledge-based methods.
input images are required to support the model research. Fu- Furthermore, Wang et al. proposed the FVQA dataset pro-
ture research should also be devoted to developing the VQA viding a supporting fact that is critical for answering each
model taking multiple images as input. visual questions [86]. On the other hand, to test the VQA
methods’ ability to retrieve relevant facts without a struc-
4.3. Interpretability and Reliability tured knowledge base, Marino et al. proposed the OK-VQA
Interpretability is a long-standing problem of deep learn- dataset, including only questions that require external knowl-
ing. As the VQA models are commonly based on deep learn- edge resources. They also proposed a knowledge-based base-
ing methods, the medical VQA has to deal with interpretabil- line named “ArticleNet”, which retrieved some articles from
ity. To the medical VQA system, the interpretability deter- Wikipedia for each question image pair and then trained a
mines the reliability of the predicted answer, and it is more network to find the answer in the retrieved articles.
important than that in the general domain because the wrong The challenge in medical VQA is that the knowledge re-
decision may lead to catastrophic consequences. The general- quirement is much higher than “adult-level common sense”.
domain VQA researchers have addressed this problem and The future research includes identifying questions that re-
investigated several directions to evaluate the inference abil- quired external knowledge, exploiting the existing medical
ity of a model. knowledge base such as [59] and building a medical VQA
dataset with a structured or unstructured knowledge base.
4.3.1. Unimodal Bias of VQA Models There are also additional studies required to determine whether
In the VQA field, the unimodal bias means that the model the medical knowledge-based VQA needs specific approaches.
may answer the question based on statistical regularities from
one modality without considering the other modality. Par- 4.3.3. Evidence Verification
ticularly, the bias to language input is also called language In terms of the methods with attention mechanism, an
prior. The researchers have noticed bias since the first VQA attention map can be visualized to show the image regions
dataset was proposed. They tested the question-only model that lead to the answer [92]. By comparing the attention map
“LSTM Q”, which has only a slightly lower performance with annotated human visual attention, the researchers can
compared with the standard model “LSTM Q+I” [10] and verify whether the VQA methods use the correct evidence to
indicates the language prior. To better measure the effect of get the answer. Evidence verification is also required in med-
language prior problem of VQA models, Agrawal et al. pre- ical VQA. Visualizing the evidence of the abnormality will
sented new splits of the VQA v1 and VQA v2 datasets with gain the trust of medical professionals and patients. Mean-
changing priors (respectively VQA-CP v1 and VQA-CP v2). while, it will allow professional users to evaluate and val-
They also proposed a Grounded Visual Question Answer- idate the answer. In general-domain VQA, Das et al. pro-
ing model (GVQA) to prevent the model from “cheating” posed the VQA-HAT dataset containing annotated human
by primarily relying on priors [2]. Alternatively, Ramakrish- visual attention to evaluate the attention maps both qualita-
nan et al. proposed a novel regularization scheme that poses tively (via visualizations), and quantitatively (via rank-order
training as an adversarial game between the VQA model and correlation) [21]. To extend the verification to textual ev-
a question-only model to discourage the VQA model from idence, Park et al. proposed the VQA-X dataset containing
capturing language biases [65]. Cadene et al. proposed the

human-annotated both textual explanations and visual expla- e.t.c.). This goal will introduce more complexity and un-
nations [62]. To address text-based question-answer pairs in controlled factors under the ’human-in-the-loop hypothesis.
the VQA task, Wang et al. proposed the EST-VQA dataset More efforts and studies need to be conducted in this field for
annotating the bounding boxes of correct text clues [87]. Al- a successful deployment of medical VQA into the clinical
ternatively, Jiang et al. proposed the IQVA dataset annotat- pipeline. Improving workflow efficiency and service quality
ing the eye-tracking data as human visual attention [36]. is also required to study multiple aspects such as time spent,
Although those studies have established a completed scheme collaborative answer accuracy, and user satisfaction.
including the annotated evidence, reasoning model, and eval-
uation metrics, several problems are to be explored when ap-
5. Conclusion
plying evidence verification in medical VQA. First, the pre-
vious studies showed the different annotation modes such as This article presented a survey of the datasets and ap-
bounding box and eye-tracking. It is unknown what is the proaches of medical VQA. We collect the information of
most suitable mode for medical VQA. Second, the manual 8 medical VQA datasets and 45 papers describing medi-
annotation for a medical task is expensive because it requires cal VQA approaches. We give a comprehensive discussion
the professional skill of radiologists. The possible solutions in dataset creation, approach framework, approach compo-
for medical VQA are finding automation annotation meth- nents, specific techniques, approaches overview. In addition
ods and developing a training scheme to maximize utiliza- to the descriptive review, we pinpointed some challenges
tion from a small quantity of annotated data. worth exploring in future research. The medical VQA sys-
These three directions examine the different reasoning tem mainly faces the following four challenges: first, how
procedures of VQA models and provide performance evalu- to get the system answers a range of more comprehensive
ation methods. The VQA models have to look into the im- question categories; second, how to combine medical fea-
age, learn the extra knowledge, and find the correct evidence tures with the task; third, how to verify the evidence of an
to pass the “exams”. To our best knowledge, there are only a answer to make it more convincing; forth, how to make the
few works in medical VQA. If the medical VQA system can system not bias to any modality; finally, how to maximize
be verified with its inference ability, it will become a more the benefit of the medical VQA in the workflow. For future
convincing and reliable tool. It is also helpful to present the work, we propose the following directions. To investigate
knowledge and evidence used in inference explicitly. Hence potential implementation in a real-world scene is necessary.
the evaluation of inference ability should be regarded as a Furthermore, we should pay attention to the conversation be-
more important benchmark than the answer accuracy. Hope- tween medical professionals and non-professionals. The ad-
fully, the medical VQA will answer the question “why” in vantage of VQA is to understand natural language questions
the nearest future. and help non-professionals. We should also introduce mean-
ingful ideas in the general domain, such as evidence verifica-
4.4. Integration in Medical Workflow tion, bias analysis, external knowledge database, e.t.c. These
The community has a long history of trying to integrate will help us build a practicable and convincing medical VQA
AI decision-supporting systems into the clinical flow. Inte- system towards medical AI’s ultimate goal.
grating the medical VQA system into clinical practice will
provide efficient communication and serve as an effective as- 6. Declaration of Interests
sistant. Hekler et al. found that the combination of human
and artificial intelligence can achieve superior results over We declare that we have no known competing financial
the independent results of both of these systems in skin can- interests or personal relationships that could have appeared
cer classification [34]. Furthermore, Tschandl et al. found to influence the work reported in this paper.
that physicians with AI-based support outperformed either
AI or physicians. Meanwhile, the least experienced clini- References
cians gain the most from AI-based support [79]. Tschandl [1] Abacha, A.B., Gayen, S., Lau, J.J., Rajaraman, S., Demner-
et al. also experimented with the accuracy improvements Fushman, D., 2018. NLM at ImageCLEF 2018 visual question an-
of human-computer collaboration in skin cancer recognition swering in the medical domain., in: CLEF (Working Notes).
and found multiclass probabilities outperformed either content- [2] Agrawal, A., Batra, D., Parikh, D., Kembhavi, A., 2018. Don’t just
based image retrival (CBIR) or malignancy probability in the assume; look and answer: Overcoming priors for visual question an-
swering, in: 2018 IEEE/CVF Conference on Computer Vision and
mobile technolog [79]. These findings indicate that many Pattern Recognition (CVPR), IEEE Computer Society, Los Alami-
factors must be considered in developing successful outcomes tos, CA, USA. pp. 4971–4980.
for medical AI supporting system integration, such as clin- [3] Al-Sadi, A., Al-Theiabat, H., Al-Ayyoub, M., 2020. The inception
icians’ cognitive style, cognitive error, personality, experi- team at VQA-Med 2020: Pretrained VGG with data augmentation
ence, and acceptance of AI. for medical vqa and vqg, in: CLEF 2020 Working Notes.
[4] Al-Sadi, A., Talafha, B., Al-Ayyoub, M., Jararweh, Y., Costen, F.,
Moreover, most AI decision support systems used in the 2019. JUST at ImageCLEF 2019 visual question answering in the
clinical setting are good at addressing close-ended questions medical domain., in: CLEF (Working Notes).
(yes/no) only. In contrast, the ultimate goal of a medical [5] Allaouzi, I., Ahmed, M.B., 2018. Deep neural networks and decision
VQA system is towards open-ended questions (what, which, tree classifier for visual question answering in the medical domain.,
in: CLEF (Working Notes).

[6] Allaouzi, I., Ahmed, M.B., Benamrou, B., 2019. An encoder- man Language Technologies, Volume 1 (Long and Short Papers),
decoder model for visual question answering in the medical domain., pp. 4171–4186.
in: CLEF (Working Notes). [23] Do, T., Nguyen, B.X., Tjiputra, E., Tran, M., Tran, Q.D., Nguyen,
[7] Ambati, R., Reddy Dudyala, C., 2018. A sequence-to-sequence A., 2021. Multiple meta-model quantifying for medical visual ques-
model approach for imageclef 2018 medical domain visual question tion answering, in: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N.,
answering, in: 2018 15th IEEE India Council International Confer- Speidel, S., Zheng, Y., Essert, C. (Eds.), Medical Image Comput-
ence (INDICON), pp. 1–6. doi:10.1109/INDICON45594.2018.8987108. ing and Computer Assisted Intervention – MICCAI 2021, Springer
[8] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, International Publishing, Cham. pp. 64–74.
S., Zhang, L., 2018. Bottom-up and top-down attention for image [24] Eslami, S., de Melo, G., Meinel, C., 2021. TeamS at VQA-Med
captioning and visual question answering, in: 2018 IEEE/CVF Con- 2021: BBN-Orchestra for long-tailed medical visual question an-
ference on Computer Vision and Pattern Recognition (CVPR), IEEE swering. Working Notes of CLEF 201.
Computer Society, Los Alamitos, CA, USA. pp. 6077–6086. [25] Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T.,
[9] Andreas, J., Rohrbach, M., Darrell, T., Klein, D., 2016. Neural mod- Rohrbach, M., 2016. Multimodal compact bilinear pooling for vi-
ule networks, in: 2016 IEEE Conference on Computer Vision and sual question answering and visual grounding, in: Proceedings of the
Pattern Recognition (CVPR), IEEE Computer Society, Los Alami- 2016 Conference on Empirical Methods in Natural Language Pro-
tos, CA, USA. pp. 39–48. cessing, Association for Computational Linguistics, Austin, Texas.
[10] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C., pp. 457–468.
Parikh, D., 2015. VQA: Visual question answering, in: 2015 IEEE [26] Gasmi, K., Ltaifa, I.B., Lejeune, G., Alshammari, H., Ammar, L.B.,
International Conference on Computer Vision (ICCV), IEEE Com- Mahmood, M.A., 2021. Optimal deep neural network-based model
puter Society, Los Alamitos, CA, USA. pp. 2425–2433. for answering visual medical question. Cybernetics and Systems 0,
[11] Bai, H., Shan, X., Huang, Y., Wang, X., 2021. MVQAS: A Medi- 1–22. doi:10.1080/01969722.2021.2018543.
cal Visual Question Answering System. Association for Computing [27] Gong, H., Chen, G., Liu, S., Yu, Y., Li, G., 2021a. Cross-modal self-
Machinery, New York, NY, USA. p. 4675–4679. attention with multi-task pre-training for medical visual question an-
[12] Bansal, M., Gadgil, T., Shah, R., Verma, P., 2019. Medical vi- swering, in: Proceedings of the 2021 International Conference on
sual question answering at Image CLEF 2019-VQA Med, in: CLEF Multimedia Retrieval, Association for Computing Machinery, New
(Working Notes). York, NY, USA. p. 456–460. doi:10.1145/3460426.3463584.
[13] Ben Abacha, A., Datla, V.V., Hasan, S.A., Demner-Fushman, D., [28] Gong, H., Huang, R., Chen, G., Li, G., 2021b. SYSU-HCP at VQA-
Müller, H., 2020. Overview of the VQA-Med task at ImageCLEF Med 2021: A data-centric model with efficient training methodology
2020: Visual question answering and generation in the medical do- for medical visual question answering. Working Notes of CLEF 201.
main, in: CLEF 2020 Working Notes, CEUR-WS.org, Thessaloniki, [29] Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D., 2017.
Greece. Making the V in VQA matter: Elevating the role of image under-
[14] Ben Abacha, A., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, standing in Visual Question Answering, in: Conference on Com-
D., Müller, H., 2019. VQA-Med: Overview of the medical visual puter Vision and Pattern Recognition (CVPR).
question answering task at imageclef 2019, in: CLEF2019 Working [30] Gupta, D., Suman, S., Ekbal, A., 2021. Hierarchical deep multi-
Notes, CEUR-WS.org, Lugano, Switzerland. modal network for medical visual question answering. Expert Sys-
[15] Ben Abacha, A., Sarrouti, M., Demner-Fushman, D., Hasan, S.A., tems with Applications 164, 113993. doi:https://doi.org/10.1016/
Müller, H., 2021. Overview of the VQA-Med task at ImageCLEF j.eswa.2020.113993.
2021: Visual question answering and generation in the medical do- [31] Hasan, S.A., Ling, Y., Farri, O., Liu, J., Müller, H., Lungren, M.P.,
main, in: CLEF 2021 Working Notes, CEUR-WS.org, Bucharest, 2018. Overview of ImageCLEF 2018 medical domain visual ques-
Romania. tion answering task., in: CLEF (Working Notes).
[16] Bounaama, R., Abderrahim, M.E.A., 2019. Tlemcen University at [32] He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning
ImageCLEF 2019 visual question answering task., in: CLEF (Work- for image recognition, in: 2016 IEEE Conference on Computer Vi-
ing Notes). sion and Pattern Recognition (CVPR), IEEE Computer Society, Los
[17] Cadene, R., Dancette, C., Cord, M., Parikh, D., et al., 2019. RUBi: Alamitos, CA, USA. pp. 770–778.
Reducing unimodal biases for visual question answering. Advances [33] He, X., Zhang, Y., Mou, L., Xing, E., Xie, P., 2020. PathVQA:
in Neural Information Processing Systems 32, 841–852. 30000+ questions for medical visual question answering. arXiv
[18] Chen, G., Gong, H., Li, G., 2020. HCP-MIC at VQA-Med 2020: Ef- preprint arXiv:2003.10286 .
fective visual representation for medical visual question answering, [34] Hekler, A., Utikal, J.S., Enk, A.H., Hauschild, A., Weichenthal, M.,
in: CLEF 2020 Working Notes. Maron, R.C., Berking, C., Haferkamp, S., Klode, J., Schadendorf,
[19] Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., D., et al., 2019. Superior skin cancer classification by the combina-
Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning phrase tion of human and artificial intelligence. European Journal of Cancer
representations using RNN encoder–decoder for statistical machine 120, 114–121.
translation, in: Proceedings of the 2014 Conference on Empiri- [35] Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory.
cal Methods in Natural Language Processing (EMNLP), Associa- Neural Computation 9, 1735–1780.
tion for Computational Linguistics, Doha, Qatar. pp. 1724–1734. [36] Jiang, M., Chen, S., Yang, J., Zhao, Q., 2020. Fantastic answers and
doi:10.3115/v1/D14-1179. where to find them: Immersive question-directed visual attention,
[20] Cross, N.M., Wildenberg, J., Liao, G., Novak, S., Bevilacqua, T., in: 2020 IEEE/CVF Conference on Computer Vision and Pattern
Chen, J., Siegelman, E., Cook, T.S., 2020. The voice of the radi- Recognition (CVPR), IEEE Computer Society, Los Alamitos, CA,
ologist: Enabling patients to speak directly to radiologists. Clinical USA. pp. 2977–2986.
imaging 61, 84–89. [37] Johnson, A.E., Pollard, T.J., Greenbaum, N.R., Lungren, M.P.,
[21] Das, A., Agrawal, H., Zitnick, C.L., Parikh, D., Batra, D., 2016. Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng,
Human attention in visual question answering: Do humans and deep S., 2019. MIMIC-CXR-JPG, a large publicly available database of
networks look at the same regions?, in: Conference on Empirical labeled chest radiographs. arXiv preprint arXiv:1901.07042 .
Methods in Natural Language Processing. [38] Jung, B., Gu, L., HaradaAl-Sadi, T., 2020. bumjun_jung at VQA-
[22] Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2019. BERT: Pre- Med 2020: VQA model based on feature extraction and multi-modal
training of deep bidirectional transformers for language understand- feature fusion, in: CLEF 2020 Working Notes.
ing, in: Proceedings of the 2019 Conference of the North Ameri- [39] K. Verma, H., Ramachandran S., S., 2020. HARENDRAKV at
can Chapter of the Association for Computational Linguistics: Hu- VQA-Med 2020: Sequential VQA with attention for medical visual

question answering, in: CLEF 2020 Working Notes. cross-sectional imaging on radiologist workload. Academic radiol-
[40] Kavur, A.E., Selver, M.A., Dicle, O., Barış, M., Gezer, N.S., 2019. ogy 22, 1191–1198.
CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmen- [59] Müller, L., Gangadharaiah, R., Klein, S.C., Perry, J., Bernstein,
tation Challenge Data. doi:10.5281/zenodo.3362844. G., Nurkse, D., Wailes, D., Graham, R., El-Kareh, R., Mehta, S.,
[41] Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U.D., et al., 2019. An open access medical knowledge base for commu-
Jawahar, C., 2021. Mmbert: Multimodal bert pretraining for im- nity driven diagnostic decision support system development. BMC
proved medical vqa, in: 2021 IEEE 18th International Sympo- medical informatics and decision making 19, 93.
sium on Biomedical Imaging (ISBI), pp. 1033–1036. doi:10.1109/ [60] Nguyen, B.D., Do, T.T., Nguyen, B.X., Do, T., Tjiputra, E., Tran,
ISBI48211.2021.9434063. Q.D., 2019. Overcoming data limitation in medical visual question
[42] Kim, J., Jun, J., Zhang, B., 2018. Bilinear attention networks, in: answering, in: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert,
Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa- C., Zhou, S., Yap, P.T., Khan, A. (Eds.), Medical Image Comput-
Bianchi, N., Garnett, R. (Eds.), Advances in Neural Information Pro- ing and Computer Assisted Intervention – MICCAI 2019, Springer
cessing Systems, Montréal, Canada. pp. 1571–1581. International Publishing, Cham. pp. 522–530.
[43] Kornuta, T., Rajan, D., Shivade, C., Asseman, A., Ozcan, A.S., [61] Papineni, K., Roukos, S., Ward, T., Zhu, W.J., 2002. BLEU: a
2019. Leveraging medical visual question answering with support- method for automatic evaluation of machine translation, in: Pro-
ing facts, in: CLEF (Working Notes). ceedings of the 40th Annual Meeting of the Association for Com-
[44] Kovaleva, O., Shivade, C., Kashyap, S., Kanjaria, K., Wu, J., Ballah, putational Linguistics, Association for Computational Linguistics,
D., Coy, A., Karargyris, A., Guo, Y., Beymer, D.B., et al., 2020. To- Philadelphia, Pennsylvania, USA. pp. 311–318.
wards visual dialog for radiology, in: Proceedings of the 19th SIG- [62] Park, D., Hendricks, L., Akata, Z., Rohrbach, A., Schiele, B., Dar-
BioMed Workshop on Biomedical Language Processing, pp. 60–69. rell, T., Rohrbach, M., 2018. Multimodal explanations: Justifying
[45] Lau, J.J., Gayen, S., Abacha, A.B., Demner-Fushman, D., 2018. A decisions and pointing to the evidence, in: 2018 IEEE/CVF Confer-
dataset of clinically generated visual questions and answers about ence on Computer Vision and Pattern Recognition (CVPR), IEEE
radiology images. Scientific Data 5, 1–10. Computer Society, Los Alamitos, CA, USA. pp. 8779–8788.
[46] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J., [63] Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.M.,
2020. BioBERT: a pre-trained biomedical language representation 2018. Radiology objects in COntext (ROCO): a multimodal image
model for biomedical text mining. Bioinformatics 36, 1234–1240. dataset, in: Intravascular Imaging and Computer Assisted Stenting
[47] Li, J., Liu, S., 2021. Lijie at ImageCLEFmed VQA-Med 2021: At- and Large-Scale Annotation of Biomedical Data and Expert Label
tention model based on efficient interaction between multimodality. Synthesis. Springer, pp. 180–189.
Working Notes of CLEF 201. [64] Peng, Y., Liu, F., Rosen, M.P., 2018. UMass at ImageCLEF medical
[48] Li, Y., Yang, Z., Hao, T., 2021. TAM at VQA-Med 2021: A hybrid visual question answering (Med-VQA) 2018 task, in: CLEF (Work-
model with feature extraction and fusion for medical visual question ing Notes).
answering. Working Notes of CLEF 201. [65] Ramakrishnan, S., Agrawal, A., Lee, S., 2018. Overcoming lan-
[49] Liao, Z., Wu, Q., Shen, C., van den Hengel, A., Verjans, J., 2020. guage priors in visual question answering with adversarial regular-
AIML at VQA-Med 2020: Knowledge inference via a skeleton- ization, in: Advances in Neural Information Processing Systems, pp.
based sentence mapping approach for medical domain visual ques- 1541–1551.
tion answering, in: CLEF 2020 Working Notes. [66] Ren, F., Zhou, Y., 2020. CGMVQA: A new classification and gen-
[50] Lin, M., Chen, Q., Yan, S., 2013. Network in network. arXiv preprint erative model for medical visual question answering. IEEE Access
arXiv:1312.4400 . 8, 50626–50636.
[51] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., [67] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,
Dollár, P., Zitnick, C.L., 2014. Microsoft COCO: Common objects Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al., 2015. Im-
in context, in: European conference on computer vision, Springer. ageNet large scale visual recognition challenge. International Jour-
pp. 740–755. nal of Computer Vision 115, 211–252.
[52] Liu, B., Zhan, L.M., Wu, X.M., 2021a. Contrastive pre-training [68] Sarrouti, M., 2020. NLM at VQA-Med 2020: Visual question
and representation distillation for medical visual question answering answering and generation in the medical domain, in: CLEF 2020
based on radiology images, in: de Bruijne, M., Cattin, P.C., Cotin, S., Working Notes.
Padoy, N., Speidel, S., Zheng, Y., Essert, C. (Eds.), Medical Image [69] Schilling, R., Messina, P., Parra, D., Lobel, H., 2021. PUC chile
Computing and Computer Assisted Intervention – MICCAI 2021, team at VQA-Med 2021: approaching vqa as a classfication task via
Springer International Publishing, Cham. pp. 210–220. fine-tuning a pretrained cnn. Working Notes of CLEF 201.
[53] Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M., 2021b. [70] Schuster, M., Paliwal, K.K., 1997. Bidirectional recurrent neural
Slake: A semantically-labeled knowledge-enhanced dataset for med- networks. IEEE transactions on Signal Processing 45, 2673–2681.
ical visual question answering. arXiv:2102.09542. [71] Sharma, D., Purushotham, S., Reddy, C.K., 2021. Medfusenet: An
[54] Liu, S., Ding, H., Zhou, X., 2020. Shengyan at VQA-Med 2020: An attention-based multimodal deep learning model for visual question
encoder-decoder model for medical domain visual question answer- answering in the medical domain. Scientific Reports 11, 1–18.
ing task, in: CLEF 2020 Working Notes. [72] Shi, L., Liu, F., Rosen, M.P., 2019. Deep multimodal learning for
[55] Liu, S., Ou, X., Che, J., Zhou, X., Ding, H., 2019. An Xception- medical visual question answering., in: CLEF (Working Notes).
GRU model for visual question answering in the medical domain., [73] Shickel, B., Tighe, P.J., Bihorac, A., Rashidi, P., 2018. Deep EHR:
in: CLEF (Working Notes). A survey of recent advances in deep learning techniques for elec-
[56] Lu, J., Yang, J., Batra, D., Parikh, D., 2016. Hierarchical question- tronic health record (EHR) analysis. IEEE Journal of Biomedical and
image co-attention for visual question answering, in: Advances in Health Informatics 22, 1589–1604. doi:10.1109/JBHI.2017.2767063.
Neural Information Processing Systems, pp. 289–297. [74] Simonyan, K., Zisserman, A., 2015. Very deep convolutional net-
[57] Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R., 2019. OK- works for large-scale image recognition, in: Proceedings of the 3rd
VQA: A visual question answering benchmark requiring external International Conference on Learning Representations.
knowledge, in: 2019 IEEE/CVF Conference on Computer Vision [75] Simpson, A.L., Antonelli, M., Bakas, S., Bilello, M., Farahani, K.,
and Pattern Recognition (CVPR), IEEE Computer Society, Los van Ginneken, B., Kopp-Schneider, A., Landman, B.A., Litjens,
Alamitos, CA, USA. pp. 3190–3199. G., Menze, B., Ronneberger, O., Summers, R.M., Bilic, P., Christ,
[58] McDonald, R.J., Schwartz, K.M., Eckel, L.J., Diehn, F.E., Hunt, P.F., Do, R.K.G., Gollub, M., Golia-Pernicka, J., Heckers, S.H.,
C.H., Bartholmai, B.J., Erickson, B.J., Kallmes, D.F., 2015. The Jarnagin, W.R., McHugo, M.K., Napel, S., Vorontsov, E., Maier-
effects of changes in utilization and technological advancements of Hein, L., Cardoso, M.J., 2019. A large annotated medical image

dataset for the development and evaluation of segmentation algo- 6274–6283.

rithms. arXiv:1902.09063. [94] Yu, Z., Yu, J., Fan, J., Tao, D., 2017. Multi-modal factorized bilinear
[76] Sitara, N.M.S., Kavitha, S., 2021. SSN MLRG at VQA-Med 2021: pooling with co-attention learning for visual question answering, in:
An approach for VQA to solve abnormality related queries using 2017 IEEE International Conference on Computer Vision (ICCV),
improved datasets. Working Notes of CLEF 201. IEEE Computer Society, Los Alamitos, CA, USA. pp. 1839–1848.
[77] Talafha, B., Al-Ayyoub, M., 2018. JUST at VQA-Med: A VGG- [95] Yu, Z., Yu, J., Xiang, C., Fan, J., Tao, D., 2018. Beyond bilin-
Seq2Seq model, in: CLEF (Working Notes). ear: Generalized multimodal factorized high-order pooling for vi-
[78] Thanki, A., Makkithaya, K., 2019. MIT Manipal at ImageCLEF sual question answering. IEEE Transactions on Neural Networks and
2019 visual question answering in medical domain, in: CLEF Learning Systems 29, 5947–5959. doi:10.1109/TNNLS.2018.2817340.
(Working Notes). [96] Zhan, L.M., Liu, B., Fan, L., Chen, J., Wu, X.M., 2020. Medical
[79] Tschandl, P., Rinner, C., Apalla, Z., Argenziano, G., Codella, N., visual question answering via conditional reasoning, in: Proceed-
Halpern, A., Janda, M., Lallas, A., Longo, C., Malvehy, J., et al., ings of the 28th ACM International Conference on Multimedia (MM
2020. Human-computer collaboration for skin cancer recognition. ’20), ACM.
Nature Medicine 26, 1229–1234. [97] Zheng, W., Yan, L., Wang, F.Y., Gou, C., 2020. Learning from
[80] Turner, A., Spanier, A., 2019. LSTM in VQA-Med, is it really the guidance: Knowledge embedded meta-learning for medical vi-
needed? JCE study on the ImageCLEF 2019 dataset, in: CLEF sual question answering, in: Yang, H., Pasupa, K., Leung, A.C.S.,
(Working Notes). Kwok, J.T., Chan, J.H., King, I. (Eds.), Neural Information Process-
[81] Umada, H., Aono, M., 2020. kdevqa at VQA-Med 2020: focusing ing, Springer International Publishing, Cham. pp. 194–202.
on GLU-based classification, in: CLEF 2020 Working Notes. [98] Zhou, B., Cui, Q., Wei, X.S., Chen, Z.M., 2020. BBN: Bilateral-
[82] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., branch network with cumulative learning for long-tailed visual
Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you recognition , 1–8.
need, in: Advances in Neural Information Processing Systems, pp. [99] Zhou, Y., Kang, X., Ren, F., 2018. Employing Inception-Resnet-
5998–6008. v2 and Bi-LSTM for medical domain visual question answering, in:
[83] Vu, M., Sznitman, R., Nyholm, T., Löfstedt, T., 2019. Ensemble of CLEF (Working Notes).
streamlined bilinear visual question answering models for the Im- [100] Zhou, Y., Kang, X., Ren, F., 2019. TUA1 at ImageCLEF 2019 vqa-
ageCLEF 2019 challenge in the medical domain, in: CLEF 2019. med: a classification and generation model based on transfer learn-
[84] Vu, M.H., Löfstedt, T., Nyholm, T., Sznitman, R., 2020. A question- ing, in: CLEF (Working Notes).
centric model for visual question answering in medical imaging.
IEEE Transactions on Medical Imaging 39, 2856–2868. doi:10.
1109/TMI.2020.2978284.
[85] Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A., 2017a.
Explicit knowledge-based reasoning for visual question answering,
in: Proceedings of the Twenty-Sixth International Joint Conference
on Artificial Intelligence, IJCAI-17, pp. 1290–1296.
[86] Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A., 2018.
FVQA: Fact-based visual question answering. IEEE Transactions
on Pattern Analysis & Machine Intelligence 40, 2413–2427.
[87] Wang, X., Liu, Y., Shen, C., Ng, C., Luo, C., Jin, L., Chan, C.,
van den Hengel, A., Wang, L., 2020. On the general value of ev-
idence, and bilingual scene-text visual question answering, in: 2020
IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), IEEE Computer Society, Los Alamitos, CA, USA. pp.
10123–10132.
[88] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.,
2017b. ChestX-Ray8: Hospital-scale chest X-Ray database and
benchmarks on weakly-supervised classification and localization of
common thorax diseases, in: 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), IEEE Computer Society,
Los Alamitos, CA, USA. pp. 3462–3471. doi:10.1109/CVPR.2017.
369.
[89] Wu, Q., Teney, D., Wang, P., Shen, C., Dick, A., van den Hengel, A.,
2017. Visual question answering: A survey of methods and datasets.
Computer Vision and Image Understanding 163, 21–40. Language
in Vision.
[90] Xiao, Q., Zhou, X., Xiao, Y., Zhao, K., 2021. Yunnan university
at VQA-Med 2021: Pretrained BioBERT for medical domain visual
question answering. Working Notes of CLEF 201.
[91] Yan, X., Li, L., Xie, C., Xiao, J., Gu, L., 2019. Zhejiang University at
ImageCLEF 2019 visual question answering in the medical domain,
in: CLEF (Working Notes).
[92] Yang, Z., He, X., Gao, J., Deng, L., Smola, A., 2016. Stacked atten-
tion networks for image question answering, in: 2016 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), IEEE
Computer Society, Los Alamitos, CA, USA. pp. 21–29.
[93] Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q., 2019. Deep modu-
lar co-attention networks for visual question answering, in: 2019
IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), IEEE Computer Society, Los Alamitos, CA, USA. pp.

Medical VQA Survey

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Medical VQA Survey

Uploaded by

Copyright:

Available Formats

Medical Visual Question Answering: A Survey

ARTICLE INFO ABSTRACT

1. Introduction professionals’ efficiency. Another application matching the

Lin et al.: Preprint submitted to Elsevier Page 1 of 18

Lin et al.: Preprint submitted to Elsevier Page 2 of 18

2.1.4. RadVisDial tailed annotation guidelines to ensure consistency. Only 100

Lin et al.: Preprint submitted to Elsevier Page 3 of 18

Q: What does the ct Q: Is the lesion as-

Q: Airspace opacity? Q: Lung lesion?

Q: What have been

Lin et al.: Preprint submitted to Elsevier Page 4 of 18

Lin et al.: Preprint submitted to Elsevier Page 5 of 18

3.2. Image Encoder

Lin et al.: Preprint submitted to Elsevier Page 6 of 18

Figure 2: A schematic diagram of the mainstream medical VQA frameworks illustrating

Lin et al.: Preprint submitted to Elsevier Page 7 of 18

Lin et al.: Preprint submitted to Elsevier

Lin et al.: Preprint submitted to Elsevier

Lin et al.: Preprint submitted to Elsevier

Lin et al.: Preprint submitted to Elsevier

(a) Composition of various network architectures as image encoders.

(b) Composition of various network architectures as language encoders.

Lin et al.: Preprint submitted to Elsevier Page 12 of 18

Lin et al.: Preprint submitted to Elsevier Page 13 of 18

Lin et al.: Preprint submitted to Elsevier Page 14 of 18

Lin et al.: Preprint submitted to Elsevier Page 15 of 18

Lin et al.: Preprint submitted to Elsevier Page 16 of 18

Lin et al.: Preprint submitted to Elsevier Page 17 of 18

dataset for the development and evaluation of segmentation algo- 6274–6283.

Lin et al.: Preprint submitted to Elsevier Page 18 of 18

You might also like