You are on page 1of 2

Visual Question Answering Using Deep Learning

Arvind Kumar
M Tech (IT)
IIIT-A
Prayagraj, India
mhc2022016@iiita.ac.in

Abstract—Visual Question Answering (VQA) in recent times C. Visual question answering algorithm based on image cap-
challenges fields that have received an outsized interest from tion
the areas of Natural Language Processing and Computer Vision.
VQA aims to establish an intelligent system to predict the answers In this paper they introduce the VQA-E model, which has a
for the natural language questions raised related to the image. certain degree of accuracy in the explanation of the VQA, but
Index Terms—Natural Language Processing, Long Short-Term this model ignores the information that explains the answer
Memory, CNN itself. To this end, we first use the image processing method
in this model to extract the target information of the image,
I. I NTRODUCTION
combine it with the text information, and use the collaboration
VQA is a multi-discipline research problem that gained attention mechanism in the process of combination, instead
popularity from academic communities like natural language of the attention on image only in VQA-E model. Then the
processing and computer vision because humans when looked explanation is combined with the problem information and
at images, tend to see objects and understand how they interact input them into the LSTM system. Its function is to enrich
with their properties and states. Visual question answering the text information in the visual question answer. We believe
(VQA) is fascinating as it permits humans to recognize models that more text information will improve the accuracy of the
they originally see[3]. The current approaches in VQA or answer. In this section, we will introduce the visual question
grounding rely on concatenating vectors or applying element answer algorithm based on image caption model proposed in
wise sum or product. Researchers have utilized many tech- this paper.
niques to build a VQA model. Deep learning methods can
attain advanced results on challenging face recognition. III. M ETHODOLOGY
The main aim of the proposed VQA System is that when
II. R ELATED W ORKS
given an input picture and a human understandable language
A. Visual Question Answering based on multimodal triplet question that is in free-structure and indefinite about the
knowledge accumulation picture, it should produce spontaneous, accurate, concise, free
In this paper given a picture and a question related to the form, and natural language answers. Answering any possible
picture, the purpose of the knowledge base of visual ques- question about the image is one of the strenuous tasks which
tion answer task is to predict the corresponding answer that needs a proper semantic scene understanding.I have selected
requires additional practical knowledge beyond the content base paper[3].The proposed system undergoes the following
provided by pictures and text.This paper takes the accumulated steps to extract the output. They are
multimodal knowledge as external knowledge and infers the • Image Feature Extraction
answer directly. First,based on the unstructured knowledge • Word Embedding for Question
graph, text embedding and visual embedding are extracted • Answer Prediction
from the pre-trained language model LXMERT.
A. A brief description of selected model
B. Natural Language Processing based Visual Question An- The processing steps are given below. First, the features
swering Efficient: an EfficientDet Approach of the image are extracted. Here, a feature in detail about the
• A preprocessing layer that converts the dataset into a content of an image; that is, it is about underlying background
suitable form by processing questions and images. objects typically, about whether a definite region of an image
• Bidirectional LSTM layer for extraction of temporal has definite properties. Features may be particular in structures
relationships between words of the question, which acts about the image, such as edges, objects, or points. The image
on word embeddings generated using GloVe [10]. is passed through the wellknown vgg16 architecture of a deep
• An EfficientDet based image features’ extraction compo- convolution neural network. The last two layers are popped
nent to efficiently process images. out to extract the activation from the second last layer with
• The final layer to combine the question and image fea- dimensions of 4096, which comes from the last layer of the
tures and compile the model. SoftMax. After image feature extraction, Word Embedding
is adopted, a technique in Natural Language Processing to directions, In International Conference on Digital Futures and
represent words in a Deep Learning environment. The final Transformative Technologies”, vol. 3, pp. 178-185, May 2021.
step is to predict the answer. In order to get the output, we [3] S. Barra , C. Bisogni, M. De Marsico, and S. Ricciardi,
need to concatenate the image features as well as the question “Visual Question Answering: which investigated applications”,
features using pointwise multiplication. pp. 5673-5679, Feb 2021.
[4] R. Bernardi, and S. Pezzelle, “Linguistic issues behind
visual question answering. Language and Linguistics Com-
pass”, 2021.
[5] V. Kazemi , and A. Elqursh, “Show, ask, attend, and
answer: A strong baseline for visual question answering”,
2017.
[6] S. Shah, A. Mishra, N. Yadati and P. P. K. Talukdar,
“Knowledgeaware visual question answering. In Proceedings
of the Artificial Intelligence”, vol. 33, no. 01, pp. 8876- 8884,
July 2019.
[7] G KV, A Mittal, “Reducing Language Biases in Vi-
sual Question Answering with Visually-Grounded Question
Encoder”, vol. 111, pp. 463-473, 2020.
[8] Z. Liu, J. Wu, L. Fu., Y. Majeed, Y. Feng, R. Li, and Y.
Cui, “Improved kiwifruit detection using pre-trained VGG16
IV. DATASET D ETAIL with RGB and NIR information fusion” , 2019.
In this method i am using COCO-QA dataset, Uses images [9] A. Veit, T. Matera, L. Neumann, J. Matas, and S.
from COCO where questions are created using an NLP algo- Belongie, “Cocotext: Dataset and benchmark for text detection
rithm, which answers are all one word. and recognition in natural images”, 2016.
[10] M. Malinowski and M. Fritz, “A multi-world approach
V. CONCLUSION to question answering about real-world scenes based on un-
Visual Question Answering is useful for visually impaired certain input”, in Proc. Advances in Neural Inf. Process. Syst.,
people and connects visual data to the text that’s on the web- pp. 1682–1690, 2014.
leverage multi model information on the web to summarize
visual data for analysts.The proposed system has been ex-
perimented with various open-ended questions to show the
robustness of the system. VQA finds its application in various
real-world scenarios such as selfdriving cars and guiding
visually impaired people.
A. work progress
Litreature survey is completed[1-9].Base paper is selected
on which I have to work and some brief about the method is
taken.In this semester work will be done only on the basis of
Text data and remaining work will be done in the subsequent
semesters.
VI. ACKNOWLEDGMENT
I would like to express the gratitude to Prof. Anupam Agar-
wal, for giving this topic, as it enhanced the knowledge in the
domain of Machine learning, Deep learning, Natural Language
Processing (NLP) and Computer Vision and application in real
world problem.
VII. REFERENCES
[1] H. Sharma and A. S Jalal, “Visual question answering
model based on graph neural network and contextual attention.
Image and Vision Computing”, vol. 102, pp. 223-230, March
2021.
[2] A. Jamshed, and M. M. Fraz, “NLP Meets Vision
for Visual InterpretationA Retrospective Insight and Future

You might also like