Major Project Phase 1

01
VISION BASED QUERY AND IMAGE-DRIVEN

TEXT PROCESSING FOR CHATBOTS
Group Members
YUSUF SODAWALA: 2021300127
Mentor KRISHNAN SUBRAMANIAN: 2021300130
Prof. Jignesh Sisodia SAAD SURVE: 2021300131
INTRODUCTION 02
Objective: Develop image processing for chatbots with VQA.
Aim: Enhance chatbots' understanding of visual content.
Focus: Extracting meaning from images for natural language interaction.
Importance: Improves user engagement and interaction.
Components:
Theoretical underpinnings of VQA
Image processing techniques
Model development and evaluation
PROBLEM DEFINITION, SCOPE &
OBJECTIVES OF THE PROJECT
03
Scope of the Project:
Utilization of Advanced Image Processing:
Leverage state-of-the-art algorithms and
frameworks (e.g., TensorFlow, PyTorch).
Explore CNNs, attention mechanisms, and
transfer learning for enhanced image
understanding.
Focus on Accessibility and Inclusivity:
Prioritize VQA model development for
accessibility.
Create multimodal interfaces for effective
interaction by visually impaired users.
LITERATURE SURVEY 04
Paper Title Author(s) Methodology Inference
Learning content
and context with Chao Yang, Su Feng, Develops two architectural Improved model performance
language bias for Dongsheng Li, Huawei Shen, branches to deal with bias learning: by addressing bias in content
visual question Guoqing Wang, and Bin Jiang. Content and Context branches. and context.
answering.
Beyond accuracy: A
Provides a structured approach
consolidated tool for Introduces a novel digital
Dirk Vath, Pascal Tilli, and for evaluating and testing VQA
visual question framework to analyze, evaluate, and
Ngoc Thang Vu. models and datasets.
answering test VQA models and datasets.
benchmarking.
Multimodal Aims to enhance interpretability

A. Farinhas, A.T. Martins, and Models attention mechanisms of
continuous visual and performance by mimicking
P.Q. Aguiar. VQA models as multimodal feature
attention human attention patterns.
functions for human attention.
mechanisms.
LITERATURE SURVEY
Paper Title Author(s) Methodology Inference
Medical visual
Li-Ming Zhan, Bo Liu, Lu Fan,
question answering Question conditioned reasoning Enhances reasoning capabilities
Jiaxin Chen, and Xiao Ming
via conditional module for medical VQA models. in medical VQA tasks.
Wu.
reasoning.
Debiased visual
Shiquan Wen, Guanghui Xu, Technique to recognize and Addresses bias issues in VQA
question answering
Mingkui Tan, Qingyao Wu, and alleviate negative bias effects model training to improve
from feature and
Qi Wu. during training. performance.
sample perspectives.
Expands VQA tasks to include

Question-controlled
Anwen Hu, Shizhe Chen, and VQA tasks to include scene text scene text awareness for image
text-aware image
Qin Jin. awareness for image captioning. captioning.
captioning.
ARCHITECTURE
ViLT : Vision and Language Transformer

05
06
DESIGN AND METHODOLOGY
Main Block
Overview of the
System
DESIGN AND METHODOLOGY
COLOR
OCR RECOGNITION
OBJECT
SCENE RECOGNITION
RECOGNITION
IMPLEMENTATION DETAILS
07
VILT MODEL FOR VQA:
VILT MODEL IMPLEMENTED VIA HUGGING FACE'S TRANSFORMERS LIBRARY.
INTEGRATES VISUAL AND TEXTUAL DATA FOR TASKS LIKE VISUAL QUESTION
ANSWERING.
UTILIZES PRE-TRAINED TRANSFORMER ARCHITECTURES.
DURING INFERENCE, PROCESSES INPUT IMAGES AND QUESTIONS.
EMPLOYS MULTI-MODAL FUSION TECHNIQUES.
ENABLES ACCURATE PREDICTIONS THROUGH CONTEXTUAL UNDERSTANDING.
OCR MODEL FOR TEXT EXTRACTION:
IMPLEMENTED OCR MODEL USING EASYOCR LIBRARY.
DESIGNED TO EXTRACT TEXT FROM IMAGES OR DOCUMENTS.
LEVERAGES PRE-TRAINED DEEP LEARNING MODELS FOR CHARACTER RECOGNITION.
CAPABLE OF HANDLING MULTIPLE LANGUAGES AND SCRIPTS.
PROVIDES FAST AND ACCURATE TEXT EXTRACTION CAPABILITIES.
SUPPORTS VARIOUS IMAGE FORMATS SUCH AS JPEG, PNG, AND PDF.
OFFERS STRAIGHTFORWARD INTEGRATION INTO EXISTING APPLICATIONS OR WORKFLOWS.
SUITABLE FOR TASKS LIKE DOCUMENT DIGITIZATION, TEXT EXTRACTION FROM IMAGES,
AND MORE.
08
TECHNOLOGY STACK
PROJECT PLAN AND TIMELINE
09
Phase 3 - and
Phase 0 - 8/1/24 Phase 1.5 - 27/3/23
35% completion beyond
Problem Identification
Next semester
Phase 1 - 27/2/23 Phase 2 - 23/3/23

Completion of Ideation -
Solution, 50%
Objectives,
Technology required, 25% Completion
REFERENCES
1. Zhihong Lin, Donghao Zhang, Qingyi Tac, Danli Shi, Gholamreza Haffari, Qi Wu, Mingguang He, and Zongyuan Ge. "Medical visual
question answering: A survey," 2022.
2. Himanshu Sharma and Anand Singh Jalal. "A survey of methods, datasets and evaluation metrics for visual question answering."
*Image and Vision Computing*, 116:104327, 2021.
3. Sruthy Manmadhan and Binsu C Kovoor. "Visual question answering: a state-of-the-art review." *Artificial Intelligence Review*,
53:5705–5745, 2020.
4. Charulata Patil and Manasi Patwardhan. "Visual question generation: The state of the art." *ACM Comput. Surv.*, 53(3), May 2020.
5. Yeyun Zou and Qiyu Xie. "A survey on VQA: Datasets and approaches." In *2020 2nd International Conference on Information
Technology and Computer Application (ITCA)*. IEEE, Dec 2020.
6. Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. "Towards vqa
models that can read." In *2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, pages 8309–8318, 2019.
7. Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, Ernest Valveny, C.V. Jawahar, and Dimosthenis Karatzas.
"Scene text visual question answering." In *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*, October
2019.
8. ChaoYang, SuFeng,DongshengLi, HuaweiShen, GuoqingWang, andBinJiang. "Learningcontentandcontext with language bias for visual
question answering." In *2021 IEEE International Conference on Multimedia and Expo (ICME)*, pages 1–6, July 2021.
9. Dirk V¨ ath, Pascal Tilli, and Ngoc Thang Vu. "Beyond accuracy: A consolidated tool for visual question answering benchmarking."
In *Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, pages 114–123,
Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
10. A. Farinhas, A. T. Martins, and P. Q. Aguiar. "Multimodal continuous visual attention mechanisms." In *2021 IEEE/CVF
International Conference on Computer Vision Workshops (ICCVW)*, pages 1047–1056, Los Alami tos, CA, USA, Oct 2021. IEEE Computer
Society.
11. Corentin Kervadec, Grigory Antipov, Moez Baccouche, and Christian Wolf. "Roses are red, violets are blue... but should vqa
expect them to?" In *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pages 2776–2785, 2021.
12. D. Teney, E. Abbasnejad, and A. van den Hengel. "Unshuffling data for improved generalization in visual question answering." In
*2021 IEEE/CVF International Conference on Computer Vision (ICCV)*, pages 1397–1407, Los Alamitos, CA, USA, Oct 2021. IEEE Computer
Society.
13. Chenyu Gao, Qi Zhu, Peng Wang, Hui Li, Yuliang Liu, Anton Van den Hengel, and Qi Wu. "Structured multi modal attentions for
textvqa." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 2021.
14. Zhuo Chen, Jiaoyan Chen, Yuxia Geng, Jeff Z Pan, Zonggang Yuan, and Huajun Chen. "Zero-shot visual question answering using
knowledge graph." In *International Semantic Web Conference*, pages 146–162. Springer, 2021.
15. Li-Ming Zhan, Bo Liu, Lu Fan, Jiaxin Chen, and Xiao-Ming Wu. "Medical visual question answering via conditional reasoning." In
*Proceedings of the 28th ACM International Conference on Multimedia, MM ’20*, pages 2345–2354, New York, NY, USA, 2020. Association
for Computing Machinery.

Major Project Phase 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Major Project Phase 1

Uploaded by

Copyright:

Available Formats

01

VISION BASED QUERY AND IMAGE-DRIVEN

Multimodal Aims to enhance interpretability

Paper Title Author(s) Methodology Inference

Expands VQA tasks to include

ViLT : Vision and Language Transformer

Phase 1 - 27/2/23 Phase 2 - 23/3/23

You might also like