Research Proposal

Title: Application of Convolutional Vision Transformers for the
Classification of Infectious Diseases in Chest Radiological images
Submitted by
Name: Md Hasibur Rahman
Course Code: CSE 326
Course Title: Research and Innovation
Id: 211-15-14616
Sec: 58_A
Submission date: 28/11/2023
Submitted to
Dr. Arif Mahmud, Associated Professor, Department of CSE, DIU.
Daffodil International University| 1

Table of Contents
Introduction: .................................................................................................... 3
Problem Statement ........................................................................................... 4
Literature Review: ........................................................................................... 5
Aims and Objectives: ....................................................................................... 7
Research Questions to be addressed in this project. ........................................ 8
Significance of the Research: .......................................................................... 9
Proposed Methodology .................................................................................. 10
About the Dataset: ......................................................................................... 12
References: .................................................................................................... 14
Table of Figures
FIGURE 1 DIAGRAM OF THE PROPOSED METHODOLOGY. .......................................... 12
Index of Tables
TABLE 1 DISTRIBUTION OF THE CLASSES OF THE DATASET AND THEIR LABELS. ......... 13

Introduction:
Respiratory diseases, such as COVID-19, tuberculosis, and pneumonia, pose
significant global health threats, responsible for substantial morbidity and
mortality worldwide (WHO, 2020). Rapid and accurate diagnosis of these
diseases is critical for effective patient management and improved health
outcomes. In this regard, chest radiological images serve as a primary non-
invasive tool for early detection and monitoring of these conditions (Kelly et
al., 2020). However, conventional manual analysis of chest radiological images
is often labor-intensive and subject to variability, highlighting the need for
more accurate and efficient methods (Rajpurkar et al., 2017).
In recent years, deep learning models, particularly Convolutional Neural
Networks (CNNs), have demonstrated promising results in enhancing the
accuracy of disease detection and classification from chest X-rays (Rajpurkar
et al., 2017; Wang et al., 2017). The advent of transformer models, primarily
used in natural language processing tasks, offer potential applications in
medical imaging, demonstrating competitive performance with CNNs in
certain tasks (Dosovitskiy et al., 2020).
This project proposes to leverage the power of transformer models, specifically
a custom Convolutional Vision Transformer, in the classification of chest
radiological images, with a focus on infectious respiratory diseases including
COVID-19, lung opacity, normal lung, viral pneumonia, and tuberculosis. The
goal is to fine-tune and compare various deep learning architectures on a multi-
class chest X-ray dataset, providing an in-depth analysis of the models'
interpretability and prediction capabilities.

Problem Statement
The problem this research intends to address is the development and evaluation
of a Transformer-based model for the classification of Chest radiological
images, specifically focusing on five classes - COVID-19, Lung Opacity,
Normal, Viral Pneumonia, and Tuberculosis. The challenge is to fine-tune and
compare the performance of various deep learning architectures, including
VGG19, ResNet50, Xception, a custom CNN model, and a custom
Convolutional Vision Transformer, in order to identify the model that offers
the highest prediction accuracy. Furthermore, an understanding of the model's
decision-making process, through the use of visualization techniques, is sought
to provide insights into how these models process and interpret chest X-ray
images. This understanding could significantly impact the adoption of such
models in clinical practice, augmenting the efficiency and accuracy of disease
detection.

Literature Review:
The significant role of radiography, particularly chest X-rays, in the early
detection and diagnosis of respiratory diseases, is well-established (Kelly et
al., 2020). Yet, traditional manual analysis of these images can be labor-
intensive, subject to variability, and lack the necessary speed for efficient
disease control, especially in the face of pandemic scenarios like COVID-19
(Rajpurkar et al., 2017).
Over the past few years, machine learning, and particularly deep learning
models, have shown remarkable potential to overcome these challenges,
enhancing the speed, accuracy, and efficiency of disease detection from chest
X-rays (Rajpurkar et al., 2017; Wang et al., 2017). Convolutional Neural
Networks (CNNs), with their strength in automatically learning hierarchical
representations from raw data, have emerged as a leading tool in this domain,
providing superior performance in diverse medical imaging tasks (LeCun et
al., 2015). Various architectures like VGG19, ResNet50, and Xception have
been extensively employed, showcasing strong results in chest X-ray analysis
(Simonyan & Zisserman, 2014; He et al., 2016; Chollet, 2017).
Nevertheless, despite these successes, the interpretability of CNNs remains a
considerable challenge, often considered as "black-box" models, with
decision-making processes hard to decipher (Selvaraju et al., 2020). This lack
of transparency has raised concerns about the practical applicability of these
models in clinical practice, underscoring the need for models that not only offer
high prediction accuracy but also explainable outcomes.
Recently, the advent of transformer models, originally designed for natural
language processing tasks, offers promising solutions in this regard. The
Convolutional Vision Transformer (ViT), an adaptation of the transformer
model for computer vision tasks, has shown competitive performance with
CNNs, bringing the benefits of both worldviews - the data efficiency and
performance of CNNs, and the flexible global reasoning capability of
transformers (Dosovitskiy et al., 2020). Furthermore, the self-attention
mechanism inherent in transformers could potentially offer greater
interpretability, providing insights into the model's decision-making process
(Vaswani et al., 2017).

However, the application of Convolutional Vision Transformers in medical
imaging, particularly in chest X-ray analysis, is still a relatively unexplored
domain, warranting further investigation.
In summary, this research builds on a body of work exploring deep learning
for chest X-ray analysis, and seeks to extend it by investigating the potential
and efficacy of Convolutional Vision Transformers in this context, addressing
the dual need for high performance and interpretability.

Aims and Objectives:
Aim:
The overarching aim of this project is to investigate the efficacy of
Convolutional Vision Transformers for the classification of infectious diseases
in chest radiological images, with a specific focus on enhancing interpretability
alongside maintaining high performance.
Objectives:
1. Dataset Exploitation and Preparation To effectively utilize a comprehensive,
open-source chest X-ray dataset, ensuring appropriate preprocessing and
labelling for an accurate and efficient machine learning model development
process.
2. Model Development and Enhancement To develop, fine-tune and enhance
various deep learning models including VGG19, ResNet50, Xception, a
custom CNN model, and a custom Convolutional Vision Transformer. The
objective includes exploring ways to increase their performance and efficiency
in classifying chest radiological images into five distinct classes: COVID-19,
Lung Opacity, Normal, Viral Pneumonia, and Tuberculosis.
3. Comparative Performance Analysis To conduct a comparative analysis of
the performance of the developed models. This would involve evaluating each
model's predictive accuracy, and other relevant metrics such as precision,
recall, F-score, and ROC-AUC, to determine the most effective model for the
classification task.
4. Enhancing Model Interpretability To introspect the selected model using
visualization techniques such as convolution visualization or attention map
visualization, to enhance the model's interpretability. This involves
understanding and explaining the decision-making process of the model,
including what parts of the input image it focuses on and how it assigns weights
to make a prediction.
5. Contribution to Academic and Clinical Practice To provide a significant
contribution to the existing body of knowledge in the field of medical imaging

and deep learning, by offering insights into the application of Convolutional
Vision Transformers in chest X-ray classification. Furthermore, this research
aims to enhance the potential adoption of such models in clinical practice by
presenting a model that not only performs with high accuracy, but also offers
explainability and transparency in its decision-making process.
In achieving these objectives, this research will provide a comprehensive
investigation into the potential application of Convolutional Vision
Transformers in medical imaging, with a focus on interpretability and high
performance.
Research Questions to be addressed in this project.
1. How does the performance of Convolutional Vision Transformers

compare with traditional Convolutional Neural Networks and other deep
learning models such as VGG19, ResNet50, and Xception in the
classification of chest radiological images of infectious respiratory
diseases?
2. What insights can be derived from convolution visualization or attention
map visualization about the decision-making process of the
Convolutional Vision Transformer model in predicting respiratory
diseases from chest radiological images?

Significance of the Research:
This research stands at the intersection of deep learning and medical imaging,
and seeks to explore the potential of Convolutional Vision Transformers (ViTs)
in the classification of chest radiological images for infectious respiratory
diseases. The significance of this research can be appreciated from multiple
perspectives.
Firstly, the utilization of a cutting-edge AI technology such as ViTs, which have
primarily been employed in natural language processing tasks, into the realm
of medical imaging, is itself an innovative step. It will contribute to the
expanding body of knowledge on the application of transformer models in the
field of computer vision (Dosovitskiy et al., 2020).
Secondly, it directly addresses a persistent challenge in the field of medical
imaging - the lack of interpretability in deep learning models. By utilizing the
self-attention mechanism inherent in transformers, this research could pave the
way for more interpretable models, thereby enhancing their acceptance and
applicability in clinical practice (Vaswani et al., 2017).
Thirdly, the fine-tuning and comparative analysis of various deep learning
architectures will provide valuable insights into the performance and suitability
of these models in classifying chest radiological images for various respiratory
diseases. This has significant implications for healthcare, potentially
improving the speed, accuracy, and efficiency of disease detection and
management, which is particularly important given the ongoing global
pandemic (Kelly et al., 2020).
Finally, the visualization and analysis of the model's decision-making process
will contribute to a deeper understanding of how these models process and
interpret medical images. This knowledge can assist in the refinement of these
models, as well as the development of best practices for their application in the
medical field (Selvaraju et al., 2020).
In summary, this research has the potential to contribute significantly to the
fields of medical imaging, deep learning, and healthcare, advancing our
understanding and capabilities in disease detection and management, and
bringing us closer to the goal of personalized and effective patient care.

Proposed Methodology
The methodology for this project entails a systematic process comprising data preparation, model
development, and comparative evaluation, followed by introspection and analysis of the model's
decision-making process.
1. Data Preparation:
The research will utilize an open-source dataset of chest radiological images available online.
This dataset comprises images categorized into five classes: COVID-19, Lung Opacity, Normal,
Viral Pneumonia, and Tuberculosis.
The initial step involves a thorough quality check and cleaning of the dataset. Only frontal view
images will be used, as these provide the most consistent views and are most commonly used in
medical diagnosis. All images will be checked for their labelling accuracy. The images will then
be preprocessed to ensure they are of a consistent size and format for model training.
Furthermore, data augmentation techniques such as rotation, flipping, and scaling will be applied
to increase the robustness of the model. As of now, the appropriate pre-processing or data
augmentation techniques to be applied have not been strictly specified. Further research and
discussion with supervisor will be conducted prior to selection and application of these
techniques.
2. Model Development and Training:
A variety of deep learning models will be developed for comparison. These include well-
established architectures like VGG19, ResNet50, and Xception. Each model will be fine-tuned
using the prepared dataset, with adjustments made to hyperparameters to optimize performance.
A custom Convolutional Neural Network (CNN) architecture will be developed, leveraging
depthwise separable convolution layers, dilated convolution layers, residual blocks, and batch
normalization.
Finally, a custom Convolutional Vision Transformer model will be developed and trained. This
model will leverage the capabilities of transformers, including their self-attention mechanism, to
classify chest radiological images.
The models will be trained on a designated training set, using a validation set to tune the model
parameters to avoid overfitting. The model weights will be updated using the backpropagation
algorithm, and optimization algorithms such as Adam or stochastic gradient descent will be used
to minimize the loss function.

3. Model Evaluation and Comparison:
Once the models are trained, they will be evaluated using a separate test set. The primary
performance metric will be accuracy, but precision, recall, F-score, and ROC-AUC will also be
used for a comprehensive evaluation.
A comparative analysis will be conducted to determine the most effective model. This analysis
will consider not only the performance metrics but also factors such as computational efficiency
and training time.
4. Model Introspection and Analysis:
The most effective model will then undergo further introspection using visualization techniques
such as convolution visualization or attention map visualization. This process aims to understand
the decision-making process of the model, providing insights into what the model focuses on in
the input image, and how it assigns weights when making a prediction. This investigation will
provide valuable insights into the interpretability of the model, a key aspect of the research
objectives.
In conclusion, this research employs a comprehensive and rigorous methodology, combining
data preparation, model development, and evaluation with an introspective analysis, to
investigate the efficacy and interpretability of Convolutional Vision Transformers in chest X-ray
classification.

Figure 1 Diagram of the Proposed Methodology.
About the Dataset:

The dataset (click here to access dataset website) for this research project is a
comprehensive collection of chest radiological images, specifically X-rays,
that have been meticulously gathered from various online resources (Basu et.
al., 2021). This dataset is a result of a collaborative effort by researchers from
Qatar University, Doha, and Dhaka University, along with their associates from
Pakistan and Malaysia (Basu et. al., 2021). The team worked in close
collaboration with medical professionals to ensure the accuracy and relevance
of the data (Basu et. al., 2021).
The primary dataset, which consists of four classes - COVID-19, Lung
Opacity, Normal, and Viral Pneumonia, was initially sourced from the COVID-
19 Radiography Database on Kaggle (Basu et. al., 2021). To enhance the
diversity and robustness of the dataset, additional images of Pneumonia and

COVID-19 were incorporated from various other online resources (Basu et.
al., 2021).
In a significant enhancement to the original dataset, (Basu et. al., 2021)
research team introduced a fifth class - Tuberculosis. This addition was made
to broaden the scope of the research and increase the dataset's relevance in the
context of infectious respiratory diseases (Basu et. al., 2021).
To maintain consistency and ensure the highest quality of data, only frontal
view X-ray images were included in the dataset (Basu et. al., 2021). Any lateral
view images that were part of the original collections were excluded (Basu et.
al., 2021).
The final dataset, therefore, comprises five classes of chest X-ray images, each
representing a different condition. The distribution of these classes is as
follows:
Table 1 Distribution of the classes of the dataset and their labels.
Disease Name Number of Samples Label
COVID-19 4,189 0
Lung Opacity 6,012 1
Normal 10,192 2
Viral Pneumonia 7,397 3
Tuberculosis 4,897 4
Total 32,687
In total, the dataset includes 32,687 samples, making it a substantial resource

for training and evaluating the deep learning models that will be developed as
part of this research project.
This dataset, with its diverse classes and large number of samples, provides a
solid foundation for our research into the efficacy and interpretability of
Convolutional Vision Transformers in chest X-ray classification.

Project Plan:
Timeline Activity
November 20 -
November 30, 2023 Preliminary Research and Literature Review
December 1 -
December 15, 2023 Dataset Collection and Exploration
December 16 -
December 30, 2024 Dataset Preparation
January 1 - February Model Development: Implementing VGG19, ResNet50,
18, 2024 Xception, and custom CNN
March 1 - March 15, Model Development: Implementing and training
2024 custom Convolutional Vision Transformer
March 16 - March 25, Model Enhancement: Fine-tuning and hyperparameters
2024 optimization
April 26 - May 10,
2024 Comparative Performance Analysis of models
May 11 - May 25, 2024 Enhancing Model Interpretability
May 26 - June 15, 2024 Drafting and Proofreading of the report
July 6 - August 15,
2024 Preparation for Defense
August 16 – August 25,
2024 Final Review, Project Submission, and Defense

References:
• WHO (2020). The top 10 causes of death. https://www.who.int/news-
room/fact-sheets/detail/the-top-10-causes-of-death
• Kelly, B. J., Farness, P., Soto, M. T., Morgan, M., & Ghassemi, M.
(2020). Machine Learning in Medical Imaging. Journal of Nuclear
Medicine Technology, 48(3), 209-219.
• Rajpurkar, P., Irvin, J., Ball, R. L., Zhu, K., Yang, B., Mehta, H., ... &
Langlotz, C. P. (2017). Deep learning for chest radiograph diagnosis: A
retrospective comparison of the CheXNeXt algorithm to practicing
radiologists. PLoS medicine, 15(11), e1002686.
• Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., & Summers, R. M.
(2017). ChestX-ray8: Hospital-scale chest x-ray database and
benchmarks on weakly-supervised classification and localization of
common thorax diseases. Proceedings of the IEEE conference on
computer vision and pattern recognition, 2097-2106.
• Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16
words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929.
• Chollet, F. (2017). Xception: Deep Learning with Depthwise Separable
Convolutions. Proceedings of the IEEE conference on computer vision
and pattern recognition, 1251-1258.
• He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning
for Image Recognition. Proceedings of the IEEE conference on computer
vision and pattern recognition, 770-778.
• Kelly, B., Squizzato, S., Parascandolo, P., Kalkreuter, N., Kashani, R., &
Rajpurkar, P. (2020). An overview of deep learning in medical imaging
focusing on MRI. Zeitschrift für Medizinische Physik, 30(2), 102-116.
• LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature,
521(7553), 436–444.

• Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., ... &
Lungren, M. P. (2017). Chexnet: Radiologist-Level Pneumonia
Detection on Chest X-Rays with Deep Learning. arXiv preprint
arXiv:1711.05225.
• Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., &
Batra, D. (2020). Grad-CAM: Visual Explanations from Deep Networks
via Gradient-Based Localization. International Journal of Computer
Vision, 128(2), 336-359.
• Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional
Networks for Large-Scale Image Recognition. arXiv preprint
arXiv:1409.1556.
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.
N., ... & Polosukhin, I. (2017). Attention is All You Need. Advances in
neural information processing systems, 5998-6008.
• Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., & Summers, R. M.
(2017). ChestX-ray8: Hospital-scale Chest X-ray Database and
Benchmarks on Weakly-Supervised Classification and Localization of
Common Thorax Diseases. Proceedings of the IEEE conference on
computer vision and pattern recognition, 2097-2106.
• Basu, Arkaprabha; Das, Sourav; Ghosh, Susmita; Mullick, Sankha;
Gupta, Avisek; Das, Swagatam, 2021, "Chest X-Ray Dataset for
Respiratory Disease
Classification", https://doi.org/10.7910/DVN/WNQ3GI, Harvard
Dataverse, V5

Research Proposal

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research Proposal

Uploaded by

Copyright:

Available Formats

Title: Application of Convolutional Vision Transformers for the

Classification of Infectious Diseases in Chest Radiological images

Daffodil International University| 1

Daffodil International University| 2

Daffodil International University| 3

Daffodil International University| 4

Daffodil International University| 5

Daffodil International University| 6

Daffodil International University| 7

Research Questions to be addressed in this project.

1. How does the performance of Convolutional Vision Transformers

Daffodil International University| 8

Daffodil International University| 9

Daffodil International University| 10

Daffodil International University| 11

About the Dataset:

Daffodil International University| 12

In total, the dataset includes 32,687 samples, making it a substantial resource

Daffodil International University| 13

Daffodil International University| 14

Daffodil International University| 15

Daffodil International University| 16

You might also like