You are on page 1of 6

2022 International Conference on Trends in Quantum Computing and Emerging Business Technologies (TQCEBT)

CHRIST (Deemed to be University), Pune Lavasa Campus, India. Oct 14-15, 2022

An effective Approach for Pneumonia Detection


using Convolution Vision Transformer
2022 International Conference on Trends in Quantum Computing and Emerging Business Technologies (TQCEBT) | 978-1-6654-5361-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/TQCEBT54229.2022.10041662

Somya Gupta Andrea Rodrigues


CHRIST (Deemed to be University), CHRIST (Deemed to be University),
India India
somya.gupta@msds.christuniversity.in andrea.rodrigues@christuniversity.in

Parul Joshi Jossy George


CHRIST (Deemed to be University), CHRIST (Deemed to be University),
India India
parul.joshi@msds.christuniversity.in frjossy@christuniversity.in

Abstract – Early detection of pneumonia in patients As per the findings of studies in the health sector, y
through effective medical imaging may enable timely remedial Convolutional Neural Networks (CNN), was instrumental in
measures and reduce the severity of the infection. There is an examination of chest X-ray images for the identification of
increase in cases among new-borns, teenagers and also people the infection. However, the accuracy obtained with it is not
with health issues in recent years. The COVID-19 pandemic satisfactory. The other deep learning models used for the
also revealed the major impact pneumonia had on the lungs same purpose are VGG-19 and ViT. ViT is an upcoming
and the consequences of delayed detection. The presence of the new model for image processing which is still in the process
infection in the lungs is examined through images of Chest X- of development. ViT has lately surpassed CNN’s
ray, however, for an early diagnosis of the infection, this paper
performance for image classification, however, the expense
proposes an automated model as a more effective alternative.
Convolutional Vision Transformer (CVT) which gives an
of pre-training is still exorbitant given the huge outside
accuracy of 97.13%, and is a robust combination of datasets [3].
Convolution and Vision Transformer (ViT), is suggested in this A combination of Vision Transformer along with
paper as a potential model to detect pneumonia early in convolutional and pooling layers has been employed in this
patients. study to examine chest X-Ray images. The CVT system
combines two widely used architectures to overcome some
Keywords - Chest X-ray, Pneumonia, Detection, Vision
important limitations of each approach on its own, namely
Transformation, Convolution Neural Network, Convolutional
Vision Transformer, Deep Learning.
Convolutional Neural Networks (CNNs) and Transformer-
based models. By leveraging both techniques, the Vision
I. INTRODUCTION System can outperform existing architectures, especially at
low data levels while achieving similar performance with
Countries in the developing stage have registered an respect to the large dataset regime. This process occurs
increase in the cases of pneumonia. Excess pollution, poor without compromising on accuracy or speed [4].
living conditions, rise in population and scarcity of medical
infrastructure has contributed to the increasing rate of II. LITERATURE REVIEW
infection. Overcoming these setbacks is possible by looking
at how technology can be used on a wide scale to detect the Debaditya Shome et al., [5] described the
condition before it reaches a point of severity. Patients Vision Transformer (ViT) which supports deep learning for
diagnosed early will be able receive specialized treatments. COVID -19 prediction from the images that are based on
Due to the simple fact that pneumonia can be detected only chest X-ray. A total of 30 K collection images of chest X-
by images of a chest X- ray, there is a setback as assessing rays were gathered and used. The data set for the study
the images is a challenging task and comes with high health composed of mixing different open-source data sets A very
risk. [1] good score of accuracy that reached to 98% and 99% was
noted. A self-operating pneumonia detection model works
Increasing the chance of survival for pneumonia patients effectively in early prediction of pneumonia in many places
is possible through timely examination The method of as told by Khushal Tyagi et al., [6] They implied 3 different
diagnosing the infection with chest X-ray images is risky and models named Convolutional Neural Network (CNN),
subjected to several health concerns. An automated VGG16, and Visual Transformer (ViT) and compared the
algorithm is required which will improve the quality of outcomes of all models. The outcomes suggested that ViT
healthcare and help in better, more accurate diagnosis of the can recognize pneumonia with 96.45% accuracy. This model
infection. This algorithm will prove to be both time and life- can be used to prevent unfortunate situations in far off
saving for the stakeholders [2]. Among the growing places.
technological advancements, computer vision is a precise
way to identify infection with the help of deep learning As explained in Boyuan Wang et al., [7] the paper shows
frameworks. how early detection of COVID - 19 through deep learning
could help in curbing the spread of the virus to some extent
and also decrease hospital costs. Swim Transformer (ST) is

978-1-6654-5361-5/22/$31.00 ©2022 IEEE 1


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on August 26,2023 at 16:52:48 UTC from IEEE Xplore. Restrictions apply.
chosen as the second network to make a model for the same. Associative Memory (Modern Hopfield Network) and the
Later this model is compared with Vision Transformer (ViT) Convolution-Attention Neural Architectures which preserves
and Convolutional Neural Network (CNN) and it the self-attention from the Vision Transformer (ViT). Their
outshone both the other models. When it was pre - trained on approach is able to result in a more effective development in
a big data and send to many mid - size or small image comparison to the Conventional baseline approach. The
identification benchmarks, Vision Transformer (ViT) attains better working model in their study was able to achieve R2 =
excellent outcomes as compared with Convolutional Neural 0.85 ± 0.05 and Pearson correlation coefficient ρ = 0.92 ±
Networks (CNN) rather requiring substantially lesser 0.02 in geographic extent & R2 = 0.72 ± 0.09, ρ = 0.85 ±
estimated resources to train Showed in their paper, by 0.06 in blur prediction. A deep learning-based framework is
Alexey Dosovitskiy et al., [8] described by Mohamad Mahmoud Al Rahhal et al., [14] for
the detection of COVID-19 through Computed Tomography
Sangjoon Park et al., [9] described the situation that the (CT) and X-ray images. The ViT model was employed to
images of chest x-ray were taken yearly in hospitals is not depict the pipeline which used a Siamese encoder. The
suitable to be used as there is less of manual labelling by Siamese encoder was used to enable the distillation and class
experienced people, mainly in underprivileged areas. They token They used arouse Convolution at varied rates to enable
demonstrated that a deep learning structure that takes help of more deep features from multi-scale feature maps. Increasing
knowledge processing self-monitored learning and training, the dataset was possible with the help of data augmentation
that is proof that the functionality of the real model which generated adversarial examples that enhanced the
developed with a fewer amount of labels may eventually be performance and the result of the model.
better with increased unlabelled information. Testing
solutions show that the suggested model attains better III. PROPOSED FRAMEWORK
robustness in contrast to the existing surroundings and can
be applied easily to varied diagnosis of infections and A. Dataset
viruses, including tuberculosis (Tb), COVID-19, and The dataset for this study is procured from Kaggle and it
pneumothorax. As explained by Tathagat Banerjee et al., consists of 5856 images which are divided into two sets:
[10] various types of pneumonia have same and difficult train and test, each of which have two categories “Normal”
symptoms. In the given paper, they assessed the problem and and “Pneumonia” as shown in Fig. 1. In the training set, the
attempted to devise an architecture that used Unsupervised number of Chest X-ray images of patients who do not have
Generative Transfusion Network, also known as (UGTN). pneumonia is 1346 and for patients who have Pneumonia is
This architecture is drafted with the help of state-of-the Art 3883 X-rays. In the test set, the division is of 234 X-rays
Transformers (AT) and eight layer’s decoders and encoders indicating no pneumonia and 390 having pneumonia.
to arrive at an accuracy that amounted to 92 and 81% which
is equal to and in some cases higher than the predecessor
networks. Finally, the researchers suggest a firm
Unsupervised Generative Transfusion Network (UTGN)
architecture which is of an alternative style send family with
the feature of Generative Adversarial Networks (GAN).
Xiaohong Gao et al., [11] described vision transformer
architectures built upon attention models and are scalable
when compared with CNN based models. In comparison
with CNN based model DenseNet for COVID-CT, ViT
model appears to perform better with 76.6% accuracy
whereas DenseNet realized accuracy of 73.7%. While chest
CT images are in 3D volume, it is a natural approach to Fig. 1. Two Categories of the Dataset
process these data in 3D form. As addressed previously, the
lesioned regions are proportionally small compared to the 1) Data resizing and Augmentation
whole volume, which might constitute the main reason that Fig. 2 shows that the dataset is not distributed in a
3D based systems perform far worse than 2D based models. balanced way This imbalance may result in biased
Koushik Sivarama Krishnan and Karthik Sivarama Krishnan prediction or it may produce overfitting problems. Data
[12] explained that the lungs of the people infected with augmentation was applied on the training set to remove this
pneumonia swell up as the infection increases. Examining unevenness.
chest X-rays is thus a way to identify pneumonia. In the Data Augmentation is a technique which helps in
paper they have employed an automated and effective producing synthetic data from the already existing data. So,
technique for distinguishing COVID-19 from the conditions the data is altered using various transformations such as
of Pneumonia, X-rays of the chest, and lung opacity. If a big scaling, cropping or rotation, but the meaning of the data
version of Vision Transformer (ViT) is used with a better remains intact. This technique helps in increasing the size of
and bigger dataset it is possible to reach a better and higher the dataset as well as balancing the imbalanced class by
performance metrics. adding the synthetic data. Hence this technique has been
Nam Nguyen and J. Morris Chang [13] proposed a better used to deal with the problem of imbalance data.
model for detection of pneumonia which is an amalgamation Following augmentations have been applied on the data:
of model-centric and data-centric approaches. They begin
by, formulating pre-training focused on data for very scarce x Randomly rotated some training images from
data conditions of the search dataset. After that, they showed training by two degrees.
two hybrids which is also a mixed model employing Dense

2
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on August 26,2023 at 16:52:48 UTC from IEEE Xplore. Restrictions apply.
x Randomly zoomed some training images by 2% x Convolutional Layers – In this layer, filters are
applied to extract information about various features
x Randomly flipped images horizontally. of the image. The output of this layer is termed as
feature map and it provides information about the
edges and corners of the image.

x Pooling Layer – The main purpose of this layer is to


decrease the dimension or size of the feature map by
summarizing the features and converting it into a
smaller size matrix with significant or relevant
features.

x Fully Connected Layer – This is the last layer of the


architecture and, as the name suggests, it consists of
fully connected layers in which all neurons are
linked to each other. Then the output of pooling
Fig 2: Imbalanced Dataset
layer is flattened and passed on and then the
2) Methodology classification process commences.
The model presented in this paper is a combination of
C. Vision Transformer
two types of algorithms - a) Convolutional Neural Networks
and b) Vision transformer. The concept of transformers was originally taken from
Natural Language Processing. Transformers have been used
In this model too the Vision Transformer is used along for years in the NLP domain and have attained popularity for
with some changes. Convolution layers are added in the their simplicity and high performance.
algorithm of Vision Transformers.
Vision Transformers model was introduced in a research
The pipeline or the flow of the model is depicted in Fig. paper titled “An image is Worth 16*16 Words: Transformers
3. This flowchart gives a basic understanding of the model for Image Recognition at Scale.” A total of twelve scholars
from the initial process of data aggregation till deployment. from the Google Research Brain Team including Neil
Houlsby and Alexey Dosovitskiy developed this model.
The input image of ViT is broken into patches of the
same size. The size of patch is chosen randomly according to
the trial and test method. According to the patch size and the
image size, number of patches are calculated based on the
given formula:
2
§ image _ size ·
number _ of _ patches ¨ ¸
© patch _ size ¹
These patches, also known as tokens are transformed in 2
Dimension by flattening them into a vector format. Then a
layer of position embedding is added to the patches to secure
positional information [5]. Finally, these tokens pass through
the transformer encoder consisting of three layers:
x Multi -Head Self Attention Layer (MSP)

x Layer Norm (LN)

x Multi – Layer Perceptron (MLP) Layer

The following equation has been used for positional


embedding of the patches:
2i
Fig. 3: Structure of Convolutional Vision Transformer d mod el
PE( pos ,2 i ) sin( pos / 10000 )
B. Convolutional Neural Networks
2i
CNN are deep learning models which consist of feed d mod el
forward networks and different layers such as convolutional PE( pos ,2 i 1) cos( pos / 10000 )
layers and pooling layers. Each of these layers have their
own functionality and they form the architecture of CNN D. Convolution Vision Transformers (CVT)
which is widely used for image recognition and The model proposed in this paper consist of a vision
classification. transformer along with the added convolutions.

3
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on August 26,2023 at 16:52:48 UTC from IEEE Xplore. Restrictions apply.
Instead of adding the full architecture of CNN, only the
convolution layer and pooling layer is added before the
positional embedding layer of the transformer. The image
patches are first passed on to the convolutional and pooling
layers and only then it is embedded, followed by the
transformer encoder and finally the output (as shown in Fig.
4)

Fig. 5: Model Accuracy and Loss (overfitting results)

Fig. 4: Structure of Convolution Vision Transformer

The convolutions have been placed after the patches with


the intention of obtaining important features from each of
them instead of passing the whole image in the convolutions
and then dividing the obtained output in the patches. This
setup helps in obtaining the importance of each patch, which
will then pass on to the transformer encoder and will help the
model learn the features efficiently, even on smaller dataset.
This setup is notably also giving the best model accuracy.
IV. EXPERIMENTAL SETUP
After setting up of the algorithm, the model was tried and Fig. 6: Model Accuracy and Loss (without overfitting)
tested with different parameters and each time the accuracy
and loss graph was plotted to determine the overfitting. Fig.
5 depicts the initial stage of the model which performed Table 1. shows the initial and final values of the
extremely well on the training set, but proved to not be as parameters which were being modified to achieve the final
efficient on the testing set indicating the presence of biasness model.
and model overfitting. Hence, to remove it, the model was
executed several times before the final model was achieved TABLE I. DIFFERENT PARAMETERS OF THE MODEL
which resulted in promising findings and showed no Parameters Initial Value Final Value
overfitting (as shown in Fig. 6).
Learning Rate 0.0001 0.00001
No. of Epoch 75 120
Batch Size 128 256
No. of heads 4 4
Patch Size 9 6
Projection Dimension 64 64
Image Size 90 72
No. transformer layers 8 8
Number of Patches 100 144

4
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on August 26,2023 at 16:52:48 UTC from IEEE Xplore. Restrictions apply.
V. RESULTS AND DISCUSSION
After performing training processes and also testing the
algorithm, the model’s efficiency can be stated based on
four performance metrics namely Precision, Recall, F1 score
and Accuracy.

x Accuracy– It gives the extent to which the model is


able make correct predictions. The formula for
accuracy is:

Correct _ predictions
Accuracy
Total _ predictions

x Precision – It estimates how many values are really NORMAL PNEUMONIA


positive out of all the values or classes predicted as
positive by the model. Fig. 8: Confusion Matrix of the Model

x Recall – It is an estimate which states that out of all Confusion Matrix is essential as it aids in computing the
the positives in the data, how many is the model False Positive (FP), True Positive (TP), False Negative (FN),
able to predict correctly. and True Negative of the prediction where,
FP indicates that the prediction is positive but in reality,
x F1 Score – F1 score is another evaluation metric it is false.
which is used to assess the model’s performance. TP means that the in reality and in prediction it is
Usually, the F1 score is used for evaluating any positive.
model built on an imbalanced dataset instead of
precision and recall. This is because F1 score is FN indicates that the prediction is negative and same is
attained when precision and recall are combined, the case in reality
hence biasness is eliminated from this evaluator. TN indicates that the prediction showed negative but in
reality it is found to be true.
Precision u Recall
2u Here the two classes defined are ‘Pneumonia’ and
Precision  Recall ‘Normal’ instead of Positive and Negative. The following
can be inferred from the plotted confusion matrix:
Firstly, the training and testing accuracy obtained by the
model is 97.13% and 93.43% respectively which indicates x 368 images have been correctly predicted as
that the model is performing very well on both training and Normal.
testing data. Both the accuracies have been plotted and
shown in Fig. 7. x 22 images have been predicted as normal but it
actually belongs to Pneumonia class.

x 215 images predicted correctly as Pneumonia


images.

x 19 images incorrectly predicted as Pneumonia.

In the end to get the value of precision, recall and F1


score, classification report has been employed. The report
obtained for the model is shown in Table 2.

TABLE II. F1 SCORE, RECALL AND PRECISION

Here, 0 refers to the ‘Pneumonia’ class and 1 refers to the


Fig. 7: Train and Test Accuracy ‘Normal’ class.
Confusion matrix is plotted for the evaluation and it is The F1 score for both the classes: 0 and 1 is 95% and
shown below in the Fig. 8. 91%, respectively, which is a very good score and it

5
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on August 26,2023 at 16:52:48 UTC from IEEE Xplore. Restrictions apply.
indicates that the model is able to predict and deal with both [3] S. d’Ascoli, H. Touvron, M. L. Leavitt, A. S. Morcos, G. Biroli and
the classes effectively. L. Sagun, "Convit: Improving vision transformers with soft
convolutional inductive biases," International Conference on Machine
Table 3 depicts the comparison of the model with the Learning, no. 1, pp. 2286-2296, 2021.
other existing models. And it can be seen that CVT [4] A. Razzaq, "Marktechpost.com," 20 july 2021. [Online].
outperforms all the models in all the evaluation metric. Also, [5] D. Shome, T. Kar, S. N. Mohanty, P. Tiwari, K. Muhammad, A.
as mentioned before CVT can even perform better with less AlTameem and A. K. J. Saudagar, "Covid-transformer: Interpretable
covid-19 detection using vision transformer for healthcare.,"
data and therefore it requires less computational power, International Journal of Environmental Research and Public Health,
which makes CVT better than others. no. 1, 2021.
[6] K. Tyagi, G. Pathak, R. Nijhawan and A. Mittal, "Detecting
TABLE III. COMPARISON ACROSS MODELS Pneumonia using Vision Transformer and comparing with other
techniques.," 5th International Conference on Electronics,
Communication and Aerospace Technology (ICECA), pp. 12-16,
2021.
[7] B. Wang, D. Zhang and Z. Tian, "STCovidNet: Automatic Detection
Model of Novel Coronavirus Pneumonia Based on Swin
Transformer.," no. 1, 2022.
[8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner and N. Houlsby, "COVID-Transformer: Interpretable
COVID-19 Detection Using Vision Transformer for Healthcare," no.
1, 2020.
VI. CONCLUSION [9] S. Park, G. Kim, Y. Oh, J. B. Seo, S. M. Lee, J. H. Kim and J. C. Ye,
Employing the Convolution Vision Transformer (CVT) "AI can evolve without labels: self-evolving vision transformer for
chest X-ray diagnosis through knowledge distillation.," 2022.
to examine chest X-Ray images was proposed to arrive at a
[10] T. Banerjee, S. Karthikeyan, A. Sharma, K. Charvi and S. Raman,
timely detection of pneumonia. CVT is a combination of "Attention-Based Discrimination of Mycoplasma Pneumonia,"
Vision Transformer along with convolutional and pooling Proceedings of International Conference on Computational
layers. This hybrid model not only outperforms both CNN Intelligence and Data Engineering , no. 1, pp. 29-41, 2022.
and CVT, but it also it overcomes the limitation of both the [11] X. Gao, Y. Qian and A. Gao, "COVID-VIT: Classification of Covid-
approaches. The final train and test accuracy obtained by the 19 from CT chest images based on vision transformer models," p. 6,
model is 97.13 % and 93.34 % respectively. Pneumonia is an 2021.
extremely critical health issue and early detection can be life [12] K. Sivarama Krishnan and K. Sivarama Krishnan, "Vision
saving for some patients. This model will help in faster and Transformer based COVID-19 Detection using Chest X-rays," p. 5,
2021.
more accurate examination of pneumonia and will help the
[13] N. Nguyen and J. M. Chang, "COVID-19 Pneumonia Severity
health care sector to function effectively. Prediction using Hybrid Convolution-Attention Neural
Architectures," 2021.
REFERENCES
[14] M. M. Al Rahhal, Y. Bazi, R. M. Jomaa, A. AlShibli, N. Alajlan, M.
[1] R. Kundu, R. Das, Z. W. Geem , G. T. Han and R. Sarkar, L. Mekhalfi and F. Melgani , "COVID-19 Detection in CT/X-ray
"Pneumonia detection in chest X-ray images using an ensemble of Imagery Using Vision Transformers," Journal of Personalized
deep learning models," Plos one, no. 1, 2021. Medicine, 2022.
[2] B. Almaslukh, "A Lightweight Deep Learning-Based Pneumonia
Detection Approach for Energy-Efficient Medical Systems," Wireless
Communications and Mobile Computing, p. 14, 2021.

6
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY CALICUT. Downloaded on August 26,2023 at 16:52:48 UTC from IEEE Xplore. Restrictions apply.

You might also like