You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/357977459

Gastrointestinal image classification based on VGG16 and transfer learning

Conference Paper · December 2021


DOI: 10.1109/ICISAT54145.2021.9678481

CITATIONS READS
0 256

3 authors:

Amina Benkessirat Nadjia Benblidia


Saad Dahlab University Blida1 University
3 PUBLICATIONS   2 CITATIONS    61 PUBLICATIONS   171 CITATIONS   

SEE PROFILE SEE PROFILE

Azeddine Beghdadi
Université Sorbonne Paris Nord
340 PUBLICATIONS   3,612 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Image Compression oriented restoration View project

marketing View project

All content following this page was uploaded by Amina Benkessirat on 17 February 2022.

The user has requested enhancement of the downloaded file.


Gastrointestinal image classification based on
VGG16 and transfer learning
Benkessirat Amina Benblidia Nadjia Beghdadi Azeddine
LRDSI Laboratory LRDSI Laboratory L2TI laboratory, Institute of Galilee
Blida 1 university Blida 1 university Sorbonne Paris North University
Blida, Algeria Blida, Algeria Paris, France
aminabks-gsi@outlook.com benblidia@yahoo.com azeddine.beghdadi@univ-paris13.fr

Abstract—Investigational procedures and medical diagnosis classification performances are strongly affected by intrusive
can be greatly improved by opting for detecting automatically visual elements. Furthermore, visual features fall into low-
abnormalities and anatomical landmarks in medical images. level feature space, which lack the ability to represent high-
However, this remains a challenging task and still unexplored
field. This paper aims to investigate the capabilities of a pre- level problem domain concepts, and are rather poor in their
trained deep convolutional neural network VGG-16 model for ability to generalize. Consequently, traditional medical image
images categorization with transfer learning containing anatom- classification methods have been unable to meet the needs of
ical landmarks, pathological finding and endoscopic procedures. real medical application and to deal with the medical images
Data augmentation is also performed to highlight the importance problems. Deep learning (DL) techniques are now used to
of data size for deep models. The accuracies achieved before and
after data augmentation are 96.9% and 98.8% respectively. solve difficult task and to improve the performance of existing
Index Terms—VGG-16, Transfer Learning, classification, cat- traditional techniques; and makes it possible for computers to
egorization, data augmentation, endoscopy learn extremely complicated mathematical models for data rep-
resentation. The computational model is a classical Artificial
I. I NTRODUCTION Neural Network (ANN) [6], [7] with a lot of hidden layers of
neurons, the huge number of layers inspires the name DL [6].
When healthy cells in the lining of the gastrointestinal (GI) DL techniques are characterized by a strong generalization
tract grow out of the normal control; something abnormal ability, when DL model is trained for a particular task, he
happens such as rectal bleeding, inflammation or mass for- will be able to perform accurately this task, using a variety
mation [1]. When abnormalities are detected early, cure is of testing data [6]. Convolutional neural networks (CNNs) are
often possible [2]. Therefore, to save the patient life, we need the most popular image processing models to date [7]. The
to find the affected cells in early stage. Every year, roughly dataset used in our work (Kvasir) became accessible in the
2.8 million new cases of stomach, colorectal, and esophageal fall of 2017 as part of MediaEval’s Medical Multimedia Chal-
cancer are diagnosed, with 1.8 million deaths [3]. The gold lenge, a benchmarking project that assigns challenges to the
standard for GI tract investigation is endoscopic examination, scientific community [3]. The main contribution of this study is
requiring both high-tech equipment and highly skilled staff. maximizing the performance of MediaEval challenge methods,
The interior of the GI tract is examined in real time video by implementing a transfer learning CNN for the automated
using high quality digital endoscopes in this process, and all classification of each Kvasir classes. The advantage of our
normal and abnormal finding must be reported [3], what could solution is achieving high performance without requiring hand-
be time consuming task for the physician who is supposed crafted attribute extraction. Also, to to maximize accuracy and
to visually analyse all the sequences. To low significantly avoid overfitting, emphasis is given to data augmentation. The
this cost of investigational procedures, machine learning (ML) remainder of this paper is organized as follows. The proposed
algorithms could provide an automatic classification system. solution is presented in Section 3 after a review of relevant
To low significantly this cost of investigational procedures, work in Section 2. Section 4 describes the experimental results
machine learning (ML) algorithms could provide an automatic on the Kvasir dataset, and section 5 closes with a broad
classification system. conclusion as well as research recommendations.
Tremendous efforts have been put into the field of medical
image classification. Even if the images are acquired differ- II. R ELATED WORK
ently, which gives different image nature. Furthermore, differ- As part of the MediaEval’s challenge, researchers devel-
ent types of deterioration, such as blur, noise, and contrast oped efficient systems, in terms of categorization rate and
flaws, influence them [4]. Since traditional medical image execution time; this makes their systems reliable. Agrawal
classification methods are based on visual features as well et al [8] performed 3 classification models. Based on the
as their combination [5], such as colour shape and texture, baseline features provided with the dataset and extracted
978-1-6654-7824-3/21/$31.00 ©2021 IEEE features by Inception-v3, and VGGNet. The first approach uses
the baseline features and extracted features by Inception-V3. justified, since they haven’t cited the accuracies result before
The second approach uses the baseline features and extracted performing preprocessing.
features by VGGNet. The third one combines the features All these methods have shown their effectiveness and have
extracted by both networks as well as the baseline features. achieved acceptable classification rates. In the following, we
In all three cases the classification was carried out using present a flexible model that outperforms the MediaEval
SVM model, and they reached a classification accuracy of challenge methods.
95.9%, 95.3% and 96.1% respectively. The disadvantage is
that the feature extraction step is necessary and additional step III. CNN M ODEL AND A RCHITECTURE
compared to the DL techniques. Pogorelov et al. [9] performed
3 different classification approaches. Classification based on Neural network, particularly Convolutional Neural Network
global features (GF). Classification using CNN. Classification (CNN) are one of the most important technologies for digital
using transfer learning in DL (TFL). In the first approach, image analysis. with all its modules, including image classi-
the GF were extracted from Lire open source software. The fication. CNN is modelled typically as the mammalian visual
classification was performed by Random Forrest (RF) and cortex, consequently it is successfully applied to vision recog-
Logistic Model Tree (LMT) from Weka. Two different variants nition tasks [6], [7]. Classifying WCE images is a challenging
were carried out: one uses all extracted GF, and the second task, facing several factors produce negative changes in the
uses only two GF. Secondly, two different models built from images, such as specular reflections (light saturation), uneven
scratch. The first with 3 layers and the second with 6 convo- distribution of pixels (vignetting), blurred areas and dark areas
lution layers. The activation function used is ReLU for the [14], [15]. CNN have the ability to learn features directly from
convolutional layer and Maxpooling for the pooling layer. the input image dataset. In what follows we define the general
In all layers, a dropout of 0.5 was included. The last two structure of a CNN.
dense layers for classification were made with the activation
functions ReLU and sigmoid respectively. The two networks A. Convolutional Neural Network Model
were trained for 200 epochs using Adam optimizer. In the
third approach is based on transfer learning by re-training and CNNs are similar to ANNs in that they are constituted
fine-tuning the pre-trained Inception v3 model. They reached of neuron layers that optimise automatically via learning. It
a classification accuracy of 96.4%, 95.9% and 95.7% for the has been shown that CNNs are strongly discriminative for
three approach respectively. Naqvi et al. [10] used six baseline real-world pattern recognition, and they have the ability to
provided features in addition to two extracted textures features. learn both the global and local structures of images [14], [15].
The model was tested carrying out 10-fold cross validation. They are one of the strongest learning algorithms to recognize
It has been noticed that some features perform very poorly. image information and have shown outstanding success in the
Hence they have been removed. Using logistic regression and segmentation, classification, detection and retrieval of relevant
discriminant kernel analysis, they formed a separate model tasks [15]. Similar to standard ANNs, CNNs are multilayer
for each feature. The overall technique is then applied for networks, except that these layers are not simple perceptrons
predictions. They then used K-means clustering to come up [6] [16]. The CNN layers fall into four categories [7] [6] [16]
with a reduced dataset representing the entire distribution. [17]: convolutional layers, activation layers, pooling layers,
They reached a classification accuracy of 94.2%. The requiring and fully connected layers.
of features extractions step is considered as disadvantage. 1) Convolutional layers: The convolution operations are
Petscharnig et al. [11] proposed a CNN architecture based on done with the so called convolutional kernels (or filters) [6].
GoogLeNet and already existing CNN models. The prominent Useful features extraction is performed by convolutional op-
architectural feature of their proposal is the inception module. erations, from locally correlated data points, while preserving
They reached a classification accuracy of 93.9%. Liu et al. spatial relationship between pixels [15]. The convolution result
[12] developed a two stages based model. The first to learn serves as input to the activation layer. 1 demonstrates the
data representation, it is based on bidirectional marginal Fisher structure of a CNN.
analysis (BMFA). The second based on SVM for images cat- 2) Activation layers: They consist of a non-linear pro-
egorization. They reached a classification accuracy of 92.6%. cessing unit, called the activation function [15]. This non-
Apart from the methods in the context of the challenge, Cogan linear unit integrates non-linearity in the feature space and
et al. [13] proposes an automatic preprocessing framwork to contributes to the abstractions learning, and thus enables the
filter, applied edge removal, color mapping and scaling and learning the differences in the images, at the semantic level.
contrast enhancement, for each image in the dataset. They The activation result serves as input to the pooling layer.
perform data augmentation before training different predefined 3) Activation layers: They consist of a non-linear pro-
neural networks architectures. Using Inception-v4, Inception- cessing unit, called the activation function [15]. This non-
ResNet-v2, and NASNet the obtained accuracies were 98.45%, linear unit integrates non-linearity in the feature space and
98.48%, and 97.35%, respectively. Theire work seems orig- contributes to the abstractions learning, and thus enables the
inal among existing literature, incorporing a preprocessing learning the differences in the images, at the semantic level.
framwork. However, th contribution of their proposal is not The activation result serves as input to the pooling layer.
4) Pooling layers: This layer performs a subsampling, to
summarize the results and to makes the input invariant to
geometrical distortions [16] [15]. The principle is to calculate
one output value v for a nxn grid from the activation map,
where v is the maximum (max-pooling) or average (average-
pooling) value of that grid in the activation map [6]. The
popular VGG-16 network considers the pooling layer as part
of convolutional layer.
5) Fully connected layers: Usually, used at the end of CNN
dedicated to perform classification [15]. This layer serves to
represent compactly the input signal (e.g images) [6].
6) Other layers: There exist other regulatory layers than
the mapping layers mentioned above(batch normalization,
dropout...).
These layers are incorporated to optimize CNN performance Fig. 2. Structure of VGG-16 model
[6] [15]. Batch normalization is incorporated to address the
problems due to the internal covariance shift (a change in
the distribution of hidden units’ values) within feature-maps 4096, 4096 and 1000 respectively. Finally, a softmax layer to
[6] [15]. Dropout is incorporated to address the overfitting perform classification. VGG-16 is used efficiently, because of
problem, by introducing regularization within the network, to its generalization ability of other dataset [18].
improve generalization by randomly skipping some units or
C. Proposed Architecture
connections with a certain probability. Figure 1 demonstrates
the structure of a CNN model. Only the last three layers of VGG-16 were fine-tuned in our
effort. This decision has been motivated by the observations
below [19]:
• The first layers of a pre-trained CNN contain information
about edge and color.
• The later layers offer information about the classes’
specifics.
Learning of VGG-16 has been done on more than 1 million
images, and has been destined to categorize them into 1000
classes [18]. The last three VGG-16’s layers perform classifi-
cation, considering these 1000 classes. consequently, these are
Fig. 1. Structure of a CNN the layers that need to be fine-tuned for a new categorization
task [19] [20]. So the concept is to take all the layers as they
are, except the three last layers. Then, configure three new
B. Transfer learning in CNN layers, namely fully connected layer, a softmax layer, and a
To train a deep CNN, huge labelled training data is required, classification output layer, to perform a new classification task.
to ensure a high classification performance. Nevertheless, The size of the fully connected layer must be similar to the
medical images acquisition and annotation is a challenging number of class to be distinguished [20]. The size value in
task. When such difficulties arise, the usage of CNN attributes our current work is 8, which corresponds to the number of
such as VGG-16 [18], via transfer learning has been proven classes, as described in the next section.
to be efficient for image analysis [18]. The usage of CNN In the training step, we used the RMSprop optimizer, the
attributes such as VGG-16 [18], via transfer learning has been momentum was 0.9n the initial learning rate was 1e-1 and the
proven to be efficient for image analysis [18]. VGG-16 is end learning rate was 1e-4. This to achieve the best accuracy
a pre-trained networks on ImageNet dataset [18], and has and to minimise the binary cross entropy loss function, which
a remarkable feature extraction’s capability which involves is given by E:
a high image classification rate. Figure 2 demonstrates the M
structure of VGG-16 model. 1 X
E=− [ym ×log(hô (xm )+(1−ym )×log(1−hô (xm ))]
All hidden layers are equipped with the Rectified Linear M m=1
Unit (ReLU) function, which computes the function f (x) = (1)
max(x, 0). As mentioned above the first convolutional layer where, M is the number of training instances, ym is the target
receives an image of size 224x224. The convolutional layers label for training instance m, xm is the input for training
use a stride of 1 pixel. The max pooling layers use a stride instance m, hô is the model with neural network weights ô.
of 2 pixels (for spatial pooling). Three fully connected layers In a second experiments serie, to alleviate the overfitting
follow the set of convolutional layers. Their channel sizes are problem caused by the small training set size, we carried out
dataset augmentation. This was performed by reflecting images C. Results
at random in both the vertical and horizontal axes. In effect,
The classification results on the Kvasir dataset are presented
the advantage of mirroring is to help learning the anatomical
in this section. So, we illustrate the comparison of the pre-
structure viewed from different perspectives [21].
trained VGG-16 model with the results corresponding to the
D. Motivation for Choosing VGG-16 MediaEval’s challenge on Kvasir dataset. Before performing
The choice of the pre-trained VGG-16 network, compared data augmentation our model reached out 96.9% accuracy. Fig-
to the AlexNet and InceptionNet networks was motivated by ure 4 depists the confusion matrix comparing the inter-class
[22]: performance of the second serie of experiemnt. Where the
diagonal represents correct labels. The results of our proposals
• VGG-16 is deeper than AlexNnet, so that he it has a
are competitive with MediaEval’s challenge results as shown in
better feature learning.
Table I, especially after performing data augmentation (serie
• VGG-16 has a stronger generalization ability than Incep-
2). However, this comparison is not very fair, because the
tionNet since it is simpler.
sets used to test and validate the model are not the same.
IV. E XPERIMENTS AND R ESULTS Nonetheless, the proposed models seem performant on each
We start this section by introducing the Kvasir dataset. Then provided metrics.
we discuss the results.
TABLE I
A. Kvasir Dataset P ERFORMANCE OF OUR MODELS AND M EDIA E VAL’ S CHALLENGE
MODELS
Kvasir dataset [3], [23] contains images from inside the
gastrointestinal (GI) tract, collected using endoscopic equip- Author Year Acc Pr R F1
ment at Vestre Viken Health Trust in Norway.This dataset was This study (Serie 1) 2021 96.9 84.9 84.8 83.2
made accessible in the fall of 2017 as part of MediaEval’s This study (Serie 2) 2021 98.8 87.3 88.6 87.3
Agrawal et al [8] 2017 96.1 84.7 85.2 84.7
Medical Multimedia Challenge, a benchmarking program that Pogorelov et al. [9] 2017 95.7 82.9 82.9 82.6
assigns challenges to the scientific community. [3]. The im- Naqvi et al. [10] 2017 94.2 76.7 77.2 76.7
age resolution varies from 720x576 up to 1920x1072 pixels. Petscharnig et al. [11] 2017 93.9 75.5 75.5 75.5
Liu et al. [12] 2017 92.6 70.3 70.3 70.3
Three major anatomical landmarks, three clinically significant
discoveries, and two endoscopic procedures are all represented
in the images. images have been annotated and verified by According to the confusion matrix, we can observe that
specialists. On some of the images, the left-most quarter of our proposal performs considerably well for Polyp and Dyed-
the image is devoted to annotations and image information, Resection-Margins. We can also see that the Normal z-line nzl
while the anatomical perspective takes up the remaining three and Esophagitis es classes cause the most confusion. We can
quarters. A green box in the bottom left corner of several justify this by the observation in [3]. Because both classes
photos depicts the endoscope’s position in the GI tract. Angle are captured from a similar anatomical location, except that
of view, size, luminance, zoom, and center point are all Esophagitis belongs to an unhealthy area and Normal z-line
different. In Figure 3, an image of each class in the collection to a healthy area. Because both classes are captured from a
is chosen at random. similar anatomical location, except that Esophagitis belongs
1) Anatomical landmarks: An aberrant characteristic of the to an unhealthy area and Normal z-line to a healthy area.
GI system that can be seen using an endoscopy. They also be Furthermore, Tabel I shows that our solution outperforms the
typical pathological sites such as ulcers or inflammation. The MediaEval’s challenge results.
anatomical landmarks included in Kvasit collection are Z-line,
pylorus and cecum. V. C ONCLUSION
2) Pathological findings: A characteristic of the gastroin-
testinal tract that is aberrant. Endoscopically, a damage or The use of a pre-trained VGG-16 deep CNN model with
alteration in the normal mucosa can be seen. The discovery transfer learning for image classification from inside the
could be a symptom of a developing sickness or a precursor. gastrointestinal tract is examined in this work. VGG-16 model
The pathologies ulcerative colitis , polyps, and esophagitis are with transfer learning was successful in providing a 98.8%
included in the Kvasir dataset. classification rate. Our proposed model has demonstrated per-
3) Endoscopic Procedures: Two set of images related to formance which is competitive to the MediaEval’s challenge
removal of lesions are provided, namely dyed and lifted polyp results. Our next challenge would be to improve the accuracy
and dyed resection margins. as close to 100 %. Our aim is to investigate the confusion
between the normal z line nzl and the esophagitises classes
B. Evaluation Criteria to reduce the error rate. Future works will be dedicated to
To validate the VGG-16 model’s efficiency using transfer investigating image quality enhancement to show the image
learning, the Accuracy (Acc), Precision (P r), Recall (R), and quality impact on deep classification, starting with exploiting
F1 score are used. These metrics are used widely in medical the preprocessing framework presented in [13]. Furthermore,
imaging [24]. another pre-trained models with transfer learning for the
Fig. 3. An image of each class of the Kvasir dataset

et al., “Kvasir: A multi-class image dataset for computer aided gastroin-


testinal disease detection,” in Proceedings of the 8th ACM on Multimedia
Systems Conference, 2017, pp. 164–169.
[10] S. S. A. Naqvi, S. Nadeem, M. Zaid, and M. A. Tahir, “Ensemble of
texture features for finding abnormalities in the gastro-intestinal tract.”
in MediaEval, 2017.
[11] S. Petscharnig, K. Schöffmann, and M. Lux, “An inception-like cnn
architecture for gi disease and anatomical landmark classification.” in
MediaEval, 2017.
[12] Y. Liu, Z. Gu, and W. K. Cheung, “Hkbu at mediaeval 2017 medico:
Medical multimedia task,” 2017.
[13] T. Cogan, M. Cogan, and L. Tamil, “Mapgi: Accurate identification of
anatomical landmarks and diseased tissue in gastrointestinal tract using
deep learning,” Computers in biology and medicine, vol. 111, p. 103351,
Fig. 4. Confusion matrix demonstrating interclass accuracy.
2019.
[14] B. Sdiri, F. A. Cheikh, K. Dragusha, and A. Beghdadi, “Comparative
study of endoscopic image enhancement techniques,” in 2015 Colour
classification task can be explored, or more deeper version and Visual Computing Symposium (CVCS). IEEE, 2015, pp. 1–5.
[15] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the
of VGG network. recent architectures of deep convolutional neural networks,” Artificial
Intelligence Review, vol. 53, no. 8, pp. 5455–5516, 2020.
R EFERENCES [16] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.
MIT press Cambridge, 2016, vol. 1, no. 2.
[1] H. Q. Ontario et al., “Colon capsule endoscopy for the detection of col- [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
orectal polyps: an evidence-based analysis,” Ontario health technology with deep convolutional neural networks,” Advances in neural informa-
assessment series, vol. 15, no. 14, p. 1, 2015. tion processing systems, vol. 25, pp. 1097–1105, 2012.
[2] C. E. Board. (2020) Colorectal-cancer: statistics. [Online]. Available: [18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
https://www.cancer.net/cancer-types/colorectal-cancer/statistics large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[3] K. Pogorelov, K. R. Randel, C. Griwodz, S. L. Eskeland, T. de Lange, [19] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are
D. Johansen, C. Spampinato, D.-T. Dang-Nguyen, M. Lux, P. T. Schmidt features in deep neural networks?” in Advances in neural information
et al., “Kvasir: A multi-class image dataset for computer aided gastroin- processing systems, 2014, pp. 3320–3328.
testinal disease detection,” in Proceedings of the 8th ACM on Multimedia [20] S. Lu, Z. Lu, and Y.-D. Zhang, “Pathological brain detection based on
Systems Conference, 2017, pp. 164–169. alexnet and transfer learning,” Journal of computational science, vol. 30,
[4] Z. Al-Ameen, G. Sulong, and M. G. M. Johar, “Enhancing the contrast pp. 41–47, 2019.
of ct medical images by employing a novel image size dependent [21] M. Kim, J. Zuallaert, and W. De Neve, “Towards novel methods for
normalization technique,” International journal of Bio-science and bio- effective transfer learning and unsupervised deep learning for medical
technology, vol. 4, no. 3, pp. 63–68, 2012. image analysis,” in Doctoral Consortium (DCBIOSTEC 2017), 2017,
[5] Z. Lai and H. Deng, “Medical image classification based on deep pp. 32–39.
features extracted by deep model and statistic feature fusion with [22] Q. Guan, Y. Wang, B. Ping, D. Li, J. Du, Y. Qin, H. Lu, X. Wan,
multilayer perceptron,” Computational intelligence and neuroscience, and J. Xiang, “Deep convolutional neural network vgg-16 model for
vol. 2018, 2018. differential diagnosing of papillary thyroid carcinomas in cytological
[6] F. Altaf, S. M. Islam, N. Akhtar, and N. K. Janjua, “Going deep images: a pilot study,” Journal of Cancer, vol. 10, no. 20, p. 4876,
in medical image analysis: Concepts, methods, challenges, and future 2019.
directions,” IEEE Access, vol. 7, pp. 99 540–99 572, 2019. [23] simula, “The kvasir dataset,” ps://datasets.simula.no/kvasir/, 2020.
[7] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, [24] A. Benkessirat and N. Benblidia, “Fundamentals of feature selection:
M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, An overview and comparison,” in 2019 IEEE/ACS 16th International
“A survey on deep learning in medical image analysis,” Medical image Conference on Computer Systems and Applications (AICCSA). IEEE,
analysis, vol. 42, pp. 60–88, 2017. 2019, pp. 1–6.
[8] T. Agrawal, R. Gupta, S. Sahu, and C. Y. Espy-Wilson, “Scl-umd at the
medico task-mediaeval 2017: Transfer learning based classification of
medical images.” in MediaEval, 2017.
[9] K. Pogorelov, K. R. Randel, C. Griwodz, S. L. Eskeland, T. de Lange,
D. Johansen, C. Spampinato, D.-T. Dang-Nguyen, M. Lux, P. T. Schmidt

View publication stats

You might also like