Professional Documents
Culture Documents
Cervical Cancer Classification Using Machine Learning With Feature Importance and Model Explainability
Cervical Cancer Classification Using Machine Learning With Feature Importance and Model Explainability
Explainability
Mahmudul Hasan Priyanka Roy Adiba Mahjabin Nitu
Department of CSE Department of CSE Department of CSE
HSTU, Dinajpur-5200, Bangladesh HSTU, Dinajpur-5200, Bangladesh HSTU, Dinajpur-5200, Bangladesh
mahmudulmoon123@gmail.com priyanka.roy1202@gmail.com nitu.hstu@gmail.com
Abstract—Cervical cancer is still one of the most common of a patient with cervical cancer mostly depends on the early
gynecological cancers in the world. The rate at which cervical stage detection of the disease. This rationale leads researchers
cancer develops is the fourth highest among common female all over the world to build many machine learning models.
disorders. It’s one of the diseases that threaten women’s health
worldwide, and early symptoms are notoriously hard to see. Although the aim is to detect cancer accurately, only few of
In the fields of gynecology and computer science, there has the models meet their goal with precision. Another reason
been a dearth of studies focusing on the diagnosis of cervical having a fairly high impact on the accuracy of the proposed
cancer based on machine learning. Early-stage prediction of models is selecting the features correctly. Although there
cervical cancer can be a great solution as it aware women to are many factors influencing cervical cancer, an important
control their lifestyles also. This study classifies cervical cancer
from secondary data using machine learning algorithms. Also, question therefore is whether we can discover and filter the
the top features responsible for cervical cancer are found out combination of top features increasing the risk of cervical
in this study. Support Vector Machine, Random Forest, and cancer. In this study, we analyze the effect of data balancing on
Logistic Regression are used as classifiers, and Boruta feature the performance of the classifiers. Also, two XAI tools (eli5
selection technique is used to find the best features to train and SHAP) are used to explain the best performing model,
the model. Two model explainable tools Explain like I’m 5
(eli5) and SHapley Additive exPlanations (SHAP) are used to and top features. The effects of the top features on the model
rank the top feature, and their effect on the model are also performance are also analyzed.
analyzed. RF performs better than other classifiers using the
Synthetic Minority Oversampling Technique with Tomek links
II. R EVIEW OF L ITERATURE
(SMOTETomek) data balancing technique that greatly impacts Different types of methods have been applied to detect
model accuracy. It shows 99.85% accuracy with 100% precision,
recall and f1 score. This proposed paradigm will help the medical cervical cancer from image screening and numerical dataset.
domain people to predict the early stage of cervical cancer more A study proposed a model based on the graph cut-based
accurately and can explain the behind information. segmentation scheme to detect cervical cancer efficiently [3].
Index Terms—Cervical cancer, machine learning, model ex- The whole experimental investigation process was performed
plainability, feature importance on the Herlev dataset and improved accuracy by 5.24%
I. I NTRODUCTION comparatively [4]. Statistical analysis was performed to detect
Cancer is one of the most vital diseases that is quite cynical the early stage of cervical cancer using five different methods
to overcome, particularly when diagnosed as cervical cancer namely IBK, K-Star, SPAARC, RT and RF [5]. The dataset
which refers to the abnormal growth of tissue cells in the which they worked with to facilitate this study was collected
cervix (the entrance to the uterus from the vagina) of women from the UCI ML Repository and Kaggle. They achieved
[1]. It is considered as the second most dangerous type of around 98.33% accuracy in RF. In another research, a novel
cancer after breast cancer and has a high mortality rate. Every DGCA-RCNN framework was introduced [6]. It was intended
year millions of women are being diagnosed with cervical to detect cancer cells in the cervix and they used the cervical
cancer with a huge death rate around the world. Globally, cytology image dataset from “Digital Human Body” (DHB)
cervical cancer is the fourth most commonly appeared cancer Vision Challenge-Intelligent Diagnosis of Cervical Cancer
in women. It is estimated that 604,000 new cases were Risk which is made available by Alibaba Cloud TianChi
diagnosed only in the year of 2020. Among the estimated Company. Different ML approaches were used to predict
342,000 deaths from cervical cancer in that year, about 90% cervical cancer using Cytokine gene variants combined with
of the victims were from countries of low- and middle-income the socio-demographic characteristics as predictors and found
[2]. This is due to the lack of early stage diagnosis of cervical Logistic Regression to have the highest accuracy of 82.25%
cancer as it may not express any possible symptoms to realize while the highest sensitivity (85%) was achieved by the GNB
before the disease itself reaches at a severe stage. The lifespan [7]. In another approach, raman spectroscopy along with
176
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on March 14,2024 at 13:46:40 UTC from IEEE Xplore. Restrictions apply.
A. Effect of data balancing on the accuracy of the classifiers verify any outcome before agreeing. ML model explainability
Using different machine learning algorithms we get accu- feeds the same concept where we explain the reasons why our
racy above 98% on imbalance data but the precision, recall, model is superior and how it is making predictions correctly.
and F1 score vary for class 0 and class 1. Using SVC we see It is visible that initially, despite having a higher accuracy,
that both recall and F1 score values are zero (= 0.0) where the recall was too poor. It indicates that the performance
those for LR and RF are just above 0.8 and 0.9. It is due to of the model is not stable for each class. Before applying
the improper distribution of data samples for each class where the sampling techniques, our model’s prediction reflected a
class 0 contains X of the samples (majority class) and class 1 biased nature and hence degraded the model’s accuracy. But
(minority class) contains only Y. However, the recall and F1 after applying the sampling techniques, we achieved a much
scores need to be improved in some way as we already know higher recall which helped us to make sure how our model is
these are the determiner if a model is superior to make correct performing for minority class values besides majority class.
predictions. TABLE III
TABLE I T OP FEATURE OF C ERVICAL CANCER CLASSIFICATION USING ELI 5 AND
P ERFORMANCE OF THE CLASSIFIERS ON I MBALANCE DATA SHAP
177
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on March 14,2024 at 13:46:40 UTC from IEEE Xplore. Restrictions apply.
value. In short, this figure explains that while Schiller = 0
favored the chances of not cancer, Cytology = 1 escalated the
chances of cancer.
V. C ONCLUSION AND F UTURE WORK
Cervical cancer is a life-threatening disease for women. This
study finds the cause behind this disease and classifies cervical
cancer using machine learning algorithms from secondary
Fig. 3. PDP plot of Feature Schiller
data. The findings show that the Schiller, Hinselmann, Age,
and Hormonal Contraceptives features are the top features
for this disease and are mainly responsible for the model
performance. Also, data balancing has a great impact on the
model accuracy as well as the performance of the individual
class.
In further study, we will divide cervical cancer into many
stages. We make a multi-class classification. How the feature
Fig. 4. PDP plot of Feature Age impacts each stage and the interrelations of the features will
explore the work to find an optimal solution in that case.
R EFERENCES
[1] J. S. Lea and K. Y. Lin, “Cervical cancer,” Obstetrics and Gynecology
Clinics, vol. 39, no. 2, pp. 233–253, 2012.
[2] H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Je-
mal, and F. Bray, “Global cancer statistics 2020: Globocan estimates of
incidence and mortality worldwide for 36 cancers in 185 countries,” CA:
a cancer journal for clinicians, vol. 71, no. 3, pp. 209–249, 2021.
[3] M. A. Devi, J. Sheeba, and K. S. Joseph, “Neutrosophic graph cut-based
Fig. 5. PDP plot of Feature Hormonal Contraceptives segmentation scheme for efficient cervical cancer detection,” Journal
of King Saud University-Computer and Information Sciences, vol. 34,
no. 1, pp. 1352–1360, 2022.
[4] Y. Marinakis, M. Marinaki, G. Dounias, J. Jantzen, and B. Bjerregaard,
“Intelligent and nature inspired optimization methods in medicine: the
pap smear cell classification problem,” Expert Systems, vol. 26, no. 5,
pp. 433–457, 2009.
[5] M. M. Ali, K. Ahmed, F. M. Bui, B. K. Paul, S. M. Ibrahim, J. M.
Quinn, and M. A. Moni, “Machine learning-based statistical analysis
for early stage detection of cervical cancer,” Computers in Biology and
Medicine, vol. 139, p. 104985, 2021.
[6] X. Li, Z. Xu, X. Shen, Y. Zhou, B. Xiao, and T.-Q. Li, “Detection of
Fig. 6. PDP plot of Feature Hinselmann cervical cancer cells in whole slide images using deformable and global
context aware faster rcnn-fpn,” Current Oncology, vol. 28, no. 5, pp.
3585–3601, 2021.
while at 30, the risk is almost constant. The probability of [7] M. Kaushik, R. C. Joshi, A. S. Kushwah, M. K. Gupta, M. Banerjee,
R. Burget, and M. K. Dutta, “Cytokine gene variants and socio-
Cervical cancer is high at Hormonal Contraceptives value at demographic characteristics as predictors of cervical cancer: a machine
0.7 in Fig. 5. From 0.0 to 0.7 the probability is increasing but learning approach,” Computers in Biology and Medicine, vol. 134, p.
it goes down after the value goes above 0.7. Like Schiller, the 104559, 2021.
[8] H. Zhang, C. Chen, R. Gao, Z. Yan, Z. Zhu, B. Yang, C. Chen,
probability of cancer is increasing with Hinselmann when the X. Lv, H. Li, and Z. Huang, “Rapid identification of cervical adeno-
value is tending toward 1 from 0 at Fig. 6. carcinoma and cervical squamous cell carcinoma tissue based on raman
spectroscopy combined with multiple machine learning algorithms,”
Photodiagnosis and Photodynamic Therapy, vol. 33, p. 102104, 2021.
[9] M. M. Rahaman, C. Li, Y. Yao, F. Kulwa, X. Wu, X. Li, and Q. Wang,
“Deepcervix: a deep learning-based framework for the classification of
cervical cells using hybrid deep feature fusion techniques,” Computers
in Biology and Medicine, vol. 136, p. 104649, 2021.
Fig. 7. Explanation of probability of cancer of a sample [10] M. E. Plissiti, P. Dimitrakopoulos, G. Sfikas, C. Nikou, O. Krikoni,
and A. Charchanti, “Sipakmed: A new dataset for feature and image
based classification of normal and pathological cervical cells in pap
From the above Fig. 7 we see that features dominating the smear images,” in 2018 25th IEEE International Conference on Image
increased predictions are denoted in pink and their visual size Processing (ICIP). IEEE, 2018, pp. 3144–3148.
[11] K. Fernandes, J. S. Cardoso, and J. Fernandes, “Transfer learning with
shows the magnitude of its effect on making the prediction. partial observability applied to cervical cancer screening,” in Iberian
Feature values responsible for decreasing the prediction are conference on pattern recognition and image analysis. Springer, 2017,
shown in blue. Furthermore, we see that Schiller = 0 has the pp. 243–250.
[12] M. A. Sahid, M. Hasan, N. Akter, and M. M. R. Tareq, “Effect of
biggest impact on the outcome of our proposed model and it imbalance data handling techniques to improve the accuracy of heart
increases the chance of not having cancer. Besides, Cytology = disease prediction using machine learning and deep learning,” in 2022
1 has a very strong effect in case of decreasing the prediction IEEE Region 10 Symposium (TENSYMP), 2022, pp. 1–6.
178
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on March 14,2024 at 13:46:40 UTC from IEEE Xplore. Restrictions apply.