You are on page 1of 4

2022 4th International Conference on Electrical, Computer & Telecommunication Engineering

(ICECTE) 29-31 December 2022, Rajshahi-6204, Bangladesh

Cervical Cancer Classification using Machine


Learning with Feature Importance and Model
2022 4th International Conference on Electrical, Computer & Telecommunication Engineering (ICECTE) | 979-8-3503-2054-1/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICECTE57896.2022.10114548

Explainability
Mahmudul Hasan Priyanka Roy Adiba Mahjabin Nitu
Department of CSE Department of CSE Department of CSE
HSTU, Dinajpur-5200, Bangladesh HSTU, Dinajpur-5200, Bangladesh HSTU, Dinajpur-5200, Bangladesh
mahmudulmoon123@gmail.com priyanka.roy1202@gmail.com nitu.hstu@gmail.com

Abstract—Cervical cancer is still one of the most common of a patient with cervical cancer mostly depends on the early
gynecological cancers in the world. The rate at which cervical stage detection of the disease. This rationale leads researchers
cancer develops is the fourth highest among common female all over the world to build many machine learning models.
disorders. It’s one of the diseases that threaten women’s health
worldwide, and early symptoms are notoriously hard to see. Although the aim is to detect cancer accurately, only few of
In the fields of gynecology and computer science, there has the models meet their goal with precision. Another reason
been a dearth of studies focusing on the diagnosis of cervical having a fairly high impact on the accuracy of the proposed
cancer based on machine learning. Early-stage prediction of models is selecting the features correctly. Although there
cervical cancer can be a great solution as it aware women to are many factors influencing cervical cancer, an important
control their lifestyles also. This study classifies cervical cancer
from secondary data using machine learning algorithms. Also, question therefore is whether we can discover and filter the
the top features responsible for cervical cancer are found out combination of top features increasing the risk of cervical
in this study. Support Vector Machine, Random Forest, and cancer. In this study, we analyze the effect of data balancing on
Logistic Regression are used as classifiers, and Boruta feature the performance of the classifiers. Also, two XAI tools (eli5
selection technique is used to find the best features to train and SHAP) are used to explain the best performing model,
the model. Two model explainable tools Explain like I’m 5
(eli5) and SHapley Additive exPlanations (SHAP) are used to and top features. The effects of the top features on the model
rank the top feature, and their effect on the model are also performance are also analyzed.
analyzed. RF performs better than other classifiers using the
Synthetic Minority Oversampling Technique with Tomek links
II. R EVIEW OF L ITERATURE
(SMOTETomek) data balancing technique that greatly impacts Different types of methods have been applied to detect
model accuracy. It shows 99.85% accuracy with 100% precision,
recall and f1 score. This proposed paradigm will help the medical cervical cancer from image screening and numerical dataset.
domain people to predict the early stage of cervical cancer more A study proposed a model based on the graph cut-based
accurately and can explain the behind information. segmentation scheme to detect cervical cancer efficiently [3].
Index Terms—Cervical cancer, machine learning, model ex- The whole experimental investigation process was performed
plainability, feature importance on the Herlev dataset and improved accuracy by 5.24%
I. I NTRODUCTION comparatively [4]. Statistical analysis was performed to detect
Cancer is one of the most vital diseases that is quite cynical the early stage of cervical cancer using five different methods
to overcome, particularly when diagnosed as cervical cancer namely IBK, K-Star, SPAARC, RT and RF [5]. The dataset
which refers to the abnormal growth of tissue cells in the which they worked with to facilitate this study was collected
cervix (the entrance to the uterus from the vagina) of women from the UCI ML Repository and Kaggle. They achieved
[1]. It is considered as the second most dangerous type of around 98.33% accuracy in RF. In another research, a novel
cancer after breast cancer and has a high mortality rate. Every DGCA-RCNN framework was introduced [6]. It was intended
year millions of women are being diagnosed with cervical to detect cancer cells in the cervix and they used the cervical
cancer with a huge death rate around the world. Globally, cytology image dataset from “Digital Human Body” (DHB)
cervical cancer is the fourth most commonly appeared cancer Vision Challenge-Intelligent Diagnosis of Cervical Cancer
in women. It is estimated that 604,000 new cases were Risk which is made available by Alibaba Cloud TianChi
diagnosed only in the year of 2020. Among the estimated Company. Different ML approaches were used to predict
342,000 deaths from cervical cancer in that year, about 90% cervical cancer using Cytokine gene variants combined with
of the victims were from countries of low- and middle-income the socio-demographic characteristics as predictors and found
[2]. This is due to the lack of early stage diagnosis of cervical Logistic Regression to have the highest accuracy of 82.25%
cancer as it may not express any possible symptoms to realize while the highest sensitivity (85%) was achieved by the GNB
before the disease itself reaches at a severe stage. The lifespan [7]. In another approach, raman spectroscopy along with

979-8-3503-2054-1/22/$31.00 ©2022 IEEE


Authorized licensed use limited to: Amrita School of Engineering. Downloaded on March 14,2024 at 13:46:40 UTC from IEEE Xplore. Restrictions apply.
pattern recognition was applied for the speedy identification interpretable classifier. Here, Eli5 provides the model ex-
of adenocarcinoma and squamous cell carcinoma tissue in the plainability by ranking feature importance and explaining
cervix area of women [8]. This model proposed an accuracy predictions of RF.
rate of 96.3%. They collected the sample data from the First SHAP: Shapley Additive Explanations mostly known as
Affiliated Hospital of Xinjiang Medical University. Similarly, SHAP is a mathematical tool to explain and to increase trans-
a deep learning-based framework has been advocated for parency and interpretability of black-box machine learning
the classification of cervical cells with the help of hybrid models. There are many features that contribute to the machine
deep feature fusion techniques [9]. They claimed to achieve learning model but only the top ranked main features affect
the highest classification accuracy on the SIPaKMeD dataset, the output of the model. SHAP shows the contribution or the
which exhibits the potential of better and improved cervical importance of each feature on the prediction of the model. In
cancer diagnostic methods [10]. brief, SHAP is a cooperative game theory based tool that helps
III. M ETHODOLOGY in explaining the output of any machine learning model by
A. Overview of the Proposed Methodology calculating the contribution of each feature to the prediction.
This proposed methodology includes data preprocessing, C. Description of The Classifiers
data balancing, cancer classification, model explainability, and
feature importance. Fig. 2 shows the overview of the proposed Support Vector Machine (SVM): For cervical cancer
methodology. classification, this study employs a support vector classifier
technique based on the characteristics of the dataset. The
function of the line is y = ax + b. Renaming x with x1 and y
with x2 , we get: ax1 − x2 + b = 0, If we define x = (x1 , x2 )
and w = (a, −1) we get: w.x + b = 0, The equation is derived
from the two-dimensional vectors. But it also works for any
number of dimensions.
Random Forest (RF): RF is a bagging ensemble algorithm
consisting of several Decision tree models dividing the training
dataset into branches which then segregates into more branches
until a leaf node is attained. It has the ability to work on a
large data set with higher dimensional and it does not allow
over-fitting trees that leads to higher accuracy than other ML
models.
Logistic Regression (LR): LR is a more simple and quiet
statistical method to calculate or predict the probability of a
binary (yes/no) event occurring. Like RF, it is also a type of
supervised learning that is used when the dependent variable
(target) is categorical.
Fig. 1. Block diagram of the proposed methodology
D. Performance Measure Techniques
We take dataset from UCI [11] then preprocess the data and We use the accuracy, precision, recall, and F1 score to
select the important features to train the model using the rel- evaluate the models. Percentage of correctly labeled data
evant data. We use SOMTETomek data balancing techniques instances relative to the total number of data instances is a
to balance the dataset [12]. We split the dataset into 80:20 as measure of the accuracy. One of the primary metrics used
traditional ML process. Then we perform analysis using the to evaluate a model’s efficacy is its ability to classify data.
classifiers and rank the features including the explanation of Measures how well it performs when the prediction is accurate.
the best features. All the performance measurements are tab- The accuracy with which the model recalls correct classes is
ulated and different graphs are used to visualize the outcome. measured by the recall. When calculating an F1 score, both
B. Feature Selection and Feature Ranking Techniques precision and recall are given equal weight because they both
impact the overall score.
Baruta: Feature selection is one of the most important and
IV. R ESULT AND D ISCUSSION
crucial phases in order to develop any machine learning model.
Baruta is an alternate solution to automate this time-consuming This study shows the effect of data balancing on the
process. Baruta is an algorithm designed to take the “all- accuracy of classifiers and also shows the explainability of
relevant” approach to feature selection. It is not a stand-alone the highest-performing algorithm RF. The top most features
algorithm: it sits on top of the RF. that are responsible for cervical cancer are also found and
Eli5: Eli5 allows to visualize and debug machine learning the effect of each feature is described by a swarm plot and
models using a unified API. It provides aids to explain black- some Partial Dependence Plots (PDP) plots. All the results
box machine learning models through a locally-fit simple, are described gradually.

176
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on March 14,2024 at 13:46:40 UTC from IEEE Xplore. Restrictions apply.
A. Effect of data balancing on the accuracy of the classifiers verify any outcome before agreeing. ML model explainability
Using different machine learning algorithms we get accu- feeds the same concept where we explain the reasons why our
racy above 98% on imbalance data but the precision, recall, model is superior and how it is making predictions correctly.
and F1 score vary for class 0 and class 1. Using SVC we see It is visible that initially, despite having a higher accuracy,
that both recall and F1 score values are zero (= 0.0) where the recall was too poor. It indicates that the performance
those for LR and RF are just above 0.8 and 0.9. It is due to of the model is not stable for each class. Before applying
the improper distribution of data samples for each class where the sampling techniques, our model’s prediction reflected a
class 0 contains X of the samples (majority class) and class 1 biased nature and hence degraded the model’s accuracy. But
(minority class) contains only Y. However, the recall and F1 after applying the sampling techniques, we achieved a much
scores need to be improved in some way as we already know higher recall which helped us to make sure how our model is
these are the determiner if a model is superior to make correct performing for minority class values besides majority class.
predictions. TABLE III
TABLE I T OP FEATURE OF C ERVICAL CANCER CLASSIFICATION USING ELI 5 AND
P ERFORMANCE OF THE CLASSIFIERS ON I MBALANCE DATA SHAP

Algorithm Class Precision Recall F1 Avg. Feature Weight Feature SHAP


Acc Value
0 0.98 1.00 0.99 Schiller 0.0642±0.0133 Schiller +0.09
SVC 98.26
1 0.0 0.0 0.0 Age 0.0047±0.0000 Hinselmann +0.01
0 1.00 1.00 1.00 Hormonal 0.0047±0.0059 Age +0.01
LR 99.70
1 1.00 0.82 0.92 Contraceptives
0 1.00 1.00 1.00 Hinselmann 0.0009±0.0037 Hormonal +0.01
RF 99.71
1 1.00 0.83 0.91 Contraceptives
First sexual 0.0028±0.0046 Cytology +0.01
intercourse
On the other hand, in table II we see the improved precision,
recall and F1 score for all the ML models that we used and the
value is nearly 1.0. The standings for our best model RF with
clearly increased recall indicates that this ML model performs
better to predict accurately for both majority and minority
classes. This improvement is the result of applying Synthetic
Minority Oversampling Technique or SMOTE balancing tech-
nique along with Tomek link on our imbalance dataset and
balancing it. Also, RF is an ensemble algorithm that reduce
the variance of the data. Also, it explores probable all the
solutions and find the best one for classification.
TABLE II
P ERFORMANCE OF THE CLASSIFIERS ON BALANCE DATA
Fig. 2. Distribution of SHAP values to show the impact of the features
Algorithm Class Precision Recall F1 Avg.
Acc
0 0.97 1.00 0.99
Our model’s key components are already described. But
SVC 98.66 still, how each characteristic influences the model accuracy is
1 1.00 0.97 0.99
LR
0 1.00 0.99 0.99
99.40 unknown. Consider the ’Age’ characteristics as an illustration.
1 0.99 1.00 0.99 In this regard, we are aware that age is a major factor. But
0 1.00 1.00 1.00
RF 99.85 it is unclear whether the risk of cancer increases with age
1 1.00 1.00 1.00
or decreases. For determining how single features affect our
prediction we need to utilize a new technique called Partial
B. Explanation of the top features and effect on the accuracy Dependence Plots (PDP). It is calculated after the model is
of RF classifier fitted. Then, we make a prediction using just one test row. We
We have applied two feature selection algorithms namely make a series of predictions rather than a single forecast by
eli5 and SHAP to select the most impact features in our iteratively changing a single variable in a row. For the cervical
dataset. As from TABLE III we can see, with eli5 we got cancer model, we randomly sample rows from test data and
Schiller, Age, Hormonal Contraceptives, Hinselmann and First change a single variable value, such as age, before making a
Sexual Intercourse as the top 5 features. On the other hand series of predictions. And we do these for multiple rows, then
SHAP finds Citology as our 5th feature instead of first sexual plot the average forecast of the outcome on vertical axes.
intercourse. Merging both the results we mark all these six The PDP of Schiller in Fig. 3 indicates that the probability
features as the most impactful features in our dataset. of cancer is increasing with Schiller. It increase in an positive
Here we see the Feature vs Impact model where the rate by the change of the Schiller value from 0 to 1. Also,
dominating features are denoted by pink color and the less the PDP of Age in Fig. 4 indicates that the most dangerous
impactful features are in blue. We always tend to question and period for getting infected with cervical cancer is 19 to 21

177
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on March 14,2024 at 13:46:40 UTC from IEEE Xplore. Restrictions apply.
value. In short, this figure explains that while Schiller = 0
favored the chances of not cancer, Cytology = 1 escalated the
chances of cancer.
V. C ONCLUSION AND F UTURE WORK
Cervical cancer is a life-threatening disease for women. This
study finds the cause behind this disease and classifies cervical
cancer using machine learning algorithms from secondary
Fig. 3. PDP plot of Feature Schiller
data. The findings show that the Schiller, Hinselmann, Age,
and Hormonal Contraceptives features are the top features
for this disease and are mainly responsible for the model
performance. Also, data balancing has a great impact on the
model accuracy as well as the performance of the individual
class.
In further study, we will divide cervical cancer into many
stages. We make a multi-class classification. How the feature
Fig. 4. PDP plot of Feature Age impacts each stage and the interrelations of the features will
explore the work to find an optimal solution in that case.
R EFERENCES
[1] J. S. Lea and K. Y. Lin, “Cervical cancer,” Obstetrics and Gynecology
Clinics, vol. 39, no. 2, pp. 233–253, 2012.
[2] H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Je-
mal, and F. Bray, “Global cancer statistics 2020: Globocan estimates of
incidence and mortality worldwide for 36 cancers in 185 countries,” CA:
a cancer journal for clinicians, vol. 71, no. 3, pp. 209–249, 2021.
[3] M. A. Devi, J. Sheeba, and K. S. Joseph, “Neutrosophic graph cut-based
Fig. 5. PDP plot of Feature Hormonal Contraceptives segmentation scheme for efficient cervical cancer detection,” Journal
of King Saud University-Computer and Information Sciences, vol. 34,
no. 1, pp. 1352–1360, 2022.
[4] Y. Marinakis, M. Marinaki, G. Dounias, J. Jantzen, and B. Bjerregaard,
“Intelligent and nature inspired optimization methods in medicine: the
pap smear cell classification problem,” Expert Systems, vol. 26, no. 5,
pp. 433–457, 2009.
[5] M. M. Ali, K. Ahmed, F. M. Bui, B. K. Paul, S. M. Ibrahim, J. M.
Quinn, and M. A. Moni, “Machine learning-based statistical analysis
for early stage detection of cervical cancer,” Computers in Biology and
Medicine, vol. 139, p. 104985, 2021.
[6] X. Li, Z. Xu, X. Shen, Y. Zhou, B. Xiao, and T.-Q. Li, “Detection of
Fig. 6. PDP plot of Feature Hinselmann cervical cancer cells in whole slide images using deformable and global
context aware faster rcnn-fpn,” Current Oncology, vol. 28, no. 5, pp.
3585–3601, 2021.
while at 30, the risk is almost constant. The probability of [7] M. Kaushik, R. C. Joshi, A. S. Kushwah, M. K. Gupta, M. Banerjee,
R. Burget, and M. K. Dutta, “Cytokine gene variants and socio-
Cervical cancer is high at Hormonal Contraceptives value at demographic characteristics as predictors of cervical cancer: a machine
0.7 in Fig. 5. From 0.0 to 0.7 the probability is increasing but learning approach,” Computers in Biology and Medicine, vol. 134, p.
it goes down after the value goes above 0.7. Like Schiller, the 104559, 2021.
[8] H. Zhang, C. Chen, R. Gao, Z. Yan, Z. Zhu, B. Yang, C. Chen,
probability of cancer is increasing with Hinselmann when the X. Lv, H. Li, and Z. Huang, “Rapid identification of cervical adeno-
value is tending toward 1 from 0 at Fig. 6. carcinoma and cervical squamous cell carcinoma tissue based on raman
spectroscopy combined with multiple machine learning algorithms,”
Photodiagnosis and Photodynamic Therapy, vol. 33, p. 102104, 2021.
[9] M. M. Rahaman, C. Li, Y. Yao, F. Kulwa, X. Wu, X. Li, and Q. Wang,
“Deepcervix: a deep learning-based framework for the classification of
cervical cells using hybrid deep feature fusion techniques,” Computers
in Biology and Medicine, vol. 136, p. 104649, 2021.
Fig. 7. Explanation of probability of cancer of a sample [10] M. E. Plissiti, P. Dimitrakopoulos, G. Sfikas, C. Nikou, O. Krikoni,
and A. Charchanti, “Sipakmed: A new dataset for feature and image
based classification of normal and pathological cervical cells in pap
From the above Fig. 7 we see that features dominating the smear images,” in 2018 25th IEEE International Conference on Image
increased predictions are denoted in pink and their visual size Processing (ICIP). IEEE, 2018, pp. 3144–3148.
[11] K. Fernandes, J. S. Cardoso, and J. Fernandes, “Transfer learning with
shows the magnitude of its effect on making the prediction. partial observability applied to cervical cancer screening,” in Iberian
Feature values responsible for decreasing the prediction are conference on pattern recognition and image analysis. Springer, 2017,
shown in blue. Furthermore, we see that Schiller = 0 has the pp. 243–250.
[12] M. A. Sahid, M. Hasan, N. Akter, and M. M. R. Tareq, “Effect of
biggest impact on the outcome of our proposed model and it imbalance data handling techniques to improve the accuracy of heart
increases the chance of not having cancer. Besides, Cytology = disease prediction using machine learning and deep learning,” in 2022
1 has a very strong effect in case of decreasing the prediction IEEE Region 10 Symposium (TENSYMP), 2022, pp. 1–6.

178
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on March 14,2024 at 13:46:40 UTC from IEEE Xplore. Restrictions apply.

You might also like