Paper V1.edited

Artificial Intelligence in Disease Diagnostics:
Rethinking Risk Factors for Cervical Cancer

Naresh Kumar Trivedi Raj Gaurang Tiwari Vinay Gautam
Chitkara University Institute of Chitkara University Institute of Chitkara University Institute of
Engineering and Technology, Engineering and Technology, Engineering and Technology,
Chitkara University Chitkara University, Chitkara University,
Punjab, India Punjab, India Punjab, India
nareshk.trivedi@chitkara.edu.in rajgaurang@chitkara.edu.in vinay.gautam@chitkara.edu.in
Vikrant Sharma Anuj Kumar Jain

Department of Computer Science and Chitkara University Institute of
Engineering, Engineering and Technology,
Graphic Era Hill University, Chitkara University,
Dehradun, India Punjab, India
vsharma@gehu.ac.in anuj.jain@chitkara.edu.in
Abstract—As machine learning evolves, many individuals Vapnik [6] developed SVM, now one of the most used
and businesses utilize numerous algorithms to evaluate massive classification methods, based on the idea of using a partition
data sets and create actionable insights that aid in predicting hyperplane to split a sample space into subspaces
behavior. And this type of technology is increasingly employed in corresponding to different classes. For its intended purposes,
the medical industry to forecast the early stages of certain severe it has been successfully applied in a wide variety of fields,
diseases, such as cervical cancer. There has been a significant
amount of research conducted on cervical cancer in recent years.
including image recognition, text classification, data
Studies have focused on various topics such as risk factors for forecasting, etc. [7].
cervical cancer, early detection and screening, and the Decision trees are widely used in many fields for data
effectiveness of different treatment options. In this study, we
classification, including business, healthcare, and education.
conduct an in-depth comparison of the various machine learning
methods, discussing their relative merits and shortcomings in Samples are primarily categorized according to a division of
terms of accuracy and overall performance. Staking which is an characteristics, which is its defining feature. Many decision
ensemble machine learning approach emerges as the best tree-based ensemble methods, such as Random Forests and
approach for Cervical cancer classification. XGBoost, employ decision trees as their core classifier,
reaping the benefits of both their superior accuracy and
Keywords— Classification, Cervical cancer, machine
generalizability. XGBoost is a powerful and efficient
learning, Neural Network, Stacking
approach for designing classifiers.
I. INTRODUCTION When it comes to labeling large datasets with many
One of the illnesses that are becoming more of a problem attributes, the Random Forest method is the gold standard. It
for women is cervical cancer. Among women, the rate of employs a tree-structured method of analysis for making
malignant tumors is greatest in developing countries and third categorizations and predicting outcomes. This research
highest in industrialized countries. Nearly a third (28.8 compares the risk features of the dataset from the UCI
percent) of the world's cervical cancer cases are found in repository [8] using three machine learning-based approaches
China each year, where 140,000 new cases are identified (XGBoost, SVM, and Random Forest).
annually. In China, it is the most common malignancy of the Five distinct parts make up the framework of this study.
female reproductive system [1]. When it first begins showing Methodologies employed in data analysis are outlined in
itself, it might be difficult to identify. Cervical cancer cannot Section III. In the IV section, we assess the features and
be detected in any other way than via routine screenings. quality of the cervical cancer dataset. The results and effective
Magnetic resonance[2] and diffusion-weighted imaging[3] are strategies are also discussed in this section. In this final
ineffective in detecting the disease in its early stages. The portion, the study reaches the conclusion.
Human Papillomavirus (HPV), having your first sexual
encounter at a young age, smoking, and so on are all factors II. RELATED WORKS
that increase your chance of developing cervical cancer. The
human papillomavirus (HPV) is the single greatest cause of Studies have been conducted to determine what variables,
cervical cancer[4]. Researchers have shown that in such as human papillomavirus (HPV) infection, smoking, and
comparison to nonsmokers, those who are infected with HPV a compromised immune system, increase a person's chance of
and also smoke have a threefold increased risk of developing developing cervical cancer. The early identification and
cervical cancer [5]. These days, early illness detection makes screening of cervical cancer, using tools like the Pap test and
use of a wide range of detection methods, some of which are HPV testing, is another field of study [9], [10].
based on machine learning. Surgery, radiation therapy, and chemotherapy are only
some of the treatment options for cervical cancer, and their
efficacy has been the subject of many studies. Research on III. MATERIAL AND METHOD
immunotherapy and targeted therapy for the treatment of
cervical cancer has also been performed. A. Dataset
Cervical-cancer dataset[13] has been used in this study
In addition, there has been an increasing focus on the
development of new technologies, such as liquid-based B. Model Diagram
cytology, to improve cervical cancer screening and to develop The proposed method is shown in a model diagram (fig.1)
new biomarkers to detect cervical cancer early [11][12]. below. The picture illustrates how the research for the model
A classifier for cervical cancer was coupled with the developed.
revised segmentation using machine learning methods; C. Brief Description of Algorithms Used
random forests were shown to be the most effective [14]. To
handle the plethora of images that are produced by the 1) k-Nearest Neighbors (k-NN): In the realm of machine
recovered objects, unsupervised learning, and robust learning, it is a well-liked approach for solving classification
refinement approaches are used, such as Adaboost detectors and regression issues. In order to make predictions, the k-NN
[15], SVM supports [16], and Gaussian mixture models [17]. algorithm looks for the k-nearest training instances in the
It was developed and implemented a new method of cell feature space. The k-NN algorithm uses a distance metric,
segmentation based on the Markov random field that makes such as Euclidean distance, to measure the similarity between
use of superpixels to distinguish between cells that do not
the input data and the training examples. Since no
overlap. To identify cervical cells, the authors use the
multifilter SVM using the settings described in [19]. Artificial assumptions about the data distribution are made by the k-
neural networks (ANN) were developed and evaluated and NN technique, it falls under the category of non-parametric
found to classify cervical cells with 78 percent accuracy[20]. algorithms. It is simple to implement and does not require a
An unsupervised method [21] was used to evaluate the quality lot of computational resources, making it a popular choice for
of the medical evidence for each subtype of cervical cancer. many applications.
When compared to other simple classification models, PSO 2) Neural networks: Machine learning algorithms based
with KNN membership values fared better [22]. Cervical on the human brain's structure and function are known as
cancer cells are segmented and classified using a Gabor-based
technique based on their form and texture. The percentage of neural networks. Known as artificial neurons, its structure
correctly identified cancerous cells rose to 89 percent [23]. consists of stacked clusters of nodes that work together to
One of the recommended model's reference components, the analyze and send data. Before reaching the output layer, the
least-square support vector machine (LSSVM), was used to input is altered and processed by numerous layers of these
categorize the CNN features, with even better results. artificial neurons. The weights of the connections connecting
Comparatively, RBF-based support vector machines (SVMs) the neurons are adjusted during training to bring the gap
outperformed logistic regression and random forest methods between the expected and actual output closer together.
[25]. Accuracy based on features was predicted to be between
Neural networks are used extensively in many disciplines for
90% and 95%.
things like voice recognition, language processing, and
Overall, research on cervical cancer is ongoing and picture recognition. They are considered powerful
continues to focus on improving early detection, developing algorithms, as they can learn to model highly non-linear and
new treatment options, and increasing understanding of the
complex relationships between inputs and outputs[26].
disease.
Fig. 1. Proposed Model Diagram

3) Stochastic Gradient Descent (SGD): It is an model comparisons, model evaluations, and hyperparameter
optimization algorithm used for training machine learning optimization. You may apply this technique in both
models, particularly for linear models such as linear supervised and unsupervised, linear and non-linear models to
regression and neural networks. The algorithm starts with a gain a more precise idea of how well they'll perform [11].
random set of weights and iteratively updates the weights by E. Performance evaluation metrics
adjusting them in the direction of the negative gradient of the Predictions made by a machine learning model may be
loss function. In each iteration, the algorithm uses a small set evaluated using these. The challenge at hand and the data at
of training examples, called a batch or mini-batch, to hand should guide your selection of a performance measure.
calculate the gradient, instead of using the whole dataset. This There are a variety of indicators used to assess employee
makes the algorithm computationally efficient and able to performance.
handle large datasets. The algorithm is sensitive to the • Precision refers to how well the model fits the data used
learning rate, which controls the step size of the update, and to make a forecast.
the choice of a suitable learning rate is critical for the • Accuracy is the fraction of times a model correctly
performance of the algorithm. SGD is a widely used predicts a value out of a set of values. This is the fraction
optimization algorithm as it is simple to implement and can of cases when the model correctly predicted a good
be used to train large datasets on distributed computing outcome (Sensitivity).
architectures. • F1-Score strikes a fine chord between accuracy and
4) Stacking: Combining numerous models into one is memorability.
called "stacking," and it's a common machine-learning • The area under the receiver operating characteristic curve
strategy for boosting model accuracy. With stacking, you (ROC-AUC) is a metric for evaluating a binary classifier's
take the predictions from many base models and feed them ability to separate positive and negative data.
into a single meta-model as features. The final prediction is
IV. RESULT AND DISCUSSION
made by the meta-model using the characteristics given by
the basic models as input. This article investigates three different approaches to
machine learning and makes use of ensemble learning in order
There are two main types of stacking: homogeneous and
to make a prediction about cervical cancer. The five
heterogeneous. In homogeneous stacking, all the base models
characteristics that were used are outlined in the study's
are of the same type, while in heterogeneous stacking, the base
methodology and materials. A total of 1015 samples were
models are of different types. Stacking's primary value is that
collected after absenteeism and incidence statistics were
it may boost model performance by using the best features of
examined independently. With 20-K fold cross-validation, 20
many models while mitigating the worst. Additionally,
different data sets were employed for testing and training.
stacking might enhance the model's interpretability by
revealing which base models are most influential in producing Following are the 5 factors that conclude cervical cancer.
the final forecast inside the meta-model. The performance of
machine learning models in a wide range of applications, Dyskeratotic refers to cells or tissue that have abnormal,
including image classification, natural language processing, irregular, or degenerative keratinization. In other words, it is a
and time series forecasting, may be improved using stacking, term used to describe cells that have abnormal changes in the
a strong ensemble approach. process of forming the protein keratin, which is a key
component of the outer layer of skin and other structures in
D. K-fold cross-validation the body. Dyskeratotic changes can be seen in a variety of
It is a common way to gauge a model's efficacy in the field diseases and conditions and can be an indication of cell
of machine learning. The procedure entails partitioning the dysfunction or dysfunction of the tissue.
data into k "folds" of equal size, with k-1 of the folds serving Koilocytotic refers to cells that have a characteristic
as training data and the remaining fold serving as test data. appearance known as "koilocytes" when viewed under a
After collecting data from the k-test sets, the model's efficacy microscope. Koilocytes are cervical cells that have a
is determined by averaging the performance measure. characteristic "halo" appearance, which is caused by changes
Since all the data is used for both training and testing, k- in the cell's cytoplasm and nucleus. They are often seen in
fold cross-validation yields a more accurate estimate of the cervical infections caused by certain types of human
model's performance. It may be useful for avoiding papillomavirus (HPV) and are considered a sign of cervical
overfitting, which happens when a model is perfected on its dysplasia, which is a precancerous condition. Koilocytosis can
training data but fails miserably when presented with novel also be caused by other factors such as irritation or trauma.
data. Metaplastic refers to a type of cell or tissue transformation
The size of the dataset is the primary factor in deciding in which one cell type is replaced by another. This process is
what value of k to use. The model's performance can be also known as transdifferentiation. For example, metaplasia is
estimated more accurately with a bigger k number, but this the transformation of one type of epithelial cell into another
will need more time to train the model than with a lower k type, such as squamous cells becoming glandular cells.
value. In machine learning, k-fold cross-validation is used for Metaplasia occurs as a response to certain types of injury or
chronic irritation and can be seen in different organs and Predicted
tissues. In some cases, metaplasia can be a precursor to
malignancy, but not always. It is important to note that D K M P S ∑
metaplasia is a physiological process that can occur under D 242 7 1 1 0 251
normal circumstances as well as pathological conditions.
Actual
K 15 183 15 5 7 225
Parabasal cells are a type of cells that are found in the M 0 7 159 3 0 169
epithelium, which is the lining of certain organs and tissues in
P 0 0 0 163 0 163
the body. These cells are located at the basal layer of the
epithelium, which is the layer closest to the underlying tissue. S 0 0 0 0 207 207
They are typically larger and more irregular in shape than the ∑ 257 197 175 172 214 1015
more mature cells that are found in the upper layers of the
epithelium. Parabasal cells can be seen in a variety of
conditions, such as cervical dysplasia (precancerous changes Fig. 4. stochastic gradient descent
in the cervix) and vaginal atrophy (thinning of the vaginal
walls). They may also be seen in biopsy specimens taken from
the cervix or vagina because of infection, inflammation, or
other types of injury. High numbers of parabasal cells may Predicted
indicate a need for further evaluation or treatment. D K M P S ∑
A superficial-intermediate cell is a type of cell found in the
epidermis, the outermost layer of the skin. Superficial cells are D 243 7 1 0 0 251
Actual
in the uppermost layer of the epidermis and are responsible for
forming the protective barrier of the skin. Intermediate cells K 14 197 9 1 4 225
are in the middle layers of the epidermis and play a role in the M 0 6 162 1 0 169
formation of new skin cells. Both superficial and intermediate
cells are constantly shed and replaced by new cells produced P 0 0 0 163 0 163
in the lower layers of the epidermis [13]. S 0 0 0 0 207 207
Fig. 2-5 display confusion matrices for the k-nearest- ∑ 257 210 172 165 211 1015
neighbor (kNN), neural network, stochastic gradient descent, Fig. 5. Stack
and stacking models respectively. Here, Dyskeratotic,
Koilocytotic, Metaplastic, Parabasal, and Superficial-
Intermediate target classes have been abbreviated as D, K, M,
P and S respectively. It is evident from Table 1's confusion matrices and fig. 6
that kNN and SGD perform the worst in trials (TN). The
Predicted stacking model outperformed all other machine learning
D K M P S ∑ techniques with a maximum classification accuracy of 95.7%.
D 242 2 1 4 2 251
Actual
K 36 156 20 7 6 225
TABLE I. COMPARISON OF DIFFERENT MACHINE LEARNING
M 1 6 157 5 0 169 ALGORITHMS
P 0 0 0 163 0 163
Model AUC CA F1 Precision Recall
S 2 0 0 0 205 207
∑ 281 164 178 179 213 1015 kNN 0.988 0.909 0.905 0.913 0.909
Fig. 2. k-nearest-neighbor Stack 0.994 0.957 0.957 0.957 0.957

SGD 0.962 0.939 0.938 0.939 0.939
Predicted
Neural Network 0.996 0.952 0.951 0.952 0.952
D K M P S ∑
D 244 6 1 0 0 251
Actual
K 18 192 11 0 4 225
M 0 6 162 1 0 169
P 1 0 0 161 1 163
S 0 0 0 0 207 207
∑ 263 204 174 162 212 1015
Fig. 3. neural network
1
V. CONCLUSION
0.98
Lifesaving and treatment decisions depend on an accurate
0.96 cervical cancer diagnosis. This research shows that cervical
cancer may be detected using machine learning methods. By
0.94 the use of ensemble learning with 20-fold cross-validation,
our neural network and SGD algorithms are able to achieve
score
0.92 an accuracy of 95.7% with the stacking model, and our other
classifiers also achieve a respectable degree of accuracy. This
0.9
research used a classification approach to identify risk factors
0.88 for cervical cancer. It is our goal to improve classification
accuracy by extracting more features/characteristics from the
0.86 present dataset in the future. For the greatest results, we will
also employ a wider array of Machine learning algorithms
0.84 and deep learning approaches, including transfer and deep
kNN Stack SGD Neural learning.
Network
Models REFERENCES
[1] Parkin DM, Bray F, Ferlay J, Pisani P (2001) Estimating the world
AUC CA F1 Precision Recall cancer burden: Globocan 2000. Int J Cancer 94: 153–156.
[2] M. Exner et al., ‘‘Value of diffusion-weighted MRI in diagnosis of uter-
ine cervical cancer: A prospective study evaluating the benefits of DWI
Fig. 6. Performance comparison
compared to conventional MR sequences in a 3T environment,’’ Acta
Radiologica., vol. 57, no. 7, pp. 869–877, 2016.
[3] P. Z. Mcveigh, A. M. Syed, M. Milosevic, A. Fyles, and M. A. Haider,
‘‘Diffusion-weighted MRI in cervical cancer,’’ Eur. Radiol., vol. 18, no.
A. ROC for Performance Evaluation 5, pp. 1058–1064, 2008.
The ROC is a curve that is generated from the FP ratio (x- [4] A. Gadducci, C. Barsotti, S. Cosio, L. Domenici, and A. G. Riccardo,
axis) and the TP ratio (y-axis) (y-axis). When training data for ‘‘Smoking habit, immune suppression, oral contraceptive use, and hor-
mone replacement therapy use and cervical carcinogenesis: A review of
both classes fluctuates over time, ROC becomes beneficial. If the literature,’’ Gynecol. Endocrinol., vol. 27, no. 8, pp. 597– 604, 2011.
you want the most accurate classifier, your ROC area should [5] P. Luhn, J. Walker, M. Schiffman, and R. E. Zuna, ‘‘The role of co-
be near to 1. According to Fig. 7, stacking surpasses all other factors in the progression from human papillomavirus infection to
methods for determining cervical cancer. cervical cancer,’’ Gynecol. Oncol., vol. 128, no. 2, pp. 265–270, 2013.
[6] Cortes, C. and V.N.Vapnik(1995). “support vector networks.” Machine
Learning, 20(3):273-297
[7] C. Sommer, Quantitative characterization, classification and
reconstruction of oocyst shapes of Eimeria species from cattle,
Parasitology 116, 21-28, 1998.
[8] Khullar, Vikas, Sachin Ahuja, Raj Gaurang Tiwari, and Ambuj Kumar
Agarwal. "Investigating efficacy of deep trained soil classification
system with augmented data." In 2021 9th International Conference on
Reliability, Infocom Technologies and Optimization (Trends and Future
Directions)(ICRITO), pp. 1-5. IEEE, 2021.
[9] Lilhore, U.K., Poongodi, M., Kaur, A., Simaiya, S., Algarni, A.D.,
Elmannai, H., Vijayakumar, V., Tunze, G.B. and Hamdi, M., 2022.
Hybrid model for detection of cervical cancer using causal analysis and
machine learning techniques. Computational and Mathematical
Methods in Medicine, 2022.
[10] Koundal, Deepika, Savita Gupta, and Sukhwinder Singh. "Computer
aided thyroid nodule detection system using medical ultrasound
images." Biomedical Signal Processing and Control 40 (2018): 117-
130.
[11] Trivedi, N.K., Kumar, S., Jain, S., Maheshwari, S. (2021). KFCM-
Based Direct Marketing. In: Rathore, V.S., Dey, N., Piuri, V., Babo, R.,
Polkowski, Z., Tavares, J.M.R.S. (eds) Rising Threats in Expert
Applications and Solutions. Advances in Intelligent Systems and
Computing, vol 1187. Springer, Singapore. https://doi.org/10.1007/978-
981-15-6014-9_57
[12] N. K. Trivedi, V. Gautam, H. Sharma, A. Anand and S. Agarwal,
"Diabetes Prediction using Different Machine Learning Techniques,"
2022 2nd International Conference on Advance Computing and
Innovative Technologies in Engineering (ICACITE), Greater Noida,
Fig. 7. ROC curve for different machine algorithms
India, 2022, pp. 2173-2177, doi: lesions by label-free surface enhanced Raman fingerprints and
10.1109/ICACITE53722.2022.9823640. chemometrics,” Biologie et Médecine, vol. 29, p. 102276, 2020.
[13]https://www.kaggle.com/datasets/prahladmehandiratta/cervical-cancer- [20] F. A. D. Jia, S. B. Zhengyi, and T. C. C. Zhang, “CNN-SVM network
largest-dataset-sipakmed abstract,” Neurocomputing, 2020.
[14] Q. Meng, “Machine learning to predict local recurrence and distant [21] A. Dongyao Jia, B. Zhengyi Li, and C. Chuanwang Zhang, “Detection
metastasis of cervical cancer after definitive radiotherapy,” International of cervical cancer cells based on strong feature CNN-SVM network,”
Journal of Radiation Oncology • Biology • Physics, vol. 108, no. 3, Neurocomputing, vol. 411, pp. 112– 127, 2020.
article e767, 2020. [22] J. Ren, A. Zhang, and X. Wang, “Jo ur na l P re,” Pharmacological
[15] J. Shan, R. Jiang, X. Chen et al., “Machine learning predicts lymph node Research, no. article 104743, 2020.
metastasis in early-stage oral tongue squamous cell carcinoma,” Journal [23] M. F. Ijaz, M. Attique, and Y. Son, “Data-driven cervical cancer
of Oral and Maxillofacial Surgery, vol. 78, no. 12, pp. 2208–2218, 2020. prediction model with outlier detection and over-sampling methods,”
[16] S. K. Saini, V. Bansal, R. Kaur, and M. Juneja, “ColpoNet for Sensors, vol. 20, no. 10, pp. 2809–2822, 2020.
automated cervical cancer screening using colposcopy images,” [24] E. M. L. Ruiz, T. Niu, M. Zerfaoui et al., “A novel gene panel for
Machine Vision and Applications, vol. 31, no. 3, pp. 1–15, 2020. prediction of lymph-node metastasis and recurrence in patients with
[17] P. Sanyal, P. Ganguli, and S. Barui, “Performance characteristics of an thyroid cancer,” Surgery, vol. 167, no. 1, pp. 73–79, 2020.
artificial intelligence based on convolutional neural network for [25] D. Stelzle, L. F. Tanaka, K. K. Lee et al., “Estimates of the global
screening conventional Papanicolaou-stained cervical smears,” Medical burden of cervical cancer associated with HIV,” The Lancet Global
Journal, Armed Forces India, vol. 76, no. 4, pp. 418–424, 2020. Health, vol. 9, no. 2, pp. e161–e169, 2021.
[18] B. R. Jany, A. Janas, and F. Krok, “Automatic microscopic image [26] Tiwari, Raj Gaurang, Alok Misra, Vikas Khullar, Ambuj Kumar
analysis by moving window local Fourier transform and machine Agarwal, Shubhi Gupta, and Arun Pratap Srivastava. "Identifying
learning,” Micron, vol. 130, article 102800, 2020. microscopic augmented images using pre-trained deep convolutional
[19] V. Karunakaran, V. N. Saritha, M. M. Joseph et al., “Diagnostic neural networks." In 2021 International Conference on Technological
spectro-cytology revealing differential recognition of cervical cancer Advancements and Innovations (ICTAI), pp. 32-37. IEEE, 2021.

Paper V1.edited

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper V1.edited

Uploaded by

Copyright:

Available Formats

Artificial Intelligence in Disease Diagnostics:

Rethinking Risk Factors for Cervical Cancer

Vikrant Sharma Anuj Kumar Jain

Fig. 1. Proposed Model Diagram

Fig. 2. k-nearest-neighbor Stack 0.994 0.957 0.957 0.957 0.957

You might also like