You are on page 1of 9

Improving the Prediction of Cervical Cancer Risk through Machine Learning

Models: An Analysis of XGBoost, Hybrid, and Stacking Techniques


cancer risk, such as elevated sexual activity and
Abstract:This study investigates the use of Human Papillomavirus (HPV).The objective of this
artificial intelligence, specifically the XGBoost
research is to use the XGBoost algorithm to predict
algorithm and different machine learning
models, to predict cervical cancer risk factors in cervical cancer in 858 individuals based on various
858 participants by taking into account a factors like age, number of pregnancies, smoking
number of important variables, including age, behaviour, and medical history.This study focuses
previous medical history, smoking status, and not only on predicting biopsy results, but also on
pregnancy history. After a thorough identifying important factors that increase the risk
methodology that included data preprocessing,
of cervical cancer.This study makes a significant
exploratory data analysis, advanced feature
selection techniques, and the creation of hybrid contribution to the field because cervical cancer is
and stacking models, it was discovered that the prevalent in women's health. It does so by utilising
Stacking Model and Hybrid Model 3 sophisticated predictive methods that enable
outperformed other models in terms of intelligent diagnosis and may aid in the diagnosis
accuracy, precision, recall, and F1 scores.
of various other cancers.
These results demonstrate the effectiveness of
these models in accurately differentiating
between positive and negative cases, Worldwide, cervical cancer mortality rates are
underscoring the potential of machine learning among the highest due to its widespread threat to
in accurately detecting and predicting cervical women's health. The medical research
cancer. In order to achieve accurate and widely communities can now investigate machine
applicable cervical cancer risk assessment, it is
learning models for the prediction and analysis of
recommended that future work concentrate on
improving feature selection methods, data cervical cancer thanks to technological
handling techniques, and ethical considerations advancements and the collection of substantial
and practicability when integrating predictive cancer data . However, overfitting, improper use of
models into healthcare systems. dimensionality reduction techniques, and skewed
Key Words: Artificial intelligence, XGBoost, data handling methods are just a few of the
Machine learning Models, Data Preprocessing,
drawbacks that plague existing models. By
Feature Selection Techniques, Hybrid and
Stacking Models. emphasising sensitivity over total accuracy, this
study attempts to overcome these constraints and
Introduction:
create a more complete machine learning model
It is estimated that cervical cancer causes about
that can forecast biopsy results for patients with
4000 deaths in the US and about 300,000 deaths
cervical cancer . This paper, which is structured
globally [1]. It is a serious global health concern. In
into five sections, covers the literature on cervical
order to lower the death rate from cervical cancer,
cancer prediction in detail. It also discusses the
early detection and diagnosis are critical, which
research methodology used to select the dataset,
has led to research into the use of artificial
how to implement the model using the XGBoost
intelligence and machine learning in imaging. The
algorithm, and a thorough summary of the
increase in global case numbers is expected to be
research findings.
attributed to the lack of diagnostic facilities and
Cervical cancer is one of the many women's
HPV vaccines in low-income countries. Strong
cancers that pose a major global health risk and
computational efficiency and robustness make the
take a considerable toll on lives . Cervical cancer
Extreme Gradient Boosting (XGBoost) algorithm a
death rates can be greatly decreased by early
useful tool for regression and classification tasks in
detection and prevention. However, challenges
the medical domain [1]. It has gained popularity in
with early detection and routine exams are partly
a number of scientific domains. Focusing on the
caused by lack of knowledge, lack of access to
need for novel analytical techniques that take local
healthcare resources, and financial constraints in
constraints and limitations into account,
certain nations. Cervical cancer diagnosis and
researchers have concentrated on comprehending
prediction have shown promise with machine
the variables that contribute to elevated cervical
learning techniques, especially when using This research paper is to utilise both exploratory
algorithms like XGBoost . This paper's research data analysis (EDA) and machine learning
emphasises the value of machine learning in algorithms to gain a thorough understanding of
healthcare and how it can improve patient the risk factors associated with cervical cancer, as
outcomes. Additionally, it attempts to increase the well as to develop a dependable approach for
diagnostic precision of deep learning algorithms in detecting the disease. Exploratory data analysis
(EDA) is an essential step in understanding the
medical imaging and develop a predictive model
patterns and relationships that exist within a
for cervical cancer diagnosis. This study examines
dataset. In the case of cervical cancer, EDA can
how advanced machine learning techniques, text
help us identify the factors that are most likely to
mining, and econometric tools can be used to
be responsible for the development of the disease.
diagnose and predict cervical cancer using By exploring the data, we can gain insight into the
machine learning algorithms. various risk factors associated with cervical cancer,
Literature Review: including age, sexual activity, family history, and
This research proposed an efficient feature exposure to the human papillomavirus (HPV). The
selection and prediction model for cervical cancer KNeighbors Classifier with n_neighbors = 2 using
datasets using Boruta analysis and SVM method to normalisation achieved the highest accuracy of
deal with this challenge. A Boruta analysis method 0.9992, indicating that this model was able to
is used. It is improved from of random forest predict cervical cancer with a high degree of
method and mainly discovers feature subsets from accuracy.
the data source that are significant to assigned
classification activity. The proposed model’s This paper presents diverse classification
primary aim is to determine the importance of techniques and shows the advantage of feature
cervical cancer screening factors for classifying selection approaches to the best predicting of
high-risk patients depending on the findings. This cervical cancer disease. There are thirty-two
research work analyses cervical cancer and various attributes with eight hundred and fifty-eight
risk factors to help detect cervical cancer. The samples. Besides, this data suffers of missing
proposed model Boruta with SVM and various values and imbalance data. Therefore, over-
popular ML models are implemented using Python sampling, under-sampling and imbedded over and
and various performance measuring parameters, under sampling have been used. Furthermore,
i.e., accuracy, precision, F1–Score, and recall. dimensionality reduction techniques are required
However, the proposed Boruta analysis with SVM for improving the accuracy of the classifier.
performs outstanding over existing methods. Therefore, feature selection methods have been
studied as they divided into two distinct
This paper is set to studies classification categories, filters and wrappers. The results show
techniques in data mining on risk factor of cervical that age, first sexual intercourse, number of
cancer datasets. The classification techniques such pregnancies, smokes, hormonal contraceptives
as Naive Bayes (NB), C4.5 Decision Tree (C4.5), k- and STDs: genital herpes are the main predictive
Nearest Neighbors (kNN), Sequential Minimal features with high accuracy with 97.5%. Decision
Optimization (SMO), Random Forest Decision Tree Tree classifier is shown to be advantageous in
(RF), Multilayer Perceptron (MLP) Neural Network handling classification assignment with excellent
and Simple Logistic Regression (SLR) have been performance.
used to classify the dataset whether healthy or
cancer result for cervical cancer diagnostic. The In this paper, very deep residual learning based
dataset is needed to be undergoing intense data networks are designed in order to perform cervical
pre-processing phase due to imbalance and have a cancer screening. Moreover, in this work, we
lot of missing value. The performance of highlight the importance of the activation
classification was evaluated using 10-folds cross functions on a residual network (ResNet)’s
validation where accuracy, precision and recall as performance. Thus, three residual networks of the
evaluation metric were measured using confusion same structure are built with different activation
matrix to determine the performance power for all functions. The employed models are trained and
classification techniques. tested using a dataset of colposcopy cervical
images, and the experimental results showed that
designed residual networks with leaky and designed a novel feature fusion method in feature
parametric rectified linear unit (Leaky-RELU and extraction, and we used the first and second
PRELU) activation functions performed almost derivative features that reflect more peak details
equally in terms of accuracy where they reached of the original spectrum for fusion. The accuracy
accuracies of 90.2 and 100%, respectively. This rate of KNN without feature fusion is 88.17%, and
achieved high accuracy was compared to other the accuracy rate after fusion is 93.55%. The
related works’ results, and it showed an accuracy rate of ELM without feature fusion is
outperformance in screening the pre-cancerous 90.81%, and the accuracy rate after fusion is
and healthy colposcopy cervical images. Such an 93.51%. The results show that the accuracy of
earlier and accurate diagnosis may help in feature fusion has been improved to a certain
preventing cervical cancer transformation. extent, and this method is expected to be used as
a new method of spectral data fusion.
This paper proposes a method of cervical biopsy
tissue image classification based on least absolute In this study, a model has been developed to
shrinkage and selection operator (LASSO) and predict the risk of cervical cancer based on one’s
ensemble learning-support vector machine (EL- lifestyle choices. Important features have been
SVM). Using the LASSO algorithm for feature delineated using the Extreme Gradient Boosting
selection, the average optimization time was (XGBoost) Classifier. After oversampling, the data
reduced by 35.87 seconds while ensuring the is fed into the model for training and testing. The
accuracy of the classification, and then serial Gradient Boost model was chosen to arrive at an
fusion was performed. The EL-SVM classifier was accuracy of 98.9%. This model can be effective to
used to identify and classify 468 biopsy tissue associate risk factors with cervical cancer
images, and the receiver operating characteristic prediction which can help the in the effective
(ROC) curve and error curve were used to evaluate prevention and management of cervical cancer.
the generalisation ability of the classifier.
This proposed IBGC-CRF-SPSST embeds the
Experiments show that the normal cervical cancer
complete benefits of constraint association among
classification accuracy reached 99.64%, the
pixels and superpixel edge data for accurate
normal-low-grade squamous intraepithelial lesion
determination of the nuclei and cytoplasmic
(LSIL) classification accuracy was 84.25%, the
boundaries so as to ensure efficient differentiation
normal-high-grade squamous intraepithelial lesion
of the healthy and unhealthy cancer cells. Finally,
(HSIL) classification accuracy was 87.40%, the LSIL-
the pixel-level forecasting potential of Conditional
HSIL classification accuracy was 76.34%, the
Random Fields is included for enhancing the
LSILcervical cancer classification accuracy was
degree of semantic-based segmentation accuracy
91.88%, and the HSIL-cervical cancer classification
to a predominant level. The experimental
accuracy was 81.54%.
evaluated results of the proposed IBGC-CRF-SPSST
In this study, the actual Raman spectrum signal of aim to produce an accuracy of 99.78%, a mean
precancerous cervical tissue was collected, and the processing time of 2.18sec, a precision of 96%, a
PLS and Relief methods were used to extract the sensitivity of 98.92%, and a specificity of 99.32%
signal characteristics of the spectrum. Then, we value which is determined to be excellent and on
established and compared KNN and ELM par with the existing detection techniques used for
classification models and finally achieved the early investigating cervical cancer.
diagnosis of cervical cancer. This experiment
Methodology:
Dataset Acquisition and Description:
The study's dataset, which included 858 cases with
32 characteristics, was taken from the UCI
repository(https://archive.ics.uci.edu/dataset/383
/cervical+cancer+risk+factors) . The missing values
in this dataset are a significant feature that results
from the decision made by some women to
withhold critical information. The data also reveals
a notable imbalance, with the majority of cases
being non-cancerous cases. The four target
variables in the dataset—Hinselmann, Schiller,
Cytology, and Biopsy—each of which correlates to Exploratory Data Analysis (EDA):
a different kind of cervical cancer test, make this A thorough analysis of the dataset was conducted
discrepancy especially notable. Significant using a variety of visualisations and descriptive
obstacles for research and modelling in this study statistics. Statistical techniques yielded a
are handling the missing data and resolving the comprehensive overview, revealing details about
imbalance in non-cancerous cases. the properties and distributions of the dataset.
Data Preprocessing: Furthermore, a wide variety of graphical displays,
Dataset Overview and Cleaning: such as scatter plots and histograms, were
After the cervical cancer risk factors dataset was produced. These images were created especially to
retrieved from the UCI repository, a thorough look at important characteristics like "Age,"
evaluation was part of an initial exploratory phase. "Smokes," "Number of sexual partners," and
In order to better comprehend the features of the "Number of pregnancies." By facilitating a
dataset, descriptive statistics were used, which led multifaceted exploration of key attributes through
to the discovery that there were missing values, both numerical summaries and graphical
denoted by the symbol "?." This led to a representations, these combined techniques
painstaking analysis of each column to determine allowed for a deeper understanding of the
and measure the amount of missing data. These dataset's nuances and ultimately facilitated a more
missing values were consistently replaced with thorough analysis of the patterns and relationships
NaN in order to guarantee standardised data within the data.
handling protocols. This created the foundation for
an organised and methodical approach to handling
the missing entries in the dataset.
Imputation and Transformation:
The format of the dataset was changed to a
numeric data type, which is a necessary step to
enable computational analysis. The NaN values
were methodically replaced using the column
medians, a reliable technique selected to handle
missing data, in order to address the problem of
missing entries. This meticulous procedure was
designed to ensure that the dataset was uniform
and thorough, to ensure its integrity by filling in
any missing values, and to make it ready for more
in-depth analyses and investigations without
jeopardising the dataset's coherence and
reliability.
Feature Selection and Data Splitting into Training and Testing
Preparation: Data:
Feature important analysis: The painstakingly prepared dataset underwent an
An extensive feature importance analysis was intentional partitioning procedure, dividing it into
carried out using the XGBoost Classifier in order to separate training and testing subsets. This method
identify the important factors affecting the 'Biopsy' attempted to create a representative and well-
target variable. Using this technique, the most balanced scheme for the assessment of developed
significant characteristics that were essential for models by using a split ratio of 80% for training
forecasting the result of the "Biopsy" could be data and 20% for testing data. The models could
found. By carefully identifying the top eight learn and identify patterns from a sizable portion
influential features based on their importance of the dataset by assigning a larger percentage to
scores, the analysis provided a thorough training data, while the testing subset, which was
understanding of the important determinants. A smaller but still significant, offered an independent
bar plot visualisation, which shows the relative set for assessing the models' performance. This
importance and ranking of these influential method avoided overfitting and allowed for a
features graphically, was used to improve the trustworthy assessment of the models' predictive
interpretability of these results. In addition to accuracy and generalizability. It also ensured a
identifying the most important characteristics, this robust assessment of the models' efficacy.
thorough analysis using the XGBoost Classifier also Machine Model’s:
produced an understandable and straightforward Hybrid Model 1:
representation, which gave important insights into To improve the accuracy of classification for risk
the main factors influencing the 'Biopsy' variable. factors associated with cervical cancer, a hybrid
model was built by combining the predictive
capabilities of two different classifiers: Support
Vector Machine (SVM) and XGBoost. The models
were trained on the given training dataset, utilising
the strengths of both classifiers, and probabilities
were predicted using the testing data. The hybrid
model was created by combining their individual
predictions using a majority voting method. The
result of this amalgamation was a computed
Dataset Refinement and Resampling: amalgamated output, which was expressed as the
Following the feature importance analysis, a average of the two classifiers' predicted
focused dataset refinement was carried out, probabilities. To determine the efficacy of the
concentrating on the top eight significant hybrid model, performance evaluation metrics
attributes found. In order to address the such as accuracy, precision, recall, and the F1
imbalance that was already present in the dataset, score were calculated. By combining the
a deliberate strategy that involved applying the capabilities of SVM and XGBoost, the hybrid model
resampling technique known as Random attempted to leverage the combined predictive
Oversampling was employed. With a deliberate power of both classifiers to produce a more robust
focus on the minority' class, this technique and accurate classification of cancer risk
specifically addressed the disparity in the dataset factors.The metrics that were obtained provide
and produced a more balanced representation for insight into the model's overall predictive
robust analysis. The dataset was purposefully capabilities as well as its potential to improve
modified to guarantee a more equitable classification accuracy in the field of cervical
distribution by highlighting the minority' class cancer risk assessment.
through Random Oversampling. This improved the Hybrid Model 2:
accuracy and dependability of later analyses or By combining the advantages of the CatBoost and
modelling carried out on the refined dataset. LightGBM classifiers, a unique hybrid model was
created. Class 1 probabilities were used to predict combined predictive result for evaluating cervical
the probabilities on the testing data after the cancer risk factors. The calculated metrics—
classifiers had first been trained on the given precision, recall, accuracy, and F1 score—act as
training dataset. In order to create a single output, performance markers for the StackingClassifier,
the predicted probabilities from CatBoost and offering information about its effectiveness and
LightGBM were combined and averaged. This potential to improve cervical cancer risk factor
process created the hybrid model. The prediction accuracy.
performance of the hybrid model was then Result:
evaluated using evaluation metrics, such as Hybrid Model 1:
accuracy, precision, recall, and the F1 score. In
order to maximise their combined strengths for Accuracy 0.9286
increased classification accuracy in the field of
Precision 0.8614
cervical cancer risk assessment, this fusion model
combined the predictive capabilities of LightGBM Recall 1
and CatBoost. The computed metrics shed light on F1 Score 0.9256
the hybrid model's effectiveness and promise for
improving predictive accuracy when assessing risk
factors for cervical cancer.
Hybrid Model 3:
The Decision Tree and Random Forest models, two
different classifiers, were combined to create a
hybrid model that combined their predictive
strengths. The training dataset was used to
individually train each classifier, and predictions
were made for the testing data. The hybrid model
was created by combining the Random Forest and
Decision Tree predictions and averaging the results
Performance metrics for hybrid model 1 are
to produce a single prediction. Performance
displayed in the table. With an accuracy of 0.9286,
metrics for the hybrid model were calculated,
it demonstrated a generally high degree of
including accuracy, precision, recall, and the F1
prediction accuracy. The precision of 0.8614
score. By utilising the combined predictive powers
signifies a high degree of positive prediction
of Random Forest and Decision Tree, this fusion
accuracy. A recall score of one, which is perfect,
model aimed to improve the precision and stability
indicates how well the model can identify all real
of classification in the field of cervical cancer risk
positive instances. With a 0.9256 F1 score, the
assessment. The metrics acquired are indicators of
trade-off between recall and precision is well-
the overall effectiveness of the hybrid model and
balanced. All things considered, the model
its possibility to improve predictive accuracy when
performs exceptionally well, identifying positive
evaluating cervical cancer risk factors.
instances with few false positives. For the model
Stacking Model: to be applied, however, the particular problem's
Combining the best features of XGBoost, CatBoost, context and the best recall/precision ratio must be
Random Forest, and Decision Tree classifiers taken into account.
resulted in the creation of a StackingClassifier. This
Confusion Matrix:
StackingClassifier was trained on the given dataset
using a meta-classifier, specifically Logistic
Regression. Predictions were then made using the
testing data to assess the performance of the
model. By combining various base classifiers, the
StackingClassifier attempted to combine their
distinct predictive powers to provide a strong and
and precision is balanced. All things considered,
the model performs admirably, highlighting its
strong performance in precisely identifying
positives while retaining strong precision. For the
application of this model, however, contextual
factors are essential to maximise the trade-off
between recall and precision.
Confusion Matrix:

The assessment of hybrid model 1's performance.


According to the matrix, 156 of the 179 instances
that were classified as class 0 (true negatives)
were correctly predicted, while 23 were incorrectly
classified as class 1 (false positives). Furthermore,
the model correctly predicted all 143 of the
instances labelled as class 1 (true positives), with The hybrid model 2 evaluation results are shown in
no instances incorrectly labelled as class 0 (false the provided confusion matrix. The model
negatives). This matrix highlights a robust correctly predicted 167 of the 179 cases classified
performance in differentiating between the two as class 0 (true negatives) and incorrectly classified
classes for hybrid model 1, showing a strong ability 12 as class 1 (false positives). The model accurately
of the model to correctly identify class 1 instances predicted all 143 of the class 1 instances (true
while also demonstrating relatively few positives) and incorrectly classified none of the
misclassifications in identifying class 0 instances. class 0 instances (false negatives). This matrix
Hybrid Model 2: highlights how well the model can identify both
classes, with a high degree of accuracy for class 1
Accuracy 0.9627 instances and comparatively low misclassification
rates for class 0 instances. It draws attention to the
Precision 0.9226
hybrid model 2's strong performance in
Recall 1 differentiating between the two classes,
F1 Score 0.9597 highlighting the model's accuracy and potency in
classification tasks.
Hybrid Model 3:

Accuracy 0.9783
Precision 0.9533
Recall 1
F1 Score 0.9761

With an accuracy of 0.9627 and precision of


0.9226 for positive predictions, hybrid model 2
exhibits high accuracy. It can identify all positive
instances with an expert recall score of 1. With an
F1 score of 0.9597, the trade-off between recall
model's excellent performance in successfully
differentiating between the two classes,
emphasising hybrid model 3's precision and
accuracy in classification tasks.

Accuracy 0.9783
Precision 0.9533
Recall 1
F1 Score 0.9761
With positive prediction accuracy of 0.9783 and
precision of 0.9533, hybrid model 3 exhibits Stacking Model:
exceptional performance. It can identify all
positive instances with an expert recall score of 1.
With a score of 0.9761, the F1 exhibits a well-
balanced recall and precision trade-off. Overall,
the model performs robustly, demonstrating its
ability to precisely identify positives while
maintaining high precision. For the application of
this model, it is necessary to take into account the
particular context in order to maximise the trade-
off between recall and precision.
Confusion Matrix:

With a precision of 0.9533 and an accuracy of


0.9783 for positive predictions, the stacking model
demonstrates a high level of accuracy. With a
recall score of 1, it is able to precisely identify
every positive instance. At 0.9761, the F1 score
exhibits a well-balanced recall versus precision
trade-off. Overall, the model performs robustly,
demonstrating its ability to precisely identify
positives while maintaining high precision. For the
application of this model, it is necessary to take
into account the particular context in order to
The results of the hybrid model 3 evaluation are
maximise the trade-off between recall and
displayed in the provided confusion matrix. Seven
of the 179 cases that were incorrectly classified as precision.
class 1 (false positives) and 172 of the cases that Confusion Matrix:
were correctly predicted as class 0 (true negatives)
by the model. None of the 143 cases classified as
class 1 (false negatives) were mistakenly classified
as class 0 by the model, which accurately
predicted all of the cases as class 1 (true positives).
This matrix highlights the model's remarkable
accuracy in both classes: it detects class 0
instances with very few misclassifications and class
1 instances with high accuracy. It highlights the
comparable to that of the advanced Stacking
Model. This shows how these models have been
refined and are now more effective at handling the
provided data.
Conclusion and Future Work:
The goal of the study was to predict cervical
cancer risk factors by utilising the capabilities of
the XGBoost algorithm and different machine
learning models. The methodology of the study
included careful preprocessing of the dataset,
exploratory data analysis, feature selection that
Incredibly accurate, the stacking model has 172 was thorough, and the development of hybrid and
true negatives and 143 true positives. The results stacking models. The outcomes showed that, in
showed that the predictions for both classes were comparison to other models, the Hybrid Model 3
accurate, with 7 false positives and 0 false and the Stacking Model demonstrated remarkable
negatives. The model emphasises its accuracy and accuracy, precision, recall, and F1 scores. The
precision in classification tasks by successfully Stacking Model and Hybrid Model 3 both
differentiating between the classes. continuously showed resilience in detecting risk
Comparative Study: factors for cervical cancer and excellent accuracy
in categorising positive cases. These results
Model Accuracy highlight how machine learning models, especially
ensemble methods, can greatly aid in the precise
Hybrid Model 1 0.9286
detection and prediction of cervical cancer by
Hybrid Model 2 0.9627 demonstrating the models' ability to distinguish
Hybrid Model 3 0.9783 between positive and negative cases.
Stacking Model 0.9783
Future work should focus on improving feature
selection methodologies for a deeper
comprehension of key predictors for cervical
cancer risk in order to advance this research. The
models' dependability would be further increased
by improving data handling techniques,
particularly when it comes to handling missing
data and unequal classes. Ensuring the practical
applicability and efficacy of these models in
medical settings will require investigating a greater
The accuracy ratings of the various models are range of model ensembles and concentrating on
shown in the table. While Hybrid Model 2 showed real-world clinical validation. To protect patient
a higher accuracy of 0.9627, Hybrid Model 1 only privacy, equity, and usability in a variety of
managed to attain an accuracy of 0.9286. healthcare settings, it is also necessary to take
Paralleling the performance of the Stacking Model, ethical and global adaptability into account when
which also attained an accuracy of 0.9783, Hybrid integrating these predictive models into
Model 3 demonstrated an even higher accuracy of healthcare systems. The development of more
0.9783. This comparison shows that the accuracy precise and broadly applicable predictive models
of the Hybrid Models 1 and 3 increased steadily for cervical cancer risk assessment will require
until it eventually reached the same level as the ongoing progress and thorough validation.
Stacking Model. It demonstrates how the
predictive power of each hybrid model improves
gradually until it reaches a level of performance

You might also like