Professional Documents
Culture Documents
1
Table of Contents
Abstract......................................................................................................................................................
1 Introduction........................................................................................................................................
1.1 Traditional Methods used to Diagnosis Breast Cancer in medical domain..........................................
1.2 Role of Machine Learning techniques used to predict the Breast Cancer...........................................
1.3 Hybrid Machine Learning models used to predict the Breast Cancer.................................................
1.4 Scope for Research...........................................................................................................................
1.5 Aim of the research...........................................................................................................................
1.6 Research question.............................................................................................................................
1.7 Research Hypothesis.........................................................................................................................
1.8 Objectives.........................................................................................................................................
1.9 Limitations........................................................................................................................................
1.10 Ethical Issues.....................................................................................................................................
1.11 Project Planning and Timescales.......................................................................................................
1.12 Risk Analysis......................................................................................................................................
1.13 Ethical Approval................................................................................................................................
2 Literature Review..............................................................................................................................
2.1 Feature selection and dimensionality reduction usage in breast cancer predictions........................
2.2 Dimensionality Reduction Techniques for Feature Selection and Feature Extraction.......................
2.3 Breast cancer prediction using feature selection..............................................................................
2.4 Analysis and Conclusions..................................................................................................................
3 Methodology....................................................................................................................................
3.1 Choice of Methods............................................................................................................................
3.2 Dataset Description...........................................................................................................................
3.3 Feature selection Methods...............................................................................................................
3.3.1 Chi-square...........................................................................................................................
3.3.2 L1-based feature selection..................................................................................................
3.3.3 Recursive Feature Elimination.............................................................................................
3.4 Dimensionality reduction techniques...............................................................................................
3.4.1 Principal Component Analysis (PCA)...................................................................................
3.4.2 Latent Dirichlet Allocation (LDA).........................................................................................
3.5 Machine Learning algorithm.............................................................................................................
3.5.1 SVM classifier......................................................................................................................
3.5.2 Random Forest classifier.....................................................................................................
3.5.3 MLP.....................................................................................................................................
3.5.4 Passive Aggressive classifier (PAC)......................................................................................
3.6 Evaluation metrics.............................................................................................................................
3.6.1 Confusion Report................................................................................................................
3.6.2 Time taken for training, validating & testing the data........................................................
4 Experiments & results.......................................................................................................................
4.1 Breast Cancer Data Collection...........................................................................................................
4.2 Breast Cancer Data Description........................................................................................................
4.3 Breast Cancer Data Preprocessing....................................................................................................
4.4 Breast Cancer Data Visualization.......................................................................................................
4.5 Label Encoding Process.....................................................................................................................
2
4.6 Splitting the Breast Cancer Data........................................................................................................
4.7 ML model implementation...............................................................................................................
4.7.1 SVM model..........................................................................................................................
4.7.2 Random Forest Classifier.....................................................................................................
4.7.3 Decision Tree.......................................................................................................................
4.7.4 MLP Classifier......................................................................................................................
4.7.5 Passive Classifier.................................................................................................................
5 Results & conclusion..........................................................................................................................
5.1 Technical Challenges faced & its solution.........................................................................................
5.1.1 Name error.........................................................................................................................
5.1.2 Attribute Error.....................................................................................................................
5.2 Interpretation of Results...................................................................................................................
5.2.1 Accuracy.............................................................................................................................
5.3 Critical Analysis.................................................................................................................................
5.3.1 Research Findings...............................................................................................................
5.3.2 Comparison with other research work................................................................................
5.4 Conclusion.........................................................................................................................................
5.4.1 Addressing Research Questions..........................................................................................
5.5 Future Enhancement........................................................................................................................
6 Reference..........................................................................................................................................
Appendix..................................................................................................................................................
3
List of tables
Table 1 Parameters of SVM.........................................................................................................46
Table 2 Results of SVM................................................................................................................46
Table 3 Parameters of random forest.........................................................................................47
Table 4 Results of random forest................................................................................................48
Table 5 Parameters of decision tree...........................................................................................49
Table 6 Results of decision tree..................................................................................................49
Table 7 Parameters of MLP.........................................................................................................50
Table 8 Results of MLP................................................................................................................51
Table 9 Parameters of passive classifier......................................................................................52
Table 10 Results of passive classifier...........................................................................................52
Table 11 Overall results...............................................................................................................55
4
List of figures
Figure 1 Support Vector Machine Classifier [19].........................................................................33
Figure 2 Architecture of the Random Forest algorithm [24].......................................................34
Figure 3 Architecture of the Decision Tree algorithm [24]..........................................................35
Figure 4 Multi-Layer Perceptron [31]..........................................................................................37
Figure 5 Passive Aggressive Classifier [36]..................................................................................38
Figure 6 visualization of the first 10 number of rows..................................................................42
Figure 7 Data visualization..........................................................................................................43
Figure 8 Data visualization of radius and perimeter....................................................................44
Figure 9 Label encoding code......................................................................................................44
Figure 10 Label encoding graph..................................................................................................45
Figure 11 Data splitting...............................................................................................................45
5
Abstract
With the advancement of biomedical and computer technologies, a vast amount of data
on various clinical factors related to breast cancer have been collected, providing a new
opportunity for accurate predictions of the disease. The problem with using high-
dimensional medical data to predict breast cancer is that this data can be difficult to
interpret and analyze. Due to its complexity, traditional techniques such as logistic
regression and decision trees may not be able to accurately capture the underlying
relationships between the various clinical factors. Furthermore, the prediction accuracy
of such models can be limited due to the high-dimensional nature of the data, which can
lead to over- or under-fitting of the model. As such, new methods must be developed to
effectively utilize the high-dimensional medical data in order to improve the accuracy of
breast cancer predictions. This article includes research on the usefulness of feature
selection and dimensionality reduction strategies in enhancing the precision of breast
cancer prediction. They include the support vector machine, random forest, decision
tree, passive classifier, and multi-layer perceptron (MLP),, which are all forms of
machine learning (ML) as well as three feature selection techniques namely L1-based
feature selection, methods for reducing data dimensions by two, including principal
component analysis and linear discriminant analysis, as well as recursive feature
elimination (RFE) and wrapper-based forward selection and backward removal, were
studied and contrasted. The study provides insights into the importance of feature
selection and dimensionality reduction for obtaining better accuracy in breast cancer
predictions.
6
1 Introduction
The focus of this study is on contrasting the usefulness of feature selection as well as
dimensionality reduction methods for enhancing the precision of breast cancer forecasts.
Machine learning methods like support vector machine, random forest, decision tree,
passive classifier, and multi-layer perceptron will be tested to see which ones work best
(MLP), as well as feature selection techniques namely L1-based feature selection,
Recursive feature elimination (RFE) and wrapper-based forward selection
and backward elimination and principal component analysis and linear discriminant
analysis are two examples of dimensionality reduction techniques.
Breast cancer is the most common form of cancer in women worldwide, accounting for
about 30 percent of all occurrences of cancer in females. This means that more than 1.5
million women have been diagnosed with breast cancer yearly, and 500,000 women lose
their lives as a result of this sickness in nations all over the world. Breast cancer is also
regarded to be a multifactorial disease. This condition has been more prevalent during
the last three decades, despite a concurrent decline in the mortality rate. On the other
hand, it is anticipated that mammography screening will result in a 20% decrease in
fatalities, while improvements in cancer therapy will result in a 60% increase [1].
7
All women should have a mammogram at regular intervals, either once every year or
once every two years, since this kind of screening for breast cancer is an essential and
effective. This "A fix screening programme for everyone" is inefficient at recognising
malignancy at the individual level and has the potential to undermine the effectiveness
of screening initiatives. On the other end of the spectrum, medical specialists are of the
opinion that a more precise diagnosis of women that are at risk may be reached by
taking into consideration other risks in along with mammography screening. The
recognition of patients who are at the greatest risk of developing the disease can be
aided by accurate risk prediction by modelling, which may also help radiation
oncologists in organising a personal screening for patients and advising them to take part
in the programme for early detection [3].
In recent years, machine learning has gained traction in the healthcare industry for the
aim of disease prediction. This modelling approach indicates the process of gathering
details from data and finding hidden relationships. While some research has relied only
on demographic risk indicators (such as lifestyles and laboratory data) to make breast
cancer predictions, other studies have included mammographic stereotypes or data from
patient biopsies. Some research has used just demographic risk variables to make breast
cancer predictions. Prediction of breast cancer using genetic information has been shown
by others [4].
One of the most challenging components of breast cancer predictions is the process of
creating a model that takes into consideration all of the known characteristics that
increase the likelihood of developing breast cancer. The study of mammographic
pictures or demographic risk variables may be the exclusive focus of the most recent
prediction models; other important aspects may not be taken into account. In addition,
these models, which are precise enough to identifies women who are at high risk, might
lead to frequent screenings and invasive sampling using magnetic resonance imaging
(MRI) and ultrasound. Patients are at risk of bearing both the financial and
psychological burdens of the situation [5].
8
In order to accurately forecast a woman's chance of developing breast cancer, a number
of criteria, such as demographic, laboratory, and mammographic risk factors, need to be
considered. As a result, multifactorial models, which include a number of potential risk
variables into their assessment, have the potential to be helpful in accurately estimating
the likelihood of developing breast cancer [6].
It is possible for breast cancer to form in either the fatty tissue or the fibrous connective
tissue of the breast. Cancer of the breast is a deadly malignant tumor that rapidly spreads
throughout the body. Breast cancer scores highly in terms of fatality rates among
females. Risk for breast cancer can be increased by a number of other variables,
including being older and having a family history of the disease. The medical
community has a significant hurdle when attempting to make a correct diagnosis of
breast cancer. The wide variety of tests available adds unnecessary complexity to
diagnosis and makes it harder to draw meaningful conclusions. So, it is necessary to
implement computational diagnostic approaches with the help of AI and ML. However,
there are challenges associated with using high dimensional data for breast cancer
prediction using machine learning. These include data sparsity, class imbalance, and
overfitting. Additionally, the high number of features can make it difficult to interpret
the models and identify the most important features for predicting cancer. The intelligent
system has to perform dimensionality reduction by mapping the high-dimensional input
into a low-dimensional subspace so it can assess the depth of the relationships involved.
Reducing the high dimensional medical data through feature ranking and dimensionality
reduction techniques may be useful in dealing with high dimensionality. As a result,
these techniques can boost the effectiveness of classification algorithms in terms of, for
example, projected accuracy, speed of prediction, and clarity of results, all while
reducing operational expenses.
The analysis of data is undergoing a sea change at a breakneck speed. Start with a basic
study of the data, as it was before it was improved, and then go on to employing
intelligence. Other common approaches that are used to develop understanding of a
dataset include machine learning, fuzzy logic, and artificial intelligence, to name just a
few. In order to learn from the data, a technique that is based on machine learning uses
9
both supervisory and non-supervisory methods. Once the data have been learnt, the
suggested technique is able to make predictions about those data with a high level of
accuracy [7]. Methods depends on ML may be used to make predictions about the data
based on the label of the dataset. In the event that the label of the data is categorical,
then classification techniques will be used. On the other hand, the regression technique
of prediction may be used if the data being analyzed are continuous. The data may be
clustered by the application of clustering algorithms. Clustering allows for the prediction
of new instances on the basis of the approach that is used [8]. Classification techniques
are broken down into linear and non-linear categories according to whether or not they
provide linear or non-linear results when applied to data sets. Using with a linear model,
one is able to make predictions based on data that is spread linearly. On the other hand,
methods like as decision trees, neural networks, SVM, and KNN may be used for
processing non-linear data. During the classification phase, some of the most common
methods utilized include decision trees, neural networks, support vector machines, and
kernel neural networks [9]. The use of the neural network-based approach offers
significant untapped potential. In light of the benefits it offers, it has also found its way
into an advanced neural network-based learning approach known as the deep learning
method. The phenomenon known as the dimensionality curse is one that often occurs
with gene expression data. In a dataset like this one, the number of dimensions might
range anywhere from a few hundreds to thousands [10]. When faced with such
obstacles, the application of any model becomes not only difficult but also time
consuming. It is not possible to sketch the most important functionality with the
assistance of visualization. None of the models can provide such a vast number of
dimensions in a clear and concise manner. Under these circumstances, it is necessary to
cut down on the number of dimensions without jeopardizing the fundamental properties
of the dataset. Techniques for feature selection and size reduction may be used to
accomplish this goal successfully. When procedures like as feature selection and
dimension reduction are applied to a dataset thereafter, the qualities of the dataset are
not lost [11].
The majority of cancers are avoidable if caught early, however many women are
nonetheless diagnosed with advanced forms of the disease. In addition to aiding in the
10
management and prevention of cancer recurrence, improved diagnostic methods play a
crucial role in patient-specific treatment plans. An accurate breast tumour classification
system that can distinguish between malignant and benign breast tumours is necessary.
While diagnosing and prognosticating breast cancer, doctors typically compile
information from a variety of sources, including patient histories, laboratory results, and
research on the disease. Due to the sheer volume of data, managing and analysing high
dimensional medical data can be difficult and lead to a number of problems. One of the
biggest issues with high dimensional medical data is the curse of dimensionality. This
leads to over fitting, which is when the model is overly complex and only works well on
the training data but not on unseen data. Another problem with high dimensional
medical data is the problem of selecting relevant features. Due to the sheer number of
variables, it can be difficult to identify which ones are important and which are not. This
can lead to biased results as only the most important features are used for analysis.
Feature selection and dimensionality reduction techniques can help to address these
issues. Selecting the most relevant features from a dataset is the goal of feature selection
methods like L1-based feature selection, RFE, and wrapper-based forward selection
and backward elimination. Combining variables may lower the number of dimensions,
and methods like principal component analysis (PCA) and latent variable modelling
(LDM) are two examples. These methods have the potential to simplify the dataset and
boost the reliability of the resulting model.
This research is focused on developing an efficient model that can be able the predict
breast cancer based on medical data. To successfully build the model, the research
mainly focused on feature selection methods and dimension reduction techniques
applied to the data to reduce the dimension, and will verify the efficiency of these two
methods by applying the dimension reduced data to machine learning algorithms.
What are the relative performance differences between feature selection and
dimensionality reduction techniques in improving the accuracy of breast cancer
predictions?
11
1.7 Research Hypothesis
Improved breast cancer prediction accuracy may be achieved via the use of
dimensionality reduction techniques rather than feature selection methods.
1.8 Objectives
To evaluate the results with and without feature selection methods and
dimensionality reduction techniques.
1.9 Limitations
One of the study's potential flaws is that it doesn't take into consideration that there are
numerous varieties of breast cancer, and that each type may call for a unique approach to
therapy. There are likely more variables, such as those related to one's lifestyle and the
surrounding environment, that contribute to cancer development, but these were not
explored in this research. The study does not address how to ensure that the model is
able to handle highly imbalanced data sets.
Data accuracy, completeness, and timeliness should be prioritised first. Medical records
may include private information about patients, so it is important that they are kept
secure and that the data is not used without the patient’s consent. In addition, it is
crucial to verify that the data is not utilised in a manner that might result in injury or
prejudice.
Second, it is important to ensure that the algorithms used are fair and unbiased.
Algorithms can be biased if they are built on data that has been collected in a way that is
12
biased. It is essential to gather data in a manner that is representative of the whole
community.
13
Feature selection and Medium High Feature selection and
dimensionality dimensionality reduction
reduction techniques techniques must be applied
not applied correctly correctly to ensure the best
performance of the machine
learning algorithms.
The ethical considerations discussed above do not necessarily require ethical approval,
as the data is being collected from an open source website, Kaggle, and the
programming language used is Python. As the data is freely available, it does not require
the consent of the patients for its use. In addition, the Kaggle data set is collected from
reliable sources and is regularly updated. Furthermore, Python is a general purpose
programming language and does not require any special permissions for its use. Thus,
the use of Kaggle and Python ensures that the data is accurate, complete, and up-to-date.
Moreover, Python programming can be used to create algorithms that are designed to be
fair and unbiased, by avoiding the use of any data that may be biased. Therefore, no
ethical approval is required in this study.
14
2 Literature Review
This literature review aims to analyse existing research on the use of feature selection
and dimensionality reduction techniques for effective breast cancer predictions. This
review will discuss the findings of the existing researches and draw conclusions on
which methods need to be explored in this study for accurate prediction of breast cancer.
The primary goal of this study [12] was to employ correlation analysis and variance of
input features to pick feature selection strategies, then feed these relevant features to a
classification algorithm. In order to enhance breast cancer categorization, the authors
adopted an ensemble approach. The WBCD dataset, available to the public, was used to
test the suggested method (Wisconsin Breast Cancer Dataset). Dimensionality reduction
was accomplished by correlation analysis and principal component analysis. Many
machine learning techniques were tested, and their results compared and contrasted: LR,
SVM, NB, KNN, RF, DT and SGD. The effectiveness of the classifiers was enhanced
by tweaking their hyper-parameters. Two distinct voting methods were used in
combination with the top performing classification algorithms. The class chosen by the
majority of voters is the one predicted by a hard vote, whereas the class chosen by the
highest probability is the one predicted by a soft vote. The suggested technique exceeded
the state-of-the-art by a wide margin, with an accuracy of 98.24%, high precision of
99.29%, and recall of 95.89%.
In this study [13], the SVM and the Extreme Gradient Boosting approach, both machine
learning algorithms, will be evaluated against one another in a classification setting. To
make classification easier, PCA is used to extract features from the raw data and limit
the amount of data attributes. In addition to PCA, K-Means is employed for
dimensionality reduction as a clustering technique. In this work, the results are examined
by applying four distinct models to the Wisconsin Breast Cancer Dataset, each of which
makes use of a different dimensionality reduction technique and one of two different
classifiers. Accuracy, sensitivity, and specificity metrics evaluated from the confusion
matrices will be used to make the comparison. Results from experiments demonstrate
15
that the less popular K-Means approach for dimensionality reduction is on par with the
more popular Principal Component Analysis.
In order to predict breast cancer, the authors of this study [14] offer a synthetic model
with a collection of features optimised with a genetic algorithm (CHFS-BOGA). In
place of chance and random selection, the authors suggest OGA by enhancing the
initialization generation and genetic algorithms with the C4.5 decision tree classifier as
the fitness function. The updated data consisting of 569 rows and 32 columns was
gathered from Wisconsin UCI machine learning. Weka, an open-source data mining
programme, was used to conduct an evaluation of the dataset using its explorer module.
The results demonstrate that when compared to the single filter techniques and PCA, the
suggested hybrid feature selection approach provides superior results. These factors help
in forecasting future profits. Previous iterations of the proposed system (CHFS-BOGA)
utilising support vector machine (SVM) classifiers reached an accuracy of 97.3 percent.
Using (CHFS-BOGA-SVM), they achieved a best-in-class 98.25% accuracy on a data
set composed of 70.0% training data and 30.0% testing data, and a perfect 100%
accuracy on the whole training set. Not only that, but the ROC curve had a value of 1.0.
The findings demonstrated that the suggested (CHFSBOGA-SVM) system successfully
distinguished between malignant and benign breast tumours.
Dimensionality reduction, feature ranking, fuzzy logic, and an artificial neural network
are all used in this study [15] to develop a new approach to data classification. The
purpose of this research is to evaluate the current integrated methods to breast cancer
detection and prognosis and to draw conclusions about their relative merits. The best
diagnostic classification accuracy is provided by principal component analysis (PCA)
using a neural network (NN), however gain ratio and chi-square also perform well
(85.78 percent). These findings pave the way for the creation of a Medical Expert
System model, which may be utilised for the automated diagnosis of additional diseases
and high-dimensional medical datasets.
Medical disease analysis, racial profiling and gene classification are just some of the
topics covered in this paper [16]. In addition, for each feature selection and feature
16
extraction method, the authors summarise the techniques/algorithms, datasets, classifier
approaches, and achieved findings relating to the accuracy and computational time.
Around half of the examined studies relied on various optimization strategies, and this
tendency was seen by the authors when they looked at how researchers reduced
dimensionality based on feature selection methods. Both the SVM and the KNN
classifiers are quite common, however the SVM has better accuracy. Yet, 7 of the
analysed research methodologies rely heavily on CNN and DNN algorithms for feature
extraction. In feature extraction, principal component analysis (PCA) is still widely
utilised, having been implemented in 8 different approaches so far. In addition, the
optimised PCA had the potential for enhanced efficiency in terms of both computational
time and the number of features that were eliminated in the process.
The goal of this research [17] is to create a system for early-stage breast cancer
prediction using the minimal amount of features that can be extracted from the clinical
information. The planned experiment has been carried out using the Wisconsin breast
cancer dataset (WBCD). When using most predictive factors, KNN classifier has been
found to produce the highest classification accuracy of 99.28%. Detecting breast cancer
at an early stage using the suggested method drastically reduces medical costs and
improves quality of life.
The goal of this effort [18] is to categorise breast cancer patients as having a recurrence
or not. In this study, the authors used a breast cancer categorization dataset to identify
the optimal feature set for prediction. As feature selection methods, Chi-squared and the
Mutual Information method have been employed. Afterward, the Logistic Regression
model utilised the decided-upon features to arrive at its conclusion. Specifically, it was
shown that the Mutual Information method was more effective and yielded more reliable
forecasts.
Recently, researchers have increased their efforts to work on a dataset with a big number
of attributes that is known as Big Data. This is a direct result of the revolution in
technology and also the growth in the field of data science. This data, which consists of
many different factors, may be analysed using dimension reduction technology's
methodologies, which are efficient, effective, and influential. The value of the
technology known as "data processing, pattern recognition, machine learning, and data
17
mining" rests in a variety of industries, including those listed above. This study
examines the similarities and differences between two important techniques for reducing
dimensionality—namely, feature extraction and feature selection—both of which are
used often in machine learning models. The authors used a variety of classifiers, such as
Support vector machines, k-nearest neighbours, Decision tree, and Naive Bayes, on the
data from the anthropometric survey of US Army personnel (ANSUR 2) in order to
categorise the data and test the relevance of features by determining particular
characteristics in USA Army personnel. The results showed that (k-nearest neighbours)
achieved high accuracy (83%) in prediction, and we reduced the dimensions using a
number of techniques, such as (High The findings of this study make it abundantly
evident that the effectiveness of strategies for dimension reduction is going to vary
depending on the kind of data being dealt with. When it comes to text data, some
methods are more effective than others, but other methods perform better when dealing
with photos [19].
Based on the reviewed literature, it appears that only a small fraction of research use
both feature selection and dimensionality reduction techniques to boost their breast
cancer prediction models' accuracy. The techniques used include correlation analysis,
principal component analysis (PCA), Chi-squared method, Mutual Information method,
gain ratio, K-Means clustering and chi-square. Unfortunately, studies comparing feature
selection and dimensionality reduction methods to better breast cancer forecasts are
limited. L1-based selection, Recursive Feature Elimination (RFE), and wrapper-based
forward selection and backward elimination have all been studied insufficiently, and
there are few other feature selection approaches that have been thoroughly investigated.
Most studies have focused on PCA as a dimensionality reduction technique, while none
have explored LDA. In addition, Logistic regression, SVM, NB, KNN, RF, DT, and
stochastic gradient decent learning, as well as fuzzy logic and ANN are just some of the
machine learning techniques put to the test in these investigations. However, none of
these studies investigate the potential of algorithms like passive classifiers and MLP for
breast cancer prediction. The goal of this study is to compare the performance of the
multilayer perceptron (MLP) and the passive classifier (PC) in breast cancer prediction
with that of other more established methods, such as the support vector machine, the
18
random forest, and the decision tree. Moreover, this study analyses the effectiveness of
feature selection techniques namely L1-based feature selection, Recursive feature
elimination (RFE) and wrapper-based forward selection and backward elimination and
dimensionality reduction methods such as PCA and LDA and performances will be
compared to determine which one is more effective in predicting breast cancer.
19
3 Methodology
The below techniques and algorithms are chosen for design the effective breast cancer
predictions based on their advantages as discussed below.
20
aggressive classifier algorithm can handle big datasets and adjust their weights as new
data comes in [26].
Follow the link below to access the Breast Cancer Wisconsin (Diagnostic) Data Set, a
database containing detailed information about breast cancer in Wisconsin.
https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data
Expert medical opinion on the benign or malignant nature of a breast mass's nucleus is
included. Fine needle aspirate (FNA) of a lump in the breast, digital image features are
included in the first portion of the data set, and the remaining part consists of the cancer
diagnosis. Totalling 569 pieces of information, these are the categories into which they
fall: There were 357 noncancerous cases, and 212 cancerous ones.
Due to its size and diversity, this dataset is ideal for research into feature selection and
dimensionality reduction strategies for accurate breast cancer forecasts. The dataset can
be mined for information that could prove useful in deciding whether a given case of
breast cancer is malignant or not. Furthermore, the dataset can be utilized to create
algorithms that efficiently cut down on the requisite number of attributes for accurate
patient diagnostic prediction. The diagnostic procedure for breast cancer may benefit
from this.
21
During the process of developing a predictive model, the number of input characteristics
required is reduced as a result of this activity. Problems with processing speed and
complexity may be resolved by selecting features; however, this results in storage in
memory being compromised. To put it another way, feature selection eliminates features
that provide insufficient information in order to narrow the dimensions down to a
secondary set. As a result, the features selection process identified the best possible
subset of characteristics from the extensive dataset [29]. During the implementation
features selection process, the new set dropped while still maintaining most of the
information from the source dataset. Because of this, we should consider deleting
redundant features and obtaining important information instead. Two things are
important for the selection procedure to take into account: (1) there should be no impact
on the accuracy and performance, and (2) the final subset should be comparable to the
primary dataset. When it comes to obtaining data visualization and data interpretation
while also cutting down on storage space, the optimum feature selection criteria consist
of two primary processes (feature creation). These steps make up the features selection
mechanism. This step involves constructing a subset from the massive quantity of data
that was collected. Second, as its name suggests, the function of feature evaluation is to
assess the subset that was produced to meet the criteria [30].
The technique for selecting features consists of a mix of search procedures, each of
which chooses a distinct subset of characteristics, and an evaluation metric, which
assigns scores to each of the distinct subsets of features. This needs a lot of processing
resources, and depending on the kind of machine learning model being used, a different
subset of characteristics could yield the best results. This indicates that there is not a
single ideal collection of features but rather several optimal features set depending on
the ML technique that is meant to be used. There is a distinction to be made between
supervised and unsupervised feature selection approaches [31]. This categorization is
accomplished by taking into account, or not taking into account, the target variable
throughout the method of feature selection. When reducing redundant variables using
the correlation approach, unsupervised methods do not take into consideration the
variable that is being targeted. It is necessary to employ supervised methods for the
purpose to exclude the variables and elements that are not connected to the one that is
being studied. It is commonly known that some methods, like filter methods and
wrapper methods, have a supervisor looking over their shoulders [32].
22
The selection of variables in filter techniques is determined by the properties of the
features themselves. They do not use any kind of machine learning method, instead
relying only on the qualities of the features themselves. Filter techniques are not
dependent on any particular model and require little to no computing effort. However,
they often result in performance that is poorer when compared to that produced by
different feature selection approaches. Moreover, these are outstanding tools for quick
filtering and the rapid exclusion of components from a collection of data that are not
associated with the topic that is now being addressed [33].
The embedded approach is selects features to be included in the model during the
process of creating the model itself. The models are equipped with built-in feature
selection methods, which choose and includes these variables that result in the highest
possible level of precision. The embedded technique takes into account the interaction
that occurs between the model and its features. In contrast to the feature selection
techniques, which were mentioned earlier and involve removing the variables from the
dataset, the dimensionality reduction approach involves the creation of a new projection
of data together with whole new variables set. This is in opposiste to the fact that the
feature selection methods involve removing the variables [35].
Chi-square, L1-based feature selection and Recursive Feature Elimination are the
suitable feature selection methods of this study.
3.3.1 Chi-square
23
2
(observed Values−Expected values)
Chi−square ( t k , c i )=∑
Expected Values
Addition of the input variables and the target variable in the breast cancer dataset are
categorical data. The input variables are broken down into subcategories. The problem
that we are dealing with is one of classification predictive modelling, in which the
system attempts to classify the data as belong to either the recurring or the non-recurrent
class. Using the statistical approach known as Pearson's Chi squared, one may conduct a
test to determine whether or not the variable serving as an input also influences the
variable serving as an output [37].
A test statistic is computed by drawing a comparison between the observed values and
the theoretical ones. In order to determine whether or not the distributions of category
observed variables and expected variables vary from one another, the chi-squared test is
carried out. In the event that the value exceeds a certain threshold, the null hypothesis is
rejected, and the significance of the finding is established. This means that the variable
is seen as being reliant on the outcome. If this is not the case, the value being reported is
not significant; hence, the null hypothesis should not be rejected, and the variable in
question should be considered independent [38].
The regularization denoted by the notation "L1" is also known by the name "lasso"
regularization. A method known as the Least Absolute Shrinkage and Selection Operator
has the capability of bringing some of the coefficients down to zero. Least Absolute
Shrinkage and Selection Operator is the full name of this algorithm. This shows that
lasso has the capability to punish the feature and make its coefficient 0 if it determines
that the feature is not essential. This indicates that when attempting to forecast the target
variable, some of the attributes will have a value of zero. As a result of the fact that these
characteristics will not be included into the model's final predictions, it is possible to
exclude them. Consequently, this results in a streamlined collection of characteristics for
the completed model [39].
24
L1 is a regularization constraint added to the target function of linear models to stop the
prediction model from being overfit to the data. The L1-based feature selection method
uses the sparse solutions produced by punishing the model with the L1 norm. Linear
SVC is used as a classifier in this linear model [40].
Recursive feature elimination, often known as RFE, is a feature selection technique that
makes an effort to choose the features subset that provides the greatest level of
classification accuracy based on the learnt model. After constructing a classification
model, traditional RFE works by first identifying and then systematically removing the
poorest feature responsible for any decrease in "classification accuracy." Recent
developments have led to the development of a novel method to RFE, which analyses
"feature (variable) importance" rather than "classification accuracy" using a support
vector machine (SVM) model, and then selects the least significant features for deletion
[41].
25
to increase classification accuracy and performance by removing redundant and
superfluous data. This is accomplished by incorporating a new pixel-based classification
algorithm. When doing data analysis on a huge amount of characteristics, it is also used
to decrease the dimension of the data to a lower dimension. This is because smaller
dimensions are easier to analyse [44]. The principal component analysis, often known as
PCA, is a useful pre-processing technique for the constructive dimensionality reduction
methods that determine the association between the variables of features. PCA is a
technique that has been developed and utilized in a various domain, such as natural
language processing, audio recognition, geography, bioinformatics, and computer
vision, to name a few. Because the job of image processing exposes the challenges of
both compute and memory consumption, PCA is frequently employed in the research
area of image processing. However, the variable selection methods are ineffective when
all of the variables are linked, but they perform effectively when working with
informative variables [45].
On the other hand, the identification of the class levels of informativeness is also a very
significant notion. In the past, many strategies for the extraction of features were
researched in an effort to enhance categorization via feature selection. The feature
extraction process makes use of local features rather than global features to more
correctly depict unique attributes of relevant information. This is possible when the fact
that local features are picked separately depending on the characteristics of the dataset. It
begins by gleaning features from each of the feature component sets, after which it
applies those features to the input data in order to convert it into a new collection of
informative features. The impact of dimension reduction using principal component
analysis (PCA) alone is inferior than the effect of using PCA in conjunction with
entropy. Because of this, a number of academics have focused their attention on feature
weighing in order to enhance the feature selection techniques that include weighting
with inter-class and intra-class distance [46]. In order to enhance classification and
clustering from maintaining features of a class problem, the classes that were utilized in
statistically weighted feature approaches were also explored. On the other hand, a
number of academics have proposed a class-weighted measurement by a comparable
distance that represents the qualities that are associated with classes. In addition, the
number of dataset features that have a significant distinction may be minimized by using
strategies for feature selection as well as extraction. The process of picking and
26
eliminating certain aspects without altering them is referred to as feature selection,
whereas the process of transforming data into a lower dimension is referred to as
dimensionality reduction. Finding a dimensionality reduction approach that uses
principal component analysis (PCA) that can carry out feature extraction and feature
selection without sacrificing picture quality and without losing important features due to
dominating class characteristics is one of the challenges that must be overcome [47].
Dimensionality reduction (DR) involves selecting useful traits and discarding irrelevant
and superfluous ones. It reducing dimensionality of input can increase performance by
reducing learning time and complexity of model or enhancing capacity of generalization
and accuracy of classification accuracy. By selecting acceptable characteristics reduces
measurement expense and improves problem understanding [42].
Dimensionality reduction on a dataset has many benefits. (i) Data storage space
decreases as dimensions decrease. (ii) It takes only less computing time. It removes
redundant, unnecessary, and noisy data. Improve data quality. (v) Some algorithms fail
in higher dimensions. By reducing these dimensions improves algorithm efficiency and
accuracy. Higher-dimensional data visualization is difficult. Thus, lowering dimensions
may help us design and analyse patterns. (vii) It streamlines and enhances categorisation
[49].
It is the popular Dimensionality Reduction (DR) technology. Finding the sweet spot
between Information variance and vector dimensionality reduction is what PCA is all
about. The PCA is a technique of unsupervised learning that can help to make sense of
the data. PCA was developed to reduce the number of dimensions a dataset has. It's a
method for decreasing the number of dimensions needed to describe a dataset to focus
on the most informative subset of data [50]. The PCA is a method of orthogonal
statistical for transforming a dataset of connected observations into a data set of non-
27
linearly related values. Face recognition and other applications like medical data
correlation are two of the general uses of principal component analysis (PCA), but it is
also utilized in other sectors such as quantitative finance and spike-triggered covariance
analysis in neuroscience [22].
Principal Component Analysis (PCA) has a wide variety of applications, some of which
include machine learning, the processing of images and voice, computer vision, text
mining, visualisation, biometrics, and robotic sensor data. Facial recognition is another
one [52].
The purpose of principal component analysis (PCA) is to find the primary Components
(PCs), which are a collection of characteristics that are not connected with one another.
The first PC and the sequence in which it was found have the biggest amount of
variation in the data set. Despite the fact that it is a reliable approach for dimension
reduction, it does have certain restrictions. In spite of its broad use, the PCA
transformation is based on second-order statistical analysis. Because the principal
components may be extremely statistically reliant on one another while having no
correlation with one another, principal component analysis (PCA) may not be successful
in finding the data's most concise description. PCA requires a representation with a
higher dimension, which will be disclosed by a non-linear method, since it physically
portrays the data as a hyperplane that is buried in an ambient space. This is why a non-
linear method is required. PCA displays the data in this manner, hence it is required for
the data components to demonstrate non-linear correlations if this is how the data is to
be represented. This is due to the fact that the hyperplane is nested inside of a space that
28
is ambient. As a result, non-linear alternatives to principal component analysis have
been developed. Because they utilize estimate approaches based on least squares, PCA
methods do not take into account outliers, which are ubiquitous in practical training sets
[53].
In the pre-processing phase of applications involving data mining and machine learning,
LDA is another prominent dimensionality reduction method that may be used. The
primary objective of latent dirichlet allocation (LDA) is to map a dataset that has a vast
number of characteristics onto a space that has fewer dimensions and a high degree of
class separability. The amount of money spent on computing will decrease as a result.
The technique that is used by LDA is quite similar to the strategy that is taken by PCA.
LDA not only maximizes the separation of numerous classes, but it also maximizes the
variance of the data (as measured by PCA). The objective of linear discriminant analysis
is to project a dimension space onto a more compact subspace while maintaining the
integrity of the class information [55].
The linear combination of features is used as a linear classifier in LDA, which allows for
the extraction of features and the decrease in dimension. Through the translation of
characteristics into a space with fewer dimensions, it ensures the greatest possible
degree of class separability by optimizing the ratio of the variation between classes to
that of the variance within classes. The capacity of LDA to combine the information
from both features in order to build a new axis, which in turn minimizes the variance of
the variables and maximizes the class distance between them, is one of the benefits of
using this statistical approach. Another advantage is that this method minimizes the class
distance between the variables [56].
Despite the fact that the LDA represents one of the data reduction techniques that is used
on a regular basis, it has a number of downsides that should be taken into consideration.
The linear discriminant analysis (LDA) is unable to discover the lower dimensional
29
space when the dimensions are significantly larger than the total quantity of samples
included in the data matrix. This results in the singularization of the within-class matrix.
This challenge is sometimes referred to as a small sample size (SSS) problem. There
have been a number of recommendations made on possible answers to this problem.
Eliminating the empty space in the within-class matrix was the first tactic that was
proposed as a solution to the problem. The second method involves converting a
subspace that is considered to be intermediate, such as PCA, into a matrix that is
considered to be inside the class, and then finally into a matrix that has full rank. In the
second method, if there is a linearity issue, which means that various classes cannot be
separated linearly from one another, the linear discriminant analysis (LDA) is unable to
differentiate between these classes. It is possible to solve the issue by using the kernel's
built-in functions. Applying the regularization problem to the process of handling
singular linear systems is the third strategy, which is also a well-known option [57].
ML is the process of problem solving by analyzing the patterns of data that are
available, as opposed to being explicitly programmed to do so. Models of ML are used
in a various activity, including classification, regression, clustering, anomaly detection,
ranking, recommendations, and forecasting, to name a few [58].
The process of determining which category a certain piece of data belongs to is referred
to as classification. Classification algorithms work on data that has been labelled, with
each label serving to define a different class or category that the data may be placed in.
There are two possible approaches to classification: binary classification and multi class
classification. The first method detects that the unlabeled data will fall into any of the
two classifications that are accessible, whereas the second method predicts that the data
will fall into one of N different groups or categories. And also, regression is the process
of attempting to identify a label that is a continuous value based on a collection of
characteristics that are connected to one another. The regression process is applied to a
collection of labelled features, and a function is used to estimate the value of unlabeled
data based on the labelled features [59].
The process of organizing individual instances of data into distinct groups determined by
the degree to which they are alike is referred to as clustering. Clusters are the distinct
groupings, and members of a cluster have comparable traits that are unique to that
30
cluster. Clusters may be further broken down into subclusters. A rare occurrence or
observation that is deceptive and otherwise different from the other observations is
known as an anomaly. Anomalies may occur at any frequency. detecting fraudulent
transactions, anomalous clusters, patterns that suggest network infiltration, detecting
outliers, and other irregularities are all made easier with the assistance of anomaly
detections. During the ranking process, the data with labels are organised into instances
and given scores. The ranker then uses these scores to determine rankings for the
examples that have not yet been observed. The process of making suggestions about
items or services to a user on the basis of that user's previous activity is referred to as
"recommendation." Predicting is the act of looking into the past and making assumptions
about the future [60].
31
just the essential characteristics and excluding the highly associated feature. This may be
done without the risk of losing critical information [62].
SVM is a important approach utilized in the area of ML. The goal of the algorithm is to
locate an N-dimensional hyperplane that may be used to categorize the data. The core of
this technique is locating the optimal plane in terms of margin. Depending on the total
number of features, N can take on different forms. It was simple to evaluate two
characteristics side by side. However, this is not always the case if there are multiple
features to classify. Increasing the margin leads to more precise predictions [63]. SVM
is depicted graphically in the figure.
32
The degree of a polynomial kernel function (denoted by "poly") is controlled by one of
the hyperparameters. All other kernels will disregard it if it is negative. The "tol"
parameter defines a stop criterion's tolerance [68].
The most widely used supervised ML technique is the Random Forest algorithm. The
technique is flexible enough to solve both classification as well as regression issues. It's
an ensemble learning technique in which a group of somewhat ineffective learners
works together to produce a more robust representation of the world. This method
generates a forest of trees. The key benefit of utilizing this approach is that, unlike other
algorithms, it can deal with missing values and outlier values [69].
33
The classification technique is used by the RF in order to take a non-parametric
approach. The RF applies a vast number of decision trees to each individual data set
after performing categorization at a high rate for each data set. A random number of
input variables is used in each tree, and then the results of all of the trees are integrated
to get a more accurate conclusion based on the variables [71].
Criterion: The function that determines how good a split actually is. Supported criteria
are “gini”, “log_loss” and “entropy”.
max_features: It is related to the total amount of features used to train a random forest.
min_samples_split: It provides the decision tree in a random forest with the less required
number of observations in each node to divide the nodes.
min_samples_leaf: It defines the least count of samples that should be found in the leaf
node after breaking a node into two separate nodes [73].
34
Figure 3 Architecture of the Decision Tree algorithm [70]
In order to construct the tree, a DT first divides the data into more than one groups. The
ratio of information gain to entropy is used to determine the partitioning. Attributes are
prioritized for investigation based on the knowledge they promise to impart. Each
variable’s entropy can be obtained using the following equation, and from that, the
information gain can be derived.
Where p represents the proportion of samples from positive classes found in S and q
represents the same proportion found in S, S is a collection of both successful and
unsuccessful occurrences. There is a need to determine the information gain associated
with the attribute V [75].
Splitter: It is used to determine which branch to take at each node. Its acceptable values
either be "best" or "random" techniques.
Criterion: Quality of a split is measured by this parameter. Acceptable criteria are “gini”,
“log_loss” and “entropy” [75].
35
class_weight: This is a hyperparameter that goes by the name of class_weight, and it
determines the weight that is associated with classes or the weight that is provided for
each output class.
3.5.3 MLP
MLP, which stands for multi-layer perceptron. This is a highly effective modelling
technique that employs a supervised training approach by making use of samples of data
that have outcomes that are already known. A model of a nonlinear function is produced
as a result of this technique. This model permits the prediction of output data based on
input data that is provided. The result of one layer is the input of the following layer in
MLP design, and so on. The first and last layers of a neural network are commonly
referred to as the layers of input and output. The other levels are called as the hidden
layers [76]. The MLP is a kind of multilayer neural network called a feedforward neural
network because data is not transformed in the hidden layers before being sent from the
input layer to the output layer. Every individual connection made between neurons
carries its own unique amount of weight. The activation function of the perceptron that
belong to the same layer is the same. In a broad sense, it can be understood as a sigmoid
function for the hidden layers. Depending on the requirements of the application, the
output layer might take the shape of a sigmoid or a linear function [77].
36
nodes to the output nodes. In the event that there is a mistake in the output, that error
must be transferred in some way from the output to the input; doing so will result in the
weights being rectified. The post-diffusion algorithm is the approach that is used most
often for this purpose.
A MLP with significantly one hidden layer is capable of identifying the nonlinear
function, despite the fact that it does it with reduced precision. Overfitting the data to be
trained is more likely to occur in networks that have more hidden layers than other types
of networks. The rate of learning and the momentum are the primary determinants of the
speed and performance of the process of learning. MLPs are able to address issues that
cannot be linearly separated and are meant to provide an approximation of any
continuous function. Pattern categorization, recognition, prediction, and approximation
are the primary applications of MLP [80].
PAC is an effective online ML algorithm that maintains its passive state for the purpose
of producing an accurate classification outcome but switches to its aggressive state in
the event of a miscount or input that is irrelevant. It is one of the applications in ML that
is particularly useful and successful. Its principal use is when it requires a vast amount
of data to be processed all at once, which is also one of its applications. They are related
to the perceptron framework in the sense that they do not call for the specification of an
attribute for the learning rate [82]. If the offered prediction turns out to be accurate,
it should not attempt to update or improve the model in any way; instead, you should
just leave it as it is. This shows that the data is not adequate for triggering any
modifications in the model, and as a result, it seems to be passive to the system. If the
input results in inaccurate predictions, the model is modified to account for this. They
37
are the most important facts for providing effective modifications to the model that is
currently being proposed [83].
C: This option determines the maximum step size that can be used for regularization.
The default value is 1.0.
max_iter: This parameter specifies maximum number of times that the training data will
be gone through.
validation_fraction: The percentage of the training data that will be used for the
validation set when early halting will occur. Must fall between 0 and 1, inclusive. Only
used if the early_stopping flag has been set to true. [85].
loss: The loss function to be used. It might either be hinge or squares loss.
38
In this study, uses confusion matrix, Accuracy, F1-Score, Precision and Recall as well as
time taken for training, validating and testing the data as its evaluating metrics that are
used to assess the efficiency of the model.
The accuracy is the percentage of occurrences in the data set that were correctly
classified. The accuracy of a classification model relies on how well it can filter out
irrelevant information. The recall of a model is its propensity to locate all relevant
instances within a data set. Instead of measuring a model's overall performance, like
accuracy does, an F1 score can be used to evaluate it based on how well it performs
inside each individual class [88].
3.6.2 Time taken for training, validating & testing the data
The dataset that was utilized to train the model and give it the ability to detect features
and trends that had not been noticed before. The validation set is a various collection of
data utilized to check how well our model is doing throughout its training phase. After
the training phase has been finished, the model is evaluated using data that was gathered
from the test set. This helps ensure that the model is accurate. At this point, an
evaluation of the total amount of time necessary for training, validating, and testing a set
is carried out [89].
39
4 Experiments & results
Early detection and treatment are essential for improving the chances of survival ML
models have been shown to be effective in predicting breast cancer, and they have the
potential to be used as a screening tool. In this session, we will discuss the Breast Cancer
Wisconsin (Diagnostic) Dataset Collection, its description, and the data preprocessing
techniques used. This chapter also describes how the data is preprocessed and
visualized. Further, the results after implementation of ML models with and without
feature selection techniques and dimensionality reduction techniques and the critical
analysis will be detailed in this chapter.
The Breast Cancer Wisconsin (Diagnostic) Data Set is a publicly available dataset that
contains data about breast cancer tumors. The dataset was collected by Dr. William H.
Wolberg, W. Nick Street, Olvi L. Mangasarian, and Harold E. Wechsler at the
University of Wisconsin Hospitals, Madison, Wisconsin, USA. The dataset contains 569
observations, each of which represents a breast cancer tumor. The observations are
described by 30 features. Follow the link below to access the Breast Cancer Wisconsin
(Diagnostic) Data Set, a database containing detailed information about breast cancer in
Wisconsin.
https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data
Expert medical opinion on the benign or malignant nature of a breast mass's nucleus is
included. Fine needle aspirate (FNA) of a lump in the breast, digital image features are
included in the first portion of the data set, and the remaining part consists of the cancer
diagnosis. Totalling 569 pieces of information, these are the categories into which they
fall: There were 357 noncancerous cases, and 212 cancerous ones.
Due to its size and diversity, this dataset is ideal for research into feature selection and
dimensionality reduction strategies for accurate breast cancer forecasts. The dataset can
be mined for information that could prove useful in deciding whether a given case of
breast cancer is malignant or not.
40
The Breast Cancer Wisconsin (Diagnostic) Dataset has been imported using pandas
package and importing the ‘data.csv’ file. After importing the dataset, the first 10
number of rows have been visualized as shown below.
The preprocessing steps involved in the Breast Cancer Wisconsin (Diagnostic) Dataset
are as follows:
41
Removing Null values: In the Breast Cancer Wisconsin (Diagnostic) Dataset, there are
no null values and there is no further processing required.
Removing Duplicates: In the Breast Cancer Wisconsin (Diagnostic) Dataset, there are no
duplicates present in the dataset and there is no further processing required.
The Breast Cancer Wisconsin (Diagnostic) Dataset has been visualized using count plot,
dist plot, line plot and box plot. The output column ‘cancer_target’ has been visualized
using count plot and the input columns ‘area_mean’ and ‘value’ have been visualized
using dist plot. The visualization of these input and output columns are shown below.
42
Figure 8 Data visualization of radius and perimeter
Based on the above visualizations, both radius and perimeter mean of cell nucleus are
directly proportional. It is also inferred that; severity of concave portions value is less
for malignant cancer patients.
43
Figure 10 Label encoding graph
The Breast Cancer Data has been splitted for the purpose of training, validating and
testing. In this research, the 341 count of data has been used for training purpose, 114
count of data has been used for training purpose and 114 count of data has been used for
training purpose. The coding for splitting the dataset is shown below.
For effective breast cancer predictions, the ML models such as SVM, RF, DT, MLP
Classifier and Passive Classifier models will be implemented in this session. Before
implementation of any ML models, the best fit parameter have been determined by using
Grid search method.
This research work aims to predict breast cancer via the use of dimensionality reduction
techniques and feature selection methods and also investigates which ML model along
with dimensionality reduction techniques and feature selection methods helps in
effectively predicting breast cancer.
44
4.7.1 SVM model
For predicting breast cancer, the SVM model has been implemented with parameters
such as kernel, gamma, C, degree and tol with varying values chosen through
GridSearch method. Based on the GridSearch method, the parameters and its values
have been chosen for implementing SVM with feature selection techniques such as Chi-
square FS, L1 based FS, Recursive feature elimination techniques and dimensionality
reduction techniques such as PCA and LDA techniques. The parameters and its values
chosen for each separate methods using Grid search method is as follows.
L1
Without Chi- Recursive Feature
Parameters based
FS square FS Elimination PCA LDA
FS
linea
kernel poly Poly poly poly poly
r
degree 7 7 7 3 3 3
Based on the above parameters and its chosen values, the Support Vector Machine ML
model have been implemented and evaluated with feature selection techniques such as
Chi-square FS, L1 based FS, Recursive feature elimination techniques and
dimensionality reduction techniques such as PCA and LDA techniques. The assessed
results are as tabulated below.
Algorithm
Trainin
Validation Testing Validation Testing
g
45
Without FS 0.90 0.94 0.00 0.06 0.04
Recursive Feature
0.93 0.96 0.01 0.19 0.06
Elimination
Based on the evaluated results, the SVM model by selecting the most essential features
using Recursive feature elimination technique gives the maximum validation accuracy
of 93% and testing accuracy of 96% but the time taken for validation and testing is quite
long of 0.19secs and 0.06secs respectively. The combination of RFE and SVM is a
powerful technique for improving the accuracy of ML models because RFE can help to
remove features that are not important, while SVM can help to identify the optimal
hyperplane that splits the two classes in the dataset. Comparing the performance in terms
of time taken, the SVM model with all features and SVM model with LDA
dimensionality reduction techniques takes least time to validate and test the model of
0.06secs and 0.06 secs respectively.
The Random Forest ML model has been constructed to predict breast cancer, and several
values for the parameters criteria, max_features, n_estimators, min_samples_split, and
min_samples_leaf have been selected using the GridSearch method. The parameters and
their specified values for the various Grid search algorithms are as follows.
Chi- L1 Recursive
Without
Parameters square based Feature
FS PCA LDA
FS FS Elimination
46
max_features sqrt sqrt sqrt sqrt sqrt sqrt
min_samples_spli
2 2 2 2 2 2
t
min_samples_leaf 1 1 1 1 1 1
The Random Forest ML model has been built and assessed with feature selection
techniques like Chi-square FS, L1 based FS, Recursive feature removal techniques, and
dimensionality reduction approaches like PCA and LDA procedures using the
aforementioned parameters and their chosen values. The results are tabulated below after
being evaluated.
Algorithm
Trainin
Validation Testing Validation Testing
g
Recursive Feature
0.96 0.95 0.24 0.08 0.06
Elimination
By selecting the most important features with L1-based FS, the RF model achieves the
highest validation accuracy of 95% and the highest testing accuracy of 96%, but the
training, validation, and testing times are relatively long compared to other models at
0.48secs, 0.21secs, and 0.07secs, respectively. L1-based feature selection is effective at
47
picking out the most crucial characteristics of a model because it penalizes the absolute
value of the coefficients of the features. In terms of time, the RF model without any
feature selection technique requires the least amount of time to validate and evaluate the
model, 0.04secs and 0.05secs, respectively.
The Decision Tree ML model for predicting breast cancer has been constructed with
parameters like splitter, criterion, max_features, class_weight, and min_samples_split
with variable values selected using GridSearch method. The parameters and values
selected for each distinct technique using the Grid search method are detailed below.
Recursive
Chi- L1
Without Feature
Parameters square based PCA LDA
FS Eliminatio
FS FS
n
min_samples_spl 11 8
5 7 5 9
it
The Decision Tree ML model was implemented and evaluated using feature selection
techniques including Chi-square FS, L1-based FS, Recursive feature elimination
techniques, and dimensionality reduction techniques including PCA and LDA
techniques. The tabulated results of the evaluation are shown below.
48
Trainin
Validation Testing Validation Testing
g
Recursive Feature
0.95 0.96 0.01 0.08 0.06
Elimination
Based on the evaluated results, the DT model that selects the most important features
using Chi-Square FS and the Recursive feature elimination technique provides the
highest testing accuracy of 96%. However, the validation accuracy of Chi-square FS is
significantly lower than that of RFE, at 90%. The combination of RFE and DT is a
potent technique for increasing the accuracy of ML models, as RFE can be used to
eliminate unimportant features and DT can be utilized to determine the optimal decision
tree for the model. In terms of time, the DT model without any feature selection
techniques takes the least amount of time to validate and evaluate the model, 0.04
seconds for each.
For predicting breast cancer, the MLP Classifier model has been implemented with
parameters such as activation, solver, learning_rate, learning_rate_init and tol with
varying values chosen through GridSearch method. The parameters and its values
chosen for each separate methods using Grid search method is as follows.
49
Eliminatio PCA LDA
FS
n
learning_rate_ini
0.001 0.001 0.001 0.001 0.001 0.001
t
0.00000 1e-05
tol 0.0001 0.0001 0.0001 1e-07
1
Based on the above parameters and its chosen values, the MLP Classifier model have
been implemented and evaluated with feature selection techniques such as Chi-square
FS, L1 based FS, Recursive feature elimination techniques and dimensionality reduction
techniques such as PCA and LDA techniques. The assessed results are as tabulated
below.
Algorithm
Trainin
Validation Testing Validation Testing
g
Recursive Feature
0.94 0.96 0.70 0.14 0.08
Elimination
50
LDA 0.93 0.90 0.15 0.22 0.10
Based on the assessed results, the MLP Classifier model by selecting the most important
features using Recursive feature elimination technique, PCA dimensionality reduction
techniques and without feature selection techniques gives the maximum validation
accuracy of 94% and testing accuracy of 96%, because the MLP Classifier is a effective
ML algorithm that can learn complex relationships among the features and the target
variable. Comparing the efficiency in terms of time taken, the MLP Classifier model
with Chi-square feature selection techniques takes least time to train, validate and test
the model of 0.06secs, 0.07secs and 0.08secs respectively.
The Passive Classifier model for predicting breast cancer has been implemented with
GridSearch-determined parameter values for C, max_iter, validation_fraction, loss, and
tol. The parameters and their values selected for each distinct method using Grid search
are as follows:
Chi- L1 Recursive
Without
Parameters square based Feature
FS PCA LDA
FS FS Elimination
validation_fra
0.2 0.2 0.2 0.2 0.2 0.2
ction
0.000
tol 0.0001 0.0001 0.0001 0.0001 0.0001
1
51
The Passive Classifier model was implemented and evaluated using feature selection
techniques including Chi-square FS, L1-based FS, Recursive feature elimination
techniques, and dimensionality reduction techniques including PCA and LDA
techniques. The tabulated results of the evaluation are shown below.
Algorithm
Trainin
Validation Testing Validation Testing
g
Recursive Feature
0.84 0.82 0.02 0.22 0.06
Elimination
Based on the evaluated results, the Passive Classifier model that selects the most
important features using PCA dimensionality reduction techniques achieves the highest
validation & testing accuracy of 94%. This is because PCA can help to decrease the
dimensionality of the dataset without losing too much information, which makes the
model easier to train and can improve the model's accuracy, while also requiring the
least amount of time to train, validate, and test the model, with respective times of
0.00secs, 0.07secs, and 0.07secs.
52
5 Results & conclusion
The technological difficulties encountered and their resolutions are described in the
following sections.
53
5.2 Interpretation of Results
5.2.1 Accuracy
Based on the results, the accuracy of the ML models such as SVM, RF, DT, MLP
Classifier and Passive Classifier along with FS techniques such as Chi-square FS, L1
based FS, RFE and dimensionality reduction techniques such as PCA & LDA ae
compared to analyzes the effectiveness of these models in predicting breast cancer. The
comparison results are shown below.
L1
Without Chi- Recursive Feature PC
based LDA
FS square FS Elimination A
FS
Random
Forest 0.95 0.91 0.96 0.95 0.90 0.93
Decision
Tree 0.95 0.96 0.94 0.96 0.82 0.95
MLP
Classifier 0.96 0.79 0.95 0.96 0.96 0.90
Passive
Classifier 0.92 0.90 0.92 0.82 0.94 0.92
54
Among all the feature selection techniques, the Recursive Feature Elimination (RFE)
model gives the maximum accuracy of 96%, which is higher among all feature selection
techniques. It gives maximum accuracy of the other models, as it works by recursively
removing features that are not important. This is done by iteratively building a model
and then removing the features that have the least impact on the model's performance.
Among all the dimensionality reduction techniques, the LDA dimensionality reduction
technique with SVM model gives the maximum accuracy of 97%, which is higher
among all other dimensionality reduction techniques. LDA is better at finding the
directions in the dataset that separate the two classes the best and hence it gives more
accuracy when compared to the PCA technique in predicting breast cancer.
Among all ML models, the SVM model with LDA feature selection has the highest
accuracy of 97%, followed by the SVM model with RFE feature selection, RF model
with L1 based feature selection, DT model with Chi-square FS & RFE and MLP
Classifier with RFE and PCA technique gives the maximum accuracy of 96% in
predicting breast cancer.
The detailed analysis of this study is broken down into two distinct sessions, which are
titled respectively "Research Findings" and "Comparison of other research work," and
the specifics of each session are provided in the following information.
Why the SVM along with LDA feature selection technique gives maximum accuracy in
predicting breast cancer?
The SVM along with LDA feature selection technique gives maximum accuracy in
predicting breast cancer because LDA is a linear discriminant analysis technique that
helps in choosing the most related features for the classification task. This helps to
reduce the dimensionality of the data, which can help to enhance the accuracy of the
model. The SVM model with LDA feature selection has the highest accuracy of 97%,
which is significantly greater than the accuracy of the other models.
Why MLP Classifier with Chi-square FS, Decision Tree with PCA and Passive
Classifier with RFE could not able to predict breast cancer?
55
MLP Classifier, Decision Tree, and Passive Classifier are all supervised learning
models, which means that they learn from labeled data. However, it is possible that the
features selected by the feature selection techniques are not informative enough for these
models to learn from.
Why RFE along with SVM, DT, MLP Classifier gives the maximum accuracy of 96% in
predicting breast cancer?
The research work (Kumari.M et al., 2018) has created a system for early-stage breast
cancer prediction using the minimal amount of features that can be extracted from the
clinical information. The planned experiment has been carried out using the Wisconsin
breast cancer dataset (WBCD). When using most predictive factors, KNN classifier has
been found to produce the highest classification accuracy of 99.28%. Detecting breast
cancer at an early stage using the suggested method drastically reduces medical costs
and improves quality of life.
Dimensionality reduction, feature ranking, fuzzy logic, and an artificial neural network
are all used in this study (Gupta.K et al., 2019) to develop a new approach to data
classification. The purpose of this research is to evaluate the current integrated methods
to breast cancer detection and prognosis and to draw conclusions about their relative
merits. The best diagnostic classification accuracy is provided by principal component
analysis (PCA) using a neural network (NN), however gain ratio and chi-square also
perform well (85.78 percent). These findings pave the way for the creation of a Medical
Expert System model, which may be utilised for the automated diagnosis of additional
diseases and high-dimensional medical datasets.
This study's main objective (Ibrahim, S et al., 2021) was to use correlation analysis and
variance of input features to select feature selection strategies and feed them to a
classification algorithm. An ensemble technique improved breast cancer categorization.
56
The public WBCD dataset tested the suggested method. Correlation and principal
component analysis reduced dimension. LR, SVM, NB, KNN, RF, DT, and SGD were
examined and compared. Tweaking hyper-parameters improved classification
performance. Two voting approaches were utilized using top classification algorithms.
The majority of voters choose a hard vote, whereas the highest probability chooses a soft
vote. With 98.24% accuracy, 99.29% precision, and 95.89% recall, the proposed method
outperformed the current standard.
This research work (Karimi.K et al., 2022) uses a genetic algorithm-optimized synthetic
model (CHFS-BOGA) to predict breast cancer. OGA using initialization generation and
genetic algorithms with the C4.5 decision tree classifier as the fitness function replaces
chance and random selection. Wisconsin UCI machine learning provided 569 rows and
32 columns. Using its explorer module, Weka, an open-source data mining software,
evaluated the dataset. The hybrid feature selection approach outperforms single filter
approaches and PCA. These criteria predict earnings. Support vector machine (SVM)
classifier-based CHFS-BOGA iterations have 97.3 percent accuracy. They achieved a
best-in-class 98.25% accuracy on a data set of 70.0% training data and 30.0% testing
data using (CHFS-BOGA-SVM) and 100% accuracy on the entire training set. The ROC
curve was also 1.0. The CHFSBOGA-SVM system identified malignant from benign
breast tumours.
This research work focus on predicting breast cancer by using feature selection and
dimensionality reduction techniques. For this purpose, Breast Cancer Wisconsin
(Diagnostic) Dataset has been chosen for this research work and the same has been
collected and imported in the pandas data frame. Further, the data is preprocessed,
visualized and splitted before implementing the ML models. Then the ML models such
as SVM, RF, DT, MLP Classifier and Passive Classifier have been implemented,
evaluated without any feature selection techniques and feature selection techniques such
as Chi-square, L1 based FS and RFE and dimensionality reduction techniques including
PCA and LDA. Comparing the performance of the ML models with feature selection
and dimensionality reduction techniques, the SVM model with LDA feature selection
has the highest accuracy of 97% in predicting breast cancer. Overall, both feature
selection and dimensionality reduction techniques can be effective in enhancing the
accuracy of breast cancer predictions.
57
5.4 Conclusion
Breast cancer is a disease in which breast cells develop uncontrollably, and there are
variety of breast cancer depending on which breast cells become cancer. Through blood
vessels and lymph vessels, breast cancer can expand outside the breast. Importantly,
early detection of this fatal disease reduces the mortality rate and increases the survival
period of breast cancer patients. ML models are capable of autonomously learning and
adjusting actions for breast cancer prediction based on historical data without requiring
human intervention. When using high-dimensional medical data to predict breast cancer,
it can be difficult to interpret and analyze these data. Due to their complexity, traditional
methods may be incapable of effectively capturing the underlying relationships between
the various clinical factors. Due to the high-dimensionality of the data, which can result
in over- or under-fitting of the model, the predictive accuracy of such models may also
be limited. Thus, the focus of this research is on predicting breast cancer through the use
of feature selection and dimensionality reduction techniques. Breast Cancer Wisconsin
(Diagnostic) Dataset has been collected and imported into the pandas data frame for this
research project. In addition, the data is preprocessed, visualized, and split before ML
models are implemented. Then, ML models such as SVM, RF, DT, MLP Classifier, and
Passive Classifier were implemented and evaluated with and without feature selection
techniques, including Chi-square, L1-based FS and RFE, and dimensionality reduction
techniques including PCA and LDA. Comparing the efficacy of ML models with feature
selection and dimensionality reduction techniques, the SVM model with LDA feature
selection has the highest predictive accuracy for breast cancer, at 97%. Overall, both
feature selection and dimensionality reduction can improve the accuracy of breast cancer
predictions.
RQ 1: What are the relative performance differences between feature selection and
dimensionality reduction techniques in improving the accuracy of breast cancer
predictions?
Feature selection approaches and dimensionality reduction strategies both have the
potential to increase the accuracy of breast cancer predictions. Feature selection
techniques do this by eliminating unnecessary characteristics from the dataset, while
methods for reducing dimensionality reduce the amount of features present in the
58
dataset. When evaluating the performance of the different models, the SVM model that
uses LDA feature selection has the greatest accuracy, coming in at 0.97. This shows that
the LDA dimensionality reduction approach is more successful than other feature
selection strategies, which aids in boosting the accuracy of breast cancer predictions for
this specific dataset. In general, the accuracy of breast cancer forecasts may be improved
using strategies like as feature selection, as well as dimensionality reduction, which can
both be useful.
The results analysis shows that the integration of multidimensional data with different
classification, feature selection, and dimensionality reduction strategies might give
useful inference instruments for this area. There is a need for further study in this area to
better understand how to hyper-tune model parameters to increase the accuracy of
classification methods. In the future, it is anticipated that multiple datasets and deep
learning models will be utilized to attain high precision.
Future research could concentrate on devising personalized treatment plans based on the
characteristics and genetic profile of the patient's tumor. Once breast cancer has been
identified, it is essential to devise an individualized treatment plan for each patient.
Currently, the majority of breast cancer detection models only utilize a single data type,
such as tissue samples or medical images. Using multimodal data, such as tissue
samples, medical images, and patient history, could enhance detection accuracy.
59
6 Reference
60
[12] S. Ibrahim et al., “Feature selection using correlation analysis and principal
component analysis for accurate breast cancer diagnosis,” J. Imaging, vol. 7, no. 11, p.
225, 2021.
[13] A. Jamal et al., “Dimensionality reduction using pca and k-means clustering for
breast cancer prediction,” LKJITI, vol. 9, no. 3, pp. 192-201, 2018.
[14] K. Karimi et al., “Two new feature selection methods based on learn-heuristic
techniques for breast cancer prediction: A comprehensive analysis,” Ann. Oper. Res.,
pp. 1-36, 2022.
[15] K. Gupta and R. R. Janghel, “Dimensionality reduction-based breast cancer
classification using machine learning” in Computational Intelligence: Theories,
Applications and Future Directions, vol. I. Singapore: Springer, 2019, pp. 133-146.
[16] R. Zebari et al., “A comprehensive review of dimensionality reduction
techniques for feature selection and feature extraction,” J. Appl. Sci. Technol. Trends,
vol. 1, no. 2, pp. 56-70, 2020.
[17] M. Kumari and V. Singh, “Breast cancer prediction system,” Procedia Comput.
Sci., vol. 132, pp. 371-376, 2018.
[18] L. Nesamani and S. N. S. Rajini, “Predictive modeling for classification of breast
cancer dataset using feature selection techniques” in Research Anthology on Medical
Informatics in Breast and Cervical Cancer. IGI Global, 2023, pp. 166-177.
[19] H. K. Malik et al., “Comparison of feature selection and feature extraction role in
dimensionality reduction of big data,” J. Tech., vol. 5, no. 1, pp. 184-192, 2023.
[20] G. Kou et al., “Evaluation of feature selection methods for text classification
with small datasets using multiple criteria decision-making methods,”, Applied Soft
Computing, vol. 86, p. 105836, 2020.
[21] U. Das et al., “Accurate recognition of coronary artery disease by applying
machine learning classifiers” in 23rd International Conference on Computer and
Information Technology (ICCIT), Dec. 2020, 2020, pp. 1-6.
[22] B. M. S. Hasan and A. M. Abdulazeez, A Review of Principal Component
Analysis Algorithm for Dimensionality Reduction, 2021, pp. 20-30.
[23] H. Jelodar et al., “Latent Dirichlet allocation (LDA) and topic modeling: Models,
applications, a survey,”, Multimed. Tools Appl., vol. 78, no. 11, pp. 15169-15211, 2019.
[24] I. Ibrahim and A. Abdulazeez, “The role of machine learning algorithms for
diagnosing diseases,”, JASTT, vol. 2, no. 1, pp. 10-19, 2021.
61
[25] S. Buschjager et al., “Realization of random forest for real-time evaluation
through tree framing” in IEEE International Conference on Data Mining (ICDM), Nov.
2018, 2018, pp. 19-28.
[26] H. Taud and J. F. Mas, “Multilayer Perceptron (MLP)” in Geomatic Approaches
for Modeling Land Change Scenarios, 2018, pp. 451-455.
[27] B. Remeseiro and V. Bolon-Canedo, “A review of feature selection methods in
medical applications,”, Comput. Biol. Med., vol. 112, p. 103375, 2019.
[28] B. Venkatesh and J. Anuradha, “A review of feature selection and its methods,”,
Cybernetics and Information Technologies, vol. 19, no. 1, pp. 3-26, 2019.
[29] V. Jaiswal et al., “A breast cancer risk predication and classification model with
ensemble learning and big data fusion,” Decis. Anal. J., p. 100298, 2023.
[30] H. K. Malik et al., “Comparison of feature selection and feature extraction role in
dimensionality reduction of big data,” J. Tech., vol. 5, no. 1, pp. 184-192, 2023.
[31] A. García-Domínguez et al., “Diabetes detection models in Mexican patients by
combining machine learning algorithms and feature selection techniques for clinical and
paraclinical attributes: A comparative evaluation,” J. Diabetes Res., vol. 2023, 9713905,
2023.
[32] X. Sun and A. Qourbani, “Combining ensemble classification and integrated
filter-evolutionary search for breast cancer diagnosis,” J. Cancer Res. Clin. Oncol., pp.
1-17, 2023.
[33] L. Guo et al., “Breast cancer prediction model based on clinical and biochemical
characteristics: Clinical data from patients with benign and malignant breast tumors
from a single center in South China,” J. Cancer Res. Clin. Oncol., pp. 1-13, 2023.
[34] S. Rostamzadeh et al., “A comparative investigation of machine learning
algorithms for predicting safety signs comprehension based on socio-demographic
factors and cognitive sign features,” Sci. Rep., vol. 13, no. 1, p. 10843, 2023.
[35] L. Nesamani and S. N. S. Rajini, “Predictive modeling for classification of breast
cancer dataset using feature selection techniques” in Research Anthology on Medical
Informatics in Breast and Cervical Cancer. IGI Global, 2023, pp. 166-177.
[36] A. Y. Ikram and L. O. Q. M. A. N., “Chakir, "Arabic text classification in the
legal domain”," in Third International Conference on Intelligent Computing in Data
Sciences (ICDS), Oct. 2019, 2019, pp. 1-6.
62
[37] E. Strelcenia and S. Prakoonwit, “Effective feature engineering and classification
of breast cancer diagnosis: A comparative study. BioMedInformatics,” vol. 3, no. 3, pp.
616-631, 2023.
[38] S. S. Travers et al., “Breast cancer brain metastases localization and risk of
hydrocephalus: A single institution experience,” J. Neurooncol., vol. 163, no. 1, pp. 115-
121, 2023.
[39] M. M. Hassan et al., “A comparative assessment of machine learning algorithms
with the Least Absolute Shrinkage and Selection Operator for breast cancer detection
and prediction,” Decis. Anal. J., vol. 7, p. 100245, 2023.
[40] R. Shang et al., “Local discriminative based sparse subspace learning for feature
selection,”, Pattern Recognition, vol. 92, pp. 219-230, 2019.
[41] J. Abdollahi et al., Diabetes Data Classification Using Deep Learning Approach
and Feature Selection Based on Genetic, 2023.
[42] R. Lamba et al., “A hybrid feature selection approach for Parkinson’s detection
based on mutual information gain and recursive feature elimination,”, Arab. J. Sci. Eng.,
vol. 47, no. 8, pp. 10263-10276, 2022.
[43] P. Misra and A. S. Yadav, Improving the Classification Accuracy Using
Recursive Feature Elimination with Cross-Validation, 2020, pp. 659-665.
[44] N. Tikher, Brain tumor detection model using digital image processing and
transfer learning, 2023 ([Doctoral dissertation]. St. Mary’s University).
[45] R. Hu et al., “Evaluation of customs supervision competitiveness using principal
component analysis,” Sustainability, vol. 15, no. 3, p. 1833, 2023.
[46] F. Abbas et al., Assessing the Dimensionality Reduction of the Geospatial
Dataset Using Principal Component Analysis (PCA) and Its Impact on the Accuracy and
Performance of Ensembled and Non-Ensembled Algorithm, 2023.
[47] J. P. Bharadiya, “A tutorial on principal component analysis for dimensionality
reduction in machine learning,” Int. J. Innov. Sci. Res. Technol., vol. 8, no. 5, pp. 2028-
2032, 2023.
[48] S. Ayesha et al., “Overview and comparative study of dimensionality reduction
techniques for high dimensional data,”, Information Fusion, vol. 59, pp. 44-58, 2020.
[49] R. Zebari et al., “A comprehensive review of dimensionality reduction
techniques for feature selection and feature extraction,”, JASTT, vol. 1, no. 2, pp. 56-70,
2020.
63
[50] L. Zhang et al., “Hyperspectral dimensionality reduction based on multiscale
superpixelwise kernel principal component analysis,”, Remote Sensing, vol. 11, no. 10,
p. 1219, 2019.
[51] J. P. Bharadiya, “A tutorial on principal component analysis for dimensionality
reduction in machine learning,” Int. J. Innov. Sci. Res. Technol., vol. 8, no. 5, pp. 2028-
2032, 2023.
[52] V. Tomar et al., “Single sample face recognition using deep learning: A survey,”
Artif. Intell. Rev., pp. 1-49, 2023.
[53] I. Babikir et al., “Evaluation of principal component analysis for reducing
seismic attributes dimensions: Implication for supervised seismic facies classification of
a fluvial reservoir from the Malay Basin, offshore Malaysia,” J. Petrol. Sci. Eng., vol.
217, p. 110911, 2022.
[54] C. Schwarz, “ldagibbs: A command for topic modeling in Stata using latent
Dirichlet allocation,”, The. Stata Journal, vol. 18, no. 1, pp. 101-117, 2018.
[55] P. N. Thotad et al., “Diabetes disease detection and classification on Indian
demographic and health survey data using machine learning methods,” Diabetes Metab.
Syndr., vol. 17, no. 1, p. 102690, 2023.
[56] C. Schwarz, “ldagibbs: A command for topic modeling in Stata using latent
Dirichlet allocation,”, The. Stata Journal, vol. 18, no. 1, pp. 101-117, 2018.
[57] W. Jia et al., “Feature dimensionality reduction: A review,” Complex Intell.
Syst., vol. 8, no. 3, pp. 2663-2693, 2022.
[58] A. K. Tyagi and P. Chahal, “Artificial intelligence and machine learning
algorithms” in Research Anthology on Machine Learning Techniques, Methods, and
Applications. IGI Global, 2022, pp. 188-219.
[59] I. Izonin et al., “A two-step data normalization approach for improving
classification accuracy in the medical diagnosis domain,” Mathematics, vol. 10, no. 11,
p. 1942, 2022.
[60] S. Chaudhury et al., “Effective image processing and segmentation-based
machine learning techniques for diagnosis of breast cancer,” Comp. Math. Methods
Med., vol. 2022, 6841334, 2022.
[61] A. D. Sendek et al., “Machine learning modeling for accelerated battery
materials design in the small data regime,” Adv. Energy Mater., vol. 12, no. 31, p.
2200553, 2022.
64
[62] W. Zheng et al., “Interpretability application of the Just-in-Time software defect
prediction model,” J. Syst. Softw., vol. 188, p. 111245, 2022.
[63] D. Jalal and T. Ezzedine, “Decision tree and support vector machine for anomaly
detection in water distribution networks,”, 2020 in International Wireless
Communications and Mobile Computing (IWCMC), Jun. 2020, pp. 1320-1323.
[64] M. F. Ak, “A comparative analysis of breast cancer detection and diagnosis
using data visualization and machine learning applications,”, Healthcare (Basel), vol. 8,
no. 2, p. 111, 2020.
[65] Y. Wang et al., “Deep learning-based socio-demographic information
identification from smart meter data,”, IEEE Trans. Smart Grid, vol. 10, no. 3, pp. 2593-
2602, 2018.
[66] M. C. Gomes et al., “Tool wear monitoring in micromilling using support vector
machine with vibration and sound sensors,”, Precision Engineering, vol. 67, pp. 137-
151, 2021.
[67] S. Liang, “Comparative analysis of SVM, XGBoost and neural network on hate
speech classification,”, RESTI, vol. 5, no. 5, pp. 896-903, 2021.
[68] M. Schonlau and R. Y. Zou, “The random forest algorithm for statistical
learning,”, The. Stata Journal, vol. 20, no. 1, pp. 3-29, 2020.
[69] V. Kadam et al., “Enhancing surface fault detection using machine learning for
3D printed products,”, ASI, vol. 4, no. 2, p. 34, 2021.
[70] D. Ramayanti and U. Salamah, Text Classification on Dataset of Marine and
Fisheries Sciences Domain Using Random Forest Classifier, 2018, pp. 1-7.
[71] L. Ren et al., “An adaptive Laplacian weight random forest imputation for
imbalance and mixed-type data,” Inf. Syst., vol. 111, p. 102122, 2023.
[72] C. Faria et al., “A tree-based approach to forecast the total nitrogen in
wastewater treatment plants” in International Symposium on Distributed Computing and
Artificial Intelligence, Sept. 2021, pp. 137-147.
[73] B. Charbuty and A. Abdulazeez, “Classification based on decision tree algorithm
for machine learning,”, JASTT, vol. 2, no. 1, pp. 20-28, 2021.
[74] L. Abhishek, “Optical character recognition using ensemble of SVM, MLP and
extra trees classifier” in International Conference for Emerging Technology (INCET),
Jun. 2020, 2020, pp. 1-4.
[75] J. Anmala and V. Turuganti, “Comparison of the performance of decision tree
(DT) algorithms and extreme learning machine (ELM) model in the prediction of water
65
quality of the Upper Green River watershed,”, Water Environ. Res., vol. 93, no. 11, pp.
2360-2373, 2021.
[76] A. H. Fath et al., Implementation of Multilayer Perceptron (MLP) and Radial
Basis Function (RBF) Neural Networks to Predict Solution Gas-Oil Ratio of Crude Oil
Systems, 2020, pp. 80-91.
[77] L. E. McCoubrey et al., “Machine learning uncovers adverse drug effects on
intestinal bacteria,”, Pharmaceutics, vol. 13, no. 7, p. 1026, 2021.
[78] A. N. Ahmed et al., Machine Learning Methods for Better Water Quality
Prediction, 2019, p. 124084.
[79] A. Al Bataineh et al., “Multi-layer Perceptron training optimization using nature
inspired computing,” IEEE Access, vol. 10, pp. 36963-36977, 2022.
[80] S. Baressi Šegota et al., “Frigate speed estimation using CODLAG propulsion
system parameters and multilayer Perceptron,” Naše More, vol. 67, no. 2, pp. 117-125,
2020.
[81] P. N. Kumar, Detection of Textual Propaganda Using Passive Aggressive
Classifiers, 2023, pp. 73-79.
[82] K. Varada Rajkumar et al., “Detection of fake news using natural language
processing techniques and passive aggressive classifier” in Intelligent Systems and
Sustainable Computing, Proc. ICISSC 2021, Singapore: Springer Nature Singapore,
2022, pp. 593-601.
[83] S. A. Krishnan et al., SQL Injection Detection Using Machine Learning, p. 11.
[84] S.M. TS, P. S. Sreeja, and R. P. Ram, "Fake News Article classification using
Random Forest, Passive Aggressive, and Gradient Boosting," in 2022 International
Conference on Connected Systems & Intelligence (CSI), Aug. 2022, pp. 1-6.
[85] S. Sievert et al., “Better and faster hyperparameter optimization with Dask” in,
Proceedings of the Python in Science Conference, Proc. 18th Python in Science
Conference, pp. 118-125, Jul. 2019.
[86] L. V. Von Krannichfeldt et al., “Online ensemble approach for probabilistic wind
power forecasting,”, IEEE Trans. Sustain. Energy, vol. 13, no. 2, pp. 1221-1233, 2021.
[87] D. Chicco and G. Jurman, “The advantages of the Matthews correlation
coefficient (MCC) over F1 score and accuracy in binary classification evaluation,”,
BMC Genomics, vol. 21, no. 1, pp. 6, 2020.
66
[88] R. Yacouby and D. Axman, “Probabilistic extension of precision, recall, and f1
score for more thorough evaluation of classification models” in Proc. First Workshop on
Evaluation and Comparison of NLP Systems, Nov. 2020, pp. 79-91.
[89] M. Mudassir et al., “Time-series forecasting of Bitcoin prices using high-
dimensional features: A machine learning approach,”, Neural Comput. Appl., pp. 1-15,
2020.
67
Appendix
Effe_cancerbre__w.filterwarnings("ignore")
Effe_cancerbre = Effe_cancerbre__p.read_csv('data.csv')
Effe_cancerbre.head(n=10)
Effe_cancerbre.tail(n=10)
Effe_cancerbre.shape
Effe_cancerbre.mean()
Effe_cancerbre.max()
Effe_cancerbre=Effe_cancerbre.rename(columns={"diagnosis":"cancer_target"})
Effe_cancerbre.select_dtypes(include=['object']).dtypes
68
** output 'cancer_target' alone is the object type column.
Effe_cancerbre.info()
Effe_cancerbre_bn.distplot(Effe_cancerbre['area_mean'], color='orange')
Effe_cancerbre_oy.ylabel("value")
Effe_cancerbre_oy.show()
69
***** cell nucleus area mean value is differ from patient to patient.
****** both radius and perimeter mean of cell nucleus are directly proportional.
**** severity of concave portions value is less for maligmant cancer patients.
Effe_cancerbre_ingk = Effe_cancerbre_ing.LabelEncoder()
Effe_cancerbre['cancer_target']=
Effe_cancerbre_ingk.fit_transform(Effe_cancerbre['cancer_target'])
Effe_cancerbre['cancer_target']
Effe_cancerbre.to_csv('Effe_cancerbre.csv', index=False)
Effe_cancerbre
70
Effe_cancerbre = Effe_cancerbre__p.read_csv('Effe_cancerbre.csv')
Effe_cancerbre
Effe_cancerbre['cancer_target'].value_counts()
Effe_cancerbre__X = Effe_cancerbre.drop('cancer_target',axis=1)
Effe_cancerbre__Y = Effe_cancerbre['cancer_target']
Effe_cancerbre__X
Effe_cancerbre__Y
tst_s=0.4
rdm_ste=55
tst_s1=0.5
71
print("train-size : ", Effe_cancerbre__Xtrn.shape[0])
Effe_cancerbre__Xtrn
Effe_cancerbre__Xvld
Effe_cancerbre__Xtst
'gamma': ['auto','scale'],
'C':[1.0,1.5,2.0,2.5,3.0,3.5,4.0],
'degree':[3,5,7,8,10,13],
72
'tol':[1e-3,1e-5,1e-7,1e-9]}
cncr_hyp_Vb = Effe_cancerbresvvm(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul1.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
73
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'max_features': ['sqrt','log2',None],
74
'n_estimators':[100,150,200,250,300,350,400],
'min_samples_split':[2,4,6,8,10,12,14],
'min_samples_leaf':[1,2,3,4,5,6]}
cncr_hyp_Vb = Effe_cancerbrerdmrt(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(5,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(5,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul2.fit(Effe_cancerbre__Xtrn.sample(50,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(50,random_state=rdm_ste))
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xvld)
75
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
76
cncr_hyp = { 'splitter': ['best','random'],
'criterion': ['gini','entropy','log_loss'],
'max_features':['auto','sqrt','log2'],
'class_weight':['dict','balanced'],
'min_samples_split':[5,6,7,8,9,10,11]}
cncr_hyp_Vb = Effe_cancerbredcinre(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(100,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(100,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul3.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xvld)
77
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
78
cncr_hyp = { 'activation': ['identity','tanh','relu','logistic'],
'solver': ['lbfgs','sgd','adam'],
'learning_rate':['constant','invscaling','adaptive'],
'learning_rate_init':[0.001,0.0001,0.00001],
'tol':[1e-4,1e-5,1e-6,1e-7]}
cncr_hyp_Vb = Effe_cancerbremllp(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul4.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
79
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
80
from sklearn.linear_model import PassiveAggressiveClassifier as Effe_cancerbrepsvea
'loss':['hinge', 'huber'],
'tol':[1e-4,1e-5,1e-6,1e-7,1e-3]}
cncr_hyp_Vb = Effe_cancerbrepsvea(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul5.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
81
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
82
from sklearn.feature_selection import chi2 as Effe_cancerbre__lc2
Effe_cancerbre__fsfs_M = Effe_cancerbre__fsfs_k.fit(Effe_cancerbre__X,
Effe_cancerbre__Y)
Effe_cancerbre__fsfs_sr =
Effe_cancerbre__p.DataFrame(Effe_cancerbre__fsfs_M.scores_)
Effe_cancerbre__fsfs_co = Effe_cancerbre__p.DataFrame(Effe_cancerbre__X.columns)
Effe_cancerbre__N = Effe_cancerbre__p.concat([Effe_cancerbre__fsfs_sr,
Effe_cancerbre__fsfs_co],axis=1)
Effe_cancerbre__N[:]
Effe_cancerbre__X
tst_s=0.4
rdm_ste=55
tst_s1=0.5
83
Effe_cancerbre__Xtrn, Effe_cancerbre__Xtst, Effe_cancerbre__Ytrn,
Effe_cancerbre__Ytst = Effe_cancerbretrits(Effe_cancerbre__X, Effe_cancerbre__Y,
test_size=tst_s, random_state=rdm_ste)
Effe_cancerbre__Xtrn
Effe_cancerbre__Xvld
Effe_cancerbre__Xtst
'gamma': ['auto','scale'],
'C':[1.0,1.5,2.0,2.5,3.0,3.5,4.0],
'degree':[3,5,7,8,10,13],
'tol':[1e-3,1e-5,1e-7,1e-9]}
84
cncr_hyp_Vb = Effe_cancerbresvvm(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul1.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
85
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'max_features': ['sqrt','log2',None],
'n_estimators':[100,150,200,250,300,350,400],
'min_samples_split':[2,4,6,8,10,12,14],
'min_samples_leaf':[1,2,3,4,5,6]}
86
cncr_hyp_Vb = Effe_cancerbrerdmrt(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(5,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(5,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul2.fit(Effe_cancerbre__Xtrn.sample(50,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(50,random_state=rdm_ste))
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
87
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'criterion': ['gini','entropy','log_loss'],
'max_features':['auto','sqrt','log2'],
'class_weight':['dict','balanced'],
'min_samples_split':[5,6,7,8,9,10,11]}
88
cncr_hyp_Vb = Effe_cancerbredcinre(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(100,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(100,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul3.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
89
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'solver': ['lbfgs','sgd','adam'],
'learning_rate':['constant','invscaling','adaptive'],
'learning_rate_init':[0.001,0.0001,0.00001],
'tol':[1e-4,1e-5,1e-6,1e-7]}
cncr_hyp_Vb = Effe_cancerbremllp(random_state=rdm_ste)
90
cncr_hyp_Vb = Effe_cancerbregidr(cncr_hyp_Vb, cncr_hyp,
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul4.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
91
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'loss':['hinge', 'huber'],
'tol':[1e-4,1e-5,1e-6,1e-7,1e-3]}
cncr_hyp_Vb = Effe_cancerbrepsvea(random_state=rdm_ste)
92
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul5.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
93
print("\n period of validation data:", cncr_Pdcct2-cncr_Pdcct1,"\n")
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
Effe_cancerbre__fsfs_k= Effe_cancerbre__fsfs(estimator=Effe_cancerbre__lisk)
Effe_cancerbre__fsfs_M = Effe_cancerbre__fsfs_k.fit(Effe_cancerbre__X,
Effe_cancerbre__Y)
Effe_cancerbre__fsfs_M.transform(Effe_cancerbre__X)
94
Effe_cancerbre__FS = [Effe_cancerbre__X.columns[o] for o in
range(len(Effe_cancerbre__fsfs_M.get_support())) if
Effe_cancerbre__fsfs_M.get_support()[o] == True]
Effe_cancerbre__X=Effe_cancerbre__X[Effe_cancerbre__FS]
Effe_cancerbre__X
tst_s=0.4
rdm_ste=55
tst_s1=0.5
95
Effe_cancerbre__Xtrn
Effe_cancerbre__Xvld
Effe_cancerbre__Xtst
'gamma': ['auto','scale'],
'C':[1.0,1.5,2.0,2.5,3.0,3.5,4.0],
'degree':[3,5,7,8,10,13],
'tol':[1e-3,1e-5,1e-7,1e-9]}
cncr_hyp_Vb = Effe_cancerbresvvm(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
96
cncr_moul1.fit(Effe_cancerbre__Xtrn.sample(20,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(20,random_state=rdm_ste))
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
97
R_op = Effe_cancerbrecfmrd(confusion_matrix = P_os, display_labels = [0,1])
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'max_features': ['sqrt','log2',None],
'n_estimators':[100,150,200,250,300,350,400],
'min_samples_split':[2,4,6,8,10,12,14],
'min_samples_leaf':[1,2,3,4,5,6]}
cncr_hyp_Vb = Effe_cancerbrerdmrt(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(5,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(5,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
98
cncr_moul2.fit(Effe_cancerbre__Xtrn.sample(50,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(50,random_state=rdm_ste))
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
99
R_op = Effe_cancerbrecfmrd(confusion_matrix = P_os, display_labels = [0,1])
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'criterion': ['gini','entropy','log_loss'],
'max_features':['auto','sqrt','log2'],
'class_weight':['dict','balanced'],
'min_samples_split':[5,6,7,8,9,10,11]}
cncr_hyp_Vb = Effe_cancerbredcinre(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(100,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(100,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
100
cncr_moul3.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
101
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'solver': ['lbfgs','sgd','adam'],
'learning_rate':['constant','invscaling','adaptive'],
'learning_rate_init':[0.001,0.0001,0.00001],
'tol':[1e-4,1e-5,1e-6,1e-7]}
cncr_hyp_Vb = Effe_cancerbremllp(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul4.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
102
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
103
cncr_Pdcct2 = Effe_cancerbretiim.time()
'loss':['hinge', 'huber'],
'tol':[1e-4,1e-5,1e-6,1e-7,1e-3]}
cncr_hyp_Vb = Effe_cancerbrepsvea(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul5.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
104
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
105
cncr_Pdcct2 = Effe_cancerbretiim.time()
eva_met='accuracy'
Effe_cancerbre__fsfs_k = Effe_cancerbre__rfc(Effe_cancerbredcinre(random_state=4),
step=3, cv=Effe_cancerbre__strkfd(2), scoring=eva_met)
Effe_cancerbre__fsfs_k.fit(Effe_cancerbre__X, Effe_cancerbre__Y)
print('count: {}'.format(Effe_cancerbre__fsfs_k.n_features_))
Effe_cancerbre__X.drop(Effe_cancerbre__X.columns[Effe_cancerbredcinmy.where(Eff
e_cancerbre__fsfs_k.support_ == False)[0]], axis=1, inplace=True)
Effe_cancerbre__X
tst_s=0.4
rdm_ste=55
tst_s1=0.5
106
Effe_cancerbre__Xtrn, Effe_cancerbre__Xtst, Effe_cancerbre__Ytrn,
Effe_cancerbre__Ytst = Effe_cancerbretrits(Effe_cancerbre__X, Effe_cancerbre__Y,
test_size=tst_s, random_state=rdm_ste)
Effe_cancerbre__Xtrn
Effe_cancerbre__Xvld
Effe_cancerbre__Xtst
'gamma': ['auto','scale'],
'C':[1.0,1.5,2.0,2.5,3.0,3.5,4.0],
'degree':[3,5,7,8,10,13],
'tol':[1e-3,1e-5,1e-7,1e-9]}
107
cncr_hyp_Vb = Effe_cancerbresvvm(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul1.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
108
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'max_features': ['sqrt','log2',None],
'n_estimators':[100,150,200,250,300,350,400],
'min_samples_split':[2,4,6,8,10,12,14],
'min_samples_leaf':[1,2,3,4,5,6]}
109
cncr_hyp_Vb = Effe_cancerbrerdmrt(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(5,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(5,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul2.fit(Effe_cancerbre__Xtrn.sample(50,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(50,random_state=rdm_ste))
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
110
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'criterion': ['gini','entropy','log_loss'],
'max_features':['auto','sqrt','log2'],
'class_weight':['dict','balanced'],
'min_samples_split':[5,6,7,8,9,10,11]}
111
cncr_hyp_Vb = Effe_cancerbredcinre(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(100,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(100,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul3.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
112
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'solver': ['lbfgs','sgd','adam'],
'learning_rate':['constant','invscaling','adaptive'],
'learning_rate_init':[0.001,0.0001,0.00001],
'tol':[1e-4,1e-5,1e-6,1e-7]}
cncr_hyp_Vb = Effe_cancerbremllp(random_state=rdm_ste)
113
cncr_hyp_Vb = Effe_cancerbregidr(cncr_hyp_Vb, cncr_hyp,
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul4.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
114
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'loss':['hinge', 'huber'],
'tol':[1e-4,1e-5,1e-6,1e-7,1e-3]}
cncr_hyp_Vb = Effe_cancerbrepsvea(random_state=rdm_ste)
115
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn.sample(10,random_state=rdm_ste),
Effe_cancerbre__Ytrn.sample(10,random_state=rdm_ste))
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul5.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
116
print("\n period of validation data:", cncr_Pdcct2-cncr_Pdcct1,"\n")
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
Effe_cancerbre__X = Effe_cancerbre__dimep_k.fit_transform(Effe_cancerbre__X)
Effe_cancerbre__X
Effe_cancerbre__X.shape
117
tst_s=0.4
rdm_ste=55
tst_s1=0.5
Effe_cancerbre__Xtrn
Effe_cancerbre__Xvld
Effe_cancerbre__Xtst
'gamma': ['auto','scale'],
'C':[1.0,1.5,2.0,2.5,3.0,3.5,4.0],
118
'degree':[3,5,7,8,10,13],
'tol':[1e-3,1e-5,1e-7,1e-9]}
cncr_hyp_Vb = Effe_cancerbresvvm(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn[:10], Effe_cancerbre__Ytrn[:10])
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul1.fit(Effe_cancerbre__Xtrn[:10], Effe_cancerbre__Ytrn[:10])
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
119
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'max_features': ['sqrt','log2',None],
'n_estimators':[100,150,200,250,300,350,400],
'min_samples_split':[2,4,6,8,10,12,14],
120
'min_samples_leaf':[1,2,3,4,5,6]}
cncr_hyp_Vb = Effe_cancerbrerdmrt(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn[:5], Effe_cancerbre__Ytrn[:5])
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul2.fit(Effe_cancerbre__Xtrn[:50], Effe_cancerbre__Ytrn[:50])
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
121
R_op = Effe_cancerbrecfmrd(confusion_matrix = P_os, display_labels = [0,1])
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'criterion': ['gini','entropy','log_loss'],
'max_features':['auto','sqrt','log2'],
'class_weight':['dict','balanced'],
'min_samples_split':[5,6,7,8,9,10,11]}
122
cncr_hyp_Vb = Effe_cancerbredcinre(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn[:100], Effe_cancerbre__Ytrn[:100])
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul3.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
123
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'solver': ['lbfgs','sgd','adam'],
'learning_rate':['constant','invscaling','adaptive'],
'learning_rate_init':[0.001,0.0001,0.00001],
'tol':[1e-4,1e-5,1e-6,1e-7]}
124
cncr_hyp_Vb = Effe_cancerbremllp(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn[:10], Effe_cancerbre__Ytrn[:10])
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul4.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
125
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'loss':['hinge', 'huber'],
'tol':[1e-4,1e-5,1e-6,1e-7,1e-3]}
cncr_hyp_Vb = Effe_cancerbrepsvea(random_state=rdm_ste)
126
cncr_hyp_Vb = Effe_cancerbregidr(cncr_hyp_Vb, cncr_hyp,
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn[:10], Effe_cancerbre__Ytrn[:10])
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul5.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
127
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
Effe_cancerbre__X = Effe_cancerbre__dimep_k.fit_transform(Effe_cancerbre__X)
Effe_cancerbre__X
Effe_cancerbre__X.shape
128
##60% of trn data
tst_s=0.4
rdm_ste=55
tst_s1=0.5
Effe_cancerbre__Xtrn
Effe_cancerbre__Xvld
Effe_cancerbre__Xtst
129
'gamma': ['auto','scale'],
'C':[1.0,1.5,2.0,2.5,3.0,3.5,4.0],
'degree':[3,5,7,8,10,13],
'tol':[1e-3,1e-5,1e-7,1e-9]}
cncr_hyp_Vb = Effe_cancerbresvvm(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn[:10], Effe_cancerbre__Ytrn[:10])
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul1.fit(Effe_cancerbre__Xtrn[:10], Effe_cancerbre__Ytrn[:10])
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xvld)
130
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul1.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'max_features': ['sqrt','log2',None],
131
'n_estimators':[100,150,200,250,300,350,400],
'min_samples_split':[2,4,6,8,10,12,14],
'min_samples_leaf':[1,2,3,4,5,6]}
cncr_hyp_Vb = Effe_cancerbrerdmrt(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn[:5], Effe_cancerbre__Ytrn[:5])
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul2.fit(Effe_cancerbre__Xtrn[:50], Effe_cancerbre__Ytrn[:50])
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
132
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul2.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'criterion': ['gini','entropy','log_loss'],
'max_features':['auto','sqrt','log2'],
133
'class_weight':['dict','balanced'],
'min_samples_split':[5,6,7,8,9,10,11]}
cncr_hyp_Vb = Effe_cancerbredcinre(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn[:100], Effe_cancerbre__Ytrn[:100])
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul3.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
134
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul3.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'solver': ['lbfgs','sgd','adam'],
'learning_rate':['constant','invscaling','adaptive'],
'learning_rate_init':[0.001,0.0001,0.00001],
135
'tol':[1e-4,1e-5,1e-6,1e-7]}
cncr_hyp_Vb = Effe_cancerbremllp(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn[:10], Effe_cancerbre__Ytrn[:10])
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul4.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
136
R_op = Effe_cancerbrecfmrd(confusion_matrix = P_os, display_labels = [0,1])
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul4.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
'loss':['hinge', 'huber'],
'tol':[1e-4,1e-5,1e-6,1e-7,1e-3]}
137
cncr_hyp_Vb = Effe_cancerbrepsvea(random_state=rdm_ste)
cv=2, verbose=1)
cncr_hyp_Vb.fit(Effe_cancerbre__Xtrn[:10], Effe_cancerbre__Ytrn[:10])
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_moul5.fit(Effe_cancerbre__Xtrn, Effe_cancerbre__Ytrn)
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xvld)
print(Effe_cancerbreclfctr(Effe_cancerbre__Yvld, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Yvld,cncr_pdt)
138
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
cncr_Pdcct1 = Effe_cancerbretiim.time()
cncr_pdt= cncr_moul5.predict(Effe_cancerbre__Xtst)
print(Effe_cancerbreclfctr(Effe_cancerbre__Ytst, cncr_pdt))
P_os = Effe_cancerbrecfusx(Effe_cancerbre__Ytst,cncr_pdt)
R_op.plot()
cncr_Pdcct2 = Effe_cancerbretiim.time()
139
'LDA_dim_red':{'SVM':77, "RF":93, 'DT':95, 'MLP':90, 'PAC':92}
Effe_cancerbre_Re = Effe_cancerbre__p.DataFrame(Effe_cancerbre_R)
Effe_cancerbre_Re.plot.bar()
Effe_cancerbre__oty.xticks()
Effe_cancerbre__oty.ylabel('Result')
Effe_cancerbre__oty.legend(loc='lower left')
Effe_cancerbre__oty.show()
140