You are on page 1of 26

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/367378275

Analysis of Impact of Principal Component Analysis and Feature Selection for


Detection of Breast Cancer Using Machine Learning Algorithms

Article · January 2023

CITATIONS READS

0 63

1 author:

Chitra G Desai
National Defence Academy
66 PUBLICATIONS   156 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Chitra G Desai on 24 January 2023.

The user has requested enhancement of the downloaded file.


Journal of Information and Computational Science ISSN: 1548-7741

Analysis of Impact of Principal Component Analysis and


Feature Selection for Detection of Breast Cancer Using
Machine Learning Algorithms

Chitra Desai
Department of Computer Science, National Defence Academy, Pune

Abstract: Dimensionality reduction for medical data is seen as a challenging task for datasets
that are huge in dimensions and carry critical information. Feature selection and compression
techniques can be applied to high-dimension datasets to reduce the features. In feature
selection, we select a set of features from the existing features and ignore the remaining based
on certain feature selection criteria. In compression, we recreate new features from the existing
ones by retaining important information from the original features. However, feature selection,
particularly, in the medical dataset can lead to the loss of important information if not
understood rightly during exploratory data analysis. There are several techniques for feature
selection and compression for dimensionality reduction. This paper presents principal
component analysis PCA, a compression technique on breast cancer dataset and performs
detection using machine learning algorithm. This paper also presents feature selection using
one of the machine learning algorithms.
Initially, pre-processing is performed on to the dataset, followed by exploratory data analysis.
In depth study about the data, its characteristics and distribution are carried out. With the help
of data visualization attempts have been made to gain insight into the data with reference to the
standards in the domain of breast cancer. The data is cleaned and standardized before applying
PCA. Using box plot the data set is checked for outliers as the aim to standardize the data by
removing the mean and scaling each feature to unit variance.
The selection of number of principal components plays important role which further impacts
the accuracy of machine learning algorithm. Using scree plot, attempt has been made to select
appropriate number of principal components. The information captured in low dimension space
is represented using bivariate scatter plot to gain understanding of the data. Here experiments
with and without PCA using different machine learning algorithms for detection of breast
cancer has been demonstrated. The dataset is split into 80:20 for training and testing. The
machine learning algorithms developed here for detection of breast cancer are – Logistic
Regression, Support vector machine, Decision trees and Random Forest. Using random forest
feature selection is demonstrated.
The impact of PCA is analysed by computing the cost function with respect to each machine
learning algorithm with and without PCA. The confusion matrix in each case is plotted
separately for training and testing data. The values of true positive, true negative, false positive,

Volume 13 Issue 1 - 2023 197 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

false negative using confusion matrix plays significant role in medical data. Based on these
values the training and testing report for precision, recall, F1 score and support are generated.
The precision and recall curve are plotted to gain insight into average precision and average
recall of training and testing data. The performance of all models is evaluated, model tuning as
required is performed on individual models and are ranked accordingly.
Keywords: Principal Component Analysis, Feature Selection, Machine Learning Algorithm,
Breast Cancer Detection, Confusion Matrix, Data Visualization, Exploratory Data Analysis,
Data Pre-Processing

1. Introduction
The versatile use of computer-based system in health care and various diagnostics equipment
has resulted in huge extent of data being generated. This data is found useful for prediction
and classification using machine learning algorithms, to identify patterns, perform data mining,
extract knowledge (knowledge discovery), anomaly detection and clustering [1,2,3]. Data in
health care is generated in the form of medical records, administrative reports, important
findings for setting benchmarks [4], clinical trials, health insurance, surveys conducted and
many more. Data can be in several forms like text, images, structured or unstructured. It can
be data from blood or tissue sample, x-rays, CT-scan, mammograms, MRI scan, results
obtained from health devices like electrocardiogram (ECG), electroencephalogram (EEG), data
in speech format for example clinical conversations etc. The huge and complex nature of data
leads to several challenges. Challenges like dealing with noisy data, high dimensionality,
security aspect related to health care data, integrating data from various sources, selection of
appropriate tool set, issues related to growing data – particularly for storage and retrieval, lack
of professionals to handle data are some of the commonly observed. From the several
challenges one of the critical contests addressed here is of high dimensionality for medical data.
Medical data sets contain many features and instances due to which predicting classification
accuracy using machine learning algorithms becomes challenging task [5]. It becomes difficult
to visualize training set with huge number of features [6]. It is difficult to figure out which
instance or which feature will have what impact on classification. Features that are redundant
or carry poor quality input will also hamper the predictive capability of the model. Thus, using
all the features may lead to problem of curse of dimensionality and eventually impact
computational complexity and classification performance [7]. This ascends the need for
dimensionality reduction.
Dimensionality reduction for medical data is seen as a challenging task for datasets that are
huge in dimensions and carry critical information. Feature selection and compression that is
feature extraction techniques can be applied to high dimension dataset to reduce the features.
The analysis of impact of dimensionality reduction on breast cancer detection data set from
UCI machine learning repository is discussed in this paper. The description of the data set is
presented in section 2. Before applying any feature selection or feature extraction technique, it
is essential that data should be pre-processed. Also, to gain insight into the data exploratory
analysis needs to be performed. Section 3 presents pre-processing and exploratory data analysis
for the breast cancer detection data set. Section 4 presents the standardization part of the data.

Volume 13 Issue 1 - 2023 198 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

The classifiers used in this paper for detection of breast cancer are – Logistic Regression,
Support vector machine, Decision trees and Random Forest. A brief introduction to these
classifiers is presented in section 5.
In feature selection, sub set of features are selected from the existing features ignoring the
remaining based on certain feature selection criteria. In feature selection the original features
(subset) are preserved, which effectively defines the dataset [8]. However, feature selection,
particularly, in medical dataset can lead to loss of important information if not understood
rightly during exploratory data analysis. It is therefore essential to understand which feature
selection technique to be chosen. Feature selection techniques can be supervised and
unsupervised. The supervised techniques include wrapper, filter and embedded methods [9].
Optimal feature selection can be also be achieved for medical data set using nature inspired
algorithms like genetic algorithm [10], particle swarm optimization [11,12], artificial bee
colony [13]. This paper demonstrates the feature selection on medical data set using Random
Forest Classifier in section 6. Random forest classifier belongs to the class of embedded
methods which combines the qualities of both wrapper and filter class for feature selection.
Embedded methods consider dependency between the features, have better computational
complexity then that of wrapper, have high performance accuracy compared to filter and less
prone to overfitting [14].
Feature extraction ie compression of features is less prone to overfitting compared to feature
selection [14]. In compression we recreate new features from the existing one by retaining
important information from the original features. Here, new reduced set of features is created
from the existing one by applying algebraic transformation based on some optimization criteria
[15,16]. There are several features extraction techniques like Principal Component analysis
(PCA), kernel principal component analysis (KPCA), independent component analysis (ICA)
[17] and Linear Discriminant Analysis (LDA) [18]. LDA using statistical technique, reduces
the dimensionality and preserves as much possible the class discriminatory information. PCA
is the linear transformation method for feature extraction, KPCA is a nonlinear PCA developed
by using the kernel method, ICA linearly transforms original features into statistically
independent features. This paper presents principal component analysis (PCA), a compression
technique on breast cancer dataset and perform detection using machine learning algorithm.
PCA is discussed in section 7. Section 8 presents results and conclusion.

2. Data Set
The data set for breast cancer diagnostics [19] is analysed in this paper which consist of total
569 instances across 32 attributes. Out of the 32 attributes, one attribute is ID, the other one is
diagnosis which is the target variable and the remaining 30 attributes are the feature vectors.
Features are calculated from a digitized image of a fine needle aspirate of a breast mass.
These features represent the characteristics of the cell nuclei present in the image. The detail
description of each of these feature vectors can be referred from [20,21,22,23]. Here the
feature values here are recorded up to four significant digits and stored using float64 data
type. These features are - Diagnosis, radius_mean, texture_mean, perimeter_mean,
area_mean, smootheness_mean, compactness_mean, concavity_mean, concave_points_mean,

Volume 13 Issue 1 - 2023 199 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

symmetry_mean, fractual_dimension_mean, radius_se, texture_se, perimeter_se, area_se,


smoothness_se, compactness_se, concativity_se, concave_points_se, symmetry_se,
fractual_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst,
smoothness_worst, compactness_worst, concativity_worst, concave_points_worst,
symmetry_worst, fractual_dimension_worst
It is observed that that there are no null values in the dataset. The target variable diagnosis is
categorical and belongs to class of binary classification problem. The value across diagnosis
attribute is either M or B, where M = malignant, B = benign. The data type of diagnosis attribute
is object type, encoding is applied to diagnosis to replace ‘B’ with ‘0’ and ‘M’ with ‘1’. The
number of observations for benign (0) = 357 and malignant (1) = 212. There are 62.7% of
observation with non-cancerous cells and 37.3% observations are with cancerous cells.

3. Exploratory Data Analysis


Cancer disease has been a great cause of concern as in spite of advancement in healthcare,
around ten million deaths are observed in 2020 [24]. According to World Health Organization
out of the most common cases of cancer is that of breast cancer (2.26 million cases) leading to
685 000 deaths. The mortality rate can be reduced by early detection of cancer. There are
several clinical parameters that are useful in diagnosis of the cancer. In this section breast
cancer related sample parameters (feature vector) have been analysed for demonstration along
with data visualization to gain insight into the main characteristics of the data.
The available data set has no null values and no duplicates. Table 1 presents the overview of
the dataset. There are few variables with high correlation and zero values. The variable id is
unique. The details are as shown in figure 1. The details in table 1 and figure 1 helps one to
decide which variables to focus. Though, there are certain variables with zero values, it does
not mean that these are absurd but what is important to note here is that do these variables can
have zero value and if so, can they be ignored or should these be replaced these with appropriate
values like mean, median, mode using pre-processing steps. For example, in the breast cancer
dataset concavity represents severity of concave portions of the contour and it can be zero, so
these zeros need not be treated and can be continued to be used as beneficial values.
Each feature vector can be analysed to gain insight into the data and identify how it impacts
classification into benign or malignant. Let us consider the first feature vector radius_mean,
which is highly corelated to perimeter_mean, area_mean, perimeter_worst and radius_worst.
Figure 2 shows the swarm plot for radius_mean. The mean of radius_mean for benign is 12.14
and malignant is 17.46.
The minimum and maximum value of radius_mean for benign is 6.98 and 17.85 respectively,
whereas, for malignant it is 10.95 and 28.11 respectively. From these values, if one wants to
infer upon the probability of being benign or malignant given the value of radius_mean, for
example, any value above 17.85, it could be a case of malignant tumour.

Volume 13 Issue 1 - 2023 200 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Table 1 Overview of Breast Cancer Dataset

Dataset Statistics Count


Number of variables 32
Number of observations 569
Missing cells 0
Duplicate rows 0
Total size in memory 142.4 KiB
Average record size in memory 256.2 B

Figure 1 High Correlation between variables and variables with zero Value

Volume 13 Issue 1 - 2023 201 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Figure 2 Swarm Plot for radius_mean

Outliers impact significantly on the power of statistical tests. Using box plot it is possible to
detect the outliers for the variables. Figure 3 shows box plot for area_mean. The outliers for
area mean on maximum side of malignant cases are: 2501, 2499, 2250, 2010 and 1828. The
mean of area_mean for malignant is 978.37 which is computed using the outlier. These outlier
values indicate how they have affected the mean of area_mean of malignant instances.

Figure 3 Box plot for area_mean (Indicating Outliers)

Volume 13 Issue 1 - 2023 202 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Figure 4 Scatter Plot showing spread of malignant and benign with respect to radius_mean
and concavity_mean

Figure 4 presents scatter plot across two variables, that is concavity_mean and radius_mean,
which clearly shows associativity of two variables in distinguishing benign and malignant
cases. The corelation between different variable is as shown by heatmap in figure 5. Thus, all
variables can be explored to the possible extent with the help of statistical values and
visualization.

Figure 5 Heatmap

Let us assume that X holds the feature vectors and y holds the target variable for further
discussion continued.

Volume 13 Issue 1 - 2023 203 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

4. Standardization and Train Test Split


The dataset as mentioned above consists of 32 attributes. The values of the 31 numeric
attributes (feature vectors) are in varying ranges. Figure 6 gives insight into these varying range
values of feature vectors. PCA is very sensitive to the variance of initial attributes. It is very
much likely that attributes with large ranges will dominate the attributes with small ranges
which will consequence in biased results. To avoid this problem, the data is standardized. Here
the data is standardized by setting Gaussian with zero mean and unit variance. By importing
StandardScaler from sklearn.preprocessing [26], X is transformed to StandardScaler. Let us
assume Xs holds transformed values. Figure 7 shows standardized sample results after
transformation (Xs).
PCA is applied to the transformed values (Xs). There are two approaches to move ahead with
PCA transformation – either apply PCA to the entire scaled feature vector (Xs) and then split
it into train-test data or split the scaled feature vector (Xs) into train-test data and apply PCA
on the train and test data individually. Experiments were performed using both approaches. No
significant change was observed in both approaches for the breast cancer dataset. The data
(Xs) is transformed using PCA and the first four components are selected as explained in
section 4. Let us say Xs is transformed to Xs_pca. The dimension of Xs_pca of 569 instances
(rows) and four components (columns). Xs_pca is split into 80% training data and 20% testing
data. Training data consist of 455 instances and test data consists of 114 instances.

Volume 13 Issue 1 - 2023 204 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Figure 6 Statistical insight into the Feature Vectors showing Varying Range of Values across
each Feature Vector

Figure 7 Sample Output after Standardization (Xs)

5. Machine Learning Algorithms


Machine Learning is commonly inferred as, we have a data resource, we feed this data to
machine learning algorithm and the algorithm learns the pattern from the data. It is essential to

Volume 13 Issue 1 - 2023 205 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

understand learning from both machine and human perspective [26]. Humans learn by
observing or by getting directly involved in the task. With experience human tend to improve
the performance for the said task. Machine learning is emulating human learning and
improving with experience. It learns the experience with the help algorithm. The machine
learning models can be parametric or non-parametric. A machine learning model is termed
parametric if it learns the data based on the distribution of the data, for example, logistic
regression. Support vector machine (SVM), decision tree and random forest are non-parametric
as they do not make assumption on the basis of distribution of data. For all the four models of
classification– logistic regression, SVM, decision tree and random forest, performance
evaluation can be done using confusion matrix [27].
In confusion matrix the values are represented in matrix as true positive (TP), true negative
(TN), false positive (FP), false negative (FN) considering the actual and predicted values. TP
refers that the model predicted positive and actual was positive, in TN the model predicted
negative and actual was negative, in FP the model predicted positive but actual was negative,
in FN the model predicted negative but actual was positive. Th obtained values for TP, TN, FP
and FN are useful to compute indicators - precision, recall and f1-score [28]. Precision is given
by equation (1) , recall by equation (2) and f-score by equation (3)
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃 … (1)
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁 … (2)
𝑇𝑃
𝐹1 = 1 … (3)
𝑇𝑃+ (𝐹𝑃+𝐹𝑁)
2

For the probabilistic forecast for binary classification, ROC curve or precision-recall curve are
helpful tools. It is observed from figure 2 that the two classes ‘B’ and ‘M’ of the imbalanced.
So, precision-recall curve is preferred for analysis of the performance of the models. ROC
curve is best suited when the classes in target variable are balanced. For binary classification,
precision-recall curve is more informative compared to ROC curve [29]

5.1 Logistic Regression


Logistic regression is widely used algorithm for cancer detection as the dichotomous nature of
the target variable makes it suitable for classification. Application of logistic regression for
cancer detection are widely seen [30,31,32,33]. Logistic regression is a basic linear classifier
which does not move the decision boundary but only adjust the gradient of decision boundary
according to the separation of classes. Let x be the input variable and y be target variable. The
sigmoid function f(x) [34] is given by equation 4, where e is the base of natural logarithm.
1
𝑓(𝑥) = 1+𝑒 𝑥 …. (4)

Figure 8 shows an example of sigmoid function f(x) for input x. The graph in figure 11 shows
the suitability of logistic regression for binary classification. As mentioned above in our breast
cancer dataset, we have represented ‘B’ with ‘0’ and ‘M’ with ‘1’. If the value returned by the

Volume 13 Issue 1 - 2023 206 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

model for input x is higher than or equal to threshold 0.5, then ‘M’ (1) is assigned else ‘B’ (0)
is assigned.

Figure 8 Sigmoid Function

The model is trained using logistic regression algorithm and as shown in figure 9, the
confusion matrix and precision-recall curve is obtained.

Figure 9 Confusion Matrix and Precision-Recall Curve for Logistic Regression

Volume 13 Issue 1 - 2023 207 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

The model gives accuracy of 97.36%. It is observed that for training data, the average precision
(AP) is 0.93 and for test data it is 0.98. The average recall (AR) is 0.67 for training data and
0.56 for test data.

5.2 Support Vector Machine


Support Vector Machine (SVM) [35] belongs to supervised learning class of machine learning
algorithms, which is based on the concept of hyperplane that separates features into different
domains (classes). SVM can be used both for classification and regression. SVM is a linear
classifier which uses the principle of margin maximization. It transforms data using kernel trick
which is further utilized to find an optimal boundary between the possible outputs. The data
points close to the hyperplane are called support vectors. The distance of the vectors from the
hyperplane is called the margin.

Figure 10 Confusion Matrix and Precision Recall for SVM

Figure 10 shows the confusion matrix and precision recall curve for breast cancer detection
dataset using SVM. It is observed that for training data, the average precision (AP) is 0.92 and
for test data it is 0.98. The average recall (AR) is 0.68 for training data and 0.57 for test data.
A drop in average precision in training data of SVM by 0.01and increase in average recall by
0.01 in both training and testing data is observed compared to logistic regression.

Volume 13 Issue 1 - 2023 208 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

5.3 Decision Tree


Decision tree [36,37] is a supervised learning algorithm and is used both for classification and
regression. The decision tree is an inverted tree structure with root node at the top followed by
branch and leaf nodes. Each internal node of the decision tree represents feature (attribute) and
leaf node represents the outcome. At each branching node, each feature vector from the feature
vector is examined and it learns to partition on the basis of attribute value. Algorithms like ID3,
gini index and C4.5 the attribute selection algorithms used by decision tree to select the best
attribute to split the instances. The tree is recursively partitioned into smaller subset till no more
attribute and instances are left.

Figure 11 Confusion Matrix and Precision Recall for Decision Tree

Figure 11 shows the confusion matrix and precision recall curve for breast cancer detection
dataset using decision tree. It is observed that for training data, the average precision (AP) is
1.00 and for test data it is 0.77. The average recall (AR) is 0.50 for training data and 0.65 for
test data.
5.4 Random Forest
In decision tree there is a single tree whereas in random forest [38] there is accumulation of
trees which forms the forest. In random forest multiple decision trees are created randomly. It
uses ensemble learning, where it trains bunch of individual models in a parallel way. Here,
each model is trained by random subset of data. The larger subset ensures managing the

Volume 13 Issue 1 - 2023 209 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

biasness in the distribution of data. Figure 12 shows the confusion matrix and precision recall
curve for breast cancer detection dataset using random forest.

Figure 12 Confusion Matrix and Precision recall curve for Random Forest

6. Feature Selection using Random Forest Classifier


Using the feature selection technique using random forest classifies following ten features
were selected: ['radius_mean', 'perimeter_mean', 'area_mean', 'concavity_mean', 'concave
points_mean', 'radius_worst', 'perimeter_worst', 'area_worst', 'concavity_worst', 'concave
points_worst']
The feature importance graph with relative importance is as shown in figure 13 below,
based on which the above ten features were selected. The threshold is set to 0.03.

Volume 13 Issue 1 - 2023 210 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Figure 23 Relative Feature Importance for Features in Breast Cancer Dataset

The AP for training data is 1 and for testing data it is 0.88. The AR for training data is 0.85
and testing data it is 0.79. It is observed that AR for feature extraction using random forest
classifier is maximum compared to AR of logistic regression, SVM ,decision tree and random
forest (without feature extraction).

Volume 13 Issue 1 - 2023 211 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Figure 34 Confusion Matrix and Precision recall curve for Feature Selection using Random
Forest Classifier

7. Principal Component Analysis


Karl Pearson in 1901 invented the technique called Principal Component Analysis (PCA) for
dimensionality reduction [39]. PCA is found useful to reduce noise and eliminate redundant
data by preserving important information. It helps to summarize the data into indices called
Principal Components which are uncorrelated. PCA aims at reducing dimensions of the data
by retaining as much as information possible. PCA defines a new coordinate system where the
first principal component goes in the direction of most varying data, second principal
component is orthogonal to the first one and goes in the direction of the second highest varying
data, third principal component is orthogonal to first and second and so on [40]
Data set with large number of features hamper the speed of machine learning algorithms. In
principal component analysis the large number of features are reduced to small number of
principal components thereby increasing the speed of machine learning algorithms. The
application of PCA to various medical data sets has been observed and found effective. The
multi-class tumour classification [41], classification of breast cancer from the original White
Blood Cell [42], cancer detection up to 100% accuracy using PCA and SVM [43], ECG
Classification [44] are few of the examples showing effectiveness of PCA on medical data set.
In the present dataset for breast cancer detection there are 31 feature vectors that is let us say
there are 31 dimensions. When PCA is applied, Principal Component Matrix of the dimension

Volume 13 Issue 1 - 2023 212 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

same as to the original dimension (thirty-one) is computed explaining the relationship of how
all variables relate to each other. It has both direction and magnitude. The number of principal
components is further selected based on how much variance is explained by each component
or eigenvalue criterion. Using scree plot decision to select the number of principal components
can be done as shown in figure 15 below for our dataset.

Figure 15 Scree Plot for Detecting Number of Principal Components

While selecting the number of Principal components it is essential to ensure these components
explain maximum variance, so that it retains maximum information. Table 2 shows variance
explained by each of the component and its percentage. As we have chosen for four components
so there are four eigen vectors and four eigen values (one for each eigen vector). Figure 16
shows values for four eigen vector e1, e2, e3, e4 and also eigen values λ1, λ2, λ3, λ4 for
corresponding eigen vector.

Table 2 Explained Variance and its Ratio for Principal Components

Principal Components Explained Variance Explained Variance


Ratio
PC1 13.31145188 0.42864701
PC2 5.70683496 0.18376792
PC3 2.84038694 0.09146436
PC4 1.98484548 0.06391475

Volume 13 Issue 1 - 2023 213 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Figure 16 Eigen Vectors and Eigen Values

On observing the elbow curve in the scree plot (figure 15) it is decided to select four principal
components. The scree plot shows maximum variability explained by the first principal
component, the other components from 2 to 3 explains considerably moderate variability before
it starts flattening from 4 PC onwards. All components before the curve flattens are selected.
For the data under study, we have obtained single elbow curve thereby simplifying the choice
of principal components. However, there can be situation where more than one elbow curve
could be observed, thereby making it difficult to decide upon number of components.
Considering the eigen vector an eigen values it also observed that the eigen values are in
descending order that is the first eigen vector corresponds to the first principal component pc1
and so on. Thus, using PC, we have reduced the number of dimensions for the given breast
cancer data set from thirty-one to four. That is using principal component analysis we have
obtained principal components that are utilized further as feature vectors. These feature vectors
are formed using eigen vectors by representing the data from original axes to the new axes
represented by principal components. The pair plot for the four components is presented in
figure 17.

Volume 13 Issue 1 - 2023 214 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Figure 17 Scatter Plot for Four Principal Component

Figure 18 Confusion Matrix and PR Curve for LR with PCA

Volume 13 Issue 1 - 2023 215 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Figure 19 Confusion Matrix and PR Curve for SVM with PCA

Figure 40 Confusion Matrix and PR Curve for Decision Tree with PCA

Volume 13 Issue 1 - 2023 216 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Figure 51 Confusion Matrix and PR Curve for Random Forest with PCA

On obtaining the four principal components, a new feature set to be given to classification
algorithm is available with us. Using these four principal components the confusion matrix and
precision curve are plotted. Figure 18 to 21 shows confusion matrix and precision recall curve
for logistic regression (with PCA), SVM (with PCA), decision tree (with PCA) and random
forest with PCA.
It is observed from the above figures of precision curve, that random forest with PCA obtains
the highest rank for prediction.

8. Result and Conclusion


Table 3 presents AP and AR values for all the algorithms under consideration with and without
PCA as well for feature extraction using random forest classifier. The objective to perform
these experiments and obtain the values was to test how feature extraction and feature reduction
impact the prediction capability of the model.

Volume 13 Issue 1 - 2023 217 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

Table 3 AP and AR

Training Data Testing Data


AP AR AP AR
Logistic Regression 0.93 0.67 0.98 0.56
SVM 0.92 0.68 0.98 0.57
Decision Tree 1.00 0.50 0.77 0.65
Random Forest 1.00 0.83 0.90 0.77
Logistic Regression (PCA) 0.87 0.73 0.95 0.62
SVM (PCA) 0.89 0.70 0.95 0.63
Decision Tree (PCA) 1.00 0.50 0.76 0.61
Random Forest (PCA) 1.00 0.83 0.87 0.77
Feature Extraction Using RFC 1.00 0.85 0.88 0.79

Four machine learning algorithms were considered for the study – Logistic regression, Support
Vector machine, Decision Tree and Random Forest. Prediction using these algorithms were
carried out with and without PCA. Using PCA, the 31 features were reduced to 4 components.
Also, for analysing the impact on prediction using feature extraction, feature extraction using
random forest classifier was performed. Out of 31 features, 10 features were selected by
computing relative feature importance using random forest classifier.
The performance evaluation of the models was performed using confusion matrix and precision
recall curve. The AP and AR values indicated in the table 2 shows that using feature extraction
and feature reduction, the performance of the predicting algorithms is actually enhanced with
all benefits offered by feature extraction and feature reduction in terms of resource utilization
and other as discussed above.
Using feature extraction (using random classifier) and feature reduction (PCA), the
performance of the prediction algorithm has enhanced and found suitable for breast cancer
detection compared to without PCA and without feature extraction.

Reference list

1. S. H. Liao, P. H. Chu, and P. Y. Hsiao, “Data mining techniques and applications - A


decade review from 2000 to 2011,” Expert Syst. Appl., vol. 39, no. 12, pp. 11303–11311,
2012.
2. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge
discovery in databases,” AI Mag., pp. 37–54, 1996.
3. R. Veloso, F. Portela, M. F. Santos, Á. Silva, F. Rua, A. Abelha, and J. Machado, “A
Clustering Approach for Predicting Readmissions in Intensive Medicine,” Procedia Technol.,
vol. 16, pp. 1307–1316, 2014.
4. N. Wickramasinghe, S. K. Sharma, and J. N. D. Gupta, “Knowledge Management in
Healthcare,” vol. 63, pp. 5–18, 2005.

Volume 13 Issue 1 - 2023 218 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

5. Eva Tuba, Ivana Strumberger, Timea Bezdan, Nebojsa Bacanin, Milan Tuba,
Classification and Feature Selection Method for Medical Datasets by Brain Storm
Optimization Algorithm and Support Vector Machine, Procedia Computer Science, Volume
162,2019, Pages 307-315, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2019.11.289
6. M. Verleysen and D. François, "The curse of dimensionality in data mining and time
series prediction," in International Work-Conference on Artificial Neural Networks, pp. 758-
770: Springer, 2005.
7. S. N. Katole and S. P. Karmore, "A New Approach of Microarray Data Dimension
Reduction For Medical Applications," 2015 2nd International Conference on Electronics and
Communication Systems (ICECS), pp. 409-413, 2015 doi: 10.1109/ECS.2015.7124936.
8. D. L. Padmaja and B. Vishnuvardhan, "Comparative Study of Feature Subset
Selection Methods For Dimensionality Reduction On Scientific Data," in IEEE 6th
International Conference on Advanced Computing (IACC), pp. 31-34: IEEE, 2016
9. Girish Chandrashekar, Ferat Sahin, “A Survey On Feature Selection Methods”,
Computers & Electrical Engineering, Volume 40, Issue 1,Pages 16-28,ISSN 0045-7906, 2014
https://doi.org/10.1016/j.compeleceng.2013.11.024 .
10. T.Santhanam,M.Padmavathi, “Application Of K-Means And Genetical Algorithms
For Dimension Reduction By Integrating SVM For Diabetes Diagnosis”, Procedia Computer
Science 47,pp-76–83, 2015.
11. H.H.Inbarani,A.T.Azar,G.Jothi, “Supervised Hybrid Feature Selection Based On PSO
And Rough Sets For Medical Diagnosis”, Computer Methods And Programs In Biomedicine
113(1), pp-175–185, 2014.
12. S.M.Vieira,L.F.Mendonc ̧a,G.J.Farinha,J.M.Sousa, “Modified Binary PSO For
Feature Selection Using SVM Applied To Mortality Prediction Of Septic Patients”, Applied
Soft Computing 13(8),pp-3494–3504, 2013.
13. M.S.Uzer,N.Yilmaz,O.Inan, Feature selection method based on artificial bee colony
algorithm and support vector machines for medical datasets classification, The Scientific
World Journal Article ID 419187(2013)1–10
14. Zebari et al., A Comprehensive Review of Dimensionality Reduction Techniques for
Feature Selection and Feature Extraction , Journal of Applied Science and Technology
Trends Vol. 01, No. 02, pp. 56 –70, (2020)
15. M. K. Elhadad, K. M. Badran, and G. I. Salama, "A novel approach for ontology-
based dimensionality reduction for web text document classification," International Journal
of Software Innovation (IJSI), vol. 5, no. 4, pp. 44-58, 2017.
16. D. A. Zebari, H. Haron, S. R. Zeebaree, and D. Q. Zeebaree, "Enhance the
Mammogram Images for Both Segmentation and Feature Extraction Using Wavelet
Transform," in 2019 International Conference on Advanced Science and Engineering
(ICOASE), 2019, pp. 100-105: IEEE
17. L. J. Cao and W. K. Chong, "Feature extraction in support vector machine: a
comparison of PCA, XPCA and ICA," Proceedings of the 9th International Conference on
Neural Information Processing, 2002. ICONIP '02., 2002, pp. 1001-1005 vol.2, doi:
10.1109/ICONIP.2002.1198211.

Volume 13 Issue 1 - 2023 219 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

18. Youness Aliyari Ghassabeh, Frank Rudzicz, Hamid Abrishami Moghaddam, Fast
incremental LDA feature extraction, Pattern Recognition, Volume 48, Issue 6,2015,Pages
1999-2012,ISSN 0031-3203,https://doi.org/10.1016/j.patcog.2014.12.012.
19. Street W. N, Wolberg W. H. and Mangasarian O.L, https://ftp.cs.wisc.edu/math-
prog/cpo-dataset/machine-learn/cancer/WDBC/
20. O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August
1995.
21. W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine
learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative
Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.
22. W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized
breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery
1995;130:511-516.
23. W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived
nuclear features distinguish malignant from benign breast cytology.Human Pathology,
26:792--796, 1995.
24. Ferlay J, Ervik M, Lam F, Colombet M, Mery L, Piñeros M, et al. Global Cancer
Observatory: Cancer Today. Lyon: International Agency for Research on Cancer; 2020
(https://gco.iarc.fr/today, accessed February 2021).
25. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830,
2011
26. Underwood, T. (2020). Machine Learning and Human
Perspective. PMLA/Publications of the Modern Language Association of America, 135(1),
92-109. doi:10.1632/pmla.2020.135.1.92
27. Ting K.M. (2017) Confusion Matrix. In: Sammut C., Webb G.I. (eds) Encyclopedia of
Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-
4899-7687-1_50
28. Goutte C., Gaussier E. (2005) A Probabilistic Interpretation of Precision, Recall and
F-Score, with Implication for Evaluation. In: Losada D.E., Fernández-Luna J.M. (eds)
Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol
3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_25
29. Saito T, Rehmsmeier M (2015) The Precision-Recall Plot Is More Informative than
the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE
10(3): e0118432. https://doi.org/10.1371/journal.pone.0118432
30. Choney Zangmo and Montip Tiensuwan, Application of logistic regression models to
cancer patients: a case study of data from Jigme Dorji Wangchuck National Referral
Hospital (JDWNRH) in Bhutan, 2018 J. Phys.: Conf. Ser. 1039 012031
31. Breslow NE, Day NE, Heseltine E. Statistical methods in cancer research. Lyon:
International Agency for Research on Cancer; 1980.
32. Ayer, Turgay, Jagpreet Chhatwal, Oguzhan Alagoz, Charles E. Kahn Jr, Ryan W.
Woods, and Elizabeth S. Burnside. "Comparison of logistic regression and artificial neural
network models in breast cancer risk estimation." Radiographics 30, no. 1 (2010): 13-22.

Volume 13 Issue 1 - 2023 220 www.joics.org


Journal of Information and Computational Science ISSN: 1548-7741

33. Liu, Lei. "Research on logistic regression algorithm of breast cancer diagnose data
by machine learning." In 2018 International Conference on Robots & Intelligent System
(ICRIS), pp. 157-160. IEEE, 2018.
34. Andriy Burkov, The Hundred-Page Machine Learning Book, Publisher: Andriy
Burkov (2019)
35. Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273–97.
36. Quinlan, J.R. Induction of decision trees. Mach Learn 1, 81–106 (1986).
https://doi.org/10.1007/BF00116251
37. A. Navada, A. N. Ansari, S. Patil and B. A. Sonkamble, "Overview of use of decision
tree algorithms in machine learning," 2011 IEEE Control and System Graduate Research
Colloquium, 2011, pp. 37-42, doi: 10.1109/ICSGRC.2011.5991826.
38. A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R
News 2(3), 18--22.
39. Bartholomew, D. J. (2010). Principal components analysis, Int. Encycl. Educ., pp.
374–377, doi: 10.1016/B978-0-08-044894-7.01358-0
40. Andriy Burkov, The Hundred-Page Machine Learning Book, Publisher: Andriy
Burkov (2019)
41. Saraswathi, V. and Gupta, D. (2019), Classification of Brain Tumor using PCA-RF in
MR Neurological Images, 2019 11th Int. Conf. Commun. Syst. Networks, COMSNETS 2019,
vol. 2061, pp. 440–443, doi: 10.1109/COMSNETS.2019.8711010
42. A. A., Ripmiatin,E. and Effendi,Y. (2018). Dimensionality Reduction using PCA and
K-Means Clustering for Breast Cancer Prediction, Lontar Komput. J. Ilm.Teknol. Inf., vol. 9,
no. 3, p. 192, 2018 doi: 10.24843/lkjiti.2018.v09.i03.p08
43. Astuti, W.and Adiwijaya, “Support Vector Machine And Principal Component
Analysis For Microarray Data Classification”, J. Phys. Conf. Ser., vol. 971, no. 1, 2018
doi:10.1088/1742-6596/971/1/012003
44. Yang, W., Si, Y., Wang, D., & Guo, B., “Automatic Recognition Of Arrhythmia Based
On Principal Component Analysis Network And Linear Support Vector Machine”,
Computers In Biology And Medicine, 101, 22–32, 2018
https://doi.org/10.1016/j.compbiomed.2018.08.003

Volume 13 Issue 1 - 2023 221 www.joics.org


View publication stats

You might also like