You are on page 1of 14

Machine Learning Algorithm

CIA - 2
MBA

SUBMITTED TO:
Dr. Durgansh Sharma

SUBMITTED BY:
Jeffrey Williams (20221013)

INSTITUTE OF MANAGEMENT
CHRIST (DEEMED TO BE UNIVERSITY)
DELHI NCR.
8th November 2020
Introduction
Problem Identification
Breast cancer is the second most common and also the second leading cause of
cancer deaths in women in the United States. According to the American
Cancer Society, on average every 1 in 8 women in the United States would
develop breast cancer in her lifetime and 2.6% would die from breast cancer.
One of the warning symptoms of breast cancer is the development of a tumor in
the breast. A tumor, however, could be either benign or malignant.

Objective:
This project aims to predict whether an individual has breast cancer and
determine which cytological attributes are significant in identifying benign and
malignant tumors. To achieve this, I performed four different classification
models in machine learning, namely Logistic Regression, Decision Tree,
Random Forest, and Gradient Boosting Machine, and Principal Component
Analysis on a dataset obtained from the UCI Machine Learning Repository.
Business Problem:
This dataset was created by Dr. William H. Wolberg from the University of
Wisconsin, who took a digital scan of the fine-needle aspirates from patients
with solid breast masses. Then, he used a graphical computer program called
Xcyt to calculate ten cytological characteristics present in each digitized image.

PCA on medical data (breast cancer predictions)


Data Analysis
Principal component analysis (PCA) is a statistical method to reduce
dimensionality. PCA is mostly used on high dimensional dataset with loads of
variables. The main goal is to create a simpler version of such a dataset. After
that step data is represented by the components which represent majority of
variance from original dataset, where commonly few first components explain
the biggest part of variance.
Dataset consists of 116 observations, where 64 patients have a breast cancer and
52 patients are a control group. The main purpose of this data is to build a
predictive model, but this project will be mainly focused on PCA and checking
whether it can help to get better results and also to get insights about data. The
dataset consists of 10 variables:

 Age (years)
 BMI (kg/m2)
 Glucose (mg/dL)
 Insulin (µU/mL)
 HOMA
 Leptin (ng/mL)
 Adiponectin (µg/mL)
 Resistin (ng/mL)
 MCP-1(pg/dL)
 Classification (1=Healthy controls, 2=Patients (with cancer))

Data Exploration:

df <- select(data, -Classification)


head(df,3)

Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin


1 48 23.50000 70 2.707 0.4674087 8.8071 9.702400 7.99585
2 83 20.69049 92 3.115 0.7068973 8.8438 5.429285 4.06405
3 82 23.12467 91 4.498 1.0096511 17.9393 22.432040 9.27715
MCP.1
1 417.114
2 468.786
3 554.697

summary(df)
Age BMI Glucose Insulin
Min. :24.0 Min. :18.37 Min. : 60.00 Min. : 2.432
1st Qu.:45.0 1st Qu.:22.97 1st Qu.: 85.75 1st Qu.: 4.359
Median :56.0 Median :27.66 Median : 92.00 Median : 5.925
Mean :57.3 Mean :27.58 Mean : 97.79 Mean :10.012
3rd Qu.:71.0 3rd Qu.:31.24 3rd Qu.:102.00 3rd Qu.:11.189
Max. :89.0 Max. :38.58 Max. :201.00 Max. :58.460
HOMA Leptin Adiponectin Resistin
Min. : 0.4674 Min. : 4.311 Min. : 1.656 Min. : 3.210
Median : 1.3809 Median :20.271 Median : 8.353 Median :10.828
Mean : 2.6950 Mean :26.615 Mean :10.181 Mean :14.726
3rd Qu.: 2.8578 3rd Qu.:37.378 3rd Qu.:11.816 3rd Qu.:17.755
Max. :25.0503 Max. :90.280 Max. :38.040 Max. :82.100
MCP.1
Min. : 45.84
1st Qu.: 269.98
Median : 471.32
Mean : 534.65
3rd Qu.: 700.09
Max. :1698.44

To deepen the knowledge about data, the plot with pairs of variables is very
useful. At this plot we can see distributions, correlations and also boxplots to
see differences for patients with and without breast cancer. On this plot we can
see age difference among those two groups, it can indicate that control group is
not perfectly chosen. Moreover, all the indicators are higher for patients with
cancer.

ggpairs(data)
To get a better view of correlations, the corrplot was conducted. Correlation
between Glucose, HOMA, Leptin is now clearly visible.

corr_df = cor(df,method='pearson')
corrplot(corr_df)
Principal Component Analysis (PCA):
Kaiser’s Stopping Rule is a method to decide which components should be
chosen. In this approach, components with eigenvalue higher than 1 should
retain. It is also connected with scree test in which we plot eignevalues on
vertical axis and components on horizontal axis. The components are ordered
descending, from the largest to the smallest and basing on the elbow rule, we
choose number of components. If line of eigenvalues is levelling off, we should
pick that number of components. The other approach is to look on the
percentage of variance explained, it is good when components explain 70-90%.

pca <- prcomp(df, center=TRUE, scale=TRUE)


eigen(cor(df))$values

[1] 3.05853968 1.52220124 1.16754381 1.10569811 0.72254235 0.65727917


[7] 0.44155050 0.29262393 0.03202122

fviz_eig(pca, choice='eigenvalue')
Basing on Kaiser’s rule, the 4 components should be chosen, because
eigenvalues of those are higher than 1. The scree plot gives the same results.

fviz_eig(pca)
summary(pca)

Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.7489 1.2338 1.0805 1.0515 0.85002 0.81073 0.66449
Proportion of Variance 0.3398 0.1691 0.1297 0.1229 0.08028 0.07303 0.04906
Cumulative Proportion 0.3398 0.5090 0.6387 0.7615 0.84184 0.91487 0.96393
PC8 PC9
Standard deviation 0.54095 0.17894
Proportion of Variance 0.03251 0.00356
Cumulative Proportion 0.99644 1.00000

If we look at the plot of components and variance that they explain, in my


opinion we should chose 4 or 5 components. It is because the fifth component is
not explaining a lot of variance solely, but 4 components explain only 76% of
variance. So, in the analysis, the fifth component will be also taken into
consideration.

Analysis of components:

fviz_pca_var(pca, col.var = "steelblue")

Basing only on the plot, few components are clearly visible. First one consists
of HOMA, Insulin and Glucose. The second is probably mainly a BMI and
Adiponectin and third is basically Age. To distinguish which variables are the
main part of component 4 and 5, the plot of contribution of variables to
components must be conducted.
pca$rotation[,1:5]

PC1 PC2 PC3 PC4 PC5


Age 0.1245739 0.06626166 -0.20670086 0.821387749 0.2529717
BMI 0.2604297 0.49933879 0.42573480 -0.070921590 -0.2324569
Glucose 0.4390227 -0.18594179 -0.13088534 0.125615405 0.1995197
Insulin 0.4439787 -0.38631334 0.09371342 -0.059771821 -0.2976113
HOMA 0.4928517 -0.37472788 -0.01219108 -0.005644021 -0.1391521
Leptin 0.3314935 0.23364275 0.58320150 0.058353947 0.2878059
Adiponectin -0.1726096 -0.48059810 0.28212343 -0.276866882 0.5292842
Resistin 0.2817413 0.30361887 -0.28892346 -0.302703643 0.5975961
MCP.1 0.2546307 0.21044940 -0.49675794 -0.359469872 -0.1188700
var <- get_pca_var(pca)
a<-fviz_contrib(pca, "var",axes = 1)
b<-fviz_contrib(pca, "var",axes = 2)
c<-fviz_contrib(pca, "var",axes = 3)
d<-fviz_contrib(pca, "var",axes = 4)
e<-fviz_contrib(pca, "var",axes = 5)
grid.arrange(a,b,c,d,e,top='Contribution to the Principal Components')

As it was said before the components consist of:


 HOMA, Insulin, Glucose
 BMI, Adiponectin, but also Insulin and HOMA
 Age
And fourth and fifth component:

 Leptin, MCP.1, BMI


 Resistin, Adiponectin

Advanced visualisations:

The results of PCA can also be shown on biplot with distinction of classes
(1=Healthy controls, 2=Patients (with cancer))

ggbiplot(pca, obs.scale=1, var.scale=1,


groups=as.factor(data$Classification), ellipse=TRUE, circle = TRUE)

In this case, the observations of group 2 are more spreaded and variance in this
group is rather bigger than in group 1.
What is important observations on the right side of the plot have bigger values
of HOMA, Insulin, Glucose (PC1 consists mainly of this variables), so those
variables can be good indicators of having cancer.

SVM on PCA:
As it was said at the beggining the main purpose of this dataset is to build a
model to predict whether a patient does or does not have cancer. It could be also
used to a faster detection of possible cancer. Beacuse of that, model based on
the whole dataset will be compared with models based on 4 and 5 components.
The model that will be used is Support vector machine (SVM). It is supervised
method for classification and regression. The algorithm divides the space into
number of classes that we predict, so the algortihm finds the hiperplane that
differentiates two classes in the best possible way. In this case, the two classes
will be predicted (1=Healthy controls, 2=Patients (with cancer)).
Dataset will be split in 80/20 proportions for train/test datasets.

smp_size <- floor(0.8 * nrow(data))


set.seed(123)
data0 = as.data.frame(scale(df))
data0$Classification = data$Classification
train_ind <- sample(seq_len(nrow(data0)), size = smp_size)
data_train <- data0[train_ind, ]
data_test <- data0[-train_ind, ]
data1=as.data.frame(get_pca_ind(pca)$coord)
data1 = data1[c(1:4)]
data1$Classification = data$Classification
train_ind <- sample(seq_len(nrow(data)), size = smp_size)
data_train_4PC <- data1[train_ind, ]
data_test_4PC <- data1[-train_ind, ]
data2=as.data.frame(get_pca_ind(pca)$coord)
data2 = data1[c(1:5)]
data2$Classification = data$Classification
train_ind <- sample(seq_len(nrow(data)), size = smp_size)
data_train_5PC <- data2[train_ind, ]
data_test_5PC <- data2[-train_ind, ]
The models will be trained on train set with k-fold cross validation, where k=3.
The results will be measured by accuracy of each model and the best will be
used to predict on the test set.

Training the model and Summarizing the results:

train_control <- trainControl(method="repeatedcv", number=3)


model <- train(Classification~., data=data_train, trControl=train_control,
method="svmPoly")
model_4PC <- train(Classification~., data=data_train_4PC,
trControl=train_control, method="svmPoly")
model_5PC <- train(Classification~., data=data_train_5PC,
trControl=train_control, method="svmPoly")
results = resamples(list(SVM_data=model, SVM_data_4PC=model_4PC,
SVM_data_5PC=model_5PC))
bwplot(results, metric="Accuracy")
Model based on all the variables works better than models based on 4 and 5
components. So the prediction on test set will be conducted with the use of this
model.

predictions <- predict(model, data_test[1:9])


conf_max = confusionMatrix(table(predictions, data_test$Classification))
print(conf_max$table)

predictions 1 2
1 7 3
2 2 12

print(conf_max$overall['Accuracy'])

Accuracy
0.7916667

The results on the test set are rather satisfying. On 24 observations 19 on them
were classified correctly what gives accuracy of 79%.

Conclusions:
In this project PCA was done on rather small dataset. The analysis helps to get
better understanding of the data and dependencies between variables. The
predictive model were built on original data but it is rather the fact of not high
dimensional dataset. It is always good to check, whether such analysis can
improve the model. In this project the way of ‘how to use PCA on
classification’ problem was covered, so this can be used on problems, in which
dataset consists of more variables.

You might also like