Professional Documents
Culture Documents
CIA - 2
MBA
SUBMITTED TO:
Dr. Durgansh Sharma
SUBMITTED BY:
Jeffrey Williams (20221013)
INSTITUTE OF MANAGEMENT
CHRIST (DEEMED TO BE UNIVERSITY)
DELHI NCR.
8th November 2020
Introduction
Problem Identification
Breast cancer is the second most common and also the second leading cause of
cancer deaths in women in the United States. According to the American
Cancer Society, on average every 1 in 8 women in the United States would
develop breast cancer in her lifetime and 2.6% would die from breast cancer.
One of the warning symptoms of breast cancer is the development of a tumor in
the breast. A tumor, however, could be either benign or malignant.
Objective:
This project aims to predict whether an individual has breast cancer and
determine which cytological attributes are significant in identifying benign and
malignant tumors. To achieve this, I performed four different classification
models in machine learning, namely Logistic Regression, Decision Tree,
Random Forest, and Gradient Boosting Machine, and Principal Component
Analysis on a dataset obtained from the UCI Machine Learning Repository.
Business Problem:
This dataset was created by Dr. William H. Wolberg from the University of
Wisconsin, who took a digital scan of the fine-needle aspirates from patients
with solid breast masses. Then, he used a graphical computer program called
Xcyt to calculate ten cytological characteristics present in each digitized image.
Age (years)
BMI (kg/m2)
Glucose (mg/dL)
Insulin (µU/mL)
HOMA
Leptin (ng/mL)
Adiponectin (µg/mL)
Resistin (ng/mL)
MCP-1(pg/dL)
Classification (1=Healthy controls, 2=Patients (with cancer))
Data Exploration:
summary(df)
Age BMI Glucose Insulin
Min. :24.0 Min. :18.37 Min. : 60.00 Min. : 2.432
1st Qu.:45.0 1st Qu.:22.97 1st Qu.: 85.75 1st Qu.: 4.359
Median :56.0 Median :27.66 Median : 92.00 Median : 5.925
Mean :57.3 Mean :27.58 Mean : 97.79 Mean :10.012
3rd Qu.:71.0 3rd Qu.:31.24 3rd Qu.:102.00 3rd Qu.:11.189
Max. :89.0 Max. :38.58 Max. :201.00 Max. :58.460
HOMA Leptin Adiponectin Resistin
Min. : 0.4674 Min. : 4.311 Min. : 1.656 Min. : 3.210
Median : 1.3809 Median :20.271 Median : 8.353 Median :10.828
Mean : 2.6950 Mean :26.615 Mean :10.181 Mean :14.726
3rd Qu.: 2.8578 3rd Qu.:37.378 3rd Qu.:11.816 3rd Qu.:17.755
Max. :25.0503 Max. :90.280 Max. :38.040 Max. :82.100
MCP.1
Min. : 45.84
1st Qu.: 269.98
Median : 471.32
Mean : 534.65
3rd Qu.: 700.09
Max. :1698.44
To deepen the knowledge about data, the plot with pairs of variables is very
useful. At this plot we can see distributions, correlations and also boxplots to
see differences for patients with and without breast cancer. On this plot we can
see age difference among those two groups, it can indicate that control group is
not perfectly chosen. Moreover, all the indicators are higher for patients with
cancer.
ggpairs(data)
To get a better view of correlations, the corrplot was conducted. Correlation
between Glucose, HOMA, Leptin is now clearly visible.
corr_df = cor(df,method='pearson')
corrplot(corr_df)
Principal Component Analysis (PCA):
Kaiser’s Stopping Rule is a method to decide which components should be
chosen. In this approach, components with eigenvalue higher than 1 should
retain. It is also connected with scree test in which we plot eignevalues on
vertical axis and components on horizontal axis. The components are ordered
descending, from the largest to the smallest and basing on the elbow rule, we
choose number of components. If line of eigenvalues is levelling off, we should
pick that number of components. The other approach is to look on the
percentage of variance explained, it is good when components explain 70-90%.
fviz_eig(pca, choice='eigenvalue')
Basing on Kaiser’s rule, the 4 components should be chosen, because
eigenvalues of those are higher than 1. The scree plot gives the same results.
fviz_eig(pca)
summary(pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.7489 1.2338 1.0805 1.0515 0.85002 0.81073 0.66449
Proportion of Variance 0.3398 0.1691 0.1297 0.1229 0.08028 0.07303 0.04906
Cumulative Proportion 0.3398 0.5090 0.6387 0.7615 0.84184 0.91487 0.96393
PC8 PC9
Standard deviation 0.54095 0.17894
Proportion of Variance 0.03251 0.00356
Cumulative Proportion 0.99644 1.00000
Analysis of components:
Basing only on the plot, few components are clearly visible. First one consists
of HOMA, Insulin and Glucose. The second is probably mainly a BMI and
Adiponectin and third is basically Age. To distinguish which variables are the
main part of component 4 and 5, the plot of contribution of variables to
components must be conducted.
pca$rotation[,1:5]
Advanced visualisations:
The results of PCA can also be shown on biplot with distinction of classes
(1=Healthy controls, 2=Patients (with cancer))
In this case, the observations of group 2 are more spreaded and variance in this
group is rather bigger than in group 1.
What is important observations on the right side of the plot have bigger values
of HOMA, Insulin, Glucose (PC1 consists mainly of this variables), so those
variables can be good indicators of having cancer.
SVM on PCA:
As it was said at the beggining the main purpose of this dataset is to build a
model to predict whether a patient does or does not have cancer. It could be also
used to a faster detection of possible cancer. Beacuse of that, model based on
the whole dataset will be compared with models based on 4 and 5 components.
The model that will be used is Support vector machine (SVM). It is supervised
method for classification and regression. The algorithm divides the space into
number of classes that we predict, so the algortihm finds the hiperplane that
differentiates two classes in the best possible way. In this case, the two classes
will be predicted (1=Healthy controls, 2=Patients (with cancer)).
Dataset will be split in 80/20 proportions for train/test datasets.
predictions 1 2
1 7 3
2 2 12
print(conf_max$overall['Accuracy'])
Accuracy
0.7916667
The results on the test set are rather satisfying. On 24 observations 19 on them
were classified correctly what gives accuracy of 79%.
Conclusions:
In this project PCA was done on rather small dataset. The analysis helps to get
better understanding of the data and dependencies between variables. The
predictive model were built on original data but it is rather the fact of not high
dimensional dataset. It is always good to check, whether such analysis can
improve the model. In this project the way of ‘how to use PCA on
classification’ problem was covered, so this can be used on problems, in which
dataset consists of more variables.