Professional Documents
Culture Documents
CIA - 3
MBA
SUBMITTED TO:
Dr. Durgansh Sharma
SUBMITTED BY:
Jeffrey Williams (20221013)
INSTITUTE OF MANAGEMENT
CHRIST (DEEMED TO BE UNIVERSITY)
DELHI NCR.
4th December 2020
Introduction
Problem Identification
Breast cancer is the second most common and also the second leading cause of
cancer deaths in women in the United States. According to the American
Cancer Society, on average every 1 in 8 women in the United States would
develop breast cancer in her lifetime and 2.6% would die from breast cancer.
One of the warning symptoms of breast cancer is the development of a tumour
in the breast. A tumor, however, could be either benign or malignant.
Objective:
This project aims to predict whether an individual has breast cancer and
determine which cytological attributes are significant in identifying benign and
malignant tumors. To achieve this, I performed four different classification
models in machine learning, namely Logistic Regression, Decision Tree,
Random Forest, and Gradient Boosting Machine, and Principal Component
Analysis on a dataset obtained from the UCI Machine Learning Repository.
Business Problem:
This dataset was created by Dr. William H. Wolberg from the University of
Wisconsin, who took a digital scan of the fine-needle aspirates from patients
with solid breast masses. Then, he used a graphical computer program called
Xcyt to calculate ten cytological characteristics present in each digitized image.
Clustering:
Two groups of observations have a very clear clustering pattern. This reminds
us to test whether the clustering results correspond to the real classification. For
Gaussian Mixture Model and k-means, the clustering results are projected onto
the first two components.
Dataset consists of 116 observations, where 64 patients have a breast cancer and
52 patients are a control group. The main purpose of this data is to build a
predictive model, but this project will be mainly focused on PCA and checking
whether it can help to get better results and also to get insights about data. The
dataset consists of 10 variables:
Age (years)
BMI (kg/m2)
Glucose (mg/dL)
Insulin (µU/mL)
HOMA
Leptin (ng/mL)
Adiponectin (µg/mL)
Resistin (ng/mL)
MCP-1(pg/dL)
Classification (1=Healthy controls, 2=Patients (with cancer))
Data Exploration:
> View(dataR2)
> mydata = dataR2
> str(mydata)
Output:
spec_tbl_df [116 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Age : num [1:116] 48 83 82 68 86 49 89 76 73 75 ...
$ BMI : num [1:116] 23.5 20.7 23.1 21.4 21.1 ...
$ Glucose : num [1:116] 70 92 91 77 92 92 77 118 97 83 ...
$ Insulin : num [1:116] 2.71 3.12 4.5 3.23 3.55 ...
$ HOMA : num [1:116] 0.467 0.707 1.01 0.613 0.805 ...
$ Leptin : num [1:116] 8.81 8.84 17.94 9.88 6.7 ...
$ Adiponectin : num [1:116] 9.7 5.43 22.43 7.17 4.82 ...
$ Resistin : num [1:116] 8 4.06 9.28 12.77 10.58 ...
$ MCP.1 : num [1:116] 417 469 555 928 774 ...
$ Classification: num [1:116] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "spec")=
.. cols(
.. Age = col_double(),
.. BMI = col_double(),
.. Glucose = col_double(),
.. Insulin = col_double(),
.. HOMA = col_double(),
.. Leptin = col_double(),
.. Adiponectin = col_double(),
.. Resistin = col_double(),
.. MCP.1 = col_double(),
.. Classification = col_double()
.. )
- attr(*, "problems")=<externalptr>
> head(mydata)
# A tibble: 6 x 10
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin MCP.1
Classification
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
<dbl>
1 48 23.5 70 2.71 0.467 8.81 9.70 8.00 417.
1
2 83 20.7 92 3.12 0.707 8.84 5.43 4.06 469.
1
3 82 23.1 91 4.50 1.01 17.9 22.4 9.28 555.
1
4 68 21.4 77 3.23 0.613 9.88 7.17 12.8 928.
1
5 86 21.1 92 3.55 0.805 6.70 4.82 10.6 774.
1
6 49 22.9 92 3.23 0.732 6.83 13.7 10.3 530.
1
Input:
> pairs(mydata[2:9])
> plot(mydata$Age~mydata$Glucose, data = mydata)
> with(mydata,text(mydata$Age~mydata$Glucose, labels=mydata$Age,pos=4))
>
> z <- mydata[,-c(1,1)]
> means <- apply(z,2,mean)
> sds <- apply(z,2,sd)
> nor <- scale(z,center=means,scale=sds)
> distance = dist(nor)
> mydata.hclust = hclust(distance)
> plot(mydata.hclust)
> plot(mydata.hclust,labels=mydata$Age,main='Default from hclust')
> plot(mydata.hclust,hang=-1)
> member = cutree(mydata.hclust,3)
> table(member)
Output:
member
1 2 3
111 2 3
Input:
> aggregate(nor,list(member),mean)
Output:
Group.1 BMI Glucose Insulin HOMA Leptin
Adiponectin Resistin
1 1 -0.01920753 -0.1247992 -0.0710255 -0.10946131 -0.06601427
0.03302801 -0.1388204
2 2 0.78125364 0.2089617 0.1898051 0.09411453 2.18293165
-0.87009709 4.1115657
3 3 0.18984296 4.4782615 2.5014066 3.98732534 0.98724024
-0.64197181 2.3953097
MCP.1 Classification
1 -0.06046165 -0.02239071
2 -0.79470060 -0.10355701
3 2.76688146 0.89749412
Input:
> aggregate(mydata[,-c(1,1)],list(member),mean)
Output:
Group.1 BMI Glucose Insulin HOMA Leptin Adiponectin
Resistin MCP.1
1 1 27.48569 94.98198 9.297018 2.296325 25.34871 10.406896
13.00589 513.7325
2 2 31.50411 102.50000 11.923000 3.037757 68.49090 4.226503
65.67092 259.7500
3 3 28.53515 198.66667 35.195667 17.216999 45.55360 5.787642
44.40540 1491.7463
Classification
1 1.540541
2 1.500000
3 2.000000
Input:
> library(cluster)
> plot(silhouette(cutree(mydata.hclust,3), distance))
> wss <- (nrow(nor)-1)*sum(apply(nor,2,var))
> for (i in 2:20) wss[i] <- sum(kmeans(nor, centers=i)$withinss)
> plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups
sum of squares")
> set.seed(123)
> kc<-kmeans(nor,4)
> mc = mclust(mydata)
Input:
> set.seed(1234)
> kc<-kmeans(nor,4)
> kc
Output:
K-means clustering with 4 clusters of sizes 36, 5, 37, 38
Cluster means:
BMI Glucose Insulin
1 0.9277127 0.22499318 0.3741834
2 0.1303320 3.02803139 3.0966263
3 -0.2834197 -0.52835188 -0.4914887
4 -0.6200733 -0.09712873 -0.2833856
HOMA Leptin Adiponectin
1 0.2341775 1.0498497 -0.18822718
2 3.6621604 0.5503111 -0.32967666
3 -0.4417129 -0.4759841 0.19908470
4 -0.2736266 -0.6035456 0.02785337
Resistin MCP.1 Classification
1 0.1677903 0.1031171 0.06328484
2 1.3308653 1.5524237 0.89749412
3 -0.4849215 -0.2011023 -1.10460814
4 0.1380873 -0.1061461 0.89749412
Clustering vector:
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1
[21] 3 3 1 3 3 1 1 1 1 3 1 1 1 1 3 3 3 1 3 3
[41] 3 3 3 3 1 3 1 3 3 3 1 3 4 4 4 4 4 4 4 4
[61] 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 1 4 2 1
[81] 4 1 1 4 4 1 4 2 2 1 2 1 1 1 1 1 4 4 1 4
[101] 4 4 1 4 1 1 4 4 1 1 1 1 4 1 4 1
Available components:
Input:
> kc$cluster
Output:
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1
[21] 3 3 1 3 3 1 1 1 1 3 1 1 1 1 3 3 3 1 3 3
[41] 3 3 3 3 1 3 1 3 3 3 1 3 4 4 4 4 4 4 4 4
[61] 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 1 4 2 1
[81] 4 1 1 4 4 1 4 2 2 1 2 1 1 1 1 1 4 4 1 4
[101] 4 4 1 4 1 1 4 4 1 1 1 1 4 1 4 1
Input:
> cm=table(mydata$Age,kc$cluster)
> cm
Output:
1 2 3 4
24 0 0 1 0
25 0 0 1 0
28 1 0 0 0
29 1 0 1 0
32 1 0 0 0
34 0 0 2 1
35 1 0 1 0
36 0 0 2 0
38 0 0 1 1
40 0 1 0 1
41 1 0 0 0
42 0 0 0 2
43 1 0 0 2
44 0 1 1 2
45 2 0 0 5
46 0 0 0 3
47 0 0 1 0
48 1 1 1 1
49 2 0 1 2
50 1 0 0 0
51 0 0 1 3
52 1 0 0 0
53 1 0 0 0
54 1 0 1 1
55 1 0 0 0
57 1 0 0 0
58 1 0 0 0
59 0 0 0 2
60 1 0 1 0
61 1 0 0 0
62 0 0 0 2
64 0 0 1 1
65 3 0 0 0
66 2 0 2 1
67 0 0 1 0
68 1 0 1 1
69 1 0 2 2
71 2 0 0 1
72 0 0 0 3
73 1 0 1 0
74 0 0 0 1
75 2 0 2 0
76 1 0 3 0
77 0 0 2 0
78 0 0 1 0
81 1 0 0 0
82 1 0 1 0
83 0 0 1 0
85 0 1 1 0
86 1 1 1 0
89 0 0 1 0
Input:
> plot(mydata[c("Age", "Glucose")])
> plot(mydata[c("Age", "Glucose")],
+ col = kc$cluster)
> plot(mydata[c("Age", "Glucose")],
+ col = kc$cluster,
+ main = "K-means with 3 clusters")
> kc$centers
Output:
BMI Glucose Insulin
1 0.9277127 0.22499318 0.3741834
2 0.1303320 3.02803139 3.0966263
3 -0.2834197 -0.52835188 -0.4914887
4 -0.6200733 -0.09712873 -0.2833856
HOMA Leptin Adiponectin
1 0.2341775 1.0498497 -0.18822718
2 3.6621604 0.5503111 -0.32967666
3 -0.4417129 -0.4759841 0.19908470
4 -0.2736266 -0.6035456 0.02785337
Resistin MCP.1 Classification
1 0.1677903 0.1031171 0.06328484
2 1.3308653 1.5524237 0.89749412
3 -0.4849215 -0.2011023 -1.10460814
4 0.1380873 -0.1061461 0.89749412
Input:
> y_kc = kc$cluster
> clusplot(mydata[, c("Age", "Glucose")],
+ y_kc,
+ lines = 0,
+ shade = TRUE,
+ color = TRUE,
+ labels = 2,
+ plotchar = FALSE,
+ span = TRUE,
+ main = paste("Cluster mydata"),
+ xlab = 'Age',
+ ylab = 'Glucose')
Output:
Cluster Analysis:
01.
02.
03.
04.
05.
Scatter plot:
Associate Mining:
Association Rule Mining (ARM) is a data mining technique that focuses on
mining for associations between itemsets for further applications. We will use
package arules for ARM and arulesViz for AR visualization.
Input:
rules <- apriori(mydata)
rules <- apriori(mydata,parameter = list(minlen=2, maxlen=10,supp=.7,
conf=.8))
rules <- apriori(mydata,parameter = list(minlen=2, maxlen=5,supp=.1,
conf=.5),appearance=list(rhs=c("BMI=Yes"),lhs=c("Insulin I=Yes", "Leptin
I=Yes", "Resistin I=Yes"),default="none"))
quality(rules)<-round(quality(rules),digits=3)
rules.sorted <- sort(rules, by="lift")
redundant <- is.redundant(rules, measure="confidence")
which(redundant)
rules.pruned <- rules[!redundant]
rules.pruned <- sort(rules.pruned, by="lift")
inspect(rules.pruned)
library(arulesViz)
plot(rules)
plot(rules,method="grouped")
plot(rules,method="graph")
Output:
Conclusion:
Implementing the Association Rules measure can be extremely useful in the
process of recommendation, of establishing behaviour patterns. It is a great
technique for an extraction of useful information and remarks from a dataset
that is hard to read or too vast in its dimensions. The algorithms transforms the
provided dataset into a readable and useful outputs, comprehendable even for a
beginner, what can be an extreme advantage in the business environment, where
a depiction of the conclusions and the outputs, to someone who is not familiar
with the works of such algorithms, can sometimes be a tedious procedure.
Association rules use the traditional probability theory and its statistics,
therefore anyone can be able to understand the impacts.
The measure can be modified and implemented in many ways, depending on the
user’s interest. A deeper look into the outputs can established additional rules
for a more detailed analysis.