You are on page 1of 21

Machine Learning Algorithm

CIA - 3
MBA

SUBMITTED TO:
Dr. Durgansh Sharma

SUBMITTED BY:
Jeffrey Williams (20221013)

INSTITUTE OF MANAGEMENT
CHRIST (DEEMED TO BE UNIVERSITY)
DELHI NCR.
4th December 2020
Introduction
Problem Identification
Breast cancer is the second most common and also the second leading cause of
cancer deaths in women in the United States. According to the American
Cancer Society, on average every 1 in 8 women in the United States would
develop breast cancer in her lifetime and 2.6% would die from breast cancer.
One of the warning symptoms of breast cancer is the development of a tumour
in the breast. A tumor, however, could be either benign or malignant.

Objective:
This project aims to predict whether an individual has breast cancer and
determine which cytological attributes are significant in identifying benign and
malignant tumors. To achieve this, I performed four different classification
models in machine learning, namely Logistic Regression, Decision Tree,
Random Forest, and Gradient Boosting Machine, and Principal Component
Analysis on a dataset obtained from the UCI Machine Learning Repository.
Business Problem:
This dataset was created by Dr. William H. Wolberg from the University of
Wisconsin, who took a digital scan of the fine-needle aspirates from patients
with solid breast masses. Then, he used a graphical computer program called
Xcyt to calculate ten cytological characteristics present in each digitized image.

Clustering:
Two groups of observations have a very clear clustering pattern. This reminds
us to test whether the clustering results correspond to the real classification. For
Gaussian Mixture Model and k-means, the clustering results are projected onto
the first two components.

Dataset consists of 116 observations, where 64 patients have a breast cancer and
52 patients are a control group. The main purpose of this data is to build a
predictive model, but this project will be mainly focused on PCA and checking
whether it can help to get better results and also to get insights about data. The
dataset consists of 10 variables:
 Age (years)
 BMI (kg/m2)
 Glucose (mg/dL)
 Insulin (µU/mL)
 HOMA
 Leptin (ng/mL)
 Adiponectin (µg/mL)
 Resistin (ng/mL)
 MCP-1(pg/dL)
 Classification (1=Healthy controls, 2=Patients (with cancer))

Data Exploration:
> View(dataR2)
> mydata = dataR2
> str(mydata)

Output:
spec_tbl_df [116 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Age : num [1:116] 48 83 82 68 86 49 89 76 73 75 ...
$ BMI : num [1:116] 23.5 20.7 23.1 21.4 21.1 ...
$ Glucose : num [1:116] 70 92 91 77 92 92 77 118 97 83 ...
$ Insulin : num [1:116] 2.71 3.12 4.5 3.23 3.55 ...
$ HOMA : num [1:116] 0.467 0.707 1.01 0.613 0.805 ...
$ Leptin : num [1:116] 8.81 8.84 17.94 9.88 6.7 ...
$ Adiponectin : num [1:116] 9.7 5.43 22.43 7.17 4.82 ...
$ Resistin : num [1:116] 8 4.06 9.28 12.77 10.58 ...
$ MCP.1 : num [1:116] 417 469 555 928 774 ...
$ Classification: num [1:116] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "spec")=
.. cols(
.. Age = col_double(),
.. BMI = col_double(),
.. Glucose = col_double(),
.. Insulin = col_double(),
.. HOMA = col_double(),
.. Leptin = col_double(),
.. Adiponectin = col_double(),
.. Resistin = col_double(),
.. MCP.1 = col_double(),
.. Classification = col_double()
.. )
- attr(*, "problems")=<externalptr>
> head(mydata)
# A tibble: 6 x 10
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin MCP.1
Classification
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
<dbl>
1 48 23.5 70 2.71 0.467 8.81 9.70 8.00 417.
1
2 83 20.7 92 3.12 0.707 8.84 5.43 4.06 469.
1
3 82 23.1 91 4.50 1.01 17.9 22.4 9.28 555.
1
4 68 21.4 77 3.23 0.613 9.88 7.17 12.8 928.
1
5 86 21.1 92 3.55 0.805 6.70 4.82 10.6 774.
1
6 49 22.9 92 3.23 0.732 6.83 13.7 10.3 530.
1

Input:

> pairs(mydata[2:9])
> plot(mydata$Age~mydata$Glucose, data = mydata)
> with(mydata,text(mydata$Age~mydata$Glucose, labels=mydata$Age,pos=4))
>
> z <- mydata[,-c(1,1)]
> means <- apply(z,2,mean)
> sds <- apply(z,2,sd)
> nor <- scale(z,center=means,scale=sds)
> distance = dist(nor)
> mydata.hclust = hclust(distance)
> plot(mydata.hclust)
> plot(mydata.hclust,labels=mydata$Age,main='Default from hclust')
> plot(mydata.hclust,hang=-1)
> member = cutree(mydata.hclust,3)
> table(member)

Output:

member
1 2 3
111 2 3

Input:
> aggregate(nor,list(member),mean)

Output:
Group.1 BMI Glucose Insulin HOMA Leptin
Adiponectin Resistin
1 1 -0.01920753 -0.1247992 -0.0710255 -0.10946131 -0.06601427
0.03302801 -0.1388204
2 2 0.78125364 0.2089617 0.1898051 0.09411453 2.18293165
-0.87009709 4.1115657
3 3 0.18984296 4.4782615 2.5014066 3.98732534 0.98724024
-0.64197181 2.3953097
MCP.1 Classification
1 -0.06046165 -0.02239071
2 -0.79470060 -0.10355701
3 2.76688146 0.89749412

Input:
> aggregate(mydata[,-c(1,1)],list(member),mean)

Output:
Group.1 BMI Glucose Insulin HOMA Leptin Adiponectin
Resistin MCP.1
1 1 27.48569 94.98198 9.297018 2.296325 25.34871 10.406896
13.00589 513.7325
2 2 31.50411 102.50000 11.923000 3.037757 68.49090 4.226503
65.67092 259.7500
3 3 28.53515 198.66667 35.195667 17.216999 45.55360 5.787642
44.40540 1491.7463
Classification
1 1.540541
2 1.500000
3 2.000000

Input:
> library(cluster)
> plot(silhouette(cutree(mydata.hclust,3), distance))
> wss <- (nrow(nor)-1)*sum(apply(nor,2,var))
> for (i in 2:20) wss[i] <- sum(kmeans(nor, centers=i)$withinss)
> plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups
sum of squares")
> set.seed(123)
> kc<-kmeans(nor,4)
> mc = mclust(mydata)

Input:
> set.seed(1234)
> kc<-kmeans(nor,4)
> kc

Output:
K-means clustering with 4 clusters of sizes 36, 5, 37, 38

Cluster means:
BMI Glucose Insulin
1 0.9277127 0.22499318 0.3741834
2 0.1303320 3.02803139 3.0966263
3 -0.2834197 -0.52835188 -0.4914887
4 -0.6200733 -0.09712873 -0.2833856
HOMA Leptin Adiponectin
1 0.2341775 1.0498497 -0.18822718
2 3.6621604 0.5503111 -0.32967666
3 -0.4417129 -0.4759841 0.19908470
4 -0.2736266 -0.6035456 0.02785337
Resistin MCP.1 Classification
1 0.1677903 0.1031171 0.06328484
2 1.3308653 1.5524237 0.89749412
3 -0.4849215 -0.2011023 -1.10460814
4 0.1380873 -0.1061461 0.89749412

Clustering vector:
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1
[21] 3 3 1 3 3 1 1 1 1 3 1 1 1 1 3 3 3 1 3 3
[41] 3 3 3 3 1 3 1 3 3 3 1 3 4 4 4 4 4 4 4 4
[61] 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 1 4 2 1
[81] 4 1 1 4 4 1 4 2 2 1 2 1 1 1 1 1 4 4 1 4
[101] 4 4 1 4 1 1 4 4 1 1 1 1 4 1 4 1

Within cluster sum of squares by cluster:


[1] 245.22004 80.03784 129.19497 149.15961
(between_SS / total_SS = 41.7 %)

Available components:

[1] "cluster" "centers"


[3] "totss" "withinss"
[5] "tot.withinss" "betweenss"
[7] "size" "iter"
[9] "ifault"

Input:
> kc$cluster

Output:
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1
[21] 3 3 1 3 3 1 1 1 1 3 1 1 1 1 3 3 3 1 3 3
[41] 3 3 3 3 1 3 1 3 3 3 1 3 4 4 4 4 4 4 4 4
[61] 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 1 4 2 1
[81] 4 1 1 4 4 1 4 2 2 1 2 1 1 1 1 1 4 4 1 4
[101] 4 4 1 4 1 1 4 4 1 1 1 1 4 1 4 1

Input:
> cm=table(mydata$Age,kc$cluster)
> cm

Output:
1 2 3 4
24 0 0 1 0
25 0 0 1 0
28 1 0 0 0
29 1 0 1 0
32 1 0 0 0
34 0 0 2 1
35 1 0 1 0
36 0 0 2 0
38 0 0 1 1
40 0 1 0 1
41 1 0 0 0
42 0 0 0 2
43 1 0 0 2
44 0 1 1 2
45 2 0 0 5
46 0 0 0 3
47 0 0 1 0
48 1 1 1 1
49 2 0 1 2
50 1 0 0 0
51 0 0 1 3
52 1 0 0 0
53 1 0 0 0
54 1 0 1 1
55 1 0 0 0
57 1 0 0 0
58 1 0 0 0
59 0 0 0 2
60 1 0 1 0
61 1 0 0 0
62 0 0 0 2
64 0 0 1 1
65 3 0 0 0
66 2 0 2 1
67 0 0 1 0
68 1 0 1 1
69 1 0 2 2
71 2 0 0 1
72 0 0 0 3
73 1 0 1 0
74 0 0 0 1
75 2 0 2 0
76 1 0 3 0
77 0 0 2 0
78 0 0 1 0
81 1 0 0 0
82 1 0 1 0
83 0 0 1 0
85 0 1 1 0
86 1 1 1 0
89 0 0 1 0

Input:
> plot(mydata[c("Age", "Glucose")])
> plot(mydata[c("Age", "Glucose")],
+ col = kc$cluster)
> plot(mydata[c("Age", "Glucose")],
+ col = kc$cluster,
+ main = "K-means with 3 clusters")
> kc$centers

Output:
BMI Glucose Insulin
1 0.9277127 0.22499318 0.3741834
2 0.1303320 3.02803139 3.0966263
3 -0.2834197 -0.52835188 -0.4914887
4 -0.6200733 -0.09712873 -0.2833856
HOMA Leptin Adiponectin
1 0.2341775 1.0498497 -0.18822718
2 3.6621604 0.5503111 -0.32967666
3 -0.4417129 -0.4759841 0.19908470
4 -0.2736266 -0.6035456 0.02785337
Resistin MCP.1 Classification
1 0.1677903 0.1031171 0.06328484
2 1.3308653 1.5524237 0.89749412
3 -0.4849215 -0.2011023 -1.10460814
4 0.1380873 -0.1061461 0.89749412

Input:
> y_kc = kc$cluster
> clusplot(mydata[, c("Age", "Glucose")],
+ y_kc,
+ lines = 0,
+ shade = TRUE,
+ color = TRUE,
+ labels = 2,
+ plotchar = FALSE,
+ span = TRUE,
+ main = paste("Cluster mydata"),
+ xlab = 'Age',
+ ylab = 'Glucose')
Output:
Cluster Analysis:
01.

02.
03.
04.

05.
Scatter plot:

Hierarchical agglomerative clustering :


The goal of this exercise is to do hierarchical clustering of the observations.
This type of clustering does not assume in advance the number of natural
groups that exist in the data.
As part of the preparation for hierarchical clustering, distance between all pairs
of observations are computed. Furthermore, there are different ways to link
clusters together, with single, complete, and average being the most common
linkage methods.
There are two main types of clustering: hierarchical and k-means.
We will create a k-means clustering model on the Wisconsin breast cancer data
and compare the results to the actual diagnoses and the results of your
hierarchical clustering model. Take some time to see how each clustering model
performs in terms of separating the two diagnoses and how the clustering
models compare to each other.

Associate Mining:
Association Rule Mining (ARM) is a data mining technique that focuses on
mining for associations between itemsets for further applications. We will use
package arules for ARM and arulesViz for AR visualization.

Input:
rules <- apriori(mydata)
rules <- apriori(mydata,parameter = list(minlen=2, maxlen=10,supp=.7,
conf=.8))
rules <- apriori(mydata,parameter = list(minlen=2, maxlen=5,supp=.1,
conf=.5),appearance=list(rhs=c("BMI=Yes"),lhs=c("Insulin I=Yes", "Leptin
I=Yes", "Resistin I=Yes"),default="none"))
quality(rules)<-round(quality(rules),digits=3)
rules.sorted <- sort(rules, by="lift")
redundant <- is.redundant(rules, measure="confidence")
which(redundant)
rules.pruned <- rules[!redundant]
rules.pruned <- sort(rules.pruned, by="lift")
inspect(rules.pruned)
library(arulesViz)
plot(rules)
plot(rules,method="grouped")
plot(rules,method="graph")
Output:

The eclat() takes in a transactions object and gives the most frequent items in


the data based the support you provide to the supp argument.
The maxlen defines the maximum number of items in each itemset of frequent
items.
Generating Rules
We will generate parameters support and confidence for rule mining and lift for
interestingness evaluation. Support is an indication of how frequently the
itemset appears in the dataset. For example, the support of the item citrus fruit is
1/2 as it appears in only 1 out of the two transactions.
Confidence is the proportion of the true positive of the rule.
Visualizing Association Rules:
Package arulesViz supports visualization of association rules with scatter plot,
balloon plot, graph, parallel coordinates plot, etc.
Comment: We can see now, that the rules with the highest list establish a
sequel/prequel or the same universum connection between the inputs. That was
rather predictable, as series are very often watched as a whole. However, we
also looked at the first 150 observations to looks for some other associations.

Conclusion:
Implementing the Association Rules measure can be extremely useful in the
process of recommendation, of establishing behaviour patterns. It is a great
technique for an extraction of useful information and remarks from a dataset
that is hard to read or too vast in its dimensions. The algorithms transforms the
provided dataset into a readable and useful outputs, comprehendable even for a
beginner, what can be an extreme advantage in the business environment, where
a depiction of the conclusions and the outputs, to someone who is not familiar
with the works of such algorithms, can sometimes be a tedious procedure.
Association rules use the traditional probability theory and its statistics,
therefore anyone can be able to understand the impacts.
The measure can be modified and implemented in many ways, depending on the
user’s interest. A deeper look into the outputs can established additional rules
for a more detailed analysis.

You might also like