You are on page 1of 19

Machine

learning

Unsupervis
ed
learning
supervised
learning

Clustering

K-means
K-medoids
Hierarchical

Associatio
n analysis
Classificati
on

Decision tree
K-Nearest
neighbor
Nave Bayesian
Support vector
machines
Neural network

Clustering
Finding groups of objects such that the objects in a group will be similar (or related) to
one another and different from (or unrelated to) the objects in other groups.

Clustering
Similarity
Similarity Numerical measure of how alike two data objects are. Is higher when
objects are more alike.
P1
P2
P3
P4
Standardization is necessary, if scales differ

Euclidean Distance

P1

2.8
4

3.1
6

5.0
9

P2

2.8
4

1.4
1

3.1
6

P3

3.1
6

1.4
1

p4

5.0
9

3.1
6

K-means Clustering

3.1
6

84
2.
3
2.2

P1

P2

P3

P4

P1

2.8
4

3.1
6

5.0
9

P2

2.8
4

1.4
1

3.1
6

P3

3.1
6

1.4
1

p4

5.0
9

3.1
6

K-means Clustering
Partitional clustering approach
Each cluster is associated with a
centroid (center point)
Each point is assigned to the
cluster with the closest centroid
Number of clusters, K, must be
specified

K-means in R

iris2 <- iris


iris2$Species <- NULL
kmeans.result <- kmeans(iris2, 3)
table(iris$Species, kmeans.result$cluster)
plot(iris2[c("Sepal.Length", "Sepal.Width")], col =
kmeans.result$cluster)

Optimal value of K ( no. of


cluster )
v
1

Total variation in data = v1^2 + v2^2 +


v3^2 ..

TSS

MINIMI
ZE
V1

Total variation within the cluster ( WSS )


= VARIATION FOR CLUSTER 1 + VARIATION
FOR CLUSTER 2

VARIATION BETWEEN THE


CLUSTER (BSS)

TSS = WSS + BSS

K-means in R
wss <- (nrow(iris2)-1)*sum(apply(iris2,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(iris2,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of
squares")

Scaling ( normalization )
Customer

Marital
status

House

Car

Salary

25000

20000

0
0
0
25000
d(A,B) = 5000
A and C could be similar , so will be in one
d(A,C) = 1.3
D(B,C) > 5000 cluster

Customer

Marital
status

House

Car

Salary

After normalization , A and B will be in one


cluster
iris2_z <- as.data.frame(lapply(iris2, scale))

Normalizati
on
(Value
min)
(Maxmin)

K-means limitations
K-means has problems when
the data contains outliers
Finding optimum number of
clusters K is difficult

K-medoids clustering
library(fpc)
iris2 <- iris
iris2$Species <- NULL
pamk.result <- pamk(iris2)
table(pamk.result$pamobject$clustering, iris$Species)
layout(matrix(c(1,2),1,2))
plot(pamk.result$pamobject)
layout(matrix(1))

K-medoids clustering

Hierarchical
clustering

Basic algorithm is straightforward


1. Compute the proximity matrix
2. Let each data point be a cluster
3. Merge the two closest clusters
4. Update the proximity matrix 6. Until only a single cluster remains

Hierarchical
clustering

Hierarchical
clustering

Hierarchical
clustering

idx <- sample(1:dim(iris)[1], 40)


irisSample <- iris[idx,]
irisSample$Species <- NULL
hc <- hclust(dist(irisSample), method="ave")
plot(hc, hang = -1, labels=iris$Species[idx])
rect.hclust(hc, k=3)
groups <- cutree(hc, k=3)

Clustering depending
on type of dataset
K means should not be used for dataset with outliers ??
Heirarchical clustering should not be used for large
dataset ??

Thank You