You are on page 1of 19

Clustering methods

April 2018

K-means clustering.

We will use the kmeans() function in the stats package.

Iris data

We shall first use the iris data


rm(list=ls())
data(iris)
summary(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width


## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Remove the class label
iris.features<-iris[,-5]

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point()

1
4.5

4.0

3.5
Species
Sepal.Width

setosa
versicolor
3.0 virginica

2.5

2.0

5 6 7 8
Sepal.Length

Perform K-means on the data:


set.seed(5342)
kmeans.results<-kmeans(iris.features,3,nstart=20)

Check clustering results against class labels (Species)


table(iris$Species, kmeans.results$cluster)

##
## 1 2 3
## setosa 0 50 0
## versicolor 2 0 48
## virginica 36 0 14
Note that class “Setosa” can be easily separated from the other clusters.
Plot the clusters and their centres. Note that there are four dimensions in the data and that only the first
two dimensions are used to draw the plot below. Some black points close to the green centre (asterisk) are
actually closer to the black centre in the four dimensional space.

plot(iris.features[c("Sepal.Length", "Sepal.Width")], col=kmeans.results$cluster)


points(kmeans.results$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=8, cex=2)

2
4.0
Sepal.Width

3.5
3.0
2.5
2.0

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

Sepal.Length

The elbow method for determining number of clusters with K-means

Compute and plot wss for k = 2 to k = 15


k.max <- 15 # Maximal number of clusters
wss <- sapply(1:k.max, function(k){kmeans(iris.features, k, nstart=10 )$tot.withinss})
plot(1:k.max, wss,type="b", pch = 19, frame = FALSE,xlab="Number of clusters K",ylab="Total within-clust

3
700
Total within−clusters sum of squares

500
300
100
0

2 4 6 8 10 12 14

Number of clusters K

According to the scree plot above we could argue that the optimal number of clusters is 2 or 3.

The average silhouette width method for determining number of clusters with
K-means

Compute the average silhouette width for k = 2 to k = 15


library(cluster)
k.max <- 15
data <- iris.features
sil <- rep(0, k.max)

for(i in 2:k.max){
km.res <- kmeans(data, centers = i, nstart = 25, iter.max = 20)
ss <- silhouette(km.res$cluster, dist(data))
sil[i] <- mean(ss[, 3])
}

Plot the average silhouette width

plot(1:k.max, sil, type = "b", pch = 19,


frame = FALSE, xlab = "Number of clusters k")
abline(v = which.max(sil), lty = 2)

4
0.6
0.4
sil

0.2
0.0

2 4 6 8 10 12 14

Number of clusters k

According to the average silhouttee method,two clusters were chosen as optimal!

US Arrests

# Load the data set


data("USArrests")
# Remove any missing value (i.e, NA values for not available)
# That might be present in the data
df <- na.omit(USArrests)
df <- scale(df)
# View the firt 6 rows of the data
head(df, n = 6)

## Murder Assault UrbanPop Rape


## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
# K-means clustering
set.seed(123)
km.out <- kmeans(df, 2, nstart = 25)
# k-means group number of each observation
km.out$cluster

5
## Alabama Alaska Arizona Arkansas California
## 2 2 2 1 2
## Colorado Connecticut Delaware Florida Georgia
## 2 1 1 2 2
## Hawaii Idaho Illinois Indiana Iowa
## 1 1 2 1 1
## Kansas Kentucky Louisiana Maine Maryland
## 1 1 2 1 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 1 2 1 2 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 1 2 1 1
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 2 1 1
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 1 1 1 1 2
## South Dakota Tennessee Texas Utah Vermont
## 1 2 2 1 1
## Virginia Washington West Virginia Wisconsin Wyoming
## 1 1 1 1 1

Partition Around medoids. We will use the pam() in the cluster package.

# PAM clustering
pam.out <- pam(df, 2)
pam.out$cluster

## Alabama Alaska Arizona Arkansas California


## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 2 2 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 2 1 2 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 2 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 2 2 2
plot(pam.out)

6
clusplot(pam(x = df, k = 2))
2
Component 2

1
0
−1

−3 −2 −1 0 1 2 3

Component 1
These two components explain 86.75 % of the point variability.

7
Silhouette plot of pam(x = df, k = 2)
n = 50 2 clusters Cj
j : nj | avei∈Cj si

1 : 20 | 0.37

2 : 30 | 0.43

0.0 0.2 0.4 0.6 0.8 1.0


Silhouette width si
Average silhouette width : 0.41
This table shows how to use the average silhouette width value:
Range of SC Interpretation
0.71-1.0 A strong structure has been found
0.51-0.70 A reasonable structure has been found
0.26-0.50 The structure is weak and could be artificial
< 0.25 No substantial structure has been found
According to the table, the fit is weak.

Clustering with CLARA.


R function for computing CLARA is found in the in cluster package.
clarax <- clara(df, 2, samples=10)
# Silhouette plot
plot(silhouette(clarax), col = 2:3, main = "Silhouette plot")

8
Silhouette plot
n = 44 2 clusters Cj
j : nj | avei∈Cj si

1 : 18 | 0.40

2 : 26 | 0.44

0.0 0.2 0.4 0.6 0.8 1.0


Silhouette width si
Average silhouette width : 0.42
The overall average silhouette width is 0.42 meaning that the fit is weak (see table above showing range for
Si and corresponding interpretation).

Methods for determing number of clusters

Elbow method for k-means clustering

set.seed(123)
# Compute and plot wss for k = 2 to k = 15
k.max <- 15 # Maximal number of clusters
df.out <- df
wss <- sapply(1:k.max,
function(k){kmeans(df.out, k, nstart=10 )$tot.withinss})

plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
abline(v = 3, lty =2)

9
200
Total within−clusters sum of squares

150
100
50

2 4 6 8 10 12 14

Number of clusters K

According to the elbow method, the optimal number of clusters suggested for the K-means algorithm is 3 or 4.

Average silhouette method for k-means clustering

k.max <- 15
data.out <- df
sil <- rep(0, k.max)

# Compute the average silhouette width for


# k = 2 to k = 15
for(i in 2:k.max){
km.res <- kmeans(df.out, centers = i, nstart = 25)
ss <- silhouette(km.res$cluster, dist(df.out))
sil[i] <- mean(ss[, 3])
}

# Plot the average silhouette width


plot(1:k.max, sil, type = "b", pch = 19,
frame = FALSE, xlab = "Number of clusters k")
abline(v = which.max(sil), lty = 2)

10
0.4
0.3
0.2
sil

0.1
0.0

2 4 6 8 10 12 14

Number of clusters k

According to the silhouette method the optimal number of clusters suggested for the Kmeans algorithm is 2.

Average silhouette method for PAM clustering

k.max <- 15
sil <- rep(0, k.max)

# Compute the average silhouette width for


# k = 2 to k = 15
for(i in 2:k.max){
pam.res <- pam(df, i)
ss <- silhouette(pam.res$cluster, dist(df))
sil[i] <- mean(ss[, 3])
}

# Plot the average silhouette width


plot(1:k.max, sil, type = "b", pch = 19,
frame = FALSE, xlab = "Number of clusters k")
abline(v = which.max(sil), lty = 2)

11
0.4
0.3
0.2
sil

0.1
0.0

2 4 6 8 10 12 14

Number of clusters k

Gap Statistic for K means clustering

# Compute gap statistic


gap_stat_km <- clusGap(df, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
# Print the result
plot(gap_stat_km, frame = FALSE, xlab = "Number of clusters k")
abline(v = 4, lty = 2)

12
clusGap(x = df, FUNcluster = kmeans, K.max = 10, B = 50,
0.26 nstart = 25)
Gapk

0.22
0.18

2 4 6 8 10

Number of clusters k

According to the Gap Statistic the ’optimal number of clusters chosen for the Kmeans algorithm is 4!

Gap Statistic for PAM clustering

# Compute gap statistic


gap_stat_pam <- clusGap(df, FUN = pam, K.max = 10, B = 50)
# Print the result
plot(gap_stat_pam, frame = FALSE, xlab = "Number of clusters k")
abline(v = 4, lty = 2)

13
clusGap(x = df, FUNcluster = pam, K.max = 10, B = 50)
0.30
0.26
Gapk

0.22
0.18

2 4 6 8 10

Number of clusters k

According to the Gap Statistic the ’optimal number of clusters chosen for the PAM algorithm is also 4.

Using the NbClust package which uses a vote to chose the number of clusters.
The following example determine the number of clusters using all statistics:

res.nb <- NbClust(df, distance = "euclidean",min.nc = 2, max.nc


= 10, method = "complete", index ="all")

14
0.0015
0.013

Hubert statistic second differences


Hubert Statistic values

0.0010
0.011

0.0005
0.009

2 4 6 8 10 2 4 6 8 10

Number of clusters Number of clusters

## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##

15
Second differences Dindex Values
1.3

0.10
1.2
Dindex Values

1.1

0.05
1.0
0.9

0.00
0.8

2 4 6 8 10 2 4 6 8 10

Number of clusters Number of clusters

## *** : The D index is a graphical method of determining the number of clusters.


## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 4 proposed 3 as the best number of clusters
## * 6 proposed 4 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
When all statistics in the NbClust package are allowed to vote, the majority (in this case 9 out of 23) propose
that the ‘optimal’ number of clusters should be 2.

16
hist(res.nb$Best.nc[1,], breaks= 0:10, col= 'lightblue', xlab = "Optimal number of clusters")

Histogram of res.nb$Best.nc[1, ]
8
6
Frequency

4
2
0

0 2 4 6 8 10

Optimal number of clusters

Clustering with DBSCAN

To illustrate the application of DBSCAN we will use a very simple artificial data set of four slightly overlapping
Gaussians in two-dimensional space with 100 points each. We set the random number generator to make
the results reproducible and create the data set as shown below. The function dbscan() is found in the fpc
package.
set.seed(2)
n <- 400
x <- cbind(
x = runif(4, 0, 1) + rnorm(n, sd = 0.1),
y = runif(4, 0, 1) + rnorm(n, sd = 0.1)
)
true_clusters <- rep(1:4, time = 100)
plot(x, col = true_clusters, pch = true_clusters)

17
1.0
0.8
0.6
y

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8

# To apply DBSCAN, we need to decide on the neighborhood radius eps


# and the density threshold minPts.
# The rule of thumb for minPts is to use at least the number of
# dimensions of the data set plus one. In our case, this is 3.
db <- fpc::dbscan(x, eps = .05, MinPts =3 )
# Plot DBSCAN results
plot(db, x, main = "DBSCAN", frame = FALSE)

18
DBSCAN
1.0
0.8
0.6
y

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8

DBSCAN has found three clusters in the data.

19

You might also like