Professional Documents
Culture Documents
April 2018
K-means clustering.
Iris data
1
4.5
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0 virginica
2.5
2.0
5 6 7 8
Sepal.Length
##
## 1 2 3
## setosa 0 50 0
## versicolor 2 0 48
## virginica 36 0 14
Note that class “Setosa” can be easily separated from the other clusters.
Plot the clusters and their centres. Note that there are four dimensions in the data and that only the first
two dimensions are used to draw the plot below. Some black points close to the green centre (asterisk) are
actually closer to the black centre in the four dimensional space.
2
4.0
Sepal.Width
3.5
3.0
2.5
2.0
Sepal.Length
3
700
Total within−clusters sum of squares
500
300
100
0
2 4 6 8 10 12 14
Number of clusters K
According to the scree plot above we could argue that the optimal number of clusters is 2 or 3.
The average silhouette width method for determining number of clusters with
K-means
for(i in 2:k.max){
km.res <- kmeans(data, centers = i, nstart = 25, iter.max = 20)
ss <- silhouette(km.res$cluster, dist(data))
sil[i] <- mean(ss[, 3])
}
4
0.6
0.4
sil
0.2
0.0
2 4 6 8 10 12 14
Number of clusters k
US Arrests
5
## Alabama Alaska Arizona Arkansas California
## 2 2 2 1 2
## Colorado Connecticut Delaware Florida Georgia
## 2 1 1 2 2
## Hawaii Idaho Illinois Indiana Iowa
## 1 1 2 1 1
## Kansas Kentucky Louisiana Maine Maryland
## 1 1 2 1 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 1 2 1 2 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 1 1 2 1 1
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 2 1 1
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 1 1 1 1 2
## South Dakota Tennessee Texas Utah Vermont
## 1 2 2 1 1
## Virginia Washington West Virginia Wisconsin Wyoming
## 1 1 1 1 1
Partition Around medoids. We will use the pam() in the cluster package.
# PAM clustering
pam.out <- pam(df, 2)
pam.out$cluster
6
clusplot(pam(x = df, k = 2))
2
Component 2
1
0
−1
−3 −2 −1 0 1 2 3
Component 1
These two components explain 86.75 % of the point variability.
7
Silhouette plot of pam(x = df, k = 2)
n = 50 2 clusters Cj
j : nj | avei∈Cj si
1 : 20 | 0.37
2 : 30 | 0.43
8
Silhouette plot
n = 44 2 clusters Cj
j : nj | avei∈Cj si
1 : 18 | 0.40
2 : 26 | 0.44
set.seed(123)
# Compute and plot wss for k = 2 to k = 15
k.max <- 15 # Maximal number of clusters
df.out <- df
wss <- sapply(1:k.max,
function(k){kmeans(df.out, k, nstart=10 )$tot.withinss})
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
abline(v = 3, lty =2)
9
200
Total within−clusters sum of squares
150
100
50
2 4 6 8 10 12 14
Number of clusters K
According to the elbow method, the optimal number of clusters suggested for the K-means algorithm is 3 or 4.
k.max <- 15
data.out <- df
sil <- rep(0, k.max)
10
0.4
0.3
0.2
sil
0.1
0.0
2 4 6 8 10 12 14
Number of clusters k
According to the silhouette method the optimal number of clusters suggested for the Kmeans algorithm is 2.
k.max <- 15
sil <- rep(0, k.max)
11
0.4
0.3
0.2
sil
0.1
0.0
2 4 6 8 10 12 14
Number of clusters k
12
clusGap(x = df, FUNcluster = kmeans, K.max = 10, B = 50,
0.26 nstart = 25)
Gapk
0.22
0.18
2 4 6 8 10
Number of clusters k
According to the Gap Statistic the ’optimal number of clusters chosen for the Kmeans algorithm is 4!
13
clusGap(x = df, FUNcluster = pam, K.max = 10, B = 50)
0.30
0.26
Gapk
0.22
0.18
2 4 6 8 10
Number of clusters k
According to the Gap Statistic the ’optimal number of clusters chosen for the PAM algorithm is also 4.
Using the NbClust package which uses a vote to chose the number of clusters.
The following example determine the number of clusters using all statistics:
14
0.0015
0.013
0.0010
0.011
0.0005
0.009
2 4 6 8 10 2 4 6 8 10
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
15
Second differences Dindex Values
1.3
0.10
1.2
Dindex Values
1.1
0.05
1.0
0.9
0.00
0.8
2 4 6 8 10 2 4 6 8 10
16
hist(res.nb$Best.nc[1,], breaks= 0:10, col= 'lightblue', xlab = "Optimal number of clusters")
Histogram of res.nb$Best.nc[1, ]
8
6
Frequency
4
2
0
0 2 4 6 8 10
To illustrate the application of DBSCAN we will use a very simple artificial data set of four slightly overlapping
Gaussians in two-dimensional space with 100 points each. We set the random number generator to make
the results reproducible and create the data set as shown below. The function dbscan() is found in the fpc
package.
set.seed(2)
n <- 400
x <- cbind(
x = runif(4, 0, 1) + rnorm(n, sd = 0.1),
y = runif(4, 0, 1) + rnorm(n, sd = 0.1)
)
true_clusters <- rep(1:4, time = 100)
plot(x, col = true_clusters, pch = true_clusters)
17
1.0
0.8
0.6
y
0.4
0.2
0.0
18
DBSCAN
1.0
0.8
0.6
y
0.4
0.2
0.0
19