Professional Documents
Culture Documents
Statistics: Module 8
1
Cluster Analysis
2
Clustering methods
• Hierarchical methods
→ Agglomerative
→ Divisive
• Partitioning methods
→ nonparametric
→ Model-based
3
Scope of application
• Some methods only work on data matrices with
Euclidean (k-means) or Mahalanobis distances
(model-based)
4
Agglomerative hierarchical clustering
• Does not lead to a particular number of clusters
5
Intercluster dissimilarity
• At each step join the two closest or most similar
clusters
6
Single linkage
Consider two clusters Q and R.
The single linkage intercluster dissimilarity between
clusters Q and R:
7
Complete linkage
Consider two clusters Q and R.
The complete linkage intercluster dissimilarity
between clusters Q and R:
8
Group average
Consider two clusters Q and R.
The group average intercluster dissimilarity between
clusters Q and R:
1 X
d(Q, R) = d(i, j)
nQ nR
i∈Q,j∈R
9
Centroid linkage
• Based on Euclidean distances
10
Ward’s method
• Based on Euclidean distances
2 2nQ nR 2
d (Q, R) = dE (x̄(Q), x̄(R))
nQ + nR
11
Displaying a hierarchical clustering
• Dendrogram: clustering tree
• Banner
12
Dendrogram
• Root: all objects in one cluster
13
Height
0 5 10 15 20
Algeria
Mauritius
Greenland
Mexico
Columbia
El Salvador
Ecuador
Grenada
Jamaica
Trinidad (67)
Chile
United States (NW66)
Honduras
South Africa(W)
Canada
14
United States (W66)
life
United States (66)
United States (67)
Argentina
Single linkage
Madagascar
Height
0 10 20 30 40 50 60
Algeria
Costa Rica
Panama
Dominican Rep
Nicaragua
Tunisia
El Salvador
Ecuador
Mauritius
Greenland
Honduras
Mexico
Columbia
Reunion
Seychelles
15
Grenada
life
Jamaica
United States (NW66)
Chile
South Africa(W)
United States (66)
United States (67)
Argentina
Canada
United States (W66)
Trinidad(62)
Cameroon
Madagascar
South Africa(C)
Example: Life expectancy
Guatemala
Height
0 10 20 30 40
Algeria
Tunisia
Mauritius
Greenland
Mexico
Columbia
Honduras
Seychelles
Grenada
Jamaica
Chile
Trinidad (67)
United States (NW66)
El Salvador
Ecuador
16
Reunion
life
South Africa(W)
Argentina
Canada
Group average
Madagascar
Height
0 20 40 60 80
Algeria
Costa Rica
Panama
Dominican Rep
Nicaragua
Tunisia
El Salvador
Ecuador
Trinidad(62)
South Africa(W)
Argentina
Canada
United States (W66)
United States (66)
United States (67)
17
Mauritius
life
Ward
Greenland
Honduras
Mexico
Madagascar
Banner
• Left: Each object is its own cluster
18
Example: Life expectancy
Single linkage
Alger
Mauri
Green
Mexic
Colum
El Sa
Ecuad
Grena
Jamai
Trini
Chile
Unite
Hondu
South
Canad
Unite
Unite
Unite
Argen
Costa
Panam
Domin
Nicar
Seych
Reuni
South
Guate
Tunis
Trini
Camer
Madag
0 5 10 15 20 21.9
Height
19
Example: Life expectancy
Complete linkage
Alger
Costa
Panam
Domin
Nicar
Tunis
El Sa
Ecuad
Mauri
Green
Hondu
Mexic
Colum
Reuni
Seych
Grena
Jamai
Unite
Chile
Trini
South
Unite
Unite
Argen
Canad
Unite
Trini
Camer
Madag
South
Guate
0 10 20 30 40 50 60
Height
20
Example: Life expectancy
Group average
Alger
Tunis
Mauri
Green
Mexic
Colum
Hondu
Seych
Grena
Jamai
Chile
Trini
Unite
El Sa
Ecuad
Reuni
South
Argen
Canad
Unite
Unite
Unite
Costa
Panam
Domin
Nicar
Trini
South
Guate
Camer
Madag
0 10 20 30 40 46
Height
21
Example: Life expectancy
Ward
Alger
Costa
Panam
Domin
Nicar
Tunis
El Sa
Ecuad
Trini
South
Argen
Canad
Unite
Unite
Unite
Mauri
Green
Hondu
Mexic
Colum
Reuni
Seych
Grena
Jamai
Chile
Trini
Unite
South
Guate
Camer
Madag
0 20 40 60 80 88
Height
22
K-means
• Based on Euclidean distances
→ Standardize the variables
23
K-means objective
Minimize the sum of distances between the objects
and their group mean over all possible partitions of
the data in k groups C1 , . . . , Ck :
k X
X
2
min dE (xi , x̄(Cj ))
C1 ,...,Ck
j=1 i∈Cj
24
Example: Pottery
Data with chemical composition of 45 specimens of
Romano-British pottery, determined by atomic
absorption spectrophometry for nine oxides.
25
Example: Number of clusters
250
Within group sum of squares
200
150
100
50
1 2 3 4 5 6
Number of groups
−→ 3 clusters
26
Example: Three clusters
CLUSPLOT( pottery.data )
4
3
2
Component 2
1
0
−1
−2
−2 0 2 4
Component 1
These two components explain 74.75 % of the point variability.
27
Example: cluster means
Clus 1 Clus 2 Clus 3
28
K-medoids
• Partitions the observations in k groups
29
K-medoid objective
Minimize the sum of dissimilarities between the
objects and the closest medoid over all possible
choices of k medoids m1 , . . . , mt from the data:
n
X
min min d(xi , mt )
m1 ,...,mt t=1,...,k
i=1
30
Example: Pottery
2.8
2.6
PAM objective function
2.4
2.2
2.0
1.8
2 3 4 5 6
Number of groups
−→ 3 clusters
31
Example: Three clusters
CLUSPLOT( pottery.data )
3
2
Component 2
1
0
−1
−2
−2 0 2 4
Component 1
These two components explain 74.75 % of the point variability.
32
Example: cluster medoids
Clus 1 Clus 2 Clus 3
33
Displaying a partitioning clustering
Silhouette plot
34
Example: Pottery
Silhouette plot of pam(x = pottery.data, k = 3, stand = T)
n = 45 3 clusters Cj
j : nj | avei∈Cj si
1 : 20 | 0.53
2 : 11 | 0.42
3 : 14 | 0.38
35
Model-based clustering
• Assume the data come from a mixture of k
subpopulations
36
Example: Planet data
Measurements on 3 variables for 101 exoplanets
(outside the solar system) discovered up to October
2002.
37
Example: Planets
0 1000 3000 5000
15
10
Mass
5
0
5000
3000
Period
1000
0
0.8
0.6
Eccentricity
0.4
0.2
0.0
0 5 10 15 0.0 0.2 0.4 0.6 0.8
38
Example: Planets
−2000
−2500
−3000
BIC
−3500
−4000
EII VVI
VII EEE
EEI EEV
−4500
VEI VEV
EVI VVV
2 4 6 8
number of components
−→ 3 clusters
39
Example: Planets
0 1000 3000 5000
15
10
Mass
5
0
5000
3000
Period
1000
0
0.8
0.6
Eccentricity
0.4
0.2
0.0
0 5 10 15 0.0 0.2 0.4 0.6 0.8
40
Example: cluster centers
Clus 1 Clus 2 Clus 3
41
Determining the number of clusters
• Compare results for different values of k
• Subjective!
42