Professional Documents
Culture Documents
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 1 / 16
Hierarchical Clustering
The clusters in the hierarchy range from the fine-grained to the coarse-grained –
the lowest level of the tree (the leaves) consists of each point in its own cluster,
whereas the highest level (the root) consists of all points in one cluster.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 2 / 16
Hierarchical Clustering: Nested Partitions
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 3 / 16
Hierarchical Clustering Dendrogram
with Ct −1 ⊂ Ct for t = 2, . . . , 5. We
assume that A and B are merged before
A B C D E C and D.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 4 / 16
Number of Hierarchical Clusterings
The total number of different dendrograms with n leaves is given as:
nY
−1
(2m − 1) = 1 × 3 × 5 × 7 × · · · × (2n − 3) = (2n − 3)!!
m=1
b b b
b b
b b
b b
b
b b
1 2 1
3
(a) n = 1 b b b b
1 2 b b
(b) n = 2
1 3 2 3
1 2
(c) n = 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 5 / 16
Agglomerative Hierarchical Clustering
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 6 / 16
Agglomerative Hierarchical Clustering Algorithm
AgglomerativeClustering(D, k):
1 C ← {Ci = {x i } | x i ∈ D} // Each point in separate cluster
2 ∆ ← {δ(x i , x j ): x i , x j ∈ D} // Compute distance matrix
3 repeat
4 Find the closest pair of clusters Ci , Cj ∈ C
5 Cij ← Ci ∪ Cj // Merge the clusters
6 C ← C \ {Ci , Cj } ∪ {Cij } // Update the clustering
7 Update distance matrix ∆ to reflect new clustering
8 until |C| = k
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 7 / 16
Distance between Clusters
Single, Complete and Average
δ(Ci , Cj ) = min{kx − y k | x ∈ Ci , y ∈ Cj }
Complete Link: The maximum distance between points in the two clusters:
δ(Ci , Cj ) = max{kx − y k | x ∈ Ci , y ∈ Cj }
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 8 / 16
Distance between Clusters: Mean and Ward’s
Mean Distance: The distance between two clusters is defined as the distance
between the means or centroids of the two clusters:
δ(Ci , Cj ) =
µi − µj
ni nj
µ i − µj
2
δ(Ci , Cj ) =
ni + nj
3
δ E
ABCD
ABCD 3
2
δ CD E
AB 2 3 CD
2
CD 3
3
δ C D E
AB 3 2 3
AB
C 1 3 1 1
D 5
1 1
δ B C D E
A 1 3 2 4
B 3 2 3 A B C D E
C 1 3
D 5
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 10 / 1
Lance–Williams Formula
Whenever two clusters Ci and Cj are merged into Cij , we need to update the
distance matrix by recomputing the distances from the newly created cluster Cij to
all other clusters Cr (r 6= i and r 6= j).
The Lance–Williams formula provides a general equation to recompute the
distances for all of the cluster proximity measures
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 11 / 1
Lance–Williams Formulas for Cluster Proximity
Measure αi αj β γ
1 1
Single link 2 2 0 − 21
1 1 1
Complete link 2 2 0 2
ni nj
Group average ni +nj ni +nj
0 0
ni nj − n i ·n j
Mean distance ni +nj ni +nj (ni +nj )2
0
ni +nr nj +nr − nr
Ward’s measure ni +nj +nr ni +nj +nr ni +nj +nr
0
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 12 / 1
Lance–Williams Formulas for Cluster Proximity
δ(Ci , Cr ) δ(Cj , Cr ) δ(Ci , Cr ) − δ(Cj , Cr )
δ(Cij , Cr ) = + −
2 2 2
Complete link: Arithmetical trick to find the maximum.
δ(Ci , Cr ) δ(Cj , Cr ) δ(Ci , Cr ) − δ(Cj , Cr )
δ(Cij , Cr ) = + +
2 2 2
Group average: Weight the distance by the cluster size.
ni nj
δ(Cij , Cr ) = · δ(Ci , Cr ) + · δ(Cj , Cr )
ni + nj ni + nj
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 13 / 1
Lance–Williams Formulas for Cluster Proximity
Mean distance: The new centroid is in the line defined by µi and µj , and its
n ·n
distance to µr has to be adjusted by (n +i n j )2 .
i j
ni nj −ni · nj
δ(Cij , Cr ) = · δ(Ci , Cr ) + · δ(Cj , Cr ) + · δ(Ci , Cj )
ni + nj ni + nj (ni + nj )2
Ward’s measure: The ∆SSE of the new cluster is a weigthed sum of the ∆SSEs
of the original clusters, adjusted by the fact that nr was considered twice.
ni + nr nj + nr −nr
δ(Cij , Cr ) = · δ(Ci , Cr ) + · δ(Cj , Cr ) + · δ(Ci , Cj )
ni + nj + nr ni + nj + nr ni + nj + nr
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 14 / 1
Iris Dataset: Complete Link Clustering
u2
rS bC
rS bC
bC
1.0 bC
rS bC bC
bC
rS rS bC bC
rS rS bC
rS bC bC
0.5 rS rS
rS rS
rS bC bC bC Cb
rS rS Sr rSrS Sr Sr bC
bC bC bC bC
rS Sr rS Sr rS rS Sr rS
rS rS rS Sr rS Tu
Tu bC Cb bC
rS bC Cb Cb bC bC
rS rS Sr rS
Tu bC
0 rS rS uT bC
Sr rS rS uT uT Tu Tu bC bC bC bC
bC
rS rS Sr Sr rS uT bC bC bC bC bC
rS rS rS rS
Tu
uT uT uT Tu Tu
rS uT bC Cb
rS uT uT uT uT Tu
Sr
rS uT uT bC bC
−0.5 rS Sr rS uT uT Tu bC
uT uT uT Tu uT
rS rS uT
uT
bC
−1.0 uT uT
uTrS
uT
−1.5 u1
−4 −3 −2 −1 0 1 2 3
Contingency Table:
iris-setosa iris-virginica iris-versicolor
C1 (circle) 50 0 0
C2 (triangle) 0 1 36
C3 (square) 0 49 14
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 15 / 1
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 16 / 1