You are on page 1of 16

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms


dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 14: Hierarchical Clustering

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 1 / 16
Hierarchical Clustering

The goal of hierarchical clustering is to create a sequence of nested partitions,


which can be conveniently visualized via a tree or hierarchy of clusters, also called
the cluster dendrogram.

The clusters in the hierarchy range from the fine-grained to the coarse-grained –
the lowest level of the tree (the leaves) consists of each point in its own cluster,
whereas the highest level (the root) consists of all points in one cluster.

Agglomerative hierarchical clustering methods work in a bottom-up manner.


Starting with each of the n points in a separate cluster, they repeatedly merge the
most similar pair of clusters until all points are members of the same cluster.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 2 / 16
Hierarchical Clustering: Nested Partitions

Given a dataset D = {x 1 , . . . , x n }, where x i ∈ Rd , a clustering C = {C1 , . . . , Ck } is


a partition of D.

A clustering A = {A1 , . . . , Ar } is said to be nested in another clustering


B = {B1 , . . . , Bs } if and only if r > s, and for each cluster Ai ∈ A, there exists a
cluster Bj ∈ B, such that Ai ⊆ Bj .

Hierarchical clustering yields a sequence of n nested partitions C1 , . . . , Cn . The


clustering Ct −1 is nested in the clustering Ct . The cluster dendrogram is a rooted
binary tree that captures this nesting structure, with edges between cluster
Ci ∈ Ct −1 and cluster Cj ∈ Ct if Ci is nested in Cj , that is, if Ci ⊂ Cj .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 3 / 16
Hierarchical Clustering Dendrogram

The dendrogram represents the following


ABCDE sequence of nested partitions:
Clustering Clusters
C1 {A}, {B}, {C }, {D}, {E }
ABCD C2 {AB}, {C }, {D}, {E }
C3 {AB}, {CD}, {E }
C4 {ABCD}, {E }
AB CD C5 {ABCDE }

with Ct −1 ⊂ Ct for t = 2, . . . , 5. We
assume that A and B are merged before
A B C D E C and D.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 4 / 16
Number of Hierarchical Clusterings
The total number of different dendrograms with n leaves is given as:
nY
−1
(2m − 1) = 1 × 3 × 5 × 7 × · · · × (2n − 3) = (2n − 3)!!
m=1

b b b

b b
b b
b b
b

b b
1 2 1
3
(a) n = 1 b b b b
1 2 b b

(b) n = 2
1 3 2 3
1 2
(c) n = 3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 5 / 16
Agglomerative Hierarchical Clustering

In agglomerative hierarchical clustering, we begin with each of the n points in a


separate cluster. We repeatedly merge the two closest clusters until all points are
members of the same cluster.
Given a set of clusters C = {C1 , C2 , .., Cm }, we find the closest pair of clusters Ci
and Cj and merge them into a new cluster Cij = Ci ∪ Cj .
Next, we update the setof clusters by removing Ci and Cj and adding Cij , as
follows C = C \ {Ci , Cj } ∪ {Cij }.
We repeat the process until C contains only one cluster. If specified, we can stop
the merging process when there are exactly k clusters remaining.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 6 / 16
Agglomerative Hierarchical Clustering Algorithm

AgglomerativeClustering(D, k):
1 C ← {Ci = {x i } | x i ∈ D} // Each point in separate cluster
2 ∆ ← {δ(x i , x j ): x i , x j ∈ D} // Compute distance matrix
3 repeat
4 Find the closest pair of clusters Ci , Cj ∈ C
5 Cij ← Ci ∪ Cj // Merge the clusters
6 C ← C \ {Ci , Cj } ∪ {Cij } // Update the clustering
7 Update distance matrix ∆ to reflect new clustering
8 until |C| = k

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 7 / 16
Distance between Clusters
Single, Complete and Average

A typical distance between two points is the Euclidean distance or L2 -norm


d
X 1/2
kx − y k2 = (xi − yi )2
i =1

Single Link: The minimum distance between a point in Ci and a point in Cj

δ(Ci , Cj ) = min{kx − y k | x ∈ Ci , y ∈ Cj }

Complete Link: The maximum distance between points in the two clusters:

δ(Ci , Cj ) = max{kx − y k | x ∈ Ci , y ∈ Cj }

Group Average: The average pairwise distance between points in Ci and Cj :


P P
x ∈Ci y ∈Cj kx − y k
δ(Ci , Cj ) =
ni · nj

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 8 / 16
Distance between Clusters: Mean and Ward’s
Mean Distance: The distance between two clusters is defined as the distance
between the means or centroids of the two clusters:

δ(Ci , Cj ) = µi − µj

Minimum Variance or Ward’s Method: The distance between two clusters is


defined as the increase in the sum of squared errors (SSE) when the two clusters
are merged, where the SSE for a given cluster Ci is given as

δ(Ci , Cj ) = ∆SSEij = SSEij − SSEi − SSEj


P 2
where SSEi = x ∈Ci kx − µi k . After simplification, we get:

 
ni nj µ i − µj 2

δ(Ci , Cj ) =
ni + nj

Ward’s measure is therefore a weighted version of the mean distance measure.


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 9 / 16
Single Link Agglomerative Clustering
ABCDE

3
δ E
ABCD
ABCD 3

2
δ CD E
AB 2 3 CD
2
CD 3
3

δ C D E
AB 3 2 3
AB
C 1 3 1 1
D 5

1 1
δ B C D E
A 1 3 2 4
B 3 2 3 A B C D E
C 1 3
D 5

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 10 / 1
Lance–Williams Formula

Whenever two clusters Ci and Cj are merged into Cij , we need to update the
distance matrix by recomputing the distances from the newly created cluster Cij to
all other clusters Cr (r 6= i and r 6= j).
The Lance–Williams formula provides a general equation to recompute the
distances for all of the cluster proximity measures

δ(Cij , Cr ) = αi · δ(Ci , Cr ) + αj · δ(Cj , Cr ) +



β · δ(Ci , Cj ) + γ · δ(Ci , Cr ) − δ(Cj , Cr )

The coefficients αi , αj , β, and γ differ from one measure to another.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 11 / 1
Lance–Williams Formulas for Cluster Proximity

Measure αi αj β γ
1 1
Single link 2 2 0 − 21
1 1 1
Complete link 2 2 0 2
ni nj
Group average ni +nj ni +nj
0 0
ni nj − n i ·n j
Mean distance ni +nj ni +nj (ni +nj )2
0
ni +nr nj +nr − nr
Ward’s measure ni +nj +nr ni +nj +nr ni +nj +nr
0

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 12 / 1
Lance–Williams Formulas for Cluster Proximity

Single link: Arithmetical trick to find the minimum.


δ(Ci , Cr ) δ(Cj , Cr ) δ(Ci , Cr ) − δ(Cj , Cr )
δ(Cij , Cr ) = + −
2 2 2
Complete link: Arithmetical trick to find the maximum.


δ(Ci , Cr ) δ(Cj , Cr ) δ(Ci , Cr ) − δ(Cj , Cr )
δ(Cij , Cr ) = + +
2 2 2
Group average: Weight the distance by the cluster size.

ni nj
δ(Cij , Cr ) = · δ(Ci , Cr ) + · δ(Cj , Cr )
ni + nj ni + nj

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 13 / 1
Lance–Williams Formulas for Cluster Proximity

Mean distance: The new centroid is in the line defined by µi and µj , and its
n ·n
distance to µr has to be adjusted by (n +i n j )2 .
i j

ni nj −ni · nj
δ(Cij , Cr ) = · δ(Ci , Cr ) + · δ(Cj , Cr ) + · δ(Ci , Cj )
ni + nj ni + nj (ni + nj )2

Ward’s measure: The ∆SSE of the new cluster is a weigthed sum of the ∆SSEs
of the original clusters, adjusted by the fact that nr was considered twice.

ni + nr nj + nr −nr
δ(Cij , Cr ) = · δ(Ci , Cr ) + · δ(Cj , Cr ) + · δ(Ci , Cj )
ni + nj + nr ni + nj + nr ni + nj + nr

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 14 / 1
Iris Dataset: Complete Link Clustering
u2
rS bC
rS bC
bC
1.0 bC
rS bC bC
bC
rS rS bC bC
rS rS bC
rS bC bC
0.5 rS rS
rS rS
rS bC bC bC Cb
rS rS Sr rSrS Sr Sr bC
bC bC bC bC
rS Sr rS Sr rS rS Sr rS
rS rS rS Sr rS Tu
Tu bC Cb bC
rS bC Cb Cb bC bC
rS rS Sr rS
Tu bC
0 rS rS uT bC
Sr rS rS uT uT Tu Tu bC bC bC bC
bC
rS rS Sr Sr rS uT bC bC bC bC bC
rS rS rS rS
Tu
uT uT uT Tu Tu
rS uT bC Cb
rS uT uT uT uT Tu
Sr
rS uT uT bC bC
−0.5 rS Sr rS uT uT Tu bC
uT uT uT Tu uT
rS rS uT
uT
bC
−1.0 uT uT
uTrS
uT

−1.5 u1
−4 −3 −2 −1 0 1 2 3
Contingency Table:
iris-setosa iris-virginica iris-versicolor
C1 (circle) 50 0 0
C2 (triangle) 0 1 36
C3 (square) 0 49 14

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 15 / 1
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 14: Hierarchical Clustering

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 16 / 1

You might also like