Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Data Mining and Machine Learning:
Fundamental Concepts and Algorithms

dataminingbook.info
Mohammed J. Zaki1 Wagner Meira Jr.2
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Chapter 14: Hierarchical Clustering
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 14: Hierarchical Clustering 1 / 16
Hierarchical Clustering
The goal of hierarchical clustering is to create a sequence of nested partitions,

which can be conveniently visualized via a tree or hierarchy of clusters, also called
the cluster dendrogram.
The clusters in the hierarchy range from the fine-grained to the coarse-grained –
the lowest level of the tree (the leaves) consists of each point in its own cluster,
whereas the highest level (the root) consists of all points in one cluster.
Agglomerative hierarchical clustering methods work in a bottom-up manner.

Starting with each of the n points in a separate cluster, they repeatedly merge the
most similar pair of clusters until all points are members of the same cluster.
Hierarchical Clustering: Nested Partitions
Given a dataset D = {x 1 , . . . , x n }, where x i ∈ Rd , a clustering C = {C1 , . . . , Ck } is

a partition of D.
A clustering A = {A1 , . . . , Ar } is said to be nested in another clustering

B = {B1 , . . . , Bs } if and only if r > s, and for each cluster Ai ∈ A, there exists a
cluster Bj ∈ B, such that Ai ⊆ Bj .
Hierarchical clustering yields a sequence of n nested partitions C1 , . . . , Cn . The

clustering Ct −1 is nested in the clustering Ct . The cluster dendrogram is a rooted
binary tree that captures this nesting structure, with edges between cluster
Ci ∈ Ct −1 and cluster Cj ∈ Ct if Ci is nested in Cj , that is, if Ci ⊂ Cj .
Hierarchical Clustering Dendrogram
The dendrogram represents the following

ABCDE sequence of nested partitions:
Clustering Clusters
C1 {A}, {B}, {C }, {D}, {E }
ABCD C2 {AB}, {C }, {D}, {E }
C3 {AB}, {CD}, {E }
C4 {ABCD}, {E }
AB CD C5 {ABCDE }
with Ct −1 ⊂ Ct for t = 2, . . . , 5. We
assume that A and B are merged before
A B C D E C and D.
Number of Hierarchical Clusterings
The total number of different dendrograms with n leaves is given as:
nY
−1
(2m − 1) = 1 × 3 × 5 × 7 × · · · × (2n − 3) = (2n − 3)!!
m=1
b b b
b b
b b
b b
b
b b
1 2 1
3
(a) n = 1 b b b b
1 2 b b
(b) n = 2
1 3 2 3
1 2
(c) n = 3
Agglomerative Hierarchical Clustering
In agglomerative hierarchical clustering, we begin with each of the n points in a

separate cluster. We repeatedly merge the two closest clusters until all points are
members of the same cluster.
Given a set of clusters C = {C1 , C2 , .., Cm }, we find the closest pair of clusters Ci
and Cj and merge them into a new cluster Cij = Ci ∪ Cj .
Next, we update the setof clusters by removing Ci and Cj and adding Cij , as
follows C = C \ {Ci , Cj } ∪ {Cij }.
We repeat the process until C contains only one cluster. If specified, we can stop
the merging process when there are exactly k clusters remaining.
Agglomerative Hierarchical Clustering Algorithm
AgglomerativeClustering(D, k):
1 C ← {Ci = {x i } | x i ∈ D} // Each point in separate cluster
2 ∆ ← {δ(x i , x j ): x i , x j ∈ D} // Compute distance matrix
3 repeat
4 Find the closest pair of clusters Ci , Cj ∈ C
5 Cij ← Ci ∪ Cj // Merge the clusters
6 C ← C \ {Ci , Cj } ∪ {Cij } // Update the clustering
7 Update distance matrix ∆ to reflect new clustering
8 until |C| = k
Distance between Clusters
Single, Complete and Average
A typical distance between two points is the Euclidean distance or L2 -norm

d
X 1/2
kx − y k2 = (xi − yi )2
i =1
Single Link: The minimum distance between a point in Ci and a point in Cj
δ(Ci , Cj ) = min{kx − y k | x ∈ Ci , y ∈ Cj }
Complete Link: The maximum distance between points in the two clusters:
δ(Ci , Cj ) = max{kx − y k | x ∈ Ci , y ∈ Cj }
Group Average: The average pairwise distance between points in Ci and Cj :

P P
x ∈Ci y ∈Cj kx − y k
δ(Ci , Cj ) =
ni · nj
Distance between Clusters: Mean and Ward’s
Mean Distance: The distance between two clusters is defined as the distance
between the means or centroids of the two clusters:

δ(Ci , Cj ) = µi − µj
Minimum Variance or Ward’s Method: The distance between two clusters is

defined as the increase in the sum of squared errors (SSE) when the two clusters
are merged, where the SSE for a given cluster Ci is given as
δ(Ci , Cj ) = ∆SSEij = SSEij − SSEi − SSEj

P 2
where SSEi = x ∈Ci kx − µi k . After simplification, we get:

ni nj µ i − µj 2

δ(Ci , Cj ) =
ni + nj
Ward’s measure is therefore a weighted version of the mean distance measure.

Single Link Agglomerative Clustering
ABCDE
3
δ E
ABCD
ABCD 3
2
δ CD E
AB 2 3 CD
2
CD 3
3
δ C D E
AB 3 2 3
AB
C 1 3 1 1
D 5
1 1
δ B C D E
A 1 3 2 4
B 3 2 3 A B C D E
C 1 3
D 5
Lance–Williams Formula
Whenever two clusters Ci and Cj are merged into Cij , we need to update the
distance matrix by recomputing the distances from the newly created cluster Cij to
all other clusters Cr (r 6= i and r 6= j).
The Lance–Williams formula provides a general equation to recompute the
distances for all of the cluster proximity measures
δ(Cij , Cr ) = αi · δ(Ci , Cr ) + αj · δ(Cj , Cr ) +

β · δ(Ci , Cj ) + γ · δ(Ci , Cr ) − δ(Cj , Cr )
The coefficients αi , αj , β, and γ differ from one measure to another.
Lance–Williams Formulas for Cluster Proximity
Measure αi αj β γ
1 1
Single link 2 2 0 − 21
1 1 1
Complete link 2 2 0 2
ni nj
Group average ni +nj ni +nj
0 0
ni nj − n i ·n j
Mean distance ni +nj ni +nj (ni +nj )2
0
ni +nr nj +nr − nr
Ward’s measure ni +nj +nr ni +nj +nr ni +nj +nr
0
Single link: Arithmetical trick to find the minimum.

δ(Ci , Cr ) δ(Cj , Cr ) δ(Ci , Cr ) − δ(Cj , Cr )
δ(Cij , Cr ) = + −
2 2 2
Complete link: Arithmetical trick to find the maximum.

δ(Ci , Cr ) δ(Cj , Cr ) δ(Ci , Cr ) − δ(Cj , Cr )
δ(Cij , Cr ) = + +
2 2 2
Group average: Weight the distance by the cluster size.
ni nj
δ(Cij , Cr ) = · δ(Ci , Cr ) + · δ(Cj , Cr )
ni + nj ni + nj
Mean distance: The new centroid is in the line defined by µi and µj , and its
n ·n
distance to µr has to be adjusted by (n +i n j )2 .
i j
ni nj −ni · nj
δ(Cij , Cr ) = · δ(Ci , Cr ) + · δ(Cj , Cr ) + · δ(Ci , Cj )
ni + nj ni + nj (ni + nj )2
Ward’s measure: The ∆SSE of the new cluster is a weigthed sum of the ∆SSEs
of the original clusters, adjusted by the fact that nr was considered twice.
ni + nr nj + nr −nr
δ(Cij , Cr ) = · δ(Ci , Cr ) + · δ(Cj , Cr ) + · δ(Ci , Cj )
ni + nj + nr ni + nj + nr ni + nj + nr
Iris Dataset: Complete Link Clustering
u2
rS bC
rS bC
bC
1.0 bC
rS bC bC
bC
rS rS bC bC
rS rS bC
rS bC bC
0.5 rS rS
rS rS
rS bC bC bC Cb
rS rS Sr rSrS Sr Sr bC
bC bC bC bC
rS Sr rS Sr rS rS Sr rS
rS rS rS Sr rS Tu
Tu bC Cb bC
rS bC Cb Cb bC bC
rS rS Sr rS
Tu bC
0 rS rS uT bC
Sr rS rS uT uT Tu Tu bC bC bC bC
bC
rS rS Sr Sr rS uT bC bC bC bC bC
rS rS rS rS
Tu
uT uT uT Tu Tu
rS uT bC Cb
rS uT uT uT uT Tu
Sr
rS uT uT bC bC
−0.5 rS Sr rS uT uT Tu bC
uT uT uT Tu uT
rS rS uT
uT
bC
−1.0 uT uT
uTrS
uT
−1.5 u1
−4 −3 −2 −1 0 1 2 3
Contingency Table:
iris-setosa iris-virginica iris-versicolor
C1 (circle) 50 0 0
C2 (triangle) 0 1 36
C3 (square) 0 49 14
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
Mohammed J. Zaki1 Wagner Meira Jr.2
1
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Chapter 14: Hierarchical Clustering

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Copyright:

Available Formats

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 14: Hierarchical Clustering

The goal of hierarchical clustering is to create a sequence of nested partitions,

Agglomerative hierarchical clustering methods work in a bottom-up manner.

Given a dataset D = {x 1 , . . . , x n }, where x i ∈ Rd , a clustering C = {C1 , . . . , Ck } is

A clustering A = {A1 , . . . , Ar } is said to be nested in another clustering

Hierarchical clustering yields a sequence of n nested partitions C1 , . . . , Cn . The

The dendrogram represents the following

In agglomerative hierarchical clustering, we begin with each of the n points in a

A typical distance between two points is the Euclidean distance or L2 -norm

Single Link: The minimum distance between a point in Ci and a point in Cj

Group Average: The average pairwise distance between points in Ci and Cj :

Minimum Variance or Ward’s Method: The distance between two clusters is

δ(Ci , Cj ) = ∆SSEij = SSEij − SSEi − SSEj

Ward’s measure is therefore a weighted version of the mean distance measure.

δ(Cij , Cr ) = αi · δ(Ci , Cr ) + αj · δ(Cj , Cr ) +

The coefficients αi , αj , β, and γ differ from one measure to another.

Single link: Arithmetical trick to find the minimum.

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 14: Hierarchical Clustering

You might also like