You are on page 1of 25

HIERARICHICAL

CLUSTERING
CONCEPT

• A store collected the ages of 9 of


its customers.
• Labelled C1-C9
• Labelled the amount each of them
spent at the store in the last month.
CONCEPT
• Now the store wants to identify segments or clusters of its customers to
better understand them and their needs.
• CLUSTER CREATION AND DENDOGRAMS

• We start by making every


single data point a cluster.
• This forms 9 clusters
CONCEPT

• Take the two closest clusters


and make them one cluster.
• Since C2 and C3 are closest,
they form a cluster.
• This gives us a total of 8
clusters
CONCEPT
• DENDOGRAM-a diagram that shows the hierarchical relationship between objects.
• It is most commonly created as an output from hierarchical clustering

The x-axis represents the points (or


customers in our case), and the y-
axis is the distance between the
clusters.
CONCEPT
• Keep repeating the process- Take the two closest clusters (C5 and C6) and
make them one cluster and plot this on the dendrogram (7 Clusters)
CONCEPT
6 CLUSTERS

5 CLUSTERS
CONCEPT

4 CLUSTERS

3 CLUSTERS
CONCEPT

2 CLUSTERS

1 CLUSTER
Dendrogram
 A binary tree that shows how clusters are
merged/split hierarchically
 Each node on the tree is a cluster; each leaf node is a
singleton cluster

10
Dendrogram
 A clustering of the data objects is obtained by
cutting the dendrogram at the desired level, then
each connected component forms a cluster

11
Dendrogram
 A clustering of the data objects is obtained by
cutting the dendrogram at the desired level, then
each connected component forms a cluster

12
How to Merge Clusters?
 How to measure the distance between clusters?

 Single-link
 Complete-link
Distance?
 Average-link
 Centroid distance

Hint: Distance between clusters is


usually defined on the basis of distance
between objects.

13
How to Define Inter-Cluster Distance

 Single-link d min (Ci , C j )  min d ( p, q )


pCi , qC j
 Complete-link
 Average-link The distance between two
 Centroid distance clusters is represented by the
distance of the closest pair of
data objects belonging to
different clusters.
14
How to Define Inter-Cluster Distance

 Single-link d min (Ci , C j )  max d ( p, q )


pCi , qC j
 Complete-link
 Average-link The distance between two
 Centroid distance clusters is represented by the
distance of the farthest pair of
data objects belonging to
different clusters.
15
How to Define Inter-Cluster Distance

 Single-link d min (Ci , C j )  avg d ( p, q )


 Complete-link pCi , qC j
 Average-link
The distance between two
 Centroid distance
clusters is represented by the
average distance of all pairs of
data objects belonging to
different clusters.
16
How to Define Inter-Cluster Distance

 
mi,mj are the means
of Ci, Cj,

 Single-link d mean (Ci , C j )  d (mi , m j )


 Complete-link
 Average-link The distance between two
 Centroid distance clusters is represented by the
distance between the means of
the cluters.

17
Hierarchical Clustering: Comparison
Single-link Complete-link
5
1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 4
4

Average-link Centroid distance


5
1 5 4 1
2
5 2
2 5
2
3 6 3
3 6
4 1 1
4 4
3
18
Hierarchical Clustering

 Agglomerative approach
Initialization:
Each object is a cluster
Iteration:
a ab Merge two clusters which are
b abcde most similar to each other;
Until all objects are merged
c
cde into a single cluster
d
de
e

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

19
Hierarchical Clustering

 Divisive Approaches Initialization:


All objects stay in one cluster
Iteration:
a ab Select a cluster and split it into
two sub clusters
b abcde Until each leaf cluster contains
c only one object
cde
d
de
e

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

20
CONCEPT- FINAL DENDOGRAM

21
CONCEPT
• We don't want clusters with distances
greater than 3. so that leaves us with only the 3
• Then we draw a threshold line at 3 clusters below the threshold line

22
CONCEPT

THRESHOLD SET TO 5 WE GET 2 CLUSTERS

23
OPTIMAL NUMBER OF CLUSTERS

We can figure out the optimal number of


clusters is by finding the longest line that
doesn’t cross any extended horizontal
line.

24
THRESHOLD

younger customers that don’t spend much, older customers with less
spending also, and the mid-segment that spends a lot
25

You might also like