You are on page 1of 27

Multivariate Statistics

Clustering
222BDA15 - Yobah Bertrand Yonkou
What is Clustering?
It is the process of grouping
data points based on
similarity.

/ˈklʌstə/
Types of Clustering
2

4
3
1. Centroid-based clustering
2. Density-based clustering
3. Distribution-based clustering
4. Hierarchical clustering
Hierarchical clustering
Creates a hierarchy of clusters.
Each cluster is nested as a subtree of another cluster (the parent cluster).
Types of hierarchical clustering;
Dendograms

A dendrogram is a tree-like representation of the relationships of similarity


among different groups of entities.

/ˈdɛndrə(ʊ)ɡram/ den.druh.gram
Features

x-axis: Data points

y-axis: Characteristic in question;

Granularity and cluster size have an inverse


relationship.

/ˈdɛndrə(ʊ)ɡram/ den.druh.gram
Linkage methods
How do you conclude that two clusters should be merged?

Linkage methods;
Single linkage: Minimum distance between any two points.
Complete linkage: Maximum distance between any two points.
Others: Average linkage, Centroid, and Ward linkages.

The choice of the linkage can greatly impact the final result. So, the choice should
be made based on the nature of your dataset and the goal behind your analysis.

/ˈdɛndrə(ʊ)ɡram/ den.druh.gram
How to construct a dendrogram?
Given n data points, create a distance matrix for the n data points. This matrix
quantifies the distance between the points.

Initialize each data point as a separate cluster.

Combine the two closest clusters into one cluster and update the similarity or
distance matrix. Do this until all points form a single cluster.

Plot the dendrogram by showing the sequences of merges that took place during
the clustering process.

The height of each branch in the dendrogram shows the distance or similarity between the data points in
the corresponding cluster.

/ˈdɛndrə(ʊ)ɡram/ den.druh.gram
Example 1 - Using single linkage (minimum distance)

Points X Y

A 0.07 0.83

B 0.85 0.14

C 0.66 0.89

D 0.49 0.64

E 0.8 0.46

/ˈdɛndrə(ʊ)ɡram/ den.druh.gram
1. Create a distance (Euclidean distance) matrix for the points A to E. and
select the smallest distance:

A B C D E

A 0

B 1.041393 0

C 0.593043 0.773692 0

D 0.460977 0.616117 0.302324 0

E 0.818413 0.323883 0.452217 0.358469 0

Using single linkage, we will group points C and D because they have the smallest distance.
To update the distance matrix (next slide), the distance between any point X with CD is the
minimum distance between CX and DX.
2. Update distance matrix: and select the points with the smallest distance:

A B CD E

A 0

B 1.041393 0

CD 0.460977 0.616117 0

E 0.818413 0.323883 0.358469 0

From the above table, we can observe that points B and E have the smallest distance so, we
will club them (next slide).
3. Update distance matrix: and select the points with the smallest distance:

A BE CD

A 0

BE 0.818413 0

CD 0.460977 0.358469 0

From the above table, CD and BE have the smallest distance. This means, CD and BE will
grouped.
4. Update distance matrix: and select the points with the smallest distance:

A CDBE

A 0

CDBE 0.358469 0

Now, we have reached the end of the process. As we can see in the table above, there is
only one cluster.

Clustering steps:
CD
CD and BE
CDBE
CDBEA
5. Drawing a dendrogram Clusters: CD, CD - BE, CDBE, CDBEA
Allocating observations to clusters Clusters: CD, CD - BE, CDBE, CDBEA

L1

L2
Example 2 - Using single linkage (minimum distance)

Points X Y

A 1 7

B 5 6

C 6.5 8

D 8 5

E 8.5 1

/ˈdɛndrə(ʊ)ɡram/ den.druh.gram
1. Create a distance (Euclidean distance) matrix for the points A to E.

A B C D E

A 0

B 5.385165 0

C 5.59017 2.5 0

D 7.28011 3.162278 3.354102 0

E 9.604686 6.103278 7.28011 4.031129 0


2. Select the points that have the smallest distance:

A B C D E

A 0

B 5.385165 0

C 5.59017 2.5 0

D 7.28011 3.162278 3.354102 0

E 9.604686 6.103278 7.28011 4.031129 0

Now that we have two points grouped together, using single linkage, the distance between
any point X with point BC would be the minimum distance between BX and CX.
3. Update distance matrix:

A BC D E

A 0

BC 5.385165 0

D 7.28011 3.162278 0

E 9.604686 6.103278 4.031129 0


4. Select the points that have the smallest distance

A BC D E

A 0

BC 5.385165 0

D 7.28011 3.162278 0

E 9.604686 6.103278 4.031129 0

From the table above, BC and D have the smallest distance, so BCD will form a cluster. The
distance between any point X and BCD is the minimum distance between BX, CX and DX.
5. Update distance matrix and select the smallest distance:

A BCD E

A 0

BCD 5.385165 0

E 9.604686 4.031129 0

From the distance matrix above, BCD and E have the smallest distance. So, BCDE will form
a new cluster.
6. Update distance matrix:

A BCDE

A 0

BCDE 5.385165 0

We are done obtaining a single cluster.

Clustering steps:
BC
BCD
BCDE
BCDEA
7. Drawing a dendrogram Clusters: BC, BCD, BCDE and BCDEA
References
https://www.analyticsvidhya.com/blog/2021/06/single-link-hierarchical-clustering-clearly-
explained/
Biology lovers: https://wheatoncollege.edu/wp-content/uploads/2012/08/How-to-Read-a-
Dendrogram-Web-Ready.pdf
https://www.displayr.com/what-is-dendrogram/
THANK YOU

You might also like