You are on page 1of 17

Hierarchical Clustering

Two Types of Clustering


• Partitional algorithms: Construct various partitions and then evaluate them by some
criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using
some criterion

Hierarchical Partitional
(How-to) Hierarchical Clustering
The number of possible dendrograms
with n leafs = Since we cannot test all possible trees
(2n -3)!/[(2(n -2)) (n -2)!] we will have to heuristic search of all
possible trees. We could do this..
Number Number of Possible
of Leafs Dendrograms
2 1 Bottom-Up (agglomerative): Starting
3 3 with each item in its own cluster, find
4 15
5 105
the best pair to merge into a new
... … cluster. Repeat until all clusters are
10 34,459,425 fused together.

Top-Down (divisive): Starting with all


the data in a single cluster, consider
every possible way to divide the cluster
into two. Choose the best division and
recursively operate on both sides.
We begin with a distance matrix which
contains the distances between every
pair of objects in our database.

0 8 8 7 7

0 2 4 4

0 5 5

D( , ) = 8 0 3

D( , ) = 3 0
A generic technique for measuring similarity
To measure the similarity between two objects, transform one
of the objects into the other, and measure how much effort it
took. The measure of effort becomes the distance measure.

The distance between Patty and Selma.


Change dress color, 1 point
Change earring shape, 1 point
Change hair part, 1 point
D(Patty,Selma) = 3

The distance between Marge and Selma.


Change dress color, 1 point
Add earrings, 1 point This is called the “edit
Decrease height, 1 point
distance” or the
Take up smoking, 1 point
Lose weight, 1 point “transformation distance”
D(Marge,Selma) = 5
Agglomerative clustering algorithm

• Most popular hierarchical clustering technique

• Basic algorithm
1. Compute the distance matrix between the input data points
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains

• Key operation is the computation of the distance between


two clusters
– Different definitions of the distance between clusters lead to
different algorithms
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all Choose


possible … the
merges… closest
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all Cluster


possible the

merges… closest

Consider all Cluster


possible … the
merges… closest
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all
Choose
possible
… the best
merges…

Consider all
Choose
possible
… the best
merges…

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.

Consider all Cluster


possible the
… closest
merges…

Consider all
Cluster
possible
… the
merges…
closest

Consider all Cluster


possible … the
merges… closest
We know how to measure the distance between two
objects, but defining the distance between an object
and a cluster, or defining the distance between two
clusters is non obvious.

• Single linkage (nearest neighbor): In this method the distance between two
clusters is determined by the distance of the two closest objects (nearest
neighbors) in the different clusters.
• Complete linkage (furthest neighbor): In this method, the distances
between clusters are determined by the greatest distance between any two
objects in the different clusters (i.e., by the "furthest neighbors").
• Group average linkage: In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two
different clusters.
Single linkage

29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

Average linkage
Summary of Hierarchal Clustering Methods

• No need to specify the number of clusters in


advance.
• Hierarchal nature maps nicely onto human
intuition for some domains
• They do not scale well: time complexity of at least
O(n2), where n is the number of total objects.
• Like any heuristic search algorithms, local optima
are a problem.
• Interpretation of results is (very) subjective.
Hierarchical Clustering Matlab
Diketahui titik titik berikut
(1, 2) (2.5, 4.5) (2, 2) (4,1.5) (4, 2.5)
Buat klusterisasi hirarkhi dari titik-titik tersebut

Jawab :
Matrik X menyimpan titik-titik tsb
Selanjutnya menghitung distance titik 1 dan 2, titik 1 dan 3, dst sampai semua
pasangan titik diketahui distance-nya. Fungsi matlab untuk melakukan ini adalah
pdist.

Untuk memudahkan membaca matrik distance Y tersebut, matrik tersebut bisa


ditransformasikan sbb (Elemen 1,1 berarti jarak titik1 dengan titik1 yaitu 0, dst)
Selanjutnya melakukan hirarchical clustering dengan fungsi linkage

Cara membaca matrik hasil Z adalah sbb:


Baris-1 : Object ke-4 dan ke-5 yang berjarak 1 dicluster
Baris-2 : Object ke-1 dan ke-3 yang berjarak 1 dicluster
Baris-3 : Object ke-6 (hasil cluster baris-1) dan ke-7 (hasil cluster baris-2) dicluster,
keduanya berjarak 2.0616
Baris-4 : Object ke-2 dan ke-8 (hasil cluster baris-3) dicluster, keduanya berjarak 2.5

Lebih jelasnya dapat dilihat pada grafis di atas


Membuat dendrogram dari matrik hasil Z

dendrogram (Z)

sehingga menghasilkan figure dendrogram berikut

You might also like