Clustering

Hierarchical Clustering
Two Types of Clustering

• Partitional algorithms: Construct various partitions and then evaluate them by some
criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using
some criterion
Hierarchical Partitional
(How-to) Hierarchical Clustering
The number of possible dendrograms
with n leafs = Since we cannot test all possible trees
(2n -3)!/[(2(n -2)) (n -2)!] we will have to heuristic search of all
possible trees. We could do this..
Number Number of Possible
of Leafs Dendrograms
2 1 Bottom-Up (agglomerative): Starting
3 3 with each item in its own cluster, find
4 15
5 105
the best pair to merge into a new
... … cluster. Repeat until all clusters are
10 34,459,425 fused together.
Top-Down (divisive): Starting with all

the data in a single cluster, consider
every possible way to divide the cluster
into two. Choose the best division and
recursively operate on both sides.
We begin with a distance matrix which
contains the distances between every
pair of objects in our database.
0 8 8 7 7
0 2 4 4
0 5 5
D( , ) = 8 0 3
D( , ) = 3 0
A generic technique for measuring similarity
To measure the similarity between two objects, transform one
of the objects into the other, and measure how much effort it
took. The measure of effort becomes the distance measure.
The distance between Patty and Selma.

Change dress color, 1 point
Change earring shape, 1 point
Change hair part, 1 point
D(Patty,Selma) = 3
The distance between Marge and Selma.

Change dress color, 1 point
Add earrings, 1 point This is called the “edit
Decrease height, 1 point
distance” or the
Take up smoking, 1 point
Lose weight, 1 point “transformation distance”
D(Marge,Selma) = 5
Agglomerative clustering algorithm
• Most popular hierarchical clustering technique
• Basic algorithm
1. Compute the distance matrix between the input data points
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains
• Key operation is the computation of the distance between

two clusters
– Different definitions of the distance between clusters lead to
different algorithms
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
cluster. Repeat until all clusters are
fused together.
Consider all Choose

possible … the
merges… closest
fused together.
Consider all Cluster

possible the
…
merges… closest

possible … the
merges… closest
fused together.
Consider all
Choose
possible
… the best
merges…
Consider all
Choose
possible
… the best
merges…
Consider all Choose

possible … the best
merges…
fused together.

possible the
… closest
merges…
Consider all
Cluster
possible
… the
merges…
closest

possible … the
merges… closest
We know how to measure the distance between two
objects, but defining the distance between an object
and a cluster, or defining the distance between two
clusters is non obvious.
• Single linkage (nearest neighbor): In this method the distance between two
clusters is determined by the distance of the two closest objects (nearest
neighbors) in the different clusters.
• Complete linkage (furthest neighbor): In this method, the distances
between clusters are determined by the greatest distance between any two
objects in the different clusters (i.e., by the "furthest neighbors").
• Group average linkage: In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two
different clusters.
Single linkage
29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7
Average linkage
Summary of Hierarchal Clustering Methods
• No need to specify the number of clusters in

advance.
• Hierarchal nature maps nicely onto human
intuition for some domains
• They do not scale well: time complexity of at least
O(n2), where n is the number of total objects.
• Like any heuristic search algorithms, local optima
are a problem.
• Interpretation of results is (very) subjective.
Hierarchical Clustering Matlab
Diketahui titik titik berikut
(1, 2) (2.5, 4.5) (2, 2) (4,1.5) (4, 2.5)
Buat klusterisasi hirarkhi dari titik-titik tersebut
Jawab :
Matrik X menyimpan titik-titik tsb
Selanjutnya menghitung distance titik 1 dan 2, titik 1 dan 3, dst sampai semua
pasangan titik diketahui distance-nya. Fungsi matlab untuk melakukan ini adalah
pdist.
Untuk memudahkan membaca matrik distance Y tersebut, matrik tersebut bisa

ditransformasikan sbb (Elemen 1,1 berarti jarak titik1 dengan titik1 yaitu 0, dst)
Selanjutnya melakukan hirarchical clustering dengan fungsi linkage
Cara membaca matrik hasil Z adalah sbb:

Baris-1 : Object ke-4 dan ke-5 yang berjarak 1 dicluster
Baris-2 : Object ke-1 dan ke-3 yang berjarak 1 dicluster
Baris-3 : Object ke-6 (hasil cluster baris-1) dan ke-7 (hasil cluster baris-2) dicluster,
keduanya berjarak 2.0616
Baris-4 : Object ke-2 dan ke-8 (hasil cluster baris-3) dicluster, keduanya berjarak 2.5
Lebih jelasnya dapat dilihat pada grafis di atas

Membuat dendrogram dari matrik hasil Z
dendrogram (Z)
sehingga menghasilkan figure dendrogram berikut

Clustering

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering

Uploaded by

Copyright:

Available Formats

Hierarchical Clustering

Two Types of Clustering

Top-Down (divisive): Starting with all

The distance between Patty and Selma.

The distance between Marge and Selma.

• Most popular hierarchical clustering technique

• Key operation is the computation of the distance between

Consider all Choose

Consider all Cluster

Consider all Cluster

Consider all Choose

Consider all Cluster

Consider all Cluster

• No need to specify the number of clusters in

Untuk memudahkan membaca matrik distance Y tersebut, matrik tersebut bisa

Cara membaca matrik hasil Z adalah sbb:

Lebih jelasnya dapat dilihat pada grafis di atas

sehingga menghasilkan figure dendrogram berikut

You might also like