Professional Documents
Culture Documents
CO-3:
CO-4:
2
Course Objectives
3
Hierarchical Clustering
• Produce a nested sequence of clusters, a tree, also called Dendrogram.
4
Hierarchical clustering
• The aim of hierarchical clustering is a hierarchy of clusters(!)
• don’t commit to the number of clusters beforehand,
• instead we obtain a tree-based representation of the observations known as a
dendrogram
• See also ISLR 10.3.2
Clustering and Hierarchical Clustering
Clustering Hierarchical Clustering
E E
E7
E3 E8 E1 E2
E2
E4 E7 E8
E1
6 6
Types of hierarchical clustering
• Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from
the bottom level, and
• merges the most similar (or nearest) pair of clusters
• stops when all the data points are merged into a single cluster (i.e., the root cluster).
• Divisive (top down) clustering: It starts with all data points in one cluster, the
root.
• Splits the root into a set of child clusters. Each child cluster is recursively divided further
• stops when only singleton clusters of individual data points remain, i.e., each cluster with
only a single point
7
Types of hierarchical clustering..
• Agglomerative or bottom-up clustering where we start with the observations in
n clusters – the leaves of the tree – and then merge clusters – forming branches
– until there is only 1 cluster, the trunk of the tree
• Divisive or top-down clustering where we start with the observations in 1 cluster
and then split clusters until we reach the leaves
• We will focus on agglomerative clustering as it is generally much more efficient
than divisive clustering
Agglomerative
Divisive
Agglomerative clustering
It is more popular then divisive methods.
• At the beginning, each data point forms a cluster (also called a node).
• Merge nodes/clusters that have the least distance.
• Go on merging
• Eventually all nodes belong to one cluster
9
Agglomerative clustering algorithm
10
An example: working of the algorithm
11
Measuring the distance of two clusters
• A few ways to measure distances of two clusters.
• Results in different variations of the algorithm.
• Single link
• Complete link
• Average link
• Centroids
• …
12
Single link method
• The distance between two clusters is
the distance between two closest
data points in the two clusters, one
data point from each cluster.
• It can find arbitrarily shaped clusters,
but
• It may cause the undesirable “chain
effect” by noisy points
13
Complete link method
• The distance between two clusters is the distance of two furthest data
points in the two clusters.
• It is sensitive to outliers because they are far away
14
Average link and centroid methods
• Average link: A compromise between
• the sensitivity of complete-link clustering to outliers and
• the tendency of single-link clustering to form long chains that do not correspond
to the intuitive notion of clusters as compact, spherical objects.
• In this method, the distance between two clusters is the average distance of all
pair-wise distances between the data points in two clusters.
• Centroid method: In this method, the distance between two clusters is
the distance between their centroids
15
The complexity
• All the algorithms are at least O(n2). n is the number of data points.
• Single link can be done in O(n2).
• Complete and average links can be done in O(n2logn).
• Due the complexity, hard to use for large data sets.
• Sampling
• Scale-up methods (e.g., BIRCH).
16
Distance functions
• Key to clustering. “similarity” and “dissimilarity” can also commonly used
terms.
• There are numerous distance functions for
• Different types of data
• Numeric data
• Nominal data
• Different specific applications
17
Distance functions for numeric attributes
• Most commonly used functions are
• Euclidean distance and
• Manhattan (city block) distance
• We denote distance with: dist(xi, xj), where xi and xj are data points
(vectors)
• They are special cases of Minkowski distance. h is positive integer.
1
h h h h
dist ( x i , x j ) (( xi1 x j1 ) ( xi 2 x j 2 ) ... ( xir x jr ) )
18
Euclidean distance and Manhattan distance
• If h = 2, it is the Euclidean distance
19
Squared distance and Chebychev distance
• Squared Euclidean distance: to place progressively greater weight on
data points that are further apart.
dist ( x i , x j ) ( xi1 x j1 ) 2 ( xi 2 x j 2 ) 2 ... ( xir x jr ) 2
20
Hierarchical clustering
• Agglomerative Clustering
• Start with single-instance clusters
• At each step, join the two closest clusters
• Design decision: distance between clusters
• Divisive Clustering
• Start with one universal cluster
• Find two clusters
• Proceed recursively on each subset
• Can be very fast
• Both methods produce a dendrogram
g a c i e d k b j f h
Divisive Clustering
• The divisive clustering algorithm is a top-down clustering approach,
initially, all the points in the dataset belong to one cluster and split is
performed recursively as one moves down the hierarchy.
• Steps of Divisive Clustering:
• Initially, all points in the dataset belong to one single cluster.
• Partition the cluster into two least similar cluster
• Proceed recursively to form new clusters until the desired number of
clusters is obtained.
22
1st Image: All the data points belong to one cluster,
2nd Image: 1 cluster is separated from the previous single cluster,
3rd Image: Further 1 cluster is separated from the previous set of
clusters.
23
Sample dataset separated into 4 clusters
• In the above sample dataset, it is observed
that there is 3 cluster that is far separated
from each other. So we stopped after
getting 3 clusters.
• Even if start separating further more
clusters, below is the obtained result.
24
How to choose which cluster to split?
• Check the sum of squared errors of each
cluster and choose the one with the
largest value.
• In the below 2-dimension dataset,
currently, the data points are separated
into 2 clusters, for further separating it to
form the 3rd cluster find the sum of
squared errors (SSE) for each of the points
in a red cluster and blue cluster.
25
Sample dataset separated into 2 clusters.
• The cluster with the largest SSE value is separated into 2 clusters,
hence forming a new cluster. In the above image, it is observed red
cluster has larger SSE so it is separated into 2 clusters forming 3 total
clusters.
26
How to split the above-chosen cluster?
Once we have decided to split which cluster, then the question arises
on how to split the chosen cluster into 2 clusters. One way is to use
Ward’s criterion to chase for the largest reduction in the difference in
the SSE criterion as a result of the split.
27
Summary
• Clustering is has along history and still active
• There are a huge number of clustering algorithms
• More are still coming every year.
• We only introduced several main algorithms. There are many others, e.g.,
• density based algorithm, sub-space clustering, scale-up methods, neural
networks based methods, fuzzy clustering, co-clustering, etc.
• Clustering is hard to evaluate, but very useful in practice. This partially explains
why there are still a large number of clustering algorithms being devised every
year.
• Clustering is highly application dependent and to some extent subjective.
28
References
• Books and Journals
• Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai
Ben-David-Cambridge University Press 2014
• Introduction to machine Learning – the Wikipedia Guide by Osman Omer.
• Video Link-
• https://www.youtube.com/watch?v=3M1wUK2zCKY
• https://www.youtube.com/watch?v=7enWesSofhg
• Web Link-
• https://en.wikipedia.org/wiki/Hierarchical_clustering
• https://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-clustering-1.html
• https://link.springer.com/10.1007%2F978-1-4419-9863-7_1371
29
THANK YOU