P 3.1.3 Hierarchical

University Institute of Engineering
DEPARTMENT OF COMPUTER SCIENCE

& ENGINEERING
Bachelor of Engineering (Computer Science & Engineering)
Artificial Intelligence and Machine Learning(21CSH-316)
Prepared by:
Sitaram patel(E13285)
Topic: Hierarchical Clustering
DISCOVER . LEARN . EMPOWER

Course Outcomes
CO-1:Understand the fundamental concepts and
techniques of artificial intelligence and machine
learning
CO-2: Apply the basic of python programming to various

problems related to AI &ML.
CO-3:
CO-4:
CO-5:.Apply various unsupervised machine learning

models and evaluate their performance
2
Course Objectives
To study learning processes:

To provide a comprehensive
supervised and unsupervised,
To make familiar with the foundation to Machine To understand modern
deterministic and statistical
domain and fundamentals of Learning and Optimization techniques and practical
knowledge of Machine
Artificial intelligence, . methodology with trends of Machine learning.
learners, and ensemble
applications t.
learning
3
Hierarchical Clustering
• Produce a nested sequence of clusters, a tree, also called Dendrogram.
4
Hierarchical clustering
• The aim of hierarchical clustering is a hierarchy of clusters(!)
• don’t commit to the number of clusters beforehand,
• instead we obtain a tree-based representation of the observations known as a
dendrogram
• See also ISLR 10.3.2
Clustering and Hierarchical Clustering
Clustering Hierarchical Clustering
E E
E7
E3 E8 E1 E2
E2
E4 E7 E8
E1
6 6
Types of hierarchical clustering
• Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from
the bottom level, and
• merges the most similar (or nearest) pair of clusters
• stops when all the data points are merged into a single cluster (i.e., the root cluster).
• Divisive (top down) clustering: It starts with all data points in one cluster, the
root.
• Splits the root into a set of child clusters. Each child cluster is recursively divided further
• stops when only singleton clusters of individual data points remain, i.e., each cluster with
only a single point
7
Types of hierarchical clustering..
• Agglomerative or bottom-up clustering where we start with the observations in
n clusters – the leaves of the tree – and then merge clusters – forming branches
– until there is only 1 cluster, the trunk of the tree
• Divisive or top-down clustering where we start with the observations in 1 cluster
and then split clusters until we reach the leaves
• We will focus on agglomerative clustering as it is generally much more efficient
than divisive clustering
Agglomerative
Divisive
Agglomerative clustering
It is more popular then divisive methods.
• At the beginning, each data point forms a cluster (also called a node).
• Merge nodes/clusters that have the least distance.
• Go on merging
• Eventually all nodes belong to one cluster
9
Agglomerative clustering algorithm
10
An example: working of the algorithm
11
Measuring the distance of two clusters
• A few ways to measure distances of two clusters.
• Results in different variations of the algorithm.
• Single link
• Complete link
• Average link
• Centroids
• …
12
Single link method
• The distance between two clusters is
the distance between two closest
data points in the two clusters, one
data point from each cluster.
• It can find arbitrarily shaped clusters,
but
• It may cause the undesirable “chain
effect” by noisy points
Two natural clusters are split into

two
13
Complete link method
• The distance between two clusters is the distance of two furthest data
points in the two clusters.
• It is sensitive to outliers because they are far away
14
Average link and centroid methods
• Average link: A compromise between
• the sensitivity of complete-link clustering to outliers and
• the tendency of single-link clustering to form long chains that do not correspond
to the intuitive notion of clusters as compact, spherical objects.
• In this method, the distance between two clusters is the average distance of all
pair-wise distances between the data points in two clusters.
• Centroid method: In this method, the distance between two clusters is
the distance between their centroids
15
The complexity
• All the algorithms are at least O(n2). n is the number of data points.
• Single link can be done in O(n2).
• Complete and average links can be done in O(n2logn).
• Due the complexity, hard to use for large data sets.
• Sampling
• Scale-up methods (e.g., BIRCH).
16
Distance functions
• Key to clustering. “similarity” and “dissimilarity” can also commonly used
terms.
• There are numerous distance functions for
• Different types of data
• Numeric data
• Nominal data
• Different specific applications
17
Distance functions for numeric attributes
• Most commonly used functions are
• Euclidean distance and
• Manhattan (city block) distance
• We denote distance with: dist(xi, xj), where xi and xj are data points
(vectors)
• They are special cases of Minkowski distance. h is positive integer.
1
h h h h
dist ( x i , x j )  (( xi1  x j1 )  ( xi 2  x j 2 )  ...  ( xir  x jr ) )
18
Euclidean distance and Manhattan distance
• If h = 2, it is the Euclidean distance
dist ( x i , x j )  ( xi1  x j1 ) 2  ( xi 2  x j 2 ) 2  ...  ( xir  x jr ) 2
• If h = 1, it is the Manhattan distance
dist ( x i , x j ) | xi1  x j1 |  | xi 2  x j 2 | ... | xir  x jr |

• Weighted Euclidean distance
dist ( x i , x j )  w1 ( xi1  x j1 ) 2  w2 ( xi 2  x j 2 ) 2  ...  wr ( xir  x jr ) 2
19
Squared distance and Chebychev distance
• Squared Euclidean distance: to place progressively greater weight on
data points that are further apart.
dist ( x i , x j )  ( xi1  x j1 ) 2  ( xi 2  x j 2 ) 2  ...  ( xir  x jr ) 2
• Chebychev distance: one wants to define two data points as "different" if

they are different on any one of the attributes.
dist ( x i , x j )  max(| xi1  x j1 |, | xi 2  x j 2 |, ..., | xir  x jr |)
20
Hierarchical clustering
• Agglomerative Clustering
• Start with single-instance clusters
• At each step, join the two closest clusters
• Design decision: distance between clusters
• Divisive Clustering
• Start with one universal cluster
• Find two clusters
• Proceed recursively on each subset
• Can be very fast
• Both methods produce a dendrogram
g a c i e d k b j f h
Divisive Clustering
• The divisive clustering algorithm is a top-down clustering approach,
initially, all the points in the dataset belong to one cluster and split is
performed recursively as one moves down the hierarchy.
• Steps of Divisive Clustering:
• Initially, all points in the dataset belong to one single cluster.
• Partition the cluster into two least similar cluster
• Proceed recursively to form new clusters until the desired number of
clusters is obtained.
22
1st Image: All the data points belong to one cluster,
2nd Image: 1 cluster is separated from the previous single cluster,
3rd Image: Further 1 cluster is separated from the previous set of
clusters.
23
Sample dataset separated into 4 clusters
• In the above sample dataset, it is observed
that there is 3 cluster that is far separated
from each other. So we stopped after
getting 3 clusters.
• Even if start separating further more
clusters, below is the obtained result.
24
How to choose which cluster to split?
• Check the sum of squared errors of each
cluster and choose the one with the
largest value.
• In the below 2-dimension dataset,
currently, the data points are separated
into 2 clusters, for further separating it to
form the 3rd cluster find the sum of
squared errors (SSE) for each of the points
in a red cluster and blue cluster.
25
Sample dataset separated into 2 clusters.
• The cluster with the largest SSE value is separated into 2 clusters,
hence forming a new cluster. In the above image, it is observed red
cluster has larger SSE so it is separated into 2 clusters forming 3 total
clusters.
26
How to split the above-chosen cluster?
Once we have decided to split which cluster, then the question arises
on how to split the chosen cluster into 2 clusters. One way is to use
Ward’s criterion to chase for the largest reduction in the difference in
the SSE criterion as a result of the split.
How to handle the noise or outlier?

Due to the presence of outlier or noise, can result to form a new
cluster of its own. To handle the noise in the dataset using a threshold
to determine the termination criterion that means do not generate
clusters that are too small.
27
Summary
• Clustering is has along history and still active
• There are a huge number of clustering algorithms
• More are still coming every year.
• We only introduced several main algorithms. There are many others, e.g.,
• density based algorithm, sub-space clustering, scale-up methods, neural
networks based methods, fuzzy clustering, co-clustering, etc.
• Clustering is hard to evaluate, but very useful in practice. This partially explains
why there are still a large number of clustering algorithms being devised every
year.
• Clustering is highly application dependent and to some extent subjective.
28
References
• Books and Journals
• Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai
Ben-David-Cambridge University Press 2014
• Introduction to machine Learning – the Wikipedia Guide by Osman Omer.
• Video Link-
• https://www.youtube.com/watch?v=3M1wUK2zCKY
• https://www.youtube.com/watch?v=7enWesSofhg
• Web Link-
• https://en.wikipedia.org/wiki/Hierarchical_clustering
• https://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-clustering-1.html
• https://link.springer.com/10.1007%2F978-1-4419-9863-7_1371
29
THANK YOU

P 3.1.3 Hierarchical

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

P 3.1.3 Hierarchical

Uploaded by

Copyright:

Available Formats

University Institute of Engineering

DEPARTMENT OF COMPUTER SCIENCE

DISCOVER . LEARN . EMPOWER

CO-2: Apply the basic of python programming to various

CO-5:.Apply various unsupervised machine learning

To study learning processes:

Two natural clusters are split into

dist ( x i , x j )  ( xi1  x j1 ) 2  ( xi 2  x j 2 ) 2  ...  ( xir  x jr ) 2

• If h = 1, it is the Manhattan distance

dist ( x i , x j ) | xi1  x j1 |  | xi 2  x j 2 | ... | xir  x jr |

dist ( x i , x j )  w1 ( xi1  x j1 ) 2  w2 ( xi 2  x j 2 ) 2  ...  wr ( xir  x jr ) 2

• Chebychev distance: one wants to define two data points as "different" if

How to handle the noise or outlier?

You might also like