Professional Documents
Culture Documents
• Classification and clustering are two methods of pattern identification with some
similarity and disimilarity.
• Find a model for the class attribute as a function of the values of the other
attributes
• Use ttraining set to construct the model and use test set to estimate the
accuracy of the model
Iteration 1: Find Best Attribute for 0 th level by computing Gain of each Attribute
a. Age attribute:
• m: number of Classes or labels=2
• pi: Probability of Class i
¿
5
(2 2 3 3
− 𝑙𝑜𝑔 2 − 𝑙𝑜𝑔 2 +
14 5 5 5
4
5 14
4
) (
4
− 𝑙𝑜𝑔 2 +
4
5
4 14 5
3 3 2
5 5
2
) (
− 𝑙𝑜𝑔 2 − 𝑙𝑜𝑔 2 =𝟎 .𝟔𝟗𝟒
5 )
Class Variation in Class Variation in Class Variation in
Youth Middle_aged Senior
Age has highest information gain from rest of the attribute, and hence is seleceted
as splitting attribute.
Level 0 Decision Tree using 1st Splitting
Attribute Age
• Tupple Falling on age=middle_aged
all belong to the same class “YES”
and hence a leaf can be created for
that branch.
Example:
• If (Age == Youth) AND (Student == No) then buys_computer=”No”
• If (Age == Youth) AND (Student == No) then buys_computer=”Yes”
• If (Age == middle_aged) then buys_computer=”Yes”
• If( Age == senior) AND (credit_rating == Excellent) then buys_computer=”No”
• If(Age == senior) AND (credit_rating == fair) then buys_computer=”Yes”
Drawback of Information gain as attribute selection measure
• The information gain measure is biased towards tests with many outcomes.
i.e. it prefers to select attribute having a large number of values.
• For instance, for the this dataset information gain
prefers to split on attribute “Table income” that
would result in a large number of pure partitions
but each one containg one tuple.
Infoincome(D)=0 [variation = 0]
=> Gainincome(D) = Info(D)-Infoincome(D)=Info(D)
• Hence, Information gain w.r.t. income is maximal.
Classifier -II
Naive Bayesian Classifier
Classification by Naive Bayesian Classifier
• Bayesian Classifier is a statistical classifier based on Bayes Theorem which can predict the
probability that a given tuple belongs to a particular class.
• Naive Bayesian Classifier is a simple Bayesian classifier that exhibits high accuracy and
speed compared to Decision tree & Neural Network classifier.
• Can solve problems involving both categorical and continuous valued attributes.
A prior probability is the probability that an observation will fall into a group
before one collect the data or before the experiment happens.
Example:
1. John conducts a single coin toss. What is the a priori probability of landing a head?
A priori probability of landing a head = 1 / 2 = 50%.
Where, A, B = Events
P(B|A):- The probability of B occuring given that A has happened
P(A) & P(B):- The prior probability of event A & event B
Probability Basics: Example Calculating Posterior Probability
A forest is composed of 20% Oak trees and 80% Maple trees. Suppose it is known that
90% of the Oak trees are healthy while just 50% of the Maple trees are healthy.
Q:- Suppose that from a distance one can tell that a particular tree is healthy, what is the
probability that the tree is an Oak tree?
=>
P(Oak): Probability that a tree is an oak is 0.2 (20%)
P(Healthy): Probability that a given tree is healthy= (0.2*0.9)+(0.8*0.5)=0.58
P(Healthy|Oak): The probability that a tree is healthy given that it’s an Oak
tree is 0.9 [90%]
= 0.3103
Probability Basics: Posterior Probability contd...
For an intuitive understanding of this probability, suppose the following grid represents this
forest with 100 trees. Exactly 20 of the trees are Oak trees (20%) and 18 (90%) of them are
healthy. The other 80 (80%) trees are Maple and 40 (50%) of them are healthy.
(O = Oak, M = Maple, Green = Healthy, Red = Unhealthy)
Out of all exactly 58 of them are healthy and 18 of these healthy ones are Oak trees. Thus, if a
healthy tree has selected then the probability that it’s an Oak tree is 18/58 = 0.3103.
Bayes’ Theorem [useful in finding posterior probability]
Given:
• X: Data tuple described by measurement made on a set on n attributes
• H: Some hypothesis that the data tuple “X” belongs to class C
Problem: Determine P(H|X):- Given the evidence/observed data tuple “X”, find the
probability that the hypothesis H holds. i.e. Probability that tuple”X” belongs to class C
P(H|X):- Posterior Probability of class “H” where attribute values <x1, ...xn> are given.
P(H): Prior Probability of class “H” regardless of attributes values <x1, ...xn>
P(X|H): Posterior Probability of attribute values <x1, ...xn> where class “H” is given
P(X): Prior Probability of attribute values <x1, ...xn> regardless of class “H”
An Example of Bayes’ Theorem
Determine:-
Normalization Constant
An Example of Bayes’ Theorem contd...
P(H|X):- Posterior Probability that the ccustomer will buy computer=”YSE” given that age = 35,
income =40K and credit_rating = fair.
P(H): Prior Probability of class Yes regardless of attributes values <age, income, credit_rating>
P(X): Prior Probability of attribute values <age=35, income=40K credit_rating = fair> regardless
of class buying computer
Naive Bayes Classifier
• Given: D: - Set of tuples.
X: -<x1, ...xn> where xi is the value of attribute Ai
m: Number of classes - C1,..Cm
Hence, the Naive Bayesian classifier predicts buys_comp is yes for given tuple X
Problem
For the given training
dataset let the new
customer X value as
follows. Find the
probability that Golf will
be played or not?
Find the query X belongs
to which class?
X = (Outlook=Rainy
Temp = Hot
Humidity = High
Windy = False)
Partial Solution
Advantage
• Simple, fast and easy to implement.Needs less training data. Highly scalable.
• Handles continuous and discrete data.
• Easily updatable if training dataset increases
Uses
• Text Classification
• Spam Filtering
• Hybrid Recommender
• Online Application
Disavantages
• Assumption: Class conditional independence, therefore loss of accuracy
• Practically, dependecies exist among variables.
Example: Hospitals, patients profile, family history, didease symptoms etc
Major Issues
• Violations of Independence Assumption: It assumes that the effect of an attribute values
on a given class is independent of the values of the other attribute.
Solution: Bayesian beleief network classifier allows the representation of dependencies
among subsets of attributes
• Zero Conditional Probability problem: If a given class and feature value never occur
together in the training set then the frequency based probability esimate will be zero. This
is problematic since it will wipe out all information in the other probabilities when they are
multiplied.
Therefore often it is desirable to incorporate a small sample correction in all possible
option. Or consider the occurance/probability as 1 instead of zero which can be
negligible in the consideration of maximum.
K-Nearest Neighbor
(Lazy Learners or Learning from your Neighbor)
Eager Learners Vs. Lazy Learners
Eager Learner Lazy Learner
• Eager Learner approach of classification • Lazy Learner approach stores the received
constructs a generalized classification training tuples and wait to the last minute of
model just after receiving the training receiving the new/test tuple. Only when it
tuples (much before receiving new/test receives the test tuple, it constructs the
tuple), and be ready and eager to classify generalized classification model based on
the new unseen tuple. similarity among test tuple and training tuples.
• More work when training tuple is received • Less work (only store) when training tuple is
and less work when making a received and more work when making a
classification. classification
• More Expensive but support incremental
• Less Expensive but not support
learning
incremental learning
• e.g. K-Nearest Neighbor, case-based learning
• e.g. Decision Tree, Bayesian Classifier,
K-Nearest Neighbor Classifier
• The problem can be solved by tuning the value of k_neighbors parameter. As we increase
the number of neighbors k, the model starts to generalize well, but increasing the value too
much would again drop the performance.
• Therefore, it’s important to find an optimal value of K, such that the model is able to
classify well on the test data set with minimum error.
• Setting K value to the square root of number of training tuple or half of the square root of
number of training tuple can lead to better result.
• For an even number of classes (e.g. 2 or 4) it is a good idea to choose a K value with an odd
number to avoid a tie. And the inverse, use an even number for K when you have an odd
number of classes
How to Measure Distance
• Manhattan Distance:
Here n is the number of dimensions (attributes) and xi and yi are the ith attributes (components)
of data objects x and y.
Euclidean distance is more preferable as it gives equal important to each feature
Proximity Measure for Numeric Attributes
[Dissimilarity- Euclidean Distance]
Data Matrix
x2 x4 point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
2 x1 x4 4 5
Dissimilarity Matrix
x3
(with Euclidean Distance)
0 2 4
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
Proximity Measure for Numeric Attributes
[Dissimilarity- Manhattan Distance]
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
2 x1
x4 4 5
Dissimilarity Matrix
x3
0 2 4 (with Manhattan Distance)
x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
Proximity Measure for Numeric Attributes
[Dissimilarity- Minkowski Distance]
Data Matrix
point attribute1 attribute2
x1 1 2 Minkowski Distance=
x2 3 5
x3 2 0
x4 4 5 Here p is a real number >=1
Disadvantage:
• Computationally expensive and more space complexity
• Output depends on choosen K value whic can reduce accuracy for some K value
• Does not work well with large data set and high dimension
• Sensitive to noisy, missing and outlier data
• Need normlization
Application
• Used in classification and regression
• Used in pattern recognition
• Used in gene expression, protein-protein prediction, get 3D structure
of the protein
• Used to measure document similarity
Support Vector Machine
Support Vector Machines
Support Vector Machine (SVM) is an algorithm which can be used for both classification or
regression challenges. However, it is mostly used in two-group classification problems. In
the SVM algorithm, each data item is plotted as a point in n-dimensional space (where n is
number of features) with the value of each feature being the value of a particular coordinate.
Then, classification is performed by finding the hyperplane that differentiates the two
classes.
What is hyperplane?
A hyperplane is a generalization of a plane.
In one dimension, it is called a point.
In two dimensions, it is a line.
In three dimensions, it is a plane.
In more dimensions one can call it an hyperplane.
The following figure represents datapoint in one dimension and the point L is a separating
hyperplane.
Support Vector Machines cont…
How does SVM works?
Let’s imagine we have two tags: red and blue, and our
data has two features: x and y. We want a classifier
that, given a pair of (x, y) coordinates, outputs if it’s
either red or blue. We plot our already labeled training
data on a plane.
A support vector machine takes these data points and
outputs the hyperplane (which in two dimensions it’s
simply a line) that best separates the tags. This line is
the decision boundary: anything that falls to one side
of it we will classify as blue, and anything that falls to
the other as red.
In 2D, the best hyperplane is simply a line.
Support Vector Machines cont…
But, what exactly is the best hyperplane? For SVM, it’s the one that maximizes the
margins from both tags. In other words: the hyperplane (in 2D, it's a line) whose distance
to the nearest element of each tag is the largest.
Support Vector Machines cont…
(Scenario-1) Identification of the right hyperplane: Here, we have three hyperplanes (A, B
and C). Now, the job is to identify the right hyperplane to classify star and circle.
Remember a thumb rule - identify the right hyperplane: “Select the hyperplane which
segregates the two classes better”. In this scenario, hyperplane “B” has excellently
performed this job.
Support Vector Machines cont…
(Scenario-2) Identify the right hyperplane: Here, we have three hyperplanes (A, B and C)
and all are segregating the classes well. Now, How can we identify the right hyperplane?
Here, maximizing the distances between nearest data point (either class) and hyperplane
will help to decide the right hyperplane. This distance is called as margin.
Support Vector Machines cont…
Below, you can see that the margin for hyperplane C is high as compared to both A and B.
Hence, we name the right hyperplane as C. Another lightning reason for selecting the
hyperplane with higher margin is robustness. If we select a hyperplane having low margin
then there is high chance of misclassification.
Support Vector Machines cont…
(Scenario-3) Identify the right hyperplane: Use the rules as discussed in previous section
to identify the right hyperplane.
Some of you may have selected the hyperplane B as it has higher margin compared to A.
But, here is the catch, SVM selects the hyperplane which classifies the classes accurately
prior to maximizing margin. Here, hyperplane B has a classification error and A has
classified all correctly. Therefore, the right hyperplane is A.
Support Vector Machines cont…
(Scenario-4) Below, we are unable to segregate the two classes using a straight line, as one
of the stars lies in the territory of other(circle) class as an outlier.
One star at other end is like an outlier for star class. The SVM algorithm has a feature to
ignore outliers and find the hyperplane that has the maximum margin. Hence, we can say,
SVM classification is robust to outliers.
Support Vector Machines cont…
(Scenario-5) Find the hyperplane to segregate to
classes: In the scenario, we can’t have linear
hyperplane between the two classes, so how does
SVM classify these two classes? Till now, we have
only looked at the linear hyperplane.
What is Clustering
Clustering is the task of dividing the population or
data points into a number of groups such that data
points in the same groups are more similar to other
data points in the same group and dissimilar to the
data points in other groups.
Clustering is the unsupervised learning where we
don’t know the class labels or number of labels
beforehand.
Clustering algorithm forms the groups of objects on
the basis of similarity and dissimilarity between them.
Clustering Example
Imagine of a number of objects in a basket. Each item has a distinct set of features (size, shape,
color, etc.). Now the task is to group each of the objects in the basket. A natural first question to
ask is, ‘on what basis these objects should be grouped?’ Perhaps size, shape, or color.
Clustering Application
Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
Biology: It can be used for classification among different species of plants and animals.
Libraries: It is used in clustering different books on the basis of topics and information.
Insurance: It is used to acknowledge the customers, their policies and identifying the
frauds.
City Planning: It is used to make groups of houses and to study their values based on
their geographical locations and other factors present.
Earthquake studies: By learning the earthquake affected areas, the dangerous zones can
be determined.
Healthcare: It can be used in identifying and classifying the cancerous gene.
Search Engine: It is the backbone behind the search engines. Search engines try to group
similar objects in one cluster and the dissimilar objects far from each other.
Education: It can be used to monitor the students' academic performance. Based on the
students' score they are grouped into different-different clusters, where each clusters
denoting the different level of performance.
Requirements of Clustering in Data Mining
Scalability − Highly scalable clustering algorithms are needed to deal with large databases.
Ability to deal with different kinds of attributes - Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and binary
data.
Discovery of clusters with attribute shape - The clustering algorithm should be capable of
detecting clusters of arbitrary shape. They should not be bounded to only distance measures
that tend to find spherical cluster of small sizes.
High dimensionality - The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
Ability to deal with noisy data − Dataset contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability − The clustering results should be interpretable, comprehensible, and usable.
What is Good Clustering?
• A good clustering method will produce high quality clusters in which:
• The quality of a clustering result also depends on both the similarity measure used by the
method and its implementation.
• The quality of a clustering method is measured by its ability to discover some or all of the
hidden patterns.
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
1. Step-1: Select the number K to decide the number of
clusters.
2. Step-2: Select random K points or centroids. (It can be
other from the input dataset).
3. Step-3: Assign each data point to their closest
centroid, which will form the predefined K clusters.
4. Step-4: Calculate the variance and place a new
centroid of each cluster.
5. Step-5: Repeat the third steps, i.e. reassign each
datapoint to the new closest centroid of each cluster.
6. Step-6: If any reassignment occurs, then go to step-4
else go to step-7.
7. Step-7: The model is ready.
8. End
Working of K-Means Algorithm
Suppose we have two variables x and y. The x-y axis scatter plot of these two variables
is given below:
Working of K-Means Algorithm cont…
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
We need to choose some random K points or centroid to form the cluster. These points can
be either the points from the dataset or any other point. So, here we are selecting the below
two points as K points, which are not the part of dataset. Consider the below image:
Working of K-Means Algorithm cont…
Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by calculating the distance between two points. So, we will draw a
median between both the centroids. Consider the below image:
Working of K-Means Algorithm cont…
From the image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid i.e. K2.
Let's color them as blue and yellow for clear visualization.
Working of K-Means Algorithm cont…
As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
Working of K-Means Algorithm cont…
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
Working of K-Means Algorithm cont…
As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points.
Working of K-Means Algorithm cont…
We will repeat the process by finding the center of gravity of centroids, so the new
line, which means our model is formed. Consider the below image:
Working of K-Means Algorithm cont…
As our model is ready, so we can now remove the assumed centroids, and the two final
X Y
10 data points with 2 attributes are given. Question is 2 4
to divide the data points into K number (let k=3) of 2 6
5 6
clusters.
4 7
8 3
6 6
5 2
5 7
6 3
4 4
K-Means Clustering Example cont..
First plot the data point and find the centoid of the data points. Assume initially all the
data points belongs to single/same cluser.
New Cluster
G-1 = {1, 2, 4} G-2 = {7, 9, 10} G-3 = {3, 5, 6, 8}
K-Means Clustering Example cont..
K-Means Clustering Example cont..
Sr.
Iteration-3 No.
1
Centroid of Cluster 1: C1 = (2.66, 5.66) 2
Centroid of Cluster 2: C2 = (5,3) 3
Centroid of Cluster 3: C3 = (6, 5.5) 4
5
6
7
8
9
10
New Cluster
G-1 = {1, 2, 4} G-2 = {5, 7, 9, 10} G-3 = {3, 6, 8}
K-Means Clustering Example cont..
Sr.
No.
Iteration-4
1
Centroid of Cluster 1: C1 = (2.66, 5.66) 2
Centroid of Cluster 2: C2 = (5.75,3) 3
Centroid of Cluster 3: C3 = (5.33, 6.33) 4
5
6
7
8
9
10
New Cluster
G-1 = {1, 2} G-2 = {5, 7, 9, 10} G-3 = {3, 4, 6, 8}
K-Means Clustering Example cont..
Iteration-5 Sr.
No.
New Cluster
G-1 = {1, 2} G-2 = {5, 7, 9, 10} G-3 = {3, 4, 6, 8}
K-Means Clustering Example cont..
Advantages
• Simple, understandable
• Items automatically assigned to clusters
Disadvantages
• Must pick number of clusters before hand
• Often terminates at a local optimum.
• All items forced into a cluster
• Too sensitive to outliers
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled
examples.
• Recursive application of a standard clustering algorithm can produce a
hierarchical clustering.
Hierarchical Clustering
1. Bottom up (aglomerative)
• Start with single-instance clusters
• At each step, join the two closest clusters
• Design decision: distance between clusters
e.g.two closest instances in clusters
2. Top down (divisive approach / deglomerative)
• Start with one universal cluster
• Find two clusters which are far
• Proceed recursively on each subset
• Can be very fast
3. Both methods produce a dendrogram
Types of Hierarchical Clustering
Divisive Hierarchical clustering Agglomerative Hirarchical clustering
Types of Hierarchical Clustering...
Agglomerative Hierarchical Clustering
Each point is assigned to an individual cluster in this technique. Suppose there are 4 data
points, so each of these points would be assigned to a cluster and hence there would be 4
clusters in the beginning.
Then, at each iteration, closest pair of clusters are merged and this step is repeated until only a
single cluster is left.
Divisive hierarchical clustering works in the opposite way. Instead of starting with n clusters,
we start with a single cluster and assign all the points to that cluster.
Next at each iteration, farthest point in the cluster is split and this process is repeated until
each cluster only contains a single point.
Agglomerative Hierarchical Clustering Example
Suppose a faculty wants to divide the students into different groups. The faculty has the marks
scored by each student in an assignment and based on these marks, he/she wants to segment
them into groups. There’s no fixed target as to how many groups to have. Since the faculty
does not know what type of students should be assigned to which group, it cannot be solved as
a supervised learning problem. So, hierarchical clustering is applied to segment the students
into different groups. Let’s take a sample of 5 students.
Creating a Proximity Matrix
Roll No Mark
1 10 First, a proximity matrix to be created which tell the distance
2 7 between each of these points (marks). Since the distance is
3 28 calculated of each point from each of the other points, a square
matrix of shape n × n (where n is the number of observations)
4 20
is obtained. Let’s make the 5 × 5 proximity matrix for the
5 35 example.
Agglomerative Hierarchical Clustering Example..
Proximity Matrix
Roll 1 2 3 4 5
No
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
LOOP (Step 2 & 3): (Until one single Cluster is not obtained)
Step 2: Compute the proximity look at the smallest distance in the proximity matrix
and merge the points with the smallest distance.
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Here, we have taken the maximum of the two marks (7, 10) i.e. 10 to replace the
marks for this cluster. [Instead of the maximum, the minimum value or the average
values can also be considered.]
Roll No Mark
(1, 2) 10
3 28
4 20
5 35
Steps to Perform Agglomerative Hierarchical
Clustering...
LOOP (Step 2 & 3): (Until one single Cluster is not obtained)
Step 2: Compute the proximity look at the smallest distance in the proximity matrix
and merge the points with the smallest distance. Then the proximity matrix is
updated.
Roll No 1, 2 3 4 5
1, 2 0 18 10 25
3 18 0 8 7
4 10 8 0 15
5 25 7 15 0
Steps to Perform Agglomerative Hierarchical
Clustering...
Step 3: Update the clusters and accordingly update the distance matrix.
Here, we have taken the maximum of the two marks (5, 3) i.e 35 to replace the marks
for this cluster. [Instead of the maximum, the minimum value or the average values
can also be considered.]
Roll No Mark
(1, 2) 10
(5, 3) 35
4 20
Steps to Perform Agglomerative Hierarchical Clustering...
LOOP (Step 2 & 3): (Until one single Cluster is not obtained)
Step 2: Compute the proximity look at the smallest distance in the proximity matrix
and merge the points with the smallest distance. Then the proximity matrix is
updated.
Here, we have taken the maximum of the two marks ((1, 2), (4)) i.e 20 to replace the
marks for this cluster. [Instead of the maximum, the minimum value or the average
values can also be considered.]
Roll No Mark
((1, 2),(4)) 20
(5, 3) 35
Steps to Perform Agglomerative Hierarchical Clustering...
LOOP (Step 2 & 3): (Until one single Cluster is not obtained)
Step 2: Compute the proximity look at the smallest distance in the proximity matrix
and merge the points with the smallest distance. Then the proximity matrix is
updated.
Roll No Mark
(((1, 2), (4)), (5, 3) 35
Steps to Perform Agglomerative Hierarchical
Clustering cont…
Here, we can see that we have merged sample 1 and 2. The vertical line represents the
distance between these samples.
Dendrogram cont…
Iterartion-1
Roll 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Iterartion-2
Roll (1, 2) 3 4 5
1, 2 0 18 10 25
3 18 0 8 7
4 10 8 0 15
5 25 7 15 0
Dendrogram cont…
Iterartion-3
Roll (1, 2 ) (5, 3) 4
(1, 2) 0 25 10
(5, 3) 25 0 15
4 10 15 0
Iterartion-4
Roll No ((1, 2) , (4) (5, 3)
((1, 2), (4)) 0 15
(5, 3) 15 0
We can clearly visualize the steps of hierarchical clustering. More the distance of the
vertical lines in the dendrogram, more the distance between those clusters.
Dendrogram cont…
Now, we can set a threshold distance and draw a horizontal line (Generally, the
threshold is set in such a way that it cuts the tallest vertical line). Let’s set this
threshold as 12 and draw a horizontal line:
The number of clusters will be the number of vertical lines which are being
intersected by the line drawn using the threshold. In the above example, since the
red line intersects 2 vertical lines, we will have 2 clusters. One cluster will have a
sample (1,2,4) and the other will have a sample (3,5).
Divisive Hierarchical Clustering: Example
Starts from a big/master cluster (by considering all data points into a single
cluster) and reduces it to as many as possible small clusters.
If the requirement is to form K clusters from n data points, then stop reducing till
K clusters; otherwise, reduce the master cluster until the solo clusters are not
obtained (one data point in one cluster).
1. From proximity matrix, construct the master cluster in terms of the minimum
cost spanning-tree using Kruskal’s method.
2. Split the cluster into sub-clusters using any flat clustering method. (Let data
points/ cluster with larger distance)
3. Repeat step 2 until desired clusters are not achieved or solo clusters are not
formed.
Divisive Hierarchical Clustering: Example
Apply the divisive hierarchial clustering on the 5 data points (set A = {a, b, c, d,
e}) with the following distance matrix to output 4 clusters.
Dist a b c d e
a 0
b 1 0
c 2 2 0
d 2 4 1 0
e 3 3 5 3 0
Divisive Hierarchical Clustering: Example...
Step 1: [Initialization]
Construct the minimum cost Spanning Tree (MST) using Kruskal method.
Distance among data points/nodes in increasing order:
Edge AB CD AC AD BC AE BE DE BD CE
Dist 1 2 2 2 2 3 3 3 4 5
B
1 2 C Initial master cluster: {A B C D E}
A 2
D
3 E
Divisive Hierarchical Clustering: Example...
Step 2: [LOOP]
Break the link with highest value and form the sub clusters or disconnected graph
till the required clusters are achieved.
Iteration-1 Iteration-2
Break the maximumm distance: Break the maximumm distance:
Edge AE with ditance 3 Edge AC with ditance 2
B B
1 2 C 1 C
A 2 A 2
E D D
E
Clusters: {A B} {C D} {E}
Divisive Hierarchical Clustering: Example...
Step 2: [LOOP] CONT..
Iteration-3
Break the maximumm distance:
Edge CD with ditance 2
B C
1
A
D
E
{E}
{ A B C D E} {ABCD} {AB}
{CD} {C}
{D}
Closeness of two clusters
The decision of merging two clusters is taken on the basis of closeness of these clusters. There are
multiple metrics for deciding the closeness of two clusters and primarily are:
Euclidean distance Manhattan distance
Squared Euclidean distance Maximum distance
Mahalanobis
distance