You are on page 1of 153

Classification & Clustering

Classification and Clustering

• Classification and clustering are two methods of pattern identification with some
similarity and disimilarity.

• Classification uses predefined classes in which objects are assigned.

• Clustering identifies similarities between objects and based on the similar


characteristics it forms the groups which differentiate them from other groups of
objects. These groups are known as "clusters"
Classification and Clustering: Example
Classification
• Classification is the Data mining technique used to predict group membership for
data instances.
• There are two ways to assign a new value to a given class
 Crispy Classification
 Probabilistic Classification
Crispy Classification: Given an input, the classifier excatly returns its class label.
Probabilistic Classification: Given an input, the classifier returns its probabilities to
belong to each class. This is useful when some mistakes can be more costly than
others
• Give only data > 90%
• Assign the object to the class with the highest probability
• Assign the object to the class only if its probability >40%
Clustering
• Clustering is a technique of organising a group of data into classes and clusters
where the objects reside inside a cluster will have high similarity and the objects
of two clusters would be dissimilar to each other.
• In clustering, the similarity as well as distance between two objects is measured
by the similarity function. Generally, Intra-cluster distance are less than inter-
cluster distance.
Classification Vs Clustering
Parameter Classification Clustering
Type Used for supervised learning Used for unsupervised learning
Basic process of classifying the input instances grouping the instances based on their
based on their corresponding class labels similarity without the help of class
labels
Need it has labels, so there is need of training there is no need of training and testing
and testing dataset for verifying the dataset
created model
Complexity more complex less complex
Example Logistic regression, Naive Bayes k-means clustering algorithm, Fuzzy c-
Algorithms classifier, Support vector machines, KNN, means clustering algorithm, Gaussian
Decision Tree etc. (EM) clustering algorithm, etc.
Application Detection of unsolicited email Investigation of the social networks
Recognition of the face Segmentation of an image
Approval of a Bank Loan Recommendation Engines
Supervised vs. Unsupervised Learning
Supervised and Unsupervised learning are the two techniques of ML. Both are
used in different scenarios and with different dataset.
Supervised vs. Unsupervised Learning
Supervised learning (classification) –
In Supervised learning models are trained using labeled data.
In supervised learning, models need to find the mapping function to map the input variable
(X) with the output variable (Y).
Y= f(X)
Supervised learning needs supervision to train the model, which is similar to as a student
learns things in the presence of a teacher.
Supervised learning can be used for two types of problems: Classification and Regression.
Unsupervised learning (clustering) –
In Unsupervised learning patterns are inferred from the unlabeled input data.
The goal of unsupervised learning is to find the structure and patterns from the input data.
Unsupervised learning does not need any supervision. Instead, it finds patterns from the data
How Supervised Learning Works?
How Un-supervised Learning Works?
Trainning Dataset Vs. Test Dataset
• The training data is the biggest subset (>=60%) of the original dataset, which is used to
train or fit the model. Firstly, the training data is fed to the algorithms, which lets them
learn how to make predictions for the given task. The type of training data that we
provide to the model is highly responsible for the model's accuracy and prediction ability. It
means that the better the quality of the training data, the better will be the performance of
the model.
• The test dataset is another well-organized subset (20%-40%) of original data, which is
independent of the training dataset. However, it has some similar types of features and class
probability distribution for each type of scenario and uses it as a benchmark for model
evaluation once the model training is completed.
Classification Learning: Definition

• Given a collection of records (training set)


–Each record contains a set of attributes, one of the attributes is the class

• Find a model for the class attribute as a function of the values of the other
attributes

• Use ttraining set to construct the model and use test set to estimate the
accuracy of the model

• Goal: previously unseen records should be assigned a class as accurately as


Illustrating Classification Learning
Classification - A Two-Step Process
Model construction: describing a set of predetermined classes
 Building the Classifier or Model
 Each tuple/sample is assumed to belong to a predefined class, as defined by
the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules / decision trees / mathematical
formulae
Model usage: for classifying future or unknown objects
Using Classifier for Classification estimate accuracy of the model
The known label of test sample is compared with the classified result from the
model
Accuracy rate is the percentage of test set that are correctly classified by the model
Test set is independent of training set, otherwise over-fitting will occur
Classification Technique
• Decision Trees
• Bayesian Classifiers
• K-Nearest Neighbour
• Support Vector Machines
• Neural Networks
• Linear Regression
• Fuzzzy Algorithm
• Genetic Algorithm
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
– The topmost node in the tree is the root node.
• Decision tree generation consists of two phases
–Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
–Tree pruning
Identify and remove branches that reflect noise or outliers
– Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Decision Tree for Cheat or not Cheat
Attribute Selection Measure (How to choose Attribute?)
• Attribute selection measure finds the best attributes in sequence which can give a smallest
decision tree. It provide a splitting rule that determine how the tuple at a give node can be best
split.
• Attribute selection measure or Splitting Rules:
• Determine how the tuples at a given node are to be split
• Provide ranking for each attribute describing the tuples
• The attribute with highest score is chosen
• Determine split point or split subset
• Ideally, each resulting partition would be pure [A pure partition is a partition containing tuples
that all belong to the same class ]
• Three popular attribute selection measure:
1. Information Gain 2. Gain Ratio 3. Gini Index
Three Possible Partition Scenario
Information Gain Method

• Entropy/Information is the probabilistic measure of


uncertanity/variance/randomness present in the
dataset/attribute.
• Information Gain is the decrease/increase in Entropy value
when the node is split. It is the measure of a reduction of
uncertainty/variance.
• Based on the computed values of Entropy and Information
Gain the best attribute at the respective level is choosen
Information Gain Method
To find the best feature that serves as a root node in terms of information gain:
1. Find the Expected Entropy of Dataset: the average amount of information
ned to identify the class label of a tuple in D.
• m: number of Classes or labels
• pi: Probability of Class i
2. Find the Etropy of individual attribute
× • v: no. of variations of attribute A
• pi: Probability of Class i
3. Find the information gain of individual attribute(s).

• A is the particular Attribute


4. Choose the best Information gain i.e Information gain with highest value.
Repeat this process to constuct the complete decision tree.
Which Attribute to Select
Class:
RID Age Income Student Credit_Rating Buys
Computer
1 Youth High No Fair No Youth
AGE
2 Youth High No Excellent No
3 middle_aged High No Fair Yes Senior
4 senior Medium No Fair Yes
5 senior Low Yes Fair Yes Middle_aged
6 senior Low Yes Excellent No
7 middle_aged Low Yes Excellent Yes
8 Youth Medium No Fair No
NO YES YES
9 Youth Low Yes Fair Yes NO YES YES
10 senior Medium Yes Fair Yes NO YES NO
11 Youth Medium Yes Excellent Yes YES YES YES
12 middle_aged Medium No Excellent Yes YES NO
13 middle_aged High Yes Fair Yes
14 senior Medium No Excellent No
Which Attribute to Select
Class:
RID Age Income Student Credit_Rating Buys
Computer
1 Youth High No Fair No
Income
High
2 Youth High No Excellent No
3 middle_aged High No Fair Yes Low
4 senior Medium No Fair Yes
5 senior Low Yes Fair Yes Medium
6 senior Low Yes Excellent No
7 middle_aged Low Yes Excellent Yes
8 Youth Medium No Fair No
NO YES YES
9 Youth Low Yes Fair Yes NO NO NO
10 senior Medium Yes Fair Yes YES YES YES
11 Youth Medium Yes Excellent Yes YES YES YES
12 middle_aged Medium No Excellent Yes YES
13 middle_aged High Yes Fair Yes NO
14 senior Medium No Excellent No
Which Attribute to Select
Class:
RID Age Income Student Credit_Rating Buys
Computer
1 Youth High No Fair No
Student
NO
2 Youth High No Excellent No
3 middle_aged High No Fair Yes YES
4 senior Medium No Fair Yes
5 senior Low Yes Fair Yes
6 senior Low Yes Excellent No
7 middle_aged Low Yes Excellent Yes
8 Youth Medium No Fair No
NO YES
9 Youth Low Yes Fair Yes NO NO
10 senior Medium Yes Fair Yes YES YES
11 Youth Medium Yes Excellent Yes YES YES
12 middle_aged Medium No Excellent Yes NO YES
13 middle_aged High Yes Fair Yes YES YES
14 senior Medium No Excellent No NO YES
Which Attribute to Select
Class:
RID Age Income Student Credit_Rating Buys
Computer Credit_
1 Youth High No Fair No Rating
2 Youth High No Excellent No Fair
3 middle_aged High No Fair Yes Excellent
4 senior Medium No Fair Yes
5 senior Low Yes Fair Yes
6 senior Low Yes Excellent No
7 middle_aged Low Yes Excellent Yes
NO NO
8 Youth Medium No Fair No
YES NO
9 Youth Low Yes Fair Yes
10 senior Medium Yes Fair Yes
YES YES
11 Youth Medium Yes Excellent Yes YES YES
12 middle_aged Medium No Excellent Yes NO YES
13 middle_aged High Yes Fair Yes YES NO
14 senior Medium No Excellent No YES
YES
Information Gain Approach

Step 1: Find the Expected Entropy/Information of Dataset:


• m: number of Classes or labels=2
• pi: Probability of Class i

Class YES of D Class NO of D


Information Gain Approach
Step 2: Iteratively find the Best Atrribute for each level of the Decision ree.

Iteration 1: Find Best Attribute for 0 th level by computing Gain of each Attribute
a. Age attribute:
• m: number of Classes or labels=2
• pi: Probability of Class i
¿
5
(2 2 3 3
− 𝑙𝑜𝑔 2 − 𝑙𝑜𝑔 2 +
14 5 5 5
4
5 14
4
) (
4
− 𝑙𝑜𝑔 2 +
4
5
4 14 5
3 3 2
5 5
2
) (
− 𝑙𝑜𝑔 2 − 𝑙𝑜𝑔 2 =𝟎 .𝟔𝟗𝟒
5 )
Class Variation in Class Variation in Class Variation in
Youth Middle_aged Senior

Gain(Age)= Entropy (D) - EntropyAge(D) = 0.940 - 0.694 = 0.246


Information Gain Approach
Step 2: Iteratively find the Best Atrribute for each level of the Decision ree.

Iteration 1: Find Best Attribute for For 0 th level Cont...

Similarly find the gain of attribute “income”, “student”, “credit_rating”

b. Gain of income = 0.029


c. Gain of student = 0.151
d. Gain of credit_rating = 0.048

Age has highest information gain from rest of the attribute, and hence is seleceted
as splitting attribute.
Level 0 Decision Tree using 1st Splitting
Attribute Age
• Tupple Falling on age=middle_aged
all belong to the same class “YES”
and hence a leaf can be created for
that branch.

• continue the tree construction with


finding the best gain for data-partion
age=youth and age=senior
Final Decision Tree
Example of Majority Voting
Extracting Classification Rule from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to leaf.
• Each attribute-value pair along a path forms a conjuction
• The leaf node holds the class prediction

Example:
• If (Age == Youth) AND (Student == No) then buys_computer=”No”
• If (Age == Youth) AND (Student == No) then buys_computer=”Yes”
• If (Age == middle_aged) then buys_computer=”Yes”
• If( Age == senior) AND (credit_rating == Excellent) then buys_computer=”No”
• If(Age == senior) AND (credit_rating == fair) then buys_computer=”Yes”
Drawback of Information gain as attribute selection measure
• The information gain measure is biased towards tests with many outcomes.
i.e. it prefers to select attribute having a large number of values.
• For instance, for the this dataset information gain
prefers to split on attribute “Table income” that
would result in a large number of pure partitions
but each one containg one tuple.
Infoincome(D)=0 [variation = 0]
=> Gainincome(D) = Info(D)-Infoincome(D)=Info(D)
• Hence, Information gain w.r.t. income is maximal.
Classifier -II
Naive Bayesian Classifier
Classification by Naive Bayesian Classifier
• Bayesian Classifier is a statistical classifier based on Bayes Theorem which can predict the
probability that a given tuple belongs to a particular class.

• Naive Bayesian Classifier is a simple Bayesian classifier that exhibits high accuracy and
speed compared to Decision tree & Neural Network classifier.

• Naive Assumption: [Class-Conditional Independence]: The effect of an attribute value on a


particular class is independent of other attribute value.

• Supervised Learning method.

• Can solve problems involving both categorical and continuous valued attributes.

[Simplify the computation] => Naive


Probability Basics: Prior Probability

A prior probability is the probability that an observation will fall into a group
before one collect the data or before the experiment happens.

Example:
1. John conducts a single coin toss. What is the a priori probability of landing a head?
A priori probability of landing a head = 1 / 2 = 50%.

2. A six-sided fair dice is rolled. What is the a priori probability of rolling a 2, 4, or 6, in


a dice roll?
The number of desired outcomes is 3 (rolling a 2, 4, or 6), and there are 6 outcomes
in total.
A priori probability of rolling a 2, 4, or 6 = 3 / 6 = 50%.
Probability Basics: Posterior Probability
A posterio probability / Conditional probability is the probability of an event occurring
given that another event has already occurred.
For example, we might be interested in finding the probability of some event “A”
occurring after we account for some event “B” that has just occurred.
The formula to calculate a posterior probability of A occurring given that B occurred:
Bayes’ Theorem

Where, A, B = Events
P(B|A):- The probability of B occuring given that A has happened
P(A) & P(B):- The prior probability of event A & event B
Probability Basics: Example Calculating Posterior Probability
A forest is composed of 20% Oak trees and 80% Maple trees. Suppose it is known that
90% of the Oak trees are healthy while just 50% of the Maple trees are healthy.
Q:- Suppose that from a distance one can tell that a particular tree is healthy, what is the
probability that the tree is an Oak tree?
=>
P(Oak): Probability that a tree is an oak is 0.2 (20%)
P(Healthy): Probability that a given tree is healthy= (0.2*0.9)+(0.8*0.5)=0.58
P(Healthy|Oak): The probability that a tree is healthy given that it’s an Oak
tree is 0.9 [90%]

= 0.3103
Probability Basics: Posterior Probability contd...
For an intuitive understanding of this probability, suppose the following grid represents this
forest with 100 trees. Exactly 20 of the trees are Oak trees (20%) and 18 (90%) of them are
healthy. The other 80 (80%) trees are Maple and 40 (50%) of them are healthy.
(O = Oak, M = Maple, Green = Healthy, Red = Unhealthy)

Out of all exactly 58 of them are healthy and 18 of these healthy ones are Oak trees. Thus, if a
healthy tree has selected then the probability that it’s an Oak tree is 18/58 = 0.3103.
Bayes’ Theorem [useful in finding posterior probability]
Given:
• X: Data tuple described by measurement made on a set on n attributes
• H: Some hypothesis that the data tuple “X” belongs to class C
Problem: Determine P(H|X):- Given the evidence/observed data tuple “X”, find the
probability that the hypothesis H holds. i.e. Probability that tuple”X” belongs to class C

According to Bayes Theorem

P(H|X):- Posterior Probability of class “H” where attribute values <x1, ...xn> are given.
P(H): Prior Probability of class “H” regardless of attributes values <x1, ...xn>
P(X|H): Posterior Probability of attribute values <x1, ...xn> where class “H” is given
P(X): Prior Probability of attribute values <x1, ...xn> regardless of class “H”
An Example of Bayes’ Theorem

Example: Let customers are only


described by attributes
age, income and credit_rating

 X: 35 years old customer with an income 40K and fair Credit_rating


 H: Hypothesis that the customer will buy a computer
Likelihood Prior

Determine:-
Normalization Constant
An Example of Bayes’ Theorem contd...

According to Bayes Theorem

Let hypothesis H is buying computer = “YES”

P(H|X):- Posterior Probability that the ccustomer will buy computer=”YSE” given that age = 35,
income =40K and credit_rating = fair.
P(H): Prior Probability of class Yes regardless of attributes values <age, income, credit_rating>

P(X|H): Posterior Probability of attribute values <age=35, income=40K credit_rating = fair>


where class buying computer =”YES” is given.

P(X): Prior Probability of attribute values <age=35, income=40K credit_rating = fair> regardless
of class buying computer
Naive Bayes Classifier
• Given: D: - Set of tuples.
X: -<x1, ...xn> where xi is the value of attribute Ai
m: Number of classes - C1,..Cm

• Bayesian classifier predicts X belongs to Class Ci iff

P(Ci|X)>P(Cj|X) for 1 ≤ j ≤ m & j ≠ i


or
Maximize Posterior Hypothesis:
Naive Bayes Classifier
In the maximization of hypothesis ,
Find three values 1. P(X) 2. P(Ci) 3.P(X|Ci)

Naive Assumption: [Class-Conditional Independence]: The effect of an attribute value on a


particular class is independent of other attribute value.

1. Due to conditional independence of the attribute P(X)= P(x1,...xn) =1


2. P(Ci): Probability that class i is occuring in the dataset
where is the number of tuple containing class i
3. P(X|Ci)= P(x1, ..., xn | Ci).
Due to conditional independence, = P(x1|Ci)×...×P(xn|Ci)

As P(X) is 1, Hence only maximize P(X|Ci) × P(Ci)


Problem
Class:
RID Age Income Student Credit_Rating Buys
For the given training
Computer dataset let the new
1 Youth High No Fair No
customer X value is as
2 Youth High No Excellent No
follows. Find the
probability that customer
3 middle_aged High No Fair Yes
will buy computer or not
4 senior Medium No Fair Yes
or
5 senior Low Yes Fair Yes
6 senior Low Yes Excellent No Find the query X belongs
7 middle_aged Low Yes Excellent Yes
to which class?
8 Youth Medium No Fair No
9 Youth Low Yes Fair Yes X = (Age=Youth
10 senior Medium Yes Fair Yes
income = medium
11 Youth Medium Yes Excellent Yes
12 middle_aged Medium No Excellent Yes student = Yes
13 middle_aged High Yes Fair Yes credit_rating = Fair)
14 senior Medium No Excellent No
Problem
Step-1: Find Class Prior Probability P(Ci) for i=1,2
P(C1) = P(buys_computer = Yes) =
P(C2) = P(buys_computer = No) =
Step-2: Find Conditional Probability of the query X or tuple X given is the class label.

Conditional Probability of X w.r.t. Class Yes Conditional Probability of X w.r.t. Class No


P(X|Yes) = P(X|Yes) =
= P(x1: age=youth|Yes) × = P(x1: age=youth|No) ×
P(x2: income=medium|Yes) × P(x2: income=medium|No) ×
P(x3: student=yes|Yes) × P(x3: student=yes|No) ×
P(x4: credit_rating=fair|Yes) P(x4: credit_rating=fair|No)
=0.22 × 0.44 × 0.66 × 0.66 = 0.044 =0.6 × 0.4 × 0.2 × 0.4 = 0.019
Problem
Step-3: Find the class that maximizes P(X|Ci) × P(Ci)

Max{ P(X|buys_Comp=Yes) × p(buys_computer = Yes),


P(X|buys_Comp=No) × p(buys_computer = No)
= {0.44 × 0.643, 0.019 × 0.357}
= {0.028, 0.007} = 0.028 {Class “Yes” Maximizes the conditional probability of X}

Hence, the Naive Bayesian classifier predicts buys_comp is yes for given tuple X
Problem
For the given training
dataset let the new
customer X value as
follows. Find the
probability that Golf will
be played or not?
Find the query X belongs
to which class?

X = (Outlook=Rainy
Temp = Hot
Humidity = High
Windy = False)
Partial Solution
Advantage
• Simple, fast and easy to implement.Needs less training data. Highly scalable.
• Handles continuous and discrete data.
• Easily updatable if training dataset increases
Uses
• Text Classification
• Spam Filtering
• Hybrid Recommender
• Online Application
Disavantages
• Assumption: Class conditional independence, therefore loss of accuracy
• Practically, dependecies exist among variables.
Example: Hospitals, patients profile, family history, didease symptoms etc
Major Issues
• Violations of Independence Assumption: It assumes that the effect of an attribute values
on a given class is independent of the values of the other attribute.
Solution: Bayesian beleief network classifier allows the representation of dependencies
among subsets of attributes
• Zero Conditional Probability problem: If a given class and feature value never occur
together in the training set then the frequency based probability esimate will be zero. This
is problematic since it will wipe out all information in the other probabilities when they are
multiplied.
Therefore often it is desirable to incorporate a small sample correction in all possible
option. Or consider the occurance/probability as 1 instead of zero which can be
negligible in the consideration of maximum.
K-Nearest Neighbor
(Lazy Learners or Learning from your Neighbor)
Eager Learners Vs. Lazy Learners
Eager Learner Lazy Learner
• Eager Learner approach of classification • Lazy Learner approach stores the received
constructs a generalized classification training tuples and wait to the last minute of
model just after receiving the training receiving the new/test tuple. Only when it
tuples (much before receiving new/test receives the test tuple, it constructs the
tuple), and be ready and eager to classify generalized classification model based on
the new unseen tuple. similarity among test tuple and training tuples.
• More work when training tuple is received • Less work (only store) when training tuple is
and less work when making a received and more work when making a
classification. classification
• More Expensive but support incremental
• Less Expensive but not support
learning
incremental learning
• e.g. K-Nearest Neighbor, case-based learning
• e.g. Decision Tree, Bayesian Classifier,
K-Nearest Neighbor Classifier

Nearest based classifiers are based on learning by analogy i.e. by comparing a


given test tuples with traing tuples that are similar to it.
• Let the training tuples are described by n attributes. All instances of training
tuples are correspond to points in the n-dimentional space and accordingly each
tuple represents a point in an n-dimensional space
• A k-nearest-neighbor classifier searches the pattern space for the k training
tuples that are closest to the unknown tuple.
Prediction for test data is done on the basis of its neighbour
Properties of KNN

 How many neighbors should be consider? i.e. what is k?


 How do we measure distance?
 Should all points be weighted equally, or some points have more influence than others?
 How many neighbors should be consider? (what is k)
• At K=1, the KNN model tends to closely follow the training data and thus shows a high
training score and low test score, indicating overfitting.

• The problem can be solved by tuning the value of k_neighbors parameter. As we increase
the number of neighbors k, the model starts to generalize well, but increasing the value too
much would again drop the performance.

• Therefore, it’s important to find an optimal value of K, such that the model is able to
classify well on the test data set with minimum error.

• Setting K value to the square root of number of training tuple or half of the square root of
number of training tuple can lead to better result.
• For an even number of classes (e.g. 2 or 4) it is a good idea to choose a K value with an odd
number to avoid a tie. And the inverse, use an even number for K when you have an odd
number of classes
How to Measure Distance

DISTANCE in NUMERIC ATTRIBUTE


Proximity Measure for Numeric Attributes
[Dissimilarity]
Commonly used distance measures for computing dissimilarity are
• Euclidean Distance:

• Manhattan Distance:

• Minkowski Distance: [When p=1: It is Manhattan

(Generalization of Euclidean p=2: It is Euclidean]


and Manhattan)

Here n is the number of dimensions (attributes) and xi and yi are the ith attributes (components)
of data objects x and y.
Euclidean distance is more preferable as it gives equal important to each feature
Proximity Measure for Numeric Attributes
[Dissimilarity- Euclidean Distance]
Data Matrix
x2 x4 point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
2 x1 x4 4 5
Dissimilarity Matrix
x3
(with Euclidean Distance)
0 2 4

x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
Proximity Measure for Numeric Attributes
[Dissimilarity- Manhattan Distance]
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
2 x1
x4 4 5
Dissimilarity Matrix
x3
0 2 4 (with Manhattan Distance)

x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
Proximity Measure for Numeric Attributes
[Dissimilarity- Minkowski Distance]
Data Matrix
point attribute1 attribute2
x1 1 2 Minkowski Distance=
x2 3 5
x3 2 0
x4 4 5 Here p is a real number >=1

Minkowski Distance Minkowski Distance


with p value 1 is with p value 2 is
Manhattan Distance Euclidean Distance
x1 x2 x3 x4 x1 x2 x3 x4
x1 0 x1 0
x2 5 0 x2 3.61 0
x3 3 6 0 x3 5.1 5.1 0
x4 6 1 7 0 x4 4.24 1 5.39 0
Example: 1
• Let given is the data from the questionaires survey with two attributes (X1: Acid
durability and X2: Strength) to classify whether a special paper tissue is good or not.
Classify the new paper tissue with features: X1=3 and X2=7.
X1: X2: Y: Class
Acid Durability Strength
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Example-1: Solution
Step 1: Intialize and Define K
K=1 Overfitting Problem
Number of tuples is even, hence consider K as odd.
Let K = 3 for this problem
Step 2: Compute distance between input sample and training samples using Euclidian
distance measure.

X1: X2: Y: Distance from


Acid Strength Class (3, 7)
Durability
7 7 Bad sqrt((7-3)2 + (7-7)2)=4
7 4 Bad sqrt((7-3)2 + (4-7)2)=5
3 4 Good sqrt((3-3)2 + (4-7)2)=3
1 4 Good sqrt((1-3)2 + (4-7)2)=3.6
Example-1: Solution
Step 3: Determine K nearest neighbor those are at minimum distance and rank them.
X1: X2: Y: Class Distance Rank
Acid Strength from
Durability (3, 7)
7 7 Bad 4 3
7 4 Bad 5 4
3 4 Good 3 1
1 4 Good 3.6 2

Step 4: Apply Simple Mojority


There are 2 “Good” and 1 “Bad”. Thus the new paper tissue will be classified as Good due to
Example 2
Given is the customer detail as customer’s age, income, credit card no. and their
corresponding class. Find the class label for the new tuple/sample:
Customer: RRR, Age: 37, Income: 50K, Cr Card: 2, Class- ?

Customer Age Income No. of Credit Class


Cards
XXX 35 35K 3 No
YYY 22 50K 2 Yes
ZZZ 63 200K 1 No
PPP 59 170K 1 No
QQQ 25 40K 4 Yes
RRR 37 50K 2 ????
Example-2: Solution
Find the Euclidian Distance of new tuple from each training data

Customer Age Income No. of Credit Class Distance


Cards
XXX 35 35K 3 No sqrt((35-37)2+(35-50)2+(3-2)2)=15.16
+YYY 22 50K 2 Yes sqrt((22-37)2+(50-50)2+(2-2)2)=15

ZZZ 63 200K 1 No sqrt((63-37)2+(200-50)2+(1-2)2)=152.2

PPP 59 170K 1 No sqrt((59-37)2+(170-50)2+(1-2)2)=122

QQQ 25 40K 4 Yes sqrt((25-37)2+(40-50)2+(4-2)2)=15.74

RRR 37 50K 2 ????


Majority of {Yes, Yes, No} is Yes. Hence Class of new tuple is Yes
Should all Attributes be weighted equally,
or
Some Attributes have more influence than others?
Some attributes (like income, loan) outweighs some smaller ranged attributes (like binary or
categorical or numeric like age) and thus, the result obtained can be a biased one.
Normalization is a solution to it: Max-Min Normalization generally used
Solution : Normalization
KNN Classifier Algorithm
Advantage:
• No training period: There is no discriminative function or model build at the traing stage
• As there is no training period, any new data can be added seamlessly without impacting
accuracy of the result.
• Easy to implement
• Best used where probability is unknown
• Useful for non-linear data as no assumption is required

Disadvantage:
• Computationally expensive and more space complexity
• Output depends on choosen K value whic can reduce accuracy for some K value
• Does not work well with large data set and high dimension
• Sensitive to noisy, missing and outlier data
• Need normlization
Application
• Used in classification and regression
• Used in pattern recognition
• Used in gene expression, protein-protein prediction, get 3D structure
of the protein
• Used to measure document similarity
Support Vector Machine
Support Vector Machines
Support Vector Machine (SVM) is an algorithm which can be used for both classification or
regression challenges. However, it is mostly used in two-group classification problems. In
the SVM algorithm, each data item is plotted as a point in n-dimensional space (where n is
number of features) with the value of each feature being the value of a particular coordinate.
Then, classification is performed by finding the hyperplane that differentiates the two
classes.
What is hyperplane?
A hyperplane is a generalization of a plane.
 In one dimension, it is called a point.
 In two dimensions, it is a line.
 In three dimensions, it is a plane.
 In more dimensions one can call it an hyperplane.
The following figure represents datapoint in one dimension and the point L is a separating
hyperplane.
Support Vector Machines cont…
How does SVM works?
 Let’s imagine we have two tags: red and blue, and our
data has two features: x and y. We want a classifier
that, given a pair of (x, y) coordinates, outputs if it’s
either red or blue. We plot our already labeled training
data on a plane.
A support vector machine takes these data points and
outputs the hyperplane (which in two dimensions it’s
simply a line) that best separates the tags. This line is
the decision boundary: anything that falls to one side
of it we will classify as blue, and anything that falls to
the other as red.
 In 2D, the best hyperplane is simply a line.
Support Vector Machines cont…
But, what exactly is the best hyperplane? For SVM, it’s the one that maximizes the
margins from both tags. In other words: the hyperplane (in 2D, it's a line) whose distance
to the nearest element of each tag is the largest.
Support Vector Machines cont…
(Scenario-1) Identification of the right hyperplane: Here, we have three hyperplanes (A, B
and C). Now, the job is to identify the right hyperplane to classify star and circle.
Remember a thumb rule - identify the right hyperplane: “Select the hyperplane which
segregates the two classes better”. In this scenario, hyperplane “B” has excellently
performed this job.
Support Vector Machines cont…
(Scenario-2) Identify the right hyperplane: Here, we have three hyperplanes (A, B and C)
and all are segregating the classes well. Now, How can we identify the right hyperplane?

Here, maximizing the distances between nearest data point (either class) and hyperplane
will help to decide the right hyperplane. This distance is called as margin.
Support Vector Machines cont…

Below, you can see that the margin for hyperplane C is high as compared to both A and B.
Hence, we name the right hyperplane as C. Another lightning reason for selecting the
hyperplane with higher margin is robustness. If we select a hyperplane having low margin
then there is high chance of misclassification.
Support Vector Machines cont…
(Scenario-3) Identify the right hyperplane: Use the rules as discussed in previous section
to identify the right hyperplane.

Some of you may have selected the hyperplane B as it has higher margin compared to A.
But, here is the catch, SVM selects the hyperplane which classifies the classes accurately
prior to maximizing margin. Here, hyperplane B has a classification error and A has
classified all correctly. Therefore, the right hyperplane is A.
Support Vector Machines cont…
(Scenario-4) Below, we are unable to segregate the two classes using a straight line, as one
of the stars lies in the territory of other(circle) class as an outlier.

One star at other end is like an outlier for star class. The SVM algorithm has a feature to
ignore outliers and find the hyperplane that has the maximum margin. Hence, we can say,
SVM classification is robust to outliers.
Support Vector Machines cont…
(Scenario-5) Find the hyperplane to segregate to
classes: In the scenario, we can’t have linear
hyperplane between the two classes, so how does
SVM classify these two classes? Till now, we have
only looked at the linear hyperplane.

SVM can solve this problem. Easily! It solves this


problem by introducing additional feature. Here, we
can add a new feature z=x2+y2. Now, plotting the data
points on axis x and z:

In above plot, points to consider are:


 All values for z would be positive as z is the squared sum of both x and y.
 In the original plot, red circles appear close to the origin of x and y axes, leading to
lower value of z and star relatively away from the origin result to higher value of z.
Support Vector Machines cont…
 In the SVM classifier, it is easy to have a linear hyperplane
between these two classes. But, another burning question
which arises is, should we need to add this feature manually to
have a hyperplane. No, the SVM algorithm has a technique
called the kernel trick. The SVM kernel takes low dimensional
input space and transforms it to a higher dimensional space i.e.
it converts not separable problem to separable problem. It is
mostly useful in non-linear separation problem. Simply put, it
does some extremely complex data transformations, then finds
out the process to separate the data based on the labels or
outputs you’ve defined.
 When we look at the hyperplane in original input space it looks
like a circle.
Types of SVM
 Linear SVM: Linear SVM is used for data that are linearly separable i.e. for a
dataset that can be categorized into two categories by utilizing a single straight
line. Such data points are termed as linearly separable data, and the classifier is
used described as a linear SVM classifier.
 Non-linear SVM: Non-Linear SVM is used for data that are non-linearly separable
data i.e. a straight line cannot be used to classify the dataset. For this, we use
something known as a kernel trick that sets data points in a higher dimension
where they can be separated using planes or other mathematical functions. Such
data points are termed as non-linear data, and the classifier used is termed as a
non-linear SVM classifier.
CLUSTERING
What is Cluster
 A cluster is a subset of objects which are “similar”
 A subset of objects such that the distance between any two objects in the cluster is less
than the distance between any object in the cluster and any object not located inside it.

What is Clustering
 Clustering is the task of dividing the population or
data points into a number of groups such that data
points in the same groups are more similar to other
data points in the same group and dissimilar to the
data points in other groups.
 Clustering is the unsupervised learning where we
don’t know the class labels or number of labels
beforehand.
 Clustering algorithm forms the groups of objects on
the basis of similarity and dissimilarity between them.
Clustering Example
Imagine of a number of objects in a basket. Each item has a distinct set of features (size, shape,
color, etc.). Now the task is to group each of the objects in the basket. A natural first question to
ask is, ‘on what basis these objects should be grouped?’ Perhaps size, shape, or color.
Clustering Application
 Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
 Biology: It can be used for classification among different species of plants and animals.
 Libraries: It is used in clustering different books on the basis of topics and information.
 Insurance: It is used to acknowledge the customers, their policies and identifying the
frauds.
 City Planning: It is used to make groups of houses and to study their values based on
their geographical locations and other factors present.
 Earthquake studies: By learning the earthquake affected areas, the dangerous zones can
be determined.
 Healthcare: It can be used in identifying and classifying the cancerous gene.
 Search Engine: It is the backbone behind the search engines. Search engines try to group
similar objects in one cluster and the dissimilar objects far from each other.
 Education: It can be used to monitor the students' academic performance. Based on the
students' score they are grouped into different-different clusters, where each clusters
denoting the different level of performance.
Requirements of Clustering in Data Mining
 Scalability − Highly scalable clustering algorithms are needed to deal with large databases.
 Ability to deal with different kinds of attributes - Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and binary
data.
 Discovery of clusters with attribute shape - The clustering algorithm should be capable of
detecting clusters of arbitrary shape. They should not be bounded to only distance measures
that tend to find spherical cluster of small sizes.
 High dimensionality - The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Dataset contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible, and usable.
What is Good Clustering?
• A good clustering method will produce high quality clusters in which:

 The intra-class (that is, intra-cluster) similarity is high.

 The inter-class similarity is low.

• The quality of a clustering result also depends on both the similarity measure used by the
method and its implementation.

• The quality of a clustering method is measured by its ability to discover some or all of the
hidden patterns.

• However, objective evaluation is problematic: usually done by human / expert inspection.


Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion: K-Mean Algorithm
• Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or
objects) using some criterion: Divisive and Agglomerative Methods of hierarchical
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the idea is to
find the best fit of that model to each other
Distinction between Partitional and Hierarchical sets of
clusters
• Partitional Clustering
• It divides data objects into non-overlapping subsets (clusters) such that each data object is
in exactly one subset. It construct a partition of a database D of n objects into a set of k
(given) clusters
• Heuristic methods: k-means and k-medoids algorithms
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
• There are mainly two types of hierarchical clustering:
 Agglomerative hierarchical clustering
 Divisive Hierarchical clustering
K- Means Clustering Algorithm
 It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such
a way that each dataset belongs only one group that has similar properties.
 It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters. A centroid is a data point that lies at the center of a cluster.
 The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
 The k-means clustering algorithm mainly performs two tasks:
 Determines the best value for k center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
 Hence each cluster has datapoints with some commonalities, and away from other clusters.
K- Means Algorithm cont…

The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
1. Step-1: Select the number K to decide the number of
clusters.
2. Step-2: Select random K points or centroids. (It can be
other from the input dataset).
3. Step-3: Assign each data point to their closest
centroid, which will form the predefined K clusters.
4. Step-4: Calculate the variance and place a new
centroid of each cluster.
5. Step-5: Repeat the third steps, i.e. reassign each
datapoint to the new closest centroid of each cluster.
6. Step-6: If any reassignment occurs, then go to step-4
else go to step-7.
7. Step-7: The model is ready.
8. End
Working of K-Means Algorithm
Suppose we have two variables x and y. The x-y axis scatter plot of these two variables
is given below:
Working of K-Means Algorithm cont…
 Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
 We need to choose some random K points or centroid to form the cluster. These points can
be either the points from the dataset or any other point. So, here we are selecting the below
two points as K points, which are not the part of dataset. Consider the below image:
Working of K-Means Algorithm cont…
 Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by calculating the distance between two points. So, we will draw a
median between both the centroids. Consider the below image:
Working of K-Means Algorithm cont…
 From the image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid i.e. K2.
Let's color them as blue and yellow for clear visualization.
Working of K-Means Algorithm cont…
 As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
Working of K-Means Algorithm cont…
 Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
Working of K-Means Algorithm cont…

 As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points.
Working of K-Means Algorithm cont…

 We will repeat the process by finding the center of gravity of centroids, so the new

centroids will be as shown in the below image:


Working of K-Means Algorithm cont…
 As we got the new centroids so again will draw the median line and reassign the data

points. So, the image will be:


Working of K-Means Algorithm cont…
 We can see in the previous image; there are no dissimilar data points on either side of the

line, which means our model is formed. Consider the below image:
Working of K-Means Algorithm cont…
 As our model is ready, so we can now remove the assumed centroids, and the two final

clusters will be as shown in the below image:


K-Means Clustering Example

X Y
10 data points with 2 attributes are given. Question is 2 4
to divide the data points into K number (let k=3) of 2 6
5 6
clusters.
4 7
8 3
6 6
5 2
5 7
6 3
4 4
K-Means Clustering Example cont..
First plot the data point and find the centoid of the data points. Assume initially all the
data points belongs to single/same cluser.

School of Computer Engineering


K-Means Clustering Example cont..
Let K=3 or problem is to divide the data points into 3 clusters.
For this, select 3 randome points/seed points/centroids of 3 clusters. Let (1, 5), (4. 1) & (8, 4)

School of Computer Engineering


K-Means Clustering Example cont..
(Calculate the distances of all data points from 3 centroids)
Sr.
No.
1
2
3
4
• Compare the 3 distances of
5
individual data points from the 6
7
corresponding centroids and find
8
the minimumm distance. 9
10
• The data point will belong to the
Clusters
cluster to which it is closer
G-1 = {1, 2, 4} G-2 = {7, 10} G-3 = {3, 5, 6, 8, 9}
K-Means Clustering Example cont..
K-Means Clustering Example cont..
Sr.
Iteration-2 No.
1
Centroid of Cluster 1: C1 =(2.66,5.66) 2
Centroid of Cluster 2: C2 = (4.5, 3) 3
Centroid of Cluster 3: C3 = (6, 5) 4
5
6
7
8
9
10

New Cluster
G-1 = {1, 2, 4} G-2 = {7, 9, 10} G-3 = {3, 5, 6, 8}
K-Means Clustering Example cont..
K-Means Clustering Example cont..
Sr.
Iteration-3 No.
1
Centroid of Cluster 1: C1 = (2.66, 5.66) 2
Centroid of Cluster 2: C2 = (5,3) 3
Centroid of Cluster 3: C3 = (6, 5.5) 4
5
6
7
8
9
10

New Cluster
G-1 = {1, 2, 4} G-2 = {5, 7, 9, 10} G-3 = {3, 6, 8}
K-Means Clustering Example cont..
Sr.
No.
Iteration-4
1
Centroid of Cluster 1: C1 = (2.66, 5.66) 2
Centroid of Cluster 2: C2 = (5.75,3) 3
Centroid of Cluster 3: C3 = (5.33, 6.33) 4
5
6
7
8
9
10

New Cluster
G-1 = {1, 2} G-2 = {5, 7, 9, 10} G-3 = {3, 4, 6, 8}
K-Means Clustering Example cont..
Iteration-5 Sr.
No.

Centroid of Cluster 1: C1 = (2, 5) 1


Centroid of Cluster 2: C2 = (5.75,3) 2
Centroid of Cluster 3: C3 = (5, 5.5) 3
4
5
6
7
8
9
10

New Cluster
G-1 = {1, 2} G-2 = {5, 7, 9, 10} G-3 = {3, 4, 6, 8}
K-Means Clustering Example cont..
Advantages
• Simple, understandable
• Items automatically assigned to clusters

Disadvantages
• Must pick number of clusters before hand
• Often terminates at a local optimum.
• All items forced into a cluster
• Too sensitive to outliers
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled
examples.
• Recursive application of a standard clustering algorithm can produce a
hierarchical clustering.
Hierarchical Clustering
1. Bottom up (aglomerative)
• Start with single-instance clusters
• At each step, join the two closest clusters
• Design decision: distance between clusters
e.g.two closest instances in clusters
2. Top down (divisive approach / deglomerative)
• Start with one universal cluster
• Find two clusters which are far
• Proceed recursively on each subset
• Can be very fast
3. Both methods produce a dendrogram
Types of Hierarchical Clustering
Divisive Hierarchical clustering Agglomerative Hirarchical clustering
Types of Hierarchical Clustering...
Agglomerative Hierarchical Clustering
 Each point is assigned to an individual cluster in this technique. Suppose there are 4 data
points, so each of these points would be assigned to a cluster and hence there would be 4
clusters in the beginning.
 Then, at each iteration, closest pair of clusters are merged and this step is repeated until only a
single cluster is left.

Divisive Hierarchical clustering

 Divisive hierarchical clustering works in the opposite way. Instead of starting with n clusters,
we start with a single cluster and assign all the points to that cluster.
 Next at each iteration, farthest point in the cluster is split and this process is repeated until
each cluster only contains a single point.
Agglomerative Hierarchical Clustering Example

Suppose a faculty wants to divide the students into different groups. The faculty has the marks
scored by each student in an assignment and based on these marks, he/she wants to segment
them into groups. There’s no fixed target as to how many groups to have. Since the faculty
does not know what type of students should be assigned to which group, it cannot be solved as
a supervised learning problem. So, hierarchical clustering is applied to segment the students
into different groups. Let’s take a sample of 5 students.
Creating a Proximity Matrix
Roll No Mark
1 10 First, a proximity matrix to be created which tell the distance
2 7 between each of these points (marks). Since the distance is
3 28 calculated of each point from each of the other points, a square
matrix of shape n × n (where n is the number of observations)
4 20
is obtained. Let’s make the 5 × 5 proximity matrix for the
5 35 example.
Agglomerative Hierarchical Clustering Example..
Proximity Matrix
Roll 1 2 3 4 5
No
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0

The diagonal elements of this matrix is always 0 as the distance of a point


with itself is always 0. The Euclidean distance formula is used to calculate
the rest of the distances. So, to calculate the distance between
Point 1 and 2: √(10-7)^2 = √9 = 3
Point 1 and 3: √(10-28)^2 = √324 = 18 and so on…
Similarly, all the distances are calculated and the proximity matrix is filled.
Steps to Perform Agglomerative Hierarchical
Clustering
Step 1: First, all the points to an individual cluster is assigned. Different colors here
represent different clusters. Hence, 5 different clusters for the 5 points in the data.

LOOP (Step 2 & 3): (Until one single Cluster is not obtained)
Step 2: Compute the proximity look at the smallest distance in the proximity matrix
and merge the points with the smallest distance.
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0

Here, the smallest distance is 3 and hence point 1 and 2 is merged.


Steps to Perform Agglomerative Hierarchical
Clustering...
Step 3: Update the clusters and accordingly update the distance matrix.

Here, we have taken the maximum of the two marks (7, 10) i.e. 10 to replace the
marks for this cluster. [Instead of the maximum, the minimum value or the average
values can also be considered.]

Roll No Mark
(1, 2) 10
3 28
4 20
5 35
Steps to Perform Agglomerative Hierarchical
Clustering...
LOOP (Step 2 & 3): (Until one single Cluster is not obtained)
Step 2: Compute the proximity look at the smallest distance in the proximity matrix
and merge the points with the smallest distance. Then the proximity matrix is
updated.

Roll No 1, 2 3 4 5
1, 2 0 18 10 25
3 18 0 8 7
4 10 8 0 15
5 25 7 15 0
Steps to Perform Agglomerative Hierarchical
Clustering...
Step 3: Update the clusters and accordingly update the distance matrix.

Here, we have taken the maximum of the two marks (5, 3) i.e 35 to replace the marks
for this cluster. [Instead of the maximum, the minimum value or the average values
can also be considered.]

Roll No Mark
(1, 2) 10
(5, 3) 35
4 20
Steps to Perform Agglomerative Hierarchical Clustering...

LOOP (Step 2 & 3): (Until one single Cluster is not obtained)
Step 2: Compute the proximity look at the smallest distance in the proximity matrix
and merge the points with the smallest distance. Then the proximity matrix is
updated.

Roll No (1, 2 ) (5, 3) 4


(1, 2) 0 25 10
(5, 3) 25 0 15
4 10 15 0
Steps to Perform Agglomerative Hierarchical
Clustering...
Step 3: Update the clusters and accordingly update the distance matrix.

Here, we have taken the maximum of the two marks ((1, 2), (4)) i.e 20 to replace the
marks for this cluster. [Instead of the maximum, the minimum value or the average
values can also be considered.]

Roll No Mark
((1, 2),(4)) 20
(5, 3) 35
Steps to Perform Agglomerative Hierarchical Clustering...

LOOP (Step 2 & 3): (Until one single Cluster is not obtained)
Step 2: Compute the proximity look at the smallest distance in the proximity matrix
and merge the points with the smallest distance. Then the proximity matrix is
updated.

Roll No ((1, 2) , (4) (5, 3)


((1, 2), (4)) 0 15
(5, 3) 15 0
Steps to Perform Agglomerative Hierarchical
Clustering...
Step 3: Update the clusters and accordingly update the distance matrix.

Here, only one option is there. Merge the two clusters.

Roll No Mark
(((1, 2), (4)), (5, 3) 35
Steps to Perform Agglomerative Hierarchical
Clustering cont…

We started with 5 clusters and finally have a single cluster.


Dendrogram
 To get the number of clusters for hierarchical clustering, we make use of the concept
called a Dendrogram.
 A dendrogram is a tree-like diagram that records the sequences of merges or splits.
 Let’s get back to faculty-student example. Whenever we merge two clusters, a
dendrogram record the distance between these clusters and represent it in graph
form.

We have the samples of the dataset on the


x-axis and the distance on the y-axis.
Whenever two clusters are merged, we
will join them in this dendrogram and the
height of the join will be the distance
between these points (Distance can be
obtained from Proximity Matrix).
Dendrogram cont…
We started by merging sample 1 and 2 and the
distance between these two samples was 3.
Let’s plot this in the dendrogram.

Here, we can see that we have merged sample 1 and 2. The vertical line represents the
distance between these samples.
Dendrogram cont…
Iterartion-1
Roll 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Iterartion-2
Roll (1, 2) 3 4 5

1, 2 0 18 10 25
3 18 0 8 7
4 10 8 0 15
5 25 7 15 0
Dendrogram cont…
Iterartion-3
Roll (1, 2 ) (5, 3) 4

(1, 2) 0 25 10
(5, 3) 25 0 15
4 10 15 0
Iterartion-4
Roll No ((1, 2) , (4) (5, 3)
((1, 2), (4)) 0 15
(5, 3) 15 0

We can clearly visualize the steps of hierarchical clustering. More the distance of the
vertical lines in the dendrogram, more the distance between those clusters.
Dendrogram cont…
Now, we can set a threshold distance and draw a horizontal line (Generally, the
threshold is set in such a way that it cuts the tallest vertical line). Let’s set this
threshold as 12 and draw a horizontal line:

The number of clusters will be the number of vertical lines which are being
intersected by the line drawn using the threshold. In the above example, since the
red line intersects 2 vertical lines, we will have 2 clusters. One cluster will have a
sample (1,2,4) and the other will have a sample (3,5).
Divisive Hierarchical Clustering: Example

Starts from a big/master cluster (by considering all data points into a single
cluster) and reduces it to as many as possible small clusters.
If the requirement is to form K clusters from n data points, then stop reducing till
K clusters; otherwise, reduce the master cluster until the solo clusters are not
obtained (one data point in one cluster).

Steps in Divisible Algorithm:

1. From proximity matrix, construct the master cluster in terms of the minimum
cost spanning-tree using Kruskal’s method.
2. Split the cluster into sub-clusters using any flat clustering method. (Let data
points/ cluster with larger distance)
3. Repeat step 2 until desired clusters are not achieved or solo clusters are not
formed.
Divisive Hierarchical Clustering: Example

Apply the divisive hierarchial clustering on the 5 data points (set A = {a, b, c, d,
e}) with the following distance matrix to output 4 clusters.

Dist a b c d e

a 0
b 1 0
c 2 2 0
d 2 4 1 0
e 3 3 5 3 0
Divisive Hierarchical Clustering: Example...
Step 1: [Initialization]
Construct the minimum cost Spanning Tree (MST) using Kruskal method.
Distance among data points/nodes in increasing order:
Edge AB CD AC AD BC AE BE DE BD CE
Dist 1 2 2 2 2 3 3 3 4 5

B
1 2 C Initial master cluster: {A B C D E}
A 2
D
3 E
Divisive Hierarchical Clustering: Example...
Step 2: [LOOP]
Break the link with highest value and form the sub clusters or disconnected graph
till the required clusters are achieved.
Iteration-1 Iteration-2
Break the maximumm distance: Break the maximumm distance:
Edge AE with ditance 3 Edge AC with ditance 2

B B
1 2 C 1 C
A 2 A 2
E D D
E
Clusters: {A B} {C D} {E}
Divisive Hierarchical Clustering: Example...
Step 2: [LOOP] CONT..
Iteration-3
Break the maximumm distance:
Edge CD with ditance 2
B C
1
A
D
E

Clusters: {A B} {C} {D} {E}


Dendogram of Divisive Hierarchical Clustering:
Example...

{E}

{ A B C D E} {ABCD} {AB}

{CD} {C}

{D}
Closeness of two clusters

The decision of merging two clusters is taken on the basis of closeness of these clusters. There are
multiple metrics for deciding the closeness of two clusters and primarily are:
 Euclidean distance  Manhattan distance
 Squared Euclidean distance  Maximum distance
 Mahalanobis
distance

You might also like