Chapter-V CLASSIFICATION & CLUSTERING

Classification & Clustering
Classification and Clustering
• Classification and clustering are two methods of pattern identification with some
similarity and disimilarity.
• Classification uses predefined classes in which objects are assigned.
• Clustering identifies similarities between objects and based on the similar

characteristics it forms the groups which differentiate them from other groups of
objects. These groups are known as "clusters"
Classification and Clustering: Example
Classification
• Classification is the Data mining technique used to predict group membership for
data instances.
• There are two ways to assign a new value to a given class
 Crispy Classification
 Probabilistic Classification
Crispy Classification: Given an input, the classifier excatly returns its class label.
Probabilistic Classification: Given an input, the classifier returns its probabilities to
belong to each class. This is useful when some mistakes can be more costly than
others
• Give only data > 90%
• Assign the object to the class with the highest probability
• Assign the object to the class only if its probability >40%
Clustering
• Clustering is a technique of organising a group of data into classes and clusters
where the objects reside inside a cluster will have high similarity and the objects
of two clusters would be dissimilar to each other.
• In clustering, the similarity as well as distance between two objects is measured
by the similarity function. Generally, Intra-cluster distance are less than inter-
cluster distance.
Classification Vs Clustering
Parameter Classification Clustering
Type Used for supervised learning Used for unsupervised learning
Basic process of classifying the input instances grouping the instances based on their
based on their corresponding class labels similarity without the help of class
labels
Need it has labels, so there is need of training there is no need of training and testing
and testing dataset for verifying the dataset
created model
Complexity more complex less complex
Example Logistic regression, Naive Bayes k-means clustering algorithm, Fuzzy c-
Algorithms classifier, Support vector machines, KNN, means clustering algorithm, Gaussian
Decision Tree etc. (EM) clustering algorithm, etc.
Application Detection of unsolicited email Investigation of the social networks
Recognition of the face Segmentation of an image
Approval of a Bank Loan Recommendation Engines
Supervised vs. Unsupervised Learning
Supervised and Unsupervised learning are the two techniques of ML. Both are
used in different scenarios and with different dataset.
Supervised vs. Unsupervised Learning
Supervised learning (classification) –
In Supervised learning models are trained using labeled data.
In supervised learning, models need to find the mapping function to map the input variable
(X) with the output variable (Y).
Y= f(X)
Supervised learning needs supervision to train the model, which is similar to as a student
learns things in the presence of a teacher.
Supervised learning can be used for two types of problems: Classification and Regression.
Unsupervised learning (clustering) –
In Unsupervised learning patterns are inferred from the unlabeled input data.
The goal of unsupervised learning is to find the structure and patterns from the input data.
Unsupervised learning does not need any supervision. Instead, it finds patterns from the data
How Supervised Learning Works?
How Un-supervised Learning Works?
Trainning Dataset Vs. Test Dataset
• The training data is the biggest subset (>=60%) of the original dataset, which is used to
train or fit the model. Firstly, the training data is fed to the algorithms, which lets them
learn how to make predictions for the given task. The type of training data that we
provide to the model is highly responsible for the model's accuracy and prediction ability. It
means that the better the quality of the training data, the better will be the performance of
the model.
• The test dataset is another well-organized subset (20%-40%) of original data, which is
independent of the training dataset. However, it has some similar types of features and class
probability distribution for each type of scenario and uses it as a benchmark for model
evaluation once the model training is completed.
Classification Learning: Definition
• Given a collection of records (training set)

–Each record contains a set of attributes, one of the attributes is the class
• Find a model for the class attribute as a function of the values of the other
attributes
• Use ttraining set to construct the model and use test set to estimate the
accuracy of the model
• Goal: previously unseen records should be assigned a class as accurately as

Illustrating Classification Learning
Classification - A Two-Step Process
Model construction: describing a set of predetermined classes
 Building the Classifier or Model
 Each tuple/sample is assumed to belong to a predefined class, as defined by
the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules / decision trees / mathematical
formulae
Model usage: for classifying future or unknown objects
Using Classifier for Classification estimate accuracy of the model
The known label of test sample is compared with the classified result from the
model
Accuracy rate is the percentage of test set that are correctly classified by the model
Test set is independent of training set, otherwise over-fitting will occur
Classification Technique
• Decision Trees
• Bayesian Classifiers
• K-Nearest Neighbour
• Support Vector Machines
• Neural Networks
• Linear Regression
• Fuzzzy Algorithm
• Genetic Algorithm
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
– The topmost node in the tree is the root node.
• Decision tree generation consists of two phases
–Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
–Tree pruning
Identify and remove branches that reflect noise or outliers
– Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Decision Tree for Cheat or not Cheat
Attribute Selection Measure (How to choose Attribute?)
• Attribute selection measure finds the best attributes in sequence which can give a smallest
decision tree. It provide a splitting rule that determine how the tuple at a give node can be best
split.
• Attribute selection measure or Splitting Rules:
• Determine how the tuples at a given node are to be split
• Provide ranking for each attribute describing the tuples
• The attribute with highest score is chosen
• Determine split point or split subset
• Ideally, each resulting partition would be pure [A pure partition is a partition containing tuples
that all belong to the same class ]
• Three popular attribute selection measure:
1. Information Gain 2. Gain Ratio 3. Gini Index
Three Possible Partition Scenario
Information Gain Method
• Entropy/Information is the probabilistic measure of

uncertanity/variance/randomness present in the
dataset/attribute.
• Information Gain is the decrease/increase in Entropy value
when the node is split. It is the measure of a reduction of
uncertainty/variance.
• Based on the computed values of Entropy and Information
Gain the best attribute at the respective level is choosen
Information Gain Method
To find the best feature that serves as a root node in terms of information gain:
1. Find the Expected Entropy of Dataset: the average amount of information
ned to identify the class label of a tuple in D.
• m: number of Classes or labels
• pi: Probability of Class i
2. Find the Etropy of individual attribute
× • v: no. of variations of attribute A
3. Find the information gain of individual attribute(s).
• A is the particular Attribute

4. Choose the best Information gain i.e Information gain with highest value.
Repeat this process to constuct the complete decision tree.
Which Attribute to Select
Class:
RID Age Income Student Credit_Rating Buys
Computer
1 Youth High No Fair No Youth
AGE
2 Youth High No Excellent No
3 middle_aged High No Fair Yes Senior
4 senior Medium No Fair Yes
5 senior Low Yes Fair Yes Middle_aged
6 senior Low Yes Excellent No
7 middle_aged Low Yes Excellent Yes
8 Youth Medium No Fair No
NO YES YES
9 Youth Low Yes Fair Yes NO YES YES
10 senior Medium Yes Fair Yes NO YES NO
11 Youth Medium Yes Excellent Yes YES YES YES
12 middle_aged Medium No Excellent Yes YES NO
13 middle_aged High Yes Fair Yes
14 senior Medium No Excellent No
Class:
Computer
1 Youth High No Fair No
Income
High
3 middle_aged High No Fair Yes Low
5 senior Low Yes Fair Yes Medium
NO YES YES
9 Youth Low Yes Fair Yes NO NO NO
10 senior Medium Yes Fair Yes YES YES YES
11 Youth Medium Yes Excellent Yes YES YES YES
12 middle_aged Medium No Excellent Yes YES
13 middle_aged High Yes Fair Yes NO
Class:
Computer
Student
NO
3 middle_aged High No Fair Yes YES
5 senior Low Yes Fair Yes
NO YES
9 Youth Low Yes Fair Yes NO NO
10 senior Medium Yes Fair Yes YES YES
11 Youth Medium Yes Excellent Yes YES YES
12 middle_aged Medium No Excellent Yes NO YES
13 middle_aged High Yes Fair Yes YES YES
14 senior Medium No Excellent No NO YES
Class:
Computer Credit_
1 Youth High No Fair No Rating
2 Youth High No Excellent No Fair
3 middle_aged High No Fair Yes Excellent
NO NO
YES NO
9 Youth Low Yes Fair Yes
10 senior Medium Yes Fair Yes
YES YES
11 Youth Medium Yes Excellent Yes YES YES
12 middle_aged Medium No Excellent Yes NO YES
13 middle_aged High Yes Fair Yes YES NO
14 senior Medium No Excellent No YES
YES
Information Gain Approach
Step 1: Find the Expected Entropy/Information of Dataset:

• m: number of Classes or labels=2
Class YES of D Class NO of D

Step 2: Iteratively find the Best Atrribute for each level of the Decision ree.
Iteration 1: Find Best Attribute for 0 th level by computing Gain of each Attribute
a. Age attribute:
• m: number of Classes or labels=2
¿
5
(2 2 3 3
− 𝑙𝑜𝑔 2 − 𝑙𝑜𝑔 2 +
14 5 5 5
4
5 14
4
) (
4
− 𝑙𝑜𝑔 2 +
4
5
4 14 5
3 3 2
5 5
2
) (
− 𝑙𝑜𝑔 2 − 𝑙𝑜𝑔 2 =𝟎 .𝟔𝟗𝟒
5 )
Class Variation in Class Variation in Class Variation in
Youth Middle_aged Senior
Gain(Age)= Entropy (D) - EntropyAge(D) = 0.940 - 0.694 = 0.246

Step 2: Iteratively find the Best Atrribute for each level of the Decision ree.
Iteration 1: Find Best Attribute for For 0 th level Cont...
Similarly find the gain of attribute “income”, “student”, “credit_rating”
b. Gain of income = 0.029

c. Gain of student = 0.151
d. Gain of credit_rating = 0.048
Age has highest information gain from rest of the attribute, and hence is seleceted
as splitting attribute.
Level 0 Decision Tree using 1st Splitting
Attribute Age
• Tupple Falling on age=middle_aged
all belong to the same class “YES”
and hence a leaf can be created for
that branch.
• continue the tree construction with

finding the best gain for data-partion
age=youth and age=senior
Final Decision Tree
Example of Majority Voting
Extracting Classification Rule from Trees
• Represent the knowledge in the form of IF-THEN rules
• One rule is created for each path from the root to leaf.
• Each attribute-value pair along a path forms a conjuction
• The leaf node holds the class prediction
Example:
• If (Age == Youth) AND (Student == No) then buys_computer=”No”
• If (Age == Youth) AND (Student == No) then buys_computer=”Yes”
• If (Age == middle_aged) then buys_computer=”Yes”
• If( Age == senior) AND (credit_rating == Excellent) then buys_computer=”No”
• If(Age == senior) AND (credit_rating == fair) then buys_computer=”Yes”
Drawback of Information gain as attribute selection measure
• The information gain measure is biased towards tests with many outcomes.
i.e. it prefers to select attribute having a large number of values.
• For instance, for the this dataset information gain
prefers to split on attribute “Table income” that
would result in a large number of pure partitions
but each one containg one tuple.
Infoincome(D)=0 [variation = 0]
=> Gainincome(D) = Info(D)-Infoincome(D)=Info(D)
• Hence, Information gain w.r.t. income is maximal.
Classifier -II
Naive Bayesian Classifier
Classification by Naive Bayesian Classifier
• Bayesian Classifier is a statistical classifier based on Bayes Theorem which can predict the
probability that a given tuple belongs to a particular class.
• Naive Bayesian Classifier is a simple Bayesian classifier that exhibits high accuracy and
speed compared to Decision tree & Neural Network classifier.
• Naive Assumption: [Class-Conditional Independence]: The effect of an attribute value on a

particular class is independent of other attribute value.
• Supervised Learning method.
• Can solve problems involving both categorical and continuous valued attributes.
[Simplify the computation] => Naive

Probability Basics: Prior Probability
A prior probability is the probability that an observation will fall into a group
before one collect the data or before the experiment happens.
Example:
1. John conducts a single coin toss. What is the a priori probability of landing a head?
A priori probability of landing a head = 1 / 2 = 50%.
2. A six-sided fair dice is rolled. What is the a priori probability of rolling a 2, 4, or 6, in

a dice roll?
The number of desired outcomes is 3 (rolling a 2, 4, or 6), and there are 6 outcomes
in total.
A priori probability of rolling a 2, 4, or 6 = 3 / 6 = 50%.
Probability Basics: Posterior Probability
A posterio probability / Conditional probability is the probability of an event occurring
given that another event has already occurred.
For example, we might be interested in finding the probability of some event “A”
occurring after we account for some event “B” that has just occurred.
The formula to calculate a posterior probability of A occurring given that B occurred:
Bayes’ Theorem
Where, A, B = Events
P(B|A):- The probability of B occuring given that A has happened
P(A) & P(B):- The prior probability of event A & event B
Probability Basics: Example Calculating Posterior Probability
A forest is composed of 20% Oak trees and 80% Maple trees. Suppose it is known that
90% of the Oak trees are healthy while just 50% of the Maple trees are healthy.
Q:- Suppose that from a distance one can tell that a particular tree is healthy, what is the
probability that the tree is an Oak tree?
=>
P(Oak): Probability that a tree is an oak is 0.2 (20%)
P(Healthy): Probability that a given tree is healthy= (0.2*0.9)+(0.8*0.5)=0.58
P(Healthy|Oak): The probability that a tree is healthy given that it’s an Oak
tree is 0.9 [90%]
= 0.3103
Probability Basics: Posterior Probability contd...
For an intuitive understanding of this probability, suppose the following grid represents this
forest with 100 trees. Exactly 20 of the trees are Oak trees (20%) and 18 (90%) of them are
healthy. The other 80 (80%) trees are Maple and 40 (50%) of them are healthy.
(O = Oak, M = Maple, Green = Healthy, Red = Unhealthy)
Out of all exactly 58 of them are healthy and 18 of these healthy ones are Oak trees. Thus, if a
healthy tree has selected then the probability that it’s an Oak tree is 18/58 = 0.3103.
Bayes’ Theorem [useful in finding posterior probability]
Given:
• X: Data tuple described by measurement made on a set on n attributes
• H: Some hypothesis that the data tuple “X” belongs to class C
Problem: Determine P(H|X):- Given the evidence/observed data tuple “X”, find the
probability that the hypothesis H holds. i.e. Probability that tuple”X” belongs to class C
According to Bayes Theorem
P(H|X):- Posterior Probability of class “H” where attribute values <x1, ...xn> are given.
P(H): Prior Probability of class “H” regardless of attributes values <x1, ...xn>
P(X|H): Posterior Probability of attribute values <x1, ...xn> where class “H” is given
P(X): Prior Probability of attribute values <x1, ...xn> regardless of class “H”
An Example of Bayes’ Theorem
Example: Let customers are only

described by attributes
age, income and credit_rating
 X: 35 years old customer with an income 40K and fair Credit_rating

 H: Hypothesis that the customer will buy a computer
Likelihood Prior
Determine:-
Normalization Constant
An Example of Bayes’ Theorem contd...
According to Bayes Theorem
Let hypothesis H is buying computer = “YES”
P(H|X):- Posterior Probability that the ccustomer will buy computer=”YSE” given that age = 35,
income =40K and credit_rating = fair.
P(H): Prior Probability of class Yes regardless of attributes values <age, income, credit_rating>
P(X|H): Posterior Probability of attribute values <age=35, income=40K credit_rating = fair>

where class buying computer =”YES” is given.
P(X): Prior Probability of attribute values <age=35, income=40K credit_rating = fair> regardless
of class buying computer
Naive Bayes Classifier
• Given: D: - Set of tuples.
X: -<x1, ...xn> where xi is the value of attribute Ai
m: Number of classes - C1,..Cm
• Bayesian classifier predicts X belongs to Class Ci iff
P(Ci|X)>P(Cj|X) for 1 ≤ j ≤ m & j ≠ i

or
Maximize Posterior Hypothesis:
Naive Bayes Classifier
In the maximization of hypothesis ,
Find three values 1. P(X) 2. P(Ci) 3.P(X|Ci)
Naive Assumption: [Class-Conditional Independence]: The effect of an attribute value on a

particular class is independent of other attribute value.
1. Due to conditional independence of the attribute P(X)= P(x1,...xn) =1

2. P(Ci): Probability that class i is occuring in the dataset
where is the number of tuple containing class i
3. P(X|Ci)= P(x1, ..., xn | Ci).
Due to conditional independence, = P(x1|Ci)×...×P(xn|Ci)
As P(X) is 1, Hence only maximize P(X|Ci) × P(Ci)

Problem
Class:
For the given training
Computer dataset let the new
customer X value is as
follows. Find the
probability that customer
3 middle_aged High No Fair Yes
will buy computer or not
or
6 senior Low Yes Excellent No Find the query X belongs
to which class?
9 Youth Low Yes Fair Yes X = (Age=Youth
10 senior Medium Yes Fair Yes
income = medium
11 Youth Medium Yes Excellent Yes
12 middle_aged Medium No Excellent Yes student = Yes
13 middle_aged High Yes Fair Yes credit_rating = Fair)
Problem
Step-1: Find Class Prior Probability P(Ci) for i=1,2
P(C1) = P(buys_computer = Yes) =
P(C2) = P(buys_computer = No) =
Step-2: Find Conditional Probability of the query X or tuple X given is the class label.
Conditional Probability of X w.r.t. Class Yes Conditional Probability of X w.r.t. Class No

P(X|Yes) = P(X|Yes) =
= P(x1: age=youth|Yes) × = P(x1: age=youth|No) ×
P(x2: income=medium|Yes) × P(x2: income=medium|No) ×
P(x3: student=yes|Yes) × P(x3: student=yes|No) ×
P(x4: credit_rating=fair|Yes) P(x4: credit_rating=fair|No)
=0.22 × 0.44 × 0.66 × 0.66 = 0.044 =0.6 × 0.4 × 0.2 × 0.4 = 0.019
Problem
Step-3: Find the class that maximizes P(X|Ci) × P(Ci)
Max{ P(X|buys_Comp=Yes) × p(buys_computer = Yes),

P(X|buys_Comp=No) × p(buys_computer = No)
= {0.44 × 0.643, 0.019 × 0.357}
= {0.028, 0.007} = 0.028 {Class “Yes” Maximizes the conditional probability of X}
Hence, the Naive Bayesian classifier predicts buys_comp is yes for given tuple X
Problem
For the given training
dataset let the new
customer X value as
follows. Find the
probability that Golf will
be played or not?
Find the query X belongs
to which class?
X = (Outlook=Rainy
Temp = Hot
Humidity = High
Windy = False)
Partial Solution
Advantage
• Simple, fast and easy to implement.Needs less training data. Highly scalable.
• Handles continuous and discrete data.
• Easily updatable if training dataset increases
Uses
• Text Classification
• Spam Filtering
• Hybrid Recommender
• Online Application
Disavantages
• Assumption: Class conditional independence, therefore loss of accuracy
• Practically, dependecies exist among variables.
Example: Hospitals, patients profile, family history, didease symptoms etc
Major Issues
• Violations of Independence Assumption: It assumes that the effect of an attribute values
on a given class is independent of the values of the other attribute.
Solution: Bayesian beleief network classifier allows the representation of dependencies
among subsets of attributes
• Zero Conditional Probability problem: If a given class and feature value never occur
together in the training set then the frequency based probability esimate will be zero. This
is problematic since it will wipe out all information in the other probabilities when they are
multiplied.
Therefore often it is desirable to incorporate a small sample correction in all possible
option. Or consider the occurance/probability as 1 instead of zero which can be
negligible in the consideration of maximum.
K-Nearest Neighbor
(Lazy Learners or Learning from your Neighbor)
Eager Learners Vs. Lazy Learners
Eager Learner Lazy Learner
• Eager Learner approach of classification • Lazy Learner approach stores the received
constructs a generalized classification training tuples and wait to the last minute of
model just after receiving the training receiving the new/test tuple. Only when it
tuples (much before receiving new/test receives the test tuple, it constructs the
tuple), and be ready and eager to classify generalized classification model based on
the new unseen tuple. similarity among test tuple and training tuples.
• More work when training tuple is received • Less work (only store) when training tuple is
and less work when making a received and more work when making a
classification. classification
• More Expensive but support incremental
• Less Expensive but not support
learning
incremental learning
• e.g. K-Nearest Neighbor, case-based learning
• e.g. Decision Tree, Bayesian Classifier,
K-Nearest Neighbor Classifier
Nearest based classifiers are based on learning by analogy i.e. by comparing a

given test tuples with traing tuples that are similar to it.
• Let the training tuples are described by n attributes. All instances of training
tuples are correspond to points in the n-dimentional space and accordingly each
tuple represents a point in an n-dimensional space
• A k-nearest-neighbor classifier searches the pattern space for the k training
tuples that are closest to the unknown tuple.
Prediction for test data is done on the basis of its neighbour
Properties of KNN
 How many neighbors should be consider? i.e. what is k?

 How do we measure distance?
 Should all points be weighted equally, or some points have more influence than others?
 How many neighbors should be consider? (what is k)
• At K=1, the KNN model tends to closely follow the training data and thus shows a high
training score and low test score, indicating overfitting.
• The problem can be solved by tuning the value of k_neighbors parameter. As we increase
the number of neighbors k, the model starts to generalize well, but increasing the value too
much would again drop the performance.
• Therefore, it’s important to find an optimal value of K, such that the model is able to
classify well on the test data set with minimum error.
• Setting K value to the square root of number of training tuple or half of the square root of
number of training tuple can lead to better result.
• For an even number of classes (e.g. 2 or 4) it is a good idea to choose a K value with an odd
number to avoid a tie. And the inverse, use an even number for K when you have an odd
number of classes
How to Measure Distance
DISTANCE in NUMERIC ATTRIBUTE

Proximity Measure for Numeric Attributes
[Dissimilarity]
Commonly used distance measures for computing dissimilarity are
• Euclidean Distance:
• Manhattan Distance:
• Minkowski Distance: [When p=1: It is Manhattan
(Generalization of Euclidean p=2: It is Euclidean]

and Manhattan)
Here n is the number of dimensions (attributes) and xi and yi are the ith attributes (components)
of data objects x and y.
Euclidean distance is more preferable as it gives equal important to each feature
[Dissimilarity- Euclidean Distance]
Data Matrix
x2 x4 point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
2 x1 x4 4 5
Dissimilarity Matrix
x3
(with Euclidean Distance)
0 2 4
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
[Dissimilarity- Manhattan Distance]
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
2 x1
x4 4 5
Dissimilarity Matrix
x3
0 2 4 (with Manhattan Distance)
x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
[Dissimilarity- Minkowski Distance]
Data Matrix
point attribute1 attribute2
x1 1 2 Minkowski Distance=
x2 3 5
x3 2 0
x4 4 5 Here p is a real number >=1
Minkowski Distance Minkowski Distance

with p value 1 is with p value 2 is
Manhattan Distance Euclidean Distance
x1 x2 x3 x4 x1 x2 x3 x4
x1 0 x1 0
x2 5 0 x2 3.61 0
x3 3 6 0 x3 5.1 5.1 0
x4 6 1 7 0 x4 4.24 1 5.39 0
Example: 1
• Let given is the data from the questionaires survey with two attributes (X1: Acid
durability and X2: Strength) to classify whether a special paper tissue is good or not.
Classify the new paper tissue with features: X1=3 and X2=7.
X1: X2: Y: Class
Acid Durability Strength
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Example-1: Solution
Step 1: Intialize and Define K
K=1 Overfitting Problem
Number of tuples is even, hence consider K as odd.
Let K = 3 for this problem
Step 2: Compute distance between input sample and training samples using Euclidian
distance measure.
X1: X2: Y: Distance from

Acid Strength Class (3, 7)
Durability
7 7 Bad sqrt((7-3)2 + (7-7)2)=4
7 4 Bad sqrt((7-3)2 + (4-7)2)=5
3 4 Good sqrt((3-3)2 + (4-7)2)=3
1 4 Good sqrt((1-3)2 + (4-7)2)=3.6
Example-1: Solution
Step 3: Determine K nearest neighbor those are at minimum distance and rank them.
X1: X2: Y: Class Distance Rank
Acid Strength from
Durability (3, 7)
7 7 Bad 4 3
7 4 Bad 5 4
3 4 Good 3 1
1 4 Good 3.6 2
Step 4: Apply Simple Mojority

There are 2 “Good” and 1 “Bad”. Thus the new paper tissue will be classified as Good due to
Example 2
Given is the customer detail as customer’s age, income, credit card no. and their
corresponding class. Find the class label for the new tuple/sample:
Customer: RRR, Age: 37, Income: 50K, Cr Card: 2, Class- ?
Customer Age Income No. of Credit Class

Cards
XXX 35 35K 3 No
YYY 22 50K 2 Yes
ZZZ 63 200K 1 No
PPP 59 170K 1 No
QQQ 25 40K 4 Yes
RRR 37 50K 2 ????
Example-2: Solution
Find the Euclidian Distance of new tuple from each training data
Customer Age Income No. of Credit Class Distance

Cards
XXX 35 35K 3 No sqrt((35-37)2+(35-50)2+(3-2)2)=15.16
+YYY 22 50K 2 Yes sqrt((22-37)2+(50-50)2+(2-2)2)=15
ZZZ 63 200K 1 No sqrt((63-37)2+(200-50)2+(1-2)2)=152.2
PPP 59 170K 1 No sqrt((59-37)2+(170-50)2+(1-2)2)=122
QQQ 25 40K 4 Yes sqrt((25-37)2+(40-50)2+(4-2)2)=15.74
RRR 37 50K 2 ????

Majority of {Yes, Yes, No} is Yes. Hence Class of new tuple is Yes
Should all Attributes be weighted equally,
or
Some Attributes have more influence than others?
Some attributes (like income, loan) outweighs some smaller ranged attributes (like binary or
categorical or numeric like age) and thus, the result obtained can be a biased one.
Normalization is a solution to it: Max-Min Normalization generally used
Solution : Normalization
KNN Classifier Algorithm
Advantage:
• No training period: There is no discriminative function or model build at the traing stage
• As there is no training period, any new data can be added seamlessly without impacting
accuracy of the result.
• Easy to implement
• Best used where probability is unknown
• Useful for non-linear data as no assumption is required
Disadvantage:
• Computationally expensive and more space complexity
• Output depends on choosen K value whic can reduce accuracy for some K value
• Does not work well with large data set and high dimension
• Sensitive to noisy, missing and outlier data
• Need normlization
Application
• Used in classification and regression
• Used in pattern recognition
• Used in gene expression, protein-protein prediction, get 3D structure
of the protein
• Used to measure document similarity
Support Vector Machine
Support Vector Machines
Support Vector Machine (SVM) is an algorithm which can be used for both classification or
regression challenges. However, it is mostly used in two-group classification problems. In
the SVM algorithm, each data item is plotted as a point in n-dimensional space (where n is
number of features) with the value of each feature being the value of a particular coordinate.
Then, classification is performed by finding the hyperplane that differentiates the two
classes.
What is hyperplane?
A hyperplane is a generalization of a plane.
 In one dimension, it is called a point.
 In two dimensions, it is a line.
 In three dimensions, it is a plane.
 In more dimensions one can call it an hyperplane.
The following figure represents datapoint in one dimension and the point L is a separating
hyperplane.
Support Vector Machines cont…
How does SVM works?
 Let’s imagine we have two tags: red and blue, and our
data has two features: x and y. We want a classifier
that, given a pair of (x, y) coordinates, outputs if it’s
either red or blue. We plot our already labeled training
data on a plane.
A support vector machine takes these data points and
outputs the hyperplane (which in two dimensions it’s
simply a line) that best separates the tags. This line is
the decision boundary: anything that falls to one side
of it we will classify as blue, and anything that falls to
the other as red.
 In 2D, the best hyperplane is simply a line.
But, what exactly is the best hyperplane? For SVM, it’s the one that maximizes the
margins from both tags. In other words: the hyperplane (in 2D, it's a line) whose distance
to the nearest element of each tag is the largest.
(Scenario-1) Identification of the right hyperplane: Here, we have three hyperplanes (A, B
and C). Now, the job is to identify the right hyperplane to classify star and circle.
Remember a thumb rule - identify the right hyperplane: “Select the hyperplane which
segregates the two classes better”. In this scenario, hyperplane “B” has excellently
performed this job.
(Scenario-2) Identify the right hyperplane: Here, we have three hyperplanes (A, B and C)
and all are segregating the classes well. Now, How can we identify the right hyperplane?
Here, maximizing the distances between nearest data point (either class) and hyperplane
will help to decide the right hyperplane. This distance is called as margin.
Below, you can see that the margin for hyperplane C is high as compared to both A and B.
Hence, we name the right hyperplane as C. Another lightning reason for selecting the
hyperplane with higher margin is robustness. If we select a hyperplane having low margin
then there is high chance of misclassification.
(Scenario-3) Identify the right hyperplane: Use the rules as discussed in previous section
to identify the right hyperplane.
Some of you may have selected the hyperplane B as it has higher margin compared to A.
But, here is the catch, SVM selects the hyperplane which classifies the classes accurately
prior to maximizing margin. Here, hyperplane B has a classification error and A has
classified all correctly. Therefore, the right hyperplane is A.
(Scenario-4) Below, we are unable to segregate the two classes using a straight line, as one
of the stars lies in the territory of other(circle) class as an outlier.
One star at other end is like an outlier for star class. The SVM algorithm has a feature to
ignore outliers and find the hyperplane that has the maximum margin. Hence, we can say,
SVM classification is robust to outliers.
(Scenario-5) Find the hyperplane to segregate to
classes: In the scenario, we can’t have linear
hyperplane between the two classes, so how does
SVM classify these two classes? Till now, we have
only looked at the linear hyperplane.
SVM can solve this problem. Easily! It solves this

problem by introducing additional feature. Here, we
can add a new feature z=x2+y2. Now, plotting the data
points on axis x and z:
In above plot, points to consider are:

 All values for z would be positive as z is the squared sum of both x and y.
 In the original plot, red circles appear close to the origin of x and y axes, leading to
lower value of z and star relatively away from the origin result to higher value of z.
 In the SVM classifier, it is easy to have a linear hyperplane
between these two classes. But, another burning question
which arises is, should we need to add this feature manually to
have a hyperplane. No, the SVM algorithm has a technique
called the kernel trick. The SVM kernel takes low dimensional
input space and transforms it to a higher dimensional space i.e.
it converts not separable problem to separable problem. It is
mostly useful in non-linear separation problem. Simply put, it
does some extremely complex data transformations, then finds
out the process to separate the data based on the labels or
outputs you’ve defined.
 When we look at the hyperplane in original input space it looks
like a circle.
Types of SVM
 Linear SVM: Linear SVM is used for data that are linearly separable i.e. for a
dataset that can be categorized into two categories by utilizing a single straight
line. Such data points are termed as linearly separable data, and the classifier is
used described as a linear SVM classifier.
 Non-linear SVM: Non-Linear SVM is used for data that are non-linearly separable
data i.e. a straight line cannot be used to classify the dataset. For this, we use
something known as a kernel trick that sets data points in a higher dimension
where they can be separated using planes or other mathematical functions. Such
data points are termed as non-linear data, and the classifier used is termed as a
non-linear SVM classifier.
CLUSTERING
What is Cluster
 A cluster is a subset of objects which are “similar”
 A subset of objects such that the distance between any two objects in the cluster is less
than the distance between any object in the cluster and any object not located inside it.
What is Clustering
 Clustering is the task of dividing the population or
data points into a number of groups such that data
points in the same groups are more similar to other
data points in the same group and dissimilar to the
data points in other groups.
 Clustering is the unsupervised learning where we
don’t know the class labels or number of labels
beforehand.
 Clustering algorithm forms the groups of objects on
the basis of similarity and dissimilarity between them.
Clustering Example
Imagine of a number of objects in a basket. Each item has a distinct set of features (size, shape,
color, etc.). Now the task is to group each of the objects in the basket. A natural first question to
ask is, ‘on what basis these objects should be grouped?’ Perhaps size, shape, or color.
Clustering Application
 Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
 Biology: It can be used for classification among different species of plants and animals.
 Libraries: It is used in clustering different books on the basis of topics and information.
 Insurance: It is used to acknowledge the customers, their policies and identifying the
frauds.
 City Planning: It is used to make groups of houses and to study their values based on
their geographical locations and other factors present.
 Earthquake studies: By learning the earthquake affected areas, the dangerous zones can
be determined.
 Healthcare: It can be used in identifying and classifying the cancerous gene.
 Search Engine: It is the backbone behind the search engines. Search engines try to group
similar objects in one cluster and the dissimilar objects far from each other.
 Education: It can be used to monitor the students' academic performance. Based on the
students' score they are grouped into different-different clusters, where each clusters
denoting the different level of performance.
Requirements of Clustering in Data Mining
 Scalability − Highly scalable clustering algorithms are needed to deal with large databases.
 Ability to deal with different kinds of attributes - Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and binary
data.
 Discovery of clusters with attribute shape - The clustering algorithm should be capable of
detecting clusters of arbitrary shape. They should not be bounded to only distance measures
that tend to find spherical cluster of small sizes.
 High dimensionality - The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Dataset contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible, and usable.
What is Good Clustering?
• A good clustering method will produce high quality clusters in which:
 The intra-class (that is, intra-cluster) similarity is high.
 The inter-class similarity is low.
• The quality of a clustering result also depends on both the similarity measure used by the
method and its implementation.
• The quality of a clustering method is measured by its ability to discover some or all of the
hidden patterns.
• However, objective evaluation is problematic: usually done by human / expert inspection.

Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and then evaluate
them by some criterion: K-Mean Algorithm
• Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or
objects) using some criterion: Divisive and Agglomerative Methods of hierarchical
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the clusters and the idea is to
find the best fit of that model to each other
Distinction between Partitional and Hierarchical sets of
clusters
• Partitional Clustering
• It divides data objects into non-overlapping subsets (clusters) such that each data object is
in exactly one subset. It construct a partition of a database D of n objects into a set of k
(given) clusters
• Heuristic methods: k-means and k-medoids algorithms
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
• There are mainly two types of hierarchical clustering:
 Agglomerative hierarchical clustering
 Divisive Hierarchical clustering
K- Means Clustering Algorithm
 It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such
a way that each dataset belongs only one group that has similar properties.
 It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters. A centroid is a data point that lies at the center of a cluster.
 The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
 The k-means clustering algorithm mainly performs two tasks:
 Determines the best value for k center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
 Hence each cluster has datapoints with some commonalities, and away from other clusters.
K- Means Algorithm cont…
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
1. Step-1: Select the number K to decide the number of
clusters.
2. Step-2: Select random K points or centroids. (It can be
other from the input dataset).
3. Step-3: Assign each data point to their closest
centroid, which will form the predefined K clusters.
4. Step-4: Calculate the variance and place a new
centroid of each cluster.
5. Step-5: Repeat the third steps, i.e. reassign each
datapoint to the new closest centroid of each cluster.
6. Step-6: If any reassignment occurs, then go to step-4
else go to step-7.
7. Step-7: The model is ready.
8. End
Working of K-Means Algorithm
Suppose we have two variables x and y. The x-y axis scatter plot of these two variables
is given below:
Working of K-Means Algorithm cont…
 Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
 We need to choose some random K points or centroid to form the cluster. These points can
be either the points from the dataset or any other point. So, here we are selecting the below
two points as K points, which are not the part of dataset. Consider the below image:
 Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by calculating the distance between two points. So, we will draw a
median between both the centroids. Consider the below image:
 From the image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid i.e. K2.
Let's color them as blue and yellow for clear visualization.
 As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
 Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
 As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points.
 We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:

 As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:

 We can see in the previous image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:
 As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:

K-Means Clustering Example
X Y
10 data points with 2 attributes are given. Question is 2 4
to divide the data points into K number (let k=3) of 2 6
5 6
clusters.
4 7
8 3
6 6
5 2
5 7
6 3
4 4
K-Means Clustering Example cont..
First plot the data point and find the centoid of the data points. Assume initially all the
data points belongs to single/same cluser.
School of Computer Engineering

Let K=3 or problem is to divide the data points into 3 clusters.
For this, select 3 randome points/seed points/centroids of 3 clusters. Let (1, 5), (4. 1) & (8, 4)
School of Computer Engineering

(Calculate the distances of all data points from 3 centroids)
Sr.
No.
1
2
3
4
• Compare the 3 distances of
5
individual data points from the 6
7
corresponding centroids and find
8
the minimumm distance. 9
10
• The data point will belong to the
Clusters
cluster to which it is closer
G-1 = {1, 2, 4} G-2 = {7, 10} G-3 = {3, 5, 6, 8, 9}
Sr.
Iteration-2 No.
1
Centroid of Cluster 1: C1 =(2.66,5.66) 2
Centroid of Cluster 2: C2 = (4.5, 3) 3
Centroid of Cluster 3: C3 = (6, 5) 4
5
6
7
8
9
10
New Cluster
G-1 = {1, 2, 4} G-2 = {7, 9, 10} G-3 = {3, 5, 6, 8}
Sr.
Iteration-3 No.
1
Centroid of Cluster 1: C1 = (2.66, 5.66) 2
Centroid of Cluster 2: C2 = (5,3) 3
Centroid of Cluster 3: C3 = (6, 5.5) 4
5
6
7
8
9
10
New Cluster
G-1 = {1, 2, 4} G-2 = {5, 7, 9, 10} G-3 = {3, 6, 8}
Sr.
No.
Iteration-4
1
Centroid of Cluster 2: C2 = (5.75,3) 3
5
6
7
8
9
10
New Cluster
G-1 = {1, 2} G-2 = {5, 7, 9, 10} G-3 = {3, 4, 6, 8}
Iteration-5 Sr.
No.
Centroid of Cluster 1: C1 = (2, 5) 1

Centroid of Cluster 2: C2 = (5.75,3) 2
Centroid of Cluster 3: C3 = (5, 5.5) 3
4
5
6
7
8
9
10
New Cluster
G-1 = {1, 2} G-2 = {5, 7, 9, 10} G-3 = {3, 4, 6, 8}
Advantages
• Simple, understandable
• Items automatically assigned to clusters
Disadvantages
• Must pick number of clusters before hand
• Often terminates at a local optimum.
• All items forced into a cluster
• Too sensitive to outliers
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled
examples.
• Recursive application of a standard clustering algorithm can produce a
hierarchical clustering.
Hierarchical Clustering
1. Bottom up (aglomerative)
• Start with single-instance clusters
• At each step, join the two closest clusters
• Design decision: distance between clusters
e.g.two closest instances in clusters
2. Top down (divisive approach / deglomerative)
• Start with one universal cluster
• Find two clusters which are far
• Proceed recursively on each subset
• Can be very fast
3. Both methods produce a dendrogram
Types of Hierarchical Clustering
Divisive Hierarchical clustering Agglomerative Hirarchical clustering
Types of Hierarchical Clustering...
Agglomerative Hierarchical Clustering
 Each point is assigned to an individual cluster in this technique. Suppose there are 4 data
points, so each of these points would be assigned to a cluster and hence there would be 4
clusters in the beginning.
 Then, at each iteration, closest pair of clusters are merged and this step is repeated until only a
single cluster is left.
Divisive Hierarchical clustering
 Divisive hierarchical clustering works in the opposite way. Instead of starting with n clusters,
we start with a single cluster and assign all the points to that cluster.
 Next at each iteration, farthest point in the cluster is split and this process is repeated until
each cluster only contains a single point.
Agglomerative Hierarchical Clustering Example
Suppose a faculty wants to divide the students into different groups. The faculty has the marks
scored by each student in an assignment and based on these marks, he/she wants to segment
them into groups. There’s no fixed target as to how many groups to have. Since the faculty
does not know what type of students should be assigned to which group, it cannot be solved as
a supervised learning problem. So, hierarchical clustering is applied to segment the students
into different groups. Let’s take a sample of 5 students.
Creating a Proximity Matrix
Roll No Mark
1 10 First, a proximity matrix to be created which tell the distance
2 7 between each of these points (marks). Since the distance is
3 28 calculated of each point from each of the other points, a square
matrix of shape n × n (where n is the number of observations)
4 20
is obtained. Let’s make the 5 × 5 proximity matrix for the
5 35 example.
Agglomerative Hierarchical Clustering Example..
Proximity Matrix
Roll 1 2 3 4 5
No
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
The diagonal elements of this matrix is always 0 as the distance of a point

with itself is always 0. The Euclidean distance formula is used to calculate
the rest of the distances. So, to calculate the distance between
Point 1 and 2: √(10-7)^2 = √9 = 3
Point 1 and 3: √(10-28)^2 = √324 = 18 and so on…
Similarly, all the distances are calculated and the proximity matrix is filled.
Steps to Perform Agglomerative Hierarchical
Clustering
Step 1: First, all the points to an individual cluster is assigned. Different colors here
represent different clusters. Hence, 5 different clusters for the 5 points in the data.
LOOP (Step 2 & 3): (Until one single Cluster is not obtained)
Step 2: Compute the proximity look at the smallest distance in the proximity matrix
and merge the points with the smallest distance.
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Here, the smallest distance is 3 and hence point 1 and 2 is merged.

Clustering...
Step 3: Update the clusters and accordingly update the distance matrix.
Here, we have taken the maximum of the two marks (7, 10) i.e. 10 to replace the
marks for this cluster. [Instead of the maximum, the minimum value or the average
values can also be considered.]
Roll No Mark
(1, 2) 10
3 28
4 20
5 35
Clustering...
and merge the points with the smallest distance. Then the proximity matrix is
updated.
Roll No 1, 2 3 4 5
1, 2 0 18 10 25
3 18 0 8 7
4 10 8 0 15
5 25 7 15 0
Clustering...
Here, we have taken the maximum of the two marks (5, 3) i.e 35 to replace the marks
for this cluster. [Instead of the maximum, the minimum value or the average values
can also be considered.]
Roll No Mark
(1, 2) 10
(5, 3) 35
4 20
Steps to Perform Agglomerative Hierarchical Clustering...
updated.
Roll No (1, 2 ) (5, 3) 4

(1, 2) 0 25 10
(5, 3) 25 0 15
4 10 15 0
Clustering...
Here, we have taken the maximum of the two marks ((1, 2), (4)) i.e 20 to replace the
marks for this cluster. [Instead of the maximum, the minimum value or the average
values can also be considered.]
Roll No Mark
((1, 2),(4)) 20
(5, 3) 35
Steps to Perform Agglomerative Hierarchical Clustering...
updated.
Roll No ((1, 2) , (4) (5, 3)

((1, 2), (4)) 0 15
(5, 3) 15 0
Clustering...
Here, only one option is there. Merge the two clusters.
Roll No Mark
(((1, 2), (4)), (5, 3) 35
Clustering cont…
We started with 5 clusters and finally have a single cluster.

Dendrogram
 To get the number of clusters for hierarchical clustering, we make use of the concept
called a Dendrogram.
 A dendrogram is a tree-like diagram that records the sequences of merges or splits.
 Let’s get back to faculty-student example. Whenever we merge two clusters, a
dendrogram record the distance between these clusters and represent it in graph
form.
We have the samples of the dataset on the

x-axis and the distance on the y-axis.
Whenever two clusters are merged, we
will join them in this dendrogram and the
height of the join will be the distance
between these points (Distance can be
obtained from Proximity Matrix).
Dendrogram cont…
We started by merging sample 1 and 2 and the
distance between these two samples was 3.
Let’s plot this in the dendrogram.
Here, we can see that we have merged sample 1 and 2. The vertical line represents the
distance between these samples.
Dendrogram cont…
Iterartion-1
Roll 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Iterartion-2
Roll (1, 2) 3 4 5
1, 2 0 18 10 25
3 18 0 8 7
4 10 8 0 15
5 25 7 15 0
Dendrogram cont…
Iterartion-3
Roll (1, 2 ) (5, 3) 4
(1, 2) 0 25 10
(5, 3) 25 0 15
4 10 15 0
Iterartion-4
Roll No ((1, 2) , (4) (5, 3)
((1, 2), (4)) 0 15
(5, 3) 15 0
We can clearly visualize the steps of hierarchical clustering. More the distance of the
vertical lines in the dendrogram, more the distance between those clusters.
Dendrogram cont…
Now, we can set a threshold distance and draw a horizontal line (Generally, the
threshold is set in such a way that it cuts the tallest vertical line). Let’s set this
threshold as 12 and draw a horizontal line:
The number of clusters will be the number of vertical lines which are being
intersected by the line drawn using the threshold. In the above example, since the
red line intersects 2 vertical lines, we will have 2 clusters. One cluster will have a
sample (1,2,4) and the other will have a sample (3,5).
Divisive Hierarchical Clustering: Example
Starts from a big/master cluster (by considering all data points into a single
cluster) and reduces it to as many as possible small clusters.
If the requirement is to form K clusters from n data points, then stop reducing till
K clusters; otherwise, reduce the master cluster until the solo clusters are not
obtained (one data point in one cluster).
Steps in Divisible Algorithm:
1. From proximity matrix, construct the master cluster in terms of the minimum
cost spanning-tree using Kruskal’s method.
2. Split the cluster into sub-clusters using any flat clustering method. (Let data
points/ cluster with larger distance)
3. Repeat step 2 until desired clusters are not achieved or solo clusters are not
formed.
Divisive Hierarchical Clustering: Example
Apply the divisive hierarchial clustering on the 5 data points (set A = {a, b, c, d,
e}) with the following distance matrix to output 4 clusters.
Dist a b c d e
a 0
b 1 0
c 2 2 0
d 2 4 1 0
e 3 3 5 3 0
Divisive Hierarchical Clustering: Example...
Step 1: [Initialization]
Construct the minimum cost Spanning Tree (MST) using Kruskal method.
Distance among data points/nodes in increasing order:
Edge AB CD AC AD BC AE BE DE BD CE
Dist 1 2 2 2 2 3 3 3 4 5
B
1 2 C Initial master cluster: {A B C D E}
A 2
D
3 E
Step 2: [LOOP]
Break the link with highest value and form the sub clusters or disconnected graph
till the required clusters are achieved.
Iteration-1 Iteration-2
Break the maximumm distance: Break the maximumm distance:
Edge AE with ditance 3 Edge AC with ditance 2
B B
1 2 C 1 C
A 2 A 2
E D D
E
Clusters: {A B} {C D} {E}
Step 2: [LOOP] CONT..
Iteration-3
Break the maximumm distance:
Edge CD with ditance 2
B C
1
A
D
E
Clusters: {A B} {C} {D} {E}

Dendogram of Divisive Hierarchical Clustering:
Example...
{E}
{ A B C D E} {ABCD} {AB}
{CD} {C}
{D}
Closeness of two clusters
The decision of merging two clusters is taken on the basis of closeness of these clusters. There are
multiple metrics for deciding the closeness of two clusters and primarily are:
 Euclidean distance  Manhattan distance
 Squared Euclidean distance  Maximum distance
 Mahalanobis
distance

Chapter-V CLASSIFICATION &amp; CLUSTERING

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter-V CLASSIFICATION &amp; CLUSTERING

Uploaded by

Copyright:

Available Formats

Classification & Clustering

Classification and Clustering

• Classification uses predefined classes in which objects are assigned.

• Clustering identifies similarities between objects and based on the similar

• Given a collection of records (training set)

• Goal: previously unseen records should be assigned a class as accurately as

• Entropy/Information is the probabilistic measure of

• A is the particular Attribute

Step 1: Find the Expected Entropy/Information of Dataset:

Class YES of D Class NO of D

Gain(Age)= Entropy (D) - EntropyAge(D) = 0.940 - 0.694 = 0.246

Iteration 1: Find Best Attribute for For 0 th level Cont...

Similarly find the gain of attribute “income”, “student”, “credit_rating”

b. Gain of income = 0.029

• continue the tree construction with

• Naive Assumption: [Class-Conditional Independence]: The effect of an attribute value on a

• Supervised Learning method.

[Simplify the computation] => Naive

2. A six-sided fair dice is rolled. What is the a priori probability of rolling a 2, 4, or 6, in

According to Bayes Theorem

Example: Let customers are only

 X: 35 years old customer with an income 40K and fair Credit_rating

According to Bayes Theorem

Let hypothesis H is buying computer = “YES”

P(X|H): Posterior Probability of attribute values <age=35, income=40K credit_rating = fair>

• Bayesian classifier predicts X belongs to Class Ci iff

P(Ci|X)>P(Cj|X) for 1 ≤ j ≤ m & j ≠ i

Naive Assumption: [Class-Conditional Independence]: The effect of an attribute value on a

1. Due to conditional independence of the attribute P(X)= P(x1,...xn) =1

As P(X) is 1, Hence only maximize P(X|Ci) × P(Ci)

Conditional Probability of X w.r.t. Class Yes Conditional Probability of X w.r.t. Class No

Max{ P(X|buys_Comp=Yes) × p(buys_computer = Yes),

Nearest based classifiers are based on learning by analogy i.e. by comparing a

 How many neighbors should be consider? i.e. what is k?

DISTANCE in NUMERIC ATTRIBUTE

• Minkowski Distance: [When p=1: It is Manhattan

(Generalization of Euclidean p=2: It is Euclidean]

Minkowski Distance Minkowski Distance

X1: X2: Y: Distance from

Step 4: Apply Simple Mojority

Customer Age Income No. of Credit Class

Customer Age Income No. of Credit Class Distance

ZZZ 63 200K 1 No sqrt((63-37)2+(200-50)2+(1-2)2)=152.2

PPP 59 170K 1 No sqrt((59-37)2+(170-50)2+(1-2)2)=122

QQQ 25 40K 4 Yes sqrt((25-37)2+(40-50)2+(4-2)2)=15.74

RRR 37 50K 2 ????

SVM can solve this problem. Easily! It solves this

In above plot, points to consider are:

 The intra-class (that is, intra-cluster) similarity is high.

 The inter-class similarity is low.

• However, objective evaluation is problematic: usually done by human / expert inspection.

centroids will be as shown in the below image:

points. So, the image will be:

clusters will be as shown in the below image:

School of Computer Engineering

School of Computer Engineering

Centroid of Cluster 1: C1 = (2, 5) 1

Divisive Hierarchical clustering

The diagonal elements of this matrix is always 0 as the distance of a point

Here, the smallest distance is 3 and hence point 1 and 2 is merged.

Roll No (1, 2 ) (5, 3) 4

Roll No ((1, 2) , (4) (5, 3)

Here, only one option is there. Merge the two clusters.

We started with 5 clusters and finally have a single cluster.

Chapter-V CLASSIFICATION & CLUSTERING

Chapter-V CLASSIFICATION & CLUSTERING