Unit 5

Unit V
Clustering and Classification

:
Taught by
Prof .Datta Deshmukh
By Prof.Datta Deshmukh
Unit V
Clustering and Classification :
Clustering: Basics of Clustering; similarity /
dissimilarity measures; clustering criteria. Minimum
within cluster distance criterion. K‐means algorithm,
DBSCAN‐Density‐based Spatial clustering of
application with Noise.
Basics of Clustering
Unsupervised Learning
Unsupervised Learning has no explicit output/target

variable, i.e. works without supervision.
So it discovers knowledge, hidden structures or

relationship in the unlabelled data. For e.g. it can learn
to group or organize data in such a way that similar
objects are in the same group.
Similarly, news articles can be put together based on
the topics like sports, business, technology, politics etc.
This approach is known as Clustering.
Clustering is the task of dividing the population or

data points into a number of groups such that data
points in the same groups are more similar to other
data points in the same group and dissimilar to the data
points in other groups. It is basically a collection of
objects on the basis of similarity and dissimilarity
between them.
For example The data points in the graph below
clustered together can be classified into one single
group. We can distinguish the clusters, and we can
identify that there are 3 clusters in the below picture.
Clustering Methods:
Density-Based Methods:
These methods consider the clusters as the dense
region having some similarities and differences from
the lower dense region of the space. These methods
have good accuracy and the ability to merge two
clusters. Example DBSCAN (Density-Based Spatial
Clustering of Applications with Noise), OPTICS
(Ordering Points to Identify Clustering Structure), etc.
Clustering Methods
Hierarchical Based Methods: The clusters formed in
this method form a tree-type structure based on the
hierarchy. New clusters are formed using the
previously formed one. It is divided into two category
Agglomerative (bottom-up approach)
Divisive (top-down approach)
Examples CURE (Clustering Using Representatives),
BIRCH (Balanced Iterative Reducing Clustering and
using Hierarchies), etc
Clustering Methods
Partitioning Methods:
These methods partition the objects into k clusters and
each partition forms one cluster. This method is used
to optimize an objective criterion similarity function
such as when the distance is a major parameter
example K-means, CLARANS (Clustering Large
Applications based upon Randomized Search), etc.
Grid-based Methods: In this method, the data space
is formulated into a finite number of cells that form a
grid-like structure. All the clustering operations done
on these grids are fast and independent of the number
of data objects example STING (Statistical Information
Grid), wave cluster, CLIQUE (CLustering In Quest),
etc.
Clustering Algorithm
K-Means Clustering is an
Unsupervised Machine Learning algorithm, which
groups the unlabeled dataset into different clusters
Applications of Clustering in different
fields:
 Marketing: It can be used to characterize & discover customer segments for
marketing purposes.
 Biology: It can be used for classification among different species of plants
and animals.
 Libraries: It is used in clustering different books on the basis of topics and
information.
 Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
 City Planning: It is used to make groups of houses and to study their values
based on their geographical locations and other factors present.
 Earthquake studies: By learning the earthquake-affected areas we can
determine the dangerous zones.
 Image Processing: Clustering can be used to group similar images together,
classify images based on content, and identify patterns in image data.
 Genetics: Clustering is used to group genes that have similar
expression patterns and identify gene networks that work together in
biological processes.
 Finance: Clustering is used to identify market segments based on
customer behavior, identify patterns in stock market data, and analyze
risk in investment portfolios.
 Customer Service: Clustering is used to group customer inquiries and
complaints into categories, identify common issues, and develop
targeted solutions.
 Manufacturing: Clustering is used to group similar products together,
optimize production processes, and identify defects in manufacturing
processes.
 Medical diagnosis: Clustering is used to group patients with similar
symptoms or diseases, which helps in making accurate diagnoses and
identifying effective treatments.
 Fraud detection: Clustering is used to identify suspicious patterns or
anomalies in financial transactions, which can help in detecting fraud
or other financial crimes.
 Traffic analysis: Clustering is used to group similar patterns of
traffic data, such as peak hours, routes, and speeds, which can
help in improving transportation planning and infrastructure.
 Social network analysis: Clustering is used to identify
communities or groups within social networks, which can help in
understanding social behavior, influence, and trends.
 Cybersecurity: Clustering is used to group similar patterns of
network traffic or system behavior, which can help in detecting
and preventing cyberattacks.
 Climate analysis: Clustering is used to group similar patterns of
climate data, such as temperature, precipitation, and wind, which
can help in understanding climate change and its impact on the
environment.
Measures of Distance
Clustering consists of grouping certain objects that

are similar to each other, it can be used to decide if two
items are similar or dissimilar in their properties.
the similarity measure is a distance with dimensions
describing object features.
That means if the distance among two data points
is small then there is a high degree of similarity
among the objects and vice versa.
Measures of Distance
The similarity is subjective and depends heavily on

the context and application. For example, similarity
among vegetables can be determined from their taste,
size, colour etc. Most clustering approaches use
distance measures to assess the similarities or
differences between a pair of objects, the most popular
distance measures used are:
Euclidean Distance
Euclidean distance is considered the traditional metric
for problems with geometry.
 It can be simply explained as the ordinary
distance between two points.
It is one of the most used algorithms in the cluster
analysis.
 One of the algorithms that use this formula would
be K-mean.
Mathematically it computes the root of squared
differences between the coordinates between two
objects.
Euclidean Distance
Minkowski distance:
It is the generalized form of the Euclidean and
Manhattan Distance Measure. In an N-dimensional
space, a point is represented as,
(x1, x2, ..., xN)
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Manhattan Distance
This determines the absolute difference among the pair
of the coordinates.
Suppose we have two points P and Q to determine the
distance between these points we simply have to
calculate the perpendicular distance of the points from
X-Axis and Y-Axis. In a plane with P at coordinate
(x1, y1) and Q at (x2, y2).
 Manhattan distance between P and Q = |x1 – x2| + |y1

– y2|
A similarity measure for two objects (i,j) will
return 1 if similar and 0 if dissimilar.
A dissimilarity measure works just opposite to how
the similarity measure works, i.e., it returns 1 if
dissimilar and 0 if similar.
Similarity and dissimilarity measures help remove the
outliers. Their use quickly eliminates redundant data
since they help identify potential outliers as highly
dissimilar objects to others.
The measure of similarity and dissimilarity is referred
to as proximity.
The measure of similarity can often be measured as a
function of a measure of dissimilarity.
Dissimilarity matrix
Calculating proximity measures
Clustering criteria
In the context of grouping data points into clusters based
on certain criteria.
Centroid-Based Criteria:
K-Means Clustering: This is one of the most popular
centroid-based clustering methods. It aims to partition
data into K clusters where each data point belongs to the
cluster with the nearest mean (centroid).
Centroid: In K-means, the centroid of a cluster is the

mean vector of all the data points in that cluster. It
represents the "center" of the cluster.
Density-Based Criteria:
DBSCAN (Density-Based Spatial Clustering of

Applications with Noise): DBSCAN is a density-based
clustering algorithm that groups together points that are
closely packed, based on a specified minimum number of
points (MinPts) within a specified distance (Epsilon).
Core Points, Border Points, and Noise: DBSCAN

distinguishes between core points (points with at least
MinPts within Epsilon), border points (points within
Epsilon of a core point but with fewer than MinPts within
Epsilon), and noise points (points that are neither core
nor border points).
Minimum within cluster distance criterion
The minimum within-cluster distance criterion, also
known as the within-cluster sum of squares (WCSS)
criterion, is a measure used to evaluate the compactness
of clusters in clustering algorithms.
It calculates the sum of squared distances between each
data point and the centroid of its assigned cluster. The
goal is to minimize this sum, indicating that data points
within each cluster are close to their cluster center.
The objective of K-Means clustering is to minimize the
total intra-cluster distance (squared error).
Objective function
Minimizing WCSS
Clustering algorithms aim to minimize the WCSS
across all clusters.
This minimization process involves iteratively
updating cluster assignments and centroids until
convergence, where data points are grouped into
clusters such that the total WCSS is minimized.
Application Example
Suppose you have a dataset with various engineering

measurements, and you apply K-means clustering to
group similar data points.
After clustering, you calculate the WCSS for each
cluster. Lower WCSS values indicate that data points
within clusters are closer to their centroids, indicating
better clustering quality
k-Means Algorithm :
Input: D is a dataset containing n objects, k is the
number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster
centroids
2.For each of the objects in D do:
 Compute distance between the current objects and k cluster
centroids
 Assign the current object to that cluster to which it is closest.
3. Compute the "cluster centers" of each cluster. These
become the new cluster centroids.
4.Repeat step 2-3 until the convergence criterion is satisfied
5.Stop
k-Means: Note Points
1. Objects are defined in terms of set of attributes A = {A1,
A2, ....., Am} where each A₁ is continuous data type.
2. Distance computation: Any distance such as L1, L2, L3 or
cosine similarity.
3. Minimum distance is the measure of closeness between an
object and centroid.
4. Mean Calculation: It is the mean value of each attribute
values of all objects.
5. Convergence criteria: Any one of the following are
termination condition of the algorithm.
 Number of maximum iteration permissible.
 No change of centroid values in any cluster.
 Zero (or no significant) movement(s) of object
from one cluster to another.
 Cluster quality reaches to a certain level of
acceptance.
Quiz 1
Choose the correct statement(s) w.r.t K-Means
clustering.
● It is often used for unlabelled data.

● It can be used to segment customers based on
their past behavior/characteristic.
● It puts two dissimilar points in same cluster.
● All of the above
Quiz 2
K in K-Means clustering stands for -
● Number of nearest neighbors
● Number of samples in each cluster
● Minimum distance between the
clusters
● Number of clusters
Quiz 3
Which of the following can act as a termination
criterion in K-Means?
● Fixed number of iterations
● Stationary centroids appear between
successive iterations.
● The distance between the clusters is
minimum.
● None of the above
Finding optimal value of k
Data points clustering is a subjective decision as there
is no ground truth available. Domain knowledge or
better business understanding may help in getting
intuition behind right number of clusters.
methods that help in selecting optimal value of k. Most
commonly used are -
1. Elbow method
2. Average silhouette method
Elbow Method
 The Elbow Method is one of the most popular methods to
determine this optimal value of k.
● It looks at the inertia for different values of k.
Inertia( or within cluster sum of squared distance or intra cluster
distance) - It is the sum of
squared distances of samples to their closest cluster centroid.
Step 1. Perform k-means clustering for different
values of k.
Step 2. For each k, calculate the inertia.
Step 3. Plot the curve of inertia according to the
number of clusters k.
Step 4. Choose the k where interia stops
decreasing abruptly.
NOTE - As Deshmukh
By Prof.Datta k increases, the inertia tends towards zero.
Elbow Method
Homework
What are the K-Means Limitations and How to
overcome it?
Density-Based Spatial Clustering Of
Applications With Noise (DBSCAN)
Clusters are dense regions in the data space, separated

by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive
notion of “clusters” and “noise”.
The key idea is that for each point of a cluster, the
neighborhood of a given radius has to contain at least a
minimum number of points
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and

hierarchical clustering work for finding spherical-
shaped clusters or convex clusters.
In other words, they are suitable only for compact and
well-separated clusters.
 Moreover, they are also severely affected by the
presence of noise and outliers in the data.
Real-life data may contain irregularities, like: Clusters
can be of arbitrary shape such as those shown in the
figure . Data may contain noise.
Parameters Required For
DBSCAN Algorithm
eps: It defines the neighborhood around a data point i.e. if
the distance between two points is lower or equal to ‘eps’
then they are considered neighbors. If the eps value is chosen
too small then a large part of the data will be considered as
an outlier. If it is chosen very large then the clusters will
merge and the majority of the data points will be in the same
clusters. One way to find the eps value is based on the k-
distance graph.
MinPts: Minimum number of neighbors (data points) within
eps radius. The larger the dataset, the larger value of MinPts
must be chosen. As a general rule, the minimum MinPts can
be derived from the number of dimensions D in the dataset
as, MinPts >= D+1. The minimum value of MinPts must be
chosen atDeshmukh
By Prof.Datta least 3.
In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than
MinPts points within eps.
Border Point: A point which has fewer than MinPts
within eps but it is in the neighborhood of a core
point.
Noise or outlier: A point which is not a core point or
border point.
Steps Used In DBSCAN Algorithm
1.Find all the neighbor points within eps and identify the core
points or visited with more than MinPts neighbors.
2.For each core point if it is not already assigned to a cluster,
create a new cluster.
3.Find recursively all its density-connected points and assign
them to the same cluster as the core point.
A point a and b are said to be density connected if there exists
a point c which has a sufficient number of points in its
neighbors and both points a and b are within the eps distance.
This is a chaining process. So, if b is a neighbor of c, c is a
neighbor of d, and d is a neighbor of e, which in turn is
neighbor of a implying that b is a neighbor of a.
4.Iterate through the remaining unvisited points in the dataset.
Those points that do not belong to any cluster are noise.
When Should We Use DBSCAN Over
K-Means In Clustering Analysis?
DBSCAN(Density-Based Spatial Clustering of
Applications with Noise) and K-Means are both
clustering algorithms that group together data that have
the same characteristic. However, They work on
different principles and are suitable for different types
of data. We prefer to use DBSCAN when the data is
not spherical in shape or the number of classes is not
known beforehand.
Homework
What is the difference Between DBSCAN and K-
Means?

Unit 5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 5

Uploaded by

Copyright:

Available Formats

Unit V

Clustering and Classification

Unsupervised Learning has no explicit output/target

So it discovers knowledge, hidden structures or

Clustering is the task of dividing the population or

Clustering consists of grouping certain objects that

The similarity is subjective and depends heavily on

 Manhattan distance between P and Q = |x1 – x2| + |y1

Centroid: In K-means, the centroid of a cluster is the

DBSCAN (Density-Based Spatial Clustering of

Core Points, Border Points, and Noise: DBSCAN

Suppose you have a dataset with various engineering

● It is often used for unlabelled data.

Clusters are dense regions in the data space, separated

Partitioning methods (K-means, PAM clustering) and

You might also like