Professional Documents
Culture Documents
Taught by
Prof .Datta Deshmukh
By Prof.Datta Deshmukh
Unit V
Clustering and Classification :
Clustering: Basics of Clustering; similarity /
dissimilarity measures; clustering criteria. Minimum
within cluster distance criterion. K‐means algorithm,
DBSCAN‐Density‐based Spatial clustering of
application with Noise.
By Prof.Datta Deshmukh
Basics of Clustering
Unsupervised Learning
By Prof.Datta Deshmukh
Similarly, news articles can be put together based on
the topics like sports, business, technology, politics etc.
This approach is known as Clustering.
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Clustering Methods:
Density-Based Methods:
These methods consider the clusters as the dense
region having some similarities and differences from
the lower dense region of the space. These methods
have good accuracy and the ability to merge two
clusters. Example DBSCAN (Density-Based Spatial
Clustering of Applications with Noise), OPTICS
(Ordering Points to Identify Clustering Structure), etc.
By Prof.Datta Deshmukh
Clustering Methods
Hierarchical Based Methods: The clusters formed in
this method form a tree-type structure based on the
hierarchy. New clusters are formed using the
previously formed one. It is divided into two category
Agglomerative (bottom-up approach)
Divisive (top-down approach)
Examples CURE (Clustering Using Representatives),
BIRCH (Balanced Iterative Reducing Clustering and
using Hierarchies), etc
By Prof.Datta Deshmukh
Clustering Methods
Partitioning Methods:
These methods partition the objects into k clusters and
each partition forms one cluster. This method is used
to optimize an objective criterion similarity function
such as when the distance is a major parameter
example K-means, CLARANS (Clustering Large
Applications based upon Randomized Search), etc.
By Prof.Datta Deshmukh
Grid-based Methods: In this method, the data space
is formulated into a finite number of cells that form a
grid-like structure. All the clustering operations done
on these grids are fast and independent of the number
of data objects example STING (Statistical Information
Grid), wave cluster, CLIQUE (CLustering In Quest),
etc.
By Prof.Datta Deshmukh
Clustering Algorithm
K-Means Clustering is an
Unsupervised Machine Learning algorithm, which
groups the unlabeled dataset into different clusters
By Prof.Datta Deshmukh
Applications of Clustering in different
fields:
Marketing: It can be used to characterize & discover customer segments for
marketing purposes.
Biology: It can be used for classification among different species of plants
and animals.
Libraries: It is used in clustering different books on the basis of topics and
information.
Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
City Planning: It is used to make groups of houses and to study their values
based on their geographical locations and other factors present.
Earthquake studies: By learning the earthquake-affected areas we can
determine the dangerous zones.
Image Processing: Clustering can be used to group similar images together,
classify images based on content, and identify patterns in image data.
By Prof.Datta Deshmukh
Genetics: Clustering is used to group genes that have similar
expression patterns and identify gene networks that work together in
biological processes.
Finance: Clustering is used to identify market segments based on
customer behavior, identify patterns in stock market data, and analyze
risk in investment portfolios.
Customer Service: Clustering is used to group customer inquiries and
complaints into categories, identify common issues, and develop
targeted solutions.
Manufacturing: Clustering is used to group similar products together,
optimize production processes, and identify defects in manufacturing
processes.
Medical diagnosis: Clustering is used to group patients with similar
symptoms or diseases, which helps in making accurate diagnoses and
identifying effective treatments.
Fraud detection: Clustering is used to identify suspicious patterns or
anomalies in financial transactions, which can help in detecting fraud
By Prof.Datta Deshmukh
or other financial crimes.
Traffic analysis: Clustering is used to group similar patterns of
traffic data, such as peak hours, routes, and speeds, which can
help in improving transportation planning and infrastructure.
Social network analysis: Clustering is used to identify
communities or groups within social networks, which can help in
understanding social behavior, influence, and trends.
Cybersecurity: Clustering is used to group similar patterns of
network traffic or system behavior, which can help in detecting
and preventing cyberattacks.
Climate analysis: Clustering is used to group similar patterns of
climate data, such as temperature, precipitation, and wind, which
can help in understanding climate change and its impact on the
environment.
By Prof.Datta Deshmukh
Measures of Distance
By Prof.Datta Deshmukh
Measures of Distance
By Prof.Datta Deshmukh
Euclidean Distance
Euclidean distance is considered the traditional metric
for problems with geometry.
It can be simply explained as the ordinary
distance between two points.
It is one of the most used algorithms in the cluster
analysis.
One of the algorithms that use this formula would
be K-mean.
Mathematically it computes the root of squared
differences between the coordinates between two
objects.
By Prof.Datta Deshmukh
Euclidean Distance
By Prof.Datta Deshmukh
Minkowski distance:
It is the generalized form of the Euclidean and
Manhattan Distance Measure. In an N-dimensional
space, a point is represented as,
(x1, x2, ..., xN)
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Manhattan Distance
This determines the absolute difference among the pair
of the coordinates.
Suppose we have two points P and Q to determine the
distance between these points we simply have to
calculate the perpendicular distance of the points from
X-Axis and Y-Axis. In a plane with P at coordinate
(x1, y1) and Q at (x2, y2).
By Prof.Datta Deshmukh
A similarity measure for two objects (i,j) will
return 1 if similar and 0 if dissimilar.
A dissimilarity measure works just opposite to how
the similarity measure works, i.e., it returns 1 if
dissimilar and 0 if similar.
Similarity and dissimilarity measures help remove the
outliers. Their use quickly eliminates redundant data
since they help identify potential outliers as highly
dissimilar objects to others.
The measure of similarity and dissimilarity is referred
to as proximity.
The measure of similarity can often be measured as a
function of a measure of dissimilarity.
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Dissimilarity matrix
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Calculating proximity measures
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Clustering criteria
In the context of grouping data points into clusters based
on certain criteria.
Centroid-Based Criteria:
K-Means Clustering: This is one of the most popular
centroid-based clustering methods. It aims to partition
data into K clusters where each data point belongs to the
cluster with the nearest mean (centroid).
By Prof.Datta Deshmukh
Density-Based Criteria:
By Prof.Datta Deshmukh
Objective function
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Minimizing WCSS
Clustering algorithms aim to minimize the WCSS
across all clusters.
This minimization process involves iteratively
updating cluster assignments and centroids until
convergence, where data points are grouped into
clusters such that the total WCSS is minimized.
By Prof.Datta Deshmukh
Application Example
By Prof.Datta Deshmukh
k-Means Algorithm :
Input: D is a dataset containing n objects, k is the
number of cluster
Output: A set of k clusters
Steps:
1. Randomly choose k objects from D as the initial cluster
centroids
2.For each of the objects in D do:
Compute distance between the current objects and k cluster
centroids
Assign the current object to that cluster to which it is closest.
3. Compute the "cluster centers" of each cluster. These
become the new cluster centroids.
4.Repeat step 2-3 until the convergence criterion is satisfied
By Prof.Datta Deshmukh
5.Stop
k-Means: Note Points
1. Objects are defined in terms of set of attributes A = {A1,
A2, ....., Am} where each A₁ is continuous data type.
2. Distance computation: Any distance such as L1, L2, L3 or
cosine similarity.
3. Minimum distance is the measure of closeness between an
object and centroid.
4. Mean Calculation: It is the mean value of each attribute
values of all objects.
5. Convergence criteria: Any one of the following are
termination condition of the algorithm.
Number of maximum iteration permissible.
No change of centroid values in any cluster.
Zero (or no significant) movement(s) of object
from one cluster to another.
Cluster quality reaches to a certain level of
acceptance.
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Quiz 1
Choose the correct statement(s) w.r.t K-Means
clustering.
By Prof.Datta Deshmukh
Quiz 2
K in K-Means clustering stands for -
● Number of nearest neighbors
● Number of samples in each cluster
● Minimum distance between the
clusters
● Number of clusters
By Prof.Datta Deshmukh
Quiz 3
Which of the following can act as a termination
criterion in K-Means?
● Fixed number of iterations
● Stationary centroids appear between
successive iterations.
● The distance between the clusters is
minimum.
● None of the above
By Prof.Datta Deshmukh
Finding optimal value of k
Data points clustering is a subjective decision as there
is no ground truth available. Domain knowledge or
better business understanding may help in getting
intuition behind right number of clusters.
methods that help in selecting optimal value of k. Most
commonly used are -
1. Elbow method
2. Average silhouette method
By Prof.Datta Deshmukh
Elbow Method
The Elbow Method is one of the most popular methods to
determine this optimal value of k.
● It looks at the inertia for different values of k.
Inertia( or within cluster sum of squared distance or intra cluster
distance) - It is the sum of
squared distances of samples to their closest cluster centroid.
Step 1. Perform k-means clustering for different
values of k.
Step 2. For each k, calculate the inertia.
Step 3. Plot the curve of inertia according to the
number of clusters k.
Step 4. Choose the k where interia stops
decreasing abruptly.
NOTE - As Deshmukh
By Prof.Datta k increases, the inertia tends towards zero.
Elbow Method
By Prof.Datta Deshmukh
Homework
What are the K-Means Limitations and How to
overcome it?
By Prof.Datta Deshmukh
Density-Based Spatial Clustering Of
Applications With Noise (DBSCAN)
By Prof.Datta Deshmukh
Why DBSCAN?
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Parameters Required For
DBSCAN Algorithm
eps: It defines the neighborhood around a data point i.e. if
the distance between two points is lower or equal to ‘eps’
then they are considered neighbors. If the eps value is chosen
too small then a large part of the data will be considered as
an outlier. If it is chosen very large then the clusters will
merge and the majority of the data points will be in the same
clusters. One way to find the eps value is based on the k-
distance graph.
MinPts: Minimum number of neighbors (data points) within
eps radius. The larger the dataset, the larger value of MinPts
must be chosen. As a general rule, the minimum MinPts can
be derived from the number of dimensions D in the dataset
as, MinPts >= D+1. The minimum value of MinPts must be
chosen atDeshmukh
By Prof.Datta least 3.
In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than
MinPts points within eps.
Border Point: A point which has fewer than MinPts
within eps but it is in the neighborhood of a core
point.
Noise or outlier: A point which is not a core point or
border point.
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
Steps Used In DBSCAN Algorithm
1.Find all the neighbor points within eps and identify the core
points or visited with more than MinPts neighbors.
2.For each core point if it is not already assigned to a cluster,
create a new cluster.
3.Find recursively all its density-connected points and assign
them to the same cluster as the core point.
A point a and b are said to be density connected if there exists
a point c which has a sufficient number of points in its
neighbors and both points a and b are within the eps distance.
This is a chaining process. So, if b is a neighbor of c, c is a
neighbor of d, and d is a neighbor of e, which in turn is
neighbor of a implying that b is a neighbor of a.
4.Iterate through the remaining unvisited points in the dataset.
Those points that do not belong to any cluster are noise.
By Prof.Datta Deshmukh
When Should We Use DBSCAN Over
K-Means In Clustering Analysis?
DBSCAN(Density-Based Spatial Clustering of
Applications with Noise) and K-Means are both
clustering algorithms that group together data that have
the same characteristic. However, They work on
different principles and are suitable for different types
of data. We prefer to use DBSCAN when the data is
not spherical in shape or the number of classes is not
known beforehand.
By Prof.Datta Deshmukh
Homework
What is the difference Between DBSCAN and K-
Means?
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh
By Prof.Datta Deshmukh