Professional Documents
Culture Documents
Presented by –
Rohit Paul
Disadvantages of Partitioning Method
Choosing k manually
Clustering data of varying sizes and density
Sensitive to outliers
Figure 1 Figure 2
Density-based Method
Cluster in a data space is a contiguous region of high
point density
Separated by lower density of points
Density within the areas of noise is assumed to be
lower
DBSCAN
Density Based Spatial Clustering of Applications
with Noise
How is the density estimated?
Estimates the density by counting the number of points
in a fixed-radius neighborhood
KEY IDEA: For each point of a cluster the
neighborhood of a given radius has to contain at least
a minimum number of points
Two input parameters required:
Epsilon(ε)
MinPts
ε-Neighborhood: Objects within a radius ε from an
object
Three types of points based on ε-Neighborhood:
Core Point
Border Point
Noise Point
Terminology used:
Directly density-reachable
o A point q is directly density-reachable from a core
point p if q is within the Eps-neighborhood of p.
Density-reachable
o A point q is density reachable
from p if there are a set of core
points leading from p to q.
Density-connected
o Two points p and q are density connected if there
are a core point o, such that both p and q are density
reachable from o
A cluster satisfy the following conditions:
If p be a core point and the set of all point which
are density-reachable from p be O. Then, this set
O is a cluster with respect to Eps and Minpts.
For all p and q in cluster C, p is density-
connected to q with respect to Eps and Minpts.
Noise, a set of points which do not belongs to any
clusters
Pseudo code of DBSCAN Algorithm
Scikit-learn implementation
Reference
https://developers.google.com/machine-
learning/clustering/algorithm/advantages-disadvantages
https://towardsdatascience.com/dbscan-clustering-
explained-97556a2ad556
https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.DBS
CAN.html
Martin Ester, Hans-Peter Kriegel, J¨org Sander, and
Xiaowei Xu. A density-based algorithm for discovering
clusters in large spatial databases with noise.
Thank You