You are on page 1of 38

Density-based Clustering

Experiments: OPTICS

Clustering - Advanced Topics: OPTICS

SIGKDD, ACM Student Chapter @ HITK

Presenter: Dr. Partha Basuchowdhuri


Assistant Professor, Dept of CSE
Heritage Institute of Technology,
Chowbaga Road, Anandapur,
Kolkata, INDIA

March 10, 2017

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Outline for Part I

1 Density-based Clustering
Recap - DBSCAN
Extended Terminologies
Algorithm

2 Experiments: OPTICS
Reachability Plots
Sensitivity Analysis
Conclusion

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Recap - Existing Terminologies (DBSCAN)


Two parameters -
Eps (): Maximum radius of the neighbourhood
MinPts (µ): Minimum number of points in an -neighbourhood of
that point
N (q) : {p ∈ Dataset D| dist(p, q) ≤ }

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Recap - Existing Terminologies (DBSCAN)


Two parameters -
Eps (): Maximum radius of the neighbourhood
MinPts (µ): Minimum number of points in an -neighbourhood of
that point
N (q) : {p ∈ Dataset D| dist(p, q) ≤ }

Definition 1: Directly density-reachable


A point p is directly density-reachable from a point q w.r.t. , µ if
1. p belongs to N (q), 2. core-point condition: |N (q)| ≥ µ

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Recap - Existing Terminologies (DBSCAN)

Definition 2: Density-reachable (Asymmetric relation)


A point p is density-reachable from a point q w.r.t. , µ in D, if there is
a chain of points p1 , p2 , . . ., pn , p1 = q, pn = p such that pi ∈ D and
pi+1 is directly density-reachable from pi , w.r.t.  and µ.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Recap - Existing Terminologies (DBSCAN)

Definition 3: Density-connected (Symmetric relation)


A point p is density-connected to a point q w.r.t. , µ in D, if
there is a point o ∈ D such that both p and q are
density-reachable from o, w.r.t.  and µ in D.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Recap - Existing Terminologies (DBSCAN)


Definition 4: Cluster and Noise
If D is the set of points, a cluster C w.r.t.  and µ in D is a non-empty
subset of D satisfying the following conditions:
1 Maximality: ∀ p, q ∈ D, if p ∈ C and q is density-reachable from
p, w.r.t.  and µ, then also q ∈ C .
2 Connectivity: ∀ p, q ∈ C , if p is density-connected to q, w.r.t. 
and µ in D.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Recap - DBSCAN is sensitive to parameter choices

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

DBSCAN cannot recognize clusters with varying densities

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Basic targets for an improved DBSCAN


Finding right parameter values should not be a constraint.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Basic targets for an improved DBSCAN


Finding right parameter values should not be a constraint.

Furthermore, the algorithm should not be parameter


sensitive, i.e., the results should not change drastically with
change in parameter values (which unfortunately happens in
DBSCAN).

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Basic targets for an improved DBSCAN


Finding right parameter values should not be a constraint.

Furthermore, the algorithm should not be parameter


sensitive, i.e., the results should not change drastically with
change in parameter values (which unfortunately happens in
DBSCAN).

On the other hand, the idea of density-based cluster


structures are successful when the parameters are right. So,
we would like to preserve the density-based structures.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Basic targets for an improved DBSCAN


Finding right parameter values should not be a constraint.

Furthermore, the algorithm should not be parameter


sensitive, i.e., the results should not change drastically with
change in parameter values (which unfortunately happens in
DBSCAN).

On the other hand, the idea of density-based cluster


structures are successful when the parameters are right. So,
we would like to preserve the density-based structures.

It would have been nice to have cluster structures that are


density-based, yet hierarchical. Can we do that?
SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS
Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Features of OPTICS

Without producing clusters explicitly, generate an ordering of


data objects representing density-based cluster structure.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Features of OPTICS

Without producing clusters explicitly, generate an ordering of


data objects representing density-based cluster structure.

This cluster-ordering contains information equivalent to the


density-based clusterings corresponding to a broad range of
parameter settings.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Features of OPTICS

Without producing clusters explicitly, generate an ordering of


data objects representing density-based cluster structure.

This cluster-ordering contains information equivalent to the


density-based clusterings corresponding to a broad range of
parameter settings.

Good for both automatic and interactive cluster analysis,


including finding intrinsic clustering structure.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Features of OPTICS

Without producing clusters explicitly, generate an ordering of


data objects representing density-based cluster structure.

This cluster-ordering contains information equivalent to the


density-based clusterings corresponding to a broad range of
parameter settings.

Good for both automatic and interactive cluster analysis,


including finding intrinsic clustering structure.

It can be represented graphically or using visualization


techniques. Intuitive and simple user interpretation.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

OPTICS - Extended Terminologies


Definition 5: Core-distance (cDist(p))
Let p be a point in D,  be a distance value, N (p) be the -neighborhood
of p, µ be a natural number and µ-distance be the distance from p to its
µ-neighbor. Then the core-distance of p is defined as,
(
UNDEFINED, if |N (p)| < µ
cDist,µ (p) =
µ − distance(p), otherwise

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

OPTICS - Extended Terminologies


Definition 5: Core-distance (cDist(p))
Let p be a point in D,  be a distance value, N (p) be the -neighborhood
of p, µ be a natural number and µ-distance be the distance from p to its
µ-neighbor. Then the core-distance of p is defined as,
(
UNDEFINED, if |N (p)| < µ
cDist,µ (p) =
µ − distance(p), otherwise

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

OPTICS - Extended Terminologies


Definition 6: Reachability-distance (rDist(p,o))
Let p and o be points in D, N (o) be the -neighborhood of o and µ be
a natural number. Then, the reachability-distance of p with respect to o
is defined as,
(
UNDEFINED, if |N (o)| < µ
rDist,µ (p, o) =
max(cDist(o), distance(o, p)), otherwise

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Algorithm - Prelude

Observation: Dense clusters are completely contained by less


dense clusters.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Algorithm - Prelude

Observation: Dense clusters are completely contained by less


dense clusters.
Idea: Process objects in the “right” order and keep track of
point density in their neighborhood.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Algorithm - Prelude

Observation: Dense clusters are completely contained by less


dense clusters.
Idea: Process objects in the “right” order and keep track of
point density in their neighborhood.
Parameters: “generating” distance , fixed value µ.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Algorithm - Prelude

Observation: Dense clusters are completely contained by less


dense clusters.
Idea: Process objects in the “right” order and keep track of
point density in their neighborhood.
Parameters: “generating” distance , fixed value µ.
Core-distance(o): smallest distance such that o is a core
point. Upper bound of cDist(o) is . cDist for non-core point
is UNDEFINED.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Algorithm - Prelude

Observation: Dense clusters are completely contained by less


dense clusters.
Idea: Process objects in the “right” order and keep track of
point density in their neighborhood.
Parameters: “generating” distance , fixed value µ.
Core-distance(o): smallest distance such that o is a core
point. Upper bound of cDist(o) is . cDist for non-core point
is UNDEFINED.
Reachable-distance(p,o): smallest distance such that p is
directly density-reachable from o. Upper bound of rDist(p,o)
is . rDist for a point p, that is non-reachable from o, is
UNDEFINED.
SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS
Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Algorithm - How OPTICS works

Order points by shortest reachability distance to guarantee


that clusters w.r.t. higher density are finished first. (for a
constant µ, higher density requires lower ).

Data Structure: List. Memorize shortest reachability


distances seen so far (“distance of a jump to that point”).

Visit each point. Always perform the shortest jump.

Output:
order of points.
core-distance of points.
reachability distance of points.
SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS
Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Algorithm 1: OPTICS(SetOfObjects, , MinPts, OrderedFile)


1 begin
2 OrderedFile.open()
3 foreach Object ∈ SetOfObjects do
4 if not Object.Processed then
5 ExpandClusterOrder (SetOfObjects, Object, , MinPts, OrderedFile)

6 OrderedFile.close()

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Algorithm 2: ExpandClusterOrder (SetOfObjects, Object, , MinPts, OrderedFile)


1 begin
2 neighbors ← SetOfObjects.neighbors(Object, )
3 Object.Processed ← TRUE
4 Object.reachability distance ← UNDEFINED
5 Object.setCoreDistance(neighbors, , MinPts)
6 OrderedFile.write(Object)
7 if Object.core distance 6= UNDEFINED then
8 OrderSeeds.update(neighbors, Object)
9 while |OrderSeeds| > 0 do
10 currentObject ← OrderedSeeds.next()
11 neighbors ← SetOfObjects.neighbors(currentObject, )
12 currentObject.Processed ← TRUE
13 currentObject.setCoreDistance(neighbors, , MinPts)
14 OrderedFile.write(currentObject)
15 if currentObject.core distance 6= UNDEFINED then
16 OrderSeeds.update(neighbors, currentObject)

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Algorithm 3: OrderSeeds :: update(neighbors, CenterObject)


1 begin
2 c dist ← CenterObject.core distance
3 foreach Object ∈ neighbors do
4 if not Object.Processed then
5 new r dist ← max{c dist, CenterObject.dist(Object)}
6 if Object.reachability distance = UNDEFINED then
7 Object.reachability distance ← new r dist
8 insert(Object, new r dist)
9 else
/* Object already in OrderSeeds */
10 if new r dist < Object.reachability distance then
11 Object.reachability distance ← new r dist
12 decrease(Object, new r dist)

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Recap - DBSCAN
Density-based Clustering
Extended Terminologies
Experiments: OPTICS
Algorithm

Algorithm - Properties
“Flat” density-based clusters w.r.t., ∗ ≤  and µ afterwards:
Starts with an object o where cDist(o) ≤ ∗ and rDist(o) > ∗ .
Continues while rDist(o) ≤ ∗ .

Performance: O(n*runtime(-neighborhood-query))
Without spatial index support (worst case): O(n2 )
Tree-based spatial index support: O(nlog (n)) (R ∗ -tree, X-tree)
SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS
Reachability Plots
Density-based Clustering
Sensitivity Analysis
Experiments: OPTICS
Conclusion

Outline for Part II

1 Density-based Clustering
Recap - DBSCAN
Extended Terminologies
Algorithm

2 Experiments: OPTICS
Reachability Plots
Sensitivity Analysis
Conclusion

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Reachability Plots
Density-based Clustering
Sensitivity Analysis
Experiments: OPTICS
Conclusion

OPTICS - Visualization

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Reachability Plots
Density-based Clustering
Sensitivity Analysis
Experiments: OPTICS
Conclusion

Reachability Plots
Represents the density-based clustering structure
Easy to analyze
Independent of the dimension of the data

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Reachability Plots
Density-based Clustering
Sensitivity Analysis
Experiments: OPTICS
Conclusion

Testing parameter sensitivity

Relatively insensitive to change in parameters.


Good results when parameters are just large enough.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Reachability Plots
Density-based Clustering
Sensitivity Analysis
Experiments: OPTICS
Conclusion

Example
Neighboring objects stay close to each other in a linear sequence.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Reachability Plots
Density-based Clustering
Sensitivity Analysis
Experiments: OPTICS
Conclusion

Good for clusters with varying density (as promised)

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Reachability Plots
Density-based Clustering
Sensitivity Analysis
Experiments: OPTICS
Conclusion

Bad for clusters with uniform density


Slower than DBSCAN. DENCLUE can be a solution.

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS


Reachability Plots
Density-based Clustering
Sensitivity Analysis
Experiments: OPTICS
Conclusion

Thanks for listening.


Questions?

SIGKDD, ACM Student Chapter @ HITK Clustering - Advanced Topics: OPTICS

You might also like