You are on page 1of 55

Clustering

• Clustering is the process of making a group of


abstract objects into classes of similar objects.
• A cluster of data objects can be treated as one
group.
Clustering Methods
• Partitioning methods
• Hierarchical methods
• Density-based methods
• Grid-based methods
• Model-based methods
• Clustering high-dimensional data
• Constraint-based clustering
Partitioning Method
• Suppose we are given a database of ‘n’ objects
and the partitioning method constructs ‘k’
partition of data.
• Each partition will represent a cluster and k ≤
n. It means that it will classify the data into k
groups, which satisfy the following
requirements −
– Each group contains at least one object.
– Each object must belong to exactly one group.
Hierarchical Methods
• This method creates a hierarchical
decomposition of the given set of data objects.
• We can classify hierarchical methods on the
basis of how the hierarchical decomposition is
formed.
• There are two approaches:
– Agglomerative Approach
– Divisive Approach
Agglomerative Approach
• This approach is also known as the bottom-up
approach.
• we start with each object forming a separate
group.
• It keeps on merging the objects or groups that are
close to one another.
• It keep on doing so until all of the groups are
merged into one or until the termination condition
holds.
Divisive Approach
• This approach is also known as the top-down
approach.
• we start with all of the objects in the same cluster.
In the continuous iteration, a cluster is split up
into smaller clusters.
• It is down until each object in one cluster or the
termination condition holds.
• This method is rigid, i.e., once a merging or
splitting is done, it can never be undone.
Density-based Method
• This method is based on the notion of density.
• The basic idea is to continue growing the
given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for
each data point within a given cluster, the
radius of a given cluster has to contain at least
a minimum number of points.
Grid-based Method
• In this, the objects together form a grid.
• The object space is quantized into finite
number of cells that form a grid structure.
Model-based methods
• In this method, a model is hypothesized for each
cluster to find the best fit of data for a given
model.
• This method locates the clusters by clustering the
density function.
• It reflects spatial distribution of the data points.
• This method also provides a way to automatically
determine the number of clusters based on
standard statistics, taking outlier or noise into
account.
• It therefore yields robust clustering methods.
Constraint-based Method
• In this method, the clustering is performed by
the incorporation of user or application-
oriented constraints.
• A constraint refers to the user expectation or
the properties of desired clustering results.
• Constraints provide us with an interactive way
of communication with the clustering process.
• Constraints can be specified by the user or the
application requirement.
K-means Clustering Method:
• If k is given, the K-means algorithm can be executed in
the following steps:
• Partition of objects into k non-empty subsets
• Identifying the cluster centroids (mean point) of the
current partition.
• Assigning each point to a specific cluster
• Compute the distances from each point and allot points
to the cluster where the distance from the centroid is
minimum.
• After re-allotting the points, find the centroid of the
new cluster formed.
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of the medoids
by one of the non-medoids if it improves the total distance of the resulting
clustering
– PAM works effectively for small data sets, but does not scale well for large data
sets

• CLARA (Kaufmann & Rousseeuw, 1990)


• CLARANS (Ng & Han, 1994): Randomized sampling
• Focusing + spatial data structure (Ester et al., 1995)
Typical k-medoids algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Hierarchical Methods
• A hierarchical clustering method works by grouping
data objects into a tree of clusters.
• Hierarchical clustering methods can be further
classified as either agglomerative or divisive,
depending on whether the hierarchical decomposition is
formed in a bottom-up (merging) or top-down
(splitting) fashion.
• The quality of a pure hierarchical clustering method
suffers from its inability to perform adjustment once a
merge or split decision has been executed.
• That is, if a particular merge or split decision later turns
out to have been a poor choice, the method cannot
backtrack and correct it.
• Agglomerative hierarchical clustering:
• This bottom-up strategy starts by placing each
object in its own cluster and then merges these
atomic clusters into larger and larger clusters,
until all of the objects are in a single cluster or
until certain termination conditions are satisfied.
• Most hierarchical clustering methods belong to
this category.
• They differ only in their definition of intercluster
similarity.
• Divisive hierarchical clustering:
• This top-down strategy does the reverse of
agglomerative hierarchical clustering by starting
with all objects in one cluster.
• It subdivides the cluster into smaller and smaller
pieces, until each object forms a cluster on its
own or until it satisfies certain termination
conditions, such as a desired number of clusters is
obtained or the diameter of each cluster is within
a certain threshold.
• AGNES (AGglomerative NESting), an
agglomerative hierarchical clustering method,
and DIANA (DIvisive ANAlysis), a divisive
hierarchical clustering method, to a data set of
five objects, fa, b, c, d, eg
• Initially, AGNES places each object into a
cluster of its own.
• The clusters are then merged step-by-step
according to some criterion
• For example, clusters C1 and C2 may be merged if an
object in C1 and an object in C2 form the minimum
Euclidean distance between any two objects from
different clusters.
• This is a single-linkage approach in that each cluster is
represented by all of the objects in the cluster, and the
similarity between two clusters is measured by the
similarity of the closest pair of data points belonging to
different clusters.
• The cluster merging process repeats until all of the
objects are eventually merged to form one cluster
• In DIANA, all of the objects are used to form one
initial cluster.
• The cluster is split according to some principle,
such as the maximum Euclidean distance between
the closest neighboring objects in the cluster.
• The cluster splitting process repeats until,
eventually, each new cluster contains only a
single object.
• In either agglomerative or divisive hierarchical
clustering, the user can specify the desired
number of clusters as a termination condition.
• A tree structure called a dendrogram is
commonly used to represent the process of
hierarchical clustering.
• It shows how objects are grouped together step
by step
• At l = 1, objects a and b are grouped together
to form the first cluster, and they stay together
at all subsequent levels.
• We can also use a vertical axis to show the
similarity scale between clusters.
• For example, when the similarity of two
groups of objects, {a, b} and {c, d, e} is
roughly 0.16, they are merged together to form
a single cluster.
• Distance Measures between clusters
• When an algorithm uses the minimum
distance, dmin(Ci, Cj), to measure the distance
between clusters, it is sometimes called a
nearest-neighbor clustering algorithm.
• Moreover, if the clustering process is
terminated when the distance between nearest
clusters exceeds an arbitrary threshold, it is
called a single-linkage algorithm
• When an algorithm uses the maximum
distance, dmax(Ci, Cj), to measure the distance
between clusters, it is sometimes called a
farthest-neighbor clustering algorithm.
• If the clustering process is terminated when the
maximum distance between nearest clusters
exceeds an arbitrary threshold, it is called a
complete-linkage algorithm.
• The use of mean or average distance is a
compromise between the minimum and maximum
distances and overcomes the outlier sensitivity
problem.
• Whereas the mean distance is the simplest to
compute, the average distance is advantageous in
that it can handle categoric as well as numeric
data.
• The computation of the mean vector for categoric
data can be difficult or impossible to define.
Difficulties with hierarchical
clustering
• Difficulties regarding the selection of merge or split
points.
• Split decision is critical because once a group of objects
is merged or split, the process at the next step will
operate on the newly generated clusters.
• It will neither undo what was done previously nor
perform object swapping between clusters.
• Low-quality clusters will be formed at wrong split or
merge decisions.
• The method does not scale well, because each decision
to merge or split requires the examination and
evaluation of a good number of objects or clusters.
Density-based Clustering
• The Density-based Clustering works by detecting areas where the
data points are concentrated and where they are separated by areas
that are empty or sparse.
• Points that are not part of a cluster are labeled as noise.
• To discover clusters with arbitrary shape, density-based clustering
methods have been developed.
• Three most commonly used density based clustering algorithms are
listed as follows:
– DBSCAN - Grows clusters according to a density-based connectivity
analysis.
– OPTICS - Produce a cluster ordering obtained from a wide range of
parameter settings.
– DENCLUE - Clusters objects based on a set of density distribution
functions.
DBSCAN
• DBSCAN is a density based clustering
algorithm.
• The algorithm grows regions with sufficiently
high density into clusters and discovers
clusters of arbitrary shape in spatial databases
with noise.
• It defines a cluster as a maximal set of density-
connected points
• The neighborhood within a radius e of a given object is
called the -neighborhood of the object.
• If the -neighborhood of an object contains at least a
minimum number, MinPts, of objects, then the object is
called a core object.
• Given a set of objects, D, we say that an object p is directly
density-reachable from object q if p is within the -
neighborhood of q, and q is a core object.
• An object p is density-reachable from object q with respect
to and MinPts in a set of objects, D, if there is a chain of
objects p1, . . . , pn, where p1 = q and pn = p such that pi+1 is
directly density-reachable from pi with respect to and
MinPts, for 1<= i<= n, pi D.
• An object p is density-connected to object q with
respect to e and MinPts in a set of objects, D, if
there is an object o D such that both p and q are
density-reachable from o with respect to and
MinPts.
• Density reachability is the transitive closure of
direct density reachability, and this relationship is
asymmetric. Only core objects are mutually
density reachable. Density connectivity, however,
is a symmetric relation.
• Of the labeled points, m, p, o, and r are core objects
because each is in an e-neighborhood containing at least
three points.
• q is directly density-reachable from m. m is directly
density-reachable from p and vice versa.
• q is (indirectly) density-reachable from p because q is
directly density-reachable from m and m is directly density-
reachable from p. However, p is not density-reachable from
q because q is not a core object. Similarly, r and s are
density-reachable from o, and o is density-reachable from r.
• o, r, and s are all density-connected.
• A density-based cluster is a set of density-connected objects
that is maximal with respect to density-reachability. Every
object not contained in any cluster is considered to be noise
DBSCAN : Finding Clusters in DB
Scan
• DBSCAN searches for clusters by checking the -
neighborhood of each point in the database.
• If the e-neighborhood of a point p contains more
than MinPts, a new cluster with p as a core object
is created. DBSCAN then iteratively collects
directly density-reachable objects from these core
objects, which may involve the merge of a few
density-reachable clusters.
• The process terminates when no new point can be
added to any cluster.
• If a spatial index is used, the computational
complexity of DBSCAN is O(nlogn), where n
is the number of database objects.
• Otherwise, it is O(n2).With appropriate settings
of the user-defined parameters e and MinPts,
the algorithm is effective at finding arbitrary-
shaped clusters.
• Advantages
• Does not require a-priori specification of number
of clusters.
• Able to identify noise data while clustering.
• DBSCAN algorithm is able to find arbitrarily size
and arbitrarily shaped clusters.
• Disadvantages
• DBSCAN algorithm fails in case of varying
density clusters.
• Fails in case of neck type of dataset.
Review of Concepts
Is an object o in a cluster or Are objects p and q in the
an outlier? same cluster?

Are p and q density-


Is o a core object?
connected?

Is o density-reachable by Are p and q density-


some core object? reachable by some object o?

Directly density- Indirectly density-reachable


reachable through a chain
OPTICS: Ordering Points to
Identify the Clustering Structure
• Ordering points to identify the clustering structure
(OPTICS) is an algorithm for finding density-
based clusters in spatial data.
• Its basic idea is similar to DBSCAN, but it
addresses one of DBSCAN's major weaknesses:
the problem of detecting meaningful clusters in
data of varying density.
• To do so, the points of the database are (linearly)
ordered such that spatially closest points become
neighbors in the ordering.
• Additionally, a special distance is stored for
each point that represents the density that must
be accepted for a cluster so that both points
belong to the same cluster.
OPTICS con’t
• Produces a special order of the database wrt its density-
based clustering structure
• This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
• Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
• Can be represented graphically or using visualization
techniques
Density-Based Hierarchical Clustering
• Observation: Dense clusters are completely contained
by less dense clusters
C D
C1 C2

• Idea: Process objects in the “right” order and keep track of point
density in their neighborhood
C MinPts = 3
C1 C2
2 1
Core- and Reachability Distance
• Parameters: “generating” distance , fixed
value MinPts
• core-distance,MinPts(o)
“smallest distance such that o is a core object”
(if that distance is   ; “?” otherwise)

• reachability-distance,MinPts(p, o) MinPts = 5

“smallest distance such that p is


p
directly density-reachable from o” q o 
(if that distance is   ; “?” otherwise)

core-distance(o)
reachability-distance(p,o)
reachability-distance(q,o)
DENCLUE: using density functions

• DENsity-based CLUstEring by Hinneburg & Keim


(KDD’98)
• Major features
– Solid mathematical foundation
– Good for data sets with large amounts of noise
– Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
– Significantly faster than existing algorithm (faster than DBSCAN
by a factor of up to 45)
– But needs a large number of parameters
Denclue: Technical Essence
• Model density by the notion of influence
• Each data object exert influence on its neighborhood.
• The influence decreases with distance
• Example:
– Consider each object is a radio, the closer you are to the object,
the louder the noise
• Key: Influence is represented by mathematical function
Denclue: Technical Essence
• Influence functions: (influence of y on x,  is a user given constant)
– Square : f ysquare(x) = 0, if dist(x,y) > ,
1, otherwise

– Guassian:

d ( x, y )2

f y
Gaussian ( x)  e 2 2
Density Function

• Density Definition is defined as the sum of the


influence functions of all data points.

d ( x , xi ) 2

( x )   i 1 e
N
D 2 2
f Gaussian
Gradient: The steepness of a slope

• Example d ( x , y )2

f Gaussian ( x , y )  e 2 2

d ( x , xi ) 2

( x )   i 1 e
N
D 2 2
f Gaussian
d ( x , xi ) 2

( x, xi )  i 1 ( xi  x)  e
N
f D
Gaussian
2 2
Denclue: Technical Essence
• Clusters can be determined mathematically by
identifying density attractors.
• Density attractors are local maximum of the
overall density function.
Density Attractor
Features of DENCLUE
• Major features
– Solid mathematical foundation
• Compact definition for density and cluster
• Flexible for both center-defined clusters and arbitrary-shape
clusters
– But needs a large number of parameters
• : parameter to calculate density
• : density threshold
• : parameter to calculate attractor

You might also like