You are on page 1of 65

Lecture 10, 11 – Clustering

Anaum Hamid

https://sites.google.com/site/anaumhamid/data-mining/lectures
Gentle Reminder

“Switch Off” your Mobile Phone Or Switch


Mobile Phone to “Silent Mode”
Contents

1. What is Cluster Analysis?

2. k-Mean Clustering

3. Hierarchical Clustering

4. Density based Clustering

5. Grid Based Clustering


What is Cluster Analysis?
v  Cluster: A collection of data objects
§  similar (or related) to one another within the same group
§  dissimilar (or unrelated) to the objects in other groups
v  Cluster analysis (or clustering, data segmentation, …)
§  Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
v  Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
v  Typical applications
§  As a stand-alone tool to get insight into data distribution
§  As a preprocessing step for other algorithms

4
What is Cluster Analysis?

5
What is Cluster Analysis?

v By Family

6
What is Cluster Analysis?

v By Age

7
What is Cluster Analysis?

v By Gender

8
Applications of Cluster Analysis

v Data reduction
§  Summarization: Preprocessing for regression, PCA,
classification, and association analysis
§  Compression: Image processing: vector quantization
v Hypothesis generation and testing
v Prediction based on groups
§  Cluster & find characteristics/patterns for each group
v Finding K-nearest Neighbors
§  Localizing search to one or a small number of clusters
v Outlier detection: Outliers are often viewed as those “far
away” from any cluster

9
Clustering: Application Examples
v Biology: taxonomy of living things: kingdom, phylum,
class, order, family, genus and species
v Information retrieval: document clustering
v Land use: Identification of areas of similar land use in
an earth observation database
v Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
v City-planning: Identifying groups of houses according
to their house type, value, and geographical location
v Earth-quake studies: Observed earth quake
epicenters should be clustered along continent faults
v Climate: understanding earth climate, find patterns of
atmospheric and ocean
v Economic Science: market research
10
Earthquakes

11
Image Segmentation

12
Basic Steps to Develop a Clustering
Task
v  Feature selection
§  Select info concerning the task of interest
§  Minimal information redundancy
v  Proximity measure
§  Similarity of two feature vectors
v  Clustering criterion
§  Expressed via a cost function or some rules
v  Clustering algorithms
§  Choice of algorithms
v  Validation of the results
§  Validation test (also, clustering tendency test)
v  Interpretation of the results
§  Integration with applications
13
Quality: What Is Good Clustering?

v  A good clustering method will produce high quality clusters


§  high intra-class similarity: cohesive within clusters
§  low inter-class similarity: distinctive between clusters

v  The quality of a clustering method depends on


§  the similarity measure used by the method
§  its implementation, and
§  Its ability to discover some or all of the hidden patterns

Inter-cluster
Intra-cluster distances distances are
are minimized maximized

14
Measure the Quality of
Clustering
v  Dissimilarity/Similarity metric
§  Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
§  The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
§  Weights should be associated with different variables
based on applications and data semantics
v  Quality of clustering:
§  There is usually a separate “quality” function that
measures the “goodness” of a cluster.
§  It is hard to define “similar enough” or “good enough”
•  The answer is typically highly subjective
15
Data Structures
v Clustering algorithms typically operate on either
of the following two data structures:
v Data matrix ⎡ x11 ... x1f ... x1p ⎤
⎢ ⎥
§  (two modes) ⎢ ... ... ... ... ... ⎥
⎢x ... xif ... xip ⎥
⎢ i1 ⎥
⎢ ... ... ... ... ... ⎥
⎢x ... xnf ... xnp ⎥⎥
⎢⎣ n1 ⎦

⎡ 0 ⎤
v Dissimilarity matrix ⎢ d(2,1)
⎢ 0 ⎥

§  (one mode) ⎢ d(3,1) d ( 3,2) 0 ⎥
⎢ ⎥
⎢ : : : ⎥
⎢⎣d ( n,1) d ( n,2) ... ... 0⎥⎦
16
Data Matrix
⎡ x11 ... x1f ... x1p ⎤
⎢ ⎥
⎢ ... ... ... ... ... ⎥
n objects
⎢x ... xif ... xip ⎥
⎢ i1 ⎥ p variables
⎢ ... ... ... ... ... ⎥
⎢x ... xnf ... xnp ⎥⎥
⎢⎣ n1 ⎦

v This represents n objects, such as persons, with


p variables (measurements or attributes), such
as age, height, weight, gender, and so on.
v  The structure is in the form of a relational table,
or n-by-p matrix (n objects p variables)
17
Dissimilarity matrix
⎡ 0 ⎤
⎢ d(2,1) 0 ⎥
⎢ ⎥
⎢ d(3,1) d ( 3,2) 0 ⎥
⎢ ⎥
⎢ : : : ⎥
⎢⎣d ( n,1) d ( n,2) ... ... 0⎥⎦

v  It is often represented by an n-by-n where d(i, j) is the


measured difference or dissimilarity between objects i and j.
v  In general, d(i, j) is a nonnegative number that is
§  close to 0 when objects i and j are highly similar or “near” each
other
§  becomes larger the more they differ
v  Where d(i, j)=d( j, i), and d(i, i)=0

18
Type of data in clustering
analysis
v Interval-scaled variables
v Binary variables
v Nominal, ordinal, and ratio variables
v Variables of mixed types
Interval-valued variables

v Standardize data
§  Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where m = 1 (x + x + ... + x )
f n 1f 2 f nf
.

§  Calculate the standardized measurement (z-score)


xif − m f
zif = sf
v Using mean absolute deviation is more robust
than using standard deviation
Similarity and Dissimilarity
Between Objects
v Distances are normally used to measure the
similarity or dissimilarity between two data
objects

v Some popular ones include:


§  Minkowski distance
§  Manhattan distance
§  Euclidean distance
Minkowski and Manhattan
Distance
v Minkowski distance
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp

§  where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are


two p-dimensional data objects, and q is a positive
integer
xi
(1,7) 12
v If q = 1, d is Manhattan distance 6

d (i, j) =| x − x | + | x − x | +...+ | x − x | 6
i1 j1 i2 j2 ip jp
xj(7,1)
Euclidean Distance
Xi (1,7)
v If q = 2, d is Euclidean distance:
8.48
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp

§  Properties
Xj(7,1)
•  d(i,j) ≥ 0
•  d(i,i) = 0 i j
•  d(i,j) = d(j,i)
•  d(i,j) ≤ d(i,k) + d(k,j)
k

v Also one can use weighted distance, or other


dissimilarity measures.
Binary Variables
v A contingency table for binary data
Object j
1 0 sum
1 a b a +b
Object i 0 c d c+d
sum a + c b + d p

v Simple matching coefficient (invariant, if the binary


variable is symmetric): d (i, j) = b+c
a +b+c + d
v Jaccard coefficient (noninvariant if the binary
variable is asymmetric): b+c
d (i, j) =
a +b+c
Dissimilarity between Binary
Variables
v  Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

§  gender is a symmetric attribute


§  the remaining attributes are asymmetric binary
§  let the values Y and P be set to 1, and the value N be set to 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
Nominal Variables

v A generalization of the binary variable in that it


can take more than 2 states, e.g., red, yellow,
blue, green
v Method 1: Simple matching
§  m: # of matches, p: total # of variables
d (i, j) = p −
p
m

v Method 2: use a large number of binary variables


§  creating a new binary variable for each of the M
nominal states
Ordinal Variables

v An ordinal variable can be discrete or continuous


v Order is important, e.g., rank
v Can be treated like interval-scaled
rif ∈{1,..., M f }
§  replace xif by their rank
§  map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif −1
zif =
M f −1
§  compute the dissimilarity using
methods for interval-scaled variables
27
Ratio-Scaled Variables

v Ratio-scaled variable: a positive measurement on a


nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
v Methods:
§  treat them like interval-scaled variables— not a good
choice! (why?—the scale can be distorted)
§  apply logarithmic transformation
yif = log(xif)
§  treat them as continuous ordinal data treat their rank as
interval-scaled
28
Variables of Mixed Types

v A database may contain all the six types of variables


§  symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
v One may use a weighted formula to combine their effects
Σ pf = 1δ ij( f ) dij( f )
d (i, j) =
§  f is binary or nominal: Σ pf = 1δ ij( f )
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
§  f is interval-based: use the normalized distance
§  f is ordinal or ratio-scaled
•  compute ranks rif and
•  and treat zif as interval-scaled
zif =
r
if −1
M f −1
29
Considerations for Cluster Analysis

v Partitioning criteria
§  Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
v Separation of clusters
§  Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than
one class)
v Similarity measure
§  Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
v Clustering space
§  Full space (often when low dimensional) vs. subspaces (often
in high-dimensional clustering)
30
Requirements and Challenges
v Scalability
§  Clustering all the data instead of only on samples
v Ability to deal with different types of attributes
§  Numerical, binary, categorical, ordinal, linked, and mixture of
these
v Constraint-based clustering
§  User may give inputs on constraints
§  Use domain knowledge to determine input parameters
v Interpretability and usability
v Others
§  Discovery of clusters with arbitrary shape
§  Ability to deal with noisy data
§  Incremental clustering and insensitivity to input order
§  High dimensionality
Major Clustering Approaches 1
v Partitioning approach:
§  Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square
errors
§  Typical methods: k-means, k-medoids, CLARANS
v Hierarchical approach:
§  Create a hierarchical decomposition of the set of data
(or objects) using some criterion
§  Typical methods: Diana, Agnes, BIRCH, CHAMELEON
v Density-based approach:
§  Based on connectivity and density functions
§  Typical methods: DBSACN, OPTICS, DenClue
v Grid-based approach:
§  based on a multiple-level granularity structure
§  Typical methods: STING, WaveCluster, CLIQUE
32
Major Clustering Approaches 2
v Model-based:
§  A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
§  Typical methods: EM, SOM, COBWEB
v Frequent pattern-based:
§  Based on the analysis of frequent patterns
§  Typical methods: p-Cluster
v User-guided or constraint-based:
§  Clustering by considering user-specified or application-specific
constraints
§  Typical methods: COD (obstacles), constrained clustering
v Link-based clustering:
§  Objects are often linked together in various ways
§  Massive links can be used to cluster objects: SimRank, LinkClus
33
Partitioning Algorithms: Basic
Concept
v Partitioning method: Partitioning a database D of n
objects into a set of k clusters, such that the sum of
squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)
E = Σik=1Σ p∈Ci (d ( p, ci ))2
v Given k, find a partition of k clusters that optimizes the
chosen partitioning criterion
§  Global optimal: exhaustively enumerate all partitions
§  Heuristic methods: k-means and k-medoids algorithms
§  k-means (MacQueen’67, Lloyd’57/’82): Each cluster is
represented by the center of the cluster
§  k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
34
The K-Means Clustering Method

v Given k, the k-means algorithm is implemented in four


steps:
1.  Partition objects into k nonempty subsets
2.  Compute seed points as the centroids of the clusters
of the current partitioning (the centroid is the center,
i.e., mean point, of the cluster)
3.  Assign each object to the cluster with the nearest
seed point
4.  Go back to Step 2, stop when the assignment does
not change
35
An Example of K-Means
Clustering
K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
n  Partition objects into k
nonempty subsets
n  Repeat
n  Compute centroid (i.e., mean
point) for each partition Update the
cluster
n  Assign each object to the
centroids
cluster of its nearest centroid
n  Until no change

36
k-Means Algorithms steps
1.  Selecting the desired number of clusters k
2.  Initialization k cluster centers (centroid) random
3.  Place any data or object to the nearest cluster. The proximity of
the two objects is determined based on the distance. The
distance used in k-means algorithm is the Euclidean distance (d)
n
2
d Euclidean (x, y ) = ( x
∑ i i− y )
i =1

x = x1, x2,. , , , Xn and y = y1, y2,. , , , Yn n is the number of attributes


(columns) between 2 records
4.  Recalculate cluster centers with the current cluster membership.
Cluster center is the average (mean) of all the data, or objects in
a particular cluster
5.  Assign each object again using the new cluster centers. If the
cluster centers have not changed again, then the clustering
process is completed. Or, go back to step 3 until the cluster
centers do not change anymore (stable) or no significant
decrease of the value of SSE (Sum of Squared Errors)
Exercise – Iteration 1

[STEP 1] Determine the number of clusters k = 2


[STEP 2] Determine the initial centroid random example of the data
besides m1 = (1,1), m2 = (2,1)
[STEP 3] Place each object to the closest cluster centroid based on the
closest value difference (distance).
m1 m2
Cluster
Instances x y (1,1) (2,1) Membership
A 1 3 2 3 m1
B 3 3 4 3 m2
C 4 3 5 4 m2
D 5 3 6 5 m2
E 1 2 1 2 m1
F 4 2 4 3 m2
G 1 1 0 1 m1
H 2 1 1 0 m2
Exercise – Iteration 1

[STEP 4] The results obtained, members cluster1 = {A, E, G},


cluster2 = {B, C, D, F, H}. The SSE(Sum of Square Error) and
MSE(Mean Square Error) values are:
k 2

SSE = ∑ ∑ d ( p, mi )
i =1 p∈Ci

Cluster1 Cluster 2
SSE = [(1-1)2+(3-1)2]+ [(1-1)2+(2-1)2]+ [(1-1)2+(1-1)2] (1, 3) (3, 3)
+ [(3-2)2+(3-1)2]+ [(4-2)2+(3-1)2]+ [(5-2)2+(3-1)2] (1, 2) (4, 3)
+ [(4-2)2+(2-1)2]+ [(2-2)2+(1-1)2] (1, 1) (5, 3)
(4, 2)
SSE = 4 + 1 + 0 + 5 + 8 + 13 + 5 + 0 = 36 (2, 1)
MSE = 36/8 = 4.5 -------- --------
(3/3, 6/3) (18/5, 12/5)
(1, 2) (3.6, 2.4)
[STEP 5] The new centroid for
cluster1 = [(1 + 1 + 1)/3, (3 + 2 + 1)/3] = (1, 2) and
cluster2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)
Exercise – Iteration 2

[STEP 3] Place each object to the closest cluster centroid based on the
closest value difference (distance).

m1 m2
Cluster
Instances x y (1,2) (3.6,2.4) Membership
A 1 3 1 3.2 m1
B 3 3 3 1.2 m2
C 4 3 4 1 m2
D 5 3 5 2 m2
E 1 2 0 3 m1
F 4 2 3 0.8 m2
G 1 1 1 4 m1
H 2 1 2 3 m1
Exercise – Iteration 2

[STEP 4] The results obtained, members cluster1 = {A, E, G, H}, cluster2


= {B, C, D, F}. The new SSE and MSE values are:
k 2

SSE = ∑ ∑ d ( p, mi )
i =1 p∈Ci

SSE = [(1-1)2+(3-2)2]+ [(1-1)2+(2-2)2]+ [(1-1)2+(1-2)2] + [(2-1)2+(1-2)2]


+ [(3-3.6)2+(3-2.4)2]+ [(4-3.6)2+(3-2.4)2]+ [(5-3.6)2+(3-2.4)2]
+ [(4-3.6)2+(2-2.4)2]
SSE = 1 + 0 + 1 + 2 + 0.72 + 0.52 + 2.32 + 0.32 = 7.88
MSE = 7.88/8 = 0.985

[STEP 5] The new centroid for


cluster1 = [(1 + 1 + 1+2)/4, (3 + 2 + 1+1)/4] = (1.25,1.75) and
cluster2 = [(3 + 4 + 5 + 4 )/4, (3 + 3 + 3 + 2 )/4] = (4, 2.75)
Exercise – Iteration 3

[STEP 3] Place each object to the closest cluster centroid based on the
closest value difference (distance).

m1 m2
Cluster
Instances x y (1.25,1.75) (4, 2.75) Membership
A 1 3 1.5 3.25 m1
B 3 3 3 1.25 m2
C 4 3 4 0.25 m2
D 5 3 5 1.25 m2
E 1 2 0.5 3.75 m1
F 4 2 3 0.75 m2
G 1 1 1 4.75 m1
H 2 1 1.5 3.75 m1
Exercise – Iteration 3

[STEP 4] The results obtained, members cluster1 = {A, E, G, H}, cluster2


= {B, C, D, F }. The new SSE and MSE values are:
k 2

SSE = ∑ ∑ d ( p, mi )
i =1 p∈Ci

SSE = [(1-1.25)2+(3-1.75)2]+ [(1-1.25)2+(2-1.75)2]+ [(1-1.25)2+(1-1.75)2] +


[(2-1.25)2+(1-1.75)2] + [(3-4)2+(3-2.75)2]+ [(4-4)2+(3-2.75)2]+
[(5-4)2+(3-2.75)2] + [(4-4)2+(2-2.75)2]

SSE = 1.625 + 0.125 + 0.625 + 0.125 + 1.0625 + 0.0625 + 1.0625 + 0.5625


= 7.88
MSE = 6.25/8 = 0.78125

[STEP 5] The new centroid for


cluster1 = [(1 + 1 + 1+2)/4, (3 + 2 + 1+1)/4] = (1.25,1.75) and
cluster2 = [(3 + 4 + 5 + 4 )/4, (3 + 3 + 3 + 2 )/4] = (4, 2.75)
Final Result

v Can be seen in the table. No change of


members again on each cluster
§  The final results are:
§  cluster1 = {A, E, G, H}, and
§  cluster2 = {B, C, D, F}
§  With SSE value = 6.25 and MSE value = 0.78125
§  the number of iterations 3

44
Comments on the K-Means Method

v  Strength:
§  Efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
§  Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
v  Comment: Often terminates at a local optimal
v  Weakness
§  Applicable only to objects in a continuous n-dimensional space
•  Using the k-modes method for categorical data
•  In comparison, k-medoids can be applied to a wide range of data
§  Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
§  Sensitive to noisy data and outliers
§  Not suitable to discover clusters with non-convex shapes

45
Variations of the K-Means Method

v Most of the variants of the k-means which differ in


§  Selection of the initial k means
§  Dissimilarity calculations
§  Strategies to calculate cluster means

v Handling categorical data: k-modes


§  Replacing means of clusters with modes
§  Using new dissimilarity measures to deal with
categorical objects
§  Using a frequency-based method to update modes
of clusters
§  A mixture of categorical and numerical data: k-
prototype method
46
What Is the Problem of the K-Means
Method?
v The k-means algorithm is sensitive to outliers!
§  Since an object with an extremely large value may
substantially distort the distribution of the data
v K-Medoids:
§  Instead of taking the mean value of the object in a cluster as
a reference point, medoids can be used, which is the most
centrally located object in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10

47
PAM: A Typical K-Medoids
Algorithm
Total Cost = 20

10 10 10

9 9 9

8 8 8

7 Arbitrary
7
Assign
7

6 6 6

5 choose k 5
each 5

4 object as 4 remainin 4

3 initial 3 g object 3

2 medoids 2
to 2

1 1
nearest
1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10
Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6 swapping 6



change
5 5

If quality is 4 4
improved. 3 3
2 2
1 1
0
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10

48
The K-Medoid Clustering Method
v K-Medoids Clustering: Find representative objects
(medoids) in clusters
v PAM (Partitioning Around Medoids, Kaufmann &
Rousseeuw 1987)
§  Starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it
improves the total distance of the resulting clustering
§  PAM works effectively for small data sets, but does not
scale well for large data sets (due to the computational
complexity)
v Efficiency improvement on PAM
§  CLARA (Kaufmann & Rousseeuw, 1990): PAM on
samples
§  CLARANS (Ng & Han, 1994): Randomized re-sampling

49
Hierarchical Clustering
Dendrogram: Shows How Clusters
are Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster
0.25

0.2

0.15

0.1

0.05

0
3 6 4 1 2 5
Distance between Clusters X X

v Single link: smallest distance between an element in one


cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
v Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
v Average: avg distance between an element in one cluster and
an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
v Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
v Medoid: distance between the medoids of two clusters, i.e.,
dist(Ki, Kj) = dist(Mi, Mj)
§  Medoid: a chosen, centrally located object in the cluster
51
Density-Based Clustering Methods

v Clustering based on density (local cluster criterion), such


as density-connected points
v Major features:
§  Discover clusters of arbitrary shape
§  Handle noise
§  One scan
§  Need density parameters as termination condition
v Several interesting studies:
§  DBSCAN: Ester, et al. (KDD’96)
§  OPTICS: Ankerst, et al (SIGMOD’99).
§  DENCLUE: Hinneburg & D. Keim (KDD’98)
§  CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-
based)
52
Density-Based Clustering: Basic
Concepts
v Two parameters:
§  Eps: Maximum radius of the neighbourhood
§  MinPts: Minimum number of points in an Eps-
neighbourhood of that point
v NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
v Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
§  p belongs to NEps(q)
p MinPts = 5
§  core point condition:
Eps = 1 cm
|NEps (q)| ≥ MinPts q
53
Density-Reachable and Density-
Connected
v Density-reachable:
§  A point p is density-reachable p
from a point q w.r.t. Eps, MinPts if
p1
there is a chain of points p1, …, q
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi

v Density-connected p q
§  A point p is density-connected to
a point q w.r.t. Eps, MinPts if o
there is a point o such that both, p
and q are density-reachable from
o w.r.t. Eps and MinPts 54
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
v Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
v Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier

Border
Eps = 1cm
Core MinPts = 5

55
DBSCAN: The Algorithm

1.  Arbitrary select a point p


2.  Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
3.  If p is a core point, a cluster is formed
4.  If p is a border point, no points are density-reachable from
p and DBSCAN visits the next point of the database
5.  Continue the process until all of the points have been
processed

If a spatial index is used, the computational complexity of DBSCAN is


O(nlogn), where n is the number of database objects. Otherwise, the
complexity is O(n2)

56
DBSCAN: Sensitive to
Parameters

http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
57
Density-Based Clustering
http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/
Demo

58
Grid-Based Clustering Method

v Using multi-resolution grid data structure


v Several interesting methods
§  STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
§  WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)
•  A multi-resolution clustering approach using
wavelet method
§  CLIQUE: Agrawal, et al. (SIGMOD’98)
•  Both grid-based and subspace clustering
59
Measuring Clustering Quality

v  Two methods: extrinsic vs. intrinsic


v  Extrinsic: supervised, i.e., the ground truth is available
§  Compare a clustering against the ground truth using
certain clustering quality measure
§  Ex. BCubed precision and recall metrics
v  Intrinsic: unsupervised, i.e., the ground truth is unavailable
§  Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are
§  Ex. Silhouette coefficient

60
Measuring Clustering Quality:
Extrinsic Methods
v  Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
v  Q is good if it satisfies the following 4 essential criteria
§  Cluster homogeneity: the purer, the better
§  Cluster completeness: should assign objects belong to
the same category in the ground truth to the same
cluster
§  Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into a
rag bag (i.e., “miscellaneous” or “other” category)
§  Small cluster preservation: splitting a small category into
pieces is more harmful than splitting a large category
into pieces
61
Summary
v  Cluster analysis groups objects based on their similarity and has
wide applications
v  Measure of similarity can be computed for various types of data
v  Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
v  K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
v  Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
v  DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
v  STING and CLIQUE are grid-based methods, where CLIQUE is
also a subspace clustering algorithm
v  Quality of clustering results can be evaluated in various ways 62
Summary
Clustering General Characteristics
Method
Partitioning o Find mutually exclusive clusters of spherical shape
Methods o Distance-based
o May use mean or medoid (etc.) to represent cluster center
o Effective for small to medium size data sets
Hierarchical o Clustering is a hierarchical decomposition (i.e., multiple levels)
Methods o Cannot correct erroneous merges or splits
o May incorporate other techniques like microclustering or consider
object “linkages”
Density-based o Can find arbitrarily shaped clusters
Methods o Clusters are dense regions of objects in space that are separated
by low-density regions
o Cluster density: Each point must have a minimum number of
points within its “neighborhood”
o May filter out outliers
Grid-based o Use a multiresolution grid data structure
Methods o Fast processing time (typically independent of the number of data
63
objects, yet dependent on grid size)
Reference
1.  Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques Third Edition, Elsevier, 2012
2.  Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevier,
2011
3.  Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining
Use Cases and Business Analytics Applications, CRC Press
Taylor & Francis Group, 2014
4.  Daniel T. Larose, Discovering Knowledge in Data: an Introduction
to Data Mining, John Wiley & Sons, 2005
5.  Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MIT
Press, 2014
6.  Florin Gorunescu, Data Mining: Concepts, Models and
Techniques, Springer, 2011
7.  Oded Maimon and Lior Rokach, Data Mining and Knowledge
Discovery Handbook Second Edition, Springer, 2010
8.  Warren Liao and Evangelos Triantaphyllou (eds.), Recent
Advances in Data Mining of Enterprise Data: Algorithms and
Applications, World Scientific, 2007
64

You might also like