You are on page 1of 89

Lecture 10, 11 – Clustering

Muhammad Tariq Siddique

https://sites.google.com/site/mtsiddiquecs/dm
Gentle Reminder

“Switch Off” your Mobile Phone Or Switch


Mobile Phone to “Silent Mode”
Contents

1. What is Cluster Analysis?
h l l

2. k‐Mean Clustering
2 k Mean Clustering

3. Hierarchical Clustering
3. Hierarchical Clustering

4. Density based Clustering
y g

5. Grid Based Clustering
What is Cluster Analysis?
™ Cluster: A collection of data objects
ƒ similar (or related) to one another within the same group
ƒ dissimilar (or unrelated) to the objects in other groups
™ Cluster analysis (or clustering, segmentation …))
clustering data segmentation,
ƒ Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
™ Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
™ Typical applications
ƒ As a stand-alone tool to get insight into data distribution
ƒ As a preprocessing step for other algorithms

4
What is Cluster Analysis?

5
What is Cluster Analysis?

™By Family

6
What is Cluster Analysis?

™By Age

7
What is Cluster Analysis?

™By Gender

8
Applications of Cluster Analysis

™ Data reduction
ƒ Summarization: Preprocessing for regression, PCA,
classification, and association analysis
ƒ Compression: Image processing: vector quantization
™ Hypothesis generation and testing
™ Prediction based on groups
ƒ Cluster & find characteristics/patterns for each group
™ Finding K-nearest Neighbors
ƒ Localizing
L li i search h tto one or a smallll number
b off clusters
l t
™ Outlier detection: Outliers are often viewed as those “far
away” from any cluster
away

9
Clustering: Application Examples
™ Biology: taxonomy of living things: kingdom, phylum,
class, order, family,
y ggenus and species
p
™ Information retrieval: document clustering
™ Land use: Identification of areas of similar land use in
an earth observation database
™ Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop p targeted
g marketinggpprograms
g
™ City-planning: Identifying groups of houses according
to their house type, value, and geographical location
™ Earth
Earth-quake
quake studies: Observed earth quake
epicenters should be clustered along continent faults
™ Climate: understanding earth climate, find patterns of
p
atmospheric and ocean
™ Economic Science: market research
10
Earthquakes

11
Image Segmentation

12
Basic Steps to Develop a Clustering
Task
™ Feature selection
ƒ Select info concerning the task of interest
ƒ Minimal information redundancy
™ Proximity measure
ƒ Similarity of two feature vectors
™ Clustering criterion
ƒ Expressed via a cost function or some rules
™ Clustering algorithms
ƒ Choice of algorithms
g
™ Validation of the results
ƒ Validation test (also, clustering tendency test)
™ Interpretation
I t t ti off the
th results
lt
ƒ Integration with applications
13
Quality: What Is Good Clustering?

™ A good clustering method will produce high quality clusters


ƒ high intra-class similarity: cohesive within clusters
ƒ low inter-class similarity: distinctive between clusters

™ The quality of a clustering method depends on


ƒ the similarityy measure used byy the method
ƒ its implementation, and
ƒ Its ability to discover some or all of the hidden patterns

Inter-cluster
IIntra-cluster
l distances
di distances are
are minimized maximized

14
Measure the Quality of
Clustering
™ Dissimilarity/Similarity metric
ƒ Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
ƒ The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
ƒ Weights should be associated with different variables
based on applications and data semantics
™ Quality of clustering:
ƒ There is usually a separate “quality” function that
measures the “goodness” of a cluster.
ƒ It is hard to define “similar
similar enough”
enough or “good
good enough”
enough
• The answer is typically highly subjective
15
Data Structures
™Clustering algorithms typically operate on either
of the following two data structures:
™Data matrix ⎡ x 11 ... x 1ff ... x 1pp ⎤
⎢ ⎥
ƒ (two modes) ⎢ ... ... ... ... ... ⎥
⎢x ... x if ... x ip ⎥
⎢ i1 ⎥
⎢ ... ... ... ... ... ⎥
⎢x ... x nf ... x np ⎥
⎣⎢ n1 ⎦⎥

⎡ 0 ⎤
™Dissimilarity matrix ⎢ d(2,1)
⎢ 0 ⎥

ƒ (one mode) ⎢ d(3,1 ) d ( 3,2 ) 0 ⎥
⎢ ⎥
⎢ : : : ⎥
⎢⎣ d ( n ,1 ) d ( n ,2 ) ... ... 0 ⎥⎦
16
Data Matrix
⎡ x 11 ... x 1f ... x 1p ⎤
⎢ ⎥
⎢ ... ... ... ... ... ⎥
n objects
⎢x ... x if ... x ip ⎥
⎢ i1 ⎥ p variables
⎢ ... ... ... ... ... ⎥
⎢x ... x nf ... x np ⎥⎥
⎢⎣ n1 ⎦

™This represents n objects, such as persons, with


p variables (measurements or attributes), such
as age, height,
h i ht weight,
i ht gender,
d and d so on.
™ The structure is in the form of a relational table,
or n-by-p
b matrix
t i (n
( objects
bj t p variables)
i bl )
17
Dissimilarity
y matrix
⎡ 0 ⎤
⎢ d(2,1) ⎥
⎢ d(2 1) 0 ⎥
⎢ d(3,1 ) d ( 3,2 ) 0 ⎥
⎢ ⎥
⎢ : : : ⎥
⎣⎢ d ( n ,1) d ( n ,2 ) ... ... 0 ⎥⎦

™ It is often represented by an n-by-n where d(i, j) is the


measured difference or dissimilarity between objects i and j.
™ In general, d(i, j) is a nonnegative number that is
ƒ close
l tto 0 when
h objects
bj t i and
d j are hi
highly
hl similar
i il or ““near”” each
h
other
ƒ becomes larger the more they differ
™ Where d(i,
d(i j)
j)=d(
d( jj, i),
i) and d(i
d(i, i)
i)=0
0

18
Type of data in clustering
analysis
™Interval-scaled variables
™Binary variables
™Nominal, ordinal, and ratio variables
™Variables of mixed types
Interval-valued variables

™Standardize data
ƒ Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where m f = 1n (x1 f + x2 f + ... + xnf )
.

ƒ Calculate the standardized measurement (z-score)


xif − m f
zif = sf
™Using mean absolute deviation is more robust
than using standard deviation
Similarity and Dissimilarity
Between Objects
™Distances are normally used to measure the
similarity or dissimilarity between two data
objects

™Some popular ones include:


ƒ Minkowski distance
ƒ Manhattan distance
ƒ Euclidean distance
Minkowski and Manhattan
Distance
™Minkowski distance
d (i, j) = q (| x − x |q + | x − x | q +...+ | x − x | q )
i1 j1 i2 j2 ip jp

ƒ where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are


two p-dimensional data objects, and q is a positive
integer
xi (1,7)
12
™If q = 1
1, d is Manhattan distance 6

d (i, j ) =| x − x | + | x − x | + ...+ | x − x | 6
i1 j1 i2 j2 ip jp
xj(7,1)
Euclidean Distance
Xi (1,7)
™If q = 2, d is Euclidean distance:
8.48
d (i, j) = (| x − x | 2 + | x − x | 2 +...+ | x − x | 2 )
i1 j1 i2 j2 ip jp

ƒ Properties
Xj(7,1)
• d(i,j) ≥ 0
• d(i
d(i,i)
i) = 0 i j
• d(i,j) = d(j,i)
( ,j) ≤ d(i,k)
• d(i,j) ( , ) + d(k,j)
( ,j)
k

™Also one can use weighted distance, or other


dissimilarity measures.
Binary Variables
™ A contingency table for binary data
Object j
1 0 sum
1 a b a +b
Object i 0 c d c+d
sum a + c b + d p

™ Simple matching coefficient (invariant, if the binary


variable is symmetric): d ( i , j ) = b+c
a+b+c+d
™ Jaccard coefficient (noninvariant if the binary
variable is asymmetric):
d (i, j ) = b+c
a+b+c
Dissimilarity between Binary
Variables
™ Example
N am e G ender Fever C ough T e s t- 1 T e s t-2 T e s t-3 T e s t-4
Jack M Y N P N N N
M ary F Y N P N P N
J im M Y P N N N N

ƒ gender is a symmetric attribute


ƒ the remaining attributes are asymmetric binary
ƒ let the values Y and P be set to 1, and the value N be set to 0
0 + 1
d ( jack , mary ) = = 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) = = 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) = = 0 . 75
1 + 1 + 2
Nominal Variables

™A generalization of the binary variable in that it


can take more than 2 states, e.g., red, yellow,
blue, green
™Method 1: Simple matching
ƒ m: # of matches, p: total # of variables

d ( i , j ) = p −p m

™Method 2: use a large number of binary variables


ƒ creating a new binary variable for each of the M
nominal states
Ordinal Variables

™ An ordinal variable can be discrete or continuous


™ Order is important, e.g., rank
™ Can be treated like interval-scaled
interval scaled
r if ∈ {1,..., M f }
ƒ replace xif by their rank
ƒ map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
r if − 1
z =
if M f − 1
ƒ compute the dissimilarity using
methods for interval-scaled variables
27
Ratio-Scaled Variables

™ Ratio
Ratio-scaled
scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
™ Methods:
ƒ treat them like interval-scaled variables— not a good
g
choice! (why?—the scale can be distorted)
ƒ apply logarithmic transformation
yif = log(xif)
ƒ treat them as continuous ordinal data treat their rank as
interval-scaled
28
Variables of Mixed Types
yp

™ A database mayy contain all the six types yp of variables


ƒ symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
™ One may use a weighted formula to combine their
effects Σ pf = 1 δ ij( f ) d ij( f )
d (i, j ) =
Σ pf = 1 δ ij( f )
ƒ f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
ƒ f is interval-based:
interval based: use the normalized distance
ƒ f is ordinal or ratio-scaled
• compute ranks rif and
r −1
• and treat zif as interval-scaled z if =
if

M f −1
29
Considerations for Cluster Analysis

™ Partitioning criteria
ƒ Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
™ Separation of clusters
ƒ Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than
one class)
™ Similarity measure
ƒ Distance
Distance-based
based (e.g.,
(e g Euclidian,
Euclidian road network
network, vector) vs
vs.
connectivity-based (e.g., density or contiguity)
™ Clustering space
ƒ Full space (often when low dimensional) vs. subspaces (often
in high-dimensional clustering)
30
Requirements and Challenges
™ Scalability
ƒ Clustering all the data instead of only on samples
™ Ability to deal with different types of attributes
ƒ Numerical, binary, categorical, ordinal, linked, and mixture of
these
™ Constraint-based clustering
ƒ User may give inputs on constraints
ƒ Use domain knowledge to determine input parameters
™ Interpretability and usability
™ Others
ƒ Discovery of clusters with arbitrary shape
ƒ Ability to deal with noisy data
ƒ Incremental clustering and insensitivity to input order
ƒ g d
High dimensionality
e s o a ty
Major Clustering Approaches 1
™ Partitioning approach:
ƒ Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square
errors
ƒ Typical
yp methods: k-means, k-medoids, CLARANS
™ Hierarchical approach:
ƒ Create a hierarchical decomposition of the set of data
((or objects)
j ) using
g some criterion
ƒ Typical methods: Diana, Agnes, BIRCH, CHAMELEON
™ Density-based approach:
ƒ Based on connectivity and density functions
ƒ Typical methods: DBSACN, OPTICS, DenClue
™ Grid-based approach:
ƒ based on a multiple
multiple-level
level granularity structure
ƒ Typical methods: STING, WaveCluster, CLIQUE
32
Major Clustering Approaches 2
™ Model-based:
ƒ A model
d l iis h
hypothesized
th i d ffor each
h off th
the clusters
l t and
d ttries
i tto fi
find
d
the best fit of that model to each other
ƒ Typical methods: EM, SOM, COBWEB
™ Frequent
F t pattern-based:
tt b d
ƒ Based on the analysis of frequent patterns
ƒ Typical methods: p-Cluster
™ User-guided or constraint-based:
ƒ Clustering by considering user-specified or application-specific
constraints
ƒ Typical methods: COD (obstacles), constrained clustering
™ Link-based clustering:
ƒ Objects
j are often linked together
g in various ways
y
ƒ Massive links can be used to cluster objects: SimRank, LinkClus
33
Partitioning Algorithms: Basic
Concept
™ Partitioning method: Partitioning a database D of n
objects into a set of k clusters, such that the sum of
squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)
E = Σ ik=1Σ p∈Ci (d ( p, ci )) 2
™ Given k, find a partition of k clusters that optimizes the
chosen partitioning criterion
ƒ Global optimal: exhaustively enumerate all partitions
ƒ Heuristic methods: kk-means
means and kk-medoids
medoids algorithms
ƒ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is
represented by the center of the cluster
medoids or PAM (Partition around medoids) (Kaufman &
ƒ kk-medoids
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
34
The K-Means Clustering Method

™ Given k, the k-means algorithm is implemented in four


steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters
of the current partitioning (the centroid is the center,
i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest
seed point
4. Go back to Step 2, stop when the assignment does
nott change
h
35
An Example of K-Means
Clustering
K=2

Arbitrarily Update the


partition cluster
objects
bj t into
i t centroids
k groups

The initial data set Loop if Reassign objects


needed
„ Partition objects into k
nonempty subsets
„ Repeat
„ Compute
C t centroid
t id (i
(i.e., mean
point) for each partition Update the
„ Assign each object to the cluster
cluster of its nearest centroid centroids

„ Until no change

36
k-Means Algorithms steps
1. Selecting the desired number of clusters k
2
2. Initialization k cluster centers (centroid) random
3. Place any data or object to the nearest cluster. The proximity of
the two objects is determined based on the distance. The
distance used in k-means algorithm
g is the Euclidean distance ((d))
n
d Euclidean ( x, y ) = (
∑ i i− ) 2
x y
i =1

x = x1, x2,. , , , Xn and y = y1, y2,. , , , Yn n is the number of attributes


(columns) between 2 records
4. Recalculate cluster centers with the current cluster membership.
Cluster center is the average (mean) of all the data
data, or objects in
a particular cluster
5. Assign each object again using the new cluster centers. If the
cluster centers have not changed
g again,
g , then the clustering g
process is completed. Or, go back to step 3 until the cluster
centers do not change anymore (stable) or no significant
decrease of the value of SSE (Sum of Squared Errors)
Exercise – Iteration 1

[STEP 1] Determine the number of clusters k = 2


[STEP 2] Determine the initial centroid random example of the data
besides m1 = (1,1), m2 = (2,1)
[STEP 3] Place each object to the closest cluster centroid based on the
closest value difference (distance).
m1 m2
Cluster 
Instances x y (1,1) (2,1) Membership
A 1 3 2 3 m1
B 3 3 4 3 m2
C 4 3 5 4 m2
D 5 3 6 5 m2
E 1 2 1 2 m1
F 4 2 4 3 m2
G 1 1 0 1 m1
H 2 1 1 0 m2
Exercise – Iteration 1

[STEP 4] The results obtained, members cluster1 = {A, E, G},


cluster2 = {B
{B, C
C, D
D, F
F, H}
H}. The SSE(Sum of Square Error) and
MSE(Mean Square Error) values are:
k 2

SSE = ∑ ∑ d ( p, m )
i =1 p∈C i
i

Cluster1 Cluster 2
SSE = [(1-1)2+(3-1)2]+ [(1-1)2+(2-1)2]+ [(1-1)2+(1-1)2] (1, 3) (3, 3)
+ [(3-2)2+(3-1)2]+ [(4-2)2+(3-1)2]+ [(5-2)2+(3-1)2] (1 2)
(1, (4 3)
(4,
+ [(4-2)2+(2-1)2]+ [(2-2)2+(1-1)2] (1, 1) (5, 3)
(4, 2)
SSE = 4 + 1 + 0 + 5 + 8 + 13 + 5 + 0 = 36 (2, 1)
MSE = 36/8 = 4.5 -------- --------
(3/3, 6/3) (18/5, 12/5)
(1, 2) (3.6, 2.4)
[STEP 5] The new centroid for
cluster1 = [(
[(1 + 1 + 1)/3,
) , (3
( + 2 + 1)/3]
) ] = (1,
( , 2)) and
cluster2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)
Exercise – Iteration 2

[STEP 3] Place each object to the closest cluster centroid based on the
closest
l t value
l diff
difference (di
(distance).
t )

m1 m2
Cluster 
Instances x y (1,2) (3.6,2.4) Membership
A 1 3 1 3.2 m1
B 3 3 3 12
1.2 m2
C 4 3 4 1 m2
D 5 3 5 2 m2
E 1 2 0 3 m1
F 4 2 3 0.8 m2
G 1 1 1 4 m1
H 2 1 2 3 m1
Exercise – Iteration 2

[STEP 4] The results obtained, members cluster1 = {A, E, G, H}, cluster2


= {B,
{B C,
C D,
D F}
F}. The
Th new SSE and d MSE values
l are:
k 2

SSE = ∑ ∑ d ( p, m )
i =1 p∈C i
i

SSE = [(1-1)2+(3-2)2]+ [(1-1)2+(2-2)2]+ [(1-1)2+(1-2)2] + [(2-1)2+(1-2)2]


+ [(3
[(3-3.6)
3.6)2+(3-2.4)
(3 2.4)2]]+ [(4
[(4-3.6)
3.6)2+(3-2.4)
(3 2.4)2]]+ [(5
[(5-3.6)
3.6)2+(3-2.4)
(3 2.4)2]
+ [(4-3.6)2+(2-2.4)2]
SSE = 1 + 0 + 1 + 2 + 0.72 + 0.52 + 2.32 + 0.32 = 7.88
MSE = 7.88/8 = 0.985

[STEP 5] The new centroid for


cluster1 = [(1 + 1 + 1+2)/4
1+2)/4, (3 + 2 + 1+1)/4] = (1(1.25,1.75)
25 1 75) and
cluster2 = [(3 + 4 + 5 + 4 )/4, (3 + 3 + 3 + 2 )/4] = (4, 2.75)
Exercise – Iteration 3

[STEP 3] Place each object to the closest cluster centroid based on the
closest
l t value
l diff
difference (di
(distance).
t )

m1 m2
Cluster 
Instances x y (1.25,1.75) (4, 2.75) Membership
A 1 3 1.5 3.25 m1
B 3 3 3 1 25
1.25 m2
C 4 3 4 0.25 m2
D 5 3 5 1.25 m2
E 1 2 0.5 3.75 m1
F 4 2 3 0.75 m2
G 1 1 1 4.75 m1
H 2 1 1.5 3.75 m1
Exercise – Iteration 3

[STEP 4] The results obtained, members cluster1 = {A, E, G, H}, cluster2


= {B,
{B CC, D
D, F }}. The new SSE and MSE values are:
k 2

SSE = ∑ ∑ d ( p, m )
i =1 p∈C i
i

SSE = [(1-1.25)2+(3-1.75)2]+ [(1-1.25)2+(2-1.75)2]+ [(1-1.25)2+(1-1.75)2] +


[(2-1.25)2+(1-1.75)2] + [(3-4)2+(3-2.75)2]+ [(4-4)2+(3-2.75)2]+ [(5-4)2+(3-
2.75))2] + [(4-4)
[( )2+(2-2.75)
( )2]

SSE = 1.625 + 0.125 + 0.625 + 0.125 + 1.0625 + 0.0625 + 1.0625 + 0.5625


= 7.88
MSE = 6.25/8
6 25/8 = 0.78125
0 78125

[STEP 5] The new centroid for


cluster1
l t 1 = [(1 + 1 + 1+2)/4
1+2)/4, (3 + 2 + 1+1)/4] = (1(1.25,1.75)
25 1 75) and
d
cluster2 = [(3 + 4 + 5 + 4 )/4, (3 + 3 + 3 + 2 )/4] = (4, 2.75)
Final Result

™Can be seen in the table


table. No change of
members again on each cluster
ƒ The final results are:
ƒ cluster1 = {A, E, G, H}, and
ƒ cluster2 = {B, C, D, F}
ƒ With SSE value = 6.25 and MSE value = 0.78125
ƒ the number of iterations 3

44
Comments on the K-Means Method

™ Strength:
O(tkn) where n is # objects,
ƒ Efficient: O(tkn), objects k is # clusters,
clusters and t is #
iterations. Normally, k, t << n.
ƒ Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
™ Comment: Often terminates at a local optimal
™ Weakness
ƒ Applicable only to objects in a continuous n-dimensional
n dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
ƒ Need to specify k,
k the number of clusters,
clusters in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
ƒ Sensitive to noisy data and outliers
ƒ Not suitable to discover clusters with non
non-convex
convex shapes

45
Variations of the K-Means Method

™ Most of the variants of the k-means which differ in


ƒ Selection of the initial k means
ƒ Dissimilarity calculations
ƒ Strategies to calculate cluster means

™ Handling categorical data: k-modes


ƒ Replacing means of clusters with modes
ƒ Using new dissimilarity measures to deal with
categorical objects
ƒ Using a frequency-based method to update modes
of clusters
ƒ A mixture of categorical and numerical data: kk-
prototype method
46
What Is the Problem of the K-Means
Method?
™ The k-means algorithm is sensitive to outliers!
ƒ Since an object with an extremely large value may
substantially distort the distribution of the data
™ K-Medoids:
ƒ Instead of taking the mean value of the object in a cluster as
a reference point, medoids can be used, which is the most
y located object
centrally j in a cluster
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

47
PAM: A Typical K-Medoids
Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9
Compute
9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

i
improved.d 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

48
The K-Medoid Clustering Method
™ K-Medoids Clustering: Find representative objects
(medoids) in clusters
™ PAM (Partitioning Around Medoids, Kaufmann &
Rousseeuw 1987)
ƒ St
Starts
t from
f an initial
i iti l sett off medoids
d id and
d it
iteratively
ti l
replaces one of the medoids by one of the non-medoids if
it improves the total distance of the resulting clustering
ƒ PAM works effectively for small data sets sets, but does not
scale well for large data sets (due to the computational
complexity)
™ Efficiency improvement on PAM
ƒ CLARA (Kaufmann & Rousseeuw, 1990): PAM on
samples
ƒ CLARANS ((Ng
g & Han,, 1994):
) Randomized re-sampling
p g

49
Hierarchical Clustering
™ Use distance matrix as clustering criteria
™ This method does not require the number of clusters k
as an input, but needs a termination condition

Step
St Step
St Step
St Step
St Step
St agglomerative
0 1 2 3 4 (AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step Step Step Step Step
4 3 2 1 0 50
AGNES (Agglomerative Nesting)
™ Introduced in Kaufmann and Rousseeuw (1990)
™ Implemented
I l t d iin statistical
t ti ti l packages,
k e.g., Splus
S l
™ Use the single-link method and the dissimilarity matrix
™ Merge nodes that have the least dissimilarity
™ Go on in a non-descending fashion
™ Eventuallyy all nodes belong g to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

51
Dendrogram: Shows How Clusters
are Merged

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram

A clustering
l t i off the
th data
d t objects
bj t iis obtained
bt i d bby cutting
tti the
th
dendrogram at the desired level, then each connected
component forms a cluster
0.25

0.2

0.15

0.1

0.05

0
3 6 4 1 2 5
DIANA (Divisive Analysis)

™Introduced in Kaufmann and Rousseeuw (1990)


™Implemented in statistical analysis packages,
e.g., Splus
™Inverse order of AGNES
™Eventually each node forms a cluster on its own
10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

53
Distance between Clusters X X

™ Single
g link: smallest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
™ Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
™ Average: avg distance between an element in one cluster and
an element in the other
other, ii.e.,
e dist(Ki, Kj) = avg(tip, tjq)
™ Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
™ Medoid: distance between the medoids of two clusters, i.e.,
dist(Ki, Kj) = dist(Mi, Mj)
ƒ Medoid: a chosen, centrally located object in the cluster
54
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
™ Centroid: the “middle” of a ΣiN= 1(t
Cm = ip )
cluster N

™ Radius:
R di square roott off
average distance from any Σ N (t − cm ) 2
point of the cluster to its Rm = i =1 ip
N
centroid

™ Diameter:
Di t square roott off
average mean squared Σ N Σ N (t − t ) 2
Dm = i =1 i =1 ip iq
distance between all ppairs of N ( N −1)
points in the cluster
55
Extensions to Hierarchical
Clustering
™ Major weakness of agglomerative clustering methods
ƒ Can never undo what was done previously
ƒ Do not scale well: time complexity
p y of at least O(n
( 2), where
n is the number of total objects
™ Integration of hierarchical & distance-based
distance based clustering
ƒ BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
sub clusters
ƒ CHAMELEON (1999): hierarchical clustering using
dynamic modeling

56
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
™ Zhang, Ramakrishnan & Livny, SIGMOD’96
™ Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
ƒ Phase 1: scan DB to build an initial in
in-memory
memory CF tree (a multi
multi-
level compression of the data that tries to preserve the inherent
clustering structure of the data)
ƒ Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
™ S
Scales l finds
l lilinearly: fi d a good
d clustering
l t i with ith a single
i l scan and
d
improves the quality with a few additional scans
™ Weakness: handles only
y numeric data,, and sensitive to the order of
the data record
57
Clustering Feature Vector in BIRCH

™ Clustering Feature (CF): CF = (N, LS, SS)


™ N: Number of data points
™ LS: linear sum of N points: N
∑ X i
i =1

™ SS: square sum of N points CF = (5, (16,30),(54,190))


(3,4)
(3 4)
N 2 10

(2,6)
∑ X
9

(4,5)
8
i
i =1
7

(4 7)
(4,7)
6

2
(3,8)
1

0
0 1 2 3 4 5 6 7 8 9 10

58
CF-Tree in BIRCH

™ Clustering feature:
ƒ Summary of the statistics for a given subcluster: the
0-th, 1st, and 2nd moments of the subcluster from the
statistical point of view
ƒ Registers crucial measurements for computing cluster
and utilizes storage efficiently
™ A CF tree is a height-balanced tree that stores the
clustering features for a hierarchical clustering
ƒ A nonleaf node in a tree has descendants or “children”
ƒ The nonleaf nodes store sums of the CFs of their
children
™ A CF tree has two parameters
ƒ Branching factor: max # of children
ƒ Threshold:
Threshold ma
max diameter of ssub-clusters
b cl sters stored at the
leaf nodes
59
The CF Tree Structure
B=7 CF1 CF2 CF3 CF6

L=6 child
hild1 child
hild2 child
hild3 child
hild6

N l f node
Non-leaf d
CF1 CF2 CF3 CF5

child1 child2 child3 child5

Leaf node Leaf node


prev next prev next
CF1 CF2 CF6 CF1 CF2 CF4

60
The Birch Algorithm

™ Cluster Diameter 1
∑ ( xi − x j )
2

n ( n − 1)

™ For each point in the input


ƒ Find closest leaf entry
ƒ Add point to leaf entry and update CF
ƒ If entry diameter > max_diameter, then split leaf, and possibly
parents
™ Algorithm is O(n)
™ Concerns
Co ce s
ƒ Sensitive to insertion order of data points
ƒ Since we fix the size of leaf nodes, so clusters may not be so natural
ƒ Clusters tend to be spherical given the radius and diameter
measures
61
CHAMELEON: Hierarchical
Clustering Using Dynamic Modeling
(1999)
™ CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
™ Measures the similarity based on a dynamic model
ƒ Two clusters are merged only if the interconnectivity and
closeness (proximity) between two clusters are high
relative to the internal interconnectivity of the clusters and
closeness of items within the clusters
™ Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into a
large number of relatively small sub
sub-clusters
clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining these
sub-clusters
62
KNN Graphs & Interconnectivity
™ k-nearest graphs from an original data in 2D:

™ EC{Ci ,Cj } :The absolute inter-connectivity between Ci


and Cj: the sum of the weight of the edges that
connect vertices in Ci to vertices in Cj
™ Internal inter-connectivity of a cluster Ci : the size of its
min-cut bisector ECCi (i.e., the weighted sum of edges
that partition the graph into two roughly equal parts)
™ Relative Inter-connectivity (RI):
63
Relative Closeness & Merge of Sub-
Clusters
™ Relative closeness between a pair of clusters Ci and
Cj : the absolute closeness between Ci and Cj
normalized w.r.t. the internal closeness of the two
clusters Ci and Cj

™ and are the average weights of the edges


that belong in the min
min-cut
cut bisector of clusters Ci and
Cj , respectively, and is the average weight
of the edges that connect vertices in Ci to vertices in
Cj
™ Merge Sub-Clusters:
ƒ Merges only those pairs of clusters whose RI and RC are
p
both above some user-specified thresholds
ƒ Merge those maximizing the function that combines RI and
RC
64
Overall Framework of CHAMELEON
Construct (K-NN)
Sparse Graph Partition the Graph

Data Set

K-NN Graph
P and q are connected Merge Partition
if q is among the top k
closest neighbors of p
Relative
e at e
interconnectivity:
connectivity of c1 and c2
Final Clusters
over internal connectivity
Relative closeness:
closeness of c1 and c2 over
internal closeness 65
CHAMELEON (Clustering Complex
Objects)

66
Density-Based Clustering Methods

™ Clustering based on density (local cluster criterion), such


as density-connected
density connected points
™ Major features:
ƒ Discover clusters of arbitrary shape
ƒ Handle
H dl noisei
ƒ One scan
ƒ Need density parameters as termination condition
™ Several interesting studies:
ƒ DBSCAN: Ester, et al. (KDD’96)
ƒ OPTICS: Ankerst,
Ankerst et al (SIGMOD’99).
(SIGMOD’99)
ƒ DENCLUE: Hinneburg & D. Keim (KDD’98)
ƒ CLIQUE: Agrawal,
g , et al. ((SIGMOD’98)) ((more grid-
g
based)
67
Density-Based Clustering: Basic
Concepts
™ Two parameters:
ƒ Eps: Maximum radius of the neighbourhood
ƒ MinPts: Minimum number of points in an Eps-
neighbourhood
i hb h d off that
th t point
i t
™ NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
™ Directly
Di l ddensity-reachable:
i h bl A pointi p is
i di
directly
l d
density-
i
reachable from a point q w.r.t. Eps, MinPts if
ƒ p belongs to NEps(q)
p MinPts = 5
ƒ core point condition:
p = 1 cm
Eps
|NEps (q)| ≥ MinPts q
68
Density-Reachable and Density-
Connected
™ Density-reachable:
ƒ A point p is density-reachable p
from a point q w.r.t. Eps, MinPts if
p1
there is a chain of points p1, …, pn, q
p1 = q, pn = p such that pi+1 is
directly density-reachable from pi

™ Density-connected p q
ƒ A point p is density-connected to
a point q w.r.t. Eps, MinPts if o
there is a point o such that both
both, p
and q are density-reachable from
o w.r.t. Eps and MinPts 69
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
™Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
™Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier

Border
Eps = 1cm
Core MinPts = 5

70
DBSCAN: The Algorithm

1. Arbitrary select a point p


2. Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
3. If p is a core point, a cluster is formed
4. If p is a border point, no points are density-reachable from
p and DBSCAN visits the next point of the database
5. Continue the process until all of the points have been
p
processed

If a spatial index is used, the computational complexity of DBSCAN is


O(nlogn), where n is the number of database objects. Otherwise, the
complexity is O(n2)

71
DBSCAN: Sensitive to
Parameters

http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
72
OPTICS: A Cluster-Ordering Method
(1999)
™ OPTICS: Ordering Points To Identify the Clustering
St t
Structure
ƒ Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
ƒ Produces a special order of the database wrt its
density-based clustering structure
ƒ This cluster-ordering contains info equiv to the
density-based clusterings corresponding to a broad
range of parameter settings
ƒ Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
ƒ Can be represented g graphicallyy or using
g visualization
techniques
73
OPTICS: Some Extension from
DBSCAN
™ Index-based: k = # of dimensions, N: # of points
ƒ Complexity: O(N*logN)
™ Core Distance of an object p: the smallest value ε such that
the εε-neighborhood
neighborhood of p has at least MinPts objects
Let Nε(p): ε-neighborhood of p, ε is a distance value
Core-distanceε, MinPts(p) = Undefined if card(Nε(p)) <
MinPts
MinPts-distance(p), otherwise
™ Reachability Distance of object p from core object q is the
min radius value that makes p density-reachable from q
Reachability-distanceε, MinPts(p, q) =
Undefined
U d fi d if q iis nott a core object
bj t
max(core-distance(q), distance (q, p)), otherwise
74
Core Distance & Reachability
Distance

75
Reachability-
distance

undefined

ε
ε
ε

Cluster-order
Cluster order of the
objects
76
Density-Based Clustering: OPTICS &
Applications:
http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/Demo

77
Grid-Based Clustering Method

™Using multi-resolution grid data structure


™Several interesting methods
ƒ STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
ƒ WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB
(VLDB’98)
98)
• A multi-resolution clustering approach
using wavelet method
ƒ CLIQUE: Agrawal, et al. (SIGMOD’98)
• Both grid-based and subspace clustering
78
STING: A Statistical Information Grid
Approach
™Wang, Yang and Muntz (VLDB’97)
™The spatial area is divided into rectangular cells
™There are several levels of cells correspondingg
to different levels of resolution

79
The STING Clustering Method
™ Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
™ Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
™ Parameters of higher level cells can be easily calculated from
parameters of lower level cell
ƒ count, mean, s, min, max
ƒ type off distribution—normal,
di ib i l uniform,
if etc.
™ Use a top-down approach to answer spatial data queries
™ Start from a p pre-selected layer—typically
y yp y with a small number
of cells
™ For each cell in the current level compute the confidence
interval

80
STING Algorithm and Its
Analysis
™ Remove the irrelevant cells from further consideration
™ When
Wh fifinish
i h examining
i i the
h current llayer, proceed
d to the
h
next lower level
p
™ Repeat this p
process until the bottom layer
y is reached
™ Advantages:
ƒ Query-independent, easy to parallelize, incremental
update
d t
ƒ O(K), where K is the number of grid cells at the lowest
level
™ Disadvantages:
ƒ All the cluster boundaries are either horizontal or
vertical and no diagonal boundary is detected
vertical,
81
Assessing Clustering Tendency
™ Assess if non-random structure exists in the data by
measuring the probability that the data is generated by a
uniform
if data
d t distribution
di t ib ti
™ Test spatial randomness by statistic test: Hopkins Static
ƒ Given a dataset D regarded as a sample of a random variable o,
determine how far away o is from being uniformly distributed in
the data space
ƒ Sample n points, p1, …, pn, uniformly from D. For each pi, find
its nearest neighbor in D: xi = min{dist (pi, v)} where v in D
ƒ Sample
S l n points,
i t q1,
1 …, qn, uniformly
if l from
f D.
D For
F each h qi,i find
fi d
its nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D
and v ≠ qi
ƒ Calculate the Hopkins Statistic:

ƒ If D is uniformly distributed, ∑ xi and ∑ yi will be close to each


other and H is close to 0.5. If D is clustered, H is close to 1

82
Determine the Number of Clusters

™ Empirical method
ƒ # of clusters ≈√n/2 for a dataset of n points
™ Elbow method
ƒ Use the turning point in the curve of sum of within cluster variance
w.r.t the # of clusters
™ Cross validation method
ƒ Divide a given data set into m parts
ƒ Use m – 1 p parts to obtain a clustering
g model
ƒ Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid,
and use the sum of squared distance between all points in
th ttestt sett and
the d the
th closest
l t centroids
t id tot measure how h wellll
the model fits the test set
ƒ For any k > 0, repeat it m times, compare the overall quality
measure w w.r.t.
r t different kk’ss, and find # of clusters that fits the data
the best
83
Measuring Clustering Quality

™ Two methods: extrinsic vs. intrinsic


™ Extrinsic: supervised, i.e., the ground truth is available
ƒ Compare a clustering against the ground truth using
certain
i clustering
l i quality
li measure
ƒ Ex. BCubed precision and recall metrics
™ Intrinsic:
I t i i unsupervised,
i d ii.e., th
the ground
d ttruth
th iis unavailable
il bl
ƒ Evaluate the goodness of a clustering by considering how
well the clusters are separated
separated, and how compact the
clusters are
ƒ Ex. Silhouette coefficient

84
Measuring Clustering Quality:
Extrinsic Methods
™ Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
™ Q is good if it satisfies the following 4 essential criteria
ƒ Cluster homogeneity: the purerpurer, the better
ƒ Cluster completeness: should assign objects belong to
the same category in the ground truth to the same
cluster
ƒ Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into a
rag bag (i.e., “miscellaneous” or “other” category)
ƒ Small cluster preservation: splitting a small category
into p
pieces is more harmful than splitting
p g a large
g
category into pieces
85
Summary
™ Cluster analysis groups objects based on their similarity and has
wide applications
™ Measure of similarity can be computed for various types of data
™ Clustering algorithms can be categorized into partitioning methods,
hierarchical methods,, density-based
y methods,, grid-based
g methods,,
and model-based methods
™ K-means and K-medoids algorithms are popular partitioning-based
clustering
g algorithms
g
™ Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
g
™ DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
™ STING and CLIQUE are grid-based
grid based methods
methods, where CLIQUE is
also a subspace clustering algorithm
™ Quality of clustering results can be evaluated in various ways 86
Summary
Clustering General Characteristics
Method
Partitioning oFind mutually exclusive clusters of spherical shape
Methods oDistance-based
oMay use mean or medoid (etc.) to represent cluster center
oEffective for small to medium size data sets
Hierarchical oClustering is a hierarchical decomposition (i.e., multiple levels)
Methods oCannot correct erroneous merges or splits
oMay incorporate other techniques like microclustering or consider
object
bj t “linkages”
“li k ”
Density-based oCan find arbitrarily shaped clusters
Methods oClusters are dense regions of objects in space that are separated
by low-density
low density regions
oCluster density: Each point must have a minimum number of
points within its “neighborhood”
oMay filter out outliers
Grid-based oUse a multiresolution grid data structure
Methods oFast processing time (typically independent of the number of data
87
objects, yet dependent on grid size)
Reference
1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques Third Edition, Elsevier, 2012
2 Ian H.
2. H Witten
Witten, Frank Eibe
Eibe, Mark AA. Hall,
Hall Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevier,
2011
3. Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining
Use Cases and Business Analytics Applications
Applications, CRC Press
Taylor & Francis Group, 2014
4. Daniel T. Larose, Discovering Knowledge in Data: an Introduction
to Data Mining, John Wiley & Sons, 2005
5. Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MIT
Press, 2014
6. Florin Gorunescu, Data Mining: Concepts, Models and
Techniques Springer,
Techniques, Springer 2011
7. Oded Maimon and Lior Rokach, Data Mining and Knowledge
Discovery Handbook Second Edition, Springer, 2010
8. Warren Liao and Evangelos Triantaphyllou (eds.), Recent
Ad
Advances iin D
Data
t Mi
Mining
i off E
Enterprise
t i D Data:
t Algorithms
Al ith andd
Applications, World Scientific, 2007
88

You might also like