DM 10,11 Clustering PDF

Lecture 10, 11 – Clustering
Muhammad Tariq Siddique
https://sites.google.com/site/mtsiddiquecs/dm
Gentle Reminder
“Switch Off” your Mobile Phone Or Switch

Mobile Phone to “Silent Mode”
Contents
1. What is Cluster Analysis?
h l l
2. k‐Mean Clustering
2 k Mean Clustering
3. Hierarchical Clustering
3. Hierarchical Clustering
4. Density based Clustering
y g
5. Grid Based Clustering
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, segmentation …))
clustering data segmentation,
Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
4
5
By Family
6
By Age
7
By Gender
8
Applications of Cluster Analysis
Data reduction
Summarization: Preprocessing for regression, PCA,
classification, and association analysis
Compression: Image processing: vector quantization
Hypothesis generation and testing
Prediction based on groups
Cluster & find characteristics/patterns for each group
Finding K-nearest Neighbors
Localizing
L li i search h tto one or a smallll number
b off clusters
l t
Outlier detection: Outliers are often viewed as those “far
away” from any cluster
away
9
Clustering: Application Examples
Biology: taxonomy of living things: kingdom, phylum,
class, order, family,
y ggenus and species
p
Information retrieval: document clustering
Land use: Identification of areas of similar land use in
an earth observation database
Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop p targeted
g marketinggpprograms
g
City-planning: Identifying groups of houses according
to their house type, value, and geographical location
Earth
Earth-quake
quake studies: Observed earth quake
epicenters should be clustered along continent faults
Climate: understanding earth climate, find patterns of
p
atmospheric and ocean
Economic Science: market research
10
Earthquakes
11
Image Segmentation
12
Basic Steps to Develop a Clustering
Task
Feature selection
Select info concerning the task of interest
Minimal information redundancy
Proximity measure
Similarity of two feature vectors
Clustering criterion
Expressed via a cost function or some rules
Clustering algorithms
Choice of algorithms
g
Validation of the results
Validation test (also, clustering tendency test)
Interpretation
I t t ti off the
th results
lt
Integration with applications
13
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters

high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on

the similarityy measure used byy the method
its implementation, and
Its ability to discover some or all of the hidden patterns
Inter-cluster
IIntra-cluster
l distances
di distances are
are minimized maximized
14
Measure the Quality of
Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that
measures the “goodness” of a cluster.
It is hard to define “similar
similar enough”
enough or “good
good enough”
enough
• The answer is typically highly subjective
15
Data Structures
Clustering algorithms typically operate on either
of the following two data structures:
Data matrix ⎡ x 11 ... x 1ff ... x 1pp ⎤
⎢ ⎥
(two modes) ⎢ ... ... ... ... ... ⎥
⎢x ... x if ... x ip ⎥
⎢ i1 ⎥
⎢ ... ... ... ... ... ⎥
⎢x ... x nf ... x np ⎥
⎣⎢ n1 ⎦⎥
⎡ 0 ⎤
Dissimilarity matrix ⎢ d(2,1)
⎢ 0 ⎥
⎥
(one mode) ⎢ d(3,1 ) d ( 3,2 ) 0 ⎥
⎢ ⎥
⎢ : : : ⎥
⎢⎣ d ( n ,1 ) d ( n ,2 ) ... ... 0 ⎥⎦
16
Data Matrix
⎡ x 11 ... x 1f ... x 1p ⎤
⎢ ⎥
⎢ ... ... ... ... ... ⎥
n objects
⎢x ... x if ... x ip ⎥
⎢ i1 ⎥ p variables
⎢ ... ... ... ... ... ⎥
⎢x ... x nf ... x np ⎥⎥
⎢⎣ n1 ⎦
This represents n objects, such as persons, with

p variables (measurements or attributes), such
as age, height,
h i ht weight,
i ht gender,
d and d so on.
The structure is in the form of a relational table,
or n-by-p
b matrix
t i (n
( objects
bj t p variables)
i bl )
17
Dissimilarity
y matrix
⎡ 0 ⎤
⎢ d(2,1) ⎥
⎢ d(2 1) 0 ⎥
⎢ d(3,1 ) d ( 3,2 ) 0 ⎥
⎢ ⎥
⎢ : : : ⎥
⎣⎢ d ( n ,1) d ( n ,2 ) ... ... 0 ⎥⎦
It is often represented by an n-by-n where d(i, j) is the

measured difference or dissimilarity between objects i and j.
In general, d(i, j) is a nonnegative number that is
close
l tto 0 when
h objects
bj t i and
d j are hi
highly
hl similar
i il or ““near”” each
h
other
becomes larger the more they differ
Where d(i,
d(i j)
j)=d(
d( jj, i),
i) and d(i
d(i, i)
i)=0
0
18
Type of data in clustering
analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where m f = 1n (x1 f + x2 f + ... + xnf )
.
Calculate the standardized measurement (z-score)

xif − m f
zif = sf
Using mean absolute deviation is more robust
than using standard deviation
Similarity and Dissimilarity
Between Objects
Distances are normally used to measure the
similarity or dissimilarity between two data
objects
Some popular ones include:

Minkowski distance
Manhattan distance
Euclidean distance
Minkowski and Manhattan
Distance
Minkowski distance
d (i, j) = q (| x − x |q + | x − x | q +...+ | x − x | q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are

two p-dimensional data objects, and q is a positive
integer
xi (1,7)
12
If q = 1
1, d is Manhattan distance 6
d (i, j ) =| x − x | + | x − x | + ...+ | x − x | 6
i1 j1 i2 j2 ip jp
xj(7,1)
Euclidean Distance
Xi (1,7)
If q = 2, d is Euclidean distance:
8.48
d (i, j) = (| x − x | 2 + | x − x | 2 +...+ | x − x | 2 )
i1 j1 i2 j2 ip jp
Properties
Xj(7,1)
• d(i,j) ≥ 0
• d(i
d(i,i)
i) = 0 i j
• d(i,j) = d(j,i)
( ,j) ≤ d(i,k)
• d(i,j) ( , ) + d(k,j)
( ,j)
k
Also one can use weighted distance, or other

dissimilarity measures.
Binary Variables
A contingency table for binary data
Object j
1 0 sum
1 a b a +b
Object i 0 c d c+d
sum a + c b + d p
Simple matching coefficient (invariant, if the binary

variable is symmetric): d ( i , j ) = b+c
a+b+c+d
Jaccard coefficient (noninvariant if the binary
variable is asymmetric):
d (i, j ) = b+c
a+b+c
Dissimilarity between Binary
Variables
Example
N am e G ender Fever C ough T e s t- 1 T e s t-2 T e s t-3 T e s t-4
Jack M Y N P N N N
M ary F Y N P N P N
J im M Y P N N N N
gender is a symmetric attribute

the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
0 + 1
d ( jack , mary ) = = 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) = = 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) = = 0 . 75
1 + 1 + 2
Nominal Variables
A generalization of the binary variable in that it

can take more than 2 states, e.g., red, yellow,
blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
d ( i , j ) = p −p m
Method 2: use a large number of binary variables

creating a new binary variable for each of the M
nominal states
Ordinal Variables
An ordinal variable can be discrete or continuous

Order is important, e.g., rank
Can be treated like interval-scaled
interval scaled
r if ∈ {1,..., M f }
replace xif by their rank
map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
r if − 1
z =
if M f − 1
compute the dissimilarity using
methods for interval-scaled variables
27
Ratio-Scaled Variables
Ratio
Ratio-scaled
scaled variable: a positive measurement on a
nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled variables— not a good
g
choice! (why?—the scale can be distorted)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank as
interval-scaled
28
Variables of Mixed Types
yp
A database mayy contain all the six types yp of variables

symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio
One may use a weighted formula to combine their
effects Σ pf = 1 δ ij( f ) d ij( f )
d (i, j ) =
Σ pf = 1 δ ij( f )
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is interval-based:
interval based: use the normalized distance
f is ordinal or ratio-scaled
• compute ranks rif and
r −1
• and treat zif as interval-scaled z if =
if
M f −1
29
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than
one class)
Similarity measure
Distance
Distance-based
based (e.g.,
(e g Euclidian,
Euclidian road network
network, vector) vs
vs.
connectivity-based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often
in high-dimensional clustering)
30
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
g d
High dimensionality
e s o a ty
Major Clustering Approaches 1
Partitioning approach:
Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square
errors
Typical
yp methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data
((or objects)
j ) using
g some criterion
Typical methods: Diana, Agnes, BIRCH, CHAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple
multiple-level
level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
32
Major Clustering Approaches 2
Model-based:
A model
d l iis h
hypothesized
th i d ffor each
h off th
the clusters
l t and
d ttries
i tto fi
find
d
the best fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent
F t pattern-based:
tt b d
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects
j are often linked together
g in various ways
y
Massive links can be used to cluster objects: SimRank, LinkClus
33
Partitioning Algorithms: Basic
Concept
Partitioning method: Partitioning a database D of n
objects into a set of k clusters, such that the sum of
squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)
E = Σ ik=1Σ p∈Ci (d ( p, ci )) 2
Given k, find a partition of k clusters that optimizes the
chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: kk-means
means and kk-medoids
medoids algorithms
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is
represented by the center of the cluster
medoids or PAM (Partition around medoids) (Kaufman &
kk-medoids
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
34
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four

steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters
of the current partitioning (the centroid is the center,
i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest
seed point
4. Go back to Step 2, stop when the assignment does
nott change
h
35
An Example of K-Means
Clustering
K=2
Arbitrarily Update the

partition cluster
objects
bj t into
i t centroids
k groups
The initial data set Loop if Reassign objects

needed
Partition objects into k
nonempty subsets
Repeat
Compute
C t centroid
t id (i
(i.e., mean
point) for each partition Update the
Assign each object to the cluster
cluster of its nearest centroid centroids
Until no change
36
k-Means Algorithms steps
1. Selecting the desired number of clusters k
2
2. Initialization k cluster centers (centroid) random
3. Place any data or object to the nearest cluster. The proximity of
the two objects is determined based on the distance. The
distance used in k-means algorithm
g is the Euclidean distance ((d))
n
d Euclidean ( x, y ) = (
∑ i i− ) 2
x y
i =1
x = x1, x2,. , , , Xn and y = y1, y2,. , , , Yn n is the number of attributes

(columns) between 2 records
4. Recalculate cluster centers with the current cluster membership.
Cluster center is the average (mean) of all the data
data, or objects in
a particular cluster
5. Assign each object again using the new cluster centers. If the
cluster centers have not changed
g again,
g , then the clustering g
process is completed. Or, go back to step 3 until the cluster
centers do not change anymore (stable) or no significant
decrease of the value of SSE (Sum of Squared Errors)
Exercise – Iteration 1
[STEP 1] Determine the number of clusters k = 2

[STEP 2] Determine the initial centroid random example of the data
besides m1 = (1,1), m2 = (2,1)
[STEP 3] Place each object to the closest cluster centroid based on the
closest value difference (distance).
m1 m2
Cluster
Instances x y (1,1) (2,1) Membership
A 1 3 2 3 m1
B 3 3 4 3 m2
C 4 3 5 4 m2
D 5 3 6 5 m2
E 1 2 1 2 m1
F 4 2 4 3 m2
G 1 1 0 1 m1
H 2 1 1 0 m2
[STEP 4] The results obtained, members cluster1 = {A, E, G},

cluster2 = {B
{B, C
C, D
D, F
F, H}
H}. The SSE(Sum of Square Error) and
MSE(Mean Square Error) values are:
k 2
SSE = ∑ ∑ d ( p, m )
i =1 p∈C i
i
Cluster1 Cluster 2
SSE = [(1-1)2+(3-1)2]+ [(1-1)2+(2-1)2]+ [(1-1)2+(1-1)2] (1, 3) (3, 3)
+ [(3-2)2+(3-1)2]+ [(4-2)2+(3-1)2]+ [(5-2)2+(3-1)2] (1 2)
(1, (4 3)
(4,
+ [(4-2)2+(2-1)2]+ [(2-2)2+(1-1)2] (1, 1) (5, 3)
(4, 2)
SSE = 4 + 1 + 0 + 5 + 8 + 13 + 5 + 0 = 36 (2, 1)
MSE = 36/8 = 4.5 -------- --------
(3/3, 6/3) (18/5, 12/5)
(1, 2) (3.6, 2.4)
[STEP 5] The new centroid for
cluster1 = [(
[(1 + 1 + 1)/3,
) , (3
( + 2 + 1)/3]
) ] = (1,
( , 2)) and
cluster2 = [(3 + 4 + 5 + 4 + 2)/5, (3 + 3 + 3 + 2 + 1)/5] = (3.6, 2.4)
closest
l t value
l diff
difference (di
(distance).
t )
m1 m2
Cluster
Instances x y (1,2) (3.6,2.4) Membership
A 1 3 1 3.2 m1
B 3 3 3 12
1.2 m2
C 4 3 4 1 m2
D 5 3 5 2 m2
E 1 2 0 3 m1
F 4 2 3 0.8 m2
G 1 1 1 4 m1
H 2 1 2 3 m1
[STEP 4] The results obtained, members cluster1 = {A, E, G, H}, cluster2

= {B,
{B C,
C D,
D F}
F}. The
Th new SSE and d MSE values
l are:
k 2
SSE = ∑ ∑ d ( p, m )
i =1 p∈C i
i
SSE = [(1-1)2+(3-2)2]+ [(1-1)2+(2-2)2]+ [(1-1)2+(1-2)2] + [(2-1)2+(1-2)2]

+ [(3
[(3-3.6)
3.6)2+(3-2.4)
(3 2.4)2]]+ [(4
[(4-3.6)
3.6)2+(3-2.4)
(3 2.4)2]]+ [(5
[(5-3.6)
3.6)2+(3-2.4)
(3 2.4)2]
+ [(4-3.6)2+(2-2.4)2]
SSE = 1 + 0 + 1 + 2 + 0.72 + 0.52 + 2.32 + 0.32 = 7.88
MSE = 7.88/8 = 0.985

cluster1 = [(1 + 1 + 1+2)/4
1+2)/4, (3 + 2 + 1+1)/4] = (1(1.25,1.75)
25 1 75) and
cluster2 = [(3 + 4 + 5 + 4 )/4, (3 + 3 + 3 + 2 )/4] = (4, 2.75)
closest
l t value
l diff
difference (di
(distance).
t )
m1 m2
Cluster
Instances x y (1.25,1.75) (4, 2.75) Membership
A 1 3 1.5 3.25 m1
B 3 3 3 1 25
1.25 m2
C 4 3 4 0.25 m2
D 5 3 5 1.25 m2
E 1 2 0.5 3.75 m1
F 4 2 3 0.75 m2
G 1 1 1 4.75 m1
H 2 1 1.5 3.75 m1
[STEP 4] The results obtained, members cluster1 = {A, E, G, H}, cluster2

= {B,
{B CC, D
D, F }}. The new SSE and MSE values are:
k 2
SSE = ∑ ∑ d ( p, m )
i =1 p∈C i
i
SSE = [(1-1.25)2+(3-1.75)2]+ [(1-1.25)2+(2-1.75)2]+ [(1-1.25)2+(1-1.75)2] +

[(2-1.25)2+(1-1.75)2] + [(3-4)2+(3-2.75)2]+ [(4-4)2+(3-2.75)2]+ [(5-4)2+(3-
2.75))2] + [(4-4)
[( )2+(2-2.75)
( )2]
SSE = 1.625 + 0.125 + 0.625 + 0.125 + 1.0625 + 0.0625 + 1.0625 + 0.5625

= 7.88
MSE = 6.25/8
6 25/8 = 0.78125
0 78125

cluster1
l t 1 = [(1 + 1 + 1+2)/4
1+2)/4, (3 + 2 + 1+1)/4] = (1(1.25,1.75)
25 1 75) and
d
cluster2 = [(3 + 4 + 5 + 4 )/4, (3 + 3 + 3 + 2 )/4] = (4, 2.75)
Final Result
Can be seen in the table

table. No change of
members again on each cluster
The final results are:
cluster1 = {A, E, G, H}, and
cluster2 = {B, C, D, F}
With SSE value = 6.25 and MSE value = 0.78125
the number of iterations 3
44
Comments on the K-Means Method
Strength:
O(tkn) where n is # objects,
Efficient: O(tkn), objects k is # clusters,
clusters and t is #
iterations. Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
Comment: Often terminates at a local optimal
Weakness
Applicable only to objects in a continuous n-dimensional
n dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
Need to specify k,
k the number of clusters,
clusters in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
Sensitive to noisy data and outliers
Not suitable to discover clusters with non
non-convex
convex shapes
45
Variations of the K-Means Method
Most of the variants of the k-means which differ in

Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes

Replacing means of clusters with modes
Using new dissimilarity measures to deal with
categorical objects
Using a frequency-based method to update modes
of clusters
A mixture of categorical and numerical data: kk-
prototype method
46
What Is the Problem of the K-Means
Method?
The k-means algorithm is sensitive to outliers!
Since an object with an extremely large value may
substantially distort the distribution of the data
K-Medoids:
Instead of taking the mean value of the object in a cluster as
a reference point, medoids can be used, which is the most
y located object
centrally j in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
47
PAM: A Typical K-Medoids
Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
K=2 Randomly select a

Total Cost = 26 nonmedoid object,Oramdom
10 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
i
improved.d 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
48
The K-Medoid Clustering Method
K-Medoids Clustering: Find representative objects
(medoids) in clusters
PAM (Partitioning Around Medoids, Kaufmann &
Rousseeuw 1987)
St
Starts
t from
f an initial
i iti l sett off medoids
d id and
d it
iteratively
ti l
replaces one of the medoids by one of the non-medoids if
it improves the total distance of the resulting clustering
PAM works effectively for small data sets sets, but does not
scale well for large data sets (due to the computational
complexity)
Efficiency improvement on PAM
CLARA (Kaufmann & Rousseeuw, 1990): PAM on
samples
CLARANS ((Ng
g & Han,, 1994):
) Randomized re-sampling
p g
49
Hierarchical Clustering
Use distance matrix as clustering criteria
This method does not require the number of clusters k
as an input, but needs a termination condition
Step
St Step
St Step
St Step
St Step
St agglomerative
0 1 2 3 4 (AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step Step Step Step Step
4 3 2 1 0 50
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented
I l t d iin statistical
t ti ti l packages,
k e.g., Splus
S l
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventuallyy all nodes belong g to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
51
Dendrogram: Shows How Clusters
are Merged
Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram
A clustering
l t i off the
th data
d t objects
bj t iis obtained
bt i d bby cutting
tti the
th
dendrogram at the desired level, then each connected
component forms a cluster
0.25
0.2
0.15
0.1
0.05
0
3 6 4 1 2 5
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages,
e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
53
Distance between Clusters X X
Single
g link: smallest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one
cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and
an element in the other
other, ii.e.,
e dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e.,
dist(Ki, Kj) = dist(Mi, Mj)
Medoid: a chosen, centrally located object in the cluster
54
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a ΣiN= 1(t
Cm = ip )
cluster N
Radius:
R di square roott off
average distance from any Σ N (t − cm ) 2
point of the cluster to its Rm = i =1 ip
N
centroid
Diameter:
Di t square roott off
average mean squared Σ N Σ N (t − t ) 2
Dm = i =1 i =1 ip iq
distance between all ppairs of N ( N −1)
points in the cluster
55
Extensions to Hierarchical
Clustering
Major weakness of agglomerative clustering methods
Can never undo what was done previously
Do not scale well: time complexity
p y of at least O(n
( 2), where
n is the number of total objects
Integration of hierarchical & distance-based
distance based clustering
BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
sub clusters
CHAMELEON (1999): hierarchical clustering using
dynamic modeling
56
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
Zhang, Ramakrishnan & Livny, SIGMOD’96
Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
Phase 1: scan DB to build an initial in
in-memory
memory CF tree (a multi
multi-
level compression of the data that tries to preserve the inherent
clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
S
Scales l finds
l lilinearly: fi d a good
d clustering
l t i with ith a single
i l scan and
d
improves the quality with a few additional scans
Weakness: handles only
y numeric data,, and sensitive to the order of
the data record
57
Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS)

N: Number of data points
LS: linear sum of N points: N
∑ X i
i =1
SS: square sum of N points CF = (5, (16,30),(54,190))

(3,4)
(3 4)
N 2 10
(2,6)
∑ X
9
(4,5)
8
i
i =1
7
(4 7)
(4,7)
6
2
(3,8)
1
0
0 1 2 3 4 5 6 7 8 9 10
58
CF-Tree in BIRCH
Clustering feature:
Summary of the statistics for a given subcluster: the
0-th, 1st, and 2nd moments of the subcluster from the
statistical point of view
Registers crucial measurements for computing cluster
and utilizes storage efficiently
A CF tree is a height-balanced tree that stores the
clustering features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of their
children
A CF tree has two parameters
Branching factor: max # of children
Threshold:
Threshold ma
max diameter of ssub-clusters
b cl sters stored at the
leaf nodes
59
The CF Tree Structure
B=7 CF1 CF2 CF3 CF6
L=6 child
hild1 child
hild2 child
hild3 child
hild6
N l f node
Non-leaf d
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Leaf node Leaf node

prev next prev next
CF1 CF2 CF6 CF1 CF2 CF4
60
The Birch Algorithm
Cluster Diameter 1
∑ ( xi − x j )
2
n ( n − 1)
For each point in the input

Find closest leaf entry
Add point to leaf entry and update CF
If entry diameter > max_diameter, then split leaf, and possibly
parents
Algorithm is O(n)
Concerns
Co ce s
Sensitive to insertion order of data points
Since we fix the size of leaf nodes, so clusters may not be so natural
Clusters tend to be spherical given the radius and diameter
measures
61
CHAMELEON: Hierarchical
Clustering Using Dynamic Modeling
(1999)
CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity and
closeness (proximity) between two clusters are high
relative to the internal interconnectivity of the clusters and
closeness of items within the clusters
Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into a
large number of relatively small sub
sub-clusters
clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining these
sub-clusters
62
KNN Graphs & Interconnectivity
k-nearest graphs from an original data in 2D:
EC{Ci ,Cj } :The absolute inter-connectivity between Ci

and Cj: the sum of the weight of the edges that
connect vertices in Ci to vertices in Cj
Internal inter-connectivity of a cluster Ci : the size of its
min-cut bisector ECCi (i.e., the weighted sum of edges
that partition the graph into two roughly equal parts)
Relative Inter-connectivity (RI):
63
Relative Closeness & Merge of Sub-
Clusters
Relative closeness between a pair of clusters Ci and
Cj : the absolute closeness between Ci and Cj
normalized w.r.t. the internal closeness of the two
clusters Ci and Cj
and are the average weights of the edges

that belong in the min
min-cut
cut bisector of clusters Ci and
Cj , respectively, and is the average weight
of the edges that connect vertices in Ci to vertices in
Cj
Merge Sub-Clusters:
Merges only those pairs of clusters whose RI and RC are
p
both above some user-specified thresholds
Merge those maximizing the function that combines RI and
RC
64
Overall Framework of CHAMELEON
Construct (K-NN)
Sparse Graph Partition the Graph
Data Set
K-NN Graph
P and q are connected Merge Partition
if q is among the top k
closest neighbors of p
Relative
e at e
interconnectivity:
connectivity of c1 and c2
Final Clusters
over internal connectivity
Relative closeness:
closeness of c1 and c2 over
internal closeness 65
CHAMELEON (Clustering Complex
Objects)
66
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such

as density-connected
density connected points
Major features:
Discover clusters of arbitrary shape
Handle
H dl noisei
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst,
Ankerst et al (SIGMOD’99).
(SIGMOD’99)
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal,
g , et al. ((SIGMOD’98)) ((more grid-
g
based)
67
Density-Based Clustering: Basic
Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood
i hb h d off that
th t point
i t
NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
Directly
Di l ddensity-reachable:
i h bl A pointi p is
i di
directly
l d
density-
i
reachable from a point q w.r.t. Eps, MinPts if
p belongs to NEps(q)
p MinPts = 5
core point condition:
p = 1 cm
Eps
|NEps (q)| ≥ MinPts q
68
Density-Reachable and Density-
Connected
Density-reachable:
A point p is density-reachable p
from a point q w.r.t. Eps, MinPts if
p1
there is a chain of points p1, …, pn, q
p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
Density-connected p q
A point p is density-connected to
a point q w.r.t. Eps, MinPts if o
there is a point o such that both
both, p
and q are density-reachable from
o w.r.t. Eps and MinPts 69
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
Discovers clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
70
DBSCAN: The Algorithm
1. Arbitrary select a point p

2. Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
3. If p is a core point, a cluster is formed
4. If p is a border point, no points are density-reachable from
p and DBSCAN visits the next point of the database
5. Continue the process until all of the points have been
p
processed
If a spatial index is used, the computational complexity of DBSCAN is

O(nlogn), where n is the number of database objects. Otherwise, the
complexity is O(n2)
71
DBSCAN: Sensitive to
Parameters
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
72
OPTICS: A Cluster-Ordering Method
(1999)
OPTICS: Ordering Points To Identify the Clustering
St t
Structure
Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
Produces a special order of the database wrt its
density-based clustering structure
This cluster-ordering contains info equiv to the
density-based clusterings corresponding to a broad
range of parameter settings
Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure
Can be represented g graphicallyy or using
g visualization
techniques
73
OPTICS: Some Extension from
DBSCAN
Index-based: k = # of dimensions, N: # of points
Complexity: O(N*logN)
Core Distance of an object p: the smallest value ε such that
the εε-neighborhood
neighborhood of p has at least MinPts objects
Let Nε(p): ε-neighborhood of p, ε is a distance value
Core-distanceε, MinPts(p) = Undefined if card(Nε(p)) <
MinPts
MinPts-distance(p), otherwise
Reachability Distance of object p from core object q is the
min radius value that makes p density-reachable from q
Reachability-distanceε, MinPts(p, q) =
Undefined
U d fi d if q iis nott a core object
bj t
max(core-distance(q), distance (q, p)), otherwise
74
Core Distance & Reachability
Distance
75
Reachability-
distance
undefined
ε
ε
ε
‘
Cluster-order
Cluster order of the
objects
76
Density-Based Clustering: OPTICS &
Applications:
http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/Demo
77
Grid-Based Clustering Method
Using multi-resolution grid data structure

Several interesting methods
STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB
(VLDB’98)
98)
• A multi-resolution clustering approach
using wavelet method
CLIQUE: Agrawal, et al. (SIGMOD’98)
• Both grid-based and subspace clustering
78
STING: A Statistical Information Grid
Approach
Wang, Yang and Muntz (VLDB’97)
The spatial area is divided into rectangular cells
There are several levels of cells correspondingg
to different levels of resolution
79
The STING Clustering Method
Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from
parameters of lower level cell
count, mean, s, min, max
type off distribution—normal,
di ib i l uniform,
if etc.
Use a top-down approach to answer spatial data queries
Start from a p pre-selected layer—typically
y yp y with a small number
of cells
For each cell in the current level compute the confidence
interval
80
STING Algorithm and Its
Analysis
Remove the irrelevant cells from further consideration
When
Wh fifinish
i h examining
i i the
h current llayer, proceed
d to the
h
next lower level
p
Repeat this p
process until the bottom layer
y is reached
Advantages:
Query-independent, easy to parallelize, incremental
update
d t
O(K), where K is the number of grid cells at the lowest
level
Disadvantages:
All the cluster boundaries are either horizontal or
vertical and no diagonal boundary is detected
vertical,
81
Assessing Clustering Tendency
Assess if non-random structure exists in the data by
measuring the probability that the data is generated by a
uniform
if data
d t distribution
di t ib ti
Test spatial randomness by statistic test: Hopkins Static
Given a dataset D regarded as a sample of a random variable o,
determine how far away o is from being uniformly distributed in
the data space
Sample n points, p1, …, pn, uniformly from D. For each pi, find
its nearest neighbor in D: xi = min{dist (pi, v)} where v in D
Sample
S l n points,
i t q1,
1 …, qn, uniformly
if l from
f D.
D For
F each h qi,i find
fi d
its nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D
and v ≠ qi
Calculate the Hopkins Statistic:
If D is uniformly distributed, ∑ xi and ∑ yi will be close to each

other and H is close to 0.5. If D is clustered, H is close to 1
82
Determine the Number of Clusters
Empirical method
# of clusters ≈√n/2 for a dataset of n points
Elbow method
Use the turning point in the curve of sum of within cluster variance
w.r.t the # of clusters
Cross validation method
Divide a given data set into m parts
Use m – 1 p parts to obtain a clustering
g model
Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid,
and use the sum of squared distance between all points in
th ttestt sett and
the d the
th closest
l t centroids
t id tot measure how h wellll
the model fits the test set
For any k > 0, repeat it m times, compare the overall quality
measure w w.r.t.
r t different kk’ss, and find # of clusters that fits the data
the best
83
Measuring Clustering Quality
Two methods: extrinsic vs. intrinsic

Extrinsic: supervised, i.e., the ground truth is available
Compare a clustering against the ground truth using
certain
i clustering
l i quality
li measure
Ex. BCubed precision and recall metrics
Intrinsic:
I t i i unsupervised,
i d ii.e., th
the ground
d ttruth
th iis unavailable
il bl
Evaluate the goodness of a clustering by considering how
well the clusters are separated
separated, and how compact the
clusters are
Ex. Silhouette coefficient
84
Measuring Clustering Quality:
Extrinsic Methods
Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
Q is good if it satisfies the following 4 essential criteria
Cluster homogeneity: the purerpurer, the better
Cluster completeness: should assign objects belong to
the same category in the ground truth to the same
cluster
Rag bag: putting a heterogeneous object into a pure
cluster should be penalized more than putting it into a
rag bag (i.e., “miscellaneous” or “other” category)
Small cluster preservation: splitting a small category
into p
pieces is more harmful than splitting
p g a large
g
category into pieces
85
Summary
Cluster analysis groups objects based on their similarity and has
wide applications
Measure of similarity can be computed for various types of data
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods,, density-based
y methods,, grid-based
g methods,,
and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering
g algorithms
g
Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
g
DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
STING and CLIQUE are grid-based
grid based methods
methods, where CLIQUE is
also a subspace clustering algorithm
Quality of clustering results can be evaluated in various ways 86
Summary
Clustering General Characteristics
Method
Partitioning oFind mutually exclusive clusters of spherical shape
Methods oDistance-based
oMay use mean or medoid (etc.) to represent cluster center
oEffective for small to medium size data sets
Hierarchical oClustering is a hierarchical decomposition (i.e., multiple levels)
Methods oCannot correct erroneous merges or splits
oMay incorporate other techniques like microclustering or consider
object
bj t “linkages”
“li k ”
Density-based oCan find arbitrarily shaped clusters
Methods oClusters are dense regions of objects in space that are separated
by low-density
low density regions
oCluster density: Each point must have a minimum number of
points within its “neighborhood”
oMay filter out outliers
Grid-based oUse a multiresolution grid data structure
Methods oFast processing time (typically independent of the number of data
87
objects, yet dependent on grid size)
Reference
1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques Third Edition, Elsevier, 2012
2 Ian H.
2. H Witten
Witten, Frank Eibe
Eibe, Mark AA. Hall,
Hall Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevier,
2011
3. Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining
Use Cases and Business Analytics Applications
Applications, CRC Press
Taylor & Francis Group, 2014
4. Daniel T. Larose, Discovering Knowledge in Data: an Introduction
to Data Mining, John Wiley & Sons, 2005
5. Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MIT
Press, 2014
6. Florin Gorunescu, Data Mining: Concepts, Models and
Techniques Springer,
Techniques, Springer 2011
7. Oded Maimon and Lior Rokach, Data Mining and Knowledge
Discovery Handbook Second Edition, Springer, 2010
8. Warren Liao and Evangelos Triantaphyllou (eds.), Recent
Ad
Advances iin D
Data
t Mi
Mining
i off E
Enterprise
t i D Data:
t Algorithms
Al ith andd
Applications, World Scientific, 2007
88

DM 10,11 Clustering PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM 10,11 Clustering PDF

Uploaded by

Copyright:

Available Formats

Lecture 10, 11 – Clustering

Muhammad Tariq Siddique

“Switch Off” your Mobile Phone Or Switch

 A good clustering method will produce high quality clusters

 The quality of a clustering method depends on

This represents n objects, such as persons, with

 It is often represented by an n-by-n where d(i, j) is the

 Calculate the standardized measurement (z-score)

Some popular ones include:

 where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are

Also one can use weighted distance, or other

 Simple matching coefficient (invariant, if the binary

 gender is a symmetric attribute

A generalization of the binary variable in that it

Method 2: use a large number of binary variables

 An ordinal variable can be discrete or continuous

 A database mayy contain all the six types yp of variables

 Given k, the k-means algorithm is implemented in four

Arbitrarily Update the

The initial data set Loop if Reassign objects

x = x1, x2,. , , , Xn and y = y1, y2,. , , , Yn n is the number of attributes

[STEP 1] Determine the number of clusters k = 2

[STEP 4] The results obtained, members cluster1 = {A, E, G},

[STEP 4] The results obtained, members cluster1 = {A, E, G, H}, cluster2

SSE = [(1-1)2+(3-2)2]+ [(1-1)2+(2-2)2]+ [(1-1)2+(1-2)2] + [(2-1)2+(1-2)2]

[STEP 5] The new centroid for

[STEP 4] The results obtained, members cluster1 = {A, E, G, H}, cluster2

SSE = [(1-1.25)2+(3-1.75)2]+ [(1-1.25)2+(2-1.75)2]+ [(1-1.25)2+(1-1.75)2] +

SSE = 1.625 + 0.125 + 0.625 + 0.125 + 1.0625 + 0.0625 + 1.0625 + 0.5625

[STEP 5] The new centroid for

Can be seen in the table

 Most of the variants of the k-means which differ in

 Handling categorical data: k-modes

K=2 Randomly select a

Decompose data objects into a several levels of nested

Introduced in Kaufmann and Rousseeuw (1990)

 Clustering Feature (CF): CF = (N, LS, SS)

 SS: square sum of N points CF = (5, (16,30),(54,190))

child1 child2 child3 child5

Leaf node Leaf node

 For each point in the input

 EC{Ci ,Cj } :The absolute inter-connectivity between Ci

 and are the average weights of the edges

 Clustering based on density (local cluster criterion), such

1. Arbitrary select a point p

If a spatial index is used, the computational complexity of DBSCAN is

Using multi-resolution grid data structure

 If D is uniformly distributed, ∑ xi and ∑ yi will be close to each

 Two methods: extrinsic vs. intrinsic

You might also like

A good clustering method will produce high quality clusters

The quality of a clustering method depends on

This represents n objects, such as persons, with

It is often represented by an n-by-n where d(i, j) is the

Calculate the standardized measurement (z-score)

Some popular ones include:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are

Also one can use weighted distance, or other

Simple matching coefficient (invariant, if the binary

gender is a symmetric attribute

A generalization of the binary variable in that it

Method 2: use a large number of binary variables

An ordinal variable can be discrete or continuous

A database mayy contain all the six types yp of variables

Given k, the k-means algorithm is implemented in four

Can be seen in the table

Most of the variants of the k-means which differ in

Handling categorical data: k-modes

Introduced in Kaufmann and Rousseeuw (1990)

Clustering Feature (CF): CF = (N, LS, SS)

SS: square sum of N points CF = (5, (16,30),(54,190))

For each point in the input

EC{Ci ,Cj } :The absolute inter-connectivity between Ci

and are the average weights of the edges

Clustering based on density (local cluster criterion), such

Using multi-resolution grid data structure

If D is uniformly distributed, ∑ xi and ∑ yi will be close to each

Two methods: extrinsic vs. intrinsic