You are on page 1of 7

Cluster Validity
• For supervised classification we have a variety of
measures to evaluate how good our model is


Cluster Validation
• •

• Accuracy, precision, recall

• For cluster analysis, the analogous question is how to


evaluate the “goodness” of the resulting clusters?
• Cluster validation
– Assess the quality and reliability of clustering results.

• Why validation?
– To avoid finding clusters formed by chance
– To compare clustering algorithms
– To choose clustering parameters
• e.g., the number of clusters in the K-means algorithm

Cluster Validation Clusters found in Random Data


1 1

0.9 0.9

0.8 0.8

• Cluster validation Random


0.7

0.6
0.7

0.6 DBSCAN

– Assess the quality and reliability of clustering Points 0.5 0.5

y
0.4 0.4

results. 0.3

0.2
0.3

0.2

• Why validation?
0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
– To avoid finding clusters formed by chance 1 1

0.9 0.9

– To compare clustering algorithms K-means 0.8

0.7
0.8

0.7
Complete

– To choose clustering parameters 0.6

0.5
0.6

0.5
Link
y

y
• e.g., the number of clusters in the K-means 0.4 0.4

algorithm 0.3

0.2
0.3

0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
3 4

K K Aggarwal, Dept of OR, DU 1


Aspects of Cluster Validation Cluster validation process
• Cluster validation refers to procedures that evaluate the results
• Comparing the clustering results to ground truth of clustering in a quantitative and objective fashion
(externally known results). – How to be “quantitative”: To employ the measures.
– External Index
– How to be “objective”: To validate the measures!
• Evaluating the quality of clusters without reference
to external information.
– Use only the data
– Internal Index INPUT: Clustering Partitions P Validity m*
• Determining the reliability of clusters. DataSet(X) Algorithm Codebook C Index
– To what confidence level, the clusters are not formed
by chance
– Statistical framework Different number of clusters m

5 6

Measuring clustering validity


Internal Index:
• Validate without external info
• Solve the number of clusters ? ?
External Index Internal indexes
• Validate against ground truth
• Compare two clusters:
(how similar) ?

?
7 8

K K Aggarwal, Dept of OR, DU 2


Internal indexes Mean square error (MSE)
• Minimizes (or maximizes) internal index: • The more clusters the smaller the MSE.
– Rule of thumb • Small knee-point near the correct value.
One simple rule of thumb sets the number to • But how to detect?
10
S2
9
with n as the number of objects (data points).
8
– Variances of within cluster and between clusters 7

– Rate-distortion method 6

MSE
5 Knee-point between
– F-ratio 4
14 and 15 clusters.
– Davies-Bouldin index (DBI) 3
2
– Bayesian Information Criterion (BIC)
1
– Silhouette Coefficient 0
5 10 15 20 25
9 Clusters 10

From MSE to cluster validity Sum-of-squares based indexes


• SSW / k ---- Ball and Hall (1965)
• Minimize within cluster variance (MSE) • k2|W| ---- Marriot (1971)
• Maximize between cluster variance • SSB / k  1 ---- Calinski & Harabasz (1974)
Inter-cluster SSW / N  k
Intra-cluster variance is
maximized • log(SSB/SSW) ---- Hartigan (1975)
variance is
minimized • ---- Xu (1997)
d log( SSW /(dN 2 ))  log(k )
(d is the dimension of data; N is the size of data; k is the number of clusters)

SSW = Sum of squares within the clusters (=MSE)


SSB = Sum of squares between the clusters
11 12

K K Aggarwal, Dept of OR, DU 3


Internal Measures: Cohesion and Separation Internal Measures: Cohesion
• Cluster Cohesion: Measures how closely related are objects in a and Separation
• Example: SSE
cluster
– BSS + WSS = constant
– Example: SSE m
• Cluster Separation: Measure how distinct or well-separated a   
cluster is from other clusters 1 m 2 3 4 m 5
• Example: Squared Error 1 2
– Cohesion is measured by the within cluster sum of squares (SSE) K=1 WSS  (1  3) 2  ( 2  3) 2  ( 4  3) 2  (5  3) 2  10
SSW    ( x  mi ) 2 cluster:
BSS 4  (3  3)  02

i xCi Total  10  0  10
– Separation is measured by the between cluster sum of squares
SSB   C i (m  mi ) 2 K=2
WSS (1  1.5) 2  (2  1.5) 2  (4  4.5)2  (5  4.5) 2  1
i BSS 2  (3  1.5) 2  2  (4.5  3)2  9
– Where |Ci| is the size of cluster i clusters: Total  1  9  10
Total Vatiance =
 ( X )  SSW  SSB 13

F-ratio variance test F-ratio for dataset S1


• Variance-ratio F-test 1.4

• Measures ratio of between-groups variance 1.2


against the within-groups variance (original f-test)

F-ratio (x10^5)
1.0
PNN
• F-ratio (WB-index): 0.8
IS
N 0.6
k   || xi  c p ( i ) ||2 minimum
k  SSW 0.4
F i 1

k
 ( X )  SSW 0.2
 n j || c j  x ||2
j 1
SSB 0.0
25 23 21 19 17 15 13 11 9 7 5
Clusters
15 16

K K Aggarwal, Dept of OR, DU 4


Davies-Bouldin index (DBI) Davies-Bouldin index (DBI)

• Minimize intra cluster variance


• Maximize the distance between
clusters
• Cost function weighted sum of the two:
MAE j  MAE k
R j ,k 
d (c j , c k )
1 M
DBI   max R j ,k
M j 1 j  k
18

Silhouette coefficient Silhouette coefficient


[Kaufman&Rousseeuw, 1990] [Kaufman&Rousseeuw, 1990]

 We need a quantitative method to assess the quality of a clustering... • Cohesion: measures how closely related are
 The silhouette value of a point is a measure of how similar a point is to points in its own
cluster compared to points in other clusters
objects in a cluster
• Separation: measure how distinct or well-
 Formal definition:
separated a cluster is from other clusters
• a(i) is the average distance of the point i to the other points in its own cluster A
• d(i, C) is the average distance of the point i to the other points in the cluster C
• b(i) is the minimal d(i, C) over all clusters other than A

cohesion
separation

K K Aggarwal, Dept of OR, DU 5


Silhouette coefficient Silhouette coefficient
• Cohesion a(x): average distance of x to all other vectors
in the same cluster.
• Separation b(x): average distance of x to the vectors in
x other clusters. Find the minimum among the clusters.
x • silhouette s(x):
b( x )  a ( x )
s( x) 
max{a( x), b( x)}
cohesion
• s(x) = [-1, +1]: -1=bad, 0=indifferent, 1=good
a(x): average distance separation • Silhouette coefficient (SC):
in the cluster 1 N
b(x): average distances to SC 
N
 s( x)
i 1
others clusters, find minimal

Silhouette coefficient Performance of


Silhouette coefficient

24

K K Aggarwal, Dept of OR, DU 6


Internal indexes Internal indexes

25
Soft partitions 26

Comparison of the indexes


K-means

27

K K Aggarwal, Dept of OR, DU 7