Professional Documents
Culture Documents
ClusterValidation PDF
ClusterValidation PDF
Cluster Validity
For supervised classification we have a variety of
measures to evaluate how good our model is
Accuracy, precision, recall
0 .9 0 .9
0 .8 0 .8
0 .7 0 .7
Random 0 .6 0 .6 DBSCAN
Points 0 .5 0 .5
y
y
0 .4 0 .4
0 .3 0 .3
0 .2 0 .2
0 .1 0 .1
0 0
0 0.2 0 .4 0 .6 0 .8 1 0 0.2 0 .4 0 .6 0 .8 1
x x
1 1
0 .9 0 .9
K-means 0 .8 0 .8
Complete
0 .7 0 .7
Link
0 .6 0 .6
0 .5 0 .5
y
0 .4 0 .4
0 .3 0 .3
0 .2 0 .2
0 .1 0 .1
0 0
0 0.2 0 .4 0 .6 0 .8 1 0 0.2 0 .4 0 .6 0 .8 1
x x
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e.,
distinguishing whether non-random structure actually exists
in the data.
2. Comparing the results of a cluster analysis to externally
known results, e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the
data without reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster
analyses to determine which is better.
5. Determining the ‘correct’ number of clusters.
0 .9 0 .9
0 .8 0 .8
0 .7 0 .7
0 .6 0 .6
0 .5 0 .5
y
y
0 .4 0 .4
0 .3 0 .3
0 .2 0 .2
0 .1 0 .1
0 0
0 0.2 0 .4 0 .6 0 .8 1 0 0.2 0 .4 0 .6 0 .8 1
x x
P o in t s
50 0.5
0 .5
y
60 0.4
0 .4
70 0.3
0 .3
80 0.2
0 .2
90 0.1
0 .1
1 00 0
0 20 40 60 80 1 0 0 S i m ila rity
0 0.2 0 .4 0 .6 0 .8 1
P o in t s
x
Using Similarity Matrix for Cluster
Validation
Clusters in random data are not so crisp
1 1
10 0.9 0 .9
20 0.8 0 .8
30 0.7 0 .7
40 0.6 0 .6
P o in t s
50 0.5 0 .5
y
60 0.4 0 .4
70 0.3 0 .3
80 0.2 0 .2
90 0.1 0 .1
1 00 0 0
20 40 60 80 1 0 0 S i m ila rity 0 0.2 0 .4 0 .6 0 .8 1
P o in t s x
DBSCAN
Using Similarity Matrix for Cluster
Validation
Clusters in random data are not so crisp
1 1
10 0.9 0 .9
20 0.8 0 .8
30 0.7 0 .7
40 0.6 0 .6
P o in t s
50 0.5 0 .5
y
60 0.4 0 .4
70 0.3 0 .3
80 0.2 0 .2
90 0.1 0 .1
1 00 0 0
20 40 60 80 1 0 0 S i m ila rity 0 0.2 0 .4 0 .6 0 .8 1
P o in t s x
K-means
Using Similarity Matrix for Cluster
Validation
Clusters in random data are not so crisp
1 1
10 0.9 0 .9
20 0.8 0 .8
30 0.7 0 .7
40 0.6 0 .6
P o in t s
50 0.5 0 .5
y
60 0.4 0 .4
70 0.3 0 .3
80 0.2 0 .2
90 0.1 0 .1
1 00 0 0
20 40 60 80 1 0 0 S i m ila rity 0 0.2 0 .4 0 .6 0 .8 1
P o in t s x
Complete Link
Using Similarity Matrix for Cluster
Validation
1
0.9
1 5 00
2 0.8
6
0.7
1 00 0
3 0.6
4
1 50 0 0.5
0.4
2 00 0
0.3
5
0.2
2 50 0
0.1
7
3 00 0 0
5 00 1 00 0 1 50 0 2000 2 50 0 3000
DBSCAN
Internal Measures: SSE
Clusters in more complicated figures aren’t well separated
Internal Index: Used to measure the goodness of a
clustering structure without respect to external information
SSE
SSE is good for comparing two clusterings or two clusters
(average SSE).
Can also be used to estimate the number
10
of clusters
6 9
8
4
7
2 6
SSE
5
0
4
-2 3
2
-4
1
-6 0
2 5 10 15 20 25 30
5 10 15
K
Internal Measures: SSE
SSE curve for a more complicated data set
1
2 6
3
4
25
0 .4
20
0 .3
15
0 .2
10
0 .1
5
0
0 0.2 0 .4 0 .6 0 .8 1 0
0 .01 6 0 . 01 8 0 .02 0 .0 2 2 0 .0 2 4 0 .0 2 6 0 .0 2 8 0 .0 3 0 .03 2 0 .03 4
x SS E
Statistical Framework for Correlation
Correlation of incidence and proximity matrices for the
K-means clusterings of the following two data sets.
1 1
0 .9 0 .9
0 .8 0 .8
0 .7 0 .7
0 .6 0 .6
0 .5 0 .5
y
0 .4 0 .4
0 .3 0 .3
0 .2 0 .2
0 .1 0 .1
0 0
0 0.2 0 .4 0 .6 0 .8 1 0 0.2 0 .4 0 .6 0 .8 1
x x
BSS = ∑ Ci ( m − mi ) 2
cohesion separation
Internal Measures: Silhouette Coefficient
Silhouette Coefficient combine ideas of both cohesion and
separation, but for individual points, as well as clusters and
clusterings
For an individual point, i
Calculate a = average distance of i to the points in its cluster
Calculate b = min (average distance of i to points in another cluster)
The silhouette coefficient for a point is then given by