Professional Documents
Culture Documents
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
similar functionality, or
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN
z Simple segmentation
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
group stocks with similar 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN – Dividing students into different registration groups
price fluctuations Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, alphabetically, by last name
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
z Results of a query
z Summarization
– Groupings are a result of an external specification
– Reduce the size of large
data sets
z Graph partitioning
Clustering precipitation – Some mutual relevance and synergy, but areas are not
in Australia identical
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4
z Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Two Clusters Four Clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
Partitional Clustering Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Original Points A Partitional Clustering
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
Types of Clusters: Contiguity-Based Types of Clusters: Density-Based
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
Types of Clusters: Objective Function … Characteristics of the Input Data Are Important
z Map the clustering problem to a different domain z Type of proximity or density measure
– This is a derived measure, but central to clustering
and solve a related problem in that domain
z Sparseness
– Proximity matrix defines a weighted graph, where the – Dictates type of similarity
nodes are the points being clustered, and the – Adds to efficiency
weighted edges represent the proximities between z Attribute type
points – Dictates type of similarity
z Type of Data
– Clustering is equivalent to breaking the graph into – Dictates type of similarity
connected components, one for each cluster. – Other characteristics, e.g., autocorrelation
z Dimensionality
– Want to minimize the edge weight between clusters z Noise and Outliers
and maximize the edge weight within clusters z Type of Distribution
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
Clustering Algorithms K-means Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
2.5
y
cluster. 1
0.5
iterations. 2 2
1.5 1.5
y
1 1
z Complexity is O( n * K * I * d ) 0 0
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Iteration 6
1
2
3
4
5 2.5 2.5 2.5
3
2 2 2
1 1 1
0 0 0
1.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
y
0 2 2 2
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Evaluating K-means Clusters Importance of Choosing Initial Centroids …
K
SSE = ∑ ∑ dist 2 ( mi , x )
2
i =1 x∈Ci 1.5
y
– x is a data point in cluster Ci and mi is the representative point for 1
cluster Ci
can show that mi corresponds to the center (mean) of the cluster 0.5
– Given two clusters, we can choose the one with the smallest 0
error
– One easy way to reduce SSE is to increase K, the number of -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
clusters x
A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26
3
Iteration 1
3
Iteration 2
z If there are K ‘real’ clusters then the chance of selecting
2.5 2.5 one centroid from each cluster is small.
2 2
1.5 1.5
– Chance is relatively small when K is large
y
0 0
1 1 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28
4 4
6
2 2
y
0 0
4
-2 -2
-4 -4
2
-6 -6
0 5 10 15 20 0 5 10 15 20
y
0 x x
Iteration 3 Iteration 4
8 8
-2 6 6
4 4
-4 2 2
y
0 0
-6 -2 -2
-4 -4
-6 -6
0 5 10 15 20
x 0 5 10
x
15 20 0 5 10
x
15 20
Starting with two initial centroids in one cluster of each pair of clusters Starting with two initial centroids in one cluster of each pair of clusters
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30
10 Clusters Example 10 Clusters Example
Iteration 1 Iteration 2
Iteration 4
1
2
3 8 8
8 6 6
4 4
6 2 2
y
0 0
4 -2 -2
-4 -4
2 -6 -6
0 5 10 15 20 0 5 10 15 20
y
x
Iteration 3 x
Iteration 4
0 8 8
6 6
-2 4 4
2 2
-4
y
0 0
-2 -2
-6 -4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
x x
x
Starting with some pairs of clusters having three initial centroids, while other have only one. Starting with some pairs of clusters having three initial centroids, while other have only one.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36
Bisecting K-means Bisecting K-means Example
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42
Overcoming K-means Limitations Overcoming K-means Limitations
6 5
0.2
4
3 4
0.15 2
5
2
0.1
Original Points K-means Clusters
1
0.05
3 1
0
1 3 2 5 4 6
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46
z Do not have to assume any particular number of z Two main types of hierarchical clustering
clusters – Agglomerative:
– Any desired number of clusters can be obtained by Start with the points as individual clusters
‘cutting’ the dendogram at the proper level At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48
Agglomerative Clustering Algorithm Starting Situation
z More popular hierarchical clustering technique z Start with clusters of individual points and a
z Basic algorithm is straightforward proximity matrix p1 p2 p3 p4 p5 ...
1. Compute the proximity matrix p1
3. Repeat p3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50
z After some merging steps, we have some clusters z We want to merge the two closest clusters (C2 and C5) and
C1 C2 C3 C4 C5 update the proximity matrix. C1 C2 C3 C4 C5
C1 C1
C2 C2
C3 C3
C3 C3
C4 C4
C4 C4
C5 C5
C2 C5 C2 C5
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 51 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52
C1 ? p3
C2 U C5 ? ? ? ? p4
C3
C3 ? p5
C4 z MIN
? .
C4 z MAX
.
Proximity Matrix z Group Average
C1 .
Proximity Matrix
z Distance Between Centroids
z Other methods driven by an objective
C2 U C5
function
– Ward’s Method uses squared error
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 53 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54
How to Define Inter-Cluster Similarity How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ... p1 p2 p3 p4 p5 ...
p1 p1
p2 p2
p3 p3
p4 p4
p5 p5
z MIN z MIN
. .
z MAX z MAX
. .
z Group Average .
z Group Average .
Proximity Matrix Proximity Matrix
z Distance Between Centroids z Distance Between Centroids
z Other methods driven by an objective z Other methods driven by an objective
function function
– Ward’s Method uses squared error – Ward’s Method uses squared error
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 55 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 56
p2 × × p2
p3 p3
p4 p4
p5 p5
z MIN z MIN
. .
z MAX z MAX
. .
z Group Average .
z Group Average .
Proximity Matrix Proximity Matrix
z Distance Between Centroids z Distance Between Centroids
z Other methods driven by an objective z Other methods driven by an objective
function function
– Ward’s Method uses squared error – Ward’s Method uses squared error
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 57 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 58
2 1
the proximity graph. 0.15
2 3 6 0.1
I1 I2 I3 I4 I5 0.05
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 59 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 60
Strength of MIN Limitations of MIN
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 61 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62
0.35
– Determined by all pairs of points in the two clusters 5 0.3
2
0.25
3 6 0.2
I1 I2 I3 I4 I5 3
1
0.15
0.1
I1 1.00 0.90 0.10 0.65 0.20 4 0.05
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 63 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64
1
I1 I2 I3 I4 I5 0.05
4
I1 1.00 0.90 0.10 0.65 0.20 3
0
3 6 4 1 2 5
z Compromise between Single and Complete z Similarity of two clusters is based on the increase
Link in squared error when two clusters are merged
– Similar to group average if distance between points is
distance squared
z Strengths
– Less susceptible to noise and outliers z Less susceptible to noise and outliers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 69 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 70
1
5
4 1
z O(N2) space since it uses the proximity matrix.
3
5
2 5 – N is the number of points.
2 1 5
MIN MAX 2
2 3 6 3 6
3
4
1 z O(N3) time in many cases
4
4
– There are N steps and at each step the size, N2,
proximity matrix must be updated and searched
5
1 5 4 1 – Complexity can be reduced to O(N2 log(N) ) time for
2 2 some approaches
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 71 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 72
Hierarchical Clustering: Problems and Limitations MST: Divisive Hierarchical Clustering
z Once a decision is made to combine two clusters, z Build MST (Minimum Spanning Tree)
it cannot be undone – Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q) such
that one point (p) is in the current tree but the other (q) is not
z No objective function is directly minimized – Add q to the tree and put an edge between p and q
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 73 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 74
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 75 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 76
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 77 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 78
DBSCAN: Core, Border and Noise Points When DBSCAN Works Well
When DBSCAN Does NOT Work Well DBSCAN: Determining EPS and MinPts
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 81 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 82
0.7 0.7
– Accuracy, precision, recall Random 0.6 0.6 DBSCAN
Points 0.5 0.5
y
0.4 0.4
z For cluster analysis, the analogous question is how to 0.3 0.3
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
0.9 0.9
0.8
K-means 0.8
Complete
z Then why do we want to evaluate them? 0.7 0.7
Link
– To avoid finding patterns in noise 0.6
0.5
0.6
0.5
y
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 83 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 84
Different Aspects of Cluster Validation Measures of Cluster Validity
1. Determining the clustering tendency of a set of data, i.e., z Numerical measures that are applied to judge various aspects
distinguishing whether non-random structure actually exists in the of cluster validity, are classified into the following three types.
data. – External Index: Used to measure the extent to which cluster labels
2. Comparing the results of a cluster analysis to externally known match externally supplied class labels.
results, e.g., to externally given class labels. Entropy
3. Evaluating how well the results of a cluster analysis fit the data – Internal Index: Used to measure the goodness of a clustering
without reference to external information. structure without respect to external information.
Sum of Squared Error (SSE)
- Use only the data
– Relative Index: Used to compare two different clusterings or
4. Comparing the results of two different sets of cluster analyses to
clusters.
determine which is better.
Often an external or internal index is used for this function, e.g., SSE or
5. Determining the ‘correct’ number of clusters. entropy
z Sometimes these are referred to as criteria instead of indices
For 2, 3, and 4, we can further distinguish whether we want to – However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.
evaluate the entire clustering or just individual clusters.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 85 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 86
Measuring Cluster Validity Via Correlation Measuring Cluster Validity Via Correlation
z Two matrices
– Proximity Matrix
z Correlation of incidence and proximity matrices
– “Incidence” Matrix for the K-means clusterings of the following two
One row and one column for each data point data sets.
An entry is 1 if the associated pair of points belong to the same cluster
1 1
An entry is 0 if the associated pair of points belongs to different clusters
0.9 0.9
0.7
0.8
0.7
– Since the matrices are symmetric, only the correlation between 0.6 0.6
y
0.4 0.4
z High correlation indicates that points that belong to the 0.3 0.3
0.2 0.2
0 0
z Not a good measure for some density or contiguity based 0 0.2 0.4
x
0.6 0.8 1 0 0.2 0.4
x
0.6 0.8 1
clusters.
Corr = -0.9235 Corr = -0.5810
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 87 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 88
Using Similarity Matrix for Cluster Validation Using Similarity Matrix for Cluster Validation
0.7
50 0.5 0.5
y
40 0.6
0.6
Points
60 0.4 0.4
50 0.5
0.5
y
70 0.3 0.3
60 0.4
0.4 80 0.2 0.2
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 89 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 90
Using Similarity Matrix for Cluster Validation Using Similarity Matrix for Cluster Validation
z Clusters in random data are not so crisp z Clusters in random data are not so crisp
1 1
1 1
10 0.9 0.9
10 0.9 0.9
20 0.8 0.8
20 0.8 0.8
30 0.7 0.7
30 0.7 0.7
40 0.6 0.6
40 0.6 0.6
Points
Points
50 0.5
y
0.5
50 0.5 0.5
y
60 0.4 0.4
60 0.4 0.4
70 0.3 0.3
70 0.3 0.3
80 0.2 0.2
80 0.2 0.2
90 0.1 0.1
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1 100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
Points x
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 91 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 92
4
3 0.6 z SSE is good for comparing two clusterings or two clusters
1500 0.5
0.4
(average SSE).
2000
0.3 z Can also be used to estimate the number of clusters
5 10
0.2
2500
6 9
0.1
7 8
3000 0 4
500 1000 1500 2000 2500 3000 7
2 6
SSE
5
DBSCAN 0
4
-2 3
2
-4
1
-6 0
2 5 10 15 20 25 30
5 10 15
K
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 93 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 94
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 95 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 96
Statistical Framework for SSE Statistical Framework for Correlation
0.9
1
0.9
0.8 0.8
1 0.7 0.7
50
0.6 0.6
0.9
45
0.5 0.5
y
0.8 0.4
40 0.4
0.5
y
0 0
25 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
0.4
20
0.3
15
0.2
10 Corr = -0.9235 Corr = -0.5810
0.1
5
0
0 0.2 0.4 0.6 0.8 1 0
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034
x SSE
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 97 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 98
Internal Measures: Cohesion and Separation Internal Measures: Cohesion and Separation
z A proximity graph based approach can also be used for z Silhouette Coefficient combine ideas of both cohesion and separation,
cohesion and separation. but for individual points, as well as clusters and clusterings
– Cluster cohesion is the sum of the weight of all links within a cluster. z For an individual point, i
– Cluster separation is the sum of the weights between nodes in the cluster – Calculate a = average distance of i to the points in its cluster
and nodes outside the cluster. – Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
b
– Typically between 0 and 1. a
– The closer to 1 the better.
cohesion separation z Can calculate the Average Silhouette width for a cluster or a
clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 101 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 102
External Measures of Cluster Validity: Entropy and Purity Final Comment on Cluster Validity
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 103 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 104