Professional Documents
Culture Documents
Concepts
and Algorithms
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
Applications of Cluster
Analysis
Understanding
Group related
documents for
browsing, group genes
and proteins that have
similar functionality, or
group stocks with
similar price fluctuations
Summarization
Clustering precipitation
in Australia
Six Clusters
Two Clusters
Four Clusters
Types of Clusterings
Partitional Clustering
Hierarchical clustering
Partitional Clustering
Original Points
A Partitional Clustering
Hierarchical Clustering
p1
p3 p4
p2
p1 p2
p1
p3
p3 p4
Traditional Dendrogram
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
p3 p4
Non-traditional Dendrogram
multiple clusters.
Can represent multiple classes or border points
Clustering Algorithms
Hierarchical clustering
Density-based clustering
K-means Clustering
Complexity is O( n * K * I * d )
Evaluating K-means
Clusters
For each point, the error is the distance to the nearest cluster
To get SSE, we square these errors and sum them.
K
SSE dist 2 ( mi , x )
i 1 xCi
Given two clusters, we can choose the one with the smallest
error
One easy way to reduce SSE is to increase K, the number of
clusters
A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Shape
Density
Size
2.5
2
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
2.5
2.5
1.5
1.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
Optimal Clustering
-2
-1.5
-1
-0.5
0.5
1.5
Sub-optimal Clustering
3
2.5
2
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
Iteration 2
1.5
1.5
1.5
2.5
2.5
2.5
0.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
Iteration 4
-1.5
-1
-0.5
0.5
1.5
-2
Iteration 5
1.5
1.5
1.5
0.5
0.5
0.5
-1
-0.5
0.5
1.5
-1
-0.5
0.5
1.5
1.5
Iteration 6
2.5
2.5
-1.5
-1.5
2.5
-2
Iteration 3
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
3
2.5
2
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
1.5
1.5
2.5
2.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
Iteration 3
Iteration 2
1.5
-2
-1.5
-1
-0.5
0.5
Iteration 4
1.5
1.5
1.5
0.5
0.5
0.5
-1
-0.5
0.5
1.5
Iteration 5
2.5
2.5
-1.5
1.5
2.5
-2
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
1.5
If there are K real clusters then the chance of selecting one centroid from each cluster is small.
Solutions to Initial
Centroids Problem
Multiple runs
Postprocessing
Bisecting K-means
Bisecting K-means
Limitations of K-means:
Differing Sizes
Original Points
K-means (3 Clusters)
Limitations of K-means:
Differing Density
Original Points
K-means (3 Clusters)
Original Points
K-means (2 Clusters)
Overcoming K-means
Limitations
Original Points
K-means Clusters
Overcoming K-means
Limitations
Original Points
K-means Clusters
Overcoming K-means
Limitations
Original Points
K-means Clusters
Hierarchical Clustering
Produces a set of nested clusters
organized as a hierarchical tree
Can be visualized as a dendrogram
6
0.2
4
3
0.15
5
2
0.1
0.05
0
3
1
Strengths of Hierarchical
Clustering
Hierarchical Clustering
Agglomerative:
Divisive:
Agglomerative Clustering
More popular hierarchical clustering technique
Algorithm
1.
2.
3.
4.
5.
6.
Starting Situation
Proximity Matrix
Intermediate Situation
C2
C3
C4
C1
C2
C3
C3
C4
C4
C5
C1
Proximity Matrix
C2
C5
C5
Intermediate Situation
C2
C3
C4
C1
C2
C3
C4
C3
C4
C5
C1
Proximity Matrix
C2
C5
C5
After Merging
C3
C4
C1
C2 U C5
U
C5
C3
C4
?
?
C3
C4
Proximity Matrix
C2 U C5
Similarity?
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average
.
Distance Between Centroids
Proximity Matrix
Other methods driven by an objective
function
...
p2
p3
p4 p5
p1
p2
p3
p4
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an
objective function
p5
.
.
.
Proximity Matrix
...
p2
p3
p4 p5
p1
p2
p3
p4
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an
objective function
p5
.
.
.
Proximity Matrix
...
p2
p3
p4 p5
p1
p2
p3
p4
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an
objective function
p5
.
.
.
Proximity Matrix
...
p2
p3
p4 p5
p1
p2
p3
p4
MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an
objective function
p5
.
.
.
Proximity Matrix
...
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Hierarchical Clustering:
MIN
1
3
5
0.2
Nested Clusters
0.15
0.1
0.05
0
Dendrogram
Strength of MIN
Original Points
Two Clusters
Limitations of MIN
Original Points
Two Clusters
Hierarchical Clustering:
1
MAX 4
2
5
0.4
0.35
0.3
0.25
6
1
Nested Clusters
0.2
0.15
0.1
0.05
0
Dendrogram
Strength of MAX
Original Points
Two Clusters
Limitations of MAX
Original Points
Tends to break large clusters
Biased towards globular clusters
Two Clusters
proximity(p ,p )
proximity(
Cluster
i , Cluster
j)
piCluster
i
pjClusterj
|Clusteri | |Clusterj |
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Hierarchical Clustering:
5
Group
Average
4 1
2
5
0.25
0.2
0.15
6
1
Nested Clusters
0.1
0.05
0
Dendrogram
Hierarchical Clustering:
Group Average
Strengths
Limitations
Hierarchical Clustering:
Comparison
1
3
5
MIN
MAX
5
Wards Method
2
5
Group Average
1
4
2
3
1
5
6
1
Hierarchical Clustering:
Problems and Limitations
DBSCAN
DBSCAN Algorithm
Eliminate noise points
Perform clustering on the remaining points
Original Points
Original Points
Clusters
Resistant to Noise
Can handle clusters of different shapes and sizes
(MinPts=4, Eps=9.75).
Original Points
Varying densities
High-dimensional data
(MinPts=4, Eps=9.92)
distance
Noise points have the kth nearest neighbor at
farther distance
So, plot sorted distance of every point to its
kth nearest neighbor