Professional Documents
Culture Documents
History
• John Snow, a London
physician, plotted the
location of Cholera
deaths during the
outbreak in 1850’s.
• The location indicated
polluted wells.
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications of Cluster Analysis
3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
• Summarization
– Reduce the size of large data
sets
Clustering
precipitation in
Australia
Applications
• Computer Vision
– Image Segmentation: identify
head, hands, nose, .. in a
human image. Identify types
of fruit in an image of a shop.
Applications
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Partitional Clustering
• Hierarchical clustering
• Density-based clustering
K-means Clustering
ID Weight Ph (y)
index (x)
A 1 1
B 2 1
C 4 3
D 5 4
Hierarchical Clustering
• Produces a set of nested clusters organized as
a hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences of
merges or splits 6 5
0.2
4
3 4
0.15
2
5
2
0.1
1
0.05 1
3
0
1 3 2 5 4 6
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k
clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are k
clusters)
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update
the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3 C3
C4 C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• The question is “How do we update the proximity matrix?”
C2
U
C5
C1 C3 C4
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
MIN
.
MAX .
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX .
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX .
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX .
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX .
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
MST: Divisive Hierarchical Clustering
• Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q) such that one
point (p) is in the current tree but the other (q) is not
– Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical Clustering
• Use MST for constructing hierarchy of clusters
DBSCAN(EPS, MinPts)
– A border point has fewer than MinPts within Eps, but is in the
neighborhood (distance=EPS) of a core point
cohesion separation
Internal Measures: Silhouette Coefficient
1
d (x, y ) | x1 y1 | | x2 y2 | | xn yn |
p p p p , p0
– p = 2: Euclidean distance
d(x , y) | x1 y1 | | x2 y2 |2 | xn yn |2
2
3
L1 p1 p2 p3 p4
2 p1 p1 0 4 4 6
p3 p4 p2 4 0 2 4
1 p3 4 2 0 2
p2 p4 6 4 2 0
0
0 1 2 3 4 5 6 Distance Matrix for Manhattan Distance
point x y L2 p1 p2 p3 p4
p1 0 2 p1 0 2.828 3.162 5.099
p2 2 0 p2 2.828 0 1.414 3.162
p3 3 1 p3 3.162 1.414 0 2
p4 5 1 p4 5.099 3.162 2 0
Data Matrix Distance Matrix for Euclidean Distance
Distance Measures
• Cosine Measure (Similarity vs. Distance)
For x ( x1 x2 xn ) and y ( y1 y2 yn )
x1 y1 xn yn
cos(x, y )
x12 xn2 y12 yn2
d (x, y ) 1 cos(x, y )
– Property: 0 d(x , y ) 2
– Nonmetric vector objects: keywords in
documents, gene features in micro-arrays, …
– Applications: information retrieval, biologic
taxonomy, ...
Distance Measures
• Example: Cosine measure
x1 (3, 2, 0, 5, 2, 0, 0), x 2 (1,0, 0, 0, 1, 0, 2)
3 1 2 0 0 0 5 0 2 1 0 0 0 2 5
32 2 2 0 2 52 2 2 0 2 0 2 42 6.48
12 0 2 0 2 0 2 12 0 2 2 2 6 2.45
5
cos(x1 , x 2 ) 0.32
6.48 2.45
d (x1 , x 2 ) 1 cos(x1 , x 2 ) 1 0.32 0.68