Clustering Basics

Unsupervised Learning
History
• John Snow, a London
physician, plotted the
location of Cholera
deaths during the
outbreak in 1850’s.
• The location indicated
polluted wells.
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications of Cluster Analysis
• Understanding Discovered Clusters

Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group
– Group related documents for 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
browsing, group genes and Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
proteins that have similar 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Technology2-DOWN
functionality, or group stocks Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
with similar price fluctuations Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
• Summarization
– Reduce the size of large data
sets
Clustering
precipitation in
Australia
Applications
• Computer Vision
– Image Segmentation: identify
head, hands, nose, .. in a
human image. Identify types
of fruit in an image of a shop.
Applications
• Bank/Internet Security: fraud/spam pattern discovery

• Biology: taxonomy of living things such as kingdom, phylum,
class, order, family, genus and species
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Climate change: understanding earth climate, find patterns of
atmospheric and ocean
• Finance: stock clustering analysis to uncover correlation
underlying shares
• Image Compression/segmentation: coherent pixels grouped
• Information retrieval/organisation: Google search, topic-based
news
• Land use: Identification of areas of similar land use in an earth
observation database
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Social network mining: special interest group automatic
Notion of a Cluster can be Ambiguous
How many clusters? Six Clusters
Two Clusters Four Clusters

Types of Clusterings
• A clustering is a set of clusters

• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Partitional Clustering
Original Points A Partitional Clustering

Clustering Algorithms
• K-means
• Hierarchical clustering
• Density-based clustering
K-means Clustering
• Partitional clustering approach

• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
Example
ID Weight Ph (y)
index (x)
A 1 1
B 2 1
C 4 3
D 5 4
Hierarchical Clustering
• Produces a set of nested clusters organized as
a hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences of
merges or splits 6 5
0.2
4
3 4
0.15
2
5
2
0.1
1
0.05 1
3
0
1 3 2 5 4 6
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k
clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are k
clusters)
• Traditional hierarchical algorithms use a similarity or distance

matrix
– Merge or split one cluster at a time
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique

• Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
• Key operation is the computation of the proximity of two

clusters
– Different approaches to defining the distance between clusters
distinguish the different algorithms
Starting Situation
• Start with clusters of individual points and a
proximity matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update
the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3 C3
C4 C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• The question is “How do we update the proximity matrix?”
C2
U
C5
C1 C3 C4
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
function
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
function
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
function
p1 p2 p3 p4 p5 ...
p1
  p2
p3
p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
function
MST: Divisive Hierarchical Clustering
• Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q) such that one
point (p) is in the current tree but the other (q) is not
– Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical Clustering
• Use MST for constructing hierarchy of clusters
DBSCAN(EPS, MinPts)
• DBSCAN is a density-based algorithm.

– Density = number of points within a specified radius (Eps)
– A point is a core point if it has more than a specified number of

points (MinPts) within Eps
• These are points that are at the interior of a cluster
– A border point has fewer than MinPts within Eps, but is in the
neighborhood (distance=EPS) of a core point
– A noise point is any point that is not a core point or a border

point.
DBSCAN: Core, Border, and Noise Points
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points
Cluster Validity
• For supervised classification, similar to decision trees, we
have a variety of measures to evaluate how good our model is
– Accuracy, precision, recall
• For cluster analysis, the analogous question is how to evaluate

the “goodness” of the resulting clusters?
• But “clusters are in the eye of the beholder”!
• Then why do we want to evaluate them?

– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters
Measuring Cluster Validity Via Correlation
• Two matrices
– Proximity Matrix
– “Incidence” Matrix
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same cluster
• An entry is 0 if the associated pair of points belongs to different clusters
• Compute the correlation between the two matrices
– Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
• High correlation indicates that points that belong to the same
cluster are close to each other.
Internal Measures: Cohesion and Separation
• A proximity graph based approach can also be used for cohesion and
separation.
– Cluster cohesion is the sum of the weight of all links within a cluster.
– Cluster separation is the sum of the weights between nodes in the cluster and
nodes outside the cluster.
cohesion separation
Internal Measures: Silhouette Coefficient
• Silhouette Coefficient combine ideas of both cohesion and separation, but

for individual points, as well as clusters and clusterings
• For an individual point, i
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
s = 1 – a/b if a < b, (or s = b/a - 1 if a  b, not the usual case)
– Typically between 0 and 1. b

– The closer to 1 the better. a
• Can calculate the Average Silhouette width for a cluster or a clustering

External Measures of Cluster Validity: Entropy and Purity
Distance Measures
• Minkowski Distance (http://en.wikipedia.org/wiki/Minkowski_distance)
For x  ( x1 x2    xn ) and y  ( y1 y2    yn )
 
1
d (x, y )  | x1  y1 |  | x2  y2 |    | xn  yn |
p p p p , p0
– p = 1: Manhattan (city block) distance
– p = 2: Euclidean distance
d(x , y)  | x1  y1 |  | x2  y2 |2    | xn  yn |2
2
– Do not confuse p with n, i.e., all these distances are

defined based on all numbers of features (dimensions).
– A generic measure: use appropriate p in different

applications
Distance Measures
• Example: Manhatten and Euclidean distances
3
L1 p1 p2 p3 p4
2 p1 p1 0 4 4 6
p3 p4 p2 4 0 2 4
1 p3 4 2 0 2
p2 p4 6 4 2 0
0
0 1 2 3 4 5 6 Distance Matrix for Manhattan Distance
point x y L2 p1 p2 p3 p4
p1 0 2 p1 0 2.828 3.162 5.099
p2 2 0 p2 2.828 0 1.414 3.162
p3 3 1 p3 3.162 1.414 0 2
p4 5 1 p4 5.099 3.162 2 0
Data Matrix Distance Matrix for Euclidean Distance
Distance Measures
• Cosine Measure (Similarity vs. Distance)
For x  ( x1 x2    xn ) and y  ( y1 y2    yn )
x1 y1      xn yn
cos(x, y ) 
x12      xn2 y12      yn2
d (x, y )  1  cos(x, y )
– Property: 0  d(x , y )  2
– Nonmetric vector objects: keywords in
documents, gene features in micro-arrays, …
– Applications: information retrieval, biologic
taxonomy, ...
Distance Measures
• Example: Cosine measure
x1  (3, 2, 0, 5, 2, 0, 0), x 2  (1,0, 0, 0, 1, 0, 2)
3 1  2  0  0  0  5  0  2 1  0  0  0  2  5
32  2 2  0 2  52  2 2  0 2  0 2  42  6.48
12  0 2  0 2  0 2  12  0 2  2 2  6  2.45
5
cos(x1 , x 2 )   0.32
6.48  2.45
d (x1 , x 2 )  1  cos(x1 , x 2 )  1  0.32  0.68

Clustering Basics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Basics

Uploaded by

Copyright:

Available Formats

Unsupervised Learning

• Understanding Discovered Clusters

– Group related documents for 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

browsing, group genes and Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

with similar price fluctuations Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

• Bank/Internet Security: fraud/spam pattern discovery

How many clusters? Six Clusters

Two Clusters Four Clusters

• A clustering is a set of clusters

Original Points A Partitional Clustering

• Partitional clustering approach

• Traditional hierarchical algorithms use a similarity or distance

• More popular hierarchical clustering technique

• Key operation is the computation of the proximity of two

• DBSCAN is a density-based algorithm.

– A point is a core point if it has more than a specified number of

– A noise point is any point that is not a core point or a border

• For cluster analysis, the analogous question is how to evaluate

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them?

• Silhouette Coefficient combine ideas of both cohesion and separation, but

s = 1 – a/b if a < b, (or s = b/a - 1 if a  b, not the usual case)

– Typically between 0 and 1. b

• Can calculate the Average Silhouette width for a cluster or a clustering

– p = 1: Manhattan (city block) distance

– Do not confuse p with n, i.e., all these distances are

– A generic measure: use appropriate p in different

• Example: Manhatten and Euclidean distances

You might also like