You are on page 1of 39

Unsupervised Learning

History
• John Snow, a London
physician, plotted the
location of Cholera
deaths during the
outbreak in 1850’s.
• The location indicated
polluted wells.
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications of Cluster Analysis

• Understanding Discovered Clusters


Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Industry Group

– Group related documents for 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,


DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN

browsing, group genes and Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,


Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
proteins that have similar 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Technology2-DOWN
functionality, or group stocks Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

with similar price fluctuations Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

• Summarization
– Reduce the size of large data
sets

Clustering
precipitation in
Australia
Applications

• Computer Vision
– Image Segmentation: identify
head, hands, nose, .. in a
human image. Identify types
of fruit in an image of a shop.
Applications

• Bank/Internet Security: fraud/spam pattern discovery


• Biology: taxonomy of living things such as kingdom, phylum,
class, order, family, genus and species
• City-planning: Identifying groups of houses according to their
house type, value, and geographical location
• Climate change: understanding earth climate, find patterns of
atmospheric and ocean
• Finance: stock clustering analysis to uncover correlation
underlying shares
• Image Compression/segmentation: coherent pixels grouped
• Information retrieval/organisation: Google search, topic-based
news
• Land use: Identification of areas of similar land use in an earth
observation database
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Social network mining: special interest group automatic
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters


Types of Clusterings

• A clustering is a set of clusters


• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset

• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Partitional Clustering

Original Points A Partitional Clustering


Clustering Algorithms
• K-means

• Hierarchical clustering

• Density-based clustering
K-means Clustering

• Partitional clustering approach


• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple
Example

ID Weight Ph (y)
index (x)
A 1 1
B 2 1
C 4 3
D 5 4
Hierarchical Clustering
• Produces a set of nested clusters organized as
a hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the sequences of
merges or splits 6 5
0.2
4
3 4
0.15
2
5
2
0.1

1
0.05 1
3

0
1 3 2 5 4 6
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k
clusters) left

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there are k
clusters)

• Traditional hierarchical algorithms use a similarity or distance


matrix
– Merge or split one cluster at a time
Agglomerative Clustering Algorithm

• More popular hierarchical clustering technique


• Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

• Key operation is the computation of the proximity of two


clusters
– Different approaches to defining the distance between clusters
distinguish the different algorithms
Starting Situation
• Start with clusters of individual points and a
proximity matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
• After some merging steps, we have some clusters

C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1 Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation

• We want to merge the two closest clusters (C2 and C5) and update
the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3 C3
C4 C4

C5
C1 Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
• The question is “How do we update the proximity matrix?”

C2
U
C5
C1 C3 C4
C1 ?

C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...

p1
Similarity?
p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...

p1

p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...

p1

p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...

p1

p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...

p1
  p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
– Ward’s Method uses squared error
MST: Divisive Hierarchical Clustering
• Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q) such that one
point (p) is in the current tree but the other (q) is not
– Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical Clustering
• Use MST for constructing hierarchy of clusters
DBSCAN(EPS, MinPts)

• DBSCAN is a density-based algorithm.


– Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number of


points (MinPts) within Eps
• These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in the
neighborhood (distance=EPS) of a core point

– A noise point is any point that is not a core point or a border


point.
DBSCAN: Core, Border, and Noise Points
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points
Cluster Validity
• For supervised classification, similar to decision trees, we
have a variety of measures to evaluate how good our model is
– Accuracy, precision, recall

• For cluster analysis, the analogous question is how to evaluate


the “goodness” of the resulting clusters?

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them?


– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters
Measuring Cluster Validity Via Correlation
• Two matrices
– Proximity Matrix
– “Incidence” Matrix
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same cluster
• An entry is 0 if the associated pair of points belongs to different clusters
• Compute the correlation between the two matrices
– Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
• High correlation indicates that points that belong to the same
cluster are close to each other.
Internal Measures: Cohesion and Separation
• A proximity graph based approach can also be used for cohesion and
separation.
– Cluster cohesion is the sum of the weight of all links within a cluster.
– Cluster separation is the sum of the weights between nodes in the cluster and
nodes outside the cluster.

cohesion separation
Internal Measures: Silhouette Coefficient

• Silhouette Coefficient combine ideas of both cohesion and separation, but


for individual points, as well as clusters and clusterings
• For an individual point, i
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (or s = b/a - 1 if a  b, not the usual case)

– Typically between 0 and 1. b


– The closer to 1 the better. a

• Can calculate the Average Silhouette width for a cluster or a clustering


External Measures of Cluster Validity: Entropy and Purity
Distance Measures
• Minkowski Distance (http://en.wikipedia.org/wiki/Minkowski_distance)
For x  ( x1 x2    xn ) and y  ( y1 y2    yn )

 
1
d (x, y )  | x1  y1 |  | x2  y2 |    | xn  yn |
p p p p , p0

– p = 1: Manhattan (city block) distance

– p = 2: Euclidean distance
d(x , y)  | x1  y1 |  | x2  y2 |2    | xn  yn |2
2

– Do not confuse p with n, i.e., all these distances are


defined based on all numbers of features (dimensions).

– A generic measure: use appropriate p in different


applications
Distance Measures

• Example: Manhatten and Euclidean distances

3
L1 p1 p2 p3 p4
2 p1 p1 0 4 4 6
p3 p4 p2 4 0 2 4
1 p3 4 2 0 2
p2 p4 6 4 2 0
0
0 1 2 3 4 5 6 Distance Matrix for Manhattan Distance

point x y L2 p1 p2 p3 p4
p1 0 2 p1 0 2.828 3.162 5.099
p2 2 0 p2 2.828 0 1.414 3.162
p3 3 1 p3 3.162 1.414 0 2
p4 5 1 p4 5.099 3.162 2 0
Data Matrix Distance Matrix for Euclidean Distance
Distance Measures
• Cosine Measure (Similarity vs. Distance)
For x  ( x1 x2    xn ) and y  ( y1 y2    yn )

x1 y1      xn yn
cos(x, y ) 
x12      xn2 y12      yn2
d (x, y )  1  cos(x, y )

– Property: 0  d(x , y )  2
– Nonmetric vector objects: keywords in
documents, gene features in micro-arrays, …
– Applications: information retrieval, biologic
taxonomy, ...
Distance Measures
• Example: Cosine measure
x1  (3, 2, 0, 5, 2, 0, 0), x 2  (1,0, 0, 0, 1, 0, 2)

3 1  2  0  0  0  5  0  2 1  0  0  0  2  5
32  2 2  0 2  52  2 2  0 2  0 2  42  6.48
12  0 2  0 2  0 2  12  0 2  2 2  6  2.45
5
cos(x1 , x 2 )   0.32
6.48  2.45
d (x1 , x 2 )  1  cos(x1 , x 2 )  1  0.32  0.68

You might also like