You are on page 1of 61

Clustering

1
The Problem of Clustering
 Given a set of points, with a notion of
distance between points, group the points
into some number of clusters, so that
members of a cluster are in some sense as
close to each other as possible.

2
Example
x x
xx x
x x x x
x x x x x x x
x xx x x
x x x xx x
x x x

x x
x x x x x
x x x
x
3
Problems With Clustering
 Clustering in two dimensions looks easy.
 Clustering small amounts of data looks easy.
 The Curse of Dimensionality
 Many applications involve not 2, but 10 or
10,000 dimensions.

4
Clustering Evaluation
 Manual inspection
 Benchmarking on existing labels
 Cluster quality measures
 distance measures
 high similarity within a cluster, low across clusters

5
Distance Measures
 Each clustering problem is based on some
kind of “distance” between points.
 Two major classes of distance measure:
1. Euclidean
2. Non-Euclidean

6
Euclidean Vs. Non-Euclidean
 A Euclidean space has some number of real-
valued dimensions and “dense” points.
 There is a notion of “average” of two points.
 A Euclidean distance is based on the locations of
points in such a space.
 A Non-Euclidean distance is based on
properties of points, but not their “location” in a
space.

7
Some Euclidean Distances
 L2 norm : d(x,y) = square root of the sum of
the squares of the differences between x and
y in each dimension.
 The most common notion of “distance.”
 L1 norm : sum of the differences in each
dimension.
 Manhattan distance = distance if you had to travel
along coordinates only.

8
Non-Euclidean Distances
 Jaccard distance for sets = 1 minus ratio of
sizes of intersection and union.
Jaccard(x, y) = 1 - |x y|
|x y|
 Cosine distance = angle between vectors
from the origin to the points in question.
 Edit distance = number of inserts and deletes
to change one string into another.

9
Jaccard Distance for Bit-Vectors
 Example: p1 = 10111; p2 = 10011.
 Size of intersection = 3; size of union = 4, Jaccard
similarity (not distance) = 3/4.
 Need to make a distance function satisfying
triangle inequality and other laws.
 d(x,y) = 1 – (Jaccard similarity) works.

10
Cosine Distance
 Think of a point as a vector from the origin
(0,0,…,0) to its location.
 Two points’ vectors make an angle, whose
cosine is the normalized dot-product of the
vectors: p1.p2/|p2||p1|.
 Example p1 = 00111; p2 = 10011.
 p1.p2 = 2; |p1| = |p2| = 3.
 cos() = 2/3;  is about 48 degrees.

11
Edit Distance
 The edit distance of two strings is the number
of inserts and deletes of characters needed to
turn one into the other.
 Equivalently: d(x,y) = |x| + |y| -2|LCS(x,y)|.
 LCS = longest common subsequence = longest
string obtained both by deleting from x and
deleting from y.

12
Example
 x = abcde ; y = bcduve.
 Turn x into y by deleting a, then inserting u
and v after d.
 Edit-distance = 3.
 Or, LCS(x,y) = bcde.
 |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 = 3.

13
Clustering Algorithms

k -Means Algorithms
Hierarchical Clustering

14
Methods of Clustering
 Point Assignment (Partitioning “flat” algorithms ):
 Usually start with a random (partial) partitioning and Maintain
a set of clusters.
 Refine it iteratively
 Place points into their “nearest” cluster.
 k means/medoids clustering
 Model based clustering
 Hierarchical (Agglomerative):
 Initially, each point in cluster by itself.
 Repeatedly combine the two “nearest” clusters into one.

15
Partional Clustering
 Also called flat clustering
 The most famous algorithm is K-Means

16
k –Means Algorithm(s)
 Assumes Euclidean space.
 Start by picking k, the number of clusters.
 Initialize clusters by picking one point per
cluster.
 For instance, pick one point at random, then k -1
other points, each as far away as possible from
the previous points.

17
Simple Clustering: K-means
Works with numeric data only
1) Pick a number (K) of cluster centers (at
random)
2) Assign every item to its nearest cluster
center (e.g. using Euclidean distance)
3) Move each cluster center to the mean of its
assigned items
4) Repeat steps 2,3 until convergence (change
in cluster assignments less than a
threshold)

18
K-means example, step 1

k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3

X
19
K-means example, step 2

k1
Y

k2
Assign
each point
to the closest
cluster
center k3

X
20
K-means example, step 3

k1 k1
Y

Move k2
each cluster
center k3
k2
to the mean
of each cluster k3

X
21
K-means example, step 4

Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?

X
22
K-means example, step 4 …

k1
Y
A: three
points with
animation k3
k2

X
23
K-means example, step 4b

k1
Y
re-compute
cluster
means k3
k2

X
24
K-means example, step 5

k1
Y

k2
move cluster
centers to k3
cluster means

X
25
Discussion

What can be the problems with


K-means clustering?

26
Issue 1: How Many Clusters?
 Number of clusters k is given
 Partition n docs into predetermined number of

clusters
 Finding the “right” number of clusters is part of the
problem

27
Getting k Right
 Try different k, looking at the change in the
average distance to centroid, as k increases.
 Average falls rapidly until right k, then
changes little.

Best value
Average of k
distance to
centroid
k
28
Example
x x
xx x
x x x x
many long x x x x x x x
distances x xx x x
to centroid. x x x xx x
x x x

x x
x x x x x
x x x
x
29
Example
x x
xx x
x x x x
Just right; x x x
distances x x x x
x xx x x
rather short. xx x
x x x
x x x

x x
x x x x x
x x x
x
30
Example
x x
xx x
x x x x
x x x x x x x
x xx x x
Too many; x x x xx x
little improvement x x x
in average
distance. x x
x x x x x
x x x
x
31
Issue: 2
 Result can vary significantly depending on
initial choice of seeds (number and
position)
 Can get trapped in local minimum
 Example: initial
cluster
centers

instances

 Q: What can be done?


32
Seed Choice
Example showing
 Results can vary based on random seed
selection. sensitivity to seeds
 Some seeds can result in poor
convergence rate, or convergence to sub-
optimal clusterings.
 Select good seeds using a heuristic
(e.g., item least similar to any existing In the above, if you start
mean) with B and E as centroids
you converge to {A,B,C}
 Try out multiple starting points
and {D,E,F}
 Initialize with the results of another If you start with D and F
method. you converge to
To increase chance of finding global {A,B,D,E} {C,F}
optimum: restart with different random
seeds.
33
K-means issues, variations, etc.
 Recomputing the centroid after every assignment
(rather than after all points are re-assigned) can
improve speed of convergence of K-means

34
K-means clustering - outliers ?
What can be done about outliers?

35
K-means clustering summary
Advantages Disadvantages
 Simple,  Must pick number of
understandable clusters before hand
 items automatically  All items forced into

assigned to clusters a cluster


 Too sensitive to
outliers

36
Clustering Algorithms

 Hierarchical algorithms
 Bottom-up, agglomerative
 Top-down, divisive

37
Hierarchical Clustering
 Two important questions:
1. How do you determine the “nearness” of
clusters?
2. How do you represent a cluster of more than
one point?

38
Hierarchical Clustering --- (2)
 Key problem: as you build clusters, how do
you represent the location of each cluster, to
tell which pair of clusters is closest?
 Euclidean case: each cluster has a centroid =
average of its points.
 Measure intercluster distances by distances of
centroids.

39
Example
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o
(5,0)

40
And in the Non-Euclidean Case?
 The only “locations” we can talk about are the
points themselves.
 I.e., there is no “average” of two points.
 Approach 1: clustroid = point “closest” to
other points.
 Treat clustroid as if it were centroid, when
computing intercluster distances.

41
“Closest” Point?
 Possible meanings:
1. Smallest maximum distance to the other points.
2. Smallest average distance to other points.
3. Smallest sum of squares of distances to other
points.
4. Etc., etc.

42
Example

clustroid

1 2
6 4
3
5 clustroid

intercluster
distance

43
*Hierarchical clustering

 Bottom up
 Start with single-instance clusters
 At each step, join the two closest clusters
 Design decision: distance between clusters
 E.g. two closest instances in clusters
vs. distance between means
 Top down
 Start with one universal cluster
 Find two clusters
 Proceed recursively on each subset
 Can be very fast
 Both methods produce a
dendrogram
g a c i e d k b j f h 44
Hierarchical Clustering
 Build a tree-based hierarchical taxonomy (dendrogram) from a
set of documents.
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

 One option to produce a hierarchical clustering is recursive


application of a partitional clustering algorithm to produce a
hierarchical clustering. (top down)

45
Hierarchical Agglomerative
Clustering (HAC)
 Assumes a similarity function for determining
the similarity of two instances.
 Starts with all instances in a separate cluster
and then repeatedly joins the two clusters
that are most similar until there is only one
cluster.
 The history of merging forms a binary tree or
hierarchy.

46
A Dendogram: Hierarchical Clustering

• Dendrogram: Decomposes
data objects into a several
levels of nested partitioning
(tree of clusters).

• Clustering of the data


objects is obtained by
cutting the dendrogram at
the desired level, then each
connected component
forms a cluster.

47
HAC Algorithm
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci  cj

48
Hierarchical Clustering algorithms
 Agglomerative (bottom-up):
 Start with each item being a single cluster.

 Eventually all items belong to the same cluster.

 Divisive (top-down):
 Start with all items belong to the same cluster.
 Eventually each node forms a cluster on its own.
 Does not require the number of clusters k in advance
 Needs a termination/readout condition

49
Dendrogram: Document Example
 As clusters agglomerate, docs likely to fall
into a hierarchy of “topics” or concepts.

d3
d5
d1 d3,d4,d
d4
5
d2
d1,d2 d4,d5 d3

50
“Closest pair” of clusters
 Many variants to defining closest pair of clusters
 “Center of gravity”
 Clusters whose centroids (centers of gravity) are the most

cosine-similar
 Single-link
 Similarity of the most similar (single-link)

 The smallest minimum pairwise distance

 Complete-link
 Similarity of the “furthest” points,

 The smallest maximum pairwise distance

 Average-link
 Average similarity between pairs of elements

51
Major issue - labeling
 After clustering algorithm finds clusters - how
can they be useful to the end user?
 Need pithy label for each cluster

52
How to Label Clusters
 Show titles of typical documents
 Titles are easy to scan

 Authors create them for quick scanning!

 But you can only show a few titles which may not

fully represent cluster


 Show words/phrases prominent in cluster
 More likely to fully represent cluster

 Use distinguishing words/phrases

 Differential labeling
 But harder to scan
53
Evaluation of clustering

 Perhaps the most substantive issue in data


mining in general:
 how do you measure goodness?
 Most measures focus on computational
efficiency
 Time and space
 For application of clustering to search:
 Measure retrieval effectiveness

54
Approaches to evaluating
 Anecdotal
 User inspection
 Ground “truth” comparison
 Cluster retrieval
 Purely quantitative measures
 Average distance between cluster members
 Microeconomic / utility

55
Anecdotal evaluation
 Probably the commonest (and surely the easiest)
 “I wrote this clustering algorithm and look what it

found!”
 No benchmarks, no comparison possible
 Any clustering algorithm will pick up the easy stuff
like partition by languages
 Generally, unclear scientific value.

56
User inspection
 Induce a set of clusters or a navigation tree
 Have subject matter experts evaluate the results and
score them
 some degree of subjectivity

 Often combined with search results clustering


 Not clear how reproducible across tests.
 Expensive / time-consuming

57
Ground “truth” comparison
 Take a union of docs from a taxonomy & cluster
 Yahoo!, ODP, newspaper sections …

 Compare clustering results to baseline


 e.g., 80% of the clusters found map “cleanly” to
taxonomy nodes
 How would we measure this?

 But is it the “right” answer?


 There can be several equally right answers

58
Microeconomic viewpoint
 Anything - including clustering - is only as good as
the economic utility it provides
 For clustering: net economic gain produced by an
approach (vs. another approach)
 Examples
 recommendation systems

59
Other Clustering Approaches
 EM – probability based clustering
 Bayesian clustering
 SOM – self-organizing maps
 …

60
Soft Clustering

 Clustering typically assumes that each instance is


given a “hard” assignment to exactly one cluster.
 Does not allow uncertainty in class membership or for
an instance to belong to more than one cluster.
 Soft clustering gives probabilities that an instance
belongs to each of a set of clusters.
 Each instance is assigned a probability distribution
across a set of discovered categories (probabilities of
all categories must sum to 1).

61

You might also like