You are on page 1of 5

Informal goal

Clustering:
Overview and •  Given set of objects and measure of
similarity between them, group similar
K-means algorithm objects together

•  What mean by “similar”?


•  What is good grouping?
K-Means illustrations thanks to •  Computation time / quality tradeoff
2006 student Martin Makowiecki 1 2

General types of clustering Applications:


•  “Soft” versus “hard” clustering Many
–  Hard: partition the objects –  biology
•  each object in exactly one partition –  astronomy
–  Soft: assign degree to which object in –  computer aided design of circuits
cluster –  information organization
•  view as probability or score –  marketing
–  …
•  flat versus hierarchical clustering
–  hierarchical = clusters within clusters
3 4

Clustering in information Example applications in search


search and analysis
•  Query evaluation: cluster pruning (§7.1.6)
•  Group information objects - cluster all documents
⇒  discover topics - choose representative for each cluster
?  other groupings desirable - evaluate query w.r.t. cluster reps.
•  Clustering versus classifying - evaluate query for docs in cluster(s) having
- classifying: have pre-determined classes most similar cluster rep.(s)
with example members •  Results presentation: labeled clusters
- clustering: - cluster only query results
- get groups of similar objects - e.g. Yippy.com (metasearch)
- added problem of labeling clusters by topic
- e.g. common terms within cluster of docs. 5 hard / soft? flat / hier? 6

1
Issues Issues continued
•  What attributes represent items for clustering
purposes?
•  Cluster goals?
•  What is measure of similarity between items? –  Number of clusters?
•  General objects and matrix of pairwise similarities –  flat or hierarchical clustering?
•  Objects with specific properties that allow other –  cohesiveness of clusters?
specifications of measure •  How evaluate cluster results?
– Most common:
–  relates to measure of closeness between clusters
Objects are d-dimensional vectors
» Euclidean distance •  Efficiency of clustering algorithms
» cosine similarity –  large data sets => external storage
•  Maintain clusters in dynamic setting?
•  What is measure of similarity between clusters?
7
•  Clustering methods? - MANY! 8

Quality of clustering General types of clustering


methods
•  In applications, quality of clustering depends on
how well solves problem at hand
•  constructive versus iterative improvement
•  Algorithm uses measure of quality that can be
–  constructive: decide in what cluster each
optimized, but that may or may not do a good
object belongs and don’t change
job of capturing application needs.
•  often faster
•  Underlying graph-theoretic problems usually –  iterative improvement: start with a clustering
NP-complete and move objects around to see if can
–  e.g. graph partitioning improve clustering
•  often slower but better
•  Usually algorithm not finding optimal clustering
9 10

Vector model:
K-means overview
K- means algorithm
•  Choose k points among set to cluster
•  Well known, well used –  Call them k centroids

•  Flat clustering •  For each point not selected, assign it to its


closest centroid
•  Number of clusters picked ahead of time –  All assignment give initial clustering
•  Iterative improvement •  Until “happy” do:
•  Uses notion of centroid –  Recompute centroids of clusters
•  New centroids may not be points of original set
•  Typically uses Euclidean distance –  Reassign all points to closest centroid
•  Updates clusters
11 12

2
An Example An Example
start: choose centroids and cluster recompute centroids

13 14

An Example An Example
re-cluster around new centroids 2nd recompute centroids and re-cluster

15 16

An Example Details for K-means


3rd (final) recompute and re-cluster •  Need definition of centroid
ci = 1/|Ci| ∑x for ith cluster Ci containing objects x
x∈Ci
notion of sum of objects ?
•  Need definition of distance to (similarity to)
centroid
•  Typically vector model with Euclidean distance
•  minimizing sum of squared distances of each
point to its centroid = Residual Sum of Squares
K
RSS = ∑ ∑ dist(ci,x)2
i=1 x∈Ci

17 18

3
K-means performance Time Complexity of K-means

•  Can prove RSS decreases with each •  Let tdist be the time to calculate the distance
iteration, so converge between two objects
•  Can achieve local optimum •  Each iteration time complexity:
–  No change in centroids O(K*n*tdist)
n = number of objects
•  Running time depends on how •  Bound number of iterations I giving
demanding stopping criteria O(I*K*n*tdist)
•  Works well in practice •  for m-dimensional vectors:
–  speed O(I*K*n*m)
m large and centroids not sparse
–  quality
19 20

Space Complexity of K-means Choosing Initial Centroids


•  Store points and centroids •  Bad initialization leads to poor results
–  vector model: O((n + K)m)

•  External algorithm versus internal?


–  store k centroids in memory
–  run through points each iteration

Optimal Not Optimal


21 22

Choosing Initial Centroids


K-means weakness
Many people spent much time examining
how to choose seeds
•  Random Non-globular clusters
•  Fast and easy, but often poor results
•  Run random multiple times, take best
–  Slower, and still no guarantee of results
•  Pre-conditioning
–  remove outliers
•  Choose seeds algorithmically
–  run hierarchical clustering on sample points and
use resulting centroids
–  Works well on small samples and for few initial
centroids
23 24

4
K-means weakness K-means weakness
Wrong number of clusters Outliers and empty clusters

25 26

Real cases tend to be harder


•  Different attributes of the feature vector
have vastly different sizes
–  size of star versus color
•  Can weight different features
–  how weight greatly affects outcome

•  Difficulties can be overcome

27