You are on page 1of 105

(2013)

Email:yangjw@pku.edu.cn
1

Cluster

Cluster:

(Clustering) X
X1,X2,,Xn
kC1,C2,,Ck

CiXj1i,Xj2i,,Xjnii
Cii1,,kX
C1C2CkX
CiCjij

Document Clustering (DC) is partitioning a set


of documents into groups or clusters
Clusters should be computed to
Contain

similar documents
Separate as much as possible different documents

For instance, if similarity between documents is


defined to capture semantic relatedness,
documents in a cluster should deal with the same
topics, and topics in each cluster should be
different.
6

Guiding the organization of a document collection


(e.g. [Sahami 98])
Progressively

clustering groups of documents allow to


build a topic hierarchy

Supporting browsing and interactive retrieval


Grouping

retrieved documents to allow a faster relevant


documents selection process
Some search engines (Vivisimo)

Pre-processing for other tasks, e.g.


To

detect the main semantic dimensions of the text


collection and to support a kind of concept indexing
[Karypis & Han 00]
For text summarization

As other text processing tasks, DC has


several steps
Document

representation
Dimensionality reduction
Applying a clustering algorithm
Evaluating the effectiveness of the process

abc
efghi


[-,-,-,-,]
[-,-,-,-,]
[-,-,-,-,]

10


P, precision
R, recall
FMeasure

1
1
1

1
P
R

2 PR
F1
PR
11

Fi

Macro

2 Pi Ri
F1i
Pi Ri

1 m
Macro F Fi
m i 1

Micro
Micro F

1
1
1
Pi
Ri

(ni Fi )

i 1

ni

i 1

12

(sum-of-squared-error criterion:
k

Je

| X mi |

i 1 X Ci

XCimiCi
Je

13



14

15

16


(Vector

Space Model)

Mti ()//

/
dj
(a1j,a2j,,aMj)
N

AM*N= (aij)

D1 = 2T1+ 3T2 + 5T3

Cosine

T3

Q = 0T1 + 0T2 + 2T3


2 3

D2 = 3T1 + 7T2 + T3

T2

T1
17


GpGqDpq
(d(xi,xj)xi Gpxj Gq)
:
Dpq min d ( xi , x j )

:
D pq max d ( xi , x j )

:
Dpq min d ( x p , xq )

:
1
D pq
n1n2

:(Wald)
D1

(x i

x G

x p )'(x i x p )

i p

D 1 2

(x j

x G
j

(x k

G
G

xk

D2

d (x , x )

xi G p x j Gq

x q )'(x j x q )

x )'(x k x ) D pq D 1 2 D 1 D 2
18

19

20

D={d1, ,di , ,dn}

1. k
2. k
S={s1, ,sj , ,sk}
3. Ddi sj
sim(di , sj )
4.
arg max sim(di , sj ),
disj Cj D
C={c1, ,ck}
5. 2~4
21

k-meansk-(MacQueen67)

1.k

2.

3.

4.2,3k

22

k-means
10

10

10
9
8
7
6
5
4
3
2
1
0
0

K=2

10

Arbitrarily
assign
each
objects to
k initial
cluster

4
3
2
1
0
0

10

Update
the
cluster
means

4
3
2
1
0
0

reassign
10

5
4
3
2
1
0
1

10

reassign

10

10

Update
the
cluster
means

4
3
2
1
0
0

10

23

k-means
5X1,X2,X3,X4,X5

X10,2 X20,0 X31.5,0


X45,0 X55,2

k=2

C1X1,X2,X4C2X3,X5

24

k-means

M1M2

M1(0+0+5)/3,(2+0+0)/31.660.66
M21.5+5/20+2/23.251.00

e12 = [(0-1.66)2+(2-0.66)2] + [(0-1.66)2+(0-0.66)2] +


[(5-1.66)2+(0-0.66)2] =19.36 k
e22=8.12

Je

| X mi |2

i 1 X Ci

E2 e12+e22 19.36+8.12
27.48
25

k-means

2M1 M2

d(M1,X1)=(1.662+1.342)1/2=2.14

d(M2,X1)=3.40 ==>X1C1
d(M1,X2)=1.79 d(M2,X2)=3.40 ==>X2C1
d(M1,X3)=0.83 d(M2,X3)=2.01 ==>X3C1
d(M1,X4)=3.41 d(M2,X4)=2.01 ==>X4C2
d(M1,X5)=3.60 d(M2,X5)=2.01 ==>X5C2

C1X1,X2,X3C2X4,X5
26

k-means

M10.50.67 M25.01.0

e124.17 e222.00 E6.17

27.486.17

27

k-means

Strength: Relatively efficient: O(tkn), where n is #


objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 +
k(n-k))

Comment: Often terminates at a local optimum.


The global optimum may be found using
techniques such as: deterministic annealing and
genetic algorithms
28

k-means

kprototypes k
modes

29

k-mediods(k-)

PAM(Partitioning Around Medoids)


k-(1987)
k

30

PAM(1987)

1.k
2.repeat
3.

4.Orandom
5.OrandomOj
6.if sOrandomOjk

7.Until
31

Typical k-medoids algorithm (PAM)


Total Cost = 20
10

10

10

Arbitrary
choose k
object as
initial
medoids

7
6
5
4
3
2

7
6
5
4
3
2

10

K=2
Do loop
Until no
change

10

Assign
each
remainin
g object
to
nearest
medoids

7
6
5
4
3
2
1
0
0

10

Randomly select a
nonmedoid object,Oramdom

Total Cost = 26
10

10

Compute
total cost of
swapping

Swapping O
and Oramdom

If quality is
improved.

7
6

8
7
6

0
0

10

10

32

PAM

PAM is more robust than k-means in the presence of noise


and outliers because a medoid is less influenced by outliers
or other extreme values than a mean

PAM works efficiently for small data sets but does not
scale well for large data sets.

O(k(n-k)2 ) for each iteration

where n is # of data, k is # of clusters


Sampling based method,
CLARA(Clustering LARge Applications)

33

CLARA (1990)

CLARA (Clustering Large Applications)

(Kaufmann and Rousseeuw in 1990)

It draws multiple samples of the data set, applies PAM on


each sample, and gives the best clustering as the output

Strength: deals with larger data sets than PAM

Weakness:
Efficiency

depends on the sample size

good clustering based on samples will not necessarily


represent a good clustering of the whole data set if the
sample is biased ()
34

CLARANS (1994)

CLARANS (A Clustering Algorithm based on Randomized


Search) (Ng and Han94)

CLARANS draws sample of neighbors dynamically

The clustering process can be presented as searching a


graph where every node is a potential solution, that is, a set
of k medoids

If the local optimum is found, CLARANS starts with new


randomly selected node in search for a new local optimum

It is more efficient and scalable than both PAM and


CLARA
35

36

Step 0 Step 1 Step 2 Step 3 Step 4

a
b
c
d
e

ab

agglomerative
(AGNES)

abcde
cde

de

Step 4 Step 3 Step 2 Step 1 Step 0

divisive
(DIANA)
37


D={d1, ,di , ,dn} di
Ci={di}
DC={c1, ,ci , ,cn}
C( ci , cj ) sim(ci , cj )

arg max sim(ci , cj ),
ci cjck=cicj
DC={c1, ,cn-1}
C

38

AGNES (1990)
AGNES (Agglomerative Nesting)
Kaufmann and Rousseeuw (1990)
Use the Single-Link method and the dissimilarity
matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster

10

10

10

10

0
0

10

39
9

10

AGNES (1990)

1(single-linkage)Nearest
Neighbor

Dmin (Ci , C j )

2 Dmax(Ci,Cj)
3
40

Single Link

41

Complete Link

42

34
D34=5.0
341,2,34,5

43

D(1,(34)) = min(D(1,3),D(1,4)) = min(20.6, 22.4)


= 20.6;
D(2,(34)) = min(D(2,3),D(2,4)) = min(14.1, 11.2)
= 11.2;
D(5,(34)) = min(D(3,5),D(4,5)) = min(25.0, 25.5)
= 25.0.
1,2,5
1,2,(34),5
15D1,5
7.07
44

45

46

AGNES

Major weakness of agglomerative clustering


methods
not scale well: time complexity of at least O(n2),
where n is the number of total objects
can never undo what was done previously
do

47

BIRCH(1996)
CF
CURE(1998)

ROCK
Chameleon(1999)

48

BIRCH (1996)

Birch: Balanced Iterative Reducing and Clustering using


Hierarchies

by Zhang, Ramakrishnan, Livny (SIGMOD96)

Incrementally construct a CF (Clustering Feature) tree, a


hierarchical data structure for multiphase clustering
Phase

1: scan DB to build an initial in-memory CF tree


(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)

Phase

2: use an arbitrary clustering algorithm to cluster


the leaf nodes of the CF-tree

49

CF-Tree in BIRCH

CF is a compact storage for data on points in a cluster


Has enough information to calculate the intra-cluster
distances

summary of the statistics for a given subcluster.

Additivity theorem allows us to merge sub-clusters


Clustering Feature: CF = (N, LS, SS)

CF = (5, (16,30),(54,190))

N: Number of data points


LS: Ni=1Xi

SS: Ni=1Xi2

10
9
8
7
6
5
4
3
2
1
0
0

10

(3,4)
(2,6)
(4,5)
(4,7)
(3,8)

50

CF Tree
Root

CF1

CF2 CF3

CF6

child1

child2 child3

child6

Non-leaf node
CF1

CF2 CF3

CF5

child1

child2 child3

child5

Leaf node
prev CF1 CF2

CF6 next

Leaf node
prev CF1 CF2

CF4 next

51

CF-Tree in BIRCH

A CF tree is a height-balanced tree that stores the


clustering features for a hierarchical clustering

52

BIRCH
Scales

linearly: finds a good clustering


with a single scan and improves the
quality with a few additional scans

Weakness:

handles only numeric data,


and sensitive to the order of the data
record.

53

CURE (1998)
CURE (Clustering Using REpresentatives )
by Guha, Rastogi & Shim, 1998
Shrink the multiple representative points
towards the gravity center by a fraction of .
Multiple representatives capture the shape
of the cluster

54

CURE
S
p (size s/p)
s/pq clusters

55

CURE

s = 50
p = 2
s/p = 25

s/pq = 5

y
y

x
y

y
x
x

x
56

CURE

Shrink the multiple representative points towards the


gravity center by a fraction of .

Multiple representatives capture the shape of the cluster

57

CURE

CURE

CURE

K-d
58

DIANA (1990)
DIANA (Divisive Analysis)
Kaufmann and Rousseeuw (1990)
Inverse order of AGNES
Eventually each node forms a cluster on its
own

10

10

10

0
0

10

10

10

59

60

61


Clustering based on density (local cluster
criterion), such as density-connected points
Major features:

Discover clusters of arbitrary shape


Handle noise
One scan
Need density parameters as termination

condition

Several interesting studies:


Ester, et al. (KDD96)
OPTICS: Ankerst, et al (SIGMOD99).
DENCLUE: Hinneburg & D. Keim (KDD98)
CLIQUE: Agrawal, et al. (SIGMOD98)
DBSCAN:

62


1.

2.
MinPts

3.Dpq
q
pq

p
q

MinPts = 5
Eps = 1 cm
63

4.p1,p2,,pnp1qpn
ppiD1<=i<=npi+1 pi
MinPtspq
MinPts
5.Do
pqoMinPts
pqMinPts
p

q
o
64

DBSCAN (KDD-96)

DBSCAN Density-Based Spatial

Clustering of Applications with Noise

Martin Ester,KDD-96

65

DBSCAN

= 1cm
MinPts = 5
66

DBSCAN

1 p
2p |N(q)|
3|N(q)|MinPts p

4
5

= 1cm
MinPts = 5
67

DBSCAN

( and MinPts)
O(n2)

R-
o(nlogn)

68

OPTICS (SIGMOD99)
OPTICS: Ordering Points To Identify the
Clustering Structure

by Mihael Ankerst. ACM SIGMOD99


DBSCAN

69

OPTICS

p DMinPts
N(p)
p
P
|(N(p)| MinPtsp

pqpp,q

Max|p,q|
|N(p)| MinPtsp
70

OPTICS

71

OPTICS
1.p
2.p
MinPts

3.

Complexity: O(kN2)

72

MinPts

73

74

75

CLIQUE(1998)

CLIQUE:Clustering In QUEst,1998

76

CLIQUE

1.

2.
3.

4.
K

k-1
kk-1

77

40

50

20

Vacation

=3

30

Vacation
(week)
0 1 2 3 4 5 6 7

Salary
(10,000)
0 1 2 3 4 5 6 7

20

age
60

30

50
age

30

40

50

age
60

agesalary
vacation

78

CLIQUE

79

On-line clustering

80

On-line clustering
Web Search Results: multiple sub-topics
are mixed together for the given query.
Organizing Web search results into clusters
facilitates users' quick browsing.

81

On-line clustering

Zamir and Etzioni95:


clustering algorithm should

take the document snippets


instead of the whole documents as input, since the
downloading of original documents is time-consuming;
the clustering algorithm should be fast enough for
online calculation
the generated clusters should have readable
descriptions for quick browsing by users.
Zamir and Etzioni presented a Suffix Tree Clustering
(STC) 95 ( Web Document Clustering: A Feasibility Demonstration)
first identifies sets of documents that share common
phrases,
then create clusters according to these phrases
82

Suffix Tree Clustering


(STC)

STC is a linear time clustering algorithm that is


based on a suffix tree which efficiently identifies
sets of documents that share common phrases.
STC satisfies the key requirements:
STC

treats a document as a string, making use of


proximity information between words.
STC is novel, incremental, and O(n) time algorithm.
STC succinctly summarizes clusters contents for users.
Quick

because of working on smaller set of documents,


incrementality

83

banana
banana anana nana ana na a

($)

O(n)
84


banana

85

STC
Step1:

Document cleaning

Html

-> plain text


Words stemming
Mark sentence boundaries
Remove non-word tokens
Step

2: Identifying Base Clusters


Step3: Combining Base Clusters
* STC treats a document as a set of strings
86

Ex. A Suffix Tree of Strings

Document1: cat ate cheese,


Document2: mouse ate cheese too
Document3: cat ate mouse too

Document No, Term No


87

Base clusters

Base clusters corresponding to the suffix tree nodes


88

Base Cluster Graph


Node: cluster
Edge: similarity between two clusters > 1

What if ate is
in the stop
word list?

89

Cluster score
s(B)

= |B| * f(|P|)

|B|

is the number of documents in base


cluster B
|P| is the number of words in P that have
a non-zero score
zero score words: stopwords, too few(<3) or
too many( >40%)

90

Step 3: Combining Base Clusters


Merge

base clusters with a high


overlap in their document sets
documents

Similarity

may share multiple phrases.

of Bm and Bn (0.5 is

paramter)
1
=

iff
| Bm Bn| / | Bm | > 0.5
and | Bm Bn| / | Bn | > 0.5
otherwise
91

Evaluation-Precision

92

Snippets vs. Whole Document

93

Execution time

94

STC
STC an incremental, o(n) time clustering
algorithm
STC allows cluster overlap

document often has 1+ topics


STC allows a document to appear in 1+ clusters,
since documents may share 1+ phrases with other
documents
But not too similar to be merged into one cluster..

A clustering algorithms on Web search engine


results
95

A paper: Cluster Web Search


Results, 2004
Learning to Cluster Web Search Results.
Hua-Jun Zeng, etc. (Microsoft research,
Asia) , SIGIR04, 2004
These traditional clustering techniques
generate clusters with poor readable names.

96

A paper: Cluster Web Search


Results, 2004

Zengs method reformalize the clustering


problem as a salient phrase ranking problem.
First

extracts and ranks salient phrases as


candidate cluster names, based on a regression
model learned from human labeled training data.
The documents are assigned to relevant salient
phrases to form candidate clusters.
The final clusters are generated by merging
these candidate clusters.
97

Salient Phrases Extraction

Denote
the

current phrase (an n-gram) as w,


the set of documents that contains w as D(w).

Five properties which are calculated during


the document parsing.
Phrase

Frequency / Inverted Document Frequency


Phrase Length
Intra-Cluster Similarity
Cluster Entropy
Phrase Independence
98

Salient Phrases Extraction

Phrase Frequency / Inverted Document Frequency

Phrase Length: the count of words in a phrase.


a longer name is preferred for users browsing.

LEN = n

Intra-Cluster Similarity (ICS):is calculated as the


average cosine similarity between the documents
and the centroid.

99

Salient Phrases Extraction

Cluster Entropy (CE) represent the


distinctness of a phrase.

Phrase Independence (IND)a phrase is


independent when the entropy of its context
is high (i.e. the left and right contexts are
random enough).

100

Experimental Results
----Property Comparison

101

Experimental Results
---- Learning Methods Comparison

The coefficients of one of the linear


regression models, as follows

102

Example Queries

103

104

105

You might also like