TextMining05 聚类

(2013)
Email:yangjw@pku.edu.cn
1
Cluster
Cluster:

(Clustering) X
X1,X2,,Xn
kC1,C2,,Ck
CiXj1i,Xj2i,,Xjnii
Cii1,,kX
C1C2CkX
CiCjij
Document Clustering (DC) is partitioning a set

of documents into groups or clusters
Clusters should be computed to
Contain
similar documents
Separate as much as possible different documents
For instance, if similarity between documents is

defined to capture semantic relatedness,
documents in a cluster should deal with the same
topics, and topics in each cluster should be
different.
6
Guiding the organization of a document collection

(e.g. [Sahami 98])
Progressively
clustering groups of documents allow to

build a topic hierarchy
Supporting browsing and interactive retrieval

Grouping
retrieved documents to allow a faster relevant

documents selection process
Some search engines (Vivisimo)
Pre-processing for other tasks, e.g.

To
detect the main semantic dimensions of the text

collection and to support a kind of concept indexing
[Karypis & Han 00]
For text summarization
As other text processing tasks, DC has

several steps
Document
representation
Dimensionality reduction
Applying a clustering algorithm
Evaluating the effectiveness of the process

abc
efghi

[-,-,-,-,]
[-,-,-,-,]
[-,-,-,-,]
10

P, precision
R, recall
FMeasure
1
1
1
1
P
R
2 PR
F1
PR
11
Fi
Macro
2 Pi Ri
F1i
Pi Ri
1 m
Macro F Fi
m i 1
Micro
Micro F
1
1
1
Pi
Ri
(ni Fi )
i 1
ni
i 1
12
(sum-of-squared-error criterion:
k
Je
| X mi |
i 1 X Ci
XCimiCi
Je
13

14
15
16

(Vector
Space Model)
Mti ()//
/
dj
(a1j,a2j,,aMj)
N
AM*N= (aij)
D1 = 2T1+ 3T2 + 5T3
Cosine

T3
Q = 0T1 + 0T2 + 2T3

2 3
D2 = 3T1 + 7T2 + T3
T2
T1
17

GpGqDpq
(d(xi,xj)xi Gpxj Gq)
:
Dpq min d ( xi , x j )
:
D pq max d ( xi , x j )
:
Dpq min d ( x p , xq )
:
1
D pq
n1n2
:(Wald)
D1
(x i
x G
x p )'(x i x p )
i p
D 1 2
(x j
x G
j
(x k
G
G
xk
D2
d (x , x )
xi G p x j Gq
x q )'(x j x q )
x )'(x k x ) D pq D 1 2 D 1 D 2
18
19
20
D={d1, ,di , ,dn}
1. k
2. k
S={s1, ,sj , ,sk}
3. Ddi sj
sim(di , sj )
4.
arg max sim(di , sj ),
disj Cj D
C={c1, ,ck}
5. 2~4
21
k-meansk-(MacQueen67)
1.k
2.
3.
4.2,3k
22
k-means
10
10
10
9
8
7
6
5
4
3
2
1
0
0
K=2
10
Arbitrarily
assign
each
objects to
k initial
cluster
4
3
2
1
0
0
10
Update
the
cluster
means
4
3
2
1
0
0
reassign
10
5
4
3
2
1
0
1
10
reassign
10
10
Update
the
cluster
means
4
3
2
1
0
0
10
23
k-means
5X1,X2,X3,X4,X5
X10,2 X20,0 X31.5,0

X45,0 X55,2
k=2
C1X1,X2,X4C2X3,X5
24
k-means
M1M2
M1(0+0+5)/3,(2+0+0)/31.660.66
M21.5+5/20+2/23.251.00
e12 = [(0-1.66)2+(2-0.66)2] + [(0-1.66)2+(0-0.66)2] +

[(5-1.66)2+(0-0.66)2] =19.36 k
e22=8.12
Je
| X mi |2
i 1 X Ci
E2 e12+e22 19.36+8.12
27.48
25
k-means
2M1 M2
d(M1,X1)=(1.662+1.342)1/2=2.14
d(M2,X1)=3.40 ==>X1C1
d(M1,X2)=1.79 d(M2,X2)=3.40 ==>X2C1
d(M1,X3)=0.83 d(M2,X3)=2.01 ==>X3C1
d(M1,X4)=3.41 d(M2,X4)=2.01 ==>X4C2
d(M1,X5)=3.60 d(M2,X5)=2.01 ==>X5C2
C1X1,X2,X3C2X4,X5
26
k-means
M10.50.67 M25.01.0
e124.17 e222.00 E6.17
27.486.17
27
k-means
Strength: Relatively efficient: O(tkn), where n is #

objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 +
k(n-k))
Comment: Often terminates at a local optimum.

The global optimum may be found using
techniques such as: deterministic annealing and
genetic algorithms
28
k-means
kprototypes k
modes
29
k-mediods(k-)
PAM(Partitioning Around Medoids)

k-(1987)
k
30
PAM(1987)
1.k
2.repeat
3.
4.Orandom
5.OrandomOj
6.if sOrandomOjk
7.Until
31
Typical k-medoids algorithm (PAM)

Total Cost = 20
10
10
10
Arbitrary
choose k
object as
initial
medoids
7
6
5
4
3
2
7
6
5
4
3
2
10
K=2
Do loop
Until no
change
10
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
0
0
10
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
10
10
Compute
total cost of
swapping
Swapping O
and Oramdom
If quality is
improved.
7
6
8
7
6
0
0
10
10
32
PAM
PAM is more robust than k-means in the presence of noise

and outliers because a medoid is less influenced by outliers
or other extreme values than a mean
PAM works efficiently for small data sets but does not
scale well for large data sets.
O(k(n-k)2 ) for each iteration
where n is # of data, k is # of clusters

Sampling based method,
CLARA(Clustering LARge Applications)
33
CLARA (1990)
CLARA (Clustering Large Applications)
(Kaufmann and Rousseeuw in 1990)
It draws multiple samples of the data set, applies PAM on

each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
Efficiency
depends on the sample size
good clustering based on samples will not necessarily

represent a good clustering of the whole data set if the
sample is biased ()
34
CLARANS (1994)
CLARANS (A Clustering Algorithm based on Randomized

Search) (Ng and Han94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a

graph where every node is a potential solution, that is, a set
of k medoids
If the local optimum is found, CLARANS starts with new

randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and

CLARA
35
36
Step 0 Step 1 Step 2 Step 3 Step 4
a
b
c
d
e
ab
agglomerative
(AGNES)
abcde
cde
de
Step 4 Step 3 Step 2 Step 1 Step 0
divisive
(DIANA)
37

D={d1, ,di , ,dn} di
Ci={di}
DC={c1, ,ci , ,cn}
C( ci , cj ) sim(ci , cj )

arg max sim(ci , cj ),
ci cjck=cicj
DC={c1, ,cn-1}
C
38
AGNES (1990)
AGNES (Agglomerative Nesting)
Kaufmann and Rousseeuw (1990)
Use the Single-Link method and the dissimilarity
matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10
10
10
10
0
0
10
39
9
10
AGNES (1990)
1(single-linkage)Nearest
Neighbor

Dmin (Ci , C j )

2 Dmax(Ci,Cj)
3
40
Single Link
41
Complete Link
42
34
D34=5.0
341,2,34,5
43
D(1,(34)) = min(D(1,3),D(1,4)) = min(20.6, 22.4)

= 20.6;
D(2,(34)) = min(D(2,3),D(2,4)) = min(14.1, 11.2)
= 11.2;
D(5,(34)) = min(D(3,5),D(4,5)) = min(25.0, 25.5)
= 25.0.
1,2,5
1,2,(34),5
15D1,5
7.07
44
45
46
AGNES
Major weakness of agglomerative clustering

methods
not scale well: time complexity of at least O(n2),
where n is the number of total objects
can never undo what was done previously
do
47
BIRCH(1996)
CF
CURE(1998)
ROCK
Chameleon(1999)
48
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using

Hierarchies
by Zhang, Ramakrishnan, Livny (SIGMOD96)
Incrementally construct a CF (Clustering Feature) tree, a

hierarchical data structure for multiphase clustering
Phase
1: scan DB to build an initial in-memory CF tree

(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)
Phase
2: use an arbitrary clustering algorithm to cluster

the leaf nodes of the CF-tree
49
CF-Tree in BIRCH
CF is a compact storage for data on points in a cluster

Has enough information to calculate the intra-cluster
distances
summary of the statistics for a given subcluster.
Additivity theorem allows us to merge sub-clusters

Clustering Feature: CF = (N, LS, SS)
CF = (5, (16,30),(54,190))
N: Number of data points

LS: Ni=1Xi
SS: Ni=1Xi2
10
9
8
7
6
5
4
3
2
1
0
0
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
50
CF Tree
Root
CF1
CF2 CF3
CF6
child1
child2 child3
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
CF6 next
Leaf node
prev CF1 CF2
CF4 next
51
CF-Tree in BIRCH
A CF tree is a height-balanced tree that stores the

clustering features for a hierarchical clustering

52
BIRCH
Scales
linearly: finds a good clustering

with a single scan and improves the
quality with a few additional scans
Weakness:
handles only numeric data,

and sensitive to the order of the data
record.
53
CURE (1998)
CURE (Clustering Using REpresentatives )
by Guha, Rastogi & Shim, 1998
Shrink the multiple representative points
towards the gravity center by a fraction of .
Multiple representatives capture the shape
of the cluster
54
CURE
S
p (size s/p)
s/pq clusters

55
CURE

s = 50
p = 2
s/p = 25
s/pq = 5
y
y
x
y
y
x
x
x
56
CURE
Shrink the multiple representative points towards the

gravity center by a fraction of .
Multiple representatives capture the shape of the cluster
57
CURE
CURE

CURE
K-d
58
DIANA (1990)
DIANA (Divisive Analysis)
Kaufmann and Rousseeuw (1990)
Inverse order of AGNES
Eventually each node forms a cluster on its
own
10
10
10
0
0
10
10
10
59
60
61

Clustering based on density (local cluster
criterion), such as density-connected points
Major features:
Discover clusters of arbitrary shape

Handle noise
One scan
Need density parameters as termination
condition
Several interesting studies:

Ester, et al. (KDD96)
OPTICS: Ankerst, et al (SIGMOD99).
DENCLUE: Hinneburg & D. Keim (KDD98)
CLIQUE: Agrawal, et al. (SIGMOD98)
DBSCAN:
62

1.
2.
MinPts
3.Dpq
q
pq
p
q
MinPts = 5
Eps = 1 cm
63
4.p1,p2,,pnp1qpn
ppiD1<=i<=npi+1 pi
MinPtspq
MinPts
5.Do
pqoMinPts
pqMinPts
p
q
o
64
DBSCAN (KDD-96)
DBSCAN Density-Based Spatial
Clustering of Applications with Noise
Martin Ester,KDD-96
65
DBSCAN
= 1cm
MinPts = 5
66
DBSCAN
1 p
2p |N(q)|
3|N(q)|MinPts p
4
5
= 1cm
MinPts = 5
67
DBSCAN
( and MinPts)
O(n2)
R-
o(nlogn)
68
OPTICS (SIGMOD99)
OPTICS: Ordering Points To Identify the
Clustering Structure
by Mihael Ankerst. ACM SIGMOD99

DBSCAN
69
OPTICS
p DMinPts
N(p)
p
P
|(N(p)| MinPtsp
pqpp,q
Max|p,q|
|N(p)| MinPtsp
70
OPTICS

71
OPTICS
1.p
2.p
MinPts
3.
Complexity: O(kN2)
72
MinPts
73
74
75
CLIQUE(1998)
CLIQUE:Clustering In QUEst,1998
76
CLIQUE
1.
2.
3.

4.
K
k-1
kk-1
77
40
50
20
Vacation
=3
30
Vacation
(week)
0 1 2 3 4 5 6 7
Salary
(10,000)
0 1 2 3 4 5 6 7
20
age
60
30
50
age
30
40
50
age
60
agesalary
vacation
78
CLIQUE
79
On-line clustering
80
On-line clustering
Web Search Results: multiple sub-topics
are mixed together for the given query.
Organizing Web search results into clusters
facilitates users' quick browsing.
81
On-line clustering
Zamir and Etzioni95:

clustering algorithm should
take the document snippets

instead of the whole documents as input, since the
downloading of original documents is time-consuming;
the clustering algorithm should be fast enough for
online calculation
the generated clusters should have readable
descriptions for quick browsing by users.
Zamir and Etzioni presented a Suffix Tree Clustering
(STC) 95 ( Web Document Clustering: A Feasibility Demonstration)
first identifies sets of documents that share common
phrases,
then create clusters according to these phrases
82
Suffix Tree Clustering

(STC)
STC is a linear time clustering algorithm that is

based on a suffix tree which efficiently identifies
sets of documents that share common phrases.
STC satisfies the key requirements:
STC
treats a document as a string, making use of

proximity information between words.
STC is novel, incremental, and O(n) time algorithm.
STC succinctly summarizes clusters contents for users.
Quick
because of working on smaller set of documents,

incrementality
83
banana
banana anana nana ana na a

($)
O(n)
84

banana
85
STC
Step1:
Document cleaning
Html
-> plain text

Words stemming
Mark sentence boundaries
Remove non-word tokens
Step
2: Identifying Base Clusters

Step3: Combining Base Clusters
* STC treats a document as a set of strings
86
Ex. A Suffix Tree of Strings
Document1: cat ate cheese,

Document2: mouse ate cheese too
Document3: cat ate mouse too
Document No, Term No

87
Base clusters
Base clusters corresponding to the suffix tree nodes

88
Base Cluster Graph

Node: cluster
Edge: similarity between two clusters > 1
What if ate is
in the stop
word list?
89
Cluster score
s(B)
= |B| * f(|P|)
|B|
is the number of documents in base

cluster B
|P| is the number of words in P that have
a non-zero score
zero score words: stopwords, too few(<3) or
too many( >40%)
90
Step 3: Combining Base Clusters

Merge
base clusters with a high

overlap in their document sets
documents
Similarity
may share multiple phrases.
of Bm and Bn (0.5 is
paramter)
1
=
iff
| Bm Bn| / | Bm | > 0.5
and | Bm Bn| / | Bn | > 0.5
otherwise
91
Evaluation-Precision
92
Snippets vs. Whole Document
93
Execution time
94
STC
STC an incremental, o(n) time clustering
algorithm
STC allows cluster overlap
document often has 1+ topics

STC allows a document to appear in 1+ clusters,
since documents may share 1+ phrases with other
documents
But not too similar to be merged into one cluster..
A clustering algorithms on Web search engine

results
95
A paper: Cluster Web Search

Results, 2004
Learning to Cluster Web Search Results.
Hua-Jun Zeng, etc. (Microsoft research,
Asia) , SIGIR04, 2004
These traditional clustering techniques
generate clusters with poor readable names.
96
A paper: Cluster Web Search

Results, 2004
Zengs method reformalize the clustering

problem as a salient phrase ranking problem.
First
extracts and ranks salient phrases as

candidate cluster names, based on a regression
model learned from human labeled training data.
The documents are assigned to relevant salient
phrases to form candidate clusters.
The final clusters are generated by merging
these candidate clusters.
97
Salient Phrases Extraction
Denote
the
current phrase (an n-gram) as w,

the set of documents that contains w as D(w).
Five properties which are calculated during

the document parsing.
Phrase
Frequency / Inverted Document Frequency

Phrase Length
Intra-Cluster Similarity
Cluster Entropy
Phrase Independence
98
Phrase Frequency / Inverted Document Frequency
Phrase Length: the count of words in a phrase.

a longer name is preferred for users browsing.
LEN = n
Intra-Cluster Similarity (ICS):is calculated as the

average cosine similarity between the documents
and the centroid.
99
Cluster Entropy (CE) represent the

distinctness of a phrase.
Phrase Independence (IND)a phrase is

independent when the entropy of its context
is high (i.e. the left and right contexts are
random enough).
100
Experimental Results
----Property Comparison
101
Experimental Results
---- Learning Methods Comparison
The coefficients of one of the linear

regression models, as follows
102
Example Queries
103
104
105

TextMining05 聚类

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TextMining05 聚类

Uploaded by

Copyright:

Available Formats

(2013)

Document Clustering (DC) is partitioning a set

For instance, if similarity between documents is

Guiding the organization of a document collection

clustering groups of documents allow to

Supporting browsing and interactive retrieval

retrieved documents to allow a faster relevant

Pre-processing for other tasks, e.g.

detect the main semantic dimensions of the text

As other text processing tasks, DC has

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + 2T3

D={d1, ,di , ,dn}

X10,2 X20,0 X31.5,0

e12 = [(0-1.66)2+(2-0.66)2] + [(0-1.66)2+(0-0.66)2] +

e124.17 e222.00 E6.17

Strength: Relatively efficient: O(tkn), where n is #

Comment: Often terminates at a local optimum.

PAM(Partitioning Around Medoids)

Typical k-medoids algorithm (PAM)

PAM is more robust than k-means in the presence of noise

O(k(n-k)2 ) for each iteration

where n is # of data, k is # of clusters

CLARA (Clustering Large Applications)

(Kaufmann and Rousseeuw in 1990)

It draws multiple samples of the data set, applies PAM on

Strength: deals with larger data sets than PAM

depends on the sample size

good clustering based on samples will not necessarily

CLARANS (A Clustering Algorithm based on Randomized

CLARANS draws sample of neighbors dynamically

The clustering process can be presented as searching a

If the local optimum is found, CLARANS starts with new

It is more efficient and scalable than both PAM and

Step 0 Step 1 Step 2 Step 3 Step 4

Step 4 Step 3 Step 2 Step 1 Step 0

D(1,(34)) = min(D(1,3),D(1,4)) = min(20.6, 22.4)

Major weakness of agglomerative clustering

Birch: Balanced Iterative Reducing and Clustering using

by Zhang, Ramakrishnan, Livny (SIGMOD96)

Incrementally construct a CF (Clustering Feature) tree, a

1: scan DB to build an initial in-memory CF tree

2: use an arbitrary clustering algorithm to cluster

CF is a compact storage for data on points in a cluster

summary of the statistics for a given subcluster.

Additivity theorem allows us to merge sub-clusters

N: Number of data points

A CF tree is a height-balanced tree that stores the

linearly: finds a good clustering

handles only numeric data,

Shrink the multiple representative points towards the

Multiple representatives capture the shape of the cluster

Discover clusters of arbitrary shape

Several interesting studies:

DBSCAN Density-Based Spatial

Clustering of Applications with Noise

by Mihael Ankerst. ACM SIGMOD99

Zamir and Etzioni95:

take the document snippets

Suffix Tree Clustering

STC is a linear time clustering algorithm that is

treats a document as a string, making use of

because of working on smaller set of documents,