Professional Documents
Culture Documents
Email:yangjw@pku.edu.cn
1
Cluster
Cluster:
(Clustering) X
X1,X2,,Xn
kC1,C2,,Ck
CiXj1i,Xj2i,,Xjnii
Cii1,,kX
C1C2CkX
CiCjij
similar documents
Separate as much as possible different documents
representation
Dimensionality reduction
Applying a clustering algorithm
Evaluating the effectiveness of the process
abc
efghi
[-,-,-,-,]
[-,-,-,-,]
[-,-,-,-,]
10
P, precision
R, recall
FMeasure
1
1
1
1
P
R
2 PR
F1
PR
11
Fi
Macro
2 Pi Ri
F1i
Pi Ri
1 m
Macro F Fi
m i 1
Micro
Micro F
1
1
1
Pi
Ri
(ni Fi )
i 1
ni
i 1
12
(sum-of-squared-error criterion:
k
Je
| X mi |
i 1 X Ci
XCimiCi
Je
13
14
15
16
(Vector
Space Model)
Mti ()//
/
dj
(a1j,a2j,,aMj)
N
AM*N= (aij)
Cosine
T3
D2 = 3T1 + 7T2 + T3
T2
T1
17
GpGqDpq
(d(xi,xj)xi Gpxj Gq)
:
Dpq min d ( xi , x j )
:
D pq max d ( xi , x j )
:
Dpq min d ( x p , xq )
:
1
D pq
n1n2
:(Wald)
D1
(x i
x G
x p )'(x i x p )
i p
D 1 2
(x j
x G
j
(x k
G
G
xk
D2
d (x , x )
xi G p x j Gq
x q )'(x j x q )
x )'(x k x ) D pq D 1 2 D 1 D 2
18
19
20
1. k
2. k
S={s1, ,sj , ,sk}
3. Ddi sj
sim(di , sj )
4.
arg max sim(di , sj ),
disj Cj D
C={c1, ,ck}
5. 2~4
21
k-meansk-(MacQueen67)
1.k
2.
3.
4.2,3k
22
k-means
10
10
10
9
8
7
6
5
4
3
2
1
0
0
K=2
10
Arbitrarily
assign
each
objects to
k initial
cluster
4
3
2
1
0
0
10
Update
the
cluster
means
4
3
2
1
0
0
reassign
10
5
4
3
2
1
0
1
10
reassign
10
10
Update
the
cluster
means
4
3
2
1
0
0
10
23
k-means
5X1,X2,X3,X4,X5
k=2
C1X1,X2,X4C2X3,X5
24
k-means
M1M2
M1(0+0+5)/3,(2+0+0)/31.660.66
M21.5+5/20+2/23.251.00
Je
| X mi |2
i 1 X Ci
E2 e12+e22 19.36+8.12
27.48
25
k-means
2M1 M2
d(M1,X1)=(1.662+1.342)1/2=2.14
d(M2,X1)=3.40 ==>X1C1
d(M1,X2)=1.79 d(M2,X2)=3.40 ==>X2C1
d(M1,X3)=0.83 d(M2,X3)=2.01 ==>X3C1
d(M1,X4)=3.41 d(M2,X4)=2.01 ==>X4C2
d(M1,X5)=3.60 d(M2,X5)=2.01 ==>X5C2
C1X1,X2,X3C2X4,X5
26
k-means
M10.50.67 M25.01.0
27.486.17
27
k-means
k-means
kprototypes k
modes
29
k-mediods(k-)
30
PAM(1987)
1.k
2.repeat
3.
4.Orandom
5.OrandomOj
6.if sOrandomOjk
7.Until
31
10
10
Arbitrary
choose k
object as
initial
medoids
7
6
5
4
3
2
7
6
5
4
3
2
10
K=2
Do loop
Until no
change
10
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
0
0
10
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
10
10
Compute
total cost of
swapping
Swapping O
and Oramdom
If quality is
improved.
7
6
8
7
6
0
0
10
10
32
PAM
PAM works efficiently for small data sets but does not
scale well for large data sets.
33
CLARA (1990)
Weakness:
Efficiency
CLARANS (1994)
36
a
b
c
d
e
ab
agglomerative
(AGNES)
abcde
cde
de
divisive
(DIANA)
37
D={d1, ,di , ,dn} di
Ci={di}
DC={c1, ,ci , ,cn}
C( ci , cj ) sim(ci , cj )
arg max sim(ci , cj ),
ci cjck=cicj
DC={c1, ,cn-1}
C
38
AGNES (1990)
AGNES (Agglomerative Nesting)
Kaufmann and Rousseeuw (1990)
Use the Single-Link method and the dissimilarity
matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10
10
10
10
0
0
10
39
9
10
AGNES (1990)
1(single-linkage)Nearest
Neighbor
Dmin (Ci , C j )
2 Dmax(Ci,Cj)
3
40
Single Link
41
Complete Link
42
34
D34=5.0
341,2,34,5
43
45
46
AGNES
47
BIRCH(1996)
CF
CURE(1998)
ROCK
Chameleon(1999)
48
BIRCH (1996)
Phase
49
CF-Tree in BIRCH
CF = (5, (16,30),(54,190))
SS: Ni=1Xi2
10
9
8
7
6
5
4
3
2
1
0
0
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
50
CF Tree
Root
CF1
CF2 CF3
CF6
child1
child2 child3
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
CF6 next
Leaf node
prev CF1 CF2
CF4 next
51
CF-Tree in BIRCH
52
BIRCH
Scales
Weakness:
53
CURE (1998)
CURE (Clustering Using REpresentatives )
by Guha, Rastogi & Shim, 1998
Shrink the multiple representative points
towards the gravity center by a fraction of .
Multiple representatives capture the shape
of the cluster
54
CURE
S
p (size s/p)
s/pq clusters
55
CURE
s = 50
p = 2
s/p = 25
s/pq = 5
y
y
x
y
y
x
x
x
56
CURE
57
CURE
CURE
CURE
K-d
58
DIANA (1990)
DIANA (Divisive Analysis)
Kaufmann and Rousseeuw (1990)
Inverse order of AGNES
Eventually each node forms a cluster on its
own
10
10
10
0
0
10
10
10
59
60
61
Clustering based on density (local cluster
criterion), such as density-connected points
Major features:
condition
62
1.
2.
MinPts
3.Dpq
q
pq
p
q
MinPts = 5
Eps = 1 cm
63
4.p1,p2,,pnp1qpn
ppiD1<=i<=npi+1 pi
MinPtspq
MinPts
5.Do
pqoMinPts
pqMinPts
p
q
o
64
DBSCAN (KDD-96)
Martin Ester,KDD-96
65
DBSCAN
= 1cm
MinPts = 5
66
DBSCAN
1 p
2p |N(q)|
3|N(q)|MinPts p
4
5
= 1cm
MinPts = 5
67
DBSCAN
( and MinPts)
O(n2)
R-
o(nlogn)
68
OPTICS (SIGMOD99)
OPTICS: Ordering Points To Identify the
Clustering Structure
69
OPTICS
p DMinPts
N(p)
p
P
|(N(p)| MinPtsp
pqpp,q
Max|p,q|
|N(p)| MinPtsp
70
OPTICS
71
OPTICS
1.p
2.p
MinPts
3.
Complexity: O(kN2)
72
MinPts
73
74
75
CLIQUE(1998)
CLIQUE:Clustering In QUEst,1998
76
CLIQUE
1.
2.
3.
4.
K
k-1
kk-1
77
40
50
20
Vacation
=3
30
Vacation
(week)
0 1 2 3 4 5 6 7
Salary
(10,000)
0 1 2 3 4 5 6 7
20
age
60
30
50
age
30
40
50
age
60
agesalary
vacation
78
CLIQUE
79
On-line clustering
80
On-line clustering
Web Search Results: multiple sub-topics
are mixed together for the given query.
Organizing Web search results into clusters
facilitates users' quick browsing.
81
On-line clustering
83
banana
banana anana nana ana na a
($)
O(n)
84
banana
85
STC
Step1:
Document cleaning
Html
Base clusters
What if ate is
in the stop
word list?
89
Cluster score
s(B)
= |B| * f(|P|)
|B|
90
Similarity
of Bm and Bn (0.5 is
paramter)
1
=
iff
| Bm Bn| / | Bm | > 0.5
and | Bm Bn| / | Bn | > 0.5
otherwise
91
Evaluation-Precision
92
93
Execution time
94
STC
STC an incremental, o(n) time clustering
algorithm
STC allows cluster overlap
96
Denote
the
LEN = n
99
100
Experimental Results
----Property Comparison
101
Experimental Results
---- Learning Methods Comparison
102
Example Queries
103
104
105