Professional Documents
Culture Documents
1.
2.
3.
4.
5.
Clustering algorithm
3.1 .............................................................................................................60
3.2 (Measures of Association).............................................61
3.3 (Dissimilarity) ....................64
3.4 (Classification Methods).......66
3.5 (Cluster Hypothesis) ...............68
3.6 .70
3.7 Clustering algorithm ................................77
....................................................................................................101
..................................................................................................102
3.1
(Automatic Classification)
(Document Clustering)
(pattern recognition)
(automatic medical
diagnosis) (Keyword Clustering)
IR 2
(Keyword clustering) (document clustering)
R.M.Hayes
( item)
(logical Relationship)
(Logical Organization) 2
1.
2.
(Group Vector)
(query)
(query)
60
(query)
(query)
String Matching / comparison , Same vocabulary used ,
Probability that documents arise from same model , Same meaning of text
3.2 (Measures of Association)
(object)
1. Simple Coefficient
| X Y |
X Y X
Y
X = { 1 , 2 , 3}
Y = { 1 , 4}
X Y = { 1}
|X Y|=1
2. Dices Coefficient
|X Y | / |XY|
61
4. Cosine Coefficient
Cosine Correlation Salton
SMART Salton
n n
(X,Y) / ||X|| ||Y|| cosine 2
X Y
| X Y | / |X|1/2 * |Y|
(X,Y) ||.||
X = (X1,,Xn ) Y = (Y1,..Yn)
sim( Di , D j ) = wik w jk
k =1
Di , Dj
wik =
tf ik log( N / nk )
t
k =1
normalize
Term A
63
(Q) D2
(Q) D1
SIM(Q,D1) =
0.56
0.58
= 0.74
3.3 (Dissimilarity)
64
P
D P x P
D
(1) D( X , Y) > 0 X , Y P
(2) D(X , X ) = 0 X P
(3) D(X, Y ) = D(Y ,X) X , Y P
(4) D(X, Y) < D(X , Z) + D(Y, Z)
4
1.
| X Y | / |X | + |Y| | X Y | = | X Y | - | X Y |
Dices Coefficient
2|X Y| / |X| +|Y|
Dissimilarity = 1 - ( 2 | X Y | / | X | + | Y | )
= ( |X | + | Y | - 2 | X Y | ) / ( |X| + |Y| )
= ( | X Y | - | X Y | ) / ( |X| + |Y| )
= | X Y | / |X | + |Y|
2. Jaccard
i 0 1 I 1
0
|X| = Xi i 1..N
N
| X Y | = Xi Yi I 1..N
3
65
| X Y | / |X | + |Y| = ( Xi (1-Yi) + Yi(1-Xi ) ) / ( Xi + Yi )
2
P(Xi) P(Xj)
Jardine Sibson
1 0 P1(1),p1(0),P2(1),P2(0)
Jardine Sibson
(Information Radius)
3.
u v
3.4 (Classification Methods)
(documents) (keyword) (hand written
character)(species) (Description)
-
- 9Keyword)
-
- (Probability Distributions)
66
Sparck Jones
1. 2 Monothetic
Polythetic Monothetic Polythetic
G = { f1 , f2 , f3 , , fn}
fi Individuals
Individuals G (row) f G Individuals
(Column) f G Individuals
Individual Monothetic
3.2 5 6 Monothetic 7
Monothetic 1 , 2 , 3 , 4 , 5
Polythetic 3
67
2. 2 Exclusive
Overlapping Overlapping Exclusive
Overlapping Class Individuals
3. 2 Ordered
Unordered Ordered Unordered Class
Ordered Classification
(Hierarchical)
closely associated documents tend to be relevant to the same requests
(relevant) (Non-relevant)
68
3.3 :
3.3 (request)
relevant-relevant(R-R) relevant-non-relevant(R-N-R)
X B
X Y
(document clustering)
clustering
(Clustering algorithm )
-
69
-
-
-
-
Clustering
(Distance-based Clustering)
3.6
(Clustering algorithm) 4
1.
(1)
(2)
(3)
2.
1. Object
1.1 Graph Theoretic Method
3.3 : 6
3
71
3.3
threshold threshold
2 2
3.5 :
72
3.6 : Dendrogram
3.6 {A,B,C,D,E} L1
{A,B}, {C} ,{D} ,{E} L2 {A,B} {C,D,E}
L3 (A,B,C,D,E}
dendrogram
Jardine and Sibson Single-link
(dissimilarity:DC)
3
73
thresholding
hierarchic cluster complete-link , average-link
matching function threshold
(request) (low level)
high precision row recall cut-off (low rank position)
high recall low precision
Hierarchic
- hierarchic
- hierarchic
- hierarchic
hierarchic
Minimum Spanning Tree MST
Single-link tree
Single-linked tree Minimum Spanning Tree
MST
MST Single-link hierarchy
Single link Cluster single-link hierarchy
thresholding MST
minimum spanning tree (edge)
edge
2915
1421
200
310
D
A
E
410
612
200
310
410
800
612
40
2. (Descriptions) Object
(cluster
representative) cluster profile classification vector centroid
3
75
-
-
- threshold matching function threshold
- (Overlap) .
-
(Descriptions) Object
1. Rocchios clustering 3
rag-bag
thresholds matching function
(overlap)
Single-Pass algorithm
-
-
-
76
-
matching function
-
- (test)
(input
parameter)
2. Dattola
(hierarchic)
graph-theoretic heuristic approaches ( )
graph-theoretic (association
measure) matching function
n log n n
77
k-mean
K-means clustering1
K-means clustering2
K-means clustering3
O(tkn) n k
t k t n
( local optimal) (global
optimal)
k
78
K-means
(mode) categorical
frequency-bases k-prototype
categorical
medoid PAM(Partitioning Around
Medoids,1987) medoid
medoid medoid
PAM
- k
- h medoid i medoid
- TCih
o TCih<0 i h
- medoid
79
K
K K
a. K
b.
c. K
d. b c K
multiple time
fuzzy feature vector
n X1,X2,...,Xn K< n
mi cluster i
X i ||X-mi||
K
- m1,m2,mk
- Until
o
o For I = 1 to K
mi
- end_until
m K
80
K-mean
CLARA(Cluster Large Applications,1990) , CLARANS(Ng&Han,1994)
Sanpawat
(http://www.cs.tufts.edu/~{sanpawat,couch})
O(K/2) K
Spherical K-Means
(global)
(full text)
(unstructured text document) Vector Space Model(VSM)
VSM
wik k i
Di
Di = (wi1, wi2, , wit)
space t- 3-
81
(similarity)
cosine
0 1
(term frequency)
tf*idf(term frequency *
inverse document frequency) idf log(N/df) N
df
normalization 1
tfik k i
N
dfk i
inner product
||Di|| = ||Dj|| = 1
82
df idf
83
0
VSM
input
3.8 : -
spherical K-mean
k-mean
Euclidean cosine
84
X1
j
4800
5 1146 1653
828 47 1126 1
1
Longest Matching 32675
F-measure
3
85
K-mean
k-mean (correlation)
( fuzzy
clustering ) Dunn Bezdek
(fuzzy C-Means)
-
( ) (m)
-
-
- objective function
()
86
Objective Function
c
J=
ij ) d ( X j , Z i )
i =1 j =1
J Objective Function
X = {X1,X2,Xn}
n
c
m 1
ij (membership) J i
d ( X , Z ) x j z
i
2
Zi =
(
j =1
ij
)m X j
( ij )
j =1
ij =
ij
[1 / d 2 ( X j Z i )]1 /( m 1)
c
[1 / d
i =1
( X j Z i )]1 /( m 1)
87
Initial centroids
Z1,Z2,Z3,Ze
Improved
Centroids
no
yes
Calculate Membership
and objectivity function
(Euclidean distance)
EDji =
( X j Z i )( X j Z i ) T
EDji
X j Z i T Transpose matrix
88
(Mahalanobis distance)
MDji =
( X j Z i ) A 1 ( X j Z i ) T
MDji X j
Z i
A variance-covariance matrix
n
A=
(X
j =1
Z i )T ( X j Z i )
n 1
, , ,
(www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf)
89
90
agglomerative
O(n2) n
BIRCH(1996) CF-tree
CURE(1998)
CHAMELEON(1990)
BIRCH CURE
Hierarchical algorithm 1 N
N*N
1 item cluster N
cluster N item
item
2 2
3
4 2 3
Hierarchical Algorithm
N single cluster
K link K-1
Single-Linkage Clustering
N*N D =[d(i,j)]
0,1,2,,(n-1) L(k) level k clustering m
cluster (r) (s)
d[(r )(s)]
3
91
1 level L(0) =0
m=0
2
3 m = m+1 (r ) (s)
m level
L(m) = d[(r )(s)]
4 proximity D
(r ) (s)
denoted(r,s) (k)
BA
0
662
877
255
412
996
FI
662
0
295
468
268
400
MI
877
295
0
754
564
138
NA
255
468
754
0
219
869
RM
412
268
564
219
0
669
TO
996
400
138
869
669
0
MI TO 138
MI/TO
Level L(MI/TO) = 138 m= 1
single-linkage clustering
92
MI TO
BA
FI
MI/TO
NA
RM
BA
0
662
877
255
412
FI
662
0
295
468
268
MI/TO
877
295
0
754
564
BA
FI
MI/TO
NA/RM
BA
0
662
877
255
NA
255
468
754
0
219
RM
412
268
564
219
0
NA RM
L(NA/RM) = 219 m =2
FI
662
0
295
268
MI/TO
877
295
0
564
NA/RM
255
268
564
0
BA/NA/RM
0
268
564
FI
268
0
295
MI/TO
564
295
0
BA/NA/RM/FI
0
295
MI/TO
295
0
93
2 level = 295
hierarchical tree
BA NA RM FI
MI
TO
O(n2) n
(4) Mixture of Gaussians
clustering model-based
a Gaussian a Poisson
Mixture Model component distribution
mixture model
1 (the Gaussian)
P ( ) N [ , 2 I ]
i
94
P( X / i ) = P( i ) P( X / i , 1 , 2 ,..., k )
i
EM (Expectation-Maxixization)
the mixture Gaussian
Xk
X1 = 30
P(X1) = 0.5
X2 = 18
P(X2) =
X3 = 0
P(X3) = 2
X4 = 23
P(X4) = 0.5-3
1 :
X1 : a students
X2 : b students
X3 : c students
X4 : d students
P
=0
PL
2c
3d
=0
2 1
3
2
95
b+c
6(b + c + d )
1
10
2 :
x1 + x2 : h students
x3
: c students
x4
: d students
2
1
a=
1
2
1
+
2
h, b =
1
+
2
2
a, b =
b+c
6(b + c + d )
EM algorithm mixture of Gaussian
1 : Initialize parameters:
0 = {1 , 2 ,..., k , p1 , p 2 ,... p k }
2 : E-step:
96
p ( j | X k , t ) =
p ( X k , t ) p ( j | t )
p ( X k , t )
p( X k | i , i . 2 ) pi
t
p( X
| i , i . 2 ) pi
t
t
t
3 : M-step:
i
( t +1)
p ( | X , ) x
=
p ( | X , )
i
pi
( t +1)
p (
| X k , t )
(Optimization)
,
, ,
(query)
5
a.
b.
c.
d.
e. ,
3
97
3.5
5
DOC1 ={Database, Query, Data Retrieval , Computer,Network, DBMS}
DOC2={Artificial Intelligence, Internet, Indexing, Natural Language Processing}
DOC3={Database , Expert System, Information Retrieval System, Multimedia}
DOC4={Fuzzy Logic, Neural Network, Computer Networks}
DOC5-{Object-Oriented, DBMS , Query ,Indexing}
16
Artificial Intelligence , Computer Network, Data Retrival, Database
DBMS,
Expert System ,
Fuzzy Logic, Indexing
Information Retrieval System, Internet,Multimedia,Natural Language Processing,
Neural Network,
Object Oriented,
Query,
Relational Database
DOC1=0110100000000011
DOC2=1000000101010000
DOC3=0001010010100000
DOC4=0100001000001000
DOC5=0000100100000110
98
(query) 16
Dice coefficient Cosine coefficient
Jaccard coefficient 0.0 1.0
1.0
(fitness)
(Survival of the fittest)
(Crossover) 2
101111110011101
100110011110000
101111111110000
100110010011101
(Mutation)
99
101111110011101
10
101111110111101
threshold
100
1. Simple Coefficient , Dices Coefficient ,
Jaccards Coefficient , Cosine Coefficient Overlap coefficient
2. Jaccard
3. Dices Coefficient
4. Monothetic polythetic
7. Clustering
8. Clustering algorithm
9.
10. Graph Theoretic Method
11. Single Link Method
12. Rocchio
13. K-means
14. PAM
15. Fuzzy C-means(FCM)
16. Single-Linkage Clustering
17. Genetic
101
, ,
, ,
,CS337 ,2535
, ,Technical
Journal ,Vol11.No7,March-June, 2000
Sanpawat Kantabutra and Alva L.Couch ,Parallel K-means Clustering Algorithm on
NOWs , Department of Computer Science Tufts University, Medford, Massachusetts,
www.nectec.or.tn/NTJ/n06/papers/No6_short_1.pdf
. , ,
, , , ,
, The Joint Conference on Computer
Science and Software Engineering. November 17-18, 2005 ,
,(www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf)
,
Spherical K-Means ,Intelligent Information
Retrieval and Database Laboratory, Department of Computer Science,Faculty of
Science Kasetsart University,Bangkok,10900,Thailand
102