CT477-3 Information Retrieval Ch.3

3
1.
2.
3.
4.
5.

Clustering algorithm
3.1 .............................................................................................................60
3.2 (Measures of Association).............................................61
3.3 (Dissimilarity) ....................64
3.4 (Classification Methods).......66
3.5 (Cluster Hypothesis) ...............68
3.6 .70
3.7 Clustering algorithm ................................77
....................................................................................................101
..................................................................................................102
3.1
(Automatic Classification)
(Document Clustering)
(pattern recognition)
(automatic medical
diagnosis) (Keyword Clustering)
IR 2
(Keyword clustering) (document clustering)
R.M.Hayes
( item)

(logical Relationship)
(Logical Organization) 2
1.
2.

(Group Vector)
(query)
(query)
60

(query)
(query)
String Matching / comparison , Same vocabulary used ,
Probability that documents arise from same model , Same meaning of text
3.2 (Measures of Association)

(object)
1. Simple Coefficient
| X Y |
X Y X
Y
X = { 1 , 2 , 3}
Y = { 1 , 4}
X Y = { 1}
|X Y|=1
2. Dices Coefficient
2|X Y| / |X| +|Y|

X Y
3. Jaccards Coefficient
|X Y | / |XY|
61
4. Cosine Coefficient
Cosine Correlation Salton
SMART Salton
n n
(X,Y) / ||X|| ||Y|| cosine 2
X Y
| X Y | / |X|1/2 * |Y|
(X,Y) ||.||
X = (X1,,Xn ) Y = (Y1,..Yn)
Vector Space Similarity

t
sim( Di , D j ) = wik w jk
k =1
Di , Dj
wik =
tf ik log( N / nk )
t
k =1
(tf ik ) 2 [log( N / nk )]2
Normalize (the term weights)

0 1 cosine normalized inner
product
5. Overlap coefficient
| X Y | / min( |X| ,|Y| )
62
3.1 Vector Space

D , Q ,w
normalize
3.2 Vector Space

Q
D1 = (0.8,0.3)
D1
D2 =(0.2,0.3)
D2
COS 1 = 0.74
COS 2=0.98
Term B
Term A
63
3.1 : cosine Degrees
(Q) D2
(Q) D1
SIM(Q,D1) =
0.56
0.58
= 0.74
3.3 (Dissimilarity)

64
P
D P x P
D
(1) D( X , Y) > 0 X , Y P
(2) D(X , X ) = 0 X P
(3) D(X, Y ) = D(Y ,X) X , Y P
(4) D(X, Y) < D(X , Z) + D(Y, Z)
4
1.
| X Y | / |X | + |Y| | X Y | = | X Y | - | X Y |
Dices Coefficient
2|X Y| / |X| +|Y|
Dissimilarity = 1 - ( 2 | X Y | / | X | + | Y | )
= ( |X | + | Y | - 2 | X Y | ) / ( |X| + |Y| )
= ( | X Y | - | X Y | ) / ( |X| + |Y| )
= | X Y | / |X | + |Y|
2. Jaccard
i 0 1 I 1
0

|X| = Xi i 1..N
N
| X Y | = Xi Yi I 1..N
3
65

| X Y | / |X | + |Y| = ( Xi (1-Yi) + Yi(1-Xi ) ) / ( Xi + Yi )
2

P(Xi) P(Xj)
Jardine Sibson

1 0 P1(1),p1(0),P2(1),P2(0)
Jardine Sibson
(Information Radius)
3.
u v

3.4 (Classification Methods)

(documents) (keyword) (hand written
character)(species) (Description)

-
- 9Keyword)
-
- (Probability Distributions)
66
Sparck Jones
1. 2 Monothetic
Polythetic Monothetic Polythetic
G = { f1 , f2 , f3 , , fn}
fi Individuals
Individuals G (row) f G Individuals
(Column) f G Individuals
Individual Monothetic
3.2 5 6 Monothetic 7
Monothetic 1 , 2 , 3 , 4 , 5
Polythetic 3
3.2 : Monothetic Polythetic
67
2. 2 Exclusive
Overlapping Overlapping Exclusive
Overlapping Class Individuals
Exclusive Class individual Overlapping

Individuals

3. 2 Ordered
Unordered Ordered Unordered Class
Ordered Classification
(Hierarchical)
Unordered Classification Thesaurus

3.5 (Cluster Hypothesis)

closely associated documents tend to be relevant to the same requests

(relevant) (Non-relevant)
68
3.3 :
3.3 (request)
relevant-relevant(R-R) relevant-non-relevant(R-N-R)
X B
X Y
(document clustering)

clustering

(Clustering algorithm )
-
69
-
-
-
-

Clustering

(Distance-based Clustering)
3.6
(Clustering algorithm) 4
(1) Exclusive Clustering
(2) Overlapping Clustering
(3) Hierarchical Clustering

Exclusive Clustering Overlapping Clustering
(4) Probabilistic Clustering

2
70
1.
(1)
(2)
(3)
2.

1. Object

1.1 Graph Theoretic Method
3.3 : 6

3
71
3.3

threshold threshold
2 2
3.5 :
72
3.5 Keyword clustering Sparck Jones and

Jackon ,Auguestson and Minker Vaswani and Cameron String
connected component
1 maximal
compleate subgraph
1.2 Single Link Hierarchic Cluster

Objects
(dissimilarity coefficient)
dendrogram (tree structure)
3.6 : Dendrogram
3.6 {A,B,C,D,E} L1
{A,B}, {C} ,{D} ,{E} L2 {A,B} {C,D,E}
L3 (A,B,C,D,E}
dendrogram
Jardine and Sibson Single-link
(dissimilarity:DC)
3
73
thresholding
hierarchic cluster complete-link , average-link

matching function threshold
(request) (low level)
high precision row recall cut-off (low rank position)

high recall low precision
3.7 : single-link clusters

thresholding
74
Hierarchic
- hierarchic
- hierarchic
- hierarchic
hierarchic
Minimum Spanning Tree MST
Single-link tree
Single-linked tree Minimum Spanning Tree
MST

MST Single-link hierarchy
Single link Cluster single-link hierarchy
thresholding MST

minimum spanning tree (edge)
edge
2915
1421
200
310
D
A
E
410
612
200
310
410
800
612
40
minimum spanning tree
2. (Descriptions) Object

(cluster
representative) cluster profile classification vector centroid
3
75
-
-
- threshold matching function threshold
- (Overlap) .
-

(Descriptions) Object
1. Rocchios clustering 3

rag-bag

thresholds matching function
(overlap)
Single-Pass algorithm
-
-
-
76
-
matching function
-
- (test)
(input
parameter)
2. Dattola

(hierarchic)

graph-theoretic heuristic approaches ( )
graph-theoretic (association
measure) matching function
n log n n
3.7 Clustering algorithm

Clustering algorithm
(1) K-means (Partitioning)

n K

a. k
b. (centroid) (mean)
c.

d.
77
k-mean
K-means clustering1
K-means clustering2
K-means clustering3
O(tkn) n k
t k t n
( local optimal) (global
optimal)

k
78
K-means

(mode) categorical
frequency-bases k-prototype
categorical
medoid PAM(Partitioning Around
Medoids,1987) medoid
medoid medoid

PAM
- k
- h medoid i medoid
- TCih
o TCih<0 i h
- medoid
79

K

K K
a. K
b.
c. K
d. b c K

multiple time
fuzzy feature vector
n X1,X2,...,Xn K< n
mi cluster i
X i ||X-mi||
K
- m1,m2,mk
- Until
o
o For I = 1 to K
mi
- end_until
m K
80
K-mean

CLARA(Cluster Large Applications,1990) , CLARANS(Ng&Han,1994)
Sanpawat
(http://www.cs.tufts.edu/~{sanpawat,couch})

O(K/2) K

Spherical K-Means

(global)
(full text)

(unstructured text document) Vector Space Model(VSM)
VSM

wik k i
Di
Di = (wi1, wi2, , wit)
space t- 3-
81
(similarity)
cosine
0 1

(term frequency)
tf*idf(term frequency *
inverse document frequency) idf log(N/df) N
df
normalization 1
tfik k i
N
dfk i
inner product
||Di|| = ||Dj|| = 1
82
3.3 D1, D2, D3

(word segmentation)

df idf
83

0
VSM
input
3.8 : -
spherical K-mean
k-mean
Euclidean cosine

84
X1
j
4800
5 1146 1653
828 47 1126 1
1
Longest Matching 32675
F-measure
3
85
(2) Fuzzy C-means(FCM)

(Eulidence distance)
(Mahalanobis distance)

(outlier)
K-mean
k-mean (correlation)

( fuzzy
clustering ) Dunn Bezdek
(fuzzy C-Means)
-
( ) (m)
-
-
- objective function
()
86
Objective Function
c
J=
ij ) d ( X j , Z i )
i =1 j =1
J Objective Function
X = {X1,X2,Xn}
n
c
m 1
ij (membership) J i
d ( X , Z ) x j z
i
2
Zi =
(
j =1
ij
)m X j
( ij )
j =1
ij =
ij
[1 / d 2 ( X j Z i )]1 /( m 1)
c
[1 / d
i =1
( X j Z i )]1 /( m 1)
87
Initial centroids
Z1,Z2,Z3,Ze
Calculate membership from

The given centroids
Calculate new centroids
Improved
Centroids
no
yes
Calculate Membership
and objectivity function

(Euclidean distance)
EDji =
( X j Z i )( X j Z i ) T
EDji
X j Z i T Transpose matrix
88
(Mahalanobis distance)

MDji =
( X j Z i ) A 1 ( X j Z i ) T
MDji X j
Z i
A variance-covariance matrix
n
A=
(X
j =1
Z i )T ( X j Z i )
n 1
, , ,
(www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf)

(3) Hierarchical Clustering

(Hierarchical decomposition) distance matrix
k
89
AGNES(Agglomerative) Kaufmann Rousseeuw(1990)

Single-link

dendrogram (tree of cluster)

dendrogram

DIANA(Divisive Analysis) Kaufmann Rousseeuw(1990)

Single-Link

90
agglomerative
O(n2) n

BIRCH(1996) CF-tree
CURE(1998)
CHAMELEON(1990)
BIRCH CURE
Hierarchical algorithm 1 N
N*N
1 item cluster N
cluster N item
item
2 2

3
4 2 3
Hierarchical Algorithm

N single cluster
K link K-1
Single-Linkage Clustering
N*N D =[d(i,j)]
0,1,2,,(n-1) L(k) level k clustering m
cluster (r) (s)
d[(r )(s)]
3
91
1 level L(0) =0
m=0
2
d[(r )(s)] = min d[(i )(j)]

3 m = m+1 (r ) (s)
m level
L(m) = d[(r )(s)]
4 proximity D
(r ) (s)
denoted(r,s) (k)
d[(k )(r,s)] = min d[(k )(r), d[(k )(s)]

5
2-5
3.4 Hierarchical clustering
single-linkage
BA ,FI, MI ,NA , RM ,TO
Input distance matrix L=0
BA
FI
MI
NA
RM
TO
BA
0
662
877
255
412
996
FI
662
0
295
468
268
400
MI
877
295
0
754
564
138
NA
255
468
754
0
219
869
RM
412
268
564
219
0
669
TO
996
400
138
869
669
0
MI TO 138
MI/TO
Level L(MI/TO) = 138 m= 1
single-linkage clustering

92
MI TO
BA
FI
MI/TO
NA
RM
BA
0
662
877
255
412
FI
662
0
295
468
268
MI/TO
877
295
0
754
564
min d(i,j) = d(NA,RM) = 219

NA/RM
BA
FI
MI/TO
NA/RM
BA
0
662
877
255
NA
255
468
754
0
219
RM
412
268
564
219
0
NA RM
L(NA/RM) = 219 m =2
FI
662
0
295
268
MI/TO
877
295
0
564
NA/RM
255
268
564
0
min d(i,j) = d(BA,NA/RM) = 255 BA NA/RM

BA/NA/RM L(BA/NA/RM) = 255 m = 3
BA/NA/RM
FI
MI/TO
BA/NA/RM
0
268
564
FI
268
0
295
MI/TO
564
295
0
min d(i,j) = d(BA/NA/RM.FI) = 268 BA/NA/RM FI

BA/NA/RM/FI L(BA/NA/RM/FI) = 268 m = 4
BA/NA/RM/FI
MI/TO
BA/NA/RM/FI
0
295
MI/TO
295
0
93
2 level = 295
hierarchical tree
BA NA RM FI
MI
TO

O(n2) n

(4) Mixture of Gaussians
clustering model-based

a Gaussian a Poisson

Mixture Model component distribution
mixture model

1 (the Gaussian)
P ( ) N [ , 2 I ]
i
94
P( X / i ) = P( i ) P( X / i , 1 , 2 ,..., k )
i
EM (Expectation-Maxixization)
the mixture Gaussian
Xk
X1 = 30
P(X1) = 0.5
X2 = 18
P(X2) =
X3 = 0
P(X3) = 2
X4 = 23
P(X4) = 0.5-3
1 :
X1 : a students
X2 : b students
X3 : c students
X4 : d students
P(a, b, c, d | )(0.5) a * b * (2 ) c * (0.5 3 ) d
P
=0
P( L) = log(0.5) a + log( ) b + log(2 ) c + log(0.5 3 ) d
PL
2c
3d
=0
2 1
3
2
95
b+c
6(b + c + d )
a=14 , b=6 , c=9 d=10
1
10
2 :
x1 + x2 : h students
x3
: c students
x4
: d students
2
1
a=
1
2
1
+
2
h, b =
1
+
2
2
a, b =
b+c
6(b + c + d )

EM algorithm mixture of Gaussian
1 : Initialize parameters:
0 = {1 , 2 ,..., k , p1 , p 2 ,... p k }
2 : E-step:
96
p ( j | X k , t ) =
p ( X k , t ) p ( j | t )
p ( X k , t )
p( X k | i , i . 2 ) pi
t
p( X
| i , i . 2 ) pi
t
t
t
3 : M-step:
i
( t +1)
p ( | X , ) x
=
p ( | X , )
i
pi
( t +1)
p (
| X k , t )
(5) Genetic Algorithm John Holland .. 1975
(Optimization)

,
, ,
(query)

5
a.
b.
c.
d.

e. ,

3
97
3.5
5
DOC1 ={Database, Query, Data Retrieval , Computer,Network, DBMS}
DOC2={Artificial Intelligence, Internet, Indexing, Natural Language Processing}
DOC3={Database , Expert System, Information Retrieval System, Multimedia}
DOC4={Fuzzy Logic, Neural Network, Computer Networks}
DOC5-{Object-Oriented, DBMS , Query ,Indexing}
16
Artificial Intelligence , Computer Network, Data Retrival, Database
DBMS,
Expert System ,
Fuzzy Logic, Indexing
Information Retrieval System, Internet,Multimedia,Natural Language Processing,
Neural Network,
Object Oriented,
Query,
Relational Database
DOC1=0110100000000011
DOC2=1000000101010000
DOC3=0001010010100000
DOC4=0100001000001000
DOC5=0000100100000110
98

(query) 16

Dice coefficient Cosine coefficient
Jaccard coefficient 0.0 1.0
1.0
(fitness)

(Survival of the fittest)
(Crossover) 2
101111110011101
100110011110000
101111111110000
100110010011101
(Mutation)

99

101111110011101
10
101111110111101

threshold
100

1. Simple Coefficient , Dices Coefficient ,
Jaccards Coefficient , Cosine Coefficient Overlap coefficient
2. Jaccard
3. Dices Coefficient
4. Monothetic polythetic
5. Exclusive Class Overlapping Class

6. Ordered Classification
7. Clustering
8. Clustering algorithm
9.
10. Graph Theoretic Method
11. Single Link Method
12. Rocchio
13. K-means
14. PAM
15. Fuzzy C-means(FCM)
16. Single-Linkage Clustering
17. Genetic
101

, ,
, ,
,CS337 ,2535
, ,Technical
Journal ,Vol11.No7,March-June, 2000
Sanpawat Kantabutra and Alva L.Couch ,Parallel K-means Clustering Algorithm on
NOWs , Department of Computer Science Tufts University, Medford, Massachusetts,
www.nectec.or.tn/NTJ/n06/papers/No6_short_1.pdf
. , ,

, , , ,
, The Joint Conference on Computer
Science and Software Engineering. November 17-18, 2005 ,
,(www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf)
,
Spherical K-Means ,Intelligent Information
Retrieval and Database Laboratory, Department of Computer Science,Faculty of
Science Kasetsart University,Bangkok,10900,Thailand
102

CT477-3 Information Retrieval Ch.3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CT477-3 Information Retrieval Ch.3

Uploaded by

Copyright:

Available Formats

3

2|X Y| / |X| +|Y|

Vector Space Similarity

(tf ik ) 2 [log( N / nk )]2

Normalize (the term weights)

3.1 Vector Space

3.2 Vector Space

3.1 : cosine Degrees

3.2 : Monothetic Polythetic

Exclusive Class individual Overlapping

Unordered Classification Thesaurus

(1) Exclusive Clustering

(2) Overlapping Clustering

(3) Hierarchical Clustering

(4) Probabilistic Clustering

3.5 Keyword clustering Sparck Jones and

1.2 Single Link Hierarchic Cluster

3.7 : single-link clusters

minimum spanning tree

3.7 Clustering algorithm

3.3 D1, D2, D3

(2) Fuzzy C-means(FCM)

Calculate membership from

Calculate new centroids

(3) Hierarchical Clustering

AGNES(Agglomerative) Kaufmann Rousseeuw(1990)

dendrogram (tree of cluster)

DIANA(Divisive Analysis) Kaufmann Rousseeuw(1990)

d[(r )(s)] = min d[(i )(j)]

d[(k )(r,s)] = min d[(k )(r), d[(k )(s)]

min d(i,j) = d(NA,RM) = 219

min d(i,j) = d(BA,NA/RM) = 255 BA NA/RM

min d(i,j) = d(BA/NA/RM.FI) = 268 BA/NA/RM FI

P(a, b, c, d | )(0.5) a * b * (2 ) c * (0.5 3 ) d

P( L) = log(0.5) a + log( ) b + log(2 ) c + log(0.5 3 ) d

a=14 , b=6 , c=9 d=10

(5) Genetic Algorithm John Holland .. 1975

5. Exclusive Class Overlapping Class

You might also like