You are on page 1of 44

3

1.
2.
3.
4.
5.


Clustering algorithm

3.1 .............................................................................................................60
3.2 (Measures of Association).............................................61
3.3 (Dissimilarity) ....................64
3.4 (Classification Methods).......66
3.5 (Cluster Hypothesis) ...............68
3.6 .70
3.7 Clustering algorithm ................................77
....................................................................................................101
..................................................................................................102

3.1
(Automatic Classification)

(Document Clustering)
(pattern recognition)
(automatic medical
diagnosis) (Keyword Clustering)
IR 2
(Keyword clustering) (document clustering)
R.M.Hayes
( item)


(logical Relationship)

(Logical Organization) 2
1.
2.



(Group Vector)
(query)
(query)

60


(query)

(query)
String Matching / comparison , Same vocabulary used ,
Probability that documents arise from same model , Same meaning of text
3.2 (Measures of Association)


(object)

1. Simple Coefficient
| X Y |
X Y X
Y
X = { 1 , 2 , 3}
Y = { 1 , 4}
X Y = { 1}
|X Y|=1
2. Dices Coefficient

2|X Y| / |X| +|Y|


X Y
3. Jaccards Coefficient

|X Y | / |XY|

61

4. Cosine Coefficient
Cosine Correlation Salton
SMART Salton
n n
(X,Y) / ||X|| ||Y|| cosine 2

X Y
| X Y | / |X|1/2 * |Y|
(X,Y) ||.||
X = (X1,,Xn ) Y = (Y1,..Yn)

Vector Space Similarity



t

sim( Di , D j ) = wik w jk
k =1

Di , Dj
wik =

tf ik log( N / nk )

t
k =1

(tf ik ) 2 [log( N / nk )]2

Normalize (the term weights)


0 1 cosine normalized inner
product
5. Overlap coefficient
| X Y | / min( |X| ,|Y| )
62

3.1 Vector Space


D , Q ,w

normalize

3.2 Vector Space


Q
D1 = (0.8,0.3)
D1
D2 =(0.2,0.3)
D2
COS 1 = 0.74
COS 2=0.98
Term B

Term A

63

3.1 : cosine Degrees

(Q) D2

(Q) D1
SIM(Q,D1) =

0.56
0.58

= 0.74

3.3 (Dissimilarity)

64

P
D P x P

D
(1) D( X , Y) > 0 X , Y P
(2) D(X , X ) = 0 X P
(3) D(X, Y ) = D(Y ,X) X , Y P
(4) D(X, Y) < D(X , Z) + D(Y, Z)
4
1.
| X Y | / |X | + |Y| | X Y | = | X Y | - | X Y |
Dices Coefficient
2|X Y| / |X| +|Y|

Dissimilarity = 1 - ( 2 | X Y | / | X | + | Y | )
= ( |X | + | Y | - 2 | X Y | ) / ( |X| + |Y| )
= ( | X Y | - | X Y | ) / ( |X| + |Y| )
= | X Y | / |X | + |Y|
2. Jaccard
i 0 1 I 1
0

|X| = Xi i 1..N
N
| X Y | = Xi Yi I 1..N
3

65


| X Y | / |X | + |Y| = ( Xi (1-Yi) + Yi(1-Xi ) ) / ( Xi + Yi )
2

P(Xi) P(Xj)

Jardine Sibson

1 0 P1(1),p1(0),P2(1),P2(0)
Jardine Sibson
(Information Radius)
3.

u v

3.4 (Classification Methods)

(documents) (keyword) (hand written
character)(species) (Description)

-
- 9Keyword)
-
- (Probability Distributions)
66

Sparck Jones
1. 2 Monothetic
Polythetic Monothetic Polythetic
G = { f1 , f2 , f3 , , fn}
fi Individuals
Individuals G (row) f G Individuals
(Column) f G Individuals
Individual Monothetic
3.2 5 6 Monothetic 7
Monothetic 1 , 2 , 3 , 4 , 5
Polythetic 3

3.2 : Monothetic Polythetic

67

2. 2 Exclusive
Overlapping Overlapping Exclusive
Overlapping Class Individuals

Exclusive Class individual Overlapping


Individuals

3. 2 Ordered
Unordered Ordered Unordered Class
Ordered Classification
(Hierarchical)

Unordered Classification Thesaurus



3.5 (Cluster Hypothesis)



closely associated documents tend to be relevant to the same requests


(relevant) (Non-relevant)

68

3.3 :
3.3 (request)
relevant-relevant(R-R) relevant-non-relevant(R-N-R)
X B
X Y

(document clustering)




clustering

(Clustering algorithm )
-

69

-
-

-
-




Clustering

(Distance-based Clustering)

3.6
(Clustering algorithm) 4

(1) Exclusive Clustering

(2) Overlapping Clustering

(3) Hierarchical Clustering


Exclusive Clustering Overlapping Clustering

(4) Probabilistic Clustering


2
70

1.
(1)

(2)

(3)
2.


1. Object

1.1 Graph Theoretic Method

3.3 : 6

3

71

3.3

threshold threshold
2 2

3.5 :

72

3.5 Keyword clustering Sparck Jones and


Jackon ,Auguestson and Minker Vaswani and Cameron String
connected component
1 maximal
compleate subgraph

1.2 Single Link Hierarchic Cluster


Objects
(dissimilarity coefficient)
dendrogram (tree structure)

3.6 : Dendrogram
3.6 {A,B,C,D,E} L1
{A,B}, {C} ,{D} ,{E} L2 {A,B} {C,D,E}
L3 (A,B,C,D,E}
dendrogram
Jardine and Sibson Single-link
(dissimilarity:DC)
3

73

thresholding
hierarchic cluster complete-link , average-link

matching function threshold
(request) (low level)
high precision row recall cut-off (low rank position)

high recall low precision

3.7 : single-link clusters


thresholding
74

Hierarchic
- hierarchic
- hierarchic
- hierarchic
hierarchic
Minimum Spanning Tree MST
Single-link tree
Single-linked tree Minimum Spanning Tree
MST

MST Single-link hierarchy
Single link Cluster single-link hierarchy
thresholding MST

minimum spanning tree (edge)
edge

2915

1421

200

310
D

A
E

410

612

200

310

410

800
612

40

minimum spanning tree

2. (Descriptions) Object

(cluster
representative) cluster profile classification vector centroid

3
75

-
-
- threshold matching function threshold

- (Overlap) .
-

(Descriptions) Object

1. Rocchios clustering 3

rag-bag

thresholds matching function
(overlap)

Single-Pass algorithm
-
-
-
76

-
matching function
-
- (test)

(input
parameter)
2. Dattola


(hierarchic)

graph-theoretic heuristic approaches ( )
graph-theoretic (association
measure) matching function
n log n n

3.7 Clustering algorithm


Clustering algorithm
(1) K-means (Partitioning)

n K

a. k
b. (centroid) (mean)
c.

d.

77

k-mean

K-means clustering1

K-means clustering2

K-means clustering3

O(tkn) n k
t k t n
( local optimal) (global
optimal)

k
78

K-means

(mode) categorical
frequency-bases k-prototype
categorical
medoid PAM(Partitioning Around
Medoids,1987) medoid
medoid medoid

PAM
- k
- h medoid i medoid
- TCih
o TCih<0 i h
- medoid

79



K

K K

a. K

b.
c. K
d. b c K


multiple time
fuzzy feature vector
n X1,X2,...,Xn K< n
mi cluster i
X i ||X-mi||
K
- m1,m2,mk
- Until
o
o For I = 1 to K
mi
- end_until
m K

80

K-mean

CLARA(Cluster Large Applications,1990) , CLARANS(Ng&Han,1994)

Sanpawat
(http://www.cs.tufts.edu/~{sanpawat,couch})

O(K/2) K


Spherical K-Means


(global)
(full text)


(unstructured text document) Vector Space Model(VSM)
VSM

wik k i
Di
Di = (wi1, wi2, , wit)
space t- 3-

81

(similarity)
cosine
0 1


(term frequency)
tf*idf(term frequency *
inverse document frequency) idf log(N/df) N
df
normalization 1

tfik k i
N
dfk i
inner product
||Di|| = ||Dj|| = 1

82

3.3 D1, D2, D3


(word segmentation)


df idf

83



0
VSM
input

3.8 : -
spherical K-mean
k-mean
Euclidean cosine

84

X1
j

4800
5 1146 1653
828 47 1126 1
1
Longest Matching 32675
F-measure
3

85

(2) Fuzzy C-means(FCM)






(Eulidence distance)
(Mahalanobis distance)

(outlier)

K-mean
k-mean (correlation)

( fuzzy
clustering ) Dunn Bezdek
(fuzzy C-Means)
-
( ) (m)

-
-

- objective function
()

86

Objective Function
c

J=

ij ) d ( X j , Z i )

i =1 j =1

J Objective Function
X = {X1,X2,Xn}
n
c
m 1
ij (membership) J i
d ( X , Z ) x j z
i
2

Zi =

(
j =1

ij

)m X j

( ij )

j =1

ij =

ij

[1 / d 2 ( X j Z i )]1 /( m 1)
c

[1 / d
i =1

( X j Z i )]1 /( m 1)

87

Initial centroids
Z1,Z2,Z3,Ze

Calculate membership from


The given centroids

Calculate new centroids

Improved
Centroids

no

yes

Calculate Membership
and objectivity function


(Euclidean distance)
EDji =

( X j Z i )( X j Z i ) T

EDji
X j Z i T Transpose matrix
88

(Mahalanobis distance)

MDji =

( X j Z i ) A 1 ( X j Z i ) T

MDji X j
Z i
A variance-covariance matrix
n

A=

(X
j =1

Z i )T ( X j Z i )
n 1

, , ,

(www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf)


(3) Hierarchical Clustering


(Hierarchical decomposition) distance matrix
k

89

AGNES(Agglomerative) Kaufmann Rousseeuw(1990)


Single-link

dendrogram (tree of cluster)


dendrogram

DIANA(Divisive Analysis) Kaufmann Rousseeuw(1990)


Single-Link

90

agglomerative
O(n2) n

BIRCH(1996) CF-tree
CURE(1998)
CHAMELEON(1990)
BIRCH CURE
Hierarchical algorithm 1 N
N*N
1 item cluster N
cluster N item
item
2 2

3

4 2 3

Hierarchical Algorithm



N single cluster
K link K-1
Single-Linkage Clustering

N*N D =[d(i,j)]
0,1,2,,(n-1) L(k) level k clustering m
cluster (r) (s)
d[(r )(s)]
3

91

1 level L(0) =0
m=0
2

d[(r )(s)] = min d[(i )(j)]


3 m = m+1 (r ) (s)
m level
L(m) = d[(r )(s)]
4 proximity D
(r ) (s)
denoted(r,s) (k)

d[(k )(r,s)] = min d[(k )(r), d[(k )(s)]


5
2-5
3.4 Hierarchical clustering
single-linkage
BA ,FI, MI ,NA , RM ,TO
Input distance matrix L=0
BA
FI
MI
NA
RM
TO

BA
0
662
877
255
412
996

FI
662
0
295
468
268
400

MI
877
295
0
754
564
138

NA
255
468
754
0
219
869

RM
412
268
564
219
0
669

TO
996
400
138
869
669
0

MI TO 138
MI/TO
Level L(MI/TO) = 138 m= 1
single-linkage clustering

92

MI TO

BA
FI
MI/TO
NA
RM

BA
0
662
877
255
412

FI
662
0
295
468
268

MI/TO
877
295
0
754
564

min d(i,j) = d(NA,RM) = 219


NA/RM

BA
FI
MI/TO
NA/RM

BA
0
662
877
255

NA
255
468
754
0
219

RM
412
268
564
219
0

NA RM
L(NA/RM) = 219 m =2
FI
662
0
295
268

MI/TO
877
295
0
564

NA/RM
255
268
564
0

min d(i,j) = d(BA,NA/RM) = 255 BA NA/RM


BA/NA/RM L(BA/NA/RM) = 255 m = 3
BA/NA/RM
FI
MI/TO

BA/NA/RM
0
268
564

FI
268
0
295

MI/TO
564
295
0

min d(i,j) = d(BA/NA/RM.FI) = 268 BA/NA/RM FI


BA/NA/RM/FI L(BA/NA/RM/FI) = 268 m = 4
BA/NA/RM/FI
MI/TO

BA/NA/RM/FI
0
295

MI/TO
295
0

93

2 level = 295
hierarchical tree

BA NA RM FI
MI
TO

O(n2) n

(4) Mixture of Gaussians
clustering model-based


a Gaussian a Poisson


Mixture Model component distribution
mixture model


1 (the Gaussian)
P ( ) N [ , 2 I ]
i
94

P( X / i ) = P( i ) P( X / i , 1 , 2 ,..., k )
i

EM (Expectation-Maxixization)
the mixture Gaussian
Xk
X1 = 30
P(X1) = 0.5
X2 = 18
P(X2) =
X3 = 0
P(X3) = 2
X4 = 23

P(X4) = 0.5-3

1 :
X1 : a students
X2 : b students
X3 : c students
X4 : d students

P(a, b, c, d | )(0.5) a * b * (2 ) c * (0.5 3 ) d

P
=0

P( L) = log(0.5) a + log( ) b + log(2 ) c + log(0.5 3 ) d

PL

2c
3d

=0
2 1
3
2

95

b+c
6(b + c + d )

a=14 , b=6 , c=9 d=10

1
10

2 :
x1 + x2 : h students
x3
: c students
x4
: d students

2
1
a=

1
2
1
+
2

h, b =

1
+
2

2
a, b =

b+c
6(b + c + d )


EM algorithm mixture of Gaussian
1 : Initialize parameters:
0 = {1 , 2 ,..., k , p1 , p 2 ,... p k }

2 : E-step:
96

p ( j | X k , t ) =

p ( X k , t ) p ( j | t )
p ( X k , t )

p( X k | i , i . 2 ) pi
t

p( X

| i , i . 2 ) pi
t

t
t

3 : M-step:
i

( t +1)

p ( | X , ) x
=
p ( | X , )
i

pi

( t +1)

p (

| X k , t )

(5) Genetic Algorithm John Holland .. 1975

(Optimization)

,
, ,
(query)

5
a.

b.
c.
d.

e. ,

3

97

3.5
5
DOC1 ={Database, Query, Data Retrieval , Computer,Network, DBMS}
DOC2={Artificial Intelligence, Internet, Indexing, Natural Language Processing}
DOC3={Database , Expert System, Information Retrieval System, Multimedia}
DOC4={Fuzzy Logic, Neural Network, Computer Networks}
DOC5-{Object-Oriented, DBMS , Query ,Indexing}
16
Artificial Intelligence , Computer Network, Data Retrival, Database
DBMS,
Expert System ,
Fuzzy Logic, Indexing
Information Retrieval System, Internet,Multimedia,Natural Language Processing,
Neural Network,
Object Oriented,
Query,
Relational Database

DOC1=0110100000000011
DOC2=1000000101010000
DOC3=0001010010100000
DOC4=0100001000001000
DOC5=0000100100000110
98


(query) 16


Dice coefficient Cosine coefficient
Jaccard coefficient 0.0 1.0
1.0
(fitness)

(Survival of the fittest)

(Crossover) 2

101111110011101
100110011110000

101111111110000
100110010011101
(Mutation)

99


101111110011101
10
101111110111101

threshold

100


1. Simple Coefficient , Dices Coefficient ,
Jaccards Coefficient , Cosine Coefficient Overlap coefficient
2. Jaccard
3. Dices Coefficient

4. Monothetic polythetic

5. Exclusive Class Overlapping Class


6. Ordered Classification

7. Clustering
8. Clustering algorithm
9.
10. Graph Theoretic Method
11. Single Link Method
12. Rocchio
13. K-means
14. PAM
15. Fuzzy C-means(FCM)
16. Single-Linkage Clustering
17. Genetic

101


, ,
, ,
,CS337 ,2535
, ,Technical
Journal ,Vol11.No7,March-June, 2000
Sanpawat Kantabutra and Alva L.Couch ,Parallel K-means Clustering Algorithm on
NOWs , Department of Computer Science Tufts University, Medford, Massachusetts,
www.nectec.or.tn/NTJ/n06/papers/No6_short_1.pdf
. , ,

, , , ,
, The Joint Conference on Computer
Science and Software Engineering. November 17-18, 2005 ,
,(www.cs.buu.ac.th/~deptdoc/proceedings/JCSSE2005/pdf/a-315.pdf)
,
Spherical K-Means ,Intelligent Information
Retrieval and Database Laboratory, Department of Computer Science,Faculty of
Science Kasetsart University,Bangkok,10900,Thailand
102

You might also like