Professional Documents
Culture Documents
Bai 5
Bai 5
D LIU &
NG DNG
GV: TS. NGUYN HONG T ANH
BI 5
GOM NHM
D LIU
NI DUNG
1. Gii
2.
3.
4.
5.
thiu
nh gi m hnh
GII THIU
1. Gom nhm l g:
Nhm/cm/lp: tp cc i tng DL
Gom nhm l qu trnh nhm cc i tng thnh
nhng nhm/cm/lp c ngha. Cc i tng
trong cng mt nhm c nhiu tnh cht chung v
c nhng tnh cht khc vi cc i tng
nhm khc.
Cho CSDL D={t1,t2,,tn} v s nguyn k, gom
nhm l bi ton xc nh nh x f : Dg{1,,k}
sao cho mi ti c gn vo mt nhm (lp) Kj,
1jk.
Traditional Clustering
Goal is to identify similar
groups of objects
Groups (clusters, new
classes) are discovered
Dataset consists of attributes
Unsupervised (class label
has to be learned)
Important: Similarity
assessment which derives a
distance function is critical,
because clusters are
discovered based on
distances/density.
Classification
Pre-defined classes
Datasets consist of attributes
and a class labels
Supervised (class label is
known)
Goal is to predict classes
from the object
properties/attribute values
Classifiers are learnt from
sets of classified examples
Important: classifiers need to
have a high accuracy
GII THIU
Kt qu gom nhm:
2 nhm/cm
6 nhm/cm
4 nhm/cm
GII THIU
ng dng:
Nhn dng
Phn tch d liu khng gian
X l nh
Khoa hc kinh t (c bit nghin cu tip
th)
WWW
Gom nhm ti liu lin quan d tm kim
Gom d liu Weblog thnh nhm tm cc
nhm c cng kiu truy cp
GII THIU
V d:
Discovered Clusters
Gom
gen v
protein c cng
chc nng
Nhm cc c
phiu
c
xu
hng gi dao
ng ging nhau
Nhm cc vng
theo lng ma
c
1
2
3
4
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Industry Group
Technology1-DOWN
Technology2-DOWN
Financial-DOWN
Oil-UP
10
GII THIU
V d:
Tip th: pht hin cc nhm khch hng
trong CSDL khch hng xy dng
chng trnh tip th c mc tiu
t ai: xc nh cc vng t trng trt
ging nhau trong CSDL quan st tri t
Bo him: tm nhm khch hng c kh
nng hay gp tai nn
Nghin cu ng t: gom nhm cc tm
chn ng t quan st c theo vt nt
lc a
11
V D: Gom nhm
14
GII THIU
Cch biu din
cc nhm/cm
Phn chia bng
cc ng ranh
gii
Cc khi cu
I1
Theo xc sut
I2
S hnh cy
In
1 2 3
0.5 0.2 0.3
15
GII THIU
2. Tiu chun gom nhm:
Phng php gom nhm tt l phng php s to cc
nhm c cht lng:
S ging nhau gia i tng trong cng mt nhm cao.
Gia cc nhm th s ging nhau thp.
Khong cch bn
trong nhm l
min
Khong cch
gia cc
nhm l max
16
GII THIU
2. Tiu chun gom nhm (tt):
Cht lng ca kt qu gom nhm da
trn 2 yu t:
o s ging nhau dng trong phng
php gom nhm
Thut ton gom nhm.
Mt s o cht lng:
Bnh phng sai (Sum of Squared Error SSE)
Entropy
17
GII THIU
3. o khong cch:
o khong cch thng dng xc nh s
khc nhau hay ging nhau gia hai i tng.
Khong cch Minkowski:
d (i, j) q (| x x |q | x x |q ... | x x |q )
i1 j1
i2 j 2
ip jp
vi i = (xi1, xi2, , xip) v j = (xj1, xj2, , xjp) : hai i
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j2
ip jp
18
GII THIU
3. o khong cch (tt)
Nu q=2, d l khong cch Euclide:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1
i2 j 2
ip jp
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
19
GII THIU
4. Cc kiu d liu
Cc kiu d liu khc nhau yu cu
o s khc nhau cng khc nhau.
10
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality
GII THIU
6. Mt s phng php gom nhm:
Phng php phn hoch
Phng php phn cp
Phng php da trn mt
Phng php da trn li
Phng php da trn m hnh
Phng php da trn tp ph bin
Phng php da trn rng buc
Phng php da trn lin kt
22
11
Model-based:
A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
Massive links can be used to cluster objects: SimRank, LinkClus
NI DUNG
1.
Gii thiu
2.
3.
4.
5.
12
26
13
V d: K-means, Bc 1
k1
Y
Chn
3
trung tm
nhm bt
k : k1, k2,
k3
k2
k3
X
28
14
V d: K-means, Bc 2
k1
Y
Gn tng
im vo
nhm c
trung tm
nhm gn
nht
k2
k3
X
29
V d: K-means, Bc 3
k1
k1
Y
Di
chuyn
trung
tm
tng nhm
v
im
trung bnh
mi
ca
nhm
k2
k3
k2
k3
X
30
15
V d: K-means, Bc 4
k1
Y
Gn li cc
im cho gn
vi cc trung
tm nhm mi
k3
k2
Q: Cc im
no c gn
li ?
X
31
V d: K-means, Bc 4
k1
Y
3 im
c
gn li
k3
k2
X
32
16
V d: K-means, Bc 4
k1
Y
Tnh li
trung
bnh
nhm
k3
k2
X
33
V d: K-means, Bc 5
k1
Y
Di
chuyn
trung
tm
nhm v gi
tr TB nhm
mi,
k2
k3
X
34
17
V d: k-mean
Customer
Age Income
(K)
John
0.55
0.175
Rachel
0.34
0.25
Hannah
Tom
0.93
0.85
nellie
0.39
0.2
David
0.58
0.25
Income
Age
V d: k-mean
Bc 1: Chn Nellie v David l trung tm nhm/cm A
v B
Customer
Distance
from
David
Distance
from
Nellie
John
0.08
0.161
Rachel
0.24
0.07
Hannah
0.859
1.006
Tom
0.694
0.845
Nellie
David
Income
B
A
Age
18
V d: k-mean
B2: Tnh cc trung tm mi ca nhm/cm A v B
Trung tm ca Cluster A:
Age= 0.37,
Income=0.23
Trung tm ca Cluster B:
Age= 0.77,
Income=0.57
Income
Da trn cc trung tm cm
mi, gn cc khch hng
vo cc nhm/cm.
Age
V d: k-mean
Customer
Distance A
Distance B
Income
John
Rachel
Hannah
Tom
Nellie
David
0.19
0.04
1.00
0.45
0.53
0.49
0.84
0.04
0.33
0.53
0.22
0.37
Age
19
V d: k-mean
B3: Tnh cc trung tm mi ca nhm/cm A v B
Trung tm ca Cluster A:
Age= 0.47, Income=0.22
Trung tm ca Cluster B:
Age = 0.97, Income= 0.93
Vi cc trung tm nhm
mi ny, thnh phn ca
cc nhm khng thay i.
Thut ton dng.
Income
B
Age
20
21
10
10
Arbitrary
choose k
object as
initial
medoids
6
5
4
3
2
6
5
4
3
2
1
0
0
10
10
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
0
0
K=2
10
Until no
change
If quality is
improved.
10
10
Swapping O
and Oramdom
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Do loop
Compute
total cost of
swapping
8
7
6
9
8
7
6
0
0
10
10
43
Pht trin :
22
NI DUNG
2.
Gii thiu
Phng php phn hoch
3.
1.
4.
5.
nh gi m hnh
45
6
0.2
4
3
0.15
2
5
2
0.1
0.05
3
0
1
46
23
Thut ton:
AGNES, DIANA
BIRCH (Balance Iterative Reducing &
Hierachies)
CURE (Clustering Using Representative)
ROCK (Robust Clustering using linKs)
CHAMELEON
Clustering
using
47
24
Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj)
= dist(Mi, Mj)
Step
1
Step
2
Step
3
Step
4
(agglomerative)
ab
abcde
cde
de
e
Step
4
Tch t
Step
3
Step
2
Step
1
Step
0
Chia nh
(divisive)
50
25
10
10
0
0
10
1
0
0
10
5110
10
10
0
0
10
10
10
52
26
im Ta x Ta y
P1
0.40
0.53
P2
0.22
0.38
P3
0.353
0.32
P4
0.26
0.19
P5
0.08
0.41
P6
0.45
0.30
53
P2
P3
P4
P5
P6
P1
P2
P3
P4
P5
P6
27
=min(dist(3,1),dist(6,1))
55
28
3
5
0.2
0.15
0.1
0.05
4
4
Cc nhm
(Single Link)
S hnh cy
58
29
Bi tp c nhn s 6
Thi gian: 20
Cho tp DL gm 6 im trong khng gian 2 chiu
vi ma trn khong cch nh sau.
S dng thut ton AGNES vi Complete link
gom nhm. V s hnh cy.
Da trn s hnh cy, xc nh 3 nhm thu
c.
60
30
Bi tp c nhn
Cho ma trn khong cch nh sau:
P1
P2
P3
P4
P5
P6
P1
P2
P3
P4
P5
P6
NI DUNG
2.
Gii thiu
Phng php phn hoch
3.
1.
4.
5.
62
31
32
MinPts = 5
65
MinPts = 5
Eps = 1 cm
66
33
MinPts = 5
Eps = 1 cm
p1
q
67
MinPts = 5
Eps = 1 cm
o
68
34
69
35
Cc i tng ban u
Cc loi i tng :
nng ct , bin v nhiu
71
Cc i tng ban u
Cc nhm
Resistant to Noise
Supports Outliers
Can handle clusters of different shapes and sizes
72
36
(MinPts=4, Eps=9.75).
Cc i tng ban u
Gp vn vi
Varying densities
High-dimensional data
73
(MinPts=4, Eps=9.92)
Nhc im:
Gp vn khi cc nhm c mt khc
nhau.
phc tp cao i vi d liu nhiu chiu.
Ph thuc vo gi tr Eps, MinPts.
74
37
V D: DBSCAN
75
NI DUNG
2.
Gii thiu
Phng php phn hoch
3.
1.
4.
5.
nh gi m hnh
76
38
V D: KT QU GOM NHM
1
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
DL BAN
U
1
0.9
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
DBSCAN
0.2
0.4
0.6
0.8
0.2
0.4
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
K-means
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
0.6
0.8
Complete
Link
0.1
0
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
77
39
Xc nh s nhm/cm
Empirical method
# of clusters n/2 for a dataset of n points
Elbow method
Use the turning point in the curve of sum of within cluster
variance w.r.t the # of clusters
Cross validation method
Divide a given data set into m parts
Use m 1 parts to obtain a clustering model
Use the remaining part to test the quality of the clustering
E.g., For each point in the test set, find the closest centroid,
and use the sum of squared distance between all points in the
test set and the closest centroids to measure how well the
model fits the test set
For any k > 0, repeat it m times, compare the overall quality
measure w.r.t. different ks, and find # of clusters that fits the
79
data the best
40
7
6
SSE
2
0
5
4
-2
3
2
-4
-6
10
15
10
15
20
25
30
Vi x l mt im DL trong nhm Ci v mi l
im i din cho nhm (im TB nhm
hoc im trung tm nhm), K-s nhm.
dist (): khong cch Euclide
41
pij
mij
mj
Trong :
mj l s mu ca cluster j;
mij l s mu ca class i thuc cluster j;
mj
j 1
ej
Trong :
L l s lp (classes);
K l s nhm(clusters);
mj l s mu ca cluster j;
m l tng s mu
42
mj
j 1
purity
purity j
Trong :
L l s lp (classes);
K l s nhm(clusters);
mj l s mu ca cluster j;
m l tng s mu
Entertainm
ent
Finan
cial
40
280
10
5
6
Tng
class
Tng
cluster
Natio
nal
Sports
506
96
27
677
1.2270 0.7474
29
39
361
1.1472 0.7756
671
685
0.1813 0.9796
162
119
73
369
1.7487 0.4390
331
22
70
13
23
464
1.3976 0.7134
358
12
212
48
13
648
1.5523 0.5525
354
555
341
943
273
738
3204
1.1450 0.7203
Foreign Metro
Entropy
Purity
V d pFinancial,2=7/361=0.0194
43
TM TT
Bi ton gom nhm l nhm cc i
tng da trn s ging nhau ca chng
v c ng dng rng ri.
o s ging nhau c th khc nhau i
vi tng kiu d liu .
Cc thut ton gom nhm chnh chia
thnh : phn hoch, phn cp, da trn
mt , da trn li v da trn m hnh
Bi ton xc nh c bit l mt ng dng
quan trng ca phn tch nhm.
nh gi cht lng nhm l lnh vc cn
tp trung nghin cu.
87
BI TP nhm
Thi gian: 30
Cho DL bn ( chun
ho) v k = 2.
S dng thut ton kmeans (k=2) xc nh
cc nhm.
Tnh o SSE cho
tng nhm vng lp
u tin v cui cng.
Customer
Hng
Mai
Lan
Thy
Tun
Minh
Thin
Ngc
Age
Income
No.
cards
0.05
0.25
0.58
0.00
0.33
1.00
0.42
0.12
0.12
0.06
0.41
0.08
0.15
1.00
0.76
0.06
0.33
0.00
0.33
0.67
0.41
0.11
0.00
1.00
44
http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf
89
BI TP
1. Cho 2 i tng: (22,1,42,10) v (20,0,36,8)
a)
b)
c)
45
BI TP
Ta c 8 i tng sau (biu din thng qua ta (x,y)): A1(2,10),
A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9).
3.
a)
b)
c)
BI TP
Customer
Age
Income No.
(K)
cards
Tho
35
37
Hng
25
51
Gia
29
44
Thnh
45
100
Thy
20
30
33
57
Minh
65
200
Nhung
54
142
Nht
58
175
Tng
25
40
92
46
Bi tp
5. Cho tp DL gm 5
im trong khng
gian 2 chiu vi ma
trn khong cch
cho. S dng thut
ton AGNES ln lt
vi Single Link v
Complete link gom
nhm. V s hnh
cy.
Xc nh 3 nhm thu
c t s hnh
cy theo c 2 cch.
P1
P2
P3
P4
P5
P1
P2
P3
P4
P5
93
BI TP
6. Cho tp DL mt chiu: {6, 12, 18, 24, 30, 42, 48}
a)
b)
c)
m1 = 18, m2 = 45
m1 = 15, m2 = 40
47
BI TP
7. Cho ma trn hn lon sau. Hy tnh o
entropy v purity.
Enterta Finan
Natio Spor
Cluster/Class inment
cial Foreign Metro nal
ts Tng cluster
1
11
676
693
27
89
333
827
253
33
1562
326
465
105
16
29
949
Tng class
354
555
341
943
273
738
3204
95
BI TP
8. Cho CSDL bn di:
a) Chun ha CSDL v gom cm vi k-mean(k = 2)
(khng dng ct response). Tnh o SSE cho kt qu
gom nhm.
b) To ma trn hn lon (so vi ct response) v tnh o
entropy v purity cho 2 nhm to ra t cu a).
96
48
BI TP
Customer
Age
Income No.
(K)
cards
Response
Lm
35
35
Yes
Hng
22
50
No
Mai
28
40
Yes
Lan
45
100
No
Thy
20
30
Yes
Tun
34
55
No
Minh
63
200
No
Vn
55
140
No
Thin
59
170
No
Ngc
25
40
Yes
49