You are on page 1of 49

KHAI THC

D LIU &
NG DNG
GV: TS. NGUYN HONG T ANH

BI 5

GOM NHM
D LIU

NI DUNG
1. Gii
2.
3.
4.

5.

thiu

Phng php phn hoch


Phng php phn cp
Phng php da trn mt

nh gi m hnh

GII THIU
1. Gom nhm l g:
Nhm/cm/lp: tp cc i tng DL
Gom nhm l qu trnh nhm cc i tng thnh
nhng nhm/cm/lp c ngha. Cc i tng
trong cng mt nhm c nhiu tnh cht chung v
c nhng tnh cht khc vi cc i tng
nhm khc.
Cho CSDL D={t1,t2,,tn} v s nguyn k, gom
nhm l bi ton xc nh nh x f : Dg{1,,k}
sao cho mi ti c gn vo mt nhm (lp) Kj,
1jk.

Khng ging bi ton phn lp, cc


nhm/cm/lp khng c bit trc.
4

Clustering vs. Classification

Traditional Clustering
Goal is to identify similar
groups of objects
Groups (clusters, new
classes) are discovered
Dataset consists of attributes
Unsupervised (class label
has to be learned)
Important: Similarity
assessment which derives a
distance function is critical,
because clusters are
discovered based on
distances/density.

Classification

Pre-defined classes
Datasets consist of attributes
and a class labels
Supervised (class label is
known)
Goal is to predict classes
from the object
properties/attribute values
Classifiers are learnt from
sets of classified examples
Important: classifiers need to
have a high accuracy

PHN LP <> GOM NHM


Phn lp: hc c gim st (Supervised learning)
Tm phng php d on lp ca mu mi t
cc mu gn nhn lp (phn lp) trc

PHN LP <> GOM NHM


Gom nhm: hc khng gim st (Unsupervised
learning )
Tm cc nhm/cm/lp t nhin ca cc mu
cha c gn nhn

GII THIU
Kt qu gom nhm:

C bao nhiu nhm /cm?

2 nhm/cm

6 nhm/cm

4 nhm/cm

GII THIU
ng dng:
Nhn dng
Phn tch d liu khng gian
X l nh
Khoa hc kinh t (c bit nghin cu tip
th)
WWW
Gom nhm ti liu lin quan d tm kim
Gom d liu Weblog thnh nhm tm cc
nhm c cng kiu truy cp

Gim kch thc d liu ln

GII THIU
V d:
Discovered Clusters

Gom
gen v
protein c cng
chc nng
Nhm cc c
phiu
c
xu
hng gi dao
ng ging nhau
Nhm cc vng
theo lng ma
c

1
2
3
4

Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP

Industry Group

Technology1-DOWN

Technology2-DOWN

Financial-DOWN
Oil-UP

10

GII THIU
V d:
Tip th: pht hin cc nhm khch hng
trong CSDL khch hng xy dng
chng trnh tip th c mc tiu
t ai: xc nh cc vng t trng trt
ging nhau trong CSDL quan st tri t
Bo him: tm nhm khch hng c kh
nng hay gp tai nn
Nghin cu ng t: gom nhm cc tm
chn ng t quan st c theo vt nt
lc a
11

V D: Gom nhm cc ngi nh

Da trn khong cch a l


12

V D: Gom nhm cc ngi nh

Da trn kch thc


13

V D: Gom nhm

14

GII THIU
Cch biu din
cc nhm/cm
Phn chia bng
cc ng ranh
gii
Cc khi cu
I1
Theo xc sut
I2

S hnh cy
In

1 2 3
0.5 0.2 0.3

15

GII THIU
2. Tiu chun gom nhm:
Phng php gom nhm tt l phng php s to cc
nhm c cht lng:
S ging nhau gia i tng trong cng mt nhm cao.
Gia cc nhm th s ging nhau thp.
Khong cch bn
trong nhm l
min

Khong cch
gia cc
nhm l max

16

GII THIU
2. Tiu chun gom nhm (tt):
Cht lng ca kt qu gom nhm da
trn 2 yu t:
o s ging nhau dng trong phng
php gom nhm
Thut ton gom nhm.

Mt s o cht lng:
Bnh phng sai (Sum of Squared Error SSE)
Entropy
17

GII THIU
3. o khong cch:
o khong cch thng dng xc nh s
khc nhau hay ging nhau gia hai i tng.
Khong cch Minkowski:

d (i, j) q (| x x |q | x x |q ... | x x |q )
i1 j1
i2 j 2
ip jp
vi i = (xi1, xi2, , xip) v j = (xj1, xj2, , xjp) : hai i

tng p-chiu v q l s nguyn dng

Nu q=1, d l khong cch Manhattan:

d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j2
ip jp

18

GII THIU
3. o khong cch (tt)
Nu q=2, d l khong cch Euclide:

d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1
i2 j 2
ip jp

Tnh cht ca o khong cch

d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
19

GII THIU
4. Cc kiu d liu
Cc kiu d liu khc nhau yu cu
o s khc nhau cng khc nhau.

Cc bin t l theo khong: Khong


cch Euclide

Cc bin nh phn: h s so khp,


h s Jaccard

Cc bin tn, th t, t l: khong


cch Minkowski

Cc bin dng hn hp: cng thc


trng lng
20

10

5. Requirements and Challenges

Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality

GII THIU
6. Mt s phng php gom nhm:
Phng php phn hoch
Phng php phn cp
Phng php da trn mt
Phng php da trn li
Phng php da trn m hnh
Phng php da trn tp ph bin
Phng php da trn rng buc
Phng php da trn lin kt

22

11

6. Phng php gom nhm

Model-based:
A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
Massive links can be used to cluster objects: SimRank, LinkClus

NI DUNG
1.

Gii thiu

2.

Phng php phn hoch

3.
4.
5.

Phng php phn cp


Phng php da trn mt
nh gi m hnh
24

12

PHNG PHP PHN HOCH


1. Khi nim c bn:
Phng php phn hoch : xy dng k (k<n) phn
hoch ca CSDL D gm n i tng. Mi phn hoch
1 nhm/cm
Cho s k, cn tm k nhm tha mn tiu chun phn
hoch chn ( v d o bnh phng sai - SSE
nh nht).
Biu din mi nhm bng gi tr trung bnh ca d
liu trong nhm : thut ton K-means (1967)
Biu din nhm bng mt i tng nm gn
trung tm ca nhm: thut ton k-medoids, PAM
(1987)
25

PHNG PHP PHN HOCH


1. Khi nim c bn (tt):
Cng thc tnh Bnh phng sai ( Sum of Squared
K
Error - SSE)

SSE dist 2 (mi , x )


i 1 xCi

Vi x l mt im DL trong nhm Ci v mi l im i din cho


nhm (im TB nhm hoc im trung tm nhm), K-s
nhm. dist (): khong cch Euclide

V d : ta c 2 nhm/cm vi cc trung tm tng ng


m1=3, m2=4
K1={2,3}, K2={4,10,12,20,30,11,25}

SSE = 12+0+0+62+82+162+262+72+212 =1523

26

13

PHNG PHP PHN HOCH


2. Thut ton k-means:
Cho s k, mi nhm c biu din bng gi tr TB ca DL
trong nhm

B1: Chn ngu nhin k i tng nh l nhng trung tm


ca cc nhm .

B2: Gn tng i tng cn li vo nhm c trung tm


nhm gn n nht (da trn o khong cch Euclide)

B3: Tnh li gi tr trung tm ca tng nhm

Di chuyn trung tm nhm v = gi tr TB mi ca nhm

Cho nhm Ki={ti1,ti2,,tim}, gi tr trung bnh ca nhm l


mi = (1/m)(ti1 + + tim)

B4: Nu cc trung tm nhm khng c g thay i th


dng, ngc li quay li B2.
27

V d: K-means, Bc 1
k1
Y
Chn
3
trung tm
nhm bt
k : k1, k2,
k3

k2

k3
X
28

14

V d: K-means, Bc 2

k1
Y
Gn tng
im vo
nhm c
trung tm
nhm gn
nht

k2

k3
X
29

V d: K-means, Bc 3

k1

k1

Y
Di
chuyn
trung
tm
tng nhm
v
im
trung bnh
mi
ca
nhm

k2
k3

k2
k3
X

30

15

V d: K-means, Bc 4

k1
Y
Gn li cc
im cho gn
vi cc trung
tm nhm mi

k3

k2

Q: Cc im
no c gn
li ?

X
31

V d: K-means, Bc 4

k1
Y

3 im
c
gn li

k3

k2

X
32

16

V d: K-means, Bc 4

k1
Y

Tnh li
trung
bnh
nhm

k3

k2

X
33

V d: K-means, Bc 5
k1
Y

Di
chuyn
trung
tm
nhm v gi
tr TB nhm
mi,

k2
k3

X
34

17

V d: k-mean
Customer

Age Income
(K)

John

0.55

0.175

Rachel

0.34

0.25

Hannah

Tom

0.93

0.85

nellie

0.39

0.2

David

0.58

0.25

Income

Age

V d: k-mean
Bc 1: Chn Nellie v David l trung tm nhm/cm A
v B
Customer

Distance
from
David

Distance
from
Nellie

John

0.08

0.161

Rachel

0.24

0.07

Hannah

0.859

1.006

Tom

0.694

0.845

Nellie
David

Income

B
A

Age

18

V d: k-mean
B2: Tnh cc trung tm mi ca nhm/cm A v B
Trung tm ca Cluster A:
Age= 0.37,
Income=0.23
Trung tm ca Cluster B:
Age= 0.77,
Income=0.57

Income

Da trn cc trung tm cm
mi, gn cc khch hng
vo cc nhm/cm.

Age

V d: k-mean
Customer

Distance A

Distance B

Income

John
Rachel
Hannah
Tom
Nellie
David

0.19
0.04
1.00

0.45
0.53
0.49

0.84
0.04

0.33
0.53

0.22

0.37

Age

19

V d: k-mean
B3: Tnh cc trung tm mi ca nhm/cm A v B
Trung tm ca Cluster A:
Age= 0.47, Income=0.22
Trung tm ca Cluster B:
Age = 0.97, Income= 0.93

Vi cc trung tm nhm
mi ny, thnh phn ca
cc nhm khng thay i.
Thut ton dng.

Income
B

Age

Thut ton K-means


u im:
n gin, d hiu, tng i hiu qu.
phc tp: O(tkn), Vi n l # objects, k l
# clusters, v t is # iterations (k, t << n). So
snh vi PAM: O(k(n-k)2 ), CLARA: O(ks2 +
k(n-k))
Cc i tng t ng gn vo cc nhm.
Thng t c ti u cc b.
40

20

Thut ton K-means


Nhc im:

Thuc tnh phi s? TT k-medoids


Cn xc nh s nhm (k) trc.
Tt c cc i tng phi gn vo cc
nhm.
Ph thuc vo vic chn cc nhm u tin.
Gp vn khi cc nhm c kch thc,
mt khc nhau hoc hnh dng khng
phi l hnh cu.
Nhy cm vi DL nhiu, c bit.
41

PHNG PHP PHN HOCH


3. Thut ton k-medoids : PAM
Cho s k, mi nhm c biu din bng mt
trong cc i tng gn trung tm nhm nht

B1: Chn ngu nhin k i tng nh l


nhng trng tm ca cc nhm .

B2 : gn tng i tng cn li vo nhm c


trng tm cm gn n nht .

B3 : Chn mt i tng bt k. Hon i n


vi trng tm ca nhm. Nu cht lng ca
cc nhm tng ln th quay li B2. Ngc li
tip tc thc hin B3 cho n khi khng cn c
thay i.
42

21

PAM: A Typical K-Medoids Algorithm


Total Cost = 20
10

10

10

Arbitrary
choose k
object as
initial
medoids

6
5
4
3
2

6
5
4
3
2
1

0
0

10

10

Assign
each
remainin
g object
to
nearest
medoids

7
6
5
4
3
2
1
0
0

K=2
10

Until no
change

If quality is
improved.

10

10

Swapping O
and Oramdom

Randomly select a
nonmedoid object,Oramdom

Total Cost = 26

Do loop

Compute
total cost of
swapping

8
7
6

9
8
7
6

0
0

10

10

43

PHNG PHP PHN HOCH


3. Thut ton k-medoids (tt):
Nhn xt :

Thut ton PAM hiu qu hn so vi k-means


khi c mt DL nhiu, c bit.

PAM hiu qu vi tp DL nh nhng khng co


dn tt vi tp DL ln.

Pht trin :

CLARA (Clustering LARge Applications): da trn


phng php ly mu (1990)
CLARANS(Clustering LARge Application based upon
RANdomized Search): ly mu ng (1994)
44

22

NI DUNG
2.

Gii thiu
Phng php phn hoch

3.

Phng php phn cp

1.

4.

5.

Phng php da trn mt

nh gi m hnh

45

PHNG PHP PHN CP


1. Gii thiu:
Phng php phn cp: xy dng cc nhm v t
chc nh cy phn cp.
Biu din di dng s hnh cy (dendrogram):
lu li qu trnh gom li / phn chia nhm
5

6
0.2

4
3

0.15

2
5
2

0.1

0.05

3
0

1
46

23

PHNG PHP PHN CP


1. Gii thiu (tt):
Khng cn xc nh trc s nhm k.
Xc nh s nhm cn thit bng vic ct ngang s
hnh cy ti mc thch hp.

Hai loi phn cp chnh:


Tch t: t di ln trn, mi i tng l mt nhm
Chia nh: t trn xung, tt c cc i tng l 1 nhm

Thut ton:
AGNES, DIANA
BIRCH (Balance Iterative Reducing &
Hierachies)
CURE (Clustering Using Representative)
ROCK (Robust Clustering using linKs)
CHAMELEON

Clustering

using

47

PHNG PHP PHN CP


1. Gii thiu (tt):
Cch xc nh khong cch gia cc nhm :
Single Link: khong cch gn nht gia hai i
tng thuc hai nhm

Complete Link: khong cch xa nht gia hai


i tng thuc hai nhm
48

24

Khong cch gia cc cm

Single link: smallest distance between an element in one cluster and


an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one cluster


and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

Average: avg distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

Centroid: distance between the centroids of two clusters, i.e., dist(Ki,


Kj) = dist(Ci, Cj)

Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj)
= dist(Mi, Mj)

Medoid: a chosen, centrally located object in the cluster


49

PHNG PHP PHN CP


Step
0

Step
1

Step
2

Step
3

Step
4

(agglomerative)
ab

abcde

cde

de

e
Step
4

Tch t

Step
3

Step
2

Step
1

Step
0

Chia nh
(divisive)
50

25

PHNG PHP PHN CP


2. Thut ton AGNES (Agglomerative
Nesting):
B1: Mi i tng l mt nhm.
B2: Hp nht cc nhm c khong cch gia
cc nhm l nh nht (Single Link/ Complete
Link)
B3: Nu thu c nhm ton b th dng,
ngc li quay li B2.
10

10

10

0
0

10

1
0
0

10

5110

PHNG PHP PHN CP


3. Thut ton DIANA (Divisive Analysis):
B1: Tt c cc i tng l mt nhm.
B2: Chia nh nhm c khong cch gia nhng
i tng trong nhm l ln nht.
B3: Nu mi nhm ch cha 1 i tng th
dng, ngc li quay li B2.
10

10

10

0
0

10

10

10

52

26

V D: THUT TON AGNES


Cho tp DL gm 6
im trong khng
gian 2 chiu. S
dng thut ton
AGNES vi Single
link (khong cch
gn nht gia 2
im ca 2 nhm
khc nhau) gom
nhm

im Ta x Ta y
P1

0.40

0.53

P2

0.22

0.38

P3

0.353

0.32

P4

0.26

0.19

P5

0.08

0.41

P6

0.45

0.30
53

V D: THUT TON AGNES


Xy dng ma trn khong cch ( o Euclide)
gia cc im
P1

P2

P3

P4

P5

P6

P1

0.00 0.23 0.22 0.37 0.34 0.24

P2

0.23 0.00 0.15 0.19 0.14 0.24

P3

0.22 0.15 0.00 0.16 0.29 0.10

P4

0.37 0.19 0.16 0.00 0.28 0.22

P5

0.34 0.14 0.29 0.28 0.00 0.39

P6

0.24 0.24 0.10 0.22 0.39 0.00


54

27

V D: THUT TON AGNES


S dng Single Link:
1. Bc 1: mi im l mt nhm
2. Bc 2:
Trong s cc nhm gm mt im th dist(3,6) - min
nn gp im P3 v P6 vi nhau thnh mt nhm

Thu c cc nhm: {1}, {4}, {2}, {5}, {3,6},


3. Quay li bc 2 do cha thu c nhm ton b.
4. Tnh khong cch gia cc nhm. V d:
Dist({3,6},{1})

=min(dist(3,1),dist(6,1))

=min(0.22, 0.24) = 0.22

55

V D: THUT TON AGNES


S dng Single Link:
5. dist(2,5) l nh nht nn gp P2 v P5. Ta c cc nhm
sau : {1}, {4}, {3,6}, {2,5}
6. Tnh khong cch gia cc nhm. V d:
dist({3,6},{2,5})
= min(dist(3,2),dist(6,2),dist(3,5),dist(6,5))
= min(0.15, 0.24, 0.28, 0.39) = 0.15
.
dist({3,6},{2,5}) nh nht nn gp cc nhm {3,6},
{2,5} thnh mt nhm.
56

Ta thu c cc nhm: {1},{4},{2,3,5,6}

28

V D : THUT TON AGNES


S dng Single Link:
7. Tip tc:
Tnh khong cch gia cc nhm.
Gp {4} vi {2,3,5,6} thu c cc nhm {1},
{2,3,4,5,6}.
8. Gp 2 nhm ny ta thu c nhm ton b
v thut ton dng.
57

V D: THUT TON AGNES


1

3
5

0.2

0.15

0.1

0.05

4
4

Cc nhm
(Single Link)

S hnh cy
58

29

PHNG PHP PHN CP


4. Nhc im:
Tnh co dn thp: phc tp l O(n2) vi n - s i
tng.
Khng th quay lui v bc trc .
Kh xc nh phng php tch t hay chia nh
Nhy cm vi nhiu, c bit.
Gp vn khi cc nhm c kch thc khc nhau v c
hnh dng li.
C xu hng phn chia cc nhm DL ln.

Tch hp phng php phn cp vi


phng php phn hoch (da trn khong
cch): BIRCH, CURE, CHAMELEON
59

Bi tp c nhn s 6
Thi gian: 20
Cho tp DL gm 6 im trong khng gian 2 chiu
vi ma trn khong cch nh sau.
S dng thut ton AGNES vi Complete link
gom nhm. V s hnh cy.
Da trn s hnh cy, xc nh 3 nhm thu
c.

60

30

Bi tp c nhn
Cho ma trn khong cch nh sau:
P1

P2

P3

P4

P5

P6

P1

0.00 0.23 0.22 0.37 0.34 0.24

P2

0.23 0.00 0.15 0.19 0.14 0.24

P3

0.22 0.15 0.00 0.16 0.29 0.10

P4

0.37 0.19 0.16 0.00 0.28 0.22

P5

0.34 0.14 0.29 0.28 0.00 0.39

P6

0.24 0.24 0.10 0.22 0.39 0.00


61

NI DUNG
2.

Gii thiu
Phng php phn hoch

3.

Phng php phn cp

1.

4.

5.

Phng php da trn


mt
nh gi m hnh

62

31

PHNG PHP DA TRN MT


1. Gii thiu:
Phng php da trn mt (density): m rng
cc nhm cho n khi mt ca i tng d
liu trong vng ln cn vt qua ngng.
Thut ton:
DBSCAN
OPTICS (Ordering Points To Identify the
Clustering structure)
DENCLUE (DENsity- based CLUstEring)
CLIQUE (Clustering In QUEst)
63

PHNG PHP DA TRN MT


2. Khi nim c bn:
Hai tham s do ngi dng xc nh
Eps: Bn knh ln nht ca vng ln cn
MinPts: S nh nht cc i tng trong vng ln cn
bn knh Eps

Mt : s i tng nm trong mt bn knh xc


nh Eps
K hiu Neps(p): tp {q thuc v D | dist(p,q)<=Eps}
i tng nng ct (core point): cha s i tng
nhiu hn ngng MinPts trong bn knh Eps.
i tng bin (border point): s i tng trong
bn knh Eps t hn MinPts nhng vn nm trong
vng ln cn ca i tng nng ct.
i tng nhiu (noise point): i tng khng
thuc c 2 dng trn.
64

32

PHNG PHP DA TRN MT

MinPts = 5
65

PHNG PHP DA TRN MT


2. Khi nim c bn:
Mt i tng p l i tng c mt t
c trc tip (directly density_reachable)
t i tng q theo Eps, MinPts nu :
p thuc Neps(q)
|Neps(q)| >= MinPts
p
q

MinPts = 5
Eps = 1 cm
66

33

PHNG PHP DA TRN MT


Mt t c (density-reachable)
i tng p l i tng c mt c th t
c t i tng q theo Eps, MinPts nu tn
ti mt dy chuyn cc i tng p1,,pn, vi
p1=q, pn=p sao cho pi+1 l i tng c mt
t c trc tip t pi.
p

MinPts = 5
Eps = 1 cm

p1
q
67

PHNG PHP DA TRN MT


Mt lin thng (density-connected):
i tng p l mt lin thng t i
tng q theo Eps, MinPts nu tn ti mt i
tng o sao cho c p, q l i tng c mt
c th t c t o theo Eps, MinPts.

MinPts = 5
Eps = 1 cm

o
68

34

PHNG PHP DA TRN MT

3. Thut ton DBSCAN (Density Based


Spatial Clustering of Application with
Noise)
Mt nhm c xc nh nh tp cc i
tng c mt lin thng ln nht .
Tm cc nhm hnh dng bt k trong
CSDL khng gian c nhiu

69

PHNG PHP DA TRN MT


3. Thut ton DBSCAN (tt):
B1: Chn ngu nhin i tng p
B2: Tm tt c cc i tng c mt
c th t c t p theo Eps, MinPts
B3: Nu p l i tng nng ct th hnh
thnh nhm
Nu p l i tng bin (khng c i tng
no l i tng c mt t c t p) th
DBSCAN xem xt i tng tip theo trong
CSDL

B4: Tip tc cho n khi tt c cc i


tng u c x l
70

35

PHNG PHP DA TRN MT

Cc i tng ban u

Cc loi i tng :
nng ct , bin v nhiu

Eps = 10, MinPts = 4

71

PHNG PHP DA TRN MT


Khi thut ton chy tt

Cc i tng ban u

Cc nhm

Resistant to Noise
Supports Outliers
Can handle clusters of different shapes and sizes

72

36

PHNG PHP DA TRN MT


Khi thut ton chy khng tt

(MinPts=4, Eps=9.75).
Cc i tng ban u

Gp vn vi
Varying densities
High-dimensional data
73

(MinPts=4, Eps=9.92)

PHNG PHP DA TRN MT


4. Nhn xt DBSCAN:
u im:

Lm vic tt vi d liu nhiu


C th gii quyt cc trng hp cc nhm
c hnh dng v knh thc khc nhau

Nhc im:
Gp vn khi cc nhm c mt khc
nhau.
phc tp cao i vi d liu nhiu chiu.
Ph thuc vo gi tr Eps, MinPts.
74

37

V D: DBSCAN

75

NI DUNG
2.

Gii thiu
Phng php phn hoch

3.

Phng php phn cp

1.

4.

5.

Phng php da trn mt

nh gi m hnh
76

38

V D: KT QU GOM NHM
1
0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

DL BAN
U

1
0.9

0.4

0.4

0.3

0.3

0.2

0.2
0.1

0.1
0

DBSCAN

0.2

0.4

0.6

0.8

0.2

0.4

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

K-means

0.4

0.4

0.3

0.3

0.2

0.2

0.1
0

0.6

0.8

Complete
Link

0.1
0

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

77

nh gi cht lng nhm


nh gi cht lng nhm l nhim v
kh khn v phc tp nht trong phn
tch nhm.
Cht lng nhm th hin qua:

Xc nh c xu hng gom nhm ca


DL.
So snh kt qu gom nhm vi kt qu/
thng tin bn ngoi c (v d so snh vi
cc nhn lp cho).
nh gi kt qu gom nhm khng dng
thng tin bn ngoi: ch s dng DL.
So snh kt qu ca 2 phng php gom
nhm khc nhau.
Xc nh chnh xc s nhm.

39

Xc nh s nhm/cm

Empirical method
# of clusters n/2 for a dataset of n points
Elbow method
Use the turning point in the curve of sum of within cluster
variance w.r.t the # of clusters
Cross validation method
Divide a given data set into m parts
Use m 1 parts to obtain a clustering model
Use the remaining part to test the quality of the clustering
E.g., For each point in the test set, find the closest centroid,
and use the sum of squared distance between all points in the
test set and the closest centroids to measure how well the
model fits the test set
For any k > 0, repeat it m times, compare the overall quality
measure w.r.t. different ks, and find # of clusters that fits the
79
data the best

nh gi cht lng nhm

o cht lng nhm:

External index: o mc cc nhn lp


tng ng vi cc nhn lp bn ngoi
cung cp sn.
Entropy
Internal Index: o cht lng ca cu trc
nhm khng dng cc thng tin bn ngoi
SSE
Relative Index: dng so snh 2 phng
php gom nhm hoc so snh cc nhm
SSE hoc entropy

40

nh gi cht lng nhm


Internal Index: SSE

SSE thng dng so snh 2


phng php gom nhm hoc 2 nhm
10
9

7
6

SSE

2
0

5
4

-2

3
2

-4

-6

10

15

10

15

20

25

30

nh gi cht lng nhm

Internal Index: SSE


Cng thc tnh Bnh phng sai ( Sum
of Squared Error - SSE)
K

SSE dist 2 (mi , x )


i 1 xCi

Vi x l mt im DL trong nhm Ci v mi l
im i din cho nhm (im TB nhm
hoc im trung tm nhm), K-s nhm.
dist (): khong cch Euclide

41

nh gi cht lng nhm


External

index: Entropy, Purity

i vi mi nhm j, xc nh pij l xc sut


mt mu thuc nhm j(cluster j) thuc v lp
i(class i)

pij

mij
mj

Trong :
mj l s mu ca cluster j;
mij l s mu ca class i thuc cluster j;

nh gi cht lng nhm


External

index: Entropy, Purity

Khi entropy ca cluster j : ej v tng entropy e


L

e j pij log 2 ( pij )


i 1

mj

j 1

ej

Trong :
L l s lp (classes);
K l s nhm(clusters);
mj l s mu ca cluster j;
m l tng s mu

42

nh gi cht lng nhm


External

index: Entropy, Purity

Khi purity ca cluster j : purityj v tng purity


K

mj

j 1

purity

purity j max pij


i

purity j

Trong :
L l s lp (classes);
K l s nhm(clusters);
mj l s mu ca cluster j;
m l tng s mu

nh gi cht lng nhm

V d: Kt qu gom nhm dng k-mean cho tp


DL bi bo LA
Cluster/
Class

Entertainm
ent

Finan
cial

40

280

10

5
6
Tng
class

Tng
cluster

Natio
nal

Sports

506

96

27

677

1.2270 0.7474

29

39

361

1.1472 0.7756

671

685

0.1813 0.9796

162

119

73

369

1.7487 0.4390

331

22

70

13

23

464

1.3976 0.7134

358

12

212

48

13

648

1.5523 0.5525

354

555

341

943

273

738

3204

1.1450 0.7203

Foreign Metro

Entropy

Purity

V d pFinancial,2=7/361=0.0194

43

TM TT
Bi ton gom nhm l nhm cc i
tng da trn s ging nhau ca chng
v c ng dng rng ri.
o s ging nhau c th khc nhau i
vi tng kiu d liu .
Cc thut ton gom nhm chnh chia
thnh : phn hoch, phn cp, da trn
mt , da trn li v da trn m hnh
Bi ton xc nh c bit l mt ng dng
quan trng ca phn tch nhm.
nh gi cht lng nhm l lnh vc cn
tp trung nghin cu.
87

BI TP nhm

Thi gian: 30

Cho DL bn ( chun
ho) v k = 2.
S dng thut ton kmeans (k=2) xc nh
cc nhm.
Tnh o SSE cho
tng nhm vng lp
u tin v cui cng.

Customer
Hng
Mai

Lan
Thy
Tun
Minh
Thin
Ngc

Age

Income

No.
cards

0.05
0.25
0.58
0.00
0.33
1.00
0.42
0.12

0.12
0.06
0.41
0.08
0.15
1.00
0.76
0.06

0.33
0.00
0.33
0.67
0.41
0.11
0.00
1.00

44

TI LIU THAM KHO


J.Han, M.Kamber, Chng 10, 11 Data
mining: Concepts and Techniques, 3rd edition.
2. P.-N. Tan, M. Steinbach, V. Kumar, Chng 8
- Introduction to Data Mining
1.

http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf

89

BI TP
1. Cho 2 i tng: (22,1,42,10) v (20,0,36,8)
a)
b)
c)

Tnh khong cch Euclide gia 2 i tng


Tnh khong cch Mahanttan gia 2 i tng
Tnh khong cch Minkowski gia 2 i tng vi
q=3

2. Th no l gom nhm?. Trnh by chi tit


phng php phn hoch, phn cp. Cho v
d c th tng phng php. So snh u,
khuyt im ca 2 phng php.
90

45

BI TP
Ta c 8 i tng sau (biu din thng qua ta (x,y)): A1(2,10),
A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9).

3.

S dng khong cch Euclide v gi s gn A1, B1, C1 l


cc trung tm ca cc nhm tng ng. S dng thut
ton k-means ( vi k=3) xc nh:

a)

Ba trung tm ca nhm sau vng lp thi hnh u tin. Tnh


o SSE cho cc nhm.
Ba nhm kt qu cui cng. Tnh o SSE cho cc nhm.

b)

c)

T chn 3 trung tm nhm bt k khng trng vi 8 i


tng cho. S dng k-mean (k=3) xc nh cc
nhm. So snh kt qu vi cu a)
S dng thut ton phn cp ln lt vi Single link v
Complete link xc nh 3 nhm t DL trn. V s
hnh cy tng ng
91

BI TP
Customer

Age

Income No.
(K)
cards

Tho

35

37

Hng

25

51

Gia

29

44

Thnh

45

100

Thy

20

30

33

57

Minh

65

200

Nhung

54

142

Nht

58

175

Tng

25

40

4. Cho CSDL bn:


a) S dng k-mean
gom cm vi k =3. Tnh
o SSE v so snh kt qu.
b) Chun ha CSDL v
gom cm vi k =3. So snh
kt qu vi cu a).

92

46

Bi tp
5. Cho tp DL gm 5
im trong khng
gian 2 chiu vi ma
trn khong cch
cho. S dng thut
ton AGNES ln lt
vi Single Link v
Complete link gom
nhm. V s hnh
cy.
Xc nh 3 nhm thu
c t s hnh
cy theo c 2 cch.

P1

P2

P3

P4

P5

P1

0.00 0.10 0.41 0.55 0.35

P2

0.10 0.00 0.64 0.47 0.98

P3

0.41 0.64 0.00 0.44 0.85

P4

0.55 0.47 0.44 0.00 0.76

P5

0.35 0.98 0.85 0.76 0.00

93

BI TP
6. Cho tp DL mt chiu: {6, 12, 18, 24, 30, 42, 48}
a)

Vi mi tp trung tm nhm sau, hnh thnh 2 nhm


u tin da k-mean(k=2). Tnh o SSE cho tng
tp 2 nhm . So snh kt qu

b)

c)

m1 = 18, m2 = 45
m1 = 15, m2 = 40

Nu tip tc chy thut ton k-mean(k=2) trn tp DL


trn vi cc trung tm nhm cho, c s thay i
nh th no ?
S dng thut ton Agnes vi Single Link xc nh
2 nhm t DL trn. So snh kt qu vi k-mean( k=2
v chn kt qu cho SSE tt nht)
94

47

BI TP
7. Cho ma trn hn lon sau. Hy tnh o
entropy v purity.
Enterta Finan
Natio Spor
Cluster/Class inment
cial Foreign Metro nal
ts Tng cluster
1

11

676

693

27

89

333

827

253

33

1562

326

465

105

16

29

949

Tng class

354

555

341

943

273

738

3204

95

BI TP
8. Cho CSDL bn di:
a) Chun ha CSDL v gom cm vi k-mean(k = 2)
(khng dng ct response). Tnh o SSE cho kt qu
gom nhm.
b) To ma trn hn lon (so vi ct response) v tnh o
entropy v purity cho 2 nhm to ra t cu a).

96

48

BI TP
Customer

Age

Income No.
(K)
cards

Response

Lm

35

35

Yes

Hng

22

50

No

Mai

28

40

Yes

Lan

45

100

No

Thy

20

30

Yes

Tun

34

55

No

Minh

63

200

No

Vn

55

140

No

Thin

59

170

No

Ngc

25

40

Yes

49

You might also like