You are on page 1of 100

S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.

vn

I HC THI NGUYN
KHOA CNG NGH THNG TIN



Nguyn Trung Sn




PHNG PHP PHN CM V NG DNG

Chuyn ngnh : KHOA HC MY TNH
M s : 60.48.01



LUN VN THC S KHOA HC MY TNH




NGI HNG DN KHOA HC
1. PGS. TS V C THI






Thi Nguyn 2009


S ha bi Trung tm Hc liu i hc Thi Nguyn http://www.lrc-tnu.edu.vn

I HC THI NGUYN
KHOA CNG NGH THNG TIN



Nguyn Trung Sn




PHNG PHP PHN CM V NG DNG

Chuyn ngnh : KHOA HC MY TNH
M s : 60.48.01



LUN VN THC S KHOA HC MY TNH




NGI HNG DN KHOA HC
1. PGS. TS V C THI







Thi Nguyn 2009


-2-
MC LC
TRANG
LI CM N 5
LI M U 6
CHNG I : TNG QUAN THUYT V PHN CM D LIU 7
1. Phn cm d liu 7
1.1 nh ngha v phn cm d liu 7
1.2 Mt s v d v phn cm d liu 7
2. Mt s kiu d liu 10
2.1 D liu Categorical 10
2.2 D liu nh phn 13
2.3 D liu giao dch 14
2.4 D liu Symbolic 15
2.5 Chui thi gian(Time Series) 16
3. Php Bin i v Chun ha d liu 16
3.1 Php chun ha d liu 17
3.2 Bin i d liu 21
3.2.1 Phn tch thnh phn chnh 21
3.2.2 SVD 23
3.2.3 Php bin i Karhunen-Love 24
CHNG II. CC THUT TON PHN CM D LIU 28
1. Thut ton phn cm d liu da vo phn cm phn cp 28
1.1 Thut ton BIRCH 28
1.2 Thut ton CURE 30
1.3 Thut ton ANGNES 32
1.4 Thut ton DIANA 33
1.5 Thut ton ROCK 33
1.6 Thut ton Chameleon 34

-3-
2. Thut ton phn cm d liu m 35
2.1 Thut ton FCM 36
2.2 Thut ton FCM 37
3. Thut ton phn cm d liu da vo cm trung tm 37
3.1 . Thut ton K MEANS 37
3.2 Thut ton PAM 41
3.3 Thut ton CLARA 42
3.4 Thut ton CLARANS 44
4. Thut ton phn cm d liu da vo tm kim 46
4.1 Thut ton di truyn (GAS) 46
4.2 J- Means 48
5. Thut ton phn cm d liu da vo li 49
5.1 STING 49
5.2. Thut ton CLIQUE 51
5.3. Thut ton WaveCluster 52
6. Thut ton phn cm d liu da vo mt 53
6.1 Thut ton DBSCAN 53
6.2. Thut ton OPTICS 57
6.3. Thut ton DENCLUDE 58
7. Thut ton phn cm d liu da trn mu 60
7.1 Thut ton EM 60
7.2 Thut ton COBWEB 61
CHNG III :NG DNG CA PHN CM D LIU 62
1. Phn on nh 62
1.1. nh ngha Phn on nh 63
1.2 Phn on nh da vo phn cm d liu 65
2. Nhn dng i tng v k t 71
2.1 Nhn dng i tng 71

-4-
2.2 Nhn dng k t. 75
3. Truy hi thng tin 76
3.1 Biu din mu 78
3.2 Php o tng t 79
3.3 Mt gii thut cho phn cm d liu sch 80
4. Khai ph d liu 81
4.1 Khai ph d liu bng Phng php tip cn. 82
4.2 Khai ph d liu c cu trc ln. 83
4.3 Khai ph d liu trong C s d liu a cht. 84
4.4 Tm tt 86
KT LUN ,HNG PHT TRIN CA TI 90
PH LC 91
TI LIU THAM KHO 99

-5-
LI CM N


Em xin chn thnh cm n PGS. TS V c Thi tn tnh hng dn
khoa hc, gip em hon thnh tt lun vn tt nghip ny.
Em cng xin gi li cm n ti cc thy, c gio dy d, v truyn
t kin thc cho em trong sut qu trnh hc tp v nghin cu

HC VIN
NGUYN TRUNG SN

-6-
LI M U
Trong nhng nm gn y, s pht trin mnh m ca CNTT lm
cho kh nng thu thp v lu tr thng tin ca cc h thng thng tin tng
nhanh mt cch chng mt. Bn cnh , vic tin hc ha mt cch t v
nhanh chng cc hot ng sn xut, kinh doanh cng nh nhiu lnh vc
hot ng khc to ra cho chng ta mt lng d liu lu tr khng l.
Hng triu CSDL c s dng trong cc hot ng sn xut, kinh doanh,
qun l..., trong c nhiu CSDL cc ln c Gigabyte, thm ch l Terabyte.
S bng n ny dn ti mt yu cu cp thit l cn c nhng k
thut v cng c mi t ng chuyn i lng d liu khng l kia thnh
cc tri thc c ch. T , cc k thut khai ph d liu tr thnh mt lnh
vc thi s ca nn CNTT th gii hin nay ni chung v Vit Nam ni ring.
Khai ph d liu ang c p dng mt cch rng ri trong nhiu lnh vc
kinh doanh v i sng khc nhau: marketing, ti chnh, ngn hng v bo
him, khoa hc, y t, an ninh, internet Rt nhiu t chc v cng ty ln trn
th gii p dng k thut khai ph d liu vo cc hot ng sn xut kinh
doanh ca mnh v thu c nhng li ch to ln.
Cc k thut khai ph d liu thng c chia thnh 2 nhm chnh:
- K thut khai ph d liu m t: c nhim v m t v cc tnh
cht hoc cc c tnh chung ca d liu trong CSDL hin c.
- K thut khai ph d liu d on: c nhim v a ra cc d on
da vo cc suy din trn d liu hin thi.
Bn lun vn ny trnh by mt s vn v Phn cm d liu, mt
trong nhng k thut c bn Khai ph d liu. y l hng nghin cu
c trin vng ch ra nhng s lc trong vic hiu v khai thc CSDL khng
l, khm ph thng tin hu ch n trong d liu; hiu c ngha thc t ca d liu.
Lun vn c trnh by trong 3 chng v phn ph lc :
Chng 1 : Trnh by tng quan l thuyt v Phn cm d liu, cc kiu d
liu, Php bin i v chun ha d liu.
Chng 2 : Gii thiu, phn tch, nh gi cc thut ton dng phn cm
d liu
Chng 3 : Trnh by mt s ng dng tiu biu ca phn cm d liu.
Kt lun : Tm tt cc vn c tm hiu trong lun vn v cc vn lin
quan trong lun vn, a ra phng hng nghin cu tip theo.

-7-
CHNG I :
TNG QUAN L THUYT V PHN CM D LIU
1. Phn cm d liu
1.1 nh ngha v phn cm d liu
Phn cm d liu(Data Clustering) hay phn cm, cng c th gi l
phn tch cm, phn tch phn on, phn tch phn loi, l qu trnh nhm
mt tp cc i tng thc th hay tru tng thnh lp cc i tng tng
t. Mt cm l mt tp hp cc i tng d liu m cc phn t ca n
tng t nhau cng trong mt cm v phi tng t vi cc i tng trong
cc cm khc. Mt cm cc i tng d liu c th xem nh l mt nhm
trong nhiu ng dng.
1.2 Mt s v d v phn cm d liu
1.2.1 Phn cm d liu phc v cho biu din d liu gene
Phn cm l mt trong nhng phn tch c s dng thng xuyn
nht trong biu din d liu gene (Yeung et al., 2003; Eisen at al., 1998). D
liu biu din gene l mt tp hp cc php o c ly t DNA microarray
(cn gi l DNA chip hay gene chip) l mt tm thy tinh hoc nha trn
c gn cc on DNA thnh cc hng siu nh. Cc nh nghin cu s dng
cc con chip nh vy sng lc cc mu sinh hc nhm kim tra s c mt
hng lot trnh t cng mt lc. Cc on DNA gn trn chip c gi l
probe (mu d). Trn mi im ca chip c hng ngn phn t probe vi trnh
t ging nhau. Mt tp hp d liu biu din gene c th c biu din
thnh mt ma trn gi tr thc :
,
2 1
2 22 21
1 12 11
|
|
|
|
|
.
|

\
|
=
nd n n
d
d
x x x
x x x
x x x
D


Trong :
- n l s lng cc gen
- d l s lng mu hay iu kin th
- x
ij
l thc o biu din mc gen i trong mu j

-8-
Bi v cc biu ma trn gc cha nhiu, gi tr sai lch, h thng bin th,
do tin x l l i hi cn thit trc khi thc hin phn cm.














Hnh 1 Tc v ca Khai ph d liu

D liu biu din gen c th c phn cm theo hai cch. Cch th nht
l nhm cc cc mu gen ging nhau, v d nh gom cc dng ca ma trn D.
Cch khc l nhm cc mu khc nhau trn cc h s tng ng, v d nh
gom cc ct ca ma trn D.
1.2.2 Phn cm d liu phc trong sc khe tm l
Phn cm d liu p dng trong nhiu lnh vc sc khe tm l, bao
gm c vic thc y v duy tr sc khe, ci thin cho h thng chm sc sc
khe, v cng tc phng chng bnh tt v ngi khuyt tt (Clatworthy et
al., 2005). Trong s pht trin h thng chm sc sc khe, phn cm d liu
c s dng xc nh cc nhm ca ngi dn m c th c hng li
t cc dch v c th (Hodges v Wotring, 2000). Trong thc y y t, nhm
phn tch c s dng la chn nhm mc tiu vo nhm s c kh nng
em li li ch cho sc khe c th t cc chin dch qung b v to iu
kin thun li cho s pht trin ca qung co. Ngoi ra, phn cm d liu
Khai ph d liu
Khai ph d liu trc tip
Khai ph d liu gin tip
Phn loi
c lng
D on
Phn cm
Lut kt hp
Din gii v trc quan ha

-9-
c s dng xc nh cc nhm dn c b ri ro do pht trin y t v cc
iu kin nhng ngi c nguy c ngho.
1.2.3 Phn cm d liu i vi hot ng nghin cu th trng
Trong nghin cu th trng, phn cm d liu c s dng phn
on th trng v xc nh mc tiu th trng (Chrisoppher, 1969;
Saunders, 1980, Frank and Green, 1968). Trong phn on th trng, phn
cm d liu thng c dng phn chia th trng thnh nhng cm
mang ngha, chng han nh chia ra i tng nam gii t 21-30 tui v
nam gii ngoi 51 tui, i tng nam gii ngoi 51 tui thng khng c
khuynh hng mua cc sn phm mi.
1.2.4 Phn cm d liu i vi hot ng Phn on nh
Phn on nh l vic phn tch mc xm hay mu ca nh thnh cc
lt ng nht (Comaniciu and Meer, 2002). Trong phn on nh, phn cm
d liu thng c s dng pht hin bin ca i tng trong nh.
Phn cm d liu l mt cng c thit yu ca khai ph d liu, khai
ph d liu l qu trnh khm ph v phn tch mt khi lng ln d liu
ly c cc thng tin hu ch (Berry and Linoff, 2000). Phn cm d liu
cng l mt vn c bn trong nhn dng mu (pattern recognition). Hnh
1.1 a ra mt danh sch gin lc cc tc v a dng ca khai ph d liu v
chng t vai tr ca phn cm d liu trong khai ph d liu.
Nhn chung, Thng tin hu dng c th c khm ph t mt khi
lng ln d liu thng qua phng tin t ng hay bn t ng (Berry and
Linoff, 2000). Trong khai ph d liu gin tip, khng c bin no c chn
ra nh mt bin ch, v mc tiu l khm ph ra mt vi mi quan h
gia tt c cc bin. Trong khi i vi khai ph d liu gin tip mt vi
bin li c chn ra nh cc bin ch. Phn cm d liu l khai ph d liu
gin tip, bi v trong khai ph d liu, ta khng m bo chc chn chnh xc
cm d liu m chng ta ang tm kim, ng vai tr g trong vic hnh thnh
cc cm d liu , v n lm nh th no.
Vn phn cm d liu c quan tm mt cch rng ri, mc d
cha c nh ngha ng b v phn cm d liu v c th s khng bao gi
l mt v i n thng nht.(Estivill-Castro,2002; Dubes, 1987; Fraley and
Raftery, 1998). Ni mt cch i khi l : Phn cm d liu, c ngha l ta

-10-
cho mt tp d liu v mt phng php tng t, chng ta nhm d liu li
chng hn nh im d liu trong cng mt nhm ging nhau v im d liu
trong cc nhm khc nhau v s khng ng dng. R rng l vn ny
c bt gp trong nhiu ng dng, chng hn nh khai ph vn bn, biu
din gen, phn loi khch hng, x l nh
2. Mt s kiu d liu
Thut ton phn cm d liu c nht rt nhiu lin kt vi cc loi d
liu. V vy, s hiu bit v quy m, bnh thng ho, v gn nhau l rt quan
trng trong vic gii thch cc kt qu ca thut ton phn cm d liu. Kiu
d liu ni n mc lng t ha trong d liu (Jain v Dubes, 1988;
Anderberg, 1973) - mt thuc tnh duy nht c th c g nh nh phn, ri
rc, hoc lin tc. thuc tnh nh phn c chnh xc hai gi tr, nh l ng
hoc sai. Thuc tnh ri rc c mt s hu hn cc gi tr c th, v th cc
loi nh phn l mt trng hp c bit ca cc loi ri rc (xem hnh 2).
D liu quy m, m ch ra tm quan trng tng i ca cc con s,
cng l mt vn quan trng trong phn cm d liu. Vy liu c th c
chia thnh quy m nh lng v quy m nh tnh. quy m nh lng bao
gm quy m danh ngha v quy m gii hn; quy m nh tnh bao gm quy
m khong v quy m khong t l (hnh 3). cc kiu d liu s c xem xt
trong phn ny .
2.1 D liu Categorical
Thuc tnh Categorical cng c gi l thuc tnh danh ngha, thuc
tnh ny n gin l s dng nh tn, chng hn nh cc thng hiu xe v
tn ca cc chi nhnh ngn hng. Chng ta xem xt cc d liu tp hp vi
mt s hu hn cc im d liu, mt thuc tnh trn danh ngha ca cc im
d liu trong tp d liu c th ch c mt s hu hn cc gi tr; nh vy, cc
loi danh ngha cng l mt trng hp c bit ca kiu ri rc.

-11-









Hnh 2. Biu cc dng d liu








Hnh 3. Biu quy m d liu
Trong phn ny, chng ta s gii thiu cc bng biu tng v bng tn
s v k hiu mt s b d liu Categorical.

Bng 1 Mu v d ca tp d liu Categorical
Bn ghi Gi tr
x
1
(A, A, A, A, B, B)
x
2
(A, A, A, A, C, D)
x
3
(A, A, A, A, D, C)
x
4
(B, B, C, C, D, C)
x
5
(B, B, D, D, C, D)

Cho
{ }
n
x x x D , ,
2 1
=
l mt tp d liu tuyt i vi khong cch
n, c m t bi d thuc tnh Categorical v
1
, v
2
,v
d
. t DOM(v
j
) thuc
Kiu d liu
Ri rc Lin tc
Danh ngha Nh phn
i xng Bt i xng
Quy m d liu
nh lng
Danh ngha Gii hn
nh tnh
T l Khong

-12-
min thuc tnh v
j
. Trong tp d liu Categorical cho trong bng 2.1, v d
min ca v
1
v v
4
l DOM(v
1
) = {A, B} v DOM(v
4
) ={A, C, D}, tch bit.
Cho mt tp d liu Categorical D, gi s rng
( ) { }
j
jn j j j
A A A v DOM , , ,
2 1
=
vi j = 1, 2, ,d. Gi A
jl
j
n l s s 1
l trng
thi thuc tnh Categorical v
j
cho trong tp d liu D. Mt bng T
s
ca tp
d liu c nh ngha
Ts = (s
1
, s
2
, , s
d
), (2.1)
Ni s
j

) 1 ( d l s s
l vecto nh ngha l
( )
T
jn j j j
j
A A A s , , ,
2 1
=
.
V c nhiu trng thi c th l cc gi tr (hoc) cho mt bin, mt
bng biu tng ca mt tp d liu thng l khng duy nht. V d, i vi
b d liu trong bng 1, c hai bng 2 v Bng 3 l bng biu tng ca n.
Bng tn s c tnh theo mt bng biu tng v n chnh xc
cng kch thc nh bng biu tng. t C l mt cm. Sau , bng tn s
T
f
(C) ca cc cm C c nh ngha l
( ) ( ) ( ) ( ) ( ), , , ,
2 1
C f C f C f C T
d f
=
(2.2)
Ni ( ) C f
j
l mt vecto c nh ngha
( ) ( ) ( ) ( ) ( ) , , , ,
2 1
T
jn j j f
C f C f C f C T
j
= (2.3)
Bng 2. Mt trong nhng bng biu tng ca b d liu trong bng 1
|
|
|
.
|

\
|
D
C
B
D
C
B
D
C
A
D
C
A
B
A
B
A

Bng 3 : Bng biu tng ca b d liu trong bng 1.
|
|
|
.
|

\
|
D
B
C
D
C
B
D
C
A
A
C
D
A
B
B
A

Ni f
jr
(C)
) 1 , 1 (
j
n r d j s s s s
l s im d liu trong cm C m gi tr
A
jr
ti mng th j, v.v
( ) { }, :
jr j jr
A x C x C f = e =
(2.4)

-13-
Ni x
j
l gi tr b phn j ca x
i vi mt bng biu tng cho trc ca b d liu, bng tn s ca
mi cm l duy nht ln n rng bng biu tng. V d, i vi b d liu
trong bng 2.1, cho C c mt cm, trong C = (x
1
, x
2
, x
3
). Sau , nu s
dng cc biu tng trnh by trong bng 2 bng tn s tng ng cho cc
nhm C c cho trong bng 2.4. Nhng nu s dng bng biu tng trnh
by trong Bng 2.3, sau l bng tn s cho cc nhm C c cho trong
bng 2.5.
c c b d liu Categorical D, chng ta thy rng T
f
(D) l mt
bng tnh ton tn s trn c s d liu ton b thit lp. Gi s D l phn
vng khng chng cho vo k cm C
1
, C
2
,..., C
k
. Sau chng ta c
( ) ( )

=
=
k
i
i jr jr
C f D f
1
(2.5)
Vi tt c r = 1, 2, , n
j
v j = 1, 2, d.
2.2 D liu nh phn
Mt thuc tnh nh phn l mt thuc tnh c hai gi tr chnh xc nht
c th, chng hn nh "ng" hay "Sai" Lu rng cc bin nh phn c th
c chia thnh hai loi:. bin nh phn i xng v cc bin nh phn bt i
xng. Trong mt bin nh phn i xng, hai gi tr c quan trng khng km
nhau. Mt v d l "nam-n". Bin nh phn i xng l mt bin danh ngha.
Trong mt bin khng i xng, mt trong nhng gi tr ca n mang tm
quan trng hn bin khc . V d, "c" l vit tt ca s hin din ca mt
thuc tnh nht nh v "khng" ngha l s vng mt ca mt thuc tnh nht
nh.
Mt vecto nh phn x vi kch thc d c nh ngha l (x
1
, x
2
,,
x
d
)(Zhang and Srihari 2003), ni { }( ) d i x
i
s s e 1 1 , 0 l gi tr thnh phn j ca x.
Vecto khi nh phn I ca kch thc d l mt vecto nh phn vi mi gi tr
nhp vo bng 1. Vic b xung mt vecto nh phn x c nh ngha l
x I x = , ni I l mt n v vecto nh phn c cng kch thc nh x.
Xt hai vecto nh phn x v y trong khng gian d, v cho ( ) y x S
ij
,
{ } ( ) 1 , 0 , e j i biu th s ln xut hin ca i trong x v j trong y tng ng, v d
( ) { i x k y x S
k ij
= = : , v } d k j y
k
, , 2 , 1 , = = . (2.6)

-14-
Sau , r rng chng ta c ng thc sau :
( )

=
= =
d
i
i i
y x y x y x S
1
11
, . ,
(2.7a)
( ) ( )( )

=
= =
d
i
i i
y x y x y x S
1
_ _
00
, 1 1 . ,
(2.7b)
( ) ( )

=
= =
d
i
i i
y x y x y x S
1
_
01
, 1 . ,
(2.7c)
( ) ( )

=
= =
d
i
i i
y x y x y x S
1
_
10
, 1 . ,
(2.7d)
Ta cng c :
( ) ( ) ( ) ( ). , , , ,
11 10 01 00
y x S y x S y x S y x S d + + + =
(2.8)
Bng 4: Bng tnh ton tn s t bng biu tng trong bng 2
|
|
|
.
|

\
|
1
1
1
1
1
1
0
0
3
0
0
3
0
3
0
3

Bng5: Bng tnh ton tn s t bng biu tng trong bng 3
|
|
|
.
|

\
|
1
1
1
1
1
1
0
0
3
3
0
0
3
0
0
3


2.3 D liu giao dch
Cho mt tp hp cc phn t I = (I
1
, I
2
,. . . , I
m
), mt giao dch l mt
tp hp con ca I (Yang et al, 2002b.; Wang et al, 1999a.; Xiao v Dunham,
2001). Mt tp d liu giao dch l mt tp hp cc giao dch, v d
{ }. , 2 , 1 , : n i I t t D
i i
= _ = . Giao dch c th c i din bi vector nh phn,
trong mi mc biu th cc c hay khng c mc tng ng. V d, chng
ta c th i din cho mt giao dch t
i
do vc t nh phn (b
i1
, b
i2
,.., b
im
.), ni
b
ij
= 1 nu I
J
t
i
v b
ij
= 0 nu I
j
e t
i
. T im ny, cc d liu giao dch l

-15-
mt trng hp c bit ca d liu nh phn. V d ph bin nht ca d liu
giao dch l th trng d liu trong gi hng. Trong mt th trng
thit lp d liu trong gi hng, giao dch c cha mt tp hp con ca tp
tng s mt hng m c th c mua. V d, sau y l hai giao dch: (to,
bnh), (to, mn n, trng, c,). Ni chung, nhiu giao dch c thc hin
cc mc tha tht phn phi. V d, mt khch hng ch c th mua mt s
mt hng t mt ca hng vi hng nghn mt hng. Nh ch ra bi Wang
et al. (1999a), cho cc giao dch c thc hin cc mc tha tht phn phi,
cp tng t l khng cn thit, cng khng nh gi xem mt cm
giao dch l tng t.
2.4 D liu Symbolic
D liu Categorical v d liu nh phn l loi d liu c in, v d
liu symbolic l mt phn m rng ca cc kiu d liu c in. Trong b d
liu thng thng, cc i tng ang c coi l c nhn (ln u cc i
tng t) (Malerba et al, 2001.), trong khi ti tp d liu symbolic , cc i
tng l nhiu hn "thng nht" do c ngha l cc mi quan h. Nh vy, cc
d liu symbolic c nhiu hn hoc t hn ng nht hoc cc nhm ca
cc c nhn (th hai i tng t)
(Malerba et al, 2001.). Malerba et al. (2001) c xc nh mt d liu
symbolic c thit lp mt lp hoc nhm ca cc c nhn m t bi mt
s thit lp gi tr hoc bin phng thc. Bin A c gi l gi tr thit lp
nu n ng vai tr gi tr ca n trong thit lp min ca n. Mt bin
phng thc l mt thit lp gi tr bin vi mt bin php hoc phn phi
mt (tn s, xc sut, hoc trng lng) kt hp vi mi i tng.
Gowda v Diday (1992) tm tt s khc bit gia d liu symbolic v
d liu thng thng nh sau:
Tt c cc i tng trong mt d liu symbolic c th khng c
nh ngha v cc bin tng t.
Mi bin c th mt nhiu hn mt gi tr hoc thm ch khong mt
gi tr.
Cc bin trong mt d liu symbolic phc tp c th mt gi tr bao
gm mt hoc nhiu i tng c bn.

-16-
Cc m t ca mt i tng tng trng c th ph thuc vo mi
quan h hin ti gia cc i tng khc.
Cc gi tr cc bin mt c th cho thy tn sut xut hin, kh nng
tng i, mc quan trng ca cc gi tr, vv.
D liu Symbolic c th c tng hp t cc d liu khc thng v
l do l ring t. Trong s liu iu tra dn s, v d, cc d liu c to
sn dng tng hp m bo rng cc nh phn tch d liu khng th xc
nh mt c nhn hay mt doanh nghip duy nht thnh lp.
2.5 Chui thi gian(Time Series)
Chui thi gian l nhng hnh thc n gin nht ca d liu tm thi.
Chnh xc, mt chui thi gian l mt chui ca s thc i din cho cc php
o ca mt bin thc t ti cc khong thi gian bng (Gunopulos v Das,
2000). V d, gi c phiu cc phong tro, nhit ti mt im no , v
khi lng bn hng theo thi gian tt c o l cc chui thi gian.
Mt chui thi gian l ri rc nu bin c xc nh trn mt tp hu
hn cc im thi gian. Nhiu nht ca chui thi gian gp phi trong phn
tch cm l thi gian ri rc. Khi mt bin c nh ngha tt c cc im
trong thi gian, sau l chui thi gian l lin tc.
Ni chung, mt chui thi gian c th c coi l mt hn hp ca bn
thnh phn sau (Kendall v Ord, 1990):
1. Mt xu hng, v d., cc phong tro lu di;
2. Bin ng v xu hng u n hn hoc t hn;
3. Mt thnh phn theo ma;
4. Mt hiu ng d hoc ngu nhin.
3. Php bin i v chun ha d liu
Trong nhiu ng dng ca phn cm d liu, d liu th, hoc o c
thc t, khng c s dng trc tip, tr khi mt m hnh xc sut cho cc
th h khun mu c sn (Jain v Dubes, 1988). Vic chun b cho vic phn
cm d liu yu cu mt s loi chuyn i, chng hn nh bin i v
chun ha d liu. Mt s phng php bin i d liu thng c s dng
phn cm d liu s c tho lun trong phn. Mt s phng php
chun ho d liu c trnh by trong Phn 4.1.

-17-
thun tin hy cho { }
* *
2
*
1
*
, , ,
n
x x x D = biu th tp d liu th d-chiu.
T ma trn d liu l mt ma trn n x d c cho bi
( )
|
|
|
|
|
.
|

\
|
=
* *
2
*
1
*
2
*
22
*
21
*
1
*
12
*
11
* *
2
*
1
, , ,
nd n n
d
d
T
n
x x x
x x x
x x x
x x x

(4.1)
3.1 Php chun ha d liu
Chun ho lm cho d liu gim kch thc i. N c ch xc nh
tiu chun ho ch s. Sau chun ha, tt c cc kin thc v v tr v quy m
ca cc d liu gc c th b mt. N l cn thit chun ha cc bin trong
trng hp cc bin php khng ging nhau, chng hn nh khong cch
Euclide, l nhy cm vi nhng khc bit trong ln hoc quy m ca cc
bin u vo (Milligan v Cooper, 1988). Cc phng php tip cn cc
chun ho ca cc bin bn cht ca hai loi: Chun ha ton cc v chun
ho trong cm.
Chun ha ha ton cc lm chun cc bin trn tt c cc yu t trong
cc tp d liu. Trong vng-cm tiu chun ho dng ch tiu chun ha
xy ra trong cc cm bin mi ngy. Mt s hnh thc tiu chun ho c th
c s dng trong cc chun ha ton cc v chun ha trong phm vi rt
tt, nhng mt s hnh thc chun ho ch c th c s dng trong chun
ho ton cc.
Khng th trc tip chun ha cc bin trong cc cm trong phn cm,
bi v cc cm khng c bit trc khi chun ha. khc phc kh khn
ny, khc phng php phi c thc hin. Tng th v Klett (1972) xut
mt cch tip cn lp rng cc cm thu c u tin da trn s c lng
tng th v sau s dng cc cm gip xc nh cc bin bn trong nhm
chnh lch i vi chun ho trong mt phn cm th hai.
chun ha d liu th c a ra trong phng trnh (4,1), ta c
th tr mt thc o v tr v phn chia mt bin php quy m cho mi bin.
l,
j
j ij
ij
M
L x
x

=
*
(4.2)
Edited by Foxit Reader
Copyright(C) by Foxit Corporation,2005-2009
For Evaluation Only.

-18-
ni
ij
x biu th gi tr c chun ha,
j
L l v tr o, v
j
M l quy m o.
Chng ti c th c c phng php tiu chun ho khc nhau bng
cch chn khc nhau L
J
v M
J
trong phng trnh (4,2). Mt s phng php
chun ho ni ting trung bnh, tiu chun lch, phm vi, Huber ca d
ton, d ton biweight Tukey's, v Andrew c tnh ca sng.
Bng 4,1 cho mt s hnh thc tiu chun ho, ni
*
j
x ,
*
j
R v
*
j
o , c
ngha l, phm vi, v lch chun ca bin th j, tng ng, ngha l

=
=
n
i
ij j
x
n
x
1
* *
1
(4.3a)
, min max
*
1
*
1
*
ij
n i
ij
n i
j
x x R
s s s s
=
(4.3b)
2
1
2
1
* * *
) (
1
1
(

=

=
n
i
j ij j
x x
n
o
(4.3c)
By gi chng ta tho lun v mt s chi tit cc hnh thc chung ca
tiu chun ho v thuc tnh .z-score l mt hnh thc ca tiu chun ho
c s dng chuyn bin th bnh thng to im chun. Cho mt tp
hp cc d liu th D
*
, cc Z-score cng thc chun c nh ngha l
( )
*
* *
*
1
j
j ij
ij ij
x x
x Z x
o

= =
(4.4)
Ni
*
j
x ,
*
j
o c ngha l cc mu v lch chun ca cc thuc tnh th
j, tng ng.
Bin i s c mt ngha ca 0 v phng sai mt trong s 1. V tr
quy m v thng tin ca bin gc b mt. Chuyn i ny cng l trnh by
trong (Jain v Dubes, 1988, trang 24). Mt iu quan trng hn ch ca chun
ha Z
1
l n phi c p dng trong tiu chun ton cu v khng trong
phm vi-cm tiu chun ho (Milligan v Cooper, 1988). Trong thc t, hy
xem xt trng hp hai cm tch ra cng tn ti trong cc d liu. Nu mt
mu c v tr mi hai cm trung tm, sau trong vng-cm chun s chun
ha cc mu nm ti cm trung tm v khng vect. Bt k thut ton
clustering s nhm hai s khng vect vi nhau, c ngha l hai nguyn mu
Edited by Foxit Reader
Copyright(C) by Foxit Corporation,2005-2009
For Evaluation Only.

-19-
s c c nhm cho mt cluster. iu ny to ra mt kt qu phn nhm
rt gy hiu nhm.
Bng 4.1 Mt vi php chun ha d liu, ni
*
j
x ,
*
j
R v
*
j
o c nh ngha
trong biu thc 4.3
Tn L
j
L
j

z-score
*
j
x
*
j
o
USTD 0
*
j
o
Maxium 0
*
1
max
ij
n i
x
s s

Mean
*
j
x 1
Median
*
2
1
j
n
x
+
nu n l l
|
|
.
|

\
|
+
+
*
2
2
*
2
2
1
j
n
j
n
x x nu n l chn
1
Sum 0

=
n
i
ij
x
1
*

Range
*
1
min
ij
n i
x
s s

*
j
R

Chun ha USTD ( lch chun cc trng khng chnh xc) cng
tng t nh chun ho im z-score v c nh ngha l
( )
*
*
*
2
j
ij
ij ij
x
x Z x
o
= = (4.5)
Ni
*
j
o c nh ngha trong biu thc (4.3c)
Bin i bi Z
2
s c mt phng sai ca 1. K t khi c im s
khng c trung tm bng cch tr i c ngha l, cc thng tin v tr gia
cc im vn cn. Nh vy, chun ha Z
2
s khng phi chu nhng vn
ca s mt thng tin v cc Cm centroids.
Phng php chun ho th ba trnh by trong Milligan v Cooper
(1988) l s dng im ti a v bin:
( )
*
1
*
*
3
max
ij
n i
ij
ij ij
x
x
x Z x
s s
= = (4.6)
Edited by Foxit Reader
Copyright(C) by Foxit Corporation,2005-2009
For Evaluation Only.

-20-
Mt X bin i bi Z
3
s c mt ngha
) max( X
X
v lch chun
,
) max( X
X
o
ni X v
X
o l trung bnh v lch chun ca bin gc. Z
3
l nhy
cm vi s hin din ca Outliers (Milligan v Cooper, 1988). Nu mt n
ln quan st trn mt bin c trnh by, Z
3
s chun ha cc gi tr cn li
gn 0. Z
3
c v l c ngha ch khi bin ny l mt bin php trong mt
phm vi t l (Milligan v Cooper, 1988).
Hai quy chun c lin quan n vic s dng phm vi ca bin c
trnh by trong (Milligan v Cooper, 1988):

( )
*
*
*
4
j
ij
ij ij
R
x
x Z x = = (4.7a)
( ) ,
min
*
*
1
*
*
5
j
ij
n i
ij
ij ij
R
x x
x Z x
s s

= = (4.7b)
Ni
*
j
R l phm vi thuc tnh th j c nh ngha trong biu thc
(4.3b)
Mt bin X bin i bi Z
4
v Z
5
s c ngha l
) min( ) max( X X
X

v
) min( ) max(
) min(
X X
X X

, tng ng, v c cng lch chun


) min( ) max( X X
X

o
. C
hai Z
4
v Z
5
d phi s hin din ca Outliers.

Mt tiu chun ho trn c s bnh thng ha vi tng ca cc quan
st trnh by trong (Milligan v Cooper, 1988) c nh ngha l
( ) ,
1
*
*
*
6

=
= =
n
i
ij
ij
ij ij
x
x
x Z x (4.8)
Cc Z
6
chuyn i s bnh thng ha tng gi tr chuyn thnh s
thng nht v cc chuyn c ngha l s c
n
1
. Nh vy, c ngha l s c
lin tc trn tt c cc bin.
Edited by Foxit Reader
Copyright(C) by Foxit Corporation,2005-2009
For Evaluation Only.

-21-
Mt cch tip cn rt khc nhau ca chun ho m bao gm vic
chuyn i cc im n nh gi cao c trnh by trong (Milligan v
Cooper, 1988) v c nh ngha l

( ) ( ),
* *
7 ij ij ij
x Rank x Z x = = (4.9)
Ni Rank(X) l cp ch nh cho X
Mt bin chuyn bi Z
7
s c mt ngha ca
2
1 + n
v mt phng sai
ca |
.
|

\
| +

+
+
4
1
6
1 2
1
n n
n . Vic chuyn i cp bc lm gim tc ng ca
Outliers trong d liu.
Conover v Iman (1981) xut bn loi chuyn i cp bc.
Hng nht chuyn i trnh by c xp hng t nh n ln nht, vi im
s nh nht c hng nht, im th hai nh nht c th hng hai, vv. Cp bc
trung bnh c ch nh trong trng hp quan h.
3.2 Bin i d liu
Bin i D liu c g lm g vi d liu chun ho, nhng n l
phc tp hn hn so vi chun ho d liu. Chun ho d liu tp trung vo
cc bin, nhng Bin i d liu tp trung vo cc d liu ton b thit lp.
Theo Chun ho d liu nh vy, c th c c xem nh l mt trng
hp c bit ca Bin i d liu i. Trong phn ny, trnh by mt s d liu
k thut Bin i c th c s dng trong phn cm d liu.
3.2.1 Phn tch thnh phn chnh
Mc ch chnh ca phn tch thnh phn chnh (PCA) (Ding v He,
2004; Jolliffe, 2002) l gim chiu cao ca mt chiu t d liu bao gm mt
lng ln s bin tng quan v ng thi gi li cng nhiu cng tt ca
bin i hin din trong tp d liu. Cc thnh phn chnh (PC) l cc bin
mi c khng tng quan v ra lnh nh vy l ngi u tin gi li vi
phn ln cc bin th hin din trong tt c cc bn gc bin.
Cc PC c nh ngha nh sau. Cho ( )
'
=
d
v v v v , , ,
2 1
l mt vect ca
d ngu nhin bin, ni l hot ng transpose. Bc u tin l tm mt
hm tuyn tnh mt v a
1
' ca cc yu t ca v c ti a cc phng sai, m a
1
l
mt vect d-chiu ( )
'
d
a a a
1 12 11
, , , do ,

-22-

=
= '
d
i
i i
v a v a
1
1
'
1

Sau khi tm v a v a v a
j 1 2 1
, , ,

' ' ' , chng ti tm mt hm tuyn tnh v a
j
' khng
tng quan vi v a v a v a
j 1 2 1
, , ,

' ' ' v c phng sai ti a. Sau chng ta s tm
thy d chc nng nh vy tuyn tnh sau khi bc d. Bin bt ngun th j
PC . Nhn chung, hu ht cc bin th trong v s c chim bi cc PC vi
ln u tin.
tm mu ca PC, chng ta cn phi bit ma trn hip phng sai

ca v Trong hu ht cc trng hp thc t, ma trn hip phng sai

cha c bit, v n s c thay th bng mt mu


ma trn hip phng sai . i vi j = 1, 2,. . . , d, n c th c cho thy th
j PC c cho bi z
j
= v a
j
' , ni a
j
l mt eigenvector ca

tng ng vi
cc th gi tr j ln nht
j
.
Trong thc t, bc u tin, z
1
= v a
j
' c th tm thy bng cch gii
quyt ti u ho vn sau y:
Maximize ( ) v a
1
var '
1
1
= 'a a ,
Ni ( ) v a
1
var '
c tnh nh sau
( )

=
1
'
1
'
1
var a a v a
gii quyt vn ti u ha trn, cc k thut ca nhn u
Lagrange c th c s dng. Cho l mt s nhn Lagrange. Ta mun ti
a ha
( ). 1
'
1 1
'
1

a a a a (4.10)
Phng trnh khc(4.10) vi a
1
, chng ta c
0
1 1
=

a a
( ) 0
1
=

a I
d

Ni I
d
l ma trn nhn dng d x d
V

l gi tr ring ca

v a
1
l vecto c trng ng v.
,
1
'
1 1
'
1
=

a a a a

-23-
a
1
l vecto c trng ng v vi gi tr ring ln nht ca

. Trong
thc t n c th c biu din l mt PC th j l v a
j
' , ni a
j
l mt vecto
c trng ca tng ng vi th j ln nht gi tr ring
j
(Jolliffe, 2002).
Trong (Dinh v He, 2004), PCA l lm vic gim chiu ca d liu
thit lp v sau thut ton K-means c p dng trong khng gian con
PCA.
Cc v d khc ca PCA p dng trong phn tch cm d liu c th
c tm thy trong (Yeung v Ruzzo, 2001). Trnh din PCA l tng ng
vi gi tr thc hin phn hy t (SVD) trn cc hip phng sai ma trn ca
d liu. ORCLUS s dng SVD (Kanth et al, 1998) k thut. tm hiu ty
tin theo nh hng khng gian con vi phn cm d liu tt.
3.2.2 SVD
SVD l mt k thut mnh m trong tnh ton ma trn v phn tch,
chng hn nh vic gii quyt cc h thng phng trnh tuyn tnh v xp x
ma trn. SVD cng l mt k thut ni ting chiu tuyn tnh v c s
dng rng ri trong nn d liu v o (Andrews v Patterson, 1976a, b). trong
mc ny, phng php SVD l phng php tm tt.
Cho
{ }
n
x x x D , , ,
2 1
=
l mt s d liu c t trong mt khng
gian d-chiu. Sau , D c th c i din bi mt n x n ma trn X l
( ) ,
d n
ij
x X

=

Ni
ij
x gi tr thnh phn ca x
i

Cho ( )
d
, , ,
2 1
= l ct ca X,

=
=

=
n
i
ij j
d j x
n
1
, , , 2 , 1 ,
1
1

v cho e
n
l mt vect ct ca n chiu di vi tt c cc yu t tng
ng vi n. Sau , SVD th hin
n
e X l,
T
n
USV e X = (4.11)
trong U l mt ma trn n n trc giao, v d, ngha l, U
T
U = I l
ma trn n v. S l mt ma trn cho cha cc gi tr s t, v V l mt ma
trn unita d d , v d, V
H
V = I, ni V
H
l ma trn chuyn v lin hp ca V.

-24-
Cc ct ca ma trn V l vecto c trng ca ma trn hip phng sai
C ca X; chnh xc
T T T
V V X X
n
C . = =
1
(4.12)
K t khi C l ma trn cho i din d d, n c d l s t nhin vecto
c trng trc giao. M khng mt tng qut, cho cc gi tr ring ca C
gim :
1

2

d
. Hy j (j = 1,2 ,..., d) l lch chun ca ct th j
ca X, ngha l,
( ) .
1
2
1
1
2
|
.
|

\
|
=

=
n
i
j ij j
x
n
o

ca C l bt bin theo lun phin, ngha l,



= =
= =
d
j
j
d
j
j
1 1
2
o

Ch rng
n X e
T
n
=
v
n e e
n
T
n
=
t phng trnh (4.11) v (4.12),
chng ta c

T T T T T
USV U VS SV VS =
( ) ( )
n
T
n
e X e X =

n
T
n
T
n
T T
n
T T
e e e X X e X X + =

T T
n X X =
T
V nVA . (4.13)
K t khi V l mt ma trn trc giao, t phng trnh (4,13), cc gi tr
t c lin quan n cc gi tr ring bi
. , 2 , 1
,
2
d j n s
j j
= =
Cc vecto c trng chim cc my tnh ca X, v cc tnh nng khng
tng quan s c thu c do chuyn i ( )V e X Y
n
= . PCA chn cc
tnh nng vi gi tr ring cao nht.
3.2.3 Php bin i Karhunen-Love
Cc php bin i Karhunen-Love (KL) c lin quan vi cc gii thch
cu trc d liu thng qua mt s tuyn tnh kt hp ca cc bin. Ging nh
PCA, php bin i KL cng l cch ti u cho d n d- chiu im gim
im chiu sao cho sai s ca d n (tc l tng ca khong cch bnh
phng (SSD)) l ti thiu (Fukunaga, 1990).

-25-
Cho
{ }
n
x x x D , , ,
2 1
=
l mt tp d liu khng gian d chiu, v X l
ng v ma trn d x d. ngha l ( )
d n
ij
x X

= vi
ij
x l gi tr j thnh phn ca x
i.
( ) n i x
i
, , 2 , 1 =
l vecto d chiu. Chng c th hin th khng li bng
php tnh tng vecto tuyn tnh c lp nh

=
= =
d
j
T
i
T
j ij i
y y x
1
| |
hoc ,
T
Y X | = (4.14)
ni ( ), ,
, , 2 1 id i i i
y y y y

= v
( )
d
d
y
y
y
Y | | | | , , , ,
2 1
2
1

=
|
|
|
|
|
.
|

\
|
=
Cc ma trn d d c s | v chng ta bit thm c th cho rng nhng
hng | hnh thc mt b trc giao, ngha l,

=
=
=
, 0
, 1
j i for
j i for
T
j i
| |
hay ,
d d
T
I

= ||
Ni
d d
I

l ma trn n v
d d
I


Sau , t phng trnh (4.14), b phn ca y
j
c th c tnh ton
bng
, , , 2 , 1 , n i x y
i i
= = |
hoc
| X Y =
V vy, Y ch n gin l mt bin i trc giao ca X.
j
| c gi l
vc t th j tnh nng v y
ij
l thnh phn th j ca mu x
i
trong khng gian
tnh nng ny. gim bt chiu, chng ta ch chn m(m<d) tnh nng vect
c th gn ng X tt. Xp x c th c thu c bng cch thay th cc
thnh phn ca y
j
vi hng chn trc (Fukunaga, 1990, trang 402):
, ) (
1 1

= + =
+ =
d
j
d
m j
T
j ij
T
j ij i
b y m x | |

-26-
hoc , ) , 1 ( ) (
1 1
|
|
|
.
|

\
|
+
|
|
|
.
|

\
|
=
+
T
d
T
m
T
m
T
i
m Y m x
|
|
|
|


ni ) , 1 ( m Y l ma trn n x m c c bng ct m u tin ca Y, c nghia
l ( )
m n
ij
y

= m) Y(1, , v mt ma trn ) ( d m n vi (i, j) nhp t b


i,m+j
.
M khng mt tng qut, chng ta gi nh rng ch c cc thnh phn
m u tin ca mi y
j
c tnh ton. Sau , cc li ca cc kt qu l xp x
, ) ( ) ( ) (
1

+ =
= = A
d
m j
T
j ij ij i i i
b y m x x m x |
shoc , ) ) , 1 ( ( ) (
1
|
|
|
.
|

\
|
+ = A
+
T
d
T
m
B d m Y m X
|
|

Ni ) ), 1 ( d m Y + l ma trn ) ( d m n c hnh thnh bi ct cui m-d
ct ca Y
Ch rng X

v X A l cc ma trn ngu nhin, bi vy li vung xp


xr l
{ }
2 2
) ( ) ( m X E m A = c
{ } )) ( ) ( ( m X m X Tr E
T
A A =
{ } ) ) ) , 1 ( )( ) , 1 ( (
T
B d m Y B d m Y Tr E + + =
( ) { }.
1 1
2

= + =
=
n
i
d
m j
ij ij
b y E (4.15)
Vi mi la chn iu kin hng s b
ij
, Ta nhn c gi tr ) (
2
m c . La
chn ti u cho b
ij
l mt trong ) (
2
m c nh nht. T phng trnh (4.15) La
chn ti u cho b
ij
, l
{ } | | , 0 2 ) (
2
= =
c
c
ij ij
ij
b y E m
b
c
M cho
{ } { } ,
j i ij ij
x E y E b | = =

hoc
{ } ). , , )( ( , ) 1 (
1 d m
X E d m m Y E B | |
+
= + =

-27-
Cho

x
l ma trn hip phng sai ca X, do chng ta c
{ } { }) ( ) ( X E X X E X
T
x
=


{ } { } | |
{ }
{ }
(
(
(

|
|
|
.
|

\
|

|
|
|
.
|

\
|
=
n n
T
n
T T
n
T
x E
x E
x
x
X E X E x x
1 1
1
) , ) , , (
{ } { }). ( ) (
1
i i
n
i
T
i i
x E x X E x

=

Do ) (
2
m c c th c vit li nh sau :
( ) { }

= + =
=
n
i
d
m j
ij ij
b y E m
1 1
2 2
) ( c
{ } ( ) { } ( )

= + =
=
n
i
d
m j
j i i
T
i i
T
j
x E x x E x
1 1
| |
{ } ( ) { } ( )
j
n
m j
n
i
i i
T
i i
T
j
x E x x E x | |

+ = =
|
.
|

\
|
=
1 1


+ =
=
d
m j
X
j
T
j
1
. | | (4.16)
N c th ch ra la chn ti u tha mn
j
| (Fukunaga,1990)
.
j j
X
j
| | =


Do
j
| l vecto c trng ca

x
.Do phng trnh (4.16) tr
thnh

+ =
=
d
M j
j
m
1
2
) ( c
T ma trn hip phng sai ca X,

x
l semidefinite i din, n c
gi tr ring d khng m. Nu chng ti chn vecto c trng m tng ng
vi cc gi tr ring m ln nht, sau l li vung s c gim thiu.

-28-
CHNG II
CC THUT TON PHN CM D LIU
1. Thut ton phn cum d liu da vo phn cm phn cp
1.1 Thut ton BIRCH
Thut ton phn cm khc cho tp d liu ln, c gi l BIRCH.
tng ca thut ton l khng cn lu ton b cc i tng d liu ca cc
cm trong b nh m ch lu cc i lng thng k. Thut ton a ra hai
khi nim mi theo di cc cm hnh thnh , phn cm c trng l tm tt
thng tin v mt cm v cy phn cm c trng(cy CF) l cy cn bng
c s dng lu tr cm c trng( c s dng m t cm tm tt).
Trc tin c gi l cm c trng, l mt b ba(n, LS, SS), trong n l
s cc im trong phn hoch cm con, LS l tng s cc gi tr thuc tch v
SS l tng bnh phng ca cc im . c trng tip theo l cy CF, m
n gin l cy cn bng m lu b ba ny. C th chng mnh rng, cc i
lng thng k chun, nh l o khong cch, c th xc nh t cy CF.
Hnh 4.10 di y biu th mt v d v cy CF. C th thy rng, tt c cc
nt trong cy lu tng cc c trng cm CF, cc nt con, trong khi cc
nt l lu tr cc c trng ca cc cm d liu.
Cy CF cha cc nt trong v nt l, nt trong l nt cha cc nt con
v nt l th khng c con. Nt trong lu tr cc tng c trng cm(CF) ca
cc nt con ca n. Mt cy (CF) c c trng bi hai tham s :
- Yu t nhnh (Braching Factor B) : Nhm xc nh ti a cc nt
con ca mt nt l trong ca cy
- Ngng(Threshold T) : khong cch ti a gia bt k mt cp i
tng trong nt l ca cy, khong cch ny cn gi l ng knh ca cc
cm con c lu ti cc nt l.
Hai tham s ny c nh hng n kch thc ca cy CF. thut ton BIRCH
thc hin gm hai giai on:
Giai on 1 : BIRCH qut tt c cc i tng trong CSDL xy
dng cy CF khi ta, m c lu tr trong b nh. Trong giai on ny ,
cc i tng ln lt c chn vo nt l gn nht ca cy CF(nt l ca
cy ng vai tr l cm con), sau khi chn xong th tt c cc nt trong cy
CF c cp nht thng tin. Nu ng knh ca cm con sau khi chn l ln

-29-
hn ngng T, th nt l c tch. Qu trnh lp li cho n khi tt c cc i
tng trong cy ch c c mt ln, lu ton b cy CF trong b nh th
cn phi iu chnh kch thc ca cy CF thng qua iu chnh ngng T.
Giai on 2 : BIRCH la chn mt thut ton phn cm(nh thut ton
phn cm phn hoch) thc hin phn cm cho cc nt l ca cy CF

Hnh 4.10 : Cy CF s dng trong BIRCH

Thut ton BIRCH thc hin qua cc bc c bn nh sau :
1. Cc i tng d liu ln lt c chn vo cy C, sau khi chn ht cc
i tng th thu c cy CF khi to. Mt i tng c chn vo nt
l gn nht to thnh cm con. Nu ng knh ca cm con ny ln hn T
th nt l c tch ra. Khi mt i tng thch hp c chn vo nt l,
tt c cc nt tr ti gc ca cy c cp nht vi thng tin cn thit
2. Nu cy CF hin thi khng c b nh trong khi tin hnh xy dng
mt cy CF nh hn: Kch thc ca cy CF c iu khin bi tham s
F v v vy vic chn mt gi tr ln hn cho n s ha nhp mt s cm
con thnh mt cm, iu ny lm cho cy CF nh hn. Bc ny khng
cn yu cu c d liu li t u nhng vn m bo hiu chnh cy d
liu nh hn.
3. Thc hin phn cm: Cc nt l cy CF lu tr cc i lng thng k

-30-
ca cc cm con. Trong bc ny, BIRCH s dng cc i lng thng k
ny p dng mt s k thut phn cm, v d K-means v to ra mt
khi to cho phn cm.
4. Phn phi li cc i tng d liu bng cch dng cc i tng trng
tm cho cc cm c khm ph t bc 3: y l mt bc ty chn
duyt li tp d liu v gn li nhn cho cc i tng d liu ti cc trng
tm gn nht. Bc ny nhm gn nhn cho cc d liu khi to v loi
b cc i tng ngoi lai

Vi cu trc cy CF c s dng, BIRCH c tc thc hin PCDL
nhanh v c th p dng i vi tp CDSL ln, BIRCH cng c hiu qu khi
p dng vi tp d liu tng trng theo thi gian. BIRCH thc hin tnh ton
kh tt, phc tp tnh ton ca BIRCH l tuyn tnh t l vi s cc i
tng, do BIRCH ch duyt ton b d liu mt ln vi mt ln qut thm ty
chn( thc hin phn cm li cc nt l cy ca CF), c th c o trong
thi gian O(n) vi n l s i tng d liu. thut ton ny kt hp cc cm
gn nhau v xy dng li cy CF, tuy nhin mi nt trong cy CF c th ch
lu tr mt s hu hn bi kch thc ca n. BIRCH vn c mt hn ch :
thut ton ny c th khng x l tt nu cc cm khng c hnh dng cu,
bi v n s dng khi nim bn knh hoc ng knh kim sot ranh gii
cc cm v cht lng ca cc cm c khm ph khng c tt. Nu
BIRCH s dng khong cch Eucle, n thc hin tt ch vi cc d liu s,
mt khc tham s vo T c nh hng rt ln ti kch thc t nhin ca
cm. Vic p cc i tng d lieeujlamf cho cc i tng ca cm c th l
i tng kt thc ca cm khc, trong khi cc i tng gn nhau c th b
ht bi cc cm khc nu chng c biu din cho thut ton theo mt th
t khc. BIRCH khng thch hp vi d liu a chiu.
1.2 Thut ton CURE
Trong khi hu ht cc thut ton thc hin phn cm vi cc cm hnh
cu v kch thc tng t, nh vy l khng hiu qu khi xut hin cc phn
t ngoi lai. Thut ton ny nh ngha mt s c nh cc im i din nm
ri rc trong ton b khng gian d liu v c chn m t cc cm c
hnh thnh. Cc im ny c to ra bi trc ht la chn cc i tng

-31-
nm ri rc trong cm v sau co li hoc di chuyn chng v trung tm
cm bng nhn t co cm. Qu trnh ny c lp li v nh vy trong qu
trnh ny, c th o t l gia tng ca cm. Ti mi bc ca thut ton, hai
cm c cp cc im i din gn nhau(mi im trong cp thuc v mi cm
khc nhau) c ha nhp
Nh vy, c nhiu hn mt im i din mi cm cho php CURE
khm ph c cc cm c hnh dng khng phi l hnh cu. Vic co li cc
cm c tc dng lm gim tc ng ca cc phn t ngoi lai. Nh vy, thut
ton ny c kh nng x l tt trong trng hp c cc phn t ngoi li v
lm cho hiu qu vi nhng hnh dng khng phi l hnh cu v kch thc
rng bin i. Hn na, n t l tt vi CSDL ln m khng lm gim
cht lng phn cm. Hnh 3.14 di y l v d v qu trnh x l ca
CURE


Hnh 3.14 : Cm d liu khai ph bi thut ton CURE
x l c cc CSDL ln, CURE s dng ngu nhin v phn
hoch, mt mu l c xc nh ngu nhin trc khi c phn hoch, v
sau c tin hnh phn cm trn mi phn hoch, nh vy mi phn
hoch l tng phn c phn cm, cc cm thu hoch, nh vy mi phn
hoach l tng phn c phn cm, cc cm thu c li c phn cm
ln th hai thu c cc cm con mong mun, nhng mu ngu nhin
khng nht thit a ra mt m t tt cho ton b tp d liu.


-32-
Thut ton CURE c thc hin qua cc bc c bn sau :
1. Chn mt mu ngu nhin t tp d liu ban u.
2. Phn hoch mu ny thnh nhiu nhm d liu c kch thc
bng nhau : tng y l phn hoch mu thnh p nhm d
liu bng nhau, kch thc ca mi phn hoch l n/p(n l kch
thc mu).
3. Phn cm cc im ca mi nhm : Thc hin PCDL cho cc
nhm cho n khi mi nhm c phn thnh n/pq(vi q>1).
4. Loi b cc phn t ngoi lai : Trc ht, khi cc cm c hnh
thnh cho n khi s cc cm gim xung mt phn so vi s cc
cm ban u. Sau , trong trng hp cc phn t ngoi lai c
ly mu cng vi qu trnh pha khi to mu d liu, thut ton
s t ng loi b cc nhm nh
5. Phn cm cc cm khng gian : cc i tng i din cho cc
cm di chuyn v hng trung tm cm, ngha l chng c
thay th bi cc i tng gn trung tm hn.
6. nh du d liu vi cc nhn tng ng.

phc tp tnh ton ca thut ton CURE l O(n
2
log(n)). CURE l
thut ton tin cy trong vic khm ph ra cc cm vi hnh th bt k v c
th p dng tt i vi d liu c phn t ngoi lai v trn cc tp d liu hai
chiu. Tuy nhin, n li rt nhy cm vi cc tham s nh s cc i tng
i din, t l co ca cc phn t i din.
1.3 Thut ton ANGNES
Phng php phn hoch ANGNES l k thut kiu tch t. ANGNES
bt u ngoi vi mi i tng d liu trong cc cm ring l. Cc cm
c ha nhp theo mt s loi ca c s lut, cho n khi ch c mt cm
nh ca phn cp, hoc gp iu kin dng. Hnh dng ny ca phn cm
phn cp cng lin quan n tip cn bottom-up bt u di vi cc nt l
trong mi cm ring l v duyt ln trn phn cp ti nt gc, ni tm thy
cm n cui cng vi tt c cc i tng d liu c cha trong cm .

-33-
1.4 Thut ton DIANA
DIANA thc hin i lp vi AGNES. DIANA bt u vi tt c cc
i tng d liu c cha trong mt cm ln v chia tch lp li, theo phn
loi ging nhau da trn lut, cho n khi mi i tng d liu ca cm ln
c chia tch ht. Hnh dang ca cm phn cp cng lin quan tip cn
top-down bt u ti mc nh nt gc, vi tt c cc i tng d liu, trong
mt cm, v duyt xung cc nt l di cng ni tt c cc i tng d
liu tng ci c cha trong cm ca chnh mnh.
Trong mi phng php ca hai phng php, c th s cc cm dn
ti cc mc khc nhau trong phn cp bng cch duyt ln hoc xung cy.
Mi mc c th khc nhau s cc cm v tt nhin kt qu cng khc nhau.
Mt hn ch ln ca cch tip cn ny l cc cm c ha nhp hoc phn
chia mt ln, khng th quay li quyt nh , cho d ha nhp hoc phn
chia khng phi l thch hp mc
1.5 Thut ton ROCK
--------Main module---------
Procedure cluster(S,k)
Begin
1. link:=compute_links(S)
2. for each s e S do
3. q[s]:= build_local heap(link, s)
4. Q:=build_global heap(S, q)
5. while size(Q)> k do{
6. w:= extract_max(Q)
7. v:= max(q[u])
8. delete(Q, v)
9. w:= merge(u,v)
10. for each x e q[u] q[v] do {
11. link[x, w]:=link[x, u]+ link[x, v]
12. delete(q[x], u); delete(q[x], v)
13. insert(q[x], w, g(x, w); insert(q[x], w, g(x, w)
14. update(Q, x, q[x])
15. }

-34-
16. insert(W, w, q[w]
17. deallocate(q[u]); deallocate(q[v])
18. }
end

---------------------Compute_links Procedure-------------
Procedure compute_links(S)
Begin
1. Compute nbrlist[i] for every point i in S
2. Set link[i,j] to be zero all i,j
3. for i:=1 to n do {
4. N:= nbrlist[i]
5. for j:=1 to [N]-1 do
6. for 1:= j+1 to [N]-1 do
7. link[N[j], N[l]:=link[N[j], N[l]+1
8. }
End

1.6 Thut ton Chameleon
Phng php Chameleon mt cch tip cn khc trong vic s dng m
hnh ng xc nh cc cm no c hnh thnh. Bc u tin ca
Chameleon l xy dng mt th mt tha v sau ng dng mt thut
ton phn hoch th PCDL vi s ln ca cc cm con. Tip theo,
Chameleon thc hin tch t phn cm phn cp, nh AGNES, bng ha
nhp cc cm con nh theo hai php o, mi quan h lin thng v mi quan
h gn nhau ca cc nhm con. Do , thut ton khng ph thuc vo ngi
s dng cc tham s nh K-means v c th thch nghi.
Thut ton ny kho st m hnh ng trong phn cm phn cp. Trong
, hai cm c ha nhp nu gia hai cm c lin quan mt thit ti quan
h kt v gn nhau ca cc i tng trong cc cm. Qu trnh ha nhp d
dng khm ph cc cm t nhin v ng nht, ng dng cho tt c cc kiu
d liu min l hm tng t c xc nh.

-35-
N khc phc c nhc im cc phng php CURE v ROCK. L
do l CURE v lc lin quan l i thng tin v lin kt ca cc i tng
trong hai cm khc nhau, trong khi ROCK lc lin quan l i thng tin
v gn nhau ca hai cm m li ch trng qu v lin kt.
CURE s dng thut ton phn hoch th phn cm cc i tng
d liu vo trong mt s ln mt cch tng i nh ca cc cm con.
Chameleon s dng thut ton phn cm phn cp tm cc cm xc thc
bng cch lp nhiu ln kt hp hoc ha nhp cc cm con. xc nh cc
cp ca nhiu cm con tng t, phi tnh ton c hai lin kt v gn nhau ca
cc cm, c bit cc c trng bn trong ca cc cm ang c ha nhp.
Nh vy, n khng ph thuc vo m hnh tnh v c th t ng thch
nghi vi c trng bn trong ca cc cm ang c ha nhp. N c kh
nng hn khm ph cc cm c hnh th bt k c cht lng cao hn
CURE v DBSCAN nhng chi ph x l d liu a chiu ph thuc vo O(n
2
)
thi gian cho n cc i tng trong trng hp xu nht.
2. Thut ton phn cm d liu m
Phn cm d liu m (FCM) l phng php phn cm d liu cho
php mi im d liu thuc v hai hoc nhiu cm thng qua bc thnh vin.
Ruspini(1969) gii thiu khi qut khi nim phn hoch m m t cu
trc cm ca tp d liu v xut mt thut ton tnh ton ti u phn
hoch m. Dunn(19730 m rng phng php phn cm v pht trin
thut ton phn cm m. tng ca thut ton l xy dng mt phng
php phn cm m da trn ti thiu ha hm mc tiu. Bezdek(1981) ci
tin v tng qut ha hm mc tiu m bng cch a ra trng s m m
xy dng thut ton phn cm m Fuzzy C-means(FCM), v c chng
minh hi t ca cc thut ton l cc tiu cc b.
Thut ton FCM c p dng thnh cng trong gii quyt mt s
ln cc bi ton PCDL nh trong nhn dng mu, x l nh, y hc, Tuy
nhin, nhc im ln nht ca FCM l nhy cm vi nhiu v phn t ngoi
lai trong d liu, ngha l cc trung tm cm c th nm xa so vi trung tm
thc ca cm. c nhiu phng php xut ci tin cho nhc im trn
ca thut ton FCM bao gm : phn cm da trn xc sut (Kellet, 1993),

-36-
phn cm nhiu m ( Dave, 1991), phn cm da trn ton t Lp,
Norm(Kerten, 1999) v thut ton Insensitive Fuzzy C-means( PCM c ).
2.1 Thut ton FCM
Thut ton FCM gm mt chui cc php lp qua li gia phng trnh
(5) v (6). Nh vy FCM s dng php lp ti u hm mc tiu, da trn
o c tng t c trng s gia x
k
v cm trung tm Vi, sau mi vng
lp, thut ton tnh ton v cp nht phn t u
jk
trong ma trn phn hoch U.
Php lp s dng khi { }, max
1
s
+ k
ij
k
ij ij
u u trong l chun kt thc gia 0
v 1, trong khi k l cc bc lp. Th tc ny hi t ti cc tiu cc b hay
im yn nga ca J
m
(u, V). Thut ton FCM tnh ton ma trn phn hoch U
v kch thc ca cc cm thu c cc m hnh m t ma trn ny.
Cc bc thc hin ca thut ton FCM nh sau:
Input : S cm c v tham s m m cho hm mc tiu J;
Output : c cm d liu sao cho hm mc tiu trong (1) t gi tr cc tiu;
Begin
1. Nhp tham s c (1<c<n) | | + e , 1 m . Khi to m trn
| | ; 0 , ,
) 0 (
= e = j R V v V
sxc
ij

2. Repeat
j:=j+1
Tnh ma trn phn hoch m U
(j)
theo cng thc (5)
Cp nht cc trung tm | |
) ( ) (
2
) (
1
) (
,..., ,
j
c
j j j
v v v V = da vo (6) v U
(j)

3. Until { } s
+
F
j j
U u
) ( ) 1 (
;
4. Trnh din cc cm kt qu;
End

Trong ,
F
* l tiu chun Frobenious c nh ngha nh sau:

= =
i k
ik
T
F
u UU Tr
2
2
) ( * v tham s trc.
Vic chn cc tham s cm rt nh hng n kt qu phn cm, tham
s ny thng c chn theo php ngu nhin hoc theo Heuristic.
i vi
+
1 m th thut ton K-means tr thnh thut ton r. i vi
m th thut ton FCM tr thnh thut ton phn cm m vi :
c
u
ik
1
=

-37-
Cha c quy tc no nhm la chn tham s m m bo vic phn cm
hiu qu, thng thng chn m = 2
phc tp ca thut ton FCM tng ng vi phc tp ca
thut ton K-means trong trng hp s i tng ca tp d liu cn phn
cm l rt ln. Tm li, thut ton phn cm m FCM l mt m rng ca
thut ton K-means nhm khm ph ra cc cm chng ln nhau, tuy nhin,
FCM vn cha ng cc nhc im ca thut ton K-means trong vic x l
i vi cc phn t ngoi lai v nhiu trong d liu. Thut ton FCM c
trnh by di y l mt m rng ca thut ton FCM nhm khc phc cc
nhc im ny.
2.2 Thut ton FCM
Input : S cm c v cc tham s m, cho hm mc tiu J;
Output : Cc cm d liu sao cho hm mc tiu trong (2) t gi tr cc
tiu;
Begin
1. Nhp tham s c(1<c<n), | | + e , 1 m v 0 > c . Khi to ma trn V=[v
ij
],
V
(0)
, thit lp j = 0;
2. Repeat
j:=j+1;
Tnh ma trn phn hoch m U
(j)

Cp nht cc trung tm | |
) ( ) (
2
) (
1
) (
,..., ,
j
c
j j j
v v v V =
3. Until { } s
+
F U u
j j ) ( ) 1 (
;
4. Trnh din cc cm kt qu;
End.
3. Thut ton phn cm d liu da vo cm trung tm
3.1 . Thut ton K means
K- means l thut ton phn cm m nh ngha cc cm bi trung tm
ca cc phng t. Phng php ny da trn o khong cch ca cc i
tng d liu trong cm. N c xem nh l trung tm ca cm. Nh vy,
n cn khi to mt tp trung tm cc trung tm cm ban u, v thng qua
n lp li cc bc gm gn mi i tng ti cm m trung tm gn, v
tnh ton ti trung tm ca mi cm trn c s gn mi cho cc i tng.
Qu trnh ny dng khi cc trung tm hi t

-38-

Hnh 3.2 : Cc thit lp xc nh danh gii cc cm ban u
Trong phng php K-means, chn mt gi tr k v sau chn ngu
nhin k trung tm ca cc i tng d liu. Tnh ton khong cch gia i
tng d liu trung bnh mi cm tm kim phn t no l tng t v
thm vo cm . T khong cch ny c th tnh ton trung bnh mi ca
cm v lp li qu trnh cho n khi mi cc i tng d liu l mt b phn
ca cc cm k
Mc ch ca thut ton K-means l sinh k cm d liu {C
1
, C
2
,, C
k
}
t mt tp d liu cha n i tng trong khng gian d chiu Xi = {x
i1
,
x
i2
,x
id
}, i = 1n, sao cho hm tiu chun :
( )
2
1
i
k
i
x C
i
E D x m
e
=
=

t gi tr ti thiu,
Trong : M
i
l trng tm ca cm C
i
, D l khong cch gia hai i tng.
Trng tm ca cm l mt vecto, trong gi tr ca mi phn t ca
n l trung cng ca cc thnh phn tng ng ca cc i tng vecto d
liu trong cm ang xt. Tham s u vo ca thut ton l s cm k, v tham
s u ra ca thut ton l cc trng tm ca cc cm d liu. o khong
cch D gia cc i tng d liu thng c s dng l khong cch
Euclide v y l m hnh khong cch nn d ly o hm v xc nh cc
cc tr ti thiu. Hm tiu chun v o khong cch c th c xc nh
c th hn ty vo ng dng hoc quan im ca ngi dng.


-39-

Hnh 3.3 Tnh ton trng tm ca cc cm mi
Cc bc c bn ca thut ton K means
Input : S cm k v cc trng tm cm
{ }
1
k
j
j
m
=

Output : cc cm | |( ) 1 C i i k s s v hm tiu chun E t gi tr ti thiu.
Begin
Bc 1 : Khi to
Chn k trng tm
{ }
1
k
j
j
m
=
ban u trong khng gian Rd (d l s
chiu ca d liu). Vic la chn ny c th l ngu nhin hoc theo kinh
nghim.
Bc 2 : Tnh ton khong cch
i vi mi im ( ) 1
i
X i k s s , tnh ton khong cch ca n ti
mi trng tm ( ) 1
j
m i k s s . Sau tm trng tm gn nht i vi im.
Bc 3 : Cp nht li trng tm
i vi mi ( ) 1 i k s s , cp nht trng tm cm m
j
bng cch xc
nh trung bnh cng cc vecto i tng d liu.
iu kin dng :
Lp li cc bc 2 v 3 cho n khi cc trng tm ca cm khng
thay i.
End.



-40-
K- means biu din cc cm bi cc trng tm ca cc i tng trong
cm Thut ton K-means chi tit c trnh by :
BEGIN
1. Nhp n i tng d liu
2. Nhp k cm d liu
3. MSE = +
4. For I = 1 to k do
| | ( 1)* / i i i n k
m X
+
= ; // khi to k trng tm
5. Do {
6. OldMSE = MSE
7. MSE = 0
8. for j = 1 to k do
9. {m[j]=0; n[j]=0}
10. End for
11. For I =1 to n do
12. For j =1 to k do
13. Tnh ton khong cch Euclide bnh phng :
D
2
(x[i]; m[j]
14. Endfor
15. Tm trng tm gn nht m[h] ti X[i]
16. m[h] = m[h] + X[i] ; n[h] = n[h]+1;
17. MSE=MSE + D
2
(x[i]; m[j]
18. Endfor
19. n[j] = max(n[j], 1); m[j] = m[j]/n[j];
20. MSE=MSE
21. } while (MSE<=OldMSE)
END.
Trong :
- MSE : Sai s bnh phng trung bnh hay l hm tiu chun
- D
2
(x[i]; m[j] : Khong cch Euclide t i tng th i ti trng tm j;
- OldMSE m[j], n[j] : Bin tm lu gi tr cho trng thi trung gian
cho cc bin tng ng


-41-

Hnh 3.6 : V d hnh dng cm d liu sau khi phn cm bng K-means
Cht lng ca thut ton K mean ph thuc nhiu vo cc tham s
u vo nh : s cm k, v k trng tm khi to ban u. Trong trng hp
cc trng tm khi to ban u m qu lch so vi cc trng tm cm t nhin
th kt qu phn cm ca K means l rt thp, ngha l cc cm d liu c
khm ph rt lch so vi cc cm trong thc t. Trn thc t, cha c mt gii
php no chn tham s u vo, gii php thng c s dng nht l th
nghim vi cc gi tr u vo , gii php thng c s dng nht l th
nghim vi gi tr u vo k khc nhau ri sau chn gii php tt nht.
3.2 Thut ton PAM
Thut ton PAM l thut ton m rng ca thut ton K-means nhm
c kh nng x l hiu qu i vi d liu nhiu hoc phn t ngoi lai,
PAM s dng cc i tng medoid biu din cho cc cm d liu, mt
i tng medoid l i tng t ti v tr trung tm nht bn trong mi cm.
V vy, i tng medoid t b nh hng ca cc i tng rt xa trung
tm, trong khi cc trng tm ca thut ton K means li rt b tc ng
bi cc im xa trung tm ny. Ban u, PAM khi to k i tng medoid
v phn phi cc i tng cn li vo cc cm vi i tng medoid i
din tng ng sao cho chng tng t i vi medoid trong cm nht.
Gi s O
j
l i tng khng phi medoid m Om l mt i tng
medoid, khi ta ni Oj thuc v cm c i tng medoid l Om lm i
din nu d(O
j
, O
m
) = min
Oe
(O
j
, O
e
); trong d(O
j
, O
m
) l phi tng t gia
O
j
v O
e
, min
Oe
l gi tr nh nht ca phi tng t gia O
j
v tt c cc
i tng medoid ca cc cm d liu. cht lng ca mi cm c khm
ph c nh gi thng qua phi tng t trung bnh gia mt i tng
v i tng medoid tng ng vi cm ca n, ngha l cht lng phn
cm c nh gi thng qua cht lng ca tt c cc i tng medoid.

-42-
phi tng t c xc nh bng o khong cch, thut ton PAM c p
dng cho d liu khng gian. xc nh cc medoid, PAM c p dng
cho d liu khng gian. xc nh cc medoid, PAM bt u bng cch la
chon k i tng medoid bt k. Sau mi bc thc hin , PAM c gng hon
chuyn gia i tng Medoid O
m
v mt i tng O
p
, khng phi l
medoid, min l s hon chuyn ny nhm ci tin cht lng ca phn cm,
qu trnh ny kt thc khi cht lng phn cm khng thay i. Cht lng
phn cm c nh gi thng qua hm tiu chun, cht lng phn cm tt
nht khi hm tiu chun t gi tr ti thiu.
PAM tnh gi tr C
jmp
cho tt c cc i tng O
j
lm cn c cho
vic hon chuyn gia O
m
v O
p
.
O
m
: l i tng medoid hin thi cn c thay th :
O
p
: l i tng medoid mi thay th cho Om;
O
j
: L i tng d liu ( Khng phi medoid) c th c di chuyn
sang cm khc;
Oj,2
: L i tng medoid hin thi gn i tng O
j
nht
Cc bc thc hin thut ton PAM
Input : Tp d liu c n phn t, s cm k.
Output : k cm d liu sao cho cht lng phn hoch l tt nht.
BEGIN
1. Chn k i tng medoid bt k;
2. Tnh TC
mp
cho tt c cc cp i tng O
m
, O
p
. Trong , O
m
l i
tng medoid v O
p
l i tng khng phi medoid;
3. Chn cp i tng O
m
v O
p
. Tnh Min
Om
, Min
Op
, TC
mp
, nu TC
mp

l m thay th O
m
bi O
p
v quay li bc 2. Nu TC
mp
dng,
chuyn sang bc 4;
4. Vi mi i tng khng phi medoid, xc nh i tng medoid
tng t vi n nht ng thi gn nhn cm cho chng.
END.
3.3 Thut ton CLARA
Thut ton CLARA c a ra nhm khc phc nhc im ca thut
ton PAM trong trng hp gi tr k v n l ln. CLARA tin hnh trch mu
cho tp d liu c n phn t, n p dng thut ton PAM cho mu ny v tm

-43-
ra cc i tng trung tm medoid cho mu c trch ra t d liu ny. Nu
mu d liu c trch theo mt cch ngu nhin, th cc medoid ca n xp
x vi cc medoid ca ton b tp d liu ban u. tin ti mt xp x tt
hn, CLARA a ra nhiu cch ly mu v thc hin phn cm cho mi
trng hp, sau tin hnh chn kt qu phn cm tt nht khi thc hin
phn cm trn mu ny. o chnh xc, cht lng ca cc cm c nh
gi thng qua phi tng t trung bnh ca ton b cc i tng d liu
trong tp i tng d liu ban u. Kt qu thc nghim ch ra rng, 5 mu
d liu c kch thc 40 +2k cho kt qu tt. Cc bc thc hin ca thut
ton CLARA :
CLARA (5);
BEGIN
1. For i = 1 to 5 do
2. Ly mt mu c 40 + 2k i tng d liu ngu nhin t tp d liu
v p dng thut ton PAM cho mu d liu ny nhm tm cc i
tng medoid i din cho cc cm.
3. i vi mi tng Oj trong tp d liu ban u, xc nh i tng
medoid tng t nht trong s k i tng medoid.
4. Tnh phi tng t trung bnh cho phn hoch cc i tng thu
c bc trc, nu gi r ny b hn gi tr ti thiu hin thi th
s dng gi tr ny thay cho gi tr ti thiu trng thi trc, nh
vy, tp k i tng medoid xc nh bc ny l tt nht cho n
thi im ny.
5. Quay v bc 1
END
Phng php medoid khng hiu qu vi trng hp tp d liu ln,
nh vy, phng php da trn mu c gi l CLARA. y, mt phn
nh d liu hin thi c chn nh mt i din ca d liu thay v s dng
ton b d liu v sau medoid c chn t mu s dng PAM. Nu mu
c chn theo cch ngu nhin th n c th cn phi i din tp d liu
gc. Cc i tng i din (medoids) c chn l tng t m c
chn t tp d liu. N a ra nhiu mu ca tp d liu, p dng PAM trn

-44-
mi mu, v tr li cm tt nht u ra, nh vy, CLARA c th x l vi
tp d liu ln hn PAM.
3.4 Thut ton CLARANS
CLARANS cng s dng kiu k-medoids , n kt hp thut ton PAM
vi chin lc tm kim kinh nghim mi. tng c bn ca CLARANS l
khng xem xt tt c cc kh nng c th thay th cc i tng tm medoids
bi mt i tng khc, n ngay lp tc thay th cc i tng tm ny nu
vic thay th ny c tc ng tt n cht lng phn cm ch khng cn xc
nh cch thay th ti u nht.
CLARANS ly ngu nhin mt i tng ca k i tng medoid
trong tm cm v c gng thay th n vi mt i tng chn ngu nhin
trong (n-k) i tng cn li. Cm thu c sau khi thay th i tng trung
tm c gi l mt lng ging ca phn hoch cm trc . S cc lng
ging c hn ch bi tham s do ngi dng a vo l Maxneighbor, qu
trnh la chn cc lng ging ny hon ton ngu nhin. Tham s Numlocal
cho php ngi dng xc nh s vng lp ti u cc b c tm kim.
Khng phi tt c cc lng ging c duyt m ch c Maxneighbor s lng
ging c duyt. Nu mt lng ging tt hn c tm thy, th CLARANS
di chuyn lng ging ti nt v qu trnh bt u lp li; nu khng kt
qu cm hin thi l ti u cc b. Nu ti u cc b c tm thy, th
CLARANS bt u vi la chn nt ngu nhin mi trong tm kim ti u
cc b mi.
CLARANS khng thch hp vi tp d liu ln bi v n ly phn nh
ca ton b tp d liu v phn ny c chn i din ton b tp d liu
v thc hin sau . CLARANS khng b gii hn khng gian tm kim nh
i vi CLARA, v trong cng mt lng thi gian th cht lng ca cc
cm phn c l ln hn CLARA.
Mt s khi nim s dng trong thut ton CLARANS c nh ngha
nh sau:
Gi s O l mt tp c n i tng v M O _ l tp cc i tng tm
mediod, NM = O- M l tp cc tng khng phi tm. Cc i tng d
liu s dng trong thut ton CLARANS l cc khi a din. Mi i tng
c din t bng mt tp cc cnh, mi cnh c xc nh bng hai im.

-45-
Gi s
3
P R _ l mt tp tt c cc im . Ni chung, cc i tng y l
cc i tng d liu khng gian v chng ta nh ngha tm ca mt i
tng chnh l trung bnh cng ton hc ca tt c cc nh hay cn gi l
trng tm :
: center O P
Gi s dist l mt hm khong cch, khong cch thng c chn
y l khong cch Eucliean :
0
: dist PxP R
+

Hm khong cch dist c th m rng cho cc im ca khi a din
thng qua hm tm :
0
: dist OxO R
+
sao cho
ist(o , ) is ( ( ), ( ))
i j i j
d o d t center o center o =
Mi i tng c gn cho mt tm medoid ca cm nu khong
cch t trng tm ca i tng ti tm medoid ca n l nh nht. V vy,
nh ngha tm medoid nh sau : medoid : O M sao cho
( ) , , : is( , ) is ( , ),
i i i i j
medoid o m m M m M d o m d t o m o O = e e s e . Cui cng nh ngha
mt cm ti tm mediod m
i
tng ng l mt tp con cc i tng trong O
vi medoid(o) = m
i

Gi s C0 l tp tt c cc phn hoch ca O. Hm tng nh gi
cht lng mt phn hoch c nh ngha nh sau : total_distance :
0 0
C R
+

sao cho total_distance(c)= is ( , )
i
d t o m

vi , ( )
i i
m M o cluster m e e
Thut ton chi tit CLARANS :
Input : O,k, dist, numlocal v maxneighbor;
Output : k cm d liu;
CLARANS(int k, function dist, int numlocal, int maxneighbor)
BEGIN
For (i = 1 ; 1 <= numlocalk; i++{
current.creat_randomly(k);
j = 1 ;
while (j <= maxneighbor) {
current.select_radom(old, new);
diff = current.caculate_distance_difference(old, new);
if (diff < 0){
current.exchange(old, new);

-46-
j = 1;
}
Else j++; //end if
} //end while
Dist = current.caculate_total_distance();
If (disr < smallest_dist) {
Best = current;
Smallest_dist= dist;
} // end if
}// end for
END.

4. Thut ton phn cm d liu da vo tm kim
4.1 Thut ton di truyn (GAS)
Thut ton di truyn GAS ln u tin c xut bi Holland (1975)
l mt h tnh ton m hnh ly cm hng t tng t ca s tin ha v di
truyn dn s. Gas vn song song v c bit thch hp cho vic gii quyt
vn ti u ha phc tp.Filho et al. (1994) trnh by mt cuc kho st ca
kh cng vi mt GA n gin vit bng C ngn ng.
Thng thng, ch c hai thnh phn chnh ca GAS c vn ph
thuc: cc vn m ha v chc nng nh gi (v d, khch quan chc
nng). Ngay c i vi cng mt vn , c th s dng m ha khc nhau.
V d, trong cc k-c ngha l thut ton di truyn, Krishna v Narasimha
(1999) lm vic string-of-group-s m ha, trong khi Maulik v
Bandyopadhyay (2000) c m ha cc chui sao cho mi chui l mt
chui cc thc s i din cho cc trung tm cm.
Trong GAS, cc tham s ca khng gian tm kim c m ho trong
cc hnh thc gi l chui nhim sc th. AGA maintains dn (set) ca N
chui m ho cho mt s dn s c nh kch thc N v tin ha qua cc th
h. Trong mi th h, ba nh khai thc di truyn, ngha l, t nhin, la chn,
xuyn cho , v t bin, c p dng cho dn s hin nay sn xut mt
s dn mi. Mi chui trong dn s lin kt vi mt gi tr th dc ty thuc
vo gi tr ca hm mc tiu. Da trn nguyn tc sng cn ca cc lp rp ,

-47-
mt chui vi trong s dn hin hnh c la chn v tng c phn cng
mt s bn sao, v sau mt th h mi ca dy ang mang li bng cch p
dng cho v t bin cc chui c chn.
Ni chung, mt GA in hnh c nhng nm thnh phn c bn: m
ha, khi to, la chn, crossover, v t bin. M ha l ph thuc vo vn
di xem xt. Trong giai on khi, dn s (set) ca chui s c ngu
nhin to ra. Sau giai on khi, c mt lp ca cc th h. S lng ca cc
th h c xc nh bi ngi s dng. Trong kh, chui tt nht thu c
cho n nay c lu tr trong mt v tr ring bit bn ngoi dn s v sn
lng cui cng l chui tt nht trong s tt c c th c chui kim tra trong
ton b qu trnh.
Murthy v Chowdhury (1996) xut mt GA trong mt n lc t
c ti u gii php cho cc vn clustering. Trong thut ton ny, cc
chc nng nh gi c xc nh nh l tng ca bnh phng khong cch
Euclide ca cc im d liu t cc cm tng ng ca h trung tm. Ngoi
ra, n im cho (Michalewicz, 1992), ngha l, cc nh iu hnh cho gia
hai dy, c thc hin ti mt v tr, v cc chin lc elitist, ngha l, cc
chui hay nht c mang t trc n dn s k tip, c s dng.
Tseng v Yang (2001) xut mt cch tip cn di truyn c gi l
clustering n t ng phn nhm vn . Clustering l ph hp vi phn
nhm d liu vi nh gn cm hnh cu, v s cm c th c kim sot
gin tip bi mt tham s w. Thut ton s sn xut mt s lng ln cc cm
nh gn vi mt gi tr nh ca w v n s sn xut mt s lng nh hn ca
cm lng hn vi mt gi tr ln ca w. A di truyn phn nhm da trn thut
ton nhm tm ra cc cm nonspherical c xut bi Tseng v Yang
(2000).
Garai v Chaudhuri (2004) xut mt phn nhm di truyn c
hng dn theo cp bc thut ton m c th tm thy ty tin c hnh cm.
Thut ton ny bao gm hai giai on. Lc u, tp d liu gc l b phn hy
thnh mt s nhm phn mnh ly lan trong qu trnh GAsearch giai
on th hai trong ton b khng gian. Sau , cc th bc Cm trn thut
ton (HCMA) c s dng. Trong qu trnh st nhp, mt k thut gi l cc

-48-
cluster lin k kim tra thut ton (ACCA) c s dng th nghim k
ca hai cm phn on h c th c sp nhp vo mt nhm.
Krishna v Narasimha (1999) v Bandyopadhyay v Maulik (2002)
xut hai thut ton phn nhm khc nhau da trn GAS v k ph bin c
ngha l thut ton. Trong di truyn k-c ngha l thut ton (GKA), Krishna
v Narasimha (1999) c s dng k-c ngha l nh iu hnh thay v cc
nh iu hnh cho tng tc hi t, trong khi kga-clustering,
Bandyopadhyay v Maulik (2002) c s dng cc nh iu hnh crossover-
n im.
Cowgill et al. (1999) xut mt thut ton-based clustering di truyn
c gi l COWCLUS. Trong COWCLUS, chc nng nh gi l t l
phng sai (VR) c nh ngha trong iu kin c lp cm bn ngoi v
tnh ng nht cm ni b. Mc tiu ca thut ton l tm cc phn vng
vi VR ti a.
4.2 J- Means
Cho { }
1 2
, , ,
n
D x x x = l mt tp i tng v S
D
c hiu l tt c cc
phn ca D.
2
1
min
D D
i
k
i
P S
i x C
x z
e
= e


Ni k l s lng cm , . c hiu l Euclidean chun tc, v z
i
l
tm ca cm C
i


1
i
i
x C i
Z x
C
e
=


Vi i = 1, 2,k


Thut ton J-mean :

Bc 1 (khi) Hy P
D
= (C
1
, C
2
,. . . , C
k
) l mt phn vng ban u ca
D, z
i
l trng tm ca cm Ci, v f
opt
c mc tiu hin chc nng gi tr;

S2 (im chim ng) Tm im trng, ngha l, im trong D khng trng
vi mt cm trng tm trong mt dung sai nh;

-49-

S3 (Bc khu ph) Tm phn vng tt nht
D
P' v mc tiu tng ngchc
nng gi tr f ' trong cc khu ph nhy ca gii php hin ti P
D
:

S31 (khai ph lng ging) i vi mi j (j = 1, 2,..., N), lp li sau bc
sau: (a) ti nh c. Thm mt cm mi centroid Z
k+1
ti mt s im
trng x
j
v tr v tm thy nhng ch s i ca trng tm tt nht xa; cho
v
ij
biu s thay i trong gi tr hm mc tiu; (b) Gi tt nht. Gi i ch
s i v j ni v
ij
l ti thiu;

S32 (chuyn hay thay th) Nu trng tm z
i
bi x
j
v cp nht cc thnh
vin nhm cho ph hp c c P phn vng mi
D
P' ; t
' '
:
opt i j
f f v ' = +
S4 (Chm dt hoc di chuyn) Nu
opt
f f ' > , dng; nu khng, di chuyn
n lng ging tt nht Gii php
D
P' ; t
D
P' l gii php hin hnh v quay
v bc S2.
5. Thut ton phn cm d liu da vo li
5.1 STING
STING l k thut phn cm a phn gii da trn li, trong vng
khng gian d liu c phn r thnh s hu hn cc cells ch nht, iu ny
c ngha l cc cells li c hnh thnh t cc cells li con thc hin
phn cm. C nhiu mc ca cc cells ch nht tng ng vi cc mc khc
nhau ca phn gii trong cu trc li, v cc cells ny hnh thnh cu trc
phn cp : mi cells mc cao c phn hoch thnh cc s cc cells nh
mc thp hn tip theo trong cu trc phn cp. Cc im d liu c np t
CSDL, gi tr ca cc tham s thng k cho cc thuc tnh ca i tng d
liu trong mi li c tnh ton t d liu v lu tr thng qua cc tham
s thng k cc cell mc thp hn (iu ny ging vi cy CF). Cc gi tr
ca cc tham s thng k gm : s trung bnh mean, s ti a max, s ti
thiu min, s m count , lch chun s,
Cc i tng d liu ln lt c chn vo li v cc tham s thng
k trn c tnh trc tip thng qua cc i tng d liu ny. Cc truy
vn khng gian c thc hin bng cch xt cc cells thch hp ti mi mc

-50-
phn cp. Mt truy vn khng gian c xc nh nh l mt thng tin khi
phc li ca d liu khng gian v cc quan h ca chng. STING c kh
nng m rng cao , nhng do s dng phng php a phn gii nn n ph
thuc cht ch vo trng tm ca mc thp nht. a phn gii l kh nng
phn r tp d liu thnh cc mc chi tit khc nhau. Khi ha nhp cc cells
ca cu trc li hnh thnh cc cm, n khng xem xt quan h khng
gian gia cc nt ca mc con khng c ha nhp ph hp( do chng ch
tng ng vi cc cha ca n) v hnh dng ca cc cm d liu khm ph l
isothetic, tt c ranh gii ca cc cm c cc bin ngang v dc, theo bin ca
cc cells v khng c ng bin cho c pht hin ra.
Cc li th ca cch tip cn ny so vi cc phng php phn cm d
liu khc :
- Tnh ton da trn li l truy vn c lp vi thng tin thng k c
bo qun trong mi cells i din nn ch cn thng tin tm tt ca d liu
trong cells ch khng phi l d liu thc t v khng ph thuc vo cu truy vn.
- Cu trc d liu li thun tin cho qu trnh x l song song v cp
nht lin tc.
- Duyt ton b CSDL mt ln tnh ton cc i lng thng k cho
mi cells, nn n hiu qu v do phc tp thi gian to cc cm xp
x O(n), trong n l tng s cc i tng. Sau khi xy dng cu trc phn
cp, thi gian x l cho cc truy vn l O(g), trong g l tng s cells li
mc thp (g<<n)
Cc hn ch ca thut ton ny :
- Trong khi s dng cch tip cn a phn gii thc hin phn tch
cm cht lng ca phn cm STING hon ton ph thuc vo tnh cht hp
mc thp nht ca cu trc li. Nu tnh cht hp l mn, dn n chi ph
thi gian x l tng, tnh ton tr nn phc tp v nu mc di cng l qu
th th n c th lm gim bt cht lng v chnh xc ca phn tch cm.
Thut ton STING :
1. Xc nh tng bt u
2. Vi mi ci ca tng ny, tnh ton khong tin cy (hoc c lng
khong) ca xc sut m cells ny lin quan ti truy vn
3. T khong tin cy ca tnh ton trn,gn nhn cho l c lin quan hoc

-51-
khng lin quan.
4. Nu lp ny l lp cui cng , chuyn sang Bc 6; nu khc th chuyn
sang Bc 5
5. Duyt xung di ca cu trc cy phn cp mt mc. Chuyn sang
Bc 2 cho cc cells m hnh thnh cc cells lin quan ca lp c mc cao
hn.
6. Nu c t c cu truy vn, chuyn sang bc 8; nu khng th
chuyn sang bc 7.
7. Truy lc li d liu vo trong cc cells lin quan v thc hin x l. Tr
li kt qu ph hp yu cu ca truy vn. Chuyn sang Bc 9.
8. Tm thy cc min c cc cells lin quan. Tr li min m ph hp vi
yu cu ca truy vn. Chuyn sang bc 9
9. Dng
5.2. Thut ton CLIQUE
Trong khng gian a chiu, cc cm c th tn ti trong tp con ca cc
chiu hay cn gi l khng gian con. Thut ton CLIQUE l thut ton hu
ch cho PCDL khng gian a chiu trong cc CSDL ln thnh cc khng gian
con. Thut ton ny bao gm cc bc :
- Cho n l tp ln ca cc im d liu a chiu; khng gian d liu
thng l khng ging nhau bi cc im d liu. Phng php ny xc nh
nhng vng gn, tha v c trong khng gian d liu nht nh, bng cch
pht hin ra ton th phn b mu ca tp d liu.
- Mt n v l dy c nu phn nh ca tt c cc im d liu cha
trong n vt qu tham s mu a vo. Trong thut ton CLIQUE, cm
c nh ngha l tp ti a lin thng cc n v dy c.
Cc c trng ca CLINQUE
- T ng tm kim khng gian con ca khng gian a chiu, sao cho
mt c ca cc cm tn ti trong khng gian con.
- Mn cm vi th t ca d liu vo v khng ph hp vi bt k quy
tc phn b d liu no.
- Phng php ny t l tuyn tnh vi kch thc vo v c tnh bin
i tt khi s chiu ca d liu tng.

-52-
N phn hoch tp d liu thnh cc hnh hp ch nht v tm cc hnh
hp ch nht c, ngha l cc hnh hp ny cha mt s cc i tng d
liu trong s cc i tng lng ging cho trc. Hp cc hnh hp ny to
thnh cc cm d liu. Tuy nhin , CLINQUE c bt u bng cch tip
cn n gin do chnh xc ca kt qu phn cm c th b nh hng dn
ti cht lng ca cc phng php ny c th gim.
Phng php bt u nhn dng cc cells c n chiu trong khng
gian d liu v tim kim phn b ca d liu, tip n CLINQUE ln lt tm
cc hnh ch nht 2 chiu, 3 chiu,., cho n khi hnh hp ch nht c k
chiu c tm thy, phc tp tnh ton ca CLIQUE l O(n)
5.3. Thut ton WaveCluster
Thut ton WaveCluster l phng php gn ging vi STING, tuy
nhin thut ton s dng php bin i dng sng tm c trong khng
gian. u tin k thut ny tm tt d liu bng vic tn dng cu trc dng
li a chiu ln trn khng gian d liu. Tip theo n s dng php bin i
dng sng bin i khng gian c c trng gc, tm kim c trong
khng gian c bin i. Phng php ny l phc tp vi cc phng
php khc chnh l php bin i.
y, mi cells li tm tt thng tin cc im ca mt nhm nh x
vo trong cells. y l thng tin tiu biu thch hp a vo b nh chnh
s dng php bin i dng sng a phn gii v tip theo l phn tch cm.
Mt php bin i dng sng l k thut da trn c s x l tn hiu v x l
nh bng phn tch tn hiu vi tn s xut hin trong b nh chnh. Bng vic
thc hin mt lot cc php bin i ngc phc tp cho nhm ny,n cho
php cc cm trong d liu tr thnh r rng hn. Cc cm ny c th c
xc nh bng tm kim c trong vng mi.
Phng php ny phc tp, nhng li c nhng li th :
- Cung cp cm khng gim st, kh nhiu cc thng tin bn ngoi bin
ca cm. Theo cch , vng c trong khng gian c trng gc ht cc
im gn v ngn chn cc im xa. V vy, cc cm t ng ni bt v
lm sch khu vc xung quanh n, do cc kt qu t ng loi phn t
ngoi lai.

-53-
- a phn gii l thuc tnh h tr d tm cc cm c cc mc bin i
chnh xc.
- Thc hin nhanh vi phc tp ca thut ton l O(n), trong n l
s i tng trong CSDL. Thut ton c th thch hp vi x l song song.
- X l tp d liu ln c hiu qu, khm ph cc cm c hnh dng bt
k, x l phn t ngoi lai, mn cm vi th t vo, v khng ph thuc vo
cc tham s vo nh s cc cm hoc bn knh lng ging.
6. Thut ton phn cm d liu da vo mt
6.1 Thut ton DBSCAN
Thut ton DBSCAN thch nghi vi mt dy phn cm v khm
ph ra cc cm c hnh dng bt k trong khng gian CSDL c nhiu. N c
nh ngha cm l tp ti a cc im lin thng mt .
Phn cm da vo mt l tp cc i tng lin thng mt m ti
a v lin lc mt ; mi i tng khng c cha trong cm l c xem
xt nhiu. Trn thc t DBSCAN tm kim cho cc cm bng cch kim tra
cc i tng m c s i tng lng ging nh hn mt ngng ti thiu,
tc l c ti thiu MinPts i tng v mi i tng trong cm tn ti mt
i tng khc trong cm ging nhau vi khong cch nh mt ngng Eps.
Tm tt c cc i tng m cc lng ging ca n thuc v lp cc i tng
xc nh trn, mt cm c xc nh bng mt tp tt c cc i tng
lin thng mt cc lng ging ca n. DBSCAN lp li tm kim ngay khi
cc i tng lin lc mt t cc i tng trung tm, n c th bao gm
vic kt hp mt s cm c mt lin lc. Qu trnh kt thc khi khng tm
c im mi no c th thm vo bt c cm no.
DBSCAN c th tm ra cc cm vi hnh th bt k, trong khi o ti
cng mt thi im t b nh hng bi th t ca cc i tng d liu nhp
vo. Khi c mt i tng c chn vo ch tc ng n mt lng ging xc
nh. Mt khc , DBSCAN s dng tham s Eps v MinPts trong thut ton
kim sot mt ca cc cm . DBSCAN bt u vi mt im ty v
xy dng mt lng ging c th c i vi Eps v MinPts, V vy,
DBSCAN yu cu ngi dng xc nh bn knh Eps ca lng ging v s cc
lng ging ti thiu MinPts, cc tham s ny kh m xc nh c ti u,
thng thng n c xc nh bng php chn ngu nhin hoc theo kinh

-54-
nghim. phc tp ca DBSCAN l O(n
2
), nhng nu p dng ch s khng
gian gip xc nh cc lng ging ca mt i tng d liu th phc
ca DBSCAN c ci tin l O(nlogn). Thut ton DBSCAN c th p dng
cho cc tp d liu khng gian ln a chiu, khong cch Eucle c th p
dng cho tp d liu khng gin ln a chiu, khong cch Eclide c s
dng o s tng t gia cc i tng nhng khng hiu qu i vi d
liu a chiu [10][15]
- nh ngha 1 : Ln cn vi ngng Eps ca mt im p k hiu
N
Eps
(p) c xc nh nh sau : N
Eps
(p)={qeD} khong cch dist(p,q) s Eps.
D l tp d liu cho trc.
Mt im p mun nm trong mt cm C no th N
Eps
(p) phi c ti thiu
MinPts im. S im ti thiu c chn l bao nhiu cng l bi ton kh
v nu s im ti thiu ln th ch nhng im nm thc s trong cm C mi
t tiu chun, trong khi nhng im nm ngoi bin ca cm khng
th t c iu . Ngc li, nu s im ti thiu l nh th mi im s
ri vo mt cm.
Theo nh ngha trn, ch nhng im nm trong cm mi tha mn
iu kin l im thuc vo cm. Nhng im nm bin ca cm th khng
tha mn iu kin , bi v thng thng th ln cn vi ngng Eps ca
im bin th b hn ln cn vi ngng ca Eps ca im nhn.
trnh c iu ny, c th a ra mt tiu chun khc nh
ngha mt im thuc vo mt cm nh sau : Nu mt im p mun thuc
mt cm C phi tn ti mt im thuc mt cm nh sau: Nu mt im p
mun thuc mt cm C phi tn ti mt im q m p e N
Eps
(q) v s im
trong p e N
Eps
(q) phi ln hn im ti thiu. iu ny dn ba php o c
s dng m t thuc tnh cu cc im d liu, l mt lin lc trc tip,
mt lin lc v mt lin lc v mt lin thng c nh ngha nh
sau :
- nh ngha 2 : Mt lin lc trc tip
Mt im p c gi l lin lc trc tip t im q vi ngng Eps nu :
1. p e N
Eps
(q)
2. ( )
Esp
N q MinPts > (iu kin nhn), im q gi l im nhn.

-55-
C th thy lin lc trc tip l mt hm phn x v i xng vi hai im
nhn v bt i xng nu mt trong hai im khng phi l im nhn.
- nh ngha 3 : Mt lin lc
Mt im p c gi l lin lc t mt im q theo tham s Eps v MinPts
nu tn ti mt dy p = p
1
, p
2
,, p
n
= q tha mn pi+1 l c thm lin lc trc
tip t pi vi 1 1 i n =
Hai im bin ca mt cm C c th khng lin lc c vi nhau bi v
c hai u khng tha mn iu kin nhn.
- nh ngha 4 : Mt lin thng
Mt im p c gi l lin thng vi im q theo tham s Eps v MinPts
nu tn ti mt im O m c hai im p, q u c th lin lc c theo
tham s Eps v MinPts. Mt lin thng c tnh cht i xng v phn x.
- nh ngha 5 : Cm
Gi s D l mt tp c im d liu. Mt tp con C khc rng ca D c gi
l mt cm theo Eps v MinPts nu tha mn hai iu kin :
1. Vi , p q D e , nu p C e v q c th lin lc c t p theo Eps v
MinPts th q C e
2. Vi , p q C e ,p lin thng vi q theo Eps v MinPts.
- nh ngha 6 : Nhiu
Gi s C
1
, C
2
, . , C
k
l cc cm trong tp d liu D theo tham s Eps
v MinPts, im d liu nhiu l im d liu khng thuc vo cm no trong
cc cm C
1
, C
2
, . , C
k,
tc l N ={p/ vi mi I = 1,,k e C
i
}.
Vi hai tham s Eps v MinPts cho trc, c th khm ph cc cm
theo hai bc :
- Bc 1 : Chn mt im bt k t tp d liu ban u tha mn iu
kin nhn.
- Bc 2 : Ly tt c cc im lin lc vi im nhn chn to
thnh cm.
B 1 : Gi s p l mt im trong D,
Es
( )
p
N p MinPts > tp O ={o/o e
D v c th lin lc t p theo Eps v MinPts} l mt cm theo Eps v MinPts.
Nh vy, cm C khng hon ton l duy nht, tuy nhin, mi im
trong C lin lc t bt c mt im nhn no ca C, v vy C cha ng mt
s im lin thng vi im nhn ty .

-56-
B 2 : Gi s C l mt cum theo Eps v MinPts, p l mt im bt
k trong C vi
Es
( )
p
N p MinPts > . Khi , C trng vi tp O ={o/o e D v o c
th lin lc t p theo Eps v MinPts}.
Thut ton : DBSCAN khi to im p ty v ly tt c cc im lin
lc mt t p ti Eps v MinPts. Nu p l im nhn th th tc trn to ra
mt cm theo Eps v MinPts ( b 2), nu p l mt im bin, khng c
im no lin lc mt t p v DBSCAN s i thm im tip theo ca tp
d liu.
Nu s dng gi tr ton cc Eps v Minpts, DBSCAN c th ha nhp
hai cm ( nh ngha 5) thnh mt cm nu mt ca hai cm gn bng
nhau. Gi s khong cch gia hai tp d liu S1 v S2 c nh ngha l
dist(S1, S2) = min{dist(p, q) {peS1 v peS2}.
Thut ton DBSCAN
--------------main Module----------------
DBSCAN(SetOfPoints, Eps, MinOts)
//SetOfPoints is UNCLASSIFIED
Clusterid:=NextId(NOISE);
FOR i FROM 1 TO SetOfPoints.size DO
Point := SetOfPoints.get(i);
IF PointClId = UNCLASSIFIED THEN
IF ExpandCluster (SetOfPoints, Point, ClusterId, Eps, MinPts ) THEN
ClusterId.= nextld(ClusterId)
END IF
END IF
END FOR
FOR END; I/DBSCAN
--------ExpandCluster Procedure -------
ExpandClusster(SetOfPoints, Points, ClId, Eps, MinPts): Boolean; seeds:=
SetOfPoints.regionQuery(Point, E ps) IF seeds.size < MinPts THEN // no
core point
SetOfPoints.changeclId(Point, NOISE), RETURN False;
ELSE //all points in seeds are density-reachable from Point
SetOfPoints.changeClId(seeds, ClId); seeds.delete(Point); WHILE

-57-
seeds <> Empty DO
currentP:= seeds.firstO; result:=
SetOfPoints.regionQuery(CurrentP, Eps);
IF result.size >= MinPts THEN
FOR i FROM 1 to result.size 00 resultpP:= result.get(i); IF
resultp.ClId IN {UNCLASSIFIED, NOISE) THEN
IF resultp.ClId = UNCLASSIFIED THEN
seeds.append(resultP);
END IF
SetOfPoints.changeC1Id(resultP, C1Id),
ENDIF; //UNCLASSIFIED or NOISE
END FOR;
END IF ;// result.size >= Minpts
Seed.delete(CurrentP)
END WHILE ;//seeds <> Empty
RETURN True;
END IF;
END ;//ExpandCluster
Trong SetOfPoints hoc l tp d liu ban u hoc l cm c
khm ph t bc trc, C1Id (ClusterId) l nhn nh du phn t d liu
nhiu c th thay i nu chng c th lin lc mt t mt im khc trong
CSDL, iu ny ch xy ra i vi cc im bin ca d liu. hm
SetOfPoints.get(i) tr v phn t th I ca SetofPoints. Th tc
SetOfPoints.regionQuery(Point, Eps) tr v mt danh sch cc im d liu
ln cn vi im Point trong ngng Eps t tp d liu SetOfPoint. Tr mt
s trng hp ngoi l, kt qu ca DBSCAN l c lp vi th t duyt cc
i tng d liu. Eps v MinPts l hai tham s ton cc c xc nh bng
th cng hoc theo kinh nghim. Tham s Eps c a vo l nh so vi
kch thc ca khng gian d liu, th phc tp tnh ton trung bnh ca
mi truy vn l O(logn).
6.2. Thut ton OPTICS
Thut ton ny l m rng ca DBSCAN, tuy nhin n ci tin bng
cch gim bt cc tham s u vo. Thut ton ny khng phn cm cc im

-58-
d liu m thc hin tnh ton v sp xp trn cc im d liu theo th t
tng dn nhm t ng PCDL v phn tch cm tng tc hn l a ra phn
cm mt tp d liu r rng. y l th t m t cu trc phn d liu cm
da trn mt ca d liu, n cha thng tin tng ng vi phn cm da
trn mt t mt dy cc tham s c thit lp v to th t ca cc i
tng trong CSDL, ng thi lu tr khon cch li v khong cch lin lc
ph hp ca mi i tng. Hn na, thut ton c xut rt ra cc cm
da trn th t thng tin. Nh vy thng tin cho trch ra tt c cc cm da
trn mt khong cch bt k ' e m nh hn khong cche c s dng
trong sinh th t.
Vic sp xp th t c xc nh bi hai thuc tnh ring ca cc im
d liu l khong cch nhn v khong cch lin lc. Cc php o ny
chnh l kch thc m c lin quan n qu trnh ca thut ton DBSCAN,
tuy nhin, chng c s dng xc nh th t ca cc im d liu
c xp xp. Th t da tren c s cc im d liu m c khong cch
nhn nh nht v tng dn ln. iu duy nht v phng php ny l
ngi s dng khng phi xc nh gi tr e hoc MinPts ph hp.
Thut ton ny c th phn cm cc i tng cho vi cc tham s
u vo nh e v MinPts, nhng n vn cho php ngi s dng ty la
chon cc gi tr tham s m s dn n khm ph cc cm chp nhn c.
Cc thit lp tham s thng da theo kinh nghim tp hp v kh xc nh,
c bit l vi cc tp d liu a chiu.
Tuy nhin, n cng c phc tp thi gian thc hin nh DBSCAN
bi v c cu trc tng ng vi DBSCAN : O(nlogn)- n l kch thc ca
tp d liu. Th t cm ca tp d liu c th c biu din bng th, v
c minh ha trong hnh sau, c th thy ba cm, gi tr c quyt nh s cm
6.3. Thut ton DENCLUDE
DENCLUDE a ra cch tip cn khc vi cc thut ton phn cm
da trn mt trc , cch tip cn ny xem xt m hnh c s dng
mt cng thc ton m t mi im d liu s nh hng trong m hnh
nh th no c gi l hm nh hng c th xem nh mt hm m m t
nh hng ca im d liu vi cc i tng lng ging ca n. V d v
hm nh hng l cc hm parabolic, hm sng ngang, hoc hm Gaussian.

-59-
Nh vy , DENCLUDE l phng php da trn mt tp cc hm phn ph
mt v c xy dng tng chnh nh sau :
- nh hng ca mi im d liu c th l hnh thc c m hnh s
dng mt hm tnh ton, c gi l hm nh hng, m t tc ng ca im
d liu vi cc i tng lng ging ca n;
- Mt ton cc ca khng gian d liu c m hnh phn tch nh
l tng cc hm nh hng ca tt c cc im d liu;
- Cc cm c th xc nh chnh xc bi vic xc nh mt cao
(density attractors), trong mt cao l cc im cc i hm mt ton cc.
S dng cc cells li khng ch gi thng tin v cc cells li m thc
t n cn cha ng c cc im d liu. N qun l cc cells trong mt cu
trc truy cp da trn cy, v nh vy n nhanh hn so vi mt s cc thut
ton c nh hng, nh DBSCAN. Tuy nhin, phng php ny i hi chn
la k lng tham bin mt v ngng nhiu, vic chn la tham s l
quan trng nh hng ti cht lng ca cc kt qu phn cm.
nh ngha : Cho x, y l hai i tng trong khng gian d chiu k hiu
l Fd. Hm nh hng ca i tng
d
y F e ln i tng x l mt hm
0
:
y d
B
f F R
+
m c nh ngha di dng mt hm nh hng cwo bn
( ) ( , )
y
B b
f X f x y = . Hm nh hng c th l mt hm bt k; c bn l xc nh
khong cch ca hai vecto d(x, y) trong khng gian d chiu, v d nh khong
cch Euclide. Hm khong cch c tnh cht phn x v i xng. V d v
hm nh hng nh sau :
- Hm nh hng sng ngang :
0 if ( , )
( , )
1 if ( , )
square
d x y
f x y
d x y
o
o
>
=

s


Trong o l mt ngng.
- Hm nh hng Gaussian:
2
2
( , )
2
( , )
d x y
square
f x y e
o
=
Mt khc, hm mt ti im
d
x F e c inh ngha l tng cc hm
nh hng ca tt cc im d liu. Cho n l cc i tng d liu c m
t bi mt tp vecto { }
1
,...,
d
n
D x x F = e hm mt c nh ngha nh sau :
( )
1
( ) ( )
n
D x i
B B
i
F x F x
=
=



-60-
Hm mt c thnh lp da trn nh hng Gauss c xc nh
nh sau :
2
2
( , )
2
1
( )
i
d x x
n
D
Gauss
i
F d e
o
=
=


DENCLUE ph thuc nhiu vo ngng nhiu v tham s mt ,
nhng DENCLUE c cc li th chnh c so snh vi cc thut ton phn
cm khc sau y :
- C c s ton hc vng chc v tng qut ha cc phng php phn
cm khc, bao gm cc phng php phn cp, da trn phn hoch
- C cc c tnh phn cm tt cho cc tp d liu vi s lng ln v
nhiu
- Cho php cc cm c hnh dng bt k trong tp d liu a chiu
c m t trong cng thc ton.
phc tp tnh ton ca DENCLUDE l O(nlogn). Cc thut ton
da trn mt khng thc hin k thut phn mu trn tp d liu nh trong
cc thut ton phn cm phn hoch, v iu ny c th lm tng thm
phc tp c s khc nhau gia mt ca cc i tng trong mu vi mt
ca ton b d liu.
7. Thut ton phn cm d liu da trn mu
7.1 Thut ton EM
Thut ton EM c xem nh l thut ton da trn mu hoc l m
rng ca thut ton K-means. Tht vy, EM gn cc i tng cho cc cm
cho theo xc sut phn phi thnh phn ca i tng . Phn phi xc
sut thng c s dng l phn phi xc sut Gaussian vi mc ch l
khm ph lp cc gi tr tt cho cc tham s ca n bng hm tiu chun l
hm logarit kh nng ca i tng d liu, y l hm tt m hnh xc
sut cho cc i tng d liu. EM c th khm ph ra nhiu hnh dng cm
khc nhau, tuy nhin do thi gian lp ca thut ton kh nhiu nhm xc nh
cc tham s tt nn chi ph tnh ton ca thut ton kh cao. c mt s ci
tin c xut cho EM da trn cc tnh cht ca d liu : c th nn, c
th sao lu trong b nh v c th hy b. Trong cc ci tin ny, cc i
tng b hy b khi bit chc chn c nhn phn cm ca n, chng c

-61-
nn khi khng loi b v thuc v mt cm qu ln so vi b nh v chng s
c lu li trong cc trng hp cn li.
Thut ton c chia thnh hai bc v qu trnh c lp li cho
n khi vn c gii quyt :
- h b h a E

+
=
+
=
2
1
,
2
1
2
1
:
-
) ( 6
, :
d c b
b a
b a M
+ +
+
=
1. Khi to tham s :
{ }
) 0 ( ) 0 (
2
) 0 (
1
) 0 ( ) 0 (
2
) 0 (
1 0
, , , , , , ,
k K
p p p =
2. Bc E
( )
( ) ( )
( )
( )
( )

= =
k
t
j
t
i i k
t
i
t
i i k
t k
t j t j k
t k j
P x P
P x P
x P
P x P
x P
) ( 2 ) (
) ( 2 ) (
, ,
, ,
,
, ,
,
o e
o e

e e
e
3. Bc M :
( )
( )

=
+
k
t k i
k
k
t k i
t
i
x P
x x P
e
e

,
,
) 1 (


( )
R
x P
p
k
t k i
t
i

=
+
e ,
) 1 (

4. Lp li bc 2, 3 cho n khi t kt qu

7.2 Thut ton COBWEB
COBWEB l cch tip cn biu din cc i tng d liu theo kiu
cp thuc tnh gi tr. COBWEB thc hin bng cch to cy phn lp,
tng t nh khi nim ca BIRCH, tuy nhin cu trc cy khc nhau. Mi
nt ca cy phn lp l i din cho khi nim ca i tng d liu v tt c
cc im m di lp l cng thuc mt nt. COBWEB s dng cng c
phn loi qun l cu trc cy. T cc cm hnh thnh da trn php o
tng t m phn loi gia tng t v phi tng t, c hai c th m t
phn chia gi tr thuc tnh gia cc nt trong lp. Cu trc cy cng c th
m t phn chia gi tr thuc tnh gia cc nt trong lp. Cu trc cy cng c
th c hp nht hoc phn tch khi chn mt nt mi vo cy. C hai
phng php ci tin cho COBWEB v CLASSIT v AutoClass.

-62-
CHNG III
NG DNG CA PHN CM D LIU
1. Phn on nh
Phn on nh l mt b phn cu thnh c bn trong nhiu lnh vc
c ng dng my tnh v c th c coi nh l mt lnh vc nghin cu c
bn ca phn cm d liu (Rosenfeld and Kak 1982). Vic phn on cc
nh da vo vic hin th mt h thng phn tch hnh nh ph thuc vo cnh
hin th, Hnh dng nh, cu hnh, v b chuyn i dng chuyn i ra
nh k thut s, v cui cng l u ra(mc tiu) ca h thng.
Cc ng dng ca phng php phn cm d liu i vi vn phn
on nh nh c cng nhn hn ba thp k trc, v nhng n lc tin
phong vn l nn tng c s dng ngy nay. Nn tng lp li l xc nh
cc vc-t c tnh mi mt im nh m nh cha c hm s mt ca
nh v hm s bn thn v tr im nh. tng ny c m t hnh () bn
di. tng ny rt thnh cng khi s dung i vi cc nh c mt (c
hay khng cha kt cu nh), di(bin ) nh, v nh a ph.

Hnh 25. Tnh nng i din cho clustering. Hnh nh v v tr cc
php o c chuyn n cc tnh nng. Cm trong khng gian tnh nng
tng ng vi cc phn on hnh nh.


-63-
1.1. nh ngha Phn on nh
Phn on nh c hiu thng thng l vic phn tch nh u vo
thnh cc min (cc lp i tng ring r) mi i tng c gi l mt
nh con. phn bit i tng ny vi i tng khc v tin li cho cc
bc phn tch tip theo, mi i tng c gn mt nhn. Thc cht ca
phn on nh l php i snh mu. Mi nh con c phn tch cha cc
thuc tnh (mt , mu, cha vn).
Nu ta cho:

l mt nh u vo vi N
r
dng, v N
c
ct v gi tr quan st x
ij
vi im nh
(i, j), php phn on nh c th c biu din thnh :

vi lth on

cha mt tp hp con ca cc cc kt ni ta im nh. Khng c on
no chia s v tr im nh ( ) j i S S
j i
= C = , v php hp ca cc phn on
bao ton b nh { } { } ( ). ... 1 ... 1
1 c r i
k
i
N N S U =
=
Jain v Dubes[1981], sau khi Fu v
Mui[1981] pht hin ra 3 k thut s dng phn on nh t mt nh u
vo l : K thut phn on nh da trn min, K thut phn on nh da
trn bin,v k thut phn on nh bng phn cm d liu.
Hy xem xt s hu dng ca vic to ngng mt mc xm n gin
phn on mt nh cng tng phn cao. Hnh 26(a) biu din mt
nh thang-o-sng ca m vch ca mt sch gio khoa c scan trn mt
my qut hnh phng. Phn b biu din kt qu ca mt tc v to ngng c
bn c thit k chia tch min ti v sang trn vng m vch. Cc bc
nh phn ha nh vy thng c s dng trong cc h thng nhn din k
t. S to ngng nh hng n phn cm d liu im nh thnh hai nhm
da trn php o cng mt chiu [Rosenfeld 1969; Dunn et al.1974].
Mt bc x l sau chia tch cc lp thnh cc vng c lin kt. Trong
khi ngng mc xm n gin l mi trng nh c kim sot c
tip nhn v nhiu nh khoa hc cng hin cc phng php thch hp cho

-64-
vic to ngng [Weszka 1978; Trier v Jain 1995], cc nh phc tp i
hi nhiu k thut phn on chi tit hn.
Nhiu phn on s dng c hai php o quang ph (v d nh My
qut a quang ph c s dng trong vin thm) v khng gian (da trn v
tr im nh trn mt nh phng).Php o mi im nh t tng ng
trc tip ti ni dung ca mt mu.

(a)

(b)

(c)
Hnh 26. Nh phn ha thng qua ngng. (a): nh thang o xm gc. (b)
Biu mc xm. (c) Kt qu ca vic to ngng

-65-
1.2 Phn on nh da vo phn cm d liu
Vic p dng cc tnh nng ca a phng phn khc clustering
quy m hnh nh mu xm-c ti liu trong Schachter et al. [1979]. Ti liu
nhn mnh n vic la chn thch hp ca tnh nng mi im nh ch
khng phi l phng php phn cm, v xut vic s dng cc mt phng
ta hnh nh (khng gian thng tin) l tnh nng b sung c lm vic
ti cc phn nhm da trn phn khc. Mc tiu ca phn cm l c c mt
chui cc cm hyperellipsoidal bt u vi cc trung tm cm v tr ti cc v
tr mt ti a trong khng gian mu, v cc cm pht trin v cc trung tm
cho n khi mt th nghim tt p ca
2
_ cho ph hp b vi phm. Mt
lot cc tnh nng c tho lun v p dng cho c hai mu xm v mu sc
hnh nh.
Mt thut ton phn cm kt t c p dng bi Silverman v
Cooper [1988] cho vn ca hc khng gim st ca cm vect h s cho
hai m hnh nh tng ng vi cc phn on hnh nh. Cc m hnh u tin
l a thc cho cc s o hnh nh quan st; gi nh y l hnh nh l mt
b su tp ca th lin k nhiu b mt, mi mt hm a thc ca cc mt
phng ta hnh nh, c ly mu trn li ng qut to ra cc hnh
nh quan st . Thut ton tin x l bng cch ly vect ca h s ca hnh
vung t nht ph hp vi cc d liu trong ca s hnh nh M phn chia. Mt
thut ton phn cm kt t ha trn ( mi bc) hai cm c mt cm ton
cu ti thiu gia-khong cch Mahalanobis . Cng mt khun kh c
p dng i vi phn khc ca hnh nh kt cu, nhng c hnh nh nh m
hnh a thc l khng thch hp, v mt tham s ngu nhin Markov m hnh
trng c gi nh thay th.
Wu v Leahy [1993] m t vic p dng cc nguyn tc ca dng chy
mng phn loi khng gim st, yielding mt cun tiu thuyt thut ton
phn cp cho phn cm d liu. V bn cht, k thut ny xem cc mu
khng nhn nh cc nt trong mt th, trong trng lng ca mt cnh
(chng hn nh dung tch) l mt thc o ging nhau gia cc nt tng
ng. Cm c xc nh bng cch loi b cc cnh ca th to phn
chia kt ni th con. Trong phn khc hnh nh, im nh c 4-lng
ging hoc 8-lng ging cc cnh chia s hnh nh my bay trong th k

-66-
xy dng, v trng lng ca mt cnh th da trn ln ca mt cnh
hnh nh a ra gi thuyt gia cc im nh lin quan ( ln ny c tnh
bng cch s dng mt n n gin phi sinh). Do , vic phn on ny
hot ng bng cch tm ng nt ng ca trong hnh nh, v tt nht l c
nhn cnh hn l da trn khu vc trn.
Trong Vinod et al. [1994], hai mng n-ron thn kinh c thit k
thc hin m hnh phn cm khi kt hp. Mt hai tng mng hot ng trn
mt biu a chiu ca d liu xc nh "nguyn mu" c s dng
phn loi cc m hnh u vo thnh cc cm. Nhng nguyn mu c pht
trin thnh mng li phn loi, mt hai iu hnh mng lp trn biu cc
d liu u vo, nhng c nh hng c trng lng khc nhau t
mng la chn nguyn mu. Trong c hai mng li, cc biu ca hnh
nh c s dng trng lng ng gp ca m hnh l mt trong nhng
mu lng ging theo xem xt n v tr ca nguyn mu hay phn loi cui
cng; nh vy, n c kh nng l mnh m hn khi so snh vi cc k thut
l mt gi nh tim n mt tham s chc nng cho cc lp mu. Kin
trc ny c th nghim trn mu xm quy m v cc vn phn khc mu.
Jolion et al. [1991] m t mt qu trnh tch cc cm tun t t cc
m hnh u vo thit lp bng cch xc nh cc khu vc hyperellipsoidal c
cha mt phn nh trong quy nh ca cc im khng phn lp trong b ny.
Cc khu vc chit xut c so snh vi nht lp a bin Gaussian mt
thng qua mt th nghim Kolmogorov-Smirnov, v cht lng ph hp
c s dng nh l mt con s ng chn vng tt nht ' mi php lp.
Qu trnh ny tip tc cho n khi dng li l mt tiu ch hi lng. Th tc
ny c p dng cho cc vn ca la chn ngng cho phn khc a
ngng ca hnh nh cng v phm vi phn khc ca hnh nh.
K thut Clustering cng c thnh cng c s dng cho cc
phn on ca nhiu hnh nh , l mt ngun ph bin ca cc d liu
u vo cho ba chiu i tng h thng cng nhn [Jain v Flynn 1993].
Phm vi cm bin nh li ta thng tr li vi gi tr o ti mi im
nh ang c cc ta ca mt v tr trong khng gian 3D. Nhng v tr 3D
c th c hiu l nhng a im m cc tia ang ni ln t cc a im my
bay hnh nh trong mt b ct nhau ca cc i tng pha trc ca cm bin.

-67-
Cc tnh nng ca c bn ca khi nim phn cm d c bit hp dn
cho cc phn khc hnh nh t nhiu (khng ging nh o cng ) cc
php o ti mi im nh c cng mt n v (chiu di); ny s lm cho
qung co hoc bin i hoc chun ha hnh nh tnh nng khng cn thit
nu mc tiu ca h l p t bng rng trn cc tnh nng . Tuy nhin,
phm vi nh phn on thng gn thm o khng gian chc nng, loi b
li th ny.
Hnh nh ca mt h thng c m t trong phm vi phn khc
Hoffman v Jain [1987] s dng phn cm bnh phng li trong mt khng
gian su chiu tnh nng nh mt ngun ca mt phn khc "ban u" l tinh
t (thng l qua vic sp nhp cc phn on) thnh u ra cc phn on.
K thut ny c nng cao trong Flynn v Jain [1991], v c s dng
trong mt so snh c h thng gn y ca phn on hnh nh [Hoover et al.
1996]; nh vy, n c l l mt trong nhng gii hn k thut phn cm tn
ti lu nht phm vi thc hin tt trn rt nhiu hnh nh.
Vic phn on ny hot ng nh sau. Ti mi im nh (i, j) trong
phm vi hnh nh u vo, cc o lng c k hiu tng ng 3D
( )
ij ij ij
z y x , , , ni x
ij
l mt hm tuyn tnh ca j (s ct) v y
ij
l mt hm tuyn
tnh ca ti (s lng hng). Mt k k lng ging ca (i, j) c s dng
c lng b mt 3D ( )
z
ij
y
ij
x
ij ij
n n n n , , = ti (i, j), thng thng bng vic tm kim
t nht- vung phng ph hp vi cc im 3D trong lng ging. Cc vc t
tnh nng cho im nh ti (i, j) l su chiu ( )
z
ij
y
ij
x
ij ij ij ij
n n n z y x , , , , , , v mt phn
khc ng c vin c tm thy bi phn cm cc vect tnh nng ny. V cc
l do thc t, khng phi vector tnh nng ca mi im nh c s dng
trong cc th tc phn cm; thng 1.000 vect tnh nng c la chn bi
ly mu ph.
Thut ton CLUSTER [Jain v Dubes 1988] c s dng c
c cc nhn phn on cho mi im nh. CLUSTER l mt thut ton m
rng ca thut ton k-means, n c kh nng xc nh mt s cm ca mt
tp d liu, mi mt s khc nhau ca cc cm. Hoffman v Jain [1987] cng
th nghim vi cc k thut phn nhm khc (v d, hon chnh, lin kt, lin
kt n, th, l thuyt, v cc thut ton li bnh phng) v hng
CLUSTER cung cp s kt hp tt nht v hiu sut v chnh xc. Mt

-68-
li th b sung ca CLUSTER l n to ra mt chui cc cm u ra (tc l,
mt cm 2-gii php lp thng qua mt K
max
cm gii php m K
max
c
ch nh bi ngi dng v thng l 20 hoc hn); mi phn nhm theo th
t sn lng ny thng k clustering kt hp gia-cm tch v trong cm
phn tn -. Phn cm ti u ha cc s liu thng k ny c chn l mt
trong cm tt nht. Mi im nh trong phm vi hnh nh c gn nhn
phn on ca cc trung tm cm gn nht. iu ny bc phn loi khong
cch ti thiu khng c bo m sn xut phn on c kt ni
trong mt phng hnh nh, do vy, mt thnh phn kt ni ghi nhn thut ton
phn b cc nhn mi cho cc khu vc phn chia c t trong cng mt
nhm. Hot ng tip theo bao gm cc xt nghim loi b mt, vic sp nhp
cc bn v lin k s dng mt th nghim cho s hin din ca mp nhn
hoc nhy cnh gia cc phn on lin k, v c lng thng s b mt.

(a) (b)

(c) (d)
Hnh 27. Phn on nh bng phn cm d liu.
(a): nh u vo. (b): Mt bng chnh tc hnh nh c chn. (c): Bc u
phn on (19 nhm gii php) tr li bng cch s dng CLUSTER 1000
su chiu mu t hnh nh nh l mt mu thit lp. (d): kt qu phn on
cui cng(8 phn on) (c) (d) sau khi x l

-69-
Hnh 27 cho thy tin trnh ny c p dng cho mt lot hnh
nh. Mt phn ca hnh hin th nhiu hnh nh u vo; phn b cho thy s
phn b ca b mt. Trong c mt phn, cc phn on u tin tr li bi
CLUSTER v sa i m bo phn on kt ni c hin th. Phn d
cho thy phn khc cui cng c to ra bi vic sp nhp cc bn v li k
m khng c mt mp nhn ng k gia chng. Cc cm cui cng hp l
khc bit i din cho cc b mt c trong ny i tng phc tp.
Cc phn tch kt cu hnh nh c quan tm bi cc nh nghin
cu trong vi nm. Kt cu k thut phn on c pht trin bng cch
s dng mt lot cc m hnh kt cu v hot ng hnh nh. Nguyn v
Cohen [1993], kt cu phn khc hnh nh c a ch ha mu vi hnh
nh nh l mt h thng phn cp ca hai trng Markov ngu nhin, ly
mt s s liu thng k n gin t mi khi hnh to thnh mt vector
tnh nng, v phn nhm cc khi bng cch s dng phng php K-means
m. Th tc phn cm y l sa i cng nhau c tnh s lng cc
cm cng nh cc thnh vin trong m ca mi vc t tnh nng cho cc cm
khc nhau.
Mt h thng phn chia hnh nh cho kt cu c m t bi Jain v
Farrokhnia [1991]; , b lc Gabor c s dng c c mt b
28-nh hng v tnh nng chn lc cc kt cu trong cc lng ging ca
mi im nh. 28 tnh nng c gim n mt s lng nh hn thng qua
mt th tc la chn tnh nng, v cc tnh nng kt qu c tin x l v
sau nhm bng cch s dng chng trnh CLUSTER.
Mt bng thng k [Dubes 1987] c s dng la chn tt nht cc
cm. Ti thiu khong cch phn loi c s dng nhn mi im nh
trn hnh nh gc. K thut ny c th nghim trn mt s kt cu ghp
bao gm cc kt cu Brodatz t nhin v hnh nh tng hp. Hnh 28 (a) cho
thy mt khm kt cu bao gm bn u vo ca cc kt cu Brodatz ph
bin [Brodatz 1966]. Phn b cho thy phn khc sn xut khi cc tnh nng
lc Gabor c ghp cha cc thng tin khng gian (ta pixel). B lc
ny da vo k thut Gabor c chng minh rt mnh v c m
rng ti cc phn on t ng ca vn bn trong ti liu [Jain v hattacharjee
1992] v phn on ca cc i tng trong nn phc tp [Jain et al. 1997].

-70-

(a) (b)
Hnh 28. Kt qu ca kt cu phn on nh (a): kt cu khm 4 lp. (b):
bn nhm gii php thc hin bi gii thut CLUSTER vi ta im nh
bao gm trong cc tnh nng thit lp.
Phn cm d liu c th c s dng nh l mt giai on tin x l
xc nh cc lp hc mu phn loi gim st tip theo. Taxt v
Lundervold [1994] v Lundervold et al. [1996] m t mt thut ton
clustering partitional v mt k thut ghi nhn hng dn s dng xc nh
cc lp vt liu (v d, no ty cht lng, cht trng, bp Khi, khi u) trong
cc hnh nh c ng k ca mt con ngi c c u nm knh khc
nhau hnh nh cng hng t (yielding mt nm chiu tnh nng vector ti
mi im nh). Mt s phn cm thu c v kt hp vi kin thc tn
min (nhn lc chuyn mn) xc nh cc lp khc nhau. Quyt nh quy
nh phn loi gim st c da trn nhng lp ny c ly. Hnh 29 (mt)
cho thy mt trong nhng knh ca mt u vo-a quang ph hnh nh; phn
b cho thy 9-cm kt qu. Thut ton K-means l c p dng cho cc
phn khc ca LANDSAT hnh nh trong Solberg et al. [1996]. Cc trung
tm cm ban u c chn tng tc ca mt nh iu hnh o to, v
tng ng vi cc lp hc s dng t nh khu vc th, t (thc vt min
ph) cc khu vc, rng, ng c, v nc. Hnh 30 (mt) cho thy nhng hnh
nh u vo hon tr nh mu xm; phn b cho thy kt qu ca th tc phn
cm d liu.



-71-







(a) (b)


Hnh 29. Phn on nh y t a quang ph. (a)Knh duy nht ca nh u
vo. (b) 9 cm phn on nh


(a) (b)
Hnh 30: Phn on nh LANDSAT. (a) Bn gc hnh nh ESA /
EURIMAGE / Sattelitbild). (b): Cnh c phn cm.
2.Nhn dng i tng v k t
2.1 Nhn dng i tng
Vic s dng cc phn nhm xem nhm i tng 3D cho mc ch
cng nhn i tng trong phm vi d liu c m t trong Dorai v Jain
[1995]. Cc thut ng dng ch xem mt hnh nh phm vi ca mt i
tng thu c t bt c quan im ty . H thng xem xt, lm vic theo
mt quan im ph thuc (hoc xem trung tm) cch tip cn i vi vn
cng nhn i tng; mi i tng c cng nhn l i din trong iu
khon ca mt th vin hnh nh lot cc i tng .

-72-
C rt nhiu c th c ca mt i tng 3D v mc tiu mt trong
nhng cng vic m l trnh kt hp mt u vo xem khng r i vi
tng hnh nh ca tng i tng. Mt ch ph bin trong vn hc cng
nhn i tng c lp ch mc, trong xem cha bit c s dng
chn mt tp hp con ca im ca mt tp hp con ca cc i tng trong
c s d liu so snh hn na, v t chi tt c cc im khc ca i
tng. Mt trong nhng cch tip cn nh ch s dng cc khi nim ca
cc tng lp xem; mt lp hc xem l tp hp cc im cht lng tng t
ca mt i tng. Trong tc phm , cc lp hc xem c xc nh bi
phn cm d liu; phn cn li ca tiu mc ny vch ra cc k thut.
Xem i tng c nhm li vo cc lp hc da trn hnh dng
ging nhau ca cc tnh nng ph. Mi hnh nh u vo ca mt i tng
xem trong sn lng c lp mt vector tnh nng m n m t. Cc tnh nng
vector cha trong mi pht u tin trung tm ca mt hnh bnh thng


=
h
h H h m ) ( ) (
1
ho quang ph phn phi, ) (h H

, ca mt i tng xem l
thu c t d liu phm vi ca n bng cch xy dng mt biu ca cc
gi tr ch s hnh dng (c lin quan n cc gi tr b mt cong) v tch ly
tt c cc i tng im nh m ri vo mi thng. Bi bnh thng ha
quang ph i vi din tch tng s i tng, quy m (size) khc nhau m c
th tn ti gia cc i tng khc nhau c g b. Ti thi im u tin
m
1
tnh ton m c ngha
) (h H


=
h
h H h m ) ( ) (
1 . (1)
Vi momen trung tm khc, m
p
,
10 2 s s p
c nh ngha l :

( ) ( )

=
h
p
p
h H m h m
_
1 (2)
Do cc vecto c tnh c biu th bng
( ), ,..., ,
10 2 1
m m m R =
nm
Trong khong [-1,1].
Ti O =
{ }
n
O O O ,..., ,
2 1
l mt la chn ca n i tng 3D vi cnh nm

-73-
trong c s d liu. M
D
. cnh th i ca j i tng,
i
j
O
trong c s d liu
c biu th bng
i
j
i
j
R L ,
, ni
i
j
L
l i tng nhn v
i
j
R
l vecto c
tnh.
Cho mt tp i tng i din R
i
=
{ }
i i i i
R L R L
1 1 1 1
, , ,
m m t m cnh
ca i i tng, mc tiu l ly ra mt phn ca cnh
P
i
=
{ }
i
k
i i
i
C C C , , ,
2 1

. Mi cm trong P
i
cha nhng cnh ca i tng th
i m i tng c cp tng t da trn s khng ging nhau gia
cc thi im tng ng vi cc tnh nng ca hnh quang ph ca cc cnh .
Cc bin php ca khng ging nhau gia
i
j
R
v
i
k
R
c nh ngha :
D
( ) ( )

=
=
10
1
2
,
l
i
kl
i
jl
i
k
i
j
R R R R
(3)
Phn cm d liu Cnh(Views)
Mt c s d liu cha khong 3,200 nh ca 10 i tng iu khc
khc nhau vi 320 cnh c s dng [Dorai and Jain 1995].
Cc hnh nh dao ng t 320 quan im c th (xc nh bi li t
ong ca xem-mt cu bng cch s dng khi 20 mt ) ca cc i tng
c tng hp. Hnh 31 cho thy mt tp hp con ca tp hp cc im ca
Rn h mang c s dng trong th nghim. Hnh dng ph ca tng xem l
tnh vc t c tnh v sau tnh nng ca n c xc nh. Cnh ca tng
i tng ang t tp, da trn D o khng ging nhau gia vect thi im
ca h bng cch s dng cc kt ni n clustering th bc [Jain v Dubes
1988]. Cc nhm th bc thu c vi 320 cnh ca i tng Rn h mang
c hin th trong hnh 32. Cnh ca nhm phn cp chn i tng khc
cng tng t nh cc dendrogram trong hnh 32. Dendrogram ny c ct
mc khng ging nhau l 0,1 hoc t hn c c nh gn v cng
cch nhau cm. Cc clusterings thu c theo cch ny chng minh rng
quan im ca tng i tng ri vo mt vi cm khc bit r rt. Cc trng
tm ca mi cm ny c xc nh bi my tnh trung bnh ca vect thi
im ca lt xem ri vo mt cm.

-74-





Hnh 31. Mt tp con cc cnh ca nh Rn h mang c chn t 320 cnh
Dorai v Jain [1995] chng minh rng phn nhm ny da trn xem
nhm i tng ph hp vi th tc to iu kin v tnh chnh xc phn loi
v s lng ph hp cn thit cho vic phn loi ng ca xem th. Xem i
tng c nhm thnh cc cm xem nh gn v ng nht, nh vy chng
t sc mnh ca cluster da trn s t chc xem v ph hp vi i tng
c hiu qu.


-75-


















Hnh 32 : Cu trc ca mt nhm gm 320 cnh ca mt tc phm iu khc
con rn h mang.
2.2 Nhn dng k t.
K thut nhn dng k da vo phn cm d liu c pht trin bi
Connell v Jain [1998] nhn bit lexemes trong vn bn vit tay cho cc
mc ch ca nh vn vit tay cng nhn c lp. S thnh cng ca mt h
thng nhn dng ch vit l cc k ph thuc vo chp nhn bi ngi s
dng tim nng. Nh vn ph thuc h thng cung cp mt mc cao hn
s cng nhn chnh xc hn so vi cc h thng nh vn c lp, nhng i
hi mt lng ln d liu o to. Mt nh vn c lp h thng, mt khc,
phi c kh nng nhn ra nhiu phong cch vn bn nhm p ng mt ngi
dng c nhn. Khi cc bin thin ca phong cch vn bn phi c bt gi
bi mt h thng tng, n cng tr nn kh khn phn bit i x gia cc
lp khc nhau do s lng chng cho nhau trong khng gian tnh nng ny.
Mt trong nhng gii php cho vn ny l tch cc d liu t nhng

-76-
phong cch vit khc nhau cho mi lp hc vo lp con khc nhau, c gi
l lexemes. Nhng lexemes i din cho cc phn ca d liu c d dng
hn tch ra t cc d liu ca cc tng lp khc hn m lexeme thuc.
Trong h thng ny, ch vit l b bt bi s ho cc ta (x, y) v v
tr ca cc cy bt v v tr t im bt (ln hoc xung) vi t l ly mu
khng i. Sau mt s ly li mu, bnh thng ho, v lm mn, mi nt bt
l i din nh l mt chui di bin-im. Mt s liu da trn n hi mu
lp trnh ph hp v nng ng, c xc nh cho php khong cch gia
hai nt c tnh ton.
S dng cc khong cch tnh bng cch ny, mt ma trn gn nhau
c xy dng ca tng loi ch s (tc l, 0 thng qua 9). Mi bin php ma
trn khong cch lp trong cho mt lp ch s c th. Ch s trong mt lp
c bit l nhm trong mt thc nghim tm mt s lng nh cc nguyn
mu. Phn cm c thc hin bng cch s dng chng trnh CLUSTER
m t trn [Jain v Dubes 1988], trong vc t tnh nng cho mt ch s
ca n l N ln cn n con s ca cng mt lp. CLUSTER phn nhm tt
nht cho mi gi tr ca K trn mt s phm vi, trong K l s cm vo
d liu ny l c phn vng. Theo d on, c ngha l li bnh phng
(MSE) gim n iu nh l mt chc nng ca K. Cc "ti u" gi tr ca K
c chn bng cch xc nh mt u gi trong biu ca MSE vs K.
Khi i din cho mt cm ch s ca mt mu th nghim duy nht, tt nht
nhn din on-line kt qu c cng nhn thu c bng cch s dng cc
ch s l gn nht ti trung tm cm's. S dng s ny, mt t l
nhn din chnh xc l 99,33%.
3. Truy hi thng tin
Thng tin hi thng tin (Information Retrieval) c lin quan vi lu tr
t ng v ly cc ti liu [Rasmussen 1992]. Nhiu th vin cc trng i
hc s dng h thng IR cung cp truy cp vo cc cun sch, tp ch, v
cc ti liu khc. Cc th vin s dng n Li-brary of Congress
Classification (LCC) (Phn loi Th vin Quc hi M), n ny hiu qu
cho vic lu tr v truy tm sch. n LCC bao gm cc lp c nhn A n
Z [LC Classification Outline 1990] c s dng k t ha sch thuc cc
i tng khc nhau. V d, nhn Q tng ng vi sch trong lnh vc khoa

-77-
hc, v bo m cht lng phn lp c phn cng ton hc. Nhn QA76
ti QA76.8 c s dng phn loi sch lin quan n my tnh v cc lnh
vc khc ca khoa hc my tnh.
C mt s vn lin quan n vic phn loi cc sch bng cch s
dng s LCC. Mt s trong s ny c lit k di y:
(1) Khi mt ngi s dng ang tm kim mt cun sch trong th vin
m vi mt ch anh ta quan tm, s LCC mt mnh c th khng th ly
tt c cc sch c lin quan. iu ny l do s lng phn loi c ch nh
cho nhng cun sch hay cc loi ch thng c nhp vo trong c s
d liu khng c thng tin lin quan n tt c cc ch c bo him
trong mt cun sch. minh ha im ny, chng ta hy xem xt cun sch
Cc thut ton cho phn cm d liu ca Jain v Dubes [1988]. S LCC ca
n l 'QA 278.J35'. Trong s ny LCC, QA 278 tng ng vi ch 'phn
tch cm', J tng ng vi tn tc gi u tin v 35 l s serial phn cng ca
Th vin Quc hi. Cc loi ch cho cun sch ny c cung cp bi nh
xut bn (m thng c nhp vo trong c s d liu to iu kin tm
kim) l nhm phn tch, x l d liu v thut ton. C mt chng trong
sch ny [Jain v Dubes 1988] rng vi tm nhn my tnh, x l hnh nh,
v phn khc hnh nh. V vy, mt ngi s dng tm kim cho vn hc trn
my vi tnh v tm nhn, c bit, hnh nh phn khc s khng th truy cp
cun sch ny bng cch tm kim c s d liu vi s gip ca mt trong
hai s LCC hoc cc loi i tng c cung cp trong c s d liu. S
LCC cho sch tm nhn my tnh c TA 1632 [LC Classification 1990]
l rt khc vi QA s 278.J35 c ng k cho cun sch ny.
2) C mt vn c hu trong giao LCC s sch mt khu vc pht
trin nhanh. V d, chng ta hy xem xt cc khu vc ca cc mng thn kinh.
Ban u, th loi 'QP' trong LCC n c s dng nhn sch v th
tc t tng ti hi ngh khu vc ny. V d, Proceedings of the Joint
International Conference on Neural Networks [IJCNN'91] c giao QP ca
s 363,3 '. Tuy nhin, hu ht cc cun sch gn y trn cc mng thn kinh
c cho mt s cch s dng cc nhn th loi 'QA'; Proceedings of
IJCNN'92 cc [IJCNN'92] c phn cng bo m cht lng ca s 76,87 '.
Nhiu nhn cho sch i ph vi cng mt ch s buc h c t trn

-78-
ngn xp khc nhau trong mt th vin. Do , c mt cn phi cp nht cc
nhn phn loi theo thi gian trong mt k lut mi ni.
(3) vic giao mt s cho mt cun sch mi l mt vn kh khn.
Mt cun sch c th i ph vi cc ch tng ng vi hai hoc nhiu s
LCC, v do , ch nh mt s duy nht cho cun sch nh vy l rt kh khn.
Murty v Jain [1995] m t mt kin thc da trn lc phn nhm
i din nhm cc cun sch, trong thu c bng cch s dng CR
ACM (Hi my tnh My vi tnh Xem li) phn loi cy [ACM CR
Classifications 1994]. Cy ny c s dng bi cc tc gi gp phn ACM
n phm khc nhau cung cp cc t kha trong cc hnh thc th loi ACM
nhn CR. Cy ny bao gm 11 nt cp u tin. Cc nt l c nhn A
n K. Mi nt trong cy ny c mt nhn l mt chui ca mt hay nhiu
k hiu. Nhng biu tng ny c k t ch-s. V d, I515 l nhn ca
mt nt cp th t trong cy.
3.1 Biu din mu
Mi cun sch c th hin nh mt danh sch tng qut [Sangal
1991] ca nhng dy bng cch s dng phn loi cy ACM CR. V mc ch
ngn gn trong i din, cc cp, cc nt th t trong cy phn loi ACM CR
c gn nhn bng cch s dng ch s 1-9 v k t A n Z. V d, cc nt
con ca I.5.1 (m hnh) c dn nhn I.5.1 0,1 n I.5.1.6. y, I.5.1.1
tng ng vi cc nt c nhn xc nh, v I.5.1.6 l vit tt ca nt c nhn
structural.Ina thi trang tng t, tt c cc cp, cc nt th t trong cy c
th c gn nhn l cn thit. T by gi, cc du chm gia biu tng
k tip s c b qua n gin ha cc i din. V d, I.5.1.1 s c k
hiu l I511.
Minh ha cho qu trnh ny i din vi s gip ca cc cun sch
ca Jain v Dubes [1988]. C nm chap-ters trong cun sch ny. n
gin ch bin, ch xem xt c cc thng tin trong cc ni dung chng. C
mt mc duy nht trong bng ni dung cho cc chng 1, 'Gii thiu', v v
vy khng ly bt k t kho t ny. Chng 2, c nhn ' D liu i din,'
mc tng ng vi cc nhn ca cc nt trong cy phn loi ACM CR
[ACM CR Classifications 1994] c a ra di y:
(1a) I522 (feature evaluation and selection),

-79-
(2b) I532 (similarity measures), and
(3c) I515 (statistical).
Da trn nhng phn tch trn, Chng 2 ca Jain v Dubes [1988] c
th c c trng bi s phn ly trng ((I522 I532 I515) (1,4)). Cc
trng lng (1,4) biu th rng n l mt trong bn chng, trong c vai
tr trong cc i din ca cun sch. Cn c vo bng ni dung, chng ti c
th s dng mt hoc nhiu dy I522, I532, I515 v i din cho Chng 2.
Tng t nh vy, chng ti c th i din cho chng khc trong cun sch
ny nh cc php tuyn trng da trn cc bng ni dung v phn loi cy
ACM CR. Cc i din ca ton b cun sch, s kt hp ca tt c cc c
quan i din chng, c cho bi (((I522 I532 I515) (1,4) ((I515
I531) (2,4)) ((I541 I46 I434) (1,4))).
Hin nay, cc i din c to ra bng tay bng cch qut cc bng
ni dung ca sch trong lnh vc khoa hc my tnh nh ACM cy phn loi
CR cung cp kin thc v cun sch khoa hc my tnh. Cc chi tit ca b
su tp ca cun sch c s dng trong nghin cu ny c sn trong Murty
v Jain [1995].
3.2 Php o tng t
S ging nhau gia hai cun sch da trn s ging nhau gia cc
chui tng ng. Hai trong s cc chc nng ni ting, khong cch gia mt
cp dy c [Baeza-Yates 1992] khong cch Hamming v sa khong
cch. Khng phi ca cc chc nng ny khong cch hai c th c s
dng trong cc ng dng c ngha ny. V d sau minh ho im. Hy xem
xt ba dy I242, I233, v H242. Nhng chui l cc nhn (predicate logic i
din cho kin thc, lp trnh logic, v cc h thng c s d liu phn tn)
trong ba cp th t, cc nt trong cy phn loi ACM CR. Cc nt I242 v
I233 l chu ca cc nt c nhn I2 (tr tu nhn to) v H242 l mt chu ca
cc nt c nhn H2 (c s d liu qun l). V vy, khong cch gia I242 v
I233 phi nh hn m gia I242 v H242. Tuy nhin, khong cch Hamming
v sa khong cch [Baeza-Yates 1992] c hai u c mt gi tr 2 gia I242
v I233 v gi tr ca 1 gia I242 v H242. Hn ch ny thc y nh ngha
ca mt bin php tng t mi m bt ng s ging nhau gia cc chui
trn. S ging nhau gia hai chui c nh ngha l t l chiu di ca tin

-80-
t ph bin nht [Murty v Jain 1995] gia hai dy vi chiu di ca chui
u tin. V d, s ging nhau gia chui I522 v I51 l 0,5. Cc bin php
tng t c xut l khng i xng, v s ging nhau gia I51 v I522
l 0,67. Cc gi tr ti thiu v ti a l bin php tng t ny l 0,0 v 1,0,
tng ng. Cc kin thc v cc mi quan h gia cc nt trong cy phn loi
ACM CR l b bt bi cc i din trong cc hnh thc dy. V d, nt c
nhn cng nhn l mu i din l I5 chui, trong khi I53 chui tng ng vi
cc nt c nhn clustering. S ging nhau gia hai nt (I5 v I53) l 1,0. Mt
bin php i xng ca tng [Murty v Jain 1995] c s dng xy
dng mt ma trn tng t c kch thc 100 x 100 tng ng vi 100 cun
sch c s dng trong cc th nghim.
3.3 Mt gii thut cho phn cm d liu sch
Vn phn nhm c th c nu nh sau. Cho mt b su tp B ca
cun sch, chng ta cn c c mt tp C thit lp cc cm. Mt gn
dendrogram(cy cc cm) [Jain v Dubes 1988], s dng Thut ton phn
cm kt ni kt t hon tonhon thu thp 100 cun sch c th hin
trong hnh 33. By cm thu c bng cch chn mt ngng ( ) t c gi tr
0,12. N ni ting m cc gi tr khc nhau cho ( ) t c th cung cp cho
clusterings khc nhau. Ngng gi tr ny c chn bi v " khong cch "
trong dendrogram gia cc cp m su v by cm c hnh thnh l ln
nht. Xt nghim cc lnh vc ch ca cun sch [Murty v Jain 1995]
trong cc cm tit l rng cc cm thu c l thc s c ngha. Mi cm
c i din bng cch s dng mt danh sch cc chui s v cp s
f
tn s,
ni s
f
l s sch trong cc cm, trong s l hin ti. V d, cm c
1
cha 43
cun sch thuc v nhn din m hnh, cc mng thn kinh, tr tu nhn to v
tm nhn my tnh; mt phn ca R(C1) i din ca n c a ra di y.
W(C1) = ((B718,1), (C12,1), (D0,2),
(D311,1), (D312,2), (D321,1),
(D322,1), (D329,1),... (I46,3),
(I461,2), (I462,1), (I463, 3),
... (J26,1), (J6,1),
(J61,7), (J71,1))


-81-
Nhng cm sch v m t cluster tng ng c th c s dng nh
sau: Nu mt ngi s dng ang tm kim sch, ni, v hnh nh phn khc
(I46), sau chng ta chn cm C
1
v i din ca mnh c cha I46 chui.
Sch B2 (Neurocomputing) v B18 (Neural Networks: Lateral Inhibition) l
c hai thnh vin ca nhm C
1
mc d s LCC ca h kh khc nhau (B2 l
QA76.5.H4442, B18 l QP363.3.N33).
Bn sch b sung c nhn B101, B102, B103, B104 v c s
dng nghin cu cc vn ca vic phn cng phn loi s sch mi.
Nhng s LCC ca nhng cun sch ny l: (B101) Q335.T39, (B102)
QA76.73.P356C57, (B103) QA76.5.B76C.2, v (B104) QA76.9D5W44.
Nhng quyn sch ny c giao cho cc cm da trn phn loi hng xm
gn nht. Nhng hng xm gn nht ca B101, mt cun sch v nhn to
tnh bo, l B23 v v vy B101 c phn cng cm C1. N c quan st
thy s phn cng ca bn sch cc cm tng ng l c ngha, chng t
rng kin thc da trn phn cm d liu rt hu ch trong vic gii quyt cc
vn lin quan n ly ti liu.
4. Khai ph d liu
Trong nhng nm gn y chng ta thy bao gi tng khi lng d
liu thu thp ca tt c cc loi. Vi rt nhiu d liu c sn, n l cn thit
pht trin cc thut ton m c th ly thng tin t cc ca hng c ngha
rng ln. Tm kim nuggets hu ch ca thng tin gia cc s lng rt ln
ca cc d liu c bit n nh l cc lnh vc khai ph d liu.
Khai ph d liu c th c p dng cho quan h, giao dch, v c s
d liu khng gian, cng nh cc ca hng ln d liu c cu trc nh World
Wide Web. C nhiu d liu trong h thng khai thc s dng ngy nay, v
cc ng dng bao gm Cc Ngn kh Hoa K pht hin ra tin, Hip hi
Bng r Quc gia hun luyn vin pht hin xu hng v m hnh ca cc
cu th chi cho c nhn v cc i, v phn loi cc m hnh ca tr em
trong h thng chm sc nui dng [Hedberg 1996] . Mt s tp ch gn y
c nhng vn c bit v khai ph d liu [1996 Cohen, Cross 1996,
Wah 1996].



-82-
4.1 Khai ph d liu bng Phng php tip cn.
Khai ph d liu, ging nh phn cm d liu, l mt hot ng thm
d, do , phng php phn cm d liu ang rt thch hp khai ph d
liu. Phn cm d liu thng l mt bc khi u quan trng ca mt s
trong qu trnh khai ph d liu [Fayyad 1996]. Mt s phng php khai
ph d liu s dng phng php phn cm d liu c c s d liu phn
khc, mu tin on, v trc quan ha c s d liu ln.
Phn on. Phng php phn cm d liu c s dng trong khai
ph d liu vo c s d liu phn khc thnh cc nhm ng nht. iu ny
c th phc v mc ch ca nn d liu (lm vic vi cc cm hn l cc c
nhn), hoc nhn bit cc c im ca dn s ph thuc m c th c
nhm mc tiu cho cc mc ch c th (v d, tip th nhm vo ngi gi).
Thut ton phn cm d liu K-means [Faber 1994] c s dng
phn cm im nh trong hnh nh Landsat [Faber et al. 1994]. Mi im
nh ban u c 7 gi tr t cc ban nhc v tinh khc nhau, bao gm hng
ngoi. Nhng gi tr 7 l kh khn cho con ngi ng ha v phn tch
m khng cn s tr gip. Cc im nh vi cc gi tr 7 tnh nng c
nhm thnh 256 nhm, sau mi im nh c gn gi tr ca cm trung
tm. Hnh nh ny sau c th c hin th vi nhng thng tin khng gian
cn nguyn vn. Con ngi ngi xem c th nhn vo mt hnh nh n v
xc nh mt khu vc quan tm (v d, ng cao tc hoc rng) v nhn n
nh l mt khi nim. H thng ny sau xc nh im nh khc trong
cng mt nhm nh l mt v d ca khi nim .
on trc mu. Thng k phng php phn tch d liu thng lin
quan n th nghim mt m hnh gi thuyt ca cc nh phn tch c
trong tm tr. Khai thc d liu c th gip ngi dng pht hin gi thuyt
tim nng trc khi s dng cc cng c thng k. on trc m hnh s
dng phn nhm cc nhm, sau infers quy tc characterize cc nhm
v xut cc m hnh. V d, ngi ng k tp ch c th c nhm da
trn mt s yu t (tui tc, gii tnh, thu nhp, vv), sau cc nhm kt qu
c trng trong mt n lc tm mt m hnh m s phn bit cc thu bao
ny s gia hn ng k ca h t nhng ngi m s khng [Simoudis 1996].
Hnh nh. Cm trong c s d liu ln c th c s dng hnh dung,

-83-
h tr cc nh phn tch ca con ngi trong vic xc nh cc nhm v nhm
con c c im tng t. WinViz [Lee v Ong 1996] l mt cng c khai
thc d liu trc quan, trong c ngun gc cm c th c xut khu nh
cc thuc tnh mi m sau c th c c trng bi h thng. V d, ng
cc n sng c nhm theo calo, m, cht bo, natri, cht x, carbohydrate,
ng, kali, vitamin v cc ni dung trn phc v. Khi thy cc cm kt qu,
ngi s dng c th xut cc cm Win-Viz l thuc tnh. H thng ny
cho thy rng mt trong nhng cm c c trng bi ni dung kali cao, v
cc nh phn tch ca con ngi nhn ra cc c nhn trong nhm nh l thuc
cm "gia nh ng cc", dn n mt khi qut rng "ng cc, cm nhiu cht
kali."
4.2 Khai ph d liu c cu trc ln.
Khai thc d liu thng c thc hin trn c s d liu quan h
giao dch v cng xc nh cc lnh vc m c th c s dng nh l cc
tnh nng, nhng c nghin cu gn y v c s d liu c cu trc ln
nh World Wide Web [Etzioni 1996].
V d v cc n lc gn y phn loi cc vn bn web bng cch s
dng t ng hoc cc chc nng ca cc t nh tnh nng bao gm Maarek v
Shaul [1996] v Chekuri et al. [1999]. Tuy nhin, b tng i nh cc mu
o to c nhn v chiu hn ch rt ln s thnh cng cui cng ca t ng
phn loi ti liu web da trn nhng t nh tnh nng.
Ch khng phi l nhm ti liu trong mt khng gian tnh t,
Wulfekuhler v Punch [1997] cm t t mt b su tp nh ca World Wide
Web ti liu trong khng gian vn bn. Cc d liu mu thit lp bao gm 85
ti liu t cc min trong sn xut ngi dng khc nhau 4-xc nh loi (lao
ng, lut php, chnh ph, v thit k). 85 ti liu cha 5.190 thn cy khc
bit t sau khi cc t thng dng (cc, v, trong) c g b. K t t
c chc chn khng phi khng tng quan, h s ri vo ni cm t c
s dng mt cch thng nht trn ton b ti liu c gi tr tng t nh ca
tn s trong mi ti liu.
Phng php phn cm bng K-means c ngha l phn nhm c
s dng nhm cc t 5.190 thnh 10 nhm. Mt kt qu ng ngc nhin
l trung bnh 92% trong cc t ri vo mt cm duy nht, m sau c th

-84-
c loi b khai thc d liu mc ch. Cc cm nh nht c iu khon
vo mt con ngi c v ng ngha lin quan. Cc cm 7 nh nht t mt
hot ng tiu biu c th hin trong hnh 34.
iu khon c s dng trong ng cnh bnh thng, hoc iu kin
duy nht m khng xy ra thng xuyn trn ton b ti liu o to s c xu
hng cm thnh nhm thnh vin ln 4000. iu ny s chm sc cc li
chnh t, tn ring m khng thng xuyn, v cc iu khon c s dng
theo cch tng t trong sut t ton b ti liu. iu khon s dng trong
bi cnh c th (nh tp tin trong bi cnh np n sng ch, hn l mt tp
tin my tnh) s xut hin trong cc ti liu ph hp vi iu kin thch hp
khc cho rng bng sng ch (bi cnh , pht minh ra) v do s c xu
hng cm li vi nhau. Trong s cc nhm t, ng cnh c bit ni bt so
vi m ng.
Sau khi discarding cluster ln nht, cc thit lp nh hn cc tnh nng
c th c s dng xy dng cc truy vn tm ra cc ti liu khc c
lin quan trn Web tiu chun s dng cng c tm kim web (v d, Lycos,
Alta Vista, m vn bn). Tm kim trn Web vi cc iu khon ly t cm t
cho php pht hin ra cc ch ht mn (v d, gia nh y t li) trong
vng loi c nh ngha rng ri (v d, lao ng).
4.3 Khai ph d liu trong C s d liu a cht.
Khai ph c s d liu l mt ngun lc quan trng trong vic thm d
du m v sn xut. N c ph bin kin thc trong ngnh cng nghip du
m chi ph in hnh ca mt khoan mi ra nc ngoi cng l trong khong
$ 3-40, nhng c hi ca trang web l mt thnh cng kinh t l 1 trong
10. Thm thng tin v c h thng khoan quyt nh mt cch ng k c th lm
gim chi ph sn xut chung.
Tin b trong cng ngh khoan v cc phng php thu thp d liu c
dn n cc cng ty du m v ancillaries ca h thu thp mt lng ln a
vt l / d liu a cht t ging sn xut v cc trang web thm d, v sau
t chc chng thnh cc c s d liu ln. K thut khai thc d liu gn y
c s dng ly c chnh xc phn tch mi quan h gia cc hin
tng quan st v cc thng s. Nhng mi quan h sau c th c s
dng nh lng du v kh t.

-85-
V cht lng, tr lng tt phc hi c bo ha hydrocarbon cao ang
mc kt bi trm tch rt xp (cha porosity) v bao quanh bi s lng ln
cc loi cng c ngn chn s r r du kh t xa. Mt khi lng ln cc
trm tch xp l rt quan trng tm d tr phc hi tt, do pht trin
ng tin cy v chnh xc cc phng php cho d ton ca porosities trm
tch t cc d liu thu thp l cha kha c tnh tim nng du kh. Cc
quy tc chung ca cc chuyn gia ngn ci s dng cho tnh ton xp, rng
l n l mt chc nng lut s m ca chiu su:
xp =
( ) Depth x x x F
m
e K
. , , ,
2 1
.

(4)
Mt s yu t nh cc loi , cu trc, v xy bng xi mng nh cc
thng s ca F chc nng bi ri mi quan h ny. iu ny i nh ngha
ca ng cnh thch hp, trong c gng khm ph ra cng thc o xp.
Bi cnh a cht c th hin trong iu khon ca hin tng a cht, nh
l hnh hc, lithology, nn cht, v ln, lin kt vi khu vc. N ni ting
rng nhng thay i bi cnh a cht t lu vc lu vc (cc khu vc a
l khc nhau trn th gii) v cng t khu vc ti khu vc trong mt lu vc
[Allen v Allen 1990; Biswas 1995]. Hn na, tnh nng tim n trong bi
cnh c th khc nhau rt nhiu. M hnh kt hp cc k thut n gin, m
lm vic trong lnh vc k thut m l hn ch bi hnh vi ca con ngi gy
ra h thng v cng thnh lp lut ca vt l, khng th p dng trong lnh
vc thm d du kh. n a ch ny, phn nhm d liu c s dng
xc nh ng cnh c lin quan, v sau pht hin ra phng trnh c
thc hin trong bi cnh mi. Mc ch l ly cc tp con x1, x2, ..., xm t
mt tp ln cc tnh nng a cht, v F mi quan h chc nng nht nh
chc nng o rng, xp trong khu vc.
Cc phng php tng th minh ho trong Hnh 35, bao gm hai bc
chnh: (i) Bi cnh nh ngha bng cch s dng cc k thut Phn cm
khng gim st, v (ii) pht hin bng cch phn tch Phng trnh hi quy
[Li v Biswas 1995]. Bt thm d d liu thu thp t mt vng lu vc
Alaska c phn tch bng cch s dng phng php pht trin. Cc i
tng d liu (mu) c m t v 37 c im a cht, nh xp, tnh
thm, mt kch thc ht, v phn loi, s lng cc mnh khong sn
khc nhau (v d, thch anh, Chert, fenspat) hin nay, tnh cht ca cc mnh

-86-
, l chn lng c im, v xy bng xi mng. Tt c nhng tnh nng cc
gi tr c o bng s c thc hin trn mu c ly t cc bn ghi tt
trong qu trnh khoan thm d.
Thut ton phn cm d liu K-means c s dng xc nh
mt tp cc ng nht cu trc a cht nguyn thy
(g
1
, g
2
, ..., g
m
). Nhng nguyn thy ny sau c nh x vo m n v
so vi bn n v a tng hc. Hnh 36 m t mt bn mt phn cho
mt tp hp cc ging v bn cu trc nguyn thy. Bc tip theo trong qu
trnh pht hin c xc nh phn ca khu vc ging c to thnh t cng
mt trnh t ca a cht nguyn thy. Mi trnh t quy nh mt Ci ng cnh.
T mt phn ca bn Hnh 36, trong bi cnh C
1
= g
2
. g
1
. g
2
. g
3
c
xc nh ti hai khu vc tt (ca 300 v 600 series). Sau khi bi cnh c
xc nh, d liu im thuc bi cnh tng c nhm li vi nhau cho
derivation phng trnh. Th tc dn xut derivation lm vic phn tch hi
qui [Sen v Srivastava 1990].
Phng php ny c p dng cho mt tp d liu ca khong 2.600
i tng tng ng vi mu o thu thp t ging l cc lu vc Alaska.
K-means nhm d liu ny t thnh by nhm. Nh minh ho,
Chng ta chn mt b 138 i tng i din cho mt bi cnh phn tch.
Cc tnh nng nht nh ngha cm ny c la chn, v cc chuyn gia
surmised rng bi cnh i din cho mt vng xp rng thp, c m
hnh bng cch s dng cc th tc hi qui.
4.4 Tm tt
C rt nhiu ng dng, ni ra quyt nh v phn tch mu thm d
c thc hin trn d liu ln t ra. V d, trong ly ti liu, mt tp hp
cc ti liu c lin quan c th tm thy mt vi trong s hng triu ti liu ca
cc chiu ca hn 1000. C th x l nhng vn ny rt hu ch nu mt
s tru tng ca d liu c thu c v c s dng trong vic ra quyt
nh, hn l trc tip bng cch s dng d liu ton b thit lp. Bi tru
tng ha d liu, chng ti c ngha l mt i din n gin v gn nh ca
d liu. n gin ny gip my ch bin c hiu qu hay mt con ngi trong
comprehending cu trc trong d liu mt cch d dng. Thut ton phn cm
d liu rt l tng cho vic t c cc d liu tru tng.

-87-
Trong bi ny, chng ta kim tra cc bc khc nhau trong phn
nhm: (1) m hnh i din, (2) tnh ton tng t, (3) nhm quy trnh, v (4)
i din cm. Ngoi ra, cng cp nn thng k, m, thn kinh, tin ha,
v kin thc da trn phng php tip cn phn cm d liu. Chng ta c
bn m t cc ng dng ca phn nhm: (1) Phn on nh, (2) nhn din i
tng, (3) truy hi ti liu, v (4) khai ph d liu.














Hnh 36. M vng so vi bn n v a tng mt phn ca khu vc
nghin cu.
Phn cm d liu l mt qu trnh ca cc nhm d liu da trn mt
thc o tng t. Phn cm d liu l mt qu trnh ch quan; cng mt b
cc d liu thng xuyn cn phi c phn vng khc nhau cho cc ng
dng khc nhau. Ch quan ny lm cho qu trnh phn nhm kh khn. iu
ny l do mt thut ton n hoc phng php tip cn l khng gii
quyt mi vn phn cm d liu. Mt gii php c th nm trong ch quan
ny phn nh trong cc hnh thc kin thc. Kin thc ny c s dng hoc
ngm hoc r rng trong mt hoc nhiu giai on ca Phn cm d liu.
Kin thc da trn thut ton phn nhm s dng kin thc mt cch r rng.
Bc kh khn nht trong phn nhm l tnh nng khai thc hoc mu
i din. Cc nh nghin cu mu nhn din cng nhn thun tin trnh bc

-88-
ny bng cch gi s rng cc i din c khun mu c sn nh l u vo
ca thut ton phn cm d liu. Kch thc nh, tp hp d liu, i din m
hnh c th thu c da trn kinh nghim trc y ca ngi dng vi vn
ny. Tuy nhin, trong trng hp cc b d liu ln, l kh khn cho
ngi s dng theo di s quan trng ca mi tnh nng trong phn cm
d li. Mt gii php l lm cho cc php o nh nhiu trn cc mu cng tt
v s dng chng trong khun mu i din. Nhng n khng th s dng
mt b su tp ln cc php o trc tip trong phn cm d liu v chi ph
tnh ton. V vy, mt s tnh nng khai thc / la chn phng php tip cn
c thit k c c kt hp tuyn tnh hoc phi tuyn ca cc php
o c th c dng i din cho cc mu. Hu ht cc n ngh cho
khai thc tnh nng / la chn thng c lp li trong t nhin v khng th
c s dng trn cc tp d liu ln do chi ph tnh ton.
Bc th hai trong phn nhm l ging nhau tnh ton. Mt lot cc
n c s dng tnh ton ging nhau gia hai m hnh. H s dng
kin thc hoc ngm hoc r rng. Hu ht cc kin thc da trn thut ton
phn nhm s dng kin thc r rng trong tnh ton tng t. Tuy nhin, nu
khng phi l i din cho cc mu bng cch s dng cc tnh nng ph hp,
sau n khng phi l c th lm cho mt phn vng c ngha khng phn
bit cht lng v s lng kin thc c s dng trong tnh ton tng t.
Khng c n ph chp nhn c i vi my tnh ging nhau gia cc
mu i din bng cch s dng mt hn hp ca c hai tnh nng nh lng.
Khng ging nhau gia mt cp mu c i din bng cch s dng mt
thc o khong cch c th hoc khng th c mt s liu.
Bc tip theo trong phn nhm l nhm cc bc li vi nhau. C hai
nhm n rng ri: n theo k tha v phn vng. Cc n c nhiu
th bc linh hot, v cc n phn vng t tn km. Cc thut ton phn
vng nhm ti a ha kh nng li tiu ch bnh phng. Thc y bi s tht
bi ca cc li bnh phng thut ton phn cm d liu phn vng trong vic
tm kim cc gii php ti u cho vn ny, mt b su tp ln cc phng
php c xut v c s dng c c mt gii php ton cu ti
u cho vn ny. Tuy nhin, cc n c gii hn cho php v mt tnh
ton trn d liu ln t ra. n phn cm d liu da trn mng

-89-
nowrron(ANN) c trin khai thn kinh ca cc thut ton phn nhm, v
h chia s cc ti sn khng mong mun ca cc thut ton. Tuy nhin, ANNs
c kh nng t ng bnh thng ha d liu v trch xut cc tnh nng. Mt
quan st quan trng l ngay c khi mt n c th tm thy gii php ti u
cho vn phn vng bnh phng li, n vn c th thu ngn ca cc yu
cu v khng th-ng hng bn cht ca cc cm.
Trong mt s ng dng, v d trong truy hi ti liu, n c th hu ch
c mt phn nhm khng phi l mt phn vng. iu ny c ngha l
cc cm chng cho. Phn cm d liu m Fuzzy l chc nng rt l tng
cho mc ch ny. Ngoi ra, cc thut ton phn nhm m c th x l d
liu hn hp cc loi. Tuy nhin, mt vn ln vi phn cm d liu m l
n rt kh c c cc gi tr thnh vin. Mt cch tip cn tng hp c
th khng lm vic v bn cht ch quan ca phn cm d liu. N l cn thit
i din cho cc cm thu c trong mt hnh thc thch hp gip nh
sn xut quyt nh. Kin thc da trn phn nhm n to ra cc m t
bng trc gic hp dn ca cc cm. H c th c s dng ngay c khi cc
m hnh c i din bng cch s dng mt s kt hp cc c tnh v nh
lng, min l kin thc lin kt mt khi nim v cc tnh nng hn hp c
sn. Tuy nhin, vic trin khai cc n v khi nim phn cm d liu c
c tnh rt t tin v khng ph hp cho nhm tp hp d liu ln.
Thut ton K-means v gii thut da trn mng nowrron thn kinh
ca , li Kohonen, l thnh cng nht c s dng trn b d liu ln.
iu ny l do l thut ton K-means n gin thc hin v c tnh hp
dn v thi gian tuyn tnh phc tp ca n. Tuy nhin, n khng kh thi s
dng ngay c thut ton ny thi gian tuyn tnh trn d liu ln t ra. Thut
ton gia tng nh lnh o v thc hin thn kinh ca n, mng Art, c th
c s dng cm tp d liu ln. Nhng h c xu hng t ph thuc.
Phn chia v chinh phc l mt heuristic m c khai thc theo ng thit
k thut ton my tnh gim chi ph tnh ton. Tuy nhin, cn khn ngoan
s dng trong cc phn nhm t c kt qu c ngha.
Tm li, Phn cm d liu l mt vn th v, hu ch, v y thch
thc. N c tim nng ln trong cc ng dng nh nhn in i tng, phn
on hnh nh, v cc chn lc v truy hi thng tin. Tuy nhin cn cn thn
thit k mt vi la chn c th khai thc tim nng ny.

-90-
KT LUN
Cc vn c tm hiu trong lun vn
Tng hp, nghin cu nhng nt c bn l thuyt v ng dng thc tin
ca Phn cm d liu. Vi s pht trin ngy cng ln nh v bo ca Cng
ngh thng tin v s to ra v C s d liu thng tin. Do yu cu v
nghin cu hon thin, p dng phng php, k thut Phn cm d liu l
rt cn thit v c ngha to ln
Trong chng 1, lun vn trnh by tng quan, l thuyt v phn cm
d liu, v mt s l thuyt lin quan trc tip n khai ph d liu. Chng
2, gii thiu tng qut cc thut ton phn cm d liu, thut ton phn cm
d liu l rt nhiu, Lun vn ch cp mt s thut ton ph bin, thng
dng. Chng 3 l ni v mt s ng dng tiu biu ca phn cm d liu
nh Phn on nh, Nhn din k t v i tng, Truy hi thng tin, v
Khai ph d liu.

HNG PHT TRIN CA TI

Phn cm d liu v ng dng ca Phn cm d liu l hng nghin
cu cn thit, quan trng, Tuy nhin y cng l mng rt rng, bao hm
nhiu phng php, k thut, v hnh thnh nhiu nhm khc nhau.
Trong qu trnh nghin cu, thc hin lun vn mc d c gng tp
trung nghin cu v tham kho nhiu ti liu, bi bo, tp ch khoa hc trong
v ngoi nc, nhng do trnh cn c nhiu gii hn khng th trnh khi
thiu st v hn ch. Em rt mong c s ch bo ng gp nhiu hn na
ca cc thy, c gio, cc nh khoa hc

HNG NGHIN CU PHT TRIN

- Tip tc nghin cu thm v l thuyt v phn cm d liu
- Xy dng, pht trin thm cc k thut, ng dng ca Phn cm d
liu.

-91-
PH LC :
XY DNG CHNG TRNH PHN CM D LIU VI
THUN TON K-MEANS BNG NGN NG VISUAL BASIC 6.0

Giao din chng trnh :



-92-
* Ngi s dng chn s lng cm d liu, sau click ngu nhin vo
khung( nhp d liu X, Y).
Chng trnh to cm trn c s ti gin bnh phng khong cch
gia d liu v cm trng tm tng ng, mi im biu th cho mt i
tng v ta (X, Y) m t hai thuc tnh ca i tng. Mu sc ca im
v s nhn biu th cho cm d liu
* Thut ton phn cm K-Means lm vic nh sau :
Nu s lng d liu nh hn s cm th ta gn mi d liu l mt
trng tm ca cm. Mi trng tm s c mt s cm. Nu s lng ln d
liu ln hn s cm, vi mi d liu, ta tnh ton khong cch ti tt c cc
trng tm v ly khong cch ti thiu. D liu ny c ni l thuc v cm
c khong cch ti thiu ti d liu ny.
Khi chng ta khng chc chn v v tr ca trng tm, ta cn iu chnh
v tr trng tm da vo d liu cp nht hin ti. Sau , ta gn tt c d
liu ti trng tm mi ny. Qu trnh ny c lp li cho ti khi khng cn
d liu di chuyn sang cm khc. V mt ton hc, vng lp ny c th chng
minh l hi t.

-93-


V d sau khi chy chng trnh vi s cm = 9




-94-
M ngun chng trnh
Option Explicit

Private Data() ' Row 0 = cluster, 1 =X, 2= Y; S l- ng d liu trong
cc ct
Private Centroid() As Single ' cm trung tm (X v Y) ca cc cm; S
l- ng cm = S l- ng ct
Private totalData As Integer ' Tng s d liu (tng s ct)
Private numCluster As Integer ' Tng s cc cm

##############################################################
' Cc form iu khin
' + Form_Load
' + cmdReset_Click
' + txtNumCluster_Change
' + Picture1_MouseDown
' + Picture1_MouseMove
'
##############################################################

Private Sub Form_Load()
Dim i As Integer

Picture1.BackColor = &HFFFFFF ' t mu = trng
Picture1.DrawWidth = 10 ' ln ca im
Picture1.ScaleMode = 3 ' pixels

'- a ra s l- ng ca cm
numCluster = Int(txtNumCluster)
ReDim Centroid(1 To 2, 1 To numCluster)
For i = 0 To numCluster - 1
'To nhn
If i > 0 Then Load lblCentroid(i)
lblCentroid(i).Caption = i + 1
lblCentroid(i).Visible = False
Next i
End Sub


Private Sub cmdReset_Click()
' refress li d liu
Dim i As Integer

Picture1.Cls ' Lm sch nh

-95-
Erase Data ' Xa d liu
totalData = 0

For i = 0 To numCluster - 1
lblCentroid(i).Visible = False ' Khng hin nhn
Next i

'Cho php thay i s l- ng cm
txtNumCluster.Enabled = True
End Sub

Private Sub txtNumCluster_Change()
'Thay i s l- ng cm v reset li d liu
Dim i As Integer

For i = 1 To numCluster - 1
Unload lblCentroid(i)
Next i
numCluster = Int(txtNumCluster)
ReDim Centroid(1 To 2, 1 To numCluster)
'Gi s kin cmdReset_Click
For i = 0 To numCluster - 1
If i > 0 Then Load lblCentroid(i)
lblCentroid(i).Caption = i + 1
lblCentroid(i).Visible = False
Next i
End Sub


Private Sub Picture1_MouseDown(Button As Integer, Shift As Integer, X As
Single, Y As Single)
'Thu thp d liu v trnh din kt qu
Dim colorCluster As Integer
Dim i As Integer

'V hiu kh nng c th thay i s l- ng cm
txtNumCluster.Enabled = False

' To d liu chc nng
totalData = totalData + 1
ReDim Preserve Data(0 To 2, 1 To totalData) ' Ch : Bt u vi 0 cho
dng
Data(1, totalData) = X
Data(2, totalData) = Y


-96-
'Thc hin k-mean clustering
Call kMeanCluster(Data, numCluster)

'Trnh din kt qu
Picture1.Cls
For i = 1 To totalData
colorCluster = Data(0, i) - 1
If colorCluster = 7 Then colorCluster = 12 ' Nu mu trng (Nu ging
mu nn th thay i thnh mu khc)
X = Data(1, i)
Y = Data(2, i)
Picture1.PSet (X, Y), QBColor(colorCluster)
Next i

'Hin th cm trung tm
For i = 1 To min2(numCluster, totalData)
lblCentroid(i - 1).Left = Centroid(1, i)
lblCentroid(i - 1).Top = Centroid(2, i)
lblCentroid(i - 1).Visible = True
Next i
End Sub


Private Sub Picture1_MouseMove(Button As Integer, Shift As Integer, X As
Single, Y As Single)
lblXYValue.Caption = X & "," & Y
End Sub
'
##############################################################
' FUNCTIONS
' + kMeanCluster:
' + dist: Khong cch tnh ton
' + min2: Tr li gi tr nh nht gia hai s
'
##############################################################

Sub kMeanCluster(Data() As Variant, numCluster As Integer)
' Hm chnh phn cm d liu thnh k cm
' input: + Ma trn d liu (0 ti 2, 1 ti TotalData); Row 0 = cluster, 1 =X, 2=
Y; D liu trong cc ct
' + numCluster: S l- ng cm ng- i dng mun d liu - c phn cm
' + Cc bin a ph- ng: Centroid, TotalData
' ouput: o) Cm trung tm - c cp nht
' o) Gn s l- ng cc cm vo d liu (= row 0 of Data)
Dim i As Integer

-97-
Dim j As Integer
Dim X As Single
Dim Y As Single
Dim min As Single
Dim cluster As Integer
Dim d As Single
Dim sumXY()
Dim isStillMoving As Boolean

isStillMoving = True
If totalData <= numCluster Then
Data(0, totalData) = totalData
Centroid(1, totalData) = Data(1, totalData) ' X
Centroid(2, totalData) = Data(2, totalData) ' Y
Else
'Tnh ton khong cch ti thiu gn d liu mi
min = 10 ^ 10 'S ln
X = Data(1, totalData)
Y = Data(2, totalData)
For i = 1 To numCluster
d = dist(X, Y, Centroid(1, i), Centroid(2, i))
If d < min Then
min = d
cluster = i
End If
Next i
Data(0, totalData) = cluster
Do While isStillMoving
' Vng lp ny chc chn hi t
'Tnh ton cc trng tm mi
ReDim sumXY(1 To 3, 1 To numCluster) ' 1 =X, 2=Y, 3= m s
l- ng d liu
For i = 1 To totalData
sumXY(1, Data(0, i)) = Data(1, i) + sumXY(1, Data(0, i))
sumXY(2, Data(0, i)) = Data(2, i) + sumXY(2, Data(0, i))
sumXY(3, Data(0, i)) = 1 + sumXY(3, Data(0, i))
Next i
For i = 1 To numCluster
Centroid(1, i) = sumXY(1, i) / sumXY(3, i)
Centroid(2, i) = sumXY(2, i) / sumXY(3, i)
Next i

'Gn tt c d liu ti cc trng tm mi
isStillMoving = False
For i = 1 To totalData

-98-
min = 10 ^ 10 'S ln
X = Data(1, i)
Y = Data(2, i)
For j = 1 To numCluster
d = dist(X, Y, Centroid(1, j), Centroid(2, j))
If d < min Then
min = d
cluster = j
End If
Next j
If Data(0, i) <> cluster Then
Data(0, i) = cluster
isStillMoving = True
End If
Next i
Loop
End If
End Sub
Function dist(X1 As Single, Y1 As Single, X2 As Single, Y2 As Single) As
Single
' Tnh ton khong cch Euclidean
dist = Sqr((Y2 - Y1) ^ 2 + (X2 - X1) ^ 2)
End Function
Private Function min2(num1, num2)
' Tr v gi tr nh nht gia hai s
If num1 < num2 Then
min2 = num1
Else
min2 = num2
End If
End Function




-99-
TI LIU THAM KHO

[1]. M.R Anderber, Cluster analysis of application, A cademic Press, New York, 1973
[2]. B.S. Everitt, Cluster Analysis, Edward Amold coblished by Haisted Press and
imprint of john Wiley & Sons Inc., 3
rd
edition, 1993
[3]. D.Fisher, Knowledged acquisition via incremental conceptual clustering, in
Machine Learing
[4] Zou, H., T. Hastie, and R. Tibshirani: Sparse principal component analysis. Journal
of Computational and Graphical Statistics, 15(2):265{286, 2006.
[5] Hall, P., H.G. Muller, and J.L. Wang: Properties of principal component methods for
functional and longitudinal data analysis. Ann. Statist, 34(3):1493{1517, 2006.
[6] Yao, F., H.G. Muller, A.J. Cli_ord, S.R. Dueker, J. Follett, Y. Lin, B.A. Buchholz,
and J.S. Vogel: Shrinkage Estimation for Functional Principal Component Scores
with Application to the Population Kinetics of Plasma Folate. Biometrics, 59:676{
685, 2003.
[7] Liang, K.Y. and S.L. Zeger: Longitudinal data analysis using generalized linear
models. Biometrika, 73(1):13{22, 1986.
[8] Maaten, L. J. P. van der, E. O. Postma, and H. J. van den Herik: Dimensionality
reduction: A comparative review. 2007.
http://www.cs.unimaas.nl/l.vandermaaten/dr/DR_draft.pdf Preprint published online.
[9] Fan, J. and I. Gijbels: Variable Bandwidth and Local Linear Regression Smoothers.
The Annals of Statistics, 20(4):2008{2036, 1992.
[10] Data Clustering Theory, Algorithms, and Applications. Guojun Gan, Chaoqun Ma,
Jianhong Wu. 2007

You might also like