You are on page 1of 58

i

TRNG .
KHOA.

----------



Bo co tt nghip

ti:


PHN TCH CM DANH T C S TRING ViT S DNG M HNH CRFs




















ii



LI CAM OAN
Ti xin cam oan, kt qu lun vn hon ton l kt qu ca t bn thn
ti tm hiu, nghin cu. Cc ti liu tham kho c trch dn v ch thch y
.


Hc vin



Nguyn Thanh Huyn



iii
LI CM N
Trong sut thi gian hc tp, hon thnh lun vn ti c cc Thy,
C truyn t cho cc kin thc cng nh phng php nghin cu khoa hc rt
hu ch v c gia nh, c quan, ng nghip v bn b quan tm, ng vin
rt nhiu.
Trc ht, ti mun gi li cm n cc Thy, C trong khoa Cng ngh
thng tin- Trng i hc Cng ngh - i hc Quc gia H ni truyn t
cc kin thc qu bu cho ti trong sut thi gian hc tp ti trng. c bit,
ti xin gi li cm n su sc ti thy gio hng dn PGS.TS on Vn Ban,
ngi Thy tn tnh ch bo v hng dn v mt chuyn mn cho ti trong
sut qu trnh thc hin lun vn ny.
Cng qua y, ti xin gi li cm n n ban gim hiu trng Trung cp
kinh t H Ni, ni ti angcng tc to mi iu kin thun li cho ti trong
thi gian hc tp cng nh trong sut qu trnh lm lun vn tt nghip.
Cui cng, ti xin cm n b m, anh, ch, chng, con v cc bn b,
ng nghip lun ng h, ng vin ti rt nhiu ti yn tm nghin cu
v hon thnh lun vn. Trong sut qu trnh lm lun vn, bn thn ti c
gng tp trung tm hiu, nghin cu v tham kho thm nhiu ti liu lin quan.
Tuy nhin, do thi gian hn ch v bn thn cn cha c nhiu kinh nghim
trong nghin cu khoa hc, chc chn bn lun vn vn cn nhiu thiu st. Ti
rt mong c nhn s ch bo ca cc Thy C gio v cc gp ca bn b,
ng nghip lun vn c hon thin hn.
H Ni, ngy 12 thng 06 nm 2011
Nguyn Thanh Huyn



iv
MC LC
LI CAM OAN............................................................................................................. i
LI CM N................................................................................................................. iii
MC LC........................................................................................................................ iv
DANH MC CC K HIU, CC CH VIT TT....................................... vi
DANH MC CC BNG......................................................................................... vii
DANH MC CC HNH......................................................................................... viii
M U............................................................................................................................ 1
Chng 1 - TNG QUAN V KHAI PH D LIU V L THUYT
TP TH.......................................................................................................................... 3
1.1. Gii thiu v khai ph d liu .............................................................. 3
1.1.1 Khm ph tri thc ...................................................................................... 3
1.1.2. Khai ph d liu........................................................................................ 4
1.2. ng dng ca khai ph d liu ............................................................ 5
1.3. Mt s phng php khai ph d liu thng dng................................ 6
1.3.1. Phn lp (Classification) ......................................................................... 6
1.3.2. Phn cm (Clustering) ............................................................................. 8
1.3.3. Lut kt hp (Association Rules) .......................................................... 9
1.4. L thuyt tp th.................................................................................. 9
1.4.1. H thng tin ............................................................................................. 10
1.4.2. Bng quyt nh...................................................................................... 10
1.4.3. Quan h khng phn bit c ........................................................... 12
1.4.4. Xp x tp hp......................................................................................... 12
1.5. Kt lun chng 1.............................................................................. 14
Chng 2- CY QUYT NH V CC THUT TON XY DNG
CY QUYT NH..................................................................................................... 15
2.1. Tng quan v cy quyt nh ............................................................. 15
2.1.1. nh ngha................................................................................................ 15
2.1.2. Thit k cy quyt nh ......................................................................... 16
2.1.3. Phng php tng qut xy dng cy quyt nh............................. 18
2.1.3. ng dng cy quyt nh trong khai ph d liu ............................. 19
2.2. Thut ton xy dng cy quyt nh da vo Entropy........................ 20
2.2.1. Tiu ch chn thuc tnh phn lp....................................................... 20
2.2.2. Thut ton ID3 ........................................................................................ 21
2.2.3. V d v thut ton ID3 ......................................................................... 23
2.3. Thut ton xy dng cy quyt nh da vo ph thuc ca thuc
tnh ........................................................................................................... 28


v
2.3.1. ph thuc ca thuc tnh theo l thuyt tp th ......................... 28
2.3.2. ph thuc chnh xc | theo l thuyt tp th.............................. 28
2.3.3. Tiu ch chn thuc tnh phn lp.................................................. 28
2.3.4. Thut ton xy dng cy quyt nh ADTDA.................................. 29
2.3.5. V d.......................................................................................................... 30
2.4. Thut ton xy dng cy quyt nh da vo Entropy v ph thuc
ca thuc tnh ........................................................................................... 33
2.4.1. Tiu ch chn thuc tnh phn lp.................................................. 33
2.4.2. Thut ton FID3 (Fixed Iterative Dichotomiser 3 [5] ) ................... 34
2.4.3. V d.......................................................................................................... 35
2.5. Kt lun chng 2.............................................................................. 39
Chng 3 - NG DNG KIM CHNG V NH GI.............................. 40
3.1. Gii thiu bi ton ............................................................................. 40
3.2. Gii thiu v c s d liu................................................................. 40
3.3. Ci t ng dng................................................................................ 41
3.4. Kt qu v nh gi thut ton........................................................... 42
3.4.1. M hnh cy quyt nh tng ng vi tp d liu Bank_data...... 42
3.4.2. Cc lut quyt nh tng ng vi tp d liu Bank_data ............. 44
3.4.3. nh gi thut ton ................................................................................ 44
3.4.4. ng dng cy quyt nh trong khai ph d liu ............................. 45
3.5. Kt lun chng 3.............................................................................. 46
KT LUN..................................................................................................................... 47
TI LIU THAM KHO.......................................................................................... 49



vi

DANH MC CC K HIU, CC CH VIT TT
CC K HIU:
S = (U, A) H thng tin
V
a
Tp cc gi tr ca thuc tnh a
IND(B) Quan h tng ng ca tp thuc tnh B
[u
i
]
p
Lp tng ng cha i tng u
i

U/B Phn hoch ca U sinh ra bi quan h IND(B)
DT=(U,CD) Bng quyt nh
) ( X B B-Xp x di ca X
) ( X B B-xp x trn ca X
) ( S
C
d PO Min C-khng nh ca d
|DT| Tng s cc i tng trong DT
|U| Lc lng ca tp U
[U]
d
Phn hoch ca U sinh ra bi quan h IND(d)

CC CH VIT TT:
ADTDA Algorithm for Buiding Decision Tree Based on Dependency
of Attributes
FID3 Fixed Iterative Dichotomiser 3
ID3 Iterative Dichotomiser 3
IG Information Gain


vii
DANH MC CC BNG

Bng 1. H thng tin n gin......................................................................... 10
Bng 2. Mt bng quyt nh vi C={Age, LEMS} v D={Walk}..................... 11
Bng 3. D liu hun luyn.............................................................................. 23
Bng 4. Bng cc thuc tnh ca tp d liu Bank_data................................... 41
Bng 5. chnh xc ca cc thut ton ......................................................... 45



viii
DANH MC CC HNH

Hnh 1. Qu trnh phn lp d liu Bc xy dng m hnh ........................... 7
Hnh 2. Qu trnh phn lp d liu c lng chnh xc m hnh ............. 8
Hnh 3. Qu trnh phn lp d liu Phn lp d liu mi ................................ 8
Hnh 4. Xp x tp i tng trong Bng 2 bi cc thuc tnh iu kin Age v
LEMS ............................................................................................................... 14
Hnh 5. M t chung v cy quyt nh............................................................. 15
Hnh 6. V d v Cy quyt nh....................................................................... 16
Hnh 7. M hnh phn lp cc mu mi ........................................................... 19
Hnh 8. Cy sau khi chn thuc tnh Humidity (ID3)........................................ 25
Hnh 9. Cy sau khi chn thuc tnh Outlook (ID3).......................................... 26
Hnh 10. Cy kt qu (ID3) .............................................................................. 27
Hnh 11. Cy sau khi chn thuc tnh Humidity (ADTDA) ............................... 31
Hnh 12. Cy sau khi chn thuc tnh Outlook (ADTDA) ................................. 32
Hnh 13. Cy kt qu (ADTDA)........................................................................ 33
Hnh 14. Cy quyt nh sau khi chn thuc tnh Humidity (FID3) .................. 36
Hnh 15. Cy quyt nh sau khi chn thuc tnh Windy (FID3)....................... 38
Hnh 16. Cy kt qu (FID3)............................................................................ 39
Hnh 17. Dng cy quyt nh ID3................................................................... 42
Hnh 18. Dng cy quyt nh ADTDA............................................................. 42
Hnh 19. Dng cy quyt nh FID3................................................................. 43
Hnh 20. Mt s lut ca cy quyt nh ID3 ................................................... 44
Hnh 21. Mt s lut ca cy quyt nh ADTDA............................................. 44
Hnh 22. Mt s lut ca cy quyt nh FID3................................................. 44
Hnh 23. Giao din ng dng ........................................................................... 46


1
M U
- L do chn ti
Trong nhng nm gn y Cng ngh thng tin pht trin mnh m v c
nhng tin b vt bc. Cng vi s pht trin ca Cng ngh thng tin l s
bng n thng tin. Cc thng tin t chc theo phng thc s dng giy trong
giao dch ang dn c s ha, do nhiu tnh nng vt tri m phng thc
ny mang li nh: c th lu tr lu di, cp nht, sa i, tm kim mt cch
nhanh chng. l l do khin cho s lng thng tin s ha ngy nay ang
tng dn theo cp s nhn.
Hin nay, khng mt lnh vc no li khng cn n s h tr ca cng
ngh thng tin v s thnh cng ca cc lnh vc ph thuc rt nhiu vo
vic nm bt thng tin mt cch nhy bn, nhanh chng v hu ch. Vi nhu cu
nh th nu ch s dng thao tc th cng truyn thng th chnh xc khng
cao v mt rt nhiu thi gian. Do vy vic khai ph tri thc t d liu trong cc
tp ti liu ln cha ng thng tin phc v nhu cu nm bt thng tin c vai tr
ht sc to ln. Vic khai ph tri thc c t lu nhng s bng n ca n th
mi ch xy ra trong nhng nm gn y. Cc cng c thu thp d liu t ng
v cc cng ngh c s d liu c pht trin dn n vn mt lng d liu
khng l c lu tr trong c s d liu v trong cc kho thng tin ca cc t
chc, c nhn....Do vic khai ph tri thc t d liu l mt trong nhng vn
v ang nhn c nhiu s quan tm ca cc nh nghin cu. Mt vn
quan trng v ph bin trong k thut khai ph d liu l phn lp, n v
ang c ng dng rng ri trong thng mi, y t, cng nghip...
Trong nhng nm trc y, phng php phn lp c xut,
nhng khng c phng php tip cn phn loi no l cao hn v chnh xc
hn hn nhng phng php khc. Tuy nhin vi mi phng php c mt li
th v bt li ring khi s dng. Mt trong nhng cng c khai ph tri thc hiu
qu hin nay l s dng cy quyt nh tm ra cc lut phn lp.
Phn lp s dng l thuyt tp th, c xut bi Zdzislaw Pawlak vo
nm 1982, v c nghin cu rng ri trong nhng nm gn y. L thuyt
tp th cung cp cho nhiu nh nghin cu v phn tch d liu vi nhiu k
thut trong khai ph d liu nh l cc khi nim c trng bng cch s dng
mt s d kin. Nhiu nh nghin cu s dng l thuyt tp th trong cc
ng dng nh phn bit thuc tnh, gim s chiu, khm ph tri thc, v phn


2
tch d liu thi gian,... y l mt cng c ton hc mi c p dng trong
khai ph d liu c th c dng la chn thuc tnh phn nhnh trong
vic xy dng cu trc cy quyt nh v c nhiu cch tip cn khc nhau
chn thuc tnh phn nhnh ti u, lm cho cy c chiu cao nh nht. Chnh v
vy, trong lun vn ny ti tm hiu v cc phng php xy dng cy quyt
nh da vo tp th. Vic ng dng cy quyt nh khai ph d liu v
ang c tip tc tm hiu, nghin cu. Vi mong mun tm hiu v nghin
cu v lnh vc ny, ti chn ti ng dng cy quyt nh trong khai
ph d liu lm lun vn tt nghip.
- Mc tiu nghin cu
Mc ch ca lun vn l nghin cu cc vn c bn ca l thuyt tp
th, cy quyt nh v cc thut ton xy dng cy quyt nh trn h thng tin
y da trn tp th; ci t v nh gi cc thut ton xy dng cy quyt
nh nghin cu; bc u p dng m hnh cy quyt nh xy dng vo
trong khai ph d liu (h tr ra quyt nh trong vay vn).
- B cc lun vn
Lun vn gm 3 chng chnh:
Chng 1: Tng quan v khai ph tri thc v l thuyt tp th
Trong chng ny trnh by tng quan v khai ph d liu v l thuyt tp
th.
Chng 2: Cy quyt nh v cc thut tan xy dng cy quyt nh.
Trong chng ny gii thiu tng quan v cy quyt inh, phng php
tng qut xy dng cy quyt nh v ba thut ton xy dng cy quyt nh:
ID3, ADTDA, FID3
Chng 3: Thc nghim v nh gi.
Pht biu bi ton, ci t ng dng v nh gi.


3
Chng 1 - TNG QUAN V KHAI PH D LIU V
L THUYT TP TH
1.1. Gii thiu v khai ph d liu
1.1.1 Khm ph tri thc
Trong thi i bng n cng ngh thng tin, cc cng ngh lu tr d liu
ngy cng pht trin nhanh chng to iu kin cho cc n v thu thp d liu
nhiu hn v tt hn. c bit trong lnh vc kinh doanh, cc doanh nghip
nhn thc c tm quan trng cu vic nm bt v x l thng tin. N h tr
cc ch doanh nghip trong vic a ra cc chin lc kinh doanh kp thi mang
li nhng li nhun to ln cho doanh nghip ca mnh. Tt c l do khin cho
cc c quan, n v v cc doanh nghip to ra mt lng d liu khng l c
Gigabyte thm ch l Terabyte cho ring mnh. Cc kho d liu ngy cng ln v
tim n nhiu thng tin c ch. S bng n dn ti mt yu cu cp thit l
phi c nhng k thut v cng c mi bin kho d liu khng l kia thnh
nhng thng tin c ng v c ch. Khm ph tri thc t d liu (Knowledge
Discovery from Data - KDD) ra i nh mt kt qu tt yu p ng cc nhu
cu .
Qu trnh khm ph tri thc t d liu thng thng gm cc bc chnh
sau [2]-[7]:
Bc 1: Xc nh vn v la chn ngun d liu (Problem
Understanding anh Data Understanding)
Trong giai on ny cc chuyn gia trong lnh vc cn phi tho lun
vi cc chuyn gia tin hc, xc nh c chng ta mong mun khm
ph nhng g, thng nht gii php cho qu trnh khm ph d liu (mun
c cc lut hay mun phn lp, phm cm d liu). y l mt giai on
quan trng v nu xc nh sai vn th ton b qu trnh ph sn, n tr
nn v ch.
Bc 2: Chun b d liu (Data preparation)
Bao gm cc qu trnh sau:
- Thu thp d liu (data gathering)


4
- Lm sch d liu (data cleaning)
- Tch hp d liu ( data integeration)
- Chn d liu (data selection)
- Bin i d liu (data transformation)
y cng l mt giai on rt quan trng v nu d liu u vo
khng chnh xc th hin nhin s khng th no c mt kt qu chnh xc
c.
Bc 3 : Khai ph d liu (Data Mining)
y l bc xc nh nhim v khai ph d liu v la chn k thut
khai ph d liu. Kt qu ca qu trnh ny s tm ra cc tri thc, m hnh
hay cc quy lut tim n bn trong d liu.
Bc 4: nh gi mu (Partern Evalution)
nh gi xem tri thc thu c c chnh xc v c gi tr hay khng,
nu khng c th quay li cc bc trn. Vic nh gi ny c thc hin
thng qua cc chuyn gia trong lnh vc v ngi dng l chnh ch khng
phi l cc chuyn gia tin hc.
Bc 5: Biu din tri thc v trin khai (Knowlegde presentation and
Deployment)
Biu din tri thc pht hin c di dng tng minh, thn thin v
hu ch vi a s ngi dng v tin hnh a tri thc pht hin c vo
cc ng dng c th.
1.1.2. Khai ph d liu
Khai ph d liu ch l mt bc trong qu trnh khm ph tri thc t c s
d liu. Khai ph d liu bao gm cc giai on sau [7]:
Giai on 1: Gom d liu (Gathering)
y l bc tp hp cc d liu c khai thc trong mt c s d liu,
mt kho d liu v thm ch cc d liu t cc ngun ng dng Web.
Giai on 2: Trch lc d liu (Selection)


5
giai on ny d liu c la chn hoc phn chia theo mt s tiu
chun no , v d chn tt c nhng ngi c tui i t 25 35 v c trnh
i hc.
Giai on 3: Lm sch, tin x l v chun b trc d liu (Cleansing,
Pre-processing and Preparation)
Giai oan th ba ny l giai on hay b sao lng, nhng thc t n l mt
bc rt quan trng trong qu trnh khai ph d liu. Mt s li thng mc phi
trong khi gom d liu l tnh khng cht ch, logc. V vy, d liu thng
cha cc gi tr v ngha v khng c kh nng kt ni d liu. V d: tui =
673. Giai on ny s tin hnh x l nhng dng d liu khng cht ch ni
trn. Nhng d liu dng ny c xem nh thng tin d tha, khng c gi tr.
Bi vy, y l mt qu trnh rt quan trng v d liu ny nu khng c lm
sch - tin x l - chun b trc th s gy nn nhng kt qu sai lch nghim
trng.
Giai on 4: Chuyn i d liu (Transformation)
D liu s c chuyn i ph hp vi mc ch khai thc.
Giai on 5: Pht hin v trch mu d liu (Pattern Extraction and
Discovery)
giai on ny nhiu thut ton khc nhau c s dng trch ra cc
mu t d liu. Thut ton thng dng l nguyn tc phn loi, nguyn tc kt
hp hoc cc m hnh d liu tun t,. v.v.
Giai on 6: nh gi kt qu mu (Evaluation of Result)
y l giai on cui trong qu trnh khai ph d liu. giai on ny, cc
mu d liu c chit xut ra bi phn mm khai ph d liu. Khng phi bt
c mu d liu no cng u hu ch, i khi n cn b sai lch. V vy, cn
phi u tin nhng tiu chun nh gi chit xut ra cc tri thc (Knowlege)
cn chit xut ra.
1.2. ng dng ca khai ph d liu
Hin nay, k thut khai ph d liu ang c p dng mt cch rng ri
trong rt nhiu lnh vc kinh doanh v i sng khc nhau nh marketing, ti
chnh, ngn hng v bo him, khoa hc, y t, an ninh, internet,


6
+ Y hc v chm sc sc khe : chn on bnh trong y t da trn kt
qu xt nghim gip cho bo him y t Australia pht hin ra nhiu trng
hp xt nghim khng hp l tit kim c 1 triu $/nm.
+ Marketing: IBM Surf Aid p dng khai ph d liu vo phn tch
cc ln ng nhp Web vo cc trang c lin quan n th trng pht hin s
thch khch hng, t nh gi hiu qu ca vic tip th qua Web v ci thin
hot ng ca cc Website; Trang Web mua bn qua mng Amazon cng tng
doanh thu nh p dng Khai ph d liu trong vic phn tch s thch mua bn
ca khch hng
+ Ti chnh v th trng chng khon: p dng vo vic phn tch cc
th tn dng tiu biu ca cc khch hng, phn on ti khon nhn c, phn
tch u t ti chnh nh chng khon, giy chng nhn, v cc qu tnh thng,
nh gi ti chnh, v pht hin k gian, .... D bo gi ca cc loi c phiu
trong th trng chng khon, ...
+ Bo him: p dng vo vic phn tch mc ri ro xy ra i vi tng
loi hng ho, dch v hay chin lc tm kim khch hng mua bo him, ...
+ Qu trnh sn xut: Cc ng dng gii quyt s ti u ca cc ngun ti
nguyn nh cc my mc, nhn s, v nguyn vt liu; thit k ti u trong qu
trnh sn xut, b tr phn xng v thit k sn phm, chng hn nh qu trnh
t ng da vo yu cu khch hng...
1.3. Mt s phng php khai ph d liu thng dng
Nhim v chnh ca khai ph d liu l m t v d on. Trong m t
nhm biu th cc c im chung ca d liu c trong CSDL, cn d on
nhm thc hin, suy lun trn d liu hin c a ra cc kt lun ca d on
. Di y gii thiu 3 phng php thng dng nht l: phn cm d liu,
phn lp d liu v lut kt hp.
1.3.1. Phn lp (Classification)
Mc tiu ca phng php phn lp d liu l d on nhn lp cho cc
mu d liu. Qu trnh phn lp d liu thng gm 2 bc:
Bc 1: Xy dng m hnh
Trong bc ny, mt m hnh s c xy dng da trn vic phn tch
cc mu d liu sn c. u vo ca qu trnh ny l mt tp d liu c cu trc


7
c m t bng cc thuc tnh v c to ra t tp cc b gi tr ca cc thuc
tnh . Mi b gi tr c gi chung l mt mu (sample). Trong tp d liu
ny, mi mu c gi s thuc v mt lp nh trc, lp y l gi tr ca
mt thuc tnh c chn lm thuc tnh gn nhn lp hay thuc tnh quyt
nh. u ra ca bc ny thng l cc quy tc phn lp di dng lut dng
if-then, cy quyt nh, cng thc logic, hay mng nron. Qu trnh ny c
m t nh trong hnh 1

Hnh 1. Qu trnh phn lp d liu Bc xy dng m hnh
Bc 2: S dng m hnh xy dng phn lp d liu
Trong bc ny vic u tin l phi lm l tnh chnh xc ca m
hnh. Nu chnh xc l chp nhn c m hnh s c s dng d on
nhn lp cho cc mu d liu khc trong tng lai.
chnh xc mang tnh cht d on ca m hnh phn lp va to ra
c c lng. Holdout l mt k thut n gin c lng chnh xc
. K thut ny s dng mt tp d liu kim tra vi cc mu c gn
nhn lp. Cc mu ny c chn ngu nhin v c lp vi cc mu trong tp
d liu o to. chnh xc ca m hnh trn tp d liu kim tra a l t
l phn trm cc cc mu trong tp d liu kim tra c m hnh phn lp ng
(so vi thc t).


8

Hnh 2. Qu trnh phn lp d liu c lng chnh xc m hnh

Hnh 3. Qu trnh phn lp d liu Phn lp d liu mi
1.3.2. Phn cm (Clustering)
Mc tiu chnh phn cm d liu l nhm cc i tng tng t nhau
trong tp d liu vo cc cm sao cho cc i tng thuc cng mt lp l
tng ng cn cc i tng thuc cc cm khc nhau s khng tng ng.
Phn cm d liu l mt v d ca phng php hc khng gim st. Trong
phng php ny ta s khng th bit kt qu cc cm thu c s nh th no khi
bt u qu trnh. V vy, cn c mt chuyn gia v lnh vc nh gi cc cm
thu c.
Phn cm d liu c s dng nhiu trong cc ng dng v phn loi th
trng, phn loi khch hng, nhn dng mu, phn loi trang web, Ngoi ra


9
phn cm d liu cn c th c s dng nh mt bc tin x l cho cc
thut ton khai ph d liu khc.
1.3.3. Lut kt hp (Association Rules)
Lut kt hp l lut m trong phn nh mi quan h kt hp cht ch
trong mt tp cc i tng trong mt CSDL [2].
Mc tiu ca phng php ny l pht hin v a ra mi lin h gia cc
gi tr d liu trong CSDL. Mu u ra ca gii thut khai ph d liu l tp lut
kt hp tm c. Khai ph lut kt hp c thc hin qua 2 bc:
Bc 1: Tm tt c cc tp mc ph bin, mt vn bn ph bin c xc
nh qua h tr v tha mn h tr cc tiu.
Bc 2: Sinh ra cc lut kt hp mnh t tp mc ph bin, cc lut phi
tha mn h tr cc tiu v tin cy cc tiu.
1.4. L thuyt tp th
L thuyt tp th (rough set theory) ln u tin c xut bi
Z.Pawlak vo nhng nm u thp nin 1980. Phng php ny ng vai tr ht
sc quan trng trong lnh vc tr tu nhn to v cc ngnh lin quan n nhn
thc, c bit l trong lnh vc hc my, thu nhn tri thc, phn tch quyt nh,
pht hin v khm ph tri thc t c s d liu, cc h chuyn gia, cc h h tr
quyt nh, lp lun da trn quy np [1]-[8].
Cc lnh vc ng dng trong tp th bao gm:
- Chn on y hc (medical diagnosis)
- Nghin cu dc l ((pharmacology)
- D on th trng c phiu v phn tch d liu ti chnh
- Kinh doanh tin t (banking)
- Nghin cu th trng
- Cc h thu nhn v lu tr thng tin
- Nhn dng mu, gm nhn dng ting ni v ch vit tay
- Thit k h iu khin ( control system design)
- X l nh (image processing)


10
- Thit k logic s(digital logic design)
Sau y chng ta s nghin cu cc khi nim c bn ca l thuyt tp th.
y l nhng kin thc quan trng cho vic p dng tp th xy dng cy
quyt nh.
1.4.1. H thng tin
Trong hu ht cc h qun tr c s d liu thng thng th thng tin
thng c biu din di dng cc bng, trong mi bng biu din thng
tin v mt i tng, mi ct biu din thng tin v mt thuc tnh ca i
tng. T u nhng nm 80 Pawlak nh ngha mt khi nim mi l h
thng tin (infomation system) da trn khi nim bng truyn thng nh sau
[4]:
nh ngha 1.1. [1]-[8] H thng tin l mt cp S = (U, A) trong U
l tp hu hn khc rng cc i tng (c gi l tp v tr cc i tng) v
A l tp hu hn khc rng cc thuc tnh. Vi mi aeA ta k hiu V
a
l tp gi
tr ca thuc tnh a. Nu xeU v aeA th ta k hiu x(a) l gi tr thuc tnh a
ca i tng x.
V d 1.1. [8] Bng d liu di y l mt h thng tin vi 7 i tng
v 2 thuc tnh.
Age LEMS
x
1
16-30 50
x
2
16-30 0
x
3
31-45 1-25
x
4
31-45 1-25
x
5
46-60 26-49
x
6
16-30 26-49
x
7
46-60 26-49

Bng 1. H thng tin n gin
1.4.2. Bng quyt nh
c th biu din mt d liu thc t, trong c nhng thuc tnh quyt
inh, chng ta xt mt trng hp c bit ca h thng tin c gi l bng
quyt nh c nh ngha nh sau:


11
nh ngha 1.2: Bng quyt nh (h quyt nh) l mt dng c bit ca
h thng tin, trong tp cc thuc tnh A bao gm hai tp con ri nhau l tp
thuc tnh iu kin C v tp cc thuc tnh quyt nh D. Nh vy bng quyt
nh l mt h thng tin c dng DT= (U, C D) vi C D = C [1].
V d 1.2. Bng 2 di y th hin mt bng quyt nh, trong tp
thuc tnh iu kin nh Bng 1 v thuc tnh quyt nh {Walk} c thm
vo nhn hai gi tr l Yes v No [8].
Age LEMS Walk
x
1
16-30 50 Yes
x
2
16-30 0 No
x
3
31-45 1-25 No
x
4
31-45 1-25 Yes
x
5
46-60 26-49 No
x
6
16-30 26-49 Yes
x
7
46-60 26-49 No

Bng 2. Mt bng quyt nh vi C={Age, LEMS} v D={Walk}

nh ngha 1.3. Min khng nh
Cho bng quyt nh DT = {U,CD}. Tp POS
C
(D) =
Y
) ( /
) (
D IND U X
X C
e
c
gi l C-min khng nh ca D. Ni cch khc, uePOS
C
(D) nu v ch nu
u(C) = v(C) ko theo u(D) = v(D) vi mi veU [1].
Mt thuc tnh c e C c gi l khng cn thit trong DT nu POS
C
(D)
= POS
C-{c}
(D). Ngc li, c l cn thit trong DT.
Ta ni bng quyt nh DT = (U, C {d}) l c lp nu mi thuc tnh
c e C u cn thit trong DT.
nh ngha 1.4. Xt bng quyt nh DT = (U, C {d}) v hai i tng
x, y e U. Ta ni x v y mu thun nhau trong DT nu x(C) = y(C) nhng x(d)
y(d) [3].
i tng x c gi l nht qun trong DT nu khng tn ti mt i
tng y khc mu thun vi x. DT c gi l nht qun nu mi i tng
trong xeU u l nht qun.


12
1.4.3. Quan h khng phn bit c
Mt trong nhng c im c bn ca l thuyt tp th l dng lu gi
v x l cc d liu trong c s mp m, khng phn bit c. Trong mt
h thng tin theo nh ngha trn cng c th c nhng i tng khng phn
bit c.
nh ngha 6: Cho h thng tin S = (U, A). Vi mi tp thuc tnh B _ A
u to ra tng ng mt quan h tng ng, k hiu IND(B) [1]-[8]:
IND(B) = {(x,x) e U
2
| ae B, x(a) = x(a) }
IND(B) c gi l quan h B-khng phn bit c. Nu (x,x) e IND(B) th
cc i tng x v x l khng th phn bit c vi nhau qua tp thuc tnh B.
Vi mi i tng x e U, lp tng ng ca x trong quan h IND(B) c k
hiu bi [x]
B
l tp tt c cc i tng c quan h IND(B) vi x.
Quan h B- khng phn bit c phn hoch tp i tng U thnh cc
lp tng ng, k hiu l U/ IND(B) hay U/B, tc l U/B = {[x]
P
| x e U}.
V d 1.3. [8] Xt h thng tin cho trong Bng 1
Xt thuc tnh B = {LEMS}, ta c phn hoch ca tp U sinh bi quan h
tng ng IND(B) l:
U/B = {{x
1
}, {x
2
}, {x
3
, x
4
}, {x
5
, x
6
, x
7
}}
Khi , ta ni cc cp i tng x
3
, x
4
v x
5
, x
6
l khng phn bit qua tp
thuc tnh {LEMS} v chng thuc cng mt lp tng ng nh bi quan h
IND(B).
Nu ta xt B = {Age, LEMS}, ta c:
U/B = {{x
1
}, {x
2
}, {x
3
, x
4
}, {x
5
, x
7
},{ x
6
}}
Khi x
5
v x
6
l phn bit c qua tp thuc tnh {Age, LEMS} v
chng khng thuc cng lp tng ng nh bi quan h IND(B).
1.4.4. Xp x tp hp
Ta thy bng 2 khi nim Walk khng th nh ngha r rng qua 2
thuc tnh iu kin Age v LEMS v c x
3
, x
4
thuc cng mt lp tng ng
to bi 2 thuc tnh Age v LEMS nhng li c gi tr khc nhau ti thuc tnh


13
Walk. V vy, nu mt i tng no c (Age,LEMS) = (31-45, 1-25) th ta
vn khng th bit chc chn gi tr ca n ti thuc tnh Walk l Yes hay No.
V vy, ta thy khi nim Walk khng c m t r rng. Tuy nhin, cn c
vo tp thuc tnh {Age, LEMS} ta vn c th ch ra c chc chn mt s i
tng c Walk l Yes, mt s i tng c Walk l No, cn li l cc i tng
thuc tnh v bin ca hai gi tr Yes v No, c th:
Nu i tng no c gi tr ti tp thuc tnh {Age,LEMS} thuc tp
{{16-30, 50}, {16-30, 26-49}} th c Walk l Yes.
Nu i tng no c gi tr ti tp thuc tnh {Age,LEMS} thuc tp
{{16-30, 0}, {46-60, 26-49}} th c Walk l No.
Nu i tng no c gi tr ti tp thuc tnh {Age,LEMS} = {31-45, 1-
25} th c Walk l Yes hoc No.
Chnh v vy ta c khi nim xp x tp hp nh sau:
nh ngha 1.3. [10] Cho h quyt nh DT = (U, CD), tp thuc tnh B_C,
tp i tng X_U. Chng ta c th xp x tp hp X bng cch s dng cc
thuc tnh trong B t vic xy dng cc tp hp B-xp x di v B-xp x trn
c nh ngha nh sau:
B-xp x di ca tp X: BX = {x e U | [x]
B
_ X}
B-xp x trn ca tp X: B X = {x e U | [x]
B
X C

Tp hp BX l tp cc i tng trong U m s dng cc thuc tnh trong
B ta c th bit chc chn c chng l cc phn t ca X.

Tp hp B X l cc i tng trong U m s dng cc thuc tnh trong B
ta ch c ni rng chng c th l cc phn t ca X.
Tp BN
B
(X) = B X \BX c gi l B-bin ca tp X, n cha cc i
tng m s dng cc thuc tnh ca B ta khng th xc nh c chng c
thuc tp X hay khng.
Tp U\ B X c gi l B-ngoi ca tp X, gm nhng i tng m s
dng tp thuc tnh B ta bit chc chn chng khng thuc tp X.
Mt tp hp c gi l th nu ng bin ca n l khng rng, ngc
li ta ni tp ny l r.
V d 1.4. Xt h quyt nh cho trong Bng 2


14
Xt tp i tng X = {x eU | x(Walk) = Yes} = {x
1
, x
4
, x
6
} v tp thuc
tnh B = {Age, LEMS}. Khi ta c [8]:
U/B ={{x
1
}, {x
2
}, {x
3
, x
4
}, {x
5
, x
7
},{ x
6
}}
BX = {xe U | [x]
B
c X} = {x
1
, x
6
}


B X = {x e U | [x]
B
X C} = {u
1
, u
2
, u
5
, u
7
, u
8
}

Hnh 4. Xp x tp i tng trong Bng 2 bi cc thuc tnh iu kin Age v
LEMS
1.5. Kt lun chng 1
+ Chng ny gii thiu tng quan v khai ph d liu, ng dng ca
khai ph d liu, v gii thiu mt s phng php khai ph d liu thng dng.
+ Trnh by tng quan v l thuyt tp th bao gm h thng thng tin,
quan h khng phn bit c, cc tp th, bng quyt nh, V ng thi
trnh by cc v d c th minh ha cc khi nim ny.



15
Chng 2- CY QUYT NH V CC THUT TON XY
DNG CY QUYT NH
2.1. Tng quan v cy quyt nh
Cy quyt nh l cng c dng phn lp cc d liu, n c cu trc
cy. Mi cy quyt nh l mt s tng trng cho mt s quyt nh ca mt
lp cc d kin no . Mi nt trong cy l tn ca mt lp hay mt php th
thuc tnh c th no , php th ny phn chia khng gian trng thi cc d
kin ti nt thnh cc kt qu c th t c ca php th. Mi tp con c
phn chia ca php th l khng gian con ca cc s kin, n tng ng vi mt
vn con ca s phn lp. Cc cy quyt nh c dng h tr qu trnh ra
quyt nh.
2.1.1. nh ngha
Cy quyt nh l mt cy m mi nt ca cy l:
- Nt l hay cn gi l nt tr li biu th cho mt lp cc trng hp m
nhn ca n l tn ca lp.
- Nt khng phi l nt l hay cn gi l nt trong, nt nh php kim tra
cc thuc tnh, nhn ca nt ny l tn ca thuc tnh v c mt nhnh ni nt ny
n cc cy con ng vi mi kt qu c th c php th. Nhn ca nhnh ny l
cc gi tr ca thuc tnh . Nt trn cng gi l nt gc.






Hnh 5. M t chung v cy quyt nh
phn lp mu d liu cha bit, gi tr cc thuc tnh ca mu c a
vo kim tra trn cy quyt nh. Mi mu tng ng c mt ng i t gc n
l v l biu din d on gi tr phn lp ca mu .
Nt gc
Cc nhnh
Nt trong

Nt trong
Nt l Nt l


16
V d 2.1: Cy quyt nh









Hnh 6. V d v Cy quyt nh

2.1.2. Thit k cy quyt nh
2.1.2.1. X l d liu
Trong th gii thc, ni chung d liu th chc chn c mc nhiu.
iu ny c cc nguyn nhn khc nhau nh l d liu li, d liu c i lng
khng chnh xc, .... Do , chng ta thng tin x l (ngha l, lm sch)
cc tiu ho hay hu b tt c d liu th b nhiu. Cc giai on tin x l ny
cng c th bin i d liu th hin th hu ch hn, nh h thng thng tin.
Khi nhiu bc tin x l ng dng hiu qu, n s gip ci tin hiu qu phn
lp.
Cc cng vic c th ca tin x l d liu bao gm nhng cng vic nh:
- Filtering Attributes: Chn cc thuc tnh ph hp vi m hnh.
- Filtering samples: Lc cc mu (instances, patterns) d liu cho
m hnh.
- Transformation: Chuyn i d liu cho ph hp vi cc m hnh
nh chuyn i d liu t numeric sang nomial
- Discretization (ri rc ha d liu): Nu bn c d liu lin tc
nhng c mt s thut ton ch p dng cho cc d liu ri rc
(nh ID3, ADTDA,) th bn phi thc hin vic ri rc ha d
liu.
mild
Humidity
high
Normal
low
Outlook
Temp
Overcast Rainy
hot
TRUE
FALSE TRUE

FALSE
Sunn
TRUE

FALSE


17
2.1.2.2. To cy
Cy quyt nh c to thnh bng cch ln lt chia ( quy) mt tp
d liu thnh cc tp d liu con, mi tp con c to thnh ch yu t cc
phn t ca cng mt lp.
Cc nt (khng phi l nt l) l cc im phn nhnh ca cy. Vic phn
nhnh ti cc nt c th da trn vic kim tra mt hay nhiu thuc tnh xc
nh vic phn chia d liu.
2.1.2.3. Tiu chun tch
Vic la chn ch yu trong cc thut ton phn lp da vo cy quyt
nh l chn thuc tnh no kim tra ti mi nt ca cy. Chng ta mong
mun chn thuc tnh sao cho vic phn lp tp mu l tt nht. Nh vy chng
ta cn phi c mt tiu chun nh gi vn ny. C rt nhiu tiu chun
c nh gi c s dng l:
+ Lng thng tin thu thm IG (Information Gain, thut ton ID3 ca
John Ross Quilan [9]).
+ ph thuc ca thuc tnh quyt nh vo thuc tnh iu kin theo
ngha l thuyt tp th ca Zdzisaw Pawlak [3]-[10]
Cc tiu chun trn s c trnh by trong cc thut ton xy dng cy
quyt nh cc phn di y.
2.1.2.4. Tiu chun dng
y l phn quan trng trong cu trc phn lp ca cy quyt nh nhm
chia mt nt thnh cc nt con.
Chng ta tp trung mt s tiu chun dng chung nht c s dng trong
cy quyt nh. Tiu chun dng truyn thng s dng cc tp kim tra. Chng
ta kim tra cy quyt nh trong sut qa trnh xy dng cy vi tp kim tra v
dng thut ton khi xy ra li. Mt phng php khc s dng gi tr ngng
cho trc dng chia nt. Chng ta c th thay ngng nh l gim nhiu, s
cc mu trong mt nt, t l cc mu trong nt, hay chiu su ca cy, ...
2.1.2.5. Ta cy
Trong giai on to cy chng ta c th gii hn vic pht trin ca cy
bng s bn tin ti thiu ti mi nt, su ti a ca cy hay gi tr ti thiu
ca lng thng tin thu thm.
Sau giai on to cy chng ta c th dng phng php di m t
ngn nht (Minimum Description Length) hay gi tr ti thiu ca IG ta cy


18
(chng ta c th chn gi tr ti thiu ca IG trong giai on to cy nh
cho cy pht trin tng i su, sau li nng gi tr ny ln ta cy).
2.1.3. Phng php tng qut xy dng cy quyt nh
Qu trnh xy dng mt cy quyt nh c th bt u bng mt nt rng
bao gm ton b cc i tng hun luyn v lm nh sau [3]:
1. Nu ti nt hin thi, tt c cc i tng hun luyn u thuc vo
mt lp no th cho nt ny thnh nt l c tn l nhn lp chung ca cc
i tng.
2. Trng hp ngc li, s dng mt o, chn thuc tnh iu kin
phn chia tt nht tp mu hun luyn c ti nt.
3. To mt lng nt con ca nt hin thi bng s cc gi tr khc nhau
ca thuc tnh c chn. Gn cho mi nhnh t nt cha n nt con mt gi tr
ca thuc tnh ri phn chia cc cc i tng hun luyn vo cc nt con tng
ng.
4. Nt con t c gi l thun nht, tr thnh l, nu tt c cc i tng
mu ti u thuc vo cng mt lp. Lp li cc bc 1-3 i vi mi nt
cha thun nht.
Trong cc thut ton c s xy dng cy quyt nh ch chp nhn cc
thuc tnh tham gia vo qu trnh phn lp c gi tr ri rc, bao gm c thuc
tnh c dng d on trong qu trnh hc cng nh cc thuc tnh c s
dng kim tra ti mi nt ca cy. Do trong trng hp cc thuc tnh c
gi tr lin tc c th d dng loi b bng cch phn mnh tp gi tr lin tc
ca thuc tnh thnh mt tp ri cc khong.
Vic xy dng cy quyt nh c tin hnh mt cch qui, ln lt t
nt gc xung ti tn cc nt l. Ti mi nt hin hnh ang xt, nu kim tra
thy tho iu kin dng: thut ton s to nt l. Nt ny c gn mt gi tr
ca nhn lp ty iu kin dng c tho mn. Ngc li, thut ton tin hnh
chn im chia tt nht theo mt tiu ch cho trc, phn chia d liu hin hnh
theo iu kin chia ny.
Sau bc phn chia trn, thut ton s lp qua tt c cc tp con ( c
chia) v tin hnh gi qui nh bc u tin vi d liu chnh l cc tp con
ny.


19
Trong bc 3, tiu chun s dng la chn thuc tnh c hiu l mt s
o ph hp, mt s o nh gi thun nht, hay mt quy tc phn chia tp
mu hun luyn.
2.1.3. ng dng cy quyt nh trong khai ph d liu
Sau khi xy dng thnh cng cy quyt nh ta s dng kt qu t m
hnh cy quyt nh . y l bc s dng m hnh phn lp d liu hoc
rt ra cc tri thc trong phng php khai ph d liu bng phng php phn
lp.
2.1.3.1. Xc nh lp ca cc mu mi
Trn c s bit gi tr ca cc thuc tnh ca cc mu X
1
, X
2
, , X
n
ta
xc nh thuc tnh quyt nh (hay phn lp) Y ca i tng (c th dng
k thut ny nhn dng mu, d bo, )

















Hnh 7. M hnh phn lp cc mu mi
2.1.3.2. Rt ra cc tri thc hay lut t cy
Vi mc ch v nhim v chnh ca vic khai ph d liu l pht hin ra
cc quy lut, cc m hnh t trong CSDL. T m hnh thu c ta rt ra cc tri
thc hay cc quy lut di dng cy hoc cc lut di dng If Then.
Hai m hnh trn l tng ng, chng c th c chuyn i qua li gia cc
m hnh vi nhau.
(Sunny, True, Cool, High)
Cy quyt
nh
D liu
hun luyn
D liu
c th
Kt qu ?




20
V d 2.2:
Mt trong cc lut rt ra t cy trong v d 2.1 l
+Lut 1:
IF(Humidity: high) AND (Outlook: rainy) THEM (=> Quyt nh: False)
+Lut 2:
IF(Humidity: high) AND (Outlook: sunny) THEM (=> Quyt nh: False)
+Lut 3:
IF(Humidity: high) AND (Outlook: Overcast) THEN (=> Quyt nh: True)

T y ta s dng cc lut ny h tr qu trnh ra cc quyt nh, d
on,
2.2. Thut ton xy dng cy quyt nh da vo Entropy
2.2.1. Tiu ch chn thuc tnh phn lp
Tiu ch nh gi tm im chia l rt quan trng, chng c xem l
mt tiu chun heuristic phn chia d liu. tng chnh trong vic a ra
cc tiu ch trn l lm sao cho cc tp con c phn chia cng tr nn trong
sut (tt c cc b thuc v cng mt nhn) cng tt.
Thut ton dng o lng thng tin thu thm (information IG - IG)
xc nh im chia [9]. o ny da trn c s l thuyt thng tin ca nh
ton hc Claude Shannon, o ny c xc nh sau:
Xt bng quyt nh DT = (U, C {d} ), s gi tr (nhn lp) c th ca d
l k. Khi Entropy ca tp cc i tng trong DT c nh ngha bi:
i
k
i
i
p p U Entropy
2
1
log ) (

=
=
trong p
i
l t l cc i tng trong DT mang nhn lp i.
Lng thng tin thu thm (Information Gain - IG) l lng Entropy cn
li khi tp cc i tng trong DT c phn hoch theo mt thuc tnh iu
kin c no . IG xc nh theo cng thc sau:
) (
| |
| |
) ( ) , (
v
V v
v
U Entropy
U
U
U Entropy c U IG
c

e
=



21
trong V
c
l tp cc gi tr ca thuc tnh c, U
v
l tp cc i tng trong DT
c gi tr thuc tnh c bng v. IG(U, c) c John Ross Quinlan [9] s dng lm
o la chn thuc tnh phn chia d liu ti mi nt trong thut ton xy
dng cy quyt nh ID3. Thuc tnh c chn l thuc tnh cho lng thng
tin thu thm ln nht.
2.2.2. Thut ton ID3
Thut ton ID3 Iterative Dichotomiser 3 [9] l thut ton dng xy
dng cy quyt nh c John Ross Quinlan trnh by. tng chnh ca thut
ton ID3 l xy dng cy quyt nh bng cch ng dng t trn xung (Top-
Down), bt u t mt tp cc i tng v cc thuc tnh ca n. Ti mi nt
ca cy mt thuc tnh c kim tra, kt qu ca php kim tra ny c s
dng phn chia tp i tng theo kt qu kim tra trn. Qu trnh ny c
thc hin mt cch quy cho ti khi tp i tng trong cy con c sinh ra
thun nht theo mt tiu ch phn lp no , hay cc i tng thuc cng
mt dng ging nhau no . Cc lp hay cc dng ny c gi l nhn ca nt
l ca cy, cn ti mi nt khng phi l nt l th nhn ca n l tn thuc tnh
c chn trong s cc thuc tnh c dng kim tra c gi tr IG ln nht.
i lng IG c tnh thng qua hm Entropy. Nh vy, IG l i lng c
dng a ra u tin cho thuc tnh no c chn trong qu trnh xy
dng cy quyt nh.
Gi m ca thut ton ID3 nh sau:
D liu vo: Bng quyt nh DT = (U, C {d})
D liu ra: M hnh cy quyt nh
Function Create_tree (U, C, {d})
Begin
If tt c cc mu thuc cng nhn lp d
i
then
return mt nt l c gn nhn d
i

else if C = null then
return nt l c nhn d
j
l lp ph bin nht trong DT
else begin
bestAttribute:= getBestAttribute(U, C);
// Chn thuc tnh tt nht chia


22

C := C- {bestAttribute};
//xa bestAttribute khi tp thuc tnh
Vi mi gi tr v in bestAttribute
begin
U
v
:= [U]
v
;
//U
v
l phn hoch ca U theo thuc tnh
//bestAttribute c gi tr l v
ChildNode:=Create_tree(U
V
, C, {d});
//To 1 nt con
Gn nt ChildNode vo nhnh v;
end
end
End
Gi m ca hm getBestAttribute nh sau:
D liu vo: Bng quyt nh DT = (U, C{d})
D liu ra: Thuc tnh iu kin tt nht

Function getBestAttribute (U, C);
Begin
maxIG := 0;
Vi mi c in C
begin
tg : = IG(U, c);
// Tnh lng thng tin thu thm IG(U,c)
If (tg > max IG) then
begin
maxIG := tg;
kq := c;
end
end
return kq;
//Hm tr v thuc tnh c lng thng tin thu thm IG l ln nht
End


23
2.2.3. V d v thut ton ID3
Xt bng quyt nh DT = {U, C {d}} sau y:
Outlook Windy Temp Humidity d
1 overcast true cool high True
2 sunny false mild high false
3 sunny false hot high false
4 overcast false hot normal false
5 sunny true hot low True
6 rainy false mild high false
7 rainy false hot high false
8 rainy false hot normal false
9 overcast true hot low True
10 rainy false mild normal True
11 rainy true hot normal false
12 rainy false hot high false

Bng 3. D liu hun luyn
Gii thch c s d liu trong bng 5:
Mi mt mu biu din cho tnh trng thi tit gm cc thuc tnh
Outlook (quang cnh), Temp (nhit ), Humidity ( m) v Windy (gi); v
u c mt thuc tnh quyt nh d (chi Tennis). Thuc tnh quyt nh ch c
hai gi tr True, False (chi, khng chi tennis).
Mi thuc tnh u c mt tp cc gi tr hu hn:
Thuc tnh Outlook c ba gi tr: Overcast (m u) , Rain (ma), Sunny
(nng); Temp c ba gi tr: Hot (nng), Cool (mt) , Mild (m p); Humidity c
hai gi tr: High (cao), Normal (TB) v Windy c hai gi tr: True (c gi), False
(khng c gi). Cc gi tr ny chnh l k hiu (symbol) dng biu din bi
ton.
Thut ton xy dng cy quyt nh nh sau.
u tin nt l c khi to gm cc mu t 1 n 12
tm im chia tt nht, phi tnh ton ch s IG ca tt c cc thuc
tnh trn. u tin s tnh Entropy cho ton b tp hun luyn U gm: bn b


24
{1, 5, 9, 10} c gi tr thuc tnh nhn l TRUE v tm b {2, 3, 4, 6, 7, 8, 11,
12} c thuc tnh nhn l FALSE, do :


Tnh IG cho tng thuc tnh:
- Thuc tnh Outlook. Thuc tnh ny c ba gi tr l Overcast,
Sunny v Rainy.
Nhn vo bng d liu ta thy:
Vi gi tr Overcast c ba b {1, 9} c gi tr thuc tnh
nhn l TRUE v c mt b {4} c nhn lp l FALSE.
Tng t gi tr Sunny c mt b {5} c nhn lp l
TRUE v c hai b {2, 3} c nhn lp l FALSE;
Vi gi tr Rainy c mt b {10} c nhn lp TRUE v
nm b {6, 7, 8, 11, 12} c nhn lp FALSE.
Theo cng thc trn, o lng thng tin thu thm ca thuc tnh
Outlook xt trn U l:




Theo cch tnh tng t nh trn, ta tnh c:
- IG(U,Windy)=
)] log
8
7
log
8
1
(
12
8
) log
4
1
log
4
3
(
12
4
[ 918 . 0 8
7
2
8
1
2
4
1
2
4
3
2
+ = 0.285
- IG(U, Temp)=
148 . 0 )] log
8
6
log
8
2
(
12
8
) log
3
2
log
3
1
(
12
3
[ 918 . 0 8
6
2
8
2
2
3
2
2
3
1
2
= +
- IG(U, Humidity)=
323 . 0 )] log
4
3
log
4
1
(
12
4
) log
6
5
log
6
1
(
12
6
[ 918 . 0 4
3
2
4
1
2
6
5
2
6
1
2
= +
Nh vy, thuc tnh Humidity l thuc tnh c ch s IG ln nht nn
s c chn l thuc tnh phn chia. V th thuc tnh Humidity c chn
lm nhn cho nt gc, ba nhnh c to ra ln lt vi tn l: high,
Normal, low.
0.918 log
12
8
log
12
4
) ( 12
8
2
12
4
2
= = U Entropy

e
=
Outlook
) (
| |
| |
) ( ) Outlook , (
V v
v
v
U Entropy
U
U
U Entropy U IG
134 . 0 )] log
6
5
log
6
1
(
12
6
) log
3
2
log
3
1
(
12
3
) log
3
1
log
3
2
(
12
3
[ 918 . 0 6
5
2
6
1
2
3
2
2
3
1
2
3
1
2
3
2
2
= + + =


25
Hn na nhnh low c cc mu {5, 9} cng thuc mt lp TRUE
nn nt l c to ra vi nhn l TRUE .
Kt qu phn chia s l cy quyt nh nh sau:


Hnh 8. Cy sau khi chn thuc tnh Humidity (ID3)

Bc tip theo gi thut ton quy: ID3(U
1
, C-{Humidity}, {d})
Tng t tm im chia tt nht ti thut ton ny, phi tnh ton ch s IG
ca cc thuc tnh Outlook, Windy, Temp.
- u tin ta cng tnh Entropy cho ton b tp hun luyn trong U
1
gm
mt b {1} c thuc tnh nhn l TRUE v nm b {2, 3, 6, 7, 12} c
thuc tnh nhn l FALSE:


- Tip theo tnh IG cho thuc tnh Outlook, thuc tnh ny c ba gi tr l
Overcast, Sunny v Rainy. Nhn vo bng d liu:
Vi gi tr Overcast ch c mt b {1} c gi tr thuc tnh nhn
l TRUE .
Tng t gi tr Sunny ch c hai b {2, 3} u c nhn lp l
FALSE;
Vi gi tr Rainy ch c ba b {6, 7, 12} u c nhn lp
FALSE.
Do , o lng thng tin thu thm ca thuc tnh Outlook xt trn

U
1
l:
IG(U
1
, Outlook) =0.65 - )] log
3
3
(
6
3
) log
2
2
(
6
2
) log
1
1
(
6
1
[ 3
3
2
2
2
2
1
1
2
+ + = 0.65
- Tnh tng t ta cng c:
Humidity
{1, 2, ., 12}
ID3(U
1
, C-{humidity}, {d})
{1, 2, 3, 6, 7, 12}


ID3(U
2
, C-{humidity}, {d})
{4, 8, 10, 11}
high
Normal low
TRUE
{5, 9 }
65 . 0 log
6
5
log
6
1
) ( 6
5
2
6
1
2 1
= = U Entropy


26
IG(U
1
, Windy) = 0.65 - )] log
5
5
(
6
5
) log
1
1
(
6
1
[ 5
5
2
1
1
2
+ = 0.65
IG(U
1
, Temp) = 0.65 - )] log
5
5
(
6
5
) log
1
1
(
6
1
[ 5
5
2
1
1
2
+ = 0.65
Ta thy ch s IG ca ba thuc tnh Outlook, Windy, Temp l nh
nhau, ta c th chn bt k thuc tnh no phn chia.
Gi s ta chn thuc tnh Outlook phn chia. Do , thuc tnh
Outlook lm nhn cho nt bn tri ni vi nhnh high.
Thuc tnh ny c ba gi tr Overcast, Sunny v Rainy nn ta tip
tc to thnh ba nhnh mi l Overcast, Sunny v Rainy:
Vi nhnh Overcast gm mt mu {1} v c gi tr quyt nh l
TRUE nn ta to nt l l TRUE .
Vi nhnh Sunny gm hai mu {2, 3} v c cng gi tr quyt
nh l FALSE nn to nt l l FALSE.
Vi nhnh Rainy c ba mu {6, 7, 12} v u c gi tr quyt
nh l FALSE nn ta to nt l l FALSE.
Sau khi thc hin xong thut ton quy: ID3(U
1
, C-{Humidity}, {d}), ta
c cy nh sau:

Hnh 9. Cy sau khi chn thuc tnh Outlook (ID3)

Bc tip theo gi thut ton quy: ID3(U2, C-{ Humidity}, {d})
Tnh mt cch tng t nh trn ta c:
Entropy (U
2
) = 811 . 0 log
4
3
log
4
1
4
3
2
4
1
2
=
Overcast Rainy
Humidity
{1, 2,, 12}
high

Normal
low
Outlook
{1, 2, 3, 6, 7, 12}

ID3(U
2
, C-{humidity}, {d})
{4, 8, 10 , 11}
TRUE
{5, 9 }
FALSE
{6, 7, 12 }
TRUE
{1 }
FALSE
{2, 3 }
Sunny


27
IG(U
2
, Outlook) =
0.811 - )] log
3
2
log
3
1
(
4
3
) log
1
1
(
4
1
[ 3
2
2
3
1
2
1
1
2
+ = 0.811-0.689 = 0.123
IG(U
2
, Windy) =
0.811 - )] log
1
1
( 4 / 1 ) log
3
2
log
3
1
(
4
3
[ 1
1
2
3
2
2
3
1
2
+ = 0.811-0.689 = 0.123
IG(U
2
, Temp) =
0.811 - )] log
1
1
(
3
1
) log
3
3
(
4
3
[ 1
1
2
3
3
2
+ = 0.811-0 = 0.811

Ta thy ch s IG ca Temp l ln nht, nn n c chn phn chia.
Do , thuc tnh Temp lm nhn cho nt bn phi ni vi nhnh Normal.
Trong U
2
, thuc tnh ny c hai gi tr hot v mild nn ta tip tc to
thnh hai nhnh mi l hot v mild:
Vi nhnh hot gm ba mu {4, 8, 11} v u c gi tr quyt nh
l FALSE nn ta to nt l l FALSE.
Vi nhnh mild gm mt mu {10} v c gi tr quyt nh l
TRUE nn to nt l l TRUE .
Cy cui cng nh sau:

Hnh 10. Cy kt qu (ID3)
mild
Humidity
{1, 2,, 12}
high

Normal
low
Outlook
{1, 2, 3, 6, 7, 12}

Temp
{4, 8, 10 , 11}
Overcast Rainy
hot
TRUE
{5, 9 }
FALSE
{6, 7, 12 }
TRUE
{1 }
FALSE
{2, 3 }
Sunny
TRUE
{10 }
FALSE
{4, 8, 11 }


28
2.3. Thut ton xy dng cy quyt nh da vo ph thuc ca
thuc tnh
2.3.1. ph thuc ca thuc tnh theo l thuyt tp th
Xt bng quyt nh DT = (U, C {D}). Ta ni D ph thuc C vi
ph thuc k (0s k s 1) [10]:
k = (C,D) =
| |
| ) ( |
U
D
C
POS

D dng thy rng:
(C,D) =
e ) ( / | |
| |
D IND U X U
X C

0 s (C, D) s 1
ph thuc (C, D) c cc tnh cht sau:
- Nu (C, D) = 1 th D ph thuc hon ton vo C
- Nu 0 < (C, D) < 1 th d ph thuc mt phn vo B
- Nu (d, B) = 0 th khng c i tng no ca U c th c phn lp
ng (nh d) da vo tp thuc tnh B.
2.3.2. ph thuc chnh xc theo l thuyt tp th
Xt bng quyt nh DT = (U, C {D}) v tp con thuc tnh iu kin B
_ C. Gi s U/D = {Y
1
, Y
2
, , Y
m
} v U/B = {X
1
, X
2
, , X
n
}. t
| | {
| |
| |
}
| |
| x |
| x |
| x
) , (
U
Y
D B
B
B
B


>
=
I
Y

|
(B, D) gi l ph thuc chnh xc | dng o t l cc i tng
c phn lp vi mc chnh xc |, trong gi tr | (0.5 | 1) dng
xc nh t l cc phn lp ng [10].
2.3.3. Tiu ch chn thuc tnh phn lp
Theo cch tip cn tp th, ph thuc (c, d) c s dng lm tiu
chun la chn thuc tnh kim tra ti mi nt trong qu trnh pht trin cy
quyt nh: thuc tnh c chn l thuc tnh c cho gi tr (c, d) ln nht
trong s cc thuc tnh cn li ti mi bc [10]. Nu tt c ph thuc (c, d)
ca cc thuc tnh bng khng th thuc tnh c chn l thuc c ph thuc
chnh xc | ln nht.


29
2.3.4. Thut ton xy dng cy quyt nh ADTDA
Thut ton ADTDA - Algorithm for Buiding Decision Tree Based on
Dependency of Attributes [10] l thut ton dng xy dng cy quyt nh
c Longjun Huang, Minghe Huang, Bin Guo, and Zhiming Zhang trnh by.
tng chnh ca thut ton ADTDA l xy dng cy quyt nh bng cch
ng dng t trn xung chin lc tham lam thng qua cc tp cho kim
tra tng thuc tnh mi nt ca cy. chn thuc tnh "tt nht" ( c cy
ti u c su nh nht), ngi ta phi tnh ph thuc ca thuc tnh
quyt nh vo thuc tnh iu kin. Thuc tnh c chn phi c ph thuc
ln nht.
Gi m ca thut ton ADTDA nh sau:
D liu vo: Bng quyt nh DT = (U, C {d})
D liu ra: M hnh cy quyt nh
Function Create_tree (U, C, {d})
Begin
If tt c cc mu thuc cng nhn lp d
i
then
return mt nt l c gn nhn d
i

else if C = null then
return nt l c nhn d
j
l lp ph bin nht trong DT
else begin
bestAttribute:= getBestAttribute(U, C);
// Chn thuc tnh tt nht chia
Ly thuc tnh bestAttribute lm gc;

C := C- {bestAttribute};
//xa bestAttribute khi tp thuc tnh
Vi mi gi tr v in bestAttribute
begin
U
v
:= [U]
v
;
//U
v
l phn hoch ca DT theo thuc tnh
//bestAttribute c gi tr l v
ChildNode:=Create_tree(U
v
, U, {d});
//To 1 nt con
Gn nt ChildNode vo nhnh v;
end


30
end
End
Thut ton ADTDA ging thut ton ID3, nhng khc nhau hm
getBestAttribute.
Gi m ca hm getBestAttribute nh sau:
D liu vo: Bng quyt nh DT = (U, C{d})
D liu ra: Thuc tnh iu kin tt nht

Function getBestAttribute (U, C);
Begin
maxDependency := 0;
Vi mi c in C
begin
k : = DependencyGama(U, c);
// Tnh ph thuc ca thuc tnh (c,d)
If (k > maxDependency) then
begin
maxDependency := k;
kq := c;
end
end
return kq;
//Hm tr v thuc tnh c ph thuc ca thuc tnh (c,d) ln nht
End
2.3.5. V d
Xt bng quyt nh cho trong Bng 5
xy dng cy ta tnh ph thuc ca tt c cc thuc tnh iu kin
vo thuc tnh quyt nh d.
- Thuc tnh quyt nh d c 4 mu {1, 5, 9, 10} c gi tr TRUE v 6
mu {2, 3, 4, 6, 7, 8, 11, 12} c gi tr FALSE , nn ta c:
[U]
d
= {{1, 5, 9 ,10}, {2, 3, 4, 6, 7, 8, 11, 12}}
- Thuc tnh Outlook c ba gi tr Overcast gm 3 mu {1, 4, 9},
sunny gm 3 mu {2, 3, 5} v rainy gm su mu {6, 7, 8, 10, 11, 12} nn :
[U]
Outlook
= {{1, 4, 9}, {2, 3, 5}, {6, 7, 8, 11, 11, 12}}


31
Do : 0
12
0
| |
| {} |
| |
| ) ( |
) , ( = = = =
U U
d pos
d Outlook
Outlook

Tng t, ta c:
[U]
Windy
= {{1, 5, 9, 11},{2, 3, 4, 6, 7, 8, 10, 12}}
. 0
12
0
| |
| {} |
| |
| ) ( |
) d Windy, (
Windy
= = = =
U U
d pos

[U]
Temp
={{1}, {2, 6, 10}, {3, 4, 5, 7, 8, 9, 11, 12}
.
12
1
| |
| } 1 { |
| |
| ) ( |
) , ( = = =
U U
d pos
d Temp
Temp


[U]
Humidity
= {{1, 2, 3, 6, 7, 12}, {4, 8, 10, 11}, {5, 9}}
.
12
2
| |
| } 9 , 5 { |
| |
| ) ( |
) , Humidity (
Humidity
= = =
U U
d pos
d

Ta thy ) , Humidity ( d c gi tr ln nht nn ta chn thuc tnh
Humidity lm thuc tnh phn chia. Nh vy nt gc c nhn l Humidity
v c 3 nhnh c to ra ln lt vi tn l: high, Normal, low.
Hn na nhnh low c cc mu {5, 9} cng thuc mt lp TRUE
nn nt l c to ra vi nhn l TRUE .
Kt qu phn chia s l cy quyt nh nh sau:


Hnh 11. Cy sau khi chn thuc tnh Humidity (ADTDA)

Bc tip theo gi thut ton quy: ADTDA (U
1
, C-{Humidity},
{d})
Ta c: [U
1
]
d
= {{1}, {2, 3, 6, 7, 12}}
[U
1
]
Outlook
= {{1}, {2, 3}, {6, 7, 12}}
Do , 1
6
6
| |
| } 12 , 7 , 6 , 3 , 2 , 1 { |
| |
| ) ( |
) , (
1 1
= = = =
U U
d pos
d Outlook
Outlook


[U
1
]
windy
= {{1}, {2, 3, 6, 7, 12}
Humidity
U={1, 2, ., 12}
ADTDA(U
1
, C-{humidity}, {d})
U
1
={1, 2, 3, 6, 7, 12}

ADTDA(U
2
, C-{humidity}, {d})
U
2
={4, 8, 10, 11}
high
Normal low
TRUE
{5, 9 }


32
1
6
6
| |
| } 12 , 7 , 6 , 3 , 2 , 1 { |
| |
| ) ( |
) , (
1 1
= = = =
U U
d pos
d windy
windy


[U
1
]
Temp
={{1}, {2, 6}, {3, 7, 12}}
1
6
6
| |
| } 12 , 7 , 6 , 3 , 2 , 1 { |
| |
| ) ( |
) , (
1 1
= = = =
U U
d pos
d Temp
Temp



Ta thy ph thuc ca ba thuc tnh Outlook, Windy, Temp vo
thuc tnh quyt nh d l nh nhau, nn ta c th chn bt k thuc tnh no
phn chia.
Tng t nh thut ton ID3, ta c cy nh sau:


Hnh 12. Cy sau khi chn thuc tnh Outlook (ADTDA)

Bc tip theo gi thut ton quy: ADTDA(U
2
, C-{Humidity}, {d})
Mt cch tng t nh trn ta c:
[U
2
]
d
= {{10}, {4, 8, 11}}
[U
2
]
Outlook
= {{4}, {8, 10, 11}}
Do ,
4
1
| |
| } 4 { |
| |
| ) ( |
) , (
2 2
= = =
U U
d pos
d Outlook
Outlook


[U
2
]
windy
= {{4, 8, 10}, {11}
4
1
| |
| } 11 { |
| |
| ) ( |
) , (
2 2
= = =
U U
d pos
d windy
windy


[U
2
]
Temp
={{4, 8, 11}, {10}}
high

Normal
low
outlook
U
1
={1, 2, 3, 6, 7, 12}

ADTDA(U
2
, C-{humidity}, {d})
U
2
={4, 8, 10 , 11}
Overcast Rainy
TRUE
{5, 9 }
FALSE
{6, 7, 12 }
TRUE
{1 }
FALSE
{2, 3 }
Sunny
Humidity
U={1, 2, ., 12}


33
1
4
4
| |
| } 11 , 10 , 8 , 4 { |
| |
| ) ( |
) , (
2 2
= = = =
U U
d pos
d Temp
Temp


Ta thy ) , ( d Temp l ln nht, nn thuc tnh temp c chn phn
chia. Do , thuc tnh Temp lm nhn cho nt bn phi ni vi nhnh
Normal.
Tng t nh trong thut ton ID3, cy cui cng nh sau:


Hnh 13. Cy kt qu (ADTDA)
2.4. Thut ton xy dng cy quyt nh da vo Entropy v ph
thuc ca thuc tnh
2.4.1. Tiu ch chn thuc tnh phn lp
Xt bng quyt nh DT = (U, C {d}).
Lng thng tin thu thm n nh IG
fix
- Fixed Information Gain [5] l
tiu chun mi cho chn thuc tnh thuc tnh iu kin c no phn chia.
IG
fix
c xc nh theo cng thc sau:
| |
) , (
* ) , ( ) , (
c
c U IG
c d c U IG
fix
=
Trong :
+ |c| l s cc gi tr khc nhau ca thuc tnh iu kin c
+ (c, d) l ph thuc c vo d
+ IG(U, c) l lng thng tin thu thm
high

Normal
low
Outlook
{1, 2, 3, 6, 7, 12}

Temp
{4, 8, 10 , 11}
Overcast Rainy
mild hot
TRUE
{5, 9 }
FALSE
{6, 7, 12
TRUE
{1 }
FALSE
{2, 3 }
Sunny
Humidity
{1, 2,, 12}
TRUE
{10 }
FALSE
{14, 8, 11}


34
Lng thng tin thu thm n nh ca thuc tnh c s dng nh mt
tiu chun cho vic chn thuc tnh kim tra ti mi nt trong cy quyt nh.
Thuc tnh iu kin vi gi tr lng thng tin thu thm n nh ln nht c
chn t tp rt gn thuc tnh v c s dng lm nt gc ca cy.
2.4.2. Thut ton FID3 (Fixed Iterative Dichotomiser 3 [5] )
D liu vo: Bng quyt nh DT = (U, C {d})
D liu ra: M hnh cy quyt nh
Function Create_tree (U, C, {d})
Begin
If tt c cc mu thuc cng nhn lp d
i
then
return mt nt l c gn nhn d
i

else if C = null then
return nt l c nhn d
j
l lp ph bin nht trong DT
else begin
bestAttribute:= getBestAttribute(U, C);
// Chn thuc tnh tt nht chia
Ly thuc tnh bestAttribute lm gc;

C := C- {bestAttribute};
//xa bestAttribute khi tp thuc tnh
Vi mi gi tr v in bestAttribute
begin
U
v
:= [U]
v
;
//U
v
l phn hoch ca DT theo thuc tnh
//bestAttribute c gi tr l v
ChildNode:=Create_tree(U
v
, C, {d});
//To 1 nt con
Gn nt ChildNode vo nhnh v;
end
end
End
Thut ton FID3 ging thut ton ID3, nhng khc nhau hm
getBestAttribute.

Gi m ca hm getBestAttribute nh sau [5]:


35
D liu vo: Bng quyt nh DT = (U, C{d})
D liu ra: Thuc tnh iu kin tt nht

Function getBestAttribute (U, C);
Begin
C:= C ;
Vi mi c in C
begin
k := DependencyGama(U, c);
// Tnh ph thuc ca thuc tnh (c,d)
If (k =0) then C:= C - {c};
end
Vi mi c in C
begin
tg := Igfix(U,c);
//Tnh lng thng tin thu thm n nh
If (tg>maxIGfix) then
begin
maxIGfix:= tg;
kq := c;
end
end
return kq;
//Hm tr v thuc tnh c lng thng tin thu thm IGfix(U,c) ln nht
End
2.4.3. V d
Xt bng quyt nh DT= (U, C {d}} cho trong Bng 5
Trong thut ton ADTDA trn, ta tnh c:
(Outlook,d) = 0
(Windy,d) = 0
(Temp,d) =
12
1

.
12
2
) , Humidity ( = d



36
Nn thuc tnh mi C = {Temp, Humidity}. Ta tnh IG
fix
(U,Temp) v
IG
fix
(U, Humidity) :
Trong thut ton ID3 trn ta c:
IG(U, Temp)= 0.148
IG(U, Humidity)= 0.323
Do :

IG
fix
(U,Temp) = 064 . 0
3
148 . 0
*
12
1
| |
) , (
* ) , ( = =
Temp
Temp U IG
d Temp
IG
fix
(U,Humidity) =
| |
) , (
* ) , (
Temp
Humidity U IG
d Humidity
= 134 . 0
3
323 . 0
*
12
2
= =
Ta thy IG
fix
(U,Humidity) c gi tr ln nht nn ta chn thuc tnh
Humidity lm thuc tnh phn chia. Tng t nh thut ton ID3, ta c cy
nh sau:


Hnh 14. Cy quyt nh sau khi chn thuc tnh Humidity (FID3)

Bc tip theo gi thut ton quy: FID3(U
1
, C-{Humidity}, {d})
Theo thut ton ADTDA ta c:
[U
1
]
d
= {{1}, {2, 3, 6, 7, 12}
[U
1
]
Outlook
= {{1}, {2, 3}, {6, 7, 12}}
Do , 1
6
6
| |
| } 12 , 7 , 6 , 3 , 2 , 1 { |
| |
| ) ( |
) , (
1 1
= = = =
U U
d pos
d Outlook
Outlook


[U
1
]
windy
= {{1}, {2, 3, 6, 7, 12}
TRUE
U={1, 2, ., 12}
FID3(U
1
, C-{humidity}, {d})
U
1
={1, 2, 3, 6, 7, 12}

FID3(U
2
, C-{humidity}, {d})
U
2
={4, 8, 10, 11}
high
Normal
low
TRUE
{5, 9 }


37
1
6
6
| |
| } 12 , 7 , 6 , 3 , 2 , 1 { |
| |
| ) ( |
) , (
1 1
= = = =
U U
d pos
d windy
windy


[U
1
]
Temp
={{1}, {2, 6}, {3, 7, 12}}
1
6
6
| |
| } 12 , 7 , 6 , 3 , 2 , 1 { |
| |
| ) ( |
) , (
1 1
= = = =
U U
d pos
d Temp
Temp


Theo thut ton ID3 ta c:
IG(U
1
, Windy) = 0.65
IG(U
1
, Outlook) = 0.65
IG(U
1
, Temp) = 0.65
Vy:
IG
fix
(U
1
, Windy)= 57 . 0
2
65 . 0
* 1
| |
) , (
* ) , (
1
= =
Windy
Windy U IG
d Windy
IG
fix
(U
1
, Outlook)= 465 . 0
3
65 . 0
* 1
| |
) , (
* ) , (
1
= =
Outlook
Outlook U IG
d Outlook
IG
fix
(U
1
, Temp)= 465 . 0
3
65 . 0
* 1
| |
) , (
* ) , (
1
= =
Temp
Temp U IG
d Temp
Ta thy IG
fix
(U
1
, Windy) c gi tr ln nht nn thuc tnh Windy c
chn lm thuc tnh phn chia.
Do , thuc tnh Windy lm nhn cho nt bn tri ni vi nhnh
high.
Thuc tnh ny c hai gi tr true v false nn ta tip tc to thnh hai
nhnh mi l true v false:
Vi nhnh true gm mt mu {1} v c gi tr quyt nh l Y
nn ta to nt l l Y.
Vi nhnh false gm nm mu {2, 3, 6, 7, 12} v c cng gi tr
quyt nh l N nn to nt l l N.
Sau khi thc hin xong thut ton quy: FID3(U
1
, C-{Humidity}, {d}),
ta c cy nh sau:


38


Hnh 15. Cy quyt nh sau khi chn thuc tnh Windy (FID3)

Bc tip theo gi thut ton quy: FID3(U
2
, C-{Humidity}, {d})
Theo thut ton ADTDA ta c:
[U
2
]
d
= {{10}, {4, 8, 11}}
[U
2
]
Outlook
= {{4}, {8, 10, 11}}
Do ,
4
1
| |
| } 4 { |
| |
| ) ( |
) , (
2 2
= = =
U U
d pos
d Outlook
Outlook


[U
2
]
windy
= {{4, 8, 10}, {11}
4
1
| |
| } 11 { |
| |
| ) ( |
) , (
2 2
= = =
U U
d pos
d windy
windy


[U
2
]
Temp
={{4, 8, 11}, {10}}
1
4
4
| |
| } 11 , 10 , 8 , 4 { |
| |
| ) ( |
) , (
2 2
= = = =
U U
d pos
d Temp
Temp


Theo thut ton ID3 ta c:
IG(U
2
, Outlook) =0.123
IG(U
2
, Windy) = 0.123
IG(U
2
, Temp) = 0.811
Vy:
IG
fix
(U
2
, Windy)= 124 . 0
2
123 . 0
*
4
1
| |
) , (
* ) , (
2
= =
Windy
Windy U IG
d Windy
IG
fix
(U
2
, Outlook)=
101 . 0
3
1235 . 0
*
4
1
| |
) , (
* ) , (
2
= =
Outlook
Outlook U IG
d Outlook
Humidity
{1, 2,, 12}
high

Normal
low
windy
{1, 2, 3, 6, 7, 12}

FID3(U
2
, C-{humidity}, {d})
{4, 8, 10 , 11}
true
false
TRUE
{5, 9 }
FALSE
{2, 3, 6, 7, 12 }
TRUE
{1 }


39
IG
fix
(U
2
, Temp)= 519 . 0
3
811 . 0
* 1
| |
) , (
* ) , (
2
= =
Temp
Temp U IG
d Temp
Ta thy ch s IG
fix
(U
2
,Temp) l ln nht, nn n c chn phn chia.
Tng t nh thut ton ID3 ta c cy cui cng nh sau:

Hnh 16. Cy kt qu (FID3)
2.5. Kt lun chng 2
Trong chng ny trnh by phng php tng qut xy dng cy quyt
nh; ba thut ton xy dng cy quyt nh ID3, ADTDA, FID3; cc v d c
th minh ha tng bc trn mi thut ton;
high

Normal
low
windy
{1, 2, 3, 6, 7, 12}

true
false
TRUE
{5, 9 }
FALSE
{2, 3, 6, 7, 12 }
TRUE
{1 }
Humidity
{1, 2,, 12}
temp
{4, 8, 10 , 11}
mild
hot
TRUE
{10 }
FALSE
{4, 8, 11}



40
Chng 3 - NG DNG KIM CHNG V NH GI
3.1. Gii thiu bi ton
Chng ta ang sng trong th gii tha thng tin thiu tri thc l
nhn nh ca nhiu ngi trong thi i bng n thng tin hin nay.
S dng phng php khai ph tri thc t d liu d on ri ro tn
dng l mt phng php mi nhm nng cao cht lng tn dng ca Ngn
hng.
Ri ro tn dng c th c hiu l nguy c mt ngi i vay khng th
tr c gc v/hoc li ng thi hn quy nh.
Hin nay, phng nga ri ro tn dng, cc chuyn gia Ngn hng thc
hin cc phng php thu thp, phn tch v nh gi cc thng tin v khch
hng, ti sn bo m ca khon vay Phng php truyn thng ny c nhiu
hn ch do ph thuc vo trnh , tm l v yu t ch quan khc ca cc cn
b thm nh h s vay n ca khch hng. Chnh v vy m mt cng c tr
gip thm nh v c on cht lng tn dng mt cch khch quan da trn
cc c s khoa hc l ht sc c ngha v cn thit. Vic xut cho vay hay
khng da vo cc lut quyt nh (phn lp) c xy dng thng qua cy
quyt nh c nghin cu. Nh cc lut quyt nh ny s h tr cn b tn
dng c quyt nh cho khch hng vay hay khng.
Trong phm vi lun vn ny ti tp trung nghin cu i vi cng tc
tn dng tiu dng ca khch hng vi tp d liu Bank_data. Da vo tp
Bank_data s xy dng m hnh cy quyt nh, t cy quyt nh rt ra cc lut
quyt nh. Da vo cc lut quyt nh ta s phn lp c tp d liu mi
(d liu v khch hng xin vay tiu dng, nhng cha c phn lp) v tp d
liu sau khi c phn lp s h tr cho cc cn b tn dng ra quyt nh cho
khch hng vay hay khng.
3.2. Gii thiu v c s d liu
Trong qu trnh th nghim, ti s dng tp d liu Bank_data trch t c
s d liu c su tm bi gio s Bamshad Mobasher ca Khoa School of
Computing, College of Computing and Digital Media ti i hc DePaul
University ti M (http://maya.cs.depaul.edu/classes/ect584/WEKA/data/
bank-data.csv). Tp d liu ny gm 600 i tng, sau khi tin s l vi phn


41
mm Weka v lu di dng file excel ta c tp d liu gm 600 i tng, 10
thuc tnh iu kin v thuc tnh quyt nh result quyt nh mi khch
hng l c vay v khng c vay.
Cc thuc tnh v gi tr ca cc thuc tnh ca tp d liu Bank_data
c m t trong bng sau:
Th
t
Tn
thuc tnh
Gi tr Gii thch
1 Tuoi
Tre,
Trung nien, Gia
Tr, trung nin, gi
2 Gioi_tinh Nam, Nu Nam, N
3 Khu_vuc
NT, TTran,
Ngoai o, TP
Nng thn, Th trn,
ngoi , thnh ph
4 Thu_nhap Thap, TB, Cao
Thp, trung bnh,
cao
5 Ket_hon C, K C, khng
6 Con
0_Con, 1_con,
2_con, 3_con
Khng con, mt
con, hai con, ba con
7 Xe C, K C, khng
8
TKTK
(ti khon tit kim)
C, K C, khng
9
TK_Htai
(ti khon hin ti)
C, K C, khng
10 The_chap C, K C, khng
11 RESULT (Cho vay) True, false
C (True),
khng (False)

Bng 4. Bng cc thuc tnh ca tp d liu Bank_data

3.3. Ci t ng dng
ng dng ny c vit trong mi trng Visual Studio 2008, vit bng
ngn ng lp trnh Visal Basic. ng dng ny tp trung vo xy dng v nh
gi chnh xc ca cc thut ton c trnh by chng 2. T cc cy quyt
nh hay cc lut quyt nh rt ra t cy quyt nh s h tr cho cc cn b tn
dng trong ngn hng quyt nh cho khch hng c vay hay khng.


42
3.4. Kt qu v nh gi thut ton
3.4.1. M hnh cy quyt nh tng ng vi tp d liu Bank_data
- Cy quyt nh ng vi thut ton ID3

Hnh 17. Dng cy quyt nh ID3
- Cy quyt nh ng vi thut ton ADTDA

Hnh 18. Dng cy quyt nh ADTDA


43
- Cy quyt nh ng vi thut ton FID3
Trong qu trnh thc nghim tc gi thy trong thut ton FID3 nu p
dng trn 1 c s d liu ln th phc thuc ca cc thuc tnh iu kin
vo thuc tnh quyt nh u bng 0 ( bc u tin khi xy dng cy
quyt nh). Do , lng thng tin thu thm n nh IG
fix
ca cc thuc tnh
iu kin cng bng 0. Trong trng hp ny th thut ton s chn mt
thuc tnh bt k (thuc tnh u tin) lm thuc tnh phn chia, v nh vy
cy quyt nh s khng ti u. V vy, tc gi mnh dn ci tin da theo
thut ton ADTDA, l nu tt cc cc ph thuc ca thuc tnh iu
kin vo thuc tnh quyt nh l bng 0, th lng thng tin thu n nh IG
fix

s c tnh da vo ph thuc chnh xc |, tc l:


V khi cy quyt nh ca thut ton FID3 trn c s d liu
Bank_data nh sau:

Hnh 19. Dng cy quyt nh FID3
| |
) , (
* ) , ( ) , (
c
c U IG
c d c U IG
fix

=


44
3.4.2. Cc lut quyt nh tng ng vi tp d liu Bank_data
- Cc lut quyt nh ng vi cy quyt nh ID3

Hnh 20. Mt s lut ca cy quyt nh ID3
- Cc lut quyt nh ng vi cy quyt nh ADTDA

Hnh 21. Mt s lut ca cy quyt nh ADTDA
- Cc lut quyt nh ng vi cy quyt nh FID3

Hnh 22. Mt s lut ca cy quyt nh FID3
3.4.3. nh gi thut ton
nh gi chnh xc ca thut ton vi s np gp (fold) l 10 trn b
d liu tennis (Bng 3) v b d liu Bank_data, ta c kt qu nh sau:


45
D liu S mu S thuc
tnh
ID3 ADTDA FID3
Bank_data 600 11 77.33% 78.57% 80.71%
Tennis 12 5 80% 80% 80%
Trung bnh 78.67% 79.29% 80.36%

Bng 5. chnh xc ca cc thut ton
3.4.4. ng dng cy quyt nh trong khai ph d liu
ng dng h tr cc b ngn hng ra quyt nh cho khch hng vay hay
khng. Vi nhng tin v khch hng xin vay ( bit gi tr ca cc thuc tnh
iu kin nhng cha c phn lp) da vo m hnh cy quyt nh c
xy dng ta d on c lp ca b d liu (cho vay hay khng cho vay).
T h tr cho cn b ngn hng trong qu trnh ra quyt nh cho vay hay
khng.
Trong ng dng, khi xy dng m hnh cy quyt nh c nh gi
chnh xc ca tng lut quyt nh da trn b d liu a vo training. Do
, vic phn lp cc mu d liu mi a ra c tin cy ca vic phn
lp .
V d khi nh gi chnh xc ca lut 9 da trn b d liu training l
90%. Qu trnh phn lp trn mu d liu no da vo lut 9, th tin cy
ca lp s l 90%.
tin cy ca cc lut quyt nh ph thuc rt ln vo b d liu
training, d liu training cng ln th tin cy ca cc lut cng cao. Tuy
nhin, trong ng dng ny vic xy dng cy quyt nh ch da trn b d liu
training gm 600 d liu, do tin cy ca cc lut ch mang tnh cht minh
ha (tnh chnh xc khng cao).


46


Hnh 23. Giao din ng dng
3.5. Kt lun chng 3
Trong chng ny pht biu bi ton kim chng cc thut ton xy
dng cy quyt nh chng 2 trn b d liu mu Bank_data. ng thi ci
t, nh gi chnh xc ca tng thut ton v nh gi chnh xc ca cc
lut. Da vo m hnh cy quyt nh (cc lut quyt nh) c xy dng,
phn lp cc mu d liu mi.



47
KT LUN
Khai ph d liu l mt lnh vc , ang v lun lun thu ht cc nh
nghin cu bi n l mt lnh vc cho php pht hin tri thc trong c s d liu
khng l bng cc phng thc thng minh. Nghin cu lnh vc ny i hi
ngi nghin cu phi bit tng hp cc kt qu nghin cu nhiu lnh vc
ca khoa hc my tnh v vic ng dng n trong tng nhim v ca khai ph
d liu.
Qua hai nm hc tp, tm ti, nghin cu, c bit l trong khong thi gian
lm lun vn, tc gi hon thin lun vn vi cc mc tiu t ra ban u. C
th lun vn t c nhng kt qu sau:
- Trnh by cc kin thc c bn v khai ph d liu; h thng ha cc
kin thc c bn ca l thuyt tp th c p dng xy dng cy
quyt nh.
- Gii thiu phng php tng qut xy dng cy quyt nh, v trnh
by ba thut ton xy dng cy quyt nh ID3, ADTDA, FID3 v mt
s v d minh ha cho cc phng php xy dng cy quyt nh cng
c trnh by.
- Ci t bng Visual Basic ba thut ton xy dng cy quyt nh ID3,
ADTDA, FID3 trn c s d liu mu Bank_data. nh gi chnh
xc ca cc thut ton trn v nh gi chnh xc ca tng lut
trong m hnh cy quyt nh.
Qua qu trnh hc tp, nghin cu tc gi khng nhng tch ly c thm
cc kin thc m cn nng cao c kh nng lp trnh, pht trin ng dng.
Tc gi nhn thy lun vn gii quyt tt cc ni dung, yu cu nghin cu
t ra, c cc v d minh ha c th. Song do thi gian c hn nn lun vn vn
cn tn ti mt s thiu st, mt s vn m tc gi cn phi tip tc nghin
cu, tm hiu.
Hng pht trin ca ti l:
V l thuyt:
- Cn tip tc nghin cu cc thut ton khai ph d liu bng cy quyt
nh da vo tp th nh: thut ton ADTCCC (da vo CORE v i


48
lng ng gp phn lp ca thuc tnh), thut ton ADTNDA (da
vo ph thuc mi ca thuc tnh),
- Nghin cu cc phng php xy dng cy quyt nh trn h thng
thong tin khng y , d liu lin tc v khng chc chn.
V chng trnh demo:
- Cn b sung thm d liu cho tp training m hnh cy quyt nh
c tin cy cao hn v hot ng hiu qu hn.
- Cn tip tc pht trin hon thin theo hng tr thnh phn mm khai
ph d liu trong tn dng tiu dng nhm h tr cho cn b tn dng
a ra quyt nh cho khch hng vay hay khng.
- Tm hiu nhu cu thc t t ci tin chng trnh, ci t li bi
ton theo cc thut ton nghin cu lm vic tt hn vi cc c
s d liu ln v c th c c sn phm trn th trng..


49
TI LIU THAM KHO
Ting Vit
[1]


H Thun, Hong Th Lan Giao (2005), Mt thut ton tm tp rt
gn s dng ma trn phn bit c, Chuyn san cc cng trnh
nghin cu trin khai Vin thng v CNTT, (15), tr. 83-87.
[2] Nguyn Thanh Bnh (2007), ng dng cy quyt nh trong bi
ton phn lp, Lun vn thc s khoa hc. Trng i hc Khoa
hc - i hc Hu.
[3]

Nguyn Thanh Tng (2009), Mt tiu chun mi chn nt xy dng
cy quyt nh, Tp ch Khoa hc v Cng ngh, 47(2), tr. 1525.
Ting Anh
[4] Andrzej Skowron, Ning Zhong (2000), Rough Sets in KDD, Tutorial
Notes.
[5] Baoshi Ding, Yongqing Zheng, Shaoyu Zang (2009), "A New Decision
Tree Algorithm Based on Rough Set Theory", Asia-Pacific Conference
on Information Processing, (2), pp. 326-329.
[6] Cuiru Wang, Fangfang OU (2008), "An Algorithm for Decision Tree
Construction Based on Rough Set Theory", International Conference
on Computer Science and Information Technology, pp. 295-298.
[7] Ho Tu Hao, Knowledge Discovery and Dataming Techniques and
Practice, http:// www.netnam.vn/unescocourse/knowledge.
[8] Jan Komorowski, Lech Polkowski, Andrzej Skowron, Rough Sets: A
Tutorial. http://www/folli.loria.fr/cds/1999/library/pdf/skowron.pdf
[9] John Ross Quilan (1990), Decision trees and decision making, IEEE
transactions on Man and Cybernetics, (20), pp. 339-346.
[10] Longjun Huang, Minghe Huang, Bin Guo, Zhimming Zhang (2007), "A
New Method for Constructing Decision Tree Based on Rough Set


50
Theory", IEEE International Conference on Granular Computing, pp.
241- 244.
[11] Ramadevi Yellasiri, C.R.Rao, Vivekchan Reddy (2007), Decision Tree
Induction Using Rough Set Theory Comparative Study, Journal of
Theoretical and Applied Information Technology, pp. 110-114.
[12] Sang Wook Han, Jae Yearn Kim (2007), "Rough Set-based Decision
Tree using the Core Attributes Concept", Second International
Conference on Innovative Computing Information and Control, pp. 298
- 301.
[13] Weijun Wen (2009), A New Method for Constructing Decision Tree
Based on Rough Set Theory, Proceedings of the International
Symposium on Intelligent Information Systems and Applications
Qingdao China, pp. 416-419.
[14] Z. Pawlak (1998) - Rough Set Theory and Its Application to Data
Analysis, Cybernetics and Systems: An International Journal 29, pp.
661-688.

You might also like