You are on page 1of 49

M hnh cy quyt nh 1

TRNG I HC KHOA HC T NHIN


KHOA CNG NGH THNG TIN


N MN HC MY HC
Lp Cao Hc - Chuyn Ngnh KHMT & HTTT


M HNH CY QUYT NH
(Decision Tree Learning)


GVHD: TS. Trn Thi Sn
Thnh vin nhm:
11 12 020 Nguyn Thy Ngc
11 11 013 Trn Thanh Hip
11 11 010 o Ngc Giang
11 11 055 L Nguyn Qunh Thy
11 11 027 Nguyn Mai Lnh


TP.HCM 4-5-6/2012
M hnh cy quyt nh 2
BNG PHN CNG
Thnh vin nhm Ni dung thc hin
Nguyn Thy Ngc 11 12 020
Phn 1.Gii thiu, Phn 2.Cc o
(trang 5 trang 16)
Nguyn Mai Lnh 11 11 027 Phn 3. Thut ton ID3 (trang 16 trang
25)
Trn Thanh Hip 11 11 013 Phn 4. Thut ton C4.5 (trang 25
trang 32)
L Nguyn Qunh Thy 11 11 055 Phn 5. Cc vn trong vic hc cy
quyt nh (trang 32 trang 41)
o Ngc Giang 11 11 010 Phn 6. Demo (trang 42 trang 48)
M hnh cy quyt nh 3
MC LC

1. Gii thiu .............................................................................................................. 5
1.1. Cy quyt nh v mt s ng dng ......................................................................................................... 5
1.2. Biu din cy quyt nh ............................................................................................................................ 6
2. Cc o .............................................................................................................. 7
2.1. Cc o da trn l thuyt thng tin ..................................................................................................... 8
2.1.1. Information Gain ................................................................................................................................. 10
2.1.2. Gain Ratio............................................................................................................................................. 12
2.2. Gini Index ................................................................................................................................................. 14
3. Thut ton ID3 ...................................................................................................16
3.1. Gii thiu gii thut ................................................................................................................................. 16
3.2. La chn thuc tnh phn loi tt nht .................................................................................................. 19
3.3. Tm kim khng gian gi thuyt trong ID3 ............................................................................................ 23
3.4. Chuyn cy v lut ................................................................................................................................... 24
3.5. u tin hc trong cy quyt nh ............................................................................................................ 25
4. Thut ton C4.5 .................................................................................................25
4.1. Gii thiu C4.5 .......................................................................................................................................... 25
4.2. Cc o s dng trong C4.5 ................................................................................................................. 26
4.2.1. Information Gain ................................................................................................................................. 26
4.2.2. Gain Ratio trong C4.5 ......................................................................................................................... 28
4.3. c im ca C4.5 .................................................................................................................................... 29
4.3.1. C4.5 c c ch ring trong x l nhng gi tr thiu ......................................................................... 30
4.3.2. Trnh Qu va d liu .................................................................................................................... 30
4.3.3. Chuyn i t cy quyt nh sang lut ............................................................................................. 31
4.4. Nhn xt v C4.5 ....................................................................................................................................... 31
5. Cc vn trong vic hc cy quyt nh .......................................................32
5.1. Trnh Overfitting d liu ........................................................................................................................ 32
M hnh cy quyt nh 4
5.1.1. Gim li ct ta ..................................................................................................................................... 35
5.1.2. Lut POST-PRUNING ........................................................................................................................ 36
5.2. Kt hp cc thuc tnh c gi tr lin tc ............................................................................................... 38
5.2.1. Cc phng php thay th cho cc thuc tnh la chn ................................................................... 39
5.2.2. X l hun luyn i vi nhng thuc tnh thiu gi tr .................................................................. 40
5.2.3. X l cc thuc tnh c chi ph khc nhau ......................................................................................... 41
6. Demo ...................................................................................................................42
6.1. Yu cu phn cng v d liu mu ......................................................................................................... 42
6.2. Gii thiu chng trnh ........................................................................................................................... 42
Quy trnh s dng chng trnh...................................................................................................................... 44
6.3. Qu trnh o to ..................................................................................................................................... 44
Ti vo d liu o to: .................................................................................................................................... 44
To cy quyt nh: .......................................................................................................................................... 44
Rt trch tp lut ............................................................................................................................................... 45
6.4. Phn lp .................................................................................................................................................... 46
6.5. Mt s nh gi ........................................................................................................................................ 47
6.6. Cc ng lin kt ti ng dng v d liu mu ................................................................................... 48
Ti liu tham kho ................................................................................................................................................ 49

M hnh cy quyt nh 5
Li mo u
M hnh cy quyt nh l mt trong nhng phng php ph bin v c p dng nhiu trong
thc t cho cc bi ton phn lp v d on. [1] Chng ny s gii thiu chung v m hnh cy
quyt nh, mt s o ph bin c p dng trong cc thut ton, thut ton ID3, C4.5, v
mt s vn m rng ca cy quyt nh.

1. Gii thiu
M hnh hc cy quyt nh l phng php cho vic xp x cc hm mc tiu c gi tr ri
rc v c biu din di dng cy quyt nh. Cy quyt nh sau khi c hc c th biu
din li di dng lut if-then tng kh nng d c. y l mt trong nhng phng php
c s dng ph bin nht trong s cc thut ton hc quy np, v n c p dng kh
thnh cng trong ng dng y khoa v ti chnh nh chn on bnh hay nh gi ri ro tn
dng.[1]

1.1. Cy quyt nh v mt s ng dng
Cy quyt nh l mt phng php phn lp thuc nhm hc c gim st (supervised
learning) nh: da trn lut (rule-based), mng Bayes (nave Bayes), mng nron, SVM,
ng dng ca cy quyt nh dng trong phn lp d on nh:
D bo thi tit (d bo tri nng, ma hay m u,) da trn mt s yu t nhit ,
sc gi, m,
D bo trong kinh doanh (doanh s thng ti s tng hay gim) da cc yu t ch s
tiu dng, yu t x hi, s kin,
Tn dng ngn hng (kh nng chi tr tn dng ca khch hng khi vay mn)
Th trng chng khon (gi vng, c phiu s tng hay gim)

M hnh cy quyt nh 6
1.2. Biu din cy quyt nh
T nhng nhu cu trong thc t trn, mc tiu l xy dng c cy quyt nh c thuc
tnh quyt nh (l cc lp c sn hay cc thuc tnh cn d on) da trn cc thuc tnh quan
st. V d nh trong bi ton d bo thi tit th thuc tnh quyt nh s l thi tit vi cc lp l
nng, ma, m u v cc thuc tnh iu kin l nhit , sc gi, m,
Thnh phn chnh ca cy quyt nh: [3]

Nt l C
k
: nhn ca lp th k (thuc tnh quyt nh C)
Nt gc, nt trong A
i
: thuc tnh A
i
(thuc tnh iu kin)
Nhnh V
ij
: trng hp th j (gi tr, khong gi tr) ca A
i
. Khong gi tr trong
trng hp cc php so snh (>,<,>=,<=)
Nh vy, xy dng c cy quyt nh th cn phi sp xp cc thuc tnh A
i
vo cc
nt trong cy. Cu hi t ra l s chn nt no lm nt gc? nt no lm nt trong? cho mi
bc thc hin v tng t cho cc bc tip theo. T y rt ra nhn xt liu th t sp xp cc
thuc tnh trong cy c nh hng n cht lng kt qu khng?
Hnh minh ha bn di cho thy vic chn nt gc l Humidity hay nt gc l Outlook s
cho 2 cy quyt nh c mc phc tp khc nhau.
A
n
A
m
A
k
A
l
C1
v
n1
v
n2
v
n3
v
m1
v
m2
v
l1 v
l2
v
k1
v
k2
C1
C2
C2
C3
C3
M hnh cy quyt nh 7


Nh vy cn o gip la chn cc thuc tnh thch hp cho mi bc m cc o ny
s c gii thiu phn tip theo.
Da vo loi nhn ca thuc tnh quyt nh C
k
c th phn thnh 2 dng cy quyt nh:
Cy phn lp (Classification tree): Nhn ca cc C
k
l d liu nh danh
o V d: d bo kt qu trn u thng hay thua, kt qu hc tp ca hc sinh l
gii, kh hay trung bnh.
Cy hi quy (Regression tree): Nhn ca cc C
k
l d liu nh lng
(gi tr s).
o V d: c tnh gi mt ngi nh hoc khong thi gian iu tr bnh.
Tng ng vi mi dng s c thut ton ph bin hin nay: ID3, C4.5, CART
(Classification And Regression Tree).
CART: gii quyt c bi ton cn s dng cy phn lp v cy hi quy
ID3, C4.5: gii quyt c bi ton s dng cy phn lp
Ty vo mi thut ton s c cc o tng ng h tr vic nh gi, la chn thuc
tnh cho vic xy dng cy mi bc.

2. Cc o
Nh vy tr li cho cu hi lm th no chn c thuc tnh phn tch (splitting
attribute) cho mi bc khi thc hin xy dng cy quyt nh? Cc o c a ra gii
quyt vn ny.
M hnh cy quyt nh 8
o l phn quan trng trong thut ton nh gi thuc tnh no s c chn lm
thuc tnh phn tch. Thuc tnh phn tch gip m bo s trng lp, ngu nhin t nht gia
cc phn hoch to c t thuc tnh . Cch tip cn ny gip ti thiu s php th phn
loi mt phn t.
Ty theo cc thut ton cy quyt nh, cch tip cn khc nhau m s s dng cc o
khc nhau. Trong phn ny s trnh by 2 o c s dng kh ph bin l o da vo
l thuyt thng tin (Information Gain, Gain Ratio c s dng trong thut ton ID3 v C4.5) v
o Gini (c s dng trong thut ton CART).
2.1. Cc o da trn l thuyt thng tin
Cc o thuc nhm ny c xut pht da trn l thuyt thng tin (Information
Theory) ca Claude Shannon c ngun gc t l thuyt xc sut v thng k. Cc o da trn
l thuyt thng tin u s dng n mt khi nim l entropy.
Entropy l i lng c trng cho mc hn lon ca cc phn t trong qun th S. Hay
ni cch khc l lng thng tin cn phn loi 1 phn t trong S.
Nhn xt: nu hn lon ca qun th S cao th lng thng tin cn phn loi 1 phn
t trong S nhiu. Ngc li, nu hn lon thp th lng thng tin cn phn loi 1 phn t
trong S t.
Hnh bn di minh ha cho hn lon ca cc phn t trong qun th S


Qun th ny c cc phn t hn lon
cn nhiu lng thng tin cn
phn loi 1 phn t, entropy cao
Qun th c cc phn t tp trung, t hn lon
cn t lng thng tin cn phn loi 1
phn t, entropy thp
M hnh cy quyt nh 9
Cng thc tng qut tnh Entropy trong qun th S:
2
1
( ) log ( )
| |
| |
m
i i
i
i
i
Entropy S p p
C
p
S
=
=
=


Trong : p
i
l xc sut 1 phn t bt k trong S thuc v lp C
i
(i = 1..m). Vi C
i
l gi
tr ca thuc tnh quyt nh.
V d 1: Tp d liu S c 14 phn t trong gi tr ca thuc tnh quyt nh Play
Tennis? l Yes, No. Vy c th phn thnh 2 lp theo thuc tnh quyt nh C
1
: Yes v C
2
:
No. p dng cng thc tnh c entropy ca S nh sau:

2 2
9 9 5 5
Entropy(S) log log 0.940
14 14 14 14
= =


Cng thc tnh entropy trn l cng thc tng qut i vi trng hp S c m lp. Trong
trng hp S c 2 lp cng thc tnh entropy (S) :
2 2
( ) log log Entropy S p p p p
+ +
=

Trong trng hp S ch c 2 lp C
1
, C
2
:
Nu (p
1
= 0 hay p
2
=0) entropy = 0: S ng nht (mi phn t u thuc 1 lp)
Nu p
1
= p
2
= 1/2 entropy = 1: s phn t thuc C
1
v C
2
bng nhau
Nu p
i
e (0,1) entropy e (0,1): s phn t thuc C
1
v C
2
khc nhau
M hnh cy quyt nh 10
Cc trng hp trn c biu din bng th hay v d minh ha nh sau:Vn t ra l ngoi thuc tnh quyt nh C
i
, trong qun th S cn c nhiu thuc tnh
iu kin khc nh A, A, A,Vy khi nn chn thuc tnh no lm thuc tnh phn tch?

Phn hoch S theo thuc tnh A

Phn hoch S theo thuc tnh A
2.1.1. Information Gain
o Information Gain c a ra gip vic la chn thuc tnh phn tch.
o ny c s dng trong thut ton ID3. Mi thuc tnh iu kin s phn tch qun th S
thnh cc phn hoch. Information Gain s dng i lng entropy ca l thuyt thng tin
cho bit mc trng lp gia cc phn hoch c to ra, ngha l mt phn hoch s cha
cc phn t t mt lp hay t nhiu lp khc nhau.
Thuc tnh A phn hoch S thnh v phn hoch c k hiu nh sau {S
A
1
, S
A
2
,,
S
A
v
}. Vi v l tp gi tr ca A. Cng thc tnh Information Gain:
1
| |
( , ) ( ) * ( )
| |
A
v
j A
j
j
S
Gain S A Entropy S Entropy S
S
=
=


Vi Entropy (S
A
j
) l mc hn lon ca cc phn hoch S
A
j
do thuc tnh A to ra
.
Chn thuc tnh A
j
no c Gain(S,A
j
) LN NHT lm thuc tnh phn tch
.

p
1
=0 hay
p
2
=0
p
1
=p
2
=1/2 p
i
e (0,1)
M hnh cy quyt nh 11
Nhn xt: Do Entropy(S) l hng s i vi tt c thuc tnh, mc hn lon ca
cc phn hoch do thuc tnh A
j
to ra (Entropy (S
A
j
))c mong i l c gi tr cng nh cng
tt (t hn lon). Do kt qu s l chn Gain (S,A
j
) ca thuc tnh Aj no c gi tr ln nht.
V d 2: C tp d liu Play Tennis nh sau:

Xt thuc tnh Outlook: Outlook c 3 gi tr Sunny, Overcast, v Rain. Do
thuc tnh Outlook to ra 3 phn hoch. Ln lt tnh Entropy cho mi phn hoch theo cng
thc tnh Entropy phn 2.1 ta c:
Entropy (Sunny) =
2 2
2 2 3 3
log log 0.971
5 5 5 5
=
Entropy (Overcast)= 0
Entropy (Rain) =
2 2
3 3 2 2
log log 0.971
5 5 5 5
=
Entropy (S) = 0.940 ( tnh c v d trn)
Theo cng thc tnh Gain (S, Outlook):
Gain(S,Outlook) = 0.940 ( (5/14)*0.971 + (5/14)*0.971) = 0.246
Tng t, ln lt tnh c Gain cho cc thuc tnh khc:
Gain (S,Humidity)=0.151, Gain (S,Wind)=0.048, Gain (S,Temp)=0.029
M hnh cy quyt nh 12
Thuc tnh Outlook c Gain(S,Outlook) cao nht nn chn Outlook lm thuc tnh
phn tch.
V d 3: gi s cng tp d liu Play Tennis nh V d 2 nhng c b sung thm
mt thuc tnh When (thi gian chi Tennis) nh sau:

Gain(S,When) = 0.940 (4/14)*1 (3/14)*0.918 (3/14)*0.918 = 0.261
Nhn xt: by gi do thuc tnh When c Gain (S,When) cao nht 0.261 (ln hn
Gain(S,Outlook)=0.246 tnh V d 2), nn s chn thuc tnh When lm thuc tnh phn
tch thay v l thuc tnh Outlook. Tuy nhin, theo quan st th do thuc tnh When c nhiu gi
tr (5 gi tr) hn thuc tnh Outlook (3 gi tr), trong c mt phn hoch l When=7pm ch c
1 phn t, nn entropy ca mi phn hoch do thuc tnh When to ra thp, t dn n Gain
(S, When) cao.
Nh vy vi o Information Gain c xu hng thin v cho thuc tnh nhiu
gi tr (cy c nhiu nhnh). iu ny lm nh hng n kt qu d on. Do cn mt o
ci tin hn gii quyt vn ny.

2.1.2. Gain Ratio
o Gain Ratio c t ra gii quyt vn mt thuc tnh to ra rt nhiu
phn hoch nhng c th mi phn hoch ch gm 1 phn t. o ny c s dng trong
M hnh cy quyt nh 13
thut ton C4.5 v chun ha c Information Gain nh vo Split Information (thng tin
phn tch). [1] Cng thc tnh Split Information nh sau:
2
1
| | | |
( , ) log
| | | |
A A
v
j j
j
S S
SplitInfo S A
S S
=
| |
=
|
|
\ .


Nhn xt:Thng tin phn tch ny c ngha: nu thuc tnh A c cng nhiu gi tr
th thng tin phn tch ca n (SplitInfo (S,A)) cng ln. Khi ly Gain(S,A) chia cho
SplitInfo(S,A) c c o GainRation (S, A) nh cng thc:
( )
( , )
( , )
Gain A
GainRatio S A
SplitInfo S A
=

Chn thuc tnh A
j
no c GainRatio(S, A
j
) LN NHT lm thuc tnh phn tch.
o Gain Ratio gii quyt c xu hng thin v cho thuc tnh nhiu gi
tr ca o Information Gain v gi s hai thuc tnh A v B c cng Gain
(Gain(A)=Gain(B)). Nhng nu thuc tnh A c nhiu gi tr hn thuc tnh B, khi SplitInfo
ca A s ln hn SplitInfo ca B. Do theo cng thc, gi tr Gain Ratio ca B s ln hn
Gain Ratio ca A, v cui cng s chn B lm thuc tnh phn tch. Nh vy Gain Ratio
chun ha c trng hp thin v cho thuc tnh nhiu gi tr ca Information Gain trc .
V d 4: Tng t bi ton V d 3 trnh by trn, tnh li GainRatio cho
thuc tnh Outlook v When

,
4 4 3 3 3 3 3 3 1 1
2.217
2 2 2 2 2
SplitInfo(S When)
=- log - log - log - log - log
14 14 14 14 14 14 14 14 14 14
=

M hnh cy quyt nh 14
,
5 5 4 4 5 5
1.577
2 2 2
SplitInfo(S Outlook)
=- log - log - log
14 14 14 14 14 14
=

GainRatio(S,When)= Gain(S,When)/SplitInfo(S,When)=0.261/2.217= 0.118
GainRatio(S,Outlook)= Gain(S,Outlook)/SplitInfo(S,Outlook)=0.246/1.577= 0.156
Ta thy thuc tnh Outlook c Gain Ratio ln hn do chn Outlook lm thuc
tnh phn tch
Gain Ratio cng gp mt s vn do c phn mu s SplitInfo:
SplitInfo c th bng 0 hay rt thp (khi |S
A
j
| ~ |S|). Khi Gain Ratio s khng
xc nh c hoc rt ln.
Cch gii quyt:
Tnh Gain cho tng thuc tnh v ly gi tr Gain trung bnh (Gain_Average)
Sau ch p dng Gain Ratio trn nhng thuc tnh c Gain > Gain_Average

2.2. Gini Index
o Gini Index c s dng trong thut ton CART (Classification And Regression
Tree). Thut ton CART c s dng trong bi ton xy dng cy phn lp v cy hi quy.[2]
o Gini c c im l khi xy dng cy s to ra s phn tch nh phn cho mi thuc
tnh A
j
(phn hoch thuc A
j
v phn hoch khng thuc A
j
).
- Cho tp hun luyn S, cc lp {C
1
, C
2
,C
n
}. Cng thc tnh Gini ca tp S
2 2
1 1 1 1
( ) (1 ) 1
n n n n
i i i i i
i i i i
Gini S p p p p p
= = = =
= = =

Vi pi l xc sut 1 mu trong S thuc v lp Ci
Gini(S) = 0 nu nh cc mu trong S thuc cng 1 lp
M hnh cy quyt nh 15
Mt s k hiu:
DOM(A) = {a
1
,a
2
,a
v
} l tp gi tr ca thuc tnh A
C = V
A
c DOM(A). V
A
l cc tp hp con ca S to bi gi tr ca thuc tnh A.
|V
A
|= 2
v
-2 (khng ly tp C v A)

A
V = DOM(A) \ V
A


A
V
S gm cc phn t ca S ly gi tr ca thuc tnh A trong V
A


A
V
S gm cc phn t ca S ly gi tr ca thuc tnh A trong
A
V
Cng thc tnh o Gini cho tp S phn hoch theo thuc tnh A
| |
| |
( , ) ( ) ( )
| | | |
A A
A
A
V V
V
V
S
S
Gini S A Gini S Gini S
S S
= +

Chn thuc tnh c Gini(S,A) NH NHT lm thuc tnh phn tch.
V d 5: vi bng d liu Play Tennis ? nh trn
Xt thuc tnh Outlook
A=Outlook; DOM(A)={sunny, rain, overcast}
V
A1
={sunny,rain}; V
A2
={sunny,overcast}; V
A3
={rain,overcast}
V
A4
={overcast}=
1 A
V ; V
A5
={rain}=
2 A
V ; V
A6
={sunny}=
3 A
V
1 4
2 2 2
2 2 2
10 4 10 5 5 4 4
( , ) ( ) ( ) (1 ) (1 ) 0.357
14 14 14 10 10 14 4
A A
V V
Gini S outlook Gini S Gini S = + = + =
2 5
2 2 2 2
2 2 2 2
9 5 9 5 4 5 3 2
( , ) ( ) ( ) (1 ) (1 ) 0.489
14 14 14 9 9 14 5 5
A A
V V
Gini S outlook Gini S Gini S = + = + =

3 6
2 2 2 2
2 2 2 2
9 5 9 5 4 5 2 3
( , ) ( ) ( ) (1 ) (1 ) 0.489
14 14 14 9 9 14 5 5
A A
V V
Gini S outlook Gini S Gini S = + = + =

Vy Gini(S, Outlook)= 0.357
M hnh cy quyt nh 16
Tng t tnh c:
Gini(S,Temp)=0.45; Gini(S,Humidity)=0.367; Gini(S,Wind)=0.429
Chn Outlook lm thuc tnh phn tch do c gi tr Gini nh nht.

3. Thut ton ID3
3.1. Gii thiu gii thut
Thut ton ID3, hc cy quyt nh bng cch xy dng t trn xung, bt u vi cu
hi thuc tnh no dung phn loi u tin, tc l lm nt gc cho cy. tr li cho cu
hi ny, mi thuc tnh c nh gi da trn mt o quyt nh xem n c phi l
thuc tnh tt nht phn loi khng. Thuc tnh tt nht phn loi tt nhin dung lm
nt gc cho cy. Mi nhnh con ca cy c to da trn cc gi tr ca thuc tnh ny, v
ri cc v d hun luyn c sp xp theo tng nhnh con. Ton b qu trnh lp li s dng
cc mu hun luyn kt hp vi mi nt con chn thuc tnh tt nht phn loi ti tng
nt trn cy. Gii thut tm kim tham n c s dng cho cy quyt nh, ti mi nhnh
khng bao gi quay ngc tr ln xc minh li la chn trc .
Nh vy, nhim v ca gii thut ID3 l hc cy quyt nh t mt tp cc v d rn
luyn (training example) hay cn gi l d liu rn luyn (training data). Hay ni khc hn,
gii thut c:
- u vo: Mt tp hp mu hun luyn. Mi v d bao gm cc thuc tnh m t mt
tnh hung, hay mt i tng no , v mt gi tr phn loi ca n.
- u ra: Cy quyt nh c kh nng phn loi ng n cc v d trong tp d liu
rn luyn.
V d, chng ta hy xt bi ton phn loi xem ta c i chi tennis (Play tennis) ng vi
thi tit no khng. Gii thut ID3 s hc cy quyt nh t tp hp cc v d sau:
Day Outlook Temp Humidity Wind Play tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
M hnh cy quyt nh 17
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Tp d liu ny bao gm 14 v d. Mi v d biu din cho tnh trng thi tit gm cc
thuc tnh Outlook, Temp, Humidity v Wind; v u c mt thuc tnh phn loi Play
Tennis (Yes, No). No ngha l khng i chi tennis ng vi thi tit , Yes ngha l
ngc li. Gi tr phn loi y ch c hai loi (No, Yes), hay cn ta ni phn loi ca tp
v d ca khi nim ny thnh hai lp (classes). Thuc tnh Play tennis cn c gi l
thuc tnh ch (target attribute).
Mi thuc tnh u c mt tp cc gi tr hu hn. Thuc tnh Outlook c ba gi tr
(Sunny , Overcast, Rain), Temp c ba gi tr (Hot, Mild, Cool), Humidity c hai gi tr (High,
Normal) v Wind c hai gi tr (Strong, Weak). Cc gi tr ny chnh l k hiu (symbol)
dng biu din bi ton.
T tp d liu rn luyn ny, gii thut ID3 s hc mt cy quyt nh c kh nng phn
loi ng n cc v d trong tp ny, ng thi hy vng trong tng lai, n cng s phn
loi ng cc v d khng nm trong tp ny. Mt cy quyt nh v d m gii thut ID3 c
th quy np c l:

M hnh cy quyt nh 18


Cc nt trong cy quyt nh biu din cho mt s kim tra trn mt thuc tnh no ,
mi gi tr c th c ca thuc tnh tng ng vi mt nhnh ca cy. Cc nt l th hin
s phn loi ca cc v d thuc nhnh , hay chnh l gi tr ca thuc tnh phn loi.
Sau khi gii thut quy np c cy quyt nh, th cy ny s c s dng phn
loi tt c cc v d hay th hin (instance) trong tng lai. V cy quyt nh s khng thay
i cho n khi ta cho thc hin li gii thut ID3 trn mt tp d liu rn luyn khc.
ng vi mt tp d liu rn luyn s c nhiu cy quyt nh c th phn loi ng tt c
cc v d trong tp d liu rn luyn. Kch c ca cc cy quyt nh khc nhau ty thuc
vo th t ca cc kim tra trn thuc tnh.
Vy lm sao hc c cy quyt nh c th phn loi ng tt c cc v d trong tp
rn luyn? Mt cch tip cn n gin l hc thuc lng tt c cc v d bng cch xy dng
mt cy m c mt l cho mi v d. Vi cch tip cn ny th c th cy quyt nh s
khng phn loi ng cho cc v d cha gp trong tng lai. V phng php ny cng
ging nh hnh thc hc vt, m cy khng h hc c mt khi qut no ca khi nim
cn hc. Vy, ta nn hc mt cy quyt nh nh th no l tt?
Occams razor v mt s lp lun khc u cho rng gi thuyt c kh nng nht l gi
thuyt n gin nht thng nht vi tt c cc quan st, ta nn lun lun chp nhn nhng
cu tr li n gin nht p ng mt cch ng n d liu ca chng ta. Trong trng hp
ny l cc gii thut hc c gng to ra cy quyt nh nh nht phn loi mt cch ng n
tt c cc v d cho. Trong phn k tip, chng ta s i vo gii thut ID3, l mt gii
thut quy np cy quyt nh n gin tha mn cc vn va nu.
Outlook
Humidity
Yes
Wind
No
Yes
Sunny
Rain
Overcast
Normal High Weak Strong
Yes No
M hnh cy quyt nh 19
ID3 xy dng cy quyt nh theo cch t trn xung. Lu rng i vi bt k thuc
tnh no, chng ta cng c th phn vng tp hp cc v d rn luyn thnh nhng tp con
tch ri, m mi v d trong mt phn vng (partition) c mt gi tr chung cho thuc
tnh . ID3 chn mt thuc tnh kim tra ti nt hin ti ca cy v dng trc nghim ny
phn vng tp hp cc v d; thut ton khi xy dng theo cch quy mt cy con
cho tng phn vng. Vic ny tip tc cho n khi mi thnh vin ca phn vng u nm
trong cng mt lp; lp tr thnh nt l ca cy.
3.2. La chn thuc tnh phn loi tt nht
im mu cht ca thut ton ID3 l chn la c thuc tnh tt nht phn loi ti
mi nt ca cy. Chng ta cn la chn thuc tnh no phn loi hiu qu nht cc mu.
y chng ta s dng Information Gain. Thut ton ID3 s dng Information
Gain ny la chn thuc tnh tt nht trong tp cc thuc tnh ng vin pht trin cy
t trn xung.
Information Gain s dng i lng entropy ca l thuyt thng tin cho bit mc
trng lp gia cc phn hoch c to ra, ngha l mt phn hoch s cha cc phn t t
mt lp hay t nhiu lp khc nhau.
o Information Gain ca thuc tnh A c tnh theo cng thc sau, vi tp
mu S cho trc.
( ) ()
|

|
||
(

)
()

Trong Values(A) l tp cc gi tr ca thuc tnh A, v

l tp con ca S m thuc
tnh A mang gi tr v (v d,

={s e S|A(s) = v }). Ch l phn u tin ca cng thc ch


l Entropy ca tp gc S, v phn th 2 l gi tr entropy mong i sau khi S c phn
mnh da trn thuc tnh A. Gi tr Entropy mong i c biu din phn ny n gin l
tng cc Entropy ca mi tp con Sv. Do Gain(S,A) l gi tr gin lc mong i trong
entropy gy sau bi cc gi tr bit ca thuc tnh A. Ni cch khc, Gain(S, A) l thng
tin c cp v hm gi tr mc ch, c cho bi cc gi tr khc nhau ca thuc tnh A.
Gi tr ca Gain(S, A) l s lng cc bits tit kim khi m ho gi tr ch ca mt thnh
phn tu ca S, bng cch bit gi tr ca thuc tnh A.
Xt vi d cho ban u:
M hnh cy quyt nh 20

Bng 3.2
S l tp cc mu hun luyn c m t bi cc thuc tnh bao gm Wind, c gi tr
Weak hoc Strong. Nh ni trc, gi s S l tp cha 14 mu, [9+,5-]. Trong 14 mu
ny, c 8 mu m Wind=Weak, trong [6+,2-] tc 6 mu dng, 2 mu m. Cn li 6 mu
m Wind=Strong, trong [3+,3-]. o Information Gain ca 14 mu ny phn theo thuc
tnh Wind c tnh nh sau:

Values(Wind) = Weak, Strong
S = [9+,5-]; Entropy(S) = Entropy([9+,5-]) = 0.940

: l nt con vi tr Weak l [6+,2-]

: l nt con vi tr Strong l [3+,3-]


Gain(S,Wind) = ()
|

|
||
(

)
{ }

= () (

) (

) (

) (

)
= (

) (

)

M hnh cy quyt nh 21
Information Gain l o c s dng bi thut ton ID3 la chn thuc tnh tt
nht trong mi bc pht trin cy. Vic s dng Information gain c nh gi c th
trong hnh di, vi 2 thuc tnh l Humidity v Wind, tnh ton xem thuc tnh no dung
phn hoch tt nht tp tp mu cho ban u.
Tng t ta tnh c gi tr Informaion Gain ca 4 thuc tnh l:
Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029

Theo o Information gain, thuc tnh Outlook cho ta s tin on tt nht cho thuc
tnh ch, PlayTennis. V th, Outlook c la chn l thuc tnh quyt nh cho nt gc, v
cc nhnh c to bn di nt gc ng vi mi gi tr ca n. Gm Sunny, Overcast v
Rain.

Ch l cc mu ng vi Outlook = Overcast th cc gi tr PlayTennis u l Yes tc
mang du +. V vy, nt ny ca cy l nt l lun, vi ga tr phn loi l PlayTennis =
Yes. Ngc li, cc nhnh con ng vi Outlook = Sunny v Outlook = Rain vn cn entropy
khc 0, v cy quyt nh s cn m rng bn di cc nhnh ny.
Qu trnh chn thuc tnh mi v phn vng cc mu hun luyn c lp li ti mi nt
con, ti thi im ny ch dung nhng mu hun luyn lien quan ti nt . Nhng thuc tnh
c chn phn loi cp cao hn s khng c xt n na, nh vy mi thuc tnh ch
xut hin duy nht mt ln trn c cy quyt nh. Qu trnh ny tip tc ti mi nt l cho
n khi mt trong hai iu kin sau y tho mn: (1) mi thuc tnh u c mt mi
ng i ca cy, hoc (2) cc mu hun luyn tng ng vi tng nt l u c chung gi
tr mc tiu (cng l Yes hoc cng l No)
Outlook
Sunny
Rain
Overcast
+: D3, D4, D5, D7, D9, D10, D11, D12, D13
- : D1, D2, D6, D8, D14
+: D9, D11
- : D1, D2, D8
+: D3, D7, D12, D13
- :
+: D4, D5, D10
- : D6, D14
M hnh cy quyt nh 22
Bc k tip trong qu trnh tng trng,


Vi Outlook=Sunny th thuc tnh tip theo phn nhnh cy l thuc tnh Humidity
do c Information Gain ln nht.
Lm tng t vi Outlook=Sunny ta c cy hon chnh sau:Rain
Outlook
?
Yes
Sunny Overcast
?
{D1,D2,D8,D9,D11}
[2+,3-]
{D3,D7,D12,D13}
[4+,0-]
{D4,D5,D6,D10,D114}
[3+,2-]

= {D1,D2,D8,D9,D11}
Gai(

) 0971
3
5
0
2
5
0
Gai(

) 0971
2
5
1
3
5
0918
Gai(

) 0971
2
5
0
2
5
1
1
5
0
M hnh cy quyt nh 23

3.3. Tm kim khng gian gi thuyt trong ID3
Nh cc phng php hc qui np khc, ID3 c th c m t nh l mt khng gian
cc gi thuyt vi ci m khp vi tp mu hun luyn. Khng gian gi thuyt c tm
trong ID3 l tp cc cy quyt nh c th c.
Rain
Outlook
Yes
Sunny Overcast
Wind
{D1,D2,D8,D9,D11}
[2+,3-]
{D3,D7,D12,D13}
[4+,0-]
{D4,D5,D6,D10,D114}
[3+,2-]
Humidity
Yes
No
Normal
High
{D9,D11}
[2+,0-]
{D1,D2,D8}
[3+,0-]
Yes No
Weak
Strong
{D4,D5,D10}
[3+,0-]
{D6,D14}
[2+,0-]
M hnh cy quyt nh 24

- ID3 th hin t n gin ti phc tp, thut ton leo i tm kim xuyn sut khng
gian gi thuyt ny, bt u vi cy rng. ID3 tm kim ch mt cy (ch khng phi tt c)
ph hp vi mu hun luyn ban u.
- Trong qu trnh tm kim ID3 khng thc hin quay lui (No backtracking)
+ Ch m bo tm c li gii ti u cc b ch khng m bo tm c li gii ti
u tng th.
+ Mt thuc tnh c chn l thuc tnh kim tra th ID3 khng bao gi cn nhc li
la chn ny.
ID3 s dng tt c cc mu hun luyn ti mi bc trong qu trnh tm kim a ra
kt qu da vo thng k. Do , kt qu t b nh hng bi cc d liu sai hay d liu
nhiu.
3.4. Chuyn cy v lut
Cy quyt nh s c chuyn v dng cc lut thun tin cho vic ci t v s
dng. V d cy quyt nh cho tp d liu rn luyn c th c chuyn thnh mt s lut
nh sau :


M hnh cy quyt nh 25


- R
1
: If (Outlook=Sunny) . (Humidity=Normal) Then Play Tennis=Yes
- R
2
: If (Outlook=Sunny) . (Humidity=High) Then Play=No
- R
3
: If (Outlook=Overcast) Then Play=Yes
- R
4
: If (Outlook=Rain) . (Wind=Weak) Then Play=Yes
- R
5
: If (Outlook=Rain) . (Wind=Strong) Then Play=No
3.5. u tin hc trong cy quyt nh
- i vi mt tp cc v d hc, c th tn ti nhiu (hn 1) cy quyt nh ph hp vi
cc v d hc ny.
- ID3 chn cy quyt nh ph hp u tin tm thy trong qu trnh tm kim ca n.
- Chin lc tm kim ca gii thut ID3
+ u tin cc cy quyt nh n gin ( su thp)
+ u tin cc cy quyt nh trong mt thuc tnh c gi tr Information Gain cng
ln s l thuc tnh kim tra ca mt nt cng gn nt gc.

4. Thut ton C4.5
4.1. Gii thiu C4.5
Vi nhng c im C4.5 l thut ton phn lp d liu da trn cy quyt nh hiu qu
v ph bin trong nhng ng dng khai ph c s d liu c kch thc nh. C4.5 s dng c
ch lu tr d liu thng tr trong b nh, chnh c im ny lm C4.5 ch thch hp vi
nhng c s d liu nh, v c ch sp xp li d liu ti mi node trong qu trnh pht trin cy
quyt nh. C4.5 cn cha mt k thut cho php biu din li cy quyt nh di dng mt
danh sch sp th t cc lut if-then (mt dng quy tc phn lp d hiu). K thut ny cho php
Outlook
Humidity
Yes
Wind
No
Yes
Sunny
Rain
Overcast
Normal High Weak Strong
Yes No
M hnh cy quyt nh 26
lm gim bt kch thc tp lut v n gin ha cc lut m chnh xc so vi nhnh tng
ng cy quyt nh l tng ng.
T tng pht trin cy quyt nh ca C4.5 l phng php HUNT nghin cu trn.
Chin lc pht trin theo su (depth-first strategy) c p dng cho C4.5.
4.2. Cc o s dng trong C4.5
Phn ln cc h thng hc my u c gng to ra 1 cy cng nh cng tt, v nhng
cy nh hn th d hiu hn v d t c chnh xc d on cao hn.
Do khng th m bo c s cc tiu ca cy quyt nh, C4.5 da vo nghin cu ti
u ha, v s la chn cch phn chia m c o la chn thuc tnh t gi tr cc i.
Hai o c s dng trong C4.5 l information gain v gain ratio.
4.2.1. Information Gain


Trong : Value(A) l tp cc gi tr ca thuc tnh A, Sv l tp con ca S
m A nhn gi tr v
V d m t cch tnh information gain
Xt CSDL sau:
( )
( , ) ( ) ( )
v
v
v Value A
S
InfoGain S A Entropy S Entropy S
S
e
=

M hnh cy quyt nh 27Xt thuc tnh Outlook(O)


Xt thuc tnh Temperature(T)Tng t
InfoGain(H) = 0.048
InfoGain(W) = 0.029
Suy ra: Ta s chn Outlook v c InfoGain ln nht
2 2
9 9 5 5
( ) log log 0.94
14 14 14 14
Entropy S = =
2 2 2 2
5 3 3 2 2 5 3 3 2 2
( , ) 0.94 log log 0 log log 0.246
14 5 5 5 5 14 5 5 5 5
InfoGain S O
( | | | |
= + + ~
| | (
\ . \ .

2 2 2 2
2 2
4 2 2 2 2 4 3 3 1 1
log log log log
14 4 4 4 4 14 4 4 4 4
( , ) 0.94 0.151
6 4 4 2 2
log log
14 6 6 6 6
InfoGain S T
(
| | | |
+ +
| | (
\ . \ .
(
= ~
(
| |

( |
\ .

M hnh cy quyt nh 28

Tnh InfoGain cho tng nhnh (cha c phn lp) tng trng cy (mt cch
qui).
Kt qu t c:

4.2.2. Gain Ratio trong C4.5
Vi o Information Gain s u tin cho nhng thuc tnh c nhiu gi tr, gii
quyt vic ny Gain Ratio b sung thm thng tin phn tch (split information)

M hnh cy quyt nh 29

V d v Gain Ratio

Tng t ta c:
GainRatio(S, H) = 0.151/1.56 = 0.097
GainRatio(S, W) = 0.048/1 = 0.048
GainRatio(S, T) = 0.029/0.985 = 0.029
Suy ra: Chn Outlook v c Gain Ratio ln nht
4.3. c im ca C4.5
2 2 2
5 5 4 4 5 5
( , ) log log log 1.58
14 14 14 14 14 14
SplitInfo S O
| |
= ~
|
\ .
( )
, 0. InfoGain S O 246 =
( )
, 0.246/1.58 0.156 GainRatio S O = =
M hnh cy quyt nh 30
4.3.1. C4.5 c c ch ring trong x l nhng gi tr thiu
Mt ci tin hn so vi ID3 l C4.5 c c ch ring x l nhng d liu thiu:
VD: Gi So l tp cc mu c gi tr ca thuc tnh Outlook b thiu.Khi o
information gain ca Outlook s gim v chng ta khng hc c g t cc mu trong So


Tng ng split information cng thay iNh s thay i ny lm gim gi tr ca cc mu lin quan ti thuc tnh c t l gi tr thiu
cao.
v d trn nu Outlook c chn, C4.5 khng to mt nhnh ring trn cy quyt nh cho
So. Thay vo , thut ton c c ch phn chia cc mu trong So v cc tp con Si l tp con
m c gi tr thuc tnh outlook xc nh theo trong s |Si| / |S So|.
4.3.2. Trnh Qu va d liu
Qu va d liu l mt kh khn ng k i vi hc bng cy quyt nh v nhng
phng php hc khc. Qu va d liu l hin tng: nu khng c cc case xung t (l nhng
case m gi tr cho mi thuc tnh l ging nhau nhng gi tr ca lp li khc nhau) th cy
quyt nh s phn lp chnh xc ton b cc case trong tp d liu o to. i khi d liu o
to li cha nhng c tnh c th, nn khi p dng cy quyt nh cho nhng tp d liu khc
th chnh xc khng cn cao nh trc.
C mt s phng php trnh qu va d liu trong cy quyt nh:
+ Dng pht trin cy sm hn bnh thng, trc khi t ti im phn lp hon ho tp
d liu o to. Vi phng php ny, mt thch thc t ra l phi c lng chnh xc thi
im dng pht trin cy.
+ Cho php cy c th qu va d liu, sau s ct, ta cy Mc d phng php th
nht c v trc quan hn, nhng vi phng php th hai th cy quyt nh c sinh ra c
th nghim chng minh l thnh cng hn trong thc t, v n cho php cc tng tc tim nng
( , ) ( , )
o
o
S S
InfoGain S O InfoGain S S O
S

=
2 2
( )
( , ) log log
o o v v
v Value A
S S S S
SplitInfo S O
S S S S
e
| |
=
|
|
\ .

M hnh cy quyt nh 31
gia cc thuc tnh c khm ph trc khi quyt nh xem kt qu no ng gi li. C4.5 s
dng k thut th hai trnh qu va d liu.
4.3.3. Chuyn i t cy quyt nh sang lut
Vic chuyn i t cy quyt nh sang lut sn xut (production rules) dng if-then to
ra nhng quy tc phn lp d hiu, d p dng. Cc m hnh phn lp biu din cc khi nim
di dng cc lut sn xut c chng minh l hu ch trong nhiu lnh vc khc nhau, vi
cc i hi v c chnh xc v tnh hiu c ca m hnh phn lp. Dng output tp lut sn
xut l s la chn khn ngoan. Tuy nhin, ti nguyn tnh ton dng cho vic to ra tp lut
t tp d liu o to c kch thc ln v nhiu gi tr sai l v cng ln [12]. Khng nh ny
s c chng minh qua kt qu thc nghim trn m hnh phn lp C4.5
Giai on chuyn di t cy quyt nh sang lut bao gm 4 bc:
Ct ta:
Lut khi to ban u l ng i t gc n l ca cy quyt nh. Mt cy quyt nh
c l l th tng ng tp lut sn xut s c l lut khi to. Tng iu kin trong lut c xem
xt v loi b nu khng nh hng ti chnh xc ca lut .
Sau , cc lut ct ta c thm vo tp lut trung gian nu n khng trng vi
nhng lut c.
La chn
Cc lut ct ta c nhm li theo gi tr phn lp, to nn cc tp con cha cc lut
theo lp. S c k tp lut con nu tp training c k gi tr phn lp. Tng tp con trn c xem
xt chn ra mt tp con cc lut m ti u ha chnh xc d on ca lp gn vi tp lut
.
Sp xp
Sp xp K tp lut to ra t trn bc theo tn s li. Lp mc nh c to ra bng
cch xc nh cc case trong tp training khng cha trong cc lut hin ti v chn lp ph bin
nht trong cc case lm lp mc nh.
c lng, nh gi:
Tp lut c em c lng li trn ton b tp training, nhm mc ch xc nh xem
liu c lut no lm gim chnh xc ca s phn lp. Nu c, lut b loi b v qu trnh
c lng c lp cho n khi khng th ci tin thm.
4.4. Nhn xt v C4.5
M hnh cy quyt nh 32
C4.5 c c ch sinh cy quyt nh hiu qu v cht ch bng vic s dng o la
chn thuc tnh tt nht l gain-ratio.
Cc c ch x l vi gi tr li, thiu v chng qu va d liu ca C4.5 cng vi
c ch ct ta cy.
Thm vo , m hnh phn lp C4.5 cn c phn chuyn i t cy quyt nh sang lut
dng if-then, lm tng chnh xc v tnh d hiu ca kt qu phn lp. y l tin ch rt c
ngha i vi ngi s dng.
5. Cc vn trong vic hc cy quyt nh
5.1. Trnh Overfitting d liu
Trong mt s trng hp gii thut ch m t s pht trin ca cy ch su phn loi
hon ton nhng d liu o to. Trong thc t, ta thng gp mt s kh khn nh d liu b
nhiu, li hoc khi s lng d liu l qu t. Trong hai trng hp nu trn, gii thut ID3 c
th to ra trng hp qu khp (overfitting).
Nhm tc gi pht biu rng mt gi thuyt qu khp vi d liu o to nu mt s nhng
gi thuyt m thch hp vi d liu o to nhng thiu chnh xc trn ton b s phn b ca
trng hp. Chng ny s gii thiu cc k thut thng c s dng nht.
nh ngha:
Cho mt khng gi thuyt H , mt gi thuyt h thuc H c gi l qu khp trn d liu
o to nu tn ti mt vi gi thuyt h thuc H m h c t li hn h trn d liu o to,
nhng h c t li hn h trn ton b s phn b.
Hnh nh di y minh ha cc tc ng ca thao tc qu khp trong mt ng dng in
hnh ca vic hc mt cy quyt nh. Trong v d ny, thut ton ID3 c p dng trong tin
hc y t.
- Trc ngang ca th ny ch tng s cc nt trong cy quyt nh.
- Trc ng cho thy tnh chnh xc ca cy quyt nh.
- ng lin cho thy tnh chnh xc ca cc cy quyt nh trn tp d liu o to
- ng t lin cho thy chnh xc o trn mt tp hp c lp ca cc v d kim
tra.
Tuy nhin, trn th ta thy rng chnh xc o c qua cc v d kim tra c lp u
tin tng, sau gim dn. Trn th ta c th thy khi kch thc cy vt qu 25 nt, nu tip
tc xy dng cy quyt nh s lm gim chnh xc ca n khi s dng trn tp d liu kim
tra mc d tnh chnh xc ca n tng trn tp hun luyn.
M hnh cy quyt nh 33

Qu khp trong vic hc cy quyt nh.
Lm th no c th cho cy h ph hp vi cc v d o to tt hn h ng thi cho
kt qu tt hn trn tp d liu kim tra.
Mt trong nhng trng hp c th xy ra l trong cc v d hun luyn c nhiu, li.
minh ha, ta xem xt cc c trng c thm vo cc v d nh sau:
(Outlook = Sunny, Temperature = Hot, Humidity = Normal,
Wind = Strong, PlayTennis = No)
Thay v
(Outlook = Sunny, Temperature = Hot, Humidity = Normal,
Wind = Strong, PlayTennis = Yes)
Khi , vic to ra cy quyt nh s phc tp hn nhiu. Do d liu ban u b li, thut
ton ID3 to ra mt cy quyt nh khc phc tp hn so vi ban u.
M hnh cy quyt nh 34

Cy quyt nh ban u
Tuy nhin, do d liu sai, thut ton ID3 c th to ra thm 2 nt t mt nhnh ca cy.
T v d trn, ta c th thy, nu c bt k mt nhiu, li ngu nhin trong cc v d hun
luyn c th dn n tnh trng qu khp. Trong thc t, tnh trng qu khp ny c th xy ra
khi d liu o to b li hoc tp d liu qu nh, khng i din c cho tp d liu. Hn
na, tnh trng qu khp l mt kh khn ng k cho vic hc cy quyt nh v nhiu phng
php hc khc. V d, trong mt nghin cu thc nghim ca ID3 lin quan n nhim v khc
nhau ca vic hc nhiu, li, tnh trng qu khp c th lm gim tnh chnh xc ca cy quyt
nh t 10-25%.
C mt vi cch tip cn nhm trnh trng hp qu khp trong vic hc cy quyt nh.
y c th c phn thnh hai lp:
- Phng php tip cn bng cch ngng pht trin cy trc khi n t n im cy
quyt nh hon ton phn loi d liu hun luyn.
- Phng php tip cn cho php tip tc pht trin cy quyt nh vn tnh trng qu
khp xy ra, sau ta li cy.
Mc d cch tip cn ca phng php u tin c v thc t hn, tuy nhin trong thc t
phng php th hai t kt qu thnh cng hn. Nguyn nhn l do trong phng php u tin
kh khn l phi c tnh mt cch chnh xc khi no ngng pht trin cy; kch thc cui cng
ca cy l kch thc no? S dng tiu ch no xc nh kch thc cui cng ny? Phng
php tip cn bao gm:
M hnh cy quyt nh 35
- S dng mt b d liu kim tra ring bit, khc bit so vi tp d liu hun luyn
nh gi hiu qu ca vic ct ta nt t cy.
- S dng tt c tp d liu c sn o to, nhng p dng mt bi thng k kim tra
nhm c tnh kh nng m rng. V d: php th nghim chi-square
- S dng mt bin php r rng m ha cc v d hun luyn v cc cy quyt nh,
ngn chn s tng trng ca cy ny nhm lm gim kch thc ti thiu ca cy m
ha. Cch tip cn ny, da trn nguyn l m t chiu di ti thiu (Minimum
Description Length principle) c cp n trong Quinlan v Rivest (1989) v Mehta
cng cc cng s (1995).
Cch u tin ca cc phng php tip cn trn c s dng ph bin nht. Nhm tc
gi tho lun v hai bin th chnh ca cch tip cn ny. Trong phng php ny, cc d liu c
sn c chia thnh hai b: mt tp hp d liu o to c s dng hnh thnh cc gi
thuyt hc, v mt tp hp d liu ring bit kim tra, c s dng nh gi tnh chnh
xc ca gi thuyt ny trn d liu tip theo, v nh gi tc ng ca vic ct ta.
ng lc ca thao tc trn nh sau: Mc d ngi hc c th b nhm ln bi nhng li
ngu nhin v quy lut ny trng hp ngu nhin trong tp hun luyn. V vy, tp hp xc nhn
c d kin c th s cung cp mt php kim tra an ton chng li trng hp qu khp ca
tp hun luyn. iu quan trng l cc tp xc nhn phi ln cho chnh n nhm cung cp
mt mu ngha thng k cho cc trng hp. Thng thng, ngi ta s dng mt phn ba
trong tp d liu kim tra v hai phn ba hun luyn.
5.1.1. Gim li ct ta
Lm th no ta c th s dng mt thit lp xc nhn chnh xc nhm ngn chn tnh
trng qu khp. Mt cch tip cn c gi l gim li ct ta (reduced-error pruning - Quinlan
1987) nhm quyt nh xem xt nt no l ng c vin cho ct ta. Mt quyt nh nt no l nt
ct ta bao gm vic loi b cc cy con bt ngun t nt , lm cho n thnh mt nt l v gn
cho n mt nt cha ph hp nht. Cc nt c xa khi cy khi kt qu ta cy khng t hn so
vi trc khi ta. Phng php ny c tc dng khi bt k nt l no b thm do quy tc trng
hp ngu nhin trong tp hun luyn c th b lc b bt bi v nhng s trng hp tng t
khng xy ra trong cc tp kim tra. Cng on ta cc nt c lp i lp li v cc nt c la
chn loi b phi hu ht tng chnh xc cho cy quyt nh. Cng on ct b cc nt ny
tip tc cho n khi vic ct ta thm l c hi (v d, lm gim tnh chnh xc ca cy qua tp d
liu kim tra).
Cc tc ng ca thao tc gim li ct ta lm tng tnh chnh xc ca cc cy quyt nh
c minh ha trong hnh di y. Nh trong trong v d v qu khp trong tp cy quyt nh,
chnh xc ca cy c kim tra trn c hai tp d liu hun luyn v kim tra. ng c
M hnh cy quyt nh 36
b sung trong hnh di y cho thy chnh xc ca cy tng dn khi thc hin vic ct ta
cy. Trong trng hp ny, d liu c sn c chia thnh ba tp con: tp d liu o to, tp
d liu kim tra c s dng ct ta cy v tp d liu th nghim c s dng cung cp
mt c tnh chnh xc hn.

nh hng ca phng php gim li ct ta trong khi hc cy quyt nh.
S dng mt b d liu ring bit ct ta l mt cch tip cn hiu qu.
Hn ch chnh ca phng php ny l khi d liu b hn ch th s phn t ca cc tp d
liu kim tra, o to, th nghim s gim.
Phn sau trnh by mt cch tip cn thay th ct ta c tm thy hu ch trong
nhiu ng dng thc t, trong trng hp d liu b gii hn. Rt nhiu k thut b sung c
xut, lin quan n phn vng d liu c sn nhiu ln khc nhau bng nhiu cch, sau tnh
trung bnh cc kt qu. nh gi da trn kinh nghim ca cy thay th phng php ct ta
c bo co bi Mingers (1989b) v Malerba cng cc cng s (1995).
5.1.2. Lut POST-PRUNING
Trong thc tin, mt phng php thnh cng cho vic tm kim nhng gi thuyt vi s
chnh xc cao l mt k thut m chng ta s gi l lut ct ta (post-pruning). Mt dng khc
ca phng php ct ta c s dng trong C4.5 l mt dng pht trin ca gii thut ID3. Lut
ton ct ta (post-pruning) gm nhng bc sau y:
- Xy dng cy quyt nh t tp d liu hun luyn - cho php tnh trng qu khp xy ra.
M hnh cy quyt nh 37
- Chuyn i cy hc thnh mt tp lut tng ng. Mi lut l mt ng i t nt
gc n nt l ca cy quyt nh.
- Thu gn mi lut bng cch loi b bt k iu kin tin quyt m kt qu c ci thin
chnh xc.
- Phn loi cc quy tc ct ta v sp xp chng theo chnh xc c tnh ca chng, v
xem xt chng trong trnh t khi phn loi cc trng hp tip theo.
minh ha, xem xt li cy quyt nh sau:

Trong khi rt gn lut, mi l tng ng vi mt lut v c to ra bng cch i t nt
gc n nt l ca cy. Mi ln kim tra thuc tnh dc theo ng dn t gc n l s tr thnh
mt quy tc tin (iu kin tin quyt) v phn loi ti cc nt l tr thnh h qu
(postcondition).
V d, mt nhnh tn cng bn tay tri ca cy trong hnh trn c chuyn thnh lut nh
sau:
IF (Outlook = Sunny) A (Humidity = High)
THEN PlayTennis = No
Tip theo, mi quy tc nh vy c ta bng cch loi b bt k tin , iu kin tin
quyt, m khi loi b kt qu khng ti t hn c tnh chnh xc ca n. Lut ct ta c chn
ty theo iu kin no ca cc bc ct ta ci tin hn so vi cc iu kin khc. Vic ct ta s
khng c thc hin nu n lm gim tnh chnh xc ca quy tc c tnh. V d lut ct ta s
xem xt vic loi b tin (Outlook=Sunny) and (Humidity=High) trong lut c to ra.
M hnh cy quyt nh 38
Nh ni trn, mt trong nhng phng php c tnh chnh xc l s dng mt
xc nhn v d phn chia tp hun luyn. Mt phng php khc, c s dng trong thut ton
C4.5 nhm nh gi hiu sut da trn vic o to thit lp ca chnh n nhm c tnh kh
nng khng ph hp ca mt nt bng cch tnh ton quy tc chnh xc v cc v d hun luyn
m n c p dng, sau tnh lch chun trong tnh chnh xc c tnh ny gi nh mt
nh thc phn phi. i vi mt tin cy nht nh, c tnh thp hn gii hn sau c
thc hin l bin php thc hin quy tc (v d, cho mt tin cy l 95%, nguyn tc tnh
chnh xc bi quan c c tnh bi tnh chnh xc quan st c qua tp o to, tr i 1,96 ln
lch chun c tnh). Hiu qu cho d liu ln, c tnh bi quan l rt gn vi chnh xc
quan st (v d, lch chun l rt nh), trong khi n pht trin hn na t cc quan st chnh
xc l kch thc ca cc tp hp d liu gim.
C ba li th khi chuyn i cy quyt nh thnh cc lut trc khi ct ta:
- Vic chuyn thnh lut cho php phn bit trong nhng ng cnh khc nhau khi mt nt
trong cy quyt nh c s dng. Bi v mi ng i (tng ng vi mt lut) s
phn bit thng qua mt nt l ca cy quyt nh cho ra mt lut phn bit, vic ct ta
cy quyt nh xem xt vic kim tra cc thuc tnh c th to thnh nhng ng i
khc nhau. Ch c 2 la chn s loi b mt nt l hon ton (khi cy quyt nh c
thu gn) hoc gi n li dng nguyn thy.
- Vic chuyn cy quyt nh thnh lut loi b thuc tnh kim tra phn bit m xy ra
gn vi gc ca cy v nhng ci ny xy ra gn vi nhng l. Nh vy chng ta c th
trnh c s ln xn trong vn tnh ton nh l lm th no t chc li cy nu nt
gc b ct ta nhng vn cn cy con thuc nt gc ny.
- Chuyn i cc quy tc ci thin kh nng hc. Quy nh nhm d dng cho vic hiu.

5.2. Kt hp cc thuc tnh c gi tr lin tc
Trong nhng nh ngha ban u ca thut ton ID3 ch gii hn n nhng thuc tnh m
nm trong tp gi tr ri rc.
- u tin l gi tr thuc tnh mc tiu, gi tr ny ca n c tin on bi cy quyt
nh phi l gi tr thc.
- Th hai nhng thuc tnh c kim tra trong nhng nt cy quyt nh phi l gi tr
ri rc.
M hnh cy quyt nh 39
S hn ch ny c th d dng c loi b. Khi gi tr lin tc vn c th c s dng
trong cy quyt nh. iu ny c th t ti bng vic hn ch ti a nhng thuc tnh gi tr ri
rc mi m phn chia gi tr lin tc thnh tp hp ca nhng gi tr ri rc theo cc khong.
C th vi mt gi tr thuc tnh A l gi tr lin tc, gii thut c th linh ng to ra mt
thuc tnh logic Ac mang gi tr true nu A < c v false nu ngc li.
Vn cn li l la chn gi tr tt nht cho ngng c nh th no.
V d, gi s ta xt cc thuc tnh gi tr lin tc Temperature. Gi s trong mt v d lin
kt hun luyn vi mt nt c th trong mt cy quyt nh c gi tr Temperature v thuc tnh
PlayTennis.

Nhng ngng ng c vin c th c nh gi bng cch tnh ton c thng qua cc
thng tin c lin quan vi nhau. Trong v d trn, c hai ngng ng c vin, tng ng vi gi
tr ca nhit m ti gi tr thay i PlayTennis l (48 + 60)/2 v (80 + 90)/2. Khi , gi tr
Information gain c th c tnh ton li cho mi ng c vin thuc tnh, Temperature
>54
v
Temperature
>85
. V I nformation gain (Temperature
>54
) >Information gain (Temperature
>85
)
nn gi tr Temperature
>54
c chn. Gi tr thuc tnh ny t ng to ra c th so snh c
vi cc thuc tnh ng c vin khc c sn cho vic pht trin cy quyt nh.
Fayyad v Irani (1993) tho lun v mt phn m rng tip cn theo cch ny chia tch
cc thuc tnh lin tc vo nhiu khong hn l ch hai khong da trn mt ngng duy nht.
Utgoff v Brodley (1991) v Murthy cng cng s (1994) tho lun v cch tip cn xc nh
cc tnh nng bng cch kt hp tuyn tnh mt s thuc tnh c gi tr lin tc.
5.2.1. Cc phng php thay th cho cc thuc tnh la chn
C mt khuynh hng t nhin trong Information gain l nhng thuc tnh c chn
vi nhiu gi tr trn nhng thuc tnh khc vi mt vi gi tr. V d, xt thuc tnh Daten c
nhiu nhng gi tr c th xy ra (v d ngy 4 thng 3 nm 1979). Nu ta thm thuc tnh ny
vo bng v d v quyt nh c Play Tennis hay khng, ta thy rng thuc tnh ny c kh nng
phn loi mnh nht.
iu ny xy ra bi v thuc tnh Date hon ton c lp tin on hm mc tiu trn d
liu o to. Vy thuc tnh ny c b sai hay khng? Tht ra, thuc tnh ny t ra cho tp d
liu c nhiu gi tr m c th phn chia tp d liu o to thnh nhng tp rt nh. Do , n
s c thu thp thng tin rt cao trn tp d liu o to, mc d l mt cng c tin on khng
M hnh cy quyt nh 40
tt ca hm mc tiu trn nhng trng hp khc. Mt cch trnh iu ny l chn la nhng
thuc tnh da trn mt s nh gi khc hn l ch s dng thuc tnh Information gain. Mt
o s dng thnh cng hn l Gain Ratio. Phng php s dng o ny cn tr nhng thuc
tnh tng t Datebng vic kt hp mt biu thc gi l Split Information

Trong S
1
n S
c
l tp con ca c kt qu ca nhng trng hp t s phn hoch S bng
thuc tnh c nh gi vi c l A. rng SplitInformation thc s chnh l entropy ca S
vi vi s lin quan trn nhng gi tr ca thuc tnh A. iu ny ngc vi vic s dng ca
Entropy trc y trong chng ta xem xt ch entropy ca S vi s lin quan n gi tr mc
tiu m nhng gi tr ca n c tin on bi cy quyt nh.
nh gi Gain Ratio c nh ngha trong biu thc ca nh gi Gain trc y cng
nh i vi SplitInformation nh sau:

5.2.2. X l hun luyn i vi nhng thuc tnh thiu gi tr
Trong mt s trng hp nht nh d liu c th thiu nhng gi tr i vi mt s thuc
tnh. V d, trong y khoa chng ta mong mun tin on kt qu bnh nhn da trn nhiu nhng
kim tra trong phng th nghim, n c th l kim tra th mu trn mt tp nhng bnh nhn
cho php. Trong trng hp nh vy chng ta thng thng phi c on nhng gi tr thiu
da trn nhng trng hp m thuc tnh ny c mt gi tr bit.
Xt mt trng hp m trong Gain(S,A) c tnh ton ti nt n trong cy quyt nh
xc nh khi no thuc tnh A l thuc tnh tt nht kim tra nt quyt nh ny. Gi s
rng (x,c(x)) l mt trong nhng tp d liu o to trong S v gi tr A(x) l khng c bit
n. Mt trong nhng chin thut lin quan n thao tc thiu gi tr thuc tnh l gn cho n gi
tr chim hu ht trong tp d liu hun luyn ti nt n. Mt th tc th hai phc tp hn l gn
mt kt qu c th xy ra cho mi mt gi tr ca A hn l n gin gn mt gi tr chung cho
A(x). Kh nng c th nhn c bng cch quan st tn s ca nhng gi tr khc nhau cho A
trong s nhng v d ti nt n.
V d: Gi s thuc tnh A l mt ng c cho thuc tnh kim tra nt n. Ta s phi x l
nh th no vi v d x khng c (thiu) gi tr i vi thuc tnh A (tc l: x
A
l khng xc
nh)?
M hnh cy quyt nh 41
- Gi S
n
l tp cc v d hc gn vi nt n c gi tr i vi thuc tnh A
- Gii php 1: x
A
l gi tr ph bin nht i vi thuc tnh A trong s cc v d thuc tp
S
n

- Gii php 2: x
A
l gi tr ph bin nht i vi thuc tnh A trong s cc v d thuc tp
S
n
c cng phn lp vi x
- Gii php 3: Tnh xc sut p
v
i vi mi gi tr c th v ca thuc tnh A
o Gn phn (fraction) p
v
ca v d x i vi nhnh tng ng ca nt n
o Nhng v d mt phn (fractional instances) ny c s dng tnh gi tr
Information Gain

5.2.3. X l cc thuc tnh c chi ph khc nhau
Trong mt s nhng cng on ca vic hc nhng thuc tnh ca thc th c s lin h
n gi. Gi s mt bc s cn phn loi hoc chun on mt cn bnh, vn t ra l bc s
phi cho bnh nhn ca mnh thc hin nhng kim tra hoc xt nghim no m chi ph l nh
nht. Nhng kim tra hoc xt nghim ny chnh l nhng thuc tnh cn phi xt n trong cy
quyt nh.
Khi , cn s dng nhng cch nh gi khc InformationGain nhm xc nh cc thuc
tnh kim tra

Trong nhng trng hp nh vy, ta s u tin nhng cy quyt nh s dng nhng thuc
tnh chi ph thp, da trn nhng thuc tnh c gi cao ch khi cn to ra nhng s phn loi ng
tin cy. ID3 c th sa i chuyn thnh thuc tnh tnh ton phi tr bng vic a ra mt
biu thc gi tr sang nh lng la chn thuc tnh.

Nt n
4 v d c gi tr
i vi A = 1
0.4 ca x
6 v d c gi tr
i vi A = 0
0.6 ca x
(w *0,1+ l hng s xc nh
mc quan trng gia chi pha
v Information Gain)
Mt thuc tnh kiu nh phn (0/1) A.
Nt n bao gm:
- Mt v d x (gi tr thiu i vi A)
- 4 v d c gi tr i vi A bng 1
- 6 v d c gi tr i vi A bng 0
P(x
A
= 1) = 4/10 = 0.4
) (
) , (
2
A Cost
A S
Gain
) 1 ) ( (
2
1
) , (
+

A Cost
w
A S Gain
M hnh cy quyt nh 42
6. Demo
6.1. Yu cu phn cng v d liu mu
Chng trnh c vit bng ngn ng Visual C#. Yu cu phn cng chy c chng
trnh:
My ci h iu hnh window
.Net Framework 3.5 tr ln
D liu mu kim th m nhm chun b sn gm c 8 tp tin. Bn tp tin dng cho qu trnh
o to (c tip u ng l trainning_) gm:
training_4_rows_vi.xlsx
training_14_rows_vi.xlsx
training_32_rows_vi.xlsx
training_210_rows_en.xlsx
Bn tp tin dng cho qu trnh phn lp (sau khi c cy v rt trch lut, c tip u ng
data_) gm:
data_10.000_rows_en.xlsx
data_210_rows_en.xlsx
data_320_rows_vi.xlsx
data_1050_rows_en.xlsx
6.2. Gii thiu chng trnh
Giao din ca chng trnh gm c 4 vng chnh
M hnh cy quyt nh 43

Vng 1: Cc nt thao tc gm cc chc nng
Nt Menu chnh
o Ti tp tin o to vo chng trnh
o Ti tp tin cn phn lp (ch sau khi c tp lut)
o Thot chng trnh
Tab thut ton to cy
o ID3: S dng thut ton ID3 to cy quyt nh t d liu o to ti
vo chng trnh
o C4.5: S dng thut ton C4.5 to cy quyt nh t d liu o to ti
vo chng trnh
o Rt trch lut: To ra tp lut bng cch rt trch t cy quyt nh to
Tab phn lp
o Phn lp: Sau khi c tp lut, ta c th ti vo d liu cn phn lp v
nhn nt ny phn l.
Vng 2:
Li hin th d liu t tp tin ti vo trong chng trnh, trong trng hp thc hin phn
lp li ny cng s hin th cc ct kt qu phn lp v lut c s dng quyt nh d
liu thuc lp tng ng.
Vng 3:
M hnh cy quyt nh 44
Cy quyt nh to ra tng ng t d liu o to sau khi thc hin phn lp. Chi tit hin th
trn cy ny s c cp trong phn sau (Qu trnh o to)
Vng 4:
Hin th danh sch cc lut rt trch c t cy quyt nh to c trong vng 3
QUY TRNH S DNG CHNG TRNH
Cc bc ln lt s dng chng trnh demo ny gm:
Bc 1: Ti tp tin o to
Bc 2: Thc hin to cy quyt nh bng 1 trong 2 thut ton ID3 hoc C4.5
Bc 3: Thc hin rt trch lut t cy to bc 2
Bc 4: Ti vo tp tin cn phn lp
Bc 5: Thc hin phn lp cho d liu bc 4.
Trong qu trnh s dng, c th c nt lnh ca chc nng no b m i l do ti ng cnh
hin ti cha th thc hin c chc nng . V d khi cha c cy quyt nh th cha th rt
trch tp lut, v vy nt Rt trch lut s b m i.
6.3. Qu trnh o to
TI VO D LIU O TO:
Chn Nt menu chnh Load d liu o to

Nu vic ti din ra thnh cng, li d liu (Vng 2) s hin th d liu t tp tin
excel tng ng.
TO CY QUYT NH:
Chn tab Thut ton to cy nhn nt ID3 hoc C4.5 to cy.
Nu thnh cng cy quyt nh s hin th trong Vng 3.
M hnh cy quyt nh 45


Hnh trn l minh ha cy quyt nh to c t d liu o to trong tp tin
training_210_rows_en.xlsx khi dng thut ton ID3 (bn tri) v thut ton C4.5 (bn phi).
Nu s dng thut ton ID3 th nt gc ca cy s c nhn l root id3. Nu s dng thut ton
C4.5 th nt gc ca cy s c nhn l root c4.5.
Cc nt con (khng phi nt gc) nhn nt gm c 2 phn chnh l gi tr thuc tnh nt cha v
thuc tnh hin ti. V d nt c nhn l: [Rain] Humidity c ngha gi tr ca thuc tnh cha l
Rain, thuc tnh hin ti c chn phn lp tip l Humidity.
Nt l: cc nt l l nt khng c nt con, cc nt l c thm phn [result = 1] hoc [result =
0] chnh l kt qu phn lp trn nhnh .
RT TRCH TP LUT
Sau khi c cy quyt nh ta c th rt trch lut bng cch: Chn tab Thut ton to cy
nhn nt Rt trch lut.
Danh sch cc lut c rt trch t cy quyt nh hin ti c hin th (c nh s th t)
Vng 4 nh hnh sau

Cc lut to c t cy quyt nh to bi thut ton ID3 ( trn)
M hnh cy quyt nh 46

Cc lut to c t cy quyt nh to bi thut ton C4.5 ( trn)
6.4. Phn lp
Sau khi rt trch c tp lut, chng ta c th ti vo d liu cn phn lp, chng trnh s
da vo tp lut c hin ti phn lp cho d liu mi ny.
Nhn chn menu chnh Load d liu cn phn lp (nh hnh sau)

Sau khi ti d liu cn phn lp, chng ta s c hin th d liu cn phn lp trn li (Vng 2)
nh sau:

D liu cn phn lp trong tp tin data_210_rows_en.xlsx
trong trng hp ny s c thm ct kt qu (Play Tennis) v ct lut s dng phn
lp (Rule) cho d liu trn dng tng ng.
M hnh cy quyt nh 47
phn lp cho d liu to vo, chn Tab Phn lp nhn nt Phn lp, chng
trnh s bt u thc hin phn lp v hin th li kt qu trn li nh trong hnh.

Trong trng hp c dng d liu no khng phn lp c do khng c lut no c
th phn lp c d liu hoc v l do khc th trong ct kt qu v ct lut s c nh
du bng k t x.
6.5. Mt s nh gi
Chng trnh xy dng 3 bc c bn khi s dng k thut cy quyt nh vo phn lp
gm to cy, sinh lut, phn lp trn tp lut sinh.
Chng trnh tp trung vo vic demo cc ni dung l thuyt m nhm thuyt trnh, bi vy
nn ch tch hp 2 thut ton to cy quyt nh c bn l ID3 v C4.5.
Tc thc hin phn lp cho d liu cha ti u. Sau y l biu so snh tc thc
hin chng trnh trn cc d liu khc nhau.
M hnh cy quyt nh 48

D liu trn c tng hp khi thc hin demo trn thit b c cu hnh CPU Intel Core
i5 2.5GHz.
6.6. Cc ng lin kt ti ng dng v d liu mu
Ti ng dng ti a ch:
https://www.dropbox.com/s/pgutzi293erzb79/Decision%20Tree%20Release.zip
Ti d liu mu ti a ch
https://www.dropbox.com/s/n51i5mu5c4xg2ob/Decision%20Tree%20Test%20Data.zip

22 12 68 611
0
100
200
300
400
500
600
700
5 Lut,
6 Thuc tnh,
Lut di nht 4
node,
320 Dng d liu
5 Lut,
4 Thuc tnh,
Lut di nht 3
node,
210 Dng d liu
5 Lut,
4 Thuc tnh,
Lut di nht 3
node,
1.000 Dng d liu
5 Lut,
4 Thuc tnh,
Lut di nht 3
node,
10.000 Dng d liu
Thi gian (s)
Thi gian (s)
M hnh cy quyt nh 49

Ti liu tham kho
[1] Tom M. Mitchell, Machine Learning, 1997
[2] Lior Rokach, Oded Maimon, Data Mining and Knowledge Discovery Hanbook, Chap 09
[3] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, 2
nd
Edition,
2006