You are on page 1of 95

Trng i hc Khoa hc T nhin

Khoa Cng ngh Thng tin


TI LIU L THUYT KTDL & UD
Ging vin: ThS. L Ngc Thnh
Email: lnthanh@fit.hcmus.edu.vn
Summer 2012
Phn Lp D Liu (P1)
Powerpoint Templates
2
Ni dung
Khi nim c s v phn lp
Phn lp da trn cy quyt nh
Phn lp da trn lut
Powerpoint Templates
3
Case Study
Ngn hng nh gi vic cho vay l
an ton hay ri ro.
Qun l ca hng d on khch
hng mua hay khng mua.
Bc s quyt nh mt trong ba
phng php iu tr no thch hp
vi bnh nhn.
Phn loi tin tc thuc v ch
th thao, chnh tr, vn ha hay
gii tr.
Powerpoint Templates
4
Phn lp (1/3)
Phn lp l qu trnh gn nhn ( xc
nh) cho cc mu d liu mi vi chnh
xc c th.
V d: gn nhn an ton hay ri ro cho
khch hng; gn nhn mua hay khng
mua; gn nhn pp A, pp B hay pp C;
gn nhn th thao,cho tng tin tc mi.
Powerpoint Templates
5
Phn lp (2/3)
Cho CSDL D = {t
1
,t
2
,,t
n
} v tp
cc lp C = {c
1
,c
2
,,c
m
}, phn lp
l bi ton xc nh nh x f : DC
sao cho mi t
i
c gn vo mt
lp c
j
.


Hy a cc v d
th hin bi ton
phn lp.
Powerpoint Templates
6
Phn lp (3/3)
Phn lp l dng hc c gim st
(supervised learning) Ti sao?
Phn lp (classification) v d on gi tr
s (numeric prediction) l hai dng chnh
ca bi ton d on (prediction) nhng:
Classification Numeric Prediction
- Cc nhn l cc gi tr ri
rc hay nh danh
- Mc tiu l phn lp v cc
nhn nh
- V d: d on khch hng
c mua hay khng mua
?
- u ra l hm gi tr lin tc
hay gi tr c th t
- Mc tiu l d on cc gi
tr b thiu hay cha bit
- V d: d on s tin mt
khch hng xc nh s b
ra trong mt ln mua sm
Powerpoint Templates
7
Qu trnh phn lp
Bc 1: Xy dng m hnh (bc hc)
M t tp cc nhn/lp
Tp hun luyn: cc mu gn nhn lp
u ra: m hnh phn lp v d nh lut phn
lp, cy quyt nh hoc cng thc ton m t
lp
Bc 2: S dng m hnh (b.phn lp)
p dng m hnh vo d liu kim th (tch
bit v c nhn) nh gi chnh xc.
Nu chnh xc chp nhn c -> p dng
m hnh phn lp cc mu mi


Powerpoint Templates
8
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = professor
OR years > 6
THEN tenured = yes
Classifier
(Model)
V d v bc hc
Powerpoint Templates
9
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
V d v bc phn lp
IF rank = professor OR years >6 THEN tenured = yes
Powerpoint Templates
10
Mt s phng php phn lp
Phng php da trn cy quyt nh
Phng php da trn lut
Phng php Nave Bayes
Phng php da trn th hin
Mng Nron
SVM (support vector machine)
Tp th
Powerpoint Templates
11
nh gi m hnh phn lp
chnh xc ca d on
Tc
Kh nng chu li (d liu nhiu/thiu)
Tnh d hiu, d ci t
tt ca lut (kch thc, s
lng,)

Powerpoint Templates
12
Ni dung
Khi nim c s v phn lp
Phn lp da trn cy quyt nh
Khi nim cy quyt nh
Cc phng php da trn cy quyt
nh
Xy dng cy quyt nh
Ta cy
Phn lp da trn lut
Powerpoint Templates
13
nh ngha cy quyt nh
Cy quyt nh l mt cu
trc phn cp ca cc nt
v cc nhnh
2 loi nt trn cy:
Nt ni b: mang tn thuc
tnh ca CSDL
Nt l: mang tn lp
Nhnh: mang gi tr ca
thuc tnh
Nt gc Nt ni b
Nt l
Powerpoint Templates
14
Gii thiu PP cy quyt nh
1970 1980: J.Ross Quinlan xut thut
ton cy quyt nh ID3. Sau , xut
thut ton C4.5 ci tin t ID3.
1984: L.Breiman v ng s xut CART
cho vic pht sinh cy quyt nh nh phn.
Ngoi ra cn mt s thut ton khc nh
SLIQ (Mehta 1996), SPRINT (J.Shafer
1996), PUBLIC (Rastogi 1998), RainForest
(Gehrke 1998)
Cc phng php ch yu da trn m hnh
top-down v chia tr.
Powerpoint Templates
15
Pht sinh cy quyt nh
Gm 2 bc chnh:
Bc 1: Xy dng cy quyt nh
Bt u, ton b tp d liu hun luyn
c s dng chn thuc tnh cho gc
Tp hun luyn c phn chia quy
da trn thuc tnh c chn.
Bc 2 : Ta cy
Xc nh v loi b bt cc nhnh gy
nhiu hay ngoi lai
Powerpoint Templates
16
V d pht sinh cy quyt nh
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
age?
student?
credit rating?
no yes
fair
excellent
<=30
>40
no
no
yes
yes
yes
31..40
D liu hun luyn t
ca hng bn my tnh.
Cy quyt nh c to
ra t ID3
Powerpoint Templates
17
Powerpoint Templates
18
TT xy dng cy quyt nh
Powerpoint Templates
19
TT xy dng cy quyt nh
u vo: c s d liu D, tp cc thuc tnh, phng
php chn thuc tnh
1. To ra mt node N
2. Nu cc dng trong D thuc v cng 1 lp, th node N
tr thnh l v c nh nhn vi lp ny.
Powerpoint Templates
20
V d x/d cy quyt nh
age income student credit_rating buys_computer
yes
yes
yes
yes
yes
yes
yes
yes
?
yes




Cc dng d trong trong D u c thuc tnh phn lp
buys_computer l yes nn node N tr thnh node l vi
gi tr l nhn ca lp ny
Powerpoint Templates
21
TT xy dng cy quyt nh
4. Nu danh sch thuc tnh (khng tnh thuc tnh phn
lp) l rng th N l node l vi nhn ca lp xut hin
nhiu nht trong D
Powerpoint Templates
22
V d x/d cy quyt nh
buys_computer
yes
yes
no
yes
yes
yes
no
yes
?
yes




D ch c thuc tnh phn lp nn node N tr thnh node l
vi gi tr l nhn ca lp xut hin nhiu nht
Powerpoint Templates
23
TT xy dng cy quyt nh
6. p dng phng php heuristic chn thuc tnh
phn chia tt nht.
7. Node N c gn l thuc tnh ny km vi cc tiu ch
chia (nu thuc tnh lin tc th tiu ch chia l cc im
d liu t chia d liu)
8. Nu thuc tnh chia l ri rc th b n ra khi danh
sch thuc tnh
Powerpoint Templates
24
V d x/d cy quyt nh
(a) Thuc tnh chia tt nht A l ri rc
(b) Thuc tnh chia tt nht A l lin tc, split_point l im
d liu chia

Powerpoint Templates
25
TT xy dng cy quyt nh
10. Vi mi tiu ch chia ca thuc tnh c chn bc
trc
11. Chia tp d liu D thnh cc tp d liu con D
j
theo
tng tiu ch
12. Nu D
j
rng th N l node l vi nhn ca lp xut
hin nhiu nht trong D
13. Nu khng, gi quy li hm tm thuc tnh
phn chia tt nht cho D
j

Powerpoint Templates
26
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
V d x/d cy
quyt nh
age?
<=30 >40
31...40
income student credit_rating buys_computer
high no fair no
high no excellent no
medium no fair no
low yes fair yes
medium yes excellent yes
income student credit_rating buys_computer
medium no fair yes
low yes fair yes
low yes excellent no
medium yes fair yes
medium no excellent no
yes
? ?
D
1
D
2
Powerpoint Templates
27
im dng thut ton
Qu trnh quy dng khi gp mt
trong cc iu kin sau:
Tt c cc dng d liu trong D u
thuc v cng mt lp
Khng cn thuc tnh tip tc phn
chia.
D rng
Powerpoint Templates
28
Powerpoint Templates
29
Phng php chn thuc tnh
L mt heurisitc chn thuc tnh
sao cho n phn chia tt nht d
liu c cho vo cc lp.
Mt s heuristic:
Information Gain
Gain Ratio
Gini Index
Powerpoint Templates
30
Information Gain (ID3/C4.5)
Chn thuc tnh c li thng tin
(information gain) cao nht
o thng tin cn thit c th phn
lp cc mu trong D (cng c gi l
entropy)
=

2
(

=1

D: l tp hun luyn
C
i,D
: l cc nhn phn lp trong D (i=1,..,m)
p
i
: xc sut mt mu trong D thuc v lp C
i
v bng
|
,
|
||


Powerpoint Templates
31
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
V d information gain
940 . 0
14
5
log
14
5
14
9
log
14
9
) 5 , 9 I(
2 2
= = = Info(D)
Info(D) = ?

Powerpoint Templates
32
Information Gain (ID3/C4.5)
Thuc tnh A c cc gi tr :{a
1
, a
2
,,a
v
}
Dng thuc tnh A phn chia tp hun
luyn D thnh v tp con {D
1
, D
2
, , D
v
}
Thng tin cn thit phn chia D theo
thuc tnh A :

=
|

|
||
(

=1




Powerpoint Templates
33
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
V d information gain
Info
age
(D) = ?

=
5
14

30
+
4
14

31..40
+
5
14

>40
= 0.694
Powerpoint Templates
34
Information Gain (ID3/C4.5)
li thng tin khi phn chia D da
trn thuc tnh A:

=



V d:
Gain(age) = Info(D) Info
age
(D)
= 0.940 0.694
= 0.246
Powerpoint Templates
35
Bi tp 1
Xy dng
cy quyt
nh cho d
liu sau vi
phng
php chn
thuc tnh l
Information
Gain
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
Powerpoint Templates
36
Bi tp 1 p n
age?
student?
credit rating?
no yes
fair
excellent
<=30
>40
no
no
yes
yes
yes
31..40
Powerpoint Templates
37
Bi tp 2
Yu cu tng t bi tp 1
Powerpoint Templates
38
Bi tp 2 p n
Powerpoint Templates
39
Vn ca Information Gain
o Information Gain
thng hng v cc
thuc tnh c nhiu gi tr
mt s trng hp
cc mu chia thun nht
v khng c ch cho vic
phn lp.
V d: thuc tnh id ca
sn phm


Cn chun ha o I nformation Gain
Powerpoint Templates
40
Gain Ratio (C4.5)
Gi tr chun ha (chia thng tin):


o Gain Ratio:


Thuc tnh c Gain Ratio ln nht
s chn chia

=
|

|
||

2
(
|

|
||
)

=1

=
()

()

Powerpoint Templates
41
V d Gain Ratio (C4.5)
V d:


GainRatio(income) = 0.029/1.557 = 0.019
GainRatio(student)? GainRation(credit.)?
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
Powerpoint Templates
42
Gini Index (CART)
nh gi khng thun nht ca d liu:
= 1

=1

p
i
: xc sut mt mu trong D thuc v
lp C
i
v bng
|
,
|
||

Tng t IG, Gini phn chia D theo
thuc tnh A {a
1
, a
2
,,a
v
}:

=
|

|
||
(

=1

Chn thuc tnh c Gini nh nht
Powerpoint Templates
43
V d Gini Index (CART)
459 . 0
14
5
14
9
1 ) (
2 2
=
|
.
|

\
|

|
.
|

\
|
= D Gini
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
Gini
age
(D) = ?
Gini
income
(D) = ?
Gini
credit_rating
(D) = ?
Gini
student
(D) = ?
Powerpoint Templates
44
V d Gini Index (CART) p n
343 . 0
) 2 , 3 (
14
5
) 0 , 4 (
14
4
) 3 , 2 (
14
5
) (
=
+ + = gini gini gini D Gini
age
Gini
income
(D) = 0.44
Gini
credit_rating
(D) = 0.429
Gini
student
(D) = 0.367
Powerpoint Templates
45
Bi tp 3
Xy dng
cy quyt
nh cho d
liu sau vi
phng
php chn
thuc tnh l
Gain Ratio
v Gini Index
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
Powerpoint Templates
46
Powerpoint Templates
47
Vn cy quyt nh (1/2)
Qu khp (overfitting):
Qu nhiu nhnh, mt
s nhnh bt thng
do c to bi d
liu nhiu hay d liu
bin
Gy nn chnh xc
thp cho mu cha
gp bao gi.
Powerpoint Templates
48
Vn cy quyt nh (2/2)
Powerpoint Templates
49
Ta nhnh (1/2)
Hai phng php trnh overfitting:
Ta nhnh trc: dng to nhnh sm;
khng chia node nu c mt o di
ngng
Kh chn mt ngng thch hp
Ta nhnh sau: b i mt s nhnh khi cy
hon thnh
S dng tp d liu khc nhau ly t d liu
hun luyn quyt nh cy ta nhnh tt nht.
Powerpoint Templates
50
Ta nhnh (2/2)
Cross Validation
Powerpoint Templates
51
V d ta nhnh
im overfit
Powerpoint Templates
52
Th hin kt qu phn lp(DBMiner)

Powerpoint Templates
53
Minh ha cho cy quyt nh trong
d liu SGI/MineSet 3.0
Powerpoint Templates
54
Ni dung
Khi nim c s v phn lp
Phn lp da trn cy quyt nh
Phn lp da trn lut
Lut IF-THEN
ph v chnh xc
Xy dng lut
nh gi lut
Thut ton ILA

Powerpoint Templates
55
Phn lp dng lut IF-THEN
Th hin tri thc dng lut IF-THEN
V d: IF age = youth AND student = yes
THEN buys_computer = yes
Nu mt dng d liu tha iu kin
ca lut th ngi ta ni lut ph
(cover) c dng d liu
nh gi lut da trn: ph
(coverage) v chnh xc (accuracy)

Powerpoint Templates
56
ph vs chnh xc
ph ca lut : coverage(R)
T l cc mu c ph bi lut
=

||

chnh xc ca lut : accuracy(R)
T l mu c phn lp ng theo
lut trong s cc mu c ph
=



Powerpoint Templates
57
V d 1 v ph v chnh xc
R: IF Marital Status=Single
THEN No

Cov =
4
10
= 40%
=
2
5
= 50%

Tid Refund Marital
Status
Taxable
Income
Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Powerpoint Templates
58
V


2

Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
R1: (Give Birth = no) . (Can Fly = yes) Birds
R2: (Give Birth = no) . (Live in Water = yes) Fishes
R3: (Give Birth = yes) . (Blood Type = warm) Mammals
R4: (Give Birth = no) . (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Tnh ph
v chnh
xc cho
tng lut.
Powerpoint Templates
59
R1: (Give Birth = no) . (Can Fly = yes) Birds;
R2: (Give Birth = no) . (Live in Water = yes) Fishes;
R3: (Give Birth = yes) . (Blood Type = warm) Mammals;
R4: (Give Birth = no) . (Can Fly = no) Reptiles;
R5: (Live in Water = sometimes) Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
V d 2 (tt)
Dng lut trn
phn lp cho cc
mu mi sau
Nhn xt?
Powerpoint Templates
60
Nhn xt v d 2
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
- Mu lemur ph bi lut R3, nn c phn vo lp
Mammals
- Mu turtle ph bi c lut R4 v R5 (vn ng )
- Mu dogfish shark khng c ph bi bt k lut no.
Cch gii quyt?
R1: (Give Birth = no) . (Can Fly = yes) Birds;
R2: (Give Birth = no) . (Live in Water = yes) Fishes;
R3: (Give Birth = yes) . (Blood Type = warm) Mammals;
R4: (Give Birth = no) . (Can Fly = no) Reptiles;
R5: (Live in Water = sometimes) Amphibians
Powerpoint Templates
61
Phng php gii quyt (1/2)
Vn ng :
Da trn kch thc ca lut: cc lut
c tp iu kin nhiu hn s c
u tin cao hn
Da trn lp: cc lp c xp theo
ph bin hay theo chi ph khi phn
lp sai, cc lut s theo th t u tin
ca cc lp ny.
Da trn lut: cc lut c xp hng
theo o cht lng lut ( chnh
xc, ph, ) hoc theo kin
chuyn gia
Powerpoint Templates
62
Phng php gii quyt (2/2)
Nu mu khng c ph
bi bt k lut no th gn
vo lp mc nh
Powerpoint Templates
63
Powerpoint Templates
64
Xy dng lut phn lp
Phng php gin tip: rt lut t
cc m hnh phn lp khc
V d nh cy quyt nh, mng
nron, ...
Phng php trc tip: rt cc lut
trc tip t d liu
Mt s thut ton: RIPPER, CN2, ILA,
FOIL, AQ,

Powerpoint Templates
65
Rt lut t cy quyt nh
Lut d hiu hn cy quyt nh ln
Mi lut c to ra t mi nhnh t gc
n l
Mi cp thuc tnh-gi tr dc theo ng
dn to nn php kt
Node l l lp d on
Lut mang tnh ton din v loi tr ln
nhau
Powerpoint Templates
66
V d rt lut t cy
age?
student?
credit rating?
no yes
fair
excellent
<=30
>40
no
no
yes
yes
yes
31..40
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
Powerpoint Templates
67
Phng php trc tip
Thut ton ph tun t. Cc lut s
c hc tun t.
Mi lut trong lp c
i
s ph nhiu
mu ca c
i
nhng khng ph (hoc
ph t) mu ca cc lp khc.
u im so vi cy quyt nh: cc
lut c th rt ra ng thi


Powerpoint Templates
68
Thut ton ph tun t (1/2)
B0: Bt u t lut rng
B1: Vi mi lp c
i

B1.1: S dng hm Learn-One-Rule tm ra
lut tt nht cho lp hin ti
B1.2: Loi cc mu b ph bi lut ra khi DL
B1.3: Lp li qu trnh t B1.1 cho n khi gp
iu kin dng (v d nh khng cn mu hoc
o cht lng thp hn ngng do ngi
dng xc nh)
Powerpoint Templates
69
Thut ton ph tun t (2/2)

Powerpoint Templates
70
V d thut ton ph tun t
(i) Original Data
(ii) Step 1
Mu dng (+) l cc mu c phn vo lp c
i
ang xt.
Cc mu thuc lp khc l mu m (-)
Powerpoint Templates
71
V d thut ton ph tun t (tt)
(iii) Step 2
R1
(iv) Step 3
R1
R2
Powerpoint Templates
72
Hm Learn-One-Rule
Bt u vi lut chung nht: thuc
tnh rng
IF THEN c
i
Ln lt, thm cc thuc tnh mi
s dng chin lc tm kim tham
lam theo su
Chn mt thuc tnh ci thin cht
lng ca lut tt nht
Powerpoint Templates
73
V d hm Learn-One-Rule
Powerpoint Templates
74
o cht lng lut
Mt s o c th:
bao ph
chnh xc
FOIL (Fist Order Inductive Learner)

o FOIL da trn Information
Gain. N hng n cc lut c
chnh xc cao v bao ph rt nhiu
mu dng


Khng tin cy cao*
(*) c thm phn gii thch trong [1] trang 361
Powerpoint Templates
75
o cht lng lut
Gi R l lut ang c hin ti
V d: IF k THEN c
i

R l lut c m rng t R
V d: IF k (att
j
= val
k
) THEN c
i

Gi pos l s mu dng, neg l s mu
m c ph bi lut R
pos l s mu dng, neg l s mu
m c ph bi lut R

Powerpoint Templates
76
o cht lng lut
FOIL nh gi tng cng thng tin (information
gain) khi m rng lut:
_ =

(
2


2

+
)
Lut c tng cng ln nht s c gi li

Powerpoint Templates
77
Ta lut
trnh Overfiting, s dng mt tp d
liu test ta bt lut (rule pruning):
_ =

+

pos (neg) l s mu dng (m) ph bi R
trong tp test
Mt lut b ta bng cch bt i mt thuc
tnh trong lut.
Nu phin bn R sau khi ta c cht
lng tt hn (FOIL_Prune nh hn) th
R s b ta.
Powerpoint Templates
78
Nhn xt rt lut trc tip
chnh xc: ging vi cy quyt nh
Hiu qu: chy chm hn so vi cy quyt
nh v:
pht sinh mi lut, tt c cc lut c th
u phi th trn d liu (khng hon ton
nhng vn nhiu)
Khi d liu ln v/hay s lng thuc tnh-
gi tr nhiu, thut ton chy rt chm.
Tnh cht ch ca lut: mi lut c th
khng c lp vi lut khc bi v lut c
tm thy sau khi d liu ph b lut trc
b i.

Powerpoint Templates
79
Powerpoint Templates
80
ILA Hc Quy Np
M.Tolun, 1998, ILA Inductive Learning
Algorithm
Xc nh cc lut IF-THEN trc tip t
tp hun luyn (pht trin lut theo
hng t tng qut -> c th)
Chia tp hun luyn thnh cc bng con
theo tng gi tr ca lp.
Thc hin vic so snh cc gi tr ca
thuc tnh trong tng bng con v tnh s
ln xut hin.
Thuc tnh c dng phi s, gi tr ri rc
Powerpoint Templates
81
Thut ton Hc Quy Np (ILA)
B1: Chia tp mu thnh cc tp con ng vi tng phn lp
B2: Vi mi bng con
B3: Vi mi t hp thuc tnh c th (bt u vi s lng = 1)
B4: Tm cc gi tr ch xut hin bng con ny m khng
xut hin cc bng con khc
B5: (Nu c nhiu t hp th chn t hp c s lng mu
tin nhiu nht)
B6: S dng t hp thuc tnh, gi tr va tm c to
lut
B7: B i cc dng ph bi lut
B8: Nu cn dng cha xt, lp li B3
B9: Lp li B2 vi cc bng con
Powerpoint Templates
82
V d ILA
Cho bng d liu sau:
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
2 Nh Nn Khng mua
3 Nh Cu Mua
4 Ln Nn Khng mua
5 Ln Xanh l Tr Mua
6 Ln Tr Khng mua
7 Ln Xanh l Cu Mua
Powerpoint Templates
83
V d ILA (tt)
Chia bng thnh cc bng con ng
vi tng phn lp:
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
5 Ln Xanh l Tr Mua
7 Ln Xanh l Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
Powerpoint Templates
84
V d ILA (tt)
Chn t hp thuc tnh (t 1) c nhiu gi tr
xut hin bng ny nht m khng xut
hin cc bng khc
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
5 Ln Xanh l Tr Mua
7 Ln Xanh l Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
Chn thuc tnh Mu sc
vi gi tr Xanh l
Powerpoint Templates
85
V d ILA (tt)
Xy dng lut t t hp thuc tnh
v xa cc mu ph bi lut.
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
IF Mu sc = Xanh l THEN Quyt nh = Mua
Powerpoint Templates
86
V d ILA (tt)
STT Kch c Mu sc Hnh dng Quyt nh
3 Nh Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
IF Mu sc = Xanh l THEN Quyt nh = Mua
IF Kch c = Va THEN Quyt nh = Mua
Powerpoint Templates
87
V d ILA (tt)
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
IF Mu sc = Xanh l THEN Quyt nh = Mua
IF Kch c = Va THEN Quyt nh = Mua
IF Hnh dng= Cu THEN Quyt nh = Mua
Powerpoint Templates
88
V d ILA (tt)
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
5 Ln Xanh l Tr Mua
7 Ln Xanh l Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
IF Hnh dng = Nn THEN Quyt nh = Khng mua
Powerpoint Templates
89
V d ILA (tt)
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
5 Ln Xanh l Tr Mua
7 Ln Xanh l Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
6 Ln Tr Khng mua
IF Hnh dng = Nn THEN Quyt nh = Khng mua
Powerpoint Templates
90
V d ILA (tt)
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
5 Ln Xanh l Tr Mua
7 Ln Xanh l Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
6 Ln Tr Khng mua
IF Hnh dng = Nn THEN Quyt nh = Khng mua
IF Kch c = Ln AND Mu sc = THEN Quyt nh
= Khng mua
Powerpoint Templates
91
Bi tp
Cho tp hun luyn sau. Gi s Chi Tennis
l thuc tnh phn lp.

Powerpoint Templates
92
Quang
cnh
Nhit m Sc gi Chi
Tennis
Ma TB BT Mnh ?
Nng TB Cao Mnh ?
a) S dng ln lt o Gain, ch mc gini xy
dng cy quyt nh. Bin i cy thnh lut.
b) S dng phng php ILA xc nh lut.
c) S dng ln lt cc tp lut thu c t cu (a),
(b) xc nh lp cho mu mi.
Bi tp (tt)
Powerpoint Templates
93
Tm tt
Phn lp l qu trnh gn nhn cho cc mu.
B phn lp c hc da trn cc mu c
gn nhn sn.
Phng php phn lp da trn cy quyt nh
tm kim thuc tnh tt nht a vo cy bng
o nh Information Gain, Gain Ratio, Gini
Index. Vn ta cy vt qua vn
Overfitting
Phng php phn lp da trn lut tp trung vo
vic pht sinh lut trc tip/gin tip t d liu.
Trc tip s dng hm Learn-One-Rule v
nh gi cht lt lut FOIL. Gin tip s dng
cy quyt nh,
Powerpoint Templates
94
Ti liu tham kho
1. J.Han, M.Kamber, Chng 8 Classification:Basic
Concepts v Chng 9 Classification: Advanced
Methods, cun Data mining: Basic Concepts and
Methods, 3
rd
edition
2. J.Han, M.Kamber, J.Pei, Chapter 8,
http://www.cs.uiuc.edu/homes/hanj/cs412/bk3_slides/
08ClassBasic.ppt
3. Bing Liu, Chapter 3 Suppervised Learning,
http://www.cs.uic.edu/~liub/teach/cs583-fall-
06/CS583-supervised-learning.ppt
4. Mehmet R. Tolun, Saleh M. Abu-Soud. ILA, an
inductive learning algorithm for rule extraction.
ESA 14(3), 4/1998, 361-370
Powerpoint Templates
95
Hi & p

You might also like