Professional Documents
Culture Documents
Winter 2012 Topic 4 Classification P1
Winter 2012 Topic 4 Classification P1
2
(
=1
D: l tp hun luyn
C
i,D
: l cc nhn phn lp trong D (i=1,..,m)
p
i
: xc sut mt mu trong D thuc v lp C
i
v bng
|
,
|
||
Powerpoint Templates
31
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
V d information gain
940 . 0
14
5
log
14
5
14
9
log
14
9
) 5 , 9 I(
2 2
= = = Info(D)
Info(D) = ?
Powerpoint Templates
32
Information Gain (ID3/C4.5)
Thuc tnh A c cc gi tr :{a
1
, a
2
,,a
v
}
Dng thuc tnh A phn chia tp hun
luyn D thnh v tp con {D
1
, D
2
, , D
v
}
Thng tin cn thit phn chia D theo
thuc tnh A :
=
|
|
||
(
=1
Powerpoint Templates
33
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
V d information gain
Info
age
(D) = ?
=
5
14
30
+
4
14
31..40
+
5
14
>40
= 0.694
Powerpoint Templates
34
Information Gain (ID3/C4.5)
li thng tin khi phn chia D da
trn thuc tnh A:
=
V d:
Gain(age) = Info(D) Info
age
(D)
= 0.940 0.694
= 0.246
Powerpoint Templates
35
Bi tp 1
Xy dng
cy quyt
nh cho d
liu sau vi
phng
php chn
thuc tnh l
Information
Gain
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
Powerpoint Templates
36
Bi tp 1 p n
age?
student?
credit rating?
no yes
fair
excellent
<=30
>40
no
no
yes
yes
yes
31..40
Powerpoint Templates
37
Bi tp 2
Yu cu tng t bi tp 1
Powerpoint Templates
38
Bi tp 2 p n
Powerpoint Templates
39
Vn ca Information Gain
o Information Gain
thng hng v cc
thuc tnh c nhiu gi tr
mt s trng hp
cc mu chia thun nht
v khng c ch cho vic
phn lp.
V d: thuc tnh id ca
sn phm
Cn chun ha o I nformation Gain
Powerpoint Templates
40
Gain Ratio (C4.5)
Gi tr chun ha (chia thng tin):
o Gain Ratio:
Thuc tnh c Gain Ratio ln nht
s chn chia
=
|
|
||
2
(
|
|
||
)
=1
=
()
()
Powerpoint Templates
41
V d Gain Ratio (C4.5)
V d:
GainRatio(income) = 0.029/1.557 = 0.019
GainRatio(student)? GainRation(credit.)?
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
Powerpoint Templates
42
Gini Index (CART)
nh gi khng thun nht ca d liu:
= 1
=1
p
i
: xc sut mt mu trong D thuc v
lp C
i
v bng
|
,
|
||
Tng t IG, Gini phn chia D theo
thuc tnh A {a
1
, a
2
,,a
v
}:
=
|
|
||
(
=1
Chn thuc tnh c Gini nh nht
Powerpoint Templates
43
V d Gini Index (CART)
459 . 0
14
5
14
9
1 ) (
2 2
=
|
.
|
\
|
|
.
|
\
|
= D Gini
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
Gini
age
(D) = ?
Gini
income
(D) = ?
Gini
credit_rating
(D) = ?
Gini
student
(D) = ?
Powerpoint Templates
44
V d Gini Index (CART) p n
343 . 0
) 2 , 3 (
14
5
) 0 , 4 (
14
4
) 3 , 2 (
14
5
) (
=
+ + = gini gini gini D Gini
age
Gini
income
(D) = 0.44
Gini
credit_rating
(D) = 0.429
Gini
student
(D) = 0.367
Powerpoint Templates
45
Bi tp 3
Xy dng
cy quyt
nh cho d
liu sau vi
phng
php chn
thuc tnh l
Gain Ratio
v Gini Index
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
Powerpoint Templates
46
Powerpoint Templates
47
Vn cy quyt nh (1/2)
Qu khp (overfitting):
Qu nhiu nhnh, mt
s nhnh bt thng
do c to bi d
liu nhiu hay d liu
bin
Gy nn chnh xc
thp cho mu cha
gp bao gi.
Powerpoint Templates
48
Vn cy quyt nh (2/2)
Powerpoint Templates
49
Ta nhnh (1/2)
Hai phng php trnh overfitting:
Ta nhnh trc: dng to nhnh sm;
khng chia node nu c mt o di
ngng
Kh chn mt ngng thch hp
Ta nhnh sau: b i mt s nhnh khi cy
hon thnh
S dng tp d liu khc nhau ly t d liu
hun luyn quyt nh cy ta nhnh tt nht.
Powerpoint Templates
50
Ta nhnh (2/2)
Cross Validation
Powerpoint Templates
51
V d ta nhnh
im overfit
Powerpoint Templates
52
Th hin kt qu phn lp(DBMiner)
Powerpoint Templates
53
Minh ha cho cy quyt nh trong
d liu SGI/MineSet 3.0
Powerpoint Templates
54
Ni dung
Khi nim c s v phn lp
Phn lp da trn cy quyt nh
Phn lp da trn lut
Lut IF-THEN
ph v chnh xc
Xy dng lut
nh gi lut
Thut ton ILA
Powerpoint Templates
55
Phn lp dng lut IF-THEN
Th hin tri thc dng lut IF-THEN
V d: IF age = youth AND student = yes
THEN buys_computer = yes
Nu mt dng d liu tha iu kin
ca lut th ngi ta ni lut ph
(cover) c dng d liu
nh gi lut da trn: ph
(coverage) v chnh xc (accuracy)
Powerpoint Templates
56
ph vs chnh xc
ph ca lut : coverage(R)
T l cc mu c ph bi lut
=
||
chnh xc ca lut : accuracy(R)
T l mu c phn lp ng theo
lut trong s cc mu c ph
=
Powerpoint Templates
57
V d 1 v ph v chnh xc
R: IF Marital Status=Single
THEN No
Cov =
4
10
= 40%
=
2
5
= 50%
Tid Refund Marital
Status
Taxable
Income
Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Powerpoint Templates
58
V
2
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
R1: (Give Birth = no) . (Can Fly = yes) Birds
R2: (Give Birth = no) . (Live in Water = yes) Fishes
R3: (Give Birth = yes) . (Blood Type = warm) Mammals
R4: (Give Birth = no) . (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Tnh ph
v chnh
xc cho
tng lut.
Powerpoint Templates
59
R1: (Give Birth = no) . (Can Fly = yes) Birds;
R2: (Give Birth = no) . (Live in Water = yes) Fishes;
R3: (Give Birth = yes) . (Blood Type = warm) Mammals;
R4: (Give Birth = no) . (Can Fly = no) Reptiles;
R5: (Live in Water = sometimes) Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
V d 2 (tt)
Dng lut trn
phn lp cho cc
mu mi sau
Nhn xt?
Powerpoint Templates
60
Nhn xt v d 2
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
- Mu lemur ph bi lut R3, nn c phn vo lp
Mammals
- Mu turtle ph bi c lut R4 v R5 (vn ng )
- Mu dogfish shark khng c ph bi bt k lut no.
Cch gii quyt?
R1: (Give Birth = no) . (Can Fly = yes) Birds;
R2: (Give Birth = no) . (Live in Water = yes) Fishes;
R3: (Give Birth = yes) . (Blood Type = warm) Mammals;
R4: (Give Birth = no) . (Can Fly = no) Reptiles;
R5: (Live in Water = sometimes) Amphibians
Powerpoint Templates
61
Phng php gii quyt (1/2)
Vn ng :
Da trn kch thc ca lut: cc lut
c tp iu kin nhiu hn s c
u tin cao hn
Da trn lp: cc lp c xp theo
ph bin hay theo chi ph khi phn
lp sai, cc lut s theo th t u tin
ca cc lp ny.
Da trn lut: cc lut c xp hng
theo o cht lng lut ( chnh
xc, ph, ) hoc theo kin
chuyn gia
Powerpoint Templates
62
Phng php gii quyt (2/2)
Nu mu khng c ph
bi bt k lut no th gn
vo lp mc nh
Powerpoint Templates
63
Powerpoint Templates
64
Xy dng lut phn lp
Phng php gin tip: rt lut t
cc m hnh phn lp khc
V d nh cy quyt nh, mng
nron, ...
Phng php trc tip: rt cc lut
trc tip t d liu
Mt s thut ton: RIPPER, CN2, ILA,
FOIL, AQ,
Powerpoint Templates
65
Rt lut t cy quyt nh
Lut d hiu hn cy quyt nh ln
Mi lut c to ra t mi nhnh t gc
n l
Mi cp thuc tnh-gi tr dc theo ng
dn to nn php kt
Node l l lp d on
Lut mang tnh ton din v loi tr ln
nhau
Powerpoint Templates
66
V d rt lut t cy
age?
student?
credit rating?
no yes
fair
excellent
<=30
>40
no
no
yes
yes
yes
31..40
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
Powerpoint Templates
67
Phng php trc tip
Thut ton ph tun t. Cc lut s
c hc tun t.
Mi lut trong lp c
i
s ph nhiu
mu ca c
i
nhng khng ph (hoc
ph t) mu ca cc lp khc.
u im so vi cy quyt nh: cc
lut c th rt ra ng thi
Powerpoint Templates
68
Thut ton ph tun t (1/2)
B0: Bt u t lut rng
B1: Vi mi lp c
i
B1.1: S dng hm Learn-One-Rule tm ra
lut tt nht cho lp hin ti
B1.2: Loi cc mu b ph bi lut ra khi DL
B1.3: Lp li qu trnh t B1.1 cho n khi gp
iu kin dng (v d nh khng cn mu hoc
o cht lng thp hn ngng do ngi
dng xc nh)
Powerpoint Templates
69
Thut ton ph tun t (2/2)
Powerpoint Templates
70
V d thut ton ph tun t
(i) Original Data
(ii) Step 1
Mu dng (+) l cc mu c phn vo lp c
i
ang xt.
Cc mu thuc lp khc l mu m (-)
Powerpoint Templates
71
V d thut ton ph tun t (tt)
(iii) Step 2
R1
(iv) Step 3
R1
R2
Powerpoint Templates
72
Hm Learn-One-Rule
Bt u vi lut chung nht: thuc
tnh rng
IF THEN c
i
Ln lt, thm cc thuc tnh mi
s dng chin lc tm kim tham
lam theo su
Chn mt thuc tnh ci thin cht
lng ca lut tt nht
Powerpoint Templates
73
V d hm Learn-One-Rule
Powerpoint Templates
74
o cht lng lut
Mt s o c th:
bao ph
chnh xc
FOIL (Fist Order Inductive Learner)
o FOIL da trn Information
Gain. N hng n cc lut c
chnh xc cao v bao ph rt nhiu
mu dng
Khng tin cy cao*
(*) c thm phn gii thch trong [1] trang 361
Powerpoint Templates
75
o cht lng lut
Gi R l lut ang c hin ti
V d: IF k THEN c
i
R l lut c m rng t R
V d: IF k (att
j
= val
k
) THEN c
i
Gi pos l s mu dng, neg l s mu
m c ph bi lut R
pos l s mu dng, neg l s mu
m c ph bi lut R
Powerpoint Templates
76
o cht lng lut
FOIL nh gi tng cng thng tin (information
gain) khi m rng lut:
_ =
(
2
2
+
)
Lut c tng cng ln nht s c gi li
Powerpoint Templates
77
Ta lut
trnh Overfiting, s dng mt tp d
liu test ta bt lut (rule pruning):
_ =
+
pos (neg) l s mu dng (m) ph bi R
trong tp test
Mt lut b ta bng cch bt i mt thuc
tnh trong lut.
Nu phin bn R sau khi ta c cht
lng tt hn (FOIL_Prune nh hn) th
R s b ta.
Powerpoint Templates
78
Nhn xt rt lut trc tip
chnh xc: ging vi cy quyt nh
Hiu qu: chy chm hn so vi cy quyt
nh v:
pht sinh mi lut, tt c cc lut c th
u phi th trn d liu (khng hon ton
nhng vn nhiu)
Khi d liu ln v/hay s lng thuc tnh-
gi tr nhiu, thut ton chy rt chm.
Tnh cht ch ca lut: mi lut c th
khng c lp vi lut khc bi v lut c
tm thy sau khi d liu ph b lut trc
b i.
Powerpoint Templates
79
Powerpoint Templates
80
ILA Hc Quy Np
M.Tolun, 1998, ILA Inductive Learning
Algorithm
Xc nh cc lut IF-THEN trc tip t
tp hun luyn (pht trin lut theo
hng t tng qut -> c th)
Chia tp hun luyn thnh cc bng con
theo tng gi tr ca lp.
Thc hin vic so snh cc gi tr ca
thuc tnh trong tng bng con v tnh s
ln xut hin.
Thuc tnh c dng phi s, gi tr ri rc
Powerpoint Templates
81
Thut ton Hc Quy Np (ILA)
B1: Chia tp mu thnh cc tp con ng vi tng phn lp
B2: Vi mi bng con
B3: Vi mi t hp thuc tnh c th (bt u vi s lng = 1)
B4: Tm cc gi tr ch xut hin bng con ny m khng
xut hin cc bng con khc
B5: (Nu c nhiu t hp th chn t hp c s lng mu
tin nhiu nht)
B6: S dng t hp thuc tnh, gi tr va tm c to
lut
B7: B i cc dng ph bi lut
B8: Nu cn dng cha xt, lp li B3
B9: Lp li B2 vi cc bng con
Powerpoint Templates
82
V d ILA
Cho bng d liu sau:
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
2 Nh Nn Khng mua
3 Nh Cu Mua
4 Ln Nn Khng mua
5 Ln Xanh l Tr Mua
6 Ln Tr Khng mua
7 Ln Xanh l Cu Mua
Powerpoint Templates
83
V d ILA (tt)
Chia bng thnh cc bng con ng
vi tng phn lp:
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
5 Ln Xanh l Tr Mua
7 Ln Xanh l Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
Powerpoint Templates
84
V d ILA (tt)
Chn t hp thuc tnh (t 1) c nhiu gi tr
xut hin bng ny nht m khng xut
hin cc bng khc
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
5 Ln Xanh l Tr Mua
7 Ln Xanh l Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
Chn thuc tnh Mu sc
vi gi tr Xanh l
Powerpoint Templates
85
V d ILA (tt)
Xy dng lut t t hp thuc tnh
v xa cc mu ph bi lut.
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
IF Mu sc = Xanh l THEN Quyt nh = Mua
Powerpoint Templates
86
V d ILA (tt)
STT Kch c Mu sc Hnh dng Quyt nh
3 Nh Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
IF Mu sc = Xanh l THEN Quyt nh = Mua
IF Kch c = Va THEN Quyt nh = Mua
Powerpoint Templates
87
V d ILA (tt)
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
IF Mu sc = Xanh l THEN Quyt nh = Mua
IF Kch c = Va THEN Quyt nh = Mua
IF Hnh dng= Cu THEN Quyt nh = Mua
Powerpoint Templates
88
V d ILA (tt)
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
5 Ln Xanh l Tr Mua
7 Ln Xanh l Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
2 Nh Nn Khng mua
4 Ln Nn Khng mua
6 Ln Tr Khng mua
IF Hnh dng = Nn THEN Quyt nh = Khng mua
Powerpoint Templates
89
V d ILA (tt)
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
5 Ln Xanh l Tr Mua
7 Ln Xanh l Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
6 Ln Tr Khng mua
IF Hnh dng = Nn THEN Quyt nh = Khng mua
Powerpoint Templates
90
V d ILA (tt)
STT Kch c Mu sc Hnh dng Quyt nh
1 Va Xanh dng Hp Mua
3 Nh Cu Mua
5 Ln Xanh l Tr Mua
7 Ln Xanh l Cu Mua
STT Kch c Mu sc Hnh dng Quyt nh
6 Ln Tr Khng mua
IF Hnh dng = Nn THEN Quyt nh = Khng mua
IF Kch c = Ln AND Mu sc = THEN Quyt nh
= Khng mua
Powerpoint Templates
91
Bi tp
Cho tp hun luyn sau. Gi s Chi Tennis
l thuc tnh phn lp.
Powerpoint Templates
92
Quang
cnh
Nhit m Sc gi Chi
Tennis
Ma TB BT Mnh ?
Nng TB Cao Mnh ?
a) S dng ln lt o Gain, ch mc gini xy
dng cy quyt nh. Bin i cy thnh lut.
b) S dng phng php ILA xc nh lut.
c) S dng ln lt cc tp lut thu c t cu (a),
(b) xc nh lp cho mu mi.
Bi tp (tt)
Powerpoint Templates
93
Tm tt
Phn lp l qu trnh gn nhn cho cc mu.
B phn lp c hc da trn cc mu c
gn nhn sn.
Phng php phn lp da trn cy quyt nh
tm kim thuc tnh tt nht a vo cy bng
o nh Information Gain, Gain Ratio, Gini
Index. Vn ta cy vt qua vn
Overfitting
Phng php phn lp da trn lut tp trung vo
vic pht sinh lut trc tip/gin tip t d liu.
Trc tip s dng hm Learn-One-Rule v
nh gi cht lt lut FOIL. Gin tip s dng
cy quyt nh,
Powerpoint Templates
94
Ti liu tham kho
1. J.Han, M.Kamber, Chng 8 Classification:Basic
Concepts v Chng 9 Classification: Advanced
Methods, cun Data mining: Basic Concepts and
Methods, 3
rd
edition
2. J.Han, M.Kamber, J.Pei, Chapter 8,
http://www.cs.uiuc.edu/homes/hanj/cs412/bk3_slides/
08ClassBasic.ppt
3. Bing Liu, Chapter 3 Suppervised Learning,
http://www.cs.uic.edu/~liub/teach/cs583-fall-
06/CS583-supervised-learning.ppt
4. Mehmet R. Tolun, Saleh M. Abu-Soud. ILA, an
inductive learning algorithm for rule extraction.
ESA 14(3), 4/1998, 361-370
Powerpoint Templates
95
Hi & p