You are on page 1of 41

Bi 4: Phn lp - Classification

GVPT: TS. PHC

Khai ph d liu

Phn lp v d bo Tng quan


Phn lp l g? D bo l g? Gii thiu cy quyt nh Phn lp kiu Bayes Nhng phng php phn lp khc chnh xc trong phn lp Tm tt

Khai ph d liu

Phn lp l g?
Mc ch: d on nhng nhn phn lp cho cc b d liu/mu mi u vo: mt tp cc mu d liu hun luyn, vi mt nhn phn lp cho mi mu d liu u ra: m hnh (b phn lp) da trn tp hun luyn v nhng nhn phn lp
Khai ph d liu 3

Mt s ng dng phn lp tiu biu


Tn dng Tip th Chn on y khoa Phn tch hiu qu iu tr

Khai ph d liu

D on l g?
Tng t vi phn lp o xy dng mt m hnh o s dng m hnh d on cho nhng gi tr cha bit Phng thc ch o: Git li o hi quy tuyn tnh v nhiu cp o hi quy khng tuyn tnh

Khai ph d liu

Phn lp so vi d bo
Phn lp: o d on cc nhn phn lp o phn lp d liu da trn tp hun luyn v cc gi tr trong mt thuc tnh phn lp v dng n xc nh lp cho d liu mi D bo: o xy dng m hnh cc hm gi tr lin tc o d on nhng gi tr cha bit

Khai ph d liu

Phn lp - tin trnh hai bc


1. Bc 1: Xy dng m hnh t tp hun luyn
2. Bc 2: S dng m hnh - kim tra tnh ng n ca m hnh v dng n phn lp d liu mi
Khai ph d liu 7

Xy dng m hnh
Mi b/mu d liu c phn vo mt lp c xc nh trc
Bc 1 Lp ca mt b/mu d liu c xc nh bi thuc tnh gn nhn lp

Tp cc b/mu d liu hun luyn tp hun luyn - c dng xy dng m hnh M hnh c biu din bi cc lut phn lp, cc cy quyt nh hoc cc cng thc ton hc
Khai ph d liu 8

S dng m hnh
Phn lp cho nhng i tng mi hoc cha c phn lp nh gi chnh xc ca m hnh o lp bit trc ca mt mu/b d liu em kim tra c so snh vi kt qu thu c t m hnh o t l chnh xc = phn trm cc mu/b d liu c phn lp ng bi m hnh trong s cc ln kim tra

Bc 2

Khai ph d liu

V d: xy dng m hnh
D liu hun luyn
Cc thut ton phn lp

NAME RANK YEARS TENURED B phn lp Mary Assistant Prof 3 no (M hnh) James Assistant Prof 7 yes Bill Professor 2 no IF rank = professor John Associate Prof 7 yes OR years > 6 Mark Assistant Prof 6 no THEN tenured = yes Annie Associate Prof 3 no
Khai ph d liu 10

V d: s dng m hnh
B phn lp D liu kim tra D liu cha phn lp (Jeff, Professor, 4)
NAME Tom Lisa Jack Ann RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes
Khai ph d liu

Tenured?
Yes
11

Chun b d liu
Lm sch d liu o nhiu o cc gi tr trng Phn tch s lin quan (chn c trng) Bin i d liu

Khai ph d liu

12

nh gi cc phng php phn lp

chnh xc Tc Bn vng Co dn (scalability) C th biu din c D lm

Khai ph d liu

13

Qui np cy quyt nh
A? B? C?

D?

Yes

Cy quyt nh l mt cy trong nt trong = mt php kim tra trn mt thuc tnh nhnh ca cy = u ra ca mt php kim tra nt l = nhn phn lp hoc s phn chia vo lp
Khai ph d liu 14

To cy quyt nh
Hai giai on to cy quyt nh: xy dng cy o bt u, tt c cc mu hun luyn u gc o phn chia cc mu da trn cc thuc tnh c chn o kim tra cc thuc tnh c chn da trn mt o thng k hoc heuristic thu gn cy o xc nh v loi b nhng nhnh nhiu hoc tch khi nhm
Khai ph d liu 15

Cy quyt nh V d tiu biu: play tennis?


Thi tit
nng nng u m ma ma ma u m nng nng ma nng u m u m ma

Nhit
nng nng nng m p mt mt mt m p mt m p m p m p nng m p
Khai ph d liu

m
cao cao cao cao va va va cao va va va cao va cao

Gi
khng khng khng khng khng c c khng khng khng c c khng c

Lp
N N P P P N P N P P P P P N

Tp hun luyn trch t Quinlans ID3

16

Cy quyt nh thu c vi ID3


(Quinlan 86)
thi tit nng m cao N P
Khai ph d liu

u m P va c

ma

gi
khng N P
17

Rt lut phn lp t cy quyt nh


thi tit nng u m ma gi c N khng P

m
cao N P va

IF thi tit=nng AND m=va THEN play tennis

Mi mt ng dn t gc n l trong cy to thnh mt lut Mi cp gi tr thuc tnh trn mt ng dn to nn mt s lin Nt l gi quyt nh phn lp d on Cc lut to c d hiu hn cc cy
Khai ph d liu 18

Cc thut ton trn cy quyt nh


Thut ton cn bn o xy dng mt cy quy phn chia v xc nh c tnh t trn xung o cc thuc tnh c xem l r rng, ri rc o tham lam (c th c tnh trng cc i cc b) Nhiu dng khc nhau: ID3, C4.5, CART, CHAID o im khc bit chnh: tiu chun/thuc tnh phn chia, o chn la
Khai ph d liu 19

Cc o la chn thuc tnh


li thng tin (Information gain) Gini index 2 s thng k bng ngu nhin (contingency table statistic) G- thng k (statistic)

Khai ph d liu

20

li thng tin (1))


Chn thuc tnh c ch s c li thng tin ln nht Cho P v N l hai lp v S l mt tp d liu c p phn t lp P v n phn t lp N Khi lng thng tin cn thit quyt nh mt mu ty c thuc v lp P hay N hay khng l
p p n n I ( p, n) log2 log2 pn pn pn pn

Khai ph d liu

21

li thng tin (2)


Cho cc tp {S1, S2 , , Sv} l mt phn hoch trn tp S, khi s dng thuc tnh A Cho mi Si cha pi mu lp P and ni mu lp N entropy, hay thng tin mong mun cn thit phn lp cc i tng trong tt c cc cy con Si p n l E ( A) i i I ( pi , ni )
i 1

pn

Thng tin c c bi vic phn nhnh trn thuc tnh A l


Gain( A) I ( p, n) E ( A)
Khai ph d liu 22

li thng tin V d (1)


Tha nhn:

Lp P: plays_tennis = yes Lp N: plays_tennis = no

Thng tin cn thit phn lp mt mu c cho l:


I ( p, n) I (9,5) 0.940

Khai ph d liu

23

li thng tin V d (2)


Tnh entropy cho thuc tnh thi tit:
thi tit nng u m ma pi 2 4 3 ni I(pi, ni) 3 0.971 0 0 2 0.971

Ta c
Do

E (thoitiet)

14

I (2,3)

4 5 I (4,0) I (3,2) 0.694 14 14

Gain(thoitiet) I (9,5) E (thoitiet) 0.246


Gain(nhietdo) 0.029 Gain(doam) 0.151 Gain( gio) 0.048
Khai ph d liu 24

Tng t

Nhng tin chun khc dng xy dng cy quyt


Cc iu kin ngng phn chia o tt c cc mu thuc v cng mt lp o khng cn thuc tnh no na phn chia o khng cn mu no phn lp Chin lc r nhnh o nh phn v k-phn o cc thuc tnh ri rc, r rng v cc thuc tnh lin tc Lut nh nhn: mt nt l c nh nhn vo mt lp m phn ln cc mu ti nt ny thuc v lp
Khai ph d liu 25

Overfitting trong phn lp bng cy quyt nh


Cy to c c th overfit d liu hun luyn o qu nhiu nhnh o chnh xc km cho nhng mu cha bit L do overfit o d liu nhiu v tch ri khi nhm o d liu hun luyn qu t o cc gi tr ti a cc b trong tm kim tham lam (greedy search)
Khai ph d liu 26

Cch no trnh overfitting?


Hai hng: rt gn trc: ngng sm rt gn sau: loi b bt cc nhnh sau khi xy xong ton b cy

Khai ph d liu

27

Phn lp trong cc c s d liu ln


Tnh co dn: phn lp cc tp d liu c hng triu mu v hng trm thuc tnh vi tc chp nhn c Ti sao s dng cy quyt nh trong khai thc d liu? o tc hc tng i nhanh hn cc phng php khc o c th chuyn i thnh cc lut phn lp n gin v d hiu o c th dng cc truy vn SQL phc v truy cp c s d liu o chnh xc trong phn lp c th so snh
Khai ph d liu 28

Cc phng php s dng cy quyt nh trong cc nghin cu v khai ph d liu


SLIQ (EDBT96 Mehta et al.) SPRINT (VLDB96 J. Shafer et al.) PUBLIC (VLDB98 Rastogi & Shim) RainForest (VLDB98 Gehrke, Ramakrishnan & Ganti)
Khai ph d liu 29

Phn lp Bayes: Ti sao? (1)


Hc theo xc sut: o tnh cc xc sut r rng cho cc gi thit o mt trong nhng hng thit thc cho mt s vn thuc loi hc C tng trng: o mi mu hun luyn c th tng/gim dn kh nng ng ca mt gi thit o tri thc u tin c th kt hp vi d liu quan st

Khai ph d liu

30

Phn lp Bayes: Ti sao? (2)


D on theo xc sut: o d on nhiu gi thit, trng s cho bi kh nng xy ra ca chng Chun: o Ngay c khi cc phng php Bayes kh trong tnh ton, chng vn c th cung cp mt chun to quyt nh ti u so nhng phng php khc

Khai ph d liu

31

Phn lp Bayes
Bi ton phn lp c th hnh thc ha bng xc sut a-posteriori: P(C|X) = xc sut mu X=<x1,,xk> thuc v lp C V d P(class=N | outlook=sunny,windy=true,) tng: gn cho mu X nhn phn lp l C sao cho P(C|X) l ln nht
Khai ph d liu 32

Tnh xc sut a-posteriori


nh l Bayes: P(C|X) = P(X|C)P(C) / P(X) P(X) l hng s cho tt c cc lp P(C) = tn s lin quan ca cc mu thuc lp C C sao cho P(C|X) ln nht = C sap cho P(X|C)P(C) ln nht Vn : tnh P(X|C) l khng kh thi!
Khai ph d liu 33

Phn lp Nave Bayesian


Tha nhn Nave: s c lp thuc tnh P(x1,,xk|C) = P(x1|C)P(xk|C) Nu thuc tnh th i l ri rc: P(xi|C) c c lng bi tn s lin quan ca cc mu c gi tr xi cho thuc tnh th i trong lp C Nu thuc tnh th i l lin tc: P(xi|C) c c lng thng qua mt hm mt Gaussian Tnh ton d dng trong c hai trng hp
Khai ph d liu 34

Phn lp Nave Bayesian V d (1)


c lng P(xi|C)
Thi tit P(nng | p) = 2/9 P(u m | p) = 4/9 P(ma | p) = 3/9 Nhit P(nng | p) = 2/9 P(m p | p) = 4/9 P(mt | p) = 3/9

P(p) = 9/14 P(n) = 5/14

P(nng | n) = 3/5 P(u m | n) = 0 P(ma | n) = 2/5 P(nng | n) = 2/5 P(m p | n) = 2/5 P(mt | n) = 1/5

m P(cao | p) = 3/9 P(va | p) = 6/9 Gi P(c | p) = 3/9 P(khng | p) = 6/9

P(cao | n) = 4/5 P(va | n) = 1/5

P(c | n) = 3/5 P(fkhng | n) = 2/5

Khai ph d liu

35

Phn lp Nave Bayesian V d (2)


Phn lp X: o mt mu cha thy X = <ma, nng, cao, khng> o P(X|p)P(p) = P(ma|p)P(nng|p)P(cao|p)P(khng|p)P(p) = 3/92/93/96/99/14 = 0.010582 o P(X|n)P(n) = P(ma|n)P(nng|n)P(cao|n)P(khng|n)P(n) = 2/52/54/52/55/14 = 0.018286
o Mu X c phn vo lp n (khng chi tennis)
Khai ph d liu 36

Phn lp Nave Bayesian


gi thuyt c lp
lm cho c th tnh ton cho ra b phn lp ti u khi tha yu cu nhng yu cu t khi c tha trong thc t v cc thuc tnh (cc bin) thng c lin quan vi nhau. Nhng c gng khc phc im hn ch ny: o Cc mng Bayes (Bayesian networks), kt hp l lun Bayes vi cc mi quan h nhn qu gia cc thuc tnh o Cc cy quyt nh, l lun trn mt thuc tnh ti mt thi im, xt nhng thuc tnh quan trng nht trc
Khai ph d liu 37

Cc phng php phn lp khc


Cc phng php khc

Mng Neural Phn lp k lng ging gn nht Suy lun da vo trng hp Thut ton di truyn Hng tp th Cc hng tp m

Khai ph d liu

38

chnh xc trong phn lp


c lng t l sai: Phn hoch: hun luyn v kim tra (nhng tp d liu ln) o dng hai tp d liu c lp , tp hun luyn (2/3), tp kim tra (1/3) Kim tra cho (nhng tp d liu va) o chia tp d liu thnh k mu con o s dng k-1 mu con lm tp hun luyn v mt mu con lm tp kim tra --- kim tra chp k thnh phn Bootstrapping: xa i mt - leave-one-out (nhng tp d liu nh)
Khai ph d liu 39

Tm tt (1)
Phn lp l mt vn nghin cu bao qut Phn ln c kh nng l mt trong nhng k thut khai ph d liu c dng rng ri nht vi rt nhiu m rng

Khai ph d liu

40

Tm tt (2)
Tnh uyn chuyn vn ang l mt vn quan trng ca tt cc ng dng c s d liu Cc hng nghin cu: phn lp d liu khngquan h, v d nh text, khng gian v a phng tin

Khai ph d liu

41

You might also like