You are on page 1of 36

Hc My

(IT4866)

Nguyn Nh ht Quang
quang.nguyennhat@hust.edu.vn

Trng i hc Bch Khoa H Ni Vin Cng ngh thng tin v truyn thng
Nm hc 2013-2014

Ni d dung mn hc:

Gii thiu chung


Hc my Cng c WEKA

nh gi hiu nng h thng hc my Cc phng php hc da trn xc sut Cc phng php hc c gim st Cc p phng gp php p hc khng gg gim st

Hc My IT4866

Gii thiu v Hc my

Hc my (Machine Learning ML) l mt lnh vc nghin cu ca Tr tu nhn to (Artificial Intelligence AI) Cc nh ngha v hc my
Mt qu trnh nh mt h thng ci thin hiu sut (hiu qu hot ng) ca n [Simon, 1983] Mt qu trnh m mt chng trnh my tnh ci thin hiu sut ca n trong mt cng vic thng qua kinh nghim [Mitchell, 1997] Vic lp trnh cc my y tnh ti u ha mt tiu ch hiu sut da trn cc d liu v d hoc kinh nghim trong qu kh [Alpaydin, 2004]

Biu din mt bi ton hc my [Mitchell, 1997] Hc my = Ci thin hiu qu mt cng vi ic thng th qua ki kinh h nghi him Mt cng vic (nhim v) T
i vi cc tiu ch nh gi hiu nng P Thng qua (s dng) kinh nghim E
Hc My IT4866
3

V d bi ton hc my (1)
Lc th rc (Email spam filtering)
T: D on ( lc) nhng th in t no l th rc (spam email) P: % of f cc th in t gi n c phn loi chnh xc E: Mt tp cc th in t (emails) mu, mi th in t c biu din bng mt tp thuc tnh (vd: tp t kha) v nhn lp (th thng/th rc) ) tng ng
Th rc?

Th thng

Th rc

Hc My IT4866

V d bi ton hc my (2)
Phn loi cc trang Web (Web page categorization/ classification)

T: Phn loi cc trang Web theo cc ch nh trc P: T l (%) cc trang Web c phn loi chnh xc E: Mt tp cc trang Web, trong mi trang Web gn vi mt ch

Ch ?

Hc My IT4866

V d bi ton hc my (3)
Nhn dng ch vit tay (H d itt characters (Handwritten h t recognition)

T: Nhn dng v phn loi cc t trong cc nh ch vit tay P: T l (%) cc t c nhn dng v phn loi ng E: Mt tp cc nh ch vit tay, t trong o g mi nh c gn vi mt nh danh ca mt t we do

T no?

in

the right way

Hc My IT4866

V d bi ton hc my (4)
D on ri ro cho vay ti chnh (Loan risk estimation)
T: Xc nh mc ri ro (vd: cao/th /thp) ) i vi cc h s xin i vay ti chnh P: T l % cc h s xin vay c mc ri ro cao (khng tr li tin vay) c xc nh chnh xc E: Mt tp cc h s xin vay; y; mi h s c biu din bi mt tp cc thuc tnh v mc ri ro (cao/thp) (
Hc My IT4866
al al al al al al al al al al al al al al al al al al al al kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kjasgsdfogsdjgfopjkhdrgfopjkhal kj kj kj kj kj kj kj kj kj kj kj kj kj kj kj kj kj kj kj kj

Ri ro?

Cao T chi

Thp Chp nhn

Qu trnh hc my
Tp hc (Training set) Tp d liu (Dataset) Tp ti u (Validation set) Ti u ha cc tham s ca h thng Tp th nghim (Test set) Hun luyn h thng

Th nghim h thng hc
8

Hc My IT4866

Cc thnh phn chnh ca bi ton hc my (1)


La chn cc v d hc (training/learning examples)


Cc thng tin hng dn qu trnh hc (training feedback) c cha ngay trong cc v d hc, hay l c cung cp gin tip (vd: t mi trng hot ng) Cc v d hc theo kiu c gim st (supervised) hay khng c gim st (unsupervised) Cc v d hc phi tng thch vi (i din cho) cc v d s c s dng g bi h thng g trong g tng g lai ( (future test examples) p )

Xc nh hm mc tiu (gi thit, khi nim) cn hc


F: X {0,1} F: X Mt tp cc nhn lp F: X R+ (min cc gi tri s thc dng)
Hc My IT4866
9

Cc thnh phn chnh ca bi ton hc my (2)


La chn cch biu din cho hm mc tiu cn hc


Hm a thc (a polynomial function) Mt tp cc lut (a set of rules) Mt cy quyt nh (a decision tree) Mt mng n-ron ron nhn to (an artificial neural network)

La chn mt g gii thut hc my y c th hc ( (xp x) c hm mc tiu


Phng php hc hi quy (Regression-based) Phng php hc quy np lut (Rule induction) Phng php hc cy quyt nh (ID3 hoc C4.5) Phng php hc lan truyn ngc (Back-propagation)
Hc My IT4866
10

Cc vn trong Hc my (1)

Gii thut hc my (Learning algorithm)


Nhng gii thut hc my no c th hc (xp x) mt hm mc tiu cn hc? Vi nh hng iu kin no, mt gi ii th thut hc my ch hn s hi t (tim cn) hm mc tiu cn hc? i vi mt lnh vc bi ton c th v i vi mt cch biu din cc v d (i tng) c th, gii thut hc my no thc hin tt nht?

Hc My IT4866

11

Cc vn trong Hc my (2)

Cc v d hc (Training examples)
Bao nhiu v d hc l ? Kch thc ca tp hc (tp hun luyn) nh hng th no i vi chnh h h xc ca h hm mc ti tiu hc c? ? Cc v d li (nhiu) v/hoc cc v d thiu gi tr thuc tnh (missing-value) (missing value) nh hng th no i vi chnh xc?

Hc My IT4866

12

Cc vn trong Hc my (3)

Qu trnh hc (Learning process)


Chin lc ti u cho vic la chn th t s dng (khai thc) cc v d hc? C Cc chi hin lc la ch hn ny l lm th thay i mc ph hc tp ca bi ton hc my nh th no? Cc tri thc c th ca bi ton (ngoi cc v d hc) c th ng gp th no i vi qu trnh hc?

Hc My IT4866

13

Cc vn trong Hc my (4)

Kh nng/gii hn hc (Learning capability)


Hm mc tiu no m h thng g cn hc? Biu din hm mc tiu: Kh nng biu din (vd: hm tuyn tnh / hm phi tuyn) vs. phc tp ca gii thut v qu trnh hc Cc gii hn (trn l thuyt) i vi kh nng hc ca cc gii thut hc my? Kh nng khi qut t ha h (generalize) ( li ) ca h thng t cc v d hc? ?

trnh vn hc qu khp (over-fitting) t chnh xc cao trn tp hc, nhng t chnh xc thp trn tp th nghim)

Kh nng h thng t ng thay i (thch nghi) biu din (cu trc) bn trong ca n?

ci thin kh nng (ca h thng i vi vic) biu din v hc h mc tiu hm ti


Hc My IT4866
14

Vn hc q qu khp ( (1) )

Mt hm mc tiu (mt gi thit) hc c h s c gi l qu khp/qu / ph h hp (over-fit) ( fit) vi mt tp hc nu tn ti mt hm mc tiu khc h sao cho: h km k ph h hp hn ( t chnh h h xc km k hn) ) h i vi tp hc, nhng h t chnh xc c cao hn h i vi ton b tp d liu (bao gm c nhng v d c s dng sau qu trnh hun luyn)

Hc My IT4866

15

Vn hc q qu khp ( (2) )

Gi s gi D l tp ton b cc v d, v D_train l tp cc v d hc Gi s gi ErrD(h) l mc li m gi thit h sinh ra i vi tp D, v ErrD_train(h) l mc li m gi thit h sinh ra i vi tp D_train Gi thit h qu khp (qu ph hp) tp hc D_train D train nu tn ti mt gi thit khc h:
ErrD_train D train(h) < ErrD_train D train(h), v ErrD(h) > ErrD(h)

Hc My IT4866

16

Vn hc q qu khp ( (3) )

Vn over-fitting thng do cc nguyn nhn: Li (nhiu) trong tp hun luyn (do qu trnh thu thp/xy dng tp d liu) S lng cc v d hc qu nh, khng i din cho ton b tp (phn b) ca cc v d ca bi ton hc Mc chnh xc qu l tng (~100%) i vi tp hun luyn Qu trnh hc hi t mt hm mc tiu l t ng i vi tp hu h n l luyn ( (nh hng khng kh thch h h hp vi cc v d khc trong tng lai)

Hc My IT4866

17

Vn hc q qu khp ( (4) )

Trong s cc gi thit (hm mc tiu) hc c, gi thit (hm mc tiu) no khi qut ha tt nht t cc v d hc? Lu : Mc tiu ca hc my l t c chnh xc cao trong g d on i vi cc v d sau ny, khng phi i vi cc v d hc

Hm mc tiu f(x) no t chnh xc cao nht i vi cc v d sau ny?


f(x)

Occams O razor: u tin ti ch hn h hm mc tiu n gin nht ph hp (khng nht thit hon ho) vi cc v d hc Khi qut t h ha tt hn D gii thch/din gii hn p phc tp tnh ton t hn
Hc My IT4866

18

Vn hc q qu khp V d
Tip tc qu trnh hc cy quyt nh s lm gim chnh xc i vi tp th nghim mc d tng chnh xc i vi tp hc

[Mitchell, 1997]
Hc My IT4866
19

WEKA Gii thiu


WEKA l mt cng c phn mm vit bng Java, Java phc v lnh vc hc my v khai ph d liu C th ti v t a ch:
http://www.cs.waikato.ac.nz/ml/weka/

Cc tnh nng chnh


Mt tp cc cng c tin x l d liu, cc gii thut hc my, khai ph d liu, v cc phng php th nghim nh gi Giao din ha (gm c tnh nng hin th ha d liu) Mi trng cho php so snh cc gii thut hc my v khai ph d liu p
Hc My IT4866
20

WEKA Cc mi trng g chnh


Simple CLI
Giao din n gin kiu dng lnh (nh MS-DOS) MS DOS)

Explorer

(chng ta s ch yu s dng mi trng ny!)

Mi trng cho php s dng tt c cc kh nng ca WEKA khm ph d liu


Experimenter
Mi trng cho php tin hnh cc th nghim v thc hin cc kim tra thng k (statistical tests) gia cc m hnh hc my

K KnowledgeFlow l d Fl
Mi trng cho php bn tng tc ha kiu ko/th thit k cc bc (cc thnh phn) ca mt th nghim
Hc My IT4866

21

WEKA Mi trng g Explorer p

Hc My IT4866

22

WEKA Mi trng g Explorer p


Preprocess chn v thay i (x l) d liu lm vic Classify hun luyn v kim tra cc m hnh hc my (phn loi, hoc hi q quy/d y on) ) Cluster hc cc nhm t d liu (phn cm) Associate khm ph cc lut kt hp t d liu Select attributes xc nh v la chn cc thuc tnh lin quan (quan trng) nht ca d liu Visualize xem (hin th) biu tng tc 2 chiu i vi d liu
Hc My IT4866
23

WEKA Khun dng g ca tp d liu


WEKA ch lm vic vi cc tp tin vn bn (text) c khun dng ARFF V d ca mt tp d liu


Tn ca tp d liu

@relation weather @attribute @attribute @ @attribute i @attribute @attribute

outlook {sunny, overcast, rainy} Thuc tnh kiu nh danh temperature real humidity i i real Thuc tnh kiu s windy {TRUE, FALSE} play {yes, no} Thuc tnh phn lp (mc nh l thuc tnh cui cng) @data sunny,85,85,FALSE,no Cc v d , , , ,y overcast,83,86,FALSE,yes (instances)
Hc My IT4866
24

WEKA Explorer: p Tin x l d liu


D liu c th c nhp vo (imported) t mt tp tin c khun dng: g ARFF, , CSV D liu cng c th c c vo t mt a ch URL, hoc t mt c s d liu thng qua JDBC Cc cng c tin x l d liu ca WEKA c gi l filters
Ri rc ha (Discretization) Chun ha (Normalization) Ly mu (Re-sampling) La chn thuc tnh (Attribute selection) Ch n i (T Chuy (Transforming) f i ) v kt hp (Combining) (C bi i ) cc th thuc tnh t h

Hy y xem giao g din ca WEKA Explorer


Hc My IT4866
25

WEKA Explorer: p Cc b p phn lp ( (1) )


Cc b phn lp (Classifiers) ca WEKA tng ng vi cc m hnh d on cc i lng kiu nh danh (phn lp) hoc cc i lng kiu s (hi quy/d on) Cc k thut phn lp c h tr bi WEKA
Nave Bayes classifier and Bayesian networks Decision trees Instance-based classifiers Support vector machines Neural networks Hy H xem giao i din ca WEKA Explorer E l
Hc My IT4866
26

WEKA Explorer: p Cc b p phn lp ( (2) )


La chn mt b phn lp (classifier) La chn cc ty chn cho vic kim tra (test options)
Use training set. B phn loi hc c s c nh gi trn tp hc Supplied test set. S dng mt tp d liu khc (vi tp hc) cho vic nh gi Cross-validation. Tp d liu s c chia u thnh k tp (folds) c kch thc xp x nhau, v b phn loi hc c s c nh gi bi phng php cross-validation Percentage split. Ch nh t l phn chia tp d liu i vi vic nh gi

Hc My IT4866

27

WEKA Explorer: p Cc b p phn lp ( (3) )


More options
Output model. Hin th b phn lp hc c Output per-class stats. Hin th cc thng tin thng k v precision/recall i vi mi lp Output entropy evaluation measures. Hin th nh gi hn tp (entropy) ca tp d liu Output confusion matrix. Hin th thng tin v ma trn li phn lp (confusion matrix) ( ) i vi phn lp hc c Store predictions for visualization. Cc d on ca b phn lp c lu li trong b nh, c th c hin th sau Output predictions. Hin th chi tit cc d on i vi tp kim tra Cost-sensitive evaluation. Cc li (ca b phn lp) c xc nh da trn ma trn chi ph (cost matrix) ch nh Random seed for XVal / % Split. Ch nh gi tr random seed c s dng cho h qu t trnh h la ch hn ngu nhin hi cc v d cho h tp kim tra t
Hc My IT4866
28

WEKA Explorer: p Cc b p phn lp ( (4) )


Classifier output hin th cc thng tin quan trng


Run information. Cc ty chn i vi m hnh hc c, tn ca tp d liu, s lng cc v d, cc thuc tnh, v f.f. th nghim Classifier model (full training set). Biu din (dng text) ca b phn lp hc c Predictions on test data. Thng tin chi tit v cc d on ca b phn lp i vi tp kim tra Summary S . Cc C thng k v mc chnh h h xc ca b phn h lp, i vi f.f. th nghim chn Detailed Accuracy By Class. Thng tin chi tit v mc chnh xc ca b phn h lp i vi mi lp Confusion Matrix. Cc thnh phn ca ma trn ny th hin s lng cc v d kim tra (test instances) c phn lp ng v b phn h lp sai i
Hc My IT4866
29

WEKA Explorer: p Cc b p phn lp ( (5) )


Result list cung cp mt s chc nng hu ch


Save model. Lu li m hnh tng ng vi b phn lp hc c vo trong mt tp tin nh phn (binary file) Load model. c li mt m hnh c hc trc t mt tp tin nh phn Re-evaluate model on current test set. nh gi mt m hnh (b phn lp) hc c trc i vi tp kim tra (test set) hin ti Visualize classifier errors. Hin th ca s biu th hin cc kt qu ca vic phn lp
Cc v d c phn lp chnh xc s c biu din bng k hiu bi du cho (x), cn cc v d b phn lp sai s c biu din bng k hiu vung ( )


Hc My IT4866
30

WEKA Explorer: Cc b phn cm (1)


Cc b phn cm (Cluster builders) ca WEKA tng ng vi cc m hnh tm cc nhm ca cc v d tng t i vi mt tp d liu Cc k thut phn cm c h tr bi WEKA
Expectation maximization (EM) k-Means ...

Cc b phn cm c th c hin th kt qu v so snh h vi cc cm (lp) ) thc t

Hy xem giao din ca WEKA Explorer


Hc My IT4866
31

WEKA Explorer: Cc b phn cm (2)


La chn mt b phn cm (cluster builder) La ch hn ch h phn h cm (cluster ( l t mode) d )

Use training set. Cc cm hc c s c kim tra i vi tp hc Supplied test set. S dng mt tp d liu khc kim tra cc cm hc c Percentage split. Ch nh t l phn chia tp d liu ban u cho vic xy dng tp kim tra Classes to cl clusters sters e evaluation al ation. So snh chnh xc c ca cc cm hc c i vi cc lp c ch nh

Store clusters for visualization


Lu li cc b phn lp trong b nh, c th hin th sau

Ignore attributes
La chn cc thuc tnh s khng tham gia vo qu trnh hc cc cm
Hc My IT4866
32

WEKA Explorer: Pht hin lut kt hp


La chn mt m hnh (gii thut) pht hin lut kt hp Associator A i t output t t hin th cc thng tin quan trng

Run information. Cc ty chn i vi m hnh pht hin lut kt hp, tn ca tp d liu, s lng cc v d, cc thuc tnh Associator model (full training set). Biu din (dng text) ca tp cc lut kt hp pht hin c h tr ti thiu (minimum support) tin cy ti thiu (minimum confidence) Kch thc ca cc tp mc thng xuyn (large/frequent itemsets) ) Lit k cc lut kt hp tm c

Hy y xem g giao din ca WEKA Explorer p


Hc My IT4866
33

WEKA Explorer: p La chn thuc tnh


xc nh nhng thuc tnh no l quan trng nht Trong WEKA, mt phng php la chn thuc tnh (attribute selection) bao gm 2 phn:
Attribute Evaluator. xc nh mt phng php nh gi mc ph hp ca cc thuc tnh Vd: correlation-based, wrapper, information gain, chisquared, squared Search Method. xc nh mt phng php (th t) xt cc thuc tnh Vd: best-first, random, exhaustive, ranking,

Hy xem giao din ca WEKA Explorer


Hc My IT4866
34

WEKA Explorer: p Hin th d liu


Hin th d liu rt cn thit trong thc t


Gip p xc nh mc kh khn ca bi ton hc

WEKA c th hin th
Mi thuc tnh ring l (1-D visualization) Mt cp thuc tnh (2 (2-D D visualization)

Cc gi tr (cc nhn) lp khc nhau s c hin th bng cc mu khc nhau Th h t Thanh trt Jitter Jitt h tr t vi ic hin th r rng hn, khi c qu nhiu v d (im) tp trung xung quanh mt v tr trn biu Tnh nng phng to/thu nh (bng cch tng/gim gi tr ca PlotSize v PointSize)
Hc My IT4866

Hy xem giao din ca WEKA Explorer


35

References
E. Alpaydin. Introduction to Machine Learning. The MIT Press, 2004. T. M. Mitchell. Machine Learning. McGraw-Hill, 1997. H. A. Simon. Why Should Machines Learn? In R. S. Michalski, J. Carbonell, and T. M. Mitchell (Eds.): Machine learning: g An artificial intelligence g approach, pp chapter 2, pp. 25-38. Morgan Kaufmann, 1983.

Hc My IT4866

36

You might also like