You are on page 1of 18

Khai Ph D Liu (IT6080)

Nguyn Nht Quang


(quang.nguyennhat@hust.edu.vn)

Trng i hc Bch Khoa H Ni Vin Cng ngh Thng tin v Truyn thng
Nm hc 2012-2013

Ni dung mn hc:

Gii thiu v Khai ph d liu Gii thiu v cng c WEKA Tin x l d liu Pht hin cc lut kt hp Cc k thut phn lp v d on Cc k thut phn nhm Lc cng tc

Khai Ph D Liu

WEKA Gii thiu

WEKA l mt cng c phn mm vit bng Java, phc v lnh vc hc my v khai ph d liu Cc tnh nng chnh
Mt tp cc cng c tin x l d liu, cc gii thut hc my, khai ph d liu, u v cc phng php th nghim nh gi Giao din ha (gm c tnh nng hin th ha d liu) Mi trng cho php so snh cc gii thut hc my v khai ph h d liu

C th ti v t a ch:
http://www cs waikato ac nz/ml/weka/ http://www.cs.waikato.ac.nz/ml/weka/

Khai Ph D Liu

WEKA Cc mi trng g chnh

Simple CLI
Giao din n gin kiu dng lnh (nh MS-DOS) MS DOS)

Explorer

(chng ta s ch yu s dng mi trng ny!)

Mi trng cho php s dng tt c cc kh nng ca WEKA khm ph d liu

Experimenter
Mi trng cho php tin hnh cc th nghim v thc hin cc kim tra thng k (statistical tests) gia cc m hnh hc my

K KnowledgeFlow l d Fl
Mi trng cho php bn tng tc ha kiu ko/th thit k cc bc (cc thnh phn) ca mt th nghim
Khai Ph D Liu

WEKA Mi trng g Explorer p

Khai Ph D Liu

WEKA Mi trng g Explorer p

Preprocess chn v thay i (x l) d liu lm vic Classify hun luyn v kim tra cc m hnh hc my (phn loi, hoc hi q quy/d y on) ) Cluster hc cc nhm t d liu (phn cm) Associate khm ph cc lut kt hp t d liu Select attributes xc nh v la chn cc thuc tnh lin quan (quan trng) nht ca d liu Visualize xem (hin th) biu tng tc 2 chiu i vi d liu
Khai Ph D Liu 6

WEKA Khun dng g ca tp d liu

WEKA ch lm vic vi cc tp tin vn bn (text) c khun dng ARFF V d ca mt tp d liu


Tn ca tp d liu

@relation weather @attribute @attribute @ @attribute i @attribute @attribute

outlook {sunny, overcast, rainy} Thuc tnh kiu nh danh temperature real humidity i i real Thuc tnh kiu s windy {TRUE, FALSE} play {yes, no} Thuc tnh phn lp (mc nh l thuc tnh cui cng) @data sunny,85,85,FALSE,no Cc v d , , , ,y overcast,83,86,FALSE,yes (instances)
Khai Ph D Liu 7

WEKA Explorer: p Tin x l d liu

D liu c th c nhp vo (imported) t mt tp tin c khun dng: g ARFF, , CSV D liu cng c th c c vo t mt a ch URL, hoc t mt c s d liu thng qua JDBC Cc cng c tin x l d liu ca WEKA c gi l filters
Ri rc ha (Discretization) Chun ha (Normalization) Ly mu (Re-sampling) La chn thuc tnh (Attribute selection) Ch n i (T Chuy (Transforming) f i ) v kt hp (Combining) (C bi i ) cc th thuc tnh t h

Hy y xem giao g din ca WEKA Explorer


Khai Ph D Liu 8

WEKA Explorer: p Cc b p phn lp ( (1) )

Cc b phn lp (Classifiers) ca WEKA tng ng vi cc m hnh d on cc i lng kiu nh danh (phn lp) hoc cc i lng kiu s (hi quy/d on) Cc k thut phn lp c h tr bi WEKA
Nave Bayes classifier and Bayesian networks Decision trees Instance-based classifiers Support vector machines Neural networks Hy H xem giao i din ca WEKA Explorer E l
Khai Ph D Liu 9

WEKA Explorer: p Cc b p phn lp ( (2) )


La chn mt b phn lp (classifier) La chn cc ty chn cho vic kim tra (test options)
Use training set. B phn loi hc c s c nh gi trn tp hc Supplied test set. S dng mt tp d liu khc (vi tp hc) cho vic nh gi Cross-validation. Tp d liu s c chia u thnh k tp (folds) c kch thc xp x nhau, v b phn loi hc c s c nh gi bi phng php cross-validation Percentage split. Ch nh t l phn chia tp d liu i vi vic nh gi

Khai Ph D Liu

10

WEKA Explorer: p Cc b p phn lp ( (3) )

More options
Output model. Hin th b phn lp hc c Output per-class stats. Hin th cc thng tin thng k v precision/recall i vi mi lp Output entropy evaluation measures. Hin th nh gi hn tp (entropy) ca tp d liu Output confusion matrix. Hin th thng tin v ma trn li phn lp (confusion matrix) ( ) i vi phn lp hc c Store predictions for visualization. Cc d on ca b phn lp c lu li trong b nh, c th c hin th sau Output predictions. Hin th chi tit cc d on i vi tp kim tra Cost-sensitive evaluation. Cc li (ca b phn lp) c xc nh da trn ma trn chi ph (cost matrix) ch nh Random seed for XVal / % Split. Ch nh gi tr random seed c s dng cho h qu t trnh h la ch hn ngu nhin hi cc v d cho h tp kim tra t
Khai Ph D Liu 11

WEKA Explorer: p Cc b p phn lp ( (4) )

Classifier output hin th cc thng tin quan trng


Run information. Cc ty chn i vi m hnh hc c, tn ca tp d liu, s lng cc v d, cc thuc tnh, v f.f. th nghim Classifier model (full training set). Biu din (dng text) ca b phn lp hc c Predictions on test data. Thng tin chi tit v cc d on ca b phn lp i vi tp kim tra Summary S . Cc C thng k v mc chnh h h xc ca b phn h lp, i vi f.f. th nghim chn Detailed Accuracy By Class. Thng tin chi tit v mc chnh xc ca b phn h lp i vi mi lp Confusion Matrix. Cc thnh phn ca ma trn ny th hin s lng cc v d kim tra (test instances) c phn lp ng v b phn h lp sai i
Khai Ph D Liu 12

WEKA Explorer: p Cc b p phn lp ( (5) )

Result list cung cp mt s chc nng hu ch


Save model. Lu li m hnh tng ng vi b phn lp hc c vo trong mt tp tin nh phn (binary file) Load model. c li mt m hnh c hc trc t mt tp tin nh phn Re-evaluate model on current test set. nh gi mt m hnh (b phn lp) hc c trc i vi tp kim tra (test set) hin ti Visualize classifier errors. Hin th ca s biu th hin cc kt qu ca vic phn lp
Cc v d c phn lp chnh xc s c biu din bng k hiu bi du cho (x), cn cc v d b phn lp sai s c biu din bng k hiu vung ( )


Khai Ph D Liu 13

WEKA Explorer: Cc b phn cm (1)

Cc b phn cm (Cluster builders) ca WEKA tng ng vi cc m hnh tm cc nhm ca cc v d tng t i vi mt tp d liu Cc k thut phn cm c h tr bi WEKA
Expectation maximization (EM) k-Means ...

Cc b phn cm c th c hin th kt qu v so snh h vi cc cm (lp) ) thc t

Hy xem giao din ca WEKA Explorer


Khai Ph D Liu 14

WEKA Explorer: Cc b phn cm (2)


La chn mt b phn cm (cluster builder) La ch hn ch h phn h cm (cluster ( l t mode) d )

Use training set. Cc cm hc c s c kim tra i vi tp hc Supplied test set. S dng mt tp d liu khc kim tra cc cm hc c Percentage split. Ch nh t l phn chia tp d liu ban u cho vic xy dng tp kim tra Classes to cl clusters sters e evaluation al ation. So snh chnh xc c ca cc cm hc c i vi cc lp c ch nh

Store clusters for visualization


Lu li cc b phn lp trong b nh, c th hin th sau

Ignore attributes
La chn cc thuc tnh s khng tham gia vo qu trnh hc cc cm
Khai Ph D Liu 15

WEKA Explorer: Lut kt hp


La chn mt m hnh (gii thut) pht hin lut kt hp Associator A i t output t t hin th cc thng tin quan trng

Run information. Cc ty chn i vi m hnh pht hin lut kt hp, tn ca tp d liu, s lng cc v d, cc thuc tnh Associator model (full training set). Biu din (dng text) ca tp cc lut kt hp pht hin c h tr ti thiu (minimum support) tin cy ti thiu (minimum confidence) Kch thc ca cc tp mc thng xuyn (large/frequent itemsets) ) Lit k cc lut kt hp tm c

Hy y xem g giao din ca WEKA Explorer p


Khai Ph D Liu 16

WEKA Explorer: p La chn thuc tnh


xc nh nhng thuc tnh no l quan trng nht Trong WEKA, mt phng php la chn thuc tnh (attribute selection) bao gm 2 phn:
Attribute Evaluator. xc nh mt phng php nh gi mc ph hp ca cc thuc tnh Vd: correlation-based, wrapper, information gain, chisquared, squared Search Method. xc nh mt phng php (th t) xt cc thuc tnh Vd: best-first, random, exhaustive, ranking,

Hy xem giao din ca WEKA Explorer


Khai Ph D Liu 17

WEKA Explorer: p Hin th d liu

Hin th d liu rt cn thit trong thc t


Gip p xc nh mc kh khn ca bi ton hc

WEKA c th hin th
Mi thuc tnh ring l (1-D visualization) Mt cp thuc tnh (2 (2-D D visualization)

Cc gi tr (cc nhn) lp khc nhau s c hin th bng cc mu khc nhau Th h t Thanh trt Jitter Jitt h tr t vi ic hin th r rng hn, khi c qu nhiu v d (im) tp trung xung quanh mt v tr trn biu Tnh nng phng to/thu nh (bng cch tng/gim gi tr ca PlotSize v PointSize)
Khai Ph D Liu

Hy xem giao din ca WEKA Explorer


18

You might also like