You are on page 1of 18

Khai Ph D Liu

Nguyn Nht Quang


quangnn-fit@mail.hut.edu.vn Vin Cng ngh Thng tin v Truyn thng Trng i hc Bch Khoa H Ni
Nm hc 2010-2011

Ni dung mn hc:
Gii thiu v Khai ph d liu Gii thiu v cng c WEKA Tin x l d liu Pht hin cc lut kt hp Cc k thut phn lp v d on Cc k thut phn nhm

Khai Ph D Liu

WEKA Gii thiu


WEKA l mt cng c phn mm vit bng Java phc v lnh vc hc my Java, v khai ph d liu Cc tnh nng chnh g
Mt tp cc cng c tin x l d liu, cc gii thut hc my, khai ph d liu, v cc p phng php th nghim nh gi gp p g g Giao din ha (gm c tnh nng hin th ha d liu) Mi trng cho php so snh cc gii thut hc my v khai ph d liu

C th ti v t a ch:
http://www.cs.waikato.ac.nz/ml/weka/
Khai Ph D Liu 3

WEKA Cc mi trng chnh g


Simple CLI
Giao din n gin ki dng lnh (nh MS-DOS) kiu (nh MS DOS)

Explorer

(chng ta s ch yu s dng mi trng ny!)

Mi trng cho php s dng tt c cc kh nng ca WEKA khm ph d liu

Experimenter
Mi trng cho php tin hnh cc th nghim v thc hin cc kim tra thng k (statistical tests) gia cc m hnh hc my

KnowledgeFlow K l d Fl
Mi trng cho php bn tng tc ha kiu ko/th thit k cc bc (cc thnh phn) ca mt th nghim
Khai Ph D Liu

WEKA Mi trng Explorer g p

Khai Ph D Liu

WEKA Mi trng Explorer g p


Preprocess chn v thay i (x l) d liu lm vic Classify hun luyn v kim tra cc m hnh hc my (phn loi, hoc hi quy/d on) q y ) Cluster hc cc nhm t d liu (phn cm) Associate khm ph cc lut kt hp t d liu Select attributes xc nh v la chn cc thuc tnh lin quan (quan trng) nht ca d liu Visualize xem (hin th) biu tng tc 2 chiu i vi d liu
Khai Ph D Liu 6

WEKA Khun dng ca tp d liu g p


WEKA ch lm vic vi cc tp tin vn bn (text) c khun dng ARFF V d ca mt tp d liu
@relation weather @attribute @attribute @attribute @ i @attribute @attribute Tn ca tp d liu

outlook {sunny, overcast, rainy} Thuc tnh kiu nh danh temperature real humidity real i i Thuc tnh kiu s windy {TRUE, FALSE} play {yes, no} Thuc tnh phn lp (mc nh l thuc tnh cui cng) @data sunny,85,85,FALSE,no Cc v d , , , ,y overcast,83,86,FALSE,yes (instances)
Khai Ph D Liu 7

WEKA Explorer: Tin x l d liu p


D liu c th c nhp vo (imported) t mt tp tin c khun dng: ARFF, CSV g , D liu cng c th c c vo t mt a ch URL, hoc t mt c s d liu thng qua JDBC Cc cng c tin x l d liu ca WEKA c gi l filters
Ri rc ha (Discretization) Chun ha (Normalization) Ly mu (Re-sampling) La chn thuc tnh (Attribute selection) Chuyn Ch i (T (Transforming) v kt hp (Combining) cc th tnh f i ) h (C bi i ) thuc t h

Hy xem giao din ca WEKA Explorer y g


Khai Ph D Liu 8

WEKA Explorer: Cc b phn lp ( ) p p p (1)


Cc b phn lp (Classifiers) ca WEKA tng ng vi cc m hnh d on cc i lng kiu nh danh (phn lp) hoc cc i lng kiu s (hi quy/d on) Cc k thut phn lp c h tr bi WEKA
Nave Bayes classifier and Bayesian networks Decision trees Instance-based classifiers Support vector machines Neural networks H xem giao di ca WEKA Explorer Hy i din E l
Khai Ph D Liu 9

WEKA Explorer: Cc b phn lp ( ) p p p (2)


La chn mt b phn lp (classifier) La chn cc ty chn cho vic kim tra (test options)
Use training set. B phn loi hc c s c nh gi trn tp hc Supplied test set. S dng mt tp d liu khc (vi tp hc) cho vic nh gi Cross-validation. Tp d liu s c chia u thnh k tp (folds) c kch thc xp x nhau, v b phn loi hc c s c nh gi bi phng php cross-validation Percentage split. Ch nh t l phn chia tp d liu i vi vic nh gi

Khai Ph D Liu

10

WEKA Explorer: Cc b phn lp ( ) p p p (3)


More options
Output model. Hin th b phn lp hc c model Output per-class stats. Hin th cc thng tin thng k v precision/recall i vi mi lp Output entropy evaluation measures. Hin th nh gi hn tp measures (entropy) ca tp d liu Output confusion matrix. Hin th thng tin v ma trn li phn lp ( (confusion matrix) i vi phn lp hc c ) Store predictions for visualization. Cc d on ca b phn lp c lu li trong b nh, c th c hin th sau Output predictions. Hin th chi tit cc d on i vi tp kim tra Cost-sensitive evaluation. Cc li (ca b phn lp) c xc nh da trn ma trn chi ph (cost matrix) ch nh Random seed for XVal / % Split. Ch nh gi tr random seed c s dng cho qu t h l d h trnh la chn ngu nhin cc v d cho t ki t h hi d h tp kim tra
Khai Ph D Liu 11

WEKA Explorer: Cc b phn lp ( ) p p p (4)


Classifier output hin th cc thng tin quan trng
Run information. Cc ty chn i vi m hnh hc tn ca tp information hc, d liu, s lng cc v d, cc thuc tnh, v f.f. th nghim Classifier model (full training set). Biu din (dng text) ca b phn lp hc c Predictions on test data. Thng tin chi tit v cc d on ca b phn lp i vi tp kim tra S Summary. C thng k v mc chnh xc ca b phn l Cc th h h h lp, i vi f.f. th nghim chn Detailed Accuracy By Class. Thng tin chi tit v mc chnh xc ca b phn l i vi mi l h lp i i lp Confusion Matrix. Cc thnh phn ca ma trn ny th hin s lng cc v d kim tra (test instances) c phn lp ng v b phn l sai h lp i
Khai Ph D Liu 12

WEKA Explorer: Cc b phn lp ( ) p p p (5)


Result list cung cp mt s chc nng hu ch
Save model Lu li m hnh tng ng vi b phn lp hc model. c vo trong mt tp tin nh phn (binary file) Load model. c li mt m hnh c hc trc t mt tp tin nh phn Re-evaluate model on current test set. nh gi mt m hnh (b phn lp) hc c trc i vi tp kim tra (test set) hin t i hi ti Visualize classifier errors. Hin th ca s biu th hin cc kt qu ca vic phn lp
Cc v d c phn lp chnh xc s c biu din bng k hiu bi du cho (x), cn cc v d b phn lp sai s c biu din bng k hiu vung ( )


Khai Ph D Liu 13

WEKA Explorer: Cc b phn cm (1)


Cc b phn cm (Cluster builders) ca WEKA tng ng vi cc m hnh tm cc nhm ca cc v d tng t i vi mt tp d liu Cc k thut phn cm c h tr bi WEKA
Expectation maximization (EM) k-Means ...

Cc b phn cm c th c hin th kt qu v so snh vi cc cm (l ) th t h i (lp) thc Hy xem giao din ca WEKA Explorer
Khai Ph D Liu 14

WEKA Explorer: Cc b phn cm (2)


La chn mt b phn cm (cluster builder) La chn ch phn cm ( l t mode) L h h h (cluster d )
Use training set. Cc cm hc c s c kim tra i vi tp hc Supplied test set. S dng mt tp d liu khc kim tra cc cm hc c Percentage split. Ch nh t l phn chia tp d liu ban u cho vic xy dng tp kim tra Classes to cl sters e al ation So snh chnh xc ca cc c m clusters evaluation. c cm hc c i vi cc lp c ch nh

Store clusters for visualization


Lu li cc b phn lp trong b nh, c th hin th sau

Ignore attributes
La chn cc thuc tnh s khng tham gia vo qu trnh hc cc cm
Khai Ph D Liu 15

WEKA Explorer: Lut kt hp


La chn mt m hnh (gii thut) pht hin lut kt hp Associator output hin th cc thng tin quan trng A i t t t
Run information. Cc ty chn i vi m hnh pht hin lut kt hp, tn ca tp d liu, s lng cc v d, cc thuc tnh Associator model (full training set). Biu din (dng text) ca tp cc lut kt hp pht hin c h tr ti thiu (minimum support) tin cy ti thiu (minimum confidence) Kch thc ca cc tp mc thng xuyn (large/frequent itemsets) ) Lit k cc lut kt hp tm c

Hy xem giao din ca WEKA Explorer y g p


Khai Ph D Liu 16

WEKA Explorer: La chn thuc tnh p


xc nh nhng thuc tnh no l quan trng nht Trong WEKA, mt phng php la chn thuc tnh (attribute selection) bao gm 2 phn:
Attribute Evaluator xc nh mt phng php nh gi mc Evaluator. ph hp ca cc thuc tnh Vd: correlation-based, wrapper, information gain, chisquared, squared Search Method. xc nh mt phng php (th t) xt cc thuc tnh Vd: best-first, random, exhaustive, ranking,

Hy xem giao din ca WEKA Explorer


Khai Ph D Liu 17

WEKA Explorer: Hin th d liu p


Hin th d liu rt cn thit trong thc t
Gip xc nh mc kh khn ca bi ton hc p

WEKA c th hin th
Mi thuc tnh ring l (1-D visualization) Mt cp thuc tnh (2 D visualization) (2-D

Cc gi tr (cc nhn) lp khc nhau s c hin th bng cc mu khc nhau Thanh trt Jitt Th h t t Jitter h t vic hi th r rng h tr i hin hn, khi c qu nhiu v d (im) tp trung xung quanh mt v tr trn biu Tnh nng phng to/thu nh (bng cch tng/gim gi tr ca PlotSize v PointSize) Hy xem giao din ca WEKA Explorer
Khai Ph D Liu 18

You might also like