You are on page 1of 17

Khai Ph D Liu

Nguyn Nht Quang


quangnn-fit@mail.hut.edu.vn
Trng i hc Bch Khoa H Ni Vin Cng ngh Thng tin v
Truyn thng Nm hc 2011-2012

Ni dung mn hoc:

Gii thiu v Khai ph d liu


Gii thiu v cng c WEKA

I M I y

Tin x l d liu
Pht hin cc lut kt hp

Cc k thut phn lp v d on
Cc k thut phn nhm

Khai Ph D Liu

WEKA l mt cng c phn mm vit bng


Java, phc v lnh vc hc my v khai ph
d liu Cc tnh nng chnh
Mt tp cc cng c tin x l d liu, cc gii
thut hc my, khai ph d liu, v cc phng
php th nghim nh gi
Giao din ha (gm c tnh nng hin th ha
d liu)
Mi trng cho php so snh cc gii thut hc
my v khai ph d liu

C th ti v t a ch:
http://www.cs.waikato.ac.nz/ml/weka/

Simple CLI

Giao din n gin kiu dng lnh (nh MS-DOS)

Explorer (chng ta s ch yu s dng mi trng ny!)


Mi trng cho php s dng tt c cc kh nng ca WEKA
khm ph d liu

Experimenter
Mi trng cho php tin hnh cc th nghim v thc hin cc
kim tra thng k (statistical tests) gia cc m hnh hc my

KnowledgeFlow
Mi trng cho php bn tng tc ha kiu ko/th thit
k cc bc (cc thnh phn) ca mt th nghim

Preprocess
chn v thay i (x l) d liu lm vic
Classify
hun luyn v kim tra cc m hnh hc my (phn loi, hoc
hi quy/d on)
Cluster
hc cc nhm t d liu (phn cm)
Associate
khm ph cc lut kt hp t d liu
Select attributes
xc nh v la chn cc thuc tnh lin quan (quan trng)
nht ca d liu
Visualize
xem (hin th) biu tng tc 2 chiu i vi d liu

ca tp d liu

WEKA Mi trng Explorer


WEKA Khun dng

WEKA ch lm vic vi cc tp tin vn bn (text) c khun dng


ARFF

V d ca mt tp d liu

.......Tn ca tp

relation .weather/

d liu

attribute .outlook {sunny, overcast, rainyiv^ Thuc tnh


attribute temperature real
kiu nh danh
attribute "humidity real
attribute windy {TRUE, FALSE}
Thuc tnh kiu s
attribute" play {yes, no}';^
-------------------------------Thuc tnh phn lp
Khai Ph D Liu

@data
sunny,8 5,8 5,FALSE,no

(mc nh l thuc tnh


cui cng)

overcast,83,86,FALSE,yes |
I

Cc vi d (instances)

D liu c th c nhp vo (imported) t mt tp tin c


khun dng: ARFF, CSV

D liucng c th c c vo t mt a ch URL, hoc t


mt c s d liu thng qua JDBC

Cc cng c tin x l d liu ca WEKA c gi l filters

Ri rc ha (Discretization)
Chun ha (Normalization)
Ly mu (Re-sampling)

WEKA Mi trng Explorer


La chn thuc tnh (Attribute selection)
Chuyn i (Transforming) v kt hp (Combining) cc thuc tnh
^Hy xem giao din ca WEKA Explorer...

Khai Ph D Liu

WEKA Explorer: Cc b phn lp (1)


Cc b phn lp (Classifiers) ca WEKA tng ng vi
cc m hnh d on cc i lng kiu nh danh
(phn lp) hoc cc i lng kiu s (hi quy/d on)
Cc k thut phn lp c h tr bi WEKA
Nave Bayes classifier and Bayesian networks
Decision trees
Instance-based classifiers
Support vector machines
Neural networks
^ Hy xem giao din ca WEKA Explorer...
Khai Ph D Liu

10

WEKA Explorer: Cc b phn lp (1)


La chn mt b phn lp (classifier)
La chn cc ty chn cho vic kim tra (test options)

Use training set. B phn loi hc c s c nh gi


trn tp hc

Supplied test set. S dng mt tp d liu khc (vi tp hc)


cho vic nh gi

Cross-validation. Tp d liu s c chia u thnh k tp


(folds) c kch thc xp x nhau, v b phn loi hc c s
c nh gi bi phng php cross-validation

Percentage split. Ch nh t l phn chia tp d liu i vi


vic nh gi

More options...

Output model. Hin th b phn lp hc c


Khai Ph D Liu

11

WEKA Explorer: Cc b phn lp (1)


Output per-class stats. Hin th cc thng tin thng k v
precision/recall i vi mi lp
Output entropy evaluation measures. Hin th nh gi hn tp
(entropy) ca tp d liu
Output confusion matrix. Hin th thng tin v ma trn li phn lp
(confusion matrix) i vi phn lp hc c
Store predictions for visualization. Cc d on ca b phn lp
c lu li trong b nh, c th c hin th sau
Output predictions. Hin th chi tit cc d on i vi tp kim tra
Cost-sensitive evaluation. Cc li (ca b phn lp) c xc nh da
trn ma trn chi ph (cost matrix) ch nh
Random seed for XVal / % Split. Ch nh gi tr random seed c s
dng cho qu trnh la chn ngu nhin cc v d cho tp kim tra

Classifier output hin th cc thng tin quan trng

Run information. Cc ty chn i vi m hnh hc, tn ca tp


d liu, s lng cc v d, cc thuc tnh, v f.f. th nghim
Khai Ph D Liu

12

WEKA Explorer: Cc b phn lp (1)


Classifier model (full training set). Biu din (dng text) ca
b phn lp hc c
Predictions on test data. Thng tin chi tit v cc d on ca b
phn lp i vi tp kim tra
Summary. Cc thng k v mc chnh xc ca b phn lp,
i vi f.f. th nghim chn
Detailed Accuracy By Class. Thng tin chi tit v mc chnh
xc ca b phn lp i vi mi lp
Confusion Matrix. Cc thnh phn ca ma trn ny th hin s
lng cc v d kim tra (test instances) c phn lp ng v b
phn lp sai

Khai Ph D Liu

13

Result list cung cp mt s chc nng hu ch


Save model. Lu li m hnh tng ng vi b phn lp hc
c vo trong mt tp tin nh phn (binary file)
Load model. c li mt m hnh c hc trc t mt
tp tin nh phn
Re-evaluate model on current test set. nh gi mt m hnh
(b phn lp) hc c trc i vi tp kim tra (test set)
hin ti
Visualize classifier errors. Hin th ca s biu th hin cc
kt qu cua vic phn lp
I

<

>

_ > _

/ V

Cc v d c phn lp chnh xc s c biu din bng k hiu


bi du cho (x), cn cc v d b phn lp sai s c biu din
bng k hiu vung ()

Cc b phn cm (Cluster builders) ca WEKA tng ng


vi cc m hnh tm cc nhm ca cc v d tng
t i vi mt tp d liu

-h

A < A

I A

Cc k thut phn cm c h tr bi WEKA


Expectation maximization (EM)

k-Means

Cc b phn cm c th c hin th kt qu v so
snh vi cc cm (lp) thc t
^Hy xem giao din ca WEKA Explorer ...
La chn mt b phn cm (cluster builder)
La chn ch phn cm (cluster mode)
Use training set. Cc cm hc c s c kim tra i vi tp hc
Supplied test set. S dng mt tp d liu khc kim tra cc cm hc
c
Percentage split. Ch nh t l phn chia tp d liu ban u cho vic xy
dng tp kim tra
Classes to clusters evaluation. So snh chnh xc ca cc cm
hc c i vi cc lp c ch nh

Store clusters for visualization


^ Lu li cc b phn lp trong b nh, c th hin th sau

Ignore attributes
^ La chn cc thuc tnh s khng tham gia vo qu trnh hc cc cm

Khai Ph D Liu

15

La chn mt m hnh (gii thut) pht hin lut kt hp


Associator output hin th cc thng tin quan trng
Run information. Cc ty chn i vi m hnh pht hin lut kt
hp, tn ca tp d liu, s lng cc v d, cc thuc tnh
Associator model (full training set). Biu din (dng text) ca tp
cc lut kt hp pht hin c
h tr ti thiu (minimum support)
tin cy ti thiu (minimum confidence)
Kch thc ca cc tp mc thng xuyn (large/frequent
itemsets)
Lit k cc lut kt hp tm c

^ Hy xem giao din ca WEKA Explorer...


xc nh nhng thuc tnh no l quan trng nht
Trong WEKA, mt phng php la chn thuc tnh
(attribute selection) bao gm 2 phn:
Attribute Evaluator. xc nh mt phng php nh gi mc
ph hp ca cc thuc tnh
Vd: correlation-based, wrapper, information gain, chisquared,...
Search Method. xc nh mt phng php (th t) xt cc
thuc tnh

Vd: best-first, random, exhaustive, ranking,.

^ Hy xem giao din ca WEKA Explorer...


Hin th d liu rt cn thit trong thc t
Gip xc nh mc kh khn ca bi ton hc

WEKA c th hin th
Mi thuc tnh ring l (1-D visualization)
Mt cp thuc tnh (2-D visualization)

Cc gi tr (cc nhn) lp khc nhau s c hin th


bng cc mu khc nhau
Thanh trt Jitter h tr vic hin th r rng hn, khi c
qu nhiu v d (im) tp trung xung quanh mt
v tr trn biu
Tnh nng phng to/thu nh (bng cch tng/gim gi tr
ca PlotSize v PointSize)
' r A

I A

^ A

^Hy xem giao din ca WEKA Explorer...

Khai Ph D Liu

17

You might also like