Professional Documents
Culture Documents
L2-Gioi Thieu WEKA
L2-Gioi Thieu WEKA
Trng i hc Bch Khoa H Ni Vin Cng ngh Thng tin v Truyn thng
Nm hc 2012-2013
Ni dung mn hc:
Gii thiu v Khai ph d liu Gii thiu v cng c WEKA Tin x l d liu Pht hin cc lut kt hp Cc k thut phn lp v d on Cc k thut phn nhm Lc cng tc
Khai Ph D Liu
WEKA l mt cng c phn mm vit bng Java, phc v lnh vc hc my v khai ph d liu Cc tnh nng chnh
Mt tp cc cng c tin x l d liu, cc gii thut hc my, khai ph d liu, u v cc phng php th nghim nh gi Giao din ha (gm c tnh nng hin th ha d liu) Mi trng cho php so snh cc gii thut hc my v khai ph h d liu
C th ti v t a ch:
http://www cs waikato ac nz/ml/weka/ http://www.cs.waikato.ac.nz/ml/weka/
Khai Ph D Liu
Simple CLI
Giao din n gin kiu dng lnh (nh MS-DOS) MS DOS)
Explorer
Experimenter
Mi trng cho php tin hnh cc th nghim v thc hin cc kim tra thng k (statistical tests) gia cc m hnh hc my
K KnowledgeFlow l d Fl
Mi trng cho php bn tng tc ha kiu ko/th thit k cc bc (cc thnh phn) ca mt th nghim
Khai Ph D Liu
Khai Ph D Liu
Preprocess chn v thay i (x l) d liu lm vic Classify hun luyn v kim tra cc m hnh hc my (phn loi, hoc hi q quy/d y on) ) Cluster hc cc nhm t d liu (phn cm) Associate khm ph cc lut kt hp t d liu Select attributes xc nh v la chn cc thuc tnh lin quan (quan trng) nht ca d liu Visualize xem (hin th) biu tng tc 2 chiu i vi d liu
Khai Ph D Liu 6
outlook {sunny, overcast, rainy} Thuc tnh kiu nh danh temperature real humidity i i real Thuc tnh kiu s windy {TRUE, FALSE} play {yes, no} Thuc tnh phn lp (mc nh l thuc tnh cui cng) @data sunny,85,85,FALSE,no Cc v d , , , ,y overcast,83,86,FALSE,yes (instances)
Khai Ph D Liu 7
D liu c th c nhp vo (imported) t mt tp tin c khun dng: g ARFF, , CSV D liu cng c th c c vo t mt a ch URL, hoc t mt c s d liu thng qua JDBC Cc cng c tin x l d liu ca WEKA c gi l filters
Ri rc ha (Discretization) Chun ha (Normalization) Ly mu (Re-sampling) La chn thuc tnh (Attribute selection) Ch n i (T Chuy (Transforming) f i ) v kt hp (Combining) (C bi i ) cc th thuc tnh t h
Cc b phn lp (Classifiers) ca WEKA tng ng vi cc m hnh d on cc i lng kiu nh danh (phn lp) hoc cc i lng kiu s (hi quy/d on) Cc k thut phn lp c h tr bi WEKA
Nave Bayes classifier and Bayesian networks Decision trees Instance-based classifiers Support vector machines Neural networks Hy H xem giao i din ca WEKA Explorer E l
Khai Ph D Liu 9
La chn mt b phn lp (classifier) La chn cc ty chn cho vic kim tra (test options)
Use training set. B phn loi hc c s c nh gi trn tp hc Supplied test set. S dng mt tp d liu khc (vi tp hc) cho vic nh gi Cross-validation. Tp d liu s c chia u thnh k tp (folds) c kch thc xp x nhau, v b phn loi hc c s c nh gi bi phng php cross-validation Percentage split. Ch nh t l phn chia tp d liu i vi vic nh gi
Khai Ph D Liu
10
More options
Output model. Hin th b phn lp hc c Output per-class stats. Hin th cc thng tin thng k v precision/recall i vi mi lp Output entropy evaluation measures. Hin th nh gi hn tp (entropy) ca tp d liu Output confusion matrix. Hin th thng tin v ma trn li phn lp (confusion matrix) ( ) i vi phn lp hc c Store predictions for visualization. Cc d on ca b phn lp c lu li trong b nh, c th c hin th sau Output predictions. Hin th chi tit cc d on i vi tp kim tra Cost-sensitive evaluation. Cc li (ca b phn lp) c xc nh da trn ma trn chi ph (cost matrix) ch nh Random seed for XVal / % Split. Ch nh gi tr random seed c s dng cho h qu t trnh h la ch hn ngu nhin hi cc v d cho h tp kim tra t
Khai Ph D Liu 11
Khai Ph D Liu 13
Cc b phn cm (Cluster builders) ca WEKA tng ng vi cc m hnh tm cc nhm ca cc v d tng t i vi mt tp d liu Cc k thut phn cm c h tr bi WEKA
Expectation maximization (EM) k-Means ...
Use training set. Cc cm hc c s c kim tra i vi tp hc Supplied test set. S dng mt tp d liu khc kim tra cc cm hc c Percentage split. Ch nh t l phn chia tp d liu ban u cho vic xy dng tp kim tra Classes to cl clusters sters e evaluation al ation. So snh chnh xc c ca cc cm hc c i vi cc lp c ch nh
Ignore attributes
La chn cc thuc tnh s khng tham gia vo qu trnh hc cc cm
Khai Ph D Liu 15
Run information. Cc ty chn i vi m hnh pht hin lut kt hp, tn ca tp d liu, s lng cc v d, cc thuc tnh Associator model (full training set). Biu din (dng text) ca tp cc lut kt hp pht hin c h tr ti thiu (minimum support) tin cy ti thiu (minimum confidence) Kch thc ca cc tp mc thng xuyn (large/frequent itemsets) ) Lit k cc lut kt hp tm c
xc nh nhng thuc tnh no l quan trng nht Trong WEKA, mt phng php la chn thuc tnh (attribute selection) bao gm 2 phn:
Attribute Evaluator. xc nh mt phng php nh gi mc ph hp ca cc thuc tnh Vd: correlation-based, wrapper, information gain, chisquared, squared Search Method. xc nh mt phng php (th t) xt cc thuc tnh Vd: best-first, random, exhaustive, ranking,
WEKA c th hin th
Mi thuc tnh ring l (1-D visualization) Mt cp thuc tnh (2 (2-D D visualization)
Cc gi tr (cc nhn) lp khc nhau s c hin th bng cc mu khc nhau Th h t Thanh trt Jitter Jitt h tr t vi ic hin th r rng hn, khi c qu nhiu v d (im) tp trung xung quanh mt v tr trn biu Tnh nng phng to/thu nh (bng cch tng/gim gi tr ca PlotSize v PointSize)
Khai Ph D Liu