You are on page 1of 4

Khai thc d liu v ng dng

Lp 14HCB

BI TP THC HNH 1
TIN X L D LIU VI WEKA
Mc tiu:
SV bit cch s dng cng c khai thc d liu Weka tin hnh tin x l d liu.

Quy nh
-

Thi hn: 1 tun [t 18/08/2015 n 25/08/2015]


Hnh thc: th mc bi lm c tn l <tn nhm> (nu tn nhm c du th b
du v vit dnh lin), bao gm:
o Document lu dng file *.doc(x) hoc *.pdf: bo co tr li cc cu hi
o Cc file *.arff thu c sau cc bc tin x l d liu
im: 0.5/3.0

Gii thiu
Weka l mt cng c m ngun m vit trn mi trng Java s dng trong khai
thc d liu, c pht trin bi Trng i Hc Waikato New Zealand v c
s dng ti IAI Lab. Weka l mt cng c c lc cho vic hc mn khai thc d liu
v ng dng bi tnh min ph, sinh vin c th nghin cu s khc bit khi thc thi
nhng m hnh khai thc d liu khc nhau. Ngoi ra, cc kt qu t Weka c th
c cng b trn cc tp ch hay hi ngh uy tn nht. Do vy, Weka c xem l
mt mi trng pht trin thc t c la chn nghin cu khai thc d liu.
Download Weka ti: http://www.cs.waikato.ac.nz/ml/weka/
Cch s dng Weka: xem hng dn chi tit trong th mc ci t

C s d liu bnh tim


Dataset v bnh tim c ly t UCI repository (datasets-UCI.jar) gm:
+ heart-h.arff: d liu ca ngi Hung-ga-ri
+ heart-c.arff: d liu ca vng Cleveland
Cc dataset ny m t cc thnh phn ca bnh tim. Download dataset ti:
http://prdownloads.sourceforge.net/weka/datasets-UCI.jar (1.1MB)
Mc tiu ca vic khai thc d liu t cc dataset ny l hiu r hn cc nhn t
nguy him cho bnh tim, c th l thuc tnh th 14: num (<50: khng c bnh, t
50-1 n 50-4 cho bit cc mc tng ca bnh)

GVHD: L Ngc Thnh, Nguyn Hi Minh

Khai thc d liu v ng dng

Lp 14HCB

bi
Cu hi t ra l c th d on bnh tim t nhng d liu bit khc ca mt bnh
nhn hay khng. Tc v khai thc d liu c chn tr li cu hi ny l phn
lp/d on, v mt vi thut ton khc nhau s c s dng tm ra thut ton
cho kt qu d on tt nht.
1. Chun b d liu Tch hp d liu (integration) 1
Bc ny hp nht 2 dataset li thnh 1. Bn hy cho bit:
a. nh ngha s tch hp d liu.
b. C vn v nhn din thc th (entity identification) trong 2 dataset ny hay
khng? Nu c, gii quyt nh th no?
c. C vn d liu d tha (redundancy) trong 2 dataset ny hay khng? Nu
c, gii quyt nh th no?
d. C s mu thun d liu (data value conflicts) trong 2 dataset ny hay khng?
Nu c, gii quyt nh th no?
e. Tch hp 2 dataset ny li thnh 1 dataset chun b cho cc cu hi tip
theo. Np dataset sau khi tch hp vo Explorer. Bn c bao nhiu mu? Bao
nhiu thuc tnh?
f. Chp li mn hnh ca ca s Explorer ca bn.
2. Tm tt m t d liu Descriptive data summarization 2
Trc khi tin x l d liu, mt bc quan trng l lm quen vi d liu
a. Trong tab Preprocess, xem xt thuc tnh age v tr li cu hi: trung bnh,
lch chun, gi tr nh nht, ln nht ca n l g?
b. Lit k five-number summary ca thuc tnh ny. Weka c cung cp nhng
con s ny hay khng?
c. Cho bit thuc tnh no l s (numeric), thuc tnh no l c th t (ordernal)
v thuc tnh no l ri rc/danh sch (categorical/nomial).
d. Gii thch ngha ca th trong ca s Explorer. Bn t tn cho th ny
l g? Mu xanh v mu c ngha g (ch cc pop-up hin ln khi di
chuyn chut trn th). th ny biu din cho ci g?
e. Ln lt xem xt cc thuc tnh khc ca dataset di dng th. Dn cc
nh chp mn hnh vo bi lm.
f. Nhn xt ca bn t nhng th ?
g. Chuyn sang tab Visualize. Thut ng s dng trong textbook t tn cho
cc th l g? Chn jitter ti a, ch ct num (ct cui cng), theo bn
1
2

Xem [4], phn 2.4.1 Data Integration


Xem [4], phn 2.2 Descriptive Data Summarization

GVHD: L Ngc Thnh, Nguyn Hi Minh

Khai thc d liu v ng dng

Lp 14HCB

cc thuc tnh no c v nh dn n bnh tim nhiu nht? Dn vo bi lm


hnh nh th ca thuc tnh m bn cho rng c kh nng d on bnh tim
tt nht (Y) nh l mt hm ca num(X).
h. C nhng cp thuc tnh khc nhau no c v nh tng quan vi nhau
khng?
3. Chun b d liu Chn lc d liu (selection) 3
Cc dataset s dng trong bi tp c x l bng cc chn ra tp cc thuc
tnh lin quan n mc tiu khai thc d liu.
a. Bn hy cho bit c bao nhiu thuc tnh trong nhng dataset trc khi x l?
b. S dng tab Select attributes. Lit k nhng la chn khc nhau ca Weka
chn lc thuc tnh, gii thch ngn gn tng phng php.
c. So snh vi cc phng php chn lc d liu trong textbook, c phng
php no khng c trong Weka hay phng php no trong Weka khng c
trong textbook?
4. Chun b d liu Lm sch d liu (cleaning) 4
X l cc d liu thiu, nhiu, v mu thun. S dng cc b lc trong Weka
lm sch d liu.
a. Cc gi tr thiu (Missing values): Lit k cc phng php hc x l d
liu thiu. Weka ci t nhng phng php no? Bn hy chn 1 phng
php x l gi tr thiu trong dataset, gii thch ti sao bn chn phng
php . Ci t 1 phng php khc m bn thch nu n khng c trong
Weka
b. D liu nhiu (Noisy data): Lit k cc phng php hc loi b cc d
liu nhiu, Weka ci t nhng phng php no?
c. D tm d liu tp (Outlier detection): Lit k cc phng php hc d
tm d liu tp. Bn d tm d liu tp bng Weka nh th no? C d liu tp
trong dataset cho hay khng? Nu c, lit k mt s d liu tp.
d. Lu dataset lm sch vo file heart-cleaned.arff v dn vo bi lm 1 nh
chp cho thy t nht 10 dng ca d liu vi tt c cc ct.
5. Chun b d liu Chuyn i d liu (Transformation) 5

Xem [3], phn 7


Xem [4], phn 2.3 Data Cleaning
5
Xem [4], phn 2.4.2 Data Transformation
4

GVHD: L Ngc Thnh, Nguyn Hi Minh

Khai thc d liu v ng dng

Lp 14HCB

Trong s cc k thut chuyn i d liu, s dng cc b lc ca Weka tm


hiu cc k thut sau:
a. Xy dng thuc tnh Attribute construction: v d, thm mt thuc tnh l
tng ca 2 thuc tnh khc. B lc no ca Weka cho php lm iu ny?
b. Chun ha Normalize mt thuc tnh. B lc no ca Weka cho php lm
iu ny? B lc c th chun ha Min-max khng, chun ha Z-score hay
chun ha thp phn hay khng? Cho bit c th cch thc thc hin nhng
chun ha ny trong Weka.
c. Chn 1 phng php v tin hnh chun ha tt c cc thuc tnh l s thc,
gii thch s la chn ca bn.
d. Lu dataset chun ha vo file heart-normal.arff v chp nh mn hnh
cho thy t nht 10 dng d liu vi tt c cc ct.
6. Chun b d liu Rt gn d liu (Reduction) 6
Cc c s d liu thng rt ln, khng th thao tc trc tip c. Cc k thut
rt gn d liu c p dng tin x l d liu. Trong tab Preprocess, bn
cnh vic chn lc thuc tnh, mt phng php rt gn d liu l chc lc cc
dng trong mt dataset, hay cn gi l ly mu (sampling). Lm cch no ly
mu vi cc b lc ca Weka? N c th thc hin 2 phng php chnh l:
Simple Random Sample Without Replacement, v Simple Random Sample
With Replacement hay khng?

Ti liu tham kho


[1] Slide l thuyt
[2] Trang ch ca Weka: http://www.cs.waikato.ac.nz/ml/weka/
[3] Hng dn s dng Explorer trong Weka
[4] Textbook: J. Han and M. Kamber: Data Mining, Concepts and Techniques,
Second Edition - Chapter 2: Data Preprocessing
[5] I. H. Witten and E. Frank: Data mining, Practical Machine Learning Tools and
Techniques

Xem [4], phn 2.5 Data Reduction

GVHD: L Ngc Thnh, Nguyn Hi Minh

You might also like