Professional Documents
Culture Documents
L3-Tien Xu Ly Du Lieu
L3-Tien Xu Ly Du Lieu
Ni dung mn hc:
Gii thiu v Khai ph d liu Gii thiu v cng c WEKA Tin x l d liu Pht hin cc lut kt hp Cc k thut phn lp v d on Cc k thut phn nhm
Khai Ph D Liu
Tp d liu p
Mt tp d liu (dataset) l mt tp hp cc i tng (objects) v cc thuc tnh ca chng Mi thuc tnh (attribute) m t mt c im ca mt i tng
Vd: Cc thuc tnh Refund, Marital Status, Taxable Income, Cheat Cc thuc tnh
Tid Refund Marital Status 1 2 3 Yes Y No No Yes No N No Yes No No N No Single Si l Married Single Married Taxable Income Cheat 125K 100K 70K 120K No N No No No Yes Y No No Yes No N Yes
Cc i tng
4 5 6 7 8 9 10
10
Cc kiu tp d liu p
Bn ghi (Record)
Cc bn ghi trong csdl quan h Ma trn d liu Biu din vn bn (document) D liu giao dch
th (Graph)
World Wide Web Mng thng tin, hoc mng x hi Cc cu trc phn t (Molecular structures)
TID Items
1 2 3 4 5
Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Coke Diaper Milk
(Han, Kamber - Data Mining: Concepts and Techniques)
C trt t (Ordered)
D liu khng g g gian ( (vd: bn ) ) D liu thi gian (vd: time-series data) D liu chui (vd: chui giao dch) D liu chui di truyn (genetic sequence y (g q data)
Khai Ph D Liu
Kiu c th t (ordinal):
Ly gi tr t mt tp c th t cc gi tr Vd1: Cc thuc t h l i tr h A Vd1 C th tnh ly gi t s nh: Age, Height, H i ht Vd2: Thuc tnh Income ly gi tr t tp {low, medium, high}
Khai Ph D Liu
Khai Ph D Liu
Cc c tnh m t d liu
Mc ch: hiu r v d liu c c (chiu hng chnh/trung tm s bin thin s phn b) tm, thin, S phn b ca d liu (Data dispersion)
Gi tr tiu/cc i (min/max) Gi tr xut hin nhiu nht (mode) Gi t t tr trung bnh ( b h (mean) ) Gi tr trung v (median) S bin thin (variance) v lch chun (standard deviation) Cc ngoi lai (outliers)
Khai Ph D Liu
Khai Ph D Liu
Khai Ph D Liu
Biu histogram g
Biu histogram l cch biu din da trn th c s dng rt ph bin Hin th cc m t thng k xut hin (counts/frequencies) theo mt thuc tnh no
Khai Ph D Liu
10
th ri rc (Scatter p ) ( plot)
Cho php hin th quan h 2 chiu (gia 2 thuc tnh) ca d liu Cho php q p p quan st ( quan) cc nhm im, cc ngoi li, (trc q ) , g , Mi cp gi tr ca 2 thuc tnh c xt tng ng vi 2 ta ca im c hin th trn mt phng
Nhiu/li ( i / Nhi /li (noise/error): Ch ng nhng li h cc v d bt ) Cha h hoc d thng (abnormal instances)
Vd: salary = -525 (gi tr ca thuc tnh khng th l mt s m)
Khai Ph D Liu
15
Cc i h h hi hi c gn (b mt C gi tr thuc tnh thiu cn phi (bng c ch suy din) m bo tnh chnh xc ca cc kt qu khai p d liu q ph
Khai Ph D Liu 16
Mt s ngi s m nhim vic kim tra v gn cc gi tr thuc tnh cn thiu ny (manually filling): cng vic t nht + chi ph cao Gn gi tr t ng bi my tnh
Mt gi tr (hng) mc nh Gi tr trung bnh ca thuc tnh Gi tr trung bnh ca thuc tnh , xt i vi tt c cc v d (cc bn ghi) thuc cng lp (class) vi bn ghi Gi tr c th xy ra nht da trn phng php xc sut ( y g (vd: cng thc Bayes)
Khai Ph D Liu 17
Khai Ph D Liu
18
Hi quy (Regression)
Gn d liu vi mt hm hi quy (regression function)
Phn cm (Clustering)
Pht hin v loi b cc ngoi lai (sau khi xc nh cc cm)
21
Hi quy (Regression) q y( g )
y
Y1
y=x+1
Y1
X1
Khai Ph D Liu
22
Khai Ph D Liu
23
Tch hp d liu p
Tch hp d liu (Data integration)
Kt hp d liu t nhiu ngun vo mt kho d liu thng nht p g g
Cc thuc tnh d tha c th c pht hin bng phn tch tng quan (Correlation analysis): Pearson, Cosine, chi-square Yu cu chung i vi qu trnh tch hp d liu: Gim thiu (trnh c l tt nht) cc d tha v cc mu thun
Gip ci thin tc ca q trnh khai p d liu, v nng cao p qu ph , g cht lng ca cc kt qu (tri thc) thu c
Khai Ph D Liu 25
v old mini = (new _ maxi new _ mini ) + new _ mini maxi mini
Chun ha z-score
i, i: gi tr trung bnh v lch chun i vi thuc tnh i
v new =
v old i
new
ld v old = j 10
Gim bt d liu
Ti sao cn phi gim bt d liu?
Mt kho (tp) d liu ln c th cha lng d liu ln n terabytes Do , qu trnh khai ph d liu c th s chy rt lu (rt mt thi gian) i vi ton b tp d liu
Gim s chiu
nh hng tiu cc ca s chiu (s thuc tnh) ln
Khi s chiu tng, d liu tr nn tha tht hn (more sparse) g ( p ) Mt v khong cch gia cc im (quan trng i vi vic phn cm, pht hin ngoi lai) tr nn t c ngha
x2
x1
(Han, Kamber - Data Mining: Concepts and Techniques)
30
Cc mc kt hp ph hp
S dng biu din ngn gn (nh) nht gii quyt yu cu (truy vn thng tin) t ra
Cc cu tm kim (queries) i vi cc thng tin c kt hp (aggregated information) nn c tr li bng cch s dng cc khi d liu
Khai Ph D Liu 33
Ly mu d liu y
Ly mu d liu (Data sampling) l phng php quan trng i vi vic la chn d liu Vic ly mu d liu l cn thit v yu cu thu thp v x l ton b mt tp d liu ln s i hi chi ph cao v tn thi gian Cc nguyn tc quan trng ca vic ly mu d liu
S dng mt mu (sample) s c tc dng gn nh s dng ton b tp d liu, nu nh mu i din cho tp d liu Mt mu c gi l i din cho mt tp d liu, nu mu c (xp x) c tnh ca tp d liu
Khai Ph D Liu
34