You are on page 1of 35

Khai Ph D Liu

Nguyn Nht Quang


quangnn-fit@mail.hut.edu.vn Vin Cng ngh Thng tin v Truyn thng Trng i hc Bch Khoa H Ni
Nm hc 2010-2011

Ni dung mn hc:
Gii thiu v Khai ph d liu Gii thiu v cng c WEKA Tin x l d liu Pht hin cc lut kt hp Cc k thut phn lp v d on Cc k thut phn nhm

Khai Ph D Liu

Tp d liu p
Mt tp d liu (dataset) l mt tp hp cc i tng (objects) v cc thuc tnh ca chng Mi thuc tnh (attribute) m t mt c im ca mt i tng
Vd: Cc thuc tnh Refund, Marital Status, Taxable Income, Cheat Cc thuc tnh
Tid Refund Marital Status 1 2 3 Yes Y No No Yes No N No Yes No No N No Single Si l Married Single Married Taxable Income Cheat 125K 100K 70K 120K No N No No No Yes Y No No Yes No N Yes

Mt tp cc gi tr ca cc thuc tnh m t mt i tng


Khi nim i tng cn c tham chiu n vi cc tn gi khc: bn ghi (record), im d liu (data point), trng hp (case), mu (sample), thc th (entity), hoc v d (instance)
Khai Ph D Liu

Cc i tng

4 5 6 7 8 9 10
10

Divorced 95K Di d Married 60K

Divorced 220K Single Married M i d Single 85K 75K 90K

(Tan, Steinbach, Kumar Introduction to Data Mining) g)

Cc kiu tp d liu p
Bn ghi (Record)
Cc bn ghi trong csdl quan h Ma trn d liu Biu din vn bn (document) D liu giao dch

th (Graph)
World Wide Web Mng thng tin, hoc mng x hi Cc cu trc phn t (Molecular structures)
TID Items

1 2 3 4 5

Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Coke Diaper Milk
(Han, Kamber - Data Mining: Concepts and Techniques)

C trt t (Ordered)
D liu khng g g gian ( (vd: bn ) ) D liu thi gian (vd: time-series data) D liu chui (vd: chui giao dch) D liu chui di truyn (genetic sequence y (g q data)
Khai Ph D Liu

Cc kiu gi tr thuc tnh g


Kiu nh danh/chui (norminal): khng c th t
Ly gi tr t mt tp khng c th t cc gi tr (nh danh) Vd: Cc thuc tnh nh: Name, Profession,

Kiu nh phn (binary): l mt trng hp c bit ca kiu nh danh


Tp cc gi tr ch gm c 2 gi tr (Y/N, 0/1, T/F)

Kiu c th t (ordinal):
Ly gi tr t mt tp c th t cc gi tr Vd1: Cc thuc t h l i tr h A Vd1 C th tnh ly gi t s nh: Age, Height, H i ht Vd2: Thuc tnh Income ly gi tr t tp {low, medium, high}

Khai Ph D Liu

Kiu thuc tnh ri rc vs. lin tc


Kiu thuc tnh ri rc (Discrete-valued attributes)
Tp cc gi tr l mt tp hu hn Bao gm c cc thuc tnh c kiu gi tr l cc s nguyn Bao gm c cc thuc tnh nh phn (binary attributes)

Kiu thuc tnh lin tc (Continuous-valued attributes)


Cc gi tr l cc s thc (real numbers)

Khai Ph D Liu

Cc c tnh m t d liu
Mc ch: hiu r v d liu c c (chiu hng chnh/trung tm s bin thin s phn b) tm, thin, S phn b ca d liu (Data dispersion)
Gi tr tiu/cc i (min/max) Gi tr xut hin nhiu nht (mode) Gi t t tr trung bnh ( b h (mean) ) Gi tr trung v (median) S bin thin (variance) v lch chun (standard deviation) Cc ngoi lai (outliers)

Khai Ph D Liu

Hin th ha d liu (Data visualization)


Biu din d liu bng cc phng php hin th ha, gip hiu r cc c im ca d liu Cung cp ci nhn nh tnh i vi cc tp d liu ln C th ch ra cc mu cc xu hng cc cu trc cc mu, hng, trc, bt thng, v cc quan h trong d liu H tr xc nh cc vng d liu quan trng v cc tham s ph hp cho cc phn tch nh lng tip theo Trong mt s trng hp, c th cung cp cc chng minh trc quan i vi cc biu din (tri thc) thu c

Khai Ph D Liu

D liu cn i vs. lch


Gi tr trung bnh, gi tr trung v, v gi tr xut hin nhiu nht i vi
D liu cn i D liu lch

Khai Ph D Liu

(Han, Kamber - Data Mining: Concepts and Techniques)

Biu histogram g
Biu histogram l cch biu din da trn th c s dng rt ph bin Hin th cc m t thng k xut hin (counts/frequencies) theo mt thuc tnh no

(Han, Kamber - Data Mining: Concepts and Techniques)

Khai Ph D Liu

10

th ri rc (Scatter p ) ( plot)
Cho php hin th quan h 2 chiu (gia 2 thuc tnh) ca d liu Cho php q p p quan st ( quan) cc nhm im, cc ngoi li, (trc q ) , g , Mi cp gi tr ca 2 thuc tnh c xt tng ng vi 2 ta ca im c hin th trn mt phng

(Han, Kamber - Data Mining: ( Concepts and Techniques) Khai Ph D Liu 11

Tin x l d liu: Cc nhim v chnh


Lm sch d liu (Data cleaning)
Gn cc gi tr thuc tnh cn thiu, Sa cha cc d liu nhiu/li, Xc g nh hoc loi b cc ngoi lai (outliers), Gii quyt cc mu thun d liu

Tch hp d liu (Data integration)


Tch hp nhiu c s d liu, nhiu khi d liu (data cubes), hoc nhiu p ( ) tp tin d liu

Bin i d liu (Data transformation)


Chun ha (normalize) v kt hp (aggregate) d liu

Gim bt d liu (Data reduction)


Gim bt v biu din (cc thuc tnh) ca d liu, gim bt kch thc d liu nhng vn m bo thu c cc kt qu khai ph d liu tng ng (hoc xp x) Ri rc ha d liu (Data discretization) L mt thao tc trong gim bt d liu c s dng i vi cc d liu c cc thuc tnh kiu s
Khai Ph D Liu 12

Lm sch d liu (1) ( )


Cc vn ca d liu? D li th c t th t c th cha nhiu, li kh liu thu thc h hi li, khng hon chnh, c mu thun
Khng hon chnh (incomplete): Thiu cc gi tr thuc tnh, tnh hoc thiu mt s thuc tnh
Vd: salary = <undefined>

Nhiu/li ( i / Nhi /li (noise/error): Ch ng nhng li h cc v d bt ) Cha h hoc d thng (abnormal instances)
Vd: salary = -525 (gi tr ca thuc tnh khng th l mt s m)

Mu thun (inconsistent): Cha ng cc mu thun (khng thng nht)


Vd: salary = abc (khng ph hp vi kiu d liu s ca thuc tnh salary)
Khai Ph D Liu 13

Lm sch d liu (2) ( )


Ngun gc/l do ca d liu khng sch? Khng hon chnh (incomplete) Gi tr ca thuc tnh khng c (not available) ti thi im c thu thp Cc vn gy ra bi phn cng phn mm, hoc ngi thu cng, mm thp d liu Nhiu/li (noise/error) Do vic thu thp d liu Do vic nhp d liu Do vic truyn d liu y Mu thun (inconsistent) D liu c thu thp t nhiu ngun khc nhau Vi phm cc rng b (i ki ) i vi cc thuc tnh h buc (iu kin) i h h
Khai Ph D Liu 14

Lm sch d liu (3) ( )


Ti sao cn phi lm sch d liu? Nu d liu khng sch (c cha li, nhiu, khng y , c mu thun), th cc kt qu khai ph d liu s b nh hng v khng ng tin cy Cc kt qu khai ph d liu (cc tri thc khm ph c) khng chnh xc (khng ng tin cy) s dn n cc quyt nh khng chnh xc, khng ti u
Vd: Cc d liu cha li hoc thiu gi tr thuc tnh s c th dn n cc kt qu thng k sai lm

Khai Ph D Liu

15

Thiu gi tr thuc tnh g


i vi mt s thuc tnh, gi tr ca chng i vi mt s bn ghi khng c
Vd: Gi tr ca thuc tnh Income khng c (khng c ghi li) i vi mt s bn ghi

Thiu gi tr thuc tnh c th v:


Li ca cc thit b phn cng Khng tng thch vi cc d liu c ghi t trc, do gi tr (mi) b xa i D liu khng c nhp vo (li ca ngi nhp liu)

Cc i h h hi hi c gn (b mt C gi tr thuc tnh thiu cn phi (bng c ch suy din) m bo tnh chnh xc ca cc kt qu khai p d liu q ph
Khai Ph D Liu 16

Thuc tnh thiu gi tr: Cc gii php g g p p


B qua cc bn ghi c cc thuc tnh thiu gi tr
Thng c p dng trong cc bi ton p g p g g phn lp ( p (classification) ) Khng hiu qu, khi t l % cc gi tr thiu i vi cc thuc tnh (rt) khc nhau

Mt s ngi s m nhim vic kim tra v gn cc gi tr thuc tnh cn thiu ny (manually filling): cng vic t nht + chi ph cao Gn gi tr t ng bi my tnh
Mt gi tr (hng) mc nh Gi tr trung bnh ca thuc tnh Gi tr trung bnh ca thuc tnh , xt i vi tt c cc v d (cc bn ghi) thuc cng lp (class) vi bn ghi Gi tr c th xy ra nht da trn phng php xc sut ( y g (vd: cng thc Bayes)
Khai Ph D Liu 17

D liu cha nhiu


Nhiu: Li ngu nhin i vi gi tr ca mt thuc tnh Cc gi tr thuc tnh b li (nhiu) c th v:
Li ca cc thit b thu thp d liu Cc li khi nhp d liu Li trong qu trnh truyn d liu S mu thun (khng nht qun) trong quy c tn (thuc tnh/bin)

Khai Ph D Liu

18

D liu cha nhiu: Cc gii php g p p


Phn khong (Binning)
Sp xp d liu, v phn chia thnh cc khong (bins) c tn s liu xut hin gi tr (frequency) nh nhau Sau , mi khong d liu c th c biu din bng trung bnh(mean), trung v (median), hoc cc gii hnca cc gi tr trong khong

Hi quy (Regression)
Gn d liu vi mt hm hi quy (regression function)

Phn cm (Clustering)
Pht hin v loi b cc ngoi lai (sau khi xc nh cc cm)

Kt hp gia my tnh v kim tra ca con ngi


My tnh t ng pht hin cc gi tr nghi ng (l nhiu/li) Cc gi tr nghi ng ny s c con ngi kim tra li
Khai Ph D Liu 19

Phn khong (Binning) g( g)


Phn chia vi rng (khong cch) bng nhau
Chia khong g tr thnh N khong vi kch thc ( rng) bng g gi g ( g) g nhau Nu mini v maxi l gi tr ln nht v nh nht ca thuc tnh, th kch thc ( rng) ca mi khong = ( ( g) g (maxi - mini)/N ) Khng ph hp i vi cc tp d liu lch (skewed data), hoc c cha cc ngoi lai (outliers) v c th mt khong s ch cha mt ( mt s) cc ngoi lai (hoc ) g

Phn chia vi su (tn xut xut hin) bng nhau


Chia khong g tr thnh N khong ( g gi g (khng nht thit bng nhau), g g ) sao cho mi khong cha xp x bng nhau s lng (tn xut xut hin) ca cc v d Hiu qu hn cch phn chia vi rng ( q p g (khong cch) bng g ) g nhau
Khai Ph D Liu 20

Phn khong (Binning) V d g( g)


Sp xp cc gi tr ca thuc tnh Price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 21 24 25 26 28 29 34 Phn chia thnh cc khong vi su (tn xut xut hin) bng nhau
Bin 1: 4, 8, 9, 15 Bin 2: 21, 21, 24, 25 Bin 3: 26, 28 29 Bi 3 26 28, 29, 34

Biu din khong d liu bi gi tr trung bnh


Bin 1: 9, 9, 9, Bi 1 9 9 9 9 Bin 2: 23, 23, 23, 23 Bin 3: 29, 29, 29, 29
Khai Ph D Liu

21

Hi quy (Regression) q y( g )
y

Y1
y=x+1

Y1

X1

(Han, Kamber - Data Mining: Concepts and Techniques)

Khai Ph D Liu

22

Phn tch cc cm (Cluster analysis) ( y )

(Han, Kamber - Data Mining: Concepts and Techniques)

Khai Ph D Liu

23

Tch hp d liu p
Tch hp d liu (Data integration)
Kt hp d liu t nhiu ngun vo mt kho d liu thng nht p g g

Tch hp mc m hnh (Schema integration)


Tch hp metadata t cc ngun khc nhau Vd: A.cust-id B customID A cust id B.customID

Vn xc nh thc th ( trnh d tha d liu)


Cn xc nh cc thc th (identities) trn thc t t nhiu ngun d liu Vd: Vd Bill Clinton B Cli t Cli t B. Clinton

Pht hin v x l cc mu thun i vi gi tr d liu


i vi cng mt thc th trn thc t, nhng cc gi tr thuc tnh t nhiu ngun khc nhau li khc nhau. Cc l do c th: Cc cch biu din khc nhau Mc nh gi, o (scales) khc nhau Vd: h o lng mt vs. h o lng ca Anh l A h
Khai Ph D Liu 24

Tch hp d liu: X l d tha d liu


D tha d liu (redundant data) thng xuyn xy ra, khi tch hp d liu t nhiu ngun (vd: t nhiu csdl)
nh danh i tng: Cng mt thuc tnh (hay cng mt i tng) c th mang cc tn (nh danh) khc nhau trong cc csdl khc nhau D liu suy ra c: Mt thuc tnh trong mt bng c th l mt thuc tnh c suy ra (derived attribute) trong mt bng khc Vd: Annual Revenue v Monthly Revenue

Cc thuc tnh d tha c th c pht hin bng phn tch tng quan (Correlation analysis): Pearson, Cosine, chi-square Yu cu chung i vi qu trnh tch hp d liu: Gim thiu (trnh c l tt nht) cc d tha v cc mu thun
Gip ci thin tc ca q trnh khai p d liu, v nng cao p qu ph , g cht lng ca cc kt qu (tri thc) thu c
Khai Ph D Liu 25

Bin i d liu (1) ( )


Bin i d liu (Data transformation)
Vic chuyn (nh x) ton b tp g tr ca mt thuc tnh sang mt tp y ( ) p gi g p mi cc gi tr thay th, sao cho mi gi tr c tng ng vi mt trong cc gi tr mi

Cc phng p p bin i d liu p g php


Lm trn (Smoothing): Loi b nhiu/li khi d liu Kt hp (Aggregation): S tm tt d liu, xy dng cc khi d liu (data cubes) Khi qut ha (Generalization): Xy dng cc phn cp khi nim (concept hierarchies) Chun ha (Normalization): a cc gi tr v mt khong c ch nh Chun ha min-max Chun ha z-score Chun ha bi thang chia 10 Xy dng (to nn) cc thuc tnh mi da trn cc thuc tnh ban u
Khai Ph D Liu 26

Bin i d liu (2) ( )


Chun ha min-max: thnh khong [new_mini, newmaxi]
v
new

v old mini = (new _ maxi new _ mini ) + new _ mini maxi mini

Chun ha z-score
i, i: gi tr trung bnh v lch chun i vi thuc tnh i

v new =

v old i

Chun ha bi thang chia 10

new

ld v old = j 10

j l gi tr s nguyn nh nht sao cho: max({vnew}) < 1


Khai Ph D Liu 27

Gim bt d liu
Ti sao cn phi gim bt d liu?
Mt kho (tp) d liu ln c th cha lng d liu ln n terabytes Do , qu trnh khai ph d liu c th s chy rt lu (rt mt thi gian) i vi ton b tp d liu

Gim bt d liu (Data reduction)


thu c mt biu din thu gn (gim bt) nhng vn sinh ra cng (hoc xp x) cc kt qu phn tch (khai ph) nh vi tp d liu ban u

Cc chin lc gim bt d liu g


Gim s chiu (Dimensionality reduction): loi b bt cc thuc tnh khng (t) quan trng Gim lng d liu (Data/Numerosity reduction) Kt hp khi d liu (Data cube aggregation) Nn d liu (Data compression) Hi quy (Regression) Ri rc ha (Discretization)
Khai Ph D Liu 28

Gim s chiu
nh hng tiu cc ca s chiu (s thuc tnh) ln
Khi s chiu tng, d liu tr nn tha tht hn (more sparse) g ( p ) Mt v khong cch gia cc im (quan trng i vi vic phn cm, pht hin ngoi lai) tr nn t c ngha

Gim s chiu (Dimensionality reduction)


Trnh (gim bt) nh hng tiu cc ca s chiu ln Gip loi b cc thuc tnh khng lin quan, v gim nhiu/li Gip i Gi gim chi ph v thi gian v b nh cn cho qu t h kh i hi h i h h trnh khai ph d liu Cho php hin th ha (visualize) d liu mt cch d dng v hiu hn hi qu h

Cc k thut gim s chiu


Phn tch thnh phn chnh (Principal component analysis) ( y ) La chn tp con cc thuc tnh (Feature subset selection)
Khai Ph D Liu 29

Phn tch thnh phn chnh (1) p ( )


Phn tch thnh phn chnh ( (Principal component p p analysis PCA)
Tm mt php chiu (projection) khng gian thuc tnh mi sao cho gi c mc ti a v s khc bit ( (variation) trong tp ) g p d liu ban u Tm cc eigenvectors ca ma trn hip bin cc p eigenvectors ny s nh ngha khng gian thuc tnh mi
Khai Ph D Liu

x2

x1
(Han, Kamber - Data Mining: Concepts and Techniques)

30

Phn tch thnh phn chnh (2) p ( )


Mi v d (bn ghi) s c biu din bi n chiu (thuc tnh) Mc ch: Tm k (n) vect trc giao (s l cc thnh phn chnh principal components) biu din tp d liu ban u ph hp nht
1) Chun ha d liu u vo: Cc gi tr cho cc thuc tnh c a v cng mt khong gi tr 2) Tnh k vect trc giao (chnh l cc thnh phn chnh) 3) Mi vect d liu u vo s l mt kt hp tuyn tnh ca k vect thnh phn chnh ny 4) C th h phn chnh Cc thnh h h h c sp xp th mc gim d v quan theo i dn trng 5) Kch thc ca d liu c gim bt, bng cch loi b cc thnh phn (vect) c mc quan trng thp cc vect ny tng ng vi khc bit (variance) thp 6) S dng cc vect c mc quan trng cao nht s cho php biu din xp x tp d liu ban u

Phng php PCA ch p dng c vi d liu kiu s


Khai Ph D Liu 31

La chn tp con cc thuc tnh p


Vi d thuc tnh ban u, c th c n 2d kh nng la chn mt tp con cc thuc tnh Cc phng php thng c p dng cho vic la chn tp con cc thuc tnh (Feature subset selection)
La chn cc thuc tnh ring r (vi gi s l cc thuc tnh l c lp vi nhau) Theo mt (hoc mt s) tiu ch nh gi La chn thuc tnh tng bc (Step-wise feature selection) (Step wise Thuc tnh tt nht s c chn ra u tin Chn thuc tnh tt nht tip theo i vi thuc tnh u tin chn h Loi b thuc tnh tng bc (Step-wise feature elimination) Loi b dn dn (repeatedly) cc thuc tnh km (ti) nht Kt hp ng thi 2 chin lc: la chn v loi b cc thuc tnh
Khai Ph D Liu 32

Kt hp khi d liu (Data cube aggregation)


Mc thp nht ca mt khi d liu (basic cuboid)
L d liu c kt hp li i vi mt thc th (individual entity) c quan tm Vd: Mt khch hng trong mt kho d liu mua hng

Cc k h kh h C mc kt hp khc nhau trong cc khi d liu d li


Gip gim nh hn na kch thc ca d liu cn x l

Cc mc kt hp ph hp
S dng biu din ngn gn (nh) nht gii quyt yu cu (truy vn thng tin) t ra

Cc cu tm kim (queries) i vi cc thng tin c kt hp (aggregated information) nn c tr li bng cch s dng cc khi d liu
Khai Ph D Liu 33

Ly mu d liu y
Ly mu d liu (Data sampling) l phng php quan trng i vi vic la chn d liu Vic ly mu d liu l cn thit v yu cu thu thp v x l ton b mt tp d liu ln s i hi chi ph cao v tn thi gian Cc nguyn tc quan trng ca vic ly mu d liu
S dng mt mu (sample) s c tc dng gn nh s dng ton b tp d liu, nu nh mu i din cho tp d liu Mt mu c gi l i din cho mt tp d liu, nu mu c (xp x) c tnh ca tp d liu

Khai Ph D Liu

34

Cc phng php ly mu d liu p gp p y


Ly mu ngu nhin (Simple random sampling)
Mi v d (bn ghi) c la chn vi mt gi tr xc sut nh nhau

Ly mu khng thay th (Sampling without replacement)


Khi mt v d (bn ghi) c ly mu, n s c loi khi tp d liu ban u (s khng th c chn thm mt ln no na)

Ly mu c thay th (Samping with replacement)


Khi mt v d (bn ghi) c ly mu, n khng b loi khi tp d liu ban u (c th c chn nhiu hn mt ln)

Ly h tng (St tifi d L mu phn t (Stratified sampling) li )


Phn chia tp d liu thnh cc phn (partitions) Ly ngu nhin cc v d t mi phn y g p
Khai Ph D Liu 35

You might also like