You are on page 1of 35

Khai Ph D Liu

Nguyn Nht Quang


quangnn-fit@mail.hut.edu.vn
Trng i hc Bch Khoa H Ni
Vin Cng ngh Thng tin v Truyn thng
Nm hc 2011-2012

Ni dung mn hc:

Gii thiu v Khai ph d liu

Gii thiu v
cng c WEKA

Tin x l d liu

Pht hin cc lut kt hp

Cc k thut phn lp v d on

Cc k thut phn nhm

Khai Ph D Liu

Tp
p d liu

Mt tp d liu (dataset) l mt tp
hp cc i tng (objects) v cc
thuc tnh ca chng
Mi thuc tnh (attribute) m t mt
c im ca mt i tng

Cc thuc tnh

Vd: Cc thuc tnh Refund, Marital


Status, Taxable Income, Cheat

Mt tp cc gi tr ca cc thuc
tnh m t mt i tng

Khi nim i tng cn c


tham chiu n vi cc tn gi khc:
bn ghi (record), im d liu (data
point), trng hp (case), mu
(sample), thc th (entity), hoc v
d (instance)
Khai Ph D Liu

Cc
i
tng

Tid Refund Marital


Status

Taxable
Income Cheat

Y
Yes

Si l
Single

125K

N
No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

N
No

Di
Divorced
d 95K

Y
Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

N
No

M i d
Married

75K

N
No

10

No

Single

90K

Yes

60K

10

(Tan, Steinbach, Kumar Introduction to Data Mining)


g)

Cc kiu tp
p d liu

Bn ghi (Record)

th (Graph)

Cc bn ghi trong csdl quan h


Ma trn d liu
Biu din vn bn (document)
D liu giao dch
World Wide Web
Mng thng tin, hoc mng x hi
Cc cu trc phn t (Molecular structures)

C trt t (Ordered)

D liu
khng
gg
gian ((vd: bn ))
D liu thi gian (vd: time-series data)
D liu chui (vd: chui giao dch)
D liu
chui di truyn
y (genetic
(g
sequence
q
data)
Khai Ph D Liu

TID

Items

Bread, Coke, Milk

2
3
4
5

Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke Diaper
Coke,
Diaper, Milk
(Han, Kamber - Data Mining:
Concepts and Techniques)

Cc kiu ggi tr thuc tnh

Kiu nh danh/chui (norminal): khng c th t

Kiu nh phn (binary): l mt trng hp c bit ca


kiu nh danh

Ly gi tr t mt tp khng c th t cc gi tr (nh danh)


Vd: Cc thuc tnh nh: Name, Profession,

Tp cc gi tr ch gm c 2 gi tr (Y/N, 0/1, T/F)

Kiu c th t (ordinal):

Ly gi tr t mt tp c th t cc gi tr
Vd1 C
Vd1:
Cc th
thuc
tnh
t h ly
l gi
i ttr s
nh:
h Age,
A
Height,
H i ht
Vd2: Thuc tnh Income ly gi tr t tp {low, medium, high}

Khai Ph D Liu

Kiu thuc tnh ri rc vs. lin tc

Kiu thuc tnh ri rc (Discrete-valued attributes)

Tp cc gi tr l mt tp hu hn

Bao gm c cc thuc tnh c kiu gi tr l cc s nguyn

Bao gm c cc thuc tnh nh phn (binary attributes)

Kiu thuc tnh lin tc (Continuous-valued attributes)

Cc gi tr l cc s thc (real numbers)

Khai Ph D Liu

Cc c tnh m t d liu

Mc ch: hiu r v d liu c c (chiu hng


chnh/trung tm
tm, s bin thin
thin, s phn b)

S phn b ca d liu (Data dispersion)

Gi tr cc tiu/cc

i (min/max)

Gi tr xut hin nhiu nht (mode)

Gi ttr ttrung bnh


b h ((mean))

Gi tr trung v (median)

S bin thin (variance) v lch chun (standard deviation)

Cc ngoi lai (outliers)

Khai Ph D Liu

Hin th ha d liu (Data visualization)

Biu din d liu bng cc phng php hin th ha,


gip hiu r cc c im ca d liu

Cung cp ci nhn nh tnh i vi cc tp d liu ln

C th ch ra cc mu
mu, cc xu hng
hng, cc cu trc
trc, cc
bt thng, v cc quan h trong d liu

H tr xc nh cc vng d liu quan trng v cc tham


s ph hp cho cc phn tch nh lng tip theo

Trong mt s trng hp, c th cung cp cc chng


minh trc quan i vi cc biu din (tri thc) thu c

Khai Ph D Liu

D liu cn i vs. lch

Gi tr trung bnh, gi tr trung


v, v gi tr xut hin nhiu
nht i vi

D liu cn i
D liu lch

Khai Ph D Liu

(Han, Kamber - Data Mining:


Concepts and Techniques)

Biu histogram
g

Biu histogram l cch


biu din da trn th

c s dng rt ph
bin

Hin th cc m t thng
k xut hin
(counts/frequencies) theo
mt thuc tnh no

Khai Ph D Liu

(Han, Kamber - Data Mining:


Concepts and Techniques)

10

th ri rc (Scatter
(
p
plot))

Cho php hin th quan h 2 chiu (gia 2 thuc tnh) ca d liu


Cho p
php
pq
quan st ((trc
quan)
q
) cc nhm im,, cc ngoi
g li,
,
Mi cp gi tr ca 2 thuc tnh c xt tng ng vi 2 ta ca
im c hin th trn mt phng

((Han, Kamber - Data Mining:


Concepts and Techniques)
Khai Ph D Liu

11

Tin x l d liu: Cc nhim v chnh

Lm sch d liu (Data cleaning)

Tch hp d liu (Data integration)

Tch hp
p nhiu c s d liu, nhiu khi d liu ((data cubes),
) hoc nhiu
tp tin d liu

Bin i d liu (Data transformation)

Gn cc g
gi tr thuc
tnh cn thiu, Sa cha cc d liu
nhiu/li, Xc
nh hoc loi b cc ngoi lai (outliers), Gii quyt cc mu thun d liu

Chun ha (normalize) v kt hp (aggregate) d liu

Gim bt d liu (Data reduction)

Gim bt v biu din (cc thuc tnh) ca d liu, gim bt kch thc
d liu nhng vn m bo thu c cc kt qu khai ph d liu
tng ng (hoc xp x)
Ri rc ha d liu (Data discretization)
L mt thao tc trong gim bt d liu
c s dng i vi cc d liu c cc thuc tnh kiu s
Khai Ph D Liu

12

Lm sch d liu (1)


( )

Cc vn ca d liu?

D liu
li th
thu c

t th
thc t c
th cha
h nhiu,
hi li
li, kh
khng
hon chnh, c mu thun

Khng hon chnh (incomplete): Thiu cc gi tr thuc tnh,


tnh
hoc thiu mt s thuc tnh

Nhiu/li
Nhi
/li (noise/error):
( i /
) Cha
Ch ng

nhng
h li h
hoc
cc
v d
d bt
thng (abnormal instances)

Vd: salary = <undefined>

Vd: salary = -525 (gi tr ca thuc tnh khng th l mt s m)

Mu thun (inconsistent): Cha ng cc mu thun (khng


thng nht)

Vd: salary = abc (khng ph hp vi kiu d liu s ca thuc tnh


salary)
Khai Ph D Liu

13

Lm sch d liu (2)


( )

Ngun gc/l do ca d liu khng sch?

Khng hon chnh (incomplete)


Gi tr ca thuc tnh khng c (not available) ti thi im c
thu thp
Cc vn gy ra bi phn cng
cng, phn mm,
mm hoc ngi thu
thp d liu

Nhiu/li (noise/error)
Do vic thu thp d liu
Do vic nhp d liu
Do vic
truyn
y d liu

Mu thun (inconsistent)
D liu c thu thp t nhiu ngun khc nhau
Vi phm
h
cc
rng
b
buc
(i
(iu ki
kin)) i vi
i cc
thuc
h tnh
h
Khai Ph D Liu

14

Lm sch d liu (3)


( )

Ti sao cn phi lm sch d liu?

Nu d liu khng sch (c cha li, nhiu, khng y


, c mu thun), th cc kt qu khai ph d liu s b
nh hng v khng ng tin cy

Cc kt qu khai ph d liu (cc tri thc khm ph


c) khng chnh xc (khng ng tin cy) s dn n
cc quyt nh khng chnh xc, khng ti u

Vd: Cc d liu cha li hoc thiu gi tr thuc tnh s c th


dn n cc kt qu thng k sai lm

Khai Ph D Liu

15

Thiu ggi tr thuc tnh

i vi mt s thuc tnh, gi tr ca chng i vi mt


s bn ghi khng c

Thiu
gi tr thuc tnh c th
v:

Vd: Gi tr ca thuc tnh Income khng c (khng c ghi li)


i vi mt s bn ghi
Li ca cc thit b phn cng
Khng tng thch vi cc d liu c ghi t trc, do
gi tr (mi) b xa i
D liu khng c nhp vo (li ca ngi nhp liu)

Cc gi
C
i tr thuc
h tnh
h thiu
hi cn
phi
hi c

gn
(b
(bng mt

c ch suy din) m bo tnh chnh xc ca cc


kt q
qu khai p
ph d liu

Khai Ph D Liu

16

Thuc tnh thiu ggi tr: Cc ggii p


php
p

B qua cc bn ghi c cc thuc tnh thiu gi tr

Thng
g c p
p dng
g trong
g cc bi ton p
phn lp
p ((classification))
Khng hiu qu, khi t l % cc gi tr thiu i vi cc thuc tnh
(rt) khc nhau

Mt s ngi s m nhim vic kim tra v gn cc gi tr


thuc tnh cn thiu ny (manually filling): cng vic t nht +
chi ph cao
Gn gi tr t ng bi my tnh

Mt gi tr (hng) mc nh
Gi tr trung bnh ca thuc tnh
Gi tr trung bnh ca thuc tnh , xt i vi tt c cc v d
(cc bn ghi) thuc cng lp (class) vi bn ghi
Gi tr c th xy
y ra nht da trn phng
g php xc sut ((vd:
cng thc Bayes)
Khai Ph D Liu

17

D liu cha nhiu

Nhiu: Li ngu nhin i vi gi tr ca mt thuc tnh

Cc gi tr thuc tnh b li (nhiu) c th v:

Li ca cc thit b thu thp d liu

Cc li khi nhp d liu

Li trong qu trnh truyn d liu

S mu thun (khng nht qun) trong quy c tn (thuc


tnh/bin)

Khai Ph D Liu

18

D liu cha nhiu: Cc ggii p


php
p

Phn khong (Binning)

Hi quy (Regression)

Gn d liu vi mt hm hi quy (regression function)

Phn cm (Clustering)

Sp xp d liu,
liu v phn chia thnh cc khong (bins) c tn s
xut hin gi tr (frequency) nh nhau
Sau , mi khong d liu c th c biu din bng trung
bnh(mean), trung v (median), hoc cc gii hnca cc gi tr
trong khong

Pht hin v loi b cc ngoi lai (sau khi xc nh cc cm)

Kt hp gia my tnh v kim tra ca con ngi

My tnh t ng pht hin cc gi tr nghi ng (l nhiu/li)


Cc gi tr nghi ng ny s c con ngi kim tra li
Khai Ph D Liu

19

Phn khongg (Binning)


(
g)

Phn chia vi rng (khong cch) bng nhau

Chia khong
gg
gi tr thnh N khong
g vi kch thc (
( rng)
g) bng
g
nhau
Nu mini v maxi l gi tr ln nht v nh nht ca thuc tnh, th
kch thc ((
rng)
g) ca mi khong
g = ((maxi - mini))/N
Khng ph hp i vi cc tp d liu lch (skewed data), hoc
c cha cc ngoi lai (outliers) v c th mt khong s ch cha
mt
((hoc
mt
s)) cc ngoi
g lai

Phn chia vi su (tn xut xut hin) bng nhau

Chia khong
gg
gi tr thnh N khong
g ((khng
g nht thit bng
g nhau),
)
sao cho mi khong cha xp x bng nhau s lng (tn xut
xut hin) ca cc v d
Hiu q
qu hn cch phn
p
chia vi rng
g ((khong
g cch)) bng
g
nhau
Khai Ph D Liu

20

Phn khongg (Binning)


(
g) V d

Sp xp cc gi tr ca thuc tnh Price: 4, 8, 9, 15, 21,


21 24
21,
24, 25
25, 26
26, 28
28, 29
29, 34

Phn chia thnh cc khong vi su (tn xut xut


hin) bng nhau

Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bi 3
Bin
3: 26
26, 28,
28 29,
29 34

Biu din khong d liu bi gi tr trung bnh

Bin 1
Bi
1: 9
9, 9
9, 9
9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Khai Ph D Liu

21

Hi qquyy (Regression)
( g
)
y

Y1
y=x+1

Y1

X1

(Han, Kamber - Data Mining: Concepts and Techniques)

Khai Ph D Liu

22

Phn tch cc cm ((Cluster analysis)


y )

(Han, Kamber - Data Mining: Concepts and Techniques)

Khai Ph D Liu

23

Tch hp
p d liu

Tch hp d liu (Data integration)

Tch hp mc m hnh (Schema integration)

Tch hp metadata t cc ngun khc nhau


Vd: A.cust-id
A cust id B.customID
B customID

Vn xc nh thc th ( trnh d tha d liu)

Kt hp
p d liu
t nhiu ngun
g
vo mt
kho d liu
thng
g nht

Cn xc nh cc thc th (identities) trn thc t t nhiu ngun d liu


Vd Bill Clinton
Vd:
Cli t B.
B Cli
Clinton
t

Pht hin v x l cc mu thun i vi gi tr d liu

i vi cng mt thc th trn thc t, nhng cc gi tr thuc tnh t


nhiu
ngun
khc nhau li khc nhau. Cc l do c th:

Cc cch biu din khc nhau


Mc nh gi, o (scales) khc nhau Vd: h o lng mt vs.
h o
lng
l ca
Anh
A h
Khai Ph D Liu

24

Tch hp d liu: X l d tha d liu

D tha d liu (redundant data) thng xuyn xy ra, khi tch


hp d liu t nhiu ngun (vd: t nhiu csdl)

nh danh i tng: Cng mt thuc tnh (hay cng mt i


tng) c th mang cc tn (nh danh) khc nhau trong cc csdl
khc nhau
D liu suy ra c: Mt thuc tnh trong mt bng c th l mt
thuc tnh c suy ra (derived attribute) trong mt bng khc
Vd: Annual Revenue v Monthly Revenue

Cc thuc tnh d tha c th c pht hin bng phn tch


tng quan (Correlation analysis): Pearson, Cosine, chi-square

Yu cu
chung i
vi qu trnh tch hp d liu: Gim thiu

(trnh c l tt nht) cc d tha v cc mu thun

Gip
p ci thin
tc
ca q
qu trnh khai p
ph d liu,
, v nng
g cao
cht lng ca cc kt qu (tri thc) thu c
Khai Ph D Liu

25

Bin i d liu (1)


( )

Bin i d liu (Data transformation)

Vic
chuyn
y ((nh x)
) ton b
tp
p g
gi tr ca mt
thuc
tnh sang
g mt
tp
p
mi cc gi tr thay th, sao cho mi gi tr c tng ng vi mt trong
cc gi tr mi

Cc p
phng
gp
php
p bin i d liu

Lm trn (Smoothing): Loi b nhiu/li khi d liu


Kt hp (Aggregation): S tm tt d liu, xy dng cc khi d liu
(data cubes)
Khi qut ha (Generalization): Xy dng cc phn cp khi nim
(concept hierarchies)
Chun ha (Normalization): a cc gi tr v mt khong c ch nh
Chun
ha min-max
Chun ha z-score
Chun ha bi thang chia 10
Xy dng (to nn) cc thuc tnh mi da trn cc thuc tnh ban u
Khai Ph D Liu

26

Bin i d liu (2)


( )

Chun ha min-max: thnh khong [new_mini, new_maxi]


v

new

v old mini
=
(new _ maxi new _ mini ) + new _ mini
maxi mini

Chun ha z-score

i, i: gi tr trung bnh v lch chun i vi thuc tnh i

v new =

v old i

Chun ha bi thang chia 10

new

v oldld
= j
10

j l gi tr s nguyn nh nht sao cho: max({vnew}) < 1


Khai Ph D Liu

27

Gim bt d liu

Ti sao cn phi gim bt d liu?

Gim bt d liu (Data reduction)

Mt kho (tp) d liu ln c th cha lng d liu ln n terabytes


Do , qu trnh khai ph d liu c th s chy rt lu (rt mt thi gian)
i vi ton b tp d liu
thu c mt biu din thu gn (gim bt) nhng vn sinh ra cng
(hoc xp x) cc kt qu phn tch (khai ph) nh vi tp d liu ban u

Cc chin lc
gim
g
bt d liu

Gim s chiu (Dimensionality reduction): loi b bt cc thuc tnh


khng (t) quan trng
Gim lng d liu (Data/Numerosity reduction)
Kt hp khi d liu (Data cube aggregation)
Nn d liu (Data compression)
Hi quy (Regression)
Ri rc ha (Discretization)
Khai Ph D Liu

28

Gim s chiu

nh hng tiu cc ca s chiu (s thuc tnh) ln

Gim s chiu (Dimensionality reduction)

Khi s chiu tng,


g d liu tr nn tha tht hn (more
(
sparse)
p
)
Mt v khong cch gia cc im (quan trng i vi vic
phn cm, pht hin ngoi lai) tr nn t c ngha
Trnh (gim bt) nh hng tiu cc ca s chiu ln
Gip loi b cc thuc tnh khng lin quan, v gim nhiu/li
Gi gim
Gip
i chi
hi ph
h v
thi gian
i v
b nh
h cn
cho
h qu
ttrnh
h kh
khaii
ph d liu
Cho php hin th ha (visualize) d liu mt cch d dng v
hi qu
hiu
h
hn

Cc k thut gim s chiu

Phn tch thnh phn chnh ((Principal component analysis)


y )
La chn tp con cc thuc tnh (Feature subset selection)
Khai Ph D Liu

29

Phn tch thnh p


phn chnh ((1))

Phn tch thnh phn chnh


((Principal
p component
p
analysis PCA)

x2

Tm mt php chiu
(projection) khng gian
thuc tnh mi sao cho gi
c mc ti a v s
khc bit
((variation)) trong
g tp
p
d liu ban u
Tm cc eigenvectors ca ma
trn
hip
p bin cc
eigenvectors ny s nh
ngha khng gian thuc tnh
mi
Khai Ph D Liu

x1
(Han, Kamber - Data Mining:
Concepts and Techniques)

30

Phn tch thnh p


phn chnh ((2))

Mi v d (bn ghi) s c biu din bi n chiu (thuc tnh)


Mc ch: Tm k (n) vect trc giao (s l cc thnh phn chnh
principal components) biu
din tp d liu ban u
ph hp nht

1) Chun ha d liu u vo: Cc gi tr cho cc thuc tnh c a v


cng mt khong gi tr
2) Tnh k vect trc giao (chnh l cc thnh phn
chnh)
3) Mi vect d liu u vo s l mt kt hp tuyn tnh ca k vect
thnh phn chnh ny
4) Cc
C thnh
th h phn
h chnh
h h c

sp
xp
theo
th mc
gim
i dn
d v
quan
trng
5) Kch thc ca d liu c gim bt, bng cch loi b cc thnh
phn (vect) c mc quan trng thp cc vect ny tng ng vi
khc bit (variance) thp
6) S dng cc vect c mc quan trng cao nht s cho php biu
din xp x tp d liu ban u

Phng php PCA ch p dng c vi d liu kiu s


Khai Ph D Liu

31

La chn tp
p con cc thuc tnh

Vi d thuc tnh ban u, c th c n 2d kh nng la chn


mt tp con cc thuc tnh
Cc phng php thng c p dng cho vic la chn tp
con cc thuc tnh (Feature subset selection)

La chn cc thuc tnh ring r (vi gi s l cc thuc tnh l


c lp vi nhau)
Theo mt (hoc mt s) tiu ch nh gi
La chn thuc tnh tng bc (Step-wise
(Step wise feature selection)
Thuc tnh tt nht s c chn ra u tin
Chn thuc tnh tt nht tip theo i vi thuc tnh u tin
chn
h
Loi b thuc tnh tng bc (Step-wise feature elimination)
Loi b dn dn (repeatedly) cc thuc tnh km (ti) nht
Kt hp ng thi 2 chin lc: la chn v loi b cc thuc tnh
Khai Ph D Liu

32

Kt hp khi d liu (Data cube aggregation)

Mc thp nht ca mt khi d liu (basic cuboid)

C mc
Cc
kt
k hp
h khc
kh nhau
h trong cc
khi d
d liu
li

Gip gim nh hn na kch thc ca d liu cn x l

Cc mc kt hp ph hp

L d liu c kt hp li i vi mt thc th (individual entity)


c quan tm
Vd: Mt khch hng trong mt kho d liu mua hng

S dng biu din ngn gn (nh) nht gii quyt yu cu


(truy vn thng tin) t ra

Cc cu tm kim (queries) i vi cc thng tin c kt hp


(aggregated information) nn c tr li bng cch s dng
cc khi d liu
Khai Ph D Liu

33

Lyy mu d liu

Ly mu d liu (Data sampling) l phng php quan


trng i vi vic la chn d liu

Vic ly mu d liu l cn thit v yu cu thu thp v


x l ton b mt tp d liu ln s i hi chi ph cao v
tn thi gian

Cc nguyn tc quan trng ca vic ly mu d liu

S dng mt mu (sample) s c tc dng gn nh s dng ton


b tp d liu, nu nh mu i din cho tp d liu
Mt
mu
c gi l
i din
cho mt
tp
d liu,
nu
mu

c

(xp x) c tnh ca tp d liu

Khai Ph D Liu

34

Cc p
phngg php
p p lyy mu d liu

Ly mu ngu nhin (Simple random sampling)

Ly mu khng thay th (Sampling without replacement)

Khi mt v d (bn ghi) c ly


mu,
n s c loi khi tp d
liu ban u (s khng th c chn thm mt ln no na)

Ly mu c thay th (Samping with replacement)

Mi v d (bn ghi) c la chn vi mt gi tr xc sut nh


nhau

Khi mt v d (bn ghi) c ly mu, n khng b loi khi tp


d liu ban u (c th c chn nhiu hn mt ln)

L mu
Ly
phn
h t
tng (Stratified
(St tifi d sampling)
li )

Phn chia tp d liu thnh cc phn (partitions)


Ly
y ngu
g nhin cc v d t mi phn
p
Khai Ph D Liu

35

You might also like