Professional Documents
Culture Documents
L3-Tien Xu Ly Du Lieu
L3-Tien Xu Ly Du Lieu
Ni dung mn hc:
Gii thiu v
cng c WEKA
Tin x l d liu
Cc k thut phn lp v d on
Khai Ph D Liu
Tp
p d liu
Mt tp d liu (dataset) l mt tp
hp cc i tng (objects) v cc
thuc tnh ca chng
Mi thuc tnh (attribute) m t mt
c im ca mt i tng
Cc thuc tnh
Mt tp cc gi tr ca cc thuc
tnh m t mt i tng
Cc
i
tng
Taxable
Income Cheat
Y
Yes
Si l
Single
125K
N
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
N
No
Di
Divorced
d 95K
Y
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
N
No
M i d
Married
75K
N
No
10
No
Single
90K
Yes
60K
10
Cc kiu tp
p d liu
Bn ghi (Record)
th (Graph)
C trt t (Ordered)
D liu
khng
gg
gian ((vd: bn ))
D liu thi gian (vd: time-series data)
D liu chui (vd: chui giao dch)
D liu
chui di truyn
y (genetic
(g
sequence
q
data)
Khai Ph D Liu
TID
Items
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke Diaper
Coke,
Diaper, Milk
(Han, Kamber - Data Mining:
Concepts and Techniques)
Kiu c th t (ordinal):
Ly gi tr t mt tp c th t cc gi tr
Vd1 C
Vd1:
Cc th
thuc
tnh
t h ly
l gi
i ttr s
nh:
h Age,
A
Height,
H i ht
Vd2: Thuc tnh Income ly gi tr t tp {low, medium, high}
Khai Ph D Liu
Tp cc gi tr l mt tp hu hn
Khai Ph D Liu
Cc c tnh m t d liu
Gi tr cc tiu/cc
i (min/max)
Gi tr trung v (median)
Khai Ph D Liu
C th ch ra cc mu
mu, cc xu hng
hng, cc cu trc
trc, cc
bt thng, v cc quan h trong d liu
Khai Ph D Liu
D liu cn i
D liu lch
Khai Ph D Liu
Biu histogram
g
c s dng rt ph
bin
Hin th cc m t thng
k xut hin
(counts/frequencies) theo
mt thuc tnh no
Khai Ph D Liu
10
th ri rc (Scatter
(
p
plot))
11
Tch hp
p nhiu c s d liu, nhiu khi d liu ((data cubes),
) hoc nhiu
tp tin d liu
Gn cc g
gi tr thuc
tnh cn thiu, Sa cha cc d liu
nhiu/li, Xc
nh hoc loi b cc ngoi lai (outliers), Gii quyt cc mu thun d liu
Gim bt v biu din (cc thuc tnh) ca d liu, gim bt kch thc
d liu nhng vn m bo thu c cc kt qu khai ph d liu
tng ng (hoc xp x)
Ri rc ha d liu (Data discretization)
L mt thao tc trong gim bt d liu
c s dng i vi cc d liu c cc thuc tnh kiu s
Khai Ph D Liu
12
Cc vn ca d liu?
D liu
li th
thu c
t th
thc t c
th cha
h nhiu,
hi li
li, kh
khng
hon chnh, c mu thun
Nhiu/li
Nhi
/li (noise/error):
( i /
) Cha
Ch ng
nhng
h li h
hoc
cc
v d
d bt
thng (abnormal instances)
13
Nhiu/li (noise/error)
Do vic thu thp d liu
Do vic nhp d liu
Do vic
truyn
y d liu
Mu thun (inconsistent)
D liu c thu thp t nhiu ngun khc nhau
Vi phm
h
cc
rng
b
buc
(i
(iu ki
kin)) i vi
i cc
thuc
h tnh
h
Khai Ph D Liu
14
Khai Ph D Liu
15
Thiu
gi tr thuc tnh c th
v:
Cc gi
C
i tr thuc
h tnh
h thiu
hi cn
phi
hi c
gn
(b
(bng mt
Khai Ph D Liu
16
Thng
g c p
p dng
g trong
g cc bi ton p
phn lp
p ((classification))
Khng hiu qu, khi t l % cc gi tr thiu i vi cc thuc tnh
(rt) khc nhau
Mt gi tr (hng) mc nh
Gi tr trung bnh ca thuc tnh
Gi tr trung bnh ca thuc tnh , xt i vi tt c cc v d
(cc bn ghi) thuc cng lp (class) vi bn ghi
Gi tr c th xy
y ra nht da trn phng
g php xc sut ((vd:
cng thc Bayes)
Khai Ph D Liu
17
Khai Ph D Liu
18
Hi quy (Regression)
Phn cm (Clustering)
Sp xp d liu,
liu v phn chia thnh cc khong (bins) c tn s
xut hin gi tr (frequency) nh nhau
Sau , mi khong d liu c th c biu din bng trung
bnh(mean), trung v (median), hoc cc gii hnca cc gi tr
trong khong
19
Chia khong
gg
gi tr thnh N khong
g vi kch thc (
( rng)
g) bng
g
nhau
Nu mini v maxi l gi tr ln nht v nh nht ca thuc tnh, th
kch thc ((
rng)
g) ca mi khong
g = ((maxi - mini))/N
Khng ph hp i vi cc tp d liu lch (skewed data), hoc
c cha cc ngoi lai (outliers) v c th mt khong s ch cha
mt
((hoc
mt
s)) cc ngoi
g lai
Chia khong
gg
gi tr thnh N khong
g ((khng
g nht thit bng
g nhau),
)
sao cho mi khong cha xp x bng nhau s lng (tn xut
xut hin) ca cc v d
Hiu q
qu hn cch phn
p
chia vi rng
g ((khong
g cch)) bng
g
nhau
Khai Ph D Liu
20
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bi 3
Bin
3: 26
26, 28,
28 29,
29 34
Bin 1
Bi
1: 9
9, 9
9, 9
9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Khai Ph D Liu
21
Hi qquyy (Regression)
( g
)
y
Y1
y=x+1
Y1
X1
Khai Ph D Liu
22
Khai Ph D Liu
23
Tch hp
p d liu
Kt hp
p d liu
t nhiu ngun
g
vo mt
kho d liu
thng
g nht
24
Yu cu
chung i
vi qu trnh tch hp d liu: Gim thiu
Gip
p ci thin
tc
ca q
qu trnh khai p
ph d liu,
, v nng
g cao
cht lng ca cc kt qu (tri thc) thu c
Khai Ph D Liu
25
Vic
chuyn
y ((nh x)
) ton b
tp
p g
gi tr ca mt
thuc
tnh sang
g mt
tp
p
mi cc gi tr thay th, sao cho mi gi tr c tng ng vi mt trong
cc gi tr mi
Cc p
phng
gp
php
p bin i d liu
26
new
v old mini
=
(new _ maxi new _ mini ) + new _ mini
maxi mini
Chun ha z-score
v new =
v old i
new
v oldld
= j
10
27
Gim bt d liu
Cc chin lc
gim
g
bt d liu
28
Gim s chiu
29
x2
Tm mt php chiu
(projection) khng gian
thuc tnh mi sao cho gi
c mc ti a v s
khc bit
((variation)) trong
g tp
p
d liu ban u
Tm cc eigenvectors ca ma
trn
hip
p bin cc
eigenvectors ny s nh
ngha khng gian thuc tnh
mi
Khai Ph D Liu
x1
(Han, Kamber - Data Mining:
Concepts and Techniques)
30
sp
xp
theo
th mc
gim
i dn
d v
quan
trng
5) Kch thc ca d liu c gim bt, bng cch loi b cc thnh
phn (vect) c mc quan trng thp cc vect ny tng ng vi
khc bit (variance) thp
6) S dng cc vect c mc quan trng cao nht s cho php biu
din xp x tp d liu ban u
31
La chn tp
p con cc thuc tnh
32
C mc
Cc
kt
k hp
h khc
kh nhau
h trong cc
khi d
d liu
li
Cc mc kt hp ph hp
33
Lyy mu d liu
Khai Ph D Liu
34
Cc p
phngg php
p p lyy mu d liu
L mu
Ly
phn
h t
tng (Stratified
(St tifi d sampling)
li )
35