You are on page 1of 24

Data mining - classification

Ging vin:

Nguyn Qunh Chi

Sinh vin:
Trn Tun Anh
inh Th Thanh Hng
Nguyn Trng Th

Data mining - Classification

M U
S pht trin nhanh chng ca mng Internet v Intranet sinh ra mt khi lng khng l
cc d liu dng siu vn bn (d liu Web). Cng vi s thay i v pht trin hng ngy,
hng gi v ni dung cng nh s lng cc trang Web trn Internet th vn tm kim thng
tin i vi ngi s dng li cng kh khn. C th ni nhu cu tm kim thng tin trn mt
CSDL phi cu trc c pht trin ch yu cng vi s pht trin ca Internet. Thc vy,
vi Internet con ngi lm quen vi cc trang Web cng vi v vn cc thng tin. Trong
nhng nm gn y Internet tr thnh mt trong nhng kn v khoa hc, thng tin kinh t,
thng mi v qung co. Mt trong nhng l do cho s pht trin ny l s thp v gi c tiu
tn khi cng khai mt tran Web trn Internet. So snh vi nhng dch v khc nh mua bn
hay qung co trn mt t bo hay tp ch, th mt trang Web i r hn rt nhiu v cp
nht nhanh chng hn ti hng triu ngi dung khp mi ni trn th gii. C th ni trang
Web nh l cun t in Bch khoa ton th. Thng tin trn cc trang Web a dng v mt ni
dung cng nh hnh thc. C th ni Internet nh mt x hi o, n bao gm cc thng tin v
mi mt ca i sng kinh t, x hi c trnh by di dng vn bn, hnh nh, m thanh
Tuy nhin cng vi s a dng v s lng ln thng tin nh vy ny sinh vn qu ti
thng tin. Ngi ta khng th t tm kim a ch trang Web cha thng tin m mnh cn, do
vy i hi phi c mt trnh tin ch qun l ni dung ca cc trang Web v cho php tm thy
cc a ch trang Web c ni dung ging vi yu cu ca ngi tm kim. Cc tin ch ny
qun l d liu nh cc i tng phi cu trc. Hin nay chng ta lm quen vi mt s cc
tin ch nh vy, l: yahoo, google, alvista
Mt khc, gi s chng ta c cc trang Web v cc vn Tin hc, Th thao, Kinh t - X hi
v xy dng Cn c vo ni dung ca cc ti liu m khch hng xem hoc download v,
sau khi phn lp chng ta s bit khch hng hay tp trung vo ni dung g trn trang Web ca
chng ta, t chng ta s b sung thm nhiu cc ti liu v cc ni dung m khch hng
quan tm v ngc li. Cn v pha khch hng sau khi phn tch chng ta cng bit c
khch hng hay tp trung v vn g, t c th a ra nhng h tr thm cho khch hng
. T nhng nhu cu thc t trn , phn lp v tm kim trang Web vn l bi ton hay v cn
pht trin nghin cu hin nay.

[D08 HTTT1]

Page 2

Data mining - Classification


MC LC
MC LC...................................................................................................................................3
Gii thiu....................................................................................................................................4
Khai thc d liu.....................................................................................................................4
Khi nim............................................................................................................................4
u th khai ph d liu......................................................................................................5
Cc k thut khai ph d liu.............................................................................................6
Cy quyt nh....................................................................................................................7
Cng c khai ph d liu Weka......................................................................................7
Cc chc nng ca Weka Explorer....................................................................................7
Kho st d liu..................................................................................................................8
Phn lp d liu s dng cy quyt nh..................................................................................9
Tng quan v phn lp d liu trong khai ph......................................................................9
Phn lp d liu.................................................................................................................9
Cy quyt nh trong phn lp d liu.................................................................................11
nh ngha.........................................................................................................................11
Thut ton C4.5................................................................................................................12
Thc t.....................................................................................................................................14
Gii thiu v dataset............................................................................................................14
Phn tch kt qu..................................................................................................................17

[D08 HTTT1]

Page 3

Data mining - Classification

Gii thiu
Khai thc d liu
Khi nim
Khi ph d liu c nh ngha l: qu trnh trch xut cc thng tin c gi tr tim n
bn trong lng ln d liu c lu tr trong cc c s d liu, kho d liu Hin nay,
ngoi thut ng khai ph d liu, ngi ta cn dng mt s thut ng khc c ngha tng t
nh: khai ph tri thc t c s d liu (knowlegde mining from databases), trch lc d liu
(knowlegde extraction), phn tch d liu/mu (data/patten analysis), kho c d liu (data
archaeology), no vt d liu (data dredging). Nhiu ngi coi khai ph d liu v mt thut
ng thng dng khc l khai ph tri thc trong c s d liu (Knowlegde Discovery in
Databases KDD) l nh nhau. Tuy nhin trn thc t, khai ph d liu ch l mt bc thit
yu trong qu trnh khm ph tri thc trong c s d liu. Qu trnh ny bao gm cc bc sau:
Bc 1: Lm sch d liu (data cleaning): loi b nhiu hoc cc d liu khng thch
hp.
Bc 2: Tch hp d liu (data intergration): tch hp d liu t cc ngun khc nhau
nh: c s d liu, kho d liu, file text
Bc 3: Chn d liu (data selection): bc ny, nhng d liu lin quan trc tip
n nhim v s c thu thp t cc ngun d liu ban u.
Bc 4: Chuyn i d liu (data transformation): trong bc ny, d liu s c
chuyn i v dng ph hp cho vic khai ph bng cch thc hin cc thao tc nhm hoc
tp hp.
Bc 5: Khai ph d liu (data mining): l giai on thit yu, trong cc phng
php thng minh s c p dng trch xut ra cc mu d liu.
Bc 6: nh gi mu (pattern evaluation): nh gi s hu ch ca cc mu biu
din tri thc da vo mt s php o.
Bc 7: Trnh din d liu (knowlegde presentation): s dng cc k thut trnh
din v trc quan ho d liu biu din tri thc khai ph c cho ngi s dng

[D08 HTTT1]

Page 4

Data mining - Classification

Khai ph d liu v pht hin tri thc trong cc c s d liu cun ht cc phng
php, thut ton v k thut t nhiu chuyn ngnh nghin cu khc nhau nh hc my, thu
nhn mu, c s d liu, thng k, tr tu nhn to, thu nhn tri thc trong h chuyn gia
cng hng ti mc tiu thng nht l trch lc ra c cc tri thc t d liu trong cc c s
d liu khng l. Song so vi cc phng php khc, khai ph d liu c mt s u th r rt

u th khai ph d liu
Khai ph d liu c nhiu ng dng v mt s u th r rt c xem xt di y:
+ So vi phng php hc my, khai ph d liu c li th hn ch, khai ph d liu
c th s dng vi cc c s d liu cha nhiu nhiu, d liu khng y hoc bin i lin
tc. Trong khi phng php hc my ch yu c p dng trong cc c s d liu y ,
t bin ng v tp d liu khng qu ln.
+ Phng php h chuyn gia: phng php ny khc vi khai ph d liu ch cc v
d ca chuyn gia thng mc cht lng cao hn nhiu so vi cc d liu trong c s d
liu, v chng thng ch bao hm c cc trng hp quan trng. Hn na cc chuyn gia
s xc nhn gi tr v tnh hu ch ca cc mu pht hin c.
+ Phng php thng k l mt trong nhng nn tng l thuyt ca Khai ph d liu,
nhng khi so snh hai phng php vi nhau ta c th thy cc phng php thng k cn tn
ti mt s im yu m Khai ph d liu khc phc c:
Cc phng php thng k chun khng ph hp vi cc kiu d liu c cu

trc trong rt nhiu c s d liu


Cc phng php thng k hot ng hon ton theo d liu, n khng s

dng tri thc sn c v lnh vc


Kt qu phn tch ca thng k c th s rt nhiu v kh c th lm r c
Kt qu phn tch ca thng k c th s rt nhiu v kh c th lm r c
Vi nhng u im , khai ph d liu ang c p dng khai ph d liu nhn s
p ng tnh thng xuyn thay i, tng trng ca d liu. Tm kim nhng thng tin
tim n trong d liu m bng phng php khc khng pht hin c.

[D08 HTTT1]

Page 5

Data mining - Classification


Cc k thut khai ph d liu
Cc k thut khai ph d liu thng c chia lm hai nhm chnh:
- K thut khai ph d liu m t: c nhim v m t v cc tnh cht hoc cc c tnh
chung ca d liu trong c s d liu hin c. Cc k thut ny c th lit k: phn cm
(clustering), tm tt (summerization), trc quan ha (visualization), phn tch s ph hin bin
i v lch, phn tch lut kt hp (association rules)...
- K thut khai ph d liu d on: c nhim v a ra cc d on da vo cc suy
din trn d liu hin thi. cc k thut ny gm c: phn lp (classification), hi quy
(regression)...
3 phng php thng dng nht trong khai ph d liu la: phn cm d liu, phn lp
d liu v khai ph lut kt hp. Chng ta ch xt n phng php phn lp
Phn lp d liu: Mc tiu ca phng php phn lp d liu l d on nhn lp cho
cc mu d liu. Qu trnh phn lp d liu thng gm 2 bc: xy dng m hnh v s dng
m hnh phn lp d liu
Bc 1: mt m hnh s c xy dng da trn vic phn tch cc mu d liu sn
c. Mi mu tng ng vi mt lp, c quyt nh bi mt thuc tnh gi l thuc tnh lp.
Cc mu d liu ny cn c gi l tp d liu hun luyn (training data set). Cc nhn lp
ca tp d liu hun luyn u phi c xc nh trc khi xy dng m hnh, v vy phng
php ny cn c gi l hc c thy (supervised learning) khc vi phn cm d liu l hc
khng c thy (unsupervised learning).
Bc 2: s dng m hnh phn lp d liu. Trc ht chng ta phi tnh chnh
xc ca m hnh. Nu chnh xc l chp nhn c, m hnh s c s dng d on
nhn lp cho cc mu d liu khc trong tng lai. Phng php hi qui khc vi phn lp d
liu ch, hi qui dng d on v cc gi tr lin tc cn phn lp d liu th ch dng
d on v cc gi tr ri rc.

[D08 HTTT1]

Page 6

Data mining - Classification


Cy quyt nh
Trong phn lp d liu hnh thc trc quan ca m hnh l cy quyt nh. Sau
y, lun vn s trnh by vai tr, nh gi v cy quyt nh trong khai ph d liu.

Cng c khai ph d liu Weka


Cc chc nng ca Weka Explorer

Cc chc nng chnh ca Weka Explorer th hin trong cc th tab ca man hnh chnh,
bao gm:

Preprocess: Cho php m, iu chnh, lu mt tp tin d liu, th ny cha cc thutt


ton p dng trong tin x l d liu.
Classify: Cung cp cc m hnh phn loi d liu hoc hi quy.
Cluster: Cung cp cc m hnh gom cm.
Associate: Khai thc tp ph bin v lut kt hp.
SelectAttributes: La chn cc thuc tnh thch hp nht trong tp
d liu.
Visualize: Th hin d liu di dng biu .

[D08 HTTT1]

Page 7

Data mining - Classification


Kho st d liu

S dng th Preprocess

(1) Open file: M mt tp d liu.

(2) Edit: Hin th v chnh sa d liu bng tay nu cn thit.

(3) Save: Lu tr d liu hin ti ra tp tin


Weka Explorer h tr mt s nh dng arff, csv

(4) Filter: Cc tc v tin x l d liu c gi l cc b lc

(5) Selected attribute: Thng tin v thuc tnh ang c chn


o

Type: Kiu d liu ca thuc tnh (Numeric dng s, Nominal dng ri


rc / khng s, ordinal th t, binary nh phn)

Missing: S mu thiu gi tr trn thuc tnh ang xt

Distinct: S gi tr phn bit

Unique: S mu khng c gi tr trng vi mu khc

S dng th Classify

(1) Classifer: la chn b phn loi v cc tham s.

(2) Test Options: cc ty chn kim th m hnh

Use training set: s dng chnh tp d liu hun luyn kim nghim

Supplied test set: S dng mt tp d liu khc.

Cross-validation: Chia d liu thnh nhiu phn (Flods) thc hin nhiu ln
nh gi kt qu.

Percentage split: Chia d liu thnh 2 phn theo t l %, mt phn dng xy


dng m hnh, phn cn li dnh cho kim th

(3) Result list: Danh sch kt qu cc ln chy thut ton, c th tng tc trn danh
sch ny thc hin mt chc nng ph

[D08 HTTT1]

Page 8

Data mining - Classification


Phn lp d liu s dng cy quyt nh
Tng quan v phn lp d liu trong khai ph
Phn lp d liu
Mt trong cc nhim v chnh ca khai ph d liu l gii quyt bi ton phn lp. u
vo ca bi ton phn lp l mt tp cc mu hc c phn lp trc, mi mu c m t
bng mt s thuc tnh. Cc thuc tnh dng m t mt mu gm hai loi l thuc tnh lin
tc v thuc tnh ri rc. Trong s cc thuc tnh ri rc c mt thuc tnh c bit l phn lp,
m cc gi tr ca n c gi l nhn lp. Thuc tnh lin tc s nhn cc gi tr c th t,
ngc li thuc tnh ri rc s nhn cc gi tr khng c th t. Ngoi ra, cc thuc tnh c th
nhn gi tr khng xc nh (chng hn, v nhng l do khch quan ta khng th bit c gi
tr ca n). Ch rng nhn lp ca tt c cc mu khng c php nhn gi tr khng xc
nh. Nhim v ca qu trnh phn lp l thit lp c nh x gia gi tr ca cc thuc tnh
vi cc nhn lp. M hnh biu din quan h ni trn sau s c dng xc nh nhn
lp cho cc quan st mi khng nm trong tp mu ban u.

Lp 1
D liu u
vo

Thut ton phn


lp hot ng

Lp 2

Lp n

Thc t t ra nhu cu t mt c s d liu vi nhiu thng tin n ta c th trch rt ra


cc quyt nh nghip v thng minh. Phn lp v d on l hai dng ca phn tch d liu
nhm trch rt ra mt m hnh m t cc lp d liu quan trng hay d on xu hng d liu
tng lai. Phn lp d on gi tr ca nhng nhn xc nh (categorical label) hay nhng gi
tr ri rc (discrete value), c ngha l phn lp thao tc vi nhng i tng d liu m c b
gi tr l bit trc. Trong khi , d on li xy dng m hnh vi cc hm nhn gi tr lin
tc. V d m hnh phn lp d bo thi tit c th cho bit thi tit ngy mai l ma, hay nng
da vo nhng thng s v m, sc gi, nhit , ca ngy hm nay v cc ngy trc
. Hay nh cc lut v xu hng mua hng ca khch hng trong siu th, cc nhn vin kinh
doanh c th ra nhng quyt sch ng n v lng mt hng cng nh chng loi by bn
Mt m hnh d on c th d on c lng tin tiu dng ca cc khch hng tim nng
[D08 HTTT1]

Page 9

Data mining - Classification


da trn nhng thng tin v thu nhp v ngh nghip ca khch hng. Trong nhng nm qua,
phn lp d liu thu ht s quan tm cc nh nghin cu trong nhiu lnh vc khc nhau
nh hc my (machine learning), h chuyn gia (expert system), thng k (statistics)... Cng
ngh ny cng ng dng trong nhiu lnh vc khc nhau nh: thng mi, nh bng, maketing,
nghin cu th trng, bo him, y t, gio dc...
Qu trnh phn lp d liu gm hai bc:

Bc th nht (learning)

Qu trnh hc nhm xy dng mt m hnh m t mt tp cc lp d liu hay cc khi


nim nh trc. u vo ca qu trnh ny l mt tp d liu c cu trc c m t bng cc
thuc tnh v c to ra t tp cc b gi tr ca cc thuc tnh . Mi b gi tr c gi
chung l mt phn t d liu (data tuple), c th l cc mu (sample), v d (example), i
tng (object), bn ghi (record) hay trng hp (case). Lun vn s dng cc thut ng ny
vi ngha tng ng. Trong tp d liu ny, mi phn t d liu c gi s thuc v mt
lp nh trc, lp y l gi tr ca mt thuc tnh c chn lm thuc tnh gn nhn lp
hay thuc tnh phn lp (class label attribute). u ra ca bc ny thng l cc quy tc phn
lp di dng lut dng if-then, cy quyt nh, cng thc logic, hay mng nron. Qu trnh
ny c m t nh trong hnh v:

Classification
algorithm

Training data

Classifier (modle)
P_i

P_t

63.02

68.82

22.52
40.47
98.67 -0.254
If D_s39.60
<= 19.85 and
P_r <=125.21
and S_s <=40.47 and P_t >9.97
10.06 25.01
28.99 114.4 4.5642
Then class = Abnormal
22.21 50.09
46.61 105.98 -3.53

39.05

L_l_a

S_s

P_r

D_s

Bc th hai (classification)

[D08 HTTT1]

Page 10

Data mining - Classification


Bc th hai dng m hnh xy dng bc trc phn lp d liu mi. Trc tin
chnh xc mang tnh cht d on ca m hnh phn lp va to ra c c lng.
Holdout l mt k thut n gin c lng chnh xc . K thut ny s dng mt tp
d liu kim tra vi cc mu c gn nhn lp. Cc mu ny c chn ngu nhin v c
lp vi cc mu trong tp d liu o to. chnh xc ca m hnh trn tp d liu kim tra
a l t l phn trm cc cc mu trong tp d liu kim tra c m hnh phn lp ng
(so vi thc t). Nu chnh xc ca m hnh c c lng da trn tp d liu o to th
kt qu thu c l rt kh quan v m hnh lun c xu hng qu va d liu. Qu va d
liu l hin tng kt qu phn lp trng kht vi d liu thc t v qu trnh xy dng m hnh
phn lp t tp d liu o to c th kt hp nhng c im ring bit ca tp d liu .
Do vy, cn s dng mt tp d liu kim tra c lp vi tp d liu o to. Nu chnh xc
ca m hnh l chp nhn c, th m hnh c s dng phn lp nhng d liu tng
lai, hoc nhng d liu m gi tr ca thuc tnh phn lp l cha bit.
Cc phng php nh gi chnh xc ca m hnh phn lp. Chng ta ch cp n
phng php nh gi ph bin k-fold cross-validation. k-fold cross-validation tp d liu
ban u c chia ngu nhin thnh k tp con (fold) c kch thc xp x nhau S1, S2,, Sk.
Qu trnh train v test c thc k ln. ti ln lp th I, Si l tp d liu kim tra cc tp cn li
hp thnh tp d liu hun luyn. C ngha l u tin vic hun luyn c thc hin trn cc
tp S2, S3,,Sk, sau kim tra trn tp S1. Tip tc qu trnh nh th n khi tp kim tra l
Sk. Dn chnh xc l ton b s phn lp ng t k ln lp chia cho tng s mu ca d liu
ban u.

Cy quyt nh trong phn lp d liu


nh ngha
Trong nhng nm qua, nhiu m hnh phn lp d liu c cc nh khoa hc trong
nhiu lnh vc khc nhau xut nh mng notron, m hnh thng k tuyn tnh bc 2, cy
quyt nh, m hnh di truyn. Trong s nhng m hnh , cy quyt nh vi nhng u im
ca mnh c nh gi l mt cng c mnh, ph bin v c bit thch hp cho Data Mining
ni chung v phn lp d liu ni ring [12]. C th k ra nhng u im ca cy quyt nh
nh: xy dng tng i nhanh; n gin, d hiu. Hn na cc cy c th d dng c
chuyn i sang cc cu lnh SQL c th c s dng truy nhp c s d liu mt cch
hiu qu. Cui cng, vic phn lp da trn cy quyt nh t c s tng t v i khi l
chnh xc hn so vi cc phng php phn lp khc
Cy quyt nh l mt flow-chart ging cu trc cy , nt bn trong biu th mt kim
tra trn mt thuc tnh , nhnh biu din u ra ca kim tra , nt l biu din nhn lp hoc s
phn b ca lp. Cy quyt nh l biu pht trin c cu trc dng cy, nh m t trong
hnh v sau:

[D08 HTTT1]

Page 11

Data mining - Classification

Trong cy quyt nh:


Gc: l node trn cng ca cy
Node trong: biu din mt kim tra trn mt thuc tnh n (hnh ch nht)
Nhnh: biu din cc kt qu ca kim tra trn node trong (mi tn)
Node l: biu din lp hay s phn phi lp.
phn lp mu d liu cha bit, gi tr cc thuc tnh ca mu c a vo kim
tra trn cy quyt nh. Mi mu tng ng c mt ng i t gc n l v l biu din
d on gi tr phn lp mu
Thut ton C4.5
L s pht trin t CLS v ID3 l cng c thng dng trong Data mining. u vo l
tp cc mu, mi mu thuc v mt lp. u ra l b phn lp dng d on.
Cy quyt nh
Cho tp S cc mu, C4.5 sinh ra cy quyt nh ban u theo phng php chia tr nh sau:
[D08 HTTT1]

Page 12

Data mining - Classification


Nu tt c cc mu trong S u thuc v cng mt lp hay tp S nh th cy l mt nt

l vi nhn l lp xut hin nhiu nht trong S


Ngc li, chn mt thuc tnh vi hai hay nhiu kt qu trong tp thuc tnh. To

thuc tnh ny l nt gc ca cy vi mi nhnh l mi kt qu ca thuc tnh, chia tp


mu cn li ca S thnh cc tp con S1,S2,,Sk theo kt qu (cn li) ca S, p dng
cng cch lm mt cc quy cho cc tp S1,
Lm th no chn thuc tnh lm nt gc? C4.5 da vo mt trong hai heuristics sau:
Information Gain

Trong : Value (A) l tp cc gi tr ca thuc tnh A, Sv l tp con ca S m A nhn gi tr


Gain ratio

[D08 HTTT1]

Page 13

Data mining - Classification


Thc t
Gii thiu v dataset
Qua tp d liu column_2C_weka.arff thu thp c c xy dng bi Tin s
Henrique da Mota trong mt khong thi gian c tr y t trong Tp on Nghin cu ng dng
trong Chnh hnh (GARO) ca cc Trung tm y si-Chirurgical de Radaptation des Massues,
Lyon, Php. Cc d liu c t chc trong hai nhim v phn loi khc nhau nhng c lin
quan. Nhim v bao gm trong vic phn loi bnh nhn thuc mt trong hai loi: Bnh thng
(100 bnh nhn) hoc bt thng (210 bnh nhn). Chng ti cung cp cc tp tin cn cho s
dng trong mi trng WEKA.Ngun :
http://archive.ics.uci.edu/ml/datasets/Vertebral+Column
Qua tp d liu column_2C_weka.arff ta nhn thy c 310 mu vi 7 thuc tnh (c
thuc tnh lp) c trong d liu.
Bng miu t tn, kiu d liu, cc gi tr ca tng thuc tnh
Stt

Tn thuc tnh

Kiu d liu

Cc gi tr ca thuc tnh

Pelvic_incidence

Numeric

26.148 > 129.834

Pelvic_tilt

Numeric

-6.555 > 49.432

Lumbar_lordosis_angle

Numeric

14 > 125.742

Sacral_slope

Numeric

13.367 > 121.43

Pelvic_radius

Numeric

70.083 > 163.071

Degree_spondylolisthesis

Numeric

-11.058 > 418.543

Class

Nominal

Normal, Abnormal

Thuc tnh phn lp 7 (class)


T l phn lp
o

Normal: 100 (32.258%)

Abnormal: 210 (67.742%)

[D08 HTTT1]

Page 14

Data mining - Classification


Mt nh dng tp tin vn bn bao gm hai phn:

@relation column_2C_weka
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute

pelvic_incidence numeric
pelvic_tilt numeric
lumbar_lordosis_angle numeric
sacral_slope numeric
pelvic_radius numeric
degree_spondylolisthesis numeric

Phn khai bo

@attribute class {Abnormal, Normal}


@data
63.0278175,22.55258597,39.60911701,40.47523153,98.67291675,0.254399986,Abnormal
39.05695098,10.06099147,25.01537822,28.99595951,114.4054254,4.
564258645,Abnormal
68.83202098,22.21848205,50.09219357,46.61353893,105.9851355,3.530317314,Abnormal
69.29700807,24.65287791,44.31123813,44.64413017,101.8684951,11
.21152344,Abnormal
49.71285934,9.652074879,28.317406,40.06078446,108.1687249,7.91
8500615,Abnormal
40.25019968,13.92190658,25.1249496,26.32829311,130.3278713,2.2
30651729,Abnormal

Phn d liu
Phn khai bo:
@relation <tn d liu>
@attribute <tn thuc tnh 1><Kiu d liu>
@attribute <tn thuc tnh 2><Kiu d liu>

@attribute <tn thuc tnh n><Kiu d liu>


o Cc kiu d liu
Numeric
D liu dng s V d: @ATTRIBUTE name numeric
Nominal
D liu ri rc
V
d:
@ATTRIBUTE
class
{setosa, versicolor}
String
Date
[D08 HTTT1]

D liu chui
V d: @ATTRIBUTE name string
D liu kiu ngy V d: @ATTRIBUTE discovered
Page 15

Data mining - Classification


date
D liu thiu c k hiu bng du chm hi ?
o Phn d liu:
Mi mu d liu c t trn mt dng, gi tr ca cc thuc
tnh c lit k theo th t t tri qua phi v ngn cch bi
du phy,
Hin th tp tin bng arffViewer

ngha ca cc thuc tnh


1

Pelvic_incidence = Pi

T l mc bnh vng chu

Pelvic_tilt = Pt

nghing vng chu

Lumbar_lordosis_angle = lla

Gc tt xng sng tht lung cong ra

Sacral_slope = Ss

dc xng cng

[D08 HTTT1]

Page 16

Data mining - Classification


5

Pelvic_radius = Pr

Bn knh vng chu

Degree_spondylolisthesis = Ps

Mc spondylolisthesis

Class: Normal, Abnormal

Lp: bnh thng, d thng

Phn tch kt qu
S dng thut ton J48 (C4.5) ca Weka cung cp hun luyn tp d liu
Cy quyt nh ca thut ton l:

nh gi hiu qu phn lp ca thut ton i vi tp d liu c cho theo hai


phng php:
[D08 HTTT1]

Page 17

Data mining - Classification

Cross-validation
Ln test th nht : vi t l phn chia thnh 10 phn

S mu

T l

Phn lp ng

253

81.6129%

Phn lp sai

57

18.3871%

Khng phn c lp

Tng

310

Ln test th hai: vi t l phn chia <10 phn l 8 ta c:

S mu

T l

Phn lp ng

255

82.2581%

Phn lp sai

55

17.7419%

Khng phn c lp

Tng

310

Ln test th ba: vi t l phn chia < 10 phn l 5 ta c:

S mu

T l

Phn lp ng

254

81.9355%

Phn lp sai

56

18.0645%

Khng phn c lp

Tng

310

[D08 HTTT1]

Page 18

Data mining - Classification


Ln test th t: Vi t l phn chia > 10 phn l 12 ta c:

S mu

T l

Phn lp ng

255

82.2581%

Phn lp sai

55

17.7419%

Khng phn c lp

Tng

310

Ln test th nm vi t l phn chia > 10 phn l 15 ta c:

S mu

T l

Phn lp ng

260

83.871%

Phn lp sai

50

16.129%

Khng phn c lp

Tng

310

Sau khi chy thut ton trn theo phng php Cross-Validation th vi tham s Fold =
15 t c hiu qu phn lp nht l 83.871% vi s mu test l 310
Precentage split: cho bit chia l bao nhiu % th t hiu qu phn lp cao nht:
Ln test th nht: vi t l phn chia l 66% th ta c:
S mu

T l

Phn lp ng

90

85.7143%

Phn lp sai

15

14.2857%

Khng phn c lp

Tng

105

[D08 HTTT1]

Page 19

Data mining - Classification


Ln test th hai: vi t l phn chia < 66% l 60% ta c:

S mu

T l

Phn lp ng

97

78.2258%

Phn lp sai

27

21.7742%

Khng phn c lp

Tng

124

Ln test th ba : vi t l phn chia <66% l 55% ta c:

S mu

T l

Phn lp ng

117

84.1727%

Phn lp sai

22

15.8273%

Khng phn c lp

Tng

139

Ln test th t: vi t l phn chia > 66% l 70% ta c:

S mu

T l

Phn lp ng

76

81.7204%

Phn lp sai

17

18.2796%

Khng phn c lp

Tng

93

[D08 HTTT1]

Page 20

Data mining - Classification


Ln test th nm: vi t l phn chia > 66% l 75% ta c:

S mu

T l

Phn lp ng

65

84.4156%

Phn lp sai

12

15.5844%

Khng phn c lp

Tng

77

Sau khi chy thut ton trn vi phng php Precentage split vi t l phn chia l
66% t hiu qu phn lp cao nht 85.7143%, nhng vi s mu phn lp 105 gim so vi
310 nn cha t hiu qu phn lp
Cc suy lun suy ra t cy quyt nh s dng phng php Cross-Validation:

Classifier out put:


Kt qu c lit k bng vn bn vi nhng phn phn bit nh sau

[D08 HTTT1]

Page 21

Data mining - Classification

Run information: Thng tin chung v thut ton dc s dng d liu, tp d


liu

Classifier model: chi tit m hnh phn loi, tuy nhin i vi mt s b phn
loi th m hnh phn loi khng th hin y thng tin bng vn bn c

Summary: Lit k thng tin tng qut v mc chnh xc ca b phn loi


trong th nghim v thc thi

[D08 HTTT1]

Page 22

Data mining - Classification


Cc trng hp c phn loi mt cch chnh xc v khng chnh xc cho thy t l
phn trm cc trng hp th nghim mt cch chnh xc v khng chnh xc phn loi. Cc
s liu c hin th trong ma trn nhm ln, vi a, b v i din cho nhn lp. y c 310
trng hp, do , t l phn trm v s liu, aa + bb = 191 + 69 = 260, ab + ba = 19 + 31= 50.
T l phn trm cc trng hp phn loi chnh xc thng c gi l chnh xc
hoc mu chnh xc. N c mt s nhc im nh l mt c tnh hiu sut (khng c c hi
sa cha, khng nhy cm vi lp phn), v vy c th bn s mun xem xt mt s cc s
khc.
Kappa l mt bin php c th c hiu chnh ca tha thun gia cc phn loi v
cc lp hc tht s. N c tnh bng cch tham gia cc tha thun d kin bi c hi t cc
tha thun quan st v phn chia theo tha thun ti a c th. Mt gi tr ln hn 0 v lun
nh hn 1 c ngha l phn loi ca bn ang lm tt hn so vi c hi ( n thc s nn
c!).
T l li c s dng d on s ch khng phi l phn loi. Trong s d on,
d on khng ch l ng hay sai, li ny c mt cng , v cc bin php ny phn nh
iu .

Detailed Accuracy By Class v Confusion Matrix: Chi tit kt qu chnh xc


ca b phn loi trn tng phn lp

Ma trn nhm ln l ma trn 2x2. S lng cc trng hp phn loi chnh l tng ca ng
cho chnh trong ma trn aa + bb.

[D08 HTTT1]

Page 23

Data mining - Classification


TP rate (True Positive rate t l ng tch cc): l t l ca cc v d phn lp l loi x,
trong tt c cc v d thc s c lp x. trong ma trn nhm ln, y l phn t ng cho chia
cho gi tr trn hng c lin quan: TP = 191/(191+69) = 0.91;
69/(69+31) = 0.69
FP rate (False Positive rate t l sai tch cc): l t l ca cc v d phn loi l lp x,
nhng thuc v mt lp khc trong s tt c cc v d khng phi lp x. trong ma trn nhm
ln iu ny l phn t dng cho chia cho tng s phn t hng c lin quan tc l: 31/
(31+69) = 0.31;
19/( 191+ 19)= 0.09
Precision tnh chnh xc: xc nh cc phn ca h s m thc s ha ra l tch cc trong cc
nhm phn loi
Precision = TP / ( TP + FP )
Recall kh nng ly li: phn trm cc trng hp tch cc l TP rate
F-Measure Gi tr trung bnh iu ha chnh xc v ly li:
F-measure = 2 * ( ( Precision.Recall) / Precision + Recall) ) or = 2*TP / (2*TP) + FP + FN

[D08 HTTT1]

Page 24