You are on page 1of 50

Bo co thc tp

GVHD: Nguyn Qunh Chi

HC VIN CNG NGH BU CHNH VIN THNG


KHOA CNTT

BO CO THC TP
ti: Tm hiu khai ph d liu bng cy quyt nh

GV hng dn : NGUYN QUNH CHI


Tn Sinh Vin : NGUYN C TNG
NGUYN CNG TOAN
NG TH YN
Lp

Nhm SV

: L10CQCN7-B

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

Li m u
Trong nhng nm gn y, vic nm bt c thng tin c coi l cha
kha ca kinh doanh. Ai thu thp, phn tch v hiu c thng tin v hnh ng
c nh vo nhng thng tin l k thng cuc trong thi i thng tin ny.
Chnh v vy, vic to ra thng tin v mc tiu th thng tin ngy nay ngy cng
gia tng.
Cng vi chc nng khai thc c tnh cht tc nghip, vic khai thc cc c
s d liu (CSDL) phc v cc yu cu tr gip quyt nh ngy cng c ngha
quan trng v l nhu cu to ln trong mi lnh vc hot ng kinh doanh, qun l.
D liu c thu thp v lu tr ngy cng nhiu nhng ngi ra quyt nh trong
qun l, kinh doanh li cn nhng thng tin b ch, nhng tri thc rt ra t ngun
d liu hn l chnh nhng d liu cho vic ra quyt nh ca mnh.
Cc nhu cu c bit n t lu nhng mi thc s bng n t thp
nin 90 ny. Do , nhng nm gn y pht trin mnh m mt lot cc lnh
vc nghin cu v t chc cc kho d liu v kho thng tin (data warehouse,
information warehouse), cc h tr gip quyt nh, cc phng php pht hin tri
thc v khai ph d liu (data mining). Trong , khai ph d liu v pht hin tri
thc tr thnh mt lnh vc nghin cu si ng, thu ht s quan tm ca rt
nhiu ngi trn khp cc lnh vc khc nhau nh cc h c s d liu, thng k,
chit xut thng tin, nhn dng, hc my, tr tu nhn to, v.v. Trong phm vi ti
bo co ny, chng em s trnh by nhng ni dung sau:
Chng I. Tng quan v CSDL v s xut hin ca khai ph d liu
Chng II. Khai ph d liu
Chng III.Khai ph d liu bng cy quyt nh
Chng IV. Demo bng Cng c WEKA

Nhm SV

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

CHNG 1. TNG QUAN V CSDL V S XUT HIN


KHAI PH D LIU
1.1. T chc v khai thc CSDL truyn thng
Vic dng cc phng tin tin hc t chc v khai thc cc c s d liu
c pht trin t nhng nm 60. T cho n nay, rt nhiu c s d liu
c t chc, pht trin v khai thc mi quy m v khp cc lnh vc hot
ng ca con ngi v x hi. Theo nh nh gi cho thy, lng thng tin trn th
gii c sau 20 thng li tng gp i. Kch thc v s lng c s d liu thm
ch cn tng nhanh hn. Nm 1989, tng s c s d liu trn th gii vo khong
5 triu, hu ht u l cc c s d liu c nh c pht trin trn DBaseIII. Vi
s pht trin mnh m ca cng ngh in t to ra cc b nh c dung lng ln,
b x l tc cao cng vi cc h thng mng vin thng, ngi ta xy dng
cc h thng thng tin nhm t ng ha mi hot ng kinh doanh ca mnh. iu
ny to ra mt dng d liu tng ln khng ngng v ngay t cc giao dch n
gin nht nh mt cuc gi in thoi, kim tra sc khe, s dng th tn dng,
v.v u c ghi vo trong my tnh. Cho n nay, con s y tr nn khng
l, bao gm cc c s d liu cc ln c gigabytes v thm ch terabytes lu tr
cc d liu kinh doanh, v d nh d liu thng tin khch hng, d liu lch s cc
giao dch, d liu bn hng, d liu cc ti khon, cc khon vay, s dng vn,
Nhiu h qun tr c s d liu mnh vi cc cng c phong ph v thun tin
gip con ngi khai thc c hiu qu cc ngun ti nguyn d liu. M hnh c s
d liu quan h v ngn ng vn p chun (SQL) c vai tr ht sc quan trng
trong vic t chc v khai thc cc c s d liu . Cho n nay, khng mt t
chc kinh t no l khng s dng cc h qun tr c s d liu v cc h cng c
bo co, ngn ng hi p nhm khai thc cc c s d liu phc v cho hot ng
tc nghip ca mnh.
1.2. Bc pht trin mi ca vic t chc v khai thc cc CSDL
Cng vi vic tng khng ngng khi lng d liu, cc h thng thng tin
cng c chuyn mn ha, phn chia theo cc lnh vc ng dng nh sn xut,
ti chnh, bun bn th trng v.v. Nh vy, bn cnh chc nng khai thc d liu
c tnh cht tc nghip, s thnh cng trong kinh doanh khng cn l nng sut ca
Nhm SV

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
cc h thng thng tin na m l tnh linh hot v sn sng p li nhng yu cu
trong thc t, CSDL cn em li nhng tri thc hn l chnh nhng d liu .
Cc quyt nh cn phi c cng nhanh cng tt v phi chnh xc da trn nhng
d liu sn c trong khi khi lng d liu c sau 20 thng li tng gp i lm nh
hng n thi gian ra quyt nh cng nh kh nng hiu ht c ni dung d
liu. Lc ny cc m hnh CSDL truyn thng v ngn ng SQL cho thy khng
c kh nng thc hin cng vic ny. ly c nhng thng tin c tnh tri
thc trong khi d liu khng l ny, ngi ta i tm nhng k thut c kh
nng hp nht cc d liu t cc h thng giao dch khc nhau, chuyn i thnh
mt tp hp cc c s d liu n nh, c cht lng, ch c s dng ring cho
mt vi mc ch no . Cc k thut c gi chung l k thut to kho d
liu (data warehousing) v mi trng cc d liu c c gi l cc kho d liu
(data warehouse).
Kho d liu l mt mi trng c cu trc cc h thng thng tin, cung cp
cho ngi dng cc thng tin kh c th truy nhp hoc biu din trong cc CSDL
tc nghip truyn thng, nhm mc ch h tr vic ra quyt nh mang tnh lch s
hoc hin ti. Theo W.H.Inmon, c th nh ngha kho d liu nh sau: Mt kho
d liu l mt tp hp d liu tch hp hng ch c tnh n nh, thay i theo
thi gian nhm h tr cho vic ra quyt nh. Ni cch khc, mt kho d liu bao
gm:
- Mt hoc nhiu cng c chit xut d liu t bt k dng cu trc
d liu no.
- C s d liu tch hp hng ch n nh c tng hp t cc d
liu bng cch lp bng d liu ca d liu.
Mt kho d liu c th c coi l mt h thng thng tin vi nhng thuc
tnh sau:
- L mt c s d liu c thit k c nhim v phn tch, s dng cc
d liu t cc ng dng khc nhau.
- H tr cho mt s ngi dng c lin quan vi cc thng tin lin quan.
- L d liu ch c.
- Ni dung ca n c cp nht thng xuyn theo cch thm thng
tin.
Nhm SV

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
- Cha cc d liu lch s v hin ti cung cp cc xu hng thng
tin.
- Cha cc bng d liu c kch thc ln.
Cu trc kho d liu c xy dng da trn h qun tr CSDL quan h, c
chc nng ging nh mt kho lu tr thng tin trung tm. Trong , d liu tc
nghip v phn x l c tch ring khi qu trnh x l kho d liu. Kho lu tr
trung tm c bao quanh bi cc thnh phn c thit k lm cho kho d liu
c th hot ng, qun l v truy nhp c t ngi dng u cui cng nh t
cc ngun d liu.
Nhng ch c kho d liu thi th cha c cc tri thc. Nh cp
trn, cc kho d liu c s dng theo ba cch chnh:
- Theo cch khai thc truyn thng, kho d liu c s dng
khai thc cc thng tin bng cc cng c vn p v bo co.
Tuy nhin, nh c vic chit xut, tng hp v chuyn i t
cc d liu th sang dng cc d liu cht lng cao v c tnh
n nh, kho d liu gip cho vic nng cao cc k thut biu
din thng tin truyn thng (hi p v bo co).
- Th hai l cc kho d liu c s dng h tr cho phn tch trc
tuyn (OLAP). Trong khi ngn ng vn p chun SQL v cc cng c lm
bo co truyn thng ch c th m t nhng g c trong CSDL th phn tch
trc tuyn c kh nng phn tch d liu, xc nh xem gi thuyt ng hay
sai. Tuy nhin, phn tch trc tuyn li khng c kh nng a ra c cc
gi thuyt.
Hn na, do kch thc qu ln v tnh cht phc tp ca kho d liu lm
cho n rt kh c th c s dng cho nhng mc ch nh a ra cc gi thuyt
t cc thng tin m chng trnh ng dng cung cp (v d nh kh c th a ra
c gi thuyt gii thch c hnh vi ca mt nhm khch hng).
Trc y, k thut hc my thng c s dng tm ra nhng gi
thuyt t cc thng tin d liu thu thp c. Tuy nhin, thc nghim cho thy
chng th hin kh nng rt km khi p dng vi cc tp d liu ln trong kho d
liu ny. Phng php thng k tuy ra i lu nhng khng c g ci tin ph
hp vi s pht trin ca d liu. y chnh l l do ti sao mt khi lng ln d
Nhm SV

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
liu vn cha c khai thc v thm ch c lu tr ch yu trong cc kho d
liu khng trc tuyn (off-line). iu ny to nn mt l hng ln trong vic h tr
phn tch v tm hiu d liu, to ra khong cch gia vic to ra d liu v vic
khai thc cc d liu .Trong khi , cng ngy ngi ta cng nhn thy rng, nu
c phn tch thng minh th d liu s l mt ngun ti nguyn qu gi trong
cnh tranh trn thng trng.
Gii tin hc p ng li nhng thch thc trong thc tin cng nh trong
nghin cu khoa hc bng cch a ra mt phng php mi p ng c nhu
cu trong khoa hc cng nh trong hot ng thc tin. chnh l cng ngh
Khai ph d liu (data mining). y chnh l ng dng th ba ca kho d liu.
1.3. Qu trnh pht hin tri thc v khai ph d liu
Yu t thnh cng trong mi hot ng kinh doanh ngy nay l vic bit s
dng thng tin mt cch c hiu qu. iu c ngha l t cc d liu sn c,
phi tm ra nhng thng tin tim n c gi tr m trc cha c pht hin, tm
ra nhng xu hng pht trin v nhng yu t tc ng ln chng. Thc hin cng
vic chnh l thc hin qu trnh pht hin tri thc trong c s d liu
(Knowledge Discovery in Database KDD) m trong k thut cho php ta ly
c cc tri thc chnh l k thut khai ph d liu (data mining).
Nh John Naisbett ni Chng ta ang chm ngp trong d liu m vn
i tri thc. D liu thng c cho bi cc gi tr m t cc s kin, hin tng
c th. Cn tri thc (knowledge) l g? C th c nhng nh ngha r rng phn
bit cc khi nim d liu, thng tin v tri thc hay khng? Kh m nh ngha
chnh xc nhng phn bit chng trong nhng ng cnh nht nh l rt cn thit v
c th lm c. Thng tin l mt khi nim rt rng, kh c th a ra mt nh
ngha chnh xc cho khi nim ny. Cng khng th nh ngha cho khi nim tri
thc cho d ch hn ch trong phm vi nhng tri thc c chit xut t cc CSDL.
Tuy nhin, ta c th hiu tri thc l mt biu thc trong mt ngn ng no din
t mt (hoc nhiu) mi quan h gia cc thuc tnh trong cc d liu . Cc ngn
ng thng c dng biu din tri thc (trong vic pht hin tri thc t cc
CSDL) l cc khung (frames), cc cy v th, cc lut (rules), cc cng thc
trong ngn ng logic mnh hoc tn t cp mt, cc h thng phng trnh,

Nhm SV

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
v.v, v d nh ta c cc lut m t cc thuc tnh ca d liu, cc mu thng
xuyn xy ra, cc nhm i tng trong c s d liu, v.v
Tm li: Pht hin tri thc trong c s d liu (Knowledge Discovery in
Database KDD) l mt quy trnh nhn bit cc mu hoc cc m hnh trong d
liu vi cc tnh nng nh hp thc mi, kh ch v c th hiu c.
1.3.1. Qu trnh khm ph tri thc c tin hnh qua 5 bc sau :

Hnh 1.1. Qu trnh khm ph tri thc


Mc d c 5 giai on nh trn, xong qu trnh pht hin tri thc c s d
liu l mt qu trnh tng tc v lp i lp li theo kiu hnh xon chn c, trong
ln lp sau hon chnh hn ln lp trc. Ngoi ra giai on sau li da trn kt
quthu c ca giai on trc theo kiu thc nc. y l mt qu trnh
bin trngmang tnh cht hc ca qu trnh pht hin tr thc v l
phng php lun trongvin pht hin tri thc. Cc giai on s c trnh
by c th nh sau:
G1: Hnh thnh v nh ngha bi ton
y l bc tm hiu lnh vc ng dng v hnh thnh bi ton, bc ny
s quyt nh cho vic rt ra nhng tri thc hu ch, ng thi la chn cc
phng php khai ph d liu thch hp vi mc ch ca ng dng v bn
cht ca d liu.

Nhm SV

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
G2: Thu thp v tin x l d liu
Trong bc ny d liu c thu thp dng th (ngun d liu thu thp
c th l t cc kho d liu hay ngun thng tin internet). Trong giai on ny
d liu cng c tin x l bin i v ci thin cht lng d liu cho
ph hp vi phng php khai ph d liu c chn la trong bc trn.
Bc ny thng chim nhiu thi gian nht trong qu trnh khm ph tri
thc.
Cc gii thut tin x l d liu bao gm :
1. X l d liu b mt/ thiu: Cc dng d liu b thiu s c thay th
bi cc gi tr thch hp
2. Kh s trng lp: cc i tng d liu trng lp s b loi b i. K
thut ny khng c s dng cho cc tc v c quan tm n phn b
d liu.
3. Gim nhiu: nhiu v cc i tng tch ri khi phn b chung s b
loi i khi d liu.
4. Chun ho: min gi tr ca d liu s c chun ho.
5. Ri rc ho: cc dng d liu s s c bin i ra cc gi tr ri rc.
6. Rt trch v xy dng c trng mi t cc thuc tnh c.
7. Gim chiu: cc thuc tnh cha t thng tin s c loi b bt.
G3: Khai ph d liu v rt ra cc tri thc
y l bc quan trng nht trong tin trnh khm ph tri thc. Kt qu ca
bc ny l trch ra c cc mu v/hoc cc m hnh n di cc d liu. Mt
m hnh c th l mt biu din cu trc tng th mt thnh phn ca h thng hay
c h thng trong c s d liu, hay miu t cch d liu c ny sinh. Cn mt
mu l mt cu trc cc b c lin quan n vi bin v vi trng hp trong c s
d liu.
G4: Phn tch v kim nh kt qu
Bc th t l hiu cc tri thc tm c, c bit l lm sng t cc m
t v d on. Trong bc ny, kt qu tm c s c bin i sang dng ph
hp vi lnh vc ng dng v d hiu hn cho ngi dng.

Nhm SV

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
G5: S dng cc tri thc pht hin c
Trong bc ny, cc tri thc khm ph c s c cng c, kt hp li
thnh mt h thng, ng thi gii quyt cc xung t tim nng trong cc tri thc
. Cc m hnh rt ra c a vo nhng h thng thng tin thc t di dng
cc mdun h tr vic a ra quyt nh.
Cc giai on ca qu trnh khm ph tri thc c mi quan h cht ch vi
nhau trong bi cnh chung ca h thng. Cc k thut c s dng trong giai on
trc c th nh hng n hiu qu ca cc gii thut c s dng trong cc giai
on tip theo. Cc bc ca qu trnh khm ph tri thc c th c lp i lp li
mt s ln, kt qu thu c c th c ly trung bnh trn tt c cc ln thc
hin.

Nhm SV

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

CHNG 2: KHAI PH D LIU


2.1. Khai ph d liu l g?
Khai ph d liu c dng m t qu trnh pht hin ra tri thc trong
CSDL. Qu trnh ny kt xut ra cc tri thc tim n t d liu gip cho vic d
bo trong kinh doanh, cc hot ng sn xut, ... Khai ph d liu lm gim chi
ph v thi gian so vi phng php truyn thng trc kia (v d nh phng
php thng k).
Khai thc d liu (data mining) l mt ng tng i mi, n ra i vo
khong nhng nm cui ca ca thp k 1980. C rt nhiu nh ngha khc nhau
v khai ph d liu. Gio s Tom Mitchell a ra nh ngha ca khai ph d
liu nh sau: Khai ph d liu l vic s dng d liu lch s khm ph nhng
qui tc v ci thin nhng quyt nh trong tng lai.. Vi mt cch tip cn ng
dng hn, tin s Fayyad pht biu: Khai ph d liu thng c xem l vic
khm ph tri thc trong cc c s d liu, l mt qu trnh trch xut nhng thng
tin n, trc y cha bit v c kh nng hu ch, di dng cc quy lut, rng
buc, qui tc trong c s d liu.. Cn cc nh thng k th xem " khai ph d liu
nh l mt qu trnh phn tch c thit k thm d mt lng cc ln cc d liu
nhm pht hin ra cc mu thch hp v/ hoc cc mi quan h mang tnh h thng
gia cc bin v sau s hp thc ho cc kt qu tm c bng cch p dng
cc mu pht hin c cho tp con mi ca d liu".
Tm li: Khai ph d liu l mt bc trong quy trnh pht hin tri thc gm
c cc thut ton khai ph d liu chuyn dng di mt s quy nh v hiu qu
tnh ton chp nhn c tm ra cc mu hoc cc m hnh trong d liu.

Nhm SV

10

Lp: L10CQCN7-B

Bo co thc tp
2.2. Qu trnh khai ph d liu

GVHD: Nguyn Qunh Chi

Hnh 2.1. Kin trc h thng khai ph d liu


Khai ph d liu l hot ng trng tm ca qu trnh khm ph tri thc .
Thut ng khai ph d liu cn c mt s nh khoa hc gi l pht hin tri thc
trong c s d liu ( knowledge discovery in database _KDD) ( theo Fayyad
Smyth and Piatestky-Shapiro 1989). Qu trnh ny gm c 6 bc:

Nhm SV

11

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

Hnh 2.2. Qu trnh khai ph d liu


Qu trnh khai ph d liu bt u vi kho d liu th v kt thc vi tri thc
c chit xut ra. Ni dung ca qu trnh nh sau:
2.2.1.Gom d liu (gatherin)
Tp hp d liu l bc u tin trong khai ph d liu. Bc ny ly d liu
t trong mt c s d liu, mt kho d liu, thm ch d liu t nhng ngun cung
ng web.
2.2.2. Trch lc d liu (selection)
giai on ny d liu c la chn v phn chia theo mt s tiu chun
no . y l giai on chn lc, trch rt cc d liu cn thit t c s d liu tc
nghip vo mt c s d liu ring. Chng ta chn ra nhng d liu cn thit cho
cc giai on sau. Tuy nhin cng vic thu gom d liu vo mt c s d liu
thng rt kh khn v d liu nm ri rc khp ni trong c quan, t chc cng
mt loi thng tin nhng c to lp theo cc dng hnh thc khc nhau. V d ni
ny dng kiu chui, ni kia li dng kiu s khai bo mt thuc tnh no ca
khch hng. ng thi cht lng d liu ca cc ni cng khng ging nhau. V
vy. chng ta cn chn lc d liu tht tt chuyn sang giai on tip theo.
2.2.3.Lm sch v tin x l d liu (cleansing preprocessing).
Giai on th ba ny l giai on hay b sao lng, nhng thc t n l mt
bc rt quan trng trong qu trnh khai ph d liu. Mt s li thng mc phi
Nhm SV

12

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
trong khi gom d liu l d liu khng y hoc khng thng nht, thiu cht
ch. V vy d liu thng cha cc gi tr v ngha v khng c kh nng kt ni
d liu. V d Sinh vin c tui=200. Giai on th ba ny nhm x l cc d liu
nh trn(d liu v ngha, d liu khng c kh nng kt ni). Nhng d liu dng
ny thng c xem l thng tin d tha, khng c gi tr. Bi vy y l mt qu
trnh rt quan trng. Nu d liu khng c lm sch- tin x l - chun b trc
th s gy nn nhng kt qu sai lch nghim trng v sau.
Giai on ny c mt s chc nng sau:
+ iu ha d liu: gim bt tnh khng nht qun d liu ly t nhiu ngun
khc nhau. Phng php thng thng l kh cc trng hp trng lp d
liu v thng nht cc k hiu.
+ X l cc gi tr khuyt: Tnh khng y ca d liu c th gy ra hin
tng d liu cha cc gi tr khuyt. y l hin tng kh ph bin. Ngi
ta s dng nhiu phng php khc nhau x l cc gi tr khuyt.
+ X l nhiu v ngoi l: Thng thng nhiu d liu c th l nhiu ngu
nhin hoc cc gi tr bt bnh thng. lm sch nhiu, ngi ta c th s
dng phng php lm trn nhiu hoc dng cc gii thut pht hin ra cc
ngoi l x l.
2.2.4. Chuyn i d liu (transformation)
Trong giai on ny, d liu c th c t chc v s dng li. Mc ch
ca vic chuyn i d liu l lm cho d liu ph hp hn vi mc ch khai ph
d liu.
2.2.5. Pht hin v trch mu d liu ( pattern extraction and discovery)
y l bc t duy trong khai ph d liu. trong giai on ny nhiu thut
ton khc nhau c s dng trch ra cc mu t d liu. Thut ton thng
dng trch mu d liu l thut ton phn loi d liu, kt hp d liu, thut ton
m hnh ho d liu tun t.
2.2.6. nh gi kt qu mu (evaluation of result )
y l giai on cui cng trong qu trnh khai ph d liu, giai on ny
cc mu d liu c chit xut ra bi phn mm khai ph d liu. Khng phi
mu d liu no cng hu ch, i khi n cn b sai lch. V vy cn phi a ra

Nhm SV

13

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
nhng tiu chun nh gi u tin cho cc mu d liu rt ra c nhng tri
thc cn thit.
2.3. Chc nng ca khai ph d liu
Khai ph d liu c hai chc nng c bn l: chc nng d on v chc
nng m t. D bo l dng mt s bin hoc trng trong CSDL d on ra cc
gi tr cha bit hoc s c ca cc bin quan trng khc. Vic m t tp trung vo
tm kim cc mu m con ngi c th hiu c m t d liu. Trong lnh vc
KDD, m t c quan tm nhiu hn d bo, n ngc vi cc ng dng hc my
v nhn dng mu m trong vic d bo thng l mc tiu chnh.
Khai ph d liu mang li nhng li ch nh :
+ Cung cp tri thc h tr ra quyt nh
+ D bo
+ Khi qut d liu
2.4. Cc k thut khai ph d liu
Trong thc t c nhiu k thut khai ph d liu khc nhau nhm thc hin
hai chc nng m t v d on.
- K thut khai ph d liu m t: c nhim v m t cc tnh cht hoc cc
c tnh chung ca d liu trong CSDL hin c. Mt s k thut khai ph
trong nhm ny l: phn cm d liu (Clustering), tng hp
(Summarisation), trc quan ho (Visualization), phn tch s pht trin v
lch (Evolution and deviation analyst),.
- K thut khai ph d liu d on: c nhim v a ra cc d on da
vo cc suy din trn c s d liu hin thi. Mt s k thut khai ph
trong nhm ny l: phn lp (Classification), hi quy (Regression), cy
quyt nh (Decision tree), thng k (statictics), mng nron (neural
network), lut kt hp,.
Mt s k thut ph bin thng c s dng khai ph d liu hin nay
l :
2.4.1. Phn lp d liu
Mc tiu ca phn lp d liu l d on nhn lp cho cc mu d liu.
Qu trnh gm hai bc: xy dng m hnh, s dng m hnh phn lp d

Nhm SV

14

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
liu( mi mu 1 lp). M hnh c s dng d on nhn lp khi m chnh
xc ca m hnh chp nhn c.
2.4.2. Phn cm d liu
Mc tiu ca phn cm d liu l nhm cc i tng tng t nhau trong
tp d liu vo cc cum, sao cho cc i tng thuc cng mt lp l tng ng.
2.4.3. Khai ph lut kt hp
Mc tiu ca phng php ny l pht hin v a ra cc mi lin h gia
cc gi tr d liu trong c s d liu. u ra ca gii thut lut kt hp l tp lut
kt hp tm c. Phng php khai ph lut kt hp gm c hai bc:
- Bc 1: Tm ra tt c cc tp mc ph bin. Mt tp mc ph bin c
xc nh thng qua tnh h tr v tho mn h tr cc tiu.
- Bc 2: Sinh ra cc lut kt hp mnh t tp mc ph bin, cc lut phi
tho mn h tr v tin cy cc tiu.
2.4.4. Hi quy
Phng php hi quy tng t nh l phn lp d liu. Nhng khc ch n
dng d on cc gi tr lin tc cn phn lp d liu dng d on cc gi
tr ri rc.
2.4.5. Gii thut di truyn
L qu trnh m phng theo tin ho ca t nhin. tng chnh ca gii
thut l da vo quy lut di truyn trong bin i, chn lc t nhin v tin ho
trong sinh hc.
2.4.6. Mng nron
y l mt trong nhng k thut khai ph d liu c ng dng ph bin
hin nay. K thut ny pht trin da trn mt nn tng ton hc vng vng, kh
nng hun luyn trong k thut ny da trn m hnh thn kinh trung ng ca con
ngi.
Kt qu m mng nron hc c c kh nng to ra cc m hnh d bo, d
on vi chnh xc v tin cy cao. N c kh nng pht hin ra c cc xu
hng phc tp m k thut thng thng khc kh c th pht hin ra c. Tuy
nhin phng php mng n ron rt phc tp v qu trnh tin hnh n gp rt
nhiu kh khn: i hi mt nhiu thi gian, nhiu d liu, nhiu ln kim tra th
nghim.
Nhm SV

15

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
2.4.7. Cy quyt nh.
K thut cy quyt nh l mt cng c mnh v hiu qu trong vic phn
lp v d bo. Cc i tng d liu c phn thnh cc lp. Cc gi tr ca i
tng d liu cha bit s c d on, d bo. Tri thc c rt ra trong k
thut ny thng c m t di dng tng minh, n gin, trc quan, d hiu
i vi ngi s dng.
2.5. Cc dng d liu c th khai ph c
- CSDL quan h
- CSDL a chiu
- CSDL giao dch
- CSDL quan h - i tng
- CSDL khng gian v thi gian
- CSDL a phng tin.
2.6. Cc lnh vc lin quan n khai ph d liu v ng dng ca khai ph d
liu
2.6.1. Cc lnh vc lin quan n pht hin tri thc v khai ph d liu
Pht hin tri thc v khai ph d liu c ng dng trong nhiu ngnh v
lnh vc khc nhau nh: ti chnh ngn hng, thng mi, y t, gio dc, thng k,
my hc, tr tu nhn to, csdl, thut ton ton hc, tnh ton song song vi tc
cao, thu thp c s tri thc cho h chuyn gia,
2.6.2. ng dng ca khai ph d liu
Khai ph d liu c vn dng gii quyt cc vn thuc nhiu lnh
vc khc nhau. Chng hn nh gii quyt cc bi ton phc tp trong cc ngnh i
hi k thut cao, nh tm kim m du, t nh vin thm, cnh bo hng hc trong
cc h thng sn xut; c ng dng cho vic quy hoch v pht trin cc h
thng qun l v sn xut trong thc t nh d on ti s dng in, mc tiu
th sn phm, phn nhm khch hng; p dng cho cc vn x hi nh pht
hin ti phm, tng cng an ninh
Mt s ng dng c th nh sau :
- Khai ph d liu c s dng phn tch d liu, h tr ra quyt nh.

Nhm SV

16

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
- Trong sinh hc: n dng tm kim , so snh cc h gen v thng tin di
chuyn, tm mi lin h gia cc h gen v chun on mt s bnh di
chuyn
- Trong y hc: khai ph d liu gip tm ra mi lin h gia cc triu
chng, chun on bnh.
- Ti chnh v th trng chng khon: Khai ph d liu phn tch tnh
hnh ti chnh, phn tch u t, phn tch c phiu
- Khai thc d liu web.
- Trong thng tin k thut: khai ph d liu dng phn tch cc sai hng,
iu khin v lp lch trnh
Trong thng tin thng mi: dng phn tch d liu ngi dng, phn
tch d liu marketing, phn tch u t, pht hin cc gian ln.
2.7. Cc thch thc v hng pht trin ca pht hin tri thc v khai ph d
liu.
S pht trin ca pht hin tri thc v khai ph d liu gp phi mt s thch
thc sau:
- CSDL ln (s lng bn ghi, s bng)
- S chiu ln
- Thay i d liu v tri thc c th lm cho cc mu pht hin khng
cn ph hp na.
- D liu b thiu hoc b nhiu.
- Quan h gia cc trng phc tp
- Vn giao tip vi ngi s dng v kt hp vi cc tri thc c.
- Tch hp vi cc h thng khc.
-
Hng pht trin ca khm ph tri thc v khai ph d liu l vt qua c
tt c nhng thch thc trn. Ch trng vo vic m rng ng dng p ng cho
mi lnh vc trong i sng x hi, v tng tnh hu ch ca vic khai ph d liu
trong nhng lnh vc c khai ph d liu. To ra cc phng php khai ph d
liu linh ng, uyn chuyn x l s lng d liu ln mt cch hiu qu. To
ra tng tc ngi s dng tt, gip ngi s dng tham gia iu khin qu trnh
khai ph d liu, nh hng h thng khai ph d liu trong vic pht hin cc
Nhm SV

17

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
mu ng quan tm. Tch hp khai ph d liu vo trong cc h c s d liu. ng
dng khai ph d liu khai ph d liu web trc tuyn. Mt vn quan trng
trong vic pht trin khm ph tri thc v khai ph d liu l vn an ton v
bo mt thng tin trong khai ph d liu.

Nhm SV

18

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

CHNG 3: KHAI PH D LIU BNG CY QUYT NH


3.1. Cy quyt nh
3.1.1. nh ngha
Cy quyt nh (decision tree) l mt phng php rt mnh v ph bin cho
c hai nhim v ca khai ph d liu l phn loi v d bo. Mt khc, cy quyt
nh cn c th chuyn sang dng biu din tng ng di dng tri thc l cc
lut If-Then.
Cy quyt nh l cu trc biu din di dng cy. Trong , mi nt trong
(internal node) biu din mt thuc tnh, nhnh (branch) biu din gi tr c th c
ca thuc tnh, mi l (leaf node) biu din cc lp quyt nh v nh trn cng
ca cy gi l gc (root). Cy quyt nh c th c dng phn lp bng cch
xut pht t gc ca cy v di chuyn theo cc nhnh cho n khi gp nt l. Trn
c s phn lp ny chng ta c th chuyn i v cc lut quyt nh.
Cy quyt nh c s dng xy dng mt k hoch nhm t c mc
ch mong mun. Cc cy quyt nh c dng h tr qu trnh ra quyt nh.
Cy quyt nh l mt dng c bit ca cu trc cy.
To cy quyt nh chnh l qu trnh phn tch c s d liu, phn lp v
a ra d on. Cy quyt nh c to thnh bng cch ln lt chia ( quy)
mt tp d liu thnh cc tp d liu con, mi tp con c to thnh ch yu t
cc phn t ca cng mt lp. La chn thuc tnh to nhnh thng qua Entropy
v Gain.
V d: Cy quyt nh phn lp mc lng

Nhm SV

19

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi


Age?
35

> 35

salary
40
bad

salary
>40

50
good

bad

>50
good

Hnh 3.1 Cy quyt nh phn lp mc lng


3.1.2.Hc cy quyt nh (Dicision tree learning):
L phng php xp x gi tr ri rc bi nhng hm mc tiu ( target
function), trong hm c biu din bi mt cy quyt nh.Nhng cy hc
( learned trees) c th cng c biu din nh l tp hp ca nhng lut if then
tng tnh d c cho con ngi . Nhng phng php hc ny th hin trong nhng
gii thut suy din quy np thng dng nht v c ng dng thnh cng trong
nhng nhim v t vic hc chun on bnh trong y hc n vic nh gi ri ro
trong ti chnh v kinh t.
3.1.3.Ti sao hc cy quyt nh l mt phng php hc qui np hp dn
Nhng phng php hc qui np to thnh nhng cng thc cho khng gian
gi thuyt tng qut bng vic tm ra tnh qui tc bng kinh nghim da trn nhng
d liu v d.
Vi hc qui np , hc cy quyt nh hp dn v 3 nguyn nhn:
1. Cy quyt nh l mt s tng qut tt cho nhng trng hp ta khng
n , ch nhng trng hp c m t trong nhng gii hn ca nhng
c tnh m lin quan n nhng khi nim mc tiu.
2. Nhng phng php hiu qu trong tnh ton l s hng ca t l thc n
s ca nhng trng hp ca d liu a vo tnh ton.
3. Kt qu ca cy quyt nh a ra mt s biu din ca nim m d dng
cho con ngi bi v n a ra qu trnh phn loi hin nhin

Nhm SV

20

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
3.1.4.u im ca cy quyt nh
So vi cc phng php khai ph d liu khc, cy quyt nh c mt s u
im sau
- Cy quyt nh tng i d hiu.
- i hi mc tin x l d liu n gin.
- C th x l vi c cc d liu ri rc v lin tc.
- Cy quyt nh l mt m hnh hp trng.
- Kt qu d on bng cy quyt nh c th thm nh li bng cch kim
tra thng k.
3.1.5.Vn xy dng cy quyt nh
C nhiu thut ton khc nhau xy dng cy quyt nh nh: CLS, ID3,
C4.5, SLIQ, SPRINT, EC4.5, C5.0Nhng ni chung qu trnh xy dng cy
quyt nh u c chia ra lm 3 giai on c bn:
- Xy dng cy: Thc hin chia mt cch quy tp mu d liu hun
luyn cho n khi cc mu mi nt l thuc cng mt lp
- Ct ta cy: L vic lm dng ti u ho cy. Ct ta cy chnh l vic
trn mt cy con vo trong mt nt l.
- nh gi cy: Dng nh gi chnh xc ca cy kt qu. Tiu ch
nh gi l tng s mu c phn lp chnh xc trn tng s mu a
vo.
3.1.6.Rt ra cc lut t cy quyt nh
C th chuyn i qua li gia m hnh cy quyt nh v m hnh dng lut
(IF THEN). Hai m hnh ny l tng ng nhau.
V d t cy 2.1 ta c th rt ra c cc lut sau.
IF (Age <= 35) AND (salary<=40) THEN class = bad
IF (Age<=35) AND (salary>40) THEN class = good
IF (Age>35) AND (salary <=50 ) THEN class = bad
IF (Age > 35) AND(salary>50)
THEN class = good
3.2. Cc thut ton khai ph d liu bng cy quyt nh
3.2.1. Thut ton CLS
Thut ton ny c Hovland v Hint gii thiu trong Concept learning
System (CLS) vo nhng nm 50 ca th k 20. Sau gi tt l thut ton CLS.
Nhm SV

21

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
Thut ton CLS c thit k theo chin lc chia tr t trn xung. N gm cc
bc sau:
- To mt nt T, nt ny gm tt c cc mu ca tp hun luyn.
- Nu tt c cc mu trong T c thuc tnh quyt nh mang gi tr
"yes" (hay thuc cng mt lp), th gn nhn cho nt T l "yes" v
dng li. T lc ny l nt l.
- Nu tt c cc mu trong T c thuc tnh quyt nh mang gi tr
"no" (hay thuc cng mt lp), th gn nhn cho nt T l "no" v
dng li. T lc ny l nt l.
Trng hp ngc li cc mu ca tp hun luyn thuc c hai lp
v "no" th:
+ Chn mt thuc tnh X trong tp thuc tnh ca tp mu d liu
, X c cc gi tr v1,v2, vn.
+ Chia tp mu trong T thnh cc tp con T1, T2,.,Tn. chia theo
gi tr ca X.
+ To n nt con Ti (i=1,2n) vi nt cha l nt T.
+ To cc nhnh ni t nt T n cc nt Ti (i=1,2n) l cc
thuc tnh ca X.

"yes"

Thc hin lp cho cc nt con Ti(i =1,2..n) v quay li bc 2.

Ta nhn thy trong bc 4 ca thut ton, thuc tnh c chn trin khai
cy l tu . Do vy cng vi mt tp mu d liu hun luyn nu p dng thut
ton CLS vi th t chn thuc tnh trin khai cy khc nhau, s cho ra cc cy c
hnh dng khc nhau. Vic la chn thuc tnh s nh hng ti rng, su,
phc tp ca cy. V vy mt cu hi t ra l th t thuc tnh no c chn
trin khai cy s l tt nht. Vn ny s c gii quyt trong thut ton ID3
di y.
3.2.2. Thut ton ID3
Thut ton ID3 c pht biu bi Quinlan (trng i hc Syney,
Australia) v c cng b vo cui thp nin 70 ca th k 20. Sau , thut ton
ID3 c gii thiu v trnh by trong mc Induction on decision trees, machine
learning nm 1986. ID3 c xem nh l mt ci tin ca CLS vi kh nng la

Nhm SV

22

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
chn thuc tnh tt nht tip tc trin khai cy ti mi bc. ID3 xy dng cy
quyt nh t trn- xung (top -down).
ID3 biu din cc khi nim (concept) dng cc cy quyt nh (decision
tree). Biu din ny cho php chng ta xc nh phn loi ca mt i tng bng
cch kim tra cc gi tr ca n trn mt s thuc tnh no .
Nh vy, nhim v ca gii thut ID3 l hc cy quyt nh t mt tp cc v
d rn luyn (training example) hay cn gi l d liu rn luyn (training data).
Hay ni khc hn, gii thut c:
u vo: Mt tp hp cc v d. Mi v d bao gm cc thuc tnh m t

mt tnh hung, hay mt i tng no , v mt gi tr phn loi ca


n.
u ra: Cy quyt nh c kh nng phn loi ng n cc v d trong
tp d liu rn luyn, v hy vng l phn loi ng cho c cc v d
cha gp trong tng lai.
3.2.2.1. Entropy o tnh thun nht ca tp d liu
Dng o tnh thun nht ca mt tp d liu. Entropy ca mt tp S c
tnh theo cng thc sau:

Entropy(S)= - P + log 2( P +) P - log 2( P )


Trong trng hp cc mu d liu c hai thuc tnh phn lp "yes" (+), "no"
(-). K hiu p+ l ch t l cc mu c gi tr ca thuc tnh quyt nh l "yes",
v p- l t l cc mu c gi tr ca thuc tnh quyt nh l "no" trong tp S.
Trng hp tng qut, i vi tp con S c n phn lp th ta c cng thc
sau:
n

Entropy(S)=

(- P log ( P))
i

i=1

Trong Pi l t l cc mu thuc lp i trn tp hp S cc mu kim tra.


Cc trng hp c bit
- Nu tt c cc mu thnh vin trong tp S u thuc cng mt lp th
Entropy(S) =0
- Nu trong tp S c s mu phn b u nhau vo cc lp th Entropy(S)
=1
- Cc trng hp cn li 0< Entropy(S)<1
Nhm SV

23

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
3.2.2.2. Information Gain
Gain l i lng dng o tnh hiu qu ca mt thuc tnh c la chn
cho vic phn lp. i lng ny c tnh thng qua hai gi tr Information v
Entropy.
- Cho tp d liu S gm c n thuc tnh Ai(i=1,2n) gi tr Information
ca thuc tnh Ai k hiu l Information(Ai) c xc nh bi cng thc
n

Information(A i ) = - log 2 ( pi ) = Entropy(S)


i=1

- Gi tr Gain ca thuc tnh A trong tp S k hiu l Gain(S,A) v c


tnh theo cng thc sau:

Sv
Entropy(S
)
v
S
v value(A)

Gain( S, A)= Information(A) - Entropy(A)= Entropy(


S)-
Trong :

S l tp hp ban u vi thuc tnh A. Cc gi tr ca v tng ng l


cc gi tr ca thuc tnh A.
Sv bng tp hp con ca tp S m c thuc tnh A mang gi tr v.
|Sv| l s phn t ca tp Sv.

|S| l s phn t ca tp S.
Trong qu trnh xy dng cy quyt nh theo thut ton ID3 ti mi
bc trin khai cy, thuc tnh c chn trin khai l thuc tnh c gi tr
Gain ln nht.
3.2.2.3.Hm xy dng cy quyt nh trong thut ton ID3
Function induce_tree(tp_v_d, tp_thuc_tnh)
begin
if mi v d trong tp_v_d u nm trong cng mt lp then
return mt nt l c gn nhn bi lp
else if tp_thuc_tnh l rng then
return nt l c gn nhn bi tuyn ca tt c cc lp trong
tp_v_d
else begin
chn mt thuc tnh P, ly n lm gc cho cy hin ti;
Nhm SV

24

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

xa P ra khi tp_thuc_tnh;
vi mi gi tr V ca P
begin
to mt nhnh ca cy gn nhn V;
t vo phn_vngV cc v d trong tp_v_d c gi tr V
ti thuc tnh P;
Gi induce_tree(phn_vngV, tp_thuc_tnh), gn kt
qu
vo nhnh V
end
end
end
2.2.2.4.V d minh ha
Chng ta hy xt bi ton phn loi xem ta c i chi tennis ng vi thi tit
no khng. Gii thut ID3 s hc cy quyt nh t tp hp cc v d sau:
Quang
Chi
Nhit
m
Gi
cnh
Tennis
Dl
Nng
Nng
Cao
Nh
Khng
D2
Nng
Nng
Cao
Mnh
Khng
D3
m u
Nng
Cao
Nh
C
D4
Ma
m p
Cao
Nh
C
D5
Ma
Mt
Trung bnh
Nh
C
D6
Ma
Mt
Trung bnh
Mnh
Khng
D7
m u
Mt
Trung bnh
Mnh
C
D8
Nng
m p
Cao
Nh
Khng
D9
Nng
Mt
Trung bnh
Nh
C
Dl0
Ma
m p
Trung bnh
Nh
C
Dl1
Nng
m p
Trung bnh
Mnh
C
Dl2
m u
m p
Cao
Mnh
C
Dl3
m u
Nng
Trung bnh
Nh
C
Dl4
Ma
m p
Cao
Mnh
Khng
Bng 2.1. Tp d liu v d cho chi Tennis
Tp d liu ny bao gm 14 v d. Mi v d biu din cho tnh trng thi
tit gm cc thuc tnh quang cnh, nhit , m v gi; v u c mt thuc
tnh phn loi chi Tennis(c, khng). Khng ngha l khng i chi tennis ng
vi thi tit , C ngha l chi tennis ng vi thi tit . Gi tr phn loi
Ngy

Nhm SV

25

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
y ch c hai loi (c, khng), hay cn ta ni phn loi ca tp v d ca khi nim
ny thnh hai lp (classes). Thuc tnh Chi tennis cn c gi l thuc tnh
ch (target attribute).
Mi thuc tnh u c mt tp cc gi tr hu hn. Thuc tnh quang cnh c
ba gi tr: m u , ma , nng; nhit c ba gi tr: nng, mt, m p; m c hai
gi tr: cao, T v gi c hai gi tr: mnh, nh. Cc gi tr ny chnh l k hiu
(symbol) dng biu din bi ton.
T tp d liu rn luyn ny, gii thut ID3 s hc mt cy quyt nh c
kh nng phn loi ng n cc v d trong tp ny, ng thi hy vng trong
tng lai, n cng s phn loi ng cc v d khng nm trong tp ny. Mt cy
quyt nh v d m gii thut ID3 c th quy np c l:

Hnh 2.2. Cy quyt nh thut ton ID3


Cc nt trong cy quyt nh biu din cho mt s kim tra trn mt thuc
tnh no , mi gi tr c th c ca thuc tnh tng ng vi mt nhnh ca
cy. Cc nt l th hin s phn loi ca cc v d thuc nhnh , hay chnh l gi
tr ca thuc tnh phn loi.
Sau khi gii thut quy np c cy quyt nh, th cy ny s c s
dng phn loi tt c cc v d hay th hin (instance) trong tng lai. V cy
Nhm SV

26

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
quyt nh s khng thay i cho n khi ta cho thc hin li gii thut ID3 trn
mt tp d liu rn luyn khc.
ng vi mt tp d liu rn luyn s c nhiu cy quyt nh c th phn
loi ng tt c cc v d trong tp d liu rn luyn. Kch c ca cc cy quyt
nh khc nhau ty thuc vo th t ca cc kim tra trn thuc tnh.
Ta c: S = [9+, 5-]
Entropy(S) = entropy(9+,5-)
= p+log2p+ p-log2p- = (9/14)log2(9/14) (5/14)log2(5/14)
= 0.940
Values(Quang cnh) =Nng, m u, ma
Snng = [2+, 3-]
Sm u = [4+, 0-]
Sma = [3+, 2-]
Gain ( S , Quangcanh ) = Entropy ( S )

| Sv |
Entropy ( S v )
v{nang , mu , mua } | S |

= Entropy(S) (5/14)Entropy(Snng) (4/14)Entropy(Sm u)


(5/14)Entropy(Sma)
Trong :
Entropy(S) = 0.940
Entropy(Snng) = (2/5)log2(2/5) (3/5)log2(3/5)
= 0.5288 + 0.4422 = 0.971
Entropy(Sm u) = (4/4)log2(4/4) (0/4)log2(0/4) = 0 + 0 = 0
Entropy(SMa) = (3/5)log2(3/5) (2/5)log2(2/5)
= 0.4422 + 0.5288 = 0.971
Suy ra:
Gain(S, Quang cnh) = 0.940 (5/14)* 0.971 (4/14)* 0 (5/14)* 0.971 = 0.246
Values(Nhit ) =Nng, m p, mt
SNng = [2+, 2-]
Sm p = [4+, 2-]
SMt = [3+, 1-]
Gain ( S , Nhietdo) = Entropy( S )

Nhm SV

| Sv |
Entropy( S v )
v{ Nong , Amap, Mat} | S |

27

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
= Entropy(S) (4/14)Entropy(SNng) (6/14)Entropy(Sm p)
(4/14)Entropy(SMt)
Trong :
Entropy(S) = 0.940
Entropy(SNng)
= (2/4)log2(2/4) (2/4)log2(2/4)
= 0.5 + 0.5 = 1
Entropy(Sm p) = (4/6)log2(4/6) (2/6)log2(2/6)
= 0.3896 + 0.5282 = 0.9178
Entropy(SMt) = (3/4)log2(3/4) (1/4)log2(1/4)
= 0.3112781 + 0.5 = 0.81128
Suy ra:
Gain(S, Temperature) = 0.940 (4/14)*1 (6/14)*0.9178 (4/14)*0.81128 =
0.029
Values( m) = Cao, Trung bnh
SCao = [3+, 4-]
STrung bnh = [6+,1-]
Gain( S , doam) = Entropy( S )

| Sv |
Entropy ( S v )
v{Cao ,Trungbinh} | S |

= Entropy(S) (7/14)Entropy(SCao) (7/14)Entropy(STrung bnh)


Trong :
Entropy(S) = 0.940
Entropy(SCao) = (3/7)log2(3/7) (4/7)log2(4/7)
= 0.5238 + 0.4613 = 0.9851
Entropy(STrung bnh) = (6/7)log2(6/7) (1/7)log2(1/7)
= 0.1966 + 0.4010 = 0.5976
Suy ra:
Gain(S, m) = 0.940 (7/14)*0.9851 (7/14)*0.5976 = 0.151
Values(Gi) =Nh, Mnh
SNh = [6+, 2-]
SMnh = [3+, 3-]
Gain( S , Gi) = Entropy( S )

Nhm SV

| Sv |
Entropy ( S v )
v{ Nhe , Manh} | S |

28

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
= Entropy(S) (8/14)Entropy(SNh) (6/14)Entropy(SMnh)
Trong :
Entropy(S) = 0.940
Entropy(SNh) = (6/8)log2(6/8) (2/8)log2(2/8)
= 0.3112 + 0.5 = 0.8112
Entropy(SMnh) = (3/6)log2(3/6) (3/6)log2(3/6)
= 0.5 + 0.5 = 1
Suy ra:
Gain(S, Gi) = 0.940 (8/14)*0.811 (6/14)*1 = 0.048
Ta thu c kt qu:
Gain(S, Quang cnh) = 0.246
Gain(S, Nhit ) = 0.029
Gain(S, m) = 0.151
Gain(S, Gi) = 0.048
Ta thy gi tr Gain(S, Quang cnh) ln nht nn Quang cnh c chn lm
nt gc.
Quang cnh
Nng
{D1, D2, D8, D9,
D11}
2+, 3[2+,3-]
tt1 ?

m u

Ma

{D3, D7, D12, D13}


2+, 3-

[4+,0-]
Y
es

{D4, D5, D6, D10,


D14}
2+,
3[3+,2-]
tt2 ?

Sau khi lp c cp u tin ca cy quyt nh ta li xt nhnh Nng


Tip tc ly Entropy v Gain cho nhnh Nng ta c hiu sut nh sau:
Gain(SNng, m) = 0.970
Gain(SNng, Nhit ) = 0.570
Gain(SNng, Gi) = 0.019
Nh vy thuc tnh m c hiu sut phn loi cao nht trong nhnh Nng
ta chn thuc tnh m lm nt k tip .
Nhm SV

29

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
Tng t nh vy i vi nhnh cn li ca cy quyt nh ta c cy quyt nh
Quang cnh
hon chnh nh sau
Nng
{D1, D2, D8, D9,
D11}
2+, 3[2+,3-]

Ma

{D3, D7, D12, D13}

m
Cao

m u

2+, 3-

[4+,0-]
Y
es

{D4, D5, D6, D10,


D14}
2+,
3[3+,2-]
Gi
Nh

TB

{D1, D2, D8}

{D9, D11}

{D4, D5, D10}

2+, 3-

2+, 3-

2+, 3-

[0+,3-]
N
o

[3+,0-]
Y
es

[2+,0-]
Y
es

Mnh
{D6, D14}
2+, 3-

[0+,2-]
N
o

Vi vic tnh ton gi tr Gain la chn thuc tnh ti u cho vic trin
khai cy, thut ton ID3 c xem l mt ci tin ca thut ton CLS. Tuy nhin
thut ton ID3 khng c kh nng x l i vi nhng d liu c cha thuc tnh
s - thuc tnh lin tc (numeric attribute) v kh khn trong vic x l cc d liu
thiu (missing data)v d liu nhiu (noisy data). Vn ny s c gii quyt
trong thut ton C4.5 sau y.
3.2.3.Thut ton C4.5
Thut ton C4.5 c pht trin v cng b bi Quinlan vo nm 1996.
Thut ton C4.5 l mt thut ton c ci tin t thut ton ID3 vi vic cho php
x l trn tp d liu c cc thuc tnh s (numeric atributes) v v lm vic c
vi tp d liu b thiu v b nhiu. N thc hin phn lp tp mu d liu theo
chin lc u tin theo chiu su (Depth - First). Thut ton xt tt c cc php th
c th phn chia tp d liu cho v chn ra mt php th c gi tr GainRatio
tt nht. GainRatio l mt i lng nh gi hiu qu ca thuc tnh dng
thc hin php tch trong thut ton pht trin cy quyt nh.
3.2.3.1. o s dng xc nh im chia tt nht:
Entropy: i lng o tnh ng nht hay tnh thun nht ca cc mu.

Nhm SV

30

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi


c

Entropy (S ) = p i log 2 p i
i =1

Trong :
S l tp d liu hun luyn.
Ci l mt nhn lp bt k trong tp d liu S.
Pi l xc sut ca mt b bt k trn S thuc v nhn Ci.
Gi s phn chia cc b trong S trn mt thuc tnh A bt k, khng mt
tnh tng qut c th xem nh A c cc gi tr phn bit {a1, a2, , av}. Nu thuc
tnh A c s dng chia thnh v tp con, nhng tp con ny s tng ng vi
cc nhnh con ca nt hin ti, o thng tin c c sau khi phn lp theo v tp
con trn s c tnh nh sau:
v

Entropy A (S ) =

Sj

j =1

Entropy (S j )

Trong : S j l tng s b d liu c phn chia vo tp con th j.


Information gain: o xc nh nh hng ca mt thuc tnh trong mu

trong vic phn lp gi l li thng tin.


li thng tin da trn phn nhnh bng thuc tnh A:

Gain (S , A ) = Entropy (S ) Entropy A (S )


SplitInformation: Thng tin tim n c to ra bng cch chia tp d liu

trong mt s tp con no .
c

Splitinfomation(S,A) = -
i =1

Si
S

log 2

Si
S

Trong Si l tp con ca S cha cc v d c thuc tnh A mang gi tr Vi.


rng Splitinfomation thc s chnh l Entropy ca S vi s lin quan trn
nhng gi tr ca thuc tnh A.
GainRatio: S nh gi thay i cc gi tr ca thuc tnh.
Gain(S,A)

GainRation(S,A) =
SplitInformation(S,A)

Nhm SV

31

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
Tt c cc thuc tnh s c tnh ton o t l Gain, thuc tnh no c
o t l Gain ln nht s c chn lm thuc tnh phn chia.
3.2.3.2. Thut ton xy dng cy quyt nh
D liu vo: Tp d liu D, tp danh sch thuc tnh, tp nhn lp
D liu ra: M hnh cy quyt nh
Thut ton: Tocy(Tp d liu E, tp danh sch thuc tnh F, tp nhn lp)
1 Nu iu_kin_dng(E,F) = ng
2
ntl = CreateNode()
3
ntl.nhnlp=Phnlp(E)
4
return ntl
5 Ngc li
6
Ntgc = CreateNode()
7
Ntgc.iukinkimtra = tm_im_chia_tt_nht(E, F)
8
t F = F \ {Nt chn phn chia}
9
t V = {v| v tho iu kin l phn phn chia xut pht t Ntgc}
10
Lp qua tng tp phn chia v V
11
t Ev = {e | Ntgc.iukinkimtra(e) = v v e E}
12
Ntcon = Tocy(Ev, F, tp nhn lp)
13
Dng lp
14 End if
15 Tr v ntgc.
3.2.3.3. V d:

Ngy
D1

Quang
cnh
Nng

D2

Nng

Nng

90

Mnh

Khng

D3

m u

Nng

78

Nh

D4

Ma

m p

96

Nh

D5

Ma

Mt

80

Nh

D6

Ma

Mt

70

Mnh

Khng

D7

m u

Mt

65

Mnh

Nhm SV

Nhit

Gi

Nng

85

Nh

Chi
tennis
Khng

32

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

D8

Nng

m p

95

Nh

Khng

D9

Nng

Mt

70

Nh

D10

Ma

m p

80

Nh

D11

Nng

m p

70

Mnh

D12

m u

m p

90

Mnh

D13

m u

Nng

75

Nh

D14

Ma

m p

80

Mnh

Khng

- D liu vo:
+ Tp d liu thi tit.
+ Tp danh sch thuc tnh: Ngy, Quang cnh, Nhit , m, Gi.
+ Tp nhn lp: C Khng.
- D liu ra: M hnh cy quyt nh chi tennis.
- To cy quyt nh:
Ln to cy u tin:
Tm_im_chia_tt_nht(E, F):
+ E: Tp d liu thi tit.
+ F: Ngy, Quang cnh, Nhit , m, Gi.
S[9+, 5-]

+: D3, D4, D5, D7, D9, D10, D11, D12, D13


- : D1, D2, D6, D8, D14

Entropy(S) = -(9/14)log2(9/14) (5/14)log2(5/14) = 0.940


o t l gain cho thuc tnh Quang cnh:

Nhm SV

33

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

Quang cnh

m u

Nng

[2+, 3-]

Ma

[4+, 0-]

[3+, 2-]

EntropyQuangcnh(S)=(5/14)Entropy(SNng)+(4/14)Entropy(Smu)+
(5/14)Entropy(SMa)
= (5/14)(- (2/5)log2(2/5) (3/5)log2(3/5)) + (4/14)(0) +
(5/14)(- (3/5)log2(3/5) (2/5)log2(2/5))
= 0.694
Gain(S, Quang cnh) = Entropy(S) EntopyQuang cnh(S) = 0.940 0.694 = 0.246
SplitInfomation(S, Quang cnh) = - (5/14)log2(5/14) - (4/14)log2(5/14) (5/14)log2(5/14)
= 1.577
GainRatio(S, Quang cnh) = 0.246/1.577 = 0.156
o t l gain cho thuc tnh Gi:

Gi

Mnh

Nhm SV

[3+, 3-]

Nh

34

Lp: L10CQCN7-B
[6+, 2-]

Bo co thc tp

GVHD: Nguyn Qunh Chi

EntropyGi(S) = (6/14)Entroy(SMnh) + (8/14)Entropy(SNh)


= (6/14)(1) + (8/14)(- (6/8)log2(6/8) (2/8)log2(2/8))
= 0.892
Gain(S, Gi) = Entropy(S) EntropyGi(S) = 0.940 0.892 = 0.048
SplitInfomation(S, Gi) = -(6/14)log2(6/14) (8/14)log2(8/14) = 0.985
GainRatio(S, Gi) = 0.048/0.985 = 0.049
o t l gain cho thuc tnh m:
Entropy m(S) = (4/14)Entropy(S<=72.5) + (10/14)Entropy(S>72.5)
= (4/14)(- (3/4)log2(3/4) (1/4)log2(1/4)) +
(10/14)(-(6/10)log2(6/10) (4/10)log2(4/10))
= 0.925
Gain(S, m) = Entropy(S) - Entropy m(S) = 0.940 0.925 = 0.015
SplitInfomation(S, m) = -(4/14)log2(4/14) (10/14)log2(10/14) = 0.863
GainRatio(S, m) = 0.015/0.863 = 0.017
o t l gain cho thuc tnh Nhit :
EntropyNhit (S) = (4/14)Entropy(SNng)+(6/14)Entropy(Sm p)
+(4/14)Entropy(SMt)
= (4/14)(1) + (6/14)(- (4/6)log2(4/6) (2/6)log2(2/6)) +
(4/14)(- (3/4)log2(3/4) (1/4)log2(1/4))
= 0.911
Gain(S, Nhit ) = Entropy(S) - EntropyNhit (S) = 0.940 0.911 = 0.029
SplitInfomation(S, Nhit ) = - (4/14)log2(4/14) (6/14)log2(6/14)
(4/14)log2(4/14)
= 1.557
GainRatio(S, Nhit ) = 0.028/1.557 = 0.019
o t l gain cho thuc tnh Ngy:
EntropyNgy(S) = (1/14)Entropy(SD1) + + (1/14)Entropy(SD14)
= 14(1/14)(0) = 0
Gain(S, Ngy) = Entropy(S) - EntropyNgy(S) = 0.940 0 = 0.940
Nhm SV

35

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
SplitInfomation(S, Ngy) = 14(- (1/14)log2(1/14)) = 3.807
GainRatio(S, Ngy) = 0.940/3.807 = 0.246

La chn thuc tnh tt nht phn chia:

Entropy trung bnh ca cc thuc tnh = (0.694 + 0.892 + 0.925 + 0.911 + 0)/5
= 0.684
Ta c: GainRatio(S, Quang cnh) = 0.156
EntopyQuang cnh(S) = 0.694 > 0.684
Thuc tnh c chn phn chia: Quang cnh

Quang cnh

Nng

C - Khng

m u

Ma

C - Khng

Ln to cy th hai:
Nhnh Nng:
o t l gain cho thuc tnh Nhit :
Entropy(SNng) = - (3/5)log2(3/5) (2/5)log2(2/5) = 0.971
EntropyNhit (SNng)=(2/5)Entropy(SNng) + (2/5)Entropy(Sm p) +
(1/5)Entropy(SMt)
= (2/5)0 + (2/5)1 + (1/5)0 = 0.4
Gain(SNng, Nhit ) = 0.971 0.400 = 0.571
SplitInfomation(SNng, Nhit ) = - (2/5)log2(2/5) (2/5)log2(2/5) (1/5)log2(1/5)
= 1.522
GainRatio(SNng, Nhit ) = 0.571/1.522 = 0.375
o t l gain cho thuc tnh m:
Chn gi tr phn chia tt nht:
Entropy(S m) = - (2/5)log2(2/5) (3/5)log2(3/5) = 0.971
Nhm SV

36

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi


m
70

85

90

77.5
C
Khng
Gain

2
0

95
92.5

87.5
>
0
3

2
1

0.971

>
0
2

2
2

0.420

>
0
1
0.171

Quang cnh
Nng

77.5

> 77.5

[2+, 0-]

[0+, 3-]

Entropy m(SNng) = (2/5)Entropy(S<=72.5) + (3/5)Entropy(S>72.5)


= (2/5)0 + (3/5)0 = 0
Gain(SNng, m) = 0.971 0 = 0.971
SplitInfomation(SNng, m) = - (2/5)log2(2/5) (3/5)log2(3/5) = 0.971
GainRatio(SNng, m) = 0.971/0.971 = 1
o t l gain ca thuc tnh Gi:
EntropyGi(SNng) = (2/5)Entropy(SNh) + (3/5)Entropy(SMnh)
= (2/5)1 + (3/5)(- (1/3)log2(1/3) (2/3)log2(2/3)) = 0.951
Gain(SNng, Gi) = 0.971 0.951 = 0.020
SplitInfomation(SNng, Gi) = - (2/5)log2(2/5) (3/5)log2(3/5) = 0.971
GainRatio(SNng, Gi) = 0.020/0.971 = 0.021
o t l gain cho thuc tnh Ngy:
Nhm SV

37

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
EntropyNgy(SNng) = (1/5)Entropy(SD1) + (1/5)Entropy(SD2) + (1/5)Entropy(SD8)
+
(1/5)Entropy(SD9) + (1/5)Entropy(SD11) = 0
Gain(SNng, Ngy) = 0.971 0 = 0.971
SplitInfomation(SNng, Ngy) = 5(-1/5log2(1/5)) = 2.322
GainRatio(SNng, Ngy) = 0.971/2.322 = 0.418
Thuc tnh c chn phn chia: m

Quang cnh

Nng

77.5

m u

Ma

C - Khng

> 77.5

Khng

Nhnh Ma:
o t l gain cho thuc tnh Nhit :
Entropy(SMa) = - (3/5)log2(3/5) (2/5)log2(2/5) = 0.971
EntropyNhit (SMa) = (3/5)Entropy(Sm p) + (2/5)Entropy(SMt)
= (3/5)(- (2/3)log2(2/3) (1/3)log2(1/3)) + (2/5)(1) = 0.951
Gain(SMa, Nhit ) = 0.971 0.951 = 0.020
SplitInfomation(SMa, Nhit ) = - (3/5)log2(3/5) (2/5)log2(2/5) = 0.971
GainRatio(SMa, Nhit ) = 0.020/0.971 = 0.021
o t l gain cho thuc tnh Gi:
EntropyGi(SMa) = (3/5)Entropy(SNh) + (2/5)Entropy(SMnh)
= (3/5)0 + (2/5)0 = 0
Nhm SV

38

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
Gain(SMa, Gi) = 0.971 0 = 0.971
SplitInfomation(SMa, Gi) = - (3/5)log2(3/5) (2/5)log2(2/5) = 0.971
GainRatio(SMa, Gi) = 0.971/0.971 = 1
o t l gain cho thuc tnh Ngy
EntropyNgy(SMa) = (1/14)Entropy(SD4) + (1/14)Entropy(SD5) +
(1/14)Entropy(SD6) +
(1/14)Entropy(SD10) + (1/14)Entropy(SD14) = 0
Gain(SMa, Ngy) = 0.971 0 = 0.971
SplitInfomation(SMa, Ngy) = 5(-1/5log2(1/5)) = 2.322
GainRatio(SMa, Ngy) = 0.971/2.322 = 0.418
Thuc tnh c chn phn chia: Gi

Quang cnh

Nhm SV

39

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

Nng

77

m u

Ma

> 77

Khng

Gi

Nh

Mnh

Khng

- Lut rt ra t cy quyt nh:


Lut 1: if (Quang cnh = Nng) and ( m 77.5) then Chi tennis = C
Lut 2: if (Quang cnh = Nng) and ( m < 77.5) then Chi tennis =
Khng
Lut 3: if (Quang cnh = m u) then Chi tennis = C
Lut 4: if (Quang cnh = Ma) and (Gi = Nh) then Chi tennis = C
Lut 5: if (Quang cnh = Ma) and (Gi = Mnh) then Chi tennis =
Khng
3.3. Ct ta cy quyt nh
Qua tm hiu cc thut ton xy dng cy quyt nh trn, ta thy vic xy
dng cy bng cch pht trin nhnh cy y theo chiu su phn lp hon
ton cc mu hun luyn; nh thut ton CLS v thut ton ID3 i khi gp kh
khn trong cc trng hp d liu b nhiu (Noisy Data) hoc d liu b thiu
(Missing Data) khng i din cho mt quy lut; tc l to ra cc nt c s
mu rt nh. Trong trng hp ny, nu thut ton vn c pht trin cy th ta s
dn n mt tnh hung m ta gi l tnh trng "Over fitting" trong cy quyt nh.

Nhm SV

40

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
Vn Over fitting l mt kh khn trong vic nghin cu v ng dng cy
quyt nh. gii quyt tnh trng ny ngi ta s dng phng php ct ta cy
quyt nh. C hai phng php ct ta cy quyt nh.
3.3.1. Tin ct ta (Prepruning)
Chin thut tin ct ta ngha l s dng sm vic pht trin cy trc khi n
vn n im m vic phn lp cc mu hun luyn c hon thnh. Ngha l
trong qu trnh xy dng cy, mt nt c th s khng c tch thm bc na
nu nh kt qu ca php tch ri vo mt ngng gn nh chc chn. Nt
tr thnh nt l v c gn nhn l nhn ca lp ph bin nht ca tp cc mu ti
nt .
3.3.2. Hu ct ta (Postpruning)
Chin thut ny ngc vi chin thut tin ct ta. N cho php pht trin cy
y sau mi ct ta. Ngha l xy dng cy sau mi thc hin ct b cc
nhnh khng hp l. Trong qu trnh xy dng cy theo chin thut hu ct ta th
cho php tnh trng Over fitting xy ra. Nu mt nt m cc cy con ca n b ct
th n s tr thnh nt l v nhn ca l c gn l nhn ca lp ph bin nht ca
cc con trc ca n.
Tm li, vic ct ta cy nhm: ti u ho cy kt qu. Ti u v kch c cy
v v chnh xc ca vic phn lp bng cch ct b cc nhnh khng ph hp
(over fitted branches). thc hin vic ct ta cy th c cc k thut c bn sau
y:
- S dng tp hp tch ri ca mu hc nh gi tnh hu dng ca vic
hu ct ta nhng nt trong cy. S dng k thut ct ta cy ny c thut ton
CART, gi tt l chi ph phc tp (Cost - Complexity prunning).
- p dng phng php thng k nh gi v ct b cc nhnh c tin
cy km hoc m rng tip cc nhnh c chnh xc cao. K thut ct ta
ny c gi l ct ta bi quan v thng c s dng ct ta cc cy
c xy dng theo thut ton ID3 v C4.5.
- K thut m t di ti thiu - MDL (Minimum Description Length)
(vi k thut ny khng cn kim tra cc mu). K thut ny khng cn thit
phi kim tra cc mu v n thng c s dng trong cc thut ton SLIQ,
SPRINT.
Nhm SV

41

Lp: L10CQCN7-B

Bo co thc tp
GVHD: Nguyn Qunh Chi
3.4.nh gi v kt lun v cc thut ton xy dng cy quyt nh
Cc thut ton xy dng cy quyt nh va c trnh by trn u c
nhng im mnh v im yu ring ca n.
- u tin ta xt n thut ton CLS y l mt trong nhng thut ton ra
i sm nht. N ch p dng cho cc CSDL c cc thuc tnh nh, gi tr
cc thuc tnh dng phn loi hay ri rc. Cn i vi cc CSDL ln v c
cha cc thuc tnh m gi tr ca n l lin tc th CLS lm vic khng
hiu qu. Nhng y l thut ton n gin, d ci t, ph hp trong vic
hnh thnh tng v gii quyt nhng nhim v n gin.
- Thut ton ID3: trong thut ton ID3, Quinlan khc phc c hn ch
ca thut ton CLS (ID3 c xem l phin bn ci tin ca CLS). Thut
ton ny lm vic rt c hiu qu, n cho kt qu ti u hn thut ton CLS
Khi p dng thut ton ID3 cho cng mt tp d liu u vo v th nhiu
ln th cho cng mt kt qu. Bi v, thuc tnh ng vin c la chn
mi bc trong qu trnh xy dng cy c la chn trc. Tuy nhin
thut ton ny cng cha gii quyt c v vn thuc tnh s, lin tc,
s lng cc thuc tnh cn b hn ch v gii quyt hn ch vi vn d
liu b thiu hoc b nhiu.
- Thut ton C4.5: tip tc khc phc nhng nhc im ca thut ton
ID3, Quinlan a ra thut ton C4.5(C4.5 l s ci tin cho thut ton
ID3 v ci l phin bn sau ca ID3). Trong thut ton ny gii quyt
c vn lm vic vi thuc tnh s(lin tc), thuc tnh c nhiu gi tr,
v vn d liu b thiu hoc b nhiu. Tuy nhin yu im ca thut ton
ny l lm vic khng hiu qu vi nhng CSDL ln v cha gii quyt
c vn b nh.
Mc d c nhiu ci tin, nhiu thut ton xy dng cy quyt nh ra i,
nhng ni chung vn cn nhiu vn kh khn phc tp v nhiu thch thc
trong Khai ph d liu bng cy quyt nh.

Nhm SV

42

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

Chng 4: Cng c Weka


4.1. Gii thiu chung v phn mm Weka
L phn mm khai thc d liu, thuc d n nghin cu ca i hc
Waikato, New Zealand
Mc tiu: xy dng mt cng c hin i nhm pht trin cc k thut
my hc v p dng chng vo bi ton khai thc d liu trong thc t.
WEKA c xy dng bng ngn ng Java, cu trc gm hn 600 lp,
t chc thnh 10 packages.
Cc chc nng chnh ca phn mm:
- Kho st d liu: tin x l d liu, phn lp, gom nhm d liu, v khai
thc lut kt hp.
- Thc nghim m hnh: cung cp phng tin kim chng, nh gi
cc m hnh hc.
- Biu din trc quan d liu bng nhiu dng th khc nhau.
4.2. Cc mi trng chnh
- Simple CLI:Giao din n gin kiu dng lnh (nh MS-DOS)
Explorer: (chng ta s ch yu s dng mi trng ny!) Mi trng
cho php s dng tt c cc kh nng ca WEKA khm ph d liu
- Experimenter: Mi trng cho php tin hnh cc th nghim v thc
hin cc kim tra thng k (statistical tests) gia cc m hnh hc my
- KnowledgeFlow: Mi trng cho php bn tng tc ha kiu
ko/th thit k cc bc (cc thnh phn) ca mt th nghim
-

Nhm SV

43

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

4.2.1. Mi trng Explorer

Preprocess: chn v thay i (x l) d liu lm vic.


Classify: hun luyn v kim tra cc m hnh hc my (phn loi,
hoc hi quy/d on).
Cluster: hc cc nhm t d liu (phn cm)
Associate: khm ph cc lut kt hp t d liu
Select attributes: xc nh v la chn cc thuc tnh lin quan
(quan trng)
nht ca d liu.
Visualize: xem (hin th) biu tng tc 2 chiu i vi d liu.

WEKA ch lm vic vi cc tp tin vn bn (text) c khun dng ARFF, CSV


*D liu u vo cha thng tin c s dng trong m t bi ton
Nhm SV

44

Lp: L10CQCN7-B

Bo co thc tp

Nhm SV

GVHD: Nguyn Qunh Chi

45

Lp: L10CQCN7-B

Bo co thc tp
*u ra

Nhm SV

GVHD: Nguyn Qunh Chi

46

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

Chng 5. Kt lun
Trong khun kh bo co tt nghip ny, chng em nghin cu v tm
hiu v cc vn lin quan ti khai ph d liu (Data mining) bng cy quyt
nh v c bn hon thnh ti v t c mt s kt qu nh sau:
- Nm c mt s k thut c bn khai ph d liu, cc chc nng v
ng dng ca khai ph d liu.
- Nm c khai ph d liu bng cy quyt nh, cc thut ton xy dng
cy quyt nh.
ng dng: Xy dng chng trnh demo cho ng dng khai ph d liu
bng cy quyt nh, s dng cy quyt nh d on c i chi Tenis hay
khng v kt qu d on khm cha bnh..
Hng pht trin: Nghin cu thm mt s thut ton mi v khai ph d
liu bng cy quyt nh, tm hiu k hn v cc k thut khai ph d liu khc.
Xy dng c nhng chng trnh ng dng phc tp v c tnh thc t hn bng
cy quyt nh.

Nhm SV

47

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

TI LIU THAM KHO


[1] Bi ging khai ph d liu_Trng H Hng Hi Vit Nam
[2] Khai ph d liu_Trng H Bch khoa h ni
[3] Cy quyt nh ID3 v hc quy np ILA_T Hoi Vit_H khoa hc t
nhin TPHCM
[4] Phng php hc cy quyt nh_ Thanh ngh_Trng H Cn th
[5] CTT305_Khai thc v s dng Weka Explorer
[6] Nghin cu cc thut ton phn lp d liu da trn cy quyt
nh_Nguyn Th Thy Linh Kha lun tt nghip Trng H cng ngh.
Mt s website:
[7]
http://timnt.com/chuyen-trang/tri-thuc/971/Lap-trinh/Data-Mining-Gioithieu-mot-qui-trinh-hoan-chinh-ve-xay-dung-mo-hinh-khai-pha-du-lieu
[8]
http://www.4shared.com/dir/27390526/51ee3ce1/CHUYEN_DE_KDD.html
[9] Wikipedia - Bch khoa ton th m - Cy quyt nh.
[10] http://en.wikipedia.org/wiki/Decision tree

Nhm SV

48

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

MC LC
Li m u.............................................................................................................................................2
CHNG 1. TNG QUAN V CSDL V S XUT HIN KHAI PH D LIU.....................3
1.1. T chc v khai thc CSDL truyn thng.....................................................................................3
1.2. Bc pht trin mi ca vic t chc v khai thc cc CSDL......................................................3
1.3. Qu trnh pht hin tri thc v khai ph d liu............................................................................6
1.3.1. Qu trnh khm ph tri thc c tin hnh qua 5 bc sau :...................................................7

G1: Hnh thnh v nh ngha bi ton...................................................7


G2: Thu thp v tin x l d liu...........................................................8
G3: Khai ph d liu v rt ra cc tri thc.............................................8
G4: Phn tch v kim nh kt qu .......................................................8
G5: S dng cc tri thc pht hin c................................................9
CHNG 2: KHAI PH D LIU..................................................................................................10
2.1. Khai ph d liu l g?.................................................................................................................10
2.2. Qu trnh khai ph d liu............................................................................................................11
2.2.1.Gom d liu (gatherin)...............................................................................................................12
2.2.2. Trch lc d liu (selection)......................................................................................................12
2.2.4. Chuyn i d liu (transformation).........................................................................................13
2.2.5. Pht hin v trch mu d liu ( pattern extraction and discovery) .........................................13
2.4. Cc k thut khai ph d liu.......................................................................................................14
2.4.1. Phn lp d liu ........................................................................................................................14
2.4.2. Phn cm d liu.......................................................................................................................15
2.4.3. Khai ph lut kt hp................................................................................................................15
2.4.4. Hi quy......................................................................................................................................15
2.4.5. Gii thut di truyn....................................................................................................................15
2.4.6. Mng nron...............................................................................................................................15
2.4.7. Cy quyt nh...........................................................................................................................16
2.5. Cc dng d liu c th khai ph c........................................................................................16
2.6. Cc lnh vc lin quan n khai ph d liu v ng dng ca khai ph d liu........................16
2.6.1. Cc lnh vc lin quan n pht hin tri thc v khai ph d liu...........................................16
2.6.2. ng dng ca khai ph d liu.................................................................................................16
2.7. Cc thch thc v hng pht trin ca pht hin tri thc v khai ph d liu..........................17
CHNG 3: KHAI PH D LIU BNG CY QUYT NH..................................................19
3.1. Cy quyt nh..............................................................................................................................19
3.1.1. nh ngha ................................................................................................................................19
3.1.2.Hc cy quyt nh (Dicision tree learning): ............................................................................20
3.1.3.Ti sao hc cy quyt nh l mt phng php hc qui np hp dn....................................20
3.1.4.u im ca cy quyt nh......................................................................................................21
3.1.5.Vn xy dng cy quyt nh................................................................................................21
3.1.6.Rt ra cc lut t cy quyt nh................................................................................................21
3.2. Cc thut ton khai ph d liu bng cy quyt nh..................................................................21
3.2.1. Thut ton CLS..........................................................................................................................21

Nhm SV

49

Lp: L10CQCN7-B

Bo co thc tp

GVHD: Nguyn Qunh Chi

3.2.2. Thut ton ID3...........................................................................................................................22


2.2.2.4.V d minh ha........................................................................................................................25
3.2.3.1. o s dng xc nh im chia tt nht:.....................................................................30
3.2.3.2. Thut ton xy dng cy quyt nh.....................................................................................32
3.2.3.3. V d:......................................................................................................................................32
3.3. Ct ta cy quyt nh...................................................................................................................40
3.3.1. Tin ct ta (Prepruning)...........................................................................................................41
3.3.2. Hu ct ta (Postpruning)..........................................................................................................41
3.4.nh gi v kt lun v cc thut ton xy dng cy quyt nh................................................42
*u ra.................................................................................................................................................46
Chng 5. Kt lun.............................................................................................................................47
MC LC...........................................................................................................................................49

Nhm SV

50

Lp: L10CQCN7-B