You are on page 1of 19

TR TU NHN TO

NG DNG THUT TON CY QUYT NH H TR NHN VIN GI TH T VN KHCH HNG

Gio vin hng dn Sinh vin thc hin

: TS. Nguyn Nht Quang : V Thnh Trung 20073070 Nguyn Hng Phc 20072236 Lu Vn ng 20070675 Nguyn Vn Hng

Lp Kha

: Truyn thng & Mng : 52

MC LC
Li m u..............................................................................trang 3 Ni dung................................................................................trang 4 1. Gii thiu ti..........................................................trang 4 1.1. t vn .........................................................trang 4 1.2. M t bi ton...................................................trang 4 2. Trin khai bi ton......................................................trang 6 2.1. Thut ton Cy quyt nh: ID3.......................trang 6 2.2. DataSet.............................................................trang 10 2.3. C s d liu....................................................trang 12 2.4. Cng ngh s dng...........................................trang 13 2.5. M phng chng trnh........................................trang 13 3. nh hng pht trin.....................................................trang 15 3.1. Nhng khuyt im ca ID3.................................trang 15 3.2. Ci tin thut ton: C4.5........................................trang 15 4. Nhng kh khn, tranh lun trong qu trnh thc hin...trang 16 Kt lun.................................................................................trang 17 Ti liu tham kho.................................................................trang 18 Phn cng cng vic..............................................................trang 19

LI M U
Hin nay, trong rt nhiu cc lnh vc, ngnh ngh hay nhng cng vic c th, vai tr ca con ngi vn lun l trung tm. Tuy nhin, thc t cho thy, ngy cng c nhiu s h tr c lc t pha my mc v cng ngh trong nhng cng vic . Tht vy, c th ni vai tr ca khoa hc v cng ngh cao l khng th thiu v ang nh hng rt ln ti i sng ca chng ta. l nhng cng c ht sc c lc ca con ngi, gip con ngi gii quyt vn nhanh chng, chnh xc v hiu qu hn, tit kim rt nhiu chi ph. Mt ng dng d nhn thy nht v s tr gip ca khoa hc cng ngh i vi i sng con ngi l: h tr con ngi ra quyt nh. l vic p dng tr tu nhn to vo nhng cng vic tng chng nh qu kh, mt rt nhiu cng sc v thi gian. Qua qu trnh tm hiu v kho st, nhm chng em quyt nh chn ti ng dng tr tu nhn to trong h thng CRM v c th l Cy quyt nh h tr nhn vin chm sc khch hng gi th t vn khch hng. Nu trc y, nhng nhn vin phi thu thp thng tin t hng trm, hng ngn khch hng, i snh vi thng tin cc lp, cc kha hc ph hp ri gi th mi, t vn cho khch hng tham gia hc tp. l mt vn ht sc nan gii v i hi chi ph cng sc rt ln. Tuy nhin, vi hiu qu rt cao ca cc thut ton Cy quyt nh, nhng cng vic s c gim ti chi ph xung mc thp nht c th, h tr con ngi rt tt. Trong phm vi ti cng nh kin thc, nhm chng em c gng a ra c mt bi ton rt nh nm trong mt mng ti rt ln. Vi s hng dn, gip rt tn tnh ca thy Nguyn Nht Quang, cc thnh vin trong nhm hiu c ngha thut ton, gn c vo bi ton thc t mc c th. Chng em cng xin cm n s ng gp, h tr ca thy gio v mong mun c th trin khai tip ti ny mc rng hn, hon thin hn trong tng lai. Chng em xin chn thnh cm n! Nhm thc hin

NI DUNG
1. Gii thiu ti:
1.1. t vn : mi cng ty, trung tm qun l o to, khch hng s n v tm hiu nhng dch v cung cp o to, tm ra cho mnh nhng kha hc, lp hc v cc yu t ph hp c th la chn mt hnh thc o to tt nht. Cng vic ny s c nhng nhn vin chm sc khch hng cung cp ti khch hng nhng thng tin mong mun, vi nhng t vn chi tit, hp l nht. Tuy nhin, vi mt lng d liu thng tin khch hng rt ln, cng vic la chn nhng kha hc, lp hc ph hp vi tng i tng khch hng l vic ht sc kh khn, i hi tiu tn rt nhiu chi ph thi gian v cng sc. Vic p dng thut ton Cy quyt nh vo x l d liu thng tin l gii php rt tt. Vi nhng tiu ch ca tng kha hc tng ng vi nhng nhu cu ca khch hng, nhn vin ch cn dng giao din ca chng trnh thao tc, chng trnh s t ng kim tra, so snh v a ra nhng gi chnh xc, ph hp nht. T , nhn vin s bit c i tng khch hng c th s ph hp vi nhng kha o to no. Cui cng, nhn vin s gi th t vn n khch hng, cung cp cho khch hng nhng thng tin hu ch nht. M t bi ton: - u vo: tp d liu thng tin u vo s bao gm: Thng tin kha hc/lp hc: s gm cc thng tin nh ni dung mn hc, chng ch sau khi hc, ging vin, ca hc ... Thng tin khch hng: tn, tui, i tng khch hng, cc thng tin lin lc. Cc thng tin lin quan khc. y l nhng thng tin s c lu trong C s d liu ca trung tm v s c x l bc tip theo - X l: vi cc d liu u vo nh trn (dataset), chng trnh s dng thut ton Cy quyt nh (c th l ID3) x l d liu,
4

1.2.

a ra m hnh cy quyt nh tng ng vi cc d liu c. T tr ra gi tr nh gi ph hp nht gia thng tin ca khch hng v thng tin cc kha hc: khch hng no ph hp vi lp hc no... - u ra: d liu tr ra s l kt qu tng ng khch hng lp hc. Nhn vin t vn s da vo nhng kt qu v gi th t vn n khch hng, cung cp nhng thng tin chi tit v lp hc ph hp nht vi khch hng, h tr khch hng c th la chn cho mnh mt lp hc no thch hp nht.

M hnh lm vic ca bi ton


Thng tin Khch hng Thng tin Kha hc/Lp hc

INPUT Tp d liu u vo (DataSet)

Tp lut (nu c) Process (X l vi tp d liu u vo)

Thut ton ID3

OUTPUT Kt qu tr v cc cp Khch hng Kha hc

Nhn vin gi th n Khch hng 5 vi cc thng tin v Kha hc

2. Trin khai bi ton:


2.1. Thut ton Cy quyt nh: 2.1.1. M t thut ton: C kh nhiu thut ton Datamining hay v c ng dng nhiu, nhng trong phm vi kin thc v thi gian nghin cu, tm hiu th nhm chng em quyt nh chn thut ton ID3. Mt s thut ton s dng trong Datamining:
Thut ton Tham s

CART
Gini Diversity Index

ID3 v C4.5
Entropy Info-gain

SLIQ & SPRINT


Gini Index

Proposed approach
Info gain & Uncertainity coefficient Decision Tree with concepts of node merging and Height balance using AVL trees Dynamic pruning based on thresholds

Tnh ton

Phng thc
Constructs Binary Decision Tree Top-Down Decision Tree Construction Decision Tree Construction in a Breadth first manner

Ct ta

Post pruning based on costcomplexity measure

Pre-pruning using a single pass algorithm

Post pruning based on MDL principle

- L do la chn cy quyt nh: + Cy quyt nh d hiu + Vic chun b d liu cho mt cy quyt nh l c bn hoc khng cn thit + Cy quyt nh c th x l c d liu c gi tr bng s v d liu c gi tr l tn th loi + Cy quyt nh l mt m hnh hp trng + C th thm nh m hnh bng cc kim tra thng k + Cy quyt nh c th x l mt lng ln d liu trong mt khong thi gian ngn

Thut ton ID3(Iterative Dichotomiser 3) do Quinlan pht trin vo nm 1979, mc ch xy dng nn mt cy quyt nh da vo tp d liu dataset, vi mi thuc tnh s c nhng gi tr i cng.

Mi nt (khng phi l) ca mt cy quyt nh tng ng vi mt thuc tnh u vo, v mi mt nhnh con i ra tip theo chnh l gi tr ca thuc tnh . Mt nt l tng ng vi gi tr k vng ca cc thuc tnh trc m c xc nh bng cch i t nt gc cho n nt l (ta c th hiu l kt qu k vng cui cng c a ra sau khi duyt qua tt c nhng thuc tnh c lin quan trc tun theo nhng lut, rng buc c nu ra) Mt cy quyt nh tt l cy c mi nt l tng ng vi mt thuc tnh m thuc tnh c gi tr ngha tt nht trong ton b nhng thuc tnh cha c duyt (tnh t nt gc cho n nt hin ti). Tc l, chng ta mun d on gi tr ca thuc tnh bng cch da vo s lng nhng nghi vn nh nht trn tng s nhng nghi vn trung bnh (mc tin cy v chnh xc cng cao th s c chn) a. Entropy: Entropy c s dng xc nh thng tin u vo mt thuc tnh c bit l v cc thuc tnh u ra cho mt tp hp cc d liu hun luyn, nh mc gi tr cho nhng ngun thng tin khng chc chn, ngun thng tin a vo cng khng chc chn th cng cn c thm thng tin m t v n. S lng trung bnh ca lng thng tin cn thit xc nh mi ngun tin l mt thc o s khng chc chn ca i tng tip nhn v ngun tin , v c gi l entropy ca ngun. Gi thit rng ngun S c n thng tin {m1,m2,..., mn}, nhng thng tin ny khng ph thuc vo nhau, ta gi xc sut ca mi mi l pi . Nu ngun S tun theo lut phn b P = (p1, p2,..., pn) th ta c cng thc tnh Entropy ca P l:

b. Info Gain:
7

By gi, ta chia thuc tnh T thnh nhng tp d liu con mc di T1, T2,..., Tn. Nh vy, lng thng tin cn xc nh nhng thnh phn trong T chnh l lng thng tin trung bnh cn xc nh nhng i tng d liu con ca T nh trn:

H ( X ,T )

| Ti | H (Ti ) i 1 | T |

Trong qu trnh dng cy quyt nh, ta s lun cn bit lng thng tin c cung cp tng ng t mi thuc tnh X. chnh l s chnh lch gia gi tr thng tin cn phn chia cc thnh phn ca T trc khi bit gi tr ca X, H(T) v gi tr cn phn chia thnh tp d liu con ca T sau khi bit gi tr ca X, H(T). T ta c khi nim Information Gain cho thuc tnh X cng vi tp d liu T ca n:

Gain (X,T) = H(T) H(X,T)


2.1.2. tng thut ton : ID3(D, Target, Atts) - Kt qu tr v : mt cy quyt nh c xy dng theo nhng gi tr u vo - Cc bin : + D : tp d liu hun luyn ca u vo + Target : nhng thuc tnh m gi tr c d on bi cy quyt nh + Atts : tp hp cc thuc tnh cn kim th trong qu trnh xy dng cy (cha c xt duyt) - Thut ton (gi ngn ng) :
function ID3 (I, 0, T) { /* I is the set of input attributes * O is the output attribute * T is a set of training data * * function ID3 returns a decision tree */ if (T is empty) { return a single node with the value "Failure"; } 8

if (all records in T have the same value for O) { return a single node with that value; } if (I is empty) { return a single node with the value of the most frequent value of O in T; /* Note: some elements in this node will be incorrectly classified */ } /* now handle the case where we cant return a single node */ compute the information gain for each attribute in I relative to T; let X be the attribute with largest Gain(X, T) of the attributes in I; let {x_j| j=1,2, .., m} be the values of X; let {T_j| j=1,2, .., m} be the subsets of T when T is partitioned according the value of X; return a tree with the root node labelled X and arcs labelled x_1, x_2, .., x_m, where the arcs go to the trees ID3(I-{X}, O, T_1), ID3(I-{X}, O, T_2), .., ID3(I-{X}, O, T_m); }

2.1.3. u im ca thut ton ID3: - S dng thut tm kim leo i (hill - climbing) da trn gi tr Gain tm kim cc thuc tnh trong ton b Cy quyt nh - u ra (Output) ch l mt gi thuyt n (1 kt qu duy nht) - Khng bao gi gp hin tng quay lui tnh hi t cao - S dng d liu hun luyn tng bc, tri ngc vi nhng thut gii pht trin m rng cy quyt nh (c th hn ch c kch thc Cy khng qu ln) - S dng cc thuc tnh tnh: hn ch ti a li cho nhng bn ghi d liu ring l, c th nh hng ti ton b d n - Kim sot c d liu rc, d liu tp bn ngoi bng cch gim bt yu cu tiu chun cho vic chp nhn nhng d liu cha hon chnh 2.1.4. Xy dng Cy quyt nh : - Cy c thit lp t trn xung di (phng php top-down) - Cc mu hun luyn nm gc ca cy
9

- Chn mt thuc tnh phn chia thnh cc nhnh. Thuc tnh c chn da trn o thng k hoc o heuristic (chnh l cc gi tr Entropy, Info-Gain tnh ton trn). Vi tng thuc tnh, gi tr Gain no thp nht trong tp cc thuc tnh cha c xt th s c chn a vo cy bc . Ta c th hiu cch la chn chnh l nhm mc ch to ra mt cy nh nht c th, gi tr Gain cng nh c ngha l thuc tnh c li nht cho qu trnh phn lp. - Tip tc lp li vic xy dng cy quyt nh cho cc nhnh. - iu kin dng : + Tt c cc mu ri vo mt nt thuc v cng mt lp (nt l) + Khng cn thuc tnh no c th dng phn chia mu na + Khng cn li mu no ti nt 2.2. DataSet : V d v Dataset s s dng
Course Name CCNA CourseCerti ficate CCNA GroupN ame Network Cours TimeN eFee ame 300 Ca Sng 1 400 Ca Chiu 1 200 Ca Sng 2 400 Ca Chiu 1 300 Ca Ti 2 300 Ca Ti 2 300 Ca Ti 2 200 Ca Ti 2 300 Ca Ti Teacher Name Nguyn Vn Cng Trn Vn Nam Trn Trng Ti Trn Vn Nam Bo Bo Bo Trn Trng Ti Trn IsStudentLe arned True

CCNP

CCNP

Network

True

Office

MOS

Office

True

CCNP

CCNP

Network

True

CCDA CCDA CCDA Office

CCDA CCDA CCDA MOS

Network Network Network Office

True True True False

MCSA

MCSA

Network
10

False

2 MCSE SCJP Office MCSE SCJP MOS Network Program ming Office 1000 Ca Ti 1 200 Ca Ti 1 200 Ca Ti 2 300 Ca Chiu 2 300 Ca Chiu 2 300 Ca Ti 2 300 Ca Ti 2 300 Ca Ti 2 200 Ca Ti 2 200 Ca Ti 2 300 Ca Ti 2 300 Ca Chiu 2 400 Ca Chiu 1 300 Ca Sng 2 300 Ca Chiu 2 350 Ca Sng 2 200 Ca Ti 1

Xun Chnh L Khnh False Hng c Trn Trng Ti Hng Hng Hng Hng Bo Bo Bo Trn Trng Ti Trn Trng Ti Trn Xun Chnh Hng Hng Trn Vn Nam Bo Hng Hng Nguyn Tun Hng Hng c False False

CCSP

CCSP

Network

True

CCSP

CCSP

Network

True

CCDA CCDA CCDA Office

CCDA CCDA CCDA MOS

Network Network Network Office

True True True False

Office

MOS

Office

False

MCSA

MCSA

Network

False

CCSP

CCSP

Network

True

CCNP

CCNP

Network

True

CCDA CCSP

CCDA CCSP

Network Network

True True

MCDB A SCJP

MCDBA

Program ming Program ming


11

False

SCJP

False

Vi Dataset trn, ta c cc thuc tnh v min gi tr : - CourseName : {MCSA, CCNP, MCDBA, SCJP} - CourseCertificated : {CCNP, MCSA, MCDBA} - tng t vi cc thuc tnh khc Ta s da vo tng ca thut ton ID3, tnh ton cc gi tr Entropy H(T), cc gi tr H(X,T) v tnh gi tr Gain. T , vi gi tr Gain no ln nht th l thuc tnh c li thng tin ln nht th s c chn lm nt xy dng cy quyt nh. Thao tc trn c lp i lp li n khi kt thc (ht thuc tnh duyt hoc tm ra c l ti u) C s d liu :

2.3.

2.4.

Cng ngh s dng : - Mi trng lp trnh : Microsoft Visual Studio 2008 - Ngn ng lp trnh : C# trn nn cng ngh .NET 3.5 - H qun tr c s d liu : Microsoft SQL Server 2005

2.5. M phng chng trnh : 2.5.1. Main Form v tp d liu Dataset :

12

2.5.2. Form khi n vo nt to cy : - Cy quyt nh vi Dataset t MainForm

2.5.3. Form khi n vo nt demo : - Nhp d liu vo cc textbox v combobox - n nt Test

13

3. nh hng pht trin :


Nhng khuyt im ca thut ton ID3 : - Ch thch hp vi m hnh c lng d liu t, ri rc - Khng thch ng c vi nhng tp d liu tp (d pht sinh li) - Khng hiu qu khi xut hin nhng d liu khng mong mun - Cy quyt nh khi dng ra vn cn c th ln, rm r, cha c ti u mc ti a c th 3.2. Ci tin thut ton : s s dng thut ton C4.5 3.2.1. Thut ton : - Mnh m khi gp nhng d liu tp, c kh nng phng trnh hin tng Overfitting : l hin tng lng d liu khng cn thit (hon ton c th loi b) vn c a vo Cy khin cho kt qu tr v khng ti u, cy ln v rm r. cc thut ton tin tin hn, vn ny c gii quyt, kt qu tr v cui cng s c ti u ha hn. - Thch hp c vi cc d liu lin tc 3.1.
14

- Gii quyt bi ton vi trng hp m cc thuc tnh c d liu trng (khuyt d liu trong qu trnh dng cy quyt nh) - C th chuyn i t Cy quyt nh thnh cc Lut 3.2.2. Cc phng php ct ta cy : - Pre pruning : dng ngay qu trnh pht trin nhnh ca cy khi gp d liu khng chc chn (hoc d liu tip theo l rng) - Post pruning : to ra mt cy quyt nh hon ho ri sau s ct b dn nhng phn d liu khng chc chn. tng ny c 2 bc thc hin : + Subtree replacement : loi b i nhng on d liu khng chc chn + Subtree raising : tip tc pht trin d liu t on d liu va c ct b Qu trnh ny c lp i lp li cho n khi duyt ht ton b Cy quyt nh. 3.2.3. Kim sot d liu trng (Handling missing value): Mi khi gp d liu trng trong qu trnh xy dng cy, ta s coi l d liu cha xc nh (unknown data) v s biu din d liu l d liu ? trn cy, b qua v tip tc pht trin cy vi nhng d liu tt. 3.2.4. X l d liu lin tc (a gi tr vi mt thuc tnh): l nhng trng hp m mi thuc tnh c nhiu gi tr (not unique), gii php cn tnh ra xc sut ca nhng gi tr nm trong thuc tnh : SplitInfo = (X,T). Nh vy, gi tr Gain lc ny s khc so vi ban u:

GainRatio( X , T )

Gain( X , T ) SplitInfo( X , T )

4. Nhng kh khn, tranh lun trong qu trnh thc hin:


4.1. Kh khn
15

- Kh khn trong vic tm ti liu cho thut ton C4.5 - Khng th tm c ti liu cho thut ton C5.0 - Kh khn trong vic tm hiu v tnh cht ca khch hng, thng tin kha hc/lp hc trong thc tin cc trung tm o to. - Kh khn trong vic nhp d liu kim nghim v bi ton i hi mt lng d liu tng i ln, nu t d liu qu th mc nh gi thut ton s khng khch quan. 4.2. Tranh lun: - Ti sao li phi s dng cy thut ton ID3/C4.5 cho cy phn loi m khng s dng cc thut ton khc. - Liu cc trng trong Dataset p dng cho bi ton c th hay cha? - C nht thit phi a thm cc tp lut ca ngi dng vo hay khng, hay ch cn da vo cy quyt nh l gii quyt bi ton?

16

KT LUN
Qu tht, ng dng ca cng ngh vo trong cc cng vic, lnh vc thc tin ca cuc sng l ht sc quan trng. N gip cho con ngi gim bt c cng sc, thi gian v rt nhiu nhng chi ph tn km khc. C th, i vi mt trung tm o to gio dc, vic tng tc, h tr, t vn v chm sc khch hng l mt vic khng th thiu v phn no to nn thng hiu. D mc no, nhng ngi qun l lun mong mun cung cp cho khch hng s chm sc chu o, tn tnh nht. Cng vic gi th t vn v h tr thng tin tuyn sinh l mu cht trong chin lc . Nu trc y l cng vic ht sc vt v th ngy nay, vi s h tr ca cng ngh mi, l mt vic rt n gin cho i ng nhn vin t vn. Thut ton ID3 h tr vic la chn ra nhng kt qu ti u, ph hp nht trong mt tp hp d liu khng l. T , nhng d liu tr v s c tn dng ti a trong cng vic, nhng nhn vin t vn ch vic dng kt qu thc hin cng vic tip theo ca mnh: gi th t vn cho khch hng v nhng kha hc/lp hc ph hp nht. Trong phm vi ca ti cng nh nhng kh khn gp phi, chng em ht sc c gng hon thnh tt nht nhng g cn thit, nhng cng vic ra. Tuy nhin, chng em vn cn mc phi nhng thiu st khng th trnh khi, rt mong thy gio tip tc h tr, hng dn c th ci tin chng trnh tt hn na trong tng lai. Chng em xin chn thnh cm n s tr gip rt tn tnh ca thy gio.

17

TI LIU THAM KHO


- Gio trnh Nhp mn Tr tu nhn to TS. Nguyn Nht Quang - Gio trnh The ID3 Decision Tree Algorithm - MONASH UNIVERSITY - A Pre-pruning Method in Belief Decision Tree Zied Elouedi, Khaled Mellouli, Philipe Smets - Decision Tree Induction: An Approach for Data Classification using AVL Tree Devi Prashad Bukya, S. Ramachandram - C4.5 Programming for Machine Learning J.Ross Quinlan

18

PHN CNG CNG VIC


1. Tm hiu v cc thut ton Machine Learning Data Mining p dng vo gii bi ton. - Lu Vn ng - Nguyn Hng Phc - V Thnh Trung 2. Nghin cu v cy quyt nh v cc thut ton. 2.1. Thut ton ID3 2.1.1. tng v cch thc hin thut ton ID3 - Nguyn Hng Phc - V Thnh Trung 2.1.2. nh gi v ph hp v hiu nng ca thut ton ID3 - Lu Vn ng 2.2. Thut ton C4.5 2.2.1. tng v cch thc hin thut ton C4.5 - V Thnh Trung 2.2.2. nh gi v ph hp v hiu nng ca thut ton C4.5 - Lu Vn ng 3. Chun b d liu cho ng dng: 3.1. Phn tch v thit k c s d liu. - Nguyn Hng Phc. 3.2. Nhp d liu kim th. - Lu Vn ng 4. Tin hnh ci t ng dng: 4.1. Ci t thut ton ID3 - V Thnh Trung 4.2. Ci t cy quyt nh: - Nguyn Hng Phc 4.3. Xy dng ng dng mu v demo chng trnh - Nguyn Hng Phc - V Thnh Trung 4.4. Kim th - Nguyn Hng Phc - V Thnh Trung
19

You might also like