Professional Documents
Culture Documents
AIReport
AIReport
: TS. Nguyn Nht Quang : V Thnh Trung 20073070 Nguyn Hng Phc 20072236 Lu Vn ng 20070675 Nguyn Vn Hng
Lp Kha
MC LC
Li m u..............................................................................trang 3 Ni dung................................................................................trang 4 1. Gii thiu ti..........................................................trang 4 1.1. t vn .........................................................trang 4 1.2. M t bi ton...................................................trang 4 2. Trin khai bi ton......................................................trang 6 2.1. Thut ton Cy quyt nh: ID3.......................trang 6 2.2. DataSet.............................................................trang 10 2.3. C s d liu....................................................trang 12 2.4. Cng ngh s dng...........................................trang 13 2.5. M phng chng trnh........................................trang 13 3. nh hng pht trin.....................................................trang 15 3.1. Nhng khuyt im ca ID3.................................trang 15 3.2. Ci tin thut ton: C4.5........................................trang 15 4. Nhng kh khn, tranh lun trong qu trnh thc hin...trang 16 Kt lun.................................................................................trang 17 Ti liu tham kho.................................................................trang 18 Phn cng cng vic..............................................................trang 19
LI M U
Hin nay, trong rt nhiu cc lnh vc, ngnh ngh hay nhng cng vic c th, vai tr ca con ngi vn lun l trung tm. Tuy nhin, thc t cho thy, ngy cng c nhiu s h tr c lc t pha my mc v cng ngh trong nhng cng vic . Tht vy, c th ni vai tr ca khoa hc v cng ngh cao l khng th thiu v ang nh hng rt ln ti i sng ca chng ta. l nhng cng c ht sc c lc ca con ngi, gip con ngi gii quyt vn nhanh chng, chnh xc v hiu qu hn, tit kim rt nhiu chi ph. Mt ng dng d nhn thy nht v s tr gip ca khoa hc cng ngh i vi i sng con ngi l: h tr con ngi ra quyt nh. l vic p dng tr tu nhn to vo nhng cng vic tng chng nh qu kh, mt rt nhiu cng sc v thi gian. Qua qu trnh tm hiu v kho st, nhm chng em quyt nh chn ti ng dng tr tu nhn to trong h thng CRM v c th l Cy quyt nh h tr nhn vin chm sc khch hng gi th t vn khch hng. Nu trc y, nhng nhn vin phi thu thp thng tin t hng trm, hng ngn khch hng, i snh vi thng tin cc lp, cc kha hc ph hp ri gi th mi, t vn cho khch hng tham gia hc tp. l mt vn ht sc nan gii v i hi chi ph cng sc rt ln. Tuy nhin, vi hiu qu rt cao ca cc thut ton Cy quyt nh, nhng cng vic s c gim ti chi ph xung mc thp nht c th, h tr con ngi rt tt. Trong phm vi ti cng nh kin thc, nhm chng em c gng a ra c mt bi ton rt nh nm trong mt mng ti rt ln. Vi s hng dn, gip rt tn tnh ca thy Nguyn Nht Quang, cc thnh vin trong nhm hiu c ngha thut ton, gn c vo bi ton thc t mc c th. Chng em cng xin cm n s ng gp, h tr ca thy gio v mong mun c th trin khai tip ti ny mc rng hn, hon thin hn trong tng lai. Chng em xin chn thnh cm n! Nhm thc hin
NI DUNG
1. Gii thiu ti:
1.1. t vn : mi cng ty, trung tm qun l o to, khch hng s n v tm hiu nhng dch v cung cp o to, tm ra cho mnh nhng kha hc, lp hc v cc yu t ph hp c th la chn mt hnh thc o to tt nht. Cng vic ny s c nhng nhn vin chm sc khch hng cung cp ti khch hng nhng thng tin mong mun, vi nhng t vn chi tit, hp l nht. Tuy nhin, vi mt lng d liu thng tin khch hng rt ln, cng vic la chn nhng kha hc, lp hc ph hp vi tng i tng khch hng l vic ht sc kh khn, i hi tiu tn rt nhiu chi ph thi gian v cng sc. Vic p dng thut ton Cy quyt nh vo x l d liu thng tin l gii php rt tt. Vi nhng tiu ch ca tng kha hc tng ng vi nhng nhu cu ca khch hng, nhn vin ch cn dng giao din ca chng trnh thao tc, chng trnh s t ng kim tra, so snh v a ra nhng gi chnh xc, ph hp nht. T , nhn vin s bit c i tng khch hng c th s ph hp vi nhng kha o to no. Cui cng, nhn vin s gi th t vn n khch hng, cung cp cho khch hng nhng thng tin hu ch nht. M t bi ton: - u vo: tp d liu thng tin u vo s bao gm: Thng tin kha hc/lp hc: s gm cc thng tin nh ni dung mn hc, chng ch sau khi hc, ging vin, ca hc ... Thng tin khch hng: tn, tui, i tng khch hng, cc thng tin lin lc. Cc thng tin lin quan khc. y l nhng thng tin s c lu trong C s d liu ca trung tm v s c x l bc tip theo - X l: vi cc d liu u vo nh trn (dataset), chng trnh s dng thut ton Cy quyt nh (c th l ID3) x l d liu,
4
1.2.
a ra m hnh cy quyt nh tng ng vi cc d liu c. T tr ra gi tr nh gi ph hp nht gia thng tin ca khch hng v thng tin cc kha hc: khch hng no ph hp vi lp hc no... - u ra: d liu tr ra s l kt qu tng ng khch hng lp hc. Nhn vin t vn s da vo nhng kt qu v gi th t vn n khch hng, cung cp nhng thng tin chi tit v lp hc ph hp nht vi khch hng, h tr khch hng c th la chn cho mnh mt lp hc no thch hp nht.
CART
Gini Diversity Index
ID3 v C4.5
Entropy Info-gain
Proposed approach
Info gain & Uncertainity coefficient Decision Tree with concepts of node merging and Height balance using AVL trees Dynamic pruning based on thresholds
Tnh ton
Phng thc
Constructs Binary Decision Tree Top-Down Decision Tree Construction Decision Tree Construction in a Breadth first manner
Ct ta
- L do la chn cy quyt nh: + Cy quyt nh d hiu + Vic chun b d liu cho mt cy quyt nh l c bn hoc khng cn thit + Cy quyt nh c th x l c d liu c gi tr bng s v d liu c gi tr l tn th loi + Cy quyt nh l mt m hnh hp trng + C th thm nh m hnh bng cc kim tra thng k + Cy quyt nh c th x l mt lng ln d liu trong mt khong thi gian ngn
Thut ton ID3(Iterative Dichotomiser 3) do Quinlan pht trin vo nm 1979, mc ch xy dng nn mt cy quyt nh da vo tp d liu dataset, vi mi thuc tnh s c nhng gi tr i cng.
Mi nt (khng phi l) ca mt cy quyt nh tng ng vi mt thuc tnh u vo, v mi mt nhnh con i ra tip theo chnh l gi tr ca thuc tnh . Mt nt l tng ng vi gi tr k vng ca cc thuc tnh trc m c xc nh bng cch i t nt gc cho n nt l (ta c th hiu l kt qu k vng cui cng c a ra sau khi duyt qua tt c nhng thuc tnh c lin quan trc tun theo nhng lut, rng buc c nu ra) Mt cy quyt nh tt l cy c mi nt l tng ng vi mt thuc tnh m thuc tnh c gi tr ngha tt nht trong ton b nhng thuc tnh cha c duyt (tnh t nt gc cho n nt hin ti). Tc l, chng ta mun d on gi tr ca thuc tnh bng cch da vo s lng nhng nghi vn nh nht trn tng s nhng nghi vn trung bnh (mc tin cy v chnh xc cng cao th s c chn) a. Entropy: Entropy c s dng xc nh thng tin u vo mt thuc tnh c bit l v cc thuc tnh u ra cho mt tp hp cc d liu hun luyn, nh mc gi tr cho nhng ngun thng tin khng chc chn, ngun thng tin a vo cng khng chc chn th cng cn c thm thng tin m t v n. S lng trung bnh ca lng thng tin cn thit xc nh mi ngun tin l mt thc o s khng chc chn ca i tng tip nhn v ngun tin , v c gi l entropy ca ngun. Gi thit rng ngun S c n thng tin {m1,m2,..., mn}, nhng thng tin ny khng ph thuc vo nhau, ta gi xc sut ca mi mi l pi . Nu ngun S tun theo lut phn b P = (p1, p2,..., pn) th ta c cng thc tnh Entropy ca P l:
b. Info Gain:
7
By gi, ta chia thuc tnh T thnh nhng tp d liu con mc di T1, T2,..., Tn. Nh vy, lng thng tin cn xc nh nhng thnh phn trong T chnh l lng thng tin trung bnh cn xc nh nhng i tng d liu con ca T nh trn:
H ( X ,T )
| Ti | H (Ti ) i 1 | T |
Trong qu trnh dng cy quyt nh, ta s lun cn bit lng thng tin c cung cp tng ng t mi thuc tnh X. chnh l s chnh lch gia gi tr thng tin cn phn chia cc thnh phn ca T trc khi bit gi tr ca X, H(T) v gi tr cn phn chia thnh tp d liu con ca T sau khi bit gi tr ca X, H(T). T ta c khi nim Information Gain cho thuc tnh X cng vi tp d liu T ca n:
if (all records in T have the same value for O) { return a single node with that value; } if (I is empty) { return a single node with the value of the most frequent value of O in T; /* Note: some elements in this node will be incorrectly classified */ } /* now handle the case where we cant return a single node */ compute the information gain for each attribute in I relative to T; let X be the attribute with largest Gain(X, T) of the attributes in I; let {x_j| j=1,2, .., m} be the values of X; let {T_j| j=1,2, .., m} be the subsets of T when T is partitioned according the value of X; return a tree with the root node labelled X and arcs labelled x_1, x_2, .., x_m, where the arcs go to the trees ID3(I-{X}, O, T_1), ID3(I-{X}, O, T_2), .., ID3(I-{X}, O, T_m); }
2.1.3. u im ca thut ton ID3: - S dng thut tm kim leo i (hill - climbing) da trn gi tr Gain tm kim cc thuc tnh trong ton b Cy quyt nh - u ra (Output) ch l mt gi thuyt n (1 kt qu duy nht) - Khng bao gi gp hin tng quay lui tnh hi t cao - S dng d liu hun luyn tng bc, tri ngc vi nhng thut gii pht trin m rng cy quyt nh (c th hn ch c kch thc Cy khng qu ln) - S dng cc thuc tnh tnh: hn ch ti a li cho nhng bn ghi d liu ring l, c th nh hng ti ton b d n - Kim sot c d liu rc, d liu tp bn ngoi bng cch gim bt yu cu tiu chun cho vic chp nhn nhng d liu cha hon chnh 2.1.4. Xy dng Cy quyt nh : - Cy c thit lp t trn xung di (phng php top-down) - Cc mu hun luyn nm gc ca cy
9
- Chn mt thuc tnh phn chia thnh cc nhnh. Thuc tnh c chn da trn o thng k hoc o heuristic (chnh l cc gi tr Entropy, Info-Gain tnh ton trn). Vi tng thuc tnh, gi tr Gain no thp nht trong tp cc thuc tnh cha c xt th s c chn a vo cy bc . Ta c th hiu cch la chn chnh l nhm mc ch to ra mt cy nh nht c th, gi tr Gain cng nh c ngha l thuc tnh c li nht cho qu trnh phn lp. - Tip tc lp li vic xy dng cy quyt nh cho cc nhnh. - iu kin dng : + Tt c cc mu ri vo mt nt thuc v cng mt lp (nt l) + Khng cn thuc tnh no c th dng phn chia mu na + Khng cn li mu no ti nt 2.2. DataSet : V d v Dataset s s dng
Course Name CCNA CourseCerti ficate CCNA GroupN ame Network Cours TimeN eFee ame 300 Ca Sng 1 400 Ca Chiu 1 200 Ca Sng 2 400 Ca Chiu 1 300 Ca Ti 2 300 Ca Ti 2 300 Ca Ti 2 200 Ca Ti 2 300 Ca Ti Teacher Name Nguyn Vn Cng Trn Vn Nam Trn Trng Ti Trn Vn Nam Bo Bo Bo Trn Trng Ti Trn IsStudentLe arned True
CCNP
CCNP
Network
True
Office
MOS
Office
True
CCNP
CCNP
Network
True
MCSA
MCSA
Network
10
False
2 MCSE SCJP Office MCSE SCJP MOS Network Program ming Office 1000 Ca Ti 1 200 Ca Ti 1 200 Ca Ti 2 300 Ca Chiu 2 300 Ca Chiu 2 300 Ca Ti 2 300 Ca Ti 2 300 Ca Ti 2 200 Ca Ti 2 200 Ca Ti 2 300 Ca Ti 2 300 Ca Chiu 2 400 Ca Chiu 1 300 Ca Sng 2 300 Ca Chiu 2 350 Ca Sng 2 200 Ca Ti 1
Xun Chnh L Khnh False Hng c Trn Trng Ti Hng Hng Hng Hng Bo Bo Bo Trn Trng Ti Trn Trng Ti Trn Xun Chnh Hng Hng Trn Vn Nam Bo Hng Hng Nguyn Tun Hng Hng c False False
CCSP
CCSP
Network
True
CCSP
CCSP
Network
True
Office
MOS
Office
False
MCSA
MCSA
Network
False
CCSP
CCSP
Network
True
CCNP
CCNP
Network
True
CCDA CCSP
CCDA CCSP
Network Network
True True
MCDB A SCJP
MCDBA
False
SCJP
False
Vi Dataset trn, ta c cc thuc tnh v min gi tr : - CourseName : {MCSA, CCNP, MCDBA, SCJP} - CourseCertificated : {CCNP, MCSA, MCDBA} - tng t vi cc thuc tnh khc Ta s da vo tng ca thut ton ID3, tnh ton cc gi tr Entropy H(T), cc gi tr H(X,T) v tnh gi tr Gain. T , vi gi tr Gain no ln nht th l thuc tnh c li thng tin ln nht th s c chn lm nt xy dng cy quyt nh. Thao tc trn c lp i lp li n khi kt thc (ht thuc tnh duyt hoc tm ra c l ti u) C s d liu :
2.3.
2.4.
Cng ngh s dng : - Mi trng lp trnh : Microsoft Visual Studio 2008 - Ngn ng lp trnh : C# trn nn cng ngh .NET 3.5 - H qun tr c s d liu : Microsoft SQL Server 2005
12
13
- Gii quyt bi ton vi trng hp m cc thuc tnh c d liu trng (khuyt d liu trong qu trnh dng cy quyt nh) - C th chuyn i t Cy quyt nh thnh cc Lut 3.2.2. Cc phng php ct ta cy : - Pre pruning : dng ngay qu trnh pht trin nhnh ca cy khi gp d liu khng chc chn (hoc d liu tip theo l rng) - Post pruning : to ra mt cy quyt nh hon ho ri sau s ct b dn nhng phn d liu khng chc chn. tng ny c 2 bc thc hin : + Subtree replacement : loi b i nhng on d liu khng chc chn + Subtree raising : tip tc pht trin d liu t on d liu va c ct b Qu trnh ny c lp i lp li cho n khi duyt ht ton b Cy quyt nh. 3.2.3. Kim sot d liu trng (Handling missing value): Mi khi gp d liu trng trong qu trnh xy dng cy, ta s coi l d liu cha xc nh (unknown data) v s biu din d liu l d liu ? trn cy, b qua v tip tc pht trin cy vi nhng d liu tt. 3.2.4. X l d liu lin tc (a gi tr vi mt thuc tnh): l nhng trng hp m mi thuc tnh c nhiu gi tr (not unique), gii php cn tnh ra xc sut ca nhng gi tr nm trong thuc tnh : SplitInfo = (X,T). Nh vy, gi tr Gain lc ny s khc so vi ban u:
GainRatio( X , T )
Gain( X , T ) SplitInfo( X , T )
- Kh khn trong vic tm ti liu cho thut ton C4.5 - Khng th tm c ti liu cho thut ton C5.0 - Kh khn trong vic tm hiu v tnh cht ca khch hng, thng tin kha hc/lp hc trong thc tin cc trung tm o to. - Kh khn trong vic nhp d liu kim nghim v bi ton i hi mt lng d liu tng i ln, nu t d liu qu th mc nh gi thut ton s khng khch quan. 4.2. Tranh lun: - Ti sao li phi s dng cy thut ton ID3/C4.5 cho cy phn loi m khng s dng cc thut ton khc. - Liu cc trng trong Dataset p dng cho bi ton c th hay cha? - C nht thit phi a thm cc tp lut ca ngi dng vo hay khng, hay ch cn da vo cy quyt nh l gii quyt bi ton?
16
KT LUN
Qu tht, ng dng ca cng ngh vo trong cc cng vic, lnh vc thc tin ca cuc sng l ht sc quan trng. N gip cho con ngi gim bt c cng sc, thi gian v rt nhiu nhng chi ph tn km khc. C th, i vi mt trung tm o to gio dc, vic tng tc, h tr, t vn v chm sc khch hng l mt vic khng th thiu v phn no to nn thng hiu. D mc no, nhng ngi qun l lun mong mun cung cp cho khch hng s chm sc chu o, tn tnh nht. Cng vic gi th t vn v h tr thng tin tuyn sinh l mu cht trong chin lc . Nu trc y l cng vic ht sc vt v th ngy nay, vi s h tr ca cng ngh mi, l mt vic rt n gin cho i ng nhn vin t vn. Thut ton ID3 h tr vic la chn ra nhng kt qu ti u, ph hp nht trong mt tp hp d liu khng l. T , nhng d liu tr v s c tn dng ti a trong cng vic, nhng nhn vin t vn ch vic dng kt qu thc hin cng vic tip theo ca mnh: gi th t vn cho khch hng v nhng kha hc/lp hc ph hp nht. Trong phm vi ca ti cng nh nhng kh khn gp phi, chng em ht sc c gng hon thnh tt nht nhng g cn thit, nhng cng vic ra. Tuy nhin, chng em vn cn mc phi nhng thiu st khng th trnh khi, rt mong thy gio tip tc h tr, hng dn c th ci tin chng trnh tt hn na trong tng lai. Chng em xin chn thnh cm n s tr gip rt tn tnh ca thy gio.
17
18