You are on page 1of 43

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 1 liu

K thut phn lp d liu trong Khai Ph D Liu Bi Thanh Hiu

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 2 liu K thut phn lp d liu trong Khai Ph D Liu l mt trong nhng vn nguyn cu m rng hin nay ; tp trung ch yu vo thng k, my hc v mng ntrn . K thut phn lp c nh gi l mt k thut khai ph d liu c s dng rng ri nht vi nhiu m rng . S kt hp ca k thut phn lp v c s d liu l mt lnh vc ha hn bi v p ng c mt vn ht sc quan trng ca ng dng c s d liu l tnh uyn chuyn cao . Vi nhng ngha v vai tr ht sc quan trng ca k thut phn lp nu trn , bi thu hoch ny tp trung nguyn cu su nhng k thut phn lp ,nhng cch tip cn khc nhau i vi k thut phn lp cng vi nhng tm hiu v nh gi nhng ci tin ca k thut phn lp trong thi gian gn y t nhng kt qu c ng ti trn mt s bo co khoa hc ti nhng hi ngh khoa hc quc t v Khai Ph D Liu cng nh vic tm hiu v s dng k thut phn lp trong sn phm thng mi Microsoft SQL Server 2000 . Bi Thanh Hiu Cao hc Kha 1

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 3 liu

1. Gii thiu v phn lp :

Phn lp d liu l k thut da trn tp hun luyn v nhng gi tr hay hay l nhn ca lp trong mt thuc tnh phn lp v s dng n trong vic phn lp d liu mi . Phn lp cng l tin on loi lp ca nhn . Bn cnh k thut phn lp c mt hnh thc tng t l k thut tin on , k thut tin on khc vi phn lp ch phn lp ch lin quan n tin on loi lp ca nhn cn k thut tin on m hnh nhng hm nh gi lin tc . K thut phn lp c tin hnh bao gm 2 bc : Xy dng m hnh v s dng m hnh . Xy dng m hnh : l m t mt tp nhng lp c nh ngha trc trong : mi b hoc mu c gn thuc v mt lp c nh ngha trc nh l c xt nh bi thuc tnh nhn lp , tp hp ca nhng b c s dng trong vic s dng m hnh c gi l tp hun luyn . M hnh c biu din l nhng lut phn lp , cy quyt nh v nhng cng thc ton hc . S dng m hnh : Vic s dng m hnh phc v cho mc ch phn lp d liu trong tng lai hoc phn lp cho nhng i tng cha bit n . Trc khi s dng m hnh ngi ta thng phi nh gi tnh chnh xt ca m hnh trong : nhn c bit ca mu kim tra c so snh vi kt qu phn lp ca m hnh , chnh xc l phn trm ca tp hp mu kim tra m phn loi ng bi m hnh , tp kim tra l c lp vi tp hun luyn . Phn lp l mt hnh thc hc c gim st tc l : tp d liu hun luyn ( quan st , thm nh ...) i i vi nhng nhn ch nh lp quan st , nhng d liu mi c phn lp da trn tp hun luyn . Ngc li vi hnh thc hc c gim st l hnh thc hc khng c gim st lc nhn lp ca tp d liu hun luyn l khng c bit n .
2. Phn lp bng phng php qui np cy quyt nh : 2.1. Khi nim cy quyt nh :

Cy quyt nh l mt flow-chart ging cu trc cy , nt bn trong biu th mt kim tra trn mt thuc tnh , nhnh biu din u ra ca kim tra , nt l biu din nhn lp hoc s phn b ca lp .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 4 liu


Vic to cy quyt nh bao gm 2 giai on : To cy v ta cy . to cy thi im bt u tt c nhng v d hun luyn l gc sau phn chia v d hun luyn theo cch qui da trn thuc tnh c chn . Vic ta cy l xt nh v xa nhng nhnh m c phn t hn lon hoc nhng phn t nm ngoi (nhng phn t khng th phn vo mt lp no ) . Vic s dng cy quyt nh nh sau : Kim tra nhng gi tr thuc tnh ca mu i vi cy quyt nh . 2.2. Thut ton qui np cy quyt nh : Gii thut c bn (gii thut tham lam) c chia thnh cc bc nh sau: 1. Cy c xy dng qui t trn xung di (top-down) v theo cch thc chia tr (divide-conquer). 2. thi im bt u , tt c nhng v d hun luyn gc . 3. Thuc tnh c phn loi ( nu l gi tr lin tc chng c ri rc ha) 4. Nhng v d hun luyn c phn chia qui da trn thuc tnh m n chn la . 5. Kim tra nhng thuc tnh c chn da trn nn tng ca heristic hoc ca mt nh lng thng k . iu kin dng vic phn chia : 1.Tt c nhng mu hun luyn i vi mt node cho trc thuc v cng mt lp. 2.Khng cn thuc tnh cn li no phn chia tip . 3.Khng cn mu no cn li .
2.3. li thng tin (Information Gain) trong cy quyt nh :

Information gain l i lng c s dng chn la thuc tnh vi information gain ln nht .Gi s c hai lp , P v N . Cho tp hp ca nhng v d S cha p phn t ca lp P v n phn t ca lp N . Khi lng ca thng tin , cn quyt nh nu nhng mu ty trong S thuc v P hoc N c nh ngha nh l :

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 5 liu


I(p,n) = -[p/(p+n)]log 2 [p/(p+n)] [n/(p+n)]log 2 [n/(p+n)] Gi s rng s dng thuc tnh A mt tp hp S c phn hoch thnh nhng tp hp {S1,S2,..,Sv} . Nu Si cha nhng mu ca P v ni mu ca Ni entropy hoc thng tin mong i cn phn loi nhng i tng trong cy con Si l :
v

E(A) =

[(p +n )/(p+n)] I(p ,n )


i i i i i=1

Thng tin nhn c nhnh A l : Gain(A) = I(p,n)-E(A) 2.4. Ni dung gii thut hc cy quyt nh c bn ID3 : ID3 l mt gii thut hc cy quyt nh c pht trin bi Ross Quinlan (1983). tng c bn ca gii thut ID3 l xy dng cy quyt nh bng vic s dng mt cch tm kim t trn xung trn nhng tp hp cho trc kim tra mi thuc tnh ti mi nt ca cy . chn ra thuc tnh m hu ch nht cho s phn loi trn nhng tp hp cho trc , chng ta s a ra mt h o li thng tin. tm ra mt cch ti u phn loi mt tp hp thng tin , vn t ra l chng ta cn phi lm ti thiu ha ( Chng hn, ti thiu chiu cao ca cy). Nh vy chng ta cn mt s chc nng m c th nh gi trng hp no no cho ra mt s phn chia cn bng nht . H o li thng tin s l hm nh vy.
ID3 ( Learning Sets S, Attributes Sets A, Attributesvalues V) Return Decision Tree.

Begin u tin np learning sets , to nt gc cho cy quyt nh 'rootNode', thm learning set S vo trong nt gc nh l tp con ca n. For rootNode, u tin chng ta tnh Entropy(rootNode.subset) If Entropy(rootNode.subset)==0, then rootNode.subset bao gm records tt c vi cng gi tr cho cng gi tr thuc tnh xt nh, tr v mt nt l vi decision attribute:attribute value; If Entropy(rootNode.subset)!=0, then tnh li thng tin (information gain) cho mi thuc tnh tri (cha c s dng phn chia), tm thuc tnh A vi Maximum(Gain(S,A)). To nhng nt con ca rootNode ny v thm vo rootNode trong cy quyt nh. For mi con ca rootNode, p dng

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 6 liu


ID3(S,A,V) mt cch qui cho n khi t c node m c entropy=0 hay t c nt l. End ID3. V d : m t hot ng ca ID3 chng ta s dng v d Play Tennis.S m t tng trng thuc tnh nh sau: Attribute Possible Values: Outlook sunny, overcast , rain Temperature hot , mild, cood Humidity high,normal Windy true,false Decision n(negative),p(positive) Tp Leaning set cho v d chi tennis: Outlook Temperature Humidity Windy Decision sunny hot high false n sunny hot high true n overcast hot high false p rain mild high false p rain cool normal false p rain cool normal false n overcast cool normal true p sunny mild high false p sunny mild normal true p rain mild normal false p sunny mild normal true p overcast mild high true p overcast hot normal false p rain mild high true n Gii thut ID3 thc hin nh sau :
1. To nt gc( rootNode) , cha ng ton b learning set nh l

nhng tp hp con ca chng (subset) sau tnh : Entropy(rootNode.subset)= -(9/14)log 2 ( 9/14 ) ( 5/14)log 2 (5/14)= 0.940 2. Tnh ton thng tin nhn c cho mi thuc tnh : Gain(S,Windy)= Entropy(S)-(8/14)Entropy(S false) (6/14)Entropy(S true) = 0.048 Gain(S,Humidity) = 0.151 Gain(S,Temperature) = 0.029 Gain(S,Outlook) = 0.246

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 7 liu


3. Chn la nhng thuc tnh vi thng tin nhn c ti a , chnh

l s phn chia theo thuc tnh outlook . 4. p dng ID3 cho mi nt con ca nt gc ny , cho n khi t n nt l hoc nt c entropy = 0.
2.5. Nhng thiu st ca gii thut ID3:

Trng hp thiu st th nht : Mt thiu st quan trng ca ID3 l khng gian phn chia hp l ti mt node l cn kit . Mt s phn chia l s phn hoch ca mi trng hp ca khng gian m kt qu t c t vic th nghim ti mt node quyt nh ID3 v con chu ca n cho php s kim tra ti ti mt thuc tnh n v nhnh trong kt qu cho ra t s kim tra ny. Trng hp thiu st th hai : Mt thiu st m ID3 mc phi l n da vo rt nhiu vo s lng ca nhng tp hp d liu a vo. Qun l s tp nhiu ca tp d liu vo l v cng quan trng khi chng ta ng dng gii thut hc cy quyt nh vo th gii thc .Cho v d , khi c s ln tp trong tp d liu a vo hoc khi s lng v d a vo l qu nh to ra mt v d in hnh ca hm mc tiu ng . ID3 c th dn n vic to quyt nh sai. C rt nhiu nhng m rng t gii thut ID3 c bn pht trin p dng nhng lut hc cy quyt nh vo th gii thc , nh l nhng post-pruning tree , qun l nhng thuc tnh gi tr thc , lin quan n vic thiu nhng thuc tnh , s dng nhng tiu chun chn la thuc tnh khc hn thu thp thng tin . 2.6. M rng qui np cy quyt nh c bn : Vic m rng qui np cy quyt nh c p dng cho nhng thuc tnh gi tr lin tc : nh ngha mt cch uyn chuyn nhng thuc tnh gi tr b ri rc m s phn chia gi tr thuc tnh thnh mt tp ri rc ca nhng khong . M rng qui np cy quyt nh cng c p dng cho nhng gi tr thuc tnh thiu st bng cch : Gn nhng gi tr thiu st bng gi tr thng thng nht ca thuc tnh hoc gn kh nng c th vi mi gi tr c th . Vic m rng qui np cy quyt nh cng c p dng cho xy dng thuc tnh : To nhng thuc tnh da trn nhng ci tn ti m chng th hin tha tht . iu ny s gip thu gim vic phn mnh , s lp li v vic to bn sao .
2.7. Gii thut m rng C4.5 :

C4.5 l s m rng ca gii thut ID3 trn mt s kha cnh sau: Trong vic xy dng cy quyt nh , chng c th lin h vi tranning set m c nhng records vi nhng gi tr thuc tnh khng c bit n bi vic nh gi vic thu thp thng tin hoc l t s thu thp thng tin , cho

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 8 liu


nhng thuc tnh bng vic xem xt ch nhng record m thuc tnh c nh ngha . Trong vic s dng cy quyt nh , chng ta c th phn loi nhng record m c nhng gi tr thuc tnh khng bit bng vic c lng nhng kt qu c kh nng sy ra.Trong v d chi nh gn ca chng ta , nu chng ta c a mt record mi m outlook l sunny v humidity cha cho bit , chng ta s x l nh sau : Chng ta di chuyn t nt gc Outlook n nt Humidity theo cung c nh nhn l sunny. im t lc chng ta khng bit gi tr ca Humidity chng ta rng nu humidity l 75 c 2 records , v nu humidity l ln hn 75 c 3 records trong c 1 record khng hot ng . Nh vy iu c th a ra nh cu tr li cho record kh nng (0.4,06) cho chi gn hoc khng chi gn. Chng ta c th lin h n nhng gi tr lin tc . Gi s rng thuc tnh Ci c tm gi tr thuc tnh lin tc . Chng ta s xem xt nhng gi tr ny trong tp learning set . Cho rng chng c xp sp th t tng dn A1, A2,..,Am sau vi mi gi tr Ai i=1,2,..,m.Chng ta chia nhng records thnh nhng ci c gi tr t Ci tr ln v bao gm c Aj v nhng ci c nhng gi tr ln hn Aj .Vi nhng ln phn hoch ny chng ta tnh li gi tr thu thp v t s thu thp v chn ra phn hoch c t s thu thp thng tin nhn c ti a. Trong v d v chi Golf ca chng ta , i vi humidity T l training set chng ta s xt nh thng tin cho mi ln phn chia v tm c s phn chia tt nht ti 75 . Phm vi ca thuc tnh ny tr thnh {<=75,>75}. Ch rng phng php ny lin quan n mt con s quan trng ca vic tnh ton. 2.8. Thu gim cy quyt nh v nhng tp lut suy dn : Vic xy dng cy quyt nh nh vo training set bi v cch chng xy dng lin quan nghim ngt n hu ht cc record trong tp hun luyn .Trong thc t , lm nh vy n c th l iu hon ton phc tp. Vi nhng ng i di v khng u . Vic thu gim cy quyt nh c thc hin bng vic thay th nhng cy con thnh nhng nt l.S thay th ny s c thc hin ti ni m lut quyt nh c thit lp nu tn sut li gy ra trong cy con l ln hn trong mt nt l.Cho v d vi cy n gin nh sau: Color red Success blue Failure

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 9 liu


cha 2 records th nht l training red success v th hai l trainning blue failures v sau trong Test Set chng ta tm thy 3 red failures v mt blue success , chng ta c th xem xt vic thay th cy con ny bng vic thay th bng mt node n Failure . Sau vic thay th ny chng ta s cn li 2 li thay v 5 li . Winston ch ra rng lm th no s dng Fisher's exact test xt nh nu thuc tnh phn loi l thc s ph thuc vo mt thuc tnh khng xt nh. Nu iu ny khng sy ra th thuc tnh khng xt nh khng cn phi xut hin trong ng i hin ti ca cy quyt nh. Quinlan v Breiman ngh nhng heuristic phc tp hn cho vic thu gim cy quyt nh .Mt iu d dng lm l c th dn ra mt lut t mt cy quyt nh : vit ra mt lut t mi ng trong cy quyt nh i t gc n l.V tri ca lut c xy dng d dng t nhn ca nhng nt v nhn ca nhng cung. Nhng lut rt ra c th c rt gn nh sau: Gi LHS l LHS ca lut Cho LHS nhn c bng cch thu gim mt s iu kin ca LHS. Chng ta c th chc chn thay th LHS bng LHS trong lut ny nu tp con ca training set tha mn LHS v LHS l tng ng. Mt lut c th c thu gim bng cch s dng metacondition v d nh khng c lut khc c th p dng . 2.9. Gii thut m rng See5/C5.0 : See5 l mt dng ngh thut ca h thng xy dng s phn loi trong dng thc ca nhng cy quyt nh v tp lut . See5 c thit k v hot ng trn c s d liu ln v s kt hp i mi nh l boosting. Kt qu to ra bi See5 v C5.0 l tng t nhau . Hot ng trc y trn Windows95/98/NT ca C5.0 l phn hot ng ca n trn Unix . See 5 v C5.0 l nhng cng c khai khi d liu phc tp cho nhng mu khai ph d liu m pht ha ra nhng loi tp hp chng thnh nhng i tng phn loi v s dng chng tin on. c im chnh ca C5.0 l : C5.0 c thit k phn tch nhng c s d lu quan trng cha ng hng ngn n hng trm ngn nhng records.v hng chc n hng trm s liu v hoc tn field . ti a kh nng gii thch , i tng phn loi ca See5.0 /C5.0 c din t nh l cy quyt nh hoc tp ca nhng lut if then . Dng thc ca n d hiu hn so vi neutron network . C5.0 d dng s dng do khng c gi l kin thc cao cp ca thng k v my hc .
2.10. Gii thut See5/C5.0 l tt hn C4.5:

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 10 liu


C5.0 trong h thng Unix v bn sao ca n See5 trong Windows l nhng phin bn cao cp hn C4.5 trn nhiu kha cnh quan trng . Chng ta s th so snh C5.0 v C4.5 trn cng h thng Unix . V nhng tp lut (Ruleset):nhanh nhiu hn v t tn b nh hn: C C5.0 v C4.5 cung cp s la chn cho nhng dng thc ca phn loi cy quyt nh hoc l nhng tp lut (ruleset) . Trong nhiu ng dng th tp lut (ruleset) c u tin s dng hn v chng n gin v d hiu hn cy quyt nh .Nhng nhng phng php tm ra lut trong C4.5 l chm v chim nhiu b nh.C5.0 th hin s hon thin trong vn to ra tp lut v s ci tin ny l gy n tng mnh m. Cy quyt nh : nhanh hn v nh hn : Vi cng nhng tp d liu (dataset) th C4.5 v C5.0 sn sinh ra nhng lut vi s chnh xt v d on l nh nhau.S khc nhau chnh yu l kch c ca cy v thi gian tnh ton.Cy ca C5.0 l nh hn v nhanh hn mt s yu t. S nng ln(Boosting): Da trn s nguyn cu ca Freund v Schapire , y l s pht trin mi y hp dn m n khng c s tng t no trong C4.5.Boosting l mt k thut to v kt hp nhiu nhng i tng phn loi ci thin tnh chnh xt tin on . C5.0 h tr Booting vi mt s nhng d liu s th nghim. Thng thng , C5.0 s mt thi gian lu hn to ra nhng i tng phn loi (classifier) . Nhng nhng kt qu c th phn tch nh lng s tnh ton cng thm .Boosting lun c gng t c nh cao nht ca s chnh xt trong tin on yu cu phi t ti. c bit khi nhng i tng phn loi khng c nng ln l hon ton chnh xt. Nhng chc nng mi: C5.0 kt hp nhiu chc nng nh l variable misclassfication costs .Trong C4.5 tt c nhng li u c xem nh nhau.Nhng trong nhng ng dng thc t s c mt s li trong qu trnh phn loi l nguy him hn nhng ci khc .C5.0 chp nhn mt chi ph phn chia i vi mi cp lp c tin on.Nu quan im ny c p dng , C5.0 sau s xy dng nhng i tng phn loi ti thiu ha nhng gi tr phn loi sai c mong i hn l nhng tn sut li. C5.0 c nhiu kiu d liu hn c nhng ci c trong C4.5 bao gm c ngy gi , thuc tnh gi tr ri rc c xp th t v case labels. Thm vo l nhng gi tr thiu (missing value) . C5.0 cho php nhng gi tr c coi nh l khng p dng c . Hn na , C5.0 cung cp nhng iu kin d dng nh ngha nhng thuc tnh mi nh nhng hm ca nhng thuc tnh khc.

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 11 liu


Nhng ng dng khai ph d liu gn y c hnh thc ho vi kch thc ln hn , vi hng trm thm ch hng ngn nhng thuc tnh .C5.0 c th t ng lc nhng thuc tnh , loi b nhng ci xut hin bn l trc khi mt lp phn loi c xy dng. i vi ng dng ca loi ny , s phn loi c th dn n nhng i tng nh hn v s tin on chnh xt hn v thm ch thu gim c thi gian to ra tp lut. C5.0 cng d dng c s dng hn. Nhng chn la c n gin ha v m rng.- h tr s ly mu v cross-validation , trong lc chng trnh C4.5 to ra cy quyt nh v tp lut c kt hp vo mt chng trnh duy nht. Phin bn trn windows See5 xy dng c mt giao din ha thn thin v thm vo mt s chc nng h tr khc.V d Cross-Reference Window lm cho nhng i tng phn loi d hiu hn bng vic lin kt nhng trng hp n nhng phn lin quan n vic phn loi.
2.11. Phn lp vi GiniIndex(IBM IntelligenMiner) :

Tng t nh i lng Gain trn IBM a ra mt i lng cho vic phn lp l gini nh sau: Nu mt tp d liu T cha nhng mu t n lp, gini index , gini(T) c nh ngha nh sau :

trong pj l tn s lin quan ca lp j trong T. Nu mt tp hp d liu T c chia thnh 2 tp con T1 v T2 vi kch thc tng ng l N1 v N2 . gini index ca d liu chia ct cha nhng v d t n lp , gini index gin(T) c nh ngha nh sau:

Thuc tnh cung cp gi tr ginisplit(T) nh nht c chn phn chia nt Th hin tri thc theo dng thc nhng lut IF-THEN . Trong mt lut c to da trn mi con ng t nt gc n l . Mi cp thuc tnh theo mt con ng to thnh mt s kt hp v nt l nm gi ton b lp tin on . Nhng lut to ra rt d hiu i vi con ngi .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 12 liu


2.12. Trnh vic qu kht (overfitting) trong vic phn lp :

Cy c to ra c th qu kht vi d liu hun luyn . Vic qu kht c th sy ra trong nhng trng hp sau y : Qu nhiu nhng nhnh , mt s c th phn nh s d thng v nhng phn t hn lon (noise) hoc nhng phn t nm ngoi phn lp (outlier) . Kt qu nhn c thiu chnh xt i vi nhng mu cha thy. C hai cch tip cn trnh qu kht d liu . Ta trc : Dng s xy dng ca cy sm khng chia mt node nu iu ny to kt qu di mt ngng theo mt h nh gi tt . Mt kh khn trong vic ta trc l iu ny s to ra s kh khn trong vic to ra mt ngng thch hp . Ta sau : Loi nhng nhnh t mt cy ln y - to mt th t ca nhng cy b ta tng dn ln trong ta s dng mt tp d liu khc nhau t d liu hun luyn xt nh ci no l cy c ta tt nht (best pruned tree). Tip cn xt nh kch c cy cui cng : Thng thng ngi ta phn chia tp hun luyn thnh tp d liu hun luyn (2/3) v d liu th (1/3) , ngi ta s dng s nh gi cho . Cng c mt cch khc l s dng tt c d liu hun luyn , nhng p dng kim tra thng k nh lng khi no m rng hoc ta bt mt nt c th ci thin ton b s phn phi . Mt cch th ba l s dng nguyn tc m t chiu di ti thiu : trong ngi ta dng s pht trin ca cy khi s m ha c ti thiu . 3. S phn lp cy quyt nh trong c s d liu ln: S phn lp l mt vn c in c nguyn cu mt cch m rng bi nhng nh thng k v nhng nh nguyn cu my hc .Hng pht trin hin nay l ca vic phn lp l phn lp nhng tp d liu vi hng t nhng mu th v hng trm thuc tnh vi tc va phi. Qui np cy quyt nh c nh gi cao trong khai ph d liu ln v nhng nguyn nhn sau : Tc hc tng i nhanh hn so vi nhng phng php phn loi khc . C th hon chuyn c thnh nhng lut phn lp n gin v d hiu .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 13 liu


C th s dng truy vn SQL truy xut c s d liu . S chnh xt phn lp c th so snh c vi nhng phng php khc . Nhng phng php qui np cy quyt nh trong nguyn cu v Khai ph trn tp d liu ln : 3.1. SLIQ : A Fast Scalable Classifier for Data Mining : Nhng gii thut phn lp c thit k ch theo cch cho d liu thng tr trong b nh . Phng php ny tho lun a ra vic xy dng mt cng c phn loi c kh nng leo thang v th hin SLIQ ( Superived Learning In Quest ) nh mt cng c phn loi mi , SLIQ l mt cng c cy quyt nh m c th qun l c thuc tnh s v thuc tnh xc thc . N s dng mt k thut sp xp trc (pre-sorting) trong giai on pht trin cy (tree-grow).Th tc sp xp trc ny c tnh hp vi chin thut pht trin cy theo chiu rng cho php s phn lp ca tp d liu thng tr a . SLIQ cng s dng mt gii thut ta cy m chi ph khng qu cao vi kt qu t c kh quan v nhng cy kh chnh xt . S kt hp ca nhng k thut ny cho php SLIQ leo thang vi tp d liu ln v tp d liu phn lp m khng n s lng ca nhng lp, nhng thuc tnh v nhng record . Trong phng php ny th tp d liu hun luyn khng th c t chc trong b nh .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 14 liu

3.2. T SLIQ sang SPRINT : A Scalable Parallel Classifier for

DataMining Phng php ny th hin mt gii thut phn lp da trn cy quyt nh c kh nng loi b s gii hn ca b nh , thc thi nhanh v c kh nng leo thang . Gii thut ny c thit k d dng cho song song ha , cho php nhiu b x l lm vic vi nhau xy dng mt m hnh nht qun . Danh sch lp trong SLIQ phi cha trong b nh C chai : danh sch lp c th ln SPRINT : t thng tin lp vo danh sch thuc tnh v khng c danh sch lp no. Song song phn lp : phn chia danh sch nhng thuc tnh . 3.3. PUBLIC :Tch hp s pht trin ca cy quyt nh v ta cy : Phng php ny cp mt ci tin ca cng c cy quyt nh c ci tin m tch hp giai on ta cy vi giai on xy dng ban u . Trong PUBLIC , mt nt khng c m rng trong sut giai on xy dng , nu n c xc nh rng n s b ta trong sut giai on ta sau . Do to s quyt nh cho node ny trc khi n c m rng , PUPLIC tnh ton mt bin di trn gi tr ti thiu cy con c t ti nt . S c lng ny s c s dng bi PUBLIC xt nh nhng nt m chc chn c ta v i vi nhng nt nh vy m khng tiu tn trong vic phn chia chng . Tch hp pht trin v ta : mi nt , kim tra chi ph ca s pht trin ca nhng cy con .
3.4. RainForest : A Generic Framework :

Phng php ny trnh by mt khung lm vic hp nht cho nhng cng c to cy quyt nh m tch ri nhng kha cnh v kh nng leo thang ca gii thut cho vic xy dng mt cy quyt nh t nhng c tnh trung tm m xt nh phm cht ca cy . Loi gii thut ny d dng hot ng c th vi nhng gii thut ring bit t ti liu nguyn cu bao gm

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 15 liu


C4.5 , CART , CHAID , FACT , ID3 v nhng m rng , SLIQ , SPRINT v QUEST. C chai ca kh nng leo thang : Tnh ton thuc tnh gi tr,nhn lp (AVC-Group) cho mi node . RainForest a ra mt tp hp ca nhng gii thut cho tnh ton nhanh AVC-group .
3.5. Qui np cy quyt nh d liu da trn khi :

Qui np cy quyt nh ca d liu da trn khi l s tch hp ca tng qut ha vi qui np cy quyt nh .. Phn lp nhiu cp da trn khi c hai vn quan trng l phn tch lin quan da trn nhiu cp v phn tch thng tin nhn c vi chiu v cp . 4. Phng php phn lp Bayesian: L thuyt Bayesian cung cp mt tip cn theo xc xut suy din . N da trn gi thuyt rng s lng ca khuynh hng b chi phi bi phn b xc xut v quyt nh ti u c th c to bi s suy lun v nhng xc xut i lin vi d liu c quan st . y l vn quan trng ca my hc bi v n cung cp mt tip cn nh lng cho vic xem xt cn thn bng chng h tr nhng gi thuyt thay i . L thuyt Bayesian cung cp gii thut hc c bn m vn dng nhng xc xut cng nh l mt khung lm vic cho s phn tch s hot ng ca nhng gii thut m khng th vn dng r rng . Hc theo xt sut : Tnh xt sut hin cho gi thuyt , trong s nhng tip cn thc dng nht cho cc kiu chc chn ca nhng vn hc . Tnh tng dn : mi v d hun luyn c th gia tng vic tng hoc gim m khng gian gi thuyt ng . Kin thc trc c th kt hp vi d liu c quan st . Tin on xt sut : Tin on nhiu khng gian gi thuyt , c o bi xt sut ca n . Tiu chun : Thm ch khi phng thc Bayesian kh tnh ton , chng cng cung cp mt tiu chun tt nht cho vic to quyt nh . nh l Bayesian : Cho trc mt tp hun luyn D , xt sut posteriori ca mt gi thuyt h , p(h\D) cho bi nh l Bayesian : P(D\h)P(h) P(h\D) = P(D) Gi thuyt ti a posteriori MAP :

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 16 liu

Kh khn thc t ca phng php phn lp Bayesian ch n i hi kin thc khi to ca nhiu kh nng c th sy ra v chi ph tnh ton ng k . Phn lp Naive Bayes: Gi thuyt n gin : nhng thuc tnh l c lp theo iu kin

Chi ph tnh ton thu gim ng k , ch tnh n s phn b lp. Vi mt tp hp cho trc chng ta c th tnh ton kh nng sy ra Vn phn lp c th c th thc ha s dng xt sut a-posteriori nh sau : P(C|X)= xt sut ca mu v d x=<x1,..,xk> l lp ca C . tng : gn mu X vo lp nhn C sao cho P(C|X) l ln nht . nh l Bayes pht biu nh sau : P(C|X) = P(X|C).P(C)/P(X) , trong : P(X) l hng cho tt c lp P(C) = tn sut tng i ca mu lp C C m P(C|X) l ln nht = C m P(X|C).P(C) l ln nht . Phn lp Naive Bayesian : Gi thuyt Naive : thuc tnh l c lp P(x1,...,xk|C)=P(x1|C)...P(xk|C) Nu thuc tnh th i l xt thc : P(xi |C) c c lng nh l tn sut tng i ca nhng mu c gi tr xi nh l thuc tnh th i trong lp C. Nu thuc tnh th i l lin tc P(xi |C) c c lng thng qua hm mt Gaussian .Vic tnh ton l d dng trong c hai trng hp .V d playtennis : phn lp X .Cho mu cha c thy nh sau X=<rain,hot,high,false>

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 17 liu

P(X|p).P(p)=P(rain|p).P(hot|p).P(high|p).P(false|p).P(p) = 3/9.2/9.3/9.6/9.9/14=0.010582 P(rain|n).P(hot|n).P(high|n).P(false|n).P(n)=2/5.2/5.4/5.5/14=0.018286 Mu X c phn lp vo lp n. Gii thut Nave Bayes vit bng m gi nh sau:

S c lp ca gi thuyt:

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 18 liu


Nhng gi thuyt c lp nhau s gip cho vic tnh ton tr nn d dng . li phn lp tt nht t c rt t trong thc t v nhng thuc tnh (bin) thng lin quan vi nhau . vt qua nhng gii hn ny ngi ta gii quyt bng 2 cch: Dng mng Bayesian , y chnh l s kt hp ca l lun v quan h nhn qu gia nhng thuc tnh . Cy quyt nh m suy lun trn mt thuc tnh thi im xem xt nhng thuc tnh quan trng u tin . Mng Bayesian Tin cy ( Bayesian belief network ) : Bayesian belief network cho php mt tp con ca nhng bin c lp theo iu kin . Trong Bayesian belief ngi ta s dng m hnh th ca quan h nhn qu . C nhiu cch hc ca Bayesian belief networks nh sau : Cho trc c cu trc mng v nhng bin : y l cch d dng . Cho trc cu trc mng nhng ch c mt vi bin ch khng phi l tt c . Cu trc mng l hon ton khng c bit . 5. Phn lp bng mng ln truyn ngc: Neural Networks (Mng Ntrn): Cu trc ca mt neural nh sau:

Vector x n chiu c nh x vo bin y da trn tch v hng v mt hm nh x phi tuyn . Mng hun luyn: Mc tiu c bn ca vic hun luyn : t c mt mt tp hp ca nhng trng s m c th lm cho hu ht tt c nhng b trong tp hun luyn c phn lp ng . Nhng bc ca qu trnh hun luyn :

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 19 liu


Khi to trng s vi nhng gi tr ngu nhin . Ln lc a mi b vo trong mng . i vi mi n v : - Tnh ton mng input cho mi n v nh l mt s kt hp tuyn tnh ca tt c nhng input i vi n v . - Tnh ton gi tr output s dng hm kch hot . - Tnh ton li . - Cp nht trng s v khuynh hng . Mng thu gim v rt trch lut : Mng thu gim : Mng kt ni hon ton s kh ni khp vi nhau . Vi n node input , n hidden node v m output node dn n h(m+N) trng s . Thu gim im : loi mt s lin kt m khng nh hng phn lp chnh xt ca mng . Nhng ch li ca mng ntrn c th k n nh sau: Tin on chnh xt cao . Hot ng mnh , lm vic c ngay khi cc mu cha li. Output c th l gi tr ri rc, gi tr thc hoc vector ca nhiu thuc tnh ri rc hoc gi tr thc . nh gi nhanh hm mc tiu c hc . Yu im : Thi gian hun luyn lu . Kh c th hiu c hm hc ( trng s). Kh c th kt hp tri thc lnh vc. 6. Phn lp da trn nguyn l khai ph lut kt hp: C nhiu phng thc i vi s phn lp da trn lut kt hp . ARCS: Khai ph s kt hp s lng v gom cm ca nhng lut kt hp . Phn loi kt hp : Khai ph h tr cao v tin cy cao trong cng thc cond_set =>y trong y l mt nhn lp . CAEP(Phn lp bng cch tp hp li nhng mu ni bc) : trong nhng mu ni bt (Emerging patterns) : nhng tp thnh phn itemset h tr gia tng quan trng t mt lp sang lp khc . Khai ph nhng mu ni bt da trn h tr ti thiu v tc pht trin .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 20 liu


7. Nhng phng thc phn lp khc: 7.1. Nhng phng thc da trn khong cch: Cha nhng v d hun luyn v tr hon x l cho n khi mt thc th mi c phn lp . Nhng cch tip cn thng thng : Tip cn ngi lng ging gn nht k-nearest neighbor . Hi qui trng lng cc b : bng cch xy dng nhng xp x cc b . Lp lun da trn trng hp : Ngi ta s dng biu din k hiu v tri thc da trn suy din . Gii thut K-Nearest Neighbor : Tt c nhng thc th tng ng vi nhng im trn khng gian n-D . Ngi lng ging gn nht c nh ngha trong biu thc ca khong cch Euclidean . Hm mc tiu c th c ri rc ha hoc gi tr thc . i vi nhng gi tr ri rc, k-NN tr v hu ht gi tr thng thng gia k v d hun luyn gn nht vi xq. Biu Vonoroi : b mt quyt nh c qui vo bi 1-NN i vi mt tp in hnh ca nhng v d hun luyn . Gii thut k-NN i vi gi tr lin tc i vi nhng hm mc tiu gi tr lin tc . Tnh ton gi tr trung bnh ca k lng ging gn nht . Gii thut khong cch trng lng ngi lng ging gn nht . Trng lng ng gp ca mi lng ging theo khong cch ca chng n im truy vn xq , gn trng lng ln hn cho lng ging gn hn 1 ________
d (xq ,x i) 2

Tng t , i vi nhng hm mc tiu gi tr thc . Mnh i vi d liu hn lon bi trung bnh k-nearest neighbor. Tr ngi ca chiu : Khong cch gia nhng lng ging c th b chi phi bi nhng thuc tnh quan trng . vt qua iu ny , ko dn trc hoc loi tr nhng thuc tnh t quan trng nht .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 21 liu


7.2. Gii thut Di truyn (Genetic Algorithms):

Gii thut di truyn : da trn mt tng t i vi mt s tin b sinh hc . Mi lut c biu din bi mt chui d liu bit . Mt mu khi to c to ra bao hm nhng lut c to mt cch ngu nhin . Da trn khi nim ca ci thch hp nht tn ti . Nhng qui lut ph hp nht s c biu din bi s phn lp chnh xt ca n trn mt tp hp ca nhng v d hun luyn . S thch hp ca mt lut c biu din bi chnh xt s phn lp ca n trn tp hp nhng v d hun luyn . Kt qu c to ra bi s giao nhau v s bin i . 7.3. Tip cn tp th : Tp th c s dng xp x hoc nh ngha th nhng lp tng ng . Mt tp th cho mt lp C c xp x bi 2 tp hp : mt xp x di ( chc chn trong C ) v mt xp x trn . Tm kim tp rt gn ca nhng thuc tnh (i vi c tnh rt gn) l NP kh nhng ma trn phn bit c s dng thu gim cng tnh ton . 7.4. Tip cn tp m : Logic m s dng nhng gi tr thc gia 0.0 v 1.0 biu din ca quan h thnh vin (v d nh l s dng th thnh vin m ). Nhng gi tr thuc tnh c chuyn sang nhng gi tr m . i vi mt v d mi cho trc , nhiu hn mt gi tr m c th c p dng . Mi lut c th p dng c th p dng ng gp mt c cho thnh vin theo th loi . Thng thng , nhng gi tr thc i vi th loi c tin on c tng kt . 7.5. Phn lp bng suy lun da trn trng hp ( case-based reasoning): Nhng thc th c biu din bi s m t giu tnh tng trng (v d nh hm th) . Kt hp nhiu trng hp nhn c , suy lun da trn tri thc v vn gii quyt . 8. Tin on v phn lp: Tin on v vn tng t nh phn loi . Vic tin on c xy dng nh sau : u tin l xy dng m hnh .Tip theo l s dng m hnh tin on nhng gi tr tip theo . Phng php chnh tin on l phng php hi qui . C nhiu kiu hi qui : hi qui tuyn tnh, hi qui a tuyn v hi khng tuyn tnh . Tin on khc vi phn lp ch : Phn lp cp n tin on loi lp d liu . Tin on m hnh nhng hm gi tr lin tc .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 22 liu


M hnh tin on : Tin on d liu hoc cu trc tng qut nhng m hnh tuyn tnh da trn c s d liu . Nhng nt chnh ca phng php bao gm : Ti tiu tng qut . Phn tch lin quan thuc tnh . Xy dng m hnh tuyn tnh tng qut . Tin on . Xt nh nhng yu t chnh nh hng n vic tin on . Phn tch lin quan d liu : nh gi tnh khng chc chn , phn tch entropy , thm nh chuyn gia ... Tin on nhiu cp : phn tch drill-down v roll-up . Phn tch hi qui v m hnh tin on log-linear: Hi qui tuyn tnh : Y = +X . Hai thng s , xt nh ng thng v c thit lp bi s dng d liu bng tay . a hi qui : Y = b0 + b1X1 + b2X2 . Nhiu hm khng tuyn tnh khng hi qui c th c chuyn dng sang nhng dng trn . Nhng m hnh log-linear : bng nhiu chiu ca kh nng kt ni c xp x bi mt sn phm ca nhng bng th t thp . p(a,b,c,d) = ab acad bcd Hi qui gia trng a phng: Xy dng mt xp x tng minh cho f trn mt vng a phng xung quanh mt thc th truy vn xq. Hi qui tuyn tnh gia trng a phng : hm mc tiu f l hm xp x gn xq s dng hm tuyn tnh : f(x) = w0 + w1a1(x)+...+Wn an (x) . chnh xt ca phn lp : c lng tn sut li : Phn chia hun luyn v kim tra : s dng hai tp d liu c lp,tp hun luyn (2/3) v tp kim tra (1/3) . S dng cho tp d liu vi s lng mu ln . Thm nh cho : chia tp d liu thnh k mu con .S dng k-1 mu con nh l d liu hun luyn v mt mu con nh l d liu kim tra . 9. Khai ph d lu vi Microsoft OLE DB

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 23 liu

Ti sao phi khai ph d liu vi OLE DB . Tiu chun cng nghip l ti hn ca s pht trin khai ph d liu , vic s dng , thao tc gia cc phn , v trao i . OLEDB cho khai ph d liu l mt cuc i mi t nhin t OLEDB v OLDB thnh OLAP . Xy dng nhng ng dng khai ph trn c s d liu quan h l ng k . Ta cn nhng gii thut khai ph d liu ty bin khc nhau , cng vic quan trng trn phn ca nhng ngi xy dng ng dng . Mc tiu : xa b gnh nng ca vic pht trin ng dng trong c s d liu quan h ln . ng c ca OLE DB i vi vic khai ph d liu : Lm cho pht trin ca m hnh khai ph d liu tr nn d dng hn . To ra nhng m hnh khai ph d liu . Cha ng , bo qun v lm ti m hnh d liu c cp nht .Chng trnh s dng m hnh trn tp d liu khc . Duyt qua m hnh . Cho php nhng ngi pht trin ng dng tch hp tham gia trong vic xy dng nhng gii php khai ph d liu . Nhng c im ca OLE i vi khai ph d liu :

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 24 liu


c lp ca nh cung cp phn mm .Khng a ra mt m hnh xt nh no cho vic khai ph . c cu trc cung cp d liu cho tt c nhng m hnh khai ph ni ting . Tng quan : Li c ch quan h th hin nhng API da trn ngn ng . Server phn tch th hin OLE DB OLAP v OLE DB DM . Bo tr SQL n cha . S dng li nhng khi nim ang tn ti . Nhng ton t chnh h tr d liu m hnh khai ph : nh ngha mt m hnh khai ph . Nhng thuc tnh c tin on . Nhng thuc tnh c s dng tin on . Gii thut c s dng xy dng m hnh . Chuyn n mt m hnh khai ph d liu t mt d liu hun luyn .Tin on nhng thuc tnh cho nhng d liu mi . Duyt mt m hnh khai ph t vic bo v trc quan ha .

Data Mining Module l tng t vi mt bng trong SQL: To mt i tng data mining module : CREATE MINING MODEL[model_name] Chn vo d liu hun luyn d liu vo m hnh v hun luyn n . INSERT INTO [model_name] S dng m hnh khai ph d liu : SELECT relation_name.[id],[mode_name].[predict_arr] quan tm ni dung DMM to tin on v duyt thng k cha bi m hnh . S dng DELETE lm rng / reset d liu . Tin on trong c s d liu : tin on kt ni gia mt m hnh v tp d liu (table) . Khai trin DMM bi ch vit lnh SQL . Hai thnh phn chnh : Trng hp v tp trng hp : d liu u vo . Mt bng hoc nhng bng lng ( cho d liu phn cp) . M hnh khai ph d liu : mt kiu c bit ca bng . Mt tp hp nhng trng hp l lin quan vi mt m hnh khai ph d liu v mt thng tin th trong khi to mt DMM . Lu gii thut khai ph v kt qu tng quan thay v ca d liu chnh n .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 25 liu


Nhng ton t c bn : CREATE, INSERT INTO , PREDICTION JOIN , SELECT , DELETE FROM , v DROP . Biu din bng logic c lng nhau :

S dng dch v to hnh d liu to mt tp nhng hng phn cp . Phn ca nhng sn phm Microsoft Data Access Component (MDAC) . Nhng bng lng nhau : Khng cn thit cho nhng h thng con cha ng h tr nhng record lng nhau . Nhng trng hp ch c c th ha nh l tp nhng hng lng nhau trc hun luyn , tin on nhng m hnh khai ph d liu . Cng d liu vt l c th c s dng to nhng tp trng hp khc . nh ngha m hnh khai ph d liu : Vic nh ngha tn m hnh khai ph d liu bao gm nhng giai on sau: t tn ca m hnh . Thit lp gii thut v nhng thng s u vo . Nhng ct ca caseset v nhng quan h gia nhng ct . nhng ct ngun v nhng ct tin on. V d :
CREATE MINING MODEL [Age Prediction] %Name of Model ( [Customer ID] LONG KEY, %source column [Gender] TEXT DISCRETE, %source column [Age] Double DISCRETIZED() PREDICT, %prediction column [Product Purchases] TABLE %source column ( [Product Name] TEXT KEY, %source column [Quantity] DOUBLE NORMAL CONTINUOUS, %source column [Product Type] TEXT DISCRETE RELATED TO [Product Name] %source column ))

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 26 liu


USING [Decision_Trees_101] used %Mining algorithm

Column Specifiers :
KEY ( kha ) ATTRIBUTE (thuc tnh) RELATION (quan h) (RELATED TO clause) QUALIFIER (t hn nh) (OF clause) PROBABILITY: [0, 1] VARIANCE SUPPORT PROBABILITY-VARIANCE ORDER TABLE

Attribute Types :
DISCRETE ORDERED CYCLICAL CONTINOUS DISCRETIZED SEQUENCE_TIME

Chuyn n mt m hnh phn tch d liu : S dng pht biu INSERT INTO . Tiu tn thi gian cho mt trng hp s dng m hnh khai ph d liu . S dng pht biu SHAPE to bng lng nhau t d liu input .V d:
INSERT INTO [Age Prediction] ( [Customer ID], [Gender], [Age], [Product Purchases](SKIP, [Product Name], [Quantity], [Product Type]) ) SHAPE {SELECT [Customer ID], [Gender], [Age] FROM Customers ORDER BY [Customer ID]} APPEND {SELECT [CustID], {product Name], [Quantity], [Product Type] FROM Sales ORDER BY [CustID]} RELATE [Customer ID] TO [CustID] ) AS [Product Purchases]

S dng m hnh d liu tin on :

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 27 liu


Tin on kt ni . Tin on trong tpd liu D s dng m hnh khai ph liu M . Khc vi equi-join . M hnh khai ph d liu : l mt truth table . Pht biu SELECT lin quan vi PREDICTION JOIN nhng gi tr xt nh tch ra t DMM .V d :
SELECT t.[Customer ID], [Age Prediction].[Age] FROM [Age Prediction] PRECTION JOIN (SHAPE {SELECT [Customer ID], [Gender] FROM Customers ORDER BY [Customer ID]} APPEND ( {SELECT [CustID], [Product Name], [Quantity] FROM Sales ORDER BY [CustID]} RELATE [Customer ID] TO [CustID] ) AS [Product Purchases] ) AS t ON [Age Prediction].[Gender]=t.[Gender] AND [Age Prediction].[Product Purchases].[Product Name]=t. [Product Purchases].[Product Name] AND [Age Prediction].[Product Purchases].[Quantity]=t.[Product Purchases].[Quantity]

Duyt qua m hnh khai ph d liu : Duyt qua m hnh khai ph d liu l qu trnh trc quan ho d liu visualization . Kt lun : OLE DB cho khai ph d liu l s tch hp khai ph d liu v h thng c s d liu . OLE DB cho khai ph d liu l mt tiu chun tt cho xy dng ng dng khai ph d liu . 10. To cy quyt nh trong Microsoft SQL Server 2000 Trong phn ny s trnh by cch thc Microsoft Analysis services c s dng hin thc m hnh cy quyt nh trong phn mm Microsoft SQL Server 2000 .Chng ta cp n to m hnh cy quyt nh vi hai m hnh - mt s dng nhng bng quan h chun nh l ngun v mt ci khc s dng OLAP cubes . To m hnh Bc u tin trong hot ng khai ph d liu l to m hnh . M hnh khai ph d liu c to ra t nhng mu tin cha trong mt ngun d liu (data source) . Mt vi ngun d liu c th c kt ni thng qua

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 28 liu


OLE DB c th c s dng to m hnh . Nhng ngun ny bao gm c s d liu quan h , OLAP cubes, FoxPro tables, text file , hoc thm ch Microsoft Excel spreadsheets . Chng ta cng s tp trung vo cch thc s dng nhng ngun d liu ny lu tr test case c s dng to tin on v cch thc cha kt qu ca nhng tin on . Analysis Manager im xut pht to m hnh khai ph d liu vi Analysis Manager bao gm trong Analysis Services Installation package trong SQL Server 2000 CD-ROM. Trc khi bt u , ta phi ng k vi analysis server m ta mun to kt ni bng cch kch chut phi trn Analysis Server folder v chn Register Server .

Ch rng c nhiu Analysis Manager folder cha nhng phn t cn to OLAP cubes v nhng m hnh khai ph d liu . Server phn tch bao gm nhng thnh phn sau: Databases : Mi Analysis Server cha mt hoc nhiu c s d liu , mt icon i din mi c s d liu .C 4 folder v mt icon di mi database icon .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 29 liu


Data Sources : Data Source folder cha data source xt nh trong database . Mt data source duy tr thng tin OLE DB provider information , network setting , connection time-out , v thng tin cho php truy cp . Mt database c th cha nhiu data source trong folder Data Source ca n . Cubes : Cube folder cha cubes trong database . Mt icon i din mi cube . Ba kiu ca cubes c m t trong Analysis Manager Tree pane : Regular , Linked , v Virtual . Partitions : Mt Partintion folder ca cube cha mt icon cho mi partition trong cube . C hai kiu partition c m t trong Analysis Manager Tree pane : Local v Remote . Cube Roles : Di mt cube , mt Cube Roles icon n biu din tt c nhng vai tr ca cube . Shared Dimensions : shared dimension folder cha mt icon i vi mi m hnh khai ph d liu trong c s d liu . Nhng dimension ny c th bao gm trong mt vi cube trong c s d liu . Bn dng ca shared dimention folder c m t trong Analysis Manager Tree pane : l Regular, Vitual, Parent-Child v Data-Mining . Mining Modes : nhng m hnh khai ph d liu cha mt icon cho mi m hnh khai ph d liu trong database . Ta s rng c hai icon th hin hai kiu ca m hnh khai ph d liu . Data Roles : Database roles icon biu din tt c c s d liu . Role c th gn cho mt vi cube hoc mt vi m hnh khai ph d liu trong c s d liu .

c th tin hnh khai ph d liu u tin ta phi to c s d liu . To c s d liu : To database l vn n gin . Ta ch cn kch chut phi ln server v chn New Database . Database dialog box hin ra v ta phi g tn ca c s d liu , c c phn ty nh ta g m t ca database .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 30 liu

Mining Mode Wizard : sn phm ca Microsoft i i vi nhng tc v trong mt gii hn v c th tin on mt s bc . Mining mode wizard s dn dt chng ta tng bc to mt m hnh . 1. Chn ngun (Select source). 2. Chn case table hoc nhng bng cho m hnh khai ph d liu . 3. Chn k thut khai ph d liu (gii thut) . 4. Hiu chnh nhng kt ni ca nhng bng c chn nh l ngun trong nhng bc trc . 5. Chn ct Case Key . 6. Chn Input v ct tin on . 7. Kt thc . Select Source : Ta phi chn vic to m hnh khai ph d liu m cha nhng trng hp t bng quan h hoc OLAP cubes . Select case tables : Kt ni c s dng vi m hnh quan h c to ra v hin th trong mn Select Case Table . y cng cung cp mt ty chn ca vic to ra mt kt ni mi bng vic kch ln mt Data Source mi .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 31 liu

Chn mt k thut khai ph d liu : Mining Model Wizard cung cp hai gii thut khai ph d liu , hoc "k thut" nh chng ta gi trong wizard m chng ta chn t . Vi mc ch , chng ta chn Microsoft Decision Tree trong mn hnh Select Data Mining Techniques . To v hiu chnh nhng kt ni : Nu ta chn nhiu bng trong cc bc trc sau mn hnh to v hiu chnh nhng kt ni s hin th tip theo . Mn hnh ny s cho php ta ha ha nhng bn kt ni bng vic ko nhng ct t nhng bng cha vo con ca n . Nu bn chn ch mt bng n th bc ny b b qua . Chn kha ca ct : Bc k tip l chn ID nh l Case Key column . S chn la ca ID c mt nh hng quan trng ln u ra ca quyt nh bi v Key l cng c xt nh s duy nht ca mt record. Chn la mt kha l iu bt but , do n rt quan trng to mt kha trong SQL Server database nu mt ci cha tn ti . Chn input v tin on ct : Trn mn hnh tin on v chn ct , ly t nht mt ct cho m hnh khai ph t mt ct cho php trong danh sch trn ca s bn tri . M hnh Input column th hin d liu thc s m c s dng hun luyn m hnh khai ph d liu . Nu bn chn Microsoft Decision Trees trong Select Case Tables screen , v cng chn t nht mt ct tin on .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 32 liu


Available columns : Chn ct t tree view . S dng button c cung cp chuyn nhng ct v c ca s Predictable Column hoc ca s Input Column loi b nhng ct t s chn la . Ta khng th s dng ID Column ta chn trong Select Key Column dialog box nh l mt Input column bi v n l kha . Predictable columns : Xem nhng ct c chn la c th tin on c . Ca s ny ch c hin th ch nu bn la chn Microsoft Decision Tree trong Select Case Tables dialog box . Input columns : Xem selected input columns .

Kt thc : Cui cng nhng thng s m hnh khai ph d liu c nh ngha , ta phi nhp tn ca m hnh khai ph d liu . Trnh son tho m hnh khai ph quan h : Tin dng nh wizard l nhng ng dng , chng khng gii hn s uyn chuyn trong mi bc bi v qun l n gin , wizard phi s dng nhng gi tr mc nh v nhng quyt nh tng minh hon thnh mt tc v . Bng vic s dng Relational Mining Model Editor , ta c th b qua wizard .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 33 liu

Trc quan m hnh : Mt trong nhng c im c gi tr nht ca cy quyt nh l s n gin ca logic bn trong cu trc ca n .Data Mining Model Editor cha hai tabs y mn hnh , lc tab , m c s dng thay i cu trc ca m hnh v Content tab l ni hin th d liu c phn loi v c t chc thnh cy. Content tab l cch nhanh v tin li xem m hnh , nhng cu trc v thuc tnh . Dependency Network Browser : Dependency Network Browser l cng c c s dng xem nhng s c lp v nhng mi quan h gia nhng i tng trong m hnh khai ph d liu . hin th n t ca s Analysis Manager Tree , kch chut phi mt m hnh khai ph d liu v sau chn Browser Dependency Network . Trong Dependency Network Browser , mt m hnh khai ph d liu c th hin nh l mt mng ca nhng thuc tnh . Bn trong m hnh , chng ta c th xt nh d liu c lp v tin on trong nhng thuc tnh quan h . S ph thuc c th hin bi nhng mi tn . Hng ca s tin on c ch nh bi arrowhead v bi color-coding ca nhng notes .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 34 liu


Bn trong gii thut cy quyt nh : Ging nh tn ca chng , gii thut cy quyt nh l m hnh dng cy . Khng c gii hn cho cp v nhng u vo v nhng bin c gn vo gii thut , cy cng ln - cng rng v cng su hn . CART,CHAID v C4.5 : Khi mt gii thut cy quyt nh c p dng vo vn khai ph d liu , kt qu hoc quyt nh trng ging nh mt cy . Mc d Microsoft s dng gii thut ca chnh n to cy quyt nh , gii thut ny gy ra bi nhng phng php khc c th nghim v chng minh . Nhng cy phn lp v hi qui (CART) : CART l c s dng rng ri nht bi v s phn lp hiu qu ca n v s dng nhng k thut ta cy t ng khc nhau , bao gm s dng vic thm nh cho mt tp hp kim tra . Chi-squared Automatic Interaction Detector (CHAID) : Gii thut CHAID s dng Chi-squared phn tch da trn nhm nhng bng hoc li ngu nhin xt nh nhng g phn b ca mt gi tr cho trc l g . C4.5: Gii thut ny l mt s nng cao t mt phin bn c l ID3 ( Iterative Dichotomizer version 3) . To cy quyt nh vi OLAP OLAP l mt dng cu trc tt c thit k t trc ti u s lu tr ca d liu c kt hp .Vi OLAP ta c th to s kt hp chc chn theo chiu c phn cp v nhng gi tr truy cp c tnh tng theo chiu thi gian , chiu sn phm v nhng v tr a l- ging nh pht biu GROUP BY trong SQL .Chiu cung cp mt phng tin m n din t mi quan h gia trng d liu theo cch m n khng d dng lm vi c s d liu quan h . Cho v d , cha trong nhng bng quan h phng quan h th bc m tn ti gia nhng ngi nhn vin v qun l ca h trong c s d liu ca tp on ngun nhn lc yu cu tnh tng i ca logic phc . To m hnh : Chng ta bt u to m hnh vi Mining Model Wizard . S tho lun chi tit ca vic lm th no to v s dng wizard ny c cp phn cy quyt nh . Nhng bc ni tip to m hnh s dng Mining Model Wizard nh sau : 1. Chn kiu ngun . 2. Chn cube source cho m hnh khai ph . 3. Chn k thut khai ph d liu . 4. Chn chiu v cp ca m hnh khai ph s phn tch . 5. Chn d liu hun luyn .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 35 liu


6. To chiu , mt cube o , hoc c hai . y l nhng bc ty chn . 7. Kt thc . Chn kiu ngun : Mn hnh gii thiu yu cu khng c input nhng Select Source Type dialog hin th ra v yu cu ta phi xt nh ngun d liu , m trong trng hp ny l OLAP. Chn la OLAP cho bi mn hnh th t tip theo c xt nh thnh cube v dimension , ngc li vi bng v field nh l trong c s d liu quan h . Chn Source Cube v k thut khai ph d liu : Trong Select Source Cube dialog box, ta chn cube m cha nhng trng hp m ta s s dng hun luyn m hnh .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 36 liu

Chn thng tin (Select Case) : Trong mn hnh Select Case , chn chiu cha nhng d liu c s dng hun luyn cho m hnh khai ph d liu . Cng ty chn cp m ta quan tm s dng . Nu ta khng chn cp th wizard s chn cp thp nht trong s cc chiu ca n .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 37 liu

Chn la thc th tin on (Select Predicted Entity) : Trong mn hnh Selected Predicted Entity , chng ta c ba ty chn cho ngun ca s tin on ca ta . + Gii hn source cube . + Thuc tnh thnh vin ca case level . + Nhng thnh vin ca chiu khc . Tiu chun nh gi ca Source Cube : Nu ta mun to s tin on vi n v o lng - nhng gi tr s trong cube . Ta s chn tiu chun nh gi . Thuc tnh thnh vin ca Case Level : Tt c nhng level dimension trong OLAP c th cha thuc tnh thnh vin thm vo m t level . Nhng thnh vin ca chiu (Dimension) khc : Nu c quan h gia chiu cha trong nhng nhng thng tin v nhng chiu khc , ta c th s dng chiu lin quan nh l ngun tin on thuc tnh .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 38 liu

Chn d liu hun luyn : Trong bc tip theo , ta chn d liu hun luyn m hnh ca ta . Chiu m ta chn trong mn hnh Select Case c chn mc nh . Chn chiu v Cube o : Bc tip theo l ty nh nhng cung cp mt c tnh mnh m cho php ch khi s dng OLAP ging nh l ngun d liu v Microsoft Decision Trees ging nh gii thut khai ph d liu . B qua ty nh m ta chn, m hnh khai ph d liu s to cu trc Analysis Service . Chiu (Dimension) : Chiu l kt qu ca output data-mining model.Nu ta nhn vo mt vi chiu OLAP, ta ch rng dng thc ca n l cy th bc trong nhng nhnh c th c nhng nhnh con , mi nhnh con c th c nhng nhnh con ca chnh n . Khi o (Virtual Cube) : Khi o hu nh xc nh khi t d liu n ngoi tr l n cng cha chiu m c to trong lc hin ti . Hon tt M hnh khai ph d liu : Trong bc cui cng l t tn cho m hnh khai ph d liu .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 39 liu

nh ngha v giao tc : C nhiu cng vic c hon tt trong succession khi x l mt cube hoc mt m hnh khai ph d liu . Chng bao gm : 1.To cu trc . 2.Truy vn ngun d liu . 3.Chn d liu vo nhng cu trc . 4.To nhng trng c tnh ton . Trc khi Analysis Service pht biu rng m hnh khai ph d liu l hon thnh , n kim tra tt c cc bc c hon tt . Nu mt vi bc trong sai , nhng bc trc khng c hon tt . Trnh son tho m hnh khai ph d liu OLAP : Khi s x l kt thc , kch vo Close button v ch cho trnh son tho m hnh khai ph d liu OLAP xut hin . Trnh chc nng trnh son tho v c bn ging nh trnh son tho m hnh khai ph d liu . C mt cht khc bit l OLAP l ngun ca m hnh v khng phi l c s d liu quan h .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 40 liu

Ni dung chi tit ca s : iu u tin ta s l nhng nt trong cy khng c nhng tn trng hp l nh l trong c s d liu quan h . Danh sch cy tin on : Danh sch cy tin on cha nhiu cu trc cy quyt nh khc nhau c trong m hnh . Mi cy quyt nh c th hin bi trng quyt nh m s c xut pht thng qua vic s dng n . Phn tch d liu vi OLAP Data-Ming Model : To mt m hnh t OLAP l tng t theo nhiu kiu to n t mt ngun c s d liu quan h . Nhng tc v khai ph d liu khng trc tip m chng ta tm kim ng hnh s dng nh l khng n ngun , ngoi tr cho s kin l OLAP khng ging nh m hnh quan h , cung cp kh nng ca chnh m hnh khai ph d liu quan h thnh OLAP cube m c s dng nh ngun .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 41 liu


rng nu c s d liu quan h i lc c s dng h tr quyt nh , chng c s dng tt nht cho x l giao tc , nh l th t thc th , tnh ton . S dng khi o (Using Virtual Cube) : Khi ta to m hnh khai ph d liu ging nh l c vch ra t trc trong chng ny , ta c mt cube o ging nh l m hnh khai ph d liu . duyt qua ni dung ca cube , kch chut phi ln cube ca ta v chn Browser Data . S dng chiu c to : Nhng chiu bt u trong phn cp t d liu trong nhng bng . Cho v d mt c s d liu nhn vin c th c nhiu cp qun l . Nhng cp ny c th c xt nh bng nhng ton t nh ngha cube v sau c chn bi b x l chiu OLAP .Ci m s to cu trc biu din qun l cu trc trong c s d liu . Khi nim MDX : ni khi lng ln ca d liu nh dng v cu trc c lu tr , ngn ng truy vn l gn nh ch triu gi n . Microsoft SQL Server,DB2,Oracle s dng SQL nh l ngn ng truy vn tiu chun v Microsoft OLAP s dng MDX , mt ngn ng truy vn c bin i mt cch c bit truy xut khi . MDX cng l c s dng to thnh vin tnh ton v cng thc thnh vin thng thng m xt nh nhng qui lut bao hm trong mt tp d liu . MDX trng ging nh l SQL bi v chng chia s mt s t kha , nhng n thc s l ngn ng khc . Bn cnh c cng cch nhn v cu trc , chiu cng tng t nh i vi cy quyt nh trong chng s dng c nhng qui lut bao hm trong mt tp hp thnh vin . Theo cng kiu nh vy nhng nt cy quyt nh p dng nhng lut xt nh trong trng hp no nhng record c bao hm , nhng chiu s dng MDX . c mt tng tt hn ca vic tin trnh ny lm vic nh th no , vo th mc Shared Dimensions ca cube , kch chut phi ln chiu ca ta , v chn Browse Dimension Data . c im cp chiu o l quan trng bi v n cho php ta duyt d liu trong cube theo node m c th hin trong m hnh khai ph d liu . Duyt trong m hnh khai ph ny cho php ta xem nhng o trong khi theo nhng lut th hin trong m hnh khai ph d liu . Kt lun : Ging nh phn khai ph d liu ca Analysis Service c cp, tt c m hnh to vi cy quyt nh c xem xt chnh xt nh nhau ngoi tr cu trc ca ngun d liu bi v b x l khai ph d liu nh dng thng tin theo cng mt kiu trc khi hun luyn m hnh . Tt c nhng thng tin c chuyn sang cu trc bng n vi nhng ct v nhng hng nh vy tt c thng tin l tm thi t trong cu trc phng m

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 42 liu


b x l khai ph d liu c th lm vic vi n . M hnh khai ph d liu c to vi OLAP l c bn ging nh m hnh c to trong ngun d liu quan h . Bng cch s dng OLAP nh l ngun d liu , analyis service c th cung cp mt i tng duy nht ty nh trong sut qu trnh to m hnh khai ph d liu m khng cho php s dng c s d liu quan h nh l ngun ca case data . Ta c th chn to nhng cu trc OLAP trun thng nh l nhng khi o v chiu o . Nhng iu ny thc s c gi tr trong chng cung cp mt ton t ai mun xem d liu kh nng lm nh vy vi s gip ca nhng khun mu m c khai ph v t chc trong m hnh khai ph d liu .

Bo co thu hoch chuyn Khai ph d liu & Nh kho d 43 liu

You might also like