You are on page 1of 109

ng dng k thut khai ph d liu trong h thng IDS MC LC

MC LC...................................................................................................................................1 DANH MC CC T VIT TT............................................................................................3 DANH MC CC BNG.........................................................................................................4 DANH MC HNH V..............................................................................................................5 LI NI U............................................................................................................................6 Chng 1.....................................................................................................................................7 TNG QUAN V KHAI PH D LIU...........................................................................7 1.1 Gii thiu v khai ph d liu...........................................................................................7 1.2 Cc nhim v ca khai ph d liu...................................................................................8 1.3 Cc loi d liu c khai ph..........................................................................................9 1.4 Lch s pht trin ca Khai ph d liu............................................................................9 1.5 ng dng ca Khai ph d liu.......................................................................................9 1.6 Phn loi.........................................................................................................................11 1.7 Mt s thch thc t ra cho vic khai ph d liu.......................................................11 Kt chng................................................................................................................................11 Chng 2...................................................................................................................................12 QUY TRNH V PHNG THC THC HIN KHAI PH D LIU............................12 2.1 Quy trnh tng qut thc hin Khai ph d liu..............................................................12 2.2 Tin trnh khm ph tri thc khi i vo mt bi ton c th...........................................13 2.3 Tin x l d liu............................................................................................................14 2.3.1 Lm sch d liu......................................................................................................15 2.3.1.1 Cc gi tr thiu.................................................................................................15 2.3.1.2 D liu nhiu.....................................................................................................16 2.3.2 Tch hp v chuyn i d liu................................................................................17 2.3.2.1 Tch hp d liu................................................................................................17 2.3.2.2 Bin i d liu.................................................................................................19 2.3.3 Rt gn d liu (Data reduction).............................................................................20 2.3.3.1 Rt gn d liu dng Histogram.......................................................................21 2.3.3.2 Ly mu (Sampling)..........................................................................................22 2.3.4 Ri rc ha d liu v to lc phn cp khi nim..........................................24 2.3.4.1 Ri rc ha bng cch phn chia trc quan dng cho d liu dng s............25 2.3.4.2 To h thng phn cp khi nim cho d liu phn loi...................................26 2.3 Phng php khai ph d liu.........................................................................................26 2.4 Mt s k thut dng trong Data Mining ......................................................................28 2.4.1 Cy quyt nh.........................................................................................................28 2.4.1.1 Gii thiu chung................................................................................................28 2.4.1.2 Cc kiu cy quyt nh....................................................................................29 2.4.1.3 u im ca cy quyt nh.............................................................................31 2.4.2 Lut kt hp............................................................................................................31 2.4.2.1 Pht biu bi ton khai ph lut kt hp..........................................................32 2.4.2.2 Cc hng tip cn khai ph lut kt hp........................................................34 2.4.3 M hnh d liu a chiu........................................................................................35 2.4.3.1 nh ngha:.......................................................................................................35 2.4.3.2 Cc thao tc trn cc chiu ca MDDM..........................................................36 2.4.4 Khong cch ngn nht...........................................................................................37 2.4.5 K-Lng ging gn nht............................................................................................38 2.4.6 Phn cm.................................................................................................................39

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

ng dng k thut khai ph d liu trong h thng IDS


2.4.7 K thut hin th d liu.........................................................................................40 2.4.8 Mng Neural............................................................................................................41 2.4.8.1 Tng quan.........................................................................................................41 2.4.8.2 M hnh mng Nron........................................................................................42 2.4.9Thut ton di truyn..................................................................................................43 2.4.9.1 Gii thiu chung................................................................................................43 2.4.9.2 Cc bc c bn ca gii thut di truyn..........................................................44 Kt chng................................................................................................................................46 Chng 3...................................................................................................................................47 NG DNG K THUT KHAI PH D LIU TRONG H THNG IDS.......................47 3.1 H thng IDS.................................................................................................................47 3.1.1 Gii thiu ................................................................................................................47 3.1.2 H thng pht hin xm nhp - IDS ........................................................................47 3.1.2.1 IDS l g?..........................................................................................................47 3.1.2.2 Vai tr, chc nng ca IDS...............................................................................48 3.1.2.3 M hnh h thng IDS mc vt l ....................................................................49 3.1.2.4 Cu trc v hot ng bn trong ca h thng IDS:.........................................49 3.1.2.5 Phn loi............................................................................................................53 3.2 Khai ph d liu trong IDS.............................................................................................54 3.2.1 NIDS da trn khai ph d liu ..............................................................................54 3.2.1.1. Source of Audit Data:......................................................................................54 3.2.1.2 X l d liu kim ton th v xy dng cc thuc tnh..................................56 3.2.1.3 Cc phng thc khai ph d liu trong NIDS.................................................57 3.2.2 Tnh hnh trong nc...............................................................................................60 3.3.3 Tnh hnh th gii.....................................................................................................61 3.3.3.1 Nghin cu sm nht........................................................................................61 3.3.3.2 Nghin cu mun hn.......................................................................................64 3.3.3.3 Nghin cu gn y v hin nay.......................................................................68 Chng 4...................................................................................................................................79 XY DNG CHNG TRNH PHT HIN TN CNG DoS S DNG K THUT KHAI PH D LIU...............................................................................................................79 4.1 Thut ton phn cm ......................................................................................................79 4.1.1 Dn nhp..................................................................................................................79 4.1.2 Cc dng d liu trong phn tch cm.....................................................................79 4.2.2.1 Bin tr khong..................................................................................................80 4.2.2.2 Cc bin nh phn..............................................................................................82 4.2.2.3 Cc bin phn loi (bin nh danh), bin th t, v bin t l theo khong....83 4.2.3 Cc phng php gom cm.....................................................................................85 4.2.3.1 Cc phng php phn hoch...........................................................................85 4.2.3.2 Cc phng php phn cp...............................................................................86 4.2.4 Thut ton gom cm bng phng php K-means..................................................86 4.2.4.1 Thut ton k-means...........................................................................................86 4.2.4.2 K thut dng i tng i din: Phng php k-medoids............................90 4.2 S phn tch thit k chng trnh (cc mu)............................................................91 ..................................................................................................................................................91 4.2.1 Tp hp d liu v tin x l ..................................................................................91 4.2.1.1 Tp hp d liu.................................................................................................91 4.2.1.2 Tin x l..........................................................................................................93 4.2.2 Khai ph d liu pht hin tn cng t chi dch v...............................................94

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

ng dng k thut khai ph d liu trong h thng IDS


4.2.2.1 Cc mu bt thng ca tn cng t chi dch v............................................94 4.2.2.2 Khai ph d liu ...................................................................................................96 4.2.3 Biu din d liu......................................................................................................97 Chng 5...................................................................................................................................98 KT QU T C NH GI, KT LUN V HNG PHT TRIN...............98 5.1 Ci t.............................................................................................................................98 5.2 Kt qu t c ...........................................................................................................99 5.3 Kt lun.........................................................................................................................106 5.4 Hng pht trin...........................................................................................................107 ................................................................................................................................................107 TI LIU THAM KHO......................................................................................................108

DANH MC CC T VIT TT
AS BIDS BI Dev Studio CSDL DM DMX DSV DTS IDS/IPS KDD KTDL KDL MDDM MMPB MSE Analysis Services Intelligence Development Studio Business Intelligent Developtment C s d liu Data mining: Khai ph d liu Data Mining eXtensions Data Source View Data Transformation Services Intrusion Detection System/ Intrusion Prevention System Knowledge Discovery and Data Mining Khai thc d liu Kho d liu Dimensional Data Model: M hnh d liu a chiu Mining Model Prediction Builder Mining Structure Editor
3

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

ng dng k thut khai ph d liu trong h thng IDS


MSS
OLAP

SRSWOR SRSWR

Microsoft SQL Server Online Analytical Processing Simple random sample without replacement Simple random sample with replacement

DANH MC CC BNG
Bng 2.1: Tn s quan st.........................................................................................................19 Bng 3.1: D liu chi golf.......................................................................................................30 Bng 3.2: V d v mt CSDL giao dch D...........................................................................32 Bng 3.3: Tp mc thng xuyn minsup = 50%....................................................................33 Bng 3.4: Lut kt hp sinh t tp mc ph bin ABE............................................................34 Bng 3.5: D liu iu tra vic s hu cc tin nghi................................................................37 Bng 3.6: Mu d liu khch hng...........................................................................................38 Bng 3.7: Mt s v d dng k thut k-lng ging..................................................................39 Bng 3.8: Bng s kin cho bin nh phn................................................................................82 Bng 3.9: Mt bng quan h trong cc bnh nhn c m t bng cc bin nh phn......83 Bng 3.10: Bng d liu mu cha cc bin dng hn hp...................................................83

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

ng dng k thut khai ph d liu trong h thng IDS

DANH MC HNH V
Hnh 2.1: Data mining mt bc trong qu trnh khm ph tri thc.....................................13 Hnh 2.2: Tng quan tin trnh khai ph d liu.......................................................................14 Hnh 2.3: Cc hnh thc tin x l d liu................................................................................15 Hnh 2.4: Mt histogram cho price s dng singleton bucket biu din mt cp price value/frequency.........................................................................................................................21 Hnh 2.5: Mt histogram c rng bng nhau cho price......................................................22 Hnh 2.6: Phng php ly mu..............................................................................................24 Hnh 2.7: Mt lc phn cp cho khi nim price..............................................................25 Hnh 2.8: T ng to h thng phn cp khi nim da trn s lng gi tr phn bit ca cc thuc tnh............................................................................................................................26 Hnh 3.1: Kt qu ca cy quyt nh......................................................................................30 Hnh 3.2: Biu din hnh hc cho m hnh d liu n-chiu (vi n=3).....................................35 Hnh 3.3: Bin i bng 2 chiu sang m hnh d liu n-chiu...............................................36 Hnh 3.4: Cc mu tin biu din thnh im trong mt khng gian bi cc thuc tnh ca chng v khong cch gia chng c th c o...................................................................38 Hnh 3.6: th da vo hai o..........................................................................................41 Hnh 3.7: th tng tc 3 chiu...........................................................................................41 Hnh 3.8: M phng kin trc mng neural.............................................................................42 Hnh 3.5: Minh ha thut ton k-means....................................................................................87

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

ng dng k thut khai ph d liu trong h thng IDS

LI NI U
S pht trin ca cng ngh thng tin v vic ng dng cng ngh thng tin trong nhiu lnh vc ca i sng, kinh t x hi trong nhiu nm qua cng ng ngha vi lng d liu c cc c quan thu thp v lu tr ngy mt tch lu nhiu ln. H lu tr cc d liu ny v cho rng trong n n cha nhng gi tr nht nh no . Tuy nhin, theo thng k th ch c mt lng nh ca nhng d liu ny (khong t 5% n 10%) l lun c phn tch, s cn li h khng bit s phi lm g hoc c th lm g vi chng nhng h vn tip tc thu thp rt tn km vi ngh lo s rng s c ci g quan trng b b qua sau ny c lc cn n n. Cc phng php qun tr v khai thc c s d liu truyn thng khng p ng c k vng ny, nn ra i K thut pht hin tri thc v khai ph d liu (KDD - Knowledge Discovery and Data Mining). K thut pht hin tri thc v khai ph d liu v ang c nghin cu, ng dng trong nhiu lnh vc khc nhau cc nc trn th gii, ti Vit Nam k thut ny tng i cn mi m tuy nhin cng ang c nghin cu v dn a vo ng dng. Trong phm vi ca ti nghin cu ny, ti xin c trnh by nhng kin thc c bn v khai ph d liu v vic ng dng khai ph d liu trong h thng IDS/IPS. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
6

ng dng k thut khai ph d liu trong h thng IDS


Trong qu trnh hon thnh ti ny ti nhn c s gip ch bo tn tnh ca cc thy c gio v bn b, c bit l thy gio V nh Thu. Nhng do gii hn thi gian v nng lc nn khng trnh khi nhng sai st, rt mong nhn c s gp hn na ca thy c v cc bn. Em chn thnh cm n cc thy, c!

Sinh vin thc hin V Th Vn

Chng 1
TNG QUAN V KHAI PH D LIU 1.1 Gii thiu v khai ph d liu Khai ph d liu c nh ngha l qu trnh trch xut cc thng tin c gi tr tim n bn trong lng ln d liu c lu tr trong cc c s d liu, kho d liu. C th hn l tin trnh trch lc, sn sinh nhng tri thc hoc nhng mu tim n, cha bit nhng hu ch t cc c s d liu ln. ng thi l tin trnh khi qut cc s kin ri rc trong d liu thnh cc tri thc mang tnh khi qut, tnh qui lut h tr tch cc cho cc tin trnh ra quyt nh. Hin nay, ngoi thut ng khai ph d liu, ngi ta cn dng mt s thut ng khc c ngha tng t nh: Khai ph tri thc t CSDL, trch lc d liu, phn tch d liu/mu, kho c d liu (data archaeology), no vt d liu (data dredredging). Nhiu ngi coi khai ph d liu v mt s thut ng thng dng khc l khm ph tri thc trong CSDL (Knowledge Discovery in Databases-KDD) l nh nhau. Tuy nhin trn thc t khai ph d liu ch l mt bc thit yu trong qu trnh Khm ph tri thc trong CSDL. hnh dung vn ny ta c th s dng mt v d n gin nh sau: Khai ph d liu c v nh tm mt cy kim trong ng c kh. Trong v d V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
7

ng dng k thut khai ph d liu trong h thng IDS


ny, cy kim l mt mnh nh tri thc hoc mt thng tin c gi tr v ng c kh l mt kho c s d liu rng ln. Nh vy, nhng thng tin c gi tr tim n trong kho c s d liu s c chit xut ra v s dng mt cch hu ch nh khai ph d liu. Chc nng khai ph d liu gm c gp nhm phn loi, d bo, d on v phn tch cc lin kt. Ngun d liu phc v cho KTDL c th l cc CSDL ln hay cc kho d liu (Datawarehouse) c hay khng c cu trc. Cc tc v khai ph d liu c th c phn thnh hai loi: miu t v d bo - Cc tc v khai ph miu t m t cc c tnh chung ca d liu trong c s d liu. K thut khai ph d liu m t: C nhim v m t v cc tnh cht hoc cc c tnh chung ca d liu trong CSDL hin c. Cc k thut ny gm c: phn cm (clustering), tm tt (summerization), trc quan ho (visualiztion), phn tch s pht trin v lch (Evolution and deviation analyst), phn tch lut kt hp (association rules) - Cc tc v khai ph d bo thc hin vic suy lun trn d liu hin thi a ra cc d bo. K thut khai ph d liu d on: C nhim v a ra cc d on da vo cc suy din trn d liu hin thi. Cc k thut ny gm c: Phn lp (classification), hi quy (regression) 1.2 Cc nhim v ca khai ph d liu Cho n nay c rt nhiu cng trnh nghin cu v pht trin trong lnh vc khai ph d liu. Da trn nhng loi tri thc c khm ph, chng ta c th phn loi nh theo cc nhim c nh sau: Khai ph lut thuc tnh: tm tt nhng thuc tnh chung ca tp d liu no trong c s d liu. V d nh nhng triu chng ca mt cn bnh S th thng c th c th hin qua mt tp cc thuc tnh A. Khai ph nhng lut phn bit: khai ph nhng c trng, nhng thuc tnh phn bit gia tp d liu ny vi tp d liu khc. V d nh nhm phn bit gia cc chng bnh th mt lut phn bit c dng tm tt nhng triu chng nhm phn bit chng bnh xc nh vi nhng chng bnh khc. Khm ph lut kt hp: khai ph s kt hp gia nhng i tng trong mt tp d liu. Gi s hai tp i tng {A1, A2, ,An} v {B1, B2, ,Bn} th lut kt hp c dng {A1^A2^^ An) {B1^ B2^ ^Bn). Khm ph lut phn lp: phn loi d liu vo trong tp nhng lp bit. V d nh mt s chic xe c nhng c tnh chung phn vo cc lp da trn cch tiu th nhin liu hoc c th phn vo cc lp da trn trng ti Phn nhm: xc nh mt nhm cho mt tp cc i tng da trn thuc tnh ca chng. Mt s cc tiu chun c s dng xc nh i tng c thuc v nhm hay khng. D bo: d bo gi tr c th ng cua nhng d liu b thiu hoc s phn b thuc tnh no trong tp d liu. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

ng dng k thut khai ph d liu trong h thng IDS


Khm ph quy lut bin i: tm nhng tp lut phn nh nhng hnh vi tin ha, bin i chung ca mt tp d liu. V d nh lut khm ph nhng yu t chnh tc ng ln s thay i ca nhng gi c phiu no . 1.3 Cc loi d liu c khai ph Khai ph d liu thng lm vic vi nhiu kiu d liu khc nhau. Hu ht cc kiu d liu c khai ph l nhng kiu sau: C s d liu quan h: nhng c s d liu c t chc theo m hnh quan h. Hu ht nhng h qun tr c s d liu hin nay u h tr m hnh ny nh: Oracle, IBM DB2, MS SQL Server, MS Access C s d liu a chiu: c s d liu ny c gi l nh kho d liu, trong d liu c chn t nhiu ngn khc nhau v cha nhng c tnh lch s thng qua thuc tnh thi gian tng minh hay ngm nh. C s d liu giao tc: y l loi c s d liu c s dng nhiu trong siu th, thng mi, ti chnh, ngn hng C s d liu quan h - hng tng: m hnh c s d liu ny lai gia m hnh hng i tng v m hnh c s d liu quan h. C s d liu thi gian, khng gian: cha nhng thng tin v khng gian a l hoc thng tin theo thi gian. C s d liu a phng tin: loi d liu ny bao gm: m thanh, nh, video, vn bn v nhiu kiu d liu nh dng khc. Ngy nay loi d liu ny c s dng nhiu trn mng Internet. 1.4 Lch s pht trin ca Khai ph d liu - Nhng nm 1960: Xut hin CSDL theo m hnh mng v m hnh phn cp. - Nhng nm 1970: Thit lp nn tng l thuyt cho CSDL quan h, cc h qun tr CSDL quan h. - Nhng nm 1980: Hon thin l thuyt v CSDL quan h v cc h qun tr CSDL quan h, xut hin cc h qun tr CSDL cao cp (hng i tng, suy din, ...) v h qun tr hng ng dng trong lnh vc khng gian, khoa hc, cng nghip, nng nghip, a l ... - Nhng nm 1990-2000: pht trin Khai ph d liu v kho d liu, CSDL a phng tin, v CSDL Web. 1.5 ng dng ca Khai ph d liu Khai ph d liu l mt lnh vc lin quan ti nhiu ngnh hc khc nh: h CSDL, thng k, trc quan ho hn na, tu vo cch tip cn c s dng, khai ph d liu cn c th p dng mt s k thut nh mng nron, l thuyt tp th, tp m, biu din tri thc So vi cc phng php ny, khai ph d liu c mt s u th r rt. So vi phng php hc my, khai ph d liu c li th hn ch, khai ph d liu c th s dng vi cc CSDL cha nhiu nhiu, d liu khng y V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
9

ng dng k thut khai ph d liu trong h thng IDS


hoc bin i lin tc. Trong khi phng php hc my ch yu c p dng trong cc CSDL y , t bin ng v tp d liu khng qua ln Phng php h chuyn gia: phng php ny khc vi khai ph d liu ch cc v d ca chuyn gia thng mc cao hn nhiu so vi cc d liu trong CSDL, v chng thng ch bao hm c cc trng hp quan trng. Hn na cc chuyn gia s xc nhn gi tr v tnh hu ch ca cc mu pht hin c. Phng php thng k l mt trong nhng nn tng l thuyt ca khai ph d liu, nhng khi so snh hai phng php vi nhau ta c th thy cc phng php thng k cn tn ti mt s im yu m khai ph d liu khc phc c: - Cc phng php thng k chun khng ph hp vi cc kiu d liu c cu trc trong rt nhiu CSDL. - Cc phng php thng k hot ng hon ton theo d liu, n khng s dng tri thc c sn v lnh vc. - Kt qu phn tch ca h thng c th s rt nhiu v kh c th lm r c. - Phng php thng k cn c s hng dn ca ngi dng xc nh phn tch d liu nh th no v u. Khai thc d liu c ng dng rng ri trong rt nhiu lnh vc nh: Ngn hng: Xy dng m hnh d bo ri ro tn dng. Tm kim tri thc, qui lut ca th trng chng khon v u t bt ng sn. Pht hin dng th tn dng gi trn mng v l cng c hu ch cho dch v qun l ri ro cho thng mi in t Thng mi in t: Cng c tm hiu, nh hng thc y, giao tip vi khch hng. Phn tch hnh vi mua sm trn mng v cho bit thng tin tip th ph hp vi loi khch hng trong mt phn khu th trng nht nh Nhn s: Gip nh tuyn dng chn ng vin thch hp nht cho nhu cu ca cng ty Y hc: H tr bc s pht hin ra bnh ca bnh nhn da trn cc xt nghim u vo An ninh, an ton mng ng dng trong h thng pht hin xm nhp tri php IDS/IPS pht hin ra cc cuc tn cng xm nhp mng tri php. Vvv nhiu lnh vc khc Mt s ng dng ca khai ph d liu trong lnh vc kinh doanh: V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
10

ng dng k thut khai ph d liu trong h thng IDS


BRANDAID: m hnh marketing linh hot tp chung vo hng tiu dng ng gi. CALLPLAN: gip nhn vin bn hng xc nh s ln ving thm ca khch hng trin vng v khch hng hin c. DETAILER: xc nh khch hng no nn ving thm v sn phm no nn gii thiu trong tng chuyn ving thm, GEOLINE: m hnh thit k a bn tiu th v dch v. MEDIAC: Gip ngi qung co mua phng tin trong mt nm, lp k hoch s dng phng tin bao gm phc ho khc th trng, c tnh tim nng 1.6 Phn loi Chng ta c th phn lp h thng khai ph d liu theo cc tiu chun sau: Phn lp da trn loi d liu c khai ph: nhng h thng khai ph d liu lm vic vi c s d liu quan h, nh kho d liu, c s d liu giao tc, c s d liu hng i tng, a phng tin v Web Phn lp da trn kiu tri thc khai ph: h thng khai ph d liu xut kt qu kiu tm tt, m t, lut kt hp, phn lp, phn nhm v d bo Phn lp da trn loi k thut c s dng: h thng khai ph s dng cc k thut OLAP, k thut my hc (cy quyt nh, mng neural, thut gii tin ha, tp th v tp m). Phn lp da trn lnh vc p dng khai ph: h thng c dng trong nhiu lnh vc: sinh hc, y khoa, thng mi v bo him 1.7 Mt s thch thc t ra cho vic khai ph d liu Cc c s d liu ln S chiu ln Thay i d liu v tri thc c th lm cho cc mu pht hin khng cn ph hp. D liu b thiu hoc nhiu Quan h gia cc trng phc tp Giao tip vi ngi s dng v kt hp vi cc tri thc c. Tch hp vi cc h thng khc

Kt chng
Trong chng ny, gii thiu v: - Khi nim khai ph d liu - Nhim v ca khai ph d liu - Phn loi trong khai ph d liu - Cc lnh vc ng dng ca khai ph d liu - Mt s thch thc trong khai ph d liu V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
11

ng dng k thut khai ph d liu trong h thng IDS


Chng sau s gii thiu v quy trnh v phng thc thc hin khai ph d liu cng nh mt s k thut dng trong khai ph d liu.

Chng 2 QUY TRNH V PHNG THC THC HIN KHAI PH D LIU


2.1 Quy trnh tng qut thc hin Khai ph d liu 1) 2) 3) 4) Qu trnh ny gm cc bc: Lm sch d liu (data cleaning): Loi b nhiu hoc cc d liu khng thch hp. Tch hp d liu (data integration): Tch hp d liu t cc ngun khc nhau nh: CSDL, Kho d liu, file text Chn d liu (data selection): bc ny, nhng d liu lin quan trc tip n nhim v s c thu thp t cc ngun d liu ban u. Chuyn i d liu (data transformation): Trong bc ny, d liu s c chuyn i v dng ph hp cho vic khai ph bng cch thc hin cc thao tc nhm hoc tp hp.
12

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

ng dng k thut khai ph d liu trong h thng IDS


5) Khai ph d liu (data mining): L giai on thit yu, trong cc phng php thng minh s c p dng trch xut ra cc mu d liu. 6) nh gi mu (pattern evaluation): nh gi s hu ch ca cc mu biu din tri thc da vo mt s php o. 7) Trnh din d liu (Knowlegde presention): S dng cc k thut trnh din v trc quan ho d liu biu din tri thc khai ph c cho ngi s dng.

Hnh 2.1: Data mining mt bc trong qu trnh khm ph tri thc 2.2 Tin trnh khm ph tri thc khi i vo mt bi ton c th Chnh v mc tiu khm ph tr thc ngm nh trong c s d liu nn qu trnh khai ph thng phi qua mt s cc giai on cn thit. Bao gm nhng giai on chun b d liu khai ph, giai on khai ph d liu v cui cng l giai on chuyn kt qu khai ph sang nhng tri thc cho con ngi hiu c. Chi tit cc bc thc hin c m t trong bng tm tt nh sau: Giai on 1: u tin l pht trin mt s hiu bit v lnh vc ng dng v nhng tri thc tng ng. Xc nh mc ch ca tin trnh khai ph d liu t qua im ca ngi dng. Giai on 2: chun b d liu khai ph, thu thp d liu v d liu mu Giai on 3: tin x l d liu, xa cc thng tin b nhiu trong d liu, loi b s trng lp d liu v xc nh chin lc x l d liu b mt V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
13

ng dng k thut khai ph d liu trong h thng IDS


Giai on 4: chiu d liu, thu nh d liu v tm nhng c trng khai ph Giai on 5: chn mt phng php khai ph d liu thch hp nht trong s cc phng php ph bin nh: tm tt, phn lp, hi quy, phn nhm, kt hp Giai on 6: t thut ton chn, m hnh ha thut ton gii quyt trong trng hp c th ang xt. La chn nhng phng php tm kim mu d liu, quyt nh cc tham s. Giai on 7: y l giai on khai ph d liu, s dng thut ton tm kim nhng mu th v trong mt hnh thc th hin c th hoc mt tp nhng th hin bao gm nhng lut phn lp, cy, s hi quy v phn nhm. Giai on 8: thng dch li nhng mu c khai ph di cc hnh thc th hin tri thc ca d liu nh ngn ng, biu , hnh cy, bng

Hnh 2.2: Tng quan tin trnh khai ph d liu Qu trnh khai ph ny c s tng tc v lp li gia hai bc bt k, nhng bc c bn ca tin trnh c minh ha trong hnh trn. Hu ht nhng cng vic trc y u tp trung bc 7 l giai on khai ph d liu. Tuy nhin, cc bc cn li quan trng khng km v nhng bc ng gp rt nhiu vo s thnh cng ca ton b tin trnh khai ph d liu. Sau y ta s tm hiu chi tit v qu trnh tin x l trong tin trnh. 2.3 Tin x l d liu D liu trong thc t thng khng sch, v khng nht qun. Cc k thut tin x l d liu c th ci thin c cht lng ca d liu, do n gip cc qu trnh khai ph d liu chnh xc v hiu qu. Tin x l d liu l mt bc quan trng trong qu trnh khm ph tri thc, bi v cht lng cc V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
14

ng dng k thut khai ph d liu trong h thng IDS


quyt nh phi da trn cht lng ca d liu. Qu trnh tin x l d liu bao gm: Lm sch d liu, tch hp v bin i d liu, rt gn d liu, ri rc ha d liu v xy dng cc lc phn cp khi nim.

Hnh 2.3: Cc hnh thc tin x l d liu 2.3.1 Lm sch d liu D liu trong thc t thng khng y , nhiu, v khng nht qun. Qu trnh l sch d liu s c gng in cc gi tr thiu, loi b nhiu, v sa cha s khng nht qun ca d liu. 2.3.1.1 Cc gi tr thiu Cc phng php x l gi tr thiu: 1. B qua b c gi tr thiu): Phng php ny thng c s dng khi nhn lp b thiu (thng trong tc v khai ph d liu phn lp, classification). Phng php ny rt khng hiu qu, tr khi mt b cha kh nhiu thuc tnh vi cc gi tr thiu. c bit phng php ny rt km hiu qu khi phn trm gi tr thiu trong tng thuc tnh l ng k. 2. in vo bng tay cc gi tr thiu: Cch tip cn ny tn thi gian v khng kh thi khi thc hin trn tp d liu ln vi nhiu gi tr thiu. 3. S dng mt hng s ton cc in vo cc gi tr thiu: Thay th ton b gi tr thiu ca cc thuc tnh bng mt hng s nh "Unknown" hay Nu cc gi tr thiu c thay th bi mt hng s khi chng trnh khai ph d liu s nhm n vi mt khi nim c ngha, "Unknown", khi chng c cng mt gi tr ph bin- "Unknown". Do , mc d y l mt phng php n gin, nhng n khng d dng. 4. S dng gi tr trung bnh ca thuc tnh in cc gi tr thiu. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
15

ng dng k thut khai ph d liu trong h thng IDS


5. S dng mt gi tr trung bnh ca thuc tnh cho tt c cc mu th thuc v cng mt lp vi b cho. V d, nu phn lp khch hng theo credit_risk; gi tr average income cn thiu s c thay th bng gi tr average income ca cc khch hng thuc v cng mt lp credit-risk vi b cho. 6. S dng mt gi tr c kh nng nht in vo cc gi tr thiu: Gi tr ny c th tm ra bng phng php hi qui, hay da trn cc cng c s dng hnh thc Bayesian. V d, s dng cc thuc tnh khc ca khch hng trong tp d liu ca bn, bn c th xy dng mt cy quyt nh d bo gi tr thiu cho income. Phng php ny tuy phc tp nhng li c s dng nhiu nht. Do gi tr thiu c suy lun ra khi phn tch cc thuc tnh cn li nn gi tr thiu tm c theo phng php ny chnh xc hn cc phng php khc. 2.3.1.2 D liu nhiu Nhiu l mt li hay mt s mu thun ngu nhin trong vic o cc bin s. Cc k thut loi b nhiu l: 1. Phng php Binning u tin sp xp d liu v phn hoch d liu thnh nhng bin. Sau ngi dng c th lm trn d liu bng cc gi tr trung bnh ca bin, bng trung v ca bin, bng cc bin ca bin, ... Lm trn bng gi tr trung bnh ca bin: Mi gi tr trong bin c thay th bng gi tr trung bnh ca bin. Lm trn bng gi tr trung v ca bin: Mi gi tr trong bin c thay th bi gi tr trung v ca bin. Lm trn bng cc bin ca bin: Gi tr ln nht v gi tr nh nht trong mt bin c dng nhn bit bin ca bin. Mi gi tr ca bin khi c thay th bi gi tr bin gn nht. Bin c rng cng ln th tp d liu thu c s cng "trn". V d 2.1: Cho d liu ca price (tnh bng $) c sp xp: 4, 8, 15, 21, 21, 24, 25, 28, 34. Phn d liu vo cc bin (tn s bng nhau): Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Lm trn bng gi tr trung bnh ca bin: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Lm trn bng cc bin ca bin Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
16

ng dng k thut khai ph d liu trong h thng IDS


2. Phng php hi qui: D liu c th lm trn bng cch khp n vi mt hm no , chng hn nh hm hi qui. Phng php hi qui tuyn tnh tm mt ng thng ti u kht vi 2 thuc tnh (hay 2 bin), do mt thuc tnh c th dng d on thuc tnh cn li. Hi qui tuyn tnh a b l mt s m rng ca hi qui tuyn tnh n, trong hm hi qui cha nhiu hn 2 thuc tnh d bo v d liu c lm kht vi mt b mt a chiu. 3. Phn cm (Clustering): Cc gi tr ngoi lai c th c d bi s phn cm, trong cc gi tr c t chc thnh cc nhm, hay cn gi l cc "cluster". Bng trc gic, cc gi tr ri ra ngoi tp hp ca cc cluster c th c xem nh l cc gi tr ngoi lai. Nhiu phng php cho vic lm trn d liu cng l nhng phng php cho vic rt gn d liu bao gm c s ri rc ha. V d, k thut binning, m t trn, rt gn mt s cc gi tr khc nhau cho mt thuc tnh. Cc hot ng ny ging nh l mt hnh thc rt gn d liu cho d liu logic c s (logicbased data) cho cc phng thc KTDL, nh phng php cy quyt nh, trong n thc hin mt cch lp i lp li cc s so snh gi tr trn mt tp d liu c sp xp. Cc h thng phn cp khi nim l mt hnh thc ca vic ri rc ha d liu m c th dng trong vic lm trn d liu. V d, mt h thng phn cp khi nim cho price c th nh x cc cc gi c trong thc t vi r, trung bnh v t, do n lm gim s gi tr ca d liu trong qu trnh KTDL. 2.3.2 Tch hp v chuyn i d liu KTDL thng i hi s tch hp d liu tc l s hp nht d liu t nhiu kho cha. D liu c th c chuyn i sang cc hnh thc thch hp cho KTDL. Mc ny s m t c hai: tch hp d liu v chuyn i d liu. 2.3.2.1 Tch hp d liu Cc tc v phn tch d liu ca bn s i hi s tch hp d liu, n kt hp d liu t nhiu ngun khc nhau thnh mt khi d liu gn kt, nh trong qu trnh xy dng v s dng KDL (data warehousing). Mt s vn trong qu trnh tch hp d liu: - Lm th no nhng thc th trong th gii thc t nhiu ngun khc nhau c th ph hp vi nhau. V d, customer_id v cus_id ch l mt thuc tnh ch khng phi hai. - Siu d liu dng trnh s pht sinh li trong qu trnh tch hp cc lc v chuyn i d liu. V d, kiu d liu, cc gi tr null... ca mt thuc tnh, gi tr thuc tnh mt c s d liu l "H" nhng c s d liu khc n li l 1. - S d tha d liu: Gi tr mt thuc tnh c th c tnh ton t gi tr ca cc thuc tnh khc. S khng nht qun trong vic t tn cc thuc tnh cng c th gy ra kt qu d tha trong tp d liu. Mt vi s d tha cc thuc tnh c th c d tm bng php phn tch tng quan. Cho 2 thuc tnh, V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
17

ng dng k thut khai ph d liu trong h thng IDS


php phn tch tng quan c th o c mt thuc tnh c lin quan cht ch vi thuc tnh khc hay khng trn c s cc d liu sn c. i vi cc thuc tnh s, chng ta c th tnh ton s lin quan gia hai thuc tnh, A v B, bng cch tnh h s tng quan:
a A, B =

(ai A)(bi B)
i =1

N A B

(a b ) N AB
i =1 i i

N A B

Trong : - N l s cc b, ai v bi tng ng l cc gi tr ca cc thuc tnh A, B trong b th i. - , A and B tng ng l cc lch chun ca A v B - A v B l cc gi tr trung bnh ca A v B. Ch rng -1=<rA,B<=1, nu rA,B >0 th A, B tng quan dng, c ngha l A tng th B cng tng, rA,B cng ln th s tng quan cng cao. Nu rA,B < 0 th A, B tng quan m, c ngha l A tng th B gim Nu r A,B = 0 th A v B khng ph thuc. Ch rng s tng quan khng hm quan h nhn qu, c ngha l nu A v B l tng quan th khng c ngha l A gy ra B hay B gy ra A. V d, trong s phn tch v d liu nhn khu hc, chng ta c th tm ra rng s lng bnh vin v s lng xe hi trong mt vng c mi quan h tng quan. iu ny khng c ngha l mt thuc tnh l nguyn nhn ca mt thuc tnh khc. Thc t, c hai thuc tnh u lin quan n mt thuc tnh th ba l dn s. i vi cc d liu ri rc, mi quan h tng quan gia hai thuc tnh, A v B, c th c tm ra bi php th 2 (chi-bnh phng). Gi s A c c gi tr khc ring bit: a1,a2,a3....ac. B c r gi tr ring bit: b 1,b2,b3...br. Cc b d liu m t bi A v B c th c biu din bng mt bng vi c ct gi tr A v r hng gi tr B. Gi s (Ai;Bj) trong A=ai v B=bj. Khi 2 c tnh bi:
=
2 i =1 j =1 C r

(oij eij ) 2 eij

Trong oij l tn s quan st ca s kin chung (A i;Bj) v eij l tn s mong i ca (Ai;Bj) c tnh bi:
e11 = count ( A = ai ) count ( B = b j ) N ,

Thng k 2 kim tra gi thuyt A v B c c lp hay khng. Php kim tra da trn mt mc ngha, vi (r-1)x(c-1) mc t do. Nu gi thuyt b loi ta kt lun rng A v B ph thuc. V d 2.2: Php phn tch tng quan ca cc thuc tnh phn loi s dng 2 . Gi s rng mt nhm 1500 ngi c iu tra. Gii tnh ca mi ngi c ch .. Mi ngi c hi xem c thch tiu thuyt h cu hay khng. V vy chng ta c hai thuc tnh: gender v prefered-reading. Tn s quan st (hay count) ca mi possible joint event c ghi li trong bng (bng 2.1) sau: V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
18

ng dng k thut khai ph d liu trong h thng IDS


Male Female Total fiction 250 (90) 200 (360) 450 Non-fiction 50 (210) 1000 (840) 1050 Total 300 1200 1500 Bng 2.1: Tn s quan st trong gi tr trong du ngoc n l tn s mong i (c tnh da trn cng thc tnh eij). Gi s ta tnh e11
e11 = count (male) count ( fiction) 300 450 = = 90, N 1500

Khi 2 c tnh nh sau:


2 =
( 250 90) 2 (50 210) 2 (200 360) 2 (1000 840) 2 + + + 90 210 360 840

= 284.44 + 121.90 + 71.11 + 30.48 = 507.93 Mc t do cho bng 2.2 l (2-1)x(2-1)=1. Cho mt mc t do, gi tr 2 cn bc b gi thuyt 0.001 mc ngha l 10.828 ( s ny c c do tra bng, cc bng ny thng c trong cc sch thng k). T kt qu tnh ton trn, chng ta c th loi b gi thuyt prefered_reading v gender l khng ph thuc v rt ra kt lun: 2 thuc tnh trn c tng quan vi nhau. 2.3.2.2 Bin i d liu Trong bin i d liu, d liu c chuyn i hay hp nht v dng ph hp cho vic KTDL. Bin i d liu bao gm nhng vic sau y: - Lm trn, tc l loi b nhiu ra khi d liu. Cc k thut bao gm: binning, regression, v clustering. - Kt hp, trong cc php ton tm tt (summary) hay cc php ton kt hp (aggregation) c p dng cho d liu. v d, d liu bn hng hng ngy c th c tnh ton theo thng hay theo nm. Bc ny c th s dng trong vic xy dng khi d liu ca d liu nhiu mc. - Khi qut ha d liu, trong d liu mc khi nim thp hay d liu th c tng hp khi nim mc cao hn. - Chun ha, trong thuc tnh d liu c tnh t l sao cho n nm trong mt khong nh no v d nh -1 n 1; 0 n 1. Xy dng thuc tnh (hay xy dng c tnh), trong cc thuc tnh mi c xy dng v c thm vo tp thuc tnh cho tr gip cho qu trnh KTDL Cc phng php chun ha d liu: 1. Chun ha min-max: Thc hin php bin i tuyn tnh trn d liu gc. Gi s rng minA v maxA l gi tr ln nht v gi tr nh nht ca thuc tnh A. Php bnh thng ha nh x gi tr v ca A thnh gi tr v' trong khong [newminA, new maxA] bng biu thc chuyn i:
v' = v min A ( new _ max A new _ min A ) + new _ min A max A min A

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

19

ng dng k thut khai ph d liu trong h thng IDS


2. Chun ha Z-score: (hay chun ha trung bnh zero) cc thuc tnh ca mt thuc tnh A, c chun ha da trn gi tr trung bnh v lch chun (standar deviation) ca gi tr A. Mt gi tr v ca A c chun ha thnh gi tr v' bng biu thc:
v' = v A

trong A v A l gi tr trung bnh v lch chun ca thuc tnh A. Phng php chun ha ny hu ch khi gi tr ln nht v gi tr nh nht ca thuc tnh l cha bit, hay cc thuc tnh ny c cc gi tr ngoi l nh hng n phng php chun ha min-max. 3. Chun ha bng cch a v t l thp phn: Mt gi tr v c chun ha thnh gi tr v' bng php tnh sau:
v' = v 10 j

trong j l gi tr nguyn nh nht m Max(|v'|)<1. Trong vic xy dng cc thuc tnh, cc thuc tnh mi c xy dng gip ci thin tnh chnh xc v r rng ca cu trc trong d liu a chiu (highdimensional data). V d, chng ta c th xy dng thuc tnh din tch trn c s hai thuc tnh chiu di v chiu rng. Bng cch phi hp cc thuc tnh, s xy dng thuc tnh c th khm ph ra thng tin cn thiu v mi quan h gia cc thuc tnh m c th hu ch cho khm ph tri thc. 2.3.3 Rt gn d liu (Data reduction) K thut rt gn d liu c th c p dng c c s biu din rt gn ca tp d liu m nh hn nhiu v s lng, m vn gi c tnh nguyn vn ca d liu gc. Tc l, KTDL trn d liu rt gn s hiu qu hn so vi KTDL trn d liu gc. Cc giai on rt gn d liu nh sau: 1. Tng hp khi d liu, trong cc php ton tng hp c p dng trn d liu trong cu trc ca khi d liu. 2. La chn tp thuc tnh con, trong cc thuc tnh hay cc chiu khng lin quan, lin quan yu, hay d tha c th c tm v xa. 3. Rt gn chiu, trong c ch m ha c s dng rt gn kch thc tp d liu. 4. Gim s lng, trong d liu c thay th hay c nh gi bi d liu khc, nh hn v s lng nh cc m hnh tham s (ch cn lu gi cc tham s m hnh thay v phi lu gi d liu tht) hay cc phng php khng dng tham s (nonparametric method) nh clustering, ly mu (sampling), v s dng cc lc (histograms). 5. Ri rc ha v to cc phn cp khi nim, trong cc gi tr d liu th ca cc thuc tnh c thay th bi cc di hay cc mc khi nim cao hn. Ri rc ha l mt hnh thc ca numerosity reduction, n rt hu dng cho t ng to cc phn cp khi nim. Ri rc ha v to cc khi nim phn cp l V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
20

ng dng k thut khai ph d liu trong h thng IDS


nhng cng c mnh m cho KTDL, trong n cho php KTDL nhiu mc khc nhau ca khi nim. 2.3.3.1 Rt gn d liu dng Histogram S dng k thut binning xp x cc phn b d liu v l mt dng rt gn d liu rt ph bin. Chia d liu thnh nhng bucket v lu tr gi tr trung bnh (tng) cho tng bucket Mt histogram cho mt thuc tnh A phn hoch nhng phn b d liu ca A thnh nhng tp con ri nhau, hay gi l nhng bucket. Nhng bucket ny c hin th trn mt trc ngang, trong khi chiu cao (v din tch) ca mt bucket thng thng s biu din tn sut trung bnh ca gi tr c biu din bi bucket . Nu mi bucket ch biu din mt cp gi tr/tn sut ca mt thuc tnh n l, nhng bucket c gi l singleton bucket. Thng thng, bucket biu din nhng min gi tr lin tc ca thuc tnh cho. V d 2.3 Histograms. D liu sau y l danh sch gi ca cc mt hng thng bn ca AllElectronics (c lm trn v gi tr dollar gn nht). Cc s c sp xp nh sau: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30. Hnh 2.4 trnh by mt histogram cho d liu s dng cc singleton bucket. rt gn d liu, thng thng mi bucket biu th mt di cc gi tr lin tc ca mt thuc tnh cho. Trong hnh 2.5, mi bucket biu din mt di khc nhau c rng 10 dollar.

Hnh 2.4: Mt histogram cho price s dng singleton bucket biu din mt cp price value/frequency

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

21

ng dng k thut khai ph d liu trong h thng IDS

Hnh 2.5: Mt histogram c rng bng nhau cho price Phn hoch cc gi tr thuc tnh: rng bng nhau: Trong mt histogram rng bng nhau, rng ca tng min gi tr bucket l mt hng s (nh rng $10 ca cc bucket) su bng nhau (hay cao bng nhau): Trong mt histogram su bng nhau, cc bucket c to ra sao cho tn sut ca tng bucket l mt hng s (c ngha l mi bucket cha ng cng mt s mu d liu k nhau) Ti u-V: Nu chng ta xt tt c cc histogram c th c ca mt s bucket cho, histogram ti u-V l histogram c khc bit thp nht. lch histogram l mt tng c trng s ca cc gi tr gc m tng bucket biu din, trong trng s ca bucket bng s gi tr trong bucket. MaxDiff: Trong mt histogram MaxDiff, xt s khc bit gia tng cp gi tr lin k (adjacent). Mt bin ca bucket c thit lp gia tng cp cho cc cp c -1 s khc bit ln nht, trong l s bucket c ngi s dng xc nh. 2.3.3.2 Ly mu (Sampling) Ly mu c th c s dng nh mt k thut rt gn d liu bi v n cho php mt tp d liu ln c biu din bng mt tp mu ngu nhin nh hn nhiu ca d liu. Gi s mt tp d liu D c N b. Cc phng php c th ly mu D rt gn d liu: Mu ngu nhin n gin khng c s thay th (SRSWOR simple random sample without replacement) vi kch thc n: o Mu ny c to ra bng cch rt n b ca N b t D (n<N), trong xc sut rt mt b bt k trong D l 1/N, c ngha l tt c cc b u c xc sut bng nhau. Mu ngu nhin n gin c s thay th (SRSWR simple random sample with replacement) vi kch thc n: o Mu ny tng t nh SRSWOR, ngoi tr vic mi ln mt b c rt ra t D, b s c ghi li v sau c thay th. iu ny c ngha l sau khi mt b c rt ra, b s c a v li D c th c rt ra li. Mu cm (cluster) V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
22

ng dng k thut khai ph d liu trong h thng IDS


o Nu nhng b trong D c nhm li thnh M cluster tch ri nhau, th c th thu c mt SRS ca m cluster, trong m<M. o V d, nhng b trong mt c s d liu thng c ly ra theo mi trang mt ln, nh mi trang c th c xem nh mt cluster. C th thu c mt biu din c rt gn ca d liu thng qua p dng SRSWOR cho nhng trang , mang li mt mu cluster ca cc b. Mu phn tng: o Nu D c chia thnh nhng phn tch ri nhau c gi l tng, mt mu phn tng ca D c to ra bng cch ly ra nhng SRS ti mi tng. iu ny gip m bo mt mu biu din, c bit khi d liu i xng lch. o V d c th thu c mt mu phn tng t d liu khch hng, trong mi tng c to ra cho tng nhm tui khch hng. Bng cch ny, nhm tui c s lng khch hng nh nht cng c m bo s c biu din. Khi rt gn d liu, phng php ly mu thng c s dng ph bin c lng cc cu tr li cho cc cu truy vn tng hp. K thut ny c th xc nh mt kch thc mu y c lng mt hm cho trong mt gii hn mc li c th. Kch thc mu c th rt nh so vi kch thc ca tp d liu. K thut to mu l mt s la chn hin nhin cho vic ci thin khng ngng ca tp d liu rt gn. Tp d liu ny c th c tip tc ci tin bng cch tng kch thc tp mu. Ngoi cc phng php nu trn cn c mt s phng php khc rt gn d liu nh: clustering, rt gn tp thuc tnh, rt gn min gi tr...

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

23

ng dng k thut khai ph d liu trong h thng IDS

Hnh 2.6: Phng php ly mu 2.3.4 Ri rc ha d liu v to lc phn cp khi nim Cc k thut ri rc ha d liu c th c s dng gim s lng cc gi tr cho mt thuc tnh lin tc cho trc bng cch chia di gi tr ca thuc tnh thnh cc khong nh. Cc nhn ca nhng khong ny c th c s dng thay th cc gi tr d liu thc. S thay th s lng ln cc gi tr ca thuc tnh lin tc bng mt s nh cc khong nh c gn nhn lm gim kch thc v n gim ha d liu gc. iu ny dn n s biu din ngn gn, d dng v knowledge-level ca cc kt qu KTDL. Mt phn cp khi nim cho mt thuc tnh s cho cho nh ngha mt s ri rc ha ca thuc tnh. Cc phn cp khi nim c th c s dng rt gn d liu thu thp v thay th cc khi nim mc thp thnh cc khi nim mc cao hn. Mc d s chi tit b mt i bi cc phng php khi qut ha d liu (data generalization) nh vy nhng d liu sau khi khi qut ha c th c ngha hn v d dng gii thch hn. iu ny gp phn biu din nht qun cc kt qu khai ph d liu trong nhiu tc v khai ph d liu, y l mt yu cu ph bin. Thm vo , khai ph d liu trn tp d liu thu gn yu cu t thao tc vo ra v hiu qu hn khai ph d liu trn tp d liu ln hn, tp d liu cha khi qut ha. Do cc li ch ny, cc k thut ri rc ha d liu v V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
24

ng dng k thut khai ph d liu trong h thng IDS


cc phn cp khi nim c p dng trc qu trnh khai ph d liu nh mt bc tin x l hn trong qu trnh khai ph d liu. Mt v d ca mt phn cp khi nim cho thuc tnh price c a ra bi hnh 2.6. Nhiu hn mt phn cp khi nim c th c nh ngha cho cng mt thuc tnh cung cp cho nhu cu ca cc ngi dng khc nhau.

Hnh 2.7: Mt lc phn cp cho khi nim price 2.3.4.1 Ri rc ha bng cch phn chia trc quan dng cho d liu dng s Mc d cc phng php ri rc ha trn l hu ch trong vic to cc h thng phn cp cc h thng phn cp bng s, Nhiu ngi dng thch xem cc khong bng s c phn hoch thnh cng dng d c, trc quan v t nhin. V d, mc lng hng nm thng c chia thnh nhng min gi tr nh [$50,000, $60,000) hn nhng min nh [$51263.98, $860872.34) l kt qu thu c t mt s qu trnh phn tch phn cm phc tp. Lut 3-4-5 c th c s dng phn on d liu bng s thnh nhng on t nhin, tng t nhau. Trong trng hp tng qut, lut trn phn hoch min d liu thnh 3, 4 hay 5 khong c di tng t nhau, mt cch qui theo tng mc, da trn min gi tr ti nhng con s c ngha nht. Chng ta s minh ha vic s dng lut ny vi nhng v d di y. Lut c thc hin nh sau: Nu mt on cha 3, 6, 7 hay 9 gi tr phn bit k s ngha nht, th s phn hoch min thnh 3 on (3 on c rng bng nhau cho 3, 6, 9 v 3 on trong nhm 2-3-2 cho 7); Nu mt on cha 2, 4 hay 8 gi tr phn bit k s ngha nht, th s phn hoch min thnh 4 on c rng bng nhau. Nu mt on cha 1, 5 hay 10 gi tr phn bit k s ngha nht, th phn hoch min thnh 5 on rng bng nhau. Lut trn c th c p dng qui cho mi on con, to thnh mt h thng phn cp khi nim cho thuc tnh bng s cho. D liu thc t thng cha nhiu gi tr ngoi lai, m c th lm sai lch phng php ri rc ha topdown da trn gi tr min v max. V d, ti sn ca mt s t ngi c th ln hn rt nhiu so vi s khc trong cng mt tp d liu. Phng php ri rc ha da trn cc gi tr ti sn ln nht c th dn n mt h thng phn cp c dc cao. Do top-level discretization c th c thc hin da trn min gi tr d liu m t phn ln d liu cho (V d khong gia ca d liu sau V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
25

ng dng k thut khai ph d liu trong h thng IDS


khi ct i mi u 5%). Cc gi tr qu cao hay qu thp nm ngoi top-level discretization s hnh thnh cc khong ring bit. 2.3.4.2 To h thng phn cp khi nim cho d liu phn loi D liu phn loi (categorical data) l d liu ri rc (discrete data). Cc thuc tnh phn loi c mt s gii hn cc gi tr phn bit, khng c trt t gia cc gi tr. V d: v tr a l, phn loi ngh nghip, v chng loi hng ha. Phn cp khi nim c th c t ng to ra da trn s lng phn bit ca mi thuc tnh trong tp thuc tnh cho. Mt heristic thng dng l: thuc tnh vi nhiu gi tr phn bit nht c t ti mc thp nht ca khi nim v ngc li. Xt v d sau y. V d 2.7: To h thng phn cp khi nim da trn s lng cc gi tr phn bit trn mi thuc tnh. H thng phn cp khi nim location cho tp thuc tnh {street, city, province_or_state, country}.

Hnh 2.8: T ng to h thng phn cp khi nim da trn s lng gi tr phn bit ca cc thuc tnh Ngoi cc phng php trn cn c mt s phng php khc ri rc ha nh: Bining, clustering, ri rc ha da trn Entropy 2.3 Phng php khai ph d liu T nhng nhim v trn chng ta thy rng vic khai ph d liu khng ch n gin l s dng duy nht mt k thut no . Bt c phng php no h tr cho vic tm kim thng tin tt th s c s dng. Ty thuc vo cc nhim v khc nhau m cc phng php c th c chn, mi phng php c im mnh v nhng mt hn ch ring. Chng ta c th phn loi nhng phng php khai ph d liu theo cc nhm sau: V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
26

ng dng k thut khai ph d liu trong h thng IDS


Phng php thng k: hng tip cn thng k da trn nn tng m hnh xc sut. Cch thc hin da trn vic kim chng nhng l thuyt xc nh trc v da trn vic lm thch hp nhng m hnh cho d liu. Thng thng nhng m hnh ny c s dng bi nhng nh thng k. Do vy con ngi phi cung cp nhng l thuyt ng vin v cc m hnh thc hin. Suy din da trn tnh hung: gii quyt vn a ra bng cch dng trc tip cc kinh nghim v nhng gii php trong qu kh. Mt tnh hung thng l vn c bit i mt trc y v cng c gii quyt. a ra mt vn mi c th, suy din da trn tnh hung s kim tra mt tp nhng tnh hung c lu tr v tm nhng tnh hung tng t. Nu tnh hung tn ti th nhng gii php tng ng s c p dng cho m hnh mi v tnh hung gii quyt mi s c cp nht vo h thng nhm phc v cho ln thc hin sau. Mng Neural: l lp nhng m hnh m phng theo b no con ngi. Nh chng ta bit th b no con ngi bao gm hng triu t bo thn kinh c kt ni qua khp thn kinh. Nhng mng thn kinh c hnh thnh t s lng ln nhng neural m phng, cc nt ny c kt ni vi nhau trong cch ging nh t bo thn kinh. Tng t nh trong b no tht, cng kt ni c th c thay i p li s kch thch, iu ny cho php mng c kh nng hc. Cy quyt nh: trong cy ny mi nt trung gian th hin mt s kim chng hoc mt quyt nh da trn item d liu ang xt. Da trn kt qu ca th nghim s xc nh nhnh tip theo. phn lp mt item d liu, chng ta bt u t nt gc sau i xung theo cc nt ph hp vi kim chng cho n khi gp nt l, ti nt ny s cho quyt nh. Cy quyt nh cn c dng th hin nhng hnh thc c bit ca mt tp lut. Quy np lut: nhng lut th hin mt s tng quan thng k gia cc s xut hin ca nhng thuc tnh, i tng no trong d liu. Hnh thc chung ca lut l X1^ ^XN Y[S, C], ngha l khi c s xut hin cc thuc tnh X1 XN s dn n thuc tnh Y vi h tr l S v tin cy l C. Mng Bayer: mng Bayer l th hin ha ca s phn b xc sut, c dn ra t vic thng k s xut hin ca cc i tng. c bit mng Bayer l mt n th c hng, trong mi nt th hin bin thuc tnh v nhng cnh th hin xc sut ph thuc gia nhng thuc tnh . Thut gii di truyn: hay lp trnh tin ha l cch gii quyt theo chin lc ti u da theo nguyn l tin ha c kho st trong t nhin. Nhng gii php vn tt nht s i qua giai on chn lc v chng s c kt hp vi nhau cho nhng gii php khc tt hn. Qu trnh c lp li nh vy cho n khi vn c gii quyt hoc tin n mt ngng dng. Tp m: y l k thut chnh ca vic th hin v x l tnh khng chnh xc. S khng chnh xc c ny sinh t nhng c s d liu ngy nay: s khng chnh xc, khng th xc nh, khng nht qun v m h Nhng V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
27

ng dng k thut khai ph d liu trong h thng IDS


tp m khai thc tnh khng chc chn lm cho h thng phc tp tr nn d qun l. Do vy nhng tp m to thnh cch tip cn mnh khng ch gii quyt tnh khng trn vn, nhiu hay d liu khng chnh xc m cn gip pht trin cc m hnh khng chc chn ca d liu nhm to ra s thc thi thng minh, mm do hn so vi h thng truyn thng. Tp th: tp th c nh ngha bi tp xp x trn v tp xp x di. Cc i tng thuc tp xp x di l xc nh hon ton. Cc i tng thuc tp xp x trn l phn khng xc nh. Tp xp x trn ca tp th l hi ca tp xp x di v xp x bin. Mt i tng thuc tp bin l xc nh nhng khng xc nh hon ton. V vy, tp th c th c xem nh tp m c hm thnh vin ba gi tr: ng, sai v c th. Ging nh tp m, tp th c mt khi nim ton hc phn lp d liu. Tp th cng nh tp m t c s dng nh l mt gii php n l. Chng thng c kt hp vi cc phng php khc nh suy din lut, phn lp v phn nhm. 2.4 Mt s k thut dng trong Data Mining Cho ti nay c rt nhiu k thut c th p dng trong vic khai ph d liu nh: tp m, phng php Bayes, cy quyt nh, lp trnh tin ha vi thut ton di truyn, my hc, mng neural, k thut phn nhm, lut kt hp, m hnh d liu a chiu, cng c phn tch d liu trc tuyn (OLAP), khong cch ngn nht, k thut k-lng ging gn nht, lut kt hp vi gii thut AprioriTID, hc quy np Chng ny ch trnh by s lc mt s k thut thng c s dng. 2.4.1 Cy quyt nh Trong l thuyt quyt nh (chng hn qun l ri ro), mt cy quyt nh l mt th ca cc quyt nh v cc hu qu c th ca n (bao gm ri ro v hao ph ti nguyn). Cy quyt nh c s dng xy dng mt k hoch nhm t c mc tiu mong mun. Cc cy quyt nh c dng h tr qu trnh ra quyt nh. Cy quyt nh l mt dng c bit ca cu trc cy. 2.4.1.1 Gii thiu chung Trong lnh vc hc my, cy quyt nh l mt kiu m hnh d bo ngha l mt nh x t cc quan st v mt s vt/hin tng ti cc kt lun v gi tr mc tiu ca s vt/hin tng. Mi mt nt trong tng ng vi mt bin, ng ni gia n vi nt con ca n th hin mt gi tr c th cho bin . Mi nt l i din cho gi tr d on ca bin mc tiu, cho trc cc gi tr ca cc bin c biu din bi ng i t nt gc ti nt l . K thut hc my dng trong cy quyt nh c gi l hc bng cy quyt nh, hay ch gi vi ci tn ngn gn l cy quyt nh. Hc bng cy quyt nh cng l mt phng php thng dng trong khai ph d liu. Khi , cy quyt nh m t mt cu trc cy, trong , cc l i din cho cc phn loi cn cnh i din cho cc kt hp ca cc thuc tnh dn V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
28

ng dng k thut khai ph d liu trong h thng IDS


ti phn loi . Mt cy quyt nh c th c hc bng cch chia tp hp ngun thnh cc tp con da theo mt kim tra gi tr thuc tnh . Qu trnh ny c lp li mt cch qui cho mi tp con dn xut. Qu trnh qui hon thnh khi khng th tip tc thc hin vic chia tch c na, hay khi mt phn loi n c th p dng cho tng phn t ca tp con dn xut. Mt b phn loi rng ngu nhin (random forest) s dng mt s cy quyt nh c th ci thin t l phn loi. Cy quyt nh cng l mt phng tin c tnh m t dnh cho vic tnh ton cc xc sut c iu kin. Cy quyt nh c th c m t nh l s kt hp ca cc k thut ton hc v tnh ton nhm h tr vic m t, phn loi v tng qut ha mt tp d liu cho trc. D liu c cho di dng cc bn ghi c dng:
( x, y ) = ( x1 , x2 , x3 ,..., xk , y )

Bin ph thuc (dependant variable) y l bin m chng ta cn tm hiu, phn loi hay tng qut ha. x1 , x2 , x3 ... l cc bin s gip ta thc hin cng vic . 2.4.1.2 Cc kiu cy quyt nh Cy quyt nh cn c hai tn khc: Cy hi quy (Regression tree): c lng cc hm gi c gi tr l s thc thay v c s dng cho cc nhim v phn loi. (v d: c tnh gi mt ngi nh hoc khong thi gian mt bnh nhn nm vin) Cy phn loi (Classification tree): nu y l mt bin phn loi nh: gii tnh (nam hay n), kt qu ca mt trn u (thng hay thua). V d: Ta s dng mt v d gii thch v cy quyt nh: David l qun l ca mt cu lc b nh golf ni ting. Anh ta ang c rc ri chuyn cc thnh vin n hay khng n. C ngy ai cng mun chi golf nhng s nhn vin cu lc b li khng phc v. C hm, khng hiu v l do g m chng ai n chi, v cu lc b li tha nhn vin. Mc tiu ca David l ti u ha s nhn vin phc v mi ngy bng cch da theo thng tin d bo thi tit on xem khi no ngi ta s n chi golf. thc hin iu , anh cn hiu c ti sao khch hng quyt nh chi v tm hiu xem c cch gii thch no cho vic hay khng. Vy l trong hai tun, anh ta thu thp thng tin v: Tri (outlook) (nng (sunny), nhiu my (clouded) hoc ma (raining)). Nhit (temperature) bng F. m (humidity). C gi mnh (windy) hay khng. V tt nhin l s ngi n chi golf vo hm . David thu c mt b d liu gm 14 dng v 5 ct. D liu chi golf Cc bin c lp Quang cnh Nhit m Gi Chi V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
29

ng dng k thut khai ph d liu trong h thng IDS


85 khng 90 c 78 khng 96 khng 80 khng 70 c 65 c 95 khng 70 khng 80 khng 70 c 90 c 75 khng 80 c Bng 3.1: D liu chi golf Sau , gii quyt bi ton ca David, ngi ta a ra mt quyt nh. Nng Nng m u Ma Ma Ma m u Nng Nng Ma Nng m u m u ma 85 80 83 70 68 65 64 72 69 75 75 72 81 71 khng khng c c c khng c khng c c c c c khng m hnh cy

Hnh 3.1: Kt qu ca cy quyt nh Cy quyt nh l mt m hnh d liu m ha phn b ca nhn lp (cng l y) theo cc thuc tnh dng d on. y l mt th c hng phi chu trnh di dng mt cy. Nt gc (nt nm trn nh) i din cho ton b d liu. Thut ton cy phn loi pht hin ra rng cch tt nht gii thch bin ph thuc, play (chi), l s dng bin Outlook. Phn loi theo cc gi tr ca bin Outlook, ta c ba nhm khc nhau: Nhm ngi chi golf khi tri nng, nhm chi khi tri nhiu my, v nhm chi khi tri ma. Kt lun th nht: nu tri nhiu my, ngi ta lun lun chi golf. V c mt s ngi ham m n mc chi golf c khi tri ma. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
30

ng dng k thut khai ph d liu trong h thng IDS


Tip theo, ta li chia nhm tri nng thnh hai nhm con. Ta thy rng khch hng khng mun chi golf nu m ln qu 70%. Cui cng, ta chia nhm tri ma thnh hai v thy rng khch hng s khng chi golf nu tri nhiu gi. V y l li gii ngn gn cho bi ton m t bi cy phn loi. David cho phn ln nhn vin ngh vo nhng ngy tri nng v m, hoc nhng ngy ma gi. V hu nh s chng c ai chi golf trong nhng ngy . Vo nhng hm khc, khi nhiu ngi s n chi golf, anh ta c th thu thm nhn vin thi v ph gip cng vic. Kt lun l cy quyt nh gip ta bin mt biu din d liu phc tp thnh mt cu trc n gin hn rt nhiu. 2.4.1.3 u im ca cy quyt nh So vi cc phng php khai ph d liu khc, cy quyt nh l phng php c mt s u im: Cy quyt nh d hiu. Ngi ta c th hiu m hnh cy quyt nh sau khi c gii thch ngn. Vic chun b d liu cho mt cy quyt nh l c bn hoc khng cn thit. Cc k thut khc thng i hi chun ha d liu, cn to cc bin ph (dummy variable) v loi b cc gi tr rng. Cy quyt nh c th x l c d liu c gi tr bng s v d liu c gi tr l tn th loi. Cc k thut khc thng chuyn phn tch cc b d liu ch gm mt loi bin. Chng hn, cc lut quan h ch c th dng cho cc bin tn, trong khi mng n-ron ch c th dng cho cc bin c gi tr bng s. Cy quyt nh l mt m hnh hp trng. Nu c th quan st mt tnh hung cho trc trong mt m hnh, th c th d dng gii thch iu kin bng logic Boolean. Mng n-ron l mt v d v m hnh hp en, do li gii thch cho kt qu qu phc tp c th hiu c. C th thm nh mt m hnh bng cc kim tra thng k. iu ny lm cho ta c th tin tng vo m hnh. Cy quyt nh c th x l tt mt lng d liu ln trong thi gian ngn. C th dng my tnh c nhn phn tch cc lng d liu ln trong mt thi gian ngn cho php cc nh chin lc a ra quyt nh da trn phn tch ca cy quyt nh. 2.4.2 Lut kt hp Lut kt hp l mt hng quan trng trong khai ph d liu. Lut kt hp gip chng ta tm c cc mi lin h gia cc mc d liu (items) ca CSDL. Lut kt hp l dng kh n gin nhng li mang kh nhiu ngha. Thng tin m dng lut ny em li l rt ng k v h tr khng nh trong qu trnh ra quyt nh. Tm cc lut kt hp qu him v mang nhiu thng tin t CSDL V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
31

ng dng k thut khai ph d liu trong h thng IDS


tc nghip l mt trong nhng hng tip cn chnh ca lnh vc khai ph d liu. Lut kt hp l nhng lut c dng 80 % khch hng mua my in thoi di ng th mua thm simcard, 30 % c mua c my in thoi di ng ln simcard hoc 75 % khch hng gi in lin tnh v sng cc huyn th gi in thoi IP 171 lin tnh, trong 25% s khch hng va gi lin tnh, sng cc huyn va gi in thoi IP 171 lin tnh. mua my in thoi di ng hay gi lin tnh v sng cc huyn y c xem l cc v tri (tin ) ca lut, cn mua simcard hay gi in thoi IP 171 lin tnh l v phi (kt lun) ca lut. Cc con s 30%, 25% l h tr ca lut (support - s phn trm cc giao dch cha c v tri v v phi), cn 80%, 75% l tin cy ca lut (confidence - s phn trm cc giao dch tho mn v tri th cng tho mn v phi). h tr (support) v tin cy (confidence) l hai thc o cho mt lut kt hp. h tr bng 25% c ngha l Trong cc khch hng c s dng in thoi th c 25% khch hng s dng in thoi ID lin tnh v in thoi IP 171. tin cy bng 75% c ngha l Trong cc khch hng c s dng in thoi lin tnh th c 75% khch hng s dng in thoi IP 171. 2.4.2.1 Pht biu bi ton khai ph lut kt hp I ={i1 , i2 , , in } l tp bao gm n mc (Item cn gi l cc thuc tnh attribute). X I c gi l tp mc (itemset). T = {t1, t2, tm} l tp gm m giao dch (transacstion cn gi l bn ghi - record), mi giao dch c nh danh bi TID (Transaction Identification). R l mt quan h nh phn trn I v T (hay R IxT). Nu giao dch t c cha mc I th ta vit (i, t) R (hoc iRt). (T, I, R) l ng cnh khai thc d liu. Mt CSDL D, v mt hnh thc, chnh l mt quan h nh phn R nh trn. V ngha, mt CSDL l mt tp cc giao dch, mi giao dch t l mt tp mc, t 2I (2I l tp cc tp con ca I). V d v CSDL giao dch: I = {A, B, C, D, E}, T = {1, 2, 3, 4, 5, 6} Thng tin v cc giao dch cho bng sau: nh danh giao dch (TID) Tp mc (itemset) 1 AB DE 2 BC E 3 AB DE 4 ABC E 5 ABCDE 6 BCDE Bng 3.2: V d v mt CSDL giao dch D Cho mt tp mc X I. K hiu s(X) l h tr(support) ca mt tp mc X l t l phn trm s giao dch trong CSDL D cha X trn tng s cc giao dch trong CSDL D. s(X) = Card(X)/Card(D)% V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
32

ng dng k thut khai ph d liu trong h thng IDS


Tp mc ph bin: Cho mt tp mc X I v ngng ph bin ti thiu minsup (0, 1],(minsup c xc nh bi ngi s dng). Mt tp mc X c gi l mt tp ph bin theo ngng minsup nu v ch nu h tr ca n ln hn hoc bng mt ngng minsup: s(X) minsup. K hiu FX(T, I, R, minsup) = {X I | s(X) minsup} Vi (T, I, R) trong v d CSDL bng 1, v gi tr ngng minsup = 50% s lit k cc tp mc ph bin (frenquent-itemset) nh sau: Tp mc ph bin h tr (s) tng ng B 100% E, BE 83% A, C, D, AB, AE, BC, BD,ABE 67% AD, CE, DE, ABD, ADE, BCE, BDE 50% Bng 3.3: Tp mc thng xuyn minsup = 50% h tr s ca lut kt hp X Y l t l phn trm cc giao dch trong D c cha X v Y l s(X Y) = Card(X Y)/Card(D) % Lut kt hp c dng X c Y trong : X, Y l cc tp mc tho mn iu kin X Y = v c l tin cy. tin cy ca lut c = s(X Y)/s(X)%: L t l phn trm cc giao dch trong D c cha X th cha Y. V mt xc sut, tin cy c ca mt lut kt hp l xc sut (c iu kin) xy ra Y vi iu kin xy ra X Lut kt hp tin cy: Mt lut c xem l tin cy nu tin cy c ca n ln hn hoc bng mt ngng minconf (0, 1] no do ngi dng xc nh. Ngng minconf phn nh mc xut hin ca Y khi cho trc X. (( c minconf) (minimum Confidence)) Lut kt hp cn tm l lut kt hp tho mn Minsup v minconf cho trc. Chng ta ch quan tm n cc lut c h tr ln hn h tr ti thiu v tin cy ln hn tin cy ti thiu. Hu ht cc thut ton khai ph lut kt hp thng chia thnh 2 pha: Pha 1 : Tm tt c cc tp mc ph bin t c s d liu tc l tm tt c cc tp mc X tho s(X) minsup. Pha 2: Sinh cc lut tin cy t cc tp ph bin tm thy pha 1. Nu X l mt tp mc ph bin th lut kt hp c sinh ra t X c dng: X c X \ X , trong : X l tp con khc rng ca X. X\X l hiu ca hai tp hp X v X. c l tin cy ca lut tho mn c minconf Vi tp mc ph bin trong bng 2 th chng ta c th sinh lut kt hp sau y: Lut kt hp tin cy c minconf ? 100% A BE C 67% B AE Khng V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
33

ng dng k thut khai ph d liu trong h thng IDS


E 80% AB C 100% AB E C 100% AE B C 80% BE A C Bng 3.4: Lut kt hp sinh t tp mc ph bin ABE Tp ph bin ti i: Cho M FX(T, I, R, minsup) M c gi l tp mc ph bin ti i nu khng tn ti X FX(T, I, R, minsup), M X, M X 2.4.2.2 Cc hng tip cn khai ph lut kt hp Lut kt hp nh phn (binary association rule hoc Boolean association rule): l hng nghin cu u tin ca lut kt hp. Hu ht cc nghin cu thi k u v lut kt hp u lin quan n lut kt hp nh phn. Trong dng lut kt hp ny, cc mc (thuc tnh) ch c quan tm l c hay khng xut hin trong giao tc ca c s d liu ch khng quan tm v mc xut hin. Thut ton tiu biu nht khai ph dng lut ny l thut ton Apriori v cc bin th ca n. y l dng lut n gin v cc lut khc cng c th chuyn v dng lut ny nh mt s phng php nh ri rc ho, m ho, v.v. Lut kt hp c thuc tnh s v thuc tnh hng mc (quantitative and categorical association rule): Cc thuc tnh ca c s d liu thc t c kiu rt a dng (nh phn binary, s - quantitative, hng mc categorical) pht hin lut kt hp vi cc thuc tnh ny,cc nh nghin cu xut mt s phng php ri rc ho nhm chuyn dng lut ny v dng nh phn p dng cc cc thut ton c. Lut kt hp tip cn theo hng tp th (mining association rule base on rough set): Tm kim lut kt hp da trn l thuyt tp th. Lut kt hp nhiu mc (multi-level association rule): Vi cch tip cn theo lut ny s tm kim thm nhng lut c dng mua my tnh PC => mua h iu hnh AND mua phn mm tin ch vn phng, thay v ch nhng lut qu c th nh mua my tnh IBM PC => mua h iu hnh Micorsoft Windows AND mua phn mm tin ch vn phng Microsoft Office, . Nh vy dng lut u tin l lut tng qut ho ca dng lut sau v tng qut theo nhiu mc khc nhau. Lut kt hp m (fuzzy association rule): Vi nhng hn ch cn gp phi trong qu trnh ri rc ho cc thuc tnh s, cc nh nghin cu xut lut kt hp m nhm khc phc cc hn ch trn v chuyn lut kt hp v dng t nhin hn, gn gi hn vi ngi s dng. Lut kt hp vi cc thuc tnh c nh trng s (association rule with weighted items) Trong thc t, cc thuc tnh trong CSDL khng phi lc no cng c vai tr nh nhau. C mt s thuc tnh c ch trng hn v c mc quan trng cao hn cc thuc tnh khc. Khai thc lut kt hp song song (parallel mining of association rule): Bn cnh khai thc lut kt hp tun t, cc nh lm tin hc cng tp chung vo V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
34

ng dng k thut khai ph d liu trong h thng IDS


nghin cu cc gii thut song song cho qu trnh pht hin lut kt hp. Nhu cu song song ho v x l phn tn l cn thit bi kch thc d liu ngy cng ln hn nn i hi tc x l cng nh dung lng b nh ca h thng phi c m bo. C rt nhiu thut ton song song khc nhau xut c th khng ph thuc vo phn cng. 2.4.3 M hnh d liu a chiu M hnh d liu a chiu (Multi Dimensional Data Model MDDM) l mt m hnh m trong d liu c th hin thng theo khng gian n-chiu. M hnh ny ph hp i vi cc tnh ton s hc v thng k: tng hp v phn t d liu theo cch khc nhau, cc phn tch d liu theo phng php hi quy phi tham s. Ngoi ra, MDDM cng cn c s dng pht hin cc lut kt hp gia cc ch tiu dng if X then Y vi tin cy l c%. nh ngha hnh thc ca MDDM nh sau: 2.4.3.1 nh ngha: Cho: D1, D2, , Dn l tp gi tr ri rc x1, x2, , xn l cc bin chy trn tp D . M hnh d liu a chiu l mt nh x t tch -cc cc min nh ngha D1 x D2 x x Dn l mt tp con cc s thc U R: F: B U y B = D1 x D2 x x Dn; D1 = Dom(x1), D2 = Dom(x2), , Dn = Dom(xn), v U R. B l min gi tr v U l tp gi tr ca nh x f. Nu chng ta xem b gi tr (x1, x2, , xn) nh l mt im trong khng gian n-chiu, v f(x1, x2, , xn) l gi tr s tng ng ca im th s im c trong m hnh d liu ny l: Card(D1) x Card(D2) x ... x Card(Dn) V d: vi n=3, m hnh d liu a chiu y l mt khi hp ch nht, cn gi l cc Cube, trong b (x1,x2,x3) tng ng vi cc ta trong khng gian ba chiu v f(x1,x2,x3) l gi tr ca cc trong khi hp ch nht . Biu din hnh hc cc chiu nh trong hnh:

Hnh 3.2: Biu din hnh hc cho m hnh d liu n-chiu (vi n=3) Mi min nh ngha mt chiu, do chng ta c th bin i mt bng hai chiu trong mt c s d liu (CSDL) sang m hnh d liu n-chiu:

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

35

ng dng k thut khai ph d liu trong h thng IDS

Hnh 3.3: Bin i bng 2 chiu sang m hnh d liu n-chiu 2.4.3.2 Cc thao tc trn cc chiu ca MDDM Chng ta c th thao tc trn mt, hai hay mt s chiu ca MDDM bng cch c nh cc chiu cn li vi nhng gi tr c th. iu ny c th c xem nh mt php chiu c tham s trn mt s chiu ca MDDM. ' ' ' D3 , , xn Dn D2 , x3 V d: gi s chng ta c n-1 gi tr xc nh x2 v mt nh x g nh ngha nh sau: g(X1) = f(X1, X2 = X2, , Xn = Xn), vi x1 D1 . Th g(x1) s l 1 bng d liu 1 chiu (y1,y2, yk). y k = Card (D1). Tng t, nu chng ta c nh n-2 chiu D 1, D2, , Dn vi nhng gi tr c th

" x D3 , x 4 D4 ,
" 3
th nh x trn 2 chiu D1 v D2:

x Dn
" n

g(x1,x2) = f(x1, x2,x3 = x3, , xn = xn) vi x1 D1 , x2 D2 s cho chng ta mt bng d liu hai chiu k dng ng vi k gi tr ca min D1 v l ct tng ng vi l gi tr ca min D2. Tng t nh vy, chng ta c th nh ngha php trch m chiu t MDDM f n-chiu ban u. V cc bng d liu c trch l cc gi tr thuc khng gian s thc R nn chng ta c th p dng cc php tnh tng trung bnh cng min, max phng sai v lch chun cho cc gi tr ca cc trn 1 ct (khi cho x 1 bin thin trong min D1 v c nh cc min khc ti mt gi tr c th), hoc trn cc ca mt dng (khi cho x2 bin thin trong min D2 v c nh cc min khc ti mt gi tr c th) hoc cho cc gi tr ca tt c cc trong bng ny v kt qu tnh ton cng s l mt s thc. Kt qu ca cc php ton trn u l cc s thc. Vn dng m hnh d liu n-chiu cho mt v d n gin v iu tra i sng dn c. Gi s chng ta c 2 min nh ngha D 1 = { 101, 102, 103, 104, 105, 106} ng vi danh sch cc m s h iu tra v D 2 = {tivi, t lnh, xe my, my git, iu ha nhit } ng vi vic s hu cc tin nghi sinh hot. Nu nh ngha nh x f: D1 x D2 n th chng ta c bng theo di v V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
36

ng dng k thut khai ph d liu trong h thng IDS


s lng tin nghi ca tng h; Nu f: D 1 x D2 {0, 1} th f cho kt qu l bng theo di vic c (1) hay khng (0) tin nghi ca tng h. Gi thit h 101 c tivi, xe my, iu ha nhit ; h 102 c tivi, t lnh, my git; h 103 c tivi, my git; h 104 c t lnh, xe my v h 105 c tivi, t lnh, xe my, my git, iu ha nhit ; h 106 c tivi, t lnh v xe my. Kt qu hm f l mt bng 2 chiu sau y: C ti vi C t lnh C xe my My git iu ha nhit 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 Bng 3.5: D liu iu tra vic s hu cc tin nghi Cng theo hng dc chng ta bit c s lng h chia theo tin nghi sinh hot. Cng theo hang ngang chng ta bit c s lng tin nghi ca tng h gia nh. By gi chng ta nh ngha mt m hnh trn chiu D2 nh sau: M s h 101 102 103 104 105 106

S lng mc cn xt l 2. Khi (tivi) = 1; (t lnh) = 1; (xe my) = 0; (my git) = 0; (iu ha nhit ) = 0. (D2) = (1,1,0,0,0). nh ngha php chiu g trn min gi tr D1 nh sau:

y c l s mc c xt (c = 2). Khi g(x1) l mt bng mt chiu: g(x1) = {0,1,0,0,1,1}. S lng gi tr 1 trong vector g(x 1) phn nh s h gia nh c ng thi c 2 tin nghi sinh hot l ti vi v t lnh. Bng cc to cc nh x (MDDM) mi, cc php ton s hc cng nh cc hm tch hp trn cc nh x , chng ta c th t c cc phn tch th v trn cc bng ca CSDL. M hnh d liu n-chiu c s dng rt thch hp cho vic phn tch d liu thng k. Cng c phn tch d liu trc tuyn OLAP ca Microsoft pht trin da trn m hnh d liu ny. 2.4.4 Khong cch ngn nht y l phng php xem cc mu tin nh l nhng im trong khng gian d liu a chiu. p dng tng ny c th xc nh khong cch gia hai mu tin trong khng gian d liu nh sau: cc mu tin c lin h vi nhau th rt gn nhau. Cc mu tin xa nhau th c t im chung. C s d liu mu cha V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
37

ng dng k thut khai ph d liu trong h thng IDS


c thuc tnh nh tui, thu nhp v tn dng. Ba thuc tnh mu ny thnh lp nn khng gian d liu ba chiu v c th phn tch cc khong cch gia cc mu tin trong khng gian ny. Tui Thu nhp Tn dng Khch hng 1 32 40.000 10.000 Khch hng 2 24 30.000 2.000 8 10 8 Bng 3.6: Mu d liu khch hng Khong cch hai khch hng c tnh
8 2 +10 2 + 8 2 = 15

Hnh 3.4: Cc mu tin biu din thnh im trong mt khng gian bi cc thuc tnh ca chng v khong cch gia chng c th c o V d: Tui phm vi: 1 100, trong khi thu nhp khong t 0 100.000 dollar mi thng. Nu dng d liu ny m khng hiu chnh cho ng th thu nhp s l mt thuc tnh d phn bit hn rt nhiu so vi tui v y l iu m chng ta khng mong mun. V vy chia thu nhp cho 1000 t ti mt n v o ln nh l tui. Lm tng t cho thuc tnh tn dng. Nu o tt c thuc tnh cng mt o, s c mt o khong cch ng tin cy o cc mu tin khc nhau. Trong v d s dng o Enclidean, khong cch gia khch hng 1 v khch hng 2 l 15. 2.4.5 K-Lng ging gn nht Khi thng dch cc mu tin thnh cc im trong mt khng gian d liu nhiu chiu, chng ta c th nh ngha khi nim ca lng ging: Cc mu tin gn nhau l lng ging ca nhau Gi s ta mun d on thi ca mt tp khch hng t mt c s d liu vi nhng mu tin m t nhng khch hng ny. Gi thuyt c s i hi lm mt d n l nhng khch hng cng loi s c cng thi . Trong thut ng n d ca khng gian d liu a chiu, mt kiu ch l mt vng trong khng gian d liu ny. Mt khc, cc mu tin cng kiu s gn nhau trong khng gian d liu: chng s l lng ging ca nhau. Da vo hiu bit ny, pht trin mt thut ton mnh nhng rt n gin - thut ton k-lng ging gn nht. L thuyt c s ca k-lng ging gn nht l lm nh lng ging ca bn lm. Nu mun d on thi ca mt cc nhn c th, bt u nhn vo thi ca mi ngi gn gi vi anh ta trong khng gian d liu. Tnh tr trung bnh v thi ca 10 ngi ny, v tr trung bnh ny s l c s d on cho c nhn V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
38

ng dng k thut khai ph d liu trong h thng IDS


ca anh ta. Ch k trong k-lng ging gn nht i din cho s lng ging iu tra. K-lng ging gn nht n gin tht s khng l mt k thut hc m l mt phng php tm kim thun ty bi v tp d liu bn than n c dng ch tham kho. N khng th to ra mt l thuyt trong lnh vc datamining m gip hiu cu trc tt hn. V d nh, nu mun ra mt quyt nh cho mi yu t trong tp d liu cha n mu tin, th cn phi so snh mu tin vi cc mu tin khc. iu ny dn n phc tp bc 2, c bit cho tp c s d liu ln. Nu mun lm mt s phn tch k-lng ging gn nht n gin i vi mt c s d liu c mt triu mu tin, phi thc hin mt ngn t php so snh. Cch tip cn nh vy r rang l khng tt mc du c nhiu kt qu nghin cu gip tng tc ca qu trnh tm kim ny. Tm li cc thut ton data mining khng nn c phc tp tnh ton ln hn n*log(n) (trong n l s cc mu tin). Trong thc t chng ta t dng k thut k-lng ging gn nht. Mt v d dng thut ton k-lng ging:

Bng 3.7: Mt s v d dng k thut k-lng ging 2.4.6 Phn cm Gom cm d liu l hnh thc hc khng gim st trong cc mu hc cha c gn nhn. Mc ch ca gom cm d liu l tim nhng mu i din hoc gom d liu tng t nhau (theo mt chun nh gi no ) thnh nhng cm. Cc im d liu nm trong cc cm khc nhau c tng t thp hn cc im d liu nm trong mt cm. Phn tch cm c nhiu ng dng rng ri, bao gm nghin cu th trng, nhn dng mu, phn tch d liu v x l nh. Trong kinh doanh, phn tch cm c th gip cc nh marketing khm ph s khc nhau gia cc nhm khch hng da trn thng tin khc hng v cc c trng ca cc nhm khch hng da trn cc mu mua hng. Trong sinh hc, n c th c s dng phn loi thc vt v ng vt, cc mu gen vi cc chc nng tng t nhau. Phn tch cm cn c th phn loi t theo cng nng hoc thc t s dng c chnh sch qui hoch ph hp, phn loi cc ti liu trn Web. Cc yu cu c bn ca phn tch cm trong KTDL: C kh nng lm vic hiu qu vi lng d liu ln: Phn tch cm trn mt mu ca d liu ln c th dn n cc kt qu thin lch. Cn phi c cc thut ton phn cm trn CSDL ln. C kh nng x l cc dng d liu khc nhau: Nhiu thut ton c thit k x l d liu bng s. Tuy nhin, cc ng dng c th yu cu phn V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
39

ng dng k thut khai ph d liu trong h thng IDS


tch cm cc dng d liu khc, nh d liu nh phn, phn loi, trt t hay s trn ln ca cc kiu d liu. C kh nng khm ph ra cc cm vi cc dng bt k: Nhiu thut ton phn cm da trn cc s o khong cch Euclide hay Manhattan. Cc thut ton da trn cc s o khong cch c xu hng tm cc cm hnh cu vi kch thc v mt tng t nhau. Tuy nhin, mt cm (cluster) c th c hnh dng bt k. Do cn pht trin cc thut ton tm cc cluster hnh dng bt k. Yu cu ti thiu tri thc lnh vc nhm xc nh cc tham s u vo: Nhiu thut ton phn cm i hi ngi dng nhp cc tham s trong phn tch cm. Cc kt qu phn cm c th b nh hng vo cc tham s u vo. Cc tham s thng kh xc nh, c bit i vi cc tp d liu cha cc i tng d liu nhiu chiu. C kh nng lm vic vi d liu nhiu. Khng b nh hng vo th t nhp ca d liu. Lm vic tt trn CSDL c s chiu cao. Chp nhn cc rng buc do ngi dng ch nh C th hiu v s dng c cc kt qu gom cm phn ny ch gii thiu s qua v k thut phn cm, chi tit hn v k thut ny s c gii thiu trong chng 4. 2.4.7 K thut hin th d liu K thut hin th d liu l mt phng php rt hu hiu trong vic pht hin cc mu trong tp d liu v c th dng khi bt u tin trnh khai ph d liu c th cm nhn c gi tr ca tp d liu v cc mu s c tm thy u. Nhng kh nng ny c cung cp bng cc cng c hin th hng i tng 3 chiu cho php ngi s dng khai ph cc cu trc tng tc 3 chiu. Hin nay, k thut ang c pht trin bng k thut ha cao cp trong thc t o, cho php ngi quan st khng gian d liu nhn to, cng lc bin i tp d liu. Tuy nhin i vi hu ht ngi s dng k thut ny khng th truy xut m phi nh vo cc k thut ha n gin c trong nhng cng c truy vn tin hoc nhng cng c data mining. Phng php n gin ny c th cung cp mt lng thng tin c gi tr. Mt k thut c bn v c gi tr cao l lc phn tn: trong k thut ny thng tin trn hai thuc tnh c hin th trong khng gian Descartes. Cc lc phn tn c th c s dng nhn dng tp con d liu ng quan tm, v th chng ta ch cn tp trung vo phn cn li ca qu trnh data mining.

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

40

ng dng k thut khai ph d liu trong h thng IDS

Hnh 3.6: th da vo hai o Trong v d hnh trn, chng ta to th da vo hai o: thu nhp v tui. Ta thy rng nhng ngi tui trung bnh c thu nhp thp s c khuynh hng c cc tp ch m nhc. Mt phng php khm ph tp d liu tt hn rt nhiu l thng qua mi trng tng tc 3 chiu v hnh 3.7 minh ha kh nng ny.

Hnh 3.7: th tng tc 3 chiu 2.4.8 Mng Neural 2.4.8.1 Tng quan Mng neural nhn to (Artificial Neural Network - ANN) l mt m hnh x l thng tin da trn c ch hot ng ca h thng thn kinh sinh hc, nh no b. Thnh phn chnh yu ca m hnh ny l cu trc c bit ca h thng ny. N tp hp mt s lng ln cc phn t x l kt hp ni ti (c gi l cc neuron) hot ng hp nht gii quyt cc bi ton c th. Mt ANN s c cu hnh cho mt ng dng c th no , v d nh nhn dng m hnh hoc phn loi d liu thng qua qu trnh hc. Vic hc trong h thng nhm mc ch iu chnh cc kt ni thuc k tip hp c phn chia trong t bo m c sn gia cc neuron. Neuron nhn to u tin c to ra vo nm 1943 bi nh nghin cu neuron hc Warren McCulloch v nh logic hc Walter Pits. Nhng k thut thi khng cho php neuron pht trin c cc th mnh ca n. Mng neuron ny nay c nhiu ci tin cng nh p ng c cc yu cu t ra ca cc bi ton, mt s u im ca mng neuron ngy nay so vi thi trc l:

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

41

ng dng k thut khai ph d liu trong h thng IDS


1. Hc thch ng: c kh nng hc cch thc thc hin cng vic da trn cc d liu cho sn trong qu trnh hun luyn hoc nh cc thng s ban u 2. Kh nng t t chc: mt ANN c th t thn t chc hoc miu t cc thng tin m n nhn c trong sut thi gian hc 3. Hiu chnh li thng qua m ha thng tin d tha: c th hy mt phn mng lm cho hiu sut h thng gim. Tuy nhin, mt s mng c kh nng nh c phn mng hy. Vo nm 1980, cc nh nghin cu bt u pht trin mng Neural vi nhiu kin trc phc tp nhm khc phc nhng hn ch trc y. 2.4.8.2 M hnh mng Nron M hnh mng Nron tng qut c dng nh sau:

Ngy nay mng Nron c th gii quyt nhiu vn phc tp i vi con ngi, p dng trong nhiu lnh vc nh nhn dng, nh dng, phn loi, x l tn hiu, hnh nh v.v Dng k thut mng Neural c th phn tch mt c s d liu nh hnh 3.8.

Hnh 3.8: M phng kin trc mng neural Hnh trn m t mt kin trc mng Neural n gin dng biu din vic phn tch c s d liu tip th. Trong thuc tnh c chia thnh ba lp V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
42

ng dng k thut khai ph d liu trong h thng IDS


tui v c a vo ba nt nhp ring r: thuc tnh s hu nh, xe cng c mt nt nhp ring v bn nt nhn dng bn vng. Mi nt nhp tng ng vi mt quyt nh Yes/No n gin. Tng t nh th cho cc nt xut: mi tp ch l mt nt. Cc nt nhp c ni ton b vi cc nt n v cc nt n c ni ton b vi cc nt xut. Trong mng, cc nhnh gia cc nt u c trng lng bng nhau. Trong qu trnh dy, mng nhn nhng mu nhp v xut tng ng vi nhng record trong c s d liu v cp nht li trng lng ca cc nhnh cho n khi tt c nt nhp kt c vi nt xut thch hp. Cc nhnh t m cho thy mi lin h no gia nhng ngi c tp ch xe hi v tp ch hi c cng la tui nh hn 30, c xe hi v vng 3. Nhng mng khng cung cp lut nhn dng mi kt hp ny. Mng Neural biu din tt trn s phn lp yu cu v c th c s dng rt hu ch trong data mining. 2.4.9 Thut ton di truyn 2.4.9.1 Gii thiu chung C mt mi kt hp tuyt vi gia cng ngh v t nhin, nguyn l l thuyt tnh ton v s tin ha. Thuyt tnh ton v s tin ha l phng php gii quyt vn bng cch p dng k thut tin ha. Lnh vc ny a ra 3 m t c lp bao gm: cc thut ton v di truyn, lp trnh v s tin ha v chin lc tin ha. Tt c 3 m t ny c nghin c ln u tin trong hai thp nin 1960 v 1970, nhng mi cho ti cui thp nin 1980 th chng mi bt u c chp nhn. V thi gian ny, cc nh nghin cu bt u nhn thy mi quan h r rng gia my tnh v sinh hc. Hin ti, cc thut ton di truyn ang c xem l k thut my hc thnh cng nht. T nhin cung cp ngun cm hng cho hai vn : qu trnh chn lc t nhin l mt k thut gii quyt vn . V DNA nh mt k thut m ha vn . Trong Ngun gc ca mun loi, Darwin m t l thuyt v s tin ha: s chn lc t nhin l khi nim trung tm. Mi loi c mt s sinh sn tha, trong cuc u tranh mnh m ginh s sng, ch c nhng loi no thch nghi tt nht i vi mi trng th mi sng st. Nm 1953, Jame Watson v Francis Crick pht hin ra cu trc xon kp ca DNA. Thng tin di truyn c lu tr trong phn t DNS di, bao gm 4 khi, iu ny c ngha l tt c thng tin di truyn ca cc nhn con ngi, hoc ca bt k mt to vt sng no, u c t trong mt ngn ng ch c 4 k t (C, G, A v T trong ngn ng di truyn). Tp hp ca cc ch th di truyn cho loi ngi vo khong 3 t k t. Mi c nhn u tha hng vi c tnh ca ngi cha hoc ngi m. S khc nhau gia cc c nhn nh mu tc hay mu mt, v cc bnh bm sinh c gy nn bi s khc bit v m di truyn. S chn lc t nhin c hai mt thun li v khng thun li. Mt bt li r rng l s sinh sn qu mc ca cc c th. Mt bt li khc l ton b tin V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
43

ng dng k thut khai ph d liu trong h thng IDS


trnh t c rt t mc ch, v s ci thin ca cc loi ph thuc vo cc yu t thay i mi trng. Cc loi c th tin ha bng cch t bin Gen ngu nhin. Cng thc xy dng thut ton di truyn gii quyt vn nh sau: t ra b m tt, hon ho m ha vn di dng chui bao gm mt s k t gii hn. Sng ch ra mt mi trng nhn to trong my tnh cc gii php c th kt ni li vi nhau. Cung cp i tng c th xt on c s thnh cng hay tht bi m thut ng ca chuyn gia gi l hm thch nghi. kt hp cc gii php, cn thc hin cc thao tc cho cc chui gen ca ngi cha v ngi m c ct ra v sau khi thay i th c kt li vi nhau. Trong vic ti to, tt c cc trng hp t bin c th c p dng. Khi u, ta cung cp mt s ngi hon ton khc nhau, v yu cu my tnh thc hin s tin ha bng cch loi b cc gii php xu trn mi th h v thay th chng bi con chu hoc s t bin tt. Dng li khi mt gia nh gm cc gii php tt c hnh thnh. u im v khuyt im ca cc thut ton di truyn lun tn ti song song. Hai hn ch l s sinh sn qu nhiu v tnh cht ngu nhin trong tin trnh tm kim. Cc cch gii bng nhng thut ton di truyn c m ha thnh cc k hiu, v vy thng l d c, l mt thun li hn so vi mng neural. 2.4.9.2 Cc bc c bn ca gii thut di truyn Mt gii thut di truyn n gin bao gm cc bc sau: Bc 1: Khi to mt qun th ban u gm cc chui nhim sc th. Bc 2: Xc nh gi tr mc tiu cho tng nhim sc th tng ng. Bc 3: To cc nhim sc th mi da trn cc ton t di truyn. Bc 5: Xc nh hm mc tiu cho cc nhim sc th mi v a vo qun th. Bc 4: Loi bt cc nhim sc th c thch nghi thp. Bc 6: Kim tra tha mn iu kin dng. Nu iu kin ng, ly ra nhim sc th tt nht, gii thut dng li; ngc li, quay v bc 3.

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

44

ng dng k thut khai ph d liu trong h thng IDS


Bt u Khi to qun th

M ho cc bin

nh gi thch nghi

Chn lc

Lai ghp

t bin Khng

Tho mn iu kin dng Tho Kt qu

Kt thc

Cu trc thut gii di truyn tng qut Bt u t =0; Khi to P(t) Tnh thch nghi cho cc c th thuc P(t); Khi (iu kin dng cha tha) lp t = t + 1; Chn lc P(t) Lai P(t) t bin P(t) Ht lp Kt thc

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

45

ng dng k thut khai ph d liu trong h thng IDS


Kt chng Trong chng ny, chng ta va gii thiu v quy trnh tng qut thc hin khai ph d liu, tin trnh khai ph tri thc i vo bi ton c th, tin trnh tin x l d liu v im qua 9 k thut dng trong khai ph d liu, l bi ton cy quyt nh, bi ton timg lut kt hp, m hnh d liu a chiu, khong cch ngn nht, K-lng ging gn nht, phn cm, k thut hin th d liu, mng Neural v thut ton di truyn. Chng sau s trnh by v ng dng ca k thut khai ph d liu trong h thng IDS/IPS.

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

46

ng dng k thut khai ph d liu trong h thng IDS Chng 3 NG DNG K THUT KHAI PH D LIU TRONG H THNG IDS
3.1 H thng IDS 3.1.1 Gii thiu Ngy nay, s pht trin lin tc ca cng ngh my tnh l khng th ph nhn. C rt nhiu cng ty thit lp cc cng giao dch Internet v cc khch hng ngy cng quan tm hn n vic mua bn trn mng, cc thng tin thu thp c pht trin nhanh chng. Mng Internet tr thnh mt cng c thng dng cho vic giao tip. Cng vi s pht trin ca cng ngh th chng ta li phi i mt vi s cn thit phi tng cng cng tc an ninh. Tnh gn gi, m rng v s phc tp ca Internet lm cho s cn thit ca an ninh h thng thng tin cng tr nn cp thit hn bao gi ht. Vic kt ni h thng mng thng tin vo cc h thng mng nh Internet v cc h thng in thoi cng cng lm tng thm kh nng tim n ri ro i vi h thng. n ny c tham vng khm ph v kim tra cc vn c th c i vi mt H thng pht hin xm nhp (IDS), mt phn rt quan trng ca an ninh my tnh ni ring v an ninh mng ni chung. Mt IDS t bn thn n khng th ngn cn cc security brake, nhng n pht hin cc mi him ha bng cch kim sot cc hot ng khng mong mun. Mc tiu ca n ny, kt hp vi kinh nghim thc t v cc tham kho trn mng, l nhm xy dng mt h thng IDS c kh nng hc cc hnh vi tn cng v c th xc nh c cc cuc tn cng mi m khng cn phi cp nht li h thng. iu ny l kh quan. Mt h thng uyn chuyn nh vy s khng cn thit phi c mt c s d liu cp nht th cng ca cc du hiu tn cng, bn cnh n cn c th xc nh cc cuc tn cng mi da trn cc mu hc v khng b ph thuc vo cc lut lc ca hng th ba. 3.1.2 H thng pht hin xm nhp - IDS 3.1.2.1 IDS l g? Khi bn t mt ng h bo ng trn nhng cnh ca v trn nhng ca s trong nh ca bn, ging nh vic bn ang ci t mt h thng pht hin xm nhp (IDS) trong nh bn vy. H thng pht hin xm nhp c dng bo v mng my tnh ca bn iu hnh trong mt kiu n gin. Mt IDS l mt phn mm v phn cng mt cch hp l m nhn ra nhng mi nguy hi c th tn cng chng li mng ca bn. Chng pht hin nhng hot ng xm phm m xm nhp vo mng ca bn. Bn c th xc nh nhng hot ng V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
47

ng dng k thut khai ph d liu trong h thng IDS


xm nhp bng vic kim tra s i li ca mng, nhng host log, nhng system call, v nhng khu vc khc m pht ra nhng du hiu chng li mng ca bn. Bn cnh vic pht hin nhng cuc tn cng, hu ht h thng pht hin xm nhp cng cung cp vi loi cch i ph li nhng tn cng, nh vic thit lp nhng kt ni TCP.

3.1.2.2 Vai tr, chc nng ca IDS * Pht hin cc nguy c tn cng v truy nhp tri php - y l vai tr chnh ca mt h thng pht hin xm nhp IDS, n c nhim v xc nh nhng tn cng v truy nhp tri php vo h thng mng bn trong. - H thng IDS c kh nng h tr pht hin cc nguy c an ninh e da mng m cc h thng khc (nh bc tng la) khng c, kt hp vi h thng ngn chn xm nhp IPS gip cho h thng chn ng, hn ch cc cuc tn cng, xm nhp t bn ngoi. * Tng kh nng hiu bit v nhng g ang hot ng trn mng IDS cung cp kh nng gim st xm nhp v kh nng m t an ninh cung cp kin thc tng hp v nhng g ang chy trn mng t gc ng dng cng nh gc mng cng vi kh nng lin kt vi phn tch, iu tra an ninh nhm a ra cc thng tin v h thng nh gip ngi qun tr nm bt v hiu r nhng g ang din ra trn mng. * Kh nng cnh bo v h tr ngn chn tn cng - IDS c th hot ng trong cc ch lm vic ca mt thit b gim st th ng (sniffer mode) h tr cho cc thit b gim st ch ng hay nh l mt thit b ngn chn ch ng (kh nng loi b lu lng kh nghi). - IDS h tr cho cc h thng an ninh a ra cc quyt nh v lu lng da trn a ch IP hoc cng cng nh c tnh ca tn cng. - V d: Nh mu tn cng hoc bt thng v giao thc hoc lu lng tng tc n t nhng my ch khng hp l. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
48

ng dng k thut khai ph d liu trong h thng IDS


- IDS cn c th cnh bo v ghi li cc bin c cng nh thc hin bt gi gi lu lng khi pht hin tn cng cung cp cho nh qun tr mng cc thng tin phn tch v iu tra cc bin c. - Ngay sau khi cc php phn tch v iu tra c thc hin, mt quy tc loi b lu lng s c a ra da trn kt qu phn tch, iu tra . - T hp ca nhng thuc tnh v kh nng ny cung cp cho nh qun tr mng kh nng tch hp IDS vo mng v tng cng an ninh n mt mc m trc y khng th t n bng cc bin php n l nh bc tng la. 3.1.2.3 M hnh h thng IDS mc vt l * H thng pht hin xm nhp mc vt l bao gm ba thnh phn chnh sau y: - B cm ng (Sensor): gim st cc lu lng bn trong cc khu vc mng khc nhau, nhm thu thp cc thng tin, d liu v hot ng trong mng. - My ch lu tr d liu tp trung ( Centralize database server) : ni tp trung lu tr thng tin, d liu do cc b cm ng gi v. - Giao din ngi dng ( User Interface): gip ngi qun tr mng qun l, gim st h thng.

3.1.2.4 Cu trc v hot ng bn trong ca h thng IDS: H thng pht hin xm nhp bao gm 3 modul chnh: - Modul thu thp thng tin, d liu - Modul phn tch, pht hin tn cng - Modul phn ng

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

49

ng dng k thut khai ph d liu trong h thng IDS

* Modul thu thp thng tin, d liu: - Modul ny c nhim v thu thp cc gi tin trn mng em phn tch. - Vn t ra trong thc t l chng ta cn trin khai h thng pht hin xm nhp IDS v tr no trong m hnh mng ca chng ta. Thng thng chng ta s t IDS nhng ni m chng ta cn gim st. - C hai m hnh chnh thu thp d liu l : + M hnh ngoi lung. + M hnh trong lung. M hnh thu thp d liu ngoi lung - Trong m hnh ngoi lung khng can thip trc tip vo lung d liu. Lung d liu vo ra h thng mng s c sao mt bn v c chuyn ti modul thu thp d liu . - Theo cch tip cn ny h thng pht hin xm nhp IDS khng lm nh hng ti tc lu thng ca mng.

M hnh thu thp d liu trong lung V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
50

ng dng k thut khai ph d liu trong h thng IDS


- Trong m hnh ny, h thng pht hin xm nhp IDS c t trc tip vo lung d liu vo ra trong h thng mng, lung d liu phi i qua h thng pht hin xm nhp IDS trc khi i vo trong mng. - u im ca m hnh ny, h thng pht hin xm nhp IDS trc tip kim sot lung d liu v phn ng tc thi vi cc s kin an ton... tuy nhin n c nhc im l nh hng ng k n tc lu thng ca mng.

* Module phn tch, pht hin tn cng

y l modul quan trng nht n c nhim v pht hin cc tn cng. Modul ny c chia thnh cc giai on: Tin x l, phn tch, cnh bo. Tin x l: Tp hp d liu, ti nh dng gi tin. - D liu c sp xp theo tng phn loi, phn lp. - Xc nh nh dng ca ca d liu a vo (chng s c chia nh theo tng phn loi). - Ngoi ra, n c th ti nh dng gi tin (defragment), sp xp theo chui. Phn tch: - Pht hin s lm dng (Misuse detection models): da trn mu, u im chnh xc. + Phn tch cc hot ng ca h thng, tm kim cc s kin ging vi cc mu tn cng bit trc. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
51

ng dng k thut khai ph d liu trong h thng IDS


+ u im: pht hin cc cuc tn cng nhanh v chnh xc, khng a ra cc cnh bo sai lm gim kh nng hot ng ca mng v gip cc ngi qun tr xc nh cc l hng bo mt trong h thng ca mnh. + Nhc im: l khng pht hin c cc cuc tn cng khng c trong c s d liu, cc kiu tn cng mi, do vy h thng lun phi cp nht cc mu tn cng mi. - Pht hin tnh trng bt thng (Anomaly detection models): da trn s bt thng: Ban u chng lu gi nhng m t s lc v cc hot ng bnh thng, cc cuc tn cng xm nhp gy ra cc hot ng bt bnh thng v k thut ny pht hin ra cc hot ng bt bnh thng . + Ban u, chng lu gi cc m t s lc v cc hot ng bnh thng ca h thng. + Cc cuc tn cng xm nhp gy ra cc hot ng bt bnh thng v k thut ny pht hin ra cc hot ng bt bnh thng . Pht hin da trn mc ngng Pht hin nh qu trnh t hc Pht hin da trn nhng bt thng v giao thc + u im: c th pht hin ra cc kiu tn cng mi, cung cp cc thng tin hu ch b sung cho phng php d s lm dng. + Nhc im: thng to ra mt s lng cc cnh bo sai lm gim hiu sut hot ng ca mng. Cnh bo: Qu trnh ny thc hin sinh ra cc cnh bo ty theo c im v loi tn cng, xm nhp m h thng pht hin c. * Modul phn ng: - y l mt im khc bit gia h thng pht hin xm nhp IDS v mt h thng ngn chn xm nhp IPS. - i vi h thng IDS, chng thng ch c kh nng ngn chn rt hn ch bi v chng da trn c ch phn ng th ng (passive response). - C ch phn ng th ng khng th ngn chn c tn tht v n ch c a ra sau khi tn cng nh hng ti h thng. - Khi c du hiu ca s tn cng hoc thm nhp, modul pht hin tn cng s gi tn hiu bo hiu c s tn cng hoc thm nhp n modul phn ng. - Lc modul phn ng gi tn hiu kch hot tng la thc hin chc nng ngn chn cuc tn cng hay cnh bo ti ngi qun tr. - Ti modul ny, nu ch a ra cc cnh bo ti cc ngi qun tr v dng li th h thng ny c gi l h thng phng th b ng. Modul V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
52

ng dng k thut khai ph d liu trong h thng IDS


phn ng ny ty theo h thng m c cc chc nng v phng php ngn chn khc nhau. 3.1.2.5 Phn loi IDS c chia lm hai loi chnh: HIDS (Host Intrusion Detection System): trin khai trn my trm hoc server quan trng, ch bo v ring tng my.

NIDS (Network Intrusion Detection System): t ti nhng im quan trng ca h thng mng, pht hin xm nhp cho khu vc

* So snh gia h thng HIDS v NIDS: NIDS HIDS p dng trong phm vi rng (theo di p dng trong phm vi mt Host ton b hot ng ca mng) V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
53

ng dng k thut khai ph d liu trong h thng IDS


Pht hin tt nhng tn cng, xmPht hin tt nhng tn cng, xm nhp t bn ngoi nhp t bn trong. Pht hin da trn cc d liu, thngPht hin da trn thng tin, d liu tin thu thp trong ton b mng trn Host c lp vi h iu hnh Ph thuc vo h iu hnh trn Host Tit kim kinh ph khi trin khai Yu cu chi ph cao D dng ci t, trin khai Phc tp khi ci t, trin khai 3.2 Khai ph d liu trong IDS 3.2.1 NIDS da trn khai ph d liu Tt c cc h thng pht hin xm nhp u da trn c s chnh l phn tch thng k hay khai ph d liu. phng php ny chng ta c mt kho cc mu hnh vi bnh thng hay cc hnh vi v hi ca ngi dung. Bt k mt s kin mi no din ra vi h thng cng c so snh vi nhng h s bnh thng ny v h thng xc nh lch ca cc s kin ny vi nhng mu bnh thng. Sau n xc nh xem liu s kienj ny c th c gn c nh mt hnh ng xm nhp hay khng. y c hai kch bn c th xy ra: 1) cc hnh vi bt thng (hnh vi bnh thng nhng c xem nh l ng nghi) khng phi l mt hnh ng xm nhp c gn c nh mt hnh ng xm nhp. iu ny s cho kt qu chc chn sai. 2) Cc hnh vi xm nhp tri php khng c gn nhn xm nhp s cho kt qu chc chn sai, iu ny s gy ra tnh trng nguy him hn. Trong h thng pht hin xm nhp d trn khai ph d liu, chng ta cn mt c s d liu, tt nht l mt c s d liu ln v khai ph d liu l v hng vi c s d liu, ci m s cha tt c cc hnh vi hay cc du hiu v hi. Sau k thut khai ph d liu c th c p dng trn c s d liu ny xc nh cc mi tng quan v lin kt trong d liu tm kim mt lut mi. 3.2.1.1. Source of Audit Data: Ngun d liu kim ton chnh l cc sensor ca mng. Mt sensor mng c th l internal hay external. Cc Internal sensor chi yu c s dng xc nh cc tn cng ti mt host xc nh. Cc External sensor thng dung gim st bt k mt host no trn mng. Nhng sensor ny gim st lu thng mng v ghi li cc d liu . Cc bn ghi d liu cha nhiu thuc tnh. Khi thc hin khai ph d liu cho vic pht hin xm nhp c th s dng d liu mc ca TCPDUMP hay cp cnh bo. c hai kiu d liu ny chng ta s tm thy nhng trng v a ch IP ngun, a ch IP ch, cng ngun, cng ch, giao thc (TCP, UDP ...), thi gian xy ra v thi gian sng ca giao tc. Cc thuc tnh V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
54

ng dng k thut khai ph d liu trong h thng IDS


nn tng ny cho ta mt ci nhn tt v cc kt ni hoc cc cnh bo ring l. V d, cc sensor l mt loi phn mm c ci t trn my trm hay trn my cng kt ni mng. N qut cc cng, tnh bng thng ca kt ni v nu c mt s chnh lch ln th n s nhn dng nh l mt tn cng v kt qu l n s chn truy cp ti mt cng xc nh hay c th chuyn tip vn ti thnh phn qun tr h thng . Mt v d v loi phn mm ny l BSM (Basic Security Module), n nh mt phn trong h iu hnh UNIX Solaris. Mt loi phn mm tng t cng c kh nng s dng vi h iu hnh Windows, c gi l BAM (Basic Auditing Module), c pht trin bi trng i hc Columbia. Mt v d khc v mt h thng sensor l Network Flight Recorder (NFR), mt h thng gim st mng thng mi m gm c mt b my sniffing cc gi trong mng vi mt ngn ng cp cao phin dch gi d liu mng. By gi chng ta i vo xem d liu kim ton ny trong nh th no. Khi mt sensor mng thu thp d liu ca mt kt ni, n s ly d liu dng khng chun. Gi s rng mt sensor pht hin rng mt kt ni c thit lp hay mt ai ang truy cp mng thng qua mt cng, n s ly cc d liu v a ch IP ch, a ch IP ngun, cng ch, cng ngun, giao thc (TCP, UDP etc.), thi gian thc hin v khong thi gian sng ca giao dch Tt c cc d liu ny vn gi nguyn trong cng mt ni vi nhau chng hn nh dng m con ngi c th s khng hiu c d liu bng cch xem chng. Chng dng nh phn v chng l cc d liu header TCP/IP. Hnh sau ch cho thy nh dng header TCP/IP v hnh tip theo s cho thy kt qu d liu thu c t mt TCP DUMP. Do cc phn mm nh BSM, BAM, NFR t chc chng v ch cho ta thy dng m chng ta c th hiu c v c th xem cc trng thuc tnh nh IP ngun, IP ch, cng ngun, cng ch, thi gian sngTrc khi t chc chng ta gi l d liu kim ton th.

Figure 1: TCP Header Format/ Format of raw audit binary data

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

55

ng dng k thut khai ph d liu trong h thng IDS

Hnh 2: Mu d liu TCP DUMP 3.2.1.2 X l d liu kim ton th v xy dng cc thuc tnh Cc d liu kim ton c thu thp t cc sensor mng hoc t mt ngun no dng th v nh dng nh phn. Trc khi s dng chng chng ta cn x l chng v cn ly c cc lut t chng. Ni dung hang u v c bn y l chng ta phi xy dng mt c s d liu t d liu kim ton v c mt s hiu bit ban u v cc lut. Vi s tr gip ca hai iu ny v vi thut ton lut kt hp chng ta s c c cc tp lut mi v c tn cng c th. Sau chng ta c th ng dng nhng lut ny cho cc s kin sp xy ra pht hin cc tn cng mi cha c bit. Trc khi ng dng mt lut khai ph d liu no chng ta cn tin x l d liu kim ton th dng nh phn thu c t cc sensor. Vic ny c thc hin bi TCPDUMP hay BSM. tin x l nhng d liu kim ton th ny nhm lm vic trng i hc Columbia s dng BAM (Basic Auditing Model) thay v BSM (Basic Security Model), ci m c chnh h to ra. Tin x l c ngha l u vo l cc d liu kim ton thi v u ra s l nhng d liu kim ton nhng dng c t chc vi cc thuc tnh nh IP ngun, IP ch, cng ngun, cng ch, giao thc (TCP, UDP ...), thi gian v khong thi gian tn ti. Bc tip theo l p dng mt s thut ton khai ph d liu tin x l d liu. Nhng thut ton ny nh l thut ton lut phn nhm chung, lut phn nhm, lut kt hp v thut ton frequent episodes. Mt vi nghin cu tp trung vo mt lut c th, mt s tp trung vo mt kt hp ca hai hay ba lut hay mt s tp trung vo mt vi lut c ci tin nh l lut to ra cc bt V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
56

ng dng k thut khai ph d liu trong h thng IDS


thng nhn to. Qua mt s phn c trnh by, chng ta s c ci nhn tng qut v mt s cc lut ny (lut phn nhm v lut kt hp). 3.2.1.3 Cc phng thc khai ph d liu trong NIDS 1. Lut phn lp: Pht hin xm nhp c th c xem nh l mt vn phn nhm: ta phi chia mi bn ghi ca d liu kim ton vo mt tp ri rc cc hng mc c th, bnh thng hay mt kiu xm nhp c th. Cn c vo tp cc bn ghi, ci m c mt thuc tnh l lp nhn, thut ton phn lp c th tnh mt m hnh m s dng cc gi tr thuc tnh ng n nht biu din cho mi khi nim. V d, xt cc bn ghi kt ni s dng giao thc telnet hnh di. y, hot l m s truy cp vo th mc h thng, to v thc hin cc chng trnh compromise l thuc tnh n s li khng tm thy file hay ng dn, v Jump to cc xm nhp RIPPER (mt thut ton hc my da trn lut chun c pht trin ti trung tm nghin cu ATT) mt chng trnh hc lut phn nhm, cc lut to ra cho vic phn nhm cc kt ni telnet v mt s cc lut c hin th trong bng sau

Figure 3: Telnet Records ([LS00] page: 232)

Figure 4: Example RIPPER Rules from Telnet Records ([LS00] page: 233) y, chng ta thy rng qu tht RIPPER chn cc gi tr thuc tnh duy nht xc nh cc xm nhp. Nhng lut ny trc tin c th c xem xt k v chnh sa bi cc chuyn gia an ton, sau l c kt hp vo trong h thng pht hin s lm dng. S chnh xc ca mt mt hnh phn nhm ph thuc trc tip vo tp cc thuc tnh c cung cp trong qu trnh hun luyn d liu. V d nu cc thuc tnh hot, compromised and root shell c chuyn dch t cc bn ghi trong bng trn, RIPPER s c th thc hin cc lut chnh V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
57

ng dng k thut khai ph d liu trong h thng IDS


xc nhn dng cc kt ni trn b m. V th, vic la chn ng tp cc thuc tnh h thng l mt bc quan trng khi thc hin vic phn nhm. 2. Lut kt hp: Mt lut kt hp ch yu l lut ton hc c tm thy hu ch trong h thng pht hin xm nhp da trn khai ph d liu. Mi v mi m hnh c pht trin ti mt mc no trong ng cnh ny c mi lin h theo cch ny hay cch khc vi mt lut kt hp. Trong c s d liu, mi lin kt gia cc thnh phn d liu c ngha l chng ta suy ra rng phn t d liu c th tn ti bi v s xut hin ca mt s cc phn t d liu trong mt giao tc. Mc ch ca vic khai ph cc lut kt hp l tm ra tt c cc lut kt hp tim n gia cc phn t d liu. Ni chung, trong h thng pht hin xm nhp da trn khai ph d liu chng ta to mt c s d liu ca cc s kin khng c s xm nhp v sau p dng k thut lut kt hp vo trong tp d liu tm ra tt c cc lut khc hoc cc s kin s khng c s xm nhp. iu ny s gip tm ra tt c cc hnh vi bnh thng tim n. Sau nhng lut ny s c so snh vi bt k mt tp phn t d liu i vo xc nh liu c phi l mt s xm nhp hay khng. Nhn t then cht nht y l chng ta phi thit lp ngng ti thiu cho mc h tr v tin cy ti thiu 3. Lng ging gn nht K-NN xc nh liu hoc l khng mt im nm trong mt khu vc tha tht ca cc thuc tnh bng cch tnh tng ca cc khong cch n nhng Klng ging gn nht ca im , c gi tt l K-NN. Bng trc gic cc im trong khu vc dy c s c nhiu im gnchng v s c mt im nh KNN. Nu kch c ca K vt qu mc thng xuyn ca bt c loi tn cng trong tp d liu v hnh nh ca cc yu t tn cng l khc xa cc yu t d liu th ngay sau phng thc lm vic. Thut ton c m t nh sau: for each data instance i do for each data instance j do find D(i, j)=distance between j and i end Keep the top k-D(i,.) and sum them to get K NNi if k NNi > threshold Intrusion occurred end Cc vn chnh vi cc thut ton ny l n c phc tp tnh ton 2 O(N ) thi gian. Chng ta c th tng tc x l bng cch s dng mt k thut tng t vi phn cm CANOPY. Phn cm CANOPY c s dng nh mt phng tin chia khng gian thnh cc khng gian con nh hn loi b s cn thit phi kim tra tt c cc im d liu. Chng ta s dng cc cm nh l mt cng c lm gim thi gian tm kim k-lng ging gang nht. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
58

ng dng k thut khai ph d liu trong h thng IDS


Trc tin chng ta phn cm d liu bng cch s dng thut ton phn cm c chiu rng c nh ca phn trc vi mt bin m chng ta t mi phn t vo ch c mt nhm. Sau khi d liu c phn cm chiu rng w, chng ta c th tnh cc k-lng ging gn nht cho mt im x bng cch tn dng cc u im ca cc thuc tnh. Chng ta biu din c(x) nh l im trung tm ca nhm c cha mt im x. i vi mt nhm c v mt im x chng ta s dng cc k hiu d(x; c) biu din khong cch gia im v im trung tm cm. i vi bt k hai im x1 v x2 , nu cc im nm trong cng mt cm d (x , x ) 2w v trong tt c cc trng hp d(x , x ) d(x , c(x )) + w d(x , x ) d(x , c(x )) w Cho C l mt tp hp cc cm. Ban u C c cha tt c cc cm trong cc d liu. Ti bt c bc no trong thut ton, chng ta c mt tp hp cc im c kh nng l mt trong nhng im k-lng ging gn nht. Chng ta biu din tp ny l P. Chng ta cng c mt tp cc im trong thc t, l mt trong cc im k-lng ging gn nht. Chng ta biu din bng tp K. Ban u K v P l rng. Chng ta tnh khong cch t x n tng cm. i vi cc cm vi cc trung tm gn vi x nht, chng ta xo b n t C v thm vo tt c cc im ca n vo P. Chng ta tham kho hot ng ny nh l "m rng" nhm. Cha kha ca cc thut ton ny l chng ta c th c c mt rng buc thp hn khong cch t tt c cc im trong cc cm ca tp C bng cch s dng cng thc trn:
1 2 1 1 2 2 1 1 2 2

Cc thut ton thc hin sau y. i vi mi im x i P, chng ta tnh d(x;xi). Nu d(x; xi) <dmin , chng ta c th m bo rng xi l im gn x hn sau tt c cc im trong cc cm ca C. Trong trng hp ny, chng ta loi b xi khi P v thm n vo K. Nu chng ta khng th m bo vic ny cho bt k phn t ca P (bao gm c cc trng hp m nu P l trng rng), sau chng ta "m" cm gn nht bng cch thm tt c cc im vo P, v b cm khi C. Ch rng khi chng ta loi b cc cm khi C, dmin s tng. Sau khi K c k phn t, chng ta kt thc. Hu ht cc tnh ton c chi ph kim tra khong cch gia cc im trong D ti trung tm ca cm. iu ny c hiu qu ng k hn vic tnh khong cch gia cc cp im ca tt c cc im. S la chn ca chiu rng w hin khng nh hng n kt qu k-NN, nhng thay v ch nh hng n cc kt qu tnh ton. Bng trc gic, chng ta mun chn mt w ci m chia d liu vo cc cm c kch thc hp l. 4. Phn cm Mc d c cp n, nhng phn cm hun luyn trung tm pht hin bt thng khng gim st. iu ny l bi v phn cm tm kim cc bt V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
59

ng dng k thut khai ph d liu trong h thng IDS


thng theo mc nh v khng c yu cu lm vic cao c thc hin. Chng ta s xem xt mt trong nhng phng php tip cn. Mc tiu u tin ca cc thut ton ca chng ta l tnh c bao nhiu im l "gn" vi tng im trong khng gian thuc tnh. Mt tham s cho cc thut ton l mt bn knh w cn gi tt l chiu rng cm. i vi bt k mt cp im x 1 v x2 , chng ta xem xt hai im "gn" nhau, nu khong cch gia chng l nh hn hoc bng w, d (x1 , x2 ) <w vi khong cch c xc nh trong phm vi hm ht nhn. i vi mi im x, chng ta xc nh N (x) l s im m nm trong vng w ca im x. Mt cch hnh thc hn ta xc nh nh sau: (1) Nhng tnh ton n gin ca N(x) cho tt c cc im c mt phc tp tnh ton O(N2) trong N l s im. L do l chng ta phi tnh n nhng khong cch gia cc cp tt c cc im. Tuy nhin, bi v chng ta l ch quan tm n vic xc nh cc outliers l g, chng ta c th tnh ton gn ng mt cch c hiu qu nh sau. Trc tin chng ta thc hin phn cm chiu rng c nh trn ton b d liu vi cm c chiu rng w. Sau , chng ta gn nhn cc im vo trong cc cm nh nh cc bt thng. Mt thut ton phn cm chiu rng c nh c m t nh sau. im u tin l trung tm u tin ca nhm. i vi tt c cc im tip theo, nu n cch im trung tm ca cm khng qu w, n s c thm vo cm . Nu khng n l mt trung tm ca mt nhm mi. Lu rng mt s im c th c thm vo nhiu cm. Thut ton phn cm chiu rng c nh i hi ch c mt tri qua tp d liu (only one pass through the data). phc tp ca thut ton l O(cn) c c l s cm v n l s lng cc im d liu. i vi mt w hp l, c s nh hn n mt cch ng k. Lu rng theo nh ngha ca cng thc trn, cho tng nhm, s lng cc im gn cc trung tm cm, N(c), l s im trong cm c. i vi mi im x, khng phi l mt trung tm ca mt nhm, chng ta xp x N(x) bng N(c) i vi cm c cha x. i vi cc im nm trong khu vc dy c, ni m c rt nhiu s chng cho gia cc cm, iu ny s lm cho vic nh gi khng chnh xc. Tuy nhin, i vi cc im l outliers, s c mt s cm chng cho tng i trong cc khu vc ny v N(c) s l mt cch xp x chnh xc ca N(x). V chng ta l ch quan tm n s im l outliers, cc im trong khu vc dy c s cao hn nhiu so vi ngng. Do , xp x l hp l trong trng hp ca chng ta. Vi gii thut xp x hiu qu, chng ta c th x l mt tp d liu ln hn ng k so vi thut ton n gin bi v chng ta khng cn phi thc hin vic so snh tng cp im. 3.2.2 Tnh hnh trong nc Khai ph d liu v khai ph tri thc l mt lnh vc rt mi Vit Nam. Tuy nhin n cng len li su vo trong rt nhiu lnh vc nh: sinh hc, gio V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
60

ng dng k thut khai ph d liu trong h thng IDS


dc, cng ngh thng tin, tr tu nhn to, thng mi, bo mt, vin thng Do l mt lnh vc mi m nn y l mt lnh vc thu ht c nhiu s quan tm ca nhiu ngi. Song hu nh cc nghin cu trong lnh vc ny mi ch dng li cp nghin cu ch cha a vo hot ng thc t. V d nc ta c cc nghin cu v lnh vc ny nh: canh tc d liu v ng dng khai khong d liu vo trong y hc chun on bnh t ng, ng dng khai ph d liu nhn dng vn tay, dch t ng, phn tch s liu thng k, phn tch s liu in thoi trong vin thng, phn tch hnh vi mua sm ca ngi dng, sp xp thi kho biu, ng dng khai ph d liu trong phn tch ti chnh ngn hng thng mi h tr quyt nh trong kinh doanh ngn hng Ring trong lnh vc an ninh th hu nh cha c mt kt qu nghin cu c th no v vic ng dng khai ph d liu m ch mi hnh thc s dng cc sn phm an ninh c s dng k thut khai ph d liu nh h thng pht hin xm nhp c cc hng pht trin. Trong ng dng khai ph d liu trong h thng pht hin xm nhp, cng c mt s nh hng nghin cu nhng ch mi dng tng quan m thi. Nhng vn ny s tip tc c nghin cu, pht trin v s c nhng kt qu ng k. iu ny l xu th chung khng ch Vit Nam m trn ton th gii-ng dng khai ph d liu trong h thng pht hin xm nhp. 3.3.3 Tnh hnh th gii K t khi khai ph d liu c a vo ng dng trong pht hin xm nhp mng l rt gn y, nhng nghin cu sm nht trong lnh vc cng l rt mun nm 1997 hay 1998. Nhng ch trong mt thi gian ngn nh vy khai ph d liu c th thu ht c nhng quan tm ng k trong cc nghin cu, v cc nghin cu trong lnh vc ny vn tip tc c m rng. 3.3.3.1 Nghin cu sm nht 1. Pht trin lc pht hin xm nhp tu chnh s dng khai ph d liu: Cc h thng pht hin xm nhp in hnh ni kt cc du hiu ca lu thng i vo vi cc du hiu xc thc v do pht hin xm nhp, ci m c bn l h thng pht hin lm dng. Nhng nhng du hiu ca cc xm nhp cng c th gn lin vi vic xc thc ngi dung, kt qu l cnh bo li. Phng php tip cn pht trin trong m hnh ny l pht trin cc b lc tu chnh c th gim cc cnh bo sai da trn cc hnh vi bnh thng c nhn bit trong mt mi trng c th. Nhiu cc hnh vi bnh thng c xem nh l cc tn cng bi cc cnh bo sai. Mi quan tm chnh y l liu chng ta c th nhn bit cc mu bnh thng v nhn dng chng trong mt dng cnh bo sau chng ta d dng lc chng ra v gim ng k t l cc cnh bo sai. Kh khn i vi phng php tip cn ny l vic xy dng nhng b lc v nh ngha nhng mu bnh thng. iu cng cn mt n lc ng k ca con ngi. gim sc lc ny Clifton v Gengo s dng k thut khai ph d liu trong phng php ca h. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
61

ng dng k thut khai ph d liu trong h thng IDS


Clifton v Gengo pht trin cc b lc da trn chui cc cnh bo. tng l mt chui cc hnh ng c xem l bnh thng trong mt mi trng khng ging nh l hnh vi bnh thng trong mi trng khc. V th mt hnh ng bnh thng c th nguy hi trong mi trng khc v hnh ng s c gn c l mt hnh ng bnh thng trong mi trng . Nhng trong thc t hnh ng khng phi l mt iu bt thng hay ng hn n l mt hnh ng v hi. S hiu sai ny c th dn n mt cnh bo sai. Nhng n khng ging nh mt chui y cc hnh ng bnh thng s c sao li trong mt xm nhp. V th cc cnh bo l mt phn ca mt chui bnh thng hon chnh c th b b qua. H s dng Frequent Episodes nhn dng cc chui thng xy ra ca cc cnh bo. Mt episode l mt chui cc cnh bo xy ra trong mt khong thi gian window c th. V mt frequent episode l mt chui xut hin nhiu ln trong nhiu khong thi gian window. i khi c th c cc hnh ng xem vo gia ccfrequent episode m khng lin quan. Tht kh pht hin cc frequent episode trc s c mt ca cc hnh ng khng lin quan v k thut khai ph d liu t ra c hiu qu im ny pht hin cc chui thng xuyn nht mt cch hiu qu v t ng ho.

Figure 8: V d v cc chui thng xuyn c hnh ng xen vo hoc nhiu Mc ch chnh trong bo co ny l nhn dng cc chui cc cnh bo c gy ra bi cc hnh ng bnh thng. Cc on thng xuyn l cc chui cnh bo thng xy ra. Nhng on thng xuyn ny l rt quan trng bi v hai im sau: Mt chui ph bin ca cc cnh bo khng th l mt xm nhp. Bi v nhng k tn cng s khng th cng mt th lp i lp li nu khng chng b pht hin. Cc hnh ng bnh thng c thc hin thng xuyn hn v mt on thng xuyn l kt qu ca mt hnh vi bnh thng. Vic phn tch cc chui thng xuyn v ti thiu chng t danh sch cc tn cng c th s gim thiu ti a nht trong dng cc cnh bo li bi v lun c nhiu hnh ng bnh thng hn c hnh ng nguy him. Trong thc nghim ca h, h phn tch trn mt triu cc cnh bo xm nhp thu c t 7 cm bin trong mt mng. Khong thi gian thc nghim ca h l hai tun. H ti cc file log vo trong mt c s d liu quan h. S c bn ca log ny ging nh Log(Event, FromIP, ToIP, time). Sau khi h s V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
62

ng dng k thut khai ph d liu trong h thng IDS


dng mt thut ton c gi l Query Flocks, da trn mt thut ton khai ph lut kt hp m rng pht hin cc chui thng xuyn. Thut ton Query Flocks em li s linh hot trong vic qun l cac mu phc tp hn thut ton cc Frequent Episode. Tiu im chnh ca h l lm gia tng cng ngh khai ph vo mt h thng pht hin xm nhp tn ti. H khng a ra bt k mt thut ton mi no nhng h xut mt m hnh s dng thut ton :frequent episode rule bit. M hnh c h a ra nh sau:

Figure 9: Developing custom filters with data mining ([CG00] page: 2) Trong m hnh ny h s dng mt h thng pht hin xm nhp mng thng mi c (h khng cp mt h thng c th no m quan tm n h thng pht hin s lm dng c s lm vic nh th no) ci m thu thp d liu bn ghi cc kt ni t cc cm bin v thc hin cc hnh ng c bn v nh mt h thng u ra sinh ra cc cnh bo. Trc khai ph d liu cha h i vo hot ng trong h thng pht hin xm nhp v tt c h thng l h thng pht hin s lm dng da trn pht hin cc du hiu. V th n c gng kt ni vi nhng du hiu bnh thng, nu khng tm thy th n s c gn c nh l mt cnh bo. V nh th s dn n kt qu c rt nhiu cc cnh bo sai. Sau h cho cc cnh bo ny i qua cc b lc tu chnh. Trong b lc ny s dng cc lut frequent episode h c tm ra mt phn y ca mt chui bnh thng. Sau cc cnh bo c pht ra cho cc mu thng xuyn b b qua v ch c phn cc cnh bo xm nhp cn li c a qua cng c khai ph d liu (da trn thut ton Query Flocks, l s m rng thut ton khai ph lut kt hp), ni m n c th s dng k thut khai ph d liu bt k no tm kim cc mu xm nhphoc ni mt cch khc sau ton b h thng cng c th thc hin vic pht hin bt thng. Trong nghin cu ny, c bn h khng pht trin mt m hnh mi. Trng tm chnh ca h l lc cc cnh bo sai s dng thut ton c thc thi V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
63

ng dng k thut khai ph d liu trong h thng IDS


trc (Frequent Episodes v Query Flocks). H ch yu ch tra rng nhng thut ton nu c th c s dng gim cc cnh bo sai. u im chnh trong hng tip cn ca h l nh ngha ca h v vic s dng cc b lc c th c thc thi d dng trong bt k h thng pht hin xm nhp no c. H cng gi cng c khai ph d liu ring bit, ci m c th c chn vo bt k h thng pht hin xm nhp no c sn. Hn na cng c khai ph d liu khng phi qun l qu nhiu du vo. Yu im chnh trong m hnh ca h l h s dng h thng pht hin lm dng v cc cng c khai ph d liu c lp nhau hai ni nhng hin nay c h thng pht hin xm nhp da trn nhng bt thng nh ADAM ci m c th thc thi ng thi hai vic cng mt lc. 3.3.3.2 Nghin cu mun hn 1. ADAM: Mt th nghim v vic thc hin vic ng dng khai ph d liu trong h thng pht hin xm nhp ADAM l mt nghin cu quan trng nht trong lnh vc ny trong thi im 2001-2002. Rt nhiu nghin cu c tin hnh ci tin thut ton ny sau . ADAM s dng kt hp gia lut kt hp v phn nhm pht hin tn cng trong vt kim ton TCP Dump. u tin, ADAM thu thp cc tp d liu bnh thng, c bit nh l thng xuyn bng cch khai ph vo trong m hnh ny. Th hai l cho n chy mt thut ton trc tuyn tm nhng kt ni cui v so snh chng vi d liu c khai ph bit v loi b nhng d liu c xem l bnh thng. Vi cc hnh vi nguy him th sau s dng mt thnh phn phn nhm c hun luyn trc phn nhm cc kt ni nguy him nh l mt loi tn cng bit hoc mt cnh bo sai. C hai giai on trong m hnh thc nghim ny. Trong giai on th nht h hun luyn thnh phn phn nhm. giai on ny ch din ra mt ln offline trc khi s dng h thng. Trong giai on th hai h s dng thnh phn phn nhm c hun luyn pht hin xm nhp. Chi tit thut ton c m t phn tip sau: Giai on 1:

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

64

ng dng k thut khai ph d liu trong h thng IDS

Figure 11: The training phase of ADAM ([BCJ+01] page: 5) bc u tin ny, mt c s d liu ca cc tp phn t thng xuyn thng thng attack-free c mt h tr cc tiu, c to. C s d liu ny phc v nh l mt tp h s, ci m c so snh sau khi thu c cc tp d liu thng xuyn c tm thy. C s d liu h s c b tr vi cc tp phn t thng xuyn mt nh dng c th cho cc phn attack-free ca d liu. Thut ton c s dng trong bc ny c th l mt s kt hp cc thut ton khai ph d liu thng thng mc d h s dng mt thut ton tu bin cho tc tt hn. V th, trong bc u tin ny h s ti mt h s cc hnh vi bnh thng. H s ny ch yu cha cc d liu ca kt ni mng bnh thng, iu ny c ngha l h s ny cha tp gi tr hoc s kt hp ca cc gi tr IP ngun, IP ch, cng ngun, cng ch, thi gian kt ni, tem thi gian, gi tr c bnh thng. bc th hai mt ln na h s dng d liu c hun luyn, h s nhng hnh vi bnh thng v mt thut ton trc tuyn cho lut kt hp ci m u ra ca n cha cc tp phn t thng xuyn c th l cc tn cng. Cc tp phn t nguy him cng vi mt tp cc thuc tnh c trch t d liu bng mt module chn thuc tnh c s dng nh hun luyn d liu cho thnh phn phn nhm l da trn cy quyt nh. By gi hy xem xt gii thut lut kt hp trc tuyn ng lm vic nh th no. Thut ton ny c li bng mt ca s trt vi kch thc c th iu hng c. Thut ton cho ra tp phn t m nhn c s h tr mnh vi h s trong thi gian kch thc ca s c th. Chng so snh tt c cc tp d liu vi c s d liu h s, nu c s kt ni th d liu l bnh thng. Mt khc chng li mt b m ci m s theo di s h tr ca tp phn t. Nu h tr vt qua mt V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
65

ng dng k thut khai ph d liu trong h thng IDS


ngng xc nh th tp phn t c gn c nguy him. Sau khi thnh phn phn nhm gn nhn cho tp d liu nguy him nh l mt tn cng bit, cnh bo sai hay tn cng cha bit. Giai on 2:

Figure 12: Discovering intrusion with ADAM (Phase 2) ([BCJ+01] page: 6) Trong giai on ny thnh phn phn lp c hun luyn v c th phn loi bt k tn cng no nh l bit, cha bit hay l cnh bo sai. giai on ny cng s dng cungd mt thut ton trc tuyn ng sinh ra cc d liu ng ng v cng vi module chn thuc tnh, h s, nhng nghi ng ny c gi n phn t phn lp c hun luyn. Thnh phn phn lp sau cho u ra l kiu tn cng m d liu ph hp. Nu l mt cnh bo sai th thnh phn phn lp loi b d liu ra khi danh sch cc tn cng v khng gi nhng d liu ny ti nhn vin qun l h thng Do , nh mt kt lun chng ta c th ni rng phn ny cho thy mt cch hiu qu s dng k thut khai ph ti thi im . Nhc im chnh ca phng php ny l h ch s dng cc lut kt hp v bi v kt qu ca thnh phn phn lp ca h sinh ra nhiu lut, trong s c nhiu lut b tha. H khng c bt k k thut chng li nhng lut d tha v khng lin quan . V d gi s mt lut l (A,B) C c ngha l A v B xy ra th C s xy ra. Phi tha nhn rng nu B xy ra th C s xy ra. Nhng thut ton ny s tnh B C cng nh mt lut khc, c ngha l thut ton ny sinh ra cc lut m rng khng cn thit. Nhng sau , nhiu nghin cu c thc hin theo phng php ny v nhiu nghin cu gii thiu mt lot cc phng php (nh interestingness) vo trong nhng xem xt ca h v ci thin m hnh ny. 2. Mt Framework v vic xy dng cc thuc tnh v c m hnh cho h thng pht hin xm nhp (MADAM ID): MADAMID l mt IDS ni bt trong lnh vc ny. Trong chng ny mc ch ca h l pht trin mt phng php c h thng v t ng ho hn xy dng IDS. H pht trin mt lot cc cng c ci m c th p dng V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
66

ng dng k thut khai ph d liu trong h thng IDS


a dng ho ngun d liu kim ton to ra cc m hnh pht hin xm nhp. Ch chnh ca phng php MADAMID l p dng chng trnh khai ph d liu m rng d liu kim ton c thu thp tnh cc m hnh m thu c mt cch chnh xc cc hnh vi t nhin hoc cc mu ca cc xm nhp hay cc hnh ng bnh thng. Cc thnh phn chnh ca khung lm vic MADAMID bao gm cc thnh phn hc phn lp v cc siu lp, lut kt hp cho mt xch phn tch v cc on thng xuyn cho vic phn tch chui. Qu trnh p dng MADAMID nh sau:

Figure 13: MADAMID workflow ([LS00] page: 231) bc u tin, d liu kim ton th c thu gom dng nh phn. Sau chng c x l v dng thng tin gi mng ASCII. V d, ban u chng l cc byte dng 0 v 1. Sau chng ta chuyn nhng gi tr v dng ASCII, chng ta c th d dng hiu c. Gi s rng s 16 bit nh phn u tin cho ta bit cng ngun, do chng ta chuyn 16 bit nh phn ny v dng hex hay thp phn chng ta c th hiu c cng ngun. Sau khi gii m tt c cc thng tin u ca gi tin chng ta khi qut ho chng vo cc bn ghi kt ni cha mt s cc c im c bn nh dch v, thi gian kt ni Cc chng trnh khai ph d liu khc nhau nh lut kt hp, lut on thng xuyn sau c p dng vo trong nhng bn ghi kt ni v nh mt u ra h c c mt s cc c im ban u v sau nhng c im ny c s dng nh l cc lut trong m hnh. V d, gi s trong cc bn ghi kt ni chng ta c c IP ngun nh nhau t nhiu gi tin c gng truy cp vo nhiu IP ch nhng vi cng mt cng. Trong cc bn ghi s kin hay cc gi tin cho tt c cc thng tin ny l ri rc v chng c nhm li trn c s mt ca s thi gian xc nh (trong thc nghim ca h l 5 pht) vo cc bn ghi bc kt ni/ phin. Sau khi p dng cc lut khai ph d liu (kt hp/ on thng xuyn/ phn nhm) vo trong cc bn ghi ny chng ta i m vic bit c mt c im ci m nu iu kin trn xy ra ngay khi , iu ny c th l mt iu bt thng hoc mt tn cng v chng ta thu c mt lut miu t tnh hung ny t giai on ny. Cui cng lut ny c p dng vo V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
67

ng dng k thut khai ph d liu trong h thng IDS


trong m hnh. Bi v tt c cc phng thc khai ph d liu ny (lut kt hp, lut phn on thng xuyn, lut phn nhm) c m t trong v d phn 2.2.1 v cu trc c im c m t trong 2.1.2, chng khng c tho lun thm. MADAM ID gn y a ra cc m hnh pht hin lm dng cho h thng mng v host c bit nh l cc m hnh pht hin bt thng cho ngi dung. u im chnh trong nghin cu ca h l h tp chung x l mt cch hiu qu v t ng ho cho cc cu trc c im. Hn ch ca h l h thng ca h hin ang l h thng off-line v h ang nghin cu l th no chuyn i n vo IDS thi gian thc bi v h thng pht hin xm nhp nn l h thng thi gian thc ti thiu ho tn hi an ninh. Mt yu im khc trong m hnh ny l n ch tnh cc mu thng xuyn ca cc bn ghi kt ni. Nhng nhiu xm nhp ging nh nhng ci m gn vo tt c cc hot ng vi mt kt ni n khng c cc mu thng xuyn trong d liu kt ni. Nhng kiu xm nhp ny c th dn n kh nng khng pht hin trong m hnh ca h. 3.3.3.3 Nghin cu gn y v hin nay 1. Hc lut cho pht hin bt thng (Learning Rules for Anomaly Detection -LERAD): Trong nghin cu ny h trnh by mt thut ton hiu qu gi l LERAD (Learning Rules for Anomaly Detection). H trnh by n nh mt thay th ca gii thut ADAM. S khc bit chnh gia hai thut ton ny l trong ADAM to ra tt c cc lut v cc mi quan h kt hp c th v kt qu l t l cc cnh bo sai cng rt cao. Ni mt cch khc LERAD pht sinh ra t hn cc lut c la chn c s d tha l gn khng v v th t l cnh bo sai cng thp hn gii thut ADAM. Trong gii thut ny, chng ta trnh by thut ton cng vi mt v d hn l din gii theo l thuyt. Gi s S l tp d liu hun luyn mu.

Figure 14: Sample Training Dataset ([MC03] page: 602) y c 4 thuc tnh Port, Word1, Word2 v Word3. Bc 1: trong bc ny chng ta sinh ra tt c cc lut v cc quan h c th t cc mu. Thut ton nh sau:

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

68

ng dng k thut khai ph d liu trong h thng IDS

Figure 15: LERAD Algorithm ([MC03] page: 602) dng u tin ca thut ton, chng ta chn ngu nhin hai mu S 1 v S2 t cc mu S. S0, S1 =80, GET, /, HTTP/1.0} v S 2 = {80, GET, /index.html, HTTP/1.0}. Sau chng ta ni cc thuc tnh ca S 1 v S2 dng th hai. Cc thuc tnh c ni kt l (Port, Word1, Word3). Sau dng th 3 chng ta bt u mt vng lp. Vng lp t 1 ti M, gi nh trong trng hp ny M=4. By gi chng ta i vo vo lp dng tip theo. y chng ta la chn ngu nhin Word1 nh l a t danh sch cc thuc tnh A v loi b n khi A. Do a=Word1 v A={Port, Word3}. Vi ln u tin m=1 v chng ta vo trong m t if v to mt lut r1: Word1=GET. Chng ta thm lut ny vo tp lut. Ln th hai m=2 v tp thuc tnh A khng rng. Do chng ta s i vo trong vng lp. Ln ny chng ta loi b ngu nhin thuc tnh khc Port nh a. V th by gi a=Port, A={Word3}. Ln ny chng ta s i n phn else khi m khng bng 1. S1[Port] = 80 v chng ta them n vo nh l v trc ca lut th hai r2: if Port = 80 then Word1 = GET. Chng ta them lut ny vo tp lut. Ln th ba m < 4 v A khng rng. Chng ta chn ngu nhin ch mt thuc tnh bn tri Word3 v loi n ra khi A. By gi a=Word3 v A={ }. Sau chng ta ti phn else. S1[Word3] = HTTP/1.0. Chng ta thm n nh phn trc ca r3. r3: if Port = 80 and Word3 = HTTP/1.0 then Word1 = GET. Chng ta thm lut ny vo tp lut. Ln th m=4 v A cng bng rng. Do ta thot khi vng lp. Phn ny ch cho ta thy c thut ton sinh cc lut nh th no v ton b tin trnh s tip tc cho n khi sinh ra tt c cc lut. V cui cng tp lut ca ta l: R={ r1: Word1 = GET, r2: if Port = 80 then Word1 = GET r3: if Port = 80 and Word3 = HTTP/1.0 then Word1 = GET } Bc 2: trong bc ny sp xp nhng lut theo th t gim dn v loi b cc lut d tha. sp xp cc lut ny, chng ta s dng mt t l n/r, V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

69

ng dng k thut khai ph d liu trong h thng IDS


vi n l s cc mu c hun luyn tho mn phn u v r l gi tr s c php. Thut ton c m t nh sau:

Figure 16: LERAD algorithm (Part 2) ([MC03] page: 602) V d, trong trng hp ca chng ta, sau khi hun luyn trn S v sp xp theo n/r th nhng lut ny s l: r2: if Port = 80 then Word1 = GET (n/r = 2/1) r3: if Port = 80 and Word3 = HTTP/1.0 then Word1 = GET (n/r = 2/1) r1: Word1 = GET or HELO (n/r = 3/2) gii thch r chng ta hy xem n v r ca r3 c chn nh th no. S cc mu c word1 = GET hay HELO l 3 v gi tr cho php ca c hai l 2. Hy xem mt v d khc, vi r2, s cc mu m lut c ni l 2 (dng th nht v dng th hai ca bng) v gi tr cho php y ch l GET, do r = 1. Gi tr tu y ca r2v r3 l nh nhau. Loi lut d tha: r2 nh du hai gi tr GET trong S. r3 s nh du hai gi tr nh vy v khng c gi tr mi, v th chng ta s loi n. R1 nh du HELO trong mu th ba them vo cc gi tr c nh du t trc, do chng ta gi li lut ny. V vy chng ta c th thy rng rt nhiu thut ton hin ny c s dng ci tin thut ton ADAM c gii thiu trong nghin cu gn y. Tt c mi ngi u cho rng ADAM nh l mt tng v h ang tp trung vo vic lm th no ci tin n. 2. Pht hin xm nhp da trn Entropy: Trong nghin cu ny cc tc gi s trnh by mt cch s lc hai m hnh khai ph d liu da trn ADAM v khai ph d liu da trn Entropy. Sau so snh hai h thng v ch cho ta thy c u im ca h thng da trn Entropy hn hn h thng ADAM. Mt phng php pht hin xm nhp in hnh trn ADAM nh sau: 1) Trc tin xy dng mt h s ca cc hnh vi bnh thng hay cc hnh vi v hi ca h thng my tnh v mng. 2) Sau nhng bt thng xa ri nhng hnh vi bnh thng ny c xem nh l cc xm nhp tin nng. Thut ton ADAM c s dng trong bc 1 khai thc cc lut kt hp t c s d liu. N tm tt c cc lut kt hp c h tr ln hn h tr cc tiu m ngi dng ch nh. Do thut ton ADAM tm cc chui s kin in hnh v ph bin nh l h s h thng. Mc d y l thut ton c nghin cu su v c s dng nhiu nht nhng n cng i hi vic la chn rt cn thn tham s h tr cc tiu. V d chng ta c mt c s d liu v c tn ti lut A c 100 cc phn t V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
70

ng dng k thut khai ph d liu trong h thng IDS


d liu ph bin vi 200 cc phn t d liu tri ngc v mt lut khc B vi 98 cc phn t d liu ph bin v khng c cc phn t d liu tri ngc. By gi nu ta chn ph bin cc tiu l 100, th n ch chn lut A. ly lut B chng ta s phi thit t gi tr ph bin cc tiu nh hn. V th im yu ca h thng ADAM l n i hi mt s la chn cn thn ph bin cc tiu. Bi v phn tch cc xm nhp mt mng ln hay mng thng mi in t ln th phi thit t gi tr ph bin cc tiu phi ln gim ti a cc nhiu. Mt ln na phn tch cc xm nhp mt mng nh, MinSupport phi nh bt cc u vt phm vi lu thng mng. Mt im na l kt qu ca ADAM bao hm cc lut kt hp mu thun nhau. V th hu ht nhng ln da trn vic la chn MinSupport kt qu l nhiu v i hi x l cc kt qu s dng chng trong h thng pht hin xm nhp. y l l do ti sao Barbara s dng h thng hc lut phn nhm trong h thng ADAM ca h. Do trong nghin cu ny cc tc gi xut s dng phng thc khai ph d liu da trn Entropy cho h thng pht hin xm nhp thay th cho h thng da trn APRIORI. H thng ch yu da trn entropy c gi mt cch chnh xc hn l phng thc Graph Based Induction (GBI) cho vic tm kim lut. Thut ton nh sau: Trong thut ton ny c 4 phn v 3 bc. Bn phn l: Extracted subgraphs, Input graph, Contract input graph, Enumerate pair graph and select pair. Khi u extracted sub-graph l rng nhng khi thut ton chy n dung cc nt i din cho cc lut. Trong v d d y chng ta nhn thy rng chng ta c hai lut: A: 4 2 v B: 1 3

Figure 17: Workflow of Entropy based Intrusion Detection ([Yo03] page: 842)

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

71

ng dng k thut khai ph d liu trong h thng IDS


bc u tin, chng ta chuyn th u vo chuyn i n theo extracted sub-graphs. V d, t extracted sub-graph chng ta bit 4 2 i din cho A. V th trong th u vo bt k u ta tm thy 4 2, ta thay n bng A. Chng ta tip tc tin hnh cng mt th tc vi B. bc th hai, contracted graph c phn tch theo cc cch c th v c phn vo mi sub-graph c th c gi l cc cp. Trong v d ny 7 cp c rt ra. Ti bc th 3, chng ta chn cp tt nht theo yu cu ca chng ta hay cp ph hp vi mt chun xc nh. Trong v d ny B 7 c chn v sau n c trit xut nh l 1 3 7 khi B=1 3. V mi quan h 1 3 7 ny s c thm vo extracted sub-graph v s c s dng trong phn tip theo. Bt u t tp rng ca cc extracted sub-graph, thut ton ny c th trch cc sub-graph khc nhau ci m xut hin thng xuyn trong th u vo. Bng cch lp li iu ny n trch ra cc sub-graph phc tp hn theo cch i tng bc mt. u im ca thut ton ny l trong bc th 3 nu chng ta chn cc cp xut hin thng xuyn hn MinSupport, th thut ton ny lm vic nh thut ton APRIORI v nu bc 3 thut ton tt nht c chn da trn entropy trn c s ch mc th n c th c xem li nh mt s m rng ca thut ton hc lut phn lp truyn thng. Mc d trong cc h thng da trn APRIORI v lut phn nhm h s dng cc gi tr thuc tnh bng thay th nhng trong GBI iu ny c th c thc hin bng cch xem nt gc (4, 1, 4, 8) nh l cc lp thuc tnh v gi tr ca cc thuc tnh c th c xem nh cc nt kt ni (2, 5, 6, 3, 7, 2, 6, 6, 3, 9). Cng vic c a ra ca h vn ang trong giai on nghin cu nhng h cho thy mt cch thay th d dng h thng APRIORI khai ph cc lut. 3. MINDS Minnesota Intrusion Detection System: MINDS l mt NIDS ph bin nht hin nay. H thng ny c pht trin bi phng khoa hc my tnh trng i hc Minnesota nm 2002. University of Minnesota ang dng h thng ny vo trong mng ca h t nm 2002 v chng c kh nng pht hin nhiu tn cng mi khi chng c i vo hot ng (cc v d bao gm slammer worm, NetBus worm ). C hai loi cng ngh pht hin bt thng l pht hin bt thng gim st v khng gim st. Trong pht hin bt thng c gim st, a ra mt tp cc d liu bnh thng hun luyn v a mt tp d liu mi kim tra v mc ch l xc nh liu d liu kim tra l bnh thng hay bt thng. Trong h thng pht hin bt thng khng gim st, m hnh c gng pht hin cc hnh vi bt thng m khng s dng bt k kin thc no v d liu hun luyn. Cc h thng pht hin bt thng khng gim st da trn cc phng php thng k, s phn cm, pht hin outlier ... MINDS ny l mt kiu h thng pht hin bt thng khng gim st. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
72

ng dng k thut khai ph d liu trong h thng IDS


MINDS s dng mt k thut khai ph d liu ph hp t ng pht hin cc tn cng vo cc h thng v mng my tnh. Mc tiu di hn ca MINDS l address tt c cc hng pht hin xm nhn. Trong nghin cu ny h miu t chi tit hai bi vit c th: (1) mt k thut pht hin bt thng khng gim st ci m nh mt t l cho mi kt ni mng, tham chiu cc kt ni bt thng nh th no v (2) l mt module da trn phn tch cc mu kt hp, thu cc kt ni mng c nh gi c tnh bt thng cao bng module pht hin bt thng. Lung ca mt MINDS c m t trong hnh sau. Chng ta s m t h thng theo tng bc ca lung.

Figure 18: MINDS System ([ELK+04] page: 4) u vo ca MINDS l lung d liu mng phin bn 5 thu c bng cc cng c lung (chi tit c th xem ti www.splintered.net/sw/flow-tools) thay cho d liu tcpDump. Flowtools ch bt cc thng tin tiu gi tin, khng bt ni dung gi tin. Cng ging nh d liu tcpdump thng tin u cha cc gi tr IP ngun, IP ch, cng ngun, cng ch, tem thi gian, c, thi gian kt niH s dng ca s thi gian l 10 pht. Tt c cc d liu trong mng internet c i qua nh cc gi. Tt c cc gi tin ny c cc thng tin u v d liu. H thng ch bt thng tin phn u ca tt c cc gi tin i qua trong 10 pht cui. Nhng d liu c lu tr v trc khi chng c chuyn ti h thng chnh, mt bc lc d liu c tin hnh loi b nhng lu thng mng c phn tch l khng interest trong phn tch. V d d liu c lc c th cha cc lu thng t cc ngun khng c cc thc. Nh trng i hc Windsor, khi mt truy cp gi yu cu ti mt cng trong khong 40000 V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
73

ng dng k thut khai ph d liu trong h thng IDS


n 60000 t mng khu trng s UofW n c cho php mt khc nu IP ngun khng t mng ca trng th truy cp ny b cm. Giai on 1: s c im Sau khi chng ta i vo h thng chnh. Giai on u tin trong MINDS l feature extraction. D liu n dng nh phn nhng chng ta bit nh dng (cc byte i din cho ci g) v chng ta trch nhng c im c bn t d liu kim ton. Cc c im c bn ny ch a ch IP ngun v IP ch, cng ngun v ch, giao thc, c, s byte v s gi tin. Vi nhng c im c bn ny sau cc c im c bt ngun c tnh ton. C hai loi c im c bt ngun, (1) cc c im da trn ca s thi gian v (2) cc c im da trn ca s kt ni. Cc c im da trn ca s thi gian c xy dng bt cc kt ni theo cc nt c trng ging nhau trong T giy cui. V d, s kt ni c chuyn tip ti cng mt a ch IP ch trong T giy cui c gi l count-dest. Cc c im da vo ca s kt ni c xy dng bt c kt ni c nt c trng ging nhay trong N kt ni cui. V d, trong N kt ni cui s cc kt ni c chuyn n cng mt a ch IP ch c gi l count-dest-conn. Cc mu ca c hai loi c im c miu t trong bng di

Figure 19: Time-window based features ([ELK+04] page: 4)

Figure 20: Connection-window based features ([ELK+04] page: 5) Giai on 2: pht hin tn cng bit Sau khi chng ta thu c tt c cc c im ca cc kt ni th bc tip theo l so snh nhng c im ny vi nhng bt thng bit. Nu chng ta tm thy mt s lin kt th chng ta gi trc tip n cho thnh phn V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
74

ng dng k thut khai ph d liu trong h thng IDS


phn tch. V d, chng ta lm quen vi cc c im da trn ca s thi gian khi mt IP ngun n c gng truy cp ti cng mt cng ca nhiu IP ch nhiu ln trong 3 giy cui v nu c tn ti du hiu ca mt kiu tn cng th chng ta c th gi n ti thnh phn phn tch nh l mt tn cng m khng c bt k phn vn no. By gi nu khng c u hiu tn cng ca kiu ny th chng ta gi bn ghi kt ni ti mudule pht hin bt thng, ci s thc hin bc tip theo. Bc 3: pht hin bt thng Trong bc ny module pht hin bt thng s s dng thut ton pht hin outlier ch nh mt t l bt thng ti mi kt ni mng. N ch nh mt mc v vic outlier cho mi im d liu, c gi l Local Outlier Factor (LOF). i vi mi mu d liu, mt ca cc lng ging c tnh trc tin. LOF ca mt mu d liu c th p i din cho t l trung bnh mt mu p v mt cc lng ging ca n. LOF i hi cc lng ging ca tt c cc im d liu c xy dng. iu ny bap hm c vic tnh khong cch tng cp im ca tt c cc im d liu, vic ny c phc tp O(n). Khi c hng triu tp d liu th phc tp s rt ln. gim thiu phc tp m mt phng php chin trong MINDS. H chun b mt tp d liu mu t d liu v tt c cc im d liu c so snh vi tp d liu nh, iu ny lm gim phc tp xung O(n*m), vi m l kch thc ca tp d liu nh. Cm C1 Cm C2 Tp d liu mu, p1 trong C1 Tp d liu mu, p2 trong C2 Tp d liu mu, p4 trong C2 Tp d liu mu, p3 trong C1 Tp d liu mu, p5 trong C2 V d ,For example, bng trn chng ta c th thy rng cm C2 dy hn cm C1. Do mt thp cm C1, hu ht cc mu q trong C1, khong cch gia bt k tp d liu vi cc lng ging ca n ln hn khong cch ca C1. V d, khong cch gia p1 v p3 cao hn khong cch gia p2 v p4. Do , cho nn p2 s khng c xem nh l outlier. Bc th 4: phn tch mu kt hp Sau khi ch nh cho mi kt ni mt t l th 10% cc t l u c t nh l lp bt thng v 30% cc t l cui c xem nh lp cc hnh vi bnh thng. 60% cc t l gia c b qua trong h thng ca h. Sau nhng kt ni c gn t l ny cho i qua thnh phn sinh cc mu kt hp.

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

75

ng dng k thut khai ph d liu trong h thng IDS

Figure 21: MINDS Association Analysis Module ([ELK+04] page: 14) Module ny khi qut cc kt ni mng c xp vo nhng hnh vi bt thng cao bng module pht hin bt thng. Mc ch ca vic khai ph cc mu kt hp l khm ph cc mu thng xy ra trong lp bt thng hay trong lp bnh thng. Trong bc ny h p dng lut kt hp xy dng cc tp lut cho lp bt thng v lp bnh thng. V d, xem xt cc hnh ng cho mt dch v c th c th c tm lc bng mt tp thng xuyn sau: sourceIP=X, destinationPort=Y Nu c nhiu kt ni trong tp thng xuyn c xp vo mc cao t bc trc, th nhng tp thng xuyn ny c th l mt du hiu thch hp cho vic them vo mt h thng da vo du hiu. Hay nu tp thng xuyn sau c t l thp hn v xut hin nhiu ln th chng ta c th ni n bnh thng ci m l mt hnh vi duyt Web. Protocol=TCP, destinationPort=80, NumPackets=36 H thng ca h cng tm cc mu tri ngc. tm cc mu tri ngc h s dng mt s phng thc nh t l, chnh xc, phng tin gi nh v u ho ca chnh xc v gi nh. Do cc phng thc ny module ny sp xp cc mu v nhm cc mu tng t li vi nhau v biu din chng trc thnh phn phn tch. Cc phng php v vic sp xp cc mu theo th t c m t tng quan nh sau:

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

76

ng dng k thut khai ph d liu trong h thng IDS

Figure 22: Measures for ordering patterns ([ELK+04] page: 13) Xem xt mt tp cc c im xy ra c1 ln trong lp bt thng v c2 ln trong lp bnh thng. ng thi xt n1 v n2 l s cc kt ni bt thng v bnh thng trong tp d liu. Gi s rng chng ta ch quan tm n vic tm cc h s ca lp bt thng, t l c1/n1 trn c2/n2 s cho thy l th no cc mu tt c th phn bit nhng kt ni bt thng vi nhng kt ni bnh thng. T l hay chnh xc n l l khng bi v chng thng c trng cho mt s rt nh cc kt ni bt thng. Trong trng hp rng hn, mt mu him c thc hin ch mt ln trong lp bt thng v khng xut hin trong lp bnh thng s c gi tr cc i ca t l v chnh xc, v vn c th l khng quan trng. gii thch cho quan trng ca mt mu, phng php gi nh c th c s dng nh l mt phng tin thay th. Khng may, mt mu c s gi nh cai c th khng cn thit l c nhn thc ng n. Bin php F1 l mt phng tin iu ho v chnh xc v gi nh, cung cp mt s kt hp tt gia hai bin php ny. Sau trong bc cui cng tng ca tt c cc lut c biu din trc thnh phn phn tch v thnh phn ny c th nng cp hoc xy dng h s bnh thng hay c th gn nhn nhng du hiu ca mt tn cng mi. y l cch m MINDS lm vic nh mt h thng pht hin xm nhp khng gim st. TNG QUT CC NGHIN CU NIDS: H Misuse/ Da Cng ngh D liu im mnh im yu thng Anomal trn khai ph hun y cnh c s luyn bo dng c dng ADA Anomaly Khng Lut kt hp, TCPDu Nghin cu Sinh ra cc M lut phn lp mp i u lut d tha, i hi la chn cn thn minsupport V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
77

ng dng k thut khai ph d liu trong h thng IDS


MAD AMI D Misuse Khng v anomaly Lut kt hp, TCPDu lut phn mp, nhm, lut BSM phn on generate thng d xuyn data Outlier, lut kt hp Netflow Version 5 Xy dng cc c im mt cch t ng, hiu qu H thng Off-line, ch tnh cc mu thng xuyn ca cc bn ghi

MIN DS

Anomaly Khng

Custo Anomaly C m ID Filter s

lut kt hp, cc phn on thng xuyn

LER AD

Anomaly Khng

Kt hp, phn lp Entropy/ Graph based

Entro Anomaly Khng py based

Anomaly Khng xem Scores xt 60% d made liu gia it easy to c th cha determine cc lut qua anomalies trng D liu C th c i hi c thm vo chun b cc dinh ra bt k h b lc tu t b thng no chnh cho cm bin cc mi trng khc nhau TCPDu Khng c Chn la mp d tha minsupport cn thn D liu Cch thay i hi kim th ca Minsupport ton nh ADAM phi c phn la chn cn thn

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

78

ng dng k thut khai ph d liu trong h thng IDS Chng 4 XY DNG CHNG TRNH PHT HIN TN CNG DoS S DNG K THUT KHAI PH D LIU
4.1 Thut ton phn cm 4.1.1 Dn nhp D hin nay tnh trng pht tn virus tr nn ph bin, 90% doanh nghip khng nh nhng cuc tn cng t chi dch v ang l vn phin toi v thng gp nht trong cng ty. T cui nhng nm 90 ca th k trc. Hot ng ny bt ngun t khi mt s chuyn gia bo mt, trong qu trnh pht hin khim khuyt h thng trn h iu hnh Windows 98, pht hin ra rng ch cn gi mt gi d liu ping c dung lng ln cng lm t lit mt server mc tiu. Pht hin ny sau ngay lp tc c gii hacker s dng trit tiu nhng i tng m h c nh tn cng. T y, hnh thc s khai ca DoS (Denial of Service) ra i. Trong khi , dng DDoS (Distributed Denial of Service) th da vo vic gi mt lnh ping ti mt danh sch gm nhiu server (kiu ny gi l amplifier, tc l khuch i rng mc tiu), gi dng l mt gi ping a ch IP gc c tr hnh vi IP ca mc tiu nn nhn. Cc server khi tr li yu cu ping ny khi s lm lt nn nhn vi nhng phn hi (answer) gi l pong. Do phn n ny chn vic nghin cu v demo khai ph d liu trong pht hin tn cng t chi dch v v k thut c s dng y l k thut phn cm. y l k thut pht hin bt thng khng gim st. Cc thut ton pht hin bt thng khng gim st c th c thc hin trn d liu khng gn nhn, ci m d dng c c bi v n ch n gin l thu thp cc d liu km ton th t mt h thng. Trong thc t, pht hin bt thng khng gim st c nhiu li th hn hn pht hin bt thng c gim st. Cc li th chnh l chng khng yu cu mt tp d liu hon ton bnh thng hun luyn. Hn na, tp d liu cc hnh vi bnh thng ca ngi dng l v cng ln v trong qu trnh ly tp d liu sch hun luyn trong k thut pht hin c gim st th khng th m bo rng trong d liu khng c xm nhp. Trong khi tp cc hnh vi c gi l xm nhp th nh hn nhiu. Ngoi ra k thut ny cn c rt nhiu u im nh c trnh by cc phn bn trn. 4.1.2 Cc dng d liu trong phn tch cm Gi s mt tp d liu dng phn tch cm cha n i tng (cc i tng c th l con ngi, nh, ti liu...). Cc thut ton gom cm thng x l trn mt trong hai cu trc d liu sau: V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
79

ng dng k thut khai ph d liu trong h thng IDS


(1) Ma trn d liu: Biu din n i tng, nh con ngi, vi p bin (cn c gi l cc php o hay cc thuc tnh), nh tui, chiu cao, cn nng, gii tnh. Ma trn ny biu din mi quan h i tng theo thuc tnh.
x11 ... xi1 ... x n1 ... ... ... ... ... x1 f ... xif ... x nf ... ... ... ... ... x1 p ... xip ... x np

(6.1) (2) Ma trn phn bit: biu din khong cch gia hai im (i tng) trong khng gian d liu gm n i tng theo p thuc tnh ta dng ma trn phn bit
0 d ( 2,1) d (3,1) d ( n,1) 0 d (3,2) d ( n,2) 0 0

(6.2) Trong d(i,j) l khong cch gia i tng i v i tng j. Trong trng hp tng qut, d(i,j) l mt s khng m v dn v 0 khi 2 i tng i v j tng t nhau. Bi v d(i,j) = d(j,i) v d(i,i) =0 nn ta c ma trn 6.2. 4.2.2.1 Bin tr khong Cc bin tr khong l o lin tc ca cc i lng tuyn tnh n gin nh trng lng, chiu cao, nhit , tui...Cc n v o nh hng rt nhiu n kt qu gom cm. V d, thay i n v o: Mt thay cho inch cho chiu cao, kg thay cho pound cho cn nng c th dn n cc cu trc cm khc nhau. Trong trng hp tng qut, biu din mt bin bi n v b s dn n mt khong ln cho bin v n nh hng ln n kt qu gom cm. trnh s ph thuc vo vic la chn cc n v o, d liu cn phi c chun ha. Cc phng php chun ha c gng a tt c cc bin mt nh hng nh nhau. iu ny rt hu ch khi chng ta cha c mt tri thc tin nghim no v d liu. Tuy nhin trong mt s ng dng, ngi s dng c th cho mt tp cc bin no nh hng nhiu hn cc bin khc. V d, khi gom cm cc ng vin cho mn bng r th chiu cao c u tin hn c. chun ha cc php o, mt s la chn l chuyn cc php o ban u thnh cc bin khng n v. i vi mt bin f c cc s o x 1f, x2f...xnf, s chun ha c th c thc hin theo cc cch sau: (1) Tnh sai s tuyt i trung bnh sf:
sf = 1 ( x1 f m f + x 2 f m f + ... + x nf m f ) n

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

80

ng dng k thut khai ph d liu trong h thng IDS


(6.3) Trong mf l gi tr trung bnh ca f, c ngha l: (2) Tnh o chun, hay z-score (chun ha z):
z if = xif m f sf

mf =

1 ( x1 + x 2 + ... + x n ) n

(6.4)

Da vo cng thc (6.4) ta thy rng sai s tuyt i trung bnh cng ln th hin tng c bit cng gim. Do o c chn s nh hng n kt qu phn tich mu c bit. Cc o thng dng cho bin tr khong: (1) Khong cch Euclide: d (i, j ) = ( x x ) + ( x x ) +... + ( x x ) (6.5) Trong i = ( xi1 , xi 2 ,, xin ), j = ( xj , xj ,, xj ) l 2 i tng d liu n chiu. (2) Khong cch Manhattan: d (i, j ) = x x + x x +... + x x (6.6) C hai khong cch Euclide v Manhattan u tha cc yu cu ton hc ca m phng trnh khong cch: a. d(i,j)>=0: Khong cch phi l mt s khng m. b. d(i,j) = 0: Khong cch ca mt i tng n chnh n bng khng. c. d(i,j) =d(j,i): Khong cch l mt hm i xng. d. d(i,j) <= d(j,h) + d(h,j): Tnh cht bt ng thc tam gic. (3) Khong cch Minkowski:
2 2 2 i1 j1 i2 j2 in jn
i1 j1 i2 j2 in jn

d (i, j ) = ( xi1 x j1

+ xi 2 x j 2

+ ... + xin x jn ) p

(6.7)

Trong p l mt s nguyn dng. (4) Khong cch c trng:


d (i, j ) = ( w1 xi1 x j1
p

+ w2 xi 2 x j 2

+ ... + wn xin x jn ) p

(6.8)

Khong cch c trng l s ci tin ca khong cch Minkowski, trong c tnh n nh hng ca tng thuc tnh n khong cch gia hai i tng. Thuc tnh c trng s w cng ln th cng nh hng nhiu n khong cch d. Vic chn trng s ty thuc vo ng dng v mc tiu c th. Mt bin nh phn l bt i xng nu c mt trng thi c ngha quan trng hn (thng c gn l 1). Lc ny thng c xu hng thin v trng thi u tin . V d trong chun on y khoa ngi ta thng u tin mt hng kt lun hn hng kia. Do nhng trng thi cha r rng (nh triu chng bnh cha r rng) th cng c th kt lun l 1 u tin cho bc chun on chuyn su hoc cch ly theo di. Mt v d ca bin nh phn bt i xng l HIV c 2 trng thi l dng tnh (1) v m tnh (0). Object j 1 0 sum V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
81

ng dng k thut khai ph d liu trong h thng IDS


1 q r q+r 0 s t s+t sum q+s r+t p Bng 3.8: Bng s kin cho bin nh phn Xt hai i tng i v j c cc thuc tnh ca i tng c biu din bng cc bin nh phn. Gi s cc bin nh phn c cng trng s. Ta c bng s kin nh bng 3.8. Trong q l s cc bin nh phn bng 1 i vi c 2 i tng i v j, s l s cc bin nh phn bng 0 i vi i nhng bng 1 i vi j, r l s cc bin nh phn bng 1 i vi i nhng bng 0 i vi j, t l s cc bin nh phn bng 0 i vi c hai i v j. 4.2.2.2 Cc bin nh phn Mt bin nh phn ch c hai trng thi l 0 hoc 1. Bin nh phn l i xng nu nu c hai trng thi l tng ng (v mt ngha ca ng dng). C ngha l khng c xu hng thin v trng thi 1. V d thuc tnh gender c hai trng thi l male hay female tng ng vi hai trng thi l 0 v 1. S khc nhau ca hai i tng da trn cc bin nh phn i xng (symmetric binary dissimilarity) l:
d (i, j ) = r +s q + r + s +t

Object i

(6.9)

S khc nhau ca hai i tng da trn cc bin nh phn bt i xng (asymmetric binary dissimilarity)
d (i, j ) = r +s q +r +s

(6.10)

Chng ta c th o khong cch gia hai bin nh phn da trn khi nim tng t nhau (similarity) thay v khng tng t nhau sim(i,j) = 1 - d(i, j) (6.11) H s sim(i,j) c gi l h s jaccard. V d 6.1: S khc nhau gia cc bin nh phn. Gi s rng mt bng cc record ca cc bnh nhn (bng 3.9) cha cc thuc tnh name, gender, fever, cough, test-1, test-2, test-3 v test-4, trong name l thuc tnh nh danh, gender l mt thuc tnh i xng, v cc thuc tnh cn li l cc thuc tnh nh phn khng i xng. i vi cc gi tr ca cc thuc tnh khng i xng, cho cc gi tr Y ( yes) v P (positive) bng 1, cc gi tr N (no hay negative) bng 0. Gi s rng khong cc i tng (bnh nhn) c tnh da trn cc thuc tnh khng i xng. Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y Y N N N N V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
82

ng dng k thut khai ph d liu trong h thng IDS


Bng 3.9: Mt bng quan h trong cc bnh nhn c m t bng cc bin nh phn. Theo phng trnh (6.10) khong cch gia cc cp i tng l:
0 +1 = 0.33 2 + 0 +1 1 +1 d ( Jack , Jim) = = 0.67 1 +1 +1 1+2 d ( Marry , Jim) = = 0.75 1 +1 + 2 d ( Jack , Marry ) =

Cc php o ch ra rng Marry v Jim c bnh khng ging nhau bi v d(Marry, Jim) l ln nht. 4.2.2.3 Cc bin phn loi (bin nh danh), bin th t, v bin t l theo khong 1. Cc bin phn loi: Bin phn loi l bin c th nhn dng nhiu hn hai trng thi. V d bin mu sc c th c cc trng thi vng lc v xanh. Cho s cc trng thi ca mt bin nh danh l M. Cc trng thi c th biu th bng cc ch ci, k hiu, hay mt tp cc s nguyn nh 1, 2, 3..., M. Ch rng cc s nguyn ny ch dng cho vic trnh by d liu v khng biu din mt gi tr nguyn c th no. Khong cch gia 2 i tng i v j theo bin phn loi c th c tnh da trn h s i xng n gin:
d (i, j ) = p m p

(6.12)

Trong m l s thuc tnh phn loi c gi tr trng khp gia hai i tng i v j, p l tng s thuc tnh phn loi. V d 6.2: S khc nhau gia cc thuc tnh phn loi. Gi s chng ta c d liu mu trong bng 3.10, Ngoi tr 2 bin object-identifier (bin nh danh) v bin test-1 l bin phn loi xem xt (chng ta s s dng test-2 v test-3 trong cc v d sau). Object Test-1 Test-2 Test-3 Identifier (categorical) (ordinal) (ratio-scaled) 1 Code-A Excellent 445 2 Code-B Fair 22 3 Code-C Good 164 4 Code-A Excellent 1,210 Bng 3.10: Bng d liu mu cha cc bin dng hn hp Ma trn phn bit cho cc i tng ca bng trn l:
0 d ( 2,1) d (3,1) d ( 4,1) 0 d (3,2) d ( 4,2) 0 d ( 4,3) 0

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

83

ng dng k thut khai ph d liu trong h thng IDS


y chng ta c mt bin phn loi, test-1, nn p = 1. Do ma trn phn bit sau khi tnh ton l:
0 1 1 0 0 1 1 0 1 0

2. Bin th t: Bin th t l bin trn mt tp gi tr c xc nh quan h th t trn , v d hng xp loi huy chng vng, bc, ng. Bin th t c th ri rc hoc lin tc. Cc gi tr ca mt bin th t f c M f trng thi. Cc trng thi c sp xp nh ngha hng (rank): 1,..., Mf. Gi s rng f l mt bin t mt tp cc bin th t m t n i tng. o cho bin th t f c xy dng nh sau: (1) Gi tr ca bin f cho i tng th i l x if, v f c Mf trng thi c sp xp, biu din cc cp 1,..., Mf. Thay th xif bi cp tng ng ca n, rif {1,..., Mf}. (2) nh x hng tng bin vo on [0,1] bng cch thay th i tng i trong bin f bi:
z if = xif 1 M f 1

(6.13)

(3) Tnh phn bit theo cc phng php bit i vi bin tr khong zif. V d 6.3: S khc nhau gia cc bin th t. Gi s chng ta c d liu mu cho trong bng 6.3. Bin test-2 l bin th t. C 3 trng thi cho bin test-2 theo trt t sau: fair, good, v excellent, do Mf = 3. i vi bc 1, chng ta thay th mi gi tr ca test-2 bi rank ca n, 4 i tng ln lt c gn cho cc rank: 3, 1, 2, 3. Bc 3 chun ha rank bng cch nh x theo cng thc (6.13) ta c rank 1 --> 0 rank 2 --> 0,5 v rank 3 --> 1.0. i vi bc 3, chng ta s dng khong cch Euclide, kt qu th hin trong ma trn phn bit sau y:
0 1 0.5 0 0 0. 5 1.0 0 0. 5 0

3. Bin theo thang t l (ratio-scaled variable): Bin t l theo khong l o dng trn cc t l phi tuyn. V d: Cc i lng c biu din theo hm m chng hn: AeBt. Trong A, B l cc hng s dng v t l bin biu din thi gian. Trong a s trng hp ta khng th p dng trc tip phng php o cho cc bin tr khong cho loi bin ny v c th gy sai s ln. Phng php tt hn l tin x l d liu bng cch chuyn sang logarit yif=log(xif) sau mi p dng trc tip cho cc bin tr khong hoc th t. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
84

ng dng k thut khai ph d liu trong h thng IDS


V d 6.4: S khc nhau gia cc bin t l theo khong. Quay li d liu mu cho trong bng 6.3. Bin test-3 l bin t l theo khong. p dng phng php chuyn i logarit ta thu c ma trn phn bit sau:
0 1.31 0.44 0.43 0 0.87 1.74 0 0.87 0

4. Bin c kiu hn hp CSDL c th cha c 6 loi bin nu trn. Ta c th dng cng thc c gn trng kt hp cc hiu qu ca cc bin thnh phn.

Trong ij ( f ) c tnh nh sau: - ij ( f ) =0 khi xjf hay xif khng tn ti hoc xif = xjf = 0. - ij ( f ) = 1 trong cc trng hp khc. Ngoi ra dij(f) c tnh nh sau: - i vi cc bin tr khong hay th t: d ij(f) l khong cch c chun ha. - i vi cc bin nh phn hay phn loi: + dij(f) = 0 khi xif = xjf = 0 + dij(f) = 1 trong cc trng hp khc 4.2.3 Cc phng php gom cm 4.2.3.1 Cc phng php phn hoch y l phng php phn hoch CSDL D c n i tng thnh k cm sao cho i) Mi cm cha t nht mt i tng. ii) Mi i tng thuc v mt cm duy nht. iii) k l s cm c cho trc. y l tiu chun chung ca cc phng php phn hoch truyn thng. Gn y xut hin nhiu phng php phn hoch da trn l thuyt tp m th tiu chun (ii) l khng quan trng m thay vo l mc thuc v mt cm ca mt i tng no , mc ny c gi tr t 0 n 1. Cc phng php tip cn phn hoch: K-means: Mi cm c biu din bng gi tr trung bnh ca cc i tng trong cm. K-medoids: Mi cm c biu din bng mt trong cc i tng nm gn tm ca cm. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
85

ng dng k thut khai ph d liu trong h thng IDS


4.2.3.2 Cc phng php phn cp y l cc phng php to phn cp cm ch khng phi to cc phn hoch cc i tng. Phng php ny khng cn xc nh s cm ngay t u. S cm s do khong cch gia cc cm hoc do iu kin dng quyt nh. Hai cch tip cn gm: Gp v Tch. Gp: (1) Xut pht t mi i tng v mt cm cha n. (2) Nu 2 cm gn nhau (di mt ngng no ) s c gp li thnh mt cm duy nht. (3) Lp li bc 2 n khi ch tt c cc cm c gp li thnh 1 cm hay n khi iu kin dng. Tch: (1) Xut pht t mt cm duy nht l ton b khng gian. (2) Chn cm c phn bit cao nht (ma trn phn bit c phn t ln nht hay tr trung bnh ln nht) tch i. Bc ny s p dng cc phng php phn hoch i vi cm chn. (3) Lp li bc 2 cho n khi mi i tng thuc mt cm hoc t iu kin dng ( s cm cn thit hay khong cch gia cc cm t ngng nh). Ngoi ra cn c cc phng php nh: - Phng php da trn mt . - Phng php da trn m hnh. - Phng php da trn li. 4.2.4 Thut ton gom cm bng phng php K-means Thut ton k-means phn hoch mt tp n object vo trong k cluster sao cho cc i tng trong mt cluster c tng t cao v cc i tng trong cc cluster khc nhau c tng t thp. Mi cluster c i din bi trng tm (cluster mean) ca n. Phng php k-means phn hoch mt i tng vo cc cm da trn khong cch ca i tng n trng tm ca cc cm. Mt i tng c phn vo mt cluster nu khong cch t i tng n cluster ang xt l nh nht. Sau cc cluster mean c cp nht. Qu trnh lp i lp li cho n khi hm mc tiu ng qui. Mt cch in hnh, hm mc tiu square-error c s dng:
E = p mi
i =1 p Ci k 2

Trong , p l i tng thuc cluster Ci, mi l trng tm ca cluster Ci. 4.2.4.1 Thut ton k-means Input: + k: S cc cluster. + D: Mt tp d liu cha n i tng V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
86

ng dng k thut khai ph d liu trong h thng IDS


Output: Mt tp k cluster. Method: (1) Chn ngu nhin k i tng t D lm trng tm ban u ca k cluster; (2) repeat (3) Gn (hay gn li) mi i tng cho cluster m n gn nht, da trn vic so snh cc khong cch ca i tng n cc cluster.. (4) Cp nht cc gi tr trung bnh ca cc cluster (5) until khng c s thay i trong cc cluster V d 6.5: Gom cm bng phn hoch k-means. Gi su c mt tp cc i tng c phn b trong khng gian nh hnh (6.1 a). Cho k=3.

Hnh 3.5: Minh ha thut ton k-means Hnh 3.5 a: Chn ngu nhin 3 i tng lm 3 trng tm ban u ca 3 cluster. Ba i tng ny c nh du +. Mi i tng cn li c phn b vo mt cluster nu i tng gn trng tm ca cluster nht. Ta c 3 cluster c khoanh vng bng cc ng gch chm nh hnh v. Hnh 3.5 b: Trng tm ca mi cluster c cp nht. cc trng tm sau khi c cp nht c nh du +. S dng cc trng tm cluster mi, phn b li cc i tng vo cc cluster da trn khong cch gn nht ca i tng vi cc trng tm cluster. Cc cluster kt qu c khoanh bng cc ng chm. Hnh 3.5 c: Lp li nh cc bc trong hnh 3.5 b. Cc cluster kt qu c khoanh bng cc ng lin nt. Lp li nh trong hnh 3.5 b, kt qu khng c s thay i. Dng thut ton. Kt qu thu c l 3 cluster c khoanh bng cc ng lin nt. V d 6.6: Cho tp im x1= {1, 3} = {x11, x12} x2 = {1.5, 3.2} = {x21, x22} x3 = {1.3, 2.8} = {x31, x32} x4 = {3, 1} = {x41, x42} Dng thut ton k-means gom cm vi k = 2. Bc khi to: V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
87

ng dng k thut khai ph d liu trong h thng IDS


- Khi to ma trn phn hoch M c 4 ct tng ng vi 4 im v 2 dng tng ng vi 2 cluster. Cc phn t trong ma trn ( m ij vi i = 1, 2; j= 1, 4) c gn gi tr khi to 0. x1 x2 x3 x4 c1 0 0 0 0 c2 0 0 0 0 - Chn x1, x2 ln lt lm trng tm ban u cho 2 cluster c 1, c2. Trng tm ca 2 cluster ln lt l: v1 = {1, 3} = {v11, v12}; v2 = {1.5, 3.2} = {v21, v22}. - Cp nht ma trn phn hoch M: x1 x2 x3 x4 c1 1 0 0 0 c2 0 1 0 0 Bc 1: - Gn cc i tng vo cc cluster: Tnh cc khong cch Euclide: 2 2 2 2 d (v1 , x1 ) = ( x11 v11 ) + ( x12 v12 ) = (1 1) + ( 3 3) = 0 2 2 2 2 d (v 2 , x1 ) = ( x11 v 21 ) + ( x12 v 22 ) = (1 1.5) + ( 3 3.2 ) = 0.538 Xp x1 vo cm c1. 2 2 2 2 d (v1 , x 2 ) = ( x 21 v11 ) + ( x 22 v12 ) = (1.5 1) + ( 3.2 3) = 0.538 2 2 2 2 d (v 2 , x 2 ) = ( x 21 v 21 ) + ( x 22 v 22 ) = (1.5 1.5) + ( 3.2 3.2 ) = 0 Xp x2 vo cm c2. 2 2 2 2 d (v1 , x3 ) = ( x31 v11 ) + ( x32 v12 ) = (1.3 1) + ( 2.8 3) = 0.360 2 2 2 2 d (v 2 , x3 ) = ( x31 v 21 ) + ( x32 v 22 ) = (1.3 1.5) + ( 2.8 3.2 ) = 0.447 Xp x3 vo cm c1. 2 2 2 2 d (v1 , x4 ) = ( x 41 v11 ) + ( x42 v12 ) = ( 3 1) + (1 3) = 2.828 2 2 2 2 d (v 2 , x 4 ) = ( x 41 v 21 ) + ( x 42 v 22 ) = ( 3 1.5) + (1 3.2 ) = 2.663 Xp x4 vo cm c2. - Cp nht li ma trn phn hoch ta c: x1 x2 x3 x4 c1 1 0 1 0 c2 0 1 0 1 Bc 2: Cp nht li trng tm cc cluster:
m11 x11 + m12 x 21 + m13 x31 + m14 x 41 1*1 + 0 *1.5 + 1 *1.3 + 0 * 3 = = 1.15 m11 +12 +m13 + m14 1 + 0 +1 + 0 m x + m12 x22 + m13 x32 + m14 x42 1* 3 + 0 * 3.2 + 1* 2.8 + 0 *1 v12 = 11 12 = = 2.9 m11 +12 +m13 + m14 1 + 0 +1 + 0 m x + m22 x 21 + m23 x31 + m24 x41 0 *1 + 1*1.5 + 0 *1.3 + 1* 3 v21 = 21 11 = = 2.25 m21 + 22 +m23 + m24 0 +1 + 0 +1 v11 =

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

88

ng dng k thut khai ph d liu trong h thng IDS


v 22 = m21 x12 + m22 x 22 + m23 x32 + m24 x 42 0 *1 + 1 *1.5 + 0 *1.3 + 1 * 3 = = 2 .1 m21 + 22 +m23 + m24 0 +1 + 0 +1

Quay li bc 1: - Gn cc i tng vo cc cluster: Tnh cc khong cch Euclide: 2 2 2 2 d (v1 , x1 ) = ( x11 v11 ) + ( x12 v12 ) = (1 1.15) + ( 3 2.9) = 0.180 2 2 2 2 d (v 2 , x1 ) = ( x11 v21 ) + ( x12 v22 ) = (1 2.25) + ( 3 2.1) = 1.54 Xp x1 vo cm c1. 2 2 2 2 d (v1 , x 2 ) = ( x 21 v11 ) + ( x 22 v12 ) = (1.5 1.15) + ( 3.2 2.9 ) = 0.461 2 2 2 2 d (v 2 , x 2 ) = ( x 21 v 21 ) + ( x 22 v 22 ) = (1.5 2.25) + ( 3.2 2.1) = 1.415 Xp x2 vo cm c2. 2 2 2 2 d (v1 , x3 ) = ( x31 v11 ) + ( x32 v12 ) = (1.3 1.15) + ( 2.8 2.9 ) = 0.180 2 2 2 2 d (v 2 , x3 ) = ( x31 v 21 ) + ( x32 v 22 ) = (1.3 2.25) + ( 2.8 2.1) = 1.18 Xp x3 vo cm c1. 2 2 2 2 d (v1 , x 4 ) = ( x 41 v11 ) + ( x 42 v12 ) = ( 3 1.15) + (1 2.9 ) = 2.652 2 2 2 2 d (v 2 , x 4 ) = ( x 41 v 21 ) + ( x 42 v 22 ) = ( 3 1.15) + (1 2.9 ) = 1.331 Xp x4 vo cm c2. - Cp nht li ma trn phn hoch: x1 x2 x3 x4 c1 1 1 1 0 c2 0 0 0 1 Bc 2: Cp nht li trng tm cho cc cluster
m11 x11 + m12 x21 + m13 x31 + m14 x41 1*1 + 1*1.5 + 1*1.3 + 0 * 3 = = 1.27 m11 +12 +m13 + m14 1 +1 +1 + 0 m x + m12 x22 + m13 x32 + m14 x42 1* 3 + 1* 3.2 + 1* 2.8 + 0 *1 v12 = 11 12 = =3 m11 +12 +m13 + m14 1 +1 +1 + 0 m x + m22 x21 + m23 x31 + m24 x 41 0 *1 + 0 *1.5 + 0 *1.3 + 1* 3 v21 = 21 11 = =3 m21 + 22 +m23 + m24 0 + 0 + 0 +1 m x + m22 x22 + m23 x32 + m24 x 42 0 *1 + 0 *1.5 + 0 *1.3 + 1* 3 v22 = 21 12 = =1 m21 + 22 +m23 + m24 0 + 0 + 0 +1 v11 =

Ma trn phn hoch thay i do ta quay li bc 2 v tip tc cho n khi ma trn phn hoch khng thay i. * u v nhc im ca thut ton a. u im: + Tng i nhanh. phc tp ca thut ton l O(tkn), trong : - n: S i tng trong khng gian d liu. - k: S cm cn phn hoch. - t: S ln lp (t thng kh nh so vi n). + K-means ph hp vi cc cm c dng hnh cu. V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
89

ng dng k thut khai ph d liu trong h thng IDS


b. Khuyt im: + Khng m bo t c ti u ton cc v kt qu u ra ph thuc nhiu vo vic chn k im khi u. Do phi chy li thut ton vi nhiu b khi u khc nhau c c kt qu tt. + Cn phi xc nh trc s cm. + Kh xc nh s cm thc s m khng gian d liu c. Do c th phi th vi cc gi tr k khc nhau. + Kh pht hin cc loi cm c hnh dng phc tp khc nhau v nht l cc dng cm khng li. + Khng th x l nhiu v mu c bit. + Ch c th p dng khi tnh c trng tm. 4.2.4.2 K thut dng i tng i din: Phng php k-medoids Thut ton k-means khng chnh xc trong trng hp d liu d liu b nhiu bi v mt i tng vi gi tr rt ln c th nh hng rt nhiu trong vic phn b d liu vo cc cm. khc phc li trn, thay v dng gi tr trung bnh ca cc i tng trong mt cluster lm im i din cho cluster, chng ta c th dng mt i tng thc s i din cho cluster . Mi i tng cn li c phn vo cm m i tng i din tng t vi n nht. Phng php phn hoch da trn nguyn l cc tiu cc s khc nhau gia mi i tng v i tng i din tng ng. Do hm mc tiu absolute-error c s dng:
E = p o j
j =1 p Cj k

Trong p l im trong khng gian biu din mt i tng trong cluster Cj ; v oj l i tng i din ca ca Cj. Trong trng hp tng qut, thut ton lp cho n khi mi i tng i din l mt medoid thc s, c ngha l n nm trung tm ca cluster tng ng. y chnh l tng c bn ca phng php k-medois. Thut ton k-medoids Input: Tp d liu D; s cluster k. Output: k cluster. Method: (1) Chn ngu nhin k i tng Oi (i=1..k) lm trung tm (medoids) ban u ca cm. (2) Repeat (3) Gn (hoc gn li) tng i tng cn li vo cm c trung tm gn im ang xt nht. (4) Vi mi i tng trung tm - Ln lt xt cc i tng khng l trung tm (non-medoids) x. - Tnh li S khi hon i Oi bi x S c xc nh nh sau: V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
90

ng dng k thut khai ph d liu trong h thng IDS


S = Ex EOi - Nu S<0 th thay th Oi bi x. (5) Until khng c s thay i trong cc cluster. u im: K-medoids lm vic c vi nhiu v bit l. Khuyt im: K-medoid ch hiu qu khi tp d liu khng qu ln v c phc tp l O(k(n-k)2t). Trong : n: S im trong khng gian d liu. k: S cm cn phn hoch. t: S ln lp, t kh nh so vi n. 4.2 S phn tch thit k chng trnh (cc mu)
D liu b sung Raw Audit Data D liu c chuyn i Cc tham s Tin trnh khai ph d liu Cc mu bt thng Cc cm

C ch x l C s d liu chuyn i

Hnh : Nguyn l chung ca mt tin trnh pht hin xm nhp s dung k thut phn cm 4.2.1 Tp hp d liu v tin x l 4.2.1.1 Tp hp d liu i vi h thng *NIX c th dng cng c rt thng dng TCPDUMP cho mc ch thu thp d liu v sng lc d liu qua giao tip mng. Tcpdump l mt tin trnh chy ch nn (background), n s kt xut cc thng tin cn thit (qua cc tham s dng lnh) ra tp tin. C mi thi im c n packet c tp hp v lu tr cho vic x l sau ny cng nh nhm phc v cng vic phn lp, sau tin trnh thu thp thng tin li c tip din. Khi lu cc thng tin v packet thnh tp tin trong mi chu k x l, cc thng tin cn thit s c n gin ha s dng cc tham s ca tcpdump, v d: tcpdump c 50 w dump host victim-machine i vi mi packet Ethernet nhn c: V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
91

ng dng k thut khai ph d liu trong h thng IDS


{ - phn gii lp a ch IP ch - phn gii lp a ch ngun - ly thng tin v giao thc } a ch IP ca ch cng nh ngun theo khun dng 4 s thp phn phn cch nhau bi du chm, v d 192.168.43.2. Mt thng tin khc rt quan trng trong qu trnh thu lp l loi giao thc, l mt trong hai loi giao thc TCP v UDP. Mt s gio thc dng khc da trn TCP v UDP nh ICMP, ARP, v RARP cng s c xt n. Bng sau cung cp mt v d v kt xut ca tcpdump bao gm cc thng tin a ch IP ngun v ch, v tip n l giao thc cho mi packet: 192.168.138.20.123 > 192.168.166.63.123: udp 48 192.168.166.63 > 19.168.138.20: icmp:192.168.166.63 udp port 123 unreachable (DF) arp who-has 192.168.138.1 tell 192.168.138.20 arp reply 192.168.138.1 is-at 0:0:c:7:ac:0 192.168.138.20.123 > 192.168.63.1.123: udp 48 192.168.138.20.123 > 192.168.63.2.123: udp 48 192.168.138.20.123 > 192.168.166.63.123: udp 48 192.168.133.46 > 192.168.138.20: icmp: echo request (DF) 192.168.138.20 > 192.168.133.46: icmp: echo reply 192.168.133.46 > 192.168.138.20: icmp: echo request (DF) 192.168.138.20 > 192.168.133.46: icmp: echo reply 192.168.133.46 > 192.168.138.20: icmp: echo request (DF) 192.168.138.20 > 192.168.133.46: icmp: echo reply Trn cc my HOST nu s dng OS khc *NIX (dng Windows) th c th dng cng c WinDump c xem l phin bn ca tcpdump trn Windows. Windump cng tng thch hon ton vi tcpdump v cng s dng bt, kim tra v lu vo a cc network traffic. Windump tng thch vi Windows 95, 98, ME, NT, 2000, XP, 2003 and Vista, da trn th vin v cc iu khin ca trnh WinPcap (http://www.winpcap.org/). Windump l min ph theo giy php BSD-style c ti http://www.winpcap.org/windump/misc/copyright.htm. Trong phn chng trnh thit k trong n c s dng d liu th thu c nh vic ghi du vt trn mng DMZ Ethernet, du vt ny c thu t ngy 16/9/1993 n 15/10/1993 v bt c 782281 kt ni din rng gia phng th nghim Berkeley Lawrence vi phn cn li ca mng. Du vt th c ly bng cch s dng TCPdump trn mt Sun Sparcstation s dng b lc gi nhn BPF.

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

92

ng dng k thut khai ph d liu trong h thng IDS


4.2.1.2 Tin x l D liu th y c tp hp di dng mt file .txt, mi khi c kt ni th s c mt dng c t ng thm vo file. File phn n ny gm c 9 phn c ngn cch nhau bng khong trng: Trng ngha TimeStamp Tem thi gian, ghi li thi im kt ni xy ra c chnh xc Micro giy Duration Khong thi gian sng ca kt ni tnh bgn giy Protocol Giao thc ca kt ni Byte sent by Kch thc gi tin c gi i tnh bng byte, nu khng origination xc nh c thay bng ? Byte sent by Kch thc gi tin phn hi tnh bng byte, nu khng xc responder nh c thay bng ? Remote host a ch IP ca my ngoi mng, theo khun dng 4 s thp phn cch nhau bng du chm Localhost a ch ca my ni b theo khun dng 4 s thp phn cch nhau bng du chm. y ch s cui trong a ch thay v phi dng ton b a ch IP bi v cc host ni b c ba s u l ging nhau. State Trng thi ca kt ni. Hai trng thi quan trng nht l SF-m t mt kt ni bnh thng v REJ-mt kt ni b t chi (khi u SYN nhng khng nhn c mt RST p li), ngoi ra cn mt s trng thi khc nh: S0, S1, S2, S3, S4, RSTOSn, RSTRSn, SHR, Sha, SS, OOS1, OOS2 Flag C (c th c hoc khng) cho bit kt ni gia ni b hay vi mng ngoi. L-kt ni c bt ngun t mng ni b, N-kt ni vi mng gn, c ny ch yu c thit lp cho cc kt ni nntp V d: 748162802.427995 1.24383 smtp ? ? 1 128.97.154.3 REJ L 748162802.803033 3.96513 smtp 1173 328 3 128.8.142.5 SF 748162804.817224 1.02839 nntp 58 129 2 140.98.2.1 SF L 748162812.254572 138.168 nntp 363238 1200 4 128.49.4.103 SF L 748162817.478016 10.0858 nntp 230 100 4 128.32.133.1 SF N 748162833.453963 2.16477 smtp 2524 306 5 192.48.232.17 SF 748162836.735788 13.1779 smtp 16479 174 16 128.233.1.12 RSTRS3 L D liu th ban u c c v lu vo mt bng (TCP) trong c s d liu TCP

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

93

ng dng k thut khai ph d liu trong h thng IDS

Bin i RemoteHost v dng s nguyn bng cch thm mt bng tham chiu

Gom cm d liu theo mt ca s thi gian xc nh v lu kt qu nhm cm trong mt bng trung gian TCP_F

4.2.2 Khai ph d liu pht hin tn cng t chi dch v 4.2.2.1 Cc mu bt thng ca tn cng t chi dch v T kho st thc t kt hp vi nhng kin thc, cc chuyn gia trong lnh vc an ninh cho thy: - Cc tn cng t chi dch v ch yu tn cng qua giao thc HTTP. Cc request ca giao thc HTTP bnh thng ch l cc a ch Web (URL) nn c kch thc rt nh. Song c rt nhiu tn cng, k tn cng thng chn m lnh thc thi trn trnh duyt nn nhn (cross site scripting), v d: http://www.microsoft.com/education/? ID=MCTN&target=http://www.microsoft.com/education/? ID=MCTN&target=<script>alert(document.cookie)</script>, s dng cc script (on m) dng tp tin flash hay chn cc truy vn SQL vo URL iu ny lm cho cc request ny c kch thc ln hn bnh thng. V th theo nghin cu cho thy th cc request ca giao thc ny c kch thc >=350 byte l nhng mu bt thng c kh nng l tn cng.

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

94

ng dng k thut khai ph d liu trong h thng IDS


- Tn cng t chi dch v c rt nhiu dng nh Ping of Death, Teardrop, Aland Attack, Winnuke, Smurf Attack, UDP/ICMP Flooding, TCP/SYN Flooding, Attack DNS. Ping Of Death. Mt s my tnh s b ngng hat ng, Reboot hoc b crash khi b nhn nhng gi d liu ping c kch thc ln. V d nh : ping ach -n 1000 trong : s 1000 l s ln gi gi d liu. TCP/SYN Flooding: Bc 1: Khch hng gi mt TCP SYN packet n cng dch v ca my ch Khch hng -> SYN Packet -> My ch Bc 2 : My ch s phn hi li khch hng bng 1 SYN/ACK Packet v ch nhn mt 1 ACK packet t khch hng My ch -> SYN/ACK Packet -> Khch hang Bc 3: Khch hng phn hi li My ch bng mt ACK Packet v vic kt ni han tt Khch hng v my ch thc hin cng vic trao i d liu vi nhau. Khch hng -> ACK Packet -> My ch Trong trng hp Hacker thc hin vic SYN Flooding bng cch gi ti tp, hng lot TCP SYN packet n cng dch v ca my ch s lm my ch b qu ti v khng cn kh nng p ng c na UDP/ICMP Flooding: Hacker thc hin bng cch gi 1 s lng ln cc gi tin UDP/ICMP c kch thc ln n h thng mng, khi h thng mng chu phi s tn cng ny s b qua ti v chim ht bng thng ng truyn i ra bn ngoi ca mng ny, v th n gy ra nhng nh hng rt ln n ng truyn cng nh tc ca mng, gy nn nhng kh khn cho khch hng khi truy cp t bn ngoi vo mng ny. Ta c mu bt thng l mt RemoteHost trong mt khong thi gian c xt gi mt lng ln gi tin ti mt LocalHost tc l mt RemoteHost gi rt nhiu request ti mt LocalHost trong khong ca s thi gian. Tu theo tng iu kin h thng mng cn bo v m ngng s request ny thay i, vic la chn cn thn ngng ny s cho ta cc cnh bo ng v n ph thuc rt nhiu vo kinh nghim thc t. - Nhng vi nhng tin ch nh Trinoo, TFN2K, Stacheldraht ngi tn cng khng phi ch dng 1 ni tn cng m s dng nhiu mng li khc nhau thc hin vic tn cng ng lot vo my nn nhn. Tc l s c mt s lng ln cc Request t cc a ch mng khc nhau ti cng mt khong thi gian trong mt khong thi gian rt ngn (DDoS). V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
95

ng dng k thut khai ph d liu trong h thng IDS


Ta c mu bt thng c nhiu request n mt my LocalHost trong mt khong thi gian ngn. Cng tu thuc vo iu kin ca mng v my m ngng s request ny thay i cho ph hp. Ngoi ra cn c rt nhiu cc mu bt thng ca tn tn cng t chi dch v vi nhng c trng khc nhau nhng trong phm vi ca n ny ch s dng mt s mu bt thng c bn trong tn cng t chi m thng qua qu trnh phn tch thc t thu c. 4.2.2.2 Khai ph d liu Trong qu trnh khai ph d liu ny, d liu c s dng hun luyn y chnh l cc mu bt thng m t cc tn cng t chi dch v thu c trn vi cc tham s tu chn v s request cng nh khong thi gian xem xt ph hp vi cc h thng khc nhau. K thut gom cm c s dng trong phn demo ny l k thut dng i tng i din: Phng php k-medoids. Cc mu bt thng y s c s dng l i din cho cm xm nhp v phn cn li s l cm bnh thng. Do y ta s c hai cm. D liu sau tin trnh tin x l c phn vo cc nhm thi gian vi rng tu chn trc. n y tu thuc vo vic chn cc du hiu pht hin tn cng t chi dch v m ta c cc cch x l ph hp: - Nu la chn du hiu ca tn cng t chi dch v l du hiu trong s dng giao thc HTTP m trong c s d liu demo l WWW th ta s s dng mu bt thng v tn cng t chi dch v trong giao thc ny vi cc tham s u vo l ngng kch thc gi tin request v s request tho mn ngng kch thc ny ti my ch Web m y chnh l mt my trong mng ni b cn c bo v Localhost lm cc tham s u vo ca thut ton k-medoids. u ra ca thut ton s l hai cm: cm cha cc mu c xem l bt thng l cc mu c s kt ni m cc kt ni ny c kch thc gi request ln hn kch thc c a ra, ln hn ngng s request trong mt khong thi gian c th c chn v cm cc mu bnh thng l cc mu khng c c im trn. - Nu la chn du hiu tn cng t chi dch v truyn thng DoS th u vo ca thut ton ch l ngng s request v thuc tnh c xem xt y chnh l thuc tnh RemoteHost. u ra ca thut ton l hai cm: cm cc bt thng cha cc mu m s request t mt RemoteHost trong mt khong thi gian xc nh ln hn ngng kt ni v cm bnh thng. - Nu la chn du hiu tn cng t chi dch v theo kiu nhiu request t nhiu a ch IP khc nhau ti mt my cc b th cng tng t nh du hiu tn cng DoS nhng lc ny thuc tnh c xem xt y l thuc tnh LocalHost. Sau khi thut ton kt thc s cho ta hai cm, V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
96

ng dng k thut khai ph d liu trong h thng IDS


mt cm cc mu bt thng c s request n mt my ni b ln hn ngng kt ni c chn. By gi ta thu c hai cm phn tch: cm bt thng v cm bnh thng. 4.2.3 Biu din d liu Sau khi d liu c gom vo hai cm nh trn, d liu cn c biu din ngi dng c th d dng hiu c. C rt nhiu cch thc biu din d liu ny nhng trong phn demo ny ch biu din cc d liu ny di dng cc bn ghi ca mt bng cc mu bt thng. V d:

i vi lnh vc khai ph d liu trong h thng pht hin xm nhp ny, ngoi vic hin th kt qu th cc kt qu ny s c tip tc c dng x l v d nh a ra cc cnh bo, thc hin mt s hnh ng chng li nhng mi nguy hn khai ph c ny Nhng trong phm vi ca n s khng i su vo vn ny.

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

97

ng dng k thut khai ph d liu trong h thng IDS

Chng 5 KT QU T C NH GI, KT LUN V HNG PHT TRIN


5.1 Ci t V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
98

ng dng k thut khai ph d liu trong h thng IDS


H thng chng trnh c ci t bng ngn ng C# trn mi trng Microsoft Visual Studio .NET 2005 v h iu hnh Windows XP. Bc u chng trnh cng hnh thnh c phn giao din tng i thn thin vi ngi dung. Cu hnh h thng: Chng trnh nn c chy trn h iu hnh WinNT, Windows 2000 Advance Server, Windows 2000 Professional hay Win XP. V i hi my c cu hnh sau: - Cu hnh my ti thiu: o Tc CPU: 1.5GHz o Dung lng b nh: 256 MB o Khng gian trng trn cng: 500 MB - Cu hnh my ngh: o Tc CPU: 3.2GHz o Dung lng b nh: 512 MB (hoc ln hn) o Khng gian trng trn cng: 1 GB Thng tin chng trnh ci t: Ngn ng C Sharp (C#) Cng c pht trin Microsoft Visual Studio 2005 Kiu ng dng ng dng Windows 32 bits H iu hnh WinNT, Windows 2000 Advance Server, Windows 2000 Professional hay Win XP Mi trng hot ng MS .NET Framework 1.0 C s d liu Microsoft SQL Server 7, 2000 Kt ni c s d liu ADO.NET 5.2 Kt qu t c Phn ny s gii thiu qua v cc giao din ca chng trnh cng nh mt s kt qu thu c thng qua thc thi chng trnh. Giao din chnh ca chng trnh:

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

99

ng dng k thut khai ph d liu trong h thng IDS

Hnh 5.1 Giao din chnh Trong giao in ny, chc nng ca cc thnh phn c th nh sau: - Buttom Chn file cho php ta chn file dng .txt hay .log khai ph. File y s c khai ph theo du hiu s kt ni t mt RemoteHost trong khong thi gian (ca s thi gian) - Nt HTTP s cho php ta i vo mt giao din mi khai ph theo du hin bt thng trong giao thc HTTP cng nh mt s la chn khc - T ng s a ta ti mt giao din mi cho php thc thi vic khai ph mt cch t ng c sau mt khong thi gian tu chn - Gom d liu thc hin vic c d liu t file c chn vo bng d liu TCP ng thi chuyn i mt s thuc tnh t dng vn bn v dng s nh: tem thi gian, thi gian kt ni duy tr, v thm cc d liu cn thiu - Tin x l m ra mt giao din mi thc hin chc nng tin x l cho d liu thu c trn - Lm li cho php chn li file thc hin li vic khai ph t u - Thot cho php thot khi chng trnh Giao din Tin x l:

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

100

ng dng k thut khai ph d liu trong h thng IDS

Hnh 5.2 Tin x l d liu Thi gian x l: cho php tu chn khong thi gian (ca s thi gian) x l, thi gian ny tnh bng giy. Gi tr mc nh l 60 giy Tin x l: a d liu v cc kt ni v dng ph hp vi thut ton nh chuyn RemoteHost v dng s, gom nhm theo thi gian cc kt ni ny Kt qu: hin th kt qu Tin x l trong phn Kt qu tin x l bn Khai ph: m ra giao din tin hnh khai ph tm ra cc bt thng Thot: quay v mn hnh chnh Mn hnh khai ph

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

101

ng dng k thut khai ph d liu trong h thng IDS

Hnh 5.3 Giao din khai ph ph bin: tu chn ph bin ca kt ni c du hiu bt thng, chnh l ngng s kt ni dng lm tham s u vo ca thut ton khai ph, tnh theo n v s ln xut hin. trnh nhng kt qu sai ta phi chn gi tr ny mt cch thch hp, ch yu l tu vo thc nghim cng nh kinh nghim ca ngi dung. y khng dng n v ca ngng kt ni l % v s c rt nhiu trng hp tng s kt ni l v cng nh chng hn l 2 th chc chn y s cho ta mt du hiu bt thng nu ta chn ngng <=50% nhng nu chn ngng >50% th li cho kt qu sai ln khi m tng kt ni trong mt ca s thi gian no l rt ln Kt qu: him th kt qu khai ph c (cm nhng bt thng) Quay v: quay v ca s Tin x l c th thc hin li qu trnh tin x l Thot: thot khi chng trnh Ca s khai ph da trn giao thc HTTP

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

102

ng dng k thut khai ph d liu trong h thng IDS

Hnh 5.4 Mn hnh khai ph d liu ca giao thc HTTP Chn file Audit: c chc nng tng t nh trong giao din chnh Chn bng trong c s d liu: cho php tin hnh khai ph trn d liu c sn trong c s d liu ca chng ta (TCP) Chn khong thi gian: Chc nng c cp trong mn hnh Tin x l HTTP: Nu c la chn th s thc thi khai ph theo du hiu bt thng trn giao thc HTTP, khi chc nng ny c chn th ta phi nhp ngng kt ni v ngng kch thc ca cc Request HTTP Tt c cc giao thc: thc thi khai ph da trn du hiu s cc request ca tt c cc giao thc. Khai ph: nt thc hin chc nng khai ph da trn d liu thu c cng vi cc thng s tu chn Thc hin li: chn v thc qu trnh khai ph Quay v: tr v mn hnh chnh Thot: thot khi chng trnh Mn hnh thc hin chc nng khai ph t ng

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

103

ng dng k thut khai ph d liu trong h thng IDS

Hnh 5.5 Mn hnh t ng khai ph - Chn file d liu Audit: tng t nh bn trn - Chn khong thi gian: tng t nh trn - Ngng: ngng s kt ni - Mc thi gian hin ti: Do d liu khai ph l d liu c lp nn ta phi chn mc thi gian ban u bt u khai ph. Khi tch hp vo h thng thc th khng phi thc hin bc ny v n ly l thi gian ca h thng hin ti - T ng: thc hin khai ph t ng vi cc thng s bn trn. S c s kim tra tnh ng n ca cc thng s c nhp. - Stop: tm thi dng vic thc thi khai ph t ng, trng thi ca chng trnh s c duy tr. Khi thc hin tip n s tip tc khai ph t v tr ang ng li . - Lm li: thc hin li t u qu trnh t ng - Quay v: tr v mn hnh chnh - Thot: thot khi h thng Tc thc thi: - Ca s thi gian: 60 giy Tng s kt ni Thi gian x l (S) 18 0.046875 14 0.03125 11 0.0625 11 0.015625 V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
104

ng dng k thut khai ph d liu trong h thng IDS


16 15 24 19 20 9 08 - Ca s thi gian: 120 giy Tng s kt ni 28 30 22 28 31 47 25 40 26 <10 10 - 20 - Ca s thi gian: 300 giy Tng s kt ni 69 51 31 - 45 69 106 116 170 162 194 249 212 278 291 364 413 433 0.03125 0.015625 0.03125 0.03125 0.03125 0.25 0 Thi gian x l (S) 0.40625 0.0625 0.046875 0.0625 0.03125 0.0625 0.03125 0.0625 0.015625 ~0 ~ 0.015625 Thi gian x l (S) 0.3125 0.046875 0.03125 0.078125 0.078125 0.09375 0.140625 0.125 0.15625 0.1875 0.15625 0.203125 0.21875 0.265625 0.375 0.421875
105

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

ng dng k thut khai ph d liu trong h thng IDS


- Ca s thi gian: 600 giy Tng s kt ni 120 202 372 245 426 473 551 605 906 799 857 1145 1426 1776 3293 Thi gian x l (S) 0.09375 0.15625 0.28125 0.1875 0.375 0.515625 0.4375 0.5 0.78125 0.671875 0.75 0.9375 1.296875 1.59375 2.96875

- Ta c th thy tc x l ph thuc vo nhiu yu t m yu quan trng ta quan tm y l khong thi gian gom nhm v s kt ni trong tng nhm s lm nh hng ti thi gian x l nh th no. 5.3 Kt lun Trong qu trnh hon thnh ti ny, d t c nhng kin thc nht nh, nhng ti nhn thy Khai ph d liu ni chung v khai ph d liu trong h thng IDS/IPS ni ring l mt lnh vc nghin cu rng ln, nhiu trin vng. ti trnh by c cc vn c bn v khai ph d liu: Tm quan trng ca KPDL, cc hng tip cn khai ph d liu v cc k thut khai ph d liu. S dng thut ton gom cm m c th y l phng thc kmedoids ng dng vo khai ph d liu pht hin xm nhp. Vi ti ny, chng t kh nng ng dng tr tu nhn to trong ngnh chuyn su ca khoa hc my tnh v cc ngnh khc, nht l v mng my tnh, mng Internet. ti ua ra m hnh hot ng ca mt h thng thng minh tr gip ngi dng v ngi qun tr mng nhm pht hin cc xm nhp tin n kh nng tn cng tn cng t chi dch v. Tuy nhin, do hn ch v mt thi gian v lng kin thc vn c nn phn nghin cu mi ch dng li cp demo h thng vi mt s kiu tn cng tn cng t chi dch v n gin. Khi m lng d liu thu thp v lu tr ngy cng tng, cng vi nhu cu nm bt thng tin, th nhim v t ra cho Khai ph d liu ngy cng quan trng. S p dng c vo nhiu lnh vc kinh t x hi, an ninh quc phng, an ninh mng cng l mt u th ca khai ph d liu. Vi nhng mong V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
106

ng dng k thut khai ph d liu trong h thng IDS


mun ti hy vng s dn a nhng kin thc c t ti ny sm tr thnh thc t, phc v cho cuc sng con ngi chng ta. ti cn nhiu khim khuyt, cn thi gian hon thin v pht trin thm. 5.4 Hng pht trin Trn c s trnh by, hin thc mt cng c cnh bo tn cng t chi dch v vi giao din ha thn thin v c kh nng phn ng ph hp vi nhng hnh vi c xem l bt thng . Ci tin tr thnh cng c gim st thi gian thc cc cuc tn cng DoS, ci tin thut ton lm cho tc tnh ton nhanh hn, khng nhng m rng thm nhiu cc mu bt thng trong tn cng t chi dch v m cn c cc mu ca cc kiu tn cng khc. Thc nghim nh gi rt ra kt lun thut ton ti u cho tng loi tn cng DoS (back dos, u2r, r2l,...) Pht trin cng c gim st, cnh bo cc cuc tn cng DDoS, DRDos. Xy dng h thng x l song song tng tc thc hin ng thi s a ra nhng phn ng nhanh cho cc hnh vi vi phm. C th a vo tch hp vo mt h thng pht hin lm dng truyn thng.

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

107

ng dng k thut khai ph d liu trong h thng IDS TI LIU THAM KHO
[1] PGS.TS Phc (2006), Gio trnh Khai thc D liu, Trng i hc Cng ngh thng tin TP. H Ch Minh, i hc Quc gia TP. H Ch Minh. [2] Hunh Tun Anh, Bi ging DATAWAREHOUSE AND DATA MINING, TRNG I HC NHA TRANG (2008). [3] Ts.Nguyn nh Thc. Tr tu nhn to - Mng Nron Phng php v ng dng NXB Gio dc nm 2000 [4] PGS.TS Nguyn Quang Hoan. Nhp mn tr tu nhn to. Hc vin Cng ngh Bu chnh Vin thng (2007) [5] Tuyn tp Bo co Hi ngh Sinh vin Nghin cu Khoa hc ln th 6 .THUT GII DI TRUYN V NG DNG. i hc Nng 2008 [6] Ths. Phm Nguyn Anh Huy, Lun vn thc s tin hc Dng mt s thut ton khai khong d liu h tr truy xut cc a ch Internet WebServer . Trng i hc Khoa hc T nhin - i hc quc gia TPHCM (2000). [7 ] PGS.TS Phc, Lun vn tin s ton hc Nghin cu v pht trin mt s thut gii, m hnh ng dng khai thc d liu (DATA MINING) . Trng i hc Khoa hc T nhin - i hc Quc gia TPHCM (2002) [8] Nong Ye. The handbook of data mining. Arizona state University. LAWRENCE ERLBAUM ASSOCIATES(LEA), PUBLISHERS Mahwah, New Jersey London (2003). [9] Jiawei Han and Micheline Kamber, University of Illinois at UrbanaChampaign. Data Mining Concepts and Techniques 2nd. Morgan kaufmann Publishers (2006). [10] ZhaoHui Tang and Jamie MacLennan. Data Mining with SQL Server 2005. Wiley Publishing, Inc., Indianapolis, Indiana (2005). [11] D. Barbara, J. Cou to, S. Jajodia, v N.Wu. Special sectionon data mining for intrusion detection and threat analysis: Adam: a testbed for exploring the use of data mining in intrusion detection . ACM SIGMOD Record, vol. 30, page 15-42, Dec. 2001. [12] D. Barbara, N.Wu, v S. Jajodia. Detection novel network intrusions using bayes estimators Proceedings of the First SIAM International Conference on Data Mining (SDM 2001), Chicago, USA, Apr, 2001. [13] Ken. Toshida. Entropy based intrusion detection. Proceedings of IEEE Pacific Rim Conference on Communications, Computers and signal Processing (PACRIM2003), vol. 2, trang 840-843. IEEE, Aug. 2003. IEEE Explore. [14] S. B. Cho, Incorporating soft computing techiniquesinto a probabilistic intrusion detection system. IEEE Transactions on Systems, Man, and Cyberneticspart C: applications and reviews, vol. 32, trang 154-160, May 2002. [15] S. S. Ahmedur Rahman, Survey report association and classification rule mining for network intrusion detection, Schook of computer science University of Windsor (2006) V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m
108

ng dng k thut khai ph d liu trong h thng IDS


[16] Wenke Lee, A data mining framework for constructing feature and models for instrusion detection systems, Columbia University (1999)

V th Vn_Khoa An ton Thng tin_Hc vin K thut Mt m

109