You are on page 1of 9

Phn loi email thng qua phng php Support Vector Machines

I.Gii thiu v Support Vector Machines


Phng php SVM l mt phng php phn loi vn bn mt cch c hiu qu c Vapnik gii thiu vo nm 1995 gii quyt vic nhn dng 2 lp s dng nguyn l cc tiu ri ro c cu trc (Structural Risk Minimization). Cho trc mt tp hun luyn c biu trong khng gian vect trong mi ti liu l mt im, phng php ny s tm ra mt siu mt phng h quyt nh tt chia cc im trong khng gian thnh hai lp ring bt khc nhau. Cht lng ca siu mt phng c nh gi bi khong cch ca mi lp n mt phng ny. Khong cch ny cng ln th quyt nh cng chnh xc ng thi vic phn loi cng chnh xc

Cng thc tnh : Thc cht phng php ny l mt bi ton ti u, mc tiu l tm ra mt khng gian H v siu mt phng quyt nh h trn H sao cho sai s phn loi l thp nht. r d =(wi1, wi2, , wi|T|), Mi vn bn c biu din tng ng vi mt vector i trong wij l trng s ca t xut hin trong vn bn. 1

Phn loi email thng qua phng php Support Vector Machines
r Phng trnh siu mt phng cha vect di trong khng gian nh sau:
r r d i .w + b = 0

t h(d i ) = sign(d i .w + b) =

r r

+1 d i .w + b > 0 -1 d i .w + b < 0
r r

r r

r r Nh th h( di ) biu din s phn lp ca di vo hai lp ni trn gi r r d lp + , yi = - 1 , lp d lp - . c siu y={ 1} , yi = +1 , lp i i

phng h ta phi gii bi ton sau : r r Tm min || w || vi w v b tho mn iu kin sau :


r r i 1, n : yi ( sign(d i .w + b)) 1

Bi ton ny c th gii bng k thut s dng ton t Lagrange bin i v dng ng thc. im th v ca SVM l mt phng quyt nh ph thuc vo cc vect h tr c khong cch n mt phng quyt nh
1 r . Khi cc im || w ||

khc xo ht i th thut ton vn c kt qu ging nh ban u. l c im khc so vi cc phng php khc v tt c d liu trong tp hun luyn u c dng ti u ho kt qu.

II.Gii thiu v bi ton phn loi email


1.1 Gii thiu qua bi ton phn loi email Trong thi i ngy nay l thi i ca cng ngh thng tin. Internet l mt phn tt yu ca cuc sng. Khi mng Internet c a vo cuc sng, gip cho vic tip nhn v trao i thng tin mt cch nhanh chng hn. Trong email l mt phng tin trao i thng tin mt cch hiu qu, nhanh chng, tit kim chi ph. Nhng trong qu trnh trao i thng tin nhiu khi ta gp nhng s bc mnh l nhng email ta khng cn thit 2

Phn loi email thng qua phng php Support Vector Machines phi nhn nhng n vn c trong hm th ca mnh. Cc th gi l cc spam mail. N l mt loi email pht tn mt cch rng ri vi s lng ln khng ph thuc vo ngi nhn. Ni dung ch yu ca cc mail l qung co. Vn ny xut hin mt cch kh lu ri ngay t khi c mng Internet. Ta bit rng khi mng Internet ra i nhng ngi s dng u tin l cc chuyn gia my tnh. H cng gi nhiu email n mt nhm cc nhm tin. Sau mi c tnh trng khng th kim sot c cc email gi n. V vy cn phi c cc chng trnh lc,ngn chn cc email m ta khng cn thit phi nhn. Trong nhng nm gn y tht s l thi i bng n thng tin, lng thng tin trao i l rt ln, s ngi tham gia ngy cng ln, y l iu kin thun li cho vic qung co trn mng pht trin mt cch mnh m. Tuy c cc chng trnh ngn chn nhng nhng chng trnh ny khng cn c hiu qu na. Vic nh hng ca cc email mang mc ch xu l rt ln n nh hng rt nhiu vn nh l tc ng truyn chm,lng ph tin bc vo vic xa b chng. c nhiu nh khoa hc, t chc, cc c nhn nghin cu v xy dng cc chng trnh phn loi v lc email. Nhng s lng cc spam mail pht tn vi mt s lng ngy cng ln vt qua cc b lc. V vy cuc chin chng cc spammer v nhng ngi chng vn cn tip tc, cha c hi kt. Do nhng vn nh trn nn em chn vn nghin cu v trnh by vi thy gio v cc bn l : Tm hiu bi ton phn loi email thng qua phng php Support Vector Machines. Nh trn trnh by vic ngn chn cc spam thng qua cc phng php l rt cn thit. c nhiu nghin cu i theo cc hng khc nhau da trn cc c tnh khc nhau. Cc phng php thng quan tm n tiu ca cc mail gi n, loi b cc mail n t nhng ni c coi l cc ni pht tn spam, nhng t ng, hay nhng ni dung lp i lp li mt cch kh hiu. Nhng hn nay cc spam ngy cng tinh vi hn kh c th phn bit u l spam v non-spam. Nhng c mt iu m cc spam khng h 3

Phn loi email thng qua phng php Support Vector Machines thay i l bn cht ca n, l mc ch qung co. V vy cc phng php lc cn phi quan tm n vn ny. Nhng cng c mt iu khc cng kh quan trng l vic lc mail. Khi b lc mail m lc sai hay l khi b lc mt cch khng hiu qu th ta cng phi tr gi rt ln. l 2 vn cn phi c quan tm mt cch nghim tc. 1.2 Bi ton phn loi email i vi bi ton ny thc hin bng phng php SVM, n cng chnh l p dng vic phn loi vn bn vo bi ton phn loi email. N l qu trnh hun luyn phn loi mt vn bn mi qua cc mu c sn Vi nt v kho d liu PU PU l mt kho d liu dng chun, gm c bn kho d liu nh PU1,PU2,PU3,PUA. Trong cc kho ny c s thay th t dng vn bn sang dng s c minh ho di hnh sau :

Hm nh x t vn bn sang cc con s khng c cng b do vic khi phc li d liu ban u l rt kh. iu m bo c tnh b mt ring t ca ngi nhn v ngi gi. Trong kho d liu c cc email ging nhau v c nhn trong cng mt ngy th c xo th cng . Hai email c gi l khc nhau nu chng c t nht 5 dng khc nhau. Tt c cc email ging nhau, bt k ngy nhn u b xo i ch gi li mt email m thi. C ch ny p dng cho c email non-spam v email spam. Mi kho d liu pu li c chia lm nhiu phn t 1 n 10, mi phn li cha s lng email non-spam v email spam l nh nhau.

Phn loi email thng qua phng php Support Vector Machines

Vic nh gi hiu qu bi ton da trn mt s nh gi sau : T l nhn ra spam l t l phn trm gia s mail m b lc coi l spam trn tng s mail n b lc T l nhn ra non-spam l t l phn trm gia s mail b chn li thc s l spam c b lc coi l spam trn tng s mail n b lc. y l tiu ch nh gi mc an ton ca b lc. T l nhn ra non-spam = T l nhn ra spam =
T1 T1 + T2

T1 T1 + T3

trong T1 : s mail l spam c nhn ra l spam T2 : s mail l spam c nhn ra l non-spam T3 :s mail l non-spam c nhn ra l spam Ngoi ra cn c th nh gi qua t l chnh xc hay qua t l li t l chnh xc =
v1 + v 2 V1 + V2

v1 : s mail l spam c nhn din l spam v2: s mail l non-spam c nhn din l non-spam V1 v V2 l tng s mail non-spam v spam cn phn loi T l li =
v1 + v2 V1 + V2

v : s mail l spam c nhn din l spam 1 v : s mail l non-spam c nhn din l non-spam 2

V1 v V2 l tng s mail non-spam v spam cn phn loi

Phn loi email thng qua phng php Support Vector Machines Bc u tin ca mi phng php phn loi l vic chuyn vn bn dng mt chui k t thnh mt dng khc ph hp vi phng php phn loi. Hu ht cc phng php u s dng cch biu din vn bn di dng cc vect. Input : Tp d liu vo cho chng trnh chy c dng <label> <index1>:<value1> <index2>:<value2> ... trong label c th l +1 (non-spam) hoc -1(spam) index chnh l t c m ha sang dng s value l s ln t xut hin trong email D liu t cc kho d liu PU l rt ln, nhng chng trnh ca em ch s dng mt phn d liu ch yu ly t kho PU1 Output : t l chnh xc ca m hnh 1.3 Kt Qu T cc kho d liu nh trn ta xy dng d liu chy gm c tng s email l 660(d liu cn hi t) . Ta c bng kt qu sau : T l chnh xc (%) 0.77138 0.62011 0.62011 0.61453

tin cy 0.00 0.25 0.50 0.75

Phn loi email thng qua phng php Support Vector Machines

Trong tin cy tnh theo s chn la n-fold cross validation T l chnh xc thp mt phn l do d liu cn nh Phn mm libsvm 2.82 c mt s chng trnh h tr trong qu trnh x l d liu thnh cc output m mnh c 2 cch c th a ra chnh xc tu thuc vo d liu m ta a vp v d nh D liu ta c l heart_scale (chng trnh c trong chng trnh). D liu ny nh nn c th dng chng trnh easy.py hoc grid.py Ta thc hin bng cch nh cu lnh sau grid.py [-log2c begin,end,step] [-log2g begin,end,step] [-v fold] [-svmtrain pathname] [-gnuplot pathname] [-out pathname] [-png pathname] [additional parameters for svm-train] dataset v easy.py training file [testing] nu c Ngoi ra ta c th ly d liu ra t d liu gc subset.py [options] dataset number [output1] [output2] options: -s method : method of selection (default 0) 0 -- stratified selection (classification only) 1 -- random selection output1 : the subset (optional) output2 : rest of the data (optional) 7

Phn loi email thng qua phng php Support Vector Machines Hnh nh minh ho s cho ta cc gi tr tt nht

III.Kt lun u im ca phng php SVM l Hiu qu trong vic phn loi, an ton cao Thch hp cho tng ngi dng c th thng l mc Client C kh nng hc tp gip cho vic phn loi c chnh xc hn

IV.Ti liu tham kho 1. C.C. Chang and C.J. Lin; LIBSVM: a library for support vector machines; Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm; 2001 2. C s d liu ly t trang web http://www.itt.demokritos.gr/skel/i-config 3. Lun vn tt nghip Tm hiu hng tip cn phn loi email v xy dng phn mm mail client h tr ting vit L Nguyn B Duy - Trn Minh Tr Trng i hc khoa hc t nhin HQGTPHCM 4. Bo co lt

Phn loi email thng qua phng php Support Vector Machines

You might also like