You are on page 1of 50

I HC QUC GIA H NI

TRNG I HC CNG NGH


---------<>---------




Nguyn Vn Huy



THUT TON BAYES V NG DNG





KHO LUN TT NGHIP I HC H CHNH QUY
Ngnh : Cng Ngh Thng Tin








H NI 2009


I HC QUC GIA H NI
TRNG I HC CNG NGH
---------<>---------



Nguyn Vn Huy


THUT TON BAYES V NG DNG



KHO LUN TT NGHIP I HC H CHNH QUY
Ngnh : Cng Ngh Thng Tin



Cn b hng dn: ThS. Nguyn Nam Hi
Cn b ng hng dn: ThS. Hong Kin






H NI 2009
Thut ton Bayes v ng dng

ii
Li cm n

Vit kha lun khoa hc l mt trong nhng vic kh khn nht m em phi
hon thnh t trc n nay. Trong qu trnh thc hin ti em gp rt nhiu kh
khn v b ng. Nu khng c nhng s gip v li ng vin chn thnh ca
nhiu thy c bn b v gia gia nh c l em kh c th hon thnh lun vn ny.
u tin em xin gi li cm n chn thnh n thy Nguyn Nam Hi v thy
Hong Kin trc tip hng dn em hon thnh lun vn ny. Nh c thy m
em c tip cn vi ngun ti liu gi tr cng nh nhng gp qu gi sau ny. Bn
cnh s gip , em cn c cc thy bn Trung tm my tnh to mi iu kin
tt nht v c s vt cht cng nh hng dn ch bo n cn em c tip cn vi
h thng. Em bit n nhng ngy thng c lm vic bn cc thy, em khng th no
qun nhng ngy thng tuyt vi .
Trong qu trnh gp nht nhng kin thc qu bu, cc thy, c, bn b l
nhng ngi cng em st cnh trong sut thi gian em hc tp v nghin cu di
mi trng i hc Cng ngh.
Trong nhng n lc , khng th khng k n cng lao to ln khng g c th
n p ca cha m nhng ngi sinh thnh, dng dc con nn ngi, lun nhc
nh, ng vin con hon thnh tt nhim v.

H Ni
Thng 5, 2009

Nguyn Vn Huy
Thut ton Bayes v ng dng

iii
Tm tt ni dung
Thng k (ton hc) l b mn ton hc rt quan trng v c nhiu ng dng to
ln trong thc t, gip con ngi rt ra thng tin t d liu quan st, nhm gii quyt
cc bi ton thc t trong cuc sng.
Trong kha lun ny trnh by v mt tip cn thng k trong vic d on s
kin da vo l thuyt Bayes. L thuyt ny ni v vic tnh xc sut ca s kin da
vo cc kt qu thng k cc s kin trong qu kh. Sau vic tnh ton mi s kin
c gn xc xut hay im (ty vo mi phng php nh gi) ng vi kh nng c
th xy ra vi s kin . V cui cng da vo ngng phn loi cho cc s kin.
Sau phn l thuyt chng ta s tm hiu v bi ton thc t trong ngnh cng
ngh thng tin. Bi ton v vic lc th rc t ng. Gii quyt bi ny l s kt hp
t rt nhiu phng n nh DNS Blacklist, kim tra ngi nhn, ngi gi, dng b
lc Bayes, chn a ch IP, Blacklist/Whitelist,.... Dng b lc Bayes l phng n
thng minh n gn gi vi ngi dng bi chnh ngi dng hun luyn n nhn
bit th rc. Kha lun ny tp chung vo vic tm hiu b lc th rc Bayesspam
m ngun m, ci t cho h thng email c tn l SquirrelMail m ngun m ang
c dng cho h thng email ca trng i hc Cng ngh - Coltech Mail. Kt qu
cho thy b lc c mc hot ng hiu qu l khc nhau ty thuc vic ngi dng
hun luyn cho b lc thng qua cc th in t m h cho l th rc nhng ni chung
b lc em li hiu qu kh tt.


Thut ton Bayes v ng dng

iv
Mc lc
Chng 1 Gii thiu.................................................................................. 1
1.1 Tng quan.......................................................................................................1
1.2 Cu trc..........................................................................................................3
Chng 2 C s l thuyt.......................................................................... 4
2.1 Pht biu nh l Bayes ..................................................................................4
2.2 Cc tiu ha ri ro trong bi ton phn lp Bayes...........................................5
2.3 Phn lp Bayes chun tc ............................................................................. 13
2.4 Min quyt nh............................................................................................ 20
Chng 3 Phn lp Naive Bayes............................................................. 22
3.1 nh ngha.................................................................................................... 22
3.2 Cc m hnh xc sut Naive Bayes ............................................................... 23
3.3 c lng tham s ....................................................................................... 24
3.4 Xy dng mt classifier t m hnh xc sut................................................. 25
3.5 Thut ton phn loi vn bn Naive Bayes.................................................... 25
V d: Phn loi th in t bng Naive Bayes classifier................................... 27
Chng 4 Gii quyt bi ton lc th rc .............................................. 30
4.1 t vn .................................................................................................... 30
4.2 Bi ton ........................................................................................................ 31
4.3 Tin x l mi l th in t......................................................................... 31
4.4 Dng lut Bayes tnh xc sut ....................................................................... 32
4.5 Hun luyn cho b lc Bayes........................................................................ 33
4.6 Lc th n, c l th rc khng? ................................................................. 34
4.7 B lc BayesSpam........................................................................................ 35
4.8 Mt s ci tin cho b lc BayesSpam.......................................................... 38
Chng 5 Kt lun .................................................................................. 40
Thut ton Bayes v ng dng

v
Ph lc A C s d liu ca b lc .......................................................... 43
Ti liu tham kho 44
Thut ton Bayes v ng dng

1

Chng 1 Gii thiu
1.1 Tng quan
Khoa hc thng k ng mt vai tr cc k quan trng, mt vai tr khng th
thiu c trong bt c cng trnh nghin cu khoa hc, nht l khoa hc thc nghim
nh y khoa, sinh hc, nng nghip, ha hc, v ngay c x hi hc. Th nghim da
vo cc phng php thng k hc c th cung cp cho khoa hc nhng cu tr li
khch quan nht cho nhng vn kh khn nht.
Khoa hc thng k l khoa hc v thu thp, phn tch, din gii v trnh by
cc d liu t tm ra bn cht v tnh quy lut ca cc hin tng kinh t, x hi
- t nhin. Khoa hc thng k da vo l thuyt thng k, mt loi ton hc ng dng.
Trong l thuyt thng k, tnh cht ngu nhin v s khng chc chn c th lm m
hnh da vo l thuyt xc sut. V mc ch ca khoa hc thng k l to ra thng
tin "ng nht" theo d liu c sn, c nhiu hc gi nhn khoa thng k nh mt loi
l thuyt quyt nh.
Thng k l mt trong nhng cng c qun l v m quan trng, cung cp cc
thng tin thng k trung thc, khch quan, chnh xc, y , kp thi trong vic nh
gi, d bo tnh hnh, hoch nh chin lc, chnh sch, xy dng k hoch pht trin
kinh t - x hi v p ng nhu cu thng tin thng k ca cc t chc, c nhn. Trong
s nhng vai tr quan trng th d bo tnh hnh l mt trong nhng vai tr mang
nhiu ngha, n c c mt qu trnh hun luyn bn trong v c tnh x l t ng
khi c hun luyn. Hay ni khc hn l khi c tri thc ly t cc d liu thng
k hay kinh nghim ca ngi dng kt hp vi mt phng php hc (hun luyn)
da trn l thuyt thng k ta s c c mt c my c tri thc t n c th a ra
c nhng quyt nh vi chnh xc kh cao.
Phn tch thng k l mt khu quan trng khng th thiu c trong cc
cng trnh nghin cu khoa hc, nht l khoa hc thc nghim. Mt cng trnh nghin
cu khoa hc, cho d c tn km v quan trng c no, nu khng c phn tch
ng phng php s khng bao gi c c hi c xut hin trong cc tp san khoa
hc. Ngy nay, ch cn nhn qua tt c cc tp san nghin cu khoa hc trn th gii,
hu nh bt c bi bo y hc no cng c phn Statistical Analysis (Phn tch thng
k), ni m tc gi phi m t cn thn phng php phn tch, tnh ton nh th no,
v gii thch ngn gn ti sao s dng nhng phng php hm bo k hay
Thut ton Bayes v ng dng

2
tng trng lng khoa hc cho nhng pht biu trong bi bo. Cc tp san y hc c uy
tn cng cao yu cu v phn tch thng k cng nng. Khng c phn phn tch thng
k, bi bo khng th xem l mt bi bo khoa hc. Khng c phn tch thng k,
cng trnh nghin cu cha c xem l hon tt.
Trong khoa hc thng k, c hai trng phi cnh tranh song song vi nhau,
l trng phi tn s (frequentist school) v trng phi Bayes (Bayesian school).
Phn ln cc phng php thng k ang s dng ngy nay c pht trin t trng
phi tn s, nhng hin nay, trng phi Bayes ang trn chinh phc khoa hc
bng mt suy ngh mi v khoa hc v suy lun khoa hc. Phng php thng k
thuc trng phi tn s thng n gin hn cc phng php thuc trng phi
Bayes. C ngi tng v von rng nhng ai lm thng k theo trng phi Bayes l
nhng thin ti!
hiu s khc bit c bn gia hai trng phi ny, c l cn phi ni i
qua vi dng v trit l khoa hc thng k bng mt v d v nghin cu y khoa.
bit hai thut iu tr c hiu qu ging nhau hay khng, nh nghin cu phi thu thp
d liu trong hai nhm bnh nhn (mt nhm c iu tr bng phng php A, v
mt nhm c iu tr bng phng php B). Trng phi tn s t cu hi rng
nu hai thut iu tr c hiu qu nh nhau, xc sut m d liu quan st l bao
nhiu, nhng trng phi Bayes hi khc: Vi d liu quan st c, xc sut m
thut iu tr A c hiu qu cao hn thut iu tr B l bao nhiu. Tuy hai cch hi
thot u mi c qua th chng c g khc nhau, nhng suy ngh k chng ta s thy
l s khc bit mang tnh trit l khoa hc v ngha ca n rt quan trng. i vi
ngi bc s (hay nh khoa hc ni chung), suy lun theo trng phi Bayes l rt t
nhin, rt hp vi thc t. Trong y khoa lm sng, ngi bc s phi s dng kt qu
xt nghim phn on bnh nhn mc hay khng mc ung th (cng ging nh
trong nghin cu khoa hc, chng ta phi s dng s liu suy lun v kh nng ca
mt gi thit).

Thut ton Bayes v ng dng

3

1.2 Cu trc
Cc phn cn li ca kha lun c cu trc nh sau:
Chng 2 trnh by c s l thuyt Bayes cc khi nim, phng php c
s dng trong kho lun.
Chng 3 trnh by l thuyt Bayes nng cao - Naive Bayes. Chng ny s
cp n khi nim, u im v ng dng phn loi ca n t cn c nghin cu
xy dng h thng phn loi vn bn.
Chng 4 trnh by chi tit v b lc bao gm cc vn v c s tri thc,
vic hun luyn cho b lc, cch thc lm vic v hng ci tin trong vic lc th
rc.
Chng 5 trnh by kt lun v chng trnh ng dng b lc BayesSpam ci
t trn h thng th in t Squirrelmail.


Thut ton Bayes v ng dng

4

Chng 2 C s l thuyt

2.1 Pht biu nh l Bayes
nh l Bayes cho php tnh xc sut xy ra ca mt s kin ngu nhin A khi
bit s kin lin quan B xy ra. Xc sut ny c k hiu l P(A|B), v c l
"xc sut ca A nu c B". i lng ny c gi xc sut c iu kin hay xc sut
hu nghim v n c rt ra t gi tr c cho ca B hoc ph thuc vo gi tr .

Theo nh l Bayes, xc sut xy ra A khi bit B s ph thuc vo 3 yu t:
Xc sut xy ra A ca ring n, khng quan tm n B. K hiu l
P(A) v c l xc sut ca A. y c gi l xc sut bin duyn
hay xc sut tin nghim, n l "tin nghim" theo ngha rng n khng
quan tm n bt k thng tin no v B.
Xc sut xy ra B ca ring n, khng quan tm n A. K hiu l
P(B) v c l "xc sut ca B". i lng ny cn gi l hng s
chun ha (normalising constant), v n lun ging nhau, khng ph
thuc vo s kin A ang mun bit.
Xc sut xy ra B khi bit A xy ra. K hiu l P(B|A) v c l "xc
sut ca B nu c A". i lng ny gi l kh nng (likelihood) xy
ra B khi bit A xy ra. Ch khng nhm ln gia kh nng xy ra
A khi bit B v xc sut xy ra A khi bit B.

Khi bit ba i lng ny, xc sut ca A khi bit B cho bi cng thc:






Thut ton Bayes v ng dng

5
.
2.2 Cc tiu ha ri ro trong bi ton phn lp
Bayes
By gi xem xt bi ton nt chai, hy hnh dung rng nh my sn xut c 2
loi l: w
1
= Super v w
2
= Average
Gi s thm rng nh my c mt h s ca cc kho cha sn phm lu gi,
tm lc li nh sau:
S nt chai ca lp w
1
: n
1
= 901 420
S nt chai ca lp w
2
: n
2
= 1 352 130
Tng s nt chai: n = 2 253 550
Theo ta d dng tnh c xc sut mt nt chai thuc lp no trong 2
lp, y gi l xc sut tin nghim hay l prevalences:
P(w
1
) = n
1
/n = 0.4 P(w
2
) = n
2
/n = 0.6 (1-1)
rng xc sut tin nghim trn khng phi hon ton ph thuc vo nh
my sn xut m n ch yu vo cht lng ca nguyn liu. Tng t mt bc s
chuyn khoa tim khng th no kim sot xc sut bnh nhi mu c tim ca mt
nhm dn c. Prevalences c th lm iu bi v n lin quan n trng thi t
nhin.
Gi s bi ton yu cu thc hin mt quyt nh khng r rng, chng hn
chn lp cho ci nt chai bt k m khng bit g v nt chai . Nu ch c thng tin
l xc sut tin nghim th ta s chn lp w
2
. Vi cch ny chng ta mong rng n ch
sai 40% s ln.
Gi s rng chng ta c th o c vecto c trng ca nt chai, p(w
i
|x) l
xc sut c iu kin m t xc sut i tng x thuc lp w
i
. Nu chng ta c th
xc nh xc sut p(w
1
|x) v p(w
2
|x) d thy rng:
Nu P(w
1
| x) > P(w
2
|x) ta phn x vo w
1
;
Nu P(w
1
| x) < P(w
2
|x) ta phn x vo w
2
;
Nu P(w
1
| x) = P(w
2
| x) chn ty
Tm li:
if P(w
1
|x) > P(w
2|
x) then x e w
1
else x e w
2
. (1-2a)
Thut ton Bayes v ng dng

6
Xc sut hu nghim P(w
i
|x) c th tnh c nu chng ta bit pdfs (cc hm
mt xc sut) ca cc phn phi vec t c trng ca 2 lp. Sau ta tnh cc xc
sut p(x|w
i
) , l xc sut i tng thuc lp w
i
c c trng l x gi l
likelihood of x tm dch l kh nng xy ra x hay l hp l ca x. Thc t ta dng
cng thc Bayes:

(1-3)

Vi:


Lu rng P(w
i
) v P(w
i
|x) l cc xc sut ri rc, tri li p(x|w
i
) v p(x) l
cc gi tr ca hm mt xc sut. rng khi so snh (1-2a) ta c gi tr chung l
p(x) do ta vit li:

if p(x|w
1
) P(w
1
) > p(x|w
2
)P(w
2
) then x e w
1
else x e w
2
. (1-4)

Hay l:

then x e w
1
else x e w
2
. (1-4a)

Trong cng thc (1-4a) th v(x) gi l t s hp l (likelihood ratio)


1
( ) ( | w ) (w )
c
i i
i
p x p x P
=
=

( | w ) (w )
(w | )
( )
i i
i
p x P
p x
p x
=
1 2
2 1
( | w ) (w )
( )
( | w ) (w )
p x p
v x
p x p
= >
Thut ton Bayes v ng dng

7
Hnh 1: Biu ca c trng N cho hai lp hc ca cc nt chai. Gi tr
ngng N = 65 c nh du bng mt ng thng ng

Gi s rng mi nt chai ch c mt c trng l N, tc l vec t c trng l x = [N],
gi s c mt nt chai c x = [65].
T th ta tnh c cc xc sut likelihood:
p(x|w
1
) = 20/24 = 0.833 P(w
1
) p(x|w
1
) = 0.333 (1-5a)
p(x|w
2
) = 16/23 = 0.696 P(w
2
) p(x|w
1
) = 0.418 (1-5b)
Ta s phn x = [65] vo lp w
2
mc d hp l(likelihood) ca w
1
ln hn
ca w
2
Hnh 2 minh ha nh hng ca vic iu chnh ngng xc sut tin nghim
n cc hm mt xc sut.
Xc sut tin nghim ng nht (equal prevalences). Vi cc hm mt
xc sut ng nht, ngng quy nh l mt na khong cch n phn t
trung bnh. S lng cc trng hp phn lp sai tng ng vi vng c
t m. y l vng m khong cch phn lp l nh nht.
Xc sut tin nghim ca w
1
ln hn ca w
2
. Ngng quyt nh thay th
cc lp c xc sut tin nghim nh hn. V vy gim s trng hp ca lp
c xc sut tin nghim cao dng nh c v thun tin.
Thut ton Bayes v ng dng

8

Hnh 2: Xc sut tin nghim ng nht (a), khng ng nht (b).

Chng ta thy rng tht s lch ngng quyt nh dn n lp w
2
tt
hn lp w
1
. iu ny nghe c v hp l k t khi m by gi lp w
2
xut hin thng
xuyn hn. Khi sai ton phn tng ln iu k l l s nh hng ca xc sut tin
nghim l c li. Cu tr li cho cu hi ny l lin quan n ch phn lp mo
him, m s c trnh by ngay by gi.
Chng ta gi nh rng gi ca mt nt chai (cork stopper) thuc lp w
1
l
0.025, lp w
2
l 0.015. Gi s l cc nt chai lp w
1
c dng cho cc chai c
bit, cn cc nt chai lp w
2
th dng cho cc chai bnh thng.
Nu ta phn lp sai mt nt chai lp w
1
th s b mt 0.025-0.015=0.01.
Nu phn lp sai mt nt chai lp w
2
th dn n n s b loi b v s b mt 0.015.
Ta k hiu:
SB - Hnh ng ca vic s dng mt nt chai(cork stopper) phn
cho loi chai c bit.
NB - Hnh ng ca vic s dng mt nt chai(cork stopper) phn
cho loi chai bnh thng.
w
1
= S (siu lp); w
2
= A (lp trung bnh)







Thut ton Bayes v ng dng

9

Hnh 3: Kt qu phn lp ca cork stoppers vi xc sut tin nghim khng ng
nht: 0.4 cho lp w1 v 0.6 cho lp w2


nh ngha:

ij
= (
i
| w
j
) l mt mt vi hnh ng
i
khi m lp ng l w
j
, vi

i
e{SB, NB}.

11
= (
1
| w
1
) = (SB | S) = 0,

12
= (
1
| w
2
) = (SB | A) = 0.015,

21
= (
2
| w
1
) = (NB | S) = 0.01,

22
= (
2
| w
2
) = (NB | A) = 0.


Thut ton Bayes v ng dng

10

Chng ta c th sp xp
ij
thnh ma trn hao ph .


= (1-6)

V th mt mt vi hnh ng s dng mt nt chai (m t bi vect c
trng x) v phn vo cho nhng chai c bit c th c biu th nh sau:
R(
1
| x) = R(SB | x) = (SB | S)P(S | x) + (SB | A)P(A | x) (1-6a)
R(
1
| x) = 0.015 P(A | x)
Tng t cho trng hp nu phn cho nhng chai thng thng:
R(
2
| x) = R(NB | x) = (NB | S)P(S | x) + (NB | A)P(A | x) (1-6b)
R(
2
| x) = 0.01P(S | x)
Chng ta gi nh rng nh gi ri ro ch chu nh hng t quyt nh sai.
Do vy mt quyt nh chnh xc s khng gy ra thit hi
ii
=0, nh trong (1-6).
Nu thay v 2 lp chng ta c c lp th s mt mt ng vi mt hnh ng
i
s l:

(1-6c)

Chng ta quan tm n vic gim thiu mc ri ro trung bnh tnh cho mt
lng ln nt chai bt k. Cng thc Bayes cho ri ro nh nht lm c iu ny
bng cch cc tiu ha cc ri ro c iu kin R(
i
| x).
Gi s ban u rng cc quyt nh sai lm c cng mt mt mt, chng c t l
vi mt n v mt mt:

(1-7a)

Trong trng hp ny t tt c cc xc sut hu nghim u tng ln mt,
chng ta cn phi cc tiu ha:

(1-7b)
0 0.015
0.01 0
(
(

i
1
( | ) ( | ) ( | )
c
i j j
j
R x P x
=
=

i
0
( | )
1
ij j
if i j
if j j

=

= =

=

( | ) ( | ) 1 ( | )
i j j
i j
R x P x P x
=
= =

Thut ton Bayes v ng dng



11
iu ny tng ng vi vic chng ta cc i P(w
i
| x), lut quyt nh
Bayes cho ri ro cc tiu tng ng vi vic tng qut ha vn :

Phn lp w
i
nu P(w
i
| x) > P(w
j
| x), i j = (1-7c)

Tm li: lut quyt nh Bayes cho ri ro cc tiu, khi s phn lp ng th khng b
mt mt v nu nh phn lp sai th c mt mt, ta cn phi chn c lp c xc
sut hu nghim l cc i.

Hm quyt nh cho lp w
i
l:
g
i
(x) = P(w
i
| x) (4-18d)

By gi hy xem xt cc tnh hung khc nhau ca cc thit hi xy ra cho
nhng quyt nh sai lm, cho n gin gi s c = 2. Da vo cc biu thc (1-6a)
v (1-6b) tht d nhn thy rng mt nt chai s thuc lp w
1
nu:


<


Hay l (1-8)


V th ngng quyt nh so vi t s hp l(likelihood) th n nghing v s
mt mt. Ta c th ci t lut quyt nh Bayes nh hnh 5.
Tng t chng ta c th iu chnh xc sut tin nghim nh sau:


; (1-8a)


12 2 1
21 1 2
( ) ( | )
( ) ( | )
P w P x w
P w P x w

<
12 2
( | ) P x
21 1
( | ) P x
*
21 1
1
21 1 12 2
( )
( )
( ) ( )
P w
P w
P w P w


=
*
12 2
2
21 1 12 2
( )
( )
( ) ( )
P w
P w
P w P w


=
Thut ton Bayes v ng dng

12

Vi s mt mt
12
= 0.015 v
21
= 0.01, s dng xc sut tin nghim
trn ta c P
*
(w
1
) = 0.308 v P
*
(w
2
) = 0.692. S thit hi s l ln hn nu nh
phn lp sai lp w
2
do cn tng P
*
(w
2
) ln so vi P
*
(w
1
). Kt qu ca vic iu
chnh l gim s lng cc phn t thuc lp w
2
b phn lp sai thnh w
1
. Xem kt
qu phn lp hnh hnh 6.


Ta c th tnh gi tr ri ro trung bnh trng hp c 2 lp:


(1-9)

1 2
12 2 21 1 12 12 21 21
( | ) ( ) ( | ) ( )
R R
R P w x p x dx P w x p x dx Pe Pe = + = +
} }
Thut ton Bayes v ng dng

13
R
2
v R
2
l min quyt nh ca lp
1

v lp
2

, cn Pe
ij
l xc sut sai
s ca s quyt nh lp l
i

khi m lp ng l j


Chng ta hy s dng tp d liu hun luyn nh gi nhng sai s ny,
Pe
12
=0.1 v Pe
21
=0.46 (xem hnh 6). Ri ro trung bnh i vi mi nt chai by gi
l:
R = 0.015Pe
12
+ 0.01Pe
21
= 0.0061.
Vi l tp cc lp ta c cng thc (1-9) tng qut:

(1-9a)

Lut quyt nh Bayes khng phi l la chn duy nht trong thng k phn
lp. Cng lu rng, trong thc t mt trong nhng c gng gim thiu ri ro trung
bnh l s dng c lng ca hm mt xc sut tnh c t mt tp d liu hun
luyn, nh chng ta lm trn cho cork Stoppers. Nu chng ta c nhng cn c
tin rng cc hm phn phi xc sut tha mn tham s mu, th ta thay th vic tnh
cc tham bin thch hp t tp hun luyn. Hoc l chng ta cng c th s dng
phng php cc tiu ha ri ro theo kinh nghim (empirical risk minimization
(ERM)), nguyn tc l cc tiu ha ri ro theo kinh nghim thay v ri ro thc t.

2.3 Phn lp Bayes chun tc
Cho n gi chng ta vn cha gi nh c trng ca phn phi mu cho
likelihoods. Tuy nhin, m hnh chun tc l mt gi nh hp l. M hnh chun tc
c lin quan n nh l gii hn trung tm ni ting, theo nh l ny th tng ca mt
lng ln cc bin ngu nhin c lp v phn phi ng nht s c phn phi hi t
v lut chun. Thc t ta c c mt xp x n lut chun tc, thm ch vi c mt
s lng tng i nh c thm vo cc bin ngu nhin. i vi cc c trng c
th c coi l kt qu ca vic b sung cc bin c lp, thng th gi nh l c th
chp nhn.
Likelihood chun tc ca lp
i
c biu din bi hm mt xc sut:



( ( ) | ) ( , ) ( ( ) | ) ( , ) ( )
i i
i i i i
X X
R x P x dx x P x p x dx


eO eO
= =

} }
Thut ton Bayes v ng dng

14

i
v
i
l cc tham s phn phi, n gi th ta s dng cc c lng
mu m
i
v C
i
.
Hnh 7 minh ha phn phi chun trong trng hp c hai chiu.
Cho mt tp hun luyn c n mu T={x
1
, x
2
, x
n
} c m t bi mt phn
phi vi hm mt xc sut l p(T | ), l mt vec t tham s ca phn phi
(chng hn nh vec t trung bnh ca phn phi chun). Mt cch ng ch tnh
c c lng mu ca vect tham bin l cc i ha hm mt xc sut p(T | ),
c th coi dy l mt hm ca gi l likelihood of cho tp hun luyn. Gi s rng
mi mu l a vo c lp t mt tp v hn, chng ta c th biu th likelihood nh
sau:



Khi s dng c lng hp l cc i (maximum likelihood estimation) ca
cc bin phn phi th n thng d dng hn l tnh cc i ca ln[p(T|)], iu ny
l tng ng nhau. Vi phn phi Gauss c lng mu c cho bi cc cng
thc (1-10a) v (1-10b) chnh l c lng hp l cc i v n s hi t v mt gi tr
thc.





1
( | ) ( | )
n
i
i
p T p x
=
=
[
Thut ton Bayes v ng dng

15


Nh c th nhn thy t (1-10), cc b mt ca mt xc sut ng nht vi
hp l chun (normal likelihood) tha mn Mahalanobis metric:
By gi chng ta tip tc tnh hm quyt nh cho cc c trng ca phn phi
chun:
g
i
(x) = P(
i
| x) = P(
i
) p(x |
i
) (1-11)
bin i logarit ta c:
Bng cch s dng nhng hm quyt nh, r rng ph thuc Mahalanobis
metric, ta c th xy dng phn lp Bayes vi ri ro nh nht, y l phn lp ti u.
Ch rng cng thc (1-11b) s dng gi tr tht ca khong cch Mahalanobis, trong
khi m trc chng ta s dng c lng ca khong cch ny.
Vi trng hp covariance ng nht cho tt c cc lp (
i
=) v b qua cc
hng s ta c:

(1-11c)
1
1
( ) ( ) ( ) ln ( )
2
i i i i
h x x x P

' = +

Thut ton Bayes v ng dng



16
Vi bi ton 2 lp, bit s d(x) =h
1
(x)-h
2
(x) l d ng tnh ton:
Qua ta c c hm quyt nh tuyn tnh

Hai lp phn bit vi phn phi chun, xc sut tin nghim ng nht v
covariance v vn cn c mt cng thc rt n gin cho xc sut ca li ca phn
lp:



Thut ton Bayes v ng dng

17
bnh phng ca khong cch Bhattacharyya, mt khong cch Mahalanobis
ca sai phn trung bnh, th hin tnh d tch lp.
Hnh 8 th hin dng iu ca Pe vi s tng dn ca bnh phng khng cch
Bhattacharyya. Hm ny gim dn theo cp s m v n hi t tim cn ti 0. V vy
tht kh gim sai s phn lp khi gi tr ny l nh.
Lu rng ngay c khi cc phn phi mu khng phi l phn phi chun, min
l chng i xng v phi tun theo Mahalanobis metric, th chng ta s thu c mt
phn lp quyt nh tng t nh phn lp chun, cho d c s khc bit v nh gi
sai s v xc sut hu nghim. minh ha ta hy xt hai lp c xc sut tin nghim
ng nht v c ba loi phn phi i xng, vi cng lch tiu chun v trung bnh
0 v 2.3 nh hnh 9.



Thut ton Bayes v ng dng

18
Phn lp ti u cho 3 trng hp s dng cng mt ngng quyt nh c gi
tr 1.15, tuy nhin cc sai s phn lp l khc nhau:

Nomal: Pe = 1 erf(2.3/2) = 12.5%
Cauchy: Pe = 22.7%
Logistic: Pe = 24.0%
Kt qu thc nghim cho thy, khi ma trn covariance a ra lch gii hn,
th s phn lp c th thc hin mt cch tng t vi phng php ti u vi iu
kin cc covariance l ng nht. iu ny l hp l v khi cc covariance khng khc
bit nhau nhiu th s khc bit gia cc gii php bc hai v tuyn tnh ch ng k
khi cc mu cch xa nguyn mu nh hnh 10.
Chng ta s minh ha bng cch s dng b d liu Norm2c2d. Sai s l
thuyt i vi trng hp hai lp, hai chiu v b d liu trn l:

2
=

c lng sai s ca b d liu hun luyn cho tp d liu ny l 5%. Bng
cch a vo sai s 0.1 vo cc gi tr ca ma trn nh x A cho b d liu, vi
lch nm gia 15% v 42% gi r ca covariance, ta c sai s tp hun luyn l
6%.
| |
0.8 0.8 2
2 3 8 1 ( 2) 7.9%
0.8 1.6 3
Pe erf

( (
= = =
( (


Thut ton Bayes v ng dng

19

Tr li vi d liu cc nt chai, ta c bi ton phn lp s dng 2 c trng N
v PRT vi xc sut tin nghim ng nht. Lu phn lp thng k ngoi tnh ton
s n khng lm thay i cc php ton, v th m cc kt qu t c l ging nhau
nu nh s dng PRT hay PRT10.
Mt danh sch ring cc xc sut hu nghim hu ch trong tnh ton cc sai s
phn lp, xem hnh 11.
Cho cc ma trn covariances trong bng 1. lch ca cc phn t trong ma
trn covariance so vi gi tr trung tm nm trong khong t 5% n 30%. Hnh dng
ca cc cm l tng t nhau, y l bng chng tin rng vic phn lp l gn vi
ti u.
Bng cch s dng hm quyt nh da trn cc ma trn covariance ring l,
thay v ch mt ma trn tng covariance, ta s xy dng c ng bin quyt nh
bc hai. Tuy nhin phn lp bng ng bc hai kh tnh lch hn so vi phn lp
tuyn tnh, c bit l trong khng gian nhiu chiu, v ta cn phi c mt lng ln
tp d liu hun luyn (xem v d ca Fukunaga and Hayes, 1989).
Thut ton Bayes v ng dng

20
2.4 Min quyt nh
Trong thc t ca cc ng dng nhn dng mu, n gin ta ch cn s dng
mt lut quyt nh nh cc cng thc (1-2a) v (1-7c) khi s to ra nhiu bin
quyt nh, v rt d xut hin nhiu trong d liu, nh hng n chnh xc ca
cc tnh ton phn lp. Nhiu mu nm gn bin quyt nh c th thay i lp c
gn ch vi mt iu chnh nh. Ngha l thc t, phn ln cc mu mang c im
ca c 2 lp. i vi cc mu nh vy, thch hp cho vc t chng trong mt lp c
bit c th xem xt k hn. iu ny chc chn phi trong mt s ng dng, v d
nh, trong lnh vc y t, ni ranh gii gia bnh thng v khc thng l cn phi
phn tch thm. Mt cch gii quyt l gn mt s nh tnh(qualifications) trong vic
tnh ton xc sut hu nghim P(
i
|x) cho lp
i
. Chng hn chng ta gn nh tnh
"definite" nu xc sut ln hn 0.9, "probable" nu xc sut gia 0.9 v 0.8, v
"possible" nu xc sut b hn 0.8. Theo cch ny th vi nt chai c case 55 (xem
hnh 11) s c phn lp l mt "possible" cork ca lp "super", v case 54 l mt
"probable" cork ca lp "average".
Thay v gn m t nh tnh vo lp nhn c, mt phng php khc c s
dng trong mt s trng hp nht nh l quy nh cho s tn ti ca mt lp c
bit gi l lp t chi hay l min quyt nh (reject region).
K hiu:
*: lp c phn;

i
: lp vi xc sut hu nghim cc i, chng hn P(
i
|x) = max P(w
j
|x)
vi mi lp
ij
#
i
.
Lut Bayes c th vit nh sau *=
i

By gi ta quy nh xc sut hu nghim ca mt nt chai phi cao hn nhiu
so vi mt ngng t chi (reject threshold) nht nh
r
, nu khng n s c phn
vo reject class w
r
. Cng thc Bayes c vit li nh sau:

(1-14)

Khi tnh ton t s hp l (likelihood ratio) vi t s xc sut tin nghim
(prevalence ratio), th ta phi nhn t s ny vi (1-
r
)/
r
. Mt lp c khng bao gi c
mt rejection nu
r
< (c-1)/c, do
r
[(c-1)/c, 1].

*
( | )
( | )
i i r
r i r
if P x
if P x


>

=

<

Thut ton Bayes v ng dng



21
Chng ta s minh ha khi nim reject class s dng d liu cork stoppers. Gi
s rng mt reject threshold
r
= 0.7 l ngng c quy nh. Tnh bin quyt nh
cho reject class l xc nh hm phn lp vi cc xc sut tin nghim P(
1
) =
1-
r
= 0.3, P(
2
) = 1-
r
= 0.7. Cc ng thng quyt nh l cc ng nghing
v giao vi trc tung ti PRT10=15.5 v PRT10=20.1. Ch rng hai ng ny
c xu hng i xng nhau qua ng thng quyt nh c xc nh. Hnh 12 l
biu phn tn vi cc ng quyt nh mi. vng gia hai ng thng l reject
region.
Chng ta hy xem cc ma trn phn lp hin th trong Hnh 13. Nh li mt
cht ta s thy rng c 4 mu ca lp 1 v 5 mu ca lp 2 b phn lp sai, l nm
trong reject region chim 9% s mu. S lng phn lp sai by gi cho lp 1 l 1mu
v cho lp 2 l 5 mu, tng s li l 6%.






Thut ton Bayes v ng dng

22
Chng 3 Phn lp Naive Bayes
3.1 nh ngha
Naive Bayes classifier l mt thut ng trong x l s liu thng k Bayesian
vi mt phn lp xc sut da trn cc ng dng nh l Bayes vi gi nh c lp
bn vng. Mt thut ng m t chi tit cho nhng m hnh xc sut s l m hnh c
trng khng ph thuc.
Trong thut ng n gin, mt naive Bayes classifier gi nh rng s c mt
(hay khng c mt) ca mt c trng ca mt lp hc l khng lin quan n s hin
din (hay thiu vng) ca bt k cc c trng. V d, mt tri cy c th c coi l
mt qu to nu n c mu chung quanh, v ng knh khong 4 inch. Mc d cc
c trng ny ph thuc vo s tn ti ca cc c trng khc, naive Bayes classifier
xem xt tt c cc c tnh c lp gp phn vo kh nng tri cy ny l qu to.
Ty thuc vo tnh chnh xc bn cht ca m hnh xc sut, naive Bayes
classifiers c th c o to rt hiu qu trong mt thit lp hc c gim st. Trong
nhiu ng dng thc t, tham s c lng cho cc m hnh naive Bayes s dng cc
phng php maximum likehood; ni cch khc, c th lm vic vi cc m hnh
naive Bayes m khng tin xc sut Bayesian hoc bng cch s dng bt c phng
php Bayesian.
Mc d thit k ngy th v hnh nh gi nh n gin hn, naive Bayes
classifiers thng lm vic trong nhiu tnh hung th gii thc phc tp tt hn c
th mong i. Mi y, xem xt vn phn lp Bayesian c th thy c mt s l
thuyt gii thch cho tnh hiu qu ca naive Bayes classifiers. Mt li th ca naive
Bayes classifier l n i hi mt s lng nh d liu o to c lng cc tham
s (cc ngha v s khc nhau ca cc bin) cn thit cho vic phn loi. Bi v cc
bin c gi nh c lp, ch nhng khc bit ca cc bin cho mi lp hc cn phi
c xc nh v khng phi ton b ma trn thng k.



Thut ton Bayes v ng dng

23
3.2 Cc m hnh xc sut Naive Bayes
Tm li, cc m hnh xc sut cho mt classifier l mt m hnh c iu kin
i vi mt bin lp ph thuc C vi mt s lng nh ca cc kt qu hay cc lp
hc, ph thuc vi bin c trng F
1
cho ti F
n.



Vn l nu s cc c trng n l ln hay khi mt c trng c th chim mt
s lng ln cc gi tr, sau da vo mt m hnh trn cc bng xc sut l khng
th lm c. Do vy, chng ta cng thc ha li cc m hnh d x l.
Bng cch s dng nh l Bayes, c c:


Trong thc hnh, ch cn quan tm ti t s ca phn s, khi m mu s khng
ph thuc vo C v cc gi tr ca cc c trng ca F
i
cho, nn mu s l hng
thc s.

T s tng ng vi m hnh xc sut c th c vit li nh sau, s dng
nh ngha ca xc sut c iu kin:







Thut ton Bayes v ng dng

24
By gi gi nh "naive" gi nh c iu kin c lp a vo: gi nh rng
mi c trng F
i
c iu kin c lp vi tt c cc c trng F
j
cho j # i. iu ny
c ngha l

do c th c th hin nh:



iu ny c ngha l di s c lp gi nh trn, cc iu kin phn phi
trn cc lp hc bin C c th c th hin:


y Z l mt nhn t xc nh t xch ph thuc vo F
1
, F
2
, .., F
n
, chng hn mt
hng s nu cc gi tr ca cc bin c trng u c bit.
Nu c k lp hc v nu mt m hnh cho p(F
i
) c th c th hin trong
cc thut ng ca r tham s, sau cc m hnh naive Bayes tng ng c (k - 1) +
nrk tham s. Trong thc t, thng k = 2 (phn loi nh phn) v r = 1 (cc bin
Bernoulli nh l cc c trng) c ph bin, v nh vy tng s lng cc tham s
ca m hnh naive Bayes l 2n + 1, y n l s cc c trng nh phn s dng cho
cc d on.

3.3 c lng tham s
Tt c cc tham s m hnh (tc l, lp hc u tin v cc c trng phn phi
xc sut) c th c gn ng vi cc tn s lin quan t vic thit lp o to. y
l cc nh gi maximum likehood kh nng c th xy ra. Cc c trng khng ring
bit cn phi c ri rc u tin. S ri rc c th khng gim st (cc rng buc la
chn c bit) hoc gim st (rng buc hng dn bi thng tin trong d liu o
to).
Thut ton Bayes v ng dng

25
Nu mt lp hc v gi tr c trng khng bao gi xy ra cng vi nhau trong
thit lp o to sau c tnh xc sut da tn s s c 0. y l vn v n s
ph hy tt c cc thng tin trong cc xc sut khi chng c nhn rng. V vy,
mong mun kt hp mt mu nh chnh sa trong tt c cc xc sut c tnh rng
nh vy khng bao gi c thit lp chnh xc 0.

3.4 Xy dng mt classifier t m hnh xc sut
Cc tho lun cho n nay bt ngun nhng m hnh c trng c lp, c
ngha l, m hnh xc sut naive Bayes. Naive Bayes classifier kt hp m hnh ny
vi mt lut quyt nh. L mt lut chung chn nhiu nht cc gi thuyt c kh
nng xy ra, iu ny c bit n nh l maximum a posteriori hay lut quyt nh
MAP. Classifier tng ng l chc nng phn lp c xc nh nh sau:


Mt ch rng gi nh c lp c th dn n mt s kt qu khng mong
mun trong tnh ton sau xc sut. Trong mt s trng hp khi c mt ph thuc gia
s quan st, xc sut k trn c th mu thun vi xc sut tin th hai do mi xc
sut lun nh hn hoc bng mt.
Mc d rng s tht c th p dng rng ri, gi nh c lp thng khng
chnh xc, cc naive Bayes classifier c vi thuc tnh lm cho n hu ch trong thc
hnh. c bit thc hnh, s tch ring ca lp c iu kin phn loi c trng c
ngha l mi phn loi c th c c tnh c lp nh l mt phn phi mt chiu.
Ton b classifier l mnh b qua cc thiu st nghim trng ca n trong
nhng m hnh xc sut naive.

3.5 Thut ton phn loi vn bn Naive Bayes
K thut phn hoch ca Naive Bayes da trn c s nh l Bayes v c bit
ph hp cho cc trng hp phn loi c kch thc u vo l ln. Mc d Naive
Bayes kh n gin nhng n c kh nng phn loi tt hn rt nhiu phng php
phn hoch phc tp khc. Vi mi loi vn bn, thut ton Naive Bayes tnh cho mi
Thut ton Bayes v ng dng

26
lp vn bn mt xc sut m ti liu cn phn hoch c th thuc loi . Ti liu
s c gn cho lp vn bn no c xc sut cao nht.
Xc sut P(c
k
| d
i
) gi l xc sut m ti liu d
i
c kh nng thuc vo lp vn
bn c
k
c tnh ton nh sau:



ti liu d
i
s c gn cho loi vn bn no c xc sut hu nghim cao nht
nn c biu din bng cng thc:



trong N l tng s ti liu.
Tm li phn loi vn bn s dng thut ton Naive Bayes c th din t
mt cch ngn gn nh sau:
Vi mi vn bn D (document), ngi ta s tnh cho mi loi mt xc
sut m ti liu D c th thuc vo lp ti liu bng vic s dng lut Bayes:

(1)

Trong : D l ti liu cn phn loi, C
i
l mt ti liu bt k. Theo gi nh
ca Naive Bayes xc sut ca mi t trong ti liu D l c lp vi ng cnh xut hin
cc t ng thi cng c lp vi v tr ca cc t trong ti liu. Xc sut P(D|C
i
)
c tnh ton t tn sut xut hin ca cc t n w
j
(word) trong ti liu D

(2)
l l tng s t w trong ti liu D:
( ) * ( | )
( | )
( )
i i
i
P C P D C
P C D
P D
=
( ) * ( | )
( | )
( )
k i k
k i
i
P c P d c
P c d
P d
=
{ }
Class of di arg arg
1
1
( )* ( | )
( | )
max
max
( )
k i k
k i
k N
k N
i
P c P d c
P c d
P d
= =
s s
s s
i j
1 j l
P(D|C ) P(w | )
i
C
< <
=
[
Thut ton Bayes v ng dng

27
Nh vy biu thc (1) c th c vit li nh sau:

) | P(w
) (
) (
) | (
l j 1
j i
i
i
C
D P
C P
D C P
[
< <
=

Gi tr ln nht ca xc sut P(C
i
| D) c a ra bi ngui lm cng tc
phn loi. Gi tr ny c gi l ngng hay ranh ri gia cc lp vn bn m chng
c th cha ti liu D.
V d: Phn loi th in t bng Naive Bayes classifier
y l mt v d v lm vic naive Bayesian phn loi cc ti liu phn loi
vn . Xem xt cc vn ca phn loi cc ti liu theo ni dung ca h, v d vo
th rc v khng phi l th rc trong cc th in t. Hy tng tng rng cc ti
liu c ly ra t mt s lp hc ca cc ti liu c th lm m hnh nh l b cc t
m y xc sut t th i ca mt ti liu xy ra trong mt ti liu t lp C c th
c vit nh:

X l nh vy n gin cc tng, hn na bng cch gi s rng xc sut
ca mt t trong mt ti liu l c lp vi chiu di ca mt ti liu hoc tt c cc ti
liu cng mt chiu di.
Sau , xc sut ca mt ti liu D, cho mt lp hc C, l


Cu hi m mong mun c cu tr li l: "xc sut no mt ti liu D thuc
v mt lp hc C?" Ni cch khc, ?
By gi, theo nh ngha:



Thut ton Bayes v ng dng

28
v

Nn c:


Gi nh rng thi im ch c hai lp hc, S v S (v d nh th rc v
khng phi l th rc).



Bng cch s dng cc kt
qu Bayesian trn, c th vit:




Do :


V vy c th vit:


Thut ton Bayes v ng dng

29
Trn thc t xc sut p(S | D) c th c tnh d dng t log (p (S | D) / p
( S | D)) da trn nhn nh (S | D) + p ( S | D) = 1.
V nh vy:


Cui cng, cc ti liu c th c phn loi nh sau:
Nu n l th rc

,ngc li n khng phi l th rc.













Thut ton Bayes v ng dng

30
Chng 4 Gii quyt bi ton lc th rc

4.1 t vn
Th rc bt u c gi l "spam" sau chng trnh truyn hnh c tn
"Monty Pythons Flying Circus". Trong show truyn hnh ny, mt nhm cp bin
Vikings vo n trong mt nh hng chuyn phc v hp (spam), ri ht tong
ln mt ca khc lp i lp li 2 ch "qung co". ngha ban u ca th rc rt r
rng: Mt th lp i lp li v gy ra s bc tc, kh chu cho nhng ngi xung
quanh. ch l trong mt phm vi hp cn trong mi trng internet khi khng cn
khong cch v a l na th s c rt nhiu ngi phi chu s bc tc, cnh nhm
chn gy c ch tm l v cc k mt thi gian vo n.
Phn ln cc th khng mi m n, cc th cho hng qung co b cho l
th rc theo nhn xt ca s ng ngi dng th in t. y l vn nan gii m
cc h thng, hm mail, cc nh qun tr mng ang phi i mt trong thi im hin
nay khi m x hi thng tin ngy cng pht trin vi tc chng mt. lc v pht
hin th rc, cn c gii php lu di nh cc bin php k thut, quy c x hi v c
th dng n php lut. Nhng khi cc gii php ny c thi hnh th ch trong mt
khong thi gian ngn chng b ph v bi cc spammer, nguyn nhn chnh l h
lun ngh ra nhng ci by nh la ngi dng hay lch lut m cc t chc chng
th rc quy c.
Nh vy gii php ngn chn th rc no hiu qu v dng c lu di? Mt
phng php tt nht l chnh ngi dng th in t ngn chn th rc, bi h
hiu vn mt cch tng minh nht. Chng ta s dng cm nhn v th rc ca mi
ngi hun luyn cho cc b lc th rc ca chnh h. Mi b lc s x l th rc
ty theo phong cch ca tng ngi dng th in t. V m hnh thng k Bayes
c p dng thc thi tng ny.
T nhng c im trn, ta thy rng vic xy dng c mt b lc th rc
thng minh c th loi b mt cch chnh xc hin nay l mt nhim v cn nhiu
thch thc.

Thut ton Bayes v ng dng

31
4.2 Bi ton
Th in t l mt trong nhng phng tin giao tip ng tin cy v hu
nh khng tn km chi ph s dng. Phm vi s dng ca n rng khp trn ton th
gii v c th d dng truy cp bng hu ht cc phng tin truyn thng bin n
thnh nn nhn ca nhng k spam. Hu qu n gin nht l lm tn bng thng
mng v nghim trng hn l lm mt thi gian ca ngi dng th in t, lm lan
truyn vi rt my tnh. C thi im ngi ta thng k c rng c n 60% th in
t l th rc v mi ngy mt ngi dung th in t phi nhn t nht l 6 c spam.
Chng ta khng th i a ch hm th mi ln b spam bi iu ny khng
nhng khng hn ch c th rc m c khi cn lm cho n gia tng. Vy cn phi
tm ra mt gii php chng th rc s dng b lc c gn thut ton phn loi vi
tnh nng hiu qu v k thut n gin d ci t. V mt yu cu khng th thiu l
c lm sao vi thut ton nhng k spam hiu rng vic chng c tnh spam l v
dng.

4.3 Tin x l mi l th in t
B lc c nhn c tch hp vo mi a ch hm th ca ngi dng. N
lun lun trng thi ch th n x l. Mt khi th c gi n a ch ngi
dng th th phi c phn loi c l th rc hay khng. Nu l th rc th n b
nm ngay vo th mc st rc ngc li s c cho vo th mc th n ch
ngi dng duyt. c c kt qu l qu mt qu trnh kim duyt nghim ngt
kt hp nhiu cng on nh nh gi a ch ngi gi, th c gi n t IP, DNS
no c nm trong blacklist ca t chc chng th rc quc t hay khng, hay n gin
hn l xem th c sai vi nh dng ca mt l th thng thng hay khng (v d
tiu th qu nhiu du than, du hi, hay vit hoa ton b, mu sc nhe nhot,.
Qua bc sng lc trn chng ta bt u tin x l cho b lc Bayes. . Vi
mi th chng ta qut ton b vn bn bao gm header v m nhng HTML k c
javascript ca mi thng ip. Hin ti chng ta nh gi cc k t gm ch v s, nt
gch, du than v du $ vo cc th, v nhng ci cn li cho vo cc th ring bit.
B qua cc th m ch cha cc ch s. v cng b qua cc on comment HTML,
tch cc th ra v khng cn nh gi. Nh vy sau bc ny mt l th s ng vi
mt tp hp cha cc th ring bit.

Thut ton Bayes v ng dng

32
4.4 Dng lut Bayes tnh xc sut
Tnh xc sut cho mi th ta dng lut Bayes tnh. Gi s ta cn tnh xc
sut cho th cha t promotion. T ny chng ta thng xuyn gp trong th in t
mi cho dch v maketing. Cng thc tnh theo lut Bayes:



Trong :
Pr(S|W) l xc sut m th m cha t promotion l th rc
Pr(S) l xc sut m th bt k l th rc
P(W|S) l xc sut m t "promotion" xut hin trong th rc
Pr(H) l xc sut m mt bn tin bt k khng l th rc
P(W|H) l xc sut m t "promotion" xut hin trong th rc
Nh ni trn, nhng thng k gn y cho thy 80% th in t l th rc
nn ta s c:

Tuy nhin cho n gin v qua thc t nn ngi ta chn cc xc sut
trc l ging nhau v u c gi tr bng 0.5. Tc l:


B lc m dng gi thit ny c gi l "khng i xng", c ngha rng
chng khng c s i x phn bit cc th n. Gi thit ny cho php rt gn cng
thc trn thnh:



B lc th rc Bayesspam vn dng chnh xc cng thc trn tnh xc sut
cho mi t n.
Sau khi tnh c xc sut th cha t n l th rc ta cn kt hp cc xc
sut n li thnh mt xc sut cui cng. Xc sut ny dng nh gi th m
Thut ton Bayes v ng dng

33
cha tt c cc t n c xc sut l th rc l bao nhiu. Cng thc tnh xc sut
kt hp l:


f
Trong :
p l xc sut th ang xt l th rc
p1l xc sut p(S|W1), ng vi t u tin (v d t "promotion")
p2 l xc sut p(S|W2) , ng vi t th hai (v d t "offer")
....
pN l xc sut p(S|WN) , ng vi t th N (v d t "home")
Kt qu p thng c dng so snh vi mt ngng no quyt nh
th ang xt c xc sut p c l th rc hay khng. Nu p ln hn gi tr ngng,
th s b nh du l th rc, ngc li s khng b nh du l th rc.


4.5 Hun luyn cho b lc Bayes
S dng hai tp th in t hun luyn, mt tp l th rc v tp cn li khng
phi l th rc. Mi tp cha khong 4000 th. m s ln xut hin ca mi th
trong mi tp th in t. Mi ln m kt thc vi hai bng bm. Mi bng bm
tng ng vi mi tp th in t, bng ny l nh x cc th n s ln xut hin ca
th .
Tip theo chng ta to ra bng bm th 3, bng bm ny nh x mi th ti
xc sut m mt email cha n l email spam. Ta tnh theo cng thc sau y:

Thut ton Bayes v ng dng

34
Trong :
Ngood ng vi s th khng phi l th rc.
Nbad ng vi s th l th rc.
Cng thc trn c din t theo cc biu thc ca ngn ng Arc. Mi biu
thc l mt cp du ngoc n. Trong ngoc l mt danh sch vi biu thc ng v
tr u tin theo sau l cc tham s. Thc hin biu thc t tri qua phi.
V d:
(< (+ g b) 5) tng ng vi (g + b) < 5.
Cng thc ny s tnh xc sut cho mt t hay th (word) nh sau: Th c
ly t trong bng good, l bng bm cc th ca tp th khng phi l th rc v nhn
i ln. Nhn i ln gim chnh lch xc sut gia th rc v khng phi th
rc, tng chnh xc trong vic phn loi. Tip theo cng th ta ly t bng bad,
l bng bm cc th tp th rc. Nh vy ta c ch s g ng vi 2 ln sut hin ca th
trong tp th khng phi th rc v b ng vi s ln xut hin ca th trong trong tp
th rc. Nu nh tng g v b nh hn 5 th th s b loi b. Xc sut tnh c s
nm trong khong gi tr t .01 n .99. Xt cho cng th vic tnh ton trn tng
ng vi cng thc tnh xc sut dng lut Bayes n gin nh sau:



Nh vy kt qu ca qu trnh hun luyn l mt bng bm hay ni khc hn l
mt c s d liu rt ra t tp th hun luyn. Bng bm ny l nh x ca cc th n
cc gi tr xc xut ca chng. Bng bm ny l c s quyt nh cho vic tnh ton
xc sut ca mt l th in t l th rc.

4.6 Lc th n, c l th rc khng?
Khi mt th mi n, n phi tri qua vi cng on x l phn loi trc khi
i vo hp th ngi dng. Ti sao li th? N cn phi c nh gi c l th rc
hay khng. Lt qua c bc tin x l lc th, ngi ta lc n ni dung ca n c
phi l th rc khng bng cch ni dung text ca n c qut vo cc th, thng l
mi lm th s c quan tm nht, cc th c quan tm l cc th m xc sut
ca chng t mc trung bnh 0.5, s c dng tnh ton xc sut m th c l
spam hay khng. Cch y vi nm khi phn cng my tnh cn nhiu hn ch, tit
Thut ton Bayes v ng dng

35
kim ti nguyn v tc x l thng tin ngi ta ch t s th ti a l mi lm
tnh xc sut th l th rc. Ngy nay vn phn cng d sc p ng cho ng dng
lc th nn s th khng cn b gii hn na. Khi m s th khng cn b hn ch na
tc l ta phi tnh xc sut kt hp ca tt c chng. S c trng hp th cha xut
hin trong bng bm xc sut. Nh vy phi gn gi tr xc sut no cho th ? Kinh
nghim cho thy gn gi tr 0.4 l hp l. Ni khc hn th y l xc sut ngy th.
Ta s tnh ra xc sut kt hp ca cc gi tr xc sut n theo cng thc sau y:

on m trn vn dng chnh xc theo cng thc tnh xc sut kt hp xc sut
trnh by mc trn:



Kt qu p sau s so snh vi ngng phn loi chnh xc th rc nh
ni trn. Nh vy mi ln c mt th n ta s xc nh thm c mt th thuc
loi g b xung vo tp hun luyn ca b lc. Ngi ta sp xp time chy li
qu trnh hun luyn cp nht li hay ni khc hn l nng cao tri thc, kh nng
phn loi cho b lc. V th m b lc qua thi gian s dng s phn loi cng chnh
xc khin ngi dng phi bt ng v kh nng phn loi ca n gn nh l ging vi
vic chnh ngi dng t phn loi.

4.7 B lc BayesSpam
B lc BayesSpam thc hin vic lc th in t theo quy trnh cch thc
trnh by trn. Ngn ng lp trnh c dng xy dng b lc vit bng ngn ng
lp trnh Web PHP di dng mt plugin rt tin cho vic tch hp vo h thng th
in t. B lc chy c lp vi mi ngi dng. Tc l mi ngi dng c mt b lc
cho ring h. B lc BayesSpam cho php mi ngi dng th in t t cu hnh b
lc hoc t chi dng b lc. Ngi dng gn nh lm ch c b lc trong vic iu
chnh cc thng tin cu hnh. C th tham kho cc tnh nng cung cp cho ngi dng
bng iu khin trong hnh di y.
Thut ton Bayes v ng dng

36


Hnh 16: Bng iu khin b lc dnh cho mi ngi dng th in t

Mt khi th b nh du l th rc ngay lp tc n s b di chuyn vo st rc.
V tiu th s b nh du thnh th rc [**SPAM/Th rc**]. hnh di y th
rc c cu hnh cho ring vo th mc test. Sau mt khong thi gian ngn b lc
t ng xy dng li c s d liu n s dng chnh nhng th m n phn loi
cp nht li bng xc sut nh ni trn. B lc lm vic kh n nh, tc x l
thng tin nhanh bi thut ton kh ngn gn. Mi khi c s kin mi b lc ngay lp
tc t cp nht li c s d liu nhm gia tng kh nng lc th. Vic hun luyn cho
b lc song song vi qu trnh s dng v ph thuc vo cch nhn nhn th rc ca
mi ngi. Ni khc hn l dn dn theo thi gian s dng b lc s mang tnh cch
duyt th in t ca chnh ngi dng, ngi m cu hnh v hun luyn n.
Sau bc cu hnh chng ta c th dng b lc ngay ch cn thao tc di
dng report cho b lc bit u l th rc v c th nh gi li thnh khng phi th
rc. Thng thng ngi ta hay dng nt nh du th rc, t khi phi dng n nt
khng phi th rc. Lc ban u c s d liu ca b lc cn nh b kh nng phn
loi s cha c tt. Ngi dng phi t nhn dng th n c l th rc khng.
Thut ton Bayes v ng dng

37
Nhng hu nh cc th sau ny c ni dung tng t th rc m nh du bi ngi
dng s c b lc bt rt chnh xc. Nh vy r rng thi gian s dng v cch nhn
nhn v th rc ca ngi dng c yu t quyt nh i vi kh nng phn loi ca b
lc. Di y l hnh nh th rc th nghim chy b lc c ly ra t th mc
cha th rc:

Hnh 17: Th rc b lc v a vo th mc Test, 943 th rc.

Lm th no cc Spammer trnh khi b lc th rc? Cu tr li cho cu
hi ny s l minh chng cho thy vic c gng spam l v ch khi dng b lc.
khng b pht hin l th rc cc spammer phi c gng son th in t c ni dung
khc vi th m ngi bnh thng cng ngh c n l th rc n 80% v mt ni
dung th hay ni chnh xc hn l khc v t ng dng vit ln ni dung th. S c
hai trng hp xy ra l nu c c gng n trnh ni dung, t ng th bc th s
khng th truyn t c ni dung spam. Tc l mt l th qung co th khng th
thiu cc t ng nh mua sm, trc tuyn, min ph, nhn dp, mua hng,
khuyn mi, Khng dng cc t ng spammer khng th son c th rc
qung co. Nh vy khng th dng cch ny n trnh b lc c. Cn mt cch th
hai l gi nguyn ni dung qung co nhng khng son th bng ting vit chun
na m vit theo ngn ng ca teen. V d nh thay du ng thnh ~, du chm
thnh ., du hi thnh ?. Khuye^n mai. mua hang gia re?
nha^t . Cch ny kh hay v mt k thut (lm ri lon cc th t trong c s d
liu nhng khng phi l khng khc phc c) nhng c khi li phn tc dng v c
nhiu ngi rt ght v thy nga mt vi kiu vit ch nh th nn nhiu spammer
phi t b phng n ny.
Thut ton Bayes v ng dng

38
Nh vy spammer vn x th rc bnh thng nhng ngi dng th khng b
quy ri qu nhiu ln khi h bo cho b lc bit l th rc mt vi ln. Cc ln sau
do c hun luyn b lc cng thng minh hn n s lc ht nhng th rc mt
cch chnh xc n khng ng. Hu ht nhng ngi dng trung thnh vi b lc u
nh gi cao kh nng lc th ca BayesSpam rt hiu qu v hu nh l khng c sai
st. V thc t l n ang hot ng kh tt di h thng th in t ca trng Cng
ngh (http://mail.coltech.vnu.vn)

4.8 Mt s ci tin cho b lc BayesSpam
Trc khi cp n vn ci tin ta cn quan tm n hn ch hin ti ca
b lc l trong mt khong thi gian di ngi dng th in t khng ng nhp
v gi s lc y ngi dng nhn s lng ln th s dn n tnh trng ng nhp b
chm ch do ch lc th n. khc phc tnh trng trn vic lc th cn hot ng
theo nh k m khng ch ngi dng ng nhp. Mi th l mt file c t trong
cc th mc (INBOX, SENT, TRASH,), b lc s m thm lc th ngay c khi
ngi dng khng trc tuyn. Do y l b lc chung cho mi ngi nn n phi c
xy dng da trn phong cch chung, l ci nhn chung v th rc ca tt c ngi
dng. lm c iu ny b lc phi c hun luyn k lng da trn d liu th
ca ngi dng. Trong kha lun ny trnh by ng dng chn lc th hun luyn
c trch chn t th ca tt c ngi dng trong h thng th in t Squirrelmail
ang dng b lc BayesSpam. ng dng web vit bng ngn ng PHP, c giao din
n gin di y:
Thut ton Bayes v ng dng

39
Hot ng chnh ca ng dng:
1. To th mc tp hun luyn Corpus cha 2 th mc con l th mc th
rc (SPAM) v khng phi th rc (HAM).
2. Da trn CSDL ca b lc (spamCorpus) ly ra tn nhng ngi ang
dng b lc.
3. Vi mi ngi dng, copy tt c file th trong th mc st rc (TRASH)
vo th mc SPAM. Tng t copy tt c cc file trong th mc hp th
(INBOX) vo th mc (HAM).
4. X l th mc SPAM. Chn lc cc th c ch s Bayes cao (ln hn
ngng a ra) ng vi th c xc sut l th rc cao hn cc th cng
loi trong th mc. Da vo thuc tnh messageID ca bng ScoreCache
trong CSDL.
5. X l th mc HAM. Chn lc cc th c ch s Bayes thp (nh hn
ngng a ra) ng vi th c xc sut khng l th rc cao hn cc th
cng loi trong th mc. Da vo messageID trong bng ScoreCache.
Sau qu trnh trn ta c c tp hun luyn c chn lc t mi ngi
dng b lc. Tp hun luyn ny nh l mt ci nhn chung v th rc ca tt c
mi ngi dng b lc. C th dng tp hun luyn ny hun luyn cho b
lc cp trn
Thut ton Bayes v ng dng

40

Chng 5 Kt lun

Nh ni t u ton hc thng k ng vai tr rt quan trng trng trong mi
lnh vc. Thng k gip cho vic nm bt nh gi tnh hnh tr ln trc quan v d
hiu hn. X l v ng dng d liu thng k em li hiu qu ln lao trong vic tin
on v t c th xy dng ln mt h t ng ha hot ng chnh xc. Hng
tip cn thng k theo l thuyt Bayes kh n gin nhng em li hiu qu rt cao
chnh v th m n c ng dng kh ph bin trong hu ht cc lnh vc.
So vi cc phng php khc, phng php thng k Bayes lp lun theo kinh
nghim c tch ly p dng vo m hnh phn loi i tng linh hot hn, ph hp
vi c trng ca bi ton hn. Cc c ch c lng cng gn gi vi cch suy lun
thng thng chnh v vy m cc kt qu phn loi tng i ging vi cch phn
loi thng thng.
Cc kt qu t c l:
Kho lun tp trung nghin cu v l thuyt Bayes, t bc c s tm
hiu tip v mt ng dng ca n lin quan trc tip n ngnh cng ngh thng tin
l ng dng lc th rc. Qu trnh tm hiu v nguyn l v cch thc hot ng ca
b lc rt ra c nhng kt lun v u nhc im ca tip cn thng k Bayes
trong vic phn loi th rc. i vi vn ng dng thc t, kho lun s dng
plugin BayesSpam nh mt i tng chnh tm hiu v nghin cu. i vi vn
p dng l thuyt Bayes, kho lun nghin cu xy dng cc cng thc tnh xc sut
sao cho vic x l thng tin tr ln nhanh gn v c chnh xc cao.
T vic tm hiu ng dng BayesSpam, kho lun rt ra c mt s nhn
nh v u im v nhc im ca b lc trong qu trnh hot ng. Kt qu phn
loi th rc nhn chung l gn ging vi cc kt qu nh gi th bi ngi dng.
Tuy nhin, do thi gian c hn cng nh cc kin thc chuyn mn v h
thng th in t nn cc kt lun rt ra c trong qu trnh nghin cu cn nhiu
hn ch. Di y l nhng u nhc im chnh ca b lc th rc Bayes.
Nhng u im chnh:
u im ca b lc th rc Bayes l n c th c hun luyn bi
chnh ngi dng c s. y c th th ni l u im ln nht, n to
ra c nt c trng v cch nhn nhn th rc ca mi ngi dng.
Thut ton Bayes v ng dng

41
Cc th rc m mt ngi dng nhn c thng lin quan ti cc
hot ng trc tuyn ca ngi dng. V d, mt ngi s dng c th
c ng k vo mt bn tin trc tuyn m ngi s dng xem xt
nh l th rc. ang xem thng tin ny c th cha cc t ng c
ph bin cho tt c cc bn tin, chng hn nh tn ca bn tin v ngun
gc ca n a ch email. B lc th rc Bayesian s ch nh mt xc
sut cao hn da trn cch nhn nhn ca ngi s dng.
Th in th hp php s nhn c nhn nhn theo xu hng khc
nhau i vi mi ngi. V d, trong mi trng mt cng ty, tn cng
ty ca bn v tn ca khch hng s c cp thng xuyn. Cc b
lc s ch nh mt th rc xc sut thp hn cho cc email c cha cc
tn .
Xc sut ca cc t l duy nht i vi mi ngi dng v c th ln
dn theo thi gian hun luyn, cng vi s hiu chnh vic hun luyn
mi khi c th lc sai. Kt qu l, lc th rc Bayesian tng chnh
xc khi c o to thng xuyn theo cc quy tc c xc nh
trc.
Nhng nhc im chnh:
Mt k thut c s dng bi Spammer nhm c gng gim tnh
hiu qu ca b lc th rc l da vo chnh nguyn tc hot ng ca
n. K thut ny s chn cc t m khng phi l bnh thng lin kt
vi cc ni dung spam vi s lng ln vn bn hp php (thu thp t
cc ngun tin tc hp php hay vn chng). Do gim gi tr xc
sut kt hp ca th in t l th rc, lm cho n cng c nhiu kh
nng vt qua b lc th rc Bayes.
Mt k thut khc c s dng che mt b lc th rc Bayes l
thay th cc vn bn bng hnh nh, hoc trc tip t lin kt cha ni
dung spam n hnh nh. Ton b ni dung ca tin nhn, hoc mt s
phn ca n, c thay th bng mt hnh nh c cng mt ni dung
c trnh by li cun ngi xem. B lc th rc thng khng th
phn tch hnh nh ny, m c th cha cc t nhy cm nh "khiu
dm". Tuy nhin, nhiu h thng th in t v hiu ho mn hnh
hin th ca lin kt hnh nh v l do bo mt, nhng cc spammer li
gi lin kt n hnh nh xa c th tip cn vi cc mc tiu spam t
hn. Ngoi ra, mt hnh nh c kch thc ln hn kch thc tng
ng ca vn bn. Do , cc spammer cn nhiu hn nhu cu bng
thng gi tin nhn trc tip bao gm c hnh nh.
Thut ton Bayes v ng dng

42
Do vy, sau bc tm hiu l thuyt v ng dng th hng nghin cu tip
ca ti nhm tng hiu qu lc l:
Tm ra ci nhn chung v th rc ca nhng ngi dng th trong cng
h thng th in t. Bng cch rt ra nhng email c xc sut l th
rc cao b xung vo tp hun luyn chung cho tt c mi ngi
nhm gia tng kinh nghim cho b lc.
Ngn chn vic Spam bng hnh nh bng vic a ra thng bo l th
rc nu n c ni dung ch yu l ha. n gin nht l khng cho
hin th hnh nh khi ngi dng duyt th tr khi h c nhu cu xem
hnh nh th t h s bt hin th.
Tch hp phn tch hnh nh ly ra vn bn trong hnh nhm gim
vic lc sai do loi b tt c th c ni dung ch yu ha. Vic ny
i hi h thng phi mnh cng thut ton phn tch hnh nh thng
minh.
B xung thm vo tp cc t trung tnh ting Vit cho b lc nhm tng
tc v tit kim ti nguyn cho c s d liu. V d nh cc t trung
tnh ting vit ng vi cc t trung tnh ting Anh nh: th, l, , ci,
con, v, hoc, .














Thut ton Bayes v ng dng

43
Ph lc A C s d liu ca b lc
Thut ton Bayes v ng dng

44
Ti liu tham kho
[1] Nguyn Quc i, L Thuyt Bayes, mng Bayes. (2009)
[2] Nguyn Thanh Sn, L Khnh Lun; L thuyt xc sut v thng k ton;
Nxb Thng k (2008)
[3] Nguyn Duy Tin, Trn Minh Ngc i hc Khoa Hc T Nhin,
HQGHN, Bi ging ca Vin Thng K Th Gii IMS ti Malaysia
[4] Azam. N, Dar. H. A, Marwat. S; Comparative study on Feature Space
Reduction for Spam Detection
[5] Paul Graham; A plan for spam 2002. Xem ti a ch
http://paulgraham.com/spam.html
[6] Wikipedia ; Bayesian Spam Filtering. Xem ti a ch
http://en.wikipedia.org/wiki/Bayesian_spam_filtering
[7] Wikipedia ; Sequential Bayesian Filtering. Xem ti a ch
http://en.wikipedia.org/wiki/Sequential_bayesian_filtering

You might also like