You are on page 1of 79

MC LC

Trang
LI CAM OAN...................................................................................................... i
LI CM N........................................................................................................... ii
MC LC................................................................................................................ iii
DANH MC HNH V........................................................................................... iv
DANH MC BNG BIU.......................................................................................v
M U....................................................................................................................1
1. Tm lc ti................................................................................................1
2. Mc tiu ti .................................................................................................1
3. Ni dung thc hin ..........................................................................................1
4. Phm vi ng dng............................................................................................2
CHNG 1: GII THIU TNG QUAN...............................................................3
1.1. B lc web ....................................................................................................3
1.1.1. Khi nim............................................................................................3
1.1.2. c im web c ni dung khng lnh mnh.....................................4
1.1.3. Nguyn nhn cn xy dng b lc web ..............................................7
1.2. Cc phng php lc web c ni dung khng lnh mnh ...........................7
1.2.1. B lc web da vo a ch mng .......................................................7
1.2.2. B lc web da vo URL (Universal Resource Locator) ...................9
1.2.3. B lc web da vo DNS..................................................................12
1.2.4. B lc web da vo t kha (key word) ...........................................13
1.3. Nhng phn mm lc web hin nay...........................................................14
CHNG 2: C S L THUYT........................................................................16
2.1. Tng quan khai ph d liu ........................................................................16
2.1.1. Khai ph Text ....................................................................................16
2.1.1.1. Khi nim.................................................................................16
2.1.1.2 Mt s loi khai ph Text .........................................................16
2.1.1.3. Quy trnh khai ph Text ...........................................................17
2.1.2. Khai ph web.....................................................................................18
2.1.2.1. Khi nim.................................................................................18
2.1.2.2. Phn loi khai ph web ............................................................18
2.1.2.3. Phng php biu din trang web............................................19
2.1.3 X l vn bn t ng........................................................................21
2.1.3.1. Rt trch c trng vn bn......................................................21
2.1.3.2. Biu din vn bn bng vector c trng ................................22
2.2. Lc ni dung trang web bng thut ton Nave Bayes...............................25
2.2.1. Gii thiu...........................................................................................25
2.2.2. Hc Bayes (Bayes Learning).............................................................25
2.2.3. Cng thc Bayes ...............................................................................28
2.2.4. Cc bc tin hnh lc ni dung bng mng Bayes .........................30
2.3. Phng php tch t trong ting Vit.........................................................31
2.3.1. Tnh hnh nghin cu ........................................................................31
2.3.2. Mt s phng php tch t..............................................................32
2.3.2.1. Tch cu da trn Maximum Entropy .....................................32
2.3.2.2. Phng php khp ti a (Maximum Matching) ....................34
2.3.2.3. Phng php WFST (Weighted Finite State Transducer)....35
2.3.2.4. Bi ton tch t v cng c vnTokenizer ................................37
2.3.2.5. Phng php tch t da trn s xc sut tn ti ca t khng
ph thuc vo ng ngha.......................................................................38
2.3.3. So snh cc phng php tch t ting Vit .....................................40
2.4. Phn tch ni dung website.........................................................................42
2.4.1. Phn loi ni dung website................................................................42
2.4.2. c trng ca ngn ng ting Vit ...................................................43
2.4.3. Phng php x l ni dung website................................................44
2.4.4. Phn tch cu .....................................................................................45
CHNG 3: NG DNG......................................................................................47
3.1. Xy dng b lc ni dung web ting Vit khng lnh mnh.....................47
3.1.1. tng xut .................................................................................47
3.1.2. Hng tip cn ..................................................................................47
3.1.3. Tin trnh thu thp ni dung..............................................................48
3.1.4. Quy trnh thc hin............................................................................49
3.1.4.1. Tin trnh 1...............................................................................50
3.1.4.2. Tin trnh 2...............................................................................53
3.1.4.3. Tin trnh 3...............................................................................55
3.2. Kin trc h thng chng trnh.................................................................56
3.2.1 Trnh duyt web vi cc chc nng c bn thng thng.................56
3.2.2. Cc chc nng c bn ca h thng..................................................56
3.3. Chc nng ca chng trnh.......................................................................57
3.3.1.Giao din chnh ca chng trnh......................................................57
3.3.2. S chc nng ca chng trnh....................................................58
3.3.2.1. Chc nng ng nhp h thng................................................58
3.3.2.2. Chc nng chng trnh...........................................................59
3.4. Chc nng hc t ting Vit.......................................................................60
3.5. Chc nng x l..........................................................................................61
3.5.1. Ly ni dung website cn phn tch..................................................61
3.5.2. Qun l b t in ting Vit ............................................................61
3.5.3. Phn tch cu i vi ni dung website ting Vit............................62
3.5.4. Phn tch ni dung website ting Vit...............................................63
3.6. Chc nng hun luyn t cho vic lc ni dung ........................................63
3.6.1. Hun luyn t ting Anh ...................................................................64
3.6.2. Hun luyn t ting Vit ...................................................................64
3.7. Phn loi ni dung website.........................................................................65
3.7.1. Ni dung ting Anh...........................................................................65
3.7.2. Ni dung ting Vit ...........................................................................65
3.8. Qun l cc thng s h thng....................................................................66
3.9. Qun l cc danh sch ................................................................................67
3.9.1. Black List ..........................................................................................67
3.9.2. White List ..........................................................................................68
3.10. Kt qu thc nghim v nh gi kt qu t c .................................68
KT LUN V HNG PHT TRIN...............................................................70
TI LIU THAM KHO
DANH MC HNH V
Hnh 1.1 Mn hnh trnh duyt cm truy cp ...................................................3
Hnh 1.2 Bo co tm kim t kha sex ti Vit Nam..................................6
Hnh 1.3 Bo co tm kim t kha sex trn th gii ...................................6
Hnh 2.1 S lnh vc khai ph web ..........................................................18
Hnh 2.2 Quy trnh tch t..............................................................................37
Hnh 3.1 Tin trnh thu thp ni dung............................................................49
Hnh 3.2 M hnh tng qut lc ni dung khng lnh mnh .........................50
Hnh 3.3 M hnh tch cu trong ting Vit...................................................51
Hnh 3.4 M hnh tch t n ting Vit .......................................................52
Hnh 3.5 M hnh tch t ghp ting Vit......................................................53
Hnh 3.6 M hnh tnh xc sut cho t ghp ..................................................54
Hnh 3.7 M hnh cp nht b t in............................................................55
Hnh 3.8 Giao din chnh ca chng trnh ...................................................57
Hnh 3.9 Giao din thng bo khng cho truy cp ni dung website ............58
Hnh 3.10 Chc nng ng nhp h thng chng trnh qun l ..................58
Hnh 3.11 Chc nng hc t n v t ghp ting Vit.................................61
Hnh 3.12 Ly ni dung website cn phn tch..............................................61
Hnh 3.13 B t in ting Vit .....................................................................62
Hnh 3.14 Phn tch cu trong ting Vit .......................................................62
Hnh 3.15 Phn tch ni dung website ting Vit...........................................63
Hnh 3.16 Hun luyn t ting Anh ...............................................................64
Hnh 3.17 Hun luyn t ting Vit ...............................................................64
Hnh 3.18 Phn lp ni dung website ting Anh............................................65
Hnh 3.19 Phn lp ni dung website ting Vit............................................66
Hnh 3.20 Qun l thng s h thng.............................................................67
Hnh 3.21 Danh sch Black List.....................................................................67
Hnh 3.22 Danh sch White List ....................................................................68
DANH MC BNG BIU
Bng 1.1 Kt qu nh gi ca NetProject.......................................................9
Bng 1.2 Mt s sn phm lc web theo phng thc URL .........................11
Bng 2.1 S khc bit c bn gia ting Anh v ting Vit ..........................42
Bng 3.1 Bng m t chc nng ca chng trnh ........................................59
Bng 3.2 Kt qu xy dng b t in ting Vit ..........................................68
Bng 3.3 Kt qu phn loi web.....................................................................69
1
M U
1. Tm lc ti
Trong thi k hin nay, Internet ngy cng pht trin mnh m v tr nn
thng dng i vi mi la tui c bit l thanh thiu nin, hc sinh, sinh vin.
Li ch thit thc nht m Internet mang li l cung cp ngun ti nguyn thng
tin v tn cho ngi s dng, n gp phn khng nh vo vic nng cao kin thc
cho la tui thanh thiu nin. Tuy nhin, mt mnh ca Internet cng chnh l yu
im ca n, ngoi nhng kin thc hu ch th ngi dng cng d dng tm thy
nhng ni dung khng lnh mnh trn Internet.
Chnh v vy, mc ch chnh ca ti l nghin cu cc phng php v
xut k thut ngn chn t ng cc trang web c ni dung bng ting Vit khng
lnh mnh
2. Mc tiu ti
Tm hiu c trng cng nh s pht trin ca website c ni dung khng lnh
mnh, kt hp phn tch cc h thng lc web hin c. T xut m hnh c
th t ng pht hin nhng trang web c ni dung khng lnh mnh s dng ngn
ng ting Vit bng cc k thut rt trch thng tin t website cng nh ng dng
khai ph d liu vn bn, c bit s dng thut ton Naive Bayes nhm xc nh
ngng xc sut website khng lnh mnh c hng x l ph hp.
Bnh cnh , hin thc ha m hnh thnh mt trnh duyt web c kh nng
t ng ngn chn nhng website ting Vit c ni dung khng lnh mnh.
3. Ni dung thc hin
Ni dung chnh cn thc hin trong ti gm cc phn sau:
Nghin cu tng quan cc h thng lc web en thng dng hin nay, xc
nh nhng im bt cp t nhng chng trnh ng dng lc web hin c,
nhng im mnh, im yu ca nhng phng php xy dng b lc web.
2
Nghin cu nhng im mnh ca cc k thut phn loi vn bn nhm p dng
n tt nht vo ti nghin cu.
Tm hiu cc phng php tch t trong ting Vit, t la chn phng php
ti u nht gii quyt bi ton lc ni dung
Nghin cu cc thun ton, c bit l thut ton Nave Bayes.
xut phng php lc web ph hp v xy dng m hnh.
Ci t b lc web hin thc ha vn nghin cu.
4. Phm vi ng dng
ti Xy dng b lc pht hin cc website c ni dung khng lnh mnh
c ng dng bng mt trnh duyt gip ph huynh kim sot c qu trnh truy
cp vo cc website ca con em mnh, hn ch truy cp vo cc website c ni
dung khng lnh mnh.
3
CHNG 1: GII THIU TNG QUAN
1.1. B lc web
1.1.1. Khi nim
B lc web l phn mm c chc nng lc ni dung hin th trn mt trnh
duyt hay kha mt vi v tr ca website m ngi dng c gng truy cp vo. B
lc kim tra ni dung hay a ch ca trang web da vo tp lut v c thay th
ni dung khng mong mun bng mt trang web thay th, thng trang ny c ni
dung c dng Access Dinied.
Qun tr h thng nm quyn kim sot v loi ni dung i qua b lc. Cc b
lc web thng c s dng trong cc trng hc, th vin, cc dch v Internet
cng cng v ti gia nh nhm gi an ton cho i tng thanh thiu nin trnh
c nhng ni dung khng lnh mnh v la tui ny cha c thc vic mnh
lm.
Hnh 1.1 Mn hnh trnh duyt cm truy cp
4
1.1.2. c im web c ni dung khng lnh mnh
Hin nay, nhiu ngi quan tm nhiu n vn web en hay web xu.
Mi ngi ty theo nhn thc v quan im ca mnh c th c cc nh gi khc
nhau. Tuy nhin, thng thng mt trang web c coi l xu khi c mt trong hai
iu kin sau:
+ Ni dung khiu dm, i try.
+ Ni dung phn ng chnh tr.
Ni dung khiu dm, i try
Cc trang web ny ang v s lun l ti nng bng c mi ngi
quan tm. Cc trang web ny s dng cng c khiu dm, gi dc nh hnh nh
sex, truyn sex, thu ht ngi dng Internet truy cp vo.
Chu ni chung v ti Vit Nam ni ring, cc website ny lm bng hoi
x hi, khin cho cc thnh phn thanh thiu nin v mt s ngi trng thnh
sao lng hc tp, lm vic, ny sinh t tng bnh hon, gia tng cc t nn i
ngc li vi truyn thng vn ha tt p ca ngi Vit Nam.
Tuy nhin ta cng phi nhn nhn rng mt s quc gia c nn vn ha
thong nh M hay Chu u, vn cho php mt s website khiu dm c cp
php hot ng v ch cho php ngi trng thnh truy cp vo. Cc trang web
ny hot ng c t chc r rng v di s kim sot ca chnh quyn nc s
ti. Nh vy, cc trang web ny cha hn xu nu xt theo kha cnh php l.
Vy cn c vo u chng ta phn bit u l mt trang web xu, u l
mt trang web khng xu ? iu ph thuc vo truyn thng vn ha ca
ngi Vit Nam, cn c vo lut php hin ti ca nc Vit Nam cha c tha
nhn ngnh cng nghip sex, tt c cc trang web s dng cng c khiu dm u
b xem l trang web xu.
5
Ni dung phn ng chnh tr
Do c nhn hoc t chc phn ng vi mc ch tuyn truyn t tng phn
ng, chng ph nh nc Vit Nam, ku gi mi ngi t do chnh tr, a
nguyn a ng, Cc trang web ny c rt nhiu, ch yu t ti nc ngoi,
s dng ting Vit l ngn ng chnh tuyn truyn. H qu s tn ti ca cc
trang web ny l s chia r ni b, kch ng cc phn t vn c t tng phn
ng. Cn c ci nhn nghim tc i vi nhng trang web ny, khng th v l do
t do ngn lun m c th li dng iu ny thc hin ph hoi.
Tnh n nm 2010, Vit Nam hin ng th 3 khu vc ng Nam vi
24.269.083 ngi s dng Internet chim 2.9% ngi dng Internet chu , vi
tc tng trng trung bnh 12.034,5% trong 10 nm qua Vit Nam vn ang l
mt trong nhng th trng cng ngh vin thng y trin vng chu cng
nh khu vc, thu ht s quan tm ca cc nh u t nc ngoi.
(internetworldstats.com)
Theo Google Trends (Hot Trends) l cng c theo di xu hng tm kim ca
t kha theo vng min, thnh ph, ngn ng th Vit Nam nm trong top 3 nhng
nc thch tm hiu v sex trc tuyn nht ton cu trong nm 2011. Xt v vng
min Vit Nam, theo thng k H Ni l ni c tm kim t kha sex nhiu
nht, tip theo l Vng Tu v Qung Ngi. Theo bng thng k phn ln nhng
ni c mt tm kim t kha sex cao thng tp trung cc thnh ph du lch.
6
Hnh 1.2 Bo co tm kim t kha sex ti Vit Nam [12]
Hnh 1.3 Bo co tm kim t kha sex trn th gii [12]
7
1.1.3. Nguyn nhn cn xy dng b lc web
T nhng nguyn nhn nu trn, ti Vit Nam vic xy dng b lc web en
nhm phc v cho vn an ton trong vic truy cp mng Internet l mt yu cu
cp thit v lun lm au u cc nh qun l.
Hin nay trn th trng c rt nhiu phn mm lc web en nhng hu ht
nhng phn mm ny u c chung mt nhc im l lm chm ng truyn do
s dng cc php kim tra v so snh lin tc, mt nhc im khc l khng c
c ch t ng cp nht hnh vi s dng web ca ngi dng. Phn di s trnh
by mt s phng php lc website c ni dung khng lnh mnh.
1.2. Cc phng php lc web c ni dung khng lnh mnh
1.2.1. B lc web da vo a ch mng
Bc tng la (Firewall)
Firewall l mt k thut c tch hp vo h thng mng chng s truy cp
tri php, nhm bo v cc ngun thng tin ni b v hn ch s xm nhp khng
mong mun vo h thng. Thng thng Firewall c t gia mng bn trong
(Intranet) ca mt cng ty, t chc, vai tr chnh l bo mt thng tin, ngn
chn s truy cp khng mong mun t bn ngoi v cm truy cp t bn trong
(Intranet) ti mt s a ch nht nh trn Internet.
u im: a s cc cc h thng firewall u s dng b lc packet. Mt trong
nhng u im ca phng php ny l chi ph thp v c ch lc packet c
bao gm trong mi phn mm router.
Hn ch: vic nh ngha cc ch lc packet l mt vic phc tp, i hi
ngi qun tr mng cn c hiu bit chi tit v cc dch v Internet, cc dng
packet header,
8
Danh sch en (Back List) v danh sch trng (White List)
Danh sch trng v danh sch en l 2 phng php ph bin c nhiu nh
cung cp phn mm s dng, v n n gin, d qun l v cho mt kt qu c th
chp nhn.
Danh sch trng l danh sch cc website c php truy cp, danh sch en l
danh sch nhng trang cm, cc danh sch ny thng c to bng cch th
cng bng cch kho st trang web a ra quyt nh trang web ny l cm hay
c php truy cp.
Hng ny, s lng cc website mi xut hin rt nhiu gy kh khn cho vic
cp nht danh sch Backlist, Whitelist v c do lm th cng nn mt nhiu thi
gian cho vic b sung cc danh sch ny.
Lc web qua a ch IP
y l k thut ngn chn trc tip trn ng mng bng cc a ch IP ca
mt website. K thut ny c th l thit thc trong bi cnh cc website thng b
truy cp thng qua a ch IP hay n c th truy cp thng qua IP thay cho tn
DSN. a s trng hp, khng c khuyn dng do 3 s km ci sau:
Ngn chn truy cp n mt IP cng s ngn chn lu thng mng n nhng
site c host o trn cng IP ngay c khi n c ni dung lin quan n vn cm
hay khng.
Ngn chn truy cp n mt IP cng s ngn chn lu thng mng n mi
thnh vin ca cng thng tin nm trn IP . N s ngn chn mt thnh phn
ca website khng phi l mt phn hay mt tp cc trang con.
l s thay i thng xuyn ca cc website b lc ngay khi ch nhn
website pht hin ra b lc. Hnh ng ny da trn DNS cho php ngi
dng vn cn truy cp n trang web. Bng thng k pha di s so snh kt
qu lc ca mt s phn mm theo d n kho st website ca d n NetProject.
9
Bng 1.1 Kt qu nh gi ca NetProject.
Phn mm lc T l kha ng Efectiveness Rate
BizGuard 55 % 10 %
Cyber Patrol 52 % 2 %
CYBER sitter 46 % 3 %
Cyber Snoop 65 % 23 %
Norton InternetSecurity 45 % 6 %
SurfMonkey 65 % 11 %
X-Stop 65 % 4 %
1.2.2. B lc web da vo URL (Universal Resource Locator)
Da vo t kha (keyword) ca URL
Vi cch tip cn ny c mt danh sch cha cc t kha (keyword) c hnh
thnh nhn ra nhng a ch web b chn. URL keyword l chui con nm trong
mt a ch web, nhng a ch web c cha chui ny thng l nhng trang web
xu.
Theo kho st [5] [8], a phn nhng trang web xu dng t ng khiu dm,
gi dc lm tn min cho website ca mnh vi mc ch thu ht s ch ca
ngi dng Internet. Vi nhng trang web nh vy, vic chn trc tip ngay t a
ch URL m khng cn quan tm n ni dung trang web l mt iu ng v
khng c mt trang web no c ni dung l tt nu a ch l xu.
V d
Cc trang web ny u l web sex:
www.sexviet.com
www.sex700.com
www.sexygirls.com
10
do u cha cc t kha l sex
Hoc cc trang web sex sau y
www.freeporns.com
www.asiaporns.com
www.childporn.com
u cha cc t kha l porn
u im
Mc chnh xc kh cao do da ch yu vo t kha.
Nhc im
Chc chn b qua cc trang web khng s dng URL keyword ph bin.
Mt s t nhng trang web khng xu c cha mt URL keyword no v b
xem l xu.
K thut lc web da vo URL
y l k thut lc bng cch quan st lu thng web (HTTP) bng cch theo
di URL v cc host field bn trong cc yu cu HTTP nhn ra ch n ca
yu cu. Host field uc dng ring bit bi cc my ch web hosting nhn ra
ti nguyn no c tr v.
Lc web qua URL [9] thng c xp vo loi ch rng ln v Content
Management. Cc k thut lc qua URL ra i t 2 kiu lc pass-by v pass-
through.
Lc theo pass-by: x l trn ng mng m khng cn phi trc tip trong
ng ni gia ngi dng v internet. Yu cu ban u c chuyn n my ch
web u cui. Nu yu cu b cho l khng thch hp th b lc s ngn chn
nhng trang gc t bt c yu cu truy cp no. K thut ny cho php thit b lc
khng bao gm b nh hng yu cu. Nu thit b lc b hng, lu thng mng
vn tip tc hot ng mt cch bnh thng.
11
Lc theo pass-through: gm vic s dng mt thit b trn ng ca tt c
yu cu ca ngi dng. V th lu thng mng i qua b lc pass-through l
thit b lc thc s. Thng b lc ny nm trong cc kiu firewall, router,
application switch, proxy server, cache server.
Ty chn b lc URL
im c bit ca cc sn phm theo phng php ny cho php ngi dng
ch nh cc URL bng cch thm hay bt cc URL khi danh sch cc site xu
(Bad Site List) mc d cc website nguyn thy trong danh sch khng th b loi
b. Di y l danh sch cc sn phm lc web ph bin.
Bng 1.2 Mt s sn phm lc web theo phng thc URL
Sn Phm Hng (Cng ty)
Smartfilter Secure Computing
Web Filter SurfControl
Web Security Symantec
bt-WebFilter Burst Technology
CyBlock Web Filter Wavecrest Computing
u im khi s dng b lc qua URL
Nhng Website o khng b nh hng: K thut ny khng nh hng n cc
my ch web o khi chng cng dng mt IP nh nhng website hn ch. Mt
website b chn v website khng b chn c th chia s cng mt a ch IP.
Khng nh hng i vi vic thay i IP: Trong phn ln tnh hung, s thay
i IP ca website b hn ch s khng nh hng n phng php ny. V
phng php lc ny khng ph thuc vo a ch IP. Ch s hu nhng trang
web c th i bt c IP no h mun, nhng ngi dng ng sau b lc
khng th truy cp c.
12
Hn ch khi s dng b lc thng qua URL
Thng khng th ngn chn cc cng phi tiu chun:
+ Nhng Web server lm vic vi cng tiu chun rt tt.
+ Website trn cc cng phi tiu chun th kh khn cho vic ngn cm v
chng yu cu mt cp cao hn trong b lc.
+ Mt gii php lc qua URL c th l k thut c kh nng cn thit cho
nhng kt ni HTTP trn cc cng phi tiu chun
Khng lm vic vi cc lu thng b m ha: v HTTP yu cu s dng
SSL/TLS b m ha. Phng php lc theo URL khng th c cc hostfield.
Cho nn, b lc khng c hiu qu pht hin mt ti nguyn no trn mt a ch
IP m yu cu thc s nh hng vo.
Tm li, cc server cn c b lc thc hin loi b mt s trang web khng
tt, nhng n c th lm cho h thng chm li.
1.2.3. B lc web da vo DNS
Nhng website b lc s hon ton khng th truy cp c n tt c cc cu
hnh s dng b lc nameserver cho b phn gii tn do tt c cc b lc
nameserver s tr v thng tin bt hp l khi yu cu phn gii mt hostname ca
website b lc. Nh vy khng th truy cp n ti liu trn ca my ch cha
Website. Tuy nhin, cc website khng b lc s cho php truy cp min l chng
n c mt hostname khc t cc website b lc. V tn ca chng khng c h
tr thng tin bt hp l bi b lc nameserver nn d liu ng s tr v cho bt c
ngi dng no yu cu phn gii tn v website hin nhin l c th truy cp vo
c.
u im
S dng a nghi thc (multi-protocol): http, ftp, gropher v bt k nghi thc no
khc da trn h thng tn.
13
Khng b nh hng bi vic thay i IP: Khi thay i IP ca mt website
khng nh hng n phng php lc ny, y l phng php lc hon ton
c lp vi a ch IP.
Nhc im
Khng hiu qu i vi cc URL c cha a ch IP:
+ Phn ln nhng a ch ca mt website dng DNS (www.lhu.edu.vn), tuy
nhin cng c nhng a ch c ch nh bng mt a ch IP thay v l
dng DNS (http://118.69.126.40).
+ Trong trng hp ny n c truy cp n bng a ch IP m khng phi
dng a crh DNS ca n.
Ton b web server b chn hon ton:
+ K thut khng cho php vic kha c chn la cc trang cn li trn mt
webserver. V th, nu mt trang b cm l www.exp.com/bad.htm th c th
tt c cc truy cp khng th truy xut n www.exp.com d n khng trong
danh sch b kha.
nh hng n cc subdomain
+ Xt v k thut, mt tn min n nh example.com trong URL
http://www.example.com c dng truy cp n web server. Cng mt
thi im, domain name c th phc v nh mt domain cp trn ca cc
cng khc nh host1.example.com. Trong trng hp ny, nhng a ch
DNS dng www.example.com c th b phn gii sai. Ngoi ra, n cng lm
cho b phn gii tn min b sai i vi cc min con. V n cn nh hng
n cc dch v chy trn mng nh e-mail.
1.2.4. B lc web da vo t kha (key word)
Tng t nh cch tip cn da vo URL keyword [10], cng c mt danh
sch cc t kha nhn ra nhng trang web b chn. Mt trang web cm s cha
14
nhiu t kha khng hp l, y l c s nhn ra trang web b cm. iu quan
trng i vi phng php ny l ng ngha ca t kha theo ng cnh, iu ny
lm cho h thng c nhng nhm ln khi a ra mt quyt nh v mt trang web
c c th hin hay khng.
Mt website chuyn bnh ung th c th b kha vi l do bi vit v bnh
ung th v, ta thy c rng nu trong bi vit c cp qu nhiu n t kha
nm trong danh sch t kha chn l v th v tnh h thng s nhm ln v
kha trang ny.
Vn tip theo l cc t c hay v nh vn sai, mt s trang cha
ni dung xu th ngn t c dng trong trang web ca n b thay i nh la
h thng lc, tuy nhin khi ngi s dng c th c th hiu ngay ch l sai
chnh t thi cn i vi h thng lc iu lm nh hng ln n h thng.
1.3. Nhng phn mm lc web hin nay
SurfControl Enterprise Threat Protecion: y l phn mm ca hng
SurfControl, phn mm ny thit k theo cch tip cn lc web v ngn chn t
proxy qua URL v t kha, c khong 20 cch ngn chn
Internet Filter Web Filters: do hng iPrism Internet Filters & Web Filters
pht trin, l phn mm thc hin gim st v ngn chn. Phn mm ny c
qung co l dng k thut lc web ng kim sot ni dung trang web ngay t
ng vo. Tuy nhin, theo hng dn ca nh sn xut th phn mm ny cng c
bng dng ca k thut dng phng php lc chn t kha.
DWK4.1: Depraved Web Killer (DWK) do tc gi V Lng Bng d thi
chung kt cuc thi Tr Tu Vit Nam nm 2004, tnh n thi im ny phin bn
mi nht l v4.1 (2011) c nhiu chc nng nh:
Ngn chn cc trang web c ni dung xu (t kha, URL).
Ghi nht k cc chng trnh c chy trn my.
Ghi nht k cc trang web c truy cp.
15
Ghi nht k cc trang web xu m phn mm ngn chn.
Gi nht k n a ch mail do ngi dng thit lp
FamilyWall: l phn mm bc tng la chy thng tr trn my tnh ca
ngi s dng. Chc nng ch yu ca FamilyWall l ngn chn vic truy cp cc
Website c ni dung xu trn mng Internet, bao gm cc lp kim sot chnh sau:
cc t kha c ni dung xu, ni dung cc trang Web, danh sch cc Website xu
c pht hin,
Tng th chung, cc phn mm trn thc hin tt cc chc nng chn t kha,
chn URL, nhng hu ht cc phn mm ny khng c c ch t hc, c ch t
hc gip cho ngun d liu ngy cng phong ph hn.
16
CHNG 2: C S L THUYT
2.1. Tng quan khai ph d liu
2.1.1. Khai ph Text
2.1.1.1. Khi nim
Theo H Quang Thy [2] , Khai ph Text l qu trnh trch chn ra cc tri thc
mi, c gi tr v tc ng c ang tim n trong cc vn bn s dng cc tri
thc ny vo vic t chc thng tin tt hn nhm h tr con ngi.
V bn cht, khai ph Text l s kt hp gia khai ph d liu v x l ngn
ng t nhin (NLP: Natural Language Processing).
2.1.1.2 Mt s loi khai ph Text
Phn tch kt hp da trn t kha: Mt ti liu c th xem nh mt chui k
t v c th xc nh bng tp cc t kha. Vic phn tch cc ti liu da trn t
kha tm ra mt kt lun v ti liu .
Phn tch ti liu t ng: Ging nh mt ngi tr l, h tr c lc trong
vic phn loi ti liu bng cch c tt c cc ngun ti liu n v xp n theo
tng loi mt cch t ng.
o tng ng gia cc ti liu: o tng ng l vic xem xt ti liu
xem n c thuc v mt dng vn hc no hay thuc v mt tc gi no .
Hoc cng c th dng xp loi vn bn thuc v lnh vc no.
Phn tch trnh t: on s kin, d bo xu hng. Nh ni bn trn, vn
bn l mt chui cc k t din t mt . Nhiu ti liu gi n, c nhiu cp
din t v mt vn . T cc vn ny h thng c th a ra d on v cc
din bin ca hin tng hay nhng iu s xy ra tip theo.
Xc nh cc hin tng khng bnh thng: Hin tng khng bnh thng l
mt vn bn n c s khc bit hay c tnh qu khc so vi cng loi n n
trc cho mt kt lun v s bt thng ca vn bn.
17
2.1.1.3. Quy trnh khai ph Text
Qu trnh khai ph text tri qua cc bc sau
Thu thp d liu vn bn thuc min ng dng. bc ny c 2 iu cn c
lu . Th nht, thu thp d liu vn bn thuc min ng dng m khng phi l
tp tt c cc vn bn c th c ca th gii thc. V d, bi ton khai ph d
liu vn bn ca Rich Caruana cng cng s, min ng dng quy nh rng, tp
d liu ch l tp tt c cc cng trnh khoa hc; cn trong bi ton khai ph d
liu vn bn thuc lnh vc y t v chm sc sc khe th ch cn quan tm thu
thp cc vn bn v y t v chm sc sc khe. Th hai, yu cu ca bc thu
thp d liu l tp d liu thuc min ng dng. C th, tp d liu trang web
m my tnh tm kim ca Google thu thp c cho l i din cho ton b tp
mi trang web trn Internet. M hnh sinh trang web, tnh ngu nhin ca vic
thu thp d liu l yu t cn c quan tm trong thut ton thu thp trang web.
Tp trang web m Google thu thp c d rt s, song khng phi l ton
b mi trang web c th.
Biu din d liu vn bn: sang khun dng ph hp vi bi ton khai ph vn
bn. Biu din d liu vn bn cng ph hp vi bi ton khai ph vn bn, th
cht lng ca kt qu khai ph vn bn cng c nng cao.
La chn tp d liu u vo cho thut ton khai ph d liu: trong hu ht
trng hp, tp d liu thuc min ng dng thu thp c l rt ln. V vy,
nhiu trng hp l vt qua kh nng x l v khng gian v thi gian, i vi
cc thut ton khai ph d liu. Do vy, cn chn ra t tp d liu thu thp c
01 tp con thc hin bi ton khai ph d liu. Cc yu t m bo tnh i
din ca tp d liu thu thp c cng c p dng trong cc gii php la
chn tp d liu u vo cho thut ton khai ph d liu.
Thc hin thut ton khai ph d liu i vi tp d liu c la chn
tm ra cc mu, cc tri thc: i vi bi ton phn lp vn bn, mu (tri thc)
18
c tch hp thnh b phn lp kt qu v b phn lp ny s c s dng
vo vic phn lp i vi cc vn bn mi.
Thc hin vic khai thc s dng mu: cc tri thc nhn c t qu trnh khai
ph vn bn vo thc tin hot ng.
2.1.2. Khai ph web
2.1.2.1. Khi nim
Theo H Quang Thy [2], Khai ph web l vic trch chn ra cc thnh phn
c quan tm hay c nh gi l c ch cng cc cng cc thng tin tim nng
t cc ti nguyn hoc cc hot ng lin quan ti World Wide Web.
Mt cch trc quan c th quan nim khai ph web l s kt hp gia khai ph
vn bn vi Cng ngh Web, hay c th hn l:
Khai ph Web = Khai ph d liu + X l ngn ng t nhin + World Wide Web
Hin ti, phn ni dung in hnh nht trong trang web l vn bn, v vy, khai
ph vn bn web l mt thnh phn c bn ca khai ph web. Tuy nhin, vi s
tin b khng ngng ca cng ngh Internet, nhu cu v khai ph d liu i vi
cc d liu a phng tin khc nh hnh v, ting ni, ca nhc, phim, khng
ngng pht trin c v chiu rng ln chiu su.
2.1.2.2. Phn loi khai ph web
Khai ph Web c phn thnh 03 lnh vc chnh: khai ph ni dung web, khai
ph cu trc web v khai ph s dng web.
Hnh 2.1 S lnh vc khai ph web [2]
Khai ph web
Khai ph ni
dung trang web
Khai ph cu trc web
Khai ph s dng
web
Khai ph ni dung
trang web
Khai ph cc
mu truy cp
Ti u ha kt
qu tr v
Khai ph cc xu
hng c nhn
19
Theo nh s trn, thy c rng phn loi khai ph web c nhiu lnh vc
nhng ni dung chnh ca lun vn mun cp y l khai ph ni dung trang
web v khai ph cu trc trang web.
Khai ph ni dung trang web: Phn ln ni dung chnh ca trang web c cha
trong ni dung vn bn ca trang web . Khai ph ni dung trang web lin
quan n vic truy xut cc thng tin t cc vn bn c cu trc, vn bn siu
lin kt hay vn bn bn cu trc.
Khai ph cu trc trang web: Nh vo cc kt ni gia cc vn bn siu lin kt,
World Wide Web cha ng nhiu thng tin hn so vi tp cc vn bn ni
dung trang web. V d, s lng lin kt tr ti 01 trang web c coi l mt ch
s v mc quan trng ca trang web , ng thi, cc lin kt i ra t 01 trang
web ch ra rng, cc trang ch c ni dung lin quan n cc ch c
cp trong trang hin ti. Khai ph cu trc web l cc qu trnh, x l, nhm rt
ra cc tri thc t cch t chc v lin kt gia cc tham chiu ca cc trang web.
2.1.2.3. Phng php biu din trang web
Khc vi mt trang vn bn thng thng, trong ni dung mt trang web cn
c cc ch dn (lin kt) ngoi ti cc trang web khc vi ngha l ni dung ang
c ni ti trang web hin thi cng l 01 ni dung c quan tm ca trang web
c ch ti. Trong nhiu trng hp, ni dung ti trang web c ch dn ti cn
l 01 li gii thch cho ni dung ang c quan tm. iu c ngha l, mt ch
trong tp ch ca cc trang web m n ch dn ti. Quan h ch cng
quan tm l i xng gia hai trang web tn ti mt lin kt gia chng. Chnh v
l do , biu din trang web c nhng im m rng so vi biu din thng
thng. Nhng kha cnh m rng ng k nht ca biu din trang web so vi
biu din vn bn thng thng gm c vic m rng ni dung trang web t cc
trang web k cn n v khai thc kin thc trang web vo biu din n.
M rng ni dung vn bn trang web bng ni dung vn bn trang web k cn.
20
Khi nim k cn ca hai trang web c hiu theo ngha tn ti t nht mt
lin kt gia chng. Vic m rng ni dung vn bn t cc vn bn k cn n xut
pht t nhn nh cho rng, vic s dng cc siu lin kt c ngun gc t s lin
quan v ni dung gia chng.
C 04 phng n xy dng biu din mt trang web:
Ni b trang hin thi
Hp khng phn bit ni dung trang hin thi v k cn
Hp phn bit ni dung trang hin thi vi trang k cn
Tng qut ha ca cch 3 vi 4 mc k cn
Trong 04 phng n trn th
Phng n u tin ch s dng ni dung trang web hin thi
Phng n hai dng trn ni dung trang web hin thi vi cc k cn ca n
Phng n ba biu din trang web gm hai phn: phn u s dng ni dung
trang web hin thi, phn hai dng ni dung ca cc trang web k cn
Phng n bn l phng n tng qut ha ca phng n th ba theo
hng s lng mc c tng ln t 2 ln k. Trong biu din loi ny, cho
trc 01 mc k v 01 kho d liu trang web. Biu din trang web s bao
gm k thnh phn
Khai thc cc yu t trong trang web c b sung t ngn ng to trang web
Khai thc cc yu t trong trang web c b sung t ngn ng to trang web
nh cc th to trang web vo vic xc nh cc gi tr trng s tng ng vi cc
t. V l do cc th HTML trong mt trang web thng c gn 01 ngha nht
nh. Do , vic khai thc cc th ny s lm cho biu din ca vn bn c giu
hn
V d, cp th <title></title> c quy nh l biu din tiu ca ni
dung trang web cp n. Cp th mc <h1></h1>, <h2></h2>,c
qui nh hin th cc mc trong ni dung trang web
21
Tiu v cc mc thng nu cc chnh, quan trng ca trang web. Do
, c th gn cho ni dung (cc t kha xut hin) trong cc cp th ny c trng
s cao hn so vi cc ni dung khc (cc t kha cc v tr khc)
2.1.3 X l vn bn t ng
X l vn bn t ng (Automatic text processing) l mt cng on v
cng quan trong cc lnh vc nh khai thc vn bn (text mining), x l d liu
(Data analize) , rt trch thng tin (information extract), phn loi vn bn (text
classification), gom cm vn bn (clustering text), tng kt vn bn (Text
Summarization), lp ch mc cho my tm kim (document index), so snh
tng t ca 2 hay nhiu vn bn (Document similarity) ..
2.1.3.1. Rt trch c trng vn bn
Theo Phc [1], c 5 loi ti nguyn chnh dng nhn din ngha ca
vn bn l phn tch t vng, ng ngha, thng k, c php v phn tch.
Phn tch t vng: Cng vic chnh ca tin trnh phn tch l nhn din nhng
n v ngha ca vn bn. Trong ting Anh, gia hai t cch nhau bng
khong trng, du ngt cu, du ngt hng,.Cn i vi nhng ngn ng
khng chc chn bin gii t, phng php c cho l tng qut l dng mt
ca s trt to thnh nhng dy n k t lin nhau m ngi ta gi l n-
gram.
Phn tch ng ngha: Mc tiu ca phn tch ng ngha l to nhng dng bn
ngoi vn bn vi ngha m n biu din.
Cc t khc nhau c th s dng m t nhng khi nim tng t nhau v
vic phn tch hnh thi hc c th gip gii quyt iu ny bng cch a cc
bin th v dng chungTrong mt s ngn ng, c bit l ting Anh c th
lm iu ny bng cch t ng loi b hu t.
K thut ny gi l Stemming, l mt k thut chuyn i cc bin t v
ngun gc khc nhau ca mt t v mt ngun gc chung, ngha l loi b tin
22
t v hu t. Mc ch ca vic p dng k thut Stemming l quy v dng
c bn nht ca mt t s dng trong tin trnh ly thng tin m c th l
vic thay th mt t no trong cu truy vn chng hn nh t s nhiu thnh
s t, bi v cc t dng s t thng xut hin nhiu hn so vi cc t
dng s nhiu, v iu ny nh hng n s sp xp cc ti liu tr v.
Phn tch thng k: Phn tch thng k tn s s dng thut ng c kim
chng l hu ch. Phng php n gin nht l da trn tng s ln xut hin
ca mi thut ng (t gc, n-gram, cm t) trong kho vn bn c ch ,
cng vic thng thc hin l tm nhng cm t s dng trong nhng ng cnh
khc nhau s i din cho nhng khi nim khc nhau. Vic phn tch cc cm
t xut hin ng thi c th gip gii quyt ngha ca t nhp nhng ngha.
Phn tch c php: Phn tch t loi tt c ch v tc x l nhng kho
vn bn ln ang tr thnh hin thc. Phn tch t gip phn on nhng t
khng r ngha v cc thnh ng c php nh nhau. Qua cung cp thm
thng tin cho phn ch thng k.
Phn tch s dng: Cch thc s dng ti liu c th c nhng gi gi tr v
bn thn ti liu . Ngi ta c th xc nh c bn loi thi quen ca ngi
dng v t c th phn tch s dng v kim tra , duy tr, tham kho v nh
gi.
2.1.3.2. Biu din vn bn bng vector c trng
Gii thiu phng php
S dng m hnh khng gian vector (vector space model) l cch lm ph bin
biu din cc vn bn. Mi vn bn s tng ng vi mt vect nhiu chiu
trong khng gian Euclide. y mi chiu s tng ng vi mt t. Da trn tp
hp cc vn bn, c th xy dng khi lng t vng tng i y , tng ng
s chiu ca khng gian. iu quan trng ngoi vn bn lu tr, cc cu truy vn
a vo cng phi c biu din bng vector. Theo phng php ny, s dng
23
m hnh khng gian vector m t c trng ca vn bn. Trong s chiu ca
khng gian vector ph thuc vo phng php rt trch c trng vn bn s nu
sau v mi thnh phn ca vector c trng l trng s ca 1 t (term) trong h
thng cc vn bn.
Theo phng php ny, mi mt vn bn (D
i
) c biu din theo dng

,
_


i
i
D
, , trong i l ch s dng nhn din vn bn ny v i d

l vector c
trng ca vn bn D
i
ny , trong : ) ,..., (
1 w w d in i

, v n l s lung c trng
ca vector vn bn , W
ij
l trng s ca c trng th j, { } n j ,..., 2 , 1 .Trng s W
ij
l
l mt i lng c s c trnh by di y
Mt s cng thc tnh vector thnh phn ca vector c trng
Mi thnh phn ca vector c trng l mi t hay cm t. Cht lng ca
vic tch t trong vn bn ph thuc vo phng php v k thut tch t.
+ Phng php trng lng t
Theo phng php ny, mi thnh phn ca vector c trng c tnh theo
cng thc sau

,
_

k
df
n
k i
tf
w
log *
i
(2.1)
Trong
n l tng s vn bn trong c s d liu.
tf
ik
l s ln t k xut hin trong vn bn D
i
.
df
k
l tng s vn bn c t k.
+ Phng php m t (Term count)
Theo phng php ny, mi thnh phn ca vector c trng c tnh theo
cng thc sau:
24
ik
k
tf
w
= hoc log =
k
k
df
n
w
(2.2)
Trong

n l tng s vn bn trong c s d liu.

tf
ik
l s ln t th k xut hin trong vn bn D
i
.

df
k
l tng s vn bn c t k.
+ Phng php nh phn
Phng php ny kh n gin, trng s W
ik
=1 nu t i xut hin trong vn
bn D
i
v ngc li th W
ik
=0
Cc c im ca vector c trng ca vn bn
S chiu khng gian c trng thng ln.
Cc c trng c lp nhau.
Cc c trng ri rc nhau nh khi vector c trng d
i
c th c nhiu thnh
phn mang gi tr 0 do c nhiu c trng khng xut hin trong vn bn d
i
(nu
tip cn theo cch s dng gi tr nh phn 1, 0 biu din cho vic c xut
hin hay khng mt c trng no trong vn bn ang c biu din thnh
vector), tuy nhin nu n thun cch tip cn s dng gi tr nh phn 0, 1 ny
th kt qu phn loi phn no hn ch l do c th c trng khng c trong
vn bn ang xt nhng trong vn bn ang xt li c t kha khc vi t c
trng nhng c ng ngha ging vi t c trng ny, do mt cch tip cn
khc l khng s dng s nh phn 0,1 m s dng gi tr s thc phn no
gim bt s ri rc trong vector vn bn.
W
ik =
1: nu t th k xut hin trong vn bn
D
i
0 :nu t th k khng xut hin trong vn bn
D
i
25
2.2. Lc ni dung trang web bng thut ton Nave Bayes
2.2.1. Gii thiu
Thut ton Nave Bayes [6] l mt thut ton phn tch thng k, n thc hin
trn d liu s. M hnh xc sut Nave Bayes l phng php c s dng ph
bin nht trong phn lp ti liu text. tng ca phng php Nave Bayes l s
dng cc xc sut lin kt ca cc nhm da trn mt ti liu. S n gin ca n
l gi thit cc t c lp nhau.
Thut ton Nave Bayes trong bi ton lc ni dung c thc hin trn
nguyn tc coi mt ti liu text l c pht sinh bi cch chn ngu nhin t tt
c cc t c mt trong nhm. Cc t c c hi c b sung vo l t l vi xc
sut tm thy t trong nhm ang c xem xt. B phn lp Nave Bayes sau
xc nh kh nng ni dung cn ang c kim tra s thuc v nhm no. Nave
Bayes l mt thut ton n gin v nhanh, n hot ng tt vi cc biu din
thng k nh l phng php ti t (bag-of-words). Ngc li vi cc phng
php da trn lut, Nave Bayes c th c thc hin tng cng v cn thit phi
thc hin bc tin x l b sung to vector c tnh tn sut ca t vi kch
thc nh. V kch thc ca vector c tnh c th l kh ln v do vy cn c cc
bc b sung gim kch thc ca n.
2.2.2. Hc Bayes (Bayes Learning)
Gi thit rng c mt phn b xc sut trc cho tt c cc bin c. Gi thit
ny s l mt phng php nh lng nh gi chng c c c trong qu
trnh hun luyn. Nhng phng php ny cho php xy dng mt ranh gii chi
tit hn ca cc gi thit lun phin thay v ch quan tm n tnh n nh ca cc
gi thit. Nh vy, cc phng php Bayes cung cp cc thut ton hc thc
t. Ngoi ra, n cn c coi l mt chun nh gi cc thut ton hc khc.
26
Xc sut iu kin
Gi s rng ta n nh mt hm phn b cho mt khng gian mu v sau
hc nhn bit bin c E. Cch thc ta thay i xc sut ca cc bin c cn li?
Gi xc sut mi ca cc bin c F l xc sut iu kin ca F trn E v
k hiu l P(F|E).
Gi = {w
1
, w
2
, w
3
,,w
n
} l khng gian mu gc vi hm phn b c gn
l m(w
j
). Gi s ta hc thy rng bin c E xy ra. Ta mun gn mt hm phn
b mi m(w
j
|E) ti phn nh li thc t ny. R rng l nu mt im mu w
j
khng c trong E, ta phi c m(w
j
|E) = 0. Hn na, khi khng c thng
tin tri ngc, c th gi s rng xc sut cho w
k
trong E s c ln tng t
c trc, khi hc thy E xy ra. V l do ny, ta cn:
m(w
j
|E) = cm(w
k
) (2.3)
i vi tt c w
k
trong E, vi c l hng dng. Tuy nhin ta cng phi c
1
]
1

1 ) | (
)
(
c w
E
k
w m
E
k
E m
(2.4)
Do
) (
1
(
1
)
E P w m
c
E
k

(2.5)
Vi gi thit P(E)>0. Do vy, s nh ngha
) (
(
) | (
)
E P
w m
E wk m
k

(2.6)
Cho w
k
trong E. Phn b mi ny c tn l phn b cho iu kin E
i vi bin c F chung, c
) (
) (
) (
(
) | ( ) | (
)
E P
E F P
E P
w m
E w m E F P
E F
k
E F
k




(2.7)
27
Xc sut iu kin l xc sut kt hp vi mt bin c F, da trn s xut hin
ca mt bin c lin quan E. Biu din xc sut iu kin F da trn E l P(F|E).
P(F|E) cng c th c pht biu l xc sut xut hin ca F khi E xy ra, xc
sut iu kin c tnh bng cng thc sau
) (
) (
) | (
E P
E F P
E F P

(2.8)
C hai nh l quan trng lin quan n xc sut iu kin
i vi ba bin c bt k A1, A2 v A3 lun c quan h nh sau:
) 2 1 | 3 ( ) 1 | ) 2 ( ) 1 ( ) 3 2 1 ( A A A P A A P A P A A A P
(2.9)
Nu mt bin c A phi dn n mt trong nhng bin c c lp ln nhau A1,
A2,, An, khi
) | ( ) ( ... ) 2 | ( ) 2 ( ) 1 | ( ) 1 ( ) ( An A P An P A A P A P A A P A P A P + + +
(2.10)
Bin c c lp
Thc t thng xy ra trng hp kin thc m mt bin c E no xy ra
khng tc ng n xc sut bin c F khc xy ra, ngha l P(F|E) = P(F). Ta
mun rng trong trng hp nh th ny, cng thc P(F|E) = P(F) cng s ng.
Trong thc t, cng thc ny bao hm cng thc kia. Nu nhng cng thc ny
l ng, ta c th ni rng F l c lp ca E. Th d nh ta khng mong mun
rng kin thc kt qu ca vic nh gi bin c u tin thay i xc sut ta mun
gn cho xc sut kt qu ca vic nh gi bin c th hai, ngha l ta khng mun
nh gi th hai ph thuc vo nh gi u tin. tng ny c hnh thc ha
thnh nh ngha bin c c lp nh sau, t nh ngha xc sut iu kin:
) (
) (
) | (
E P
E F P
F E P

(2.11)
) | ( ) ( ) ( F E P F P F E P (2.12)
Vi hai bin c E v F bt k.
28
Nu cc bin c E v F c lp, s xut hin ca F khng tc ng n s xut
hin ca E v
) ( ) | ( E P F E P (2.13)
Thay kt qu ca cng thc (2.12) vo cng thc (2.11) ta c cng thc cho
cc bin c c lp E v F:
) ( ) ( ) ( E P F P F E P (2.14)
V ngc li, nu ) ( ) ( ) ( E P F P F E P , khi cc bin c E v F c lp.
Nhng pht biu ny c th c tm tt li nh sau:
Cc bin c E v F c lp nu c E v F c xc sut dng v nu P(E|F) =
P(E) th P(F|E) = P(F). Hay ni cch khc: nu P(E)> 0 v P(F) > 0, khi E v
F l c lp vi nhau nu i vi bt k tp con no {Ai, Aj,.., Am,} ca chng, ta
u c:
) ( )... ( ) ( ) ... ( Am P Aj P Ai P Am Aj Ai P (2.15)
2.2.3. Cng thc Bayes
Cho kt xut ca trng thi th hai trong thc nghim hai trng thi tm xc
sut ca kt xut ti trng thi u. Nhng xc sut ny c gi l xc sut Bayes.
Gi s rng ta c tp bin c {H
1
, H
2
, , H
m
} c lp
Hm H H ... 2 1 (2.16)
Ta gi nhng bin c ny l gi thuyt. Ta cng c mt bin c E cung cp mt
s thng tin v gi thuyt no l ng. Ta gi nhng bin c ny l d
liu hun luyn. Trc khi nhn d liu hun luyn, ta c tp xc sut trc P(H
1
),
P(H
2
), , P(H
m
) i vi cc gi thuyt. Nu ta bit gi thuyt ng, ta bit c
xc sut cho d liu hun luyn. Tc l, ta bit P(E|H) vi mi i. Ta mun tm xc
sut cho gi thuyt vi d liu hun luyn cho, ngha l mun tm xc sut iu
kin P(H
i
|E). Nhng xc sut ny gi l xc sut sau.
tm nhng xc sut ny, ta vit chng di dng nh cng thc
29
) (
) (
) | (
E P
E H P
E Hi P
i

(2.17)
Ta c th tnh t s t thng tin cho bng
) | ( ) ( ) ( Hi E P Hi P E Hi P (2.18)
Do ch c duy nht mt bin c trong s cc bin c H
1
, H
2
,,H
m
l xy ra, ta
c th vit xc sut ca E nh sau:
) ( ) 2 ( ) 1 ( ) ( E Hm E H P E H P E P + + (2.19)
S dng cng thc (2.17), cng thc (2.18) c th vit li nh sau
) | ( ) ( .... ) 2 | ( ) 2 ( ) 1 | ( ) 1 ( Hm E P Hm P H E P H P H E P H P + + + (2.20)
T (2.17), (2.18) v (2.20) c c cng thc Bayes:

m
k
k k
i i
H E P H P
H E P H P
E Hi P
1
) | ( ) (
) | ( ) (
) | ( (2.21)
Cng thc (2.21) cho php ta tm xc sut ca cc bin c khc nhau H
1
, H
2
,,
H
n
m c th l nguyn nhn lm cho bin c H xy ra.
Tm im ca nh l Bayes l tnh hin nhin ca mt bin c xc nhn kh
nng xy ra ca mt gi thuyt ng vi mc m s xut hin ca tnh hin
nhin ny s l c kh nng xy ra vi gi s ca gi thuyt hn l s vng mt ca
n. Biu din hnh thc ca nh l Bayes trong trng hp my hc nh sau:
) (
) ( ) | (
) | (
D P
h P h D P
D h P (2.22)
Trong
D l tp d liu hun luyn
h l mt gi thuyt
P(h|D) l xc sut sau (posterior probability), l xc sut iu kin ca h sau
khi tp hun luyn c biu din (da trn D)
P(h) l xc sut trc (prior probabiltity) ca gi thuyt h. Gi tr ny thng
c tm bng cch tm kim trong d liu qu kh (trong tp hun luyn)
30
P(D) l xc sut trc ca tp d liu hun luyn D. Gi tr ny thng l mt
hng s
) ( ) | ( ) ( ) | ( ) ( h P h D P h P h D P D P + (2.23)
N c th c tnh d dng khi cho bng 1
) | ( ) | ( D h vP D h P (2.24)
P(D|h) xc sut iu ca D da trn h, v c gi l kh nng c th xy ra
(likelihood). Gi tr ny c gn bng 1 khi D v h l nht qun v c gn bng
0 khi D v h khng nht qun.
nh l Bayes mang tnh tng qut v c th c p dng vo bt k trng
thi no tnh ton mt xc sut iu kin khi bit cc xc sut trc. Tnh
tng qut ca n c chng minh qua ngun gc ca n, n rt n gin. Ngun
gc ca nh l Bayes khng c g c bit. Ngun gc ny l ngn gn v ch s
dng nh ngha ca xc sut iu kin v thay th kt hp.
2.2.4. Cc bc tin hnh lc ni dung bng mng Bayes
Xc nh r cc c trng s dng. Yu cu ny s xem xt cc ni dung
website cn hin th v tm cc t hoc nhm t m chng l du hiu ca
lnh mnh hay khng lnh mnh, y c th coi l c s d liu cho b lc. y
l mt phn quan trng trong nhim v ny v c th lp li mt vi ln.
S dng mt s phng php la chn c trng phn tch d liu v chn
c trng, sau c th c lng xc sut iu kin v s dng cc lut Bayes
c lng xc sut ca mt ni dung website c phi l lnh mnh hay khng
Xc nh r ngng loi b tt c ni dung website m xc sut ca chng
ln hn xc sut ny.
Th nghim h thng lc ni dung website khng lnh mnh v c lng hiu
qu trong thc t.
H thng lc ni dung website khc nhiu so vi cc cng vic ca phn loi
vn bn l do sau: Vic phn loi nhm mt ni dung hp l thnh ni dung
31
khng hp l s pht sinh hu qu nghim trng hn l phn loi nhm theo
chiu ngc li. y l cht lng khc nhau gia cc lp m n cn c ghi
chp li trong qu trnh tnh ton.
2.3. Phng php tch t trong ting Vit
2.3.1. Tnh hnh nghin cu
Mc d ging ting Anh khi s dng k t latinh, tuy nhin tr ngi ln nht l
cu trc ting Vit khc bit hon ton so vi cu trc ting Anh trnh by
trn v a phn cc phng php thng dng cch so khp t trc tip da trn
b t in c sn v vic cp nht b t in rt kh khn, thng thc hin bng
thao tc th cng l chnh.
Da trn cc nghin cu trc, hng tip cn da trn t vi mc tiu tch
c cc t hon chnh trong cu. Hng tip cn ny c th chia lm 3 hng
chnh: da trn thng k (statistics-based), da trn t in (dictionary-based) v
hydrid (kt hp nhiu phng php vi hy vng t c nhng u im ca cc
phng php ny)
Hng tip cn da trn thng k (statistics-based): da trn cc thng tin nh
tn s xut hin ca t trong tp hun luyn ban u. Hng tip cn ny c bit
da trn tp d liu hun luyn, nh vy nn hng tip cn ny t ra rt linh hot
v hu dng trong nhiu lnh vc ring bit.
Hng tip cn da trn t in (dictionary-based): thng c s dng
trong tch t. tng ca hng tip cn ny l nhng cm t c tch ra t vn
bn phi khp vi cc t trong t in. Nhng hng tip cn khc nhau s s
dng nhng loi t in khc nhau. Hng tip cn full word/phrase cn s
dng mt b t in hon chnh c th tch c y cc t hoc ng trong
vn bn, trong khi , hng tip cn thnh phn (component) li s dng t in
thnh phn (component dictionary) [ Wu &Tseng, 1993]. T in hon chnh cha
tt c cc t v ng c dng trong ting Hoa, trong khi t in thnh phn
32
(component dictionarry) ch cha cc thnh phn ca t v ng nh hnh v v cc
t n gin trong ting Hoa. Phn di s trnh by cc phng php tch t trong
ngn ng ting Vit.
2.3.2. Mt s phng php tch t
2.3.2.1. Tch cu da trn Maximum Entropy
Phuong H.L. v Vinh H.T. [2] m hnh ha bi ton tch cu di dng bi ton
phn lp trn Maximum Entropy. Vi mi chui k t c th l im phn cch
cu (., ?, hay !), c lng xc xut ng thi ca k t cng vi ng
cnh xung quanh (biu din bi bin ngu nhin c) v bin ngu nhin th hin
c thc s l im phn tch cu hay khng (b \in {no, yes}). Xc xut m hnh
c nh ngha nh sau
p(b,c) =

k
j
c b f
j
j
1
) , (
(2.25)
y:
j
l cc tham s cha bit ca m hnh, mi a
j
tng ng vi mt hm
c trng f
j
. Gi B = {no, yes} l tp cc lp v C l tp ca cc ng cnh. Cc c
trng l cc hm nh phn f
j
: B x C {0,1} dng m ha thng tin cn thit. Xc
xut quan st c im phn tch cu trong ng cnh c c c trng bi xc
xut p(yes, c). Tham s
j
c chn l gi tr lm cc i hm likehook ca d
liu hun luyn vi cc thut ton GIS v IIS
phn lp mt k t tch cu tim nng vo mt trong hai lp {yes, no} lp
yes ngha l thc s l mt k t phn tch cu, cn no th l ngc li, da vo
lut phn lp nh sau
p(yes|c) = p(yes,c)/p(c) = p (yes,c)/(p(yes,c) + p(no,c)) (2.26)
y c l ng cnh xung quanh k t tch cu tim nng v bao gm c k
t ang xem xt. Sau y l nhng la chn hm tim nng f
j
phn tch cu
trong ting Vit.
La chn c trng
33
Cc c trng trong Maximum Entropy m ha cc thng tin hu ch cho bi
ton tch cu. Nu c trng xut hin trong tp c trng, trng s tng ng ca
n dng h tr cho tnh ton xc xut p(b|c).
Cc k t tch cu tim nng c xc nh bng cch duyt qua vn bn, xc
nh cc chui k t c phn cch bi du cch (cn gi l token) v cha mt
trong cc k t ., ?, hay !. Thng tin v token v thng tin ng cnh v token
lin tri, phi ca token hin ti c xc nh xc xut phn ln.
Gi cc token cha cc k t kt thc cu tim nng l ng vin. Phn k t
i trc k t kt thc cu tim nng c gi l tin t, phn i sau gi l hu
t. V tr ca k t kt thc cu tim nng cng c m t trong tp c trng.
Tp cc ng cnh c xem xt t chui k t c m t nh di y
1. C/ khng c 1 k t trng trc k t kt thc cu tim nng.
2. C/ khng c 1 k t trng sau k t kt thc cu tim nng.
3. K t kt thc cu tim nng.
4. c trng tin t.
5. di tin t nu n c di ln hn 0.
6. K t u tin ca tin t l k t.
7. Tin t nm trong danh sch cc t vit tt.
8. c trng hu t.
9. Token i trc token hin ti.
10.K t u tin ca token lin trc vit hoa/ khng vit hoa.
11.Token lin trc nm trong danh sch cc t vit tt.
12.Token lin sau.
13.Token ng vin c vit hoa/ khng vit hoa.
T nhng ng cnh trn, c th rt ra tp ng cnh t tp d liu (tp C). Tp
ng cnh cng vi nhn t d liu to ra mt tp c trng tng ng. Xt v d
sau lm r mi quan h gia ng cnh, c trng:
34
Nhng hacker my tnh s c c hi chim gii thng tr gi 10.000 USD
v 10.000 ola Singapore (5.882 USD) trong mt cuc tranh ti quc t mang tn
Hackers Zone c t chc vo ngy 13/5/1999 ti Singapore.
Xem xt k t kt thc cu tim nng . Trong token 10.000 USD, t v tr
ny ta c th rt ra mt s ng cnh sau:
1. Khng c k t trng trc k t ng vin.
2. Khng c k t trng sau k t ng vin.
3. K t ng vin l .
4. Tin t: 10
T d liu hc ny, c th rt trch ra cc c trng nh v d di y:
f{khng c k t trng trc ng vin, no} = 1. ngha ca c trng ny l
pht biu: token khng c k t trng trc ng vin v nhn l no l ng (c
trng nhn gi tr 1).
Sau khi c lng trng s c trng ta da vo cc tham s tnh gi tr
p(yes|c). Nu gi tr ny >50%, nhn tng ng vi k t ng vin c ghi nhn
l yes hay k t ng vin thc s l k t phn tch cu.
2.3.2.2. Phng php khp ti a (Maximum Matching)
Ni dung: Phng php khp ti a (Maximum Matching) [4] cn gi l Left
Right Maximum (LRMM). Theo phng php ny, s duyt mt ng hoc cu t
tri sang phi v chn t c nhiu m tit nht c mt trong t in, ri c th tip
tc cho t k tip cho n ht cu.
Dng n gin: c dng gii quyt nhp nhng t n. Gi s c mt chui
k t (tng ng vi chui ting trong ting Vit) C
1
, C
2
,,C
n
. Bt u t u
chui. u tin kim tra xem C
1
, c phi l t hay khng, sau kim tra xem
C
1
C
2
c phi l t hay khng. C tip tc tm cho n khi tm c t di nht. T
c v hp l nht s l t di nht. Chn t , sau tip tc tm nh trn cho
nhng t cn li cho n khi xc nh c ton b chui t.
35
Dng phc tp: Quy tc ca dng ny l phn on c v hp l nht l on
ba t vi chiu di ti a. Thut ton bt u nh dng n gin. Nu pht hin ra
nhng cch tch t gy nhp nhng (v d C
1
l t v C
1
C
2
cng l t), xem cc
ch k tip tm tt c cc on ba t c th c bt u vi C
1
v C
1
C
2
. V d
c nhng on sau:
C
1
C
2
C
3
C
4
C
1
C
2
C
3
C
4
C
5
C
1
C
2
C
3
C
4
C
5
C
6
Chui di nht s l chui th ba. Vy t u tin ca chui th ba (C
1
C
2
) s
c chn. Thc hin li cc bc cho n khi c chui t hon chnh
u im ca phng php trn c th thy r l n gin, d hiu v chy
nhanh. Hn na, ch cn mt tp t in y l c th tin hnh phn on vn
bn, hon ton khng phi tri qua hun luyn nh cc phng php c trnh
by tip theo.
Nhc im ca phng php ny l n khng gii quyt c 2 vn quan
trng nht ca bi ton phn on t ting Vit: thut ton gp phi nhiu nhp
nhng, hn na n hon ton khng c chin lc g vi nhng t cha bit.
2.3.2.3. Phng php WFST (Weighted Finite State Transducer)
Phng php WFST (Weighted Finite State Transducer) [7] cn gi l
phng php chuyn dch trng thi hu hn c trng s. tng chnh ca
phng php ny p dng cho phn on ting Vit l cc t c gn trng s
bng xc sut xut hin ca t trong t in d liu. Sau duyt qua cc cu,
cch duyt c trng s ln nht c chn l cch dng phn on t. Hot
ng ca WFST c th chia thnh ba bc sau:
Xy dng t in trng s: t in trng s D c xy dng nh l mt th
bin i trng thi hu hn c trng s. Gi s:
+ H l tp cc ting trong ting Vit.
36
+ P l tp cc loi t trong ting Vit.
+ Mi cung ca D c th l:
++ T mt phn t ca H ti mt phn t ca H;
++ T phn t (xu rng) n mt phn t ca P.
Mi t trong D c biu din bi mt chui cc cung bt u bi mt cung
tng ng vi mt phn t ca H, kt thc bi mt cung c trng s tng ng vi
mt phn t ca x P. Trng s biu din mt chi ph c lng (estimated cost)
cho bi cng thc:
C=- log(
N
f
) (2.27)
Trong
f l tn s xut hin ca t
N l kch thc tp mu
Xy dng cc kh nng tch t: Bc ny thng k tt c cc kh nng phn
on ca mt cu. Gi s cu c n ting, th c ti 2n-1 cch phn on khc nhau.
gim s bng n cc cch phn on, thut ton loi b ngay nhng nhnh
phn on m cha t khng xut hin trong t in.
La chn kh nng tch ti u: Sau khi lit k tt c cc kh nng phn on
t, thut ton chn cch tch t tt nht, l cch tch t c trng s b nht.
V d: Tc truyn thng tin s tng cao
T in trng s:
tc 8,68
truyn 12,31
truyn thng 12,31
thng tin 7,24
tin 7,33
s 6,09
37
tng 7,43
cao 6,95
Trng s theo mi cch tch t c tnh l tng cc trng s ca t theo t
in trng s
Tc | tuyn thng | tin | s | tng | cao
Tc | tuyn | thng tin | s | tng | cao
2.3.2.4. Bi ton tch t v cng c vnTokenizer
tng: Cho mt cu ting Vit bt k, hy tch cu thnh nhng n v
t vng (t), hoc ch ra nhng m tit no khng c trong t in (pht hin n
v t vng mi).
Gii thiu cng c vnTokenizer: cng c tch t ting Vit c nhm tc gi
Nguyn Th Minh Huyn, V Xun Lng v L Hng Phng pht trin da trn
phng php so khp ti a (Maximum Matching) vi tp d liu s dng l bng
m tit ting Vit v t in t vng ting Vit.
Cng c c xy dng bng ngn ng Java, m ngun m. C th dng
sa i nng cp v tch hp vo cc h thng phn tch vn bn ting Vit khc.
Quy trnh thc hin tch t theo phng php khp ti a
Hnh 2.2 Quy trnh tch t
- u vo ca cng c tch t vnTokenizer l mt cu hoc mt vn bn c
lu di dng tp.
- u ra l mt chui cc n v t c tch.
Vn bn
Tch t
T in
Chui cc n v t
38
- Cc n v t bao gm cc t trong t in cng nh cc chui s, chui k
t nc ngoi, cc hnh v rng buc (gm cc ph t), cc du cu v cc
chui k t hn tp khc trong vn bn (ISO, 2008). Cc n v t khng ch
bao gm cc t c trong t in, m c cc t mi hoc cc t c sinh t
do theo mt quy tc no (nh phng thc thm ph t hay phng thc
ly) hoc cc chui k hiu khng c lit k trong t in.
Cng c s dng tp d liu i km l tp t in t vng ting Vit, danh
sch cc n v t mi b sung, c biu din bng tmat ti tiu hu hn trng
thi, tp cha cc biu thc chnh quy cho php lc cc n v t c bit (xu
dng s, ngy thng,), v cc tp cha cc thng k unigram v bigram trn kho
vn bn tch t mu.
Vi cc n v t c trong t in, khi thc hin tch t cng c x l
hin tng nhp nhng bng cch kt hp vi cc thng k unigram v bigram.
Chng hn trong ting Vit thng gp cc trng hp nhp nhng nh:
- Xu AB va c th hiu l 1 n v t, va c th l chui 2 n v t A-B.
- Xu ABC c th tch thnh 2 n v AB-C hoc A-BC.
nh gi kt qu: Kt qu nh gi ca cng c c cho l n nh i vi
nhiu loi vn bn/ vn phong khc nhau. chnh xc trung bnh t c l
khong 94%.
2.3.2.5. Phng php tch t da trn s xc sut tn ti ca t khng
ph thuc vo ng ngha
tng: Tiep [3] xut mt phng php tch t hon ton khc so vi
cc phng php nu trn. Xut pht t bi ton lc th rc ting Vit, Tiep
gii quyt bi ton theo hng tip cn tch t da trn s tn ti ca t. n
gin ha vn c th hiu nh sau: Trong mt th spam (th rc) hay ham (th
thng thng) nu l ting Anh th vic tch t kh d dng v t c kt qu
khc nhau. Tuy nhin, i vi th rc ting Vit th cch gii quyt i vi th rc
39
ting Anh p dng khng ph hp v nhng s khc bit gia ting Anh v ting
Vit nu trn, c bit l s nhp nhng ng ngha ca ting Vit.
Trong cng trnh cng b, tc gi th nghim cc phng php tch t
khc nh nu trn gii quyt bi ton ny nhng khng mang hiu qu cao
do khng c b t in t ting Vit no ph hp vi ni dung th rc ting Vit.
Tuy nhin mu cht quan trng l mt bc th thuc lp th spam hay th ham
u s cha cc t ting Vit c th ca ring lp . V d qung co, khuyn
mi, rao vt, mua bn l nhng t thng xuyn gp trong cc th qung co
ting Vit. Di y s trnh by cch tip cn vn ca thut ton nu trn.
B phn tch cu ting Vit: Xt mt vn bn u gm n ting t=s
1
s
2
.. s
n
. Mc
tiu chnh ca qu trnh l phn tch vn bn u thnh m cu n t=z
1
z
2
z
m
vi
z
k
= s
i
s
j
(1 k m, 1 i, j n) c th cha t n hay t phc. ng vi mi cu
c c, tin hnh phn tch thnh tng t n th.
B phn tch t n: xt trong mi cu n chun S
j
(1 j n) s cha k t
n, mi t n W
m
(1 m k) v W
m+1
(1 m k) c phn cch nhau bi mt
k t khong trng qua c im ny, d dng xy dng c c s d liu cc t
n chun v tn s xut hin ca t n trong tng ni dung ca tp hun luyn.
Kt thc qu trnh phn tch t n, s hnh thnh c mt tp hp gm nhiu t
n, mi t n s c 01 m nh danh (id) nht nh v s c 2 tn s xut hin:
tn s tng trn tp hun luyn v tn s trn tng ni dung thuc tp hun luyn.
B phn tch t ghp: Xt trong 1 cu ting Vit S (Sentence) s gm W
1
, W
2
,
W
3
, W
n
t, mi t W
i
(1 i n) l mt t n ting Vit. Do vic phn tch ch tp
trung t ghp c 2 ting nn mi t ghp CW (Compound Word) c to bi hai
t n ng gn nhau W
i
, W
i+1
(1 i n) v c cch nhau bi 1 khong trng.
+ Do khng xt mt ng ngha ca t nn trong qu trnh to t ghp theo
cch trn s dn n cc t v ngha. gii quyt vn ny, tc gi s dng
ngng dng nh gi chnh xc ca t ghp tm c. Mi t ghp u c
40
ring mt ngng . Khi ngng thay i gi tr th chnh xc ca t ghp
cng b thay i theo.
Xy dng b t in t ghp: Gi s c tp d hiu hun luyn TD (Training
Document), mi th D
i
TD s c tp cc cu n S
n
. Trong mi cu n S
i
S
n
(1 i n) s gm cc t n W
1
, W
2
, W
3
, W
n
. Vn dng c ch tch t ghp nu
trn tha mi t ghp CW cha 1 b gm 2 t n { W
j
, W
j+1
} (1 j m), trong
W
j
v W
j+1
l hai t n lin tip ng gn nhau v cch nhau bi du khong cch.
ng vi mi t ghp CW tm c s c a vo tp t ghp nu t ghp cha
tn ti trong tp t ghp v tng tn s xut hin nu t ghp tm c tn ti
trong tp t ghp.
Kt qu ca qu trnh tin x l nu trn, s c c 1 tp t ghp cha c t
c gi tr s dng v nhng t 2 ting khng c ngha. Mi t trong tp t ny s
c 1 tn s k biu din tn s xut hin ca t trong tp hun luyn. Tn s k th
hin tng s ln xut hin ca t trn ton b tp hun luyn, mi ln t xut hin
th tng trng s k ln 1 n v.
Tnh gi tr ca ngng ca mi t CWtrong b t ghp
ge Totalmessa
k
(2.28)
Trong k l tn s xut hin ca t ghp CW trong tp hun luyn. Ngng
thuc khong [0.2 - 0.3] th chnh xc ca t c th chp nhn c. Nhng t
c ngng nm ngoi khong cn trn c xp vo tp cc t cn c hun
luyn tip tc.
2.3.3. So snh cc phng php tch t ting Vit
Nhn chung, phng php da trn t (word-base) cho chnh xc kh cao
(trn 95%) nh vo tp d liu ln, c nh du chnh xc, tuy nhin hiu sut
ca thut ton ph thuc hon ton vo d liu hun luyn. Vi cc phng php
cn phi s dng t in hoc tp hun luyn, ngoi vic tch t tht chnh xc,
41
cn c th nh vo cc thng tin nh du trong tp d liu thc hin cc mc
ch khc cn n vic xc nh t loi nh dch my, kim tra li chnh t, t in
ng ngha Do vy, d thi gian hun luyn kh lu, ci t phc tp, chi ph to
tp d liu ln rt tn km, nhng kt qu m hng tip cn da vo t mang li
cho mc ch dch my l rt ln.
Hng tip cn da trn k t (character-based) c u im d thc hin, thi
gian thc thi tng i nhanh, tuy nhin li c chnh xc khng cao bng
phng php da trn t. Hng tip cn ny thch hp cho cc mc ch nghin
cu khng cn n chnh xc tuyt i cng nh cc thng tin v t loi nh
phn loi vn bn, lc spam, firewall Nhn trn tng th, hng tip cn da trn
t c nhiu u im ng k trong vic nh hng nghin cu.
Da trn phn so snh tng th cc phng php v nh hng tch t nu
trn cng vi mc tiu chnh ca ti l phn loi ni dung web bng ting Vit
nn ti quyt nh chn hng tip cn da trn ting. Tuy nhin, vic phn
loi vn bn khng yu cu vic tch t phi c chnh xc cao n mc tng t
nn lun vn khng tp trung vo mt ngha cng nh nhng c trng phc tp
ca ting Vit nh t ng ngha, t ly, m ch xc nh tn s ca t n, t
ghp ting Vit xut hin trong ni dung cn lc nn hng tip cn khc vi cc
phng php xc nh ng ngha t ting Vit. Phn di s trnh by nhng c
im chnh ca phng php tip cn vn .
Phng php tch t da vo xc sut tn ti ca t, khng ph thuc ng
ngha tuy khng gii quyt c bi ton nhp nhng v ng ngha t nhng c li
th khi p dng vo bi ton phn lp vn bn do b t in t d dng cp nht
lng t y ph hp vi lp vn bn m ang mun phn lp m khng b chi
phi bi cc lp khc do trong ting Vit c rt nhiu lnh vc m ty tng lnh
vc, ch khc nhau nn c nhiu t, ting khc nhau v mt pht m cng nh
ngha, ng thi vic x l tn mt khong thi gian c th chp nhn c. Phn
42
trn a ra cc phng php tch t trong ting Vit cng nh so snh u
nhc im ca cc phng php . Phn tip theo s trnh by ng dng phng
php tch t xy dng b lc ting Vit khng lnh mnh.
2.4. Phn tch ni dung website
2.4.1. Phn loi ni dung website
Khi mt ni dung website c yu cu hin th th ni dung thuc vo mt
trong hai dng: ting Anh hoc ting Vit. Tuy nhin, hai ngn ng ny c nhng
c th kh ring bit ngoi tr c im chung u l ngn ng Latinh, c th nh
bng bn di:
Bng 2.1 S khc bit c bn gia ting Anh v ting Vit
c im ca ting Vit c imca ting Anh
c xp l loi hnh n lp
(isolae) hay cn gi l loi hnh phi
hnh thi, khng bin hnh, n tit
L loi hnh bin cch (flexion)
hay cn gi l loi hnh khut
chit
T khng bin i hnh thi, ngha
ng php nm ngoi t
V d: Ch ng em nng v Em ng
ch nng
T c bin i hnh thi ngha
ng php nm trong t.
V d: I see him v He sees me
Phng thc ng php ch yu: trt
t t v h t
V d: Go xay v Xay go
Phng thc ng php ch yu
l ph t
V d: studying v studied
Ranh gii t khng c xc nh
mc nhin bng khong trng
Kt hp gia cc hnh v l cht
ch, kh xc nh, c nhn
din bng khong trng hoc du
cu
Tn ti loi t c bit t ch loi Hin tng cu to bng t ghp
43
(classifier) hay cn gi l ph danh
t ch loi km theo vi danh t nh:
ci bn, cun sch, bc th..
thm ph t (affix) vo gc t l
rt ph bin
V d: anticomputerizational
C hin tng ly v ni ly trong
ting Vit. V d: lp lnh, lung linh
T bng so snh trn, c th thy c nhng c trng c bn ca ting Vit
cng nh l kh khn gp phi khi tch t trong ting Vit.
2.4.2. c trng ca ngn ng ting Vit
n v cu to t l ting, tc l nhng m tit c s dng trong thc tin
ngn ng Vit. Ting c th c ngha r, c th mang ngha b phai m v
ting c th t mnh khng c ngha. Hn na, 3 hin tng ny c th chuyn
ha ln nhau.
Tnh cht m tit (ting) l mt trong nhng c im chi phi c tnh loi hnh
ca ngn ng Vit. Xt mt s lng ting:
+ T ch cha mt ting, gi l t n, nh: nh, ,
+ T nhiu ting, phn ln l 2 ting, gi l t phc, nh: nh ca, sch s,
Nu xt s lng t t (yu t nh nht tham gia cu to t) tham gia cu to
t th c s phn chia nh sau:
+ T ch cha mt t t, gi l n t, nh: nh, ng nh, ra i ,
+ T n t gm nhiu ting v c hin tng ha m to ngha, gi l t ly.
Nu khng th n thuc loi ngu kt.
+ T cha nhiu t t, gi l t a t, nh: nh ca, xe p, sch s,
+ T a t nu c hin tng ha m phi ng m to ngha th thuc kiu ly.
Nu khng th thuc loi t ghp.
Vic tin x l vn bn (tch t, tch on, tch cu) s thm phc tp vi
phn x l cc h t, ph t, t ly
44
Phng thc ng php ch yu l trt t t nn nu p dng phng php tnh
xc sut xut hin ca t c th khng chnh xc nh mong i.
Ranh gii t khng c xc nh mc nhin bng khong trng. iu ny khin
cho vic phn tch hnh thi (tch t) ting Vit tr nn kh khn. Vic nhn
din ranh gii t l quan trng lm tin cho cc x l tip theo sau nh:
kim tra li chnh t, gn nhn t loi, thng k tn sut t
V gia ting Anh v ting Vit c nhiu im khc bit nn khng th p dng
y nguyn cc thut ton ting Anh vo ting Vit.
Chnh v nhng nguyn nhn phn tip theo s xut cc phng php x
l ni dung ting Vit v ting Anh.
2.4.3. Phng php x l ni dung website
Nh trnh by trn, ni dung website ang c cp l ting Vit hay
ting Anh. Di y s xut cc phng php x l ni dung website.
Cch th nht l phn chia ni dung c thnh ting Anh v ting Vit, sau
tin hnh phn loi ni dung ting Anh v ting Vit ring. Tt nhin, c th c
trng hp trong mt ni dung c c ting Vit v ting Anh nhng t l ny
khng nhiu.
Cch th hai l xy dng mt b phn loi chung cho c ting Anh v ting
Vit. Cch th hai n gin hn nhng c th gp vn khi la chn tham s k
tch cc k-gram.
Nu s dng cch th nht th xut hin mt vn cn gii quyt l phn bit
ni dung ting Anh v ting Vit. Mc d c nhng gii php phc tp hn c
xut cho vn ny, y xut s dng mt gii php rt n gin. Khi
la chn c trng, cc c trng c nh du ring ting Vit hoc ting Anh
v lu vo bng bm. Khi mt ni dung mi xut hin, 20 c trng u tin ca
ni dung s c bm vo bng ting Vit v ting Anh. Nu s lng bm
trng trong bng ting Vit ln hn bng ting Anh th ni dung c coi l ni
45
dung ting Vit v ngc li. Tuy nhin, i vi nhng ni dung s dng c
ting Vit v ting Anh vic kt lun ni dung thuc mt trong hai ngn ng
duy nht c th nh hng ti qu trnh phn loi tip theo.
Sau khi phn bit c ni dung ting Anh th s c lc ring. Hiu qu phn
loi chung sau c ly bng trung bnh cng ca phn loi cho ni dung ting
Vit v ni dung ting Anh. tng chnh xc trong qu trnh phn tch ni
dung, c th chia nh ni dung thnh tng cu n th nhm to tin cho vic
tch t ting Vit mang li chnh xc cao nht.
2.4.4. Phn tch cu
Quan nim cu l mt chui k t kt thc bi mt du chm (.), (?) hay (!)
khng th loi tr cc nhp nhng, trong du chm cu khng ch l k hiu kt
thc cu: mt s dng trong cc t vit tt hoc trong chui s. Tuy nhin, phng
php da trn kinh nghim c bn ny cho kt qu khng ti: nhn chung, khong
90% cc du chm l k hiu kt thc cu. Tuy nhin, cng cn lu cc trng
hp: trong cc k hiu khc c th c coi l du hiu kt thc cu. V d: cc
du cu nh hai chm, du chm phy v du ngang (: , ; v -) c th theo
sau bi mt cu hon chnh.
Mc ch c bn ca phn tch t vng l tch v xc nh cc c trng ca
vn bn, bt u vi vic tch mt thng ip ra thnh cc b phn nh hn,
thng l cc t n gin.V vy, vic tch cu rt quan trng h tr cho vic tch
t v sau. V th du phn cch nn dng l khong trng, v khong trng thng
dng tch cc t trong hu ht cc ngn ng, sau y l mt s phn cch cu
c dng rng ri:
+ Du chm (.)
+ Du phy (,)
+ Du chm phy (;)
+ Du nhy i ()
46
+ Du hai chm (:)
+ Du ngoc vung [ ]
+ Du ngoc nhn { }
+ Du ngoc n ( )
+ Cc ton t + - / * = <>
Hin nay, vic tch cu thng da trn mt s tiu ch sau y:
- t im phn cch cu sau du ng ngoc kp (nu c)
- Loi ra mt im phn cch cu gi nh (l du chm) trong cc trng hp
sau:
o Nu n i sau mt t vit tt thng khng xut hin cui cu,
nhng thng i trc mt danh t ring, v d: Prof hay vs
o Nu n i sau mt t vit tt bit v khng i trc mt t vit hoa.
Trng hp ny c th gii quyt ng hu ht cc trng hp vit tt
nh etc. hoc Jr. (nhng t c th xut hin gia hoc cui cu).
- Loi mt im phn cch cu gi nh vi ? hay ! nu n i trc mt t
khng vit hoa.
- Xem xt tt c cc im phn cch cu gi nh cn li nh cc im phn
cch cu thc s.
47
CHNG 3: NG DNG
3.1. Xy dng b lc ni dung web ting Vit khng lnh mnh
3.1.1. tng xut
tng xut l tm cch xy dng mt b phn loi nhm phn loi cho
mt mu mi bng cch hun luyn t nhng mu c sn. y mi mu c
xt n chnh l mi mt ni dung ca mt trang web, tp cc lp m mi ni dung
c th thuc v l y={tt, xu}
Khi trnh duyt nhn c 1 ni dung cn hin th, khi da vo mt s c
im hay thuc tnh no ca ni dung tng kh nng phn loi chnh xc ni
dung . Cc c im ca 1 ni dung nh: tiu , ni dung, c hnh nh nhiu
hay khng Cng nhiu nhng thng tin nh vy xc sut phn loi ng cng
ln, tt nhin cn ph thuc vo kch thc ca tp mu hun luyn.
Vic tnh ton xc sut s da vo cng thc Nave Bayes, t xc sut thu c
em so snh vi mt gi tr ngng t no m xem l ngng phn loi ni
dung tt hau xu . Nu ln hn t th ni dung l xu, ngc li l tt. Tuy nhin,
khi phn loi ni dung c hai li : li nhn 1 ni dung tt thnh xu v li cho hin
th mt ni dung xu. Loi li th nht nghim trng hn, v vy ta xem mi mt
ni dung tt nh l ni dung xu. Nh vy khi li nhn 1 ni dung tt thnh xu
xy ra th c xem nh l li, v khi phn loi ng xem nh ln thnh cng.
Ngng phn loi t s ph thuc v ch s ny.
3.1.2. Hng tip cn
Theo tng xut trn th vn cn gii quyt l phn lp mt ni dung
trang web vo mt trong hai lp tt v xu trong mt khong thi gian chp nhn
c. Da vo cc phng php tch t nu trn cng vi u nhc im, hng
tip cn ca ti da vo phng php tch t da vo tn s xut hin ca t m
khng da vo ng ngha ca t kt hp vi thut ton Nave Bayes v nhng l do
sau y:
48
Ni dung xu ch nm trong phm vi (ni dung vi phm thun phong m tc
Vit Nam). Do , c im v s lng t s ch nm trong lnh vc nht nh.
B t in t thuc lnh vc nu trn s to mi v cp nht thun li d dng
do gii hn phm vi, ng thi, thi gian x l phi m bo nhanh chng nn
nu theo hng tip cn x l ng ngha s mt rt nhiu thi gian.
Theo cc cng trnh cng b. Nave Bayes cho hiu qu cao trong cc bi
ton phn lp vn bn nh bi ton lc th rc ting Anh hay ting Vit.
ti ngn chn t kha theo cc hng sau
- Da vo a ch website
- Da vo ni dung trong tiu ca website (title)
- Da vo ni dung chnh ca website
3.1.3. Tin trnh thu thp ni dung
tin trnh ny cc Crawler s lm nhim v thu thp d liu. Crawler nhn
cu hnh u vo l mt website (tin tc, blog, din n, ) tin hnh bc tch,
tng hp ch lin quan, lu tr v pht li ti ngi u cui khi c yu cu.
u tin, u vo Crawler l tp cc trang web xu v tp cc trang web tt,
cc Crawler s truy xut trc tip vo ni dung ton din ri tin hnh bc tch.
Sau quy trnh khai thc ni dung s c lp vi website ngun, c lu tr v ti
s dng cho bc hc t.
T bc hc t ta xy dng c b t in t xu v t tt
49
Hnh 3.1 Tin trnh thu thp ni dung
3.1.4. Quy trnh thc hin
Quy trnh lc ni dung khng lnh mnh bng ting Vit trn Internet c th
c c th ha bng m hnh. Trong m hnh th hin r khi trnh duyt web tip
nhn mt ni dung cho n bc cui cng l cho php hin th hay khng hin th
ni dung trn trnh duyt thng qua vic p dng thut ton Nave Bayes
tnh xc sut t da vo ni dung ting Vit cn p dng b lc.
Trong m hnh xut gm 4 tin trnh nh nh hnh di
Tin trnh 1: lm nhim v tin x l, ly ni dung, tch cu, tch t n, t
ghp ting Vit trc khi cho qua tin trnh 2 p dng thut ton Nave Bayes.
Tin trnh 2: p dng thut ton Nave Bayes da trn danh sch cc t n ln
t ghp phn tch trong tin trnh 1 xc nh tn s xut hin ca cc t
Tin trnh 3: Da vo kt qu tnh xc sut xem ni dung c lnh mnh hay
khng, qua phn lp ni dung cn hin th cho ng. ng thi, cp nht li
danh sch Black list v White list a ch cn truy cp b lc lm vic tt hn.
Tin trnh 4: Giai on hc t. Trong giai on ny, cc t ghp, t n mi s
t ng c hc v cp nht vo trong tp hun luyn c s, cn cc t tn ti
s thay i tn s xut hin trong ni dung lnh mnh hay khng lnh mnh
Tp cc
trang web
xu
Tp cc
trang web
xu
Crawler
(Vietspider)
Ni dung xu
Ni dung tt
T xu
T tt
Hc t
50
Phn I
Phn II
Phn III
Phn IV
Hnh 3.2 M hnh tng qut lc ni dung khng lnh mnh
3.1.4.1. Tin trnh 1
Trong tin trnh 1 gm 02 thnh phn: b lc ly ni dung thun ca website
v b phn tch t vng ting Vit.
Tin trnh th nht, c th m t nh sau: a vo tp T
s
gm nhng ti liu
hun luyn, trong mi ti liu Ti Ts (1 i s) thuc v mt trong hai lp: ni
dung bnh thng v khng lnh mnh. Ti liu hun luyn ny c chn trong
giai on khi to v c cp nht thng xuyn trong giai on phn lp thnh
cng ni dung website (tin trnh 4).
Website cn
duyt
B lc ly ni dung
thun ca website
& tin x l
B phn tch t
vng ting Vit
Tp hp cc token
t 2 ting
B lc
Nave
Bayes
Xc nh cc token t
2 ting c trng s
cao trong ni dung
Tnh xc sut xc
nh ni ca website
thuc dng no
Ni dung
khng
lnh mnh
ma5nh
ng
Khng cho hin th
ni dung website
Cho php hin th
ni dung website
Cp nht danh
sch Black list
Cp nht danh
sch White list
Hun luyn b t xut
hin trong ni dung
Sai
B d
liu t
in
51
ng vi mi ni dung u vo phi qua bc x l gm:
X l loi b cc nh dng ca ngn ng HTML
X l loi b nhng t ph bin nh th, l, m, cc, nhng, v
cc t dng ni cu nh tuy nhin, mc d, v th, khng nhng,
m cn, nhng k t c bit nh @, #, $, ?, &, l tng tc
x l ca vic tch t
X l tch cu trong ni dung, chuyn ton b ni dung website thnh cc cu
n chun, mi t trong cu n chun cch nhau bi mt khong trng duy
nht. Quy trnh tch cu ting Vit c m t bng m hnh sau y
Input : ni dung website ting Vit cn lc
Output: Tp hp cc cu n chun c hiu chnh
Hnh 3.3 M hnh tch cu trong ting Vit
Trong b phn tch t vng ting vit gm c 2 phn: phn tch t n v
phn tch t ghp.
Quy trnh tch t n ting Vit c m t c th qua m hnh sau y, bt
u t bc c danh sch cc cu n chun
Bt u
Ni dung website
ting Vit
B lc loi
b ngn
ng HTML
X l ni dung loi
b t ni
Tch cu t ni
dung ( Split )
To danh sch cc
cu n
Xut danh sch
cu n
Kt
thc
52
Hnh 3.4 M hnh tch t n ting Vit
i vi vic phn tch t ghp, do cha c t in chun no cho vic x l
ngn ng ting Vit v da trn s liu ca website http://dict.vietfun.com [11])
th 67.1% t trong t in c di l 2 ting, khong 20% l t n v t c
di gm 3-4 ting. Cc t di hn ch chim khong 3% trong t in. Thng qua
y, thy r so vi t n v cc t ghp c di ln hn th t ghp 2 ting
chim s lng kh ln. V vy, n gin vn , ban u tp trung vo vic
phn tch t ghp c 2 ting. Quy trnh phn tch t ghp c th khi qut ha bng
m hnh trnh by pha di
Bt u
Duyt danh sch cu
n chun tm
Duyt cu DSCau.length
>0
Kt thc
Sai
ng
Tch cc t
n t cu
To thnh danh
sch t n
Duyt t n
trong danh sch
Cp nht vo
CDSL
T c trong
CSDL ?
ng
Sai
CSDL t
n ting
Vit
Cp nht tn s xut hin
Thm mi
53
Hnh 3.5 M hnh tch t ghp ting Vit
3.1.4.2. Tin trnh 2
Trong tin trnh 2, da trn cng thc Nave Bayes, p dng cng thc tnh xc
sut cho cc t ghp nh sau:
+ Gi s mi ni dung web cn truy cp: noidung
+ Lp ni dung khng lnh mnh: xau
+ Lp ni dung thng thng: tot
+ Xc sut mt ni dung website l khng lnh mnh: P(xau | noidung)
+ Word
1
, Word
2
, Word3, , Word
n
l cc c trng xut hin trong noidung.
Bt u
Duyt danh sch cu
n chun tm
Duyt cu
DSCau.length
>0
Kt thc
Sai
Tch cc t
n t cu
To thnh danh
sch t n DsTu
Duyt t n trong danh
sch DsTu; i=1
DSTu.length>
0
ng
Sai ng
T ghp: DsTu(i)
& DsTu(i+1)
Kim tra t
ghp trong
CSDL ?
ng
Sai
Tng tn s xut hin ca
t ghp trong CSDL;
i=i+1
Cp nht t ghp vo
CSDL; i=i+1
54
noidung Total
xau P xau noidung P
noidung xau P
) ( * ) | (
) | (
(3.1)
Trong
Total = P(noidung| xau) * P(xau) + P(noidung|tot) * P(tot) (3.2)
Vi ) | ( xau noidung P v ) | ( tot noidung P c tnh bng

< <

n i
i
tot tu P tot noidung P
1
) | ( ) | ( (3.3)

< <

n i
i
xau tu P xau noidung P
1
) | ( ) | ( (3.4)
V ) ( ), ( xau P tot P c tnh bi cng thc
noidung Total
xau Total
xau P

) ( (3.5)
noidung Total
tot Total
tot P

) ( (3.6)
Hnh 3.6 M hnh tnh xc sut cho t ghp
Danh sch cc token t ghp
c trong ni dung website
Tnh xc sut
Ngng
xc sut
B lc
Nave
Bayes
Danh sch cc token t ghp
c trng s cao trong ni
dung
Xu
Tt Trung tn
<0.3 >=0.7
55
3.1.4.3. Tin trnh 3
Trong tin trnh 4 ny, sau khi phn lp c ni dung cn hin th, quy
trnh hc t t ng c tin hnh. i vi nhng t n hay t ghp mi cha
c trong b t in s c cp nht vo. Ngc li, i vi nhng t c, h
thng s cp nht tn s xut hin ca t , ng thi thay i t l xut hin ca
t
Vi qu trnh t hc ny, vi s lng ni dung ting Vit cng ln th s
lng t trong b t in cng cao, ng thi s tng chnh xc cho vic tnh
xc sut ni dung bnh thng hay ni dung khng lnh mnh, h tr rt nhiu khi
p dng cng thc Nave Bayes
Hnh 3.7 M hnh cp nht b t in
Danh sch t n
trong ni dung
Danh sch t ghp
trong ni dung
X l t ghp
T mi ?
T mi ?
Hc t n
B t in
t n
B t in
t ghp
Hc t ghp
C
Khng
C
Thm mi
Thm mi
Cp nht tn s
xut hin
Cp nht tn s
xut hin Khng
56
3.2. Kin trc h thng chng trnh
Mc ch ca chng trnh: xy dng mt trnh duyt web c chc nng pht
hin v lc cc website c ni dung khng lnh mnh
D liu u vo: ni dung ca trang web mun hin th trn trnh duyt
D liu u ra: cho php hay khng cho php hin th website da vo vic lc
ni dung, ng thi cp nht danh sch Black list v White list ca trnh duyt
Chng trnh gm 02 phn chnh:
Phn 01: Trnh duyt web thng thng vi cc chc nng c bn thng thng
Phn 02: Phn qun tr ca trnh duyt
3.2.1 Trnh duyt web vi cc chc nng c bn thng thng
ng dng cc nghin cu trn xy dng chng trnh gm cc chc nng
+ Chc nng c bn ca cc trnh duyt thng thng
+ Duyt web theo chc nng lc theo: danh sch Black list v White list, lc theo
ni dung ca website da vo thut ton Nave Bayes nu a ch website khng
nm trong 2 danh sch trn
+ im c bit, ngoi lc theo ting Anh thng thng, b lc cn c th lc
c ni dung ting Vit
3.2.2. Cc chc nng c bn ca h thng
Xy dng tp hun luyn cho t khng lnh mnh ting Anh
Xy dng tp hun luyn t khng lnh mnh ting Vit (bao gm t n v t
ghp)
Xy dng chc nng phn tch t n i vi ni dung ting Anh v t n, t
ghp i vi ting Vit
Phn tch ni dung website ting Anh theo t n v ni dung website ting Vit
theo t ghp
Xy dng v cp nht li b t ghp ting Vit
57
3.3. Chc nng ca chng trnh
3.3.1.Giao din chnh ca chng trnh
Hnh 3.8 Giao din chnh ca chng trnh
Giao din gm nhiu chc nng c bn ca 01 trnh duyt web thng thng.
Tuy nhin trn giao din c nhng im khc bit nh sau:
Khi g a ch website vo mc Address ca trnh duyt. Nu a ch cn duyt
nm trong danh sch White list th s khng cn qua bc kim tra ni dung ca
webiste. Ngc li, nu a ch cn duyt nm trong danh sch Black list th s
hin th thng bo khng c truy cp website m khng cn qua bc kim tra
ni dung.
58
Hnh 3.9 Giao din thng bo khng cho truy cp ni dung website
3.3.2. S chc nng ca chng trnh
3.3.2.1. Chc nng ng nhp h thng
M t: y l chc nng dnh cho ngi qun l chng trnh ng nhp vo.
Khi ng nhp thnh cng s c ton quyn i vi chng trnh nh: qun l a
ch Black list, White list,cp nht li b t in t cng nh tn s xut hin ca
chng.
Hnh 3.10 Chc nng ng nhp h thng chng trnh qun l
59
3.3.2.2. Chc nng chng trnh
Bng 3.1 Bng m t chc nng ca chng trnh
Chc
nng
Ni dung chnh
Chc nng hc t ting Vit: chc nng ny cho php ngi dng
th nghim vic hc t ting Vit. Ngoi ra, ngi dng cng c th
dng chc nng ny cp nht thm t mi cho b t in lm tng
chnh xc ca chc nng phn loi theo ni dung
Chc nng x l: gm 04 chc nng c bn
Ly ni dung ca website ang truy cp
Chc nng qun l b t in ting Vit (bao gm t 1 ting & t 2
ting)
Chc nng phn tch ni dung webiste ting Vit thnh cc t n v
t t ting Vit
Chc nng phn tch cu ni dung webiste: chc nng phn tch ni
dung website thnh cc cu n chun
Chc nng hun luyn t
Hun luyn t ting Anh i vi ni dung website bng ting Anh
Hun luyn t ting Vit (bao gm t n v t 2 ting) i vi ni
dung website bng ting Vit
Chc nng th nghim vic phn loi ni dung theo hai c ch
Ni dung website ting Anh
Ni dung website ting Vit
60
Chc nng qun l thng s h thng nh ni lu tr ni dung
website phn tch c bao gm c ting Anh ln ting Vit, ngng
xc sut ca t c s dng
Chc nng dng qun l danh sch Black list ca chng trnh
Chc nng dng qun l danh sch White list ca chng trnh
Chc nng dng qun l danh sch cc t kha chnh ca chng
trnh
Ngoi cc chc nng va nu trn, h thng cn c chc nng lc da trn tiu
ca website cn truy cp v da trn t kha chnh nh cc h thng lc sn c
nh trnh by trong chng 1.
3.4. Chc nng hc t ting Vit
M t: chc nng ny dng hc t n v t ghp ting Vit da trn cc
ni dung webiste thu thp sn. Ngoi nhng ni dung sn c, ngi dng c th
a ni dung mi vo vic hc t thng qua chc nng ly ni dung website ang
truy cp ( mc 4.4.1)
61
Hnh 3.11 Chc nng hc t n v t ghp ting Vit
3.5. Chc nng x l
3.5.1. Ly ni dung website cn phn tch
M t: dng ly ni dung website truy cp, phc v cho vic hc t ting Vit (
mc 4.3) v cho vic phn loi ni dung webiste
Hnh 3.12 Ly ni dung website cn phn tch
3.5.2. Qun l b t in ting Vit
M t: dng qun l b t in phn tch c ( bao gm t n v t ghp).
Mt s chc nng thng dng ca mc ny l:
- Tm kim t n, t ghp
62
- Cp nht trng thi cho t n v t ghp
- Loi b t n v t ghp t s dng
- Th hin c ton b t c trong b t in
Hnh 3.13 B t in ting Vit
3.5.3. Phn tch cu i vi ni dung website ting Vit
M t: Sau khi ly c ni dung ca webiste cn truy cp, ni dung s c
hiu chnh c bn trc khi qua bc phn tch thnh cc cu n chun nhm
phc v cho vic tch t n i vi ting Anh v t n, t ghp i vi ting
Vit
Hnh 3.14 Phn tch cu trong ting Vit
63
3.5.4. Phn tch ni dung website ting Vit
M t: dng phn tch ni dung website ting Vit sau khi qua cc bc tin
x l v tch cu thnh cc t n v t ghp. ng thi th hin chng trn giao
din chng trnh v cho bit s lng t n v t ghp tmc
Hnh 3.15 Phn tch ni dung website ting Vit
3.6. Chc nng hun luyn t cho vic lc ni dung
Dng hun luyn t n i vi ting Anh, t n v t ghp i vi ting
Vit. i vi mi loi t u c hun luyn da trn 2 tp hun luyn: tp hun
luyn website thng thng v tp hun luyn website khng lnh mnh.
64
3.6.1. Hun luyn t ting Anh
M t: Trn giao din chng trnh u th hin r cc thng s
Tng s file ca tng b file hun luyn v ca ton b vic hun luyn
S t n hun luyn c da trn c 02 file b hun luyn
Hnh 3.16 Hun luyn t ting Anh
3.6.2. Hun luyn t ting Vit
M t: Trn giao din chng trnh u th hin r cc thng s
Tng s file ca tng b file hun luyn v ca ton b vic hun luyn
S t n hun luyn c da trn c 02 b file hun luyn
S t ghp hun luyn c da trn c 02 b file hun luyn
Hnh 3.17 Hun luyn t ting Vit
65
3.7. Phn loi ni dung website
Dng phn loi ni dung website thuc vo lp no: lnh mnh hay khng
lnh mnh. Quy trnh phn loi ny da trn t n i vi ting Anh v t ghp
i vi ting Vit
3.7.1. Ni dung ting Anh
M t: d liu u vo l ni dung website ting Anh cn phn tch. Kt qu tr
v cho bit ni dung thuc lp no v c xc sut trung bnh l bao nhiu da
vo qu trnh phn tch ni dung ( t n ting Anh)
Hnh 3.18 Phn lp ni dung website ting Anh
3.7.2. Ni dung ting Vit
M t: d liu u vo l ni dung website ting Vit cn phn tch. Kt qu tr
v cho bit ni dung thuc lp no v c xc sut trung bnh l bao nhiu da
vo qu trnh phn tich ni dung ( t ghp ting Vit)
66
Hnh 3.19 Phn lp ni dung website ting Vit
3.8. Qun l cc thng s h thng
M t: dng qun l cc thng s c bn ca chng trnh. C 2 loi thng
s dnh cho ting Anh v ting Vit. C 2 loi thng s gm c nhng mc qun l
chnh nh sau:
ng dn lu tr ni dung khng lnh mnh sau khi phn loi
ng dn lu tr ni dug tt sau khi phn loi
S lng token t kha tt nht cn ly cho vic phn loi ni dung website
Xc sut phn loi ti thiu dng cho vic hn lp vn bn
Ngng loi b cc t t s dng ( tn s xut hin ca t trn tng ni dung
hun luyn)
67
Hnh 3.20 Qun l thng s h thng
3.9. Qun l cc danh sch
Dng qun l danh sch a ch Black List v White List. Ngi dng c
th hiu chnh hay thm mi i vi 2 danh sch ny
3.9.1. Black List
M t: pha bn tri l danh sch Black List ca chng trnh. Ngi qun tr
u c th thm mi, xa hay hiu chnh thng tin i vi danh sch ny.
Hin ti trong danh sch c 24071 a ch khng c php truy cp. Khi ngi
dng truy cp a ch trong danh sch ny th chng trnh s ngn chn li m
khng cn qua bc phn tch ni dung.
Hnh 3.21 Danh sch Black List
68
3.9.2. White List
M t: pha bn tri l danh sch White List ca chng trnh. Ngi qun tr
u c th thm mi, xa hay hiu chnh thng tin i vi danh sch ny.
Hin ti trong danh sch c 24071 a ch khng c php truy cp. Khi ngi
dng truy cp a ch trong danh sch ny th chng trnh s hin th ni dung
ngay m khng cn qua bc phn tch ni dung.
Hnh 3.22 Danh sch White List
3.10. Kt qu thc nghim v nh gi kt qu t c
Vic xy dng b t in ting Vit vi hn 400 trang web c tm kim trn
Internet, sau khi qua cc bc tinh chnh d liu, trung bnh d liu th mi trang
web khong 200 500 t ty tng trang.
Do vic cp nht t in l t ng, cho nn tnh chnh xc ca b t in ph
thuc vo thi gian s dng, thi gian s dng lu, b t in hc nhiu v tnh
chnh xc cng cao.
Bng 3.2 Kt qu xy dng b t in ting Vit
Loi t
Thng s
S lng T l ng
T n 2114 >83%
T 2 ting 5260 >79%
69
Sau khi c b t in tin hnh hc vi d liu 200 trang web tt, 200 trang
web xu, kt qu thu c qua bng phn loi bn di
Bng 3.3 Kt qu phn loi web
Kt qu th nghim
Kt qu phn loi chnh xc
Tt Xu Tt Xu
T n 167/200 171/200 83.5% 85.5%
T n & t 2 ting 183/200 181/200 91.8% 90.6%
T 2 ting 187/200 189/200 93.5% 94.7%
70
KT LUN V HNG PHT TRIN
Kt lun
ti t c nhng kt qu sau y:
Nghin cu tng quan cc h thng lc web en v nhng phng php xy
dng b lc thng dng hin nay gm u im ln khuyt im.
Tm hiu cc phng php lc thng k cng nh nhng im mnh ca cc k
thut phn loi vn bn nhm p dng tt vo quy trnh lc ni dung trang web.
So snh cc phng php tch t trong ting Vit , t la chn phng php
ti u nht gii quyt bi ton v xy dng b t in hon chnh cho bi
ton lc ni dung khng lnh mnh.
Nghin cu cc thun ton, c bit l thut ton Nave Bayes ng dng vo qu
trnh phn lp ni dung trang web.
Hng tip cn mi ca ti l khng nhng lc c cc trang web ting
Anh m cn lc c cc trang web ting Vit khng lnh mnh da trn a
ch, tiu v ni dung chnh ca trang web.
Xy dng c danh sch blacklist, whitelist cha cc a ch website c
quyn v khng c quyn truy cp.
Xy dng b lc web th hin hng nghin cu v tip cn ng n ca
ti.
Kt qu thc nghim cho thy hng tip cn ca ti kh quan cho
chnh xc cao trong mt khong thi gian chp nhn c.
71
Hng pht trin
Tch hp b lc vo cc trnh duyt web thng dng nh Internet Explorer,
FireFox, Safari nhm nng cao tnh ng dng ca ti
Ci tin thut ton tch t nhm gim thi gian x l trong qu trnh phn lp
ni dung, to s tin li cho ngi s dng.
Hng pht trin ca ti l xut mt phng php dung ha cho vic x l
ni dung website bao gm ting Vit ln ting Anh.
Nghin cu xy dng b t in cc t vng thuc cc ni dung khng lnh
mnh m khng cn duyt qua ni dung .
TI LIU THAM KHO
Ting Vit
[1]. Phc (2005), Gio trnh khai thc d liu, i hc Cng ngh Thng tin
Tp. HCM.
[2]. H Quang Thy, Phan Xun Hiu, on Sn (2009), Gio trnh Khai ph
d liu web, Nxb Gio dc Vit Nam.
[3]. Phan Hu Tip (2011) , Nghin cu xy dng b lc Spam thng minh t
ng, Tp san khoa hc gio vin, Trng i hc Lc Hng.
Ting Anh
[4]. Chih-Hao Tsai (1996), A Word Identification System for Mandarin Chinese
Text Based on Two Variants of the Maximum Matching Algorithm.
[5]. Edel Garcia (2008), Term Vector Theory and Keyword Weights.
[6]. Goldszmidt D., Friedman, N.Geiger (2006), Bayesian network
classifiersMachine Learning.
[7]. Lafferty J. (2001), Conditional ramdom fields: probabilistic models for
segmenting and labeling sequence data. In International Conference
on Machine Learning.
[8]. Rongbo Du, Reihaneh Safavi-Naini and Willy Susilo (2006), Web Filtering
Using Text Classification, Australia.
[9]. Sebastiani Fabrizio (2004), Text Classification for Web Filtering.
[10]. Stern Benjamin (2003), Web Filtering Technology Assessment.
Website
[11]. http://www.dict.vietfun.com
[12]. www.google.com/trends

You might also like