Professional Documents
Culture Documents
TRNG I HC LC HNG
--------
NG NAI, 2014
B GIO DC V O TO
TRNG I HC LC HNG
--------
NG NAI, 2014
LI CM N
Vi nhng li u tin, em xin dnh s cm n chn thnh v su sc ti
thy tin s V c Lung hng dn v gip em tn tnh trong qu trnh hon
thnh lun vn.
Trn trng
Ti xin hon ton chu trch nhim v chu mi hnh thc k lut theo quy
nh cho li cam oan ca mnh.
Tc gi
LI CAM OAN
MC LC
K HIU CC CM T VIT TT
M U ..................................................................................................................... 1
L do chn ti ......................................................................................................... 1
Mc tiu ti ............................................................................................................. 2
Ni dung thc hin ...................................................................................................... 2
Phng php thc hin................................................................................................ 3
CHNG 1. TNG QUAN V TRCH LC D LIU TRN WEBSITE ..... 4
1.1 Gii thiu ............................................................................................................. 4
1.2 Cc loi b lc WEB c ni dung khiu dm ..................................................... 4
1.2.1 B lc WEB da vo a ch mng .................................................................. 4
1.2.2 B lc WEB da vo URL .............................................................................. 6
1.2.3 B lc WEB da vo DNS .............................................................................. 9
1.2.4 B lc WEB da vo t kha ........................................................................ 10
1.2.5 B lc WEB da vo ni dung text v hnh nh............................................ 10
1.3 Cc cng trnh lin quan .................................................................................. 11
CHNG 2. CC L THUYT NG DNG TRONG LUN VN ............. 15
2.1 Rt trch ni dung ca website .......................................................................... 15
2.1.1 Phn tch m HTML ...................................................................................... 15
2.1.2 So snh khung mu ........................................................................................ 16
2.1.3 X l ngn ng t nhin ................................................................................ 17
2.2 Phn tch ni dung thnh cc token .................................................................. 18
2.2.1 Tin x l d liu ........................................................................................... 19
2.2.2 Tch cu da trn Maximum Entropy ........................................................... 19
2.2.3 Tch t ........................................................................................................... 21
2.2.3.1 Phng php Maximum Matching .............................................................. 25
2.2.3.2 Phng php Transformation based learning TBL................................ 25
2.2.3.3 M hnh tch t bng WFST v mng Neural ............................................. 26
2.2.3.4 Phng php tch t ting vit da trn thng k t Internet v thut gii di
truyn ........................................................................................................... 28
2.2.4 Thut ton KEA ............................................................................................. 29
2.2.4.1 Chn cm ng vin ...................................................................................... 31
2.2.4.2 Tnh ton c trng ...................................................................................... 33
2.2.4.3 Hun luyn ................................................................................................... 33
2.2.4.4 Rt trch nhng cm t kha ....................................................................... 34
2.2.5 Thut ton KIP ............................................................................................... 34
2.2.6 Nhn din thc th c tn............................................................................... 36
2.3 Phn tch URL .................................................................................................. 37
CHNG 3. GII PHP LC WEBSITE KHIU DM DA TRN URL
V TEXT CONTENT ............................................................................................ 38
3.1 Phn tch m hnh h thng .............................................................................. 38
3.2 Module x l da vo URL .............................................................................. 40
3.3 Module lc theo ni dung ................................................................................. 40
3.3.1 Giai on hun luyn ..................................................................................... 41
3.3.1.1 Tin x l vn bn ....................................................................................... 41
3.3.1.2 Trch lc c trng ....................................................................................... 42
3.3.1.3 Thut ton Nave Bayes ............................................................................... 44
3.3.2 Giai on phn lp, nhn dng ...................................................................... 47
CHNG 4. TH NGHIM V NH GI KT QU ................................. 50
4.1 Mi trng th nghim...................................................................................... 50
4.2 Giao din chng trnh ...................................................................................... 50
4.2.1 Giao din chnh .............................................................................................. 50
4.2.2 Giao din hc t ly TOKEN phn lp ni dung Website ............................ 53
4.2.3 Giao din duyt cc TOKEN t n a vo danh sch TOKEN ................ 54
4.2.4 Giao din duyt cc TOKEN t ghp a vo danh sch TOKEN ............... 54
4.2.5 Giao din danh sch cc TOKEN t phn lp ni dung Website................. 55
4.2.6 Giao din ly TOKEN URL........................................................................... 55
4.2.7 Giao din danh sch cc TOKEN URL phn lp URL ca Website........ 56
4.3 Thu thp d liu................................................................................................ 56
4.3.1 Thu thp d liu lm c s d liu TOKEN URL ................................... 56
4.3.2 Thu thp d liu lm c s d liu TOKEN ni dung .............................. 57
4.4 nh gi kt qu thc nghim .......................................................................... 58
KT LUN V HNG PHT TRIN ............................................................. 60
TI LIU THAM KHO
K HIU CC CM T VIT TT
T vit tt ngha
MM Maximum Matching
NB Nave Bayes
TF Term Frequency
M U
1. L do chn ti
Internet bt u xut hin t nhng nm thp nin 60. Tuy nhin ti thi
im n ch c s dng ni b v phc v ch yu cho qun s. Ngy
19/11/1997 l ngy u tin Vit Nam chnh thc ha vo mng Internet ton cu,
sau mi my nm hot ng Internet tr thnh mt thut ng hu nh ai cng bit,
mt phng tin truyn thng ai ai cng ang s dng, thm ch vi mt s b phn
cn ph thuc hon ton. Tm nh hung ca Internet pht tn mnh m khi n bt
u pht huy cng dng gii tr ca mnh, ngi ta khng ch c th tm t liu m
cn xem phim, nghe nhc, chi game trn mng. Hng triu triu ngi vo mng
mi ngy, nhng s ngi vo mng lm vic, hc tp, truy cp ti liu th rt t.
Vi s pht trin nhanh chng ca Internet hin nay, l du hiu ng
mng trc s pht trin ca cng ngh thng tin trong mt x hi hin i. Nhng
ng sau n l nhng h ly t Internet mang li cho con ngi, trong c bit l
gii tr. Song song vi cc tr chi trn mng, th vic t m truy cp vo cc trang
c ni dung khng lnh mnh c truyn Sex, xem cc hnh nh khiu dm, xem
phim Sex cng tr nn ph bin v tc hi lm cho ngi xem mun c hnh vi
quan h tnh dc ngay, dn n tnh trng sa vo t nn mi dm, hip dm khi
cha tui v thnh nin. 1
Website khiu dm khng ch nh hng n hnh vi tnh dc ca gii tr
m cn nh hng n o c lm vic ni cng s 2. Ngoi ra, n cn gy mt an
ninh cho my tnh c nhn ca ngi s dng v h thng mng my tnh ni c
quan, bi cc phn mm c hi. Vy lm sao ngn chn khng cho ngi s
dng truy cp vo cc Website c ni dung khiu dm l vn ang c x hi
quan tm. Hin nay trong nc v ngoi nc cng c nhiu phn mm c
nghin cu nh:
Nghin cu trong nc c th cp n mt s phn mm sau: Phn mm
Killporn ca tc gi Nguyn Hu Bnh; Phn mm VwebFilter (vit tt VWF) ca
Cng ty in ton v Truyn s liu xy dng; Phn mm Depraved Web Killer
1
http://vi.wikipedia.org/wiki/Internet_t%E1%BA%A1i_Vi%E1%BB%87t_Nam
2
http://baohay.vn/chuyen-de/nhung-dieu-can-biet/288247/Web-sex-dang-tro-thanh-mon-giai-
tri-o-chon-cong-so.html
2
(DWK) ca V Lng Bng, nhn vin cng ty in thoi ng, qun 10 (TP
HCM); Phn mm MiniFireWall 4.0 (MFW) ca tc gi Hunh Ngc n (cng tc
ti phng Tin hc - Bu in tnh ng Thp); B lc pht hin cc Website c ni
dung khng lnh mnh, lun vn thc s cng ngh thng tin ca Cao Nguyn Thy
Tin.
Nghin cu ngoi nc c th k n mt s phn mm sau: STOP P-O-R-
N 5.5 c pht hnh bi PB Software LLC; K9 Web Protection c pht hnh
bi Blue Coat Systems; Media Detective 2.3 c pht hnh bi Tap Tap Software;
Parental Filter 3.0 c pht hnh bi NWSP Software Design; ScrubLT 3.2.2.0
c pht hnh bi CrubLT; CyberSitter c pht hnh bi Solid Oak Software ;
iShield 1.0 c pht hnh bi Guardware.
Thc t cho thy cc phn mm ngoi nc a phn mun s dng th phi
tr ph v thng lc hnh nh khiu dm cn lc ni dung khiu dm ch yu bng
ting anh cn bng ting vit th hn ch, cn cc phn mm trong nc th cng
cn nhng hn ch trong vic chn t kha khiu dm thng dng v chn URL c
th ca Website. Qua cho thy cn nhiu vn cn phi nghin cu lm sng
t, lm tt hn v l cng l l do m ti Xy dng b lc pht hin cc
Website c ni dung khiu dm da trn URL v TEXT CONTENT c la
chn cho lun vn ny.
2. Mc tiu ti
Bc tng la (Firewall)
Firewall l mt k thut c tch hp vo h thng mng chng s truy cp
tri php, nhm bo v cc ngun thng tin ni b v hn ch s xm nhp khng
6
u im:
Nhng Website o khng b nh hng: K thut ny khng nh hng n
cc my ch web o khi chng cng dng mt IP nh nhng website hn ch. Mt
website b chn v website khng b chn c th chia s cng mt a ch IP.
Khng nh hng i vi vic thay i IP: Trong phn ln tnh hung, s thay i IP
ca website b hn ch s khng nh hng n phng php ny. V phng php
lc ny khng ph thuc vo a ch IP. Ch s hu nhng trang web c th i bt
c IP no h mun, nhng ngi dng ng sau b lc khng th truy cp c.
Nhng Website o khng b nh hng: K thut ny khng nh hng n
cc my ch web o khi chng cng dng mt IP nh nhng website hn ch. Mt
website b chn v website khng b chn c th chia s cng mt a ch IP.
Khng nh hng i vi vic thay i IP: Trong phn ln tnh hung, s
thay i IP ca website b hn ch s khng nh hng n phng php ny. V
phng php lc ny khng ph thuc vo a ch IP. Ch s hu nhng trang web
c th i bt c IP no h mun, nhng ngi dng ng sau b lc khng th
truy cp c.
Hn ch:
Thng khng th ngn chn cc cng phi tiu chun:
Nhng Web server lm vic vi cng tiu chun rt tt.
Website trn cc cng phi tiu chun th kh khn cho vic ngn cm
v chng yu cu mt cp cao hn trong b lc.
Mt gii php lc qua URL c th l k thut c kh nng cn thit
cho nhng kt ni HTTP trn cc cng phi tiu chun
Khng lm vic vi cc lu thng b m ha: v HTTP yu cu s dng
SSL/TLS b m ha. Phng php lc theo URL khng th c cc hostfield. Cho
nn, b lc khng c hiu qu pht hin mt ti nguyn no trn mt a ch IP m
yu cu thc s nh hng vo.
9
u im:
S dng a nghi thc (multi-protocol): http, ftp, gropher v bt k nghi thc
no khc da trn h thng tn.
Khng b nh hng bi vic thay i IP: Khi thay i IP ca mt website
khng nh hng n phng php lc ny, y l phng php lc hon ton c
lp vi a ch IP.
Hn ch:
Khng hiu qu i vi cc URL c cha a ch IP:
Phn ln nhng a ch ca mt website dng DNS (www.lhu.edu.vn), tuy
nhin cng c nhng a ch c ch nh bng mt a ch IP thay v l
dng DNS (http://118.69.126.40).
Trong trng hp ny n c truy cp n bng a ch IP m khng phi
dng a crh DNS ca n.
Ton b web server b chn hon ton: K thut khng cho php vic kha c chn la cc
trang cn li trn mt webserver. V th, nu mt trang b cm l www.exp.com/bad.htm th
c th tt c cc truy cp khng th truy xut n www.exp.com d n khng trong danh
sch b kha.
nh hng n cc subdomain: Xt v k thut, mt tn min n nh example.com trong
URL http://www.example.com c dng truy cp n web server. Cng mt thi im,
domain name c th phc v nh mt domain cp trn ca cc cng khc nh
10
B Cng an, v cng Internet quc gia ti trung tm in ton v truyn s liu
VDC.
Ti liu nghin cu, phn tch v xut chnh sch php l ti Vit nam
cho vn lc ni dung thng tin trn mng Internet, bo co chuyn trong
khun kh ti nh nc mang m s 02/2006 /H -TCT-KC.01/06-10. Cc
Cng ngh Tin hc nghip v B Cng An.
DWK4.1: Depraved Web Killer (DWK) do tc gi V Lng Bng d thi
chung kt cuc thi Tr Tu Vit Nam nm 2004, tnh n thi im ny phin bn
mi nht l v4.1 (2011) c nhiu chc nng nh: ngn chn cc trang web c ni
dung xu (t kha, URL), ghi nht k cc chng trnh c chy trn my, ghi
nht k cc trang web c truy cp, ghi nht k cc trang web xu m phn
mm ngn chn, gi nht k n a ch mail do ngi dng thit lp,
Gii php lc Web REMPARO l sn phm ca Cng ty TNHH Chp Sng
v Ashmanov c pht trin da trn cng ngh tr tu nhn to c hiu ngn ng
t nhin. Remparo - gii php lc web theo ng ngha, c tnh nng ngn chn truy
cp trang web c ni dung xu, khng ph hp. Mi trang web i qua b lc
Remparo nu c ni dung khng thch hp thuc cc ch nh khiu dm, bo
lc, phn ng chnh tr s b h thng nhn din v a ra hnh ng thch hp
nh: Cho php hay ngn chn trang web hoc thc hin nhng hnh ng khc nh
cnh bo, chuyn hng truy cp tu thuc vo mong mun ca ngi qun l.
Ngoi nhng module lc web theo t kho, key words, theo danh mc black
lisk/White list, gii php lc web Remparo cn tch hp module lc web theo
ni dung da trn cng ngh tr tu nhn to c hiu ngn ng t nhin ting Vit.
Bn cnh , gii php Remparo cn c nhng tnh nng mi nh: d s
dng, khng cn ci t phn mm; t l chn 99% v c th lc nhng trang web
mi xut hin cha c cp nht trong c s d liu; chn truy cp tng trang
webpage, khng chn c website; vic chn lc c thc hin hon ton pha nh
mng; c th thit lp h thng qun l truy cp t mt trung tm; khng mt thi
gian theo di v kim sot; khng lm nh hng n tc truy cp Internet.
ti lun vn thc s Xy dng b lc pht hin cc website c ni dung
khng lnh mnh - Cao Nguyn Thy Tin nm 2011 - i hc Lc Hng. Mc
tiu ca lun vn tm hiu c trng cng nh s pht trin ca website c ni dung
13
ngn chn truy cp web "en". Phn mm c cc tnh nng nh pht hin hnh nh
khiu dm, Qut t kha, cung cp nhiu ch View, kim tra File ZIP v phn m
rng tp tin, Qut ti liu Word tm ra nhng hnh nh c nhng
Anti-Porn (AP) l chng trnh phng chng web "en" kh tt nh c
CSDL cc web khng lnh mnh ch yu bng ting Anh kh y .
Internet Lock l chng trnh dnh cho Windows cho php qun l vic truy
xut Internet, lt web, chat v email bng Password.
Net Nanny l mt phn mm c thit k dnh cho cc gia nh mun c
mt cng c gim st tnh hnh s dng Internet ca con ci.
SurfControl Enterprise Threat Protecion: y l phn mm ca hng
SurfControl, phn mm ny thit k theo cch tip cn lc web v ngn chn t
proxy qua URL v t kha, c khong 20 cch ngn chn
Internet Filter Web Filters: do hng iPrism Internet Filters & Web Filters
pht trin, l phn mm thc hin gim st v ngn chn. Phn mm ny c
qung co l cng k thut lc web ng kim sot ni dung trang web ngay t
ng vo. Tuy nhin, theo hng dn qun tr ca nh sn xut th phn mm ny
cng c bng dng ca k thut dng phng php lc chn t kha.
FamilyWall: l phn mm bc tng la chy thng tr trn my tnh ca
ngi s dng. Chc nng ch yu ca FamilyWall l ngn chn vic truy cp cc
Website c ni dung xu trn mng Internet, bao gm cc lp kim sot chnh sau:
cc t kha c ni dung xu, ni dung cc trang Web, danh sch cc Website xu
c pht hin,
Ni chung nhng phn mm hay cng c trn thc hin tt cc chc nng
chn cc trang web khng mong mun di dng danh sch en, danh sch trng, t
kha ting Anh. Nhng hu ht cc phn mm ny khng c c ch t hc, t ci
thin thch nghi vi cc thay i hay nhng d liu mi thm vo ca cc trang
web khng mong mun v hu ht pht trin cho ngn chn cc trang web ting
Anh hn l ting Vit.
15
CHNG 2. CC L THUYT
NG DNG TRONG LUN VN
2.1. Rt trch ni dung ca website
Vic rt trch ni dung trn web thng c thc hin bng cch s dng
cc crawler hay wrapper. Mt wrapper c xem nh l mt th tc c thit k
c th rt trch c nhng ni dung cn quan tm ca mt ngun thng tin no
. c nhiu cng trnh nghin cu khc nhau trn th gii s dng nhiu
phng php to wrapper khc nhau thc hin rt trch thng tin trn web. Cc
phng php ny bao gm:
+ Phn tch m HTML
+ So snh khung mu
+ X l ngn ng t nhin
Cng ging nh Google News, h thng khai thc v tng hp ni dung c
nhim v khai thc, tng hp, lu tr ri pht hnh li ti ngi dng. Wrapper
nhn cu hnh u vo ca mt website (tin tc, nht k trc tuyn, ...) tin hnh
bc tch, tng hp ch lin quan, lu tr trong database v pht hnh li ti
ngi u cui. Ni dung c bc tch ton vn, sch s v c tng hp t
nhiu ngun khc nhau gip ngi c c th theo di, kim sot, tm kim, bin
son, lu tr, xut bn,...
Kh khn ca bi ton l khng phi ton b ni dung ca trang web u cn
thit. Nu ch n thun loi cc chui script HTML th ni dung lc c s rt
nhiu li rc khng cn thit. V d: phn thng tin qung co, tin mi cp nht, ni
dung tin ngn, menu... nhng ni dung nh th ny thng cn phi b qua trong
qu trnh bc tch ni dung chnh ca trang web
3
http://nhuthuan.blogspot.com/2006/11/s-lc-v-k-thut-trong-vietspider-3.html
16
dung bc tch. Sau quy trnh khai thc, ni dung s tr thnh c lp vi website
ngun, c lu tr v ti s dng cho nhng mc ch khc nhau.
4
http://vietnam.usembassy.gov/educational_exchange.html
19
10. K t u tin ca token lin trc vit hoa/ khng vit hoa.
11. Token lin trc nm trong danh sch cc t vit tt.
12. Token lin sau.
13. Token ng vin c vit hoa/ khng vit hoa.
T nhng ng cnh trn, c th rt ra tp ng cnh t tp d liu (tp C). Tp
ng cnh cng vi nhn t d liu to ra mt tp c trng tng ng. Xt v d
sau lm r mi quan h gia ng cnh, c trng:
"Nhng hacker my tnh s c c hi chim gii thng tr gi 10.000 USD v
10.000 ola Singapore (5.882 USD) trong mt cuc tranh ti quc t mang tn
"Hackers Zone" c t chc vo ngy 13/5/1999 ti Singapore."
Xem xt k t kt thc cu tim nng "." Trong token "10.000 USD", t v tr ny
ta c th rt ra mt s ng cnh sau:
1. Khng c k t trng trc k t ng vin.
2. Khng c k t trng sau k t ng vin.
3. K t ng vin l "."
4. Tin t: 10
T d liu hc ny, c th rt trch ra cc c trng nh v d di y:
f{khng c k t trng trc ng vin, no} = 1. ngha ca c trng ny l
pht biu: "token khng c k t trng trc ng vin v nhn l no" l ng (c
trng nhn gi tr 1).
Sau khi c lng trng s c trng ta da vo cc tham s tnh gi tr
p(yes|c). Nu gi tr ny >50%, nhn tng ng vi k t ng vin c ghi nhn l
"yes" hay k t ng vin thc s l k t phn tch cu.
2.2.3. Tch t
Tch t l mt qu trnh xc nh cc t n, t ghp c trong cu qua
vic x l xc nh ranh gii ca cc t trong cu vn. i vi x l ngn ng,
c th xc nh cu trc ng php ca cu, xc nh t loi ca mt t trong cu,
yu cu nht thit t ra l phi xc nh c u l t trong cu. Vn ny
tng chng n gin vi con ngi nhng i vi my tnh, y l bi ton rt kh
gii quyt. i vi ting Anh, cc kt qu trong lnh vc ny rt kh quan v c
im ngn ng ting Anh l mt ngn ng ph thng trn Internet v cc ngn ng
22
sau tin hnh thng k v s dng thut gii di truyn tm cch tch t ti u
nht, im mi ca hng tip cn ny l thay v phi s dng ng liu hun luyn
c gn nhn hay lexicon vn cha c sn cho ting Vit, tc gi s dng
thng tin thng k rt trch trc tip t search engine v dng gii thut di truyn
xc nh nhng cch tch t hp l nht i vi vn bn ting Vit cho trc, im
khc bit ca thut ton l kt hp gii thut di truyn vi vic trch xut thng tin
thng k t Internet thng qua mt cng c tm kim thay v ly t tp d liu nh
cc phng php khc. Gii thut di truyn cho php xy dng phng php tm
kim song song (tm kim tin ha) trn qun th m trong mi c th tng ng
vi mt cch tch t cho cu ang xt. Hm thch nghi s nh gi thch nghi
ca cc ti liu thng k, rt trch t Internet, thng tin rt trch bao gm tn s cc
ti liu v thng tin tng quan gia cc nhm t trong ti liu. Da vo nguyn l
tin ha v gii thut di truyn thch hp cho vic xc nh xp x cc li gii ti u
ha ton cc trong khng gian tm kim rt ln thay v cc li gii ti u cc b.
Gii thut di truyn s tin ha mt qun th qua nhiu th h nhm ti u ha ton
cc thng qu qu trnh chn lc, lai, bin d v ti sinh. Cht lng ca mi c th
trong qun th c xc nh bng hm thch nghi v qua mi th h, chng ta s
chn li N c th tt nht sau khi thc hin qu trnh lai, bin d v ti sinh. Cc kt
qu thc nghim ca tc gi Nguyn Thanh Hng [6] trong vic tm hiu hng tip
cn mi ng dng gii thut di truyn v thng k Internet t c nhng kt qu
kh quan trong vic tch t v phn loi vn bn ting Vit vi o micro-
averaging F1 (Yang) t trn 90%.
Khi so snh kt qa ca tc gi L An H [14] v H.Nguyn [13] th thy
cng trnh ca H.Nguyn cho c kt qa tt hn khi tin hnh tch t, tuy nhin
do tc x l ca gii thut nn thi gian x l lu hn. u im ni bt ca
hng tip cn da trn nhiu k t l tnh n gin, d ng dng, ngoi ra cn c
thun li l t tn chi ph cho thao tc to ch mc v x l nhiu cu truy vn. Qua
nhiu cng trnh nghin cu ca cc tc gi c cng b, hng tip cn tch t
da trn nhiu k t, c th l cch tch t hai k t c cho l s la chn thch
hp. Tuy nhin, khng gian tm kim s rt ln do c nhiu cch t hp cc ting
thnh t.
Cc hng tip cn da trn t:
24
s da trn tn s xut hin ca t khi tin hnh phn on, khng trnh khi cc
nhp nhng trong ting Vit nu gp nhng vn bn qu di.
2.2.3.4. Phng php tch tch t ting Vit da trn thng k t Internet v
thut gii di truyn
Phng php tch tch t ting Vit da trn thng k t Internet v thut
gii di truyn IGATEC (Internet and Genetics Algorithm based Text
Categorization for Documents in Vietnamese) do H. Nguyn [13] xut nm 2005
nh mt hng tip cn mi trong tch t vi mc ch phn loi vn bn m khng
cn dng n mt t in hay tp ng liu hc no.Trong hng tip cn ny, tc
gi kt hp gia thut ton di truyn vi d liu thng k c ly t Internet.
Trong tip cn ca mnh, tc gi m t h thng tch t gm cc thnh
phn
2.2.3.4.1. Online Extractor:
Thnh phn ny c tc dng ly thng tin v tn s xut hin ca cc t trong
vn bn bng cch s dng mt search engine ni ting nh Google hay Yahoo
chng hn. Sau , tc gi s dng cc cng thc di y tnh ton mc ph
thuc ln nhau (mutual information) lm c s tnh fitness cho GA engine.
- Tnh xc sut cc t xut hin trn Internet:
pw
count(w)
MAX
count(w1 & w 2)
p(w1 & w 2)
MAX
Trong MAX = 4 * 109
count(w) s lng vn bn trn Internet c tm thy c cha t w hoc
cng cha w1 v w2 i vi count(w1&w2).
- Tnh xc sut ph thuc ca mt t ln mt t khc:
p(w1 & w 2)
p(w1 | w 2)
pw1
cc keyphrases [13]. KEA dng phng php hc my Nave Bayes hun luyn
v rt trch cc keyphrases.
Theo nhn nh ca cc tc gi, KEA l thut ton c kh nng c lp ngn
ng. Thut ton KEA c th c tm tt thng qua cc bc sau:
Bc 1: Rt trch cm ng vin: KEA rt cc cm ng vin n-gram (chiu di 1
n 3 t) m khng bt u hay kt thc bng cc stop word. Trong trng hp
bi ton gn cm t kha (keyphrase assignment) dng t in nh ngha trc
(controlled indexing), KEA ch chn ra cc cm ng vin m khp vi cc thut
ng nh ngha trong t in. Vi cc cm n-gram thu c KEA tin hnh loi
b ra khi cm ng vin cc stop word v chuyn v dng gc ca t (stemming)
cho cm ng vin.
Kho T in
Rt trch ng vin lnh vc
Ti liu
Cm ng
vin
Tnh c trng
Cm t kha
c gn nhn
trc Khng
Hun
Tnh xc sut
C luyn?
Xy dng m hnh
dng Nave Bayes Cm t
M hnh
kha
[ ] [ ] [ ]
Trong khun kh lun vn ny, chi tit cc bc thc hin bi ton Phn loi
vn bn dng thut ton Nave Bayes v mt s cch tip cn ci tin gii
quyt bi ton cho vic phn loi ni dung khiu dm l mc tiu chnh.
Trong lun vn khi nim ni dung khng lnh mnh l cc ni dung theo vn
ha Vit Nam l i try nh l cc ni dung cha cc thng tin v sex, v n c
bit c hi cho la tui cha n v thnh nin ( Vit Nam l di 18 tui). Nhng
ni dung khiu dm hoc truyn gi dc bng ting Vit hin nay rt nhiu. Vic
phn loi cc ni dung ny ngn chn khng cho tr cha tui v thnh nin l
mt thch thc ln cho gia nh v x hi.
C th mc tiu bi ton l i tm hm f:
f : (URL,D) C
39
Hnh 3.3 Quy trnh hun luyn Token ni dung CSDL Token ni
dung
Trong :
D liu hun luyn: l kho d liu text c ni dung khiu dm v lnh mnh
c thu thp t cc trang web sex, gio dc gii tnh, trang bo mng
Tin x l : chuyn i kho d liu thnh mt hnh thc ph hp phn
loi.
Trch lc c trng: Tin hnh lc ly nhng t n v t ghp gi chung
l cc Token, mang ngha bao qut ton vn bn.
p dng thut ton Bayes : p dng cng thc bayes tnh cc xc sut
tin nghim ca 2 lp Bad v Good, cng nh cc gi tr xc sut ca tng
Token thuc tng lp tng ng s dng nhn dng hay phn loi sau
ny.
CSDL Token ni dung: l cc t n, t ghp qua hun luyn v chn
lc.
3.3.1.1. Tin x l vn bn
Vn bn trong tp hun luyn trc khi c s dng cn phi tin hnh tin
x l. Qu trnh x l s gip nng cao hiu sut phn loi v gim phc tp cho
thut ton phn loi.
42
Nh nhng phn tch trong [7, 8, 10], im khc bit ln nht gia ting Anh
v ting Vit l ting Anh th mi t c lp hu nh u c ngha ca n, cn
ting Vit th mt t c ngha thng c ghp t hai t tr ln. V d t phim
porno trong ting Anh th ting Vit tng ng l phim con heo, hay Viagra
ting Vit l mt t ghp t 3 words thuc tng lc,im c bit na l cc t
ting Vit ny nu tch thnh cc t n th n li tr thnh cc t bnh thng nh
t phim con heo nu tch ra thnh t phim (movie), t con (children) v t
heo (pig) s l nhng t bnh thng khng cn tnh cht khiu dm na. Do
cch tip cn hc trn cc t n i vi ting Vit r rng s khng hiu qu.
b. Hun luyn t
( ) ( ) ( )
( )
( ) ( )
Trong :
Y i din mt gi thuyt , gi thuyt ny c suy lun khi c c
chng c mi X
P(X) : xc sut X xy ra
P(Y) : xc xut Y xy ra
45
( ) ( )
Trong :
( ) l xc sut thuc phn lp nave khi bit trc mu X.
( ) xc sut l phn lp i.
( ) xc sut thuc tnh th k mang gi tr xk khi bit X thuc
phn lp i.
Cc bc thc hin thut ton Nave Bayes:
Bc 1: Hun luyn Nave Bayes (da vo tp d liu), tnh ( ) v P(Ci|xk)
Bc 2: Phn lp ( ), ta cn tnh xc sut thuc tng phn lp
khi bit trc Xnew. Xnew c gn vo lp c xc sut ln nht theo cng thc
( ( ) ( )) ( )
TN S XUT HIN
Bad good Tng
tng s trang 5 4 9
tokenm o 4 1 5
tokenanh 3 3 6
( ( ) ( )) ( )
TN S XUT HIN
T tt T xu
Tng s
(good) (bad)
Tng s trang web 401 601 1002
T nh sng 201 0 201
T khiu dm 0 301 301
TN S XUT HIN
T tt T xu
Tng s
(good) (bad)
Tng s trang web 401 601 1002
T nh sng 202 1 203
T khiu dm 1 302 303
( ( ) ( ))
( ) ( )
( ) ( )
( ( )) ( ( )) ( ( ))
( )
( )
( ( ) ( ( ))
50
4.2.7. Giao din danh sch cc token URL phn lp URL ca cc Website
Hnh 4.12 Giao din danh sch cc Token URL sau hun luyn
4.3. Thu thp d liu
4.3.1. Thu thp d liu lm c s d liu TOKEN URL
D liu chun b bao gm 100 URL, thc t trong c 60 URL xu, qua qu
trnh x l b lc ch pht hin ra 20 URL xu nhng thc t ch c 18 URL l
ng. Trng hp cc URL xu b lc khng pht hin c th b lc s chuyn
sang kim tra phn ni dung.
D liu chun b bao gm: 300 file tt v 300 file xu c thu thp t cc
trang web tin tc v cc trang web c ni dung khiu dm. Qua qu trnh x l ca
b lc. Kt qu thc nghim, nh sau:
chnh xc
S TOKEN dng S File c phn lp
(%)
phn lp file
File Tt File Xu Tt Xu
Ly 100 token 125/300 131/300 41,6% 43,7%
Qua kt qu thc nghim cho thy s N ti u l chn khong 500 token cao
nht tnh ton phn loi.
[26] http://xahoithongtin.com.vn/2013061309041378p0c109/truy-cap-web-khieu-
dam-ban-mat-gi.htm
[27] http://baohay.vn/chuyen-de/nhung-dieu-can-biet/288247/Web-sex-dang-tro-
thanh-mon-giai-tri-o-chon-cong-so.html
[28] http://vi.wikipedia.org/wiki/Internet_t%E1%BA%A1i_Vi%E1%BB%87t_Nam
[29] http://vn.antoan.yahoo.com/qua-n-ly-n%E1%BB%99i-dung-ti-m-ki%C3%AA-
152125467.html
[30] http://www.gltec.com.vn/tin-tuc/68-internet/2830-web-ngi-ln-ngay-cang-thu-
hut-c-nhiu-qtin-q.html