You are on page 1of 73

B GIO DC V O TO

TRNG I HC LC HNG
--------

NGUYN THANH PHONG

XY DNG B LC PHT HIN CC WEBSITE C NI DUNG


KHIU DM DA TRN URL V TEXT CONTENT

Lun vn Thc s Cng ngh Thng tin

NG NAI, 2014
B GIO DC V O TO
TRNG I HC LC HNG
--------

NGUYN THANH PHONG

XY DNG B LC PHT HIN CC WEBSITE C NI DUNG


KHIU DM DA TRN URL V TEXT CONTENT
Chuyn ngnh: Cng ngh Thng tin
M s: 60480201

Lun vn Thc s Cng ngh Thng tin

NGI HNG DN KHOA HC


TS. V C LUNG

NG NAI, 2014
LI CM N
Vi nhng li u tin, em xin dnh s cm n chn thnh v su sc ti
thy tin s V c Lung hng dn v gip em tn tnh trong qu trnh hon
thnh lun vn.

Em cng xin cm n qu Thy C Trng i hc Lc Hng tn tnh


truyn dy kin thc trong qu trnh em hc tp ti trng, nhng kin thc
gip em rt nhiu trong vic hc tp v nghin cu sau ny.

Cui cng, xin gi li cm n ti nhng ngi thn trong gia nh v bn b,


ng nghip gip v to iu kin tt trong qu trnh lm lun vn.

ng Nai, thng 08 nm 2014

Trn trng

Nguyn Thanh Phong


LI CAM OAN
Ti xin cam oan kt qu t c trong lun vn l sn phm ca ring c
nhn, l kt qu ca qu trnh hc tp v nghin cu khoa hc c lp. Trong ton
b ni dung ca lun vn, nhng iu c trnh by hoc l ca c nhn hoc l
c tng hp t nhiu ngun ti liu. Tt c cc ti liu tham kho u c xut x
r rng v c trch dn hp php.

Ti xin hon ton chu trch nhim v chu mi hnh thc k lut theo quy
nh cho li cam oan ca mnh.

ng Nai, thng 08 nm 2014

Tc gi

Nguyn Thanh Phong


MC LC
LI CM N

LI CAM OAN

MC LC

K HIU CC CM T VIT TT

DANH SCH CC BNG BIU, HNH V

M U ..................................................................................................................... 1

L do chn ti ......................................................................................................... 1
Mc tiu ti ............................................................................................................. 2
Ni dung thc hin ...................................................................................................... 2
Phng php thc hin................................................................................................ 3
CHNG 1. TNG QUAN V TRCH LC D LIU TRN WEBSITE ..... 4
1.1 Gii thiu ............................................................................................................. 4
1.2 Cc loi b lc WEB c ni dung khiu dm ..................................................... 4
1.2.1 B lc WEB da vo a ch mng .................................................................. 4
1.2.2 B lc WEB da vo URL .............................................................................. 6
1.2.3 B lc WEB da vo DNS .............................................................................. 9
1.2.4 B lc WEB da vo t kha ........................................................................ 10
1.2.5 B lc WEB da vo ni dung text v hnh nh............................................ 10
1.3 Cc cng trnh lin quan .................................................................................. 11
CHNG 2. CC L THUYT NG DNG TRONG LUN VN ............. 15
2.1 Rt trch ni dung ca website .......................................................................... 15
2.1.1 Phn tch m HTML ...................................................................................... 15
2.1.2 So snh khung mu ........................................................................................ 16
2.1.3 X l ngn ng t nhin ................................................................................ 17
2.2 Phn tch ni dung thnh cc token .................................................................. 18
2.2.1 Tin x l d liu ........................................................................................... 19
2.2.2 Tch cu da trn Maximum Entropy ........................................................... 19
2.2.3 Tch t ........................................................................................................... 21
2.2.3.1 Phng php Maximum Matching .............................................................. 25
2.2.3.2 Phng php Transformation based learning TBL................................ 25
2.2.3.3 M hnh tch t bng WFST v mng Neural ............................................. 26
2.2.3.4 Phng php tch t ting vit da trn thng k t Internet v thut gii di
truyn ........................................................................................................... 28
2.2.4 Thut ton KEA ............................................................................................. 29
2.2.4.1 Chn cm ng vin ...................................................................................... 31
2.2.4.2 Tnh ton c trng ...................................................................................... 33
2.2.4.3 Hun luyn ................................................................................................... 33
2.2.4.4 Rt trch nhng cm t kha ....................................................................... 34
2.2.5 Thut ton KIP ............................................................................................... 34
2.2.6 Nhn din thc th c tn............................................................................... 36
2.3 Phn tch URL .................................................................................................. 37
CHNG 3. GII PHP LC WEBSITE KHIU DM DA TRN URL
V TEXT CONTENT ............................................................................................ 38
3.1 Phn tch m hnh h thng .............................................................................. 38
3.2 Module x l da vo URL .............................................................................. 40
3.3 Module lc theo ni dung ................................................................................. 40
3.3.1 Giai on hun luyn ..................................................................................... 41
3.3.1.1 Tin x l vn bn ....................................................................................... 41
3.3.1.2 Trch lc c trng ....................................................................................... 42
3.3.1.3 Thut ton Nave Bayes ............................................................................... 44
3.3.2 Giai on phn lp, nhn dng ...................................................................... 47
CHNG 4. TH NGHIM V NH GI KT QU ................................. 50
4.1 Mi trng th nghim...................................................................................... 50
4.2 Giao din chng trnh ...................................................................................... 50
4.2.1 Giao din chnh .............................................................................................. 50
4.2.2 Giao din hc t ly TOKEN phn lp ni dung Website ............................ 53
4.2.3 Giao din duyt cc TOKEN t n a vo danh sch TOKEN ................ 54
4.2.4 Giao din duyt cc TOKEN t ghp a vo danh sch TOKEN ............... 54
4.2.5 Giao din danh sch cc TOKEN t phn lp ni dung Website................. 55
4.2.6 Giao din ly TOKEN URL........................................................................... 55
4.2.7 Giao din danh sch cc TOKEN URL phn lp URL ca Website........ 56
4.3 Thu thp d liu................................................................................................ 56
4.3.1 Thu thp d liu lm c s d liu TOKEN URL ................................... 56
4.3.2 Thu thp d liu lm c s d liu TOKEN ni dung .............................. 57
4.4 nh gi kt qu thc nghim .......................................................................... 58
KT LUN V HNG PHT TRIN ............................................................. 60
TI LIU THAM KHO
K HIU CC CM T VIT TT

T vit tt ngha

KNN K-Nearest Neighbor

LDA Latent Drichlet Allocation

LLSF Linear Least Square Fit


LRMM Left Right Maximum Matching

MM Maximum Matching

NB Nave Bayes

pLSA Probabilistic Latent Semantic Analysis

SVM Support Vector Machine

TBL Transformation -based Learning

TF Term Frequency

WFST Weighted Finit State Transducer


DANH SCH CC HNH V
Hnh 2.1 - H thng bc tch ni dung ca VietSpider ............................................ 16
Hnh 2.2 M hnh bc tch ni dung chnh bng so snh khung mu ................... 17
Hnh 2.3 S thut ton KEA ............................................................................. 30
Hnh 3.1 M hnh h thng lc Website c ni dung khiu dm .......................... 39
Hnh 3.2 Quy trnh hun luyn ly TOKEN URL ................................................. 40
Hnh 3.3 Quy trnh hun luyn TOKEN ni dung ................................................. 41
Hnh 3.4 Quy trnh tch t ..................................................................................... 42
Hnh 3.5 M hnh hun luyn t ............................................................................ 44
Hnh 3.6 M hnh giai on phn lp .................................................................... 48
Hnh 4.1 Giao din lc khi ng b lc ............................................................... 50
Hnh 4.2 Giao din ng nhp................................................................................ 51
Hnh 4.3 Giao din khi duyt mt a ch WEB tt ............................................... 51
Hnh 4.4 Giao din khi duyt mt a ch WEB xu ............................................. 52
Hnh 4.5 Giao din danh sch a ch WEB tt, xu ............................................. 52
Hnh 4.6 Giao din chc nng h thng ................................................................. 53
Hnh 4.7 Giao din hun luyn t n, t ghp ..................................................... 53
Hnh 4.8 Giao din duyt cc TOKEN t n a vo danh sch ToKEN ......... 54
Hnh 4.9 Giao din duyt cc TOKEN t ghp a vo danh sch ToKEN ........ 54
Hnh 4.10 Giao din duyt TOKEN t n v t ghp ......................................... 55
Hnh 4.11 Giao din hun luyn TOKEN URL ..................................................... 55
Hnh 4.12 Giao din danh sch cc TOKEN URL sau hun luyn ....................... 56
Hnh 4.13 a ch URL thu thp c ................................................................... 56
Hnh 4.14 File tt thu thp c ............................................................................ 57
Hnh 4.15 File xu thu thp c ........................................................................... 57
Hnh 4.16 C s d liu TOKEN ni dung sau qu trnh hun luyn .................. 58
DANH SCH CC BNG BIU
Bng 1.1 Kt qu nh gi NET PROJECT ............................................................ 5
Bng 1.2 Mt s sn phm lc WEB theo URL ...................................................... 7
Bng 2.1 Xc nh cm ng vin ........................................................................... 32
Bng 3.1 Thng k mt s t in thng dng ting vit ...................................... 43
Bng 3.2 S liu thng k bng t in ................................................................. 43
Bng 3.3 V d minh ha tn s xut hin cc TOKEN ........................................ 46
Bng 3.4 V d minh ha tn s xut hin cc TOKEN cha lm trn .................. 47
Bng 3.5 V d minh ha tn s xut hin cc TOKEN lm trn ...................... 47
Bng 4.1 Kt qu thc nghim File ni dung ........................................................ 59
Bng 4.2 Kt qu thc nghim URL ...................................................................... 59
1

M U
1. L do chn ti

Internet bt u xut hin t nhng nm thp nin 60. Tuy nhin ti thi
im n ch c s dng ni b v phc v ch yu cho qun s. Ngy
19/11/1997 l ngy u tin Vit Nam chnh thc ha vo mng Internet ton cu,
sau mi my nm hot ng Internet tr thnh mt thut ng hu nh ai cng bit,
mt phng tin truyn thng ai ai cng ang s dng, thm ch vi mt s b phn
cn ph thuc hon ton. Tm nh hung ca Internet pht tn mnh m khi n bt
u pht huy cng dng gii tr ca mnh, ngi ta khng ch c th tm t liu m
cn xem phim, nghe nhc, chi game trn mng. Hng triu triu ngi vo mng
mi ngy, nhng s ngi vo mng lm vic, hc tp, truy cp ti liu th rt t.
Vi s pht trin nhanh chng ca Internet hin nay, l du hiu ng
mng trc s pht trin ca cng ngh thng tin trong mt x hi hin i. Nhng
ng sau n l nhng h ly t Internet mang li cho con ngi, trong c bit l
gii tr. Song song vi cc tr chi trn mng, th vic t m truy cp vo cc trang
c ni dung khng lnh mnh c truyn Sex, xem cc hnh nh khiu dm, xem
phim Sex cng tr nn ph bin v tc hi lm cho ngi xem mun c hnh vi
quan h tnh dc ngay, dn n tnh trng sa vo t nn mi dm, hip dm khi
cha tui v thnh nin. 1
Website khiu dm khng ch nh hng n hnh vi tnh dc ca gii tr
m cn nh hng n o c lm vic ni cng s 2. Ngoi ra, n cn gy mt an
ninh cho my tnh c nhn ca ngi s dng v h thng mng my tnh ni c
quan, bi cc phn mm c hi. Vy lm sao ngn chn khng cho ngi s
dng truy cp vo cc Website c ni dung khiu dm l vn ang c x hi
quan tm. Hin nay trong nc v ngoi nc cng c nhiu phn mm c
nghin cu nh:
Nghin cu trong nc c th cp n mt s phn mm sau: Phn mm
Killporn ca tc gi Nguyn Hu Bnh; Phn mm VwebFilter (vit tt VWF) ca
Cng ty in ton v Truyn s liu xy dng; Phn mm Depraved Web Killer
1
http://vi.wikipedia.org/wiki/Internet_t%E1%BA%A1i_Vi%E1%BB%87t_Nam
2
http://baohay.vn/chuyen-de/nhung-dieu-can-biet/288247/Web-sex-dang-tro-thanh-mon-giai-
tri-o-chon-cong-so.html
2

(DWK) ca V Lng Bng, nhn vin cng ty in thoi ng, qun 10 (TP
HCM); Phn mm MiniFireWall 4.0 (MFW) ca tc gi Hunh Ngc n (cng tc
ti phng Tin hc - Bu in tnh ng Thp); B lc pht hin cc Website c ni
dung khng lnh mnh, lun vn thc s cng ngh thng tin ca Cao Nguyn Thy
Tin.
Nghin cu ngoi nc c th k n mt s phn mm sau: STOP P-O-R-
N 5.5 c pht hnh bi PB Software LLC; K9 Web Protection c pht hnh
bi Blue Coat Systems; Media Detective 2.3 c pht hnh bi Tap Tap Software;
Parental Filter 3.0 c pht hnh bi NWSP Software Design; ScrubLT 3.2.2.0
c pht hnh bi CrubLT; CyberSitter c pht hnh bi Solid Oak Software ;
iShield 1.0 c pht hnh bi Guardware.
Thc t cho thy cc phn mm ngoi nc a phn mun s dng th phi
tr ph v thng lc hnh nh khiu dm cn lc ni dung khiu dm ch yu bng
ting anh cn bng ting vit th hn ch, cn cc phn mm trong nc th cng
cn nhng hn ch trong vic chn t kha khiu dm thng dng v chn URL c
th ca Website. Qua cho thy cn nhiu vn cn phi nghin cu lm sng
t, lm tt hn v l cng l l do m ti Xy dng b lc pht hin cc
Website c ni dung khiu dm da trn URL v TEXT CONTENT c la
chn cho lun vn ny.

2. Mc tiu ti

Xy dng b lc WEB c th t ng pht hin cc Website cn truy cp c


ni dung khiu dm da trn URL v TEXT CONTENT ca Website.

3. Ni dung thc hin

Thu thp cc URL v TEXT CONTENT ca cc Website c ni dung khiu


dm v khng khiu dm to b d liu cc token t c hun luyn phc
v cho vic phn loi cc Website c ni dung khiu dm v khng khiu dm.
Nghin cu cch khai ph URL v TEXT CONTENT ca Website t
xut m hnh lc cc Website c ni dung khiu dm da trn URL v TEXT
CONTENT.
Ci t b lc Website hin thc ha vn nghin cu.
3

4. Phng php thc hin

Dng cc cng c, phn mm c trong nc v ngoi nc thu thp d


liu ca cc trang tin tc, cc trang Web c ni dung khiu dm. D liu cn thu
thp l URL v ni dung ca Website.
phn lp URL ca mt Website cn duyt thuc lp tt hay xu th da
vo danh sch ToKenURL. Danh sch ToKenURL l cc t, cm t c hun
luyn t cc URL thu thp c.
phn lp ni dung ca mt Website cn duyt thuc lp c ni dung
Khiu dm hay khng th da vo danh sch ToKen ni dung. Danh sch ToKen
ni dung ny c xy dng t vic hun luyn cc tp d liu tt v tp d liu
xu thu thp c, tnh t l xut hin ca cc ToKen t trn cc tp d liu tt v
xu chn ra cc ToKen t c trng s cao ri so snh vi b t in d liu
chn ra cc ToKen t c trng dng phn lp ni dung ca Website.
Nghin cu v p dng cc thut ton tch cu, tch t trong vn bn ting
vit, kt hp vi thut ton Nave Bayes tnh xc sut ni dung vn bn ca
Website cn duyt phn lp Website l khiu dm hay khng khiu dm.
4

CHNG 1. TNG QUAN V TRCH LC D LIU TRN


WEBSITE
1.1. Gii thiu

Lc trang WEB l mt vn khng phi l mi. Lm sao pht hin c


cc trang WEB c ni dung v hnh nh khiu dm l vn cn thit. Cc gia
nh c s dng Internet th cha m khng mun con mnh tip xc vi cc trang
WEB c ni dung v hnh nh khiu dm. Hin nay, mt s quc gia trn th gii,
trong c c Vit Nam v ang nghin cu cc gii php ngn chn cc trang
WEB khiu dm sao cho c hiu qu, v th yu cu t ra l lin tc pht trin cc
phn mm pht hin v ngn chn cc trang WEB khiu dm l mt bin php an
ton b sung. Nhiu trang WEB khiu dm khng ch c ni dung khiu dm v
hnh nh khiu dm m cn cha c phn mm c hi, phn mm qung co, phn
mm gin ip v Virus
Ni chung cc trang WEB v cc trang WEB c th c phn loi l WEB
khiu dm ch yu da trn cc yu t nh: hnh nh khiu dm v ni dung khiu
dm. Trong phm vi ca ti ny ch gii hn pht hin v ngn chn cc trang
WEB c ni dung khiu dm da trn URL v TEXT CONTENT ca trang WEB.

1.2. Cc loi b lc WEB c ni dung khiu dm

Lc chn ni dung khiu dm trn mng khng phi l vn n gin, nguyn


nhn nm ch c hng t lin kt, v thc chng d dng cht no bit chc
chn u l lin kt bn trong khi lng thng tin khng l v ln xn nh th.
Di y l mt vi cch thc thng c dng trong cc b lc WEB en, WEB
khiu dm.
1.2.1. B lc WEB da vo a ch mng
B lc da vo danh sch en (Back List) v danh sch trng (White List)
y l bin php m hu ht cc cng c chn web en p dng, h xy
dng, chia nhm v phn loi cc trang web bit chc chn ni dung ch yu ca
mt tn min no y thuc v danh sch en hay danh sch trng (danh sch trng
l danh sch cc website c php truy cp, danh sch en l danh sch nhng trang
cm), iu ny c th c thc hin bng my, hoc thng qua cng ng Internet
ng o gip sc. N t ra kh hiu qu, ngn chn hu nh 99% cc trang web
5

sex ph bin, tuy nhin nhc im ca chng trnh l i khi b st cc trang


web sex c nh, bi v cc trang nh th sinh ra rt nhiu mi ngy, v khng c
mt phn mm no c th thm y tt c vo danh sch en ca n c.
Lc qua a ch IP
y l k thut ngn chn trc tip trn ng mng bng cc a ch IP ca
mt website. K thut ny c th l thit thc trong bi cnh cc website thng b
truy cp thng qua a ch IP hay n c th truy cp thng qua IP thay cho tn
DSN. a s trng hp, khng c khuyn khch dng do 3 s km ci sau:
Ngn chn truy cp n mt IP cng s ngn chn lu thng mng n nhng site
c host o trn cng IP ngay c khi n c ni dung lin quan n vn cm hay
khng.
Ngn chn truy cp n mt IP cng s ngn chn lu thng mng n mi thnh
vin ca cng thng tin nm trn IP . N s ngn chn mt thnh phn ca website
khng phi l mt phn hay mt tp cc trang con.
l s thay i thng xuyn ca cc website b lc ngay khi ch nhn website pht
hin ra b lc. Hnh ng ny da trn DNS cho php ngi dng vn cn truy cp n
trang web. Bng thng k pha di s so snh kt qu lc ca mt s phn mm theo d n
kho st website ca d n NetProject.
Bng 1.1 Kt qu nh gi ca NetProject

Phn mm lc T l kha dng Efectiveness Rate

BizGuard 55% 10%


Cyber Patrol 52% 2%
Cyber Sitter 46% 3%
Cyber Snoop 65% 23%
Norton Internet Secity 45% 6%
SurfMonkey 65% 11%
X-Stop 65% 4%

Bc tng la (Firewall)
Firewall l mt k thut c tch hp vo h thng mng chng s truy cp
tri php, nhm bo v cc ngun thng tin ni b v hn ch s xm nhp khng
6

mong mun vo h thng. Thng thng Firewall c t gia mng bn trong


(Intranet) ca mt cng ty, t chc, vai tr chnh l bo mt thng tin, ngn chn s
truy cp khng mong mun t bn ngoi v cm truy cp t bn trong (Intranet) ti
mt s a ch nht nh trn Internet.
u im: a s cc cc h thng firewall u s dng b lc packet. Mt trong
nhng u im ca phng php ny l chi ph thp v c ch lc packet c
bao gm trong mi phn mm router.
Hn ch: vic nh ngha cc ch lc packet l mt vic phc tp, i hi
ngi qun tr mng cn c hiu bit chi tit v cc dch v Internet, cc dng
Packet header,
1.2.2. B lc WEB da vo URL (Universal Resoure Locator)
Da vo t kha ca URL
Thng thng cc b lc web nh th ny s c mt danh sch cc t kha
dnh cho ngi ln c to ra sn nhn dng c nhng a ch web b chn.
T kha URL l chui con nm trong mt a ch web, nhng URL ca trang web
c cha chui con ny thng l nhng trang web c ni dung khiu dm.
Cc trang web c ni dung khiu dm thng dng t ng khiu dm, tnh
dc lm tn min cho website vi mc ch gi nh ngi dng d tm kim
bng cc cng c tim kim. Thc t cho thy c trang web no m ni dung lnh
mnh li t tn min cho website nh th. V vy, nhng trang web c t kha
URL nh vy cn chn trc tip ngay t u m khng cn phi tm hiu ni dung
bn trong.
V d: cc trang web ny u l web c ni dung khiu dm
www.sexviet.com
www.sex700.com
www.sexygirls.com
do u cha cc t kha l "sex" Hoc cc trang web khiu dm sau y
www.freeporns.com
www.asiaporns.com
www.childporn.com
cc trang ny u cha cc t kha l "porn"
7

u im: n gin nhng kh tin cy.


Hn ch: i khi c mt trang no y chng cha bt k t kha khiu dm no trong
URL nhng bn thn trang li c ni dung bn s c chng trnh b qua hoc ngc li
mt trang web v gio dc gii tnh lnh mnh c th cha t kha sex trong URL s li b
chn.
Da vo URL
y l k thut lc bng cch quan st lu thng web (HTTP) bng cch
theo di URL v cc host field bn trong cc yu cu HTTP nhn ra ch n
ca yu cu. Host field uc dng ring bit bi cc my ch web hosting nhn
ra ti nguyn no c tr v.
Lc web qua URL thng c xp vo loi ch rng ln v "Content
Management". Cc k thut lc qua URL ra i t 2 kiu lc "pass-by" v "pass-
through".
Lc theo "pass-by": x l trn ng mng m khng cn phi trc tip trong
ng ni gia ngi dng v internet. Yu cu ban u c chuyn n my ch
web u cui. Nu yu cu b cho l khng thch hp th b lc s ngn chn
nhng trang gc t bt c yu cu truy cp no. K thut ny cho php thit b lc
khng bao gm b nh hng yu cu. Nu thit b lc b hng, lu thng mng
vn tip tc hot ng mt cch bnh thng.
Lc theo "pass-through": gm vic s dng mt thit b trn ng ca tt c
yu cu ca ngi dng. V th lu thng mng i qua b lc "pass-through" l
thit b lc thc s. Thng b lc ny nm trong cc kiu firewall, router,
application switch, proxy server, cache server.
Ty chn b lc URL
im c bit ca cc sn phm theo phng php ny cho php ngi dng
ch nh cc URL bng cch thm hay bt cc URL khi "danh sch cc site xu" (Bad
Site List) mc d cc website nguyn thy trong danh sch khng th b loi b. Di
y l danh sch cc sn phm lc web ph bin.

Bng 1.2 - Mt s sn phm lc web theo phng thc URL


Sn phm Hng (Cng ty)

Smartfilter Secure Computing


8

Web Filter SurfControl


Web Security Symatec
Bt-WebFilter Burst Technology
CyBlock WebFilter Wavecrest Computing

u im:
Nhng Website o khng b nh hng: K thut ny khng nh hng n
cc my ch web o khi chng cng dng mt IP nh nhng website hn ch. Mt
website b chn v website khng b chn c th chia s cng mt a ch IP.
Khng nh hng i vi vic thay i IP: Trong phn ln tnh hung, s thay i IP
ca website b hn ch s khng nh hng n phng php ny. V phng php
lc ny khng ph thuc vo a ch IP. Ch s hu nhng trang web c th i bt
c IP no h mun, nhng ngi dng ng sau b lc khng th truy cp c.
Nhng Website o khng b nh hng: K thut ny khng nh hng n
cc my ch web o khi chng cng dng mt IP nh nhng website hn ch. Mt
website b chn v website khng b chn c th chia s cng mt a ch IP.
Khng nh hng i vi vic thay i IP: Trong phn ln tnh hung, s
thay i IP ca website b hn ch s khng nh hng n phng php ny. V
phng php lc ny khng ph thuc vo a ch IP. Ch s hu nhng trang web
c th i bt c IP no h mun, nhng ngi dng ng sau b lc khng th
truy cp c.
Hn ch:
Thng khng th ngn chn cc cng phi tiu chun:
Nhng Web server lm vic vi cng tiu chun rt tt.
Website trn cc cng phi tiu chun th kh khn cho vic ngn cm
v chng yu cu mt cp cao hn trong b lc.
Mt gii php lc qua URL c th l k thut c kh nng cn thit
cho nhng kt ni HTTP trn cc cng phi tiu chun
Khng lm vic vi cc lu thng b m ha: v HTTP yu cu s dng
SSL/TLS b m ha. Phng php lc theo URL khng th c cc hostfield. Cho
nn, b lc khng c hiu qu pht hin mt ti nguyn no trn mt a ch IP m
yu cu thc s nh hng vo.
9

Tm li, cc server cn c b lc thc hin loi b mt s trang web


khng tt, nhng n c th lm cho h thng chm li.
1.2.3. B lc WEB da vo DNS
Nhng website b lc s hon ton khng th truy cp c n tt c cc cu
hnh s dng b lc nameserver cho b phn gii tn do tt c cc b lc
nameserver s tr v thng tin bt hp l khi yu cu phn gii mt hostname ca
website b lc. Nh vy khng th truy cp n ti liu trn ca my ch cha
Website. Tuy nhin, cc website khng b lc s cho php truy cp min l chng n
c mt hostname khc t cc website b lc. V tn ca chng khng c h tr
thng tin bt hp l bi b lc nameserver nn d liu ng s tr v cho bt c ngi
dng no yu cu phn gii tn v website hin nhin l c th truy cp vo c.

u im:
S dng a nghi thc (multi-protocol): http, ftp, gropher v bt k nghi thc
no khc da trn h thng tn.
Khng b nh hng bi vic thay i IP: Khi thay i IP ca mt website
khng nh hng n phng php lc ny, y l phng php lc hon ton c
lp vi a ch IP.
Hn ch:
Khng hiu qu i vi cc URL c cha a ch IP:
Phn ln nhng a ch ca mt website dng DNS (www.lhu.edu.vn), tuy
nhin cng c nhng a ch c ch nh bng mt a ch IP thay v l
dng DNS (http://118.69.126.40).
Trong trng hp ny n c truy cp n bng a ch IP m khng phi
dng a crh DNS ca n.
Ton b web server b chn hon ton: K thut khng cho php vic kha c chn la cc
trang cn li trn mt webserver. V th, nu mt trang b cm l www.exp.com/bad.htm th
c th tt c cc truy cp khng th truy xut n www.exp.com d n khng trong danh
sch b kha.
nh hng n cc subdomain: Xt v k thut, mt tn min n nh example.com trong
URL http://www.example.com c dng truy cp n web server. Cng mt thi im,
domain name c th phc v nh mt domain cp trn ca cc cng khc nh
10

host1.example.com. Trong trng hp ny, nhng a ch DNS dng www.example.com


c th b phn gii sai. Ngoi ra, n cng lm cho b phn gii tn min b sai i vi cc
min con. V n cn nh hng n cc dch v chy trn mng nh e-mai
1.2.4. B lc WEB da vo t kha
Tng t nh cch tip cn da vo URL keyword cng c mt danh sch
cc t kha nhn ra nhng trang web b chn. Mt trang web cm s cha nhiu t
kha khng hp l, y l c s nhn ra trang web b cm. iu quan trng i
vi phng php ny l ng ngha ca t kha theo ng cnh, iu ny lm cho h
thng c nhng nhm ln khi a ra mt quyt nh v mt trang web c c th
hin hay khng.
Mt website chuyn bnh ung th c th b kha vi l do bi vit v "bnh
ung th v", ta thy c rng nu trong bi vit c cp qu nhiu n t kha
nm trong danh sch t kha chn l "v" th v tnh h thng s nhm ln v kha
trang ny.
Vn tip theo l cc t c hay v nh vn sai, mt s trang cha
ni dung xu th ngn t c dng trong trang web ca n b thay i nh la h
thng lc, tuy nhin khi ngi s dng c th c th hiu ngay ch l sai chnh t
thi cn i vi h thng lc iu lm nh hng ln n h thng.
1.2.5. B lc da vo ni dung text v hnh nh
Theo Stanfor project CS229 ca SaiKat Sen. xy dng b lc da vo ni
dung text v hnh nh tc gi s dng ba k thut chnh l: phn tch hnh nh,
vn bn, phn tch v xp hng.
phn tch vn bn, tc gi tin hnh kim tra:
Tiu trang: nu tiu trang c cha mt t ngi ln th trang c
phn lp l trang ngi ln.
T kha: cc trang web dnh cho ngi ln v cc trang web c cha ni
dung gn ging trang web ngi ln th tin hnh tm kim cc t kha
ca trang trong t in t vng v ngi ln.
URL: phn tch cc t trn URL thnh cc chui con v tra cu vo b t
in dnh cho ngi ln. Trong trng hp khng c mt t in trc
tuyn tt dnh cho ngi ln th tin hnh xy dng bng cch s dng
mt ng dng ty chnh v s dng c s d liu t vng Princeton
11

WordNet. Cc ty chnh ng dng cho php ngi dng la chn mt b


ngun t, kt qu u ra t ng ngha trong mi ln lp v cho php
ngi s dng phn lp cc t ng ngha nh ngi ln, lm xm v
sch s trc khi tin bc k tip l lp i lp li vi cc t ng ngha
ngi ln. N l iu cn thit phn lp cc t ng ngha trong mi
ln lp i lp li, nu khng th ti t s tng nhanh kch thc n hng
ngn t vi nhiu ngha khc nhau. Tychnh ng dng c xem nh l
mt cng c c dng lp i lp li cho n khi no khng cn t
mi. Danh sch cui cng bao gm hai tp tin: adult.txt v gray.txt.
Adult.txt cha nhng t c xc nhn ca ngi ln th chng ta s lc,
gray.txt cha nhng t m chng ti mun lc nhng c th c s dng
ni dung ngi ln v ni dung khng dnh cho ngi ln. Mc ch
b lc hc c s lng t thch hp qua cc ln hun luyn. t vng thu
thp c bao gm 106 t danh sch en v 26 t xm.
Ni dung trang: theo tasc gi ni dung trang l mt yu t quyt nh quan
trng.
phn tch hnh nh: s dng k thut nhn dng hnh nh khc nhau. Cc gi
OpenCV c s dng nhn dng hnh nh v phn lp ML. xp hng th
s dng AldultRank mt thc o th hng trng t nh PageRank.
1.3. Cc cng trnh lin quan
Vit Nam c mt s ti nh Nghin cu, pht trin h thng lc ni
dung h tr qun l v m bo an ton - an ninh thng tin trn mng Internet - TS.
Nguyn Vit Th - Cc cng ngh tin hc t 01/04/2006 - 01/03/2008. Mc tiu
ca ti l nghin cu v xut gii php h tr cng tc qun l mt cch hiu
qu an ton v an ninh cc lung d liu vo - ra gia Vit Nam v th gii qua
mng Internet ni ring v gia cc mng din rng ni chung. Pht trin cc cng
c phn mm, thit b phn cng cho php x l khi lng d liu ln thi gian
thc (tnh ton song song, tnh ton li), c kh nng pht hin v ngn chn thng
tin (nh, vn bn bng c ting Vit v ting Anh) c ni dung khng ph hp vi
vn ho, php lut Vit Nam v nh hng xu n trt t an ton x hi. Trin khai
v ng dng th nghim ti cng thng tin vo/ra ti trng i hc Cng ngh, ti
12

B Cng an, v cng Internet quc gia ti trung tm in ton v truyn s liu
VDC.
Ti liu nghin cu, phn tch v xut chnh sch php l ti Vit nam
cho vn lc ni dung thng tin trn mng Internet, bo co chuyn trong
khun kh ti nh nc mang m s 02/2006 /H -TCT-KC.01/06-10. Cc
Cng ngh Tin hc nghip v B Cng An.
DWK4.1: Depraved Web Killer (DWK) do tc gi V Lng Bng d thi
chung kt cuc thi Tr Tu Vit Nam nm 2004, tnh n thi im ny phin bn
mi nht l v4.1 (2011) c nhiu chc nng nh: ngn chn cc trang web c ni
dung xu (t kha, URL), ghi nht k cc chng trnh c chy trn my, ghi
nht k cc trang web c truy cp, ghi nht k cc trang web xu m phn
mm ngn chn, gi nht k n a ch mail do ngi dng thit lp,
Gii php lc Web REMPARO l sn phm ca Cng ty TNHH Chp Sng
v Ashmanov c pht trin da trn cng ngh tr tu nhn to c hiu ngn ng
t nhin. Remparo - gii php lc web theo ng ngha, c tnh nng ngn chn truy
cp trang web c ni dung xu, khng ph hp. Mi trang web i qua b lc
Remparo nu c ni dung khng thch hp thuc cc ch nh khiu dm, bo
lc, phn ng chnh tr s b h thng nhn din v a ra hnh ng thch hp
nh: Cho php hay ngn chn trang web hoc thc hin nhng hnh ng khc nh
cnh bo, chuyn hng truy cp tu thuc vo mong mun ca ngi qun l.
Ngoi nhng module lc web theo t kho, key words, theo danh mc black
lisk/White list, gii php lc web Remparo cn tch hp module lc web theo
ni dung da trn cng ngh tr tu nhn to c hiu ngn ng t nhin ting Vit.
Bn cnh , gii php Remparo cn c nhng tnh nng mi nh: d s
dng, khng cn ci t phn mm; t l chn 99% v c th lc nhng trang web
mi xut hin cha c cp nht trong c s d liu; chn truy cp tng trang
webpage, khng chn c website; vic chn lc c thc hin hon ton pha nh
mng; c th thit lp h thng qun l truy cp t mt trung tm; khng mt thi
gian theo di v kim sot; khng lm nh hng n tc truy cp Internet.
ti lun vn thc s Xy dng b lc pht hin cc website c ni dung
khng lnh mnh - Cao Nguyn Thy Tin nm 2011 - i hc Lc Hng. Mc
tiu ca lun vn tm hiu c trng cng nh s pht trin ca website c ni dung
13

khng lnh mnh, kt hp phn tch cc h thng lc web hin c, t xut m


hnh c th t ng pht hin nhng trang web c ni dung khng lnh mnh s
dng ngn ng ting Vit bng cc k thut rt trch thng tin t website cng nh
ng dng khai ph d liu vn bn, c bit s dng thut ton Naive Bayes nhm
xc nh ngng xc sut phn loi cc website khng lnh mnh. Trong lun
vn ny tc gi ch tp trung phn lp cc website khng lnh mnh thng qua ni
dung website cha tin hnh phn lp URL ca website nn b lc x l cha
nhanh lm.
Hin ti, trn th gii cng c kh nhiu phng php ngn chn nhng
trang web khng mong mun nhng phn ln phi am hiu v k thut tin hc (s
dng proxy, firewall, b lc, cc phn mm dit virut, ngn chn spyware,). iu
ny cng mang li kh khn cho cc ph huynh v phn ln l nhng ngi khng
c chuyn mn su v lnh vc ny. Ngoi ra cc phn mm ny cn c bn vi
gi kh cao v thng ch ngn chn cc trang web da vo danh sch en v trng
m cha ch trng vo vic phn tch t ng ni dung ngn chn, trong khi cc
trang web ny li lin tc thay i a ch vt qua cc danh sch ny. C th k
n cc sn phm sau:
ChildWebGuardian PRO: l mt ng dng c thit k cung cp cho tr
em tri nghim lt web an ton. ng dng ny s theo di v kim tra ni dung
ca mi trang web m ta mun truy cp. Nu chng trnh tm thy mt s ni
dung khiu dm, ChildWebGuardian PRO s ngay lp tc chn xem nhng trang
web nh vy. ng dng ny bao gm mt s chc nng kim sot nh: b lc ni
dung, b lc trang khiu dm, kim sot ca cha m, chn URL, chn truy cp
Internet, kim sot tr chi. Mi mt chc nng ny u hnh thnh mt tr ngi
ln cho nhng ai mun tm kim thng tin khiu dm trn Internet. V trc khi
con bn nhn thy bt c trang no, n s c kim tra bi tt c chc nng ny.
S dng tnh nng Parental Control c tch hp vo phin bn Kaspersky
Internet Security 2010.
K9 web protection gip ci t cc ngn chn v thi gian truy cp v danh
sch cc website cho php hoc cm truy cp.
Media Detective: l mt phn mm hu ch, gip bn tm kim v loi b
nhng ni dung khiu dm hoc thiu lnh mnh trn my tnh ca mnh bng cch
14

ngn chn truy cp web "en". Phn mm c cc tnh nng nh pht hin hnh nh
khiu dm, Qut t kha, cung cp nhiu ch View, kim tra File ZIP v phn m
rng tp tin, Qut ti liu Word tm ra nhng hnh nh c nhng
Anti-Porn (AP) l chng trnh phng chng web "en" kh tt nh c
CSDL cc web khng lnh mnh ch yu bng ting Anh kh y .
Internet Lock l chng trnh dnh cho Windows cho php qun l vic truy
xut Internet, lt web, chat v email bng Password.
Net Nanny l mt phn mm c thit k dnh cho cc gia nh mun c
mt cng c gim st tnh hnh s dng Internet ca con ci.
SurfControl Enterprise Threat Protecion: y l phn mm ca hng
SurfControl, phn mm ny thit k theo cch tip cn lc web v ngn chn t
proxy qua URL v t kha, c khong 20 cch ngn chn
Internet Filter Web Filters: do hng iPrism Internet Filters & Web Filters
pht trin, l phn mm thc hin gim st v ngn chn. Phn mm ny c
qung co l cng k thut lc web ng kim sot ni dung trang web ngay t
ng vo. Tuy nhin, theo hng dn qun tr ca nh sn xut th phn mm ny
cng c bng dng ca k thut dng phng php lc chn t kha.
FamilyWall: l phn mm bc tng la chy thng tr trn my tnh ca
ngi s dng. Chc nng ch yu ca FamilyWall l ngn chn vic truy cp cc
Website c ni dung xu trn mng Internet, bao gm cc lp kim sot chnh sau:
cc t kha c ni dung xu, ni dung cc trang Web, danh sch cc Website xu
c pht hin,
Ni chung nhng phn mm hay cng c trn thc hin tt cc chc nng
chn cc trang web khng mong mun di dng danh sch en, danh sch trng, t
kha ting Anh. Nhng hu ht cc phn mm ny khng c c ch t hc, t ci
thin thch nghi vi cc thay i hay nhng d liu mi thm vo ca cc trang
web khng mong mun v hu ht pht trin cho ngn chn cc trang web ting
Anh hn l ting Vit.
15

CHNG 2. CC L THUYT
NG DNG TRONG LUN VN
2.1. Rt trch ni dung ca website

Vic rt trch ni dung trn web thng c thc hin bng cch s dng
cc crawler hay wrapper. Mt wrapper c xem nh l mt th tc c thit k
c th rt trch c nhng ni dung cn quan tm ca mt ngun thng tin no
. c nhiu cng trnh nghin cu khc nhau trn th gii s dng nhiu
phng php to wrapper khc nhau thc hin rt trch thng tin trn web. Cc
phng php ny bao gm:
+ Phn tch m HTML
+ So snh khung mu
+ X l ngn ng t nhin
Cng ging nh Google News, h thng khai thc v tng hp ni dung c
nhim v khai thc, tng hp, lu tr ri pht hnh li ti ngi dng. Wrapper
nhn cu hnh u vo ca mt website (tin tc, nht k trc tuyn, ...) tin hnh
bc tch, tng hp ch lin quan, lu tr trong database v pht hnh li ti
ngi u cui. Ni dung c bc tch ton vn, sch s v c tng hp t
nhiu ngun khc nhau gip ngi c c th theo di, kim sot, tm kim, bin
son, lu tr, xut bn,...
Kh khn ca bi ton l khng phi ton b ni dung ca trang web u cn
thit. Nu ch n thun loi cc chui script HTML th ni dung lc c s rt
nhiu li rc khng cn thit. V d: phn thng tin qung co, tin mi cp nht, ni
dung tin ngn, menu... nhng ni dung nh th ny thng cn phi b qua trong
qu trnh bc tch ni dung chnh ca trang web

2.1.1. Phn tch m HTML


Hin nay, VietSpider3 ca tc gi Nh nh Thun l mt phn mm bc
tch ng ngha, chng truy xut trc tip vo ni dung ton din ri tin hnh bc
tch. Sau nhng c t d liu (meta data) c xy dng t ng trn nn ni

3
http://nhuthuan.blogspot.com/2006/11/s-lc-v-k-thut-trong-vietspider-3.html
16

dung bc tch. Sau quy trnh khai thc, ni dung s tr thnh c lp vi website
ngun, c lu tr v ti s dng cho nhng mc ch khc nhau.

Hnh 2.1: H thng bc tch ni dung ca VietSpider


H thng ca tc gi Nh nh Thun cng ci t m hnh khai ph d
liu cho php tng hp nhng ni dung lin quan. Xy dng m hnh topic tracking
cho php theo di cc s kin ang din ra theo tun t thi gian. M hnh thut
ton c ci t l LOR (Linked Object Representation) vi s h tr ca k thut
Stopping trong phn tch ni dung. Gii php nh ch mc (indexing) v tm kim
(searching) c sa i v ci tin t mt gii php ngun m ni ting ca
Apache l Lucene Search.
Tuy nhin, hn ch ca chng trnh l cn phi xc nh ng dn n
vng ni dung chnh trc khi bc tch i vi mi domain. Nh th s kh khn
nu h thng khi gp phi mt trang web mi hon ton.
2.1.2. So snh khung mu
Phng php rt trch thng tin bng cch so trng hai trang web c xy
dng trn nn tng nhn dng mu c tc gi Trang Nht Quang thc hin trong
vic rt trch ni dung nhm cung cp tin tc trn trang web hnh chnh. Phng
php ny cho php so khp trang web cn rt trch vi mt trang web mu xc
nh khung trnh by chung cho c hai trang web cn rt trch, t i n rt trch
ra ni dung nm trong phn c xc nh cha ni dung chnh trn trang mu.
17

(a) (b) (c)

Hnh 2.2: M hnh bc tch ni dung chnh bng so snh khung mu


(a) Trang web cn rt ni dung chnh
(b) Trang web khung mu (c xc nh trc)
(c) Ni dung chnh sau khi so khp v rt c

Phng php ny khng i hi ngi s dng phi bit v ngn ng xy


dng wrapper hay phi thay i wrapper khi cch trnh by thay i do trang web
mu c th ly trc tip t trang ch v c cng cch trnh by vi trang cn rt
trch. Tuy nhin, i vi mi domain, cn phi xc nh c mt trang web lm
mu cho nhng trang khc. y cng l mt hn ch trong qu trnh t ng ha
xc nh ni dung chnh ca web.

2.1.3. X l ngn ng t nhin


y l phng php s dng cc k thut x l ngn ng t nhin c p
dng cho nhng ti liu m thng tin trn thng khng c mt cu trc nht
nh (nh truyn). Cc k thut ny xem xt s rng buc v mt c php v ng
ngha nhn dng ra cc thng tin lin quan v rt trch ra thng tin cn thit cho
cc bc x l no . Cc cng c s dng phng php ny thch hp cho vic
rt trch thng tin trn nhng trang web c cha nhng on vn tun theo quy lut
vn phm. Mt s cng c s dng phng php x l ngn ng t nhin trong vic
bc tch ni dung nh: WHISK hay RAPIER.
c trng ca phng php ny cn ph thuc vo ngn ng trn trang web
cn c bc tch ni dung. i vi ting Vit c ti Rt trch ni dung chnh
18

trang web da vo ng cnh ca trang web ca tc gi H Anh Th 4. ti tin


hnh xc nh ni dung chnh ca trang web da vo ng cnh ca ni dung, sau
tin hnh rt trch bn tm tt ca ni dung da trn phng php chn cu tri.
Vic xc nh ni dung chnh c thc hin qua cc bc:
Loi b thng tin nh dng
Tch vng ni dung da vo cu trc, m c th l s dng tag TABLE
tch vng vn bn.
Xc nh mc lin quan v mt ni dung ghp ni cc vng k cn vi
nhau.
Chn vng vn bn c kch thc ln nht x l tip
Tuy nhin phng php ny c mt s nhc im sau:
Ty theo mc x l m qu trnh bc tch ph thuc nhiu hoc t vo
ngn ng x l.
Phng php da vo o tng t gia cc vng ghp ni t xc
nh ni dung chnh. Nhng nu cc ni dung chnh c chia trn nhiu
table vi cc thng tin t lin quan n nhau th s kh khn trong vic m
rng v xc nh vng ni dung chnh cha tt c chng.
Trng hp thng tin trong mt vng (table) qu t, s nh hng n qu
trnh tnh tng ng v m rng vng ni dung chnh.

2.2. Phn tch ni dung thnh cc token


Token c dch sang ting vit l du hiu hay biu hin. Khi rt trch
c ni dung t mt trang WEB vic phn loi ni dung ca trang web c
thuc web khiu dm hay khng th phi tin hnh tin x l d liu, tch cu, tch
t
2.2.1. Tin x l d liu
Thng thng cc bc tin x l vn bn s c trnh t sau:
Rt trch ni dung vn bn nh ly ni dung t cc trang web cn x l
loi b cc th (tag) ca html ri rt trch ni dung ca trang web.

4
http://vietnam.usembassy.gov/educational_exchange.html
19

Tch ng: ng vi mi vn bn rt trch ni dung, ta tin hnh loi b cc


k hiu, cc ch s khng cn thit, phn tch vn bn thnh cc ng phn
cch bi du cu.
Tch t, trong bc tch t c th ni l rt quan trng, nh hng n
kt qu phn loi vn bn.
Loi b Stopword (nhng t xut hin hu ht trong cc vn bn, khng c
ngha trong phn loi vn bn): bng cch s dng mt danh sch cc t
dng thng l file lu tr cc Stopword lun gp phi trong phn loi vn
bn hoc bng cch thng k t chnh trong tp hun luyn.
Mc ch ca bc ny l x l tng i sch d liu c vo cc bc
sau s x l tt hn, do cng vic ca bc ny s ch l ly d liu t cc
Webiste ghi d liu vo file text thnh chui k t thun ty (text) do n s c
yu cu nh sau:
Loi b cc d liu l loi b cc th tag nh dng, cc link lin kt, cc link
hnh nh.
Loi b cc khong trng nhiu hn 1 khong trng.
Cc du xung dng.
Cch dng trng.
Cc k t l.
2.2.2. Tch cu da trn Maximum Entropy
Phuong H.L. v Vinh H.T. [4] m hnh ha bi ton tch cu di dng bi
ton phn lp trn Maximum Entropy. Vi mi chui k t c th l im phn
cch cu (".", "?", hay "!"), c lng xc xut ng thi ca k t cng vi ng
cnh xung quanh (biu din bi bin ngu nhin c) v bin ngu nhin th hin c
thc s l im phn tch cu hay khng (b \in {no, yes}). Xc xut m hnh c
nh ngha nh sau:

y: j l cc tham s cha bit ca m hnh, mi aj tng ng vi mt hm c trng


fj. Gi B = {no, yes} l tp cc lp v C l tp ca cc ng cnh. Cc c trng l cc

hm nh phn fj: B x C {0, 1} dng m ha thng tin cn thit. Xc sut quan st


20

c im phn tch cu trong ng cnh c c c trng bi xc xut p(yes, c).


Tham s j c chn l gi tr lm cc i hm likehook ca d liu hun luyn vi cc
thut ton GIS v IIS
phn lp mt k t tch cu tim nng vo mt trong hai lp {yes, no} - lp yes
ngha l thc s l mt k t phn tch cu, cn no th l ngc li, da vo lut
phn lp nh sau
p(yes|c) = p(yes,c)/p(c) = p (yes,c)/(p(yes,c) + p(no,c))

y c l ng cnh xung quanh k t tch cu tim nng v bao gm c k t


ang xem xt. Sau y l nhng la chn hm tim nng fj phn tch cu trong
ting Vit.
La chn c trng
Cc c trng trong Maximum Entropy m ha cc thng tin hu ch cho bi
ton tch cu. Nu c trng xut hin trong tp c trng, trng s tng ng ca n
dng h tr cho tnh ton xc xut p(b|c).
Cc k t tch cu tim nng c xc nh bng cch duyt qua vn bn, xc nh
cc chui k t c phn cch bi du cch (cn gi l token) v cha mt trong
cc k t ".", "?", hay "!". Thng tin v token v thng tin ng cnh v token lin tri,
phi ca token hin ti c xc nh xc xut phn ln.
Gi cc token cha cc k t kt thc cu tim nng l "ng vin". Phn k t i
trc k t kt thc cu tim nng c gi l "tin t", phn i sau gi l "hu t". V
tr ca k t kt thc cu tim nng cng c m t trong tp c trng. Tp cc
ng cnh c xem xt t chui k t c m t nh di y
1. C/ khng c 1 k t trng trc k t kt thc cu tim nng.
2. C/ khng c 1 k t trng sau k t kt thc cu tim nng.
3. K t kt thc cu tim nng.
4. c trng tin t.
5. di tin t nu n c di ln hn 0.
6. K t u tin ca tin t l k t.
7. Tin t nm trong danh sch cc t vit tt.
8. c trng hu t.
9. Token i trc token hin ti.
21

10. K t u tin ca token lin trc vit hoa/ khng vit hoa.
11. Token lin trc nm trong danh sch cc t vit tt.
12. Token lin sau.
13. Token ng vin c vit hoa/ khng vit hoa.
T nhng ng cnh trn, c th rt ra tp ng cnh t tp d liu (tp C). Tp
ng cnh cng vi nhn t d liu to ra mt tp c trng tng ng. Xt v d
sau lm r mi quan h gia ng cnh, c trng:
"Nhng hacker my tnh s c c hi chim gii thng tr gi 10.000 USD v
10.000 ola Singapore (5.882 USD) trong mt cuc tranh ti quc t mang tn
"Hackers Zone" c t chc vo ngy 13/5/1999 ti Singapore."
Xem xt k t kt thc cu tim nng "." Trong token "10.000 USD", t v tr ny
ta c th rt ra mt s ng cnh sau:
1. Khng c k t trng trc k t ng vin.
2. Khng c k t trng sau k t ng vin.
3. K t ng vin l "."
4. Tin t: 10
T d liu hc ny, c th rt trch ra cc c trng nh v d di y:
f{khng c k t trng trc ng vin, no} = 1. ngha ca c trng ny l
pht biu: "token khng c k t trng trc ng vin v nhn l no" l ng (c
trng nhn gi tr 1).
Sau khi c lng trng s c trng ta da vo cc tham s tnh gi tr
p(yes|c). Nu gi tr ny >50%, nhn tng ng vi k t ng vin c ghi nhn l
"yes" hay k t ng vin thc s l k t phn tch cu.
2.2.3. Tch t
Tch t l mt qu trnh xc nh cc t n, t ghp c trong cu qua
vic x l xc nh ranh gii ca cc t trong cu vn. i vi x l ngn ng,
c th xc nh cu trc ng php ca cu, xc nh t loi ca mt t trong cu,
yu cu nht thit t ra l phi xc nh c u l t trong cu. Vn ny
tng chng n gin vi con ngi nhng i vi my tnh, y l bi ton rt kh
gii quyt. i vi ting Anh, cc kt qu trong lnh vc ny rt kh quan v c
im ngn ng ting Anh l mt ngn ng ph thng trn Internet v cc ngn ng
22

tng t l cc t c ngha cch nhau bng mt khong trng, do vy vic tch t


tr nn rt n gin. Trong khi i vi ting Vit, ranh gii t khng c xc nh
mc nh l khong trng m ty thuc vo ng cnh dng cu ting Vit. V d:
cc t trong ting Anh l book, cat th trong ting Vit l quyn sch, con
mo, v d du dng nu ct bng khong trng th ra cc t n v ngha khng
biu hin ngha ban u ca t ghp. Nh vy vic xc nh t c ngha bng
khong trng c mt n v t (term) phc v cho mc ch tm kim i vi
ting Vit l khng c gi tr, cc cng trnh nghin cu v phn loi vn bn gn
y c mt s kt qu ban u nhng vn cn nhiu hn ch. Nguyn nhn l
ngay bc u tin, chng ta gp kh khn trong vic x l vn bn rt ra
tn s xut hin ca t. Trong khi , phn loi vn bn th c th ni bc u
tin l quan trng nht bi v nu bc tch t sai th vic phn loi hu nh
khng th thnh cng c. Phn ln cc phng php tch t ting Vit u da
trn tp d liu hun luyn v t in trong khi hin nay cha c t in hay tp d
liu hun luyn ting Vit c gn nhn ln phc v vic ny. Cng c cc
cng trnh nghin cu v vic tch t cho ting Vit nh vntokenizer ca tc gi L
Hng Phng nhng do tnh cht a ngha ca cu ting Vit nn khng phi on
vn bn no cng c th tch t mt cch chun xc, v d cm t con nga con
nga trong tng hon cnh c ngha khc nhau.
Cc hng tip cn da trn k t (da trn ting trong ting Vit):
Hng tip cn ny n thun rt trch ra mt s lng nht nh cc ting
trong vn bn nh rt trch t 1 k t (unigram) hay nhiu k t (n-gram) v cng
mang li mt s kt qa nht nh c minh chng thng qua mt s cng trnh
nghin cu c cng b nh ca tc gi L An H [14] nm 2003, xy dng tp
ng liu th 10MB bng cch s dng phng php qui hoch ng cc i ha
xc sut xut hin ca cc ng ca cc phn an c phn cch bi cc k t
phn cch, vi mi cu, s xc nh cch tch t hp l nht.
Mt phng php khc l tch ting Vit da trn thng k t Internet v
thut gii di truyn IGATEC (Internet and Genetics Algorithm based Text
Categorization for ng, cu trc khc nhau.Documents in Vietnamese) do tc gi
H.Nguyn [13] xut nm 2005 lm theo hng tip cn mi thay v s dng ng
liu th cng trnh tip cn theo hng xem Internet nh mt kho ng liu khng l,
23

sau tin hnh thng k v s dng thut gii di truyn tm cch tch t ti u
nht, im mi ca hng tip cn ny l thay v phi s dng ng liu hun luyn
c gn nhn hay lexicon vn cha c sn cho ting Vit, tc gi s dng
thng tin thng k rt trch trc tip t search engine v dng gii thut di truyn
xc nh nhng cch tch t hp l nht i vi vn bn ting Vit cho trc, im
khc bit ca thut ton l kt hp gii thut di truyn vi vic trch xut thng tin
thng k t Internet thng qua mt cng c tm kim thay v ly t tp d liu nh
cc phng php khc. Gii thut di truyn cho php xy dng phng php tm
kim song song (tm kim tin ha) trn qun th m trong mi c th tng ng
vi mt cch tch t cho cu ang xt. Hm thch nghi s nh gi thch nghi
ca cc ti liu thng k, rt trch t Internet, thng tin rt trch bao gm tn s cc
ti liu v thng tin tng quan gia cc nhm t trong ti liu. Da vo nguyn l
tin ha v gii thut di truyn thch hp cho vic xc nh xp x cc li gii ti u
ha ton cc trong khng gian tm kim rt ln thay v cc li gii ti u cc b.
Gii thut di truyn s tin ha mt qun th qua nhiu th h nhm ti u ha ton
cc thng qu qu trnh chn lc, lai, bin d v ti sinh. Cht lng ca mi c th
trong qun th c xc nh bng hm thch nghi v qua mi th h, chng ta s
chn li N c th tt nht sau khi thc hin qu trnh lai, bin d v ti sinh. Cc kt
qu thc nghim ca tc gi Nguyn Thanh Hng [6] trong vic tm hiu hng tip
cn mi ng dng gii thut di truyn v thng k Internet t c nhng kt qu
kh quan trong vic tch t v phn loi vn bn ting Vit vi o micro-
averaging F1 (Yang) t trn 90%.
Khi so snh kt qa ca tc gi L An H [14] v H.Nguyn [13] th thy
cng trnh ca H.Nguyn cho c kt qa tt hn khi tin hnh tch t, tuy nhin
do tc x l ca gii thut nn thi gian x l lu hn. u im ni bt ca
hng tip cn da trn nhiu k t l tnh n gin, d ng dng, ngoi ra cn c
thun li l t tn chi ph cho thao tc to ch mc v x l nhiu cu truy vn. Qua
nhiu cng trnh nghin cu ca cc tc gi c cng b, hng tip cn tch t
da trn nhiu k t, c th l cch tch t hai k t c cho l s la chn thch
hp. Tuy nhin, khng gian tm kim s rt ln do c nhiu cch t hp cc ting
thnh t.
Cc hng tip cn da trn t:
24

Mc tiu l tch c cc t hon chnh trong cu, hng tip cn ny c


chia thnh 3 nhm
+ Hng tip cn da trn thng k: Cc gii php theo hng tip cn da
vo thng k cn phi da vo thng tin thng k nh term, t hay tn s k t, hay
xc sut cng xut hin trong mt tp d liu c s. u im ca hng tip cn
ny t ra linh hot v hu dng trong nhiu lnh vc khc nhau. Tuy nhin, tnh
hiu qu ca cc gii php loi ny ch yu da vo ng liu hun luyn c th
c s dng. ng tic y li l vn kh khn i vi bi ton tch t ting
Vit nh trnh by phn trn. Theo tc gi inh in [12] xy dng ng
liu hun luyn ring (khong 10MB) da vo cc ti nguyn, tin tc v sch in
t trn Internet. D nhin l b ng liu ny kh nh v khng ton din, khng
rng bao gm cc lnh vc, ch khc nhau.
+ Hng tip cn da trn t in: tng ca hng tip cn ny l nhng
cm t c tch ra t vn bn phi c so khp vi cc t trong t in. Do
trong hng tip cn ny i hi t in ring cho tng lnh vc quan tm. T in
thnh phn ch cha cc thnh phn ca t v ng nh hnh v v cc t n gin.
Hng tip cn theo t in vn cn mt s hn ch trong vic tch t v thc hin
hon ton da vo t in. Nu nh thc hin thao tc tch t bng cch s dng t
in hon chnh th trong thc t vic xy dng mt b t in hon chnh l kh
thc hin v i hi nhiu thi gian v cng sc. Nu tip cn theo hng s dng
t in thnh phn th s gim nh hn ch, nhng kh khn khi xy dng t in v
khi s s dng cc hnh v t, cc t n gin v cc t khc hnh thnh nn
t, cm t hon chnh. Vic xy dng t in cc t v ng ting Vit hon chnh l
kh kh thi.
+ Hng tip cn theo Hybrid: Hng tip cn lai ny l s kt hp hai
hng da trn thng k v da trn t in tha hng c cc u im ca
nhiu k thut v cc hng tip cn khc nhau nhm nng cao kt qa. Tuy nhin
hng tip cn Hybrid i hi c b lexicon tt hay ng liu hun luyn ln v
ng tin cy li nhng s mt nhiu thi gian x l, b nh lu tr, i hi nhiu
chi ph.
Mt s phng php tch t Ting Vit c s dng hin nay
25

2.2.3.1. Phng php Maximum Matching: Forward / Backward


Phng php so khp cc i (Maximum Matching - MM) hay cn gi l
LRMM - Left Right Maximum Matching, c trnh by bi Chih-Hao Tsai [11]
nm 2000. Phng php ny s duyt mt ng hoc cu t tri sang phi v chn t
c nhiu m tit nht c mt trong t in v c thc hin lp li nh vy cho n
ht cu.
Dng n gin ca phng php dng gii quyt nhp nhng t n. Gi
s chng ta c mt chui k t C1, C2,, Cn. Chng ta s p dng phng php t
u chui. u tin kim tra xem C1 c phi l t hay khng, sau kim tra xem
C1C2 c phi l t hay khng. Tip tc thc hin nh th cho n khi tm c t
di nht.
Dng phc tp: Quy tc ca dng ny l phn on t. Thng thng ngi
ta chn phn on ba t c chiu di ti a. Thut ton bt u t dng n gin, c
th l nu pht hin ra nhng cch tch t gy nhp nhng, nh v d trn, gi s
C1 l t v C1C2 cng l mt t, khi chng ta kim tra k t k tip trong chui
C1, C2 , .. , Cn tm tt c cc on ba t c bt u vi C1 hoc C1C2.
Gi s chng ta c c cc on sau:
- C1C2 C3C4
-C1C2 C3C4 C5
-C1C2 C3C4 C5C6
Khi chui di nht s l chui th ba. Do t u tin ca chui th ba
(C1C2) s c chn. Thc hin cc bc cho n khi c chui t honh chnh.
u im ca phng php ny thc hin tch t n gin, nhanh v ch cn
da vo t in thc hin. Tuy nhin, khuyt im ca phng php ny cng
chnh l t in, v chnh xc khi thc hin tch t ph thuc hon ton vo tnh
v chnh xc ca t in.
2.2.3.2. Phng php Transformation based Learning TBL:
Phng php TBL (Transformation-Based Learning) cn gi l phng php
hc ci tin, c Eric Brill gii thiu ln u vo nm 1992. tng ca phng
php ny p dng cho bi ton phn on nh sau: u tin gi vn bn cha c
phn on l D1 s khi to cc x l cho chng trnh phn on ban u P1.
Chng trnh P1 c phc tp ty chn, c th ch l ch thch vn bn bng cu
26

trc ngu nhin, hoc phc tp hn l phn on vn bn mt cch th cng. Sau


khi qua chng trnh P1, ta c vn bn D2 c phn on. Vn bn D2 c
so snh vi vn bn c phn on trc mt cch chnh xc l D3. Chng
trnh P2 s thc hin hc tng php chuyn i (transformation) khi p dng th
D2 s ging vi vn bn chun D3 hn. Qu trnh hc c lp i lp li n khi
khng cn php chuyn i no tt hn na v kt qu s thu c b lut R dng
cho phn on.
C th hiu cch tip cn ny da trn tp ng liu nh du, h thng c
th nhn bit ranh gii gia cc t vic tch t chnh xc vi phng php ny s
cho my hc cc cu mu trong tp ng liu c nh du ranh gii gia cc t
ng. u im ca phng php ny l n gin v ch cn cho my hc cc tp cu
mu v sau my s t rt ra qui lut ca ngn ng t s p dng chnh xc
khi c nhng cu ng da theo lut m my rt ra. Tuy nhin nhc im l
mt rt nhiu thi gian hc v tn nhiu khng gian b nh do n phi sinh ra cc
lut trung gian trong qu trnh hc. tch t c chnh xc trong mi trng hp
th i hi phi c mt tp ng liu ting Vit y v phi qua thi gian hun
luyn lu c th rt ra cc lut y .
2.2.3.3. M hnh tch t bng WFST v mng Neural
M hnh mng chuyn dch trng thi hu hn c trng s WFST Weighted
Finit State Transducer, p dng WFST vi trng s l xc sut xut hin ca mi t
trong kho ng liu, dng WFST duyt qua cc cu cn xt, khi t c trng s
ln nht l t c chn tch. M hnh WFST c ng dng vo vic phn
on t cho ting Trung Quc c tc gi Richard Sproat v cc cng s trnh by
nm 1996. Nm 2001 tc gi inh in [12] cng b cng trnh s dng m
hnh lai WFST kt hp vi mng Neural kh nhp nhng khi tch t, trong cng
trnh ny tc gi xy dng h thng tch t gm tng WFST tch t v x l
cc vn lin quan n mt s c th ring ca ngn ng ting Vit nh t ly,
tn ring, ... v tng mng Neural dng kh nhp nhng v ng ngha sau khi
tch t. M hnh WFST cn c trn cc trng s ny chn ra mt cch tch t
thch hp. Sau khi c c tt c trng thi tch t c th c ca cu, vi mi trng
thi, m hnh tnh tng trng s v chn trng thi tch t ng nht l cu c tng
trng s nh nht.
27

Chi tit v 2 tng ny nh sau:


Tng WFST gm c 3 bc:
- Bc 1: Xy dng t in trng s theo m hnh WFST, thao tc phn on
t c xem nh l mt s chuyn dch trng thi c xc sut. Chng ta miu t t
in D l mt th bin i trng thi hu hn c trng s.
Gi s: H l tp cc t chnh t ting Vit cn gi l ting.
+ P l t loi ca t.
Mi cung ca D c th l:
+ T mt phn t ca H ti mt phn t ca H
+ Cc nhn trong D biu din mt chi ph c c lng theo cng thc:
Cost =-log(f/N)
Trong : f l tn s ca t, N l kch thc tp mu.
- Bc 2: Xy dng cc kh nng phn on t: gim s bng n t hp
khi sinh ra dy cc t c th t mt dy cc ting trong cu, tc gi xut
phng php kt hp dng thm t in hn ch sinh ra cc bng n t hp, c
th l nu pht hin thy mt cch phn on t no khng ph hp khng c
trong t in, khng c phi l t ly, khng phi l danh t ring th tc gi loi
b cc nhnh xut pht t cch phn on on .
- Bc 3: La chn kh nng phn on t ti u: Sau khi c c danh sch
cc cch phn on t c th c ca cu, tc gi chn trng hp phn on t
c trng s b nht.
Tng mng Neural: M hnh c s dng kh nhp nhng khi tch t bng
cch kt hp so snh vi t in, c tc gi xut dng dng lung gi 3 dy
t loi: NNV, NVN, VNN (N: Noun, V:Verb). M hnh ny c hc bng chnh
cc cu m cch phn on t vn cn nhp nhng sau khi qua m hnh th nht.
Theo nh cng b trong cng trnh ca tc gi inh in, m hnh ny t
c chnh xc trn 97% qua vic s dng thm mng Neural kt hp vi t
in kh cc nhp nhng c th c khi tch t v tng t nh phng php
TBL m hnh ny cn tp ng liu hc y .
u im ca phng php: s cho chnh xc cao nu xy dng c mt
d liu hc y v chnh xc. Nhc im chnh ca thut ton: vic nh trng
28

s da trn tn s xut hin ca t khi tin hnh phn on, khng trnh khi cc
nhp nhng trong ting Vit nu gp nhng vn bn qu di.
2.2.3.4. Phng php tch tch t ting Vit da trn thng k t Internet v
thut gii di truyn
Phng php tch tch t ting Vit da trn thng k t Internet v thut
gii di truyn IGATEC (Internet and Genetics Algorithm based Text
Categorization for Documents in Vietnamese) do H. Nguyn [13] xut nm 2005
nh mt hng tip cn mi trong tch t vi mc ch phn loi vn bn m khng
cn dng n mt t in hay tp ng liu hc no.Trong hng tip cn ny, tc
gi kt hp gia thut ton di truyn vi d liu thng k c ly t Internet.
Trong tip cn ca mnh, tc gi m t h thng tch t gm cc thnh
phn
2.2.3.4.1. Online Extractor:
Thnh phn ny c tc dng ly thng tin v tn s xut hin ca cc t trong
vn bn bng cch s dng mt search engine ni ting nh Google hay Yahoo
chng hn. Sau , tc gi s dng cc cng thc di y tnh ton mc ph
thuc ln nhau (mutual information) lm c s tnh fitness cho GA engine.
- Tnh xc sut cc t xut hin trn Internet:

pw
count(w)
MAX
count(w1 & w 2)
p(w1 & w 2)
MAX
Trong MAX = 4 * 109
count(w) s lng vn bn trn Internet c tm thy c cha t w hoc
cng cha w1 v w2 i vi count(w1&w2).
- Tnh xc sut ph thuc ca mt t ln mt t khc:
p(w1 & w 2)
p(w1 | w 2)
pw1

Thng tin ph thuc ln nhau (mutual information) ca cc t ghp c cu


to bi n ting (cw = w1w2wn)
p(w1 & w 2 & ..... & w n )
MI(cw)
n

p w j p(w1 & w 2 & ..... & w n )
j 1
29

2.2.3.4.2. GA Engine for Text Segmentation:


Mi c th trong quan th c biu din bi chui cc bit 0, 1, trong ,
mi bit i din cho mt ting trong vn bn, mi nhm bit cng loi i din cho
cho mt segment. Cc c th trong qun th c khi to ngu nhin, trong mi
segment c gii hn trong khong 5 GA Engine sau thc hin cc bc t
bin v lai ghp nhm mc ch lm tng gi tr fitness ca cc c th t c
cch tch t tt nht c th.
KT LUN:
Xem xt tng quan mt s phng php tip cn trong tch t vn bn ting
Vit v da trn cc nghin cu c cng b th phng php tch t da trn
t mang li kt qu c chnh xc kh cao. iu ny c c nh vo tp hun
luyn ln v cc thng tin c nh du trong tp d liu gip cho vic hc rt
ra cc lut tch t cho cc vn bn khc c chnh xc, tuy nhin cng d nhn
thy kt qu ca phng php ny hon ton ph thuc vo tp d liu hun luyn.
Hng tip cn da trn k t c u im l d thc hin, thi gian thc hin
tng i nhanh, nhng li cho kt qa khng chnh xc bng hng tip cn da
trn t. Hng tip cn ny ni chung ph hp cho cc ng dng khng cn
chnh xc tuyt i trong tch t vn bn nh ng dng lc spam mail, firewall,
Nhn chung vi hng tip cn ny nu chng ta c th ci tin nng cao
chnh xc trong tch t th hng tip cn ny l hon ton kh thi v c kh nng
thay th hng tip cn tch t da trn t v khng phi xy dng kho ng liu,
mt cng vic i hi nhiu cng sc, thi gian v s h tr ca nhiu chuyn gia
trong cc lnh vc khc nhau.
2.2.4. Thut ton KEA
Turney (2000) c xem l ngi u tin gii quyt bi ton rt trch cc
keyphrase da trn phng php hc gim st [17][18], trong khi cc nghin cu
khc dng heuristic, k thut phn tch n-gram, phng php nh mng Neural
[13][14][15]. KEA [19] l mt thut ton trch xut cc cm t kha (keyphrases) t
d liu vn bn. KEA xc nh danh sch cc cm ng vin dng cc phng php
t vng hc, sau tin hnh tnh ton gi tr c trng cho mi ng vin, tip n
dng thut ton hc my tin on xem cc cm ng vin no l cc cm t
kha. Hin nay KEA c xem l mt thut ton n gin v hiu qu nht rt
30

cc keyphrases [13]. KEA dng phng php hc my Nave Bayes hun luyn
v rt trch cc keyphrases.
Theo nhn nh ca cc tc gi, KEA l thut ton c kh nng c lp ngn
ng. Thut ton KEA c th c tm tt thng qua cc bc sau:
Bc 1: Rt trch cm ng vin: KEA rt cc cm ng vin n-gram (chiu di 1
n 3 t) m khng bt u hay kt thc bng cc stop word. Trong trng hp
bi ton gn cm t kha (keyphrase assignment) dng t in nh ngha trc
(controlled indexing), KEA ch chn ra cc cm ng vin m khp vi cc thut
ng nh ngha trong t in. Vi cc cm n-gram thu c KEA tin hnh loi
b ra khi cm ng vin cc stop word v chuyn v dng gc ca t (stemming)
cho cm ng vin.

Kho T in
Rt trch ng vin lnh vc
Ti liu

Cm ng
vin

Tnh c trng
Cm t kha
c gn nhn
trc Khng
Hun
Tnh xc sut
C luyn?

Xy dng m hnh
dng Nave Bayes Cm t
M hnh
kha

Hnh 2.3 - S thut ton KEA (tham kho: http://www.nzdl.org/Kea/description.html)


Bc 2: Tnh ton c trng: mi cm ng vin, KEA tnh 4 gi tr c trng sau:
TFIDF: th hin mc quan trng ca mt cm ng vin trong ti liu
ang xt so vi cc ti liu khc trong tp d liu. Mt cm ng vin c
TFIDF cng cao th cng c kh nng tr thnh cm t kha.
31

V tr xut hin u tin: theo quan nim tc gi cc cm ng vin m c v


tr xut hin gn u hay cui ti liu th cng c kh nng tr thnh cm t
kha.
Chiu di cm: s lng t trong cm. Theo tc gi cc cm c chiu di l
2 thng c quan tm.
tng quan: l s lng cc cm trong danh sch cc cm ng vin c
lin quan ng ngha vi cm ang xt. tng quan c tnh nh vo t
in nh ngha trc. Mt cm ng vin c tng quan cao th cng c
kh nng tr thnh cm t kha.
Bc 3: Hun luyn v xy dng m hnh: dng tp ti liu hun luyn m cc
cm t kha c gn bi tc gi xy dng m hnh. Vi danh sch cc cm
ng vin xc nh dng cc k thut n-gram, loi b stop word v chuyn v
gc t (stemming) trn. KEA s nh du nhng cm no l cm + (l cm t
kha) v nhng cm no l cm - (khng l cm t kha). M hnh s c xy
dng bng cch tin hnh phn tch, tnh ton gi tr cho cc c trng cm (nh m
t pha trn) cho cc cm + v cm -. M hnh xy dng s phn nh phn b
ca cc gi tr c trng cho mi cm t.
Bc 4: Rt trch cm t kha: KEA s dng m hnh xy dng bc 3 v
tnh ton gi tr c trng cho cc cm ng vin. Sau tnh xc sut cm ng
vin l cm t kha. Cc cm ng vin vi xc sut xp hng cao nht c chn
a vo danh sch cc cm t kha. Ngi dng c th ch nh s lng cc cm
t kha cho mt ti liu.

2.2.4.1. Chn cm ng vin (candidate phrases)


Vic chn cm ng vin c tin hnh thng qua 3 bc nh sau:
Tin x l (Input Cleaning): cc files d liu u vo c dn dp v chun
ha v xc nh bin gii ban u ca cc cm. Chui u vo s c cht thnh
cc tokens
Cc du chm cu, ngoc n v nhng con s c thay th bi cc
ng bin ca cc cm (phrase boundaries).
Xa cc du nhy n
Tch nhng t c du gia thnh hai
32

Xa nhng k t cn li khng phi l token. (v khng c token no


m khng cha cc k t).
Kt qu
Tp hp cc lines
Mi line l mt dy cc token (mi token cha t nht 1 k t)
Nhng t vit tt cha cc du ngn cch phi c gi li l token
(nh C4.5 chng hn)
Xc nh cm (phrase): KEA xem xt tt c cc dy con (subsequences) trong mi
dng v xc nh dy con no thch hp l mt cm ng vin. Mt s phng php
khc c gng xc nh cc noun phrase, tuy nhin KEA dng cc lut xc nh
cc phrase nh sau:
Chiu di ti a: phrase ng vin thng ti a l 3 t
Phrase ng vin khng th l tn ring
Phrase ng vin khng c php bt u v kt thc vi 1 stopword.
Tt c cc dy t lin nhau trong mi dng s c kim tra dng 3
lut trn. Kt qu l mt tp cc cm ng vin.
V d: Bng 2.1 - Xc nh cm ng vin
Dng Cm ng vin
the programming by demonstration programming
method demonstration
method
programming by demonstration
demonstration method
programming by demonstration
method

Xc nh gc t (stemming): bc sau cng trong vic xc nh cc cm ng vin


l xc nh gc t (stemming) dng thut ton Lovins (1968) b i cc hu t.
Vic lm ny gip h thng c th xem nhiu bin th khc nhau ca cm (phrase)
nh l mt. (chng hn cut elimination s tr thnh cut elim). V h thng cng
dng stemming so snh nhng cm t kha kt qu ca KEA vi cc cm t
kha do tc gi nh ngha.
33

2.2.4.2. Tnh ton c trng (Feature calculation)


Tnh ton cc c trng cho mi cm ng vin v chng s c dng trong
hun luyn v rt trch. Hai c trng c dng l: tn s tf*idf, v tr xut hin
u tin ca cm.
Tn s TF*IDF (t): c trng ny th hin tn sut xut hin ca mt cm trong
mt ti liu so vi tn sut ca cm trong c kho d liu. S lng ti liu cha mt
cm cng t th kh nng cm l cm t kha (keyphrase) cho ti liu ang xt
cng cao. Thut ton KEA to mt tp tin lu tr gi tr tn xut ca c trng
ny.
( ) ( )
( )
Freq(P, D) l s ln cm P xut hin trong ti liu D
Size(D) l s lng t ca ti liu D
df(P) l s lng ti liu cha cm P trong kho d liu.
N: kch thc ca kho d liu
V tr xut hin u tin (d: disttance): y l c trng th 2, l s lng t pha
trc v tr xut hin u tin ca cm t chia cho kch thc ca ti liu (tng s
t). Gi tr ca c trng ny thuc khong [0, 1].

2.2.4.3 Hun luyn


Bc hun luyn dng mt tp ti liu hun luyn trong cc cm t kha
c tc gi xc nh trc. i vi mi ti liu trong tp hun luyn, nhng cm
ng vin s c xc nh v cc gi tr c trng ca tng cm ng vin s c
tnh ton. gim kch thc ca tp hun luyn, tc gi b qua cc cm m ch
xut hin mt ln trong ti liu. Mi cm ng vin s c gn nhn l cm t kha
hay khng l cm t kha da vo nhng cm t kha do tc gi ch nh. Qu trnh
hun luyn s sinh ra mt mt m hnh v m hnh ny c dng tin on
phn lp cho cc mu d liu mi dng cc gi tr ca hai c trng. Nhm tc gi
th nghim vi mt s phng php hc my khc nhau v quyt nh chn k
thut Nave Bayes cho thut ton KEA, v theo tc gi phng php hc da trn
xc sut Nave Bayes n gin nhng cho kt qu kh tt.
34

2.2.4.4 Rt trch nhng cm t kha


rt trch cc cm t kha t mt ti liu mi, KEA xc nh cc cm ng
vin v cc gi tr c trng, sau p dng m hnh xy dng trong qu trnh
hun luyn. M hnh xc nh xc sut m mi ng vin l mt cm t kha. Sau
KEA s thc hin thao tc hu x l chn ra tp hp nhng cm t kha tt
nht c th.
Khi m hnh Nave Bayes c p dng cho cc cm ng vin vi cc gi tr c
trng t(TF*IDF) v d (distance), hai lng sau c tnh ton l
[ ] [ ] [ ] (1)

[ ] [ ] [ ]

Y: s lng cc cm l cm t kha (do tc gi ch nh)


N: s lng cc cm ng vin khng phi l cm t kha.
Xc sut tng th m cm ng vin l cm t kha c tnh nh sau:
[ ]
[ ]
(2)
[ ]

Sau khi tnh ton gi tr xc sut p. Cc ng vin c sp theo th t (tng hay


gim dn) ca gi tr p ny. Tip sau s l 2 bc hu x l. Th nht, TF*IDF
s l gi tr quyt nh trong trng hp 2 cm ng vin c cng xc sut p. Th
hai, tc gi quyt nh loi b ra khi danh sch cc cm m l cm con ca mt
cm c xc sut cao hn. T danh sch cn li, thut ton s chn ra r cm c xc
sut cao nht (vi r l s lng cc cm t kha cn xc nh theo yu cu).
2.2.5 Thut ton KIP
2.2.5.1 tng
Mt cm danh t cha nhng t kha hay cm t kha v mt lnh vc c th s
c kh nng tr thnh cm t kha trong lnh vc . Mt cm danh t cng cha
nhiu t kha hay cm t kha th cm danh t ny cng c nhiu kh nng tr
thnh cm t kha. H thng xy dng sn mt c s d liu t vng lu gi cc t
kha, cm t kha v mt lnh vc c th. V cc t kha trong t in nh ngha
trc s dng tnh ton im hay trng s cho mt cm danh t. T quyt
nh cm ng vin no l cm t kha da trn trng s, im s tnh c cao
hn.
2.2.5.2 M t thut ton
35

KIP n gin gm cc bc nh: rt trch cc cm danh t (noun phrase)


ng vin t ti liu u vo. Sau kim tra cu thnh ca cm ng vin v tnh
im cho n. T quyt nh cm ng vin no l cm t kha da trn trng s,
im s tnh c cao hn.
im ca mt cm danh t c tnh da vo cc yu t:
Tn xut xut hin trong ti liu
Cu thnh ca cm danh t (cha t hay cm con no)
Nhng t v cm t cu thnh cm danh t lin quan nh th no n lnh
vc ca ti liu
KIP bao gm cc thnh phn chnh: gn nhn t loi (POS tagger), rt trch
cm danh t (Noun phrase extractor), cng c rt trch cm t kha.
Gn nhn t loi (POS tagger): KIP dng phng php gn nhn t loi
dng ph bin ca Brill [20].
Rt trch cm danh t: b rt trch cm danh t da vo cc nhn t loi
gn trong bc trc v rt ra cc cm danh t da vo mu {[A]} {N}
(A adjective; N noun; {} lp li nhiu ln; [] c th c hoc khng)
Rt trch cm t kha: tnh trng s cho cc cm danh t, thut ton xy
dng mt t in t vng cha cc t kha, cm t kha vi cc gi tr khi
to v mt lnh vc c th. T in bao gm 2 danh sch: mt danh sch cc
cm t kha (cha 1 hay nhiu t), mt danh sch cc t kha (cha 1 t
n c phn tch t danh sch th 1, cm t kha).
Trng ca mt cm danh t: WNP = F x S
F: tn s xut hin ca cm danh t trong ti liu.
S: tng trng s ca nhng t n v cc kt hp c th trong cm ng vin.
+ j

Wi: trng s ca mt t trong cm danh t ny


Pj: trng s ca ca cm con trong cm danh t.
Mc tiu ca vic tnh ton trng s ca tt c nhng t n v nhng
cm con l nhm xc nh xem mt cm con c phi l mt cm t kha
c nh ngha sn trong t in hay khng. Nu n tn ti trong t in
th cm danh t ang xt cng quan trng hn. KIP s truy vn danh sch cc
36

t kha v cm t kha t t in lnh vc c c trng s cho cc t


n (Wi) v cm con (Pj).
2.2.6. Nhn din thc th c tn
2.2.6.1 Khi nim
Nhn din thc th c tn (NER-Named Entity Recognition)5 l mt cng vic
thuc lnh vc trch xut thng tin nhm tm kim, xc nh v phn lp cc thnh
t trong vn bn khng cu trc thuc vo cc nhm thc th c xc nh trc
nh tn ngi, t chc, v tr, biu thc thi gian, con s, gi tr tin t, t l phn
trm, v.v. Thc th c tn (Named Entity) c rt nhiu ng dng, c bit trong cc
lnh vc nh hiu vn bn, dch my, truy vn thng tin, v hi p t ng.
2.2.6.2 Phng php tip cn v cc h thng ph bin
Hin nay, hu ht cc h thng nhn din thc th c tn p dng cc k
thut khai thc d liu vn bn, x l ngn ng t nhin v tip cn theo cc hng
chnh sau:
K thut da trn vn phm ngn ng: qui tc, lut vn phm c xy dng
bng tay nh kin chuyn gia ngn ng, v tn nhiu thi gian cho vic
xy dng qui tc vn phm. Qui tc vn phm s phi thay i khi c s thay
i v lnh vc ng dng hay ngn ng.
Cc m hnh hc thng k: t ph thuc ngn ng, v cng khng ph thuc
vo chuyn gia lnh vc nhng cn chun b tp d liu hun luyn tht tt
v ln c th xy dng c mt b phn lp ti u.
Kt hp my hc v cc k thut x l ngn ng t nhin.
H thng nhn din thc th c tn ph bin: c th k n cc h thng ph
bin hin nay nh:
H thng Standford NER6: xy dng b phn lp CRFClassifier da trn m
hnh thuc tnh ngu nhin c iu kin (CRF-Condictional Random Field)
H thng GATE-ANNIE 7: l mt h thng con ca GATE Framework
(General Architecture of Text Engineering) mt trong cc d n ln nht
thuc khoa Khoa hc My tnh, i hc Sheffield ca Anh. y l h thng
da trn cc t in, Ontology v vic xy dng lut nh du
5
http://en.wikipedia.org/wiki/Named_entity_recognition
6
http://nlp.stanford.edu/ner/index.shtml
7
http://gate.ac.uk/ie/annie.html
37

(annotation) cc thnh t trong vn bn. Vic xc nh cc thc th c tn


trong vn bn thc hin trong qu trnh nh du vn bn.
2.3. Phn tch URL
URL, vit tt ca Uniform Resource Locator (nh v Ti nguyn thng
nht), c dng tham chiu ti ti nguyn trn Internet. URL mang li kh nng
siu lin kt cho cc trang mng. Cc ti nguyn khc nhau c tham chiu ti
bng a ch, chnh l URL, cn c gi l a ch web hay l lin kt mng (hay
ngn gn l lin kt).
V k thut, URL l mt dng ca URI, nhng trong nhiu ti liu k thut
v cc cuc tho lun bng li ni, URL thng c s dng nh mt t ng
ngha vi URI, v iu ny khng b coi l mt vn .
Mt URL gm c nhiu phn c lit k di y:
URL scheme thng l Tn giao thc (v d: http, ftp) nhng cng c th l
mt ci tn khc (v du: news, mailto). Mun hiu r v URL scheme xin
xem URI scheme
Tn min (v d: http://vi.wikipedia.org)
Ch nh thm cng (c th khng cn)
ng dn tuyt i trn my phc v ca ti nguyn (v d: thumuc/trang)
Cc truy vn (c th khng cn)
Ch nh mc con (c th khng cn)
C th hn:
http://vi.wikipedia.org:80/thumuc/trang?timkiem=cauhoi#dautien
\__/ \______________/\_/\___________/ \____________/ \_____/
| | | | | |
URL scheme tn min | ng dn truy vn mc con
cng

Hin nay trn th gii mi ngy c rt nhiu tn min (domain) mi xut


hin.
c th tm n mt ng dn internet ch mc ni dung cn thit phc
v cho ngi s dng internet, chng ta c th s dng cc cng c trn mng
internet. Hin nay cc cng c tm kim trn mng Internet ngy cng ph bin v
c s dng rng ri. Hoc chng ta c th s dng vic tm kim n URL thch
hp thng qua danh b cc website hay l cc cng c tm kim.
38

CHNG 3. GII PHP LC WEBSITE KHIU DM DA


TRN URL V TEXT CONTENT
3.1. Phn tch m hnh h thng

Bi ton lc website thc cht l mt bi ton Phn loi vn bn, l mt vn


cp thit cho s bng n thng tin hin nay. Vn cn lm ca bi ton l gn
nhn cho cc ti liu vn bn vo cc ch cho trc. C rt nhiu ng dng thc
t, in hnh nh mt ngi phn tch chnh tr cn tng hp rt nhiu ti liu v
chnh tr c nghin cu, tuy nhin anh ta khng th ln mng c tt c cc bi
bo bi vit ri phn loi chng u l ti liu chnh tr, sau mi c k chng
cho mc ch ca mnh. Vic ny khng th bi v s lng bi bo, bi vit hin
nay rt nhiu. c bit l trn internet, vic c tt c ti liu gn nh khng th v
tn rt nhiu thi gian .

Trong khun kh lun vn ny, chi tit cc bc thc hin bi ton Phn loi
vn bn dng thut ton Nave Bayes v mt s cch tip cn ci tin gii
quyt bi ton cho vic phn loi ni dung khiu dm l mc tiu chnh.

Trong lun vn khi nim ni dung khng lnh mnh l cc ni dung theo vn
ha Vit Nam l i try nh l cc ni dung cha cc thng tin v sex, v n c
bit c hi cho la tui cha n v thnh nin ( Vit Nam l di 18 tui). Nhng
ni dung khiu dm hoc truyn gi dc bng ting Vit hin nay rt nhiu. Vic
phn loi cc ni dung ny ngn chn khng cho tr cha tui v thnh nin l
mt thch thc ln cho gia nh v x hi.

Bi ton phn loi website c ni dung khiu dm c th c pht biu nh


sau: cho trc tp cc trang web c ni dung D={d1,d2..dn} v c gn trc
thuc mt trong 2 lp C={C1=Bad, C2=Good}; tp cc URL_Bad cha ni dung
khng lnh mnh v tp cc URL_Good cha ni dung lnh mnh.

Nhim v ca bi ton l gn lp Di thuc v Cj c nh ngha c th


trong n ny l l gn lp Di thuc v 2 tp c nh ngha l Ctt (ni dung
lnh mnh) v Cxu (ni dung khiu dm)

C th mc tiu bi ton l i tm hm f:

f : (URL,D) C
39

f(URL, D) = {Bad, Good}.

M hnh h thng lc website c ni dung khiu dm trong lun vn nh trn


hnh v 3.1. Trong chia lm 2 giai on ring gm hun luyn v nhn dng
nhng c cng 2 bc l tin x l v trch trng c trng. Cc c trng y l
cc t c tch ra da vo b t in s cp n sau. H thng s gm 2
Module chnh l x l URL v x l da vo ni dung ca trang web.
40

3.2. Module x l da vo URL

Module ny n gin da vo s lng cc t kha rt trch ra c t b d


liu URL danh sch en v danh sch trng tm ra tp cc t kha c trng cho
tn cc trang web nh sex, girl, xxl, xx, porn,Giai on hun luyn l thng k
xem trong cc danh sch en v trng th cc t kha ny xut hin bao nhiu ln
t c th tnh xc sut ca mt trang web bt k s c kh nng ln l web khiu
dm hay khng.

D liu hun Tin x l Trch lc p dng thut ton


luyn c trng Bayes hun luyn

CSDL Token URL


Hnh 3.2 Quy trnh hun luyn ly Token URL

D liu hun luyn: l kho d liu tp cc URL danh sch en (a ch cc


Website c ni dung khiu dm) v tp cc URL danh sch trng (a ch cc
Website c ni dung lnh mnh) c thu thp t cc trang web sex, gio
dc gii tnh, trang bo mng
Tin x l : chuyn i kho d liu thnh mt hnh thc ph hp phn
loi.
Trch lc c trng: Tin hnh loi b cc thnh phn (http://, WWW, du
/, du -, .com/vn/gov/info/net) lc ly nhng t n v t ghp gi
chung l cc Token, mang ngha bao qut ca URL ang trch lc.
p dng thut ton Bayes : p dng cng thc bayes tnh cc xc sut
tin nghim ca 2 lp Bad v Good, cng nh cc gi tr xc sut ca tng
Token thuc tng lp tng ng s dng nhn dng hay phn loi URL
sau ny.
CSDL Token URL: l cc t n, t ghp qua hun luyn v chn lc.

3.3. Module lc theo ni dung

y l Module chnh ca lun vn. Da vo d liu cc trang web thu c


thuc c 2 lp Bad v Good hun luyn tm ra cc t (gm t n v t ghp t
2 ting gi l Token) cng tn s xut hin tng ng ca cc t ny trong 2 lp
41

khi gp mt trang web bt k s tnh xc sut ca trang ny thuc lp no nhiu hn


th phn vo lp . Nu thuc lp Bad th cn ngn chn khng cho trang web ny
hin th, ngc li th cho hin th bnh thng. Cc giai on ca Modue ny thc
hin theo cc th t trnh by tip theo sau.

3.3.1. Giai on hun luyn

Mc ch chnh ca giai on ny l da vo b d liu thu thp c sn thuc


2 lp Bad v Good tm ra cc t kha (Token) i din cho cc d liu.

D liu hun Tin x l Trch lc p dng thut ton


luyn c trng Bayes hun luyn

Hnh 3.3 Quy trnh hun luyn Token ni dung CSDL Token ni
dung

Trong :
D liu hun luyn: l kho d liu text c ni dung khiu dm v lnh mnh
c thu thp t cc trang web sex, gio dc gii tnh, trang bo mng
Tin x l : chuyn i kho d liu thnh mt hnh thc ph hp phn
loi.
Trch lc c trng: Tin hnh lc ly nhng t n v t ghp gi chung
l cc Token, mang ngha bao qut ton vn bn.
p dng thut ton Bayes : p dng cng thc bayes tnh cc xc sut
tin nghim ca 2 lp Bad v Good, cng nh cc gi tr xc sut ca tng
Token thuc tng lp tng ng s dng nhn dng hay phn loi sau
ny.
CSDL Token ni dung: l cc t n, t ghp qua hun luyn v chn
lc.
3.3.1.1. Tin x l vn bn
Vn bn trong tp hun luyn trc khi c s dng cn phi tin hnh tin
x l. Qu trnh x l s gip nng cao hiu sut phn loi v gim phc tp cho
thut ton phn loi.
42

Phng php tin x l vn bn nh sau:


Loi b cc k t c bit ( ,~,! , @ ,#,$,%,^,&,*,(,),{,},[,],+,-,=,<>,:,/,;,cc
ch s , php ton s hc ,du phy , thay th nhiu khong trng thnh 1
khong trng .
Loi b cc stopword nh : th , l ,m ,cc ,nhng , nu ,nhng ,tuy nhin ,
mc d, v th , khng nhng, m cn .
Chuyn tt c ch in hoa thnh ch thng .
3.3.1.2. Trch lc c trng
a. Phng php tch t da vo t in m bo ng ngha ca Token

Nh nhng phn tch trong [7, 8, 10], im khc bit ln nht gia ting Anh
v ting Vit l ting Anh th mi t c lp hu nh u c ngha ca n, cn
ting Vit th mt t c ngha thng c ghp t hai t tr ln. V d t phim
porno trong ting Anh th ting Vit tng ng l phim con heo, hay Viagra
ting Vit l mt t ghp t 3 words thuc tng lc,im c bit na l cc t
ting Vit ny nu tch thnh cc t n th n li tr thnh cc t bnh thng nh
t phim con heo nu tch ra thnh t phim (movie), t con (children) v t
heo (pig) s l nhng t bnh thng khng cn tnh cht khiu dm na. Do
cch tip cn hc trn cc t n i vi ting Vit r rng s khng hiu qu.

Trong cc tip cn tch t [8] tch t da vo tn sut khng quan tm n ng


ngha cho ra kt qu cha cao, chng ti xut vic tch t da trn t in
(dictionary-based) l im khc bit so vi lun vn thc s ca c Cao Nguyn
Thy Tin. Tch theo ng ngha t l cch tip cn da vo nhng n v t vng
(t) c tch ra t vn bn phi khp vi cc t trong t in, hng tip cn ny
phi s dng mt b t in hon chnh c th tch c y cc t (vi y
ngha c trng) trong vn bn.
43

Theo thng k cc b t in hin c thng dng c cung cp trn mng th


s lng t ting Vit tng ng nh trong bng 3.1

Qua vic thng k trong lun vn quyt nh ly b t in c s lng t


cao v ph bin l LacViet. Mc ch ca vic tch t l lc ra cc t c ngha
c trng cho vn bn c tt ln xu, thng th nhng t mang ngha khiu dm
th khng c trong b t in do cc t ny thng rt tc tu. Lun vn tin
hnh th nghim vi 100 t khiu dm th trong t in c khong 95 tc (95%) c
trong t in. Nhng xy dng mt b t in y cho vic tch t chnh xc,
lun vn tin hnh bc tch 500 file text ni dung khiu dm ra cc cm t
khng c trong b t in sau bng cch th cng duyt li v thm vo b t
in. Kt qu sau khi x l c c b t in vi cc s lng t thng k nh
trong bng 3.2.

Bng 3.2: S liu thng k b t in

S lng t 1 s lng t 2 s lng t s lng t


ting ting 3 ting 4 ting tng s
T in
6607(13,58%) 33769(69,42%) 2997(6,16%) 5269(10,84%) 48642
ban u
T in
sau khi 8211(15,25%) 37365(69,4%) 2997(5,56%) 5265(9,79%) 53838
thm

Qua phn tch chng ta thy s lng t n v t ghp trong b t in chim


ti 84.65% , theo nh [7] v [8] th vic tch t theo t n v t ghp cho ra kt
qu phn loi tt nn n gin ha, gim nhc nhng cho vic tch cu chng ti
ch ch yu l tch t n v t ghp.
44

b. Hun luyn t

Giai on ny nhm duyt ht tt c cc d liu hun luyn tm r cc c


trng chnh m trong lun vn ny l tn s xut hin ca tng t kha trong 2 lp
Bad v Good. Qu trnh hun luyn thc hin nh trong hnh 3.5.

Vic hun luyn lp li cho n ht cc vn bn trong tp hun luyn .

3.3.1.3. Thut ton Nave Bayes


B phn lp bayes c th d bo cc xc sut l thnh vin ca lp, chng hn
xc sut mu cho trc thuc v mt lp xc nh.
Thut ton Nave Bayes da trn nh l Bayes c pht biu nh sau:

( ) ( ) ( )
( )
( ) ( )

Trong :
Y i din mt gi thuyt , gi thuyt ny c suy lun khi c c
chng c mi X
P(X) : xc sut X xy ra
P(Y) : xc xut Y xy ra
45

P(X|Y) : xc sut X xy ra khi Y xy ra ( xc xut c iu kin , kh nng


ng ca X khi Y ng ).
P(Y|X): xc sut hu nghim ca Y nu bit X.
p dng trong bi ton phn loi, cc d kin gm c:
D: tp d liu hun luyn c vector ha di dng X hay
( )
Ci: phn lp nave, vi nave = {1,2,,m}.
Cc thuc tnh c lp iu kin i mt vi nhau.
Theo nh l Bayes:
( ) ( )
( )
( )
Theo tnh cht c lp iu kin:

( ) ( )

Trong :
( ) l xc sut thuc phn lp nave khi bit trc mu X.
( ) xc sut l phn lp i.
( ) xc sut thuc tnh th k mang gi tr xk khi bit X thuc
phn lp i.
Cc bc thc hin thut ton Nave Bayes:
Bc 1: Hun luyn Nave Bayes (da vo tp d liu), tnh ( ) v P(Ci|xk)
Bc 2: Phn lp ( ), ta cn tnh xc sut thuc tng phn lp
khi bit trc Xnew. Xnew c gn vo lp c xc sut ln nht theo cng thc

( ( ) ( )) ( )

Vic hc tnh xc sut ca cc t thuc lp no c tnh theo cng thc


Nave Bayes. V d tnh xc xut cho hai token m o v anh theo cng thc:
( ) ( )
( ) (1)
( )

Gi s trong b mu hun luyn tn s xut hin ca hai t ny nh sau:


Tng s trang web cho hc l 9, trong c 4 l trang tt v 5 l trang xu nh
trong bng 3.2.

Bng 3.3 V d minh ha tn s xut hin cc token


46

TN S XUT HIN
Bad good Tng
tng s trang 5 4 9
tokenm o 4 1 5
tokenanh 3 3 6

p dng cng thc Bayes ta c:

P(Bad/m o) = (5/9 * 4/5 )/ 5/9


P(Good/m o) = (4/9 * )/ 5/9
P(Bad/anh) = (5/9 * 3/5 )/ 6/9
P(Good/anh) = (4/9*3/4)/6/9
P(Bad) = 5/9
P(good) = 4/9
P(m o/bad) = 4/5
P(m o/good) =
Sao nu ta thc hin phn lp cho ni dung X = { m o, anh} theo cng
thc (*)

( ( ) ( )) ( )

Gi s 1 vn bn DOCi c ni dung l: m o anh c tch lm 2 token th


ta tnh 2 gi tr xc sut:
P(DOCi|Bad)= P(Bad)*[P(m o/Bad)*P(anh/Bad)] = 0.266666667
P(DOCi|good)= P(Good)*[P(m o/Good)*P(anh/Good)] = 0.083333333
Do : P(DOCi|Bad) > P(DOCi|Good) => DOCi thuc lp Bad
Tuy nhin nu mt trong cc gi tr ca cc t kha bng 0 th cc gi tr xc
sut cui cng cng bng 0 do ta khng phn lp c .
Nh vy nu nh ta gi cng thc thun bayes tnh th s gy nhiu trong
qu trnh phn lp cho mt content. Lun vn xut lm trn cng thc bayes
bng cch tng tn s xut hin c tt ln xu ln 1 n v khi xy ra trng hp
ny. V d nh trong bng 3.4 ta c tn s xut hin ca t nh sng trong lp
Bad l 0 th ta lm trn bng cch cng 1 vo tn s xut hin ca t kha ny trong
c 2 lp. Tng t nh vy i vi t kha khiu dm.
47

Bng 3.4 V d minh ha tn s xut hin cc token cha lm trn

TN S XUT HIN
T tt T xu
Tng s
(good) (bad)
Tng s trang web 401 601 1002
T nh sng 201 0 201
T khiu dm 0 301 301

Bng 3.5 V d minh ha tn s xut hin cc token lm trn

TN S XUT HIN
T tt T xu
Tng s
(good) (bad)
Tng s trang web 401 601 1002
T nh sng 202 1 203
T khiu dm 1 302 303

3.3.5. Giai on phn lp, nhn dng

Giai on ny c thc hin khi b lc khng phn lp c URL ca


Website cn duyt thuc lp no. Lc ny b lc s ly ni dung ca Website cn
duyt tin hnh x l loi b cc nh dng ca ngn ng HTML, loi b nhng t
ph bin nh th, l, m, cc, nhng v cc t dng ni cu nh
tuy nhin, v th, mc d, khng nhng, m cn v loi b nhng k
t c bit nh @, #, $, ?, &, tng tc x l ca vic tch t.
Khi loi b c cc thnh phn ny xong, tin thnh tch ni dung thnh cc
cu n chun, mi t trong cu n chun cch nhau bi mt khong trng duy
nht vic tch t chnh xc hn. Sau tin hnh tch t n v t ghp (l hai
t n lin k). Khi c danh sch cc t n v t ghp ca ni dung cn duyt
lc nay ta xem y nh l danh sch Token ca ni dung cn duyt ta tin hnh
kim tra cc Token ny vi Token ni dung trong c s d liu nu khng c th
loi b v duyt Token k tip ngc li nu c th a vo danh sch Token c
trng ca vn bn v p dng cng thc Nave Bayes tnh xc sut ca ni dung
vn bn cn duyt nu xc sut GOOD ln hn xc sut BAD th a ra kt lun
48

ni dung vn bn ca Website cn duyt l ni dung lnh mnh v ngc li. Di


y l m hnh phn lp v nhn dng ni dung ca Website.

y l giai on quan trng nht gip xc nh mt ni dung text c phi l


khiu dm hay khng. p dng phng php Naive Bayes nh sau:

+ Gi s mi ni dung cn phn lp l: Xnew

+ Ci c 2 lp l : C1 lp lnh mnh (good) v C2 lp khiu dm (bad) .

+ Xc sut mt ni dung text khiu dm: P(bad | Xnew)

+ Xc sut mt ni dung text lnh mnh: P(good | Xnew )

+ Gi s token1, token2,..., tokenn l cc c trng xut hin trong ni dung


text X( token1,token2tokenn). X c gn vo lp c xc sut ln nht theo
cng thc
49

( ( ) ( ))

Trong : xk l cc c trng xut hin trong ni dung.

P(Xnew bad) v P( new /good) c tnh bng:

( ) ( )

( ) ( )

Ta tin hnh so snh P(Xnew /bad) v P(Xnew /good), Xnew s c phn vo


lp c xc sut cao hn.

Trong thc tin do s lng token ca mt ni dung text ln nu nh ta


s dng cng thc nhn nh trn th kt qu rt nh khng ph hp. Chng
ti tin hnh logarit cng thc trn v s dng cng thc sau cho qu trnh
tnh ton ca mnh

( ( )) ( ( )) ( ( ))

P(bad) P(good) c tnh bi cng thc:

( )

( )

Tng t nh vy ta s tnh c xc sut ni dung phn vo lp no l:

( ( ) ( ( ))
50

CHNG 4. THC NGHIM V NH GI KT QU


4.1. Mi trng th nghim
B lc c ci t trn my tnh vi b x l Core i5, Ram 4G ci t h
iu hnh Win8, SQL Server 2008 R2, Visual Studio.Net phin bn 2010 v cc
phn mm h tr khc
4.2. Giao din chng trnh
4.2.1. Giao din chnh
a. Giao din lc khi ng v ng nhp

Hnh 4.1 Giao din lc khi ng b lc


51

Hnh 4.2 Giao din ng nhp

b. Giao din khi nhn mt a ch WEB tt

Hnh 4.3 Giao din khi duyt mt a ch Web tt


52

c. Giao din khi nhp mt a ch WEB xu

Hnh 4.4 Giao din khi duyt mt a ch Web xu

d. Giao din cu hnh h thng

Hnh 4.5 Giao din danh sch a ch Web xu, Web tt


53

Hnh 4.6 Giao din chc nng h thng

4.2.2. Giao din hc t ly token phn lp ni dung Website

Hnh 4.7 Giao din hun luyn t n, t ghp


54

4.2.3. Giao din duyt cc token t n a vo danh sch Token

Hnh 4.8 Giao din duyt Token t n


4.2.4. Giao din duyt cc token t ghp a vo danh sch Token

Hnh 4.9 Giao din duyt Token t ghp


55

4.2.5. Giao din danh sch cc token t phn lp ni dung Website

Hnh 4.10 Giao din duyt Token t n v t ghp


4.2.6. Giao din ly token URL

Hnh 4.11 Giao din hun luyn Token URL


56

4.2.7. Giao din danh sch cc token URL phn lp URL ca cc Website

Hnh 4.12 Giao din danh sch cc Token URL sau hun luyn
4.3. Thu thp d liu
4.3.1. Thu thp d liu lm c s d liu TOKEN URL

xy dng c s d liu TOKEN URL phn lp URL ca Website tin


thnh thu thp hn 300 a ch c cha ni dung khiu dm sau a vo b lc
hun luyn ly c gn 400 TOKEN lm c s d liu phn lp URL ca
Website. Qua qu trnh hun luyn c c s d liu TOKEN URL nh hnh 4.12.

Hnh 4.13 a ch URL thu thp c


57

4.3.2. Thu thp d liu lm c s d liu TOKEN ni dung

i vi c s d liu TOKEN ni dung, tin hnh thu thp b d liu hun


luyn vi tng cng 500 ni dung khng lnh mnh v 500 ni dung lnh mnh t
ni dung cc websites chnh sau: www.hang9x.com, www.sexviet.com,
conheo.com, lauxanh.us, sexvnonline.com, aitinhviet.com, vnexpress.net, vnn.vn,
tuoitre.com.vn, cc web gio dc gii tnh

Hnh 4.14 - File tt thu thp c

Hnh 4.15 - File xu thu thp c


58

Sau khi hc vi b d liu 500 file tt v 500 file xu thu c: 8400 t n


v 92752 t ghp. p dng cng thc Nave Bayes tnh xc sut cc t n v t
ghp thu c sau loi b cc t n v t ghp c tn s xut hin thp. So
snh cc t vi b t in c loi b cc t khng c ngha (khng c trong
t in) th s ToKen thu c l 893 t (t n v t ghp) lm c s d liu
cc ToKen c trng phn lp ni dung ca Website.

Hnh 4.16 C s d liu ToKen ni dung sau qu trnh hun luyn

4.4. nh gi kt qu thc nghim

Do vic rt trch c trng vn bn ta da vo CSDL hun luyn nn s lng


token ln, v trnh vic mt nhiu thi gian cho vic tnh ton cc token khng cn
thit t c s nh hng n vic phn loi ta tin hnh sp xp cc token theo th
t gim dn tn sut xut hin (da trn CSDL hun luyn) thc nghim kim tra
tm s N, ly N token nh nht m s phn lp khng b nh hng.

Th nghim kim tra phn URL


59

D liu chun b bao gm 100 URL, thc t trong c 60 URL xu, qua qu
trnh x l b lc ch pht hin ra 20 URL xu nhng thc t ch c 18 URL l
ng. Trng hp cc URL xu b lc khng pht hin c th b lc s chuyn
sang kim tra phn ni dung.

Th nghim kim tra phn ni dung

D liu chun b bao gm: 300 file tt v 300 file xu c thu thp t cc
trang web tin tc v cc trang web c ni dung khiu dm. Qua qu trnh x l ca
b lc. Kt qu thc nghim, nh sau:

Bng 4.1 Kt qu thc nghim File ni dung

chnh xc
S TOKEN dng S File c phn lp
(%)
phn lp file
File Tt File Xu Tt Xu
Ly 100 token 125/300 131/300 41,6% 43,7%

Ly 200 token 213/300 235/300 71% 78,3%

Ly 300 token 275/300 279/300 98,3% 99,3%

Ly 500 token 295/300 295/300 98,3% 99,3%

Ly tt c 295/300 295/300 98,3% 99,3%

Qua kt qu thc nghim cho thy s N ti u l chn khong 500 token cao
nht tnh ton phn loi.

So snh kt qu vi [7][8] trn ta thy khi phn loi da vo vic tch t


theo ng ngha ca t cho kt qu thp hn so vi vic phn loi tch t khng ph
thuc vo ng ngha da v tn sut xut hin t .
60

KT LUN V HNG PHT TRIN


Kt lun:
Qua qu trnh tm hiu v nghin cu thc hin. Lun vn t c
nhng kt qu sau:
Tm hiu v thng k c cc phng php tch t, trch chn c trng v
phn loi vn bn;
p dng thnh cng thut ton Nave Bayes vo vic phn lp URL v Text
Content ca Website cn truy cp;
Thu thp d liu t cc Website tin hnh hun luyn xy dng c b
d liu c trng (Token t) cho URL vi gn 400 t v Text Content vi
gn 900 t lm d liu dng phn lp Website cn truy cp.

Hng pht trin


Nghin cu tch hp b lc vo cc trnh duyt web thng dng nhm nng
cao tnh ng dng ca ti;
Ci tin thut ton tch t nhm gim thi gian x l trong qu trnh phn
lp ni dung Website ca b lc.
61

TI LIU THAM KHO


Ting Vit:
[1] Chu Anh Minh (2009), Bi ton trch xut t kha cho trang WEB p dng
phng php phn tch th HTML v th, Kha lun tt nghip i hc.
[2] Bi Nguyn Khi, Nguyn cu mt s phng php phn lp ci tin vo phn
lp vn bn. 2009. H KHTN, Lun vn Thc s
[3] Phc (2005), Gio trnh khai thc d liu, i hc Cng ngh Thng tin Tp.
HCM.
[4] H Quang Thy, Phan Xun Hiu, on Sn (2009), Gio trnh Khai ph d
liu web, Nxb Gio dc Vit Nam.
[5] Nguyn Linh Giang, Nguyn Mnh Hin, Bi bo Phn loi vn bn ting Vit
vi b phn loi vect h tr SVM.
[6] Nguyn Thanh Hng, Hng tip cn mi trong vic tch t phn loi vn
bn ting Vit s dng gii thut di truyn v thng k trn Internet, Tp ch bu
chnh vin thng.
[7] Nguyn Cao Thy Tin (2011), Xy dng b lc pht hin cc Website c ni
dung khng lnh mnh, Lun vn thc s cng ngh thng tin.
[8] Phan Hu Tip, V c Lung, Cao Nguyn Thy Tin, Lm Thnh Hin -
Phng php lc th rc ting Vit da trn t ghp v theo vt ngi s dng, K
yu hi tho Quc gia ln th XIV, Ch : Cc h thng h tr quyt nh, Nh
xut bn Khoa hc v K thut H Ni, 2012, Trang 463-473.
[9] Trn Th Tho (2013), Xy dng gii php h tr lc bi vit t din n, Lun
vn thc s cng ngh thng tin.
[10] Vu Duc Lung, Truong Nguyen Vu Bayesian spam filtering for Vietnameses
emails, Procedings in International Conference on Computer & Information Science
(ICCIS), 2012, ISBN: 978-1-4673-1937-9. Vol#1, p.190-193.
Ting Anh:
[11] Chih-Hao Tsai, MMSEG: A Word Identification System for Mandarin Chinese
Text Based on Two Variants of the Maximum Matching Algorithm. Web ublication
at http://technology.chtsai.org/mmseg/, 2000
[12] Dinh Dien, Hoang Kiem, Nguyen Van Toan (2001), Vietnamese Word
62

Segmentation, Proceedings of the Sixth Natural Language Processing Pacific Rim


Symposium (NLPR2001), p. 749-756, Tokyo.
[13] H. Nguyen, T. Vu, N. Tran, K. Hoang (2005), Internet and Genetics
Algorithm-based Text Categorization for Documents in Vietnamese, Research,
Innovation and Vision of the Future, the 3rd International Conference in Computer
Science, (RIFT 2005)
[14] Le An Ha, A method for word segmentation in Vietnamese, Proceedings of
Corpus Linguistics 2003, Lancaster, UK, 2003.
[15] Schneider2004 K.-M.Schneider.On word frequency information and negative
evidence in Naive Bayes text classification. In 4th International Conference on
Advances in Natural Language Processing, pages 474485, Alicante, Spain, 2004.
[16] T. Hofmann. Probabilistic LSA. Proc. UAI, 1999.
[17] Thorsten Joachims, Text categorization with Support Vector Machines:
Learning with many relevant features, Technical Report 23, LS VIII, University of
Dortmund, 1997.
[18] Yang and Chute (1994), An example-based mapping method for text
categorization and retrieval, ACM Transaction on Information Systems (TOIS),
pages 252-277.
[19] Yang & Xiu (1999), A re-examination of text categorization methods,
Proceedings of ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR 99).
Internet:
[20] http://vietnamesecommunity.wordpress.com/2013/03/21/khieu-dam-la-gi/
[21] http://xahoithongtin.com.vn/2013061309041378p0c109/truy-cap-web-khieu-
dam-ban-mat-gi.htm
[22] http://news.go.vn/xa-hoi/tin-1262779/nhieu-website-ket-noi-noi-dung-khieu-
dam-tinh-duc.htm
[23] http://vn.antoan.yahoo.com/qua-n-ly-n%E1%BB%99i-dung-ti-m-ki%C3%AA-
152125467.html
[24] http://www.baomoi.com/Website-khieu-dam-va-cac-chieu-lua/76/4400420.epi
[25] http://www.gltec.com.vn/tin-tuc/68-internet/2830-web-ngi-ln-ngay-cang-thu-
hut-c-nhiu-qtin-q.html
63

[26] http://xahoithongtin.com.vn/2013061309041378p0c109/truy-cap-web-khieu-
dam-ban-mat-gi.htm
[27] http://baohay.vn/chuyen-de/nhung-dieu-can-biet/288247/Web-sex-dang-tro-
thanh-mon-giai-tri-o-chon-cong-so.html
[28] http://vi.wikipedia.org/wiki/Internet_t%E1%BA%A1i_Vi%E1%BB%87t_Nam
[29] http://vn.antoan.yahoo.com/qua-n-ly-n%E1%BB%99i-dung-ti-m-ki%C3%AA-
152125467.html
[30] http://www.gltec.com.vn/tin-tuc/68-internet/2830-web-ngi-ln-ngay-cang-thu-
hut-c-nhiu-qtin-q.html

You might also like