You are on page 1of 47

I HC QUC GIA H NI

TRNG I HC CNG NGH

Nguyn Th Hi Yn

PHN LP BN GIM ST V NG DNG THUT


TON SVM VO PHN LP TRANG WEB

KHO LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin

H NI - 2007

I HC QUC GIA H NI
TRNG I HC CNG NGH

Nguyn Th Hi Yn

PHN LP BN GIM ST V NG DNG THUT


TON SVM VO PHN LP TRANG WEB

KHO LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin

Cn b hng dn: PGS TS H Quang Thy


Cn b ng hng dn: ThS. ng Thanh Hi

H NI 2007

LI CM N
Trc tin, em xin by t lng bit n chn thnh v su sc nht ti Thy gio,
PGS-TS H Quang Thy v Thy gio, ThS. ng Thanh Hi tn tnh hng dn,
ng vin, gip em trong sut qu trnh thc hin ti.
Em xin gi li cm n su sc ti qu Thy C trong Khoa Cng ngh thng tin
truyn t kin thc qu bu cho em trong nhng nm hc va qua.
Em xin gi li cm n cc anh ch trong nhm seminar v khai ph d liu
nhit tnh ch bo trong qu trnh em lm kho lun.
Con xin ni ln lng bit n i vi ng B, Cha M lun l ngun chm sc,
ng vin trn mi bc ng hc vn ca con.
Xin chn thnh cm n cc Anh Ch v Bn b, c bit l cc thnh vin trong
lp K48CD ng h, gip v ng vin ti trong sut thi gian hc tp bn nm trn
ging ng i hc v thc hin ti.
Mc d c gng hon thnh lun vn trong phm vi v kh nng cho php
nhng chc chn s khng trnh khi nhng thiu st. Em knh mong nhn c s cm
thng v tn tnh ch bo ca qu Thy C v cc Bn.
Em xin chn thnh cm n!
H Ni, ngy 31 thng 05 nm 2007
Sinh vin
Nguyn Th Hi Yn

TM TT NI DUNG
Hin nay, vi mt lng ln cc d liu th phn lp d liu c vai tr rt quan
trng, l mt trong nhng bi ton lun thi s trong lnh vc x l d liu vn bn. Mt
yu cu c bn c t ra l cn tng tnh hiu qu ca thut ton phn lp, nng cao gi
tr ca cc o hi tng, chnh xc ca thut ton. Mt khc, ngun ti nguyn v v
d hc c nhn khng phi lun c p ng v vy cn c cc thut ton phn lp s
dng cc v d cha c nhn. Phn lp bn gim st p ng c hai yu cu ni trn
[5, 7, 8, 16, 17]. Cc thut ton phn lp bn gim st tn dng cc ngun d liu cha
gn nhn rt phong ph c trong t nhin kt hp vi mt s d liu c gn nhn
cho sn.
Trong nhng nm gn y, phng php s dng b phn loi my h tr vector
(Support Vector Machine - SVM) c quan tm v s dng nhiu trong lnh vc nhn
dng v phn loi. T cc cng trnh khoa hc [4, 7, 8, 11] c cng b cho thy
phng php SVM c kh nng phn loi kh tt i vi bi ton phn loi vn bn cng
nh trong nhiu ng dng khc.
Trong kho lun ny, em kho st thut ton hc bn gim st SVM v trnh by
cc ni dung v phn mm SVMlin do V. Sindhwani xut [18]. Trong nm 20062007, V. Sindhwani dng SVMlin tin hnh phn lp vn bn t ngun 20Newsgroups cho cc kt qu tt [14,15].

MC LC
M U......................................................................................................... 9
Chng 1

TNG QUAN V PHN LP BN GIM ST................ 11

1.1. Phn lp d liu........................................................................................................11


1.1.1. Bi ton phn lp d liu ...................................................................................11
1.1.2. Qu trnh phn lp d liu..................................................................................12
1.2. Phn lp vn bn ......................................................................................................13
1.2.1. t vn ..........................................................................................................13
1.2.2. M hnh vector biu din vn bn.....................................................................14
1.2.3. Phng php phn lp vn bn .........................................................................19
1.2.4. ng dng ca phn lp vn bn........................................................................19
1.2.5. Cc bc trong qu trnh phn lp vn bn ......................................................20
1.2.6. nh gi m hnh phn lp ...............................................................................22
1.2.7. Cc yu t quan trng tc ng n phn lp vn bn .....................................23
1.3. Mt s thut ton hc my phn lp ........................................................................23
1.3.1. Hc c gim st ..................................................................................................23
1.3.1.1. Bi ton hc c gim st ..............................................................................23
1.3.1.2. Gii thiu hc c gim st............................................................................24
1.3.1.3. Thut ton hc c gim st k-nearest neighbor (kNN) ................................25
1.3.1.4. Thut ton hc c gim st Support vector machine (SVM).......................26
1.3.2. Thut ton phn lp s dng qu trnh hc bn gim st...................................27
1.3.2.1. Khi nim .....................................................................................................27
1.3.2.2. Lch s pht trin s lc ca hc bn gim st ..........................................28

1.3.2.3. Mt s phng php hc bn gim st in hnh ........................................29

Chng 2 S DNG SVM V BN GIM ST SVM


VO BI TON PHN LP .................................................................... 32
2.1. SVM Support Vector Machine.............................................................................32
2.1.1. Thut ton SVM .................................................................................................33
2.1.2. Hun luyn SVM................................................................................................35
2.1.3. Cc u th ca SVM trong phn lp vn bn ....................................................35
2.2. Bn gim st SVM v phn lp trang Web.............................................................37
2.2.1. Gii thiu v bn gim st SVM ........................................................................37
2.2.2. Phn lp trang Web s dng bn gim st SVM ...............................................38
2.2.2.1. Gii thiu bi ton phn lp trang Web (Web Classification).....................38
2.2.2.3. p dng S3VM vo phn lp trang Web.....................................................39

Chng 3 TH NGHIM HC BN GIM ST PHN LP TRANG


WEB.............................................................................................................. 41
3.1. Gii thiu phn mm SVMlin ..............................................................................41
3.2. Download SVMlin ................................................................................................42
3.3. Ci t....................................................................................................................42
3.4. Cch s dng phn mm .......................................................................................42

KT LUN .................................................................................................. 45
Nhng cng vic lm c ca kho lun .................................................................45
Hng nghin cu trong thi gian ti .............................................................................45

TI LIU THAM KHO........................................................................... 46


I. Ting Vit...................................................................................................................46
II. Ting Anh .................................................................................................................46

DANH SCH BNG V T VIT TT


K hiu vit tt

Cm t

kNN

k Nearest Neighbor

SVM

Support Vector Machine

S3VM

Semi Supervised Support Vector Machine

DANH MC HNH NH
Hnh 1. Bi ton phn lp.
Hnh 2. Vn bn c biu din l vector c trng.
Hnh 3. S khung qu trnh phn lp vn bn.
Hnh 4. Siu phng h phn chia d liu hun luyn thnh 2 lp + v - vi khong
cch bin ln nht. Cc im gn h nht l cc vector h tr (Support
Vector - c khoanh trn).
Hnh 5. Phng php hc bn gim st Self-training.
Hnh 6. Phng php hc bn gim st Co-training.

M U
Trong nhng nm gn y, s pht trin vt bc ca cng ngh thng tin lm
tng s lng giao dch thng tin trn mng Internet mt cch ng k c bit l th
vin in t, tin tc in t Do m s lng vn bn xut hin trn mng Internet
cng tng vi mt tc chng mt, v tc thay i thng tin l cc k nhanh chng.
Vi s lng thng tin s nh vy, mt yu cu ln t ra l lm sao t chc v tm
kim thng tin, d liu c hiu qu nht. Bi ton phn lp l mt trong nhng gii php
hp l cho yu cu trn. Nhng mt thc t l khi lng thng tin qu ln, vic phn
lp d liu th cng l iu khng th. Hng gii quyt l mt chng trnh my tnh t
ng phn lp cc thng tin d liu trn.
Tuy nhin, khi x l cc bi ton phn lp t ng th gp phi mt s kh khn l
xy dng c b phn lp c tin cy cao i hi phi c mt lng ln cc mu
d liu hun luyn tc l cc vn bn c gn nhn lp tng ng. Cc d liu hun
luyn ny thng rt him v t v i hi thi gian v cng sc ca con ngi. Do vy
cn phi c mt phng php hc khng cn nhiu d liu gn nhn v c kh nng tn
dng c cc ngun d liu cha gn nhn rt phong ph nh hin nay, phng php
hc l hc bn gim st. Hc bn gim st chnh l cch hc s dng thng tin cha
trong c d liu cha gn nhn v tp hun luyn, phng php hc ny c s dng rt
ph bin v tnh tin li ca n.
V vy, kho lun tp trung vo nghin cu bi ton phn lp s dng qu trnh hc
bn gim st, v vic p dng thut ton bn gim st my h tr vector (Support Vector
Machine SVM) vo phn lp trang Web.
Ni dung ca kho lun c trnh by bao gm 3 chng. T chc cu trc nh
sau:

Chng 1 Tng quan v phn lp bn gim st. Phn u trnh by khi


qut v bi ton phn lp d liu, phn lp vn bn, mt s nt s b v hc c gim st.
Phn cui ca chng gii thiu cc ni dung c bn v phng php hc bn gim st,
trong gii thiu mt s thut ton hc bn gim st in hnh.
Chng 2 S dng SVM v bn gim st SVM vo bi ton phn lp.
Kha lun trnh by nhng bc hot ng c bn nht ca thut ton SVM, sau
nghin cu thut ton hc bn gim st SVM, mt ci tin ca SVM c trnh by trong
[11]. Kho lun trnh by mt s p dng hc bn gim st vo bi ton phn lp trang
Web trong phn cui cng ca chng.
Chng 3 H thng th nghim phn loi trang Web v nh gi.
Trnh by kt qu nghin cu ca V. Sindhwani v phn mm ngun m SVMlin [14, 15,
18] m do chnh tc gi xut v cng b. Cc nghin cu ny cho thy phn mm
SVMlin phn lp bn gim st vn bn cho chnh xc cao.

Chng 1

TNG QUAN V PHN LP


BN GIM ST

1.1. Phn lp d liu


1.1.1. Bi ton phn lp d liu
L qu trnh phn lp mt i tng d liu vo mt hay nhiu lp cho trc nh
mt m hnh phn lp m m hnh ny c xy dng da trn mt tp hp cc i
tng d liu c gn nhn t trc gi l tp d liu hc (tp hun luyn) [1-3].
Qu trnh phn lp cn c gi l qu trnh gn nhn cho cc i tng d liu.
Nh vy, nhim v ca bi ton phn lp d liu l cn xy dng m hnh (b)
phn lp khi c mt d liu mi vo th m hnh phn lp s cho bit d liu thuc
lp no.
C nhiu bi ton phn lp d liu, nh phn lp nh phn, phn lp a lp, phn
lp a tr,.
Phn lp nh phn l qu trnh tin hnh vic phn lp d liu vo mt trong hai
lp khc nhau da vo vic d liu c hay khng mt s c tnh theo quy nh ca b
phn lp.
Phn lp a lp l qu trnh phn lp vi s lng lp ln hn hai. Nh vy, tp
hp d liu trong min xem xt c phn chia thnh nhiu lp ch khng n thun ch
l hai lp nh trong bi ton phn lp nh phn. V bn cht, bi ton phn lp nh phn
l trng hp ring ca bi ton phn lp a lp.
Trong phn lp a tr, mi i tng d liu trong tp hun luyn cng nh cc
i tng mi sau khi c phn lp c th thuc vo t hai lp tr ln. V d nh trang
web v vic bng pht bnh cm gia cm, thy cm ti mt s tnh pha Bc va thuc v
lnh vc y t lin quan n ly bnh sang ngi nhng cng thuc v lnh vc kinh t lin
quan n ngnh chn nui Trong nhng trng hp nh vy, vic sp xp mt ti liu
vo nhiu hn mt lp l ph hp vi yu cu thc t.
Sau y chng ta s tm hiu khi qut v qu trnh phn lp d liu v s b v
phng php phn lp d liu.

1.1.2. Qu trnh phn lp d liu

Hnh 1. Bi ton phn lp

Qu trnh phn lp d liu thng gm hai bc: xy dng m hnh (to b phn
lp) v s dng m hnh phn lp d liu.
Bc 1: mt m hnh s c xy dng da trn vic phn tch cc i tng d
liu c gn nhn t trc. Tp cc mu d liu ny cn c gi l tp d liu
hun luyn (training data set). Cc nhn lp ca tp d liu hun luyn c xc nh
bi con ngi trc khi xy dng m hnh, v vy phng php ny cn c gi l hc
c gim st (supervised learning). Trong bc ny, chng ta cn phi tnh chnh xc
ca m hnh, m cn phi s dng mt tp d liu kim tra (test data set). Nu chnh
xc l chp nhn c (tc l cao), m hnh s c s dng xc nh nhn lp cho
cc d liu khc mi trong tng lai. Trong vic test m hnh, s dng cc o nh

gi cht lng ca tp phn lp, l hi tng, chnh xc, o F1 ... Ni dung


chi tit v cc o ny c trnh by trong mc (1.2.6).
Tn ti nhiu phng php phn lp d liu gii quyt bi ton phn lp ty
thuc vo cch thc xy dng m hnh phn lp nh phng php Bayes, phng php
cy quyt nh, phng php k-ngi lng ging gn nht, phng php my h tr
vector.... Cc phng php phn lp khc nhau ch yu v m hnh phn lp. M hnh
phn lp cn c gi l thut ton phn lp.
Bc 2: s dng m hnh c xy dng bc 1 phn lp d liu mi.
Nh vy, thut ton phn lp l mt nh x t min d liu c sang mt min
gi tr c th ca thuc tnh lp, da vo gi tr cc thuc tnh ca d liu.

1.2. Phn lp vn bn
1.2.1. t vn
Ngy nay phng thc s dng giy t trong giao dch dn c s ho chuyn
sang cc dng vn bn lu tr trn my tnh hoc truyn ti trn mng. Bi nhiu tnh
nng u vit ca ti liu s nh cch lu tr gn nh, thi gian lu tr lu di, tin dng
trong trao i c bit l qua Internet, d dng sa i nn cng ngy, s lng vn
bn s tng ln mt cch nhanh chng c bit l trn World Wide Web. Cng vi s gia
tng v s lng vn bn, nhu cu tm kim vn bn cng tng theo. Trong i thng,
phn lp cc vn bn c tin hnh mt cch th cng, ngha l chng ta thc hin cng
vic c tng vn bn mt, xem xt v sau l gn n vo mt lp c th no . Cch
ny s tn rt nhiu thi gian v cng sc ca con ngi v cc vn bn l v vn, gn
mi vn bn vo mt lp cho l mt vn khng th v do khng kh thi. Vi s
lng vn bn s th vic phn lp vn bn t ng l mt nhu cu bc thit.
Vy phn lp vn bn l g? Phn lp vn bn (Text Categorization) l vic phn
lp p dng i vi d liu vn bn, tc l phn lp mt vn bn vo mt hay nhiu lp
vn bn nh mt m hnh phn lp; m hnh ny c xy dng da trn mt tp hp cc
vn bn c gn nhn t trc.
Phn lp vn bn l mt lnh vc c ch nht v c nghin cu trong
nhng nm gn y.

1.2.2. M hnh vector biu din vn bn


Nh trnh by phn trn, bc u tin trong qui trnh phn lp vn bn l
thao tc chuyn vn bn ang c m t di dng chui cc t thnh mt m hnh
khc, sao cho ph hp vi cc thut ton phn lp.
Thng thng ngui ta thng biu din vn bn bng m hnh vector, mi vn
bn c biu din bng mt vector trng s. tng ca m hnh ny l xem mi mt
vn bn Di c biu din theo dng
din vn bn ny v

D = (d , i ), trong i l ch s dng nhn


i

l vector c trng ca vn bn Di ny, trong :

d = (w , w ,.....,w ) , v n l s lung c trng ca vector vn bn, w


i1

i2

in

ij

l trng s

ca c trng th j , j {1,2,..., n}.

Trong qu trnh chuyn th vn bn sang thnh dng vector, vn m chng ta


cn quan tm l vic la chn c trng v s chiu cho khng gian vector, chn bao
nhiu t, l cc t no, phng php chn ra sao?
Vic la chn phng php biu din vn bn p dng vo bi ton phn lp
tu thuc vo thch hp, ph hp, o nh gi m hnh phn lp ca phng php
s dng so vi bi ton m chng ta ang xem xt gii quyt. V d nu vn bn l
mt trang Web th s c phng php la chn c trng khc so vi cc loi vn bn
khc.
Cc c trng ca vn bn khi biu din di dng vector
-

S nhiu khng gian c trng thng ln. Cc vn bn cng di, lng thng tin
trong n cp n nhiu vn th khng gian c trng cng ln.
Cc c trng c lp nhau, s kt hp cc c trng ny thng khng c ngha
trong phn lp.
Cc c trng ri rc: vector c trng di c th c nhiu thnh phn mang gi tr
0 do c nhiu c trng khng xut hin trong vn bn di (nu chng ta tip cn
theo cch s dng gi tr nh phn 1, 0 biu din cho vic c xut hin hay
khng mt c trng no trong vn bn ang c biu din thnh vector), tuy
nhin nu n thun cch tip cn s dng gi tr nh phn 0, 1 ny th kt qu

phn lp phn no hn ch l do c th c trng khng c trong vn bn ang


xt nhng trong vn bn ang xt li c t kha khc vi t c trng nhng c
ng ngha ging vi t c trng ny, do mt cch tip cn khc l khng s
dng s nh phn 0, 1 m s dng gi tr s thc phn no gim bt s ri rc
trong vector vn bn.
-

Hu ht cc vn bn c th c phn chia mt cch tuyn tnh bng cc hm


tuyn tnh.

Nh vy, di ca vector l s cc t kho xut hin trong t nht mt mu d


liu hun luyn. Trc khi nh trng s cho cc t kho cn tin hnh loi b cc t
dng. T dng l nhng t thng xut hin nhng khng c ch trong vic nh ch
mc, n khng c ngha g trong vic phn lp vn bn. C th nu mt s t dng
trong ting Vit nh v, l, th, nh vy,, trong ting Anh nh and, or,
the,. Thng thng t dng l cc trng t, lin t, gii t.
C th ly mt v d v vic biu din vn bn di dng vector trng s nh sau:

Gi y, nhng phn mm
tin tin ca hacker cho php
ngay c nhng g "tay m"
cng c th to ra virus vi
tc chng mt. Tuy nhin,
vi nhng th h trc ,
c nhng loi virus sinh ra l
c mt s kin lm nhng
ngi dng my tnh hoang
mang.

phn mm

hacker

virus

tc

tin

1
.
.
.

th h

s kin

ngi dng

xe

mn hnh

my tnh

ti vi

bia

Hnh 2. Vn bn c biu din l vector c trng

Biu din trang Web


Cc trang Web v bn cht l siu vn bn. Ngoi cc vn bn v cc thnh phn
a phng tin, cc trang Web cn bao gm nhng c trng nh l cc siu lin kt
(Hyperlink), cc th HTML v cc d liu bin i (meta data). Hu ht cc nghin cu
cho thy rng cc thnh phn vn bn ca cc trang Web cung cp thng tin chnh cho
cng vic phn lp Web trong khi nhng thnh phn khng phi vn bn c th c s
dng hon thin hiu sut phn lp [6, 9].
Hin nay tn ti rt nhiu cch biu din trang Web, vi mi mc ch khc nhau
th s c cch biu din trang Web ring. Trong cc my tm kim nh Yahoo, Altavista,
Google... khng s dng m hnh vector m s dng h thng t kho mc ni song

khng biu din ni dung vn bn. Hin nay cch tip cn biu din Website l mt cch
tip cn nhn c nhiu s quan tm ca nhiu ngi trn th gii, i tng quan tm
khng phi l Webpage m l Website, ngha l i tng tm kim khng phi l cc
trang Web n na m l c mt Website [2, 9].
Trong lnh vc vn bn truyn thng t trc n nay th thng thng vn thc
hin cc cng vic nh biu din, tm kim, phn lp... trn c s xem trang Web nh l
cc trang vn bn thng thng v s dng m hnh khng gian vector biu din vn
bn. Vic s dng siu lin kt gia cc trang Web c th ly c thng tin v mi lin
h gia ni dung cc trang, v da vo nng cao hiu qu phn lp v tm kim,
y chnh l vic khai thc th mnh ca siu lin kt trong vn bn. Mt s nh nghin
cu a ra cch ci tin nh hng bng cch lit k thm cc t kho xut hin t
cc trang Web lng ging bng cch b sung thm cc t kho xut hin trong on vn
bn ln cn vi siu lin kt.
Trong kho lun ny, chng ta s nghin cu cch biu din trang Web theo m
hnh vector v n l mt phng php rt ph bin hin nay. Vi vic s dng cc thng
tin lin kt nhm tng chnh xc tm kim cng nh phn lp cc trang Web nn cn
thit phi a thm cc thng tin v cc trang Web lng ging vo vector biu din ca
trang ang xt.
Tn ti bn cch biu din trang Web theo m hnh vector nh sau [2]:
Cch th nht
Mi t kha trong mt trang Web c lu tr cng tn s xut hin n trong
trang Web. Cch ny b qua tt c cc thng tin v v tr ca t kho trong trang, th t
ca cc t trong trang cng nh cc thng tin v siu lin kt.
Trong nhiu trng hp khi m cc ti liu lin kt c lp vi cc nhn ca cc
lp th cch biu din ny l la chn tt nht. Tuy nhin trong mt s trng hp th
cch ny khng khai thc c tnh cn i trong ti liu siu lin kt.
Cch th hai
S dng cc thng tin v lin kt ca trang Web, mc ni n ti cc trang lng
ging to ra mt siu trang (super document). Vector biu din bao gm cc t xut

hin trong mt trang cng vi tt c cc t xut hin trong cc trang lng ging ca n
cng vi tn s xut hin ca cc t. Cch ny b qua thng tin v v tr ca cc t trong
trang v th t ca chng.
Nhc im ca cch ny l lm long i ni dung ca trang m chng ta ang
quan tm. Tuy nhin y l cch la chn tt trong trng hp cn biu din mt tp cc
trang Web c ni dung v cng mt ch , nhng hin nay s lng cc trang Web lin
kt ti nhau c cng mt ch tng i t, v vy cch biu din ny him khi c s
dng.
Cch th ba
Dng mt vector cu trc biu din trang Web. Mt vector c cu trc c
chia mt cch logic thnh hai phn hoc nhiu hn. Mi phn c s dng biu din
mt tp cc trang lng ging. di ca mt vector c nh nhng mi phn ca vector
th ch dng biu din cc t xut hin trong mt tp no .
Cch ny trnh c kh nng cc trang lng ging ca mt trang Web c th lm
long ni dung ca n. Nu thng tin ca cc trang lng ging ny hu ch cho qu trnh
phn lp mt trang no th my hc vn c th truy cp n ton b ni dung ca
chng hc.
Cch th t
Xy dng mt vector c cu trc:
1. Xc nh mt s d c xem l bc cao nht ca cc trang trong tp
2. Xy dng mt vector cu trc vi d + 1 phn nh sau
a. Phn u tin biu din chnh ti liu ca mt trang Web.
b. Cc phn tip theo n d+1 biu din cc ti liu lng ging ca n,
mi ti liu c biu din trong mt phn.
Nh vy qua bn cch biu din vector trn th ta thy rng hu ht cc phng
php biu din vector c kt hp cc thng tin v trang lng ging cho kt qu phn lp
tt hn so vi phng php biu din vector vi thng tin v tn s xut hin ca cc t.

1.2.3. Phng php phn lp vn bn


Nh gii thiu, tn ti nhiu phng php phn lp vn bn nh phng php
Bayes, phng php cy quyt nh, phng php k-ngi lng ging gn nht, phng
php my h tr vector.... [1-3].
xy dng cng c phn lp vn bn t ng ngi ta thng dng cc thut
ton hc my (machine learning). Tuy nhin cn c cc thut ton c bit hn dng cho
phn lp trong cc lnh vc c th ca vn bn mt cch tng i my mc, nh l khi
h thng thy trong vn bn c mt cm t c th th h thng s phn vn bn vo
mt lp no . Tuy nhin khi phi lm vic vi cc vn bn t c trng hn th cn phi
xy dng cc thut ton phn lp da trn ni dung ca vn bn v so snh ph hp
ca chng vi cc vn bn c phn lp bi con ngi. y l t tng chnh ca
thut ton hc my. Trong m hnh ny, cc vn bn c phn lp sn v h thng
ca chng ta phi tm cch tch ra c trng ca cc vn bn thuc mi nhm ring
bit. Tp vn bn mu dng hun luyn gi l tp hun luyn (train set), hay tp mu
(pattern set), cn qu trnh my t tm c trng ca cc nhm gi l qu trnh hc
(learning). Sau khi my hc xong, ngi dng s a cc vn bn mi vo v nhim
v ca my l tm ra xem vn bn ph hp nht vi nhm no m con ngi hun
luyn n.

1.2.4. ng dng ca phn lp vn bn


Mt trong nhng ng dng quan trng nht ca phn lp vn bn l trong tm
kim vn bn. T mt tp d liu phn lp cc vn bn s c nh s i vi tng
lp tng ng. Ngi dng c th xc nh ch phn lp vn bn m mnh mong
mun tm kim thng qua cc cu hi [2, 3].
Mt ng dng khc ca phn lp vn bn l c th c s dng lc cc vn
bn hoc mt phn cc vn bn cha d liu cn tm m khng lm mt i tnh phc tp
ca ngn ng t nhin.
Ngoi ra phn lp vn bn c rt nhiu ng dng trong thc t, in hnh l cc
ng dng trch lc thng tin trn Internet. Hin nay, c rt nhiu trang Web thng mi
qung co hoc cc trang web phn ng, c vn ho khng lnh mnh, v mc ch lm
tng lng ngi truy cp, chng tr trn vo kt qu tr v ca my tm kim, chng vo

hm th ca chng ta theo chu k v gy nhiu phin toi, cc ng dng c th l lc th


rc (spam mail), lc trang web phn ng, cc trang web khng lnh mnh
Nh vy phn lp vn bn l cng c khng th thiu trong thi i Cng ngh
thng tin pht trin ln mnh nh hin nay, v th phn lp vn bn l vn ng c
quan tm xy dng v pht trin c nhng cng c hu ch lm cho h thng cng
ngh thng tin hin nay ngy cng pht trin v ln mnh.

1.2.5. Cc bc trong qu trnh phn lp vn bn


Qu trnh phn lp vn bn tri qua 4 bc [1] c bn sau:
nh ch s (indexing): Cc vn bn dng th cn c chuyn sang mt dng
biu din no x l, qu trnh ny c gi l qu trnh biu din vn bn, dng
biu din phi c cu trc v d dng trong khi x l, y vn bn c biu din di
dng ph bin nht l vector trng s. Tc nh ch s c vai tr quan trng trong qu
trnh phn lp vn bn.
Xc nh phn lp: Cn nu ln cch thc xc nh lp cho mi vn bn nh
th no, da trn cu trc biu din ca vn bn . Nhng trong khi nhng cu hi mang
tnh nht thi th tp phn lp c s dng mt cch n nh v lu di cho qu trnh
phn lp.
So snh: Trong hu ht cc tp phn lp, mi vn bn u c yu cu gn ng
sai vo mt lp no .
Phn hi (thch nghi): Qu trnh phn hi ng hai vai tr trong h phn lp vn
bn. Th nht l, khi phn lp th phi c mt s lng ln cc vn bn c xp loi
bng tay trc , cc vn bn ny c s dng lm mu hun luyn h tr xy dng
tp phn lp. Th hai l, i vi vic phn lp vn bn ny, khng d dng thay i cc
yu cu bi v ngi dng c th thng tin cho ngi bo tr h thng v vic xo b,
thm vo hoc thay i cc lp vn bn no m mnh yu cu.

Hnh sau l mt s khung cho vic phn lp vn bn, trong bao gm ba


cng on chnh:
Cng on u: Biu din vn bn, tc l chuyn cc d liu vn bn thnh
mt dng c cu trc no , tp hp cc mu cho trc thnh mt tp hun
luyn.
Cng on th hai: Vic s dng cc k thut hc my hc trn cc mu
hun luyn va biu din. Nh vy l vic biu din cng on mt s l
u vo cho cng on th hai.
Cng on th ba: Vic b sung cc kin thc thm vo do ngi dng
cung cp lm tng chnh xc trong biu din vn bn hay trong qu
trnh hc my.

Hnh 3. S khung qu trnh phn lp vn bn

1.2.6. nh gi m hnh phn lp


Chng ta khng th khng nh mt phng php phn lp vn bn c th no l
chnh xc hon ton. Bt k phng php no cng c sai lch khng nhiu th t. V
vy vic a ra o nh gi hiu qu ca thut ton phn lp gip chng ta c th
xc nh c m hnh no l tt nht, km nht, t p dng thut ton vo vic
phn lp. Sau y chng ta s a ra cng thc chung nh gi chnh xc ca cc
thut ton.
hi tng (Recall) v chnh xc (Precision), v o F1 c dng
nh gi cht lng ca thut ton phn lp.
true _ positive
100 %
(true _ positive) + ( false _ positive)

recall =

precision =

F1 (recall , precision) =

(1.1)

true _ positive
100 %
(true _ positive) + (true _ negative)

(1.2)

2 recall precision
recall precision

(1.3)

d hiu hn, chng ta c cng thc:


S vn bn c phn vo lp dng v ng
hi tng =
Tng s vn bn phn vo lp dng
S vn bn phn vo lp dng v ng
chnh xc =
Tng s vn bn c phn lp v ng
2 * hi tng * chnh xc
Tiu chun nh gi =
hi tng + chnh xc

1.2.7. Cc yu t quan trng tc ng n phn lp vn bn


Ngy nay phn lp vn bn c vai tr rt quan trng trong s pht trin ca Cng
ngh thng tin, tuy nhin phc tp ca tng loi vn bn khc nhau, v th kh nng
m tng tp phn lp c th thc thi c l khc nhau dn n kt qu phn lp khc
nhau. Chng ta c th lit k 3 yu t quan trng tc ng n kt qu phn lp nh sau:
Cn mt tp d liu hun luyn chun v ln cho thut ton hc phn
lp. Nu chng ta c c mt tp d liu chun v ln th qu trnh hun
luyn s tt v khi chng ta s c kt qu phn lp tt sau khi c
hc.
Cc phng php trn hu ht u s dng m hnh vector biu din vn
bn, do phng php tch t trong vn bn ng vai tr quan trng trong
qu trnh biu din vn bn bng vector. Yu t ny rt quan trng, v c th
i vi mt s ngn ng nh ting Anh chng hn th thao tc tch t trong
vn bn n gin ch l da vo cc khong trng, tuy nhin trong cc ngn
ng a m tit nh ting Vit v mt s ngn ng khc th s dng khong
trng khi tch t l khng chnh xc, do phng php tch t l mt yu t
quan trng.
Thut ton s dng phn lp phi c thi gian x l hp l, thi gian ny
bao gm: thi gian hc, thi gian phn lp vn bn, ngoi ra thut ton ny
phi c tnh tng cng (incremental function) ngha l khng phn lp li
ton tp tp vn bn khi thm mt s vn bn mi vo tp d liu m ch
phn lp cc vn bn mi m thi, khi thut ton phi c kh nng gim
nhiu (noise) khi phn lp vn bn.

1.3. Mt s thut ton hc my phn lp


1.3.1. Hc c gim st
1.3.1.1. Bi ton hc c gim st
Mc ch l hc mt nh x t x ti y. Khi cho trc mt tp hun luyn gm
cc cp ( xi , yi ) , trong y i gi l cc nhn ca cc mu x i . Nu nhn l cc s,

y = ( y i )i[n ] biu din vector ct ca cc nhn. Hn na, mt th tc chun l cc cp


T

( xi , y i ) c th theo gi thit i.i.d (independent and identically distributed random


variables) trn khp X Y [15].

1.3.1.2. Gii thiu hc c gim st


Hc c gim st l mt k thut ca ngnh hc my xy dng mt hm t d
liu hun luyn. D liu hun luyn bao gm cc cp i tng u vo (thng dng
vector) v u ra thc s. u ra ca mt hm c th l mt gi tr lin tc (gi l hi
quy), hay c th l d on mt nhn phn lp cho mt i tng u vo (gi l phn
lp). Nhim v ca chng trnh hc c gim st l d on gi tr ca hm cho mt i
tng bt k l u vo hp l, sau khi xem xt mt s v d hun luyn (ngha l, cc
cp u vo v u ra tng ng). t c iu ny, chng trnh hc phi tng qut
ho t cc d liu sn c d on nhng tnh hung cha gp phi theo mt cch hp
l.
gii quyt mt bi ton no ca hc c gim st, ngi ta phi xem xt
nhiu bc khc nhau:
Xc nh loi ca cc v d hun luyn. Trc khi lm bt c iu g, ngi
lm nhim v phn lp nn quyt nh loi d liu no s c s dng lm v
d. Chng hn c th l mt k t vit tay n l, ton tp mt t vit tay,
hay ton tp mt dng ch vit tay.
Thu thp tp hun luyn. Tp hun luyn cn c trng cho thc t s dng
ca hm chc nng. V th, mt tp cc i tng u vo c thu thp v
u ra tng ng c thu thp, hoc t cc chuyn gia hoc t vic o dc
tnh ton.

Xc nh vic biu din cc c trng u vo cho hm chc nng cn tm. S


chnh xc ca hm chc nng ph thuc ln vo cch cc i tng u vo
c biu din. Thng thng, i tng u vo c chuyn i thnh mt
vector c trng, cha mt s cc c trng nhm m t cho i tng . S
lng cc c trng khng nn qu ln, do s bng n t hp (curse of
dimensionality), nhng phi ln d on chnh xc u ra.

Xc inh cu trc ca hm chc nng cn tm v gii thut hc tng ng. V


d ngi thc hin qu trnh phn lp c th la chn vic s dng mng nron nhn to hay cy quyt nh.

Hon thin thit k. Ngi thit k s chy gii thut hc t mt tp hun


luyn thu thp c. Cc tham s ca gii thut hc c th c iu chnh
bng cch ti u ho hiu nng trn mt tp con (gi l tp kim chng
validation set) ca tp hun luyn, hay thng qua kim chng cho (crossvalidation). Sau khi hc v iu chnh tham s, hiu nng ca gii thut c th
c o dc trn mt tp kim tra c lp vi tp hun luyn.

1.3.1.3. Thut ton hc c gim st k-nearest neighbor (kNN)


C rt nhiu thut ton hc c gim st, y em s gii thiu mt thut ton hc
c gim st in hnh, l k-nearest neighbor (kNN hay k-lng ging gn nht)
kNN l phng php truyn thng kh ni ting theo hng tip cn thng k
c nghin cu trong nhiu nm qua. kNN c nh gi l mt trong nhng phng
php tt nht c s dng t nhng thi k u trong nghin cu v phn loi vn bn
tng ca phng php ny l khi cn phn loi mt vn bn mi, thut ton
s xc nh khong cch (c th p dng cc cng thc v khong cch nh Euclide,
Cosine, Manhattan, ) ca tt c cc vn bn trong tp hun luyn n vn bn ny
tm ra k vn bn gn nht, gi l k nearest neighbor k lng ging gn nht, sau dng
cc khong cch ny nh trng s cho tt c cc ch . Khi , trng s ca mt ch
chnh l tng tt c cc khong cch trn ca cc vn bn trong k lng ging c cng
ch , ch no khng xut hin trong k lng ging s c trng s bng 0. Sau cc
ch s c sp xp theo gi tr trng s gim dn v cc ch c trng s cao s
c chn lm ch ca vn bn cn phn loi.
Trng s ca ch cj i vi vn bn x c tnh nh sau :

W x, c j =

sim x , . y , c j b j

d i d i

d i {kNN}

(1.4)

Trong :
y (di, c) thuc {0,1}, vi:
y = 0: vn bn di khng thuc v ch cj
y = 1: vn bn di thuc v ch cj
sim (x, d): ging nhau gia vn bn cn phn loi x v vn bn d. Chng ta c
th s dng o cosine tnh khong cch:

x.


di
sim x , = cos x , =

di
di

x
di

(1.5)

bj l ngng phn loi ca ch cj c t ng hc s dng mt tp vn bn


hp l c chn ra t tp hun luyn.
chn c tham s k tt nht cho thao tc phn loi, thut ton cn c chy
th nghim trn nhiu gi tr k khc nhau, gi tr k cng ln th thut ton cng n nh v
sai st cng thp.

1.3.1.4. Thut ton hc c gim st Support vector machine (SVM)


Theo [4, 7], SVM l phng php phn lp rt hiu qu c Vapnik gii thiu
vo nm 1995 gii quyt nhn dng mu hai lp s dng nguyn l Cc tiu ho Ri
ro Cu trc (Structural Risk Minimization).
tng chnh ca thut ton ny l cho trc mt tp hun luyn c biu din
trong khng gian vector trong mi ti liu l mt im, phng php ny tm ra mt
mt phng h quyt nh tt nht c th chia cc im trn khng gian ny thnh hai lp
ring bit tng ng lp + v lp -. Cht lng ca siu mt phng ny c quyt nh
bi khong cch (gi l bin) ca im d liu gn nht ca mi lp n mt phng ny.
Khong cch bin cng ln th mt phng quyt nh cng tt ng thi vic phn loi
cng chnh xc. Mc ch thut ton SVM tm ra c khong cch bin ln nht to
kt qu phn lp tt.

Hnh sau minh ho cho thut ton ny:

Hnh 4. Siu phng h phn chia d liu hun luyn thnh 2 lp + v - vi


khong cch bin ln nht. Cc im gn h nht l cc vector h tr
(Support Vector - c khoanh trn)

Trong chng 2 s trnh by chi tit v thut ton hc SVM v bn gim st SVM.

1.3.2. Thut ton phn lp s dng qu trnh hc bn gim st


1.3.2.1. Khi nim
Theo Xiaojin Zhu [16], khi nim hc bn gim st c a ra nm 1970 khi bi
ton nh gi quy tc Linear Discrimination Fisher cng vi d liu cha gn nhn c
nhiu s quan tm ca cc nh khoa hc trn th gii.
Trong khoa hc my tnh, hc bn gim st l mt phng thc ca ngnh hc
my s dng c d liu gn nhn v cha gn nhn, nhiu nghin cu ca ngnh hc my
c th tm ra c d liu cha gn nhn khi s dng vi mt s lng nh d liu gn
nhn [15]. Cng vic thu c kt qu ca d liu gn nhn thng i hi trnh t
duy v kh nng ca con ngi, cng vic ny tn nhiu thi gian v chi ph, do vy d
liu gn nhn thng rt him v t, trong khi d liu cha gn nhn th li rt phong
ph. Trong trng hp , chng ta c th s dng hc bn gim st thi hnh cc cng
vic quy m ln.

Hc bn gim st bao gm d liu gn nhn v cha gn nhn. Hc bn gim st


c th c p dng vo vic phn lp v phn cm. Mc tiu ca hc bn gim st l
hun luyn tp phn lp tt hn hc c gim st t d liu gn nhn v cha gn nhn.
Nh vy, c th ni hc bn gim st l phng php hc c gim st kt hp vi
vic tn dng cc d liu cha gn nhn. Trong phn b sung thm vo cho d liu gn
nhn, thut ton cung cp mt vi thng tin gim st, vic ny khng cn thit cho tt c
cc mu hun luyn. Thng thng thng tin ny s c kt hp vi mt vi mu cho
trc.

Hc bn gim st l mt nhnh ca ngnh hc my (machine learning). Cc d


liu gn nhn thng him, t v rt mt thi gian, i hi s n lc ca con ngi,
trong khi d liu cha gn nhn th v vn nhng s dng vo mc ch c th ca
chng ta th rt kh, v vy tng kt hp gia d liu cha gn nhn v d liu gn
nhn xy dng mt tp phn lp tt hn l ni dung chnh ca hc bn gim st. Bi
vy hc bn gim st l mt tng tt gim bt cng vic ca con ngi v ci thin
chnh xc ln mc cao hn.

1.3.2.2. Lch s pht trin s lc ca hc bn gim st


Theo [16, 17], qu trnh hc bn gim st c nghin cu pht trin trong mt
thp k gn y, nht l t khi xut hin cc trang Web vi s lng thng tin ngy cng
ln, ch ngy cng phong ph. Chng ta c th nu ln qu trnh pht trin ca hc
bn gim st tri qua cc thut ton c nghin cu nh sau.
Cng vi s liu ln ca d liu cha gn nhn, cc thnh phn hn hp c th
c nhn ra cng vi thut ton Cc i k vng EM (expectation-maximization). Ch
cn mt mu n gn nhn cho mi thnh phn xc nh hon ton c m hnh
hn hp. M hnh ny c p dng thnh cng vo vic phn lp vn bn. Mt bin th
khc ca m hnh ny chnh l self-training. C 2 phng php ny c s dng cch
y mt thi gian kh di. Chng c s dng ph bin v da trn khi nim n gin
ca chng v s d hiu ca thut ton.
Co-training l thut ton hc bn gim st in hnh tip theo m cc nh khoa hc
u t nghin cu. Trong khi self-training l thut ton m khi c mt s phn lp li th

c th tng cng thm cho chnh n, th co-training gim bt c li tng cng c th


xy ra khi c mt qu trnh phn lp b li.
Cng vi qu trnh pht trin v vic p dng ph bin v s tng ln v cht
lng ca thut ton SVM (My h tr vector - Support Vector Machine), SVM truyn
dn (Transductive Support Vector Machine TSVM) ni bt ln nh mt SVM chun
m rng cho phng php hc bn gim st.
Gn y cc phng php hc bn gim st da trn th (graph-based) thu ht
nhiu s quan tm ca cc nh khoa hc cng nh nhng ngi quan tm n lnh vc
khai ph d liu. Cc phng php Graph-based bt u vi mt th m cc nt l cc
im d liu gn nhn v cha gn nhn, v cc im ni phn nh c s ging nhau
gia cc nt ny.
C th thy hc bn gim st l mt qu trnh hon thin dn cc thut ton p
dng vo cc vn ca i sng con ngi. Sau y chng ta s gii thiu s qua mt
s thut ton hc bn gim st in hnh c th xem l c p dng nhiu nht.

1.3.2.3. Mt s phng php hc bn gim st in hnh


C rt nhiu phng php hc bn gim st. C th nu tn cc phng php
thng c s dng nh: Nave Bayes, EM vi cc m hnh hn hp sinh, self-training,
co-training, transductive support vector machine (TSVM), v cc phng php graphbased. Chng ta khng c cu tr li chnh xc cho cu hi phng php no l tt nht
y. C th thy phng php hc bn gim st s dng d liu cha gn nhn thay
i hoc gim bt cc kt qu t nhng gi thuyt thu c ca d liu gn nhn.
Sau y, chng ti xin trnh by s b ni dung ca mt s thut ton hc bn
gim st in hnh.
Self-training

Self-training l mt phng php c s dng ph bin trong hc bn gim st.


Trong self-training mt tp phn lp ban u c hun luyn cng vi s lng nh d
liu gn nhn. Tp phn lp sau s c dng gn nhn cho d liu cha gn nhn.
in hnh l hu ht cc im cha gn nhn c tin cy cao, cng nh cng vi cc nhn
d on trc ca chng, c chn thm vo tp hun luyn. Sau tp phn lp s

c hun luyn li v lp li cc quy trnh. Ch rng tp phn lp s dng cc d on


ca n dy chnh n. Quy trnh ny c gi l self-teaching hay l bootstrapping.
Self-training c p dng x l cc bi ton ca mt s ngn ng t nhin.
Ngoi ra self-training cn c p dng phn tch v dch my. Theo Xiaojin Zhu [16,
17], nhiu tc gi p dng self-training pht hin cc i tng h thng t cc hnh
nh.
Thut ton: Self-training

1. La chn mt phng php phn lp. Hun luyn mt b phn


lp f t (Xl, Yl).
2. S dng f phn lp tt c cc i tng cha gn nhn x
Xu.
3. La chn x* vi tin cy cao nht, chn thm (x*, f (x*)) ti
d liu gn nhn.
4. Lp li cc qu trnh trn.

Hnh 5. Phng php hc bn gim st Self-training

Co-training

Theo [16,17], Co-training da trn gi thit rng cc c trng (features) c th


c phn chia thnh hai tp. Mi mt tp c trng con c kh nng hun luyn mt tp
phn lp tt. Hai tp con ny c lp iu kin (conditionally independent) cho ca
lp (class).
u tin hai tp phn lp phn tch thnh d liu hun luyn v d liu gn nhn
trn hai tp c trng con c tch bit ra. Sau mi tp phn lp li phn lp cc d
liu cha gn nhn v dy tp phn lp khc cng vi mt vi mu cha gn nhn (v
cc nhn d on) m chng cm gic c tin cy cao. Cui cng, mi tp phn lp s

c hun luyn li cng vi cc mu hun luyn chn thm c cho bi tp phn lp


khc v bt u tin trnh lp.

Thut ton: Co-training

1. Hun luyn hai b phn lp: f (1) t (Xl (1), Yl), f (2) t (Xl (2), Yl).
2. Phn lp Xu vi f (1) v f (2) tch bit nhau.
3. Chn thm vo f (1) k-most-confident (x, f (1) (x)) ti cc d liu
gn nhn ca f (2).

4. Chn thm vo f (2) k-most-confident (x, f (2) (x)) ti cc d liu


gn nhn ca f (1).
5. Lp li cc qu trnh trn.

Hnh 6. Phng php hc bn gim st Co-training

Chng 2 S DNG SVM V BN GIM ST SVM


VO BI TON PHN LP
Trong lnh vc khai ph d liu, cc phng php phn lp vn bn da trn
nhng phng php quyt nh nh quyt nh Bayes, cy quyt nh, k-ngi lng
ging gn nht, . Nhng phng php ny cho kt qu chp nhn c v c s
dng nhiu trong thc t. Trong nhng nm gn y, phng php phn lp s dng tp
phn lp vector h tr (my vector h tr - Support Vector Machine SVM) c quan
tm v s dng nhiu trong lnh vc nhn dng v phn lp. SVM l mt h cc phng
php da trn c s cc hm nhn (kernel) ti thiu ho ri ro c lng. Phng
php SVM ra i t l thuyt hc thng k do Vapnik v Chervonenkis xy dng v c
nhiu tim nng pht trin v mt l thuyt cng nh ng dng trong thc tin. Cc th
nghim thc t cho thy, phng php SVM c kh nng phn lp kh tt i vi bi
ton phn lp vn bn cng nh trong nhiu ng dng khc (nh nhn dng ch vit tay,
pht hin mt ngi trong cc nh, c lng hi quy,). Xt vi cc phng php phn
lp khc, kh nng phn lp ca SVM l tng i tt v hiu qu.

2.1. SVM Support Vector Machine


SVM s dng thut ton hc nhm xy dng mt siu phng lm cc tiu ho
phn lp sai ca mt i tng d liu mi. phn lp sai ca mt siu phng c
c trng bi khong cch b nht ti siu phng y. SVM c kh nng rt ln cho cc
ng dng c thnh cng trong bi ton phn lp vn bn.
Nh bit, phn lp vn bn l mt cch tip cn mi to ra tp phn lp vn
bn t cc mu cho trc. Cch tip cn ny phi hp vi s thc thi mc cao v
hiu sut cng vi nhng am hiu v mt l thuyt, tnh cht th ngy cng c hon
thin. Thng thng, hiu qu mc cao khng c cc thnh phn suy nghim.
Phng php SVM c kh nng tnh ton sn sng v phn lp, n tr thnh l thuyt
hc m c th ch dn nhng ng dng thc t trn ton cu.
c trng c bn quyt nh kh nng phn lp l kh nng phn lp nhng d
liu mi da vo nhng tri thc tch lu c trong qu trnh hun luyn. Sau qu
trnh hun luyn nu hiu sut tng qut ho ca b phn lp cao th thut ton hun

luyn c nh gi l tt. Hiu sut tng qut ho ph thuc vo hai tham s l sai s
hun luyn hay v nng lc ca my hc. Trong sai s hun luyn l t l li phn lp
trn tp d liu hun luyn. Cn nng lc ca my hc c xc nh bng kch thc
Vapnik-Chervonenkis (kch thc VC). Kch thc VC l mt khi nim quan trng i
vi mt h hm phn tch (hay l tp phn lp). i lng ny c xc nh bng s
im cc i m h hm c th phn tch hon ton trong khng gian i tng. Mt tp
phn lp tt l tp phn lp c nng lc thp nht (c ngha l n gin nht) v m bo
sai s hun luyn nh. Phng php SVM c xy dng trn tng ny.

2.1.1. Thut ton SVM


Xt bi ton phn lp n gin nht phn lp hai lp vi tp d liu mu:
{(xi, yi) i = 1, 2,, N, xi Rm }
Trong mu l cc vector i tng c phn lp thnh cc mu dng v mu
m nh trong hnh 4:
- Cc mu dng l cc mu xi thuc lnh vc quan tm v c gn nhn yi = 1.
- Cc mu m l cc mu xi khng thuc lnh vc quan tm v c gn yi = - 1.
Thc cht phng php ny l mt bi ton ti u, mc tiu l tm ra mt khng
gian H v siu mt phng quyt nh h trn H sao cho sai s phn lp l thp nht.
Trong trng hp ny, tp phn lp SVM l mt siu phng phn tch cc mu
dng khi cc mu m vi chnh lch cc i, trong chnh lch cn gi l
L (margin) xc nh bng khong cch gia cc mu dng v cc mu m gn mt siu
phng nht (hnh 1). Mt siu phng ny c gi l mt siu phng l ti u..
Cc mt siu phng trong khng gian i tng c phng trnh l:
C + w1 x1 + w2 x2 + + wn xn = 0
Tng ng vi cng thc

C + wi xi = 0
i=1,,n

(2.1)
(2.2)

Vi w = w1 + w2 + + wn l b h s siu phng hay l vector trng s, C l dch, khi


thay i w v C th hng v khong cch t gc to n mt siu phng thay i.
Tp phn lp SVM c nh ngha nh sau:
f(x) = sign(C + wi xi)

(2.3)

Trong
sign(z) = +1 nu z 0,
sign(z) = -1 nu z < 0.
Nu f(x) = +1 th x thuc v lp dng (lnh vc c quan tm), v ngc li,
nu f(x) = -1 th x thuc v lp m (cc lnh vc khc).
My hc SVM l mt hc cc siu phng ph thuc vo tham s vector trng s w
v dch C. Mc tiu ca phng php SVM l c lng w v C cc i ho l
gia cc lp d liu dng v m. Cc gi tr khc nhau ca l cho ta cc h siu mt
phng khc nhau, v l cng ln th nng lc ca my hc cng gim. Nh vy, cc i
ho l thc cht l vic tm mt my hc c nng lc nh nht. Qu trnh phn lp l ti
u khi sai s phn lp l cc tiu.
Ta phi gii phng trnh sau:

(2.4)

tm ra c vector trng s w v sai s ca mi im trong tp hun luyn l i t ta


c phng trnh tng qut ca siu phng tm ra c bi thut ton SVM l:
f(x1, x2,, xn) = C + wi xi

(2.5)

Vi i = 1,, n. Trong n l s d liu hun luyn.


Sau khi tm c phng trnh ca siu phng bng thut ton SVM, p dng
cng thc ny tm ra nhn lp cho cc d liu mi.

2.1.2. Hun luyn SVM


Hun luyn SVM l vic gii bi ton quy hoch ton phng SVM. Cc phng
php s gii bi ton quy hoch ny yu cu phi lu tr mt ma trn c kch thc bng
bnh phng ca s lng mu hun luyn. Trong nhng bi ton thc t, iu ny l
khng kh thi v thng thng kch thc ca tp d liu hun luyn thng rt ln (c
th ln ti hng chc nghn mu). Nhiu thut ton khc nhau c pht trin gii
quyt vn nu trn. Nhng thut ton ny da trn vic phn r tp d liu hun luyn
thnh nhng nhm d liu. iu c ngha l bi ton quy hoch ton phng vi kch
thc nh hn. Sau , nhng thut ton ny kim tra cc iu kin KKT (Karush-KuhnTucker) xc nh phng n ti u.
Mt s thut ton hun luyn da vo tnh cht: Nu trong tp d liu hun luyn
ca bi ton quy hoch ton phng con cn gii mi bc c t nht mt mu vi phm
cc iu kin KKT, th sau khi gii bi ton ny, hm mc tiu s tng. Nh vy, mt
chui cc bi ton quy hoch ton phng con vi t nht mt mu vi phm cc iu kin
KKT c m bo hi t n mt phng n ti u. Do , ta c th duy tr mt tp d
liu lm vic ln c kch thc c nh v ti mi bc hun luyn, ta loi b v thm
vo cng mt s lng mu.

2.1.3. Cc u th ca SVM trong phn lp vn bn


Nh bit, phn lp vn bn l mt tin trnh a cc vn bn cha bit ch
vo cc lp vn bn bit (tng ng vi cc ch hay lnh vc khc nhau). Mi lnh
vc c xc nh bi mt s ti liu mu ca lnh vc . thc hin qu trnh phn
lp, cc phng php hun luyn c s dng xy dng tp phn lp t cc ti liu
mu, sau dng tp phn lp ny d on lp ca nhng ti liu mi (cha bit ch
).
Chng ta c th thy t cc thut ton phn lp hai lp nh SVM n cc thut
ton phn lp a lp u c c im chung l yu cu vn bn phi c biu din di
dng vector c trng, tuy nhin cc thut ton khc u phi s dng cc uc lng
tham s v ngng ti u trong khi thut ton SVM c th t tm ra cc tham s ti u
ny. Trong cc phng php th SVM l phng php s dng khng gian vector c

trng ln nht (hn 10.000 chiu) trong khi cc phng php khc c s chiu b hn
nhiu (nh Nave Bayes l 2000, k-Nearest Neighbors l 2415).
Trong cng trnh ca mnh nm 1999 [12], Joachims so snh SVM vi Nave
Bayesian, k-Nearest Neighbour, Rocchio, v C4.5 v n nm 2003 [13], Joachims
chng minh rng SVM lm vic rt tt cng vi cc c tnh c cp trc y ca
vn bn. Cc kt qu cho thy rng SVM a ra chnh xc phn lp tt nht khi so
snh vi cc phng php khc.
Theo Xiaojin Zhu [15] th trong cc cng trnh nghin cu ca nhiu tc gi
(chng hn nh Kiritchenko v Matwin vo nm 2001, Hwanjo Yu v Han vo nm
2003, Lewis vo nm 2004) ch ra rng thut ton SVM em li kt qu tt nht phn
lp vn bn.
Kiritchenko v Matwin nghin cu v so snh phng php SVM vi k thut
Nave Bayesian, sau chng minh c rng SVM l phng php tt nht cho phn
lp th in t cng nh phn lp vn bn.
Hwanjo Yu v Han cho thy rng SVM hon ton c tin hnh tt nht so vi
cc phng php phn lp vn bn khc. Tt c cc ti liu nghin cu hin nay cho thy
rng SVM a ra kt qu chnh xc nht trong kha cnh phn lp vn bn.
Lewis nghin cu phn lp vn bn v khm ph ra rng kt qu ca SVM
l tt nht. Lewis a ra tp hp nh cc ti liu ca phn lp vn bn. Tc gi c
gng ci tin phng php RCV1 cho phn lp vn bn v s dng phng php mi
c ng dng cho mt s k thut phn lp vn bn khc nhau. SVM a ra kt qu
tt nht khi t da vo k-ngi lng ging gn nht v k thut tp phn lp RocchioStyle Prototype.
Nhng phn tch ca cc tc gi trn y cho thy SVM c nhiu im ph hp
cho vic ng dng phn lp vn bn. V trn thc t, cc th nghim phn lp vn bn
ting Anh ch ra rng SVM t chnh xc phn lp cao v t ra xut sc hn so vi cc
phng php phn lp vn bn khc.
Vn cn bn ca hc bn gim st l chng ta c th tn dng d liu cha gn
nhn ci tin hiu qu ca chnh xc trong khi phn lp, iu ny c a ra so
snh vi mt tp phn lp c thit k m khng tnh n d liu cha gn nhn.

Trong phn sau ca chng ny, kha lun s gii thiu mt phng thc ci tin
ca SVM l bn gim st SVM (semi-supervised support vector machine S3VM) [16,
17]. Bn gim st SVM c a ra nhm nng SVM ln mt mc cao hn, trong khi
SVM l mt thut ton hc c gim st, s dng d liu gn nhn th bn gim st
SVM s dng c d liu gn nhn (tp hun luyn training set) kt hp vi d liu cha
gn nhn (working set).

2.2. Bn gim st SVM v phn lp trang Web


2.2.1. Gii thiu v bn gim st SVM
Chng ta s gii thiu phng thc ci tin ca SVM l Bn gim st SVM (Semi
Supervised Support Vector Machine - S3VM). Cho mt tp hun luyn (training set) ca
d liu gn nhn v c s tham gia ca mt tp cc d liu cha gn nhn (working set),
S3VM xy dng mt my h tr vector s dng c training set v working set. Bi ton
truyn dn s d on gi tr ca mt hm phn lp ti cc im cho trong working
set.
Trong khi SVM l mt thut ton c gim st s dng d liu gn nhn, th
S3VM c xy dng s dng hn hp d liu gn nhn (training set) v d liu cha
gn nhn (working set). Mc ch l gn cc lp nhn ti working set mt cch tt
nht, sau s dng hn hp d liu hun luyn gn nhn v d liu working set sau
khi gn nhn phn lp nhng d liu mi. Nu working set rng th phng php
ny tr thnh phng php chun SVM phn lp. Nu training set rng, sau
phng php ny s tr thnh hnh th hc khng gim st. Hc bn gim st xy ra khi
c training set v working set khng rng.
hiu mt cch r rng c th v S3VM, th chng ta cn hiu v SVM c
trnh by trn. Vi thi gian v iu kin khng cho php, trong kho lun ny em ch
c th tm hiu v thut ton S3VM l bi ton phn lp nh phn.
Cho trc mt tp hun luyn gm nhng d liu gn nhn cng vi tp d liu
cha gn nhn working set bao gm n d liu. Mc ch l gn nhn cho nhng d liu
cha gn nhn ny.

Vi hai lp cho trc gm lp dng (lp +1) v lp m (lp 1). Mi d liu


c xem nh mt im trong khng gian vector. Mi im i thuc tp d liu hun
luyn c mt sai s l i v mi im j thuc working set s c hai sai s j (sai s phn
lp vi gi s rng j thuc lp +1) v zi (sai s phn lp vi gi s rng j thuc lp 1).
Thut ton S3VM s gii bi ton ti u sau (2.6) thay cho bi ton ti u 2.4 thut
ton SVM.

(2.6)

Sau khi tm c i v zj, chng ta s c c sai s nh nht ca mi im j,


Nu i < zj th im j thuc lp dng, ngc li nu i > zj th im j thuc lp m. Qu
trnh ny din ra trn tt c cc im thuc working set, sau khi qu trnh ny hon
thnh, tt c cc im cha gn nhn s c gn nhn.
Tp d liu cha gn nhn working set sau khi gn nhn s c a vo tp d
liu hun luyn, tip theo s s dung thut ton SVM hc to ra SVM mi, SVM
ny chnh l S3VM c mt siu phng mi. Sau p dng siu phng ny phn lp
cc mu d liu mi c a vo.

2.2.2. Phn lp trang Web s dng bn gim st SVM


2.2.2.1. Gii thiu bi ton phn lp trang Web (Web Classification)
Phn lp trang Web l mt trng hp c bit ca phn lp vn bn bi s hin
din ca cc siu lin kt trong trang Web, cu trc trang Web cht ch, y hn, dn
n cc tnh nng hn hp nh l plain texts, cc th hypertext, hyperlinks.
Internet vi hn 10 t trang Web l mt tp hun luyn rt phong ph v mi ch
trong cuc sng, hn na vi s lng ch trn cc Website l khng nhiu th vic

s dng Internet nh c s hun luyn rt ph hp. Trong cc trang Web, tuy chnh
xc khng phi l tuyt i, nhng ta c th thy mi ch gm c nhiu t chuyn
mn vi tn sut xut hin rt cao, vic tn dng tn s ph thuc ca cc t ny vo ch
c th em li kt qu kh quan cho phn lp.

2.2.2.3. p dng S3VM vo phn lp trang Web


C th thy trang Web l siu vn bn (hypertext) rt ph dng hin nay. Ni dung
ca cc trang Web thng c m t ngn gn, sc tch, c cc siu lin kt ch n cc
Web c ni dung lin quan v cho php cc trang khc lin kt n n.
Nh ni trn, v c xem nh l cc vn bn thng thng nn trong qu trnh
phn lp trang Web vic biu din vn bn s dng m hnh khng gian vector. Vic
biu din v x l ti liu Web cng ging nh biu din v x l vn bn bng m hnh
ny. Tuy nhin trong phn lp Web th vic khai thc th mnh ca siu lin kt trong
vn bn l mt vn ng quan tm. Vi vic s dng cc siu lin kt gia cc trang
Web t c th ly c cc thng tin v mi lin h gia ni dung cc trang, v da
vo nng cao hiu qu phn lp v tm kim.
p dng vo phn lp trang Web, thut ton S3VM xem mi trang Web l mt
vector f(d1, d2,, dn) c biu din ging nh vn bn. p dng cng thc (2.5) trong
phng trnh ca siu phng:
f(x1, x2,, xn) = C + wi xi

thay th mi vn bn tng ng vi mi trang Web vo phng trnh siu phng ny:


f(d1, d2,,dn) = C + wi di

Vi i=1,,n.
Nu f(d) 0 th trang Web thuc lp +1.
Ngc li nu f(d) < 0 trang Web thuc lp 1.

(2.6)

C th thy rng qu trnh p dng thut ton S3VM vo bi ton phn lp trang
Web chnh l vic thay th vector trng s biu din trang Web vo phng trnh siu
phng ca S3VM, t tm ra c nhn lp ca cc trang Web cha gn nhn.
Nh vy, thc cht ca qu trnh phn lp bn gim st p dng i vi d liu l
cc trang Web l tp d liu hun luyn l cc trang Web cn tp working set (d liu
cha gn nhn) l nhng trang Web c cc trang Web c nhn trong tp hun luyn
tr ti.

Chng 3 TH NGHIM HC BN GIM ST PHN


LP TRANG WEB
Kha lun nh hng khai thc phn mm ngun m tin hnh th nghim
phn lp bn gim st cc ti liu web. Phn u ca chng gii thiu phn mm ngun
m SVMlin c tiu l "Fast Linear SVM Solvers for Supervised and Semisupervised Learning" do Vikas Sindhwani cng b. Cc phn tip theo kha lun gii
thiu qu trnh khai thc phn mm nhm thc hin bi ton phn lp v nh gi. Ni
dung ca chng ny tng hp t cc ni dung c trnh by trong [14,15,18].
Phn mm SVMlin thuc din phn mm ngun m, c cng b theo cc tiu
chun ca giy php s dng phn mm GNU.

3.1. Gii thiu phn mm SVMlin


SVMlin l gi phn mm dnh cho SVMs tuyn tnh, n tho mn bi ton phn
lp mt s ln cc mu d liu v cc c trng. L chng trnh phn mm c vit
trn ngn ng C++ (hu ht c vit trn C).
Ngoi tp d liu c gn nhn, SVMlin cn c th tn dng tp d liu cha
c gn nhn trong qu trnh hc. Tp d liu cha c gn nhn ny thc s hu ch
trong vic nng cao chnh xc ca qu trnh phn lp khi m s lng d liu c
gn nhn t trc l rt t.
Hin ti SVMlin thc hin ci t cc thut ton [14, 15]sau:
Thut ton hc c gim st (ch s dng cc d liu gn nhn)
Thut ton phn lp bnh phng ti thiu c chun ha tuyn
tnh (Linear Regularized Least Squares Classification).
Bn gim st (c th s dng cc d liu cha gn nhn tng i tt)
Thut ton hc tuyn tnh SVM truyn dn s dng nhiu ln chuyn
i (Multi-switch linear Transductive L2-SVMs)

Theo Vikas Sindhwani, khi dng SVMlin phn loi vn bn (tp d liu RCV1v2/LYRL2004) vi 804414 d liu gn nhn v 47326 c trng, SVMlin mt t hn hai
pht hun luyn SVM tuyn tnh trong mt my Intel vi tc x l 3GHz v 2GB

RAM. Nu ch cho 1000 nhn, n c th s dng hng trm ngn d liu cha gn nhn
hun luyn mt SVM tuyn tnh bn gim st trong vng khong 20 pht. D liu
cha gn nhn rt hu ch trong vic ci thin qu trnh phn lp khi s lng nhn lp
khng qu ln.

3.2. Download SVMlin


Ngi dng c th ti phin bn mi nht ca SVMlin ti trang Web:
http://www.cs.uchicago.edu/people/vikass

3.3. Ci t
Trc tin, cn gii nn file ci t bng cc lnh sau:
unzip svmlin.zip
tar xvzf svmlin.tar.gz

Sau n s to ra mt th mc c tn l svmlin-v1.0 cha Makefile v 3 file


ngun l ssl.h, ssl.cpp v svmlin.cpp.
G lnh:
make

S to ra file thc thi


svmlin

Qu trnh thc thi ny c s dng hun luyn, kim tra v nh gi qu trnh


thc hin.

3.4. S dng phn mm v kt qu nh gi


Cc file d liu

nh dng d liu u vo cho SVMlin tng t nh nh dng ca b cng c


SVM-Light/LIBSVM (im khc bit duy nht l khng c ct u tin m t nhn ca
cc d liu)

Mi mt dng m t mt mu d liu v l danh sch cc cp gm ch s c


trng : gi tr c trng cho cc c trng c gi tr khc khng, c phn cch nhau
bi mt k t trng. Mi hng c kt thc bng mt k t \n.
<feature>:<value> <feature>:<value> ... <feature>:<value>

Cho v d, ma trn d liu vi 4 d liu v 5 c trng nh sau:


0

3.

c m t trong file u vo l:
2:3
1:4
2:5
1:6

5:1
2:1
3:9 4:2
4:5 5:3

Nhn ca cc d liu hun luyn c cha trong mt file ring bit, gi l file m
t nhn d liu. Mi dng ca file cha nhn cho d liu dng tng ng trong file m
t d liu trn. Nhn ca d liu c th nhn cc gi tr sau:
+1 (d liu gn nhn thuc lp dng)
-1 (d liu gn nhn thuc lp m)
0 (cc d liu cha c gn nhn)
Phin bn hin ti ca b cng c SVMlin ch c th p dng cho bi ton phn
lp nh phn.
Qu trnh hun luyn

G lnh:
svmlin [options] training_examples training_labels

Trong :
training_examples.weights.File cha d liu hun luyn
training_examples.outputs. File cha kt qu m hnh phn lp
Kim tra (testing)

G lnh:
svmlin -f training_examples.weights test_examples_filename

Trong :
training_examples.weights: File cha kt qu m hnh phn lp
test_examples_filename: File cha d liu kim tra

nh gi
Nu nhn ca d liu kim th c bit trc, chng ta s dng lnh sau
tnh ma trn thc thi ca qu trnh phn lp:
svmlin -f weights_filename test_examples_filename test_labels_filename
D liu hun luyn

D liu hun luyn c s dng bao gm 1460 ti liu (trong ch c 50 ti liu


c gn nhn) c ly t b d liu chun 20-newsgroups.
Kt qu phn lp

Vi d liu hun luyn trn y, SVMlin t chnh xc l 92.8% khi la chn


chc nng multi-switch TSVM v t chnh xc l 95.5% khi la chn chc nng
semi-supervised SVM. iu ny khng nh tnh hiu qu ca hc bn gim st SVM.

KT LUN
Nhng cng vic lm c ca kho lun
Kho lun khi qut c mt s vn v bi ton phn lp bao gm phng
php phn lp d liu, phn lp vn bn v cc thut ton hc my p dng vo bi ton
phn lp, trong ch trng nghin cu ti phng php hc bn gim st c s dng
rt ph bin hin nay.
V phn lp d liu, kho lun a ra bi ton tng quan, cho ci g v cn ci
g, ng thi trnh by v phng php phn lp d liu tng qut t c th gip
ngi c hiu s qua v bi ton phn lp.
Trnh by c bn v bi ton phn lp vn bn, cch biu din mt vn bn trong
bi ton phn lp nh th no, qua nu ln cc phng php phn lp vn bn c bn
hin nay.
Tm hiu v cc thut ton hc my p dng vo bi ton phn lp vn bn bao
gm thut ton phn lp s dng qu trnh hc c gim st v hc bn gim st. y
chng ta tp trung ch yu nghin cu v qu trnh hc bn gim st, nu ln mt s
phng php hc bn gim st in hnh, trn c s s i su tm hiu thut ton hc
bn gim st SVM.
Bi ton phn lp trang Web p dng thut ton bn gim st SVM c nu ln
rt c th. Trong phn thc nghim gii thiu mt phn mm m ngun m c tn l
SVMlin, cch s dng phn mm v kt qu chy phn mm do V. Sindhwani tin hnh
trong nm 2007. Em ti phn mm v nghin cu kho st song do hn ch v thi
gian v trnh nn cha lm ch thc hin phn mm.

Hng nghin cu trong thi gian ti


Nh trnh by trn, do cn hn ch v thi gian v kin thc nn trong kho
lun cha th tm hiu su, c bit l tin hnh thc hin phn mm SVMlin kho st.
V th trong thi gian ti em s tm hiu k hn v phn mm c th ch ng nm
vng vic thc hin phn mm, c bit l cc thut ton hc bn gim st nn tng l
thuyt ca phn mm [14,15].

TI LIU THAM KHO


I. Ting Vit
1. Nguyn Vit Cng (2006). S dng cc khi nim tp m trong biu din vn
bn v ng dng vo bi ton phn lp vn bn. Kha lun tt nghip i hc,
Trng i hc Cng ngh - i hc Quc gia H Ni.
2. Phm Th Thanh Nam (2003). Mt s gii php cho bi ton tm kim trong CSDL
Hypertext. Lun vn tt nghip cao hc, Khoa Cng ngh, HQGHN, 2003.
3. Trn Th Oanh (2006). Thut ton self-training v co-training ng dng trong phn
lp vn bn. Kha lun tt nghip i hc, Trng i hc Cng ngh - i hc
Quc gia H Ni.

II. Ting Anh


4. Aixin Sun, Ee-Peng Lim, Wee-Keong Ng. Sun (2002). Web classification using
support vector machine. Proceedings of the 4th International Workshop on Web
Information and Data Management, McLean, Virginia, USA, 2002 (ACM Press).
5. Balaij Krishnapuuram, David Williams, Ya Xue,k Alex Hartemink, Lawrence
Carin, Masrio A.T.Figueiredo (2005). On Semi-Supervised Classification. NIPS:
721-728, 2005.
6. H-J.Oh, S.H.Myaeng, and M-H.Lee (2000). A practical hypertext categorization
method using links and incrementally available class information. Proc of the 28rd
ACM SIGIR2000: 264-271, Athens, GR, 2000.
7. Kristin P. Bennett, Ayhan Demiriz (1998). Semi-Supervised Support Vector
Machines. NIPS 1998: 368-374.
8. Linli Xu, Dale Schuurmans (2005). Unsupervised and Semi-Supervised MultiClass Support Vector Machines. AAAI 2005: 904-910.
9. M. Craven and S.Slattery (2001). Relational learning with statistical predicate
invention: Better models for hypertext. Machine Learning, 43(1-2):97-119, 2001.

10. Panu Erastox (2001). Support Vector Machines: Background and Practice.
Academic Dissertation for the Degree of Licentiate of Philosophy. University of
Helsinki, 2001.
11. Paul Pavlidis, llan Wapinski, and William Stafford Noble (2004). Support vector
machine classification on the web. BIOINFORMATICS APPLICATION NOTE.
20(4), 586-587.
12. T. Joachims (1999). Transductive Inference for Text Classification using Support
Vector Machines. International Conference on Machine Learning (ICML), 1999.
13. T. Joachims (2003). Transductive learning via spectral graph partitioning.
Proceeding of The Twentieth International Conference on Machine Learning
(ICML2003): 290-297.
14. V. Sindhwani, S. S. Keerthi (2006). Large Scale Semi-supervised Linear SVMs.
SIGIR 2006.
15. V. Sindhwani, S.S. Keerthi (2007). Newton Methods for Fast Solution of Semisupervised Linear SVMs. Large Scale Kernel Machines, MIT Press, 2005
16. Xiaojin Zhu (2005). Semi-Supervised Learning with Graphs. PhD thesis, Carnegie
Mellon University, CMU-LTI-05-192, May 2005.
17. Xiaojin Zhu (2006). Semi-Supervised Learning Literature Survey. Computer
Sciences TR 1530, University of Wisconsin Madison, February 22, 2006.
18. http://people.cs.uchicago.edu/~vikass/svmlin.html

You might also like