You are on page 1of 65

Sn d

KHO LUN TT NGHIP I HC H CHNH QUY



Ngnh: Cng ngh thng tin









H NI - 2009

I HC QUC GIA H NI
TRNG I HC CNG NGH


Nguyn Hu Phng


QUNG CO TRC TUYN HNG CU TRUY
VN VI S GIP CA PHN TCH CH
V K THUT TNH HNG






I HC QUC GIA H NI
TRNG I HC CNG NGH


Nguyn Hu Phng


QUNG CO TRC TUYN HNG CU TRUY
VN VI S GIP CA PHN TCH CH
V K THUT TNH HNG








KHO LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin


Cn b hng dn: PGS. TS. H Quang Thy
Cn b ng hng dn: CN. Nguyn Minh Tun







H NI - 2009

Li cm n
Trc tin, ti xin gi li cm n v lng bit n su sc nht ti Ph Gio s Tin s H
Quang Thy v C nhn Nguyn Minh Tun, ngi tn tnh ch bo v hng dn ti
trong sut qu trnh thc hin kho lun tt nghip.
Ti chn thnh cm n cc thy, c to cho ti nhng iu kin thun li hc tp v
nghin cu ti trng i Hc Cng Ngh.
Ti cng xin gi li cm n ti cc anh ch v cc bn sinh vin trong nhm Khai ph d
liu gip v h tr ti rt nhiu v kin thc chuyn mn v trong vic thu thp d
liu.
Cui cng, ti mun gi li cm v hn ti gia nh v bn b, nhng ngi thn yu
lun bn cnh v ng vin ti trong sut qu trnh thc hin kha lun tt nghip.
Ti xin chn thnh cm n !


Sinh vin
Nguyn Hu Phng
Tm tt
Qung co trn my tm kim hin ang l hnh thc qung co thu ht c nhiu
s ch nht ngy nay, trong cc qung co c hin th bn cnh kt qu tm kim
theo truy vn ca ngi dng. iu ny dn n mt bi ton l lm th no hin th
nhng qung co ph hp nht vi truy vn.
Kha lun ny tp trung nghin cu cc phng php xp hng qung co trn my
tm kim theo ph hp vi truy vn, xut m hnh qung co s dng phn tch ch
n v k thut tnh hng. ng thi a ra phng php biu din cc qung co theo
nhng c trng mi, c trng v ch n. Tin hnh thc nghim da trn vic s
dng query logs trong xy dng tp d liu hc, m hnh khai thc c cc thng tin
hu ch t hnh vi ngi dng v em li kt qu kh kh quan. chnh xc trung bnh
ca kt qu xp hng vo khong 82%-84%.

Mc lc
Li m u....................................................................................................................................... 1
Chng 1. Khi qut v qung co trc tuyn ................................................................................ 3
1.1. Gii thiu v qung co .................................................................................................... 3
1.2. Qung co trc tuyn ........................................................................................................ 4
1.2.1. Tc tng trng v th phn .................................................................................. 4
1.2.2. Cc hnh thc qung co trc tuyn .......................................................................... 5
1.3. Qung co trc tuyn Vit Nam .................................................................................... 6
1.3.1. Tng quan v qung co trc tuyn Vit Nam ....................................................... 7
1.3.2. Nhng ti nguyn cha c khai thc v th trng qung co trc tuyn ........... 10
1.4. Qung co thng qua tm kim ....................................................................................... 13
Chng 2. Cc phng php qung co thng qua tm kim ....................................................... 16
2.1. M hnh trch xut t kha trong ni dung trang web .................................................... 16
2.2. M hnh so khp vi tp t vng m rng (impedance coupling) ................................. 17
2.3. M hnh ti u xp hng vi thut ton di truyn (Genetic Programming) ................... 18
2.4. M hnh qung co s dng phn hi lin quan ............................................................. 19
2.5. M hnh c lng CTR (Click Through Rate) ............................................................. 21
2.6. M hnh tm kim v xp hng s dng ch n trong qung co theo ng cnh ....... 22
Chng 3. H thng qung co trc tuyn s dng xp hng v ch n ................................. 25
3.1 Xp hng ......................................................................................................................... 25
3.1.1 Xp hng trong my tm kim ................................................................................. 25
3.1.2 Hc xp hng v SVM Rank ................................................................................... 26
3.1.3 Cc phng php nh gi xp hng ....................................................................... 30
3.2 Ch n ........................................................................................................................ 33
3.2.1 Latent Dirichlet Allocation (LDA) .......................................................................... 34
3.2.2 M hnh sinh trong LDA ......................................................................................... 35
3.2.3 c lng tham s v suy lun ............................................................................... 36
3.3 M hnh qung co trc tuyn hng cu truy vn vi s gip ca phn tch ch
v k thut tnh hng ................................................................................................................. 39
3.3.1 M t bi ton .......................................................................................................... 39
3.3.2 M hnh tng quan ................................................................................................... 40
3.3.3 Xc nh c trng cho m hnh ............................................................................. 41
Chng 4. Thc nghim v nh gi ............................................................................................ 43
4.1. D liu ............................................................................................................................ 43
4.2. Mi trng thc nghim ................................................................................................. 43
4.2.1 Cu hnh phn cng ..................................................................................................... 43
4.2.2 Cc cng c c s dng ........................................................................................... 44
4.3. Qu trnh thc nghim .................................................................................................... 45
4.3.1. Tin x l d liu ........................................................................................................ 45
4.3.2. Thu thp thng tin t cc URL c c ...................................................................... 46
4.3.3. Vc t ha d liu ........................................................................................................ 47
4.3.4. Thit k thc nghim ................................................................................................... 47
4.4. Kt qu thc nghim ....................................................................................................... 48
4.5. nh gi kt qu thc nghim ........................................................................................ 50

Kt lun .......................................................................................................................................... 52
Ti liu tham kho ......................................................................................................................... 53


Danh sch cc bng
Bng 1. Mt s website ln cung cp dch v qung co trc tuyn Vit Nam .......................... 9
Bng 2 Cu hnh phn cng s dng trong thc nghim .............................................................. 44
Bng 3. Danh sch cc phn mm m ngun m c s dng ................................................... 44
Bng 4. Gi tr cc o ti mt s truy vn khc nhau. .............................................................. 50


Danh sch cc hnh
Hnh 1. Doanh thu qung co trc tuyn na u v cui nhng nm t 1999 n 2008 M ..... 5
Hnh 2. Phn loi doanh thu qung co trc tuyn trong 6 thng u nm 2007 v 2008 M ..... 6
Hnh 3. Qung co trc tuyn ti mt trang bo in t Vit Nam ................................................. 8
Hnh 4. Doanh thu t qung co trc tuyn ca VnExpress v VietnamNet trong 3 nm 2004,
2005, 2006. .................................................................................................................................... 12
Hnh 5. M t ni dung mt qung co ......................................................................................... 14
Hnh 6. Kin trc c bn ca h thng qung co thng qua tm kim ......................................... 14
Hnh 7. Kin trc h thng qung co s dng phn hi lin quan ............................................... 20
Hnh 8. Thut ton c lng tham bin .................................................................................... 30
Hnh 9. M hnh biu din ca LDA ............................................................................................. 35
Hnh 10. M hnh sinh y cho LDA. ...................................................................................... 36
Hnh 11. M hnh tng quan h thng qung co s dng ch n ........................................... 40
Hnh 12. Trung bnh cc o trn tt c cc truy vn ................................................................. 49
Hnh 13. Trung bnh o NDCG@5 ti cc s lng truy vn khc nhau ................................ 49
Hnh 14. Trung bnh o MAP ti cc s lng truy vn khc nhau ......................................... 50



Bng cc t vit tt

CPA Cost Per Action/Acquisition
CPC Cost Per Click
CPM Cost Per Mille/Thousand
CTR Cost Through Rate
IDF Inverse Document Frequencies
LDA Latent Dirichlet Allocation
LSA Latent Semantic Analysis
LSI Latent Semantic Indexing
PLSA Probabilistic Latent Semantic Analysis
PLSI Probabilistic Latent Semantic Indexing
PPC Pay Per Click
TF Term Frequencies
Li m u
Qung co trc tuyn ang ngy cng pht trin v em li nhng khon li nhun
khng l trong cc nm gn y, ln n 47.5 t la [33]. Qung co trn my tm kim
l hnh thc qung co trc tuyn ph bin nht, trong cc qung co c hin th bn
cnh kt qu tm kim tr v cho ngi dng. Trong 5 nm gn y, nhm tm kim v
a ra mt th t qung co ph hp nht, rt nhiu cng trnh trong nc cng nh trn
th gii c cng b [11], [22], [24], [25], [27], [30].
L Diu Thu [27] i theo mt hng tip cn mi trong qung co theo ng cnh
bng vic m rng tp t kha qung co s dng k thut phn tch ch n. Tc gi
ch ra nhng nh hng tch cc ca ch n trong vic tm kim v xp hng qung
co.
Kha lun ny tip tc xem xt bi ton xp hng qung co trn my tm kim v
xut m hnh xp hng qung co s dng k thut phn tch ch n theo hng
tip cn mi. Khc vi cch tip cn [27], m hnh ca kha lun ny biu din qung
co theo nhng c trng v ch n v khai thc s gip ca query logs trong vic
xy dng tp d liu hc v thu c nhng kt qu kh quan. Kha lun gm bn
chng c m t s b di y:
Chng 1. Khi qut v qung co trc tuyn trnh by v tnh hnh qung co
trc tuyn trn th gii cng nh Vit Nam, ng thi gii thiu v hnh thc qung co
trn my tm kim v bi ton xp hng qung co trn my tm kim.
Chng 2. Cc phng php qung co thng qua tm kim trnh by nhng
cng trnh c a ra trong nhng nm gn y nhm gii quyt bi ton xp hng
qung co, ch ra u, nhc im ca mi phng php.
Chng 3. H thng qung co trc tuyn s dng k thut xp hng v phn
tch ch n. Chng ny trnh by v k thut xp hng, phng php hc xp hng
SVM Rank, k thut phn tch ch n v xut m hnh xp hng qung co s dng
ch n.
Chng 4. Thc nghim v nh gi m hnh trnh by v d liu c s dng,
cc giai on x l d liu v thc nghim, a ra kt qu ca m hnh, nhn xt v phn
tch kt qu thu c.
1

Phn kt lun. Tng kt v tm lc ni dung chnh ca kha lun.

2

Chng 1. Khi qut v qung co trc tuyn
1.1. Gii thiu v qung co
Qung co l hnh thc tuyn truyn, gii thiu hng ho, dch v nhm to s hp
dn v kch thch ngi mua y mnh vic bn hng cng nh thc hin dch v.
Trong chng mc nht nh, qung co cng c tc dng tch cc, tuy nhin n lm tng
gi c ca hng ho. Trong nn kinh t hng ho, chi ph v qung co thng rt ln.
Hnh thc qung co rt phong ph: p phch c bit, ng bo, pht thanh, v tuyn
truyn hnh, in nh, trin lm ch phm, nhn hiu sn xut, t knh by hng cc ca
hng hay cc x nghip, th tn, qu biu [6].
Theo mt ti liu khc, qung co l mt hin tng phc tp, gn b mt thit vi
x hi, vn ha, lch s v kinh t, n khng tun theo bt c mt nh ngha n gin hay
ring bit no. Mt vi kha cnh ca qung co rt ph dng trong khi mt vi kha cnh
khc li mang c trng c th v vn ha. Qung co bin i t ngh thut bn hng c
nhn ti truyn thng gin tip, cung cp nhng thng tin mi nhm thuyt phc con
ngi. Bn cnh nhng thng bo nhm mc ch bn hng n cn n cha nhng gi tr
vn ha v cc kin x hi. Ty thuc vo tng quan im khc nhau, qung co c th
c tc dng tch cc hay tiu cc ti x hi v kinh t [8].
Cng theo [39] th cha ca hnh thc qung co l mt ngi Ai Cp c. ng ta
dn t thng bo u tin trn tng thnh Thebes vo khong nm 3000 trc Cng
nguyn. Vi th k sau , Hy Lp hnh thc thng bo ny tr nn rt ph bin khi cc
thng tin dnh cho cng chng c v ln cc tm bng g trng by qung trng
thnh ph. Nu nh cc bng qung co pht trin nhanh sau s ra i ca phng
php in (bc p phch u tin do Caxton, ngi Anh, in t nm 1477), th ha s Php
J.Chret (1835-1932) li l ngi pht minh ra hnh thc qung co hin i. l t
qung co mt bui biu din nm 1867, gm mt cu ngn v mt hnh nh mu m gy
n tng mnh. Tuy nhin, chnh ha s Italy L.Cappiello (1875-1942) mi l ngi u
tin thc s cp ti p phch qung co vi tm bin qung co ko chocolate "Klaus"
ca ng nm 1903.
3

Ngy nay qung co c nhng bc pht trin mi v c tin hnh thng qua
cc phng tin thng tin i chng nh: truyn hnh, bo ch, pht thanh, qung co qua
bu in v c bit, l qung co trc tuyn qua Internet.
1.2. Qung co trc tuyn
Qung co trc tuyn l mt loi hnh qung co c th hin trn Internet v c
bit l cc trang web [8]. Vic s dng Internet v World Wide Web ngy cng tr nn
ph bin, do vy Internet tr thnh mt trong nhng phng tin qung co quan trng
nht ngy nay.
Mt trong cc li ch ca vic qung co trc tuyn l cho php cng b thng tin v
ni dung ngay lp tc m khng b gii hn bi v tr a l hay thi gian. N cho php
truyn t thng tin qung co mc ton cu, ti mt lng ln ngi dng vi mt
chi ph rt thp.
Qung co trc tuyn em li hiu qu u t ln cho khch hng qung co, n cho
php tu chnh cc qung co, bao gm ni dung v cc trang web m qung co s c
ng ln. Mt v d l, AdWords v AdSense ca Google cho php qung co c
hin th trn cc trang web c lin quan hoc hin th bn cnh kt qu tm kim trn my
tm kim i vi mt s t kha c nh ngha trc.
Mt trong cc u im ca qung co trc tuyn l cch thc thanh ton, vic thanh
ton c thc hin vi nhiu cch thc khc nhau, da vo phn ng ca ngi dng i
vi qung co. Mt s cch thc thanh ton nh: CPM (Cost Per Mile/Thousand), CPV
(Cost Per Visitor), CPC (Cost Per Click), CPA (Cost Per Action), CTR(Click Through
Rate) [27].
1.2.1. Tc tng trng v th phn
Nm 1994, qung co trc tuyn bt u xut hin trn trnh duyt web thng mi
u tin, Netscape Navigator 1.0, di hnh thc l cc banner qung co [32]. Nhng
qung co u tin trn web l nhng ni dung tnh hay logo ca cc cng ty. Chng
thng xut hin u mi trang web v thng l ni d c quan st nht.
Khi cng ngh ngy cng pht trin, m ra nhiu c hi mi, rt nhiu hnh thc
qung co trc tuyn xut hin. Mt vi cng ty tin hnh qung co thng qua web
site bi nhng pop-up, nh DoubleClick, AdForce v Windwire. H cung cp mt vi
4

thng tin hnh nh v trnh duyt web s thc thi mt s cng vic khi ngi dng click
vo mt qung co [32].
Mt thp nin sau khi xut hin, nhng ngi qung co trn th trng M chi
9.6 t la cho qung co trc tuyn, nm 2004 tng hn 31.5% so vi nm 2003; so snh
vi 10% cho qung co trn truyn hnh, 7.4 % cho nhng dch v qung co khc ni
chung v 6.6% cho GDP ca nn kinh t M (Hnh 1). Theo bo co ca IAB [33] vo
nm 2008, doanh thu t qung co trc tuyn t ti hn 23 t la vo cui nm
2008.

Hnh 1. Doanh thu qung co trc tuyn na u v cui nhng nm t 1999 n 2008
M [33].
Theo bo co mi nht ca Strategy Analytics [38], tng chi ph cho qung co trc
tuyn trn ton th gii ln ti gn 47.5 t la vo nm 2007 v c th vt 100 t
la vo nm 2012.
Nhng thng tin trn cho thy tc pht trin nhanh chng ca qung co trc
tuyn trong nhng nm qua v cn ha hn nhng mc doanh thu khng l trong cc nm
ti.
1.2.2. Cc hnh thc qung co trc tuyn
Qung co trc tuyn c th c phn loi thnh hai loi: hp php (cc mng
qung co) v khng hp php (spamming).
5

Qung co spam thng xm nhp vo h thng v c gi l Spyware, Adware
hay qung co Pop-up. V d, khi mt trnh duyt mi c m, pop-up qung co xut
hin v chuyn hng ngi dng ti website qung co. iu ny gy nhiu bc xc cho
ngi dng, v vy nhiu trnh duyt h tr chc nng chn pop-up gii hn cc pop-
up khng hp php. Spyware v Adware thng l nhng ng dng m rng, mt vi
trong s chng c th gy hi, v d nh Trojan.
Nhng qung co hp php c th c phn loi thnh: Qung co trng by,
email, phn loi v u gi, Lead Generation, a phng tin v tm kim. Chi tit v cc
hnh thc qung co ny c th tm thy ti [27]. Di y l biu m t thu nhp ca
cc loi ny trong 6 thng u ca nm 2007 v nm 2008 ti M [33].

Hnh 2. Phn loi doanh thu qung co trc tuyn trong 6 thng u nm 2007 v 2008
M [33]
Nh chng ta thy trn (Hnh 2), search advertising, m trong ni dung kha lun
ny ta gi l qung co thng qua tm kim, l loi hnh qung co ph bin nht v c
doanh thu ln nht ti th trng M t nm 2007 n nm 2008. N chim 41% tng thu
nhp t qung co trc tuyn trong 6 thng u nm 2007 v 46% trong 6 thng u nm
2008.
1.3. Qung co trc tuyn Vit Nam
Cng vi s pht trin ca qung co trc tuyn trn th gii, qung co trc tuyn
ti Vit Nam cng tng bc pht trin v t c nhng thnh cng bc u.
6

1.3.1. Tng quan v qung co trc tuyn Vit Nam
1.3.1.1. Th phn
Theo s liu ca Trung tm Internet Vit Nam (VNNIC), hin c xp x 19 triu
ngi - chim 22,47% dn s Vit Nam - thng xuyn tip cn vi lnternet. S lng
ngi s dng Internet ng o v tng trng nhanh (nm 2007 c thm 4 triu ngi
s dng so vi 2006) l mt mi trng tim tng khai thc qung co trc tuyn.
Tuy nhin, qung co trc tuyn Vit Nam vn ang thi k mi khai ph v
hnh thnh. Theo s liu ca Hip hi qung co Vit Nam (VAA), trn 80% th phn
qung co trong nc thuc v cc i truyn hnh, sau l qung co trn n phm bo
ch.
Qung co trc tuyn Vit Nam c doanh thu vo khong 64 t VN nm 2006,
160 t VN vo nm 2007 v trong nhng nm ti s tng trng 100% t ti con s
500 t VND vo nm 2010. Tuy nhin doanh thu ca qung co trc tuyn trn tng
ngnh qung co ti VN mc khong 1,5% (2007) [4].
1.3.1.2. Cc hnh thc qung co trc tuyn Vit Nam
V hnh thc, qung co trc tuyn Vit Nam ch yu hng ti vic qung co
thng hiu vi hnh thc logo/banner (Hnh 3). Ti cc website ln, logo/banner chi cht
bt chp cc tiu chun v hiu qu gy n tng (nhiu nht 4 qung co/mt mn hnh).
Cc dng qung co nh qua t kha, qung co theo ng cnh, theo hnh vi ... cn l
nhng khi nim mi m. V hin nay, cng cha c mt chun no i vi cc mu thit
k cho qung co trc tuyn (kch thc, v tr ...). iu ny khin khch hng mt thm
nhiu thi gian v chi ph khi tin hnh qung co ti cc website khc nhau [4].
Khch hng ca qung co trc tuyn mi ch tp trung mt vi ngnh. Kho st
ti nhng website c ng qung co nht, nhng doanh nghip ng nhng v tr t
nht thng l cc doanh nghip vin thng, ngn hng, k n l cc doanh nghip, c
s trong ngnh in my, gio dc, m thc.
7


Hnh 3. Qung co trc tuyn ti mt trang bo in t Vit Nam
Ngoi ra, cha c mt t chc uy tn ng vai tr trung gian nh gi mt
cch khch quan v s lng ngi dng ca cc website cng nh hiu qu khi tin hnh
qung co trc tuyn. Khng t cc website a ra thng tin v s lng ngi dng vi
nhng con s khng l. iu ny khin doanh thu ca qung co trc tuyn Vit Nam
tp trung ti mt vi trang web c lng truy cp cao nht (ch yu l cc bo in t,
trang tin tc nh VnExpress, Dn Tr, Vietnamnet, 24h.com.vn...) thay v c th phn b
cc website c th (du lch, gii tr, thng mi...).
V hnh thc thanh ton, vn s dng nhng hnh thc thanh ton truyn thng nh
qung co trn bo ch, s tin ngi qung co tr cho cng ty qung co c tnh theo
kch thc ca banner, s ln hin th qung co trn trang web cng th hng ca trang
web qung co (phng php CPM). Th hng ca cc trang web thng c xc nh
bi mt vi cng c trn Internet, v d alexa.com. Gi thnh qung co c quyt nh
bi s lt ngi dng truy cp vo website v v tr ca banner.
Nhng hnh thc thanh ton khc nh CPC hay CPA vn cn rt him, cn phi c
mt mng qung co ng tin cy cung cp nhng thng tin cho cc hnh thc thanh
ton ny. y l mt vn quan trng, n gii thch nguyn nhn v sao qung co theo
ng cnh, theo hnh vi, qung co trn my tm kim Vit Nam cha pht trin. Tuy
8

nhin, mt vi cng ty nm bt c iu ny v h a ra nhng m hnh th
nghim lm vic vi phng php CPC, v d nh Hura Ad
1
, daugia 247 ECOM JSC
2

v VietAd
3
, cc h thng ny tng c a ra th nghim Vit Nam (tuy nhin
chng b loi b ci tin, theo VietnamNet ).
Bng 1. Mt s website ln cung cp dch v qung co trc tuyn Vit Nam
STT Tn a ch
1 Bo in t Vnexpress http://vnexpress.net
2 Bo in t VietnamNet http://vnn.vn
3 Bo in t Thanh Nin www.thanhnien.com.vn
4 Bo in t Dn Tr www.dantri.com
5 Bo in t Lao ng www.laodong.com.vn
6 Bo in t VnMedia www.vnmedia.com.vn
7 Ngi sao http://ngoisao.net
8 Cng ty C phn Qung co dch v trc tuyn www.24h.com.vn
9 Cng ty Truyn thng a phng tin (VTC) www.vtc.com.vn

Tm li, qung co trc tuyn Vit Nam hin nay mi c s lng ngi tham gia
t i v cha phong ph v hnh thc. Cc hnh thc qung co ch yu l banner v c
thanh ton da vo kich thc, v tr banner v th hng ca trang web.

1
http://ad.hurahost.com
2
http://daugia247.com
3
http://vietad.vn
9

1.3.2. Nhng ti nguyn cha c khai thc v th trng qung co trc tuyn
phn trc kha lun gii thiu mt ci nhn tng quan v qung co trc tuyn
Vit Nam, tuy cn mi m nhng ang c m rng v c nhiu tim nng. Trong
phn ny, kha lun s trnh by k hn v nhng ti nguyn cha c khai thc v th
trng qung co trc tuyn, t ch ra tim lc v nhng vn ni tri ca qung co
trc tuyn ti Vit Nam trong cc nm ti.
1.3.2.1. Tc pht trin nhanh chng ca thng mi in t Vit Nam
Thng mi in t l mt nhn t quan trng ca qung co trc tuyn, c bit
cho vic thanh ton ca cc h thng qung co theo ng cnh, hnh vi hay qung co
qua my tm kim. Khi thng mi in t pht trin, nhiu ngnh thng mi khc c
th d dng thc hin trao i thng qua internet to iu kin cho cc cng ty gii thiu
sn phm ca h ti khch hng, h tr cho s pht trin ca qung co trc tuyn.
Vo u nm 2006, thng mi in t bt u pht trin, nhiu b lut mi
c ban hnh. Cng s h tr ca chnh ph, thng mi in t Vit Nam ngy cng
pht trin v c nhng bc tin r rt.
n cui nm 2008, kt qu iu tra vi 1600 doanh nghip trn c nc ca B
Cng Thng cho thy, hu ht cc doanh nghip trin khai ng dng thng mi in
t nhng mc khc nhau. u t cho thng mi in t c ch trng v mang
li hiu qu r rng cho doanh nghip [1].
Cc doanh nghip quan tm ti vic trang b my tnh, n nay hu nh 100%
doanh nghip u c my tnh. T l doanh nghip c t 1120 my tnh tng dn qua cc
nm v n nm 2008 t trn 20%. T l doanh nghip xy dng mng ni b nm
2008 t trn 88% so vi 84% ca nm 2007. n nay, c ti 99% s doanh nghip
kt ni Internet, trong kt ni bng thng rng chim 98%. T l doanh nghip c
website nm 2008 t 45%, tng 7% so vi nm 2007. T l website c cp nht
thng xuyn v c chc nng t hng trc tuyn u tng nhanh.
Mt trong nhng im sng nht v ng dng thng mi in t ca doanh nghip
l t l u t cho phn mm tng trng nhanh, chim 46% trong tng u t cho cng
ngh thng tin ca doanh nghip nm 2008, tng gp 2 ln so vi nm 2007. Trong khi
, u t cho phn cng gim t 55,5% nm 2007 xung cn 39% vo nm 2008. S
10

dch chuyn c cu u t ny cho thy doanh nghip bt u ch trng u t cho cc
phn mm ng dng trin khai thng mi in t sau khi n nh h tng cng ngh
thng tin. Doanh thu t thng mi in t r rng v c xu hng tng u qua cc
nm, 75% doanh nghip c t trng doanh thu t thng mi in t chim trn 5% tng
doanh thu trong nm 2008. Nhiu doanh nghip quan tm b tr cn b chuyn trch v
thng mi in t.
1.3.2.2. S bng n ca x hi trc tuyn v cc mng x hi
Thi gian gn y, vic s dng cng ngh World Wide Web v thit k web cho
php ngi dng chia s thng tin mt cch d dng hn v d nh nhng trang web
mng x hi, cc trang wiki, blog v din n. Cng vi , s lng ngi Vit Nam s
dng Internet cng ngy cng tng ln, to thnh mt cng ng trc tuyn rng ln gia
nhng ngi Vit Nam. Theo VNNIC (VietNam Internet Association), vo thng 3 nm
2008, s lng ngi Vit Nam s dng Internet ln ti trn 19 triu ngi (chim
19.41% dn s) v con s ny ang ngy cng tng ln [4]. Th trng ny ln hn so vi
Thi Lan, Philippines v Indonesia. Trong mt vi nm qua, cc cng ng trc tuyn
c chng kin s pht trin v cnh tranh ca cc trang web mng x hi, v d nh:
Yahoo! 360 blog, Tamtay, Yobanbe, Cyworld, Zoomban,...
Tuy nhin, c mt khong cch ln v s pht trin ca thng mi in t gia
Vit Nam v cc nc pht trin trn th gii m phn ln l thi quen ngi dng v
thu nhp.
1.3.2.3. Th trng qung co trc tuyn, ci nhn lu di
Tc pht trin nhanh chng ca thng mi in t, s bng n ca cng ng
trc tuyn v cc cng thng tin web Vit Nam to nn tng vng chc cho s pht
trin ca qung co trc tuyn. Trong thi gian gn y, cc nh qung co ln nh
Yahoo v Google bt u quan tm ti th trng qung co trc tuyn ti Vit Nam,
h bt u xy dng nhng chin lc tip th v cc dch v khc nhau cho ngi dng
Vit Nam. Theo VietnamNet, Google tin hnh dch cc dch v ca h sang ting
Vit, v d nh dch v qung co AdWords
4
. Yahoo ang nm gi s lng ngi dng

http://adwords.google.com/select/?hl=vi
4

11

5
Vit Nam ln nht (theo xp hng t alexa). H ra mt phin bn yahoo ting Vit v
phin bn blog 360 plus nhm thu ht ngi dng Vit Nam vo th trng ny. Nhng
qung co v cc dch v mi ca h c pht i trn h thng truyn hnh Vit Nam t
thng 5 nm 2008 [27].
Tuy nhin, th trng qung co trc tuyn thu ht khng ch cc cng ty nc
ngoi m cn c cc cng ty trong nc. Mt vi cng ty mi bt u m rng th
trng v nhm ti qung co trc tuyn. Mt s bo in t c bit n nhiu nht
Vit Nam nh VnExpress v VietnamNet, thu nhp ca h t qung co trc tuyn c tc
tng trng kh cao v VnExpress vn gi v tr s mt trong lnh vc qung co trc
tuyn Vit Nam (Hnh 4).

Hnh 4. Doanh thu t qung co trc tuyn ca VnExpress v VietnamNet trong 3
nm 2004, 2005, 2006 [1].
Tm li, th trng qung co trc tuyn Vit Nam tuy vn cn giai on mi
pht trin, nhng thu ht c rt nhiu s ch ca c cc cng ty trong nc cng
nh ngoi nc. iu ny dn n nhu cu v mt mng qung co trc tuyn Vit
Nam, nhm h tr cc hnh thc qung co mi pht trin, v d nh qung co trn my
tm kim hay qung co theo hnh vi, ng cnh....
Google v Yahoo t c nhng thnh cng ln trn th trng th gii, tuy
nhin ro cn v ngn ng v vn ha vn cn l mt hn ch h c th tip cn vi th

http://vn.yahoo.com/
5

12

trng Vit Nam. Mt bi hc t s thnh cng ca Baidu (my tm kim hng u ca
Trung Quc) chng t rng nhng cng ty qung co ln nh Google v Yahoo khng
phi lun lun thnh cng th trng khu vc, c bit l Chu [32]. Ngi dng
Vit Nam vn lun ch i mt mng ting Vit t cc cng ty trong nc. Vic xy
dng v pht trin qung co trc tuyn Vit Nam tr thnh mt yu cu thit yu
trong s pht trin lu di, v ngi Vit Nam s sm c chng kin nhng bc pht
trin mi trong th trng qung co nhng nm tip theo.
1.4. Qung co thng qua tm kim
Qung co thng qua tm kim l hnh thc qung co m cc qung co c hin
th da vo cc t kha hay cc cm t c xc nh t trc [22]. Qung co thng qua
tm kim bao gm cc nhn t chnh sau:
- Ni dung qung co: ni dung qung co c ngi qung co cung cp cho cc
cng ty qung co, ni dung qung co thng gm: tiu , m t, url, v cc t kha
tng ng vi qung co.
- Chi ph cho mi t kha: l chi ph m ngi qung co phi tr cho cng ty
qung co i vi tng t kha hay cm t c th.
- Cc qu trnh kim tra t ng hoc bng tay nhm m bo ni dung qung co
ph hp vi cc t kha.
- Tm kim cc qung co ph hp vi truy vn ngi dng (qung co trn my
tm kim) hay ph hp vi ni dung trang web (qung co theo ng cnh).
- Hin th cc ni dung qung co theo th t ph hp.
- Thu thp thng tin, o s ln click ca ngi dng, xc nh cc hnh ng ca
ngi dng v yu cu ngi qung co tr tin theo cc thng tin .
Hnh 5 l mt v d v qung co trn my tm kim MSN, khi ngi dng tm kim
vi t kha hotel, mt danh sch cc qung co c lin quan ti khch sn s c hin
th.
13


Hnh 5. M t ni dung mt qung co [36]
Hnh 6 di y m t kin trc c bn ca mt h thng qung co thng qua tm
kim.

Hnh 6. Kin trc c bn ca h thng qung co thng qua tm kim [27]
Thng qua mng qung co (Advertising network) cc qung co c hin th ti
ngi dng ty thuc vo ni dung trang web h ang xem (vi qung co ng cnh) hay
ty thuc vo truy vn h ang tm kim (qung co trn my tm kim). Khi ngi dng
click vo qung co hay thc hin mt vi hnh ng nh ng k, thanh ton mng
14

qung co s ghi nhn cc hnh ng ca ngi dng. Ngi qung co s phi tr tin
cho mng qung co ty thuc vo cc hnh ng c ghi nhn . Hin nay c rt
nhiu mng qung co ni ting nh: Google, Yahoo, MSN, Publisher Network (YPN),
Amazon.com...
Qung co thng qua tm kim c hai loi chnh: qung co trn my tm kim v
qung co theo ng cnh.
Qung co trn my tm kim l qung co c thc hin trn my tm kim, khi
ngi dng tm kim theo mt truy vn, bn cnh kt qu tm kim, mt danh sch cc
qung co c hin th tng ng vi truy vn ca ngi dng. Cc qung co c sp
xp theo hai tiu ch: ph hp vi truy vn v s tin ngi qung co s tr cho cng
ty qung co cho vic hin th qung co ca h. Qung co trn my tm kim l hnh
thc qung co trc tuyn ph bin nht hin nay.
Qung co theo ng cnh khc vi qung co trn my tm kim, danh sch qung
co thu c t vic so snh cc cm t, t kha ca qung co vi ni dung trang web
v c tr v da theo ph hp ca ni dung trang web vi cc qung co.
Trong c hai loi qung co ni trn, s lng cc qung co c a ra cho mi
ln hin th thng rt t, t 4 n 5 qung co, ngi dng thng thng ch ch n
mt vi qung co u tin, do vy yu cu i vi h thng qung co l: phi tm ra
nhng qung co ph hp nht vi truy vn ca ngi dng v a chng ln u danh
sch. Mt bi ton c t ra l xp hng cc qung co tr v theo mc ph hp vi
truy vn ca ngi dng.
Xp hng qung co l mt bi ton nhn c rt nhiu s quan tm hin nay. C
rt nhiu phng php v m hnh c a ra, v d nh m hnh qung co s dng
phn hi lin quan [11], m hnh c lng CTR(Click Through Rate) [25] hay cch
tip cn trch xut t kha qung co t ni dung trang web [30], impedance coupling
[24] v ranking optimization [22]. Cc phng php ny s c trnh by k chng
sau.
15

Chng 2. Cc phng php qung co thng qua tm
kim
Nhim v chnh ca mt h thng qung co thng qua tm kim l quyt nh cc
qung co no s c hin th v th t hin th ca chng theo mc ph hp vi truy
vn ca ngi dng hay ni dung trang web (ng cnh). Khi ngi dng tm kim, mc
ch chnh ca h l tm kim nhng ti liu lin quan n t kha ch khng phi tm
kim cc qung co, do vy ngi dng s ch thc s ch n qung co khi nhng
qung co c a ra c tnh ph hp cao vi iu m h quan tm. Mt khc, vic hin
th cc qung co ph hp c th gip ngi dng c thm nhng thng tin hu ch, tip
cn nhng dch v mong mun, ngc li nu cc qung co c a ra khng ph hp
c th lm ngi dng cm thy kh chu v gim mc hi lng vi my tm kim.
Trong 5 nm gn y, c rt nhiu phng php trn th gii v mt s phng
php Vit Nam c cng b nhm gii quyt vn ny, di y l mt s
phng php ni bt.
2.1. M hnh trch xut t kha trong ni dung trang web
y l mt m hnh ca qung co theo ng cnh. Da trn t tng ca qung co
trn my tm kim, ta c th coi trang web hin ti nh mt truy vn di bao gm nhiu t
kha. Yih v cc cng s [30] xut mt m hnh hc gim st cho php trch xut
cc t kha trong ni dung trang web. Tin hnh hc t mt tp cc trang web c
nh ngha cc t kha t trc, h xy dng mt b phn lp s dng hc my vi thut
ton hi quy logic (logistic regression).
xc nh nhng t kha v cm t m t chnh xc nht v trang web h s dng
mt vi phng php v tin hnh thc nghim tm ra phng php em li kt qu tt
nht. Ba phng php c a ra l: MoS, MoC v DeS. M (Monolithic) ngha l s
dng ton b cm t trong trch chn. D (Decomposed) xem mi t trong cm nh mt
c th ring bit. S (Separate) l coi mi t hay cm t bt k ging nhau hay khc nhau
nh cc c th ring bit, v C (Combined) kt hp cc t, cm t ging nhau lm mt.
Mt im quan trng trong cng trnh ca h l vic s dng 7.5 triu truy vn t
query logs ca MSN [36] nh mt c trng cho qu trnh trch chn, cng vi l 11
16

c trng khc nh tn sut xut hin ca t kha, c trng thuc v ngn ng hc (pos
tagging), c trng kim tra t c c vit hoa hay khng, c trng v siu vn bn (t
c nm trong mt lin kt hay khng), tiu trang, c trng v di cc cm t, cc
cu,
Trong thc nghim, h s dng 828 trang web c ly t Internet Archive [34]
s dng cho qu trnh hc v kim th h thng. Kt qu cho thy h thng MoC (cc
cm t tng ng c kt hp lm mt) em li kt qu tt nht, trong khi MoS
em li kt qu thp nht. Ngoi ra, h thng DeS (xem mi t nh mt c th ring bit)
em li kt qu thp hn so vi h thng Monolothic(xem mi cm t nh mt c th
ring bit). chnh xc ca h thng tt nht l 30.06% v ca h thng ti nht l
13.01% .
xc nh s ng gp ca mi c trng, h tin hnh thc nghim trn cng
mt h thng vi cc c trng c thm vo ln lt. Kt qu ch ra rng, c trng
query log v tn xut xut hin ca t kha ng vai tr quan trng nht.
Nghin cu ca Yih v cc cng s [30] cho thy mt hng tip cn khc ca
qung co theo ng cnh. H thng ca h cho php xp hng cc qung co da trn
nhng t kha trch xut ra c t trang web. Tuy nhin ph hp ca cc qung co
da trn cc t kha ny vn cha c kim chng qua thc nghim.
2.2. M hnh so khp vi tp t vng m rng (impedance coupling)
Mt vn ca qung co theo ng cnh, l s khc bit v t vng gia trang
web v cc qung co. Ribeiro Neto v cc cng s [24] tp trung vo vic gii quyt
vn ny bng cch m rng tp t vng ca cc trang web.
Nhn chung, mt qung co thng ngn, c ng v tp trung vo mt ch
chnh. Tuy nhin, mt trang web li c ni dung ln hn v thuc mt khng gian ng
cnh ln hn. Mt trang web c th ni v rt nhiu ch v vi cc t kha khc nhau.
Vn tm kim nhng qung co ph hp vi mt trang web s dng nhng ch c
trong ni dung trang ang l mt vn cn c quan tm.
Ribeiro v cc cng s [24] kho st 10 phng php so khp cc qung co v
trang web. H tin hnh thc nghim vi mt c s d liu ln trn 93 nghn qung co
v 100 trang web.
17

Vi 5 phng php u tin, h so snh cc trang web v qung co da vo m
hnh vc t. Hng ca mi qung co c tnh da trn tng ng cosin gia qung
co v trang web. Cc c trng c s dng l tiu , m t v cc t kha qung co.
Phng php tt nht trong nhng phng php ny l AAK, so khp s dng cc t
kha qung co xut hin trong ni dung trang web, kt qu ca phng php ny c
s dng so snh vi cc phng php impedance coupling.
Nh gii thiu trn, c mt s khc bit ln gia tp t vng ca trang web v
qung co. gii quyt vn ny, Ribeiro v cc cng s [24] m rng tp t vng
ca trang web vi nhng t kha ly t cc trang web c ni dung tng t s dng m
hnh Bayes. Nhng t kha m rng ny c th xut hin trong tp t kha ca qung co
v lm tng hiu qu ca h thng. H s dng 5 phng php so khp khc nhau gi l
cc phng php impedance coupling.
Trong thc nghim, h s dng mt c s d liu vi 6 triu trang web phc v
cho vic m rng tp t vng. Kt qu thu c khi s dng cc ni dung c m
rng tt hn so vi phng php AAK trn. Phng php tt nht c a ra l so
khp s dng ni dung trang web m rng v ni dung ca trang web c qung co tr
ti. Thc nghim ca Ribeiro-Neto v cc cng s chng t rng, vic gim s khc
bit v tp t vng gia trang web v qung co c th h tr tt cho vic tm kim qung
co ph hp vi ng cnh.
2.3. M hnh ti u xp hng vi thut ton di truyn (Genetic Programming)
T nhng nghin cu c c [24], Lacerda v cc cng s [22] a ra mt
hng tip cn da trn thut ton di truyn ti u hm xp hng. S dng cc c
trng khc nhau nh t kha, tn sut xut hin ca t, di vn bn v kch thc tp
d liu, bng phng php hc my, h xy dng mt hm so khp nhm ti u ph
hp gia trang web v cc qung co. Hm ny c th hin di dng cy vi nt l
cc php ton v cc c trng l cc l. S dng tp d liu hc v nh gi tng t
nh [24], m hnh ny em li kt qu tt hn so vi phng php tt nht c m t
l 61.7%.
18

2.4. M hnh qung co s dng phn hi lin quan
Da trn nhng nghin cu v x l truy vn v m rng cu truy vn, Andrei
Z.Broder v cc cng s [11] a ra m hnh qung co trn my tm kim s dng
phn hi lin quan. Vi mt truy vn u vo gi l truy vn gc, Andrei Z.Broder tin
hnh tm kim trn cc my tm kim v thu thp mt s kt qu trong danh sch cc kt
qu u tin. T truy vn gc v nhng kt qu , xy dng mt truy vn mi gi l truy
vn qung co - v tin hnh tm kim trn tp qung co c bng truy vn ny. Cch
tip cn ny cho php khai thc nhng thng tin m rng thu c t my tm kim nhm
to ra nhng c trng giu thng tin hn cho vic tm kim. Hn na, vic s dng
nhng c trng m t ton b qung co tt hn so vi vic ch s dng nhng t kha
ring bit ca n, iu ny cn gip cho ngi qung co khng phi xc nh trc cc
t kha ca qung co.
Truy vn qung co v cc qung co c h biu din thng quang 3 loi c
trng chnh: t kha, phn lp v cc cm t Prisma.
- T kha: h tp hp tt c cc t kha ring bit c trong tp qung co, la chn
s t kha ph hp, s dng mi t kha ny nh mt c trng sau tin hnh tnh
trng s cho cc c trng theo TF-IDF.
- Phn lp: trnh trng hp mt qung co v mt truy vn c s lin quan
ln, nhng chng s dng cc t khc nhau biu din, ngoi cc t kha, h s dng
mt c trng mc cao hn l phn lp ca truy vn. S dng mt taxonomy ln v
nhng ch lin quan ti thng mi, xy dng b phn lp cho php nh x mt on
vn bn vi mt s lp lin quan. T tp kt qu tm c vi truy vn gc, h tin hnh
phn lp vi tng kt qu, sau chn ra nhng lp ph hp nht vi truy vn gc. Cc
lp ny s c s dng nh cc c trng ca truy vn qung co, trng s ti cc c
trng s c xc nh bng tin cy tr v t b phn lp.
- Cm t Prisma: s dng cng c ca Altavistas Prisma, y l mt cng c cho
php trch chn cc cm t thng c s dng trn web, v mt tp cc cm t Prisma
cho ting anh gm 10 triu cm t, h xc nh cc cm t Prisma xut hin trong tp kt
qu ca truy vn gc, la chn nhng cm t ph hp nht vi truy vn gc v s dng
chng nh cc c trng cho truy vn qung co. Trng s ti cc c trng c tnh
theo TF-IDF.
19

Trong thc nghim Andrei Z.Broder v cc cng s [11] thit lp 4 h thng khc
nhau, vi cc tham s trn gia cc loi c trng l khc nhau trn mi h thng. S
dng mt tp 700 truy vn, mi truy vn c xy dng nh sau. Bt u vi tp tt c
cc truy vn ca Yahoo trong tun t 23-29, 2007. Chia 10 triu truy vn c tm kim
nhiu nht thnh cc nhm theo tn sut tm kim, la chn ngu nhin 50 truy vn t
mi nhm. Ngoi ra, ly ngu nhin 200 truy vn trong s nhng truy vn cn li (khng
thuc 10 triu truy vn ni trn). Vi mt truy vn, tm 3 qung co i vi mi h thng
trn, tin hnh 9000 cp truy vn-qung co nh vy. Mt nhm gm 6 nh phn tch,
tt c u c kh nng tt v ting Anh, tin hnh nh gi v phn chia mi kt qu vo
mt trong cc nhm: Perfect, Certainly Attractive, Probably Attractive, Somewhat
Attractive, Probably Not Attractive, and Certainly Not Attractive. tnh ton chnh
xc v hi tng, h coi 4 nhm u tin l ph hp, v hai nhm cui l khng ph
hp.
Kt qu thc nghim thu c c so snh vi m hnh khng s dng truy vn
m rng (ch s dng truy vn ban u) v c chnh xc vt tri. chnh xc ca
m hnh 4 h thng ln lt l 35%, 40%, 42% v 45 % so vi 16% ca m hnh khng
s dng vic m rng truy vn. Hnh 7 m t kin trc h thng ca h.

Hnh 7. Kin trc h thng qung co s dng phn hi lin quan [11]
20

M hnh qung co s dng phn hi lin quan ca Andrei Z.Broder v cc cng s
a ra c mt phng php m rng cu truy vn s dng cc kt qu tm kim. H
xut mt phng php xy dng cc c trng da trn nhng tri thc m rng, m
hnh ny gip nhng ngi qung co khng nht thit phi nh ngha r rng nhng t
kha tng ng vi qung co ca h.
2.5. M hnh c lng CTR (Click Through Rate)
Da trn vic s dng CTR xp hng cc qung co, Matthew Richardson v cc
cng s [25] a ra mt m hnh c lng CTR i vi nhng qung co mi da
trn nhng thng tin c t trc. Nhng qung co vi CTR cao s c xp hng cao
hn so vi nhng qung co c CTR thp.
Matthew Richardson xem xt vn c lng CTR vi mt tp cc c trng cho
trc nh mt bi ton hi quy v s dng hi quy logic (logistic regression) vi u ra l
cc xc sut tng ng vi cc gi tr c lng nm trong khong [0, 1]. Cc c trng
c s dng:
Din mo qung co: c bao nhiu t trong tiu , trong ni dung, ni dung c
gm nhiu k hiu, du cu hay khng, s dng cc t ngn hay di.
Mc thu ht: tiu , ni dung qung co c cha nhng t m t hnh ng
nh mua, tham gia, ng k hay khng
Danh ting: URL c kt thc bi .com, .net, .org hay khng, di URL ra sao,
URL gm nhiu on hay t on, v d: books.com s tt hn so vi
books.something.com. URL c cha nhiu du s hay cc con s hay khng
Cht lng trang web qung co tr ti: liu trang web c cha flash hay khng,
nhng phn no c bao bi nh, c s dng stylesheet hay khng, c nhiu
qung co trn trang web hay khng.
ph hp: liu t kha (bid-term) c xut hin trong tiu , trong ni dung hay
khng, trong phn no ca ni dung
Vi 5 loi c trng ni trn, h s dng 81 c trng. Ngoi ra cn s dng cc c
trng sau:
21

Cc t xut hin trong tp qung co: ly ra 10000 t ph bin nht trong tp
qung co, thm mt c trng vi gi tr 1 nu t xut hin trong qung co ang
xt, ngc li l gi tr 0.
CTR: s dng CTR ca nhng qung co khc c chung t kha (keywords, bid
term). Ngoi ra, s lng cc qung co c cng t kha vi qung co ang xt
cng c s dng nh mt c trng.
Bn cnh nhng qung co c t kha chung, CTR ca nhng qung co c t
kha lin quan cng c s dng. V d t kha red shoes v buy red shoes
l nhng t kha c lin quan v CTR ca qung co ng vi buy red shoes c
th c s dng trong vic c lng CTR ca qung co ng vi red shoes.
V d liu, h s dng mt tp cc qung co ca my tm kim MSN, mi qung
co c cc thng tin nh: URL, cc t kha tng ng vi qung co, tiu , ni dung v
c bit l tng s ln qung co c click v tng s ln qung co c xem k t khi
c a vo h thng. Tp d liu c chia lm ba phn: 70% cho vic training, 10%
cho vic kim nh v 20% cho vic test.
Trong thc nghim, h s dng trung bnh KL-divergence [20] c tnh bi kt
qu c lng CTR ca m hnh v CTR thc s ca qung co trong tp test. Xy dng
1 s h thng vi cc c trng khc nhau, tin hnh so snh vi m hnh c lng CTR
ch s dng tp train mt cch n gin (s dng mt c trng duy nht CTR ca chnh
qung co), c gi l baseline. Kt qu thu c l kh tt, mc ci tin so vi
baseline t 13.28% ti 19.67%.
2.6. M hnh tm kim v xp hng s dng ch n trong qung co theo
ng cnh
Da trn tng m rng ni dung trang web v qung co s h tr tt hn cho
vic tm kim v xp hng qung co. L Diu Thu [27] xut mt hng tip cn
trong qung co theo ng cnh, tp trung vo phn tch ch n nhm lm giu ni dung
trang web cng nh qung co bng nhng t kha m rng. khi qut ha ng cnh
ca cc trang web v qung co, tc gi tin hnh xy dng mt m hnh phn tch ch
n trn mt tp d liu ln, t pht hin nhng ch v cc mi quan h gia ch
vi t hay gia t vi t. M hnh ny cn cho php xc nh phn b xc sut ca cc
22

ch trn tng trang web hay qung co, t lm giu ni dung ca chng vi nhng
t kha ca cc ch c lin quan.
L Diu Thu xy dng mt b d liu vi kch thc ln, gi l Universal Dataset,
v s dng b d liu ny cho qu trnh phn tch ch n. B d liu c thu thp t
VnExpress [7], mt trong nhng trang bo in t ln nht ca Vit Nam, bao gm cc
ch khc nhau nh: x hi, tin tc th gii, i sng, vn ha, th thao, khoa hc
Hn 220 Megabyte d liu gm khong 40 nghn trang web c thu thp s dng Nutch
[37] v c tin x l bng cch loi b cc th HTML, phn tch cu, tch t, loi b
nhng t khng thch hp. Sau khi x l, thu c b d liu 53 Megabyte vi 40,268 ti
liu.Tin hnh phn tch ch n trn b d liu thu c s dng GibbsLDA [16], mt
ng dng ca m hnh LDA v Gibb Sampling.
tin hnh thc nghim, tc gi s dng mt tp 100 trang web v 2607 qung co
khc nhau. Cc trang web c la chn ngu nhin t tp 27,763 trang web thu thp
c t bo in t VnExpress, cc trang web c chn t cc ch : m thc, mua
bn, dc phm, nh t, th trng chng khon, vic lm Cc qung co c thu
thp bng cch s dng cc tiu , m t v t kha ca cc trang web trn danh b
website Vit Nam [5].
nh gi nh hng ca cc t kha trong tm kim theo ng cnh, L Diu Thu
ci t hai phng php tm kim theo hng tip cn ca Ribeiro-Neto [24]. Phng
php th nht gi l AD, ch s dng tiu v m t ca qung co trong tm kim.
Phng php th hai l AD_KW, tm kim qung co s dng c tiu , m t ca
qung co ln cc t kha.
nh gi nh hng ca ch n, tc gi tin hnh 6 thc nghim khc nhau.
Trong mi thc nghim, s dng mt m hnh ch n khc nhau vi cc tham s khc
nhau. Cc m hnh ch n c s dng ln lt l m hnh vi 60, 120 v 200 ch .
Sau khi suy lun ch n cho tt c cc trang web v qung co, tin hnh m rng tp
t vng ca chng theo cc ch lin quan. Kt qu thc nghim cho thy, vic s dng
ch n lm tng chnh xc ca m hnh t 64% ln 72%.
Nghin cu ca L Diu Thu [27] a ra mt m hnh nhm gii quyt bi ton
tm kim v xp hng qung co trong qung co theo ng cnh. Ch ra nhng nh hng
tch cc ca vic s dng ch n nhm m rng tp t kha ca trang web cng nh
23

qung co. Kt qu t c rt kh quan, m hnh khc phc c vn so khp gia
qung co v trang web c tp t vng khc nhau bng vic khai thc mi quang h ng
ngha n trong ni dung ca chng. Cch tip cn ny c th c m rng v s dng
mt cch hiu qu trong qung co trn my tm kim.



24

Chng 3. H thng qung co trc tuyn s dng xp
hng v ch n
3.1 Xp hng
Trong nhiu ng dng cn sp xp cc i tng theo mt tiu ch no , v d sp
xp danh sch cc nhn vin trong cng ty theo tn, tui,... hay sp xp danh sch hc
sinh trong mt lp theo im trung bnh. Cng vic nh vy gi l xp hng. Kt qu xp
hng l mt danh sch cc i tng c sp th t m mt i tng c xp
trn mt i tng khc khi n tha mn mt yu cu no [2]. Ta ni, i tng A c
hng cao hn i tng B khi A c ph hp vi tiu ch t ra ln hn so vi B. Vic
xp hng c th c tin hnh theo cc tiu ch khc nhau, ta cn tnh ph hp ca
cc i tng vi tiu ch t ra, hm tnh ph hp c gi l hm tnh hng (ranking
function). Mi khi ni ti xp hng i tng, chng ta quan tm ti hm tnh hng.
Mt s vn ni tri v xp hng l: xp hng cc trang web theo th t
quan trng, xp hng cc trng i hc theo quy m v c bit l xp hng cc kt qu
trong my tm kim theo mc ph hp vi truy vn. Trn thc t, xp hng c thc
hin rt nhiu lnh vc. Vic xp hng gip ta c mt ci nhn tng quan, tip cn c
nhng i tng ph hp nht vi yu cu a ra mt cch nhanh nht, c th so snh cc
i tng vi nhau mt cch d dng. iu cho thy, xp hng l mt bi ton rt quan
trng v c ngha.
3.1.1 Xp hng trong my tm kim
Tc pht trin nhanh chng ca World Wide Web (www) dn n nhu cu tm
kim cc ti liu trn internet tr nn rt ln, my tm kim c s dng phc v cho
nhu cu ny ca con ngi. T yu cu ca ngi dng, thng l mt truy vn, my tm
kim s tm kim v a ra cc ti liu ph hp vi yu cu . Tuy nhin s lng kt
qu ph hp vi truy vn c th l rt ln, ln ti hng trm hay hng nghn, ngi dng
khng th ln lt duyt tng kt qu ny xc nh u l ti liu mnh mun tm. Do
vy, bi ton t ra l phi tin hnh xp hng cc ti liu tr v t my tm kim theo th
t gim dn v ph hp vi truy vn u vo. Vic xp hng s gip ngi dng nhanh
chng tip cn vi kt qu mong mun, tit kim c rt nhiu thi gian.
25

Bi ton xp hng c ngha rt quan trng trong my tm kim. Khc vi nhng
xp hng n gin nh xp hng hc sinh theo im trung bnh, xp hng nhn vin theo
s lng cng vic hon thnh c mt tiu ch xp hng r rng v hm tnh dng c
th d xc nh. Vic xp hng cc kt qu tr v t my tm kim l rt phc tp, mi ti
liu c nhiu c trng khc nhau, cn tm ra mi quan h gia cc c trng .V t
kt hp cc c trng li xy dng hm tnh hng ph hp. C rt nhiu thut ton
c a ra nh: HITS, PageRank, TrustRank mi thut ton u c nhng u, nhc
im ring.
[21] Hc xp hng c Joachims nh gi l lnh vc ni ln vi s pht trin ln
mnh trong cc nghin cu v tm kim thng tin (information retrieval) v hc my
(machine learning). Ni mt cch khc, hc hm tnh hng hin ang l vn c quan
tm trong lnh vc hc my v c nhiu ng dng trong tm kim thng tin. Hc xp hng
l hc hm ca cc c trng sp xp cc i tng theo ph hp, u tin hay
quan trngty vo tng ng dng c th. Hin nay nghin cu cc phng php hc
tnh hng ang c nhiu nh khoa hc trn th gii quan tm. Di y l thut ton
SVM-Rank, mt trong nhng thut ton hc tnh hng ph bin.
3.1.2 Hc xp hng v SVM Rank
3.1.2.1 Hc xp hng
Cc nghin cu v hc xp hng ch yu tp trung vo ng dng xp hng cc ti
liu tr v t my tm kim da theo truy vn. C cc tp ti liu D = {d
1
, d
2
, , d
n
} v
vi truy vn q, cn xc nh hm xp hng h(x): D R sp xp cc ti liu D theo
ph hp vi truy vn [2].
D liu hc S l xp hng ng ca mt tp cc ti liu D D c a ra hc
hm h(x). Ty tng ng dng m c cc mc yu cu khc nhau v sp xp th hng
ng ca d liu:
1. Xc nh gi tr ph hp y c th ca tng i tng trong S, Do trong ng
dng xp hng, ngi dng quan tm nhiu ti th t thay v gi tr xp hng nn y
thng c xc nh:
26

Hai gi tr tng ng vi xp hng ph hp (relevant) hay khng ph hp
(irrelevant). Ngi dng ch quan tm cc i tng c ph hp tiu ch t
ra hay khng.
N gi tr xc nh tng ng N hng nht nh.V d: rt ph hp, ph hp,
c th ph hp, khng ph hp.
2. a ra cc so snh ph hp ca tng cp i tng.
3. Danh sch sp th t ng ca tt c cc i tng theo ph hp.
Cc phng php hc xp hng theo Sounmen Chakrabarti [13] v Tie-Yan Liu [23]
l:
- Hi quy (Regression): C S = {(x
i
, h
i
)} mi i tng x
i
xc nh gi tr y
i
tng
ng v ph hp. Hc hm h(x) tha mn:
h(x
i
) = y(i) vi mi x X
Trong hc xp hng, khi gi tr y
i
xc nh th hng ca i tng x
i
th phng
php gi l hi quy c th t (Ordinal Regression).
- Cp th t (Pairwise): C S = {(x
i
, x
j
)} l tp cc cp i tng c sp th
t, vi mi cp (x
i
, x
j
) c ngha x
i
c hng cao hn x
j
(x
i
ph hp vi iu kin hn x
j
)
Tm h(x):
v(x
i
, x
j
) e S c x
i
> x
j
th h(x
i
) > h(x
j
)
SVM-Rank l mt trong nhng thut ton thuc phng php ny.
- Danh sch sp xp (Listwise): Mt th t sp xp ca tt c cc i tng c
xc nh. Tuy nhin, iu ny khng kh thi trong mt vi ng dng, v d my tm kim.
Ta c S = {x
1
, x
2
, ..., x
m
vi x
i
X l mt sp th t (x
1
> x
2
> ... > x
m
) Cn tm
hm h(x) sao cho h(x
1
) > h(x
2
) > ... > h(x
m
)
3.1.2.2 SVM-Rank
SVM-Rank l mt thut ton c xy dng nhm gii quyt vn xp hng cc
ti liu bng vic s dng thut ton hc gim st SVM.
Gi s d liu u vo l tp ti liu nm trong khng gian n chiu X R
n
vi n l
s c trng ca ti liu. Tn ti mt kt qu xp hng Y = {r
1
, r
2
,..., r
q
} vi q l s
27

lng cc hng c th. Gi s tn ti mt th t gia cc hng r
q
r
q-1
... r
1
trong ""
th hin quan h u tin gia cc ti liu [29]. Tn ti mt tp cc hm xp hng f F m
mi hm f c th quyt nh quan h u tin gia cc ti liu:
x
i
x
j
f(x
i
) > f(x
j
) (1)
Gi s ta c mt tp cc ti liu c xp hng: S = {( x
i
, y
i
)} i =1,t t khng
gian X Y. Nhim v t ra l phi la chn hm f* tt nht t F sao cho cc tiu ha
sai lch (loss value) vi mt hm tnh sai lch cho trc (lost function) trn tp d liu
cho.
[14] Herbrich chun ha vn hc trn thnh vic hc cho phn lp trn cc
cp ti liu.
Gi s f l mt hm tuyn tnh:
F
w
(x) = < w, x > (2)
Trong w l vc t trng s v < , > l k hiu ca tch trong.
T (1) v (2) ta c:
x
i
x
j
<w, x
i
- x
j
> > 0 (3)
Khi ny, quan h gia x
i
v x
j
: x
i
x
j
c th hin bi vc t x
i
- x
j
. Tip , ta ly
tt c cc cp ti liu v quan h gia chng to nn mt vc t mi v mt nhn mi.
K hiu x
(1)
v x
(2)
ln lt l ti liu th nht v ti liu th 2, y
(1)
v y
(2)
l hng ca
chng. Ta c:
x
(1)
-x
(2)
, z = _
+1 y
(1)
> y
(2)
-1 y
(2)
> y
(1)

(4)
T tp d liu train S ta to ra mt tp d liu train khc S' vi l vc t c gn
nhn:
S = {x
i
(1)
x
i
(2)
, z
i
} i = 1,n (5)

S dng S' lm d liu cho phn lp v xy dng mt m hnh SVM cho php xc
nh nhn z l m hay dng z = +1 hay z = -1 vi mi vc t x
(1)
- x
(2)

Vic xy dng m hnh SVM tng ng vi vic gii bi ton:
28

min
w
H(w) =
1
2
[w[ +C

I
=1

sub]cct to

u, z

(w
2

, x
i

(1)
-x
i

(2)
) 1 -

i = 1, , l
(6)
Vic ti u (6 n ) t g ng v i ti u (7) khi = 1/2C:
min
w
j1 -z

(w, x
i

(1)
-x
i

(2)
)[
+
I
=1
+z[w[
2

(7)
Gi s w* l vc t trng s ca m hnh SVM. V mt hnh hc, w* s vung gc
vi siu phng ca Ranking SVM. Ta s dng w* xy dng hm ranking f
w*
cho vic
xp hng cc ti liu:
f
w*
(x) = < w, x > (8)
Khi p dng SVM, mi vect c trng c to ra t mt cp ti liu. Mi c
trng c nh ngha nh mt hm ca truy vn v ti liu.V d c trng tn sut xut
hin ca t kha c tnh bng s ln xut hin ca cc t kha trong cu truy vn trn
ti liu. Tt c cc kt qu t tt c cc truy vn c s dng trong qu trnh training.
Khng c s khc bit gia cc ti liu t cc truy vn khc nhau. Hn na, khng c s
khc bit gia cc cp ti liu thuc cc hng khc nhau, trong khi trn thc t, nh hng
ca vic xp hng sai gia nhng ti liu c hng cao vi ti liu c hng thp l ln hn
so vi vic xp hng sai gia nhng ti liu c hng thp vi nhau . y chnh l hai vn
c th gy ra s thiu chnh xc ca Ranking SVM.
gii quyt hai vn c nu trn, ta c th nh ngha mt hm loss mi
da trn c s ca Hinge Loss [29].
Loss function
Trong loss function (9) ta thm mt tham s hng iu chnh lch gia cc
cp hng, thm tham s iu chnh lch gia cc truy vn. Ta pht biu li bi
ton ca Ranking SVM vi mc tiu l cc tiu ha loss function sau:
min
w
I(w) =
k()
p
q()
j1 -z

(w, x
i

(1)
-x
i

(2)
)[
+
I
=1
+z[w[
2

(9)
Trong k
(i)
l hng ca cp ti liu i,
k(i)
l tham s hng ca k
(i)
, q
(i)
ng vi truy
vn ca cp ti liu i, q
(i)
l tham s ca truy vn q
(i)
. vi phm nhn c t cp th i
c quyt nh bi tch ca
k(i)
v q
(i)
:
k(i)
q
(i)

29


Xc nh gi tr cc tham s
Ta phi xc nh lm th no tnh gi tr ca v .
Vi , ta s dng mt phng php Heuristic c lng cc tham bin da trn
m hnh c s. Gi s NDCG c s dng nh gi (c th s dng cc o khc).
Thut ton c m t nh sau:

Hnh 8. Thut ton c lng tham bin [29]
Vi ta tnh nh s : au
p
q()
=
mux
]
={nhng cp tI IIu ng vI q(j)]
={nhng cp tI IIu ng vI q(I)]
(10)
3.1.3 Cc phng php nh gi xp hng
nh gi cht lng mt xp hng, cc o thng dng trong hc my nh
chnh xc (precision), hi tng (recall), o F khng c s dng. Xp hng yu
cu cc i tng ng (ph hp vi tiu ch) c xp cc v tr u tin ca bng
xp hng cng tt.
Di y l mt s o nh gi mc hiu qu ca xp hng:
30

3.1.3.1 MAP
chnh xc mc K: P@K Precision@K l chnh xc ca K i tng u
bng xp hng. Xc nh s i tng ng K v tr u tin ca xp hng v gi l
Match@K
PK =
HotcbK
K
[19]. Ta c:

chnh xc trung bnh (AP): l gi tr trung bnh ca cc P@K ti cc mc K c
i tng ng. Gi I(K) l hm xc nh i tng v tr hng K nu ng I(K) = 1 v
ngc li I(K) = 0. chnh xc h: trung bn
AP =
PK x I(K)
n
K=1
I(])
n
]=1

Gi tr trung bnh trn tt c cc truy v Average Precision): n (Mean
HAP =
AP

m
=1
m

Trong m l tng s truy vn.
V d:
Gi s c 6 i tng tng ng l: a, b, c, d, e.
Trong a, b, c l cc i tng ph hp v d, e l cc i tng khng ph hp.
Mt xp hng ca cc i tng cn nh gi l: c, a, d, b, e. Khi ta c:
p@1 = 1; P@2 =1; P@3 = 2/3; P@4 = 3/4; P@5 = 3/5.
AP(1) = 1; AP(2) = 1; AP(3) = 1; AP(4) = (1 + 1 + 3/4) / 3
3.1.3.2 NDCG (Normalized Discounted cumulative gain)
DCG (Discounted cumulative gain) l mt o mc hiu qu ca cc thut ton
trn h thng my tm kim hay nhng ng dng tng t, v thng c s dng trong
tm kim thng tin (Information Retrieval). S dng mt o tnh ph hp ca cc ti
liu trong tp kt qu tr v bi my tm kim, DCG o s hiu qu ca mt ti liu da
trn v tr ca n trong danh sch. Con s ny c tnh tnh ly t u ti cui danh sch
kt qu v gim dn nhng v tr thp hn[19].
31

Hai gi thit c a ra trong vic s dng DCG v nhng php o c lin quan:
o S tt hn nu nhng ti liu c ph hp cao xut hin sm trong danh
sch kt qu ca my tm kim (c rank cao hn)
o Nhng ti liu c ph hp cao thng hu ch hn so vi nhng ti liu c
ph hp thp, v nhng ti liu ny li hu ch hn so vi nhng ti liu
khng ph hp.
DCG c hnh thnh t mt o nguyn thy hn, l CG (Cumulative Gain).
Cumulative Gain: o CG khng quan tm ti v tr ca kt qu trong tnh ton, n
tnh tng ph hp ca tt c cc ti liu trong danh sch kt qu. o CG ti mt v
tr p c tnh nh sau:
C0
p
= rcl

p
=1

Trong rel
i
l mc ph hp ca kt qu ti v tr th i.
o CG khng b nh hng bi th t sp xp cc kt qu trong danh sch. Vic
chuyn ti liu c ph hp cao xung v tr thp khng lm thay i gi tr CG. Da
vo hai gi thit trn v mc hiu qu ca kt qu tm kim, DCG c s dng em
li hiu qu cao hn.
Discounted cumulative gain: tin ca DCG l nhng ti liu c ph hp cao
hn nhng li xut hin nhng v tr thp hn s dn ti mt mc pht (penalty) bng
cch gim ph hp ca ti liu i mt lng bng logarit ca v tr trong kt qu. DCG
ti v tr p c tnh nh sau:
C0
p
= rcl
1
+
rcl

log
2
i
p
=2

Ngoi ra DCG cn c tnh theo cng th c:
C0
p
=
2
cI
i
-1
log
2
(1 +i)
p
=2


32

nC0
p
=
C0
p
IC0
p
Normalized DCG:

Trong : IDCGp (Ideal Discounted cumulative gain) l gi tr DCG trong trng
hp kt qu a ra l hon ho, nhn c khi tt c cc ti liu u c xp ng v tr
tng ng vi ph hp ca chng.
V d: Gi s c 6 ti liu a, b, c, d, e, f vi cc ph hp ln lt l: 3, 3, 2, 2, 1,
0. Mt kt qu xp hng c a ra nh sau: b, c, a, f, e, d.
Ta c: CG
6
= 3 + 2 + 3 + 0 + 1 + 2 = 11
DCG
6
= 3 + (2 + 1.887 + 0 + 0.431 + 0.772) = 8.09
IDCG = 3 + (3 + 2/1.59 + 2/2 + 1/2.32 + 0) = 8.693
nDCG
6
= DCG
6
/IDCG
6
= 8.09/8.693 = 0.9306






Ngoi hai o trn, mt s o khc cng c s dng nh: trung bnh nghch
o th hng (MRR), s i tng ng mc k (Match@K), trung bnh tng nghch o
th hng ca cc i tng ng (MTRR) [2]. Tuy nhin NDCG v MAP l hai o
kh ph bin v c s dng trong rt nhiu cng trnh nh [11], [19], [29].
3.2 Ch n
Vn biu din d liu mt cch hiu qu khai thc mi quan h gia cc d
liu ngy cng tr nn tinh vi v phc tp hn. c rt nhiu nghin cu nhm gii
quyt v vn ny. Cc m hnh ch n [10] l mt bc tin quan trng trong vic
i rel
i
log
2
i rel
i
/log
2
i
1 3 N/A N/A
2 2 1 2
3 3 1.59 1.887
4 0 2.0 0
5 1 2.32 0.431
6 2 2.59 0.772
33

m hnh qu d liu vn bn. Chng c da trn tng rng mi ti liu c mt xc
sut phn phi vo cc ch , v mi ch l s phn phi kt hp gia cc t. Biu
din cc t v ti liu di dng phn phi xc sut c li ch rt ln so vi m hnh
khng gian vc t thng thng.
Mt tng ca cc m hnh ch n l xy dng nhng ti liu mi da theo
phn phi xc sut. Trc ht, to ra mt ti liu mi, ta cn chn ra mt phn phi
nhng ch cho ti liu , iu ny c ngha ti liu c to nn t nhng ch
khc nhau, vi nhng phn phi khc nhau. Tip , sinh cc t cho ti liu ta c th
la chn ngu nhin cc t da vo phn phi xc sut ca cc t trn cc ch .
Mt cch hon ton ngc li, cho mt tp cc ti liu, ta c th xc nh mt tp
cc ch n cho mi ti liu v phn phi xc sut ca cc t trn tng ch .
Hai v d v phn tch ch s dng m hnh n l Probabilistic Latent Semantic
Analysis (pLSA) and Latent Dirichlet Allocation (LDA).
PLSA l mt k thut thng k nhm phn tch nhng d liu xut hin ng thi
[17]. N c pht trin da trn LSA kt hp vi mt m hnh xc sut. Tuy nhin, theo
phn tch ca Blei v cc cng s (2003) [10], mc d LPSA l mt bc quan trng
trong vic m hnh ha d liu vn bn, tuy nhin n vn cn cha hon thin ch cha
xy dng c mt m hnh xc sut tt mc ti liu. iu dn n vn gp
phi khi phn phi xc sut cho mt ti liu nm ngoi tp d liu hc, ngoi ra s lng
cc tham s c th tng ln mt cch tuyn tnh khi kch thc ca tp d liu tng.
LDA, l mt m hnh hon thin hn so vi PLSA v c th khc phc c nhng
nhc im trn. M hnh ch n ny s c s dng trong vic xy dng h thng
ca chng ti.
3.2.1 Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) l mt m hnh sinh xc sut cho tp d liu ri
rc nh text corpora. LDA da trn tng: mi ti liu l s trn ln ca nhiu ch
(topic). V bn cht, LDA l mt m hnh Bayesian 3 cp (three-level hierarchical Bayes
model: corpus level, document level, word level) trong mi phn ca m hnh c coi
nh mt m hnh trn hu hn trn c s tp cc xc sut ch [27].
34

3.2.2 M hnh sinh trong LDA
Cho mt corpus ca M ti liu biu din bi D={d
1
,d
2
, , d
M
}, trong , mi ti liu
m trong corpus bao gm N
m
t w
i
rt t mt tp t vng ca cc mc t {t
1
, , t
v
}, V l
s lng cc mc t t trong tp t vng. LDA cung cp mt m hnh sinh y ch ra
kt qu tt hn cc phng php trc. Qu trnh sinh ra vn bn nh sau:

Hnh 9. M hnh biu din ca LDA[15]
Cc khi vung trong (Hnh 9) biu din cc qu trnh lp.
Tham s u vo: v (corpus-level parameter)
: Dirichlet prior on (theta)
m

r
: Dirichlet prior on
k

r
r
r

m
(theta): phn phi ca topic trong document th m (document-level parameter).
biu din tham s cho p(z|d=m), thnh phn trn topic cho ti liu m. Mt t l cho
mi ti liu,
m

{ } matrix) K (M
M
m m
=
=1

r
z
m,n
: topic index (word n ca vn bn m)
w
m,n
: word n ca vn bn m ch bi z
m,n
(word-level variable, observed word)
k

r
: Phn phi ca cc t c sinh t topic z
m,n
.
k

r
biu din tham s cho p(t|z=k),
thnh phn trn ca topic k. Mt t l cho mi topic, { } matrix) V (K
K
k k
=
=1

r
M: s lng cc ti liu.
35

N
m
: s lng cc t trong ti liu th m (hay cn gi l di ca vn bn)
K: s lng cc topic n.
LDA sinh mt tp cc t w
m,n
cho cc vn bn
m
d
r
bng cch:
Vi mi vn bn m, sinh ra phn phi topic
m

r
cho vn bn.
Vi mi t, z
m,n
c ly mu da vo phn phi topic trn.
Vi mi topic index z
m,n
, da vo phn phi t
k

r
, c sinh ra.
n m
w
,

k

r
c ly mu mt ln cho ton b corpus.
M hnh sinh y ( ch gii) c biu din trong Hnh 10.

Hnh 10. M hnh sinh y cho LDA [28].
y, Dir, Poiss and Mult ln lt l cc phn phi Dirichlet, Poisson,
Multinomial. (Ly mu theo phn phi Dirichlet, Poisson, Multinomial).
3.2.3 c lng tham s v suy lun
Cho trc mt tp cc vn bn, yu cu ca qu trnh ny l tm xem topic model (
, ) no sinh ra tp cc vn bn trn. Qu trnh c lng tham s cho LDA vi k
thut Gibbs Sampling gm cc bc:
k

r
m

r
36

Khi to: ly mu ln u. Di y l m gi ca qu trnh khi to ly mu ln
u:
( ) t
z
n
( ) z
m
n zero all count variables, , ,
m
n ,
z
n
[ ] M m , 1 for all documents do
[ ]
m
N n , 1 for all words in document do m
sample topic index ~Mult(1/K)
n m
z
,
( )
1 +
s
m
n increment document-topic count:
1 +
m
n increment document-topic sum:
( )
1 +
t
s
n increment topic-term count:
1 +
z
n increment topic-term sum:
end for
end for
Trong : : s topic z trong vn bn m
( ) z
m
n
: tng s topic trong vn bn m
m
n
: s term t trong topic z
( ) t
z
n
: tng s term trong topic z
z
n
Mi ln ly mu cho mt t, cc tham s i vi tng term v topic trn ln lt
c tng ln.
Giai on burn-in: qu trnh ly mu li cho n khi t c mt chnh xc
nht nh. M gi ca qu trnh ny:
while not finished do
[ ] M m , 1 for all documents do
for all words in document do [
m
N n , 1 ] m
- for the current assignment of to a term t for word : z
n m
w
,
37

( )
1
t
z
n
( )
1
z
m
n decrement counts and sums: ; 1
m
n ; ; 1
z
n
- multinomial sampling acc. (decrements from previous step):
( ) w z z p z
i i
r r
, | ~
~

sample topic index


- use the new assignment of to the term t for word to: z
n m
w
,
( )
1 +
z
m
r
; ;
n 1 +
z
n
r
1 +
t
z
n
r
increment counts and sums:
end for
end for
Trong mi ln ly mu li: cc tham s tng ng vi cc topic v term c gim i
1, cc tham s tng ng vi cc topic v term mi tng ln 1.
Kim tra s hi t v c ra cc tham s: Qu trnh kt thc, c cc tham s u
ra v . M gi ca qu trnh c cc tham s u ra:
if converged and L sampling iterations since last read out then
- the different parameters read outs are averaged
read out parameter set acc. to Eq.
k

r

read out parameter set acc. to Eq.
m

r

end if
end while
2 phn phi n
k

r
v c tnh nh sau:
m

r
( )
( )
v
V
v
v
k
t
t
k
t k
n
n

+
+
=

=1
,
( )
( )
z
K
z
z
m
k
k
m
k m
n
n

+
+
=

=1
,


Vi m hnh c lng LDA cho, c th suy lun ch cho cc ti liu mi
bng cc th tc ly mu tng t.
38

3.3 M hnh qung co trc tuyn hng cu truy vn vi s gip ca
phn tch ch v k thut tnh hng
Nh trnh by nhng chng trc, mt bi ton quan trng ca qung co trn
my tm kim l vic xp hng cc qung co theo ph hp vi truy vn ca ngi
dng. T nhng phng php c trnh by Chng II, cho thy vic la chn cc c
trng cho vic biu din qung co l ht sc quan trng. C nhng trng hp gia
qung co v t kha c s ph hp ln, tuy nhin tp t vng s dng trong qung co
v truy vn l khc nhau. Do vy, bn cnh cc c trng v t kha, vic s dng mt s
c trng mc tru tng cao hn l rt cn thit. Nhng nghin cu ca Andrei v cc
cng s [11] cho thy, vic s dng cc c trng m rng nh phn lp truy vn, cm
t Prisma em li nhng kt qu kh quan. c bit l nghin cu ca L Diu Thu [27]
ch ra rng, vic s dng ch n trong qung co theo ng cnh nhm m rng tp
t vng ca qung co cng nh trang web em li kt qu rt kh quan.
Trong phn ny, ta s trnh by mt m hnh qung co trc tuyn trn my tm
kim s dng k thut phn tch ch n v tnh hng. Khc vi m hnh c xy
dng bi L Diu Thu [27], m hnh ca chng ta c xy dng nhm mc ch xp
hng qung co trn my tm kim theo truy vn ca ngi dng. K thut ch n c
s dng trong vic xy dng nhng c trng mi biu din qung co. Ngoi ra, m
hnh cn khai thc mt lng ln cc query logs nhm xy dng tp d liu hc.
3.3.1 M t bi ton
Bi ton c m t nh sau: T truy vn ca ngi dng v mt tp cc qung co
c sn, yu cu a ra K qung co ph hp nht vi truy vn.
Input:
- Truy vn q
- Tp qung co A = {a
1
, a
2
, ..., a
n
}
Output:
- K qung co R = {a
r1
, a
r2
, ..., a
rk
}
gii quyt bi ton, chng ta xy dng hm ranking F nh sau:
F: {Q}x{A} [0,1]
39

Vi F(q, a) tr v ph hp ca qung co a i vi truy vn q, ph hp cng
ln qung co s c xp hng cng cao.
Zeng [29] v Xu [29] ch ra rng, s dng thut ton SVM ranking em li kt
qu tt trong vic xp hng cng nh phn cm kt qu tm kim, khi s dng c truy
vn, title v snippet (ni dung tm tt) trong qu trnh hc. Trong m hnh ny, SVM rank
s c s dng xy dng hm xp hng F nh trn.
3.3.2 M hnh tng quan
T nhng nghin cu c cp trn, chng ti xut h thng qung co
trn my tm kim s dng phn tch ch n v k thut tnh hng. H thng c m
t mt cch tng quan nh sau.
Model
estimation
(2)
Estimated Model
Topic inference
(6)
Key word
Matching (5)
Ads
Relevant
Ads
New
Training
data (3)
Learn to
rank
model (4)
Ranking
function
(7)
Relevant
Ads
Ranking
Training
data
(1)

Hnh 11. M hnh tng quan h thng qung co s dng ch n
M hnh gm cc bc chnh sau:
1) Xy dng tp d liu hc. Tp d liu hc c xy dng bng cch phn tch
cc query logs, thu thp cc tiu , m t ca trang web v coi chng nh mt
qung co (ti liu).
40

2) Xy dng m hnh ch n, xc nh cc ch v phn phi xc sut ca
cc ch trn tng ti liu.
3) Xy dng tp d liu hc vi c trng mi, cc c trng y gm c tn
sut xut hin ca t kha v xc sut mi ti liu thuc vo mt ch .
4) Xy dng hm xp hng t tp d liu hc thu c. Hm xp hng c xy
dng s dng thut ton SVM-Rank.
5) Tm kim cc qung co ph hp vi truy vn.
6) Xc nh ch n ca qung co v biu din qung co theo c trng mi.
7) Xp hng cc qung co s dng hm xp hng c xy dng t tp d
liu hc.
3.3.3 Xc nh c trng cho m hnh
Trong m hnh ny, chng ta coi mi qung co (bao gm ni dung, tiu ) l mt
ti liu. Coi cc snippet (tiu v m t) ca trang web l mt ti liu. Gi s tp ti liu
ca chng ta l D = {d
1
, d
2
, , d
m
}. Chng ta s dng cc c trng sau trong qu trnh
xy dng hm ranking nh thut ton SVM-Rank:
Term Frequency / Inverse Document Frequency:
t
,]
=
n
,]
n
k,] k
Term Frequency (TF):

Trong : n
i,j
l tn sut xut hin ca t kha t
i
trong ti liu j
Inverse Document F u IDF req ency ( ):
iJ

= log
||
|{J: t

e J]|

Trong : |D| l s lng ti liu trong tp D
|{d: t
i
d}| l s lng ti liu m t kha t
i
xut hin.
(t -iJ)
,]
= t
,]
x iJ


Chng ta c:
41

Hidden Topic:
Gi s chng ta xc nh c K topic t tp d liu hc. Vi mi ti liu d, chng
ta tnh cc xc sut ti liu d thuc vo topic i l pd(i), vi i = 1,k.
T xc nh c vc t topic ca ti liu d:
T(d) = [pd
1
, pd
2
, , pd
k
]
T hai c trng trn, chng ta xy dng c vc t i din ti liu V(d):
V(d) = [tfidf(t
1
, d), tfidf(t
2
, d),,tfidf(t
m
, d), pd
1
, pd
2
, , pd
k
]
42

Chng 4. Thc nghim v nh gi
4.1. D liu
M hnh s dng query log xy dng b d liu trong qu trnh hc. Query log l
mt phn quan trng ca my tm kim. N ghi li cc hnh vi ca ngi dng trong khi
tm kim, cng nh nhng mi quan tm ca ngi dng i vi mi truy vn. Query log
khng cha cc qung co hin th ra vi ngi dng, tuy nhin n cha cc truy vn
c nhp vo, cng nh nhng kt qu tm kim c ngi dng click. Qung co,
thc cht l nhng ti liu vi ta v phn m t cho trang web m qung co tr ti.
Do vy, chng ta c th xem ta v nhng tm tt ca trang web (thng c t
trong cc th meta) nh mt ni dung qung co v s dng trong qu trnh hc. Vic s
dng query log s gip khai thc rt nhiu thng tin hu ch t nhng hnh vi ca ngi
dng trong khi tm kim.
Chng ti s dng 1Gb query logs c ly t my tm kim MSN [36] vi14 triu
query & url c click. Cc query u bng ting Anh. Mi query log gm cc thng tin
nh sau:
- QueryID: s hiu ca query, nhng query log c cng s hiu th cng thuc mt
phin lm vic.
- Query: ni dung query, y l ni dung query c ngi dng nhp vo.
- Time: thi im ngi dng click vo URL.
- URL: URL c ngi dng click.
- Position: v tr ca url c click trong danh sch kt qu tr v.
4.2. Mi trng thc nghim
4.2.1 Cu hnh phn cng
Qu trnh thc nghim c tin hnh trn my tnh c cu hnh phn cng nh sau:
43

Bng 2 Cu hnh phn cng s dng trong thc nghim
Thnh phn Ch s
CPU 1 Pentium IV 3.06 GHz
RAM 1.5 GB
OS WindowsXP Service Pack 2
B nh ngoi 240GB
4.2.2 Cc cng c c s dng
Di y l cc cng c m ngun m c s dng trong qu trnh thc nghim:
Bng 3. Danh sch cc phn mm m ngun m c s dng
STT Tn phn mm Tc gi Ngun
1 SVM-Rank Joachims http://svmlight.joachims.org/.
2 GibbsLDA++ Phan Xun Hiu http://gibbslda.sourceforge.net
Ngoi cc cng c k trn, chng ti xy dng cc module x l bng ngn ng
Python nh sau:
Module filter: lc trong 14 triu query logs, ly ra 1 triu query log u tin.
Gom nhm tt c cc url c tr v bi cng mt query, tnh im cho mi
URL trn tng phin lm vic v tng hp im cho mi URL trn tt c cc
phin lm vic. Sp xp cc URL theo th t gim dn v im.
Module crawl: t cc URL thu c bi module filter, tin hnh crawl ni
dung trang web, phn tch v ly ra tiu , m t ca trang web. Chng ta coi
m t v tiu ca mt trang web l mt ti liu trong b d liu hc.
Module normalize: Chun ha cc ni dung thu c bi module crawl nh
loi b t dng, cc k hiu v ngha, cc ni dung trng.
44

Module tfidf: Vc t ha cc ti liu thu c theo c trng v tn sut
xut hin ca t kha, TF-IDF.
Module tfidf_lda: Vc t ha cc ti liu thu c theo c trng v tn sut
xut hin ca t kha, TF-IDF v c trng v xc sut xut hin ca ti liu
trong tng ch n.
Module test: T cc qung co c sp xp theo kin ngi dng, tin
hnh vc t ha cc qung co theo c trng v tn sut xut hin cc t
kha, sau xp hng cc kt qu ny bng hm xp hng. Kt qu tr v s
c so snh vi kt qu ngi dng a ra v tnh ton cc o NDCG,
MAP.
Module test_lda: T cc qung co c sp xp theo kin ngi dng,
tin hnh suy lun cc ch n m mi qung co c th thuc vo. Vc t
ha mi qung co theo c trng tn sut xut hin ca t ha v c trng
xc sut mi qung co thuc vo cc ch n. Xp hng cc kt qu ny
bng hm xp hng. Kt qu tr v s c so snh vi kt qu ngi dng
a ra v tnh ton cc o NDCG, MAP.
4.3. Qu trnh thc nghim
Qu trnh thc nghim gm cc bc chnh sau y
X l d liu: tin x l d liu, xy dng tp ti liu hc cho m hnh, vc t
ha d liu.
Xy dng hm xp hng: tin hnh training trn tp d liu c bng thut
ton SVM-Rank.
Xy dng tp test: thu thp cc qung co trn my tm kim MSN.
nh gi kt qu m hnh: thu thp kin ngi dng v so snh vi kt qu
m hnh a ra.
4.3.1. Tin x l d liu
Ly v mt triu query log u tin, trong s query log ny, chn ra tt c cc query
c s click ca ngi dng ln hn 4. Kt qu thu c gm 30,372 query. Mt query c
45

th c nhiu ngi dng nhp vo ti cc thi im khc nhau. Chng ta tin hnh tnh
im cho mi URL i vi mt query nh sau:
o Trong mt phin lm vic, lit k cc URL c ngi dng click vo.
o Gn im cho mi URL gim dn t 100 theo th t click ca ngi dng. V
d: vi t kha yahoo, c 4 url tr v v ln lt c click theo th t:
http://yahoo.com, http://my.yahoo.com, http://mail.yahoo.com,
http://search.yahoo.com. Khi im ln lt cho 4 URL trong phin lm vic
l 100, 90, 80, 70.
o Tnh tng im cho tt c cc URL i vi mt query trn cc phin lm vic
khc nhau.
o Vi mi query, sp xp cc URL c ngi dng click theo th t gim dn v
im. Nu hai URL c im bng nhau, chng ta xt n v tr (position) ca
URL trong s cc URL tr v. Kt qu ny s c s dng trong bc x l
tip theo.
Cch tnh im nh trn c cc c im sau:
o Nhng URL c click nhiu s c im cao hn nhng URL c click t.
o Nhng URL trong mt phin lm vic c click trc s c im cao hn
nhng URL c click sau.
Vi cch tnh im , chng ta khai thc c mi quan tm ca ngi dng i
vi mt truy vn.
4.3.2. Thu thp thng tin t cc URL c c
T danh sch cc URL c sp xp theo im thu c trn. Chng ta tin
hnh ly v tiu v m t ca cc trang web tng ng vi mi URL. Ti bc ny, c
th gp nhng trang web cht hoc URL b hng v cn c loi b. Kt hp ni
dung tiu v m t ca trang web li, chng ta c d liu cho qu trnh hc. Tin hnh
loi b nhng URL m ni dung thu c l rng, t ch gi li nhng query c t 4
ni dung kt qu tr ln. Kt thc bc ny thu c danh sch gm 16,534 query v
83,312 ni dung (tm tt) cc trang web tng ng vi query .
46

Vic s dng tiu v m t (description) ca trang web khng hn l phng
php ti u xy dng tp d liu hc, tuy nhin n c th tt hn vic s dng ton b
ni dung trang web, iu m c th gy nhiu ln trong qu trnh hc.
4.3.3. Vc t ha d liu
Vic vc t ha d liu s c thc hin trong qu trnh trch chn cc c trng
sau:
a) TF-IDF
Tin hnh loi b t dng, cc k hiu, k t khng c ngha, chng ta thu c
danh sch cc t kha trong tp d liu. Mi t kha s c xem nh mt c trng ca
d liu.
Tnh ton trng s cho cc d liu ti cc c trng theo TF-IDF chng ta thu c
vc t trng s tf-idf:
D(d) = (tfidf(d, 1), tfidf(d,2), ..., tfidf(d, n))
Vi n l s lng cc t kha ring bit.
b) Ch n
T tp d liu c, s dng cng c GibbsLDA++ [16] chng ta thu c danh
sch cc ch n v xc sut mt d liu thuc vo mt ch . Chn s ch l
100. Chng ta xc nh c vc t c trng cho ch n i vi mi d liu .
H(d) = (pd
1
, pd
2
, ..., pd
50
)
Kt hp hai vc t H(d) v D(d) trn, chng ta thu c vc t i din d liu
V(d).
4.3.4. Thit k thc nghim
nh gi s nh hng ca ch n i vi kt qu xp hng chng ta tin hnh
ci t 2 h thng xp hng nh sau:
H thng th nht s dng SVM-Rank ch vi cc c trng v tn sut xut hin
ca t kha trong ti liu (TF-IDF). H thng ny c gi l RTF.
47

H thng th hai s dng SVM-Rank vi cc c trng v tn sut xut hin ca
t kha v cc xc sut ti liu thuc vo cc ch n. H thng ny gi l
RHT.
Chn mt s truy vn, tin hnh tm kim bng tay trn mt vi my tm kim nh
MSN, Yahoo, Google. Tng s truy vn c s dng l 40 truy vn, v cc lnh vc
khc nhau nh: computer, sport, medicine T cc trang kt qu, ly v 5 qung co cho
mi truy vn. Vic nh gi m hnh c tin hnh theo hai bc:
T cc qung co thu c, tin hnh loi b t dng, cc k t, k hiu khng c
ngha. Xc nh ch n cho mi qung co, tnh phn phi xc sut ca mi
ch trn qung co. Xy dng vc t qung co t cc xc sut thu c v
tn sut xut hin ca t kha trong qung co. S dng cng c SVM-Rank vi
m hnh thu c trong qu trnh hc xp hng cc kt qu.
Ly kin nh gi ca ngi dng i vi danh sch kt qu thu c theo truy
vn. Tin hnh ly kin 5 ngi dng, a ra cho h mt yu cu nh: vi
truy vn nh trn, bn hy ln lt click vo cc link sau theo th t ph hp.
kin ca mi ngi dng s c s dng xc nh mt s o cho m hnh,
cui cng chng ta tnh kt qu cui cng bng cch ly trung bnh cc o.
4.4. Kt qu thc nghim
Trc ht chng ta so snh trung bnh cc o trn ton b cc truy vn. Kt qu
cho thy h thng RHT vi vic s dng ch n em li kt qu trung bnh cao hn so
vi RTF. Ti cc o MAP v NDCG@5 kt qu ca RHT ln lt l 0.75 v 0.84
(Hnh 12).
48

0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
RTF
RHT
MAP NDCG@11 NDCG@@3 NDCCG@5

T
vn kh
H
thng
l 0.84
H Hnh 12. Trrung bnh ccc o ttrn tt c cc truy vn
Tin hnh s
hc nhau.
so snh trunng bnh cc o NDCCG@5 v MMAP trn tng s lnng truy
0.805
0.81
0.815
0.82
0.825
0.83
0.835
0.84
0.845
0.85
0.855
Hnh 13. T
Hnh 13 ch
g RTF. Gi
4 ti s truy
Trung bnh
ho thy trun
tr cc i
y vn 40.
5
1
5
2
5
3
5
4
5
5
5
10
o NDC
ng bnh
t c l
20
49
CG@5 ti
o NDCG@
0.85 ti s
30
cc s ln
@5 ca h t
truy vn 1
0 4
ng truy vn
thng RHT
10 v gi tr
40
RTF
RHT

n khc nhaau
cao hn so
cu tiu
o vi h
t c

0.7
0.71
0.72
0.73
0.74
0.75
0.76
0.77
0.78
0.79
0.8
10 20 30 40
RTF
RHT

Hnh 14. Trung bnh o MAP ti cc s lng truy vn khc nhau
Hnh 14 cho thy trung bnh o MAP ca RHT cao hn so vi h thng RTF. Gi
tr cc i t c l 0.79 ti s truy vn 10 v cc tiu l 0.75 ti s truy vn 40.
Di y l bng gi tr cc o ti mt s truy vn khc nhau trn h thng RHT.
Bng 4. Gi tr cc o ti mt s truy vn khc nhau.
Truy vn MAP NDCG@1 NDCG@3 NDCG@5
paint colors for
bedrooms
0.91 0.93 0.82 0.91
tennis equipment 0.77 0.79 0.68 0.85
baseball bats 0.86 1.0 0.77 0.88
shirt deign 0.75 0.87 0.68 0.87
4.5. nh gi kt qu thc nghim
Thc nghim cho thy m hnh xp hng qung co c xy dng em li kt
qu kh tt. Gi tr trung bnh cc o NDCG@5 vo khong 0.82-0.84 v o MAP
vo khong 0.73-0.75.
50

Mt s nguyn nhn c th nh hng ti kt qu ny:
Vic s dng kin ngi dng nh gi kt qu: mi ngi dng, i
vi mi truy vn c th c nhng mc ch tm kim cng nh mi quan
tm khc nhau. iu ny dn ti vic cc kt qu c s khc bit ln gia
nh gi ca cc ngi dng.
Vic s dng tiu v m t trang web lm d liu hc: ni dung tiu
v m t ca trang web thng c tc dng cho chng ta mt ci nhn tng
quan v trang web . Tuy nhin, vi mt s trang web c xy dng
khng tt, khng theo tiu chun, tiu v m t ca trang web c th
khng c hoc ni dung khng lin quan ti ni dung trang web.
Mt khc, thc nghim cng a ra s so snh gia vic s dng v khng s dng
ch n trong vic xp hng qung co. Vic s dng ch n em li kt qu kh kh
quan, trung bnh o NDCG@5 tng 0.2 v MAP tng 0.2 so vi vic khng s dng
ch n.
T nhng kt qu trn, ta thy vic s dng m hnh ch n nhm xy dng cc
c trng mi biu din qung co c tc dng tt trong vic xp hng qung co theo
truy vn ca ngi dng. Ngoi ra, vic khai thc cc query logs xy dng tp d liu
hc gip m hnh khai thc c mi quan tm ca ngi dng i vi tng truy vn tm
kim.
51

Kt lun
Vi tc pht trin nhanh chng ca internet v my tm kim, vic gii quyt cc
vn c t ra trong qung co trc tuyn ngy cng tr nn cp thit. Bi ton xp
hng qung co trn my tm kim theo truy vn ca ngi dng l mt vn ang nhn
c nhiu s quan tm ngy nay. Mc ch chnh ca kha lun ny nhm a ra mt
phng php gii quyt cho bi ton nu trn theo hng tip cn s dng m hnh ch
n.
Kha lun t c nhng kt qu:
Gii thiu khi qut v qung co trc tuyn, tnh hnh qung co trc tuyn
trn th gii cng nh Vit Nam.
Phn tch mt s phng php v m hnh c s dng trong qung co
trc tuyn.
a ra m hnh qung co trc tuyn hng cu truy vn vi s gip
ca ch n v k thut xp hng. Phng php khai thc query logs
nhm mc ch xy dng tp d liu hc.
Thc nghim v nh gi kt qu ca m hnh c a ra. Kt qu cho
thy trong mt s trng hp m hnh ci tin chnh xc ti 0.2.
Do gii hn v thi gian cng nh kin thc ca tc gi nn kha lun cn c mt s
im hn ch, l cha xy dng c tp d liu qung co v module tm kim qung
co theo truy vn ca ngi dng. Nhng hn ch ny cn c tip tc nghin cu
xy dng mt h thng hon thin hn, c th p dng cho cc my tm kim Vit Nam.

52

Ti liu tham kho
Ting Vit
[1] B Cng Thng, Bo co thng mi in t Vit Nam nm 2008,
http://www.mot.gov.vn.
[2] Nguyn Thu Trang. Hc xp hng trong tnh hng i tng v to nhn cm ti
liu. Lun vn thc s, i hc cng ngh, HQGHN, 2008.
[3] Dn Tr, Bo in t Dn Tr http://dantri.com.
[4] Hip hi qung co Vit Nam VAA, http://vaa.org.vn.
[5] Th vin thng tin Zing Directory, http://directory.zing.vn/directory, 2008.
[6] T in Bch khoa ton th Vit Nam http://dictionary.bachkhoatoanthu.gov.vn/
[7] VnExpress. Bo in t trc tuyn Vit Nam, http://vnexpress.net/.
Ting Anh
[8] Advertising Educational Foundation. Advertising & Society Review, Volume 6,
Issue 1. E-ISSN 1154-7311, 2005.
[9] Kevin Amos, director-product development at search-engine marketing firm
Impaqt Oser, 2004.
[10] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine
Learning Research, 3:993-1022, January 2003.
[11] Andrei Z. Broder; Ciccolo, P.; Fontoura, M.; Gabrilovich, E.; Josifovski, V.;
Riedel, L. Search advertising using web relevance feedback. In Proceeding of the
17th ACM conference on Information and knowledge management, 2008. Pages
1013-1022 .
[12] Yunbo Cao, Jun Xu, Tie-yan Liu, Hang Li, Yalou Huang, Hsiao-wuen Hon.
Adapting ranking SVM to document retrieval. In Proceedings of the 29th Annual
International ACM SIGIR Conference on Research and Development in
Information Retrieval, 2006.
53

[13] Chakrabarti, S. Learning to rank in vector spaces and social networks. Tutorial -
16th international conference on World Wide Web(2007).
[14] R. Herbrich, T. Graepel, and K. Obermayer. Large Margin Rank Boundaries for
Ordinal Regression. Advances in Large Margin Classifiers, pages 115-132, 2000.
[15] Phan Xuan Hieu, Susumu Horiguchi, Nguyen Le Minh (2008). Learning to
Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data
Collections, In Proc. of The 17th International World Wide Web Conference,
http://www2008.org, 2008.
[16] Phan Xuan Hieu, GibbsLDA++: A C/C++ and Gibbs Sampling based
Implementation of Latent Dirichlet Allocation (LDA),
http://gibbslda.sourceforge.net/, 2007.
[17] T. Hofmann. Probabilistic LSA. Proc. UAI, 1999.
[18] Ms. Duong Thu Huong, Public Relations & Operations Manager at IDG Ventures
Vietnam based in Ho Chi Minh City, VietnamNet e-newspaper,
http://VietnamNet.vn.
[19] K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant
documents. Proceedings of the 23rd annual international ACM SIGIR conference
on Research and development in information retrieval, pages 41-48, 2000.
[20] Kalervo Jrvelin & Jaana Keklinen University of Tampere Department of
Information Studies Finland. IR evaluation methods for retrieving highly relevant
documents.. 2000.
[21] Joachims, T., Li, H., Liu, T.-Y., and Zhai, C. Learning to rank for information
retrieval (lr4ir 2007). SIGIR Forum 41, 2 (2007), 58- 62.
[22] A. Lacerda, M.Cristo, M.Andre; G., W.Fan, N.Ziviani, and B.Ribeiro-Neto.
Learning to Advertise. In SIGIR06, ACM: Proc.of the 29
th
annual intl.
ACMSIGIRconf., pages 8. CONCLUSION 549556, NewYork, NY, 2006.
[23] Liu, T.-Y. Learning to rank in information retrieval. In WWW '08: Tutorial -
17th international conference on World Wide Web (2008).
54

[24] B.Ribeiro-Neto, M.Cristo,P.B.Golgher, and E.S. de Moura. Impedance Coupling in
Content-targeted Advertising. In SIGIR05, ACM: Proc. Of the 28
th
annual intl.
ACMSIGIR conf., pages 496503, New York, NY, 2005.
[25] M.Richardson, E. Dominowska, R. Ragno. Predicting Clicks: Estimating the
Click-Through Rate for New Ads. January 2007 In Proceedings of the 16th
International World Wide Web Conference Pages: 521 - 530.
[26] G. Salton, A. Wong, C.S. Yang. A Vector Space Model for Automatic Indexing,
Communication of the ACM, Volum 18, Number 11, 1975.
[27] Le Dieu Thu, On the analysis of large-scale datasets towards online contextual
advertising, thesis in Coltech of Technology, Viet Nam National University, Ha
Noi, Viet Nam, 2008.
[28] Nguyen Cam Tu, (2008). Hidden Topic Discovery Toward Classification And
Clustering In Vietnamese Web Documents. MSc. thesis in Coltech of Technology,
Viet Nam National University, Ha Noi, Viet Nam, 2008.
[29] Jun Xu, Yunbo Cao, Hang Li, Yalou Huang. Cost-sensitive learning of SVM for
ranking. In ECML , 2006.
[30] W.Yih, J.Goodman, andV.R.Carvalho. Finding advertising keywords on web
pages. In WWW06, ACM: Proc. Of the 15th intl. conf. on World Wide Web, pages
213222, NewYork, NY, 2006.
[31] H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, J. Ma.Learning to Cluster Web Search
Results.. In Proceedings of the ACM SIGIR Conference, 2004.
[32] CIA Advertising, www.ciaadvertising.org.
[33] Interactive Advertising Bureau (IAB) and Price Water House Coopers (PWC),
Internet Advertising Revenue Report, http://www.iab.net.
[34] Internet Archive, http://www.archive.org.
[35] Joachims SVM-Rank toolkit http://svmlight.joachims.org/.
[36] Microsoft Social Network MSN, http://www.msn.com/.
[37] Nutch: an open-source search engine, http://lucene.apache.org/nutch/.
55

56

[38] Online Advertising, news and quality online advertising information,
http://www.onlineadvertising.net/.
[39] Wikipedia, The Free Encyclopedia http://www.wikipedia.org.

You might also like