Professional Documents
Culture Documents
I
=1
sub]cct to
u, z
(w
2
, x
i
(1)
-x
i
(2)
) 1 -
i = 1, , l
(6)
Vic ti u (6 n ) t g ng v i ti u (7) khi = 1/2C:
min
w
j1 -z
(w, x
i
(1)
-x
i
(2)
)[
+
I
=1
+z[w[
2
(7)
Gi s w* l vc t trng s ca m hnh SVM. V mt hnh hc, w* s vung gc
vi siu phng ca Ranking SVM. Ta s dng w* xy dng hm ranking f
w*
cho vic
xp hng cc ti liu:
f
w*
(x) = < w, x > (8)
Khi p dng SVM, mi vect c trng c to ra t mt cp ti liu. Mi c
trng c nh ngha nh mt hm ca truy vn v ti liu.V d c trng tn sut xut
hin ca t kha c tnh bng s ln xut hin ca cc t kha trong cu truy vn trn
ti liu. Tt c cc kt qu t tt c cc truy vn c s dng trong qu trnh training.
Khng c s khc bit gia cc ti liu t cc truy vn khc nhau. Hn na, khng c s
khc bit gia cc cp ti liu thuc cc hng khc nhau, trong khi trn thc t, nh hng
ca vic xp hng sai gia nhng ti liu c hng cao vi ti liu c hng thp l ln hn
so vi vic xp hng sai gia nhng ti liu c hng thp vi nhau . y chnh l hai vn
c th gy ra s thiu chnh xc ca Ranking SVM.
gii quyt hai vn c nu trn, ta c th nh ngha mt hm loss mi
da trn c s ca Hinge Loss [29].
Loss function
Trong loss function (9) ta thm mt tham s hng iu chnh lch gia cc
cp hng, thm tham s iu chnh lch gia cc truy vn. Ta pht biu li bi
ton ca Ranking SVM vi mc tiu l cc tiu ha loss function sau:
min
w
I(w) =
k()
p
q()
j1 -z
(w, x
i
(1)
-x
i
(2)
)[
+
I
=1
+z[w[
2
(9)
Trong k
(i)
l hng ca cp ti liu i,
k(i)
l tham s hng ca k
(i)
, q
(i)
ng vi truy
vn ca cp ti liu i, q
(i)
l tham s ca truy vn q
(i)
. vi phm nhn c t cp th i
c quyt nh bi tch ca
k(i)
v q
(i)
:
k(i)
q
(i)
29
Xc nh gi tr cc tham s
Ta phi xc nh lm th no tnh gi tr ca v .
Vi , ta s dng mt phng php Heuristic c lng cc tham bin da trn
m hnh c s. Gi s NDCG c s dng nh gi (c th s dng cc o khc).
Thut ton c m t nh sau:
Hnh 8. Thut ton c lng tham bin [29]
Vi ta tnh nh s : au
p
q()
=
mux
]
={nhng cp tI IIu ng vI q(j)]
={nhng cp tI IIu ng vI q(I)]
(10)
3.1.3 Cc phng php nh gi xp hng
nh gi cht lng mt xp hng, cc o thng dng trong hc my nh
chnh xc (precision), hi tng (recall), o F khng c s dng. Xp hng yu
cu cc i tng ng (ph hp vi tiu ch) c xp cc v tr u tin ca bng
xp hng cng tt.
Di y l mt s o nh gi mc hiu qu ca xp hng:
30
3.1.3.1 MAP
chnh xc mc K: P@K Precision@K l chnh xc ca K i tng u
bng xp hng. Xc nh s i tng ng K v tr u tin ca xp hng v gi l
Match@K
PK =
HotcbK
K
[19]. Ta c:
chnh xc trung bnh (AP): l gi tr trung bnh ca cc P@K ti cc mc K c
i tng ng. Gi I(K) l hm xc nh i tng v tr hng K nu ng I(K) = 1 v
ngc li I(K) = 0. chnh xc h: trung bn
AP =
PK x I(K)
n
K=1
I(])
n
]=1
Gi tr trung bnh trn tt c cc truy v Average Precision): n (Mean
HAP =
AP
m
=1
m
Trong m l tng s truy vn.
V d:
Gi s c 6 i tng tng ng l: a, b, c, d, e.
Trong a, b, c l cc i tng ph hp v d, e l cc i tng khng ph hp.
Mt xp hng ca cc i tng cn nh gi l: c, a, d, b, e. Khi ta c:
p@1 = 1; P@2 =1; P@3 = 2/3; P@4 = 3/4; P@5 = 3/5.
AP(1) = 1; AP(2) = 1; AP(3) = 1; AP(4) = (1 + 1 + 3/4) / 3
3.1.3.2 NDCG (Normalized Discounted cumulative gain)
DCG (Discounted cumulative gain) l mt o mc hiu qu ca cc thut ton
trn h thng my tm kim hay nhng ng dng tng t, v thng c s dng trong
tm kim thng tin (Information Retrieval). S dng mt o tnh ph hp ca cc ti
liu trong tp kt qu tr v bi my tm kim, DCG o s hiu qu ca mt ti liu da
trn v tr ca n trong danh sch. Con s ny c tnh tnh ly t u ti cui danh sch
kt qu v gim dn nhng v tr thp hn[19].
31
Hai gi thit c a ra trong vic s dng DCG v nhng php o c lin quan:
o S tt hn nu nhng ti liu c ph hp cao xut hin sm trong danh
sch kt qu ca my tm kim (c rank cao hn)
o Nhng ti liu c ph hp cao thng hu ch hn so vi nhng ti liu c
ph hp thp, v nhng ti liu ny li hu ch hn so vi nhng ti liu
khng ph hp.
DCG c hnh thnh t mt o nguyn thy hn, l CG (Cumulative Gain).
Cumulative Gain: o CG khng quan tm ti v tr ca kt qu trong tnh ton, n
tnh tng ph hp ca tt c cc ti liu trong danh sch kt qu. o CG ti mt v
tr p c tnh nh sau:
C0
p
= rcl
p
=1
Trong rel
i
l mc ph hp ca kt qu ti v tr th i.
o CG khng b nh hng bi th t sp xp cc kt qu trong danh sch. Vic
chuyn ti liu c ph hp cao xung v tr thp khng lm thay i gi tr CG. Da
vo hai gi thit trn v mc hiu qu ca kt qu tm kim, DCG c s dng em
li hiu qu cao hn.
Discounted cumulative gain: tin ca DCG l nhng ti liu c ph hp cao
hn nhng li xut hin nhng v tr thp hn s dn ti mt mc pht (penalty) bng
cch gim ph hp ca ti liu i mt lng bng logarit ca v tr trong kt qu. DCG
ti v tr p c tnh nh sau:
C0
p
= rcl
1
+
rcl
log
2
i
p
=2
Ngoi ra DCG cn c tnh theo cng th c:
C0
p
=
2
cI
i
-1
log
2
(1 +i)
p
=2
32
nC0
p
=
C0
p
IC0
p
Normalized DCG:
Trong : IDCGp (Ideal Discounted cumulative gain) l gi tr DCG trong trng
hp kt qu a ra l hon ho, nhn c khi tt c cc ti liu u c xp ng v tr
tng ng vi ph hp ca chng.
V d: Gi s c 6 ti liu a, b, c, d, e, f vi cc ph hp ln lt l: 3, 3, 2, 2, 1,
0. Mt kt qu xp hng c a ra nh sau: b, c, a, f, e, d.
Ta c: CG
6
= 3 + 2 + 3 + 0 + 1 + 2 = 11
DCG
6
= 3 + (2 + 1.887 + 0 + 0.431 + 0.772) = 8.09
IDCG = 3 + (3 + 2/1.59 + 2/2 + 1/2.32 + 0) = 8.693
nDCG
6
= DCG
6
/IDCG
6
= 8.09/8.693 = 0.9306
Ngoi hai o trn, mt s o khc cng c s dng nh: trung bnh nghch
o th hng (MRR), s i tng ng mc k (Match@K), trung bnh tng nghch o
th hng ca cc i tng ng (MTRR) [2]. Tuy nhin NDCG v MAP l hai o
kh ph bin v c s dng trong rt nhiu cng trnh nh [11], [19], [29].
3.2 Ch n
Vn biu din d liu mt cch hiu qu khai thc mi quan h gia cc d
liu ngy cng tr nn tinh vi v phc tp hn. c rt nhiu nghin cu nhm gii
quyt v vn ny. Cc m hnh ch n [10] l mt bc tin quan trng trong vic
i rel
i
log
2
i rel
i
/log
2
i
1 3 N/A N/A
2 2 1 2
3 3 1.59 1.887
4 0 2.0 0
5 1 2.32 0.431
6 2 2.59 0.772
33
m hnh qu d liu vn bn. Chng c da trn tng rng mi ti liu c mt xc
sut phn phi vo cc ch , v mi ch l s phn phi kt hp gia cc t. Biu
din cc t v ti liu di dng phn phi xc sut c li ch rt ln so vi m hnh
khng gian vc t thng thng.
Mt tng ca cc m hnh ch n l xy dng nhng ti liu mi da theo
phn phi xc sut. Trc ht, to ra mt ti liu mi, ta cn chn ra mt phn phi
nhng ch cho ti liu , iu ny c ngha ti liu c to nn t nhng ch
khc nhau, vi nhng phn phi khc nhau. Tip , sinh cc t cho ti liu ta c th
la chn ngu nhin cc t da vo phn phi xc sut ca cc t trn cc ch .
Mt cch hon ton ngc li, cho mt tp cc ti liu, ta c th xc nh mt tp
cc ch n cho mi ti liu v phn phi xc sut ca cc t trn tng ch .
Hai v d v phn tch ch s dng m hnh n l Probabilistic Latent Semantic
Analysis (pLSA) and Latent Dirichlet Allocation (LDA).
PLSA l mt k thut thng k nhm phn tch nhng d liu xut hin ng thi
[17]. N c pht trin da trn LSA kt hp vi mt m hnh xc sut. Tuy nhin, theo
phn tch ca Blei v cc cng s (2003) [10], mc d LPSA l mt bc quan trng
trong vic m hnh ha d liu vn bn, tuy nhin n vn cn cha hon thin ch cha
xy dng c mt m hnh xc sut tt mc ti liu. iu dn n vn gp
phi khi phn phi xc sut cho mt ti liu nm ngoi tp d liu hc, ngoi ra s lng
cc tham s c th tng ln mt cch tuyn tnh khi kch thc ca tp d liu tng.
LDA, l mt m hnh hon thin hn so vi PLSA v c th khc phc c nhng
nhc im trn. M hnh ch n ny s c s dng trong vic xy dng h thng
ca chng ti.
3.2.1 Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) l mt m hnh sinh xc sut cho tp d liu ri
rc nh text corpora. LDA da trn tng: mi ti liu l s trn ln ca nhiu ch
(topic). V bn cht, LDA l mt m hnh Bayesian 3 cp (three-level hierarchical Bayes
model: corpus level, document level, word level) trong mi phn ca m hnh c coi
nh mt m hnh trn hu hn trn c s tp cc xc sut ch [27].
34
3.2.2 M hnh sinh trong LDA
Cho mt corpus ca M ti liu biu din bi D={d
1
,d
2
, , d
M
}, trong , mi ti liu
m trong corpus bao gm N
m
t w
i
rt t mt tp t vng ca cc mc t {t
1
, , t
v
}, V l
s lng cc mc t t trong tp t vng. LDA cung cp mt m hnh sinh y ch ra
kt qu tt hn cc phng php trc. Qu trnh sinh ra vn bn nh sau:
Hnh 9. M hnh biu din ca LDA[15]
Cc khi vung trong (Hnh 9) biu din cc qu trnh lp.
Tham s u vo: v (corpus-level parameter)
: Dirichlet prior on (theta)
m
r
: Dirichlet prior on
k
r
r
r
m
(theta): phn phi ca topic trong document th m (document-level parameter).
biu din tham s cho p(z|d=m), thnh phn trn topic cho ti liu m. Mt t l cho
mi ti liu,
m
{ } matrix) K (M
M
m m
=
=1
r
z
m,n
: topic index (word n ca vn bn m)
w
m,n
: word n ca vn bn m ch bi z
m,n
(word-level variable, observed word)
k
r
: Phn phi ca cc t c sinh t topic z
m,n
.
k
r
biu din tham s cho p(t|z=k),
thnh phn trn ca topic k. Mt t l cho mi topic, { } matrix) V (K
K
k k
=
=1
r
M: s lng cc ti liu.
35
N
m
: s lng cc t trong ti liu th m (hay cn gi l di ca vn bn)
K: s lng cc topic n.
LDA sinh mt tp cc t w
m,n
cho cc vn bn
m
d
r
bng cch:
Vi mi vn bn m, sinh ra phn phi topic
m
r
cho vn bn.
Vi mi t, z
m,n
c ly mu da vo phn phi topic trn.
Vi mi topic index z
m,n
, da vo phn phi t
k
r
, c sinh ra.
n m
w
,
k
r
c ly mu mt ln cho ton b corpus.
M hnh sinh y ( ch gii) c biu din trong Hnh 10.
Hnh 10. M hnh sinh y cho LDA [28].
y, Dir, Poiss and Mult ln lt l cc phn phi Dirichlet, Poisson,
Multinomial. (Ly mu theo phn phi Dirichlet, Poisson, Multinomial).
3.2.3 c lng tham s v suy lun
Cho trc mt tp cc vn bn, yu cu ca qu trnh ny l tm xem topic model (
, ) no sinh ra tp cc vn bn trn. Qu trnh c lng tham s cho LDA vi k
thut Gibbs Sampling gm cc bc:
k
r
m
r
36
Khi to: ly mu ln u. Di y l m gi ca qu trnh khi to ly mu ln
u:
( ) t
z
n
( ) z
m
n zero all count variables, , ,
m
n ,
z
n
[ ] M m , 1 for all documents do
[ ]
m
N n , 1 for all words in document do m
sample topic index ~Mult(1/K)
n m
z
,
( )
1 +
s
m
n increment document-topic count:
1 +
m
n increment document-topic sum:
( )
1 +
t
s
n increment topic-term count:
1 +
z
n increment topic-term sum:
end for
end for
Trong : : s topic z trong vn bn m
( ) z
m
n
: tng s topic trong vn bn m
m
n
: s term t trong topic z
( ) t
z
n
: tng s term trong topic z
z
n
Mi ln ly mu cho mt t, cc tham s i vi tng term v topic trn ln lt
c tng ln.
Giai on burn-in: qu trnh ly mu li cho n khi t c mt chnh xc
nht nh. M gi ca qu trnh ny:
while not finished do
[ ] M m , 1 for all documents do
for all words in document do [
m
N n , 1 ] m
- for the current assignment of to a term t for word : z
n m
w
,
37
( )
1
t
z
n
( )
1
z
m
n decrement counts and sums: ; 1
m
n ; ; 1
z
n
- multinomial sampling acc. (decrements from previous step):
( ) w z z p z
i i
r r
, | ~
~
r
read out parameter set acc. to Eq.
m
r
end if
end while
2 phn phi n
k
r
v c tnh nh sau:
m
r
( )
( )
v
V
v
v
k
t
t
k
t k
n
n
+
+
=
=1
,
( )
( )
z
K
z
z
m
k
k
m
k m
n
n
+
+
=
=1
,
Vi m hnh c lng LDA cho, c th suy lun ch cho cc ti liu mi
bng cc th tc ly mu tng t.
38
3.3 M hnh qung co trc tuyn hng cu truy vn vi s gip ca
phn tch ch v k thut tnh hng
Nh trnh by nhng chng trc, mt bi ton quan trng ca qung co trn
my tm kim l vic xp hng cc qung co theo ph hp vi truy vn ca ngi
dng. T nhng phng php c trnh by Chng II, cho thy vic la chn cc c
trng cho vic biu din qung co l ht sc quan trng. C nhng trng hp gia
qung co v t kha c s ph hp ln, tuy nhin tp t vng s dng trong qung co
v truy vn l khc nhau. Do vy, bn cnh cc c trng v t kha, vic s dng mt s
c trng mc tru tng cao hn l rt cn thit. Nhng nghin cu ca Andrei v cc
cng s [11] cho thy, vic s dng cc c trng m rng nh phn lp truy vn, cm
t Prisma em li nhng kt qu kh quan. c bit l nghin cu ca L Diu Thu [27]
ch ra rng, vic s dng ch n trong qung co theo ng cnh nhm m rng tp
t vng ca qung co cng nh trang web em li kt qu rt kh quan.
Trong phn ny, ta s trnh by mt m hnh qung co trc tuyn trn my tm
kim s dng k thut phn tch ch n v tnh hng. Khc vi m hnh c xy
dng bi L Diu Thu [27], m hnh ca chng ta c xy dng nhm mc ch xp
hng qung co trn my tm kim theo truy vn ca ngi dng. K thut ch n c
s dng trong vic xy dng nhng c trng mi biu din qung co. Ngoi ra, m
hnh cn khai thc mt lng ln cc query logs nhm xy dng tp d liu hc.
3.3.1 M t bi ton
Bi ton c m t nh sau: T truy vn ca ngi dng v mt tp cc qung co
c sn, yu cu a ra K qung co ph hp nht vi truy vn.
Input:
- Truy vn q
- Tp qung co A = {a
1
, a
2
, ..., a
n
}
Output:
- K qung co R = {a
r1
, a
r2
, ..., a
rk
}
gii quyt bi ton, chng ta xy dng hm ranking F nh sau:
F: {Q}x{A} [0,1]
39
Vi F(q, a) tr v ph hp ca qung co a i vi truy vn q, ph hp cng
ln qung co s c xp hng cng cao.
Zeng [29] v Xu [29] ch ra rng, s dng thut ton SVM ranking em li kt
qu tt trong vic xp hng cng nh phn cm kt qu tm kim, khi s dng c truy
vn, title v snippet (ni dung tm tt) trong qu trnh hc. Trong m hnh ny, SVM rank
s c s dng xy dng hm xp hng F nh trn.
3.3.2 M hnh tng quan
T nhng nghin cu c cp trn, chng ti xut h thng qung co
trn my tm kim s dng phn tch ch n v k thut tnh hng. H thng c m
t mt cch tng quan nh sau.
Model
estimation
(2)
Estimated Model
Topic inference
(6)
Key word
Matching (5)
Ads
Relevant
Ads
New
Training
data (3)
Learn to
rank
model (4)
Ranking
function
(7)
Relevant
Ads
Ranking
Training
data
(1)
Hnh 11. M hnh tng quan h thng qung co s dng ch n
M hnh gm cc bc chnh sau:
1) Xy dng tp d liu hc. Tp d liu hc c xy dng bng cch phn tch
cc query logs, thu thp cc tiu , m t ca trang web v coi chng nh mt
qung co (ti liu).
40
2) Xy dng m hnh ch n, xc nh cc ch v phn phi xc sut ca
cc ch trn tng ti liu.
3) Xy dng tp d liu hc vi c trng mi, cc c trng y gm c tn
sut xut hin ca t kha v xc sut mi ti liu thuc vo mt ch .
4) Xy dng hm xp hng t tp d liu hc thu c. Hm xp hng c xy
dng s dng thut ton SVM-Rank.
5) Tm kim cc qung co ph hp vi truy vn.
6) Xc nh ch n ca qung co v biu din qung co theo c trng mi.
7) Xp hng cc qung co s dng hm xp hng c xy dng t tp d
liu hc.
3.3.3 Xc nh c trng cho m hnh
Trong m hnh ny, chng ta coi mi qung co (bao gm ni dung, tiu ) l mt
ti liu. Coi cc snippet (tiu v m t) ca trang web l mt ti liu. Gi s tp ti liu
ca chng ta l D = {d
1
, d
2
, , d
m
}. Chng ta s dng cc c trng sau trong qu trnh
xy dng hm ranking nh thut ton SVM-Rank:
Term Frequency / Inverse Document Frequency:
t
,]
=
n
,]
n
k,] k
Term Frequency (TF):
Trong : n
i,j
l tn sut xut hin ca t kha t
i
trong ti liu j
Inverse Document F u IDF req ency ( ):
iJ
= log
||
|{J: t
e J]|
Trong : |D| l s lng ti liu trong tp D
|{d: t
i
d}| l s lng ti liu m t kha t
i
xut hin.
(t -iJ)
,]
= t
,]
x iJ
Chng ta c:
41
Hidden Topic:
Gi s chng ta xc nh c K topic t tp d liu hc. Vi mi ti liu d, chng
ta tnh cc xc sut ti liu d thuc vo topic i l pd(i), vi i = 1,k.
T xc nh c vc t topic ca ti liu d:
T(d) = [pd
1
, pd
2
, , pd
k
]
T hai c trng trn, chng ta xy dng c vc t i din ti liu V(d):
V(d) = [tfidf(t
1
, d), tfidf(t
2
, d),,tfidf(t
m
, d), pd
1
, pd
2
, , pd
k
]
42
Chng 4. Thc nghim v nh gi
4.1. D liu
M hnh s dng query log xy dng b d liu trong qu trnh hc. Query log l
mt phn quan trng ca my tm kim. N ghi li cc hnh vi ca ngi dng trong khi
tm kim, cng nh nhng mi quan tm ca ngi dng i vi mi truy vn. Query log
khng cha cc qung co hin th ra vi ngi dng, tuy nhin n cha cc truy vn
c nhp vo, cng nh nhng kt qu tm kim c ngi dng click. Qung co,
thc cht l nhng ti liu vi ta v phn m t cho trang web m qung co tr ti.
Do vy, chng ta c th xem ta v nhng tm tt ca trang web (thng c t
trong cc th meta) nh mt ni dung qung co v s dng trong qu trnh hc. Vic s
dng query log s gip khai thc rt nhiu thng tin hu ch t nhng hnh vi ca ngi
dng trong khi tm kim.
Chng ti s dng 1Gb query logs c ly t my tm kim MSN [36] vi14 triu
query & url c click. Cc query u bng ting Anh. Mi query log gm cc thng tin
nh sau:
- QueryID: s hiu ca query, nhng query log c cng s hiu th cng thuc mt
phin lm vic.
- Query: ni dung query, y l ni dung query c ngi dng nhp vo.
- Time: thi im ngi dng click vo URL.
- URL: URL c ngi dng click.
- Position: v tr ca url c click trong danh sch kt qu tr v.
4.2. Mi trng thc nghim
4.2.1 Cu hnh phn cng
Qu trnh thc nghim c tin hnh trn my tnh c cu hnh phn cng nh sau:
43
Bng 2 Cu hnh phn cng s dng trong thc nghim
Thnh phn Ch s
CPU 1 Pentium IV 3.06 GHz
RAM 1.5 GB
OS WindowsXP Service Pack 2
B nh ngoi 240GB
4.2.2 Cc cng c c s dng
Di y l cc cng c m ngun m c s dng trong qu trnh thc nghim:
Bng 3. Danh sch cc phn mm m ngun m c s dng
STT Tn phn mm Tc gi Ngun
1 SVM-Rank Joachims http://svmlight.joachims.org/.
2 GibbsLDA++ Phan Xun Hiu http://gibbslda.sourceforge.net
Ngoi cc cng c k trn, chng ti xy dng cc module x l bng ngn ng
Python nh sau:
Module filter: lc trong 14 triu query logs, ly ra 1 triu query log u tin.
Gom nhm tt c cc url c tr v bi cng mt query, tnh im cho mi
URL trn tng phin lm vic v tng hp im cho mi URL trn tt c cc
phin lm vic. Sp xp cc URL theo th t gim dn v im.
Module crawl: t cc URL thu c bi module filter, tin hnh crawl ni
dung trang web, phn tch v ly ra tiu , m t ca trang web. Chng ta coi
m t v tiu ca mt trang web l mt ti liu trong b d liu hc.
Module normalize: Chun ha cc ni dung thu c bi module crawl nh
loi b t dng, cc k hiu v ngha, cc ni dung trng.
44
Module tfidf: Vc t ha cc ti liu thu c theo c trng v tn sut
xut hin ca t kha, TF-IDF.
Module tfidf_lda: Vc t ha cc ti liu thu c theo c trng v tn sut
xut hin ca t kha, TF-IDF v c trng v xc sut xut hin ca ti liu
trong tng ch n.
Module test: T cc qung co c sp xp theo kin ngi dng, tin
hnh vc t ha cc qung co theo c trng v tn sut xut hin cc t
kha, sau xp hng cc kt qu ny bng hm xp hng. Kt qu tr v s
c so snh vi kt qu ngi dng a ra v tnh ton cc o NDCG,
MAP.
Module test_lda: T cc qung co c sp xp theo kin ngi dng,
tin hnh suy lun cc ch n m mi qung co c th thuc vo. Vc t
ha mi qung co theo c trng tn sut xut hin ca t ha v c trng
xc sut mi qung co thuc vo cc ch n. Xp hng cc kt qu ny
bng hm xp hng. Kt qu tr v s c so snh vi kt qu ngi dng
a ra v tnh ton cc o NDCG, MAP.
4.3. Qu trnh thc nghim
Qu trnh thc nghim gm cc bc chnh sau y
X l d liu: tin x l d liu, xy dng tp ti liu hc cho m hnh, vc t
ha d liu.
Xy dng hm xp hng: tin hnh training trn tp d liu c bng thut
ton SVM-Rank.
Xy dng tp test: thu thp cc qung co trn my tm kim MSN.
nh gi kt qu m hnh: thu thp kin ngi dng v so snh vi kt qu
m hnh a ra.
4.3.1. Tin x l d liu
Ly v mt triu query log u tin, trong s query log ny, chn ra tt c cc query
c s click ca ngi dng ln hn 4. Kt qu thu c gm 30,372 query. Mt query c
45
th c nhiu ngi dng nhp vo ti cc thi im khc nhau. Chng ta tin hnh tnh
im cho mi URL i vi mt query nh sau:
o Trong mt phin lm vic, lit k cc URL c ngi dng click vo.
o Gn im cho mi URL gim dn t 100 theo th t click ca ngi dng. V
d: vi t kha yahoo, c 4 url tr v v ln lt c click theo th t:
http://yahoo.com, http://my.yahoo.com, http://mail.yahoo.com,
http://search.yahoo.com. Khi im ln lt cho 4 URL trong phin lm vic
l 100, 90, 80, 70.
o Tnh tng im cho tt c cc URL i vi mt query trn cc phin lm vic
khc nhau.
o Vi mi query, sp xp cc URL c ngi dng click theo th t gim dn v
im. Nu hai URL c im bng nhau, chng ta xt n v tr (position) ca
URL trong s cc URL tr v. Kt qu ny s c s dng trong bc x l
tip theo.
Cch tnh im nh trn c cc c im sau:
o Nhng URL c click nhiu s c im cao hn nhng URL c click t.
o Nhng URL trong mt phin lm vic c click trc s c im cao hn
nhng URL c click sau.
Vi cch tnh im , chng ta khai thc c mi quan tm ca ngi dng i
vi mt truy vn.
4.3.2. Thu thp thng tin t cc URL c c
T danh sch cc URL c sp xp theo im thu c trn. Chng ta tin
hnh ly v tiu v m t ca cc trang web tng ng vi mi URL. Ti bc ny, c
th gp nhng trang web cht hoc URL b hng v cn c loi b. Kt hp ni
dung tiu v m t ca trang web li, chng ta c d liu cho qu trnh hc. Tin hnh
loi b nhng URL m ni dung thu c l rng, t ch gi li nhng query c t 4
ni dung kt qu tr ln. Kt thc bc ny thu c danh sch gm 16,534 query v
83,312 ni dung (tm tt) cc trang web tng ng vi query .
46
Vic s dng tiu v m t (description) ca trang web khng hn l phng
php ti u xy dng tp d liu hc, tuy nhin n c th tt hn vic s dng ton b
ni dung trang web, iu m c th gy nhiu ln trong qu trnh hc.
4.3.3. Vc t ha d liu
Vic vc t ha d liu s c thc hin trong qu trnh trch chn cc c trng
sau:
a) TF-IDF
Tin hnh loi b t dng, cc k hiu, k t khng c ngha, chng ta thu c
danh sch cc t kha trong tp d liu. Mi t kha s c xem nh mt c trng ca
d liu.
Tnh ton trng s cho cc d liu ti cc c trng theo TF-IDF chng ta thu c
vc t trng s tf-idf:
D(d) = (tfidf(d, 1), tfidf(d,2), ..., tfidf(d, n))
Vi n l s lng cc t kha ring bit.
b) Ch n
T tp d liu c, s dng cng c GibbsLDA++ [16] chng ta thu c danh
sch cc ch n v xc sut mt d liu thuc vo mt ch . Chn s ch l
100. Chng ta xc nh c vc t c trng cho ch n i vi mi d liu .
H(d) = (pd
1
, pd
2
, ..., pd
50
)
Kt hp hai vc t H(d) v D(d) trn, chng ta thu c vc t i din d liu
V(d).
4.3.4. Thit k thc nghim
nh gi s nh hng ca ch n i vi kt qu xp hng chng ta tin hnh
ci t 2 h thng xp hng nh sau:
H thng th nht s dng SVM-Rank ch vi cc c trng v tn sut xut hin
ca t kha trong ti liu (TF-IDF). H thng ny c gi l RTF.
47
H thng th hai s dng SVM-Rank vi cc c trng v tn sut xut hin ca
t kha v cc xc sut ti liu thuc vo cc ch n. H thng ny gi l
RHT.
Chn mt s truy vn, tin hnh tm kim bng tay trn mt vi my tm kim nh
MSN, Yahoo, Google. Tng s truy vn c s dng l 40 truy vn, v cc lnh vc
khc nhau nh: computer, sport, medicine T cc trang kt qu, ly v 5 qung co cho
mi truy vn. Vic nh gi m hnh c tin hnh theo hai bc:
T cc qung co thu c, tin hnh loi b t dng, cc k t, k hiu khng c
ngha. Xc nh ch n cho mi qung co, tnh phn phi xc sut ca mi
ch trn qung co. Xy dng vc t qung co t cc xc sut thu c v
tn sut xut hin ca t kha trong qung co. S dng cng c SVM-Rank vi
m hnh thu c trong qu trnh hc xp hng cc kt qu.
Ly kin nh gi ca ngi dng i vi danh sch kt qu thu c theo truy
vn. Tin hnh ly kin 5 ngi dng, a ra cho h mt yu cu nh: vi
truy vn nh trn, bn hy ln lt click vo cc link sau theo th t ph hp.
kin ca mi ngi dng s c s dng xc nh mt s o cho m hnh,
cui cng chng ta tnh kt qu cui cng bng cch ly trung bnh cc o.
4.4. Kt qu thc nghim
Trc ht chng ta so snh trung bnh cc o trn ton b cc truy vn. Kt qu
cho thy h thng RHT vi vic s dng ch n em li kt qu trung bnh cao hn so
vi RTF. Ti cc o MAP v NDCG@5 kt qu ca RHT ln lt l 0.75 v 0.84
(Hnh 12).
48
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
RTF
RHT
MAP NDCG@11 NDCG@@3 NDCCG@5
T
vn kh
H
thng
l 0.84
H Hnh 12. Trrung bnh ccc o ttrn tt c cc truy vn
Tin hnh s
hc nhau.
so snh trunng bnh cc o NDCCG@5 v MMAP trn tng s lnng truy
0.805
0.81
0.815
0.82
0.825
0.83
0.835
0.84
0.845
0.85
0.855
Hnh 13. T
Hnh 13 ch
g RTF. Gi
4 ti s truy
Trung bnh
ho thy trun
tr cc i
y vn 40.
5
1
5
2
5
3
5
4
5
5
5
10
o NDC
ng bnh
t c l
20
49
CG@5 ti
o NDCG@
0.85 ti s
30
cc s ln
@5 ca h t
truy vn 1
0 4
ng truy vn
thng RHT
10 v gi tr
40
RTF
RHT
n khc nhaau
cao hn so
cu tiu
o vi h
t c
0.7
0.71
0.72
0.73
0.74
0.75
0.76
0.77
0.78
0.79
0.8
10 20 30 40
RTF
RHT
Hnh 14. Trung bnh o MAP ti cc s lng truy vn khc nhau
Hnh 14 cho thy trung bnh o MAP ca RHT cao hn so vi h thng RTF. Gi
tr cc i t c l 0.79 ti s truy vn 10 v cc tiu l 0.75 ti s truy vn 40.
Di y l bng gi tr cc o ti mt s truy vn khc nhau trn h thng RHT.
Bng 4. Gi tr cc o ti mt s truy vn khc nhau.
Truy vn MAP NDCG@1 NDCG@3 NDCG@5
paint colors for
bedrooms
0.91 0.93 0.82 0.91
tennis equipment 0.77 0.79 0.68 0.85
baseball bats 0.86 1.0 0.77 0.88
shirt deign 0.75 0.87 0.68 0.87
4.5. nh gi kt qu thc nghim
Thc nghim cho thy m hnh xp hng qung co c xy dng em li kt
qu kh tt. Gi tr trung bnh cc o NDCG@5 vo khong 0.82-0.84 v o MAP
vo khong 0.73-0.75.
50
Mt s nguyn nhn c th nh hng ti kt qu ny:
Vic s dng kin ngi dng nh gi kt qu: mi ngi dng, i
vi mi truy vn c th c nhng mc ch tm kim cng nh mi quan
tm khc nhau. iu ny dn ti vic cc kt qu c s khc bit ln gia
nh gi ca cc ngi dng.
Vic s dng tiu v m t trang web lm d liu hc: ni dung tiu
v m t ca trang web thng c tc dng cho chng ta mt ci nhn tng
quan v trang web . Tuy nhin, vi mt s trang web c xy dng
khng tt, khng theo tiu chun, tiu v m t ca trang web c th
khng c hoc ni dung khng lin quan ti ni dung trang web.
Mt khc, thc nghim cng a ra s so snh gia vic s dng v khng s dng
ch n trong vic xp hng qung co. Vic s dng ch n em li kt qu kh kh
quan, trung bnh o NDCG@5 tng 0.2 v MAP tng 0.2 so vi vic khng s dng
ch n.
T nhng kt qu trn, ta thy vic s dng m hnh ch n nhm xy dng cc
c trng mi biu din qung co c tc dng tt trong vic xp hng qung co theo
truy vn ca ngi dng. Ngoi ra, vic khai thc cc query logs xy dng tp d liu
hc gip m hnh khai thc c mi quan tm ca ngi dng i vi tng truy vn tm
kim.
51
Kt lun
Vi tc pht trin nhanh chng ca internet v my tm kim, vic gii quyt cc
vn c t ra trong qung co trc tuyn ngy cng tr nn cp thit. Bi ton xp
hng qung co trn my tm kim theo truy vn ca ngi dng l mt vn ang nhn
c nhiu s quan tm ngy nay. Mc ch chnh ca kha lun ny nhm a ra mt
phng php gii quyt cho bi ton nu trn theo hng tip cn s dng m hnh ch
n.
Kha lun t c nhng kt qu:
Gii thiu khi qut v qung co trc tuyn, tnh hnh qung co trc tuyn
trn th gii cng nh Vit Nam.
Phn tch mt s phng php v m hnh c s dng trong qung co
trc tuyn.
a ra m hnh qung co trc tuyn hng cu truy vn vi s gip
ca ch n v k thut xp hng. Phng php khai thc query logs
nhm mc ch xy dng tp d liu hc.
Thc nghim v nh gi kt qu ca m hnh c a ra. Kt qu cho
thy trong mt s trng hp m hnh ci tin chnh xc ti 0.2.
Do gii hn v thi gian cng nh kin thc ca tc gi nn kha lun cn c mt s
im hn ch, l cha xy dng c tp d liu qung co v module tm kim qung
co theo truy vn ca ngi dng. Nhng hn ch ny cn c tip tc nghin cu
xy dng mt h thng hon thin hn, c th p dng cho cc my tm kim Vit Nam.
52
Ti liu tham kho
Ting Vit
[1] B Cng Thng, Bo co thng mi in t Vit Nam nm 2008,
http://www.mot.gov.vn.
[2] Nguyn Thu Trang. Hc xp hng trong tnh hng i tng v to nhn cm ti
liu. Lun vn thc s, i hc cng ngh, HQGHN, 2008.
[3] Dn Tr, Bo in t Dn Tr http://dantri.com.
[4] Hip hi qung co Vit Nam VAA, http://vaa.org.vn.
[5] Th vin thng tin Zing Directory, http://directory.zing.vn/directory, 2008.
[6] T in Bch khoa ton th Vit Nam http://dictionary.bachkhoatoanthu.gov.vn/
[7] VnExpress. Bo in t trc tuyn Vit Nam, http://vnexpress.net/.
Ting Anh
[8] Advertising Educational Foundation. Advertising & Society Review, Volume 6,
Issue 1. E-ISSN 1154-7311, 2005.
[9] Kevin Amos, director-product development at search-engine marketing firm
Impaqt Oser, 2004.
[10] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine
Learning Research, 3:993-1022, January 2003.
[11] Andrei Z. Broder; Ciccolo, P.; Fontoura, M.; Gabrilovich, E.; Josifovski, V.;
Riedel, L. Search advertising using web relevance feedback. In Proceeding of the
17th ACM conference on Information and knowledge management, 2008. Pages
1013-1022 .
[12] Yunbo Cao, Jun Xu, Tie-yan Liu, Hang Li, Yalou Huang, Hsiao-wuen Hon.
Adapting ranking SVM to document retrieval. In Proceedings of the 29th Annual
International ACM SIGIR Conference on Research and Development in
Information Retrieval, 2006.
53
[13] Chakrabarti, S. Learning to rank in vector spaces and social networks. Tutorial -
16th international conference on World Wide Web(2007).
[14] R. Herbrich, T. Graepel, and K. Obermayer. Large Margin Rank Boundaries for
Ordinal Regression. Advances in Large Margin Classifiers, pages 115-132, 2000.
[15] Phan Xuan Hieu, Susumu Horiguchi, Nguyen Le Minh (2008). Learning to
Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data
Collections, In Proc. of The 17th International World Wide Web Conference,
http://www2008.org, 2008.
[16] Phan Xuan Hieu, GibbsLDA++: A C/C++ and Gibbs Sampling based
Implementation of Latent Dirichlet Allocation (LDA),
http://gibbslda.sourceforge.net/, 2007.
[17] T. Hofmann. Probabilistic LSA. Proc. UAI, 1999.
[18] Ms. Duong Thu Huong, Public Relations & Operations Manager at IDG Ventures
Vietnam based in Ho Chi Minh City, VietnamNet e-newspaper,
http://VietnamNet.vn.
[19] K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant
documents. Proceedings of the 23rd annual international ACM SIGIR conference
on Research and development in information retrieval, pages 41-48, 2000.
[20] Kalervo Jrvelin & Jaana Keklinen University of Tampere Department of
Information Studies Finland. IR evaluation methods for retrieving highly relevant
documents.. 2000.
[21] Joachims, T., Li, H., Liu, T.-Y., and Zhai, C. Learning to rank for information
retrieval (lr4ir 2007). SIGIR Forum 41, 2 (2007), 58- 62.
[22] A. Lacerda, M.Cristo, M.Andre; G., W.Fan, N.Ziviani, and B.Ribeiro-Neto.
Learning to Advertise. In SIGIR06, ACM: Proc.of the 29
th
annual intl.
ACMSIGIRconf., pages 8. CONCLUSION 549556, NewYork, NY, 2006.
[23] Liu, T.-Y. Learning to rank in information retrieval. In WWW '08: Tutorial -
17th international conference on World Wide Web (2008).
54
[24] B.Ribeiro-Neto, M.Cristo,P.B.Golgher, and E.S. de Moura. Impedance Coupling in
Content-targeted Advertising. In SIGIR05, ACM: Proc. Of the 28
th
annual intl.
ACMSIGIR conf., pages 496503, New York, NY, 2005.
[25] M.Richardson, E. Dominowska, R. Ragno. Predicting Clicks: Estimating the
Click-Through Rate for New Ads. January 2007 In Proceedings of the 16th
International World Wide Web Conference Pages: 521 - 530.
[26] G. Salton, A. Wong, C.S. Yang. A Vector Space Model for Automatic Indexing,
Communication of the ACM, Volum 18, Number 11, 1975.
[27] Le Dieu Thu, On the analysis of large-scale datasets towards online contextual
advertising, thesis in Coltech of Technology, Viet Nam National University, Ha
Noi, Viet Nam, 2008.
[28] Nguyen Cam Tu, (2008). Hidden Topic Discovery Toward Classification And
Clustering In Vietnamese Web Documents. MSc. thesis in Coltech of Technology,
Viet Nam National University, Ha Noi, Viet Nam, 2008.
[29] Jun Xu, Yunbo Cao, Hang Li, Yalou Huang. Cost-sensitive learning of SVM for
ranking. In ECML , 2006.
[30] W.Yih, J.Goodman, andV.R.Carvalho. Finding advertising keywords on web
pages. In WWW06, ACM: Proc. Of the 15th intl. conf. on World Wide Web, pages
213222, NewYork, NY, 2006.
[31] H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, J. Ma.Learning to Cluster Web Search
Results.. In Proceedings of the ACM SIGIR Conference, 2004.
[32] CIA Advertising, www.ciaadvertising.org.
[33] Interactive Advertising Bureau (IAB) and Price Water House Coopers (PWC),
Internet Advertising Revenue Report, http://www.iab.net.
[34] Internet Archive, http://www.archive.org.
[35] Joachims SVM-Rank toolkit http://svmlight.joachims.org/.
[36] Microsoft Social Network MSN, http://www.msn.com/.
[37] Nutch: an open-source search engine, http://lucene.apache.org/nutch/.
55
56
[38] Online Advertising, news and quality online advertising information,
http://www.onlineadvertising.net/.
[39] Wikipedia, The Free Encyclopedia http://www.wikipedia.org.