You are on page 1of 59

I HC QUC GIA H NI

TRNG I HC CNG NGH




Ung Huy Long


GII PHP M RNG THNG TIN NG CNH
PHIN DUYT WEB NGI DNG NHM NNG
CAO CHT LNG T VN TRONG H THNG
T VN TIN TC







KHO LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin














H NI - 2010




Li cm n

Trc tin, ti xin gi li cm n v lng bit n su sc nht ti Ph Gio s Tin s
H Quang Thy v Thc s Trn Mai V, ngi tn tnh ch bo v hng dn ti
trong sut qu trnh thc hin kho lun tt nghip.
Ti chn thnh cm n cc thy, c to nhng iu kin thun li cho ti hc tp v
nghin cu ti trng i Hc Cng Ngh.
Ti cng xin gi li cm n ti cc anh ch v cc bn sinh vin trong nhm Khai ph
d liu gip ti rt nhiu trong vic h tr kin thc chuyn mn hon thnh tt
kho lun.
Cui cng, ti mun gi li cm v hn ti gia nh v bn b, nhng ngi thn yu
lun bn cnh v ng vin ti trong sut qu trnh thc hin kha lun tt nghip.
Ti xin chn thnh cm n!



Sinh vin
Ung Huy Long

I HC QUC GIA H NI
TRNG I HC CNG NGH


Ung Huy Long


GII PHP M RNG THNG TIN NG CNH
PHIN DUYT WEB NGI DNG NHM NNG
CAO CHT LNG T VN TRONG H THNG T
VN TIN TC







KHO LUN TT NGHIP I HC H CHNH QUY

Ngnh: Cng ngh thng tin



Cn b hng dn: Th.S Trn Mai V











H NI - 2010

i

Li cm n
Trc tin, ti xin gi li cm n v lng bit n su sc nht ti Ph Gio s
Tin s H Quang Thy v Thc s Trn Mai V, ngi tn tnh ch bo v hng
dn ti trong sut qu trnh thc hin kho lun tt nghip.
Ti chn thnh cm n cc thy, c to nhng iu kin thun li cho ti hc
tp v nghin cu ti trng i Hc Cng Ngh.
Ti cng xin gi li cm n ti cc anh ch v cc bn sinh vin trong nhm
Khai ph d liu gip ti rt nhiu trong vic h tr kin thc chuyn mn
hon thnh tt kho lun.
Cui cng, ti mun gi li cm v hn ti gia nh v bn b, nhng ngi thn
yu lun bn cnh v ng vin ti trong sut qu trnh thc hin kha lun tt nghip.
Ti xin chn thnh cm n!



Sinh vin
Ung Huy Long

ii

Tm tt
Vi s pht trin ca Internet, con ngi ngy nay khng ch c nhiu hn c hi
tip xc vi cc ngun cung cp tin tc m cn c th c c n ng lc hn. Cc t
bo in t Vit Nam cung cp mi ngy hng chc cho ti hng trm tin mi thuc
nhiu lnh vc khc nhau sn sng p ng cc yu cu mi lc, mi ni ca ngi c.
Tuy nhin, bn cnh nhng tin ch, tn ti nhng vn cn c gii quyt nh s gia
tng v s lng, tnh a dng v ni dung ca tin tc cc ngun khc nhau, s ph hp
c nhn,...Trong bi cnh , s gip ca mt h thng t vn tin tc l cn thit, bng
cch duyt qua khng gian cc la chn, n d on cc tin tc hu ch tim nng vi
tng ngi dng c nhn.
Xy dng h s s thch ngi dng l mt trong cc thnh phn c bn nht ca h
thng t vn. Tuy nhin, nhng m hnh (nh trong kho st ca Gauch v cng s [14] )
ang c s dng hin nay vn tn ti nhiu vn cha c gii quyt, v d nh: tnh
nhp nhng ng ngha trong cc h s da trn t kha, hoc i hi thng tin suy din t
WordNet xc nh ng ngha trong cc h s da trn mng ng ngha,...Thm vo ,
cc gii php ny cn thiu kh nng tnh hp mm do cc nhn t ng cnh.
Kha lun ny trnh by mt m hnh h thng t vn tin tc s dng mt m hnh
s thch ngi dng mi. Da trn khai ph d liu t ng cnh duyt web ca ngi
dng, h thng coi s thch ca ngi s dng l mt kt hp ca tp cc ch n xut
hin ph bin v tp cc thc th trong cc tin tc ngi dng tng quan tm.






iii

Mc lc
M u .......................................................................................................................... 1
Chng 1. Khi qut v cc h thng t vn .................................................................. 3
1.1. Bi ton t vn ............................................................................................... 3
1.2. Cc k thut t vn .......................................................................................... 5
1.2.1. K thut t vn da trn ni dung ............................................................. 5
1.2.2. K thut t vn cng tc ........................................................................... 8
1.2.3. K thut t vn lai................................................................................... 11
1.3. S lc v h thng t vn tin tc ca kha lun .......................................... 13
1.3.1. c trng ca t vn tin tc. .................................................................. 13
1.3.2. Hng tip cn ca kha lun ................................................................ 14
Chng 2. M hnh ha s thch ngi dng cho cc h t vn da trn ni dung. ...... 16
2.1. Tin trnh m hnh s thch ngi dng ........................................................ 16
2.2. Thu thp thng tin v ngi dng ................................................................. 17
2.2.1. Phng php nh danh ngi dng ....................................................... 17
2.2.2. Cc phng php thu thp thng tin ....................................................... 18
2.3. Xy dng m hnh s thch ngi dng ........................................................ 21
2.3.1. Phng php da trn t kha c trng s.............................................. 21
2.3.2. Phng php da trn mng ng ngha .................................................. 22
2.3.3. Phng php da trn cy phn cp khi nim ....................................... 23
Chng 3. M hnh ...................................................................................................... 24
3.1. C s l thuyt ............................................................................................. 25
3.1.1. Phn tch thng tin ch da trn m hnh ch LDA. ...................... 25
3.1.2. Nhn dng cc thc th trong ti liu da trn t in ............................ 27
3.2. Phn tch s thch ngi dng ....................................................................... 28
3.2.1. Thng tin trong phin duyt web ngi dng ......................................... 28
3.2.2. M hnh s thch ngi dng ................................................................. 29
3.3. p dng m hnh mi quan tm ngi dng vo t vn tin tc ..................... 30
3.3.1. Pha phn tch d liu t vn ................................................................... 30
3.3.2. Pha t vn trc tuyn ............................................................................. 33
3.4. nh gi kt qu t vn. ............................................................................... 36
Chng 4: Thc nghim v nh gi ........................................................................... 37
iv

4.1. Mi trng thc nghim ............................................................................... 37
4.2. D liu v cng c ........................................................................................ 37
4.2.1. D liu ................................................................................................... 37
4.2.2. Cng c .................................................................................................. 38
4.3. Thc nghim ................................................................................................ 39
4.3.1. V d v phn tch tin tc ....................................................................... 39
4.3.2. V d phn tch s thch ngi dng ....................................................... 40
4.3.3. T vn tin tc ......................................................................................... 42
4.4. Kt qu thc nghim v nh gi .................................................................. 43
Kt lun ....................................................................................................................... 46
Ti liu tham kho ....................................................................................................... 48




























v

Danh sch hnh
Hnh 1. Cc thnh phn chnh ca h thng t vn. ........................................................ 4
Hnh 2. Tin trnh m hnh ha s thch ngi dng. ................................................... 16
Hnh 3. Cc h thng t vn da trn thng tin phn hi hin. ..................................... 19
Hnh 4. M hnh mi quan tm ngi dng da trn t kha. ...................................... 22
Hnh 5. M hnh mi quan tm ngi dng da trn mng ng ngha .......................... 22
Hnh 6. M hnh mi quan tm ngi dng da trn mng khi nim .......................... 23
Hnh 7. Ti liu vi K ch n. ................................................................................. 25
Hnh 8. Biu din ha LDA ..................................................................................... 26
Hnh 9. c lng tham s tp d liu vn bn. .......................................................... 26
Hnh 10. Suy din ch s dng tp d liu VnExpress ............................................ 27
Hnh 11. M hnh s thch ngi dng da trn ch n v thc th. ........................ 29
Hnh 12. M hnh pha phn tch d liu t vn ............................................................ 31
Hnh 13. M hnh pha t vn trc tuyn. ...................................................................... 33
Hnh 14. Biu din tin tc theo ch v thc th. ...................................................... 39
Hnh 15. Kt qu phn tch cho thy cc thng tin lin quan n ch 19. ................. 42
vi

Danh sch cc bng
Bng 1. nh gi theo thang im v mt s b phim xem. ...................................... 5
Bng 2. Cc k thut thu thp thng tin n. ................................................................... 20
Bng 3. V d v mt h s s thch ngi dng. ......................................................... 24
Bng 4. Thng tin trong phin duyt web. .................................................................... 28
Bng 5. Mi trng thc nghim. ................................................................................ 37
Bng 6. Cng c. ......................................................................................................... 38
Bng 7. Mt s ch n ............................................................................................. 39
Bng 8. V d v phn tch s thch ngi dng. .......................................................... 40
Bng 9. nh gi m hnh phn tch s thch. .............................................................. 44
Bng 10. chnh xc ca m hnh da vo nh gi ca ngi s dng. ................... 44

1

M u
T khi nhng bi bo u tin v lc cng tc c cng b t nhng nm 90 ca
th k trc, h t vn chng t c vai tr quan trng ca mnh trong c hai kha
cnh nghin cu v ng dng. Chng ta c th d dng tip cn vi cc bi bo khoa hc
lin quan n t kha Recommender System trong hn 8600 kt qu tr v t my tm
kim GoogleScholar
1
vi hn 1100 kt qu cho ring nm 2009 hoc s dng cc ng
dng t vn ni ting nh sch trn Amazon
2
, phim trn NetFlix
3
.
Cc h t vn hot ng nh mt b lc thng tin [8], nhm c gng a ra cc
thng tin v ni dung hoc thng tin v sn phm (nh phim, sch, website, tin tc,) c
nhiu kh nng thuc c ngi dng quan tm. Thng thng, mt h t vn so snh
mi quan tm ca ngi dng (trong kha lun, hai khi nim mi quan tm ngi dng
hay s thch ngi dng c th c s dng thay th cho nhau) vi mt vi c trng
tham chiu a ra cc c lng nh gi cho cc sn phm. Cc c trng ny c th
n t cc thng tin ca sn phm (hng tip cn lc da trn ni dung) hoc t mi
trng x hi ngi dng (hng tip cn lc cng tc).
Mc d cc h thng t vn c nghin cu t kh lu, v c nhiu ng
dng chng minh c tnh hiu qu ca cc h thng t vn trn th gii, cc nghin cu
v lnh vc ny Vit Nam cn hn ch. Mong mun pht trin mt h thng t vn,
kha lun tp trung vo xy dng mt h thng t vn cc tin tc ting Vit.
Ngy nay, khi nim bo in t cng nh vic c tin tc in t khng cn
xa l vi a s ngi dn Vit Nam. Nhng thng k trong gn y trn BaoMoi
4
v s
lt ngi s dng internet xem cc tin tc in t hin nay ang cho thy nhu cu
ngy mt tng ca x hi trong lnh vc truyn thng ny. Tuy nhin, mt vn cn tn
ti hin nay l trong khi c qu nhiu tin tc mi ngy c cp nht, ngi dng
ging nh b chm ngp trong bin thng tin m vn khng tm ra c cc thng tin
ph hp, chnh l mi trng cho cc lnh vc lin quan n t vn tin tc pht trin.
Nm bt c nhu cu ny, kha lun xut mt gii php t vn cc ni dung thng tin
lin quan n ng cnh tip nhn thng tin hin ti ca ngi s dng, qua mong

1 http://www.scholar.google.com
2 http://www.amazon.com
3 http://www.netflix.com
4 http://www.baomoi.com/Statistics/Report.aspx
2

mun cung cp c nhng ch dn ng, nhanh chng, v khng c cc phin toi t
vic phi ng k hay cung cp cc thng tin c nhn.

Ni dung chnh ca kha lun c chia lm 4 phn:
Chng 1. Cc h thng t vn: Trnh by cc khi nim, cc thut ng, cc k
thut lin quan n h thng t vn. Cc u v nhc im ca cc k thut ny
cng c trnh by chi tit hn trong cc mc 1.2 v 1.3.
Chng 2. M hnh ha s thch ngi dng cho cc h t vn da trn ni
dung: Gii thiu v bi ton xy dng s thch ngi dng, cc thng tin c
s dng phn tch v mt s k thut m hnh s thch ngi dng.
Chng 3. M hnh: Trnh by xut xy dng s thch ngi dng da trn
phn tch ch n ph bin v cc thc th, v p dng ca m hnh ny vo h
thng t vn tin tc.
Chng 4. Thc nghim v nh gi: Trnh by mt s kt qu nh gi ban
u.

3

Chng 1. Khi qut v cc h thng t vn
Trong cuc sng hng ngy, khi ng trc qu nhiu la chn, ngi ta thng
da trn nhng kin hay li khuyn ca mi ngi xung quanh. Nhng trong k nguyn
thng tin, hng triu thng tin c a ln internet mi ngy, iu ny dn ti yu cu
phi c cc phng php t ng thu thp thng tin v a ra li khuyn h tr cho cc
phng php truyn thng trn . H t vn (recommender system) l mt gii php nh
vy. H thng ny a ra gi da trn nhng g ngi dng lm trong qu kh,
hoc da trn tng hp kin ca nhng ngi dng khc. H t vn tr thnh mt
ng dng quan trng v thu ht c s quan tm ln ca cc nh nghin cu cng nh
cc doanh nghip.
Mt s h t vn ni ting hin nay nh [26] :
Phim / TV/ m nhc: MovieLens, EachMovie, Morse, Firefly, Flycasting
Tin tc / bo ch: Tapestry, GroupLens, Lotus Notes, Anatagonomy
Sch / Ti liu: Amazon.com, Foxtrot, InfoFinder
Web: Phoaks, Gab, Fab, IfWeb, Let's Browse
Nh hng: Adaptive Place Advisor, Polylens, Pocket restaurent finder
Du lch: Dietorecs, LifestyleFinder

1.1. Bi ton t vn
Mt cch hnh thc, bi ton t vn c cc tc gi Adomavicius v Tuzhilin [2]
m t nh sau:
Gi U = (u
1
,u
2
, u
3
, , u
M
) l tp hp tt c ngi dng trong h thng t vn,
I = ( i
1
, i
2
, i
3
, , i
N
) l tp tt c cc sn phm c th t vn.
Mt hm g = u I R, trong R l mt tp hp c th t, c dng o s
ph hp ca sn phm i
n
vi ngi dng u
m
.
Nh vy, vi mi ngi dng u
m
thuc vo U, h t vn cn chn ra cc sn phm
i
mux,u
m
I, cha bit vi ngi dng u
m
sao cho hm g t gi tr ln nht.
u
m
u, i
mux,u
m
= ar g max g( u
m
, i
n
)
4


Trong cc h thng t vn, mc ph hp ca sn phm thng c biu din
theo nh gi thang im (rating), ph thuc vo tng ng dng, cc nh gi ny c th
c thc hin trc tip bi ngi dng hoc c tnh ton bi h thng.
Mi ngi dng thuc khng gian ngi dng U c xc nh bi mt h s (user
profile), nhng thng tin lu trong h s ny c th bao gm cc thng tin nh gii tnh,
tui, quc gia, tnh trng hn nhn, hay cng c th bao gm cc thng tin v s thch,
mi quan tm ca h. Tng t nh vy, mi sn phm cng c m t bi tp hp cc
c trng ca chng. V d, trong h thng t vn phim, cc c trng ca mt b phim
c th l tn phim, th loi, o din, din vin chnh,
Mt cch kht qut tin trnh t vn c th c m t nh sau:


Hnh 1. Cc thnh phn chnh ca h thng t vn.

u tin, b phn hc h s ngi dng phn tch cc s thch ngi dng. Mt khi
h thng hiu c ngi dng quan tm n iu g, n thc thi mt thut ton t vn,
so snh, t hp gia cc h s ngi dng hoc gia h s ngi dng vi cc c trng
sn phm, sau chn ra tp hp nhng sn phm ngi dng c th a thch.
Vn chnh ca h t vn l hm g khng c xc nh trn ton khng gian
u I m ch trn mt min nh ca khng gian . iu ny dn ti vic hm g phi
c ngoi suy trong khng gian ny. Thng thng, ph hp c th hin bng
im v ch xc nh trn tp cc sn phm tng c ngi dng nh gi t trc
5

(thng kh t). V d, bng 2 l nh gi ca mt s ngi dng vi cc phim m h
xem (thang im t 0-10, k hiu ngha l b phim cha c ngi dng cho im).
T nhng thng tin , h thng t vn phi d on (ngoi suy) im cho cc b phim
cha c ngi dng nh gi, t a ra nhng gi ph hp nht.

Bng 1. nh gi theo thang im v mt s b phim xem.
Spartacus Back to the
Future 3
HarryPotter 6 Up
A 2 8 9
B 8 7
C 6 5
D 4 7

1.2. Cc k thut t vn
C rt nhiu cch d on, c lng hng/im cho cc sn phm nh s dng
hc my, l thuyt xp s, cc thut ton da trn kinh nghim Cc h thng t vn
thng c phn thnh ba loi da trn cch n dng c lng cc nh gi v sn
phm:
Da trn ni dung (content-based): ngi dng c gi nhng sn phm
tng t nh cc sn phm tng c h nh gi cao.
Cng tc (collaborative): ngi dng c gi nhng sn phm m nhng
ngi cng s thch vi h nh gi cao.
Lai ghp (hybrid): kt hp c hai phng php trn.

1.2.1. K thut t vn da trn ni dung
H t vn da trn ni dung a ra cc t vn da trn phng on rng mt ngi
c th thch cc sn phm c nhiu c trng tng t vi cc sn phm m h tng a
thch. Theo , ph hp g(u,i) ca sn phm i vi ngi dng u c nh gi da
6

trn ph hp g(u, i
j
), trong i
j
I v tng t v ni dung i. V d, gi mt b
phim cho ngi dng u, h thng t vn s nhn ra s thch ca u qua cc c im ca
nhng b phim tng c u nh gi cao (nh th loi, tn o din); sau ch nhng
b phim tng ng vi s thch ca u mi c gii thiu.
Hng tip cn da trn ni dung bt ngun t nhng nghin cu v thu thp thng
tin (IR - information retrieval) v lc thng tin (IF - information filtering). Do , rt
nhiu h thng da trn ni dung hin nay tp trung vo t vn cc i tng cha d liu
text nh vn bn, tin tc, website Nhng tin b so vi hng tip cn c ca IR l do
vic s dng h s v ngi dng (cha thng tin v s thch, nhu cu). H s ny
c xy dng da trn nhng thng tin c ngi dng cung cp trc tip (khi tr li
kho st) hoc gin tip (do khai ph thng tin t cc giao dch ca ngi dng).
c th hn, t Content(i) l tp thng tin (hay tp cc c trng) v sn
phm i. Do h thng da trn ni dung c thit k ch yu dnh cho cc sn phm
dng vn bn hoc c cc m t ni dung (metadata) dng vn bn nn phng php biu
din thng c la chn l m hnh khng gian vector (Vector Space Model ). Theo
, ni dung sn phm c biu din bi cc t kha: Content(i) = (w
i1
,w
i2
,,w
ik
), vi
w
i1
,..w
ik
l trng s ca cc t kha (nh TF-IDF) t 1 ti k trong khng gian t kha
c xy dng t trc. V d in hnh cho h thng dng ny l cc h t vn trang web
nh Fab[5], biu din ni dung cc trang web bng 100 t quan trng nht hay Syskill &
Webert [23] s dng 128 t c trng s cao nht.
t Profile(u) l h s v ngi dng u, bao gm cc thng tin v s thch ca u.
Nhng thng tin ny c c bng cch phn tch ni dung ca cc sn phm tng c u
nh gi (cho im) trc . Phng php c s dng thng l cc k thut phn
tch t kha ca IR, do , Profile(u) cng c th c nh ngha nh mt vector trng
s: Profile(u) = (w
u1
, ,w
uk
) vi x
uj
biu th quan trng ca t kha j vi ngi dng
u.
Trong h thng t vn da trn ni dung, ph hp g(u,i) c xc nh bi cng
thc:
g(u,i) = Score(Profile(u), Content(i))

C Profile(u), Content(i) u c biu din bng vector trng s t TF-IDF (tng
ng l cc vector w
u
, w

) nn ta c th s dng mt cng thc tnh tng t nh o


cosin:
7

g( u, i) = cos( w
u
, w

) =
w
u
. w
i
w
u
w
i



Bn cnh cc phng php IR, h t vn da trn ni dung cn s dng nhiu
phng php hc my khc nh: phn lp Bayes, cy quyt nh, mng nron nhn
to Cc phng php ny khc vi cc phng php ca IR ch n da trn cc m
hnh hc c t d liu nn. V d, da trn tp cc trang web c ngi dng
nh gi l thch hay khng thch c th s dng phn lp Bayes phn lp cc
trang web cha c nh gi.

Mt s hn ch ca h thng t vn da trn ni dung:
Theo cng trnh kho st cc h t vn ca Adomavicius v Tuzhulin[2], cc h
thng t vn da trn ni dung c mt vi hn ch sau y:
S phn tch ni dung b hn ch (Restricted content analysis): Tnh hiu qu ca
h t vn ny ph thuc vo vic m t mt cch y cc c trng ni dung
ca sn phm. V vy, ni dung sn phm phi hoc c th c trch xut t
ng bi my tnh hoc d dng c trch xut bng tay. C nhiu trng hp,
yu cu ny rt kh thc hin, v d trong min ng dng t vn d liu a
phng tin nh nh ha, phim, m thanh, Trch xut t ng c trng ni
dung ca cc i tng d liu ny l mt bi ton kh, v vic trch xut bng
tay l khng kh thi do chi ph ln.
S lm dng ni dung chuyn mn (Content over-specialisation): S t vn ch
c to ra t phn tch ni dung cc sn phm tng c ngi dng a
thch, trong khi cc nhng nh gi ca ngi dng khc c th c s dng
t vn nhng sn phm mi (thm ch khc loi), nhng t vn da trn ni
dung ch c th a ra nhng sn phm tng t vi nhng g h tng nh
gi cao trc y. Trong nhiu trng hp, nhng sn phm khng nn c t
vn nu n qu ging vi cc sn phm c nh gi t trc. Mt v d
in hnh l trong cc h thng t vn tin tc, nhng tin tc t vn c nh gi
cao hn nu n khng phi l mt bn trch dn hoc c ni dung thng tin trng
lp.

8

Vn ngi dng mi (new user problem): Ngi dng cn nh gi mt
lng sn phm ln trc khi h thng t vn c th thc s hiu s thch ca
h, v a ra nhng t vn ng tin cy.

1.2.2. K thut t vn cng tc
Theo Adomavicius v cng s [2], khng ging nh phng php t vn da trn
ni dung, h thng cng tc d on ph hp g(u,i) ca mt sn phm i vi ngi
dng u da trn ph hp g(u
j
, i) gia ngi dng u
j
v i, trong u
j
l ngi c cng
s thch vi u. V d, gi mt b phim cho ngi dng u, u tin h thng cng tc
tm nhng ngi dng khc c cng s thch vi u, v d cng thch cc b phim hnh
ng. Sau , nhng b phim c h nh gi cao s c dng t vn cho u.
C rt nhiu h thng cng tc c pht trin nh: Grundy, GroupLens (tin
tc), Ringo (m nhc), Amazon.com (sch), Phoaks (web) Cc h thng ny c th chia
thnh hai loi: da trn kinh nghim (heuristic-based hay memory-based) v da trn m
hnh (model-based).

H thng cng tc da trn kinh nghim
Cc thut ton da trn kinh nghim d on hng ca mt sn phm da trn ton
b cc sn phm c nh gi trc . Ngha l, ph hp ca sn phm i
n
vi
ngi dng u
m
, g(u
m
, i
n
) c tng hp t nh gi ca nhng ngi dng khc v i
n

(thng l N ngi c s thch tng ng nht vi u
m
).
Theo , hng tip cn lc cng tc ny t hp cc nh gi ngi dng cng s
thch ny:

Trong ,
m
l tp cc ngi dng cng s thch vi u
m
.
Mt s v d v hm t hp [2]:
9


Trong , d l h s chun ha

Gi tr trung bnh cc nh gi ca ngi dng u
j




C nhiu cch tnh tng ng (v s thch) gia hai ngi dng, nhng trong
hu ht cc phng php, tng ng ch c tnh da trn cc sn phm c c hai
ngi cng nh gi. Hai phng php ph bin nht l da trn tng quan
(correlation-based) v da trn cosin (cosine-based).
Biu din nhng nh gi qu kh ca hai ngi dng u
m
v u
j
tng ng nh sau:



tng ng da trn cosin:

10

tng quan:


H thng cng tc da trn m hnh
Khc vi phng php da trn kinh nghim, phng php da trn m hnh
(model-based) s dng k thut thng k v hc my trn d liu nn (cc nh gi
bit) xy dng nn cc m hnh. M hnh ny sau s c dng d on hng
ca cc sn phm cha c nh gi.
Breese [10] xut hng tip cn xc sut cho lc cng tc (collaborative
filtering), trong cng thc sau c lng nh gi ca ngi dng u v sn phm i
(thang im nh gi t 0 n n):
r
u,i
= E(r
u,
) = i Pr (r
u,
= i|r
u,i
, t I
u
)
n
=0


Billsus v Pazzani [9] xut phng php lc cng tc trn nn hc my, trong
rt nhiu cc k thut hc my (nh mng nron nhn to) v cc k thut trch chn c
trng (nh SVD mt k thut i s nhm lm gim s chiu ca ma trn) c th c
s dng.
Ngoi ra cn nhiu hng tip cn khc nh m hnh thng k, m hnh bayes, m
hnh hi quy tuyn tnh, m hnh entropy cc i
H thng t vn cng tc khc phc c nhiu nhc im ca h thng da trn
ni dung. Mt im quan trng l n c th x l mi loi d liu v gi mt loi sn
phm, k c nhng sn phm mi, khc hon ton so vi nhng g ngi dng tng xem.

Mt s hn ch ca h thng t vn lc cng tc
Mt s hn ch ca cc h t vn lc cng tc c th c lit k nh sau:
Vn ca s nh gi tha tht: vn s lng cc nh gi t ngi dng
qu t to ra cc d on tin cy. Mc thnh cng ca cc h thng t
11

vn ph thuc nhiu vo nhng nh gi nhn c t khch hng, v s t vn
cng tc c thc hin da trn s chng ln ca nhng nh gi ny. V vy,
rt kh c th a ra nhng t vn chnh xc khi khng gian nh gi l tha
tht. V d nh mt vi sn phm ch c nhn c t nh gi t ngi dng,
chng c th rt t c c hi c t vn, thm ch c khi c nh gi cao.
Vn ngi dng mi: Chin lc cng tc hc s thch ngi dng t chnh
nhng nh gi trong qu kh ca h. i vi nhng ngi dng mi cha thc
hin nh gi no, khng c mt s t vn no c th c to ra.
Vn sn phm mi: tng t nh vn ngi dng mi, i vi nhng sn
phm mi, cha nhn c nh gi no t pha ngi dng, khng th c s t
vn no v chng.
Vn ch cu xm: i vi ngi dng c s thch khc bit vi s ng, s
t vn i khi khng mang li kt qu.
Vn thiu tnh a dng: V tri thc ca h thng v ni dung ch da trn cc
la chn t pha ngi dng, nn s t vn thng c xu hng lch v nhng
sn phm c chn trong qu kh, kt qu l trong khi phi x l lng ln
d liu, phn ln nhng t vn c to ra li ch tp trung vo nhng sn phm
ph bin nht. V d in hnh cho nhng cn tr ca vn ny l cc h
thng t vn tin tc, trong khi nhng tin tc mi hn c th mang nhiu gi tr
hn, nhng tin tc c nhiu ngi c trc y li thng xuyn c t
vn.

1.2.3. K thut t vn lai
Mt vi h t vn kt hp c phng php cng tc v da trn ni dung nhm
trnh nhng hn ch ca c hai. C th phn thnh bn cch kt hp nh sau:
Ci t hai phng php ring r ri kt hp d on ca chng.
Tch hp cc c trng ca phng php da trn ni dung vo h thng cng
tc.
Tch hp cc c trng ca phng php cng tc vo h thng da trn c
trng.
Xy dng m hnh hp nht, bao gm cc c trng ca c hai phng php.
12


Kt hp hai phng php ring r
C hai kch bn cho trng hp ny:
Cch 1: Kt hp kt qu ca c hai phng php thnh mt kt qu chung duy
nht, s dng cch kt hp tuyn tnh (linear combination) hoc voting scheme.
Cch 2: Ti mi thi im, ch chn phng php cho kt qu tt hn (da trn
mt s o cht lng t vn no ).

Thm c trng ca m hnh da trn ni dung vo m hnh cng tc
Mt s h thng lai (nh Fab[5]) da ch yu trn cc k thut cng tc nhng vn
duy tr h s v ngi dng (theo dng ca m hnh da trn ni dung). H s ny c
dng tnh tng ng gia hai ngi dng, nh gii quyt c trng hp c
qu t sn phm chung c nh gi bi c hai ngi. Mt li ch khc l cc gi s
khng ch gii hn trong cc sn phm c nh gi cao bi nhng ngi cng s
thch (gin tip), m cn c vi nhng sn phm c tng ng cao vi s thch ca
chnh ngi dng (trc tip).

Thm c trng ca m hnh cng tc vo m hnh da trn ni dung
Hng tip cn ph bin nht l dng cc k thut gim s chiu trn tp h s ca
phng php da trn ni dung. V d, Soboroff v Nicholas [29] s dng phn tch
ng ngha n (latent semantic analysis) to ra cch nhn cng tc (collaborative
view) vi tp h s ngi dng (mi h s c biu din bi mt vector t kha).

M hnh hp nht hai phng php
Trong nhng nm gn y c kh nhiu nghin cu v m hnh hp nht. Basu
v cng s [5] xut kt hp c trng ca c hai phng php vo mt b phn
lp da trn lut (rule-based classifier). Popescul v cng s [25] a ra phng php
xc sut hp nht da trn phn tch xc sut ng ngha n (probabilistic latent
semantic analysis). Ansari v cng s [4] gii thiu m hnh hi quy Bayes s dng
dy Markov Monte Carlo c lng tham s.
13

chnh xc ca h thng t vn lai ghp c th c ci tin bng cch s dng
cc k thut da trn tri thc (knowledge-based) nh case-based reasoning. V d, h
thng Entre dng nhng tri thc v nh hng, thc phm (nh: bin khng phi l
thc n chay).. gi nh hng thch hp cho ngi dng. Hn ch chnh ca h thng
dng ny l n cn phi thu thp tri thc, y cng l nt tht c chai (bottle- neck)
ca rt nhiu h thng tr tu nhn to khc. Tuy nhin, cc h thng t vn da trn tri
thc hin ang c pht trin trn cc lnh vc m min tri thc ca n c th biu din
dng m my tnh c c (nh ontology). V d, h thng Quickstep v Foxtrot s
dng ontology v ch ca cc bi bo khoa hc gi nhng bi bo ph hp cho
ngi dng.

1.3. S lc v h thng t vn tin tc ca kha lun
M hnh h t vn do kha lun xut khng c trin khai mt cch c lp m
tch hp vo h thng cung cp tin tc. Vi vic phn tch nhng c trng ca i tng
t vn ny, kha lun xut tng ban u cho gii php t vn c trin khai.

1.3.1. c trng ca t vn tin tc.
T vn tin tc l mt lnh vc giu tim nng bi s lng cc sn phm t vn, s
lng ngi dng v s lt s dng cao hn nhiu so vi cc i tng t vn khc. Tuy
nhin, i km theo l cc th thch v cc c trng ring c ca min i tng tin
tc cng nh cc c trng chung ca ngi s dng t vn.
Tin tc l mt i tng t vn c bit, cc c trng sau ca tin tc gip a ra
cc gii php hu hiu hn trong xy dng gii php t vn:
Tnh khng ng nht gi tr: Gi tr ca tin tc ch c th c xc nh bng
cch kt hp cc yu t: ni dung thng tin ca bn tin, ngun tin, thi im xut
bn, nh xut bn, tc gi, ngi nhn tin,
Tnh d sinh ra: mt s lng ln tin tc c th ny sinh xung quanh mt s
kin, hin tng.
Tnh d tn li: hin tng tin tc nh mt gi tr khi vn n cp khng
cn tnh thi s.
14

Khi xem xt n yu t ph hp gia i tng t vn v mi quan tm ngi dng,
cc c trng v mi quan tm ca ngi dng cng cn c xem xt.
Tnh a quan tm: Ti mt thi im, ngi dng c th c nhiu mi quan
tm khc nhau. V d: h c th quan tm n c cc thng tin v c th thao v
chnh tr.
Tnh thay mi: Mi quan tm ca h c th phn chia thnh 3 loi chnh: cc
mi quan tm di hn, cc mi quan tm trung hn v cc mi quan tm ngn
hn. Tnh thay mi c th din ra c ba loi mi quan tm ny, tuy nhin tc
thay mi ca cc mi quan ngn hn l nhanh nht v n cng c u th hn khi
dng t vn cc tin tc, vn lin tc c sinh ra.

1.3.2. Hng tip cn ca kha lun
vt qua cc th thch ny, chng ti tp trung vo cc tip cn lc da trn ni
dung vi thng tin v mi quan tm ngn hn thng qua cc ch n. Cc l do c th
c nu ra l:
Th nht: Lc da trn ni dung khng gp phi cc vn rt kh gii quyt
ca lc cng tc trn min i tng tin tc: (i) vn nhng nh gi u: cc
tin tc lin tc c sinh ra v cn d dng tip cn trong khi qu trnh lc cng
tc khng th to ra cc sn phm cha tng c nh gi bi ngi dng khc
hoc nhng ngi dng cha tng nh gi mt sn phm no; (ii) vn ma
trn tha: Kh tm ra c cc sn phm c nh gi bi mt lng
ngi dng v s lng qu ln cc tin tc mi v t gnh nng cung cp thng
tin nh gi ln ngi dng [11].
Th hai: Biu din thng tin mc ch c m t r rng hn tp hp cc
mi quan tm hay s thch ca ngi dng. S dng phng php ny cn c th
khc phc c hn ch t vn cc sn phm qu ging cc sn phm c
a thch trc (v d nh vn trng lp tin tc).
Th ba: Cc d liu thu thp da trn nhng tin tc c truy cp gn nht cho
php m t chnh xc hn c tnh thay mi mi quan tm.


15

Theo , h thng xut gii quyt hai vn c bn ca tin trnh t vn:
u tin l da trn kho st v cc phng php xy dng m hnh ha s thch
ngi dng da trn cc d liu vn bn thng c p dng cho hng tip
cn lc ni dung, xut gii php m hnh s thch ngi dng da trn phn
tch ch n phin duyt web ngi dng (ng cnh c tin tc).
Sau , da trn m hnh s thch ny ca ngi dng, nhng tin tc lin quan
c thc hin thng qua i chiu ch v thc th ca chng vi nhng ch
v thc th ngi dng tng quan tm.















16

Chng 2. M hnh ha s thch ngi dng cho cc h t
vn da trn ni dung.
Trong chng mt, kha lun trnh by s b v cc khi nim lin quan n h
t vn. Qua , chng ta bit rng cht lng ca nhng t vn c nhn ph thuc vo
kh nng hc s thch ngi dng ca h t vn (hay xy dng h s s thch ngi
dng). H s s thch ngi dng cng phn nh ng mi quan tm ca h, th cng c
nhiu kh nng c c nhng t vn tt.
Cc k thut t vn da trn ni dung thng da trn cc h s s thch c xy
dng thng qua mt qu trnh phn tch cc ti liu dng vn bn.
Trong chng ny, kha lun trnh by su hn v cc khi nim v k thut lin
quan n qu trnh m hnh ha s thch ngi dng ni chung v cho cc h t vn da
trn ni dung ni ring.

2.1. Tin trnh m hnh s thch ngi dng
Theo Gauch v cc cng s [14], mt tin trnh m hnh ha s thch ngi dng
cho cc ng dng hng c nhn (nh cc h t vn hng c nhn, cc h thng web
thch nghi, ) bao gm 2 pha c bn nh minh ha sau.

Hnh 2. Tin trnh m hnh ha s thch ngi dng.

Trong pha u tin, mt tin trnh thu thp thng tin c s dng thu thp cc
d liu t ngi dng, c th chia cc d liu ny thnh hai loi c bn: cc thng tin
ngi dng hin (hay thng tin ngi dng r) v cc thng tin ngi dng n. Nhng
thng tin ny sau c tng hp xy dng m hnh s thch ngi dng trong pha
cn li, pha xy dng h s ngi dng.

17

2.2. Thu thp thng tin v ngi dng
Bc u tin trong k thut hc s thch ngi dng l thu thp cc thng tin v
ngi dng c nhn. Trong , mt yu cu c bn l h thng cn phi xc nh duy nht
ngi dng. Nhim v ny s c trnh by trong phn 2.2.1. Cc thng tin ngi dng
c c th c thu thp hin qua vic nhp trc tip bi ngi dng hay thu thp n
thng qua mt cc t phn mm. N c th thu thp t my khch ca ngi dng hay thu
thp t chnh my ch ng dng. Ph thuc vo cch thc thu thp d liu ny m nhng
dng d liu khc nhau v ngi dng c th c thu thp. Mt s la chn v nh
hng ca cc la chn c trnh by mc 2.2.2. Nhn chung, cc h thng thu thp
thng tin n v thu thp thng tin t my ch c a thch hn do t t hn gnh nng
cung cp thng tin v pha ngi dng v hn ch c phin h v yu cu ci t thm
phn mm [14].

2.2.1. Phng php nh danh ngi dng
nh danh ngi dng l tiu ch quan trng gip h thng phn bit, v xy dng
h s khc nhau cho nhng ngi dng khc nhau. Gauch v cng s [14] lit k ra 5
cch tip cn c bn trong nh danh ngi dng: cc t phn mm (software agent), ng
nhp (login), proxy server, cookie v phin duyt web(session). Mi phng php u c
nhng u, nhc im ring v nh hng cc d liu ngi dng c th thu thp c.
Ba phng php u tin chnh xc hn, nhng chng yu cu s tham gia ca
ngi dng. Cc t phn mm l mt phn mm nh c t trn my ngi dng, thu
thp thng tin v h v chia s chng vi my ch thng qua mt vi giao thc. Gii php
ny c tin cy cao nht v c nhiu hn s iu khin khi trin khai ng dng v cc
giao thc. N cng c kh nng thu thp c nhiu thng tin nht v c quyn truy cp
ti nhiu hn cc ngun thng tin ngi dng. Tuy nhin, n yu cu s tham gia ca
ngi dng ci t phn mm, l mt cn tr khng d chu. Gii php c tin cy
th hai l da trn vic ng nhp. Bi v ngi dng nh danh chnh h thng qua ng
nhp, cch nh danh ny thng chnh xc v c th s dng xc nh ngi dng
dng nhiu my khch khc nhau. Mt tr ngi ca phng php ny l ngi dng cn
thc hin mt tin trnh ng k v thc hin ng nhp v ng xut cho mi ln s dng.
gii php th ba, mt proxy server s lm nhim v thu thp thng tin ngi dng,
phng php ny hu ch khi cn thu thp thng tin v mt nhm ngi dng hoc mt
18

ngi dng s dng nhiu my tnh, tng t nh hai gii php trn, n yu cu ngi
dng tham gia bng cch ng k cng mt a ch proxy cho tt c cc my h s dng.
Hai phng php sau, cookie v phin duyt web khng yu cu bt c s tham gia
no t pha ngi dng. Trong ln u tin trnh duyt my khch truy cp vo h thng ,
mt userid c to ra, id ny s c lu trong cookie my ngi dng. Mt ngi dng
truy cp vo cng mt trang web c xc nh l duy nht nu cng mt userid c s
dng. Tuy nhin, nu ngi dng s dng nhiu hn mt my tnh, hay mt loi trnh
duyt, s c nhng cookie khc nhau, v tng ng l nhng h s ngi dng khc
nhau. Hn na, gii php ny cng gp vn khi c nhiu hn mt ngi dng cho mt
my, hoc trng hp ngi dng xa, hay tt cookie. i vi phin duyt web, tr ngi
cng tng t khi c nhiu hn mt ngi dng cho mt my hay c s dng nhiu hn
mt my, mt trnh duyt, nhng n khng lu tr userid gia nhng ln duyt. Mt
ngi dng bt u vi mt phin duyt web mi, thng tin trong phin duyt web lu li
vt cc hnh vi ngi dng tng tc vi h thng trong mt ln duyt web ca h v d
danh sch cc pageview, thi gian ginh cho mi pageview, a ch IP,
u im quan trng ca gii php nh danh da trn phin duyt web l n khng
t bt c gnh nng no v pha ngi dng, khng gp nhng nghi ngi v tnh ring t
(tc l khng lu li bt c thng tin no v ngi dng) v cng khng yu cu bt
cookie trnh duyt.

2.2.2. Cc phng php thu thp thng tin
Thng thng, cc k thut thu thp thng tin c phn theo tnh cht ca d liu
thu thp c. Theo , tng ng vi hai kiu thng tin ngi dng n v hin, c hai
phng php thu thp thng tin ngi dng.

2.2.2.1. Phng php thu thp thng tin ngi dng hin
Phng php thu thp thng tin ngi dng hin (hay thng tin phn hi hin) thu
thp nhng thng tin c nhp trc tip bi ngi dng, thng thng qua cc HTML
Form. D liu thu thp c th l cc l cc thng tin nh ngy sinh, tnh trng hn nhn,
ngh nghip, s thch,
Mt trong cc h t vn sm nht Syskill & Webert [23] t vn cc trang web da
vo cc phn hi hin. Nu ngi dng nh gi cao mt vi lin kt t mt trang, Syskill
19

& Webert s t vn cc trang lin kt khc. Thm vo , h thng cn c th to mt
truy vn ti my tm kim Lycos
1
trch xut cc trang web c th ngi dng s a
thch.
Mt vn vi cc thng tin phn hi hin l n t gnh nng cung cp thng
tin v pha ngi dng. V vy, nu ngi dng khng mun phi cung cp cc thng tin
ring t, h s khng tham gia hoc khng cung cp thng tin chnh xc. Hn na, v cc
h s c duy tr tnh trong khi tn ti cc c im c th thay i nh s thch, thi
quen,khin cho nhng h s ny c th tr nn khng chnh xc na theo thi gian.
Mt l l cho nhng h thng s dng thng tin phn hi hin l trong mt vi trng hp
ngi dng thch cung cp, chia s thng tin ca h.



2.2.2.2. Phng php thu thp thng tin ngi dng n
H s ngi dng trong phng php ny c xy dng da trn cc thng tin
phn hi n. u im ca phng php ny l khng yu cu bt c s xen vo no ca
ngi dng trong sut tin trnh xy dng v duy tr cc h s ngi dng. Cng trnh ca
Kelly v Teevan [20] cung cp mt ci nhn tng qut v cc k thut ph bin thu thp
thng tin phn hi n v cc thng tin v ngi dng c th suy din t hnh vi ca h.
Theo , Gauch v cc cng s [14] thng k tm tt cc cch tip cn ca k thut
thu thp thng tin phn hi n.


1 http://www.lycos.com/
Hnh 3. Cc h thng t vn da trn thng tin phn hi hin.
20

Bng 2. Cc k thut thu thp thng tin n [14].
K thut
Thng tin
thu thp
B rng
thng tin
u v Nhc V d
Browser Cache
Lch s
duyt web
Bt c
trang
web no
u: Ngi dng khng cn ci t bt c
th g.
Nhc: Ngi dng phi upload cache
nh k.
OBIWAN
[24]
Proxy Servers
Hnh vi
duyt web
Bt c
trang
web no
u: Ngi dng c th s dng nhiu
trnh duyt.
Nhc: Ngi dng phi s dng proxy
server.

OBIWAN
[24]
Browser Agents
Hnh vi
duyt web
Bt c
ng
dng
hng
c nhn
no
u: Cc t c th thu thp tt c cc hnh
vi web.
Nhc: Ci t v s dng ng dng mi
khi ang duyt web.
WebMate
[12]
Desktop Agents
Tt c hnh
vi ngi
dng
Bt c
ng
dng
hng
c nhn
no
u: Tt c cc tp tin v hnh vi ca
ngi dng.
Nhc: Yu cu ci t phn mm.
Google
Desktop

Web Logs
Hnh vi
duyt web
Cc
trang
web c
log
u: Thng tin v nhiu ngi dng.
Nhc: C th c t thng tin v ch t mt
trang web.
Mobasher
[7]
Search Logs
Truy vn
v Url
c click
Cc
trang tm
kim
u: Thu thp v s dng thng tin t
nhiu trang
Nhc:Cookies phi c bt v/hoc yu
cu ng nhp.
Nhc: C th c rt t thng tin.
Misearch



Da trn ngun gc cc thng tin n ny, c th chia cc thng tin phn hi n thnh
hai loi: thng tin n pha my khch (client log) thu c t bn cch tip cn u v
thng tin n pha my ch (server log) thu c t hai cch tip cn cn li.
Trong khi cc k thut thu thp thng tin pha my khch t gnh nng v pha
ngi dng thu thp v chia s log cc hnh vi ca h. Cc k thut thu thp thng tin
pha my ch (nh search log v web log) thu thp ch nhng thng tin trong qu trnh
21

tng tc ca ngi dng v h thng. iu ny lm cho cc thng tin c th thu thp t
my ch t hn nhng c u th hn v phc tp ca d liu thu thp c cng nh
trnh c cc nghi ngi v tnh ring t ca ngi dng.

2.3. Xy dng m hnh s thch ngi dng
Thng thng da trn cc c trng ca d liu thu thp c, c nhng cch tip
cn khc nhau xy dng m hnh s thch ngi dng. D liu thu thp t ngi dng
c th c chia lm hai loi chnh l d liu c cu trc v khng cu trc. Cc d liu
c cu trc nh cc nh gi theo im, ngh nghip, tui tc, Cc d liu khng cu
trc l cc d liu dng vn bn nh ni dung ca cc tin tc xem, m t ca cc b
phim xem, hay cc li nhn xt di dng ngn ng t nhin
Gauch v cng s trong [14] m t kh chi tit ba phng php xy dng m hnh
s thch ngi dng da trn cc d liu dng vn bn l phng php da t kha kha
c trng s, phng php da trn cc mng ng ngha v phng php da trn cy phn
cp khi nim. y l cc gii php m hnh s thch ngi dng thng c s dng
cho cc h thng t vn da trn ni dung.

2.3.1. Phng php da trn t kha c trng s
Mi quan tm c m t bng tp cc t kha c trng s. Trong , t kha c
trch xut t tp cc d liu ngi dng vi trng s thng c nh gi thng qua m
hnh trng s tf*idf. y l gii php c a ra sm nht v d dng ci t nht, tuy
nhin vp phi cc tr ngi v vn nhp nhng ng ngha v kch thc khng gian t
kha. V d in hnh ca phng php tip cn ny l WebMate [12], h s ngi dng
cha mt vector t kha cho mi lnh vc quan tm ca ngi dng, v mt tng m
rng ca WebMate[12], Alipes [31] s dng ba vector t kha cho mi mi quan tm
ngi s dng bao gm: mt vector m t tnh di hn, hai vector m t tnh ngn hn:
mt tch cc v mt tiu cc.
22


Hnh 4. M hnh mi quan tm ngi dng da trn t kha.

2.3.2. Phng php da trn mng ng ngha
Mi quan tm c m t bng tp cc node (t kha hoc khi nim) v cc cnh
lin kt. u tin, cc t kha cng c trch xut t d liu ngi dng. Khi nim c
th bao gm mt hoc nhiu t kha lin kt vi nhau ( v d nh: quan h ng ngha
suy din t WordNet ). Trng s gia cnh c xc nh da trn s xut hin ng thi
ca hai node hoc cc t kha thuc vo hai node trong cng mt vn bn. in hnh cho
m hnh ny l h thng InfoWeb [15], mi h s ngi dng c biu din bi mt
mng ng ngha cc khi nim. Ban u, mng ng ngha cha mt tp cc node khi
nim khng lin kt gi l cc node hnh tinh vi mt trng s. Cng nhiu thng tin thu
thp c, h s v ngi dng cng c lm giu thng qua cc t kha c trng s
lin kt vi cc khi nim. Cc t kha c biu din nh cc node v tinh xung quanh
cc khi nim chnh, trng s lin kt gia cc khi nim tng ng cng c thm vo.

Hnh 5. M hnh mi quan tm ngi dng da trn mng ng ngha[15].

23

2.3.3. Phng php da trn cy phn cp khi nim
Mi quan tm ngi dng c m t tp cc khi nim c trng s. Ban u, cc
khi nim khng trch ra t vn bn m c nh ngha trc t cy phn cp cc mc
m ODP (The Open Directory Project)[30]. D liu ngi dng c phn lp vo mt
trong cc nhnh ca cu trc phn cp ny. Vn ca phng php ny l mc chi
tit ca mc c th lm mt thng tin v cc mi quan tm chung v s ph thuc vo
chnh xc ca cc cy phn cp khi nim. Mt trong cc d n u tin s dng phng
php ny l OBIWAN [24]. Ban u, h dng cu trc phn cp khi nim t 3 mc u
tin ca ODP[30]. D liu ngi dng c t ng phn lp tm ra cc cc khi nim
ph hp nht, cc trng s khi nim tng ng c tng ln.


Hnh 6. M hnh mi quan tm ngi dng da trn mng khi nim [24].





24

Chng 3. M hnh
i vi ngi dng trong cc h thng t vn, cc yu t thuc v ng cnh s dng
hin ti ca ngi dng nh hng ln ti cc la chn trong tng lai ca h.
Cc tin tc trong phin duyt web hin phn nh chnh xc hn nhng ch hay
cc thc th ngi dng mun tm hiu thm thng tin. Do vy, phn tch thng tin t cc
tin tc ny l mt gii php tim nng m rng thng tin ng cnh so vi phng php
ch phn tch trang tin hin ti.
Trong khi cc k thut biu din s thch ngi dng hin cn tn ti cc tr ngi
nh trnh by trong chng 2. C th c mt cch tip cn mi cho cc vn ny da
trn phng on rng mt ngi dng A c th a thch mt tin tc X nu nh A xem
cc tin tc cng ch vi X v X lin quan n nhiu hn cc thc th nh danh m A
quan tm (v d nh tn mt cu lc b bng nh: ManU, hay tn mt nhn vt ni
ting nh tng thng M Obama).
Nh vy, mt h s ngi dng c th c m t hnh thc nh sau:

Bng 3. V d v mt h s s thch ngi dng.
Ngi dng Ch quan tm Thc th quan tm
An
Bng , Du
lch,
ManU, Chealsea,
Lt, Hi An,

Cc tin tc c th c gn nhn ch bng tay, tuy nhin l mt gii php
khng kh thi do i hi chi ph ln, c bit khi c qu nhiu tin tc ny sinh hay trong
cc h thng t ng thu thp tin tc nh RSSReader. Mt hng tip cn tim nng l s
dng phn tch ch n. Trong , tng c bn l xem cc vn bn l mt phn phi
xc sut theo ch v mi ch li c phn phi xc sut trn cc t. c nhiu
nghin cu khng nh c tnh ng dng ca phn tch ch n nh m hnh phn
lp, phm cm d liu [22], bi ton xc nh ph hp gia ni dung mt trang web v
cc thng ip qung co[21],
Trong cc mc sau, kha lun trnh by mt gii php xc nh cc s thch ngi
dng theo cch tip cn mi ny.
25


3.1. C s l thuyt
3.1.1. Phn tch thng tin ch da trn m hnh ch LDA.
Phn tch ch cho vn bn ni chung v cho d liu Web ni ring c vai tr
quan trng trong vic hiu v nh hng thng tin trn Web. Khi ta hiu mt trang
Web c cha nhng ch hay thng tin g th d dng hn cho vic xp loi, sp xp, v
tm tt ni dung ca trang Web . Trong phn lp vn bn, mi vn bn thng c
xp vo mt lp c th no . Trong phn tch ch , chng ta gi s mi vn bn
cp n nhiu hn mt ch (K ch ) v mc lin quan n ch c biu din
bng phn phi xc sut ca ca ti liu trn cc ch .


Hnh 7. Ti liu vi K ch n.

C rt nhiu phng php phn tch thng tin ch t vn bn, in hnh l m
hnh LDA [13]. LDA l mt m hnh sinh (generative model) v thc hin phn tch ch
t cc tp d liu vn bn hon ton phi gim st (fully unsupervised). V mc tiu,
tng t vi LSA, LDA a ra mt k thut m t thu gn cc tp d liu ri rc (nh tp
vn bn). V mt trc quan, LDA tm nhng cu trc ch (topics) v khi nim
(concepts) trong tp vn bn da trn thng tin v ng xut hin (co-occurrence) ca cc
t kha trong vn bn, v cho php m hnh ha cc khi nim ng ngha (synonymy) v
a ngha (polysemy). V mt m hnh ha, LDA hot ng tng i ging vi pLSA
(probabilistic LSA) [19]. Tuy vy, LDA u vit hn pLSA mt vi im nh tnh y
v tnh khi qut cao hn [13][17].

26


Hnh 8. Biu din ha LDA[13].

c lng gi tr tham s cho m hnh LDA.


Hnh 9. c lng tham s tp d liu vn bn.

c lng tham s cho m hnh LDA bng phng php cc i ha hm
likelihood trc tip v mt cch chnh xc c phc tp thi gian rt cao v khng kh
thi trong thc t. Ngi ta thng s dng cc phng php xp x nh Variational
Methods [13] v Gibbs Sampling [17]. Gibbs Sampling c xem l mt thut ton
nhanh, n gin, v hiu qu hun luyn LDA.
27

S dng m hnh LDA suy din ch .
Theo Nguyn Cm T [22], vi mt m hnh ch c hun luyn tt da trn
tp d liu ton th (Universial Dataset) bao ph min ng dng, ta c th thc hin mt
tin trnh qu trnh suy din ch cho cc ti liu mi tng t nh qu trnh c lng
tham s (tc l xc nh c phn phi trn cc ch ca ti liu qua tham s theta).
Tc gi cng ch ra rng s dng d liu t VnExpress
1
hun luyn c cc m hnh c
u th hn trong cc phn tch ch trn d liu tin tc, trong khi cc m hnh c
hun luyn bi d liu t Wiki
2
tt hn trong phn tch ch cc ti liu mang tnh hc
thut.
Da trn nhng nghin cu , chng ti la chn m hnh c ch c hun
luyn bi tp d liu ton th thu thp t trang Vnexpress cho phn tch ch . Mt tin
trnh phn tch ch tng qut c minh ha nh sau:


Hnh 10. Suy din ch s dng tp d liu VnExpress[22].

3.1.2. Nhn dng cc thc th trong ti liu da trn t in
i vi mt i tng vn bn, ni dung ca n lin quan nhiu n cc thc th
cha trong vn bn . i tng thc th c th l tn ngi, tn mt a im hoc mt
t chc,Phng php nhn dng cc thc th da trn t in n gin ch xem xt n
s hin din ca cc thc th thuc vo mt tp t in thc th trong vn bn ang tin

1 www.vnexpress.net
2 www.wikipedia.org

28

hnh phn tch. Thut ton i snh xu Aho-Corasick [3] l phng php nhn dng thc
th da trn t in in hnh. tng c bn ca phng php ny kh n gin ny,
cc thc th trong t in c xem l cc mu, mt tmt hu hn trng thi xy dng
t cc mu ny s c s dng xc nh s hin din ca cc mu trong vn bn.

3.2. Phn tch s thch ngi dng
3.2.1. Thng tin trong phin duyt web ngi dng
Mt phin duyt web l mt chui cc pageview ca mt ngi dng n trong mt
ln duyt n [7]. Trong , cc pageview l tp hp cc i tng web hin th ti ngi
dng. Mi pageview c th c xem nh mt tp hp cc i tng web hay cc ti
nguyn biu din cho mt hnh vi ngi dng c th nh c mt trang tin tc, xem
thng tin mt sn phm hoc thm mt sn phm vo gi hng,M hnh s dng phin
duyt web l danh sch cc url tng ng vi cc trang web ngi dng truy cp vo h
thng.

Bng 4. Thng tin trong phin duyt web.
Session ID (Profile ID) Url
1 www.bestnews4u.com?newsid=102
1 www.bestnews4u.com?newsid=82
1 www.bestnews4u.com?newsid=11
1 www.bestnews4u.com?newsid=1021
2 www.bestnews4u.com?newsid=102
2 www.bestnews4u.com?newsid=144


29

3.2.2. M hnh s thch ngi dng



Trong m hnh ny, s thch ca ngi dng c biu din bi hai thng tin: Tp
cc ch n ngi dng quan tm nht v tp cc thc th lin quan.

Xc nh tp ch n ngi dng quan tm c thc hin qua 3 bc
Bc 1: T tp ti liu m t s thch ngi dng, cc ch v phn phi ca
chng vo tng ti liu c tnh ton.
ng vi mi ti liu d
i
thuc vo tp D cc ti liu m t mi quan tm ngi s
dng, s dng phn tch ch n ta c kt qu l tp cc topic ca ti liu d
i,
k hiu l cc TP
j
thuc vo tp cc topic TP, vi trng s w
tpj
.
Topics(d
i
) = {(TP
j
, w
tpj
),}

Bc 2: Xp hng ch da trn thng k tnh ph bin
Rank (TP
j
) = S ln xut hin ca TP
j
trong ma trn D x TP vi w
tpj
ln hn mt
ngng


Bc 3: Xc nh Top N ch n c hng cao nht c s dng biu din
m hnh ngi dng.

Cc thc th lin
quan
Cc tin tc
ngi dng
quan tm trong
phin
Cc ch n ph
bin
Hnh 11. M hnh s thch ngi dng da trn ch n v thc th.
30

Xc nh tp thc th qua 2 bc
Bc 1: Xc nh ti liu cn phn tch thc th. Cc ti liu c s dng
phn tch cc thc th biu din s thch ngi dng tha mn hai iu kin
sau:
o L cc tin tc thuc phin duyt web ngi dng
o L cc tin tc c ni dung lin quan n ch ngi dng quan tm
xc nh qu trnh xc nh ch n ph bin.
Bc 2: Trch xut cc thc th t cc vn bn tin tc.

3.3. p dng m hnh s thch ngi dng vo t vn tin tc
Nghin cu ca chng ti pht trin mt m hnh h thng t vn s dng m hnh
mi quan tm xut phn trc. Trong , tng chung ca vic t vn da trn
xem cc tin tc t vn tim nng l cc tin tc mang thng tin v ch v cc thc th
ngi dng tng quan tm. ng dng t vn c tch hp trong mt h thng qun l
ni dung (Content Management System). V vy, gii php c a ra l xc nh ch
v cc thc th nm trong mi tin tc c thc hin ngay sau khi d liu tin tc c
nhp vo c s d liu cc tin tc ca h thng. Kha lun xem giai on ny l pha x l
phn tch d liu t vn. Sau pha ny, mi tin tc s tng ng vi hai danh sch mt
danh sch cc ch v mt danh sch cc thc th. Pha t vn trc tuyn thc hin thu
thp thng tin v s thch ngi dng thng qua thng k cc ch ph bin trong phin
duyt web, sau t ng sinh cc truy vn cho c s d liu, kt qu t c l d liu
t vn lin quan thuc v nhiu ch v cha cc thng tin v cc thc th ngi dng
tng quan tm.

3.3.1. Pha phn tch d liu t vn
Input: Mi vn bn tin tc.
Output: Phn tch ch v thc th ca tng tin tc.
Pha phn tch ch n.
o Suy din ch n
o La chn ch chnh
Pha phn tch thc th lin quan.
31

o Xc nh cc thc th
o La chn thc th chnh



Pha ny x l cc tin tc trc khi c lu tr vo c s d liu. Qu trnh x l
gm hai pha phn tch c lp.

Phn tch cc ch n
Tin tc c suy din cc ch n thuc vo theo mt m hnh ch n c
hun luyn. Pha ny, c thc hin bi hai bc:
Bc 1. Suy din ch n:
Nhn u vo l cc vn bn tin tc, bc ny phn tch xc sut cc ch
n phn nh ni dung trong vn bn. Cc ch c xc sut ln hn l cc
Top ch c
xc sut cao
Top cc thc th
c trng s cao
M hnh ch

T in thc
th
Tin tc
Suy din ch
Xc nh thc th
C s d liu
tin tc
Hnh 12. M hnh pha phn tch d liu t vn
32

ch m ni dung chnh ca tin tc hng ti. Ch rng s lng cc ch
n l khng i, v mi ch u c mt xc sut phn nh ni dung ca vn
bn. V d, nu ta chn m hnh vi 100 ch n phn tch, mi vn bn
c xc nh bi mt vector 100 chiu, vi mi chiu l mt ch v mi gi
tr trong cc chiu l trng s xc sut ca ch tng ng.

Bc 2. Xc nh top cc ch c phn phi cao:
T cc vector phn phi ch ca vn bn tin tc, ta cn xc nh u l
cc ch c th i din cho ni dung thng tin ca tin tc. Cc ch ny c
th c nhn ra bi hai rng buc:
o S lng ch c th biu din ni dung cho mt vn bn phi nm
trong mt gii hn.
o Xc sut ca ch phi ln hn mt ngng cho trc.

Phn tch cc thc th lin quan.
V gi tr ca tin tc cn lin quan n cc thc th m n cp ti, v d nh tin
tc v k ngh ca tng thng c gi tr hn tin tc v k ngh ca mt ngi bnh thng.
Pha ny xc nh cc thc th nm trong vn bn tin tc. Cc thc th c th c trch
xut t vn bn thng qua hai bc:
Bc 1: Xc nh tt c cc thc th trong ni dung tin tc.
Nu coi vn bn tin tc tng ng vi mt xu v mi thc th trong t in
l mt mu, ta c th p dng mt thut ton i snh xu nhn ra tt c cc
thc th nm trong ni dung ca tin tc. Kt qu ca bc ny l mt danh sch
cc thc th vi trng s l s ln xut hin ca n trong vn bn.

Bc 2: La chn cc thc th c trng s cao lu tr.
Nhng thc th c nhn nh l lin quan nhiu hn ti ni dung ca vn
bn nu n c nhc ti hn mt s ln no , bc ny thc hin lc bt cc
thc th xut hin qu t (nh hn mt ngng). Cc thc th c lu tr nh
biu din mt phn gi tr ca tin tc.

33

3.3.2. Pha t vn trc tuyn
Input: Tp Url lu trong phin duyt web.
Output: Tp cc tin tc t vn.
Pha tin x l tp Url trong phin.
o a cc Url v mt chun thng nht, xc nh cc tin tc trong phin.
Pha phn tch mi quan tm ngi dng.
o Xc nh tin tc trong phin v cc ch tng ng.
o Phn tch ch n ph bin.
o Xc nh tp thc th lin quan trong phin.
Pha xc nh cc tin tc t vn.
o Lc ra danh sch cc tin c cng ch ph bin n.
o Xp hng li cc tin c lin quan n nhiu thc th.














Tp url cc tin
tc trong phin
Tin x l
CSDL tin
tc
Tp cc tin tc
trong phin vi
cc ch n.
Thng k cc ch
ph bin
Cc thc th ngi
dng quan tm
trong phin
Truy vn 1 Truy vn 2
Cc tin tc c ch
l ch ph
bin.
Truy vn 3
Xp hng li cc
tin tc
Top cc tin tc
ginh cho t vn
Hnh 13. M hnh pha t vn trc tuyn.
34

Tin x l d liu
Cc Url lu trong phin ngi dng c a v dng chun v thng nht.
Loi b cc Url khng tng ng vi mt tin tc chi tit.
a cc Url v dng chun, loi b cc tham s tha.
V d: www.bestnews4u.com?newsid=20#top
www.bestnews4u.com?newsid=20
Loi b cc Url trng lp.
Lc ly trng nh danh tin tc (newsid) trong cc a ch Url.

Phn tch s thch ngi dng
Nh trnh by trong chng 3, s thch ngi dng c th c xc nh thng
qua cc ch ph bin v cc thc th. nng cao tc p ng, cc phn tch v
ch v thc th cho tng tin tc c thc hin trong pha phn tch d liu t vn.
V vy, s thch ngi dng c th trc tip trch xut t c s d liu. Mt cch hnh
thc, cng vic ny gm ba bc:
Bc 1. Trch xut t c s d liu cc tin tc trong phin v cc ch tng
ng. (truy vn 1 nh minh ha hnh 14) .
Bc 2. Thng k cc ch n ph bin:
T d liu thu c bc 1, h thng thng k cc ch xut hin lp li
trn cc tin tc.
Trong thc t, khi s lng cc tin tc trong phin cn t, cha c s chng
ln v ch hay cc tin tc c th nm nhng ch ring r. Do vy, h thng
cha th xc nh c ch no c quan tm ph bin, gii php cho tnh
hung ny l la chn cc ch ca tin tc gn nht c ngi dng truy cp.
Trong cc trng hp cn li, mt ngng no c s dng xc nh tnh
ph bin ca ch .
Bc 3. Xc nh tp thc th trong cc tin tc thuc cc ch n ph bin:
Mi tin tc c ni dung lin quan n mt tp hp cc thc th. Sau khi
xc nh c cc ch ph bin, cn c mt phng php xc nh thc th va
thuc vo cc tin tc va thuc phin truy cp, va lin quan n ch ph bin
35

(c th c nhng tin tc khng thuc v ch ph bin). V vy, truy vn thc
hin trch xut cc thc th cn tha mn hai rng buc (truy vn 2 nh minh ha
hnh 14):
o Thuc vo cc tin tc trong phin.
o Thuc vo cc tin tc c ch l ch ph bin.

T vn tin tc
Giai on cui cng ca tin trnh t vn l tm ra nhng tin tc ph hp nht vi
s thch ngi dng. V vy, s t vn c th t c theo hai bc sau:
Bc 1: Xc nh cc tin tc ng vin t tp cc tin tc c th tin vn.
H thng lc ra cc tin tc thuc vo cng ch vi mi quan tm ngi
dng, thng qua i snh ch n ca cc tin tc trong c s d liu v ch n
c phn tch l c ngi dng quan tm ph bin (truy vn 3 minh ha hnh
14).
Bc 2: Xp hng li cc tin tc.
Kt qu ca bc 1 l mt lp cc tin tc c th ngi dng quan tm mc
ch , c th c qu nhiu tin tc nh vy, do vy cn c mt gii php xp hng
li cc tin tc ny. Mt gii php c th trin khai da trn tng mt phn tiu
ch ra quyt nh ca ngi dng ph thuc vic xem xt tin tc c lin quan
n cc thc th ang c h quan tm hay khng.
T tp thc th ca cc tin tc t vn tim nng, hng ca mt tin tc c
xc nh bng s thc th n cp ti thuc vo danh sch cc thc th ngi
dng quan tm trong phin duyt web c phn tch trong pha trc.
Bc 3: T vn top cc tin tc xp hng cao nht.
Qu trnh xp hng cho ra mt danh sch cc tin tc c sp xp theo th t
gim dn v mc lin quan ti cc thc th ngi dng ang quan tm. Bc
ny, h thng chn ra N tin tc tim nng nht t vn ti ngi c.

36

3.4. nh gi kt qu t vn.
Vic nh gi cht lng ca tin tc t vn tr v bi h thng l mt bi ton kh,
v khng c mt o ng ngha nh gi chnh xc c s ph hp gia ngi dng v
tin tc h thng tr li.
Herlocker [18] a ra hai nguyn nhn ch yu dn ti vic nh gi cc h thng
t vn l kh khn. Nguyn nhn u tin l cht lng ca h t vn ph thuc vo tp
d liu s dng. Mt h t vn tin tc c m hnh tt cha chc t vn tt hn mt h
t vn c d liu tt (nh mt c s d liu tin tc phong ph). Nguyn nhn th hai l
vic nh gi h t vn c th hng ti cc mc tiu khc nhau. Trong mt s h thng,
cc nh gi c th da trn s ln t vn dn n quyt nh ng v sai. Trong mt s
khc, cc nh gi c th da trn xem xt ngi dng hi lng hoc khng hi lng i
vi cc kt qu t vn.
Do cc nguyn nhn ny, nh gi tnh ng n ca m hnh t vn c
xut, chng ti ch yu da vo vic thu thp kin ngi s dng v kt qu t vn.
Bn cnh , da vo kt qu nghin cu v phn tch s thch ca ngi s dng
thng qua lch s trnh duyt (history browser) c chng ti xut trong cng trnh
nghin cu sinh vin 2010 [1], chng ti a ra mt phng php nh gi t ng m
hnh phn tch s thch da vo s tng ng gia s thch ni tri trong phin duyt
web vi s thch ni tri ca lch s duyt web ca ngi s dng trong cng mt thi
im. Phng php nh gi ny s xem xt s tng ng gia s thch ca ngi s
dng trn nhiu trang v s thch ngi s dng trn h thng a ra s nh gi.
Chng ti so snh 2 loi s thch trn bng cch ly 3 ch n ph bin nht ca 2 loi
s thch ra lm i din, nu gia chng c s xut hin ca 1 ch c th no th xem
nh chng tng ng. Kt qu nh gi s c th hin trong phn tip theo.





37

Chng 4: Thc nghim v nh gi
4.1. Mi trng thc nghim
Bng 5. Mi trng thc nghim.
Thnh phn Thng s
CPU Core 2 Duo 2.0 GHz
RAM 2 GB
HDD 320 GB
OS Windows 7 Ultimate


4.2. D liu v cng c
4.2.1. D liu
D liu t vn
xy dng b d liu t vn ca h thng, chng ti thu thp d liu t 3 trang
web l: Dantri, Vnexpress, 24h. Sau qu trnh tin hnh tin x l nh bc tch ly ni
dung chnh ca tin tc, chng ti thu c 4333 tin :
2060 tin trn website Dantri.com.vn
1291 tin trn website Vnexpress.net
982 tin trn website 24h.com.vn
D liu phin duyt web ca ngi s dng
Chng ti tin hnh thu thp 30 phin duyt web ca 30 ngi s dng trn cc
website Dantri v Vnexpress thng qua vic phn tch cc history.
D liu lch s trnh duyt ca ngi s dng
Thu thp 30 d liu lch s trnh duyt (history browser) ca chnh nhng ngi s
dng trn c thi gian trong khong 15 pht trc v sau ca 30 phin duyt web ly.
38

4.2.2. Cng c
Bng 6. Cng c.
Cng c M t
SessionRecommendation Tc gi: Ung Huy Long
M t: B cng c phn tch s thch duyt web ca ngi s
dng thng qua Session v t vn tin tc da trn s thch
c phn tch
JGibbLDA Tc gi: Nguyn Cm T v Phan Xun Hiu
M t: Cng c phn tch ch n cho ti liu vit trn nn
Java
Website: http://jgibblda.sourceforge.net
VutmDic Tc gi: Trn Mai V
M t: B t in thc th gm 6479 thc th thuc 4 loi
thc th: a danh trong nc, a danh nc ngoi, tn ngi,
tn t chc.
Vnexpress 100topics Tc gi: Nguyn Cm T v Phan Xun Hiu
M t: B d liu 100 ch n c phn tch t Vnexpress
dng phn tch ch n
Website: http://jgibblda.sourceforge.net/vnexpress-
100topics.txt
Crawler4j Tc gi: Yasser Ganjisaffar
M t: Cng c thu thp d liu t cc website bo in t
Website: http://code.google.com/p/crawler4j/


39

4.3. Thc nghim
4.3.1. V d v phn tch tin tc












Bng 7. Mt s ch n
Topic 86 Topic 23 Topic 94
du_lch
tour
thi_lan
du_khch
p
khch
singapore
ph
c
im_n
bi_bin
sinh_thi
de_france
vit_nam
vng
th_thao
hc
chy
th_gii
vn
sea_games
in_kinh
vv
ginh
ni_dung
asiad

hc_sinh
quc_t
em
thi
tt_nghip
gio_vin
quc_gia
lp
thpt
t_chc
gii
k_thi
olympic
Du lch Bc Kinh dp Olympic cc kh
28/07/2008 08:17 Theo cc hng l hnh H Ni,
hin nay nhu cu khch i du lch Bc Kinh vo
thi im din ra Olympic 2008 tng cao song
cc cng ty khng th p ng c. Vo thi
im ny, gi phng khch sn ti Bc Kinh tng
gp 5 ln so vi trc kia, lng xe vn chuyn
khch du lch khng th t c do c huy
ng phc v Olympic.
Mt khc, vo thi im ny, th tc xin cp visa
vo Trung Quc cng gp nhiu kh khn. Do vy,
khng ch gi tour n Bc Kinh tng t bin m
cc hng l hnh ti Trung Quc cn t chi khi
pha Vit Nam ngh a khch sang
Danh sch cc ch :
- Topic 86
- Topic 23
- Topic 94
Danh sch cc thc th:
- Bc kinh
- H Ni
- Olympic
- Trung Quc
- Vit Nam
Hnh 14. Biu din tin tc theo ch v thc th.
40

4.3.2. V d phn tch s thch ngi dng
Cc tin tc c phin duyt web lu tr c dng phn tch s thch ngi
dng ti thi im hin ti. Qu trnh phn tch s tin hnh nh trong m hnh xut
chng 3 vi 2 bc l phn tch ch n v nhn dng cc thc th c trong tin tc. V
d, i vi 4 url c nu ra trong bng di, h thng s phn tch ra 3 ch n ni tri
trong tng tin tc v cc thc th tn ti trong cc tin tc y (thc th l cc t c t
mu).

Bng 8. V d v phn tch s thch ngi dng.
STT Tin tc Ch
1
Ch
2
Ch
3
1
Url: http://dantri.com.vn/c26/s26-393724/quy-do-mu-uu-tien-
chi-20-trieu-bang-mua-benzema.htm
Qu MU u tin chi 20 triu bng mua Benzema
(Dn tr) - Lo ngi v hng tn cng ph thuc qu nhiu vo
phong Wayne Rooney hin nay, Manchester United c k
hoch chi ra 20 triu bng mua chn st Karim Benzema
trong ma H ny.

Do s sa st phong thm hi ca tin o Berbatov, hng
cng ca Manchester United hin nay da nhiu vo phong
ca Wayne Rooney. Trc nguy c chn st ny b qu ti
ma ti do phi thi u lin min t World Cup cho ti cc
chuyn du u, MU ang c k hoch ln phng n d
phng.


19 70 72
2
Url: http://dantri.com.vn/c25/s20-393779/bo-hoi-tai-di-sieu-
thi-ngay-nghi-le.htm

B hi tai i siu th ngy ngh l

(Dn tr) - Chen nhau mua hng, ngt th ch tnh tin, thm
ch nhiu ngi phi b hng thot thn l tnh cnh
nhiu ngi gp phi khi i siu th trong nhng ngy ngh l
va qua.
Thay v i du lch, mt b phn khng nh ngi dn
86 78 14
41

TPHCM li vung tin cho mua sm trong dp ngh l di ngy
30/4 - 1/5 va qua. p li, cc siu th cng c nhiu chng
trnh khuyn mi hp dn to sc ht vi ngi dn
3
Url: http://dantri.com.vn/c26/s26-394037/wayne-rooney-tiep-
tuc-boi-thu-danh-hieu-ca-nhan.htm
Wayne Rooney tip tc bi thu danh hiu c nhn
(Dn tr) - Vi phong chi sng trong ma gii nm nay,
Wayne Rooney mt ln na li m v cc danh hiu c nhn
cao qu. Mi y anh ot thm 2 gii thng Cu th
xut sc nht do cc CV MU v cc ng i bnh chn.

Vi t l phiu bu p o 83% Rooney vt qua cc ng
i Patrice Evra v Antonio Valencia tr thnh Cu th
xut sc nht nm 2010 ca MU (Sir Matt Busby Player of
the Year). Gii thng do cc CV ca Qu khp ni trn
th gii bnh chn thng qua website ManUtd.com. y l ln
th hai chn st ngi Anh c c vinh d ny sau thnh
cng ln u vo nm 2006.

19 4 70
4
Url: http://dantri.com.vn/c26/s26-381415/owen-rooney-giup-
mu-bao-ve-thanh-cong-carling-cup.htm
Owen, Rooney gip MU bo v thnh cng Carling Cup
(Dn tr) - D Aston Villa vt ln dn trc ngay u trn
nhng vi bn lnh ca mnh, Qu li ngc dng
ginh chin thng 2-1 nh hai pha lp cng ca Owen v
Rooney, qua ln th hai lin tip v ch Carling Cup.
Trn chung kt ti Wembley ti nay, 28/2, din ra ci m v
hp dn ngay sau ting ci khai cuc. Aston Villa bt ng m
t s ngay pht 4 sau c st penalty thnh cng ca James
Milner. B di go nc lnh t sm nhng MU khng h
nao nng v nhanh chng qun bnh t s ch sau 9 pht,
vi pha chp thi c ca Owen.

D sau cu tin o Newcastle phi ri sn cui hip 1
do b au nhng ngi vo thay anh, Wayne Rooney tip tc
hon thnh xut sc nhim v. Tin o ang c phong ghi
bn cc khng ny chnh l tc gi bn thng n nh t s
2-1 pht 74, gip MU ng quang chc v ch Carling
Cup ln th hai lin tip...

19 39 37

42

H thng nhn ra im tng ng ch gia cc tin tc mi c c. Nh trong
v d, ch ph bin l : 19 (3 ln), 70 (2 ln) (v d mt s t kha c trng s cao
trong 2 ch 19 v 70 c nu trong bng di) v cc thc th ni tri nh: MU,
Wayne Rooney, Newcastle, Carling Cup, Owen,...

Phn phi trn cc t ca ch 19 Phn phi trn cc t ca ch 70
gii
v_ch
cu_th
i
ma
bng
vng
trn
hng
bng_
u
thi_u


0.06996495208178817
0.028954524962552533
0.025173421752977616
0.021828599682969036
0.01935633989209313
0.014993528496429764
0.014266393263819203
0.011503279379899072
0.011212425286854849
0.011212425286854849
0.010921571193810624
0.010485290054244287
ng
hng
tin
triu
t
chim
la
trm
chc
gi
chim_ot
nghn

0.07584530113531
0.03834504357859601
0.03463622689716275
0.03133950095811097
0.02227350462571858
0.011765190694991037
0.008674510127129994
0.008262419384748521
0.006614056415222632
0.006408011044031896
0.00620196567284116
0.0053777841880782145


4.3.3. T vn tin tc
Cc tin tc c xem l lin quan nu n thuc vo cng ch ph bin trong
cc tin tc ngi dng quan tm, v d vi cc tin tc c lit k trong bng 8. Cc tin
tc lin quan l cc tin tc c ch thuc vo 19 hoc 70.


Hnh 15. Kt qu phn tch cho thy cc thng tin lin quan n ch 19.

43

Tuy nhin, nu ch t vn cc tin tc thuc cng ch th c th c qu nhiu tin
tc c la chn, cn c mt gii php sp xp li cc tin tc ny, kha lun s dng
nhng thc th nm trong cc tin tc c xem thuc v ch c quan tm ph
bin (nh MU, Wayne Rooney, Newcastle, Carling Cup, Owen,...) xp hng li nhng
kt qu thu c.
Top N cc tin tc thu c s c s dng a ra t vn vi ngi dng. V d,
tin tc c th c t vn.
Garry Neville v 10 s kin ng nh trong s nghip MU - Bng - Tin bn
l. Score: 4
Gary Neville, tn y l Gary Alexander Neville, hin nay ang l ngi ng th 5 trong
danh sch nhng cu th khoc o nhiu nht ca MU vi 597 trn u trn tt c cc u
trng. Xp trn anh l Paul Scholes vi 641 ln ra sn v Ryan Giggs ang l ngi dn u
danh sch ny vi 836 ln. Neville cng l 1 trong 9 cu th trong top hn 500 ln xut hin
trong mu o ca MU.
Neville l sn phm ca l o to tr MU nhng nm 90 v c vinh d c eo bng i
trng trong i hnh Manchester United ot cp v ch FA dnh cho cc i tr nm
1992. Ma bng chng kin s ra i ca la cu th ti nng nh David Beckham, Ryan

4.4. Kt qu thc nghim v nh gi
Chng ti tin hnh nh gi chnh xc ca m hnh da vo 2 phng php
nh gi c nu mc 3.4:
nh gi m hnh phn tch s thch da vo tnh tng ng ch gia mi
quan tm ngi dng nhn ra t lch s duyt web lu trong my khch v mi
quan tm ngi dng nhn ra t phin duyt web lu ti my ch.
nh gi chnh xc ca m hnh da vo nh gi ca ngi s dng: thng
k cc nh gi trc tip ca ngi dng qua vic kim tra thng tin t vn l
ph hp hay khng ph hp. Kt qu o chnh xc l chnh xc trung bnh
tnh trn 30 ngi s dng.



44


Bng 9. nh gi m hnh phn tch s thch.
Ch
chnh xc ca ch vi mi quan
tm ngi dng
Ch ng u 85%
Ch ng th hai 79%
Ch ng th ba 72%
Ch ng th t 66%
Ch ng th nm 57%

Kt qu so snh tng ng ch gia phin duyt web v cc trang web ngi
dng truy cp trc v sau phin duyt web cho thy nhng phn tch v mi quan tm
ngi dng c th s dng tng hp cc mi quan tm hin ti v d on cc tin tc
c th c ngi dng a thch trong tng lai.

Bng 10. chnh xc ca m hnh da vo nh gi ca ngi s dng.
S lng cc tin tc
ngi dng duyt
qua
chnh xc ca 1
kt qu t vn
chnh xc ca 3
kt qu t vn
chnh xc ca 5
kt qu t vn
1 tin tc 70% 68.3% 65.2%
3 tin tc 76.7% 64.3% 66.4%
5 tin tc 83.3% 79.4% 76.5%
7 tin tc 56.7% 43.7% 42%

T cc s liu bng 10, c th a ra cc kt lun sau:
Kt qu t vn t tt nht trng hp phin duyt web lu tr 5 tin tc.
Cc trng hp phin duyt web lu tr 1 v 3 tin tc hiu qu thp hn l v
i khi ngi dng quan tm n cc tin tc thuc cc lnh vc hon ton
c lp, cha xut hin tnh ph bin trong cc ch c phn tch.
trng hp cn li khi s tin tc lu trong phin l 7, nhiu do mt s ch
t c quan tm trong cc tin tc c tng ln. V h thng ch xc nh cc
ch ph bin m cha quan tm ti trng s ca mi ch , trong mt s
45

trng hp, nhng ch t c quan tm tr thnh ph bin, lm gim
chnh xc ca m hnh.
Nhn chung, chnh xc ca m hnh t vn gim dn theo s lng cc tin
tc c t vn. Tuy nhin vic a ra nhiu t vn cung cp cho ngi
dng nhiu la chn hn.












46

Kt lun
Cc h thng t vn nhn c nhiu quan tm t cng ng nghin cu v cc
t chc kinh t v nhng ng gp ca n trong gii quyt vn trn ngp thng tin v
cung cp cc dch v hng c nhn. Tuy nhin, i vi lnh vc t vn tin tc, cc hng
tip cn hin nay vn cn nhiu vn cn gii quyt. Nm bt c nhu cu , kha
lun tin hnh nghin cu, kho st mt s hng tip cn gii quyt bi ton t vn
c. Sau , da trn cc kho st ny, kha lun xut mt gii php t vn cho cc h
thng cung cp tin tc.

Cc kt qu chnh t c
Kha lun tm hiu cc khi nim, thut ng, k thut lin quan n cc h thng
t vn. Da vo kho st cc c trng ca t vn tin tc, phn tch u nhc im ca
cc phng php xy dng hai thnh phn chnh ca h t vn l m hnh s thch ngi
dng v cc thut ton t vn, kha lun xut mt gii php t vn tin tc da trn khai
ph ng cnh s dng hin ti ca ngi dng. Trong , h thng thc thi mt thut ton
t vn da trn phn tch ch n v cc thc th trong ni dung ca nhng tin tc
ngi dng va truy cp (hng tip cn da trn ni dung). Hng tip cn ny c nhiu
tim nng v c chng minh thng qua mt s s liu thng k kt qu ban u.

Mt s vn cn tip tc gii quyt
Tuy m hnh bc u t c mt s kt qu kh quan, nhng vn cn tn ti
nhiu vn cn gii quyt. u tin, v cha c cc o ng ngha cho cc h thng t
vn tng t, cc nh gi ch yu da trn cc nhn nh ch quan v tnh ph hp hay
khng ph hp ca kt qu t vn. Thm vo , hn ch v s lng v cht lng ca
kho d liu tin tc cng nh hng xu n cht lng ca s t vn. Cui cng, do h
thng s dng d liu t phin duyt web ngi dng, kt qu t vn khi ngi dng mi
truy cp mt vi tin tc u cn cha cao.



47

Hng nghin cu tip theo
Trong thi gian ti, ngoi vic tip tc gii quyt cc vn cn tn ti, chng
ti nh hng mt s nghin cu tip theo:
- Nghin cu thm v cc yu t ng cnh v nh hng ca chng n quyt
nh ca ngi dng.
- Nghin cu cc hng p dng ca gii php m rng thng tin ng cnh ngi
dng nh cung cp cc thng tin qung co ph hp vi ng cnh s dng.











48

Ti liu tham kho
Ting Vit
[1] Ung Huy Long, Nguyn o Thi, Trn Xun T. M hnh t vn da trn
vic phn tch ch n s quan tm ca ngi dng, Cng trnh sinh vin nghin
cu khoa hc, i hc Cng Ngh, HQGHN, 2009.

Ting Anh
[2] G.Adomavicius, A.Tuzhilin. Towards the Next Generation of Recommender
Systems:A Survey of the State-of-the-Art and Possible Extensions, IEEE
Transactions on Knowledge and Data Engineering, 2005.

[3] Aho, Alfred V.; Margaret J. Corasick. "Efficient string matching: An aid to
bibliographic search". Communications of the ACM 18 (6): 333340, June 1975.

[4] Ansari, A., S. Essegaier, and R. Kohli. Internet recommendations systems.
Journal of Marketing Research, pages 363-375, 2000.

[5] Basu, C., H. Hirsh, and W. Cohen. Recommendation as classification:
Using social and content-based information in recommendation. In Recommender
Systems. Papers from 1998 Workshop. Technical Report WS-98-08. AAAI Press, 1998.

[6] Balabanovic, M. and Y. Shoham. Fab: Content-based, collaborative
recommendation. Communications of the ACM, 40(3):66-72, 1997.

[7] Bamshad Mobasher: Data Mining for Web Personalization. The Adaptive
Web 2007:90-135.

[8] Belkin, N.J., Croft, W.B.: Information ltering and information retrieval: two
sides of the same coin?. Communications of the ACM 35(12), 2938 (1992).

[9] Billsus, D. and M. Pazzani. Learning collaborative information filters.
In International Conference on Machine Learning, Morgan Kaufmann Publishers,
49

1998.

[10] Breese, J. S., D. Heckerman, and C. Kadie. Empirical analysis of predictive
algorithms for collaborative filtering. In Proceedings of the Fourteenth Conference on
Uncertainty in Artificial Intelligence, Madison, WI, 1998.

[11] Burke, R. Hybrid Recommender Systems: Survey and Experiments. User
Modeling and User-Adapted Interaction 12, 4 (Nov. 2002), 331-370.

[12] Chen, L., Sycara, K.: A Personal Agent for Browsing and Searching. In:
Proceedings of the 2nd International Conference on Autonomous Agents,
Minneapolis/St. Paul, May 9-13, (1998) 132-139.

[13] David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation.
Journal of Machine Learning Research (JMLR) 3:993-1022 (2003).

[14] Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A. User profiles for
personalized information access, In: Brusilovsky, P., Kobsa, A., and Neidl, W., Eds.
The Adaptive Web: Methods and Strategies of Web Personalization. Springer- Verlag,
Berlin Heidelberg New York, 2007, 54-89.

[15] Gentili, G., Micarelli, A., Sciarrone, F.: Infoweb: An Adaptive Information
Filtering System for the Cultural Heritage Domain. Applied Artificial Intelligence
17(8-9) (2003) 715-744.

[16] Guarino, N., Masolo, C., Vetere, G.: OntoSeek: Content-Based Access to the
Web. IEEE Intelligent Systems, May 14(3) (1999) 70-80.

[17] Heinrich, G., Parameter Estimation for Text Analysis, Technical Report.

[18] Herlocker, .L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating
Collaborative Filtering Recommender Systems. ACM Transactionson Information
Systems 22(1), 553(2004).

[19] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of
50

SIGIR-99, (1999) 3544.

[20] Kelly, D., Teevan, J.: Implicit feedback for inferring user preference: a
bibliography. ACM SIGIR Forum 37(2) (2003) 18-28.

[21] Le Dieu Thu. Online context advertising, Undergraduate Thesis, College of
Technology, Vietnam National University, Hanoi, 2008.

[22] Nguyen Cam Tu. Hidden Topic Discovery toward Classification and Clustering
in Vietnamese Web Documents, Master Thesis, College of Technology, Vietnam
National University, Hanoi, 2008.

[23] Pazzani, M., Muramatsu, J., Billsus, D.: Syskill & Webert: Identifying
Interesting Web Sites. In: Proceedings of the 13th National Conference On Artificial
Intelligence Portland, Oregon, August 48 (1996) 54-61.

[24] Pretschner, A.: Ontology Based Personalized Search. Masters thesis. University
of Kan- sas, June (1999).

[25] Popescul, A., L. H. Ungar, D. M. Pennock, and S. Lawrence. Probabilistic
Models for Unified Collaborative and Content-Based Recommendation in Sparse-
Data Environments. In Proc. of the 17th Conf. on Uncertainty in Artificial
Intelligence, Seattle, WA, 2001.

[26] R.Baeza, F.Silvestri. Web Query Log Mining, ACM SIGIR Conference tutorial,
2009.

[27] G. Salton, A. Wong, C.S. Yang. A Vector Space Model for Automatic Indexing,
Communication of the ACM, 18 (11), 1975.

[28] Sieg, A., Mobasher, B., Burke, R.: Inferring users information context:
Integrating user profiles and concept hierarchies. In: 2004 Meeting of the
International Federation of Classification Societies, IFCS, Chicago, July (2004).

[29] Soboroff, I. and C. Nicholas. Combining content and collaboration in
51

text filtering. In 43 IJCAI'99 Workshop: Machine Learning for Information Filtering,
1999.

[30] The Open Directory Project (ODP), http://dmoz.org

[31] Widyantoro, D.H., Yin, J., El Nasr, M., Yang, L., Zacchi, A., Yen, J.: Alipes:
A Swift Messenger In Cyberspace. In: Proc. 1999 AAAI Spring Symposium Workshop
on Intelli- gent Agents in Cyberspace, Stanford, March 22-24 (1999) 62-67.