You are on page 1of 35

M u

Ngy nay, vi nhng tc ng to ln v mnh m ca mng Internet ti i sng kinh t, chnh tr v vn ha ca con ngi, lnh vc khai ph d liu Web v ang tr thnh lnh vc nghin cu thi s, thu ht c s quan tm ca rt nhiu nh nghin cu. Khai ph d liu Web l im hi t ca rt nhiu lnh vc nghin cu nh: c s d liu, truy xut thng tin (information retrival), tr tu nhn to, n cn l mt lnh vc nh trong hc my (machine learning) v x l ngn ng t nhin. Mt trong nhng lnh vc nghin cu ang rt c quan tm hin nay trong khai ph Web l vic xy dng cc cng c tm kim trn Web. Bi trong bi cnh x hi thng tin ngy nay, nhu cu nhn c cc thng tin mt cch nhanh chng, chnh xc ang ngy cng tr nn cp thit. tm ra c cc thng tin c ch i vi mi ngi dng, c bit l vi nhng ngi dng thiu kinh nghim hon ton khng phi l vic n gin. Vi mt cng c tm kim, kh nng ngi dng c th duyt Web v nh v c cc trang Web mnh quan tm tr nn d dng hn nhiu. Tuy nhin hin nay, do s pht trin v thay i vi tc qu nhanh ca Internet, cc cng c tm kim ang phi i mt vi nhng bi ton nan gii v tc . Trong c bi ton v tc tnh ton hng cho cc trang Web, thc thi nhim v tnh ton quan trng cho cc trang thng tin kt qu tm c so vi yu cu tm kim ca ngi dng. V kch thc ca World Wide Web l v cng ln, ln ti hng t trang web, khng nhng th cc trang Web ny khng trng thi tnh m lun lun thay i. Do tnh hiu qu v thi gian cng tr nn quan trng. Nu php tnh PageRank cho tp cc trang web trong c s d liu khng nhanh, h thng tm kim s khng cung cp c cht lng tm kim tt cho ngi dng. thc y l mt lnh vc nghin cu c nhiu trin vng, chng ti chn hng nghin cu Gii php tnh hng trang khai thc cu trc Block ca Web v p dng vo my tm kim cho ti kha lun tt nghip ca mnh. Kha lun tp trung nghin cu bi ton tnh hng trang web (PageRank) trong cc my tm kim: cu trc, thut ton cng nh cc tiu chun nh gi qu trnh ny. Chng ti cng p dng cc l thuyt trn i su phn tch m ngun, tm hiu c ch thc thi qu trnh tnh PageRank trong my tm kim Vinahoo, mt my tm kim ting Vit m ngun m vi nhiu tnh nng u vit. T vic nghin cu ny, chng ti xut mt gii php p dng khi nim thnh phn lin thng trong ma trn lin kt Web trong Vinahoo, ng thi thc hin vic ci t th nghim trn m ngun ca my tm kim ny. Ni dung ca kha lun c t chc thnh bn chng vi ni dung c gii thiu nh di y.

Chng 1 vi tn gi Tng quan v khai ph d liu web v my tm kim trnh by v nhng ni dung nghin cu c bn ca khai ph web, nhng thun li v kh khn trong lnh vc ny. Phn cui ca chng ny trnh by cc thnh phn c bn ca mt my tm kim. Mt s thut ton tnh hng trang in hnh l tiu ca chng 2. Phn u chng ny gii thiu tng quan v bi ton xp hng trang Web trong my tm kim v thut ton tnh PageRank c bn. Vic phn tch nhu cu tng tc tnh ton PageRank trong my tm kim, mt s thut ton ci tin t phng php PageRank cng vi nh gi c trnh by trong phn cui ca chng. Chng 3 vi tn gi Thut ton s dng cu trc Block theo thnh phn lin thng tp trung nghin cu v gii php khai thc cu trc Web. Chng ny gii thiu khi nim, mt s vn v l thuyt, chng minh v nh gi thut ton CCP s dng cu trc ny. Chng 4 vi tiu Gii php tnh hng trang ci tin cho my tm kim Vinahoo gii thiu thnh phn tnh PageRank trong module nh ch s ca Vinahoo, cc ci tin, ci t v nh gi kt qu thc nghim.

Chng 1. Tng quan v khai ph d liu Web v my tm kim


1.1. Khai ph d liu Web
1.1.1. Tng quan v khai ph d liu Web Ngy nay, s pht trin nhanh chng ca mng Internet v Intranet sinh ra mt khi lng khng l cc d liu dng siu vn bn (d liu Web). Trong nhng nm gn y Intrnet tr thnh mt trong nhng knh v khoa hc, thng tin kinh t, thng mi v qung co. Mt trong nhng l do cho s pht trin ny l chi ph thp duy tr mt trang Web trn Internet. So snh vi nhng dch v khc nh ng tin hay qung co trn mt t bo hay tp ch, th mt trang Web "i" r hn rt nhiu v cp nht nhanh chng hn ti hng triu ngi dng khp mi ni trn th gii. C th ni Internet nh l cun t in Bch khoa ton th vi ni dung v hnh thc a dng. N nh mt x hi o, n bao gm cc thng tin v mi mt ca i sng kinh t, x hi c trnh by di dng vn bn, hnh nh, m thanh ...

WWW

Tri thc

Hnh 1. Khai ph Web, cng vic khng d dng Tuy nhin, Internet l mt mi trng a phng tin ng bao gm s kt hp ca cc c s d liu khng ng nht, cc chng trnh v cc giao tip ngi dng. R rng, khai ph d liu text ch l mt lnh vc nh trong mi trng ny. Khai ph d liu trn Internet, hay thng c gi l khai ph web ngoi vic cn khai ph c ni dung cc trang vn bn, cn phi khai thc c cc ngun lc ni trn cng nh mi quan h gia chng. Khai ph Web, s giao thoa gia khai ph d liu v Word-Wide-Web, ang pht trin mnh m v bao gm rt nhiu lnh vc nghin cu nh c s d liu, tr tu nhn to, truy xut thng tin (information retrival) v nhiu lnh vc khc. Cc cng ngh Agent-base, truy xut thng tin da trn khi nim (concept-based), truy xut thng tin s dng case-base reasoning v

tnh hng vn bn da trn cc c trng (features) siu lin kt... thng c xem l cc lnh vc nh trong khai ph web. Khai ph Web vn cha c nh ngha mt cch r rng v cc ch trong vn tip tc c m rng. Tuy vy, chng ta c th hiu khai ph web nh vic: trch ra cc thnh phn c quan tm hay c nh gi l c ch cng cc thng tin tim nng t cc ti nguyn hoc cc hot ng lin quan ti World-Wide Web[9]. Hnh 2 th hin mt s phn loi cc lnh vc nghin cu quen thuc trong khai ph Web. Ngi ta thng phn khai ph web thnh 3 lnh vc chnh: khai ph ni dung web (web content mining), khai ph cu trc web (web structure mining) v khai ph s dng web (web usage mining).

KHAI PH D LIU WEB

Khai ph ni dung Web

Khai ph cu trc Web

Khai ph s dng Web

Khai ph ni dung trang Web

Ti u kt qu tr v

Khai ph cc mu truy cp

Phn tch cc xu hng c nhn

Hnh 2: Cc ni dung trong khai ph Web 1.1.2. Cc lnh vc ca khai ph d liu Web 1.1.2.1 Khai ph ni dung Web Phn ln cc tri thc ca World-Wide Web c cha trong ni dung vn bn. Khai ph ni dung web (web content mining) l cc qu trnh x l ly ra cc tri thc t ni dung cc trang vn bn hoc m t ca chng. C hai chin lc khai ph ni dung web: mt l khai ph trc tip ni dung ca trang web, v mt l nng cao kh nng tm kim ni dung ca cc cng c khc nh my tm kim. - Khai ph ni dung trang web(Web Page summarization): lin quan ti vic truy xut cc thng tin t cc vn bn c cu trc, vn bn siu lin kt, hay cc vn bn bn cu trc. Lnh vc ny lin quan ch yu ti vic khai ph bn thn ni dung cc vn bn.

- Ti u kt qu tr v (search engine result summarization): Tm kim trong kt qu. Trong cc my tm kim, sau khi tm ra nhng trang Web tho mn yu cu ngi dng, cn mt cng vic khng km phn quan trng, l phi sp xp, chn lc kt qu theo mc hp l vi yu cu ngi dng. Qu trnh ny thng s dng cc thng tin nh tiu trang, URL, content-type, cc lin kt trong trang web... tin hnh phn lp v a ra tp con cc kt qu tt nht cho ngi dng. 1.1.2.2. Khai ph cu trc web Nh vo cc kt ni gia cc vn bn siu lin kt, World-Wide Web c th cha ng nhiu thng tin hn l ch cc thng tin bn trong vn bn. V d, cc lin kt tr ti mt trang web ch ra mc quan trng ca trang web , trong khi cc lin kt i ra t mt trang web th hin cc trang c lin quan ti ch cp trong trang hin ti. V ni dung ca khai ph cu trc Web (web structure mining) l cc qu trnh x l nhm rt ra cc tri thc t cch t chc v lin kt gia cc tham chiu ca cc trang web. 1.1.2.3 Khai ph s dng web Khai ph s dng web (web usage mining) hay khai ph h s web (web log mining) l vic x l ly ra cc thng tin hu ch trong cc h s truy cp Web. Thng thng cc web server thng ghi li v tch ly cc d liu v cc tng tc ca ngi dng mi khi n nhn c mt yu cu truy cp. Vic phn tch cc h s truy cp web ca cc web site khc nhau s d on cc tng tc ca ngi dng khi h tng tc vi Web cng nh tm hiu cu trc ca Web, t ci thin cc thit k ca cc h thng lin quan. C hai xu hng chnh trong khai ph s dng web l General Access Pattern Tracking v Customizied Usage tracking. - Phn tch cc mu truy cp (General Access Pattern tracking): phn tch cc h s web bit c cc mu v cc xu hng truy cp. Cc phn tch ny c th gip cu trc li cc site trong cc phn nhm hiu qu hn, hay xc nh cc v tr qung co hiu qu nht, cng nh gn cc qung co sn phm nht nh cho nhng ngi dng nht nh t c hiu qu cao nht... - Phn tch cc xu hng c nhn (Cusomized Usage tracking): Mc ch l chuyn bit ha cc web site cho cc lp i tng ngi dng. Cc thng tin c hin th, su ca cu trc site v nh dng ca cc ti nguyn, tt c u c th chuyn bit ha mt cch t ng cho mi ngi dng theo thi gian da trn cc mu truy cp ca h.

1.1.3. Kh khn ca khai ph Web World Wide Web l mt h thng rt ln phn b rng khp, cung cp thng tin trn mi lnh vc khoa hc, x hi, thng mi, vn ha,... Web l mt ngun ti nguyn giu c cho Khai ph d liu. Nhng quan st sau y cho thy Web a ra nhng thch thc ln cho cng ngh Khai ph d liu [6]. 1.1.3.1. Web qu ln t chc thnh kho d liu phc v Dataming Cc CSDL truyn thng th c kch thc khng ln lm v thng c lu tr tp trung, trong khi kch thc Web rt ln, ti hng terabytes v thay i lin tc, khng nhng th cn phn tn trn rt nhiu my tnh khp ni trn th gii. Mt vi nghin cu v kch thc ca Web[6] a ra cc s liu nh sau: Hin nay trn Internet c khong hn mt t cc trang Web c cung cp cho ngi s dng. Kch thc trung bnh ca mi trang l 5-10KB th tng kch thc ca WWW t nht l 10 terabyte. Cn t l tng ca cc trang Web th tht s gy n tng. Hai nm gn y s cc trang Web tng gp i v cng tip tc tng trong hai nm ti. Nhiu t chc v x hi t hu ht nhng thng tin cng cng ca h ln Web. Nh vy vic xy dng mt kho d liu (datawarehouse) lu tr, sao chp hay tch hp cc d liu trn Web l gn nh khng th. 1.1.3.2. phc tp ca trang Web ln hn rt nhiu so vi nhng ti liu vn bn truyn thng khc Cc d liu trong cc CSDL truyn thng th thng l loi d liu ng nht (v ngn ng, nh dng,), cn d liu Web th hon ton khng ng nht. D liu Web bao gm rt nhiu loi ngn ng khc nhau (c ngn ng din t ni dung ln ngn ng lp trnh), nhiu loi nh dng khc nhau (text, HTML, PDF, hnh nh, m thanh,), nhiu loi t vng khc nhau (a ch email, cc lin kt, cc m nn (zipcode), s in thoi...). Ni cch khc, cc trang Web thiu mt cu trc thng nht. Chng c coi nh mt th vin k thut s rng ln, tuy nhin s lng khng l cc ti liu trong th vin th khng c sp xp theo mt tiu chun c bit no, khng theo phm tr no,... iu ny l mt th thch rt ln cho vic tm kim thng tin cn thit trong mt th vin nh th. 1.1.3.3. Web l mt ngun ti nguyn thng tin c thay i cao Web khng ch c thay i v ln m thng tin trong chnh cc trang Web cng c cp nht lin tc. Theo kt qu nghin cu [6] hn 500.000 trang Web

trong hn 4 thng th 23% cc trang thay i hng ngy, v khong hn 10 ngy th 50% cc trang trong tn min bin mt, ngha l a ch URL ca n khng cn tn ti na. Tin tc, th trng chng khon, cc cng ty qun co v trung tm phc v Web thng xuyn cp nht trang Web ca h. Thm vo s kt ni thng tin v s truy cp bn ghi cng c cp nht. 1.1.3.4. Web phc v mt cng ng ngi dng rng ln v a dng Internet hin nay ni vi khong 50 triu trm lm vic [6], v cng ng ngi dng vn ang nhanh chng lan rng. Mi ngi dng c mt kin thc, mi quan tm, s thch khc nhau. Nhng hu ht ngi dng khng c kin thc tt v cu trc mng thng tin, hoc khng c thc cho nhng tm kim, rt d b "lc" khi trong khi d liu khng l ca mng hoc s chn khi tm kim m ch nhn nhng mng thng tin khng my hu ch. 1.1.3.5. Ch mt phn rt nh ca thng tin trn Web l thc s hu ch Theo thng k [6], 99% ca thng tin Web l v ch vi 99% ngi dng Web. Trong khi nhng phn Web khng c quan tm li b bi vo kt qu nhn c trong khi tm kim. Vy th ta cn phi khai ph Web nh th no nhn c trang web cht lng cao nht theo tiu chun ca ngi dng? Nh vy chng ta c th thy cc im khc nhau gia vic tm kim trong mt CSDL truyn thng vi vvic tm kim trn Internet. Nhng thch thc trn y mnh vic nghin cu khai ph v s dng ti nguyn trn Internet. 1.1.4. Thun li ca khai ph Web Bn cnh nhng th thch trn, khai ph Web cng c nhng thun li: 1. Web bao gm khng ch c cc trang m cn c c cc lin kt tr t trang ny ti trang khc. Khi mt tc gi to mt lin kt t trang ca ng ta ti mt trang A c ngha l A l trang c hu ch vi vn ang bn lun. Nu mt trang cng nhiu lin kt t trang khc tr n chng t trang quan trng. V vy cc thng tin lin kt trang s cung cp mt lng thng tin giu c v mi lin quan, cht lng, v cu trc ca ni dung trang Web, v v th l mt ngun ti nguyn ln cho khai ph Web. 2. Mt my ch Web thng ng k mt bn ghi u vo (Weblog entry) cho mi ln truy cp trang Web. N bao gm a ch URL, a ch IP, timestamp. D liu Weblog cung cp lng thng tin giu c v nhng trang Web ng. Thc hin phn

tch cc h s truy cp ny ta c th rt ra nhng thng k v xu hng truy cp Web, cu trc Web v nhiu thng tin hu ch khc.

1.2. Tng quan v my tm kim


1.2.1. Nhu cu Nh cp phn trn, Internet l mt kho thng tin khng l v phc tp. Thng tin trn cc trang Web a dng v mt ni dung cng nh hnh thc. Tuy nhin cng vi s a dng v s lng ln thng tin nh vy ny sinh vn qu ti thng tin. Cng vi s thay i v pht trin hng ngy hng gi v ni dung cng nh s lng ca cc trang Web trn Internet th vn tm kim thng tin i vi ngi s dng li ngy cng kh khn. i vi mi ngi dng ch mt phn rt nh thng tin l c ch, chng hn c ngi ch quan tm n trang Th thao, Vn ha m khng my khi quan tm n Kinh t. Ngi ta khng th tm t kim a ch trang Web cha thng tin m mnh cn, do vy i hi cn phi c mt trnh tin ch qun l ni dung ca cc trang Web v cho php tm thy cc a ch trang Web c ni dung ging vi yu cu ca ngi tm kim. nh ngha [14]:My tm kim (search engine) l mt h thng c xy dng nhm tip nhn cc yu cu tm kim ca ngi dng (thng l mt tp cc t kha), sau phn tch yu cu ny v tm kim thng tin trong c s d liu c ti xung t Web v a ra kt qu l cc trang web c lin quan cho ngi dng. C th, ngi dng gi mt truy vn, dng n gin nht l mt danh sch cc t kha, v my tm kim s lm vic tr li mt danh sch cc trang Web c lin quan hoc c cha cc t kha . Phc tp hn, th truy vn l c mt vn bn hoc mt on vn bn hoc ni dung tm tt ca vn bn. Mt s my tm kim in hnh hin nay: Yahoo, Google, Alvista, ASPSeek... 1.2.2. Cu trc c bn v hot ng ca mt my tm kim Mt my tm kim c th c xem nh l mt v d ca h thng truy xut thng tin Information Retrival (IR)[14]. Mt h thng truy xut thng tin IR thng tp trung vo vic ci thin hiu qu thng tin c ly ra bng cch s dng vic nh ch s da trn cc t kha (term-base indexing)[11] v k thut t chc li cc cu truy vn (query refomulation technique)[12]. Qu trnh x l cc vn bn da trn t kha ban u trch ra cc t kha trong vn bn s dng mt t in c xy dng

trc, mt tp cc t dng, v cc qui tc (stemming rule)[14] chuyn cc hnh thi ca t v dng t gc. Sau khi cc t kha c ly ra, cc h thng thng s dng phng php TF-IDF (hoc bin th ca n) xc nh mc quan trng ca cc t kha. Do , mt vn bn c th c biu din bi mt tp cc t kha v quan trng ca chng. Mc tng t o c gia mt cu truy vn v mt vn bn chnh bng tch v hng gia hai vector cc t kha tng ng. th hin mc hp l ca cc vn bn v cu truy vn, cc vn bn c ly ra c biu din di dng mt danh sch c xp hng da trn o mc tng t gia chng v cu truy vn. Hnh 3 miu t cu trc c bn ca mt my tm kim. Mc d trong thc t, mi my tm kim c cch thc thi ring, nhng v c bn vn da trn c ch hot ng nh c m t.

Kho trang web B tm duyt

Hnh 3: M hnh cu trc ca mt my tm kim - Module d tm (crawler): l cc chng trnh c chc nng cung cp d liu cho cc my tm kim hot ng. Module ny thc hin cng vic duyt Web, n i theo cc lin kt trn cc trn Web thu thp ni dung cc trang Web. Cc chng trnh d tm c cung cp cc a ch URL xut pht, c cc trang web tng ng, phn tch v tm ra cc URL c trong cc trang web . Sau b tm duyt cung cp cc a ch URL kt qu cho b iu khin d tm (crawl control). B iu khin ny s quyt nh xem URL no s c duyt tip theo v gi li kt qu cho b d tm.

Cc b d tm sau khi ti cc trang web s lu kt qu vo kho trang web (page repository). Qu trnh ny lp li cho ti khi t ti iu kin kt thc. - Module nh ch mc (indexing): module ny c nhim v duyt ni dung cc trang web c ti v, nh ch mc cho cc trang ny bng cch ghi li a ch URL ca cc trang web c cha cc t trong c s d liu. Kt qu sinh ra mt bng ch mc rt ln. Nh c bng ch mc ny, my tm kim cung cp tt c cc a ch URL ca cc trang web theo cc truy vn bng t kha ca ngi dng. Thng thng b to ch mc to ra ch mc ni dung v ch mc cu trc (structure index). Ch mc ni dung cha thng tin v cc t xut hin trong cc trang web. Ch mc cu trc th hin mi lin kt gia cc trang web, tn dng c c tnh quan trng ca d liu web l cc lin kt. N l mt dng th gm cc nt v cc cung, mi nt trong th tng ng vi mt trang web, mi cung ni t nt A ti nt B tng ng l siu lin kt t trang web A n trang web B. - Module phn tch tp (Collection Analysis Module) hot ng da vo thuc tnh module truy vn. V d nu b truy vn ch i hi vic tm kim hn ch trong mt s website c bit, hoc gii hn trong mt tn min th cng vic s nhanh v hiu qu hn. Module ny s dng thng tin t hai loi ch mc c bn (ch mc ni dung v ch mc cu trc) do module nh ch s cung cp cng vi thng tin cc t kha trong trang web v cc thng tin tnh hng to ra cc ch mc tin ch. - Module truy vn (query engine): module ny chu trch nhim nhn cc yu cu tm kim ca ngi s dng. Module ny thng xuyn truy vn c s d liu c bit l cc bng ch mc tr v danh sch cc ti liu tha mn mt yu cu ca ngi dng. Do s lng cc trang web l rt ln, v thng thng ngi dng ch a vo mt vi t kha trong cu truy vn nn tp kt qu thng rt ln. V vy b xp hng (ranking) c nhim v sp xp cc ti liu ny theo mc hp l vi yu cu tm kim v hin th kt qu cho ngi s dng. Khi mun tm kim cc trang web v mt vn no , ngi s dng a vo mt s t kha lin quan tm kim. Module truy vn da theo cc t kha ny tm kim trong bng ch mc ni dung a ch cc url c cha t kha ny. Sau , module truy vn s chuyn cc trang web cho module xp hng sp xp cc kt qu theo mc gim dn ca tnh hp l gia trang web v cu truy vn ri hin th kt qu cho ngi s dng.

10

Chng 2. Mt s thut ton tnh hng trang in hnh


2.1. Bi ton xp hng trang Web trong my tm kim
Trong chng ny, phn u chng ti s gii thiu tng quan v bi ton xp hng trang Web trong cc my tm kim, phn sau, chng ti s tp trung phn tch ni dung cc thut ton PageRank, Modified Adaptive PageRank v Topic-sensitive PageRank ng dng trong bi ton tnh hng cho cc trang Web.

2.1.1. Nhu cu
Ngy nay, ngi s dng c th tm kim thng tin a dng v mi mt ca x hi loi ngi trn Internet. Tuy nhin, do lng thng tin trn Internet l khng l, ang tng ngy tng gi tng trng vi tc cao, cho nn vic gii bi ton tm v cung cp thng tin c ngi dng thc s quan tm trong thi gian cho php tr thnh cng vic ht sc cp thit. Cng ngh xy dng cng c tm tin trn Internet (in hnh l my tm kim - search engine) cn khng ngng c ci tin nhm bo m tho mn yu cu ngi dng c theo kha cnh thi gian tm kim nhanh ln tnh s ph hp cao gia cc trang thng tin kt qu tm c vi yu cu tm kim ca ngi dng. Khi ngi dng nhp vo mt nhm t kha tm kim, my tm kim s thc hin nhim v tm kim v tr li mt s trang Web theo yu cu ngi dng. Nhng s cc trang Web lin quan n t kha tm kim c th ln ti hng vn trang, trong khi ngi dng ch quan tm n mt s t trang trong , vy vic tm ra cc trang p ng nhiu nht yu cu ngi dng a ln u l cn thit. chnh l cng vic tnh hng ca my tm kim - sp xp cc trang kt qu theo th t gim dn ca quan trng. Cn thit phi xc nh php o v " ph hp" ca mt trang Web tm c vi yu cu ngi dng [1,10]. Lin quan ti vic xc nh php o nh vy, ngi ta quan tm ti hai hng gii quyt.. Hng th nht s dng quan trng (c xc nh qua mt i lng c gi l hng trang - page rank) ca trang Web lm ph hp vi yu cu ngi dng. Hu ht cc nghin cu u tha nhn mt gi thit l nu mt trang Web m c nhiu trang Web khc hng (link) ti th trang Web l trang Web quan trng. Trong trng hp ny, hng trang c tnh ton ch da trn mi lin kt gia cc trang Web vi nhau. Hu ht cc my tm kim s dng hng trang lm ph hp ca kt qu tm kim vi cc thut ton in hnh l PageRank,

11

Modified Adaptive PageRank [10]. Hng th hai coi ph hp ca trang Web vi cu hi ca ngi dng khng ch da trn gi tr hng trang Web nh trn m cn phi tnh n mi lin quan gia ni dung trang Web vi ni dung cu hi theo yu cu ca ngi dng m thut ton in hnh l Topic-sensitive PageRank [15,16]. Mt s nghin cu khai thc kha cnh ni dung ca trang Web i vi ph hp ca trang Web tm kim vi cu hi ngi dng cng c cp trong mt s cng trnh [4,7]. 2.1.2. quan trng ca trang web Mt s phng php c s dng o quan trng ca cc trang web. a. Cc t kha trong vn bn: Mt trang web c coi l hp l nu n c cha mt s hoc tt c cc t kha trong cu truy vn. Ngoi ra, tn s xut hin ca t kha trong trang cng c xem xt. b. Mc tng t vi cu truy vn: mt ngi dng c th ch nh mt thng tin cn tm bi mt cu truy vn ngn hay bng cc cm t di hn. Mc tng t gia cc m t ngn hay di ca ngi dng vi ni dung mi trang web c ti v c th s dng xc nh tnh hp l ca trang web . c. Mc tng t vi trang ht nhn: Cc trang tng ng vi cc URL ht nhn c s dng o mc hp l ca mi trang c ti. Cc trang ht nhn c kt hp vi nhau thnh mt vn bn ln duy nht v mc gn nhau ca vn bn ny vi cc trang web ang c duyt c s dng lm im s ca trang . d. im s phn lp: mt b phn lp c th c hun luyn xc nh cc trang ph hp vi thng tin hoc nhim v cn lm. Vic hun luyn c tin hnh s dng cc trang ht nhn (hoc cc trang web hp l c ch nh trc) nh l cc v d dng. Cc b phn lp c hun luyn sau s gn cc im s nh phn (0,1) hoc lin tip cho cc trang web c duyt da trn cc v d hun luyn. e. nh gi quan trng da trn lin kt: Mt crawler c th s dng cc thut ton nh PageRank hoc HITS, cung cp mt s nh gi quan trng ca mi trang web c duyt. Hoc n gin hn l ch s dng s lng cc lin kt ti trang web xc nh thng tin ny.

12

2.2. Thut ton PageRank c bn


Trong [8], Page v Brin a ra mt phng php nhm gip cho cng vic tnh ton hng trang. Phng php ny da trn tng rng: nu c lin kt (links) t trang A n trang B th quan trng ca trang A cng nh hng n quan trng ca trang B. iu ny ta cng c th thy c mt cch trc quan rng, nu trang Web bt k c link n bi trang Yahoo! chc chn s quan trng hn nu n c link bi mt trang Web v danh no . Gi s ta c mt tp hp cc trang Web vi cc lin kt gia chng, khi ta coi tp hp cc trang Web nh l mt th vi cc nh l cc trang Web v cc cnh l cc lin kt gia chng.

2.2.1. PageRank th
Trc tin ta s gii thiu mt nh ngha v PageRank n gin th hin quan trng ca mi trang Web da vo cc lin kt, trc khi tm hiu mt phng php c p dng trong thc t. Gi s rng cc trang Web to thnh mt th lin thng, ngha l t mt trang bt k c th c ng lin kt ti mt trang Web khc trong th . Cng vic tnh PageRank c tin hnh nh sau: Ta nh s cc trang Web c c t 1, 2,,m. Gi N(i) l s lin kt ra ngoi ca trang th i. Gi B(i) l s cc trang Web c lin kt n trang i. Khi gi tr PageRank r(i) ng vi trang i c tnh nh sau

r (i ) =

jB ( i )

r ( j)

N ( j)

Nu gi r = [r(1),r(2), ... , r(n)] l vector PageRank, trong cc thnh phn l cc hng tng ng ca cc trang Web, ta vit li cc phng trnh ny di dng ma trn r = ATr trong : A l ma trn kch thc n x n trong cc phn t

13

aij =

nu c lin kt t i n j
j

aij = 0

nu ngc li

Nh vy ta c th thy vect PageRank r chnh l vect ring ca ma trn AT Nh ta thy trn, vic tnh ton mc quan trng hay hng trang theo phng php PageRank c th c thc hin thng qua vic phn tch cc lin kt ti trang Web . Nu n c nhng lin kt quan trng tr ti th rt c th trang l trang quan trng. Tuy nhin vic tnh ton hng trang li ph thuc vo vic bit c hng ca cc trang Web c lin kt ti n, v nh vy mun tnh hng trang ny ta phi bit c hng ca trang lin kt ti n, iu ny c th gy ra vic lp v hn rt tn km. Khc phc bng cch a v cc vect hng, ta c th tnh ton c cc hng trang thng qua vic tnh ton vect ring ca ma trn AT. Trong i s tuyn tnh c kh nhiu cc phng php c th tnh c vect ring ca ma trn tuy nhin c mt phng php kh tin v c th c p dng vo vic tnh ton vect PageRank l phng php lp. Cc cng vic tnh ton s c lm nh sau: 1. s 2. r vector bt k AT s

3. nu ||r-s||<e th kt thc, khi ta nhn c r l vector PageRank nu khng th s r, quay li bc 2. Din gii thut ton trn nh sau: bc u tin ta s gn cho vect PageRank ton cc mt gi tr bt k, sau ly vect nhn vi ma trn AT c mt vect mi gi s l r(1), li tip tc nhn r(1) vi ma trn AT, tip tc qu trnh ny cho n khi dy {r(i)} hi t ngha l tt c cc phn t ca r(i) thay i vi mt sai s nh hn mt gi tr e bt k. Khi ta c th nhn c mt vect PageRank tng i i din cho cc trang Web ta xt.

14

2.2.2. PageRank trong thc t Trong thc t, khng phi lc no chng ta cng gp trng hp cc trang Web lp thnh mt th lin thng. Trn WWW c rt nhiu cc trang Web m chng khng h c trang no lin kt ti (Web leak) hay khng lin kt ti trang Web no khc (Web sink). i vi nhng trang khng c lin kt ti trang khc, nh v d hnh v 3, cm (4,5) l Websink, ngi dng khi i n nt (4,5) s b tc, khi cc trang Web s c hng r1=r2=r3=0, r4=r5=0.5, cn nu b nt 5 cng cc lin kt th trang 4 l Web leak, dn n hng ca mi trang dn n 4 u = 0. iu ny khng ph hp thc t, v bt k trang Web no ra i cng u c tnh quan trng ca n, cho d trang ca mt c nhn th n cng quan trng vi ring ngi . Do vy cn phi sa i cng thc PageRank bng cch thm vo mt h s hm d, cng thc PageRank c sa i c dng nh sau:

r (i ) = d *

r ( j)
Bi

N ( j ) + (1 d ) / n

Vic thm h s hm d ( thng c chn d=0.85 ) c ngha nh sau: b sung thm gi tr PageRank vo cho cc trang khng c link ra ngoi. Ta cng nhn thy khi d=1 th cng thc s quay li trng hp PageRank th. Page v Brin [8] cng ch ra rng cc gi tr ny c th hi t kh nhanh, trong vng khong 100 vng lp chng ta c th nhn c kt qu vi sai s khng ln lm.

1 2 5 3 4

Hnh 4: Mt v d lin kt Web

15

2.3. Mt s thut ton khc


Phn ny chng ti xin cp ti vn lin quan ti hiu nng tnh ton ca thut ton PageRank bao gm kh nng tng tc tnh ton v mt trong cc phng php tng tc tnh ton hin nay l Modified Adaptive PageRank. 2.3.1. Nhu cu tng tc tnh ton PageRank PageRank l mt trong nhng phng php thnh hnh nht v c hiu qu nht trong cng vic tm kim cc thng tin trn Internet. Nh chng ta xem xt trn, PageRank s tm cch nh gi hng cc trang thng qua cc lin kt gia cc trang Web. Vic nh gi ny c th c thc hin thng qua vic tnh ton vect ring ca ma trn k biu din cho cc trang Web. Nhng vi kch c khng l ca mnh, WWW c th lm cho cng vic tnh ton ny tn rt nhiu ngy. Cn phi tng c tc tnh ton ny ln v hai l do: - Cn c c kt qu sm a c nhng thng tin sang cc b phn khc trong cng my tm kim, vic tnh ton nhanh vect PageRank c th gip gim thiu thi gian cht ca nhng b phn . - Hin nay, cc phng php nghin cu mi u tp trung vo vic nh gi da trn nhng tiu ch do c ngi dng quan tm, do vy cn phi tnh ton nhiu vect PageRank, mi vect hng ti mt tiu khc nhau. Vic tnh ton nhiu vect ny cng i hi mi vect thnh phn c tnh ton nhanh chng. Vic tng cng tc tnh ton c th vp phi nhiu kh khn kch thc ca WWW. V vy trong [11], gii thiu mt cch gip cho qu trnh tnh ton c nhanh hn. Phng php ny xut pht t tng sau: khi ci t chng trnh v chy, quan trng cc trang Web c tc hi t khng ging nhau, c nhng trang Web quan trng hi t nhanh c trang li c hi t chm. Vy chng ta c th tn dng nhng trang hi t trc v kt qu quan trng ca nhng trang hi t c th khng cn phi tnh na. Nh vy ta c th gim c nhng tnh ton d tha v lm tng c hiu sut tnh ton ca h thng. Phng php ny l mt ci tin ca phng php PageRank.

16

2.3.2. Thut ton Modified Adaptive PageRank 2.3.2.1. Thut ton Adaptive PageRank Gi s vic tnh ton vect PageRank ca chng ta c thc hin n vng lp th k. Ta cn tnh ton x(k+1) = Ax(k), (*)

Gi C l tp hp cc trang Web hi t n mc e no v N l tp hp cc trang Web cha hi t. Khi ta chia ma trn A ra lm hai ma trn con, AN c mxn l ma trn k i din cho nhng lin kt ca m trang cha hi t, cn AC c (n-m)xn l ma trn k i din cho nhng lin kt ca (n-m) trang hi t. Tng t, ta cng chia vect x(k) ra thnh 2 vect thnh phn ca x(k) hi t cn

(k ) N

tng ng vi nhng

(k ) C

tng ng vi nhng thnh phn ca x(k) cha

hi t. Vy ta c th vit li ma trn A v x(k) di dng nh sau :

(k )

(k ) xN = (k ) xC

AN = AC

Ta c th vit li phng trnh (*) nh sau:


(k ) ( k +1) x N = AN x N ( k +1) (k ) xC AC xC

Do nhng thnh phn ca

(k ) C

hi t do vy ta khng cn tnh

( k +1) C

na v

nh vy vic tnh ton s c gim i do khng phi tnh ton cn thc hin xN(k+1)= ANx(k) 2.3.2.2. Thut ton Modified Adaptive PageRank

Ax
C

(k )

na m ch

Trong thut ton Adaptive PageRank tc tnh ton c tng nhanh ln do ta gim i c nhng tnh ton d tha bng cch khng tnh nhng gi tr hi t. Trong phn ny ta s nghin cu su hn v cch gim i nhng tnh ton d tha. Chng ta c th vit ma trn A mt cch r rng hn nh sau

ANN ANC = ACN ACC

17

Vi ANN l ma trn k i din cho nhng lin kt ca cc trang cha hi t ti nhng trang cha hi t, ACN l ma trn k i din cho nhng lin kt ca cc trang hi t ti nhng trang cha hi t v tng t cho cc phn khc ANC,ACC.V xC v ANCxC khng thay i sau vng lp th k v chng hi tu, nn phng trnh (*) c th c vit li :

( k +1) N

A x
NN

(k ) N

A x
CN

(k ) C

Ma trn A c chia nh ra do vy cng vic tnh ton c th c gim i mt cch ng k. Nhng kt qu thc nghim trong [11] cho thy tc tnh ton c th c ci thin khong 30%. Theo [11], vic gim nhng tnh ton ca phng php PageRank gip chng ta c th tnh ton nhanh hn tuy nhin y cha phi l ch n cui cng cn t c. 2.3.3. Topic-sensitive PageRank PageRank l phng php tm kim hin ang c p dng trn my tm kim Google. Tuy nhin phng php ny ch quan tm n cc lin kt m khng quan tm n ni dung ca trang Web c cha lin kt , do vy c th dn ti nhng sai lc trong thng tin tm kim c. Yu cu t ra l cn phi a ra mt phng php c tc nhanh nh phng php PageRank v li c quan tm n ni dung ca trang Web thng qua "ch " ca n. Hn na, nu khai thc c mi quan tm ca ngi dng i vi cc trang Web trong vic tnh ph hp ca trang Web vi cu hi ngi dng th vic cng c ngha. Taher H. Haveliwala [15,16] xut phng php mi nhm p ng yu cu trn, l phng php PageRank theo ch (Topic sensitive PageRank). Cc tc gi s dng khi nim "phm vi ng cnh" biu th mi quan tm ca ngi dng. Trong [4], thut ton tm kim trang Web c ni dung tng t cho mt cch tip cn khc khi cp ti xem xt kha cnh ni dung trang Web trong bi ton tm kim. Thut ton gm hai bc c m t s b nh di y. Ti bc u tin, cc trang Web trong c s d liu c phn thnh cc lp theo cc ch c1,c2,...,cn ; gi Tj l tp hp nhng trang Web theo ch cj. Mi lp

18

tng ng vi mt vector PageRank ca ch m mi thnh phn l gi tr PageRank ca mi trang trong lp. Vector PageRank ca ch c tnh nh bnh thng tuy nhin thay v s
dng v = [1 / N ] thut ton s dng vector v = vj trong n1

r uu r

Gi D j l vector cc t kho, gm tt c cc t kho trong cc ti liu ca cc ch ; Djt l s ln xut hin ca t kho t trong tt c cc ti liu ca ch cj. Bc th hai c thc hin trong thi gian hi-p. Gi s c truy vn q, gi q l phm vi ng cnh ca q. M t s b khi nim phm vi ng cnh nh sau. Vi truy vn thng thng (t hp thoi) th q chnh l q. Trng hp truy vn q c t bng cch t sng t kho q trong trang Web u th q s cha cc t kho trong u bao gm c q. Sau tnh xc sut q thuc v cc ch khc nhau. S dng thut ton phn lp Bayes vi (i) Tp hun luyn gm nhng trang c lit k trong cc ch ; (ii) u vo l cu truy vn hoc phm vi ng cnh ca cu truy vn; (iii) u ra l xc sut u vo thuc mi ch . Di y l mt s cng thc ca mt s gi tr xc sut ni trn. Gi kho th i trong ng cnh q. Vi mi cj, xc sut q cj l:

1|T | v ji = j 0

i Tj

i T j

(1)

q'

l t

P(c q')
j

P(c ) . P(q' c )
j j

P(q')

P(c ) . P(q' c )
j i i j

(2)

Trong P ( qi' | c j ) c tnh t vector cc t kho D j c xc nh ti bc 1. Gi tr P(cj) c xc nh hoc l cc gi tr bng nhau cho mi ch (cc ch ng kh nng) hoc tnh ton thng k qua tham chiu ti cc trang Web thuc mi ch ca tp hp ngi dng. Theo [15,16], vi k hiu rankjd l hng ca vn bn d cho bi vector PR(d, v j ) vector PageRank ca ch cj th quan trng sqd da theo cu truy vn c tnh ton nh sau:

s = P(c q') .rank


qd j j

jd

(3)

19

Chng 3. Thut ton s dng cu trc Block theo thnh phn lin thng
Phn u chng ny trnh by mt s khi nim c bn trong tnh ton hng trang PageRank ti mc 2, t xut phng php m chng ti gi phng php mi ny l CCP (Connected Components in PageRank). Nhng l thuyt, chng minh hnh thnh gn lin vi phng php s c cp k trong ti mc 3.

3.1. Khi nim cu trc Block theo thnh phn lin thng 3.1.1.Phn tch thut ton PageRank
Chng 2 ca kho lun trnh by phng php tnh ton hng trang PageRank. Phn ny chng ti s i su phn tch thut ton PageRank din t theo ngn ng th. Phng php ny da trn tng c tha nhn l nu c lin kt t trang A ti trang B th quan trng ca trang A cng nh hng ti quan trng ca trang B. Gi s ta c tp hp gm n trang Web trong c s d liu c nh s t 1 ti n. i vi trang u bt k, gi B I (u ) l tp hp nhng lin kt ti trang u, gi Nu l s lin kt ti trang u. Gi u l hng trang ca u (PageRank), khi cng thc tnh PageRank cho trang u nh sau:
u = i

iBI (u )

Ni

(1)

Nu din t vi ngn ng th th ta c th t G = (V, E) vi V l tp cc trang Web cn tnh hng trang (V c n trang, c nh ch s 1, 2, ... n), cn E l tp cnh th, E = {(i, j) | nu c lin kt t trang i ti trang j}. Thut ton gi thit rng th trang Web l lin thng theo ngha vi cp hai trang Web i, j bt k lun c ng i t i ti j v ngc li. Khi c th xy dng c ma trn k biu din th G nh sau:

P = ( p ij ) nxn
nu c lin kt t i n j 1 N i Trong p ij = nu khng c lin kt t i n j 0

(2)

(3)

Khi phng trnh (1) c vit li di dng ma trn s c:

20

= P

(4)

Ni cch khc y chnh l vic tnh vector ring ca ma trn P, v vector ring ny ng vi gi tr ring =1. Tuy nhin vic tnh vector ring ny ch c m bo khi ma trn P tho mn mt s tnh cht cht ch i vi ma trn chuyn Markov. Trong thc t cc trang Web, vic gi thit th lin thng l khng hp l v bao gi cng tn ti trang khng c lin kt ti trang no khc. Do vy, hng ng vi trang Web trong ma trn k P s bao gm ton nhng s 0, nn trong iu kin khng tn ti mt phn phi xc sut dng n nh ca P hay ni cch khc l vector ring PageRank. Chnh v vy, tn ti mt xc sut dng n nh i vi ma trn Markov P (xem thm trong [12]) th cn phi sa i ma trn P sao cho ph hp. nh ngha ma trn
(1 ) ~ P = P + J n

(5)

trong 0 < < 1 ( thng c chn l 0.85) v J l ma trn gm ton phn t 1. Khi , thay v tnh vector ring ca ma trn P ta tnh vector ring ~ = ( 1 ,..., n ) ca ma trn P c cho bi cng thc

= P

(6)

V tng cc thnh phn ca vector = ( 1 ,..., n ) :

i =1

=1

(7)

Hay ni cch khc 1 = 1 trong 1 l vector ct gm ton phn t 1. Ta c c iu ny v vector chnh l mt phn b xc sut dng ca ma trn chuyn Markov, do vy bt buc tng cc thnh phn trong vector phi bng 1. Trong qu trnh tnh ton vector ring, phng php lp n c s dng v phng php ny c th cho kt qu kh quan sau hn 20 vng lp [1,2]. Vi phng php trn, chng ta d dng nhn thy ma trn P l ma trn rt tha, do vy cng vic tnh ton s c nhiu thao tc tha. Trong mc tip theo chng ta s bn v khi nim cu trc Block theo thnh phn lin thng trong ma trn lin kt Web v vic s dng thnh phn lin thng gim i nhng tnh ton d tha ny.

21

3.2. Mt s vn l thuyt
Khi kho st m hnh Markov [13], chng ti nhn thy rng trong l thuyt xc sut, cc trng thi c th c chia ra nhng lp khc nhau. Nhng trng thi c th chuyn qua li nhau c coi nh l trong cng mt lp. Khi nim lp cc trng thi trong m hnh Markov kh ging vi khi nim thnh phn lin thng trong l thuyt th. Hn na, vic s dng ma trn k biu din th cc trang Web dn ti tng s dng khi nim cu trc Block (khi) theo thnh phn lin thng trong tnh ton hng trang vi mt s li th sau: - Khi chng ta s dng ton b ma trn P tnh ton vector ring nh trong phng php PageRank [1,2], s php tnh chi ph l kh ln. Nh bit, vi php nhn ma trn th thi gian tnh ton l O(n3) trong n l s trang Web. Nhng khi chng ta a ma trn k biu din th v dng cc khi biu din cho tng thnh phn lin thng th thi gian tnh ton s gim i rt nhiu. Tht vy, gi s chng ta
3 c k thnh phn lin thng, khi vi mi khi, thi gian tnh ton nh hn O(nmax ) 3

trong nmax=max{n1,,nk}v tng thi gian tnh ton s nh hn kO(nmax ) , nh hn nhiu so vi thi gian tnh ton khi ta s dng ton b ma trn ln. Nh vy, phng php xut c thi gian tnh ton l thuyt hiu qu hn i vi phng php PageRank. Hn na, nu kt hp phng php ny vi nhng phng php h tr tnh ton nh MAP hay phng php ngoi suy [9,10] th thi gian tnh ton s c gim i ng k. - S dng thnh phn lin thng chng ta c th thc s lm gim i s vng lp tnh ton khng ging nh phng php tnh ton ma trn theo tng khi hoc phng php ma trn ra cc thnh phn nh hn da trn tiu ch cng host [11]. Phng php tnh ton ma trn theo khi gip gim c thi gian tnh ton do s dng k thut tnh ton c th song song ho m khng lm gim c s vng lp. Phng php chia ma trn thnh cc Block thnh phn theo tiu ch cng host c th gim s vng lp nhng li c chia lm hai bc v mt thm chi ph x l theo khi, hn na khi c chia theo host vn kh ln. Phng php CCP khc phc c cc im trn: ci t khng cn s dng nhiu k thut nh trong tnh ton ma trn theo khi; hn na, gim c s vng lp do cc khi thnh phn lin thng c c nh hn khi c chia theo tiu ch host [11].

22

- Trong phng php c xut, cn phi tm kim cc thnh phn lin thng v vic tm thnh phn lin thng ca th c th tin hnh d dng vi thi gian a thc O(n+m) vi n l s nh v m l s cnh ca th [8]. Do vy, thi gian chi ph vi vic tm kim thnh phn lin thng l khng ng k. - Khi chng ta a v tnh ton vi mi khi thnh phn lin thng th chng ta c th song song ho qu trnh tnh ton. Vi nhng thnh phn lin thng khc nhau c tnh ton, chng ta c th giao cho nhng b x l khc nhau. Vic song song ho ny c th c tin hnh rt t nhin m khng cn phi p dng mt k thut no phc tp, hn na, khi song song ho, chng ta c th y nhanh c thi gian tnh ton ln. Nhn xt Nh vy, phng php xut c mt s li im c bn sau (so vi mt s phng php nghin cu): Gim c thi gian tnh ton do vic lp tnh ton trn ma trn c gim i da trn vic phn chia th cc trang Web ra cc thnh phn lin thng. C th kt hp d dng vi cc phng php h tr tnh ton trn ma trn. C th c p dng song song ho mt cch t nhin m khng cn phi s dng qu nhiu nhng k thut lp trnh.

Khi s dng phng php chia ma trn k thnh nhng khi ma trn nh hn i din cho tng thnh phn lin thng th trong qu trnh tnh ton chng ta phi gii quyt mt s vn sau: Tnh ton hng trang nh th no kt qu t c l ng n?

- Vic tnh ton trn cc thnh phn lin thng nh th no l hiu qu?
Chng ta s xem xt vic gii quyt nhng vn trn trn kha cnh l thuyt mc sau. 3.2.1. Tnh ton hng trang vi cc Block theo thnh phn lin thng Nh cp trn, khi tnh ton trn cc thnh phn lin thng th gi tr hng trang PageRank hay ni cch khc l vector ring i vi cc trang c tnh th no?

23

Gi s th G c k thnh phn lin thng, khi ma trn P c th c vit di dng k khi c t trn ng cho chnh nh sau:
P1 L 0 P= M O M 0 L P k

(8)

trong Pi l ma trn k c nixni ng vi thnh phn lin thng th i, i = 1, k ;

n
i =1

=n

nh ngha cc ma trn
(1 ) ~ Pi = Pi + Ji ni

vi i = 1, k v Ji l ma trn c nix ni Cng thc tnh vector ring vi tng khi ma trn Pi l
i = i Pi
~

(9)

(10)

nh l: Vi nhng gi thit trn (5,6,7,8,9,10), ta c


=(
n n1 1 ,..., k k ) n n

(11)

~ chnh l vector ring ca ma trn P .

Chng mnh:
chng minh
=(
n n1 1 ,..., k k ) n n

l vector ring ca ma trn P th ta

phi chng minh: tho mn phng trnh vector ring (6). Thay (11) vo phng trnh (6), ta c:
P1 L 0 1 (1 ) ~ J = M O M + = P = P + n n 0 L Pk J

24

1 1 J 12 P1 + n J 11 n 1 nk k nk k 1 J 21 n1 1 n1 1 J 22 P2 + ( ,..., ) = ( ,..., ) n n n n n n M M 1 J L k1 n
i j

1 J 1k n (12) 1 J 2k L n O M 1 L Pk + J kk n L

trong J ij = (1) n xn l ma trn gm ton phn t 1 v c c nixnj. Nhn v phi ca (12), v xt thnh phn th nht, ta c:
n n1 1 n1 1 n 1 1 1 J 11 ) + 2 2 J 21 + ... + k k J k1 = (P1 + n n n n n n n

(13)

i M vi mi khi Pi c vector ring tng ng l tho mn phng trnh (10)

i = i Pi = i Pi +

1 Ji ni

(14)

ni i ni i 1 = Pi + Ji n n ni

(15)

Xt trng hp c th i=1, ta c:
n1 1 n1 1 1 = P1 + J1 n n n1

(16)

T (13,16) ta c:
n n1 1 n 1 n1 1 1 1 1 J 1 = (P1 + J 11 ) + 2 2 J 21 + ... + k k J k1 (17) P1 + n n1 n n n n n n

J1 = J11

1 J 11 =

n n1 1 J 11 + ... + k k J k1 n n
k

1n1 = 1n1 .

n
i =1

n
i =1

= n , 1n1 l vector hng n1 ct gm ton phn t 1 1n1 = 1n1

25

Vy ta c (13) ng. Tng t, ta xt vi cc thnh phn tip theo, i vi i = 2, k cng tho mn (pcm) 3.2.2 Thut ton CCP Phng php s dng cu trc Block theo thnh phn lin thng trong bi ton tnh hng trang c th c chia lm cc bc: ng. - Bc 2: Tnh ton PageRank cho cc trang trong mi thnh phn lin thng da trn vic tnh ton vector ring ca ma trn k ca thnh phn lin thng . - Bc 3: T hp hng trang cui cng da trn hng trang nhn c sau bc 2 nh trong cng thc (11). Trong qu trnh tnh ton, c hiu qu, chng ta s coi nh ch lm vic vi trng hp trung bnh, c ngha l xt cc thnh phn lin thng vi s nh trung bnh. Do vy chng ta cn gii quyt hai trng hp: l thnh phn lin thng c s nh vt qu hoc nh hn ngng trung bnh. gii quyt cc trng hp ny, cn a ra hai s lm ngng, gi l min v max. Khi , Pi i = 1, k ta cn
min ni max ; nu ni min th s ghp nhng thnh phn lin thng c s nh tng

Bc 1: Chia th ra cc thnh phn lin thng vi cc ma trn k tng

t c mt thnh phn lin thng c s nh tho mn iu kin, nu ni max th cn phi tch thnh phn lin thng ny ra thnh nhng thnh phn lin thng nh hn tho mn iu kin. Tuy nhin khi tch v hp cc thnh phn lin thng, hng ca cc trang s thay i v chng ta s phi x l vn ny. T y xut cch gii quyt nh sau i vi hai trng hp: - Trng hp gp: y l trng hp n gin, PageRank ca cc trang s c tnh ton v ly ra t thnh phn lin thng sau khi c gp. Trong trng hp ny, chng ta ch cn gp sao cho s nh ca thnh phn lin thng va ln hn min. (min y c th th nghim vi nhiu s a ra c s min ti u nht). - Trng hp tch: i vi trng hp ny, vn l thuyt s kh khn hn so vi trng hp trn. Khi chng ta tch mt thnh phn lin thng ln thnh mt thnh phn lin thng nh, th gia nhng thnh phn nh ny c nhng lin h nht nh vi nhau, v ta cn phi th hin hoc s dng nhng lin h ny kt qu nhn c cui cng l chnh xc. Gi s t mt thnh phn lin thng chng ta tch ra

26

thnh m khi (mi khi u t lin thng) v gia cc khi ny c nhng lin kt no ti nhau (v c tch ra t mt thnh phn lin thng ln). Bc 1: Chng ta tnh hng cho tt c cc trang trong mi thnh phn lin thng nh nh bnh thng. Bc 2: Coi mi thnh phn lin thng nh nh mt nt ca th, chng ta s c mt th gm m nt v s lin kt ti mi nt cng nh tng s lin kt c trong th. Hng ca mi nt trong th mi c xy dng ny s bng s lin kt ti chnh nt chia cho tng s lin kt trong th. Ti sao li chn hng trang cho mi nt trong th nh vy? Khi ta chn gi tr nh vy th tng tt c cc hng trong th mi ny s bng 1 (tho mn iu kin l tng cc xc sut). Bc 3: Do trong m khi, mi khi c vector PageRank ring, v tng cc thnh phn ca vector PageRank ca mi khi ny u bng 1, do vy khi gp li tng cc thnh phn ca m vector ng vi cc khi s bng m. Do vy, tng thnh phn ca cc vector PageRank ca m khi bng 1, ta s ly PageRank i din ca mi khi nhn vi PageRank thnh phn.

H s min, max cng nh s nh trung bnh ca mi khi s c th nghim tm ra c h s ti u. Chng tip theo gii thiu m hnh my tm kim Vinahoo v p dng th nghim thut ton Modified Adaptive PageRank cho bi ton tnh hng trang trong my tm kim Vinahoo.

27

Chng 4. Gii php tnh hng trang ci tin cho my tm kim Vinahoo
4.1. Tnh ton PageRank trong Vinahoo
Trong [1], tc gi trnh by k v cu trc, CSDL, m ngun ca my tm kim Vinahoo. y, chng ti tp trung trnh by v module nh ch s trong thc thi tnh ton PageRank.

4.1.1. M hnh thc thi ca module nh ch s


Trong my tm kim Vinahoo, qu trnh tnh ton PageRank cho cc trang web c tch hp trong module index. u tin module ny s tin hnh qu trnh crawler ti ni dung cc trang web. Qu trnh ny c tng tc vi cc i tng chnh l Internet, h qun tr c s d liu SQL v c s d liu cha cc file nh phn tm. Sau , qu trnh index s tin hnh nh ch s ngc cho cc url mi ti v v lu trong cc cu trc d liu thun tin cho vic tm kim cc url theo t kha ca module tm kim sau ny, cng nh tnh gi tr PageRank cho cc trang mi ny. Qu trnh ny s s dng u vo l ni dung cc trang web mi cp nht trong file nh phn tm, cng cc thng tin c trong file nh phn c. N thc hin vic sp xp cc url theo t kha ri kt hp vi ni dung c ca cc url trong file nh phn, v cui cng l tnh hng cho cc trang Web da vo cc lin kt gia cc trang.

4.1.2. Qu trnh tnh ton PageRank trong Vinahoo


4.1.2.1. Cu trc hng i cc url trong Vinahoo Vinahoo s dng mt cu trc d liu bng bm lm hng i lu cc url cn c index. L do v cu trc bng bm rt thun tin cho vic tm kim mt phn t trong danh sch. V vy qu trnh kim tra mt URL c mt trong hng i hay cha l rt d dng. Cc URL trong hng i c nhm theo site, cc url thuc cng mt site c nhm vo mt danh sch FIFO gi l CSiteUrls. Vic nhm cc URL theo site c tc u im l lm gim vic x l cc tn min DNS, gim s ln phi kt ni ti server, cng nh lm gim s ln phi duyt file robots.txt. Do lm gim ng k thi gian duyt Web. Khi c mt url thuc vo mt site cn a vo hng i, url c thm vo cui danh sch url ca site n thuc vo. Ton b hng i l mt bng bm cc CsiteUrls v c mt con tr tr ti site hin ti ang c duyt.

28

Khi cn ly ra mt url duyt tip, url nh danh sch ca site hin ti s c trch ra. Cu trc ca hng i ny nh sau:

CSiteUrls

CSiteUrls

.....

CSiteUrls

m_current
Hnh6: Cu trc hng i CSiteQueue trong Vinahoo Trong : mi CsiteUrls l mt danh sch mt chiu cc mng cha url thuc v cng mt site. V CurlLinks l mt mng gm 100 url lin tip. CUrlLinks
m_first

CUrlLinks

......

CUrlLinks
m_last

Hnh7: Cu trc mt phn t CSiteUrl 4.1.2.2. Qa trnh tnh ton PageRank trong Vinahoo i vi my tm kim Vinahoo, sau khi nh ch mc cc trang Web trong c s d liu, qu trnh tnh ton PageRank c thc hin. Trong Vinahoo, i lng PageRank c coi nh chnh gi tr hng v gi tr PageRank c ly trc tip lm tiu ch hin th c th t cc trang ra mn hnh cho ngi dng. Cng thc tnh PageRank c s dng l cng thc tnh PageRank n gin. Cc trang c tnh PageRank ln lt, gi tr PageRank c lu vo file nh phn. Hng trang trong Vinahoo c tnh theo cng thc qui. Hm thc hin PageRank l: CalsRanks( CSQLDatabase *database) 4.1.2.3. Nhu cu y nhanh tc d tnh ton PageRank Nh cp cc chng trc, vic xp hng cc trang Web trong CSDL l mt b phn rt quan trng trong mt h thng tm kim. Cht lng ca module ny s nh hng trc tip ti cc bc sau ca qu trnh tm kim. Mt trong nhng tnh cht quan trng nht ca module ny l tnh hiu qu v thi gian. Nu qu trnh tnh PageRank khng nhanh, th s lm gia tng thi gian cht ca cc b phn khc trong my tm kim. Nh vy h thng tm kim s khng cung cp c cht

29

lng tm kim tt cho ngi dng. Vic nng cao c tc tnh ton PageRank cng c tc dng tng thm mc tnh ton cc vect thnh phn, hng ti vic xp hng cc trang Web quan tm ti cc tiu ch ca ngi dng. 4.2.3. xut gii php 4.2.3.1. Ci t Modified Adaptive PageRank Phn ny trnh by cc xut ci t thut ton tnh hng Modified Adaptive PageRank trn my tm kim Vinahoo. Hin ti, trong modul index ca Vinahoo ch ci t thut ton tnh PageRank n gin. Chng ti tin hnh thay modul tnh ton PageRank bng modul tng ng vi thut ton Modified Adaptive PageRank, v mt s l do sau: - Modified Adaptive PageRank c pht trin da trn nn ca thut ton PageRank - thut ton c ci t trong phn mm Vinahoo. Kh nng tch hp Modified Adaptive PageRank vo Vinahoo c thnh cng cao. tha. tn dng ti a cc lp sn c ca Vinahoo[1], chng ti [2] khng xy dng ma trn An x n nh trong l thuyt, m vn p dng cng thc qui nhng a thm vo phng php Modified Adaptive PageRank. nh du s hi t ca mt trang, chng ta c hai cch: - Cch th nht: Thm thuc tnh cho UrlRank cng vi tch cc Url hi tcha hi t ra hai file ring bit khi lu tr. - Cch hai: Khi lu tr s lu thm thuc tnh hi t converged bng 1 nu hi t, bng 0 nu ngc li. Qua so snh u nhc im ca tng cch, chng ti thy rng vic lu hng ca cc trang ra hai file c v s tt kim b nh hn l lu tr thm thuc tnh hi t. Tuy nhin, khi lu tr chng ta cn lu tr ton b PageRank ca cc URL theo th t lu tr l urlID, iu gip qu trnh c v ghi d liu rt thun tin v n gin. Do vy, nu lu theo hai file khc nhau li cn c thng s urlID i km cng vi mi gi tr RageRank. iu khng h tt kim b nh, tnh ton khng n gin hn ng thi cng khng tn dng tt nht m ngun hin c ca Vinahoo. Modified Adaptive PageRank y mnh tc tnh ton, gim tnh ton d

30

Do vy, chng ti chn cch thm mt thuc tnh nh du s hi t ca mt UrlRank v khi lu PageRank ca cc Url ta s lu thm gi tr converged - nh du nu gi tr hng ca trang hi t.
1 nu gi tr PageRank ca trang hi t Converged = 0 ngc li

4.2. Kt qu thc nghim v nh gi


a. Cch thc tin hnh thc nghim Thc nghim c tin hnh vi my tm kim Vinahoo, trn my tnh cu hnh Pentium 4HT 3.0GHz, 512MB RAM. Cc thc nghim c tin hnh trn mi trng Internet thc s, vi cc trang Web c ly t website http://www.ets.org/ (y l trang ch ca t chc gio dc ph trch thi chng ch ting Anh TOEFL). Sau khi crawl d liu, c s d liu lu tr 2368 trang Web vi tng s lin kt l 37490. Cc thut ton c th nghim l PageRank bnh thng, MAP v CCP. b. Kt qu th nghim Sau y l mt s kt qu chy th nghim chng trnh, cc so snh v thi gian v s vng lp chi ph c chn sau khi chia trung bnh ca 3 ln th nghim i vi mi thut ton.

Thi gian tnh ton PageRank


Thi gian(s) 3 2.5 2 1.5 1 0.5 0 Thut ton PageRank MAP CCP 1.65 2.59

Hnh 8: Biu th hin thi gian tnh ton PageRank ca 3 thut ton

31

Qua biu trn ta thy thi gian tnh ton PageRank theo thut ton MAP gim i c 36% so vi thut ton ton PageRank thng thng.
Vng lp tnh ton PageRank
S vng lp 18 16 14 12 10 8 6 4 2 0 17

10

PageRank

MAP

CCP

Thut ton

Hnh 9: Biu th hin s vng lp cn thit tnh ton PageRank ca 3 thut ton Tin hnh th nghim cc cu truy vn i vi 3 thut ton, kt qu nhn c sau hai cu truy vn: TOEFL v TEST c cho trong bng di. Bng 7. Kt qu nhn c i vi hai truy vn TOEFL v TEST ng vi cc thut ton

PageRank
TOEFL 1 2 3 4 5 6 7 8 9 10 TEST 1 2 3 4 5 ets.org/stoefl.html ets.org/ellrsc/css/twocolumns.css ets.org/toefl/contact.html ets.org/ell/testpreparation/toeflindex.html ets.org/itp/ ets.org/toefl/nextgen/index.html ets.org/legal/copyright.html ets.org/scoreitnow/index.html ets.org/itp/academics/ ets.org/criterion/ell/academics/index.html

MAP
ets.org/stoefl.html ets.org/ellrsc/css/twocolumns.css ets.org/toefl/contact.html ets.org/ell/testpreparation/toeflindex.html ets.org/itp/ ets.org/toefl/nextgen/index.html ets.org/legal/copyright.html ets.org/scoreitnow/index.html ets.org/itp/academics/ ets.org/ell/testpreparation/toefl/

ets.org/ell/testpreparation/toeflindex.html ets.org/etseurope/testinfo.html ets.org/praxis/prxdsabl.html ets.org/praxis/prxorder.html ets.org/criterion/elementary.html

ets.org/ell/testpreparation/toeflindex.html ets.org/etseurope/testinfo.html ets.org/praxis/prxdsabl.html ets.org/praxis/prxorder.html ets.org/criterion/elementary.html

32

Kt lun
1. Kt qu t c ca kha lun Thng qua vic tm hiu, nghin cu cc cng trnh nghin cu lin quan v bi ton tnh hng trang trong my tm kim, kha lun hon thnh mt s kt qu sau y: - Trnh by v phn tch chi tit thut ton PageRank c bn p dng trong bi ton tnh hng trang trong my tm kim. - Trnh by nhng ni dung c bn bao gm c vic phn tch hai thut ton Modified Adaptive PageRank v Topic-sensitive PageRank nhm nng cao hiu nng tnh ton PageRank. - Phn tch chi tit c ch hot ng ca my tm kim ting Vit m ngun m Vinahoo. - Mt kt qu quan trng ca kha lun l xut mt gii php s dng cu trc Block theo thnh phn lin thng trong ma trn lin kt Web trong vic tnh PageRank trong my tm kim Vinahoo. - Hn th na, kha lun ci t thut ton Modified Adaptive PageRank trn vo my Vinahoo v t c mt s kt qu bc u tng i kh quan. Tuy nhin do hn ch v thi gian hon thnh kha lun nn chng trnh ci t cha ci t hon chnh thut ton CCP s dng cu trc Block theo thnh phn lin thng trong ma trn lin kt Web trong vic tnh PageRank trong my tm kim Vinahoo . 2. Hng pht trin tip theo Khai ph cu trc web vo my tm kim trong bi ton tnh hng trang ang ngy cng tr thnh cc lnh vc nghin cu y tim nng, hng nghin cu tip theo ca kha lun l: - Tip tc hon thin chng trnh ci t thut ton CCP s dng cu trc Block theo thnh phn lin thng trong ma trn lin kt Web trong vic tnh PageRank trong my tm kim Vinahoo. - Nghin cu su v ti u min, max, trung bnh, cch chia cc block cng nh pht trin phng php theo hng song song ho.

33

Ti liu tham kho


[1]. Bi Quang Minh. My tm kim Vinahoo. Bo co kt qu nghin cu thuc ti khoa hc c bit cp HQGHN m s QG-02-02, 2002. [2] Th Diu Ngc, Nguyn Hoi Nam, Nguyn Yn Ngc, Nguyn Thu Trang. Gii php tnh hng trang ci tin cho my tm kim Vinahoo. Chuyn san Cc cng trnh nghin cu - trin khai Vin thng v CNTT, Tp ch Bu chnh Vin thng, 14, 4-2005, 65-71 [3] Phm Th Thanh Nam, Bi Quang Minh, H Quang Thy. Gii php tm kim trang Web tng t trong my tm kim Vinahoo. Tp ch Tin hc v iu khin hc, 20(4), 293-304, 2004. [4] Andrew Y. Ng, Alice X. Zheng, and Michael I. Jordan. Stable algorithms for link analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference. ACM, 2001. [5] Jon Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, November 1999. [6] Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001, trang 435-443. [7] Kir Kolyshkin. Vinahoo Manual. Cung cp ti http://www.Vinahoo.org. 2002.The Anatomy of large scale Hypertextual Web Search Engine. [8] [9] Page, L., Brin, S., Motwani, R. and Winograd, T. 1998 The PageRank citation ranking: bringing order to the Web. Technical report, Stanford University. Raymond Kosala, Hendrik Blockeel. Web Mining Research: A Survey. Department of Computer Science, Katholieke Uiniversiteit Leuven, Heuverlee, Belgium, trang 601-602.

[10] Sepandar Kamvar, Taher Haveliwala, and Gene Golub (2003). Adaptive Methods for the Computation of PageRank. Technical report, Stanford University. [11] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning Gene H. Golub (2003). Exploiting the Block Structure of the Web for Computing PageRank. Technical report, Stanford University

34

[12] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Extrapolation methods for accelerating PageRank computations. In Proceedings of the Twelfth International World Wide Web Conference, 2003. [13] Sheldon Ross. Introduction to probability models, 8th Edition. Academic Press, January 2003. [14] Shian - Hua Lin, Meng Chang Chen, Jan-Ming Ho, ACIRD: Intelligent Internet Document Organization and Retrival. IEEE transaction on knowledge and data engineering VOL 14, NO 3 May/June 2002. [15] Taher H. Haveliwala. Topic-Sensitive PageRank. WWW2002, May 711, 2002, Honolulu, Hawaii, USA (ACM 1581134495/02/0005). [16] Taher H. Haveliwala. Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search, 2003. [17] Taher H. Haveliwala. Efficient Computation of PageRank. Technical report, Stanford University, 1999. [18] http://www.google.com

35

You might also like