Professional Documents
Culture Documents
Ανάκτηση Πληροφορίας στον Παγκόσμιο Ιστό
Ανάκτηση Πληροφορίας στον Παγκόσμιο Ιστό
5.1
.
.
.
,
HTML,
.
5.1.1 (World Wide Web)
. (web pages)
.
.
, , , ,
. ,
, , , .
.
,
.
.
.
5.1.2
. :
,
, .
.
.
,
.
.
.
, ,
42
.
.
,
.
.
, (Web Search Engines).
[K97, BR99]:
crawler, ,
(links).
crawler
,
crawlers .
(indexer),
crawler. ,
crawler.
,
(inverted index). indexer
, ...
(query processor),
.
, ,
.
.
Crawler
crawler
.
,
.
. crawler
, ,
(servers) .
Indexer
(indexer), ,
,
,
43
(URLs), ,
.
.
.
,
URLs .
.
.
.
( ).
.
,
, , ...
,
...
.
.
.
.
, ,
, .
, ,
.
.
,
.
.
" ".
.
jaguar
.
, .
44
.
""
"" "".
.
.
.
.
(hyperlinks) .
.
.
5.1.4 HTML
HTML HyperText Markup Language
.
tag ().
HTML (tags)
.
HTML <html>
</html>. HTML
. <head> </head>
<body> </body>.
. ,
, scripting
.
<title> </title>.
,
, ,
.
HTML .
, .
,
.
, H1 H6 .
<h1> </h1>, .
.
.
,
, , .
45
<b> </b>
<strong> </strong>.
<i> </i>.
<u> </u>.
.
HTML, img, .
<img> .
HTML img
: <img src=imagepath/imagename.jpeg alt=comment>.
src img
. alt
,
.
HTML
. <a> </a>
: <a href=urlpage>hyperlink - text</a>. href
a
.
,
.
5.1.5. (links)
.
() .
term-based search engines
.
,
:
.
(in-degree) .
.
.
.
.
,
.
, .
,
, .
.
46
5.2.
Link-based
.
. HITS, Kleinberg
([97]) CLEVER, Brin Page
([BP98], GOOGLE.
.
5.2.1 HITS (Hyperlink-Induced Topic Search)
authoritative ()
. authority
authority
hub .
.
,
.
, .
term-based
,
, .
, ,
.
, ,
(query) .
, ,
.
:
. . Netscape JDK 1.1 code-signing API;
. .
JAVA.
. . java.sun.com
.
Scarcity Problem ( ),
.
,
.
47
.
Abundance Problem ( ),
.
,
.
authority (-) ,
.
.
.
,
Harvard, www.harvard.edu, authoritative
Harvard.
Harvard www.harvard.edu
,
.
authority .
, ,
. ,
authoritative Yahoo!, Excite, AltaVista,
. .
Honda
Toyota .
, ,
.
.
.
, .
Apple
IBM,
.
G = (V, E),
p q
p q. (out-degree) p
p (indegree) p p.
G
. W V
G[W] .
G[W] W
W.
.
48
,
, .
Q
. .
,
authorities
.
, S, :
i. .
ii. .
iii. authority .
S
.
authorities.
.
,
R, t ( t 200)
term-based AltaVista
.
(i), t.
(ii) R Q
.
,
Q .
S,
R, .
authority R
R. authorities
R
R.
.
Subgraph(,E,t,d)
: a query string
E: a text-based search engine
t, d: natural numbers
Let R denote the top t results of E on
Set S = R
For each page p that belongs to R
Let +(p) denote the set of all pages p points to
Let -(p) denote the set of all pages pointing to p
Add all pages in +(p) to S
If -(p) <= d then
Add all pages in -(p) to S
Else
49
1
S R,
R
R, d S
R. ,
,
S, .
S base set .
S G[S],
.
, ,
,
. G[S] .
(transverse)
domain, (intrinsic)
domain. domain ,
.
,
50
.
.
G. , ,
.
ubs uthorities
G
.
, .
in-degree,
.
.
, .
,
. ,
, , ,
in-degree.
authoritative
in-degree,
. authoritative
, hub ,
authoritative .
authorities , in-degree.
authorities hub
5.
in-degree
hubs
authorities
2 hubs authorities
hubs authorities ,
hub authorities
authority hubs.
, G.
51
,
hub authority ,
hubs authorities .
, authority x<p> hub
y<p>.
,
authorities hubs .
:
p x (authority),
y (hub). p
y
x.
x y, I O.
x :
x < p>
<q>
q:( q , p )E
y :
y < p>
<q>
q:( q , p )E
,
I .
x
, y .
.
Iterate(G, k)
G: a collection of n linked pages
k: a natural number
Let z denote the vector ( 1, 1, 1, ..., 1) R
Set x0 :=z.
Set y0 :=z.
For i =1, 2, ..., k
Apply the I operation to (xi 1 , y i 1) ,obtaining new -weights xi .
Apply the O operation to (xi ,y i 1) ,obtaining new y -weights yi .
Normalize xi,obtaining xi .
Normalize yi ,obtaining yi .
End
Return (xk ,y k ).
c
authority c hub .
:
Filter(G, k, c)
G: a collection of n linked pages
k,c: natural numbers
52
k, Iterate,
.
. ,
1 (i,j) G
pi pj 0 .
:
x ATy y Ax .
x
Iterate ATA y
.
Iterate
x y, 20
.
. .
Iterate . ,
. ,
,
.
.
.
, ,
.
.
, ,
.
hubs authorities, .
authorities
.
.
,
.
. R
S.
53
hubs authorities
hubs authorities
hubs authorities G.
. :
jaguar,
.
, randomized algorithms.
,
,
abortion.
,
.
.
, hubs authorities
A ,
G.
hubs authorities S.
.
,
.
,
hubs authorities.
G .
.
,
.
, .
.
,
.
,
5.2.2 Google
- Google [BP98].
Google ,
. Google
54
.
PageRank.
.
PageRank
T PageRank ::
1n. d
, 0 1.
d 0.85. C(A)
. PageRank :
PR(A) = (1-d) + d (PR(T1)/C(T1) + + PR(Tn)/C(Tn))
PageRanks
, PageRanks
.
PageRanks
,
.
PageRank .
.
,
(back) (browser).
.
PageRank
. d
.
d
.
.
PageRank
,
PageRank.
.
, ,
Yahoo!,
. PageRank
,
.
(anchor text)
Google
, .
. Google
.
. ,
. ,
, , ,
.
crawled
,
55
.
,
.
, .
Google ,
. ,
,
. ,
, .
. , HTML
.
Google crawling ,
, crawlers. URLserver,
(urls) crawlers.
StoreServer,
Repository. ID, docID,
.
(indexer)
(sorter). indexer repository,
.
hits. hit ,
. indexer hits barrels,
(forward index).
indexer .
(anchors).
, .
URLresolver anchors
docIDs. anchor
docID .
(links), docIDs.
PageRank .
sorter barrels, docID,
wordID, (inverted index).
sorter wordIDs . ,
DumpLexicon,
indexer
(searcher). searcher ,
PageRanks .
3.
56
3 Google
crawled
.
Big Files
Big Files
. .
.
repository HTML ,
. docID,
oyw. repository
,
.
repository
crawling. repository 4.
4 Repository
57
Document Index
(document index)
docID.
, repository,
(checksum) .
crawled docinfo
,
URLs, .
URLs docIDs. checksums
docIDs .
docID URL, checksum URLs
checksums docID.
URLs docIDs .
URLresolver.
Lexicon
. .
,
, hash .
Hit Lists
hit list
,
, .
,
.
.
bytes hit.
hits: hits . hits
URL, , meta tag.
. hit bit
, 12 bits
.
bits. hit bit
, 7
hit, 4 bits hit 8
bits . hits 8
bits 4 bits anchor
4 bits docID anchor .
, ,
.
hits hits.
hits wordID forward
index docID inverted index.
Forward Index
forward index ,
64 barrels. barrel wordIDs.
barrel, docID
58
59
. ,
Google, :
1.
2.
wordIDs
3.
barrels
4.
5.
6.
barrels
,
barrel
4
7.
4.
k.
Google
. hits
, .
hits anchor, PageRank
.
, ,
.
.
Google hits .
hit ,
.
. Google hits hits.
, hit, ,
.
IR
. IR PageRank
.
,
, . hits
hits
. hits
,
.
hits 10
.
hits,
. .
60
IR
.
,
.
, .
.
,
.
Clever Google
. , Google
, Clever
. Google .
Google
,
. Clever
, authoritative
. Clever ,
,
hub
.
5.2.2 O SALSA
SALSA [LM00]
.
Markov,
TCK,
(TCK Tightly-Knit Community) .
.
,
. Kleinberg
.
.
.
y hub authorities,
hub authority . z
hub
authority y.
z
.
authority z hub
, y
y
z.
.
61
O SALSA ,
, Kleinberg.
authoritative , ,
, .
.
, hub authority,
Markov, .
,
,
.
,
hub authority.
G =(Vh, Va, E). Vh
hub Va authority .
G. s
, sh sa. s r
sh ra.
.
,
, , .
,
. 2
( hubs authorities).
hub authority
.
. authorities
,
hubs
.
(i) = {k : k i}
i, i,
. F(i) = {k : i k}
i
.
Markov authorities :
62
63
[Altavista]
[BH98]
64
[LM00]
[LS01]
[PRTV98]
[Search Engine
Watch]
[Telcordia]
65
66