You are on page 1of 25

5.

5.1


.
.

.
,
HTML,

.
5.1.1 (World Wide Web)


. (web pages)

.

.
, , , ,
. ,
, , , .



.
,

.

.

.

5.1.2


. :
,
, .

.

.
,

.
.
.
, ,

42

.
.
,

.


.



, (Web Search Engines).

5.1.3 (Search Engines)

[K97, BR99]:
crawler, ,
(links).
crawler
,

crawlers .
(indexer),
crawler. ,

crawler.
,
(inverted index). indexer

, ...
(query processor),

.
, ,


.
.

Crawler
crawler
.
,
.

. crawler
, ,
(servers) .
Indexer
(indexer), ,
,
,

43

(URLs), ,
.


.

.

,
URLs .

.

.
.



( ).
.
,
, , ...
,
...



.

.


.

.

, ,
, .
, ,

.

.
,
.
.
" ".

.
jaguar
.

, .

44


.
""
"" "".
.



.


.
.

(hyperlinks) .
.


.

5.1.4 HTML
HTML HyperText Markup Language
.
tag ().
HTML (tags)

.
HTML <html>
</html>. HTML
. <head> </head>
<body> </body>.

. ,
, scripting
.

<title> </title>.

,
, ,
.
HTML .

, .
,
.
, H1 H6 .

<h1> </h1>, .

.
.
,
, , .

45

<b> </b>
<strong> </strong>.
<i> </i>.
<u> </u>.

.
HTML, img, .
<img> .
HTML img
: <img src=imagepath/imagename.jpeg alt=comment>.
src img

. alt
,
.
HTML
. <a> </a>
: <a href=urlpage>hyperlink - text</a>. href
a
.
,
.

5.1.5. (links)


.
() .
term-based search engines
.
,
:


.
(in-degree) .


.

.

.
.

,
.
, .
,
, .


.

46

connectivitybased search engines.



, u
(u,v) u
v.

5.2.

Link-based



.
. HITS, Kleinberg
([97]) CLEVER, Brin Page
([BP98], GOOGLE.
.
5.2.1 HITS (Hyperlink-Induced Topic Search)

authoritative ()
. authority
authority
hub .

.

,

.
, .
term-based
,
, .

, ,

.
, ,
(query) .
, ,
.
:
. . Netscape JDK 1.1 code-signing API;
. .
JAVA.
. . java.sun.com

.
Scarcity Problem ( ),

.
,
.

47

.
Abundance Problem ( ),

.
,

.
authority (-) ,
.

.
.
,
Harvard, www.harvard.edu, authoritative
Harvard.
Harvard www.harvard.edu
,

.

authority .
, ,
. ,

authoritative Yahoo!, Excite, AltaVista,
. .
Honda
Toyota .

, ,
.
.

.

, .
Apple
IBM,
.


G = (V, E),
p q
p q. (out-degree) p
p (indegree) p p.
G
. W V
G[W] .
G[W] W
W.

.

48

,
, .
Q
. .

,
authorities
.

, S, :
i. .
ii. .
iii. authority .
S
.

authorities.
.

,
R, t ( t 200)
term-based AltaVista
.
(i), t.
(ii) R Q
.
,
Q .
S,
R, .
authority R
R. authorities
R
R.
.
Subgraph(,E,t,d)
: a query string
E: a text-based search engine
t, d: natural numbers
Let R denote the top t results of E on
Set S = R
For each page p that belongs to R
Let +(p) denote the set of all pages p points to
Let -(p) denote the set of all pages pointing to p
Add all pages in +(p) to S
If -(p) <= d then
Add all pages in -(p) to S
Else

49

Add an arbitrary set of d pages from -(p) to S


End
Return S

1
S R,

R
R, d S
R. ,

,
S, .
S base set .
S G[S],

.
, ,
,
. G[S] .
(transverse)
domain, (intrinsic)
domain. domain ,
.
,

50

.
.
G. , ,
.
ubs uthorities
G
.
, .
in-degree,
.
.

, .

,
. ,
, , ,
in-degree.
authoritative

in-degree,
. authoritative
, hub ,
authoritative .
authorities , in-degree.
authorities hub
5.


in-degree

hubs

authorities

2 hubs authorities
hubs authorities ,
hub authorities
authority hubs.

, G.

51


,
hub authority ,
hubs authorities .
, authority x<p> hub
y<p>.
,
authorities hubs .
:
p x (authority),
y (hub). p
y
x.
x y, I O.
x :

x < p>

<q>

q:( q , p )E

y :

y < p>

<q>

q:( q , p )E

,
I .
x
, y .
.
Iterate(G, k)
G: a collection of n linked pages
k: a natural number
Let z denote the vector ( 1, 1, 1, ..., 1) R

Set x0 :=z.
Set y0 :=z.
For i =1, 2, ..., k
Apply the I operation to (xi 1 , y i 1) ,obtaining new -weights xi .
Apply the O operation to (xi ,y i 1) ,obtaining new y -weights yi .
Normalize xi,obtaining xi .
Normalize yi ,obtaining yi .
End
Return (xk ,y k ).

c
authority c hub .
:
Filter(G, k, c)
G: a collection of n linked pages
k,c: natural numbers

52

(xk, yk) := Iterate(G, k)


Report the pages with the c largest coordinates in xk as authorities
Report the pages with the c largest coordinates in yk as hubs


k, Iterate,
.

. ,
1 (i,j) G
pi pj 0 .

:

x ATy y Ax .
x
Iterate ATA y
.
Iterate
x y, 20
.

. .
Iterate . ,

. ,
,

.




.
.
, ,
.

.
, ,
.
hubs authorities, .
authorities
.

.
,
.
. R
S.

53

hubs authorities
hubs authorities
hubs authorities G.


. :
jaguar,
.

, randomized algorithms.
,
,
abortion.
,
.

.
, hubs authorities
A ,
G.

hubs authorities S.


.
,
.

,
hubs authorities.



G .
.

,
.

, .

.
,
.

,

5.2.2 Google

- Google [BP98].
Google ,
. Google

54


.
PageRank.
.
PageRank
T PageRank ::
1n. d
, 0 1.
d 0.85. C(A)
. PageRank :
PR(A) = (1-d) + d (PR(T1)/C(T1) + + PR(Tn)/C(Tn))
PageRanks
, PageRanks
.

PageRanks
,
.
PageRank .
.

,
(back) (browser).
.
PageRank
. d
.
d
.

.

PageRank
,
PageRank.
.
, ,
Yahoo!,
. PageRank
,
.
(anchor text)
Google
, .
. Google
.
. ,
. ,

, , ,
.
crawled
,
55

.
,
.
, .

Google ,
. ,
,
. ,
, .

. , HTML
.

Google crawling ,
, crawlers. URLserver,
(urls) crawlers.
StoreServer,
Repository. ID, docID,
.
(indexer)
(sorter). indexer repository,
.
hits. hit ,
. indexer hits barrels,
(forward index).
indexer .

(anchors).
, .
URLresolver anchors
docIDs. anchor
docID .
(links), docIDs.
PageRank .
sorter barrels, docID,
wordID, (inverted index).
sorter wordIDs . ,
DumpLexicon,
indexer
(searcher). searcher ,
PageRanks .

3.

56

3 Google

crawled
.
Big Files
Big Files
. .
.
repository HTML ,
. docID,
oyw. repository
,
.
repository
crawling. repository 4.

4 Repository

57

Document Index
(document index)
docID.
, repository,
(checksum) .
crawled docinfo
,
URLs, .

URLs docIDs. checksums
docIDs .
docID URL, checksum URLs
checksums docID.
URLs docIDs .
URLresolver.
Lexicon
. .
,
, hash .
Hit Lists
hit list
,
, .
,
.
.
bytes hit.
hits: hits . hits
URL, , meta tag.
. hit bit
, 12 bits
.
bits. hit bit
, 7
hit, 4 bits hit 8
bits . hits 8
bits 4 bits anchor
4 bits docID anchor .
, ,

.
hits hits.
hits wordID forward
index docID inverted index.
Forward Index
forward index ,
64 barrels. barrel wordIDs.
barrel, docID

58

barrel, wordIDs hit lists


. wordIDs,
wordID wordID
barrel.
Inverted Index
inverted index barrels, forward index,
(sorter).
wordID, barrel wordID.
docIDs hit lists.
.

docIDs. docID.

.
.

.
,
.
, inverted
barrels, hits
hits.
barrels
barrels .

Crawling
, Google
crawling. URLserver URLs
crawlers. crawler 300
,
.

o Parsing parser
,
tags HTML
ASCII .
o barrels
parser, barrels.
wordID hash, .
.
wordIDs,
hits, forward
barrels.
.
(log)
. indexers

indexer.
o inverted index
forward barrels wordID,

59

inverted barrel hits anchor


inverted barrel .
barrel ,
.


. ,
Google, :
1.

2.

wordIDs

3.

barrels

4.

5.

6.

barrels
,
barrel
4

7.


4.
k.


Google
. hits
, .
hits anchor, PageRank
.
, ,
.
.
Google hits .
hit ,
.
. Google hits hits.
, hit, ,
.
IR
. IR PageRank
.
,
, . hits
hits
. hits
,
.
hits 10
.
hits,
. .

60

IR
.

,
.
, .
.
,
.
Clever Google
. , Google

, Clever

. Google .
Google
,
. Clever
, authoritative
. Clever ,
,
hub
.

5.2.2 O SALSA
SALSA [LM00]

.
Markov,
TCK,
(TCK Tightly-Knit Community) .
.

,
. Kleinberg
.
.
.
y hub authorities,
hub authority . z
hub
authority y.
z
.
authority z hub
, y
y
z.
.

61

O SALSA ,
, Kleinberg.
authoritative , ,
, .
.

, hub authority,
Markov, .
,
,
.
,
hub authority.


G =(Vh, Va, E). Vh
hub Va authority .
G. s
, sh sa. s r

sh ra.
.
,
, , .
,

. 2

( hubs authorities).
hub authority

.



. authorities
,
hubs
.
(i) = {k : k i}
i, i,
. F(i) = {k : i k}
i
.
Markov authorities :

=(1, 2,..., ) Markov,


i = |(i)|/|B|, =Ui(i)
. Markov hubs
:

62

h=(h1, h2,..., h) Markov,


hi = |F(i)| / |F|, F= Ui F(i)
.

Kleinberg.
authority , i,
, .
,

Kleinberg. ,
Kleinberg , authorities
= u, u .
L1. i = |(i)| / |B|,
SALSA. hubs.
, Base Set,
, SALSA

.
j i, i
i
j. authority i :

63

[Altavista]
[BH98]

AltaVista Search Engine http://www.altavista.com


K.Bharat and M.R.Henzinger
Improved Algorithms for Topic
Distillation in a Hyperlinked Environment Proc. ACM Conf. Res. and
Developments in Information Retrieval, 1998
[BP98]
S.Brin and L.Page The Anatomy of a Large-Scale Hypertextual Web
Search Engine Proc. 7th International World Wide Web Conference,
1998
[BR99]
R.Baeza-Yates and B.Ribeiro-Neto Modern Information Retrieval
ACM Press 1999, Chapter 13
[BRRT01]
A.Borodin, G.O.Roberts, J.S.Rosenthal and P.Tsaparas,
Finding
Authorities and Hubs from Link Structure on the World Wide Web, Proc.
10th International World Wide Web Conference, Hong Kong, May 2001
[CDGKRR98]
S.Chakrabarti, B.Dom, D.Gibson, J.Kleinberg, P.Raghavan and
S.RajagopalanAutomatic Resource List Compilation by Analyzing
Hyperlink Structure and Associated Text Proc. 7th International World
Wide Web Conference, 1998
[CDI98]
S.Chakrabarti, B.Dom and P.Indyk Enhanced Hypertext Categorization
Using Hyperlinks Proc. ACM SIGMOD International Conference on
Management of Data, 1998
[Chakrabarti et S.Chakrabarti, B.Dom, D.Gibson, J.Kleinberg, S.R.Kumar, P.Raghavan,
al. 99]
S.Rajagopalan and A.Tomkins Hypersearching the Web Scientific
American, June 1999
[Chakrabarti et S.Chakrabarti, B.Dom, D.Gibson, J.Kleinberg, S.R.Kumar, P.Raghavan,
al. 99b]
S.Rajagopalan and A.Tomkins Mining the Link Structure of the World
Wide Web IEEE Computer, August 1999
[F97]
G.N.Frederickson A Data Structure for Dynamically Maintaining
Rooted Tree Journal of Algorithms 24, 1997
[GKR98a]
D.Gibson, J.Kleinberg and P.Raghavan Inferring Web Communities
from Link Topology Proc. 9th ACM Conference on Hypertext and
Hypermedia, 1998
[GKR98b]
D.Gibson, J.Kleinberg and P.Raghavan Structural Analysis of the
World Wide Web Position paper at the WWW Consortium Web
Characterization Workshop, November 1998
[Google]
Google Search Engine http://www.google.com
[GG01]
G.Greco and S.Greco Topic Distillation on Hyperlinked Data,
Unpublished manuscript
[H01]
M.R.Henzinger Web Information Retrieval an Algorithmic
Perspective, In Proceedings of European Symposium on Algorithms
(ESA)
[K97]
J.Kleinberg Autoritative Sources in a Hyperlinked Environment Proc.
9th ACM-SIAM Symposium on Discrete Algorithms, 1998 Extended
version in Journal of the ACM 46(1999) Also appears as IBM Research
Report RJ 10076, May 1997
[K99]
J.Kleinberg The Small-World Phenomenon: an Algorithmic
Perspective Cornell Computer Science Technical Report 99-1778,
October 1999
[Kleinberg et J.Kleinberg, S.R.Kumar, P.Raghavan, S.Rajagopalan and A.Tomkins
al. 99]
The Web as a Graph: Measurements, Models and Methods Invited
survey at the International Conference on Combinatorics and Computing,
1999
[KT99]
J.Kleinberg and A.Tomkins Applications of Linear Algebra to

64

[LM00]
[LS01]
[PRTV98]
[Search Engine
Watch]
[Telcordia]

Information Retrieval and Hypertext Analysis Tutorial survey at the


ACM Symposium on Principles of Database Systems, 1999
R.Lempel and S.Moran The Stochastic Approach for Link-Structure
Analysis (SALSA) and the TKC Effect Proc. 9th International World
Wide Web Conference, Amsterdam, May 2000
R.Lempel and A.Saffer PicASHOW: Pictorial Authority Search by
Hyperlinks on the Web Proc. 10th International World Wide Web
Conference, Hong Kong, May 2001
C.H.Papadimitriou, P.Raghavan, H.Tamaki and S.Vempala Latent
Semantic Indexing: a Probabilistic Analysis Proc ACM Symposium on
Principles of Database systems, 1998
Search Engine Watch http://searchenginewatch.com/reports/sizes.html
Telcordia Technologies NetSizer http://www.netsizer.com/

65

66

You might also like