The Anatomy of A Large-Scale Hypertextual Web Search Engine: Google

The Anatomy of a Large-Scale
Hypertextual Web Search Engine:

Google
Bakkaloglu, Mehmet
Google (a common spelling for googol)
googol: 10
100
, or 1 followed by 100 zeros
googolplex: 10
10100
, or 10
googol
, or 1 followed by a googol
zeros
The name 'googol' was invented in the 1930s, by a child (the
mathematician Edward Kasners nine-year-old nephew)
who was asked to think up a name for a very big number,
namely, 1 with a hundred zeros after it.

General Architecture of a Search Engine
Spider (Crawler)
gathering information
Indexer
analyzing information
Searcher
displaying information (ranking)

What makes Ranking difficult:
Web is not well controlled (it is not like a closed
Information Retrieval System) Anyone can publish
anything they want
A word can be repeated many times (if frequency is
one of the ranking criteria then this is bad)
Metadata may be abused
Cloaking: a website returns altered web pages to a search
engine accessing the site, usually to distort search engine
rankings

Sub-Optimal Ranking Methods:
manually maintain a list!
simply return the document that is closest to the query

What information does Google use in ranking Web Pages?
Link Structure (PageRank)
IR(Information Retrieval) Measures:
Anchor Text
Font(relative to the rest of the page), Capitalization,
Position in Page
Plain Hits vs. Fancy Hits (URL, title, anchor text, meta
tag)
Location Information of different hits Proximity

Link Structure (PageRank)
Idea behind PageRanking is Citation
Count the number of links pointing to a page,
But place different importance levels on each link (e.g. link
from yahoo vs. link from a personal web page)

How does PageRanking actually work?
Markov Chains:
Limiting probability of a page ~ Probability that a surfer will
visit a page

A B
C
PageRank Example:

A B
C
1
1/2
1/2
1
Equations:
P(A)=P(C)
P(B)=(1/2)*P(A)
P(C)=(1/2)*P(A)+P(B)
P(A)+P(B)+P(C)=1
Limiting probabilities:
P(A)=0.4, P(B)=0.2, P(C)=0.4

Problem with this approach:

A B
C
1
1/2
1/2
Equations:
P(A)=P(C)
P(B)=(1/2)*P(A)+P(B)
P(C)=(1/2)*P(A)
P(A)+P(B)+P(C)=1

1
Limiting probabilities:
P(A)=0, P(B)=1, P(C)=0
This is no good!!!
Solution: Use a Damping Factor

A B
C
1
()
()
P(C)= [(1-d)/3]*[P(A)+P(B)+P(C)] + d*[(1/2)*P(A)]
(1-d)/3
(1-d)/3
(1-d)/3
*d
(1-d)/3
(1-d)/3
(1-d)/3
*d
*d
1
(1-d)/3
(1-d)/3
(1-d)/3
*d
Solution: Use a Damping Factor (continued)
P(C)= [(1-d)/3]*[P(A)+P(B)+P(C)] + d*[(1/2)*P(A)]
Rational:
User follows the links then gets bored and randomly goes to
another page
Question:
How should we apply the damping factor?
Equally to all pages or more heavily to a subset of pages?
P(C)= [(1-d)/3] + d*[(1/2)*P(A)]

In General;
P(X)=[(1-d)/n] + d*[P(T1)/C(T1)+. + P(Tn)/C(Tn)]

On a typical workstation each iteration takes ~ 6 min.
The PageRank Citation Ranking: Bringing Order to the Web
Copy of paper available at:
http://citeseer.nj.nec.com/368196.html

Anchor Text
Idea:
Links provide information about the pages they are
pointing to
Also allows the inclusion of documents:
which have links pointing to them but which can not
be crawled
e.g.: images, programs, databases
(cannot be indexed by text-based search engines)

General Architecture of a Search Engine
Spider (Crawler)
gathering information
Indexer
analyzing information
Searcher
displaying information

Main Concerns of Google:
Fast and Space Efficient

Architecture:

Distributed
Crawling

Barrels: Forward vs. Inverted Index
Forward: Partially sorted (each barrel holds a word range)

Inverted: Sorted

Two steps (for performance reasons??)
Is using word ranges the best solution, or should it be
balanced based on popularity? (when doing searching)

Inverted Barrels
Sort by docID or
Sort by ranking

Hybrid solution: use 2 sets of barrels

One for title and anchor hits (they have more importance
than plain hits)
and one for all hits

Hit Lists
Capitalization
Font Size
Position

No Color Information!!!

Question:
How much effect do each of these properties have on the
ordering of web pages?
(i.e. whats the trade-off in using these?)

How often should Googles database be updated?
Well there are some limitations: (back in 1998)
Crawling 26 million pages takes ~ 9 days
Indexing 24 million pages takes ~ 5 days
Sorting them takes ~ 1 day
Plus PageRanking
In reality Google was updated ~ 1- 4 weeks
Incremental Updating??
Smart Algorithmsto decide which pages should be crawled
(or Cooperation from Web Servers)

http://searchenginewatch.com/reports/sizes.html
Improvements in Ranking
1)User Feedback
User preferences (relevance)
exp: DirectHit (a system that measures what users click on
from search results in order to refine relevancy rankings)
Personalize PageRank by increasing the weights of users
bookmarks
2)Use correlation information among different words?
(exp: networks computer networks)

Improvements in Ranking(Continued)
3)XML issues
HTML:
<td width="20%" valign="top"><small><font
face="Arial">Hamburg</font></small></td>
Code:
<td width="20%" valign="top"><% = & " " &
rs.fields("city") %></td>

XML:
<City>Hamburg</City>

Is Googles Ranking optimal?
Hmm There are some bad examples:

At one time,
Search for What is more evil than Satan,
use to result in Microsoft's home page

Any ideas about why this happened?
Sergey Brin: Lots of sites point to Microsoft as evil

The Anatomy of A Large-Scale Hypertextual Web Search Engine: Google

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Anatomy of A Large-Scale Hypertextual Web Search Engine: Google

Uploaded by

Copyright:

Available Formats

The Anatomy of a Large-Scale

Hypertextual Web Search Engine:

You might also like