You are on page 1of 49

S

e
a
r
c
h

E
n
g
i
n
e
s

H
o
w
it
w
o
r
k
s
?

W
h
i
c
h

o
f

t
h
e
s
e

i
s

S
e
a
r
c
h

E
n
g
i
n
e
?

!
!
!
!
Purpose of
Search
Engines
Convert
their need
to query
Let people
find
information
In form of
web page/
other
format
In shortest
time
possible
Searching different then
Databases Query
  Structured data tends to refer to information in
“tables”.
  Unstructured data typically refers to free text
  Allows
  Keyword queries including operators
  More sophisticated “concept” queries e.g.,
  find all web pages dealing with drug abuse.
Information
Retrieval
• lnformauon 8eLrleval (l8) ls ñndlng maLerlal (usually
documenLs) of an unsLrucLured naLure (usually LexL) LhaL
sausñes an lnformauon need from wlLhln large
collecuons (usually sLored on compuLers).
– 1hese days we frequenLly Lhlnk ñrsL of web search,
buL Lhere are many oLher cases:
• L-mall search
• Searchlng your lapLop
• CorporaLe knowledge bases
• Legal lnformauon reLrleval

Web
Crawling
Indexing Sorting
Page
Ranker
Google
Search
Engine
Query
Parser
Query
Engine
Relevance
Ranker
Formatter
Search
Engine
Google Search Engine
A Search Engine
How does
Crawling
work?
 8egln wlLh known
“seed” u8Ls
 leLch and parse Lhem
 LxLracL u8Ls Lhey
polnL Lo
 Þlace Lhe exLracLed
u8Ls on a queue
 leLch each u8L on Lhe
queue and repeaL
Web Crawlers
Web
URLs frontier
Unseen Web
Seed
pages
URLs crawled
and parsed
12
What a crawler
“MUST” do
 8e ÞollLe: 8especL lmpllclL and expllclL
pollLeness conslderauons
 Cnly crawl allowed pages
 8especL !"#"$%&$'$ (more on Lhls shorLly)
 8e 8obusL: 8e lmmune Lo splder Lraps
and oLher mallclous behavlor from web
servers
What any crawler
“SHOULD” do
  8e capable of dlsLrlbuLed operauon: deslgned Lo run on
muluple dlsLrlbuLed machlnes
  8e scalable: deslgned Lo lncrease Lhe crawl raLe by addlng
more machlnes
Þerformance/emclency: permlL full use of avallable processlng
and neLwork resources
  leLch pages of “hlgher quallLy” ñrsL
Conunuous operauon: Conunue feLchlng fresh coples of a
prevlously feLched page
LxLenslble: AdapL Lo new daLa formaLs, proLocols
Updated Crawling
Picture
URLs crawled
and parsed
Unseen Web
Seed
Pages
URL frontier
Crawling thread
13
Parsing a
document
  WhaL formaL ls lL ln?
pdf/word/excel/hLml?
  WhaL language ls lL ln?
  WhaL characLer seL ls ln use?
 (CÞ1232, u1l-8, .)

How good are the
retrieved docs?
!  Precision : Fraction of retrieved docs that are
relevant to the user’s information need
!  Recall : Fraction of relevant docs in collection
that are retrieved
Indexing
Once we have all the pages crawled we need to
index the pages to make our retrieval easier.

But how do one Index all the pages ?
Term-document
incidence matrices
Antony Julius Tempest Hamlet Othello Macbeth
Document 1 1 1 0 0 0 1
Document 2 1 1 0 1 0 0
Document 3 1 1 0 1 1 1
0 1 0 0 0 0
1 0 0 0 0 0
1 0 1 1 1 1
Document n 1 0 1 1 1 0
Can’t build
the matrix
  500K x 1M matrix has half-a-trillion 0’s and 1’s.
  But it has no more than one billion 1’s.
  matrix is extremely sparse.
  What’s a better representation?
  We only record the 1 positions.

Inverted
Index
  lor each Lerm $, we musL sLore a llsL of all documenLs LhaL
conLaln $.
  ldenufy each doc by a !"#$%, a documenL serlal number
  Can we used ñxed-slze arrays
What happens if the word Caesar is
added to document 14?
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
Inverted
Index
  We need varlable-slze posungs llsLs
  Cn dlsk, a conunuous run of posungs ls normal and besL
  ln memory, can use llnked llsLs or varlable lengLh arrays
  Some Lradeoñs ln slze/ease of lnseruon
Dictionary Postings
Posting
!"#$#%
'()*#"+,(
'(-%("
1 2 4 5 6 16 57 132
1 2 4 11 31 45 173
2 31
174
54 101
Tokenizer
Token stream Friends Romans Countrymen
lnverLed lndex consLrucuon
Linguistic modules
Modified tokens
friend roman countryman
Documents to
be indexed
Friends, Romans, countrymen.
Indexer
Inverted index
roman
counLryman
2 4
2
13 16
1
friend
Initial Text
Processing
  Tokenization
  Cut character sequence into word tokens
  Deal with “John’s”, a state-of-the-art solution
  Normalization
  Map text and query term to same form
  You want U.S.A. and USA to match
  Stemming
  We may wish different forms of a root to match
  authorize, authorization
  Stop words
  We may omit very common words (or not)
  the, a, to, of
Query processing:
AND
  Conslder processlng Lhe query:
!"#$#% )*+ '(-%("
  LocaLe !"#$#% ln Lhe ulcuonary,
  8eLrleve lLs posungs.
  LocaLe '(-%(" ln Lhe ulcuonary,
  8eLrleve lLs posungs.
  “Merge” Lhe Lwo posungs (lnLersecL Lhe documenL seLs):
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
Brutus
Caesar
Are all the pages of same
importance?
Page Rank
  A method for rating the importance of web
pages objectively and mechanically using
the link structure of the web
  PageRank was developed by Larry Page
(hence the name Page-Rank) and Sergey Brin.
  It is first as part of a research project about a
new kind of search engine. That project
started in 1995 and led to a functional
prototype in 1998 known as Google.
PageRank in a
single Equation
u: a web page
B
u
: the set of u’s backlinks
N
v
: the number of forward links of page v
c: the normalization factor to make ||R||
L1
=
1 (||R||
L1
= |R
1
+ ! + R
n
|)
Probabilistic Interpretation of
PageRank
The Random Surfer Model:
The standing probability distribution
of a random walk on the graph of the
web. simply keeps clicking
successive links at random
Example
Pagerank Matrices
Simple version of Web
Pagerank
Calculation
First Iteration
Second Iteration
Convergence after
many iterations
Problem with simplified Pagerank
  Loops: During each iteration, the loop accumulates
rank but never distributes rank to other pages!
Example
Example
First Iteration
Second Iteration
Convergence after
many iterations
Solution
  Modify the “random surfer” such that he simply keeps
clicking successive links at random, but periodically “gets
bored” and jumps to a random page based on the
distribution of E
 
  E(u): a distribution of ranks of web pages that “users”
jump to when they “gets bored” after successive links at
random.
Example
Google in 1997
Ways of index
partitioning
  By doc: each shard has index for subset of docs
  – pro: each shard can process queries
independently
  – pro: easy to keep additional per-doc information
  – pro: network traffic (requests/responses) small
  – con: query has to be processed by each shard
  - con: O(K*N) disk seeks for K word query on N
shards

Ways of index
partitioning
By doc: each shard has index for subset of docs
– pro: each shard can process queries independently
– pro: easy to keep additional per-doc information
– pro: network traffic (requests/responses) small
– con: query has to be processed by each shard
- con: O(K*N) disk seeks for K word query on N
shards

Ways of index
partitioning
By word: shard has subset of words for all docs
– pro: K word query => handled by at most K shards
– pro: O(K) disk seeks for K word query
– con: much higher network bandwidth needed, data
about each word for each matching doc must be
collected in one place
- con: harder to have per-doc information
Google in 1999
Caching
  Cache both index results and doc snippets
  Hit rates typically 30-60%
  depends on frequency of index updates, mix of query
traffic,
  level of personalization, etc
  Better Performance: 10s of machines do work of 100s or
1000s
  reduce query latency on hits
  queries that hit in cache tend to be both popular and
expensive (common words, lots of documents to score,
etc.)
  Beware: big latency spike/capacity drop when index
updated or cache flushed
Google 2000
Dealing with growth
Google 2001
In memory
Indexing
  Big increase in throughput
  Big decrease in latency
  especially at the tail: expensive queries that
previously needed GBs of disk I/O became much
faster. e.g. [ “circle of life” ]
  Variance: touch 1000s of machines, not dozens
  Availability: 1 or few replicas of each doc’s index
data
Google 2004
Google 2007 : Universal Search
References
  Stanford, CS-276 (Information Retrieval and Web
Search) - http://www.stanford.edu/class/cs276/
Pagerank Citation Ranking bringing order to the
web. EECS 584 University of Michigan :
http://web.eecs.umich.edu/~michjc/eecs584/
notes/lecture19-pagerank.ppt
  Challenges in building large scale information
retrieval systems by Jeff Dean
http://static.googleusercontent.com/media/
research.google.com/en/us/people/jeff/WSDM09-
keynote.pdf