Professional Documents
Culture Documents
Advanced Analysis
of Algorithms
Dept of CS & IT
1
University of Sargodha
Google & PageRank
Larry Page & Sergy Brin
2
Search Engine
Crawler Page
Module Repository
WWW
Indexing
Module 1
5
Query
Module 3
2 Ranking
Indexes
4 Module
Content Special-
Structure Purpose
3
PageRank
Essentially, Google interprets a link from page A
to page B as a vote, by page A, for page B.
BUT these votes doesn’t weigh the same, because
Google also analyzes the page that casts the vote.
4
Random Surfer Model
PageRank™ is considered as a model of user
behaviour, where a surfer clicks on links at
random with no regard towards content.
The random surfer visits a web page with a
certain probability which derives from the
page's PageRank.
The probability that the random surfer clicks
on one link is solely given by the number of
links on that page.
This is why one page's PageRank is not
completely passed on to a page it links to, but
is devided by the number of links on the page.
5
Random Surfer Model
So, the probability for the random surfer
reaching one page is the sum of probabilities
for the random surfer following links to this
page.
This probability is reduced by the damping
factor d. The justification within the Random
Surfer Model, therefore, is that the surfer
does not click on an infinite number of links,
but gets bored sometimes and jumps to
another page at random.
6
Damping Factor
The probability for the random surfer not
stopping to click on links is given by the damping
factor d, which depends on probability therefore,
is set between 0 and 1
The higher d is, the more likely will the random
surfer keep clicking links. Since the surfer jumps
to another page at random after he stopped
clicking links, the probability therefore is
implemented as a constant (1-d) into the
algorithm.
Regardless of inbound links, the probability for
the random surfer jumping to a page is always (1-
d), so a page has always a minimum PageRank.
7
Graph Data: Media Networks
9
Web as a Graph
Web as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University
10
Web as a Directed Graph
11
Broad Question
How to organize the Web?
First try: Human curated
Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval investigates:
Find relevant docs in a small
and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents,
random things, web spam, etc.
12
Web Search: 2 Challenges
2 challenges of web search:
(1) Web contains many sources of information
Who to “trust”?
Trick: Trustworthy pages may point to each other!
13
Ranking Nodes on the Graph
All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu
14
Link Analysis Algorithms
We will cover the following Link Analysis
approaches for computing importances
of nodes in a graph:
Page Rank
Topic-Specific (Personalized) Page Rank
Web Spam Detection Algorithms
15
PageRank:
The “Flow” Formulation
Links as Votes
Idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Think of in-links as votes:
www.stanford.edu has 23,400 in-links
www.joe-schmoe.com has 1 in-link
17
Example: PageRank Scores
A
B
3.3 C
38.4
34.3
D
E F
3.9
8.1 3.9
1.6
1.6 1.6 1.6
1.6
18
Simple Recursive Formulation
Each link’s vote is proportional to the
importance of its source page
j rj/3
rj = ri/3+rk/4
rj/3 rj/3
19
PageRank: The “Flow” Model
A “vote” from an important The web in 1839
i j di
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
… out-degree of node
20
Solving the Flow Equations
Flow equations:
3
equations, 3 unknowns, ry = ry /2 + ra /2
no constants ra = ry /2 + rm
rm = ra /2
No unique solution
All solutions equivalent modulo the scale factor
Additional constraint forces uniqueness:
Solution:
Gaussian elimination method works for
small examples, but we need a better method
for large web-size graphs
We need a new formulation!
21
PageRank: Matrix Formulation
Stochastic adjacency matrix
Let page has out-links
If , then else
is a column stochastic matrix
Columns sum to 1
Rank vector : vector with an entry per page
is the importance score of page
22
Example
ri
Remember
the flow equation: rj d
Flow equation in the matrix form i j i
j rj
. =
ri
1/3
M . r = r
23
Eigenvector Formulation
The
flow equations can be written
So the rank vector r is an eigenvector of the
stochastic web matrix M
In fact, its first or principal eigenvector,
with corresponding eigenvalue 1 NOTE:
x is an
eigenvector with
Largest eigenvalue of M is 1 since M is the corresponding
eigenvalue λ if:
column stochastic (with non-negative entries)
We know r is unit length and each column of M
sums to one, so
ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
m 0 ½ 0 m
rm = ra /2
25
Power Iteration Method
Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
Power iteration: a simple iterative scheme
Suppose there are N web pages (t )
ri
( t 1)
Initialize: r(0) = [1/N,….,1/N]T rj
i j di
Iterate: r(t+1) = M ∙ r(t)
di …. out-degree of node i
Stop when |r(t+1) – r(t)|1 <
|x|1 = 1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean
26
PageRank: How to solve?
y a m
Power Iteration: y y ½ ½ 0
Set /N a ½ 0 1
a m m 0 ½ 0
1:
2: ry = ry /2 + ra /2
Goto 1 ra = ry /2 + rm
Example: rm = ra /2
Iteration 0, 1, 2, …
27
PageRank: How to solve?
y a m
Power Iteration: y y ½ ½ 0
Set /N a ½ 0 1
a m m 0 ½ 0
1:
2: ry = ry /2 + ra /2
Goto 1 ra = ry /2 + rm
Example: rm = ra /2
Iteration 0, 1, 2, …
28
Details!
Why Power Iteration works? (1)
Power iteration:
A method for finding dominant eigenvector (the
vector corresponding to the largest eigenvalue)
29
Random Walk Interpretation
Imagine a random web surfer: i1 i2 i3
30
The Stationary Distribution
i1 i2 i3
Where is the surfer at time t+1?
Follows a link uniformly at random
j
p (t 1) M p (t )
Suppose the random walk reaches a state
then is stationary distribution of a random walk
Our original rank vector satisfies
So, is a stationary distribution for
the random walk
31
Existence and Uniqueness
A central result from the theory of random
walks (a.k.a. Markov processes):
32
PageRank:
The Google Formulation
PageRank: Three Questions
(t )
ri
( t 1)
rj
i j di
or
equivalently r Mr
Does this converge?
34
Does this converge?
(t )
ri
( t 1)
a b rj
i j di
Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
35
Does it converge to what we want?
(t )
ri
( t 1)
a b rj
i j di
Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …
36
PageRank: Problems
Dead end
2 problems:
(1) Some pages are
dead ends (have no out-links)
Random walk has “nowhere” to go to
Such pages cause importance to “leak out” Spider
trap
a m a m
39
Problem: Dead Ends
y a m
Power Iteration: y y ½ ½ 0
Set a ½ 0 0
a m m 0 ½ 0
And iterate ry = ry /2 + ra /2
ra = ry /2
Example: rm = ra /2
Here the PageRank “leaks” out since the matrix is not stochastic.
40
Solution: Always Teleport!
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
Adjust matrix accordingly
y y
a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
41
Why Teleports Solve the Problem?
Why are dead-ends and spider traps a problem
and why do teleports solve the problem?
Spider-traps are not a problem, but with traps
PageRank scores are not what we want
Solution: Never get stuck in a spider trap by
teleporting out of it in a finite number of steps
Dead-ends are a problem
The matrix is not column stochastic so our initial
assumptions are not met
Solution: Make matrix column stochastic by always
teleporting when there is nowhere else to go
42
Solution: Random Teleports
Google’s solution that does it all:
At each step, random surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page
[1/N]NxN…N by N matrix
We have a recursive problem: where all entries are 1/N
44
Random Teleports ( = 0.8)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5
7/1
15
5
7/1
1/
13/15
a 7/15 1/15 1/15
7/15
a m 1/15 7/15 13/15
1/15
m
1/
15 A
45
How do we actually compute
the PageRank?
Computing Page Rank
Key step is matrix-vector multiplication
rnew = A ∙ rold
Easy if we have enough main memory to
hold A, rold, rnew
Say N = 1 billion pages
A = ∙M + (1-) [1/N]NxN
We need 4 bytes for ½ ½ 0 1/3 1/3 1/3
each entry (say) A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3
0 ½ 1 1/3 1/3 1/3
2 billion entries for
vectors, approx 8GB
7/15 7/15 1/15
Matrix A has N2 entries = 7/15 1/15 1/15
1018 is a large number! 1/15 7/15 13/15
47
Matrix Formulation
Suppose there are N pages
Consider page i, with di out-links
We have Mji = 1/|di| when i → j
and Mji = 0 otherwise
The random teleport is equivalent to:
Adding a teleport link from i to every other page
and setting transition probability to (1-)/N
Reducing the probability of following each
out-link from 1/|di| to /|di|
Equivalent: Tax each page a fraction (1-) of its
score and redistribute evenly
48
Rearranging the Equation
, where
since
So we get:
50
PageRank: The Complete Algorithm