You are on page 1of 51

PageRank

Advanced Analysis
of Algorithms
 Dept of CS & IT
1
 University of Sargodha
Google & PageRank
Larry Page & Sergy Brin

Over 200 Million Queries per Day


Inspired by academic citations
analysis.

2
Search Engine
Crawler Page
Module Repository
WWW

Indexing
Module 1
5
Query
Module 3
2 Ranking
Indexes
4 Module

Content Special-
Structure Purpose

3
PageRank
 Essentially, Google interprets a link from page A
to page B as a vote, by page A, for page B.
 BUT these votes doesn’t weigh the same, because
Google also analyzes the page that casts the vote.

 It’s obvious that the PageRank™ algorithm does


not rank the whole website, but it’s determined
for each page individually. Furthermore, the
PageRank™ of page A is recursively defined by the
PageRank™ of those pages which link to page A

4
Random Surfer Model
 PageRank™ is considered as a model of user
behaviour, where a surfer clicks on links at
random with no regard towards content.
 The random surfer visits a web page with a
certain probability which derives from the
page's PageRank.
 The probability that the random surfer clicks
on one link is solely given by the number of
links on that page.
 This is why one page's PageRank is not
completely passed on to a page it links to, but
is devided by the number of links on the page.

5
Random Surfer Model
 So, the probability for the random surfer
reaching one page is the sum of probabilities
for the random surfer following links to this
page.
 This probability is reduced by the damping
factor d. The justification within the Random
Surfer Model, therefore, is that the surfer
does not click on an infinite number of links,
but gets bored sometimes and jumps to
another page at random.
6
Damping Factor
 The probability for the random surfer not
stopping to click on links is given by the damping
factor d, which depends on probability therefore,
is set between 0 and 1
 The higher d is, the more likely will the random
surfer keep clicking links. Since the surfer jumps
to another page at random after he stopped
clicking links, the probability therefore is
implemented as a constant (1-d) into the
algorithm.
 Regardless of inbound links, the probability for
the random surfer jumping to a page is always (1-
d), so a page has always a minimum PageRank.
7
Graph Data: Media Networks

Connections between political blogs


Polarization of the network [Adamic-Glance, 2005]
8
Web as a Graph
 Web as a directed graph:
 Nodes: Webpages
 Edges: Hyperlinks
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University

9
Web as a Graph
 Web as a directed graph:
 Nodes: Webpages
 Edges: Hyperlinks
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University

10
Web as a Directed Graph

11
Broad Question
 How to organize the Web?
 First try: Human curated
Web directories
 Yahoo, DMOZ, LookSmart
 Second try: Web Search
 Information Retrieval investigates:
Find relevant docs in a small
and trusted set
 Newspaper articles, Patents, etc.
 But: Web is huge, full of untrusted documents,
random things, web spam, etc.
12
Web Search: 2 Challenges
2 challenges of web search:
 (1) Web contains many sources of information
Who to “trust”?
 Trick: Trustworthy pages may point to each other!

 (2) What is the “best” answer to query


“newspaper”?
 No single right answer
 Trick: Pages that actually know about newspapers
might all be pointing to many newspapers

13
Ranking Nodes on the Graph
 All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu

 There is large diversity


in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!

14
Link Analysis Algorithms
 We will cover the following Link Analysis
approaches for computing importances
of nodes in a graph:
 Page Rank
 Topic-Specific (Personalized) Page Rank
 Web Spam Detection Algorithms

15
PageRank:
The “Flow” Formulation
Links as Votes
 Idea: Links as votes
 Page is more important if it has more links
 In-coming links? Out-going links?
 Think of in-links as votes:
 www.stanford.edu has 23,400 in-links
 www.joe-schmoe.com has 1 in-link

 Are all in-links are equal?


 Links from important pages count more
 Recursive question!

17
Example: PageRank Scores

A
B
3.3 C
38.4
34.3

D
E F
3.9
8.1 3.9

1.6
1.6 1.6 1.6
1.6

18
Simple Recursive Formulation
 Each link’s vote is proportional to the
importance of its source page

 If page j with importance rj has n out-links,


each link gets rj / n votes

 Page j’s own importance is the sum of the


i k
votes on its in-links ri/3 r /4
k

j rj/3
rj = ri/3+rk/4
rj/3 rj/3

19
PageRank: The “Flow” Model
 A “vote” from an important The web in 1839

page is worth more y/2


 A page is important if it is
y
pointed to by other important
a/2
pages y/2
 Define a “rank” rj for page j m
a m
a/2
ri
rj   “Flow” equations:

i j di
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
  … out-degree of node
20
Solving the Flow Equations
Flow equations:
3
  equations, 3 unknowns, ry = ry /2 + ra /2
no constants ra = ry /2 + rm
rm = ra /2
 No unique solution
 All solutions equivalent modulo the scale factor
 Additional constraint forces uniqueness:

 Solution:
 Gaussian elimination method works for
small examples, but we need a better method
for large web-size graphs
 We need a new formulation!
21
PageRank: Matrix Formulation

 Stochastic adjacency matrix
 
 Let page has out-links
 If , then else
 is a column stochastic matrix
 Columns sum to 1
 Rank vector : vector with an entry per page
 is the importance score of page

 The flow equations can be written


ri
rj  
i j di

22
Example
ri
 Remember
  the flow equation: rj   d
 Flow equation in the matrix form i  j i

 Suppose page i links to 3 pages, including j


i

j rj
. =
ri
1/3

M . r = r
23
Eigenvector Formulation

 The
  flow equations can be written
 So the rank vector r is an eigenvector of the
stochastic web matrix M
 In fact, its first or principal eigenvector,
with corresponding eigenvalue 1 NOTE:
  x is an
eigenvector with
 Largest eigenvalue of M is 1 since M is the corresponding
eigenvalue λ if:
column stochastic (with non-negative entries)
 We know r is unit length and each column of M
sums to one, so

 We can now efficiently solve for r!


The method is called Power iteration
24
Example: Flow Equations & M
y a m
y y ½ ½ 0
a ½ 0 1
a m 0 ½ 0
m
r = M∙r

ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
m 0 ½ 0 m
rm = ra /2

25
Power Iteration Method
 Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
 Power iteration: a simple iterative scheme
 Suppose there are N web pages (t )
ri
 
( t 1)
Initialize: r(0) = [1/N,….,1/N]T rj
i j di
 Iterate: r(t+1) = M ∙ r(t)
di …. out-degree of node i
 Stop when |r(t+1) – r(t)|1 < 
|x|1 = 1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean

26
PageRank: How to solve?
y a m
  Power Iteration: y y ½ ½ 0
 Set /N a ½ 0 1
a m m 0 ½ 0
 1:
 2: ry = ry /2 + ra /2
 Goto 1 ra = ry /2 + rm
 Example: rm = ra /2

ry 1/3 1/3 5/12 9/24 6/15


ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15

Iteration 0, 1, 2, …

27
PageRank: How to solve?
y a m
  Power Iteration: y y ½ ½ 0
 Set /N a ½ 0 1
a m m 0 ½ 0
 1:
 2: ry = ry /2 + ra /2
 Goto 1 ra = ry /2 + rm
 Example: rm = ra /2

ry 1/3 1/3 5/12 9/24 6/15


ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15

Iteration 0, 1, 2, …

28
Details!
Why Power Iteration works? (1)
  Power iteration:
A method for finding dominant eigenvector (the
vector corresponding to the largest eigenvalue)

29
Random Walk Interpretation
  Imagine a random web surfer: i1 i2 i3

 At any time , surfer is on some page


 At time , the surfer follows an j
ri
out-link from uniformly at random rj  
 Ends up on some page linked from i j d out (i)
 Process repeats indefinitely
 Let:
 … vector whose th coordinate is the
prob. that the surfer is at page at time
 So, is a probability distribution over pages

30
The Stationary Distribution
i1 i2 i3
  Where is the surfer at time t+1?
 Follows a link uniformly at random
j
p (t  1)  M  p (t )
 Suppose the random walk reaches a state
then is stationary distribution of a random walk
 Our original rank vector satisfies
 So, is a stationary distribution for
the random walk

31
Existence and Uniqueness
 A central result from the theory of random
walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions,


the stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0

32
PageRank:
The Google Formulation
PageRank: Three Questions
(t )
ri

( t 1)
rj
i j di
or
equivalently r  Mr
 Does this converge?

 Does it converge to what we want?

 Are results reasonable?

34
Does this converge?

(t )
ri

( t 1)
a b rj
i j di

 Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …

35
Does it converge to what we want?

(t )
ri

( t 1)
a b rj
i j di

 Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …

36
PageRank: Problems
Dead end
2 problems:
 (1) Some pages are
dead ends (have no out-links)
 Random walk has “nowhere” to go to
 Such pages cause importance to “leak out” Spider
trap

 (2) Spider traps:


(all out-links are within the group)
 Random walked gets “stuck” in a trap
 And eventually spider traps absorb all importance
37
Problem: Spider Traps
y a m
  Power Iteration: y y ½ ½ 0
 Set a ½ 0 0
a m m 0 ½ 1

 And iterate m is a spider trap ry = ry /2 + ra /2


ra = ry /2
 Example: rm = ra /2 + rm

ry 1/3 2/6 3/12 5/24 0


ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2, …
All the PageRank score gets “trapped” in node m.
38
Solution: Teleports!
 The Google solution for spider traps:
At each
time step, the random surfer has two options
 With prob. , follow a link at random
 With prob. 1-, jump to some random page
 Common values for  are in the range 0.8 to 0.9
 Surfer will teleport out of spider trap
within a few time steps
y y

a m a m
39
Problem: Dead Ends
y a m
  Power Iteration: y y ½ ½ 0
 Set a ½ 0 0
a m m 0 ½ 0

 And iterate ry = ry /2 + ra /2
ra = ry /2
 Example: rm = ra /2

ry 1/3 2/6 3/12 5/24 0


ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …

Here the PageRank “leaks” out since the matrix is not stochastic.
40
Solution: Always Teleport!
 Teleports: Follow random teleport links with
probability 1.0 from dead-ends
 Adjust matrix accordingly

y y

a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓

41
Why Teleports Solve the Problem?
Why are dead-ends and spider traps a problem
and why do teleports solve the problem?
 Spider-traps are not a problem, but with traps
PageRank scores are not what we want
 Solution: Never get stuck in a spider trap by
teleporting out of it in a finite number of steps
 Dead-ends are a problem
 The matrix is not column stochastic so our initial
assumptions are not met
 Solution: Make matrix column stochastic by always
teleporting when there is nowhere else to go
42
Solution: Random Teleports
  Google’s solution that does it all:
At each step, random surfer has two options:
 With probability , follow a link at random
 With probability 1-, jump to some random page

 PageRank equation [Brin-Page, 98]


di … out-degree
of node i

  This formulation assumes that has no dead ends. We can either


preprocess matrix to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
43
The Google Matrix
  PageRank equation [Brin-Page, ‘98]

 The Google Matrix A:

[1/N]NxN…N by N matrix
 We have a recursive problem: where all entries are 1/N

And the Power method still works!


 What is  ?
 In practice  =0.8,0.9 (make 5 steps on avg., jump)

44
Random Teleports ( = 0.8)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5
7/1

15
5
7/1

1/

y 7/15 7/15 1/15


15

13/15
a 7/15 1/15 1/15
7/15
a m 1/15 7/15 13/15
1/15
m
1/
15 A

y 1/3 0.33 0.24 0.26 7/33


a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33

45
How do we actually compute
the PageRank?
Computing Page Rank
 Key step is matrix-vector multiplication
 rnew = A ∙ rold
 Easy if we have enough main memory to
hold A, rold, rnew
 Say N = 1 billion pages
A = ∙M + (1-) [1/N]NxN
 We need 4 bytes for ½ ½ 0 1/3 1/3 1/3
each entry (say) A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3
0 ½ 1 1/3 1/3 1/3
 2 billion entries for
vectors, approx 8GB
7/15 7/15 1/15
 Matrix A has N2 entries = 7/15 1/15 1/15
 1018 is a large number! 1/15 7/15 13/15
47
Matrix Formulation
 Suppose there are N pages
 Consider page i, with di out-links
 We have Mji = 1/|di| when i → j
and Mji = 0 otherwise
 The random teleport is equivalent to:
 Adding a teleport link from i to every other page
and setting transition probability to (1-)/N
 Reducing the probability of following each
out-link from 1/|di| to /|di|
 Equivalent: Tax each page a fraction (1-) of its
score and redistribute evenly

48
Rearranging the Equation
  , where

since
 So we get:

Note: Here we assumed M


has no dead-ends
[x]N … a vector of length N with all entries x
49
Sparse Matrix Formulation
  We just rearranged the PageRank equation

 where [(1-)/N]N is a vector with all N entries (1-)/N

 M is a sparse matrix! (with no dead-ends)


 10 links per node, approx 10N entries
 So in each iteration, we need to:
 Compute rnew =  M ∙ rold
 Add a constant value (1-)/N to each entry in rnew
 Note if M contains dead-ends then and
we also have to renormalize rnew so that it sums to 1

50
PageRank: The Complete Algorithm

  Input: Graph and parameter


 Directed graph (can have spider traps and dead
ends)
 Parameter
 Output: PageRank vector
 Set:
 repeat until convergence:

if in-degree  of is 0 where:
 Now re-insert the leaked PageRank:
If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
51

You might also like