You are on page 1of 8

EXPERIMENT NO.

10

AIM:-
Implementation of Page Rank/Hits Algorithm.

THEORY:-
Page Rank Algorithm:-
PageRank (PR) is an algorithm used by Google Search to rank websites in their
search engine results. PageRank was named after Larry Page, one of the
founders of Google. PageRank is a way of measuring the importance of website
pages. According to Google: PageRank works by counting the number and
quality of links to a page to determine a rough estimate of how important the
website is. The underlying assumption is that more important websites are
likely to receive more links from other websites.
It is not the only algorithm used by Google to order search engine results, but
it is the first algorithm that was used by the company, and it is the best-known.
The above centrality measure is not implemented for multi-graphs.

Algorithm
The PageRank algorithm outputs a probability distribution used to represent the
likelihood that a person randomly clicking on links will arrive at any particular
page. PageRank can be calculated for collections of documents of any size. It is
assumed in several research papers that the distribution is evenly divided among
all documents in the collection at the beginning of the computational process.
The PageRank computations require several passes, called “iterations”, through
the collection to adjust approximate PageRank values to more closely reflect the
theoretical true value.
Example:
The “More” page is still getting less share of the vote than in example 7 of
course, but now the “Product” page has kept three quarters of its vote within our
site - unlike example 10 where it was giving away fully half of it’s vote to the
external site!
Keeping just this small extra fraction of the vote within our site has had a very
nice effect on the Home Page too – PR of 2.28 compared with just 1.66 in
example .Observation: increasing the internal links in your site can minimise
the damage to your PR when you give away votes by linking to external sites.

Principle:
If a particular page is highly important – use a hierarchical structure with the
important page at the “top”. Where a group of pages may contain outward links
– increase the number of internal links to retain as much PR as possible. Where
a group of pages do not contain outward links – the number of internal links in
the site has no effect on the site’s average PR. You might as well use a link
structure that gives the user the best navigational experience.
Page Hits Algorithm

Hyperlink Induced Topic Search (HITS) Algorithm is a Link Analysis Algorithm


that rates webpages, developed by Jon Kleinberg. This algorithm is used to the
web link-structures to discover and rank the webpages relevant for a particular
search.
HITS uses hubs and authorities to define a recursive relationship between
webpages. Before understanding the HITS Algorithm, we first need to know
about Hubs and Authorities.

• Given a query to a Search Engine, the set of highly relevant web pages are
called Roots. They are potential Authorities.
• Pages that are not very relevant but point to pages in the Root are called
Hubs. Thus, an Authority is a page that many hubs link to whereas a Hub is
a page that links to many authorities.
Algorithm –

-> Let number of iterations be k.


-> Each node is assigned a Hub score = 1 and an Authority score = 1.
-> Repeat k times:

• Hub update : Each node’s Hub score = (Authority score of each node
it points to).

• Authority update : Each node’s Authority score = (Hub score


of each node pointing to it).

• Normalize the scores by dividing each Hub score by square root of the sum
of the squares of all Hub scores, and dividing each Authority score by
square root of the sum of the squares of all Authority scores. (optional)

On running HITS Algorithm with

(without Normalization),
Initially,
Hub Scores: Authority Scores:
A -> 1 A ->
1
B -> 1 B -> 1
C -> 1 C -> 1
D -> 1 D ->
1
E -> 1 E -> 1
F -> 1 F -> 1
G -> 1 G ->
1
ter H
1st->iteration,
1 H ->
ub Scores: Af Authority Scores:
1
A -> 1 A -> 3H
B -> 2 B -> 2
C -> 1 C -> 4
D -> 2 D -> 2
E -> 4 E -> 1
F -> 1 F -> 1
G -> 2 G -> 0
H -> 1 H -> 1
After 2nd iteration,
Hub Scores: Authority Scores:
A -> 2 A -> 4
B -> 5 B -> 6
C -> 3 C -> 7
D -> 6 D -> 5
E -> 9 E -> 2
F -> 1 F -> 4
G -> 7 G -> 0
H -> 3 H ->
After 3rd iteration,
Hub Scores: Authority
Scores: A -> 5 A -> 13
B -> 9 B -> 15
C -> 4 C -> 27
D -> 13 D -> 11
E -> 22 E -> 5
F -> 1 F -> 9
G -> 11 G -> 0
H -> 4 H -> 3

Program: Page Rank


import networkx as nx
import numpy as np
from numpy import
array import
matplotlib.pyplot as
plt
with open('./dataset/HITS.txt')
as f: lines = f.readlines()

G = nx.DiGraph()

for line in lines:


t=
tuple(line.strip().split(','
))
G.add_edge(*t)
h, a = nx.hits(G, max_iter=100) h
= dict(sorted(h.items(),
key=lambda x:
x[0])) a = dict(sorted(a.items(),
key=lambda x: x[0]))
print(np.round(list(a.values()), 3))

print(np.round(list(h.values()), 3))

pr = nx.pagerank(G) pr =
dict(sorted(pr.items(), key=lambda
x:
x[0])) print(np.round(list(pr.values()), 3))

sim = nx.simrank_similarity(G)
lol = [[sim[u][v] for v in sorted(sim[u])] for u in
sorted(sim)] sim_array = np.round(array(lol), 3)
print(sim_array)

nx.draw(G, with_labels=True, node_size=2000,


edge_color='#eb4034', width=3, font_size=16, font_weight=500,
arrowsize=20, alpha=0.8) plt.savefig("graph.png")

Page Hits
Import networkx as nx

Import matplotlib.pyplot as plt G = nx.DiGraph()


G.add_edges_from([(‘A’, ‘D’), (‘B’, ‘C’), (‘B’, ‘E’), (‘C’, ‘A’),

(‘D’, ‘C’), (‘E’, ‘D’), (‘E’, ‘B’), (‘E’, ‘F’),

(‘E’, ‘C’), (‘F’, ‘C’), (‘F’, ‘H’), (‘G’, ‘A’), (‘G’, ‘C’), (‘H’, ‘A’)])
Plt.figure(figsize =(10, 10)) nx.draw_networkx(G, with_labels = True)

Hubs, authorities = nx.hits(G, max_iter = 50, normalized = True) print(“Hub


Scores: “, hubs)

Print(“Authority Scores: “, authorities)

Output:.
Page Rank
Page Hits

You might also like