You are on page 1of 15

WEB STRUCTURE MINING:

CLEVER ALGORITHM

By Arya Telang
Sap ID: 60004210201
C31
INTRODUCTION
One recent system developed at IBM, Clever, is aimed at finding both
authoritative pages and hubs.
Best source for requested information is the “Authority Source”
The Clever system identifies authoritative pages and hub pages by
creating weights.
A search can be viewed as having a goal of finding the best hubs and
authorities.
Since unsupervised websites have been developed, user cannot tell what
webpage is accurate.
Higher, better quality, false fact and error free webpages are referred to as
most authoritative.
A relevant but not error free article is not something user would want to
retrieve.
HITS: HYPERLINK-INDUCED TOPIC
SEARCH(HITS)
Components in this technique:
Based on a given set of keywords (found in a query), a set of
relevant pages (perhaps in the thousands) is found.

Hub and authority measures are associated with these pages.


Pages with the highest values are returned.
HITS: HYPERLINK-INDUCED TOPIC
SEARCH(HITS)
Terms used in this algorithm:
SE- Search Engine to find a small set
R- Root Set
P- Pages
B- Base Set

A search engine, S E, is used to find a small set, root set (R), of pages, P,
which satisfy the given query, q. This set is then expanded into a larger
set, base set (B), by adding pages linked either to or from R. This is used
to induce a subgraph of the Web.This graph is the one that is actually
examined to find the hubs and authorities.
HITS: HYPERLINK-INDUCED TOPIC
SEARCH(HITS)
Terms used in this algorithm:
G(B,L)- indicates the graph G is composed of vertices(B) and directed
egdes are L (links)
Xp- Weight used to find authorities.
Yp- Weight used to find hubs.

Pages at the same site often point to each other, we should not really use
the structure of the links between these pages to help find hubs and
authorities. Removes these links from the
graph. Hubs should point to many good authorities, and authorities should
be pointed to by many hubs. This observation is the basis for the weight
calculations shown in the algorithm.
HITS: HYPERLINK-INDUCED TOPIC
SEARCH(HITS)
Weight calculations done by using an adjacency matrix.
The approach is basically to iteratively recalculate the weights until they
converge. The weights are normalized so that the sum of the squares of
each is 1. Normally, the number of hubs and authorities found is each
between 5 and 10.
ALGORITHM
Input:

Output:
EXAMPLE:
Identify the Best hub and Authority for given adjacency Matrix.
Calculate the hubs and authority score using HITS Algorithm for k=2.

Solution:
1)Convert the matrix into graph.
EXAMPLE:
u -> Hub Weight Vector (u= A*v)
v-> Authority Weight Vector ( v= ATu)
EXAMPLE:
Hub : {{N1,N2 {Tie}}{N3,N4 {Tie}}}
Authority :{N3, {N1,N2,N4,{Tie}}
u -> Hub Weight Vector (u= A*v)
v-> Authority Weight Vector ( v= ATu)
Assume u is 4x1 unit matrix
EXAMPLE:
Calculate u matrix

Looking at u and v matrices we see:


Hub : {{N1,N2 {Tie}}{N3,N4 {Tie}}}
Authority :{N3, {N1,N2,N4,{Tie}}
EXAMPLE:
Plot the graph for k=1

Hub : {{N1,N2 {Tie}}{N3,N4 {Tie}}}


Authority :{N3, {N1,N2,N4,{Tie}}
EXAMPLE:
u v
k=2

u’=
Hub : {{N1,N2 {Tie}}{N3,N4 {Tie}}}
Authority :{N3, {N1,N2,N4,{Tie}}
v’=
Tie not resolved

You might also like