Faculty of Graph Theory and Applications

Google PageRank Algorithm
**

(Project Report)

Prepared By

Kalishavali Shaik

N091390-CSE-I

Supervised By

Kalpana Gangwar Madam

Graph theory and Applications

Dated:

December 10, 2014

the Power method is preferable. It Is usually somewhat slower. We have implemented these algorithms in a parallel environment and created a basic WebCrawler to gather test data. a method for rating Webpages objectively and mechanically. effectively measuring the human interest and attention devoted to them. the restarted Arnoldi method is preferable.Abstract This paper presents different parallel implementations of Google's PageRank algorithm. which depends on the readers interests. if very accurate results are needed. But there is still much that can be said objectively about the relative importance of Webpages. The explicitly restarted Arnoldi method was shown to be superior to the normal Arnoldi Method as well as the Power method for high values of the dampening factor α. For higher values Of α. This paper describes PageRank. knowledge and attitudes. Results Also show that load balancing our parallel implementation was usually quite ineffective. For smaller values of α.85 as Google uses. 3 . including 0. The iterative algorithms used are the Power method and the Arnoldi method. The importance of a Webpage is an inherently subjective matter. Tests have then been carried out with the different algorithms Using various test data. The purpose is to compare different methods for computing PageRank on large domains Of the Web. but the memory used is significantly less.

the World Wide Web is Hypertext and provides considerable auxiliary information on top of the text of the web pages. This order should depend on relevancy of pages and importance of pages. called PageRank. Introduction The World Wide Web creates many new challenges for information retrieval. In this paper. It is very large and heterogeneous. search engines on the Web must also contend with inexperienced users and pages engineered to manipulate search engine ranking functions. Search engine searches the web pages available on Internet and returns the result as number of ordered pages. In addition to these major challenges. Current estimates are that there areover150million webpages with a doubling life of less than one year. Alta Vista uses HITS (Hyperlink Induced Topic Search) algorithm to rank pages 4 . User needs to enter keyword or combination of keywords to trigger the search. unlike "at" document collections. However. Different search engines use different techniques to decide importance of page on the web. ranging from "What is Joe having for lunch today?" to journals about information retrieval. More importantly. the web pages are extremely diverse. we take advantage of the link structure of the Web to produce a global importance" ranking of every web page. helps search engines and users quickly make sense of the vast heterogeneity of the World Wide Web Today search engines are becoming best friends of most of the people for navigation on Internet.Abbreviations HITS: Hypertext Induced Topic Search DWPR: Distance Weighted Page Ranking WWW: World Wide Web Contents 1. such as link structure and link text. This ranking.

The parameter d is a damping factor.e. But. 2.. Google is the most famous search engine used now days. In the paper “The Anatomy of a Large-Scale Hyper textual Web Search Engine” founders of Google..……. The PageRank of a page A is given as follows: PR (A) = (1-d) + d (PR (T1)/C (T1) + . are citations). While presenting search results. Off course PageRank is not the only factor..Tn which point to it (i. PageRank is described by one mathematical formula that seems very difficult at first. Like confidence based page ranking. There are many variations of Page Ranking algorithm. + PR (Tn)/C (Tn)) “ It's the original formula that was published when PageRank was being developed. Thus all the pages in the collection should be weighted and represented in the order of their weights. 5 . What is PageRank? Most of the people start their web navigation by search engine.1 Formula: The citation graph of the web is main resource for calculation of PageRank. but still it is one of them. User cannot go through all the pages presented as output of search. Sergey Brin and Lawrence Page defined PageRank as: “We assume page A has pages T1. query sensitive self-adaptable web page ranking algorithm etc. C (A) is defined as the number of links going out of page A. the founders of Google. they should be ordered by their relevancy and importance on the web. PageRank is a numeric value that represents how important a page is on the web.while Google uses PageRank algorithm to rank the pages.85. and it is probable that Google uses a variation of it but they aren't telling us what it is. which can be set between 0 and 1. We usually set d to 0. Yahoo also uses some variation of PageRank algorithm that Google uses. weighted page rank algorithm.. which decides importance of page. but actually it is not. One of the most important factors that Google uses is PageRank. 2. this paper covers only one algorithm that is based on formula given by Bin and Page..

it’s obvious that PR of a page depends on PR of other pages. we can start with any assumed values of PR. 2.85 * 1 = 1 PR (B) = (1-d) + d * PR (A) = (1-0.g.2 How to use formula? If we look at the equation. so let’s take a guess at 1 and do some calculations. lower is the value of vote. which points to each other. i.g. It’s just voting. PageRank of a page is the addition of one constant (1-d=0. which are pointing to it. We don’t know what their PR should be to begin with. PR (A) = (1-d) + d * PR (B) = (1-0. That four factors will add up to d * PR (A). higher is the value of vote and higher the number of out links.85) + 0. Thus.85) + 0.e. 85% of PageRank of A.15) and the damped value of addition of votes by all pages pointing to it. then PR of each page will depend on PR of other. So every page distributes 85% of its original PageRank evenly among all pages to which it points. when one page links to other page. which may depend on PR of original page if it has back link to that page. only the difference is that weight of vote of a page depends on its own PR. (From here onwards I’ll refer PageRank as PR) Note that when page votes for another page. Value of vote by a particular page depends on PageRank and total number of out links of that page. e. which is yet uncalculated. Let’s see an example: consider there are only two pages A and B. And repeat the calculation of PR values of all pages iteratively till they become stable. E. So where to start from? Surprisingly. first page votes some PageRank to the second page. It is same as shareholders meeting where weight of vote of shareholder depends on the shares held.85 * 1 = 1 6 . higher the PageRank. it doesn’t give anything from its own PR. If there are only two pages A and B which points to each other.Thus. if page A is pointing to four other pages then factor d * PR (A)/4’ will come in PageRank equation of all those four pages.

Here we considered symmetrical link structure between A and B.15 PR (B) = 0. then we will end up with consistent values of PR (A) and PR (B).85 * 0. thus with any value of PR (A) and PR (B) to start.15 + 0.85 * 0.47799375 = 0. then also values will degrade iteration by iteration and will settle down to 1.2775 After second iteration: PR (A) = 0.5562946875 PR (B) = 0.85 * 0. If we take asymmetrical structure e.15 + 0.15 + 0. If we start with values greater than 1.15 + 0. then also we will end up with PR(1) = PR(B) = 1.385875 PR (B) = 0.47799375 After third iteration: PR (A) = 0.85 * 0 = 0. They will keep changing till they reach value of 1. If we start calculation with different values of PR (A) and PR(B). whatever the values we start with.85 + 0.15 = 0.g.Since they are not changing we can stop here.2775 = 0. we ends up with PR(A) = PR(B) = 1.85 * 0.15 + 0. only A is pointing to B and B is not pointing to A.15 + 0.622850484375 and so on.5562946875 = 0.385875 = 0. 7 . Now let’s do the calculations with initial guess at 0: PR (A) = 0.

net/pagerank_calculator.0.15&iblprs=0. This yields a result that efforts needed to increase PR from 2 to 3 are very less as compared to increase PR from 8 to 9.15 to unlimited. But toolbar gives PR of any page in the range 1 to 10. How is PageRank used? Google uses PR as one of the important factor in search process to fine the relevancy of page.15. According to Ian Rogers. Means the scale used by Google must be logarithmic. 4.15.For more practice with different numbers of pages with different configurations of citation graphs. So actual PR values are divided into intervals and are represented as one of the value ranging from 1 to 10. Google assigns PR for each page on the web. 15&pgnms=&pgs=2&initpr=1&its=100&type=simple 3. Example 1. intervals cannot be equidistant.google. web pages are accessed in the decreasing order of PR. visit: www. 8 . What these efforts are? At the end of this paper anyone can answer this question. For this question.php?lnks=2.10. But the question is whether these intervals are equidistant? Means is the scale linear? No one outside Google knows it. Thus PR of site is nothing but sum of PRs of all pages of site. But we can solve different examples and come up with different observations. But from the site point of view.webworkshop. As our normal PR calculation can yield PR ranging from 0. One can find PR of its own webpage by using Google toolbar (http://toolbar.com/). But output of this toolbar is somewhat unexpected. total PR of all pages of site is important.15. there are different answers from different researchers.0. Before starting examples I would like to explain difference between PR of page and PR of site.0. So while searching. Examples and Observations: Google does not provide any information about methods to improve PR values of page or site.

15. it can't have a PR value and it can't share PR with other pages. They come up with PR values equal to 0. But here onwards we will consider only small part of whole web. Dangling links are links that go to pages that don't have any outbound links.85 (1/1) =1 PR (B) = (1-d) + d (0) = 0. There are two concepts: orphan pages and dangling links. it and its links don't exist as far as the calculations are concerned.” Example 2. which don’t have any inbound link.15 Actually this is not the correct result. For a page to be indexed by Google it must have at least one page linking to it.15.15 In this example A and B don’t have any inbound and outbound link. Link from B to A is a dangling link and will be eliminated during PR calculation. PR (A) = (1-d) + d (PR (B) / C (B)) A B = 0. Page B is orphan page and thus will be eliminated during PR calculation. Therefore. Thus we can conclude that: “Minimum value of PR of any page is 0. And thus page A will not receive any vote from page B. These links are dropped for the duration of the calculations.15 + 0. Thus even there are no inbound links shown for a particular 9 .15 PR (B) = (1-d) + d (0) = 0. Orphan pages are those.PR (A) = (1-d) + d (0) A B = 0. If a page is not in the Google index.

0 Example 4. Thus we will not consider these two concepts anymore in our examples.0 A C 1.0 1.0 A 1.0 1. Example 3 B B 1.575 10 .page.0 C 1. Instead I’ll show only final settled values of PR after large number of iterations. From here onwards I’ll not show all calculations. they are assumed to be present there. B 0.

255 External Site 1 1.255 11 . Example 5: B 0.575 Fig.Analysis of Page Rank [1] This is hierarchical structure.638 B 1.85 C 0.0 A External Site 2 2.0 A 1.575 External Site1 1. We need our PR to be concentrated at homepage. In general homepage of website should have maximum PR.0 C 0.575 Example 6: External Site 2 0.A 1.1.6 C 1.215 1. Thus we can use hierarchical structure and can channel large proportion of PR of site to where we want.

B 1. From these examples we can conclude that if any page of site is pointing to external page.215 Fig. 5.Fig 2. in turn again increases weight of vote of C. In both of these examples. This part of vote from C increases PR value of A and thus increases weight of vote of A. While in example 6 page C gives part of it to external site and part of it to page A. In example 5 page is C gives its entire vote to external site. Analysis of Page Rank [1] Compare example 5 with example 6. Page A. This is iterative process. There is PR leak in example 5.Analysis of Page Rank [1] That’s great!!! Average PR of site increased as we expected. External site 1 is pointing to page A and page C is pointing to page C. B and C are pages of same site.720 External Site 2 1.549 External Site 1 1. And then PR values of remaining pages are calculated. which is avoided in example 6. PR of external site 1 is assumed to be equal to 1.146 C 1. B and C. How to increase PageRank? 12 . In turn PR of B and C increase which.3. which finally results into increased value of PR of all A. Can we infer something from these examples? Definitely. This is definitely in the favor of PR of whole site. then we can reduce the PR leak by increasing citation network. Can we decrease PR leak by introducing reciprocating links between B and C? Let’s try this.0 A 2.

Upper limit on average PR of site is 1.2 Join forums: 13 . PR per page).e. we can increase total PR of website by adding more pages into site.1 Add spam pages: As shown in figure. From examples illustrated above we can come up with some ideas to increase PR of web site.6 Spam 1 A 0. what are the contents of these pages? Total PR of site increases.Analysis of Page Rank [1] 5.39 Fig.There are different ways to increase PR of your site.39 Spam 1000 0.39 331. No matter.4. but we can’t increase average PR of site (i.0 Spam 2 0. B 281. Some standard ideas are given below: 5.

Search engine directories are a good way to get a free link to your website. Remember the more links you have the higher your PR will be. Thus be aware of link farming!! 5. Other sites will automatically link your site if you are having good contents.3 Submit to search engine directories. 5. From this people will come to know about your site and this will help to increase popularity of your site. but if it's related to your website then you will be accomplishing two tasks at once. and others. They also increase your chances at being listed higher on popular search engines like Google. Google is banning the sites participating in link farming. 5. This will allow you to increase your web presence by being listed on another search engine. then you can pay that site and can create reciprocating links among two. There are link-farming sites. You will still get credit if it's not.4 Reciprocating Links: You can search for sites related to same topic as that of your website. If PR of that site is higher than your site.5 Contents: Last and the most important way to increase PR of your site is to keep solid contents on your sites. Really there is no substitute for good contents!! Conclusion 14 . Actually this is illegal.Forums are a great way to achieve links to your website. But another important note to look on is making sure the forum is somewhat related to your website. which exchanges links with other sites. Most search engine directories allow you to submit to their website for free. In most forums you are allowed to have a signature and in your signature you can put a link to your website. and it will also be a free link.

Even though formula for calculating PageRank seems to be difficult. now a days Google is paying a lot of attention to the link’s anchor text while deciding relevancy of target page. http://pr.efactory. Wikipedia. E. And we cannot predict the result of these iterations. References Fig.Wikipedia. but it is only one of the important factors considered. more practice can yield more observations.math.html 15 .cornell.de/ Bing Liu “Web Data Mining” Springer International Edition IEEE Conference Paper “Research on PageRank and Hyperlink – Induced Topic Search in Web Structure Mining “ Website : Google.de/ www. Surely.g.efactory.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4. Analysis of Page Rank [1]--. But when a simple calculation is applied hundreds of times over the results can seem complicated. But as PageRank is also one of the important factor. one should be well aware of PageRank while designing the website. it is easy to understand. PageRank is important factor considered in Google ranking. http://pr.

Google

scribd