This action might not be possible to undo. Are you sure you want to continue?

2, February 2011

**WEB-OBJECT RANK ALGORITHM FOR EFFICIENT INFORMATION COMPUTING
**

Dr. Pushpa R. Suri

Department of Computer Science and Applications, Kurukshetra University Kurukshetra, Haryana- 136119, India. pushpa.suri@yahoo.com

Harmunish Taneja

Department of Information Technology, Maharishi Markendeshwar University, Mullana, Haryana- 133203, India harmunish.taneja@gmail.com

Abstract - In recent years there has been considerable interest in analyzing relative trust level of the web objects. As the web contain facts and the assumptions on the global scale resulting on various criterions for trusting web page. In this paper an algorithm is proposed which assigns a rank to every web object like a requested document on the web that specify the quality of that object or the relative level of trust one can make on that web page. It is used for object level information extraction for ranking search results and is implemented in C++. In this paper the behavior of object rank for different values of moister factor in a domain is analyzed. The results emphasize that the moister factor can be useful in rank computation and further explore more web pages in alignment with the user’s requirements. Keywords- Random Surfer Model, Information Computing, Web Objects, Information Retrieval System, Web Graph, Ranking, Object Rank. I. INTRODUCTION

search results based upon various lexicons. As the web contains the contradictions and hypothesis on a huge scale, therefore finding the relevant information using search engines is a tedious job. With the help of object level ranking [22], various objects on a domain independent of the query that describes the relative trust of the web page can be prioritized. The object rank of a page depends upon various factors associated with the web object. The organization of the paper is as follows. Related work is presented in section 2. Section 3 discusses the challenges of high quality search results. In section 4, Web_Object_Rank algorithm is proposed and discussed. The algorithm is implemented in section 5. Finally Section 6 concludes the paper on the basis of the results obtained. II. RELATED WORK

Information computing in various web domains is broadly extracting the web objects of unstructured nature like text objects that convince information need from within large collections using document-level ranking and therefore the structured information about real-world objects which is embedded in static web pages. Online databases exist on the web in huge amounts which are of unstructured nature. Unstructured data refers to the data which does not have clear, semantically obvious structure [7]. In other words information computing constitutes process of searching, recovering, and understanding information, from huge amounts of stored data. The information from the web can be retrieved by implementing searching techniques as Keyword based Searching, Concept-based Searching, Hybrid Search, and Knowledge Base Search. In case of object level information computing, domain based search is required. Every commercial information retrieval systems try to facilitate a user’s access to information that is relevant to his information needs. This paper highlights ranking problem for domain based information retrieval, which states that every owner of the document wants to improve ranking of its document for that it can do many manipulations on its document like increasing number of links to the page by the dummy pages [1]. Object based information computing maintain the integrity of the

Google is a prototype of a large-scale search engine that makes heavy use of the structure present in hypertext [1]. Google is designed to crawl and index the web efficiently and produce much more satisfying search results than existing systems. Link Analysis Ranking [16] emphasize that hyperlink structures are used to determine the relative authority of a web page and produce improved algorithms for the ranking of search results. The prototype with a full text and hyperlink database of web pages is available at [8]. In the current era there is much concern in using random graph models for the web. The Random Surfer model [9] and the Page Rank-based selection model [11] are described as two major models [10]. Page Rankbased selection model tries to capture the effect that the search engines have on the growth of the web by adding new links according to Page Rank. The Page Rank algorithm is used in the Google search engine [12] for ranking search results. PageRank is a link analysis algorithm used by the Google Internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web (WWW), with the purpose of "measuring" its relative importance within the set. Google is designed to be a scalable search engine with primary goal to provide high quality search results over a rapidly growing WWW [18]. The PageRank theory suggests that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the surfer will continue is a damping factor d [2]. The

162

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 2, February 2011

damping factor (α) is eminently empirical, and in most cases the value of α can be taken as 0.85 [1]. Page Rank is the stationary state of a Markov chain [2, 7]. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor that spreads uniformly over the rank. The behavior of Page Rank with respect to changes in α is useful in link-spam detection [3]. The mathematical analysis of Page Rank with change in α show that contrary to popular belief, for real-world graphs values of α close to 1 do not give a more meaningful ranking [2,21]. The order of displayed web pages is computed by the search engine Google as the PageRank vector, whose entries are the Page Ranks of the web pages [4]. The Page Rank vector is the stationary distribution of a stochastic matrix, the Google matrix. The Google matrix in turn is a convex combination of two stochastic matrices: one matrix represents the link structure of the web graph and a second, rank-one matrix, mimics the random behavior of web surfers and can also be used to fight web spamming. As a consequence, Page Rank depend mainly the link structure of the web graph, but not on the contents of the web pages. Also the Page Rank of the first vertex, the root of the graph, follows the power law [10]. However, the power undergoes a phasetransition as parameters of the model vary. Link-based ranking algorithms rank web pages by using the dominant eigenvector of certain matrices--like the co-citation matrix or its variations [17]. Distributed page ranking on top of structured peer-to-peer networks is needed because the size of the web grows at a remarkable speed and centralized page ranking is not scalable [5]. Page ranking can be propagation rates depending on the types of the links and user’s specific set of interests [6]. Page filtering can be decided based on link types combined with some other information relevant to links. For ranking, a profile containing a set of ranking rules to be followed in the task can be specified to reflect user’s specific interests [20]. Similarities of contents between hyperlinked pages are useful to produce a better global ranking of web pages [19]. III. CHALLENGES

IV.

WEB_OBJECT_RANK ALGORITHM AND IMPLEMENTATION

Page Rank of a web object can be defined as the fraction of time that the surfer spends on an average on that object. The probability that the random surfer visits a web page is its Page Rank [1]. Evidently, web objects that are hyperlinked by many other pages are visited more often. The random surfer gets bored and restarts from another random web object with a probability termed as the moister factor (m). The probability that the surfer follow a randomly chosen outlink is (1-m). The Markov Chain is a discrete-time stochastic process: a process that occurs in a series of time-steps in each of which a random choice is made [7]. There is one state corresponding to each web object. Hence, a Markov chain consists of N states if there are N numbers of Web Objects in the collection. A Markov chain is characterized by an N × N Probability Transition Matrix P each of whose entries is in the interval [0, 1]; the entries in each row of P add up to 1. Markov Property states that each entry Pij is the transition probability that depends only on the current state i. A Markov chain’s probability distribution over its states may be viewed as a Probability Vector: a vector all of whose entries are in the interval [0, 1], and the entries add up to 1. According to [7, 14] the problem of computing bounds on the conditional steadystate Probability Vector of a subset of states in finite, discrete-time Markov chains is considered. A. Web_Object_Rank Algorithm: Features Features of Object Rank Algorithm are as follow: Query independent algorithm (assigns a value to every document independent of query). Content independent Algorithm. Concerns with static quality of a web page. Object Rank value can be computed offline using only web graph. Object Rank is based upon the linking structure of the whole web. Object Rank does not rank website as a whole but it is determined for each web page individually. Object Rank of web pages Ti which link to page A does not influence the rank of page A uniformly. More are the outbound links on a page T, less will page A benefit from a link to it. Object Rank is a model of user’s behavior. B. Web_Object_Rank Algorithm: Assumptions If there are multiple links between two web objects, only a single edge is placed. No self loops allowed. The edges could be weighted, but we assume that no weight is assigned to edges in the graph. Links within the same web site are removed. Isolated nodes are removed from the graph.

The primary focus of Web Information Retrieval Support System (WIRSS) is to address the aspects of search that consider the specific needs and goals of the individuals conducting web searches [15]. The major goal is to provide high quality search results over a rapidly growing World Wide Web. Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. Decentralized content publishing is the main reason for the explosive growth of the web. Corresponding to a user query there are many documents that can be retrieve by search engine. And every owner of the document wants to improve the ranking of its document. Commercial search engine have to maintain the integrity of there search results and this is one reason for the unavailability of the efforts made by them publicly. Democratization of content creation on the web generates new challenges in WIRSS. This gives rise to the question on integrity of web pages. In a simplistic approach, one might argue that only some publishers are trustworthy and others not. One more challenge is fast crawling technology is needed to gather the web objects and keep them up to date.

163

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 2, February 2011

Web_Object_Rank Algorithm This algorithm is basically a query independent algorithm that takes a web graph as an input and assigns a rank to every object which can specify the relative authorization of that web page. In the proposed algorithm, following is the list of variables moist_fact (m) is the moister factor: the probability of random surfer to restart search from another web object 1-m is the probability of the random surfer to search web objects from randomly chosen outlinks outlinks is the number of web objects linked with a particular page N is the number of objects in the domain prob[i][j] is the Probability Transition Matrix for all i ,j € 1 to N adj[i][j] is the Adjacency Matrix for all i ,j € 1 to N x is the Probability Vector itr is Iteration D. Web_Object_Rank Algorithm Step 1. Step 2. Step 3. Create a web graph of various objects in a domain. Set prob[i][j]=adj[i][j] Compute number of out links from a particular node say counter. IF outlinks of web objects = NULL THEN prob[i][j] is equally distributed for all i ,j ELSE prob values are distributed according to number of outlinks For all i,j IF (counter = 0) THEN prob[i][j]=1/N ELSE IF (prob[i][j] =1) THEN prob[i][j] =1.0/counter Multiply the resulting matrix by 1 − m. Add m/N to every entry of the resulting matrix, to obtain Probability Transition Matrix. For all i , j Do prob[i][j]=(prob[i][j]*(1- m))+((m/N); Randomly select a node from 0 to N-1 to start a walk say s_int . Initialize Random surfer and itr to keep account of number of iterations required to 0. Try to reach at steady state with in 200 iterations otherwise toggling occur Multiplying Probability Transition Matrixes with Probability Vector to get steady state Check either system enters in steady state or not Print the ranks stored in Probability Vector x and EXIT.

C.

V.

IMPLEMENTATION

This implementation is based upon random surfer model [7] and Markov chain [13, 14]. The random surfer visit the objects in the web graph according to distribution based on which random surfer can be in one of the following four possible states at any time. Initial state is state of the system from where it will start its walk. The system is set in the random state by randomly selecting an object using random function and value corresponding to that web object in the Probability Vector is set to unity. Rest of the values in the Probability Vector is zero. Steady state is that state of the system when the Probability Vector of random surfer fulfills the properties of irreducibility and aperiodicity’s. To check either the system get the steady state or not, two successive values of the Probability Vector must be same. Ideal state is that state of the random surfer when the system achieves the steady state but at the same time web object ranks are distributed uniformly to all documents. Toggling state is achieved by the random surfer when the system is not able to reach at steady state and just toggle between two set of object ranks.

O 1

O 4

O 2

O 5 O 3 O 7

O 6

Step 4. Step 5.

O 8

Step 6. Step 7. Step 8. Step 9. Step 10. Step 11.

O 9

O 1 0

Fig. 1. Web Graph

164

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 2, February 2011

No. of Iterations

C. Results and Discussion The web graph shown in Fig 1 is used for analyzing various factors of the proposed algorithm. Variation in graph structures used for analysis change the performance of the algorithm. The graph shows 10 web objects in a domain that are interlinked as strongly connected graph. Every two nodes of the graph have a path with less number of links. Oi is the ith web object in the domain where i vary from 1 to 10. The adjacency matrix for web graph of Fig 1 is shown in Fig 2. 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0

**M oister Factor vsNo. of Iterations
**

Moister Factor 250 No. of iterations

200

150

100

50

0

0 0. 05

0. 1 0. 15

0. 2 0. 25

0. 3 0. 35

0. 4 0. 45

0. 5 0. 55

0. 6 0. 65 0. 7 0. 75

0. 8 0. 85

0. 9 0. 95

Moister Factor

Fig. 3 . Moister Factor vs Number of Iterations It is further analyzed that as the Moister Factor is equal to 1, random Surfer enters into the Ideal state and the corresponding rank values of the web objects is same as in table 2. The graph for the ideal state is shown in Fig 4. Table 2: Ranks of objects at moister factor 1 Object Computed Rank O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

Fig.2. Adjacency Matrix for all i ,j € 1 to 10 To analyze the convergence speed, number of iterations required by random surfer to reach at a steady state is recorded in Table 1 and the corresponding graph is shown in fig 3. In fig. 3 infinity value is shown by a large number of iterations (200 or more). It clearly shows that as the moister factor approaches 1, the number of iterations is reduced. Table 1: Moister Factor Vs No. of Iterations Moister Factor No. of Iterations 0 Infinity 0.05 Infinity 0.1 Infinity 0.15 Infinity 0.2 83 0.25 73 0.3 62 0.35 46 0.4 41 0.45 33 0.5 35 0.55 39 0.6 24 0.65 21 0.7 20 0.75 22 0.8 16 0.85 12 0.9 11 0.95 10 1 2

165

Computed Rank at Moister factor 1

0.12 0.1 0.08 0.06 0.04 0.02 0

Computed Rank

Computed Rank

Fig.4. Random Surfer Ideal State Figure 5 shows that for the Moister Factor less than 0.2, no rank is provided to any web object and system enters into the toggling state with large number of

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10

Web Objects

1

iterations for the given domain. Also, the ranks computed by the proposed algorithm for moister factor values from 0.2 to 1 are shown.

Computed Object Ranks at various Moister Factor

[1]

[2]

MF=0.25 MF=0.5 MF=0.75 MF=1.0 0.250000 0.200000 0.150000 MF=0.3 MF=0.55 MF=0.8 MF=0.2 MF=0.35 MF=0.6 MF=0.85 MF=0.4 MF=0.65 MF=0.9 MF=0.45 MF=0.7 MF=0.95

[3]

[4]

0.100000 0.050000

[5]

0.000000 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 Web Object

Fig. 4. Moister factor (>.2) to different documents From the above graphs and analysis, we can say that the moister factor plays a main role in this algorithm and performance of algorithm can be improved if this factor is selected properly. The value of moister factor can vary from 0 to 1 but in most of the cases system enter into the toggling state if value selected is less than 0.2 and at the value 1 system enter into ideal state giving insignificant results. Value must be closer to 1 but can not be 1. As shown in Fig. 2 systems achieve a steady state in less number of iterations if moister factor value is closer to 1. CONCLUSION The current study was conducted to demonstrate how the link structure of the web can be used to provide the ranking to various documents. This ranking can be provided offline. With the help of this approach one can prioritize the various documents on the web independent of the query. However a complete score computation is based on various other factors. In the proposed algorithm a damping factor is used that play a very important role on the analysis of the algorithm. After the analysis it is concluded that damping factor must not be selected closer to zero. At the damping factor one, the system enters into the ideal state and the ranking provided is insignificant. As per evaluation the damping factor must be selected greater than or equals to 0.5. However, if we consider convergence speed as only factor to evaluate the performance than the best moister factor will be .95. The proposed algorithm is query independent algorithm and does not consider query during ranking.

[6]

[7]

[8] [9]

[10]

[11]

REFERENCES Sergey Brin , Lawrence Page, “The anatomy of a large-scale hypertextual web search engine”, Proceedings of the 7th International conference on World Wide Web 7, p.107-117, April 1998, Brisbane, Australia Paolo Boldi, Massimo Santini, S. Vigna, “PageRank as a Function of the Damping Factor”, International World Wide Web Conference Proceedings of the 14th International conference on World Wide Web Chiba, Japan pages: 557 - 566 Year of Publication: 2005 Hui Zhang, Ashish Goel, Ramesh Govindan, Kahn Mason,and Benjamin Van Roy. “Making eigenvector-based reputation systems robust to collusion”, In Stefano Leonardi Editor, ProceedingsWAW 2004, number 3243 in LNCS, pages 92–104. Springer-Verlag, 2004. Nie Z., Wu F., Wen J.R., and Ma W.Y., “Extracting Objects from the Web”, 22nd International Conference on Data Engineering (ICDE’06), pp 1-3, Year: 2006. Jianfeng Zheng, Zaiqing Nie, “Architecture of an Object-level Vertical Search”, IEEE, in the Proceeding of International Conference on Web Information Systems and Mining, pp 51-55, Year: 2009. Zhanzi qui,Matthias Hemmje,Erich J.Neuhold, “Using Link types in web page ranking and filtering”; IEEE Computer Society Proceedings of the Second International Conference on Web Information Systems Engineering (WISE'01) Volume 1 ; Page: 311 Year of Publication: 2001 Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze, “An Introduction to Information Retrieval”, Publisher: Cambridge University Press New York, NY, USA , Pages: 461470 Year: 2008 http://google.stanford.edu/ Blum, T.-H. H. Chan, and M. R. Rwebangira, “A random-surfer web-graph model”. In ANALCO '06: Proceedings of the 8th Workshop on Algorithm Engineering and Experiments and the 3 rd Workshop on Analytic Algorithmics and Combinatorics, pages 238--246, Philadelphia, PA, USA, 2006. Society for Industrial and Applied Mathematics. Prasad Chebolu, Páll Melsted,” PageRank and the random surfer model”, Symposium on Discrete Algorithms Proceedings of the 19th annual ACMSIAM symposium on Discrete algorithms; Pages: 1010-1018.Year : 2008 Gopal Pandurangan, Prabhakar Raghavan, Eli Upfal, “Using PageRank to Characterize Web Structure”, Proceedings of the 8th Annual International Conference on Computing and Combinatorics, page No..330-339, August 15-17, 2002.

Computed Rank

166

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

Google technology overview {http://www.google.com/intl/en/corporate/tech.html}, 2004 R. Montenegro,P. Tetali, “Mathematical aspects of mixing times in Markov chains”, Foundations and Trends in Theoretical Computer Science Volume 1 , Issue 3 (May 2006) Pages: 237 - 354 ;Year : 2006 Tugrul Dayar, Nihal Pekergin, Sana Younes; “Conditional steady-state bounds for a subset of states in Markov chains”, ACM International Conference Proceeding Series; Vol. 201 Proceeding from the 2006 workshop on Tools for solving structured Markov chains Article No.: 3 Year: 2006 Orland Hoeber, “Web Information Retrieval Support Systems: The Future of Web Search, Web Intelligence & Intelligent Agent”, Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03 Pages: 29-32;Year: 2008 Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal, Panayiotis Tsaparas, “Link analysis ranking: algorithms, theory, and experiments”, ACM Transactions on Internet Technology (TOIT) Volume 5 , Issue 1 (Feb. 2005) Pages: 231 - 297 Year: 2005 R. Lempel, S. Moran, “Rank-Stability and RankSimilarity of Link-Based Web Ranking Algorithms in Authority-Connected Graphs”, Publisher: Kluwer Academic Publishers, April 2005 Information Retrieval , Volume 8 Issue 2, Pages: 245 - 264 ;Year : 2005 Sehgal, Umesh; Kaur, Kuljeet; Kumar, Pawan, “The Anatomy of a Large-Scale Hyper Textual Web Search Engine”, Computer and Electrical Engineering, 2009. ICCEE '09. Second International Conference on Volume 2, 28-30 Dec. 2009 Page(s):491 - 495 ; Year 2009 Kritikopoulos, A., Sideri, M., Varlamis, “Wordrank: A Method for Ranking Web Pages Based on Content Similarity”, Databases, 2007. BNCOD '07, 24th British National Conference on 3-5 July 2007, Page(s): 92-100, Year: 2007 . Zaiqing Nie, Ji-Rong Wen and Wei-Ying Ma, “Objectlevel Vertical Search” January 7-10, 2007, Asilomar, California, USA, 3rd Biennial Conference on Innovative Data Systems Research (CIDR), Year: 2007. Zhi-Xiong Zhang, Jian Xu, Jian-Hua Liu, Qi Zhao, Na Hong, Si-Zhu Wu, Dai-Qing Yang, “Extraction knowledge objects in scientific web resource for research profiling”, IEEE, Baoding, 12-15 July 2009, pp 34753480, Eighth International Conference on Machine Learning and Cybernetics, Year: 2009.

[22]

Nie Z., Zhang Y., Wen J.R., and Ma W.Y. “Objectlevel Ranking: Bringing Order to web Objects”, In Proceeding of World Wide Web (WWW), 2007.

Dr. Pushpa R. Suri received her Ph.D. Degree from Kurukshetra University, Kurukshetra. She is working as Associate Professor in the Department of Computer Science and Applications at Kurukshetra University, Kurukshetra, Haryana, India. She has many publications in International and National Journals and Conferences. Her teaching and research activities include Discrete Mathematical Structure, Data Structure, Information Computing and Database Systems. Harmunish Taneja received his M.Phil. degree in (Computer Science) from Algappa University, Tamil Nadu and Master of Computer Applications from Guru Jambeshwar University of Science and Technology, Hissar, Haryana, India. Presently he is working as Assistant Professor in Information Technology Department of M.M. University, Mullana, Haryana, India. He is pursuing Ph.D. (Computer Science) from Kurukshetra University, Kurukshetra. He has published 11 papers in International / National Conferences and Seminars. His teaching and research areas include Database systems, Web Information Retrieval, and Object Oriented Information Computing.

167

http://sites.google.com/site/ijcsis/ ISSN 1947-5500

In recent years there has been considerable interest in analyzing relative trust level of the web objects. As the web contain facts and the assumptions on the global scale resulting on various crit...

In recent years there has been considerable interest in analyzing relative trust level of the web objects. As the web contain facts and the assumptions on the global scale resulting on various criterions for trusting web page. In this paper an algorithm is proposed which assigns a rank to every web object like a requested document on the web that specify the quality of that object or the relative level of trust one can make on that web page. It is used for object level information extraction for ranking search results and is implemented in C++. In this paper the behavior of object rank for different values of moister factor in a domain is analyzed. The results emphasize that the moister factor can be useful in rank computation and further explore more web pages in alignment with the user’s requirements.

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd