|Views: 941|Likes: 7

Published by ijcsis

In recent years there has been considerable interest in analyzing relative trust level of the web objects. As the web contain facts and the assumptions on the global scale resulting on various criterions for trusting web page. In this paper an algorithm is proposed which assigns a rank to every web object like a requested document on the web that specify the quality of that object or the relative level of trust one can make on that web page. It is used for object level information extraction for ranking search results and is implemented in C++. In this paper the behavior of object rank for different values of moister factor in a domain is analyzed. The results emphasize that the moister factor can be useful in rank computation and further explore more web pages in alignment with the user’s requirements.

In recent years there has been considerable interest in analyzing relative trust level of the web objects. As the web contain facts and the assumptions on the global scale resulting on various criterions for trusting web page. In this paper an algorithm is proposed which assigns a rank to every web object like a requested document on the web that specify the quality of that object or the relative level of trust one can make on that web page. It is used for object level information extraction for ranking search results and is implemented in C++. In this paper the behavior of object rank for different values of moister factor in a domain is analyzed. The results emphasize that the moister factor can be useful in rank computation and further explore more web pages in alignment with the user’s requirements.

See More

See less

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 2, February 2011

WEB-OBJECT RANK ALGORITHM FOREFFICIENT INFORMATION COMPUTING

Dr. Pushpa R. Suri

Department of Computer Science and Applications,Kurukshetra UniversityKurukshetra, Haryana- 136119, India.

Harmunish Ta

neja

Department of Information Technology,Maharishi Markendeshwar University,Mullana, Haryana- 133203, India

Abstract -

In recent years there has been considerableinterest in analyzing relative trust level of the web objects.As the web contain facts and the assumptions on the globalscale resulting on various criterions for trusting web page.In this paper an algorithm is proposed which assigns arank to every web object like a requested document on theweb that specify the quality of that object or the relativelevel of trust one can make on that web page. It is used forobject level information extraction for ranking searchresults and is implemented in C++. In this paper thebehavior of object rank for different values of moisterfactor in a domain is analyzed. The results emphasize thatthe moister factor can be useful in rank computation andfurther explore more web pages in alignment with theuser

’s

requirements.

Keywords- Random Surfer Model, InformationComputing, Web Objects, Information Retrieval System,Web Graph, Ranking, Object Rank.

I

.

I

NTRODUCTION

Information computing in various web domains is broadlyextracting the web objects of unstructured nature like textobjects that convince information need from within largecollections using document-level ranking and therefore thestructured information about real-world objects which isembedded in static web pages. Online databases exist on theweb in huge amounts which are of unstructured nature.Unstructured data refers to the data which does not have clear,semantically obvious structure [7]. In other words informationcomputing constitutes process of searching, recovering, andunderstanding information, from huge amounts of stored data.The information from the web can be retrieved byimplementing searching techniques as Keyword basedSearching, Concept-based Searching, Hybrid Search, andKnowledge Base Search. In case of object level informationcomputing, domain based search is required. Every commercial

information retrieval systems try to facilitate a user’s access to

information that is relevant to his information needs. Thispaper highlights ranking problem for domain basedinformation retrieval, which states that every owner of thedocument wants to improve ranking of its document for that itcan do many manipulations on its document like increasingnumber of links to the page by the dummy pages [1]. Objectbased information computing maintain the integrity of thesearch results based upon various lexicons. As the webcontains the contradictions and hypothesis on a huge scale,therefore finding the relevant information using searchengines is a tedious job. With the help of object levelranking [22], various objects on a domain independent of the query that describes the relative trust of the web pagecan be prioritized. The object rank of a page depends uponvarious factors associated with the web object.The organization of the paper is as follows. Relatedwork is presented in section 2. Section 3 discusses thechallenges of high quality search results. In section 4,Web_Object_Rank algorithm is proposed and discussed.The algorithm is implemented in section 5. Finally Section6 concludes the paper on the basis of the results obtained.II.

R

ELATED

W

ORKGoogle is a prototype of a large-scale search enginethat makes heavy use of the structure present in hypertext[1]. Google is designed to crawl and index the webefficiently and produce much more satisfying searchresults than existing systems. Link Analysis Ranking [16]emphasize that hyperlink structures are used to determinethe relative authority of a web page and produce improvedalgorithms for the ranking of search results. The prototypewith a full text and hyperlink database of web pages isavailable at [8]. In the current era there is much concern inusing random graph models for the web. The RandomSurfer model [9] and the Page Rank-based selection model[11] are described as two major models [10]. Page Rank-based selection model tries to capture the effect that thesearch engines have on the growth of the web by addingnew links according to Page Rank. The Page Rank algorithm is used in the Google search engine [12] forranking search results. PageRank is a link analysisalgorithm used by the Google Internet search engine thatassigns a numerical weighting to each element of ahyperlinked set of documents, such as the World WideWeb (WWW), with the purpose of "measuring" itsrelative importance within the set. Google is designed tobe a scalable search engine with primary goal to providehigh quality search results over a rapidly growing WWW[18]. The PageRank theory suggests that even animaginary surfer who is randomly clicking on links willeventually stop clicking. The probability, at any step, thatthe surfer will continue is a damping factor

d

[2]. The

162http://sites.google.com/site/ijcsis/ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 2, February 2011

damping factor (α)

is eminently empirical, and in most cases

the value of α can be taken as 0.85 [1].

Page Rank is thestationary state of a Markov chain [2, 7]. The chain is obtainedby perturbing the transition matrix induced by a web graphwith a damping factor that spreads uniformly over the rank.T

he behavior of Page Rank with respect to changes in α

isuseful in link-spam detection [3]. The mathematical analysisof Page Rank with change in

α

show that contrary to popularbelief, for real-

world graphs values of α

close to 1 do not givea more meaningful ranking [2,21]. The order of displayed webpages is computed by the search engine Google as thePageRank vector, whose entries are the Page Ranks of the webpages [4]. The Page Rank vector is the stationary distributionof a stochastic matrix, the Google matrix. The Google matrixin turn is a convex combination of two stochastic matrices:one matrix represents the link structure of the web graph and asecond, rank-one matrix, mimics the random behavior of websurfers and can also be used to fight web spamming. As aconsequence, Page Rank depend mainly the link structure of the web graph, but not on the contents of the web pages. Alsothe Page Rank of the first vertex, the root of the graph, followsthe power law [10]. However, the power undergoes a phase-transition as parameters of the model vary.Link-based ranking algorithms rank web pages by using thedominant eigenvector of certain matrices--like the co-citationmatrix or its variations [17]. Distributed page ranking on top of structured peer-to-peer networks is needed because the size of the web grows at a remarkable speed and centralized pageranking is not scalable [5].Page ranking can be propagation rates depending on the

types of the links and user’s specific set of interest

s [6]. Pagefiltering can be decided based on link types combined withsome other information relevant to links. For ranking, a profilecontaining a set of ranking rules to be followed in the task can

be specified to reflect user’s specific interests

[20].Similarities of contents between hyperlinked pages are usefulto produce a better global ranking of web pages [19].III.

C

HALLENGESThe primary focus of Web Information Retrieval SupportSystem (WIRSS) is to address the aspects of search thatconsider the specific needs and goals of the individualsconducting web searches [15]. The major goal is to providehigh quality search results over a rapidly growing World WideWeb. Google employs a number of techniques to improvesearch quality including page rank, anchor text, and proximityinformation. Decentralized content publishing is the mainreason for the explosive growth of the web. Corresponding to auser query there are many documents that can be retrieve bysearch engine. And every owner of the document wants toimprove the ranking of its document. Commercial searchengine have to maintain the integrity of there search results andthis is one reason for the unavailability of the efforts made bythem publicly. Democratization of content creation on the webgenerates new challenges in WIRSS. This gives rise to thequestion on integrity of web pages. In a simplistic approach,one might argue that only some publishers are trustworthy andothers not. One more challenge is fast crawling technology isneeded to gather the web objects and keep them up to date.IV.

W

EB_

O

BJECT_

R

ANK

A

LGORITHM

A

ND

I

MPLEMENTATIONPage Rank of a web object can be defined as thefraction of time that the surfer spends on an average onthat object.

The probability that the random surfer visits aweb page is its Page Rank [1]. Evidently, web objects thatare hyperlinked by many other pages are visited moreoften. The random surfer gets bored and restarts fromanother random web object with a probability termed asthe

moister factor (m)

.

The probability that the surferfollow a randomly chosen outlink is

(1-m)

.The Markov Chain is a

discrete-time stochastic process:

a process that occurs in a series of time-steps ineach of which a random choice is made [7]. There is onestate corresponding to each web object. Hence, a Markovchain consists of

N states if there are N numbers of WebObjects in the collection.

A Markov chain is characterizedby an

N

×

N Probability Transition Matrix P

each of whose entries is in the interval [0, 1]; the entries in eachrow of

P

add up to 1. Markov Property states that eachentry

Pij

is the transition probability that depends only onthe current state

i.

A Markov chain’s probability

distribution over its states may be viewed as a

ProbabilityVector

: a vector all of whose entries are in the interval [0,1], and the entries add up to 1. According to [7, 14] theproblem of computing bounds on the conditional steady-state Probability Vector of a subset of states in finite,discrete-time Markov chains is considered.

A. Web_Object_Rank Algorithm: Features

Features of Object Rank Algorithm are as follow:

Query independent algorithm (assigns a value toevery document independent of query).

Content independent Algorithm.

Concerns with static quality of a web page.

Object Rank value can be computed offline usingonly web graph.

Object Rank is based upon the linking structure of the whole web.

Object Rank does not rank website as a whole butit is determined for each web page individually.

Object Rank of web pages T

i

which link to page Adoes not influence the rank of page A uniformly.

More are the outbound links on a page T, less willpage A benefit from a link to it.

Object

Rank is a model of user’s behavior

.

B. Web_Object_Rank Algorithm: Assumptions

If there are multiple links between two web objects,only a single edge is placed.

No self loops allowed.

The edges could be weighted, but we assume thatno weight is assigned to edges in the graph.

Links within the same web site are removed.

Isolated nodes are removed from the graph.

163http://sites.google.com/site/ijcsis/ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 2, February 2011

C. Web_Object_Rank Algorithm

This algorithm is basically a query independent algorithmthat takes a web graph as an input and assigns a rank to everyobject which can specify the relative authorization of that webpage. In the proposed algorithm, following is the list of variables

moist_fact (m) is the moister factor: the probability of random surfer to restart search from another web object

1-m is the probability of the random surfer to search webobjects from randomly chosen outlinks

outlinks is the number of web objects linked with aparticular page

N is the number of objects in the domain

prob[i][j] is the Probability Transition Matrix for all i ,j

€

1 to N

adj[i][j] is the Adjacency Matr

ix for all i ,j € 1 to N

x is the Probability Vector

itr is Iteration

D. Web_Object_Rank Algorithm

Step 1. Create a web graph of various objects in adomain.Step 2. Set prob[i][j]=adj[i][j]Step 3. Compute number of out links from a particularnode say counter.IF outlinks of web objects = NULLTHEN prob[i][j] is equally distributed for all i ,jELSE prob values are distributed according tonumber of outlinksFor all i,j IF (counter = 0)THENprob[i][j]=1/NELSEIF (prob[i][j] =1)THENprob[i][j] =1.0/counterStep 4. Multiply the resulting

matrix by 1 − m.

Step 5. Add m/N

to every entry of the resulting matrix,to obtain

Probability Transition Matrix

.For all i , j Doprob[i][j]=(prob[i][j]*(1- m))+((m/N);Step 6. Randomly select a node from 0 to N-1 to start awalk say s_int .Step 7. Initialize Random surfer and itr to keep accountof number of iterations required to 0.Step 8. Try to reach at steady state with in 200 iterationsotherwise toggling occurStep 9. Multiplying Probability Transition Matrixeswith Probability Vector to get steady stateStep 10. Check either system enters in steady state or notStep 11. Print the ranks stored in Probability Vector xand EXIT.V.

I

MPLEMENTATIONThis implementation is based upon random surfermodel [7] and Markov chain [13, 14]. The random surfervisit the objects in the web graph according to distributionbased on which random surfer can be in one of thefollowing four possible states at any time.

Initial state

is state of the system from where it willstart its walk. The system is set in the random state byrandomly selecting an object using random function andvalue corresponding to that web object in the ProbabilityVector is set to unity. Rest of the values in the ProbabilityVector is zero.

Steady state

is that state of the system whenthe Probability Vector of random surfer fulfills the

properties of irreducibility and aperiodicity’s. To check

either the system get the steady state or not, two successivevalues of the Probability Vector must be same.

Ideal state

is that state of the random surfer when the system achievesthe steady state but at the same time web object ranks aredistributed uniformly to all documents. Toggling state isachieved by the random surfer when the system is not ableto reach at steady state and just toggle between two set of object ranks.

Fig. 1. Web Graph

O1

O3O8O6O4O9

O10

O2

O5O7

164http://sites.google.com/site/ijcsis/ISSN 1947-5500

Filters

1 hundred reads

Subashini Gan liked this

Khaled Rashed liked this

baranidharan .k liked this

William Sun liked this