Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
1Activity
×
0 of .
Results for:
No results containing your search query
P. 1
Combating Web Spam With TrustRank

Combating Web Spam With TrustRank

Ratings: (0)|Views: 139|Likes:
Published by Daniel Holloway

More info:

Published by: Daniel Holloway on Dec 05, 2009
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

12/07/2009

pdf

text

original

 
Combating Web Spam with TrustRank
Zolt´an Gy¨ongyi Hector Garcia-Molina Jan Pedersen
Stanford University Stanford University Yahoo! Inc.Computer Science Department Computer Science Department 701 First AvenueStanford, CA 94305 Stanford, CA 94305 Sunnyvale, CA 94089zoltan@cs.stanford.edu hector@cs.stanford.edu jpederse@yahoo-inc.com
Abstract
Webspampagesusevarioustechniquestoachievehigher-than-deserved rankings in a search en-gine’s results. While human experts can identifyspam, it is too expensive to manually evaluate alarge number of pages. Instead, we propose tech-niques to semi-automatically separate reputable,good pages from spam. We first select a small setof seed pages to be evaluated by an expert. Oncewe manually identify the reputable seed pages, weuse the link structure of the web to discover otherpages that are likely to be good. In this paperwe discuss possible ways to implement the seedselection and the discovery of good pages. Wepresent results of experiments run on the WorldWide Web indexed by AltaVista and evaluate theperformance of our techniques. Our results showthat we can effectively filter out spam from a sig-nificant fraction of the web, based on a good seedset of less than 200 sites.
1 Introduction
The term
web spam
refers to hyperlinked pages on theWorld Wide Web that are created with the intention of mis-leading search engines. For example, a pornography sitemay spam the web by adding thousands of keywords toits home page, often making the text invisible to humansthrough ingenious use of color schemes. A search enginewill then index the extra keywords, and return the pornog-raphy page as an answer to queries that contain some of the keywords. As the added keywords are typically not of strictly adult nature, people searching for other topics willbe led to the page. Another web spamming technique is the
Permission to copy without fee all or part of this material is granted pro-vided that the copies are not made or distributed for direct commercialadvantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of theVery Large Data Base Endowment. To copy otherwise, or to republish,requires a fee and/or special permission from the Endowment.
Proceedings of the 30th VLDB Conference,Toronto, Canada, 2004
creation of a large number of bogus web pages, all pointingto a single target page. Since many search engines take intoaccount the number of incoming links in ranking pages, therank of the target page is likely to increase, and appear ear-lier in query result sets.Just as with email spam, determining if a page or groupof pages is spam is subjective. For instance, consider acluster of web sites that link to each other’s pages repeat-edly. These links may represent useful relationships be-tween the sites, or they may have been created with the ex-press intention of boosting the rank of each other’s pages.In general, it is hard to distinguish between these two sce-narios.However, just as with email spam, most people can eas-ily identify the blatant and brazen instances of web spam.For example, most would agree that if much of the text ona page is made invisible to humans (as noted above), and isirrelevant to the main topic of the page, then it was addedwith the intention to mislead. Similarly, if one finds a pagewith thousands of URLs referring to hosts like
buy-canon-rebel-300d-lens-case.camerasx.com,buy-nikon-d100-d70-lens-case.camerasx.com,...,
and notices that all host names map to the same IP address,then one would conclude that the page was created to mis-lead search engines. (The motivation behind URL spam-ming is that many search engines pay special attention towords in host names and give these words a higher weightthan if they had occurred in plain text.)While most humans would agree on the blatant webspam cases, this does not mean that it is easy for a com-puter to detect such instances. Search engine companiestypically employ staff members who specialize in the de-tection of web spam, constantly scanning the web lookingfor offenders. When a spam page is identified, a search en-gine stops crawling it, and its content is no longer indexed.This spamdetection process is very expensiveand slow, butis critical to the success of search engines: without the re-moval of the blatant offenders, the quality of search resultswould degrade significantly.Our research goal is to assist the human experts who de-tect web spam. In particular, we want to identify pages
576
 
2 31 4
 
Figure 1: A simple web graph.and sites that are likely to be spam or that are likely tobe reputable. The methods that we present in this papercould be used in two ways: (1) either as helpers in an ini-tial screening process, suggesting pages that should be ex-amined more closely by an expert, or (2) as a counter-biasto be applied when results are ranked, in order to discountpossible boosts achieved by spam.Since the algorithmic identification of spam is very dif-ficult, our schemes do not operate entirely without humanassistance. As we will see, the main algorithm we proposereceives human assistance as follows. The algorithm firstselects a small
seed 
set of pages whose “spam status” needsto be determined. A human expert then examines the seedpages, and tells the algorithm if they are spam (
bad 
pages)or not (
good 
pages). Finally, the algorithm identifies otherpages that are likely to be good based on their connectivitywith the good seed pages.In summary, the contributions of this paper are:1. We formalize the problem of web spam and spam de-tection algorithms.2. We define metrics for assessing the efficacy of detec-tion algorithms.3. We present schemes for selecting seed sets of pages tobe manually evaluated.4. We introduce the TrustRank algorithm for determin-ing the likelihood that pages are reputable.5. We discuss the results of an extensive evaluation,based on 31 million sites crawled by the AltaVistasearch engine, and a manual examination of over2
,
000 sites. We provide some interesting statistics onthe type and frequency of encountered web contents,and we use our data for evaluating the proposed algo-rithms.
2 Preliminaries
2.1 Web Model
We model the web as a graph
G
= (
V
,
E
)
consisting of aset
V
of 
pages (vertices) and a set
E
of directed links(edges) that connect pages. In practice, a web page
p
mayhave multiple HTML hyperlinks to some other page
q
. Inthis case we collapse these multiple hyperlinks into a singlelink 
(
 p
,
q
)
E
. We also remove self hyperlinks. Figure 1presents a very simple web graph of four pages and fourlinks. (For our experiments in Section 6, we will deal withweb sites, as opposed to individual web pages. However,our model and algorithms carry through to the case wheregraph vertices are entire sites.)Each page has some incoming links, or
inlinks
, andsome outgoing links, or
outlinks
. The number of inlinksof a page
p
is its
indegree
ι
(
 p
)
, whereas the number of out-links is its
outdegree
ω
(
 p
)
. For instance, the indegree of page 3 in Figure 1 is one, while its outdegree is two.Pages that have no inlinks are called
unreferenced  pages
. Pages without outlinks are referred to as
non-referencing pages
. Pages that are both unreferenced andnon-referencing at the same time are
isolated pages
. Page1 in Figure 1 is an unreferenced page, while page 4 is non-referencing.Weintroducetwomatrixrepresentationsofawebgraph,which will have important roles in the following sections.One of them is the
transition matrix
T
:
T
(
 p
,
q
) =
0 if 
(
q
,
 p
)
/
E
,1
/
ω
(
q
)
if 
(
q
,
 p
)
E
.The transition matrix corresponding to the graph in Fig-ure 1 is:
T
=
0 0 0 01 0
12
00 1 0 00 0
12
0
.
We also define the
inverse transition matrix
U
:
U
(
 p
,
q
) =
0 if 
(
 p
,
q
)
/
E
,1
/
ι
(
q
)
if 
(
 p
,
q
)
E
.Note that
U
=
T
. For the example in Figure 1 the inversetransition matrix is:
U
=
0
12
0 00 0 1 00
12
0 10 0 0 0
.
2.2 PageRank
PageRankisawellknownalgorithmthatuseslinkinforma-tion to assign global importance scores to all pages on theweb. Because our proposed algorithms rely on PageRank,this section offers a short overview.The intuition behind PageRank is that a web page isimportant if several other important web pages point to it.Correspondingly, PageRank is based on a mutual reinforce-ment between pages: the importance of a certain page
in- fluences
and is
being influenced 
by the importance of someother pages.The PageRank score
r
(
 p
)
of a page
p
is defined as:
r
(
 p
) =
α
·
q
:
(
q
,
 p
)
E
r
(
q
)
ω
(
q
)+(
1
α
)
·
1
 N 
,
where
α
is a decay factor.
1
The equivalent matrix equation
1
Note that there are a number of equivalent definitions of Page-Rank [12] that might slightly differ in mathematical formulation and nu-merical properties, but yield the same relative ordering between any twoweb pages.
577
 
 
2 314
5 67
          *
goodbad
 
Figure 2: A web of good (white) and bad (black) nodes.form is:
r
=
α
·
T
·
r
+(
1
α
)
·
1
 N 
·
1
 N 
.
Hence, the score of some page
p
is a sum of two compo-nents: one part of the score comes from pages that pointto
p
, and the other (
static
) part of the score is equal for allweb pages.PageRank scores can be computed iteratively, for in-stance, by applying the Jacobi method [3]. While in astrict mathematical sense, iterations should be run to con-vergence, it is more common to use only a fixed number of 
 M 
iterations in practice.It is important to note that while the regular PageRank algorithm assigns the same static score to each page, a
bi-ased PageRank 
version may break this rule. In the matrixequation
r
=
α
·
T
·
r
+(
1
α
)
·
d
,
vector
d
is a
static score distribution vector 
of arbitrary,non-negative entries summing up to one. Vector
d
can beused to assign a non-zero static score to a set of specialpages only; the score of such special pages is then spreadduring the iterations to the pages they point to.
3 Assessing Trust
3.1 Oracle and Trust Functions
As discussed in Section 1, determining if a page is spamis subjective and requires human evaluation. We formalizethe notion of a human checking a page for spam by a binary
oracle function
O
over all pages
p
V
:
O
(
 p
) =
0 if 
p
is bad,1 if 
p
is good.Figure 2 represents a small seven-page web where goodpages are shown as white, and bad pages as black. For thisexample, calling the oracle on pages 1 through 4 wouldyield the return value of 1.Oracle invocations are expensive and time consuming.Thus, we obviously do not want to call the oracle functionfor all pages. Instead, our objective is to be selective, i.e.,to ask a human expert to evaluate only some of the webpages.To discover good pages without invoking the oraclefunction on the entire web, we will rely on an importantempirical observation we call the
approximate isolation
of the good set: good pages seldom point to bad ones. Thisnotion is fairly intuitive—bad pages are built to misleadsearch engines, not to provide useful information. There-fore, people creating good pages have little reason to pointto bad pages.However, the creators of good pages can sometimes be“tricked,sowedofindsomegood-to-badlinksontheweb.(In Figure 2 we show one such good-to-bad link, from page4 to page 5, marked with an asterisk.) Consider the fol-lowing example. Given a good, but unmoderated messageboard, spammers may include URLs to their spam pages aspart of the seemingly innocent messages they post. Con-sequently, good pages of the message board would link tobad pages. Also, sometimes spam sites offer what is calleda
honey pot 
: a set of pages that provide some useful re-source (e.g., copies of some Unix documentation pages),but that also have hidden links to their spam pages. Thehoney pot then attracts people to point to it, boosting theranking of the spam pages.Note that the converse to approximate isolation does notnecessarily hold: spam pages can, and in fact often do, link togoodpages. Forinstance, creatorsofspampagespointtoimportant good pages either to create a honey pot, or hop-ing that many good outlinks would boost their hub-score-based ranking [10].Toevaluatepageswithoutrelyingon
O
, wewillestimatethe likelihood that a given page
p
is good. More formally,we define a
trust function
T
that yields a range of valuesbetween 0 (bad) and 1 (good). Ideally, for any page
p
,
T
(
 p
)
should give us the probability that
p
is good:
Ideal Trust Property
T
(
 p
) =
Pr
[
O
(
 p
) =
1
]
.
To illustrate, let us consider a set of 100 pages and saythat the trust score of each of these pages happens to be 0
.
7.Let us suppose that we also evaluate all the 100 pages withthe oracle function. Then, if 
T
works properly, for 70 of thepages the oracle score should be 1, and for the remaining30 pages the oracle score should be 0.In practice, it is very hard to come up with a function
T
with the previous property. However, even if 
T
does notaccurately measure the likelihood that a page is good, itwould still be useful if the function could at least help usorder pages by their likelihood of being good. That is, if weare given a pair of pages
p
and
q
, and
p
has a lower trustscore than
q
, then this should indicate that
p
is less likelyto be good than
q
. Such a function would at least be usefulin ordering search results, giving preference to pages morelikely to be good. More formally, then, a desirable propertyfor the trust function is:
578

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->