15 views

Uploaded by Sanni Kumar

save

- Human Action Recognition by Learning Bases of Action Attributes and Parts
- Word Search Paper
- Tables
- 70539_06d
- baraff-witkin98
- For Most Large Underdetermined Systems of Linear Equations the Minimal `1-norm Solution is also the Sparsest Solution
- Matlab Notes
- MatlabNotes (1)
- MolerEtAl_SparseMatricesinMATLAB_10.1.1.38
- Lecture 1 Basic Matlab
- 75 Truth Discovery
- bcsopt
- Activity 1.docx
- Efficient Enforcement of Hard Articulation Constraints in the Presence of Closed Loops and Contacts_comments
- Matlab Help Text
- Mb0040 Statistics for Management Final
- new paper 2
- Hungarian Algorithm for Excel_VBA
- Young Londoners Survey 2009
- CI_p
- TCS Placement Paper 1.doc
- lifsci_v4_n34_2012_2
- Sampling Waters Flow and Load
- Writing Mesh & Nodal Equations Directly Matrix Form.doc
- Prioritization Matrices
- Project Proposal
- 3.1 Sampling Concept
- Mata33 Final 2013w
- 2018 ma1 wk 17 18 revision test week
- softeng_2015_4_40_55097.pdf
- Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
- Dispatches from Pluto: Lost and Found in the Mississippi Delta
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Sapiens: A Brief History of Humankind
- Yes Please
- The Unwinding: An Inner History of the New America
- Grand Pursuit: The Story of Economic Genius
- This Changes Everything: Capitalism vs. The Climate
- A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
- The Emperor of All Maladies: A Biography of Cancer
- The Prize: The Epic Quest for Oil, Money & Power
- John Adams
- Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
- The World Is Flat 3.0: A Brief History of the Twenty-first Century
- Rise of ISIS: A Threat We Can't Ignore
- The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
- Smart People Should Build Things: How to Restore Our Culture of Achievement, Build a Path for Entrepreneurs, and Create New Jobs in America
- Team of Rivals: The Political Genius of Abraham Lincoln
- The New Confessions of an Economic Hit Man
- How To Win Friends and Influence People
- Angela's Ashes: A Memoir
- Steve Jobs
- Bad Feminist: Essays
- You Too Can Have a Body Like Mine: A Novel
- The Incarnations: A Novel
- The Light Between Oceans: A Novel
- Leaving Berlin: A Novel
- The Silver Linings Playbook: A Novel
- The Sympathizer: A Novel (Pulitzer Prize for Fiction)
- Extremely Loud and Incredibly Close: A Novel
- A Man Called Ove: A Novel
- The Master
- Bel Canto
- We Are Not Ourselves: A Novel
- The First Bad Man: A Novel
- The Rosie Project: A Novel
- The Blazing World: A Novel
- Brooklyn: A Novel
- The Flamethrowers: A Novel
- Life of Pi
- The Love Affairs of Nathaniel P.: A Novel
- The Bonfire of the Vanities: A Novel
- Lovers at the Chameleon Club, Paris 1932: A Novel
- The Perks of Being a Wallflower
- A Prayer for Owen Meany: A Novel
- The Cider House Rules
- Wolf Hall: A Novel
- The Art of Racing in the Rain: A Novel
- The Wallcreeper
- Interpreter of Maladies
- The Kitchen House: A Novel
- Beautiful Ruins: A Novel
- Good in Bed

You are on page 1of 13

**Information Providers on the Web
**

Xiaoxin Yin, Jiawei Han, Senior Member, IEEE, and Philip S. Yu, Fellow, IEEE

Abstract—The World Wide Web has become the most important information source for most of us. Unfortunately, there is no

guarantee for the correctness of information on the Web. Moreover, different websites often provide conflicting information on a

subject, such as different specifications for the same product. In this paper, we propose a new problem, called Veracity, i.e., conformity

to truth, which studies how to find true facts from a large amount of conflicting information on many subjects that is provided by various

websites. We design a general framework for the Veracity problem and invent an algorithm, called TRUTHFINDER, which utilizes the

relationships between websites and their information, i.e., a website is trustworthy if it provides many pieces of true information, and a

piece of information is likely to be true if it is provided by many trustworthy websites. An iterative method is used to infer the

trustworthiness of websites and the correctness of information from each other. Our experiments show that TRUTHFINDER

successfully finds true facts among conflicting information and identifies trustworthy websites better than the popular search engines.

Index Terms—Data quality, Web mining, link analysis.

Ç

1 INTRODUCTION

T

HE World Wide Web has become a necessary part of our

lives and might have become the most important

information source for most people. Everyday, people

retrieve all kinds of information from the Web. For example,

when shopping online, people find product specifications

from websites like Amazon.com or ShopZilla.com. When

looking for interesting DVDs, they get information and read

movie reviews on websites such as NetFlix.com or IMDB.

com. When they want to know the answer to a certain

question, they go to Ask.com or Google.com.

“Is the World Wide Web always trustable?” Unfortunately,

the answer is “no.” There is no guarantee for the correctness

of information on the Web. Even worse, different websites

often provide conflicting information, as shown in the

following examples.

Example 1 (Height of Mount Everest). Suppose a user is

interested in how high Mount Everest is and queries

Ask.com with “What is the height of Mount Everest?”

Among the top 20 results,

1

he or she will find the

following facts: four websites (including Ask.com itself)

say 29,035 feet, five websites say 29,028 feet, one says

29,002 feet, and another one says 29,017 feet. Which

answer should the user trust?

Example 2 (Authors of books). We tried to find out who

wrote the book Rapid Contextual Design (ISBN:

0123540518). We found many different sets of authors

from different online bookstores, and we show several of

them in Table 1. From the image of the book cover, we

found that A1 Books provides the most accurate

information. In comparison, the information from Po-

well’s books is incomplete, and that from Lakeside books is

incorrect.

The trustworthiness problem of the Web has been

realized by today’s Internet users. According to a survey

on the credibility of websites conducted by Princeton

Survey Research in 2005 [11], 54 percent of Internet users

trust news websites at least most of time, while this ratio is

only 26 percent for websites that offer products for sale and

is merely 12 percent for blogs.

There have been many studies on ranking web pages

according to authority (or popularity) based on hyperlinks.

The most influential studies are Authority-Hub analysis [7],

and PageRank [10], which lead to Google.com. However,

does authority lead to accuracy of information? The answer

is unfortunately no. Top-ranked websites are usually the

most popular ones. However, popularity does not mean

accuracy. For example, according to our experiments

(Section 4.2), the bookstores ranked on top by Google

(Barnes & Noble and Powell’s books) contain many errors on

book author information. In comparison, some small book-

stores (e.g., A1 Books) provide more accurate information.

In this paper, we propose a new problem called the

Veracity problem, which is formulated as follows: Given a

large amount of conflicting information about many

objects, which is provided by multiple websites (or other

types of information providers), how can we discover the

true fact about each object? We use the word “fact” to

represent something that is claimed as a fact by some

website, and such a fact can be either true or false. In this

paper, we only study the facts that are either properties of

796 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008

. X. Yin is with Microsoft Research, Microsoft Corporation, One Microsoft

Way, Redmond, WA 98052. E-mail: xyin@microsoft.com.

. J. Han is with the Department of Computer Science, University of Illinois

at Urbana-Champaign, 201 N. Goodwin Avenue, 2132, Urbana, IL 61801.

E-mail: hanj@cs.uiuc.edu.

. P.S. Yu is with the Department of Computer Science, University of Illinois

at Chicago, 851 South Morgan Street, Chicago, IL 60607.

E-mail: psyu@cs.uic.edu.

Manuscript received 17 July 2007; revised 20 Nov. 2007; accepted 6 Dec.

2007; published online 19 Dec. 2007.

For information on obtaining reprints of this article, please send e-mail to:

tkde@computer.org, and reference IEEECS Log Number TKDE-2007-07-0369.

Digital Object Identifier no. 10.1109/TKDE.2007.190745.

1. The query was sent on 9 February 2007.

1041-4347/08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

objects (e.g., weights of laptop computers) or relationships

between two objects (e.g., authors of books). We also

require that the facts can be parsed from the web pages.

There are often conflicting facts on the Web, such as

different sets of authors for a book. There are also many

websites, some of which are more trustworthy than

others.

2

A fact is likely to be true if it is provided by

trustworthy websites (especially if by many of them). A

website is trustworthy if most facts it provides are true.

Because of this interdependency between facts and

websites, we choose an iterative computational method.

At each iteration, the probabilities of facts being true and

the trustworthiness of websites are inferred from each

other. This iterative procedure is rather different from

Authority-Hub analysis [7]. The first difference is in the

definitions. The trustworthiness of a website does not

depend on how many facts it provides but on the accuracy

of those facts. For example, a website providing 10,000 facts

with an average accuracy of 0.7 is much less trustworthy

than a website providing 100 facts with an accuracy of 0.95.

Thus, we cannot compute the trustworthiness of a website

by adding up the weights of its facts as in [7], nor can we

compute the probability of a fact being true by adding up

the trustworthiness of websites providing it. Instead, we

have to resort to probabilistic computation. Second and

more importantly, different facts influence each other. For

example, if a website says that a book is written by

“Jessamyn Wendell” and another says “Jessamyn Burns

Wendell,” then these two websites actually support each

other although they provide slightly different facts. We

incorporate such influences between facts into our compu-

tational model.

In summary, we make three major distributions in this

paper. First, we formulate the Veracity problem about

how to discover true facts from conflicting information.

Second, we propose a framework to solve this problem,

by defining the trustworthiness of websites, confidence of

facts, and influences between facts. Finally, we propose

an algorithm called TRUTHFINDER for identifying true

facts using iterative methods. Our experiments show that

TRUTHFINDER achieves very high accuracy in discovering

true facts, and it can select better trustworthy websites

than authority-based search engines such as Google.

The rest of the paper is organized as follows: We describe

the problem in Section 2 and propose the computational

model and algorithms in Section 3. Experimental results are

presented in Section 4. We discuss related work in Section 5

and conclude this study in Section 6.

2 PROBLEM DEFINITIONS

In this paper, we study the problem of finding true facts in a

certain domain. Here, a domain refers to a property of a

certain type of objects, such as authors of books or number

of pixels of camcorders. The input of TRUTHFINDER is a

large number of facts in a domain that are provided by

many websites. There are usually multiple conflicting facts

from different websites for each object, and the goal of

TRUTHFINDER is to identify the true fact among them.

Fig. 1 shows a miniexample data set, which contains five

facts about two objects provided by four websites. Each

website provides at most one fact for an object.

2.1 Basic Definitions

We first introduce the two most important definitions in

this paper, the confidence of facts and the trustworthiness

of websites.

Definition 1 (Confidence of facts). The confidence of a fact 1

(denoted by :ð1Þ) is the probability of 1 being correct,

according to the best of our knowledge.

Definition 2 (Trustworthiness of websites). The trust-

worthiness of a website n (denoted by tðnÞ) is the expected

confidence of the facts provided by n.

Different facts about the same object may be conflicting.

For example, one website claims that a book is written by

“Karen Holtzblatt,” whereas another claims that it is written

by “Jessamyn Wendell.” However, sometimes facts may be

supportive to each other although they are slightly

different. For example, one website claims the author to

be “Jennifer Widom,” and another one claims “J. Widom,”

or one website says that a certain camera is 4 inches long,

and another one says 10 cm. If one of such facts is true, the

other is also likely to be true.

In order to represent such relationships, we propose

the concept of implication between facts. The implication

from fact 1

1

to 1

2

, iijð1

1

!1

2

Þ, is 1

1

’s influence on

1

2

’s confidence, i.e., how much 1

2

’s confidence should be

increased (or decreased) according to 1

1

’s confidence. It is

required that iijð1

1

!1

2

Þ is a value between À1 and 1.

YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 797

TABLE 1

Conflicting Information about Book Authors

2. The “trustworthiness” in this paper means accuracy in providing

information. It is different from the “trustworthiness” in the studies of trust

management [2].

Fig. 1. Input of TRUTHFINDER.

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

A positive value indicates that if 1

1

is correct, 1

2

is likely

to be correct. While a negative value means that if 1

1

is

correct, 1

2

is likely to be wrong. The details about this

will be described in Section 3.1.2.

We define implication instead of similarity between facts

because such relationship is asymmetric. For example, in

some domains (e.g., book authors), websites tend to provide

incomplete facts (e.g., first author of a book). Suppose two

websites provide author information for the same book. The

first website indicates that the author of the book is

“Jennifer Widom,” which is fact 1

1

. The second website

says that there are two authors “Jennifer Widom and

Stefano Ceri,” which is fact 1

2

. If 1

2

is correct, then 1

1

is

incomplete and will have low confidence, and thus,

iijð1

2

!1

1

Þ is low. On the other hand, we know that it

is very common for a website to provide only one of the

authors for a book. Thus, 1

1

may only tell us that “Jennifer

Widom” is one author of the book instead of the sole author.

If we are confident about 1

1

, we should also be confident

about 1

2

because 1

2

is consistent with 1

1

, and iijð1

1

!1

2

Þ

should be high. From this example, we can see that

implication is an asymmetric relationship.

Please notice that the definition of implication is

domain specific. The implication for book authors should

be very different from that for the number of pixels of

camcorders. When a user uses TRUTHFINDER on a

certain domain, he or she should provide the definition

of implication between facts. If in a domain, the

relationship between two facts is symmetric and the

definition of similarity is available, the user can define

iijð1

1

!1

2

Þ ¼ :iið1

1

. 1

2

Þ À/o:c :ii, where :iið1

1

. 1

2

Þ is

the similarity between 1

1

and 1

2

, and /o:c :ii is a

threshold for similarity.

2.2 Basic Heuristics

Based on common sense and our observations on real data,

we have four basic heuristics that serve as the base of our

computational model.

Heuristic 1. Usually there is only one true fact for a property of

an object.

In this paper, we assume that there is only one true fact

for a property of an object. The case of multiple true facts

will be studied in our future work.

Heuristic 2. This true fact appears to be the same or similar on

different websites.

Different websites that provide this true fact may present

it in either the same or slightly different ways, such as

“Jennifer Widom” versus “J. Widom.”

Heuristic 3. The false facts on different websites are less likely to

be the same or similar.

Different websites often make different mistakes for the

same object and thus provide different false facts. Although

false facts can be propagated among websites, in general,

the false facts about a certain object are much less consistent

than the true facts.

Heuristic 4. In a certain domain, a website that provides mostly

true facts for many objects will likely provide true facts for

other objects.

There are trustworthy websites such as wikipedia and

untrustworthy websites such as blogs and some small

websites. We believe that a website has some consistency in

the quality of its information in a certain domain.

3 COMPUTATIONAL MODEL

Based on the above heuristics, we know that if a fact is

provided by many trustworthy websites, it is likely to be

true; and, if a fact is conflicting with the facts provided by

many trustworthy websites, it is unlikely to be true. On the

other hand, a website is trustworthy if it provides facts with

high confidence. We can see that the website trustworthi-

ness and fact confidence are determined by each other, and

we can use an iterative method to compute both. Because

true facts are more consistent than false facts (Heuristic 3), it

is likely that we can find and distinguish true facts from

false ones at the end.

In this section, we introduce the model of iterative

computation. Table 2 shows the variables and parameters

used in the following discussion.

3.1 Website Trustworthiness and Fact Confidence

We first discuss how to infer website trustworthiness and

fact confidence from each other. The inference of website

trustworthiness is rather simple, whereas that of fact

confidence is more complicated. We start from the simplest

case and proceed to more complicated ones step by step.

3.1.1 Basic Inference

As defined in Definition 2, the trustworthiness of a website

is just the expected confidence of facts it provides. For

website n, we compute its trustworthiness tðnÞ by

calculating the average confidence of facts provided by n:

tðnÞ ¼

P

121ðnÞ

:ð1Þ

j1ðnÞj

. ð1Þ

where 1ðnÞ is the set of facts provided by n.

798 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008

TABLE 2

Variables and Parameters of TRUTHFINDER

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

In comparison, it is much more difficult to estimate the

confidence of a fact. As shown in Fig. 2, the confidence of a

fact 1

1

is determined by the websites providing it and other

facts about the same object.

Let us first analyze the simple case where there is no

related fact, and 1

1

is the only fact about object o

1

(i.e., 1

2

does not exist in Fig. 2). Because 1

1

is provided by n

1

and

n

2

, if 1

1

is wrong, then both n

1

and n

2

are wrong. We

first assume that n

1

and n

2

are independent. (This is not

true in many cases, and we will compensate for it later.)

Thus, the probability that both of them are wrong is

ð1 Àtðn

1

ÞÞ Á ð1 Àtðn

2

ÞÞ, and the probability that 1

1

is not

wrong is 1 Àð1 Àtðn

1

ÞÞ Á ð1 Àtðn

2

ÞÞ. In general, if a fact 1

is the only fact about an object, then its confidence :ð1Þ

can be computed as

:ð1Þ ¼ 1 À

Y

n2\ð1Þ

ð1 ÀtðnÞÞ. ð2Þ

where \ð1Þ is the set of websites providing 1.

In (2), 1 ÀtðnÞ is usually quite small, and multiplying

many of them may lead to underflow. In order to facilitate

computation and veracity exploration, we use a logarithm

and define the trustworthiness score of a website as

tðnÞ ¼ Àlnð1 ÀtðnÞÞ. ð3Þ

tðnÞ is between and zero and þ1, and a larger tðnÞ

indicates higher trustworthiness.

Similarly, we define the confidence score of a fact as

oð1Þ ¼ Àlnð1 À:ð1ÞÞ. ð4Þ

A very useful property is that the confidence score of a

fact 1 is just the sum of the trustworthiness scores of

websites providing 1. This is shown in the following

lemma, which is used to compute oð1Þ in TRUTHFINDER.

Lemma 1.

oð1Þ ¼

X

n2\ð1Þ

tðnÞ. ð5Þ

Proof. According to (2),

1 À:ð1Þ ¼

Y

n2\ð1Þ

ð1 ÀtðnÞÞ.

Take the logarithm on both sides, and we have

lnð1 À:ð1ÞÞ ¼

X

n2\ð1Þ

lnð1 ÀtðnÞÞ

() oð1Þ ¼

X

n2\ð1Þ

tðnÞ.

tu

3.1.2 Influences between Facts

The above discussion shows how to compute the confidence

of a fact that is the only fact about an object. However, there

are usually many different facts about an object (such as 1

1

and 1

2

in Fig. 2), and these facts influence each other.

Suppose in Fig. 2 that the implication from 1

2

to 1

1

is very

high (e.g., they are very similar). If 1

2

is provided by many

trustworthy websites, then 1

1

is also somehow supported by

these websites, and 1

1

should have reasonably high

confidence. Therefore, we should increase the confidence

score of 1

1

according to the confidence score of 1

2

, which is

the sum of the trustworthiness scores of websites providing

1

2

. We define the adjusted confidence score of a fact 1 as

o

Ã

ð1Þ ¼ oð1Þ þj Á

X

oð1

0

Þ¼oð1Þ

oð1

0

Þ Á iijð1

0

!1Þ. ð6Þ

j is a parameter between zero and one, which controls the

influence of related facts. We can see that o

Ã

ð1Þ is the sum of

the confidence scores of 1, and a portion of the confidence

score of each related fact 1

0

multiplies the implication from

1

0

to 1. Please notice that iijð1

0

!1Þ < 0 when 1 is

conflicting with 1

0

.

We can compute the confidence of 1 based on o

Ã

ð1Þ in the

same way as computing it based on oð1Þ (defined in (4)). We

use :

Ã

ð1Þ to represent this confidence:

3

:

Ã

ð1Þ ¼ 1 Àc

Ào

Ã

ð1Þ

. ð7Þ

3.1.3 Handling Additional Subtlety

Until now, we have described how to compute fact

confidence from website trustworthiness. There are still

two problems with our model, which are discussed below.

The first problem is that we have been assuming that

different websites are independent of each other. This assump-

tion is often incorrect because errors can be propagated

between websites. According to the definitions above, if a

fact 1 is provided by five websites with a trustworthiness

of 0.6 (which is quite low), 1 will have a confidence of 0.99.

However, actually, some of the websites may copy contents

from others. In order to compensate for the problem of

overly high confidence, we add a dampening factor · into (7)

and redefine fact confidence as :

Ã

ð1Þ ¼ 1 Àc

À·Áo

Ã

ð1Þ

, where

0 < · < 1.

The second problem with our model is that the confidence

of a fact 1 can easily be negative if 1 is conflicting with some

YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 799

Fig. 2. Computing confidence of a fact.

3. With a similar proof as in Lemma 1, we can show that (7) is equivalent

to :

Ã

ð1Þ ¼

1 À

Y

n2\ð1Þ

ð1 ÀtðnÞÞ Á

Y

oð1

0

Þ¼oð1Þ

Y

n

0

2\ð1

0

Þ

ð1 Àtðn

0

ÞÞ

0

@

1

A

jÁiijð1

0

!1Þ

.

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

facts provided by trustworthy websites, which makes

o

Ã

ð1Þ < 0 and :

Ã

ð1Þ < 0. This is unreasonable because

1) the confidence cannot be negative and 2) even with

negative evidences, there is still a chance that 1 is correct, so

its confidence should still be above zero. Moreover, if we set

:

Ã

ð1Þ ¼ 0 if it is negative according to (7), this “chunking”

operation and the multiple zero values may lead to unstable

conditions in our iterative computation. Therefore, we

adopt the widely used Logistic function [8], which is a

variant of (7), as the final definition for fact confidence.

:ð1Þ ¼

1

1 þc

À·Áo

Ã

ð1Þ

. ð8Þ

When · Á o

Ã

ð1Þ is significantly greater than zero, :ð1Þ is

very close to :

Ã

ð1Þ because

1

1þc

À·Áo

Ã

ð1Þ

% 1 Àc

À·Áo

Ã

ð1Þ

. When · Á

o

Ã

ð1Þ is significantly less than zero, :ð1Þ is close to zero but

remains positive. We compare these two definitions in

Fig. 3. One can see that the two curves are very close when

o

Ã

ð1Þ 3. :

Ã

ð1Þ decreases very sharply when o

Ã

ð1Þ < 1,

which is not reasonable because it is always possible that

the fact is true even with negative evidence. In comparison,

:ð1Þ decreases slowly and is slightly above zero when

o

Ã

ð1Þ (0, which is consistent with the real situation. Please

notice that (8) is also very similar to the Sigmoid function

[12], which has been successfully used in various models in

many fields.

3.2 Computing Website Trustworthiness and Fact

Confidence with Matrix Operations

We have described how to calculate website trustworthi-

ness and fact confidence. It will be more convenient if we

can convert the computational procedure into some basic

matrix operations, which can be implemented easily and

performed efficiently. To facilitate our discussions, we use

vectors to represent the trustworthiness of all websites and

the confidence of all facts. Let the vectors be

t

!

¼ ðtðn

1

Þ. . . . . tðn

`

ÞÞ

T

.

t

!

¼ ðtðn

1

Þ. . . . . tðn

`

ÞÞ

T

.

:

!

¼ ð:ð1

1

Þ. . . . . :ð1

`

ÞÞ

T

.

o

Ã

!

¼ ðo

Ã

ð1

1

Þ. . . . . o

Ã

ð1

`

ÞÞ

T

.

We want to define a ` Â` matrix A for inferring website

trustworthiness from fact confidence and a ` Â` matrix B

for the reverse inference, i.e.

t

!

¼ A:

!

.

o

Ã

!

¼ Bt

!

.

ð9Þ

t

!

and t

!

can be converted from each other using (3), and :

!

and o

!

can be converted using (8).

Matrix A can be easily defined according to (1), by

setting

A

i,

¼

1

.

1ðn

i

Þ j j. if 1

,

2 1ðn

i

Þ.

0. otherwise.

(

ð10Þ

In comparison, matrix B involves more factors because

facts about the same object influence each other. From (6)

and Lemma 1, we can infer that

B

,i

¼

1. if n

i

provides 1

,

.

j Á iijð1

/

!1

,

Þ. if n

i

provides 1

/

and oð1

/

Þ¼oð1

,

Þ.

0. otherwise.

8

<

:

ð11Þ

Both Aand Bare sparse matrices. In general, the website

trustworthiness and fact confidence can be computed

conveniently with matrix operations, as shown in Fig. 4.

The above procedure is very different from Authority-

Hub analysis proposed by Kleinberg [7]. It involves non-

linear transformations and thus cannot be computed using

eigenvector computation as in [7]. Authority-Hub analysis

defines the authority scores and hub scores as the sum of

each other. On the other hand, TRUTHFINDER studies the

probabilities of websites being correct and facts being true,

which cannot be defined as simple summations because the

probability often needs to be computed in nonlinear ways.

That is why TRUTHFINDER requires iterative computation

to achieve convergence.

800 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008

Fig. 3. Two methods for computing confidence.

Fig. 4. Computing website trustworthiness and fact confidence with

matrix operations.

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

3.3 Iterative Computation

As described above, we can infer the website trustworthi-

ness if we know the fact confidence and vice versa. As in

Authority-Hub analysis [7] and PageRank [10], TRUTHFIN-

DER adopts an iterative method to compute the trustworthi-

ness of websites and confidence of facts. Initially, it has very

little information about the websites and the facts. At each

iteration, TRUTHFINDER tries to improve its knowledge

about their trustworthiness and confidence, and it stops

when the computation reaches a stable state.

As in other iterative approaches [7], [10], TRUTHFINDER

needs an initial state. We choose the initial state in which all

websites have uniform trustworthiness t

0

. (t

0

should be set

to the estimated average trustworthiness, such as 0.9.) From

the website trustworthiness TRUTHFINDER can infer the

confidence of facts, which are very meaningful because the

facts supported by many websites are more likely to be

correct. On the other hand, if we start from a uniform fact

confidence, we cannot infer meaningful trustworthiness for

websites. Before the iterative computation, we also need to

calculate the two matrices Aand B, as defined in Section 3.2.

They are calculated once and used at every iteration.

In each step of the iterative procedure, TRUTHFINDER

first uses the website trustworthiness to compute the fact

confidence and then recomputes the website trustworthi-

ness from the fact confidence. Each step only requires two

matrix operations and conversions between tðnÞ and tðnÞ

and between :ð1Þ and o

Ã

ð1Þ. The matrices are stored in

sparse formats, and the computational cost of multiplying

such a matrix and a vector is linear with the number of

nonzero entries in the matrix. TRUTHFINDER stops iterating

when it reaches a stable state. The stableness is measured by

how much the trustworthiness of websites changes between

iterations. If t

!

only changes a little after an iteration

(measured by cosine similarity between the old and the new

t

!

), then TRUTHFINDER will stop. The overall algorithm is

presented in Fig. 5.

3.4 Complexity Analysis

In this section, we analyze the complexity of TRUTHFIN-

DER. Suppose there are 1 links between all websites and

facts. Because different websites may provide the same fact,

1 should be greater than ` (number of facts). Suppose on

the average there are / facts about each object, and thus,

each fact has / À1 related facts on the average.

Let us first look at the two matrices A and B. Each link

between a website and a fact corresponds to an entry in A.

Thus, A has 1 entries, and it takes Oð1Þ time to compute A.

B contains more entries than A because B

,i

is nonzero if

website n

i

provides a fact that is related to fact 1

,

. Thus,

there are Oð/1Þ entries in B. Because each website can

provide at most one fact about each object, each entry of B

involves only one website and one fact. Thus, it still takes

constant time to compute each entry of B, and it takes

Oð/1Þ time to compute B.

The time cost of multiplying a sparse matrix and a vector

is linear with the number of entries in the matrix. Therefore,

each iteration takes Oð/1Þ time and no extra space. Suppose

there are 1 iterations. TRUTHFINDER takes Oð1/1Þ time and

Oð/1 þ`þ`Þ space.

If in some cases, Oð/1Þ space is not available, we can

discard the matrix operations and compute the website

trustworthiness andfact confidence using their definitions in

Section 3.1. If we precompute the implication between all

facts, then Oð/`Þ space is needed to store these implication

values, and the total space requirement is Oð1 þ/`Þ. If the

implication between two facts can be computed in a very

short constant time and we do not precompute the implica-

tion, then the total space requirement is Oð1 þ` þ`Þ. In

both cases, it takes Oð1Þ time to propagate between website

trustworthiness and fact confidence and Oð/`Þ time to

adjust fact confidence according to the interfact implication.

Thus, the overall time complexity is Oð11 þ1/`Þ.

4 EMPIRICAL STUDY

We perform experiments on two real data sets and

synthetic data sets to examine the accuracy and efficiency

of TRUTHFINDER. The first real data set contains the

authors of many books provided by many online book-

stores, and the second one contains the runtime of many

movies provided by many websites. As will be explained

in Section 5, there is no existing approach to the problem

studied in this paper, and thus, we compare TRUTHFIN-

DER with a baseline approach as described below.

4.1 Experiment Setting

In order to show the effectiveness of TRUTHFINDER, we

compare it with a baseline approach called VOTING. When

trying to find the true fact for a certain object, VOTING

chooses the fact that is provided by most websites and

resolves ties randomly. This is the simplest approach and

only uses the number of websites supporting each fact. In

comparison, TRUTHFINDER considers the implication be-

tween different facts from the first iteration and considers

the different trustworthiness of different websites in the

following iterations.

All experiments are performed on an Intel PC with a

1.66-GHz dual-core processor with 1 Gbyte of memory

running Windows XP Professional. All approaches are

implemented using Visual Studio.Net (C#), using a single

thread. The two parameters in (8) are set as j ¼ 0.5 and

· ¼ 0.3. The maximum difference between two iterations c

YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 801

Fig. 5. Algorithm of TRUTHFINDER.

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

is set to 0.001 percent. We will show in our experiments that

the performance of TRUTHFINDER is not sensitive to these

parameters.

4.2 Real Data Sets

We test the accuracy and efficiency of TRUTHFINDER on

two real data sets. The results are presented below.

4.2.1 Book Authors

The first real data set contains the authors of many books

provided by many online bookstores. It contains 1,265 books

about computer science and engineering, which are pub-

lished by Addison-Wesley, McGraw-Hill, Morgan Kauf-

mann, or Prentice Hall. For each book, we use its ISBN to

search on www.abebooks.com, which returns the online

bookstores that sell the book and the book information from

each store, including the price, the authors, and a short

description. The data set contains 894 bookstores and

34,031 listings (i.e., bookstore selling a book). On the average,

each book has 5.4 different sets of authors, which are

indicated by different bookstores.

TRUTHFINDER performs iterative computation to find

out the set of authors for each book. In order to test its

accuracy, we randomly select 100 books and manually find

out their authors. We find the image of each book and use

the authors on the book cover as the standard fact.

We compare the set of authors found by TRUTHFINDER

for each book with the standard fact to compute the

accuracy of TRUTHFINDER. For a certain book, suppose the

standard fact contains r authors; TRUTHFINDER indicates

that there are n authors, among which : authors belong to

the standard fact. The accuracy of TRUTHFINDER is defined

as

:

maxðr.nÞ

.

4

Sometimes TRUTHFINDER provides partially correct

facts. For example, the standard set of authors for a book

is “Graeme C. Simsion and Graham Witt,” and the authors

found by TRUTHFINDER may be “Graeme Simsion and

G. Witt.” We consider “Graeme Simsion” and “G. Witt” as

partial matches for “Graeme C. Simsion” and “Graham

Witt” and give them partial scores. We assign different

weights to different parts of persons’ names. Each author

name has a total weight of 1, and the ratio between weights

of the last name, first name, and middle name is 3:2:1. For

example, “Graeme Simsion” will get a partial score of 5/6

because it omits the middle name of “Graeme C. Simsion.”

If the standard name has a full first or middle name and

TRUTHFINDER provides the correct initial, we give TRUTH-

FINDER a half score. For example, “G. Witt” will get a score

of 4/5 with respect to “Graham Witt,” because the first

name has weight 2/5, and the first initial “G.” gets half of

the score.

The implication between two sets of authors 1

1

and 1

2

is

defined in a very similar way as the accuracy of 1

2

with

respect to 1

1

. Because many bookstores provide incomplete

facts (e.g., only the first author), if 1

2

contains authors that are

not in 1

1

, the implication from 1

1

to 1

2

will not be decreased.

If 1

1

has r authors, 1

2

has n authors, and there are : shared

ones, then iijð1

1

!1

2

Þ ¼ :´r À/o:c :ii, where /o:c :ii

is the threshold for positive implication and is set to 0.5.

Fig. 6 shows the accuracies of TRUTHFINDER and

VOTING. One can see that TRUTHFINDER is significantly

more accurate than VOTING even at the first iteration, where

all bookstores have the same trustworthiness. This is

because TRUTHFINDER considers the implications between

different facts about the same object, while VOTING does

not. As TRUTHFINDER repeatedly computes the trustworthi-

ness of bookstores and the confidence of facts, its accuracy

increases to about 95 percent after the third iteration and

remains stable. It takes TRUTHFINDER 8.73 seconds to

precompute the implications between related facts and

4.43 seconds to finish the four iterations. VOTING takes

1.22 seconds.

Fig. 7 shows the relative change of the trustworthiness

vector after each iteration. The change is defined as one

minus the cosine similarity of the old and the new

vectors. We can see that the relative change decreases by

about one order after each iteration, showing that

TRUTHFINDER converges at a steady speed. After the

fourth iteration, the stop criterion is met, which is

indicated by the horizontal axis.

In Table 3, we manually compare the results of VOTING

and TRUTHFINDER and the authors provided by Barnes &

Noble on its website. We list the number of books in which

each approach makes each type of errors. Please notice that

one approach may make multiple errors for one book.

VOTING tends to miss authors because many bookstores

only provide subsets of all authors. On the other hand,

TRUTHFINDER tends to consider facts with more authors as

802 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008

Fig. 6. Accuracies of TRUTHFINDER and VOTING.

Fig. 7. Relative changes of TRUTHFINDER.

4. For simplicity, we do not consider the order of authors in this study,

although TRUTHFINDER can report the authors in correct order in most cases.

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

correct facts because of our definition of implication for book

authors andthus makes more mistakes of adding in incorrect

names. One may think that the largest bookstores will

provide accurate information, which is surprisingly untrue

according to our experiment. Table 3 shows that Barnes &

Noble has more errors than VOTING and TRUTHFINDER on

these 100 randomly selected books. It has redundant names

for many books. For example, for a book written by “Robert J.

Muller,” it says the authors are “Robert J. Muller; Muller.”

Actually Barnes &Noble is less accurate than TRUTHFINDER

even if we do not consider such errors.

We give an example mistake made by TRUTHFINDER.

The book “Server Storage Technologies for Windows 2000,

Windows Server 2003, and Beyond” is written by Dilip C.

Naik. There are nine bookstores saying that its authors are

“Dilip C. Naik and Dilip Naik,” seven saying “Dilip C.

Naik,” and five saying “Dilip Naik.” TRUTHFINDER infers

that its authors are “Dilip C. Naik and Dilip Naik,” which

contains redundant names. However, Amazon makes the

same mistake as TRUTHFINDER, which may explain why

the mistake is propagated to so many bookstores.

Finally, we perform an interesting experiment on finding

trustworthy websites. It is well known that Google (or other

search engines) is good at finding authoritative websites.

However, do these websites provide accurate information?

To answer this question, we compare the online bookstores

that are given highest ranks by Google with the bookstores

with highest trustworthiness found by TRUTHFINDER. We

query Google with “bookstore”

5

and find all bookstores that

exist in our data set from the top 300 Google results.

6

The

accuracy of each bookstore is tested on the 100 randomly

selected books in the same way as we test the accuracy of

TRUTHFINDER. We only consider bookstores that provide

at least 10 of the 100 books.

The results are shown in Table 4. TRUTHFINDER can find

bookstores that provide much more accurate information

than the top bookstores found by Google. TRUTHFINDER

also finds some large trustworthy bookstores such as

A1 Books, which provides 86 of 100 books with an accuracy

of 0.878. Please notice that TRUTHFINDER uses no training

data, and the testing data is manually created by reading

the authors’ names from book covers. Therefore, we believe

that the results suggest that there may be better alternatives

than Google for finding accurate information on the Web.

4.2.2 Movie Runtime

The second real data set contains runtimes of movies

provided by many websites. It contains 603 movies, which

are the movies with top ratings in every genre on

IMDB.com.

7

We collect the runtime data of each movie

using Google. For example, for the movie Forrest Gump, we

query Google with

00

forrest gump

00

þruntime and parse the

result page that contains digests of the first 100 results. If we

see terms like “130 minutes” or “2h10m” appearing after

“runtime” (or “run time”), we consider such terms as the

runtime of this movie. We found 17,109 useful digests,

which contain information from 1,727 websites. On average,

each movie has 14.3 different runtimes provided by

different websites.

Because of the authority of IMDB, we consider the runtime

it provides as the standard facts (information from IMDB.

com is excluded from our data set). We randomly select

100 movies and find their runtimes on IMDB, in order to test

the accuracy of TRUTHFINDER. If the standard runtime for a

movie is r minutes and TRUTHFINDER infers that its runtime

is n minutes, then the accuracy is defined as

jnÀrj

maxðr.nÞ

. If there

are two facts 1

1

and 1

2

about a movie, where 1

1

¼ n, and

1

2

¼ :, then the implication from 1

1

to 1

2

is defined as

iijð1

1

!1

2

Þ ¼

jnÀ:j

maxðn.:Þ

À/o:c :ii, where /o:c :ii is the

threshold for positive implication and is set to 0.75.

YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 803

TABLE 4

Comparison of the Accuracies of Top Bookstores

by TRUTHFINDER and by Google

TABLE 3

Comparison of the Results of VOTING, TRUTHFINDER,

and Barnes & Noble

5. This query was submitted on 7 February 2007.

6. Our data set does not contain the largest bookstores such as Barnes &

Noble and Amazon.com, because they do not list their books on

www.abebooks.com. However, we include Barnes & Noble here because

we have manually retrieved its authors for the 100 books for testing. 7. Please see http://imdb.com/chart/.

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

Fig. 8 shows the accuracies of TRUTHFINDER and

VOTING on the movie runtime data set. TRUTHFINDER

achieves very high accuracy on this data set, reaching

99 percent after three iterations. In comparison, the error

rate of VOTING is more than twice that of TRUTHFINDER.

Fig. 9 shows the relative change after each iteration of the

trustworthiness vector of TRUTHFINDER. It can be seen that

TRUTHFINDER still converges at a steady speed, and it takes

five iterations to meet the stop criterion. It takes TRUTH-

FINDER 2.89 seconds for initialization and 3.19 seconds for

five iterations. VOTING takes 1.55 seconds.

In Table 5, we compare the numbers of errors of VOTING

and TRUTHFINDER in different cases. Again, we find that

TRUTHFINDER is much more accurate than VOTING.

In Table 6, we compare the most trustworthy websites by

TRUTHFINDER and top ranked movie websites by Google

(with query “movies”

8

). Again, we find that the trustworthy

websites found by TRUTHFINDER provide much more

accurate information than those ranked high by Google.

4.2.3 Parameter Sensitivity

There are two important parameters in the computation

website trustworthiness and fact confidence—j and · in (6)

and (8) in Section 3.1. j controls the degree of influence

between related facts (i.e., facts about the same object), and ·

determines the shape of the confidence curve in Fig. 3. By

default, j ¼ 0.5, and · ¼ 0.3. The following experiments

show that the accuracy of TRUTHFINDER is only very

slightly affected by these two parameters.

Fig. 10a shows the accuracy of TRUTHFINDER with

different js on the two data sets (with · ¼ 0.3). It can be

seen that the accuracy of TRUTHFINDER is very stable when

j ! 0.1. Fig. 10b shows the accuracy with different ·s on the

two data sets (with j ¼ 0.5). · has more influence on

TRUTHFINDER than j. However, the influence is still very

limited, and the accuracy only varies by about 0.5 percent

even when · is changed significantly.

We also investigate how the parameters influence the

convergence of the algorithm. Figs. 11a and 11b show the

relative changes after each iteration of TRUTHFINDER, with

different js and ·s, when TRUTHFINDER is applied on the

book author data set. (We do not show experiments on the

movie runtime data set because of limited space.) In the

figures, we can see that the algorithm always converges

rapidly, with a relative change of less than 10

À5

after five

iterations. TRUTHFINDER converges faster when j is

smaller, which means that the influence between different

facts is smaller. This is reasonable because interfact

influences add complexity to the problem and will probably

make it slower to converge. TRUTHFINDER also converges

faster when · is smaller, probably because a smaller ·

means a smaller influence of o

Ã

ð1Þ on :ð1Þ (there is no

influence when · ¼ 0).

4.3 Synthetic Data Sets

In this section, we test the scalability and noise resistance of

TRUTHFINDER on synthetic data sets. We generate data sets

containing ` websites and ` facts. There are `´5 objects

(i.e., each object has five facts on average), and there are

four websites providing each fact on average (i.e., 4` links

between websites and facts). The expected trustworthiness

of each website is

t, and the trustworthiness of each website

is drawn from a uniform distribution from maxð0. 2

t À1Þ to

minð2

t. 1Þ. For example, if

t ¼ 0.7, then the range for website

trustworthiness is [0.4, 1].

Each object has a true value, which is a random number

drawn from a uniform distribution on interval [1,000,

10,000]. The facts of each object are real numbers that do

not deviate too much from the true value. We want to

generate the facts by randomly generating some numbers

on an interval around the true value. However, we do not

want the true value to be at the middle of the interval,

which makes it very easy to find the true value by taking

the average of all facts. Suppose the value for object o is .ðoÞ.

We create a value range for o that is an interval of length

.ðoÞ´2. The value range is randomly placed so that the true

value .ðoÞ falls at any position in it with equal probability.

Then, the facts of o are drawn from a uniform distribution

804 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008

Fig. 8. Accuracies of TRUTHFINDER and VOTING.

Fig. 9. Relative changes of TRUTHFINDER.

8. This query was submitted on 7 February 2007.

TABLE 5

Comparison of the Results of VOTING and TRUTHFINDER

on Movie Runtime

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

on this value range. In the following experiments, TRUTH-

FINDER does not precompute implications between facts

(because they can be easily computed on the fly) and

performs five iterations.

We first study the time and space scalability of TRUTH-

FINDER with respect to the number of facts. The number of

websites is fixed at 1,000, and the number of facts varies from

5,000 to 500,000. The results are shown in Fig. 12. Its runtime

increases 118 times as the number of objects grows 100 times,

which is very close to being linearly scalable, and the minor

superlinear part should come from the usage of dynamically

growing data structures (e.g., vectors and hash tables). The

memory usage is also linearly scalable. In fact, it grows

sublinearly, possibly because of the fixed part of memory

usage.

Then, we study the scalability with respect to the number

of websites, as shown in Fig. 13. There are 10,000 objects

and 50,000 facts, and the number of websites varies from

100 to 10,000. It can be seen that the time and memory usage

almost remain unchanged when the number of websites

grows 100 times, which is consistent with our complexity

analysis.

Finally, we study how well TRUTHFINDER can do when

the expected trustworthiness varies from zero to one. We

compare the accuracies (as defined on the movie runtime

data set) of TRUTHFINDER and VOTING in Fig. 14a and the

YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 805

TABLE 6

Comparison of the Accuracies of Top Movie Websites

by TRUTHFINDER and by Google

Fig. 10. Accuracy of TRUTHFINDER with respect to j and ·. (a) Accuracy

with respect to j. (b) Accuracy with respect to ·.

Fig. 11. Relative changes after each iteration with different js and ·s.

(a) Convergence with j. (b) Convergence with ·.

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

percentages of objects for which exactly correct facts are

found in Fig. 14b. When the expected trustworthiness is

zero, TRUTHFINDER and VOTING are not doing much

better than random. (It can be shown that even if the true

value and the fact found are two random numbers between

1,000 and 1,500, the accuracy is as high as 0.87.) However,

their accuracies grow sharply as the expected trustworthi-

ness increases. When the expected trustworthiness is 0.5,

their accuracy is more than 99.5 percent, and the percentage

of exact match for TRUTHFINDER is 98.2 percent, and that

for VOTING is 90.5 percent. This shows that TRUTHFINDER

can find the true value even if there are many untrust-

worthy websites.

5 RELATED WORK

The quality of information on the Web has always been a

major concern for Internet users [11]. There have been

studies on what factors of data quality are important for

users [13] and on machine learning approaches for

distinguishing high-quality and low-quality web pages

[9], where the quality is defined by human preference. It

is also shown that information quality measures can help

improve the effectiveness of Web search [15].

In 1998, two pieces of groundbreaking work, PageRank

[10] and Authority-Hub analysis [7], were proposed to

utilize the hyperlinks to find pages with high authorities.

These two approaches are very successful at identifying

important web pages that users are interested in, which is

also shown by a subsequent study [1]. In [3], the authors

propose a framework of link analysis and provide theore-

tical studies for many link-based approaches.

Unfortunately, the popularity of web pages does not

necessarily lead to accuracy of information. Two observa-

tions are made in our experiments: 1) even the most popular

website (e.g., Barnes & Noble) may contain many errors,

whereas some comparatively not-so-popular websites may

provide more accurate information, and 2) more accurate

information can be inferred by using many different

websites instead of relying on a single website.

TRUTHFINDER studies the interaction between websites

and the facts they provide and infers the trustworthiness of

websites and confidence of facts from each other. An

analogy can be made between this problem and Authority-

Hub analysis, by considering websites as hubs (both of

them indicate others’ authority weights) and facts as

authorities. However, these two problems are very differ-

ent, and Authority-Hub analysis cannot be applied to our

problem. In Authority-Hub analysis, a hub’s weight is

806 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008

Fig. 12. Scalability of TRUTHFINDER with respect to the number of facts.

(a) Time. (b) Memory.

Fig. 13. Scalability of TRUTHFINDER with respect to the number of

websites. (a) Time. (b) Memory.

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

computed by summing up the weights of authorities linked

to it. This is unreasonable in computing the trustworthiness

of a website, because a trustworthy website should be one

that provides accurate facts instead of many of them, and a

website providing many inaccurate facts is an untrust-

worthy one. Moreover, the confidence of a fact is not simply

the sum of the trustworthiness of the websites providing it.

Instead, it needs to be computed using some nonlinear

transformations according to a probabilistic analysis.

Another difference between TRUTHFINDER and Author-

ity-Hub analysis is that TRUTHFINDER considers the

relationships (implications) between different facts and

uses such information in inferring the confidence of facts.

This is related to existing studies on inferring similarities

between objects using links. Collaborative filtering [4] infers

the similarity between objects based on their ratings to or

from other objects. There are also studies on link-based

similarity analysis [6], [14], which defines the similarity

between two objects as the average similarity between

objects linked to them. In [5], the authors propose an

approach that uses the trust or distrust relationships

between some users (e.g., user ratings on eBay.com) to

determine the trust relationship between each pair of users.

TRUTHFINDER uses iterative methods to compute the

website trustworthiness and fact confidence, which is

widely used in many link analysis approaches [5], [6], [7],

[10], [14]. The common feature of these approaches is that

they start from some initial state that is either random or

uninformative. Then, at each iteration, the approach will

improve the current state by propagating information

(weights, probability, trustworthiness, etc.) through the

links. This iterative procedure has been proven to be

successful in many applications, and thus, we adopt it in

TRUTHFINDER.

6 CONCLUSIONS

In this paper, we introduce and formulate the Veracity

problem, which aims at resolving conflicting facts from

multiple websites and finding the true facts among them.

We propose TRUTHFINDER, an approach that utilizes the

interdependency between website trustworthiness and fact

confidence to find trustable websites and true facts.

Experiments show that TRUTHFINDER achieves high

accuracy at finding true facts and at the same time identifies

websites that provide more accurate information.

ACKNOWLEDGMENTS

This work was supported in part by the US National Science

Foundation Grants IIS-05-13678/06-42771 and NSF BDI-05-

15813. Any opinion, findings, and conclusions or recom-

mendations expressed here are those of the authors and do

not necessarily reflect the views of the funding agencies.

REFERENCES

[1] B. Amento, L.G. Terveen, and W.C. Hill, “Does ‘Authority’ Mean

Quality? Predicting Expert Quality Ratings of Web Documents,”

Proc. ACM SIGIR ’00, July 2000.

[2] M. Blaze, J. Feigenbaum, and J. Lacy, “Decentralized Trust

Management,” Proc. IEEE Symp. Security and Privacy (ISSP ’96),

May 1996.

[3] A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, “Link

Analysis Ranking: Algorithms, Theory, and Experiments,” ACM

Trans. Internet Technology, vol. 5, no. 1, pp. 231-297, 2005.

[4] J.S. Breese, D. Heckerman, and C. Kadie, “Empirical Analysis of

Predictive Algorithms for Collaborative Filtering,” technical

report, Microsoft Research, 1998.

[5] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins, “Propagation

of Trust and Distrust,” Proc. 13th Int’l Conf. World Wide Web

(WWW), 2004.

[6] G. Jeh and J. Widom, “SimRank: A Measure of Structural-Context

Similarity,” Proc. ACM SIGKDD ’02, July 2002.

[7] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environ-

ment,” J. ACM, vol. 46, no. 5, pp. 604-632, 1999.

[8] Logistical Equation from Wolfram MathWorld, http://

mathworld.wolfram.com/LogisticEquation.html, 2008.

[9] T. Mandl, “Implementation and Evaluation of a Quality-Based

Search Engine,” Proc. 17th ACM Conf. Hypertext and Hypermedia,

Aug. 2006.

[10] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank

Citation Ranking: Bringing Order to the Web,” technical report,

Stanford Digital Library Technologies Project, 1998.

[11] Princeton Survey Research Associates International, “Leap of

faith: Using the Internet Despite the Dangers,” Results of a Nat’l

Survey of Internet Users for Consumer Reports WebWatch, Oct. 2005.

[12] Sigmoid Function from Wolfram MathWorld, http://mathworld.

wolfram.com/SigmoidFunction.html, 2008.

[13] R.Y. Wang and D.M. Strong, “Beyond Accuracy: What Data

Quality Means to Data Consumers,” J. Management Information

Systems, vol. 12, no. 4, pp. 5-34, 1997.

YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 807

Fig. 14. Noise resistance of TRUTHFINDER and VOTING with respect to

website trustworthiness. (a) Accuracy. (b) Percentage of exact match.

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

[14] X. Yin, J. Han, and P.S. Yu, “LinkClus: Efficient Clustering via

Heterogeneous Semantic Links,” Proc. 32nd Int’l Conf. Very Large

Data Bases (VLDB ’06), Sept. 2006.

[15] X. Zhu and S. Gauch, “Incorporating Quality Metrics in

Centralized/Distributed Information Retrieval on the World Wide

Web,” Proc. ACM SIGIR ’00, July 2000.

Xiaoxin Yin received the BE degree from

Tsinghua University in 2001 and the MS and

PhD degrees in computer science from the

University of Illinois, Urbana-Champaign, in

2003 and 2007, respectively. He is a researcher

at the Internet Services Research Center,

Microsoft Research. He has been working in

the area of data mining since 2001, and his

research work is focused on multirelational data

mining and link analysis, and their applications

on the World Wide Web. He has published 14 papers in refereed

conference proceedings and journals.

Jiawei Han is a professor in the Department of

Computer Science, University of Illinois, Urbana-

Champaign. He has been working on research

into data mining, data warehousing, stream data

mining, spatiotemporal and multimedia data

mining, biological data mining, information net-

work analysis, text and Web mining, and soft-

ware bug mining, with more than 350 conference

and journal publications. He has chaired or

served on many program committees of inter-

national conferences and workshops. He also served or is serving on the

editorial boards of Data Mining and Knowledge Discovery, the IEEE

Transactions on Knowledge and Data Engineering, the Journal of

Computer Science and Technology, and the Journal of Intelligent

Information Systems. He is currently serving as the founding editor in

chief of the ACM Transactions on Knowledge Discovery from Data and

on the board of directors for the executive committee of the ACM Special

Interest Group on Knowledge Discovery and Data Mining (SIGKDD). He

has received many awards and recognitions, including the ACM

SIGKDD Innovation Award in 2004 and the IEEE Computer Society

Technical Achievement Award in 2005. He is a senior member of the

IEEE and a fellow of the ACM.

Philip S. Yu received the BS degree in electrical

engineering from the National Taiwan Univer-

sity, the MS and PhD degrees in electrical

engineering from Stanford University, and the

MBA degree from New York University. He is a

professor in the Department of Computer

Science, University of Illinois, Chicago, and also

holds the Wexler chair in information and

technology. He was the manager of the Software

Tools and Techniques Group at the IBM T.J.

Watson Research Center. His research interests include data mining,

Internet applications and technologies, database systems, multimedia

systems, parallel and distributed processing, and performance model-

ing. He has published more than 500 papers in refereed journals and

conference proceedings. He holds or has applied for more than 300 US

patents. He is an associate editor of the ACM Transactions on the

Internet Technology and the ACM Transactions on Knowledge

Discovery from Data. He is on the steering committee of the IEEE

Conference on Data Mining and was a member of the IEEE Data

Engineering steering committee. He was the editor in chief of IEEE

Transactions on Knowledge and Data Engineering from 2001 to 2004,

an editor, an advisory board member, and also a guest coeditor of the

special issue on mining of databases. He had also served as an

associate editor of Knowledge and Information Systems. In addition to

serving as a program committee member on various conferences, he

was the program chair or cochair of the IEEE Workshop of Scalable

Stream Processing Systems (SSPS ’07), the IEEE Workshop on Mining

Evolving and Streaming Data (2006), the 2006 Joint Conferences of the

Eighth IEEE Conference on E-commerce Technology (CEC ’06) and the

Third IEEE Conference on Enterprise Computing, E-commerce and E-

services (EEE ’06), the 11th IEEE International Conference on Data

Engineering, the Sixth Pacific Area Conference on Knowledge Dis-

covery and Data Mining, the Ninth ACM Sigmod Workshop on Research

Issues in Data Mining and Knowledge Discovery, the Second IEEE

International Workshop on Research Issues on Data Engineering:

Transaction and Query Processing, the PAKDD Workshop on Knowl-

edge Discovery from Advanced Databases, and the Second IEEE

International Workshop on Advanced Issues of E-commerce and Web-

Based Information Systems. He served as the general chair or cochair

of the 2006 ACM Conference on Information and Knowledge Manage-

ment, the 14th IEEE International Conference on Data Engineering, and

the Second IEEE International Conference on Data Mining. He has

received several IBM honors, including two IBM Outstanding Innovation

Awards, an Outstanding Technical Achievement Award, two Research

Division Awards, and the 93rd Plateau of Invention Achievement

Awards. He was an IBM Master Inventor. He received a Research

Contributions Award from the IEEE International Conference on Data

Mining in 2003 and also an IEEE Region 1 Award for “promoting and

perpetuating numerous new electrical engineering concepts” in 1999.

He is a fellow of the ACM and the IEEE.

> For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/publications/dlib.

808 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008

Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

For example.. different facts influence each other.YIN ET AL. if a website says that a book is written by “Jessamyn Wendell” and another says “Jessamyn Burns Wendell. We discuss related work in Section 5 and conclude this study in Section 6. Different facts about the same object may be conflicting. A website is trustworthy if most facts it provides are true. There are also many websites.2 A fact is likely to be true if it is provided by trustworthy websites (especially if by many of them). Our experiments show that TRUTHFINDER achieves very high accuracy in discovering true facts. Second and more importantly. It is required that impðf1 ! f2 Þ is a value between À1 and 1. We also require that the facts can be parsed from the web pages. and it can select better trustworthy websites than authority-based search engines such as Google. we formulate the Veracity problem about how to discover true facts from conflicting information. 2. authors of books). 2 PROBLEM DEFINITIONS In this paper. Experimental results are presented in Section 4. we propose the concept of implication between facts.g.” whereas another claims that it is written by “Jessamyn Wendell. This iterative procedure is rather different from Authority-Hub analysis [7].g. In summary. a domain refers to a property of a certain type of objects. 2009 at 12:50 from IEEE Xplore. we propose an algorithm called TRUTHFINDER for identifying true facts using iterative methods. is f1 ’s influence on f2 ’s confidence. For example. There are usually multiple conflicting facts from different websites for each object. The “trustworthiness” in this paper means accuracy in providing information. a website providing 10.7 is much less trustworthy than a website providing 100 facts with an accuracy of 0. weights of laptop computers) or relationships between two objects (e. Second. we study the problem of finding true facts in a certain domain. such as different sets of authors for a book.” then these two websites actually support each other although they provide slightly different facts. we choose an iterative computational method. the confidence of facts and the trustworthiness of websites. Because of this interdependency between facts and websites. sometimes facts may be supportive to each other although they are slightly different.95. Definition 2 (Trustworthiness of websites). such as authors of books or number of pixels of camcorders. by defining the trustworthiness of websites. In order to represent such relationships. Fig. 1 shows a miniexample data set. objects (e. For example. It is different from the “trustworthiness” in the studies of trust management [2]. how much f2 ’s confidence should be increased (or decreased) according to f1 ’s confidence.000 facts with an average accuracy of 0. Downloaded on January 9. 2. The confidence of a fact f (denoted by sðfÞ) is the probability of f being correct. and the goal of TRUTHFINDER is to identify the true fact among them.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 797 TABLE 1 Conflicting Information about Book Authors Fig. The trustworthiness of a website does not depend on how many facts it provides but on the accuracy of those facts. Thus.1 Basic Definitions We first introduce the two most important definitions in this paper. we propose a framework to solve this problem. and influences between facts. i. impðf1 ! f2 Þ. Restrictions apply. Input of TRUTHFINDER. Definition 1 (Confidence of facts). There are often conflicting facts on the Web. and another one says 10 cm. Each website provides at most one fact for an object.e. confidence of facts. If one of such facts is true. Instead. The trustworthiness of a website w (denoted by tðwÞ) is the expected confidence of the facts provided by w. The first difference is in the definitions.” and another one claims “J. Here. At each iteration. we make three major distributions in this paper. one website claims the author to be “Jennifer Widom..” or one website says that a certain camera is 4 inches long. we have to resort to probabilistic computation. Widom. First. the probabilities of facts being true and the trustworthiness of websites are inferred from each other. The rest of the paper is organized as follows: We describe the problem in Section 2 and propose the computational model and algorithms in Section 3. which contains five facts about two objects provided by four websites. We incorporate such influences between facts into our computational model.. some of which are more trustworthy than others.” However. Authorized licensed use limited to: University of Illinois. The implication from fact f1 to f2 . 1. the other is also likely to be true. For example. we cannot compute the trustworthiness of a website by adding up the weights of its facts as in [7]. . Finally. one website claims that a book is written by “Karen Holtzblatt. The input of TRUTHFINDER is a large number of facts in a domain that are provided by many websites. nor can we compute the probability of a fact being true by adding up the trustworthiness of websites providing it. according to the best of our knowledge.

If f2 is correct.1.2. Thus. From this example. f2 Þ is the similarity between f1 and f2 . JUNE 2008 A positive value indicates that if f1 is correct. In this paper. 3 COMPUTATIONAL MODEL 2. the relationship between two facts is symmetric and the definition of similarity is available. 2009 at 12:50 from IEEE Xplore. ð1Þ where F ðwÞ is the set of facts provided by w. websites tend to provide incomplete facts (e. Based on the above heuristics. If in a domain. the trustworthiness of a website is just the expected confidence of facts it provides.1 Basic Inference As defined in Definition 2. and. it is likely to be true. Widom. Suppose two websites provide author information for the same book. The false facts on different websites are less likely to be the same or similar. f2 is likely to be wrong. The inference of website trustworthiness is rather simple. The first website indicates that the author of the book is “Jennifer Widom. Although false facts can be propagated among websites. where simðf1 . in some domains (e. it is unlikely to be true. We believe that a website has some consistency in the quality of its information in a certain domain. in general.1. 20. On the other hand. a website that provides mostly true facts for many objects will likely provide true facts for other objects. Different websites that provide this true fact may present it in either the same or slightly different ways. Table 2 shows the variables and parameters used in the following discussion. and base sim is a threshold for similarity. For example.798 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.1 Website Trustworthiness and Fact Confidence We first discuss how to infer website trustworthiness and fact confidence from each other. In this section. we compute its trustworthiness tðwÞ by calculating the average confidence of facts provided by w: P tðwÞ ¼ f2F ðwÞ sðfÞ jF ðwÞj . NO. If we are confident about f1 . and thus. VOL. Heuristic 2.” which is fact f1 . we introduce the model of iterative computation. and impðf1 ! f2 Þ should be high. Because true facts are more consistent than false facts (Heuristic 3). We start from the simplest case and proceed to more complicated ones step by step. he or she should provide the definition of implication between facts. f2 Þ À base sim. Please notice that the definition of implication is domain specific. When a user uses TRUTHFINDER on a certain domain. we assume that there is only one true fact for a property of an object.g. While a negative value means that if f1 is correct. the false facts about a certain object are much less consistent than the true facts.” Heuristic 3. we know that it is very common for a website to provide only one of the authors for a book. For website w. The case of multiple true facts will be studied in our future work. Different websites often make different mistakes for the same object and thus provide different false facts. we have four basic heuristics that serve as the base of our computational model. f1 may only tell us that “Jennifer Widom” is one author of the book instead of the sole author. a website is trustworthy if it provides facts with high confidence. On the other hand. 3. the user can define impðf1 ! f2 Þ ¼ simðf1 .2 Basic Heuristics Based on common sense and our observations on real data. such as “Jennifer Widom” versus “J. first author of a book). Authorized licensed use limited to: University of Illinois. We can see that the website trustworthiness and fact confidence are determined by each other. then f1 is incomplete and will have low confidence. book authors). if a fact is conflicting with the facts provided by many trustworthy websites. we can see that implication is an asymmetric relationship. Heuristic 4. The second website says that there are two authors “Jennifer Widom and Stefano Ceri. . 3. We define implication instead of similarity between facts because such relationship is asymmetric. Heuristic 1. we know that if a fact is provided by many trustworthy websites. Restrictions apply.. The details about this will be described in Section 3.. and we can use an iterative method to compute both. we should also be confident about f2 because f2 is consistent with f1 .g.” which is fact f2 . whereas that of fact confidence is more complicated. The implication for book authors should be very different from that for the number of pixels of camcorders. This true fact appears to be the same or similar on different websites. it is likely that we can find and distinguish true facts from false ones at the end. f2 is likely to be correct. Downloaded on January 9. 6. TABLE 2 Variables and Parameters of TRUTHFINDER There are trustworthy websites such as wikipedia and untrustworthy websites such as blogs and some small websites. impðf2 ! f1 Þ is low. Usually there is only one true fact for a property of an object. In a certain domain.

There are still two problems with our model. which is the sum of the trustworthiness scores of websites providing f2 . Similarly. If f2 is provided by many trustworthy websites.. ð2Þ 3. it is much more difficult to estimate the confidence of a fact. then both w1 and w2 are wrong. However.1.6 (which is quite low). the probability that both of them are wrong is ð1 À tðw1 ÞÞ Á ð1 À tðw2 ÞÞ.YIN ET AL. which is used to compute ðfÞ in TRUTHFINDER. Suppose in Fig. Please notice that impðf 0 ! fÞ < 0 when f is conflicting with f 0 . and these facts influence each other. In general.) Thus. then f1 is also somehow supported by these websites. With a similar proof as in Lemma 1. According to (2). 1 À sðfÞ ¼ Y 1À Y w2W ðfÞ ð1 À tðwÞÞ Á Y oðf 0 Þ¼oðfÞ 0 @ Y 1Áimpðf 0 !fÞ ð1 À tðw0 ÞÞA : ð1 À tðwÞÞ: w0 2W ðf 0 Þ w2W ðfÞ Authorized licensed use limited to: University of Illinois. we define the confidence score of a fact as ðfÞ ¼ À lnð1 À sðfÞÞ: ð4Þ A very useful property is that the confidence score of a fact f is just the sum of the trustworthiness scores of websites providing f. and we will compensate for it later. According to the definitions above. Computing confidence of a fact. (This is not true in many cases. 2009 at 12:50 from IEEE Xplore. 2).e. and f1 is the only fact about object o1 (i. The second problem with our model is that the confidence of a fact f can easily be negative if f is conflicting with some 3. and multiplying many of them may lead to underflow. we should increase the confidence score of f1 according to the confidence score of f2 . the confidence of a fact f1 is determined by the websites providing it and other facts about the same object. f2 does not exist in Fig. they are very similar). actually. In comparison. We can compute the confidence of f based on Ã ðfÞ in the same way as computing it based on ðfÞ (defined in (4)). Lemma 1. if f1 is wrong. if a fact f is provided by five websites with a trustworthiness of 0. we use a logarithm and define the trustworthiness score of a website as ðwÞ ¼ À lnð1 À tðwÞÞ: ð3Þ ðfÞ : ð7Þ ðwÞ is between and zero and þ1. However. 2. . We use sÃ ðfÞ to represent this confidence:3 sÃ ðfÞ ¼ 1 À eÀ Ã where W ðfÞ is the set of websites providing f. Let us first analyze the simple case where there is no related fact. We define the adjusted confidence score of a fact f as X Ã ðfÞ ¼ ðfÞ þ Á ðf 0 Þ Á impðf 0 ! fÞ: ð6Þ oðf 0 Þ¼oðfÞ w2W ðfÞ is a parameter between zero and one.99. In order to facilitate computation and veracity exploration. we can show that (7) is equivalent to sÃ ðfÞ ¼ Proof. We can see that Ã ðfÞ is the sum of the confidence scores of f. 2 that the implication from f2 to f1 is very high (e. there are usually many different facts about an object (such as f1 and f2 in Fig. This is shown in the following lemma. 2). ðfÞ ¼ X w2W ðfÞ ðwÞ: ð5Þ 3. Therefore. Because f1 is provided by w1 and w2 . and a larger ðwÞ indicates higher trustworthiness. This assumption is often incorrect because errors can be propagated between websites. where 0 < < 1. Downloaded on January 9. we add a dampening factor into (7) Ã and redefine fact confidence as sÃ ðfÞ ¼ 1 À eÀ Á ðfÞ .2 Influences between Facts The above discussion shows how to compute the confidence of a fact that is the only fact about an object. which controls the influence of related facts.3 Handling Additional Subtlety Until now.g.. In (2).1. 2. and we have X lnð1 À sðfÞÞ ¼ lnð1 À tðwÞÞ w2W ðfÞ X ðwÞ: ( ) ðfÞ ¼ w2W ðfÞ u t Fig. then its confidence sðfÞ can be computed as sðfÞ ¼ 1 À Y ð1 À tðwÞÞ. f will have a confidence of 0. and f1 should have reasonably high confidence. if a fact f is the only fact about an object. Restrictions apply. we have described how to compute fact confidence from website trustworthiness. 1 À tðwÞ is usually quite small.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 799 Take the logarithm on both sides. and the probability that f1 is not wrong is 1 À ð1 À tðw1 ÞÞ Á ð1 À tðw2 ÞÞ. In order to compensate for the problem of overly high confidence. and a portion of the confidence score of each related fact f 0 multiplies the implication from f 0 to f. some of the websites may copy contents from others. We first assume that w1 and w2 are independent. The first problem is that we have been assuming that different websites are independent of each other. which are discussed below. As shown in Fig.

It involves nonlinear transformations and thus cannot be computed using eigenvector computation as in [7]. Bji ¼ Á impðfk ! fj Þ. by setting ( . sðfÞ is Ã 1 very close to sÃ ðfÞ because 1þeÀ ÁÃ ðfÞ % 1 À eÀ Á ðfÞ . When Á Ã ðfÞ is significantly less than zero. and ! s and ! can be converted using (8). 3. s 1 N ! Ã Ã Ã ¼ ð ðf1 Þ. if we set sÃ ðfÞ ¼ 0 if it is negative according to (7). 2009 at 12:50 from IEEE Xplore. 3. Let the vectors be Authorized licensed use limited to: University of Illinois. . which is not reasonable because it is always possible that the fact is true even with negative evidence. This is unreasonable because 1) the confidence cannot be negative and 2) even with negative evidences. . which makes Ã ðfÞ < 0 and sÃ ðfÞ < 0. which is a variant of (7). otherwise: In comparison. we use vectors to represent the trustworthiness of all websites and the confidence of all facts. . ! t ¼ ðtðw1 Þ. s ! Ã ¼ B!: ð9Þ facts provided by trustworthy websites. To facilitate our discussions. which is consistent with the real situation. From (6) and Lemma 1. . Restrictions apply. We compare these two definitions in Fig. . . matrix B involves more factors because facts about the same object influence each other. ! ¼ ðsðf Þ. Moreover. 3. Please notice that (8) is also very similar to the Sigmoid function [12]. ! t ¼ A!. which has been successfully used in various models in many fields. 6. Matrix A can be easily defined according to (1). NO. otherwise: ð11Þ Both A and B are sparse matrices. . 1 M Fig. That is why TRUTHFINDER requires iterative computation to achieve convergence. ! ¼ ððw Þ. . the website trustworthiness and fact confidence can be computed conveniently with matrix operations. this “chunking” operation and the multiple zero values may lead to unstable conditions in our iterative computation. . 20. In comparison. sðf ÞÞT . On the other hand. The above procedure is very different from AuthorityHub analysis proposed by Kleinberg [7]. which can be implemented easily and performed efficiently. Therefore. Two methods for computing confidence. . VOL. i. we can infer that 8 if wi provides fj . 4. 4. JUNE 2008 Fig. we adopt the widely used Logistic function [8]. . ðfN ÞÞT : We want to define a M Â N matrix A for inferring website trustworthiness from fact confidence and a N Â M matrix B for the reverse inference. . Computing website trustworthiness and fact confidence with matrix operations. . so its confidence should still be above zero. . : 0. sðfÞ ¼ 1 1þ eÀ ÁÃ ðfÞ : ð8Þ When Á Ã ðfÞ is significantly greater than zero. which cannot be defined as simple summations because the probability often needs to be computed in nonlinear ways. . sÃ ðfÞ decreases very sharply when Ã ðfÞ < 1. TRUTHFINDER studies the probabilities of websites being correct and facts being true. Downloaded on January 9. tðwM ÞÞT . It will be more convenient if we can convert the computational procedure into some basic matrix operations.e. if fj 2 F ðwi Þ. sðfÞ is close to zero but remains positive. ! t and ! can be converted from each other using (3). ð10Þ Aij ¼ 0. if wi provides fk and oðfk Þ ¼ oðfj Þ. ðw ÞÞT . 1 jF ðwi Þj. . there is still a chance that f is correct. < 1. as the final definition for fact confidence.800 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. as shown in Fig. In general. .2 Computing Website Trustworthiness and Fact Confidence with Matrix Operations We have described how to calculate website trustworthiness and fact confidence. sðfÞ decreases slowly and is slightly above zero when Ã ðfÞ ( 0. Authority-Hub analysis defines the authority scores and hub scores as the sum of each other. One can see that the two curves are very close when Ã ðfÞ > 3.

each fact has k À 1 related facts on the average. it takes OðLÞ time to propagate between website trustworthiness and fact confidence and OðkNÞ time to adjust fact confidence according to the interfact implication. At each iteration. each entry of B involves only one website and one fact. In both cases. it has very little information about the websites and the facts.1. we compare TRUTHFINDER with a baseline approach as described below.4 Complexity Analysis In this section. As will be explained in Section 5. and the total space requirement is OðL þ kNÞ. If t only changes a little after an iteration (measured by cosine similarity between the old and the new ! t ). Suppose there are L links between all websites and facts.) From the website trustworthiness TRUTHFINDER can infer the confidence of facts.2. They are calculated once and used at every iteration. 4. as defined in Section 3. Let us first look at the two matrices A and B. (t0 should be set to the estimated average trustworthiness. we compare it with a baseline approach called VOTING. 2009 at 12:50 from IEEE Xplore. The maximum difference between two iterations Authorized licensed use limited to: University of Illinois. the overall time complexity is OðIL þ IkNÞ. and the second one contains the runtime of many movies provided by many websites.3 Iterative Computation As described above. If the implication between two facts can be computed in a very short constant time and we do not precompute the implication. The matrices are stored in sparse formats. TRUTHFINDER considers the implication between different facts from the first iteration and considers the different trustworthiness of different websites in the following iterations. This is the simplest approach and only uses the number of websites supporting each fact. there are OðkLÞ entries in B. and it takes OðLÞ time to compute A. Thus. Because different websites may provide the same fact. and thus.Net (C#). each iteration takes OðkLÞ time and no extra space. VOTING chooses the fact that is provided by most websites and resolves ties randomly. Initially. TRUTHFINDER first uses the website trustworthiness to compute the fact confidence and then recomputes the website trustworthiness from the fact confidence. In each step of the iterative procedure. and it takes OðkLÞ time to compute B. there is no existing approach to the problem studied in this paper. The two parameters in (8) are set as ¼ 0:5 and ¼ 0:3. B contains more entries than A because Bji is nonzero if website wi provides a fact that is related to fact fj . In comparison. Each link between a website and a fact corresponds to an entry in A. 3. If we precompute the implication between all facts. All approaches are implemented using Visual Studio. As in Authority-Hub analysis [7] and PageRank [10]. we can discard the matrix operations and compute the website trustworthiness and fact confidence using their definitions in Section 3. If in some cases. 3. then the total space requirement is OðL þ M þ NÞ. then TRUTHFINDER will stop. TRUTHFINDER takes OðIkLÞ time and OðkL þ M þ NÞ space. Thus. L should be greater than N (number of facts). Algorithm of TRUTHFINDER. if we start from a uniform fact confidence.1 Experiment Setting In order to show the effectiveness of TRUTHFINDER. OðkLÞ space is not available. and the computational cost of multiplying such a matrix and a vector is linear with the number of nonzero entries in the matrix. Restrictions apply. TRUTHFINDER stops iterating when it reaches a stable state. Downloaded on January 9.9. Because each website can provide at most one fact about each object. Thus. The first real data set contains the authors of many books provided by many online bookstores. 5. On the other hand. we also need to calculate the two matrices A and B. [10]. and it stops when the computation reaches a stable state. We choose the initial state in which all websites have uniform trustworthiness t0 .66-GHz dual-core processor with 1 Gbyte of memory running Windows XP Professional. we analyze the complexity of TRUTHFINDER. which are very meaningful because the facts supported by many websites are more likely to be correct. it still takes constant time to compute each entry of B. When trying to find the true fact for a certain object. Thus. The overall algorithm is presented in Fig.YIN ET AL. such as 0. The time cost of multiplying a sparse matrix and a vector is linear with the number of entries in the matrix. we cannot infer meaningful trustworthiness for websites. we can infer the website trustworthiness if we know the fact confidence and vice versa. TRUTHFINDER needs an initial state. Before the iterative computation.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 801 Fig. and thus. A has L entries. Suppose there are I iterations. Each step only requires two matrix operations and conversions between tðwÞ and ðwÞ and between sðfÞ and Ã ðfÞ. Therefore. . 5. TRUTHFINDER adopts an iterative method to compute the trustworthiness of websites and confidence of facts. then OðkNÞ space is needed to store these implication values. Suppose on the average there are k facts about each object. TRUTHFINDER tries to improve its knowledge about their trustworthiness and confidence. using a single thread. As in other iterative approaches [7]. All experiments are performed on an Intel PC with a 1. 4 EMPIRICAL STUDY We perform experiments on two real data sets and synthetic data sets to examine the accuracy and efficiency of TRUTHFINDER. The stableness is measured by how much the trustworthiness of websites changes between ! iterations.

4. which are indicated by different bookstores. Each author name has a total weight of 1.43 seconds to finish the four iterations. “Graeme Simsion” will get a partial score of 5/6 because it omits the middle name of “Graeme C.73 seconds to precompute the implications between related facts and 4. 4. the authors. We compare the set of authors found by TRUTHFINDER for each book with the standard fact to compute the accuracy of TRUTHFINDER. TRUTHFINDER indicates that there are y authors.4 Sometimes TRUTHFINDER provides partially correct facts. Fig. For simplicity. only the first author). After the fourth iteration. This is because TRUTHFINDER considers the implications between different facts about the same object.. On the average. 7.031 listings (i.265 books about computer science and engineering. We will show in our experiments that the performance of TRUTHFINDER is not sensitive to these parameters.22 seconds. We assign different weights to different parts of persons’ names. We list the number of books in which each approach makes each type of errors. we manually compare the results of VOTING and TRUTHFINDER and the authors provided by Barnes & Noble on its website.001 percent. and the ratio between weights of the last name. Simsion and Graham Witt. which returns the online bookstores that sell the book and the book information from each store. Relative changes of TRUTHFINDER. Witt. and middle name is 3:2:1. including the price. For example. example. suppose the standard fact contains x authors. 6 shows the accuracies of TRUTHFINDER and VOTING.” and the authors found by TRUTHFINDER may be “Graeme Simsion and G. f2 has y authors. 7 shows the relative change of the trustworthiness vector after each iteration. .2. Fig. which is indicated by the horizontal axis. Fig. while VOTING does not. On the other hand. One can see that TRUTHFINDER is significantly more accurate than VOTING even at the first iteration. The implication between two sets of authors f1 and f2 is defined in a very similar way as the accuracy of f2 with respect to f1 .com.1 Book Authors The first real data set contains the authors of many books provided by many online bookstores. or Prentice Hall. Witt” will get a score of 4/5 with respect to “Graham Witt. where base sim is the threshold for positive implication and is set to 0.e. showing that TRUTHFINDER converges at a steady speed.g. Simsion.5. where all bookstores have the same trustworthiness. among which z authors belong to the standard fact. NO. TRUTHFINDER performs iterative computation to find out the set of authors for each book. bookstore selling a book). VOTING takes 1. and there are z shared ones.. 2009 at 12:50 from IEEE Xplore. JUNE 2008 Fig. the stop criterion is met.” We consider “Graeme Simsion” and “G. we give TRUTHFINDER a half score. “G. we use its ISBN to search on www. we do not consider the order of authors in this study.4 different sets of authors.” gets half of the score. Please notice that one approach may make multiple errors for one book. Restrictions apply. although TRUTHFINDER can report the authors in correct order in most cases.” If the standard name has a full first or middle name and TRUTHFINDER provides the correct initial. In order to test its accuracy. Witt” as partial matches for “Graeme C. McGraw-Hill. In Table 3. Because many bookstores provide incomplete facts (e.yÞ . The results are presented below. the implication from f1 to f2 will not be decreased. each book has 5. TRUTHFINDER tends to consider facts with more authors as Authorized licensed use limited to: University of Illinois. 6. For each book. and the first initial “G. If f1 has x authors. Morgan Kaufmann. As TRUTHFINDER repeatedly computes the trustworthiness of bookstores and the confidence of facts. The accuracy of TRUTHFINDER is defined z as maxðx. It contains 1. For a certain book. Accuracies of TRUTHFINDER and VOTING. its accuracy increases to about 95 percent after the third iteration and remains stable. and a short description. The change is defined as one minus the cosine similarity of the old and the new vectors. VOL.2 Real Data Sets We test the accuracy and efficiency of TRUTHFINDER on two real data sets. is set to 0. It takes TRUTHFINDER 8. The data set contains 894 bookstores and 34. we randomly select 100 books and manually find out their authors. Downloaded on January 9. We find the image of each book and use the authors on the book cover as the standard fact. Simsion” and “Graham Witt” and give them partial scores. 6. if f2 contains authors that are not in f1 .802 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. the standard set of authors for a book is “Graeme C.” because the first name has weight 2/5. VOTING tends to miss authors because many bookstores only provide subsets of all authors. then impðf1 ! f2 Þ ¼ z=x À base sim.abebooks. We can see that the relative change decreases by about one order after each iteration. first name. 20. For example. which are published by Addison-Wesley. For 4.

3 different runtimes provided by different websites. we query Google with 00 forrest gump00 þ runtime and parse the result page that contains digests of the first 100 results. We randomly select 100 movies and find their runtimes on IMDB. It has redundant names for many books.” it says the authors are “Robert J.6 The accuracy of each bookstore is tested on the 100 randomly selected books in the same way as we test the accuracy of TRUTHFINDER.” TRUTHFINDER infers that its authors are “Dilip C. we include Barnes & Noble here because we have manually retrieved its authors for the 100 books for testing. . TRUTHFINDER also finds some large trustworthy bookstores such as A1 Books. For example. We found 17.75.109 useful digests. If there are two facts f1 and f2 about a movie. 6.7 We collect the runtime data of each movie using Google.727 websites. The book “Server Storage Technologies for Windows 2000. for a book written by “Robert J. Naik. However. Please see http://imdb. Restrictions apply. There are nine bookstores saying that its authors are “Dilip C. that the results suggest that there may be better alternatives than Google for finding accurate information on the Web. Please notice that TRUTHFINDER uses no training data. Authorized licensed use limited to: University of Illinois.2. Muller.yÞ . which are the movies with top ratings in every genre on IMDB. Amazon makes the same mistake as TRUTHFINDER. We give an example mistake made by TRUTHFINDER.zÞ À base sim.” seven saying “Dilip C. Naik. 2009 at 12:50 from IEEE Xplore. and f2 ¼ z. One may think that the largest bookstores will provide accurate information.YIN ET AL. TRUTHFINDER.” Actually Barnes & Noble is less accurate than TRUTHFINDER even if we do not consider such errors. Our data set does not contain the largest bookstores such as Barnes & Noble and Amazon. we compare the online bookstores that are given highest ranks by Google with the bookstores with highest trustworthiness found by TRUTHFINDER. Naik and Dilip Naik. we perform an interesting experiment on finding trustworthy websites. The results are shown in Table 4. 4.com.com/chart/. com is excluded from our data set). It is well known that Google (or other search engines) is good at finding authoritative websites.com. Because of the authority of IMDB.” and five saying “Dilip Naik. in order to test the accuracy of TRUTHFINDER. For example. Windows Server 2003. Downloaded on January 9. If we see terms like “130 minutes” or “2h10m” appearing after “runtime” (or “run time”).878.com.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 803 TABLE 3 Comparison of the Results of VOTING. which contain information from 1.abebooks. we believe 5. Finally. If the standard runtime for a movie is x minutes and TRUTHFINDER infers that its runtime jyÀxj is y minutes. then the accuracy is defined as maxðx. On average. which is surprisingly untrue according to our experiment. where base sim is the threshold for positive implication and is set to 0.2 Movie Runtime The second real data set contains runtimes of movies provided by many websites. then the implication from f1 to f2 is defined as jyÀzj impðf1 ! f2 Þ ¼ maxðy. We only consider bookstores that provide at least 10 of the 100 books. Table 3 shows that Barnes & Noble has more errors than VOTING and TRUTHFINDER on these 100 randomly selected books. we consider the runtime it provides as the standard facts (information from IMDB. each movie has 14. However. This query was submitted on 7 February 2007. 7. which provides 86 of 100 books with an accuracy of 0. and Barnes & Noble TABLE 4 Comparison of the Accuracies of Top Bookstores by TRUTHFINDER and by Google correct facts because of our definition of implication for book authors and thus makes more mistakes of adding in incorrect names. where f1 ¼ y. Muller. and Beyond” is written by Dilip C. We query Google with “bookstore”5 and find all bookstores that exist in our data set from the top 300 Google results. However. Naik and Dilip Naik. do these websites provide accurate information? To answer this question. It contains 603 movies. TRUTHFINDER can find bookstores that provide much more accurate information than the top bookstores found by Google. Muller. which may explain why the mistake is propagated to so many bookstores. Therefore. we consider such terms as the runtime of this movie. for the movie Forrest Gump. because they do not list their books on www. and the testing data is manually created by reading the authors’ names from book covers.” which contains redundant names.

Fig. the influence is still very limited. The following experiments show that the accuracy of TRUTHFINDER is only very slightly affected by these two parameters. and the accuracy only varies by about 0. controls the degree of influence between related facts (i. Again. and there are four websites providing each fact on average (i. 3. we find that TRUTHFINDER is much more accurate than VOTING. We generate data sets containing M websites and N facts. and it takes five iterations to meet the stop criterion. 20. 4. VOL. Again. Figs. Then. However. We also investigate how the parameters influence the convergence of the algorithm. probably because a smaller means a smaller influence of Ã ðfÞ on sðfÞ (there is no influence when ¼ 0). Downloaded on January 9.5 percent even when is changed significantly. and ¼ 0:3. 9. This is reasonable because interfact influences add complexity to the problem and will probably make it slower to converge. There are N=5 objects (i. Relative changes of TRUTHFINDER. In comparison. TRUTHFINDER also converges faster when is smaller. the facts of o are drawn from a uniform distribution TABLE 5 Comparison of the Results of VOTING and TRUTHFINDER on Movie Runtime Authorized licensed use limited to: University of Illinois. 4. has more influence on TRUTHFINDER than .3 Synthetic Data Sets In this section. TRUTHFINDER converges faster when is smaller. Restrictions apply. The value range is randomly placed so that the true value vðoÞ falls at any position in it with equal probability. if t ¼ 0:7. For example. TRUTHFINDER achieves very high accuracy on this data set. with different s and s. we compare the numbers of errors of VOTING and TRUTHFINDER in different cases. which means that the influence between different facts is smaller. . each object has five facts on average). which is a random number drawn from a uniform distribution on interval [1. when TRUTHFINDER is applied on the book author data set. 9 shows the relative change after each iteration of the trustworthiness vector of TRUTHFINDER. Suppose the value for object o is vðoÞ. we find that the trustworthy websites found by TRUTHFINDER provide much more accurate information than those ranked high by Google. the error rate of VOTING is more than twice that of TRUTHFINDER. ¼ 0:5. Fig. VOTING takes 1. It can be seen that the accuracy of TRUTHFINDER is very stable when ! 0:1. 10b shows the accuracy with different s on the two data sets (with ¼ 0:5). 2t À 1Þ to minð2t. 1Þ.1. with a relative change of less than 10À5 after five iterations. Accuracies of TRUTHFINDER and VOTING.4. However.e. This query was submitted on 7 February 2007. reaching 99 percent after three iterations. (We do not show experiments on the movie runtime data set because of limited space. we test the scalability and noise resistance of TRUTHFINDER on synthetic data sets. 1]. Each object has a true value. then the range for website trustworthiness is [0.e. 6. 10a shows the accuracy of TRUTHFINDER with different s on the two data sets (with ¼ 0:3). 10.19 seconds for five iterations. We create a value range for o that is an interval of length vðoÞ=2. Fig. which makes it very easy to find the true value by taking the average of all facts.2. Fig. and determines the shape of the confidence curve in Fig.3 Parameter Sensitivity There are two important parameters in the computation website trustworthiness and fact confidence— and in (6) and (8) in Section 3.. In Table 6. 4N links between websites and facts). In Table 5.) In the 8.. 11a and 11b show the relative changes after each iteration of TRUTHFINDER. The expected trustworthiness of each website is t.000. 8 shows the accuracies of TRUTHFINDER and VOTING on the movie runtime data set. 2009 at 12:50 from IEEE Xplore.89 seconds for initialization and 3. we do not want the true value to be at the middle of the interval. JUNE 2008 Fig. facts about the same object).e. we compare the most trustworthy websites by TRUTHFINDER and top ranked movie websites by Google (with query “movies”8).55 seconds. and the trustworthiness of each website is drawn from a uniform distribution from maxð0. We want to generate the facts by randomly generating some numbers on an interval around the true value. 8. It can be seen that TRUTHFINDER still converges at a steady speed. By default. Fig. It takes TRUTHFINDER 2. The facts of each object are real numbers that do not deviate too much from the true value. figures.804 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.. we can see that the algorithm always converges rapidly. NO.000].

000 objects and 50.000 facts. In the following experiments. and the minor superlinear part should come from the usage of dynamically growing data structures (e.000. which is very close to being linearly scalable. Its runtime increases 118 times as the number of objects grows 100 times. (b) Convergence with . (a) Convergence with . 13.000. 11.. In fact. we study the scalability with respect to the number of websites.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 805 TABLE 6 Comparison of the Accuracies of Top Movie Websites by TRUTHFINDER and by Google Fig.g. The number of websites is fixed at 1. Relative changes after each iteration with different s and s. which is consistent with our complexity analysis. and the number of websites varies from 100 to 10. vectors and hash tables). (a) Accuracy with respect to .000 to 500. TRUTHFINDER does not precompute implications between facts (because they can be easily computed on the fly) and performs five iterations. Downloaded on January 9. Restrictions apply. It can be seen that the time and memory usage almost remain unchanged when the number of websites grows 100 times. on this value range. There are 10. . 12. We first study the time and space scalability of TRUTHFINDER with respect to the number of facts. possibly because of the fixed part of memory usage. and the number of facts varies from 5. (b) Accuracy with respect to .YIN ET AL. 10. we study how well TRUTHFINDER can do when the expected trustworthiness varies from zero to one. Fig. Accuracy of TRUTHFINDER with respect to and . We compare the accuracies (as defined on the movie runtime data set) of TRUTHFINDER and VOTING in Fig. Finally. Then. as shown in Fig. it grows sublinearly. The memory usage is also linearly scalable. The results are shown in Fig.000. 14a and the Authorized licensed use limited to: University of Illinois. 2009 at 12:50 from IEEE Xplore.

When the expected trustworthiness is zero. (a) Time. by considering websites as hubs (both of them indicate others’ authority weights) and facts as authorities. these two problems are very different. (b) Memory. There have been studies on what factors of data quality are important for users [13] and on machine learning approaches for distinguishing high-quality and low-quality web pages [9]. their accuracy is more than 99. Fig. In Authority-Hub analysis. a hub’s weight is Authorized licensed use limited to: University of Illinois. An analogy can be made between this problem and AuthorityHub analysis. Two observations are made in our experiments: 1) even the most popular website (e. Downloaded on January 9.) However.806 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. percentages of objects for which exactly correct facts are found in Fig. 14b. (It can be shown that even if the true value and the fact found are two random numbers between 1. Restrictions apply. whereas some comparatively not-so-popular websites may provide more accurate information. their accuracies grow sharply as the expected trustworthiness increases. It is also shown that information quality measures can help improve the effectiveness of Web search [15].500.5 percent.87. the accuracy is as high as 0. where the quality is defined by human preference.5 percent. two pieces of groundbreaking work. Barnes & Noble) may contain many errors. which is also shown by a subsequent study [1]. 5 RELATED WORK The quality of information on the Web has always been a major concern for Internet users [11]. VOL. 6. In 1998. NO. Scalability of TRUTHFINDER with respect to the number of facts. the popularity of web pages does not necessarily lead to accuracy of information.g. 12. 20. PageRank [10] and Authority-Hub analysis [7]. Unfortunately. (a) Time. TRUTHFINDER studies the interaction between websites and the facts they provide and infers the trustworthiness of websites and confidence of facts from each other. TRUTHFINDER and VOTING are not doing much better than random. and the percentage of exact match for TRUTHFINDER is 98.000 and 1. and 2) more accurate information can be inferred by using many different websites instead of relying on a single website. Scalability of TRUTHFINDER with respect to the number of websites.5. the authors propose a framework of link analysis and provide theoretical studies for many link-based approaches. and Authority-Hub analysis cannot be applied to our problem. 2009 at 12:50 from IEEE Xplore.2 percent. However. These two approaches are very successful at identifying important web pages that users are interested in. In [3]. 13. This shows that TRUTHFINDER can find the true value even if there are many untrustworthy websites. were proposed to utilize the hyperlinks to find pages with high authorities. and that for VOTING is 90. When the expected trustworthiness is 0. JUNE 2008 Fig. .. (b) Memory.

13th Int’l Conf. 2005. R. Microsoft Research. Experiments show that TRUTHFINDER achieves high accuracy at finding true facts and at the same time identifies websites that provide more accurate information.S. and J. Theory. Security and Privacy (ISSP ’96).M.” Results of a Nat’l Survey of Internet Users for Consumer Reports WebWatch. Another difference between TRUTHFINDER and Authority-Hub analysis is that TRUTHFINDER considers the relationships (implications) between different facts and uses such information in inferring the confidence of facts. “SimRank: A Measure of Structural-Context Similarity. “Link Analysis Ranking: Algorithms. G. Management Information Systems. Widom. Rosenthal. Heckerman. the approach will improve the current state by propagating information (weights. probability. etc.” Proc. J. Downloaded on January 9. and conclusions or recommendations expressed here are those of the authors and do not necessarily reflect the views of the funding agencies. Any opinion.G. Logistical Equation from Wolfram MathWorld. 2004. Brin. we introduce and formulate the Veracity problem. IEEE Symp. and T. Mandl. Raghavan. Jeh and J. 5. vol.” J. Collaborative filtering [4] infers the similarity between objects based on their ratings to or from other objects. R. no.html. it needs to be computed using some nonlinear transformations according to a probabilistic analysis. vol. “Leap of faith: Using the Internet Despite the Dangers. 2008. Borodin. 5.com) to determine the trust relationship between each pair of users. no.com/SigmoidFunction. Kleinberg. Fig. an approach that utilizes the interdependency between website trustworthiness and fact confidence to find trustable websites and true facts. 231-297. “Authoritative Sources in a Hyperlinked Environment. Tomkins. (a) Accuracy. and Experiments. pp. 604-632. and P. World Wide Web (WWW). no. 1999. Wang and D. “Empirical Analysis of Predictive Algorithms for Collaborative Filtering.” ACM Trans. which defines the similarity between two objects as the average similarity between objects linked to them. wolfram. “Does ‘Authority’ Mean Quality? Predicting Expert Quality Ratings of Web Documents. 1998. and W. “Decentralized Trust Management. 6 CONCLUSIONS In this paper. Kumar.html.C. This is unreasonable in computing the trustworthiness of a website.” technical report. 4. 2009 at 12:50 from IEEE Xplore. ACM SIGIR ’00. There are also studies on link-based similarity analysis [6].M. Hill.” Proc.g. In [5]. TRUTHFINDER uses iterative methods to compute the website trustworthiness and fact confidence. Terveen. M. [7]. computed by summing up the weights of authorities linked to it. [6]. “Beyond Accuracy: What Data Quality Means to Data Consumers. user ratings on eBay. ACKNOWLEDGMENTS This work was supported in part by the US National Science Foundation Grants IIS-05-13678/06-42771 and NSF BDI-0515813. 17th ACM Conf. Tsaparas.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 807 widely used in many link analysis approaches [5]. Oct. and a website providing many inaccurate facts is an untrustworthy one. which aims at resolving conflicting facts from multiple websites and finding the true facts among them.) through the links. Kadie. A. Internet Technology. 12. http://mathworld. Guha.Y. S. which is Authorized licensed use limited to: University of Illinois. Feigenbaum.” Proc. Aug. July 2002. the authors propose an approach that uses the trust or distrust relationships between some users (e. Sigmoid Function from Wolfram MathWorld. Lacy. We propose TRUTHFINDER. Roberts. P. [14]. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] B. 5-34. Princeton Survey Research Associates International. Blaze.YIN ET AL.” Proc. J. http:// mathworld.wolfram. 2005. and A. 14. the confidence of a fact is not simply the sum of the trustworthiness of the websites providing it. Noise resistance of TRUTHFINDER and VOTING with respect to website trustworthiness.com/LogisticEquation. “The PageRank Citation Ranking: Bringing Order to the Web. J. pp. we adopt it in TRUTHFINDER. 1. “Implementation and Evaluation of a Quality-Based Search Engine. 2006. ACM.S. J. T. at each iteration. Motwani. Hypertext and Hypermedia. R. findings. pp. The common feature of these approaches is that they start from some initial state that is either random or uninformative. (b) Percentage of exact match. vol. This is related to existing studies on inferring similarities between objects using links. Then. 2008. Restrictions apply. Stanford Digital Library Technologies Project. Page. Breese. ACM SIGKDD ’02. “Propagation of Trust and Distrust. trustworthiness. and thus. Moreover. 1998.” Proc. Strong. G. July 2000.. [10]. L. Winograd. . L. R. Amento. 1997.” J.” technical report. [14]. 46. May 1996.O. D. This iterative procedure has been proven to be successful in many applications. and C. Instead. because a trustworthy website should be one that provides accurate facts instead of many of them.

the MS and PhD degrees in electrical engineering from Stanford University. Zhu and S. Yu. He is on the steering committee of the IEEE Conference on Data Mining and was a member of the IEEE Data Engineering steering committee.J. ACM SIGIR ’00. the Second IEEE International Workshop on Research Issues on Data Engineering: Transaction and Query Processing. He has received several IBM honors. with more than 350 conference and journal publications. “LinkClus: Efficient Clustering via Heterogeneous Semantic Links. and performance modeling. VOL. He is a fellow of the ACM and the IEEE. and software bug mining. Sept. He was the editor in chief of IEEE Transactions on Knowledge and Data Engineering from 2001 to 2004. the IEEE Workshop on Mining Evolving and Streaming Data (2006). Downloaded on January 9. July 2000. E-commerce and Eservices (EEE ’06). He has chaired or served on many program committees of international conferences and workshops. Gauch. biological data mining. Yu received the BS degree in electrical engineering from the National Taiwan University. respectively. He holds or has applied for more than 300 US patents. Xiaoxin Yin received the BE degree from Tsinghua University in 2001 and the MS and PhD degrees in computer science from the University of Illinois. 32nd Int’l Conf. and P. the Journal of Computer Science and Technology. 2006. he was the program chair or cochair of the IEEE Workshop of Scalable Stream Processing Systems (SSPS ’07). and also holds the Wexler chair in information and technology.org/publications/dlib. the PAKDD Workshop on Knowledge Discovery from Advanced Databases. He received a Research Contributions Award from the IEEE International Conference on Data Mining in 2003 and also an IEEE Region 1 Award for “promoting and perpetuating numerous new electrical engineering concepts” in 1999. including the ACM SIGKDD Innovation Award in 2004 and the IEEE Computer Society Technical Achievement Award in 2005. and the Second IEEE International Conference on Data Mining. and their applications on the World Wide Web. the Ninth ACM Sigmod Workshop on Research Issues in Data Mining and Knowledge Discovery. Authorized licensed use limited to: University of Illinois. an editor. 6. JUNE 2008 [14] X. and the Journal of Intelligent Information Systems. and the MBA degree from New York University. text and Web mining. and the Second IEEE International Workshop on Advanced Issues of E-commerce and WebBased Information Systems. NO. University of Illinois.” Proc. the Sixth Pacific Area Conference on Knowledge Discovery and Data Mining. including two IBM Outstanding Innovation Awards. He has been working on research into data mining. an advisory board member.S. University of Illinois. “Incorporating Quality Metrics in Centralized/Distributed Information Retrieval on the World Wide Web.” Proc. Han. His research interests include data mining. spatiotemporal and multimedia data mining. For more information on this or any other computing topic. He has published 14 papers in refereed conference proceedings and journals. Philip S. information network analysis. data warehousing. the 11th IEEE International Conference on Data Engineering. . Internet applications and technologies. He has been working in the area of data mining since 2001. 20. J. Watson Research Center. two Research Division Awards.computer. He was the manager of the Software Tools and Techniques Group at the IBM T. Chicago. He served as the general chair or cochair of the 2006 ACM Conference on Information and Knowledge Management. database systems. and the 93rd Plateau of Invention Achievement Awards. He is currently serving as the founding editor in chief of the ACM Transactions on Knowledge Discovery from Data and on the board of directors for the executive committee of the ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). Yin. He is an associate editor of the ACM Transactions on the Internet Technology and the ACM Transactions on Knowledge Discovery from Data. stream data mining. He was an IBM Master Inventor. 2009 at 12:50 from IEEE Xplore. the 2006 Joint Conferences of the Eighth IEEE Conference on E-commerce Technology (CEC ’06) and the Third IEEE Conference on Enterprise Computing. Very Large Data Bases (VLDB ’06). please visit our Digital Library at www. He has published more than 500 papers in refereed journals and conference proceedings. He has received many awards and recognitions. and his research work is focused on multirelational data mining and link analysis. in 2003 and 2007. [15] X. multimedia systems. He had also served as an associate editor of Knowledge and Information Systems. He is a professor in the Department of Computer Science. In addition to serving as a program committee member on various conferences.808 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. He also served or is serving on the editorial boards of Data Mining and Knowledge Discovery. and also a guest coeditor of the special issue on mining of databases. parallel and distributed processing. Microsoft Research. Urbana-Champaign. He is a researcher at the Internet Services Research Center. . Jiawei Han is a professor in the Department of Computer Science. Restrictions apply. the 14th IEEE International Conference on Data Engineering. He is a senior member of the IEEE and a fellow of the ACM. UrbanaChampaign. the IEEE Transactions on Knowledge and Data Engineering. an Outstanding Technical Achievement Award.

- Human Action Recognition by Learning Bases of Action Attributes and PartsUploaded byzukun
- Word Search PaperUploaded bySitiNorhanumMuhdJani
- TablesUploaded bySekhar Chowdary
- 70539_06dUploaded byDan Farris
- baraff-witkin98Uploaded byRohit Tatu
- For Most Large Underdetermined Systems of Linear Equations the Minimal `1-norm Solution is also the Sparsest SolutionUploaded bygaroldf
- Matlab NotesUploaded byJyothi Prakash
- MatlabNotes (1)Uploaded byMohitRajput
- MolerEtAl_SparseMatricesinMATLAB_10.1.1.38Uploaded byalexandru_bratu_6
- Lecture 1 Basic MatlabUploaded byVinoth Kumar
- 75 Truth DiscoveryUploaded byHamza Mohd
- bcsoptUploaded bygorot1
- Activity 1.docxUploaded byDarryl Dave Ditucalan
- Efficient Enforcement of Hard Articulation Constraints in the Presence of Closed Loops and Contacts_commentsUploaded bydagush
- Matlab Help TextUploaded byAmine Bouaz
- Mb0040 Statistics for Management FinalUploaded byDip Konar
- new paper 2Uploaded byapi-242378778
- Hungarian Algorithm for Excel_VBAUploaded byYinghui Liu
- Young Londoners Survey 2009Uploaded byGilles85
- CI_pUploaded bysophie
- TCS Placement Paper 1.docUploaded byrohit sharma
- lifsci_v4_n34_2012_2Uploaded byDARWIN FABRICIO TORRES CHAVEZ
- Sampling Waters Flow and LoadUploaded byPauline Anne Wilson
- Writing Mesh & Nodal Equations Directly Matrix Form.docUploaded byjazbo8
- Prioritization MatricesUploaded bySteven Bonacorsi
- Project ProposalUploaded byRajat Goel
- 3.1 Sampling ConceptUploaded byshashankgulati30
- Mata33 Final 2013wUploaded byexamkiller
- 2018 ma1 wk 17 18 revision test weekUploaded byapi-314131718
- softeng_2015_4_40_55097.pdfUploaded byBojana Koteska