You are on page 1of 13

Truth Discovery with Multiple Conflicting

Information Providers on the Web
Xiaoxin Yin, Jiawei Han, Senior Member, IEEE, and Philip S. Yu, Fellow, IEEE
Abstract—The World Wide Web has become the most important information source for most of us. Unfortunately, there is no
guarantee for the correctness of information on the Web. Moreover, different websites often provide conflicting information on a
subject, such as different specifications for the same product. In this paper, we propose a new problem, called Veracity, i.e., conformity
to truth, which studies how to find true facts from a large amount of conflicting information on many subjects that is provided by various
websites. We design a general framework for the Veracity problem and invent an algorithm, called TRUTHFINDER, which utilizes the
relationships between websites and their information, i.e., a website is trustworthy if it provides many pieces of true information, and a
piece of information is likely to be true if it is provided by many trustworthy websites. An iterative method is used to infer the
trustworthiness of websites and the correctness of information from each other. Our experiments show that TRUTHFINDER
successfully finds true facts among conflicting information and identifies trustworthy websites better than the popular search engines.
Index Terms—Data quality, Web mining, link analysis.
Ç
1 INTRODUCTION
T
HE World Wide Web has become a necessary part of our
lives and might have become the most important
information source for most people. Everyday, people
retrieve all kinds of information from the Web. For example,
when shopping online, people find product specifications
from websites like Amazon.com or ShopZilla.com. When
looking for interesting DVDs, they get information and read
movie reviews on websites such as NetFlix.com or IMDB.
com. When they want to know the answer to a certain
question, they go to Ask.com or Google.com.
“Is the World Wide Web always trustable?” Unfortunately,
the answer is “no.” There is no guarantee for the correctness
of information on the Web. Even worse, different websites
often provide conflicting information, as shown in the
following examples.
Example 1 (Height of Mount Everest). Suppose a user is
interested in how high Mount Everest is and queries
Ask.com with “What is the height of Mount Everest?”
Among the top 20 results,
1
he or she will find the
following facts: four websites (including Ask.com itself)
say 29,035 feet, five websites say 29,028 feet, one says
29,002 feet, and another one says 29,017 feet. Which
answer should the user trust?
Example 2 (Authors of books). We tried to find out who
wrote the book Rapid Contextual Design (ISBN:
0123540518). We found many different sets of authors
from different online bookstores, and we show several of
them in Table 1. From the image of the book cover, we
found that A1 Books provides the most accurate
information. In comparison, the information from Po-
well’s books is incomplete, and that from Lakeside books is
incorrect.
The trustworthiness problem of the Web has been
realized by today’s Internet users. According to a survey
on the credibility of websites conducted by Princeton
Survey Research in 2005 [11], 54 percent of Internet users
trust news websites at least most of time, while this ratio is
only 26 percent for websites that offer products for sale and
is merely 12 percent for blogs.
There have been many studies on ranking web pages
according to authority (or popularity) based on hyperlinks.
The most influential studies are Authority-Hub analysis [7],
and PageRank [10], which lead to Google.com. However,
does authority lead to accuracy of information? The answer
is unfortunately no. Top-ranked websites are usually the
most popular ones. However, popularity does not mean
accuracy. For example, according to our experiments
(Section 4.2), the bookstores ranked on top by Google
(Barnes & Noble and Powell’s books) contain many errors on
book author information. In comparison, some small book-
stores (e.g., A1 Books) provide more accurate information.
In this paper, we propose a new problem called the
Veracity problem, which is formulated as follows: Given a
large amount of conflicting information about many
objects, which is provided by multiple websites (or other
types of information providers), how can we discover the
true fact about each object? We use the word “fact” to
represent something that is claimed as a fact by some
website, and such a fact can be either true or false. In this
paper, we only study the facts that are either properties of
796 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008
. X. Yin is with Microsoft Research, Microsoft Corporation, One Microsoft
Way, Redmond, WA 98052. E-mail: xyin@microsoft.com.
. J. Han is with the Department of Computer Science, University of Illinois
at Urbana-Champaign, 201 N. Goodwin Avenue, 2132, Urbana, IL 61801.
E-mail: hanj@cs.uiuc.edu.
. P.S. Yu is with the Department of Computer Science, University of Illinois
at Chicago, 851 South Morgan Street, Chicago, IL 60607.
E-mail: psyu@cs.uic.edu.
Manuscript received 17 July 2007; revised 20 Nov. 2007; accepted 6 Dec.
2007; published online 19 Dec. 2007.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2007-07-0369.
Digital Object Identifier no. 10.1109/TKDE.2007.190745.
1. The query was sent on 9 February 2007.
1041-4347/08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
objects (e.g., weights of laptop computers) or relationships
between two objects (e.g., authors of books). We also
require that the facts can be parsed from the web pages.
There are often conflicting facts on the Web, such as
different sets of authors for a book. There are also many
websites, some of which are more trustworthy than
others.
2
A fact is likely to be true if it is provided by
trustworthy websites (especially if by many of them). A
website is trustworthy if most facts it provides are true.
Because of this interdependency between facts and
websites, we choose an iterative computational method.
At each iteration, the probabilities of facts being true and
the trustworthiness of websites are inferred from each
other. This iterative procedure is rather different from
Authority-Hub analysis [7]. The first difference is in the
definitions. The trustworthiness of a website does not
depend on how many facts it provides but on the accuracy
of those facts. For example, a website providing 10,000 facts
with an average accuracy of 0.7 is much less trustworthy
than a website providing 100 facts with an accuracy of 0.95.
Thus, we cannot compute the trustworthiness of a website
by adding up the weights of its facts as in [7], nor can we
compute the probability of a fact being true by adding up
the trustworthiness of websites providing it. Instead, we
have to resort to probabilistic computation. Second and
more importantly, different facts influence each other. For
example, if a website says that a book is written by
“Jessamyn Wendell” and another says “Jessamyn Burns
Wendell,” then these two websites actually support each
other although they provide slightly different facts. We
incorporate such influences between facts into our compu-
tational model.
In summary, we make three major distributions in this
paper. First, we formulate the Veracity problem about
how to discover true facts from conflicting information.
Second, we propose a framework to solve this problem,
by defining the trustworthiness of websites, confidence of
facts, and influences between facts. Finally, we propose
an algorithm called TRUTHFINDER for identifying true
facts using iterative methods. Our experiments show that
TRUTHFINDER achieves very high accuracy in discovering
true facts, and it can select better trustworthy websites
than authority-based search engines such as Google.
The rest of the paper is organized as follows: We describe
the problem in Section 2 and propose the computational
model and algorithms in Section 3. Experimental results are
presented in Section 4. We discuss related work in Section 5
and conclude this study in Section 6.
2 PROBLEM DEFINITIONS
In this paper, we study the problem of finding true facts in a
certain domain. Here, a domain refers to a property of a
certain type of objects, such as authors of books or number
of pixels of camcorders. The input of TRUTHFINDER is a
large number of facts in a domain that are provided by
many websites. There are usually multiple conflicting facts
from different websites for each object, and the goal of
TRUTHFINDER is to identify the true fact among them.
Fig. 1 shows a miniexample data set, which contains five
facts about two objects provided by four websites. Each
website provides at most one fact for an object.
2.1 Basic Definitions
We first introduce the two most important definitions in
this paper, the confidence of facts and the trustworthiness
of websites.
Definition 1 (Confidence of facts). The confidence of a fact 1
(denoted by :ð1Þ) is the probability of 1 being correct,
according to the best of our knowledge.
Definition 2 (Trustworthiness of websites). The trust-
worthiness of a website n (denoted by tðnÞ) is the expected
confidence of the facts provided by n.
Different facts about the same object may be conflicting.
For example, one website claims that a book is written by
“Karen Holtzblatt,” whereas another claims that it is written
by “Jessamyn Wendell.” However, sometimes facts may be
supportive to each other although they are slightly
different. For example, one website claims the author to
be “Jennifer Widom,” and another one claims “J. Widom,”
or one website says that a certain camera is 4 inches long,
and another one says 10 cm. If one of such facts is true, the
other is also likely to be true.
In order to represent such relationships, we propose
the concept of implication between facts. The implication
from fact 1
1
to 1
2
, iijð1
1
!1
2
Þ, is 1
1
’s influence on
1
2
’s confidence, i.e., how much 1
2
’s confidence should be
increased (or decreased) according to 1
1
’s confidence. It is
required that iijð1
1
!1
2
Þ is a value between À1 and 1.
YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 797
TABLE 1
Conflicting Information about Book Authors
2. The “trustworthiness” in this paper means accuracy in providing
information. It is different from the “trustworthiness” in the studies of trust
management [2].
Fig. 1. Input of TRUTHFINDER.
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
A positive value indicates that if 1
1
is correct, 1
2
is likely
to be correct. While a negative value means that if 1
1
is
correct, 1
2
is likely to be wrong. The details about this
will be described in Section 3.1.2.
We define implication instead of similarity between facts
because such relationship is asymmetric. For example, in
some domains (e.g., book authors), websites tend to provide
incomplete facts (e.g., first author of a book). Suppose two
websites provide author information for the same book. The
first website indicates that the author of the book is
“Jennifer Widom,” which is fact 1
1
. The second website
says that there are two authors “Jennifer Widom and
Stefano Ceri,” which is fact 1
2
. If 1
2
is correct, then 1
1
is
incomplete and will have low confidence, and thus,
iijð1
2
!1
1
Þ is low. On the other hand, we know that it
is very common for a website to provide only one of the
authors for a book. Thus, 1
1
may only tell us that “Jennifer
Widom” is one author of the book instead of the sole author.
If we are confident about 1
1
, we should also be confident
about 1
2
because 1
2
is consistent with 1
1
, and iijð1
1
!1
2
Þ
should be high. From this example, we can see that
implication is an asymmetric relationship.
Please notice that the definition of implication is
domain specific. The implication for book authors should
be very different from that for the number of pixels of
camcorders. When a user uses TRUTHFINDER on a
certain domain, he or she should provide the definition
of implication between facts. If in a domain, the
relationship between two facts is symmetric and the
definition of similarity is available, the user can define
iijð1
1
!1
2
Þ ¼ :iið1
1
. 1
2
Þ À/o:c :ii, where :iið1
1
. 1
2
Þ is
the similarity between 1
1
and 1
2
, and /o:c :ii is a
threshold for similarity.
2.2 Basic Heuristics
Based on common sense and our observations on real data,
we have four basic heuristics that serve as the base of our
computational model.
Heuristic 1. Usually there is only one true fact for a property of
an object.
In this paper, we assume that there is only one true fact
for a property of an object. The case of multiple true facts
will be studied in our future work.
Heuristic 2. This true fact appears to be the same or similar on
different websites.
Different websites that provide this true fact may present
it in either the same or slightly different ways, such as
“Jennifer Widom” versus “J. Widom.”
Heuristic 3. The false facts on different websites are less likely to
be the same or similar.
Different websites often make different mistakes for the
same object and thus provide different false facts. Although
false facts can be propagated among websites, in general,
the false facts about a certain object are much less consistent
than the true facts.
Heuristic 4. In a certain domain, a website that provides mostly
true facts for many objects will likely provide true facts for
other objects.
There are trustworthy websites such as wikipedia and
untrustworthy websites such as blogs and some small
websites. We believe that a website has some consistency in
the quality of its information in a certain domain.
3 COMPUTATIONAL MODEL
Based on the above heuristics, we know that if a fact is
provided by many trustworthy websites, it is likely to be
true; and, if a fact is conflicting with the facts provided by
many trustworthy websites, it is unlikely to be true. On the
other hand, a website is trustworthy if it provides facts with
high confidence. We can see that the website trustworthi-
ness and fact confidence are determined by each other, and
we can use an iterative method to compute both. Because
true facts are more consistent than false facts (Heuristic 3), it
is likely that we can find and distinguish true facts from
false ones at the end.
In this section, we introduce the model of iterative
computation. Table 2 shows the variables and parameters
used in the following discussion.
3.1 Website Trustworthiness and Fact Confidence
We first discuss how to infer website trustworthiness and
fact confidence from each other. The inference of website
trustworthiness is rather simple, whereas that of fact
confidence is more complicated. We start from the simplest
case and proceed to more complicated ones step by step.
3.1.1 Basic Inference
As defined in Definition 2, the trustworthiness of a website
is just the expected confidence of facts it provides. For
website n, we compute its trustworthiness tðnÞ by
calculating the average confidence of facts provided by n:
tðnÞ ¼
P
121ðnÞ
:ð1Þ
j1ðnÞj
. ð1Þ
where 1ðnÞ is the set of facts provided by n.
798 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008
TABLE 2
Variables and Parameters of TRUTHFINDER
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
In comparison, it is much more difficult to estimate the
confidence of a fact. As shown in Fig. 2, the confidence of a
fact 1
1
is determined by the websites providing it and other
facts about the same object.
Let us first analyze the simple case where there is no
related fact, and 1
1
is the only fact about object o
1
(i.e., 1
2
does not exist in Fig. 2). Because 1
1
is provided by n
1
and
n
2
, if 1
1
is wrong, then both n
1
and n
2
are wrong. We
first assume that n
1
and n
2
are independent. (This is not
true in many cases, and we will compensate for it later.)
Thus, the probability that both of them are wrong is
ð1 Àtðn
1
ÞÞ Á ð1 Àtðn
2
ÞÞ, and the probability that 1
1
is not
wrong is 1 Àð1 Àtðn
1
ÞÞ Á ð1 Àtðn
2
ÞÞ. In general, if a fact 1
is the only fact about an object, then its confidence :ð1Þ
can be computed as
:ð1Þ ¼ 1 À
Y
n2\ð1Þ
ð1 ÀtðnÞÞ. ð2Þ
where \ð1Þ is the set of websites providing 1.
In (2), 1 ÀtðnÞ is usually quite small, and multiplying
many of them may lead to underflow. In order to facilitate
computation and veracity exploration, we use a logarithm
and define the trustworthiness score of a website as
tðnÞ ¼ Àlnð1 ÀtðnÞÞ. ð3Þ
tðnÞ is between and zero and þ1, and a larger tðnÞ
indicates higher trustworthiness.
Similarly, we define the confidence score of a fact as
oð1Þ ¼ Àlnð1 À:ð1ÞÞ. ð4Þ
A very useful property is that the confidence score of a
fact 1 is just the sum of the trustworthiness scores of
websites providing 1. This is shown in the following
lemma, which is used to compute oð1Þ in TRUTHFINDER.
Lemma 1.
oð1Þ ¼
X
n2\ð1Þ
tðnÞ. ð5Þ
Proof. According to (2),
1 À:ð1Þ ¼
Y
n2\ð1Þ
ð1 ÀtðnÞÞ.
Take the logarithm on both sides, and we have
lnð1 À:ð1ÞÞ ¼
X
n2\ð1Þ
lnð1 ÀtðnÞÞ
() oð1Þ ¼
X
n2\ð1Þ
tðnÞ.
tu
3.1.2 Influences between Facts
The above discussion shows how to compute the confidence
of a fact that is the only fact about an object. However, there
are usually many different facts about an object (such as 1
1
and 1
2
in Fig. 2), and these facts influence each other.
Suppose in Fig. 2 that the implication from 1
2
to 1
1
is very
high (e.g., they are very similar). If 1
2
is provided by many
trustworthy websites, then 1
1
is also somehow supported by
these websites, and 1
1
should have reasonably high
confidence. Therefore, we should increase the confidence
score of 1
1
according to the confidence score of 1
2
, which is
the sum of the trustworthiness scores of websites providing
1
2
. We define the adjusted confidence score of a fact 1 as
o
Ã
ð1Þ ¼ oð1Þ þj Á
X
oð1
0
Þ¼oð1Þ
oð1
0
Þ Á iijð1
0
!1Þ. ð6Þ
j is a parameter between zero and one, which controls the
influence of related facts. We can see that o
Ã
ð1Þ is the sum of
the confidence scores of 1, and a portion of the confidence
score of each related fact 1
0
multiplies the implication from
1
0
to 1. Please notice that iijð1
0
!1Þ < 0 when 1 is
conflicting with 1
0
.
We can compute the confidence of 1 based on o
Ã
ð1Þ in the
same way as computing it based on oð1Þ (defined in (4)). We
use :
Ã
ð1Þ to represent this confidence:
3
:
Ã
ð1Þ ¼ 1 Àc
Ào
Ã
ð1Þ
. ð7Þ
3.1.3 Handling Additional Subtlety
Until now, we have described how to compute fact
confidence from website trustworthiness. There are still
two problems with our model, which are discussed below.
The first problem is that we have been assuming that
different websites are independent of each other. This assump-
tion is often incorrect because errors can be propagated
between websites. According to the definitions above, if a
fact 1 is provided by five websites with a trustworthiness
of 0.6 (which is quite low), 1 will have a confidence of 0.99.
However, actually, some of the websites may copy contents
from others. In order to compensate for the problem of
overly high confidence, we add a dampening factor · into (7)
and redefine fact confidence as :
Ã
ð1Þ ¼ 1 Àc
À·Áo
Ã
ð1Þ
, where
0 < · < 1.
The second problem with our model is that the confidence
of a fact 1 can easily be negative if 1 is conflicting with some
YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 799
Fig. 2. Computing confidence of a fact.
3. With a similar proof as in Lemma 1, we can show that (7) is equivalent
to :
Ã
ð1Þ ¼
1 À
Y
n2\ð1Þ
ð1 ÀtðnÞÞ Á
Y
oð1
0
Þ¼oð1Þ
Y
n
0
2\ð1
0
Þ
ð1 Àtðn
0
ÞÞ
0
@
1
A
jÁiijð1
0
!1Þ
.
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
facts provided by trustworthy websites, which makes
o
Ã
ð1Þ < 0 and :
Ã
ð1Þ < 0. This is unreasonable because
1) the confidence cannot be negative and 2) even with
negative evidences, there is still a chance that 1 is correct, so
its confidence should still be above zero. Moreover, if we set
:
Ã
ð1Þ ¼ 0 if it is negative according to (7), this “chunking”
operation and the multiple zero values may lead to unstable
conditions in our iterative computation. Therefore, we
adopt the widely used Logistic function [8], which is a
variant of (7), as the final definition for fact confidence.
:ð1Þ ¼
1
1 þc
À·Áo
Ã
ð1Þ
. ð8Þ
When · Á o
Ã
ð1Þ is significantly greater than zero, :ð1Þ is
very close to :
Ã
ð1Þ because
1
1þc
À·Áo
Ã
ð1Þ
% 1 Àc
À·Áo
Ã
ð1Þ
. When · Á
o
Ã
ð1Þ is significantly less than zero, :ð1Þ is close to zero but
remains positive. We compare these two definitions in
Fig. 3. One can see that the two curves are very close when
o
Ã
ð1Þ 3. :
Ã
ð1Þ decreases very sharply when o
Ã
ð1Þ < 1,
which is not reasonable because it is always possible that
the fact is true even with negative evidence. In comparison,
:ð1Þ decreases slowly and is slightly above zero when
o
Ã
ð1Þ (0, which is consistent with the real situation. Please
notice that (8) is also very similar to the Sigmoid function
[12], which has been successfully used in various models in
many fields.
3.2 Computing Website Trustworthiness and Fact
Confidence with Matrix Operations
We have described how to calculate website trustworthi-
ness and fact confidence. It will be more convenient if we
can convert the computational procedure into some basic
matrix operations, which can be implemented easily and
performed efficiently. To facilitate our discussions, we use
vectors to represent the trustworthiness of all websites and
the confidence of all facts. Let the vectors be
t
!
¼ ðtðn
1
Þ. . . . . tðn
`
ÞÞ
T
.
t
!
¼ ðtðn
1
Þ. . . . . tðn
`
ÞÞ
T
.
:
!
¼ ð:ð1
1
Þ. . . . . :ð1
`
ÞÞ
T
.
o
Ã
ƒ!
¼ ðo
Ã
ð1
1
Þ. . . . . o
Ã
ð1
`
ÞÞ
T
.
We want to define a ` Â` matrix A for inferring website
trustworthiness from fact confidence and a ` Â` matrix B
for the reverse inference, i.e.
t
!
¼ A:
!
.
o
Ã
ƒ!
¼ Bt
!
.
ð9Þ
t
!
and t
!
can be converted from each other using (3), and :
!
and o
!
can be converted using (8).
Matrix A can be easily defined according to (1), by
setting
A
i,
¼
1
.
1ðn
i
Þ j j. if 1
,
2 1ðn
i
Þ.
0. otherwise.
(
ð10Þ
In comparison, matrix B involves more factors because
facts about the same object influence each other. From (6)
and Lemma 1, we can infer that
B
,i
¼
1. if n
i
provides 1
,
.
j Á iijð1
/
!1
,
Þ. if n
i
provides 1
/
and oð1
/
Þ¼oð1
,
Þ.
0. otherwise.
8
<
:
ð11Þ
Both Aand Bare sparse matrices. In general, the website
trustworthiness and fact confidence can be computed
conveniently with matrix operations, as shown in Fig. 4.
The above procedure is very different from Authority-
Hub analysis proposed by Kleinberg [7]. It involves non-
linear transformations and thus cannot be computed using
eigenvector computation as in [7]. Authority-Hub analysis
defines the authority scores and hub scores as the sum of
each other. On the other hand, TRUTHFINDER studies the
probabilities of websites being correct and facts being true,
which cannot be defined as simple summations because the
probability often needs to be computed in nonlinear ways.
That is why TRUTHFINDER requires iterative computation
to achieve convergence.
800 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008
Fig. 3. Two methods for computing confidence.
Fig. 4. Computing website trustworthiness and fact confidence with
matrix operations.
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
3.3 Iterative Computation
As described above, we can infer the website trustworthi-
ness if we know the fact confidence and vice versa. As in
Authority-Hub analysis [7] and PageRank [10], TRUTHFIN-
DER adopts an iterative method to compute the trustworthi-
ness of websites and confidence of facts. Initially, it has very
little information about the websites and the facts. At each
iteration, TRUTHFINDER tries to improve its knowledge
about their trustworthiness and confidence, and it stops
when the computation reaches a stable state.
As in other iterative approaches [7], [10], TRUTHFINDER
needs an initial state. We choose the initial state in which all
websites have uniform trustworthiness t
0
. (t
0
should be set
to the estimated average trustworthiness, such as 0.9.) From
the website trustworthiness TRUTHFINDER can infer the
confidence of facts, which are very meaningful because the
facts supported by many websites are more likely to be
correct. On the other hand, if we start from a uniform fact
confidence, we cannot infer meaningful trustworthiness for
websites. Before the iterative computation, we also need to
calculate the two matrices Aand B, as defined in Section 3.2.
They are calculated once and used at every iteration.
In each step of the iterative procedure, TRUTHFINDER
first uses the website trustworthiness to compute the fact
confidence and then recomputes the website trustworthi-
ness from the fact confidence. Each step only requires two
matrix operations and conversions between tðnÞ and tðnÞ
and between :ð1Þ and o
Ã
ð1Þ. The matrices are stored in
sparse formats, and the computational cost of multiplying
such a matrix and a vector is linear with the number of
nonzero entries in the matrix. TRUTHFINDER stops iterating
when it reaches a stable state. The stableness is measured by
how much the trustworthiness of websites changes between
iterations. If t
!
only changes a little after an iteration
(measured by cosine similarity between the old and the new
t
!
), then TRUTHFINDER will stop. The overall algorithm is
presented in Fig. 5.
3.4 Complexity Analysis
In this section, we analyze the complexity of TRUTHFIN-
DER. Suppose there are 1 links between all websites and
facts. Because different websites may provide the same fact,
1 should be greater than ` (number of facts). Suppose on
the average there are / facts about each object, and thus,
each fact has / À1 related facts on the average.
Let us first look at the two matrices A and B. Each link
between a website and a fact corresponds to an entry in A.
Thus, A has 1 entries, and it takes Oð1Þ time to compute A.
B contains more entries than A because B
,i
is nonzero if
website n
i
provides a fact that is related to fact 1
,
. Thus,
there are Oð/1Þ entries in B. Because each website can
provide at most one fact about each object, each entry of B
involves only one website and one fact. Thus, it still takes
constant time to compute each entry of B, and it takes
Oð/1Þ time to compute B.
The time cost of multiplying a sparse matrix and a vector
is linear with the number of entries in the matrix. Therefore,
each iteration takes Oð/1Þ time and no extra space. Suppose
there are 1 iterations. TRUTHFINDER takes Oð1/1Þ time and
Oð/1 þ`þ`Þ space.
If in some cases, Oð/1Þ space is not available, we can
discard the matrix operations and compute the website
trustworthiness andfact confidence using their definitions in
Section 3.1. If we precompute the implication between all
facts, then Oð/`Þ space is needed to store these implication
values, and the total space requirement is Oð1 þ/`Þ. If the
implication between two facts can be computed in a very
short constant time and we do not precompute the implica-
tion, then the total space requirement is Oð1 þ` þ`Þ. In
both cases, it takes Oð1Þ time to propagate between website
trustworthiness and fact confidence and Oð/`Þ time to
adjust fact confidence according to the interfact implication.
Thus, the overall time complexity is Oð11 þ1/`Þ.
4 EMPIRICAL STUDY
We perform experiments on two real data sets and
synthetic data sets to examine the accuracy and efficiency
of TRUTHFINDER. The first real data set contains the
authors of many books provided by many online book-
stores, and the second one contains the runtime of many
movies provided by many websites. As will be explained
in Section 5, there is no existing approach to the problem
studied in this paper, and thus, we compare TRUTHFIN-
DER with a baseline approach as described below.
4.1 Experiment Setting
In order to show the effectiveness of TRUTHFINDER, we
compare it with a baseline approach called VOTING. When
trying to find the true fact for a certain object, VOTING
chooses the fact that is provided by most websites and
resolves ties randomly. This is the simplest approach and
only uses the number of websites supporting each fact. In
comparison, TRUTHFINDER considers the implication be-
tween different facts from the first iteration and considers
the different trustworthiness of different websites in the
following iterations.
All experiments are performed on an Intel PC with a
1.66-GHz dual-core processor with 1 Gbyte of memory
running Windows XP Professional. All approaches are
implemented using Visual Studio.Net (C#), using a single
thread. The two parameters in (8) are set as j ¼ 0.5 and
· ¼ 0.3. The maximum difference between two iterations c
YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 801
Fig. 5. Algorithm of TRUTHFINDER.
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
is set to 0.001 percent. We will show in our experiments that
the performance of TRUTHFINDER is not sensitive to these
parameters.
4.2 Real Data Sets
We test the accuracy and efficiency of TRUTHFINDER on
two real data sets. The results are presented below.
4.2.1 Book Authors
The first real data set contains the authors of many books
provided by many online bookstores. It contains 1,265 books
about computer science and engineering, which are pub-
lished by Addison-Wesley, McGraw-Hill, Morgan Kauf-
mann, or Prentice Hall. For each book, we use its ISBN to
search on www.abebooks.com, which returns the online
bookstores that sell the book and the book information from
each store, including the price, the authors, and a short
description. The data set contains 894 bookstores and
34,031 listings (i.e., bookstore selling a book). On the average,
each book has 5.4 different sets of authors, which are
indicated by different bookstores.
TRUTHFINDER performs iterative computation to find
out the set of authors for each book. In order to test its
accuracy, we randomly select 100 books and manually find
out their authors. We find the image of each book and use
the authors on the book cover as the standard fact.
We compare the set of authors found by TRUTHFINDER
for each book with the standard fact to compute the
accuracy of TRUTHFINDER. For a certain book, suppose the
standard fact contains r authors; TRUTHFINDER indicates
that there are n authors, among which : authors belong to
the standard fact. The accuracy of TRUTHFINDER is defined
as
:
maxðr.nÞ
.
4
Sometimes TRUTHFINDER provides partially correct
facts. For example, the standard set of authors for a book
is “Graeme C. Simsion and Graham Witt,” and the authors
found by TRUTHFINDER may be “Graeme Simsion and
G. Witt.” We consider “Graeme Simsion” and “G. Witt” as
partial matches for “Graeme C. Simsion” and “Graham
Witt” and give them partial scores. We assign different
weights to different parts of persons’ names. Each author
name has a total weight of 1, and the ratio between weights
of the last name, first name, and middle name is 3:2:1. For
example, “Graeme Simsion” will get a partial score of 5/6
because it omits the middle name of “Graeme C. Simsion.”
If the standard name has a full first or middle name and
TRUTHFINDER provides the correct initial, we give TRUTH-
FINDER a half score. For example, “G. Witt” will get a score
of 4/5 with respect to “Graham Witt,” because the first
name has weight 2/5, and the first initial “G.” gets half of
the score.
The implication between two sets of authors 1
1
and 1
2
is
defined in a very similar way as the accuracy of 1
2
with
respect to 1
1
. Because many bookstores provide incomplete
facts (e.g., only the first author), if 1
2
contains authors that are
not in 1
1
, the implication from 1
1
to 1
2
will not be decreased.
If 1
1
has r authors, 1
2
has n authors, and there are : shared
ones, then iijð1
1
!1
2
Þ ¼ :´r À/o:c :ii, where /o:c :ii
is the threshold for positive implication and is set to 0.5.
Fig. 6 shows the accuracies of TRUTHFINDER and
VOTING. One can see that TRUTHFINDER is significantly
more accurate than VOTING even at the first iteration, where
all bookstores have the same trustworthiness. This is
because TRUTHFINDER considers the implications between
different facts about the same object, while VOTING does
not. As TRUTHFINDER repeatedly computes the trustworthi-
ness of bookstores and the confidence of facts, its accuracy
increases to about 95 percent after the third iteration and
remains stable. It takes TRUTHFINDER 8.73 seconds to
precompute the implications between related facts and
4.43 seconds to finish the four iterations. VOTING takes
1.22 seconds.
Fig. 7 shows the relative change of the trustworthiness
vector after each iteration. The change is defined as one
minus the cosine similarity of the old and the new
vectors. We can see that the relative change decreases by
about one order after each iteration, showing that
TRUTHFINDER converges at a steady speed. After the
fourth iteration, the stop criterion is met, which is
indicated by the horizontal axis.
In Table 3, we manually compare the results of VOTING
and TRUTHFINDER and the authors provided by Barnes &
Noble on its website. We list the number of books in which
each approach makes each type of errors. Please notice that
one approach may make multiple errors for one book.
VOTING tends to miss authors because many bookstores
only provide subsets of all authors. On the other hand,
TRUTHFINDER tends to consider facts with more authors as
802 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008
Fig. 6. Accuracies of TRUTHFINDER and VOTING.
Fig. 7. Relative changes of TRUTHFINDER.
4. For simplicity, we do not consider the order of authors in this study,
although TRUTHFINDER can report the authors in correct order in most cases.
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
correct facts because of our definition of implication for book
authors andthus makes more mistakes of adding in incorrect
names. One may think that the largest bookstores will
provide accurate information, which is surprisingly untrue
according to our experiment. Table 3 shows that Barnes &
Noble has more errors than VOTING and TRUTHFINDER on
these 100 randomly selected books. It has redundant names
for many books. For example, for a book written by “Robert J.
Muller,” it says the authors are “Robert J. Muller; Muller.”
Actually Barnes &Noble is less accurate than TRUTHFINDER
even if we do not consider such errors.
We give an example mistake made by TRUTHFINDER.
The book “Server Storage Technologies for Windows 2000,
Windows Server 2003, and Beyond” is written by Dilip C.
Naik. There are nine bookstores saying that its authors are
“Dilip C. Naik and Dilip Naik,” seven saying “Dilip C.
Naik,” and five saying “Dilip Naik.” TRUTHFINDER infers
that its authors are “Dilip C. Naik and Dilip Naik,” which
contains redundant names. However, Amazon makes the
same mistake as TRUTHFINDER, which may explain why
the mistake is propagated to so many bookstores.
Finally, we perform an interesting experiment on finding
trustworthy websites. It is well known that Google (or other
search engines) is good at finding authoritative websites.
However, do these websites provide accurate information?
To answer this question, we compare the online bookstores
that are given highest ranks by Google with the bookstores
with highest trustworthiness found by TRUTHFINDER. We
query Google with “bookstore”
5
and find all bookstores that
exist in our data set from the top 300 Google results.
6
The
accuracy of each bookstore is tested on the 100 randomly
selected books in the same way as we test the accuracy of
TRUTHFINDER. We only consider bookstores that provide
at least 10 of the 100 books.
The results are shown in Table 4. TRUTHFINDER can find
bookstores that provide much more accurate information
than the top bookstores found by Google. TRUTHFINDER
also finds some large trustworthy bookstores such as
A1 Books, which provides 86 of 100 books with an accuracy
of 0.878. Please notice that TRUTHFINDER uses no training
data, and the testing data is manually created by reading
the authors’ names from book covers. Therefore, we believe
that the results suggest that there may be better alternatives
than Google for finding accurate information on the Web.
4.2.2 Movie Runtime
The second real data set contains runtimes of movies
provided by many websites. It contains 603 movies, which
are the movies with top ratings in every genre on
IMDB.com.
7
We collect the runtime data of each movie
using Google. For example, for the movie Forrest Gump, we
query Google with
00
forrest gump
00
þruntime and parse the
result page that contains digests of the first 100 results. If we
see terms like “130 minutes” or “2h10m” appearing after
“runtime” (or “run time”), we consider such terms as the
runtime of this movie. We found 17,109 useful digests,
which contain information from 1,727 websites. On average,
each movie has 14.3 different runtimes provided by
different websites.
Because of the authority of IMDB, we consider the runtime
it provides as the standard facts (information from IMDB.
com is excluded from our data set). We randomly select
100 movies and find their runtimes on IMDB, in order to test
the accuracy of TRUTHFINDER. If the standard runtime for a
movie is r minutes and TRUTHFINDER infers that its runtime
is n minutes, then the accuracy is defined as
jnÀrj
maxðr.nÞ
. If there
are two facts 1
1
and 1
2
about a movie, where 1
1
¼ n, and
1
2
¼ :, then the implication from 1
1
to 1
2
is defined as
iijð1
1
!1
2
Þ ¼
jnÀ:j
maxðn.:Þ
À/o:c :ii, where /o:c :ii is the
threshold for positive implication and is set to 0.75.
YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 803
TABLE 4
Comparison of the Accuracies of Top Bookstores
by TRUTHFINDER and by Google
TABLE 3
Comparison of the Results of VOTING, TRUTHFINDER,
and Barnes & Noble
5. This query was submitted on 7 February 2007.
6. Our data set does not contain the largest bookstores such as Barnes &
Noble and Amazon.com, because they do not list their books on
www.abebooks.com. However, we include Barnes & Noble here because
we have manually retrieved its authors for the 100 books for testing. 7. Please see http://imdb.com/chart/.
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
Fig. 8 shows the accuracies of TRUTHFINDER and
VOTING on the movie runtime data set. TRUTHFINDER
achieves very high accuracy on this data set, reaching
99 percent after three iterations. In comparison, the error
rate of VOTING is more than twice that of TRUTHFINDER.
Fig. 9 shows the relative change after each iteration of the
trustworthiness vector of TRUTHFINDER. It can be seen that
TRUTHFINDER still converges at a steady speed, and it takes
five iterations to meet the stop criterion. It takes TRUTH-
FINDER 2.89 seconds for initialization and 3.19 seconds for
five iterations. VOTING takes 1.55 seconds.
In Table 5, we compare the numbers of errors of VOTING
and TRUTHFINDER in different cases. Again, we find that
TRUTHFINDER is much more accurate than VOTING.
In Table 6, we compare the most trustworthy websites by
TRUTHFINDER and top ranked movie websites by Google
(with query “movies”
8
). Again, we find that the trustworthy
websites found by TRUTHFINDER provide much more
accurate information than those ranked high by Google.
4.2.3 Parameter Sensitivity
There are two important parameters in the computation
website trustworthiness and fact confidence—j and · in (6)
and (8) in Section 3.1. j controls the degree of influence
between related facts (i.e., facts about the same object), and ·
determines the shape of the confidence curve in Fig. 3. By
default, j ¼ 0.5, and · ¼ 0.3. The following experiments
show that the accuracy of TRUTHFINDER is only very
slightly affected by these two parameters.
Fig. 10a shows the accuracy of TRUTHFINDER with
different js on the two data sets (with · ¼ 0.3). It can be
seen that the accuracy of TRUTHFINDER is very stable when
j ! 0.1. Fig. 10b shows the accuracy with different ·s on the
two data sets (with j ¼ 0.5). · has more influence on
TRUTHFINDER than j. However, the influence is still very
limited, and the accuracy only varies by about 0.5 percent
even when · is changed significantly.
We also investigate how the parameters influence the
convergence of the algorithm. Figs. 11a and 11b show the
relative changes after each iteration of TRUTHFINDER, with
different js and ·s, when TRUTHFINDER is applied on the
book author data set. (We do not show experiments on the
movie runtime data set because of limited space.) In the
figures, we can see that the algorithm always converges
rapidly, with a relative change of less than 10
À5
after five
iterations. TRUTHFINDER converges faster when j is
smaller, which means that the influence between different
facts is smaller. This is reasonable because interfact
influences add complexity to the problem and will probably
make it slower to converge. TRUTHFINDER also converges
faster when · is smaller, probably because a smaller ·
means a smaller influence of o
Ã
ð1Þ on :ð1Þ (there is no
influence when · ¼ 0).
4.3 Synthetic Data Sets
In this section, we test the scalability and noise resistance of
TRUTHFINDER on synthetic data sets. We generate data sets
containing ` websites and ` facts. There are `´5 objects
(i.e., each object has five facts on average), and there are
four websites providing each fact on average (i.e., 4` links
between websites and facts). The expected trustworthiness
of each website is

t, and the trustworthiness of each website
is drawn from a uniform distribution from maxð0. 2

t À1Þ to
minð2

t. 1Þ. For example, if

t ¼ 0.7, then the range for website
trustworthiness is [0.4, 1].
Each object has a true value, which is a random number
drawn from a uniform distribution on interval [1,000,
10,000]. The facts of each object are real numbers that do
not deviate too much from the true value. We want to
generate the facts by randomly generating some numbers
on an interval around the true value. However, we do not
want the true value to be at the middle of the interval,
which makes it very easy to find the true value by taking
the average of all facts. Suppose the value for object o is .ðoÞ.
We create a value range for o that is an interval of length
.ðoÞ´2. The value range is randomly placed so that the true
value .ðoÞ falls at any position in it with equal probability.
Then, the facts of o are drawn from a uniform distribution
804 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008
Fig. 8. Accuracies of TRUTHFINDER and VOTING.
Fig. 9. Relative changes of TRUTHFINDER.
8. This query was submitted on 7 February 2007.
TABLE 5
Comparison of the Results of VOTING and TRUTHFINDER
on Movie Runtime
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
on this value range. In the following experiments, TRUTH-
FINDER does not precompute implications between facts
(because they can be easily computed on the fly) and
performs five iterations.
We first study the time and space scalability of TRUTH-
FINDER with respect to the number of facts. The number of
websites is fixed at 1,000, and the number of facts varies from
5,000 to 500,000. The results are shown in Fig. 12. Its runtime
increases 118 times as the number of objects grows 100 times,
which is very close to being linearly scalable, and the minor
superlinear part should come from the usage of dynamically
growing data structures (e.g., vectors and hash tables). The
memory usage is also linearly scalable. In fact, it grows
sublinearly, possibly because of the fixed part of memory
usage.
Then, we study the scalability with respect to the number
of websites, as shown in Fig. 13. There are 10,000 objects
and 50,000 facts, and the number of websites varies from
100 to 10,000. It can be seen that the time and memory usage
almost remain unchanged when the number of websites
grows 100 times, which is consistent with our complexity
analysis.
Finally, we study how well TRUTHFINDER can do when
the expected trustworthiness varies from zero to one. We
compare the accuracies (as defined on the movie runtime
data set) of TRUTHFINDER and VOTING in Fig. 14a and the
YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 805
TABLE 6
Comparison of the Accuracies of Top Movie Websites
by TRUTHFINDER and by Google
Fig. 10. Accuracy of TRUTHFINDER with respect to j and ·. (a) Accuracy
with respect to j. (b) Accuracy with respect to ·.
Fig. 11. Relative changes after each iteration with different js and ·s.
(a) Convergence with j. (b) Convergence with ·.
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
percentages of objects for which exactly correct facts are
found in Fig. 14b. When the expected trustworthiness is
zero, TRUTHFINDER and VOTING are not doing much
better than random. (It can be shown that even if the true
value and the fact found are two random numbers between
1,000 and 1,500, the accuracy is as high as 0.87.) However,
their accuracies grow sharply as the expected trustworthi-
ness increases. When the expected trustworthiness is 0.5,
their accuracy is more than 99.5 percent, and the percentage
of exact match for TRUTHFINDER is 98.2 percent, and that
for VOTING is 90.5 percent. This shows that TRUTHFINDER
can find the true value even if there are many untrust-
worthy websites.
5 RELATED WORK
The quality of information on the Web has always been a
major concern for Internet users [11]. There have been
studies on what factors of data quality are important for
users [13] and on machine learning approaches for
distinguishing high-quality and low-quality web pages
[9], where the quality is defined by human preference. It
is also shown that information quality measures can help
improve the effectiveness of Web search [15].
In 1998, two pieces of groundbreaking work, PageRank
[10] and Authority-Hub analysis [7], were proposed to
utilize the hyperlinks to find pages with high authorities.
These two approaches are very successful at identifying
important web pages that users are interested in, which is
also shown by a subsequent study [1]. In [3], the authors
propose a framework of link analysis and provide theore-
tical studies for many link-based approaches.
Unfortunately, the popularity of web pages does not
necessarily lead to accuracy of information. Two observa-
tions are made in our experiments: 1) even the most popular
website (e.g., Barnes & Noble) may contain many errors,
whereas some comparatively not-so-popular websites may
provide more accurate information, and 2) more accurate
information can be inferred by using many different
websites instead of relying on a single website.
TRUTHFINDER studies the interaction between websites
and the facts they provide and infers the trustworthiness of
websites and confidence of facts from each other. An
analogy can be made between this problem and Authority-
Hub analysis, by considering websites as hubs (both of
them indicate others’ authority weights) and facts as
authorities. However, these two problems are very differ-
ent, and Authority-Hub analysis cannot be applied to our
problem. In Authority-Hub analysis, a hub’s weight is
806 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008
Fig. 12. Scalability of TRUTHFINDER with respect to the number of facts.
(a) Time. (b) Memory.
Fig. 13. Scalability of TRUTHFINDER with respect to the number of
websites. (a) Time. (b) Memory.
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
computed by summing up the weights of authorities linked
to it. This is unreasonable in computing the trustworthiness
of a website, because a trustworthy website should be one
that provides accurate facts instead of many of them, and a
website providing many inaccurate facts is an untrust-
worthy one. Moreover, the confidence of a fact is not simply
the sum of the trustworthiness of the websites providing it.
Instead, it needs to be computed using some nonlinear
transformations according to a probabilistic analysis.
Another difference between TRUTHFINDER and Author-
ity-Hub analysis is that TRUTHFINDER considers the
relationships (implications) between different facts and
uses such information in inferring the confidence of facts.
This is related to existing studies on inferring similarities
between objects using links. Collaborative filtering [4] infers
the similarity between objects based on their ratings to or
from other objects. There are also studies on link-based
similarity analysis [6], [14], which defines the similarity
between two objects as the average similarity between
objects linked to them. In [5], the authors propose an
approach that uses the trust or distrust relationships
between some users (e.g., user ratings on eBay.com) to
determine the trust relationship between each pair of users.
TRUTHFINDER uses iterative methods to compute the
website trustworthiness and fact confidence, which is
widely used in many link analysis approaches [5], [6], [7],
[10], [14]. The common feature of these approaches is that
they start from some initial state that is either random or
uninformative. Then, at each iteration, the approach will
improve the current state by propagating information
(weights, probability, trustworthiness, etc.) through the
links. This iterative procedure has been proven to be
successful in many applications, and thus, we adopt it in
TRUTHFINDER.
6 CONCLUSIONS
In this paper, we introduce and formulate the Veracity
problem, which aims at resolving conflicting facts from
multiple websites and finding the true facts among them.
We propose TRUTHFINDER, an approach that utilizes the
interdependency between website trustworthiness and fact
confidence to find trustable websites and true facts.
Experiments show that TRUTHFINDER achieves high
accuracy at finding true facts and at the same time identifies
websites that provide more accurate information.
ACKNOWLEDGMENTS
This work was supported in part by the US National Science
Foundation Grants IIS-05-13678/06-42771 and NSF BDI-05-
15813. Any opinion, findings, and conclusions or recom-
mendations expressed here are those of the authors and do
not necessarily reflect the views of the funding agencies.
REFERENCES
[1] B. Amento, L.G. Terveen, and W.C. Hill, “Does ‘Authority’ Mean
Quality? Predicting Expert Quality Ratings of Web Documents,”
Proc. ACM SIGIR ’00, July 2000.
[2] M. Blaze, J. Feigenbaum, and J. Lacy, “Decentralized Trust
Management,” Proc. IEEE Symp. Security and Privacy (ISSP ’96),
May 1996.
[3] A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, “Link
Analysis Ranking: Algorithms, Theory, and Experiments,” ACM
Trans. Internet Technology, vol. 5, no. 1, pp. 231-297, 2005.
[4] J.S. Breese, D. Heckerman, and C. Kadie, “Empirical Analysis of
Predictive Algorithms for Collaborative Filtering,” technical
report, Microsoft Research, 1998.
[5] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins, “Propagation
of Trust and Distrust,” Proc. 13th Int’l Conf. World Wide Web
(WWW), 2004.
[6] G. Jeh and J. Widom, “SimRank: A Measure of Structural-Context
Similarity,” Proc. ACM SIGKDD ’02, July 2002.
[7] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environ-
ment,” J. ACM, vol. 46, no. 5, pp. 604-632, 1999.
[8] Logistical Equation from Wolfram MathWorld, http://
mathworld.wolfram.com/LogisticEquation.html, 2008.
[9] T. Mandl, “Implementation and Evaluation of a Quality-Based
Search Engine,” Proc. 17th ACM Conf. Hypertext and Hypermedia,
Aug. 2006.
[10] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank
Citation Ranking: Bringing Order to the Web,” technical report,
Stanford Digital Library Technologies Project, 1998.
[11] Princeton Survey Research Associates International, “Leap of
faith: Using the Internet Despite the Dangers,” Results of a Nat’l
Survey of Internet Users for Consumer Reports WebWatch, Oct. 2005.
[12] Sigmoid Function from Wolfram MathWorld, http://mathworld.
wolfram.com/SigmoidFunction.html, 2008.
[13] R.Y. Wang and D.M. Strong, “Beyond Accuracy: What Data
Quality Means to Data Consumers,” J. Management Information
Systems, vol. 12, no. 4, pp. 5-34, 1997.
YIN ET AL.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 807
Fig. 14. Noise resistance of TRUTHFINDER and VOTING with respect to
website trustworthiness. (a) Accuracy. (b) Percentage of exact match.
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.
[14] X. Yin, J. Han, and P.S. Yu, “LinkClus: Efficient Clustering via
Heterogeneous Semantic Links,” Proc. 32nd Int’l Conf. Very Large
Data Bases (VLDB ’06), Sept. 2006.
[15] X. Zhu and S. Gauch, “Incorporating Quality Metrics in
Centralized/Distributed Information Retrieval on the World Wide
Web,” Proc. ACM SIGIR ’00, July 2000.
Xiaoxin Yin received the BE degree from
Tsinghua University in 2001 and the MS and
PhD degrees in computer science from the
University of Illinois, Urbana-Champaign, in
2003 and 2007, respectively. He is a researcher
at the Internet Services Research Center,
Microsoft Research. He has been working in
the area of data mining since 2001, and his
research work is focused on multirelational data
mining and link analysis, and their applications
on the World Wide Web. He has published 14 papers in refereed
conference proceedings and journals.
Jiawei Han is a professor in the Department of
Computer Science, University of Illinois, Urbana-
Champaign. He has been working on research
into data mining, data warehousing, stream data
mining, spatiotemporal and multimedia data
mining, biological data mining, information net-
work analysis, text and Web mining, and soft-
ware bug mining, with more than 350 conference
and journal publications. He has chaired or
served on many program committees of inter-
national conferences and workshops. He also served or is serving on the
editorial boards of Data Mining and Knowledge Discovery, the IEEE
Transactions on Knowledge and Data Engineering, the Journal of
Computer Science and Technology, and the Journal of Intelligent
Information Systems. He is currently serving as the founding editor in
chief of the ACM Transactions on Knowledge Discovery from Data and
on the board of directors for the executive committee of the ACM Special
Interest Group on Knowledge Discovery and Data Mining (SIGKDD). He
has received many awards and recognitions, including the ACM
SIGKDD Innovation Award in 2004 and the IEEE Computer Society
Technical Achievement Award in 2005. He is a senior member of the
IEEE and a fellow of the ACM.
Philip S. Yu received the BS degree in electrical
engineering from the National Taiwan Univer-
sity, the MS and PhD degrees in electrical
engineering from Stanford University, and the
MBA degree from New York University. He is a
professor in the Department of Computer
Science, University of Illinois, Chicago, and also
holds the Wexler chair in information and
technology. He was the manager of the Software
Tools and Techniques Group at the IBM T.J.
Watson Research Center. His research interests include data mining,
Internet applications and technologies, database systems, multimedia
systems, parallel and distributed processing, and performance model-
ing. He has published more than 500 papers in refereed journals and
conference proceedings. He holds or has applied for more than 300 US
patents. He is an associate editor of the ACM Transactions on the
Internet Technology and the ACM Transactions on Knowledge
Discovery from Data. He is on the steering committee of the IEEE
Conference on Data Mining and was a member of the IEEE Data
Engineering steering committee. He was the editor in chief of IEEE
Transactions on Knowledge and Data Engineering from 2001 to 2004,
an editor, an advisory board member, and also a guest coeditor of the
special issue on mining of databases. He had also served as an
associate editor of Knowledge and Information Systems. In addition to
serving as a program committee member on various conferences, he
was the program chair or cochair of the IEEE Workshop of Scalable
Stream Processing Systems (SSPS ’07), the IEEE Workshop on Mining
Evolving and Streaming Data (2006), the 2006 Joint Conferences of the
Eighth IEEE Conference on E-commerce Technology (CEC ’06) and the
Third IEEE Conference on Enterprise Computing, E-commerce and E-
services (EEE ’06), the 11th IEEE International Conference on Data
Engineering, the Sixth Pacific Area Conference on Knowledge Dis-
covery and Data Mining, the Ninth ACM Sigmod Workshop on Research
Issues in Data Mining and Knowledge Discovery, the Second IEEE
International Workshop on Research Issues on Data Engineering:
Transaction and Query Processing, the PAKDD Workshop on Knowl-
edge Discovery from Advanced Databases, and the Second IEEE
International Workshop on Advanced Issues of E-commerce and Web-
Based Information Systems. He served as the general chair or cochair
of the 2006 ACM Conference on Information and Knowledge Manage-
ment, the 14th IEEE International Conference on Data Engineering, and
the Second IEEE International Conference on Data Mining. He has
received several IBM honors, including two IBM Outstanding Innovation
Awards, an Outstanding Technical Achievement Award, two Research
Division Awards, and the 93rd Plateau of Invention Achievement
Awards. He was an IBM Master Inventor. He received a Research
Contributions Award from the IEEE International Conference on Data
Mining in 2003 and also an IEEE Region 1 Award for “promoting and
perpetuating numerous new electrical engineering concepts” in 1999.
He is a fellow of the ACM and the IEEE.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
808 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 6, JUNE 2008
Authorized licensed use limited to: University of Illinois. Downloaded on January 9, 2009 at 12:50 from IEEE Xplore. Restrictions apply.

For example.. different facts influence each other.YIN ET AL. if a website says that a book is written by “Jessamyn Wendell” and another says “Jessamyn Burns Wendell. We discuss related work in Section 5 and conclude this study in Section 6. Different facts about the same object may be conflicting. A website is trustworthy if most facts it provides are true. There are also many websites.2 A fact is likely to be true if it is provided by trustworthy websites (especially if by many of them). Our experiments show that TRUTHFINDER achieves very high accuracy in discovering true facts. Second and more importantly. It is required that impðf1 ! f2 Þ is a value between À1 and 1. We also require that the facts can be parsed from the web pages. and it can select better trustworthy websites than authority-based search engines such as Google. we formulate the Veracity problem about how to discover true facts from conflicting information. 2. authors of books). 2 PROBLEM DEFINITIONS In this paper. Experimental results are presented in Section 4. we propose the concept of implication between facts.g.” whereas another claims that it is written by “Jessamyn Wendell. This iterative procedure is rather different from Authority-Hub analysis [7].g. In summary. a domain refers to a property of a certain type of objects. 2009 at 12:50 from IEEE Xplore. we propose an algorithm called TRUTHFINDER for identifying true facts using iterative methods. is f1 ’s influence on f2 ’s confidence. For example. There are usually multiple conflicting facts from different websites for each object. The “trustworthiness” in this paper means accuracy in providing information. a website providing 10.7 is much less trustworthy than a website providing 100 facts with an accuracy of 0. weights of laptop computers) or relationships between two objects (e. Second. we study the problem of finding true facts in a certain domain. such as different sets of authors for a book.” then these two websites actually support each other although they provide slightly different facts. we choose an iterative computational method. the confidence of facts and the trustworthiness of websites. Because of this interdependency between facts and websites. sometimes facts may be supportive to each other although they are slightly different.95. Definition 2 (Trustworthiness of websites). such as authors of books or number of pixels of camcorders. by defining the trustworthiness of websites. In order to represent such relationships. Fig. 1 shows a miniexample data set. objects (e. For example. It is different from the “trustworthiness” in the studies of trust management [2]. how much f2 ’s confidence should be increased (or decreased) according to f1 ’s confidence.000 facts with an average accuracy of 0. Downloaded on January 9. 2. The confidence of a fact f (denoted by sðfÞ) is the probability of f being correct. and the goal of TRUTHFINDER is to identify the true fact among them.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 797 TABLE 1 Conflicting Information about Book Authors Fig. The trustworthiness of a website does not depend on how many facts it provides but on the accuracy of those facts. Thus.1 Basic Definitions We first introduce the two most important definitions in this paper. we propose a framework to solve this problem. and influences between facts. i. impðf1 ! f2 Þ. Restrictions apply. Input of TRUTHFINDER. Definition 1 (Confidence of facts). There are often conflicting facts on the Web. and another one says 10 cm. Each website provides at most one fact for an object.e. confidence of facts. If one of such facts is true. Instead. The trustworthiness of a website w (denoted by tðwÞ) is the expected confidence of the facts provided by w. The first difference is in the definitions.” and another one claims “J. Here. At each iteration. we make three major distributions in this paper. one website claims the author to be “Jennifer Widom..” or one website says that a certain camera is 4 inches long. we have to resort to probabilistic computation. Widom. First. the probabilities of facts being true and the trustworthiness of websites are inferred from each other. The rest of the paper is organized as follows: We describe the problem in Section 2 and propose the computational model and algorithms in Section 3. which contains five facts about two objects provided by four websites. We incorporate such influences between facts into our computational model.. some of which are more trustworthy than others.” However. Authorized licensed use limited to: University of Illinois. The implication from fact f1 to f2 . 1. the other is also likely to be true. For example. we cannot compute the trustworthiness of a website by adding up the weights of its facts as in [7]. . Finally. one website claims that a book is written by “Karen Holtzblatt. The input of TRUTHFINDER is a large number of facts in a domain that are provided by many websites. nor can we compute the probability of a fact being true by adding up the trustworthiness of websites providing it. according to the best of our knowledge.

If f2 is correct.1.2. Thus. From this example. f2 Þ is the similarity between f1 and f2 . JUNE 2008 A positive value indicates that if f1 is correct. In this paper. 3 COMPUTATIONAL MODEL 2. the relationship between two facts is symmetric and the definition of similarity is available. 2009 at 12:50 from IEEE Xplore. ð1Þ where F ðwÞ is the set of facts provided by w. websites tend to provide incomplete facts (e. Based on the above heuristics. If in a domain. the trustworthiness of a website is just the expected confidence of facts it provides.1 Basic Inference As defined in Definition 2. and. it is likely to be true. Widom. Suppose two websites provide author information for the same book. The false facts on different websites are less likely to be the same or similar. f2 is likely to be wrong. The inference of website trustworthiness is rather simple. The first website indicates that the author of the book is “Jennifer Widom. Although false facts can be propagated among websites. where simðf1 . in some domains (e. it is unlikely to be true. We believe that a website has some consistency in the quality of its information in a certain domain. in general.1. 20. On the other hand. a website that provides mostly true facts for many objects will likely provide true facts for other objects. Different websites that provide this true fact may present it in either the same or slightly different ways. Table 2 shows the variables and parameters used in the following discussion. and base sim is a threshold for similarity. For example.798 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.1 Website Trustworthiness and Fact Confidence We first discuss how to infer website trustworthiness and fact confidence from each other. In this section. we compute its trustworthiness tðwÞ by calculating the average confidence of facts provided by w: P tðwÞ ¼ f2F ðwÞ sðfÞ jF ðwÞj . NO. If we are confident about f1 . and thus. VOL. Heuristic 2.” which is fact f1 . we introduce the model of iterative computation. and impðf1 ! f2 Þ should be high. Because true facts are more consistent than false facts (Heuristic 3). We start from the simplest case and proceed to more complicated ones step by step. he or she should provide the definition of implication between facts. f2 Þ À base sim. Please notice that the definition of implication is domain specific. When a user uses TRUTHFINDER on a certain domain. we assume that there is only one true fact for a property of an object.g. While a negative value means that if f1 is correct. the false facts about a certain object are much less consistent than the true facts.” Heuristic 3. we know that it is very common for a website to provide only one of the authors for a book. For website w. The case of multiple true facts will be studied in our future work. Different websites often make different mistakes for the same object and thus provide different false facts. we have four basic heuristics that serve as the base of our computational model. f1 may only tell us that “Jennifer Widom” is one author of the book instead of the sole author. a website is trustworthy if it provides facts with high confidence. On the other hand. 3. the user can define impðf1 ! f2 Þ ¼ simðf1 .2 Basic Heuristics Based on common sense and our observations on real data. such as “Jennifer Widom” versus “J. first author of a book). Authorized licensed use limited to: University of Illinois. We can see that the website trustworthiness and fact confidence are determined by each other. then f1 is incomplete and will have low confidence. book authors). if a fact is conflicting with the facts provided by many trustworthy websites. we can see that implication is an asymmetric relationship. Heuristic 4. The second website says that there are two authors “Jennifer Widom and Stefano Ceri. . 3. We define implication instead of similarity between facts because such relationship is asymmetric. Heuristic 1. we know that if a fact is provided by many trustworthy websites. Restrictions apply.. The details about this will be described in Section 3.. and we can use an iterative method to compute both. we should also be confident about f2 because f2 is consistent with f1 .g.” which is fact f2 . whereas that of fact confidence is more complicated. The implication for book authors should be very different from that for the number of pixels of camcorders. This true fact appears to be the same or similar on different websites. it is likely that we can find and distinguish true facts from false ones at the end. f2 is likely to be correct. Downloaded on January 9. 6. TABLE 2 Variables and Parameters of TRUTHFINDER There are trustworthy websites such as wikipedia and untrustworthy websites such as blogs and some small websites. impðf2 ! f1 Þ is low. Usually there is only one true fact for a property of an object. In a certain domain.

There are still two problems with our model. which is the sum of the trustworthiness scores of websites providing f2 . Similarly. If f2 is provided by many trustworthy websites.. ð2Þ 3. it is much more difficult to estimate the confidence of a fact. then both w1 and w2 are wrong. However.1.6 (which is quite low). the probability that both of them are wrong is ð1 À tðw1 ÞÞ Á ð1 À tðw2 ÞÞ.YIN ET AL. which is used to compute ðfÞ in TRUTHFINDER. Suppose in Fig. Please notice that impðf 0 ! fÞ < 0 when f is conflicting with f 0 . and these facts influence each other. In general.) Thus. then f1 is also somehow supported by these websites. With a similar proof as in Lemma 1. According to (2). 1 À sðfÞ ¼ Y 1À Y w2W ðfÞ ð1 À tðwÞÞ Á Y oðf 0 Þ¼oðfÞ 0 @ Y 1Áimpðf 0 !fÞ ð1 À tðw0 ÞÞA : ð1 À tðwÞÞ: w0 2W ðf 0 Þ w2W ðfÞ Authorized licensed use limited to: University of Illinois. we define the confidence score of a fact as ðfÞ ¼ À lnð1 À sðfÞÞ: ð4Þ A very useful property is that the confidence score of a fact f is just the sum of the trustworthiness scores of websites providing f. and we will compensate for it later. According to the definitions above. Computing confidence of a fact. (This is not true in many cases. 2009 at 12:50 from IEEE Xplore. 2).e. and f1 is the only fact about object o1 (i. The second problem with our model is that the confidence of a fact f can easily be negative if f is conflicting with some 3. and multiplying many of them may lead to underflow. we should increase the confidence score of f1 according to the confidence score of f2 . the confidence of a fact f1 is determined by the websites providing it and other facts about the same object. f2 does not exist in Fig. they are very similar). actually. In comparison. We can compute the confidence of f based on à ðfÞ in the same way as computing it based on ðfÞ (defined in (4)). Lemma 1. if f1 is wrong. if a fact f is provided by five websites with a trustworthiness of 0. we use a logarithm and define the trustworthiness score of a website as ðwÞ ¼ À lnð1 À tðwÞÞ: ð3Þ ðfÞ : ð7Þ ðwÞ is between and zero and þ1. However. 2. . We use sà ðfÞ to represent this confidence:3 sà ðfÞ ¼ 1 À eÀ à where W ðfÞ is the set of websites providing f. Let us first analyze the simple case where there is no related fact. We define the adjusted confidence score of a fact f as X à ðfÞ ¼ ðfÞ þ  Á ðf 0 Þ Á impðf 0 ! fÞ: ð6Þ oðf 0 Þ¼oðfÞ w2W ðfÞ  is a parameter between zero and one.99. In order to facilitate computation and veracity exploration. we can show that (7) is equivalent to sà ðfÞ ¼ Proof. We can see that à ðfÞ is the sum of the confidence scores of f. 2 that the implication from f2 to f1 is very high (e. there are usually many different facts about an object (such as f1 and f2 in Fig. This is shown in the following lemma. 2). ðfÞ ¼ X w2W ðfÞ ðwÞ: ð5Þ 3. Therefore. Because f1 is provided by w1 and w2 . and a larger ðwÞ indicates higher trustworthiness. This assumption is often incorrect because errors can be propagated between websites. where 0 < < 1. Downloaded on January 9. we add a dampening factor into (7) à and redefine fact confidence as sà ðfÞ ¼ 1 À eÀ Á ðfÞ .2 Influences between Facts The above discussion shows how to compute the confidence of a fact that is the only fact about an object. which controls the influence of related facts.3 Handling Additional Subtlety Until now.g.. In (2).1. 2. and we have X lnð1 À sðfÞÞ ¼ lnð1 À tðwÞÞ w2W ðfÞ X ðwÞ: ( ) ðfÞ ¼ w2W ðfÞ u t Fig. then its confidence sðfÞ can be computed as sðfÞ ¼ 1 À Y ð1 À tðwÞÞ. f will have a confidence of 0. and f1 should have reasonably high confidence. if a fact f is the only fact about an object. Restrictions apply. we have described how to compute fact confidence from website trustworthiness. 1 À tðwÞ is usually quite small.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 799 Take the logarithm on both sides. and the probability that f1 is not wrong is 1 À ð1 À tðw1 ÞÞ Á ð1 À tðw2 ÞÞ. In order to compensate for the problem of overly high confidence. and a portion of the confidence score of each related fact f 0 multiplies the implication from f 0 to f. some of the websites may copy contents from others. We first assume that w1 and w2 are independent. The first problem is that we have been assuming that different websites are independent of each other. which are discussed below. As shown in Fig.

It involves nonlinear transformations and thus cannot be computed using eigenvector computation as in [7]. Bji ¼  Á impðfk ! fj Þ. by setting ( . sðfÞ is à 1 very close to sà ðfÞ because 1þeÀ Áà ðfÞ % 1 À eÀ Á ðfÞ . When Á à ðfÞ is significantly less than zero. and !  s and ! can be converted using (8). 3. s 1 N ƒ! à à à  ¼ ð ðf1 Þ. if we set sà ðfÞ ¼ 0 if it is negative according to (7). 2009 at 12:50 from IEEE Xplore. 3. Let the vectors be Authorized licensed use limited to: University of Illinois. . which is not reasonable because it is always possible that the fact is true even with negative evidence. This is unreasonable because 1) the confidence cannot be negative and 2) even with negative evidences. . which makes à ðfÞ < 0 and sà ðfÞ < 0. which is a variant of (7). otherwise: In comparison. we use vectors to represent the trustworthiness of all websites and the confidence of all facts. . ! t ¼ ðtðw1 Þ. s ƒ! à   ¼ B!: ð9Þ facts provided by trustworthy websites. To facilitate our discussions. which is consistent with the real situation. From (6) and Lemma 1. . Restrictions apply. We compare these two definitions in Fig. . . matrix B involves more factors because facts about the same object influence each other. ! ¼ ðsðf Þ. Moreover. 3. Please notice that (8) is also very similar to the Sigmoid function [12]. ! t ¼ A!. which has been successfully used in various models in many fields. 6.  Matrix A can be easily defined according to (1). NO. otherwise: ð11Þ Both A and B are sparse matrices. .  1 M Fig. That is why TRUTHFINDER requires iterative computation to achieve convergence. ! ¼ ððw Þ. . the website trustworthiness and fact confidence can be computed conveniently with matrix operations. this “chunking” operation and the multiple zero values may lead to unstable conditions in our iterative computation. . 20. In comparison. sðf ÞÞT . On the other hand. The above procedure is very different from AuthorityHub analysis proposed by Kleinberg [7]. which can be implemented easily and performed efficiently. Therefore. Two methods for computing confidence. . VOL. i. we can infer that 8 if wi provides fj . 4. 4. JUNE 2008 Fig. we adopt the widely used Logistic function [8]. .  ðfN ÞÞT : We want to define a M  N matrix A for inferring website trustworthiness from fact confidence and a N  M matrix B for the reverse inference. . Computing website trustworthiness and fact confidence with matrix operations. . so its confidence should still be above zero. . : 0. sðfÞ ¼ 1 1þ eÀ Áà ðfÞ : ð8Þ When Á à ðfÞ is significantly greater than zero. which cannot be defined as simple summations because the probability often needs to be computed in nonlinear ways. . sà ðfÞ decreases very sharply when à ðfÞ < 1. TRUTHFINDER studies the probabilities of websites being correct and facts being true. Downloaded on January 9. tðwM ÞÞT . It will be more convenient if we can convert the computational procedure into some basic matrix operations.e. if fj 2 F ðwi Þ. sðfÞ is close to zero but remains positive. ! t and ! can be converted from each other using (3). ð10Þ Aij ¼ 0. if wi provides fk and oðfk Þ ¼ oðfj Þ. ðw ÞÞT . 1 jF ðwi Þj. . there is still a chance that f is correct. < 1. as the final definition for fact confidence.800 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. as shown in Fig. In general. .2 Computing Website Trustworthiness and Fact Confidence with Matrix Operations We have described how to calculate website trustworthiness and fact confidence. sðfÞ decreases slowly and is slightly above zero when à ðfÞ ( 0. Authority-Hub analysis defines the authority scores and hub scores as the sum of each other. One can see that the two curves are very close when à ðfÞ > 3.

each fact has k À 1 related facts on the average. it takes OðLÞ time to propagate between website trustworthiness and fact confidence and OðkNÞ time to adjust fact confidence according to the interfact implication. At each iteration. each entry of B involves only one website and one fact. In both cases. it has very little information about the websites and the facts.1. we compare TRUTHFINDER with a baseline approach as described below.4 Complexity Analysis In this section. As will be explained in Section 5. and the total space requirement is OðL þ kNÞ. If t only changes a little after an iteration (measured by cosine similarity between the old and the new ! t ). Suppose there are L links between all websites and facts.) From the website trustworthiness TRUTHFINDER can infer the confidence of facts.2. They are calculated once and used at every iteration. 4. as defined in Section 3. Let us first look at the two matrices A and B. (t0 should be set to the estimated average trustworthiness. we compare it with a baseline approach called VOTING. 2009 at 12:50 from IEEE Xplore. The maximum difference between two iterations  Authorized licensed use limited to: University of Illinois. the overall time complexity is OðIL þ IkNÞ. and the second one contains the runtime of many movies provided by many websites.3 Iterative Computation As described above. If the implication between two facts can be computed in a very short constant time and we do not precompute the implication. The matrices are stored in sparse formats. TRUTHFINDER considers the implication between different facts from the first iteration and considers the different trustworthiness of different websites in the following iterations. This is the simplest approach and only uses the number of websites supporting each fact. there are OðkLÞ entries in B. and it takes OðLÞ time to compute A. Thus. Because different websites may provide the same fact. and thus.Net (C#). each iteration takes OðkLÞ time and no extra space. VOTING chooses the fact that is provided by most websites and resolves ties randomly. Initially. TRUTHFINDER first uses the website trustworthiness to compute the fact confidence and then recomputes the website trustworthiness from the fact confidence. In each step of the iterative procedure. and it takes OðkLÞ time to compute B. there is no existing approach to the problem studied in this paper. The two parameters in (8) are set as  ¼ 0:5 and ¼ 0:3. B contains more entries than A because Bji is nonzero if website wi provides a fact that is related to fact fj . In comparison. Each link between a website and a fact corresponds to an entry in A. 3. If we precompute the implication between all facts. All approaches are implemented using Visual Studio. As in Authority-Hub analysis [7] and PageRank [10]. we can discard the matrix operations and compute the website trustworthiness and fact confidence using their definitions in Section 3. If in some cases. 3. then the total space requirement is OðL þ M þ NÞ. then TRUTHFINDER will stop. TRUTHFINDER takes OðIkLÞ time and OðkL þ M þ NÞ space. Thus. L should be greater than N (number of facts). Algorithm of TRUTHFINDER. if we start from a uniform fact confidence.1 Experiment Setting In order to show the effectiveness of TRUTHFINDER. OðkLÞ space is not available. and the computational cost of multiplying such a matrix and a vector is linear with the number of nonzero entries in the matrix. Restrictions apply. TRUTHFINDER stops iterating when it reaches a stable state. Downloaded on January 9.9. Because each website can provide at most one fact about each object. Thus. The first real data set contains the authors of many books provided by many online bookstores. 5. On the other hand. we also need to calculate the two matrices A and B. [10]. and it stops when the computation reaches a stable state. We choose the initial state in which all websites have uniform trustworthiness t0 .66-GHz dual-core processor with 1 Gbyte of memory running Windows XP Professional. we analyze the complexity of TRUTHFINDER. which are very meaningful because the facts supported by many websites are more likely to be correct. it still takes constant time to compute each entry of B. When trying to find the true fact for a certain object. Thus. The overall algorithm is presented in Fig.YIN ET AL. such as 0. The time cost of multiplying a sparse matrix and a vector is linear with the number of entries in the matrix. we cannot infer meaningful trustworthiness for websites. we can infer the website trustworthiness if we know the fact confidence and vice versa. TRUTHFINDER needs an initial state. Before the iterative computation.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 801 Fig. and thus. A has L entries. Suppose there are I iterations. Each step only requires two matrix operations and conversions between tðwÞ and ðwÞ and between sðfÞ and à ðfÞ. Therefore. . 5. TRUTHFINDER adopts an iterative method to compute the trustworthiness of websites and confidence of facts. then OðkNÞ space is needed to store these implication values. Suppose on the average there are k facts about each object. TRUTHFINDER tries to improve its knowledge about their trustworthiness and confidence. using a single thread. As in other iterative approaches [7]. All experiments are performed on an Intel PC with a 1. 4 EMPIRICAL STUDY We perform experiments on two real data sets and synthetic data sets to examine the accuracy and efficiency of TRUTHFINDER. The stableness is measured by how much the trustworthiness of websites changes between ! iterations.

4. which are indicated by different bookstores. Each author name has a total weight of 1.43 seconds to finish the four iterations. “Graeme Simsion” will get a partial score of 5/6 because it omits the middle name of “Graeme C.73 seconds to precompute the implications between related facts and 4. 4. the authors. We compare the set of authors found by TRUTHFINDER for each book with the standard fact to compute the accuracy of TRUTHFINDER. TRUTHFINDER indicates that there are y authors.4 Sometimes TRUTHFINDER provides partially correct facts. Fig. For simplicity. only the first author). After the fourth iteration. This is because TRUTHFINDER considers the implications between different facts about the same object.. On the average. 7.031 listings (i.265 books about computer science and engineering. We will show in our experiments that the performance of TRUTHFINDER is not sensitive to these parameters.22 seconds. We assign different weights to different parts of persons’ names. We list the number of books in which each approach makes each type of errors. we manually compare the results of VOTING and TRUTHFINDER and the authors provided by Barnes & Noble on its website.001 percent. and the ratio between weights of the last name. Simsion and Graham Witt. which returns the online bookstores that sell the book and the book information from each store. Relative changes of TRUTHFINDER. Witt. and middle name is 3:2:1. including the price. For example. example. suppose the standard fact contains x authors. 6 shows the accuracies of TRUTHFINDER and VOTING.” and the authors found by TRUTHFINDER may be “Graeme Simsion and G. f2 has y authors. 7 shows the relative change of the trustworthiness vector after each iteration. .2. Fig. which is indicated by the horizontal axis. Fig. while VOTING does not. On the other hand. One can see that TRUTHFINDER is significantly more accurate than VOTING even at the first iteration. The implication between two sets of authors f1 and f2 is defined in a very similar way as the accuracy of f2 with respect to f1 .com.1 Book Authors The first real data set contains the authors of many books provided by many online bookstores. or Prentice Hall. Witt” will get a score of 4/5 with respect to “Graham Witt. where base sim is the threshold for positive implication and is set to 0.e. showing that TRUTHFINDER converges at a steady speed.g. Simsion.5. where all bookstores have the same trustworthiness. among which z authors belong to the standard fact. NO. TRUTHFINDER performs iterative computation to find out the set of authors for each book. bookstore selling a book). VOTING takes 1. and there are z shared ones.. 2009 at 12:50 from IEEE Xplore. JUNE 2008 Fig. the stop criterion is met.” We consider “Graeme Simsion” and “G. we give TRUTHFINDER a half score. “G. we use its ISBN to search on www. we do not consider the order of authors in this study.4 different sets of authors.” gets half of the score. Please notice that one approach may make multiple errors for one book. Restrictions apply. although TRUTHFINDER can report the authors in correct order in most cases.” If the standard name has a full first or middle name and TRUTHFINDER provides the correct initial. In order to test its accuracy. Witt” as partial matches for “Graeme C. McGraw-Hill. In Table 3. Because many bookstores provide incomplete facts (e.yÞ . The results are presented below. the implication from f1 to f2 will not be decreased. each book has 5. TRUTHFINDER tends to consider facts with more authors as Authorized licensed use limited to: University of Illinois. 6. For each book. and the first initial “G. If f1 has x authors. Morgan Kaufmann. As TRUTHFINDER repeatedly computes the trustworthiness of bookstores and the confidence of facts. The accuracy of TRUTHFINDER is defined z as maxðx. It contains 1. For a certain book. Accuracies of TRUTHFINDER and VOTING. its accuracy increases to about 95 percent after the third iteration and remains stable. and a short description. The change is defined as one minus the cosine similarity of the old and the new vectors. VOL.2 Real Data Sets We test the accuracy and efficiency of TRUTHFINDER on two real data sets. is set to 0. It takes TRUTHFINDER 8. The data set contains 894 bookstores and 34. we randomly select 100 books and manually find out their authors. Downloaded on January 9. We find the image of each book and use the authors on the book cover as the standard fact. Simsion” and “Graham Witt” and give them partial scores. 6. if f2 contains authors that are not in f1 .802 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. the standard set of authors for a book is “Graeme C.” because the first name has weight 2/5. VOTING tends to miss authors because many bookstores only provide subsets of all authors. then impðf1 ! f2 Þ ¼ z=x À base sim.abebooks. We can see that the relative change decreases by about one order after each iteration. first name. 20. For example. which are published by Addison-Wesley. For 4.

3 different runtimes provided by different websites. we query Google with 00 forrest gump00 þ runtime and parse the result page that contains digests of the first 100 results. We randomly select 100 movies and find their runtimes on IMDB. It has redundant names for many books.” it says the authors are “Robert J.6 The accuracy of each bookstore is tested on the 100 randomly selected books in the same way as we test the accuracy of TRUTHFINDER.” TRUTHFINDER infers that its authors are “Dilip C. we include Barnes & Noble here because we have manually retrieved its authors for the 100 books for testing. . TRUTHFINDER also finds some large trustworthy bookstores such as A1 Books. For example. We found 17.75.109 useful digests. If there are two facts f1 and f2 about a movie. 6.7 We collect the runtime data of each movie using Google.727 websites. The book “Server Storage Technologies for Windows 2000. for a book written by “Robert J. Naik. However. Please see http://imdb. Restrictions apply. There are nine bookstores saying that its authors are “Dilip C. that the results suggest that there may be better alternatives than Google for finding accurate information on the Web. Please notice that TRUTHFINDER uses no training data. Authorized licensed use limited to: University of Illinois.2. Muller.yÞ . which are the movies with top ratings in every genre on IMDB. Amazon makes the same mistake as TRUTHFINDER. We give an example mistake made by TRUTHFINDER.zÞ À base sim.” seven saying “Dilip C. Naik. 2009 at 12:50 from IEEE Xplore. and f2 ¼ z. One may think that the largest bookstores will provide accurate information.YIN ET AL. TRUTHFINDER.” Actually Barnes & Noble is less accurate than TRUTHFINDER even if we do not consider such errors. Our data set does not contain the largest bookstores such as Barnes & Noble and Amazon. we compare the online bookstores that are given highest ranks by Google with the bookstores with highest trustworthiness found by TRUTHFINDER. Naik and Dilip Naik. we perform an interesting experiment on finding trustworthy websites. The results are shown in Table 4. 4.com.com/chart/. com is excluded from our data set). It is well known that Google (or other search engines) is good at finding authoritative websites.com. Because of the authority of IMDB.” and five saying “Dilip Naik. in order to test the accuracy of TRUTHFINDER. For example. Windows Server 2003. Downloaded on January 9. If we see terms like “130 minutes” or “2h10m” appearing after “runtime” (or “run time”).878.com.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 803 TABLE 3 Comparison of the Results of VOTING. which contain information from 1.abebooks. we believe 5. Finally. If the standard runtime for a movie is x minutes and TRUTHFINDER infers that its runtime jyÀxj is y minutes. then the accuracy is defined as maxðx. On average. which is surprisingly untrue according to our experiment. where base sim is the threshold for positive implication and is set to 0.2 Movie Runtime The second real data set contains runtimes of movies provided by many websites. then the implication from f1 to f2 is defined as jyÀzj impðf1 ! f2 Þ ¼ maxðy. We only consider bookstores that provide at least 10 of the 100 books. Table 3 shows that Barnes & Noble has more errors than VOTING and TRUTHFINDER on these 100 randomly selected books. we consider the runtime it provides as the standard facts (information from IMDB. each movie has 14. However. This query was submitted on 7 February 2007. 7. which provides 86 of 100 books with an accuracy of 0. and Barnes & Noble TABLE 4 Comparison of the Accuracies of Top Bookstores by TRUTHFINDER and by Google correct facts because of our definition of implication for book authors and thus makes more mistakes of adding in incorrect names. where f1 ¼ y. Muller. and Beyond” is written by Dilip C. We query Google with “bookstore”5 and find all bookstores that exist in our data set from the top 300 Google results. However. Naik and Dilip Naik. do these websites provide accurate information? To answer this question. It contains 603 movies. TRUTHFINDER can find bookstores that provide much more accurate information than the top bookstores found by Google. Muller. which may explain why the mistake is propagated to so many bookstores. Therefore. we consider such terms as the runtime of this movie. for the movie Forrest Gump. because they do not list their books on www. and the testing data is manually created by reading the authors’ names from book covers.” which contains redundant names.

Fig. the influence is still very limited. The following experiments show that the accuracy of TRUTHFINDER is only very slightly affected by these two parameters. and the accuracy only varies by about 0.  controls the degree of influence between related facts (i. Again. and there are four websites providing each fact on average (i. 3. we find that TRUTHFINDER is much more accurate than VOTING. We generate data sets containing M websites and N facts. and it takes five iterations to meet the stop criterion. 20. 4. VOL. Again. Figs. Then. However. We also investigate how the parameters influence the convergence of the algorithm. probably because a smaller means a smaller influence of à ðfÞ on sðfÞ (there is no influence when ¼ 0). Downloaded on January 9.5 percent even when is changed significantly. and ¼ 0:3. 9. This is reasonable because interfact influences add complexity to the problem and will probably make it slower to converge. There are N=5 objects (i. Relative changes of TRUTHFINDER. In comparison. TRUTHFINDER also converges faster when is smaller. the facts of o are drawn from a uniform distribution TABLE 5 Comparison of the Results of VOTING and TRUTHFINDER on Movie Runtime Authorized licensed use limited to: University of Illinois. 4. has more influence on TRUTHFINDER than .3 Synthetic Data Sets In this section. TRUTHFINDER converges faster when  is smaller. Restrictions apply. The value range is randomly placed so that the true value vðoÞ falls at any position in it with equal probability. if t ¼ 0:7. For example. TRUTHFINDER achieves very high accuracy on this data set. with different s and s. we compare the numbers of errors of VOTING and TRUTHFINDER in different cases. which means that the influence between different facts is smaller. . each object has five facts on average). which is a random number drawn from a uniform distribution on interval [1. when TRUTHFINDER is applied on the book author data set. 9 shows the relative change after each iteration of the trustworthiness vector of TRUTHFINDER. Suppose the value for object o is vðoÞ. we find that the trustworthy websites found by TRUTHFINDER provide much more accurate information than those ranked high by Google. the error rate of VOTING is more than twice that of TRUTHFINDER.  ¼ 0:5. Fig. VOTING takes 1. It can be seen that the accuracy of TRUTHFINDER is very stable when  ! 0:1. 10b shows the accuracy with different s on the two data sets (with  ¼ 0:5). 2t À 1Þ to   minð2t. 1Þ.1. with a relative change of less than 10À5 after five iterations. Accuracies of TRUTHFINDER and VOTING.4. However.e. This query was submitted on 7 February 2007. reaching 99 percent after three iterations. (We do not show experiments on the movie runtime data set because of limited space. we test the scalability and noise resistance of TRUTHFINDER on synthetic data sets. 1]. Each object has a true value. then the range for website trustworthiness is [0.e. 6. 10a shows the accuracy of TRUTHFINDER with different s on the two data sets (with ¼ 0:3). 10.19 seconds for five iterations. We create a value range for o that is an interval of length vðoÞ=2. Fig. which makes it very easy to find the true value by taking the average of all facts.2. Fig. and determines the shape of the confidence curve in Fig.3 Parameter Sensitivity There are two important parameters in the computation website trustworthiness and fact confidence— and in (6) and (8) in Section 3.. In Table 6. 4N links between websites and facts). In Table 5.) In the 8.. 11a and 11b show the relative changes after each iteration of TRUTHFINDER. The expected trustworthiness  of each website is t.000. 8 shows the accuracies of TRUTHFINDER and VOTING on the movie runtime data set. 2009 at 12:50 from IEEE Xplore.89 seconds for initialization and 3. we do not want the true value to be at the middle of the interval. JUNE 2008 Fig. facts about the same object).e. we compare the most trustworthy websites by TRUTHFINDER and top ranked movie websites by Google (with query “movies”8).55 seconds. and the trustworthiness of each website  is drawn from a uniform distribution from maxð0. We want to generate the facts by randomly generating some numbers on an interval around the true value. 8. It can be seen that TRUTHFINDER still converges at a steady speed. By default. Fig. It takes TRUTHFINDER 2. The facts of each object are real numbers that do not deviate too much from the true value. figures.804 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.. we can see that the algorithm always converges rapidly. NO.000].

000 objects and 50.000 facts. In the following experiments. and the minor superlinear part should come from the usage of dynamically growing data structures (e.000. which is very close to being linearly scalable. Its runtime increases 118 times as the number of objects grows 100 times. (b) Convergence with . (a) Convergence with . 13.000. 11.. In fact. we study the scalability with respect to the number of websites.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 805 TABLE 6 Comparison of the Accuracies of Top Movie Websites by TRUTHFINDER and by Google Fig.g. The number of websites is fixed at 1. Relative changes after each iteration with different s and s. which is consistent with our complexity analysis. and the number of websites varies from 100 to 10. vectors and hash tables). (a) Accuracy with respect to .000 to 500. TRUTHFINDER does not precompute implications between facts (because they can be easily computed on the fly) and performs five iterations. Downloaded on January 9. Restrictions apply. It can be seen that the time and memory usage almost remain unchanged when the number of websites grows 100 times. on this value range. There are 10. . 12. We first study the time and space scalability of TRUTHFINDER with respect to the number of facts. possibly because of the fixed part of memory usage. and the number of facts varies from 5. (b) Accuracy with respect to .YIN ET AL. 10. we study how well TRUTHFINDER can do when the expected trustworthiness varies from zero to one. Fig. Accuracy of TRUTHFINDER with respect to  and . We compare the accuracies (as defined on the movie runtime data set) of TRUTHFINDER and VOTING in Fig. Finally. Then. as shown in Fig. it grows sublinearly. The memory usage is also linearly scalable. The results are shown in Fig.000. 14a and the Authorized licensed use limited to: University of Illinois. 2009 at 12:50 from IEEE Xplore.

When the expected trustworthiness is zero. (a) Time. by considering websites as hubs (both of them indicate others’ authority weights) and facts as authorities. these two problems are very different. (b) Memory. There have been studies on what factors of data quality are important for users [13] and on machine learning approaches for distinguishing high-quality and low-quality web pages [9]. their accuracy is more than 99. Fig. In Authority-Hub analysis. a hub’s weight is Authorized licensed use limited to: University of Illinois. An analogy can be made between this problem and AuthorityHub analysis. Two observations are made in our experiments: 1) even the most popular website (e. Downloaded on January 9.) However.806 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. percentages of objects for which exactly correct facts are found in Fig. 14b. (It can be shown that even if the true value and the fact found are two random numbers between 1. Restrictions apply. whereas some comparatively not-so-popular websites may provide more accurate information. their accuracies grow sharply as the expected trustworthiness increases. It is also shown that information quality measures can help improve the effectiveness of Web search [15].500.5 percent.87. the accuracy is as high as 0. where the quality is defined by human preference.5 percent. two pieces of groundbreaking work. Barnes & Noble) may contain many errors. which is also shown by a subsequent study [1]. 5 RELATED WORK The quality of information on the Web has always been a major concern for Internet users [11]. VOL. 6. In 1998. NO. Scalability of TRUTHFINDER with respect to the number of facts. the popularity of web pages does not necessarily lead to accuracy of information.g. 12. 20. PageRank [10] and Authority-Hub analysis [7]. Unfortunately. (a) Time. TRUTHFINDER studies the interaction between websites and the facts they provide and infers the trustworthiness of websites and confidence of facts from each other. TRUTHFINDER and VOTING are not doing much better than random. and the percentage of exact match for TRUTHFINDER is 98.000 and 1. and 2) more accurate information can be inferred by using many different websites instead of relying on a single website. Scalability of TRUTHFINDER with respect to the number of websites.5. the authors propose a framework of link analysis and provide theoretical studies for many link-based approaches. and Authority-Hub analysis cannot be applied to our problem. 2009 at 12:50 from IEEE Xplore.2 percent. However. These two approaches are very successful at identifying important web pages that users are interested in. In [3]. 13. This shows that TRUTHFINDER can find the true value even if there are many untrustworthy websites. were proposed to utilize the hyperlinks to find pages with high authorities. and that for VOTING is 90. When the expected trustworthiness is 0. JUNE 2008 Fig. .. (b) Memory.

13th Int’l Conf. 2005. R. Microsoft Research. Experiments show that TRUTHFINDER achieves high accuracy at finding true facts and at the same time identifies websites that provide more accurate information.S. and J. Theory. Security and Privacy (ISSP ’96).M.” Results of a Nat’l Survey of Internet Users for Consumer Reports WebWatch. Another difference between TRUTHFINDER and Authority-Hub analysis is that TRUTHFINDER considers the relationships (implications) between different facts and uses such information in inferring the confidence of facts. “SimRank: A Measure of Structural-Context Similarity. “Link Analysis Ranking: Algorithms. G. Management Information Systems. Widom. Rosenthal. Heckerman. the approach will improve the current state by propagating information (weights. probability. etc.” Proc. J. Downloaded on January 9. and conclusions or recommendations expressed here are those of the authors and do not necessarily reflect the views of the funding agencies. Any opinion.G. Logistical Equation from Wolfram MathWorld. 2004. Brin. we introduce and formulate the Veracity problem. IEEE Symp. and T. Mandl. Raghavan. Jeh and J. 5. vol.” J. Collaborative filtering [4] infers the similarity between objects based on their ratings to or from other objects. R. no.html. it needs to be computed using some nonlinear transformations according to a probabilistic analysis. vol. “Leap of faith: Using the Internet Despite the Dangers. 2008. Borodin. 5.com) to determine the trust relationship between each pair of users. no.com/SigmoidFunction. Kleinberg. Fig. an approach that utilizes the interdependency between website trustworthiness and fact confidence to find trustable websites and true facts. 231-297. “Authoritative Sources in a Hyperlinked Environment. Tomkins. (a) Accuracy. and Experiments. pp. 604-632. and P. World Wide Web (WWW). no. 1999. Wang and D. “Empirical Analysis of Predictive Algorithms for Collaborative Filtering.” ACM Trans. which defines the similarity between two objects as the average similarity between objects linked to them. wolfram. “Does ‘Authority’ Mean Quality? Predicting Expert Quality Ratings of Web Documents. 1998. and W. “Decentralized Trust Management. 6 CONCLUSIONS In this paper. Kumar.html.C. This is unreasonable in computing the trustworthiness of a website.” technical report. 4. 2009 at 12:50 from IEEE Xplore. ACM SIGIR ’00. There are also studies on link-based similarity analysis [6].M. Hill.” Proc.g. In [5]. TRUTHFINDER uses iterative methods to compute the website trustworthiness and fact confidence. Terveen. M. [7]. computed by summing up the weights of authorities linked to it. [6]. “Beyond Accuracy: What Data Quality Means to Data Consumers. user ratings on eBay. ACKNOWLEDGMENTS This work was supported in part by the US National Science Foundation Grants IIS-05-13678/06-42771 and NSF BDI-0515813. 17th ACM Conf. Tsaparas.: TRUTH DISCOVERY WITH MULTIPLE CONFLICTING INFORMATION PROVIDERS ON THE WEB 807 widely used in many link analysis approaches [5]. Oct. and a website providing many inaccurate facts is an untrustworthy one. which aims at resolving conflicting facts from multiple websites and finding the true facts among them.) through the links. Kadie. A. Internet Technology. 12. http://mathworld. Guha.Y. S. which is Authorized licensed use limited to: University of Illinois. Feigenbaum.” Proc. Aug. July 2002. the authors propose an approach that uses the trust or distrust relationships between some users (e. Sigmoid Function from Wolfram MathWorld. Lacy. We propose TRUTHFINDER. Roberts. P. [14]. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] B. 5-34. Princeton Survey Research Associates International. Blaze.YIN ET AL.” Proc. J. http:// mathworld.wolfram. 2005. and A. 14. the confidence of a fact is not simply the sum of the trustworthiness of the websites providing it. Noise resistance of TRUTHFINDER and VOTING with respect to website trustworthiness.com/LogisticEquation. “The PageRank Citation Ranking: Bringing Order to the Web. J. pp. we adopt it in TRUTHFINDER. 1. “Implementation and Evaluation of a Quality-Based Search Engine. 2006. ACM.S. J. T. at each iteration. Motwani. Hypertext and Hypermedia. R. findings. pp. The common feature of these approaches is that they start from some initial state that is either random or uninformative. (b) Percentage of exact match. vol. This is related to existing studies on inferring similarities between objects using links. Then. 2008. Restrictions apply. Stanford Digital Library Technologies Project. Page. Breese. ACM SIGKDD ’02. “Propagation of Trust and Distrust. trustworthiness. and thus. Moreover. 1998.” Proc. Strong. G. July 2000.. [10]. L. Winograd. . L. R. Amento. 1997.” J.” technical report. [14]. 46. May 1996.O. D. This iterative procedure has been proven to be successful in many applications. and C. Instead. because a trustworthy website should be one that provides accurate facts instead of many of them.

the MS and PhD degrees in electrical engineering from Stanford University. Zhu and S. Yu. He is on the steering committee of the IEEE Conference on Data Mining and was a member of the IEEE Data Engineering steering committee.J. ACM SIGIR ’00. the Second IEEE International Workshop on Research Issues on Data Engineering: Transaction and Query Processing. He has received several IBM honors. with more than 350 conference and journal publications. “LinkClus: Efficient Clustering via Heterogeneous Semantic Links. and performance modeling. VOL. He is a fellow of the ACM and the IEEE. and software bug mining. Sept. He was the editor in chief of IEEE Transactions on Knowledge and Data Engineering from 2001 to 2004. the IEEE Workshop on Mining Evolving and Streaming Data (2006). Downloaded on January 9. July 2000. E-commerce and Eservices (EEE ’06). He has chaired or served on many program committees of international conferences and workshops. Gauch. biological data mining. Yu received the BS degree in electrical engineering from the National Taiwan University. respectively. He holds or has applied for more than 300 US patents. Xiaoxin Yin received the BE degree from Tsinghua University in 2001 and the MS and PhD degrees in computer science from the University of Illinois. 32nd Int’l Conf. and P. the Journal of Computer Science and Technology. 2006. he was the program chair or cochair of the IEEE Workshop of Scalable Stream Processing Systems (SSPS ’07). and also holds the Wexler chair in information and technology.org/publications/dlib. the PAKDD Workshop on Knowledge Discovery from Advanced Databases. He received a Research Contributions Award from the IEEE International Conference on Data Mining in 2003 and also an IEEE Region 1 Award for “promoting and perpetuating numerous new electrical engineering concepts” in 1999. including the ACM SIGKDD Innovation Award in 2004 and the IEEE Computer Society Technical Achievement Award in 2005. and the Second IEEE International Conference on Data Mining. and their applications on the World Wide Web. the Ninth ACM Sigmod Workshop on Research Issues in Data Mining and Knowledge Discovery. Authorized licensed use limited to: University of Illinois. an editor. 6. JUNE 2008 [14] X. and the Journal of Intelligent Information Systems. and the MBA degree from New York University. text and Web mining. and the Second IEEE International Workshop on Advanced Issues of E-commerce and WebBased Information Systems. NO. University of Illinois.” Proc. the Sixth Pacific Area Conference on Knowledge Discovery and Data Mining. including two IBM Outstanding Innovation Awards. He has been working on research into data mining. an advisory board member.S. University of Illinois. “Incorporating Quality Metrics in Centralized/Distributed Information Retrieval on the World Wide Web.” Proc. Han. His research interests include data mining. spatiotemporal and multimedia data mining. For more information on this or any other computing topic. He has published 14 papers in refereed conference proceedings and journals. Philip S. information network analysis. data warehousing. the 11th IEEE International Conference on Data Engineering. . Internet applications and technologies. He has been working in the area of data mining since 2001. 20. J. Watson Research Center. two Research Division Awards.computer. He was the manager of the Software Tools and Techniques Group at the IBM T. Chicago. He served as the general chair or cochair of the 2006 ACM Conference on Information and Knowledge Management. database systems. and the 93rd Plateau of Invention Achievement Awards. He is currently serving as the founding editor in chief of the ACM Transactions on Knowledge Discovery from Data and on the board of directors for the executive committee of the ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). Yin. He is an associate editor of the ACM Transactions on the Internet Technology and the ACM Transactions on Knowledge Discovery from Data. stream data mining. He was an IBM Master Inventor. 2009 at 12:50 from IEEE Xplore. the 2006 Joint Conferences of the Eighth IEEE Conference on E-commerce Technology (CEC ’06) and the Third IEEE Conference on Enterprise Computing. Very Large Data Bases (VLDB ’06). please visit our Digital Library at www. He has published more than 500 papers in refereed journals and conference proceedings. He has received many awards and recognitions. and his research work is focused on multirelational data mining and link analysis. in 2003 and 2007. [15] X. multimedia systems. He had also served as an associate editor of Knowledge and Information Systems. He is a professor in the Department of Computer Science. In addition to serving as a program committee member on various conferences.808 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. He also served or is serving on the editorial boards of Data Mining and Knowledge Discovery. and also a guest coeditor of the special issue on mining of databases. parallel and distributed processing. Microsoft Research. Urbana-Champaign. He is a researcher at the Internet Services Research Center. . Jiawei Han is a professor in the Department of Computer Science. Restrictions apply. the 14th IEEE International Conference on Data Engineering. He is a senior member of the IEEE and a fellow of the ACM. UrbanaChampaign. the IEEE Transactions on Knowledge and Data Engineering. an Outstanding Technical Achievement Award.