Thesis Upload Augmented

INCORPORATING UNCERTAINTY IN DATA MANAGEMENT
AND INTEGRATION
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Parag Agrawal
August 2012
© 2012 by Parag Agrawal. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-

Noncommercial 3.0 United States License.
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/zp667dy6192
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Jennifer Widom, Primary Adviser
Jeffrey Ullman
Peter Haas
Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.
iii
Contents
1 Introduction 1
1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summary of Contributions and Thesis Outline . . . . . . . . . . . . . 2
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Preliminaries 7
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Uncertain Databases . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Queries over Uncertain Databases . . . . . . . . . . . . . . . . 8
2.2 ULDB Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Continuous Uncertainty in Trio 12

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Data Model Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Representation and Interfaces . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 Query-Time Processing . . . . . . . . . . . . . . . . . . . . . . 21
3.4.2 On-Demand Processing . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.1 Correlated Attributes . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.2 Interfaces and Functions in Queries . . . . . . . . . . . . . . . 26
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
4 Generalized Uncertain Databases 28
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Query Semantics . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.1 Uncertainty Evaluation . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Uncertain-Data Integration 47
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Containment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Uncertain Databases . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Containment Definitions . . . . . . . . . . . . . . . . . . . . . 52
5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Queries, Views, and Sources . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5.1 Intensional complexity . . . . . . . . . . . . . . . . . . . . . . 60
5.5.2 Extensional complexity . . . . . . . . . . . . . . . . . . . . . . 67
5.6 Query Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.6.2 Superset-Containment . . . . . . . . . . . . . . . . . . . . . . 70
5.6.3 Equality-Containment . . . . . . . . . . . . . . . . . . . . . . 71
5.7 Monotonic Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.8 Incorporating Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 76
5.8.1 Source Reliability . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.10 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
v
6 High-Confidence Joins 83
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.2 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3.3 Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3.4 Top-k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3.5 Sorted-Threshold . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4 Efficiency Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 105
6.5.2 Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5.3 Top-k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.5.4 Sorted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7 Native Query Optimization in Trio 121

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.1.1 Contributions and Outline . . . . . . . . . . . . . . . . . . . . 123
7.2 Relational encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2.1 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.3 Indexing ULDB Relations . . . . . . . . . . . . . . . . . . . . . . . . 124
7.4 Query Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.4.1 New Interesting Order . . . . . . . . . . . . . . . . . . . . . . 126
7.4.2 Query Planning . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.5 Statistics and Histograms . . . . . . . . . . . . . . . . . . . . . . . . 129
7.5.1 Exact Cardinalities . . . . . . . . . . . . . . . . . . . . . . . . 129
vi
7.5.2 Alternative Counts . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5.3 X-tuple Counts . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5.4 Other Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.6 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 133
7.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8 Summary 136
Bibliography 139
vii
Chapter 1
Introduction
Modern-day applications like information extraction on the web, data integration,

entity resolution, scientific data management, and sensor data management are all
required to cope with uncertainty in data. Motivated by this observation, recent years
have witnessed a surge of research in the field of uncertain databases. The basic
goal of this research is to abstract the common challenges and develop principled,
general, and efficient techniques for dealing with uncertainty in the context of data
management systems.
This thesis makes advances in the field of uncertain data management by presenting
efficient techniques for managing and integrating uncertain data. Section 1.1 introduces
some motivating applications, Section 1.2 provides an overview of the technical
contributions of this thesis, and Section 1.3 discusses related work in the area of
uncertain data management.
1.1 Applications
We briefly discuss some applications that motivate our work on uncertain data
management. We specifically highlight the kind of uncertainty they introduce.
• Information extraction: When extracting structured data from the unstruc-

tured web or other sources of unstructured information, the resulting extracted
1
CHAPTER 1. INTRODUCTION 2
data can often be uncertain. For example, an extractor may only be able to
determine that the director of a movie is either Robert or Richard. Furthermore,
the extraction system might output confidence values of .7 that the director is
Robert and .3 that it is Richard.
• Entity resolution: There are often multiple ways of referring to the same-real-
world entity. The process of determining if two data items refer to the same
entity is inherently approximate. As a result, entity resolution often results in
uncertain information. For example, an algorithm might determine that Bob
Doe and Robert D. refer to the same person with confidence at least .8.
• Data integration: Di↵erent sources of data can present conflicting pieces

of information. For example, while IMDB might report that the director of a
movie is Richard, Netflix might report the director to be Robert. Resolving such
conflicting information results in uncertainty. In addition, it is very challenging to
determine attributes across sources that correspond to each other automatically.
For example, IMDB might use the attribute name Stars to correspond to
the attribute Cast in Netflix. As a result, automatically determined schema
mappings are often uncertain. In addition to conflicting values and uncertain
schema mappings, the process of data integration also needs to deal with
uncertainty due to entity resolution, and uncertainty in the data itself.
• Sensor data: Sensors aim to measure the real world, but are imprecise due to
limitations in the physical world. For example, a temperature sensor measure-
ment may be described by a Gaussian distribution around the reported reading
around 90F with a standard deviation of 5F.
The key challenge in uncertain data management is to efficiently represent and query
large amounts of data with the various forms of uncertainty detailed above.
1.2 Summary of Contributions and Thesis Outline

This thesis makes contributions in three primary areas:
• Generalizing: We generalize uncertain databases to incorporate continuous

probability distributions (Chapter 3) and incomplete information (Chapter 4).
• Integration: We establish foundations for integration of uncertain data sources

(Chapter 5).
• Efficiency: We develop efficient algorithms for joins (Chapter 6) and indexing

(Chapter 7) over uncertain data.
In the remainder of this section, we provide short overviews of these chapters. We

discuss preliminary concepts in Chapter 2, and summarize the thesis in Chapter 8.
Continuous Uncertainty
Many e↵orts in the field of uncertain databases use a data model based on a set of
alternative values for tuples, and/or probabilities (or confidences) assigned to a tuple’s
existence. However, some applications must manage data that cannot be represented
with such forms of discrete uncertainty. For example, sensor measurements may be
described by a Gaussian distribution around the reported reading (based on sensor
calibration); astronomical data may contain locations described by two-dimensional
Gaussians; a database that records time-of-day may use intervals bounding the possible
time; predictive models stored in a database may include continuous probability
distributions.
Trio is a system developed at Stanford for managing data, uncertainty, and lin-
eage [ABS+ 06]. In Chapter 3, we discuss extensions to Trio for incorporating continuous
uncertainty. Data items with uncertain possible values drawn from a continuous do-
main are represented through a generic set of functions. Our approach enables precise
and efficient representation of arbitrary probability distribution functions, along with
standard distributions such as Gaussians. We also describe how queries are processed
efficiently over this representation, without knowledge of specific distributions. For
queries that cannot be answered exactly, we can provide approximate answers using
sampling or histogram approximations, o↵ering the user a cost-precision trade-o↵.
Our approach exploits Trio’s lineage and confidence features, with smooth integration
into the overall data model and system.
Generalized Uncertainty
Uncertain database systems enable management of incomplete or imprecise information.
However, most uncertain databases, including our own work in Trio, require that
exact confidence values are attached to the data being managed. In some applications,
confidence values may be known imprecisely or coarsely, or be missing altogether.
In Chapter 4, we introduce the notion of generalized uncertain databases to manage
such incomplete uncertainty. We propose a semantics for generalized uncertain
databases based on Dempster-Shafer theory [Sha76] that is consistent with previous
semantics for uncertain databases. We present an extension of the representation
scheme used by Trio in order to represent any generalized uncertain database. Finally,
we adapt Trio’s query processing techniques to operate over this new representation.
Uncertain Data Integration

Most work on uncertain data has focused on modeling and management problems.
Little work has been done on integrating multiple sources of uncertain data. Sim-
ilarly, several decades of research have focused on the theory and practice of data
integration [HRO06], but only considering integration of certain data. In Chapter 5,
we study data integration in the context of uncertain data.
The idea is to exploit redundancy of information from multiple uncertain sources
to achieve a more certain result through both contradiction and corroboration. As an
extremely simple motivating example, one image database may label an image as blue
or green, while another source labels the same image as green or yellow. As a result of
combining information from the two sources, green may be deemed more likely than
the other two colors.
Specifically, we have tackled the local-as-view [Hal01] setting of data integration,
where each data source may be an uncertain database. The central foundation of our
approach is a new containment definition for uncertain databases. Along with the
traditional challenges of efficient representations and query answering in uncertain

data management, we also address the new challenge of dealing with inconsistency
among data sources. In addition to enabling integration of uncertain databases, our
work also yields insights for the integration of certain data with uncertain mappings.
High-Confidence Joins
In uncertain databases, users often seek to compute high-confidence results for queries.
In Chapter 6, we discuss specialized algorithms for join queries that seek to prefer-
entially return result tuples with the highest confidence. We restrict ourselves to
a special case of uncertain databases without alternative values, called probabilistic
databases.
When joining uncertain data, confidence values are assigned to result tuples based
on combining confidences from the input data. To preferentially access high-confidence
results, users may wish to apply a threshold on result confidence values, ask for
the “top-k” results by confidence, or obtain results sorted by confidence. Efficient
algorithms for these types of queries can be devised by accessing the input data sorted
by confidence and exploiting the monotonicity of the confidence combination function.
Previous algorithms for these problems assumed sufficient memory was available for
processing. We address the problem of processing all three types of queries when
sufficient memory is not available, minimizing retrieval cost. All our algorithms are
proven to be close to optimal.
Native Query Optimization

The Trio system, introduced in the section on Continuous Uncertainty above, is
implemented on top of a conventional DBMS. Uncertain data with lineage is encoded
in relational tables, and Trio queries are translated to SQL queries on the encoding.
Such a layered approach reaps significant benefits in terms of architectural simplicity,
and the ability to use an o↵-the-shelf data storage and query processing engine.
However, it fails to recognize and hence capitalize on the regular structure that the
encoded Trio relations possess, and thus misses out on the benefits of specialized query
optimization.
Chapter 7 describes the first steps towards building a native query optimizer in Trio.
We demonstrate that there is indeed an opportunity to obtain significantly increased
performance when the well-defined structure of Trio relations (and their encoding) is
exploited. We then present several indexing techniques, study new statistics, design
histograms for accurately estimating our new statistics, and describe a new “interesting
order” and its associated operator to exploit this opportunity.
1.3 Related Work

This section introduces previous work in uncertain data management. Detailed
comparison of previous work with specific techniques developed in this thesis appear
in the relevant chapters.
The work described in this thesis was largely performed in the context of the
Trio project at Stanford [ABS+ 06]. Trio uses the novel concept of lineage to manage
uncertainty in data. This thesis repeatedly reuses and extends the concept of lineage
introduced in Trio.
The study of uncertain data management has history that stretches long before
Trio. Much of the early work, e.g., [AKG87, AKG91, AHV95, BP93, BP82, Gra89,
IL84, Var85], and subsequent follow-on research, e.g., [BGMP92, BDJ+ 07, Fuh90,
FR97b, Gra84, GT06, DBHW06a] focuses on theoretical foundations for uncertain
databases.
More recently, systems for managing uncertain data have been developed. These
systems can be placed broadly into two separate categories: (1) MayBMS [AKO07],
MystiQ [BDM+ 05], Orion [CSP05], Prov-View [LLRS97] and Trio [ABS+ 06] are
systems that use explicit representation of probabilities associated with data to
describe possible worlds; (2) In contrast, MCDB [JXW+ 08], BayesStore [WMGH08],
and [SD07] use statistical models and inference techniques from artificial intelligence to
describe possible worlds. The work in this thesis falls primarily into the first category.
Chapter 2
Preliminaries
In this chapter, we set up some preliminaries that the rest of the thesis relies on. We
start in Section 2.1 by reviewing basic definitions related to the concept of uncertain
databases. A lot of the work in this thesis is in the context of the Trio system, which
is a specific instantiation of an uncertain database system. Section 2.1 reviews the
ULDB data model used by the Trio system to represent uncertain data, and query
semantics in Trio.
2.1 Definitions
2.1.1 Uncertain Databases

We begin by presenting a formal definition of an uncertain database:
Definition 1 (Uncertain Database). An uncertain database U consists of a finite

nonempty set of possible worlds P W (U ) and a probability distribution pU such that:
• P W (U ) = {D1 , . . . , Dm }, where each Di is a conventional database

P
• Di 2P W pU (Di ) = 1
The term probabilistic databases is also commonly used to refer to uncertain databases.
7
CHAPTER 2. PRELIMINARIES 8
Sometimes there is no probability distribution, or the probability distribution is

not available. The notion of an uncertain database without probabilities captures this
case:
Definition 2 (Uncertain Database without Probabilities). An uncertain database

without probabilities U consists of a finite set of possible worlds P W (U ) such that:
• P W (U ) = {D1 , . . . , Dm }, where each Di is a conventional database
2.1.2 Queries over Uncertain Databases

Like conventional databases, queries can be issued against uncertain databases. We
define the typical semantics of queries over uncertain databases:
Definition 3 (Queries over Uncertain Databases). The result of a relational query Q

over an uncertain database U with possible worlds P W (U ) and probability distribution
pU is an uncertain database R. The possible worlds P W (R) and probability distribution
pR : P W (R) ! [0, 1] for R are:
P W (R) = {Q(Di )|Di 2 P W (U )}

X
pR (S) = pU (Di )
i such that
S=Q(Di )
2.2 ULDB Data Model

Trio’s ULDB data model is a representation scheme for uncertain databases as defined
in Section 2.1. We present a brief overview of the basic ULDB model; for details see
[BSH+ 08]. We first present the model with examples, then explain how it maps to
possible-world semantics. We use a contrived example for illustrative purposes. Table
Reading(sensor-id,temp) maintains temperature observations from sensors, and
the table Location(sensor-id,zone) maintains locations of sensors.
Alternatives: ULDB relations are comprised of x-tuples. Each x-tuple consists of

one or more alternatives, where each alternative is a regular tuple over the schema
of the relation. For example, if sensor 1 is located in zone A or B, then in table
Location we have:
Location(sensor-id,zone)
(1,A) || (1,B)
The table Location represents an uncertain database without probabilities (Defini-

tion 2) with two possible worlds (sometimes also called possible instances), each with
a single tuple. In general, the possible instances of a database U correspond to all
combinations of alternatives for x-tuples in U . In cases like the example tuple above
in which only some attributes are uncertain, attribute-level uncertainty may be used:
sensor-id zone
1 A || B
Confidences: Numerical confidence values may be attached to alternatives of an

x-tuple. Suppose sensor 1 is in zone A with confidence 0.4 and B with confidence 0.6,
while sensor 2 is operating at all with confidence 0.7. Then we have:
Location(sensor-id,zone)
(1,A):0.4 || (1,B):0.6
(2,C):0.7
The above table represents an uncertain database (Definition 1) with four possible
worlds, and a probability distribution over them:
.28 .42 .12 .18

(1,A) (1,B) (1,A) (1,B)
(2,C) (2,C)
The sum of confidence values of all alternatives in an x-tuple must be at most 1. If
for an x-tuple is less than 1, with probability 1 the tuple is not present. Note
that the choice of values (or presence/absence) across x-tuples is independent, except
when lineage is present, as discussed next.
Lineage: Lineage in ULDBs is recorded at the granularity of tuple alternatives:

lineage connects an x-tuple alternative to the other x-tuple alternatives from which
it was derived. Consider joining tables Location and Reading to create a table
Temp(zone,temp). Let column ID serve as a unique identifier for each x-tuple, and
let (i, j) denote the j th alternative of the x-tuple with ID i. See the join result below.
For example, tuple (B, 77) in Temp (identified by (31, 2)) is derived from (1, B) in
Location and (1, 77) in Reading.
ID Location(sensor-id,zone)
11 (1,A):0.4 || (1,B):0.6
12 (2,C):0.7
ID Reading(sensor-id,temp)
21 (1,77)
22 (2,81)
ID Temp(zone,temp)
31 (A,77):0.4 || (B,77):0.6
32 (C,81):0.7
lineage(31,1) = {(11,1), (21,1)}

lineage(31,2) = {(11,2), (21,1)}
lineage(32,1) = {(12,1), (22,1)}
Query Semantics: Query results in Trio follow the possible-worlds semantics of

Definition 5. Consider a ULDB U whose possible worlds are D1 , . . . , Dn . The result
of a query Q on U must have as its possible worlds Q(D1 ), . . . , Q(Dn ). For example,
the join of Location and Reading above should logically perform the join in all four
possible worlds of the ULDB. Of course a query processing algorithm does not expand
possible worlds; it computes the ULDB representation of the query result directly
from U .
Role of Lineage: Lineage imposes restrictions on the possible worlds of a ULDB,

e↵ectively coordinating the uncertainty in derived data with the uncertainty in the
data from which it was derived. Roughly speaking, a tuple alternative is present in a
possible instance if and only if all the alternatives in its lineage are present, although
the actual constraints are somewhat more complex [BSH+ 08]. For the example above,
(B, 77) is present in exactly those possible worlds that also contain the tuples in its
lineage, i.e., (1, B) and (1, 77) are present.
Lazy Confidence Evaluation: The confidence values associated with alternatives

in a query result must be consistent with possible worlds semantics: the confidence of
an alternative is always the sum of probabilities of all possible worlds that contain
that alternative. It is #P -hard to compute these confidence values in general. The
use of lineage allows us to compute result confidences either during query processing
or on-demand [BDHW06b].
Chapter 3
Continuous Uncertainty in Trio
The Trio system manages discrete uncertainty: a limited number of possible values
for data items are permitted. In this chapter, we present extensions to Trio for
incorporating continuous uncertainty, where an infinite number of possible values are
permitted for each data item. The work presented in this chapter initially appeared
in [AW09b].
3.1 Introduction
Many e↵orts in the field of uncertain databases, including our own work in Trio, use a
data model (Section 2.2) based on a set of alternative values for tuples. However, some
applications must manage data that cannot be represented with such forms of discrete
uncertainty. For example, sensor measurements may be described by a Gaussian
distribution around the reported reading (based on sensor calibration) [FGB02];
astronomical data may contain locations described by two-dimensional Gaussians;
a database that records time-of-day may use intervals bounding the possible time;
predictive models stored in a database may include continuous probability distributions.
There have been a few proposals for managing continuous uncertainty in a DBMS.
Reference [FGB02] suggests using an abstract data type for continuous uncertain
values, focusing specifically on Gaussian distributions. The Orion system [SMS+ 08]
generalizes to other distributions using discrete histogram approximations, and defines
12
CHAPTER 3. CONTINUOUS UNCERTAINTY IN TRIO 13
a semantics for interpreting SQL over this data. In this chapter we describe how to
incorporate continuous uncertainty into Trio. Our proposal is more general than the
previous work in both data representation and querying, but admittedly it is not
implemented yet. More detailed discussion of related work is provided in Section 3.6.
Next we describe in brief the key components of our proposal: semantics of the
data model and query language, data representation in the system, query processing
over this representation, interfaces to the uncertain data, and integration into other
aspects of Trio, including Trio’s model for discrete uncertainty and its important
lineage feature. Later sections go into more detail on each of these components.
Data Model Semantics: Models for uncertain data usually are based on possible
worlds: a set of possible instances for the database. With discrete uncertainty, an
uncertain database always represents a finite set of possible worlds; with continuous
uncertainty, we may represent an infinite set. Specifically, the value of an uncertain
attribute may be an arbitrary probability distribution function (pdf) over a continuous
domain, describing the possible values for the attribute. For example, an uncertain
temperature may be described as a Gaussian distribution around a mean temperature
of 50 with variance 12; a time may be described as a uniform distribution between
9:20 and 9:25. Gaussian and uniform distributions are special cases of pdfs, as are
the discrete alternatives on which Trio was based originally. In our model, a pdf may
also be over a multidimensional domain, representing multiple correlated uncertain
attributes. For example, predicted object locations may be comprised of uncertain
correlated latitude-longitude pairs, i.e., two-dimensional pdfs.
Query Semantics: With possible-world semantics, the semantics of a standard

query is defined naturally: The result of a query Q over an uncertain database U must
represent the result of applying Q to each possible world of U . There is no problem
with these semantics, even when the set of possible worlds is infinite, although some
query results may be surprising. For example, a query filtering by equality on values
that are continuous pdfs will always return an empty result: the probability of a value
from a pdf being equal to any specific value is zero.
Representation: Previous work [FGB02, SMS+ 08] has used symbolic representa-
tions for a set of known distribution types, and histogram approximations for the rest.
For example, Gaussian distributions can be represented by mean and variance; uniform
distributions by their endpoints. We have taken a di↵erent approach to handling a
wide class of distributions: we represent a pdf by a set of functions. For example, a
pdf ⇢ can be represented by a pair of functions:
• The function weight(low,high) returns the total probability of a value in ⇢ lying

between low and high. For example, weight(3,4) for the uniform distribution
over [2, 6] returns 0.25.
• The function sample(low,high) returns a random value between low and high
according to ⇢. For example, sample(3,4) for a Gaussian distribution with mean
4 is more likely to return a value close to 4 than close to 3.
We also generalize both functions to multidimensional pdfs. This approach subsumes

the symbolic representation for specific distributions, while at the same time enabling
higher efficiency and accuracy than histogram approximations for the rest.
Query Processing: Recall from Chapter 2 that in Trio, query results include
lineage identifying the data from which each result value was derived. Lineage is
needed to represent uncertainty correctly, and to compute result confidence values
lazily [BDHW06b]. In the presence of pdfs, we make all processing over pdf values
lazy, deferring potentially expensive computations until they are needed. At query
time, we simply generate lineage, and for those results involving pdfs, we extend
the lineage to include relevant predicates and mappings. The information contained
in the lineage is sufficient to be able to compute the functions that encode the
result lazily. For example, a query asking for temperatures less than 60 would
generate lineage annotated with the “< 60” predicate. Now if we have a result pdf ⇢0
whose lineage points to a pdf ⇢, the function weight⇢0 (low,high) on ⇢0 is translated to
weight⇢ (low,max(high,60))/weight⇢ ( 1,60) on ⇢.
The “translation” approach motivated above can be used to answer many queries
efficiently. However, for expensive queries we support approximate answers, either
using the sample function, or a histogram based on the weight function; both options
o↵er the user a cost-precision trade-o↵. Query processing is modular and generic: our
approach of mapping functions on result data to functions on data in the result’s
lineage is independent of specific distributions—all functions can be treated as “black
boxes.” However, for specific distributions we do introduce specialized processing that
is more efficient.
Interfaces: It is difficult to know how to present a pdf in a database system API or

user interface. Our approach is to o↵er an extensible suite of relevant properties of a
pdf; examples are mean, variance, weight in a range, or histogram approximation. All
of these properties are computed easily using calls to the functions that represent the
pdf (like weight and sample), again possibly using an approximation. Users may add
additional properties by providing translations to the functions.
Integration into Trio: In addition to lineage and discrete uncertainty, recall from
Chapter 2 that Trio’s data model includes confidence values associated with tuples,
denoting the probability of the tuple existing. In our new model and query language
for continuous uncertainty, operations tend to modify the probability of a tuple’s
existence. Thus, the confidence feature is particularly helpful for integrating pdfs into
Trio. Consider for example the predicate “ 4” applied to a uniform distribution over
[3, 7]. The result can be represented by a uniform distribution over [3, 4] that exists
with probability 0.25, or more generally one fourth of the original probability.
Not only do Trio’s previous features (lineage and confidences) help integrate
continuous uncertainty into the system, but the ability to represent and query pdfs
enhances some pre-existing Trio functionality. For example, aggregate query results
[MW07] may naturally be represented as continuous pdfs, and pdfs over discrete
domains can sometimes yield more efficient representations than tuple alternatives,
for certain types of data.
We discuss data model and query semantics in Section 3.2, representations and
interfaces in Section 3.3 and query processing in Section 3.4. Section 3.6 presents
related work, and we conclude with future work in Section 3.7.
3.2 Data Model Extensions

The following extensions to the data and lineage components of the ULDB model
(Section 2.2) enable it to manage continuous uncertainty. Note that here we are
describing the model only. Specific representation and query processing details are
discussed in Sections 3.3 and 3.4 respectively.
Pdf Attributes: Recall from Section 2.2 that an x-tuple is comprised of tuple
alternative. We now allow pdf attributes within x-tuple alternatives. The value for
a pdf attribute is given by a probability distribution function over a given domain.
The domain may be discrete or continuous. For example, P (A) = 0.4; P (B) = 0.6
is a pdf over a discrete domain {A, B}; a pdf of a Gaussian distribution G(77, 20)
with mean 77 and variance 20 is a pdf over the continuous domain of all real numbers
R. Suppose the sensors used to record temperature values in the Reading table from
Section 2.2 have errors described by Gaussian distributions: say sensors 1 and 2 have
variance of 20 and 30 respectively. The uncertainty in a sensor’s temperature report
can now be captured in the database by making temp a pdf attribute over R. Using
2
G(µ, ) to represent a Gaussian pdf:
Reading(sensor-id,temp)
(1, G(77,20))
(2, G(81,30))
The Reading table now has an infinite number of possible worlds, since it contains
a pdf attribute over a continuous domain. Any table with two tuples of the form
(1, x) and (2, y) is a valid possible world of Reading. The probability of an individual
possible world is infinitesimally small, so we consider sets of possible worlds. Let the
probability of the set possible worlds where a  x  b be px , and where c  y  d be
py . As with other uncertain data in Trio, the values of pdf attributes are independent.
Hence the probability of the set of possible worlds where a  x  b and c  y  d is
px · py .
Note that Trio’s attribute-level uncertainty (Section 2.2) is captured by pdf at-
tributes over finite discrete domains. Also note that the presence of alternatives along
with pdf attributes allows us to represent certain cases that pdf attributes alone can’t
represent. For example, this x-tuple represents a temperature 78 with probability 0.3,
otherwise a Gaussian around 77 with variance 20:
Temperature(zone,temp)
(A,78):0.3 || (A, G(77,20)):0.7
Predicates applied to pdf attributes can a↵ect confidence values of result alter-
natives. For example, consider a tuple that exists with confidence 0.8 and has just
one pdf attribute whose value is uniformly distributed between 2 and 6 (denoted by
U (2, 6) : 0.8). The result of applying a predicate A  5 to this tuple is U (2, 5) : 0.6.
We also support correlated pdf attributes. Correlations may exist in the base
data, for example uncertain location data may be represented by correlated x and y
coordinates. A joint pdf (over possibly heterogeneous domains) is used to represent
multiple correlated attributes. Correlated pdf attributes, and processing queries over
them, are described in more detail in Section 3.5.1. Note that independent attributes
in base data can yield correlated attributes in the result. For example, a selection
query with predicate A < B over pdfs A and B can make attributes corresponding
to A and B be correlated in the result. Lineage is used to correctly handle these
correlations, as described in Section 3.4.2 .
Lineage Extensions: An important part of our ULDB extension is to include

predicates and mappings as part of lineage, but at the table granularity. Predicates in
the where clause of a query that reference at least one pdf attribute are not evaluated
at query time. Instead, they are stored as part of the new table-level pdf-lineage. In
addition, references to pdf attributes in the select clause are stored in the pdf-lineage
as mappings from result attributes to expressions over input attributes. Pdf-lineage
allows us to evaluate the values for the pdf attributes in a result alternative lazily,
along with evaluation of confidence values. The generation of pdf-lineage is described
in Section 3.4.1, and its utilization for on-demand uncertainty processing is described
in Section 3.4.2.
3.3 Representation and Interfaces

We now describe how pdf attributes can be represented in the system and exposed to
the user or application. For now, we describe our techniques for pdfs over the simple
domain of real numbers. We will discuss higher-dimensional domains for joint pdfs in
Section 3.5.1.
A pdf is represented by an extensible set of functions. Consider a pdf value ⇢ with
a density function f . This pdf can be represented by a function weight(l, h) that
returns the weight of the pdf ⇢ between l and h.
Z h
weight(l, h) = f (x)dx
l
2
For example, for Gaussians with mean µ and variance :
h µ l µ
weight(l, h) = erf( p ) erf( p )
2 2
where erf is the error function. The weight function can completely describe any pdf,
but we may often use additional functions in the representation for increased efficiency
and convenience. One such function is sample(l,h), which returns values between any
given l and h drawn at random according to the pdf.
Notice that a call to sample can be computed using inverse sampling by using
calls to weight and a uniformly-distributed random number generator. Conversely, an
approximation to weight can be computed using calls to sample with Monte-Carlo
simulations. In addition to weight and sample, we might use inverse-cdf, fourier, or
describe for a pdf ⇢ representing a random variable X:
• The function inverse-cdf(v) returns the value r such that v = weight( 1,r).
Intuitively, inverse-cdf(v) returns the 100v percentile value for the pdf. For
2
example, the inverse-cdf of a Gaussian with mean µ and variance is:
p
inverse-cdf(v) = µ + 2 erf 1 (2p 1)
If the inverse-cdf function is not instantiated, it can be computed approximately

using calls to weight and binary search, or using sample. Functions weight and
sample can also be computed using inverse-cdf.
• The function fourier(x) returns the value of the Fourier transform of the pdf
of the random variable X at x. The Fourier transform may be known as an
expression for certain distributions. Efficient numerical computation of the
Fourier transform and inverse Fourier transform has been studied extensively.
Function fourier may be computed using weight and vice-versa.
• The function describe() returns a parametrized form of the pdf. For example,
describe might return [Type:"Gaussian",Parameters:µ, 2 ]. For a known type
of pdf, the system can compute describe using weight, sample, or inverse-cdf,
and vice-versa.
Since the functions above can be computed using calls to other functions, not all
of them need to be instantiated. In fact, all the above functions can be computed
(approximately) if any one of them were instantiated. Of course when more functions
are instantiated, it may be possible to obtain more precise answers more efficiently.
The functions above are our starting point. Hooks can be provided to the user to
create additional functions.
Pdf data represented as functions is presented to the application or user through
interfaces. Interfaces, like functions, return properties of the pdf, but they are not
used to represent the pdf and hence cannot be used to compute functions or other
interfaces. Here are two examples of interfaces:
• A discrete histogram approximation might be used by a human to visualize a

pdf. This interface is computed easily using weight: the height of a histogram
bar of width 2 · ✏ around a is the result of weight(a ✏, a + ✏).
• Some applications may ask for the median of a pdf attribute. The property
of the median m is that weight( 1,m) = weight(m,1) = 0.5. Median m can
be found approximately by calling weight repeatedly with binary search. The
precision of the result improves as more calls are made.
Although weight is suitable for computing the discrete histogram interface, it isn’t
ideal for median, and possibly other interfaces. On the other hand, median can be
computed easily using inverse-cdf by calling it with argument 0.5. If inverse-cdf is
instantiated, it should be used for a median interface. If inverse-cdf is not instantiated,
calling it from the median interface will degenerate to the same binary search over calls
to weight. Some other interesting interfaces might be are mean, variance, percentile,
inverse-percentile, and human-readable form of describe. All can be implemented
using the functions described earlier, some more efficiently than others.
Our extensible functional representation, along with the notion of interfaces, sets
up a framework permitting easy incorporation of new techniques for pdf attributes. We
will discuss how the set-of-functions approach enables efficient query processing, and
introduces a new optimization problem over a space of “plans” that can be compared
in terms of precision and efficiency.
3.4 Query Processing

We consider select-project-join queries over x-relations (recall from Section 2.2) with
pdf attributes. Here, we consider only conjunctive where clauses. Generalizing the
query language is a topic of future work. Users and applications may pose queries,
use interfaces (as described in Section 3.3) to examine the result data, and invoke
confidence computation in the usual Trio style [BDHW06b].
In Trio, query results include lineage identifying the data from which each result
value was derived, as described in Section 2.2. Along with tracing the origins of
result data, lineage is also used to represent possible-worlds semantics properly, and
to compute result confidence values lazily. For pdf attributes we also use lineage to
handle uncertainty in a lazy manner, deferring potentially expensive computations
until they are needed. More precisely, we generate lineage at query-time, then use it
to compute the functions corresponding to the pdf attributes in the result data.
3.4.1 Query-Time Processing

Query-time processing involves generating both data and lineage. We describe each of
these in turn, focusing on new requirements for managing pdf attributes.
Data Processing: In Trio, data processing at query time uses a standard rela-
tional processor. We use the algorithm described in [BDHW06b] with the following
modifications:
• Any predicate in the where clause referencing one or more pdf attributes is not
evaluated. The query is processed using only the remaining predicates in the
where clause. (Recall we assume conjunctive where clauses.)
• A placeholder is created for each element in the select clause that is of type pdf.
For example, consider a Trio query over relation R(A, B, C), where A and B are pdf
attributes:
select A, B+5, C
from R
where A  5 and B  A and C = 11
We instead execute the Trio query:
select X1, X2, C

from R
where C = 11
where X1 and X2 are placeholders.
Lineage Generation: For each alternative in the result, lineage is generated as

in [BDHW06b]. In addition, we generate table-level pdf-lineage as introduced in
Section 3.2. Specifically, pdf-lineage is composed of two parts: predicates P , recording
all unevaluated predicates from the where clause, and mappings M , recording mappings
from select clause placeholders to expressions over input attributes. Intuitively, pdf-
lineage captures portions of the query that we wish to evaluate later.
For the example query above, P is A  5 ^ B  A, and M is X1 ! R.A, X2 !

R.B + 5. Pdf-lineage, along with the standard Trio lineage is sufficient for confidence
computation and evaluation of functions for the pdf attributes in the result. We
discuss how lineage is used for these two tasks in the next section.
3.4.2 On-Demand Processing

We now describe how we compute pdf attributes and confidence values in the result data.
This processing occurs on-demand after the data processing and lineage generation is
complete. Since pdf attributes are represented as functions, the requirement is that
we are able to answer any calls to functions representing result pdfs. For any result
alternative, we can trace its lineage all the way to the base alternatives from which
it is derived, as in [BSH+ 08]. We can similarly build transitive versions of P and
M , say Pt and Mt , using a recursive procedure. For example, consider a base tuple
t1 with pdfs A, B, and a result tuple t2 with pdf-lineage P = A  5 ^ B  A and
M = X1 ! A, X2 ! B+5, as above. Suppose another query is executed on this result,
producing a tuple t3 with pdf-lineage P = X2 10 and M = X3 ! X1+5, X4 ! X2.
The transitive pdf-lineage for the tuple t3 is Pt = A  5 ^ B  A ^ B + 5 10 and
Mt = X3 ! A + 5, X4 ! B + 5.
The transitive lineage of a result tuple only points to base alternatives, and all
predicates and mappings in Pt and Mt only refer to attributes in the base alternatives
along with result placeholders. As in regular Trio, unfolding to the base data is required
since the input relations being processed in any query may not be independent, while
all base data is independent.
We now discuss four complimentary techniques for on-demand processing: transla-
tion, discrete approximation, fourier, and sampling.
Translation:
Some simple queries permit a very efficient technique that gives precise answers.
Consider for example an alternative with Pt = A  5 ^ B 10, and Mt = X1 !
A, X2 ! B, where A and B are base pdfs and X1 and X2 are pdfs in the result
alternative. The confidence of the result alternative is first computed as described in

[BSH+ 08], but may need to be further scaled down because of unevaluated predicates
over pdf attributes. For this example, the scaling factor s is computed as:
s = weightA ( 1, 5) ⇤ weightB (10, 1)
The result alternative exists only in possible worlds with values for A and B that
satisfy the predicate, and the product is taken because base pdfs are assumed to be
independent. Similarly, the weight functions for X1 and X2 are:
weightA (min(l, 5), min(h, 5))

weightX1 (l, h) =
weightA ( 1, 5)
weightB (max(l, 10), max(h, 10)))

weightX2 (l, h) =
weightB (10, 1)
This simple technique yields precise and efficient answers. Unfortunately, for somewhat
more involved queries, this approach does not work. For example, if Pt has any
predicate that involves more than one pdf, such as A  B, where A and B, we cannot
compute confidence values and weight functions by simple translation.
Discrete Approximation:
Discrete approximations to pdfs can be used to get approximate answers using a

“numerical integration” approach. This technique, though less precise and efficient than
the translation technique above, is applicable for a larger class of Pt and Mt . Consider
for example an alternative with Pt = A  B and Mt = X1 ! A. Let us assume
that weightA (0, 100) is almost 1. A can be approximated using 100 intervals between
points 0, . . . , 100 for this computation. This approximation can be represented using
a vector v = [weight(0, 1), . . . , weight(99, 100)]. The confidence scaling factor s can
be approximated using:
X
s= weightA (i, i + 1) · weightB (i, 1)
0i<100
An approximation to the weight function for the pdf X1 in the result can be computed
approximately as:
1 X
weightX1 (j, k) = · weightA (i, i + 1) · weightB (i, 1)
s jik
This example uses a discrete approximation for the weight function. It is easy to see
that inverse-cdf can also be used. There is a trade-o↵ between cost and precision
when this method is used: more points in the approximation increases both precision
and cost. The predicate A  B in the example above is equivalent to the comparison
of an arithmetic expression with a constant: A B  0. The discrete approximation
technique scales poorly as the number of pdf attributes in the arithmetic expression
increases, in terms of both cost and precision. For example, an expression like
P n 1
i Ai  0 with n pdfs can be very inefficient: O(d ) where d points are used to
approximate each pdf. However, it should be noted that the discrete approximation
technique is applicable for all queries.
Fourier:
Addition and subtraction over pdf attributes is computed using a convolution operation
on pdfs. A convolution is a kind of integration problem that can be solved using
Fourier transforms, since the convolution operation translates to multiplication in
Fourier space. Consider for example a result tuple with Mt = X1 ! A + 2B C and
an empty Pt (always true), where A, B, C are pdf attributes.
fourierX1 (x) = fourierA (x) · 2fourierB (x) · fourierC (x)
For base data, these functions may be symbolic making them efficient and precise; or
they may use the weight function to numerically approximate the Fourier transform
yielding a less precise, more expensive, execution. Also, the weight function for the
result can be computed numerically from the fourier function. This technique can be
more efficient and precise than discrete approximations, particularly for expressions
involving a large number of base pdfs. The limitation of the technique is that it is
useful only for addition and subtraction operations over pdf attributes. It needs to be
combined with other approaches when the query involves predicates.
Sampling:
Random sampling over possible worlds can be used to compute confidences and to
answer calls to functions representing the result. We use the sample function for this
technique. For example, consider a result tuple with pdf-lineage: Pt = A + B  4
and Mt = X1 ! A + B. The confidence scaling factor s can be approximated as the
fraction of times the following expression is true using a large number of calls:
sampleA ( 1, 1) + sampleB ( 1, 1)  4
A call to sample on X1 can be answered by repeatedly calling sample on A and B.

Sampling, like discrete approximations, is a very general technique that presents a
cost precision trade-o↵: more precise answers are obtained by using more samples. In
general, sampling scales better than discrete approximations as the number of pdf
attributes in expressions increase, in terms of both efficiency and precision.
Specialized Processing:
Our approach for pdf attributes makes it easy to use specialized processing for known
distributions. For example, consider a result tuple with Mt = X1 ! A + B and
an empty Pt . If both A and B are Gaussians, describe on X1 can be computed as
2 2 2 2
[Type:"Gaussian",Parameters:µA + µB , A+ B] where µA , µB , A, B are obtained
using describe calls on A, B. Other functions including weight, sample, inverse-cdf, and
even fourier can be computed efficiently from describe for known distributions. The
framework allows for adding knowledge about specific distributions to the system in
computing the describe function. This processing may be limited in scope with regards
to complicated queries, but can yield precise answers efficiently when applicable.
Combining:
For any interface, function, or confidence computation call, the answer may be produced
by combining more than one of the techniques presented above. Combining the
techniques results in multiple “query plans.” Choosing a good plan is important, since
each evaluation is potentially expensive, and we wish to optimize for a combination of
precision and efficiency.
3.5 Discussion
3.5.1 Correlated Attributes

Correlated attributes in base data are represented through shared functions. If X and
Y are correlated, then they share all of their functions. A weight call now specifies a l
and h on each attribute, i.e., weight(lX , hX , lY , hY ). Similarly a call to sample requires
four arguments, and returns a pair x, y. This representation using shared functions is
equivalent to allowing higher dimensional domains for a single attribute, while making
all dimensions accessible in queries. Recall we assume that all functions on base data
are independent, thus dependent attributes need to share functions.
Correlation across attributes in query results may occur as a side-e↵ect of a query
(or a series of queries) as discussed earlier in Section 3.2. The query processor handles
these correlations by tracing through to the base data for all aspects of the query that
process uncertainty as described in Section 3.4.
3.5.2 Interfaces and Functions in Queries

Interfaces and functions may be called directly, and they may be used in queries. For
example, a query might refer to median(A) for a pdf attribute A in the where or the
select clause. This call triggers the computation of the interface (or function) which
then produces a certain value used by the remainder of query processing.
3.6 Related Work

The Orion system [SMS+ 08] incorporates continuous uncertainty by using discrete
approximations to process queries over a symbolic representation for known distribu-
tions. Reference [JXW+ 08] describes a system that uses sampling as the fundamental
technique for managing uncertain data, including continuous uncertainty. A proba-
bilistic XML system that manages continuous uncertainty for known distributions is
described in [Sch08].
The systems described above are DBMSs targeted specifically for managing un-
certain data, usually with possible-worlds semantics. There has been another line
of work aimed at incorporating continuous variables (sometimes pdfs) into conven-
tional database management. For example, techniques for querying and indexing
pdf attributes are discussed in [FGB02, LS07], while FunctionDB [TM08] uses a
symbolic representation and algebraic query processing for continuous data that can
be represented as polynomials.
3.7 Future Work

This chapter proposes a framework for incorporating continuous uncertainty into the
Trio system. In doing so, it exposes numerous challenges for managing continuous
uncertainty, and uncertain data in general. We identify the following as some interesting
directions for future work:
• As discussed in Section 3.4.2, combining our proposed techniques yields a space

of “plans,” whose selection presents new optimization challenges.
• Functions may be invoked in a query. Thus, it would be interesting to investigate

if there are benefits associated with batch-processing these functions.
• The current proposal restricts the query language to have only a conjunctive
where clause. It would be useful to allow a more general query language, along
with some built-in language and data-type extensions specifically for querying
pdfs.
Chapter 4
Generalized Uncertain Databases
Existing uncertain databases have difficulty managing data when exact confidence
values are not available. In some applications, confidence values may be known
imprecisely or coarsely, or even be missing altogether. In this chapter, we propose
a generalized uncertain database: a database that manages incomplete information
about uncertainty. The work presented in this chapter appeared initially in [AW10].
4.1 Introduction
As we have seen in the previous chapters, uncertain databases enable databases to
manage data that has incomplete or imprecise information. However, most uncertain
databases, including Trio (Chapter 2), require that exact confidence values are at-
tached to the data being managed. In some applications, confidence values may be
known imprecisely or coarsely, or be missing altogether. To manage such incomplete
uncertainty, we propose the notion of a generalized uncertain database.
We use uncertain data about movies and release years, obtained as a result of
information extraction, say, to illustrate some di↵erent kinds of uncertainty. In the
following, (1) is an example of “complete” (exact) uncertainty, while (2), (3) and (4)
illustrate three kinds of incomplete uncertainty.
1. Exact confidence values: The Godfather was released in 1972 with confidence
.8 and 1974 with confidence .2.
28
CHAPTER 4. GENERALIZED UNCERTAIN DATABASES 29
2. Missing confidence values: Shawshank Redemption was released in 1984 or

1994.
3. Imprecise confidence values: Pulp Fiction was released in 1994 with confidence
at least .5. (E↵ectively, confidence is between .5 and 1.)
4. Coarse confidence values: Die Hard was released in 1988 or 1989 with confidence
.8, or 1990 with confidence .2.
A naive way to managing incomplete uncertainty would be to assign exact con-

fidence values when values are not available, thereby coercing data into an existing
uncertain data model; for example, in Trio confidence values are attached uniformly
to all alternatives in the case of missing confidences. Such an approach essentially
attaches confidence values to query results that are not implied by the data. Also,
queries asking for all tuples that “could have confidence .5”, “have confidence
at least .5”, or “have confidence at most .5” cannot be answered.
Incomplete uncertainty has been researched extensively from a mathematical
perspective, and this research has resulted in the development of rich theories like
Dempster-Shafer theory [Sha76]. But the current breed of uncertain data models and
systems are not designed to handle such incomplete uncertainty, since they are based
on traditional probability theory. Previous work has recognized the need to manage
specific kinds of incomplete uncertainty [BGMP90], but has not provided solutions.
Incomplete uncertainty also arises when we relax the strict independence assump-
tion that is made by most models and systems for uncertain data, including Trio.
Consider the following two pieces of uncertain data with exact confidence values:
• The Godfather was released in 1972 with confidence .8 and 1974 with confi-
dence .2.
• The Godfather II was released in 1972 with confidence .3 and 1974 with
confidence .7.
Suppose we want to know whether The Godfather was released at least one year before
The Godfather II. Assuming independence, the confidence would be .56. Without
assuming independence, Fréchet-Hoe↵ding bounds indicate that the confidence is at

least max(.8+.7-1,0) = .5 and at most min(.8,.7) = .7, but we cannot be more
precise than that. Hence, we might get imprecise confidence values having started
with exact confidences, when no assumption can be made about independence.
Note that a generalized uncertain database is a strict generalization of uncertain
databases: a generalized uncertain database degenerates to a probabilistic database
when exact confidence values are provided.
The key ingredients of our approach to managing incomplete uncertainty in
databases are:
• Data and query semantics (Section 4.2): We propose a semantics for general-
ized uncertain databases based on Dempster-Shafer theory, and the associated
semantics for queries over such databases. We have chosen Dempster-Shafer
theory because it is a mature and elegant generalization of traditional Bayesian
probability theory that incorporates epistemic uncertainty in addition to aleatory
uncertainty. We demonstrate how the new semantics degenerates to the seman-
tics for current uncertain databases.
• Representation (Section 4.3): We propose a representation scheme for generalized

uncertain databases. Our new scheme extends the representation used by Trio,
and preserves the notion of lineage. We demonstrate how our motivating
examples can be represented using this scheme. In fact, the use of lineage
ensures that our scheme is complete for generalized uncertain databases, and
hence can represent all generalized uncertain database instances.
• Query processing (Section 4.4): We adapt Trio’s query processing techniques to

operate over our representation for generalized uncertain databases. We reuse
without modification the eager data processing (recall from Chapter 2) aspects
of Trio query processing, which also computes lineage. We generalize the lazy
confidence computation problem to a lazy uncertainty evaluation problem to
agree with the specified semantics. Like confidence computation, uncertainty
evaluation is #P-hard in general, but PTIME when the lineage is conjunctive.
We emphasize that our new proposal requires only minor modifications to the
semantics, representation, and query processing algorithms for uncertain databases,
while still enabling the management of various kinds of incomplete uncertainty in a
unified manner.
We discuss related work in Section 4.5 and list directions for future research in
Section 4.6.
4.2 Semantics
In this section, we define data-model semantics for generalized uncertain databases,
and the associated query semantics.
4.2.1 Data Model

Recall from Chapter 2, the standard semantics for uncertain databases uses the notion
of possible worlds: the possible states for the database, with a probability distribution
defined over the set of possible states. We relax the requirement of a probability
distribution, and instead only require a Dempster-Shafer mass distribution [Sha76].
Definition 4 (Generalized Uncertain Database). A generalized uncertain database

U consists of a finite set of possible worlds P W and a mass function m : 2P W ! [0, 1]
such that:
X
m(S) = 1
S22P W
Notice that m assigns mass to every subset of possible worlds (although many may
have mass zero), as opposed to the traditional way of assigning a probability to each
individual possible world. It is useful to interpret mass m(S) assigned to a set of
possible worlds S as probability mass constrained to stay within S, but free to be
assigned anywhere within S. For example, when S = {W1 , W2 }, it is unspecified
how m(S) is to be split between W1 and W2 . Since we use the Demspter-Shafer
interpretation for the mass function, we also have the belief and plausibility functions,
which can be interpreted as the lower and upper bounds for probabilities:
X
bel(S) = m(A)
A✓S
X
pl(S) = m(A)
A\S6=;
This interpretation allows us to answer queries that ask for tuples which “could have
probability more than .5” (mentioned in Section 4.1). We don’t further describe
details or intuition for belief or plausibility functions in this thesis, and instead refer
the reader to [Sha76] for a detailed discussion. We do, however, note the following
relationships:
X
m(S) = 1|S A|
bel(A)
A✓S
pl(S) = 1 bel(S̄)
We now illustrate how generalized uncertain databases allow us to represent incomplete

uncertainty. We consider single-tuple databases based on our examples in Section 4.1
and construct mass functions for each.
1. Exact confidence values: In the standard probabilistic case, all the mass is
contained in sets with a single possible world. It is worth noting that the mass
function is the same as the belief and the plausibility function, and hence mass
can be interpreted as probability.
W1 W2
The Godfather, 1972 The Godfather, 1974
m(S) = .8, S = {W1 }

= .2, S = {W2 }
= 0, otherwise
2. Missing confidence values: Since no confidence values are known, all the mass is
assigned to the set of all possible worlds. For all individual possible worlds, the
belief and plausibility functions are 0 and 1 respectively.
W1 W2
Shawshank Redemption, 1984 Shawshank Redemption, 1994
m(S) = 1, S = {W1 , W2 }
= 0, otherwise
3. Imprecise confidence values: Imprecision about confidence can be represented

by assigning mass to the set with all possible worlds. For our example, we get
belief and plausibility values of .5 and 1 for {W1 }, and 0 and .5 for {W2 }.
W1 W2
Pulp Fiction, 1994 ;
m(S) = .5, S = {W1 }

= .5, S = {W1 , W2 }
= 0, otherwise
4. Coarse confidence values: In this case, confidence is specified over sets of possible
worlds, but not exactly to each individual possible world.
W1 W2 W3
Die Hard, 1988 Die Hard, 1989 Die Hard, 1990
m(S) = .8, S = {W1 , W2 }

= .2, S = {W3 }
= 0, otherwise
Since all the above kinds of uncertainty can be captured in generalized uncertain data-
bases, we can provide a unified semantics for various kinds of incomplete uncertainty
and exact probabilities. The Trio system allows relations with exact probabilities
and missing confidences, but a query joining these relations coerces the relation with
missing probabilities by assigning them uniformly, and hence the results are ad-hoc.
In contrast, our new proposal is able to combine such information in a principled
manner.
We now formalize the observation that our new data-model semantics generalizes
semantics for probabilistic databases [ABS+ 06]. Recall from Chapter 2, the definition
of a probabilistic database (Definition 1).
Observation 1 (Probabilistic Databases). Probabilistic databases are those generalized

uncertain databases where:
X
m(S) = 1
|S|=1
Probabilistic mass functions are those mass functions that satisfy the property above.
Recall from Chapter 2, the definition of uncertain databases without probabilities
(Definition 2). We can make the observation that our semantics generalizes those of
uncertain databases without probabilities.
Observation 2 (Uncertain Databases without Probabilities). Uncertain databases

without probabilities are those generalized uncertain databases where m(P W ) = 1.
4.2.2 Query Semantics

We now describe semantics of queries over generalized uncertain databases.
Definition 5 (Queries over Generalized Uncertain Databases). The result of a rela-

tional query Q over a generalized uncertain database U with possible worlds P WU and
mass function mU is a generalized uncertain database R. The possible worlds P WR
and mass function mR : 2P WR ! [0, 1] for R are:
P WR = {Q(W )|W 2 P WU }
X
mR (S) = mU (A)
A such that
S={Q(W )|W 2A}
Notice that these semantics correspond to those of probabilistic databases in spirit:

the set P WR is the same in both cases, and the mass function over P WR is constructed
in a manner similar to the construction of probability values in probabilistic databases.
In fact, the following observation formalizes that query semantics for generalized
uncertain databases degenerate to those of probabilistic databases.
Observation 3. When the mass function m for a generalized uncertain database

P
is probabilistic: |S|=1 m(S) = 1, the mass function for the generalized uncertain
database resulting from the application of a relational query is also probabilistic, and
is the same as that provided by semantics for probabilistic databases.
For completeness, we also make the following observation:
Observation 4. When the mass function for a generalized uncertain database corre-
sponds to an uncertain database without confidences: m(P W ) = 1, the mass function
for the generalized uncertain database resulting from the application of a relational
query also corresponds to an uncertain database without confidences.
4.3 Representation
In this section, we describe our new representation scheme for generalized uncertain
databases. This scheme is a modification of the representation scheme used by Trio
(Chapter 2) in the sense that it preserves the notions of alternatives, x-tuples, and
lineage. The significant change is that probabilities associated with alternatives are
replaced by a mass function defined for each x-tuple.
We start by describing our scheme without lineage, and then incorporate lineage.
A database instance is represented by a set of x-tuples, each with an associated x-mass
function. We use a running example to describe these notions, and to show how a
generalized uncertain database instance is interpreted from a representation instance.
• Alternatives and x-tuples: Databases are comprised of x-tuples. Each x-tuple

consists of one or more alternatives, where each alternative is a regular tuple
over the schema of the relation. For example, the following relation consists of
two x-tuples from our running example.
t1 Die Hard, 1988 Die Hard, 1989 Die Hard, 1990

t2 Pulp Fiction, 1994 ?
Possible worlds for the database are obtained by picking exactly one alternative
from each x-tuple. Symbol “?” is a special alternative value; picking it means
that none of the other alternatives are picked [ABS+ 06]. The first x-tuple has
three possible states, while the second x-tuple has two possible states. Hence,
the possible world set P W has six possible worlds as shown below.
W1 W2 W3
Pulp Fiction, 1994 Pulp Fiction, 1994 Pulp Fiction, 1994
W4 W5 W6
• x-mass functions: An x-tuple with the set of alternatives A has an associated

P
x-mass function m : 2A ! [0, 1] such that S✓A m(S) = 1. For the x-tuples
above, the mass functions over single-tuple databases as shown in Section 4.2
are valid x-mass functions:
x1 (S) = .8, S = {(Die Hard, 1988), (Die Hard, 1989)}

= .2, S = {(Die Hard, 1990)}
= 0, otherwise
x2 (S) = .5, S = {(Pulp Fiction, 1994)}
= .5, S = {(Pulp Fiction, 1994), ?}
= 0, otherwise
The intuitions described for mass functions in Section 4.2 carry over to x-mass
functions. We have an x-mass function for each x-tuple in the representation.
From these functions, we interpret the mass function m over the possible
worlds to get a generalized uncertain database from a representation instance.
Conceptually, we first interpret each x-mass function as a basic mass function
over the set of possible worlds, and then “combine” these basic mass functions
to obtain the mass function for the generalized uncertain database.
• Basic mass functions: In general, the basic mass function mi : 2P W ! [0, 1] for
an x-mass function xi : 2A ! [0, 1] is constructed as follows:
mi (S) = xi (B), S = {W 2 P W : 9a2B✓A , a 2 W }

= 0, otherwise
Consider the x-mass functions x1 , x2 for the tuples above. We create the
corresponding basic mass functions m1 , m2 as follows:
m1 (S) = .8, S = {W1 , W2 , W4 , W5 }

= .2, S = {W3 , W6 }
= 0, otherwise
m2 (S) = .5, S = {W1 , W2 , W3 }
= .5, S = {W1 , W2 , W3 , W4 , W5 , W6 }
= 0, otherwise
• Combination: The basic mass functions mi for all x-tuples are combined to obtain
the mass function m for the generalized uncertain database using Dempster’s
combination rule:
M
m= mi
i
where the operator is defined1 as follows [Sha76]:

X
(m1 m2 )(S) =defn m1 (A) · m2 (B), S 6= ;
A\B=S
=defn 0, S=;
m is uniquely is defined by the above expression, since Dempster’s rule is known

to be associative and commutative. For the example above, we obtain the
1
We have simplified the combination rule from [Sha76] by eliminating normalization to account
for conflict (usually denoted by K), because construction of the basic mass function ensures that we
never encounter non-zero conflict.
following mass function m:
m(S) = .4, S = {W1 , W2 }

= .1, S = {W3 }
= .4, S = {W1 , W2 , W4 , W5 }
= .1, S = {W3 , W6 }
= 0, otherwise
It must be noted that Dempster’s rule makes an assumption that translates to

the independence assumption in the probabilistic case. We show an example
that illustrates that the above interpretation degenerates to the interpretation
in probabilistic databases under the independence assumption. Suppose, x1 , x2
for the example above were both probabilistic:
x1 (S) = .5, S = {(Die Hard, 1988)}

= .3, S = {(Die Hard, 1989)}
= .2, S = {(Die Hard, 1990) }
= 0, otherwise
x2 (S) = .5, S = {(Pulp Fiction, 1994)}
= .5, S = {?}
= 0, otherwise
We would get the corresponding basic mass functions m1 , m2 as:
m1 (S) = .5, S = {W1 , W4 }

= .3, S = {W2 , W5 }
= .2, S = {W3 , W6 }
= 0, otherwise
m2 (S) = .5, S = {W1 , W2 , W3 }
= .5, S = {W4 , W5 , W6 }
= 0, otherwise
Hence, we get the mass function m for the generalized uncertain database as:
m(S) = .25, S = {W1 }

= .15, S = {W2 }
= .1, S = {W3 }
= .25, S = {W4 }
= .15, S = {W5 }
= .1, S = {W6 }
= 0, otherwise
Notice that this mass function m is probabilistic. In fact, it corresponds directly

to the semantics of probabilistic databases. We formalize the intuition suggested
by this example:
Observation 5. When the x-mass functions for all x-tuples in a generalized

uncertain database are probabilistic, the mass function for the generalized uncer-
tain database is also probabilistic, and corresponds to the probability distribution
defined in probabilistic databases.
Similarly, we also have:

Observation 6. When the x-mass functions for all individual x-tuples in a

generalized uncertain database correspond to uncertain databases without confi-
dences, the mass function for the generalized uncertain database also corresponds
to an uncertain database without confidences.
Based on results from [DBHW06b], we can immediately see that the model
presented so far for generalized uncertain databases is not complete, or even
closed under select-project-join queries. Hence, we incorporate the notion of
lineage from Trio (recall from Chapter 2) into the model.
• Lineage: Informally, lineage is a function : T ! Bool(T ) where T = [W 2P W W .

For each alternative in a database, lineage provides a Boolean formula over
alternatives. Consider an additional x-tuple t3 in the example database above
with:
t3 1989, 1994 ?
(t3 , 1) = (t1 , 2) ^ (t2 , 1)
We refer the reader to Chapter 2 for a description of lineage. Recall from

Chapter 2, we use (ti , j) to reference the j th alternative of ti . X-tuples may
be base x-tuples when none of their alternatives have lineage defined (t1 , t2 in
the example) or derived when all alternatives have lineage defined (t3 in the
example). The possible worlds P W and mass function m over possible worlds
are interpreted based on the base x-tuples only. An alternative a from a derived
⇤
x-tuple is added to a possible world W if its (transitive) lineage formula (a) is
“true” in the world W . Since (t1 , 2) and (t2 , 1) are both present in W2 , lineage
of (t3 , 1) is true in W2 . Hence, the possible worlds for the relation with x-tuples
t1 , t2 , t3 are:
W1 W2 W3
Pulp Fiction, 1994 Pulp Fiction, 1994 Pulp Fiction, 1994
1989, 1994
W4 W5 W6
Our model requires that an x-mass function be provided for each base x-tuple,
while no x-mass functions may be explicitly provided for derived tuples (just like
confidences in Trio). Hence, the mass function over possible worlds is not a↵ected due
the presence of lineage, and is constructed based on base x-tuples only. The mass
function for the generalized uncertain database with x-tuples t1 , t2 , t3 is as shown
earlier.
We can now make the following observation:
Observation 7. The representation scheme presented above is complete: there exists a

way to represent every generalized uncertain database instance using the representation
scheme.
A complete representation scheme is necessarily closed under any class of queries,

when semantics are defined according to Definition 5. The proof makes the same
argument for showing completeness as made in [DBHW06b]. We only provide a proof
outline. A representation instance can be constructed for each model instance in a
manner very similar to [DBHW06b]. The constructive proof creates a “dummy” base
x-tuple which has an alternative tW corresponding to each possible world W in the
model instance to be represented. Corresponding to each tuple t in the model instance,
a derived x-tuple t|? is created; the lineage of t is the disjunction over all alternatives
tW in the dummy tuple such that t 2 W .
4.4 Query Processing

We consider select-project-join queries over generalized uncertain databases. We adapt
Trio’s query processing techniques to operate over our representation for generalized
uncertain databases. We reuse without modification the eager data processing aspects
of Trio query processing discussed in Chapter 2. Recall that the query results include
lineage formulas identifying the data from which each result alternative is derived.
Hence, lineage plays the dual role of allowing tracing of origins of result data, while at
the same time properly representing the semantics for generalized uncertain databases
as described in Section 4.3. The notion of lineage used by Trio is sufficient to enable
the extended semantics of generalized uncertain databases. We do not describe this
part of query processing, and instead refer the reader to [ABS+ 06].
We generalize the lazy confidence computation problem (Chapter 2) in Trio to a
lazy uncertainty evaluation problem for generalized uncertain databases, to agree with
the new semantics.
4.4.1 Uncertainty Evaluation

Uncertainty evaluation occurs after result data and lineage have been computed during
the data processing phase. It involves computing mass, belief, or plausibility values
for a result tuple, henceforth collectively referred to as uncertainty values. More
generally, we allow computing uncertainty values for boolean formulas over tuples.
The formalization below states that uncertainty values for a boolean formula are the
corresponding values for the set of possible worlds where the formula is satisfied.
Definition 6. Consider Boolean formula f consisting of variables corresponding to

tuples in a generalized uncertain database: f 2 Bool(T ) where T = [W 2P W W . We
define the mass, belief, and plausibility values for the formula f as follows:
m(f ) = m({W 2 P W : f is true in W })

bel(f ) = bel({W 2 P W : f is true in W })
pl(f ) = pl({W 2 P W : f is true in W })
Uncertainty evaluation is the problem of computing an uncertainty value for an input

boolean formula. This problem is very general because it lets us ask about the
uncertainty in: (1) individual tuples, using a degenerate boolean formula; (2) possible
worlds, posed as a conjunction of all tuples; (3) any collection of tuples, and their
relationships expressed as an arbitrary formula.
The first part of uncertainty evaluation involves transforming the input boolean
formula f to another formula fb 2 Bool(Tb ), Tb = {t 2 [W 2P W W : t is a base tuple}
over base tuples. Lineage is used in making this transformation, essentially by replacing
any derived tuple by its lineage formula, transitively. We need no modifications to
this part of the algorithm, and refer the reader to [ABS+ 06] for details.
Now we discuss the uncertainty evaluation problem for boolean formulas over base
tuples. We observe that this problem generalizes confidence computation over boolean
formulas in Trio: when input mass functions are probabilistic, all three of mass, belief,
and plausibility functions degenerate to probabilities. Since confidence computation is
known to be #P-hard, this observation immediately establishes that the uncertainty
evaluation problem is also #P-hard.
Observation 8. Uncertainty evaluation for boolean formulas over base tuples is

#P-hard.
Hence, uncertainty evaluation can be expensive in general. The naive algorithm

enumerates the “truth table” E for the boolean formula, and computes the mass
function over 2E by combining individual mass functions using Dempster’s rule [Sha76].
Efficient algorithms have been proposed for computation of belief and plausibility
functions using a mass function based on fast Möbius transform [KS91].
It is well known that in probabilistic databases, when the boolean formula over
base tuples is conjunctive, the confidence can be evaluated efficiently. This result
carries over to the uncertainty evaluation problem:
Observation 9. Uncertainty evaluation for conjunctive boolean formulas over base

tuples is PTIME.
Belief, plausibility, and mass can each be treated as probabilities [Sha76] (lower, upper,
and mass, respectively) when the boolean formula over base tuples is conjunctive,
and the confidence computation module of Trio can be used directly for uncertainty
evaluation.
Approximation techniques have been proposed for the confidence computation
problem [DS04a]. The observations above suggest that such techniques can be adapted
to solve the uncertainty evaluation problem. Thorough investigation of the uncertainty
evaluation problem is left as future work; specific directions are listed in Section 4.6.
4.5 Related Work

Generalized uncertain databases are based on Dempster-Shafer theory [Sha76], whereas
probabilistic databases are based on Bayesian probability theory. There has been
a lot of work on Demspter-Shafer theory: [YL08] provides a good selection of
research. Demspter-Shafer theory doesn’t satisfy [Col08] the assumptions of Cox’s
Theorem [Cox46], and hence the argument that probability theory provides the only
“consistent” way of managing uncertainty doesn’t apply.
Extended relational models based on Dempster-Shafer theory have been previously
proposed [CBF04, CBL06, Lee92b, Lee92c, LSS96, Zad86]. However, these models are
incomplete, and do not enable efficient query processing. As discussed in Section 1.3,
there has been recent research focused on modeling and managing aleatory uncertainty
in databases [AKO07, BDM+ 05, DGS09, SMM+ 09, ABS+ 06]. Techniques for efficient
query processing have been developed and implemented as a result of this research.
Our approach bridges the gap between these two lines of research.
There has also been recent research on possibilistic databases [BP05] based on
possibility theory. Like our approach, this proposal also enables managing of data
where no exact probability values are available. Possibility theory can be captured in
Dempster-Shafer theory in terms of representation, but it uses a cautious combination
operator instead of Dempster’s rule.
4.6 Future Work

This chapter provides first steps towards enabling databases to manage uncertain data
that may not have exact confidence values. In doing so, it exposes new challenges
for managing data with incomplete uncertainty. We identify the following as some
interesting directions for future work:
• We have not extensively explored the problem of efficient uncertainty evaluation

for generalized uncertain databases. We expect that efficient techniques can
be derived by leveraging past research on Dempster-Shafer theory, possibly by
making restrictions on the structure of mass functions and lineage.
• Approximation techniques have been proposed for confidence evaluation in

uncertain databases. It will be interesting to investigate whether they can be
extended to generalized uncertain databases.
Chapter 5
Uncertain-Data Integration
Data integration is a frequent source of uncertainty in data collections. In addition,

dealing with uncertainty is one of the big challenges faced during data integration.
Most past work has studied data integration and uncertain data in isolation. We
develop foundations for uncertain-data integration over the next two chapters. The
work presented here initially appeared in [ADUW10].
5.1 Introduction
Most work on uncertain data has focused on modeling and management problems.
Little work has been done on integrating multiple sources of uncertain data. Similarly,
several decades of research have focused on the theory and practice of data integra-
tion [HRO06], but only considering integration of certain data. This chapter develops
theoretical foundations for local-as-view (LAV) integration [Hal01] of uncertain data.
The combined study of data integration and data uncertainty is important for
several reasons. The traditional benefits of data integration still apply when sources
are uncertain: Integrating data from multiple sources allows a uniform query interface
to access their combined information. In addition, integrating multiple sources of
uncertain data may help resolve portions of the uncertainty, yielding more accurate
results than any of the individual sources. As a very simple example, if one sensor
reports that an object is either in location A or in location B, and a second sensor
47
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 48
says it is either in location B or in location C, by integrating the sensor reports we

may conclude that the object is in location B.
Even when the sources are certain, data integration may introduce uncertainty. For
example, di↵erent data capturing redundant information may disagree on some values.
Furthermore, data integration often relies on mappings between sources [HRO06].
One approach has been to use automatically-generated probabilistic mappings, which
introduce uncertainty. Reference [DHY07] points out that uncertainty introduced
during data integration can equivalently be treated as integration of uncertain sources.
Now let us see why models for uncertain data do not adapt directly to the context
of data integration. Recall from Chapter 2 that the semantics of uncertain databases
are based on possible worlds: an uncertain database represents a set of possible certain
databases. Now consider integrating multiple uncertain databases. The natural
extension might be to consider all combinations of all possible worlds across sources,
but this approach can yield undesirable results. Intuitively, we would instead like to
preserve and combine possible worlds that corroborate each other, while discarding
possible worlds that are contradicted. Returning to our very simple sensor example, one
source gives the set of possible worlds {A, B} and the other gives {B, C}. Combining
all possible worlds yields (A, B), (A, C), (B, B), and (B, C). Since the two sensors
are describing the same real-world object, we prefer to discard all combinations except
(B, B). As we will see, there are several challenges to generalizing and formalizing
this intuition to solve the overall data integration problem.
We consider specifically the local-as-view (LAV) setting of data integration [Hal01].
In this setting, there is a single logical mediated database, and each data source is
mapped to its mediated schema by specifying the source as a (logical) view. Queries
over the virtual mediated database are answered using these mappings. Formalizing
LAV data integration over uncertain data requires redefining the two theoretical
foundations of the LAV approach [Hal01]: containment and certain answers. In
addition, uncertain data requires us to introduce a formal notion of consistency of a
set of sources. Next we summarize our contributions in the context of these three
building blocks, with details provided in the subsequent sections.
Containment
LAV data integration typically uses the open world assumption: Consider a mapping
view query Q for a source S. When Q is applied to the (logical) mediated database,
we do not require the result to be S exactly, but only require it to contain S. For
the case of certain databases, containment is straightforward. To extend LAV data
integration to the uncertain data setting, we need to find an appropriate definition of
containment. We will see that by defining containment carefully, we can capture the
“contradiction” and “corroboration” intuition motivated above.
We will see that for uncertain data, two di↵erent integration settings require two
somewhat di↵erent notions of containment. In one setting, which we call equality-
containment, the sources were derived from an existing but unknown uncertain
mediated database that we are trying to reconstruct. In the other setting, which
we call superset-containment, there is no actual mediated database from which the
sources were derived, so our goal is to come up with a logical mediated database that
captures the information from the sources. We will give examples to illustrate the
di↵erences. This distinction is new for handling uncertain data. For certain data,
these two settings can be handled identically.
Consistency
When sources contain uncertain data, we need to define what it means for sources to
be consistent. (As an extremely simple example, one sensor reporting location A or B
and the other reporting C or D for the same object is inconsistent.) Informally, a set
of sources is consistent if there exists a mediated database that contains all sources.
We will formalize this notion and then study the problem of consistency-checking
under both equality-containment and superset-containment. We show that in general,
consistency-checking is NP-hard in the size of the view schema for both of our settings.
Next we identify a class of sources where consistency-checking is polynomial. We
describe the construction of a hypergraph for a set of sources, and we provide a
PTIME consistency-checking algorithm when this induced hypergraph is acyclic. We
also show that the extensional complexity of consistency-checking is PTIME for both
of our settings.
Query Answers
Lastly, we consider the problem of defining correct query answers over mediated
uncertain databases. Once again, the definitions used for certain databases do not
adapt directly to the uncertain case. The conventional LAV setting uses certain
answers (where the use of the word “certain” here is not to be confused with certain
data). A certain answer is a set of tuples that is guaranteed to be contained in
any mediated database [Hal01]. We define a corresponding notion for uncertain
data, which we call correct answers, that incorporates possible worlds through our
containment definitions. Further, we seek to find a unique strongest correct answer
(SCA) defined using the partial orders implied by the containment definitions. For
superset-containment, we prove by construction the existence of an SCA. However,
for equality-containment an SCA does not always exist, and hence we define a relaxed
notion for the “best” query answer.
Discussion
For ease of presentation, we restrict ourselves to identity queries and views defined
by identity queries for most of the chapter. In Section 5.7, we extend our techniques
for monotonic views with some restrictions, and for monotonic queries over uncertain
data. For a majority of this chapter, we focus attention on uncertain databases
without probabilities. In Section 5.8, we discuss extentions of our results for the case
of generalized uncertain databases (recall from Chapter 4).
The results in this chapter are independent of the specific representation used
for uncertain data. (The computational complexity of certain problems considered
may depend on the specific representation, and we point out these di↵erences in the
relevant places.) Also, although our results are presented for discrete uncertain data
with a finite set of possible worlds, they can be generalized for continuous uncertain
data with an infinite set of possible worlds. We emphasize that our foundations are
defined in terms of possible worlds, but we neither rely on nor advocate possible worlds
as an actual representation of uncertain data. Obviously, we would represent actual

uncertain data in more succinct data models, such as in [AKG87, AKO07, BDHW06a,
CSP05, DS04b, GT06, IL84, SD07].
5.2 Containment
This section introduces and formalizes the notions of equality-containment and superset-
containment. Recall the definition of an uncertain database from Chapter 2. We shall
motivate an extension to the definition from Chapter 2 in Section 5.2.1, and then
present our containment definitions in Section 5.2.2 using the extended definition.
5.2.1 Uncertain Databases

We shall extend the definition of uncertain database without probabilities from
Chapter 2 to additionally reference a tuple set T (U ). The possible worlds of an
uncertain database U contain information about the tuples in T (U ). Intuitively, one
of U ’s possible worlds captures the true database with respect to the tuples in T (U ).
Definition 1 (Uncertain Database). An uncertain database U consists of a finite set

of tuples T (U ) and a nonempty set of possible worlds P W (U ) = {D1 , . . . , Dm }, where
each Di ✓ T (U ) is a certain database.
Consistent with traditional information theory [CT91], an uncertain database with

fewer possible worlds contains more information than an uncertain database with
more possible worlds, if they contain information about the same set of tuples.
Consistent with the definition from Chapter 2, when T is omitted in the description
of an uncertain database U , it is assumed to be the union of all possible worlds in U .
Example 1. Consider tuples A and B, and the four possible worlds: P1 = ;, P2 = {A},
P3 = {B}, and P4 = {A, B}. An uncertain database U1 that contains no information
about the existence and co-existence of tuples A and B consists of all four possible
worlds. An uncertain database U2 with information that at least one of A or B exists,
and that they cannot co-exist, does not contain either P1 or P4 . U2 contains more
information than U1 , since it asserts that P1 and P4 are not possible. An uncertain
database U3 with T (U3 ) = {A, B} but containing only the possible world P2 = {A}
asserts that the tuple A is contained in the database and that B cannot be contained
in the database, since B is contained in T (U3 ) but not contained in any possible world
of U3 . U3 contains more information than either U1 or U2 .
Informally, the information content in an uncertain database U may be thought of

as composed of two components: (1) data-information, represented by the set of tuples
T (U ). More tuples indicate more information. (2) Specificity-information, represented
by the possible worlds P W (U ). Fewer possible worlds indicate more information. The
rationale behind the extension in this definition is to isolate these two di↵erent kinds
of information encoded in an uncertain database.
5.2.2 Containment Definitions

Here we motivate and define equality-containment and superset-containment and
illustrate them with simple examples. In Section 5.3, we give a lengthier example to
present further practical motivation for our two definitions.
Equality-Containment
Equality-containment integration is relevant in situations where each source has access
to only a portion of an uncertain database that is existing but not known completely.
There are many real-world applications where access is controlled, and only slices of
the data may be visible to various parties (sources). For example, an actual uncertain
database may be hidden behind a web service, or people may only be given access
to data depending on what they pay for. The goal of data integration in this setting
is to answer queries using the best (virtual) reconstruction of the unknown actual
uncertain database. Also, when smaller pieces of sensitive data are given out to
multiple people so that no single piece leaks information, this reconstruction allows
one to detect whether the pieces can be combined to obtain sensitive information.
Finally, this setting captures the problem of answering queries using materialized
views over uncertain data, where each source is a view.
Definition 2 (Equality-Containment). Consider uncertain databases U1 and U2 . We

say that U2 equality-contains U1 , denoted U1 vE U2 , if and only if:
T (U1 ) ✓ T (U2 )
and
P W (U1 ) = {W \ T (U1 ) | W 2 P W (U2 )}
Informally, if we remove from any possible world of U2 those tuples not contained in
T (U1 ), then the resulting possible world is a world of U1 , and U1 may not contain
additional possible worlds.
Superset-Containment
Superset-containment integration is relevant in settings where we obtain uncertain
data about the real world from di↵erent sources, and the goal is to combine information
from these sources to construct a logical “real-world truth” as accurately as possible.
The simplest example of this scenario was given in Section 5.1, where one sensor
reported A or B for an object and another reported B or C. When we integrate these
sources to obtain our best guess at the real-world truth, we decide the location is
likely to be B.
Superset-containment also arises in information extraction: several parties may
extract structured data from unstructured data (e.g., extracting relations from text,
or extracting text in an OCR context) using di↵erent techniques, and integration
can be used to resolve uncertain results from the sources. Another setting where
superset-containment integration is relevant is the combination of information from
multiple sources that attempt to make predictions, such as weather forecasts from
di↵erent websites, or sales projections using di↵erent techniques.
In contrast to equality-containment, under superset-containment the sources may
not have been derived from an actual uncertain database.
Definition 3 (Superset-Containment). Consider uncertain databases U1 and U2 . We

say that U2 superset-contains U1 , denoted U1 vS U2 , if and only if:
T (U1 ) ✓ T (U2 )
and
P W (U1 ) ◆ {W \ T (U1 ) | W 2 P W (U2 )}
Superset-containment di↵ers from equality-containment in that U1 may contain possible

worlds that are not obtained by intersecting a possible world of U2 with T (U1 ). While
this definition may seem counter-intuitive at first glance, recall from Section 5.2.1
our intuition that an uncertain database with more possible worlds contains less
information.
We shall use v when we refer to either vS or vE .
Power Domains Correspondence

We now discuss how our containment definitions relate to notions in the theory of Power
Domains [AJ94, Gun92]. Specifically, we demonstrate that superset-containment and
equality-containment correspond to Smyth order and Plotkin order respectively: Smyth
and Plotkin orders are used in many applications1 as devices to “lift” a partial order
defined over elements of a set S to finite subsets of S.
Consider a possible world W in an uncertain database with tuple set T . Tuples
from T that are absent from W also represent information, so we consider possible
worlds in the context of the overall tuple set. We define a world pair as the pair
(W, T ) such that W ✓ T . Consider the following partial order over world pairs:
(W1 , T1 ) p (W2 , T2 ) i↵
(W1 ✓ W2 ) ^ ((T1 \ W1 ) ✓ (T2 \ W2 ))
This partial order captures the intuition that the larger pair contains more information,
with respect to both presence and absence of tuples.
1
See [Gun92, LW95] for examples.
The Smyth lifting of the partial order above yields the definition of superset-
containment, while the Plotkin lifting yields the definition for equality-containment.
The Plotkin order is stricter than the Smyth order, and similarly, we have:
(U1 vE U2 ) =) (U1 vS U2 )
5.3 Examples
In this section, we use two examples to motivate our definitions of containment. We
start with an abstract but simple example that illustrates the di↵erences between the
two notions of containment. Then, we present a practical example from a real-world
application to illustrate the utility of our approach. Our example also demonstrates
the notion of consistency, which is formally studied in Section 5.5.
Example 2. Recall Example 1 where uncertain databases over tuple set T = {A, B}
were:
P W (U1 ) = ;, {A}, {B}, {A, B}

P W (U2 ) = {A}, {B}
P W (U3 ) = {A}
Now suppose there is a (logical or actual) mediated uncertain database M with T (M ) =

{A, B, C} and:
P W (M ) = {A}, {A, C}
Suppose M is an actual database, and a source S obtained from M doesn’t have the
privileges to access tuple C. Then S would be represented by U3 . Intuitively, we should
have U3 vE M under equality-containment, and indeed this is the case according to
Definition 2. Notice that U1 6vE M and U2 6vE M , consistent with the fact that U1
and U2 cannot be obtained as a result of restricting access on M .
Now consider our other setting, where the sources were not derived from an actual
uncertain database, and the job of integration is to logically construct an uncertain

database. Specifically, we consider the sensor example from earlier, and whether our
example uncertain databases would be consistent with logical construction of M . M
states that there is an object in location A, no object in location B, and location C
may or may not have an object. U3 corresponds to a sensor that locates the object in
A but may not have sufficient range to locate the object in C. U2 corresponds to a less
precise sensor, reporting the location to be either A or B. Although B is not in M ,
we still permit a source with a possible world containing B. Thus, we should have
U3 vS M and U2 vS M under superset-containment, and indeed this is the case (as
well as U1 vS M ) according to Definition 3.
Example 3. Suppose the FBI maintains a single relation

Suspects(name,age,crime,...) containing information about suspects in the
USA. Most of the information about crimes and suspects isn’t certain, but is
hypothesized based on evidence.
Suppose the Southern California Police Department (SCPD) and Northern
California Police Department (NCPD) have access to just suspects in their re-
gion. Let this information be stored in relations SCPD(name,age,crime,...) and
NCPD(name,age,crime,...), respectively. Further suppose
PW(SCPD)={(Henry,...)}, {(George,...)}
PW(NCPD)={(George,...),(Kenny,...)}, ;
Suppose that the actual FBI database Suspects is:
PW(Suspects)={(Henry,...)}, {(George,...),(Kenny,...)} Note that
Suspects equality-contains both SCPD and NCPD.
Now consider a third source, the San Francisco Police Department (SFPD):
PW(SFPD)={(Kenny,...)}
The three sources SCPD, NCPD, and SFPD are inconsistent under equality-containment;
i.e., there can be no actual database that contains each of these sources. The inconsis-
tencies arise because SFPD insists that Kenny is present in all possible worlds of FBI’s
Suspects relation, contrary to information in NCPD.
Next consider the three uncertain relations SCPD, NCPD, and SFPD under superset-
containment. In this setting, instead of being derived from FBI’s Suspects relation,
these relations were obtained by collecting evidence locally. We now have the following
mediated uncertain database U that superset-contains the three sources: T (U ) contains
all three tuples (George,...), (Kenny,...), and (Henry,...), while P W (U ) contains
a single possible world with two tuples: (George,...) and (Kenny,...).
Intuitively, the three sources were resolved to conclude that George and Kenny
were suspects while Henry was not: SFPD insists that Kenny is a suspect, while NCPD
says that either both Kenny and George are suspects, or neither is. Since Kenny is a
suspect, we conclude that both Kenny and George are suspects. Finally, from SCPD we
rule out Henry being a suspect, since SCPD says that exactly one of Henry and George
is a suspect.
Recalling the intuition from Section 5.1, notice that the (Henry,...) possible world
from SCPD and the ; possible world from NCPD are contradicted by a “corroboration” of
all other possible worlds.
5.4 Queries, Views, and Sources

Before proceeding we need a few definitions. Specifically, we review the semantics
of monotonic queries over uncertain databases, and then define the notion of views
under equality-containment and superset-containment, and how we denote sources.
Definition 4 (Queries over Uncertain Databases). The result of a monotonic query

Q over an uncertain database U is an uncertain database Q(U ). The tuple set of Q(U )
is obtained by applying Q to the tuple set of U , and the possible worlds of Q(U ) are
obtained by applying Q to each possible world of U :
T (Q(U )) = Q(T (U ))
P W (Q(U )) = {Q(W ) | W 2 P W (U )}
Notice that for monotonic queries, each possible world in Q(U ) is a subset of the tuple
set of Q(U ), ensuring that Q(U ) is indeed an uncertain database.
The next definition specifies the semantics of LAV mappings by defining the notions
of view extension and view definition. These definitions are always used in conjunction
with an implicit logical mediated database.
Definition 5 (View). Consider an uncertain database V and a query Q. For a

(logical) uncertain database M , V is a view extension under equality-containment
(respectively superset-containment) with respect to view definition Q if and only if
V vE Q(M ) (respectively V vS Q(M )).
Next we formalize the notion of a source for LAV data integration in terms of
views.
Definition 6 (Source). A source S = (V, Q) is specified by a view extension V and

a view definition Q. V contains the data in the source while Q is the query used to
map the mediated schema to the source. A set of sources S = {S1 , . . . , Sm }, where
Si = (Vi , Qi ), is denoted as {V, Q}, where V = (V1 , . . . , Vm ), and Q = (Q1 , . . . , Qm ).
5.5 Consistency
In this section, we formally define consistency of a set of uncertain data sources.
We then present complexity results for the problem of consistency-checking under
equality-containment and superset-containment.
Roughly speaking, a set of sources is consistent if there exists some mediated
database that contains each source.
Definition 7 (Consistency). The set of sources S = {S1 , . . . , Sm }, where Si = (Vi , Qi ),

is consistent if and only if there exists an uncertain database M such that:
• P W (M ) 6= ;
• 8i2{1,··· ,m} Vi v Qi (M ) (v denotes vE or vS )
M is called a consistent mediated database for S.

Under superset-containment, a set of sources can be “resolved” if and only if they

are consistent. Similarly, under equality-containment, a set of sources is consistent if
and only if there exists a mediated database from which they could have been derived.
In Section 5.5.1 we study intensional complexity: the complexity of consistency-
checking in terms of the size of source data without a bound on the number of sources.
In Section 5.5.2 we study extensional complexity: the complexity of consistency-
checking in terms of the size of source data assuming a fixed number of sources.2 In
this entire section, we restrict ourselves to identity views: Qi is the identity query for
every source Si . In Section 5.7 we show that all of our results carry over to views
defined by monotonic queries under some restrictions.
We now define some preliminary definitions that are used to define shorthand
notations used in the rest of the chapter.
Definition 8. U # S denotes the result of removing a set S of tuples from an uncertain

database U :
P W (U # S) ⌘defn {W \ S | W 2 P W (U )}
T (U # S) ⌘defn T (U ) \ S
Definition 9 (Restriction). U1 + U2 denotes the result of restricting an uncertain

database U1 to an uncertain database U2 :
P W (U1 + U2 ) ⌘defn {W \ T (U2 ) | W 2 P W (U1 )}

T (U1 + U2 ) ⌘defn T (U1 ) \ T (U2 )
The above definitions satisfy the following properties, which we will use later in proofs:
Property 1. (U1 vE U2 ) ⌘ (U1 = U2 + U1 )
Property 2. (U1 vS U2 ) ⌘ (P W (U1 ) ◆ P W (U2 + U1 )) ^ (T (U1 ) = T (U2 + U1 ))
Property 3. (U1 + U2 ) ⌘ (U2 # (T (U2 ) \ T (U1 )))

2
Note that intensional complexity and extensional complexity correspond to the traditional notions
of query and data complexity.
Property 4. (P W (U1 ) 6= ;) =) 8U2 P W (U1 + U2 ) 6= ;
Property 5. (P W (U1 ) 6= ;) =) 8S P W (U1 # S) 6= ;
The last two statements indicate that the restriction and removal operations cannot
make the set of possible worlds empty, although individual possible worlds may become
empty.
5.5.1 Intensional complexity

We now study the complexity of the consistency-checking problem in terms of the
size of the data. This problem is interesting for applications that may integrate data
from a large number of sources. For instance, web information extraction can involve
combining information from a large number of webpages or websites, where extractors
introduce uncertainty. We start by showing that in general consistency-checking is NP-
hard. We then identify an interesting PTIME subclass by establishing a polynomial
consistency check for that subclass.
Intractability
Theorem 1 below establishes the NP-hardness of consistency checking of a set of
sources under both superset-containment and equality-containment. For both cases
we show reductions from the well-known 3-coloring problem [GJ79], although the
arguments are slightly di↵erent. The reductions in the proof use one source for every
node and edge, giving us NP-hardness in the size of source schemas. In Section 5.5.2
we will show that consistency-checking is tractable when the number of sources is
fixed.
Theorem 1. Checking consistency of a set of sources under superset-containment

and equality-containment is NP-hard.
Proof. Reduction from 3-coloring

Let the mediated schema be a single table with only one column. For every vertex
v, we use 3 symbols v0 , v1 , v2 corresponding to its 3 colorings. Given a graph, we
construct the following views (each described by the identity query):
• For every vertex v, construct a view extension Vv with 3 possible worlds repre-
senting its 3 colorings:
P W (Vv ) = {v0 }, {v1 }, {v2 }.
• For every edge (u, v), construct a view Vuv with 6 possible worlds representing
the 6 allowed colorings of the nodes u, v: P W (Vuv ) = {u0 , v1 }, {u1 , v0 }, {u1 , v2 },
{u2 , v1 }, {u2 , v0 }, {u0 , v2 }
Consistent =) 3-coloring: Let W be a possible instance of a mediated database

M . The following shows that it represents a 3-coloring:
• Every vertex v is assigned exactly one color:

W \ {v0 , v1 , v2 } 2 P W (Vv )
• For every edge (u, v), u and v are assigned di↵erent colors:
W \ {u0 , u1 , u2 , v0 , v1 , v2 } 2 P W (Vuv )
3-coloring =) Consistent (superset-containment): A 3-coloring can be repre-

sented as a possible instance W with a symbol for each vertex chosen according to the
color assigned to it. Consider the uncertain database M , such that P W (M ) = {W }.
M is a consistent mediated database under superset-containment.
3-coloring =) Consistent (equality-containment): There are 6 permutations

of the 3 colors, hence if a valid 3-coloring exists, 6 valid 3-colorings exist each derived
from one of the 6 permutations of the 3 colors. Each 3-coloring can be represented as a
possible instance with a symbol for each vertex chosen according to the color assigned
to it. Consider the uncertain database M , such that P W (M ) = {W1 , . . . , W6 }. M is
a consistent mediated database under equality-containment.
Tractable subclass
Next we show that for an interesting subclass, the intensional complexity of consistency-
checking is PTIME. This subclass is based on a mapping from sets of uncertain
databases to hypergraphs. First, we formally establish this mapping. We then

show that if the set of uncertain data sources induces an acyclic hypergraph, then
this set of data sources admits PTIME consistency-checking algorithms for both
equality-containment and superset-containment.
We reproduce definitions of GYO-reductions and acyclic hypergraphs from [Mai83,
Ull88].
Definition 10 (GYO-reduction). The GYO-reduction repeatedly applies the following

two rules to the hypergraph H = (N, E) until none can be applied further:
• Node Removal If a node t is contained in at most one hyperedge e in H,

remove t from e, and from N .
• Edge Removal If a hyperedge e is is contained in another hyperedge f , remove

e from E.
Definition 11 (Acyclic hypergraphs). A hypergraph is acyclic if its GYO-reduction

results in a simple (empty) hyperedge.
Note that the notion of acyclic hypergraphs has been used extensively in database
theory to identify polynomial subclasses for hard problems. See [CR97, Mai83, Ull88]
for a few examples. In our mapping, the nodes in the hypergraph represent tuples
from uncertain databases, and each uncertain database is represented by a hyperedge
in the hypergraph.
Definition 12. Consider a set of uncertain databases U = {U1 , . . . , Um }. We con-

struct the hypergraph H = (N, E) as follows:
[
N= T (Ui ), E = {T (Ui ) | i 2 {1, . . . , m}}
i
The hypergraph H is said to be induced by U .
We argue that practical uncertain databases often satisfy the acyclic hypergraph
structure. Consider, for instance, our FBI data from Example 3 under the equality-
containment setting. In addition to the zone- and city-level police departments,
suppose we have state-level police departments: states subsuming zones and zones
subsuming cities. The resulting uncertain database yields an acyclic hypergraph.
Under the superset-containment setting, consider a series of sensors monitoring sets
of rooms in a hallway; when the rooms are placed in an “acyclic fashion” (i.e., the
hallway isn’t a circle, but a set of chained rooms), the uncertain database representing
sensor readings gives an acyclic hypergraph.
While we’ve shown practical scenarios where acyclicity arises in practice, we note
that even when an uncertain database does not exhibit an acyclic hypergraph, we can
impose acyclicity by “splitting” some sources. The consequence of splitting a source
is that we may lose some specificity-information. (For example, U with P W (U ) =
{A}, {B}, {C} may be split to get U1 and U2 such that P W (U1 ) = {A}, {B}, ; and
P W (U2 ) = ;, {C}.) E↵ectively, our results enable any set of uncertain databases to
have tractable consistency-checking, but with some information loss when the acyclic
hypergraph property isn’t satisfied.
Our results are framed for the possible-worlds representation, but they also hold
for more compact representations that satisfy conditions outlined in the respective
theorems (such as the existence of a polynomial containment check). The following
two theorems are the most technically challenging of the chapter.
Theorem 2. Consider a set of uncertain sources S = {S1 , . . . , Sm } where each source

is a described by the identity query, i.e., Si = (Vi , I). If the corresponding source
extensions V = {V1 , . . . , Vm } induce an acyclic hypergraph, checking consistency of
the sources under equality-containment is PTIME for all representations that allow a
PTIME containment check.
Proof. We “reduce” (using polynomial-time) the set of uncertain databases along

with the corresponding GYO-reduction on the induced hypergraph. Let the uncertain
database corresponding to an edge e be denoted by V (e).
• Node Removal: We remove the tuple corresponding to the node t from V (e),
as the node t is removed. I.e., we replace V (e) by V (e) # {t}.
• Edge Removal: If V (e) vE V (f ), remove the source corresponding to V (e),

along with the the edge e. I.e., we replace V by V \ {V (e)}.
A set of sources inducing an acyclic hypergraph is consistent if and only if the

above reduction results in a hypergraph with just one hyperedge. The lemmas below
complete the proof.
Lemma 1. Node removal preserves consistency.
Proof. Let the node removal step remove tuple t from the source Vi . Recall that t
is not contained in any other source. Let VI = {V1 , . . . , Vi , . . . , Vm } denote the view
extensions before the node removal, and let VF = {V1 , . . . , Vi # {t}, . . . , Vm } denote
the view extensions after the node removal.
VI is consistent =) VF is consistent: Let MI be a consistent mediated database

corresponding to VI . MF = MI # {t} is a consistent mediated database for VF .
VF is consistent =) VI is consistent: Let MF be a consistent mediated database

corresponding to VF . We construct MI with T (MI ) = T (MF ) [ {t} with P W (MI )
given by :
{V | V = W [ {t},
if (W \ T (Vi # {t}) [ {t} 2 P W (Vi )
V = W, otherwise}
In the above equation, W iterates over possible worlds of Vi . MI is a consistent

mediated database for VI .
Lemma 2. Edge removal preserves consistency.
Proof. Consider the edge-removal step on edge e ✓ f . Let the set of view extensions
be VI and VF = VI \ V (e) before and after the edge removal respectively.
VI is consistent =) VF is consistent: A database M that is consistent for VI is

also consistent for VF .
VF is consistent =) VI is consistent: A database M that is consistent for VF is

also consistent for VI : the following shows V (e) vE M .
M + V (e) = (M + V (f )) + V (e) (since e ✓ f )

= V (f ) + V (e) (since V (f ) vE M )
= V (e) (since V (e) vE V (f ))
Lemma 3. For consistent source extensions V, e ✓ f =) V (e) vE V (f ).
Proof. Let M be a consistent mediated database for V. We show V (e) vE V (f ):
V (f ) + V (e) = (M + V (f )) + V (e) (since V (f ) vE M )

= M + V (e) (since e ✓ f )
= V (e) (since V (e) vE M )
Theorem 3. Consider a set of uncertain sources S = {S1 , . . . , Sm } where each source

is a described by the identity query, i.e., Si = (Vi , I). If the corresponding source
extensions V = {V1 , . . . , Vm } induce an acyclic hypergraph, checking consistency of
the sources under superset-containment is PTIME for all representations that allow a
PTIME containment operation.
Proof. The node removal step of the GYO-reduction is the same as Theorem 2.
• Edge Removal: Remove the source corresponding to V (e), along with the
the edge e, and modify the uncertain database associated with f to VR (f ) by
retaining the same tuple set and making the possible worlds P W (VR (f )) equal
to:
{W | W 2 P W (V (f )), W \ T (V (e)) 2 P W (V (e))}
We require the above operation to be polynomial in the size of the representation.

Note by construction that P W (VR (f )) is the largest subset of P W (V (f ) such
that V (e) vS VR (f ). We replace V by (V \ {V (e), V (f )}) [ {VR (f )}.
The sources are consistent if and only if the above reduction results in a simple
hyperedge. If during an edge removal step we obtain VR (f ) = ;, we declare the sources
inconsistent. Lemma 1 above continues to hold. Lemmas 4 and 5 complete the proof.
Lemma 4. Edge removal preserves consistency.
Proof. Consider the edge removal step on edge e ✓ f . Let the set of view extensions
be VI and VF = (VI \ {V (e), V (f )}) [ {VR (f )}, before and after the removal.
VI is consistent =) VF is consistent: A database M that is consistent for VI is

also consistent for VF . The following shows VR (f ) vS M :
P W (M + VR (f )) ✓ P W (M + V (f ))
(since T (VR (f )) = T (V (f )))
✓ P W (V (f )) (since V (f ) vS M )
P W ((M + VR (f )) + V (e)) = P W (M + V (e))

(since T (V (e)) ✓ T (VR (f )))
✓ P W (V (e)) (since V (e) vS VR (f ))
VF is consistent =) VI is consistent: A database M that is consistent for VF is

also consistent for VI : the following shows V (f ) vS M , and V (e) vS M :
P W (M + V (f )) = P W (M + VR (f ))
(since T (VR (f )) = T (V (f )))
✓ P W (V (f )) (since VR (f ) vS M
P W (M + V (e)) ✓ P W ((M + VR (f )) + V (e))

(since T (V (e)) ✓ T (VR (f )))
✓ P W (V (e)) (since V (e) vS VR (f ))
Lemma 5. For a consistent set V of source extensions, P W (VR (f )) 6= .

Proof. Let M be a consistent mediated database for V. By definition, P W (M ) 6= ;.

Since consistency is preserved by an edge removal step, VR (f ) vS M , P W (VR (f )) ◆
P W (M + VR (f )).
5.5.2 Extensional complexity

We now turn to extensional complexity of consistency-checking. We are interested in
studying the complexity of consistency-checking in terms of the total data size when
the number of sources is fixed. The following theorems give the good news that for
both superset-containment and equality-containment, consistency-checking is PTIME
in the size of the data. The constructive proofs of the theorems also indicate the
consistency-checking algorithms that achieve PTIME complexity. Once again, our
results are for the possible-worlds representation of uncertain data.
Theorem 4. Checking consistency of m (a constant) number of sources is PTIME in

their total data size, under superset-containment.
Proof. Consider sources with extensions V = {V1 , . . . , Vm }, with n1 , . . . , nm possible

Q
worlds respectively. We consider all N = i2{1,··· ,m} ni ways of picking one possible
world from each source. In one such instance, let Wi be the possible world picked
from the source Vi . Consider the uncertain database with one possible world W =
[i2{1,··· ,m} Wi . The following lemma completes the proof.
Lemma 6. The set V of sources is consistent if and only if at least one of these N
uncertain databases, say U , is a consistent mediated database.
Proof.
9 Consistent U =) Consistent V: U is a consistent mediated database for the set
V.
Consistent V =) 9 Consistent U : Consider a consistent mediated database M

for V and let WV be one of its possible worlds. For each source Vi vS M , hence
Wi = (WV \ T (Vi )) is a possible world of Vi . Construct U as the uncertain database
with one possible world [i2{1,··· ,m} Wi . Note that U is consistent, and is one of the N
uncertain databases above.
Theorem 5. Checking consistency of m (a constant) number of sources is PTIME in

their total data size, under equality-containment.
Proof. Consider sources with extensions V = {V1 , . . . , Vm }, with n1 , . . . , nm possible

Q
worlds respectively. We consider all N = i2{1,··· ,m} ni ways of picking one possible
world from each source. In one such instance, let Wi be the possible world picked
from the source Vi . Consider the uncertain database U (W ) with one possible world
W = [i2{1,··· ,m} Wi . Out of the N candidate W ’s, we construct an uncertain database
M by adding all possible worlds W whose corresponding uncertain database U (W ) is
consistent for V. Note that M 6= ; i↵ V is consistent. The following lemma completes
the proof.
Lemma 7. The set V of sources is consistent if and only if M is a consistent mediated

database.
Proof.
Consistent M =) Consistent V: M is a consistent mediated database for V.
Consistent V =) Consistent M : Consider a consistent mediated database MV

for V and let WV be one of its possible worlds. For each source Vi vS M , hence
Wi = (WV \ T (Vi )) is a possible world of Vi . Notice that W = [i2{1,··· ,m} Wi is a
possible world in M , since W ✓ WV . For each source, W collapses to the same possible
world as WV , by construction of W . Hence M is a consistent mediated database for
V.
5.6 Query Answers

In this section, we address the problem of defining correct answers for queries posed
over the mediated schema in our uncertain LAV data integration settings. Since the
sources are themselves uncertain, the answer is typically an uncertain database as
well. There are multiple consistent mediated uncertain databases for a given input,
hence the challenge is in defining the notion of “best” query answers corresponding to
certain answers in certain data integration [Hal01]. Informally, we would like to the
best answer to contain all the information implied by the sources, and nothing more.
Note that query answering only makes sense when the input sources are consistent.
Also, we restrict ourselves to identity queries; extensions to the class of monotonic
views and queries follows using additional results presented in Section 5.7.
5.6.1 Definitions
We define the notion of correct answer and strongest correct answer, analogous to the
traditional notions of certain answer and maximal certain answer for data integration
without uncertainty [Hal01].
Definition 13 (Correct answer). Given a set of sources S, an uncertain database A

is a correct answer to a query Q if it is contained in the answers over all consistent
mediated databases: 8M 2Mc A v Q(M ), where Mc is the set of all consistent mediated
databases for S.
Definition 14 (Strongest correct answer (SCA)). A correct answer C is the strongest

correct answer to a query Q if it contains all correct answers to the query: 8A2AC A v C,
where AC is the set of all correct answers to query Q.
Under the superset-containment setting, we show by construction the existence

of a unique SCA. However, under equality-containment, a nontrivial SCA may not
always exist. Hence we introduce a weaker requirement than SCA, and construct the
unique answer that satisfies the new requirement. For the results in this section, we
need the following definitions.
Definition 15 (Fictitious Tuple). For a set of identity views {Vi }, a tuple t is said
to be fictitious in a consistent mediated database M if t is not present in any of the
view extensions; i.e., 8i, t 62 Vi .
Definition 16 (Collected Database). For a set of consistent sources, consider the

set of mediated databases Mres = {M | M 2 MC , T (M ) = [i T (Vi )}, where Mc is
the set of all consistent mediated databases. The collected database MC has tuple set
T (MC ) = [i T (Vi ) and contains all possible worlds in all mediated databases in Mres :
[
P W (MC ) = P W (M ).
M 2Mres
Notice that Mres is the set of mediated databases that do not contain any fictitious
tuples.
5.6.2 Superset-Containment
The following theorem shows how to obtain the SCA to a query for a set of sources.
Theorem 6. For a set of consistent sources, where each source is described by the
identity view, there exists an uncertain database MI that gives the SCA CQ to any
query Q:
9MI 8Q : CQ = Q(MI )
Proof. We show that answering queries using the collected database MC (from Defini-
tion 16) gives the SCA for all queries.
Q(MC ) is a correct answer : Consider any mediated database M 0 in MC . M 0 + MC

is equivalent to eliminating fictitious tuples from M 0 . Hence,
9M 2Mres P W (M 0 + MC ) = P W (M )
Note by definition of the collected database:
8M 2Mres P W (M ) ✓ P W (MC )
Hence, all consistent mediated databases are contained in MC : 8M 0 2MC MC vS M 0 .

Hence for identity queries, MC gives a correct answer. Note that to extend our result
to monotonic views and queries, we only need: U1 vS U2 =) Q(U1 ) vS Q(U2 ). This
result is provided by Lemma 8 from Section 5.7.
Q(MC ) contains all correct answers : The collected database is a consistent

mediated database; i.e., MC 2 Mres ✓ MC . By definition of correct answers,
every correct answer A to the identity query is contained in the collected database:
A is correct =) A vS Q(MC ).
Notice that the collected database produces the SCA to all queries.
5.6.3 Equality-Containment
This section studies query answering under the equality-containment setting. Un-
fortunately, a nontrivial SCA may not always exist for a set of views. However,
the construction from the previous section still gives us good answers to queries in
this setting: we show that it yields a unique procedure that satisfies certain natural
properties.
Theorem 7. There exist sets of view extensions for which even though there are
several nontrivial mediated databases, there is no SCA for the identity query.
Proof. Consider two views: V1 with possible worlds W11 = {a} and W12 = ;, and V2
with possible worlds W21 = {b} and W22 = ;. The two views give several consistent
mediated databases, such as M1 = {{a, b}, {a}, {b}, ;} and M2 = {{a, b}, ;}. While V1
and V2 themselves are correct answers, any uncertain database A with T (A) = {a, b}
is not contained in at least one of the above mediated databases. Hence, there is no
SCA for the identity query.
Since an SCA may not always exist, we relax our requirements from the best answer.
We introduce the notion of a “query answering mechanism,” and we define two
desirable properties. We prove that the query answering mechanism we propose is the
only mechanism that satisfies these properties.
Definition 17 (QA mechanism and procedure). A query answering mechanism P ,

when initialized with a set of sources S, yields a query answering procedure PS .
Procedure PS produces an uncertain database A = PS (Q) as the result of a query Q.
We say that mechanisms P 1 and P 2 are distinct if 9Q, S such that PS1 (Q) 6= PS2 (Q).
The consistency property requires that the results for a query be obtained from a
consistent mediated database. It also asserts that the query answering mechanism
must not add data-information to the result beyond what is entailed by the sources,
hence disallowing fictitious tuples (Definition 15).
Property 6 (Consistency property). A mechanism P satisfies the consistency property

if and only if, for all initializations S, P yields a procedure PS that answers queries
using a consistent mediated database M for S with no fictitious tuples.
The all-possibility property requires that the query answering mechanism must
not add specificity-information to the result beyond what is entailed by the sources.
It asserts that the mechanism must answers queries without ruling out any possible
world that could exist in some consistent mediated database.
Property 7 (All-possibility property). A mechanism P satisfies the all-possibility

property if and only if, for all initializations S, P yields a procedure PS satisfying the
following: For any possible world W in any consistent mediated database, Q(W ) 2
P W (PS (Q)).
The following fairly straightforward theorem states that answering queries using
the collected database (Definition 16) is the only procedure that satisfies the two
properties above.
Theorem 8. A query answering mechanism P cd that answers queries using the

collected database under equality-containment for any set of sources S is the only
mechanism satisfying the consistency and all-possibility properties.
Proof. We prove the theorem in three parts.
P cd satisfies the Consistency Property: Recall the construction of the collected

mediated database MC . Since all possible worlds of at least one consistent mediated
database are possible worlds of MC , MC is also consistent under equality-containment.
P cd satisfies the All-Possibility Property: Follows directly from the construction

of the collected mediated database.
Uniqueness: Consider any other mechanism P 0 6= P cd . There must exist S, Q such

that PS0 (Q) 6= PScd (Q). Therefore, by the consistency property, P 0 uses some other
mediated database M , such that P W (M ) ⇢ P W (MC ). Hence, W 2 P W (MC )
P W (M ), violating the all-possibility property.
5.7 Monotonic Queries

We now extend our results to view definitions and queries over the mediated database
that are monotonic queries, i.e., composed of select, project, join, and union. (A
straightforward extension to containment for multirelation schemas is necessary.) In
the extensions presented here, we require that the monotonic queries do not project
away the keys of relations. Intuitively, if a query projects away the keys, tuples may
no longer be equated correctly. The consequence of this requirement is, essentially,
that an entity resolution (or reference reconciliation) algorithm be used as a first step
of data integration, to identify records that provide information about the same entity.
Entity resolution is well accepted as a part of the data integration process even in the
case of certain data. In fact, imperfect entity resolution may be one of the sources of
uncertainty in the sources.
For a set of sources whose view definitions are monotonic queries, the intractability
of consistency-checking in the size of the source schemas follows directly from the
corresponding results on identity views. Consider extensional complexity and the
polynomial subclass for intensional complexity. Using inverse rules [Dus98], we
transform a source defined by a monotonic query to an inverted skolemized source and
an inverted deskolemized source, both defined by the identity query. We then show
that these polynomial transformations preserve consistency, hence our tractability
results generalize to monotonic views for the possible worlds representation.
Definition 18 (Inverted skolemized source). For a source with view extension V and
view definition Q, let the inverse rules be R. Consider the source (V S ,I) over the
mediated schema, where V S is obtained by applying the inverse rules to each possible
world of the source extension.
P W (V S ) = {R(W ) | W 2 P W (V )}
V S is called the inverted skolemized version of the source.
Definition 19 (Inverted deskolemized source). When all tuples from the inverted
skolemized version (V S ) of a source that contain at least one skolem constant are
dropped, the uncertain database obtained (V D ) is called the inverted deskolemized
version of the source.
The following theorems show that the inverted deskolemized source and the inverted
skolemized source are consistency preserving.
Theorem 9. A set of inverted deskolemized sources is consistent if and only if the

corresponding set of skolemized sources is consistent.
Proof. Consider a tuple t that contains a skolem constant. Since skolem constants
are unique to the rule applied, the tuple t can only exist in one deskolemized source.
Hence, the tuple t can be dropped using the node-removal step of Theorem 2, which
preserves consistency.
Theorem 10. A set of sources with extensions V = {V1 , . . . , Vm } and view definitions
Q = {Q1 , . . . , Qm } are consistent if and only if the inverted skolemized versions of the
sources V S = {V1S , . . . , VmS } (all defined by the identity query) are consistent:
{V, Q} is consistent ⌘ {V S , I} is consistent
Proof. {V, Q} is consistent =) {V S , I} is consistent: Let M be a consistent

mediated database for {V, Q}. M is also a consistent mediated database for {V D , I}.
Using this observation with Theorem 9 shows that {V S , I} is indeed consistent.
{V S , I} is consistent =) {V, Q} is consistent: Let M be a consistent mediated

database for {V S , I}. We show that M is a consistent mediated database for {V, Q}.
For each view Vi :
Vi = Qi (ViS ) (by construction)

v Qi (M ) (since ViS v M, Lemma 8)
Theorem 11. A set of sources with extensions V = {V1 , . . . , Vm } and view definitions
Q = {Q1 , . . . , Qm } is consistent if and only if the set of inverted deskolemized versions
of the sources V D = {V1D , . . . , VmD } (all defined by the identity query) is consistent:
{V, Q} is consistent ⌘ {V D , I} is consistent
Proof. The result follows directly from Theorems 9 and 10.
The above theorems together show that all of our PTIME results (for the tractable
query-complexity subclass as well as the extensional complexity results) for both
superset-containment and equality-containment carry over for monotonic views: the
consistency checks are now applied on the inverted deskolemized sources.
Next we turn to answering monotonic queries over a a set of sources with monotonic
queries as views definitions. For monotonic views, we use their inverted skolemized
versions to construct the set of mediated databases. Note that our construction allows
non-fictitious tuples to have skolem constants. The construction of the collected
database from Section 5.6 uses the above set of mediated databases. The result of
query over a skolemized relation retains only tuples with no skolem constants.
The query-answering results for equality-containment presented in Section 5.6.3
carry over from the above observations. However, as described in Section 5.6.2,
extending our superset-containment results to arbitrary monotonic queries additionally
requires us to show that containment is preserved by our class of queries:
Lemma 8. For uncertain databases U1 , U2 with the same schema SC, for any mono-
tonic query Q over SC that retains a key K of SC, we have: U1 vS U2 =) Q(U1 ) vS
Q(U2 ).
Proof. By definition of U1 vS U2 :
8W2 2P W (U2 ) 9W1 2P W (U1 ) W2 \ T (U1 ) = W1
For any W2 2 P W (U2 ), consider any tuple t 2 W2 \ T (U1 ). (If no such t exists, clearly,
Q(W2 ) = Q(W1 ).) For key-preserving monotonic queries, t.K 62 T (Q(W1 ).K), hence
Q(W2 ) = Q(W1 ). Hence
8R2 2P W (Q(U2 )) 9R1 2P W (Q(U1 )) R2 \ T (Q(U1 )) = R1
The above completes the proof.
5.8 Incorporating Probabilities

In this chapter, we have so far considered only uncertain databases without probabili-
ties. In this section, we consider data integration of generalized uncertain databases
(Chapter 4), i.e., uncertain databases that incorporate probabilities using Dempster-
Shafer theory [Sha76]. Since generalized uncertain databases are a more general
model than probabilistic databases, the extension provided in this section applies to
probabilistic databases also.
We first extend the definition of a generalized uncertain database from Section 4.2.1
to additionally reference the tuple set T (U ), in a manner similar to 5.2.1.
Definition 20 (Generalized Uncertain Database). An uncertain database U consists

T (U )
of a finite set of tuples T (U ) and a mass function mU : 22 ! [0, 1] such that:
P
• S222
T (U ) mU (S) = 1
• mU ( ) 6= 1
Recall from Section 4.2 that m assigns mass to every subset of possible worlds, and it
is unspecified how this mass is split within the subset.
For presentation of containment definitions, we extend Definition 9 from Section 5.5
to generalized uncertain databases:
Definition 21 (Restriction). U1 + U2 denotes the result of restricting a generalized

uncertain database U1 to a generalized uncertain database U2 :
T (U1 + U2 ) ⌘defn T (U1 ) \ T (U2 )

X
mU1 +U2 (S) ⌘defn mU1 (R)
R\T (U2 )=S
We can now present the extension of the equality-containment definition:
Definition 22 (Equality-Containment). Consider generalized uncertain databases U1

and U2 . We say that U2 equality-contains U1 , denoted U1 vE U2 , if and only if:
U1 = U2 + U1
Notice the correspondence to Property 1 in Section 5.5.

Before we can define superset-containment for generalized uncertain databases, we
need one more definition. Recall from Section 5.2.1 that fewer possible worlds indicate
more information; we called this kind of information specificity information. For
generalized uncertain databases, the notion corresponding to the notion of containment
of possible world sets is referred to as specialization:
T
Definition 23 (Specialization). Consider two mass functions m1 , m2 : 22 ! [0, 1]
corresponding to the same tuple set T . m2 is said to be a specialization of m1 (denoted
m1 ◆ m2 ) i↵ there exists a matrix S (called the specialization matrix) with general
term S(A, B), where A, B ✓ 2T that satisfies
P
• B✓2T S(A, B) = 1, 8A ✓ 2T
• S(A, B) > 0 =) A ✓ B, 8A, B ✓ 2T

P
• m2 (A) = B✓2T S(A, B) · m1 (B), 8A ✓ 2T
We can now present the extension of the equality-containment definition:
Definition 24 (Superset-Containment). Consider generalized uncertain databases U1

and U2 . We say that U2 superset-contains U1 , denoted U1 vS U2 , if and only if:
• T (U1 ) = T (U2 + U1 )
• m(U1 ) ◆ m(U2 + U1 )
Notice the correspondence to Property 2 in Section 5.5.

It is easy to verify that for the special case when the generalized uncertain databases
are all uncertain databases without probabilities, our definitions degenerate to the
corresponding definitions presented earlier.
The extension to the definition of consistency (Definition 7) is straightforward:
Definition 25 (Consistency). The set of sources S = {S1 , . . . , Sm }, where Si =

(Vi , Qi ), is consistent if and only if there exists an uncertain database M such that:
• 8i2{1,··· ,m} Vi v Qi (M ) (v denotes vE or vS )
• mM ( ) 6= 1
In fact, mM ( ) captures the degree of inconsistency.

With these definitions in place, results similar to those in Sections 5.5 and 5.6
can be derived for the case of generalized uncertain databases. Results in Section 5.7
apply without modification.
5.8.1 Source Reliability

Since incomplete information can be represented in our data model for generalized
uncertain databases (as discussed in Chapter 4), in the integration setting this extension
enables us to encode reliability (or lack of it) for each input data source.
Recall from Chapter 4 that any mass assigned to a subset is interpreted as
unspecified within the subset. Specifically, any mass assigned to the set of all possible
worlds 2T indicates the “ignorance” in a generalized uncertain database. Motivated
by this intuition, we now define the notion of discounting a generalized uncertain
database:
Definition 26 (Discounting). An ↵-discounting of a generalized uncertain database

U yields a generalized uncertain database U↵ , where
• T (U↵ ) = T (U )
• mU↵ (S) = (1 ↵) ⇤ mU (S), S 6= 2T (U )
• mU↵ (S) = ↵ + (1 ↵) ⇤ mU (S), S = 2T (U )
Notice that mU↵ is still a valid mass function. Discounting can be interpreted as
asserting that U might be completely inaccurate with weight ↵.
Definition 6 can be extended so that each source S is now a triple: S = (V, Q, ↵).
By using discounted sources, we are able to encode reliability of each source into the
data integration problem. A higher ↵ denotes a less reliable data source.
Recall from Section 7 that a set of inconsistent sources represents contradiction,
and hence at least one of the sources must be incorrect. Previously, we had no
tools to resolve this contradiction. But now, with the notion of discounted sources,
inconsistency implies that at least some of our sources need to be discounted. The
following observation shows that we can always achieve consistency if we discount
each of the input sources.
Observation 10. A set of sources S is consistent if ↵ > 0 for each source S 2 S.
Finally, we comment on the benefits of using generalized uncertain databases instead

of probabilistic databases to extend our formalism to incorporate probabilities. The
degree of inconsistency between sources can be captured using generalized uncertain
databases, and hence the result of combining a set of sources yields a generalized
uncertain database. If we had used probabilistic databases, we would not have been
able to represent the result of data integration as a probabilistic database. Additionally,
we are able to encode reliability of sources using generalized uncertain databases.
5.9 Related Work

This chapter introduces a theory for LAV integration of uncertain data. Several
decades of work have been done on data integration (refer to [Hal01, HRO06] for
surveys) as well as on uncertain data (see Chapter 1), and we do not review this past
work here. There has been little work on integrating uncertain data.
There has been a lot of work in the area of incomplete information databases.
These are sets of certain global databases that arise as a result of data integration
of certain sources. Reference [Gra02] presents a good overview. In contrast, in our
setting integration of uncertain sources results in sets of uncertain global databases.
Our theory is based on possible worlds and some of our results rely on the
existence of an efficient containment check over the model used for representing
uncertain databases. In contrast, reference [AKG87] presents complexity results about
representing and asking questions about sets of possible worlds. This work is in fact
complimentary to our work, and provides a natural starting point for our investigation
about compact representations.
There has been previous work on using probabilistic techniques for data integra-
tion [FKL97, MM07, MRBM05, SDH08]. This work looks at uncertain integration of
certain data and is not to be confused with our work, which addresses integration of
uncertain data itself.
Recently, data exchange has been studied over probabilistic databases [FKK10].
In contrast to our work, which combines information from multiple sources to a single
target, the work in [FKK10] only considers a single source. However in the context of
a single source: (1) it allows more general kinds of mappings than just local-as-view
mappings; (2) has probabilities associated with possible worlds; and (3) it studies
some compact representations.
Reference [DS05] studies the problem of answering queries using imprecise sources,
where the imprecision is captured by a probabilistic database. The chapter presents
conditions under which view statistics are sufficient to answer a query over a mediated
database and describes algorithms for computing result-tuple probabilities. In contrast,
the goal of our work is to develop a theory for integrating uncertain sources starting
with the fundamental notion of containment. To this end, we introduce superset-
containment and equality-containment, and address the problem of consistency, none
of which are the subject of [DS05].
Finally, several chapters [DHY07, Haa07, HRO06, SDH09] mention that the
problem of integrating uncertain data is important, but do not address it.
5.10 Future Work

We laid the foundation for uncertain-data integration by introducing the notions of
superset-containment and equality-containment, and studying the notion of consistency
for uncertain data sources. Our work suggests several interesting directions for future
work:
• Generalized Uncertain Databases. In Section 5.8, we generalized the foun-

dational definitions of containment and consistency. But, the corresponding
generalizations of some of our previous results have yet to be established formally:
(1) complexity of consistency checking; and (2) the notion corresponding to
strongest correct answers.
• Inconsistent sources. We introduced the notion of source reliablity in Sec-

tion 5.8. Our approach allows for the use of a discounting factor for each source.
Understanding specific ways of computing discount factors for each source would
be an interesting direction for future research.
• Efficient representations. Since uncertain databases are traditionally defined

through sets of possible worlds, we developed the theory in this paper based
on possible worlds. It would be interesting to consider the implications of
our results on compact models for representing uncertain data, such as those
in [AKG87, AKO07, BDHW06a, CSP05, DS04b, GT06, IL84, SD07]. It would
also be interesting to explore new models that might be less expressive, but
permit efficient consistency checks and query answering.
• Query processing. In this chapter, we defined the notion of strongest correct

answers, but did not provide a way to compute them. An important direction
of future work is to develop efficient query processing techniques over compact
representations.
• Alternative approach. An alternative to our approach for uncertain data

integration could be to perform probabilistic data exchange [FKK10] followed
by imposing some flavor of functional or equality generating dependencies for
uncertain data. The hope is that this approach is in fact equivalent to the current
approach. While the two approaches provide similar utility for integration, our
idea is especially appealing because it can potentially allow us to perform entity
resolution (recall Section 5.7) after the data from all sources has been migrated
to the mediated schema.
Chapter 6
High-Confidence Joins
In uncertain databases, users often seek to compute high-confidence results for queries.
In this chapter, we discuss specialized algorithms for join queries that seek to prefer-
entially return result tuples with the highest confidence. We restrict ourselves to a
special case of uncertain databases without alternative values. The work presented in
this chapter appeared in [AW09a].
6.1 Introduction
Recall from Chapter 2 that alternatives and confidence values may be associated with
data items in uncertain databases. In this chapter, we consider the special case of
uncertain databases with confidences but without alternative values. In this special
case, result tuples have no alternatives, yet always have associated confidence values.
When result data has confidence values, the following query types become impor-
tant:
• Threshold: Return only result tuples with confidence above a threshold ⌧ .
• Top-k: Return k result tuples with the highest confidence values.
• Sorted: Return result tuples sorted by confidence.
83
CHAPTER 6. HIGH-CONFIDENCE JOINS 84
• Sorted-Threshold: Return result tuples with confidence above a threshold ⌧ ,

sorted by confidence.
We consider these query types applied to joins of uncertain relations. Generalizing

from Trio, we assume that a result tuple’s confidence is computed from the confidence
values of the joining tuples using a combining function. Combining functions are
typically monotonic: combining higher (respectively lower) values produces a higher
(respectively lower) result. For instance, for a join over independent tuples in Trio,
the combining function is multiplication, which is monotonic for confidences in [0, 1].
We consider a setting where stored relations provide efficient sorted access by
confidence, e.g., through a primary index. We exploit monotonicity of the combining
function and sorted access on the relations to devise efficient algorithms for all four
query types above.
An alternative approach could be not to devise special new algorithms, but rather
simply treat confidence values as a regular primary-indexed column. Then, join queries
can specify the output confidence as an explicit arithmetic expression wherever it
is used. For instance, a threshold condition could be a filter in the Where clause,
and a sort could use an Order By on the expression. Then we hand our query to a
traditional optimizer. The problem with this approach is that traditional optimizers
do not take advantage of the monotonicity in the combining function, which can yield
more efficient execution techniques. This observation has been made before in [IAE03],
where the techniques of [FLN03] are incorporated into a “rank-aware” join operator.
However, that operator assumes sufficient memory is available to perform the join. In
this chapter, we provide efficient algorithms for the case where memory is insufficient.
Note that threshold and sorted queries in particular may operate over a large fraction
of a large uncertain database.
We cast our work in a context more general than Trio, yet restricted to uncertain
databases with confidence values. However, like [FLN03, IAE03], the techniques are
generic: they apply to any setting where ranked data items are combined with a
monotonic function.
The outline and contributions of this chapter are as follows:
• In Section 6.2 we formalize our problem and environment: processing joins over
uncertain data with confidences, in which result confidences influence query
processing and data sets are large so memory may be a limitation.
• In Section 6.3 we provide efficient algorithms for all four query types introduced
above: Threshold, Top-k, Sorted, and Sorted-Threshold. Our algorithms are
suitable for processing stand-alone join queries, and they can be used for non-
blocking operators within a larger query plan.
• Section 6.4 presents efficiency guarantees for our algorithms. For each query
type and problem instance, we present our guarantees with respect to the
most efficient correct algorithm on that instance, that uses the same primitive
operations and memory as ours.
• Section 6.5 describes our performance experiments. We use a large synthetic

data set to evaluate performance of our algorithms against the theoretical lower-
bounds, and to see how they are a↵ected by di↵erent inputs and parameters.
Our results show performance very close to the lower-bounds. In addition, we
show empirically that we can deliver results sorted by confidence with cost
proportionate to the number of query result tuples delivered.
Related work is covered in Section 6.6 and we discuss future work in Section 6.7.
6.2 Foundations
Consider two relations R and S. Generically, a join operation R ./✓ S on them (where
✓ is the join condition) can be viewed as an exploration of the cross-product of R and
S. For example, if:
R = {hssn1, 94305i, hssn2, 10025i, hssn3, 94041i}

S = {hssn3, Bobi, hssn4, Janei}
then six points in the cross-product need to be considered to evaluate R ./✓ S. All join
methods e↵ectively explore this two-dimensional space, though often incorporating
reorganization of tuples or indexing to enable rapid pruning. In our setting, we do

not assume any indexes on the joining attributes. Instead, our pruning is based on
input and result confidences, and the query type under evaluation. Overall, we view
our algorithms as one option to be considered by a query optimizer.
Consider the join of R and S where neither relation fits in memory. To evaluate
the join condition on a pair of tuples, both must simultaneously be in memory. Clearly,
to explore the entire cross-product, some tuples must be read into memory multiple
times. A nested-loop join would allocate some memory M to load a part of the “outer”
relation, say S, while it scans through the “inner” relation R. This nested-loop join
can be visualized as illustrated in Figure 6.1. A vertical double-arrow represents a load
operation on the outer (in this case S); and an arrow perpendicular to it represents a
scan on the inner (in this case R); joining with those tuples from the outer that are in
memory. The rectangle thus formed is the part of the cross-product explored in one
“load and scan.” The breadth (smaller dimension) of a rectangle is limited by memory
M . The cost of the entire join is proportionate to the sum of the length and breadth
of the rectangles. If S were the inner, then the rectangles would be vertical.
Now let us add confidences. Let c(t) denote the confidence of a tuple t. Also let f
be the combining function: the confidence of the join of tuples r and s is f (r, s). A
combining function f is monotonic i↵ f (ri , sj ) f (rk , sl ) whenever c(ri ) c(rk ) and
c(sj ) c(sl ). Recall, we are assuming that relations R and S can be retrieved efficiently
sorted by descending confidence. Thus, we assume c(ri ) c(rj ) and c(si ) c(sj )
whenever i  j, where indices represent order of tuples. In Figure 6.2, the confidence
of tuples of R decreases as we move right and correspondingly for S it decreases as we
move up.
A point in the two dimensional space represents a (potential) result tuple whose
confidence combines those on its dimensions. Suppose we have a threshold value ⌧ for
the result confidence. Then the part of the cross-product that can contribute to the
result is guaranteed to be a stair-like area as illustrated in Figure 6.2. Subsequently,
we refer to this area as the shaded region. Intuitively, the value of the combining
function for a point p is a lower bound for all points with lower x and y positions, and
hence all such points belong to the shaded region if p does. Formally, the following
Tuples in S
Tuples in R
Figure 6.1: Nested-Loop Join
holds: f (ri , sj ) ⌧ implies f (rk , sl ) ⌧ for all k  i and l  j, by monotonicity of f .

The “stair” observation allows us to answer queries without exploring the entire
space. For instance, a query with threshold ⌧ needs to explore only the shaded region
in Figure 6.2, rather than the entire cross-product. The other query types we consider
can also benefit from this observation, and the contiguous shape of this region allows
efficient exploration as we shall discuss.
6.3 Algorithms
6.3.1 Basics
In this section, we set up the basics for the description of the algorithms. Recall we
assume that relations can be retrieved through sorted access by confidence. More
S (decreasing confidence)
Threshold
R (decreasing confidence)
Figure 6.2: Shaded Region
precisely, we assume that we can ask for tuples in a relation starting at an o↵set
in sorted order. We also assume a monotonic combining function f that is used to
compute confidence values of result tuples using confidence values on the joining tuples.
We consider a cost metric proportionate to data retrieval cost. This metric closely
resembles disk-read cost, since our algorithms all retrieve data in large sequential
chunks. Some data items may be retrieved more than once and are counted as many
times as they are retrieved. (We manage all use of memory as part of our algorithms.)
We use the following symbols in our description:
• R and S are the relations to be joined, with R usually the inner and S the outer.
• or , os , or1 , or2 , os1 , and os2 are o↵sets in the sorted relations, in some unit of
memory.
• M and L are memory sizes in the same unit as the o↵sets above.
Our algorithms use the following operations:
• load(S, os1 , os2 ): Loads into memory tuples from relation S (outer) starting
at o↵set os1 and ending at os2 . (It may construct a hash table on the joining
attributes to enable efficient look-ups in memory when R is scanned, if the join
condition can benefit, but doing so does not a↵ect retrieval cost). The cost of a
load is os2 o s1 .
• scan(R, or1 , or2 ): Scans relation R (inner) starting at o↵set or1 and ending at
or2 . While scanning, the join condition is evaluated on each scanned tuple of R
and all tuples of S residing in memory (possibly using the hash table if the join
condition allows it). The cost of a scan is or2 o r1 .
• conf(R, S, or , os ): Returns the result of the combining function f on the tuple

from R at o↵set or and tuple of S at o↵set os . A negative value (flag) is returned
if the o↵set for either relation exceeds the size of the relation. This function is
always called such that the data required is in memory.
• explore(R, S, or1 , or2 , os1 , os2 ): Combines load (S, os1 , os2 ) followed by scan(R,
or1 , or2 ). Each explore step covers a rectangle of the two-dimensional space as
shown in Figure 6.3. The cost for the exploration of this rectangle is the sum of
the length corresponding to the scan and the breadth corresponding to the load :
(or2 or1 ) + (os2 os1 ).
In load, scan, and explore, sometimes os2 and or2 are not constants, but are found by
checking a condition while loading or scanning. In our pseudocode, these end-conditions
are specified alongside the function call.
We consider the setting where the total memory available is limited: specifically
there is only enough memory to load size M of a relation, plus a small amount of
additional memory for scanning and maintaining state for the algorithms. Note that
since breadth corresponds to load, explore rectangle can have breadth no more than
M.
O O
Figure 6.3: One explore step
6.3.2 Running Example

We will use a running example to illustrate the discussion of the algorithms. For
this example, both relations R and S are of size 5M , and the combining function is
multiplication. Their cross-product is visualized in Figure 6.4. The confidences of
tuples with o↵sets that are multiples of M , namely 0, M, 2M, 3M , and 4M are shown.
The two-dimensional space is divided into 25 blocks named a, b, ..., y that are of length
and breadth M . The confidence of the (potential) result tuple in the lower-left corner
of each block can be computed using the given confidences on the axes.
6.3.3 Threshold
A Threshold query returns all result tuples whose confidence value is above a specified
threshold ⌧ . We describe algorithms to efficiently compute results of Threshold queries
u v w x y
0.5
p q r s t
0.6
k l m n o
0.7
f g h i j
0.8
a b c d e
1.0
1.0 0.9 0.6 0.3 0.1
Figure 6.4: Running Example
over joins of two relations. As discussed in Section 6.2, to answer such queries, we
only need to explore a stair-like region in the cross-product. This exploration can
be performed like a nested-loop join, modified not to explore regions that cannot
contribute results with confidence above the threshold. The exploration is detailed in
Algorithm 1 and illustrated in Figure 6.5. We call this algorithm Threshold1.
In the example of Figure 6.4, for ⌧ = 0.55, the blocks would be explored in the
following order: a, b, c, f, g, k, l, p. Blocks c, g, l, p are explored only partially: the load
and scan within explore operations also use threshold ⌧ to terminate, as seen in the
specification of os and or in Algorithm 1. For instance, in the example the load on
block p only loads tuples in S with confidence > 0.55, and scans only tuples in R
with confidence > 0.55/0.6. The explore operation emits result tuples as it finds them,
after checking that they satisfy the threshold.
Theorem 12 in Section 6.4 shows that Threshold1 has cost less than 2 times that
Algorithm 1 Threshold1
i 0
while 1 do
if conf(R, S, 0, M · i)  ⌧ then
break
explore(R, S, 0, or , M · i, min(M · (i + 1), os ))
{os minp conf(R, S, 0, p)  ⌧ }
{or minp conf(R, S, p, M · i)  ⌧ }
i i+1
of any algorithm that explores the entire part of the cross-product that can produce
result tuples. Threshold1 performs much better than this factor when the confidence
distributions in the input relations are asymmetric.
An algorithm for Threshold with a better guarantee can be devised if the shape of
the shaded region is known at the start. Suppose for example that we have meta-data
(call it D) allowing us to find the confidence values of tuples at o↵sets that are multiples
of M , as in Figure 6.4. We now describe an algorithm called Threshold2 that uses
this information while exploring the space.
Like Threshold1, Threshold2 explores the pruned space, but this time in a more
efficient manner. The idea is to choose the outer for the explore operation at each
step to achieve a longer scan. The choice is made by using meta-data D to determine
the approximate scan lengths for both possible outers. The exploration is detailed
in Algorithm 2 and illustrated in Figure 6.6. The function scan-lengths returns the
approximate scan lengths for both choices of outer using the meta-data D. For the
example of Figure 6.4 with threshold 0.55, the order in which the blocks would be
explored is a, f, k, p, b, g, l, c. As before, some blocks, namely p, l, c are explored only
partially. As in Threshold1, result tuples are emitted as they are found after checking
that they satisfy the threshold.
Theorem 13 in Section 6.4 shows that Threshold2 has cost less than 1.5 times
any algorithm that explores the entire part of the cross-product that can produce
result tuples. This factor occurs only in cases where the part of the cross-product
explored just exceeds the memory available. As the area to be explored becomes large
compared to available memory, the factor comes closer to 1, as shown by Theorem 14
Threshold
Figure 6.5: Algorithm Threshold1
in Section 6.4.
6.3.4 Top-k
A Top-k query returns k result tuples with the highest confidence for a query-specified
k. In this section, we discuss our algorithm for Top-k queries over two-relation joins.
The idea again is to prune the space to be explored using a threshold as in the
algorithms for Threshold. The threshold to be used is the confidence value of the k th
tuple in the result, i.e., the minimum confidence value among the Top-k result tuples.
Of course, this value is not known at the start of the algorithm.
During the algorithm we maintain a current top-k result set in a priority queue K.
When a result tuple t is generated with confidence greater than the confidence of the
lowest-confidence result tuple t0 in K, t0 is replaced by t. To avoid too much extra
Threshold
Figure 6.6: Algorithm Threshold2
exploration, we restrict scan operations to a bounded length, say L.

Our Top-k algorithm is detailed in Algorithm 3. E↵ectively, the algorithm breaks
up the cross-product into rectangles (explore operations) of breadth M and length
L. The highest-confidence tuple that a rectangle can produce is found using the
combining function on the lower-left corner point, and is referred to as the maxconf for
the rectangle. The algorithm explores rectangles in order of descending maxconf. In
the example of Figure 6.4, the blocks would be explored in the order a, b, f, g, k, l, c, p,
and so on depending on k. The algorithm maintains a priority queue Q of rectangles
yet to be visited. The threshold used to stop loads and scans, and to terminate the
algorithm, is the confidence of the current k th tuple in K.
Let ⌧ be the confidence of the k th result. Theorem 15 in Section 6.4 proves that
Top-k with L = M has cost less than 3 times any algorithm that explores the space
and can produce tuples with confidence ⌧ . The factor improves to 2 as the region
Algorithm 2 Threshold2
o r1 0; os1 0
while 1 do
if conf(R, S, or1 , os1 )  ⌧ then
break
l r , ls scan-lengths(or1 , os1 , ⌧, D)
if lr  ls then
explore(R, S, or1 , or2 , os1 , min(M + os1 , os2 ))
{os2 minp conf(R, S, or1 , p)  ⌧ }
{or2 minp conf(R, S, p, os1 )  ⌧ }
o s1 o s2
else
explore(S, R, os1 , os2 , or1 , min(M + or1 , or2 ))
o r1 o r2
explored becomes much larger than available memory. For such regions, larger L may
further increase efficiency. In fact, the best approach may be to vary L for di↵erent
explore steps, as discussed in Section 6.3.6.
Technically, our guarantees for Top-k hold only in cases where sufficient memory
is not available, which is the environment we are targeting. However, we can combine
our algorithm with an efficient Top-k algorithm A0 that relies on sufficient memory
being available, e.g., [IAE03]. Such algorithms also maintain a queue corresponding
to our queue K. If algorithm A0 runs out of memory during processing, it is possible
to switch to ours, continuing processing on queue K.
6.3.5 Sorted-Threshold
We now describe an algorithm for Sorted-Threshold, which returns result tuples
with confidence over a threshold ⌧ , sorted by confidence. Sorted, which returns all
results sorted by confidence, corresponds to the special case where ⌧ = 0. The
algorithm explores the same pruned space as in the Threshold problem, but in an
order resembling Top-k. The algorithm is detailed in Algorithm 4. Like Top-k, it
uses two priority queues: K for temporarily maintaining result tuples, and Q for
Algorithm 3 Top-k
oR [e] 0;oS [e] 0
confidence[e] 1
Q e
while 1 do
i ExtractMax(Q)
if confidence[i]  Bottom(K) then
break
o r1 oR [i]
o s1 oS [i]
explore(R, S, or1 , min(L + or1 , or2 ), os1 , min(M + os1 , os2 ))
{os2 minp conf(R, S, or1 , p)  Bottom(K)}
{or2 minp conf(R, S, p, os1 )  Bottom(K)}
oR [r] or1 + L;oS [r] o s1
confidence[r] conf(R, S, oR [r], oS [r])
Q r
oR [u] or1 ;oS [u] o s1 + M
confidence[u] conf(R, S, oR [u], oS [u])
Q u
maintaining information about the rectangles yet to be explored. The threshold used
to terminate loads and scans, and to terminate the algorithm itself, is the input ⌧ .
The explore operation pushes result tuples that satisfy the threshold into the queue K.
The algorithm maintains a value that is an upper bound on the confidences of all
unseen result tuples at any stage during the running of the algorithm. Result tuples
in K with confidence are emitted in sorted order as decreases.
Sorted-Threshold emits a tuple as soon as decreases to below the tuple’s confidence,
so it is a nonblocking algorithm. The algorithm explores the space in the same order
as Top-k, but we can make stronger guarantees. Theorem 17 in Section 6.4 shows
that Sorted-Threshold with L = M has cost less than 2 times any algorithm that
explores the space that can produce tuples with confidence > ⌧ . Parameter L is
further discussed in Section 6.3.6.
Algorithm 4 Sorted-Threshold
oR [e] 0;oS [e] 0
confidence[e] 1;
Q e
while 1 do
i ExtractMax(Q)
if confidence[i]  ⌧ then
break
o r1 oR [i]
o s1 oS [i]
explore(R, S, or1 , min(L + or1 , or2 ), os1 , min(M + os1 , os2 ))
oR [r] or1 + L;oS [r] o s1
confidence[r] conf(R, S, oR [r], oS [r])
Q r
oR [u] or1 ;oS [u] o s1 + M
confidence[u] conf(R, S, oR [u], oS [u])
Q u
6.3.6 Discussion
In this section we cover a few properties of and extensions to our algorithms not
discussed in their initial presentation.
Parameter L
Recall that our Top-k algorithm uses input parameter L to restrict the maximum
length of scan operations. Our efficiency guarantees are proved for L = M . However,
other choices of L may yield more efficient computations for some inputs: If the space
that needs to be explored to compute the top k results turns out to be much larger
than memory M , then it is beneficial to use a larger L. Various methods can be
employed to detect and exploit this situation. For example, if statistics are available,
we might estimate in advance the space to be explored and adjust L accordingly. L
can also be adjusted adaptively during the algorithm based on what the algorithm
has seen so far. Specifically, each invocation of the explore operation can dynamically
choose a di↵erent L. Also note that when L 6= M is used, the algorithm is no longer
symmetric, so we can incorporate dynamic selection of the outer relation in a similar
fashion to algorithm Threshold2 (Section 6.3.3).
Sorted-Threshold also may choose to adjust parameter L, but the considerations
are somewhat di↵erent. In this case, L encapsulates a memory-cost trade-o↵: A larger
L, while being more cost-efficient, may require a larger bu↵er for result tuples. Here
too we can set L in advance based on known statistics, or modify it adaptively based
on observations during the algorithm’s execution. As before, we can also incorporate
dynamic selection of the outer relation in each explore step.
Using Indexes
Our algorithms do not assume any indexes on the join attributes, and instead use
pruning based on result confidences. Nested-loop index join can be modified to devise
algorithms for all four query types which also do pruning based on result confidences:
• Sorted-by-confidence access is used on outer relation, and an index on the join

attribute is used to access the inner relation.
• A bu↵er for result tuples is maintained for Top-k, Sorted and Sorted-Threshold,
like the algorithms described above.
• The confidence of the last tuple from the outer can be used to determine an upper
bound on the confidence of all unseen result tuples because of monotonicity.
These algorithms are modifications of the Threshold Algorithm [FLN03]. They are
easily used in the context of uncertain databases because they don’t need to load the
outer into memory, and hence can’t run out of memory.
Use in Queries
Our algorithms have all been specified for two-way join operations. Suppose we have
R1 ./ R2 ./ · · · ./ Rm , and we wish to obtain the result of this multiway join using
Threshold, Top-k, or Sorted-Threshold. Our approach is fairly simple: Since Sorted
(i.e., Sorted-Threshold with ⌧ < 0) provides access to its results in a non-blocking
fashion sorted by confidence, we build a tree of two-way joins. All lower nodes in the
tree perform the Sorted operation on their two operands, while the root performs the
desired operation (Threshold, Top-k, or Sorted-Threshold) on its operands. (Sometimes
portions of intermediate results need to be materialized, if the algorithm needs to scan
them multiple times.) Operators also have the option of using the indexed version
of the algorithms as described in Section 6.3.6, if the join condition permits it. All
algorithms that use pruning based on result confidences require the input sorted by
confidence. This may be provided either by storing the relations sorted by confidence,
or using a sort operator in the query plan which allows the use of other operators in a
query plan. Sorted by confidence is a special “interesting” order because it can yield
efficiency benefits for a significant fraction of queries. These new algorithms introduce
new query optimization challenges that we leave for future work.
Non-independence and self-joins
Our algorithms rely on a monotonic combining function. Frequently in probabilistic

databases (when data is mutually independent) this function is multiplication over
the domain [0, 1], and that is what we have used in our examples and experiments.
There are cases when simple multiplication cannot be used as the combining function
because of non-independence in data (input and/or intermediate); self-join is one
instance of this non-independence. Our techniques for pruning the join can still be
used in the presence of non-independence:
• As was shown in [DS04b], “safe plans” can be found for a class of queries over
uncertain databases. A safe plan ensures that all intermediate results are inde-
pendent, allowing the query processor to compute confidences for intermediate
result tuples using the simple multiplication combining function. The operators
proposed in this chapter can be used directly in all safe plans.
• Combining functions over non-independent tuples tend to be expensive, so

a number of methods have been introduced to address this problem, e.g.,
lineage-based confidence computation in [DTW08b], and Monte Carlo techniques
in [DS04b, RDS07]. Both of these methods allow cheap interval-approximations
to the results of the combining function, which are then refined as needed. Our
algorithms can be extended to operate on interval-approximations instead of
exact results, minimizing both data retrieval cost and (expensive) computation
of exact values.
Generalizing
We cast our presentation in the context of uncertain and probabilistic databases, since
that is the area in which we work, and it forms the context for our implementation.
However, just as with other related algorithms (to be discussed in Section 6.6, e.g.,
[IAE03]), our algorithms are quite generic. For example, they can also be applied
directly to any setting with a monotonic “scoring” function, such as middleware
systems [Fag99] and multimedia databases [CG96, NR99].
6.4 Efficiency Guarantees

We prove guarantees on the efficiency of our algorithms by comparing against other
algorithms for the same problem. Each algorithm we consider takes as input I the
following:
• Relations R and S with sorted access by confidence
• Arbitrary, possibly opaque, join condition ✓
• Monotonic confidence value combining function f
• Threshold ⌧ for Threshold and Sorted-Threshold
• Result size k for Top-k
Every algorithm emits an unordered set of tuples as its result for Threshold, and
an ordered set of result tuples for Sorted-Threshold and Top-k. An algorithm is valid
if it returns the correct result for all possible inputs.
For efficiency comparisons we only consider algorithms in a class we refer to as A.
Algorithms in this class are restricted to use the same access on the data and memory
as our algorithms, in order to capture the same environment as ours. Algorithms in

class A adhere to the following:
• The relations are accessed only through their sorted-by-confidence interface.
• Memory available for storing tuples is limited to M . Small constant amounts of

additional memory may be used for hash-table overhead, maintaining k result
tuples for Top-k algorithms, and other bookkeeping.
• No special processing is performed based on the join condition ✓ (e.g., using a

disk-based sort-merge if ✓ specifies an equijoin).
Let Cost(A, I) represent the cost of algorithm A on input I. Recall that our cost
metric, which can be applied to any algorithm in A, captures the total amount of
data retrieved through the sorted-by-confidence interface.
The following five theorems encompass our efficiency guarantees for Threshold
and Top-k. (Sorted and Sorted-Threshold are discussed after these theorems.) Each
theorem compares one of our algorithms against any valid algorithm in class A for
the same problem. Note that Threshold1, Threshold2 and Top-k, are valid algorithms
in A for their respective problems.
Our results are quite strong in the following sense. Our bounds on efficiency are
based on problem instances: We show for each instance a comparison of our algorithm
against any other valid algorithm on the same instance. This comparison is stronger
than comparing algorithms on all instances—in our comparison, a di↵erent “best”
algorithm may compete with ours on di↵erent instances.
We provide intuitive proof sketches for each theorem. Full proofs are deferred to
Section 6.8.
Theorem 12. Let At represent the set of all valid algorithms in A for the Threshold
problem, and let It represent the set of all inputs. The following holds:
8I2It Cost(Threshold1, I) < 2 · min Cost(A, I)

A2At
For an input I to the Threshold problem, consider the part of the cross-product
of the input relations at which the combining function f evaluates to more than the
threshold ⌧ , for example the shaded region in Figure 6.2. All valid algorithms for the
problem must cover the shaded region, i.e., they must evaluate the join condition ✓ at
all points in the region.
We show that Threshold1 has cost less than 2 times the cost of any algorithm that
covers the region. Intuitively, one lower bound corresponds roughly to the area (in
units M 2 ) of the shaded region. However, another lower bound is the semi-perimeter
(in units M ) of the stair-like shaded region, which is equal to the sum of the lengths of
the shaded x and y axes. The latter bound can sometimes be greater than the former
(intuitively, when shaded regions are very “narrow”), so these bounds are combined in
our proof. In any exploration performed by Threshold1, all but one of the rectangles
have breadth 1. Any rectangle with cost 2 covers area at least 1, and those with
cost < 2 contribute at least 1 to the semi-perimeter. This intuition is formalized to
prove Theorem 12.
A bad case for algorithm Threshold1 occurs when all explore rectangles (say there
are b of them) have length 1 in units M . Out of the b 2 rectangles, b 1 have cost
2b 1
2, and the last rectangle has cost 1, giving a factor b
. Such an example can occur
only for asymmetric confidence distributions.
3
8I2It Cost(Threshold2, I) < · min Cost(A, I)
2 A2At
3
In fact, the factor 2
occurs only for very small shaded regions. For larger shaded
regions, Threshold2 achieves better factors as formalized next. Let sI be the number
of M ⇥ M blocks that intersect the shaded region for input I.
sI + p c
sI c A2At
p·(p+1)
where constants p and c are such that p c and sI 2
.
The value sI is a measure of the size of the region to be explored. The factor
decreases as sI increases, tending to 1 as sI tends to 1. The intuition for the guarantee
of Threshold1 also applies to both guarantees for Threshold2. Additionally, because
of the greedy choice of outer in Threshold2, the ith -from-last rectangle explored by
Threshold2 always has length at least i 1. This intuition is formalized to prove the
Theorems 13 and 14.
Now let us consider the Top-k problem. An input I satisfies the distinctness
property if confidences of all result tuples in R ⇥ S are distinct. For Top-k, we
restrict attention to inputs that satisfy this property. No valid algorithm can have any
constant-factor guarantee corresponding to our guarantees in the general case (i.e.,
without assuming distinctness). Consider for example inputs where all confidences are
1, k = 1, and only one arbitrarily placed pair of tuples satisfy the join condition.
Recall that algorithm Top-k has a parameter L, discussed in Section 6.3.6. We
prove the following two theorems for L = M .
Theorem 15. Let Ak represent the set of all valid algorithms in A for the Top-k
problem, and let Ik represent the set of all inputs that satisfy the distinctness property.
The following holds:
8I2Ik Cost(Top-k, I) < 3 · min Cost(A, I)

A2Ak
For an input I to the Top-k problem, consider the part of the cross-product where
f evaluates to at least the confidence ↵ of the k th result tuple, i.e., our usual shaded
region with ⌧ = ↵. In a manner similar to Threshold2, the factor 3 occurs only for
small shaded regions, and decreases for larger regions as formalized next. Let sI be
the number of M ⇥ M blocks that intersect the shaded region.
2 · sI 1
8I2Ik Cost(Top-k, I) < · min Cost(A, I)
sI c A2Ak
c·(c+1)
where constant c is such that sI 2
.
As above, value sI is a measure of the size of the region to be explored. The

factor above decreases as sI increases, tending to 2 as sI tends to 1. Recall that
the area and semi-perimeter are lower bounds for any covering. Most M ⇥ M blocks
explored by Top-k have cost 2 and cover area 1 (in units M 2 ), or contribute length 1
(in units M ). This intuition is formalized to prove the factor 3 in Theorem 15. The
fraction of blocks not satisfying the conditions above decreases with increasing sI , and
is formalized to prove Theorem 16
To formalize efficiency guarantees for Sorted-Threshold, we consider a class of
algorithms AM that is a super-set of A: As in A, M memory is available for storing
tuples, and a small constant amount of additional memory is permitted. In AM ,
algorithms may also use unrestricted memory to store result tuples only. Sorted-
Threshold is a valid algorithm in AM , and is compared against all valid algorithms in
AM .
Theorem 17. Let As represent the set of all valid algorithms in AM for the Sorted-
Threshold problem, and let Is represent the set of all inputs. The following holds:
8I2Is Cost(Sorted-Threshold, I) < 2 · min Cost(A, I)

A2As
The intuition behind the proof is similar to Theorems 12– 16. Any valid algorithm
in AM needs to cover the part of the cross-product that can contribute results with
confidences over the threshold ⌧ . Each M ⇥ M square with cost 2 contributes area 1,
and others contribute their cost to the semi-perimeter.
6.5 Experiments
To evaluate the performance characteristics and trade-o↵s in our algorithms, we
conducted several experiments on a large synthetic data-set with a variety of confidence
distributions. The main objectives and results of our experiments are summarized as
follows.
• Our algorithms have theoretical efficiency guarantees ranging from 1.5 to 3 times a
lower bound in the general case, as shown in Section 6.4. We wanted to determine
how close our algorithms are to the theoretical lower bound in practice. One
interesting outcome is that in practice, Threshold1 and Threshold2 are similarly
close to the lower bound, except when the input confidence distributions are
asymmetric, in which case Threshold2 is much closer than Threshold1.
• Our efficiency guarantees hold for all distributions of confidence values on the
input relations. We wanted to see how di↵erent confidence distributions a↵ect
the performance of our algorithms. In general, we do not see a dramatic change
with di↵erent distributions, except as they a↵ect the actual result.
• As discussed in Section 6.3.6, our intuition suggests that algorithm performance

can be a↵ected by input parameter L. We wanted to test empirically how L
a↵ects performance for algorithms Top-k and Sorted-Threshold. We find that
the default choice of L = M appears to be a good one.
• One feature of algorithm Sorted-Threshold is its non-blocking nature. We were

interested in determining how the cost of the algorithm varies with the number
of result tuples delivered, across di↵erent distributions. (This measure becomes
very relevant when Sorted-Threshold, or Sorted, is used in a context in which
not all tuples are necessarily requested.) We have the nice outcome that cost is
linear in the number of results produced.
6.5.1 Experimental Setup

We synthetically generate our two relations R and S with the following characteristics:
• Each data set has |R| = |S| = N where N is either one million or ten million
tuples.
• The join between R and S is one-to-one in each data set. The joining pairs are
independent of their relative confidence values, i.e., of their relative positions in
sorted-by-confidence order.
• The confidence-combining function is multiplication.
• Confidence distributions on each relation vary from uniform to highly skewed,

while maintaining the positions of the joining pairs in sorted-by-confidence order.
Figure 6.7 illustrates the six distributions we use in our experiments (denoted by
d 2 {1, 2, · · · , 6} in our graphs). The figure specifies the rate at which confidence
values decrease in the sorted relation.
In the experiments, cost is measured as number of tuples retrieved from R and S.
1
Distribution 1
0.9 Distribution 2
Distribution 3
0.8 Distribution 4
Distribution 5
0.7 Distribution 6
Confidence
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10
6
Tuple Position (x 10 )
Figure 6.7: Confidence Distributions
6.5.2 Threshold
For the Threshold problem, we compare our two algorithms, Threshold1 and Thresh-
old2, against a lower-bound cost, as discussed in Section 6.4. We show four graphs.
Figures 6.8 and 6.9 consider “symmetric” distributions in the two relations: we
use Distribution 6 for both R and S. Figures 6.10 and 6.11 consider “asymmetric”
distributions: we use Distribution 1 for R and Distribution 6 for S. We consider how

performance varies with threshold ⌧ in Figures 6.8 and 6.10, and with memory size M
in Figures 6.9 and 6.11. Fixed parameter values for experiments are noted at the top
of each graph. Also, all costs are measured in terms of number of tuples retrieved.
In general both Threshold1 and Threshold2 are very close to the lower-bound (much
closer than their theoretical guarantees). In particular, Threshold2 nearly meets the
lower-bound in all cases. Threshold1 performs not quite as well when the relations
have asymmetric distributions.
N=10x106; M=50x103; d=6

70
Threshold2
Threshold1
60 Lower Bound
50
Cost (x 10 )
5
40
30
20
10
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Threshold τ
Figure 6.8: Cost of Threshold, symmetric distributions
6.5.3 Top-k
For the Top-k problem we have four graphs:
• Figures 6.12 and 6.13 show how cost varies with k. (Figure 6.12 is for lower
values of k, while Figure 6.13 “continues” Figure 6.12 with much higher values
to confirm the trend.) We ran our Top-k algorithm to obtain its cost, noting
N=10x106; τ=0.7; d=6

45
Threshold2
40 Threshold1
Lower Bound
35
30
Cost (x 10 )
5
25
20
15
10
5
0
0 10 20 30 40 50 60 70 80 90 100
3
Memory M (x 10 )
Figure 6.9: E↵ect of M on Threshold, symmetric distributions
the confidence of the kth result tuple. Using this confidence as threshold ⌧ ,
we plotted the cost of Threshold2 and the lower-bound (as in the Threshold
experiments; Section 6.5.2). Note that the gap in performance between Top-k
and the other two algorithms is expected since Top-k cannot know ⌧ in advance.
Also note that the cost of Top-k is significantly lower than the relation sizes.
• Figure 6.14 shows how the cost of Top-k is a↵ected by memory size M . Again,
we plot Threshold2 and the lower-bound using the ⌧ from the execution of Top-k.
• Figure 6.15 considers the e↵ect of parameter L on Top-k performance. Recall

Section 6.3.6, which discusses the setting of L: If we knew the confidence
distributions and join selectivity in advance, we could select the “best” L.
Figure 6.15 shows the cost when varying L for a specific problem instance. As
we see, L = 100 ⇥ 103 is the best choice for this instance, although the default
of L = M = 25 ⇥ 103 also performs well.
N=1x106; M=50x103; d=1,6

45
Threshold2
40 Threshold1
Lower Bound
35
30
Cost (x 10 )
5
25
20
15
10
5
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Threshold τ
Figure 6.10: Cost of Threshold, asymmetric
Before looking at our results for Sorted, we consider the e↵ect of confidence
distributions on Threshold and Top-k. (We will separately consider the issue for
Sorted.) Figure 6.16 plots the cost of Threshold2 and Top-k, along with the lower-
bound, for each of the six distributions shown in Figure 6.7 on both relations. (As in
the Top-k graphs, Threshold2 and the lower-bound use the ⌧ from Top-k.) We see
that the distribution has little e↵ect on cost.
6.5.4 Sorted
Our Sorted experiments all use the Sorted-Threshold algorithm with threshold < 0, i.e.,
they deliver a full sorted result. Our first conclusion, shown clearly in Figure 6.17, is
that the cost of Sorted is proportionate to the number of results emitted. The two plots
correspond to two di↵erent distributions, and behave nearly identically. Figures 6.18
and 6.19 consider di↵erent settings for parameter L, seeing how they a↵ect cost and
memory use, respectively, as tuples are emitted. Comparing the two graphs we see the
trade-o↵ clearly: a lower L incurs higher cost but requires less memory. In Figure 6.19,
N=1x106; τ=0.7; d=1,6

40
Threshold2
Threshold1
35 Lower Bound
30
Cost (x 10 )
5
25
20
15
10
5
0 10 20 30 40 50 60 70 80 90 100
3
Memory M (x 10 )
Figure 6.11: E↵ect of M on Threshold, asymmetric
the y-axis plots the largest bu↵er used so far in the execution. The horizontal steps
correspond to the points at which a batch of result tuples is emitted.
6.6 Related Work

There has been a significant amount of previous work on top-k queries in a ranking
environment. For example, in the context of multimedia databases, a common goal is
to apply a monotonic combining function to scores for di↵erent attributes of an object,
then select the top-k objects by combined score. At a high level, this problem is
isomorphic to the one we address, however there are several di↵erences to our approach
and goals, which we will highlight.
Many algorithms have been proposed for this problem. One distinguishing feature of
previous algorithms is that many of them assume random access based on the identity of
the objects being ranked. We assume only sorted access to our input relations. Previous
algorithms that combine sorted and random access include Fagin’s Algorithm [Fag99],
N=10x106; M=L=25x103; d=2

11
Top-k
10 Threshold2
9 Lower Bound
8
Cost (x 10 )
5
7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
k (x 100)
Figure 6.12: Cost of Top-k, varying k
TA [FLN03], Quick-Combine [GBK00], and the “multi-step” algorithm presented in

[NR99]. Algorithm CA in [FLN03] specifically addresses the case where random access
is expensive relative to sorted access, while [BGM02] assumes only random access is
available on some attributes. Similar ideas to the previous contributions are applied
in [CG96], which shows how to use range queries to solve top-k problems.
The fact that we consider an environment that does not permit random accesses
distinguishes our algorithms from all of the references in the previous paragraph. A
second significant distinguishing feature of our algorithms is that we operate in a
memory-constrained environment. Although some previous algorithms, like ours, do
not permit random accesses, all of them assume sufficient memory. Examples in this
class include NRA [FLN03] and Stream-Combine [GBK01].
In a more conventional database setting, references [IAE02, IAE03, NCS+ 01]
address the top-k problem over join operations using only sorted access, very similar
to our problem. Once again, unlike our work, the algorithms presented in these
chapters assume that sufficient memory is available. A problem tackled in [ISA+ 04]
N=10x106; M=L=25x103; d=2

9
Top-k
8 Threshold2
Lower Bound
7
6
Cost (x 10 )
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
k (x 1000)
Figure 6.13: Cost of Top-k, varying k
is how to optimize query plans containing “rank-aware” operators such as those in

[IAE02, IAE03]. Sufficient memory is still assumed in that work, however some of the
query optimization ideas may be applicable in our setting.
There has been some recent work addressing top-k or join queries in the context
of uncertain databases [SIC07, RDS07, CSP+ 06]. The work in [RDS07] does not
address the problem of memory-efficient top-k computations, but rather focuses on
finding the top k results efficiently when the computation of result confidence values
(i.e., our combining function) may be very expensive. Their technique uses Monte
Carlo simulations to refine intervals surrounding the result confidence values, with
algorithms to minimize the amount of simulation required to find the top k results.
In Section 6.3.6 we discussed how our algorithms might be useful in this setting.
[SIC07] defines and addresses new top-k problems that arise due to the “possible
worlds” semantics of uncertain databases. [CSP+ 06] tackles the problem of joins over
uncertain attributes described as probability distribution functions.
N=10x106; k=500; L=M; d=2

18
Top-k
16 Threshold2
Lower Bound
14
12
Cost (x 10 )
5
10
8
6
4
2
0
5 10 15 20 25 30 35 40 45 50
3
Memory M (x 10 )
Figure 6.14: E↵ect of M on Top-k
6.7 Future Work

In this chapter, we developed efficient join algorithms for preferentially computing high
confidence results over uncertain data. As discussed in Section 6.3.6, simple techniques
can be used to extend our algorithms to handle multiway joins, as opposed to the
two-way joins we consider. However, we have not considered efficiency guarantees for
this problem. Furthermore, a number of new issues arise for multiway joins, such as
memory allocation and join ordering. More generally, since our algorithms can be
incorporated as new alternative operators in a general-purpose query processor, many
new issues and opportunities arise in optimizing complex queries over large uncertain
databases.
N=10x106; M=25x103; k=500; d=2
Top-k
10
8
Cost (x 10 )
5
0
0 50 100 150 200 250
3
Parameter L (x 10 )
Figure 6.15: E↵ect of L on Top-k
6.8 Proofs
We first prove three Lemmas, which are then used to prove the six theorems from the
body of the chapter. For an input I, consider a shaded region SI corresponding to
points that can produce tuples with confidence greater than a threshold ⌧ . For Top-k,
⌧ is the confidence of the k th result tuple.
Lemma 9. Any valid algorithm A 2 A must evaluate the join condition at each point
in SI for input I.
Proof. Suppose the join condition was not evaluated at point p 2 SI by algorithm
A. Consider input I 0 that di↵ers from I only in the outcome of the join condition at
point p. Algorithm A cannot distinguish input I from I 0 , and hence returns the same
result. But I and I 0 have di↵erent correct results; hence algorithm A cannot be valid.
Recall that M represents the memory available to load tuples from the input
relations. Let sI be the number of M ⇥M blocks that intersect region SI corresponding
N=10x106; M=L=25x103; k=500

7
Threshold2
Top-k
6 Lower Bound
5
Cost (x 10 )
5
0
1 2 3 4 5 6
Distribution d
Figure 6.16: E↵ect of distribution on Threshold and Top-k
to input I. For a block b, let Xb represent the fraction of its lower edge that is within
the shaded region. Correspondingly, let Yb represent the fraction of the left edge that
is within the shaded region. Those blocks that have no block above or to the right
intersecting the shaded region are referred to as corner blocks – let there be cI such
blocks. We consider cost in units of M .
Lemma 10. Let algorithm A 2 A evaluate the join condition at all points in region
SI . The following holds:
X
Cost(A, I) sI cI + (Xb + Yb )
corner blocks b
cI ·(cI +1)
where sI 2
.
Proof. Any algorithm’s execution can be viewed as a sequence of load and scan
operations, which can be visualized like the explore steps in our descriptions. Note
that the shapes formed in the visualization are not restricted to rectangles, but they
N=106; M=L=105; τ=-0.1

20
Distribution 1
18 Distribution 6
16
14
Cost (x 10 )
6
12
10
8
6
4
2
0
0 1 2 3 4 5 6 7 8 9 10
5
Result Tuples Emitted (x 10 )
Figure 6.17: Emit Rate for Sorted-Threshold
are restricted to have width at most M . The cost of a shape is its semi-perimeter.
We allocate the cost of shapes to the blocks they intersect in order to provide a lower
bound for the cost of the algorithm. We consider the right and upper edges of a shape
and allocate them to the block they lie in.
Formally, the restriction on the shapes is that for any interior point p, there is a
right or upper edge within distance 1 of p. It is easy to observe that any block b with
max(Xb , Yb ) = 1 has cost at least 1. Also for the corner blocks the cost is at least
X b + Yb .
The relationship between sI and cI is obtained by observing that no row or column
has more than 1 corner block.
Let HI and WI be the height and width respectively, of the two relations that the
shaded region SI covers in units M .
Lemma 11. Let algorithm A 2 A evaluate the join condition at all points in region
N=106; M=105; τ=-0.1; d=2

30
L=200000
L=100000
25 L=50000
20
Cost (x 10 )
6
15
10
0
0 2 4 6 8 10
5
Figure 6.18: E↵ect of L on Cost of Sorted-Threshold
SI . The following holds:

Cost(A, I) HI + W I
Proof. Any algorithm that evaluates the join condition at all points in SI must read
the part of the each relation which is covered by SI . The cost of doing this is HI + WI .
8I2It Cost(Threshold1, I) < 2 · min Cost(A, I)

A2At
Proof. Consider the cost of the Threshold1 algorithm. The scan cost for all blocks
except the rightmost in each row is 1. For every row except the topmost, the load
cost is 1. Hence the following holds:
X
Cost(Threshold1, I)  sI 1 + Ylef ttop + (Xb ) (6.1)
b2rightmost
N=106; M=105; τ=-0.1; d=2

16
L=200000
Result Memory Buffer (x 10 )
4
14 L=100000
L=50000
12
10
0
0 2 4 6 8 10
5
Figure 6.19: E↵ect of L on Memory Use of Sorted-Threshold
Using Lemma 9, and arithmetic manipulation using (6.1) and Lemmas 10 and 11
proves the theorem.
sI + p c
sI c A2At
p·(p+1)
where constants p and c are such that p c and sI 2
.
Proof. Consider the cost of the Threshold2 algorithm. Let it make pI greedy choices.
The load cost is at most pI . For each non-corner block the scan cost is at most 1,
while for a corner block b, it is at most max(Xb , Yb ). The following holds:
X
Cost(Threshold2, I)  pI + sI cI + max(Xb , Yb ) (6.2)
corner blocks b
Clearly pI cI , since a scan has to terminate at each corner block. Also the
relationship between sI and pI is obtained by observing that the length of the scan
corresponding to ith -from-last scan is at least i. Using Lemmas 9 and 10 with (6.2)
proves the theorem.
3
2 A2At
Proof. For sI 15, Theorem 14 implies the result. For smaller sI , case analysis proves
the theorem using Lemmas 9, 10, and 11.
2 · sI 1
8I2Ik Cost(Top-k, I) < · min Cost(A, I)
sI c A2Ak
c·(c+1)
where constant c is such that sI 2
.
Proof. Consider the cost of the Top-k algorithm. It has cost at most 2 for each block
except the last. For the last block b it has cost at most 1 + max(Xb , Yb ). The following
holds:
Cost(Top-k, I)  2 · sI 1+ max (max(Xb , Yb )) (6.3)
corner blocks b
Using Lemmas 9 and 10 with (6.3) proves the theorem.
8I2Ik Cost(Top-k, I) < 3 · min Cost(A, I)

A2Ak
Proof. For sI 12, Theorem 16 implies the result. For smaller sI , case analysis proves
the theorem using Lemmas 9, 10, and 11.
Theorem 17. Let As represent the set of all valid algorithms in AM for the Sorted-
Threshold problem, and let Is represent the set of all inputs. The following holds:
8I2Is Cost(Sorted-Threshold, I) < 2 · min Cost(A, I)

A2As
Proof. Consider the cost of the Sorted-Threshold algorithm. It has cost at most Xb +Yb
for each block b. This gives us the following:
X
Cost(Sorted-Threshold, I)  2 · (sI cI ) + (Xb + Yb ) (6.4)
b2corner
Using Lemmas 9 and 10 with (6.4) proves the theorem.

Chapter 7
Native Query Optimization in Trio
The Trio system is developed on top of a conventional DBMS [ABS+ 06]. Uncertain
data with lineage is encoded in relational tables, and Trio queries are translated to
SQL queries on the encoding. Such a layered approach reaps significant benefits in
terms of architectural simplicity, and the ability to use an o↵-the-shelf data storage
and query processing engine. However, it fails to recognize and hence capitalize on
the regular structure that the encoded Trio relations possess, and thus misses out
on the benefits of specialized query optimization. This chapter describes first steps
towards building a native query optimizer for Trio. The work presented in this chapter
appeared in [SANW08].
7.1 Introduction
Recall from Section 2.2 that the basic construct for uncertainty in Trio’s ULDB data
model is alternatives. Alternatives in a tuple specify a nonempty finite set of possible
values for the tuple. For example:
(Thomas, Main St.) || (Tom, Maine St.)
contains a tuple with two alternatives giving the two possible values for the tuple. The
Trio system is layered on top of a conventional relational DBMS [MTK+ 07]. ULDB
121
CHAPTER 7. NATIVE QUERY OPTIMIZATION IN TRIO 122
relations are encoded as conventional relational tables, and a rewriting-based approach

is used for most data management and query processing (Section 7.2).
While the layered approach in Trio simplifies implementation and exploits an
o↵-the-shelf query processing engine (QPE) and its optimization capabilities,1 it is
unable to exploit the inherent structure of encoded ULDB relations during query
processing. This chapter describes first steps towards native query optimization in
Trio.
We first demonstrate that the structure of ULDB relations merits building special-
ized index structures. We propose a collection of new index structures and associated
access methods. We also show that query plans executing queries over Trio relations
should consider a new interesting order, namely that of grouping alternatives based
on the tuple they are a part of. Consequently, we devise an operator that performs
the grouping.
Every Trio query can be answered using several query plans assembled using the
new operators, indexes, and access methods mentioned above in conjunction with
those already existing in the QPE. As in conventional query optimization, the choice
of which query plan to use is based on an estimate of the cost of executing query plans,
which critically depends on various forms of statistics and cardinalities maintained
by the database. We enumerate the necessary statistics, and specify histograms that
allow us to estimate the statistics efficiently and accurately. We empirically show that
using the techniques described in this chapter yields significant improvements in query
processing time relative to the layered approach adopted by Trio. Finally, we briefly
discuss construction and maintenance of the histograms.
Although several systems have been built recently for managing uncertain data,
none of them incorporate query optimization techniques similar to the ones described
in this chapter. Prior work in Trio itself has also focussed on other aspects such as
modeling and design, confidence computation, and versioning. Since many previously
proposed data models for uncertainty (for example, [AFM06, BGMP92, DBHW06a,
FR97a, Lee92a, RS07]) use similar constructs as ULDBs, the techniques described in
this chapter can be suitably adapted for these data models as well.
1
Trio’s implementation uses PostgreSQL
7.1.1 Contributions and Outline

The specific contributions of this chapter are as follows:
• We provide techniques for indexing ULDB relations encoded as conventional

relations, and describe the new access methods they yield. (Section 7.3)
• We motivate the need for considering a new interesting order in query plans,
and design an operator that ensures this order. (Section 7.4)
• We enumerate the statistics necessary for choosing the optimal query plan from
the set of of all query plans combining conventional and new operators described
above. We present histograms that enable estimating these statistics efficiently
and accurately. (Section 7.5)
• We discuss a variety of interesting and challenging problems our work opens up,
which can form the basis for further research in the area. (Section 7.6)
Section 7.2 introduces our relational encoding for the ULDB data model, and we
conclude with a discussion of related work in Section 7.7.
7.2 Relational encoding

We briefly describe how the ULDB data model (Chapter 2) is encoded in regular
relational tables. We refer the reader to [BSH+ 08, MTK+ 07] for complete details
on encoding ULDB databases. Recall that an x-tuple refers to a tuple in the ULDB
model, which includes alternatives.
Consider a ULDB relation T (A1 , ..., An ). The data portion of T is stored
as a conventional table referred to as Tenc with two additional attributes:
Tenc (aid, xid, A1 , . . . , An ). Each alternative in the original ULDB table is stored
as its own tuple in Tenc , and the additional attributes serve as follows:
• aid is a unique alternative identifier.
• xid identifies the x-tuple that this alternative belongs to.

As an example, relation Temp from Section 2.2 would be encoded as follows:

T empenc
xid aid zone temp
31 111 A 77
31 112 B 77
32 113 C 81
7.2.1 Queries
Queries over ULDB relations are translated to queries over the encoded relations.
To obtain xid’s on the resulting relation, the alternatives of the result need to be
grouped based which x-tuple they are a part of. Therefore, all tuples in the result of
the translated query are grouped by xids of the input relations. For example, if we
perform a join of relations R and S, the translated query over Renc and Senc includes
the clause group-by Renc .xid,Senc .xid. Details of this translation can be found
in [MTK+ 07, BSH+ 08].
7.3 Indexing ULDB Relations

As described in Section 7.2, there is a special attribute xid in all Trio encoded relations,
and all query translations involve a group by on xid. In this section, we discuss how
various conventional indexes may be useful in the presence of this special attribute. We
then introduce a new type of index that attempts to optimize the common group-by
operation in all translated queries. We use a simple running example to ground our
discussion.
Consider a relation R, and a selection query which does a range scan on the attribute
A : select * from R where A  5. The translated query groups the result by the
xid. The relation R is often stored clustered on xid, but may also be clustered on
other attributes which may or may not be A. For the selection query on A, the following
indexes may be useful:
• Index on A
An index on A may be used to retrieve only the tuples that satisfy the predicate
A  5. This index lets us efficiently retrieve all alternatives that satisfy the
predicate, but now they need to be grouped to form x-tuples. A sort on xid is
required to group the result alternatives into x-tuples. A query plan like this
can be efficient for highly selective queries, i.e., the result contains very few
alternatives, making the grouping step inexpensive.
• Index on xid
An index on xid lets us retrieve all alternatives in an x-tuple together. This
allows us to avoid the sort since result tuples are generated already grouped by
xid. This index may be useful if the predicate is not very selective, specially if
the data is stored clustered by xid.
• Index on (xid,A)
If x-tuples are very wide, i.e., contain a large number of alternatives, we may
be able to use an index on (xid, A), to only retrieve alternatives that satisfy
the predicate. This also avoids the sort at the end, and may yield efficient
executions.
• Index on (A,xid)
For queries that use an equality predicate on A, it might be useful to use an
index that returns all alternatives satisfying the predicate grouped by xid. This
index allows us to avoid the sort which may be expensive for large results. But
equality predicates seldom have large results. We would ideally like an index
that also works for range queries, and can also avoid the sort.
The indexes discussed above help in either avoiding the sort required for the group-
by on xid, or prune down the amount of data accessed by evaluating the predicate
before retrieving the tuples. An index on (A,xid) accomplishes both objectives for
equality predicates on A. We now describe a new index that generalizes this index for
efficient range scans over relations stored clustered by xid.
• Index on Ax
An index on Ax refers to an index that indexes x-tuples instead of tuples. Each
x-tuple is indexed in multiple locations, so that all x-tuples that contain an

alternative that satisfies a predicate on A can be efficiently retrieved. We can
then apply the predicate to each alternative of the x-tuple to keep only those
that satisfy it. Although this index may often retrieve more tuples than an
index on A, it can often be more efficient because it makes no more random
accesses, and also avoids the need for a sort. The sort is avoided because the
index guarantees an “interesting order” that has result alternatives grouped by
xid.
• Index on (Ax ,A)

For wide x-tuples, it may again be useful to have an index on A within an x-tuple.
This index corresponds is similar to the index on (xid,A), but prunes o↵ the
x-tuples it needs retrieve by only considering those that contain at least one
alternative that satisfies the predicate.
7.4 Query Plans

In this section, we discuss some new challenges that are encountered in creating query
plans to answer queries over uncertain data. We start by introducing a new interesting
order that can benefit many translated queries.
7.4.1 New Interesting Order

The index on Ax allows an efficient access method that also retrieves tuples grouped
by xid. For complicated queries, possibly including several joins, it may be useful
to maintain the interesting order all through the query plan to avoid the expensive
group-by at the end. Roughly speaking, the idea is to have x-tuples flowing through
the operators rather than alternatives. By maintaining all alternatives of an x-tuple
(nearly) together, we may get efficient executions of queries with large results. We
now define an interesting order XGroup. We shall see that XGroup is a fundamentally
weaker requirement than sort on an attribute, and hence can often be maintained over
the whole query plan without significant overhead.
Let R = {R1 , · · · , Rn } be the relations in the from clause of a query. The translated
query contains a group by on the xid attribute of all relations in R. The subtree
rooted at a node N in the query tree has joined relations in RN ✓ R. Suppose
the tuples flowing out of the node N are guaranteed to be grouped by xid for a set
BN ✓ RN . The node N is said to give the XGroup interesting order on the set of
relations BN . We need the result to be XGrouped on R.
An indexed access on relation R using the index on Ax for some attribute A
guarantees an XGroup on {R}. Other access methods that use an index on xid or
(xid,A), or that scan relations clustered by xid also give an XGroup on {R}. Selection
and projection operators preserve the interesting order, i.e., if the input is XGrouped
on B, the result from the operator is also XGrouped on B.
Many join operators also preserve XGroup of the inputs. For a join operator, let
the inner relation be XGrouped on Bi and let the outer be XGrouped on Bo . It is easy
to see that operators like nested-loop join (or nested-block joins) and nested-loop index
join preserve the XGroup on both the inner and the outer, i.e., result is XGrouped
on Bi [ Bo . Some other join operators may only preserve the XGroup of one of the
inputs. For instance a hash-join preserves the XGroup of the outer, i.e., returns tuples
XGrouped on Bo . Many operators in traditional database query processors naturally
preserve XGroup, making it efficient to maintain and utilize.
XGroup can often help us avoid an expensive group-by (usually implemented
through a sort). If all relations in a query are accessed using access methods that
guarantee the XGroup on the relation, and all operators preserve the interesting order,
then no group by needs to be used in the query plan. Our selection example earlier
was the simplest case that illustrates how the group by can be eliminated. Requiring
that every access method and operator in a query plan preserve the XGroup property
results in a very small set of potential plans that can exploit XGroup. We would like
to allow a broader class of query plans in which only a subset of the access methods
and operators preserve the XGroup property; such plans would still provide benefits
by making the group by cheaper.
We propose a new unary operator XGB: Let the input to XGB be XGrouped on
Bi . XGB returns an output XGrouped on Bo Bi . XGroup essentially breaks up
each group in the input into smaller groups – And hence only needs to look at one
group in the input at a time. If the groups in the input are not large, this can be very
cheap, and operates in a non-blocking fashion. Hence the use of the XGB operator
in conjunction with an access method which provides the XGroup interesting order,
allows us to incrementally do the group by. The presence of even one access method
that guarantees the XGroup interesting order can drastically reduce the cost of the
group by. Roughly, if a relation with x tuples is XGrouped in a query with y tuples
in the result, using XGB and XGroup at the most requires sorting x groups of y/x
tuples in contrast to a sort on y tuples.
7.4.2 Query Planning

In many instances, the new interesting order XGroup, and the new operator XGB
introduced in Section 7.4.1 yield efficient plans. The query planner now needs to
decide when and how to use these new constructs to efficiently execute queries. In
this section, we motivate the need for a new set of statistics that a query optimizer
needs to find efficient query plans.
Consider our example query from Section 7.3: select * from R where A  5.
We now look at each access method discussed in Section 7.3 emphasizing the relevant
statistics.
• Index on A
To estimate the cost of accessing the relation through an index on A, the optimizer
needs to determine the number of alternatives that satisfy the predicate.
• Index on xid
To estimate the number of x-tuples retrieved using an index on xid, the optimizer
needs to know the total number of x-tuples in the relation R.
• Index on (xid,A)
As discussed in Section 7.3, this index is useful only if x-tuples contain large
number of alternatives. Hence, to decide whether or not to use this index, an
estimate indicating the average number of alternatives per x-tuple (width of
x-tuple) is useful. The average width of x-tuples satisfying the predicate might
di↵er from the average width for the entire relation, and can yield more accurate
cost estimates.
• Index on (A,xid)
The number of alternatives that satisfy the predicate determines the cost of
accessing the relation using an index on (A,xid). In addition, number of x-
tuples returned is determined by estimating the number of x-tuples that contain
at least an alternative that satisfies the predicate (which should be an equality
predicate).
• Index on Ax
The number of random access made in accessing the relation R using an index on
Ax , is determined by the number of x-tuples that contain at least one alternative
satisfying the predicate.
• Index on (Ax ,A)

This index is useful in cases where the average width of an x-tuple is large,
thus justifying indexing on A within an x-tuple. A better statistic to decide the
usefulness of the index is the average width of x-tuples that satisfy the predicate.
7.5 Statistics and Histograms

We enumerate the various statistics from Section 7.4.2 that guide the optimizer in
choosing the optimal query plan, and provide techniques for estimating them.
7.5.1 Exact Cardinalities

For each relation R in the database, the following global statistics can be exactly
maintained:
• X-Card(R): Number of x-tuples in R.
• A-Card(R): Number of alternatives in R.

• AvgWidth(R): Average number of alternatives per x-tuple, which can be com-

A Card(R)
puted as X Card(R)
.
Next we consider the statistics that cannot be maintained exactly (because of the
data size), and hence need to be estimated.
7.5.2 Alternative Counts

Consider a relation R with an attribute A. We would like to estimate the cardinality
in terms of number of alternatives for point as well as range queries:
• A-Selectivity(R,A,v): Number of alternatives in R that satisfy A = v.
• A-Selectivity(R,A,x,y): Number of alternatives in R that satisfy x  A  y.
Clearly, the above cardinalities translate to counting the number of tuples in Renc
satisfying A = v and x  A  y respectively. These cardinalities over the conven-
tional relational table Renc can be estimated using well-known sampling or histogram
techniques. For example, in Trio we can build a histogram over Renc that counts the
number of tuples in it: We have various histogram buckets corresponding to the range
of all possible values in A. A histogram bucket with bucket boundary [p, q] maintains
the count of the number of tuples in Renc that satisfy p  A < q. Now, we can use
the bucket frequencies to estimate the A-selectivity.
Example 4. For relation A with integer attribute A, suppose the histogram on A has
bucket intervals 0, 5, 10, and so on; i.e., bucket B0,5 stores the number of tuples in
Renc with 0  A < 5, B5,10 for 5  A < 10, etc. To compute the number of tuples in
2·B0,5 4·B5,10
Renc satisfying 3  A  8, we evaluate 5
+ 5
.
7.5.3 X-tuple Counts

Now for relation R with attribute A, we would like to estimate the number of x-tuples
in R that contain some alternative satisfying a point or range predicate on A:
• X-Selectivity(R,A,v): Number of x-tuples in R containing at least one alternative

satisfying A = v.
• X-Selectivity(R,A,x,y): Number of x-tuples in R containing at least one alterna-

tive that satisfies x  A  y.
Estimating X-Selectivity(R,A,v): As in the case for the estimation of A-selectivity,

here also we can build a histogram that now counts the number of x-tuples instead of
the number of alternatives.
Let us first attempt constructing a histogram by creating buckets on attribute
A, and we shall then improve on this histogram. Within a bucket [p, q], we store
the number of x-tuples that contain at least one alternative satisfying p  A < q.
Note that, unlike in the case of the histogram for A-selectivity whose buckets counted
distinct alternatives, in this histogram a single x-tuple may contribute to the count in
multiple buckets. For instance, if an x-tuple contains an alternative with A in range
[p1 , q1 ] as well as another alternative with A in range [p2 , q2 ], then the x-tuple will
contribute to the counts in both the buckets. Given a value v, suppose v corresponds
Bp,q
to bucket [p, q], i.e., p  v < q, then we estimate X-selectivity(R,A,v) to be (q p)
,
where Bp,q is the stored count in bucket [p, q].
Note that the above estimate for a specific value v assumes that if an x-tuple is in
bucket [p, q], the probability of it having an alternative with value v, where p  v < q
1
is (q p)
. However, we can refine this probability if we know the number of alternatives
in an x-tuple. Suppose an x-tuple has k alternatives, whose A values are randomly
and independently distributed between p and q, the probability of having at least one
alternative with value v is 1 ( q q p p 1 )k . (If we know that the distinct A values in the
alternative are independently and randomly distributed in [p, q), then we replace k
with the number of distinct A values in the x-tuple.
We can use the observation above to enhance our histogram. We construct a 2-D
histogram with A as one dimension, and the size S of the x-tuple (either as the number
of alternatives or the number of distinct A values, based on the distribution described
above) as the second dimension. In a bucket [p, q] for dimension A, and [n1 , n2 ] for
dimension S, we store the number of x-tuples with size in range [n1 , n2 ] having either:
(1) at least k alternatives with A value in the range [p, q), or (2) alternatives with at
least k distinct A values in the range [p, q). Recall, we use (1) above if we assume
the distribution of all alternatives is independent, and we use (2) if we assume the
distribution of distinct A values is independent.
Estimating X-Selectivity(R,A,x,y): While at first glance it might seem that we

can use the histogram described above to estimate X-selectivity(R,A,x,y) also, that
is not the case. We would get an incorrect estimate using the histogram described
above because each x-tuple contributes to the count of multiple buckets.2 Hence if we
aggregate counts over multiple buckets, we would end up double-counting x-tuples
corresponding to multiple buckets, as shown by the following example.
Example 5. Suppose for relation R, the histogram on attribute A is bucketed at

multiples of 5. If we want an estimate of X-Selectivity(R,A,0,9), adding up the counts
corresponding to buckets [0, 5] and [5, 10] gives an incorrect answer, because x-tuples
could contain both alternatives with A values in [0, 5) as well as A values in [5, 10).
Hence the returned sum would be an overestimate of the actual count.
We need to create a more complex histogram to be able to give accurate expected

estimates. We use a 3-D histogram with dimensions Amin , Amax , and size S. Amin
corresponds to the minimum A value for an alternative in an x-tuple, Amax corresponds
to the maximum A value. As before S corresponds to the size of the alternatives
falling in the bucket, where size is given either by the number of alternatives or by
the number of distinct A values, depending on the distribution assumption. In the
histogram bucket corresponding to range [p, q] for Amin , [r, s] for Amax , and [n1 , n2 ]
for S, we store the number of x-tuples with size S in range [n1 , n2 ) that have the
minimum A value in range [p, q), maximum A value in range [r, s). Note that now
every single x-tuple corresponds to exactly one bucket of the histogram. Hence, we
can estimate X-Selectivity(R,A,x,y) by adding up the contributions from all relevant
buckets without any danger of double-counting.
2
Obviously estimates returned are anyway never exact. We say that the estimate is incorrect
because even in the expected sense under a uniform distribution the estimate is incorrect.
7.5.4 Other Statistics

Finally, we note that several other kinds of statistics can be estimated using the
estimation techniques described above. For example, suppose for relation R we want
to the know the average number of alternatives satisfying A = 5 among x-tuples
that contain at least one alternative with A = 5, we can compute the average as
A selectivity(R,A,5)
X selectivity(R,A,5)
.
Another similar statistic worth mentioning that can be estimated using techniques
described earlier is finding the total or average number of alternatives in R in x-tuples
that contain at least one alternative satisfying a predicate:
• A-Average(R,A,x,y): Average number of alternatives in x-tuples that contain at

least one alternative satisfying A 2 [x, y].
For A-Average we can use a histogram similar to the one for X-Selectivity, but by
eliminating the size dimension. We can then store the total count of alternatives from
the x-tuples in each bucket, instead of storing the number of x-tuples.
7.6 Discussion and Future Work

In this section, we discuss important issues arising from this chapter that we do not
address, as well as more general directions for the future that our work suggests.
• Construction and Maintenance: The first question that arises about our
techniques presented in Section 7.5 is how we construct the histograms that
enable estimating the various statistics. Construction of a histogram entails
deciding the bucket boundaries for all the dimensions followed by populating
the counts in each bucket. Recall that our histograms for ULDB relations
translate to statistics on the encoded conventional relations. Hence we can
leverage previously proposed techniques for determining the bucket boundaries
and constructing the histogram through sampling [PSC84, CMN98]. Alternately,
we could scan the entire relations apriori and compute exact frequencies for all
buckets.
The second issue that arises is that of incrementally maintaining histograms as

data is modified. While at first glance it seems like we can still use traditional
techniques for maintaining histograms [GMP97], we plan to explore whether
there are more efficient techniques that can be applied specific to the kinds of
data modifications Trio supports [DTW08a]. A comprehensive discussion of
histogram techniques can be found in [CGHJ12].
• Operators and Indexes: An interesting followup of our work would be to

study whether there are any other specialized operators and indexes that would
be useful for query execution in Trio. One possibility we considered is to combine
the grouping operator we have proposed with other operators such as various
forms of join operators, however, this does not seem to provide any additional
benefit over using the join and grouping operator in sequence.
Another issue that we have not addressed is the problem of automatically deciding
the indexes to be created given a specific instance of a database and workload
of queries, in the spirit of Microsoft SQL Server’s AutoAdmin project [CN07].
• Other Models and Systems: In this chapter, we presented techniques that

are motivated by the Trio system and its data model. However, some of
our techniques can be adapted for other data models that use similar con-
structs [AFM06, BGMP92, DBHW06a, FR97a, Lee92a, RS07]. More generally,
it would be interesting to explore operators and associated estimation tech-
niques for other uncertain database systems, followed by building a native query
optimizer for them. The grand goal of building a generic optimizer for uncer-
tain databases poses several challenging problems including efficient indexing,
estimating statistics, constructing and costing query plans, and much more.
7.7 Related Work

While many lines of research discussed in Section 1.3 have studied efficient algorithms
for query processing, none of them addresses the problem of native query optimization.
Specifically, they do not study new operators, indexing techniques, and statistics
estimation, which are the subject of this chapter.
Two notable pieces of work studying query processing in probabilistic databases
are [DS04b, DTW08c], whose focus is on probability computation. Reference [DS04b]
characterizes when a query can be computed using a safe plan, which can be very
efficiently executed. Reference [DTW08c] studies how lineage in Trio can aid in making
confidence computation more efficient. Importantly, it shows that lineage allows us to
decouple data and confidence computation in Trio, enabling us to use any query plan
for data computation. It is this decoupling that allows the optimizer to consider any
query plan, without worrying about confidence computation.
The work closest to ours is that of [SMP+ 07], which proposes new techniques for
indexing uncertain data with probabilities. The indexes proposed in [SMP+ 07] are
di↵erent from ours and focus on the probabilities. Their indexes are very useful for
queries with thresholds on probabilities. However, as described above, Trio adopts
lineage-based probability computation, and hence we focus only on data computation
in this chapter. We have proposed new indexing mechanisms specific to the ULDB data
model and its relational encoding, which are more useful for Trio query processing.
Finally, [SMP+ 07] does not consider the problem of estimating various kinds of
statistics on uncertain data.
Obviously there has been a huge body of work studying every aspect of query
optimization for conventional databases. Since there have been many papers written
on every topic, including relational operations, indexing, histograms, and query plan
selection, we do not review this literature here, instead refer the reader to any standard
database textbook such as [GMWU02].
Chapter 8
Summary
This thesis makes contributions to several aspects of managing uncertain data, includ-
ing: (1) generalizing the types of uncertain data that can be managed; (2) providing
techniques for the integration of uncertain data; and (3) developing efficient algorithms
for processing uncertain data. We summarize the main contributions of the thesis:
• Generalizing: We discussed generalizations of existing uncertain database

systems to enable them to represent and query new forms of uncertainty.
In Chapter 3, we focused on continuous uncertainty, where the number of possible
worlds of an uncertain database may be infinite. To capture such uncertainty,
we introduced a flexible and extensible representation scheme that uses a set
of functions to represent an uncertain value. The key benefit of our approach
is that it enables precise and efficient representation of arbitrary probability
distribution functions, along with standard distributions such as Gaussians. We
also developed query processing algorithms that operate over this representation
scheme. Our entire solution leverages existing features in Trio.
In Chapter 4, we generalized uncertain databases to be able to manage incomplete
knowledge of confidence values. Motivated by a key challenge facing the field
of uncertain databases — the difficulty in associating confidence values exactly
with data from the real world — we relaxed the requirement of requiring precise
confidence values. Our approach enables imprecise as well as missing confidence
136
CHAPTER 8. SUMMARY 137
values by using a semantics based on Dempster-Shafer theory. Again, we

leveraged existing features in Trio to extend its data model, representation
scheme, and query processing to manage this new form of uncertainty.
• Integration: In Chapter 5, we studied the integration of uncertain data. We

demonstrated that combining data from multiple sources can yield more certain
data. Specifically, we built a theoretical foundation for the local-as-view setting
of data integration for uncertain data. The key ingredient of our solution is a
new containment definition for uncertain databases. Along with the traditional
challenges of efficient representations and query answering but in the setting
of uncertain data management, we also addressed the new challenge of dealing
with inconsistency among data sources.
• Efficiency: Uncertain data management systems usually rely on existing query

processing algorithms developed for certain data. In some instances, specialized
algorithms for uncertain data can yield significant efficiency benefits.
In Chapter 6, we discussed specialized algorithms for join queries over uncertain
data that seek to preferentially return result tuples with the highest confidence.
Our algorithms enable efficient computation of join queries when users wish to
apply a threshold on result confidence values, ask for the “top-k” results by
confidence, or obtain results sorted by confidence. The key ingredient of our
approach was to exploit the monotonicity of the confidence combination function
by accessing the input data sorted by confidence. We also proved theoretical
performance guarantees for our algorithms.
In Chapter 7, we developed specialized techniques for efficient processing of
queries over uncertain data. Our techniques exploit the well-defined structure
of Trio relations as they are encoded in a conventional database system. We
studied several indexing techniques and developed a new one, developed new
statistics, designed histograms for accurately estimating our new statistics, and
described a new interesting order and its associated operator to enable efficient
query processing over uncertain data.
CHAPTER 8. SUMMARY 138
Despite decades of research, uncertain data management remains a field in its

relative infancy. There is still considerable work to be done before there are o↵-the-
shelf solutions for managing all kinds of uncertain data. At the highest level, the key
remaining challenges are: (1) bridging the gap between complex and messy real-world
data and the data models currently used by uncertain data management systems;
(2) identifying key processing primitives for data combination, manipulation, and
querying; and (3) providing e↵ective techniques for resolving uncertainty in query
results to make the data more actionable.
Bibliography
[ABS+ 06] Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth,
Shubha Nabar, Tomoe Sugihara, and Jennifer Widom. Trio: A System
for Data, Uncertainty, and Lineage (Demo). In Proceedings of VLDB,
2006.
[ADUW10] P. Agrawal, A. Das Sarma, J. Ullman, and J. Widom. Foundations of

uncertain-data integration. In Proc. of VLDB, 2010.
[AFM06] P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty

databases: A probabilistic approach. In ICDE, 2006.
[AHV95] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-

Wesley, 1995.
[AJ94] Samson Abramsky and Achim Jung. Domain theory. In Handbook of

Logic in Computer Science, pages 1–168. Clarendon Press, 1994.
[AKG87] S. Abiteboul, P. Kanellakis, and G. Grahne. On the representation and

querying of sets of possible worlds. SIGMOD Record, 16(3), 1987.
[AKG91] S. Abiteboul, P. Kanellakis, and G. Grahne. On the Representation

and Querying of Sets of Possible Worlds. Theoretical Computer Science,
78(1), 1991.
[AKO07] L. Antova, C. Koch, and D. Olteanu. MayBMS: Managing Incomplete

Information with Probabilistic World-Set Decompositions. In Proc. of
ICDE, 2007.
139
BIBLIOGRAPHY 140
[AW09a] Parag Agrawal and Jennifer Widom. Confidence-Aware Join Algorithms.

In Proceedings of ICDE, 2009.
[AW09b] Parag Agrawal and Jennifer Widom. Continuous Uncertainty in Trio. In

Proceedings of the Workshop on Management of Uncertain Data (MUD),
2009.
[AW10] Parag Agrawal and Jennifer Widom. Generalized Uncertain Databases:

First Steps. In Proceedings of the Workshop on Management of Uncertain
Data (MUD), 2010.
[BDHW06a] O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom. ULDBs:

Databases with uncertainty and lineage. In Proc. of VLDB, 2006.
[BDHW06b] O. Benjelloun, A. Das Sarma, C. Hayworth, and J. Widom. An Introduc-

tion to ULDBs and the Trio System. IEEE Data Engineering Bulletin,
29(1), 2006.
[BDJ+ 07] D. Burdick, P. M. Deshpande, T. S. Jayram, R. Ramakrishnan, and

S. Vaithyanathan. OLAP over uncertain and imprecise data. J. VLDB,
16(1):123–144, 2007.
[BDM+ 05] J. Boulos, N. Dalvi, B. Mandhani, S. Mathur, C. Re, and D. Suciu.

MYSTIQ: a system for finding more answers by using probabilities. In
Proc. of ACM SIGMOD, 2005.
[BGM02] N. Bruno, L. Gravano, and A. Marian. Evaluating top-k queries over

web-accessible databases. In ICDE, 2002.
[BGMP90] Daniel Barbara, Hector Garcia-Molina, and Daryl Porter. A probabilistic

relational data model. In Proc. of EDBT, 1990.
[BGMP92] D. Barbará, H. Garcia-Molina, and D. Porter. The Management of

Probabilistic Data. IEEE Trans. Knowl. Data Eng., 4(5), 1992.
BIBLIOGRAPHY 141
[BP82] B. P. Buckles and F. E. Petry. A Fuzzy Model for Relational Databases.

International Journal of Fuzzy Sets and Systems, 7, 1982.
[BP93] R. S. Barga and C. Pu. Accessing Imprecise Data: An Approach Based

on Intervals. IEEE Data Engineering Bulletin, 16(2), 1993.
[BP05] Patrick Bosc and Olivier Pivert. About projection-selection-join queries

addressed to possibilistic relational databases. IEEE T. Fuzzy Systems,
2005.
[BSH+ 08] O. Benjelloun, A. Das Sarma, A. Halevy, M. Theobald, and J. Widom.

Databases with uncertainty and lineage. VLDB Journal, 17(2), 2008.
[CBF04] Sunil Choenni, Henk Ernst Blok, and Maarten Fokkinga. Extending
the relational model with uncertainty and ignorance. Internal Report,
University of Twente, 2004.
[CBL06] Sunil Choenni, Henk Ernst Blok, and Erik Leertouwer. Handling uncer-
tainty and ignorance in databases: A rule to combine dependent data.
In DASFAA, 2006.
[CG96] S. Chaudhuri and L. Gravano. Optimizing queries over multimedia

repositories. In ACM SIGMOD, 1996.
[CGHJ12] Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris

Jermaine. Synopses for massive data: Samples, histograms, wavelets,
sketches. Foundations and Trends in Databases, 4(1-3):1–294, 2012.
[CMN98] S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for his-

togram construction: how much is enough? In Proc. of ACM SIGMOD,
1998.
[CN07] S. Chaudhuri and V. Narasayya. Self-tuning database systems: A decade

of progress. In Proc. of VLDB, 2007.
BIBLIOGRAPHY 142
[Col08] Mark Colyvan. Is probability the only coherent approach to uncertainty?

In Risk Analysis, 2008.
[Cox46] R. T. Cox. Probability, frequency and reasonable expectation. In

Readings in uncertain reasoning, 1946.
[CR97] C. Chekuri and A. Rajaraman. Conjunctive query containment revisited.

In Proc. of ICDT, 1997.
[CSP05] R. Cheng, S. Singh, and S. Prabhakar. U-DBMS: A database system for

managing constantly-evolving data. In Proc. of VLDB, 2005.
[CSP+ 06] Reynold Cheng, Sarvjeet Singh, Sunil Prabhakar, Rahul Shah, Jef-
frey Scott Vitter, and Yuni Xia. Efficient join processing over uncertain
data. In ACM CIKM, 2006.
[CT91] Thomas M. Cover and Joy Thomas. Elements of Information Theory.

Wiley, 1991.
[DBHW06a] A. Das Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working Models

for Uncertain Data. In Proc. of ICDE, 2006.
[DBHW06b] A. Das Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working Models

for Uncertain Data. In Proc. of ICDE, 2006.
[DGS09] Amol Deshpande, Lise Getoor, and Prithviraj Sen. Graphical models
for uncertain data. In Managing and Mining Uncertain Data. Springer,
2009.
[DHY07] X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty.

In Proc. of VLDB, 2007.
[DS04a] N. Dalvi and D. Suciu. Efficient Query Evaluation on Probabilistic

Databases. In Proc. of VLDB, 2004.
[DS04b] N. Dalvi and D. Suciu. Efficient Query Evaluation on Probabilistic

Databases. In Proc. of VLDB, 2004.
BIBLIOGRAPHY 143
[DS05] N. Dalvi and D. Suciu. Answering queries from statistics and probabilistic
views. In Proc. of VLDB, 2005.
[DTW08a] A. Das Sarma, M. Theobald, and J. Widom. Data modifications and

versioning in Trio. Technical report, Stanford InfoLab, 2008. Available
at http://dbpubs.stanford.edu/pub/2008-5.
[DTW08b] A. Das Sarma, M. Theobald, and J. Widom. Exploiting lineage for

confidence computation in uncertain and probabilistic databases. In
ICDE, 2008.
[DTW08c] A. Das Sarma, M. Theobald, and J. Widom. Exploiting lineage for

confidence computation in uncertain and probabilistic databases. In
Proc. of ICDE, 2008.
[Dus98] O. M. Duschka. Query planning and optimization in information inte-

gration. PhD thesis, 1998.
[Fag99] R. Fagin. Combining fuzzy information from multiple systems. J.

Comput. Syst. Sci., 58(1), 1999.
[FGB02] A. Faradjian, J. Gehrke, and P. Bonnet. GADT: A Probability Space

ADT for Representing and Querying the Physical World. In Proc. of
ICDE, 2002.
[FKK10] R. Fagin, B. Kimelfeld, and P. G. Kolaitis. Probabilistic data exchange.

In Proc. of ICDT, 2010.
[FKL97] D. Florescu, D. Koller, and Alon Y. Levy. Using probabilistic information

in data integration. In Proc. of VLDB, 1997.
[FLN03] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for

middleware. J. Comput. Syst. Sci., 66(4), 2003.
[FR97a] N. Fuhr and T. Rölleke. A Probabilistic NF2 Relational Algebra for

Imprecision in Databases. Unpublished Manuscript, 1997.
BIBLIOGRAPHY 144
[FR97b] N. Fuhr and T. Rölleke. A probabilistic relational algebra for the

integration of information retrieval and database systems. ACM TOIS,
14(1), 1997.
[Fuh90] N. Fuhr. A Probabilistic Framework for Vague Queries and Imprecise

Information in Databases. In Proc. of VLDB, 1990.
[GBK00] U. Guentzer, W-T. Balke, and W. Kiessling. Optimizing multi-feature

queries for image databases. In VLDB, 2000.
[GBK01] U. Guentzer, W-T. Balke, and W. Kiessling. Towards efficient multi-

feature queries in heterogeneous environments. In ITCC, 2001.
[GJ79] M. R. Garey and D. S. Johnson. Computers and Intractability. W. H.

Freeman and Company, 1979.
[GMP97] P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance

of approximate histograms. In Proc. of VLDB, 1997.
[GMWU02] H. G-Molina, J. Widom, and J. D. Ullman. Database Systems: The

Complete Book. Prentice-Hall, 2002.
[Gra84] G. Grahne. Dependency Satisfaction in Databases with Incomplete

Information. In Proc. of VLDB, 1984.
[Gra89] G. Grahne. Horn Tables - An Efficient Tool for Handling Incomplete

Information in Databases. In Proc. of ACM PODS, 1989.
[Gra02] G. Grahne. Information integration and incomplete information. IEEE

Data Engineering Bulletin, 25(3), 2002.
[GT06] T. J. Green and V. Tannen. Models for incomplete and probabilistic

information. In Proc. of IIDB Workshop, 2006.
[Gun92] Carl A. Gunter. Semantics of programming languages: structures and

techniques. MIT Press, 1992.
BIBLIOGRAPHY 145
[Haa07] L. Haas. Beauty and the beast: The theory and practice of information
integration. In ICDT, 2007.
[Hal01] A. Y. Halevy. Answering queries using views: A survey. VLDB Journal,

4, 2001.
[HRO06] A. Y. Halevy, A. Rajaraman, and J. J. Ordille. Data integration: The

teenage years. In VLDB, 2006.
[IAE02] I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Joining ranked inputs

in practice. In VLDB, 2002.
[IAE03] I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting top-k join

queries in relational databases. In VLDB, 2003.
[IL84] T. Imielinski and W. Lipski. Incomplete Information in Relational

Databases. Journal of the ACM, 31(4), 1984.
[ISA+ 04] I. F. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. K. Elmagarmid.

Rank-aware query optimization. In ACM SIGMOD, 2004.
[JXW+ 08] R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas.

MCDB: A Monte Carlo approach to managing uncertain data. In Proc.
of ACM SIGMOD, 2008.
[KS91] Robert Kennes and Philippe Smets. Fast algorithms for dempster-shafer
theory. In Uncertainty in Knowledge Bases. Springer, 1991.
[Lee92a] S. K. Lee. An extended Relational Database Model for Uncertain and

Imprecise Information. In Proc. of VLDB, 1992.
[Lee92b] S. K. Lee. An extended relational database model for uncertain and

imprecise information. In Proc. of VLDB, 1992.
[Lee92c] S. K. Lee. Imprecise and uncertain information in databases: An eviden-

tial approach. In Proc. of ICDE, 1992.
BIBLIOGRAPHY 146
[LLRS97] L. V. S. Lakshmanan, N. Leone, R. Ross, and V.S. Subrahmanian.

ProbView: A Flexible Probabilistic Database System. ACM TODS,
22(3), 1997.
[LS07] V. Ljosa and A. Singh. APLA: Indexing arbitrary probability distribu-

tions. In Proc. of ICDE, 2007.
[LSS96] Ee-Peng Lim, Jaideep Srivastava, and Shashi Shekhar. An evidential

reasoning approach to attribute value conflict resolution in database
integration. IEEE Trans. on Knowl. and Data Eng., 1996.
[LW95] Leonid Libkin and Limsoon Wong. On representation and querying

incomplete information in databases with bags. Inf. Process. Lett., 1995.
[Mai83] David Maier. The Theory of Relational Databases. Computer Science

Press, 1983.
[MM07] M. Magnani and D. Montesi. Uncertainty in data integration: current

approaches and open problems. In VLDB workshop on Management of
Uncertain Data, pages 18–32, 2007.
[MRBM05] M. Magnani, N. Rizopoulos, P. Brien, and D. Montesi. Schema integra-

tion based on uncertain semantic mappings. Lecture Notes in Computer
Science, pages 31–46, 2005.
[MTK+ 07] M. Mutsuzaki, M. Theobald, A. Keijzer, J. Widom, P. Agrawal, O. Ben-

jelloun, A. Das Sarma, R. Murthy, and T. Sugihara. Trio-One: Layering
uncertainty and lineage on a conventional DBMS. In Proc. of CIDR,
2007. Demonstration description.
[MW07] R. Murthy and J. Widom. Making aggregation work in uncertain and

probabilistic databases. In Proc. of Workshop on Management of Un-
certain Data, 2007.
[NCS+ 01] A. Natsev, Y-C. Chang, J. R. Smith, C-S Li, and J. S. Vitter. Supporting
incremental join queries on ranked inputs. In VLDB, 2001.
BIBLIOGRAPHY 147
[NR99] S. Nepal and M. V. Ramakrishna. Query processing issues in image

(multimedia) databases. In ICDE, 1999.
[PSC84] G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number

of tuples satisfying a condition. In Proc. of ACM SIGMOD, 1984.
[RDS07] C. Re, N. Dalvi, and D. Suciu. Efficient top-k query evaluation on

probabilistic data. In Proc. of ICDE, 2007.
[RS07] C. Re and D. Suciu. Materialized views in probabilistic databases for

information exchange and query optimization. In VLDB, 2007.
[SANW08] Anish Das Sarma, Parag Agrawal, Shubha Nabar, and Jennifer Widom.
Towards Special-Purpose Indexes and Statistics for Uncertain Data. In
Proceedings of the Workshop on Management of Uncertain Data (MUD),
2008.
[Sch08] T. Scholte. Managing continuous uncertain data by a probabilistic xml

database management system. Technical report, University of Twente,
2008.
[SD07] P. Sen and A. Deshpande. Representing and Querying Correlated Tuples

in Probabilistic Databases. In Proc. of ICDE, 2007.
[SDH08] A. Das Sarma, L. Dong, and A. Halevy. Bootstrapping pay-as-you-go

data integration systems. In Proc. of ACM SIGMOD, 2008.
[SDH09] A. Das Sarma, L. Dong, and A. Halevy. Uncertainty in data integration.

C. Aggarwal, editor, Managing and Mining Uncertain Data, 2009.
[Sha76] Glenn Shafer. A Mathematical Theory of Evidence. Princeton University

Press, 1976.
[SIC07] M. A. Soliman, I. F. Ilyas, and K. C-C. Chang. Top-k query processing

in uncertain databases. In ICDE, 2007.
BIBLIOGRAPHY 148
[SMM+ 09] S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. Hambrusch, and

R. Shah. Orion 2.0: Native support for uncertain data. In Proc. of ACM
SIGMOD, 2009.
[SMP+ 07] S. Singh, C. Mayfield, S. Prabhakar, R. Shah, and S. Hambrusch. Index-

ing uncertain categorical data. In Proc. of ICDE, 2007.
[SMS+ 08] S. Singh, C. Mayfield, R. Shah, S. Prabhakar, S. Hambrusch, J. Neville,

and R. Cheng. Database support for probabilistic attributes and tuples.
In Proc. of ICDE, 2008.
[TM08] A. Thiagarajan and S. Madden. Querying continuous functions in a

database system. In Proc. of ACM SIGMOD, 2008.
[Ull88] J. D. Ullman. Principles of Database and Knowledge-Base Systems,

Volume I. Computer Science Press, 1988.
[Var85] M. Y. Vardi. Querying logical databases. In Proc. of ACM PODS, 1985.
[WMGH08] Daisy Zhe Wang, Eirinaios Michelakis, Minos N. Garofalakis, and

Joseph M. Hellerstein. BayesStore: managing large, uncertain data
repositories with probabilistic graphical models. PVLDB, 1(1):340–351,
2008.
[YL08] Ronald R. Yager and Liping Liu. Classic Works of the Dempster-Shafer
Theory of Belief Functions. Springer, 2008.
[Zad86] L. A. Zadeh. A simple view of the dempster-shafer theory of evidence

and its implication for the rule of combination. In AI Magazine, 1986.

Thesis Upload Augmented

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis Upload Augmented

Uploaded by

Copyright:

Available Formats

INCORPORATING UNCERTAINTY IN DATA MANAGEMENT

This work is licensed under a Creative Commons Attribution-

This dissertation is online at: http://purl.stanford.edu/zp667dy6192

Jennifer Widom, Primary Adviser

Approved for the Stanford University Committee on Graduate Studies.

3 Continuous Uncertainty in Trio 12

7 Native Query Optimization in Trio 121

Modern-day applications like information extraction on the web, data integration,

• Information extraction: When extracting structured data from the unstruc-

• Data integration: Di↵erent sources of data can present conflicting pieces

1.2 Summary of Contributions and Thesis Outline

• Generalizing: We generalize uncertain databases to incorporate continuous

• Integration: We establish foundations for integration of uncertain data sources

• Efficiency: We develop efficient algorithms for joins (Chapter 6) and indexing

In the remainder of this section, we provide short overviews of these chapters. We

Uncertain Data Integration

traditional challenges of efficient representations and query answering in uncertain

Native Query Optimization

1.3 Related Work

2.1.1 Uncertain Databases

Definition 1 (Uncertain Database). An uncertain database U consists of a finite

• P W (U ) = {D1 , . . . , Dm }, where each Di is a conventional database

Sometimes there is no probability distribution, or the probability distribution is

Definition 2 (Uncertain Database without Probabilities). An uncertain database

• P W (U ) = {D1 , . . . , Dm }, where each Di is a conventional database

2.1.2 Queries over Uncertain Databases

Definition 3 (Queries over Uncertain Databases). The result of a relational query Q

P W (R) = {Q(Di )|Di 2 P W (U )}

2.2 ULDB Data Model

Alternatives: ULDB relations are comprised of x-tuples. Each x-tuple consists of

The table Location represents an uncertain database without probabilities (Defini-

Confidences: Numerical confidence values may be attached to alternatives of an

.28 .42 .12 .18

Lineage: Lineage in ULDBs is recorded at the granularity of tuple alternatives:

lineage(31,1) = {(11,1), (21,1)}

Query Semantics: Query results in Trio follow the possible-worlds semantics of

Role of Lineage: Lineage imposes restrictions on the possible worlds of a ULDB,

Lazy Confidence Evaluation: The confidence values associated with alternatives

Continuous Uncertainty in Trio

Query Semantics: With possible-world semantics, the semantics of a standard

• The function weight(low,high) returns the total probability of a value in ⇢ lying

We also generalize both functions to multidimensional pdfs. This approach subsumes

Interfaces: It is difficult to know how to present a pdf in a database system API or

3.2 Data Model Extensions

Lineage Extensions: An important part of our ULDB extension is to include

3.3 Representation and Interfaces

If the inverse-cdf function is not instantiated, it can be computed approximately

• A discrete histogram approximation might be used by a human to visualize a

3.4 Query Processing

3.4.1 Query-Time Processing

We instead execute the Trio query:

select X1, X2, C

where X1 and X2 are placeholders.

Lineage Generation: For each alternative in the result, lineage is generated as

For the example query above, P is A  5 ^ B  A, and M is X1 ! R.A, X2 !

3.4.2 On-Demand Processing

alternative. The confidence of the result alternative is first computed as described in

s = weightA ( 1, 5) ⇤ weightB (10, 1)

weightA (min(l, 5), min(h, 5))

weightB (max(l, 10), max(h, 10)))

Discrete approximations to pdfs can be used to get approximate answers using a

fourierX1 (x) = fourierA (x) · 2fourierB (x) · fourierC (x)

A call to sample on X1 can be answered by repeatedly calling sample on A and B.