Professional Documents
Culture Documents
AND INTEGRATION
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Parag Agrawal
August 2012
© 2012 by Parag Agrawal. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Jeffrey Ullman
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Peter Haas
This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.
iii
Contents
1 Introduction 1
1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summary of Contributions and Thesis Outline . . . . . . . . . . . . . 2
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Preliminaries 7
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Uncertain Databases . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Queries over Uncertain Databases . . . . . . . . . . . . . . . . 8
2.2 ULDB Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
iv
4 Generalized Uncertain Databases 28
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Query Semantics . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.1 Uncertainty Evaluation . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Uncertain-Data Integration 47
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Containment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.1 Uncertain Databases . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2 Containment Definitions . . . . . . . . . . . . . . . . . . . . . 52
5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Queries, Views, and Sources . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5.1 Intensional complexity . . . . . . . . . . . . . . . . . . . . . . 60
5.5.2 Extensional complexity . . . . . . . . . . . . . . . . . . . . . . 67
5.6 Query Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.6.2 Superset-Containment . . . . . . . . . . . . . . . . . . . . . . 70
5.6.3 Equality-Containment . . . . . . . . . . . . . . . . . . . . . . 71
5.7 Monotonic Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.8 Incorporating Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 76
5.8.1 Source Reliability . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.10 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
v
6 High-Confidence Joins 83
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3.2 Running Example . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3.3 Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3.4 Top-k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3.5 Sorted-Threshold . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4 Efficiency Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 105
6.5.2 Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5.3 Top-k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.5.4 Sorted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
vi
7.5.2 Alternative Counts . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5.3 X-tuple Counts . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5.4 Other Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.6 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 133
7.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8 Summary 136
Bibliography 139
vii
Chapter 1
Introduction
1.1 Applications
We briefly discuss some applications that motivate our work on uncertain data
management. We specifically highlight the kind of uncertainty they introduce.
1
CHAPTER 1. INTRODUCTION 2
data can often be uncertain. For example, an extractor may only be able to
determine that the director of a movie is either Robert or Richard. Furthermore,
the extraction system might output confidence values of .7 that the director is
Robert and .3 that it is Richard.
• Entity resolution: There are often multiple ways of referring to the same-real-
world entity. The process of determining if two data items refer to the same
entity is inherently approximate. As a result, entity resolution often results in
uncertain information. For example, an algorithm might determine that Bob
Doe and Robert D. refer to the same person with confidence at least .8.
• Sensor data: Sensors aim to measure the real world, but are imprecise due to
limitations in the physical world. For example, a temperature sensor measure-
ment may be described by a Gaussian distribution around the reported reading
around 90F with a standard deviation of 5F.
The key challenge in uncertain data management is to efficiently represent and query
large amounts of data with the various forms of uncertainty detailed above.
Continuous Uncertainty
Many e↵orts in the field of uncertain databases use a data model based on a set of
alternative values for tuples, and/or probabilities (or confidences) assigned to a tuple’s
existence. However, some applications must manage data that cannot be represented
with such forms of discrete uncertainty. For example, sensor measurements may be
described by a Gaussian distribution around the reported reading (based on sensor
calibration); astronomical data may contain locations described by two-dimensional
Gaussians; a database that records time-of-day may use intervals bounding the possible
time; predictive models stored in a database may include continuous probability
distributions.
Trio is a system developed at Stanford for managing data, uncertainty, and lin-
eage [ABS+ 06]. In Chapter 3, we discuss extensions to Trio for incorporating continuous
uncertainty. Data items with uncertain possible values drawn from a continuous do-
main are represented through a generic set of functions. Our approach enables precise
and efficient representation of arbitrary probability distribution functions, along with
standard distributions such as Gaussians. We also describe how queries are processed
efficiently over this representation, without knowledge of specific distributions. For
queries that cannot be answered exactly, we can provide approximate answers using
sampling or histogram approximations, o↵ering the user a cost-precision trade-o↵.
CHAPTER 1. INTRODUCTION 4
Our approach exploits Trio’s lineage and confidence features, with smooth integration
into the overall data model and system.
Generalized Uncertainty
Uncertain database systems enable management of incomplete or imprecise information.
However, most uncertain databases, including our own work in Trio, require that
exact confidence values are attached to the data being managed. In some applications,
confidence values may be known imprecisely or coarsely, or be missing altogether.
In Chapter 4, we introduce the notion of generalized uncertain databases to manage
such incomplete uncertainty. We propose a semantics for generalized uncertain
databases based on Dempster-Shafer theory [Sha76] that is consistent with previous
semantics for uncertain databases. We present an extension of the representation
scheme used by Trio in order to represent any generalized uncertain database. Finally,
we adapt Trio’s query processing techniques to operate over this new representation.
High-Confidence Joins
In uncertain databases, users often seek to compute high-confidence results for queries.
In Chapter 6, we discuss specialized algorithms for join queries that seek to prefer-
entially return result tuples with the highest confidence. We restrict ourselves to
a special case of uncertain databases without alternative values, called probabilistic
databases.
When joining uncertain data, confidence values are assigned to result tuples based
on combining confidences from the input data. To preferentially access high-confidence
results, users may wish to apply a threshold on result confidence values, ask for
the “top-k” results by confidence, or obtain results sorted by confidence. Efficient
algorithms for these types of queries can be devised by accessing the input data sorted
by confidence and exploiting the monotonicity of the confidence combination function.
Previous algorithms for these problems assumed sufficient memory was available for
processing. We address the problem of processing all three types of queries when
sufficient memory is not available, minimizing retrieval cost. All our algorithms are
proven to be close to optimal.
optimization.
Chapter 7 describes the first steps towards building a native query optimizer in Trio.
We demonstrate that there is indeed an opportunity to obtain significantly increased
performance when the well-defined structure of Trio relations (and their encoding) is
exploited. We then present several indexing techniques, study new statistics, design
histograms for accurately estimating our new statistics, and describe a new “interesting
order” and its associated operator to exploit this opportunity.
Preliminaries
In this chapter, we set up some preliminaries that the rest of the thesis relies on. We
start in Section 2.1 by reviewing basic definitions related to the concept of uncertain
databases. A lot of the work in this thesis is in the context of the Trio system, which
is a specific instantiation of an uncertain database system. Section 2.1 reviews the
ULDB data model used by the Trio system to represent uncertain data, and query
semantics in Trio.
2.1 Definitions
The term probabilistic databases is also commonly used to refer to uncertain databases.
7
CHAPTER 2. PRELIMINARIES 8
sensor-id zone
1 A || B
Location(sensor-id,zone)
(1,A):0.4 || (1,B):0.6
(2,C):0.7
The above table represents an uncertain database (Definition 1) with four possible
worlds, and a probability distribution over them:
ID Location(sensor-id,zone)
11 (1,A):0.4 || (1,B):0.6
12 (2,C):0.7
ID Reading(sensor-id,temp)
21 (1,77)
22 (2,81)
ID Temp(zone,temp)
31 (A,77):0.4 || (B,77):0.6
32 (C,81):0.7
The Trio system manages discrete uncertainty: a limited number of possible values
for data items are permitted. In this chapter, we present extensions to Trio for
incorporating continuous uncertainty, where an infinite number of possible values are
permitted for each data item. The work presented in this chapter initially appeared
in [AW09b].
3.1 Introduction
Many e↵orts in the field of uncertain databases, including our own work in Trio, use a
data model (Section 2.2) based on a set of alternative values for tuples. However, some
applications must manage data that cannot be represented with such forms of discrete
uncertainty. For example, sensor measurements may be described by a Gaussian
distribution around the reported reading (based on sensor calibration) [FGB02];
astronomical data may contain locations described by two-dimensional Gaussians;
a database that records time-of-day may use intervals bounding the possible time;
predictive models stored in a database may include continuous probability distributions.
There have been a few proposals for managing continuous uncertainty in a DBMS.
Reference [FGB02] suggests using an abstract data type for continuous uncertain
values, focusing specifically on Gaussian distributions. The Orion system [SMS+ 08]
generalizes to other distributions using discrete histogram approximations, and defines
12
CHAPTER 3. CONTINUOUS UNCERTAINTY IN TRIO 13
a semantics for interpreting SQL over this data. In this chapter we describe how to
incorporate continuous uncertainty into Trio. Our proposal is more general than the
previous work in both data representation and querying, but admittedly it is not
implemented yet. More detailed discussion of related work is provided in Section 3.6.
Next we describe in brief the key components of our proposal: semantics of the
data model and query language, data representation in the system, query processing
over this representation, interfaces to the uncertain data, and integration into other
aspects of Trio, including Trio’s model for discrete uncertainty and its important
lineage feature. Later sections go into more detail on each of these components.
Data Model Semantics: Models for uncertain data usually are based on possible
worlds: a set of possible instances for the database. With discrete uncertainty, an
uncertain database always represents a finite set of possible worlds; with continuous
uncertainty, we may represent an infinite set. Specifically, the value of an uncertain
attribute may be an arbitrary probability distribution function (pdf) over a continuous
domain, describing the possible values for the attribute. For example, an uncertain
temperature may be described as a Gaussian distribution around a mean temperature
of 50 with variance 12; a time may be described as a uniform distribution between
9:20 and 9:25. Gaussian and uniform distributions are special cases of pdfs, as are
the discrete alternatives on which Trio was based originally. In our model, a pdf may
also be over a multidimensional domain, representing multiple correlated uncertain
attributes. For example, predicted object locations may be comprised of uncertain
correlated latitude-longitude pairs, i.e., two-dimensional pdfs.
Representation: Previous work [FGB02, SMS+ 08] has used symbolic representa-
tions for a set of known distribution types, and histogram approximations for the rest.
For example, Gaussian distributions can be represented by mean and variance; uniform
distributions by their endpoints. We have taken a di↵erent approach to handling a
wide class of distributions: we represent a pdf by a set of functions. For example, a
pdf ⇢ can be represented by a pair of functions:
• The function sample(low,high) returns a random value between low and high
according to ⇢. For example, sample(3,4) for a Gaussian distribution with mean
4 is more likely to return a value close to 4 than close to 3.
Query Processing: Recall from Chapter 2 that in Trio, query results include
lineage identifying the data from which each result value was derived. Lineage is
needed to represent uncertainty correctly, and to compute result confidence values
lazily [BDHW06b]. In the presence of pdfs, we make all processing over pdf values
lazy, deferring potentially expensive computations until they are needed. At query
time, we simply generate lineage, and for those results involving pdfs, we extend
the lineage to include relevant predicates and mappings. The information contained
in the lineage is sufficient to be able to compute the functions that encode the
result lazily. For example, a query asking for temperatures less than 60 would
generate lineage annotated with the “< 60” predicate. Now if we have a result pdf ⇢0
whose lineage points to a pdf ⇢, the function weight⇢0 (low,high) on ⇢0 is translated to
weight⇢ (low,max(high,60))/weight⇢ ( 1,60) on ⇢.
The “translation” approach motivated above can be used to answer many queries
efficiently. However, for expensive queries we support approximate answers, either
CHAPTER 3. CONTINUOUS UNCERTAINTY IN TRIO 15
using the sample function, or a histogram based on the weight function; both options
o↵er the user a cost-precision trade-o↵. Query processing is modular and generic: our
approach of mapping functions on result data to functions on data in the result’s
lineage is independent of specific distributions—all functions can be treated as “black
boxes.” However, for specific distributions we do introduce specialized processing that
is more efficient.
Integration into Trio: In addition to lineage and discrete uncertainty, recall from
Chapter 2 that Trio’s data model includes confidence values associated with tuples,
denoting the probability of the tuple existing. In our new model and query language
for continuous uncertainty, operations tend to modify the probability of a tuple’s
existence. Thus, the confidence feature is particularly helpful for integrating pdfs into
Trio. Consider for example the predicate “ 4” applied to a uniform distribution over
[3, 7]. The result can be represented by a uniform distribution over [3, 4] that exists
with probability 0.25, or more generally one fourth of the original probability.
Not only do Trio’s previous features (lineage and confidences) help integrate
continuous uncertainty into the system, but the ability to represent and query pdfs
enhances some pre-existing Trio functionality. For example, aggregate query results
[MW07] may naturally be represented as continuous pdfs, and pdfs over discrete
domains can sometimes yield more efficient representations than tuple alternatives,
for certain types of data.
We discuss data model and query semantics in Section 3.2, representations and
interfaces in Section 3.3 and query processing in Section 3.4. Section 3.6 presents
related work, and we conclude with future work in Section 3.7.
CHAPTER 3. CONTINUOUS UNCERTAINTY IN TRIO 16
Pdf Attributes: Recall from Section 2.2 that an x-tuple is comprised of tuple
alternative. We now allow pdf attributes within x-tuple alternatives. The value for
a pdf attribute is given by a probability distribution function over a given domain.
The domain may be discrete or continuous. For example, P (A) = 0.4; P (B) = 0.6
is a pdf over a discrete domain {A, B}; a pdf of a Gaussian distribution G(77, 20)
with mean 77 and variance 20 is a pdf over the continuous domain of all real numbers
R. Suppose the sensors used to record temperature values in the Reading table from
Section 2.2 have errors described by Gaussian distributions: say sensors 1 and 2 have
variance of 20 and 30 respectively. The uncertainty in a sensor’s temperature report
can now be captured in the database by making temp a pdf attribute over R. Using
2
G(µ, ) to represent a Gaussian pdf:
Reading(sensor-id,temp)
(1, G(77,20))
(2, G(81,30))
The Reading table now has an infinite number of possible worlds, since it contains
a pdf attribute over a continuous domain. Any table with two tuples of the form
(1, x) and (2, y) is a valid possible world of Reading. The probability of an individual
possible world is infinitesimally small, so we consider sets of possible worlds. Let the
probability of the set possible worlds where a x b be px , and where c y d be
py . As with other uncertain data in Trio, the values of pdf attributes are independent.
Hence the probability of the set of possible worlds where a x b and c y d is
px · py .
CHAPTER 3. CONTINUOUS UNCERTAINTY IN TRIO 17
Note that Trio’s attribute-level uncertainty (Section 2.2) is captured by pdf at-
tributes over finite discrete domains. Also note that the presence of alternatives along
with pdf attributes allows us to represent certain cases that pdf attributes alone can’t
represent. For example, this x-tuple represents a temperature 78 with probability 0.3,
otherwise a Gaussian around 77 with variance 20:
Temperature(zone,temp)
(A,78):0.3 || (A, G(77,20)):0.7
Predicates applied to pdf attributes can a↵ect confidence values of result alter-
natives. For example, consider a tuple that exists with confidence 0.8 and has just
one pdf attribute whose value is uniformly distributed between 2 and 6 (denoted by
U (2, 6) : 0.8). The result of applying a predicate A 5 to this tuple is U (2, 5) : 0.6.
We also support correlated pdf attributes. Correlations may exist in the base
data, for example uncertain location data may be represented by correlated x and y
coordinates. A joint pdf (over possibly heterogeneous domains) is used to represent
multiple correlated attributes. Correlated pdf attributes, and processing queries over
them, are described in more detail in Section 3.5.1. Note that independent attributes
in base data can yield correlated attributes in the result. For example, a selection
query with predicate A < B over pdfs A and B can make attributes corresponding
to A and B be correlated in the result. Lineage is used to correctly handle these
correlations, as described in Section 3.4.2 .
h µ l µ
weight(l, h) = erf( p ) erf( p )
2 2
where erf is the error function. The weight function can completely describe any pdf,
but we may often use additional functions in the representation for increased efficiency
and convenience. One such function is sample(l,h), which returns values between any
given l and h drawn at random according to the pdf.
Notice that a call to sample can be computed using inverse sampling by using
calls to weight and a uniformly-distributed random number generator. Conversely, an
approximation to weight can be computed using calls to sample with Monte-Carlo
simulations. In addition to weight and sample, we might use inverse-cdf, fourier, or
describe for a pdf ⇢ representing a random variable X:
• The function inverse-cdf(v) returns the value r such that v = weight( 1,r).
Intuitively, inverse-cdf(v) returns the 100v percentile value for the pdf. For
2
example, the inverse-cdf of a Gaussian with mean µ and variance is:
p
inverse-cdf(v) = µ + 2 erf 1 (2p 1)
using calls to weight and binary search, or using sample. Functions weight and
sample can also be computed using inverse-cdf.
• The function fourier(x) returns the value of the Fourier transform of the pdf
of the random variable X at x. The Fourier transform may be known as an
expression for certain distributions. Efficient numerical computation of the
Fourier transform and inverse Fourier transform has been studied extensively.
Function fourier may be computed using weight and vice-versa.
• The function describe() returns a parametrized form of the pdf. For example,
describe might return [Type:"Gaussian",Parameters:µ, 2 ]. For a known type
of pdf, the system can compute describe using weight, sample, or inverse-cdf,
and vice-versa.
Since the functions above can be computed using calls to other functions, not all
of them need to be instantiated. In fact, all the above functions can be computed
(approximately) if any one of them were instantiated. Of course when more functions
are instantiated, it may be possible to obtain more precise answers more efficiently.
The functions above are our starting point. Hooks can be provided to the user to
create additional functions.
Pdf data represented as functions is presented to the application or user through
interfaces. Interfaces, like functions, return properties of the pdf, but they are not
used to represent the pdf and hence cannot be used to compute functions or other
interfaces. Here are two examples of interfaces:
• Some applications may ask for the median of a pdf attribute. The property
of the median m is that weight( 1,m) = weight(m,1) = 0.5. Median m can
be found approximately by calling weight repeatedly with binary search. The
precision of the result improves as more calls are made.
CHAPTER 3. CONTINUOUS UNCERTAINTY IN TRIO 20
Although weight is suitable for computing the discrete histogram interface, it isn’t
ideal for median, and possibly other interfaces. On the other hand, median can be
computed easily using inverse-cdf by calling it with argument 0.5. If inverse-cdf is
instantiated, it should be used for a median interface. If inverse-cdf is not instantiated,
calling it from the median interface will degenerate to the same binary search over calls
to weight. Some other interesting interfaces might be are mean, variance, percentile,
inverse-percentile, and human-readable form of describe. All can be implemented
using the functions described earlier, some more efficiently than others.
Our extensible functional representation, along with the notion of interfaces, sets
up a framework permitting easy incorporation of new techniques for pdf attributes. We
will discuss how the set-of-functions approach enables efficient query processing, and
introduces a new optimization problem over a space of “plans” that can be compared
in terms of precision and efficiency.
Data Processing: In Trio, data processing at query time uses a standard rela-
tional processor. We use the algorithm described in [BDHW06b] with the following
modifications:
• Any predicate in the where clause referencing one or more pdf attributes is not
evaluated. The query is processed using only the remaining predicates in the
where clause. (Recall we assume conjunctive where clauses.)
• A placeholder is created for each element in the select clause that is of type pdf.
For example, consider a Trio query over relation R(A, B, C), where A and B are pdf
attributes:
select A, B+5, C
from R
where A 5 and B A and C = 11
Translation:
Some simple queries permit a very efficient technique that gives precise answers.
Consider for example an alternative with Pt = A 5 ^ B 10, and Mt = X1 !
A, X2 ! B, where A and B are base pdfs and X1 and X2 are pdfs in the result
CHAPTER 3. CONTINUOUS UNCERTAINTY IN TRIO 23
The result alternative exists only in possible worlds with values for A and B that
satisfy the predicate, and the product is taken because base pdfs are assumed to be
independent. Similarly, the weight functions for X1 and X2 are:
Discrete Approximation:
An approximation to the weight function for the pdf X1 in the result can be computed
approximately as:
1 X
weightX1 (j, k) = · weightA (i, i + 1) · weightB (i, 1)
s jik
This example uses a discrete approximation for the weight function. It is easy to see
that inverse-cdf can also be used. There is a trade-o↵ between cost and precision
when this method is used: more points in the approximation increases both precision
and cost. The predicate A B in the example above is equivalent to the comparison
of an arithmetic expression with a constant: A B 0. The discrete approximation
technique scales poorly as the number of pdf attributes in the arithmetic expression
increases, in terms of both cost and precision. For example, an expression like
P n 1
i Ai 0 with n pdfs can be very inefficient: O(d ) where d points are used to
approximate each pdf. However, it should be noted that the discrete approximation
technique is applicable for all queries.
Fourier:
Addition and subtraction over pdf attributes is computed using a convolution operation
on pdfs. A convolution is a kind of integration problem that can be solved using
Fourier transforms, since the convolution operation translates to multiplication in
Fourier space. Consider for example a result tuple with Mt = X1 ! A + 2B C and
an empty Pt (always true), where A, B, C are pdf attributes.
For base data, these functions may be symbolic making them efficient and precise; or
they may use the weight function to numerically approximate the Fourier transform
yielding a less precise, more expensive, execution. Also, the weight function for the
result can be computed numerically from the fourier function. This technique can be
more efficient and precise than discrete approximations, particularly for expressions
involving a large number of base pdfs. The limitation of the technique is that it is
CHAPTER 3. CONTINUOUS UNCERTAINTY IN TRIO 25
useful only for addition and subtraction operations over pdf attributes. It needs to be
combined with other approaches when the query involves predicates.
Sampling:
Random sampling over possible worlds can be used to compute confidences and to
answer calls to functions representing the result. We use the sample function for this
technique. For example, consider a result tuple with pdf-lineage: Pt = A + B 4
and Mt = X1 ! A + B. The confidence scaling factor s can be approximated as the
fraction of times the following expression is true using a large number of calls:
sampleA ( 1, 1) + sampleB ( 1, 1) 4
Specialized Processing:
Our approach for pdf attributes makes it easy to use specialized processing for known
distributions. For example, consider a result tuple with Mt = X1 ! A + B and
an empty Pt . If both A and B are Gaussians, describe on X1 can be computed as
2 2 2 2
[Type:"Gaussian",Parameters:µA + µB , A+ B] where µA , µB , A, B are obtained
using describe calls on A, B. Other functions including weight, sample, inverse-cdf, and
even fourier can be computed efficiently from describe for known distributions. The
framework allows for adding knowledge about specific distributions to the system in
computing the describe function. This processing may be limited in scope with regards
to complicated queries, but can yield precise answers efficiently when applicable.
CHAPTER 3. CONTINUOUS UNCERTAINTY IN TRIO 26
Combining:
For any interface, function, or confidence computation call, the answer may be produced
by combining more than one of the techniques presented above. Combining the
techniques results in multiple “query plans.” Choosing a good plan is important, since
each evaluation is potentially expensive, and we wish to optimize for a combination of
precision and efficiency.
3.5 Discussion
• The current proposal restricts the query language to have only a conjunctive
where clause. It would be useful to allow a more general query language, along
with some built-in language and data-type extensions specifically for querying
pdfs.
Chapter 4
Existing uncertain databases have difficulty managing data when exact confidence
values are not available. In some applications, confidence values may be known
imprecisely or coarsely, or even be missing altogether. In this chapter, we propose
a generalized uncertain database: a database that manages incomplete information
about uncertainty. The work presented in this chapter appeared initially in [AW10].
4.1 Introduction
As we have seen in the previous chapters, uncertain databases enable databases to
manage data that has incomplete or imprecise information. However, most uncertain
databases, including Trio (Chapter 2), require that exact confidence values are at-
tached to the data being managed. In some applications, confidence values may be
known imprecisely or coarsely, or be missing altogether. To manage such incomplete
uncertainty, we propose the notion of a generalized uncertain database.
We use uncertain data about movies and release years, obtained as a result of
information extraction, say, to illustrate some di↵erent kinds of uncertainty. In the
following, (1) is an example of “complete” (exact) uncertainty, while (2), (3) and (4)
illustrate three kinds of incomplete uncertainty.
1. Exact confidence values: The Godfather was released in 1972 with confidence
.8 and 1974 with confidence .2.
28
CHAPTER 4. GENERALIZED UNCERTAIN DATABASES 29
3. Imprecise confidence values: Pulp Fiction was released in 1994 with confidence
at least .5. (E↵ectively, confidence is between .5 and 1.)
4. Coarse confidence values: Die Hard was released in 1988 or 1989 with confidence
.8, or 1990 with confidence .2.
• The Godfather was released in 1972 with confidence .8 and 1974 with confi-
dence .2.
• The Godfather II was released in 1972 with confidence .3 and 1974 with
confidence .7.
Suppose we want to know whether The Godfather was released at least one year before
The Godfather II. Assuming independence, the confidence would be .56. Without
CHAPTER 4. GENERALIZED UNCERTAIN DATABASES 30
• Data and query semantics (Section 4.2): We propose a semantics for general-
ized uncertain databases based on Dempster-Shafer theory, and the associated
semantics for queries over such databases. We have chosen Dempster-Shafer
theory because it is a mature and elegant generalization of traditional Bayesian
probability theory that incorporates epistemic uncertainty in addition to aleatory
uncertainty. We demonstrate how the new semantics degenerates to the seman-
tics for current uncertain databases.
We emphasize that our new proposal requires only minor modifications to the
semantics, representation, and query processing algorithms for uncertain databases,
while still enabling the management of various kinds of incomplete uncertainty in a
unified manner.
We discuss related work in Section 4.5 and list directions for future research in
Section 4.6.
4.2 Semantics
In this section, we define data-model semantics for generalized uncertain databases,
and the associated query semantics.
Notice that m assigns mass to every subset of possible worlds (although many may
have mass zero), as opposed to the traditional way of assigning a probability to each
individual possible world. It is useful to interpret mass m(S) assigned to a set of
possible worlds S as probability mass constrained to stay within S, but free to be
assigned anywhere within S. For example, when S = {W1 , W2 }, it is unspecified
how m(S) is to be split between W1 and W2 . Since we use the Demspter-Shafer
interpretation for the mass function, we also have the belief and plausibility functions,
CHAPTER 4. GENERALIZED UNCERTAIN DATABASES 32
which can be interpreted as the lower and upper bounds for probabilities:
X
bel(S) = m(A)
A✓S
X
pl(S) = m(A)
A\S6=;
This interpretation allows us to answer queries that ask for tuples which “could have
probability more than .5” (mentioned in Section 4.1). We don’t further describe
details or intuition for belief or plausibility functions in this thesis, and instead refer
the reader to [Sha76] for a detailed discussion. We do, however, note the following
relationships:
X
m(S) = 1|S A|
bel(A)
A✓S
pl(S) = 1 bel(S̄)
1. Exact confidence values: In the standard probabilistic case, all the mass is
contained in sets with a single possible world. It is worth noting that the mass
function is the same as the belief and the plausibility function, and hence mass
can be interpreted as probability.
W1 W2
The Godfather, 1972 The Godfather, 1974
2. Missing confidence values: Since no confidence values are known, all the mass is
assigned to the set of all possible worlds. For all individual possible worlds, the
belief and plausibility functions are 0 and 1 respectively.
W1 W2
Shawshank Redemption, 1984 Shawshank Redemption, 1994
m(S) = 1, S = {W1 , W2 }
= 0, otherwise
W1 W2
Pulp Fiction, 1994 ;
4. Coarse confidence values: In this case, confidence is specified over sets of possible
worlds, but not exactly to each individual possible world.
W1 W2 W3
Die Hard, 1988 Die Hard, 1989 Die Hard, 1990
Since all the above kinds of uncertainty can be captured in generalized uncertain data-
bases, we can provide a unified semantics for various kinds of incomplete uncertainty
and exact probabilities. The Trio system allows relations with exact probabilities
and missing confidences, but a query joining these relations coerces the relation with
missing probabilities by assigning them uniformly, and hence the results are ad-hoc.
In contrast, our new proposal is able to combine such information in a principled
manner.
We now formalize the observation that our new data-model semantics generalizes
semantics for probabilistic databases [ABS+ 06]. Recall from Chapter 2, the definition
of a probabilistic database (Definition 1).
Probabilistic mass functions are those mass functions that satisfy the property above.
Recall from Chapter 2, the definition of uncertain databases without probabilities
(Definition 2). We can make the observation that our semantics generalizes those of
uncertain databases without probabilities.
P WR = {Q(W )|W 2 P WU }
X
mR (S) = mU (A)
A such that
S={Q(W )|W 2A}
Observation 4. When the mass function for a generalized uncertain database corre-
sponds to an uncertain database without confidences: m(P W ) = 1, the mass function
for the generalized uncertain database resulting from the application of a relational
query also corresponds to an uncertain database without confidences.
4.3 Representation
In this section, we describe our new representation scheme for generalized uncertain
databases. This scheme is a modification of the representation scheme used by Trio
(Chapter 2) in the sense that it preserves the notions of alternatives, x-tuples, and
lineage. The significant change is that probabilities associated with alternatives are
replaced by a mass function defined for each x-tuple.
We start by describing our scheme without lineage, and then incorporate lineage.
A database instance is represented by a set of x-tuples, each with an associated x-mass
CHAPTER 4. GENERALIZED UNCERTAIN DATABASES 36
function. We use a running example to describe these notions, and to show how a
generalized uncertain database instance is interpreted from a representation instance.
Possible worlds for the database are obtained by picking exactly one alternative
from each x-tuple. Symbol “?” is a special alternative value; picking it means
that none of the other alternatives are picked [ABS+ 06]. The first x-tuple has
three possible states, while the second x-tuple has two possible states. Hence,
the possible world set P W has six possible worlds as shown below.
W1 W2 W3
Die Hard, 1988 Die Hard, 1989 Die Hard, 1990
Pulp Fiction, 1994 Pulp Fiction, 1994 Pulp Fiction, 1994
W4 W5 W6
Die Hard, 1988 Die Hard, 1989 Die Hard, 1990
The intuitions described for mass functions in Section 4.2 carry over to x-mass
functions. We have an x-mass function for each x-tuple in the representation.
From these functions, we interpret the mass function m over the possible
worlds to get a generalized uncertain database from a representation instance.
Conceptually, we first interpret each x-mass function as a basic mass function
over the set of possible worlds, and then “combine” these basic mass functions
to obtain the mass function for the generalized uncertain database.
• Basic mass functions: In general, the basic mass function mi : 2P W ! [0, 1] for
an x-mass function xi : 2A ! [0, 1] is constructed as follows:
Consider the x-mass functions x1 , x2 for the tuples above. We create the
CHAPTER 4. GENERALIZED UNCERTAIN DATABASES 38
• Combination: The basic mass functions mi for all x-tuples are combined to obtain
the mass function m for the generalized uncertain database using Dempster’s
combination rule:
M
m= mi
i
Hence, we get the mass function m for the generalized uncertain database as:
Based on results from [DBHW06b], we can immediately see that the model
presented so far for generalized uncertain databases is not complete, or even
closed under select-project-join queries. Hence, we incorporate the notion of
lineage from Trio (recall from Chapter 2) into the model.
t3 1989, 1994 ?
W1 W2 W3
Die Hard, 1988 Die Hard, 1989 Die Hard, 1990
Pulp Fiction, 1994 Pulp Fiction, 1994 Pulp Fiction, 1994
1989, 1994
W4 W5 W6
Die Hard, 1988 Die Hard, 1989 Die Hard, 1990
Our model requires that an x-mass function be provided for each base x-tuple,
while no x-mass functions may be explicitly provided for derived tuples (just like
confidences in Trio). Hence, the mass function over possible worlds is not a↵ected due
the presence of lineage, and is constructed based on base x-tuples only. The mass
function for the generalized uncertain database with x-tuples t1 , t2 , t3 is as shown
earlier.
We can now make the following observation:
define the mass, belief, and plausibility values for the formula f as follows:
It is well known that in probabilistic databases, when the boolean formula over
base tuples is conjunctive, the confidence can be evaluated efficiently. This result
carries over to the uncertainty evaluation problem:
Belief, plausibility, and mass can each be treated as probabilities [Sha76] (lower, upper,
and mass, respectively) when the boolean formula over base tuples is conjunctive,
and the confidence computation module of Trio can be used directly for uncertainty
evaluation.
Approximation techniques have been proposed for the confidence computation
problem [DS04a]. The observations above suggest that such techniques can be adapted
to solve the uncertainty evaluation problem. Thorough investigation of the uncertainty
evaluation problem is left as future work; specific directions are listed in Section 4.6.
possibility theory. Like our approach, this proposal also enables managing of data
where no exact probability values are available. Possibility theory can be captured in
Dempster-Shafer theory in terms of representation, but it uses a cautious combination
operator instead of Dempster’s rule.
Uncertain-Data Integration
5.1 Introduction
Most work on uncertain data has focused on modeling and management problems.
Little work has been done on integrating multiple sources of uncertain data. Similarly,
several decades of research have focused on the theory and practice of data integra-
tion [HRO06], but only considering integration of certain data. This chapter develops
theoretical foundations for local-as-view (LAV) integration [Hal01] of uncertain data.
The combined study of data integration and data uncertainty is important for
several reasons. The traditional benefits of data integration still apply when sources
are uncertain: Integrating data from multiple sources allows a uniform query interface
to access their combined information. In addition, integrating multiple sources of
uncertain data may help resolve portions of the uncertainty, yielding more accurate
results than any of the individual sources. As a very simple example, if one sensor
reports that an object is either in location A or in location B, and a second sensor
47
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 48
Containment
LAV data integration typically uses the open world assumption: Consider a mapping
view query Q for a source S. When Q is applied to the (logical) mediated database,
we do not require the result to be S exactly, but only require it to contain S. For
the case of certain databases, containment is straightforward. To extend LAV data
integration to the uncertain data setting, we need to find an appropriate definition of
containment. We will see that by defining containment carefully, we can capture the
“contradiction” and “corroboration” intuition motivated above.
We will see that for uncertain data, two di↵erent integration settings require two
somewhat di↵erent notions of containment. In one setting, which we call equality-
containment, the sources were derived from an existing but unknown uncertain
mediated database that we are trying to reconstruct. In the other setting, which
we call superset-containment, there is no actual mediated database from which the
sources were derived, so our goal is to come up with a logical mediated database that
captures the information from the sources. We will give examples to illustrate the
di↵erences. This distinction is new for handling uncertain data. For certain data,
these two settings can be handled identically.
Consistency
When sources contain uncertain data, we need to define what it means for sources to
be consistent. (As an extremely simple example, one sensor reporting location A or B
and the other reporting C or D for the same object is inconsistent.) Informally, a set
of sources is consistent if there exists a mediated database that contains all sources.
We will formalize this notion and then study the problem of consistency-checking
under both equality-containment and superset-containment. We show that in general,
consistency-checking is NP-hard in the size of the view schema for both of our settings.
Next we identify a class of sources where consistency-checking is polynomial. We
describe the construction of a hypergraph for a set of sources, and we provide a
PTIME consistency-checking algorithm when this induced hypergraph is acyclic. We
also show that the extensional complexity of consistency-checking is PTIME for both
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 50
of our settings.
Query Answers
Lastly, we consider the problem of defining correct query answers over mediated
uncertain databases. Once again, the definitions used for certain databases do not
adapt directly to the uncertain case. The conventional LAV setting uses certain
answers (where the use of the word “certain” here is not to be confused with certain
data). A certain answer is a set of tuples that is guaranteed to be contained in
any mediated database [Hal01]. We define a corresponding notion for uncertain
data, which we call correct answers, that incorporates possible worlds through our
containment definitions. Further, we seek to find a unique strongest correct answer
(SCA) defined using the partial orders implied by the containment definitions. For
superset-containment, we prove by construction the existence of an SCA. However,
for equality-containment an SCA does not always exist, and hence we define a relaxed
notion for the “best” query answer.
Discussion
For ease of presentation, we restrict ourselves to identity queries and views defined
by identity queries for most of the chapter. In Section 5.7, we extend our techniques
for monotonic views with some restrictions, and for monotonic queries over uncertain
data. For a majority of this chapter, we focus attention on uncertain databases
without probabilities. In Section 5.8, we discuss extentions of our results for the case
of generalized uncertain databases (recall from Chapter 4).
The results in this chapter are independent of the specific representation used
for uncertain data. (The computational complexity of certain problems considered
may depend on the specific representation, and we point out these di↵erences in the
relevant places.) Also, although our results are presented for discrete uncertain data
with a finite set of possible worlds, they can be generalized for continuous uncertain
data with an infinite set of possible worlds. We emphasize that our foundations are
defined in terms of possible worlds, but we neither rely on nor advocate possible worlds
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 51
5.2 Containment
This section introduces and formalizes the notions of equality-containment and superset-
containment. Recall the definition of an uncertain database from Chapter 2. We shall
motivate an extension to the definition from Chapter 2 in Section 5.2.1, and then
present our containment definitions in Section 5.2.2 using the extended definition.
Example 1. Consider tuples A and B, and the four possible worlds: P1 = ;, P2 = {A},
P3 = {B}, and P4 = {A, B}. An uncertain database U1 that contains no information
about the existence and co-existence of tuples A and B consists of all four possible
worlds. An uncertain database U2 with information that at least one of A or B exists,
and that they cannot co-exist, does not contain either P1 or P4 . U2 contains more
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 52
information than U1 , since it asserts that P1 and P4 are not possible. An uncertain
database U3 with T (U3 ) = {A, B} but containing only the possible world P2 = {A}
asserts that the tuple A is contained in the database and that B cannot be contained
in the database, since B is contained in T (U3 ) but not contained in any possible world
of U3 . U3 contains more information than either U1 or U2 .
Equality-Containment
Equality-containment integration is relevant in situations where each source has access
to only a portion of an uncertain database that is existing but not known completely.
There are many real-world applications where access is controlled, and only slices of
the data may be visible to various parties (sources). For example, an actual uncertain
database may be hidden behind a web service, or people may only be given access
to data depending on what they pay for. The goal of data integration in this setting
is to answer queries using the best (virtual) reconstruction of the unknown actual
uncertain database. Also, when smaller pieces of sensitive data are given out to
multiple people so that no single piece leaks information, this reconstruction allows
one to detect whether the pieces can be combined to obtain sensitive information.
Finally, this setting captures the problem of answering queries using materialized
views over uncertain data, where each source is a view.
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 53
T (U1 ) ✓ T (U2 )
and
P W (U1 ) = {W \ T (U1 ) | W 2 P W (U2 )}
Informally, if we remove from any possible world of U2 those tuples not contained in
T (U1 ), then the resulting possible world is a world of U1 , and U1 may not contain
additional possible worlds.
Superset-Containment
Superset-containment integration is relevant in settings where we obtain uncertain
data about the real world from di↵erent sources, and the goal is to combine information
from these sources to construct a logical “real-world truth” as accurately as possible.
The simplest example of this scenario was given in Section 5.1, where one sensor
reported A or B for an object and another reported B or C. When we integrate these
sources to obtain our best guess at the real-world truth, we decide the location is
likely to be B.
Superset-containment also arises in information extraction: several parties may
extract structured data from unstructured data (e.g., extracting relations from text,
or extracting text in an OCR context) using di↵erent techniques, and integration
can be used to resolve uncertain results from the sources. Another setting where
superset-containment integration is relevant is the combination of information from
multiple sources that attempt to make predictions, such as weather forecasts from
di↵erent websites, or sales projections using di↵erent techniques.
In contrast to equality-containment, under superset-containment the sources may
not have been derived from an actual uncertain database.
T (U1 ) ✓ T (U2 )
and
P W (U1 ) ◆ {W \ T (U1 ) | W 2 P W (U2 )}
This partial order captures the intuition that the larger pair contains more information,
with respect to both presence and absence of tuples.
1
See [Gun92, LW95] for examples.
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 55
The Smyth lifting of the partial order above yields the definition of superset-
containment, while the Plotkin lifting yields the definition for equality-containment.
The Plotkin order is stricter than the Smyth order, and similarly, we have:
(U1 vE U2 ) =) (U1 vS U2 )
5.3 Examples
In this section, we use two examples to motivate our definitions of containment. We
start with an abstract but simple example that illustrates the di↵erences between the
two notions of containment. Then, we present a practical example from a real-world
application to illustrate the utility of our approach. Our example also demonstrates
the notion of consistency, which is formally studied in Section 5.5.
Example 2. Recall Example 1 where uncertain databases over tuple set T = {A, B}
were:
P W (M ) = {A}, {A, C}
Suppose M is an actual database, and a source S obtained from M doesn’t have the
privileges to access tuple C. Then S would be represented by U3 . Intuitively, we should
have U3 vE M under equality-containment, and indeed this is the case according to
Definition 2. Notice that U1 6vE M and U2 6vE M , consistent with the fact that U1
and U2 cannot be obtained as a result of restricting access on M .
Now consider our other setting, where the sources were not derived from an actual
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 56
Next consider the three uncertain relations SCPD, NCPD, and SFPD under superset-
containment. In this setting, instead of being derived from FBI’s Suspects relation,
these relations were obtained by collecting evidence locally. We now have the following
mediated uncertain database U that superset-contains the three sources: T (U ) contains
all three tuples (George,...), (Kenny,...), and (Henry,...), while P W (U ) contains
a single possible world with two tuples: (George,...) and (Kenny,...).
Intuitively, the three sources were resolved to conclude that George and Kenny
were suspects while Henry was not: SFPD insists that Kenny is a suspect, while NCPD
says that either both Kenny and George are suspects, or neither is. Since Kenny is a
suspect, we conclude that both Kenny and George are suspects. Finally, from SCPD we
rule out Henry being a suspect, since SCPD says that exactly one of Henry and George
is a suspect.
Recalling the intuition from Section 5.1, notice that the (Henry,...) possible world
from SCPD and the ; possible world from NCPD are contradicted by a “corroboration” of
all other possible worlds.
T (Q(U )) = Q(T (U ))
P W (Q(U )) = {Q(W ) | W 2 P W (U )}
Notice that for monotonic queries, each possible world in Q(U ) is a subset of the tuple
set of Q(U ), ensuring that Q(U ) is indeed an uncertain database.
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 58
The next definition specifies the semantics of LAV mappings by defining the notions
of view extension and view definition. These definitions are always used in conjunction
with an implicit logical mediated database.
Next we formalize the notion of a source for LAV data integration in terms of
views.
5.5 Consistency
In this section, we formally define consistency of a set of uncertain data sources.
We then present complexity results for the problem of consistency-checking under
equality-containment and superset-containment.
Roughly speaking, a set of sources is consistent if there exists some mediated
database that contains each source.
• P W (M ) 6= ;
P W (U # S) ⌘defn {W \ S | W 2 P W (U )}
T (U # S) ⌘defn T (U ) \ S
The above definitions satisfy the following properties, which we will use later in proofs:
The last two statements indicate that the restriction and removal operations cannot
make the set of possible worlds empty, although individual possible worlds may become
empty.
Intractability
Theorem 1 below establishes the NP-hardness of consistency checking of a set of
sources under both superset-containment and equality-containment. For both cases
we show reductions from the well-known 3-coloring problem [GJ79], although the
arguments are slightly di↵erent. The reductions in the proof use one source for every
node and edge, giving us NP-hardness in the size of source schemas. In Section 5.5.2
we will show that consistency-checking is tractable when the number of sources is
fixed.
• For every vertex v, construct a view extension Vv with 3 possible worlds repre-
senting its 3 colorings:
P W (Vv ) = {v0 }, {v1 }, {v2 }.
• For every edge (u, v), construct a view Vuv with 6 possible worlds representing
the 6 allowed colorings of the nodes u, v: P W (Vuv ) = {u0 , v1 }, {u1 , v0 }, {u1 , v2 },
{u2 , v1 }, {u2 , v0 }, {u0 , v2 }
• For every edge (u, v), u and v are assigned di↵erent colors:
W \ {u0 , u1 , u2 , v0 , v1 , v2 } 2 P W (Vuv )
Tractable subclass
Next we show that for an interesting subclass, the intensional complexity of consistency-
checking is PTIME. This subclass is based on a mapping from sets of uncertain
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 62
Note that the notion of acyclic hypergraphs has been used extensively in database
theory to identify polynomial subclasses for hard problems. See [CR97, Mai83, Ull88]
for a few examples. In our mapping, the nodes in the hypergraph represent tuples
from uncertain databases, and each uncertain database is represented by a hyperedge
in the hypergraph.
We argue that practical uncertain databases often satisfy the acyclic hypergraph
structure. Consider, for instance, our FBI data from Example 3 under the equality-
containment setting. In addition to the zone- and city-level police departments,
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 63
suppose we have state-level police departments: states subsuming zones and zones
subsuming cities. The resulting uncertain database yields an acyclic hypergraph.
Under the superset-containment setting, consider a series of sensors monitoring sets
of rooms in a hallway; when the rooms are placed in an “acyclic fashion” (i.e., the
hallway isn’t a circle, but a set of chained rooms), the uncertain database representing
sensor readings gives an acyclic hypergraph.
While we’ve shown practical scenarios where acyclicity arises in practice, we note
that even when an uncertain database does not exhibit an acyclic hypergraph, we can
impose acyclicity by “splitting” some sources. The consequence of splitting a source
is that we may lose some specificity-information. (For example, U with P W (U ) =
{A}, {B}, {C} may be split to get U1 and U2 such that P W (U1 ) = {A}, {B}, ; and
P W (U2 ) = ;, {C}.) E↵ectively, our results enable any set of uncertain databases to
have tractable consistency-checking, but with some information loss when the acyclic
hypergraph property isn’t satisfied.
Our results are framed for the possible-worlds representation, but they also hold
for more compact representations that satisfy conditions outlined in the respective
theorems (such as the existence of a polynomial containment check). The following
two theorems are the most technically challenging of the chapter.
• Node Removal: We remove the tuple corresponding to the node t from V (e),
as the node t is removed. I.e., we replace V (e) by V (e) # {t}.
Proof. Let the node removal step remove tuple t from the source Vi . Recall that t
is not contained in any other source. Let VI = {V1 , . . . , Vi , . . . , Vm } denote the view
extensions before the node removal, and let VF = {V1 , . . . , Vi # {t}, . . . , Vm } denote
the view extensions after the node removal.
{V | V = W [ {t},
if (W \ T (Vi # {t}) [ {t} 2 P W (Vi )
V = W, otherwise}
Proof. Consider the edge-removal step on edge e ✓ f . Let the set of view extensions
be VI and VF = VI \ V (e) before and after the edge removal respectively.
Proof. The node removal step of the GYO-reduction is the same as Theorem 2.
• Edge Removal: Remove the source corresponding to V (e), along with the
the edge e, and modify the uncertain database associated with f to VR (f ) by
retaining the same tuple set and making the possible worlds P W (VR (f )) equal
to:
{W | W 2 P W (V (f )), W \ T (V (e)) 2 P W (V (e))}
The sources are consistent if and only if the above reduction results in a simple
hyperedge. If during an edge removal step we obtain VR (f ) = ;, we declare the sources
inconsistent. Lemma 1 above continues to hold. Lemmas 4 and 5 complete the proof.
Proof. Consider the edge removal step on edge e ✓ f . Let the set of view extensions
be VI and VF = (VI \ {V (e), V (f )}) [ {VR (f )}, before and after the removal.
P W (M + VR (f )) ✓ P W (M + V (f ))
(since T (VR (f )) = T (V (f )))
✓ P W (V (f )) (since V (f ) vS M )
P W (M + V (f )) = P W (M + VR (f ))
(since T (VR (f )) = T (V (f )))
✓ P W (V (f )) (since VR (f ) vS M
Lemma 6. The set V of sources is consistent if and only if at least one of these N
uncertain databases, say U , is a consistent mediated database.
Proof.
9 Consistent U =) Consistent V: U is a consistent mediated database for the set
V.
with one possible world [i2{1,··· ,m} Wi . Note that U is consistent, and is one of the N
uncertain databases above.
Proof.
Consistent M =) Consistent V: M is a consistent mediated database for V.
well. There are multiple consistent mediated uncertain databases for a given input,
hence the challenge is in defining the notion of “best” query answers corresponding to
certain answers in certain data integration [Hal01]. Informally, we would like to the
best answer to contain all the information implied by the sources, and nothing more.
Note that query answering only makes sense when the input sources are consistent.
Also, we restrict ourselves to identity queries; extensions to the class of monotonic
views and queries follows using additional results presented in Section 5.7.
5.6.1 Definitions
We define the notion of correct answer and strongest correct answer, analogous to the
traditional notions of certain answer and maximal certain answer for data integration
without uncertainty [Hal01].
Definition 15 (Fictitious Tuple). For a set of identity views {Vi }, a tuple t is said
to be fictitious in a consistent mediated database M if t is not present in any of the
view extensions; i.e., 8i, t 62 Vi .
the set of all consistent mediated databases. The collected database MC has tuple set
T (MC ) = [i T (Vi ) and contains all possible worlds in all mediated databases in Mres :
[
P W (MC ) = P W (M ).
M 2Mres
Notice that Mres is the set of mediated databases that do not contain any fictitious
tuples.
5.6.2 Superset-Containment
The following theorem shows how to obtain the SCA to a query for a set of sources.
Theorem 6. For a set of consistent sources, where each source is described by the
identity view, there exists an uncertain database MI that gives the SCA CQ to any
query Q:
9MI 8Q : CQ = Q(MI )
Proof. We show that answering queries using the collected database MC (from Defini-
tion 16) gives the SCA for all queries.
9M 2Mres P W (M 0 + MC ) = P W (M )
8M 2Mres P W (M ) ✓ P W (MC )
Notice that the collected database produces the SCA to all queries.
5.6.3 Equality-Containment
This section studies query answering under the equality-containment setting. Un-
fortunately, a nontrivial SCA may not always exist for a set of views. However,
the construction from the previous section still gives us good answers to queries in
this setting: we show that it yields a unique procedure that satisfies certain natural
properties.
Theorem 7. There exist sets of view extensions for which even though there are
several nontrivial mediated databases, there is no SCA for the identity query.
Proof. Consider two views: V1 with possible worlds W11 = {a} and W12 = ;, and V2
with possible worlds W21 = {b} and W22 = ;. The two views give several consistent
mediated databases, such as M1 = {{a, b}, {a}, {b}, ;} and M2 = {{a, b}, ;}. While V1
and V2 themselves are correct answers, any uncertain database A with T (A) = {a, b}
is not contained in at least one of the above mediated databases. Hence, there is no
SCA for the identity query.
Since an SCA may not always exist, we relax our requirements from the best answer.
We introduce the notion of a “query answering mechanism,” and we define two
desirable properties. We prove that the query answering mechanism we propose is the
only mechanism that satisfies these properties.
The consistency property requires that the results for a query be obtained from a
consistent mediated database. It also asserts that the query answering mechanism
must not add data-information to the result beyond what is entailed by the sources,
hence disallowing fictitious tuples (Definition 15).
The all-possibility property requires that the query answering mechanism must
not add specificity-information to the result beyond what is entailed by the sources.
It asserts that the mechanism must answers queries without ruling out any possible
world that could exist in some consistent mediated database.
The following fairly straightforward theorem states that answering queries using
the collected database (Definition 16) is the only procedure that satisfies the two
properties above.
Definition 18 (Inverted skolemized source). For a source with view extension V and
view definition Q, let the inverse rules be R. Consider the source (V S ,I) over the
mediated schema, where V S is obtained by applying the inverse rules to each possible
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 74
P W (V S ) = {R(W ) | W 2 P W (V )}
Definition 19 (Inverted deskolemized source). When all tuples from the inverted
skolemized version (V S ) of a source that contain at least one skolem constant are
dropped, the uncertain database obtained (V D ) is called the inverted deskolemized
version of the source.
The following theorems show that the inverted deskolemized source and the inverted
skolemized source are consistency preserving.
Proof. Consider a tuple t that contains a skolem constant. Since skolem constants
are unique to the rule applied, the tuple t can only exist in one deskolemized source.
Hence, the tuple t can be dropped using the node-removal step of Theorem 2, which
preserves consistency.
Theorem 10. A set of sources with extensions V = {V1 , . . . , Vm } and view definitions
Q = {Q1 , . . . , Qm } are consistent if and only if the inverted skolemized versions of the
sources V S = {V1S , . . . , VmS } (all defined by the identity query) are consistent:
Theorem 11. A set of sources with extensions V = {V1 , . . . , Vm } and view definitions
Q = {Q1 , . . . , Qm } is consistent if and only if the set of inverted deskolemized versions
of the sources V D = {V1D , . . . , VmD } (all defined by the identity query) is consistent:
The above theorems together show that all of our PTIME results (for the tractable
query-complexity subclass as well as the extensional complexity results) for both
superset-containment and equality-containment carry over for monotonic views: the
consistency checks are now applied on the inverted deskolemized sources.
Next we turn to answering monotonic queries over a a set of sources with monotonic
queries as views definitions. For monotonic views, we use their inverted skolemized
versions to construct the set of mediated databases. Note that our construction allows
non-fictitious tuples to have skolem constants. The construction of the collected
database from Section 5.6 uses the above set of mediated databases. The result of
query over a skolemized relation retains only tuples with no skolem constants.
The query-answering results for equality-containment presented in Section 5.6.3
carry over from the above observations. However, as described in Section 5.6.2,
extending our superset-containment results to arbitrary monotonic queries additionally
requires us to show that containment is preserved by our class of queries:
Lemma 8. For uncertain databases U1 , U2 with the same schema SC, for any mono-
tonic query Q over SC that retains a key K of SC, we have: U1 vS U2 =) Q(U1 ) vS
Q(U2 ).
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 76
Proof. By definition of U1 vS U2 :
For any W2 2 P W (U2 ), consider any tuple t 2 W2 \ T (U1 ). (If no such t exists, clearly,
Q(W2 ) = Q(W1 ).) For key-preserving monotonic queries, t.K 62 T (Q(W1 ).K), hence
Q(W2 ) = Q(W1 ). Hence
• mU ( ) 6= 1
Recall from Section 4.2 that m assigns mass to every subset of possible worlds, and it
is unspecified how this mass is split within the subset.
For presentation of containment definitions, we extend Definition 9 from Section 5.5
to generalized uncertain databases:
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 77
U1 = U2 + U1
• T (U1 ) = T (U2 + U1 )
• m(U1 ) ◆ m(U2 + U1 )
• mM ( ) 6= 1
• T (U↵ ) = T (U )
Notice that mU↵ is still a valid mass function. Discounting can be interpreted as
asserting that U might be completely inaccurate with weight ↵.
Definition 6 can be extended so that each source S is now a triple: S = (V, Q, ↵).
By using discounted sources, we are able to encode reliability of each source into the
data integration problem. A higher ↵ denotes a less reliable data source.
Recall from Section 7 that a set of inconsistent sources represents contradiction,
and hence at least one of the sources must be incorrect. Previously, we had no
tools to resolve this contradiction. But now, with the notion of discounted sources,
inconsistency implies that at least some of our sources need to be discounted. The
following observation shows that we can always achieve consistency if we discount
each of the input sources.
There has been a lot of work in the area of incomplete information databases.
These are sets of certain global databases that arise as a result of data integration
of certain sources. Reference [Gra02] presents a good overview. In contrast, in our
setting integration of uncertain sources results in sets of uncertain global databases.
Our theory is based on possible worlds and some of our results rely on the
existence of an efficient containment check over the model used for representing
uncertain databases. In contrast, reference [AKG87] presents complexity results about
representing and asking questions about sets of possible worlds. This work is in fact
complimentary to our work, and provides a natural starting point for our investigation
about compact representations.
There has been previous work on using probabilistic techniques for data integra-
tion [FKL97, MM07, MRBM05, SDH08]. This work looks at uncertain integration of
certain data and is not to be confused with our work, which addresses integration of
uncertain data itself.
Recently, data exchange has been studied over probabilistic databases [FKK10].
In contrast to our work, which combines information from multiple sources to a single
target, the work in [FKK10] only considers a single source. However in the context of
a single source: (1) it allows more general kinds of mappings than just local-as-view
mappings; (2) has probabilities associated with possible worlds; and (3) it studies
some compact representations.
Reference [DS05] studies the problem of answering queries using imprecise sources,
where the imprecision is captured by a probabilistic database. The chapter presents
conditions under which view statistics are sufficient to answer a query over a mediated
database and describes algorithms for computing result-tuple probabilities. In contrast,
the goal of our work is to develop a theory for integrating uncertain sources starting
with the fundamental notion of containment. To this end, we introduce superset-
containment and equality-containment, and address the problem of consistency, none
of which are the subject of [DS05].
Finally, several chapters [DHY07, Haa07, HRO06, SDH09] mention that the
problem of integrating uncertain data is important, but do not address it.
CHAPTER 5. UNCERTAIN-DATA INTEGRATION 81
uncertain data. The hope is that this approach is in fact equivalent to the current
approach. While the two approaches provide similar utility for integration, our
idea is especially appealing because it can potentially allow us to perform entity
resolution (recall Section 5.7) after the data from all sources has been migrated
to the mediated schema.
Chapter 6
High-Confidence Joins
In uncertain databases, users often seek to compute high-confidence results for queries.
In this chapter, we discuss specialized algorithms for join queries that seek to prefer-
entially return result tuples with the highest confidence. We restrict ourselves to a
special case of uncertain databases without alternative values. The work presented in
this chapter appeared in [AW09a].
6.1 Introduction
Recall from Chapter 2 that alternatives and confidence values may be associated with
data items in uncertain databases. In this chapter, we consider the special case of
uncertain databases with confidences but without alternative values. In this special
case, result tuples have no alternatives, yet always have associated confidence values.
When result data has confidence values, the following query types become impor-
tant:
83
CHAPTER 6. HIGH-CONFIDENCE JOINS 84
• In Section 6.2 we formalize our problem and environment: processing joins over
uncertain data with confidences, in which result confidences influence query
processing and data sets are large so memory may be a limitation.
• In Section 6.3 we provide efficient algorithms for all four query types introduced
above: Threshold, Top-k, Sorted, and Sorted-Threshold. Our algorithms are
suitable for processing stand-alone join queries, and they can be used for non-
blocking operators within a larger query plan.
• Section 6.4 presents efficiency guarantees for our algorithms. For each query
type and problem instance, we present our guarantees with respect to the
most efficient correct algorithm on that instance, that uses the same primitive
operations and memory as ours.
Related work is covered in Section 6.6 and we discuss future work in Section 6.7.
6.2 Foundations
Consider two relations R and S. Generically, a join operation R ./✓ S on them (where
✓ is the join condition) can be viewed as an exploration of the cross-product of R and
S. For example, if:
then six points in the cross-product need to be considered to evaluate R ./✓ S. All join
methods e↵ectively explore this two-dimensional space, though often incorporating
CHAPTER 6. HIGH-CONFIDENCE JOINS 86
Tuples in S
Tuples in R
6.3 Algorithms
6.3.1 Basics
In this section, we set up the basics for the description of the algorithms. Recall we
assume that relations can be retrieved through sorted access by confidence. More
CHAPTER 6. HIGH-CONFIDENCE JOINS 88
S (decreasing confidence)
Threshold
R (decreasing confidence)
precisely, we assume that we can ask for tuples in a relation starting at an o↵set
in sorted order. We also assume a monotonic combining function f that is used to
compute confidence values of result tuples using confidence values on the joining tuples.
We consider a cost metric proportionate to data retrieval cost. This metric closely
resembles disk-read cost, since our algorithms all retrieve data in large sequential
chunks. Some data items may be retrieved more than once and are counted as many
times as they are retrieved. (We manage all use of memory as part of our algorithms.)
We use the following symbols in our description:
• R and S are the relations to be joined, with R usually the inner and S the outer.
• or , os , or1 , or2 , os1 , and os2 are o↵sets in the sorted relations, in some unit of
memory.
• M and L are memory sizes in the same unit as the o↵sets above.
CHAPTER 6. HIGH-CONFIDENCE JOINS 89
• load(S, os1 , os2 ): Loads into memory tuples from relation S (outer) starting
at o↵set os1 and ending at os2 . (It may construct a hash table on the joining
attributes to enable efficient look-ups in memory when R is scanned, if the join
condition can benefit, but doing so does not a↵ect retrieval cost). The cost of a
load is os2 o s1 .
• scan(R, or1 , or2 ): Scans relation R (inner) starting at o↵set or1 and ending at
or2 . While scanning, the join condition is evaluated on each scanned tuple of R
and all tuples of S residing in memory (possibly using the hash table if the join
condition allows it). The cost of a scan is or2 o r1 .
• explore(R, S, or1 , or2 , os1 , os2 ): Combines load (S, os1 , os2 ) followed by scan(R,
or1 , or2 ). Each explore step covers a rectangle of the two-dimensional space as
shown in Figure 6.3. The cost for the exploration of this rectangle is the sum of
the length corresponding to the scan and the breadth corresponding to the load :
(or2 or1 ) + (os2 os1 ).
In load, scan, and explore, sometimes os2 and or2 are not constants, but are found by
checking a condition while loading or scanning. In our pseudocode, these end-conditions
are specified alongside the function call.
We consider the setting where the total memory available is limited: specifically
there is only enough memory to load size M of a relation, plus a small amount of
additional memory for scanning and maintaining state for the algorithms. Note that
since breadth corresponds to load, explore rectangle can have breadth no more than
M.
CHAPTER 6. HIGH-CONFIDENCE JOINS 90
S (decreasing confidence)
O O
R (decreasing confidence)
6.3.3 Threshold
A Threshold query returns all result tuples whose confidence value is above a specified
threshold ⌧ . We describe algorithms to efficiently compute results of Threshold queries
CHAPTER 6. HIGH-CONFIDENCE JOINS 91
S (decreasing confidence)
u v w x y
0.5
p q r s t
0.6
k l m n o
0.7
f g h i j
0.8
a b c d e
1.0
1.0 0.9 0.6 0.3 0.1
R (decreasing confidence)
over joins of two relations. As discussed in Section 6.2, to answer such queries, we
only need to explore a stair-like region in the cross-product. This exploration can
be performed like a nested-loop join, modified not to explore regions that cannot
contribute results with confidence above the threshold. The exploration is detailed in
Algorithm 1 and illustrated in Figure 6.5. We call this algorithm Threshold1.
In the example of Figure 6.4, for ⌧ = 0.55, the blocks would be explored in the
following order: a, b, c, f, g, k, l, p. Blocks c, g, l, p are explored only partially: the load
and scan within explore operations also use threshold ⌧ to terminate, as seen in the
specification of os and or in Algorithm 1. For instance, in the example the load on
block p only loads tuples in S with confidence > 0.55, and scans only tuples in R
with confidence > 0.55/0.6. The explore operation emits result tuples as it finds them,
after checking that they satisfy the threshold.
Theorem 12 in Section 6.4 shows that Threshold1 has cost less than 2 times that
CHAPTER 6. HIGH-CONFIDENCE JOINS 92
Algorithm 1 Threshold1
i 0
while 1 do
if conf(R, S, 0, M · i) ⌧ then
break
explore(R, S, 0, or , M · i, min(M · (i + 1), os ))
{os minp conf(R, S, 0, p) ⌧ }
{or minp conf(R, S, p, M · i) ⌧ }
i i+1
of any algorithm that explores the entire part of the cross-product that can produce
result tuples. Threshold1 performs much better than this factor when the confidence
distributions in the input relations are asymmetric.
An algorithm for Threshold with a better guarantee can be devised if the shape of
the shaded region is known at the start. Suppose for example that we have meta-data
(call it D) allowing us to find the confidence values of tuples at o↵sets that are multiples
of M , as in Figure 6.4. We now describe an algorithm called Threshold2 that uses
this information while exploring the space.
Like Threshold1, Threshold2 explores the pruned space, but this time in a more
efficient manner. The idea is to choose the outer for the explore operation at each
step to achieve a longer scan. The choice is made by using meta-data D to determine
the approximate scan lengths for both possible outers. The exploration is detailed
in Algorithm 2 and illustrated in Figure 6.6. The function scan-lengths returns the
approximate scan lengths for both choices of outer using the meta-data D. For the
example of Figure 6.4 with threshold 0.55, the order in which the blocks would be
explored is a, f, k, p, b, g, l, c. As before, some blocks, namely p, l, c are explored only
partially. As in Threshold1, result tuples are emitted as they are found after checking
that they satisfy the threshold.
Theorem 13 in Section 6.4 shows that Threshold2 has cost less than 1.5 times
any algorithm that explores the entire part of the cross-product that can produce
result tuples. This factor occurs only in cases where the part of the cross-product
explored just exceeds the memory available. As the area to be explored becomes large
compared to available memory, the factor comes closer to 1, as shown by Theorem 14
CHAPTER 6. HIGH-CONFIDENCE JOINS 93
Threshold
S (decreasing confidence)
R (decreasing confidence)
in Section 6.4.
6.3.4 Top-k
A Top-k query returns k result tuples with the highest confidence for a query-specified
k. In this section, we discuss our algorithm for Top-k queries over two-relation joins.
The idea again is to prune the space to be explored using a threshold as in the
algorithms for Threshold. The threshold to be used is the confidence value of the k th
tuple in the result, i.e., the minimum confidence value among the Top-k result tuples.
Of course, this value is not known at the start of the algorithm.
During the algorithm we maintain a current top-k result set in a priority queue K.
When a result tuple t is generated with confidence greater than the confidence of the
lowest-confidence result tuple t0 in K, t0 is replaced by t. To avoid too much extra
CHAPTER 6. HIGH-CONFIDENCE JOINS 94
Threshold
S (decreasing confidence)
R (decreasing confidence)
Algorithm 2 Threshold2
o r1 0; os1 0
while 1 do
if conf(R, S, or1 , os1 ) ⌧ then
break
l r , ls scan-lengths(or1 , os1 , ⌧, D)
if lr ls then
explore(R, S, or1 , or2 , os1 , min(M + os1 , os2 ))
{os2 minp conf(R, S, or1 , p) ⌧ }
{or2 minp conf(R, S, p, os1 ) ⌧ }
o s1 o s2
else
explore(S, R, os1 , os2 , or1 , min(M + or1 , or2 ))
{os2 minp conf(R, S, or1 , p) ⌧ }
{or2 minp conf(R, S, p, os1 ) ⌧ }
o r1 o r2
explored becomes much larger than available memory. For such regions, larger L may
further increase efficiency. In fact, the best approach may be to vary L for di↵erent
explore steps, as discussed in Section 6.3.6.
Technically, our guarantees for Top-k hold only in cases where sufficient memory
is not available, which is the environment we are targeting. However, we can combine
our algorithm with an efficient Top-k algorithm A0 that relies on sufficient memory
being available, e.g., [IAE03]. Such algorithms also maintain a queue corresponding
to our queue K. If algorithm A0 runs out of memory during processing, it is possible
to switch to ours, continuing processing on queue K.
6.3.5 Sorted-Threshold
We now describe an algorithm for Sorted-Threshold, which returns result tuples
with confidence over a threshold ⌧ , sorted by confidence. Sorted, which returns all
results sorted by confidence, corresponds to the special case where ⌧ = 0. The
algorithm explores the same pruned space as in the Threshold problem, but in an
order resembling Top-k. The algorithm is detailed in Algorithm 4. Like Top-k, it
uses two priority queues: K for temporarily maintaining result tuples, and Q for
CHAPTER 6. HIGH-CONFIDENCE JOINS 96
Algorithm 3 Top-k
oR [e] 0;oS [e] 0
confidence[e] 1
Q e
while 1 do
i ExtractMax(Q)
if confidence[i] Bottom(K) then
break
o r1 oR [i]
o s1 oS [i]
explore(R, S, or1 , min(L + or1 , or2 ), os1 , min(M + os1 , os2 ))
{os2 minp conf(R, S, or1 , p) Bottom(K)}
{or2 minp conf(R, S, p, os1 ) Bottom(K)}
oR [r] or1 + L;oS [r] o s1
confidence[r] conf(R, S, oR [r], oS [r])
Q r
oR [u] or1 ;oS [u] o s1 + M
confidence[u] conf(R, S, oR [u], oS [u])
Q u
maintaining information about the rectangles yet to be explored. The threshold used
to terminate loads and scans, and to terminate the algorithm itself, is the input ⌧ .
The explore operation pushes result tuples that satisfy the threshold into the queue K.
The algorithm maintains a value that is an upper bound on the confidences of all
unseen result tuples at any stage during the running of the algorithm. Result tuples
in K with confidence are emitted in sorted order as decreases.
Sorted-Threshold emits a tuple as soon as decreases to below the tuple’s confidence,
so it is a nonblocking algorithm. The algorithm explores the space in the same order
as Top-k, but we can make stronger guarantees. Theorem 17 in Section 6.4 shows
that Sorted-Threshold with L = M has cost less than 2 times any algorithm that
explores the space that can produce tuples with confidence > ⌧ . Parameter L is
further discussed in Section 6.3.6.
CHAPTER 6. HIGH-CONFIDENCE JOINS 97
Algorithm 4 Sorted-Threshold
oR [e] 0;oS [e] 0
confidence[e] 1;
Q e
while 1 do
i ExtractMax(Q)
if confidence[i] ⌧ then
break
o r1 oR [i]
o s1 oS [i]
explore(R, S, or1 , min(L + or1 , or2 ), os1 , min(M + os1 , os2 ))
{os2 minp conf(R, S, or1 , p) ⌧ }
{or2 minp conf(R, S, p, os1 ) ⌧ }
oR [r] or1 + L;oS [r] o s1
confidence[r] conf(R, S, oR [r], oS [r])
Q r
oR [u] or1 ;oS [u] o s1 + M
confidence[u] conf(R, S, oR [u], oS [u])
Q u
6.3.6 Discussion
In this section we cover a few properties of and extensions to our algorithms not
discussed in their initial presentation.
Parameter L
Recall that our Top-k algorithm uses input parameter L to restrict the maximum
length of scan operations. Our efficiency guarantees are proved for L = M . However,
other choices of L may yield more efficient computations for some inputs: If the space
that needs to be explored to compute the top k results turns out to be much larger
than memory M , then it is beneficial to use a larger L. Various methods can be
employed to detect and exploit this situation. For example, if statistics are available,
we might estimate in advance the space to be explored and adjust L accordingly. L
can also be adjusted adaptively during the algorithm based on what the algorithm
has seen so far. Specifically, each invocation of the explore operation can dynamically
CHAPTER 6. HIGH-CONFIDENCE JOINS 98
choose a di↵erent L. Also note that when L 6= M is used, the algorithm is no longer
symmetric, so we can incorporate dynamic selection of the outer relation in a similar
fashion to algorithm Threshold2 (Section 6.3.3).
Sorted-Threshold also may choose to adjust parameter L, but the considerations
are somewhat di↵erent. In this case, L encapsulates a memory-cost trade-o↵: A larger
L, while being more cost-efficient, may require a larger bu↵er for result tuples. Here
too we can set L in advance based on known statistics, or modify it adaptively based
on observations during the algorithm’s execution. As before, we can also incorporate
dynamic selection of the outer relation in each explore step.
Using Indexes
Our algorithms do not assume any indexes on the join attributes, and instead use
pruning based on result confidences. Nested-loop index join can be modified to devise
algorithms for all four query types which also do pruning based on result confidences:
• A bu↵er for result tuples is maintained for Top-k, Sorted and Sorted-Threshold,
like the algorithms described above.
• The confidence of the last tuple from the outer can be used to determine an upper
bound on the confidence of all unseen result tuples because of monotonicity.
These algorithms are modifications of the Threshold Algorithm [FLN03]. They are
easily used in the context of uncertain databases because they don’t need to load the
outer into memory, and hence can’t run out of memory.
Use in Queries
Our algorithms have all been specified for two-way join operations. Suppose we have
R1 ./ R2 ./ · · · ./ Rm , and we wish to obtain the result of this multiway join using
Threshold, Top-k, or Sorted-Threshold. Our approach is fairly simple: Since Sorted
(i.e., Sorted-Threshold with ⌧ < 0) provides access to its results in a non-blocking
CHAPTER 6. HIGH-CONFIDENCE JOINS 99
fashion sorted by confidence, we build a tree of two-way joins. All lower nodes in the
tree perform the Sorted operation on their two operands, while the root performs the
desired operation (Threshold, Top-k, or Sorted-Threshold) on its operands. (Sometimes
portions of intermediate results need to be materialized, if the algorithm needs to scan
them multiple times.) Operators also have the option of using the indexed version
of the algorithms as described in Section 6.3.6, if the join condition permits it. All
algorithms that use pruning based on result confidences require the input sorted by
confidence. This may be provided either by storing the relations sorted by confidence,
or using a sort operator in the query plan which allows the use of other operators in a
query plan. Sorted by confidence is a special “interesting” order because it can yield
efficiency benefits for a significant fraction of queries. These new algorithms introduce
new query optimization challenges that we leave for future work.
• As was shown in [DS04b], “safe plans” can be found for a class of queries over
uncertain databases. A safe plan ensures that all intermediate results are inde-
pendent, allowing the query processor to compute confidences for intermediate
result tuples using the simple multiplication combining function. The operators
proposed in this chapter can be used directly in all safe plans.
to the results of the combining function, which are then refined as needed. Our
algorithms can be extended to operate on interval-approximations instead of
exact results, minimizing both data retrieval cost and (expensive) computation
of exact values.
Generalizing
We cast our presentation in the context of uncertain and probabilistic databases, since
that is the area in which we work, and it forms the context for our implementation.
However, just as with other related algorithms (to be discussed in Section 6.6, e.g.,
[IAE03]), our algorithms are quite generic. For example, they can also be applied
directly to any setting with a monotonic “scoring” function, such as middleware
systems [Fag99] and multimedia databases [CG96, NR99].
Every algorithm emits an unordered set of tuples as its result for Threshold, and
an ordered set of result tuples for Sorted-Threshold and Top-k. An algorithm is valid
if it returns the correct result for all possible inputs.
For efficiency comparisons we only consider algorithms in a class we refer to as A.
Algorithms in this class are restricted to use the same access on the data and memory
CHAPTER 6. HIGH-CONFIDENCE JOINS 101
Let Cost(A, I) represent the cost of algorithm A on input I. Recall that our cost
metric, which can be applied to any algorithm in A, captures the total amount of
data retrieved through the sorted-by-confidence interface.
The following five theorems encompass our efficiency guarantees for Threshold
and Top-k. (Sorted and Sorted-Threshold are discussed after these theorems.) Each
theorem compares one of our algorithms against any valid algorithm in class A for
the same problem. Note that Threshold1, Threshold2 and Top-k, are valid algorithms
in A for their respective problems.
Our results are quite strong in the following sense. Our bounds on efficiency are
based on problem instances: We show for each instance a comparison of our algorithm
against any other valid algorithm on the same instance. This comparison is stronger
than comparing algorithms on all instances—in our comparison, a di↵erent “best”
algorithm may compete with ours on di↵erent instances.
We provide intuitive proof sketches for each theorem. Full proofs are deferred to
Section 6.8.
Theorem 12. Let At represent the set of all valid algorithms in A for the Threshold
problem, and let It represent the set of all inputs. The following holds:
For an input I to the Threshold problem, consider the part of the cross-product
of the input relations at which the combining function f evaluates to more than the
threshold ⌧ , for example the shaded region in Figure 6.2. All valid algorithms for the
problem must cover the shaded region, i.e., they must evaluate the join condition ✓ at
all points in the region.
We show that Threshold1 has cost less than 2 times the cost of any algorithm that
covers the region. Intuitively, one lower bound corresponds roughly to the area (in
units M 2 ) of the shaded region. However, another lower bound is the semi-perimeter
(in units M ) of the stair-like shaded region, which is equal to the sum of the lengths of
the shaded x and y axes. The latter bound can sometimes be greater than the former
(intuitively, when shaded regions are very “narrow”), so these bounds are combined in
our proof. In any exploration performed by Threshold1, all but one of the rectangles
have breadth 1. Any rectangle with cost 2 covers area at least 1, and those with
cost < 2 contribute at least 1 to the semi-perimeter. This intuition is formalized to
prove Theorem 12.
A bad case for algorithm Threshold1 occurs when all explore rectangles (say there
are b of them) have length 1 in units M . Out of the b 2 rectangles, b 1 have cost
2b 1
2, and the last rectangle has cost 1, giving a factor b
. Such an example can occur
only for asymmetric confidence distributions.
Theorem 13. Let At represent the set of all valid algorithms in A for the Threshold
problem, and let It represent the set of all inputs. The following holds:
3
8I2It Cost(Threshold2, I) < · min Cost(A, I)
2 A2At
3
In fact, the factor 2
occurs only for very small shaded regions. For larger shaded
regions, Threshold2 achieves better factors as formalized next. Let sI be the number
of M ⇥ M blocks that intersect the shaded region for input I.
Theorem 14. Let At represent the set of all valid algorithms in A for the Threshold
problem, and let It represent the set of all inputs. The following holds:
sI + p c
8I2It Cost(Threshold2, I) < · min Cost(A, I)
sI c A2At
CHAPTER 6. HIGH-CONFIDENCE JOINS 103
p·(p+1)
where constants p and c are such that p c and sI 2
.
The value sI is a measure of the size of the region to be explored. The factor
decreases as sI increases, tending to 1 as sI tends to 1. The intuition for the guarantee
of Threshold1 also applies to both guarantees for Threshold2. Additionally, because
of the greedy choice of outer in Threshold2, the ith -from-last rectangle explored by
Threshold2 always has length at least i 1. This intuition is formalized to prove the
Theorems 13 and 14.
Now let us consider the Top-k problem. An input I satisfies the distinctness
property if confidences of all result tuples in R ⇥ S are distinct. For Top-k, we
restrict attention to inputs that satisfy this property. No valid algorithm can have any
constant-factor guarantee corresponding to our guarantees in the general case (i.e.,
without assuming distinctness). Consider for example inputs where all confidences are
1, k = 1, and only one arbitrarily placed pair of tuples satisfy the join condition.
Recall that algorithm Top-k has a parameter L, discussed in Section 6.3.6. We
prove the following two theorems for L = M .
Theorem 15. Let Ak represent the set of all valid algorithms in A for the Top-k
problem, and let Ik represent the set of all inputs that satisfy the distinctness property.
The following holds:
For an input I to the Top-k problem, consider the part of the cross-product where
f evaluates to at least the confidence ↵ of the k th result tuple, i.e., our usual shaded
region with ⌧ = ↵. In a manner similar to Threshold2, the factor 3 occurs only for
small shaded regions, and decreases for larger regions as formalized next. Let sI be
the number of M ⇥ M blocks that intersect the shaded region.
Theorem 16. Let Ak represent the set of all valid algorithms in A for the Top-k
problem, and let Ik represent the set of all inputs that satisfy the distinctness property.
The following holds:
2 · sI 1
8I2Ik Cost(Top-k, I) < · min Cost(A, I)
sI c A2Ak
CHAPTER 6. HIGH-CONFIDENCE JOINS 104
c·(c+1)
where constant c is such that sI 2
.
Theorem 17. Let As represent the set of all valid algorithms in AM for the Sorted-
Threshold problem, and let Is represent the set of all inputs. The following holds:
The intuition behind the proof is similar to Theorems 12– 16. Any valid algorithm
in AM needs to cover the part of the cross-product that can contribute results with
confidences over the threshold ⌧ . Each M ⇥ M square with cost 2 contributes area 1,
and others contribute their cost to the semi-perimeter.
6.5 Experiments
To evaluate the performance characteristics and trade-o↵s in our algorithms, we
conducted several experiments on a large synthetic data-set with a variety of confidence
distributions. The main objectives and results of our experiments are summarized as
follows.
CHAPTER 6. HIGH-CONFIDENCE JOINS 105
• Our algorithms have theoretical efficiency guarantees ranging from 1.5 to 3 times a
lower bound in the general case, as shown in Section 6.4. We wanted to determine
how close our algorithms are to the theoretical lower bound in practice. One
interesting outcome is that in practice, Threshold1 and Threshold2 are similarly
close to the lower bound, except when the input confidence distributions are
asymmetric, in which case Threshold2 is much closer than Threshold1.
• Our efficiency guarantees hold for all distributions of confidence values on the
input relations. We wanted to see how di↵erent confidence distributions a↵ect
the performance of our algorithms. In general, we do not see a dramatic change
with di↵erent distributions, except as they a↵ect the actual result.
• Each data set has |R| = |S| = N where N is either one million or ten million
tuples.
• The join between R and S is one-to-one in each data set. The joining pairs are
independent of their relative confidence values, i.e., of their relative positions in
sorted-by-confidence order.
CHAPTER 6. HIGH-CONFIDENCE JOINS 106
1
Distribution 1
0.9 Distribution 2
Distribution 3
0.8 Distribution 4
Distribution 5
0.7 Distribution 6
Confidence
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9 10
6
Tuple Position (x 10 )
Figure 6.7: Confidence Distributions
6.5.2 Threshold
For the Threshold problem, we compare our two algorithms, Threshold1 and Thresh-
old2, against a lower-bound cost, as discussed in Section 6.4. We show four graphs.
Figures 6.8 and 6.9 consider “symmetric” distributions in the two relations: we
use Distribution 6 for both R and S. Figures 6.10 and 6.11 consider “asymmetric”
CHAPTER 6. HIGH-CONFIDENCE JOINS 107
50
Cost (x 10 )
5
40
30
20
10
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Threshold τ
6.5.3 Top-k
For the Top-k problem we have four graphs:
• Figures 6.12 and 6.13 show how cost varies with k. (Figure 6.12 is for lower
values of k, while Figure 6.13 “continues” Figure 6.12 with much higher values
to confirm the trend.) We ran our Top-k algorithm to obtain its cost, noting
CHAPTER 6. HIGH-CONFIDENCE JOINS 108
25
20
15
10
5
0
0 10 20 30 40 50 60 70 80 90 100
3
Memory M (x 10 )
Figure 6.9: E↵ect of M on Threshold, symmetric distributions
the confidence of the kth result tuple. Using this confidence as threshold ⌧ ,
we plotted the cost of Threshold2 and the lower-bound (as in the Threshold
experiments; Section 6.5.2). Note that the gap in performance between Top-k
and the other two algorithms is expected since Top-k cannot know ⌧ in advance.
Also note that the cost of Top-k is significantly lower than the relation sizes.
• Figure 6.14 shows how the cost of Top-k is a↵ected by memory size M . Again,
we plot Threshold2 and the lower-bound using the ⌧ from the execution of Top-k.
25
20
15
10
5
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Threshold τ
Before looking at our results for Sorted, we consider the e↵ect of confidence
distributions on Threshold and Top-k. (We will separately consider the issue for
Sorted.) Figure 6.16 plots the cost of Threshold2 and Top-k, along with the lower-
bound, for each of the six distributions shown in Figure 6.7 on both relations. (As in
the Top-k graphs, Threshold2 and the lower-bound use the ⌧ from Top-k.) We see
that the distribution has little e↵ect on cost.
6.5.4 Sorted
Our Sorted experiments all use the Sorted-Threshold algorithm with threshold < 0, i.e.,
they deliver a full sorted result. Our first conclusion, shown clearly in Figure 6.17, is
that the cost of Sorted is proportionate to the number of results emitted. The two plots
correspond to two di↵erent distributions, and behave nearly identically. Figures 6.18
and 6.19 consider di↵erent settings for parameter L, seeing how they a↵ect cost and
memory use, respectively, as tuples are emitted. Comparing the two graphs we see the
trade-o↵ clearly: a lower L incurs higher cost but requires less memory. In Figure 6.19,
CHAPTER 6. HIGH-CONFIDENCE JOINS 110
30
Cost (x 10 )
5
25
20
15
10
5
0 10 20 30 40 50 60 70 80 90 100
3
Memory M (x 10 )
Figure 6.11: E↵ect of M on Threshold, asymmetric
the y-axis plots the largest bu↵er used so far in the execution. The horizontal steps
correspond to the points at which a batch of result tuples is emitted.
7
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
k (x 100)
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10
k (x 1000)
10
8
6
4
2
0
5 10 15 20 25 30 35 40 45 50
3
Memory M (x 10 )
Figure 6.14: E↵ect of M on Top-k
Top-k
10
8
Cost (x 10 )
5
0
0 50 100 150 200 250
3
Parameter L (x 10 )
Figure 6.15: E↵ect of L on Top-k
6.8 Proofs
We first prove three Lemmas, which are then used to prove the six theorems from the
body of the chapter. For an input I, consider a shaded region SI corresponding to
points that can produce tuples with confidence greater than a threshold ⌧ . For Top-k,
⌧ is the confidence of the k th result tuple.
Lemma 9. Any valid algorithm A 2 A must evaluate the join condition at each point
in SI for input I.
Proof. Suppose the join condition was not evaluated at point p 2 SI by algorithm
A. Consider input I 0 that di↵ers from I only in the outcome of the join condition at
point p. Algorithm A cannot distinguish input I from I 0 , and hence returns the same
result. But I and I 0 have di↵erent correct results; hence algorithm A cannot be valid.
Recall that M represents the memory available to load tuples from the input
relations. Let sI be the number of M ⇥M blocks that intersect region SI corresponding
CHAPTER 6. HIGH-CONFIDENCE JOINS 115
5
Cost (x 10 )
5
0
1 2 3 4 5 6
Distribution d
to input I. For a block b, let Xb represent the fraction of its lower edge that is within
the shaded region. Correspondingly, let Yb represent the fraction of the left edge that
is within the shaded region. Those blocks that have no block above or to the right
intersecting the shaded region are referred to as corner blocks – let there be cI such
blocks. We consider cost in units of M .
Lemma 10. Let algorithm A 2 A evaluate the join condition at all points in region
SI . The following holds:
X
Cost(A, I) sI cI + (Xb + Yb )
corner blocks b
cI ·(cI +1)
where sI 2
.
Proof. Any algorithm’s execution can be viewed as a sequence of load and scan
operations, which can be visualized like the explore steps in our descriptions. Note
that the shapes formed in the visualization are not restricted to rectangles, but they
CHAPTER 6. HIGH-CONFIDENCE JOINS 116
12
10
8
6
4
2
0
0 1 2 3 4 5 6 7 8 9 10
5
Result Tuples Emitted (x 10 )
Figure 6.17: Emit Rate for Sorted-Threshold
are restricted to have width at most M . The cost of a shape is its semi-perimeter.
We allocate the cost of shapes to the blocks they intersect in order to provide a lower
bound for the cost of the algorithm. We consider the right and upper edges of a shape
and allocate them to the block they lie in.
Formally, the restriction on the shapes is that for any interior point p, there is a
right or upper edge within distance 1 of p. It is easy to observe that any block b with
max(Xb , Yb ) = 1 has cost at least 1. Also for the corner blocks the cost is at least
X b + Yb .
The relationship between sI and cI is obtained by observing that no row or column
has more than 1 corner block.
Let HI and WI be the height and width respectively, of the two relations that the
shaded region SI covers in units M .
Lemma 11. Let algorithm A 2 A evaluate the join condition at all points in region
CHAPTER 6. HIGH-CONFIDENCE JOINS 117
20
Cost (x 10 )
6
15
10
0
0 2 4 6 8 10
5
Result Tuples Emitted (x 10 )
Figure 6.18: E↵ect of L on Cost of Sorted-Threshold
Proof. Any algorithm that evaluates the join condition at all points in SI must read
the part of the each relation which is covered by SI . The cost of doing this is HI + WI .
Theorem 12. Let At represent the set of all valid algorithms in A for the Threshold
problem, and let It represent the set of all inputs. The following holds:
Proof. Consider the cost of the Threshold1 algorithm. The scan cost for all blocks
except the rightmost in each row is 1. For every row except the topmost, the load
cost is 1. Hence the following holds:
X
Cost(Threshold1, I) sI 1 + Ylef ttop + (Xb ) (6.1)
b2rightmost
CHAPTER 6. HIGH-CONFIDENCE JOINS 118
10
0
0 2 4 6 8 10
5
Result Tuples Emitted (x 10 )
Figure 6.19: E↵ect of L on Memory Use of Sorted-Threshold
Using Lemma 9, and arithmetic manipulation using (6.1) and Lemmas 10 and 11
proves the theorem.
Theorem 14. Let At represent the set of all valid algorithms in A for the Threshold
problem, and let It represent the set of all inputs. The following holds:
sI + p c
8I2It Cost(Threshold2, I) < · min Cost(A, I)
sI c A2At
p·(p+1)
where constants p and c are such that p c and sI 2
.
Proof. Consider the cost of the Threshold2 algorithm. Let it make pI greedy choices.
The load cost is at most pI . For each non-corner block the scan cost is at most 1,
while for a corner block b, it is at most max(Xb , Yb ). The following holds:
X
Cost(Threshold2, I) pI + sI cI + max(Xb , Yb ) (6.2)
corner blocks b
Clearly pI cI , since a scan has to terminate at each corner block. Also the
CHAPTER 6. HIGH-CONFIDENCE JOINS 119
relationship between sI and pI is obtained by observing that the length of the scan
corresponding to ith -from-last scan is at least i. Using Lemmas 9 and 10 with (6.2)
proves the theorem.
Theorem 13. Let At represent the set of all valid algorithms in A for the Threshold
problem, and let It represent the set of all inputs. The following holds:
3
8I2It Cost(Threshold2, I) < · min Cost(A, I)
2 A2At
Proof. For sI 15, Theorem 14 implies the result. For smaller sI , case analysis proves
the theorem using Lemmas 9, 10, and 11.
Theorem 16. Let Ak represent the set of all valid algorithms in A for the Top-k
problem, and let Ik represent the set of all inputs that satisfy the distinctness property.
The following holds:
2 · sI 1
8I2Ik Cost(Top-k, I) < · min Cost(A, I)
sI c A2Ak
c·(c+1)
where constant c is such that sI 2
.
Proof. Consider the cost of the Top-k algorithm. It has cost at most 2 for each block
except the last. For the last block b it has cost at most 1 + max(Xb , Yb ). The following
holds:
Cost(Top-k, I) 2 · sI 1+ max (max(Xb , Yb )) (6.3)
corner blocks b
Theorem 15. Let Ak represent the set of all valid algorithms in A for the Top-k
problem, and let Ik represent the set of all inputs that satisfy the distinctness property.
The following holds:
Proof. For sI 12, Theorem 16 implies the result. For smaller sI , case analysis proves
the theorem using Lemmas 9, 10, and 11.
CHAPTER 6. HIGH-CONFIDENCE JOINS 120
Theorem 17. Let As represent the set of all valid algorithms in AM for the Sorted-
Threshold problem, and let Is represent the set of all inputs. The following holds:
Proof. Consider the cost of the Sorted-Threshold algorithm. It has cost at most Xb +Yb
for each block b. This gives us the following:
X
Cost(Sorted-Threshold, I) 2 · (sI cI ) + (Xb + Yb ) (6.4)
b2corner
The Trio system is developed on top of a conventional DBMS [ABS+ 06]. Uncertain
data with lineage is encoded in relational tables, and Trio queries are translated to
SQL queries on the encoding. Such a layered approach reaps significant benefits in
terms of architectural simplicity, and the ability to use an o↵-the-shelf data storage
and query processing engine. However, it fails to recognize and hence capitalize on
the regular structure that the encoded Trio relations possess, and thus misses out
on the benefits of specialized query optimization. This chapter describes first steps
towards building a native query optimizer for Trio. The work presented in this chapter
appeared in [SANW08].
7.1 Introduction
Recall from Section 2.2 that the basic construct for uncertainty in Trio’s ULDB data
model is alternatives. Alternatives in a tuple specify a nonempty finite set of possible
values for the tuple. For example:
contains a tuple with two alternatives giving the two possible values for the tuple. The
Trio system is layered on top of a conventional relational DBMS [MTK+ 07]. ULDB
121
CHAPTER 7. NATIVE QUERY OPTIMIZATION IN TRIO 122
• We motivate the need for considering a new interesting order in query plans,
and design an operator that ensures this order. (Section 7.4)
• We enumerate the statistics necessary for choosing the optimal query plan from
the set of of all query plans combining conventional and new operators described
above. We present histograms that enable estimating these statistics efficiently
and accurately. (Section 7.5)
• We discuss a variety of interesting and challenging problems our work opens up,
which can form the basis for further research in the area. (Section 7.6)
Section 7.2 introduces our relational encoding for the ULDB data model, and we
conclude with a discussion of related work in Section 7.7.
7.2.1 Queries
Queries over ULDB relations are translated to queries over the encoded relations.
To obtain xid’s on the resulting relation, the alternatives of the result need to be
grouped based which x-tuple they are a part of. Therefore, all tuples in the result of
the translated query are grouped by xids of the input relations. For example, if we
perform a join of relations R and S, the translated query over Renc and Senc includes
the clause group-by Renc .xid,Senc .xid. Details of this translation can be found
in [MTK+ 07, BSH+ 08].
• Index on A
An index on A may be used to retrieve only the tuples that satisfy the predicate
CHAPTER 7. NATIVE QUERY OPTIMIZATION IN TRIO 125
A 5. This index lets us efficiently retrieve all alternatives that satisfy the
predicate, but now they need to be grouped to form x-tuples. A sort on xid is
required to group the result alternatives into x-tuples. A query plan like this
can be efficient for highly selective queries, i.e., the result contains very few
alternatives, making the grouping step inexpensive.
• Index on xid
An index on xid lets us retrieve all alternatives in an x-tuple together. This
allows us to avoid the sort since result tuples are generated already grouped by
xid. This index may be useful if the predicate is not very selective, specially if
the data is stored clustered by xid.
• Index on (xid,A)
If x-tuples are very wide, i.e., contain a large number of alternatives, we may
be able to use an index on (xid, A), to only retrieve alternatives that satisfy
the predicate. This also avoids the sort at the end, and may yield efficient
executions.
• Index on (A,xid)
For queries that use an equality predicate on A, it might be useful to use an
index that returns all alternatives satisfying the predicate grouped by xid. This
index allows us to avoid the sort which may be expensive for large results. But
equality predicates seldom have large results. We would ideally like an index
that also works for range queries, and can also avoid the sort.
The indexes discussed above help in either avoiding the sort required for the group-
by on xid, or prune down the amount of data accessed by evaluating the predicate
before retrieving the tuples. An index on (A,xid) accomplishes both objectives for
equality predicates on A. We now describe a new index that generalizes this index for
efficient range scans over relations stored clustered by xid.
• Index on Ax
An index on Ax refers to an index that indexes x-tuples instead of tuples. Each
CHAPTER 7. NATIVE QUERY OPTIMIZATION IN TRIO 126
Let R = {R1 , · · · , Rn } be the relations in the from clause of a query. The translated
query contains a group by on the xid attribute of all relations in R. The subtree
rooted at a node N in the query tree has joined relations in RN ✓ R. Suppose
the tuples flowing out of the node N are guaranteed to be grouped by xid for a set
BN ✓ RN . The node N is said to give the XGroup interesting order on the set of
relations BN . We need the result to be XGrouped on R.
An indexed access on relation R using the index on Ax for some attribute A
guarantees an XGroup on {R}. Other access methods that use an index on xid or
(xid,A), or that scan relations clustered by xid also give an XGroup on {R}. Selection
and projection operators preserve the interesting order, i.e., if the input is XGrouped
on B, the result from the operator is also XGrouped on B.
Many join operators also preserve XGroup of the inputs. For a join operator, let
the inner relation be XGrouped on Bi and let the outer be XGrouped on Bo . It is easy
to see that operators like nested-loop join (or nested-block joins) and nested-loop index
join preserve the XGroup on both the inner and the outer, i.e., result is XGrouped
on Bi [ Bo . Some other join operators may only preserve the XGroup of one of the
inputs. For instance a hash-join preserves the XGroup of the outer, i.e., returns tuples
XGrouped on Bo . Many operators in traditional database query processors naturally
preserve XGroup, making it efficient to maintain and utilize.
XGroup can often help us avoid an expensive group-by (usually implemented
through a sort). If all relations in a query are accessed using access methods that
guarantee the XGroup on the relation, and all operators preserve the interesting order,
then no group by needs to be used in the query plan. Our selection example earlier
was the simplest case that illustrates how the group by can be eliminated. Requiring
that every access method and operator in a query plan preserve the XGroup property
results in a very small set of potential plans that can exploit XGroup. We would like
to allow a broader class of query plans in which only a subset of the access methods
and operators preserve the XGroup property; such plans would still provide benefits
by making the group by cheaper.
We propose a new unary operator XGB: Let the input to XGB be XGrouped on
Bi . XGB returns an output XGrouped on Bo Bi . XGroup essentially breaks up
CHAPTER 7. NATIVE QUERY OPTIMIZATION IN TRIO 128
each group in the input into smaller groups – And hence only needs to look at one
group in the input at a time. If the groups in the input are not large, this can be very
cheap, and operates in a non-blocking fashion. Hence the use of the XGB operator
in conjunction with an access method which provides the XGroup interesting order,
allows us to incrementally do the group by. The presence of even one access method
that guarantees the XGroup interesting order can drastically reduce the cost of the
group by. Roughly, if a relation with x tuples is XGrouped in a query with y tuples
in the result, using XGB and XGroup at the most requires sorting x groups of y/x
tuples in contrast to a sort on y tuples.
• Index on A
To estimate the cost of accessing the relation through an index on A, the optimizer
needs to determine the number of alternatives that satisfy the predicate.
• Index on xid
To estimate the number of x-tuples retrieved using an index on xid, the optimizer
needs to know the total number of x-tuples in the relation R.
• Index on (xid,A)
As discussed in Section 7.3, this index is useful only if x-tuples contain large
number of alternatives. Hence, to decide whether or not to use this index, an
estimate indicating the average number of alternatives per x-tuple (width of
CHAPTER 7. NATIVE QUERY OPTIMIZATION IN TRIO 129
x-tuple) is useful. The average width of x-tuples satisfying the predicate might
di↵er from the average width for the entire relation, and can yield more accurate
cost estimates.
• Index on (A,xid)
The number of alternatives that satisfy the predicate determines the cost of
accessing the relation using an index on (A,xid). In addition, number of x-
tuples returned is determined by estimating the number of x-tuples that contain
at least an alternative that satisfies the predicate (which should be an equality
predicate).
• Index on Ax
The number of random access made in accessing the relation R using an index on
Ax , is determined by the number of x-tuples that contain at least one alternative
satisfying the predicate.
Next we consider the statistics that cannot be maintained exactly (because of the
data size), and hence need to be estimated.
Clearly, the above cardinalities translate to counting the number of tuples in Renc
satisfying A = v and x A y respectively. These cardinalities over the conven-
tional relational table Renc can be estimated using well-known sampling or histogram
techniques. For example, in Trio we can build a histogram over Renc that counts the
number of tuples in it: We have various histogram buckets corresponding to the range
of all possible values in A. A histogram bucket with bucket boundary [p, q] maintains
the count of the number of tuples in Renc that satisfy p A < q. Now, we can use
the bucket frequencies to estimate the A-selectivity.
Example 4. For relation A with integer attribute A, suppose the histogram on A has
bucket intervals 0, 5, 10, and so on; i.e., bucket B0,5 stores the number of tuples in
Renc with 0 A < 5, B5,10 for 5 A < 10, etc. To compute the number of tuples in
2·B0,5 4·B5,10
Renc satisfying 3 A 8, we evaluate 5
+ 5
.
(1) at least k alternatives with A value in the range [p, q), or (2) alternatives with at
least k distinct A values in the range [p, q). Recall, we use (1) above if we assume
the distribution of all alternatives is independent, and we use (2) if we assume the
distribution of distinct A values is independent.
For A-Average we can use a histogram similar to the one for X-Selectivity, but by
eliminating the size dimension. We can then store the total count of alternatives from
the x-tuples in each bucket, instead of storing the number of x-tuples.
• Construction and Maintenance: The first question that arises about our
techniques presented in Section 7.5 is how we construct the histograms that
enable estimating the various statistics. Construction of a histogram entails
deciding the bucket boundaries for all the dimensions followed by populating
the counts in each bucket. Recall that our histograms for ULDB relations
translate to statistics on the encoded conventional relations. Hence we can
leverage previously proposed techniques for determining the bucket boundaries
and constructing the histogram through sampling [PSC84, CMN98]. Alternately,
we could scan the entire relations apriori and compute exact frequencies for all
buckets.
CHAPTER 7. NATIVE QUERY OPTIMIZATION IN TRIO 134
Specifically, they do not study new operators, indexing techniques, and statistics
estimation, which are the subject of this chapter.
Two notable pieces of work studying query processing in probabilistic databases
are [DS04b, DTW08c], whose focus is on probability computation. Reference [DS04b]
characterizes when a query can be computed using a safe plan, which can be very
efficiently executed. Reference [DTW08c] studies how lineage in Trio can aid in making
confidence computation more efficient. Importantly, it shows that lineage allows us to
decouple data and confidence computation in Trio, enabling us to use any query plan
for data computation. It is this decoupling that allows the optimizer to consider any
query plan, without worrying about confidence computation.
The work closest to ours is that of [SMP+ 07], which proposes new techniques for
indexing uncertain data with probabilities. The indexes proposed in [SMP+ 07] are
di↵erent from ours and focus on the probabilities. Their indexes are very useful for
queries with thresholds on probabilities. However, as described above, Trio adopts
lineage-based probability computation, and hence we focus only on data computation
in this chapter. We have proposed new indexing mechanisms specific to the ULDB data
model and its relational encoding, which are more useful for Trio query processing.
Finally, [SMP+ 07] does not consider the problem of estimating various kinds of
statistics on uncertain data.
Obviously there has been a huge body of work studying every aspect of query
optimization for conventional databases. Since there have been many papers written
on every topic, including relational operations, indexing, histograms, and query plan
selection, we do not review this literature here, instead refer the reader to any standard
database textbook such as [GMWU02].
Chapter 8
Summary
This thesis makes contributions to several aspects of managing uncertain data, includ-
ing: (1) generalizing the types of uncertain data that can be managed; (2) providing
techniques for the integration of uncertain data; and (3) developing efficient algorithms
for processing uncertain data. We summarize the main contributions of the thesis:
136
CHAPTER 8. SUMMARY 137
[ABS+ 06] Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth,
Shubha Nabar, Tomoe Sugihara, and Jennifer Widom. Trio: A System
for Data, Uncertainty, and Lineage (Demo). In Proceedings of VLDB,
2006.
139
BIBLIOGRAPHY 140
[CBF04] Sunil Choenni, Henk Ernst Blok, and Maarten Fokkinga. Extending
the relational model with uncertainty and ignorance. Internal Report,
University of Twente, 2004.
[CBL06] Sunil Choenni, Henk Ernst Blok, and Erik Leertouwer. Handling uncer-
tainty and ignorance in databases: A rule to combine dependent data.
In DASFAA, 2006.
[CSP+ 06] Reynold Cheng, Sarvjeet Singh, Sunil Prabhakar, Rahul Shah, Jef-
frey Scott Vitter, and Yuni Xia. Efficient join processing over uncertain
data. In ACM CIKM, 2006.
[DGS09] Amol Deshpande, Lise Getoor, and Prithviraj Sen. Graphical models
for uncertain data. In Managing and Mining Uncertain Data. Springer,
2009.
[DS05] N. Dalvi and D. Suciu. Answering queries from statistics and probabilistic
views. In Proc. of VLDB, 2005.
[Haa07] L. Haas. Beauty and the beast: The theory and practice of information
integration. In ICDT, 2007.
[KS91] Robert Kennes and Philippe Smets. Fast algorithms for dempster-shafer
theory. In Uncertainty in Knowledge Bases. Springer, 1991.
[NCS+ 01] A. Natsev, Y-C. Chang, J. R. Smith, C-S Li, and J. S. Vitter. Supporting
incremental join queries on ranked inputs. In VLDB, 2001.
BIBLIOGRAPHY 147
[SANW08] Anish Das Sarma, Parag Agrawal, Shubha Nabar, and Jennifer Widom.
Towards Special-Purpose Indexes and Statistics for Uncertain Data. In
Proceedings of the Workshop on Management of Uncertain Data (MUD),
2008.
[YL08] Ronald R. Yager and Liping Liu. Classic Works of the Dempster-Shafer
Theory of Belief Functions. Springer, 2008.