Professional Documents
Culture Documents
Lawrence Berkeley National Laboratory: Recent Work
Lawrence Berkeley National Laboratory: Recent Work
Recent Work
Title
Simple Random Sampling from Relational Databases
Permalink
https://escholarship.org/uc/item/9704f3dr
Authors
Olken, Frank
Rotem, D.
Publication Date
1986-06-01
Computing Division
June 1986
&;
rI
~J
C)
I"
-1
0
--J
Prepared for the U.S. Department of Energy under Contract DE-AC03-76SF00098
DISCLAIMER
This document was prepared as an account of work sponsored by the United States
Government. While this document is believed to contain correct information, neither the
United States Government nor any agency thereof, nor the Regents of the University of
California, nor any of their employees, makes any warranty, express or implied, or
assumes any legal responsibility for the accuracy, completeness, or usefulness of any
information, apparatus, product, or process disclosed, or represents that its use would n,ot
infringe privately owned rights. Reference herein to any specific commercial product,
process, or service by its trade name, trademark, manufacturer, or otherwise, does not
necessarily constitute or imply its endorsement, recommendation, or favoring by the
United States Government or any agency thereof, or the Regents of the University of
California. The views and opinions of authors expressed herein do not necessarily state or
reflect those of the United States Government or any agency thereof or the Regents of the
University of California.
LBL-20707
condensed
June, 1Q86
Simple Random Sampling
from Relational Databases*
Frank Olken
Doron Rotemt
1
tistical methodology. Applications include scientific 5. whether or not each record has a uniform inclu-
investigations such as high energy particle physics sion probability.
experiments, quality control, and policy analyses.
The sample inclusion probabilities for individual
For example, one might sample a join of welfare re- records may be uniform (an unweighted or simple
cipient records with tax returns or social security random sample {SRS}) or they may be weighted ac-
records in order to estimate welfare fraud rates.
cording to some attribute of the record.
In this paper we will deal primarily with fixed size
1.2 Why put sampling in DBMS? simple random samples. In the Section 4.1 we show
how to convert between simple random samples with
Given that one wants to perform sampling, is it
worthwhile to put the sampling operator into the
and without replacement. We include a short discus- ••
sion of weighted sampling of an existing file, because
DBMS?
it is used to implement simple random sampling of
We believe that one should put sampling opera-
some relational operators. We deal with sampling
tors into the DBMS for reasons of efficiency. By em-
from both known and unknown population sizes.
bedding the sampling within the query evaluation,
we can reduce the amount of data which must be
retrieved in order to answer sampling querie~, and 3 Efficiency Measures
can exploit indices created by the DBMS.
Sampling can be used in the DBMS to provide We shall assume that the database resides on disk
cheap estimates of the answers of aggregate queries, and thus measure efficiency in terms of disk blocks
[Mor80]. Sampling may also be used to estimate read.
database parameters used by the query optimizer to Converting the number of records read into the
choose query evaluation plans, [Wil84]. number of disk blocks read is fairly well understood
if the data is uniformly distributed (see [Yao77]).
1.3 Organization of Paper Christodoulakis [Chr84] has analyzed the case where
the data is not uniformly distributed. For the sake
The condensed paper is organized into six sec- of brevity and clarity we will typically assume that
tions. In Section 2 we explain some of the different the fraction of records sampled from a relation is
types of sampling. In Section 3 we discuss the var- sufficiently small that each record sampled results in
ious efficiency metrics used to eva-uate competing a disk read. Hence we will approximate the number ..
algorithms. In Section 4 we review basic sampling of disk blocks read by the the number of records
techniques from a single flat file which we use in this read. Obviously this is a gross simplification, but it
paper. In Section 5 we discuss sampling from indi- can easily be corrected.
vidual relational operators. Finally, in Section 6 we
state our conclusions.
In the full paper, [OR86], we also discuss the use of 4 Basic Techniques
auxiliary "indices" to improve random access sam-
pling of hash files, grid files, and B+ -tree files. In this· section we develop basic techniques for
sampling from single files which either already exist
or are being generated in their entirety. In this sec-
2 Types of Sampling tion we discuss conversion between simple random
samples with and without replacement, weighted
There are a variety of types of sampling which random sampling, and previous work on sampling
may be performed. The various types of sampling from flat files.
can be classified according to: In the full paper, [OR86], we develop techniques
for sampling from some types of files which are com-
1. the manner in which the sample size is deter- mon in DBMSs: such as files with variable num- j
mined, bers of records per block, hashed or grid files, and
2. whether the sample is drawn with or without B+ -tree files.
replacement,
4.1 Converting Samples
3. whether access pattern is random or sequential,
In this section we discuss how one converts be-
4. whether or not the size of the population from tween simple random samples with replacement and
which the sample is drawn is known, those without replacement.
2
Definition 1 Samples without replacement are those Suppose that we wish to draw a weighted random
in which each element of the sampled population ap- sample of size 1 from a file of N records, denoted r;.
pears at most once. with inclusion probability for record r; proportional
to the weight w;. The maximum of the wi is denoted
For simple random sampling, a sample without Wma:z•
replacement can be obtained from a sample with We can do this by generating a uniformly dis-
replacement by simply removing the duplicates. Of tributed random integer, j, between 1 and N, and
( •f. course, the sample size may thereby be reduced, so then accepting the sampled record r; with probabil-
that additional elements of the population may have ity Pi:
to be sampled.
(1)
The most efficient way to detect duplicates is usu-
ally to construct a hash table of the sample elements.
The acceptance test is performed by generating an-
This can be done in O(s) time and space. Dupli-
other uniform random variate, ui, between 0 and 1
cate detection can be performed incrementally as
and accepting ri if ui < Pi· If r; is rejected, we
the sample is collected, so that the sampling may
repeat the process until some j is accepted.
be continued until the target sample size is reached.
The reason for dividing Wi by Wm= is to assure
See [EN82J.
that we have a proper probability (i.e., Pi ~ 1). If
When simple random sampling sequentially from ·
we do not know Wma:z we can use instead a bound
a file (or the output of a query) it may· be simpler
to sample without replacement. Suppose, however,
n such that Vj, n > wi•· The number of iterations
required to accept a record ri is geometrically dis-
that one wants a sample with replacement. Such a
tributed with a mean of (E[piJ)- 1 • Hence using n
SRSWR is needed when sampling from joins, as will
in lieu of Wma:z results in a less efficient algorithm.
be seen later. The SRSWOR sample can be con-
Acceptance/rejection sampling is well suited to
verted to SRSWR by synthetically generating the
sampling with ad hoc weights or when the weights
duplicates. Essentially we generate a simple random
are being frequently updated. Other methods, such
sample with replacement from an index set of inte- :i.''
as the partial sum tree method discussed below, re-
gers and then we construct a random mapping be-
quire preprocessing the entire table of weights.
tween the original sample without replacement and
our sample from the index set. It is easy to see
that this will produce a simple random sample with 4.2.2 Partial Sum Trees
replacement (SRSWR). It requires random access
Wong and Easton [WE80J proposed to use binary
only to the SRSWOR, which will usually fit in main
partial sum trees to expedite weighted sampling.
memory, unlike the original population. Assuming
As above, consider the file of N records, in which
that we use a hash table to check for duplicates, the
each record ri has inclusion probability wi in a sam-
conversion can be done in O(s) time and space. For
ple of size 1. Binary partial sum trees are sim-
details see the full paper.
ply binary trees with N leaves, each containing one
record ri and its weight wi. Each internal node con-
4.2 Weighted Random Sampling tains the sum of the weights of all the data nodes
We will show later that, in order to obtain simple (i.e., leaves) in its subtree. Each record, r;, can be
random samples of some types of files or relational thought to span an interval [E{- 1 wi, E{ wi), of
queries, it is often necessary to compute weighted length w;.
random samples. Hence we include a brief treat- A sample of size 1 is obtained by generating a
ment of how to calculate a weighted random sam- uniform random numbe:;, u, which ranges between
ple from an exisiting file with fixed blocking. Two 0 toW, where W = E 1 wi. The partial sum tree
methods of weighted sampling, acceptance/rejection is then traversed from root to leaf to identify the
sampling and partial sum trees, are compared below. record which spans the location u.
The height of the tree is O(log N), where N is
4.2.1 Acceptance/Rejection Sampling the number of.records. Hence the time to obtain a
sample of size s is O(s log N). The tree can also be
The basic tactic used in this paper is accep- updated in time O(log N) should the record weights
tance/rejection sampling. A brief explanation of be modified, or if sampling without replacement is
this classic sampling technique is included here for desired.
those in the database community who may be unfa- Partial sum trees can be constructed in the fonn
miliar with it. of B-trees, in order to minimize disk accesses by in-
3
creasing the tree fanout (and hence the radix of the be repeatedly evaluated until a sample of size s ob-
log). Alternatively, a partial sum tree may be em- tained (with or without replacement respectively).
bedded into a B-tree index on some domain.
Partial sum tree sampling may well outperform Definition 2 Two sampling schemes, A(R) and
acceptance/rejection sampling. Essentially, it is an- B(R) of relation R are said to be equivalent, de-
other index, specially suited to sampling. However, noted by A <=> B, if, for every possible instance r of
it is practical only when the weights are known be- relation R they generate the same size samples, and
forehand. Like any other index, it increases the cost the inclusion probability for each element of the pop-
of updates. ulation is the same in both schemes. Note that the
However, we believe that updates will greatly out- samples are not necessarily identical.
number sampling queries in most applications. For
Definition 3 MIX(a,< expr1 >,< expr2 >)denotes
this reason, and for the sake of brevity, we will dis-
a random mixture of two sampling schemes. It in-
cuss only acceptance/rejection methods in this pa-
dicates that we sample according to < expr 1 > with
per.
probability a, and with probability 1 - a we sample
according to < expr2 >.
4.3 A Review of Sampling from Files
MIX is used to implement sampling from unions.
In Table 1 we list the major results on sampling
from a single flat file (with fixed blocking), with ci- Definition 4 ACCEPT(a, < expr >)indicates that
tations to the relevant algorithms. We employ some we accept the sample element generated according to
of these techniques in our work on query sampling. < expr > with probability a.
Also see. [Dev86].
ACCEPT is used to implement sampling from
projections and joins.
5 Sampling from Relational
Operators 5.2 Selection
We denote the selection of records satisfying pred-
In this section we show how to sample the output
icate pred from relation R by Upred(R) The number
of individual relational operators such as selection,
of records in relation R is n. The fraction of records
projection, intersection, uuion, difference, and join.
of relation A which satisfies predicate pred is Ppred·
These sampling techniques form the basic building
Hence nppred is the number of records in relation R
blocks for sampling from more complex composite
which satisfy the selection predicate.
queries. The techniques entail a synthesis of the ba-
Selection is unique in that it correctly commutes
sic file sampling techniques and algorithms for im-
with the sampling operator, i.e., selecting from a
plementing relational operators. We discuss only
simple random sampling. simple random sample generates a simple random
sample of a selection.
In order to facilitate the exposition, we treat the
simpler relational operators first, leaving the most Theorem 1
interesting results concerning joins for last.
Our cost measure is the number of disk pages
read, denoted as D. Usually we will be interested in
Proof: For records which do not satisfy the pred-
the expected number of disk pages read, E(D).
icate, the inclusion probability is obviously zero on
both sides.
5.1 Notation In the sampling scheme on the lefthand side the
inclusion probability, p, for any record r which sat-
The sampling operator will be denoted as t/J. Sam-
isfies the selection predicate is:
pling method and size willbe denoted by sub-
scripts. Except as noted simple random sampling (2)
without replacement (SRSWOR) is the default sam-
pling method, e.g., tPIOo(R) or tPSRSWOR,soo(S). i.e., all such records have equal inclusion probabili-
More complex sampling schemes will be described ties.
via the iteration operators: On the righthand side,· we repeatedly sample one
WR(s,< expr >),and WOR(s,< expr >),which in- record from R, evaluate the selection predicate, and
dicate that < expr > (a sampling expression) is to then retain it if it satisfies the selection predicate
4
Type of sampling Citation Expected Disk Accesses
Simple Random Sampling with replace- 0(8)
ment
Simple Random Sampling with replace- -[OR86] 0(8(bm..,fbaug))
ment with variable blocking
Simple Random Sampling without re- [EN82] 0(8)
placement
Weighted Random Sampling [WE80J 0(8logn)
Sequential Random Sampling, known pop- [FMR62] O(n/bavg)
ulation size
[Vit84j 0(8)
Sequential Random Sampling, unknown [Vit85] O(n/bavg)
population size
[Vit85J 0(8(1 + log(n/8))
Note: s = sample size, n = population size, bmaz = maximum number of records in a block,
bavg = average number of records in a block. Assume each sample taken from a distinct disk page,
i.e., 8 < (n/bavg) For Vitter's algorithms assume random disk I/0.
and is not a duplicate. This continues until we have • SCAN: sample sequentially· SCANning every
a sample size 8. page of relation,
Since the selection operator does not alter the in-
In order to generate random accesses via the in-
clusion probabilities of those records which satisfy
dex, we must assume that the index includes rank
the selection predicate, they remain equi-probable.
information aa discussed in Section 5.3.
From the definition of the WOR iterator, we are
The first method, sequential sampling via random
assured that the sample size is 8 distinct records. 0
access skips (KSKIPI) can be expedited [Vit84J if
Techniques for sampling from selections may be
the population size (number of tuples which qualify
classifed according as to whether they use an in-
on the predicate) is known, i.e., computable from
dex, or scan the entire relation. The first class ca.Ii
the rank. information in the index. In this case the
be further classified according to whether the index
expected number of disk accesses is given by:
contains rank information, which permits random
access to the j'th ranked record. We shall assume E(DKsKrn) ~ (8(1 + log1 (np;(""))) (3)
that the index is a single attribute record-level index
constructed as a n+ -tree as discussed in the full pa- Here f is the average fan-out of each node in the
per, [OR86j. Except as noted, we assume that the n+ -tree index. The log term is due to average
predicate can be fully resolved by the index. Based height in the tree we must backtrack for each skip.
on this classification schema, we have the following We assume one additional access for each element of
algorithms: the sample to actually retrieve the sampled record.
t'i
Again assuming rank information in the index,
• KSKIPI: sample sequentially via random access the second method, random probes of the subtree
SKIPs in Index,
of the index selected by the predicate (RAI), has an
• RAI: Random Access sample via Index until de- expected cost of:
sired sample size is obtained,
E(DRAI) s::$ (8(1 + log 1 (npfed))). (4)
• SCAN!: sample sequentially via Index SCAN-
ning every relevant index page, Clearly, KSKIPI is always more efficient than RAI
for simple predicates. However, there may be occa-
• RA: Random Access sample directly until de- sions in which multi-attribute predicates are speci-
sired sample size is obtained, fied for which only a single index is available. This
5
precludes the use of KSKIPI, because we don't know Definition 5 The minimum frequency of the at-
the size or the identity of the population satisfying tribute A in relation R, denoted as IR.almin 1 is the
the multi-attribute predicate. However, we can con- minimum cardinality in relation R of any projection
tinue to use RAI on one index, and evaluate the domain value eli.
multi-attribute predicate on each record sampled.
The third method, sequentially sampling via the Theorem 2 If A is not a key of R then
index consists of finding the pages of the index which
point to records which satisfy the predicate, and
then sequentially scanning and sampling each such
index page, assuming that successive index pages
Proof:
[OR86J.
A counterexample is given in the full paper ..."'
are chained. The sequential sampling would be done Projection does not generally commute with sam-
with a reservoir method such as [Vit85J, which does pling because the projection operator removes dupli-
not require a known population size. This method cates. Hence, interchanging projection and sampling
would be used when the index does not contain the will produce uneven inclusion probabilities. 0
rank information needed for RAI" or KSKIPI. It has However, if the attribute A is a key of the relation
an expected cost of: R, then there will be no duplicate values of A in
R, hence projection and sampling can be exchanged
n) nppred
E(DscANr) ll::f log/ ( f + - f - +" .(5) with impunity.
Theorem 3
The fourth method, direct random access sam-
pling (RA,), does not require any index. For a re- tPSRSWR,.(1rA.(R)) <=>
lation with a fixed blocking factor the number disk
accesses required to obtain s distinct records is a 1rA(W R(s, ACCEPT( i(~;~")t, tPSRSWR,l(R))))
negative hypergeometric distribution whose mean is
approximately given by: Proof: On the righthand side we sample with re-
placement from relation R first. Hence, each value
(6)
1Ji in the projection domain A would have an inclu-
assuming that s < nppred· The advantage of this sion probability of I(R.eli)I/IRI. Since we want uni-
method is that it dot:J not require an index. If Ppred. form inclusion probabilities on the projected domain
is close to 1 this method avoids superfiuous accesses values, we employ acceptance/rejection sampling to
to the index. If Ppred. is very small the SCAN method correct the inclusion probabilities. The acceptance
is to be preferred. probability for a tuple with value eli in the projection
The fifth method, SCAN, consists of simply scan- domain A is given as:
ning the entire relation to perform the selection, I(R.a)lmin
with a pipelined sequential sampling of the result. (8)
Pi= I(R.eli)l
The number of page accesses is simply the size of
the relation: Hence,for each iteration the inclusion probability for
each distinct eli is:
E(DscAN) = n/bR (7)
I(R.a)lmin
p= IRI (9)
Here bR is the blocking factor for relation R.
We repeat this un.til we have s distinct records in
5.3 Projection our sample. 0
For simplicity we only consider projection on a Hence the expected cost is:
single domain. Similar results hold for projection on
E(D) ll::f
8
i(R.a)iavg (10)
multiple domains. Simlarly, for expository purpose
I(R.a)lmin
we treat only sampling with replacement. As shown
earlier extensions to sampling without replacement assuming s < 11rA(R)I, where I(R.a)lavg IRI/m
are straightforward. We denote the projection of is the average cardinality of attribute A over all at-
relation R onto domain A as 1rA.(R). tribute values eli present in the relation. Here we
Let A be an attribute defined on the domain have assumed that relation R is hashed on the pro-
a1, a2, ... ,am. The set R.eli includes all the tuples jection domain so that records may be retrieved in
in R with value a, on the attribute A. a single access.
6
In order for the above algorithm to work we must has a B+ -tree index is:
be able to readily determine the cardinality (num-
ber of duplicates) of each projected tuple. This re- E(Dn) 1=:$ ~(1 + log 171
(!:ITI)) {14)
PR TI
quires that the relation to be projected must be ei-
ther sorted, indexed or hashed on the projection do- where IT I is the the average fan-out of the B+- tree
main. Also we must either know I(R.a) lmin or re- index to relation T. Again we assume that s <
•I place it with a lower bound of 1, at the expense of IR n Tl, i.e., we neglect the extra cost of sampling
I
reduced efficiency. without replacement.
The case in which the relation to be projected is An analogous formula for E( DT) can be written if
not "indexed" is discussed in the full paper, [OR86]. we sample from relation T and check the intersection
in relation R. The choice of which file to sample
5.4 Intersection from can be made by comparing· the valqes of the
two cost formulas.
We denote the intersection of two distinct rela-
tions R and T as R n T.
5.5 Difference
While it is possible to distribute sampling over in-
tersection and still preserve uniform inclusion prob- We denote the difference of two relations R and
abilities, the resulting computation is so inefficient T as R-T.
that it is rarely worthwhile. Theorem 5
Theorem 4 For all k: t/l~c(R- T) f!, t/l~c(R)- t/l~c(T)
,P.(R n T) # WOR(s, ,P 1 (R) n T) (11} Proof: Interchanging sampling a.il.d difference fails
# WOR(s,Rn,Pt(T)) {1S} because elements in RnT but not in t/l~c(T) may be
erroneously included in t/l~c(R) - t/l~c(T). 0
Proof: Consider the first case. From the lefthand Theorem 6 ),,..;.,
7
5.6 Union Assuming B+ -tree indices, and s « IR U Tl, we
have the expected number of iterations of the above
We denote the union of two distinct relations R algorithm to obtain a single sample is:
and T as RuT.
8
5.7'.1 Join without keys comment This version of the algorithm
samples R first;
First, we will deal with the case that the join at- For :j := 1 to s do
tribute X is not a key in any of the relations R or set accept to false
T. We assume that relation T is "indexed" on the While accept= false
join attribute X, and that the modal frequency of begin
the attribute X in relation T, as defined below, is Choose a random record r from R.
(j also known. Assume r[X] = x,.
Find the cardinality of T.x,.
Definition 1 The modal frequency of the attribute
Accept r into the sample
X in relation T, denoted as IT.zlma:u is the max-
with probability
imum cardinality in relation T of any :join domain
IT.zl•
tJalue x,, i.e., p=
IT.xlmaz
(21) In case record r is accepted,
choose randomly a tuple t
of T.z, and join r with it
Theorem 9 The following acceptance/refection al-
Store the result r 1><1 t
gorithm will generate a simple random sample of size R.z=T.z
s. in the sample file.
tPSRSWR,.(R t><l T) <:> set accept to true.
R.z=T.z end
9
and cost analysis. If both R and T are indexed, ei- [Coc77j William G. Cochran. Sampling Tech-
ther algorithm could be used. If the join selectivity, niques. Wiley, 1977.
p, is very small, neither algorithm is recommended.
[Dev86] Luc Devroye. Non-uniform Random Vari-
Instead, it may be preferable to compute the full
ate Generation. Springer-Verlag, 1986.
join and then sample sequentially the output of the
join using [Vit85J as it is generated. [EN82j Jarmo Ernvall and Olli Nevalainen. An
algorithm for-unbiased random sampling.
5.'1.2 Join with key The Computer Journal, 25(1), 1982.
Suppose that the join domain X is a key of re- [FMR62j C.T. Fan, M.E. Muller, and I. Rezucha.
lation T and that relation T is indexed on X. In Develoment of sampling plans by using
this case, the acceptance/rejection sampling is un- sequential (item by item) selection tech-
ecessary, if we first sample from relation R. See the niques and digital computers. Journal
full paper [OR86j for details and proof. of the American Statistical Association,
57:387-402, June 1962.
10
,.
·,'.t'
'-.~
...,