Evaluating Multiple Join Queries in A Distributed Database System

Mathl. Comput. Modelling Vol. 21, No. 7, pp.
83-98, 1995
Pergamon Copyright@1995 Elsevier Science Ltd
Printed in Great Britain. All rights reserved
0895-7177/95-$9.50 + 0.00
0895-7177(95)00033-x
Evaluating Multiple Join Queries

in a Distributed Database System
D. J. REID
Distributed Systems Technology Centre
Department of Computer Science
The University of Queensland
St. Lucia, Queensland 4072, Australia
(Received and accepted October 1994)
Abstract-h is proposed that the execution of a set of join queries in a distributed environment
should be considered cooperatively, rather than as a set of separate requests. With this understanding,
a model of multiple query execution in the form of a linear integer program is offered.
Several requests are issued to the distributed database management system, each specifying the
collation of information comprised of a number of logically distinct data sets, or relations, and dis-
persed across the sites of a distributed system. Performing these tasks demands the usage of limited
resources, so that efficient management commands the smallest additional imposition possible. Both
processors and the data communication devices that interconnect them are exploited; an optimal
strategy is defined to be one that minimizes a weighted sum of the costs of computation and those
of information exchange incurred in resolving the group of queries.
Previous models of join query evaluation would regard each individual query in isolation, to produce
a sequence of independent execution strategies, one correspondingly for every request. By instead
permitting multiple utilization of intermediate computations, any overlap between these queries can
be exploited to further reduce the total demand placed on the system as a whole. Through investi-
gations into the character of a number of interacting join computations, performed at a single site in
isolation, an earlier single query model [l] can be extended to facilitate the cooperative execution of
an entire group.
Keywords-Distributed database system, Integer program, Join query, Relational database.
1. INTRODUCTION AND OVERVIEW

In view of its impact on overall performance, the adept management of system resources is
widely understood to be a constituent intrinsic in the construction and practical application
of distributed system technologies. In particular, the development of information systems that
traverse numerous spatially separate processor facilities requires assiduous regard for the effi-
cient resolution of database transactions. As a result, the subject has attracted much research
attention [l-9].
The resolution of a query in a distributed environment frequently involves the amalgamation
of information disseminated amongst numerous sites of the network and conceptually viewed as
many distinct data sets. In response to an issued request, the result of joining all of the relations
named in the query is be computed and presented to the appropriate user.
The author wishes to acknowledge the support of the Transactions Management group, and Distributed Databases
Unit, of the Distributed Systems Technology Center, and also the Department of Computer Science, the University
of Queensland. Thanks are also due to M. Orlowska.
Typeset by dM-TEX
83
84 D. J. RCID
Rarely are such requests issued solitarily. Often, an external event detected by a user or
applications program triggers the submission of whole groups of transactions, to achieve the
desired action or access the appropriate information. On a broader scale, related requests may
be submitted from distinct users, at different sites, in response to an occasion topical to both.
By resolving each query separately, the intermediate results of one cannot be effectively utilized
in the satisfaction of another. In the facilitation of a methodology to allow the cooperative
execution of any number of overlapping requests, partial computations can be shared and designed
with the needs of all of the queries in mind.
While the expense of monitoring the entire network to detect overlapping requests may be
prohibitive, it may instead be viable to compare requests emanating from the same site, or from
a collection of user processes known to share common interests. Support at the language level
could also permit a user or group of users to explicitly specify that a number of queries are to be
treated as a whole. The model developed here precludes no option; the requests are not presumed
to originate at users related in any particular way.
By investigating the desired behavior of processors in permitting multiple utilization of rela-
tions, an earlier model of the execution of a single query is extended. In its initial statement, the
new model of multiple join query resolution is nonlinear, preventing its direct solution by the con-
venient and widely-available techniques for linear problems. Closer examination, and using some
intuitively appealing properties of any feasible solution, this problem can be linearized. The final
model is therein appropriately and conveniently expressed as a linear integer program [lo-151.
The relational database, as is relevant to the execution of a group of join queries, is outlined in
some detail in Section 2. Also presented here is a formal definition of the distributed environment
in which the database management system operates.
A model of minimal transmission cost [I], itself an enhancement of previous work appearing
in [7], underlies the developments presented here, so some background discussion of this is given
in Section 3.
Under the assumption that the tasks of every site have already been determined, an individual
processor is assigned the computation of some number of possibly overlapping relations. The
problem of minimizing the processing cost incurred in satisfying this is developed in Section 4,
leading to the construction in Section 5 of a new model of multiple query execution.
2. RELATIONS, JOINS, AND THE DISTRIBUTED SYSTEM

A relational database appears conceptually to a user or applications program to be composed
of a number of multidimensional tables, or ‘relations’. The set of ‘attributes’, say Y, naming
the columns of a given relation y is its ‘relation scheme’, and defines the format of the table.
Each of these attributes has an associated ‘domain’, which is the set of values that a conforming
column entry is permitted to adopt. A ‘Y-tuple’ is then a function t from relation scheme Y
to the corresponding domains. This mapping can be restricted from the relation scheme Y to
any subset X of Y, with the resultant X-tuple denoted as t[X]. A relation y having relation
scheme Y is then a set of Y-tuples t.
The distributed database management system is presented with a group of requests numbered,
say, as (1,. . . ,w}. Each requires the join of relations from some appropriate subset
1’={r1,...,r,}
of the database, with corresponding relation schemes
R = {RI,.
..,Rm}.
The relations specified by any query [ E { 1,. . . , w} is the set

Evaluating Multiple Join Queries 85
having the concordant relation schemes R(<) = {Rn(c), . . . , IIT(~ respectively, where IE,r :
(1,. . . ,w} 4 (1,. . . ,m} and with a(c) 5 ~(5) for any < E (1,. . . ,w}. Without loss of generality,
since any relations not named in any of the queries are irrelevant to the evaluation of the group
and can so be eliminated, it is presumed that
,(E) = r, and correspondingly, fi R(c) = R.

<=l <=l
For each query, several relations named in that query are to be combined into one; the join of
a set of relations r(t) is defined to be the relation
w r(‘) = {t 1t[Ri] E r, Vi : K(E) < i 5 T(E)},
that has the relation scheme UR (8 [16-191. ‘Consistency’ [20], also known as ‘joinability’ [21],
of all of the relations of T is presumed, meaning that the join of any two relations w x and w y,
where 2, y C T, is the relation w (z U y).
For simplicity in this presentation, it is also understood that the relations T assume the structure
of a ‘referential path’, so that each of the requests c E (1,. . . ,w} is a ‘chain query’ [5,22]; this
presumption can be eliminated with an increased complexity of notation in the indices and
decision variables of the final model to capture the more intricate pattern of overlapping relation
schemes. The overall structure, however, remains intact.
The relations of r are thereby numbered as r = {ri 1i = 1,. . . m} with corresponding relation
schemes R = { Ri 1 i = 1, . . . m}, such that Ri n Rj # 8 if and only if j = i + 1. Only joins of
relations having overlapping relation schemes are considered, and noting that w {ri} = ri, all
relations that may be produced by joining subsets of r take the form
W{ri ,..., rj}, forl<i<m.
The amount of data contained within any such relation w {Ti, . . . , Tj} is a real-valued function,
for brevity written as
Pi3 s P(w {Tir.. 2rj}).
The distributed environment of the database system is a network
D = (S,L)
of processor sites S linked by the communications channels L 2 S x S that are available for use
in answering the query. The graph D is here considered to be directed, so that a bidirectional
data channel is represented as two opposing arcs.
For each data link, there is a cost imposed for its transmission of data, and this is presumed
to be a linear function of the amount of data sent. The coefficient
is the cost per unit of data transferred from one processor site h E S to another k E S. Also
associated with a data channel (h, k) E L is a predefined data capacity limit
which defines the maximum total volume of data that the channel may convey in the course of
the join computation. Observe the imposition that the capacity limit is strictly greater than zero;
this is made without loss of generality, for a zero available capacity implies that there is, in effect,
no available direct data link between the processors concerned.
86 D. J. REID
The join is usually implemented as a binary operation, so that the join of several relations
is actualized by a sequence of join computations. A relation w {rir . . . , rj}, for i < j, is then
the direct product of joining two relations ca {ri, . . . , rp}, say, and w {T,+~, . . . ,rj}, where
1 5 i I p < j 5 m. Without loss of generality, it is assumed that implementations of just such a
join operator are available at all of the sites of the system; any processor site not supporting the
join operator can only contribute in terms of its data links, and therefore can be removed, with
sites connected through it joined directly.
Computational abilities of various processor sites may vary, due both to hardware differences
and alternate implementations of the binary join operator, and therefore the cost of joining two
relations depends upon the site at which the computation is performed. The processing cost of
joining w {ri, . . . . rP} andw{r,+r ,..., rj} at processor k E 5’is a function
“li~~jk--((w{Ti,...,Tp},w{Tp+l,. ..,&W
Typically, the value of Yipjk will be related to the sizes pip and pp+rj of the relations being joined,
and to some measure of the speed of the processor k. However, no particular relationship is here
supposed.
A processing facility can offer only limited computational capacity to the answering of the
query, according both to its hardware capabilities and the work load already imposed upon it.
The maximum volume of work, in terms of computational cost, that may be additionally levied
on a site k E S is
$k = $(k) > 0.
Note that any site having no available processing capacity is irrelevant to the query evaluation,
and may be removed in the same way as any sites not providing a join operator.
Each processor site has some type of secondary storage available to it, either supported locally,
or provided by another site playing the role of a server.
Each processor site is allocated a (possibly empty) subset of the relations named in the query,
and this is represented as a function
Q: : s + P(r).
The set of relations that are stored at a site k E S is then a(k) 2 T. This allows the possibility of
replicated data, where several copies of the same relation may exist, each allocated to a different
site. However, at least one copy of each relation in r is to be utilized in the computation of the
requested relations w r(c), for 1 5 E I: w, and the choice of which copies will be used is a task of
the optimization problem.
Relations are communicated between processors, where they may be joined to produce new
relations. Before it may be utilized, any relation that was not initially available on the network
must be manufactured at some site, by amalgamating a suitable set of components. The consumer
of any relation w r(c), for 1 5 t 5 w, is the user process that submitted this request; all others
are produced to become arguments to yet other join operations.
A final result w r(c) is to be made available to the client process that issued this query, residing
at some particular site, say q(c) E S. This is to be performed in such a way that minimizes the
total cost incurred to the system, measured in terms of the exploitation of both processor and
communications resources.
To prohibit obviously incongruous circumstances in which it is not possible to answer every
query, it is presupposed that the information contained in every relation can ultimately arrive at
the site of every user that specified it. That is, for each relation named in any particular query,
there must be a path through the network D from at least one site holding a copy to the site of
the relevant user:
V< : 1 5 6 5 wVr, E rcE) ElkI,. . . k, E S ?? ri E a(kl)
A(ki,ki+l)EL, i=l,..., p-l
A k, = q(c).
Additional restrictions may also be placed on the structure of the network D. No processor that
cannot transmit data, directly or indirectly, to a site bearing a querist can possibly participate in
the computation. This is also true for sites that cannot gain access to any relevant information.
Any such processors may be deleted from the network, so that, without loss of generality,
Yk E S 3@ : 1 5 < I: w 3kl,. . . k, E S kl = k A k, = qcE)

??
A(ki,ki+l)EL, i=l,..., p-l
and
Vk E 5’3ri E r 3kl,. . . k, E S ?? ri E a(kl) A k, = k

A(~~,IE~+~)EL, i=l,..., p-l.
3. MINIMAL EXECUTION COST OF A SINGLE QUERY

If only a single query is to be evaluated, then the final result required at a site Q = q(l) is the
relation w r = w r(l). Here, every relation leaving a site is the join of some number of relations
(perhaps only one) that enter it, and an entering relation must either be used in producing a
new relation, or re-transmitted unaltered. This problem has been previously regarded [l] in the
form of a linear integer program, as a development of an earlier model that discounted the costs
of join computation (71.
The allocation cr of relations to processor sites can be expediently incorporated into the network
D = (S, L). New vertices are added to the graph, one for each relation in r, and a directed arc
connects one of these ‘artificial’ nodes to the vertex representing an actual processor site if a
copy of the corresponding relation is allocated to this site. A new vertex u(c) is also included to
represent each user process that issued a query 5 E { 1, . . . w}, and connected by an arc originating
at the appropriate host site q(c).
When there is only one query in the group, letting u = u(l), the new network D’ = (S’, L’) so
constructed is defined [1,7] as
S’ = S U r U {u},
L’= L U {(rz,k) I r, 6 a(k)) U ((4,u)>
(the extension of this, in facilitation of more than one query in the group, is clear). Each relation
of r now originates at exactly one source node, facilitating conveniently the encompassing of the
condition that exactly one copy of each is to be utilized.
In this discussion, the set of nodes of D’ that may possibly transfer data to a given node k E 5”
will, for brevity of notation, be denoted as P(k). Likewise, the set of nodes that may receive
information from k is d(k). That is,
P(k) = {h E 5” 1 (h, k) E L’}, k E S’, and

d(k) = (1 E S’ 1(k,l) E L’}, k E S’.
Furthermore, note that Vri E r ??P(ri) = 0, and d(u) = 0.

The conveyance of information between vertices of the network D’ is represented by the trans-
mission decision variables. Any relation can be potentially communicated by any real data
channel, that is, any link in the distributed system D. However, only the particular correlating
relation may be sent from an artificial node to a processor site, and only the final result w r can
be given in answer to the query. Then
1 if w{ri,.. . , rj} is transmitted along data link (h, k) E L’,

fijhk = (
0 otherwise,
88 D. J. REID
for i, j, h, k satisfying
(1 5 i 5 j < m A (h, k) E L)
V(l<i=jLmAh=riAh~cr(k))
V(i=lAj=mAi>jAh=qAk=u).
Conversely, for i, j, h, k such that
(l<i<jSmAh~rAk~cr(h))
V(i = j A h = ri A k y? a(h))
V(i>lAj<mAi>jAh=qAk=u),
the value of fijhk is defined to be zero.

The actual computations performed are represented by the join decision variables, each record-
ing the specification of a binary join operation between two joinable relations and the particular
processor site at which this occurs. Then let
1 if w {ri,. . . ,rp} and w {rp+r,. . . ,rj}are joined
gipjk = at processor site k to produce w {ri, . . . , rj},
( 0 otherwise,
wherei,p,j:l<p<j<m,andkES. Wheni,j,p:i=jVp>jVp<i,ork:kErVk=u,
define that &pjk = 0.
The problem of minimizing the total execution cost, including both processing and transmission
costs, can then be stated as the zero-one linear program, as proposed in [l],
min
fLJhli i=l j=i kc.9 hcP(k)
m-l m j-l
+ c2 c c c xgipjk . ^lipjk
kES i=l j=i+l p=i
subject to Vi:lli<mo C .fiir,k = 1

kEA(r,)
VJi,j:l<i<j<mVkESe
j-l i-l
xgipjk - c gili-ljk f j,$l %Yk) = ltGkj fijkl - h&k) fijhk

p=i i’=l
(
f lmqu --1
m m
V(h k) E L ?? C C Pij * fijhk I phk
i=l j=i
m-1 m j-l
Vk E s ’ c c c gipjk ’Tipjk 5 ‘$k

i=l j=i+l p=i
fijhk E (0, I}, gipjk E (0, 1). PI)
The objective function is a weighted sum of the transmission and processing costs evoked
in answering the query. The scaling factors ~1 and cp are intended to represent the relative
importance of each, according to the specific requirements and conditions in a particular system.
The constant b is chosen so that 0 < 6 < 1, ensuring that all terms in the transmission cost
section of the objective function have positive coefficients; in this way, the transmission of a
relation around a zero-cost cycle is penalized, and therefore avoided at optimality.
The first constraint, as written here, expresses the condition that exactly one copy of each
relation is to be used in forming w r. The second defines the relationship between the trans-
mission, utilization, and computation of relations; this envelopes a conservation of information
throughout the process of evaluating the join. The third specifies that a single copy of w r is to
be finally manufactured, and made available to the user. The capacity limits of data channels
and processor facilities are enforced by the fourth and fifth constraints, respectively.
4. MULTIPLE JOINS AT AN ISOLATED PROCESSOR SITE

The single query model (Pl) was developed by considering the specifics of a join computation
performed at some particular processor in isolation; this investigation underlies the nature of the
constraints representing the behavior of join operations [I]. Similarly, the realization of a model
that supports the cooperative execution of numerous queries demands the exploration of such a
computation at a solitary site.
Suppose then that an overall strategy has been appointed, producing a schedule of relation
transferals and therein assigning to processors the responsibilities of determining join sequences
that fulfill their assigned tasks. With this approach, the role of any single processor can be
characterized, within the context of an overall distributed computation.
The problem of minimizing the processing cost incurred in computing a single result, say
w {r,,..., rb}, from some number of relations r,, . . . Tb availed to a processor !C E S, has been
formulated [l] as the zero-one linear program
b-l b j-1
i=a j=i+1 p=i
subject to Vi,j:a<i<j<bo
j-l
P /,
i-l
\ I- 1 ifi= j,
).gipjk - t gili-ljk + c gijjlk 1 if i = a A j = b,
p=i \ i’=a j’=j+l
0 otherwise,
gipjk E (0, I}. W)
This structure is also applicable when the relations presented to the site are themselves the result
of joining others; the simplification is merely notational, a renaming of relations.
Towards the modification of problem (P2) to encompass possible multiple utilization of inter-
mediate results, in the production of several outcomes, suppose then that some specific processor
Ic E S is the recipient of some number Ik(i, j) of copies of each relation w {ri, . . . rj}, where
Ik : {(i, j):1 5 i 5 j 5 m} --f No.
For those relations that are not available initially at this processor, I,+(i, j) = 0. Site k is then
required to manufacture Ok(i, j) copies of each relation W {ri, . . . ,rj}, where 01, is a function
defined as
o,, : {(i, j) : 1 5 i -< j I m} + N,,.
The presumption is made that it is possible to produce all outcome relations, using those provided.
These tasks are achieved by consecutively joining pairs of relations; the number of times that
a relation w {ri,. . . , rj} is produced as the result of a join operation performed at this site k E S
is
j-l
Nk($j) = c gipjk,
p=i
90 D. J. REID
and the number of times that it is used to produce yet others is
i-l
Mk(i~j) = c Qi’z-ljk + 2 gijj'k.

i'=l j’=j+l
Note also that Nk(i,j), Mk(i,j) > 0, for all i,j : 1 5 i 5 j 5 m, and all processors k.
With the generalization to facilitate many interacting join evaluations, a relation w {ri, . . . , Tj}
may still be used in a join operation the same number of times that it has been produced, so the
possibility that
Nk(i,j) = Mk(i,j)
must be preserved. Observe that this includes relations that do not directly participate in the
in the computation as seen by site k; these are never produced, and never used, by any join
operation that occurs there.
Any relation not initially delivered to the site cannot be used in a join until first produced.
However, the multiple utilization of any relation w {r-i,. . . , rj}, once realized, is now permissible;
these circumstances are characterized by permitting
The situation in which a relation is used but neither produced by joining, nor made available to
a site, is not allowed by either condition.
In considering these two possibilities together, the viable strategies for actualizing the required
relations can be formulated in terms of a system of integer constraints. That is, for all relations
w {Ti, . . . ) rj} that are neither presented to processor site k nor specified as one of its outcomes,
or if the site can fulfill its requirement by simply re-transmitting the given copies of this relation,
(Nk(i,j) 2 M k (‘2,.? ) A (Nk(i,j) 2 1 v Ik(i,j) > 1,) v Nk(irj) = Mk(irj),
or, equivalently,
(N k(‘>‘
23) 5 M(‘k,j 2 )v(Nk(i,j)=OAIk(irj)=o))
,+k(i,j) < hfk(i,j) v (Nk(i,j) > 1 vlk(i,j) 2 l))
r\(Nk(i,j) > hfk(i,j) v (Nk(&j) 2 lvlk(i>d 2 l)).
That is, vJi,j : 1 < i 5 j < m A Ik(i,j) = Ok(i,j),
Nk(i,j) 5 b&(i,j) i-i (N&j) = Mk(i,j) v Nk(i,j) + Ik(i,d 2 I>.
The difficulty with this specification lies in the disjunction forming the second term; this may be
reduced to a more favorable form by introducing new binary variables, say eijk. This constraint
system may thereby be written as
vi,j : 15 i <j 5 mAIk(i,j) = Ok(i,j) ??
Nk(i,j) < Mk(i,.d
A (Nk(i,j) - Mk(i,j)) .eijk= 0
A Nk(i,j) + Ik(irj) - (1 - eijk) 2 0
A eiik E {O,l).
The variables eijk may be interpreted in the context of the join evaluation. The assignment
eijk = 0 indicates Nk(i,j) < Mk(i,j), so that the relation W {Ti, . . , ?-j} is used more times than
it is produced, demanding Nk(i,j) 2 1 or Ik(i,j) > 1 to ensure that this relation is manufactured
at least once, if not already available. Conversely, when eijk = 1, Nk(i, j) = Mk(i, j), and either
the relation w {ri,. . . , Tj} never participates in the join evaluation, or if it is, it is used the same
number of times that it produced. This permits Nk(i, j) L 0 and Ik(i, j) 2 0.
Introducing slack variables zijk = Mk(i, j) - Nk(i, j),
\di,j : 1 2 i 5 j 5 m A Ik(i,j) = ok(i,j) 0
Nk(i,j) - kfk(i,j) + xijk = 0

A xijk . e+ik = 0
A Nk(i,j) -k Ik(i,j) •k eijk - 1 2 0
A eajk E (0, I}, xijk 2 0.
The variable Lriik is the number of additional accesses to relation w {Ti, . . . , Ti}, expressing the
number of times this relation is copied from a logical point of view (note that this does not
necessarily suggest that actual duplication occurs). Therefore, implies how long this relation xijk
must be held; the total number of times it is to be utilized is xijk + Nk(i, j).
The first constraint of this system expresses the net change in the number of copies of a relation
entering and leaving the processor; in this case when the number required of a site is the same
as that given to it, Ik(i,j) = Ok(i,j), and the net change is zero:
However, when the number oI,(i, j) of copies of a particular relation W {Ti, . . . , Ti} required of a
site exceeds the number Ik(i, j) supplied to it, the processor must fulfill its quota by a combination
of joining and copying. That is, the number of new copies to be created is Ok(i, j) - Ik(i, j), so
that, for i,j : 1 5 i 5 j 5 m A Ik(i,j) < Ok(i,j),
Nk(i,j) - hfk(i,j) + Xijk = ok - Ik.
Alternatively, if the number lk(i, j) of copies presented to the processor exceeds the number
ok (i, j) required of it, then Ik - ok copies of the relation W {Ti, . . . , Tj } must be consumed at
site k. Therefore, for any i,j : 1 5 i 5 j 5 m A Ik(i,j) > Ok(i,j),
Nk(i,j) - hfk(i,j) + xijk = -(Ik - Ok).
Combining these three options into a single condition, to be satisfied for all i, j satisfying 1 5
i<j<m,
0 if &(i,j) = c&(&j),
Nk(i,j) - &&(i,j) + +k = ok(&j) - Ik(6.d if Ik(i,j) < 0k(ir.$3
-(Ik(i,j) - Ok(i,j)) if Ik(i,j) > Ok(&j).
The model of cooperative evaluation of numerous relations with overlapping relation schemes, at
a single processor facility, can thereby be stated as a generalization of problem (P2) in the form
92 D. J. &ID
of the integer program
m-l m j-l
min .% = c c c Sipjk * Olipjk
i=l j=i+l p&
subject to V&j:l<i<j<m*
j-l i-l
xgipjk - c gi’i-ljk + 2 hjj’k + Zijk = ok(i,j) - Ik(i,j)

p=i i’=l j’=j+l
Vi,j:lIiIjIrn*Xijk.eijk=O
j-l
Vi,j:11i~jIrn~Cg~,jk+Ik(i,j)+eijk-120
p=i
Qipjk E {O,l}, eijk E (09 I}, X:ijk 2 0. W)
The variables zijk may take any nonnegative integer values; a useful upper bound can be
derived in terms of the net flow of certain relations through the processor k. More precisely, a
relation w {ri, . . . , rj} can be copied at most once for each additional outgoing relation that it
might be used to eventually produce. Also, the initial provision of some relations precludes the
utilization of w {r.%,. . . , Tj} in manufacturing results that might otherwise have been achieved in
this way.
Exiting relations that wholly contain the information proffered by w {ri, . . . , rj} are candidates
for requiring a copy somewhere in the evaluation process. However, relations initially presented
to the processor that contain w {ri, . . . , rj} are competitors to it, obstructing one opportunity
for its use.
LEMMA 4.1. An upper bound on any variable xijk, in terms of the tasks allocated to processor k,
is
Vi,j:l<i<j<mVkES*rijk<~~(Ok(i,j)-Ik(i,j)).
f=i i=j
PROOF. In the interests of brevity, define Uk(i, j) and K(i, j) as
i m i-l i m r-1
and Vi(i, j) = CCCgTpik.

f=l +j p=i P=l j=j p=j
Consider now any relation w {T.%,. . . , rj}. Then, summing the first constraints of problem (P3)
for all relations w (5 ’* . , TJ} that wholly contain w {hi, . . . , rj},
i m i m
That is,
is1 +j j’=J+l t=l J=j +l +j

Expanding the first term of the left-hand side of this expression,
2 2 5 QfPJk = 2 2 2 QZPik+ 2 2 F Stp$

%=I Jcj p=t Z=l +j p=t f=l Tz=j p=i
= uk(&j) + Vk(i, j) + 2 2 &pJk + 2 &zpjk.

i=l f=j+l p=i k1 p=i
Now considering the second,
Likewise rearranging the third term,
= 2 c gijjJk
i=l j<j<ji<m
= 2 f: jCyiiTjik
i=l j’=j+l Jz=j
= Vk(i, j).
Therefore,
-
uk(&j) + vk(6.i) + 2 2 ‘&,, + 2 ‘ggiplk
Z=l J=j+l p=i i=l p=i
Simplifying this, we have that

_
2 Fxz~k = 2 2 (Ok(f,j) - Ik(T,_?)) - 2 2 ‘e$%pJk - 2 g&pjk
i=l Jcj i=1 jcj i=l j=j+l p=i i=l p=i
94 D. J. REID
However, isolating zijk in the left-hand side,
m i-l m
kc xifk = xijk + c c XbJkT

+I j+j Z=l +=j+1
where XrJk 2 0 for all Z andJ according to the constraint system of (P3), implying that
for any i, j : 1 5 i 5 j 5 m, as claimed.
5. MINIMAL EXECUTION COST

OF A GROUP OF JOIN QUERIES
The resolution of an entire set of queries in the environment of a distributed system may
involve many join computations performed at numerous processors, interacting by information
transferred through the communications network. While the behavior of any particular site is now
understood, its actual role in the overall computation cannot be known prior to the specification
of a strategy for fulfilling it, as so far implied. By assembling the models of cooperative join
computation for all processors into one, a complete model of multiple join query execution can
be realized.
The constraint set of problem (P3) can be easily modified to incorporate the transferal of
relations between sites. The number of times a given relation w . . , rj} is presented to a {ri,.
processor k E S is the total number of other processors (or artificial nodes) that transmit this
relation to k:
Ik(i,j) = c fijhk.
hEP(k)
Similarly, the number of copies required of the processor for use elsewhere is
Ok(i,j) = c fijkl.
&d(k)
Initially available on the network are the relations r = {ri, . . . , rm} relevant to the multiple
query execution. In order that every request < : 1 5 < 5 w is satisfied, and under the presumption
that every relation in r is named in at least one query c, each relation ri : 1 5 i 5 m must be
used at least once. That is, for any i : 1 5 i 5 m,
c fiir;k > 1.
&ACT,)
Each request e E (1,. . . , w} specifies that one copy of the final relation w {r&(e), . . . , rT(c)} be
presented to the appropriate user, delineated by U(E) E S’, residing at a site q(c) E S. Then, for
all<:l<<Iw,
f fc(<)T(E)q’~‘u’E’
= 1.
Substituting the new constraints representing the behavior and interaction of processors, the
initial availability of relations, and the final results for those of the original single query
model (Pl), a new model to allow the cooperative resolution of numerous queries is realized:
i=l j=i kES hEP(k)

m-l
._ _ _ _i-l
111
+ E2 c c c xgipjk . Yipjk
kES i=l j=i+l p=i
subject to Vi : 1 I i 5 m ?? c f&,k 2 1
kEA(ri)
j-l i-l
xgipjk - xgi’i-ljk + 2 gijjlk + xijk
p=i i’=l j’=j+l
(
= c fijkl - c fijhk
led(k) hEP(k)
Vi, j :15 i 5 j 5 m Vk E S ?? xijk . eijk = 0

j-l
V~,j:l~iij~mVkESbCgiPjk+ c fijhk+eijk-120
p=i hEP(k)
Vi, j : 1 5 i 5 j 5 m Vk E S ??xijk 2 0
V6 : 1 I t I w” ftc(~)~(~)q’wE, = 1
m m
V(h, k) E L ’ C 1 Pij * fijhk I phk
i=l j=i
m-l m j-l
Vk E S ?? C C C gipjk . %pjk I $k
+l j=i+l p=i
fijhk E (0, I}, gipjk E (0, I}, eijk E (0, I}, xijk E Z. (P4)
Other constraints might also have been considered for inclusion; in particular, it is desirable to
avoid any redundant information transfers across the network. In fact, this is a property inherent
at optimality, even when the costs associated with data transfer are exceeded by those incurred
in performing computations, a situation rarely, if ever, observed in reality [7]. The formulation
given here guarantees that, in any optimal solution, no relation will ever be sent to any site more
than once.
LEMMA 5.1. Let 6 > 0, and EI > 0. Then, at optimality in problem (P4),
Vi,j:l<i<j<mVkESo c fijhksl.
hEP(k)
PROOF. Suppose that, in some solution c to problem (P4), there is a processor k E S and a
relation w {Ti, . . . , ~j} for which
c fijhk = a,
hEP(k)
for some a > 1. That is, there are hl, . . . , h, E P(k) with
VP : 1 < p < a ??fijh,,k = 1.
Consider also a feasible solution F* constructed from c by letting
xtk = Xijk + Cl- 1, fGhjhlk = 1, fGh,,k = 0 for p = 2,. . . , a.

96 D. J. REID
(Note that c* is indeed feasible, if c is.) Then, if ,%and 2” are the objective values of < and q*,
respectively,
a
&,g*+fl cbij ’ Chk + S)fijh,,k,
p=2
with(pij.ch~+6)>Oandcr>O,sothat~>i*. Consequently c cannot be optimal. I

The statement in the form of problem (P4), however, includes some constraints that are non-
linear, obstructing the direct application of efficient and readily available methods suited to linear
programs. The ameloriation of this lies in the observation (Lemma 4.1) that
sinceIk (i, j) 2 0 for all relations Da {Ti,..., rj} and all processors k. Restated in terms of the
transmission variables fijhk,
xijk 5 2 2 c .fijkl*
2=1 J=j &d(k)
But
c fijkl = c fijkl + c fijkl

Ed(k) Kd(k)nS &d(k)\S
5 c .fijkl + c .fijg(E)u(E)
(kJ)EL l<E<W
_ _
I IL1 +w,
by the definition of the network D’ = (S’, L’), and where IL1 is the number of data links in the
network D = (S, L) an d w is the number of queries in the group. Therefore,
xijk 5 c C(lLI + W)
Z=1 pj
I m2(ILI + w),
where m = Irl is the number of relations named in the requests. For convenience in the discussion
that follows, denote this bound on xijk a~
B = m2(ILI + w), so that Xijk 2 B.
Note that while this bound may not be particularly tight, it is independent of all decision variables,
and therefore suffices for the purpose intended here.
The nonlinear constraints of problem (P4) specify that either eijk = 0 or x<jk = 0 or both, by
enforcing
xijk . C?ijk= 0,
for any relation w {pi, . . . , Tj) and any processor site k E S. However, 0 5 xijk 5 B; then observe
that, in any feasible solution, either
B- xijk < .,
for Xijk > 0,
-
o <
B -’
when it is required that eijk = 0, or alternatively,
B - xijk
B =l, for xijk = 0,
in which case eijk = 0 or ejjk = 1 are both permitted. That is,
B - xijk
eijk 2
B ’
since eijk is permitted to take values only in (0,1). Replacing the original nonlinear constraint
with expression produces a new linear problem, equivalent to the original:
min 4 = El 2 2 c c (pij ’Chk + 6) fijhk

i=l j=i kES hEP(k)
m-l m j-l
+ f2 c c 1 cgipjk * ‘Yipjk
kES i=l j=i+l p-i
subject to Vi : 1 5 i 5 m ?? fiirik > 1

c
kEA(ri)
Vi,j:lLi<j<mVkES*
j-l i-l
xgipjk - c gi’i-ljk + 2 gijjlk
p=i i’=l j’=j+l
( )
$ 2ijk = c fijkl - c fijhk

&A(k) &P(k)
‘v’i,j : 1 5 i 5 j 5 m Vk E S 0 m2((LI + W) . eijk < m2((LI -I- W) - xijk

j-l
vi,j:l~i~j~mVkESoCgipjk+ c fijhk+t?ijk-120
p=i hEP(k)
Vi, j : 1 5 i 5 j 5 m Vk E S ??xijk 2 0
K : 1 I E 5 lJJ??fn(<)r(~)qtehm = 1
V(h k) E L ?? 2 2 pij ’fijhk 5 phk

i=l j=i
m-l m j-l
vkES*C c xgipjk’%pjk<?Ik
i=l j&+1 pri
fijhk E (0, I}, gipjk E (0, I}, eijk E (0, I}, xijk E Z- P5)
Observe also that the restriction of the variables xijk to integer values is redundant; this is assured
by the constraints second in the statement of problem (P5). Therefore, in solving this model,
xijk requires no special attention, as do the other decision variables.
6. CONCLUSION
The problem of efficiently using the resources of a distributed database system in answering an
arbitrary number of related queries has been modeled in the form of an integer linear program.
This was actualized by focusing upon the role of any processor, in isolation, and assembling these
results to modify an earlier model of single query resolution.
The model first realized displays the undesirable quality of nonlinearity in some of its con-
straints. With the realization that any feasible solution to this problem cannot specify that
98 D. J. REID
the number of times any relation is copied exceeds a known upper bound, these constraints
can be replaced, producing a final model in the form of a linear integer program, suitable for
solution using the branch-and-bound, or cut set, methods that have been developed for such
problems [12,14,23-251.
REFERENCES
1. D.J. Reid, Incorporating processor costs in optimizing the distributed execution of join queries, Mat/d. Corn-
put. Mode&&g 20 (3), 7-29, (1994).
2. P.A. Bernstein and D.-M.W. Chiu, Using semi-joins to solve relational queries, Journal of the Association
for Computing Machinery 28 (l), 25-40, (January, 1981).
3. S. Ceri and G. Gottlob, Optimising joins between two partitioned relations in distributed databases, Journal
of Parallel and Distributed Computing 3, 183-205, (1986).
4. M.E. Orlowska, Effective utilization of copies in a transparent distributed environment, Distributed and
Parallel Databases 1, 409-425, (1993).
5. M.W. Orlowski, On optimisation of joins in distributed database system, F&we Databases 92, 106-114,
(1992).
6. S. Pramanik and D. Vineyard, Optimizing join queries in distributed databases, IEEE ‘Bansactions on
Software Engineering 14 (9), 1319-1326, (1988).
7. D.J. Reid, Optimal distributed execution of join queries, Computers Math. Applic. 27 (ll), 27-40, (1994).
8. D. Shasha and T.L. Wang, Optimizing equijoin queries in distributed database systems where relations are
hash partitioned, ACM ‘Transactions on Database Systems 16 (2), 279-308, (1991).
9. C.P. Wang, The complexity of processing tree queries in distributed databases, pd IEEE Symposium on
Parallel and Distributed Processing, 604-611, (1990).
10. V. Chvgtal, Linear Programming, W.H. Freeman and Company, New York, (1983).
11. J.E. Strum, Introduction to Linear Programming, Holden-Day, San Francisco, (1972).
12. T.C. Hu, Integer Programming and Network Flows, Addison-Wesley, Reading, MA, (1969).
13. A. Kaufmann, Integer and Mixed Programming: Theory and Applications, Academic Press, New York,
(1977).
14. H.A. Taha, Integer Programming: Theory, Applications, and Computations, Academic Press, New York,
(1975).
15. L.A. Wolsey, Generalized dynamic programming methods in integer programming, Mathematical Progmm-
ming 4, 222-232, (1973).
16. A.V. Aho, C. Beeri and J.D. Ullman, The theory of joins in relational databases, Association for Computing
Machinery tinsactions on Database Systems 4 (3), 297-314, (September, 1979).
17. C. Beeri, R. Fagin, D. Maier and M. Yannakakis, On the desirability of acyclic database schemes, Journal of
the Association for Computing Machinery 30 (3), 479-513, (July, 1983).
18. R. Elmssri and S. Navathe, Fundamentals of Database Systems, Benjamin/Cummings, Redwood City, CA,
(1989).
19. C.-C. Yang, Relational Databases, Prentice-Hall, Englewood Cliffs, NJ, (1986).
20. R. Fagin, Degrees of acyclicity for hypergraphs and relational database schemes, Journal of the Association
for Computing Machinery 30 (3), 514-550, (July, 1983).
21. J. Rissanen, Independent components of relations, ACM Transactions on Database Systems 2 (4), 317-325,
(1977).
22. J. Ullman, Principles of Database Systems, Znd edition, Computer Science Press, Rockville, MD, (1982).
23. R.J. Dakin, A tree-search algorithm for mixed integer programming problems, The Computer Journal 8,
250-255, (1965-1966).
24. G. Mitra, Investigation of some branch and bound strategies for the solution of mixed integer linear programs,
The Computer Journal 8, 155-170, (1965-1966).
25. A. Schrijver, Theory of Linear and Integer Programming, Wiley, Chichester, NY, (1986).

Evaluating Multiple Join Queries in A Distributed Database System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evaluating Multiple Join Queries in A Distributed Database System

Uploaded by

Copyright:

Available Formats

Mathl. Comput. Modelling Vol. 21, No. 7, pp.

Evaluating Multiple Join Queries

(Received and accepted October 1994)

Keywords-Distributed database system, Integer program, Join query, Relational database.

1. INTRODUCTION AND OVERVIEW

2. RELATIONS, JOINS, AND THE DISTRIBUTED SYSTEM

of the database, with corresponding relation schemes

The relations specified by any query [ E { 1,. . . , w} is the set

,(E) = r, and correspondingly, fi R(c) = R.

w r(‘) = {t 1t[Ri] E r, Vi : K(E) < i 5 T(E)},

W{ri ,..., rj}, forl<i<m.

The distributed environment of the database system is a network

Yk E S 3@ : 1 5 < I: w 3kl,. . . k, E S kl = k A k, = qcE)

A(ki,ki+l)EL, i=l,..., p-l

Vk E 5’3ri E r 3kl,. . . k, E S ?? ri E a(kl) A k, = k

3. MINIMAL EXECUTION COST OF A SINGLE QUERY

P(k) = {h E 5” 1 (h, k) E L’}, k E S’, and

Furthermore, note that Vri E r ??P(ri) = 0, and d(u) = 0.

1 if w{ri,.. . , rj} is transmitted along data link (h, k) E L’,

Conversely, for i, j, h, k such that

the value of fijhk is defined to be zero.

1 if w {ri,. . . ,rp} and w {rp+r,. . . ,rj}are joined

gipjk = at processor site k to produce w {ri, . . . , rj},

subject to Vi:lli<mo C .fiir,k = 1

xgipjk - c gili-ljk f j,$l %Yk) = ltGkj fijkl - h&k) fijhk

Vk E s ’ c c c gipjk ’Tipjk 5 ‘$k

4. MULTIPLE JOINS AT AN ISOLATED PROCESSOR SITE

i=a j=i+1 p=i

Ik : {(i, j):1 5 i 5 j 5 m} --f No.

and the number of times that it is used to produce yet others is

Mk(i~j) = c Qi’z-ljk + 2 gijj'k.

(Nk(i,j) 2 M k (‘2,.? ) A (Nk(i,j) 2 1 v Ik(i,j) > 1,) v Nk(irj) = Mk(irj),

,+k(i,j) < hfk(i,j) v (Nk(i,j) > 1 vlk(i,j) 2 l))

r\(Nk(i,j) > hfk(i,j) v (Nk(&j) 2 lvlk(i>d 2 l)).

That is, vJi,j : 1 < i 5 j < m A Ik(i,j) = Ok(i,j),

Nk(i,j) 5 b&(i,j) i-i (N&j) = Mk(i,j) v Nk(i,j) + Ik(i,d 2 I>.

vi,j : 15 i <j 5 mAIk(i,j) = Ok(i,j) ??

Nk(i,j) < Mk(i,.d

A (Nk(i,j) - Mk(i,j)) .eijk= 0

A Nk(i,j) + Ik(irj) - (1 - eijk) 2 0

\di,j : 1 2 i 5 j 5 m A Ik(i,j) = ok(i,j) 0

Nk(i,j) - kfk(i,j) + xijk = 0

A eajk E (0, I}, xijk 2 0.

Nk(i,j) - hfk(i,j) + Xijk = ok - Ik.

Nk(i,j) - hfk(i,j) + xijk = -(Ik - Ok).

-(Ik(i,j) - Ok(i,j)) if Ik(i,j) > Ok(&j).

of the integer program

xgipjk - c gi’i-ljk + 2 hjj’k + Zijk = ok(i,j) - Ik(i,j)

PROOF. In the interests of brevity, define Uk(i, j) and K(i, j) as

and Vi(i, j) = CCCgTpik.

is1 +j j’=J+l t=l J=j +l +j

Expanding the first term of the left-hand side of this expression,

2 2 5 QfPJk = 2 2 2 QZPik+ 2 2 F Stp$

= uk(&j) + Vk(i, j) + 2 2 &pJk + 2 &zpjk.

Likewise rearranging the third term,

i=l j’=j+l Jz=j

Simplifying this, we have that

However, isolating zijk in the left-hand side,

kc xifk = xijk + c c XbJkT

for any i, j : 1 5 i 5 j 5 m, as claimed.

5. MINIMAL EXECUTION COST

i=l j=i kES hEP(k)

Vi, j :15 i 5 j 5 m Vk E S ?? xijk . eijk = 0