Professional Documents
Culture Documents
The third step is finding the best permutation of ordering of operations in order to
minimize cost, or optimize performance.
1
Principles of Distributed Database Systems, second edition, Ozsu and Valduriez
Page 1
ICS 611 Spring Semester, 2008 L Gottschalk
So we need
• fragment statistics
• forecast of cardinalities of interim resulting tables.
The search space: all possible execution plans. They all must produce the same
result.
The search strategy: this explores the search space and selects the best plan. The
strategy defines the order in which plans are examined.
The number of alternative join trees that can be produced for N relations is O(N!).
Heuristic 1: perform selection and projection when first accessing a base relation.
Page 2
ICS 611 Spring Semester, 2008 L Gottschalk
Another deterministic plan is the “greedy algorithm”) builds only one plan, depth
first.
Dynamic programming has acceptable costs only when number of relations is small.
Therefore, randomized strategies which start with a greedy algorithm, then tries to
improve by visiting its neighbors (nearly alike plans). The neighbors are found by
randomly switching two steps in the plan.
For more than a few relations, randomized does better than deterministic.
Each element of total time can be given a cost, and therefore cost can be
minimized.
There is a direct trade-off between precision of database statistics and the cost of
maintaining them.
Using these simplifying assumptions, then very simple rules-of-thumb are given for
estimating the results of
selection
Page 3
ICS 611 Spring Semester, 2008 L Gottschalk
projection
Cartesian product
join
semijoin
union, and
difference.
Then, the pieces of the calculus query are executed in turn. The intermediate
product of the first piece is consumed (used) in the 2nd piece query. (This is a linear
join tree rather than bushy join tree strategy.)
Instead of systematically doing select operations before joins, System R only does
that if it leads to a better strategy. Every candidate tree is given a cost (total time),
and the lowest cost one is retained.
Page 4
ICS 611 Spring Semester, 2008 L Gottschalk
Two approaches:
• Optimize the ordering of joins,
• Replace joins by combinations of semijoins (to minimize communication
costs).
Simplifying assumptions:
Since we are focusing on join processing (and join ordering), we ignore local
processing time for selection and projection.
First, if there are two relations to be joined, R and S, we want to send the smaller to
the larger.
If there are three relations, Emp, Assignmts, and Proj, we must estimate what will
be the resulting cardinality of Emp⋈Assigmts, Emp⋈Proj, and Assignmts⋈Proj.
Then we can calculate the lowest cost ordering.
Explanation: Emp⋈Assigmts must be combined with Proj, and so again there
will be network transmission. And so to calculate that cost, we need to guess
the size of Emp⋈Assigmts.
Page 5
ICS 611 Spring Semester, 2008 L Gottschalk
Example:
Select * from Emp, Assignments, Proj
where (emp.EmpNo = Assignmens.EmpNo and Assgnments,ProjNo = Proj.ProjNo)
and EmpNo = “fred”;
Example:
Select * from Emp, Assignments, Proj
where (emp.EmpNo = Assignmens.EmpNo and Assgnments,ProjNo = Proj.ProjNo);
Using semijoins may not be a good idea of local processing is significant, contrasted
to communication (network) costs.
Page 6
ICS 611 Spring Semester, 2008 L Gottschalk
3) SDD-1 uses semijoins. Distributed INGRES and SYSTEM R use methods similar to
their non-distributed versions (i.e., joins).
4) All the three products use statistical information about the data.
R* Static TotalCost No No
Then, the pieces of the calculus query are executed in turn. The intermediate
product of the first piece is consumed (used) in the 2nd piece query. (This is
a linear join tree rather than bushy join tree strategy.)
As the processing of the query proceeds, “a simple choice for the next subquery is
to take the next one having no predecessor and involving the smaller fragments.”
Page 7
ICS 611 Spring Semester, 2008 L Gottschalk
“An alternative to the limited search is the exhaustive search used b y R* where all
possible strategies are evaluated to find the best one. In [one study], the two
approaches are compared on the basis of the size of data transfers. An important
conclusion of this study is that
- exhaustive search significantly outperforms limited search as soon as the query
access more than three relations,
- dynamic optimization is beneficial because the exact sizes of the intermediate
results are known.”
This is costly.
The hill-climbing algorithm could pursue either total time or response time, but SDD
pursues only total cost.
The initial feasible solution is found by computing the cost of all solutions that
transfer all the required relations to a single candidate site, and then choosing the
least costly of these.
A cost that is ignored is sending the final result to the site where the result is
needed.
Then the cost of joining each possible pair of tables is computed (which would
require sending one table to the other’s site), and then sending the result to the
single candidate site.
IF one of these is less, then this is the new chosen algorithm (at least for now).
Page 8
ICS 611 Spring Semester, 2008 L Gottschalk
Greedy Algorithms start with an initial feasible algorithm and then iteratively
improve upon it.
The main problem is that strategies with initial higher cost are eliminated, and yet
might end up being the best overall cost.
Page 9