Access Path Selection in a

Relation Database Management
System
(summarized in section 2)
Processing an SQL statement
• parsing, optimization, code generation,
execution
• an SQL statement may have many query
blocks (nesting)
Optimizer
• validates parsed query
• collects statistics on referenced relations &
columns
• discovers available access paths for each
relation
• checks for type errors in expressions
• Access path selection:
– determines order of evaluation of query blocks
– a tree of alternate path choices is created for
each query block with more than one relation
– minimum cost access path is chosen from the
tree
• results of optimizer is passed to code
generation and execution components
RSS (Research Storage System)
• storage manager for System R
• Maintains physical storage, access paths,
locking, logging, and recovery
• Relations are stored as a collection of tuples
• tuples are stored on 4K pages; pages are
organized into segments
• segments completely contain one or more
relations
• tuples are accessed via a scan: sequential
scan or index scan
• indexes are B-trees with linked leaves
• sequential scan touches all the pages of a
segment that contains a relation once
• index scans touch all the leaf pages of the
index once; relation pages >=1 times
• if index and data tuples are in the same
order, the data is “clustered”
• scans may takes a set of predicates to apply
to a tuple before returning it
– predicates are of the form (column op value)
Cost computation
• cost = page fetches + W*(RSI calls)
– cost = IO costs + W * CPU costs
• an index that matches a boolean factor of
the query is an efficient access path
Statistics
• NCARD(T): cardinality of the relation T
• TCARD(T): number of pages used for T
• P(T): fraction of pages in a segment used
for T

• ICARD(I): number of distinct keys in index
I
• NINDX(I): number of pages in index I
Selectivity
• column = value : F = 1/ICARD(column) if
there is an index. F = 1/10 otherwise
• column1 = column2: F =
1/MAX(ICARD(column1),
ICARD(column2));
F = 1/ICARD(column i); F = 1/10
• column > value: F = (high key value -
value) / (high key - low key)
• column between value1 and value2: F=
(value2 - value1)/ (high key - low key)
• column IN (list of values): F = (# of items
in list) * (selectivity for column = value)
max is 1/2
• columnA IN subquery: F = (card. of
subquery) / ( card. of subquery relations)
• (pred1) OR (pred2): F = F(pred1) +
F(pred2) - F(pred1) * F(pred2)
• (pred1) AND (pred2): F = F(pred1) *
F(pred2)
• NOT pred: F = 1 - F(pred)
QCARD
• QCARD is ( card. of all relations) * (
F(pred i))
• RSICARD is the expected number of calls
to RSI
( card. of all relations) * ( F(sargable
pred i))
• An “interesting order” is an order specified
by the GROUP BY or ORDER BY clause
• Single relation cost: cheapest access path
which produces the “interesting order” or
cheapest access path plus sorting cost of
result
Cost Table (p. 515)
• index pages fetched plus data pages fetched
plus W times RSI tuple retrieval calls.
• unique index matching an equal predicate:
1+1+W
• clustered index I matching one or more
boolean factors:
F(preds) *(NINDX(I) + TCARD) + W *
RSICARD
• etc…
Joins
• nested loops and merging scans
• merging scans require sorts on the join
column -- another “interesting order”
• n-way joins can be done by a succession of
2-way joins; not necessarily using the same
technique. Results may be pipelined if a sort
is not required.
Join ordering
• n! permutations of relation join orders
• join of (k+1) relation with previous k
relations is independent of first k join order
• avoid Cartesian products when possible;
make them as late as possible
Construct a tree
• construct a tree of possible join orderings:
keep the cheapest order that produces an
interesting ordering.
• First find the best way to access each single
relation for each interesting ordering and
unordered.
• Next, find the best way of joining any
relation to each of these.
• Repeat until all relations have been added to
each branch
• Choose the cheapest strategy that has an
interesting ordering, or the cheapest strategy
plus a sort.
• Total number of solutions to store: 2
n