You are on page 1of 41

Chapter 2

Query Processing
and
Optimization

4/2/2024 ADB(SSoftware Engineering) 1


Objectives
At the end of this chapter students will be able to …
• Define relational algebra and calculus

• Identify Query Processing steps

• List Query Optimization Process

• Explain approaches to Query Optimization

• Identify how to Implement relational Operators

• Define Pipelining

4/2/2024 ADB(SSoftware Engineering) 2


Formal Relational Query Languages
• Query languages: allow manipulation and retrieval of
data from a database.
• Two mathematical Query Languages form the basis for
“real” languages (e.g. SQL), and for implementation :
• Relational Algebra: More operational, very useful for
representing execution plans.
• Relational Calculus: Lets users describe what they want,
rather than how to compute it. (Non-procedural,
declarative.)
❖Understanding Algebra & Calculus is key to understanding
SQL, query processing!

4/2/2024 ADB(SSoftware Engineering) 3


Relational Algebra
• Relational Algebra is a procedural query language that enables a user to
specify basic retrieval requests.

• The result of a retrieval is a new relation, which may have been formed. A
sequence of relational algebra operations forms a relational algebra
expression, whose result will also be a relation that represents the result of
a database query (or retrieval request).

• Relational Algebra Operations:

• A set of mathematical operators that compose, modify, and


combine tuples within different relations.

• Relational algebra operations operate on relations and


produce relations (“closure”) from one or more relations.
4/2/2024 ADB(SSoftware Engineering) 4
Relational Algebra: 5 Basic Operations
• Selection (s): Selects a subset of rows from relation
(horizontal).
• Projection (p) Retains only wanted columns from
relation (vertical).
• Cross-product(X) Allows us to combine two relations.
• Set-difference (—) Tuples in r1, but not in r2.
• Union () Tuples in r1 or in r2.
❖Since each operation returns a relation, operations
can be composed!

4/2/2024 ADB(SSoftware Engineering) 5


Select (σ)
• Selects rows that satisfy selection
condition.
• σ <<selection condition>> (R)
• It filters those tuples that fulfill a given
condition and discards the others
• The selection condition is a Boolean
expression that uses =,≠,≤,≥ , <,> and in the
selection predicate Or the logical  rating8(S 2)
connectives (or) and (and).
• Schema of result identical to schema of
input relation.
• Selection is commutative
E.g: σ<cond1>(σ<cond2>(R)) =
σ<cond2>(σ<cond1>(R))

4/2/2024 ADB(SSoftware Engineering) 6


Projection ()
• Deletes attributes that are not in the
projection list.
•  <attribute list>(R)
• Schema of result contains exactly the
fields in the projection list, with the
same names that they had in the


input relation.
Projection operator eliminates
 sname,rating(S 2)
duplicates!
sname rating
 age(S 2)
 sname,rating( rating 8(S 2)) yuppy
lubber
9
8
sname rating guppy 5
Yuppy 9
rusty 10
Rusty 10

4/2/2024 ADB(SSoftware Engineering) 7


Union and Set-Difference
• Both operations take two input relations, which must be
union-compatible:
- Same number of fields.
- Corresponding fields have the same type.
• Union (U )- R ∪ S, is a relation that includes all tuples that are
either in R or in S or in both R and S.
• It eliminates duplicate tuples
• The union operator is commutative: R U S = SUR
• Set Difference (–)
- The result of R – S is a relation that includes all
- tuples that are in R but not in S.

4/2/2024 ADB(SSoftware Engineering) 8


Cont’d…
S1 S2

S1S 2 S1− S 2

4/2/2024 ADB(SSoftware Engineering) 9


Cross-Product (Cartesian Product) (X)
• This operation is used to combine tuples from two
relations in a combinatorial fashion.

• S1  R1: Each row of S1 paired with each row of R1.

• The Result schema has one field per field of S1 and R1,
with field names `inherited’ if possible.
• May have a naming conflict: Both S1 and R1 have a field
with the same name.
• In this case, we can use the renaming operator:

4/2/2024 ADB(SSoftware Engineering) 10


Example: Cross Product
R1 S1

R1 X S1 =

4/2/2024 ADB(SSoftware Engineering) 11


Compound Operators (Additional Operators)
• In addition to the 5 basic operators, there are
several additional “Compound Operators”
• These add no computational power to the language,
but are useful short hands
• Can be expressed solely with the basic operators
• Some of the additional operators are:
• Intersection
• Join
• Division(reading assignment)

4/2/2024 ADB(SSoftware Engineering) 12


Intersection()
• Intersection takes two input relations, which must
be union-compatible.
• R ∩ S, is a relation that includes all tuples that are in
both R and S. S2
• R  S = R − (R − S)
S1

S1  S2 = S1-(S1-S2)

sid sname rating age


31 lubber 8 55.5
58 rusty 10 35.0
4/2/2024 ADB(SSoftware Engineering) 13
Join ( )
• Joins are compound operators involving cross product,
selection, and (sometimes) projection.
• Most common type of join is a “natural join” (often just
called “join”). R S conceptually is:
• Compute RXS, where R and S are result of some
general relational algebra expressions
• Select rows where attributes that appear in both
relations have equal values
• Project all unique attributes and one copy of
each of the common ones.

4/2/2024 ADB(SSoftware Engineering) 14


Example: Natural Join
S1
R1

S1 R1 =

4/2/2024 ADB(SSoftware Engineering) 15


Overview of Query Processing
• Query Processing:
– Steps required to transform high level SQL query into a correct and
“efficient” strategy for execution and retrieval.
• In general, a query is a form of questioning, in a line of inquiry.
• A query language is a language in which user requests information from
the database
• The aim of query processing is to find information in one or more
databases and deliver it to the user.
• Query optimization:
- It is the process of choosing a suitable execution strategy for
processing a query
• Two internal representations of a query:
• Query Tree
• Query graph

4/2/2024 ADB(SSoftware Engineering) 16


Query Processing (Cont…)
• A query expressed in a high-level query language such as SQL must first
be scanned, parsed, and validated.
• Scanner: identify language components, keywords, attribute, relation names
• Parser: check query system(query syntax)
• Validation: check attributes & relations are valid semantically meaningful
names.
• Query tree and query graph are internal representation.
• Execution strategy: query plan for retrieving the results of the query from
the database files.
• Query optimization : choose a strategy (reasonably efficient strategy)
• Code generator: generates the code to execute that plan.
• Runtime database processor: has the task of running (executing) the
query code, whether in compiled or interpreted mode, to produce the
query result.
4/2/2024 ADB(SSoftware Engineering) 17
Query Processing steps

4/2/2024 ADB(SSoftware Engineering) 18


Query Processing steps
• Query Processing can be divided into four main phases:
• Decomposition
• Optimization
• Code generation, and
• Execution

4/2/2024 ADB(SSoftware Engineering) 19


Part-2

4/2/2024 ADB(SSoftware Engineering) 20


Query Decomposition
• Decomposition: it is the process of transforming a high level
query into a relational algebra query, and to check that the
query is syntactically and semantically correct. Query
decomposition consists of parsing and validation.

• Typical stages in query decomposition are:


• Analysis
• Normalization
• Semantic Analysis
• Simplification
• Query Restructuring

4/2/2024 ADBS(Software Engineering) 21


Cont…
• Stage 1: Analysis
• In this stage, the query is lexically and syntactically analyzed using
the techniques of programming language compilers.
• This stage verifies that the relations and attributes specified in the
query are defined in the system catalog.
• It also verifies that any operations applied to database objects are
appropriate for the object type.
Example:Staff(staffNo(int),salary(double),position(varchar))
SELECT staffNumber FROM Staff WHERE position > 10;
Q: Will this query Accepted or Rejected? Why?
A: Rejected because
• In the select list, the attribute staffNumber is not defined for the
Staff relation (should be staffNo).
• In the WHERE clause, the comparison ‘>10’ is incompatible with the
data type position, which is a variable character string.

4/2/2024 ADB(SSoftware Engineering) 22


Cont…
• On completion of this stage, the high-level query has been
transformed into some internal representation that is more
suitable for processing.
• The internal form that is typically chosen is some kind of query
tree, which is constructed as follows:
• A leaf node is created for each base relation in the query.
• A non-leaf node is created for each intermediate relation
produced by a relational algebra operation.
• The root of the tree represents the result of the query.
• The sequence of operations is directed from the leaves to
the root.

4/2/2024 ADB(SSoftware Engineering) 23


Cont…
• Stage 2: Normalization
• converts the query into a normalized form that can be more easily
manipulated.
• The predicate (in SQL, the WHERE condition), which may be arbitrarily
complex, can be converted into one of two forms by applying a few
transformation rules:
• Conjunctive normal form: A sequence of conjuncts that are connected
with the ∧ (AND) operator.
• Each conjunct contains one or more terms connected by the ∨ (OR) operator.
• For example:(position = ‘Manager’ ∨ salary > 20000) ∧ branchNo = ‘B003’ .
• A conjunctive selection contains only those tuples that satisfy all conjuncts.
• Disjunctive normal form: A sequence of disjuncts that are connected with
the ∨ (OR) operator.
• Each disjunct contains one or more terms connected by the ∧ (AND) operator.
• For example, we could rewrite the above conjunctive normal form as: (position
= ‘Manager’ ∧ branchNo = ‘B003’ ) ∨ (salary > 20000 ∧ branchNo = ‘B003’).
• A disjunctive selection contains those tuples formed by the union of all tuples
that satisfy the disjuncts.
4/2/2024 ADB(SSoftware Engineering) 24
Cont…
• Stage 3: Semantic analysis
• The objective is to reject normalized queries that are incorrectly
formulated or contradictory.
• A query is incorrectly formulated if components do not contribute to
the generation of the result, which may happen if some join
specifications are missing.
• A query is contradictory if its predicate cannot be satisfied by any
tuple.
Examples:
1. (position = ‘Manager’ ∧ position = ‘Assistant’)
• is contradictory, as a member of staff cannot be both a Manager and an
Assistant simultaneously.
2. ((position = ‘Manager’ ∧ position = ‘Assistant’) ∨ salary > 20000)
• the predicate could be simplified to (salary > 20000) by interpreting the
contradictory clauses the boolean value FALSE.
Unfortunately, the handling of contradictory clauses is not consistent
between DBMSs.
4/2/2024 ADB(SSoftware Engineering) 25
• Stage 4: Simplification
Cont…
• The objectives are to detect redundant qualifications, eliminate common
subexpressions, and transform the query to a semantically equivalent but
more easily and efficiently computed form.
• Typically, access restrictions, view definitions, and integrity constraints are
considered at this stage, some of which may also introduce redundancy.
• If the user does not have the appropriate access to all the components of
the query, the query must be rejected.
Example:
CREATE VIEW Staff3 AS SELECT *
SELECT staffNo, fName, lName, salary, branchNo FROM Staff3
FROM Staff WHERE (branchNo = ‘B003’ AND
WHERE branchNo = ‘B003’; salary > 20000);
Simplified to:
SELECT staffNo, fName, lName, salary, branchNo FROM Staff
WHERE (branchNo = ‘B003’ AND salary > 20000) AND branchNo = ‘B003’;
and the WHERE condition reduces to (branchNo = ‘B003’ AND salary >
20000).
4/2/2024 ADB(SSoftware Engineering) 26
Cont…
• Stage 5: Query restructuring
• The query is restructured to provide a more efficient
implementation.

4/2/2024 27
Query Optimization
• There are two main techniques for query optimization,
although the two strategies are usually combined in
practice.
• Two Approaches:
• Heuristic Approach to query optimization
• Cost Estimation for the Relational Algebra Operations
• The first technique uses heuristic rules that order the
operations in a query.
• The other technique compares different strategies based on
their relative costs and selects the one that minimizes
resource usage.
• Since disk access is slow compared with memory access, disk
access tends to be the dominant cost in query processing for
a centralized DBMS, and it is the one that we concentrate on
exclusively when providing cost estimates.
4/2/2024 28
Cont…
• Generally, we try to reduce the total execution time of the
query, which is the sum of the execution times of all
individual operations that make up the query.
• However, resource usage may also be viewed as the
response time of the query, in which case we concentrate
on maximizing the number of parallel operations.
• Since the problem is computationally intractable with a
large number of relations, the strategy adopted is generally
reduced to finding a near optimum solution.
• Both methods of query optimization depend on database
statistics to evaluate properly the different options that are
available.
• The accuracy and currency of these statistics have a
significant bearing on the efficiency of the execution
strategy chosen.
4/2/2024 29
Cont…
• The statistics cover information about relations, attributes,
and indexes.
• For example, the system catalog may store statistics giving
the cardinality of relations, the number of distinct values for
each attribute, and the number of levels in a multilevel
index.
• Keeping the statistics current can be problematic.
• If the DBMS updates the statistics every time a tuple is
inserted, updated, or deleted, this would have a significant
impact on performance during peak periods.
• An alternative, and generally preferable, approach is to
update the statistics on a periodic basis, for example nightly,
or whenever the system is idle.
• Another approach taken by some systems is to make it the
users’ responsibility to indicate when the statistics are to be
updated. 4/2/2024 30
Heuristics Approach
• Heuristics Approach uses the knowledge of the characteristics of the relational
algebra operations and the relationship between the operators to optimize the
query.

• Thus the heuristic approach of optimization will make use of:

• Properties of individual operators

• Association between operators

• Query Tree: Represents relational algebra expression

• Query graph: Represents relational calculus expression

4/2/2024 31
Query Tree
• is a graphical representation of the operators, relations,
attributes and processing sequence during query processing.
• is a tree data structure that corresponds to a relational
algebra expression. It represents the input relations of the
query as leaf nodes of the tree, and represents the
relational algebra operations as internal nodes.
• It is composed of three main parts:
• The Leafs: the base relations used for processing the query/ extracting
the required information
• The Root: the final result/relation as an output based on the operation on
the relations used for query processing
• Nodes: intermediate results or relations before reaching the final result.

• Sequence of execution of operation in a query tree will


start from the leaves and continues to the intermediate

4/2/2024
nodes and ends at ADB(SSoftware
the root.Engineering) 32
Cont’d…
• An execution of the query tree consists of executing an internal node
operation whenever its operands are available and then replacing that
internal node by the relation that results from executing the operation.

• Query graph:

– A graph data structure that corresponds to a relational calculus


expression. It does not indicate an order on which operations to
perform first. There is only a single graph corresponding to each query.

4/2/2024 ADB(SSoftware Engineering) 33


Cont…
Heuristical Processing Strategies

• Perform Selection operations as early as possible.

• Combine the Cartesian product with a subsequent Selection


operation whose predicate represents a join condition into a Join
operation.

• Use associativity of binary operations to rearrange leaf nodes so that


the leaf nodes with the most restrictive Selection operations are
executed first.

• Perform Projection operations as early as possible.

• Compute common expressions once.


4/2/2024 ADB(SSoftware Engineering) 34
Heuristic Optimization of Query Trees
• Heuristic Optimization of Query Trees:
• The same query could correspond to many different relational
algebra expressions — and hence many different query trees.
• The task of heuristic optimization of query trees is to find a final
query tree that is efficient to execute.
• Example: Find the last names of employees born after
1957 who work on a project named ‘Aquarius’. This query
can be specified in SQL as follows:
Q: SELECT E.Lname
FROM EMPLOYEE E, WORKS_ON W, PROJECT P
WHERE P.Pname = ‘Aquarius’ AND
P.Pnmuber=W.Pno AND E.Essn=Ssn AND E.Bdate > ‘1957-12-31’;

4/2/2024 ADB(SSoftware Engineering) 35


Cost Estimation Approach to Query Optimization
• Cost-based query optimization:
• Estimate and compare the costs of executing a query using different execution
strategies and choose the strategy with the lowest cost estimate. (Compare to
heuristic query optimization)

• Issues
• Cost function
• Number of execution strategies to be considered
• Cost Components for Query Execution
1. Access cost to secondary storage
2. Storage cost
3. Computation/calculation cost
4. Memory usage cost
5. Communication cost
4/2/2024 ADB(SSoftware Engineering) 36
Cont’d…
• The main idea is to minimize the cost of processing a query.
• The cost function is comprised of:
I/O cost + CPU processing cost + communication cost + Storage cost
• These components might have different weights in different
processing environments
• The DBMs will use information stored in the system catalogue
for the purpose of estimating cost.
• The main target of query optimization is to minimize the size of
the intermediate relation.
• The size will have effect in the cost of:
• Disk Access
• Data Transportation
• Storage space in the Primary Memory
• Writing on Disk ADB(SSoftware Engineering)
4/2/2024 37
Example: Cost Estimation Query optimization
Assume:
–1000 tuples in Staff.
–50 tuples in Branch.
– 50 Managers (one for each branch)
– 5 London branches
–No indexes or sort keys
–All temporary results are written back to disk (memory is small)
–Tuples are accessed one at a time (not in blocks)
Q. Find all managers who work in a London branch

SELECT *
FROM Staff s, Branch b
WHERE s.branchNo = b.branchNo AND s.position = ‘Manager’ AND
b.city = ‘london’;

4/2/2024 ADB(SSoftware Engineering) 38


Example: Cost Estimation Query optimization

Three equivalent relational algebra queries corresponding to this SQL


statement are:
• σ(position=‘Manager’)∧(city=‘London’)∧(Staff.branchNo=Branch.branchNo)
(Staff × Branch)
• σ(position=‘Manager’) ∧ (city =‘London’)(Staff
1Staff.branchNo=Branch.branchNo Branch)
• (σposition=‘Manager’(Staff)) 1Staff.branchNo=Branch.branchNo (σcity
=‘London’(Branch))

4/2/2024 ADB(SSoftware Engineering) 39


Cont… Cost estimation for Query 1

• σ(position=‘Manager’)∧(city=‘London’)∧(Staff.branchNo=Branch.branchNo
) (Staff × Branch)

• calculates the Cartesian product of Staff and Branch, which


requires (1000 + 50) disk accesses to read the relations, and
• creates a relation with (1000 * 50) tuples.
• We then have to read each of these tuples again to test them
against the selection predicate at a cost of another (1000 * 50)
disk accesses,
• giving a total cost of: (1000 + 50) + 2*(1000 * 50) = 101 050 disk
accesses

4/2/2024 ADB(SSoftware Engineering) 40


Individual Assignment

• Calculate the cost estimation for the following quires and show all
the necessary steps clearly.
• Submission Date: 27/07/2016 E.C. until 10:00 LT.

1. σ(position=‘Manager’) ∧ (city =‘London’)(Staff


1Staff.branchNo=Branch.branchNo Branch)

2. (σposition=‘Manager’(Staff)) 1Staff.branchNo=Branch.branchNo (σcity


=‘London’(Branch))

4/2/2024 ADB(SSoftware Engineering) 41

You might also like