You are on page 1of 21

Chapter 2

Query Processing and Optimization


❑After completing this chapter the learner should be familiar
with the following concepts:
❑Query Processing
❑Query processing steps
❑Query optimization
❑Query optimizer approaches
❑Transformation rules
❑Cost estimation approach for query
❑Pipelining
Overview of Query Processing and Optimization
• Query processing: The activities involved in retrieving data from the
database are called as query processing.
• Query optimization: The activity of choosing an efficient execution strategy
for processing a query is called as query optimization.
• The aim of query optimization is to choose the one that minimizes the
resource usage.
• A DBMS uses different techniques to process, optimize, and execute
highlevel queries (SQL).
• A query expressed in high-level query language must be first scanned, parsed,
and validated.
• The scanner identifies the language components (tokens) in the text of the
query, while the parser checks the correctness of the query syntax.
• The query is also validated (by accessing the system catalog) whether the
attribute names and relation names are valid.
Query Processing
Query Processing Phases
• The aim of query processing is to
find information in one or more
databases and deliver it to the user
quickly and efficiently.
• Traditional techniques work well for
databases with standard, single-
site relational structures, but databases
containing more complex and diverse
types of data demand new query
processing and optimization
techniques.
• Query processing can be divided into
four main phases: decomposition
(consisting of parsing and validation),
optimization, code generation, and
execution.
Basic Steps in Query Processing:
❑Step 1. Parsing and translation: System checks the syntax of the
query.
➢Creates a parse-tree representation of the query.
➢Translates the query into a relational-algebra expressions
➢Parser checks syntax, verifies relations
❑Step2: Optimization: finding the cheapest evaluation plan for a
query.
❑ Each relational-algebra operation can be executed by one of several
different algorithms.
❑A query optimizer must know the cost of each operation.
❑Step 3: Evaluation: The query-execution engine takes a query-
evaluation plan, executes that plan, and returns the answers to the
query.
Query Decomposition
• Query decomposition is the first phase of query processing.
• Used to transform a high-level query into a relational algebra query, and to check that
the query is syntactically and semantically correct.
• The typical stages of query decomposition are analysis, normalization, semantic
analysis, simplification, and query restructuring.
• Typical stages in query decomposition are:
1. Analysis: lexical and syntactical analysis of the query correctness.
❑In this stage, the high-level query has been transformed into some internal
representation that is more suitable for processing.
• Query tree will be built for the query processing.
• The internal form that is typically chosen is some kind of query tree, which is
constructed as follows:
• A leaf node is created for each base relation in the query.
• A non-leaf node is created for each intermediate relation produced by a relational algebra
operation.
• The root of the tree represents the result of the query.
• The sequence of operations is directed from the leaves to the root.
2. Normalization: The normalization stage of query processing converts the query
into a normalized form that can be more easily manipulated.
❖The predicate WHERE will be converted to Conjunctive (v) or Disjunctive (^)
Normal form.
❖Conjunctive normal form: A sequence of conjuncts that are connected with the ∧
(AND) operator.
❖Each conjunct contains one or more terms connected by the ∨ (OR) operator.
❖For example: (position = ‘Manager’ ∨ salary > 20000) ∧ branchNo = ‘B003’. A
conjunctive selection contains only those tuples that satisfy all conjuncts.
❖Disjunctive normal form: A sequence of disjuncts that are connected with the ∨ (OR)
operator.
❖Each disjunct contains one or more terms connected by the ∧ (AND) operator.
❖For example, we could rewrite the above conjunctive normal form as: (position =‘Manager’
∧ branchNo =‘B003’ ) ∨(salary >20000 ∧ branchNo =‘B003’). A disjunctive selection
contains those tuples formed by the union of all tuples that satisfy the disjuncts.
3. Semantic Analysis
• The objective of semantic analysis is to reject normalized queries that are incorrectly
formulated or contradictory.
• A query is incorrectly formulated if components do not contribute to the generation
of the result, which may happen if some join specifications are missing.
• A query is contradictory if its predicate cannot be satisfied by any tuple.
• For example, the predicate (position = ‘Manager’ ∧ position = ‘Assistant’) on the
Staff relation is contradictory, as a member of staff cannot be both a Manager and
an Assistant simultaneously.
• However, the predicate ((position = ‘Manager’ ∧ position = ‘Assistant’) ∨ salary >
20000) could be simplified to (salary > 20000) by interpreting the contradictory
clause as the Boolean value FALSE.
• Unfortunately, the handling of contradictory clauses is not consistent between
DBMSs.
• Algorithms to handle contradictory clauses are.
• Construct a relation connection graph: If the graph is not connected, the query is incorrectly
formulated that represent the source of projection operations.
• Construct a normalized attribute connection graph: If the graph has a cycle for which the
valuation sum is negative, the query is contradictory that represents a selection operation
4. Simplification: The objectives of the simplification stage are to detect redundant
qualifications, eliminate common subexpressions, and transform the query to a semantically
equivalent but more easily and efficiently computed form.
❖Typically, access restrictions, view definitions, and integrity constraints are
considered at this stage.
❖If the user does not have the appropriate access to all the components of the query,
the query must be rejected.
5. Query Restructuring: In the final stage of query decomposition, the query is
restructured to provide a more efficient implementation.
❖More than one translation is possible use transformation rules.
❖Most real-world data is not well structured.
❖Today's databases typically contain much non-structured data such as text, images,
video, and audio, often distributed across computer networks.
❖In this complex environment, efficient and accurate query processing becomes quite
challenging.
❖There could be tons of tricks (not only in storage and query processing, but also in
concurrency
control, recovery, etc.)
❑Query Optimization
• The activity of choosing an efficient execution strategy for processing a query is called as query
optimization.
• Everyone wants the performance of their database to be optimal.
• In particular, there is often a requirement for a specific query or object that is query based, to run
faster.
• Problem of query optimization is to find the sequence of steps that produces the answer to user
request in the most efficient manner, given the database structure.
• The performance of a query is affected by the tables or queries that underlies the query and by the
complexity of the query.
• Query optimizers are one of the main means by which modern database systems achieve their
performance advantages.
• Given a request for data manipulation or retrieval, an optimizer will choose an optimal plan for
evaluating the request from among the manifold alternative strategies.
• That means there are many ways (access paths) for accessing desired file/record.
• The optimizer tries to select the most efficient (cheapest) access path for accessing the data.
• DBMS is responsible to pick the best execution strategy based on various considerations.
• Query optimizers were already among the largest and most complex modules of database systems.
• Most efficient processing: Least amount of I/O and CPU
resources.
Selection of the best method: In a non-procedural
language the system does the optimization at the time of
execution.
• On the other hand, in a procedural language, programmers
have some flexibility in selecting the best method.
❖For optimizing the execution of a query the programmer
must know:
❖File organization.
❖Record access mechanism and primary or secondary key.
❖Data location on disk.
❖Data access limitations.
• Approaches to Query Optimization
Heuristics Approach
• The heuristical approach to query optimization, which uses transformation
rules to convert one relational algebra expression into an equivalent form that
is known to be more efficient.
• The heuristic approach uses the knowledge of the characteristics of the
relational algebra operations and the relationship between the operators to
optimize the query.
• Thus the heuristic approach of optimization will make use of:
Properties of individual operators:
Association between operators:
Query Tree: a graphical representation of the operators, relations,
attributes and predicates and processing sequence during query processing.
• Query tree is composed of three main parts: The Leafs, The Root, Nodes
Transformation Rules for the Relational Algebra Operations
• By applying transformation rules, the optimizer can transform one relational
algebra expression into an equivalent expression that is known to be more efficient.
• Use these rules to restructure the relational algebra tree generated during query
decomposition.
1. Conjunctive selection operations can cascade into individual selection operations (and vice
versa). This transformation is sometimes referred to as cascade of selection
2. Commutativity of Selection operations.
3. In a sequence of Projection operations, only the last in the sequence is required. Also, called
Cascade of projection
4. Commutativity of Selection and Projection. If the predicate p involves only the attributes in
the projection list, then the Selection and Projection operations commute
5. Commutativity of Theta join and Cartesian product
6. Commutativity of Selection and Theta join (or Cartesian product). If the selection predicate
involves only attributes of one of the relations being joined, then the Selection and Join (or
Cartesian product) operations commute
7. Commutativity of Projection and Theta join (or Cartesian
product)
8. Commutativity of Union and Intersection (but not Set
difference).
9. Commutativity of Selection and set operations (Union,
Intersection, and Set difference)
10. Commutativity of Projection and Union
11. Associativity of Theta join (and Cartesian product).
Cartesian product and Natural join are always associative
12. Associativity of Union and Intersection (but not Set
difference)
Main Heuristic
• The main heuristic is to first apply operations that reduce the size of
the intermediate relation. That is:
• Perform SELECTION as early as possible: that will reduce the cardinality
(number of tuples) of the relation.
• Perform PROJECTION as early as possible: that will reduce the degree
(number of attributes) of the relation. Both a and b will be accomplished
by placing the SELECT and PROJECT operations as far down the tree as
possible.
• SELECT and JOIN operations with most restrictive conditions resulting
with smallest absolute size should be executed before other similar
operations.
• This is achieved by reordering the nodes with JOIN
• Example: consider the following schemas and the query, where the
EMPLOYEE and the PROJECT relations are related by the
WORKS_ON relation.
• EMPLOYEE (EEmpID, FName, LName, Salary, Dept, Sex, DoB)
• PROJECT (PProjID, PName, PLocation, PFund, PManagerID)
• WORKS_ON (WEmpID, WProjID)

WEmpID (refers to employee identification) and WProjID (refers to project


identification) are
foreign keys to WORKS_ON relation from EMPLOYEE and PROJECT
relations respectively.
Cost Estimation Approach to Query Optimization
• The main idea is to minimize the cost of processing a query.
• The cost function is comprised of:
❑ I/O cost + CPU processing cost + communication cost + Storage
cost
• These components might have different weights in different processing
environments.
• The DBMs will use information stored in the system catalogue for the
purpose of estimating cost.
• The main target of query optimization is to minimize the size of the
intermediate relation. The size will have effect in the cost of:
❑Disk Access
❑Data Transpiration
✓Storage space in the Primary Memory
✓Writing on Disk
❑The statistics in the system catalogue used for cost estimation purpose are:
✓Cardinality of a relation: the number of tuples contained in a relation currently (r)
✓Degree of a relation: number of attributes of a relation
✓Number of tuples on a relation that can be stored in one block of memory
✓ Total number of blocks used by a relation
✓Number of distinct values of an attribute (d)
✓Selection Cardinality of an attribute (S): that is average number of records that will
satisfy an equality condition S=r/d
• By using the above information one could calculate the cost of executing a
query and selecting the best strategy, which is with the minimum cost of
processing.
• Cost Components for Query Optimization
The costs of query execution can be calculated for the following major process
we have during processing.
1. Access Cost of Secondary Storage: Data is going to be accessed from
secondary storage, as a query will be needing some part of the data stored in
the database. The disk access cost can again be analyzed in terms of:
Searching
Reading, and
Writing, data blocks used to store some portion of a relation.
• The disk access cost will vary depending on the file organization used and the
access method implemented for the file organization.
• In addition to the file organization, the data allocation scheme, whether the data
is stored contiguously or in scattered manner, will affect the disk access cost.
• 2. Storage Cost: While processing a query, as any query would be composed of many
database operations, there could be one or more intermediate results before reaching the
final output.
• These intermediate results should be stored in primary memory for further processing.
• The bigger the intermediate relation, the larger the memory requirement, which will have
impact on the limited available space.
• This will be considered as a cost of storage.
• Computation Cost: Query is composed of many operations.
• The operations could be database operations like reading and writing to a disk, or
mathematical and other operations like: Searching, Sorting, Merging, Computation on
field values
• 3. Communication Cost: In most database systems the database resides in one station and
various queries originate from different terminals.
• This will have impact on the performance of the system adding cost for query processing.
• Thus, the cost of transporting data between the database site and the terminal from where
the query originate should be analyzed
Pipelining
• Pipelining is another method used for query optimization. It used to improve the
performance of queries.
• It is sometime known as stream-based processing or on-the-fly processing or queries.
• As query optimization tries to reduce the size of the intermediate result, pipelining uses a
better way of reducing the size by performing different conditions on a single
intermediate result continuously.
• Thus the technique is said to reduce the number of intermediate relations in query
execution.
• Pipelining performs multiple operations on a single relation in a pipeline.
• Generally, a pipeline is implemented as a separate process or thread within the DBMS.
• Each pipeline takes a stream of tuples from its inputs and creates a stream of tuples as its
output.
• A buffer is created for each pair of adjacent operations to hold the tuples being passed
from the first operation to the second one.
• One drawback with pipelining is that the inputs to operations are not necessarily available
all at once for processing.
• This can restrict the choice of algorithms.
End of Chapter 2

You might also like