❑After completing this chapter the learner should be familiar with the following concepts: ❑Query Processing ❑Query processing steps ❑Query optimization ❑Query optimizer approaches ❑Transformation rules ❑Cost estimation approach for query ❑Pipelining Overview of Query Processing and Optimization • Query processing: The activities involved in retrieving data from the database are called as query processing. • Query optimization: The activity of choosing an efficient execution strategy for processing a query is called as query optimization. • The aim of query optimization is to choose the one that minimizes the resource usage. • A DBMS uses different techniques to process, optimize, and execute highlevel queries (SQL). • A query expressed in high-level query language must be first scanned, parsed, and validated. • The scanner identifies the language components (tokens) in the text of the query, while the parser checks the correctness of the query syntax. • The query is also validated (by accessing the system catalog) whether the attribute names and relation names are valid. Query Processing Query Processing Phases • The aim of query processing is to find information in one or more databases and deliver it to the user quickly and efficiently. • Traditional techniques work well for databases with standard, single- site relational structures, but databases containing more complex and diverse types of data demand new query processing and optimization techniques. • Query processing can be divided into four main phases: decomposition (consisting of parsing and validation), optimization, code generation, and execution. Basic Steps in Query Processing: ❑Step 1. Parsing and translation: System checks the syntax of the query. ➢Creates a parse-tree representation of the query. ➢Translates the query into a relational-algebra expressions ➢Parser checks syntax, verifies relations ❑Step2: Optimization: finding the cheapest evaluation plan for a query. ❑ Each relational-algebra operation can be executed by one of several different algorithms. ❑A query optimizer must know the cost of each operation. ❑Step 3: Evaluation: The query-execution engine takes a query- evaluation plan, executes that plan, and returns the answers to the query. Query Decomposition • Query decomposition is the first phase of query processing. • Used to transform a high-level query into a relational algebra query, and to check that the query is syntactically and semantically correct. • The typical stages of query decomposition are analysis, normalization, semantic analysis, simplification, and query restructuring. • Typical stages in query decomposition are: 1. Analysis: lexical and syntactical analysis of the query correctness. ❑In this stage, the high-level query has been transformed into some internal representation that is more suitable for processing. • Query tree will be built for the query processing. • The internal form that is typically chosen is some kind of query tree, which is constructed as follows: • A leaf node is created for each base relation in the query. • A non-leaf node is created for each intermediate relation produced by a relational algebra operation. • The root of the tree represents the result of the query. • The sequence of operations is directed from the leaves to the root. 2. Normalization: The normalization stage of query processing converts the query into a normalized form that can be more easily manipulated. ❖The predicate WHERE will be converted to Conjunctive (v) or Disjunctive (^) Normal form. ❖Conjunctive normal form: A sequence of conjuncts that are connected with the ∧ (AND) operator. ❖Each conjunct contains one or more terms connected by the ∨ (OR) operator. ❖For example: (position = ‘Manager’ ∨ salary > 20000) ∧ branchNo = ‘B003’. A conjunctive selection contains only those tuples that satisfy all conjuncts. ❖Disjunctive normal form: A sequence of disjuncts that are connected with the ∨ (OR) operator. ❖Each disjunct contains one or more terms connected by the ∧ (AND) operator. ❖For example, we could rewrite the above conjunctive normal form as: (position =‘Manager’ ∧ branchNo =‘B003’ ) ∨(salary >20000 ∧ branchNo =‘B003’). A disjunctive selection contains those tuples formed by the union of all tuples that satisfy the disjuncts. 3. Semantic Analysis • The objective of semantic analysis is to reject normalized queries that are incorrectly formulated or contradictory. • A query is incorrectly formulated if components do not contribute to the generation of the result, which may happen if some join specifications are missing. • A query is contradictory if its predicate cannot be satisfied by any tuple. • For example, the predicate (position = ‘Manager’ ∧ position = ‘Assistant’) on the Staff relation is contradictory, as a member of staff cannot be both a Manager and an Assistant simultaneously. • However, the predicate ((position = ‘Manager’ ∧ position = ‘Assistant’) ∨ salary > 20000) could be simplified to (salary > 20000) by interpreting the contradictory clause as the Boolean value FALSE. • Unfortunately, the handling of contradictory clauses is not consistent between DBMSs. • Algorithms to handle contradictory clauses are. • Construct a relation connection graph: If the graph is not connected, the query is incorrectly formulated that represent the source of projection operations. • Construct a normalized attribute connection graph: If the graph has a cycle for which the valuation sum is negative, the query is contradictory that represents a selection operation 4. Simplification: The objectives of the simplification stage are to detect redundant qualifications, eliminate common subexpressions, and transform the query to a semantically equivalent but more easily and efficiently computed form. ❖Typically, access restrictions, view definitions, and integrity constraints are considered at this stage. ❖If the user does not have the appropriate access to all the components of the query, the query must be rejected. 5. Query Restructuring: In the final stage of query decomposition, the query is restructured to provide a more efficient implementation. ❖More than one translation is possible use transformation rules. ❖Most real-world data is not well structured. ❖Today's databases typically contain much non-structured data such as text, images, video, and audio, often distributed across computer networks. ❖In this complex environment, efficient and accurate query processing becomes quite challenging. ❖There could be tons of tricks (not only in storage and query processing, but also in concurrency control, recovery, etc.) ❑Query Optimization • The activity of choosing an efficient execution strategy for processing a query is called as query optimization. • Everyone wants the performance of their database to be optimal. • In particular, there is often a requirement for a specific query or object that is query based, to run faster. • Problem of query optimization is to find the sequence of steps that produces the answer to user request in the most efficient manner, given the database structure. • The performance of a query is affected by the tables or queries that underlies the query and by the complexity of the query. • Query optimizers are one of the main means by which modern database systems achieve their performance advantages. • Given a request for data manipulation or retrieval, an optimizer will choose an optimal plan for evaluating the request from among the manifold alternative strategies. • That means there are many ways (access paths) for accessing desired file/record. • The optimizer tries to select the most efficient (cheapest) access path for accessing the data. • DBMS is responsible to pick the best execution strategy based on various considerations. • Query optimizers were already among the largest and most complex modules of database systems. • Most efficient processing: Least amount of I/O and CPU resources. Selection of the best method: In a non-procedural language the system does the optimization at the time of execution. • On the other hand, in a procedural language, programmers have some flexibility in selecting the best method. ❖For optimizing the execution of a query the programmer must know: ❖File organization. ❖Record access mechanism and primary or secondary key. ❖Data location on disk. ❖Data access limitations. • Approaches to Query Optimization Heuristics Approach • The heuristical approach to query optimization, which uses transformation rules to convert one relational algebra expression into an equivalent form that is known to be more efficient. • The heuristic approach uses the knowledge of the characteristics of the relational algebra operations and the relationship between the operators to optimize the query. • Thus the heuristic approach of optimization will make use of: Properties of individual operators: Association between operators: Query Tree: a graphical representation of the operators, relations, attributes and predicates and processing sequence during query processing. • Query tree is composed of three main parts: The Leafs, The Root, Nodes Transformation Rules for the Relational Algebra Operations • By applying transformation rules, the optimizer can transform one relational algebra expression into an equivalent expression that is known to be more efficient. • Use these rules to restructure the relational algebra tree generated during query decomposition. 1. Conjunctive selection operations can cascade into individual selection operations (and vice versa). This transformation is sometimes referred to as cascade of selection 2. Commutativity of Selection operations. 3. In a sequence of Projection operations, only the last in the sequence is required. Also, called Cascade of projection 4. Commutativity of Selection and Projection. If the predicate p involves only the attributes in the projection list, then the Selection and Projection operations commute 5. Commutativity of Theta join and Cartesian product 6. Commutativity of Selection and Theta join (or Cartesian product). If the selection predicate involves only attributes of one of the relations being joined, then the Selection and Join (or Cartesian product) operations commute 7. Commutativity of Projection and Theta join (or Cartesian product) 8. Commutativity of Union and Intersection (but not Set difference). 9. Commutativity of Selection and set operations (Union, Intersection, and Set difference) 10. Commutativity of Projection and Union 11. Associativity of Theta join (and Cartesian product). Cartesian product and Natural join are always associative 12. Associativity of Union and Intersection (but not Set difference) Main Heuristic • The main heuristic is to first apply operations that reduce the size of the intermediate relation. That is: • Perform SELECTION as early as possible: that will reduce the cardinality (number of tuples) of the relation. • Perform PROJECTION as early as possible: that will reduce the degree (number of attributes) of the relation. Both a and b will be accomplished by placing the SELECT and PROJECT operations as far down the tree as possible. • SELECT and JOIN operations with most restrictive conditions resulting with smallest absolute size should be executed before other similar operations. • This is achieved by reordering the nodes with JOIN • Example: consider the following schemas and the query, where the EMPLOYEE and the PROJECT relations are related by the WORKS_ON relation. • EMPLOYEE (EEmpID, FName, LName, Salary, Dept, Sex, DoB) • PROJECT (PProjID, PName, PLocation, PFund, PManagerID) • WORKS_ON (WEmpID, WProjID)
WEmpID (refers to employee identification) and WProjID (refers to project
identification) are foreign keys to WORKS_ON relation from EMPLOYEE and PROJECT relations respectively. Cost Estimation Approach to Query Optimization • The main idea is to minimize the cost of processing a query. • The cost function is comprised of: ❑ I/O cost + CPU processing cost + communication cost + Storage cost • These components might have different weights in different processing environments. • The DBMs will use information stored in the system catalogue for the purpose of estimating cost. • The main target of query optimization is to minimize the size of the intermediate relation. The size will have effect in the cost of: ❑Disk Access ❑Data Transpiration ✓Storage space in the Primary Memory ✓Writing on Disk ❑The statistics in the system catalogue used for cost estimation purpose are: ✓Cardinality of a relation: the number of tuples contained in a relation currently (r) ✓Degree of a relation: number of attributes of a relation ✓Number of tuples on a relation that can be stored in one block of memory ✓ Total number of blocks used by a relation ✓Number of distinct values of an attribute (d) ✓Selection Cardinality of an attribute (S): that is average number of records that will satisfy an equality condition S=r/d • By using the above information one could calculate the cost of executing a query and selecting the best strategy, which is with the minimum cost of processing. • Cost Components for Query Optimization The costs of query execution can be calculated for the following major process we have during processing. 1. Access Cost of Secondary Storage: Data is going to be accessed from secondary storage, as a query will be needing some part of the data stored in the database. The disk access cost can again be analyzed in terms of: Searching Reading, and Writing, data blocks used to store some portion of a relation. • The disk access cost will vary depending on the file organization used and the access method implemented for the file organization. • In addition to the file organization, the data allocation scheme, whether the data is stored contiguously or in scattered manner, will affect the disk access cost. • 2. Storage Cost: While processing a query, as any query would be composed of many database operations, there could be one or more intermediate results before reaching the final output. • These intermediate results should be stored in primary memory for further processing. • The bigger the intermediate relation, the larger the memory requirement, which will have impact on the limited available space. • This will be considered as a cost of storage. • Computation Cost: Query is composed of many operations. • The operations could be database operations like reading and writing to a disk, or mathematical and other operations like: Searching, Sorting, Merging, Computation on field values • 3. Communication Cost: In most database systems the database resides in one station and various queries originate from different terminals. • This will have impact on the performance of the system adding cost for query processing. • Thus, the cost of transporting data between the database site and the terminal from where the query originate should be analyzed Pipelining • Pipelining is another method used for query optimization. It used to improve the performance of queries. • It is sometime known as stream-based processing or on-the-fly processing or queries. • As query optimization tries to reduce the size of the intermediate result, pipelining uses a better way of reducing the size by performing different conditions on a single intermediate result continuously. • Thus the technique is said to reduce the number of intermediate relations in query execution. • Pipelining performs multiple operations on a single relation in a pipeline. • Generally, a pipeline is implemented as a separate process or thread within the DBMS. • Each pipeline takes a stream of tuples from its inputs and creates a stream of tuples as its output. • A buffer is created for each pair of adjacent operations to hold the tuples being passed from the first operation to the second one. • One drawback with pipelining is that the inputs to operations are not necessarily available all at once for processing. • This can restrict the choice of algorithms. End of Chapter 2