You are on page 1of 14

Distributed Query Processing

Outline
Overview of Query Processing
Objectives of Query Processing Complexity of Relational Algebra operations

Layers of Query Processing


Query Decomposition Data Localisation Global Query Optimisation Local Query Optimisation
2

Query Processing
Language of access is SQL Non-procedural: no program is written to tell the DBMS how to get the data Structure of SQL is asking for attributes, constraints, and joins The order and the process of solving the query is done by a query processor Query processor efficiency is crucial to the success of RDBMSs Very complicated issue: many factors to consider
3

Distributed Query Processing


More complicated - many more factors to consider which affect performance of distributed queries:
Distributed fragments More communication Account for replication More sites increases query response time

Query Processing
) ) a n r o b i t e c g l n a u l f t a s s n u l o o i u C t c l a l a e c r l f a o n o s i n t o a i l t e g a r r n i e n i p p p d o a e e s m s s a e r b p a t x e a ( d (

High level user query

query processor

Low level data manipulation


commands

The query processor takes a high level query (e.g., in SQL) and translates it into a set of relational algebraic expressions. Since this can be done in a number of different ways, query processor 5 must choose the best one. This is called query optimization.

Query Processing Purpose


Execute SQL (or relational calculus) query onto a distributed database Decompose the calculus query into a sequence of relational operations called an algebraic query Data accessed by the query must be localized, so that operations are translated to bear on local data (distributed fragments) Optimize the performance of the query:
Minimize the cost Disk I/O, CPUs, communication, joining, etc.

Query Processing Problem


Low level query implements execution strategy for the query The transformation from high level query to low level query must be
Correct and Efficient

A query is correct:
Low level query has the same semantics as original query (i.e if both queries produce the same result)

A relational calculus may have many equivalent and correct transformations into relational algebra An efficient execution strategy
To select the RA that minimizes resource consumption
7

Query Processing Example


Translate SQL query to relational algebra query Optimize this translation and determine the best query plan There are several choices, which lead to different efficiencies Consider the following schema:
EMP (ENO, ENAME, TITLE) ASG (ENO, PNO, RESP, DUR)
8

Example Continued
Find employee names who manage projects: SELECT FROM WHERE AND ENAME EMP, ASG EMP.ENO = ASG.ENO RESP = Manager

Scenario 1: ENAME( RESP = Manager EMP.ENO = ASG.ENO (EMP ASG) ) Scenario 2: ENAME( EMP
ENO

( RESP=Manager (ASG) ) )

Scenario 2 avoids Cartesian product, so it is better !


9

Central vs. Distributed Databases


In central systems, the problem is finding the best relational algebra expression:
In general, this process is intractable But, we find something that is close to best using rules and heuristics

In distributed system, this becomes more complicated:


Communication time for exchanging data between sites Advantages of parallel processing Besides choice of ordering relational algebra operations, it must chose the best site to process data Choosing the right site for replicated data The solution space is larger

10

Query Processing Strategies


Different processing strategies are have different efficiencies Consider the same query:

Assume relations EMP and ASG are horizontally fragmented as follows:

Results are expected at site 5


11

Query Processing Strategies


Assumptions:
Tuple access - 1 unit Tuple transfer - 10 units size(EMP) - 400 tuples size(ASG) - 1000 tuples ASG has 20 managers Data is uniformly distributed among sites The relations ASG and EMP are locally clustered on attributes RESP and ENO, respectively.

Note: statistics are required for doing this properly


In general, exact statistics are not available Therefore, you have to make the best guess

12

Query Processing Strategies


Strategy 1:
Compute selections at Sites 1 and 2, move results to Sites 3 and 4 respectively. Compute Joins at Sites 3 and 4, and move results to Site 5 for final union.

Strategy 2:
Move all information to Site 5. Compute Selections and Joins on Site 5, and display results. Assumption: The access methods to relations EMP and ASG based on attributes RESP and ENO are lost because of data transfer.

Strategy 1

14

l l l

Strategy 2

15

First Strategy Cost


1. 2. 3. 4. Produce ASG by selecting ASG requires Transfer ASG to the sites of EMP requires Produce EMP by joining ASG and EMP requires Transfer EMP to result site requires (10+10) * tupacc = (10+10) * tuptrans = (10+10) * tupacc * 2 = (10+10) * tuptrans = 20 200 40 200 460 Unit s

Total Cost is

16

Second Strategy Cost


1. 2. 3. Transfer EMP to site 5 requires Transfer ASG to site 5 requires Produce ASGby selecting ASG requires Join EMP and ASG requires 400 * tuptrans = 1000 * tuptrans = 1000 * tupacc = 4 000 10 000 1000

4.

400 * 20 * tupacc =

8 000 23 000 Units

Total Cost is

17

Query Processing Objective


Transform a high-level query (relational calculus) on a distributed database into an efficient execution strategy expressed in a low-level language (relational algebra with communication operations) on local databases. Many execution strategies are correct transformations of the same high-level query,
The one that optimizes (minimizes) resource consumption is retained.

18

Cost Measures
Measure resource consumption in terms of total cost: Cost incurred in processing the query

Sum of all times in processing the query operations at various sites and intersite communication.
Measure response time of the query: Which is the time elapse for executing the query. Advantages for parallel execution of operations at different sites.
19

Query Optimization Objectives


Minimize a cost function:
I/O cost + CPU cost + communication cost

CPU cost performing operations on data in main memory. I/O cost time necessary for disk input/output operations
Can be minimized through buffer management

Communication cost time needed for exchanging data between sites:


Processing messages Transmitting data on the network

20

Query Optimization Objectives


The total cost might have different weights in different distributed environments.
Wide area networks
communication cost will dominate
low bandwidth low speed high protocol overhead

most algorithms ignore all other cost components

Local area networks


communication cost not that dominant total cost function should be considered

Can also maximize throughput


21

Complexity of Relational Operations


Operation
Select Project (without duplicate elimination) Project (with duplicate elimination) Group Join Semi-join Division Set Operators Cartesian Product O(n2) O(nlog n)

Complexity

Assume: relations of cardinality n sequential scan

O(n)

O(nlog n)

22

Query Processing Layers


Calculus Query on Distributed Relations Query Decomposition Algebraic Query on Distributed Relations CONTROL SITE Data Localization Fragment Query Global Optimization Optimized Fragment Query with Communication Operations LOCAL SITES Local Optimization Optimized Local Queries
LOCAL SCHEMAS STATS ON FRAGMENTS FRAGMENT SCHEMA GLOBAL SCHEMA

23

Step 1 Query Decomposition


Input : Calculus query on global relations Decompose the distributed calculus query into an algebraic query Normalization
manipulate query quantifiers and qualification by applying logical operator priority.

Analysis
detect and reject incorrect queries as early as possible possible for only a subset of relational calculus

Simplification
eliminate redundant predicates

Restructuring
calculus query algebraic query more than one translation is possible use transformation rules
24

Step 2 Data Localization


Input: Algebraic query on distributed relations To localize the querys data using data distribution information. Determine which fragments are involved in the query and transforms the distributed query into a fragment query. Apply a Localization program:
substitute for each global query its materialization program (reconstruction program) Optimize fragment query is simplified and restructured according to the same rules used in decomposition layer.
25

Step 3 Global Query Optimization


Input: Fragment query (an algebraic query on fragments) Find the best (not necessarily optimal) global schedule (execution strategy)
Find the best ordering of operations in the fragment query, including communication operations. Minimize a cost function in terms of time units, I/O, CPU time, buffer space, communication, etc. Distributed join processing (join ordering)
Bushy vs. linear trees Which relation to ship where? Ship-whole vs ship-as-needed

Decide on the use of semijoins


Semijoin saves on communication at the expense of more local processing.

Join methods
nested loop vs ordered joins (merge join or hash join)

26

Step 4 Local Optimization


Input: Best global execution schedule Perform by all sites having fragments involved in the query.

Select the best access path

Use the centralized optimization techniques


27

References
Principles of Distributed Database Systems by M. Tamer Ozsu, Patrick Valduriez , S.Sridhar, 2nd ed.

You might also like