DistributedQuery Processing

Distributed Query Processing
Outline
Overview of Query Processing
Objectives of Query Processing Complexity of Relational Algebra operations
Layers of Query Processing

Query Decomposition Data Localisation Global Query Optimisation Local Query Optimisation
2
Query Processing
Language of access is SQL Non-procedural: no program is written to tell the DBMS how to get the data Structure of SQL is asking for attributes, constraints, and joins The order and the process of solving the query is done by a query processor Query processor efficiency is crucial to the success of RDBMSs Very complicated issue: many factors to consider
3
Distributed Query Processing

More complicated - many more factors to consider which affect performance of distributed queries:
Distributed fragments More communication Account for replication More sites increases query response time
Query Processing
) ) a n r o b i t e c g l n a u l f t a s s n u l o o i u C t c l a l a e c r l f a o n o s i n t o a i l t e g a r r n i e n i p p p d o a e e s m s s a e r b p a t x e a ( d (
High level user query
query processor
Low level data manipulation

commands
The query processor takes a high level query (e.g., in SQL) and translates it into a set of relational algebraic expressions. Since this can be done in a number of different ways, query processor 5 must choose the best one. This is called query optimization.
Query Processing Purpose

Execute SQL (or relational calculus) query onto a distributed database Decompose the calculus query into a sequence of relational operations called an algebraic query Data accessed by the query must be localized, so that operations are translated to bear on local data (distributed fragments) Optimize the performance of the query:
Minimize the cost Disk I/O, CPUs, communication, joining, etc.
Query Processing Problem

Low level query implements execution strategy for the query The transformation from high level query to low level query must be
Correct and Efficient
A query is correct:
Low level query has the same semantics as original query (i.e if both queries produce the same result)
A relational calculus may have many equivalent and correct transformations into relational algebra An efficient execution strategy
To select the RA that minimizes resource consumption
7
Query Processing Example

Translate SQL query to relational algebra query Optimize this translation and determine the best query plan There are several choices, which lead to different efficiencies Consider the following schema:
EMP (ENO, ENAME, TITLE) ASG (ENO, PNO, RESP, DUR)
8
Example Continued
Find employee names who manage projects: SELECT FROM WHERE AND ENAME EMP, ASG EMP.ENO = ASG.ENO RESP = Manager
Scenario 1: ENAME( RESP = Manager EMP.ENO = ASG.ENO (EMP ASG) ) Scenario 2: ENAME( EMP
ENO
( RESP=Manager (ASG) ) )
Scenario 2 avoids Cartesian product, so it is better !

9
Central vs. Distributed Databases

In central systems, the problem is finding the best relational algebra expression:
In general, this process is intractable But, we find something that is close to best using rules and heuristics
In distributed system, this becomes more complicated:

Communication time for exchanging data between sites Advantages of parallel processing Besides choice of ordering relational algebra operations, it must chose the best site to process data Choosing the right site for replicated data The solution space is larger
10
Query Processing Strategies

Different processing strategies are have different efficiencies Consider the same query:
Assume relations EMP and ASG are horizontally fragmented as follows:
Results are expected at site 5

11

Assumptions:
Tuple access - 1 unit Tuple transfer - 10 units size(EMP) - 400 tuples size(ASG) - 1000 tuples ASG has 20 managers Data is uniformly distributed among sites The relations ASG and EMP are locally clustered on attributes RESP and ENO, respectively.
Note: statistics are required for doing this properly

In general, exact statistics are not available Therefore, you have to make the best guess
12

Strategy 1:
Compute selections at Sites 1 and 2, move results to Sites 3 and 4 respectively. Compute Joins at Sites 3 and 4, and move results to Site 5 for final union.
Strategy 2:
Move all information to Site 5. Compute Selections and Joins on Site 5, and display results. Assumption: The access methods to relations EMP and ASG based on attributes RESP and ENO are lost because of data transfer.
Strategy 1
14
l l l
Strategy 2
15
First Strategy Cost

1. 2. 3. 4. Produce ASG by selecting ASG requires Transfer ASG to the sites of EMP requires Produce EMP by joining ASG and EMP requires Transfer EMP to result site requires (10+10) * tupacc = (10+10) * tuptrans = (10+10) * tupacc * 2 = (10+10) * tuptrans = 20 200 40 200 460 Unit s
Total Cost is
16
Second Strategy Cost

1. 2. 3. Transfer EMP to site 5 requires Transfer ASG to site 5 requires Produce ASGby selecting ASG requires Join EMP and ASG requires 400 * tuptrans = 1000 * tuptrans = 1000 * tupacc = 4 000 10 000 1000
4.
400 * 20 * tupacc =
8 000 23 000 Units
Total Cost is
17
Query Processing Objective

Transform a high-level query (relational calculus) on a distributed database into an efficient execution strategy expressed in a low-level language (relational algebra with communication operations) on local databases. Many execution strategies are correct transformations of the same high-level query,
The one that optimizes (minimizes) resource consumption is retained.
18
Cost Measures
Measure resource consumption in terms of total cost: Cost incurred in processing the query
Sum of all times in processing the query operations at various sites and intersite communication.
Measure response time of the query: Which is the time elapse for executing the query. Advantages for parallel execution of operations at different sites.
19
Query Optimization Objectives

Minimize a cost function:
I/O cost + CPU cost + communication cost
CPU cost performing operations on data in main memory. I/O cost time necessary for disk input/output operations
Can be minimized through buffer management
Communication cost time needed for exchanging data between sites:

Processing messages Transmitting data on the network
20
Query Optimization Objectives

The total cost might have different weights in different distributed environments.
Wide area networks
communication cost will dominate
low bandwidth low speed high protocol overhead
most algorithms ignore all other cost components
Local area networks

communication cost not that dominant total cost function should be considered
Can also maximize throughput

21
Complexity of Relational Operations

Operation
Select Project (without duplicate elimination) Project (with duplicate elimination) Group Join Semi-join Division Set Operators Cartesian Product O(n2) O(nlog n)
Complexity
Assume: relations of cardinality n sequential scan
O(n)
O(nlog n)
22
Query Processing Layers

Calculus Query on Distributed Relations Query Decomposition Algebraic Query on Distributed Relations CONTROL SITE Data Localization Fragment Query Global Optimization Optimized Fragment Query with Communication Operations LOCAL SITES Local Optimization Optimized Local Queries
LOCAL SCHEMAS STATS ON FRAGMENTS FRAGMENT SCHEMA GLOBAL SCHEMA
23
Step 1 Query Decomposition

Input : Calculus query on global relations Decompose the distributed calculus query into an algebraic query Normalization
manipulate query quantifiers and qualification by applying logical operator priority.
Analysis
detect and reject incorrect queries as early as possible possible for only a subset of relational calculus
Simplification
eliminate redundant predicates
Restructuring
calculus query algebraic query more than one translation is possible use transformation rules
24
Step 2 Data Localization

Input: Algebraic query on distributed relations To localize the querys data using data distribution information. Determine which fragments are involved in the query and transforms the distributed query into a fragment query. Apply a Localization program:
substitute for each global query its materialization program (reconstruction program) Optimize fragment query is simplified and restructured according to the same rules used in decomposition layer.
25
Step 3 Global Query Optimization

Input: Fragment query (an algebraic query on fragments) Find the best (not necessarily optimal) global schedule (execution strategy)
Find the best ordering of operations in the fragment query, including communication operations. Minimize a cost function in terms of time units, I/O, CPU time, buffer space, communication, etc. Distributed join processing (join ordering)
Bushy vs. linear trees Which relation to ship where? Ship-whole vs ship-as-needed
Decide on the use of semijoins

Semijoin saves on communication at the expense of more local processing.
Join methods
nested loop vs ordered joins (merge join or hash join)
26
Step 4 Local Optimization

Input: Best global execution schedule Perform by all sites having fragments involved in the query.
Select the best access path
Use the centralized optimization techniques

27
References
Principles of Distributed Database Systems by M. Tamer Ozsu, Patrick Valduriez , S.Sridhar, 2nd ed.

DistributedQuery Processing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DistributedQuery Processing

Uploaded by

Copyright:

Available Formats

Distributed Query Processing

Layers of Query Processing

Distributed Query Processing

High level user query

Low level data manipulation

Query Processing Purpose

Query Processing Problem

Query Processing Example

Scenario 2 avoids Cartesian product, so it is better !

Central vs. Distributed Databases

In distributed system, this becomes more complicated:

Query Processing Strategies

Assume relations EMP and ASG are horizontally fragmented as follows:

Results are expected at site 5

Query Processing Strategies

Note: statistics are required for doing this properly

Query Processing Strategies

First Strategy Cost

Second Strategy Cost

8 000 23 000 Units

Query Processing Objective

Query Optimization Objectives

Communication cost time needed for exchanging data between sites:

Query Optimization Objectives

most algorithms ignore all other cost components

Local area networks

Can also maximize throughput

Complexity of Relational Operations

Assume: relations of cardinality n sequential scan

Query Processing Layers

Step 1 Query Decomposition

Step 2 Data Localization

Step 3 Global Query Optimization

Decide on the use of semijoins

Step 4 Local Optimization

Select the best access path

Use the centralized optimization techniques

You might also like