You are on page 1of 19

CT80A0000 – DATA-INTENSIVE SYSTEMS

DISTRIBUTED QUERY PROCESSING


Lecture

Jiri Musto, D.Sc.


QUERY PROCESSING
High level user query

Query
Processor

Low-level data manipulation


commands for DBMS
3
QUERY PROCESSING COMPONENTS
Query language
E.g. SQL

Query execution
Steps of a given query

Query optimization
Manual or automatic (DBMS)

We assume a homogeneous distributed DBMS

4
OPTIMIZATION: SELECTING ALTERNATIVES
Each query can have multiple methods to reach the result
2+2, 1+3, 2*2, 2^2

To select the best option, there needs to be some metrics


Minimize the cost function (I/O, CPU, Communication)

Fastest, cheapest, transfer cost, total time, response time, etc.

Either done automatically by the system or manually by the user

5
TYPES OF OPTIMIZERS
Exhaustive search

Cost-based
Optimal

Heuristics

Not optimal (near optimal)


Optimize individual operations
Find a solution with reasonable cost

6
ONE OR MULTIPLE QUERIES
Single query at a time

Cannot use common intermediate results

Easier to optimize

Multiple queries at a time

Efficient if many similar queries

Decision space is much larger

7
OPTIMIZATION STRATEGY
Static
Optimize prior to the execution
Difficult to estimate the size of the intermediate results & error propagation
Dynamic
Run time optimization
Exact information on the intermediate relation sizes
Have to reoptimize for multiple executions
Hybrid
Compile using a static algorithm
If the error in estimate sizes is larger than threshold, reoptimize at run time

8
OPTIMIZATION DECISION SITES
Centralized
One site decides “best” schedule
Simple
Need knowledge about the entire distributed database

Distributed
Each site cooperates to determine schedule
Need only local information
Cost of cooperation

Hybrid
One site determines the global schedule
Each site optimizes the local schedules

9
QUERY PROCESSING METHODOLOGY

10
STEP 1 – QUERY DECOMPOSITION
Normalize and analyze the query
Bad queries are rejected

Simplify the query as much as possible


Remove irrelevant parameters

Restructure the query to get a more efficient option

11
STEP 2 – DATA LOCALIZATION
What fragments / partitions are involved in the query

Localize the query plan


Global parameters are changed to local ones

Optimize the localized plans

12
STEP 3 – GLOBAL QUERY OPTIMIZATION
Find the best (not necessarily optimal) global schedule

Minimize a cost function


Join processing
Bushy vs. linear trees
Which relation to ship where?
Ship-whole vs ship-as-needed

How data is joined together


Semijoins or not, what join methods are used

13
QUERY OPTIMIZATION PROCESS
QEP = Query Execution Plan Search/Solution space
Set of possible solutions (query trees)

Cost model
E.g. I/O cost + CPU cost + communication cost

Search strategy
How do we move inside the search space?

Exhaustive search, heuristic algorithms

Deterministic, randomized

14
NORMAL QUERY PROCESSING ISSUES
The optimizer needs sufficient knowledge about runtime

Runtime conditions should remain stable during query execution

Good for systems with few data sources and a controlled environment
What about changing environments?
Or large numbers of data sources?
Unpredictable runtime conditions?

15
EXAMPLE: QEP WITH BLOCKED OPERATOR
Join

• Student data cannot be


accessed
Join
• Whole process is blocked Grade
until regaining access
Join
• What if this would be
Course
reorganized?

Student Project

16
ADAPTIVE QUERY PROCESSING
Receive information from the execution environment
Modify process accordingly
Communication between optimizer and runtime environment and other components

Additional components
Monitoring (statistics, data, network, cost), assessment, reaction
Embedded in control operators of QEP

Tradeoff between reactiveness and overhead of adaptation

Change schedule, replace operators, modify behaviour

17
CONCLUSION ON QUERY PROCESSING
There are multiple ways to organize query processing
Centralized, distributed, hybrid

Queries need to be optimized automatically or manually

There are multiple methods of searching for optimal solution


The organization of query processing has an impact

Most often the best option is not the “most optimal” solution

18

You might also like