Professional Documents
Culture Documents
QUERY PROCESSING
PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Query Processing
To this point, we have seen the upper levels of the data
integration problem:
How to construct mappings from sources to a
single mediated schema
How queries posed over the mediated schema
are reformulated over the sources
Query Results
Trivial mediated schema
Movie Plays Reviews
Requests
Streaming data
Movie Plays Reviews
Query Processing for Integration:
Challenges Beyond those of a DBMS
Source query capabilities and limited data statistics
Statistics on underlying data are often unavailable
Many sources aren’t even databases!
Input binding restrictions
May need to provide inputs before
we can see matching data
Variable network connectivity and source performance
Performance is not always predictable
Need for fast initial answers
Emphasis is often on pipelined results, not overall query
completion time – harder to optimize for!
Outline
What makes query processing hard
Review of DBMS query processing
Review of distributed query processing
Query processing for data integration
A DBMS Query Processor
SQL query Query
results
Parsing
Query Query
Statistics, optimization Executable execution
index info query plan
Indices Relations
Our Example Query in the DBMS Setting
Movie Plays
Query Optimization
Goal: compare all equivalent query expressions and
their low-level implementations (operators)
HJointitle = movie
17
Relational DBMSs Set the Stage
1. An array of algorithms for relational operations –
physical plans
2. Cost-based query optimization
Enumeration, possibly with pruning heuristics
Exploit principle of optimality
Also search interesting orders
π title, startTime
π title,startTime
Movie Plays
The Basic Approach
Enumerate all possible query expressions
“bushy” plan enumeration instead of “left-linear” as was
illustrated for relational DBMS optimization
⋈ joins with single
⋈ joins of arbitrary
relations on right expressions
⋈ U ⋈ ⋈
⋈ T R S T U
R S left-linear plan non-linear (“bushy”) plan
Use a different set of cost formulas for pushed-down
operations, custom to the wrapper and the source
Changes to Query Execution
Internet-based sources present challenges:
Slow and unpredictable data transfer rates
Source failures
input
2.Add tuple to input’s
hash table
3.Probe against opposite Build
⋈ Build
⋈
scheduled by
the CPU basedBuild Build
on availability Probe R.x=S.y Probe
R tuple S tuple
Pipelined Join Plan
Movies-
Reviews Plays
⋈ M.title=
P.movie
Movies Reviews
σ location=
“New York”
Plays
⋈ M.title=
R.movie
σ actor=
“Morgan
σavgStars≥3
Freeman”
Movies Reviews
Other Pipelined Operators
In general, all algorithms will be hash-based as
opposed to index- or sort-based
This supports pipelined execution in a networked setting
Parsing
Adaptive
Query Statistics Query query
Source & costs processor
(re)optimization execution
descriptions
Incremental
query plan
Data sources
Challenges of Adaptive Query
Processing
Generally a balance among many factors:
How much information to collect – exploration vs.
exploitation
Analysis time vs. execution time – more optimization results
in better plans, but less time to spend running them
Space of adaptations – how much can we change execution?
Short-term vs. long-term – balancing between making the
best progress now, and incurring state that might cost a lot
later.
We divide into event- and performance-driven types
Event-Driven Adaptivity
Consider changing the plan at designated points
where the query processor is stopped
When a source fails or times out
When a segment of the query plan complete and is
materialized
… and we verify at runtime that the materialized result size is
significantly different from expectations
on timeout(wrapper1, 10msec) …
There can be multiple kinds of responses
Find an alternative source
on timeout(wrapper1, 10msec) if true then activate(coll1, A)
Reschedule (suspend until later) a subplan or operator
on timeout(wrapper1, 10msec) if true then reschedule(op2)
How Do We Generate the Rules?
For source failures, we need a partial ordering of
data sources with alternates
The optimizer can generate rules to access the alternates
if the primaries are down
Reviews
σ location=
Freeman”
Movies
Re-Optimization
In an event-driven setting, the optimizer only gets re-
run if it looks likely that costs for the remaining plan
will be significantly different
Movies
o2:
Reviews
σ
avgStars≥3
o3 :
σlocation=
σ
o4 : actor=
“Morgan Movies
“New York” Plays
Freeman”
o1: ⋈ M.title=
R.movie ⋈
o5 : M.title=
P.movie
The new plan “resumes” from the input streams that the
old plan was using
Ultimately we need to combine data across these plans
(phases) – this is what we call the stitch-up phase
Corrective Query Processing
Phase 0 Phase 1 Stitch-up Phase
EXCEPT
PHJoin M.title= PHJoin M.title= StJoin M.title= M0P0R0,
P.movie R.movie R.movie M1P1R1
M0 R0
σ location= M1 P1
σavgStars≥3 M0 M1 P1
R0
“New York”
σ actor=
“Morgan
σavgStars≥3
σ actor=
“Morgan
σ location=
“New York”
M0 P0
Freeman” Freeman”