ICDEpresentation

Osprey
MapReduce-style fault tolerance

in a shared-nothing distributed database
ICDE 2010
March 3, 2010
Christine Yen
with Christopher Yang, Ceryen Tan, Samuel Madden
Motivation
• Existing distributed DBMSes perform poorly
with long-running, OLAP-style queries
• Typical warehouses have hundreds or
thousands of nodes
• Osprey:
• tolerate & recover from mid-query faults
• linear speedup
High-level approach
• Replication of data, re-running hung jobs
loosely inspired by MapReduce approach
• Most clusters are heterogenous and/or
built off of commodity hardware
• Workers/replicas as just database instances
Outline
• Background
• Osprey
• Data model (partitioning)
• Executing queries on partitioned data
• Scheduling
• Results
Osprey - the idea
• Divide and conquer - running partial queries
on intermediate nodes
• Minimal custom software on intermediate
nodes
• "Chained Declustering" (Hsiao & DeWitt,
DE’90) as a data replication technique
• Experiment with variety of scheduling
algorithms
architecture
USER
C O O R D I N ATO R
Y
R
ER
ES
U
U
BQ
LT
SU
S
WORKER 1 WORKER 2 ... WORKER n
POSTGRES POSTGRES POSTGRES

QUERY
execution
QUERY
SCHEDULER
TRANSFORMER
WORK QUEUES
SUBQUERY PA R T I T I O N A EXECUTION
SUBQUERY PA R T I T I O N B
R E S U LT S M E R G E R
SUBQUERY PA R T I T I O N C
SUBQUERY PA R T I T I O N D R E S U LT
Outline
• Background
• Osprey
• Scheduling
• Results
n par titions
Data Model
PA R T I T I O N 1 PA R T I T I O N n • Assume star schema
• Partition the fact table across n machines
CHUNK 1 CHUNK 1
• Each partition is m chunks (physical
CHUNK 2 CHUNK 2 table)
CHUNK 3
...
CHUNK 3 • e.g. Chunk 7 of Partition 2 of table
orders would be
.. . .. .. . .. orders_part_2_chunk_7
CHUNK m CHUNK m
• Each of n partitions replicated k times
as
across k machines
ic
k repl
Replication - Chained declustering
• Partition i is stored on Worker i
• k backups of Partition i are stored on
workers i+1, ..., i+k (% n)
• Partition i is only unavailable if the whole
chain fails n = 4, k = 1
WORKER 1 WORKER 2 WORKER 3 WORKER 4
P 1 P 2 P 2 P 3 P 3 P 4 P 4 P 1
C O O R D I N AT O R Replication - Coordinator
PWQ 1 PWQ 2 PWQ 3 PWQ 4
WORKER 1 WORKER 2 WORKER 3 WORKER 4
P 1 P 2 P 2 P 3 P 3 P 4 P 4 P 1
• Coordinator’s perspective: each worker...

• has access to k+1 Partition Work Queues
• can only execute queries on local partitions
Outline
• Background
• Osprey
• Scheduling
• Results
Query Transformation
• We can take the original query, run it for
each chunk, and then combine all results to
re-assemble the full result set
• Overriding assumption: deg(fact table) >>
deg(dimension tables)
• Joins between partitions
Query Transformation Example
a generic query:
SELECT key_1, key_2
FROM fact_table, dim_1, dim_2
is the union of n*m of these:
WHERE key_1 BETWEEN 0 AND 10 SELECT key_1, key_2
<... join conditions ...>
FROM fact_table_part_n_chunk_m,
dim_1, dim_2
WHERE key_1 BETWEEN 0 AND 10
<... join conditions ...>
where fact_table_part_n_chunk_m is chunk m in partition n

Aggregates
• Common in analytical database loads:
SUM, COUNT, MIN, MAX, AVG...
• Most can simply be "collected up" as
intermediate results are union-ed (e.g.
MIN)
• Some partial pre-aggregation (e.g. AVG)
Query Execution Recap
• Subqueries generated by replacing the name of the fact table
with the appropriate parameters
• Rewrite aggregate functions where necessary
• Collect results as they return and calculate aggregates
appropriately
SUBQUERY S U B R E S U LT
QUERY R E S U LT S
SUBQUERY EXECUTION S U B R E S U LT
TRANSFORMER MERGER
SUBQUERY S U B R E S U LT
Outline
• Background
• Osprey
• Scheduling
• Results
Scheduling
• Job scheduling in distributed systems is a
well-studied problem
• Coordinator schedules subqueries
• Chained declustering => each worker can
work on several partitions
• Assigns subqueries to workers as they
become available
Scheduling Implementations
• Random
• Longest Queue First
• Majorization (Golubchik et al, RIDE-
TQP'92)
• "vector majorization" to minimize the
difference in PWQ lengths
• picks subqueries "for the greater good"
LQF and Majorization
PWQ 1 PWQ 2 PWQ 3 PWQ 4 PWQ 1
W1 W2 W3 W4
L(W 1) = 5 L(W 2) = 5 L(W 3) = 3 L(W 4) = 3
Handling Stragglers
• Compensating for straggling workers
• Assigned subqueries marked as “active”
• Still possible to be assigned to another
worker
• Upon completion, marked as “complete”
and stored with result
Outline
• Background
• Osprey
• Scheduling
• Results
Experimental Setup
• Star Schema Benchmark (O’Neil et al, ‘07)
• Data generation (5.5 GB, 6M rows) and
set of queries
• Added indices on commonly joined
fields, clustered around date property
• 9 commodity workstations (1 coordinator,
8 workers)
170
Total computation time (mins)

Osprey
136 Ideal Speedup
102
68
34
0
1 4 8
Number of workers (n)
Results - Scalability
Load Balance Over Time
Relative Fraction of Total Computation Time

Worker 1
0.6 Worker 2
Worker 3
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250 300
Time (s)
Results - Load Balancing

50
Test completion time (mins)
k=0
37.5
25
12.5
0
0 1 2 3
Stress (s)

50
Test completion time (mins)
k=0
37.5 k=1
25
12.5
0
0 1 2 3
Stress (s)

Completion time per chunk (mins)
2
Test Completion
SQL Exec Time
1.5 Osprey Overhead
0.5
0
10 100 1000
Number of chunks per partition (m)
Results - Overhead
Future
• Self- or nested joins
• Exploration of query schedulers - take
advantage of database caches on workers?
• Improve estimate of “work” left in PWQ

Contributions
• Linear speedup, acceptable overhead
• Load balancing and fault tolerance provided by:
• Dividing up a query into subqueries running in
parallel
• Greedy workers and chained-declustering
replication
• Re-execution of straggler subqueries
• MapReduce strategies + DBMS optimizations + limited
specialized software stack
Questions?

ICDEpresentation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ICDEpresentation

Uploaded by

Copyright:

Available Formats

Osprey

MapReduce-style fault tolerance

POSTGRES POSTGRES POSTGRES

WORKER 1 WORKER 2 WORKER 3 WORKER 4

• Coordinator’s perspective: each worker...

where fact_table_part_n_chunk_m is chunk m in partition n

Total computation time (mins)

Relative Fraction of Total Computation Time

Results - Load Balancing

Results - Load Balancing

Results - Load Balancing

• Improve estimate of “work” left in PWQ

You might also like