You are on page 1of 30

Osprey

MapReduce-style fault tolerance


in a shared-nothing distributed database

ICDE 2010
March 3, 2010

Christine Yen
with Christopher Yang, Ceryen Tan, Samuel Madden
Motivation
• Existing distributed DBMSes perform poorly
with long-running, OLAP-style queries
• Typical warehouses have hundreds or
thousands of nodes
• Osprey:
• tolerate & recover from mid-query faults
• linear speedup
High-level approach
• Replication of data, re-running hung jobs
loosely inspired by MapReduce approach
• Most clusters are heterogenous and/or
built off of commodity hardware
• Workers/replicas as just database instances
Outline
• Background
• Osprey
• Data model (partitioning)
• Executing queries on partitioned data
• Scheduling
• Results
Osprey - the idea
• Divide and conquer - running partial queries
on intermediate nodes
• Minimal custom software on intermediate
nodes
• "Chained Declustering" (Hsiao & DeWitt,
DE’90) as a data replication technique
• Experiment with variety of scheduling
algorithms
architecture
USER

C O O R D I N ATO R
Y

R
ER

ES
U

U
BQ

LT
SU

S
WORKER 1 WORKER 2 ... WORKER n

POSTGRES POSTGRES POSTGRES


QUERY
execution
QUERY
SCHEDULER
TRANSFORMER
WORK QUEUES

SUBQUERY PA R T I T I O N A EXECUTION

SUBQUERY PA R T I T I O N B

R E S U LT S M E R G E R
SUBQUERY PA R T I T I O N C

SUBQUERY PA R T I T I O N D R E S U LT
Outline
• Background
• Osprey
• Data model (partitioning)
• Executing queries on partitioned data
• Scheduling
• Results
n par titions
Data Model
PA R T I T I O N 1 PA R T I T I O N n • Assume star schema
• Partition the fact table across n machines
CHUNK 1 CHUNK 1
• Each partition is m chunks (physical
CHUNK 2 CHUNK 2 table)

CHUNK 3
...
CHUNK 3 • e.g. Chunk 7 of Partition 2 of table
orders would be
.. . .. .. . .. orders_part_2_chunk_7
CHUNK m CHUNK m
• Each of n partitions replicated k times
as
across k machines
ic
k repl
Replication - Chained declustering
• Partition i is stored on Worker i
• k backups of Partition i are stored on
workers i+1, ..., i+k (% n)
• Partition i is only unavailable if the whole
chain fails n = 4, k = 1
WORKER 1 WORKER 2 WORKER 3 WORKER 4

P 1 P 2 P 2 P 3 P 3 P 4 P 4 P 1
C O O R D I N AT O R Replication - Coordinator
PWQ 1 PWQ 2 PWQ 3 PWQ 4

WORKER 1 WORKER 2 WORKER 3 WORKER 4

P 1 P 2 P 2 P 3 P 3 P 4 P 4 P 1

• Coordinator’s perspective: each worker...


• has access to k+1 Partition Work Queues
• can only execute queries on local partitions
Outline
• Background
• Osprey
• Data model (partitioning)
• Executing queries on partitioned data
• Scheduling
• Results
Query Transformation
• We can take the original query, run it for
each chunk, and then combine all results to
re-assemble the full result set
• Overriding assumption: deg(fact table) >>
deg(dimension tables)
• Joins between partitions
Query Transformation Example
a generic query:
SELECT key_1, key_2
FROM fact_table, dim_1, dim_2
is the union of n*m of these:
WHERE key_1 BETWEEN 0 AND 10 SELECT key_1, key_2
<... join conditions ...>
FROM fact_table_part_n_chunk_m,
dim_1, dim_2
WHERE key_1 BETWEEN 0 AND 10
<... join conditions ...>

where fact_table_part_n_chunk_m is chunk m in partition n


Aggregates
• Common in analytical database loads:
SUM, COUNT, MIN, MAX, AVG...
• Most can simply be "collected up" as
intermediate results are union-ed (e.g.
MIN)
• Some partial pre-aggregation (e.g. AVG)
Query Execution Recap
• Subqueries generated by replacing the name of the fact table
with the appropriate parameters
• Rewrite aggregate functions where necessary
• Collect results as they return and calculate aggregates
appropriately
SUBQUERY S U B R E S U LT

QUERY R E S U LT S
SUBQUERY EXECUTION S U B R E S U LT
TRANSFORMER MERGER

SUBQUERY S U B R E S U LT
Outline
• Background
• Osprey
• Data model (partitioning)
• Executing queries on partitioned data
• Scheduling
• Results
Scheduling
• Job scheduling in distributed systems is a
well-studied problem
• Coordinator schedules subqueries
• Chained declustering => each worker can
work on several partitions
• Assigns subqueries to workers as they
become available
Scheduling Implementations
• Random
• Longest Queue First
• Majorization (Golubchik et al, RIDE-
TQP'92)
• "vector majorization" to minimize the
difference in PWQ lengths
• picks subqueries "for the greater good"
LQF and Majorization
PWQ 1 PWQ 2 PWQ 3 PWQ 4 PWQ 1

W1 W2 W3 W4
L(W 1) = 5 L(W 2) = 5 L(W 3) = 3 L(W 4) = 3
Handling Stragglers
• Compensating for straggling workers
• Assigned subqueries marked as “active”
• Still possible to be assigned to another
worker
• Upon completion, marked as “complete”
and stored with result
Outline
• Background
• Osprey
• Data model (partitioning)
• Executing queries on partitioned data
• Scheduling
• Results
Experimental Setup
• Star Schema Benchmark (O’Neil et al, ‘07)
• Data generation (5.5 GB, 6M rows) and
set of queries
• Added indices on commonly joined
fields, clustered around date property
• 9 commodity workstations (1 coordinator,
8 workers)
170

Total computation time (mins)


Osprey
136 Ideal Speedup

102

68

34

0
1 4 8
Number of workers (n)

Results - Scalability
Load Balance Over Time

Relative Fraction of Total Computation Time


Worker 1
0.6 Worker 2
Worker 3
0.5

0.4

0.3

0.2

0.1

0
0 50 100 150 200 250 300
Time (s)

Results - Load Balancing


50
Test completion time (mins)
k=0
37.5

25

12.5

0
0 1 2 3
Stress (s)

Results - Load Balancing


50
Test completion time (mins)
k=0
37.5 k=1

25

12.5

0
0 1 2 3
Stress (s)

Results - Load Balancing


Completion time per chunk (mins)
2
Test Completion
SQL Exec Time
1.5 Osprey Overhead

0.5

0
10 100 1000
Number of chunks per partition (m)

Results - Overhead
Future
• Self- or nested joins
• Exploration of query schedulers - take
advantage of database caches on workers?

• Improve estimate of “work” left in PWQ


Contributions
• Linear speedup, acceptable overhead
• Load balancing and fault tolerance provided by:
• Dividing up a query into subqueries running in
parallel
• Greedy workers and chained-declustering
replication
• Re-execution of straggler subqueries
• MapReduce strategies + DBMS optimizations + limited
specialized software stack

Questions?