Osprey

MapReduce-style fault tolerance in a shared-nothing distributed database ICDE 2010 March 3, 2010 Christine Yen
with Christopher Yang, Ceryen Tan, Samuel Madden

Motivation
• Existing distributed DBMSes perform poorly
with long-running, OLAP-style queries thousands of nodes

• Typical warehouses have hundreds or • Osprey: • tolerate & recover from mid-query faults • linear speedup

High-level approach
• Replication of data, re-running hung jobs • Most clusters are heterogenous and/or
built off of commodity hardware loosely inspired by MapReduce approach

• Workers/replicas as just database instances

Outline
• Background • Osprey • Data model (partitioning) • Executing queries on partitioned data • Scheduling • Results

• Divide and conquer - running partial queries
on intermediate nodes nodes

Osprey - the idea

• Minimal custom software on intermediate • "Chained Declustering" (Hsiao & DeWitt,
DE’90) as a data replication technique algorithms

• Experiment with variety of scheduling

architecture
USER C O O R D I N ATO R
Y
R

ER

ES

U

U

BQ

LT

SU

S

WORKER 1
POSTGRES

WORKER 2
POSTGRES

...

WORKER n
POSTGRES

QUERY

execution
SCHEDULER
WORK QUEUES

QUERY TRANSFORMER
SUBQUERY SUBQUERY SUBQUERY SUBQUERY

PA R T I T I O N A PA R T I T I O N B

EXECUTION

R E S U LT S M E R G E R PA R T I T I O N C PA R T I T I O N D

R E S U LT

Outline
• Background • Osprey • Data model (partitioning) • Executing queries on partitioned data • Scheduling • Results

n par titions
PA R T I T I O N 1

Data Model
n

PA R T I T I O N

CHUNK 1 CHUNK 2 CHUNK 3 .. . .. CHUNK m

CHUNK 1 CHUNK 2

• • •

Assume star schema Partition the fact table across n machines Each partition is m chunks (physical table)

...

CHUNK 3 .. . .. CHUNK m

• •
as

e.g. Chunk 7 of Partition 2 of table orders would be
orders_part_2_chunk_7

k

ic repl

Each of n partitions replicated k times across k machines

Replication - Chained declustering

• Partition i is stored on Worker i • k backups of Partition i are stored on
workers i+1, ..., i+k (% n) chain fails

• Partition i is only unavailable if the whole
n = 4, k = 1
WORKER 3 P 3 P 4 WORKER 4 P 4 P 1 WORKER 1 P 1 P 2 WORKER 2 P 2 P 3

C O O R D I N AT O R

Replication - Coordinator
PWQ 2 PWQ 3 PWQ 4

PWQ 1

WORKER 1 P 1 P 2

WORKER 2 P 2 P 3

WORKER 3 P 3 P 4

WORKER 4 P 4 P 1

Coordinator’s perspective: each worker...

• •

has access to k+1 Partition Work Queues can only execute queries on local partitions

Outline
• Background • Osprey • Data model (partitioning) • Executing queries on partitioned data • Scheduling • Results

Query Transformation
• We can take the original query, run it for
deg(dimension tables) each chunk, and then combine all results to re-assemble the full result set

• Overriding assumption: deg(fact table) >> • Joins between partitions

Query Transformation Example
a generic query:
SELECT key_1, key_2 FROM fact_table, dim_1, dim_2 WHERE key_1 BETWEEN 0 AND 10 <... join conditions ...>

is the union of n*m of these:
SELECT key_1, key_2 FROM fact_table_part_n_chunk_m, dim_1, dim_2 WHERE key_1 BETWEEN 0 AND 10 <... join conditions ...>

where fact_table_part_n_chunk_m is chunk m in partition n

Aggregates
• Common in analytical database loads:
SUM, COUNT, MIN, MAX, AVG...

• Most can simply be "collected up" as

intermediate results are union-ed (e.g. MIN)

• Some partial pre-aggregation (e.g. AVG)

• • •

Query Execution Recap
Subqueries generated by replacing the name of the fact table with the appropriate parameters Rewrite aggregate functions where necessary Collect results as they return and calculate aggregates appropriately
SUBQUERY S U B R E S U LT

QUERY TRANSFORMER

SUBQUERY

EXECUTION

S U B R E S U LT

R E S U LT S MERGER

SUBQUERY

S U B R E S U LT

Outline
• Background • Osprey • Data model (partitioning) • Executing queries on partitioned data • Scheduling • Results

• Job scheduling in distributed systems is a
well-studied problem

Scheduling

• Coordinator schedules subqueries • Chained declustering => each worker can
work on several partitions become available

• Assigns subqueries to workers as they

Scheduling Implementations • Random • Longest Queue First • Majorization (Golubchik et al, RIDETQP'92)

• "vector majorization" to minimize the
difference in PWQ lengths

• picks subqueries "for the greater good"

LQF and Majorization
PWQ 1 PWQ 2 PWQ 3 PWQ 4 PWQ 1

W1
L(W 1) = 5

W2
L(W 2) = 5

W3
L(W 3) = 3

W4
L(W 4) = 3

Handling Stragglers
• Compensating for straggling workers • Assigned subqueries marked as “active” • Still possible to be assigned to another
worker

• Upon completion, marked as “complete”
and stored with result

Outline
• Background • Osprey • Data model (partitioning) • Executing queries on partitioned data • Scheduling • Results

Experimental Setup
• Star Schema Benchmark (O’Neil et al, ‘07) • Data generation (5.5 GB, 6M rows) and
set of queries

• Added indices on commonly joined
8 workers)

fields, clustered around date property

• 9 commodity workstations (1 coordinator,

Total computation time (mins)

170 136 102 68 34 0 1 4 Number of workers (n) 8
Osprey Ideal Speedup

Results - Scalability

Relative Fraction of Total Computation Time

Load Balance Over Time Worker 1 Worker 2 Worker 3

0.6 0.5 0.4 0.3 0.2 0.1 0 0

50

100

150 Time (s)

200

250

300

Results - Load Balancing

Test completion time (mins)

50 37.5 25 12.5 0
k=0

0

1 Stress (s)

2

3

Results - Load Balancing

Test completion time (mins)

50 37.5 25 12.5 0
k=0 k=1

0

1 Stress (s)

2

3

Results - Load Balancing

Completion time per chunk (mins)

2 1.5 1 0.5 0
Test Completion SQL Exec Time Osprey Overhead

10

100 Number of chunks per partition (m)

1000

Results - Overhead

Future
• Self- or nested joins • Exploration of query schedulers - take
advantage of database caches on workers?

• Improve estimate of “work” left in PWQ

Contributions
• •
Linear speedup, acceptable overhead Load balancing and fault tolerance provided by:

• • •

Dividing up a query into subqueries running in parallel Greedy workers and chained-declustering replication Re-execution of straggler subqueries

MapReduce strategies + DBMS optimizations + limited specialized software stack

Questions?