Professional Documents
Culture Documents
ICDE 2010
March 3, 2010
Christine Yen
with Christopher Yang, Ceryen Tan, Samuel Madden
Motivation
• Existing distributed DBMSes perform poorly
with long-running, OLAP-style queries
• Typical warehouses have hundreds or
thousands of nodes
• Osprey:
• tolerate & recover from mid-query faults
• linear speedup
High-level approach
• Replication of data, re-running hung jobs
loosely inspired by MapReduce approach
• Most clusters are heterogenous and/or
built off of commodity hardware
• Workers/replicas as just database instances
Outline
• Background
• Osprey
• Data model (partitioning)
• Executing queries on partitioned data
• Scheduling
• Results
Osprey - the idea
• Divide and conquer - running partial queries
on intermediate nodes
• Minimal custom software on intermediate
nodes
• "Chained Declustering" (Hsiao & DeWitt,
DE’90) as a data replication technique
• Experiment with variety of scheduling
algorithms
architecture
USER
C O O R D I N ATO R
Y
R
ER
ES
U
U
BQ
LT
SU
S
WORKER 1 WORKER 2 ... WORKER n
SUBQUERY PA R T I T I O N A EXECUTION
SUBQUERY PA R T I T I O N B
R E S U LT S M E R G E R
SUBQUERY PA R T I T I O N C
SUBQUERY PA R T I T I O N D R E S U LT
Outline
• Background
• Osprey
• Data model (partitioning)
• Executing queries on partitioned data
• Scheduling
• Results
n par titions
Data Model
PA R T I T I O N 1 PA R T I T I O N n • Assume star schema
• Partition the fact table across n machines
CHUNK 1 CHUNK 1
• Each partition is m chunks (physical
CHUNK 2 CHUNK 2 table)
CHUNK 3
...
CHUNK 3 • e.g. Chunk 7 of Partition 2 of table
orders would be
.. . .. .. . .. orders_part_2_chunk_7
CHUNK m CHUNK m
• Each of n partitions replicated k times
as
across k machines
ic
k repl
Replication - Chained declustering
• Partition i is stored on Worker i
• k backups of Partition i are stored on
workers i+1, ..., i+k (% n)
• Partition i is only unavailable if the whole
chain fails n = 4, k = 1
WORKER 1 WORKER 2 WORKER 3 WORKER 4
P 1 P 2 P 2 P 3 P 3 P 4 P 4 P 1
C O O R D I N AT O R Replication - Coordinator
PWQ 1 PWQ 2 PWQ 3 PWQ 4
P 1 P 2 P 2 P 3 P 3 P 4 P 4 P 1
QUERY R E S U LT S
SUBQUERY EXECUTION S U B R E S U LT
TRANSFORMER MERGER
SUBQUERY S U B R E S U LT
Outline
• Background
• Osprey
• Data model (partitioning)
• Executing queries on partitioned data
• Scheduling
• Results
Scheduling
• Job scheduling in distributed systems is a
well-studied problem
• Coordinator schedules subqueries
• Chained declustering => each worker can
work on several partitions
• Assigns subqueries to workers as they
become available
Scheduling Implementations
• Random
• Longest Queue First
• Majorization (Golubchik et al, RIDE-
TQP'92)
• "vector majorization" to minimize the
difference in PWQ lengths
• picks subqueries "for the greater good"
LQF and Majorization
PWQ 1 PWQ 2 PWQ 3 PWQ 4 PWQ 1
W1 W2 W3 W4
L(W 1) = 5 L(W 2) = 5 L(W 3) = 3 L(W 4) = 3
Handling Stragglers
• Compensating for straggling workers
• Assigned subqueries marked as “active”
• Still possible to be assigned to another
worker
• Upon completion, marked as “complete”
and stored with result
Outline
• Background
• Osprey
• Data model (partitioning)
• Executing queries on partitioned data
• Scheduling
• Results
Experimental Setup
• Star Schema Benchmark (O’Neil et al, ‘07)
• Data generation (5.5 GB, 6M rows) and
set of queries
• Added indices on commonly joined
fields, clustered around date property
• 9 commodity workstations (1 coordinator,
8 workers)
170
102
68
34
0
1 4 8
Number of workers (n)
Results - Scalability
Load Balance Over Time
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250 300
Time (s)
25
12.5
0
0 1 2 3
Stress (s)
25
12.5
0
0 1 2 3
Stress (s)
0.5
0
10 100 1000
Number of chunks per partition (m)
Results - Overhead
Future
• Self- or nested joins
• Exploration of query schedulers - take
advantage of database caches on workers?
Questions?