Professional Documents
Culture Documents
Map Reduce
Clean abstraction
Extremely rigid 2 stage group-by aggregation
Code reuse and maintenance difficult
Google → MapReduce, Sawzall
Yahoo → Hadoop, Pig Latin
Microsoft → Dryad, DryadLINQ
Improving MapReduce in heterogeneous
environment
Input Output
records k1 v1 k1 v1 records
map k21 v23 k1 v3 reduce
k21 v23 k1 v5
Split
Local QSort
Split
k21 v54 k2 v4
shuffle
Extremely rigid data flow
M R
Other flows hacked in
M R M R
Stages Joins Splits
Common operations must be coded by hand
Join, filter, projection, aggregates, sorting,distinct
Semantics hidden inside map-reduce fns
Difficult to maintain, extend, and optimize
Christopher Olston, Benjamin Reed, Utkarsh
Srivastava, Ravi Kumar, Andrew Tomkins
Research
Pigs Eat Anything
Can operate on data w/o metadata : relational, nested, or
unstructured.
Pigs Live Anywhere
Not tied to one particular parallel framework
Pigs Are Domestic Animals
Designed to be easily controlled and modified by its users.
UDFs : transformation functions, aggregates, grouping functions, and
conditionals.
Pigs Fly
Processes data quickly(?)
6
Dataflow language
Procedural : different from SQL
Quick Start and Interoperability
Nested Data Model
UDFs as First-Class Citizens
Parallelism Required
Debugging Environment
7
Data Model
Atom : 'cs'
Tuple: ('cs', 'ece', 'ee')
Bag: { ('cs', 'ece'), ('cs')}
Map: [ 'courses' → { ('523', '525', '599'}]
Expressions
Fields by position $0
Fields by name f1,
Map Lookup #
8
Find the top 10 most visited pages in each category
Visits URL Info
User URL Time URL Category PageRank
Group by url
Foreach url
Load Url Info
generate count
Join on url
Group by category
Foreach category
generate top10 urls
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
15
Map1 Every group or join operation forms
Load Visits
a map-reduce boundary
Group by url Reduce1
Map2
Foreach url
Load Url Info
generate count
Map3
Other operations Group by category
pipelined into map and Reduce3
reduce phases Foreach category
generate top10 urls
Write-run-debug cycle
Sandbox dataset
Objectives:
Realism
Conciseness
Completeness
Problems:
UDFs
18
Optional “safe” query optimizer
Performs only high-confidence rewrites
User interface
Boxes and arrows UI
Promote collaboration, sharing code fragments
and UDFs
Tight integration with a scripting language
Use loops, conditionals of host language
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,
Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey
data plane
Files, TCP, FIFO, Network
job schedule
V V V
NS PD PD PD
Partition C# objects
Collection
Apply, Fork
Hints
Collection<T> collection;
bool IsLegal(Key k);
string Hash(Key);
Vertex
code var results = from c in collection
where IsLegal(c.key)
select new { Hash(c.key), c.value}; Query
plan
(Dryad job)
Data
collection
C# C# C# C#
results
Client machine
Dryad
JM
Execution
Output
foreach C# Objects DryadTabl
(11) Results Output Tables
e
LINQ expressions converted to execution plan graph (EPG)
similar to database query plan
DAG
annotated with metadata properties
EPG is skeleton of Dryad DFG
as long as native operations are used, properties can
propagate helping optimization
Pipelining
Multiple operations in a single process
Removing redundancy
Eager Aggregation
Move aggregations in front of partitionings
I/O Reduction
Try to use TCP and in-memory FIFO instead of disk space
As information
from job becomes
available, mutate
execution graph
Dataset size based
decisions
▪ Intelligent
partitioning of data
Aggregation can
turn into tree to
improve I/O
based on locality
Example if part of
computation is
done locally, then
aggregated before
being sent across
network
TeraSort - scalability
240 computer
cluster of 2.6Ghz
dual core AMD
Opterons
Sort 10 billion 100-
byte records on 10-
byte key
Each computer
stores 3.87 GBs
DryadLINQ vs Dryad - SkyServer
Dryad is hand
optimized
No dynamic
optimization
overhead
DryadLINQ is 10%
native code
High level and data type transparent
Automatic optimization friendly
Manual optimizations using Apply operator
Leverage any system running LINQ framework
Support for interacting with SQL databases
Single computer debugging made easy
Strong typing, narrow interface
Deterministic replay execution
Dynamic optimizations appear data intensive
What kind of overhead?
EPG analysis overhead -> high latency
No real comparison with other systems
Progress tracking is difficult
No speculation
Will Solid State Drives diminish advantages of MapReduce?
Why not use Parallel Databases?
MapReduce Vs Dryad
How different from Sawzall and Pig?
Language Sawzall Pig Latin DryadLINQ
Sort
Manually slowed down 8
VM’s with background
processes
Sort
Grep WordCount
1. Make decisions early
2. Use finishing times
3. Nodes are not equal
4. Resources are precious
Focusing work on small vm’s fair?
Would it be better to pay for large vm and
implement system with more customized
control?
Could this be used in other systems?
Progress tracking is key
Is this a fundamental contribution? Or just
an optimization?
“Good” research?