You are on page 1of 13

Simplifying Hadoop Usage and

Administration
Or, With Great Power Comes Great
Responsibility in MapReduce Systems
Shivnath Babu
Duke University

?
Time

Relational DBMS

19751985

New & useful


technology

19851995

Features +++++
Open source ++

19952005

Manageability Crisis,
Research +++

20052010

Claims of self-managing,
Hard to add new features

MapReduce/Hadoop

New & useful


technology
Features +++++
Open source ++

2020

Different Aspects of Manageability

Testing
Tuning
Diagnosis
Applying fixes
Configuring
Benchmarking
Capacity planning
Disaster/failure
recovery automation
Detection/repair of
data corruption

Roles (often overlap)

User (writes MapReduce


programs, Pig scripts,
HiveQL queries, etc.)
Developer
Administrator

Lifecycle of a MapReduce Job


Map function

Reduce function

Run this program as a


MapReduce job

Lifecycle of a MapReduce Job


Time

Input
Splits

Map
Wave 1

Map
Wave 2

Reduce
Wave 1

Reduce
Wave 2

How are the number of splits, number of map and reduce


tasks, memory allocation to tasks, etc., determined?

Job Configuration Parameters


190+ parameters in
Hadoop
Set manually or defaults
are used
Are defaults or rules-ofthumb good enough?

Running time (seconds)

Running time (minutes)

Experiments

On EC2 and
local clusters

Illustrative Result: 50GB Terasort


17-node cluster, 64+32 concurrent map+reduce slots
mapred.reduce. io.sort. io.sort.record.
tasks
factor
percent
10
10
0.15
Based on
popular
rule-ofthumb

10

500

0.15

28

10

0.15

300

10

0.15

300

500

0.15

Running
time

Performance at default and rule-of-thumb settings can be poor


Cross-parameter interactions are significant

Declarative HiveQL/Pig
operations
Job configuration
parameters

Space of execution
choices

Multi-job
workflows

Complexity

Problem Space
Energy
considerations
Cost in pay-as-you-go
environment

Current approaches:
Predominantly manual
Post-mortem analysis

Performance
objectives

Is this where
we want to be?

Challenges
Features of Hadoop from
a usability perspective

These features are very


useful when dealing with

Ability to specify
schema late
Easy integration with
programming lang.
Pluggability

Multiple data formats


Mix of structured and
unstructured data
Multiple computational
engines (e.g., R, DBMS)
Changes/evolution

Input data formats


Storage engines
Schedulers
Instrumentation

But, they pose nontrivial


manageability challenges

Some Thoughts on Possible Solutions


Exploit opportunities to learn
Schema can be learned from Pig Latin scripts, HiveQL queries,
MapReduce jobs
Profile-driven optimization from the compiler world
High ratio of repeated jobs to new jobs is common

Exploit the MapReduce/Hadoop design


Common sort-partition-merge skeleton
Design for robustness gives many mechanisms for adaptation &
observation (speculative execution, storing intermediate data)
Multiple map waves
Fine-grained and pluggable scheduler

Some Thoughts on Possible Solutions


Automate try-it-out and trial-and-error approaches
For example, use 5% of cluster resources to run MapReduce
tasks with a different configuration
Exploit clouds pay-as-you-go resources, EC2 spot instances

?
Time

Relational DBMS

19751985

New & useful


technology

19851995

Features +++++
Open source ++

19952005

Manageability Crisis,
Research +++

20052010

Claims of self-managing,
Hard to add new features

MapReduce/Hadoop

New & useful


technology
Features +++++
Open source ++

2020