Professional Documents
Culture Documents
Starfish: A Self-Tuning System For Big Data Analytics
Starfish: A Self-Tuning System For Big Data Analytics
Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong,
Fatma Bilgen Cetin, Shivnath Babu
Department of Computer Science
Duke University
ABSTRACT
Timely and cost-effective analytics over Big Data is now a key
ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software
stackwhich consists of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfacesis a popular choice for big data analytics. Most practitioners of big data analyticslike computational
scientists, systems researchers, and business analystslack the expertise to tune the system to get good performance. Unfortunately,
Hadoops performance out of the box leaves much to be desired,
leading to suboptimal use of resources, time, and money (in payas-you-go clouds). We introduce Starfish, a self-tuning system for
big data analytics. Starfish builds on Hadoop while adapting to user
needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the
many tuning knobs in Hadoop. While Starfishs system architecture
is guided by work on self-tuning database systems, we discuss how
new analysis practices over big data pose new challenges; leading
us to different design choices in Starfish.
1.
INTRODUCTION
2. OVERVIEW OF STARFISH
The workload that a Hadoop deployment runs can be considered
at different levels. At the lowest level, Hadoop runs MapReduce
jobs. A job can be generated directly from a program written in
a programming language like Java or Python, or generated from a
query in a higher-level language like HiveQL or Pig Latin [15], or
submitted as part of a MapReduce job workflow by systems like
Azkaban, Cascading, Elastic MapReduce, and Oozie. The execution plan generated for a HiveQL or Pig Latin query is usually a
workflow. Workflows may be ad-hoc, time-driven (e.g., run every
The schema as well as properties of the data in the files could have
been unknown so far. The system now has to quickly choose the
join execution techniquegiven the limited information available
so far, and from among 10+ ways to execute joins in Starfishas
well as the corresponding settings of job configuration parameters.
Starfishs Just-in-Time Optimizer addresses unique optimization
problems like those above to automatically select efficient execution techniques for MapReduce jobs. Just-in-time captures the
online nature of decisions forced on the optimizer by Hadoops
MADDER features. The optimizer takes the help of the Profiler
and the Sampler. The Profiler uses a technique called dynamic instrumentation to learn performance models, called job profiles, for
unmodified MapReduce programs written in languages like Java
and Python. The Sampler collects statistics efficiently about the
input, intermediate, and output key-value spaces of a MapReduce
job. A unique feature of the Sampler is that it can sample the execution of a MapReduce job in order to enable the Profiler to collect
approximate job profiles at a fraction of the full job execution cost.
io.sort.spill.percent
io.sort.record.percent
io.sort.mb
io.sort.factor
mapred.reduce.tasks
Running Time (sec)
WordCount
Rules of
Based on
Thumb
Job Profile
0.80
0.80
0.50
0.05
200
50
10
10
27
2
785
407
TeraSort
Rules of
Based on
Thumb
Job Profile
0.80
0.80
0.15
0.15
200
200
10
100
27
400
891
606
Table 1: Parameter settings from rules of thumb and recommendations from job profiles for WordCount and TeraSort
loads on big data. At the same time, we want Starfish to be usable in
environments where workloads are run directly on Hadoop without
going through Starfish. Lastword enables Starfish to be used as a
recommendation engine in these environments. The full or partial
Hadoop workload from such an environment can be expressed in
Lastwordwe will provide tools to automate this stepand then
input to Starfish which is run in a special recommendation mode.
In this mode, Starfish uses its tuning features to recommend good
configurations at the job, workflow, and workload levels; instead of
running the workload with these configurations as Starfish would
do in its normal usage mode.
3.
Settings
from
Rules of
Thumb
800
650
600
550
500
450
Settings
based on
Job Profile
400
0.4
0.3
0.2
0.1
200
1050
1000
950
900
800
600
0
Settings
based on
Job Profile
100
0.4
150
0.3
io.sort.record.percent
50
0.5
0.4
200
0.3
300
100
0.1
50
io.sort.mb
1000
200
0.2
100
0
1100
850
150
io.sort.record.percent
700
Settings
from
Rules of
Thumb
1200
1150
750
TeraSort in Hadoop
TeraSort in Hadoop
WordCount in Hadoop
0.2
0.1
io.sort.mb
mapred.reduce.tasks
400
io.sort.record.percent
Figure 4: Response surfaces of MapReduce programs in Hadoop: (a) WordCount, with io.sort.mb [50, 200] and
io.sort.record.percent [0.05, 0.5] (b) TeraSort, with io.sort.mb [50, 200] and io.sort.record.percent [0.05, 0.5] (c) TeraSort,
with io.sort.record.percent [0.05, 0.5] and mapred.reduce.tasks [27, 400]
4. The cluster setup and resource allocation that will be used to
run Job J. This information includes the number of nodes
and network topology of the cluster, the number of map and
reduce task slots per node, and the memory available for each
task execution.
The What-if Engine uses a set of performance models for predicting (a) the flow of data going through each subphase in the jobs
execution, and (b) the time spent in each subphase. The What-if Engine then produces a virtual job profile by combining the predicted
information in accordance with the cluster setup and resource allocation that will be used to run the job. The virtual job profile contains the predicted Timings and Data-flow views of the job when
run with the new parameter settings. The purpose of the virtual
profile is to provide the user with more insights on how the job will
behave when using the new parameter settings, as well as to expand the use of the What-if Engine towards answering hypothetical
questions at the workflow and workload levels.
Towards Cost-Based Optimization: Table 1 shows the parameter
settings for WordCount and TeraSort recommended by an initial
implementation of the Just-in-Time Optimizer. The What-if Engine used the respective job profiles collected from running the jobs
using the rules-of-thumb settings. WordCount runs almost twice
as fast at the recommended setting. As we saw earlier, while the
Combiner reduced the amount of intermediate data drastically, it
was making the map execution heavily CPU-bound and slow. The
configuration setting recommended by the optimizerwith lower
io.sort.mb and io.sort.record.percentmade the map tasks significantly faster. This speedup outweighed the lowered effectiveness
of the Combiner that caused more intermediate data to be shuffled
and processed by the reduce tasks.
These experiments illustrate the usefulness of the Just-in-Time
Optimizer. One of the main challenges that we are addressing
is in developing an efficient strategy to search through the highdimensional space of parameter settings. A related challenge is in
generating job profiles with minimal overhead. Figure 6 shows the
tradeoff between the profiling overhead (in terms of job slowdown)
and the average relative error in the job profile views when profiling is limited to a fraction of the tasks in WordCount. The results
are promising but show room for improvement.
4. WORKFLOW-AWARE SCHEDULING
Cause and Effect of Unbalanced Data Layouts: Section 2.2 mentioned how interactions between the task scheduler and the policies
employed by the distributed filesystem can lead to unbalanced data
layouts. Figure 7 shows how even the execution of a single large
Figure 11: Part of an example workflow showing producerconsumer relationships among jobs
block placement policy which works as follows: the first
replica of any block is stored on the same node where the
blocks writer (a map or reduce task) runs. We have implemented a new Round Robin block placement policy in HDFS
where the blocks written are stored on the nodes of the distributed filesystem in a round robin fashion.
2. How many replicas to storecalled the replication factor
for the blocks of a file? Replication helps improve performance for heavily-accessed files. Replication also improves
robustness by reducing performance variability in case of
node failures.
3. What size to use for blocks of a file? For a very big file, a
block size larger than the default of 64MB can improve performance significantly by reducing the number of map tasks
needed to process the file. The caveat is that the choice of the
block size interacts with the choice of job-level configuration
parameters like io.sort.mb (recall Section 3).
4. Should a jobs output files be compressed for storage? Like
the use of Combiners (recall Section 3), the use of compression enables the cost of local I/O and network transfers to be
traded for additional CPU cost. Compression is not always
beneficial. Furthermore, like the choice of the block size, the
usefulness of compression depends on the choice of job-level
parameters.
The Workflow-aware Scheduler performs a cost-based search for a
good layout for the output data of each job in a given workflow.
The technique we employ here asks a number of questions to the
What-if Engine; and uses the answers to infer the costs and benefits
of various choices. The what-if questions asked for a workflow
consisting of the producer-consumer relationships among Jobs P ,
C1, and C2 shown in Figure 11 include:
(a) What is the expected running time of Job P if the Round
Robin block placement policy is used for P s output files?
(b) What will the new data layout in the cluster be if the Round
Robin block placement policy is used for P s output files?
1. What block placement policy to use in the distributed filesystem for the output file of a job? HDFS uses the Local Write
(c) What is the expected running time of Job C1 (C2) if its input
data layout is the one in the answer to Question (b)?
5.
Figure 14: Workload performance under various cluster and Hadoop configurations on Amazon Elastic MapReduce
Figure 15: Performance Vs. pay-as-you-go costs for a workload on Amazon Elastic MapReduce
on the user. Specifically, users have to specify the number and type
of EC2 nodes (from among 10+ types) as well as whether to copy
data from S3 into the in-cluster HDFS. The space of provisioning
choices is further complicated by Amazon Spot Instances which
provide a market-based option for leasing EC2 nodes. In addition,
the user has to specify the Hadoop-level as well as job-level configuration parameters for the provisioned cluster.
One of the goals of Starfishs Elastisizer is to automatically determine the best cluster and Hadoop configurations to process a given
workload subject to user-specified goals (e.g., on completion time
and monetary costs incurred). To illustrate this problem, Figure 14
shows how the performance of a workload W consisting of a single workflow varies across different cluster configurations (number
and type of EC2 nodes) and corresponding Hadoop configurations
(number of concurrent map and reduce slots per node).
The user could have multiple preferences and constraints for the
workload, which poses a multi-objective optimization problem. For
example, the goal may be to minimize the monetary cost incurred to
run the workload, subject to a maximum tolerable workload completion time. Figures 15(a) and 15(b) show the running time as well
as cost incurred on Elastic MapReduce for the workload W for different cluster configurations. Some observations from the figures:
If the user wants to minimize costs subject to a completion time
of 30 minutes, then the Elastisizer should recommend a cluster
of four m1.large EC2 nodes.
If the user wants to minimize costs, then two m1.small nodes
are best. However, the Elastisizer can suggest that by paying
just 20% more, the completion time can be reduced by 2.6x.
To estimate workload performance for various cluster configurations, the Elastisizer invokes the What-if Engine which, in turn,
uses a mix of simulation and model-based estimation. As discussed in Section 4, the What-if Engine simulates the task scheduling and block-placement policies over a hypothetical cluster, and
uses performance models to predict the data flow and performance
of the MapReduce jobs in the workload. The latest Hadoop release
includes a Hadoop simulator, called Mumak, that we initially at-
7. ACKNOWLEDGEMENTS
We are extremely grateful to Amazon Web Services for awarding
us a grant for the use of resources on the Amazon Elastic Compute
Cloud (EC2) and Elastic MapReduce.
8. REFERENCES
APPENDIX
A.
STARFISHS VISUALIZER
When a MapReduce job executes in a Hadoop cluster, a lot of information is generated including logs, counters, resource utilization
metrics, and profiling data. This information is organized, stored,
and managed by Starfishs Metadata Manager in a catalog that can
be viewed using Starfishs Visualizer. A user can employ the Visualizer to get a deep understanding of a jobs behavior during execution, and to ultimately tune the job. Broadly, the functionality
of the Visualizer can be categorized into Timeline views, Data-flow
views, and Profile views.
A.1
Timeline Views
Timeline views are used to visualize the progress of a job execution at the task level. Figure 16 shows the execution timeline of
map and reduce tasks that ran during a MapReduce job execution.
The user can observe information like how many tasks were running at any point in time on each node, when each task started and
ended, or how many map or reduce waves occurred. The user is
able to quickly spot any variance in the task execution times and
discover potential load balancing issues.
Moreover, Timeline views can be used to compare different executions of the same job run at different times or with different
parameter settings. Comparison of timelines will show whether
the job behavior changed over time as well as help understand the
impact that changing parameter settings has on job execution. In
addition, the Timeline views support a What-if mode using which
the user can visualize what the execution of a job will be when run
using different parameter settings. For example, the user can determine the impact of decreasing the value of io.sort.mb on map task
execution. Under the hood, the Visualizer invokes the What-if Engine to generate a virtual job profile for the job in the hypothetical
setting (recall Section 3).
Figure 16: Execution timeline of the map and reduce tasks of a MapReduce job
Figure 17: Visual representation of the data-flow among the Hadoop nodes during a MapReduce job execution