Hive Optimization

OPTIMIZATION
TECHNIQUES WITHIN THE

HADOOP
ECO-SYSTEM: A SURVEY
Giulia Rumi, Claudia Colella, Danilo Ardagna
Dipartimento di Elttronica, Informazione e Bioingegneria

Hadoop 1.0
Courtesy of Microsoft
Hadoop 1.0
Hadoop 2.x
RM
• RM and AMs
implemented within
YARN:
Yet Another AM
Resource Negotiator
tity map/reduce function (i.e., the entire input of map tasks
is shuffled to reduce tasks and written as output). First, we
run the sort benchmark with 8GB input on 64 machines3 ,
each
Why configured with
scheduling a single map and a single reduce slot,
is important
i.e., with 64 map and 64 reduce slots overall.
Map Shuffle Sort Reduce
140
120
100
Task slots
80
60
40
20
0
0 5 10 15 20 25 30 35 40 45
Time (in seconds)
Why scheduling is important
Map (first wave) Map (third wave) Map (fifth wave) Sort
Map (second wave) Map (fourth wave) Shuffle Reduce
40
Task slots
20
0
0 20 40 60 80 100 120
Time (in seconds)
Figure 2: Sorting with 16 map and 22 reduce slots.
ntroduce a job profile that can be used for prediction of the profile in the shuffle phase is characterized by two pairs of
ob completion time as a function of assigned resources. measurements: (Sh1avg , Sh1max , Shtyp typ
avg , Shmax ).
The reduce phase begins only after the shuffle phase is
2.2 Job Profile complete. The profile of the reduce phase is represented by:
Our goal is to create a compact job profile that is com- (Ravg , Rmax , SelectivityR ) : the average and maximum of the
prised of performance invariants which are independent on reduce tasks durations and the reduce selectivity, denoted
he amount of resources assigned to the job over time and as SelectivityR , which is defined as the ratio of the reduce
hat reflects all phases of a given job: map, shuffle, sort, output size to its input.
and reduce phases. This information can be obtained from
he counters at the job master during the job’s execution or 3. ESTIMATING JOB COMPLETION TIME
parsed from the logs. More details can be found in Section 6.
The map stage consists of a number of map tasks. To In this section, we design a MapReduce performance model
compactly characterize the task duration distribution and that is based on i) the job profile and ii) the performance
bounds of completion time of different job phases. This
MapReduce Ecosystem
Optimization policies goals
• Data locality
• Sticky slots
• Skewness and Map-Reduce interdependence
• Poor system utilization
• Starvation
• Fairness
Optimization at two layers
• Application level optimization (e.g., Pig and Hive queries)
• Task scheduling optimization (i.e., FIFO scheduler

enhancement)
Application level optimization
1 2 3
Job Profiling Performance Optimization and
Modeling Scheduling
Build application Estimate jobs/workflows Allocate map and reduce

invariant completion time slots and fulfill deadlines
Aria & Autotune: HP Labs, Lucy Cherkasowa et al., ref. 21, 22, 26, 27
Signature-based approach: K. Kambatla et al., ref. 9
Job Profiling
Aria & Autotune Signature-based
• Past job runs of the whole • Execution of a small fraction of

application input data on a smaller number
• OUTPUT: lower, upper, of resources
and average estimated • OUTPUT: Resource
completion time Consumption Signature Set
• Execution of a smaller input

data set than the original one
over the same cluster
configuration
Performance modeling
Similarity
Vector-distance between
application 1 and 2
Optimization and Scheduling
Similarity
Lagrange
Multiplier Vector-distance between
Method application 1 and 2
Optimum resource allocation

already computed in the DB!
Task scheduling optimization
• Hadoop framework executes its task based
on runtime scheduling scheme
• Map and reduce tasks can be executed without

communication among other tasks, no contention and
synchronization cost between running jobs
• The first scheduler implementation was First In First Out

(FIFO)
• Hadoop 2.x:
• Fair and Capacity schedulers
• Work conserving preemption
Task scheduling optimization - Comparison
FIFO FAIR Cap. Delay Dyn. Dead. Adapt. Adapt. Adapt. Dyn.
Prior. Const. Sched. Sched. Sched. share
Basic Data Hw aff.
Alg. aff.
Data ✔ ✔
locality
Sticky ✔ ✔
slots
Skewness ✔
Starvation ✔ ✔ ✔ ✔ ✔
Utilization ✔ ✔ ✔ ✔ ✔ ✔ ✔
Fairness ✔ ✔ ✔ ✔ ✔
Released/ R R R R R P P P P P
Prototype
Hadoop 1.0/2.0 1.0/2.0 1.0/2.0 1.0 1.0 1.0 2.0 2.0 2.0 1.0
Version
Conclusions and future work
• Technology continuously evolving
• Distributed/hierarchical optimization solutions according to

YARN architecture
• Integrate batch analysis with data streaming
• Provide a quantitative evaluation of the solutions

presented
Thanks for your attention…
… any questions? 18
Thanks for your attention...
...any questions?

Hive Optimization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hive Optimization

Uploaded by

Copyright:

Available Formats

OPTIMIZATION

TECHNIQUES WITHIN THE

Dipartimento di Elttronica, Informazione e Bioingegneria

Figure 2: Sorting with 16 map and 22 reduce slots.

• Skewness and Map-Reduce interdependence

• Poor system utilization

• Task scheduling optimization (i.e., FIFO scheduler

Build application Estimate jobs/workflows Allocate map and reduce

Aria & Autotune Signature-based

• Past job runs of the whole • Execution of a small fraction of

• Execution of a smaller input

Aria & Autotune Signature-based

Aria & Autotune Signature-based

Optimum resource allocation

• Map and reduce tasks can be executed without

• The first scheduler implementation was First In First Out

• Distributed/hierarchical optimization solutions according to

• Integrate batch analysis with data streaming

• Provide a quantitative evaluation of the solutions

You might also like