You are on page 1of 13

HADOOP

PERFORMANCE TUNING

Abstract
This paper explains tuning of Hadoop configuration parameters which

directly affects Map-Reduce job performance under various conditions,


to achieve maximum performance. The paper is intended for the users

who are already familiar with Hadoop- HDFS and Map-Reduce.

Abstract 1

Document Flow 2

Definitions 2

Map Reduce Workflow 2

Parameters affecting 4
Performance

Other Performance Aspects 6

Conclusion 12

About Impetus 13
Impetus Technologies Inc.
1 www.impetus.com
October 2009
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

Document Flow • mapred.tasktracker.map/


reduce.tasks.maximum : The
The document contains Map-Reduce job maximum number of map/reduce tasks
that will be run simultaneously by a task
workflow. It describes different phases of
tracker.
Map-Reduce operations and usage of
• io.sort.mb : The size of in-memory
configuration parameters at different stages buffer (in MBs) used by map task for
sorting its output.
in the Map-Reduce job. It explains the
configuration parameters, their default • io.sort.factor : The maximum
number of streams to merge at once
values, pros, cons, and suggested values in when sorting files. This property is also
used in reduce phase. It’s fairly
different conditions. It also explains other
common to increase this to 100.
performance aspects such as temporary
• mapred.job.reuse.jvm.num.
space and Reducer lazy initialization. Further tasks : The maximum number of
tasks to run for a given job for each
the document provides a detailed case study
JVM on a tasktracker. A value of –1
with statistics. indicates no limit: the same JVM may
be used for all tasks for a job.

Definitions • mapred.reduce.parallel.
copies : The number of threads used
The document highlights the following to copy map outputs to the Reducer.

Hadoop configuration parameters with

respect to performance tuning- Map Reduce Work Flow


• dfs.block.size : Specifies the size The diagram below explains various phases
of data blocks in which the input data
set is split in Map-Reduce job and the data flow across

the job.
• mapred.compress.map.output
: Specifies whether to compress output
of maps. Map Operations: Map task involves the

• mapred.map/ following actions


reduce.tasks.speculative.execution:
• Map PProcessing:
rocessing: HDFS splits the large
When a task (map/reduce) runs very
input data set into smaller data blocks
slowly (due to hardware degradation or
(64 MB by default) controlled by the
software mis-configuration) than
property dfs.block.size. Data
expected. The Job Tracker runs another
blocks are provided as an input to map
equivalent task as a backup on another
tasks. The number of blocks to each
node. This is known as speculative
map depends on
execution. The output of the task which
finishes first is taken and the other task is mapred.min.split.size and
killed. mapred.max.split.size. If
mapred.min.split.size is less

2
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

than block size and into partitions (based on the partitioner


mapred.max.split.size is used) corresponding to the Reducer
greater than block size then 1 block is where they will be sent.
sent to each map task. The block data
is split into key value pairs based on the • Sorting: In-memory sort is performed
Input Format. The map function is on key (based on compareTo
invoked for every key value pair in the method of key class). The sorted output
input. Output generated by map is provided to the combiner function if
function is written in a circular memory any.
buffer, associated with each map. The • Merging: Before the map task is
buffer is 100 MB by default and can be finished, the spill files are merged into
controlled by the property a single partitioned and sorted output
io.sort.mb. file. The configuration property
• Spill: When the buffer size reaches a io.sort.factor controls the
threshold size controlled by maximum number of streams to merge
io.sort.spill.percent at once; the default is 10.
(default 0.80 or 80%), a background • Compression: The map output can be
thread starts to spill the contents to compressed before writing to the disk
disk. While the spill takes place map for faster disk writing, lesser disk space,
continues to write data to the buffer and to reduce the amount of data to
unless it is full. Spills are written in transfer to the Reducer. By default the
round-robin fashion to the directories output is not compressed, but it is easy
specified by the to enable by setting
mapred.local.dir property, in a mapred.compress.map.output
job-specific subdirectory. A new spill file to true. The compression library to use
is created each time the memory buffer is specified by
reaches to spill threshold. mapred.map.output.compression.
• Partitioning Before writing to the disk codec.
the background thread divides the data

3
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

Output file partitions are made available to completed and their output has been
copied. Map outputs are merged
the Reducers over HTTP. The number of maintaining their sorting order. This is
done in rounds. For example if there
worker threads used to serve the file
were 40 map outputs and the merge
partitions is controlled by the task factor was 10 (the default, controlled by
the io.sort.factor property, just
tracker.http.threads property—this like in the map’s merge) then there
setting is per tasktracker, not per map task would be 4 rounds. In first round 4 files
will be merged and in remaining 3
slot. The default of 40 may need increasing rounds 10 files are merged. The last
batch of files is not merged and directly
for large clusters running large jobs.
given to the reduce phase.

Reduce Operations: The Reducer has three 3. Reduce: During reduce phase the
reduce function is invoked for each key
phases in the sorted output. The output of this
phase is written directly to the output
1. Copy: Each map task’s output for the
filesystem, typically HDFS.
corresponding Reducer is copied as
soon as map task completes. The
reduce task has a small number of
copier threads so that it can fetch map Parameters affecting Performance
outputs in parallel. The default is 5
threads, but can be changed by setting dfs.block.size: File system block size
the
mapred.reduce.parallel.copies • Default
Default: 67108864 (bytes)
property. The map output is copied to
• Suggestions
Suggestions:
the reduce tasktracker’s memory buffer
which is controlled by mapred.job o Small cluster and large data set:
.shuffle.input.buffer.percent default block size will create a large
(specifies the proportion of the heap to number of map tasks.
use for this purpose). When the in-
memory buffer reaches a threshold size e.g. Input data size = 160 GB and
(controlled by mapred. dfs.block.size = 64 MB then
job.shuffle.merge.percent), or the minimum no. of maps=
reaches a threshold number of map (160*1024)/64 = 2560 maps.
outputs
If dfs.block.size = 128 MB
(mapred.inmem.merge.threshold),
minimum no. of maps=
it is merged and spilled to disk. As the
(160*1024)/128 = 1280 maps.
copies accumulate on disk, a
background thread merges them into If dfs.block.size = 256 MB
larger, sorted files. This saves some minimum no. of maps=
time in subsequent merging. (160*1024)/256 = 640 maps.
2. Sort: This phase should actually be In a small cluster (6-7 nodes) the
called the Merge phase as the sorting is map task creation overhead is
done at the map side. This phase starts considerable. So dfs.block.size
when all the maps have been should be large in this case but small

4
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

enough to utilize all the cluster large calculations. On a busy cluster


resources. speculative execution can reduce
overall throughput, since redundant
o The block size should be set tasks are being executed in an attempt
according to size of the cluster, map to bring down the execution time for a
task complexity, map task capacity of single job.
cluster and average size of input
files. If the map contains the • Suggestions: In large jobs where
computation such that one data average task completion time is
block is taking much more time than significant (> 1 hr) due to complex and
the other block, then the dfs block large calculations and high throughput
size should be lesser. is required the speculative execution
should be set to false.

mapred.compress.map.output: Map
mapred.tasktracker.map/
Output Compression
reduce.tasks.maximum: Maximum
• Default: False
tasks (map/reduce) for a tasktracker
• Pros: Faster disk write, saves disk
space, less time in data transfer (from • Default: 2
Mappers to Reducers).
• Suggestions:
• Cons: Overhead in compression at
Mappers and decompression at o This value should be set according
Reducers. to the hardware specification of
cluster nodes and resource
• Suggestions: For large cluster and requirements of tasks (map/reduce).
large jobs this property should be set
true. The compression codec can also e.g. a node has 8GB main memory
be set through the property mapred. + 8 core CPU + swap space
map.output.compression.codec Maximum memory required by a
(Default is org.apache.hadoop.io. task ~ 500MB
compress.DefaultCodec).
Memory required by
mapred.map/reduce.tasks. tasktracker, Datanode and
other processes ~ (1 + 1 +1) =
speculative.execution: Enable/ 3GB
Disable task (map/reduce) speculative Maximum tasks that can be run =
execution (8-3) GB/500MB = 10

Number of map or reduce task (out


• Default: True
of the maximum tasks) can be
• Pros: Reduces the job time if the task decided on the basis of memory
progress is slow due to memory usage and computation complexities
unavailability, hardware degradation. of the tasks.

• Cons: Increases the job time if the task o The memory available to each task
progress is slow due to complex and (JVM) is controlled by

5
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

mapred.child.java.opts • Default: 1
property. The default is –Xmx200m
(200 MB). Other JVM options can • Suggestions: The overhead of JVM
also be provided in this property. creation for each task is around 1
second. So for the tasks which live for
seconds or a few minutes and have
io.sort.mb: Buffer size (MBs) for sorting lengthy initialization, this value can be
increased to gain performance.
• Default: 100

• Suggestions: mapred.reduce.parallel.copies:

Threads for parallel copy at Reducer


o For large jobs (the jobs in which
map output is very large), this value • Default: 5
should be increased keeping in mind
that it will increase the memory • Description: The number of threads
required by each map task. So the used to copy map outputs to the
increment in this value should be Reducer.
according to the available memory
at the node. • Suggestions: For large jobs (the jobs in
which map output is very large), value
o Greater the value of io.sort.mb, of this property can be increased
lesser will be the spills to the disk, keeping in mind that it will increase the
saving write to the disk. total CPU usage.

io.sort.factor: Stream merge factor


Other Performance Aspects
• Default: 10
Temporary space: Jobs which generate
• Suggestions:
large intermediate data (map output)
o For large jobs (the jobs in which
map output is very large and should have enough temporary space
number of maps are also large)
controlled by property
which have large number of spills to
disk, value of this property should be mapred.local.dir. This property
increased.
specifies list directories where the Map
o Increment in io.sort.factor,
benefits in merging at Reducers since Reduce stores intermediate data for jobs.
the last batch of streams (equal to The data is cleaned-up after the job
io.sort.factor) are sent to the
reduce function without merging, completes.
thus saving time in merging.
By default, replication factor for file storage

mapred.job.reuse.jvm.num.tasks: on HDFS is 3, which means that every file

Reuse single JVM has three replicas. As a thumb rule, at least

25% of the total hard disk should for

6
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

intermediate temporary output. So job initialization, but the reduce method is

effectively, only ¼ hard disk space is called in reduce phase when all the maps

available for business use. had been finished. So in large jobs where
Reducer loads data (>100 MB for business
The default value for
logic) in-memory on initialization, the
mapred.local.dir is
performance can be increased by lazily
${hadoop.tmp.dir}/mapred/local. So if
initializing Reducers i.e. loading data in
mapred.local.dir is not set,
reduce method controlled by an initialize
hadoop.tmp.dir must have enough
flag variable which assures that it is loaded
space to hold job’s intermediate data. If the
only once.
node doesn’t have enough temporary

space the task attempt will fail and starts a By lazily initializing Reducers which require

new attempt, thus reducing the memory (for business logic) on initialization,
performance. number of maps can be increased

(controlled by mapred.tasktracker.
JVM tuning: Besides normal java code
map.tasks.maximum ;).
optimizations, JVM settings for each child

task also affect the processing time. On e.g


e.g.. Total memory per node = 8 GB

slave node end, the Tasktracker and data Maximum memory required by each map
node use 1 GM RAM each. Effective use of task = 400 MB

the remaining RAM as well as choosing the If Reducer loads 400 MB of data (metadata

right GC mechanism for each Map or for business logic) on initialization


Reduce task is very important for maximum Maximum memory required by each reduce

utilization of hardware resources. The task = 600 MB

default max RAM for child tasks is 200MB No. of reduce tasks to run = 4
which might be insufficient for many Maximum memory required by

production grade jobs. The JVM settings for Tasktracker + Datanode = 2GB

child tasks are governed by Before Lazy initialization of


Reducers
mapred.child.java.opts property.
Maximum memory required by all Reducers
Reducer LLazy
azy Initialization: In M/R job
throughout the job = 600 * 4 = 2400 MB
Reducers are initialized with Mappers at the

7
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

Number of map tasks can be run = (8-2- processing metadata from raw metadata

2.4) GB /400 MB = 9 files.

After Lazy initialization of Challenges:


Reducers 1. Metadata is not actually metadata
since it is updated while processing
Maximum memory required by all Reducers
based on some field.
during copy and sort phase = (600-400)*4
2. Processing generates huge
= 800 MB intermediate data (~ 15 TB).

3. Large number of records (> 170


Number of map tasks can be run = (8-2- million) in a few aggregation groups.
0.8) GB /400 MB = 13
4. Metadata (~1.2 GB) should remain
in-memory while processing.
So by lazily initializing Reducers, 4 more
5. Providing large metadata to different
map tasks can be run, thereby increasing parallel jobs and tasks.
the performance. 6. Solution to be run on small cluster of 7
nodes.

7. Multiple outputs were required from


Case Study 1
single input files.
Organizations that work with huge amount

of data logs offer perfect examples of Cluster Specifications:


Hadoop requirements. This case study is for Datanodes+tasktrackers = 6
an organization dealing in analyzing and machines

aggregating information collected from Namenode+jobtracker = 1 machine

various data logs of size more than 160GB

per month. Each machine has 8GB main memory,

8 core CPU, 5 TB Hard drive


Problem Statement: Processing more than
Cloudera Hadoop 0.18.3
160 GB of data logs based on some

business logic, using around 1.2 GB of Approaches: The problem was divided into

metadata (complex hierarchical look up/ 5 Map Reduce jobs- Metadata job, Input

master data). Processing involves complex processing job for parsing the input files

business logic, grouping based on different and generating 2 outputs, intermediate

fields and sorting. It also involves output 1 processing job, intermediate

8
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

output 2 processing job and an additional o Approach 2: As a solution for


challenge 1, all the processing was
downstream aggregation job.
moved to the Reducers and the map
output was partitioned on the fields on
Metadata PProcessing:
rocessing: The first step was to the basis of which, the metadata was
process metadata from raw metadata files updated. Hadoop default hash
partitioner was used and the map
using some business logic. The metadata output key’s hash code method was
written accordingly. Now all the jobs
was in the form of in memory hash maps.
could run in parallel.
Here the challenge (5) was how to provide
Feasibilty: This approach could
this in memory data to different jobs and process more data then the previous
approach but was still not feasible to
tasks in parallel environment since Hadoop process the target input data set.
doesn’t have global (shared) storage
o Approach 3: Two jobs (input
among different tasks. The solution arrived processing and intermediate output 1)
were merged together (since one job
at was to serialize all the hash maps and had all the processing on Mapper and
store into the files (~ 390 MB) then add other on Reducer) to save time to extra
read and write (>320 GB) to HDFS.
these files to distributed cache (A cache Now this combined job was generating
huge intermediate data ~ 15 TB
provided by Hadoop, which copies all the
(challenge 2) which cause the job to
resource added into it, to the local file fail with error
system of each node) and de-serialize it in Exception in thread “main”
org.apache.hadoop.fs.FSError:
all the tasks. java.io.IOException: No
space left on device at
o Approach 1: Due to the challenge 1, it org.apache.hadoop.fs.
was decided that whichever job LocalFileSystem$
updates metadata should be processed LocalFSFileOutputStream.write(
sequentially i.e. only 1 map and 1 LocalFileSystem.java:150)
reduce task should run. The main
processing job had the business logic The solution to the above error was to
as part of Maps. increase the space provided to the
mapred.local.dir directories so
Feasibility: This approach worked only that 15 TB data can be stored. In
for small input data set (< 500 MB), addition to the above solution the
since the memory requirements were property
high due to the reason that mapred.compress.map.output
aggregation records (key,value) ran was set to true.
into 170 million+ records in some
cases. Overall this approach was Now came the challenge 3, the
unfeasible for the target input data set grouping on 1 field causes ~170
million records in a few groups and the
business logic was such that the

9
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

intermediate data generated by No. of Mappers per node = 10


processing each group should remain
in memory unless the new group Total Number of Mappers = 60
comes, causing memory shortage and Total Number of Reducers =12
ultimately the job failure. The solution
to this was tweaking the business logic HDFS chunk size = 256 MB
such that the grouping can be done on
more than one fields thereby reducing Io.sort.mb = 300
the number of records in a group. Io.sort.factor = 75
Another optimization to overcome the
effect of challenge 6 was to increase Memory usage per node
dfs.block.size to 256 MB since its
default value (64MB) was creating Swap space = 32GB
~5000 maps. So in a small cluster of
7 nodes the map creation overhead Per Mapper/Reducer memory ~
was significant. After the block size 0.5GB
change, 631 maps were being created Total memory usage ~ 0.5 *12 + 1
for input data. +1 (tasktracker and datanode)
Feasibility: This approach was feasible ~ 8GB
but the estimated time to process
target input data set was more than Stats: Attempt 2
200 hrs.
Small metadata maps –in memory -
o Approach 4: Another solution to the Big maps –memcached, Lazy Reducer
challenge 4 was to use memcached for Initialization
metadata, instead of keeping it in
memory, so that number of maps can Feasibility: Time taken ~ 71 hrs
be increased. First the spy memcached
client was used but it had some data No. of memcached servers = 15
leakages and didn’t work. Finally (2*6+3 = 15)
Danga memcached client was used
HDFS data processing rate ~30 MB
and it worked well. Throughput using
per minute
memcached was around 10000
transactions per second. Average loaded data on memcache
servers = 105M on 14 servers, 210M
Here are some statistics of the job
on 1 memcached server (with double
running with memcached
weight)
Stats: Attempt 1
No. of Mappers per node = 8
Feasibility: Time taken = ~ 136 hrs
Number of Mappers = 8*6=48
No. of memcached servers = 1
Number of Reducers =12
HDFS data processing rate ~16 MB
HDFS chunk size = 256 MB
per minute
Io.sort.mb = 300
Memcached server memory ~1.2 GB
Io.sort.factor = 75

10
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

Memory usage per node Per Mapper memory ~ 1.1-1.8GB

Swap space = 32GB Per Reducer memory ~ 50 MB


(during copy) + 1.1-2 GB (during
Per Mapper memory ~ 0.8GB sort and reduce)
Per Reducer memory ~ 50 MB (during Total memory usage in Mapper
copy) + 1.5 (during sort and reduce) processing ~ 1.5 GB *5 + 50 MB
Total memory usage ~ 0.8 *8+ 100 *2 + 1 +1 (tasktracker and
MB + 1 +1 (tasktracker and datanode) ~ 9.6GB
datanode) ~ 8.5GB o Approach 6: Approach 5 was run
Conclusion - Memcached servers were again with the following configurations
showing a thruput of around 10000 No. of Mappers per node = 4
fetches per second (150000 hits per
second across 15 nodes). However, No. of Reducers per node = 5
this was not sufficient to meet the
requirements for faster processing or No. of Mappers = 4*6 = 24
match up with in memory hash map No. of Reducers = 5*6 = 30
lookups.
HDFS chunk size = 256 MB
o Approach 5: This approach uses the
fact that Reducer would start Io.sort.mb = 600
(processing data) after copy and
reduce by which time; Mappers would Io.sort.factor = 120
have shut down and released the Memory usage per node
memory (Reducer Lazy Initialization).
Per node: RAM- ~8 GB, SWAP- 4-6
42 hrs
Feasibility: Time Taken ~42 GB
HDFS data processing rate ~8 GB per Per Mapper memory ~ 1.5-2.1GB
hour
Per Reducer memory ~ 50 MB
Maps finished = 19 hrs 37 mins (during copy) + 1.3-2 GB (during sort
No. of Mappers per node = 5 and reduce)

No. of Mappers= 5*6 = 30 Total memory usage in Mapper


processing ~ 1.8 GB *4 + 50 MB *5
No. of Reducers= 2*6 = 12 + 1 +1 (tasktracker and
datanode) ~ 9.5GB
HDFS chunk size = 256 MB
Feasibility: Time taken ~ 39hrs
Io.sort.mb = 300
Approach 6 gave maximum
Io.sort.factor = 75 performance and took minimum time
Memory usage per node in the given cluster specifications. Only
4 Map jobs per node could be run as
Per node: RAM- ~8 GB, SWAP- 4-6 use of any other additional Map was
GB causing excessive swap memory which
was a major bottleneck for faster
processing.

11
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

o Approach 7: Approach 6 was run Conclusion


again but with Tokyo cabinet (a Berkley
DB like file storage system) for Performance of Hadoop Map-Reduce job
metadata handling to overcome
can be increased without increasing the
excessive RAM usage to store
metadata with the following hardware cost, by just tuning some
configurations
parameters according to the cluster
No. of Mappers per node = 7
specifications, input data size and
No. of Reducers per node = 7
processing complexities.
No. of Mappers= 7*6 = 42

No. of Reducers= 7*6 = 42

HDFS chunk size = 256 MB References


Io.sort.mb = 400 1. http://hadoop.apache.org/hdfs/
http://hadoop.apache.org/
Io.sort.factor = 120 common/docs/current/
hdfs_design.html
Memory usage per node
2. http://hadoop.apache.org/
Per node: RAM- ~8 GB, common/docs/current/
mapred_tutorial.html
Per Mapper memory ~ 800MB
3. O’ Reilly, Hadoop- The Definitive
Per Reducer memory ~ 70 MB (during
Guide by Tom White
copy) + 700MB (during sort and
reduce)

Total memory usage in Mapper


processing ~ 800 MB *7 + 70 MB *7
tasktracker and datanode
+ 1 +1 (tasktracker datanode) ~
8GB

Feasibility: Time taken ~ 35hrs

Conclusion: Approach 7 gave maximum

performance and took minimum time in the


given cluster specifications.

12
W H I T E P A P E R

HADOOP
PERFORMANCE TESTING

About Impetus Corporate Headquarters


Impetus is a pure play product engineering UNITED STATES
STA

company providing outsourced software Impetus TTechnologies,


echnologies, Inc
Inc.
product design, R&D and related services to
leading global software and technology 5300 Stevens Creek Boulevard, Suite 450
enabled companies. Headquartered in San
Jose (CA), and with multiple development San Jose, CA 95129, USA.
centers across India, Impetus has been
Phone: 408.252.7111
involved in hundreds of product launches
over its 15 year association with clients
ranging from enterprise class companies to
innovative technology start-ups. Regional Development Centers
INDIA
Impetus builds cutting-edge products
• New Delhi
utilizing the latest and emerging
technologies for a diverse client base across • Indore
domains including Digital Media, • Hyderabad
Telecommunications, Healthcare, Mobile
Applications, Web2.0, IPTV, and Internet
Advertising. The company creates leading
To know more
edge products that underscore future trends
while also offering premium design
Visit: http://www.impetus.com
consulting and product architecture services.
Email: inquiry@impetus.com
Impetus brings forth a perfect synergy of
expert talent and competent engineering
skills, coupled with technology frameworks
and product development maturity to help
you build sustainable, scalable and efficient
next-generation products, and bring them
faster to the market.

Copyright © 2009 Impetus Technologies, Inc. All Rights Reserved


All trademarks and registered trademarks used in this whitepaper are property of their respective owners
Last Revised on October 2009

13

You might also like