Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
98Activity
0 of .
Results for:
No results containing your search query
P. 1
Hadoop Performance Tuning

Hadoop Performance Tuning

Ratings: (0)|Views: 19,175|Likes:
Published by Impetus
Impetus caters to diverse business needs using HPC based innovative solutions including software (Hadoop, Korus, Grids, Erlang) as well as hardware centric GPU (using CUDA) solutions.
This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance.
Impetus caters to diverse business needs using HPC based innovative solutions including software (Hadoop, Korus, Grids, Erlang) as well as hardware centric GPU (using CUDA) solutions.
This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance.

More info:

Published by: Impetus on Nov 24, 2009
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF or read online from Scribd
See more
See less

03/25/2014

 
1
 Abstract
This paper explains tuning of Hadoop configuration parameters whichdirectly affects Map-Reduce job performance under various conditions,to achieve maximum performance. The paper is intended for the userswho are already familiar with Hadoop- HDFS and Map-Reduce.Impetus Technologies Inc.www.impetus.comOctober 2009
HADOOPPERFORMANCE TUNINGHADOOPPERFORMANCE TUNING
 Abstract
1
2
2
2
 
2
WHITE PAPER
HADOOPPERFORMANCE TESTING
Document Flow 
The document contains Map-Reduce jobworkflow. It describes different phases of Map-Reduce operations and usage of configuration parameters at different stagesin the Map-Reduce job. It explains theconfiguration parameters, their defaultvalues, pros, cons, and suggested values indifferent conditions. It also explains other performance aspects such as temporaryspace and Reducer lazy initialization. Further the document provides a detailed case studywith statistics.
Definitions
The document highlights the followingHadoop configuration parameters withrespect to performance tuning-
dfs.block.size
: Specifies the sizeof data blocks in which the input dataset is split
mapred.compress.map.output
: Specifies whether to compress outputof maps.
mapred.map/reduce.tasks.speculative.execution
:::::When a task (map/reduce) runs veryslowly (due to hardware degradation or software mis-configuration) thanexpected. The Job Tracker runs another equivalent task as a backup on another node. This is known as speculativeexecution. The output of the task whichfinishes first is taken and the other task iskilled.
mapred.tasktracker.map/reduce.tasks.maximum
::::: Themaximum number of map/reduce tasksthat will be run simultaneously by a tasktracker.
io.sort.mb
::::: The size of in-memorybuffer (in MBs) used by map task for sorting its output.
io.sort.factor
::::: The maximumnumber of streams to merge at oncewhen sorting files. This property is alsoused in reduce phase. It’s fairlycommon to increase this to 100.
mapred.job.reuse.jvm.num.tasks
::::: The maximum number of tasks to run for a given job for eachJVM on a tasktracker. A value of –1indicates no limit: the same JVM maybe used for all tasks for a job.
mapred.reduce.parallel.copies
::::: The number of threads usedto copy map outputs to the Reducer.
Map Reduce Work Flow 
The diagram below explains various phasesin Map-Reduce job and the data flow acrossthe job.Map Operations:Map Operations:Map Operations:Map Operations:Map Operations: Map task involves thefollowing actions
Map PMap PMap PMap PMap Processing:rocessing:rocessing:rocessing:rocessing: HDFS splits the largeinput data set into smaller data blocks(64 MB by default) controlled by theproperty
dfs.block.size
. Datablocks are provided as an input to maptasks. The number of blocks to eachmap depends on
mapred.min.split.size
and
 mapred.max.split.size
..... If 
 mapred.min.split.size
 is less
 
3
WHITE PAPER
HADOOPPERFORMANCE TESTING
than block size and
 mapred.max.split.size
 isgreater than block size then 1 block issent to each map task. The block datais split into key value pairs based on theInput Format. The map function isinvoked for every key value pair in theinput. Output generated by mapfunction is written in a circular memorybuffer, associated with each map. Thebuffer is 100 MB by default and can becontrolled by the property
io.sort.mb
.....Spill:Spill:Spill:Spill:Spill: When the buffer size reaches athreshold size controlled by
io.sort.spill.percent
(default 0.80 or 80%), a backgroundthread starts to spill the contents todisk. While the spill takes place mapcontinues to write data to the buffer unless it is full. Spills are written inround-robin fashion to the directoriesspecified by the
 mapred.local.dir
property, in ajob-specific subdirectory. A new spill fileis created each time the memory buffer reaches to spill threshold.PPPPPartitioningartitioningartitioningartitioningartitioning Before writing to the diskthe background thread divides the datainto partitions (based on the partitioner used) corresponding to the Reducer where they will be sent.Sorting:Sorting:Sorting:Sorting:Sorting: In-memory sort is performedon key (based on
compareTo
method of key class). The sorted outputis provided to the combiner function if any. Merging:Merging:Merging:Merging:Merging: Before the map task isfinished, the spill files are merged intoa single partitioned and sorted outputfile. The configuration property
io.sort.factor
controls themaximum number of streams to mergeat once; the default is 10.Compression:Compression:Compression:Compression:Compression: The map output can becompressed before writing to the diskfor faster disk writing, lesser disk space,and to reduce the amount of data totransfer to the Reducer. By default theoutput is not compressed, but it is easyto enable by setting
 mapred.compress.map.output
to true. The compression library to useis specified by
 mapred.map.output.compression.codec
.....

Activity (98)

You've already reviewed this. Edit your review.
vijay_kumar243 liked this
1 thousand reads
1 hundred reads
Joshua Wang added this note
This is really a good outline for Hadoop Performance Tuning.
Ilya Kaplun liked this
sati1987 liked this
Ioannis Atsonios liked this
jian1 liked this
iliketoknow added this note
It was very helpful in understanding some of the deeper aspects of how MR functions
hi02 liked this

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->