Professional Documents
Culture Documents
The main six Hadoop streaming components of the preceding command are listed and explained as
follows:
Jar: This option is used to run a jar with coded classes that are designed for serving the streaming
functionality with Java as well as other programmed Mappers and Reducers. It's called the Hadoop
streaming jar.
Input: This option is used for specifying the location of input dataset (stored on HDFS) to Hadoop
streaming MapReduce job.
Output: This option is used for telling the HDFS output directory (where the output of the
MapReduce job will be written) to Hadoop streaming MapReduce job.
File: This option is used for copying the MapReduce resources such as Mapper, Reducer, and
Combiner to computer nodes (Tasktrackers) to make it local.
Mapper: This option is used for identification of the executable Mapper file.
Reducer: This option is used for identification of the executable Reducer file.
DR Mrs. J N Jadhav, Associate Professor Dept of CSE
There are other Hadoop streaming command options too, but they are optional. Let's have a look at
them:
inputformat: This is used to define the input data format by specifying the Java class name. By default, it's
TextInputFormat.
Outputformat: This is used to define the output data format by specifyingThe Java class name. By default,
it's TextOutputFormat.
partitioner: This is used to include the class or file written with the code
For partitioning the output as (key, value) pairs of the Mapper phase.
Combiner: This is used to include the class or file written with the code for reducing the Mapper output by
aggregating the values of keys. Also, we can use the default combiner that will simply combine all the key
attribute values before providing the Mapper's output to the Reducer.
Cmdenv: This option will pass the environment variable to the streaming command. For example, we can
pass R_LIBS = /your /path /to /R / libraries.
inputreader: This can be used instead of the inputformat class for specifying the record reader class.
verbose: This is used to verbose the output.
numReduceTasks: This is used to specify the number of Reducers.
mapdebug: This is used to debug the script of the Mapper file when the
Mapper task fails.
reducedebug: This is used to debug the script of the Reducer file when the
Reducer task fails.