Professional Documents
Culture Documents
Data
Merged jobs
• Requires:
• both jobs need to have the same intermediate keys and
output formats
• Steps for merging:
• Bring the code for the two mappers together.
• In the mapper, change the writing of the key and value
to “tag” the key with the map source.
• In the reducer, parse out the tag and use an if-statement
to switch what reducer code actually gets executed.
• Use MultipleOutputs to separate the output for the jobs.
Input and Output Patterns
• improve the value of MapReduce: customizing
input and output
• Skip data storing – accept data from source
• Basic Hadoop data paradigm does not fit the problem
• Patterns:
• generating data
• external source input
• partition pruning
• external source output
Input and Output Patterns -
Generating Data
• Generate data on the fly
and in parallel
• Intent:
• Have a set of data that is partitioned by a
predetermined value, which can be used to dynamically
load the data based on what is requested by the
application.
Input and Output Patterns-
External Source Output
• The external source output pattern writes data to a
system outside of Hadoop and HDFS
• Streaming