You are on page 1of 17

CP 422 Programming for Big

Data

MapReduce design pattern (3)


MapReduce Patterns
• Summarization Patterns: Get a top-level view by
summarizing a grouping data
• Filtering Patterns: View data subsets such as records
generated from one user
• Data organization patterns: Recognize data to work with
other systems, or to make MapReduce analysis easier
• Join patterns: Analyze different datasets together to
discover interesting relationships
• Metapatterns: Piece together several patterns to solve
multi-stage problems, or to perform several analytics in the
same job
• Input and output patterns: Customize the way you use
Hadoop to load or store data
Metapatterns
• “patterns about patterns”

• Job chaining: complex, multistage problems

• Job merging: performing several analytics in the


same MapReduce job
Metapatterns - Job Chaining
• A problem can not be solved by a single
MapReduce job

• Solve problem with a series of MapReduce jobs –


more challenges

• Hadoop is designed to handle one MapReduce very


well – handling multiple jobs will take a lot of
manual coding
• Some Notes
• Take the driver for each MapReduce job and call them in
the sequence they should run. You’ll have to specifically
be sure that the output path of the first job is the input
path of the second.
• In a production scenario, the temporary directories
should be cleaned up so they don’t linger past the
completion of the job.
• Use a nonblocking job completion check, to constantly
poll to see whether all of the jobs are complete.
• Pay attention to is job success. It’s not good enough to
just know that the job completed. You also need to
check whether it succeeded or not.
Metapatterns - Job Merging
• Job merging is a process that allows two unrelated
jobs that are loading the same data to share the
MapReduce pipeline.

• The data needs to be loaded and parsed only once

• With job merging pattern, we’ll have one


MapReduce job that logically performs the two jobs
at once without mixing the two applications
Original jobs

Merged jobs
• Requires:
• both jobs need to have the same intermediate keys and
output formats
• Steps for merging:
• Bring the code for the two mappers together.
• In the mapper, change the writing of the key and value
to “tag” the key with the map source.
• In the reducer, parse out the tag and use an if-statement
to switch what reducer code actually gets executed.
• Use MultipleOutputs to separate the output for the jobs.
Input and Output Patterns
• improve the value of MapReduce: customizing
input and output
• Skip data storing – accept data from source
• Basic Hadoop data paradigm does not fit the problem

• Patterns:
• generating data
• external source input
• partition pruning
• external source output
Input and Output Patterns -
Generating Data
• Generate data on the fly
and in parallel

• Does not load data

• Use case example:


• Generate random data
for test purpose
External Source Input
• The external source input pattern doesn’t load data
from HDFS, but instead from some system outside
of Hadoop, such as an SQL database or a web
service.
Input and Output Patterns -
Partition Pruning
• Partition pruning configures the way the framework
picks input splits and drops files from being loaded
into MapReduce based on the name of the file.

• Intent:
• Have a set of data that is partitioned by a
predetermined value, which can be used to dynamically
load the data based on what is requested by the
application.
Input and Output Patterns-
External Source Output
• The external source output pattern writes data to a
system outside of Hadoop and HDFS

• The pattern skips storing data in a file system


entirely and sends output key/value pairs directly
where they belong.

• Need to be sure the destination system can handle


the parallel ingest it is bound to endure with all the
open connections
The structure of the external source output pattern
Summary
• Summarization Patterns: Numerical summarizations,
Inverted index, Counting with counters.
• Filtering Patterns: Filtering, Bloom filtering, Top ten, Distinct
• Data organization patterns: The structured to hierarchical
pattern, The partitioning and binning patterns, The total
order sorting and shuffling patterns
• Join patterns
• Metapatterns: Job chaining, Job merging
• Input and output patterns: generating data, external source
input, partition pruning, external source output
Future of MapReduce Patterns
• New features and new systems on Hadoop

• Trends in data - Not only textual data

• Images, Audio, and Video

• Streaming

You might also like