BD - Unit - III - MapReduce

BIG DATA
Syllabus
Unit-I : Introduction to Big Data

Unit-II : Hadoop Frameworks and HDFS
Unit-III :MapReduce
Unit-VI : Hive and Pig
Unit-V : Mahout, Sqoop and CASE STUDY
1
1. MapReduce: Map-Reduce is a software framework for
easily writing applications which process vast amounts of
data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner is called MapReduce.
 MapReduce is a programming model for expressing
distributed computations on massive amounts of data and
an execution framework for large-scale data processing on
clusters of commodity servers.
 MapReduce is Programming Model for Data Processing.
 MapReduce characteristics are Batch processing, No
limits on #passes over the data or time and No memory
constraints. 2
Fig: MapReduce Logical Data flow 3
 History of MapReduce: Developed by researchers at Google Group
around in the year 2003, built on principles in parallel and
distributed processing.
 MapReduce Provides a clear separation between what to compute
and how to compute it on a cluster.
 Created by Doug Cutting as solution to Nutch’s scaling problems,

inspired by Google’s GFS/MapReduce papers.
 In 2004, Nutch Distributed Filesystem written (based on GFS).
 In 2005, all important parts of Nutch ported to MapReduce and
NDFS.
 In 2006, code moved into an independent subproject of Lucene

called Hadoop.
 In early 2006 Doug Cutting joined Yahoo! which contributed
resources and manpower.
 In 2008, Hadoop became a top-level project at Apache
4
Fig: Example of Overall MapReduce word count Process 5
 MapReduce: It consists of
1. Analyzing the Data with UNIX Tools
2. Analyzing the Data with Hadoop

3. Scaling Out
4. Hadoop Streaming
5. Hadoop Pipes.
6
1) Analyzing the Data with UNIX Tools: Analyzing the Data with UNIX
Tools are Hadoop, Cloudera, Datameer, Splunk, Mahout, Hive, HBase,
LucidWorks, R, MapR, Ubuntu and Linux flavors.
 Ex: A program for finding the maximum recorded temperature by year from
NASA weather
 Program: records
#!/bash
for year in allusr/bin/env /*
do
echo -ne `basename $year .gz`"\t"
gunzip -c $year | \
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
Done
 The script loops through the compressed year files, first printing the year, and then processing
each file using awk. The awk script extracts two fields from the data: the air temperature and
the quality code. The END block is executed after all the lines in the file have been
processed, and it prints the maximum value. 7
2) Analyzing the Data with Hadoop: Analyzing the Data with
Hadoop is majorly MapReduce and HDFS.
 To take advantage of the parallel processing that Hadoop provides,
we need to express our query as a Map Reduce job.
 Map Reduce works by breaking the processing into two phases are
the map phase and the reduce phase, each phase has key-value pairs
as input and output. The map() method is passed a key to a value
and also provides an instance of Context to write the output.
Fig: Map Reduce Logical Data flow 8

3) Scaling Out: Scaling out means size will be increased or
decreased the system will be supported properly.
 Scale-out Architecture means add servers to increase

processing power
 MapReduce is a programming model for data processing

and simple to express useful programs in.
 Hadoop can run MapReduce programs written in various

languages are Java, Ruby, Python, and C++.
 A MapReduce job is a unit of work that the client wants to

be performed and it consists of the input data, MapReduce
program, and configuration information. 9
4) Hadoop Streaming: Streaming means flow of data i.e. videos ,
images, signals and audios.
 Hadoop Streaming uses Unix standard streams as the interface
between Hadoop and your program, so you can use any language
that can read standard input and write to standard output to write
your MapReduce program.
 Hadoop streaming supported by many languages are Pig, Python
and Ruby.
 Pig is a programming language and supports streaming files such
as videos and audios, that can read from standard input and write
to standard output.
 Python is a programming language and supports streaming files
such as videos and audios, that can read from standard input and
write to standard output.
 Ruby is a programming language and supports streaming files
such as videos and audios, that can read from standard input and
write to standard output. 10
5) Hadoop Pipes: Hadoop Pipe is the name of the java
interface to Hadoop MapReduce, unlike streaming, which
uses standard input and output to communicate with the
map and reduce code.
 Hadoop pipes uses sockets as the channel over which the
task tracker communicates with the process running the
java map or reduce function.
 The main() method is the application entry point, it calls
HadoopPipes :: runTask, which connects to the Java
parent process and marshals data to and from the Mapper
or Reducer.
 The runTask() method is passed a Factory so that it can
create instances of the Mapper or Reducer. 11
2) MapReduce Features: MapReduce features describe the execution
and lower level details, simply knowing the APIs and their usage is
sufficient to write applications. Features of MapReduce includes
counters, sorting and joining datasets.
 By default MapReduce will sort input records by their keys.
 MapReduce is the heart of Hadoop. It is a programming model for
processing large data sets with a parallel, distributed algorithm on a
cluster.
 A MapReduce program is composed of a Map() procedure that

performs filtering and sorting (such as sorting students by first
name into queues, one queue for each name) and
a Reduce() procedure that performs a summary operation (such as
counting the number of students in each queue, yielding name
frequencies).
 The first is the map job, which takes a set of data and converts it
into another set of data, where individual elements are broken down
into tuples (key/value pairs).
 MapReduce is a massively scalable, parallel processing framework that
works in tandem with HDFS, with MapReduce and Hadoop, compute is
executed at the location of the data, rather than moving data to the
compute location; data storage and computation coexist on the same
physical nodes in the cluster.
 MapReduce processes exceedingly large amounts of data without being

affected by traditional bottlenecks like network bandwidth by taking
advantage of this data proximity.
 MapReduce divides workloads up into multiple tasks that can be executed

in parallel.
 It consists of
1. Features of MapReduce
2. Counters
3. Sorting
4. Joins
5. Side Data Distribution
6. MapReduce Library Classes
1. Features of MapReduce: Map-Reduce is a software framework
for easily writing applications which process vast amounts of data
in-parallel on large clusters of commodity hardware in a reliable,
fault-tolerant manner is called MapReduce.
 Features of MapReduce includes counters, sorting and joining
datasets.
 It consists of
 Scale-out Architecture: Add servers to increase processing power
 Security & Authentication: Works with HDFS and HBase security to
make sure that only approved users can operate against the data in the
system
 Resource Manager: Employs data locality and server resources to
determine optimal computing operations
 Optimized Scheduling: Completes jobs according to prioritization
 Flexibility: Procedures can be written in virtually any programming
language
 Resiliency & High Availability: Multiple job and task trackers ensure
that jobs fail independently and restart automatically.
Fig: MapReduce Logical Data flow 15
2. Counters: The MapReduce framework provides Counters as an
efficient mechanism for tracking the occurrences of global events
within the map and reduce phases of jobs.
 Counters are a useful channel for gathering statistics about the job
which means it show for quality control or for application level-
statistics. They are also useful for problem diagnosis.
 Hadoop should maintain a built-in counters for every job, which
report various metrics for your job. for example there are counters
for the number of input files and records processed.
 Ex: Typical MapReduce job will kick off several mapper instances,
one for each block of the input data, all running the same code.
These instances are part of the same job, but run independent of
one another.
 Hadoop MapReduce Counters are divided into two groups:

1)Task Counters
2)Job Counters
Group Name/Enum
MapReduce Task org.apache.hadoop.mapred.Task$Counter (0.20)
Counters org.apache.hadoop.mapreduce.TaskCounter (post 0.20)
File System Counters FileSystemCounters (0.20)

org.apache.hadoop.mapreduce.FileSystemCounter (post 0.20)
File Input-Format org.apache.hadoop.mapred.FileInputFormat$Counter (0.20)

Counters org.apache.hadoop.mapreduce.lib.input.FileInputFormatCoun
ter (post 0.20)
File Output-Format org.apache.hadoop.mapred.FileOutputFormat$Counter (0.20)

Counters org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCo
unter(post 0.20)
Job Counters org.apache.hadoop.mapred.JobInProgress$Counter (0.20)

org.apache.hadoop.mapreduce.JobCounter (post 0.20)
Fig: There are several groups for the built-in Counters

i) Task Counters: Task counters gather information about tasks
over the course of their execution, and the results are aggregated
over all the tasks in a job.
 Task counters are maintained by each task attempt, and

periodically sent to the tasktracker and then to the jobtracker.
 Counter values are definitive only once a job has successfully

completed.
 Ex: The MAP_INPUT_RECORDS counter counts the input

records read by each map task and aggregates over all map tasks
in a job, so that the final figure is the total number of input
records for the whole job.
ii) Job Counters: Job counters are maintained by jobtracker, which measures the job
level statistics.
 User-Defined Java Counters: MapReduce can allow the userdefined java counters
by using java “enum” keyword.
 A job may define an arbitrary number of enums, each with an arbitrary number of
fields.
 The name of the enum is the group name, and the enum’s fields are the counter
names.
 Ex: TOTAL_LAUNCHED_MAPS counts the number of map tasks that were
launched over the course of a job.
 Ex: public class MaxTemperatureWithCounters extends Configured implements

Tool {
enum Temperature
{ MISSING,
MALFORMED
}}
3. SORTING
 Sorting means arranging the elements in sequential order or any
order.
 By default, MapReduce will sort input records by their keys
 This job produces 30 output files, each of which is sorted

 However, there is no easy way to combine the files (partial sort)
 Produce a set of sorted files that, if concatenated, would form a
globally sorted file, Use a partitioner that respects the total order of
the output.
 Ex: Range Partitioner
 Although this approach works, you have to choose your partition
sizes carefully to ensure that they are fairly even so that job times
aren’t dominated by a single reducer
 Ex: Bad partitioning
 To construct more even partitions, we need to have a better

understanding of the distribution for the whole dataset
4. JOINS
 Joins is one of the interesting features available in MapReduce.
 A join is an operation that combines records from two or more
data sets based on a field or a set of fields, known as the foreign
key.
 The foreign key is the field in the relational table that matches the
column of another table.
 Frameworks like Pig, Hive, or Cascading has support for
performing joins.
 Joins performed by Mapper are called as Map-side Joins.
 Joins performed by Reducer can be treated as Reduce-side joins.
 It consists of
i. Map-Side Joins.
ii. Reduce-Side Joins.
i) Map-Side Joins: A map-side join between large inputs works by
performing the join before the data reaches the map
function(Joining at map side performs the join before data reached
to map).
 The inputs for to each map must be partitioned and sorted in a
specific way.
 Each input dataset must be divided into the
same number of partitions, and it must be sorted by the same key
(the join key) in each source.
 All the records for a particular key must reside in the same partition
and which is mandatory.
 We can achieve following kind of joins using Map-Side techniques,
1) Inner Join
2) Outer Join
3) Override - MultiFilter for a given key, preferred values from the
right most source.
 Use a CompositeInputFormat from the org.apache.hadoop.
mapred. join package to run a map-side join
MapReduce Job for Sorting
Dataset 1 Map Reduce
Map
Dataset 2 Map Reduce

ii) Reduce-Side Joins
 Reduce-Side joins are more simple than Map-Side joins since the
– Input datasets don’t have to be structured in any particular way
– Less efficient as both datasets have to go through the
MapReduce shuffle
 Idea: The mapper tags each record with its source

– Uses the join key as the map output key so that the records with
the same key are brought together in the reducer.
 Multiple inputs: The input sources for the datasets have different
formats
 Use the MultipleInputs class to separate the logic for parsing
and tagging each source.
 Secondary sort: To perform the join, it is important to have the data

from one source before another.
 Example: The code assumes that every station ID in the
weather records has exactly one matching record in the
station dataset.
5. SIDE DATA DISTRIBUTION
 Side data can be defined as extra read-only data needed
by a job to process the main dataset.
 Side data refers to extra static small data required by
MapReduce to perform job.
 Side-Data is the additional data needed by the job to
process the main dataset.
 The challenge is the availability of side data on the node
where the map would be executed in a convenient and
efficient fashion.
 Hadoop provides two side data distribution techniques.
They are:
 (a) Using the Job Configuration
 (b)Distributed Cache.
(a) Using the Job Configuration:
1. An arbitrary key-value pairs can be set in the job
configuration using the various setter methods on
Configuration.
2. It is a very useful technique in case of small file. The
suggested size of file to keep in configuration object is in
KBs. Because configuration object would be read by job
tracker, task tracker and new child jvm.
3. In the task you can retrieve the data from the configuration
returned by Context’s getConfiguration() method.
4. A part from this side data would require serialization if it has
non-primitive encoding.
5. DefaultStringifier uses Hadoop’s serialization framework to
serialize objects.
(b) Distributed Cache
1. Rather than serializing side data in the job configuration, it is
preferable to distribute datasets using Hadoop’s distributed cache
mechanism.
2. This provides a service for copying files and archives to the task
nodes in time for the tasks to use them when they run.
3. To save network bandwidth, files are normally copied to any
particular node once per job.
4. Side-Data can be shared using the Hadoop’s Distributed cache
mechanism.
5. We can copy files and archives to the task nodes when the tasks
need to run. Usually this is the preferrable way over the
JobConfigurtion.
6. If both the datasets are too large then we cannot copy either of the
datasets to each node in the cluster as we did in the Side data
distribution.
7. We can still join the records using MapReduce with a Map-side or
reduce-side joins.
6. MapReduce Library Classes
 Hadoop comes with a library of mappers and reducers for
commonly used functions.
 They are listed with brief descriptions in Table(next slide). For

further information on how to use them, please consult their Java
documentation.
 The major classes in the MapReduce library are:

 Javadocs
 The Input class: Writing your own Input class
 The Mapping classes
 The Reducer class
 The Output class: Writing your own Output class
 The Marshaller class
 The Counter Class
 Size limits
CLASSSES DESCRIPTION
ChainMapper,ChainReducer Run a chain of mappers in a single mapper, and a

reducer followed by a chain of mappers in a single
reducer.
A mapper and a reducer that can select fields (like the

FieldSelectionReducer (new API) Unix cut command) from the input keys and values
and emit them as output keys and values.
IntSumReducer, LongSumReducer Reducers that sum integer values to produce a total for
every key.
InverseMapper A mapper that swaps keys and values.
A mapper (or map runner in the old API) that runs

MultithreadedMapper (new API) mappers concurrently in separate threads. Useful for
mappers that are not CPU-bound.

BD - Unit - III - MapReduce

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BD - Unit - III - MapReduce

Uploaded by

Copyright:

Available Formats

BIG DATA

Unit-I : Introduction to Big Data

 Created by Doug Cutting as solution to Nutch’s scaling problems,

 In 2006, code moved into an independent subproject of Lucene

2. Analyzing the Data with Hadoop

Fig: Map Reduce Logical Data flow 8

 Scale-out Architecture means add servers to increase

 MapReduce is a programming model for data processing

 Hadoop can run MapReduce programs written in various

 A MapReduce job is a unit of work that the client wants to

 A MapReduce program is composed of a Map() procedure that

 MapReduce processes exceedingly large amounts of data without being

 MapReduce divides workloads up into multiple tasks that can be executed

 Hadoop MapReduce Counters are divided into two groups:

File System Counters FileSystemCounters (0.20)

File Input-Format org.apache.hadoop.mapred.FileInputFormat$Counter (0.20)

File Output-Format org.apache.hadoop.mapred.FileOutputFormat$Counter (0.20)

Job Counters org.apache.hadoop.mapred.JobInProgress$Counter (0.20)

Fig: There are several groups for the built-in Counters

 Task counters are maintained by each task attempt, and

 Counter values are definitive only once a job has successfully

 Ex: The MAP_INPUT_RECORDS counter counts the input

 Ex: public class MaxTemperatureWithCounters extends Configured implements

 This job produces 30 output files, each of which is sorted

 To construct more even partitions, we need to have a better

MapReduce Job for Sorting

Dataset 1 Map Reduce

Dataset 2 Map Reduce

 Idea: The mapper tags each record with its source

 Secondary sort: To perform the join, it is important to have the data

 They are listed with brief descriptions in Table(next slide). For

 The major classes in the MapReduce library are:

ChainMapper,ChainReducer Run a chain of mappers in a single mapper, and a

A mapper and a reducer that can select fields (like the

InverseMapper A mapper that swaps keys and values.

A mapper (or map runner in the old API) that runs

You might also like