You are on page 1of 39

Map Reduce and Hadoop

MapReduce: Insight
■ Consider the problem of counting the number of
occurrences of each word in a large collection of
documents
■ How would you do it in parallel ?
■ Solution:
❑ Divide documents among workers
❑ Each worker parses document to find all words, outputs
(word, count) pairs
❑ Partition (word, count) pairs across workers based on
word
❑ For each word at a worker, locally add up counts
MapReduce Example - WordCount

■ Hadoop MapReduce is an implementation of MapReduce


❑ MapReduce is a computing paradigm (Google)
❑ Hadoop MapReduce is an open-source software
MapReduce Programming Model
■ Inspired from map and reduce operations
commonly used in functional programming
languages like Lisp.
■ Input: a set of key/value pairs
■ User supplies two functions:
❑map(k,v) 🡪 list(k1,v1)
❑reduce(k1, list(v1)) 🡪 v2
■ (k1,v1) is an intermediate key/value pair
■ Output is the set of (k1,v2) pairs
MapReduce: The Map Step
Input Intermediate
key-value pairs key-value pairs

k v
map
k v1
1 k v
map
k v2
2 k v

… …

k vn k v
n

E.g. (doc—id, doc- E.g. (word, wordcount-in-a-


content) doc)
MapReduce: The Reduce Step
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
k v k v v k v
group

k v
… …

k v k v k v

E.g. (word, list-of- (word, final-


(word, wordcount-in-a- wordcount)
~ SQL Group by ~count)
SQL aggregation
doc)
MapReduce: Execution overview
Distributed Execution Overview
User
Progra
m
for for for
k k k
Mas
assign ter assign
input data map reduce
from
Wor Outpu
distributed file
ker Wor write t
system
Split local ker File 0
0
Split read Wor write
1
Split ker Outpu
Wor t
2
Wor ker File 1
ker remote
read,
sort
Map Reduce vs. Parallel Databases

■ Map Reduce widely used for parallel processing


❑ Google, Yahoo, and 100’s of other companies
❑ Example uses: compute PageRank, build keyword indices,
do data analysis of web click logs, ….
■ Database people say: but parallel databases have
been doing this for decades
■ Map Reduce people say:
❑ we operate at scales of 1000’s of machines
❑ We handle failures seamlessly
❑ We allow procedural code in map and reduce and allow
data of any type
Isolated Tasks
▪ MapReduce divides the workload into multiple independent tasks
and schedule them across cluster nodes

▪ A work performed by each task is done in isolation from one another

▪ The amount of communication which can be performed by tasks is


mainly limited for scalability reasons

▪ The communication overhead required to keep the data on the


nodes synchronized at all times would prevent the model from
performing reliably and efficiently at large scale
Data Distribution
▪ In a MapReduce cluster, data is distributed to all the nodes of the
cluster as it is being loaded in

▪ An underlying distributed file systems (e.g., HDFS) splits large data


files into chunks which are managed by different nodes in the
cluster
Input data: A large file

Node 1 Node 2 Node 3


Chunk of input data Chunk of input data Chunk of input data

▪ Even though the file chunks are distributed across several


machines, they form a single namesapce
MapReduce: A Bird’s-Eye View
▪ In MapReduce, chunks are processed in
isolation by tasks called Mappers
chunks C0 C1 C2 C3

Map Phase
▪ The outputs from the mappers are denoted as
intermediate outputs (IOs) and are brought mappers M0 M1 M2 M3

into a second set of tasks called Reducers


IO0 IO1 IO2 IO3

▪ The process of bringing together IOs into a set

Reduce Phase
of Reducers is known as shuffling process Shuffling Data

Reducers R0 R1
▪ The Reducers produce the final outputs (FOs)
FO0 FO1
▪ Overall, MapReduce breaks the data flow into two phases,
map phase and reduce phase
Keys and Values
▪ The programmer in MapReduce has to specify two functions, the
map function and the reduce function that implement the Mapper
and the Reducer in a MapReduce program

▪ In MapReduce data elements are always structured as


key-value (i.e., (K, V)) pairs

▪ The map and reduce functions receive and emit (K, V) pairs

Input Splits Intermediate Outputs Final Outputs

(K, V) Map (K’, V’) Reduce (K’’, V’’)


Pairs Function Pairs Function Pairs
Partitions
▪ In MapReduce, intermediate output values are not usually
reduced together

▪ All values with the same key are presented to a single


Reducer together

▪ More specifically, a different subset of intermediate key space is


assigned to each Reducer

▪ These subsets are known as partitions

Different colors represent


different keys (potentially)
from different Mappers

Partitions are the input to Reducers


MapReduc
• e of two phases:
MapReduce job consists
• Map: In the Map phase, data is read from a
distributed file system and partitioned
among a set of computing nodes in the
cluster. The data is sent to the nodes as a
set of key-value pairs. The Map tasks
process the input records independently of
each other and produce intermediate
results as key-value pairs. The intermediate
results are stored on the local disk of the
node running the Map task.
• Reduce: When all the Map tasks are
completed, the Reduce phase begins in
which the intermediate data with the same
key is aggregated.

• Optional Combine Task


• An optional Combine task can be used to
perform data aggregation on the
intermediate data of the same key for the
output of the mapper before transferring
the output to the Reduce task.

Book website Bahga & Madisetti,


MapReduce Job Execution Workflow
• MapReduce job execution starts when the client applications submit jobs to the Job tracker.
• The JobTracker returns a JobID to the client application. The JobTracker talks to the
NameNode to determine the location of the data.
• The JobTracker locates TaskTracker nodes with available slots at/or near the data.
• The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes,
to reassure the JobTracker that they are still alive. These messages also inform the
JobTracker of the number of available slots, so the JobTracker can stay up to date with
where in the cluster, new work can be delegated.

Book website Bahga & Madisetti,


MapReduce Job Execution Workflow

• The JobTracker submits the work to the TaskTracker nodes when they poll for
tasks. To choose a task for a TaskTracker, the JobTracker uses various scheduling
algorithms (default is FIFO).
• The TaskTracker nodes are monitored using the heartbeat signals that are sent by
the TaskTrackers to JobTracker.
• The TaskTracker spawns a separate JVM process for each task so that any task
failure does not bring down the TaskTracker.
• The TaskTracker monitors these spawned processes while capturing the output
and exit codes. When the process finishes, successfully or not, the TaskTracker
notifies the JobTracker. When the job is completed, the JobTracker updates its
status.

Book website Bahga & Madisetti,


SimpleObservation
JobTrackerdoesalot.
Disappearance of The Centralized Job Tracker
YARN–YetAnotherResourceNegotiator
ResourceManager and Aplication Master
What is Yarn?
● YARN(Yet Another Resource Negotiator) is cluster resource management
system for Hadoop.
● YARN was introduced in Hadoop 2 to improve the MapReduce implementation
and elluminate Hadoop 1 scheduling disadvantages.
● YARN is a connecting link between high level applications(Spark,HBase e.t.c)
and low level Hadoop environment.
● With introduction of YARN, Hadoop transformed from only MapReduce
framework to big data processing core.
● Some people characterized YARN as a large-scale, distributed operating
system for big data applications.

© Vigen Sahakyan
Anatomy of Yarn
YARN provides its core services via two types of long-running daemon:

●Resource Manager (one per cluster) is responsible for tracking the resources in a
cluster, and scheduling applications. Actually it responses to resource requests from
ApplicationMaster (one per each Yarn application), via requesting to Node Managers.
It doesn’t monitor and collect any job history. It is only responsible for cluster
scheduling. Actually ResourceManager is a single point of failure but Hadoop2 support
High Availability which can restore ResourceManager data in case of failures.
●Node Manager (one per every node) is responsible to monitor nodes and containers
(slot analogue in MapReduce 1) resources such as CPU, Memory, Disk space,
Network e.t.c. It also collect log data and report that information to the
ResourceManager.

© Vigen Sahakyan
Anatomy of Yarn
Another important component of Yarn is ApplicationMaster
●ApplicationMaster actually runs in a separate container process on a slave node.
●It has one instance per application, instead of JobTracker, which was a single daemon
that ran on a master node and tracked the progress of all applications which was a
point of failures.
●It responsible to send heartbeat messages to the ResourceManager with its status
and the state of the application’s resource needs.
●Hadoop2 supports uber (lightweight) tasks which can be run by ApplicationMaster on
same node, without wasting time for allocation.
●ApplicationMaster should be implemented for each Yarn application type, in case of
MapReduce it designed to execute map and reduce tasks.

© Vigen Sahakyan
Anatomy of Yarn
Steps to run Yarn application:

1.Client does a request to ResourceManager to run application.

2.ResourceManager requests NodeManagers to allocate


container for creating ApplicationMaster instance on available
(which has enough resources) node.

3.When ApplicationMaster instance already run, it itself send


requests (heartbeat, app resource needs e.t.c) to
ResourceManager and manage application.

© Vigen Sahakyan
Yarn Application
● Hadoop 2 has a lots of high level applications that was written on top of the
Yarn, by using Yarn API’s for example Spark,HBase e.t.c. MapReduce have
also become an application on top of the Yarn.
● Yarn and its applications concept brings flexibility and a lots of new
opportunities for Hadoop environment.

© Vigen Sahakyan
MapReduce 2.0 - YARN
• In Hadoop 2.0 the original processing
engine of Hadoop (MapReduce) has been
separated from the resource management
(which is now part of YARN).

• This makes YARN effectively an operating


system for Hadoop that supports different
processing engines on a Hadoop cluster
such as MapReduce for batch processing,
Apache Tez for interactive queries, Apache
Storm for stream processing, etc.

• YARN architecture divides architecture


divides the two major functions of the
JobTracker - resource management and job
life-cycle management - into separate
components:
• ResourceManager
• ApplicationMaster.

Book website Bahga & Madisetti,


YARN Components
• Resource Manager (RM): RM manages the global
assignment of compute resources to
applications. RM consists of two main services:
• Scheduler: Scheduler is a pluggable service that manages
and enforces the resource scheduling policy in the cluster.
• Applications Manager (AsM): AsM manages the running
Application Masters in the cluster. AsM is responsible for
starting application masters and for monitoring and
restarting them on different nodes in case of failures.

• Application Master (AM): A per-application AM


manages the application’s life cycle. AM is
responsible for negotiating resources from the
RM and working with the NMs to execute and
monitor the tasks.
• Node Manager (NM): A per-machine NM
manages the user processes on that machine.
• Containers: Container is a bundle of resources
allocated by RM (memory, CPU, network, etc.). A
container is a conceptual entity that grants an
application the privilege to use a certain amount
of resources on a given machine to run a
component
Book website
task. Bahga & Madisetti,
Aplication Submision In YARN
■ https://www.slideshare.net/cloudera/introducti
on-to-yarn-and-mapreduce-2
ResourceManagerComponents
Hadoop Schedulers
• Hadoop scheduler is a pluggable component that makes it
open to support different scheduling algorithms.
• The default scheduler in Hadoop is FIFO.
• Two advanced schedulers are also available - the Fair
Scheduler, developed at Facebook, and the Capacity
Scheduler, developed at Yahoo.
• The pluggable scheduler framework provides the flexibility
to support a variety of workloads with varying priority and
performance constraints.
• Efficient job scheduling makes Hadoop a multi-tasking
system that can process multiple data sets for multiple jobs
for multiple users simultaneously.
Book website Bahga & Madisetti,
FIFO Scheduler

• FIFO is the default scheduler in Hadoop that


maintains a work queue in which the jobs are
queued.
• The scheduler pulls jobs in first in first out manner
(oldest job first) for scheduling.
• There is no concept of priority or size of job in FIFO
scheduler.

Book website Bahga & Madisetti,


Fair Scheduler
• The Fair Scheduler allocates resources evenly between multiple jobs and also
provides capacity guarantees.
• Fair Scheduler assigns resources to jobs such that each job gets an equal share
of the available resources on average over time.
• Tasks slots that are free are assigned to the new jobs, so that each job gets
roughly the same amount of CPU time.
• Job Pools
• The Fair Scheduler maintains a set of pools into which jobs are placed. Each pool has a
guaranteed capacity.
• When there is a single job running, all the resources are assigned to that job. When
there are multiple jobs in the pools, each pool gets at least as many task slots as
guaranteed.
• Each pool receives at least the minimum share.
• When a pool does not require the guaranteed share the excess capacity is split between
other jobs.
• Fairness
• The scheduler computes periodically the difference between the computing time
received by each job and the time it should have received in ideal scheduling.
• The job which has the highest deficit of the compute time received is scheduled next.

Book website Bahga & Madisetti,


Capacity
Scheduler
■ The Capacity Scheduler has similar functionally as the Fair Scheduler
but adopts a different scheduling philosophy.
■ Queues
• In Capacity Scheduler, you define a number of named queues
each with a configurable number of map and reduce slots.
• Each queue is also assigned a guaranteed capacity.
• The Capacity Scheduler gives each queue its capacity when it
contains jobs, and shares any unused capacity between the
queues. Within each queue FIFO scheduling with priority is used.
■ Fairness
• For fairness, it is possible to place a limit on the percentage of
running tasks per user, so that users share a cluster equally.
• A wait time for each queue can be configured. When a queue is
not scheduled for more than the wait time, it can preempt tasks of
other queues to get its fair share.

Book website Bahga & Madisetti,


Scheduling in Yarn
● Scheduling is an important task when Hadoop cluster is shared between many
users and tasks. Problem is which user’s task should be run first or which task
should be run first, big one or small one.
● Hadoop 1 which use FIFO scheduler with slot (fixed cpu, memory and disk count)
based model was very inefficient for shared clusters, whereas Hadoop 2 introduced
Capacity (by Yahoo) and Fair (by Facebook) schedulers with container (dynamic
cpu, memory and disk count) allocation model as part of the Yarn, which is more
efficient than FIFO scheduler.
● Yarn supports three schedulers FIFO, Capacity and Fair schedulers, we’ll consider
them in next slide. It also provide opportunity to create your own scheduler by
extending Fair scheduler class.

© Vigen Sahakyan
Scheduling in Yarn
FIFO - First Input First Output:

●the simplest and most understandable scheduler


●It doesn’t needing any configuration.
●But it’s not suitable for shared clusters because big
applications eat all resources.

© Vigen Sahakyan
Scheduling in Yarn
■Capacity scheduler - allows sharing of a
Hadoop cluster along organizational lines (each
one is a queue). Queues may be further divided
in hierarchical fashion.
●each organization is allocated a certain
capacity of the overall cluster.
●if there is more than one job Capacity
Scheduler may allocate the. spare resources to
jobs in the queue, even if that causes the
© Vigen Sahakyan
Scheduling in Yarn
Fair Scheduler dynamically balance resources (evenly between all tasks) between
all running jobs. There is also queue hierarchy for organisations.

●If queue policy is not configured it is Fair (50/50% or 1:1)


●Preemption allows the scheduler to kill containers for queues that are running
with more than their fair share of resources
●Delay scheduling allows allocating container in same node where application
was submitted.
●Dominant Resource Fairness (drf) gives priority to tasks which have the most
dominant resources

© Vigen Sahakyan

You might also like