Professional Documents
Culture Documents
MapReduce: Insight
■ Consider the problem of counting the number of
occurrences of each word in a large collection of
documents
■ How would you do it in parallel ?
■ Solution:
❑ Divide documents among workers
❑ Each worker parses document to find all words, outputs
(word, count) pairs
❑ Partition (word, count) pairs across workers based on
word
❑ For each word at a worker, locally add up counts
MapReduce Example - WordCount
k v
map
k v1
1 k v
map
k v2
2 k v
… …
k vn k v
n
k v
… …
…
k v k v k v
Map Phase
▪ The outputs from the mappers are denoted as
intermediate outputs (IOs) and are brought mappers M0 M1 M2 M3
Reduce Phase
of Reducers is known as shuffling process Shuffling Data
Reducers R0 R1
▪ The Reducers produce the final outputs (FOs)
FO0 FO1
▪ Overall, MapReduce breaks the data flow into two phases,
map phase and reduce phase
Keys and Values
▪ The programmer in MapReduce has to specify two functions, the
map function and the reduce function that implement the Mapper
and the Reducer in a MapReduce program
▪ The map and reduce functions receive and emit (K, V) pairs
• The JobTracker submits the work to the TaskTracker nodes when they poll for
tasks. To choose a task for a TaskTracker, the JobTracker uses various scheduling
algorithms (default is FIFO).
• The TaskTracker nodes are monitored using the heartbeat signals that are sent by
the TaskTrackers to JobTracker.
• The TaskTracker spawns a separate JVM process for each task so that any task
failure does not bring down the TaskTracker.
• The TaskTracker monitors these spawned processes while capturing the output
and exit codes. When the process finishes, successfully or not, the TaskTracker
notifies the JobTracker. When the job is completed, the JobTracker updates its
status.
© Vigen Sahakyan
Anatomy of Yarn
YARN provides its core services via two types of long-running daemon:
●Resource Manager (one per cluster) is responsible for tracking the resources in a
cluster, and scheduling applications. Actually it responses to resource requests from
ApplicationMaster (one per each Yarn application), via requesting to Node Managers.
It doesn’t monitor and collect any job history. It is only responsible for cluster
scheduling. Actually ResourceManager is a single point of failure but Hadoop2 support
High Availability which can restore ResourceManager data in case of failures.
●Node Manager (one per every node) is responsible to monitor nodes and containers
(slot analogue in MapReduce 1) resources such as CPU, Memory, Disk space,
Network e.t.c. It also collect log data and report that information to the
ResourceManager.
© Vigen Sahakyan
Anatomy of Yarn
Another important component of Yarn is ApplicationMaster
●ApplicationMaster actually runs in a separate container process on a slave node.
●It has one instance per application, instead of JobTracker, which was a single daemon
that ran on a master node and tracked the progress of all applications which was a
point of failures.
●It responsible to send heartbeat messages to the ResourceManager with its status
and the state of the application’s resource needs.
●Hadoop2 supports uber (lightweight) tasks which can be run by ApplicationMaster on
same node, without wasting time for allocation.
●ApplicationMaster should be implemented for each Yarn application type, in case of
MapReduce it designed to execute map and reduce tasks.
© Vigen Sahakyan
Anatomy of Yarn
Steps to run Yarn application:
© Vigen Sahakyan
Yarn Application
● Hadoop 2 has a lots of high level applications that was written on top of the
Yarn, by using Yarn API’s for example Spark,HBase e.t.c. MapReduce have
also become an application on top of the Yarn.
● Yarn and its applications concept brings flexibility and a lots of new
opportunities for Hadoop environment.
© Vigen Sahakyan
MapReduce 2.0 - YARN
• In Hadoop 2.0 the original processing
engine of Hadoop (MapReduce) has been
separated from the resource management
(which is now part of YARN).
© Vigen Sahakyan
Scheduling in Yarn
FIFO - First Input First Output:
© Vigen Sahakyan
Scheduling in Yarn
■Capacity scheduler - allows sharing of a
Hadoop cluster along organizational lines (each
one is a queue). Queues may be further divided
in hierarchical fashion.
●each organization is allocated a certain
capacity of the overall cluster.
●if there is more than one job Capacity
Scheduler may allocate the. spare resources to
jobs in the queue, even if that causes the
© Vigen Sahakyan
Scheduling in Yarn
Fair Scheduler dynamically balance resources (evenly between all tasks) between
all running jobs. There is also queue hierarchy for organisations.
© Vigen Sahakyan