You are on page 1of 35

CH-3: Pillars of Hadoop – HDFS,

MapReduce, and YARN


Introduction to HDFS
HDFS is the default storage file system in Hadoop, which is
distributed, considerably simple in design and extremely scalable,
flexible, and with high fault tolerance capability. HDFS architecture
has a master-slave pattern due to which the slave nodes can be
better managed and utilized.
HDFS has self-healing processes and speculative execution, which
make the system fault tolerant, and is flexible to add/remove nodes
and increases the scalability with reliability.
HDFS is designed to be best suited for MapReduce programming.
One key assumption in HDFS is Moving Computation is Cheaper than
Moving Data, this concept is known as data locality.
Stores multiple copies of data on different nodes.
A file is split up into blocks (default 64 MB/128 MB) and stored in
distributed fashion across multiple machines.
Data Storage in HDFS
HDFS architecture

 HDFS is managed by the daemon processes which are as


follows:

 NameNode: Master process

 DataNode: Slave process

 Checkpoint NameNode or Secondary NameNode:


Checkpoint process

 BackupNode: Backup NameNode


HDFS architecture…..
NameNode
NameNode is the master process daemon server in HDFS that
coordinates all the operations related to storage in Hadoop,
including the read and writes in HDFS.

NameNode manages the filesystem namespace. NameNode


holds the metadata above all the file blocks, and in which all
nodes of data blocks are present in the cluster.

NameNode doesn't store any data. NameNode caches the data


and stores metadata in RAM for faster access, hence it requires
a system with high RAM, otherwise NameNode can become a
bottleneck in the cluster processing.
DataNode
DataNode holds the actual data in HDFS and is also responsible for
creating, deleting, and replicating data blocks, as assigned by NameNode.

DataNode sends messages to NameNode, which are called as heartbeat in


a periodic interval.

If a DataNode fails to send the heartbeat message, then NameNode will
mark it as a dead node.

If the file data present in the DataNode becomes less than the replication
factor, then NameNode replicates the file data to other DataNodes.
HDFS Architecture

Image source: http://yoyoclouds.files.wordpress.com/2011/12/hadoop_arch.png.


HDFS Commands
• To check Hadoop version
$ hadoop version
• To create folder in HDFS
$ hdfs dfs –mkdir /user/maria_dev/sample2
• To list content of a folder
$ hdfs dfs -ls /user/maria_dev/sample2
• To copy file from local to hdfs
$ hdfs dfs –put test.txt /user/maria_dev/sample2
• To copy files from HDFS to local
$ hdfs dfs –get /user/maria_dev/sample2/
• To see the content of a file
$ hdfs dfs –cat /user/maria_dev/sample2/test.txt
• Other commands like :
• -touchz , -copyFromLocal, -copyToLocal, cp ,
MapReduce
MapReduce

MapReduce can process a large volume of data in parallel, by dividing a


task into independent sub-tasks. MapReduce also has a master-slave
architecture.

The input and output, even the intermediary output in a MapReduce job,
are in the form of <Key, Value> pair.
Map Abstract
 Take key/value pair input
Key is a reference to the input value.
Value is the data set on which to operate.
 Processing
Function defined by user
Applies to every value in value input
Produces a new list of key/value pairs
Output of Map is called intermediate output
Can be different type from pair
Output is stored in local disk
Reduce Abstract
 Takes intermediate key/value pairs as input
Generated by Map
Key/Value pairs are sorted by key
 Processing
Function defined by user
Iterator supplies the values for a given key to the reduce
function.
 Produce final list of key/value pair
Output of Reduce is called Final output
Can be different from input pair
Output is stored in HDFS
The MapReduce architecture

 MapReduce architecture has the following two daemon processes:


 JobTracker: Master process
 TaskTracker: Slave process
JobTracker
 JobTracker is the master coordinator daemon process that is
responsible for coordinating and completing a MapReduce job in
Hadoop. The primary functions of JobTracker are resource
management, tracking resource availability, and task process cycle.
JobTracker identifies the TaskTracker to perform certain tasks and
monitors the progress and status of a task. JobTracker is a single
point of failure for the MapReduce process.
The MapReduce architecture
TaskTracker
 TaskTracker is the slave daemon process that
performs a task assigned by JobTracker. TaskTracker
sends heartbeat messages to JobTracker
periodically to notify about the free slots and sends
the status to JobTracker about the task and checks
if any task has to be performed.
MapReduce Detailed View
MapReduce Example
Example 2: MapReduce
Assume the following Table has Information about students who repeat a course , Apply
MapReduce to find the Average of Mark for each Student
StudentID Marks Semester
16J1211 70 Sem120182019
16S1456 80 Sem120182019
16J8521 70 Sem120182019
16J1211 66 Sem220182019
16S1456 60 Sem220182019
16J8521 80 Sem220182019
Mapper 16S1456 77 Sem320182019
16J8521 90 Sem320182019
16J1211:70 16S1456:80 16J8521:70 16J1211:66 16S1456:60 16J8521:80 16S1456:77 16J8521:90

Shuffle And Sort

16J1211:70,66 16J8521:70,80,90 16S1456:60,77,80

Reducer

16J1211:68 16J8521:80 16S1456:72.3


Example 3: MapReduce
Assume the following Table has Information about customer ratings for a service,
Apply MapReduce to find the number of customers for each rating.
CustomerID ServiceID Ratings
CA111 10 3
CA121 10 3
CA311 10 2
CA422 10 3
CA111 12 4
CA121 12 4
CA311 12 2
CA111 15 1
Mapper CA121 15 1
CA311 15 1
3:1 3:1 2:1 3:1 4:1 4:1 2:1 1:1 1:1 1:1

Shuffle And Sort

1: 1, 1,1 2:1, 1 3:1, 1, 1 4:1, 1

Reducer

1:3 2:2 3:3 4:2


How Map and Reduce work together
MapReduce Example 4
Sample MapReduce Functions - Python
mrjob.job import MRJob
mrjob.step import MRStep

class RatingsBreakdown(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings)
]

def mapper_get_ratings(self, _, line):


(userID, movieID, rating, timestamp) = line.split('\t')
yield rating, 1

def reducer_count_ratings(self, key, values):


yield key, sum(values)

if __name__ == '__main__':
RatingsBreakdown.run()
MapReduce Example
 100 files with daily temperature in two cities.
Each file has 10,000 entries.
 For example, one file may have (Toronto 20),
(New York 30),
..
 Our goal is to compute the maximum temperature in the two
cities.
 Assign the task to 100 Map processors each works on one file.
Each processor outputs a list of key-value pairs, e.g., (Toronto
30), New York (65), …
 Now we have 100 lists each with two elements. We give this
list to two reducers – one for Toronto and another for New
York.
 The reducer produce the final answer: (Toronto 55), (New
York 65)
Hadoop MapReduce
is highly scalable, fault tolerant, and designed to work in
commodity hardware's.
MapReduce architecture has a master JobTracker and multiple
worker TaskTracker processes in the Nodes.
MapReduce jobs are broken into multistep processes, which are
Mapper, Shuffle, Sort, Reducer, and auxiliary Combiner and
Partitioner.
 MapReduce jobs needs a lot of data transfer, for which Hadoop
uses Writable and WritableComparable interfaces.
MapReduce FileFormats has an InputFormat interface,
RecordReader, OutputFormat, and RecordWriter to improve the
processing and efficiency.
Data Locality
Move computation close to data, rather than data
to computation

This minimizes network congestion and increases


the overall throughput of the system
YARN
YARN
YARN (Yet Another Resource Negotiator) is a distributed resource
manager to manage and run different applications on top of
Hadoop, and provides much needed enhancements to the
MapReduce framework, that can make Hadoop much more
available, scalable, and integrable.
YARN Architecture has the following components:
ResourceManager, NodeManager, and ApplicationMaster.
Many applications are built on top of YARN, which has made
Hadoop much more stable and integrable with other applications.
YARN architecture
 YARN architecture has the following three
components:
ResourceManager (RM)
NodeManager (NM)
ApplicationMaster (AM)
Resource Manager
In YARN, ResourceManager is the master process
manager responsible for resource management
among the applications in the system.
ResourceManager has a scheduler, which only
allocates the resources to the applications and
resource availability which ResourceManager gets
from containers that provide information such as
memory, disk, CPU, network, and so on.
NodeManager
NodeManager is present in all the nodes,
which is responsible for containers,
authentication, monitoring resource usage,
and reports the information to
ResourceManager.
 Similar to TaskTracker, NodeManager sends
heartbeats to ResourceManager.
ApplicationMaster
ApplicationMaster is present for each application, responsible
for managing each and every instance of applications that run
within YARN.
ApplicationMaster coordinates with ResourceManager for the
negotiation of the resources and coordinates with the
NodeManager to monitor the execution and resource
consumption of containers, such as resource allocations of
CPU, memory, and so on.
Applications powered by YARN
Below are some of the applications that have adapted
YARN to leverage its features and achieve high
availability:
Apache Giraph: Graph processing
Apache Hama: Advanced Analytics
Apache Hadoop MapReduce: Batch processing
Apache Tez: Interactive/Batch on top of Hive
Apache S4: Stream processing
Apache Samza: Stream processing
Apache Storm: Stream processing
Apache Spark: Realtime Iterative processing
Hoya: Hbase on YARN
References :

Hadoop Essentials by Shiva Achari, ISBN


978-1-78439-668-8

ProQuest Ebook Central:


http://ebookcentral.proquest.com/lib/
momp/detail.action?docID=2039889

You might also like