You are on page 1of 25

1

Hadoop Ecosystem

2
Why Hadoop
• Distributed
• Durability
• Synchronization
• Fault-Tolerance
• Scalable
• Consistency
• Locality

3
Core Components of Hadoop

4
HDFS Daemons

5
NameNode
• maintain the HDFS namespace metadata in RAM
• stores the metadata information in a fsimage file
• block locations are kept only in memory
• does not directly read or write to the HDFS
• keeps an edit log of all the transactions
• information in NameNode RAM is flushed to the edit log
• Safe mode, during which the edit log is applied to the image file
• the NameNode is a single point of failure (SPOF) in the Hadoop 1.X

6
Secondary NameNode
• edits file could get very large and take up
extra disk space
• the check-pointing could take a long time
• make periodic checkpoints of the edits log to
the fsimage file while the NameNode is
running
• A cluster can have only one NameNode and
one Secondary NameNode
• On a production cluster, the Secondary
NameNode should be located on a different
node
7
DataNode
• storing the HDFS data
• Data is broken into blocks
• blocks are replicated
• a certain number of failed volumes may be tolerated
• communicates with the NameNode periodically
• new block locations for adding blocks to a new file are requested from
NameNode

8
HDFS High Availability

9
Active/Standby NameNodes
• two NameNodes run simultaneously
• if the active NameNode fails or crashes, the active NameNode fails
over to the standby NameNode automatically (fast failover)
• Two methods to synchronize:
• Quorum-based storage
• Shared storage using NFS
• Secondary NameNode is not required with HDFS HA

10
JournalNodes
• to synchronize state between the active NameNode and the standby
NameNode
• keep a journal of edits (modifications) logged by the active
NameNode
• The active NameNode logs the edits to the majority of the
JournalNodes in the quorum
• The standby NameNode reads the edits from one of the JournalNodes
and applies them

11
ZooKeeper
• Detect failure of the active NameNode
• initiates a failover to the standby NameNode and provides a
mechanism for active NameNode election

12
Benefits and Limitations of HDFS
• distributed • does not support updates or
• block abstraction modifications
• scalable
• fault-tolerant
• not optimized for random seeks
• data locality • local caching of data is not
• data coherency supported
• Unstructured data
• commodity hardware
• high availability
• Rack-awareness
• NameNode federation

13
MapReduce Daemons

14
JobTracker
• accepting job submissions
• initiating a job
• scheduling job tasks
• monitoring the tasks
• relaunching a task if it fails
• schedules extra tasks in parallel for slow-running tasks, called
speculative execution.

15
TaskTracker
• has map slots to run map tasks and reduce slots to run reduce tasks
• launches map/reduce tasks in child JVMs
• monitors the task’s progress
• communicates with the JobTracker periodically

16
YARN Daemons

17
ResourceManager
• manages the global allocation of compute resources
• starts per application ApplicationMasters
• does not initiate tasks to be launched nor does it handle monitoring
tasks and relaunching failed tasks
• has two main components: Scheduler and ApplicationsManager
• The Scheduler’s function is the allocate resource containers to
applications
• ApplicationsManager’s function is to accept job submissions and
launch and monitor ApplicationMaster containers, one per application

18
NodeManager
• manages the user processes on the node in coordination
with the ApplicationMasters
• manage resource containers including:
• starting containers
• monitoring their status
• reporting the same to the ResourceManager/Scheduler
• ensuring that an application is not using more resources than it has
been allocated

19
ApplicationMaster
• handles an application’s lifecycle including:
• scheduling tasks
• launching tasks
• monitoring tasks
• re-launching tasks if a task fails, and handling speculative execution
• coordinating with the NodeManager
• The containers are not tasks specific

20
JobHistory Server
• archive job and task information and logs

21
Yarn Advantages
• Dynamic resource allocation means the same container may be used
with map and reduce tasks
• Supports MR and non-MR applications
• a cluster may have several ApplicationMasters
• The different nodes do not need the same resource distribution
• supports the fair scheduler and the FIFO

22
MapReduce Lifecycle

23
Tips
• JobTracker includes the meta-information for data input split to be processed.
• Data locality is used, if feasible, to launch a map task on the same machine where its
input data is stored.
• One map task is launched for each input split
• map outputs are partitioned, sorted, and shuffled
• The reduce task output files are not merged in the job’s final output location
• ubertasking, a feature that enables running multiple tasks in the same JVM.
• Data serialization is the process of converting data in memory to bytes that are
transmitted over the network or written to disk:
• intermediate map outputs locally to disk
• interprocess communication
• at the reducers

24
25

You might also like