hadoop

hadoop.
md 2024-04-13
Hadoop
Hadoop Architecture
assesses ability to define Hadoop and the components that make up the Hadoop framework.
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware. It has many similarities with existing distributed file systems. However, the
differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is
designed to be deployed on low-cost hardware. HDFS provides high throughput access to
application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX
requirements to enable streaming access to file system data. HDFS was originally built as
infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop
Core project. The project URL is http://hadoop.apache.org/.
Hadoop Architecture
NameNode (1 or 2 for HA)
holds the metadata, e.g. the location of the blocks (~128 MB) & replications
DataNode
responsible for the data: storing, deleting, replicating
responsible for the processing
Hadoop HDFS to store data across slave machines
Hadoop YARN for resource management in the Hadoop cluster
Hadoop MapReduce to process data in a distributed fashion
Zookeeper to ensure synchronization across a cluster
Assumptions and Goals
Hardware Failure
Streaming Data Access
Large Data Sets
Simple Coherency Model
Moving Computation is Cheaper than Moving Data
Portability Across Heterogeneous Hardware and Software Platforms
NameNode and DataNode
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master
server that manages the file system namespace and regulates access to files by clients. In addition,
there are a number of DataNodes, usually one per node in the cluster, which manage storage
attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data
to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a
1/4
hadoop.md 2024-04-13
set of DataNodes. The NameNode executes file system namespace operations like opening, closing,
and renaming files and directories. It also determines the mapping of blocks to DataNodes. The
DataNodes are responsible for serving read and write requests from the file systemʼs clients. The
DataNodes also perform block creation, deletion, and replication upon instruction from the
NameNode.
Replica Placement: The First Baby Steps
HDFSʼs placement policy is to put one replica on the local machine if the writer is on a datanode,
otherwise on a random datanode in the same rack as that of the writer, another replica on a node in a
different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the
inter-rack write traffic which generally improves write performance.
Replica Selection
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request
from a replica that is closest to the reader.
Safemode
On startup, the NameNode enters a special state called Safemode. Replication of data blocks does
not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and
Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a
DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered
safely replicated when the minimum number of replicas of that data block has checked in with the
NameNode. After a configurable percentage of safely replicated data blocks checks in with the
NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then
determines the list of data blocks (if any) that still have fewer than the specified number of replicas.
The NameNode then replicates these blocks to other DataNodes.
The Communication Protocols
All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a
connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the
NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote
Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By
design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by
DataNodes or clients.
Robustness
The primary objective of HDFS is to store data reliably even in the presence of failures. The three common
types of failures are NameNode failures, DataNode failures and network partitions.
https://www.geeksforgeeks.org/hadoop-architecture/
Hadoop Common
measures understanding of the collection of common utilities and libraries that support other Hadoop
modules.
2/4
https://hadoop.apache.org/docs/stable/api/
Hadoop common or Common utilities are nothing but our java library and java files or we can say the java
scripts that we need for all the other components present in a Hadoop cluster. these utilities are used by
HDFS, YARN, and MapReduce for running the cluster. Hadoop Common verify that Hardware failure in a
Hadoop cluster is common so it needs to be solved automatically in software by Hadoop Framework.
Hadoop Distributed File System (HDFS)
tests understanding of the different Hadoop shell commands and components and usage of these in the
Hadoop environment related to the file system.
Create a directory named /foodir bin/hadoop dfs -mkdir /foodir
Remove a directory named /foodir bin/hadoop dfs -rmr /foodir
View the contents of a file named /foodir/myfile.txt bin/hadoop dfs -cat
/foodir/myfile.txt
Admin
Put the cluster in Safemode bin/hadoop dfsadmin -safemode enter
Generate a list of DataNodes bin/hadoop dfsadmin -report
Recommission or decommission DataNode(s) bin/hadoop dfsadmin -refreshNodes
Hadoop YARN
tests proficiency in the understanding and usage of the Hadoop YARN APIs.
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html
Hadoop YARN
Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource management layer of Hadoop and
is responsible for resource allocation and job scheduling. Introduced in the Hadoop 2.0 version, YARN is the
middle layer between HDFS and MapReduce in the Hadoop architecture.
The elements of YARN include:
ResourceManager (one per cluster)

ApplicationMaster (one per application)
NodeManagers (one per node)
The fundamental idea of YARN is to split up the functionalities of resource management and job
scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-
application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
The ResourceManager and the NodeManager form the data-computation framework. The
ResourceManager is the ultimate authority that arbitrates resources among all the applications in the
system. The NodeManager is the per-machine framework agent who is responsible for containers,
3/4
monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the
ResourceManager/Scheduler.
The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with
negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and
monitor the tasks.
YARN supports the notion of resource reservation via the ReservationSystem, a component that allows
users to specify a profile of resources over-time and temporal constraints (e.g., deadlines), and reserve
resources to ensure the predictable execution of important jobs.The ReservationSystem tracks resources
over-time, performs admission control for reservations, and dynamically instruct the underlying scheduler
to ensure that the reservation is fulfilled.
Steps to Running an application in YARN
Client submits an application to the ResourceManager

ResourceManager allocates a container
ApplicationMaster contacts the related NodeManager because it needs to use
the containers
NodeManager launches the container
Container executes the ApplicationMaster
Hadoop MapReduce
measures proficiency and understanding of core MapReduce functionality in Hadoop.
Hadoop MapReduce
The Map function takes input from the disk as <key,value> pairs, processes them, and produces
another set of intermediate <key,value> pairs as output.
The Reduce function also takes inputs as <key,value> pairs, and produces <key,value> pairs as
output.
Combine is an optional process. The combiner is a reducer that runs individually on each mapper server. It
reduces the data on each mapper further to a simplified form before passing it downstream.
This makes shuffling and sorting easier as there is less data to work with. Often, the combiner class is set to
the reducer class itself, due to the cumulative and associative functions in the reduce function. However, if
needed, the combiner can be a separate class as well.
Partition is the process that translates the <key, value> pairs resulting from mappers to another set of <key,
value> pairs to feed into the reducer. It decides how the data has to be presented to the reducer and also
assigns it to a particular reducer.
The default partitioner determines the hash value for the key, resulting from the mapper, and assigns a
partition based on this hash value. There are as many partitions as there are reducers. So, once the
partitioning is complete, the data from each partition is sent to a specific reducer.
4/4

hadoop

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

hadoop

Uploaded by

Copyright:

Available Formats

hadoop.

ResourceManager (one per cluster)

Client submits an application to the ResourceManager

You might also like