Professional Documents
Culture Documents
What is Hadoop?
Open Source Framework Apache
Distributed Computing
! Cluster of Commodity Cluster
Cheap > (does not require highly reliable
hardware)
Scalable >
Flexible > (from multiple vendors)
Fault Tolerance (3 copy by default)
server + secondary name node
Manage Big data (3V)
Volume Up to Petabytes
Variety
Structure
Semi-Structure
Quasi-Structure
Unstructured > Text file , XML , JSON
Velocity
Hadoop Component
Classification by Node
Hadoop Server
Master
Slave
Master / Secondary Master Node
Locate data and task
Task
Map/Reduce
Slave Node
Store data ,
data node for replication through a pipeline. This process gets repeated for
a third level of replication as well (default replication value is 3).
Read File System
When a client issues a read request on a particular file, the request
goes to the namenode requesting information about the data nodes on
which blocks corresponding to the given file are hosted in increasing order
of distance from the requesting node.
Hadoop Revolution
HDFS Federation
In Hadoop 1.x, , Even though a Hadoop Cluster can scale up to
hundreds of DataNodes, the NameNode keeps all its metadata in
memory (RAM). >>> This results in the limitation on maximum
number of files a Hadoop Cluster can store (typically 50-100M files).
As your data size and cluster size grow this becomes a bottleneck as
size of your cluster is limited by the NameNode memory.
Hadoop 2.0 feature HDFS Federation allows horizontal scaling for
Hadoop distributed file system (HDFS)
In order to scale the name service horizontally, federation uses
multiple independent Namenodes and Namespaces.
the Namenodes are independent and dont require coordination
with each other.
The DataNodes are used as common storage for blocks by all
the Namenodes.
Each DataNode registers with all the NameNodes in the cluster.
DataNodes send periodic heartbeats and block reports and
handle commands from the NameNodes.
Fault tolerance
In Hadoop 1.x, NameNode was single point of failure. NameNode
failure makes the Hadoop Cluster inaccessible. Usually, this is a rare
occurrence because of business-critical hardware with RAS features
used for NameNode servers
In case of NameNode failure, Hadoop Administrators need to
manually recover the NameNode using Secondary NameNode.
Hadoop 2.0, NameNode High Availability feature comes with
support for a Passive Standby NameNode. These Active-Passive
NameNodes are configured for automatic failover.
All namespace edits are logged to a shared NFS storage and
there is only a single writer (with fencing configuration) to this shared
storage at any point of time. The passive NodeNode reads from this
storage and keeps an updated metadata information for cluster.
MapReduce V1
- The task tracker is pre-configured with a number of slots indicating
the number of tasks it can accept.
- When the job tracker tries to schedule a task, it looks for an empty
slot in the tasktracker running on the same server which hosts the
datanode where the data for that task resides.
- If not found, it looks for the machine in the same rack.
- There is no consideration of system load during this allocation.
- Name node and job tracker have table mapping between the IP
address of data node and the rack id
- The task tracker spawns different JVM processes to ensure that
process failures do not bring down the task tracker.
- The task tracker keeps sending heartbeat messages to the job tracker
to say that it is alive and to keep it updated with the number of empty
slots available for running more tasks.
- If a TaskTracker fails or times out, that part of the job is rescheduled.
- the job tracker does some checkpointing of its work in the filesystem.
Whenever, it starts up it checks what was it upto till the last CP and
resumes any incomplete jobs. Earlier, if the job tracker went down, all
the active job information used to get lost.
YARN
- The resource manager only manages the allocation of resources to
the different jobs
- Scheduler just takes care of the scheduling jobs without worrying
about any monitoring or status updates.
Zookeeper : Coordination
Flume : Log Collector ,
Realtime HDFS Log Web Server
Agent Server
Sqoop : Data Exchange ,