You are on page 1of 8

<< HADOOP >>

What is Hadoop?
Open Source Framework Apache
Distributed Computing
! Cluster of Commodity Cluster


Cheap > (does not require highly reliable
hardware)
Scalable >


Flexible > (from multiple vendors)
Fault Tolerance (3 copy by default)
server + secondary name node
Manage Big data (3V)

Volume Up to Petabytes
Variety
Structure
Semi-Structure
Quasi-Structure
Unstructured > Text file , XML , JSON
Velocity
Hadoop Component
Classification by Node
Hadoop Server
Master
Slave
Master / Secondary Master Node
Locate data and task
Task
Map/Reduce
Slave Node
Store data ,

Run task , Master Node



Classification by Job
HDFS

block copy Data
node Name node data
node
Master : Name Node , Secondary Name Node
Slave : Data Node
Map Reduce V1
MapReduce is a programming model and an associated
implementation for processing and generating large data sets.
Master : Job Tracker
Slave : Task Tracker
Map Reduce V2
Master : Resource Manager
Slave : Node Manager , App Master
HDFS


Data Node Master Node
Data node
HDFS store larges file across multiple machines > default replication
= 3 , data is stored on 3 node , 2 copy on same rack and 1 copy on
different rack
HDFS
Table
RDBMS Unstructured data
Random Access
Data node can talk together to rebalance data , to move copies
around and to keep replication of data
HDFS cannot be mounted directly by an existing operating system
Write File System
When a client issues a write request, the filesystem metadata is
written onto the namenode and data written to a datanode (on the
requesting node itself if there is a datanode hosted on the same machine)
and simultaneously this data node keeps transferring the data to another

data node for replication through a pipeline. This process gets repeated for
a third level of replication as well (default replication value is 3).
Read File System
When a client issues a read request on a particular file, the request
goes to the namenode requesting information about the data nodes on
which blocks corresponding to the given file are hosted in increasing order
of distance from the requesting node.
Hadoop Revolution
HDFS Federation
In Hadoop 1.x, , Even though a Hadoop Cluster can scale up to
hundreds of DataNodes, the NameNode keeps all its metadata in
memory (RAM). >>> This results in the limitation on maximum
number of files a Hadoop Cluster can store (typically 50-100M files).
As your data size and cluster size grow this becomes a bottleneck as
size of your cluster is limited by the NameNode memory.
Hadoop 2.0 feature HDFS Federation allows horizontal scaling for
Hadoop distributed file system (HDFS)
In order to scale the name service horizontally, federation uses
multiple independent Namenodes and Namespaces.
the Namenodes are independent and dont require coordination
with each other.
The DataNodes are used as common storage for blocks by all
the Namenodes.
Each DataNode registers with all the NameNodes in the cluster.
DataNodes send periodic heartbeats and block reports and
handle commands from the NameNodes.
Fault tolerance
In Hadoop 1.x, NameNode was single point of failure. NameNode
failure makes the Hadoop Cluster inaccessible. Usually, this is a rare
occurrence because of business-critical hardware with RAS features
used for NameNode servers
In case of NameNode failure, Hadoop Administrators need to
manually recover the NameNode using Secondary NameNode.
Hadoop 2.0, NameNode High Availability feature comes with
support for a Passive Standby NameNode. These Active-Passive
NameNodes are configured for automatic failover.
All namespace edits are logged to a shared NFS storage and
there is only a single writer (with fencing configuration) to this shared
storage at any point of time. The passive NodeNode reads from this
storage and keeps an updated metadata information for cluster.

In case of Active NameNode failure, the passive NameNode


becomes the Active NameNode and starts writing to the shared
storage. The fencing mechanism ensures that there is only one write
to the shared storage at any point of time.
With Hadoop Release 2.4.0, High Availability support for
Resource Manager is also available.
Map Reduce V1 and Yarn
MapReduce is a programming model and an associated
implementation for processing and generating large data sets.
Users specify a map function that processes a key/value pair to
generate a set of intermediate key/value pairs, and a reduce function
that merges all intermediate values associated with the same
intermediate key.
Map/Reduce realtime Online SQL
RDBMS Batch Offline

MapReduce V1
- The task tracker is pre-configured with a number of slots indicating
the number of tasks it can accept.
- When the job tracker tries to schedule a task, it looks for an empty
slot in the tasktracker running on the same server which hosts the
datanode where the data for that task resides.
- If not found, it looks for the machine in the same rack.
- There is no consideration of system load during this allocation.
- Name node and job tracker have table mapping between the IP
address of data node and the rack id
- The task tracker spawns different JVM processes to ensure that
process failures do not bring down the task tracker.
- The task tracker keeps sending heartbeat messages to the job tracker
to say that it is alive and to keep it updated with the number of empty
slots available for running more tasks.
- If a TaskTracker fails or times out, that part of the job is rescheduled.
- the job tracker does some checkpointing of its work in the filesystem.
Whenever, it starts up it checks what was it upto till the last CP and
resumes any incomplete jobs. Earlier, if the job tracker went down, all
the active job information used to get lost.
YARN
- The resource manager only manages the allocation of resources to
the different jobs
- Scheduler just takes care of the scheduling jobs without worrying
about any monitoring or status updates.

- Different resources such as memory, cpu time, network bandwidth


etc. are put into one unit called the Resource Container.
- There are different AppMasters running on different nodes which talk
to a number of these resource containers and accordingly update the
Node Manager with the monitoring/status details.
- Resource manager
o responsible for allocating resources to the various running
applications, according to constraints such as queue capacities,
user-limits etc. The scheduler performs its scheduling function
based on the resource requirements of the applications
- Node manager
o is the per-machine slave, which is responsible for launching the
applications' containers, monitoring their resource usage (cpu,
memory, disk, network) and reporting the same to the
ResourceManager.
- Container
- App Master
o a framework specific library and is tasked with negotiating
resource from the resource manager and working with node
manager and monitoring task
o Each Application Master has the responsibility of negotiating
appropriate resource containers from the scheduler, tracking
their status, and monitoring their progress. From the system
perspective, the Application Master runs as a normal container.
- can now run multiple applications in Hadoop
- Fault tolerance
o hot stand by secondary name node
- Scalability:
o The processing power in data centers continues to grow quickly.
Because YARN ResourceManager focuses exclusively on
scheduling, it can manage those larger clusters much more
easily.
o Can have more than 1 name node
- Compatibility with MapReduce
o Existing MapReduce applications and users can run on top of
YARN without disruption to their existing processes.
- Support for workloads other than MapReduce:
o Additional programming models such as graph processing and
iterative modeling are now possible for data processing. These
added models allow enterprises to realize near real-time
processing and increased ROI on their Hadoop investments.
Hadoop Ecosystem (For YARN Only)

Hadoop HDFS Map/Reduce



SQL Random access
Hadoop

Hadoop ecosystem Hadoop V2


resource manger

Hadoop Common : contains libraries & utilities needed by other


hadoop modules

Hadoop Distributed File System (HDFS) : a distributed file system that


stores data on commodity machines, providing very high aggregate
bandwidth across the cluster

Hadoop MapReduce : a programming model for large scale data


processing

Hadoop YARN : a resource management platform for managing


compute resource in cluster and using them for

scheduling of users application

Oozie : Workflow , Workflow


Hadoop Map/Reduce, Hive Pig
Workflow

Pig : Scripting , Hive



Map/Reduce Pig script
Pig Latin Pig ETL
JSON

Pig is a system developed by Yahoo! to facilitate data analysis in


the MapReduce framework. Queries are written in the data transfer
motion language Pig Latin, which prefers an incremental and procedural
style compared with the declarative approach of SQL. Pig Latin programs
can be translated automatically into a series of map-reduce iterations,
removing the need for the developer to implement map and reduce
functions manually.

Mahout : Machine learning , Data Scientist


Predictive Analytics Hadoop Mahout
Algorithm Recommender, Classification Clustering
R Connectors : Statistics
Hive : SQL Query ,
(Query)
HDFS SQL Map/Reduce
Hive SQL like Map/Reduce
Batch
o

Hive is a data warehouse developed by Facebook. In contrast to


Pig Latin, the Hive query language follows the declarative style of SQL.
Hive automatically maps to the MapReduce framework at execution
time.

HBase : Columnar Store , Hadoop


Realtime Random Access BigTable
row column HBase Hadoop
NoSQL Database
o

HBase is a column-oriented NoSQL database based on HDFS. In


typical NoSQL database style, HBase is useful for random read/write
access in contrast to HDFS.

Zookeeper : Coordination
Flume : Log Collector ,
Realtime HDFS Log Web Server
Agent Server
Sqoop : Data Exchange ,

Table RDBMS SQL server, Oracle MySQL


HDFS Hadoop

Hue : Hadoop User Experience User


interface Hadoop command line

You might also like