You are on page 1of 48

What is Apache Hadoop?

• Open source software framework designed for storage and processing


of large scale data on clusters of commodity hardware

• Created by Doug Cutting and Mike Carafella in 2005.


The Hadoop Ecosystem
Hadoop • Contains Libraries and other modules
Common

HDFS • Hadoop Distributed File System

Hadoop YARN • Yet Another Resource Negotiator

Hadoop • A programming model for large scale


MapReduce data processing
Why Hadoop?
• Historically computation was processor-bound
• Data volume has been relatively small
• Complicated computations are performed on that data
• Advances in computer technology has historically centered around
improving the power of a single machine.
• Distributed Systems - Allow developers to use multiple machines for a
single task
Why Hadoop?
Distributed System: Problems

• Programming on a distributed system is much more complex


• Synchronizing data exchanges
• Managing a finite bandwidth
• Controlling computation timing is complicated
• Distributed systems must be designed with the expectation of failure
Distributed System: Data Storage
• Typically divided into Data Nodes and Compute Nodes
• At compute time, data is copied to the Compute Nodes
• Fine for relatively small amounts of data

• Modern systems deal with far more data than was gathering in the
past
Requirements for Hadoop

Support partial failure


• Failure of a single component must not cause the failure of the
entire system only a degradation of the application performance.
No loss of data.
• If a component fails, it should be able to recover without
restarting the entire system
• Component failure or recovery during a job must not affect the
final output
Scalability
• Increasing resources should increase load capacity
Hadoop
• Based on work done by Google in the early 2000s
• “The Google File System” in 2003
• “MapReduce: Simplified Data Processing on Large Clusters” in 2004
• The core idea was to distribute the data as it is initially stored
• Each node can then perform computation on the data it stores without
moving the data for the initial processing
Core Hadoop Concepts
• Applications are written in a high-level programming language
• No network programming or temporal dependency
• Nodes should communicate as little as possible
• A “shared nothing” architecture
• Data is spread among the machines in advance
• Perform computation where the data is already stored as often as possible
High-Level Overview
• When data is loaded onto the system it is divided into blocks
• Typically 64MB or 128MB
• Tasks are divided into two phases
• Map tasks which are done on small portions of data where the data is stored
• Reduce tasks which combine data to produce the final output
• A master program allocates work to individual nodes
Overview

• The Hadoop Distributed File System (HDFS) is a distributed file system


designed to run on commodity hardware.
• Responsible for storing data on the cluster
• Highly fault tolerant.
• Suitable for applications that have large data sets.
• Data files are split into blocks and distributed across the nodes in the
cluster
• Each block is replicated multiple times.
Goals
• Hardware Failure - Detection of faults and quick, automatic recovery
from them is a core architectural goal of HDFS.
• Batch processing rather than interactive. Emphasis is on high
throughput of data access rather than low latency of data access.
• HDFS is tuned to support large data sets.
• write-once-read-many access model.
• Hadoop migrates the computation closer to where the data is located.
This minimizes network congestion and increases the overall
throughput of the system.
• easily portable from one platform to another
Master-Worker architecture
• HDFS supports a traditional hierarchical file organization.
• Data write and read - Master Worker architecture in the form of
Name nodes and Data Nodes.
Name Node
• Master server
• Manages the file system namespace.
• Metadata storage.
• Regulates access to files by clients.
• File system namespace operations like opening, closing, and renaming
files and directories.
• It also determines the mapping of blocks to DataNodes.
Data Node
• worker server
• One per node in the cluster
• Manage storage attached to the nodes that they run on.
• A file is split into one or more blocks and these blocks are stored in a
set of DataNodes.
• Serving read and write requests from the file system’s clients.
• Block creation, deletion, and replication, upon instruction from the
NameNode.
• The NameNode and DataNode are pieces of software
designed to run on commodity machines.
• These machines typically run a Linux operating system.
• HDFS is built using the Java language; any machine
that supports Java can run the NameNode or the
DataNode software.
Blocks
• HDFS supports a traditional hierarchical file organization.
• It stores each file as a sequence of blocks of configurable size.
• All blocks in a file except the last block are the same size.
• A typical block size used by HDFS is 128 MB.
• If possible, each chunk will reside on a different DataNode.
Replication
• The blocks of a file are replicated for fault tolerance.
• Typically, blocks are replicated 3 times across clusters.
Rack Awareness and Replica Placement.
• Rack Awareness: To take a node’s physical location into account while
scheduling tasks and allocating storage
• The placement of replicas is critical to HDFS reliability and
performance. The purpose of a rack-aware replica placement policy is
to improve data reliability, availability, and network bandwidth
utilization.
• The NameNode determines the rack id each DataNode belongs to.
Blockreport
• NameNode receives Heartbeat and Blockreport messages from the
DataNodes.
• A Blockreport contains the list of data blocks that a DataNode is
hosting.
• On startup, the NameNode enters a special state called Safemode, an
administrative mode for maintenance.
Balancer
• Policy to keep one of the replicas of a block on the same node as the
node that is writing the block.
• Need to spread different replicas of a block across the racks so that
cluster can survive loss of whole rack.
• One of the replicas is usually placed on the same rack as the node
writing to the file so that cross-rack network I/O is reduced.
• Spread HDFS data uniformly across the DataNodes in the cluster.
• For a replication factor of 3, one replica is on the local machine or a
random datanode, another replica on a node in a different (remote)
rack, and the last on a different node in the same remote rack.
Balancer
Deals with under and over replication
Reasons
• A DataNode may become unavailable
• A replica may become corrupted
• A hard disk on a DataNode may fail
• The replication factor of a file may be increased.
Replication
• Pipelining is through Data nodes.
• Selection based on minimizing global bandwidth
consumption and read latency.
Metadata Persistence
• The NameNode uses a transaction log called the EditLog to persistently
record every change that occurs to file system metadata.
• The entire file system namespace, including the mapping of blocks to
files and file system properties, is stored in a file called the FsImage.
• Logs present in a local file in OS.
• Checkpoint – FsImage update based on EditLog. Efficient way to hold a
consistent view of file system.
• A checkpoint can be triggered at a given time interval or a given number
of transactions.
• Snapshots support storing a copy of data at a particular instant of time.
One usage of the snapshot feature may be to roll back a corrupted HDFS
Heartbeat
• Each DataNode sends a Heartbeat message to the NameNode
periodically.
• The NameNode marks DataNodes without recent Heartbeats as dead
and does not forward any new IO requests to them.
• The default heartbeat interval is 3 seconds. Default wait time before
declaring it as dead is 10 minutes.
Secondary Name Node
Need for a secondary name node
• Single point of failure
• The edits log file could get very large over time on a busy cluster.
• next restart of NameNode takes longer.
• The secondary NameNode merges the fsimage and the edits log files
periodically and keeps edits log size within a limit.
• It is usually run on a different machine than the primary NameNode.
HDFS 3x
Erasure Coding
High Durability and High Availability
Good system
1. High mean time between failures
2. Fast Recovery time
3. Redundancy present in different Failure Domains
Redundant Array of Inexpensive Disks
Parity Code
Hadoop 2x vs 3x
Why MR over Parallel processing?
• Automatic parallelization, distribution
• Load balancing – Trackers and nodes
• Network and data transfer optimization – transfer
between nodes
• Fault tolerance - Replicas
• Scaling - Large number of commodity servers
Output

Integrate

Segregate

Small process

Large inputs
Map Combine Shuffle & Sort Reduce

• Partitioned • Like a semi • Shuffles across • Does the


• Transform data reducer mapper nodes operations and
• Extract from • Brings all map • Aggregates calculations
each record outputs based on keys • Can have a
• Convert raw into together • Like a merge partition based
Key-Value pairs sort on a factor.
• Data stored in a • Same number • Output in HDFS
temp file and of shuffle-sort
later deleted. as reducers

You might also like