01 - Hadoop - HDFS

Introduction To Hadoop
1
TCS Internal
Agenda
Limitation of Traditional Systems

Hadoop - The Solution
Hadoop Eco-System
Hadoop Distributed File System (HDFS)
Hadoop Daemon Processes
Hadoop Architecture
2
TCS Internal

Business Challenges
Technical Limitations
Expectation
Volume
Scale of data growth. Projected volume of

global
data by 2020 is 40 Zitta Bytes. 300 times over
2005 usage.
Velocity
NYSE
Rate at which data is changed or captured.

is capturing 1 TB data each trading session.
Variety
Different sources and formats of data. Facebook,

Twitter, Youtube, Electronic Sensors generates
data that does not confirm any structural form.
Veracity
Various level of data uncertainty and reliability.
3
TCS Internal

Business Challenges
Expectation
Development and maintenance of distributed program is difficult.

Network bandwidth is not unlimited.
Disk I/O is not improved as much as CPU and Memory Speed
Specialized hardware required for fault tolerance and performance
Traditional methods need specialized infrastructure to
mitigate these challenges which lead to higher IT cost
4
TCS Internal

Business Challenges
Expectation
High performance.
Minimize infrastructure cost.
Should be scalable
Should be simple to access
Fault tolerance
5
TCS Internal
Hadoop
Solution to Business
Challenges
Volume
Velocity
speed
Overcome Technical
Limitations
MeetExpectation
Scale data storage and processing.
Use of distributed Massive Parallel Processing to

up data handling.
Variety
Support for Structured, Semi-Structured and

Unstructured data.
Veracity
processes
system.
Provides the environment to run data cleansing

to remove biases, noise and abnormality from the
6
TCS Internal
Hadoop
Challenges
Overcome Technical
Limitations
Meet Expectation
Hadoop Distributed File System provides the environment to

manage the distributed programs effectively.
Send the processing code to data location rather than spend lot
of time moving it over the network.
Hadoop framework which runs on commodity hardware, is fault
tolerant.
7
TCS Internal
Hadoop The Solution

Challenges
Overcome Technical
Limitations
Meet Expectation
Provide high performance by Massively parallel processing

Use of low cost commodity hardware minimizes the infrastructure
cost.
Hadoop framework which runs on commodity hardware, is fault
tolerant.
Can handles thousands of nodes working parallel.
Provide faults tolerance mechanism by which system continues to
function correctly even after some components fails working
properly.
8
TCS Internal
Hadoop Eco-System
Hadoop has two core

components.
1> A distributed file system
to store data known as
Hadoop Distributed File
System (HDFS). It works on top
of Native File system.
2> A Java based distributed
processing framework called
Map Reduce Framework.
Native File System
9
TCS Internal
Hadoop Eco-System
contd
Pig and Hive act as an

abstraction layer on top of Map
Reduce.
Pig engine provides a
procedural interface of Hadoop
core components.
Hive query engine provides a
SQL interface of Hadoop core
components.
Native File System
10
TCS Internal
Hadoop Eco-System
contd
Hbase is added to the Eco

System to meet the need of a
NoSQL Database. Hbase utilizes
HDFS to store data and parallel
processing.
Native File System
11
TCS Internal
Hadoop Eco-System
contd
Sqoop is introduced to facilitate

data exchange between
relational database systems and
HDFS.
Native File System
12
TCS Internal
Hadoop Eco-System
contd
Flume streams data from multiple

sources into Hadoop
Native File System
13
TCS Internal
Hadoop Eco-System
contd
Managing different
components manually is
possible but not a feasible
option for a cluster having
100s of nodes. Zookeeper
provides a centralized
interface to manage all the
components in the Hadoop
Eco-System.
Native File System
14
TCS Internal
Hadoop Eco-System
Oozie helps to create

processing workflows
involving all the other EcoSystem components like
Pig, Hive, Impala, Hbase,
Sqoop and Zookeeper.
Native File System

15
TCS Internal
Hadoop Eco-System
Impala
is
Massive Parallel
Processing
(MPP) database
engine,
developed
by
Cloudera.
Native File System
16
TCS Internal
Basic Concept Of Cluster
Switch
Switch
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Rack 1
Rack 2
Rack 3
Rack 4
Site 1
Site 2
17
TCS Internal
What is NoSQL ?
What is a Relational Database
Data is organized in Tables with rows and columns.

Maintains relationships among tables.
Supports Structured data only.
Very strict and uniform structure.
What is a NoSQL Database

A non relational database
Largely distributed database system that allows for highperformance, agile processing of information at massive
scale.
Supports Structured, Semi structured and Un structured
data.
Also known as Not Only SQL.
18
TCS Internal
Type of NoSQL Database

Key-Value store
In a key-value store, all of the data within consists of an indexed
key
and a value. Examples : Cassandra, Azure Table Storage
(ATS), BerkeleyDB etc.
Column store
Store data as sections of columns of data, rather than as rows of
data. Examples: HBase, BigTable and HyperTable.
Document database
These are designed for storing, retrieving, and managing
document- oriented information, also known as semi-structured data.
Examples: MongoDB and CouchDB.
Graph database
These databases are designed for data whose relations are well
represented as a graph and has elements which are I
nterconnected, with an undetermined number of relations between
them. Examples:
Neo4J and Polyglot.
19
TCS Internal
Hadoop Distributed File System (HDFS)

Basic Features
Distributed File System designed to address big data problems.
Runs on low-cost commodity hardware.
Automatic fault detection and fault-tolerant are part of its architectural goal.
Supports Massively Parallel Processing (MPP) for high throughput access.
Ideal for Write Once Read Many (WORM) application.
Massively scalable, can grow over 1000 nodes per cluster.
Provides streaming access to stored data.
20
TCS Internal
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
21
TCS Internal
contd
HDFS
Daemons
Node 1
DataNode
Node 4
NameNod
e
Node 2
DataNode
Node 5
Node 3
DataNode
Node 6
22
TCS Internal
contd
HDFS
Daemons
Node 1
DataNode
Node 4
NameNod
e
Node 2
DataNode
Node 5
SNN
Node 3
DataNode
Node 6
23
TCS Internal
contd
HDFS
Daemons
Node 1
NameNod
e
Node 2
DataNode
Node 4
TaskTracker
DataNode
Node 5
TaskTracker
JobTracker
SNN
Node 3
DataNode
Node 6
TaskTracker
24
TCS Internal
contd
NameNode
Manages the file system metadata in Hadoop architecture.
Metadata is
stored in fsimage file inside the node in which it runs.
Single point of failure.
Secondary Name Node

Updates of fsimage file periodically using the edits files.
Data Node
Manages storage attached to the node in which they run.
Job Tracker
Tracks all jobs submitted in a Hadoop distributed environment. It
communicates with the task tracker for distribution of tasks among
nodes.
Task Tracker
Tracks tasks submitted to the data node where it is deployed. It
communicates with the Job Tracker for task execution.
TCS Internal
25
NameNode
Directory Structure
Meta-data in Memory
The entire metadata is in main memory
No demand paging of meta-data
Types of Metadata
List of files (i.e. fsimage, fstime)

List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time, replication factor
A Transaction Log
Records file creations, file deletions (i.e. edits)
26
TCS Internal
DataNode
A Block Server
Stores data in the local file system (e.g. ext3)
Stores meta-data of a block (e.g. CRC)
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all existing blocks to the
NameNode
Facilitates Pipelining of Data
Forwards data to other specified DataNodes
27
TCS Internal
Hadoop Architecture
Metadata ops
Metadata(Name, replicas..)
(/home/foo/data,6. ..
Namenode
Client
Block ops
Read
Datanodes
Datanodes
replication
B
Blocks
Rack1
Write
Rack2
Client
28
TCS Internal
Hadoop Architecture
contd
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that manages the file
system namespace and regulates access to files by clients.
There are a number of DataNodes usually one per node in a cluster.
Large block sizes are used for improve disk I/O.
The DataNodes manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
A file is split into one or more blocks and set of blocks are stored in DataNodes.
DataNode serves read, write requests, performs block creation, deletion, and
replication upon instruction from Namenode.
Hadoop maintains a replication factor for each data block specified during creation.
The default replication factor is three.
29
TCS Internal
Write Operation In HDFS
30
TCS Internal
Write Operation In HDFS

The Client is ready to load File.txt into the cluster and breaks it up into blocks, starting
with Block A.
The Client consults the NameNode that it wants to write File.txt, gets permission from
the Name Node, and receives a list of (3) DataNodes for each block, a unique list for
each block.
The Name Node used its Rack Awareness data to influence the decision of which Data
Nodes to provide in these lists.
The key rule is that for every block of data, two copies will exist in one rack, another
copy in a different rack. So the list provided to the Client will follow this rule.
Before the Client writes Block A of File.txt to the cluster it wants to know that all Data
Nodes which are expected to have a copy of this block are ready to receive it.
It picks the first Data Node in the list for Block A (Data Node 1) and says, Hey, get ready
to receive a block, and heres a list of (2) Data Nodes, Data Node 5 and Data Node 6. Go
make sure theyre ready to receive this block too.
Data Node 1 then opens a TCP connection to Data Node 5 and says, Hey, get ready to
receive a block, and go make sure Data Node 6 is ready is receive this block too.
Data Node 5 will then ask Data Node 6, Hey, are you ready to receive a block?
The acknowledgments of readiness come back on the same TCP pipeline, until the initial
Data Node 1 sends a Ready message back to the Client.
At this point the Client is ready to begin writing block data into the cluster.
31
TCS Internal
Read operation In HDFS
When a Client wants to retrieve a

file from HDFS, it again consults
the Name Node and asks for the
block locations of the file.
The Name Node returns a list of
each Data Node holding a block,
for each block.
The Client picks a Data Node
from each block list and reads
one block at a time.
It does not progress to the next
block until the previous block
completes.
32
TCS Internal
Hadoop Cluster
33
TCS Internal
Hadoop Cluster Single Node

Maste
r
Job
Tracker
Map
Reduce
Layer
Task
Tracker
HDFS
Layer
Name
Node
Data
Mode
Node
1
34
TCS Internal
Hadoop Cluster Multi Node

Maste
r
Slave
Slave
Task
Tracker
Task
Tracker
Data
Mode
Data
Mode
Node
2
Node
3
Job
Tracker
Map
Reduce
Layer
Task
Tracker
HDFS
Layer
Name
Node
Data
Mode
Node
1
35
TCS Internal
Hadoop Rack Awareness
36
TCS Internal
Hadoop Rack Awareness
Through Rack Awareness configuration, Hadoop knows the

topology of entire network(when Hadoop is configured in
multi rack cluster).
Never loose all data even if entire rack fails.
Hadoop will prefer within-rack transfers (where there is

more bandwidth available) to off-rack transfers when
placing MapReduce tasks on nodes.
37
TCS Internal
HDFS High Availability
- Why ?
The NameNode is a single point of failure (SPOF) in an

HDFS cluster.
Each cluster has a single NameNode.
If that machine or process becomes unavailable, the

cluster as a whole will be unavailable until the
NameNode is either restarted or brought up on a
separate machine.
38
TCS Internal
39
TCS Internal
Two separate machines are configured as NameNodes.

At any point in time, one of the NameNodes is in an
Active state, and the other is in a Standby state.
For fast failover, it is also necessary that the Standby

node has up-to-date information regarding the location
of blocks in the cluster. In order to achieve this, the
DataNodes are configured with the location of both
NameNodes, and they send block location information
and heartbeats to both.
Fencing process is responsible for cutting off the

previous Active NameNode's access to the shared
edits storage to prevent risking data loss or other
incorrect results
40
TCS Internal
HDFS Basic Commands

How to view HDFS file content?
hadoop fs -cat URI
Example:
hadoop fs -cat hdfs://nn1.example.com/home/dess/PSEG/load.hql
How to Create a directory in HDFS at given path(s)?

hadoop fs -mkdir <paths>
Example:
hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2
41
TCS Internal
HDFS Basic Commands

How to list the contents of a directory?
hadoop fs -ls <args>
Example:
hadoop fs -ls /user/saurzcode
How to Upload a file in HDFS?

hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example:
hadoop fs -put /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/
42
TCS Internal
HDFS Basic Commands

How to Download a file from HDFS?
hadoop fs -get <HDFS src> ... <local_dest_Path>
Example:
hadoop fs -get /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/
How to Copy a file from source to destination?

hadoop fs -cp <source> <dest>
Example:
hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2
43
TCS Internal
HDFS Basic Commands
How to Copy a file from Local file system to HDFS?

hadoop fs -copyFromLocal <localsrc> URI
Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/saurzcode/abc.tx
How to Copy a file from HDFS to Local file system?
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/saurzcode/abc.txt
44
TCS Internal
HDFS Basic Commands

How to Remove a file or directory in HDFS?
hadoop fs -rm <arg>
Example:
hadoop fs -rm /user/saurzcode/dir1/abc.txt
How to Display last few lines of a file?

hadoop fs -tail <path[filename]>
Example:
hadoop fs -tail /user/saurzcode/dir1/abc.txt
45
TCS Internal
HDFS Basic Commands

How to Change group association of files?
hadoop fs -chgrp [-R] GROUP URI [URI ...]
Example:
hadoop fs -chgrp impala /user/dess/home/dess/PSEG/ste_copy.hql
How to Change the permissions of files?

hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]
Example:
hadoop fs -chmod 700 home/dess/PSEG/ste_copy.hql
46
TCS Internal
HDFS Basic Commands
How to Change the ownership of files?

hadoop fs -chown [-R] GROUP URI [URI ...]
Example:
hadoop fs -chown dess /user/dess/home/dess/PSEG/ste_copy.hql
47
TCS Internal
Hadoop References
Books
Hadoop The Definitive Guide, 3rd Edition by Tom White.
URLs
http://hadoop.apache.org/docs/r2.6.0
http://
bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the
-network
http://hortonworks.com/hadoop/hdfs
48
TCS Internal
Thank You
49
TCS Internal

01 - Hadoop - HDFS

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

01 - Hadoop - HDFS

Uploaded by

Copyright:

Available Formats

Introduction To Hadoop

Limitation of Traditional Systems

Limitation of Traditional Systems

Scale of data growth. Projected volume of

Rate at which data is changed or captured.

Different sources and formats of data. Facebook,

Various level of data uncertainty and reliability.

Limitation of Traditional Systems

Development and maintenance of distributed program is difficult.

Limitation of Traditional Systems

Scale data storage and processing.

Use of distributed Massive Parallel Processing to

Support for Structured, Semi-Structured and

Provides the environment to run data cleansing

Hadoop Distributed File System provides the environment to

Hadoop The Solution

Provide high performance by Massively parallel processing

Hadoop has two core

Native File System

Pig and Hive act as an

Native File System

Hbase is added to the Eco

Native File System

Sqoop is introduced to facilitate

Native File System

Flume streams data from multiple

Native File System

Native File System

Oozie helps to create

Native File System

Native File System

Basic Concept Of Cluster

What is a Relational Database

Data is organized in Tables with rows and columns.

What is a NoSQL Database

Type of NoSQL Database

Hadoop Distributed File System (HDFS)

Hadoop Daemon Processes

Hadoop Daemon Processes

Hadoop Daemon Processes

Hadoop Daemon Processes

Hadoop Daemon Processes

Secondary Name Node

List of files (i.e. fsimage, fstime)

Write Operation In HDFS

Write Operation In HDFS

Read operation In HDFS

When a Client wants to retrieve a

Hadoop Cluster Single Node

Hadoop Cluster Multi Node

Hadoop Rack Awareness

Hadoop Rack Awareness

Through Rack Awareness configuration, Hadoop knows the

Never loose all data even if entire rack fails.

Hadoop will prefer within-rack transfers (where there is

HDFS High Availability

The NameNode is a single point of failure (SPOF) in an

Each cluster has a single NameNode.

If that machine or process becomes unavailable, the

HDFS High Availability

HDFS High Availability

Two separate machines are configured as NameNodes.

For fast failover, it is also necessary that the Standby

Fencing process is responsible for cutting off the