You are on page 1of 49

Introduction To Hadoop

1
TCS Internal

Agenda

Limitation of Traditional Systems


Hadoop - The Solution
Hadoop Eco-System
Hadoop Distributed File System (HDFS)
Hadoop Daemon Processes
Hadoop Architecture

2
TCS Internal

Limitation of Traditional Systems


Business Challenges

Technical Limitations

Expectation

Volume

Scale of data growth. Projected volume of


global
data by 2020 is 40 Zitta Bytes. 300 times over
2005 usage.

Velocity
NYSE

Rate at which data is changed or captured.


is capturing 1 TB data each trading session.

Variety

Different sources and formats of data. Facebook,


Twitter, Youtube, Electronic Sensors generates
data that does not confirm any structural form.

Veracity

Various level of data uncertainty and reliability.

3
TCS Internal

Limitation of Traditional Systems


Business Challenges

Technical Limitations

Expectation

Development and maintenance of distributed program is difficult.


Network bandwidth is not unlimited.
Disk I/O is not improved as much as CPU and Memory Speed
Specialized hardware required for fault tolerance and performance
Traditional methods need specialized infrastructure to
mitigate these challenges which lead to higher IT cost

4
TCS Internal

Limitation of Traditional Systems


Business Challenges

Technical Limitations

Expectation

High performance.
Minimize infrastructure cost.
Should be scalable
Should be simple to access
Fault tolerance

5
TCS Internal

Hadoop
Solution to Business
Challenges

Volume

Velocity
speed

Overcome Technical
Limitations

MeetExpectation

Scale data storage and processing.

Use of distributed Massive Parallel Processing to


up data handling.

Variety

Support for Structured, Semi-Structured and


Unstructured data.

Veracity
processes
system.

Provides the environment to run data cleansing


to remove biases, noise and abnormality from the

6
TCS Internal

Hadoop
Solution to Business
Challenges

Overcome Technical
Limitations

Meet Expectation

Hadoop Distributed File System provides the environment to


manage the distributed programs effectively.
Send the processing code to data location rather than spend lot
of time moving it over the network.
Hadoop framework which runs on commodity hardware, is fault
tolerant.

7
TCS Internal

Hadoop The Solution


Solution to Business
Challenges

Overcome Technical
Limitations

Meet Expectation

Provide high performance by Massively parallel processing


Use of low cost commodity hardware minimizes the infrastructure
cost.
Hadoop framework which runs on commodity hardware, is fault
tolerant.
Can handles thousands of nodes working parallel.
Provide faults tolerance mechanism by which system continues to
function correctly even after some components fails working
properly.

8
TCS Internal

Hadoop Eco-System

Hadoop has two core


components.
1> A distributed file system
to store data known as
Hadoop Distributed File
System (HDFS). It works on top
of Native File system.
2> A Java based distributed
processing framework called
Map Reduce Framework.

Native File System

9
TCS Internal

Hadoop Eco-System

contd

Pig and Hive act as an


abstraction layer on top of Map
Reduce.
Pig engine provides a
procedural interface of Hadoop
core components.
Hive query engine provides a
SQL interface of Hadoop core
components.

Native File System

10
TCS Internal

Hadoop Eco-System

contd

Hbase is added to the Eco


System to meet the need of a
NoSQL Database. Hbase utilizes
HDFS to store data and parallel
processing.

Native File System

11
TCS Internal

Hadoop Eco-System

contd

Sqoop is introduced to facilitate


data exchange between
relational database systems and
HDFS.

Native File System

12
TCS Internal

Hadoop Eco-System

contd

Flume streams data from multiple


sources into Hadoop

Native File System

13
TCS Internal

Hadoop Eco-System

contd

Managing different
components manually is
possible but not a feasible
option for a cluster having
100s of nodes. Zookeeper
provides a centralized
interface to manage all the
components in the Hadoop
Eco-System.

Native File System

14
TCS Internal

Hadoop Eco-System

Oozie helps to create


processing workflows
involving all the other EcoSystem components like
Pig, Hive, Impala, Hbase,
Sqoop and Zookeeper.

Native File System


15
TCS Internal

Hadoop Eco-System

Impala
is
Massive Parallel
Processing
(MPP) database
engine,
developed
by
Cloudera.

Native File System

16
TCS Internal

Basic Concept Of Cluster

Switch

Switch

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Node

Rack 1

Rack 2

Rack 3

Rack 4

Site 1

Site 2
17
TCS Internal

What is NoSQL ?

What is a Relational Database

Data is organized in Tables with rows and columns.


Maintains relationships among tables.
Supports Structured data only.
Very strict and uniform structure.

What is a NoSQL Database


A non relational database
Largely distributed database system that allows for highperformance, agile processing of information at massive
scale.
Supports Structured, Semi structured and Un structured
data.
Also known as Not Only SQL.

18
TCS Internal

Type of NoSQL Database


Key-Value store
In a key-value store, all of the data within consists of an indexed
key
and a value. Examples : Cassandra, Azure Table Storage
(ATS), BerkeleyDB etc.

Column store
Store data as sections of columns of data, rather than as rows of
data. Examples: HBase, BigTable and HyperTable.

Document database
These are designed for storing, retrieving, and managing
document- oriented information, also known as semi-structured data.
Examples: MongoDB and CouchDB.

Graph database
These databases are designed for data whose relations are well
represented as a graph and has elements which are I
nterconnected, with an undetermined number of relations between
them. Examples:
Neo4J and Polyglot.
19
TCS Internal

Hadoop Distributed File System (HDFS)


Basic Features
Distributed File System designed to address big data problems.
Runs on low-cost commodity hardware.
Automatic fault detection and fault-tolerant are part of its architectural goal.
Supports Massively Parallel Processing (MPP) for high throughput access.
Ideal for Write Once Read Many (WORM) application.
Massively scalable, can grow over 1000 nodes per cluster.
Provides streaming access to stored data.

20
TCS Internal

Hadoop Daemon Processes

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

21
TCS Internal

Hadoop Daemon Processes

contd
HDFS
Daemons

Node 1

DataNode
Node 4

NameNod
e
Node 2

DataNode
Node 5

Node 3

DataNode
Node 6

22
TCS Internal

Hadoop Daemon Processes

contd
HDFS
Daemons

Node 1

DataNode
Node 4

NameNod
e
Node 2

DataNode
Node 5

SNN
Node 3

DataNode
Node 6

23
TCS Internal

Hadoop Daemon Processes

contd
HDFS
Daemons

Node 1

NameNod
e
Node 2

DataNode
Node 4
TaskTracker

DataNode
Node 5
TaskTracker

JobTracker

SNN
Node 3

DataNode
Node 6

TaskTracker

24
TCS Internal

Hadoop Daemon Processes

contd

NameNode
Manages the file system metadata in Hadoop architecture.
Metadata is
stored in fsimage file inside the node in which it runs.
Single point of failure.

Secondary Name Node


Updates of fsimage file periodically using the edits files.

Data Node
Manages storage attached to the node in which they run.

Job Tracker
Tracks all jobs submitted in a Hadoop distributed environment. It
communicates with the task tracker for distribution of tasks among
nodes.

Task Tracker
Tracks tasks submitted to the data node where it is deployed. It
communicates with the Job Tracker for task execution.
TCS Internal

25

NameNode
Directory Structure

Meta-data in Memory
The entire metadata is in main memory
No demand paging of meta-data

Types of Metadata

List of files (i.e. fsimage, fstime)


List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time, replication factor

A Transaction Log
Records file creations, file deletions (i.e. edits)
26
TCS Internal

DataNode
A Block Server
Stores data in the local file system (e.g. ext3)
Stores meta-data of a block (e.g. CRC)
Serves data and meta-data to Clients

Block Report
Periodically sends a report of all existing blocks to the
NameNode
Facilitates Pipelining of Data
Forwards data to other specified DataNodes

27
TCS Internal

Hadoop Architecture

Metadata ops

Metadata(Name, replicas..)
(/home/foo/data,6. ..

Namenode

Client
Block ops
Read

Datanodes

Datanodes
replication

B
Blocks

Rack1

Write

Rack2

Client
28
TCS Internal

Hadoop Architecture

contd

Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that manages the file
system namespace and regulates access to files by clients.
There are a number of DataNodes usually one per node in a cluster.
Large block sizes are used for improve disk I/O.
The DataNodes manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
A file is split into one or more blocks and set of blocks are stored in DataNodes.
DataNode serves read, write requests, performs block creation, deletion, and
replication upon instruction from Namenode.
Hadoop maintains a replication factor for each data block specified during creation.
The default replication factor is three.
29
TCS Internal

Write Operation In HDFS

30
TCS Internal

Write Operation In HDFS


The Client is ready to load File.txt into the cluster and breaks it up into blocks, starting
with Block A.
The Client consults the NameNode that it wants to write File.txt, gets permission from
the Name Node, and receives a list of (3) DataNodes for each block, a unique list for
each block.
The Name Node used its Rack Awareness data to influence the decision of which Data
Nodes to provide in these lists.
The key rule is that for every block of data, two copies will exist in one rack, another
copy in a different rack. So the list provided to the Client will follow this rule.
Before the Client writes Block A of File.txt to the cluster it wants to know that all Data
Nodes which are expected to have a copy of this block are ready to receive it.
It picks the first Data Node in the list for Block A (Data Node 1) and says, Hey, get ready
to receive a block, and heres a list of (2) Data Nodes, Data Node 5 and Data Node 6. Go
make sure theyre ready to receive this block too.
Data Node 1 then opens a TCP connection to Data Node 5 and says, Hey, get ready to
receive a block, and go make sure Data Node 6 is ready is receive this block too.
Data Node 5 will then ask Data Node 6, Hey, are you ready to receive a block?
The acknowledgments of readiness come back on the same TCP pipeline, until the initial
Data Node 1 sends a Ready message back to the Client.
At this point the Client is ready to begin writing block data into the cluster.
31
TCS Internal

Read operation In HDFS

When a Client wants to retrieve a


file from HDFS, it again consults
the Name Node and asks for the
block locations of the file.
The Name Node returns a list of
each Data Node holding a block,
for each block.
The Client picks a Data Node
from each block list and reads
one block at a time.
It does not progress to the next
block until the previous block
completes.

32
TCS Internal

Hadoop Cluster

33
TCS Internal

Hadoop Cluster Single Node


Maste
r
Job
Tracker
Map
Reduce
Layer

Task
Tracker

HDFS
Layer

Name
Node
Data
Mode

Node
1
34
TCS Internal

Hadoop Cluster Multi Node


Maste
r

Slave

Slave

Task
Tracker

Task
Tracker

Data
Mode

Data
Mode

Node
2

Node
3

Job
Tracker
Map
Reduce
Layer

Task
Tracker

HDFS
Layer

Name
Node
Data
Mode

Node
1

35
TCS Internal

Hadoop Rack Awareness

36
TCS Internal

Hadoop Rack Awareness

Through Rack Awareness configuration, Hadoop knows the


topology of entire network(when Hadoop is configured in
multi rack cluster).

Never loose all data even if entire rack fails.

Hadoop will prefer within-rack transfers (where there is


more bandwidth available) to off-rack transfers when
placing MapReduce tasks on nodes.

37
TCS Internal

HDFS High Availability

- Why ?

The NameNode is a single point of failure (SPOF) in an


HDFS cluster.

Each cluster has a single NameNode.

If that machine or process becomes unavailable, the


cluster as a whole will be unavailable until the
NameNode is either restarted or brought up on a
separate machine.

38
TCS Internal

HDFS High Availability

39
TCS Internal

HDFS High Availability

Two separate machines are configured as NameNodes.


At any point in time, one of the NameNodes is in an
Active state, and the other is in a Standby state.

For fast failover, it is also necessary that the Standby


node has up-to-date information regarding the location
of blocks in the cluster. In order to achieve this, the
DataNodes are configured with the location of both
NameNodes, and they send block location information
and heartbeats to both.

Fencing process is responsible for cutting off the


previous Active NameNode's access to the shared
edits storage to prevent risking data loss or other
incorrect results
40
TCS Internal

HDFS Basic Commands


How to view HDFS file content?
hadoop fs -cat URI
Example:
hadoop fs -cat hdfs://nn1.example.com/home/dess/PSEG/load.hql

How to Create a directory in HDFS at given path(s)?


hadoop fs -mkdir <paths>
Example:
hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2

41
TCS Internal

HDFS Basic Commands


How to list the contents of a directory?
hadoop fs -ls <args>
Example:
hadoop fs -ls /user/saurzcode

How to Upload a file in HDFS?


hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example:
hadoop fs -put /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/

42
TCS Internal

HDFS Basic Commands


How to Download a file from HDFS?
hadoop fs -get <HDFS src> ... <local_dest_Path>
Example:
hadoop fs -get /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/

How to Copy a file from source to destination?


hadoop fs -cp <source> <dest>
Example:
hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2

43
TCS Internal

HDFS Basic Commands

How to Copy a file from Local file system to HDFS?


hadoop fs -copyFromLocal <localsrc> URI

Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/saurzcode/abc.tx

How to Copy a file from HDFS to Local file system?

hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>


Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/saurzcode/abc.txt

44
TCS Internal

HDFS Basic Commands


How to Remove a file or directory in HDFS?
hadoop fs -rm <arg>
Example:
hadoop fs -rm /user/saurzcode/dir1/abc.txt

How to Display last few lines of a file?


hadoop fs -tail <path[filename]>
Example:
hadoop fs -tail /user/saurzcode/dir1/abc.txt

45
TCS Internal

HDFS Basic Commands


How to Change group association of files?
hadoop fs -chgrp [-R] GROUP URI [URI ...]
Example:
hadoop fs -chgrp impala /user/dess/home/dess/PSEG/ste_copy.hql

How to Change the permissions of files?


hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]
Example:
hadoop fs -chmod 700 home/dess/PSEG/ste_copy.hql

46
TCS Internal

HDFS Basic Commands

How to Change the ownership of files?


hadoop fs -chown [-R] GROUP URI [URI ...]
Example:
hadoop fs -chown dess /user/dess/home/dess/PSEG/ste_copy.hql

47
TCS Internal

Hadoop References
Books
Hadoop The Definitive Guide, 3rd Edition by Tom White.

URLs
http://hadoop.apache.org/docs/r2.6.0
http://
bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the
-network
http://hortonworks.com/hadoop/hdfs

48
TCS Internal

Thank You

49
TCS Internal

You might also like