Professional Documents
Culture Documents
1
TCS Internal
Agenda
2
TCS Internal
Technical Limitations
Expectation
Volume
Velocity
NYSE
Variety
Veracity
3
TCS Internal
Technical Limitations
Expectation
4
TCS Internal
Technical Limitations
Expectation
High performance.
Minimize infrastructure cost.
Should be scalable
Should be simple to access
Fault tolerance
5
TCS Internal
Hadoop
Solution to Business
Challenges
Volume
Velocity
speed
Overcome Technical
Limitations
MeetExpectation
Variety
Veracity
processes
system.
6
TCS Internal
Hadoop
Solution to Business
Challenges
Overcome Technical
Limitations
Meet Expectation
7
TCS Internal
Overcome Technical
Limitations
Meet Expectation
8
TCS Internal
Hadoop Eco-System
9
TCS Internal
Hadoop Eco-System
contd
10
TCS Internal
Hadoop Eco-System
contd
11
TCS Internal
Hadoop Eco-System
contd
12
TCS Internal
Hadoop Eco-System
contd
13
TCS Internal
Hadoop Eco-System
contd
Managing different
components manually is
possible but not a feasible
option for a cluster having
100s of nodes. Zookeeper
provides a centralized
interface to manage all the
components in the Hadoop
Eco-System.
14
TCS Internal
Hadoop Eco-System
Hadoop Eco-System
Impala
is
Massive Parallel
Processing
(MPP) database
engine,
developed
by
Cloudera.
16
TCS Internal
Switch
Switch
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Rack 1
Rack 2
Rack 3
Rack 4
Site 1
Site 2
17
TCS Internal
What is NoSQL ?
18
TCS Internal
Column store
Store data as sections of columns of data, rather than as rows of
data. Examples: HBase, BigTable and HyperTable.
Document database
These are designed for storing, retrieving, and managing
document- oriented information, also known as semi-structured data.
Examples: MongoDB and CouchDB.
Graph database
These databases are designed for data whose relations are well
represented as a graph and has elements which are I
nterconnected, with an undetermined number of relations between
them. Examples:
Neo4J and Polyglot.
19
TCS Internal
20
TCS Internal
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
21
TCS Internal
contd
HDFS
Daemons
Node 1
DataNode
Node 4
NameNod
e
Node 2
DataNode
Node 5
Node 3
DataNode
Node 6
22
TCS Internal
contd
HDFS
Daemons
Node 1
DataNode
Node 4
NameNod
e
Node 2
DataNode
Node 5
SNN
Node 3
DataNode
Node 6
23
TCS Internal
contd
HDFS
Daemons
Node 1
NameNod
e
Node 2
DataNode
Node 4
TaskTracker
DataNode
Node 5
TaskTracker
JobTracker
SNN
Node 3
DataNode
Node 6
TaskTracker
24
TCS Internal
contd
NameNode
Manages the file system metadata in Hadoop architecture.
Metadata is
stored in fsimage file inside the node in which it runs.
Single point of failure.
Data Node
Manages storage attached to the node in which they run.
Job Tracker
Tracks all jobs submitted in a Hadoop distributed environment. It
communicates with the task tracker for distribution of tasks among
nodes.
Task Tracker
Tracks tasks submitted to the data node where it is deployed. It
communicates with the Job Tracker for task execution.
TCS Internal
25
NameNode
Directory Structure
Meta-data in Memory
The entire metadata is in main memory
No demand paging of meta-data
Types of Metadata
A Transaction Log
Records file creations, file deletions (i.e. edits)
26
TCS Internal
DataNode
A Block Server
Stores data in the local file system (e.g. ext3)
Stores meta-data of a block (e.g. CRC)
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all existing blocks to the
NameNode
Facilitates Pipelining of Data
Forwards data to other specified DataNodes
27
TCS Internal
Hadoop Architecture
Metadata ops
Metadata(Name, replicas..)
(/home/foo/data,6. ..
Namenode
Client
Block ops
Read
Datanodes
Datanodes
replication
B
Blocks
Rack1
Write
Rack2
Client
28
TCS Internal
Hadoop Architecture
contd
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that manages the file
system namespace and regulates access to files by clients.
There are a number of DataNodes usually one per node in a cluster.
Large block sizes are used for improve disk I/O.
The DataNodes manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
A file is split into one or more blocks and set of blocks are stored in DataNodes.
DataNode serves read, write requests, performs block creation, deletion, and
replication upon instruction from Namenode.
Hadoop maintains a replication factor for each data block specified during creation.
The default replication factor is three.
29
TCS Internal
30
TCS Internal
32
TCS Internal
Hadoop Cluster
33
TCS Internal
Task
Tracker
HDFS
Layer
Name
Node
Data
Mode
Node
1
34
TCS Internal
Slave
Slave
Task
Tracker
Task
Tracker
Data
Mode
Data
Mode
Node
2
Node
3
Job
Tracker
Map
Reduce
Layer
Task
Tracker
HDFS
Layer
Name
Node
Data
Mode
Node
1
35
TCS Internal
36
TCS Internal
37
TCS Internal
- Why ?
38
TCS Internal
39
TCS Internal
41
TCS Internal
42
TCS Internal
43
TCS Internal
Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/saurzcode/abc.tx
44
TCS Internal
45
TCS Internal
46
TCS Internal
47
TCS Internal
Hadoop References
Books
Hadoop The Definitive Guide, 3rd Edition by Tom White.
URLs
http://hadoop.apache.org/docs/r2.6.0
http://
bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the
-network
http://hortonworks.com/hadoop/hdfs
48
TCS Internal
Thank You
49
TCS Internal