Hadoop Notes - 134

HADOOP
------------
BIGDATA
----------------------------------------------------------------------------------
HADOOP
STORAGE --> 1. VOLUME(Size of Data)

tb , pb , ex , zb , yb , bb , sb
2. VARIETIES --- Strdata , SemiStrdata , UnStrData
MAJOR SOLUTION
PROCESSING --> 3. VELOCITY --- Speed of Ret Data
HADOOP is ment for Dual Purposes i.e. BigData STORAGE & PROCESSING
HDFS MAPREDUCE
PIG
HIVE
SQOOP
HBASE
OOZIE
FLUME
HADOOP is a "Open Source" Technology/Framework
-----------------------------------------------------------------------------------
-------------------------------
BASIC TERMINOLOGIES USED IN BIGDATA HADOOP PROJECT

-----------------------------------------------------------------------------------
--------------------------------
1. NODE --- Any normal System/Machine/Laptop/Desktop is nothing but a "Node"
2. CLUSTERED NODE --- Any Node which is been part of some "Common & Dedicated
Network" is nothing
but a "Clustered Node"
3. CLUSTER ---- Collection of all the clustered nodes under one "Common & Dedicated
Network" is nothing but " Cluster".
4. HADOOP CLUSTERED NODE ----On any normal clustered node , if both "HDFS"
( Storage Component of Hadoop)
and "MAPREDUCE"(Processing
Component of Hadoop) are been runninng , we can call
that node as "Hadoop
Clustered Node".
OR
On any normal clustered
node , if we are installing Hadoop , that is node is known as
"Hadoop Clustered Node"
( because thru the Hadoop Installation by default we are getting
HDFS and MAPREDUCE ).
5. HADOOP CLUSTER ---- Collection of Hadoop Clusterd Nodes which are part of some
"Common & Dedicated Network" is
"Hadoop Cluster".
6. HADOOP CLUSTER SIZE ---- The number of nodes present in the Hadoop Cluster
including the Master Node is
"Hadoop Cluster Size"
-----------------------------------------------------------------------------------
---------------------------------------------------------
COMPATABLE "OPERATING SYSTEMS" and PRE-REQ SOFTWARES for HADOOP Installation

-----------------------------------------------------------------------------------
-----------------------------------------------------------
1. LINUX ---- UBUNTU , CENTOS , REDHAT , FEDORA , MINT , SUSE ................
2. MAC OS
3. SUN SOLARIES
4. WINDOWS -----> 0.0001% ---> Download a plugin called "Cygwin" ----we have
to install Hadoop
---------------------------------------------------------------------------------
PRE- REQ SOFTWAREs

---------------------------
To install any version of Hadoop , we must required JAVA - 1.6 or any above version
HADOOP - 1.X [ 2010 ] =======> JAVA - 1.6
HADOOP - 2.X [ 2015 ] =======> JAVA - 1.7
HADOOP - 3.X [ 2019 ] =======> JAVA - 1.8
**** JAVA is the Mother Programming Language for HADOOP
=================================================================================
SOW ( Statement Of Work)
CISCO
-----------------------------------------------------------------------------------
------------- INFY
1
2
3
4
|
|
100 --- 1 Year --- $XYZ
Which Distribution of Hadoop we are using
WHAT ARE THE DIFF DISTRIBUTIONS OF HADOOP?

-------------------------------------------------------------
1. CDH ( Cloudera Distribution for Hadoop ) / CDP ( Cloudera Data Paltform ) --->
Open Source(OS) Dist ---> YES
Enterprise Edition(EE) ---> YES
2. Hortonworks (HDP) ---->Open Source(OS) Dist ---> YES

3. Map R----> Open Source(OS) Dist ---> YES

4. Apache ----------> Open Source(OS) Dist ---> YES

Enterprise Edition(EE) ---> NO
Air Conditioner
------------------
1. LG---------------> Rs. 34000/- -------------> 0 Yrs Warranty

2. SAMSUNG------> Rs. 36000/- -------------> 0 Yrs Warranty
3. LLOYD----------> Rs. 44000/- -------------> 1 Yrs Warranty
4. VOLTAS--------> Rs. 33000/- -------------> 0 Yrs Warranty
5. BLUESTAR-----> Rs. 55000/- -------------> 2 Yrs Warranty
6. O'GENERAL---> Rs. 65000/- -------------> 5 Yrs Warranty
Windows ---> VMWorkStation ----> HADOOP - 2.6.0
============================================================================
Informatica ----------------------------- MS-SQLSERVER /

ORACLE / DB2
DataStage ------------------------------- "
Java -------------------------------------- "
Dot Net ----------------------------------- "
95 Products ....... 70 ZBs

Str , Semistr & Unstr
-----------------------------------------------------------------------------------
----------------------------------------------------------
LEGACY PROJECT MIGRATION TO BIGDATA HADOOP PLATFORM

------------------------------------------------------------------------------
What is "Data Locality(DL)" Design Rule in Hadoop ?

--------------------------------------------------------------
In Hadoop the input data must be available on HDFS before processing commence..i.e.
small volume of BUSINESS LOGIC
will move near to HUGE VOLUME of input data.
NFS(Network File Sysem) Mount Point
SFTP ( Secured File Transfer Protocol )
Live Cricket Match ------------------------------------------------------- HDFS
8 PM ( 20GB ) ------------------------------------------------------------- HDFS

------ 8:20 PM (20 Mins)
8:20PM --- 32GB ---------------------------------------------------------- HDFS
------ 8:32 PM ( 12 Mins)
8:32 PM ----40GB --------------------------------------------------------- HDFS
------ 8:40 PM ( 8 Mins )
Appl Log Server

Web Log Server
Stock Exch Server
Sensex Log Server
Extn News Feed
Twitter
Live Streaming Data ( Streaming Data )
cp
scp
distcp
================================================
UNDER FIRST BOX( DATA INGESTION)
---------------------------------------------
1. SQOOP ( RDBMS ----------> HDFS )

2. SFTP SCRIPTs ( NFS MOUNT POINT ------------> HDFS )
3. FLUME/KAFKA ( From Live Streaming Appl --------------> HDFS )
UNDER SECOND BOX ( DATA ANALYTICS )

---------------------------------------------------
1. MAP REDUCE ( If UnStr Data is Present )
2. PIG
3. HIVE
----------------
4. SPARK --- 3.X
UNDER THIRD BOX ( DOWNSTREAM APPLICATION TO CONSUME THE PROCESSED DATA)

-----------------------------------------------------------------------------------
--------------------
1. TABLEAU
2. POWER BI
3. COGNOS
4. QUICK VIEW
5. QUICK SCENCE
6. MAT LABs
=================================================================================
DATA STORAGE ON HDFS

-----------------------------
BLOCK --------------------> The Smallest & Individual Storage Unit on HDFS is

"BLOCK".
HADOOP - 1.X(2010)
HADOOP-2.X(2015) HADOOP-3.X(2019)
------------------------------
------------------------ -----------------------------
BLOCK SIZE --------> = 64MB(Default&MinSize) =
128MB(Default&MinSize) = 128MB(Default&MinSize)
= 128MB
= 256MB = 256MB
= 256MB
= 512MB = 512MB
= 512MB
= 1024MB(=1GB) = 1024MB(=1GB)
= 1024MB(=1GB)
DESIGN PRINCIPLES OF HDFS BLOCK SIZE

---------------------------------------------------
1. Irrespective of the file size in Hadoop , for each and every file --> dedicated
number of blocks will be there
2. Except the Last Block , remaining all the blocks are holding Equal Volume of
Data.
Example: If a file consisting of 10 Blocks ==> In 9 blocks Equal Volume of data

& In 10th Block May/May Not be
If a file consisting of 100 Blocks ==> In 99 blocks Equal Volume of data & In
100th Block May/May Not be
If any file storage request is coming to Hadoop Cluster
STEP 1: Always the request will be received by Hadoop Master Node only
STEP 2: Based on the Configured BlockSize at that point of time , the file data
will be divided into blocks and Master Node
will only keeps Metadata by moving the actual data to Slave Nodes.
-----------------------------------------------------------------------------------
-------------------------------------------------
HOW TO CONFIGURE BLOCK SIZE IN HADOOP ?

--------------------------------------------------------
/home/gopalkrishna/INSTALL/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.block.size</name>
<value>134237584</value>
</property>
</configuration>
64MB = ? bytes
128MB = ? bytes 563478383
-----------------------------------------------------------------------------------
---------------------------------------------------
WHENEVER THERE IS A CHANGE IN "BLOCKSIZE" WHAT WILL BE THE IMPACT ON EXISTING
"BLOCKS" ?
-----------------------------------------------------------------------------------
-------------------------------------------------
CISCO ----- JAN - MAR ( GB ---- SMEs -->64MB)
A.log --> 200MB

B.log --> 200MB
APR - JUNE --> TB , PB -- SMEs ( 128MB )
C.txt ---> 200MB
Ans: There will NOT be any impact on already existing blocks....because Hadoop
Master Node will only use BLOCKSIZE at the
time of WRITING(STORING) the data ...but NOT at the time of READING the
data ( for Reading , Master Node will
only use MetaData Information of blocks).
-----------------------------------------------------------------------------------
-----------------------------------------------------------
WHAT IS REPLICATION IN HADOOP?
---------------------------------------------
BACK UP MECHANISAM or FAIL OVER MECHANISAM or FAULT TOLERANT MECHANISAM

------------------------------------ "
REPLICATION"-----------------------------------------------------
REPLICATION : Replication is the process of duplicating the file system data on

multiple slave nodes of Hadoop Cluster
to achieve High Avaialble Processing in Hadoop
IN HADOOP
---------------
MIN REPLICATION = 1 Time ( Its only possible in case of Single Node Hadoop
Cluster )
DEFAULT REPLICATION = 3 TIMES ( We need not to configure this value as its default
anyway )
MAX REPLICATION = 512 TIMES
DESIGN RULES OF HADOOP REPLICATION

---------------------------------------------------
1. Replication is only applicable to Hadoop System Data....but NOT for MetaData

( Data about Data )
2. Keep ONE Replica per ONE Slave Node as per design
To keep 3 time Replication factor = min 3 slave nodes must be required

To keep 10 time Replication factor = min 10 slave nodes must be required
To keep "N" time Replication factor = min "N" number of Slave Nodes must be
required
3. Replication is only happening on Slave Nodes but NOT on Master Node ( because
Master Node is only ment for Metadata
Storage and for MetaData there is NO Replication in Hadoop )
-----------------------------------------------------------------------------------
---------------
HOW TO CONFIGURE "BLOCK SIZE" & "REPLICATION" IN HADOOP ?

----------------------------------------------------------------------------------
/home/gopalkrishna/INSTALL/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.block.size</name>
<value>134237584</value>
</property>
<property>
<name>dfs.replication</name>
<value>4</value>
</property>
</configuration>
==========================================================================
HOW THE ACTUAL REPLICATION OF DATA HAPPENS IN BACKEND ?
What is "Rack Awareness" in Hadoop?

-------------------------------------------------
Based on the geographical locations of Slave Nodes of Hadoop , Master Node will
divides the nodes into diff Racks
RACK 1
---------
S1 - HYD
S2 - BAN
RACK2
--------
S3 - MAL
RACK3
---------
S4 - NJ
S5 - NY
**** NOTE: In any particular RACK , as per architecture .. "N-1" number of

Replicas only be stored.
**** NOTE: Rack Awareness concept is a pure logical concept...there is no physical

storage(configuration) for this.
===================================================================================
HADOOP - 1.X ARCHITECTURE

-----------------------------------------
STORAGE ( HDFS ) ARCHITECTURE

--------------------------------------------
1. NAME NODE
2. DATA NODE
3. SECONDARY NAME NODE
PROCESSING(MAPREDUCE) ARCHITECTURE
------------------------------------------------------
4. JOB TRACKER
5. TASK TRACKER
NOTE: The above 5 architectural components are also known as "5 Daemons of Hadoop".
Daemon ---> A background Process
1. NAME NODE -----> 1. In any Hadoop Cluster , Master Node only we used to call it
as "Name Node".
2. How big the Hadoop Cluster size might be ,
Name Node is always a single player i.e. we will not
see multiple Name Nodes running at the
same point of time.
3. Name Node is exclusviely ment for
"MetaData(File System NameSpace) Storage"
2. DATA NODE ------> 1. In any Hadoop Cluster , Slave Node only we used call it as
"Data Node".
2. Unlike Name Node which is a single player
in the Hadoop Cluster , there is NO MAX LIMIT for the
number of Data Nodes in a Hadoop Cluster.
3. Data Node is exclusively ment for "Actual
Data Storage in the form of BLOCKS"...it will NOT
manages the metadata.
3. SECONDARY NAME NODE ----> 1. Whenever Primary Name Node is down....Secondary

Name Node will come into picture
as a back up node for
Primary Name Node.
2. However , Secondary
Name Node is NOT a 100% backup Node for Prmary Name Node
bacause Secondary Name
Node should be started manually.
NOTE: In Hadoop - 1.X version --> whenever Primary Name Node is down that problem
we used to call it as
SPOF ( Single Point Of
Failure).
L( Load) T(Transform) E ( Extract )

Local File System(LFS) --- VS --- Dist File System(DFS) ---- VS ---HDFS(Hadoop Dist
File System)
---------------------------
www.amazon.com -----> 1 min ---> 8.9
1997 --- www.yahoo.com ----> 51 ---> $1500
HA( High Available Processing)

Hadoop Notes - 134

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Notes - 134

Uploaded by

Copyright:

Available Formats

HADOOP

STORAGE --> 1. VOLUME(Size of Data)

PROCESSING --> 3. VELOCITY --- Speed of Ret Data

HADOOP is a "Open Source" Technology/Framework

BASIC TERMINOLOGIES USED IN BIGDATA HADOOP PROJECT

1. NODE --- Any normal System/Machine/Laptop/Desktop is nothing but a "Node"

COMPATABLE "OPERATING SYSTEMS" and PRE-REQ SOFTWARES for HADOOP Installation

1. LINUX ---- UBUNTU , CENTOS , REDHAT , FEDORA , MINT , SUSE ................

PRE- REQ SOFTWAREs

HADOOP - 1.X [ 2010 ] =======> JAVA - 1.6

HADOOP - 2.X [ 2015 ] =======> JAVA - 1.7

HADOOP - 3.X [ 2019 ] =======> JAVA - 1.8

**** JAVA is the Mother Programming Language for HADOOP

SOW ( Statement Of Work)

Which Distribution of Hadoop we are using

WHAT ARE THE DIFF DISTRIBUTIONS OF HADOOP?

Enterprise Edition(EE) ---> YES

2. Hortonworks (HDP) ---->Open Source(OS) Dist ---> YES

3. Map R----> Open Source(OS) Dist ---> YES

4. Apache ----------> Open Source(OS) Dist ---> YES

1. LG---------------> Rs. 34000/- -------------> 0 Yrs Warranty

Windows ---> VMWorkStation ----> HADOOP - 2.6.0

Informatica ----------------------------- MS-SQLSERVER /

95 Products ....... 70 ZBs

LEGACY PROJECT MIGRATION TO BIGDATA HADOOP PLATFORM

What is "Data Locality(DL)" Design Rule in Hadoop ?

NFS(Network File Sysem) Mount Point

SFTP ( Secured File Transfer Protocol )

Live Cricket Match ------------------------------------------------------- HDFS

8 PM ( 20GB ) ------------------------------------------------------------- HDFS

Appl Log Server

Live Streaming Data ( Streaming Data )

1. SQOOP ( RDBMS ----------> HDFS )

UNDER SECOND BOX ( DATA ANALYTICS )

UNDER THIRD BOX ( DOWNSTREAM APPLICATION TO CONSUME THE PROCESSED DATA)

DATA STORAGE ON HDFS

BLOCK --------------------> The Smallest & Individual Storage Unit on HDFS is

DESIGN PRINCIPLES OF HDFS BLOCK SIZE

Example: If a file consisting of 10 Blocks ==> In 9 blocks Equal Volume of data

If any file storage request is coming to Hadoop Cluster

HOW TO CONFIGURE BLOCK SIZE IN HADOOP ?

CISCO ----- JAN - MAR ( GB ---- SMEs -->64MB)

A.log --> 200MB

APR - JUNE --> TB , PB -- SMEs ( 128MB )

C.txt ---> 200MB

BACK UP MECHANISAM or FAIL OVER MECHANISAM or FAULT TOLERANT MECHANISAM

REPLICATION : Replication is the process of duplicating the file system data on

MAX REPLICATION = 512 TIMES

DESIGN RULES OF HADOOP REPLICATION

1. Replication is only applicable to Hadoop System Data....but NOT for MetaData

2. Keep ONE Replica per ONE Slave Node as per design

To keep 3 time Replication factor = min 3 slave nodes must be required

HOW TO CONFIGURE "BLOCK SIZE" & "REPLICATION" IN HADOOP ?

What is "Rack Awareness" in Hadoop?

**** NOTE: In any particular RACK , as per architecture .. "N-1" number of

**** NOTE: Rack Awareness concept is a pure logical concept...there is no physical

HADOOP - 1.X ARCHITECTURE

STORAGE ( HDFS ) ARCHITECTURE

Daemon ---> A background Process

3. SECONDARY NAME NODE ----> 1. Whenever Primary Name Node is down....Secondary

L( Load) T(Transform) E ( Extract )

www.amazon.com -----> 1 min ---> 8.9

1997 --- www.yahoo.com ----> 51 ---> $1500

HA( High Available Processing)