You are on page 1of 10

HADOOP

------------

BIGDATA
----------------------------------------------------------------------------------
HADOOP

STORAGE --> 1. VOLUME(Size of Data)


tb , pb , ex , zb , yb , bb , sb
2. VARIETIES --- Strdata , SemiStrdata , UnStrData
MAJOR SOLUTION

PROCESSING --> 3. VELOCITY --- Speed of Ret Data

HADOOP is ment for Dual Purposes i.e. BigData STORAGE & PROCESSING

HDFS MAPREDUCE

PIG

HIVE

SQOOP

HBASE

OOZIE

FLUME

HADOOP is a "Open Source" Technology/Framework

-----------------------------------------------------------------------------------
-------------------------------

BASIC TERMINOLOGIES USED IN BIGDATA HADOOP PROJECT


-----------------------------------------------------------------------------------
--------------------------------

1. NODE --- Any normal System/Machine/Laptop/Desktop is nothing but a "Node"

2. CLUSTERED NODE --- Any Node which is been part of some "Common & Dedicated
Network" is nothing
but a "Clustered Node"

3. CLUSTER ---- Collection of all the clustered nodes under one "Common & Dedicated
Network" is nothing but " Cluster".

4. HADOOP CLUSTERED NODE ----On any normal clustered node , if both "HDFS"
( Storage Component of Hadoop)
and "MAPREDUCE"(Processing
Component of Hadoop) are been runninng , we can call
that node as "Hadoop
Clustered Node".

OR
On any normal clustered
node , if we are installing Hadoop , that is node is known as
"Hadoop Clustered Node"
( because thru the Hadoop Installation by default we are getting
HDFS and MAPREDUCE ).

5. HADOOP CLUSTER ---- Collection of Hadoop Clusterd Nodes which are part of some
"Common & Dedicated Network" is
"Hadoop Cluster".

6. HADOOP CLUSTER SIZE ---- The number of nodes present in the Hadoop Cluster
including the Master Node is
"Hadoop Cluster Size"

-----------------------------------------------------------------------------------
---------------------------------------------------------

COMPATABLE "OPERATING SYSTEMS" and PRE-REQ SOFTWARES for HADOOP Installation


-----------------------------------------------------------------------------------
-----------------------------------------------------------

1. LINUX ---- UBUNTU , CENTOS , REDHAT , FEDORA , MINT , SUSE ................

2. MAC OS

3. SUN SOLARIES

4. WINDOWS -----> 0.0001% ---> Download a plugin called "Cygwin" ----we have
to install Hadoop
---------------------------------------------------------------------------------

PRE- REQ SOFTWAREs


---------------------------

To install any version of Hadoop , we must required JAVA - 1.6 or any above version

HADOOP - 1.X [ 2010 ] =======> JAVA - 1.6

HADOOP - 2.X [ 2015 ] =======> JAVA - 1.7

HADOOP - 3.X [ 2019 ] =======> JAVA - 1.8

**** JAVA is the Mother Programming Language for HADOOP

=================================================================================

SOW ( Statement Of Work)

CISCO
-----------------------------------------------------------------------------------
------------- INFY
1
2
3
4
|
|
100 --- 1 Year --- $XYZ

Which Distribution of Hadoop we are using

WHAT ARE THE DIFF DISTRIBUTIONS OF HADOOP?


-------------------------------------------------------------

1. CDH ( Cloudera Distribution for Hadoop ) / CDP ( Cloudera Data Paltform ) --->
Open Source(OS) Dist ---> YES

Enterprise Edition(EE) ---> YES

2. Hortonworks (HDP) ---->Open Source(OS) Dist ---> YES


Enterprise Edition(EE) ---> YES

3. Map R----> Open Source(OS) Dist ---> YES


Enterprise Edition(EE) ---> YES

4. Apache ----------> Open Source(OS) Dist ---> YES


Enterprise Edition(EE) ---> NO

Air Conditioner
------------------

1. LG---------------> Rs. 34000/- -------------> 0 Yrs Warranty


2. SAMSUNG------> Rs. 36000/- -------------> 0 Yrs Warranty
3. LLOYD----------> Rs. 44000/- -------------> 1 Yrs Warranty
4. VOLTAS--------> Rs. 33000/- -------------> 0 Yrs Warranty
5. BLUESTAR-----> Rs. 55000/- -------------> 2 Yrs Warranty
6. O'GENERAL---> Rs. 65000/- -------------> 5 Yrs Warranty

Windows ---> VMWorkStation ----> HADOOP - 2.6.0

============================================================================

Informatica ----------------------------- MS-SQLSERVER /


ORACLE / DB2
DataStage ------------------------------- "
Java -------------------------------------- "
Dot Net ----------------------------------- "

95 Products ....... 70 ZBs


Str , Semistr & Unstr
-----------------------------------------------------------------------------------
----------------------------------------------------------

LEGACY PROJECT MIGRATION TO BIGDATA HADOOP PLATFORM


------------------------------------------------------------------------------

What is "Data Locality(DL)" Design Rule in Hadoop ?


--------------------------------------------------------------

In Hadoop the input data must be available on HDFS before processing commence..i.e.
small volume of BUSINESS LOGIC
will move near to HUGE VOLUME of input data.

NFS(Network File Sysem) Mount Point

SFTP ( Secured File Transfer Protocol )

Live Cricket Match ------------------------------------------------------- HDFS

8 PM ( 20GB ) ------------------------------------------------------------- HDFS


------ 8:20 PM (20 Mins)
8:20PM --- 32GB ---------------------------------------------------------- HDFS
------ 8:32 PM ( 12 Mins)
8:32 PM ----40GB --------------------------------------------------------- HDFS
------ 8:40 PM ( 8 Mins )

Appl Log Server


Web Log Server
Stock Exch Server
Sensex Log Server
Extn News Feed
Twitter

Live Streaming Data ( Streaming Data )

cp
scp
distcp

================================================
UNDER FIRST BOX( DATA INGESTION)
---------------------------------------------

1. SQOOP ( RDBMS ----------> HDFS )


2. SFTP SCRIPTs ( NFS MOUNT POINT ------------> HDFS )
3. FLUME/KAFKA ( From Live Streaming Appl --------------> HDFS )

UNDER SECOND BOX ( DATA ANALYTICS )


---------------------------------------------------
1. MAP REDUCE ( If UnStr Data is Present )
2. PIG
3. HIVE
----------------
4. SPARK --- 3.X

UNDER THIRD BOX ( DOWNSTREAM APPLICATION TO CONSUME THE PROCESSED DATA)


-----------------------------------------------------------------------------------
--------------------
1. TABLEAU
2. POWER BI
3. COGNOS
4. QUICK VIEW
5. QUICK SCENCE
6. MAT LABs

=================================================================================

DATA STORAGE ON HDFS


-----------------------------

BLOCK --------------------> The Smallest & Individual Storage Unit on HDFS is


"BLOCK".

HADOOP - 1.X(2010)
HADOOP-2.X(2015) HADOOP-3.X(2019)
------------------------------
------------------------ -----------------------------
BLOCK SIZE --------> = 64MB(Default&MinSize) =
128MB(Default&MinSize) = 128MB(Default&MinSize)
= 128MB
= 256MB = 256MB
= 256MB
= 512MB = 512MB
= 512MB
= 1024MB(=1GB) = 1024MB(=1GB)
= 1024MB(=1GB)

DESIGN PRINCIPLES OF HDFS BLOCK SIZE


---------------------------------------------------

1. Irrespective of the file size in Hadoop , for each and every file --> dedicated
number of blocks will be there

2. Except the Last Block , remaining all the blocks are holding Equal Volume of
Data.

Example: If a file consisting of 10 Blocks ==> In 9 blocks Equal Volume of data


& In 10th Block May/May Not be

If a file consisting of 100 Blocks ==> In 99 blocks Equal Volume of data & In
100th Block May/May Not be

If a file consisting of 500 Blocks ==> In 499 blocks Equal Volume of data & In
500th Block May/May Not be
If a file consisting of 1000 Blocks ==> In 999 blocks Equal Volume of data & In
1000th Block May/May Not be

If any file storage request is coming to Hadoop Cluster

STEP 1: Always the request will be received by Hadoop Master Node only

STEP 2: Based on the Configured BlockSize at that point of time , the file data
will be divided into blocks and Master Node
will only keeps Metadata by moving the actual data to Slave Nodes.

-----------------------------------------------------------------------------------
-------------------------------------------------

HOW TO CONFIGURE BLOCK SIZE IN HADOOP ?


--------------------------------------------------------

/home/gopalkrishna/INSTALL/hadoop-2.6.0/etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.block.size</name>
<value>134237584</value>
</property>
</configuration>

64MB = ? bytes
128MB = ? bytes 563478383

-----------------------------------------------------------------------------------
---------------------------------------------------
WHENEVER THERE IS A CHANGE IN "BLOCKSIZE" WHAT WILL BE THE IMPACT ON EXISTING
"BLOCKS" ?
-----------------------------------------------------------------------------------
-------------------------------------------------

CISCO ----- JAN - MAR ( GB ---- SMEs -->64MB)

A.log --> 200MB


B.log --> 200MB

APR - JUNE --> TB , PB -- SMEs ( 128MB )

C.txt ---> 200MB

Ans: There will NOT be any impact on already existing blocks....because Hadoop
Master Node will only use BLOCKSIZE at the
time of WRITING(STORING) the data ...but NOT at the time of READING the
data ( for Reading , Master Node will
only use MetaData Information of blocks).

-----------------------------------------------------------------------------------
-----------------------------------------------------------
WHAT IS REPLICATION IN HADOOP?
---------------------------------------------

BACK UP MECHANISAM or FAIL OVER MECHANISAM or FAULT TOLERANT MECHANISAM


------------------------------------ "
REPLICATION"-----------------------------------------------------

REPLICATION : Replication is the process of duplicating the file system data on


multiple slave nodes of Hadoop Cluster
to achieve High Avaialble Processing in Hadoop

IN HADOOP
---------------

MIN REPLICATION = 1 Time ( Its only possible in case of Single Node Hadoop
Cluster )

DEFAULT REPLICATION = 3 TIMES ( We need not to configure this value as its default
anyway )

MAX REPLICATION = 512 TIMES

DESIGN RULES OF HADOOP REPLICATION


---------------------------------------------------

1. Replication is only applicable to Hadoop System Data....but NOT for MetaData


( Data about Data )

2. Keep ONE Replica per ONE Slave Node as per design

To keep 3 time Replication factor = min 3 slave nodes must be required


To keep 10 time Replication factor = min 10 slave nodes must be required
To keep "N" time Replication factor = min "N" number of Slave Nodes must be
required

3. Replication is only happening on Slave Nodes but NOT on Master Node ( because
Master Node is only ment for Metadata
Storage and for MetaData there is NO Replication in Hadoop )

-----------------------------------------------------------------------------------
---------------

HOW TO CONFIGURE "BLOCK SIZE" & "REPLICATION" IN HADOOP ?


----------------------------------------------------------------------------------

/home/gopalkrishna/INSTALL/hadoop-2.6.0/etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.block.size</name>
<value>134237584</value>
</property>
<property>
<name>dfs.replication</name>
<value>4</value>
</property>
</configuration>

==========================================================================
HOW THE ACTUAL REPLICATION OF DATA HAPPENS IN BACKEND ?

What is "Rack Awareness" in Hadoop?


-------------------------------------------------

Based on the geographical locations of Slave Nodes of Hadoop , Master Node will
divides the nodes into diff Racks

RACK 1
---------
S1 - HYD
S2 - BAN

RACK2
--------
S3 - MAL

RACK3
---------
S4 - NJ
S5 - NY

**** NOTE: In any particular RACK , as per architecture .. "N-1" number of


Replicas only be stored.

**** NOTE: Rack Awareness concept is a pure logical concept...there is no physical


storage(configuration) for this.

===================================================================================

HADOOP - 1.X ARCHITECTURE


-----------------------------------------

STORAGE ( HDFS ) ARCHITECTURE


--------------------------------------------
1. NAME NODE
2. DATA NODE
3. SECONDARY NAME NODE

PROCESSING(MAPREDUCE) ARCHITECTURE

------------------------------------------------------
4. JOB TRACKER
5. TASK TRACKER

NOTE: The above 5 architectural components are also known as "5 Daemons of Hadoop".

Daemon ---> A background Process

1. NAME NODE -----> 1. In any Hadoop Cluster , Master Node only we used to call it
as "Name Node".
2. How big the Hadoop Cluster size might be ,
Name Node is always a single player i.e. we will not
see multiple Name Nodes running at the
same point of time.
3. Name Node is exclusviely ment for
"MetaData(File System NameSpace) Storage"

2. DATA NODE ------> 1. In any Hadoop Cluster , Slave Node only we used call it as
"Data Node".
2. Unlike Name Node which is a single player
in the Hadoop Cluster , there is NO MAX LIMIT for the
number of Data Nodes in a Hadoop Cluster.
3. Data Node is exclusively ment for "Actual
Data Storage in the form of BLOCKS"...it will NOT
manages the metadata.

3. SECONDARY NAME NODE ----> 1. Whenever Primary Name Node is down....Secondary


Name Node will come into picture
as a back up node for
Primary Name Node.
2. However , Secondary
Name Node is NOT a 100% backup Node for Prmary Name Node
bacause Secondary Name
Node should be started manually.

NOTE: In Hadoop - 1.X version --> whenever Primary Name Node is down that problem
we used to call it as
SPOF ( Single Point Of
Failure).

L( Load) T(Transform) E ( Extract )


Local File System(LFS) --- VS --- Dist File System(DFS) ---- VS ---HDFS(Hadoop Dist
File System)
---------------------------

www.amazon.com -----> 1 min ---> 8.9

1997 --- www.yahoo.com ----> 51 ---> $1500

HA( High Available Processing)

You might also like