Professional Documents
Culture Documents
------------
BIGDATA
----------------------------------------------------------------------------------
HADOOP
HADOOP is ment for Dual Purposes i.e. BigData STORAGE & PROCESSING
HDFS MAPREDUCE
PIG
HIVE
SQOOP
HBASE
OOZIE
FLUME
-----------------------------------------------------------------------------------
-------------------------------
2. CLUSTERED NODE --- Any Node which is been part of some "Common & Dedicated
Network" is nothing
but a "Clustered Node"
3. CLUSTER ---- Collection of all the clustered nodes under one "Common & Dedicated
Network" is nothing but " Cluster".
4. HADOOP CLUSTERED NODE ----On any normal clustered node , if both "HDFS"
( Storage Component of Hadoop)
and "MAPREDUCE"(Processing
Component of Hadoop) are been runninng , we can call
that node as "Hadoop
Clustered Node".
OR
On any normal clustered
node , if we are installing Hadoop , that is node is known as
"Hadoop Clustered Node"
( because thru the Hadoop Installation by default we are getting
HDFS and MAPREDUCE ).
5. HADOOP CLUSTER ---- Collection of Hadoop Clusterd Nodes which are part of some
"Common & Dedicated Network" is
"Hadoop Cluster".
6. HADOOP CLUSTER SIZE ---- The number of nodes present in the Hadoop Cluster
including the Master Node is
"Hadoop Cluster Size"
-----------------------------------------------------------------------------------
---------------------------------------------------------
2. MAC OS
3. SUN SOLARIES
4. WINDOWS -----> 0.0001% ---> Download a plugin called "Cygwin" ----we have
to install Hadoop
---------------------------------------------------------------------------------
To install any version of Hadoop , we must required JAVA - 1.6 or any above version
=================================================================================
CISCO
-----------------------------------------------------------------------------------
------------- INFY
1
2
3
4
|
|
100 --- 1 Year --- $XYZ
1. CDH ( Cloudera Distribution for Hadoop ) / CDP ( Cloudera Data Paltform ) --->
Open Source(OS) Dist ---> YES
Air Conditioner
------------------
============================================================================
In Hadoop the input data must be available on HDFS before processing commence..i.e.
small volume of BUSINESS LOGIC
will move near to HUGE VOLUME of input data.
cp
scp
distcp
================================================
UNDER FIRST BOX( DATA INGESTION)
---------------------------------------------
=================================================================================
HADOOP - 1.X(2010)
HADOOP-2.X(2015) HADOOP-3.X(2019)
------------------------------
------------------------ -----------------------------
BLOCK SIZE --------> = 64MB(Default&MinSize) =
128MB(Default&MinSize) = 128MB(Default&MinSize)
= 128MB
= 256MB = 256MB
= 256MB
= 512MB = 512MB
= 512MB
= 1024MB(=1GB) = 1024MB(=1GB)
= 1024MB(=1GB)
1. Irrespective of the file size in Hadoop , for each and every file --> dedicated
number of blocks will be there
2. Except the Last Block , remaining all the blocks are holding Equal Volume of
Data.
If a file consisting of 100 Blocks ==> In 99 blocks Equal Volume of data & In
100th Block May/May Not be
If a file consisting of 500 Blocks ==> In 499 blocks Equal Volume of data & In
500th Block May/May Not be
If a file consisting of 1000 Blocks ==> In 999 blocks Equal Volume of data & In
1000th Block May/May Not be
STEP 1: Always the request will be received by Hadoop Master Node only
STEP 2: Based on the Configured BlockSize at that point of time , the file data
will be divided into blocks and Master Node
will only keeps Metadata by moving the actual data to Slave Nodes.
-----------------------------------------------------------------------------------
-------------------------------------------------
/home/gopalkrishna/INSTALL/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.block.size</name>
<value>134237584</value>
</property>
</configuration>
64MB = ? bytes
128MB = ? bytes 563478383
-----------------------------------------------------------------------------------
---------------------------------------------------
WHENEVER THERE IS A CHANGE IN "BLOCKSIZE" WHAT WILL BE THE IMPACT ON EXISTING
"BLOCKS" ?
-----------------------------------------------------------------------------------
-------------------------------------------------
Ans: There will NOT be any impact on already existing blocks....because Hadoop
Master Node will only use BLOCKSIZE at the
time of WRITING(STORING) the data ...but NOT at the time of READING the
data ( for Reading , Master Node will
only use MetaData Information of blocks).
-----------------------------------------------------------------------------------
-----------------------------------------------------------
WHAT IS REPLICATION IN HADOOP?
---------------------------------------------
IN HADOOP
---------------
MIN REPLICATION = 1 Time ( Its only possible in case of Single Node Hadoop
Cluster )
DEFAULT REPLICATION = 3 TIMES ( We need not to configure this value as its default
anyway )
3. Replication is only happening on Slave Nodes but NOT on Master Node ( because
Master Node is only ment for Metadata
Storage and for MetaData there is NO Replication in Hadoop )
-----------------------------------------------------------------------------------
---------------
/home/gopalkrishna/INSTALL/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.block.size</name>
<value>134237584</value>
</property>
<property>
<name>dfs.replication</name>
<value>4</value>
</property>
</configuration>
==========================================================================
HOW THE ACTUAL REPLICATION OF DATA HAPPENS IN BACKEND ?
Based on the geographical locations of Slave Nodes of Hadoop , Master Node will
divides the nodes into diff Racks
RACK 1
---------
S1 - HYD
S2 - BAN
RACK2
--------
S3 - MAL
RACK3
---------
S4 - NJ
S5 - NY
===================================================================================
PROCESSING(MAPREDUCE) ARCHITECTURE
------------------------------------------------------
4. JOB TRACKER
5. TASK TRACKER
NOTE: The above 5 architectural components are also known as "5 Daemons of Hadoop".
1. NAME NODE -----> 1. In any Hadoop Cluster , Master Node only we used to call it
as "Name Node".
2. How big the Hadoop Cluster size might be ,
Name Node is always a single player i.e. we will not
see multiple Name Nodes running at the
same point of time.
3. Name Node is exclusviely ment for
"MetaData(File System NameSpace) Storage"
2. DATA NODE ------> 1. In any Hadoop Cluster , Slave Node only we used call it as
"Data Node".
2. Unlike Name Node which is a single player
in the Hadoop Cluster , there is NO MAX LIMIT for the
number of Data Nodes in a Hadoop Cluster.
3. Data Node is exclusively ment for "Actual
Data Storage in the form of BLOCKS"...it will NOT
manages the metadata.
NOTE: In Hadoop - 1.X version --> whenever Primary Name Node is down that problem
we used to call it as
SPOF ( Single Point Of
Failure).