Professional Documents
Culture Documents
Skype ID – edureka.hadoop
Email – hadoop@edureka.in
Venkat – venkat@edureka.in
Course Topics
Week 1 Week 5
– Introduction to HDFS – HIVE
Week 2 Week 6
– Setting Up Hadoop Cluster – HBASE
Week 3 Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER
Week 4 Week 8
– PIG – SQOOP
Recap of Week 1
• Namenodes
• Datanodes
Main Components Of HDFS:
NameNode:
master of the system
maintains and manages the blocks which are present on the
DataNodes
Single point of failure for Hadoop cluster
Manages block replication
DataNodes:
slaves which are deployed on each machine and provide the actual
storage
responsible for serving read and write requests for the clients
Secondary Name Node
It is important to make the namenode resilient to failure and one of the techniques
of doing this is to create a Secondary NameNode.
This node is also called as a Checkpoint Node as it manages the Edit log and check-
pointing of namenode metadata (once per hour or when edits log reaches 64MB in
size). Please note that it does not provide namenode failover
The Secondary Node copies FSImage and Transaction Log from NameNode to a
temporary directory. Then it merges FSImage and Transaction Log into a new FSImage
in temperate location and uploads the new FSImage to the NameNode.
NameNode Metadata
Meta-data in Memory
1. The entire metadata is in main memory
2. No demand paging of FS meta-data
Types of Metadata
1. List of files
2. List of Blocks for each file
3. List of Datanode for each block
4. File attributes, e.g access time, replication factor
A Transaction Log
1. Records file créations, file deletions. etc
JobTracker and TaskTracker:
HDFS Architecture
Job Tracker
Job Tracker Contd.
Job Tracker Contd.
HDFS Client Creates a New File
Rack Awareness
Anatomy of a File Write:
Anatomy of a File Read:
Terminal Commands
Terminal Commands
Web UI URLs
• Pseudo-distributed Mode
– All daemons running on single machine, a cluster simulation on one machine
– Good for Test Environment
Core core-site.xml
HDFS hdfs-site.xml
MapReduce mapred-site.xml
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
<!--hdfs-site.xml--> <!--core-site.xml-->
<configuration> <configuration>
<property> <property>
<name>dfs.replication</name> <name>fs.default.name</name>
<value>1</value> <value>hdfs://localhost:8020/</value>
</property> </property>
</configuration> </configuration>
Defining HDFS details in hdfs-site.xml
Property Value Description
<?xml version=“1.0”?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<property>
</configuration>
Defining mapred-sites.xml
Property Value Description
Mapred.job.tracker <value>localhost: The hostname and the port that the jobtrackers RPC server
8021</value> runs on. If set to the default value of local, then the
jobtracker is run in-process on demand when you run a
MapReduce job (you don’t need to start the jobtracker in this
case, and in fact you will get an error if you try to start it in
this mode)
Mapred.local.dir ${hadoop.tmp.dir} A list of directories where MapReduce stores intermediate
/mapred/local data for jobs. The data is cleared out when the job ends.
Mapred.system.dir ${hadoop.tmp.dir} The directory relative to fs.default.name where shared files
/mapred/system are stored, during a job run.
Mapred.tasktracker.map 2 The number of map tasks that may be run on a tasktracker
.tasks.maximum at any one time
Mapred.tasktracker.redu 2 The number of reduce tasks tat may be run on a tasktracker
ce.tasks.maximum at any one time.
Critical Properties
fs.default.name
hadoop.tmp.dir
mapred.job.tracker
Slaves and masters
Two files are used by the startup and shutdown commands:
slaves
masters
hadoop-env.sh file:
Hadoop-env.sh JVM
• This file also offers a way to provide custom parameters for each of the
servers.
• Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in
the conf/ directory of the installation.
Reporting
hadoop-metrics.properties
Hadoop
Core
• secondary:fs.checkpoint.dir Namenode:dfs.name.dir
2
• secondary:fs.checkpoint.edits Namenode:dfs.name.edits.dir
3
• When the copy completes, start the NameNode and restart the
4 secondary NameNode
Clarifications
Q & A..?
Thank You
See You in Class Next Week