You are on page 1of 40

Connect with us

 24x7 Support on Skype, Email & Phone

 Skype ID – edureka.hadoop

 Email – hadoop@edureka.in

 Call us – +91 88808 62004

 Venkat – venkat@edureka.in
Course Topics

 Week 1  Week 5
– Introduction to HDFS – HIVE

 Week 2  Week 6
– Setting Up Hadoop Cluster – HBASE

 Week 3  Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER

 Week 4  Week 8
– PIG – SQOOP
Recap of Week 1

What is Big Data


What is Hadoop
Hadoop Eco-System Components
Why DFS
Features of HDFS
Areas where HDFS is not a good fit
Block Abstraction in HDFS
HDFS Components:

• Namenodes

• Datanodes
Main Components Of HDFS:

 NameNode:
 master of the system
 maintains and manages the blocks which are present on the
DataNodes
 Single point of failure for Hadoop cluster
 Manages block replication

 DataNodes:
 slaves which are deployed on each machine and provide the actual
storage
 responsible for serving read and write requests for the clients
Secondary Name Node

It is important to make the namenode resilient to failure and one of the techniques
of doing this is to create a Secondary NameNode.

The secondary namenode usually runs on a separate physical machine, since it


requires plenty of CPU and as much memory as the namenode to perform the merge.
It keeps a copy of the merged namespace image, which can be used in the event of the
namenode failing

This node is also called as a Checkpoint Node as it manages the Edit log and check-
pointing of namenode metadata (once per hour or when edits log reaches 64MB in
size). Please note that it does not provide namenode failover

The Secondary Node copies FSImage and Transaction Log from NameNode to a
temporary directory. Then it merges FSImage and Transaction Log into a new FSImage
in temperate location and uploads the new FSImage to the NameNode.
NameNode Metadata

Meta-data in Memory
1. The entire metadata is in main memory
2. No demand paging of FS meta-data

Types of Metadata
1. List of files
2. List of Blocks for each file
3. List of Datanode for each block
4. File attributes, e.g access time, replication factor

A Transaction Log
1. Records file créations, file deletions. etc
JobTracker and TaskTracker:
HDFS Architecture
Job Tracker
Job Tracker Contd.
Job Tracker Contd.
HDFS Client Creates a New File
Rack Awareness
Anatomy of a File Write:
Anatomy of a File Read:
Terminal Commands
Terminal Commands
Web UI URLs

• NameNode status: http://localhost:50070/dfshealth.jsp

• JobTracker status: http://localhost:50030/jobtracker.jsp

• TaskTracker status: http://localhost:50060/tasktracker.jsp

• DataBlock Scanner Report:


http://localhost:50075/blockScannerReport
Sample Examples List
Running the Teragen Example
Checking the Output
Checking the Output
Deployment Modes
• Standalone or local Mode
– No daemons running
– Everything runs on single JVM
– Good for deployment

• Pseudo-distributed Mode
– All daemons running on single machine, a cluster simulation on one machine
– Good for Test Environment

• Fully distributed Mode


– Hadoop running on multiple machines on a cluster
– Production Environment
Folder View of Hadoop
Hadoop Configuration Files
Configuration Filenames Description of log files
Enviroment variables that are used in the scripts to run
hadoop-env.sh Hadoop
Configuration settings for Hadoop Core such as I/O settings
core-site.xml that are common to HDFS and MapReduce
Configuration settings for HDFS daemons, the namenode, the
hdfs-site.xml secondary namenode and the data nodes.
Configuration settings for MapReduce daemons : the
mapred-site.xml jobtracker and the task trackers
A list of machines(one per line) that each run a secondary
masters namenode
A list of machines(one per line) that each run a datanode and
slaves a trask tracker
Properties for controlling how metrics are published in
hadoop-metrics.properties Hadoop
Properties for system log files, the namenode audit log and
log4j.properties the task log for the tasktracker child process
DD for each component

Core core-site.xml

HDFS hdfs-site.xml

MapReduce mapred-site.xml
core-site.xml and hdfs-site.xml

hdfs-site.xml core-site.xml

<?xml version - "1.0"?> <?xml version ="1.0"?>

<!--hdfs-site.xml--> <!--core-site.xml-->

<configuration> <configuration>

<property> <property>

<name>dfs.replication</name> <name>fs.default.name</name>

<value>1</value> <value>hdfs://localhost:8020/</value>

</property> </property>

</configuration> </configuration>
Defining HDFS details in hdfs-site.xml
Property Value Description

dfs.data.dir <value>/disk1/hdfs/data,/dis A list of directories where the datanode stores


k2/hdfs/data</value> blocks. Each block is stored in only one of these
directories. ${hadoop.tmp.dir}/dfs/data

fs.checkpoint.di <value>/disk1/hdfs/namese A list of directories where the secondary


r condary,/disk2/hdfs/namese namenode stores checkpoints. It stores a copy of
condary</value> the checkpoint in each directory in the list
${hadoop.tmp.dir}/dfs/namesecondary
Mapred-site.xml

<?xml version=“1.0”?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<property>
</configuration>
Defining mapred-sites.xml
Property Value Description
Mapred.job.tracker <value>localhost: The hostname and the port that the jobtrackers RPC server
8021</value> runs on. If set to the default value of local, then the
jobtracker is run in-process on demand when you run a
MapReduce job (you don’t need to start the jobtracker in this
case, and in fact you will get an error if you try to start it in
this mode)
Mapred.local.dir ${hadoop.tmp.dir} A list of directories where MapReduce stores intermediate
/mapred/local data for jobs. The data is cleared out when the job ends.
Mapred.system.dir ${hadoop.tmp.dir} The directory relative to fs.default.name where shared files
/mapred/system are stored, during a job run.
Mapred.tasktracker.map 2 The number of map tasks that may be run on a tasktracker
.tasks.maximum at any one time
Mapred.tasktracker.redu 2 The number of reduce tasks tat may be run on a tasktracker
ce.tasks.maximum at any one time.
Critical Properties

fs.default.name

hadoop.tmp.dir

mapred.job.tracker
Slaves and masters
Two files are used by the startup and shutdown commands:

slaves

• contains a list of hosts, one per line, that are to host


DataNode and TaskTracker servers

masters

• contains a list of hosts, one per line, that are to host


secondary NameNode servers
Per-process runtime environment

hadoop-env.sh file:

Hadoop-env.sh JVM

• This file also offers a way to provide custom parameters for each of the
servers.
• Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in
the conf/ directory of the installation.
Reporting

hadoop-metrics.properties

• This file controls the reporting


• The default is not to report
Network Requirements

Uses Shell (SSH) to launch the server processes on


the slave nodes

Hadoop
Core

Requires password-less SSH connection between


the master and all slave and secondary machines
Namenode Recovery

• Shut down the secondary NameNode


1

• secondary:fs.checkpoint.dir  Namenode:dfs.name.dir
2

• secondary:fs.checkpoint.edits  Namenode:dfs.name.edits.dir
3

• When the copy completes, start the NameNode and restart the
4 secondary NameNode
Clarifications

Q & A..?
Thank You
See You in Class Next Week

You might also like