Hadoop Week 2

Connect with us
 24x7 Support on Skype, Email & Phone
 Skype ID – edureka.hadoop
 Email – hadoop@edureka.in
 Call us – +91 88808 62004
 Venkat – venkat@edureka.in
Course Topics
 Week 1  Week 5
– Introduction to HDFS – HIVE
– Setting Up Hadoop Cluster – HBASE
– Map-Reduce Basics, types and formats – ZOOKEEPER
– PIG – SQOOP
Recap of Week 1
What is Big Data

What is Hadoop
Hadoop Eco-System Components
Why DFS
Features of HDFS
Areas where HDFS is not a good fit
Block Abstraction in HDFS
HDFS Components:
• Namenodes
• Datanodes
Main Components Of HDFS:
 NameNode:
 master of the system
 maintains and manages the blocks which are present on the
DataNodes
 Single point of failure for Hadoop cluster
 Manages block replication
 DataNodes:
 slaves which are deployed on each machine and provide the actual
storage
 responsible for serving read and write requests for the clients
Secondary Name Node
It is important to make the namenode resilient to failure and one of the techniques
of doing this is to create a Secondary NameNode.
The secondary namenode usually runs on a separate physical machine, since it

requires plenty of CPU and as much memory as the namenode to perform the merge.
It keeps a copy of the merged namespace image, which can be used in the event of the
namenode failing
This node is also called as a Checkpoint Node as it manages the Edit log and check-
pointing of namenode metadata (once per hour or when edits log reaches 64MB in
size). Please note that it does not provide namenode failover
The Secondary Node copies FSImage and Transaction Log from NameNode to a
temporary directory. Then it merges FSImage and Transaction Log into a new FSImage
in temperate location and uploads the new FSImage to the NameNode.
NameNode Metadata
Meta-data in Memory
1. The entire metadata is in main memory
2. No demand paging of FS meta-data
Types of Metadata
1. List of files
2. List of Blocks for each file
3. List of Datanode for each block
4. File attributes, e.g access time, replication factor
A Transaction Log
1. Records file créations, file deletions. etc
JobTracker and TaskTracker:
HDFS Architecture
Job Tracker
Job Tracker Contd.
Job Tracker Contd.
HDFS Client Creates a New File
Rack Awareness
Anatomy of a File Write:
Anatomy of a File Read:
Terminal Commands
Terminal Commands
Web UI URLs
• NameNode status: http://localhost:50070/dfshealth.jsp
• JobTracker status: http://localhost:50030/jobtracker.jsp
• TaskTracker status: http://localhost:50060/tasktracker.jsp
• DataBlock Scanner Report:

http://localhost:50075/blockScannerReport
Sample Examples List
Running the Teragen Example
Checking the Output
Checking the Output
Deployment Modes
• Standalone or local Mode
– No daemons running
– Everything runs on single JVM
– Good for deployment
• Pseudo-distributed Mode
– All daemons running on single machine, a cluster simulation on one machine
– Good for Test Environment
• Fully distributed Mode

– Hadoop running on multiple machines on a cluster
– Production Environment
Folder View of Hadoop
Hadoop Configuration Files
Configuration Filenames Description of log files
Enviroment variables that are used in the scripts to run
hadoop-env.sh Hadoop
Configuration settings for Hadoop Core such as I/O settings
core-site.xml that are common to HDFS and MapReduce
Configuration settings for HDFS daemons, the namenode, the
hdfs-site.xml secondary namenode and the data nodes.
Configuration settings for MapReduce daemons : the
mapred-site.xml jobtracker and the task trackers
A list of machines(one per line) that each run a secondary
masters namenode
A list of machines(one per line) that each run a datanode and
slaves a trask tracker
Properties for controlling how metrics are published in
hadoop-metrics.properties Hadoop
Properties for system log files, the namenode audit log and
log4j.properties the task log for the tasktracker child process
DD for each component
Core core-site.xml
HDFS hdfs-site.xml
MapReduce mapred-site.xml
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
<?xml version - "1.0"?> <?xml version ="1.0"?>
 
<configuration> <configuration>
<property> <property>
<name>dfs.replication</name> <name>fs.default.name</name>
<value>1</value> <value>hdfs://localhost:8020/</value>
</property> </property>
</configuration> </configuration>
Defining HDFS details in hdfs-site.xml
Property Value Description
dfs.data.dir <value>/disk1/hdfs/data,/dis A list of directories where the datanode stores

k2/hdfs/data</value> blocks. Each block is stored in only one of these
directories. ${hadoop.tmp.dir}/dfs/data
fs.checkpoint.di <value>/disk1/hdfs/namese A list of directories where the secondary

r condary,/disk2/hdfs/namese namenode stores checkpoints. It stores a copy of
condary</value> the checkpoint in each directory in the list
${hadoop.tmp.dir}/dfs/namesecondary
Mapred-site.xml
<?xml version=“1.0”?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<property>
</configuration>
Defining mapred-sites.xml
Property Value Description
Mapred.job.tracker <value>localhost: The hostname and the port that the jobtrackers RPC server
8021</value> runs on. If set to the default value of local, then the
jobtracker is run in-process on demand when you run a
MapReduce job (you don’t need to start the jobtracker in this
case, and in fact you will get an error if you try to start it in
this mode)
Mapred.local.dir ${hadoop.tmp.dir} A list of directories where MapReduce stores intermediate
/mapred/local data for jobs. The data is cleared out when the job ends.
Mapred.system.dir ${hadoop.tmp.dir} The directory relative to fs.default.name where shared files
/mapred/system are stored, during a job run.
Mapred.tasktracker.map 2 The number of map tasks that may be run on a tasktracker
.tasks.maximum at any one time
Mapred.tasktracker.redu 2 The number of reduce tasks tat may be run on a tasktracker
ce.tasks.maximum at any one time.
Critical Properties
fs.default.name
hadoop.tmp.dir
mapred.job.tracker
Slaves and masters
Two files are used by the startup and shutdown commands:
slaves
• contains a list of hosts, one per line, that are to host

DataNode and TaskTracker servers
masters
• contains a list of hosts, one per line, that are to host

secondary NameNode servers
Per-process runtime environment
hadoop-env.sh file:
Hadoop-env.sh JVM
• This file also offers a way to provide custom parameters for each of the
servers.
• Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in
the conf/ directory of the installation.
Reporting
hadoop-metrics.properties
• This file controls the reporting

• The default is not to report
Network Requirements
Uses Shell (SSH) to launch the server processes on

the slave nodes
Hadoop
Core
Requires password-less SSH connection between

the master and all slave and secondary machines
Namenode Recovery
• Shut down the secondary NameNode

1
• secondary:fs.checkpoint.dir  Namenode:dfs.name.dir
2
• secondary:fs.checkpoint.edits  Namenode:dfs.name.edits.dir
3
• When the copy completes, start the NameNode and restart the
4 secondary NameNode
Clarifications
Q & A..?
Thank You
See You in Class Next Week

Hadoop Week 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Week 2

Uploaded by

Copyright:

Available Formats

Connect with us

 24x7 Support on Skype, Email & Phone

 Call us – +91 88808 62004

What is Big Data

The secondary namenode usually runs on a separate physical machine, since it

• NameNode status: http://localhost:50070/dfshealth.jsp

• JobTracker status: http://localhost:50030/jobtracker.jsp

• TaskTracker status: http://localhost:50060/tasktracker.jsp

• DataBlock Scanner Report:

• Fully distributed Mode

<?xml version - "1.0"?> <?xml version ="1.0"?>

dfs.data.dir <value>/disk1/hdfs/data,/dis A list of directories where the datanode stores

fs.checkpoint.di <value>/disk1/hdfs/namese A list of directories where the secondary

• contains a list of hosts, one per line, that are to host

• contains a list of hosts, one per line, that are to host

• This file controls the reporting

Uses Shell (SSH) to launch the server processes on

Requires password-less SSH connection between

• Shut down the secondary NameNode

You might also like