You are on page 1of 19

Unit 1

Working with Big Data: Google File System, Hadoop Distributed File System (HDFS) –
Building blocks of Hadoop (Namenode, Datanode, Secondary Namenode, JobTracker,
TaskTracker), Introducing and Configuring Hadoop cluster (Local, Pseudo-distributed mode,
Fully Distributed mode), Configuring XML files.

INTRODUCTION TO BIG DATA:

Big data is a term that describes the large volume of data – both structured and unstructured – It
is a collection of large data sets that cannot be processed using traditional computing techniques.
Big data is not merely a data; rather it has become a complete subject, which involves various
tools, techniques and frameworks.

Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will
be of three types.

● Structured data: Relational data.


● Semi Structured data: XML data.
● Unstructured data: Word, PDF, Text, Media Logs.

Characteristics of Big data:

Volume: The quantity of generated and stored data. The size of the data determines the value
and potential insight- and whether it can actually be considered big data or not.

Variety: The type and nature of the data. This helps people who analyze it to effectively use the
resulting insight.

Velocity: In this context, the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development.

Variability: Inconsistency of the data set can hamper processes to handle and manage it.

Veracity: The quality of captured data can vary greatly, affecting accurate analysis.
1. The Google File System:
Google File System (GFS or GoogleFS) is a proprietary distributed file system developed
by Google for its own use. It is designed to provide efficient, reliable access to data using large
clusters of commodity hardware.

GFS is enhanced for Google's core data storage and usage needs (primarily the search engine),
which can generate enormous amounts of data that needs to be retained; Google File System
grew out of an earlier Google effort, "BigFiles", developed by Larry Page and Sergey Brin in the
early days of Google, while it was still located in Stanford. Files are divided into fixed-
size chunks of 64 megabytes, similar to clusters or sectors in regular file systems, which are only
extremely rarely overwritten, or shrunk; files are usually appended to or read. It is also designed
and optimized to run on Google's computing clusters, dense nodes which consist of cheap
"commodity" computers, which means precautions must be taken against the high failure rate of
individual nodes and the subsequent data loss. Other design decisions select for high
data throughputs, even when it comes at the cost of latency.
A GFS cluster consists of multiple nodes. These nodes are divided into two types:
one Master node and a large number of Chunk servers. Each file is divided into fixed-size
chunks. Chunk servers store these chunks. Each chunk is assigned a unique 64-bit label by the
master node at the time of creation, and logical mappings of files to constituent chunks are
maintained. Each chunk is replicated several times throughout the network, with the minimum
being three, but even more for files that have high end-in demand or need more redundancy.
The Master server does not usually store the actual chunks, but rather all the metadata associated
with the chunks, such as the tables mapping the 64-bit labels to chunk locations and the files they
make up, the locations of the copies of the chunks, what processes are reading or writing to a
particular chunk, or taking a "snapshot" of the chunk pursuant to replicate it (usually at the
instigation of the Master server, when, due to node failures, the number of copies of a chunk has
fallen beneath the set number). All this metadata is kept current by the Master server periodically
receiving updates from each chunk server ("Heart-beat messages").
Permissions for modifications are handled by a system of time-limited, expiring "leases", where
the Master server grants permission to a process for a finite period of time during which no other
process will be granted permission by the Master server to modify the chunk. The modifying
chunkserver, which is always the primary chunk holder, then propagates the changes to the
chunkservers with the backup copies. The changes are not saved until all chunk servers
acknowledge, thus guaranteeing the completion and atomicity of the operation.
Programs access the chunks by first querying the Master server for the locations of the desired
chunks; if the chunks are not being operated on (i.e. no outstanding leases exist), the Master
replies with the locations, and the program then contacts and receives the data from the
chunkserver directly.

Architecture of Google File System


GFS is clusters of computers. A cluster is simply a network of computers. Each cluster might
contain hundreds or even thousands of machines. In each GFS clusters there are three main
entities:
1. Clients
2. Master servers
3. Chunk servers.

Client can be other computers or computer applications and make a file request. Requests can
range from retrieving and manipulating existing files to creating new files on the system. Clients
can be thought as customers of the GFS.
Master Server is the coordinator for the cluster. Its task include:-
1. Maintaining an operation log, that keeps track of the activities of the cluster. The operation log
helps keep service interruptions to a minimum if the master server crashes, a replacement server
that has monitored the operation log can take its place.
2. The master server also keeps track of metadata, which is the information that describes chunks.
The metadata tells the master server to which files the chunks belong and where they fit within
the overall file.
Chunk Servers are the workhorses of the GFS. They store 64-MB file chunks. The chunk
servers don't send chunks to the master server. Instead, they send requested chunks directly to the
client. The GFS copies every chunk multiple times and stores it on different chunk servers. Each
copy is called a replica. By default, the GFS makes three replicas per chunk, but users can
change the setting and make more or fewer replicas if desired.
Google FileSystem READ Algorithm Execution Flow
Google File System READ Algorithm
1. Application originates the read request.
2. GFS client translates the request from (filename, byte range) - (filename, chunk index), and
sends it to master.
3. Master responds with chunk handle and replica locations (i.e. chunk servers where the replicas
are stored).
4. Client picks a location and sends the (chunkhandle, byterange) request to that location.
5. Chunk server sends requested data to the client.
6. Client forwards the data to the application.
Google File System WRITE Algorithm Execution Flow
Google File System WRITE Algorithm
1. Application originates write request.
2. GFS client translates request from (file name, data)->(filename,chunkindex),and sends it to
master.
3. Master responds with chunk handle and (primary + secondary) replica locations.
4. Client pushes write data to all locations. Data is stored in chunk servers’ internal buffers.
5. Client sends write command to primary.
6. Primary determines serial order for data instances stored in its buffer and writes the instances
in that order to the chunk.
7. Primary sends serial order to the secondaries and tells them to perform the write.
8. Secondaries respond to the primary.
9. Primary responds back to client.

Advantages and disadvantages of large sized chunks in Google File System


Chunks size is one of the key design parameters. In GFS it is 64 MB, which is much larger than
typical file system blocks sizes. Each chunk replica is stored as a plain Linux file on a chunk
server and is extended only as needed.
Advantages
1. It reduces clients’ need to interact with the master because reads and writes on the same chunk
require only one initial request to the master for chunk location information.
2. Since on a large chunk, a client is more likely to perform many operations on a given chunk, it
can reduce network overhead by keeping a persistent TCP connection to the chunk server over
an extended period of time.
3. It reduces the size of the metadata stored on the master. This allows us to keep the metadata in
memory, which in turn brings other advantages.
Disadvantages
1. Lazy space allocation avoids wasting space due to internal fragmentation.
2. Even with lazy space allocation, a small file consists of a small number of chunks, perhaps just
one. The chunk servers storing those chunks may become hot spots if many clients are accessing
the same file. In practice, hot spots have not been a major issue because the applications mostly
read large multi-chunk files sequentially. To mitigate it, replication and allowance to read from
other clients can be done.

What Is Hadoop
 Apache Hadoop is a framework that allows for the distributed processing of large data
sets across clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.
Key features – Why Hadoop?

1. Flexible:
As it is a known fact, that only 20% of data in the organizations is structured, and the rest is all
unstructured, it is very crucial to manage unstructured data which goes unattended.
Hadoop is the core to manage different types of Big Data, whether structured or unstructured,
encoded or formatted, or any other type of data and makes it useful for decision making process.
Moreover, Hadoop is simple, relevant and schema-less! Though Hadoop generally supports Java
Programming, but to your pleasant surprise, any programming language can be used in Hadoop
with the help of the MapReduce technique.
Though Hadoop works best on Windows and Linux, it can also work on other operating systems
like BSD and OS X.

2. Scalable
Hadoop is a scalable platform, in the sense that new nodes can be easily added in the system as
and when required without altering the data formats, how data is loaded, how programs are
written, or even without modifying the existing applications.

Hadoop is a totally open source platform and runs on industry-standard hardware. Moreover,
Hadoop is also fault tolerant – this means, even if a node gets lost or goes out of service, the
system automatically reallocates work to another location of the data and continues processing as
if nothing had happened!

3. Building more efficient data economy:


Hadoop has revolutionized the processing and analysis of Big data world across. Till now,
organizations were worrying about how to manage the non-stop data overflowing in their
systems. Hadoop is more like a “Dam”, which is harnessing the flow of unlimited amount of data
and generating a lot of power in the form of relevant information. Hadoop has changed the
economics of storing and evaluating data entirely!

4. Robust Ecosystem:
Hadoop is robust and rich ecosystem that is well suited to meet the analytical needs of the
developers, web startups and other organizations. Hadoop Ecosystem consists of various related
projects such as MapReduce, Hive, HBase, Zookeeper, HCatalog, Apache Pig, which make
Hadoop very competent to deliver a broad spectrum of services.

5. Hadoop is getting more “Real-Time”!


Did you ever wonder how to stream information into a cluster and analyze it in real time?
Hadoop has the answer for it! Yes, Hadoop’s competencies are getting more and more real-time.
Hadoop also provides a standard approach to a wide set of APIs for big data analytics comprising
of MapReduce, query languages and database access, and so on.

6. Cost Effective:
The basic idea behind Hadoop is to perform cost effective data analysis present across world
wide web!

7. Hadoop is getting Cloudy!


Hadoop is getting cloudier! In fact, Cloud computing and Hadoop are synchronizing in several
organizations to manage Big Data. In no time, Hadoop will become one of the most required
Apps for Cloud Computing. This is evident from the number of Hadoop clusters offered by cloud
vendors in various businesses. Thus, Hadoop will reside in the cloud soon!
Hadoop Eco-System
The Hadoop platform consists of two key services: a reliable, distributed file system
called Hadoop Distributed File System (HDFS) and the high-performance parallel data
processing engine called Hadoop MapReduce, described in MapReduce below. Hadoop was
created by Doug Cutting and named after his son’s toy elephant. Vendors that provide Hadoop-
based platforms include Cloudera, Hortonworks, MapR, Greenplum, IBM, and Amazon.
The combination of HDFS and MapReduce provides a software framework for
processing vast amounts of data in parallel on large clusters of commodity hardware (potentially
scaling to thousands of nodes) in a reliable, fault-tolerant manner. Hadoop is a generic
processing framework designed to execute queries and other batch read operations against
massive datasets that can scale from tens of terabytes to petabytes in size.
The popularity of Hadoop has grown in the last few years, because it meets the needs of
many organizations for flexible data analysis capabilities with an unmatched price-performance
curve. The flexible data analysis features apply to data in a variety of formats, from unstructured
data, such as raw text, to semi-structured data, such as logs, to structured data with a fixed
schema.
Hadoop has been particularly useful in environments where massive server farms are
used to collect data from a variety of sources. Hadoop is able to process parallel queries as big,
background batch jobs on the same server farm. This saves the user from having to acquire
additional hardware for a traditional database system to process the data (assume such a system
can scale to the required size). Hadoop also reduces the effort and time required to load data into
another system; you can process it directly within Hadoop. This overhead becomes impractical in
very large data sets.
Many of the ideas behind the open source Hadoop project originated from the Internet
search community, most notably Google and Yahoo!. Search engines employ massive farms of
inexpensive servers that crawl the Internet retrieving Web pages into local clusters where they
are analyzes with massive, parallel queries to build search indices and other useful data
structures.
The Hadoop ecosystem includes other tools to address particular needs. Hive is a SQL
dialect and Pig is a dataflow language for that hide the tedium of creating MapReduce jobs
behind higher-level abstractions more appropriate for user goals. Zookeeper is used for
federating services and Oozie is a scheduling system. Avro, Thrift and Protobuf are platform-
portable data serialization and description formats.
Apache Hadoop is a framework that enables the distributed processing of large sets of data
across clusters of servers.

Hadoop Eco-system Components:


Hadoop:

Apache Hadoop software library is a framework that allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to
detect and handle failures at the application layer, so delivering a highly-available service on top
of a cluster of computers, each of which may be prone to failures.

Hive:

Apache Hive data warehouse software facilitates querying and managing large datasets residing
in distributed storage. Hive provides a mechanism to project structure onto this data and query
the data using a SQL-like language called HiveQL. At the same time this language also allows
traditional map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.

Pig:

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets.

Flume:

Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible architecture
based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms
and many failover and recovery mechanisms. It uses a simple extensible data model that allows
for online analytic application

Sqoop:

Apache Sqoop is a tool designed to transfer data between Hadoop and relational databases. You
can use Sqoop to import data from a relational database management system (RDBMS) such as
MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in
Hadoop MapReduce, and then export the data back into an RDBMS. Sqoop automates most of
this process, relying on the database to describe the schema for the data to be imported. Sqoop
uses MapReduce to import and export the data, which provides parallel operation as well as fault
tolerance
Oozie:

Apache Oozie is an open source project that simplifies the process of creating workflows and
managing coordination among jobs. In principle, Oozie offers the ability to combine multiple
jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack and
supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop. In addition, it can be used to
schedule jobs specific to a system, such as Java programs. Therefore, using Oozie, Hadoop
administrators are able to build complex data transformations that can combine the processing of
different individual tasks and even sub-workflows.

2. Hadoop Distributed File System (HDFS) and Building blocks of Hadoop:


HDFS – Hadoop Distributed File System
 Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop
Distributed Filesystem inspired by the Google File System (GFS).
 HDFS is a highly fault tolerant, distributed, reliable, scalable file system for storing very large
files running on cluster of commodity hardware. Commodity hardware is cheap and less
reliable
 HDFS has master/slave architecture.
 An HDFS cluster consists of a single Master Node, a master server that manages the file
system namespace and regulates access to files by clients.
 In addition, there are a number of Slave Nodes, usually one per node in the cluster, which
manage storage attached to the nodes that they run on. HDFS exposes a file system
namespace and allows user data to be stored in files.
 HDFS has the property of horizontal scalability which means we can add any number of
slave nodes on the fly i.e. dynamically on runtime.
 Now when data is given by the user/client which is very big in the range of terabytes and
petabytes, internally, it is split into one or more blocks and these blocks are stored in a set of
Slave Nodes.
 A block is an independent unit of data.

Given below is the architecture of Hadoop File Sytem


A. Namenode
B. Datanode
C. Secondary Name node
D. JobTracker
E. TaskTracker
Hadoop is made up of 2 parts:
1. HDFS – Hadoop Distributed File System
2. MapReduce – The programming model that is used to work on the data present in HDFS.

1. NameNode

 This daemon runs on Master. Namenode stores all the metadata like filename, path,
number of blocks, blockIds, block locations, number of replicas, slave related
configuration, etc.
 The NameNode executes file system namespace operations like opening, closing, and
renaming files and directories.
 The Namenode acts like a Bookkeeper
 The NameNode periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster.
 Receipt of a Heartbeat implies that the DataNode is functioning properly.
 Blockreport contains a list of all blocks on a DataNode.
 If a HeartBeat is missed the NameNode checks the particular DataNode. This is the
Balancing Utility in the architecture.

2. DataNode

 This daemon runs on all the slaves and stores actual data. So the namenode corresponds
to the master machine and the datanodes to slave machines.
 It also determines the mapping of blocks to DataNodes. The DataNodes are responsible
for serving read and write requests from the file system’s clients.
 Client communicates directly with the datanode to process the local files corresponding to
the blocks.
 A datanode may communicate with other datanodes to replicate its datablocka for
redundancy.
 Datanodes are constantly reporting to the datanode.
 Datanodes informs the namenode of the blocks it is currently storing.

3. Secondary Name Node:


Secondary Node is NOT the backup or high availability node for Name node.
The job of Secondary Node is to contact NameNode in a periodic manner after certain time
interval (by default 1 hour).

NameNode which keeps all filesystem metadata in RAM has no capability to process that
metadata on to disk. So if NameNode crashes, you lose everything in RAM itself and you don't
have any backup of filesystem. What secondary node does is it contacts NameNode in an hour
and pulls copy of metadata information out of NameNode. It shuffle and merge this information
into clean file folder and sent to back again to NameNode, while keeping a copy for itself. Hence
Secondary Node is not the backup rather it does job of housekeeping.

In case of NameNode failure, saved metadata can rebuild it easily.

4. Job Tracker:

 There is only one job tracker daemon for Hadoop cluster.


 It typically run on a server as a master node of the cluster.
 The job tracker daemon is the binding between your application and Hadoop.
 Job Tracker receives the requests for MapReduce execution from the client.
 Job Tracker determines execution plan and assigns nodes to different tasks and monitors
all tasks as they are running.
 If task fails, the Job Tracker will automatically relaunch the task in a different node.

5. Task Tracker:

 The Job Tracker is the master of overall execution of a MapReduce job.


 Task Trackers manage the execution of individual tasks on each slave node.
 Each task Tracker is responsible for executing the individual tasks that the Job Tracker
assigns.
 One responsibility of Task Tracker is to constantly communicate with the Job Tracker.
 If the Job Tracker fails to receive a heart beat from a Task Tracker within a specified
amount of time, it will assume the Task Tracker has crashed and will resubmit the
corresponding tasks to other nodes in the cluster.

The blocks of a file are replicated for fault tolerance across the cluster. The NameNode
makes all decisions regarding replication of blocks. Replication is nothing but keeping same
blocks of data on different nodes.
And Replication factor is the number of times we are going to replicate every single
block of data. All the data blocks are replicated across the cluster of nodes. Ideally one replica is
present at one geographical location.
The default replication factor is 3, which can be changed according to the
requirements. We can change replication factor to desired value by editing configuration
files.
An application can specify the number of replicas of a file. The replication factor can be
specified at file creation time and can be changed later. Files in HDFS are writing-once and have
strictly one writer at any time.
Hadoop HDFS data read and write operations

 HDFS has a master and slave kind of architecture.


 Namenode acts as master and Datanodes as worker.
 All the metadata information is with namenode and the original data is stored on the
datanodes.
 Keeping all these in mind the below figure will give idea about how data
flow happens between the Client interacting with HDFS, i.e. the Namenode and
the Datanodes.

The following steps are involved in reading the file from HDFS:
Let’s suppose a Client (a HDFS Client) wants to read a file from HDFS.
Step 1: First the Client will open the file by giving a call to open() method on
FileSystem object, which for HDFS is an instance of DistributedFileSystem class.

Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure Call), to
determine the locations of the blocks for the first few blocks of the file. For each block, the
namenode returns the addresses of all the datanodes that have a copy of that block. Client will
interact with respective datanodes to read the file. Namenode also provide a token to the client
which it shows to data node for authentication.

The DistributedFileSystem returns an object of FSDataInputStream(an input stream that supports


file seeks) to the client for it to read data from FSDataInputStream in turn wraps
a DFSInputStream, which manages the datanode and namenode I/O

Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then connects to the first closest datanode
for the first block in the file.

Step 4: Data is streamed from the datanode back to the client, which calls read() repeatedly on
the stream.

Step 5: When the end of the block is reached, DFSInputStream will close the connection to the
datanode, then find the best datanode for the next block. This happens transparently to the client,
which from its point of view is just reading a continuous stream.

Step 6: Blocks are read in order, with the DFSInputStream opening new connections to
datanodes as the client reads through the stream. It will also call the namnode to retrieve the
datanode locations for the next batch of blocks as needed. When the client has finished reading,
it calls close() on the FSDataInputStream.

Differences between HDFS and GFS:


HADOOP DISTRIBUTED FILE
GOOGLE FILE SYSTEM (GFS)
SYSTEM (HDFS)
Cross platform Linux
Developed in Java Environment Developed in C, C++ environment.
At first it is developed by Yahoo and now
Its developed by google.
it’s an open source framework.
It has Name Node and Data Node It has Master Node and Chunk Server.
128 MB will be the default block size. 64MB will be default block size.
Name node receive heartbeat from data Master node receive heartbeat from chunk
node. server.
Commodity hardware were used. Commodity hardware were used.
HDFS follows WORM – Write once and
Multiple writer, multiple reader model.
read many times.
Only append is possible. Random file write possible.
3. Introducing and Configuring Hadoop cluster
Core Components of Hadoop Cluster:

Hadoop cluster has 3 components:


1. Client
2. Master
3. Slave
The role of each components are shown in the below image.

Let's try to understand these components one by one.


Client:
It is neither master nor slave, rather play a role of loading the data into cluster, submit
MapReduce jobs describing how the data should be processed and then retrieve the data to see
the response after job completion.
Masters:
The Masters consists of 3 components NameNode, Secondary Node name and JobTracker.

NameNode:
NameNode does NOT store the files but only the file's metadata. NameNode oversees the health
of DataNode and coordinates access to the data stored in DataNode.
Name node keeps track of all the file system related information such as to
 Which section of file is saved in which part of the cluster
 Last access time for the files
 User permissions like which user have access to the file
JobTracker:
JobTracker coordinates the parallel processing of data using MapReduce.

Secondary Name Node:

Secondary Node is NOT the backup or high availability node for Name node.

So what Secondary Node does?

The job of Secondary Node is to contact NameNode in a periodic manner after certain time
interval (by default 1 hour).

NameNode which keeps all filesystem metadata in RAM has no capability to process that
metadata on to disk. So if NameNode crashes, you lose everything in RAM itself and you don't
have any backup of filesystem. What secondary node does is it contacts NameNode in an hour
and pulls copy of metadata information out of NameNode. It shuffle and merge this information
into clean file folder and sent to back again to NameNode, while keeping a copy for itself. Hence
Secondary Node is not the backup rather it does job of housekeeping.
In case of NameNode failure, saved metadata can rebuild it easily.
Slaves:
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
 Store the data
 Process the computation

Each slave runs both a DataNode and Task Tracker daemon which communicates to their
masters. The Task Tracker daemon is a slave to the JobTracker and the DataNode daemon a
slave to the NameNode
Configuring Hadoop cluster (Local, Pseudo-distributed mode, Fully Distributed mode)

Standalone Mode

1. Default mode of Hadoop


2. HDFS is not utilized in this mode.
3. Local file system is used for input and output
4. Used for debugging purpose
5. No Custom Configuration is required in 3 hadoop(mapred-site.xml,core-site.xml, hdfs-site.xml)
files.
6. Standalone mode is much faster than Pseudo-distributed mode.

Pseudo Distributed Mode (Single Node Cluster)

1. Configuration is required in given 3 files for this mode


2. Replication factory is one for HDFS.
3. Here one node will be used as Master Node / Data Node / Job Tracker / Task Tracker
4. Used for Real Code to test in HDFS.
5. Pseudo distributed cluster is a cluster where all daemons are
running on one node itself.

Fully distributed mode (or multiple node cluster)

1. This is a Production Phase


2. Data are used and distributed across many nodes.
3. Different Nodes will be used as Master Node / Data Node / Job Tracker / Task Tracker

Configuring XML files.

Hadoop configuration is driven by two types of important configuration files:

1. Read-only default configuration - src/core/core-default.xml, src/hdfs/hdfs-


default.xml and src/mapred/mapred-default.xml.
2. Site-specific configuration - conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-
site.xml.The following table lists the same.
All the above files are available under ‘conf’ directory of Hadoop installation directory.

Following are the files from the File System:

The usages of files are as following:

hadoop-env.sh
Hadoop-en.sh speifices environments variables which affects the JDK used by Hadoop Daemon
(bin/hadoop) as hadoop framework is written in Java and uses Java Runtime environment, one
of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-
env.sh. This variable directs Hadoop daemon to the Java path in the system.

This file is also used for setting another Hadoop daemon execution environment such asheap
size (HADOOP_HEAP), hadoop home (HADOOP_HOME), log file location
(HADOOP_LOG_DIR), etc.
Note: For the simplicity of understanding the cluster setup, we have configured only necessary
parameters to start a cluster.

The following three files are the important configuration files for the runtime environment
settings of a Hadoop cluster.
core-site.sh
This file informs Hadoop daemon where NameNode runs in the cluster. It contains the
configuration settings for Hadoop Core such as I/O settings that are common
to HDFS andMapReduce.

Where hostname and port are the machine and port on which NameNode daemon runs and
listens. It also informs the Name Node as to which IP and port it should bind. The commonly
used port is 8020 and you can also specify IP address rather than hostname.

hdfs-site.sh
This file contains the configuration settings for HDFS daemons; the Name Node, the Secondary
Name Node, and the data nodes.
You can also configure hdfs-site.xml to specify default block replication and permission
checking on HDFS. The actual number of replications can also be specified when the file is
created. The default is used if replication is not specified in create time.

The value “true” for property ‘dfs.permissions’ enables permission checking in HDFS and the
value “false” turns off the permission checking. Switching from one parameter value to the other
does not change the mode, owner or group of files or directories.
mapred-site.sh
This file contains the configuration settings for MapReduce daemons; the job tracker and the
task-trackers. The mapred.job.tracker parameter is a hostname (or IP address) and portpair on
which the Job Tracker listens for RPC communication. This parameter specify the location of the
Job Tracker to Task Trackers and MapReduce clients.

You can replicate all of the four files explained above to all the Data Nodes and Secondary
Namenode. These files can then be configured for any node specific configuration e.g. in case of
a different JAVA HOME on one of the Datanodes.

The following two file ‘masters’ and ‘slaves’ determine the master and salve Nodes in Hadoop
cluster.
Masters
This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file
at Master server contains a hostname Secondary Name Node servers.

The ‘masters’ file on Slave Nodes is blank.

Slaves
The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node
and Task Tracker servers.

You might also like