You are on page 1of 18

LAB ASSIGNMENT - 02

Name: Parth Shah


ID: 191080070
Branch: Information Technology

AIM
To setup Multinode Hadoop cluster.

THEORY
Hadoop is an open-source framework that allows the storage and processing of
big data in a distributed environment across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage.
Big data is a collection of large datasets that cannot be processed using traditional
computing techniques. It is not a single technique or a tool, rather it has become a
complete subject, which involves various tools, techniques and frameworks.
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel with others. In short, Hadoop is used to develop applications
that could perform complete statistical analysis on huge amounts of data.
Hadoop Architecture
At its core, Hadoop has two major layers namely –
• Processing/Computation layer (MapReduce), and
• Storage layer (Hadoop Distributed File System)
MapReduce: MapReduce is a parallel programming model for writing distributed
applications devised at Google for efficient processing of large amounts of data
(multiterabyte data-sets), on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner. The MapReduce program runs on
Hadoop which is an Apache open-source framework.
The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However,
the differences from other distributed file systems are significant. It is highly
fault-tolerant and is designed to be deployed on low-cost hardware. It provides
high throughput access to application data and is suitable for applications having
large datasets.
Apart from the above-mentioned two core components, Hadoop framework also
includes the following two modules −
• Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
• Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
NameNode and DataNodes HDFS have a master/slave architecture.
An HDFS cluster consists of a single NameNode, a master server that manages the
file system namespace and regulates access to files by clients. In addition, there
are a number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on. HDFS exposes a file system
namespace and allows user data to be stored in files. Internally, a file is split into
one or more blocks and these blocks are stored in a set of DataNodes. The
NameNode executes file system namespace operations like opening, closing, and
renaming files and directories. It also determines the mapping of blocks to
DataNodes. The DataNodes are responsible for serving read and write requests
from the file system’s clients. The DataNodes also perform block creation,
deletion, and replication upon instruction from the NameNode.

NameNode
- It is a single master server that exists in the HDFS cluster.
- As it is a single node, it may become the reason for single point failure.
- It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
- It simplifies the architecture of the system.

DataNode
- The HDFS cluster contains multiple DataNodes.
- Each DataNode contains multiple data blocks.
- These data blocks are used to store data.
- It is the responsibility of DataNode to read and write requests from the file
system's clients.
It performs block creation, deletion, and replication upon instruction from the
NameNode.

HDFS
In HDFS data is distributed over several machines and replicated to ensure their
durability to failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of
blocks, data nodes and node names.

Why is HDFS used? For -


Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
Streaming Data Access: The time to read the whole data set is more important
than latency in reading the first. HDFS is built on a write-once and
read-many-times pattern.
Commodity Hardware: It works on low cost hardware.

Hadoop Tasks
Hadoop runs code across a cluster of computers. This process includes the
following core tasks that Hadoop performs −
• Data is initially divided into directories and files. Files are divided into uniform
sized blocks of 128M and 64M (preferably 128M).
• These files are then distributed across various cluster nodes for further
processing.
• HDFS, being on top of the local file system, supervises the processing.
• Blocks are replicated for handling hardware failure.
• Checking that the code was executed successfully.
• Performing the sort that takes place between the map and reduce stages.
• Sending the sorted data to a certain computer.
• Writing the debugging logs for each job.
IMPLEMENTATION
1) Installing SSH on slave

2) Creating new key on slave


3) Copy the public key to authorized keys file and verify ssh configuration using ssh
localhost

4) Installing Java8 on slave

5) Download Hadoop on slave


6) Extracting Hadoop folder

8) Adding JAVA_HOME path to /etc/environment

9) Adding new user ‘hadoop’


10) Editing .bashrc file

11) Setting paraphraseless SSH and copying keys in master and slave
12) Editing core-site.xml in master and slave
13) Editing hdfs-site.xml in master and slave

14) Editing mapred-site.xml in master and slave


15) Specifying worker nodes. In this case - slave
16) Editing yarn-site.xml in slave
17) Formatting namenode
18) Starting namenode and datanode

19) Checking active nodes on slave


20) Starting YARN
21) Stopping all nodes
CONCLUSION
Thus, from this experiment I set up a multi-node (master and slave) Hadoop
cluster on Linux. I studied the architecture of Hadoop File System and successfully
worked with Hadoop on Ubuntu.

You might also like