You are on page 1of 17

NAME- DEEPTI AGRAWAL

BDA LAB ASSIGNMENT NUMBER -02


REG. NUMBER- 201081010

What is a Hadoop Cluster?


In general, a computer cluster is a collection of various computers that
work collectively as a single system.
“A hadoop cluster is a collection of independent components connected
through a dedicated network to work as a single centralized data
processing resource. “
“A hadoop cluster can be referred to as a computational computer cluster
for storing and analysing big data (structured, semi-structured and
unstructured) in a distributed environment.” “A computational computer
cluster that distributes data analysis workload across various cluster
nodes that work collectively to process the data in parallel.”
Hadoop clusters are also known as “Shared Nothing” systems because
nothing is shared between the nodes in a hadoop cluster except for the
network which connects them. The shared nothing paradigm of a hadoop
cluster reduces the processing latency so when there is a need to process
queries on huge amounts of data the cluster-wide latency is completely
minimized.

Hadoop

Apache Hadoop, commonly referred to as Hadoop, is one specific software


developed under Apache that provides a framework for storing and
processing large datasets. It is most appealing because it is open source
and it‘s the biggest competitor to Google’s BigTable big data storage
system.

The Hadoop framework is composed of four main “modules”:


Hadoop Distributed File System (HDFS)
Yet Another Resource Negotiator (YARN)
MapReduce
Hadoop Common

These modules serve as nodes and algorithms performed by these nodes


that, when run together, form a cluster. This cluster is capable of
performing tasks normally not possible on one computer because the
input is too large or tasks are too computationally intensive, by working
together — a fundamental principle for distributed systems.
It’s important to understand that Hadoop is actually a system of multiple
distributed systems interacting with each other. MapReduce is an
algorithm that performs distributed computing, the HDFS is a distributed
storage system, YARN sets up the framework for what roles nodes need to
play for the distributed system to exist and to run tasks, and there is a
separate service called ZooKeeper that acts as its own cluster to provide
distributed configuration service.

Hadoop Cluster Architecture

A hadoop cluster architecture consists of a data centre, rack and the node
that actually executes the jobs. Data centre consists of the racks and racks
consists of nodes. A medium to large cluster consists of a two or three
level hadoop cluster architecture that is built with rack mounted servers.
Every rack of servers is interconnected through 1 gigabyte of Ethernet (1
GigE). Each rack level switch in a hadoop cluster is connected to a cluster
level switch which are in turn connected to other cluster level switches or
they uplink to other switching infrastructure.

Components of a Hadoop Cluster

Hadoop cluster consists of three components -

Master Node – Master node in a hadoop cluster is responsible for storing


data in HDFS and executing parallel computation the stored data using
MapReduce. Master Node has 3 nodes – NameNode, Secondary
NameNode and JobTracker. JobTracker monitors the parallel processing of
data using MapReduce while the NameNode handles the data storage
function with HDFS. NameNode keeps a track of all the information on
files (i.e. the metadata on files) such as the access time of the file, which
user is accessing a file on current time and which file is saved in which
hadoop cluster. The secondary NameNode keeps a backup of the
NameNode data.
Slave/Worker Node- This component in a hadoop cluster is responsible for
storing the data and performing computations. Every slave/worker node
runs both a TaskTracker and a DataNode service to communicate with the
Master node in the cluster. The DataNode service is secondary to the
NameNode and the TaskTracker service is secondary to the JobTracker.
Client Nodes – Client node has hadoop installed with all the required
cluster configuration settings and is responsible for loading all the data
into the hadoop cluster. Client node submits mapreduce jobs describing on
how data needs to be processed and then the output is retrieved by the
client node once the job processing is completed.

Advantages of a Hadoop Cluster Setup

As big data grows exponentially, parallel processing capabilities of a


Hadoop cluster help in increasing the speed of analysis process. However,
the processing power of a hadoop cluster might become inadequate with
increasing volume of data. In such scenarios, hadoop clusters can scaled
out easily to keep up with speed of analysis by adding extra cluster nodes
without having to make modifications to the application logic.
Hadoop cluster setup is inexpensive as they are held down by cheap
commodity hardware. Any organization can setup a powerful hadoop
cluster without having to spend on expensive server hardware.
Hadoop clusters are resilient to failure meaning whenever data is sent to a
particular node for analysis, it is also replicated to other nodes on the
hadoop cluster. If the node

fails then the replicated copy of the data present on the other node in the
cluster can be used for analysis.

Execution:

1. Download tar file for hadoop from

this website:
https://hadoop.apache.org/release/
3.3.6.html
2. Extract this tar file using tar command:

$ tar xzf hadoop-3.3.4.tar.gz

3. Rename the file using to hadoop

$ mv hadoop-3.3.4 hadoop

4. Open the hadoop-env.sh file in the nano editor and edit the
JAVA_HOME variable

5. Move the hadoop directory from the home to user/local/ directory


6. Open the environment file with this command, then add the

JAVA_HOME
7. Create a hadoop user

8. Edit .bashrc file, add following


#Hadoop Related Options
export
HADOOP_HOME="/usr/local/hadoop"
export
HADOOP_INSTALL=$HADOOP_HOME
export
HADOOP_MAPRED_HOME=$HADOO
P_HOME export
HADOOP_COMMON_HOME=$HADOO
P_HOME export
HADOOP_HDFS_HOME=$HADOOP_H
OME
export YARN_HOME=$HADOOP_HOME
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/li
b/native export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export
HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

9. In order to apply the changes to the environment, run the following


command:

$ source ~/.bashrc

10.Run the following command to get the ip address of the machine:


11. Now we will write down these IP addresses in the hosts file.
12. Now open the hostname file on both the machines. Rename

the machines to master and slave1 resp


13.Now in the master node we create rsa key pair
14. Now we setup this public key in both master and slave

15. Now, open the core-site.xml file and add the following configurations
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>

</configuration>
16. Now, open the hdfs-site.xml file and add the following configurations
as shown:
<configuration>
<property>
<name>dfs.name.dir</name><value>/usr/local/hadoop/data/nameNo
de</ value>
</property>
<property>
<name>dfs.data.dir</name><value>/usr/local/hadoop/data/dataNod
e</ value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

17. Now, open the mapred-site.xml file and add the following
configurations as shown:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-master:9001</value>
</property>
</configuration>
18. Now, on master, Open the workers file:

$ sudo nano /usr/local/hadoop/etc/hadoop/workers

Add the following lines as shown:

19. Use the following command to copy the files to slave node
$ scp -r /usr/local/hadoop/* slave1:/usr/local/hadoop
20.Now, on the slave1 open the yarn-site.xml file using following
command:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
</configuration>

21. Now we need to format the HDFS file system. Run this
command on the master node:

$ hdfs namenode -format


22. Now, on the hadoop-master we will start the HDFS network
using the following command: Verify the output by applying the

following command in both the machines:


23. For hadoop-slave:

Conclusion:
In conclusion, this experiment helped me learn how to set up a multi-node
Hadoop cluster. By configuring different parts of Hadoop and
understanding how it handles errors, I gained valuable skills for managing
large amounts of data. Doing tasks like formatting the HDFS file system
also gave me hands-on experience in working with big data systems. This
experiment improved my understanding of Hadoop's structure and taught
me how to build and manage data processing setups effectively.

You might also like