0% found this document useful (0 votes)
55 views7 pages

Installing Hadoop 3.2.4 Guide

Uploaded by

Mohamed Mensi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views7 pages

Installing Hadoop 3.2.4 Guide

Uploaded by

Mohamed Mensi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Subject: Framework et technologies Big Data

Audience: 3-IM
Lab Supervisor: Ikram Chaabane Academic Year
2024-2025

TP 2 – Hadoop Installation and configuration


(standalone installation)
1. Definitions
 Hadoop is a platform that allows storing and processing very large amounts of data in a
distributed manner. It enables data and tasks to be spread across a group of multiple machines
called a cluster.
 Server cluster, called also compute farm (Computer Cluster), or compute cluster, referring to
techniques that group multiple independent computers called nodes.
 HDFS: Hadoop's file system.
 Map-reduce : programming model of Hadoop

2. Hadoop Installation
2.1. Download a version of Hadoop
 From Apache's official website, download a version of Hadoop
https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz
 Click on the link, then Save.
2.2. Extract the zipped folder
 Go to the Downloads directory and extract the zipped folder

sudo tar -zxvf hadoop-3.2.4.tar.gz

2.3. Create the hadoop folder in /usr/local


 Enter the command: sudo mkdir /usr/local/hadoop

2.4. Move the downloaded Hadoop source folder


sudo mv Téléchargements/hadoop-3.2.4 /usr/local/hadoop
2.5. Add the following aliases to the .bashrc file
Note: Aliases are shorthand substitutions for repetitive and/or lengthy commands to type in the
console. You can define your aliases in a hidden file called .bashrc, which is located in your Home
directory using sudo gedit (or nano) .bashrc (hidden file located in /home/<name>/.bashrc), write these
lines at the bottom of the file, replacing the … with the corresponding paths.
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop/hadoop-3.2.4
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin

1
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END

Ctrl X to save / Shift O / Enter


For the aliases to take effect, you need to restart the terminal or type the command
source ~/.bashrc
2.6. Configure Hadoop on a single-node cluster (pseudo-distributed mode)
To configure Hadoop, you need to modify 5 files located in
/usr/local/hadoop/hadoop<version>/etc/hadoop/, where all the configuration files are found.

2.6.1. Modify the first file hadoop-env.sh : the startup file for Hadoop daemons.
These daemons, in programming terms, are processes running in the background. Hadoop has
five daemons: - Namenode, - Secondary Namenode, - Datanode, - NodeManager, -
ResourceManager.

Since Hadoop is developed in Java, we need to specify the JDK path so it can activate its
daemons. To modify the JAVA_HOME path in the hadoop-env.sh file, first type one of the
following commands to open it.
sudo nano $HADOOP_INSTALL/etc/hadoop/hadoop-env.sh
or sudo nano /usr/local/hadoop/hadoop<version>/etc/hadoop/hadoop-env.sh
or sudo gedit /usr/local/hadoop/hadoop<version>/etc/hadoop/hadoop-env.sh

Go to the line where there is « export JAVA_HOME={…} » and modify the path by
/usr/lib/jvm/<version java>
<version java> corresponds to the name retrieved in step 2.5.
Once finished, type Ctrl + X, then O, and then Enter to save the changes made.

2.6.2. Modify the second file core-site.xml

The core-site.xml file informs the Hadoop daemons that a namenode is running on the
cluster by specifying its address..

2
Since we have only one machine in the cluster, the namenode will be on localhost
(127.0.0.1). Port 9000 is associated with the HDFS file system.
To modify core-site.xml, type :

sudo nano $HADOOP_INSTALL/etc/hadoop/core-site.xml


A la fin du fichier, ajoutez les lignes suivantes :
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

 Once finished, press Ctrl X, then O, then Enter to save the changes made.
2.6.3. Modify the third file hdfs-site.xml
The file hdfs-site.xml informs Hadoop and its HDFS system about the number of replications
(property 1) (only one in our case since we have a single machine), the address of the
transaction history of the NameNode (property 2), the address for block storage by the
DataNode (property 3), and the URLs for the web interfaces of the NameNode and the
SecondaryNameNode..

Modify this file by typing :


sudo nano $HADOOP_INSTALL/etc/hadoop/hdfs-site.xml

 At the end of the file, add the following lines:


<configuration>

3
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/…/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/…/data</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>localhost:50070</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>localhost:50090</value>
</property>
</configuration>
 You should replace … by your username.

 Once finished, press Ctrl X, then O, then Enter to save the changes made.

4
2.6.4. Modify the fourth file mapred-site.xml
This file mapred-site.xml informs mainly the MapReduce package that it will run as a YARN
application (separating resource management from job management).
Open the mapred-site.xml file to make modifications:
sudo nano $HADOOP_INSTALL/etc/hadoop/mapred-site.xml

 In the <configuration> tag, add the following lines:


<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop/hadoop-3.2.4</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop/hadoop-3.2.4</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/hadoop/hadoop-3.2.4</value>
</property>

- The property mapreduce.framework.name specifies the framework that MapReduce


must use to execute tasks. It determines the environment in which the MapReduce jobs
will run, and it can significantly impact how resources are managed and how tasks are
scheduled. The possible values for this property are: yarn, local, and standalone (generally
used for development and testing).
- The property yarn.app.mapreduce.am.env is used to define specific environment
variables that will be accessible by the Application Master during its execution.
- The property mapreduce.map.env allows you to define a set of environment variables
that will be passed to the Mapper processes.
- The property mapreduce.reduce.env allows you to define a set of environment variables
that will be passed to the Reducer processes.

 Once finished, press Ctrl X, then O, then Enter to save the changes made.

2.6.5. Modify the fifth file yarn-site.xml

The yarn-site.xml file is essential for the configuration and optimization of YARN's behavior,
particularly for:

 Configuration of YARN Resources (how resources, such as memory and CPU, are allocated
to applications running on the Hadoop cluster).

5
 Definition of YARN Components (where these components are located, how they
communicate, and their operational properties).
 Management of Applications (the types of applications that can be executed and the
scheduling policies to be used, including parameters for expiration times, job priorities, and
other aspects of scheduling).
 Configuration of Network Parameters (such as the ports used for communication between
the ResourceManager and the NodeManagers, etc.).
 Definition of Quotas and Limits (such as quotas for allocated resources, ensuring that no
application monopolizes the cluster's resources).

In this lab, the yarn-site.xml file informs the NodeManager that it will have an auxiliary service
indicating to MapReduce how to perform its shuffling.
sudo nano $HADOOP_INSTALL/etc/hadoop/yarn-site.xml
 In the <configuration> tag, add the following lines:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

 Once finished, press Ctrl + X, then O, and then Enter to save the changes made.

2.7. Verify the Installation


Once you have completed the Hadoop configuration, you need to verify its installation. To do this,
you must first (only the first time) format the HDFS file system before starting Hadoop:
hdfs namenode -format
You should see an interface like the following:

6
2.8. Check Active Services
Before starting Hadoop, check the active services using the jps command jps
2.9. Start the Hadoop System
start-all.sh
2.10. Check Active Services After Startup
You should have the 5 daemons active apart from jps if the installation was successful.

If any of these services are missing, you may have a configuration error. Start by checking the log
files located in the $HADOOP_INSTALL/logs directory. For example, if the NameNode is not started,
check the files related to the NameNode.
If all services are present, Hadoop is functional on your machine.

View the NameNode web interface


Using your web browser on the virtual machine, you can access the NameNode web interface
at http://localhost:50070/

You might also like