How To Install and Set Up A 3-Node Hadoop Cluster

How to Install and Set Up a 3-Node Hadoop Cluster
What is Hadoop?
Hadoop is an open-source Apache project that allows creation of parallel processing applications on large data sets,
distributed across networked nodes. It’s composed of the Hadoop Distributed File System (HDFS™) that handles scalability
and redundancy of data across nodes, and Hadoop YARN: a framework for job scheduling that executes data processing
tasks on all nodes.
Before You Begin

1. They’ll be referred to throughout this guide as node-master, node1 and node2.
Run the steps in this guide from the node-master unless otherwise specifed.
2. Create a normal user for the install, and a user called hadoop for any Hadoop daemons.
3.Install the JDK using the appropriate guide for your distribution.
4.The steps below use example IPs for each node. Adjust each example according to your confguration:
•node-master: 192.0.2.1
•node1: 192.0.2.2
•node2: 192.0.2.3
Note
This guide is written for a non-root user. Commands that require elevated privileges are prefixed with sudo. If you’re not familiar with
the sudo command, see the Users and Groups guide. All commands in this guide are run with the hadoop user if not specified otherwise.
Architecture of a Hadoop Cluster

Before confguring the master and slave nodes, it’s important to understand the diferent components of a Hadoop cluster.
A master node keeps knowledge about the distributed fle system, like the inode table on an ext3flesystem, and schedules
resources allocation. node-master will handle this role in this guide, and host two daemons:
•The NameNode: manages the distributed fle system and knows where stored data blocks inside the cluster are.
•The ResourceManager: manages the YARN jobs and takes care of scheduling and executing processes on slave
nodes.
Slave nodes store the actual data and provide processing power to run the jobs. They’ll be node1and node2, and will host
two daemons:
•The DataNode manages the actual data physically stored on the node.
•The NodeManager manages execution of tasks on the node.
Confgure the System
Create Host File on Each Node
For each node to communicate with its names, edit the /etc/hosts fle to add the IP address of the three servers. Don’t forget to
replace the sample IP with your IP:
/etc/hosts
1 192.0.2.1 node-master
2 192.0.2.2 node1
3 192.0.2.3 node2
Distribute Authentication Key-pairs for the Hadoop User
The master node will use an ssh-connection to connect to other nodes with key-pair authentication, to manage the cluster.
1.Login to node-master as the hadoop user, and generate an ssh-key:
ssh-keygen -b 4096
2.Copy the key to the other nodes. It’s good practice to also copy the key to the node-masteritself, so that you can also
use it as a DataNode if needed. Type the following commands, and enter the hadoop user’s password when asked. If
you are prompted whether or not to add the key to known hosts, enter yes:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@node-master

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@node1
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@node2
Download and Unpack Hadoop Binaries
Login to node-master as the hadoop user, download the Hadoop tarball from Hadoop project page, and unzip it:
cd
wget http://apache.mindstudios.com/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
tar -xzf hadoop-2.8.1.tar.gz
mv hadoop-2.8.1 hadoop
Set Environment Variables
1.Add Hadoop binaries to your PATH. Edit /home/hadoop/.profle and add the following line:
/home/hadoop/.profile
PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:
1
$PATH
Confgure the Master Node

Confguration will be done on node-master and replicated to other nodes.
Set JAVA_HOME
1.Get your Java installation path. If you installed open-jdk from your package manager, you can get the path with the
command:
update-alternatives --display java
Take the value of the current link and remove the trailing /bin/java. For example on Debian, the link is /usr/lib/jvm/java-8-
openjdk-amd64/jre/bin/java, so JAVA_HOME should be /usr/lib/jvm/java-8-openjdk-amd64/jre.
If you installed java from Oracle, JAVA_HOME is the path where you unzipped the java archive.
2.Edit ~/hadoop/etc/hadoop/hadoop-env.sh and replace this line:
export JAVA_HOME=${JAVA_HOME}
with your actual java installation path. For example on a Debian with open-jdk-8:
~/hadoop/etc/hadoop/hadoop-env.sh
1 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
Set NameNode Location
On each node update ~/hadoop/etc/hadoop/core-site.xml you want to set the NameNode location to node-master on
port 9000:
~/hadoop/etc/hadoop/core-site.xml
1 <?xml version="1.0" encoding="UTF-8"?>
2 <?xml-stylesheet type="text/xsl" href="confguration.xsl"?>
3 <configuration>i
4 <property>i
5 <name>ifs.default.name</name>i
6 <value>ihdfs://node-master:9000</value>i
7 </property>i
8 </configuration>i
Set path for HDFS
Edit hdfs-site.conf:
~/hadoop/etc/hadoop/hdfs-site.xml
1 <configuration>i
2 <property>i
3 <name>idfs.namenode.name.dir</name>i
4 <value>i/home/hadoop/data/nameNode</value>i
5 </property>i
6
<property>i
<name>idfs.datanode.data.dir</name>i
<value>i/home/hadoop/data/dataNode</value>i
</property>i
7
8
<property>i
<name>idfs.replication</name>i
<value>i1</value>i
</property>i
</configuration>i
The last property, dfs.replication, indicates how many times data is replicated in the cluster. You can set 2 to have all the data
duplicated on the two nodes. Don’t enter a value higher than the actual number of slave nodes.
Set YARN as Job Scheduler
1.In ~/hadoop/etc/hadoop/, rename mapred-site.xml.template to mapred-site.xml:
cd ~/hadoop/etc/hadoop
mv mapred-site.xml.template mapred-site.xml
2.Edit the fle, setting yarn as the default framework for MapReduce operations:
~/hadoop/etc/hadoop/mapred-site.xml
1 <configuration>i
2 <property>i
3 <name>imapreduce.framework.name</name>i
4 <value>iyarn</value>i
5 </property>i
6 </configuration>i
Confgure YARN
Edit yarn-site.xml:
~/hadoop/etc/hadoop/yarn-site.xml
<configuration>i
1 <property>i
<name>iyarn.acl.enable</name>i
2
<value>i0</value>i
</property>i
3
<property>i
4
<name>iyarn.resourcemanager.hostname</name>i
<value>inode-master</value>i
5
</property>i
6
<property>i
7 <name>iyarn.nodemanager.aux-services</name>i
<value>imapreduce_shufe</value>i
8 </property>i
6 </configuration>i
Confgure Slaves
The fle slaves is used by startup scripts to start required daemons on all nodes. Edit ~/hadoop/etc/hadoop/slaves to be:
~/hadoop/etc/hadoop/slaves
1 node1
2 node2
Confgure Memory Allocation

Memory allocation can be tricky on low RAM nodes because default values are not suitable for nodes with less than 8GB of
RAM. This section will highlight how memory allocation works for MapReduce jobs, and provide a sample confguration for
2GB RAM nodes.

The Memory Allocation Properties
A YARN job is executed with two kind of resources:
•An Application Master (AM) is responsible for monitoring the application and coordinating distributed executors in the
cluster.
•Some executors that are created by the AM actually run the job. For a MapReduce jobs, they’ll perform map or
reduce operation, in parallel.
Both are run in containers on slave nodes. Each slave node runs a NodeManager daemon that’s responsible for container
creation on the node. The whole cluster is managed by a ResourceManager that schedules container allocation on all the
slave-nodes, depending on capacity requirements and current charge.
Four types of resource allocations need to be confgured properly for the cluster to work. These are:
1.How much memory can be allocated for YARN containers on a single node. This limit should be higher than all the
others; otherwise, container allocation will be rejected and applications will fail. However, it should not be the entire
amount of RAM on the node.
This value is confgured in yarn-site.xml with yarn.nodemanager.resource.memory-mb.
2.How much memory a single container can consume and the minimum memory allocation allowed. A container will
never be bigger than the maximum, or else allocation will fail and will always be allocated as a multiple of the
minimum amount of RAM.

Those values are confgured in yarn-site.xml with yarn.scheduler.maximum-allocation-mb and yarn.scheduler.minimum-allocation-
mb.
3.How much memory will be allocated to the ApplicationMaster. This is a constant value that should ft in the container
maximum size.
This is confgured in mapred-site.xml with yarn.app.mapreduce.am.resource.mb.
4.How much memory will be allocated to each map or reduce operation. This should be less than the maximum size.
This is confgured in mapred-site.xml with properties mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.
The relationship between all those properties can be seen in the following fgure:
Sample Confguration for 2GB Nodes
For 2GB nodes, a working confguration may be:
Property Value
yarn.nodemanager.resource.memory-mb 1536
Property Value
yarn.scheduler.maximum-allocation-mb 1536
yarn.scheduler.minimum-allocation-mb 128
yarn.app.mapreduce.am.resource.mb 512
mapreduce.map.memory.mb 256
mapreduce.reduce.memory.mb 256
1.Edit /home/hadoop/hadoop/etc/hadoop/yarn-site.xml and add the following lines:
~/hadoop/etc/hadoop/yarn-site.xml
1 <property>i
2 <name>iyarn.nodemanager.resource.memory-mb</name>i
3 <value>i1536</value>i
4 </property>i
5
6 <property>i
7 <name>iyarn.scheduler.maximum-allocation-mb</name>i
9 </property>i
10
11 <property>i
12 <name>iyarn.scheduler.minimum-allocation-mb</name>i
14 </property>i
15
16 <property>i
17 <name>iyarn.nodemanager.vmem-check-enabled</name>i
18 <value>ifalse</value>i
19 </property>i
The last property disables virtual-memory checking and can prevent containers from being allocated properly on JDK8.
2.Edit /home/hadoop/hadoop/etc/hadoop/mapred-site.xml and add the following lines:
~/hadoop/etc/hadoop/mapred-site.xml
1 <property>i
2 <name>iyarn.app.mapreduce.am.resource.mb</name>i
4 </property>i
5
6 <property>i
7 <name>imapreduce.map.memory.mb</name>i
9 </property>i
10
11 <property>i
12 <name>imapreduce.reduce.memory.mb</name>i
14 </property>i
Duplicate Confg Files on Each Node

1.Copy the hadoop binaries to slave nodes:
cd /home/hadoop/
scp hadoop-*.tar.gz node1:/home/hadoop
scp hadoop-*.tar.gz node2:/home/hadoop
2.Connect to node1 via ssh. A password isn’t required, thanks to the ssh keys copied above:
ssh node1
3.Unzip the binaries, rename the directory, and exit node1 to get back on the node-master:
tar -xzf hadoop-2.8.1.tar.gz
mv hadoop-2.8.1 hadoop
exit
4.Repeat steps 2 and 3 for node2.
5.Copy the Hadoop confguration fles to the slave nodes:
for node in node1 node2; do
scp ~/hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/;
done
Format HDFS
HDFS needs to be formatted like any classical fle system. On node-master, run the following command:
hdfs namenode -format
Your Hadoop installation is now confgured and ready to run.

Run and monitor HDFS
This section will walk through starting HDFS on NameNode and DataNodes, and monitoring that everything is properly working
and interacting with HDFS data.
Start and Stop HDFS
1.Start the HDFS by running the following script from node-master:
start-dfs.sh
It’ll start NameNode and SecondaryNameNode on node-master, and DataNode on node1and node2, according to
the confguration in the slaves confg fle.
2.Check that every process is running with the jps command on each node. You should get on node-master (PID will be
diferent):
21922 Jps
21603 NameNode
21787 SecondaryNameNode
and on node1 and node2:

19728 DataNode
19819 Jps
3.To stop HDFS on master and slave nodes, run the following command from node-master:
stop-dfs.sh
Monitor your HDFS Cluster
1.You can get useful information about running your HDFS cluster with the hdfs dfsadmincommand. Try for example:
hdfs dfsadmin -report
This will print information (e.g., capacity and usage) for all running DataNodes. To get the description of all available
commands, type:
hdfs dfsadmin -help
2.You can also automatically use the friendlier web user interface. Point your browser to http://node-master-
IP:50070 and you’ll get a user-friendly monitoring console.

Put and Get Data to HDFS
Writing and reading to HDFS is done with command hdfs dfs. First, manually create your home directory. All other commands
will use a path relative to this default home directory:
hdfs dfs -mkdir -p /user/hadoop
Let’s use some textbooks from the Gutenberg project as an example.
1.Create a books directory in HDFS. The following command will create it in the home directory, /user/hadoop/books:
hdfs dfs -mkdir books
2.Grab a few books from the Gutenberg project:
cd /home/hadoop
wget -O alice.txt https://www.gutenberg.org/fles/11/11-0.txt
wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8
wget -O frankenstein.txt https://www.gutenberg.org/ebooks/84.txt.utf-8
3.Put the three books through HDFS, in the booksdirectory:
hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
4.List the contents of the book directory:

hdfs dfs -ls books
5.Move one of the books to the local flesystem:
hdfs dfs -get books/alice.txt
6.You can also directly print the books from HDFS:
hdfs dfs -cat books/alice.txt
There are many commands to manage your HDFS. For a complete list, you can look at the Apache HDFS shell
documentation, or print help with:
hdfs dfs -help
Run YARN
HDFS is a distributed storage system, it doesn’t provide any services for running and scheduling tasks in the cluster. This is the
role of the YARN framework. The following section is about starting, monitoring, and submitting jobs to YARN.
Start and Stop YARN
1.Start YARN with the script:

start-yarn.sh
2.Check that everything is running with the jps command. In addition to the previous HDFS daemon, you should see
a ResourceManager on node-master, and a NodeManager on node1 and node2.
3.To stop YARN, run the following command on node-master:
stop-yarn.sh
Monitor YARN
1.The yarn command provides utilities to manage your YARN cluster. You can also print a report of running nodes with
the command:
yarn node -list
Similarly, you can get a list of running applications with command:
yarn application -list
To get all available parameters of the yarn command, see Apache YARN documentation.
2.As with HDFS, YARN provides a friendlier web UI, started by default on port 8088 of the Resource Manager. Point your
browser to http://node-master-IP:8088 and browse the UI:

Submit MapReduce Jobs to YARN
Yarn jobs are packaged into jar fles and submitted to YARN for execution with the command yarn jar. The Hadoop installation
package provides sample applications that can be run to test your cluster. You’ll use them to run a word count on the three
books previously uploaded to HDFS.
1.Submit a job with the sample jar to YARN. On node-master, run:
yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount "books/*" output
The last argument is where the output of the job will be saved - in HDFS.
2.After the job is fnished, you can get the result by querying HDFS with hdfs dfs -ls output. In case of a success, the output
will resemble:
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2017-10-11 14:09 output/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 269158 2017-10-11 14:09 output/part-r-00000
3.Print the result with:
hdfs dfs -cat output/part-r-00000

Before You Begine
P
1.Follow our guide on

r how to install and confgure a three-node Hadoop cluster to set up your YARN cluster. The master
m
node (HDFS NameNodea and YARN ResourceManager) is called node-master and the slave nodes (HDFS DataNode
l
and YARN NodeManager) are called node1 and node2.
i
Run the commandsnin this guide from node-master unless otherwise specifed.
k
2.Be sure you have a hadoop user that can access all cluster nodes with SSH keys without a password.
3.Note the path of your Hadoop installation. This guide assumes it is installed in /home/hadoop/hadoop. If it is not, adjust
the path in the examples accordingly.
4.Run jps on each of the nodes to confrm that HDFS and YARN are running. If they are not, start the services with:
start-dfs.sh
start-yarn.sh
Note
This guide is written for a non-root user. Commands that require elevated privileges are prefixed with sudo. If you’re not familiar with
the sudo command, see the Users and Groups guide.
Download and Install Spark Binaries
Spark binaries are available from the Apache Spark download page. Adjust each command below to match the correct
version number.
1.Get the download URL from the Spark download page, download it, and uncompress it.
For Spark 2.2.0 with Hadoop 2.7 or later, log on node-master as the hadoop user, and run:
cd /home/hadoop
wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
tar -xvf spark-2.2.0-bin-hadoop2.7.tgz
mv spark-2.2.0-bin-hadoop2.7 spark
2.Add the Spark binaries directory to your PATH. Edit /home/hadoop/.profle and add the following line:
For Debian/Ubuntu systems:
1 PATH=/home/hadoop/spark/bin:$PATH
For RedHat/Fedora/CentOS systems:
1 pathmunge /home/hadoop/spark/bin
Integrate Spark with YARN
To communicate with the YARN Resource Manager, Spark needs to be aware of your Hadoop confguration. This is done via
the HADOOP_CONF_DIR environment variable. The SPARK_HOME variable is not mandatory, but is useful when submitting Spark
jobs from the command line.
1.Edit the hadoop user profle /home/hadoop/.profle and add the following lines:
1 export HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
2 export SPARK_HOME=/home/hadoop/spark
3 export LD_LIBRARY_PATH=/home/hadoop/hadoop/lib/native:$LD_LIBRARY_PATH
2.Restart your session by logging out and logging in again.
3.Rename the spark default template confg fle:
mv $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
4.Edit $SPARK_HOME/conf/spark-defaults.conf and set spark.master to yarn:
$SPARK_HOME/conf/spark-defaults.conf
1 spark.master yarn
Spark is now ready to interact with your YARN cluster.

Understand Client and Cluster Mode
Spark jobs can run on YARN in two modes: cluster mode and client mode. Understanding the diference between the two
modes is important for choosing an appropriate memory allocation confguration, and to submit jobs as expected.
A Spark job consists of two parts: Spark Executors that run the actual tasks, and a Spark Driver that schedules the Executors.
•Cluster mode: everything runs inside the cluster. You can start a job from your laptop and the job will continue running
even if you close your computer. In this mode, the Spark Driver is encapsulated inside the YARN Application Master.
•Client mode the Spark driver runs on a client, such as your laptop. If the client is shut down, the job fails. Spark
Executors still run on the cluster, and to schedule everything, a small YARN Application Master is created.
Client mode is well suited for interactive jobs, but applications will fail if the client stops. For long running jobs, cluster mode is
more appropriate.
Confgure Memory Allocation

Allocation of Spark containers to run in YARN containers may fail if memory allocation is not confgured properly. For nodes
with less than 4G RAM, the default confguration is not adequate and may trigger swapping and poor performance, or even
the failure of application initialization due to lack of memory.
Be sure to understand how Hadoop YARN manages memory allocation before editing Spark memory settings so that your
changes are compatible with your YARN cluster’s limits.

Note
See the memory allocation section of the Install and Configure a 3-Node Hadoop Cluster guide for more details on managing your YARN cluster’s
memory.
Give Your YARN Containers Maximum Allowed Memory
If the memory requested is above the maximum allowed, YARN will reject creation of the container, and your Spark
application won’t start.
1.Get the value of yarn.scheduler.maximum-allocation-mb in $HADOOP_CONF_DIR/yarn-site.xml. This is the maximum allowed
value, in MB, for a single container.
2.Make sure that values for Spark memory allocation, confgured in the following section, are below the maximum.
This guide will use a sample value of 1536 for yarn.scheduler.maximum-allocation-mb. If your settings are lower, adjust the samples
with your confguration.
Confgure the Spark Driver Memory Allocation in Cluster Mode
In cluster mode, the Spark Driver runs inside YARN Application Master. The amount of memory requested by Spark at
initialization is confgured either in spark-defaults.conf, or through the command line.
From spark-defaults.conf
•Set the default amount of memory allocated to Spark Driver in cluster mode via spark.driver.memory (this value defaults
to 1G). To set it to 512MB, edit the fle:
1 spark.driver.memory 512m
From the Command Line
•Use the --driver-memory parameter to specify the amount of memory requested by spark-submit. See the following section
about application submission for examples.

Note
Values given from the command line will override whatever has been set in spark-defaults.conf.
Confgure the Spark Application Master Memory Allocation in Client Mode
In client mode, the Spark driver will not run on the cluster, so the above confguration will have no efect. A YARN Application
Master still needs to be created to schedule the Spark executor, and you can set its memory requirements.
Set the amount of memory allocated to Application Master in client mode with spark.yarn.am.memory(default to 512M)
1 spark.yarn.am.memory 512m
This value can not be set from the command line.
Confgure Spark Executors’ Memory Allocation
The Spark Executors’ memory allocation is calculated based on two parameters inside $SPARK_HOME/conf/spark-defaults.conf:
•spark.executor.memory: sets the base memory used in calculation

•spark.yarn.executor.memoryOverhead: is added to the base memory. It defaults to 7% of base memory, with a minimum
of 384MB
Note
Make sure that Executor requested memory, including overhead memory, is below the YARN container maximum size, otherwise the Spark
application won’t initialize.
Example: for spark.executor.memory of 1Gb , the required memory is 1024+384=1408MB. For 512MB, the required memory will be
512+384=896MB
To set executor memory to 512MB, edit $SPARK_HOME/conf/spark-defaults.conf and add the following line:
1 spark.executor.memory 512m
How to Submit a Spark Application to the YARN Cluster

Applications are submitted with the spark-submit command. The Spark installation package contains sample applications, like
the parallel calculation of Pi, that you can run to practice starting Spark jobs.
To run the sample Pi calculation, use the following command:
spark-submit --deploy-mode client \
--class org.apache.spark.examples.SparkPi \
$SPARK_HOME/examples/jars/spark-examples_2.11-2.2.0.jar 10
The frst parameter, --deploy-mode, specifes which mode to use, client or cluster.
To run the same application in cluster mode, replace --deploy-mode clientwith --deploy-mode cluster.
Monitor Your Spark Applications
When you submit a job, Spark Driver automatically starts a web UI on port 4040 that displays information about the
application. However, when execution is fnished, the Web UI is dismissed with the application driver and can no longer be
accessed.
Spark provides a History Server that collects application logs from HDFS and displays them in a persistent web UI. The following
steps will enable log persistence in HDFS:
1.Edit $SPARK_HOME/conf/spark-defaults.conf and add the following lines to enable Spark jobs to log in HDFS:
1 spark.eventLog.enabled true
2 spark.eventLog.dir hdfs://node-master:9000/spark-logs
2.Create the log directory in HDFS:
hdfs dfs -mkdir /spark-logs
3.Confgure History Server related properties in $SPARK_HOME/conf/spark-defaults.conf:
1 spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
2 spark.history.fs.logDirectory hdfs://node-master:9000/spark-logs
3 spark.history.fs.update.interval 10s
4 spark.history.ui.port 18080
You may want to use a diferent update interval than the default 10s. If you specify a bigger interval, you will have some
delay between what you see in the History Server and the real time status of your application. If you use a shorter
interval, you will increase I/O on the HDFS.
4.Run the History Server:
$SPARK_HOME/sbin/start-history-server.sh
5.Repeat steps from previous section to start a job with spark-submit that will generate some logs in the HDFS:
6.Access the History Server by navigating to http://node-master:18080 in a web browser:

Run the Spark Shell
The Spark shell provides an interactive way to examine and work with your data.
1.Put some data into HDFS for analysis. This example uses the text of Alice In Wonderlandfrom the Gutenberg project:
cd /home/hadoop
wget -O alice.txt https://www.gutenberg.org/fles/11/11-0.txt
hdfs dfs -mkdir inputs
hdfs dfs -put alice.txt inputs
2.Start the Spark shell:
spark-shell
var input = spark.read.textFile("inputs/alice.txt")
// Count the number of non blank lines
input.flter(line => line.length()>0).count()
The Scala Spark API is beyond the scope of this guide. You can fnd the ofcial documentation on Ofcial Apache Spark
documentation.
Source : https://www.linode.com/

How To Install and Set Up A 3-Node Hadoop Cluster

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How To Install and Set Up A 3-Node Hadoop Cluster

Uploaded by

Copyright:

Available Formats

How to Install and Set Up a 3-Node Hadoop Cluster

tasks on all nodes.

Before You Begin

Architecture of a Hadoop Cluster

replace the sample IP with your IP:

Distribute Authentication Key-pairs for the Hadoop User

1.Login to node-master as the hadoop user, and generate an ssh-key:

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@node-master

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@node2

Download and Unpack Hadoop Binaries

tar -xzf hadoop-2.8.1.tar.gz

Set Environment Variables

Confgure the Master Node

update-alternatives --display java

openjdk-amd64/jre/bin/java, so JAVA_HOME should be /usr/lib/jvm/java-8-openjdk-amd64/jre.

2.Edit ~/hadoop/etc/hadoop/hadoop-env.sh and replace this line:

Set NameNode Location

1 <?xml version="1.0" encoding="UTF-8"?>

2 <?xml-stylesheet type="text/xsl" href="confguration.xsl"?>

Set path for HDFS

1.In ~/hadoop/etc/hadoop/, rename mapred-site.xml.template to mapred-site.xml:

Confgure Memory Allocation

2GB RAM nodes.

A YARN job is executed with two kind of resources:

slave-nodes, depending on capacity requirements and current charge.

amount of RAM on the node.

This value is confgured in yarn-site.xml with yarn.nodemanager.resource.memory-mb.

minimum amount of RAM.

This is confgured in mapred-site.xml with yarn.app.mapreduce.am.resource.mb.

This is confgured in mapred-site.xml with properties mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.

For 2GB nodes, a working confguration may be:

2.Edit /home/hadoop/hadoop/etc/hadoop/mapred-site.xml and add the following lines:

Duplicate Confg Files on Each Node

scp hadoop-*.tar.gz node1:/home/hadoop

scp hadoop-*.tar.gz node2:/home/hadoop

tar -xzf hadoop-2.8.1.tar.gz

4.Repeat steps 2 and 3 for node2.

5.Copy the Hadoop confguration fles to the slave nodes:

for node in node1 node2; do

scp ~/hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/;

hdfs namenode -format

Your Hadoop installation is now confgured and ready to run.

and interacting with HDFS data.

Start and Stop HDFS

1.Start the HDFS by running the following script from node-master:

the confguration in the slaves confg fle.

and on node1 and node2:

Monitor your HDFS Cluster

hdfs dfsadmin -report

hdfs dfsadmin -help

IP:50070 and you’ll get a user-friendly monitoring console.

will use a path relative to this default home directory:

hdfs dfs -mkdir -p /user/hadoop

Let’s use some textbooks from the Gutenberg project as an example.

hdfs dfs -mkdir books

2.Grab a few books from the Gutenberg project:

wget -O alice.txt https://www.gutenberg.org/fles/11/11-0.txt

wget -O holmes.txt https://www.gutenberg.org/ebooks/1661.txt.utf-8

wget -O frankenstein.txt https://www.gutenberg.org/ebooks/84.txt.utf-8

3.Put the three books through HDFS, in the booksdirectory:

hdfs dfs -put alice.txt holmes.txt frankenstein.txt books

4.List the contents of the book directory:

5.Move one of the books to the local flesystem: