How To Set Up A Multi-Node Hadoop Cluster On Amazon EC2

How to Set Up a Multi-Node Hadoop Cluster
on Amazon EC2.
We are going to setup 4 node hadoop cluster as below.
• NameNode (Master)
• SecondaryNameNode
• DataNode (Slave1)
• DataNode (Slave2)
Here’s what you will need
1. Amazon AWS Account

2. PuTTy Windows Client (to connect to Amazon EC2 instance)
3. PuTTYgen (to generate private key – this will be used in putty to connect to EC2 instance)
4. WinSCP (secury copy)
1. Setting up Amazon EC2 Instances:

1.1 Get Amazon AWS Account by visiting this URL
https://aws.amazon.com/free/?sc_ichannel=ha&sc_icampaign=free-
tier&sc_icontent=2234&all-free-tier.sort-by=item.additionalFields.SortRank&all-free-
tier.sort-order=asc
If you do not already have a account, please create a new one. I already have AWS account and
going to skip the sign-up process. Amazon EC2 comes with eligible free-tier instances.
1.2 Launch Instance
Once you have signed up for Amazon account. Login to Amazon Web Services, click on My
Account and navigate to Amazon EC2 Console
1.3 Select AMI
1.4 Select Instance Type
Select the micro instance

1.5 Configure Number of Instances
1.6 Add Storage
Minimum volume size is 8GB
1.7 Instance Description
Give your instance name and description

1.8 Define a Security Group
Create a new security group, later on we are going to modify the security group with security
rules.
1.9 Launch Instance and Create Security Pair
Review and Launch Instance.
Amazon EC2 uses public–key cryptography to encrypt and decrypt login information. Public–
key cryptography uses a public key to encrypt a piece of data, such as a password, then the
recipient uses the private key to decrypt the data. The public and private keys are known as a key
pair.
1.10 Launching Instances
Once you click “Launch Instance” 4 instance should be launched with “pending” state
Once in “running” state we are now going to rename the instance name as below.
1. HadoopNameNode (Master)
2. HadoopSecondaryNameNode
3. HadoopSlave1 (data node will reside here)
4. HaddopSlave2 (data node will reside here)
2. Setting up client access to Amazon Instances
Now, lets make sure we can connect to all 4 instances. For that we are going to use Putty client
We are going setup password-less SSH access among servers to setup the cluster. This allows
remote access from Master Server to Slave Servers so Master Server can remotely start the Data
Node and Task Tracker services on Slave servers.
We are going to use downloaded hadoopec2cluster.pem file to generate the private key (.ppk). In
order to generate the private key we need Puttygen client
2.1 Generating Private Key
Let’s launch PUTTYGEN client and import the key pair we created during launch instance step
– “hadoopec2cluster.pem”
Navigate to Conversions and “Import Key”

Once you import the key You can enter passphrase to protect your private key or leave the
passphrase fields blank to use the private key without any passphrase. Passphrase protects the
private key from any unauthorized access to servers using your machine and your private key.
Any access to server using passphrase protected private key will require the user to enter the
passphrase to enable the private key enabled access to AWS EC2 server.
2.2 Save Private Key
Now save the private key by clicking on “Save Private Key” and click “Yes” as we are going to
leave passphrase empty.
Save the .ppk file and give it a meaningful name
2.3 Connect to Amazon Instance
Let’s connect to HadoopNameNode first. Launch Putty client, grab the public URL , import the
.ppk private key that we just created for password-less SSH access. As per amazon
documentation, for Ubuntu machines username is “ubuntu”
2.3.1 Provide private key for authentication
2.3.2 Hostname and Port and Connection Type
and “Open” to launch putty session
and will prompt you for the username, enter ubuntu, if everything goes well you will be presented
welcome message with Unix shell at the end.
Similarly connect to remaining 3 machines HadoopSecondaryNameNode, HaddopSlave1,HadoopSlave2
respectively to make sure you can connect successfully.
2.4 Enable Public Access
Issue ifconfig command and note down the ip address. Next, we are going to update the
hostname with ec2 public URL and finally we are going to update /etc/hosts file to map the ec2
public URL with ip address. This will help us to configure master ans slaves nodes with
hostname instead of ip address.
Following is the output on HadoopNameNode ifconfig
2.5 Modify /etc/hosts
Lets change the host to EC2 public IP and hostname.
Open the /etc/hosts in vi, in a very first line it will show 127.0.0.1 localhost, we need to replace
that with amazon ec2 hostname and ip address we just collected.
Modify the file and save your changes, Repeat 2.3 and 2.4 sections for remaining 3 machines.
3. Setup WinSCP access to EC2 instances
In order to securely transfer files from your windows machine to Amazon EC2 WinSCP is a
handy utility.Provide hostname, username and private key file and save your configuration and
Login.
upon successful login you will see unix file system of a logged in user /home/ubuntu your Amazon EC2
Ubuntu machine.
Upload the .pem file to master machine (HadoopNameNode) in /home/ubuntu/.ssh folder. It will be
used while connecting to slave nodes during hadoop startup daemons.
4. Apache Hadoop Installation and Cluster Setup

4.1 Update the packages and dependencies.
Let’s update the packages , I will start with master , repeat this for all slaves.
$ sudo apt-get update
Once its complete, let’s install java
4.2 Install Java
Add following PPA and install the latest Oracle Java (JDK) 7 in Ubuntu
$ sudo apt-get install openjdk-7-jdk
Check if Ubuntu uses JDK 7

Y
Repeat this for all slaves.
4.3 Download Hadoop
We are going to use haddop 2.7.1 stable version from apache site
issue wget command from shell on all the machines
$ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.1/hadoop-2.7.1.tar.gz
Create a folder called /apache on all machines
sudo mkdir /apache

Unzip the files into /apache folder and review the package content and configuration files.
Change the ownership of /apache folder to ubuntu
sudo chown ubuntu /apache
$ tar -xzvf hadoop-2.7.1.tar.gz -C /apache
In /apache folder create a softlink to the hadoop-2.7.1 folder
cd /apache
ln -s hadoop-2.7.1 hadoop
4.4 Setup Environment Variable
Navigate to home directory

$cd
Open .bashrc file in vi edito
$ vi .bashrc
Add following at the end of file
export HADOOP_HOME=/apache/hadoop
export HADOOP_CONF_DIR=/apache/hadoop/etc/hadoop
export JAVA_HOME=/usr
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
To check whether its been updated correctly or not, reload bash profile, use following commands
source ~/.bashrc
4.5 Setup Password-less SSH on Servers
Create a config file in /home/ubuntu/.ssh folder on all the nodes and enter following in it.
host master
hostname ec2-13-233-154-170.ap-south-1.compute.amazonaws.com
user ubuntu
identityFile ~/.ssh/aws_key.pem
host slave1
user ubuntu
host slave2
user ubuntu
change the permissions of both config and pem file to 600
sudo chmod 600 config
sudo chmod 600 aws_key.pem

Run the ssh-keygen command on the master node
ssh-keygen
Check the public amd private rsa keys in the .ssh folder
Copy the id_rsa.pub file to authorized file
cat id_rsa.pub>>authorized_keys
Copy config .pem amd authorized_keys file to all the nodes
scp config aws_key.pem authorized_keys slave1:~/.ssh
Check the password less password as follows
ssh slave1
4.6 Hadoop Cluster Setup
This section will cover the hadoop cluster configuration. We will have to modify
• hadoop-env.sh – This file contains some environment variable settings used by Hadoop.
You can use these to affect some aspects of Hadoop daemon behavior, such as where log
files are stored, the maximum amount of heap used etc. The only variable you should
need to change at this point is in this file is JAVA_HOME, which specifies the path to the
Java 1.7.x installation used by Hadoop.
• core-site.xml – key property fs.default.name – for namenode configuration for
e.g hdfs://namenode/
• hdfs-site.xml – key property – dfs.replication – by default 3
• mapred-site.xml – set the processing framework as YARN
• yarn-site.xml – set up the yarn related properties.
• masters – defines on which machines Hadoop will start secondary NameNodes in our
multi-node cluster.
• slaves – defines the lists of hosts, one per line, where the Hadoop slave daemons
(datanodes and tasktrackers) will run.
We will first start with master (NameNode) and then copy above xml changes to remaining 3
nodes (all slaves).
1. hadoop-env.sh
2. core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://ec2-52-66-240-250.ap-south-1.compute.amazonaws.com:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/apache/hadoop/tmp</value>
</property>
Create the tmp folder
3. hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/apache/hadoop/hdfs/nn/</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/apache/hadoop/hdfs/dn/</value>
</property>
Ceate the name node and datanode folders on all the nodes(nn only on master node)
mkdir -p /apache/hadoop/tmp
mkdir -p /apache/hadoop/hdfs/nn/
mkdir -p /apache/hadoop/hdfs/dn/
4. mapred-site.xml
create mapred-site.xml file from mapred-site.xml.template file
cp mapred-site.xml.template mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
5. yarn-site.xml( change the nodemanager localizer property to respective node dns name
for each node)
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ec2-52-66-240-250.ap-south-1.compute.amazonaws.com</value>
</property>
<property> <name>yarn.nodemanager.localizer.address</name>
<value>ec2-52-66-240-250.ap-south-1.compute.amazonaws.com:8040</value>
</property>
6. masters
put the hostnames of secondary name node
slave1
7. slaves
put the hostsnames of slave machines
master
slave1
slave2
slave3
copy all the files to all the nodes
scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml masters

slaves slave1:$HADOOP_CONF_DIR
Update the yarn-ite.xml files on each slave machine to reflect respective public dns names
For slave1
For slave2
4.7 Hadoop Daemon Startup
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which
runs on top of your , which is implemented on top of the local filesystems of your cluster. You
need to do this the first time you set up a Hadoop installation. Do not format a running Hadoop
filesystem, this will cause all your data to be erased.
To format the namenode
$ hadoop namenode -format
Lets start all hadoop daemons from HadoopNameNode
$HADOOP_HOME/sbin/start-all.sh
To Protect the EC2 instances from unauthorized access from the internet set the security group as
follows
Check the processes running on al the nodes using JPS command
Chekck the Namenode webui on the 50070 port on master machine
Check resource manager webui on the 8088 port on the master machine

How To Set Up A Multi-Node Hadoop Cluster On Amazon EC2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How To Set Up A Multi-Node Hadoop Cluster On Amazon EC2

Uploaded by

Copyright:

Available Formats

How to Set Up a Multi-Node Hadoop Cluster

Here’s what you will need

1. Amazon AWS Account

1. Setting up Amazon EC2 Instances:

1.3 Select AMI

1.4 Select Instance Type

Select the micro instance

1.6 Add Storage

Minimum volume size is 8GB

1.7 Instance Description

Give your instance name and description

1.9 Launch Instance and Create Security Pair

Review and Launch Instance.

2.1 Generating Private Key

Navigate to Conversions and “Import Key”

2.2 Save Private Key

2.3 Connect to Amazon Instance

2.3.2 Hostname and Port and Connection Type

and “Open” to launch putty session

Following is the output on HadoopNameNode ifconfig

2.5 Modify /etc/hosts

Lets change the host to EC2 public IP and hostname.

3. Setup WinSCP access to EC2 instances

4. Apache Hadoop Installation and Cluster Setup

$ sudo apt-get update

Once its complete, let’s install java

4.2 Install Java

$ sudo apt-get install openjdk-7-jdk

Check if Ubuntu uses JDK 7

Repeat this for all slaves.

4.3 Download Hadoop

issue wget command from shell on all the machines

Create a folder called /apache on all machines

sudo mkdir /apache

Change the ownership of /apache folder to ubuntu

sudo chown ubuntu /apache

$ tar -xzvf hadoop-2.7.1.tar.gz -C /apache

In /apache folder create a softlink to the hadoop-2.7.1 folder

4.4 Setup Environment Variable

Navigate to home directory

Open .bashrc file in vi edito

Add following at the end of file

4.5 Setup Password-less SSH on Servers

change the permissions of both config and pem file to 600

sudo chmod 600 config

sudo chmod 600 aws_key.pem

Copy config .pem amd authorized_keys file to all the nodes

scp config aws_key.pem authorized_keys slave1:~/.ssh

Check the password less password as follows

4.6 Hadoop Cluster Setup

Create the tmp folder

copy all the files to all the nodes

scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml masters

To format the namenode

$ hadoop namenode -format

Lets start all hadoop daemons from HadoopNameNode

Chekck the Namenode webui on the 50070 port on master machine

You might also like