You are on page 1of 15

Objectives

- Setting Up Hadoop in Single-Node and Pseudo-Cluster Node modes.


- Create WordCount MapReduce program in Eclipse IDE
- Test execution of Hadoop by running MapReduce program created as above.

Step 1. After you installed Ubuntu either as stand-alone, or dual partition


or as a VM, execute the following
$ sudo apt-get update

Step 2. Install Java jdk ( you may directly install from Oracle website
also).
http://www.oracle.com/technetwork/java/javase/downloads/index.html

Step 3. Move Java jdk directory, if exists, to /usr/bin/java_old


Issue the following commands:
3.1 sudo mv /usr/bin/java /usr/bin/java_old
If the above command fails, do not feel concerned. It's okay. It means you do
not have this /usr/bin/java directory.

3.2 Download the 32-bit or 64-bit Linux "compressed binary file" - it has a
".tar.gz" file extension.

tar -xvf jdk-8*

Assuming you have downloaded 1.8.0_92 version:


sudo mkdir /usr/lib/jvm
sudo mv ./jdk1.8.0_92 /usr/lib/jvm/
ls /usr/lib/jvm
You should see the jdk1.8.0_92 directory there.

sudo update-alternatives --install "/usr/bin/java" "java"


"/usr/lib/jvm/jdk1.8.0_92/bin/java" 4000
sudo update-alternatives --install "/usr/bin/javac" "javac"
"/usr/lib/jvm/jdk1.8.0_92/bin/javac" 4000
sudo update-alternatives --install "/usr/bin/javaws" "javaws"
"/usr/lib/jvm/jdk1.8.0_92/bin/javaws" 4000
sudo chmod a+x /usr/bin/java
sudo chmod a+x /usr/bin/javac
sudo chmod a+x /usr/bin/javaws
sudo chown -R root:root /usr/lib/jvm/jdk1.8.0_92

sudo update-alternatives --config java


sudo update-alternatives --config javac
sudo update-alternatives --config javaws

$ java -version

Step 4. We will use a dedicated Hadoop user account for running Hadoop.
While that’s not required it is recommended because it helps to
separate the Hadoop installation from other software applications
and user accounts running on the same machine
(For example: security, permissions, backups, etc).

Create new user dedicated for running MapReduce program.

$ sudo addgroup hadoop


$ sudo adduser --ingroup hadoop hduser
$ sudo adduser hduser sudo

This will add the user hduser and the group hadoop to your local machine.

$ su - hduser

Step 5. Configuring SSH:


Hadoop requires SSH access to manage its nodes, i.e. remote machines as
well as your local machine.
For our single-node setup of Hadoop, we therefore need to configure SSH
access to localhost for the
hduser user we created in the previous step.
rsync is a fast and extraordinarily versatile file copying tool to be
used by ssh.
It can copy locally, to/from another host over any remote shell, or
to/from a remote rsync daemon.

Install ssh and rsync and set up key based ssh to its own account. To do
this use execute following commands.
$ sudo apt-get install ssh

$ sudo apt-get install rsync

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa


By this command we created an DSA key pair with an empty password.
Generally, using an empty password
is not recommended, but in this case it is needed to unlock the key
without your interaction
(you don't want to enter the passphrase every time Hadoop interacts with
its nodes).

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys


By the above command you have enabled SSH access to your local machine
with this newly created key.
Now check whether you can ssh to localhost w/o password. This step is
also needed to save your local
machine’s host key fingerprint to the hduser user’s known_hosts
file.
$ ssh localhost
$ exit

Step 5. Setting up Environment Variables

$ gedit ~/.bashrc

#Hadoop Variables
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_92
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

export HADOOP_PREFIX=/usr/local/hadoop
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_PREFIX:$HADOOP_PREFIX/sbin:
$HADOOP_PREFIX/bin
export HADOOP_LOG_DIR=$HADOOP_HOME/logs

$ source ~/.bashrc

Disabling IPv6
One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various
networking-related
Hadoop configuration options will result in Hadoop binding to the IPv6
addresses of your Ubuntu OS.
There is no practical point in enabling IPv6 on a box when it is not
connected to any IPv6 network.

$ sudo gedit /etc/sysctl.conf

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Save the above config and reboot your machine in order to make the
changes take effect.
Then check whether IPv6 is enabled on your machine with the following
command:

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
A return value of 0 means IPv6 is enabled, a value of 1 means disabled
(that’s what we want).
Step 6. Download and install Hadoop:
Download Apache hadoop and untar it. You may use git also for cloning
and dowloading latest version.
$ wget -c
http://mirror.olnevhost.net/pub/apache/hadoop/common/current/hadoop-
2.6.0.tar.gz

$ sudo tar -zxvf hadoop-2.6.0.tar.gz

The following move command is needed because the HADOOP_PREFIX is


defined in .bashrc
$ sudo mv hadoop-2.6.0 /usr/local/hadoop

Step 7. Configure Hadoop:


Now we will work for setting Hadoop in Pseudo-Cluster Node mode where
each Hadoop daemon runs
in a separate Java process.
Open hadoop-env.sh in the editor of your choice and set the JAVA_HOME
environment variable
$ cd $HADOOP_PREFIX/etc/hadoop

$ sudo gedit hadoop-env.sh

#The java implementation to use.


export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_92
Also set HADOOP_OPTS to true as shown below
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
Effectuate the above changes in hadoop-env.sh and save the file.

Step 8. Configure core-site.xml


This xml file is used to configure the directory where Hadoop will store
its data files
the network ports it listens to, etc.
There are two properties that need to be set. The
first is the fs.default.name property that sets the host and request
port name for the NameNode
(Metadata server for HDFS). The second is hadoop.http.staticuser.user,
which will set the
default user name to hdfs. Copy the following lines to the Hadoop
etc/hadoop/core-site.xml file
and remove the original empty <configuration> </configuration> tags.
$ sudo gedit core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<value>hdfs</value>
</property>
</configuration>

Step 9. First we make a copy of the template file


$ sudo cp mapred-site.xml.template mapred-site.xml

Step 10. Configure mapred-site.xml


Here we configure what version of Hadoop framework we are going to use.
A new configuration option for Hadoop 2.x is the capability to specify a
framework name for MapReduce, setting the mapreduce.framework.name
property.
In this install we will use the value of "yarn" to tell MapReduce that
it
will run as a YARN application.
$ sudo gedit mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Step 11. Configure hdfs-site.xml


In the single node pseudo distributed mode, we don’t need or want the
HDFS to replicate file blocks.
By default, HDFS keeps three copies of each file in the filesystem.
There is no need for
replication on a single machine, thus the dfs.replication value will be
set to one.
In hdfs-site.xml, we specify the NameNode, Secondary NameNode, and
DataNode data
directories. These are the directories used by the various components
of HDFS to store data. Copy the following into Hadoop etc/hadoop/hdfs-
site.xml and remove
the original empty <configuration> </configuration> tags.
$ sudo gedit hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>

<value>file:/usr/local/hadoop/hadoop_store/hdfs/datanode</value>
</property>
</configuration>

Step 12. Configure yarn-site.xml


The yarn.nodemanager.aux-services property tells NodeManagers that there
will be an auxiliary
service called mapreduce.shuffle that it needs to implement. After we
tell the NodeManagers to
implement that service, we give it a class name as the means to
implement that service. In this
case, it’s the yarn.nodemanager.aux-services.mapreduce.shuffle.class .
Specifically,
what this particular configuration does is tell MapReduce how to do its
shuffle. Because
NodeManagers won’t shuffle data for a non-MapReduce job by default, we
need to configure
such a service for MapReduce. Copy the following to the Hadoop
etc/hadoop/yarn-site.xml file
and remove the original empty <configuration> </configuration> tags.
$ sudo gedit yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-
services.mapreduce.shuffle.class</name>
<value> org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Step 12. Modify Java Heap Sizes


The Hadoop installation has various environment variables that determine
the heap sizes for
each Hadoop process. These are defined in the etc/hadoop/*-env.sh files
used by Hadoop.
The default for most of the processes is a 1GB heap size, but since
we’re running on a
workstation that will probably have limited resources compared to a
standard server, we need to
adjust the heap size settings. The values that follow are what are
adequate for a small
workstation or server. They can be adjusted to fit your machine.

Edit hadoop-env.sh file to reflect the following (Don't forget to remove


the "#" at
the beginning of the line.):
HADOOP_HEAPSIZE=500
HADOOP_NAMENODE_INIT_HEAPSIZE="500"
Next, edit the mapred-env.sh to reflect the following:
HADOOP_JOB_HISTORYSERVER_HEAPSIZE=250
Finally, edit yarn-env.sh to reflect the following:
JAVA_HEAP_MAX=-Xmx500m
The following will need to added to yarn-env.sh
YARN_HEAPSIZE=500

$ cd

Step 13. Create Directories for namenode and datanode and take ownership of
$HADOOP_PREFIX
so that MapReduce program has permissions to write.
$ mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode

$ mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode
$ sudo chown hduser:hadoop -R /usr/local/hadoop

Step 14. Format HDFS


In order for the HDFS NameNode to start, it needs to initialize the
directory where it will hold
its data. The NameNode service tracks all the meta-data for the
filesystem. The format process
will use the value assigned to dfs.namenode.name.dir in etc/hadoop/hdfs-
site.xml earlier (i.e.,
/var/data/hadoop/hdfs/nn). Formatting destroys everything in the
directory and sets up a new
file system. Format the NameNode directory as the HDFS superuser, which
is typically the
‘hdfs’ user account.
$ hdfs namenode -format

Step 15. Start the HDFS Services


$ start-dfs.sh

As a sanity check, issue a jps command to see that all the services (namely,
SecondaryNameNode,
NameNode and, DataNode) are running.
$ jps

Step 16. Start the YARN Services


As with HDFS services, the YARN services need to be started. One
ResourceManager and one
NodeManager must be started as user yarn:
$ start-yarn.sh
Issue jps command to see ResourceManager and NodeManager are also
started
$ jps
If there are missing services, check the log file for the specific
service.
Similar to HDFS, the services can be stopped by issuing a stop argument
to
the daemon script:
$ yarn-daemon.sh stop nodemanager

Step 17: Verify the Running Services Using the Web Interface
Both HDFS and the YARN Resource Manager have a web interface. These
interfaces are a
convenient way to browse many of the aspects of your Hadoop
installation. To monitor HDFS
enter the following:
$ firefox http://localhost:50070 &
Connecting to port 50070 will bring up the web interface.
A web interface for the Resource Manager can be viewed by entering the
following:
$ firefox http://localhost:8088 &
$ firefox http://localhost:50090/ &
$ firefox http://localhost:50075/ &

Step 16. Store an input file in HDFS to be used for WordCount application.
$ jps >> testing.txt

$ hdfs dfs -mkdir -p /user/hadoop/input


$ hdfs dfs -copyFromLocal tcp_ip.txt /user/hadoop/input

$ hdfs dfs -ls /user/hadoop/input

Step 17. Download & Install Eclipse (kepler) as mentioned in previous post.

Step 18. Run Eclipse and specify project directory.


$ cd /opt/eclipse/

$ ./Eclipse &

Step 17. Create Wordcount Program


From Project menu, choose java project, a popup window appears. Use
project name as WordCount and click Finish.
From New menu, select class, a popup window appears, choose classname as
WordCount.
Select "Public static void main ...". Click Finish.
Open WordCount.java and replace with the same uploaded alongwith.
Similarly create 2 more classes WordCountMapper and WordCountReducer.
Open WordCountMapper.java and replace with the same uploaded alongwith.
Open WordCountReducer.java and replace with the same uploaded alongwith.
Right click on WordCount->properties Dialog and choose Java Build Path.
Click on "Add External JARs". Go to the directory
/use/local/hadoop/share/hadoop,
select all the jar files in that directory and in all subdirectories
thereof.
Click on Project->Export, select directory of your choice and create and
save WordCount.jar
Step 18. Running WordCount MapReduce job
$ hadoop jar WordCount.jar /user/hadoop/input /user/hadoop/output

Step 19. Viewing output of WordCount MapReduce job


$ bin/hdfs dfs -ls /user/hadoop/output
$ hdfs dfs -cat /user/hadoop/output/part-00000

Step 18. Stop the HDFS Services


$ stop-yarn.sh
$ stop-dfs.sh

You might also like