You are on page 1of 67

Big Data Lab Manual

Experiment 1:
1. Implement the following Data structures in Java

a) Linked Lists b) Stacks c) Queues d) Set e) Map

a) Linked Lists

import java.util.*;
public class LinkedList1{
public static void main(String args[]){

LinkedList<String> al=new LinkedList<String>();


al.add("Ravi");
al.add("Vijay");
al.add("Ravi");
al.add("Ajay");

Iterator<String> itr=al.iterator();
while(itr.hasNext()){
System.out.println(itr.next());
}
}
}
Output: Ravi
Vijay
Ravi
Ajay

b) Stacks

import java.util.Stack;
public class StackEmptyMethodExample
{
public static void main(String[] args)
{
//creating an instance of Stack class
Stack<Integer> stk= new Stack<>();
// checking stack is empty or not
boolean result = stk.empty();
System.out.println("Is the stack empty? " + result);
// pushing elements into stack
stk.push(78);
stk.push(113);
stk.push(90);
stk.push(120);
//prints elements of the stack
System.out.println("Elements in Stack: " + stk);
result = stk.empty();
System.out.println("Is the stack empty? " + result);
}
}

Output:

Is the stack empty? true


Elements in Stack: [78, 113, 90, 120]
Is the stack empty? false

c) Queues
import java.util.*;

public class Main {


public static void main(String[] args) {
//declare a Queue
Queue<String> str_queue = new LinkedList<>();
//initialize the queue with values
str_queue.add("one");
str_queue.add("two");
str_queue.add("three");
str_queue.add("four");
//print the Queue
System.out.println("The Queue contents:" + str_queue);
}
}
Output:
The Queue contents:[one, two, three, four]

d)Set
// Java program Illustrating Set Interface

// Importing utility classes


import java.util.*;

// Main class
public class GFG {

// Main driver method


public static void main(String[] args)
{
// Demonstrating Set using HashSet
// Declaring object of type String
Set<String> hash_Set = new HashSet<String>();

// Adding elements to the Set


// using add() method
hash_Set.add("Geeks");
hash_Set.add("For");
hash_Set.add("Geeks");
hash_Set.add("Example");
hash_Set.add("Set");

// Printing elements of HashSet object


System.out.println(hash_Set);
}
}

Output
[Set, Example, Geeks, For]

e) Map

/ Java Program to Demonstrate


// Working of Map interface

// Importing required classes


import java.util.*;

// Main class
class GFG {

// Main driver method


public static void main(String args[])
{
// Creating an empty HashMap
Map<String, Integer> hm
= new HashMap<String, Integer>();

// Inserting pairs in above Map


// using put() method
hm.put("a", new Integer(100));
hm.put("b", new Integer(200));
hm.put("c", new Integer(300));
hm.put("d", new Integer(400));

// Traversing through Map using for-each loop


for (Map.Entry<String, Integer> me :
hm.entrySet()) {

// Printing keys
System.out.print(me.getKey() + ":");
System.out.println(me.getValue());
}
}
}

Output:
a:100
b:200
c:300
d:400

Experiment 2:
2. (i)Perform setting up and Installing Hadoop in its three operating modes: Standalone, Pseudo
distributed, Fully distributed

(ii)Use web based tools to monitor your Hadoop setup.

(i) Perform setting up and Installing Hadoop in its three operating modes: Standalone,
Pseudo distributed, Fully distributed

Standalone:
Install Hadoop: Setting up a Single Node Hadoop Cluster
You must have got a theoretical idea about Hadoop, HDFS and its architecture. But
to get Hadoop Certified you need good hands-on knowledge. I hope you would have
liked our previous blog on HDFS Architecture, now I will take you through the
practical knowledge about Hadoop and HDFS. The first step forward is to install
Hadoop.
There are two ways to install Hadoop, i.e. Single node and Multi-node.

A single node cluster means only one DataNode running and setting up all the
NameNode, DataNode, ResourceManager, and NodeManager on a single machine.
This is used for studying and testing purposes. For example, let us consider a
sample data set inside the healthcare industry. So, for testing whether the Oozie jobs
have scheduled all the processes like collecting, aggregating, storing, and
processing the data in a proper sequence, we use a single node cluster. It can easily
and efficiently test the sequential workflow in a smaller environment as compared to
large environments which contain terabytes of data distributed across hundreds of
machines.

While in a Multi-node cluster, there are more than one DataNode running and each
DataNode is running on different machines. The multi-node cluster is practically used
in organizations for analyzing Big Data. Considering the above example, in real-time
when we deal with petabytes of data, it needs to be distributed across hundreds of
machines to be processed. Thus, here we use a multi-node cluster.

Install Hadoop
Step 1: Click here to download the Java 8 Package. Save this file in your home
directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz

Fig: Hadoop Installation – Extracting Java Files

Step 3: Download the Hadoop 2.7.3 Package.


Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-
2.7.3.tar.gz

Fig: Hadoop Installation – Downloading Hadoop

Step 4: Extract the Hadoop tar File.


Command: tar -xvf hadoop-2.7.3.tar.gz
Fig: Hadoop Installation – Extracting Hadoop Files

Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.

Learn more about the Hadoop Ecosystem and its tools with the Hadoop Certification.

Command: vi .bashrc

Fig: Hadoop Installation – Setting Environment Variable

Then, save the bash file and close it.

For applying all these changes to the current Terminal, execute the source
command.

Command: source .bashrc

Fig: Hadoop Installation – Refreshing environment variables

To make sure that Java and Hadoop have been properly installed on your system
and can be accessed through the Terminal, execute the java -version and hadoop
version commands.

Command: java -version


Fig: Hadoop Installation – Checking Java Version

Command: hadoop version

Fig: Hadoop Installation – Checking Hadoop Version

Step 6: Edit the Hadoop Configuration files.


Command: cd hadoop-2.7.3/etc/hadoop/

Command: ls

All the Hadoop configuration files are located in hadoop-2.7.3/etc/hadoop directory


as you can see in the snapshot below:

Fig: Hadoop Installation – Hadoop Configuration Files


Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It
contains configuration settings of Hadoop core such as I/O settings that are common
to HDFS & MapReduce.

Command: vi core-site.xml

Fig: Hadoop Installation – Configuring core-site.xml

<?xml version="1.0" encoding="UTF-8"?>


1
2<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3<configuration>
4<property>
5<name>fs.default.name</name>
6<value>hdfs://localhost:9000</value>
7</property>
8</configuration>
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside
configuration tag:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode,
DataNode, Secondary NameNode). It also includes the replication factor and block
size of HDFS.

Command: vi hdfs-site.xml
Fig: Hadoop Installation – Configuring hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>


1
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
2
<configuration>
3
<property>
4
<name>dfs.replication</name>
5
6<value>1</value>
7</property>
8<property>
9<name>dfs.permission</name>
10<value>false</value>
11</property>
12</configuration>
Step 9: Edit the mapred-site.xml file and edit the property mentioned below
inside configuration tag:
mapred-site.xml contains configuration settings of MapReduce application like
number of JVM that can run in parallel, the size of the mapper and the reducer
process, CPU cores available for a process, etc.

In some cases, mapred-site.xml file is not available. So, we have to create the
mapred-site.xml file using mapred-site.xml template.

Command: cp mapred-site.xml.template mapred-site.xml

Command: vi mapred-site.xml.
Fig: Hadoop Installation – Configuring mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>


1
2<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3<configuration>
4<property>
5<name>mapreduce.framework.name</name>
6<value>yarn</value>
7</property>
8</configuration>

Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
yarn-site.xml contains configuration settings of ResourceManager and NodeManager
like application memory management size, the operation needed on program &
algorithm, etc.

Command: vi yarn-site.xml
Fig: Hadoop Installation – Configuring yarn-site.xml

Next

<?xml version="1.0">
1
<configuration>
2
<property>
3
<name>yarn.nodemanager.aux-services</name>
4
5<value>mapreduce_shuffle</value>
6</property>
7<property>
8<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
9<value>org.apache.hadoop.mapred.ShuffleHandler</value>
10</property>
11</configuration>

Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
hadoop-env.sh contains the environment variables that are used in the script to run
Hadoop like Java home path, etc.

Command: vi hadoop–env.sh

Fig: Hadoop Installation – Configuring hadoop-env.sh

Step 12: Go to Hadoop home directory and format the NameNode.


Command: cd

Command: cd hadoop-2.7.3

Command: bin/hadoop namenode -format


Fig: Hadoop Installation – Formatting NameNode

This formats the HDFS via NameNode. This command is only executed for the first
time. Formatting the file system means initializing the directory specified by the
dfs.name.dir variable.

Never format, up and running Hadoop filesystem. You will lose all your data stored in
the HDFS.

You can even check out the details of Big Data with the Data Engineer Training.

Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory


and start all the daemons.
Command: cd hadoop-2.7.3/sbin

Either you can start all daemons with a single command or do it individually.

Command: ./start-all.sh

The above command is a combination of start-dfs.sh, start-yarn.sh & mr-


jobhistory-daemon.sh

Or you can run all the services individually as below:

Start NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree
of all files stored in the HDFS and tracks all the file stored across the cluster.

Command: ./hadoop-daemon.sh start namenode

Fig: Hadoop Installation – Starting NameNode

Start DataNode:
On startup, a DataNode connects to the Namenode and it responds to the requests
from the Namenode for different operations.
Command: ./hadoop-daemon.sh start datanode

Fig: Hadoop Installation – Starting DataNode

Start ResourceManager:
ResourceManager is the master that arbitrates all the available cluster resources
and thus helps in managing the distributed applications running on the YARN
system. Its work is to manage each NodeManagers and the each application’s
ApplicationMaster.

Command: ./yarn-daemon.sh start resourcemanager

Fig: Hadoop Installation – Starting ResourceManager

Start NodeManager:
The NodeManager in each machine framework is the agent which is responsible for
managing containers, monitoring their resource usage and reporting the same to the
ResourceManager.

Command: ./yarn-daemon.sh start nodemanager

Fig: Hadoop Installation – Starting NodeManager


Start JobHistoryServer:
JobHistoryServer is responsible for servicing all job history related requests from
client.

Command: ./mr-jobhistory-daemon.sh start historyserver

Step 14: To check that all the Hadoop services are up and running, run the
below command.
Command: jps

Fig: Hadoop Installation – Checking Daemons

Step 15: Now open the Mozilla browser and go


to localhost:50070/dfshealth.html to check the NameNode interface.

Fig: Hadoop Installation – Starting WebUI

Congratulations, you have successfully installed a single-node Hadoop cluster in one


go.
Pseudo distributed:
To Perform setting up and installing Hadoop in the pseudo-distributed mode in
Windows 10 using the following steps given below as follows. Let’s discuss one
by one.
Step 1: Download Binary Package :
Download the latest binary from the following site as follows.
http://hadoop.apache.org/releases.html

For reference, you can check the file save to the folder as follows.
C:\BigData
Step 2: Unzip the binary package
Open Git Bash, and change directory (cd) to the folder where you save the binary
package and then unzip as follows.
$ cd C:\BigData
MINGW64: C:\BigData
$ tar -xvzf hadoop-3.1.2.tar.gz
For my situation, the Hadoop twofold is extricated to C:\BigData\hadoop-3.1.2.
Next, go to this GitHub Repo and download the receptacle organizer as a speed as
demonstrated as follows. Concentrate the compress and duplicate all the
documents present under the receptacle envelope to C:\BigData\hadoop-
3.1.2\bin. Supplant the current records too.
Step 3: Create folders for datanode and namenode :
• Goto C:/BigData/hadoop-3.1.2 and make an organizer ‘information’.
Inside the ‘information’ envelope make two organizers ‘datanode’ and
‘namenode’. Your documents on HDFS will dwell under the datanode
envelope.

• Set Hadoop Environment Variables


• Hadoop requires the following environment variables to be set.
HADOOP_HOME=” C:\BigData\hadoop-3.1.2”
HADOOP_BIN=”C:\BigData\hadoop-3.1.2\bin”
JAVA_HOME=<Root of your JDK installation>”
• To set these variables, navigate to My Computer or This PC.
Right-click -> Properties -> Advanced System settings -> Environment
variables.
• Click New to create a new environment variable.
• In the event that you don’t have JAVA 1.8 introduced, at that point
you’ll have to download and introduce it first. In the event that the
JAVA_HOME climate variable is now set, at that point check whether the
way has any spaces in it (ex: C:\Program Files\Java\… ). Spaces in the
JAVA_HOME way will lead you to issues. There is a stunt to get around
it. Supplant ‘Program Files ‘to ‘Progra~1’in the variable worth.
Guarantee that the variant of Java is 1.8 and JAVA_HOME is highlighting
JDK 1.8.
Step 4: To make Short Name of Java Home path
• Set Hadoop Environment Variables
• Edit PATH Environment Variable
• Click on New and Add %JAVA_HOME%, %HADOOP_HOME%,
%HADOOP_BIN%, %HADOOP_HOME%/sin to your PATH one by one.
• Now we have set the environment variables, we need to validate them.
Open a new Windows Command prompt and run an echo command on
each variable to confirm they are assigned the desired values.
echo %HADOOP_HOME%
echo %HADOOP_BIN%
echo %PATH%
• On the off chance that the factors are not instated yet, at that point it
can likely be on the grounds that you are trying them in an old meeting.
Ensure you have opened another order brief to test them.
Step 5: Configure Hadoop
Once environment variables are set up, we need to configure Hadoop by editing
the following configuration files.
hadoop-env.cmd
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml
hadoop-env.cmd
First, let’s configure the Hadoop environment file. Open C:\BigData\hadoop-
3.1.2\etc\hadoop\hadoop-env.cmd and add below content at the bottom
set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
Step 6: Edit hdfs-site.xml
After editing core-site.xml, you need to set the replication factor and the location
of namenode and datanodes. Open C:\BigData\hadoop-3.1.2\etc\hadoop\hdfs-
site.xml and below content within <configuration> </configuration> tags.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\BigData\hadoop-3.2.1\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\BigData\hadoop-3.1.2\data\datanode</value>
</property>
</configuration>
Step 7: Edit core-site.xml
Now, configure Hadoop Core’s settings. Open C:\BigData\hadoop-
3.1.2\etc\hadoop\core-site.xml and below content within <configuration>
</configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>
Step 8: YARN configurations
Edit file yarn-site.xml
Make sure the following entries are existing as follows.
<configuration> <property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value> </property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Step 9: Edit mapred-site.xml
At last, how about we arrange properties for the Map-Reduce system. Open
C:\BigData\hadoop-3.1.2\etc\hadoop\mapred-site.xml and beneath content
inside <configuration> </configuration> labels. In the event that you don’t see
mapred-site.xml, at that point open mapred-site.xml.template record and rename
it to mapred-site.xml
<configuration>
<property>
<name>mapreduce.job.user.name</name>
<value>%USERNAME%</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.apps.stagingDir</name>
<value>/user/%USERNAME%/staging</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>local</value>
</property>
</configuration>
Check if C:\BigData\hadoop-3.1.2\etc\hadoop\slaves file is present, if it’s not
then created one and add localhost in it and save it.
Step 10: Format Name Node :
To organize the Name Node, open another Windows Command Prompt and run
the beneath order. It might give you a few admonitions, disregard them.
• hadoop namenode -format

Format Hadoop Name Node


Step 11: Launch Hadoop :
Open another Windows Command brief, make a point to run it as an
Administrator to maintain a strategic distance from authorization mistakes.
When opened, execute the beginning all.cmd order. Since we have added
%HADOOP_HOME%\sbin to the PATH variable, you can run this order from any
envelope. In the event that you haven’t done as such, at that point go to the
%HADOOP_HOME%\sbin organizer and run the order.

You can check the given below screenshot for your reference 4 new windows will
open and cmd terminals for 4 daemon processes like as follows.
• namenode
• datanode
• node manager
• resource manager
Don’t close these windows, minimize them. Closing the windows will terminate
the daemons. You can run them in the background if you don’t like to see these
windows.

Step 12: Hadoop Web UI


In conclusion, how about we screen to perceive how are Hadoop daemons are
getting along. Also you can utilize the Web UI for a wide range of authoritative
and observing purposes. Open your program and begin.
Step 13: Resource Manager
Open localhost:8088 to open Resource Manager
Step 14: Node Manager
Open localhost:8042 to open Node Manager

Step 15: Name Node :


Open localhost:9870 to check out the health of Name Node
Step 16: Data Node :
Open localhost:9864 to check out Data Node
Fully-distributed:

Installation and Configuring Hadoop in fully-distributed mode

To configure a Hadoop cluster in fully-distributed mode , we need to configure all


the master and slave machines. Even though it is different from the pseudo-
distributed mode, the configuration method will be same.

The following are steps to configure Hadoop cluster in fully-distributed


mode:

Step 1 − Setting Up Hadoop environment variables

master and all the slaves have the same user and all nodes in the cluster as
mentioned below: Hadoop environment variables are appended into the following
commands to file (~/.bashrc) to Export variables

export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
export HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
export PATH=$HADOOP_PREFIX/bin:$JAVA_HOME/bin:$PATH
export HADOOP_COMMON_HOME=$HADOOP_ PREFIX
export HADOOP_HDFS_HOME=$HADOOP_ PREFIX
export YARN_HOME=$HADOOP_ PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

Apply the changes to the system currently running.

Command: $ source ~/.bashrc

Step 2: configuration

Add all the export commands listed below at start of script


in Command etc/hadoop/yarn-env.sh

export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
export HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
export PATH=$PATH:$HADOOP_PREFIX/bin:$JAVA_HOME/bin:.
export HADOOP_COMMON_HOME=$HADOOP_ PREFIX
export HADOOP_HDFS_HOME=$HADOOP_ PREFIX
export YARN_HOME=$HADOOP_ PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop

Step 3: Create a folder for hadoop.tmp.dir

Create a temporary folder in HADOOP_PREFIX

Command mkdir -p $HADOOP_PREFIX/tmp

Step 4: Tweak config files

Add the all the properties mentioned below between configuration tag to all the
machines in cluster under HADOOP_ PREFIX file ⇒ etc/hadoop folder:

The following Xml files must be reconfigured in order to develop Hadoop in Java:

• Core-site.xml
• Hdfs-site.xml
• Yarn-site.xml
• Mapred-site.xml

Core-site.xml

The core-site.xml file contains information regarding memory allocated for the file
system, the port number used for Hadoop instance, size of Read/Write buffers, and
memory limit for storing the data.

Open the core-site.xml and add the properties listed below in between , tags in this
file.

< property >


< name >fs.default.name< /name >
< value >hdfs://Master-Hostname:9000< /value>
< /property >
< property >
< name >hadoop.tmp.dir< /name >
< value >/home/impadmin/hadoop-2.6.4/tmp< /value >
< /property >

Hdfs-site.xml

The hdfs-site.xml file contains information regarding the namenode path, datanode
paths of the local file systems, the value of replication data, etc. It means the place
where you want to store the Hadoop infrastructure.
Open the Hdfs-site.xml file and add the properties listed below in between the tags.
In this file, all the property values are user-defined and can be changed according to
the Hadoop infrastructure.

< property >


< name >dfs.replication< /name>
< value >2< /value>
< /property >
< property >
< name >dfs.permissions< /name >
< value >false< /value >
< /property >

Mapred-site.xml :

This Mapred-site.xml file is used to specify the MapReduce framework currently in


use. Firstly, to replace default template, it is required to copy the file from mapred-
site.xml.template to mapred-site.xml file.

Open the mapred-site.xml file and add the following properties in between the ,
tags in this file.

< property >


< name >mapreduce.framework.name< /name >
< value >yarn< /value >
< /property >

Yarn-site.xml :

Yarn-site.xml.template is a default template. This Yarn-site.xml file is used to


configure yarn into Hadoop environment Remember to replace “Master-Hostname”
with host name of cluster’s master.

Open the yarn-site.xml file and add the following properties in between the , tags in
this file.

< property >


< name >yarn.nodemanager.aux-services< /name >
< value >mapreduce_shuffle< /value >
< /property >
< property >
< name >yarn.nodemanager.aux-services.mapreduce.shuffle.class<
/name >
< value >org.apache.hadoop.mapred.ShuffleHandler< /value >
< /property >
< property >
< name >yarn.resourcemanager.resource-tracker.address< /name >
< value >Master-Hostname:8025< /value >
< /property >
< property >
< name >yarn.resourcemanager.scheduler.address< /name >
< value >Master-Hostname:8030< /value >
< /property >
< property >
< name >yarn.resourcemanager.address< /name >
< value >Master-Hostname:8040< /value >
< /property >

(ii)Use web based tools to monitor your Hadoop setup.

Here is our list of the best Hadoop monitoring tools:

1. Datadog – Cloud monitoring software with a customizable Hadoop dashboard,


integrations, alerts, and more.
2. LogicMonitor – Infrastructure monitoring software with a HadoopPackage, REST
API, alerts, reports, dashboards, and more.
3. Dynatrace – Application performance management software with Hadoop monitoring
with NameNode/DataNode metrics, dashboards, analytics,, custom alerts, and more.

How to Monitor Hadoop: Metrics You Need to Keep Track of to Monitor


Hadoop Clusters

Like any computing resource, Hadoop clusters need to be monitored


to ensure that they keep performing at their best. Hadoop’s
architecture may be resilient to system failures, but it still needs
maintenance to prevent jobs from being disrupted. When monitoring
the status of clusters, there are four main categories of metrics you
need to be aware of:

• HDFS metrics (NameNode metrics and DataNode metrics)


• MapReduce counters
• YARN metrics
• ZooKeeper metrics

Below, we’re going to break each of these metric types down,


explaining what they are and providing a brief guide for how you can
monitor them.

HDFS Metrics

Apache Hadoop Distributed File System (HDFS) is a distributed file


system with a NameNode and DataNode architecture. Whenever the
HDFS receives data it breaks it down into blocks and sends it to
multiple nodes. The HDFS is scalable and can support thousands of
nodes.

Monitoring key HDFS metrics is important because it helps you to:


monitor the capacity of the DFS, monitor the space available, track
the status of blocks, and optimize the storage of your data.

There are two main categories of HDFS metrics:

• NameNode metrics
• DataNode metrics
NameNodes and DataNodes

HDFS follows a master-slave architecture where every cluster in the


HDFS is composed of a single NameNode (master) and multiple
DataNodes (slave). The NameNode controls access to files, records
metadata of files stored in the cluster, and monitors the state of
DataNodes.

A DataNode is a process that runs on each slave machine, which


performs low-level read/write requests from the system’s clients, and
sends periodic heartbeats to the NameNode, to report on the health of
the HDFS. The NameNode then uses the health information to
monitor the status of DataNodes and verify that they’re live.

When monitoring, it’s important to prioritize analyzing metrics taken


from the NameNode because if a NameNode fails, all the data within
a cluster will become inaccessible to the user.
Prioritizing monitoring NameNode also makes sense as it enables you
to ascertain the health of all the data nodes within a cluster.
NameNode metrics can be broken down into two groups:

• NameNode-emitted metrics
• NameNode Java Virtual Machine (JVM) metrics

Below we’re going to list each group of metrics you can monitor and
then show you a way to monitor these metrics for HDFS.

NameNode-emitted metrics

• CapacityRemaining – Records the available capacity


• CorruptBlocks/MissingBlocks – Records number of
corrupt/missing blocks
• VolumeFailuresTotal – Records number of failed volumes
• NumLiveDataNodes/NumDeadDataNodes – Records count of
alive or dead DataNodes
• FilesTotal – Total count of files tracked by the NameNode
• Total Load – Measure of file access across all DataNodes
• BlockCapacity/BlocksTotal – Maximum number of blocks
allocable/count of blocks tracked by NameNode
• UnderReplicated Blocks – Number of under-replicated blocks
• NumStaleDataNodes – Number of stale DataNodes

NameNode JVM Metrics

• ConcurrentMarkSweep count – Number of old-generation


collections
• ConcurrentMarkSweep time – The elapsed time of old-
generation collections, in milliseconds
How to Monitor HDFS Metrics

One way that you can monitor HDFS metrics is through Java
Management Extensions (JMX) and the HDFS daemon web interface.
To view a summary of NameNode and performance metrics enter the
following URL into your web browser to access the web interface
(which is available by default at port 50070):
http://<namenodehost>:50070
Here you’ll be able to see information on Configured Capacity, DFS
Used, Non-DFS Used, DFS Remaining, Block Pool Used, DataNodes
usage%, and more.

If you require more in-depth information, you can enter the following
URL to view more metrics with a JSON output:
http://<namenodehost>:50070jmx
MapReduce Counters

MapReduce is a software framework used by Hadoop to process


large datasets in-parallel across thousands of nodes. The framework
breaks down a dataset into chunks and stores them in a file system.
MapReduce jobs are responsible for splitting the datasets and map
tasks then process the data.

For performance monitoring purposes, you need to monitor


MapReduce counters, so that you view information/statistics about job
execution. Monitoring MapReduce counters enables you to monitor
the number of rows read, and the number of rows written as output.

You can use MapReduce counters to find performance bottlenecks.


There are two main types of MapReduce counters:

• Built-in Counters – Counters that are included with


MapReduce by default
• Custom counters – User-defined counters that the user can
create with custom code

Below we’re going to look at some of the built-in counters you can use
to monitor Hadoop.
Built-In MapReduce Counters

Built-in Counters are counters that come with MapReduce by default.


There are five main types of built-in counters:

• Job counters
• Task counters
• File system counters
• FileInputFormat Counters
• FileOutput Format Counters

Job Counters
MapReduce job counters measure statistics at the job level, such as
the number of failed maps or reduces.

• MILLIS_MAPS/MILLIS_REDUCES – Processing time for


maps/reduces
• NUM_FAILED_MAPS/NUM_FAILED_REDUCES – Number of
failed maps/reduces
• RACK_LOCAL_MAPS/DATA_LOCAL_MAPS/OTHER_LOCAL
_MAPS – Counters tracking where map tasks were executed

Task Counters

Task counters collect information about tasks during execution, such


as the number of input records for reduce tasks.

• REDUCE_INPUT_RECORDS – Number of input records for


reduce tasks
• SPILLED_RECORDS – Number of records spilled to disk
• GC_TIME_MILLIS – Processing time spent in garbage
collection

FileSystem Counters

FileSystem Counters record information about the file system, such as


the number of bytes read by the FileSystem.

• FileSystem bytes read – The number of bytes read by the


FileSystem
• FileSystem bytes written – The number of bytes written to the
FileSystem

FIleInputFormat Counters

FileInputFormat Counters record information about the number of


bytes read by map tasks

• Bytes read – Displays the bytes read by map tasks with the
specific input format

File OutputFormat Counters

FileOutputFormat counters gather information on the number of bytes


written by map tasks or reduce tasks in the output format.
• Bytes written – Displays the bytes written by map and reduce
tasks with the specified format
How to Monitor MapReduce Counters

You can monitor MapReduce counters for jobs through the


ResourceManager web UI. To load up the ResourceManager web UI,
go to your browser and enter the following URL:
http://<resourcemanagerhost>:8088

Here you will be shown a list of All Applications in a table format.


Now, go to the application you want to monitor and click
the History hyperlink in the Tracking UI column.

On the application page, click on the Counters option on the left-hand


side. You will now be able to view counters associated with the job
monitored.
YARN Metrics

Yet Another Resource Negotiator (YARN) is the component of


Hadoop that’s responsible for allocating system resources to the
applications or tasks running within a Hadoop cluster.

There are three main categories of YARN metrics:

• Cluster metrics – Enable you to monitor high-level YARN


application execution
• Application metrics -Monitor execution of individual YARN
applications
• NodeManager metrics – Monitor information at the individual
node level

Cluster Metrics

Cluster metrics can be used to view a YARN application execution.

• unhealthyNodes – Number of unhealthy nodes


• activeNodes – Number of currently active nodes
• lostNodes – Number of lost nodes
• appsFailed – Number of failed applications
• totalMB/allocatedMB – Total amount of memory/amount of
memory allocated
Application metrics

Application metrics provide in-depth information on the execution of


YARN applications.

• progress – Application execution progress meter

NodeManager metrics

NodeManager metrics display information on resources within


individual nodes.

• containersFailed – Number of containers that failed to launch


How to Monitor YARN Metrics

To collect metrics for YARN, you can use the HTTP API. With your
resource manager, host query the yarn metrics located on port 8088
by entering the following (use the qry parameter to specify the
MBeans you want to monitor).
Resourcemanagerhost:8088/jmx?qry=java.lang:type=memory
ZooKeeper Metrics

ZooKeeper is a centralized service that maintains configuration


information and delivers distributed synchronization across a Hadoop
cluster. ZooKeeper is responsible for maintaining the availability of the
HDFS NameNode and YARNs ResourceManager.

Some key ZooKeeper metrics you should monitor include:

• zk_followers – Number of active followers


• zk_avg_latency – Amount of time it takes to respond to a client
request (in ms)
• zk_num_alive_connections – Number of clients connected to
ZooKeeper
How to Collect Zookeeper Metrics

There are a number of ways you can collect metrics for Zookeeper,
but the easiest is by using the 4 letter word commands through Telnet
or Netcat at the client port. To keep things simple, we’re going to look
at the mntr, arguably the most important of the four 4 letter word
commands.
$ echo mntr | nc localhost 2555

Entering the mntr command will return you information on average


latency, maximum latency, packets received, packets sent,
outstanding requests, number of followers, and more. You can view a
list of four-letter word commands on the Apache ZooKeeper site.

Hadoop Monitoring Software

Monitoring Hadoop metrics through JMX or an HTTP API enables you


to see the key metrics, but it isn’t the most efficient method of
monitoring performance. The most efficient way to collect and analyze
HDFS, MapReduce, Yarn, and ZooKeeper metrics, is to use an
infrastructure monitoring tool or Hadoop monitoring software.

Many network monitoring providers have designed platforms with the


capacity to monitor frameworks like Hadoop, with state-of-the-art
dashboards and analytics to help the user monitor the performance of
clusters at a glance. Many also come with custom alerts systems that
provide you with email and SMS notifications when a metric hits a
problematic threshold.

In this section, we’re going to look at some of the top Hadoop


monitoring tools on the market. We’ve prioritized tools with high-
quality visibility, configurable alerts systems, and complete data
visualizations.
Our methodology for selecting Hadoop monitoring tools 

We reviewed the market for Hadoop monitors and analyzed tools


based on the following criteria:

• A counter to record the log message throughput rate


• Alerts for irregular log throughput rates
• Throughput of Hadoop-collected system statistics
• Collection of HDFS, MapReduce, Yarn, and ZooKeeper
metrics
• Pre-written searches to make sense of Hadoop data
• A free tool or a demo package for a no-obligation
assessment
• Value for money offered by a thorough Hadoop data
collection tool that is provided at a fair price
With these selection criteria in mind, we selected a range of tools that
both monitor Hadoop activities and pass through the data collected by
Hadoop on disk and data management activity.
1. Datadog

Datadog is a cloud monitoring tool that can monitor services and


applications. With Datadog you can monitor the health and
performance of Apache Hadoop. There is a Hadoop dashboard that
displays information on DataNodes and NameNodes.

For example, you can view a graph of Disk remaining by DataNode,


and TotalLoad by NameNode. Dashboards can be customized to
add information from other systems as well. Integrations for HDFS,
MapReduce, YARN, and ZooKeeper enable you to monitor the most
significant performance indicators.

Key features:

• Hadoop monitoring dashboard


• Integrations for HDFS, MapReduce, YARN, and ZooKeeper
• Alerts
• Full API access
The alerts system makes it easy for you to track performance changes
when they occur by providing you with automatic notifications. For
example, Datadog can notify you if Hadoop jobs fail. The alerts
system uses machine learning, which has the ability to identify
anomalous behavior.

To give you greater control over your monitoring experience, Datadog


provides full API access so that you can create new integrations.
You can use the API access to complete tasks such as querying
Datadog in the command-line or creating JSON-formatted
dashboards.
2. LogicMonitor

LogicMonitor is an infrastructure monitoring platform that can be


used for monitoring Apache Hadoop. LogicMonitor comes with a
Hadoop package that can monitor HDFS NameNode, HDFS
DataNode, Yarn, and MapReduce metrics. For monitoring Hadoop all
you need to do is add Hadoop hosts to monitor, enable JMX on the
Hadoop hosts, and assign properties to each resource. The tool then
collects Hadoop metrics through a REST API.

Key features:
• Monitors HDFS NameNode, HDFS DataNode, Yarn, and
MapReduce metrics
• REST API
• Custom alert thresholds
• Dashboard
• Reports

To monitor these metrics you can set alert trigger conditions to


determine when alerts will be raised. Alerts can be assigned a
numeric priority value to determine the severity of a breach. There is
also an escalation chain you can use to escalate alerts that haven’t
been responded to.

For more general monitoring activity, LogicMonitor includes a


dashboard that you can use to monitor your environment with key
metrics and visualizations including graphs and charts. The software
also allows you to schedule reports to display performance data. For
example, the Alert Trends report provides a summary of alerts that
occurred for resources/groups over a period of time.
3. Dynatrace

Dynatrace is an application performance management tool you can


use to monitor services and applications. Dynatrace also offers users
performance monitoring for Hadoop. The platform can automatically
detect Hadoop components and display performance metrics for
HDFS and MapReduce. Whenever a new host running Hadoop is
added to your environment the tool detects it automatically.

Key features:

• Automatically detects Hadoop components


• DataNode and NameNode metrics
• Analytics and Data visualizations
• Dashboards
• Custom alerts

You can monitor a range of NameNode and DataNode metrics.


NameNode metrics include Total, Used, Remaining, Total load,
Total, Pending deletion, Files total, Under replicated, Live,
Capacity, and more. Types of DataNode metrics include Capacity,
Used, Cached, Failed to Cache, Blocks, Removed,
Replicated, and more.

Dashboards provide a range of information with rich data


visualizations. For example, you can view a chart of MapReduce
maps failed or a bar graph of Jobs preparing and running. Custom
alerts powered by anomaly detection enable you to identify
performance issues, helping you to make sure that your service stays
available.
Experiment 3:

3.Implement the following file management tasks in Hadoop:

• Adding files and directories


• Retrieving files
• Deleting files

Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and copies
them into HDFS using one of the above command line utilities.

Inserting Data into HDFS


Assume we have data in the file called file.txt in the local system which is ought to
be saved in the hdfs file system. Follow the steps given below to insert the required
file in the Hadoop file system.

Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step 2
Transfer and store a data file from local systems to the Hadoop file system using
the put command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step 3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input
Retrieving Data from HDFS
Assume we have a file in HDFS called outfile. Given below is a simple
demonstration for retrieving the required file from the Hadoop file system.

Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Step 2
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
Removing a file or directory from HDFS:
Step 1: Switch to root user from ec2-user using the “sudo -i” command.

Step 2: Check files in the HDFS

Check files in the HDFS using the “hadoop fs -ls” command. In this case, we found 44
items in the HDFS.

Step 3: Removing the file

Let us try removing the “users_2.orc” file we find in the above result. A file or a directory
can be removed by passing the “-rmr” argument in the hadoop fs command. The syntax
for it is:

hadoop fs -rmr &ltpath to file or directory&gt

Since my “users_2.orc file is present in the root directory, the command would be
“hadoop fs -rmr /user/root/users_2.orc”. Let us observe the output of this command upon
execution.

It is clear from the above result that the file is moved to the trash.

Step 4: Cross-checking to see if the file is removed

Let us cross-check the same by listing the files in the HDFS, and our desired output
must not contain the “users_2.orc” file in the list.
So only 43 items are now present in the root directory as users_2.orc file is removed
from the HDFS.

Experiment 4:
4. Run a basic Word Count MapReduce program to understand MapReduce Paradigm.

Workflow of MapReduce consists of 5 steps:

1. Splitting – The splitting parameter can be anything, e.g. splitting by space, comma,
semicolon, or even by a new line (‘\n’).

2. Mapping – as explained above.

3. Intermediate splitting – the entire process in parallel on different clusters. In order to


group them in “Reduce Phase” the similar KEY data should be on the same cluster.

4. Reduce – it is nothing but mostly group by phase.

5. Combining – The last phase where all the data (individual result set from each
cluster) is combined together to form a result.

Now Let’s See the Word Count Program in


Java
Fortunately, we don’t have to write all of the above steps, we only need to write the splitting
parameter, Map function logic, and Reduce function logic. The rest of the remaining steps
will execute automatically.

Make sure that Hadoop is installed on your system with the Java SDK.

Steps
1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.

2. Right Click > New > Package ( Name it - PackageDemo) > Finish.

3. Right Click on Package > New > Class (Name it - WordCount).

4. Add Following Reference Libraries:

1. Right Click on Project > Build Path> Add External

1. /usr/lib/hadoop-0.20/hadoop-core.jar

2. Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar

5. Type the following code:

package PackageDemo;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static void main(String [] args) throws Exception

Configuration c=new Configuration();

String[] files=new GenericOptionsParser(c,args).getRemainingArgs();

Path input=new Path(files[0]);

Path output=new Path(files[1]);

Job j=new Job(c,"wordcount");

j.setJarByClass(WordCount.class);

j.setMapperClass(MapForWordCount.class);

j.setReducerClass(ReduceForWordCount.class);

j.setOutputKeyClass(Text.class);

j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);

FileOutputFormat.setOutputPath(j, output);

System.exit(j.waitForCompletion(true)?0:1);

public static class MapForWordCount extends Mapper<LongWritable, Text, Text,


IntWritable>{

public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException

String line = value.toString();

String[] words=line.split(",");

for(String word: words )

Text outputKey = new Text(word.toUpperCase().trim());

IntWritable outputValue = new IntWritable(1);

con.write(outputKey, outputValue);

public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text,


IntWritable>

public void reduce(Text word, Iterable<IntWritable> values, Context con) throws


IOException, InterruptedException

int sum = 0;

for(IntWritable value : values)


{

sum += value.get();

con.write(word, new IntWritable(sum));

}
1

The above program consists of three classes:

• Driver class (Public, void, static, or main; this is the entry point).
• The Map class which extends the public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
and implements the Map function.
• The Reduce class which extends the public class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements the Reduce function.

6. Make a jar file

Right Click on Project> Export> Select export destination as Jar File > next> Finish.
7. Take a text file and move it into HDFS format:

To move this into Hadoop directly, open the terminal and enter the following commands:
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile
8. Run the jar file:

(Hadoop jar jarfilename.jar packageName.ClassName PathToInputTextFile


PathToOutputDirectry)
[training@localhost ~]$ hadoop jar MRProgramsDemo.jar PackageDemo.WordCount
wordCountFile MRDir1
9. Open the result:

[training@localhost ~]$ hadoop fs -ls MRDir1

Found 3 items

-rw-r--r-- 1 training supergroup 0 2016-02-23 03:36 /user/training/MRDir1/_SUCCESS

drwxr-xr-x - training supergroup 0 2016-02-23 03:36 /user/training/MRDir1/_logs

-rw-r--r-- 1 training supergroup 20 2016-02-23 03:36 /user/training/MRDir1/part-r-00000

[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000

BUS 7

CAR 4

TRAIN 6

Experiment 5:
5. Write a map reduce program that mines weather data. Weather sensors collecting data
every hour at many locations across the globe gather a large volume of log data, which is a
good candidate for analysis with Map Reduce, since it is semi structured and record-oriented.
Analyzing the Data with Hadoop

To take advantage of the parallel processing that Hadoop provides, we need to express our query as a
MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster of machines.

Map and Reduce

MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase
has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer
also specifies two functions: the map function and the reduce function.

The input to our map phase is the raw NCDC data. We choose a text input format that gives us each line in the
dataset as a text value. The key is the offset of the beginning of the line from the beginning of the file, but as we
have no need for this, we ignore it.

Our map function is simple. We pull out the year and the air temperature, because these are the only fields we
are interested in. In this case, the map function is just a data preparation phase, setting up the data in such a way
that the reduce function can do its work on it: finding the maximum temperature for each year. The map function
is also a good place to drop bad records: here we filter out temperatures that are missing, suspect, or erroneous.

To visualize the way the map works, consider the following sample lines of input data (some unused columns
have been dropped to fit the page, indicated by ellipses):

0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...

These lines are presented to the map function as the key-value pairs:

(0, 0067011990999991950051507004…9999999N9+00001+99999999999…)

(106, 0043011990999991950051512004…9999999N9+00221+99999999999…)

(212, 0043011990999991950051518004…9999999N9-00111+99999999999…)

(318, 0043012650999991949032412004…0500001N9+01111+99999999999…)

(424, 0043012650999991949032418004…0500001N9+00781+99999999999…)

The keys are the line offsets within the file, which we ignore in our map function. The map function merely
extracts the year and the air temperature (indicated in bold text), and emits them as its output (the temperature
values have been interpreted as integers):

(1950, 0)

(1950, 22)

(1950, −11)

(1949, 111)

(1949, 78)

The output from the map function is processed by the MapReduce framework before being sent to the reduce
function. This processing sorts and groups the key-value pairs by key. So, continuing the example, our reduce
function sees the following input:
(1949, [111, 78])

(1950, [0, 22, −11])

Each year appears with a list of all its air temperature readings. All the reduce function has to do now is iterate
through the list and pick up the maximum reading:

(1949, 111)

(1950, 22)

This is the final output: the maximum global temperature recorded in each year. The whole data flow is illustrated
in Figure 2-1. At the bottom of the diagram is a Unix pipeline, which mimics the whole MapReduce flow.

Experiment 6:
6.Use MapReduce to find the shortest path between two people in a social graph.

Hint: Use an adjacency list to model a graph, and for each node store the distance from the
original node, as well as a back pointer to the original node. Use the mappers to propagate the
distance to the original node, and the reducer to restore the state of the graph. Iterate until the
target node has been reached.

To find the shortest path between two people in a social graph using MapReduce, we
can follow the following steps:

1. Create an adjacency list to represent the social graph. Each node in the graph
should have a unique identifier and a list of adjacent nodes.
2. Initialize the distance of each node from the source node to infinity except the
source node itself which has a distance of zero. Each node also has a back
pointer to the source node.
3. Create a mapper function that takes a node and emits a key-value pair for
each adjacent node. The key is the adjacent node identifier, and the value is a
tuple containing the distance of the current node from the source node plus
the weight of the edge connecting them, and the back pointer of the current
node.
4. Create a reducer function that takes a node and a list of its adjacent nodes.
For each adjacent node, compute the distance from the source node to the
adjacent node as the minimum of its current distance and the distance
received from the mapper plus the weight of the edge connecting them.
Update the back pointer of the adjacent node to the current node if its
distance is updated.
5. Repeat steps 3 and 4 until the target node is reached, i.e., its distance is
updated.
6. Once the target node is reached, trace back the path from the target node to
the source node using the back pointers stored in each node.

Here's the pseudocode for the MapReduce algorithm:

// Mapper function

map(node):

if node is the source node:

emit(node, (0, None))

else:

emit(node, (infinity, None))

for adj_node in node.adjacent_nodes:

emit(adj_node, (node.distance + edge_weight(node, adj_node), node))

// Reducer function

reduce(node, values):

min_distance = infinity

back_pointer = None

for value in values:

distance, pointer = value

if distance < min_distance:

min_distance = distance

back_pointer = pointer

if min_distance < node.distance:

node.distance = min_distance

node.back_pointer = back_pointer

emit(node, node.adjacent_nodes)

// Main function

while target node is not reached:

run mapper function for each node in the graph


run reducer function for each node in the graph

trace back the path from target node to source node using back pointers

Note that the emit function in the mapper and reducer functions is used to send key-
value pairs to the next stage of the MapReduce pipeline. The pairs emitted by the
mapper function are grouped by key (i.e., the adjacent node identifier) and passed to
the reducer function as a list of values. The reducer function emits a key-value pair
for each updated node, indicating that its adjacent nodes may need to be
reevaluated in the next iteration.

Experiment 7:
7. Implement Friends-of-friends algorithm in MapReduce.

Hint: Two MapReduce jobs are required to calculate the FoFs for each user in a social
network .The first job calculates the common friends for each user, and the second job
sorts the common friends by the number of connections to your friends.

To implement the Friends-of-friends (FoFs) algorithm in MapReduce, we can use two


MapReduce jobs. The first job calculates the common friends for each user, and the
second job sorts the common friends by the number of connections to your friends.

Here are the steps for the two MapReduce jobs:

First Job: Calculating Common Friends

1. Create an adjacency list to represent the social graph. Each node in the graph
should have a unique identifier and a list of adjacent nodes.
2. Create a mapper function that takes a node and emits a key-value pair for
each pair of adjacent nodes. The key is a tuple containing the two adjacent
node identifiers, sorted in alphabetical order. The value is the node identifier
of the mapper's input node.
3. Create a reducer function that takes a pair of adjacent nodes and a list of the
nodes that are adjacent to both nodes. The reducer function emits a key-value
pair for each node that is adjacent to both nodes. The key is the node
identifier of the common friend, and the value is a list of the node identifiers
of the two adjacent nodes.
4. Run the first MapReduce job.

Second Job: Sorting by Number of Connections


1. Create a mapper function that takes a key-value pair from the first job and
emits a key-value pair for each node identifier in the value list. The key is the
node identifier of the common friend, and the value is a tuple containing the
node identifier of one of the adjacent nodes and the number of connections
the common friend has to the adjacent node's friends.
2. Create a reducer function that takes a common friend and a list of tuples. The
reducer function sorts the list of tuples by the number of connections and
emits a key-value pair for each tuple in the sorted list. The key is the node
identifier of the adjacent node, and the value is the number of connections
the common friend has to the adjacent node's friends.
3. Run the second MapReduce job.

Here's the pseudocode for the MapReduce algorithm:

First Job

// Mapper function

map(node):

for adj_node in node.adjacent_nodes:

if node.identifier < adj_node.identifier:

emit((node.identifier, adj_node.identifier), node.identifier)

else:

emit((adj_node.identifier, node.identifier), node.identifier)

// Reducer function

reduce(pair, nodes):

common_friends = []

for node1 in nodes:

for node2 in nodes:

if node1 < node2:

common_friends.append((node1, node2))

for friend in common_friends:

emit(friend, pair)
Second Job

// Mapper function

map(friend, pairs):
for pair in pairs:
if pair[0] != friend:
emit(pair[0], (pair[1], 1))

// Reducer function
reduce(friend, pairs):
sorted_pairs = sorted(pairs, key=lambda x: x[1], reverse=True)
for pair in sorted_pairs:
emit(pair[0], pair[1])

Note that the emit function in the mapper and reducer functions is used to send key-
value pairs to the next stage of the MapReduce pipeline. The pairs emitted by the
first mapper function are grouped by key (i.e., the pair of adjacent node identifiers)
and passed to the first reducer function as a list of values. The reducer function emits
a key-value pair for each common friend, indicating which two adjacent nodes share
that friend. The pairs emitted by the second mapper function are grouped by key
(i.e., the common friend identifier) and passed to the second reducer function as a
list of tuples. The reducer function sorts the tuples by the number of connections and
emits a

Experiment 8:
Implement an iterative PageRank graph algorithm in MapReduce.

Hint: PageRank can be implemented by iterating a MapReduce job until the graph has
converged. The mappers are responsible for propagating node PageRank values to their
adjacent nodes, and the reducers are responsible for calculating new PageRank values for
each node, and for re-creating the original graph with the updated PageRank values.

To implement an iterative PageRank algorithm in MapReduce, we can use a


sequence of MapReduce jobs, where each job represents a single iteration of the
PageRank algorithm. Each job consists of mappers that propagate node PageRank
values to their adjacent nodes, and reducers that calculate new PageRank values for
each node, and for re-creating the original graph with the updated PageRank values.

Here are the steps for the MapReduce algorithm:

1. Create an adjacency list to represent the graph. Each node in the graph should
have a unique identifier and a list of adjacent nodes.
2. Initialize the PageRank values for each node. Each node should start with a
PageRank of 1/N, where N is the total number of nodes in the graph.
3. Create a mapper function that takes a node and emits a key-value pair for
each adjacent node. The key is the adjacent node identifier, and the value is
the PageRank value of the input node divided by the number of adjacent
nodes.
4. Create a reducer function that takes a node and a list of PageRank values from
its adjacent nodes. The reducer function sums the PageRank values and
applies the PageRank formula to calculate the new PageRank value for the
node. The reducer function emits a key-value pair for each node in the graph.
The key is the node identifier, and the value is a tuple containing the new
PageRank value and the node's adjacency list.
5. Run the MapReduce job.
6. Update the PageRank values for each node in the graph based on the output
of the previous MapReduce job.
7. Repeat steps 3-6 until the graph has converged (i.e., the difference in
PageRank values for each node between iterations is below a certain
threshold).

Here's the pseudocode for the MapReduce algorithm:

// Mapper function

map(node):

for adj_node in node.adjacent_nodes:

emit(adj_node, node.pagerank / len(node.adjacent_nodes))

// Reducer function

reduce(node, pageranks):

damping_factor = 0.85

new_pagerank = (1 - damping_factor) / N + damping_factor *


sum(pageranks)

emit(node.identifier, (new_pagerank, node.adjacent_nodes))

// Main function

while not converged:

// Run a MapReduce job to calculate new PageRank values

run_mapreduce_job()

// Update PageRank values for each node

for node in graph:


node.pagerank = new_pagerank[node.identifier]

// Check for convergence

converged = check_convergence()

Note that the emit function in the mapper and reducer functions is used to send key-
value pairs to the next stage of the MapReduce pipeline.

Experiment 9:
9. Perform an efficient semi-join in MapReduce.

Hint: Perform a semi-join by having the mappers load a Bloom filter from the Distributed
Cache, and then filter results from the actual MapReduce data source by performing
membership queries against the Bloom filter to determine which data source records should
be emitted to the reducers.

A semi-join operation involves finding all the records in one dataset that have matching keys in
another dataset. One way to perform an efficient semi-join in MapReduce is to use a Bloom filter
to filter the records in the second dataset.

A Bloom filter is a probabilistic data structure that allows efficient membership tests. It is
essentially a bit vector that is initially set to zero. To add an element to the filter, it is hashed
multiple times, and the resulting hash values are used to set the corresponding bits in the vector.
To test if an element is in the filter, the same hash functions are applied, and the bits
corresponding to the hash values are checked. If all the corresponding bits are set, the element is
deemed to be in the filter (with a certain probability of false positives).

Here is how you can perform an efficient semi-join in MapReduce using a Bloom filter:

1. Preprocess the dataset that contains the keys to filter, and generate a Bloom filter. This
can be done outside of MapReduce, or in a separate MapReduce job that outputs the
filter as a file in HDFS.
2. In the main MapReduce job, load the Bloom filter from the Distributed Cache into each
mapper. The Distributed Cache is a distributed file system that can be used to cache files
needed by MapReduce jobs.
3. In the mapper, for each record in the input dataset, compute the hash of the key, and
perform membership queries against the Bloom filter. If the query returns true, emit the
record to the reducer.
4. In the reducer, process the records emitted by the mappers as usual.

By using a Bloom filter to filter the records in the second dataset, we can reduce the amount of
data that needs to be processed in the MapReduce job, and therefore improve the overall
efficiency. However, Bloom filters have a certain probability of false positives, so it is possible that
some records that do not actually match the keys in the first dataset may be emitted by the
mappers. Therefore, this technique is most effective when the false positive rate of the Bloom
filter is low, and when it is acceptable to tolerate some false positives in the output.

import java.io.IOException;

import java.util.HashSet;

import java.util.Set;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.bloom.BloomFilter;

import org.apache.hadoop.util.bloom.Key;

import org.apache.hadoop.util.bloom.KeyFactory;

public class BloomSemiJoin {

public static class BloomSemiJoinMapper extends Mapper<Object, Text, Text, Text> {

private BloomFilter bloomFilter = new BloomFilter();

private KeyFactory keyFactory = new KeyFactory();

@Override

public void setup(Context context) throws IOException, InterruptedException {

Configuration conf = context.getConfiguration();

Path bloomFilterFile = new Path(conf.get("bloomFilterFile"));

bloomFilter.readFields(context.getCacheFiles()[0]);

@Override

public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {

String[] fields = value.toString().split(",");


String joinField = fields[0];

if (bloomFilter.membershipTest(keyFactory.createKey(joinField.getBytes()))) {

// If the join field is in the Bloom filter, emit the record

context.write(new Text(joinField), value);

public static class BloomSemiJoinReducer extends Reducer<Text, Text, Text, Text> {

@Override

public void reduce(Text key, Iterable<Text> values, Context context) throws


IOException, InterruptedException {

for (Text value : values) {

context.write(key, value);

public static void main(String[] args) throws Exception {

if (args.length != 3) {

System.err.println("Usage: BloomSemiJoin <left_input_path> <right_input_path>


<output_path>");

System.exit(1);

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "Bloom Semi-Join");

// Load the Bloom filter into the distributed cache

Path bloomFilterPath = new Path(args[1]);

BloomFilter bloomFilter = new BloomFilter(1000000, 10, BloomFilter.Hash.MURMUR_HASH);

Set<String> joinFieldSet = new HashSet<>();

// Load the join field from the right dataset and add it to the Bloom filter

int joinFieldIndex = 0;
try (BufferedReader br = new BufferedReader(new
FileReader(bloomFilterPath.toString()))) {

String line;

while ((line = br.readLine()) != null) {

String[] fields = line.split(",");

joinFieldSet.add(fields[joinFieldIndex]);

for (String joinField : joinFieldSet) {

bloomFilter.add(new Key(joinField.getBytes()));

DistributedCache.addCacheFile(bloomFilterPath.toUri(), job.getConfiguration());

job.getConfiguration().set("bloomFilterFile", bloomFilterPath.getName());

job.setJarByClass(BloomSemiJoin.class);

job.setMapperClass(BloomSemiJoinMapper.class);

job.setReducerClass(BloomSemiJoinReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutput

Experiment 10:
10. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your
data.

Pig is a high-level platform for creating MapReduce programs that run on Apache
Hadoop. It is used to analyze large datasets in a distributed computing environment.
To install and run Pig, follow the steps below:

1. Download Pig: Go to the Apache Pig website (https://pig.apache.org/) and


download the latest stable version of Pig. The download will be a compressed
file with a .tar.gz extension.
2. Extract Pig: Extract the downloaded file using the following command:

tar -xzvf pig-x.y.z.tar.gz


3 .Replace x.y.z with the version number of Pig you downloaded.
4.Set up environment variables: You need to set up the environment variables
for Pig to work properly. Open the .bashrc or .bash_profile file in your home
directory and add the following lines:

export PIG_HOME=/path/to/pig-x.y.z

export PATH=$PATH:$PIG_HOME/bin

5.Replace /path/to/pig-x.y.z with the actual path to the Pig installation directory.
6.Verify installation: Open a new terminal window and run the following
command to verify the installation:

pig -help

7.If the installation is successful, you should see the Pig help message.
8. Run Pig: To run Pig, you need to have Hadoop installed and running. Start
Hadoop and run the following command:

pig -x local

9.This will start Pig in local mode. You can now run Pig scripts or type Pig Latin
commands directly in the console.

That's it! You have successfully installed and run Pig.

here are some examples of Pig Latin scripts to perform common data manipulations:

1. Sort:

To sort a dataset by a particular column, you can use the ORDER BY operator. For
example, suppose you have a dataset mydata with columns name and age. You can sort
it by age using the following Pig Latin script:

sorted = ORDER mydata BY age;

This will create a new relation sorted that contains the same data as mydata, but sorted by
age.

2. Group:

To group a dataset by a particular column, you can use the GROUP BY operator. For example,
suppose you have a dataset mydata with columns name, age, and gender. You can group it by
gender using the following Pig Latin script:
grouped = GROUP mydata BY gender;

This will create a new relation grouped that contains groups of records with the same
gender.

3. Join:

To join two datasets by a common column, you can use the JOIN operator. For
example, suppose you have two datasets mydata1 and mydata2, both with columns name
and age. You can join them on the name column using the following Pig Latin script:

joined = JOIN mydata1 BY name, mydata2 BY name;

This will create a new relation joined that contains the records from mydata1 and
mydata2 that have the same name.

4. Project:

To select a subset of columns from a dataset, you can use the FOREACH...GENERATE
operator. For example, suppose you have a dataset mydata with columns name, age, and
gender. You can select just the name and gender columns using the following Pig Latin
script:

projected = FOREACH mydata GENERATE name, gender;

This will create a new relation projected that contains only the name and gender
columns from mydata.

5. Filter:

To select a subset of records from a dataset that satisfy a certain condition, you can
use the FILTER operator. For example, suppose you have a dataset mydata with
columns name and age. You can select only the records where the age is greater than
or equal to 18 using the following Pig Latin script:

filtered = FILTER mydata BY age >= 18;

This will create a new relation filtered that contains only the records from mydata
where the age is greater than or equal to 18.
Experiment 12:
11. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views,
functions, and indexes.

Hive is a data warehousing tool that enables querying and managing large datasets
that are stored in Hadoop Distributed File System (HDFS). To install and run Hive,
follow the steps below:

1. Download Hive: Go to the Apache Hive website (https://hive.apache.org/) and


download the latest stable version of Hive. The download will be a
compressed file with a .tar.gz extension.
2. Extract Hive: Extract the downloaded file using the following command:

tar -xzvf apache-hive-x.y.z-bin.tar.gz

Replace x.y.z with the version number of Hive you downloaded.


3. Set up environment variables: You need to set up the environment variables
for Hive to work properly. Open the .bashrc or .bash_profile file in your home
directory and add the following lines:

export HIVE_HOME=/path/to/apache-hive-x.y.z-bin

export PATH=$PATH:$HIVE_HOME/bin

Replace /path/to/apache-hive-x.y.z-bin with the actual path to the Hive installation


directory.
4. Set up Hive configuration: Hive requires a configuration file to run properly.
You can copy the default configuration file by running the following
command:

cp $HIVE_HOME/conf/hive-default.xml.template $HIVE_HOME/conf/hive-site.xml

5. Start Hadoop: Hive requires Hadoop to be running. Start Hadoop by running


the following command:

start-all.sh

This will start all the Hadoop daemons.


6. Verify installation: Open a new terminal window and run the following
command to verify the installation:

hive --version
If the installation is successful, you should see the Hive version number.
7. Run Hive: To run Hive, you can use the command line interface (CLI) or the
web interface, called HiveServer2. To start the CLI, run the following
command:

Hive
This will start the Hive CLI, where you can run Hive queries.

That's it! You have successfully installed and run Hive.


here are some examples of using Hive to create, alter, and drop databases, tables,
views, functions, and indexes:

1. Create a database:

To create a new database in Hive, you can use the CREATE DATABASE command. For
example, to create a database named mydb, run the following command:

CREATE DATABASE mydb;


2. Alter a database:

To alter an existing database in Hive, you can use the ALTER DATABASE command. For
example, to change the database location to ‘/my/new/location’, run the following
command:

ALTER DATABASE mydb SET LOCATION '/my/new/location';


3. Drop a database:

To drop an existing database in Hive, you can use the DROP DATABASE command. For
example, to drop the mydb database, run the following command:

DROP DATABASE mydb;


Note that dropping a database will also drop all the tables, views, functions, and
indexes in that database.

4. Create a table:

To create a new table in Hive, you can use the CREATE TABLE command. For example,
to create a table named mytable with columns name, age, and gender, run the following
command:
CREATE TABLE mytable (name STRING, age INT, gender
STRING);
5. Alter a table:

To alter an existing table in Hive, you can use the ALTER TABLE command. For example,
to add a new column email of type STRING to the mytable table, run the following
command:

ALTER TABLE mytable ADD COLUMNS (email STRING);


6. Drop a table:

To drop an existing table in Hive, you can use the DROP TABLE command. For example,
to drop the mytable table, run the following command:

DROP TABLE mytable;


7. Create a view:

To create a new view in Hive, you can use the CREATE VIEW command. For example, to
create a view named myview that selects the name and age columns from the mytable
table, run the following command:

CREATE VIEW myview AS SELECT name, age FROM mytable;


8. Drop a view:

To drop an existing view in Hive, you can use the DROP VIEW command. For example,
to drop the myview view, run the following command:

DROP VIEW myview;


9. Create a function:

To create a new function in Hive, you can use the CREATE FUNCTION command. For
example, to create a function named myfunction that returns the length of a string,
run the following command:

CREATE FUNCTION myfunction AS


'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLength';

10. Drop a function:


To drop an existing function in Hive, you can use the DROP FUNCTION command. For
example, to drop the myfunction function, run the following command:

DROP FUNCTION myfunction;

11. Create an index:

To create a new index in Hive, you can use the CREATE INDEX command. For example,
to create an index named myindex on the name column of the mytable table, run the
following command:

CREATE INDEX myindex ON TABLE mytable (name);

12. Drop an index:

To drop an existing index in Hive, you can use the DROP INDEX command. For example,
to drop the myindex index, run the following command:

DROP INDEX myindex ON mytable

You might also like