Professional Documents
Culture Documents
Experiment 1:
1. Implement the following Data structures in Java
a) Linked Lists
import java.util.*;
public class LinkedList1{
public static void main(String args[]){
Iterator<String> itr=al.iterator();
while(itr.hasNext()){
System.out.println(itr.next());
}
}
}
Output: Ravi
Vijay
Ravi
Ajay
b) Stacks
import java.util.Stack;
public class StackEmptyMethodExample
{
public static void main(String[] args)
{
//creating an instance of Stack class
Stack<Integer> stk= new Stack<>();
// checking stack is empty or not
boolean result = stk.empty();
System.out.println("Is the stack empty? " + result);
// pushing elements into stack
stk.push(78);
stk.push(113);
stk.push(90);
stk.push(120);
//prints elements of the stack
System.out.println("Elements in Stack: " + stk);
result = stk.empty();
System.out.println("Is the stack empty? " + result);
}
}
Output:
c) Queues
import java.util.*;
d)Set
// Java program Illustrating Set Interface
// Main class
public class GFG {
Output
[Set, Example, Geeks, For]
e) Map
// Main class
class GFG {
// Printing keys
System.out.print(me.getKey() + ":");
System.out.println(me.getValue());
}
}
}
Output:
a:100
b:200
c:300
d:400
Experiment 2:
2. (i)Perform setting up and Installing Hadoop in its three operating modes: Standalone, Pseudo
distributed, Fully distributed
(i) Perform setting up and Installing Hadoop in its three operating modes: Standalone,
Pseudo distributed, Fully distributed
Standalone:
Install Hadoop: Setting up a Single Node Hadoop Cluster
You must have got a theoretical idea about Hadoop, HDFS and its architecture. But
to get Hadoop Certified you need good hands-on knowledge. I hope you would have
liked our previous blog on HDFS Architecture, now I will take you through the
practical knowledge about Hadoop and HDFS. The first step forward is to install
Hadoop.
There are two ways to install Hadoop, i.e. Single node and Multi-node.
A single node cluster means only one DataNode running and setting up all the
NameNode, DataNode, ResourceManager, and NodeManager on a single machine.
This is used for studying and testing purposes. For example, let us consider a
sample data set inside the healthcare industry. So, for testing whether the Oozie jobs
have scheduled all the processes like collecting, aggregating, storing, and
processing the data in a proper sequence, we use a single node cluster. It can easily
and efficiently test the sequential workflow in a smaller environment as compared to
large environments which contain terabytes of data distributed across hundreds of
machines.
While in a Multi-node cluster, there are more than one DataNode running and each
DataNode is running on different machines. The multi-node cluster is practically used
in organizations for analyzing Big Data. Considering the above example, in real-time
when we deal with petabytes of data, it needs to be distributed across hundreds of
machines to be processed. Thus, here we use a multi-node cluster.
Install Hadoop
Step 1: Click here to download the Java 8 Package. Save this file in your home
directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.
Learn more about the Hadoop Ecosystem and its tools with the Hadoop Certification.
Command: vi .bashrc
For applying all these changes to the current Terminal, execute the source
command.
To make sure that Java and Hadoop have been properly installed on your system
and can be accessed through the Terminal, execute the java -version and hadoop
version commands.
Command: ls
Command: vi core-site.xml
Command: vi hdfs-site.xml
Fig: Hadoop Installation – Configuring hdfs-site.xml
In some cases, mapred-site.xml file is not available. So, we have to create the
mapred-site.xml file using mapred-site.xml template.
Command: vi mapred-site.xml.
Fig: Hadoop Installation – Configuring mapred-site.xml
Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
yarn-site.xml contains configuration settings of ResourceManager and NodeManager
like application memory management size, the operation needed on program &
algorithm, etc.
Command: vi yarn-site.xml
Fig: Hadoop Installation – Configuring yarn-site.xml
Next
<?xml version="1.0">
1
<configuration>
2
<property>
3
<name>yarn.nodemanager.aux-services</name>
4
5<value>mapreduce_shuffle</value>
6</property>
7<property>
8<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
9<value>org.apache.hadoop.mapred.ShuffleHandler</value>
10</property>
11</configuration>
Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
hadoop-env.sh contains the environment variables that are used in the script to run
Hadoop like Java home path, etc.
Command: vi hadoop–env.sh
Command: cd hadoop-2.7.3
This formats the HDFS via NameNode. This command is only executed for the first
time. Formatting the file system means initializing the directory specified by the
dfs.name.dir variable.
Never format, up and running Hadoop filesystem. You will lose all your data stored in
the HDFS.
You can even check out the details of Big Data with the Data Engineer Training.
Either you can start all daemons with a single command or do it individually.
Command: ./start-all.sh
Start NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree
of all files stored in the HDFS and tracks all the file stored across the cluster.
Start DataNode:
On startup, a DataNode connects to the Namenode and it responds to the requests
from the Namenode for different operations.
Command: ./hadoop-daemon.sh start datanode
Start ResourceManager:
ResourceManager is the master that arbitrates all the available cluster resources
and thus helps in managing the distributed applications running on the YARN
system. Its work is to manage each NodeManagers and the each application’s
ApplicationMaster.
Start NodeManager:
The NodeManager in each machine framework is the agent which is responsible for
managing containers, monitoring their resource usage and reporting the same to the
ResourceManager.
Step 14: To check that all the Hadoop services are up and running, run the
below command.
Command: jps
For reference, you can check the file save to the folder as follows.
C:\BigData
Step 2: Unzip the binary package
Open Git Bash, and change directory (cd) to the folder where you save the binary
package and then unzip as follows.
$ cd C:\BigData
MINGW64: C:\BigData
$ tar -xvzf hadoop-3.1.2.tar.gz
For my situation, the Hadoop twofold is extricated to C:\BigData\hadoop-3.1.2.
Next, go to this GitHub Repo and download the receptacle organizer as a speed as
demonstrated as follows. Concentrate the compress and duplicate all the
documents present under the receptacle envelope to C:\BigData\hadoop-
3.1.2\bin. Supplant the current records too.
Step 3: Create folders for datanode and namenode :
• Goto C:/BigData/hadoop-3.1.2 and make an organizer ‘information’.
Inside the ‘information’ envelope make two organizers ‘datanode’ and
‘namenode’. Your documents on HDFS will dwell under the datanode
envelope.
You can check the given below screenshot for your reference 4 new windows will
open and cmd terminals for 4 daemon processes like as follows.
• namenode
• datanode
• node manager
• resource manager
Don’t close these windows, minimize them. Closing the windows will terminate
the daemons. You can run them in the background if you don’t like to see these
windows.
master and all the slaves have the same user and all nodes in the cluster as
mentioned below: Hadoop environment variables are appended into the following
commands to file (~/.bashrc) to Export variables
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
export HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
export PATH=$HADOOP_PREFIX/bin:$JAVA_HOME/bin:$PATH
export HADOOP_COMMON_HOME=$HADOOP_ PREFIX
export HADOOP_HDFS_HOME=$HADOOP_ PREFIX
export YARN_HOME=$HADOOP_ PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
Step 2: configuration
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_67
export HADOOP_PREFIX=/home/impadmin/hadoop-2.6.4
export PATH=$PATH:$HADOOP_PREFIX/bin:$JAVA_HOME/bin:.
export HADOOP_COMMON_HOME=$HADOOP_ PREFIX
export HADOOP_HDFS_HOME=$HADOOP_ PREFIX
export YARN_HOME=$HADOOP_ PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
Add the all the properties mentioned below between configuration tag to all the
machines in cluster under HADOOP_ PREFIX file ⇒ etc/hadoop folder:
The following Xml files must be reconfigured in order to develop Hadoop in Java:
• Core-site.xml
• Hdfs-site.xml
• Yarn-site.xml
• Mapred-site.xml
Core-site.xml
The core-site.xml file contains information regarding memory allocated for the file
system, the port number used for Hadoop instance, size of Read/Write buffers, and
memory limit for storing the data.
Open the core-site.xml and add the properties listed below in between , tags in this
file.
Hdfs-site.xml
The hdfs-site.xml file contains information regarding the namenode path, datanode
paths of the local file systems, the value of replication data, etc. It means the place
where you want to store the Hadoop infrastructure.
Open the Hdfs-site.xml file and add the properties listed below in between the tags.
In this file, all the property values are user-defined and can be changed according to
the Hadoop infrastructure.
Mapred-site.xml :
Open the mapred-site.xml file and add the following properties in between the ,
tags in this file.
Yarn-site.xml :
Open the yarn-site.xml file and add the following properties in between the , tags in
this file.
HDFS Metrics
• NameNode metrics
• DataNode metrics
NameNodes and DataNodes
• NameNode-emitted metrics
• NameNode Java Virtual Machine (JVM) metrics
Below we’re going to list each group of metrics you can monitor and
then show you a way to monitor these metrics for HDFS.
NameNode-emitted metrics
One way that you can monitor HDFS metrics is through Java
Management Extensions (JMX) and the HDFS daemon web interface.
To view a summary of NameNode and performance metrics enter the
following URL into your web browser to access the web interface
(which is available by default at port 50070):
http://<namenodehost>:50070
Here you’ll be able to see information on Configured Capacity, DFS
Used, Non-DFS Used, DFS Remaining, Block Pool Used, DataNodes
usage%, and more.
If you require more in-depth information, you can enter the following
URL to view more metrics with a JSON output:
http://<namenodehost>:50070jmx
MapReduce Counters
Below we’re going to look at some of the built-in counters you can use
to monitor Hadoop.
Built-In MapReduce Counters
• Job counters
• Task counters
• File system counters
• FileInputFormat Counters
• FileOutput Format Counters
Job Counters
MapReduce job counters measure statistics at the job level, such as
the number of failed maps or reduces.
Task Counters
FileSystem Counters
FIleInputFormat Counters
• Bytes read – Displays the bytes read by map tasks with the
specific input format
Cluster Metrics
NodeManager metrics
To collect metrics for YARN, you can use the HTTP API. With your
resource manager, host query the yarn metrics located on port 8088
by entering the following (use the qry parameter to specify the
MBeans you want to monitor).
Resourcemanagerhost:8088/jmx?qry=java.lang:type=memory
ZooKeeper Metrics
There are a number of ways you can collect metrics for Zookeeper,
but the easiest is by using the 4 letter word commands through Telnet
or Netcat at the client port. To keep things simple, we’re going to look
at the mntr, arguably the most important of the four 4 letter word
commands.
$ echo mntr | nc localhost 2555
Key features:
Key features:
• Monitors HDFS NameNode, HDFS DataNode, Yarn, and
MapReduce metrics
• REST API
• Custom alert thresholds
• Dashboard
• Reports
Key features:
Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and copies
them into HDFS using one of the above command line utilities.
Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step 2
Transfer and store a data file from local systems to the Hadoop file system using
the put command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step 3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input
Retrieving Data from HDFS
Assume we have a file in HDFS called outfile. Given below is a simple
demonstration for retrieving the required file from the Hadoop file system.
Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Step 2
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
Removing a file or directory from HDFS:
Step 1: Switch to root user from ec2-user using the “sudo -i” command.
Check files in the HDFS using the “hadoop fs -ls” command. In this case, we found 44
items in the HDFS.
Let us try removing the “users_2.orc” file we find in the above result. A file or a directory
can be removed by passing the “-rmr” argument in the hadoop fs command. The syntax
for it is:
Since my “users_2.orc file is present in the root directory, the command would be
“hadoop fs -rmr /user/root/users_2.orc”. Let us observe the output of this command upon
execution.
It is clear from the above result that the file is moved to the trash.
Let us cross-check the same by listing the files in the HDFS, and our desired output
must not contain the “users_2.orc” file in the list.
So only 43 items are now present in the root directory as users_2.orc file is removed
from the HDFS.
Experiment 4:
4. Run a basic Word Count MapReduce program to understand MapReduce Paradigm.
1. Splitting – The splitting parameter can be anything, e.g. splitting by space, comma,
semicolon, or even by a new line (‘\n’).
5. Combining – The last phase where all the data (individual result set from each
cluster) is combined together to form a result.
Make sure that Hadoop is installed on your system with the Java SDK.
Steps
1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.
2. Right Click > New > Package ( Name it - PackageDemo) > Finish.
1. /usr/lib/hadoop-0.20/hadoop-core.jar
2. Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
package PackageDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
String[] words=line.split(",");
con.write(outputKey, outputValue);
int sum = 0;
sum += value.get();
}
1
• Driver class (Public, void, static, or main; this is the entry point).
• The Map class which extends the public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
and implements the Map function.
• The Reduce class which extends the public class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements the Reduce function.
Right Click on Project> Export> Select export destination as Jar File > next> Finish.
7. Take a text file and move it into HDFS format:
To move this into Hadoop directly, open the terminal and enter the following commands:
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile
8. Run the jar file:
Found 3 items
BUS 7
CAR 4
TRAIN 6
Experiment 5:
5. Write a map reduce program that mines weather data. Weather sensors collecting data
every hour at many locations across the globe gather a large volume of log data, which is a
good candidate for analysis with Map Reduce, since it is semi structured and record-oriented.
Analyzing the Data with Hadoop
To take advantage of the parallel processing that Hadoop provides, we need to express our query as a
MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster of machines.
MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase
has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer
also specifies two functions: the map function and the reduce function.
The input to our map phase is the raw NCDC data. We choose a text input format that gives us each line in the
dataset as a text value. The key is the offset of the beginning of the line from the beginning of the file, but as we
have no need for this, we ignore it.
Our map function is simple. We pull out the year and the air temperature, because these are the only fields we
are interested in. In this case, the map function is just a data preparation phase, setting up the data in such a way
that the reduce function can do its work on it: finding the maximum temperature for each year. The map function
is also a good place to drop bad records: here we filter out temperatures that are missing, suspect, or erroneous.
To visualize the way the map works, consider the following sample lines of input data (some unused columns
have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004…9999999N9+00001+99999999999…)
(106, 0043011990999991950051512004…9999999N9+00221+99999999999…)
(212, 0043011990999991950051518004…9999999N9-00111+99999999999…)
(318, 0043012650999991949032412004…0500001N9+01111+99999999999…)
(424, 0043012650999991949032418004…0500001N9+00781+99999999999…)
The keys are the line offsets within the file, which we ignore in our map function. The map function merely
extracts the year and the air temperature (indicated in bold text), and emits them as its output (the temperature
values have been interpreted as integers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework before being sent to the reduce
function. This processing sorts and groups the key-value pairs by key. So, continuing the example, our reduce
function sees the following input:
(1949, [111, 78])
Each year appears with a list of all its air temperature readings. All the reduce function has to do now is iterate
through the list and pick up the maximum reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year. The whole data flow is illustrated
in Figure 2-1. At the bottom of the diagram is a Unix pipeline, which mimics the whole MapReduce flow.
Experiment 6:
6.Use MapReduce to find the shortest path between two people in a social graph.
Hint: Use an adjacency list to model a graph, and for each node store the distance from the
original node, as well as a back pointer to the original node. Use the mappers to propagate the
distance to the original node, and the reducer to restore the state of the graph. Iterate until the
target node has been reached.
To find the shortest path between two people in a social graph using MapReduce, we
can follow the following steps:
1. Create an adjacency list to represent the social graph. Each node in the graph
should have a unique identifier and a list of adjacent nodes.
2. Initialize the distance of each node from the source node to infinity except the
source node itself which has a distance of zero. Each node also has a back
pointer to the source node.
3. Create a mapper function that takes a node and emits a key-value pair for
each adjacent node. The key is the adjacent node identifier, and the value is a
tuple containing the distance of the current node from the source node plus
the weight of the edge connecting them, and the back pointer of the current
node.
4. Create a reducer function that takes a node and a list of its adjacent nodes.
For each adjacent node, compute the distance from the source node to the
adjacent node as the minimum of its current distance and the distance
received from the mapper plus the weight of the edge connecting them.
Update the back pointer of the adjacent node to the current node if its
distance is updated.
5. Repeat steps 3 and 4 until the target node is reached, i.e., its distance is
updated.
6. Once the target node is reached, trace back the path from the target node to
the source node using the back pointers stored in each node.
// Mapper function
map(node):
else:
// Reducer function
reduce(node, values):
min_distance = infinity
back_pointer = None
min_distance = distance
back_pointer = pointer
node.distance = min_distance
node.back_pointer = back_pointer
emit(node, node.adjacent_nodes)
// Main function
trace back the path from target node to source node using back pointers
Note that the emit function in the mapper and reducer functions is used to send key-
value pairs to the next stage of the MapReduce pipeline. The pairs emitted by the
mapper function are grouped by key (i.e., the adjacent node identifier) and passed to
the reducer function as a list of values. The reducer function emits a key-value pair
for each updated node, indicating that its adjacent nodes may need to be
reevaluated in the next iteration.
Experiment 7:
7. Implement Friends-of-friends algorithm in MapReduce.
Hint: Two MapReduce jobs are required to calculate the FoFs for each user in a social
network .The first job calculates the common friends for each user, and the second job
sorts the common friends by the number of connections to your friends.
1. Create an adjacency list to represent the social graph. Each node in the graph
should have a unique identifier and a list of adjacent nodes.
2. Create a mapper function that takes a node and emits a key-value pair for
each pair of adjacent nodes. The key is a tuple containing the two adjacent
node identifiers, sorted in alphabetical order. The value is the node identifier
of the mapper's input node.
3. Create a reducer function that takes a pair of adjacent nodes and a list of the
nodes that are adjacent to both nodes. The reducer function emits a key-value
pair for each node that is adjacent to both nodes. The key is the node
identifier of the common friend, and the value is a list of the node identifiers
of the two adjacent nodes.
4. Run the first MapReduce job.
First Job
// Mapper function
map(node):
else:
// Reducer function
reduce(pair, nodes):
common_friends = []
common_friends.append((node1, node2))
emit(friend, pair)
Second Job
// Mapper function
map(friend, pairs):
for pair in pairs:
if pair[0] != friend:
emit(pair[0], (pair[1], 1))
// Reducer function
reduce(friend, pairs):
sorted_pairs = sorted(pairs, key=lambda x: x[1], reverse=True)
for pair in sorted_pairs:
emit(pair[0], pair[1])
Note that the emit function in the mapper and reducer functions is used to send key-
value pairs to the next stage of the MapReduce pipeline. The pairs emitted by the
first mapper function are grouped by key (i.e., the pair of adjacent node identifiers)
and passed to the first reducer function as a list of values. The reducer function emits
a key-value pair for each common friend, indicating which two adjacent nodes share
that friend. The pairs emitted by the second mapper function are grouped by key
(i.e., the common friend identifier) and passed to the second reducer function as a
list of tuples. The reducer function sorts the tuples by the number of connections and
emits a
Experiment 8:
Implement an iterative PageRank graph algorithm in MapReduce.
Hint: PageRank can be implemented by iterating a MapReduce job until the graph has
converged. The mappers are responsible for propagating node PageRank values to their
adjacent nodes, and the reducers are responsible for calculating new PageRank values for
each node, and for re-creating the original graph with the updated PageRank values.
1. Create an adjacency list to represent the graph. Each node in the graph should
have a unique identifier and a list of adjacent nodes.
2. Initialize the PageRank values for each node. Each node should start with a
PageRank of 1/N, where N is the total number of nodes in the graph.
3. Create a mapper function that takes a node and emits a key-value pair for
each adjacent node. The key is the adjacent node identifier, and the value is
the PageRank value of the input node divided by the number of adjacent
nodes.
4. Create a reducer function that takes a node and a list of PageRank values from
its adjacent nodes. The reducer function sums the PageRank values and
applies the PageRank formula to calculate the new PageRank value for the
node. The reducer function emits a key-value pair for each node in the graph.
The key is the node identifier, and the value is a tuple containing the new
PageRank value and the node's adjacency list.
5. Run the MapReduce job.
6. Update the PageRank values for each node in the graph based on the output
of the previous MapReduce job.
7. Repeat steps 3-6 until the graph has converged (i.e., the difference in
PageRank values for each node between iterations is below a certain
threshold).
// Mapper function
map(node):
// Reducer function
reduce(node, pageranks):
damping_factor = 0.85
// Main function
run_mapreduce_job()
converged = check_convergence()
Note that the emit function in the mapper and reducer functions is used to send key-
value pairs to the next stage of the MapReduce pipeline.
Experiment 9:
9. Perform an efficient semi-join in MapReduce.
Hint: Perform a semi-join by having the mappers load a Bloom filter from the Distributed
Cache, and then filter results from the actual MapReduce data source by performing
membership queries against the Bloom filter to determine which data source records should
be emitted to the reducers.
A semi-join operation involves finding all the records in one dataset that have matching keys in
another dataset. One way to perform an efficient semi-join in MapReduce is to use a Bloom filter
to filter the records in the second dataset.
A Bloom filter is a probabilistic data structure that allows efficient membership tests. It is
essentially a bit vector that is initially set to zero. To add an element to the filter, it is hashed
multiple times, and the resulting hash values are used to set the corresponding bits in the vector.
To test if an element is in the filter, the same hash functions are applied, and the bits
corresponding to the hash values are checked. If all the corresponding bits are set, the element is
deemed to be in the filter (with a certain probability of false positives).
Here is how you can perform an efficient semi-join in MapReduce using a Bloom filter:
1. Preprocess the dataset that contains the keys to filter, and generate a Bloom filter. This
can be done outside of MapReduce, or in a separate MapReduce job that outputs the
filter as a file in HDFS.
2. In the main MapReduce job, load the Bloom filter from the Distributed Cache into each
mapper. The Distributed Cache is a distributed file system that can be used to cache files
needed by MapReduce jobs.
3. In the mapper, for each record in the input dataset, compute the hash of the key, and
perform membership queries against the Bloom filter. If the query returns true, emit the
record to the reducer.
4. In the reducer, process the records emitted by the mappers as usual.
By using a Bloom filter to filter the records in the second dataset, we can reduce the amount of
data that needs to be processed in the MapReduce job, and therefore improve the overall
efficiency. However, Bloom filters have a certain probability of false positives, so it is possible that
some records that do not actually match the keys in the first dataset may be emitted by the
mappers. Therefore, this technique is most effective when the false positive rate of the Bloom
filter is low, and when it is acceptable to tolerate some false positives in the output.
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.bloom.BloomFilter;
import org.apache.hadoop.util.bloom.Key;
import org.apache.hadoop.util.bloom.KeyFactory;
@Override
bloomFilter.readFields(context.getCacheFiles()[0]);
@Override
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
if (bloomFilter.membershipTest(keyFactory.createKey(joinField.getBytes()))) {
@Override
context.write(key, value);
if (args.length != 3) {
System.exit(1);
// Load the join field from the right dataset and add it to the Bloom filter
int joinFieldIndex = 0;
try (BufferedReader br = new BufferedReader(new
FileReader(bloomFilterPath.toString()))) {
String line;
joinFieldSet.add(fields[joinFieldIndex]);
bloomFilter.add(new Key(joinField.getBytes()));
DistributedCache.addCacheFile(bloomFilterPath.toUri(), job.getConfiguration());
job.getConfiguration().set("bloomFilterFile", bloomFilterPath.getName());
job.setJarByClass(BloomSemiJoin.class);
job.setMapperClass(BloomSemiJoinMapper.class);
job.setReducerClass(BloomSemiJoinReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileOutput
Experiment 10:
10. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your
data.
Pig is a high-level platform for creating MapReduce programs that run on Apache
Hadoop. It is used to analyze large datasets in a distributed computing environment.
To install and run Pig, follow the steps below:
export PIG_HOME=/path/to/pig-x.y.z
export PATH=$PATH:$PIG_HOME/bin
5.Replace /path/to/pig-x.y.z with the actual path to the Pig installation directory.
6.Verify installation: Open a new terminal window and run the following
command to verify the installation:
pig -help
7.If the installation is successful, you should see the Pig help message.
8. Run Pig: To run Pig, you need to have Hadoop installed and running. Start
Hadoop and run the following command:
pig -x local
9.This will start Pig in local mode. You can now run Pig scripts or type Pig Latin
commands directly in the console.
here are some examples of Pig Latin scripts to perform common data manipulations:
1. Sort:
To sort a dataset by a particular column, you can use the ORDER BY operator. For
example, suppose you have a dataset mydata with columns name and age. You can sort
it by age using the following Pig Latin script:
This will create a new relation sorted that contains the same data as mydata, but sorted by
age.
2. Group:
To group a dataset by a particular column, you can use the GROUP BY operator. For example,
suppose you have a dataset mydata with columns name, age, and gender. You can group it by
gender using the following Pig Latin script:
grouped = GROUP mydata BY gender;
This will create a new relation grouped that contains groups of records with the same
gender.
3. Join:
To join two datasets by a common column, you can use the JOIN operator. For
example, suppose you have two datasets mydata1 and mydata2, both with columns name
and age. You can join them on the name column using the following Pig Latin script:
This will create a new relation joined that contains the records from mydata1 and
mydata2 that have the same name.
4. Project:
To select a subset of columns from a dataset, you can use the FOREACH...GENERATE
operator. For example, suppose you have a dataset mydata with columns name, age, and
gender. You can select just the name and gender columns using the following Pig Latin
script:
This will create a new relation projected that contains only the name and gender
columns from mydata.
5. Filter:
To select a subset of records from a dataset that satisfy a certain condition, you can
use the FILTER operator. For example, suppose you have a dataset mydata with
columns name and age. You can select only the records where the age is greater than
or equal to 18 using the following Pig Latin script:
This will create a new relation filtered that contains only the records from mydata
where the age is greater than or equal to 18.
Experiment 12:
11. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views,
functions, and indexes.
Hive is a data warehousing tool that enables querying and managing large datasets
that are stored in Hadoop Distributed File System (HDFS). To install and run Hive,
follow the steps below:
export HIVE_HOME=/path/to/apache-hive-x.y.z-bin
export PATH=$PATH:$HIVE_HOME/bin
cp $HIVE_HOME/conf/hive-default.xml.template $HIVE_HOME/conf/hive-site.xml
start-all.sh
hive --version
If the installation is successful, you should see the Hive version number.
7. Run Hive: To run Hive, you can use the command line interface (CLI) or the
web interface, called HiveServer2. To start the CLI, run the following
command:
Hive
This will start the Hive CLI, where you can run Hive queries.
1. Create a database:
To create a new database in Hive, you can use the CREATE DATABASE command. For
example, to create a database named mydb, run the following command:
To alter an existing database in Hive, you can use the ALTER DATABASE command. For
example, to change the database location to ‘/my/new/location’, run the following
command:
To drop an existing database in Hive, you can use the DROP DATABASE command. For
example, to drop the mydb database, run the following command:
4. Create a table:
To create a new table in Hive, you can use the CREATE TABLE command. For example,
to create a table named mytable with columns name, age, and gender, run the following
command:
CREATE TABLE mytable (name STRING, age INT, gender
STRING);
5. Alter a table:
To alter an existing table in Hive, you can use the ALTER TABLE command. For example,
to add a new column email of type STRING to the mytable table, run the following
command:
To drop an existing table in Hive, you can use the DROP TABLE command. For example,
to drop the mytable table, run the following command:
To create a new view in Hive, you can use the CREATE VIEW command. For example, to
create a view named myview that selects the name and age columns from the mytable
table, run the following command:
To drop an existing view in Hive, you can use the DROP VIEW command. For example,
to drop the myview view, run the following command:
To create a new function in Hive, you can use the CREATE FUNCTION command. For
example, to create a function named myfunction that returns the length of a string,
run the following command:
To create a new index in Hive, you can use the CREATE INDEX command. For example,
to create an index named myindex on the name column of the mytable table, run the
following command:
To drop an existing index in Hive, you can use the DROP INDEX command. For example,
to drop the myindex index, run the following command: