Professional Documents
Culture Documents
PRACTICAL-1
Aim: Hadoop Configuration and Single node cluster setup and perform file management task in
Hadoop. (creating a directory, list the content of directory, upload and download file in HDFS)
Theory:
There are two ways to install Hadoop, i.e. Single node and Multi node.
Single node cluster means only one DataNode running and setting up all the NameNode,
DataNode, ResourceManager and NodeManager on a single machine. This is used for
studying and testing purposes. For example, let us consider a sample data set inside a
healthcare industry. So, for testing whether the Oozie jobs have scheduled all the
processes like collecting, aggregating, storing and processing the data in a proper
sequence, we use single node cluster. It can easily and efficiently test the sequential
workflow in a smaller environment as compared to large environments which contains
terabytes of data distributed across hundreds of machines.
While in a Multi node cluster, there are more than one DataNode running and each
DataNode is running on different machines. The multi node cluster is practically used in
organizations for analyzing Big Data. Considering the above example, in real time when
we deal with petabytes of data, it needs to be distributed across hundreds of machines to
be processed. Thus, here we use multi node cluster.
Implementation:
o I will show you how to install Hadoop on a single node cluster.
o Prerequisites
VIRTUAL BOX: it is used for installing the operating system on it.
OPERATING SYSTEM: You can install Hadoop on Linux based operating
systems. Ubuntu and CentOS are very commonly used. In this tutorial, we are using
CentOS.
JAVA: You need to install the Java 8 package on your system.
HADOOP: You require Hadoop 2.7.3 package. lamentation:
o Install Hadoop
o Step 1: Click here to download the Java 8 Package. Save this file in your home
directory.
o Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz
o Step 4: Extract the Hadoop tar File. Command: tar -xvf hadoop-2.1.0.tar.gz
o Step 5: Add the new group and new user to it.
o Step 9: Edit hdfs-site.xml and edit the property mentioned below inside
configuration tag: hdfs-site.xml contains configuration settings of HDFS
daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes
the replication factor and block size of HDFS. Command: sudo gedit
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
Conclusion: In this practical, I understood the Installation of Hadoop using Setting up a Single
Node Hadoop Cluster.
PRACTICAL-2
Aim: Copy your data into the Hadoop Distributed File System (HDFS)
Implementation:
Steps:
Download a text file of words
Open a terminal shell
Copy text file from local file system to HDFS
Copy a file within HDFS
Copy a file from HDFS to the local file system
Conclusion: In this practical we have learnt to copy text file of words to HDFS.
PRACTICAL-3
Aim: To understand the overall programming architecture using Map Reduce API. (word count
using MapReduce or weather report POC-map reduce program to analyses time-temperature
statistics and generate report with max/min temperature)
Implementation:
1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.
2. Right Click > New > Package ( Name it - PackageDemo) > Finish.
3. Right Click on Package > New > Class (Name it - WordCount).
4. Add Following Reference Libraries:
o Right Click on Project > Build Path> Add External
• /usr/lib/hadoop-0.20/hadoop-core.jar
• Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
5. Type the following code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration c = new Configuration();
String[] files = new GenericOptionsParser(c, args).getRemainingArgs();
Path input = new Path(files[0]);
Path output = new Path(files[1]); Job j = new Job(c, "wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true) ? 0 : 1);
}
public static class MapForWordCount extends Mapper & lt; LongWritable, Text,
Text, IntWritable & gt; {
public void map(LongWritable key, Text value, Context con) throws
IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(",");
for (String word: words) {
Text outputKey = new
Text(word.toUpperCase().trim());
IntWritable outputValue = new
IntWritable(1); con.write(outputKey,
outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer & lt; Text, IntWritable,
Text, IntWritable & gt; {
public void reduce(Text word, Iterable & lt; IntWritable & gt; values, Context con)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value: values) { sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
o The above program consists of three classes:
Driver class (Public, void, static, or main; this is the entry point).
The Map class which extends the public class
Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements the Map
function.
The Reduce class which extends the public class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements the
Reduce function.
8. To move this into Hadoop directly, open the terminal and enter the following
commands: hadoop fs -put wordcountFile wordCountFile
Output:
Conclusion: In this practical, I understood the basic concepts of the overall programming
architecture using Map Reduce API.
CE 441 Big Data Analytics 18CE004
PRACTICAL-4
Aim: Configure Flume/Spark for streaming data analysis. (Analysis using Apache Spark and
fetching live tweets from twitter) OR (Word count/ analysis from local host data streaming)
Sentiment Analysis using Apache Spark and fetching live tweets from twitter using flume-
ng
Go to twitter developer, create new app.
To run flume:
bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf -
Dflume.root.logger=DEBUG,console -n TwitterAgent
CE 441 Big Data Analytics 18CE004
CE 441 Big Data Analytics 18CE004
Open Eclipse and do File->New -> project ->Select Maven Project; see below.
CE 441 Big Data Analytics 18CE004
Edit pom.xml.
Write your code or just copy given WordCount code from D:\ spark\
CE 441 Big Data Analytics 18CE004
Conclusion: From this practical, we studied how to install spark and created Word
Count project.
CE 441 Big Data Analytics 18CE004
PRACTICAL-5
Implementation:
Add IP address of Host from Cloudera. This can be verify using the “cloud manager” in
Cloudera.
Location: C:\Windows\System32\drivers\etc\ in that open “host” file.
Click on “Launch Cloudera Express” and you will get some message in terminal:
3. Here enter Username in User Name Tab and click on Check Services to get
successful connection of NameNode is done or not. Here you may have
download server required modules.
Note: Before that you have to start NamenNode Service in Cloudera follow some steps:
Enter below code into terminal:
“for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x start; done”
It takes some time to complete.
After that in browser from Hadoop bookmark click on HDFS NameNode to verify that
NameNode is working.
4. Now we have to create JOB which is under Repository Tab. Now Right click
on Job Design Tab and click to creat Job and give name to that job. Then click
to finish.
9. Now click on browser button, and is successful fetching done then you will
get this window which you will verify with repository under HDFS in
cloudera:
1. For that you have to change tHDFSList with tHDFSInput and tLogRow.
2. Now connect tHDFSConnection with tHDFSInput using OnSubjectOk trigger and
tHDFSInput connect with tLogRow using main Row.
CE 441 Big Data Analytics 18CE004
3. Now configure tHDFSInput and tLogRow as per seen in screenshot and don’t
change anything in tHDFSConnection.
CE 441 Big Data Analytics 18CE004
4. Now click on Run. But it gives an error. For writing FILE same kind of error
occurred.
Conclusion: From this practical, I studied how to implement the HDFS and MapReduce using
Talend.
CE441: Big Data Analytics 18CE004
PRACTICAL-6
Aim: To perform NoSQL database operation using mongo dB to perform CRUD operation on it.
Implementation:
1. db.student.insert({
regNo: "3014",
name: "Test Student",
course: {
courseName: "MCA",
duration: "3 Years" },
address: {
city: "Bangalore",
state: "KA",
country: "India" }
})
2. db.studnet.find().pretty()
CE441: Big Data Analytics 18CE004
3. db.student.update({"regNo": "3014"},{$set:{"name":"user"}})
4. db.student.remove({"regNo":"3014"})
PRACTICAL-7
Aim: Retrieve various types of documents from MongoDB (file import/export operations)
Implementation:
Importing json file:
To import json file : mongoimport --db restaurantd –collection restaurantc --file
data.json
d. db.Airport.count()
Conclusion
In this practical, I learnt how to import the data, how to use imported data and how to export data in
mongo db.
CE 441 Big Data Analytics 18CE004
Practical- 8
Aim: Configure Neo4j and create node(add –remove property) & relationship in it.
Implementation:
https://neo4j.com/download-center/
2. Extract the zip file and keep the extracted folder in D:\ drive (for e.g.)
3. Open command prompt and navigate to D:\neo4j
4. Run below commands to start neo4j service:
CREATE (n: Person {name:"Hazel"}) RETURN n 10. To view all nodes use:
MATCH (n) RETURN n
MATCH (a: Person {name:"Hazel"}), (b: Person {name:"Jill"}) MERGE (a) -[r: FRIENDS
{since:"1998"}]-> (b)
19. To fetch or query database here in this case to fetch a node with name JILL: MATCH
(n: Person) WHERE n.name=”Jill” RETURN n
CREATE (node1)-[:RelationshipType]->(node2)
CREATE (Dhawan:player{name: "Shikar Dhawan", YOB: 1985, POB: "Delhi"}) CREATE
(Ind:Country {name: "India"})
CREATE (Dhawan)-[r:BATSMAN_OF]->(Ind)
RETURN Dhawan, Ind
Syntax:
MATCH (a:player), (b:Country) WHERE a.name = "Shikar Dhawan" AND b.name = "India"
CREATE (a)-[r: BATSMAN_OF]->(b)
RETURN a,b
MATCH (a:player), (b:Country) WHERE a.name = "Shikar Dhawan" AND b.name = "India"
CREATE (a)-[r:BATSMAN_OF {Matches:5, Avg:90.75}]->(b)
RETURN a,b
Conclusion: After performing practical, I understood to configure Neo4j and make nodes.
CE 441 Big Data Analytics 18CE004
PRACTICAL-9
Aim: Import files from Neo4j and complete relational graph from it.
Implementation:
OR
OR
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row WITH toInteger(row.orderID) AS
orderId, datetime(replace(row.orderDate,' ','T')) AS orderDate, row.shipCountry AS country
RETURN orderId, orderDate, country
CE 441 Big Data Analytics 18CE004
6. Relationship
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row WITH toInteger(row.orderID) AS
orderId, row.shipCountry AS country MERGE (o:Order {orderId: orderId}) create (a:orderID
{id:orderId}) create (b:country {cname:country}) create (a)-[:MADE_IN]->(b) RETURN a , b
CE 441 Big Data Analytics 18CE004
Conclusion: In this practical, I learnt how to import files from Neo4j and complete relational graph
from it.
CE 441 Big Data Analytics 18CE004
PRACTICAL-10
Aim: Configure apache spark and perform word count operation on it.
Implementation:
1) Installing Scala
Download Scala from below link:
https://www.scala-lang.org/download/
2) Installing Java8
Download Java from below link:
https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html
3) Install Eclipse
Download Eclipse from below link:
https://www.eclipse.org/downloads/
6) Execute Spark
x. System variable:
Variable: PATH
Value: D:\apache-maven-3.3.9\bin
CE 441 Big Data Analytics 18CE004
Edit pom.xml.
Write your code or just copy given WordCount code from D:\ spark\
CE 441 Big Data Analytics 18CE004
Conclusion: From this practical, we studied how to install spark and created Word
Count project.