You are on page 1of 61

CE 441 Big Data Analytics 18CE004

PRACTICAL-1

Aim: Hadoop Configuration and Single node cluster setup and perform file management task in
Hadoop. (creating a directory, list the content of directory, upload and download file in HDFS)

Theory:
There are two ways to install Hadoop, i.e. Single node and Multi node.
 Single node cluster means only one DataNode running and setting up all the NameNode,
DataNode, ResourceManager and NodeManager on a single machine. This is used for
studying and testing purposes. For example, let us consider a sample data set inside a
healthcare industry. So, for testing whether the Oozie jobs have scheduled all the
processes like collecting, aggregating, storing and processing the data in a proper
sequence, we use single node cluster. It can easily and efficiently test the sequential
workflow in a smaller environment as compared to large environments which contains
terabytes of data distributed across hundreds of machines.
 While in a Multi node cluster, there are more than one DataNode running and each
DataNode is running on different machines. The multi node cluster is practically used in
organizations for analyzing Big Data. Considering the above example, in real time when
we deal with petabytes of data, it needs to be distributed across hundreds of machines to
be processed. Thus, here we use multi node cluster.

Implementation:
o I will show you how to install Hadoop on a single node cluster.
o Prerequisites
 VIRTUAL BOX: it is used for installing the operating system on it.
 OPERATING SYSTEM: You can install Hadoop on Linux based operating
systems. Ubuntu and CentOS are very commonly used. In this tutorial, we are using
CentOS.
 JAVA: You need to install the Java 8 package on your system.
 HADOOP: You require Hadoop 2.7.3 package. lamentation:

o Install Hadoop
o Step 1: Click here to download the Java 8 Package. Save this file in your home
directory.
o Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz

o Step 3: Download the Hadoop 2.1.0 Package.


Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-
2.10.0/hadoop-2.10.0.tar.gz

o Step 4: Extract the Hadoop tar File. Command: tar -xvf hadoop-2.1.0.tar.gz
o Step 5: Add the new group and new user to it.

o Step 6: Export path and check Hadoop version.


Command: export HADOOP_HOME=/home/mypc/Desktop/Hadoop
export PATH=$PATH:$HADOOP_HOME/bin
hadoop version

o Step 7: Edit the Hadoop Configuration files.


Command: cd Desktop/Hadoop/etc/hadoop Gedit Hadoop-env.sh

 Add java path at the last


o Step 8: Open core-site.xml and edit the property mentioned below inside
configuration tag: core-site.xml informs Hadoop daemon where NameNode runs
in the cluster. It contains configuration settings of Hadoop core such as I/O
settings that are common to HDFS & MapReduce.
Command: sudo gedit $HADOOP_HOME/etc/Hadoop/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>Parent directory for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS </name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. </description>
</property> \

o Step 9: Edit hdfs-site.xml and edit the property mentioned below inside
configuration tag: hdfs-site.xml contains configuration settings of HDFS
daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes
the replication factor and block size of HDFS. Command: sudo gedit
$HADOOP_HOME/etc/hadoop/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>
<property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hduser_/hdfs</value>
</property>
</configuration>
o Step 10: Edit the mapred-site.xml file and edit the property mentioned below
inside configuration tag:
 mapred-site.xml contains configuration settings of MapReduce application like
number of JVM that can run in parallel, the size of the mapper and the reducer
process, CPU cores available for a process, etc.
 In some cases, mapred-site.xml file is not available. So, we have to create the
mapred-site.xml file using mapred-site.xml template.
 Command: cp mapred-site.xml.template mapred-site.xml
 Command: sudo gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>localhost:54311</value>
<description>MapReduce job tracker runs at this host and port.
</description>
</property>
</configuration>

o Step 11: Go to Hadoop home directory and format the NameNode.


Command: $HADOOP_HOME/bin/hdfs namenode -format

o Step 12: Once the NameNode is formatted, start aa services of Hadoop.


Command: $HADOOP_HOME/sbin/start-dfs.sh

Conclusion: In this practical, I understood the Installation of Hadoop using Setting up a Single
Node Hadoop Cluster.
PRACTICAL-2

Aim: Copy your data into the Hadoop Distributed File System (HDFS)

Implementation:

Steps:
 Download a text file of words
 Open a terminal shell
 Copy text file from local file system to HDFS
 Copy a file within HDFS
 Copy a file from HDFS to the local file system

Conclusion: In this practical we have learnt to copy text file of words to HDFS.
PRACTICAL-3

Aim: To understand the overall programming architecture using Map Reduce API. (word count
using MapReduce or weather report POC-map reduce program to analyses time-temperature
statistics and generate report with max/min temperature)

Implementation:

1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.
2. Right Click > New > Package ( Name it - PackageDemo) > Finish.
3. Right Click on Package > New > Class (Name it - WordCount).
4. Add Following Reference Libraries:
o Right Click on Project > Build Path> Add External
• /usr/lib/hadoop-0.20/hadoop-core.jar
• Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
5. Type the following code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration c = new Configuration();
String[] files = new GenericOptionsParser(c, args).getRemainingArgs();
Path input = new Path(files[0]);
Path output = new Path(files[1]); Job j = new Job(c, "wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true) ? 0 : 1);
}
public static class MapForWordCount extends Mapper & lt; LongWritable, Text,
Text, IntWritable & gt; {
public void map(LongWritable key, Text value, Context con) throws
IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(",");
for (String word: words) {
Text outputKey = new
Text(word.toUpperCase().trim());
IntWritable outputValue = new
IntWritable(1); con.write(outputKey,
outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer & lt; Text, IntWritable,
Text, IntWritable & gt; {
public void reduce(Text word, Iterable & lt; IntWritable & gt; values, Context con)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value: values) { sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
o The above program consists of three classes:
 Driver class (Public, void, static, or main; this is the entry point).
 The Map class which extends the public class
Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements the Map
function.
 The Reduce class which extends the public class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements the
Reduce function.

6. Make a jar file.


• Right Click on Project> Export> Select export destination as
JAR File > next> Finish.
7. Take a text file and move it into HDFS format:

8. To move this into Hadoop directly, open the terminal and enter the following
commands: hadoop fs -put wordcountFile wordCountFile

9. Run the jar file:


hadoop jar MRProgramsDemo.jar PackageDemo.WordCount wordCountFile
MRDir1

Output:

Conclusion: In this practical, I understood the basic concepts of the overall programming
architecture using Map Reduce API.
CE 441 Big Data Analytics 18CE004

PRACTICAL-4

Aim: Configure Flume/Spark for streaming data analysis. (Analysis using Apache Spark and
fetching live tweets from twitter) OR (Word count/ analysis from local host data streaming)

Sentiment Analysis using Apache Spark and fetching live tweets from twitter using flume-
ng
 Go to twitter developer, create new app.

 Create twitter.conf file inside /usr/lib/apache-flume-1.9.0-bin:


sudo gedit twitter.conf

 To run flume:
bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf -
Dflume.root.logger=DEBUG,console -n TwitterAgent
CE 441 Big Data Analytics 18CE004
CE 441 Big Data Analytics 18CE004

Create First WordCount project

 Open Eclipse and do File->New -> project ->Select Maven Project; see below.
CE 441 Big Data Analytics 18CE004

 Enter Group id, Artifact id, and click finish.

 Edit pom.xml.

 Write your code or just copy given WordCount code from D:\ spark\
CE 441 Big Data Analytics 18CE004

spark-1.6.1-bin- hadoop2.6 \ examples\src \main\ java\ org\ apache\


spark\ examples

 Now, add external jar from the location D:\spark\spark-1.6.1-bin-hadoop2.6\lib


and set Java 8 for compilation; see below.
CE 441 Big Data Analytics 18CE004
CE 441 Big Data Analytics 18CE004

 Build the project: D:\hadoop\spWCexamples


Write mvn package on cmd

 Execute the project: Go to the following location on cmd: D:\spark\spark-1.6.1-


binhadoop2.6\bin

Write the following command


spark-submit --class sparkWCexamples.spWCexamples.WORDCOUNT --master
local /D:/Hadoop/spWCexamples/target/spWCexamples-1.0-SNAPSHOT.jar
D:/Hadoop/spWCexamples/how.txt D:/Hadoop/spWCexamples/answer.txt

 You can also check the progress of the project at:


http://localhost:4040/jobs/ Finally get the answers; see below.
CE 441 Big Data Analytics 18CE004

Conclusion: From this practical, we studied how to install spark and created Word
Count project.
CE 441 Big Data Analytics 18CE004

PRACTICAL-5

Aim: Implementation of HDFS & MapReduce using Talend.

Implementation:

 Download Talend from below link:


https://www.talend.com/download/thankyou/big-data-windows/

 Add IP address of Host from Cloudera. This can be verify using the “cloud manager” in
Cloudera.
Location: C:\Windows\System32\drivers\etc\ in that open “host” file.

Note: To open the cloud manager in Cloudera follow some steps:


CE 441 Big Data Analytics 18CE004

Click on “Launch Cloudera Express” and you will get some message in terminal:

 Run command “sudo /home/cloudera/cloudera-manager –force –express” and wait until


finish then go to browser open cloud manager.
 After successful loading of cloud manager enter username and password in to login
section and then click on “Hosts” from menu bar.
 Verify IP with Cloud manager and Host files IP address.
CE 441 Big Data Analytics 18CE004

 Open “Talend” and start configuration of Hadoop Connection:


o As per given steps in Document file:
1. Create New project and give appropriate name.
2. Now inside Repository tab untoggled Metadata tab and inside that right click
on Hadoop cluster and select create new cluster. Now follow some steps as
per screenshot:
CE 441 Big Data Analytics 18CE004

3. Here enter Username in User Name Tab and click on Check Services to get
successful connection of NameNode is done or not. Here you may have
download server required modules.

Note: Before that you have to start NamenNode Service in Cloudera follow some steps:
Enter below code into terminal:
“for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x start; done”
It takes some time to complete.
After that in browser from Hadoop bookmark click on HDFS NameNode to verify that
NameNode is working.

 After that click on Finish


which will create cluster under
Hadoop Cluster.
CE 441 Big Data Analytics 18CE004

4. Now we have to create JOB which is under Repository Tab. Now Right click
on Job Design Tab and click to creat Job and give name to that job. Then click
to finish.

5. After that go to Pallete tab and add “tHDFSConnection” and “tHDFSList”


from that.

6. Now Right click on “tHDFSConnection” and from Trigger option select


“OnSubjectOk” and connect it with “tHDFSList”.

7. Now click on tHDFSConnection_1 and go to Component add configuration


as per seen here:
CE 441 Big Data Analytics 18CE004

8. Now click on tHDFSList_1 and go to Component add configuration as per


seen here:

9. Now click on browser button, and is successful fetching done then you will
get this window which you will verify with repository under HDFS in
cloudera:

Now Reading file from HDFS

1. For that you have to change tHDFSList with tHDFSInput and tLogRow.
2. Now connect tHDFSConnection with tHDFSInput using OnSubjectOk trigger and
tHDFSInput connect with tLogRow using main Row.
CE 441 Big Data Analytics 18CE004

3. Now configure tHDFSInput and tLogRow as per seen in screenshot and don’t
change anything in tHDFSConnection.
CE 441 Big Data Analytics 18CE004

4. Now click on Run. But it gives an error. For writing FILE same kind of error
occurred.

Conclusion: From this practical, I studied how to implement the HDFS and MapReduce using
Talend.
CE441: Big Data Analytics 18CE004

PRACTICAL-6

Aim: To perform NoSQL database operation using mongo dB to perform CRUD operation on it.

Implementation:

1. db.student.insert({
regNo: "3014",
name: "Test Student",
course: {
courseName: "MCA",
duration: "3 Years" },
address: {
city: "Bangalore",
state: "KA",
country: "India" }
})

2. db.studnet.find().pretty()
CE441: Big Data Analytics 18CE004

3. db.student.update({"regNo": "3014"},{$set:{"name":"user"}})

4. db.student.remove({"regNo":"3014"})

Conclusion: In this practical, I learnt the basics CRUD operation in MongoDB.


CE441: Big Data Analytics 18CE004

PRACTICAL-7

Aim: Retrieve various types of documents from MongoDB (file import/export operations)

Implementation:
Importing json file:
 To import json file : mongoimport --db restaurantd –collection restaurantc --file
data.json

 To verify whether data has been imported or not:


a. show dbs
b. use restaurant
c. db.resraurant.c.count()

 To display data : db.restaurantc.findOne()


CE441: Big Data Analytics 18CE004

 db.restaurantc.find({name : "Glorious Food"}).forEach(printjson)

Import csv file:

 mongoimport --db OpenFlights --collection Airport --type csv --headerline --ignoreBlanks


airports.csv

 To verify whether data has been imported or not:


a. show dbs
b. use OpenFlights
c. show tables
CE441: Big Data Analytics 18CE004

d. db.Airport.count()

 To display particular data : db.Airport.find({iso_region : "US-GA"})

Export csv file:

 mongoexport --db=OpenFlights --collection=Aiport --type=csv -- fields=id, type, name-


out=G:\output.csv
CE441: Big Data Analytics 18CE004

Conclusion
In this practical, I learnt how to import the data, how to use imported data and how to export data in
mongo db.
CE 441 Big Data Analytics 18CE004

Practical- 8

Aim: Configure Neo4j and create node(add –remove property) & relationship in it.

Implementation:

1. Download Neo4J Community Edition from below link:

https://neo4j.com/download-center/

2. Extract the zip file and keep the extracted folder in D:\ drive (for e.g.)
3. Open command prompt and navigate to D:\neo4j
4. Run below commands to start neo4j service:

bin\neo4j install-service bin\neo4j start

5. Open the URL http://localhost:7474/browser/ in your web browser.


6. Initial username: neo4j and password: neo4j
7. Once submitted prompt for change password will be shown wherein you can
change database password
8. Homepage will be shown in which top most text box starting with $ is for
writing queries.
9. Create a node “Brad” with label Person as

under: CREATE (n: Person {name:"Brad"}) RETURN

Similarly create rest of the 4 nodes.

CREATE (n: Person {name:"Alice"}) RETURN n CREATE (n: Person {name:"Mike"})


RETURN n
CREATE (n: Person {name:"Jill"}) RETURN n

CREATE (n: Person {name:"Hazel"}) RETURN n 10. To view all nodes use:
MATCH (n) RETURN n

10. Create a node “Avenger” with label Movie as under:


CREATE (n: Movie {title:"Avengers"}) RETURN
n
Similarly create rest of the 2 nodes
CREATE (n: Movie {title:"Skyfall"}) RETURN n CREATE (n: Movie {title:"Inception"})
RETURN n

12. To view nodes of specific label: MATCH (n: Movie) RETURN n

13. To set other properties:


MATCH (n: Person {name:"Brad"}) SET n.age=34
14. To remove the property:
MATCH (n: Person {name:"Brad"}) REMOVE n.age

15. To add a relationship to nodes:

MATCH (a: Person {name:"Brad"}),(b:Person {name:"Alice"}) MERGE (a) - [r:FRIENDS]->


(b)
16. To add properties in relationship:

MATCH (a: Person {name:"Hazel"}), (b: Person {name:"Jill"}) MERGE (a) -[r: FRIENDS
{since:"1998"}]-> (b)

17. Adding relationship between two different labels:

MATCH (a: Person {name:"Jill"}),(b: Movie {title:"Avengers"}) MERGE (a) -


[r:FAVOURITE]-> (b)
18. Add some more relationships between nodes.

19. To fetch or query database here in this case to fetch a node with name JILL: MATCH
(n: Person) WHERE n.name=”Jill” RETURN n

20. To find friends of BRAD:


MATCH (a: Person {name:"Brad"}) - [: FRIENDS] -> (b: Person) RETURN a, b
21. To delete a relationship and node:
MATCH (a: Person {name: “Jill”}) - [r] – (b) DELETE r

MATCH (a: Person {name: “Jill”}) DELETE a


22. To delete entire database:

MATCH (n) OPTIONAL MATCH (n) – [r] – () DELETE n, r


Creating node Relationships in Neo4j

syntax to create a relationship using the CREATE claus

CREATE (node1)-[:RelationshipType]->(node2)
CREATE (Dhawan:player{name: "Shikar Dhawan", YOB: 1985, POB: "Delhi"}) CREATE
(Ind:Country {name: "India"})
CREATE (Dhawan)-[r:BATSMAN_OF]->(Ind)
RETURN Dhawan, Ind

Syntax:

MATCH (a:LabeofNode1), (b:LabeofNode2)


WHERE a.name = "nameofnode1" AND b.name = " nameofnode2"

CREATE (a)-[: Relation]->(b) RETURN a,b

MATCH (a:player), (b:Country) WHERE a.name = "Shikar Dhawan" AND b.name = "India"
CREATE (a)-[r: BATSMAN_OF]->(b)

RETURN a,b

MATCH (a:player), (b:Country) WHERE a.name = "Shikar Dhawan" AND b.name = "India"
CREATE (a)-[r:BATSMAN_OF {Matches:5, Avg:90.75}]->(b)

RETURN a,b
Conclusion: After performing practical, I understood to configure Neo4j and make nodes.
CE 441 Big Data Analytics 18CE004

PRACTICAL-9

Aim: Import files from Neo4j and complete relational graph from it.

Implementation:

 First, we will store csv file into Neo4j

 Start Database and Open neo4j browser

1. To read data from csv file


LOAD CSV from 'file:///orders.csv' as row return row;

2. To count row from data


 LOAD CSV FROM 'file:///products.csv' AS row RETURN count(row);

OR

 LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row RETURN count(row);


CE 441 Big Data Analytics 18CE004

3. LOAD CSV WITH HEADERS FROM 'file:///order-details.csv' AS row


RETURN count(row);

Here Data in string format by defualt Convert in appropriate format


 toInteger(): converts a value to an integer.
 toFloat(): converts a value to a float (in this case, for monetary amounts).
 datetime(): converts a value to a datetime.

4. LOAD CSV FROM 'file:///products.csv' AS row WITH toInteger(row[0]) AS productId, row[1]


AS productName, toFloat(row[2]) AS unitCost RETURN productId, productName, unitCost

OR
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row WITH toInteger(row.orderID) AS
orderId, datetime(replace(row.orderDate,' ','T')) AS orderDate, row.shipCountry AS country
RETURN orderId, orderDate, country
CE 441 Big Data Analytics 18CE004

5. Now Create Node

 LOAD CSV FROM 'file:///products.csv' AS row WITH toInteger(row[0]) AS productId, row[1] AS


productName, toFloat(row[2]) AS unitCost MERGE (p:Product {productId: productId}) SET
p.productName = productName, p.unitCost = unitCost RETURN p LIMIT 20

6. Relationship
 LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row WITH toInteger(row.orderID) AS
orderId, row.shipCountry AS country MERGE (o:Order {orderId: orderId}) create (a:orderID
{id:orderId}) create (b:country {cname:country}) create (a)-[:MADE_IN]->(b) RETURN a , b
CE 441 Big Data Analytics 18CE004

7. To Create unique field

 create constraint on (c:country) assert c.name is unique;


//above query use only one time

 LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS row


//TH toInteger(row.orderID) AS orderId, row.shipCountry AS country
create (orderId:orderId {id:toInteger(row.orderID)})
MERGE (country: country{name:row.shipCountry})
create (country)-[r:try]->(orderID)
return country , orderID
//create (a:orderID {id:orderId})
CE 441 Big Data Analytics 18CE004

Conclusion: In this practical, I learnt how to import files from Neo4j and complete relational graph
from it.
CE 441 Big Data Analytics 18CE004

PRACTICAL-10

Aim: Configure apache spark and perform word count operation on it.

Implementation:

1) Installing Scala
 Download Scala from below link:
https://www.scala-lang.org/download/

 Set environmental variables:


i. User variable:
 Variable: SCALA_HOME;
 Value: C:\Program Files (x86)\scala

ii. System variable:


 Variable: PATH
 Value: C:\Program Files (x86)\scala\bin
.

2) Installing Java8
 Download Java from below link:
https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html

 Set environmental variables:


iii. User variable:
 Variable: JAVA_HOME;
 Value: C:\Program Files\Java\jre1.8.0_251

iv. System variable:


 Variable: PATH
 Value: C:\Program Files\Java\jre1.8.0_251\bin
CE 441 Big Data Analytics 18CE004

3) Install Eclipse
 Download Eclipse from below link:
https://www.eclipse.org/downloads/

 Set environmental variables:


v. User variable:
 Variable: ECLIPSE_HOME;
 Value: C:\eclipse

vi. System variable:


 Variable: PATH
 Value: C:\eclipse\bin

4) Install Spark 1.6.1


 Download Spark from below link:
http://spark.apache.org/downloads.html

 Set environmental variables:


vii. User variable:
 Variable: SPARK_HOME;
 Value: D:\Spark\spark-3.0.1-bin-hadoop2.7

viii. System variable:


 Variable: PATH
 Value: D:\Spark\spark-3.0.1-bin-hadoop2.7\bin

5) Download Windows Utilities


 Download from below link:
https://github.com/steveloughran/winutils/tree/master/hadoop-2.6.0/bin
And paste it in D:\spark\spark-1.6.1-bin-hadoop2.6\bin
CE 441 Big Data Analytics 18CE004

6) Execute Spark

7) Install Maven 3.3


 Download Apache-Maven-3.3.9 from below link:
http://spark.apache.org/downloads.html
And extract it into D drive, such as D:\apache-maven-3.3.9

 Set environmental variables:


ix. User variable:
 Variable: MAVEN_HOME;
 Value: D:\apache-maven-3.3.9

x. System variable:
 Variable: PATH
 Value: D:\apache-maven-3.3.9\bin
CE 441 Big Data Analytics 18CE004

8) Create First WordCount project


 Open Eclipse and do File->New -> project ->Select Maven Project; see below.
CE 441 Big Data Analytics 18CE004
CE 441 Big Data Analytics 18CE004

 Enter Group id, Artifact id, and click finish.

 Edit pom.xml.

 Write your code or just copy given WordCount code from D:\ spark\
CE 441 Big Data Analytics 18CE004

spark-1.6.1-bin- hadoop2.6 \ examples\src \main\ java\ org\ apache\


spark\ examples

 Now, add external jar from the location D:\spark\spark-1.6.1-bin-hadoop2.6\lib


and set Java 8 for compilation; see below.
CE 441 Big Data Analytics 18CE004
CE 441 Big Data Analytics 18CE004

 Build the project: D:\hadoop\spWCexamples


Write mvn package on cmd

 Execute the project: Go to the following location on cmd: D:\spark\spark-1.6.1-


binhadoop2.6\bin

Write the following command


spark-submit --class sparkWCexamples.spWCexamples.WORDCOUNT --master
local /D:/Hadoop/spWCexamples/target/spWCexamples-1.0-SNAPSHOT.jar
D:/Hadoop/spWCexamples/how.txt D:/Hadoop/spWCexamples/answer.txt

 You can also check the progress of the project at:


http://localhost:4040/jobs/ Finally get the answers; see below.
CE 441 Big Data Analytics 18CE004

Conclusion: From this practical, we studied how to install spark and created Word
Count project.

You might also like