20dce017 Bda Pracfil

FACULTY OF TECHNOLOGY AND ENGINEERING
DEVANG PATEL INSTITUTE OF ADVANCE TECHNOLOGY

AND RESEARCH
DEPARTMENT OF COMPUTER ENGINEERING
A.Y. 2023-24 [ODD]
LAB MANUAL
CE449: BIG DATA ANALYTICS

Semester: 7th Academic Year: 2023-24
Subject Code: CE449 Subject Name: Big Data Analytics
Student Id: 20DCE017 Student Name: Raj Chauhan
PRACTICAL INDEX
Sr. AIM Assigned Completion Grade Assessment Signature
No. Date Date Date
1 To install Hadoop framework,
configure it and setup a single
nodecluster. Use web based tools
to monitor your Hadoop setup.
2 To implement file management
tasks in Hadoop HDFS and
perform Hadoop commands.
3 To implement basic functions and
commands in R Programming. To
build WordCloud, a text mining
method using R for easy to
understand and better
visualization than a data table.
4 To implement a word count
application using the MapReduce
programming model.
To implement program that count

the occurrences of word based on
the length.
5 A. To design and
implement MapReduce
algorithms to take a
very large file of
integers and produce as
output:
a) The largest integer
b) The average of all the
integers.
c) The same set of integers, but
with each integer appearing
only once.
d) The count of the number of
distinct integers in the input.
B. To design an application to find
mutual friend using map reduce.
6 To implement basic CRUD
20DCE017 CE449 : Big Data Analytics 1

operations (create, read, update,
delete) inMongoDB and
Cassandra.
7 To develop a MapReduce
application and implement a
program that analyzes weather
data.
8 To Install and Run Hive. Use Hive
to create, alter, and drop databases,
tables, views,functions, and
indexes. To create HDFS tables
and load them in Hive and
implement joining of tables in
Hive.
9 To install and run Pig and then
write Pig Latin scripts to sort,
group,join, project, and filter your
data.
10 To install, deploy & configure
Apache Spark Cluster. To Select the
fields from the dataset using Spark
SQL. To explore Spark shell and
read from HDFS
11 To perform Sentiment Analysis
using Twitter data, Scala and Spark
12 To perform Graph Path and
Connectivity analytics and
implement basicqueries after
loading data using Neo4j
13 To perform case study of the
following platforms for solving any
bigdata analytic problem of your
choice.
(1) Amazon web services,
(2) Microsoft Azure,
(3) Google App engine
20DCE017 CE449 : Big Data Analytics 2

PRACTICAL 1
Aim: To install Hadoop framework, configure it and setup a single node
cluster. Use web-based tools to monitor your Hadoop setup.
Practical:
THEORY:
Hadoop:
 The Apache™ Hadoop® project develops open-source software for reliable,

scalable, distributed computing.
 The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models.
 It is designed to scale up from single servers to thousands of machines, each offering
local computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at the
application layer, so delivering a highly-available service on top of a cluster of
computers, each ofwhich may be prone to failures.
 First install Docker on windows machine from official website.
 Besides this parallelly also download WSL2 with ubantu from Microsoft store.
20DCE017 CE449: Big Data Analytics 3

 After downloading all the above-mentioned requirements run and install the docker on
your windows machine.Then after download Hadoop from github as the link given here
https://github.com/big-data-europe/docker-hadoop
 On the completing all the download simply run this command.
o docker-compose up -d

 This will automatically downloads all the requirements and make a docker container.
 Then click on NameNode Cli
 Now open up a browser and go to localhost:9870

CONCLUSION:
In this practical, we learned about how to install Hadoop.

PRACTICAL 2
Aim: To implement file management tasks in Hadoop HDFS and perform
Hadoop commands.
Practical:
There are many more commands in "$HADOOP_HOME/bin/hadoop fs" than are demonstrated
here, although these basic operations will get you started. Running ./bin/hadoop dfs with no
additional arguments will list all the commands that can be run with the FsShell system.
Furthermore, $HADOOP_HOME/bin/hadoop fs -help commandName will display a short
usage summary for the operation in question, if you are stuck.
 Start the docker and run the Hadoop in it.
 Open the NameNode Cli from docker and navigate to the localhost:9870.
 Here given Cli is for performing command on the NameNode.
Commands :-
1.) Ls :-
This command is use to list all the files which is been present is the hadoop file sys.
2.) Mkdir:-
To make new directory use this command.

3.) Touchz:-
4.) Hadoop version :-
5.) Hadoop find :-

6.) copyToLocal :-
7.) rmdir :-
CONCLUSION:
In this practical, we performed various basic commands on Hadoop to create and remove or
copy files into system.

PRACTICAL 3
AIM: To implement basic functions and commands in R Programming. To
build WordCloud, a text mining method using R for easy to understand and
better visualization than a data table.
CODE:
install.packages("tm")
install.packages("SnowballC")
install.packages("wordcloud")
install.packages("RColorBrewer")
library("wordcloud")
library("RColorBrewer")
# To choose the text file text = readLines(file.choose())
# VectorSource() function # creates a corpus of
# character v ectors docs = Corpus(VectorSource(text))
# Text transformation toSpace = content_transformer(
function (x, pattern)

gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/") docs1 = tm_map(docs,
toSpace, "@") docs1 = tm_map(docs, toSpace, "#")
strwrap(docs1)
# Cleaning the Text docs1 = tm_map(docs1, content_transformer(tolower))
docs1 = tm_map(docs1, removeNumbers) docs1 = tm_map(docs1, stripWhitespace)

# Build a term-document matrix dtm =
TermDocumentMatrix(docs) m = as.matrix(dtm) v =
sort(rowSums(m), decreasing = TRUE)
d = data.frame(word = names(v), freq = v)
head(d, 10)
# Generate the Word cloud wordcloud(words =

d$word, freq = d$freq, min.freq = 1, max.words = 200, random.order = FALSE, rot.per =
0.35, colors = brewer.pal(8, "Dark2"))
OUTPUT :-
CONCLUSION:
In this practical, we learnt about R and implemented wordCloud using R.

PRACTICAL 4
AIM: To Implement a Word Count Application using MapReduce API.
THEORY:
o MapReduce is a programming paradigm that enables massive scalability across

hundreds or thousands of servers in a Hadoop cluster. As the processing component,
MapReduce is the heart of Apache Hadoop. The term "MapReduce" refers to two
separate and distinct tasks that Hadoop programs perform. The first is the map job,
which takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
o The reduce job takes the output from a map as input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce
job is always performed after the map job.
o MapReduce programming offers several benefits to help you gain valuable insights
from your big data: Scalability. Businesses can process petabytes of data stored in the
Hadoop Distributed File System (HDFS).
CODE:
import java.io.IOException;
import java.util.StringTokenizer;
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper extends

Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1); private
Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException { StringTokenizer itr =

new StringTokenizer(value.toString()); while
(itr.hasMoreTokens()) { word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> { private
IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,

Context context
) throws IOException, InterruptedException { int
sum = 0;
for (IntWritable val : values) { sum
+= val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
OUTPUT:
o Create a directory in local system and create two directories inside it called Classes and
Input. We will also create input.txt file in Input directory and write random words we
want to count.

o Now we compile the java code using open-jdk which we installed previously and create
a JAR file from it.
o Now we run the jar file on the Hadoop file system using Hadoop jar command and then
we can see the final output.

CONCLUSION:
In this practical, we learnt about the mapReduce Paradigm in detail and also executed a
simple wordcount program in MapReduce in Java Language.

PRACTICAL 5
AIM: A. To design and implement MapReduce algorithms to take a very
large file of integers and produce as output: a) The largest integer b) The
average of all the integers. c) The same set of integers, but with each integer
appearing only once. d) The count of the number of distinct integers in the
input. B. To design an application to find mutual friend using map reduce.
CODE:
//Reducer.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
public class Reducer
{
public static void main(String args[]) throws IllegalStateException, IOException,
ClassNotFoundException, InterruptedException
{Configuration conf = new Configuration();
Job job = Job.getInstance(conf,"Practicle 4");job.setJarByClass(Reducer.class);
job.setMapperClass(BDA_Mapper.class); job.setReducerClass(BDA_Reducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(LongWritable.class);job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1); } }
//BDA_Mapper.java
import java.util.*;
import java.io.*;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
public class BDA_Mapper extends Mapper<LongWritable,Text,Text,LongWritable>
{ private TreeMap<String,Integer> tmap;
@Override
public void setup(Context context) throws IOException,InterruptedException
{tmap = new TreeMap<String,Integer>();}
@Override
public void map(LongWritable key,Text value,Context context) throws
IOException,InterruptedException
{if(tmap.containsKey(value.toString().trim()))
{int count=tmap.get(value.toString().trim());tmap.put(value.toString().trim(),count+1);}
else {tmap.put(value.toString().trim(),1);}}
@Override
public void cleanup(Context context) throws IOException,InterruptedException
{ for(Map.Entry<String,Integer> entry:tmap.entrySet())
{String number = entry.getKey();int count = entry.getValue();
context.write(new Text(number),new LongWritable(count));}}}
//BDA_Reducer.java
import java.util.*;
import java.io.*;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class BDA_Reducer extends Reducer<Text,LongWritable,Text,LongWritable>
{private TreeMap<String , Long> tmap2;
private int max = Integer.MIN_VALUE,unique=0,cnt=0;
private long sum=0;
@Override
public void setup(Context context) throws IOException,InterruptedException
{tmap2 = new TreeMap<String,Long>();}
@Override
public void reduce(Text key,Iterable<LongWritable> values,Context context) throws
IOException,InterruptedException
{String number = key.toString();long count=0;
for(LongWritable val:values)
{count+=val.get();sum+=((int)val.get())*Integer.parseInt(number.trim());}
tmap2.put(number,count);cnt+=count;
if(max<Integer.parseInt(number.trim())) max=Integer.parseInt(number.trim());
unique++;}
@Override
public void cleanup(Context context) throws IOException,InterruptedException
{ for(Map.Entry<String,Long> entry:tmap2.entrySet())
{Long count = entry.getValue();String name = entry.getKey();
context.write(new Text(name),new LongWritable(count));}
context.write(new Text("MAX NUMBER = "),new LongWritable(max));
context.write(new Text("AVERAGE = "),new LongWritable(sum/cnt));

context.write(new Text("Total Unique Numbers = "),new
LongWritable(unique));} }
OUTPUT:
 MyProject -> then select Build Path-> Click on Configure Build Path and select Add
External jars…. and add jars from it’s download location then click -> Apply and Close.
 Now export the project as jar file. Right-click on MyProject choose Export.. and go to
Java -> JAR file click -> Next and choose your export destination then click -> Next.
choose Main Class by clicking -> Browse and then click -> Finish -> Ok.
 In Eclipse go to export
CONCLUSION:
simple operations on large integers file in MapReduce in Java Language.
PRACTICAL 6
AIM: To implement basic CRUD operations (create, read, update, delete) in
MongoDB and Cassandra.
CODE:
 sudo docker network cassandra-network
 sudo docker network ls
 sudo docker run -p 9042:9042 --rm -it -d -e CASSANDRA_PASSWORD=temp --
network cassandra-network cassandra
 sudo docker ps
 sudo docker exec -it 05f99656aef3 bash
 cqlsh -u cassandra -p temp
 CREATE KEYSPACE IF NOT EXISTS charusat_db WITH REPLICATION ={
'class':'NetworkTopologyStrategy','datacenter1':3};
 describe charusat_db;
 use charusat_db;
 CREATE TABLE depstar(id int PRIMARY KEY ,firstname text,lastname text,email
text);
 select * from depstar;
 INSERT INTO depstar(id,firstname,lastname,email)
VALUES(1,'abc','xyz','19dce000@charusat.edu.in');
 INSERT INTO depstar(id,firstname,lastname,email)
VALUES(2,'def','xyz','19dce999@charusat.edu.in');
 update depstar set firstname='temp' where id=1;
 delete from depstar where id=1;
Mongo Db
 sudo docker run -p 27017:27017 -d -it --network cassandra-network --rm -e

MONGO_INITDB_ROOT_USERNAME=root -e
MONGO_INITDB_ROOT_PASSWORD=temp mongo:4.4.6
 sudo docker ps
 sudo docker exec 2e36ee901ed7 bash
 mongo -u root -p temp
 use depstar;
 db;
 db.newCollection.insertOne({_id:1,firstname:"abc",lastname:"xyz",email:"19dce000
@charusat.edu.in"});
 db.newCollection.insertOne({_id:2,firstname:"def",lastname:"xyz",email:"19dce999
@charusat.edu.in"});
 db.newCollection.find({})
 db.newCollection.updateOne({_id:1},{$set:{"firstname":"temp"}});
 db.newCollection.deleteOne({_id:1})
OUTPUT:
CONCLUSION:
In this practical, we learnt about the CRUD operations in MongoDB and Cassandra.

PRACTICAL 7
AIM: To develop a MapReduce application and implement a program that
analyzes weather data.
CODE:
// importing Libraries
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
// Mapper
public static class MaxTemperatureMapper extends Mapper<LongWritable, Text,
Text, Text> { public static final int MISSING = 9999;
@Override
public void map(LongWritable arg0, Text Value, Context context) throws IOException,
InterruptedException {String line = Value.toString();
if (!(line.length() == 0)) {
String date = line.substring(6, 14);
float temp_Max = Float.parseFloat(line.substring(39, 45).trim());
float temp_Min = Float.parseFloat(line.substring(47, 53).trim());
if (temp_Max > 30.0)
{context.write(new Text("The Day is Hot Day :" + date),new
Text(String.valueOf(temp_Max)));}
if (temp_Min < 15) {
context.write(new Text("The Day is Cold Day :" + date),new
Text(String.valueOf(temp_Min)));}}}}
// Reducer
public static class MaxTemperatureReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text Key, Iterator<Text> Values, Context context)throws
IOException, InterruptedException { String temperature = Values.next().toString();
context.write(Key, new Text(temperature));}}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "weather
example");job.setJarByClass(MyMaxMin.class);
job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path OutputPath = new Path(args[1]);FileInputFormat.addInputPath(job, new
Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
OutputPath.getFileSystem(conf).delete(OutputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);} }
OUTPUT:
 MyProject -> then select Build Path-> Click on Configure Build Path and select Add
External jars…. and add jars from it’s download location then click -> Apply and Close.
 Now export the project as jar file. Right-click on MyProject choose Export.. and go to
Java -> JAR file click -> Next and choose your export destination then click ->Next.
choose Main Class as MyMaxMin by clicking -> Browse and then click ->Finish-> Ok.
CONCLUSION:
simple wordcount program in MapReduce in Java Language.

PRACTICAL 8
AIM: To Install and Run Hive. Use Hive to create, alter, and drop databases,
tables, views, functions, and indexes. To create HDFS tables and load them
in Hive and implement joining of tables in Hive.
CODE:
//hive-site.xml
<property>
<name>system:java.io.tmpdir</name>
<value>/tmp/hive/java</value>
</property>
<property>
<name>system:user.name</name>
<value>${user.name}</value>
</property>
 Hive database
1) CREATE DATABASE IF NOT EXISTS depstar;
2) ALTER DATABASE depstar SET OWNER USER hadoopuser;
3) DROP DATABASE IF EXISTS depstar;
 Hive view
1) CREATE VIEW std_id_3 AS SELECT * FROM students WHERE id=3;
2) ALTER VIEW std_id_3 AS SELECT * FROM students WHERE id>1;
3) DROP VIEW std_id_3;
 Hive Table
1) CREATE TABLE IF NOT EXISTS students ( id int, firstname String,lastname
String, email String) COMMENT ‘Student details’ ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’ STORED AS
TEXTFILE;
2) ALTER TABLE students RENAME TO std;
3) DROP TABLE IF EXISTS std;
 Join on table
1) CREATE TABLE IF NOT EXISTS students ( id int, firstname String,lastname
String, email String) COMMENT ‘Student details’ ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’ STORED AS
TEXTFILE;
2) CREATE TABLE IF NOT EXISTS dept ( id int, dept String) COMMENT 'Student
dept details' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n' STORED AS TEXTFILE;
3) LOAD DATA INPATH '/hive_data/hive_1.txt' INTO TABLE students;
LOAD DATA INPATH '/hive_data/hive_2.txt' INTO TABLE dept;

4) select s.id,s.firstname,s.lastname,d.dept,s.email FROM students s JOIN dept d ON
s.id = d.id
OUTPUT:

CONCLUSION:
In this practical, we learnt about the Hive and use of Hive to create, alter, and drop databases,
tables, views and implement joining of tables in Hive.

PRACTICAL 9
AIM: To install and run Pig and then write Pig Latin scripts to sort, group,
join, project, and filter your data.
CODE:
 hdfs dfs -put dept.txt /hadoopuser/dept.txt
 hdfs dfs -ls /hadoopuser
 hdfs dfs -cat /hadoopuser/dept.txt
1)Sort of data
 student_details = LOAD 'hdfs://localhost:9000/hadoopuser/tp.txt' USING
PigStorage(',') as
(id:int,firstname:chararray,lastname:chararray,age:int,email:chararray);
 student_grp_details = LOAD 'hdfs://localhost:9000/hadoopuser/dept.txt' USING
PigStorage(',') as (id:int,dept:chararray);
 order_by = ORDER student_details BY age ASC;
 dump order_by
2) Group of data
 student_details_with_grp = LOAD 'hdfs://localhost:9000/hadoopuser/temp_grp1.txt'
USING PigStorage(',') as
(id:int,firstname:chararray,lastname:chararray,age:int,dept:chararray,email:chararray);
 group_by = GROUP student_details_with_grp BY dept;
 dump group_by
3) Join of data
 join_data = JOIN student_details BY id,student_grp_details BY id;
 dump join_data;
4) Filter data
 filter_table = FILTER student_details_with_grp BY dept=='ce';
 dump filter_table;
5) project data
 data = FOREACH student_details GENERATE firstname,lastname;
 dump data;
OUTPUT:

1) Sort of data
2) Group of data
3) Join of Data
4) Filter data

5) Project data
CONCLUSION:
In this practical, we learnt about the Pig Latin scripts to sort, group, join, project, and filter
data.

PRACTICAL 10
AIM: To install, deploy & configure Apache Spark Cluster. To Select the
fields from the dataset using Spark SQL. To explore Spark shell and read
from HDFS
CODE:
var data =
spark.read.format(“csv”).option(“header”,”true”).load(“/home/hadoopuser/Chicago.csv”);
var df1 = data.select(“ID”,”Case Number”,”Description”).show();
var ds = spark.read.text(“hdfs://localhost:9000/hadoopuser/temp.txt”);
ds.count
ds.show();
OUTPUT:

CONCLUSION:
In this practical, we learnt about the Apache Spark Cluster in detail and To explore Spark
shell.

PRACTICAL 11
AIM: To perform Sentiment Analysis using Twitter data, Scala and Spark.
CODE:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

from scipy.special import softmax
tweet = 'Great content! subscribed ●

• '
# precprcess tweet
tweet_words = []
for word in tweet.split(' '):

if word.startswith('@') and len(word) > 1:
word = '@user'
elif word.startswith('http'):
word = "http"
tweet_words.append(word)
tweet_proc = " ".join(tweet_words)
# load model and tokenizer

roberta = "cardiffnlp/twitter-roberta-base-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)
labels = ['Negative', 'Neutral', 'Positive']
# sentiment analysis
encoded_tweet = tokenizer(tweet_proc, return_tensors='pt')
# output = model(encoded_tweet['input_ids'], encoded_tweet['attention_mask'])
output = model(**encoded_tweet)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
for i in range(len(scores)):
l = labels[i]
s = scores[i]
print(l,s)
OUTPUT:
CONCLUSION: From this practical I learned, how to perform sentiment analysis using
spark.

PRACTICAL 12
AIM: To perform Graph Path and Connectivity analytics and implement basic queries after loading
data using Neo4j
CODE:
1.Install Neo4j:
2.Creating Project:

3.Create Data Cyphers:
4.Querying For Nodes:
 All Nodes:

 All nodes with specific label:
 All nodes with priorities:
 Nodes where name is LeBron James:

 Nodes where name is not LeBron James:
5.Querying Relationship:
 Get all Lakers Players:

 Get Players and number of games played:
6.Graph Path Analysis:
 The Shortest path between two players:
7.Connectivity Analysis:
 Identify players with most teammates:

CONCLUSION: From this practical I learned, how to perform Graph Path and Connectivity
analytics and implement basic queries after loading data using Neo4j.

PRACTICAL 13
AIM: To perform case study of the following platforms for solving any big data analytic problem of
your choice.
(1) Amazon web services,
(2) Microsoft Azure,
(3) Google App engine
Problem Statement:
Imagine a global e-commerce company, "EcomCorp," wants to optimize its product recommendation
engine to increase customer engagement and sales. They have terabytes of customer data, including
purchase history, browsing behavior, and demographics. EcomCorp aims to leverage big data
analytics to provide personalized product recommendations to its customers in real-time.
To address this challenge, EcomCorp will evaluate three major cloud platforms for their big data
analytics capabilities: Amazon Web Services (AWS), Microsoft Azure, and Google App Engine. We
will analyze each platform's strengths and weaknesses in the context of this use case.
Amazon Web Services (AWS):
AWS offers a comprehensive suite of services for big data analytics. EcomCorp can use the
following AWS services to address their problem:
a. Amazon S3 (Simple Storage Service): Store large volumes of customer data in S3 buckets,
making it highly durable and scalable.
b. Amazon EMR (Elastic MapReduce): Deploy EMR clusters to process and analyze data using
popular tools like Apache Spark, Hadoop, and Presto. This allows EcomCorp to extract valuable
insights from their data.
c. Amazon Redshift: Utilize Redshift as a data warehousing solution to store and query large
datasets for analytics and reporting.
d. AWS Lambda: Trigger real-time recommendations based on customer actions, such as clicks or
purchases, using Lambda functions.
e. Amazon SageMaker: Implement machine learning models to improve recommendation accuracy,

leveraging SageMaker's built-in algorithms and model training capabilities.
Strengths:
Broad range of services specifically designed for big data analytics. Scalability to handle massive
datasets. Integration with machine learning and AI tools. Strong security and compliance features.
Weaknesses:
Learning curve for managing and optimizing services. Costs can increase as data volume
and processing requirements grow.

Microsoft Azure:
Azure provides a robust set of services for big data analytics, including:
a. Azure Data Lake Storage: Store and manage large datasets in Azure Data Lake Storage
Gen2, ensuring high performance and security.
b. Azure HDInsight: Deploy HDInsight clusters with Hadoop, Spark, and other big data
frameworks to process and analyze data.
c. Azure SQL Data Warehouse: Use Azure SQL Data Warehouse for data warehousing
and complex queries.
Strengths:
Integration with other Microsoft products and services.
Seamless scaling capabilities.
Azure Databricks for advanced analytics and AI workloads.
Strong support for hybrid cloud solutions
Weaknesses:
Cost management can be challenging.
Learning curve for Azure-specific tools and services.
Google App Engine:

Google App Engine primarily focuses on application deployment and scaling rather than
big data analytics. However, Google Cloud Platform offers various services that can be
leveraged for big data analytics, such as
a. Google Cloud Storage: Store large datasets in Google Cloud Storage buckets.
b. Google BigQuery: Use BigQuery for ad-hoc SQL queries and analysis of structured
data.
c. Google Dataflow: Process streaming data and create real-time recommendations using
Dataflow.
Strengths:
Integration with other Google Cloud services.
Serverless computing for application deployment.
BigQuery's fast query performance for large datasets.
Weaknesses:
Limited big data analytics services compared to AWS and Azure.
Not as suitable for complex machine learning models without additional setup.
CONCLUSION: From this practical I learned, how to use different platform to solve any big
data analytics problem.

20dce017 Bda Pracfil

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

20dce017 Bda Pracfil

Uploaded by

Copyright:

Available Formats

FACULTY OF TECHNOLOGY AND ENGINEERING

DEVANG PATEL INSTITUTE OF ADVANCE TECHNOLOGY

A.Y. 2023-24 [ODD]

CE449: BIG DATA ANALYTICS

To implement program that count

20DCE017 CE449 : Big Data Analytics 1

20DCE017 CE449 : Big Data Analytics 2

 The Apache™ Hadoop® project develops open-source software for reliable,

20DCE017 CE449: Big Data Analytics 3

20DCE017 CE449: Big Data Analytics 4

 Then click on NameNode Cli

 Now open up a browser and go to localhost:9870

20DCE017 CE449: Big Data Analytics 5

20DCE017 CE449: Big Data Analytics 6

 Here given Cli is for performing command on the NameNode.

20DCE017 CE449: Big Data Analytics 7

4.) Hadoop version :-

5.) Hadoop find :-

20DCE017 CE449: Big Data Analytics 8

20DCE017 CE449: Big Data Analytics 9

function (x, pattern)

docs1 = tm_map(docs1, removeNumbers) docs1 = tm_map(docs1, stripWhitespace)

20DCE017 CE449: Big Data Analytics 10

In this practical, we learnt about R and implemented wordCloud using R.

20DCE017 CE449: Big Data Analytics 11

o MapReduce is a programming paradigm that enables massive scalability across

public static class TokenizerMapper extends

20DCE017 CE449: Big Data Analytics 12

public void reduce(Text key, Iterable<IntWritable> values,

20DCE017 CE449: Big Data Analytics 13

20DCE017 CE449: Big Data Analytics 14

20DCE017 CE449: Big Data Analytics 15

20DCE017 CE449: Big Data Analytics 17

 sudo docker run -p 27017:27017 -d -it --network cassandra-network --rm -e

20DCE017 CE449: Big Data Analytics 20

20DCE017 CE449: Big Data Analytics 22

20DCE017 CE449: Big Data Analytics 23

20DCE017 CE449: Big Data Analytics 24

20DCE017 CE449: Big Data Analytics 25

20DCE017 CE449: Big Data Analytics 26

20DCE017 CE449: Big Data Analytics 27

20DCE017 CE449: Big Data Analytics 28

20DCE017 CE449: Big Data Analytics 29

20DCE017 CE449: Big Data Analytics 30

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tweet = 'Great content! subscribed ●

for word in tweet.split(' '):

tweet_proc = " ".join(tweet_words)

# load model and tokenizer

labels = ['Negative', 'Neutral', 'Positive']

20DCE017 CE449: Big Data Analytics 32

20DCE017 CE449: Big Data Analytics 33

4.Querying For Nodes:

20DCE017 CE449: Big Data Analytics 34

 All nodes with priorities:

 Nodes where name is LeBron James:

20DCE017 CE449: Big Data Analytics 35

 Get all Lakers Players:

20DCE017 CE449: Big Data Analytics 36

6.Graph Path Analysis:

 The Shortest path between two players:

 Identify players with most teammates:

20DCE017 CE449: Big Data Analytics 37