BDA Practical File

L. D.
College of Engineering
Opp Gujarat University, Navrangpura, Ahmedabad - 380015
LAB PRACTICALS
Branch: Computer Engineering
BIG DATA ANALYTICS (3170722)

Semester: VII
Enrollment No.: 190280107002

Faculty Name: Prof. (Dr.) Hetal Joshiara
Name: Dev Ansodariya
Division: B
Batch: B1
Computer Engineering Department
L. D. College of Engineering Ahmedabad
VISION
• To achieve academic excellence in Computer Engineering by providing value

based education.
MISSION
• To produce graduates according to the needs of industry, government, society

and scientific community.
• To develop partnership with industries, research and development organizations
and government sectors for continuous improvement of faculties and students.
• To motivate students for participating in reputed conferences, workshops,
seminars and technical events to make them technocrats and entrepreneurs.
• To enhance the ability of students to address the real-life issues by applying
technical expertise, human values and professional ethics.
• To inculcate habit of using free and open-source software, latest technology and
soft skills so that they become competent professionals.
• To encourage faculty members to upgrade their skills and qualification through
training and higher studies at reputed universities.
Certificate
This is to certify that
Mr. Dev Ansodariya, Enrollment No. 190280107002 of B.E. Sem-7 class
has satisfactorily completed the course in BIG DATA ANALYTICS
(3170722) within four walls of L. D. College of Engineering, Ahmedabad-
380015.
Date of submission. 11th November, 2022

Staff in-Charge Prof. (Dr.) Hetal A. Joshiara
Head of Department Dr. Chirag S. Thaker
L. D. College of Engineering, Ahmedabad
Department of Computer Engineering
Rubrics for Practical
SEMESTER: BE-VII Academic Term: July-Nov 2022-23 (ODD)
Subject: Big Data Analytics (3170722) Elective Subject
Faculty: Prof. (Dr.) Hetal A. Joshiara
Rubrics Criteria Marks Good Satisfactory Need

ID (2) (1) Improvement
(0)
RB1 Regularity 05 High (>70%) Moderate Poor (0-40%)
(4070%)
RB2 Problem Analysis 05 Apt & Full Limited Very Less
& Development Identification of Identification of Identification of
of the Solution the Problem & the Problem / the Problem /
Complete Incomplete Very Less
Solution for the Solution for the Solution for the
Problem Problem Problem
RB3 Testing of the 05 Correct Solution Partially Very less correct
Solution as required Correct solution for the
Solution for the problem
Problem
RB4 Documentation 05 Documentation Not up to Proper format not
completed neatly. standard. followed,
incomplete.
Each practical carries 20 marks.
SIGN OF FACULTY
L. D. College of Engineering, Ahmedabad
Department of Computer Engineering
LABORATORY PRACTICALS ASSESSMENT
Subject Name: Big Data Analytics (3170722)
Term: 2022-23
Enroll. No.: 190280107002
Name: DEV ANSODARIYA
Pract. CO RB1 RB2 RB3 RB4 Total

No. Achieved (5) (5) (5) (5) (20)
1 CO-2
2
CO-2
3
CO-2
4
CO-2
5
CO-2
6
CO-2
7
CO-3
8
CO-5
9
CO-3
10
CO-3
11
CO-3
12
CO-1
Page |2
INDEX
Sr.
CO AIM Date Page Marks Sign
No.
No.
Prepare Make a single node clust er in 28-07- 5
1 CO-2
Hadoop. 2022
CO-2 Run Word count program in Hadoop with 250 MB 04-08- 11

2 2022
size of Data Set.
CO-2 Understand the Logs generated by MapReduce 18-08- 16

3 2022
program.
CO-2 Run two different Data sets/Different size of 25-8- 18

4 2022
Datasets on Hadoop and Compare the Logs.
CO-2 Develop Map-Reduce Application to Sort a given 01-09-

5 2022
file and do aggregation on some parameters. 21
CO-2 Download any two Big Data Sets from 08-09- 29

6 2022
authenticate website.
CO-3 Explore Spark and Implement Word count 15-09- 31

7 2022
application using Spark.
CO-5 Creating the HDFS tables and loading them in 22-09- 35

8 2022
Hive and learn joining of tables in Hive.
BE 7th Semester Big Data Analytics (3170725) 190280107002

Page |3
CO-3 Implementation of Matrix algorithms in Spark 29-09- 39

9 2022
SQL programming
CO-3 Create Data Pipeline Based on Messaging Using 13-10- 43

10 2022
PySpark and Hive Covid-19 Analysis
CO-3 Explore NoSQL database like MongoDB and 13-10- 45

11 2022
perform basic CRUD operation.
Case study based on the concept of Big Data 18-10- 51

12 2022
CO-1 Analytics. Prepare presentation in the group of 4.
Submit PPTs.

Page |5
Practical 1
AIM: Make a single node cluster in Hadoop.
• For making cluster in Hadoop, there are two different types of Hadoop installations:
1. Single node cluster Hadoop (One DataNode running and setting up all the NameNode,
DataNode, ResourceManager, and NodeManager on a single machine.)
2. Multi node cluster Hadoop (more than one DataNode running and each DataNode is
running on different machines.)
• For this practical single node cluster Hadoop installation has used.
• Installation of single node cluster will be same as multimode cluster. After installation of Hadoop
several configurations is need to be done. Those are as follows:
Step-1: after downloading Hadoop, edit core-site.xml file under “hadoop-3.3.0\etc\hadoop”

path by adding I/O setting as follows:

Page |6
Step-2: Edit hdfs-site.xml (contains configuration settings of HDFS entites) file under
“hadoop-3.3.0\etc\hadoop” path and edit the property mentioned below inside configuration
tag:
Step-3: Edit the mapred-site.xml (contains configuration settings of MapReduce

application) file under “hadoop-3.3.0\etc\hadoop” path and edit property mentioned below
inside configuration tag:

Page |7
Step-4: Edit yarn-site.xml (contains configuration settings of ResourceManager and

NodeManager) file under “hadoop-3.3.0\etc\hadoop” path and edit the property mentioned
below inside configuration tag:
Step-5: Edit hadoop-env.sh file under “hadoop-3.3.0\etc\hadoop” path and add the Java
Path as mentioned below:
Step-6: Run command under Hadoop file bin/hadoop namenode -format (This command
formats the HDFS via NameNode and only executed for the first time. Formatting the file
system means initializing the directory specified by dfs.name.dir variable.)

Page |8
Step-6: open command line and go to “hadoop-3.3.0/sbin” and type start-

all.cmd command (this command is a combination of start-dfs.cmd, start-
yarn.cmd & mr-jobhistory-daemon.cmd)

Page |9
Step-7: Start Namenode and Datanode using start-dfs.cmd and start-yarn.cmd

under sbin folder

P a g e | 10
Step-7: Open http://localhost:9870 URL in browser (this will help to monitor all of the
information about nodes, memory management, resource utilization, etc.)

P a g e | 11
Practical 2
AIM: Run Word count program in Hadoop with 250 MB size of Data Set.
• Pre-requisite:
▪ Java Installation - Check whether the Java is installed or not.
▪ Hadoop Installation - Check whether the Hadoop is installed or not.
• Steps to Execute Word Count Program:

• Create or import a text file in your local machine and put it in proper directory.
• In this practical, we find out the frequency of each word exists in this text file.
• Create a directory in HDFS, where to kept text file.

hdfs dfs -mkdir /test
• Upload the data.txt file on HDFS in the specific directory.
hdfs dfs -put /hadoop/wordcount/data.txt /test

P a g e | 12
• Write the MapReduce program using eclipse:
WC_Mapper.java
package com.wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
P a g e | 13
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WC_Mapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
File: WC_Reducer.java
import java.util.Iterator;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WC_Reducer extends MapReduceBase implements

Reducer<Text,IntWritable,Text,IntWritable> {

P a g e | 14
public void reduce(Text key, Iterator<IntWritable>
values,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException {
int sum=0;
while (values.hasNext()) {
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}
File: WC_Runner.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WC_Runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);

P a g e | 15
conf.setReducerClass(WC_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
• Create the jar file of this program and name it countworddemo.jar.

• Run the jar file
hadoop jar /home/wordcount/wordcountdemo.jar com.wordcount.WC_Runner
/test/data.txt /r_output
• The output is stored in /r_output/part-00000
• Now execute the command to see the output.
hdfs dfs -cat /r_output/part-00000
Signature of Faculty: Grade:

P a g e | 16
Practical – 3
AIM: Understand the Logs generated by MapReduce program.
Logs & File Location

• Default directory of Hadoop log file is $HADOOP_HOME/logs.
MapReduce service:
• Service instance log that contains details related to the MapReduce framework and
startup on the service side. Each service instance has a separate log, in addition to
separate logs for each task attempt.
$HADOOP_HOME/logs/user.pmr.service.application.index.log
MapReduce task:
• Task log that contains details relating to user-defined MapReduce code on the
service side. Each task records its task log messages (syslog), standard output
(stdout) and errors (stderr) to separate files in this directory. Knowing the job and
task IDs can be very useful when debugging MapReduce jobs.
-$HADOOP_HOME/logs/tasklogs/application/ job_ID/task_task
ID_att_attempt ID/syslog
ID_att_attempt ID/stdout
ID_att_attempt ID/stderr
Shuffle daemon:
• MRSS service log that captures messages, events, and errors related to the
MapReduce host.
$HADOOP_HOME/logs/mrss.hostname.log

P a g e | 17
MapReduce API
• API log that captures details relating to API calls from EGO.
api.hostname.log in the directory where the job was submitted MapReduce
client
• MapReduce client messages are not recorded in a log file. You can only view them on the client
console.
MapReduce log levels

• MapReduce logs support various levels. You can configure the log levels for the
MapReduce service and tasks.
• You can set log levels to any of the following values:
Level Description
DEBUG Logs all debug-level and informational messages.

INFO Logs all informational messages and more serious messages. This is thedefault log level.
WARN Logs only those messages that are warnings or more serious messages. Thisis the default level of
-------------------------debug information.
ERROR Logs only those messages that indicate error conditions or more seriousmessages.
FATAL Logs only those messages in which the system is unusable.

P a g e | 18
Practical – 4
AIM: Run two different Data sets/Different size of Datasets on Hadoop and Compare the Logs.
First Dataset:
Program: Word Count
Size: 1MB
dev-

P a g e | 19
dev
dev-

P a g e | 20
Conclusion:
• For two different size of dataset running same map reduce job we can say that time
taken for running for job is increase as size of dataset bigger.

P a g e | 21
Practical – 5
AIM: Develop Map Reduce Application to Sort a given file and do aggregation on some parameters.
Let’s say we want to:
“View all donor cities by descending order of donation total amount,

considering only donationswhich were not issued by a teacher. City names
should be case insensitive (using upper-case)”
This can be done in SQL as:
This query actually involves quite a few operations:
• Filtering on the value of doner_is_teacher

• Aggregating the sum m of total values grouping by city
• Sorting on the aggregated value Sumtotal

P a g e | 22
This task could be broken down into 2 MapReduce jobs:
1. First Job: Filtering and Aggregation
• Map:
▪ Input: DonationWritables “full row” objects from theSequenceFile.
▪ Output: (city, total) pairs for each entry only ifdonor_is_teacher is

not true.
• Reduce:
▪ Reduce by summing the “total” values for each “city”key
2. Second Job: Sorting
• Map:
▪ Input: (city, sumtotal) pairs with summed total percity.
▪ Output: (sumtotal, city) inversed pair.
• Reduce:
▪ Identity reducer. Does not reduce anything, but theshuffling will sort
on keys for us.

P a g e | 23
• Here is simple visualization of 2-job Process:

P a g e | 24
The First Job
• Code:
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
data.writable.DonationWritable;
public class DonationsSumByCity {
public static class CityDonationMapper extends Mapper<Text, DonationWritable, Text,
FloatWritable> {
private Text city = new Text();
private FloatWritable total = new FloatWritable();
@Override
public void map(Text key, DonationWritable donation, Context context) throws IOException,
InterruptedException {
// Ignore rows where the donor is a teacher if ("t".equals(donation.donor_is_teacher)) {
return;
float) pair.
}
}
}
// Transform city name to uppercase and write the (string,
city.set(donation.donor_city.toUpperCase()); total.set(donation.total); context.write(city, total);
public static class FloatSumReducer extends Reducer<Text, FloatWritable, Text,

FloatWritable> {
private FloatWritable result = new FloatWritable(); @Override
public void reduce(Text key, Iterable<FloatWritable> values, Context context) throws
IOException, InterruptedException {
float sum = 0;
for (FloatWritableval : values) { sum += val.get();
}
result.set(sum); context.write(key, result);
}
}

P a g e | 25
city");
public static void main(String[] args) throws Exception {

Job job = Job.getInstance(new Configuration(), "Sum donations by
job.setJarByClass(DonationsSumByCity.class);
// Mapper configuration job.setMapperClass(CityDonationMapper.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(FloatWritable.class);
// Reducer configuration (use the reducer as combiner also, useful
in cases of aggregation)
job.setCombinerClass(FloatSumReducer.class); job.setReducerClass(FloatSumReducer.class);
job.setNumReduceTasks(1); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class); FileInputFormat.setInputPaths(job, new
Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Sequence File input format
A couple of things to notice here regarding the Sequence File as the input:
• job.setInputFormatClass(SequenceFileInputFormat.class)
o Tell the job that we are reading a Sequence File.
• ... extends Mapper<Text, DonationWritable, Text, FloatWritable>
o The first two generic type parameters of the Mapper class should be the
input Key and Value types of Sequence File.
• map(Text key, DonationWritable donation, Context context)
o The parameter of the map method are directly the Writable objects. If we
were using the CSV input we would have a Text object as the second
parameter containing the csv line, which we would have to split on
commas to obtain values.
Using a Combiner
• Since we are doing an aggregation task here, using our Reducer as a Combiner
by calling job.setCombinerClass(FloatSumReducer.class) improves performance. It
will start reducing the Mapper’s output during the map phase, which will result in
less data being shuffled and sent to the Reducer.

P a g e | 26
The Second Job
• Code:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import
org.apache.hadoop.io.WritableComparable;
import
org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class OrderBySumDesc {

public static class InverseCitySumMapper extends Mapper<Text,
Text, FloatWritable, Text>
{
private FloatWritablefloatSum = new FloatWritable(); @Override

public void map(Text city, Text sum, Context context) throws
IOException,
InterruptedException {
float floatVal = Float.parseFloat(sum.toString());

floatSum.set(floatVal);
context.write(floatSum, city);

P a g e | 27
public static class DescendingFloatComparator extends
WritableComparator { public DescendingFloatComparator() {
super(FloatWritable.class, true);
}
@SuppressWarnings("rawtypes") @Override
public int compare(WritableComparable w1, WritableComparable
w2) { FloatWritable key1 = (FloatWritable) w1;
FloatWritable key2 = (FloatWritable) w2; return -1 *
key1.compareTo(key2);
}
}
public static void main(String[] args) throws Exception {
Job job = Job.getInstance(new Configuration(), "Order By Sum
Desc") job.setJarByClass(DonationsSumByCity.class);
// The mapper which transforms (K:V) => (float(V):K)

job.setMapperClass(InverseCitySumMapper.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setMapOutputKeyClass(FloatWritable.class);
job.setMapOutputValueClass(Text.class);
// Sort with descending float order

job.setSortComparatorClass(DescendingFloatComparator.cl
ass);
// Use default Reducer which simply transforms (K:V1,V2) => (K:V1),

(K:V2)job.setReducerClass(Reducer.class);
job.setNumReduceTasks(1);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));System.exit(job.waitForCompletion(true)
? 0 : 1);
}
}

P a g e | 28
Running and Viewing Results
• d
Here are the terminal commands for executing and viewing the outputs for these 2
MapReduce jobs:
$ hadoop jar donors.jar mapreduce.donation.DonationsSumByCity

donors/donations.seqfiledonors/output_sumbycity
$ hdfsdfs -cat donors/output_sumbycity/* [...]

ROCKWALL 8422.99
ROCKWELL 80.0
ROCKWOOD 9224.17
[...]
$ hadoop jar donors.jar mapreduce.donation.OrderBySumDesc

donors/output_sumbycity donors/output_orderbysumdesc
$ hdfsdfs -cat donors/output_orderbysumdesc/* | head 1.71921696E8
2.5504284E7 NEW YORK

1.5451513E7 SAN FRANCISCO
6163194.0 CHICAGO
5085116.5 SEATTLE
• As expected, the output of the first job is a plain text list of <city,sum> ordered by city name.
Thesecond job generates a list of <sum,city> sorted by descending sum
• Execution times:
▪ The first job took an average of 1 min 25 sec on my cluster.

▪ This second job took an average of 1 min 02 sec on my cluster.

P a g e | 29
Practical – 6
AIM: Download any two Big Data Sets from authenticate website.
1. Yelp Dataset:
• Website: https://www.yelp.com/dataset

P a g e | 30
2. Kaggle:
• Website: https://www.kaggle.com/datasets

P a g e | 31
Practical – 7
AIM: Explore Spark and Implement Word count application using Spark.
Apache Spark:
• Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
• Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workloads in a
respective system, it reduces the management burden of maintaining separate tools.
Spark Word Count
1. Create a directory in Hadoop, where to kept text file.

$ hadoop fs -mkdir /spark
2. Upload the data file on Hadoop in the specific directory.
$ hadoop fs -put <fileLocation> URI

P a g e | 32
3. Now, follow the below command to open the spark in Scala mode.
$ spark-shell
4. Let’s create an RDD by using the following commad.

scala > val data = sc.textFile(“sparkdata.txt”);
• Now, we can read the generated result by using the following command.
scala > data.collect;

P a g e | 33
5. Here, we split the existing data in the form of individual words by using following command.
scala > val splitdata = data.flatMap(line=>lines.split( “ “));
Now, we can read the generated result by using the following command.
scala > splitdata.collect;

P a g e | 34
6. Perform the reduce operation

scala > val reducedata = mapdata.reduceByKey(_+_);
we can read the generated result by using the following command.
scala > reducedata.collect;

P a g e | 35
Practical – 8
AIM: Creating the HDFS tables and loading them in Hive and learn joining of tables in Hive.
Create a folder on HDFS under /user/cloudera HDFS Path

javachain~hadoop]$ hadoop fs -mkdir javachain
Move the text file from local file system into newly created folder called javachain
javachain~hadoop]$ hadoop fs -put ~/Desktop/student.txt javachain/
Create Empty table STUDENT in HIVE
hive> create table student

> ( std_id int,
> std_name string,
> std_grade string,
> std_addres string)
> partitioned by (country string)
> row format delimited
> fields terminated by ','
>;
OK
Time taken: 0.349 seconds
Load Data from HDFS path into HIVE TABLE.
hive> load data inpath 'javachain/student.txt' into table student

partition(country='usa');
Loading data to table default.student partition (country=usa)
chgrp: changing ownership of
'hdfs://quickstart.cloudera:8020/user/hive/warehouse/student/country=usa/stud
ent.txt': User does not
belong to hive
Partition default.student{country=usa} stats: [numFiles=1, numRows=0,
totalSize=120, rawDataSize=0]
OK
Time taken: 1.048 seconds

P a g e | 36
Select the values in the Hive table.
hive> select * from student;

OK
101 'JAVACHAIN' 3RD 'USA usa
102 'ANTO' 10TH 'ENGLAND' usa
103 'PRABU' 2ND 'INDIA' usa
104 'KUMAR' 4TH 'USA' usa
105 'jack' 2ND 'INDIA' usa
Time taken: 0.553 seconds, Fetched: 5 row(s)
JOIN
• The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT

FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
• On successful execution of the query, you get to see the following response:
LEFT OUTER JOIN
• A LEFT JOIN returns all the values from the left table, plus the matched values from the right
table, or NULL in case of no matching JOIN predicate.

P a g e | 37
• The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER
tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE

FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
• On successful execution of the query, you get to see the following response :
RIGHT OUTER JOIN
• The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and
ORDER tables.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT

OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
• On successful execution of the query, you get to see the following response:

P a g e | 38
Full Outer Join:

• The following query demonstrates FULL OUTER JOIN between CUSTOMER and
ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE

FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

P a g e | 39
Practical – 9
AIM: Implementation of Matrix algorithms in Spark Sql programming
• Code:
Matrix_multiply.py
from pyspark import SparkConf, SparkContext

import sys, operator
def add_tuples(a, b):

return list(sum(p) for p in zip(a,b))
def permutation(row):
rowPermutation = []
for element in row: for e in

range(len(row)):
rowPermutation.append(float(element) *
float(row[e]))
return rowPermutation
def main():
input = sys.argv[1] output = sys.argv[2]
conf = SparkConf().setAppName('Matrix
Multiplication') sc = SparkContext(conf=conf)
assert sc.version >= '1.5.1'
row = sc.textFile(input).map(lambda row :

row.split(' ')).cache() ncol = len(row.take(1)[0])
intermediateResult =
row.map(permutation).reduce(add_tuples)

P a g e | 40
outputFile = open(output, 'w')
result = [intermediateResult[x:x+3] for x in

range(0, len(intermediateResult), ncol)] for row
in result:
for element in row:
outputFile.write(str(element) + ' ')
outputFile.write('\n') outputFile.close()
# outputResult = sc.parallelize(result).coalesce(1)
# outputResult.saveAsTextFile(output)
if __name__ == "__main__":
main()
matrix_multiply_sparse.py
from pyspark import SparkConf, SparkContext

import sys, operator from scipy import * from
scipy.sparse import csr_matrix
def createCSRMatrix(input):
row = []
col = []
data = []
for values in input:

value = values.split(':')
row.append(0)
col.append(int(value[0]))
data.append(float(value[1]))
return csr_matrix((data,(row,col)),
shape=(1,100))
def multiplyMatrix(csrMatrix):
csrTransponse = csrMatrix.transpose(copy=True)
return (csrTransponse*csrMatrix)
def formatOutput(indexValuePairs):

P a g e | 41
return ' '.join(map(lambda pair : str(pair[0]) + ':' + str(pair[1]),

indexValuePairs))
def main():
input = sys.argv[1] output = sys.argv[2]
conf = SparkConf().setAppName('Sparse
Matrix Multiplication') sc =
SparkContext(conf=conf) assert sc.version >=
'1.5.1'
sparseMatrix = sc.textFile(input).map(lambda
row : row.split('
')).map(createCSRMatrix).map(multiplyMatrix).red
uce(operator.add) outputFile = open(output, 'w')
for row in range(len(sparseMatrix.indptr)-1):

col =sparseMatrix.indices[sparseMatrix.indptr[row]:
sparseMatrix.indptr[row+1]]
data = sparseMatrix.data[sparseMatrix.indptr[row]:sparseMatrix.indptr[row+1]]
indexValuePairs = zip(col,data)
formattedOutput = formatOutput(indexValuePairs)
outputFile.write(formattedOutput + '\n')
if __name__ == "__main__":
main()

P a g e | 42

P a g e | 43
Practical – 10
AIM: Create A Data Pipeline Based on Messaging Using PySpark and Hive -Covid-19 Analysis.
Building data pipeline for Covid-19 data analysis using Bigdata technologies and Tableau
• The purpose is to collect the real time streaming data from COVID19 open API for every
5 minutes into the ecosystem using NiFi and to process it and store it in the data lake on
AWS.
• Data processing includes parsing the data from complex JSON format to csv format then
publishing to Kafka for persistent delivery of messages into PySpark for further
processing.
• The processed data is then fed into output Kafka topic which is inturn consumed by Nifi
and stored in HDFS.
• A Hive external table is created on top of HDFS processed data for which the process is
Orchestrated using Airflow to run for every time interval. Finally, KPIs are visualised in
tableau.
• Data Architecture:
d

P a g e | 44
• Tools used:
1. Nifi -nifi-1.10.0
2. Hadoop -hadoop_2.7.3
3. Hive-apache-hive-2.1.0
4. Spark-spark-2.4.5
5. Zookeeper-zookeeper-2.3.5
6. Kafka-kafka_2.11-2.4.0
7. Airflow-airflow-1.8.1
8. Tableau

P a g e | 45
Practical – 11
AIM: Explore NoSQL database like MongoDB and perform basic CRUD operation.
• Start Mongo Shell
• . Display all databases.

P a g e | 46
1. Create Operation
For insert one document

db.collection.insertOne()
For insert multiple document
db.collection.insertMany
2. Update Operations
Update single document

db.collection.updateOne()
Update Multiple document
db.collection.updateMany()
Replace single document
db.collection.replaceOne()
3. Read Operations
For Display All documents.

db.collection.find()

P a g e | 47
4. Delete Operation
delete single document

db.collection.deleteOne()
For delete multiple document
db.collection.deleteMany()

P a g e | 48
Practical – 12
AIM: Case study based on the concept of Big Data Analytics.
1. Walmart:
• Walmart is the largest retailer in the world and the world’s largest company by revenue,
with more than 2 million employees and 20000 stores in 28 countries.
• It started making use of big data analytics much before the word Big Data came into the
picture. Walmart uses Data Mining to discover patterns that can be used to provide
product recommendations to the user, based on which products were brought together.
• Walmart by applying effective Data Mining has increased its conversion rate of
customers. It has been speeding along big data analysis to provide best-in-class e-
commerce technologies with a motive to deliver superior customer experience.
• The main objective of holding big data at Walmart is to optimize the shopping
experience of customers when they are in a Walmart store.
• Big data solutions at Walmart are developed with the intent of redesigning global
websites and building innovative applications to customize the shopping experience for
customers whilst increasing logistics efficiency.
• Hadoop and NoSQL technologies are used to provide internal customers with access to
real-time data collected from different sources and centralized for effective use.
2. Uber:
• Uber is the first choice for people around the world when they think of moving people
and making deliveries.
• It uses the personal data of the user to closely monitor which features of the service is
mostly used, to analyze usage patterns and to determine where the services should be
more focused.
• Uber focuses on the supply and demand of the services due to which the prices of the
services provided changes. Therefore, one of Uber’s biggest uses of data is surge
pricing.
• For instance, if you are running late for an appointment and you book a cab in a
crowded place then you must be ready to pay twice the amount.
P a g e | 49
• For example, On New Year’s Eve, the price for driving for one mile can go from 200 to
1000. In the short term, surge pricing affects the rate of demand, while long term use
could be the key to retaining or losing customers. Machine learning algorithms are
considered to determine where the demand is strong.
3. Netflix:
• It is the most loved American entertainment company specializing in online on-demand

streaming video for its customers.
• Netflix has been determined to be able to predict what exactly its customers will enjoy
watching with Big Data. As such, Big Data analytics is the fuel that fires the
‘recommendation engine’ designed to serve this purpose.
• More recently, Netflix started positioning itself as a content creator, not just a
distribution method.
• Unsurprisingly, this strategy has been firmly driven by data. Netflix’s recommendation
engines and new content decisions are fed by data points such as what titles customers
watch, how often playback stopped, ratings are given, etc.
• The company’s data structure includes Hadoop, Hive and Pig with much other
traditional business intelligence.
• Netflix shows us that knowing exactly what customers want is easy to understand if the
companies just don’t go with the assumptions and make decisions based on Big Data.
4. eBay:
• A big technical challenge for eBay as a data-intensive business to exploit a system that
can rapidly analyze and act on data as it arrives (streaming data).
• There are many rapidly evolving methods to support streaming data analysis. eBay is
working with several tools including Apache Spark, Storm, Kafka.
• It allows the company’s data analysts to search for information tags that have been
associated with the data (metadata) and make it consumable to as many people as
possible with the right level of security and permissions (data governance).
• The company has been at the forefront of using big data solutions and actively
contributes its knowledge back to the open-source community.

P a g e | 50
5. Procter & Gamble:
• Procter & Gamble whose products we all use 2-3 times a day is a 179-year-old
company. The genius company has recognized the potential of Big Data and put it to
use in business units around the globe.
• P&G has put a strong emphasis on using big data to make better, smarter, real-time
business decisions. The Global Business Services organization has developed tools,
systems, and processes to provide managers with direct access to the latest data and
advanced analytics.
• Therefore P&G being the oldest company, still holding a great share in the market
despite having many emerging companies.

P a g e | 51
Practical 12
AIM: Case study based on the concept of Big Data Analytics. Prepare presentation
in the group of 4. Submit PPTs.
Figure 1 Figure 2
Figure 4
Figure 3

P a g e | 52
Figure 5 Figure 6
Figure 8
Figure 7

P a g e | 53
Figure 9 Figure 10
Figure 11 Figure 12

P a g e | 54
Figure 13 Figure 14
Figure 15 Figure 16

P a g e | 55
Figure 17 Figure 18
Figure 20
Figure 19

P a g e | 56
Figure 22
Figure 21
Figure 24
Figure 23

P a g e | 57
Figure 26
Figure 25
Figure 28
Figure 27

P a g e | 58
Figure 29
Figure 30

BDA Practical File

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDA Practical File

Uploaded by

Copyright:

Available Formats

L. D.

BIG DATA ANALYTICS (3170722)

Enrollment No.: 190280107002

• To achieve academic excellence in Computer Engineering by providing value

• To produce graduates according to the needs of industry, government, society

Date of submission. 11th November, 2022

Department of Computer Engineering

Rubrics for Practical

SEMESTER: BE-VII Academic Term: July-Nov 2022-23 (ODD)

Subject: Big Data Analytics (3170722) Elective Subject

Faculty: Prof. (Dr.) Hetal A. Joshiara

Rubrics Criteria Marks Good Satisfactory Need

Each practical carries 20 marks.

Department of Computer Engineering

LABORATORY PRACTICALS ASSESSMENT

Subject Name: Big Data Analytics (3170722)

Enroll. No.: 190280107002

Name: DEV ANSODARIYA

Pract. CO RB1 RB2 RB3 RB4 Total

CO-2 Run Word count program in Hadoop with 250 MB 04-08- 11

CO-2 Understand the Logs generated by MapReduce 18-08- 16

CO-2 Run two different Data sets/Different size of 25-8- 18

CO-2 Develop Map-Reduce Application to Sort a given 01-09-

CO-2 Download any two Big Data Sets from 08-09- 29

CO-3 Explore Spark and Implement Word count 15-09- 31

CO-5 Creating the HDFS tables and loading them in 22-09- 35

BE 7th Semester Big Data Analytics (3170725) 190280107002

CO-3 Implementation of Matrix algorithms in Spark 29-09- 39

CO-3 Create Data Pipeline Based on Messaging Using 13-10- 43

CO-3 Explore NoSQL database like MongoDB and 13-10- 45

Case study based on the concept of Big Data 18-10- 51

BE 7th Semester Big Data Analytics (3170725) 190280107002

Step-1: after downloading Hadoop, edit core-site.xml file under “hadoop-3.3.0\etc\hadoop”

BE 7th Semester Big Data Analytics (3170725) 190280107002

Step-3: Edit the mapred-site.xml (contains configuration settings of MapReduce

BE 7th Semester Big Data Analytics (3170725) 190280107002

Step-4: Edit yarn-site.xml (contains configuration settings of ResourceManager and

BE 7th Semester Big Data Analytics (3170725) 190280107002

Step-6: open command line and go to “hadoop-3.3.0/sbin” and type start-

BE 7th Semester Big Data Analytics (3170725) 190280107002

Step-7: Start Namenode and Datanode using start-dfs.cmd and start-yarn.cmd

BE 7th Semester Big Data Analytics (3170725) 190280107002

BE 7th Semester Big Data Analytics (3170725) 190280107002

• Steps to Execute Word Count Program:

• Create a directory in HDFS, where to kept text file.

• Upload the data.txt file on HDFS in the specific directory.

hdfs dfs -put /hadoop/wordcount/data.txt /test

BE 7th Semester Big Data Analytics (3170725) 190280107002

• Write the MapReduce program using eclipse:

public class WC_Reducer extends MapReduceBase implements

BE 7th Semester Big Data Analytics (3170725) 190280107002

BE 7th Semester Big Data Analytics (3170725) 190280107002

• Create the jar file of this program and name it countworddemo.jar.

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002

Logs & File Location

BE 7th Semester Big Data Analytics (3170725) 190280107002

MapReduce log levels

DEBUG Logs all debug-level and informational messages.

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002

BE 7th Semester Big Data Analytics (3170725) 190280107002

BE 7th Semester Big Data Analytics (3170725) 190280107002

Signature of Faculty: Grade: