You are on page 1of 61

L. D.

College of Engineering
Opp Gujarat University, Navrangpura, Ahmedabad - 380015

LAB PRACTICALS
Branch: Computer Engineering

BIG DATA ANALYTICS (3170722)


Semester: VII

Enrollment No.: 190280107002


Faculty Name: Prof. (Dr.) Hetal Joshiara
Name: Dev Ansodariya
Division: B
Batch: B1
Computer Engineering Department
L. D. College of Engineering Ahmedabad

VISION

• To achieve academic excellence in Computer Engineering by providing value


based education.
MISSION

• To produce graduates according to the needs of industry, government, society


and scientific community.
• To develop partnership with industries, research and development organizations
and government sectors for continuous improvement of faculties and students.
• To motivate students for participating in reputed conferences, workshops,
seminars and technical events to make them technocrats and entrepreneurs.
• To enhance the ability of students to address the real-life issues by applying
technical expertise, human values and professional ethics.
• To inculcate habit of using free and open-source software, latest technology and
soft skills so that they become competent professionals.
• To encourage faculty members to upgrade their skills and qualification through
training and higher studies at reputed universities.
Certificate
This is to certify that
Mr. Dev Ansodariya, Enrollment No. 190280107002 of B.E. Sem-7 class
has satisfactorily completed the course in BIG DATA ANALYTICS
(3170722) within four walls of L. D. College of Engineering, Ahmedabad-
380015.

Date of submission. 11th November, 2022


Staff in-Charge Prof. (Dr.) Hetal A. Joshiara
Head of Department Dr. Chirag S. Thaker
L. D. College of Engineering, Ahmedabad

Department of Computer Engineering

Rubrics for Practical

SEMESTER: BE-VII Academic Term: July-Nov 2022-23 (ODD)

Subject: Big Data Analytics (3170722) Elective Subject

Faculty: Prof. (Dr.) Hetal A. Joshiara

Rubrics Criteria Marks Good Satisfactory Need


ID (2) (1) Improvement
(0)
RB1 Regularity 05 High (>70%) Moderate Poor (0-40%)
(4070%)
RB2 Problem Analysis 05 Apt & Full Limited Very Less
& Development Identification of Identification of Identification of
of the Solution the Problem & the Problem / the Problem /
Complete Incomplete Very Less
Solution for the Solution for the Solution for the
Problem Problem Problem
RB3 Testing of the 05 Correct Solution Partially Very less correct
Solution as required Correct solution for the
Solution for the problem
Problem
RB4 Documentation 05 Documentation Not up to Proper format not
completed neatly. standard. followed,
incomplete.

Each practical carries 20 marks.

SIGN OF FACULTY
L. D. College of Engineering, Ahmedabad

Department of Computer Engineering

LABORATORY PRACTICALS ASSESSMENT

Subject Name: Big Data Analytics (3170722)

Term: 2022-23

Enroll. No.: 190280107002

Name: DEV ANSODARIYA

Pract. CO RB1 RB2 RB3 RB4 Total


No. Achieved (5) (5) (5) (5) (20)
1 CO-2

2
CO-2

3
CO-2

4
CO-2

5
CO-2

6
CO-2

7
CO-3

8
CO-5

9
CO-3

10
CO-3

11
CO-3

12
CO-1
Page |2

INDEX
Sr.
CO AIM Date Page Marks Sign
No.
No.
Prepare Make a single node clust er in 28-07- 5
1 CO-2
Hadoop. 2022

CO-2 Run Word count program in Hadoop with 250 MB 04-08- 11


2 2022
size of Data Set.

CO-2 Understand the Logs generated by MapReduce 18-08- 16


3 2022
program.

CO-2 Run two different Data sets/Different size of 25-8- 18


4 2022
Datasets on Hadoop and Compare the Logs.

CO-2 Develop Map-Reduce Application to Sort a given 01-09-


5 2022
file and do aggregation on some parameters. 21

CO-2 Download any two Big Data Sets from 08-09- 29


6 2022
authenticate website.

CO-3 Explore Spark and Implement Word count 15-09- 31


7 2022
application using Spark.

CO-5 Creating the HDFS tables and loading them in 22-09- 35


8 2022
Hive and learn joining of tables in Hive.

BE 7th Semester Big Data Analytics (3170725) 190280107002


Page |3

CO-3 Implementation of Matrix algorithms in Spark 29-09- 39


9 2022
SQL programming

CO-3 Create Data Pipeline Based on Messaging Using 13-10- 43


10 2022
PySpark and Hive Covid-19 Analysis

CO-3 Explore NoSQL database like MongoDB and 13-10- 45


11 2022
perform basic CRUD operation.

Case study based on the concept of Big Data 18-10- 51


12 2022
CO-1 Analytics. Prepare presentation in the group of 4.
Submit PPTs.

BE 7th Semester Big Data Analytics (3170725) 190280107002


Page |5

Practical 1
AIM: Make a single node cluster in Hadoop.

• For making cluster in Hadoop, there are two different types of Hadoop installations:

1. Single node cluster Hadoop (One DataNode running and setting up all the NameNode,
DataNode, ResourceManager, and NodeManager on a single machine.)
2. Multi node cluster Hadoop (more than one DataNode running and each DataNode is
running on different machines.)

• For this practical single node cluster Hadoop installation has used.

• Installation of single node cluster will be same as multimode cluster. After installation of Hadoop
several configurations is need to be done. Those are as follows:

Step-1: after downloading Hadoop, edit core-site.xml file under “hadoop-3.3.0\etc\hadoop”


path by adding I/O setting as follows:

BE 7th Semester Big Data Analytics (3170725) 190280107002


Page |6

Step-2: Edit hdfs-site.xml (contains configuration settings of HDFS entites) file under
“hadoop-3.3.0\etc\hadoop” path and edit the property mentioned below inside configuration
tag:

Step-3: Edit the mapred-site.xml (contains configuration settings of MapReduce


application) file under “hadoop-3.3.0\etc\hadoop” path and edit property mentioned below
inside configuration tag:

BE 7th Semester Big Data Analytics (3170725) 190280107002


Page |7

Step-4: Edit yarn-site.xml (contains configuration settings of ResourceManager and


NodeManager) file under “hadoop-3.3.0\etc\hadoop” path and edit the property mentioned
below inside configuration tag:

Step-5: Edit hadoop-env.sh file under “hadoop-3.3.0\etc\hadoop” path and add the Java
Path as mentioned below:

Step-6: Run command under Hadoop file bin/hadoop namenode -format (This command
formats the HDFS via NameNode and only executed for the first time. Formatting the file
system means initializing the directory specified by dfs.name.dir variable.)

BE 7th Semester Big Data Analytics (3170725) 190280107002


Page |8

Step-6: open command line and go to “hadoop-3.3.0/sbin” and type start-


all.cmd command (this command is a combination of start-dfs.cmd, start-
yarn.cmd & mr-jobhistory-daemon.cmd)

BE 7th Semester Big Data Analytics (3170725) 190280107002


Page |9

Step-7: Start Namenode and Datanode using start-dfs.cmd and start-yarn.cmd


under sbin folder

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 10

Step-7: Open http://localhost:9870 URL in browser (this will help to monitor all of the
information about nodes, memory management, resource utilization, etc.)

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 11

Practical 2
AIM: Run Word count program in Hadoop with 250 MB size of Data Set.

• Pre-requisite:
▪ Java Installation - Check whether the Java is installed or not.
▪ Hadoop Installation - Check whether the Hadoop is installed or not.

• Steps to Execute Word Count Program:


• Create or import a text file in your local machine and put it in proper directory.

• In this practical, we find out the frequency of each word exists in this text file.

• Create a directory in HDFS, where to kept text file.


hdfs dfs -mkdir /test

• Upload the data.txt file on HDFS in the specific directory.

hdfs dfs -put /hadoop/wordcount/data.txt /test

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 12

• Write the MapReduce program using eclipse:

WC_Mapper.java

package com.wordcount;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
BE 7th Semester Big Data Analytics (3170725) 190280107002
P a g e | 13
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WC_Mapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}

File: WC_Reducer.java

package com.wordcount;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WC_Reducer extends MapReduceBase implements


Reducer<Text,IntWritable,Text,IntWritable> {

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 14
public void reduce(Text key, Iterator<IntWritable>
values,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException {
int sum=0;
while (values.hasNext()) {
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}

File: WC_Runner.java

package com.wordcount;

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WC_Runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 15
conf.setReducerClass(WC_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}

• Create the jar file of this program and name it countworddemo.jar.


• Run the jar file
hadoop jar /home/wordcount/wordcountdemo.jar com.wordcount.WC_Runner
/test/data.txt /r_output
• The output is stored in /r_output/part-00000
• Now execute the command to see the output.
hdfs dfs -cat /r_output/part-00000

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 16

Practical – 3
AIM: Understand the Logs generated by MapReduce program.

Logs & File Location


• Default directory of Hadoop log file is $HADOOP_HOME/logs.

MapReduce service:
• Service instance log that contains details related to the MapReduce framework and
startup on the service side. Each service instance has a separate log, in addition to
separate logs for each task attempt.
$HADOOP_HOME/logs/user.pmr.service.application.index.log

MapReduce task:
• Task log that contains details relating to user-defined MapReduce code on the
service side. Each task records its task log messages (syslog), standard output
(stdout) and errors (stderr) to separate files in this directory. Knowing the job and
task IDs can be very useful when debugging MapReduce jobs.

-$HADOOP_HOME/logs/tasklogs/application/ job_ID/task_task
ID_att_attempt ID/syslog
-$HADOOP_HOME/logs/tasklogs/application/ job_ID/task_task
ID_att_attempt ID/stdout
-$HADOOP_HOME/logs/tasklogs/application/ job_ID/task_task
ID_att_attempt ID/stderr

Shuffle daemon:
• MRSS service log that captures messages, events, and errors related to the
MapReduce host.
$HADOOP_HOME/logs/mrss.hostname.log

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 17

MapReduce API
• API log that captures details relating to API calls from EGO.
api.hostname.log in the directory where the job was submitted MapReduce
client
• MapReduce client messages are not recorded in a log file. You can only view them on the client
console.

MapReduce log levels


• MapReduce logs support various levels. You can configure the log levels for the
MapReduce service and tasks.
• You can set log levels to any of the following values:

Level Description

DEBUG Logs all debug-level and informational messages.


INFO Logs all informational messages and more serious messages. This is thedefault log level.
WARN Logs only those messages that are warnings or more serious messages. Thisis the default level of
-------------------------debug information.
ERROR Logs only those messages that indicate error conditions or more seriousmessages.
FATAL Logs only those messages in which the system is unusable.

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 18

Practical – 4
AIM: Run two different Data sets/Different size of Datasets on Hadoop and Compare the Logs.

First Dataset:
Program: Word Count
Size: 1MB

dev-

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 19

dev

dev-

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 20

Conclusion:
• For two different size of dataset running same map reduce job we can say that time
taken for running for job is increase as size of dataset bigger.

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 21

Practical – 5
AIM: Develop Map Reduce Application to Sort a given file and do aggregation on some parameters.

Let’s say we want to:

“View all donor cities by descending order of donation total amount,


considering only donationswhich were not issued by a teacher. City names
should be case insensitive (using upper-case)”

This can be done in SQL as:

This query actually involves quite a few operations:

• Filtering on the value of doner_is_teacher


• Aggregating the sum m of total values grouping by city
• Sorting on the aggregated value Sumtotal

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 22

This task could be broken down into 2 MapReduce jobs:

1. First Job: Filtering and Aggregation

• Map:
▪ Input: DonationWritables “full row” objects from theSequenceFile.

▪ Output: (city, total) pairs for each entry only ifdonor_is_teacher is


not true.

• Reduce:

▪ Reduce by summing the “total” values for each “city”key

2. Second Job: Sorting

• Map:
▪ Input: (city, sumtotal) pairs with summed total percity.
▪ Output: (sumtotal, city) inversed pair.

• Reduce:
▪ Identity reducer. Does not reduce anything, but theshuffling will sort
on keys for us.

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 23

• Here is simple visualization of 2-job Process:

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 24

The First Job

• Code:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
data.writable.DonationWritable;
public class DonationsSumByCity {
public static class CityDonationMapper extends Mapper<Text, DonationWritable, Text,
FloatWritable> {
private Text city = new Text();
private FloatWritable total = new FloatWritable();
@Override
public void map(Text key, DonationWritable donation, Context context) throws IOException,
InterruptedException {
// Ignore rows where the donor is a teacher if ("t".equals(donation.donor_is_teacher)) {
return;

float) pair.

}
}

}
// Transform city name to uppercase and write the (string,

city.set(donation.donor_city.toUpperCase()); total.set(donation.total); context.write(city, total);

public static class FloatSumReducer extends Reducer<Text, FloatWritable, Text,


FloatWritable> {
private FloatWritable result = new FloatWritable(); @Override
public void reduce(Text key, Iterable<FloatWritable> values, Context context) throws
IOException, InterruptedException {
float sum = 0;
for (FloatWritableval : values) { sum += val.get();
}
result.set(sum); context.write(key, result);
}
}

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 25

city");

public static void main(String[] args) throws Exception {


Job job = Job.getInstance(new Configuration(), "Sum donations by

job.setJarByClass(DonationsSumByCity.class);
// Mapper configuration job.setMapperClass(CityDonationMapper.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(FloatWritable.class);
// Reducer configuration (use the reducer as combiner also, useful

in cases of aggregation)
job.setCombinerClass(FloatSumReducer.class); job.setReducerClass(FloatSumReducer.class);
job.setNumReduceTasks(1); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class); FileInputFormat.setInputPaths(job, new
Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Sequence File input format

A couple of things to notice here regarding the Sequence File as the input:

• job.setInputFormatClass(SequenceFileInputFormat.class)
o Tell the job that we are reading a Sequence File.
• ... extends Mapper<Text, DonationWritable, Text, FloatWritable>
o The first two generic type parameters of the Mapper class should be the
input Key and Value types of Sequence File.
• map(Text key, DonationWritable donation, Context context)
o The parameter of the map method are directly the Writable objects. If we
were using the CSV input we would have a Text object as the second
parameter containing the csv line, which we would have to split on
commas to obtain values.

Using a Combiner

• Since we are doing an aggregation task here, using our Reducer as a Combiner
by calling job.setCombinerClass(FloatSumReducer.class) improves performance. It
will start reducing the Mapper’s output during the map phase, which will result in
less data being shuffled and sent to the Reducer.

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 26

The Second Job

• Code:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.io.WritableComparable;
import
org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class OrderBySumDesc {


public static class InverseCitySumMapper extends Mapper<Text,
Text, FloatWritable, Text>
{

private FloatWritablefloatSum = new FloatWritable(); @Override


public void map(Text city, Text sum, Context context) throws
IOException,

InterruptedException {

float floatVal = Float.parseFloat(sum.toString());


floatSum.set(floatVal);
context.write(floatSum, city);

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 27
public static class DescendingFloatComparator extends
WritableComparator { public DescendingFloatComparator() {
super(FloatWritable.class, true);
}

@SuppressWarnings("rawtypes") @Override
public int compare(WritableComparable w1, WritableComparable
w2) { FloatWritable key1 = (FloatWritable) w1;
FloatWritable key2 = (FloatWritable) w2; return -1 *
key1.compareTo(key2);
}
}
public static void main(String[] args) throws Exception {
Job job = Job.getInstance(new Configuration(), "Order By Sum
Desc") job.setJarByClass(DonationsSumByCity.class);

// The mapper which transforms (K:V) => (float(V):K)


job.setMapperClass(InverseCitySumMapper.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);

job.setMapOutputKeyClass(FloatWritable.class);
job.setMapOutputValueClass(Text.class);

// Sort with descending float order


job.setSortComparatorClass(DescendingFloatComparator.cl
ass);

// Use default Reducer which simply transforms (K:V1,V2) => (K:V1),


(K:V2)job.setReducerClass(Reducer.class);
job.setNumReduceTasks(1);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));System.exit(job.waitForCompletion(true)
? 0 : 1);
}
}

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 28
Running and Viewing Results

• d
Here are the terminal commands for executing and viewing the outputs for these 2
MapReduce jobs:

$ hadoop jar donors.jar mapreduce.donation.DonationsSumByCity


donors/donations.seqfiledonors/output_sumbycity

$ hdfsdfs -cat donors/output_sumbycity/* [...]


ROCKWALL 8422.99
ROCKWELL 80.0
ROCKWOOD 9224.17
[...]

$ hadoop jar donors.jar mapreduce.donation.OrderBySumDesc


donors/output_sumbycity donors/output_orderbysumdesc

$ hdfsdfs -cat donors/output_orderbysumdesc/* | head 1.71921696E8

2.5504284E7 NEW YORK


1.5451513E7 SAN FRANCISCO
6163194.0 CHICAGO
5085116.5 SEATTLE

• As expected, the output of the first job is a plain text list of <city,sum> ordered by city name.
Thesecond job generates a list of <sum,city> sorted by descending sum

• Execution times:

▪ The first job took an average of 1 min 25 sec on my cluster.


▪ This second job took an average of 1 min 02 sec on my cluster.

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 29

Signature of Faculty: Grade:

Practical – 6
AIM: Download any two Big Data Sets from authenticate website.

1. Yelp Dataset:
• Website: https://www.yelp.com/dataset

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 30

2. Kaggle:
• Website: https://www.kaggle.com/datasets

Signature of Faculty: Grade:


BE 7th Semester Big Data Analytics (3170725) 190280107002
P a g e | 31

Practical – 7
AIM: Explore Spark and Implement Word count application using Spark.

Apache Spark:
• Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.

• Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workloads in a
respective system, it reduces the management burden of maintaining separate tools.

Spark Word Count

1. Create a directory in Hadoop, where to kept text file.


$ hadoop fs -mkdir /spark
2. Upload the data file on Hadoop in the specific directory.
$ hadoop fs -put <fileLocation> URI

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 32

3. Now, follow the below command to open the spark in Scala mode.
$ spark-shell

4. Let’s create an RDD by using the following commad.


scala > val data = sc.textFile(“sparkdata.txt”);

• Now, we can read the generated result by using the following command.

scala > data.collect;

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 33

5. Here, we split the existing data in the form of individual words by using following command.
scala > val splitdata = data.flatMap(line=>lines.split( “ “));

Now, we can read the generated result by using the following command.

scala > splitdata.collect;

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 34

6. Perform the reduce operation


scala > val reducedata = mapdata.reduceByKey(_+_);

we can read the generated result by using the following command.

scala > reducedata.collect;

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 35

Practical – 8
AIM: Creating the HDFS tables and loading them in Hive and learn joining of tables in Hive.

Create a folder on HDFS under /user/cloudera HDFS Path


javachain~hadoop]$ hadoop fs -mkdir javachain

Move the text file from local file system into newly created folder called javachain
javachain~hadoop]$ hadoop fs -put ~/Desktop/student.txt javachain/

Create Empty table STUDENT in HIVE

hive> create table student


> ( std_id int,
> std_name string,
> std_grade string,
> std_addres string)
> partitioned by (country string)
> row format delimited
> fields terminated by ','
>;
OK
Time taken: 0.349 seconds

Load Data from HDFS path into HIVE TABLE.

hive> load data inpath 'javachain/student.txt' into table student


partition(country='usa');
Loading data to table default.student partition (country=usa)
chgrp: changing ownership of
'hdfs://quickstart.cloudera:8020/user/hive/warehouse/student/country=usa/stud
ent.txt': User does not
belong to hive
Partition default.student{country=usa} stats: [numFiles=1, numRows=0,
totalSize=120, rawDataSize=0]
OK
Time taken: 1.048 seconds

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 36

Select the values in the Hive table.

hive> select * from student;


OK
101 'JAVACHAIN' 3RD 'USA usa
102 'ANTO' 10TH 'ENGLAND' usa
103 'PRABU' 2ND 'INDIA' usa
104 'KUMAR' 4TH 'USA' usa
105 'jack' 2ND 'INDIA' usa
Time taken: 0.553 seconds, Fetched: 5 row(s)

JOIN
• The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:

hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT


FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

• On successful execution of the query, you get to see the following response:

LEFT OUTER JOIN

• A LEFT JOIN returns all the values from the left table, plus the matched values from the right
table, or NULL in case of no matching JOIN predicate.

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 37

• The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER
tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE


FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

• On successful execution of the query, you get to see the following response :

RIGHT OUTER JOIN

• The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and
ORDER tables.

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT


OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

• On successful execution of the query, you get to see the following response:

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 38

Full Outer Join:


• The following query demonstrates FULL OUTER JOIN between CUSTOMER and
ORDER tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE


FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 39

Practical – 9
AIM: Implementation of Matrix algorithms in Spark Sql programming

• Code:

Matrix_multiply.py

from pyspark import SparkConf, SparkContext


import sys, operator

def add_tuples(a, b):


return list(sum(p) for p in zip(a,b))

def permutation(row):
rowPermutation = []

for element in row: for e in


range(len(row)):

rowPermutation.append(float(element) *
float(row[e]))

return rowPermutation

def main():
input = sys.argv[1] output = sys.argv[2]
conf = SparkConf().setAppName('Matrix
Multiplication') sc = SparkContext(conf=conf)
assert sc.version >= '1.5.1'

row = sc.textFile(input).map(lambda row :


row.split(' ')).cache() ncol = len(row.take(1)[0])
intermediateResult =
row.map(permutation).reduce(add_tuples)

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 40

outputFile = open(output, 'w')

result = [intermediateResult[x:x+3] for x in


range(0, len(intermediateResult), ncol)] for row
in result:
for element in row:
outputFile.write(str(element) + ' ')
outputFile.write('\n') outputFile.close()

# outputResult = sc.parallelize(result).coalesce(1)
# outputResult.saveAsTextFile(output)

if __name__ == "__main__":

main()
matrix_multiply_sparse.py

from pyspark import SparkConf, SparkContext


import sys, operator from scipy import * from
scipy.sparse import csr_matrix

def createCSRMatrix(input):
row = []
col = []
data = []

for values in input:


value = values.split(':')
row.append(0)
col.append(int(value[0]))
data.append(float(value[1]))

return csr_matrix((data,(row,col)),
shape=(1,100))

def multiplyMatrix(csrMatrix):
csrTransponse = csrMatrix.transpose(copy=True)
return (csrTransponse*csrMatrix)

def formatOutput(indexValuePairs):

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 41

return ' '.join(map(lambda pair : str(pair[0]) + ':' + str(pair[1]),


indexValuePairs))

def main():
input = sys.argv[1] output = sys.argv[2]
conf = SparkConf().setAppName('Sparse
Matrix Multiplication') sc =
SparkContext(conf=conf) assert sc.version >=
'1.5.1'
sparseMatrix = sc.textFile(input).map(lambda
row : row.split('
')).map(createCSRMatrix).map(multiplyMatrix).red
uce(operator.add) outputFile = open(output, 'w')

for row in range(len(sparseMatrix.indptr)-1):


col =sparseMatrix.indices[sparseMatrix.indptr[row]:
sparseMatrix.indptr[row+1]]

data = sparseMatrix.data[sparseMatrix.indptr[row]:sparseMatrix.indptr[row+1]]

indexValuePairs = zip(col,data)
formattedOutput = formatOutput(indexValuePairs)
outputFile.write(formattedOutput + '\n')

if __name__ == "__main__":
main()

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 42

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 43

Practical – 10
AIM: Create A Data Pipeline Based on Messaging Using PySpark and Hive -Covid-19 Analysis.

Building data pipeline for Covid-19 data analysis using Bigdata technologies and Tableau

• The purpose is to collect the real time streaming data from COVID19 open API for every
5 minutes into the ecosystem using NiFi and to process it and store it in the data lake on
AWS.
• Data processing includes parsing the data from complex JSON format to csv format then
publishing to Kafka for persistent delivery of messages into PySpark for further
processing.
• The processed data is then fed into output Kafka topic which is inturn consumed by Nifi
and stored in HDFS.
• A Hive external table is created on top of HDFS processed data for which the process is
Orchestrated using Airflow to run for every time interval. Finally, KPIs are visualised in
tableau.

• Data Architecture:
d

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 44

• Tools used:
1. Nifi -nifi-1.10.0
2. Hadoop -hadoop_2.7.3
3. Hive-apache-hive-2.1.0
4. Spark-spark-2.4.5
5. Zookeeper-zookeeper-2.3.5
6. Kafka-kafka_2.11-2.4.0
7. Airflow-airflow-1.8.1
8. Tableau

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 45

Practical – 11
AIM: Explore NoSQL database like MongoDB and perform basic CRUD operation.

• Start Mongo Shell

• . Display all databases.

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 46

1. Create Operation

For insert one document


db.collection.insertOne()
For insert multiple document
db.collection.insertMany

2. Update Operations

Update single document


db.collection.updateOne()
Update Multiple document
db.collection.updateMany()
Replace single document
db.collection.replaceOne()

3. Read Operations

For Display All documents.


db.collection.find()

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 47

4. Delete Operation

delete single document


db.collection.deleteOne()
For delete multiple document
db.collection.deleteMany()

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 48

Practical – 12
AIM: Case study based on the concept of Big Data Analytics.

1. Walmart:

• Walmart is the largest retailer in the world and the world’s largest company by revenue,
with more than 2 million employees and 20000 stores in 28 countries.
• It started making use of big data analytics much before the word Big Data came into the
picture. Walmart uses Data Mining to discover patterns that can be used to provide
product recommendations to the user, based on which products were brought together.
• Walmart by applying effective Data Mining has increased its conversion rate of
customers. It has been speeding along big data analysis to provide best-in-class e-
commerce technologies with a motive to deliver superior customer experience.
• The main objective of holding big data at Walmart is to optimize the shopping
experience of customers when they are in a Walmart store.
• Big data solutions at Walmart are developed with the intent of redesigning global
websites and building innovative applications to customize the shopping experience for
customers whilst increasing logistics efficiency.
• Hadoop and NoSQL technologies are used to provide internal customers with access to
real-time data collected from different sources and centralized for effective use.

2. Uber:

• Uber is the first choice for people around the world when they think of moving people
and making deliveries.
• It uses the personal data of the user to closely monitor which features of the service is
mostly used, to analyze usage patterns and to determine where the services should be
more focused.
• Uber focuses on the supply and demand of the services due to which the prices of the
services provided changes. Therefore, one of Uber’s biggest uses of data is surge
pricing.
• For instance, if you are running late for an appointment and you book a cab in a
crowded place then you must be ready to pay twice the amount.
BE 7th Semester Big Data Analytics (3170725) 190280107002
P a g e | 49

• For example, On New Year’s Eve, the price for driving for one mile can go from 200 to
1000. In the short term, surge pricing affects the rate of demand, while long term use
could be the key to retaining or losing customers. Machine learning algorithms are
considered to determine where the demand is strong.

3. Netflix:

• It is the most loved American entertainment company specializing in online on-demand


streaming video for its customers.
• Netflix has been determined to be able to predict what exactly its customers will enjoy
watching with Big Data. As such, Big Data analytics is the fuel that fires the
‘recommendation engine’ designed to serve this purpose.
• More recently, Netflix started positioning itself as a content creator, not just a
distribution method.
• Unsurprisingly, this strategy has been firmly driven by data. Netflix’s recommendation
engines and new content decisions are fed by data points such as what titles customers
watch, how often playback stopped, ratings are given, etc.
• The company’s data structure includes Hadoop, Hive and Pig with much other
traditional business intelligence.
• Netflix shows us that knowing exactly what customers want is easy to understand if the
companies just don’t go with the assumptions and make decisions based on Big Data.

4. eBay:

• A big technical challenge for eBay as a data-intensive business to exploit a system that
can rapidly analyze and act on data as it arrives (streaming data).
• There are many rapidly evolving methods to support streaming data analysis. eBay is
working with several tools including Apache Spark, Storm, Kafka.
• It allows the company’s data analysts to search for information tags that have been
associated with the data (metadata) and make it consumable to as many people as
possible with the right level of security and permissions (data governance).
• The company has been at the forefront of using big data solutions and actively
contributes its knowledge back to the open-source community.

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 50

5. Procter & Gamble:

• Procter & Gamble whose products we all use 2-3 times a day is a 179-year-old
company. The genius company has recognized the potential of Big Data and put it to
use in business units around the globe.
• P&G has put a strong emphasis on using big data to make better, smarter, real-time
business decisions. The Global Business Services organization has developed tools,
systems, and processes to provide managers with direct access to the latest data and
advanced analytics.
• Therefore P&G being the oldest company, still holding a great share in the market
despite having many emerging companies.

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 51

Practical 12

AIM: Case study based on the concept of Big Data Analytics. Prepare presentation
in the group of 4. Submit PPTs.

Figure 1 Figure 2

Figure 4
Figure 3

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 52

Figure 5 Figure 6

Figure 8
Figure 7

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 53

Figure 9 Figure 10

Figure 11 Figure 12

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 54

Figure 13 Figure 14

Figure 15 Figure 16

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 55

Figure 17 Figure 18

Figure 20
Figure 19

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 56

Figure 22
Figure 21

Figure 24
Figure 23

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 57

Figure 26
Figure 25

Figure 28
Figure 27

BE 7th Semester Big Data Analytics (3170725) 190280107002


P a g e | 58

Figure 29

Figure 30

BE 7th Semester Big Data Analytics (3170725) 190280107002

You might also like