You are on page 1of 44

Department of Information Technology

BIG DATA ANALYTICS LAB


(19CS433)

Regulation – R19

Lab manual for the Academic Year (2022-23)

IV B.Tech I Semester

DEPARTMENT OF
INFORMATION TECHNOLOGY

Vadlamudi, 522 213, AP, India

Name of the Faculty: Mr. SRIKANTH YADAV.M


Asst. Professor,
Department of IT
1
Department of Information Technology

Vision:
To become centre of excellence in technical and knowledge-based education utilizing the potential of
emerging technologies in field of Information Technology with a deep passion of wisdom, culture, and
values.

Mission:
 Impart modern teaching methodologies to provide quality education to the students
 Produce employable engineers based on skills required for industry
 Enable students and faculty members in research to develop IT based solutions to meet societal and
industry requirements

Programme Educational Objectives (PEOs)


Graduates of Information Technology programme should be able to:

PEO1: Evolve as globally competent computer professionals possessing leadership skills for developing
innovative solutions in multidisciplinary domains.
PEO2: Excel as socially committed individual having high ethical values and empathy for the needs of
society.
PEO3: To prepare students to succeed in employment/profession or to pursue postgraduate.
PEO4: Involve in lifelong learning to adapt the technological advancements in the emerging areas of
computer applications.

Programme Specific Outcomes (PSOs)


PSO1: Develop a competitive edge in basic technical skills of computer applications like Programming
Languages, Algorithms and Data Structures, Databases and Software Engineering.
PSO2: Able to identify, analyze and formulate solutions for problems using computer applications.
PSO3: Attempt to work professionally with a positive attitude as an individual or in multidisciplinary
teams and to communicate effectively among the stakeholders.

Programme Outcomes (POs)


The graduates of Information Technology will be able to:
PO1: Able to design and develop reliable software applications for social needs and excel in ITenabled
services.
PO2: Able to analyse and identify the customer requirements in multidisciplinary domains, create high
level design and implement robust software applications using latest technological skills.
PO3: Proficient in successfully designing innovative solutions for solving real life business problems and
addressing business development issues with a passion for quality, competency and holistic approach
PO4: Perform professionally with social, cultural and ethical responsibility as an individual aswell as in
multifaceted teams with positive attitude
PO5: Capable of adapting to new technologies and constantly upgrade their skills with an attitude
towards independent and lifelong learning.

2
Department of Information Technology

PREFACE
Pre requisite : SQL, BIG DATA ANALYTICS

About the LAB:

This Lab is aimed at providing knowledge on data cube construction, OLAP operations on the database,
Decision tree induction, and attributes relevance analysis through Information gain method, correlation
coefficient methods, various classifications, and clustering algorithms usage through the WEKA (Waikato
Environment for Knowledge Analysis) software and R statistical programming.

Waikato Environment for Knowledge Analysis (Weka) is a suite of machine learning software written in Java,
developed at the University of Waikato, New Zealand. It is free software licensed under the GNU General
Public License.

Weka contains a collection of visualization tools and algorithms for data analysis and predictive modeling,
together with graphical user interfaces for easy access to BIG DATA ANALYTICS functions. Weka supports
several standard BIG DATA ANALYTICS tasks, more specifically, data
preprocessing, clustering, classification, regression, visualization, and feature selection. Weka's techniques are
predicated on the assumption that the data is available as one flat file or relation, where each data point is
described by a fixed number of attributes (normally, numeric or nominal attributes, but some other attribute
types are also supported).

Weka provides access to SQL databases using Java Database Connectivity and can process the result returned
by a database query. Weka provides access to deep learning with Deeplearning4j. It is not capable of multi-
relational BIG DATA ANALYTICS. Still, there is separate software for converting a collection of linked
database tables into a single table suitable for processing using Weka. Another important area currently not
covered by the algorithms included in the Weka distribution is sequence modeling.

Relevance to industry:
This lab knowledge helps in analyzing the data. Many industries are using data analysis techniques for fraud
detection, stock market prediction, image processing, etc. Data analysis techniques are used for making business
decisions. A student can gain a good knowledge of various preprocessing, classification, and clustering methods
by learning this subject.

Latest Technologies:
3
Department of Information Technology

1. Hadoop
2. MapReduce
3. Hive, PIG, Spark
Lab Evaluation Procedure:
The performance of a student in each lab is evaluated continuously during the semester. The marks awarded
through continuous evaluation are referred to as internal marks. A comprehensive end-semester examination has
conducted the marks awarded for this evaluation referred to as external marks.
The maximum sum of internal and external assessment marks is 100, in the ratio of 50:50.

Internal Evaluation -50marks


External Evaluation -50marks
Internal Evaluation Criteria:
S.NO Component Marks
1 Viva and Interaction 10
2 Experimentation and Data 20
collection
3 Analysis of Experimental Data 10
and Interpretation
4 Knowledge of outcome and Skills 10
Total 50

External Evaluation Criteria:

Marks
Sl.No Component External
Internal
Laboratory Total
Examiner
Examiner
Objective and Procedure
1 write Up Including 5 5 10
Outcomes
Experimentation and
2 5 5 10
data collection
3 Computation of results 5 5 10
Analysis of results and
4 5 5 10
interpretation
5 Viva Voce 5 5 10
Total Marks 20 30 50

4
Department of Information Technology

INDEX

S.No Description Page No.


1 Syllabus 6
2 Course objectives and outcomes 7
3 List of experiments 8
4 Lab set up procedure (infrastructure, software, etc.) 8
5 Installation procedure 9
6 Experiments 13
7 Viva-voce questions 106
8 Minor Projects List 120

5
Department of Information Technology

19CS433 BIG DATA ANALYTICS LAB

Syllabus

1. HDFS basic command-line file operations.


2. HDFS monitoring User Interface.
3. WordCount Map Reduce program using Hadoop.
4. Implementation of word count with combiner Map Reduce program.
5. Practice on Map Reduce monitoring User Interface
6. Implementation of Sort operation using MapReduce
7. MapReduce program to count the occurrence of similar words in a file by using
partitioner.
8. Design MapReduce solution to find the years whose average sales is greater than 30.
input file format has year, sales of all months and average sales Year Jan Feb Mar
April May Jun July Aug Sep Oct Nov Dec Average
9. MapReduce program to find Dept wise salary. Empno EmpName Dept Salary
10.Creation of Database using hive.
11.Creation of partitions and buckets using Hive.
12.Practice of advanced features in Hive Query Language: RC File & XML data
processing.
13.Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter
the data.
14.Implementation of Word count using Pig.
15.Implement of word count using spark RDDs
16.Filter the log data using Spark RDDs.

Text Book:
1. Big Data Analytics 2ed, Seema Acharya, Subhashini Chellappan, Wiley Publishers, 2020
Reference Books:

1. Boris lublinsky, Kevin t. Smith, AlexeyYakubovich, “Professional Hadoop Solutions”,


Wiley, ISBN: 9788126551071, 2015.
2. Chris Eaton, Dirkderooset al. , “Understanding Big data ”, McGraw Hill, 2012.
3. Tom White, “HADOOP: The definitive Guide”, O Reilly 2012.
4. Vignesh Prajapati, “Big Data Analytics with R and Haoop”, Packet Publishing 2013

6
Department of Information Technology

Course Description & Objectives:

This course gives an overview of Big Data, i.e. storage, retrieval and processing of big data. The focus
will be on the “technologies”, i.e., the tools/algorithms that are available for storage, processing of Big Data
and a variety of analytics.

Course Outcomes:
After completion of this course, a student will be able to:

COs Course Outcomes POs

1 Understand Big Data and its analytics in the real world 1

Use the Big Data frameworks like Hadoop and NOSQL to


2 2
efficiently store and process Big Data to generate Analytics
Design of Algorithms to solve Data Intensive problems using Map
3 3
Reduce Paradigm
Design and Implementation of Big Data Analytics using Pig and
4 4
Spark to solve Data Intensive problems and to generate analytics

5 Analyse Big Data using Hive 5

Skills:
 Build and maintain reliable, scalable, distributed systems with Apache Hadoop.
 Develop Map-Reduce based Applications for Big Data.
 Design and build applications using Hive and Pig based Big data Applications.
 Learn tips and tricks for Big Data use cases and solutions

Mapping Of Course Outcomes with Program Outcomes:


PO1 PO2 PO3 PO4 PO5 PSO1 PSO2 PSO3
CO1 √ √ √ √ √
CO2 √ √ √ √ √
CO3 √ √ √ √ √
CO4 √ √ √ √ √
CO5 √ √ √ √ √

7
Department of Information Technology

List of Experiments:

S.NO PROGRAM NAME CO

1 HDFS basic command-line file operations 2


4
2 HDFS monitoring User Interface

3 WordCount Map Reduce program using Hadoop 4

4 Implementation of word count with combiner Map Reduce program 3

5 Practice on Map Reduce monitoring User Interface 3

6 Implementation of Sort operation using MapReduce 3


MapReduce program to count the occurrence of similar words in a file by using
7 3
partitioner
Design MapReduce solution to find the years whose average sales is greater than
8 30. input file format has year, sales of all months and average sales Year Jan Feb 3
Mar April May Jun July Aug Sep Oct Nov Dec Average
MapReduce program to find Dept wise salary.
9 5
Empno EmpName Dept Salary
Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter the
10 1
data
11 Implementation of Word count using Pig 3

12 Creation of Database and tables using Hive query language 5

13 Creation of partitions and buckets using Hive 1


Practice of advanced features in Hive Query Language: RC File & XML data
14 3
processing
15 Implement of word count using spark RDDs 5

16 Filter the log data using Spark RDDs 1

8
Department of Information Technology

LAB Setup procedure:


1. Installation of Hadoop software stable version
2. Installation of Hive, PIG, Spark
3. Installation of cloudera quickstart
4. Installation of VMware stduio

9
Department of Information Technology

Installation of Hadoop

Install and run Hadoop in standalone mode, pseudo mode and fully distributed cluster.

Step – 1: Java installed on your system


Open terminal and run:
$ java -version
If you have already installed Java, move to 2.1. If not, follow the steps:
$ sudo apt-get update $ sudo apt-get install default-jdk
Now check the version once again
$ java -version

Standalone mode.
Step-2: Download the latest version of Hadoop here.
$ tar -xzvf hadoop-2.7.3.tar.gz
//Change the version number if needed to match the Hadoop version you have downloaded.//

Step-3: Now we are moving this extracted file to /usr/local, suitable for local installs.
$ sudo mv hadoop-2.7.3 /usr/local/hadoop

Step-4: Now go to the Hadoop distribution directory using terminal


$ cd /usr/local/hadoop
Let’s see what’s inside the Hadoop folder
etc — has the configuration files for Hadoop environment.
bin — include various commands useful like Hadoop cmdlet.
share — has the jars that is required when you write MapReduce job. It has Hadoop libraries

Step-5: Hadoop command in the bin folder is used to run jobs in Hadoop.
$ bin/hadoop

Step-6: jar command is used to run the MapReduce jobs on Hadoop cluster
$ bin/hadoop jar

10
Department of Information Technology

Step-7: Now we will run an example MapReduce to ensure that our standalone install works
create a input directory to place the input files and we run MapReduce command on it. These
are the configuration and command files along with hadoop, we will use those as text file input
for our MapReduce.
$ mkdir input $ cp etc/hadoop/* input/

This is run using the bin/hadoop command. It is used to run MapReduce. Jar indicates that the
MapReduce operation is specified in a Java archive. Here we will use the Hadoop-MapReduce-
examples.jar file which come along with installation. Jar name differ based on the version you
are installing. Now move on to your Hadoop install directory and type:
$ Hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.7.3.jar
If you are not in the correct directory, you will get an error saying “Not a valid JAR” as shown
below. If this issue persists, check if the location of Jar file is correct for your system.

Running example to check working of standalone mode.


This is a MapReduce ran successfully on standalone setup.

11
Department of Information Technology

Experiment 01
Problem statement:
HDFS basic command-line file operations.

Objective: To understand the basic hadoop commands

1. To Check hadoop version

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hadoop version

2. To java Version

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$javac -version

3. To update hadoop packages

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$sudo apt-get update

4. To format namenode

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs namenode -format

5. To start hadoop services

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$start-all.sh

6. To see the services started

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$jps

7. To Create new directory

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs -mkdir /msy

Or

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hadoop fs -mkdir /msy

8. To Remove a file in the specified path:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –rm <src>

Eg. hdfs dfs –rm /msy/abc.txt

9. To Copy file from local file system to hdfs:


12
Department of Information Technology

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$ hdfs dfs –copyFromLocal <src> <dst>

Eg. hdfs dfs –copyFromLocal /home/vignan/sample.txt /msy/abc1.txt\

10. To display a list of contents in a directory:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –ls <path>\

Eg. hdfs dfs –ls /msy

11. To display contents in a file:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –cat <path>

Eg. hdfs dfs –cat /msy/abc1.txt\

12. To copy file from hdfs to the local file system:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –copyToLocal <src <dst>

Eg. hdfs dfs –copyToLocal /msy/abc1.txt /home/vignan/Desktop/sample.txt

13. To display the last few lines of a file:\

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –tail <path>\

Eg. hdfs dfs –tail /msy/abc1.txt

14. To Display aggregate length of the file in bytes:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –du <path>\

Eg. hdfs dfs –du /msy

15. To count no.of directories, files, and bytes under the given path:

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –count <path>

Eg. hdfs dfs –count /msy

o/p: 1 1 60

16. To Remove a directory from hdfs

vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hdfs dfs –rmr <path>\

Eg. hdfs dfs rmr /chp


13
Department of Information Technology

Experiment 02
HDFS monitoring User Interface

Let us access the HDFS web console.


1. Access the link http://MASTER_NODE:50070/ using your browser, and verify that you can
see the HDFS startup page. Here, replace MASTER_NODE with the IP address of the master
node running the HDFS NameNode.
2. The following screenshot shows the current status of the HDFS installation including the
number of nodes, total storage, storage taken by each node. It also allows users to browse the
HDFS filesystem.

14
Department of Information Technology

Experiment 03
Problem statement: Wordcount Map Reduce program using standalone Hadoop
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount
{
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
While (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable>
{

private IntWritable result = new IntWritable();


public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values) { sum+= val.get();
}
15
Department of Information Technology

result.set(sum);
context.write(key, result);
}
}

Steps:
1. Create one folder on Desktop-”WordCountTutorial”
a. Paste WordCount.java file
b. Create a folder named”Input_Data”->Create Input.txt file (enter some words)
c. Create a folder name “tutorial_classes
2. export HADOOP_CLASSPATH=$(hadoop classpath)
3. echo $HADOOP_CLASPATH
4. hadoop fs -mkdir /WordCountTutorial
5. hadoop fs -mkdir /WordCountTutorial/input
6. hadoop fs -put /home/hadoop/Desktop/WordCountTutorial/Input_Data/Input.txt
/WordCountTutorial/input
7. Change the current directory to the tutorial directory
cd '/home/hadoop/Desktop/WordCountTutorial'
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:~/Desktop/WordCountTutorial$
8. Compile the java code
javac -classpath ${HADOOP_CLASSPATH} -d
/home/hadoop/Desktop/WordCountTutorial/tutorial_classes
/home/hadoop/Desktop/WordCountTutorial/WordCount.java
9. Put the output files in one JAR file
jar -cvf firstTutorial.jar -C tutorial_classes/ .
10. Run JAR file
hadoop jar /home/hadoop/Desktop/WordCountTutorial/firstTutorial.jar WordCount
/WordCountTutorial/input /WordCountTutorial/output
11. See the output
hadoop dfs -cat /WordCountTutorial/output/*

Output:

16
Department of Information Technology

Experiment 04
Problem statement:
Implementation of word count with combiner Map Reduce program

CODE: Combiner/Reducer Code


ReduceClass.java
package com.javacodegeeks.examples.wordcount;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ReduceClass extends Reducer{
@Override
protected void reduce(Text key, Iterable values,
Context context)
throws IOException, InterruptedException
{
int sum = 0;
Iterator valuesIt = values.iterator();
//For each key value pair, get the value and adds to the sum
//to get the total occurances of a word
while(valuesIt.hasNext())
{
sum = sum + valuesIt.next().get();
}
//Writes the word and total occurrences as key-value pair to the context
context.write(key, new IntWritable(sum));
}}

Output:

17
Department of Information Technology

Experiment 05
Problem statement:
Practice on Map Reduce monitoring User Interface

Aim: Practice on Map Reduce monitoring User Interface


PROCEDURE:
Let us access the MapReduce web console.
1. Access the link http://MASTER_NODE:8088/ using your browser, and verify that you can
see the HDFS startup page. Here, replace MASTER_NODE with the IP address of the master
node running the HDFS NameNode.
2. The following screenshot shows the current status of the various running applications,
application id, completed applications, and failed applications, resources used.

18
Department of Information Technology

Experiment 06
Problem statement:
Implementation of Sort operation using MapReduce
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Sort{
public static class SortMapper extends Mapper<LongWritable,Text,Text,Text>{
Text comp_key = new Text();
protected void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException{
String[] token = value.toString().split(",");
comp_key.set(token[0]);
comp_key.set(token[1]);
context.write(comp_key,new Text(token[0]+"-"+token[1]+"-"+token[2]));
}
}
public static class SortReducer extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,InterruptedException{
for(Text details:values){
context.write(NullWritable.get(),details);
}
}
}
public static void main(String args[]) throws IOException,InterruptedException,ClassNotFoundException{
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Sort.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)?0:1);
}
}

19
Department of Information Technology

Alternative program
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Sort{
public static class SortMapper extends Mapper<LongWritable,Text,Text,Text>{
Text comp_key = new Text();
protected void map(LongWritable key, Text value, Context context) throws
IOException,InterruptedException{
String[] token = value.toString().split(",");
comp_key.set(token[0]);
comp_key.set(token[1]);
context.write(comp_key,new Text(token[0]+"-"+token[1]+"-"+token[2]));
}
}
public static class SortReducer extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException,InterruptedException{
for(Text details:values){
context.write(NullWritable.get(),details);
}
}
}
public static void main(String args[]) throws
IOException,InterruptedException,ClassNotFoundException{
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Sort.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}

Steps:

20
Department of Information Technology

1. Create one folder on Desktop-”SortTutorial”


a. Paste Sort.java file
b. Create a folder named ”Input_Data”->Create Input.txt file (enter some words)
c. Create a folder name “tutorial_classes
2. export HADOOP_CLASSPATH=$(hadoop classpath)
3. echo $HADOOP_CLASSPATH
4. hadoop fs -mkdir /SortTutorial
5. hadoop fs -mkdir /SortTutorial/input
6. hadoop fs -put /home/hadoop/Desktop/SortTutorial/Input_Data/Input.txt
/SortTutorial/input
7. Change the current directory to the tutorial directory
cd '/home/hadoop/Desktop/SortTutorial'
8. Compile the java code
javac -classpath ${HADOOP_CLASSPATH} -d
/home/hadoop/Desktop/SortTutorial/tutorial_classes
/home/hadoop/Desktop/SortTutorial/Sort.java
9. Put the output files in one JAR file
jar -cvf firstTutorial.jar -C tutorial_classes/ .
10. Run JAR file
hadoop jar /home/hadoop/Desktop/SortTutorial/firstTutorial.jar WordCount
/SortTutorial/input /SortTutorial/output
11. See the output
hadoop dfs -cat /SortTutorial/output/*

Output:

21
Department of Information Technology

Experiment 07
Problem statement:
MapReduce program to count the occurrence of similar words in a file by using partitioner

package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
22
Department of Information Technology

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}

Output:

23
Department of Information Technology

Experiment 08
Problem statement:
Design MapReduce solution to find the years whose average sales is greater than 30. input file format has year,
sales of all months and average sales. The sample input file is,

package hadoop;
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class ProcessUnits
{
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable, /*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{
//Map function
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens()){
lasttoken=s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
24
Department of Information Technology

}
}
//Reducer class
public static class E_EReduce extends MapReduceBase implements
Reducer< Text, IntWritable, Text, IntWritable >
{
//Reduce function
public void reduce(Text key, Iterator <IntWritable> values, OutputCollector>Text, IntWritable> output, Reporter
reporter) throws IOException
{
int maxavg=30;
int val=Integer.MIN_VALUE;
while (values.hasNext())

{
if((val=values.next().get())>maxavg)
{
output.collect(key, new IntWritable(val));
}
}
}
}
//Main function
public static void main(String args[])throws Exception
{
JobConf conf = new JobConf(Eleunits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Save the above program into ProcessUnits.java. The compilation and execution of the program is given below.
Compilation and Execution of ProcessUnits Program
Let us assume we are in the home directory of Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1 − Use the following command to create a directory to store the compiled java classes.
$ mkdir units

25
Department of Information Technology

Step 2 − Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program.
Download the jar from mvnrepository.com. Let us assume the download folder is /home/hadoop/.
Step 3 − The following commands are used to compile the ProcessUnits.java program and to create a jar for the
program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
$ jar -cvf units.jar -C units/ .
Step 4 − The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5 − The following command is used to copy the input file named sample.txt in the input directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6 − The following command is used to verify the files in the input directory
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7 − The following command is used to run the Eleunit_max application by taking input files from the input
directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir

Output:

26
Department of Information Technology

Experiment 09
Problem statement:
MapReduce program to find Dept wise salary. The sample input file is as follow.

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Salary
{
public static class SalaryMapper extends Mapper <LongWritable, Text, Text, IntWritable>
{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String[] token = value.toString().split(",");
int s = Integer.parseInt(token[2]);
IntWritable sal = new IntWritable();
sal.set(s);
context.write(new Text(token[1]),sal);
}
}
public static class SalaryReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
27
Department of Information Technology

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException,
InterruptedException
{
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key,result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Salary");
job.setJarByClass(Salary.class);
job.setMapperClass(SalaryMapper.class);

job.setReducerClass(SalaryReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Output:

28
Department of Information Technology

Experiment 10
Problem statement:
Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter the data.
The Apache Pig Operators is a high-level procedural language for querying large data sets using Hadoop and
the Map Reduce Platform. A Pig Latin statement is an operator that takes a relation as input and produces
another relation as output. These operators are the main tools for Pig Latin provides to operate on the data. They
allow you to transform it by sorting, grouping, joining, projecting, and filtering. Let’s create two files to run the
commands:
We have two files with name ‘first’ and ‘second.’ The first file contain three fields: user, url & id.

The second file contain two fields: url & rating. These two files are CSV files.

The Apache Pig operators can be classified as: Relational and Diagnostic.
Relational Operators:
Relational operators are the main tools Pig Latin provides to operate on the data. It allows you to transform the
data by sorting, grouping, joining, projecting and filtering. This section covers the basic relational operators.

LOAD:
LOAD operator is used to load data from the file system or HDFS storage into a Pig relation.
In this example, the Load operator loads data from file ‘first’ to form relation ‘loading1’.
The field names are user, url, id.

29
Department of Information Technology

Experiment 11
Problem statement:
 Implementation of Word count using Pig.

lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);


words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

30
Department of Information Technology

Experiment 12
Problem statement:
Creation of Database using hive

a. To create a database named “STUDENTS” with comments and database properties.

CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT 'STUDENT Details' WITH


DBPROPERTIES ('creator' = 'JOHN');

b. To describe a database.

DESCRIBE DATABASE STUDENTS;

c. To drop database.

DROP DATABASE STUDENTS;

d. To create managed table named ‘STUDENT’.

CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t';

e. To create external table named ‘EXT_STUDENT’.

CREATE EXTERNAL TABLE IF NOT EXISTS EXT_STUDENT(rollno INT,name STRING,gpa FLOAT)


ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION ‘/STUDENT_INFO;

f. To load data into the table from file named student.tsv.

LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv' OVERWRITE INTO TABLE


EXT_STUDENT;

g. To retrieve the student details from “EXT_STUDENT” table.

SELECT * from EXT_STUDENT;

31
Department of Information Technology

Experiment 13
Problem statement:
Creation of partitions and buckets using Hive.
Partition is of two types:
 STATIC PARTITION: It is upon the user to mention the partition (the segregation unit) where the data from
the file is to be loaded.
 DYNAMIC PARTITION: The User is required to simply state the column, basis which the partitioning will
take place. Hive will then create partitions basis the unique values in the column on which partition is to be
carried out.
 Partitions split the larger dataset into more meaningful chunks.
 Hive provides two kinds of partitions: Static Partition and Dynamic Partition.

a. To create static partition based on “gpa” column.

CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT (rollno INT, name STRING)


PARTITIONED BY (gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

b. Load data into partition table from table.

INSERT OVERWRITE TABLE STATIC_PART_STUDENT PARTITION (gpa =4.0) SELECT rollno,


name from EXT_STUDENT where gpa=4.0;

Bucketing is similar to partition.


 However there is a subtle difference between partition and bucketing. In partition, you need to create
partition for each unique value of the column. This may lead to a situation where you may end up with
thousands of partitions.
 This can be avoided using Bucketing in which you can limit the number of buckets that will be created. A
bucket is a file whereas a partition is a directory.

a. To create a bucketed table having 3 buckets.


 
CREATE TABLE IF NOT EXISTS STUDENT_BUCKET (rollno INT,name STRING,grade FLOAT)
CLUSTERED BY (grade) into 3 buckets;

b. Load data to bucketed table.

FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;
32
Department of Information Technology

c.  To display the content of first bucket.

SELECT DISTINCT GRADE FROM STUDENT_BUCKET


TABLESAMPLE(BUCKET 1 OUT OF 3 ON GRADE);
  
Hive supports aggregation functions like avg, count, etc.

a. To write the average and count aggregation function.

SELECT avg(gpa) FROM STUDENT;

SELECT count(*) FROM STUDENT;


  
b. To write group by and having function.

SELECT rollno, name,gpa


FROM STUDENT
GROUP BY rollno,name,gpa
HAVING gpa > 4.0;

Experiment 14
Problem statement:
Practice of advanced features in Hive Query Language: RC File & XML data processing.

33
Department of Information Technology

Experiment 15
Problem statement:
Implement of word count using spark RDDs.
val data=sc.textFile("sparkdata.txt")
data.collect;
val splitdata = data.flatMap(line => line.split(" "));
splitdata.collect;
val mapdata = splitdata.map(word => (word,1));
mapdata.collect;
val reducedata = mapdata.reduceByKey(_+_);
reducedata.collect;

(OR)

val textFile = sc.textFile("hdfs://...")


val counts = textFile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

34
Department of Information Technology

Experiment 16
Problem statement:
Basic RDD opearions in Spark
Login into spark environment
1. Open terminal in cloudera quickstart
2. Type spark-shell
Types of RDD creation
1. Using Parallelize
scala>val data = Array (1,2,3,4,5)
scala>val distdata = sc.parallelize(data)
2. External dataset
a. create one text file on desktop data.txt
b. create one directory in HDFS
hdfs dfs -mkdir /spark
c. Load the file from local to HDFS
hdfs dfs -put /home/cloudera/Desktop/data.txt /spark/data.txt
Basic RDD Transformations
1. To print elements of RDD
a. syntax: rdd.foreach(println)
Example: lines.foreach(println)
2. MAP
val x = sc.parallelize(Array("b", "a", "c"))
val y = x.map(z => (z,1))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
b, a, c
(b,1), (a,1), (c,1)
3. FILTER
val x = sc.parallelize(Array(1,2,3))
35
Department of Information Technology

val y = x.filter(n => n%2 == 1)


println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
1, 2, 3
1, 3
4. FLATMAP
val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
1, 2, 3
1, 100, 42, 2, 200, 42, 3, 300, 42
5. GROUPBY
val x = sc.parallelize(
Array("John", "Fred", "Anna", "James"))
val y = x.groupBy(w => w.charAt(0))
println(y.collect().mkString(", "))
output:
['John', 'Fred', 'Anna', 'James']
[('A',['Anna']),('J',['John','James']),('F',['Fred'])]
6. GROUPBYKEY
val x = sc.parallelize(
Array(('B',5),('B',4),('A',3),('A',2),('A',1)))
val y = x.groupByKey()
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
[('B', 5),('B', 4),('A', 3),('A', 2),('A', 1)]
36
Department of Information Technology

[('A', [2, 3, 1]),('B',[5, 4])


7. SAMPLE
val x = sc.parallelize(Array(1, 2, 3, 4, 5))
val y = x.sample(false, 0.4)
// omitting seed will yield different output
println(y.collect().mkString(", "))
output:
[1, 2, 3, 4, 5]
[1, 3]
8. UNION
val x = sc.parallelize(Array(1,2,3), 2)
val y = sc.parallelize(Array(3,4), 1)
val z = x.union(y)
val zOut = z.glom().collect()
output:
[1, 2, 3]
[3, 4]
[[1], [2, 3], [3, 4]]
9. JOIN
val x = sc.parallelize(Array(("a", 1), ("b", 2)))
val y = sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5)))
val z = x.join(y)
println(z.collect().mkString(", "))
output:
[("a", 1), ("b", 2)]
[("a", 3), ("a", 4), ("b", 5)]
[('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]
10. DISTINCT
val x = sc.parallelize(Array(1,2,3,3,4))
val y = x.distinct()
37
Department of Information Technology

println(y.collect().mkString(", "))
output:
[1, 2, 3, 3, 4]
[1, 2, 3, 4]
Basic RDD ACTIONS
1. COLLECT
val x = sc.parallelize(Array(1,2,3), 2)
val y = x.collect()
val xOut = x.glom().collect()
println(y)
output:
[[1], [2, 3]]
[1, 2, 3]
2. REDUCE
val x = sc.parallelize(Array(1,2,3,4))
val y = x.reduce((a,b) => a+b)
println(x.collect.mkString(", "))
println(y)
output:
[1, 2, 3, 4]
10
3. AGGREGATE
def seqOp = (data:(Array[Int], Int), item:Int) =>
(data._1 :+ item, data._2 + item)
def combOp = (d1:(Array[Int], Int), d2:(Array[Int], Int)) =>
(d1._1.union(d2._1), d1._2 + d2._2)

val x = sc.parallelize(Array(1,2,3,4))
val y = x.aggregate((Array[Int](), 0))(seqOp, combOp)
println(y)
38
Department of Information Technology

output:
[1, 2, 3, 4]
(Array(3, 1, 2, 4),10)

4. MAX
val x = sc.parallelize(Array(2,4,1))
val y = x.max
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
4
5. SUM
val x = sc.parallelize(Array(2,4,1))
val y = x.sum
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
7
6. MEAN
val x = sc.parallelize(Array(2,4,1))
val y = x.mean
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
2.3333333
7. STDEV
val x = sc.parallelize(Array(2,4,1))
39
Department of Information Technology

val y = x.stdev
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
1.2472191

Filter the log data using Spark RDDs.


from pyspark import SparkContext
from pyspark import SparkContext
words = sc.parallelize (
["python",
"java",
"hadoop",
"C"
]
)
words_map = words.map(lambda x: (x, 1))
mapping = words_map.collect()
print("Key value pair -> %s" % (mapping))

from pyspark import SparkContext


x = sc.parallelize([("pyspark", 1), ("hadoop", 3)])
y = sc.parallelize([("pyspark", 2), ("hadoop", 4)])
joined = x.join(y)
mapped = joined.collect()
print("Join RDD -> %s" % (mapped))

40
Department of Information Technology

ASSIGNMENT 1: SPLIT

Objective: To learn about SPLIT relational operator. Problem Description: Write a Pig Script to split customers for reward
program based on their life time values.

Customers Life Time Value


Jack 25000
Smith 8000
David 35000
John 15000
Scott 10000
Joshi 28000
disperReport 12000
Vinay
Joseph 21000

Input: Customers Life Time Value Jack 25000 Smith 8000 David 35000 John 15000 Scott 10000 Joshi 28000 Ajay 12000
Vinay 30000 Joseph 21000

If Life Time Value is >1000 and <= 2000 Sliver Program.

If Life Time Value is >20000 ---> Gold Program.

ASSIGNMENT 2: GROUP

Objective: To learn about GROUP relational operator. Problem Description: Create a data file for below schemas:

Order: CustomerId, ItemId, ItemName, OrderDate, DeliveryDate


Customer: CustomerId, CustomerName, Address, City, State, Country
1. Load Order and Customer Data.
2. Write a Pig Latin Script to determine number of items bought by each customer.

41
Department of Information Technology

ASSIGNMENT 3: COMPLEX DATA TYPE —BAG Objective: To learn complex datatype — bag in Pig. Problem
Description:

1. Create a file which contains bag dataset as shown below.

User ID From To
user1001 user1001@sample.corn {(user003@sample.com),(user004@sample.com),
(user006@sample.com)}
user1002 user1002@sample.com {(user005@sample.com), (user006@sampte.com)}
user1003 user1003@sample.com {(user001@sample.com),(user005@sample.com)}
2. Write a Pig Latin statement to display the names of all users who have sent emails and also a list of all the people that
they have sent the email to.

3. Store the result in a hie.

ASSIGNMENT 1: HIVEQL

Objective: To learn about HiveQL statement Problem Description:

Create a data file for below schemas:


Order: CustomerId, ItemId, ItemName, OrderDate, DeliveryDate
Customer: CustomerId, CustomerName, Address, City, State, Country
1. Create a table for Order and Customer Data.
2. Write a HiveQL to find number of items bought by each customer.

ASSIGNMENT 2: PARTITION
Objective: To learn about partitions in hive.
Problem Description: Create a partition table for customer schema to reward the customers based on their life time values.
Input:
Customer ID Customers Life Time Value
1001 Jack 25000
1002 Smith 8000
1003 David 12000
1004 John 15000
1005 Scott 12000
1006 Joshi 28000
1007 Ajay 12000
1008 Vinay 30000
1009 Joseph 21000

Create a partition table if life time value is 12000.


Create a partition table for all life time values.

42
Department of Information Technology

SERDE
SerDe stands for Serializer/Deserializer.
1. Contains the logic to convert unstructured data into records.
2. Implemented using Java.
3. Serializers are used at the time of writing.
4. Descrializers are used at query time (SELECT Statement).

Deserializer interface takes a binary representation or string of a record, converts it into a java object that Hive
can then manipulate. Serializer takes a java object that Hive has been working with and translates it into
something that Hive can write to HDFS.

Objective: To manipulate the XML data.


input:
<employee> <empid> 1001 </empid› <name> John</name> <designation> Team Lead </designation›
</employee>
<employee> <empid> 1002</empid> <name>Smith</name> <designation> Analyst </designation> </ employee
>

Act:
CREATE TABLE XMLSAMPLE(xmldata string);
LOAD DATA LOCAL INPATH /root/hivedemos/input.xml' INTO TABLE XMLSAMPLE;

CREATE TABLE xpath_table AS


SELECT xpath_int(xmldata,'employee/empid'),
xpath_string(xmldata,'employee/name'),
xpath_string(xmidata,'employee/designation')
FROM xmlsample;
SELECT * FROM xpath_table;

USER-DEFINED FUNCTION (UDF)


In Hive, you can use custom functions by defining the User-Defined Function (UDF).
Objective: Write a Hive function to convert the values of a field to uppercase.
Act:
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
@Description(
Name="SimpleUDFExample")
public final class MyLowerCase extends UDF
public String evaluate(final String word)
{
return word.toLowerCase(); }

43
Department of Information Technology

Note: Convert this Java Program into Jar.


ADD JAR /root/hivedemos/UpperCase.jar;
CREATE TEMPORARY FUNCTION touppercase AS 'com.example.hive.udf.MyUpperCase'; SELECT
TOUPPERCASE(name) FROM STUDENT;

Outcome: hive> ADD JAR /root/hivedemos/UpperCase.jar;


Added [/root/hivedemos/UpperCase.jar] to class path
Added resources: [/root/hivedemos/upperCase.jar]
hive> CREATE TEMPORARY FUNCTION touppercase AS 'com.example.hive.udf.MyUpperCase’;
OK

RCFILE IMPLEMENTATION
RCFile (Record Columnar File) is a data placement structure that determines how to store relational table on
computer clusters.

Objective: To work with RCFILE Format.

CREATE TABLE STUDENT_RC( rollno int, name string,gpa float ) STORED AS RCFILE; INSERT
OVERWRITE table STUDENT_RC SELECT * FROM STUDENT;
SELECT SUM(gpa) FROM STUDENT_RC;

44

You might also like