Professional Documents
Culture Documents
Regulation – R19
IV B.Tech I Semester
DEPARTMENT OF
INFORMATION TECHNOLOGY
Vision:
To become centre of excellence in technical and knowledge-based education utilizing the potential of
emerging technologies in field of Information Technology with a deep passion of wisdom, culture, and
values.
Mission:
Impart modern teaching methodologies to provide quality education to the students
Produce employable engineers based on skills required for industry
Enable students and faculty members in research to develop IT based solutions to meet societal and
industry requirements
PEO1: Evolve as globally competent computer professionals possessing leadership skills for developing
innovative solutions in multidisciplinary domains.
PEO2: Excel as socially committed individual having high ethical values and empathy for the needs of
society.
PEO3: To prepare students to succeed in employment/profession or to pursue postgraduate.
PEO4: Involve in lifelong learning to adapt the technological advancements in the emerging areas of
computer applications.
2
Department of Information Technology
PREFACE
Pre requisite : SQL, BIG DATA ANALYTICS
This Lab is aimed at providing knowledge on data cube construction, OLAP operations on the database,
Decision tree induction, and attributes relevance analysis through Information gain method, correlation
coefficient methods, various classifications, and clustering algorithms usage through the WEKA (Waikato
Environment for Knowledge Analysis) software and R statistical programming.
Waikato Environment for Knowledge Analysis (Weka) is a suite of machine learning software written in Java,
developed at the University of Waikato, New Zealand. It is free software licensed under the GNU General
Public License.
Weka contains a collection of visualization tools and algorithms for data analysis and predictive modeling,
together with graphical user interfaces for easy access to BIG DATA ANALYTICS functions. Weka supports
several standard BIG DATA ANALYTICS tasks, more specifically, data
preprocessing, clustering, classification, regression, visualization, and feature selection. Weka's techniques are
predicated on the assumption that the data is available as one flat file or relation, where each data point is
described by a fixed number of attributes (normally, numeric or nominal attributes, but some other attribute
types are also supported).
Weka provides access to SQL databases using Java Database Connectivity and can process the result returned
by a database query. Weka provides access to deep learning with Deeplearning4j. It is not capable of multi-
relational BIG DATA ANALYTICS. Still, there is separate software for converting a collection of linked
database tables into a single table suitable for processing using Weka. Another important area currently not
covered by the algorithms included in the Weka distribution is sequence modeling.
Relevance to industry:
This lab knowledge helps in analyzing the data. Many industries are using data analysis techniques for fraud
detection, stock market prediction, image processing, etc. Data analysis techniques are used for making business
decisions. A student can gain a good knowledge of various preprocessing, classification, and clustering methods
by learning this subject.
Latest Technologies:
3
Department of Information Technology
1. Hadoop
2. MapReduce
3. Hive, PIG, Spark
Lab Evaluation Procedure:
The performance of a student in each lab is evaluated continuously during the semester. The marks awarded
through continuous evaluation are referred to as internal marks. A comprehensive end-semester examination has
conducted the marks awarded for this evaluation referred to as external marks.
The maximum sum of internal and external assessment marks is 100, in the ratio of 50:50.
Marks
Sl.No Component External
Internal
Laboratory Total
Examiner
Examiner
Objective and Procedure
1 write Up Including 5 5 10
Outcomes
Experimentation and
2 5 5 10
data collection
3 Computation of results 5 5 10
Analysis of results and
4 5 5 10
interpretation
5 Viva Voce 5 5 10
Total Marks 20 30 50
4
Department of Information Technology
INDEX
5
Department of Information Technology
Syllabus
Text Book:
1. Big Data Analytics 2ed, Seema Acharya, Subhashini Chellappan, Wiley Publishers, 2020
Reference Books:
6
Department of Information Technology
This course gives an overview of Big Data, i.e. storage, retrieval and processing of big data. The focus
will be on the “technologies”, i.e., the tools/algorithms that are available for storage, processing of Big Data
and a variety of analytics.
Course Outcomes:
After completion of this course, a student will be able to:
Skills:
Build and maintain reliable, scalable, distributed systems with Apache Hadoop.
Develop Map-Reduce based Applications for Big Data.
Design and build applications using Hive and Pig based Big data Applications.
Learn tips and tricks for Big Data use cases and solutions
7
Department of Information Technology
List of Experiments:
8
Department of Information Technology
9
Department of Information Technology
Installation of Hadoop
Install and run Hadoop in standalone mode, pseudo mode and fully distributed cluster.
Standalone mode.
Step-2: Download the latest version of Hadoop here.
$ tar -xzvf hadoop-2.7.3.tar.gz
//Change the version number if needed to match the Hadoop version you have downloaded.//
Step-3: Now we are moving this extracted file to /usr/local, suitable for local installs.
$ sudo mv hadoop-2.7.3 /usr/local/hadoop
Step-5: Hadoop command in the bin folder is used to run jobs in Hadoop.
$ bin/hadoop
Step-6: jar command is used to run the MapReduce jobs on Hadoop cluster
$ bin/hadoop jar
10
Department of Information Technology
Step-7: Now we will run an example MapReduce to ensure that our standalone install works
create a input directory to place the input files and we run MapReduce command on it. These
are the configuration and command files along with hadoop, we will use those as text file input
for our MapReduce.
$ mkdir input $ cp etc/hadoop/* input/
This is run using the bin/hadoop command. It is used to run MapReduce. Jar indicates that the
MapReduce operation is specified in a Java archive. Here we will use the Hadoop-MapReduce-
examples.jar file which come along with installation. Jar name differ based on the version you
are installing. Now move on to your Hadoop install directory and type:
$ Hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.7.3.jar
If you are not in the correct directory, you will get an error saying “Not a valid JAR” as shown
below. If this issue persists, check if the location of Jar file is correct for your system.
11
Department of Information Technology
Experiment 01
Problem statement:
HDFS basic command-line file operations.
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$hadoop version
2. To java Version
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$javac -version
4. To format namenode
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$start-all.sh
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:/$jps
Or
15. To count no.of directories, files, and bytes under the given path:
o/p: 1 1 60
Experiment 02
HDFS monitoring User Interface
14
Department of Information Technology
Experiment 03
Problem statement: Wordcount Map Reduce program using standalone Hadoop
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount
{
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
While (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable>
{
result.set(sum);
context.write(key, result);
}
}
Steps:
1. Create one folder on Desktop-”WordCountTutorial”
a. Paste WordCount.java file
b. Create a folder named”Input_Data”->Create Input.txt file (enter some words)
c. Create a folder name “tutorial_classes
2. export HADOOP_CLASSPATH=$(hadoop classpath)
3. echo $HADOOP_CLASPATH
4. hadoop fs -mkdir /WordCountTutorial
5. hadoop fs -mkdir /WordCountTutorial/input
6. hadoop fs -put /home/hadoop/Desktop/WordCountTutorial/Input_Data/Input.txt
/WordCountTutorial/input
7. Change the current directory to the tutorial directory
cd '/home/hadoop/Desktop/WordCountTutorial'
vignan@vignan-HP-Compaq-8200-Elite-SFF-PC:~/Desktop/WordCountTutorial$
8. Compile the java code
javac -classpath ${HADOOP_CLASSPATH} -d
/home/hadoop/Desktop/WordCountTutorial/tutorial_classes
/home/hadoop/Desktop/WordCountTutorial/WordCount.java
9. Put the output files in one JAR file
jar -cvf firstTutorial.jar -C tutorial_classes/ .
10. Run JAR file
hadoop jar /home/hadoop/Desktop/WordCountTutorial/firstTutorial.jar WordCount
/WordCountTutorial/input /WordCountTutorial/output
11. See the output
hadoop dfs -cat /WordCountTutorial/output/*
Output:
16
Department of Information Technology
Experiment 04
Problem statement:
Implementation of word count with combiner Map Reduce program
Output:
17
Department of Information Technology
Experiment 05
Problem statement:
Practice on Map Reduce monitoring User Interface
18
Department of Information Technology
Experiment 06
Problem statement:
Implementation of Sort operation using MapReduce
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Sort{
public static class SortMapper extends Mapper<LongWritable,Text,Text,Text>{
Text comp_key = new Text();
protected void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException{
String[] token = value.toString().split(",");
comp_key.set(token[0]);
comp_key.set(token[1]);
context.write(comp_key,new Text(token[0]+"-"+token[1]+"-"+token[2]));
}
}
public static class SortReducer extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,InterruptedException{
for(Text details:values){
context.write(NullWritable.get(),details);
}
}
}
public static void main(String args[]) throws IOException,InterruptedException,ClassNotFoundException{
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Sort.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)?0:1);
}
}
19
Department of Information Technology
Alternative program
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Sort{
public static class SortMapper extends Mapper<LongWritable,Text,Text,Text>{
Text comp_key = new Text();
protected void map(LongWritable key, Text value, Context context) throws
IOException,InterruptedException{
String[] token = value.toString().split(",");
comp_key.set(token[0]);
comp_key.set(token[1]);
context.write(comp_key,new Text(token[0]+"-"+token[1]+"-"+token[2]));
}
}
public static class SortReducer extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException,InterruptedException{
for(Text details:values){
context.write(NullWritable.get(),details);
}
}
}
public static void main(String args[]) throws
IOException,InterruptedException,ClassNotFoundException{
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Sort.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
Steps:
20
Department of Information Technology
Output:
21
Department of Information Technology
Experiment 07
Problem statement:
MapReduce program to count the occurrence of similar words in a file by using partitioner
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
22
Department of Information Technology
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Output:
23
Department of Information Technology
Experiment 08
Problem statement:
Design MapReduce solution to find the years whose average sales is greater than 30. input file format has year,
sales of all months and average sales. The sample input file is,
package hadoop;
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class ProcessUnits
{
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable, /*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{
//Map function
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens()){
lasttoken=s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
24
Department of Information Technology
}
}
//Reducer class
public static class E_EReduce extends MapReduceBase implements
Reducer< Text, IntWritable, Text, IntWritable >
{
//Reduce function
public void reduce(Text key, Iterator <IntWritable> values, OutputCollector>Text, IntWritable> output, Reporter
reporter) throws IOException
{
int maxavg=30;
int val=Integer.MIN_VALUE;
while (values.hasNext())
{
if((val=values.next().get())>maxavg)
{
output.collect(key, new IntWritable(val));
}
}
}
}
//Main function
public static void main(String args[])throws Exception
{
JobConf conf = new JobConf(Eleunits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Save the above program into ProcessUnits.java. The compilation and execution of the program is given below.
Compilation and Execution of ProcessUnits Program
Let us assume we are in the home directory of Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1 − Use the following command to create a directory to store the compiled java classes.
$ mkdir units
25
Department of Information Technology
Step 2 − Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program.
Download the jar from mvnrepository.com. Let us assume the download folder is /home/hadoop/.
Step 3 − The following commands are used to compile the ProcessUnits.java program and to create a jar for the
program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
$ jar -cvf units.jar -C units/ .
Step 4 − The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5 − The following command is used to copy the input file named sample.txt in the input directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6 − The following command is used to verify the files in the input directory
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7 − The following command is used to run the Eleunit_max application by taking input files from the input
directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir
Output:
26
Department of Information Technology
Experiment 09
Problem statement:
MapReduce program to find Dept wise salary. The sample input file is as follow.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Salary
{
public static class SalaryMapper extends Mapper <LongWritable, Text, Text, IntWritable>
{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String[] token = value.toString().split(",");
int s = Integer.parseInt(token[2]);
IntWritable sal = new IntWritable();
sal.set(s);
context.write(new Text(token[1]),sal);
}
}
public static class SalaryReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
27
Department of Information Technology
public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException,
InterruptedException
{
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key,result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Salary");
job.setJarByClass(Salary.class);
job.setMapperClass(SalaryMapper.class);
job.setReducerClass(SalaryReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Output:
28
Department of Information Technology
Experiment 10
Problem statement:
Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter the data.
The Apache Pig Operators is a high-level procedural language for querying large data sets using Hadoop and
the Map Reduce Platform. A Pig Latin statement is an operator that takes a relation as input and produces
another relation as output. These operators are the main tools for Pig Latin provides to operate on the data. They
allow you to transform it by sorting, grouping, joining, projecting, and filtering. Let’s create two files to run the
commands:
We have two files with name ‘first’ and ‘second.’ The first file contain three fields: user, url & id.
The second file contain two fields: url & rating. These two files are CSV files.
The Apache Pig operators can be classified as: Relational and Diagnostic.
Relational Operators:
Relational operators are the main tools Pig Latin provides to operate on the data. It allows you to transform the
data by sorting, grouping, joining, projecting and filtering. This section covers the basic relational operators.
LOAD:
LOAD operator is used to load data from the file system or HDFS storage into a Pig relation.
In this example, the Load operator loads data from file ‘first’ to form relation ‘loading1’.
The field names are user, url, id.
29
Department of Information Technology
Experiment 11
Problem statement:
Implementation of Word count using Pig.
30
Department of Information Technology
Experiment 12
Problem statement:
Creation of Database using hive
b. To describe a database.
c. To drop database.
CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,name STRING,gpa FLOAT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t';
31
Department of Information Technology
Experiment 13
Problem statement:
Creation of partitions and buckets using Hive.
Partition is of two types:
STATIC PARTITION: It is upon the user to mention the partition (the segregation unit) where the data from
the file is to be loaded.
DYNAMIC PARTITION: The User is required to simply state the column, basis which the partitioning will
take place. Hive will then create partitions basis the unique values in the column on which partition is to be
carried out.
Partitions split the larger dataset into more meaningful chunks.
Hive provides two kinds of partitions: Static Partition and Dynamic Partition.
FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;
32
Department of Information Technology
Experiment 14
Problem statement:
Practice of advanced features in Hive Query Language: RC File & XML data processing.
33
Department of Information Technology
Experiment 15
Problem statement:
Implement of word count using spark RDDs.
val data=sc.textFile("sparkdata.txt")
data.collect;
val splitdata = data.flatMap(line => line.split(" "));
splitdata.collect;
val mapdata = splitdata.map(word => (word,1));
mapdata.collect;
val reducedata = mapdata.reduceByKey(_+_);
reducedata.collect;
(OR)
34
Department of Information Technology
Experiment 16
Problem statement:
Basic RDD opearions in Spark
Login into spark environment
1. Open terminal in cloudera quickstart
2. Type spark-shell
Types of RDD creation
1. Using Parallelize
scala>val data = Array (1,2,3,4,5)
scala>val distdata = sc.parallelize(data)
2. External dataset
a. create one text file on desktop data.txt
b. create one directory in HDFS
hdfs dfs -mkdir /spark
c. Load the file from local to HDFS
hdfs dfs -put /home/cloudera/Desktop/data.txt /spark/data.txt
Basic RDD Transformations
1. To print elements of RDD
a. syntax: rdd.foreach(println)
Example: lines.foreach(println)
2. MAP
val x = sc.parallelize(Array("b", "a", "c"))
val y = x.map(z => (z,1))
println(x.collect().mkString(", "))
println(y.collect().mkString(", "))
output:
b, a, c
(b,1), (a,1), (c,1)
3. FILTER
val x = sc.parallelize(Array(1,2,3))
35
Department of Information Technology
println(y.collect().mkString(", "))
output:
[1, 2, 3, 3, 4]
[1, 2, 3, 4]
Basic RDD ACTIONS
1. COLLECT
val x = sc.parallelize(Array(1,2,3), 2)
val y = x.collect()
val xOut = x.glom().collect()
println(y)
output:
[[1], [2, 3]]
[1, 2, 3]
2. REDUCE
val x = sc.parallelize(Array(1,2,3,4))
val y = x.reduce((a,b) => a+b)
println(x.collect.mkString(", "))
println(y)
output:
[1, 2, 3, 4]
10
3. AGGREGATE
def seqOp = (data:(Array[Int], Int), item:Int) =>
(data._1 :+ item, data._2 + item)
def combOp = (d1:(Array[Int], Int), d2:(Array[Int], Int)) =>
(d1._1.union(d2._1), d1._2 + d2._2)
val x = sc.parallelize(Array(1,2,3,4))
val y = x.aggregate((Array[Int](), 0))(seqOp, combOp)
println(y)
38
Department of Information Technology
output:
[1, 2, 3, 4]
(Array(3, 1, 2, 4),10)
4. MAX
val x = sc.parallelize(Array(2,4,1))
val y = x.max
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
4
5. SUM
val x = sc.parallelize(Array(2,4,1))
val y = x.sum
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
7
6. MEAN
val x = sc.parallelize(Array(2,4,1))
val y = x.mean
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
2.3333333
7. STDEV
val x = sc.parallelize(Array(2,4,1))
39
Department of Information Technology
val y = x.stdev
println(x.collect().mkString(", "))
println(y)
output:
[2, 4, 1]
1.2472191
40
Department of Information Technology
ASSIGNMENT 1: SPLIT
Objective: To learn about SPLIT relational operator. Problem Description: Write a Pig Script to split customers for reward
program based on their life time values.
Input: Customers Life Time Value Jack 25000 Smith 8000 David 35000 John 15000 Scott 10000 Joshi 28000 Ajay 12000
Vinay 30000 Joseph 21000
ASSIGNMENT 2: GROUP
Objective: To learn about GROUP relational operator. Problem Description: Create a data file for below schemas:
41
Department of Information Technology
ASSIGNMENT 3: COMPLEX DATA TYPE —BAG Objective: To learn complex datatype — bag in Pig. Problem
Description:
User ID From To
user1001 user1001@sample.corn {(user003@sample.com),(user004@sample.com),
(user006@sample.com)}
user1002 user1002@sample.com {(user005@sample.com), (user006@sampte.com)}
user1003 user1003@sample.com {(user001@sample.com),(user005@sample.com)}
2. Write a Pig Latin statement to display the names of all users who have sent emails and also a list of all the people that
they have sent the email to.
ASSIGNMENT 1: HIVEQL
ASSIGNMENT 2: PARTITION
Objective: To learn about partitions in hive.
Problem Description: Create a partition table for customer schema to reward the customers based on their life time values.
Input:
Customer ID Customers Life Time Value
1001 Jack 25000
1002 Smith 8000
1003 David 12000
1004 John 15000
1005 Scott 12000
1006 Joshi 28000
1007 Ajay 12000
1008 Vinay 30000
1009 Joseph 21000
42
Department of Information Technology
SERDE
SerDe stands for Serializer/Deserializer.
1. Contains the logic to convert unstructured data into records.
2. Implemented using Java.
3. Serializers are used at the time of writing.
4. Descrializers are used at query time (SELECT Statement).
Deserializer interface takes a binary representation or string of a record, converts it into a java object that Hive
can then manipulate. Serializer takes a java object that Hive has been working with and translates it into
something that Hive can write to HDFS.
Act:
CREATE TABLE XMLSAMPLE(xmldata string);
LOAD DATA LOCAL INPATH /root/hivedemos/input.xml' INTO TABLE XMLSAMPLE;
43
Department of Information Technology
RCFILE IMPLEMENTATION
RCFile (Record Columnar File) is a data placement structure that determines how to store relational table on
computer clusters.
CREATE TABLE STUDENT_RC( rollno int, name string,gpa float ) STORED AS RCFILE; INSERT
OVERWRITE table STUDENT_RC SELECT * FROM STUDENT;
SELECT SUM(gpa) FROM STUDENT_RC;
44