Analysis of Pig Script PDF

CHAPTER 1
APACHE PIG
1.1INTRODUCTION
Apache Pig is a platform, used to analyze large data sets representing them as data
flow. It is designed to provide and abstraction over MapReduce, reducing the
complexities of writing a MapReduce program. We can perform data
manipulation operations very easily in Hadoop using Apache Pig.
The features of Apache Pig are:

 Pig enables programmers to write complex data transformations without
knowing Java.
 Apache Pig has two main components – the Pig Latin language and the Pig
Run-time Environment, in which Pig Latin programs are executed.
 For Big Data Analysis, Pig gives a sample data flow language known as Pig
Latin which has functionalities similar to SQL like join, filter, limit etc.
 Developers who are working with scripting languages and SQL, leverages Pig
Latin.
This gives developers ease of programming with Apache Pig. Pig Latin
provides various built-in operators like join, sort, filter, etc to read, write, and
process large data sets thus it is evident; Pig has a rich set of operators.
 Programmers write scripts using Pig Latin to analyze data and these scripts are
internally converted to Map and reduce task by Pig MapReduce Engine.
Before Pig, writing MapReduce tasks was the only way to process the data
stored in HDFS.
 If a programmer wants to write custom functions which are unavailable in Pig,
Pig allows them to write User Defined Functions (UDF) in any languages of
their choice like Java, Python, Ruby, Jython, JRuby etc. and embed them in
Pig script. This provides extensibility to apache Pig.
ANALYSIS ON FLICKSERY USECASE USING PIG Page 1

 Pig can process any kind of data, i.e. structured, semi-structured or
unstructured data, coming from various sources; apache Pig handles all kinds
of data.
 Approximately, 10 lines of pig code is equal to 200 lines of MapReduce code.
 It can handle inconsistent schema (in case of unstructured data).
 Apache Pig extracts the data, performs operations on that data and dumps the
data in the required format in HDFS i.e. ETL (Extract Transfer Load).
 Apache Pig automatically optimizes the task before execution, i.e. automatic
optimization.
 It allows programmers and developers to concentrate upon the whole
operation irrespective of creating Mapper and Reducer functions separately.
The Architecture of Apache is depicted through this diagram:
Figure 1.1.1: Apache Architecture

CHAPTER 2
PIG LATIN QURIES
2.1. DATASET
The data set that we are using here is Flicksery. Flicksery is a Netflix Search
Engine. The data set is a sample text file lists movie names and its details like
release year, rating and runtime.
The columns and the description of data sets are as follows:
 ID - Unique number for each movie

 Name - Movie name
 Year - Movie release year
 Rating - Movie rating(float)
 Duration – Movie duration (Int)
1,The Nightmare before Chrustmas,1993,3.9,4568

2,The Mummy,1932,3.5,4388
3,Orphans of the Strom,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu,1929,3.5,5651
10,Nick of Time,1995,3.4,5333

2.2 EXECUTION STEPS PIG LATIN QUERIES
Step 1: Check if all daemons are running
Figure 1.11: check if all daemons are running
Step 2: Load the file from local file system into HDFS.
[cloudera@localhost ~]$ Hadoop fs -put /home/cloudera/datafile

/user/cloudera
Step 3: Switch to mapreduce mode in PIG.
[cloudera@localhost ~]$ pig –x mapreduce

Or
[cloudera@localhost ~]$ pig

Step 4: Load data files from HDFS to PIG.
grunt> movies = load '/user/cloudera/datafile' using PigStorage(',')as

(id:int,name:chararray,year:int,rating:float,duration:int);
2.3 PIG LATIN QUERIES
Query 1: Display the name of the movie and rating.
grunt> bag1 = foreach movies generate name,rating;
grunt> dump bag1;
(The Nightmare before Chrustmas,3.9)
(The Mummy,3.5)
(Orphans of the Strom,3.2)
(The Object of Beauty,2.8)
(Night Tide,2.8)
(One Magic Christmas,3.8)
(Muriel's Wedding,3.5)
(Mother's Boys,3.4)
(Nosferatu,3.5)
(Nick of Time,3.4)

Query 2: Display the details of movies that have more than 3
rating
grunt> rating = filter movies by rating>3;
grunt>dump rating;
(1,The Nightmare before

Chrustmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(3,Orphans of the Strom,1921,3.2,9062)
(6,One Magic Christmas,1985,3.8,5333)
(7,Muriel's Wedding,1994,3.5,6323)
(8,Mother's Boys,1994,3.4,5733)
(9,Nosferatu,1929,3.5,5651)
(10,Nick of Time,1995,3.4,5333)
Query 3: Display the details of movies which were released in year

1994.
grunt> a = filter movies by year==1994;
grunt> dump a;
(8,Mother's Boys,1994,3.4,5733)

Query 4: Display the details of movie whose rating is more than 3
but not released in year 1993.
grunt> bag4 = filter movies by rating>3 and year!=1993;
grunt> dump bag4;
(2,The Mummy,1932,3.5,4388)
(8,Mother's Boys,1994,3.4,5733)
(9,Nosferatu,1929,3.5,5651)
(10,Nick of Time,1995,3.4,5333)
Query 5: Display the details of movies who is either duration 5000

or whose rating is above 3.
grunt> bag5 = filter movies by duration>5000 or rating>3;
grunt> dump bag5;

Chrustmas,1993,3.9,4568)
(2,The Mummy,1932,3.5,4388)
(4,The Object of Beauty,1991,2.8,6150)
(5,Night Tide,1963,2.8,5126)
(8,Mother's Boys,1994,3.4,5733)
(9,Nosferatu,1929,3.5,5651)
(10,Nick of Time,1995,3.4,5333)
Query 6: Display the details of movies whose movie name has

substring ‘a’.
grunt> bag6 = filter movies by name matches '.*a.*';
grunt> dump bag6;

Chrustmas,1993,3.9,4568)
(4,The Object of Beauty,1991,2.8,6150)
(9,Nosferatu,1929,3.5,5651)

Query 7: Display the details of the movies, movie wise with same
year.
grunt> bag7 = group movies by(name,year);;
grunt> dump bag7;
((Nosferatu,1929),{(9,Nosferatu,1929,3.5,5651)})
((The Mummy,1932),{(2,The
Mummy,1932,3.5,4388)})
((Night Tide,1963),{(5,Night Tide,1963,2.8,5126)})
((Nick of Time,1995),{(10,Nick of
Time,1995,3.4,5333)})
((Mother's Boys,1994),{(8,Mother's
Boys,1994,3.4,5733)})
((Muriel's Wedding,1994),{(7,Muriel's
Wedding,1994,3.5,6323)})
((One Magic Christmas,1985),{(6,One Magic

Christmas,1985,3.8,5333)})
((Orphans of the Strom,1921),{(3,Orphans of the

Strom,1921,3.2,9062)})
((The Object of Beauty,1991),{(4,The Object of

Beauty,1991,2.8,6150)})
((The Nightmare before Chrustmas,1993),{(1,The

Nightmare before Chrustmas,1993,3.9,4568)})

Query 8: Display movie name, rating and maximum rating among
all the movies.
grunt> gr = group movies by name;
grunt>result = foreach gr generate movies.name, movies.rating,

MAX(movies.rating);
runt> dump result;
({(Nosferatu)},{(3.5)},3.5)
({(The Mummy)},{(3.5)},3.5)
({(Night Tide)},{(2.8)},2.8)
({(Nick of Time)},{(3.4)},3.4)
({(Mother's Boys)},{(3.4)},3.4)
({(Muriel's Wedding)},{(3.5)},3.5)
({(One Magic Christmas)},{(3.8)},3.8)
({(Orphans of the Strom)},{(3.2)},3.2)
({(The Object of Beauty)},{(2.8)},2.8)
({(The Nightmare before

Chrustmas)},{(3.9)},3.9)

Query 9: Display movie name, rating and total rating among all
the movies
grunt> result = foreach gr generate movies.name, movies.rating,

SUM(movies.rating);
grunt> dump result;
({(Nosferatu)},{(3.5)},3.5)
({(The Mummy)},{(3.5)},3.5)
({(Night Tide)},{(2.8)},2.799999952316284)
({(Nick of
Time)},{(3.4)},3.4000000953674316)
({(Mother's
Boys)},{(3.4)},3.4000000953674316)
({(Muriel's Wedding)},{(3.5)},3.5)
({(One Magic
Christmas)},{(3.8)},3.799999952316284)
({(Orphans of the
Strom)},{(3.2)},3.200000047683716)
({(The Object of
Beauty)},{(2.8)},2.799999952316284)
({(The Nightmare before

Chrustmas)},{(3.9)},3.9000000953674316)

Query 10: Display the details of movies maximum rating for the
movies and count number of movies per year, and minimum
movies released in a year.
grunt> gr = group movies by year;

grunt>
grunt> dump result;
(3.2,2,1921)
(3.5,5,1929)
(3.5,4,1932)
(2.8,5,1963)
(3.8,7,1985)
(2.8,3,1991)
(3.9,8,1993)
(3.5,4,1994)
(3.4,6,1995)

2.4 REFERENCES
1. htts://en.wikipedia.org/wiki/Pig
2. https://www.tutorialspoint.com/apache_pig/
3. https://hortonsworks.com/tutorial/how-to-process-data- withapache pig/
4. https://intellipaat.com/tutorial/hadoop-tutorials/apache-pig/

Analysis of Pig Script PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis of Pig Script PDF

Uploaded by

Copyright:

Available Formats

CHAPTER 1

The features of Apache Pig are:

ANALYSIS ON FLICKSERY USECASE USING PIG Page 1

The Architecture of Apache is depicted through this diagram:

Figure 1.1.1: Apache Architecture

ANALYSIS ON FLICKSERY USECASE USING PIG Page 2

The columns and the description of data sets are as follows:

 ID - Unique number for each movie

1,The Nightmare before Chrustmas,1993,3.9,4568

ANALYSIS ON FLICKSERY USECASE USING PIG Page 3

Step 1: Check if all daemons are running

Figure 1.11: check if all daemons are running

[cloudera@localhost ~]$ Hadoop fs -put /home/cloudera/datafile

Step 3: Switch to mapreduce mode in PIG.

[cloudera@localhost ~]$ pig –x mapreduce

ANALYSIS ON FLICKSERY USECASE USING PIG Page 4

grunt> movies = load '/user/cloudera/datafile' using PigStorage(',')as

2.3 PIG LATIN QUERIES

Query 1: Display the name of the movie and rating.

grunt> bag1 = foreach movies generate name,rating;

grunt> dump bag1;

(The Nightmare before Chrustmas,3.9)

(Orphans of the Strom,3.2)

(The Object of Beauty,2.8)

(One Magic Christmas,3.8)

ANALYSIS ON FLICKSERY USECASE USING PIG Page 5

grunt> rating = filter movies by rating>3;

(1,The Nightmare before

(3,Orphans of the Strom,1921,3.2,9062)

(6,One Magic Christmas,1985,3.8,5333)

Query 3: Display the details of movies which were released in year

grunt> a = filter movies by year==1994;

ANALYSIS ON FLICKSERY USECASE USING PIG Page 6

grunt> bag4 = filter movies by rating>3 and year!=1993;

grunt> dump bag4;

(3,Orphans of the Strom,1921,3.2,9062)

(6,One Magic Christmas,1985,3.8,5333)

Query 5: Display the details of movies who is either duration 5000

grunt> bag5 = filter movies by duration>5000 or rating>3;

grunt> dump bag5;

ANALYSIS ON FLICKSERY USECASE USING PIG Page 7

(3,Orphans of the Strom,1921,3.2,9062)

(4,The Object of Beauty,1991,2.8,6150)

(6,One Magic Christmas,1985,3.8,5333)

Query 6: Display the details of movies whose movie name has

grunt> bag6 = filter movies by name matches '.*a.*';

grunt> dump bag6;

(1,The Nightmare before

(3,Orphans of the Strom,1921,3.2,9062)

(4,The Object of Beauty,1991,2.8,6150)

(6,One Magic Christmas,1985,3.8,5333)

ANALYSIS ON FLICKSERY USECASE USING PIG Page 8

grunt> bag7 = group movies by(name,year);;

grunt> dump bag7;

((Night Tide,1963),{(5,Night Tide,1963,2.8,5126)})

((One Magic Christmas,1985),{(6,One Magic

((Orphans of the Strom,1921),{(3,Orphans of the

((The Object of Beauty,1991),{(4,The Object of

((The Nightmare before Chrustmas,1993),{(1,The

ANALYSIS ON FLICKSERY USECASE USING PIG Page 9

grunt> gr = group movies by name;

grunt>result = foreach gr generate movies.name, movies.rating,

grunt> bag6 = filter movies by name matches '.a.';