You are on page 1of 13

CHAPTER 1

APACHE PIG

1.1INTRODUCTION
Apache Pig is a platform, used to analyze large data sets representing them as data
flow. It is designed to provide and abstraction over MapReduce, reducing the
complexities of writing a MapReduce program. We can perform data
manipulation operations very easily in Hadoop using Apache Pig.

The features of Apache Pig are:


 Pig enables programmers to write complex data transformations without
knowing Java.
 Apache Pig has two main components – the Pig Latin language and the Pig
Run-time Environment, in which Pig Latin programs are executed.
 For Big Data Analysis, Pig gives a sample data flow language known as Pig
Latin which has functionalities similar to SQL like join, filter, limit etc.
 Developers who are working with scripting languages and SQL, leverages Pig
Latin.
This gives developers ease of programming with Apache Pig. Pig Latin
provides various built-in operators like join, sort, filter, etc to read, write, and
process large data sets thus it is evident; Pig has a rich set of operators.
 Programmers write scripts using Pig Latin to analyze data and these scripts are
internally converted to Map and reduce task by Pig MapReduce Engine.
Before Pig, writing MapReduce tasks was the only way to process the data
stored in HDFS.
 If a programmer wants to write custom functions which are unavailable in Pig,
Pig allows them to write User Defined Functions (UDF) in any languages of
their choice like Java, Python, Ruby, Jython, JRuby etc. and embed them in
Pig script. This provides extensibility to apache Pig.

ANALYSIS ON FLICKSERY USECASE USING PIG Page 1


 Pig can process any kind of data, i.e. structured, semi-structured or
unstructured data, coming from various sources; apache Pig handles all kinds
of data.
 Approximately, 10 lines of pig code is equal to 200 lines of MapReduce code.
 It can handle inconsistent schema (in case of unstructured data).

 Apache Pig extracts the data, performs operations on that data and dumps the
data in the required format in HDFS i.e. ETL (Extract Transfer Load).
 Apache Pig automatically optimizes the task before execution, i.e. automatic
optimization.
 It allows programmers and developers to concentrate upon the whole
operation irrespective of creating Mapper and Reducer functions separately.

The Architecture of Apache is depicted through this diagram:

Figure 1.1.1: Apache Architecture

ANALYSIS ON FLICKSERY USECASE USING PIG Page 2


CHAPTER 2
PIG LATIN QURIES
2.1. DATASET
The data set that we are using here is Flicksery. Flicksery is a Netflix Search
Engine. The data set is a sample text file lists movie names and its details like
release year, rating and runtime.

The columns and the description of data sets are as follows:

 ID - Unique number for each movie


 Name - Movie name
 Year - Movie release year
 Rating - Movie rating(float)
 Duration – Movie duration (Int)

1,The Nightmare before Chrustmas,1993,3.9,4568


2,The Mummy,1932,3.5,4388
3,Orphans of the Strom,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu,1929,3.5,5651
10,Nick of Time,1995,3.4,5333

ANALYSIS ON FLICKSERY USECASE USING PIG Page 3


2.2 EXECUTION STEPS PIG LATIN QUERIES

Step 1: Check if all daemons are running

Figure 1.11: check if all daemons are running

Step 2: Load the file from local file system into HDFS.

[cloudera@localhost ~]$ Hadoop fs -put /home/cloudera/datafile


/user/cloudera

Step 3: Switch to mapreduce mode in PIG.

[cloudera@localhost ~]$ pig –x mapreduce


Or
[cloudera@localhost ~]$ pig

ANALYSIS ON FLICKSERY USECASE USING PIG Page 4


Step 4: Load data files from HDFS to PIG.

grunt> movies = load '/user/cloudera/datafile' using PigStorage(',')as


(id:int,name:chararray,year:int,rating:float,duration:int);

2.3 PIG LATIN QUERIES

Query 1: Display the name of the movie and rating.

grunt> bag1 = foreach movies generate name,rating;

grunt> dump bag1;

(The Nightmare before Chrustmas,3.9)

(The Mummy,3.5)

(Orphans of the Strom,3.2)

(The Object of Beauty,2.8)

(Night Tide,2.8)

(One Magic Christmas,3.8)

(Muriel's Wedding,3.5)

(Mother's Boys,3.4)

(Nosferatu,3.5)

(Nick of Time,3.4)

ANALYSIS ON FLICKSERY USECASE USING PIG Page 5


Query 2: Display the details of movies that have more than 3
rating

grunt> rating = filter movies by rating>3;

grunt>dump rating;

(1,The Nightmare before


Chrustmas,1993,3.9,4568)

(2,The Mummy,1932,3.5,4388)

(3,Orphans of the Strom,1921,3.2,9062)

(6,One Magic Christmas,1985,3.8,5333)

(7,Muriel's Wedding,1994,3.5,6323)

(8,Mother's Boys,1994,3.4,5733)

(9,Nosferatu,1929,3.5,5651)

(10,Nick of Time,1995,3.4,5333)

Query 3: Display the details of movies which were released in year


1994.

grunt> a = filter movies by year==1994;

grunt> dump a;

(7,Muriel's Wedding,1994,3.5,6323)
(8,Mother's Boys,1994,3.4,5733)

ANALYSIS ON FLICKSERY USECASE USING PIG Page 6


Query 4: Display the details of movie whose rating is more than 3
but not released in year 1993.

grunt> bag4 = filter movies by rating>3 and year!=1993;

grunt> dump bag4;

(2,The Mummy,1932,3.5,4388)

(3,Orphans of the Strom,1921,3.2,9062)

(6,One Magic Christmas,1985,3.8,5333)

(7,Muriel's Wedding,1994,3.5,6323)

(8,Mother's Boys,1994,3.4,5733)

(9,Nosferatu,1929,3.5,5651)

(10,Nick of Time,1995,3.4,5333)

Query 5: Display the details of movies who is either duration 5000


or whose rating is above 3.

grunt> bag5 = filter movies by duration>5000 or rating>3;

grunt> dump bag5;

ANALYSIS ON FLICKSERY USECASE USING PIG Page 7


(1,The Nightmare before
Chrustmas,1993,3.9,4568)

(2,The Mummy,1932,3.5,4388)

(3,Orphans of the Strom,1921,3.2,9062)

(4,The Object of Beauty,1991,2.8,6150)

(5,Night Tide,1963,2.8,5126)

(6,One Magic Christmas,1985,3.8,5333)

(7,Muriel's Wedding,1994,3.5,6323)

(8,Mother's Boys,1994,3.4,5733)

(9,Nosferatu,1929,3.5,5651)

(10,Nick of Time,1995,3.4,5333)

Query 6: Display the details of movies whose movie name has


substring ‘a’.

grunt> bag6 = filter movies by name matches '.*a.*';

grunt> dump bag6;

(1,The Nightmare before


Chrustmas,1993,3.9,4568)

(3,Orphans of the Strom,1921,3.2,9062)

(4,The Object of Beauty,1991,2.8,6150)

(6,One Magic Christmas,1985,3.8,5333)

(9,Nosferatu,1929,3.5,5651)

ANALYSIS ON FLICKSERY USECASE USING PIG Page 8


Query 7: Display the details of the movies, movie wise with same
year.

grunt> bag7 = group movies by(name,year);;

grunt> dump bag7;

((Nosferatu,1929),{(9,Nosferatu,1929,3.5,5651)})

((The Mummy,1932),{(2,The
Mummy,1932,3.5,4388)})

((Night Tide,1963),{(5,Night Tide,1963,2.8,5126)})

((Nick of Time,1995),{(10,Nick of
Time,1995,3.4,5333)})

((Mother's Boys,1994),{(8,Mother's
Boys,1994,3.4,5733)})

((Muriel's Wedding,1994),{(7,Muriel's
Wedding,1994,3.5,6323)})

((One Magic Christmas,1985),{(6,One Magic


Christmas,1985,3.8,5333)})

((Orphans of the Strom,1921),{(3,Orphans of the


Strom,1921,3.2,9062)})

((The Object of Beauty,1991),{(4,The Object of


Beauty,1991,2.8,6150)})

((The Nightmare before Chrustmas,1993),{(1,The


Nightmare before Chrustmas,1993,3.9,4568)})

ANALYSIS ON FLICKSERY USECASE USING PIG Page 9


Query 8: Display movie name, rating and maximum rating among
all the movies.

grunt> gr = group movies by name;

grunt>result = foreach gr generate movies.name, movies.rating,


MAX(movies.rating);

runt> dump result;

({(Nosferatu)},{(3.5)},3.5)

({(The Mummy)},{(3.5)},3.5)

({(Night Tide)},{(2.8)},2.8)

({(Nick of Time)},{(3.4)},3.4)

({(Mother's Boys)},{(3.4)},3.4)

({(Muriel's Wedding)},{(3.5)},3.5)

({(One Magic Christmas)},{(3.8)},3.8)

({(Orphans of the Strom)},{(3.2)},3.2)

({(The Object of Beauty)},{(2.8)},2.8)

({(The Nightmare before


Chrustmas)},{(3.9)},3.9)

ANALYSIS ON FLICKSERY USECASE USING PIG Page 10


Query 9: Display movie name, rating and total rating among all
the movies

grunt> result = foreach gr generate movies.name, movies.rating,


SUM(movies.rating);

grunt> dump result;

({(Nosferatu)},{(3.5)},3.5)

({(The Mummy)},{(3.5)},3.5)

({(Night Tide)},{(2.8)},2.799999952316284)

({(Nick of
Time)},{(3.4)},3.4000000953674316)

({(Mother's
Boys)},{(3.4)},3.4000000953674316)

({(Muriel's Wedding)},{(3.5)},3.5)

({(One Magic
Christmas)},{(3.8)},3.799999952316284)

({(Orphans of the
Strom)},{(3.2)},3.200000047683716)

({(The Object of
Beauty)},{(2.8)},2.799999952316284)

({(The Nightmare before


Chrustmas)},{(3.9)},3.9000000953674316)

ANALYSIS ON FLICKSERY USECASE USING PIG Page 11


Query 10: Display the details of movies maximum rating for the
movies and count number of movies per year, and minimum
movies released in a year.

grunt> gr = group movies by year;


grunt>
grunt> dump result;
(3.2,2,1921)
(3.5,5,1929)
(3.5,4,1932)
(2.8,5,1963)
(3.8,7,1985)
(2.8,3,1991)
(3.9,8,1993)
(3.5,4,1994)
(3.4,6,1995)

ANALYSIS ON FLICKSERY USECASE USING PIG Page 12


2.4 REFERENCES

1. htts://en.wikipedia.org/wiki/Pig
2. https://www.tutorialspoint.com/apache_pig/
3. https://hortonsworks.com/tutorial/how-to-process-data- withapache pig/
4. https://intellipaat.com/tutorial/hadoop-tutorials/apache-pig/

ANALYSIS ON FLICKSERY USECASE USING PIG Page 13

You might also like