You are on page 1of 21

This module deals with the usage of Pig, its use cases, features, and advantages over SQL

and
MapReduce. The first session deals with the introductory part, where you were introduced to
Pig, it’s features and its data model. The second session takes you through the commonly used
Pig commands to perform the basic operations. The third session provided you with a good
understanding of Pig’s compilation and execution stages. It also helped you to get a basic
understanding of the usage of user-defined functions.

Some of the limitations of MapReduce are —


● Expert-level knowledge of Java
● Expertise in the optimisation of MapReduce jobs
● A lot of effort, in terms of time and lines of code, is needed to create a simple job using
Java

Pig successfully overcomes these limitations and the advantages associated with Pig are as
follows:
● In Pig, instructions are written in a language called Pig Latin
● With some practice, Pig Latin can easily be understood by professionals with little or no
programming experience
● Pig internally converts all the converted programs into optimised MapReduce jobs, so
you don’t have to worry about the optimisation of the jobs
● Pig saves a lot of effort in terms of time, which professionals previously used to invest in
coding complex MapReduce tasks
● The execution time of Pig scripts is higher than MapReduce jobs, considering the fact
that Pig internally uses MapReduce to compute results

The different features of Pig that are as follows:

Data flow language:


● Pig allows users to describe the sequence of operations in a step-by-step manner.
● This step-by-step approach provides query optimisation opportunities.

Quick start and interoperability:


● Pig doesn't require importing of data into tables. It can parse the input data based on a
function provided by the user.
● Also, the Pig output can be formatted according to the user-defined function provided.
● Pig operates in a read-only manner on the data stored in external files and doesn’t take
control over the data.

Nested data model:


● Pig has a flexible and fully nested data model.
● It supports complex data types like map, bag and tuple as fields of a relation.

User defined functions(UDFs):


● Pig Latin provides extensive support for user-defined functions for custom processing.
● The input and output of UDFs in Pig also follow the flexible and nested data model.
● Pig supports UDFs written in Java, Python, Javascript, Jython, Ruby and Groovy.
Parallelism:
● Pig is designed and optimised for processing web-scale data.
● Only small set of primitives that can be easily parallelised are included in Pig Latin.
● Non-equi joins whose conditions are established using comparison operators except the
equal operator (=) are not included, as they cannot be parallelized easily.

Debugging environment:
● Pig provides an interactive debugging environment.
● For each step of the user program, it generates a concise sample data table of the
output.
● This helps in early detection of errors at each step even before the completion of the
first iteration on the entire dataset.

The data types in Pig can be divided into two major categories:
1. Scalar data types
2. Complex data types

Scalar Data Types:

These consist of primitive data types, which are similar to the primitive data types in any other
programming language. Some of the scalar data types used in Pig are —

● chararray​: It is the same as the String data type in other languages. It is used to
represent names, addresses, and so on.

● float​: It is used to represent real numbers for variables like ‘weight’, ‘marks_obtained’,
etc. The float data type can store 32-bit floating-point numbers.

● double​: This data type is used to represent real numbers but for a range greater than
that represented by variables of the float data type, e.g., ‘Total_Sales’, ‘Total_Expenses’,
etc. The double data type can store 64-bit floating-point numbers.
● int​: Used to represent integral values, such as age and the count of students.

● bytearray​: It is the default data type of Pig. If no data type is declared, then Pig, by
default, assigns bytearray as the data type.

Complex Data Types​:

Complex data types are a collection of primitive data types. This consists of the following:

● Map​: This data type stores sets of key-value pairs. The keys and values are separated by
hash (‘#’) symbols and are enclosed within square brackets. The key-value pairs are
separated by commas (,).
For example, [name#david, age#20, place#chicago]

● Tuple:​ It is an ordered collection of items. It is enclosed within round brackets.


For example, (david, 20, chicago)

● Bag: ​It is a collection of tuples or bags enclosed within curly brackets.


For example, { (david, 20, chicago), (lettice, 28, london) }

In general, the data stored on Hadoop may or may not have a schema. Whenever Pig loads or
stores data it depends upon the storage functions to delimit the data values and tuples.

"​PigStorage('field delimiter')​" is the default storage function used by the Pig. The field delimiter
can be any character/symbol as shown in the example given below.

Example:
LOAD​ ​'data'​ ​USING​ ​PigStorage​(​','​);

The default field delimiter of this function is '\t'. Pig also has a binary storage function called as
BinStorage.
In addition to these storage functions, Pig also allows users to define their own storage
functions. The user has to declare the storage function in Load or Store commands to use other
than the default storage function.

There are three methods for declaring the data types of fields provided by the pig. They are as
follows:

Default: ​The first method is not to declare any data type. In this case, by default, every field is
treated as a byte array. Using the default byte array has the benefit of avoiding the casting of
data which may be expensive or may even corrupt the data.

Example:
a ​=​ ​LOAD​ ‘data’ ​USING​ ​BinStorage​ ​AS​ (user);
b ​=​ ​ORDER​ a ​BY​ user;

Even if the types are undeclared, there are two cases where Pig can determine the type of field.
They are:
● If the program uses an operator that expects a certain type on a field, Pig will persuade
that type on to that field.
● The other case is when operators or user-defined functions whose return values are
known are applied.

Examples:
1. a ​=​ ​LOAD​ ‘data’ ​USING​ ​BinStorage​ ​AS​ (users);
b ​=​ ​FOREACH​ a ​GENERATE​ users​#​‘interests’;

2. a ​=​ ​LOAD​ ‘data’ ​USING​ ​BinStorage​ ​AS​ (user);


b ​=​ ​GROUP​ a ​BY​ user;
c ​=​ ​FOREACH​ b ​GENERATE​ ​COUNT​(a) ​AS​ cnt;
d ​=​ ​ORDER​ c ​BY​ cnt;
Using "AS" clause:​ The second method of declaring types in Pig is by providing them explicitly
as part of the “AS” clause while loading the data.

Example:
a ​=​ ​LOAD​ ​'data'​ ​USING​ ​BinStorage​ ​AS​ (name: ​chararray​);

Specifying schema:​ The third method is where the load function itself provides the schema
information that accommodates self-describing data formats such as JSON.

Pig delays the typecasting of a bytearray to the point where it is actually necessary. Hence, the
data type conversion of Pig is known as lazy type conversion. This helps in avoiding the casting
of unnecessary values which are being filtered out in the previous steps.

There are two execution modes in Pig:


1. The local mode​:
The local mode execution in Pig requires access to just one system. It does not require
any Hadoop services running in the background. Also, it uses the local file system for
loading and storing data. Execution in local mode can be done using:
Command: ​pig -x local

2. The MapReduce mode​:


The MapReduce mode execution requires access to a distributed Hadoop cluster or a
virtual machine where Hadoop is running in pseudo-distributed mode. It requires all the
Hadoop services running in the background. Also, it uses the HDFS for loading and
storing data. Execution in the MapReduce mode can be carried out using:
Command: ​pig
Pig code can be run in two ways:
1. You can execute a Pig script file, where the complete Pig code is written. And the
commands written in this file will be executed in a sequential manner.

○ In the local mode, a script file named ​pigscript.pig​ can be run using the
command ​pig -x local pigscript.pig
○ In the MapReduce mode, a script file named ​pigscript.pig ​can be run using the
command ​pig pigscript.pig

2. You can run Pig commands one at a time using the interactive mode. The shell that
executes Pig commands is known as grunt shell.
○ In the local mode, the grunt shell can be invoked using the command ​pig -x local
○ In the MapReduce mode, the grunt shell can be invoked using the command ​pig

Load:​ This command helps in specifying the details of the input data files and the method to
deserialise the file contents, or in simpler words, to convert them into Pig’s data model.

Example:
customer_queries ​=​ ​LOAD​ ‘query_log.txt’ ​USING​ ​customLoad​()
​AS​ (userId, queryString, timestamp);
Both the “Using” clause (that is used to mention the custom deserialiser) and the “As” clause
(which is used to mention the schema information) are optional. In case no deserialiser is
specified, then the default deserialiser is considered which expects the input to be a
tab-delimited text file. Similarly, if no “schema” is specified, then the fields must be referred by
their positions instead of names.

Note: ​The "Load command" only specifies the details of the input file and how to read it. The
data in the file is neither read nor processed until the user explicitly asks for an output. This is
different from the typical database loading style.

Filter:​ This command is used to extract the subset of data which is of interest discarding the
rest of the data.

Example:
real_queries ​=​ ​FILTER​ customer_queries ​BY​ userId ​neq​ ‘bot’;

The filtering conditions in Pig Latin may involve a combination of expressions along with
comparison operators like ==, eq, !=, neq and logical connectors such as And, OR and Not.
Since, arbitrary expressions are allowed, programmers can write UDFs for filtering as well.

FOREACH GENERATE​: This command is similar to the for loop in other programming languages.
FOREACH loops across all the rows in the relation and GENERATE either extracts the data fields
as they are or performs some transformation on the data fields of that row. The transformed
result is then stored in another relation.
Example:
expanded_queries ​=​ ​FOREACH​ real_queries ​GENERATE
userId, ​expandQuery​(queryString);
The "Generate" clause can be followed by the following expressions:
● Constant: ​The simplest type of expression is a constant expression which is independent
of the tuple.
● Field by position:​ This expression fetches a field using its position inside the tuple.
● Field by name:​ This expression retrieves a field by its name.
● Projection: ​This expression projects a range of columns from the input.
● Map lookup: ​This expression is​ ​used to find the value of the corresponding key.
● Functional Evaluation: ​Pig Latin also supports functions like AVG, MIN, MAX, SUM,
COUNT which can be used as the expressions.
● Conditional: ​A specific condition can also be provided as an expression.
● Flatten:​ The FLATTEN command unnests the bags inside the tuples and creates new
tuples.

Example:

Figure 1: Tuple

Figure 2: Expression Table


Figure 3: Flatten Command

To process the different data sets together, the grouping of their tuples is required. Pig Latin
provides the CoGroup command to perform such grouping operations.

Example:
grouped_data ​=​ ​COGROUP​ results ​BY​ queryString,
revenue ​BY​ queryString;

Figure 4: CoGroup Command


The output of a CoGroup command contains one tuple for each group. The first field of the
tuple is the group identifier. Each of the next fields is a bag, one for each input being
co-grouped.

Note:​ The name of the bag remains the same as the input. The kth bag of the tuple contains all
the tuples of the kth input belonging to that group.

The CoGroup command only performs the operation of grouping the tuples together into
nested bags. It is the user who can subsequently choose whether to apply a custom aggregation
function on those tuples or to cross product them to get the join result. Similar to filtering,
grouping can also be performed based on the arbitrary expressions including UDFs.

If the processing involves only one dataset, then the “​Group​” command can be used in such
cases which serves as an alternative.

Example:
grouped_data ​=​ ​JOIN​ results ​BY​ queryString;

To perform the regular equi-joins, Pig Latin also provides the "​Join​" command.

Example:
grouped_data ​=​ ​JOIN​ results ​BY​ queryString,
revenue ​BY​ queryString;
Figure 5: Join Command

A generic MapReduce program can be easily expressed in Pig Latin using the Group and ForEach
commands. The two steps involved in the process of converting the MapReduce program to Pig
are:
● A user defined map function operates on one input tuple at a time and gives a bag of
key-value pairs as output.
● A user defined reduce function then operates on all the values of a key at a time and
produces the final output.

Example:
map_result ​=​ ​FOREACH​ results ​GENERATE​ ​FLATTEN​(​map​(*));
url_groups ​=​ ​GROUP​ map_result ​BY​ $1;
​output​ ​=​ ​FOREACH​ url_groups ​GENERATE​ ​reduce​(*);
Some of the SQL like commands supported by Pig are:

● Order: ​This command orders the data bags based on the specified field or fields.

● Distinct: ​This command helps to eliminate the duplicate tuples from the bag. This
command can be syntactically considered as a shortcut to grouping the bag by all fields
and then projecting out the groups.

● Union:​ This command will return the union of the two or more bags as the output.

● Cross: ​This command returns the cross product of the two or more bags as the output.

In some cases, there might be nested bags within tuples either due to co-grouping or due to the
base data itself being nested. To harness such data with the same ease, Pig Latin allows nesting
of some commands within the ForEach command.

Example:
grouped_revenue ​=​ ​GROUP​ revenue ​BY​ queryString;
query_revenues ​=​ ​FOREACH​ grouped_revenue{
top_slot ​=​ ​FILTER​ revenue ​BY
adSlot ​eq​ ‘top’;
​GENERATE​ queryString,
​SUM​(top_slot.amount)​,
​SUM​(revenue.amount);
};
Some of the commands that are allowed within the ForEach command are Filter, order and
Distinct.

Store: ​This command helps the user to save the result of the Pig Latin script into a file in the
specified folder location.

Example:
STORE​ query_revenues ​INTO​ ‘myoutput’
​USING​ ​myStore​();

Similar to the case of Load command, the USING clause is optional for the store command. If no
custom serializer is mentioned, then the default serializer that writes plain text, tab-delimited
files is applied.

Some of the additional commands supported by Pig are —

1. Describe:
You can use the ‘describe’ command to retrieve the schema of any relation.

Example:
Cust_data = load ‘users/data/data.txt’ using PigStorage(‘,’) as
(ID:int, Name:chararray, age:int, salary:double);
DESCRIBE a;

Here, ​DESCRIBE a;​ will print the schema of relation ‘a’ on the screen. The schema looks
like ​{(ID:int, Name:chararray, age:int, salary:double)}​.
2. DUMP:
You can print the contents of a relation on the screen using the DUMP command.
For example, ​DUMP a;​ will print the contents of relation ‘a’ on the screen.

3. LIMIT:
This command is used to restrict the number of rows in a relation.
For example, the command ​b= LIMIT a 5; ​will return the first 5 rows of relation ‘a’ and
store them in ‘b’.

4. DISTINCT:
This command is used to remove the duplicate rows from a relation.
For example, the command ​b= DISTINCT a; ​will return all the unique rows from relation
‘a’ and store them in relation ‘b’.

5. TOKENIZE:
The TOKENIZE command splits a string in a particular column on the basis of delimiters.
The delimiters this command supports are double quotes [“ “], spaces [ ], parentheses
[()], commas [,], and asterisks [*]. Consider a column with the value ‘David Miller’; if we
apply the Tokenize command on this string, it will return ({(David),(Miller)}).

6. Aggregation/Evaluation Functions:
Pig provides a number of aggregation/evaluation functions. Some of them are —
a. Concat(): To concatenate two or more columns
b. Min(): To find out the minimum value in a dataset
c. Max(): To find out the maximum value in a dataset
d. Floor(): To calculate the floor value of a float
e. SIZE(): To calculate the length of a string or the size of a field
f. LOWER(): To convert a string to lowercase letters
g. COUNT(): To count the number of tuples in a bag
h. AVG(): To calculate the average of the numerical values of a column in a bag
A Pig program before execution goes through a series of transformation steps. Parsing is the
first transformation step that verifies the syntactical correctness of the program. The other
tasks performed in the parsing phase include:
● Checking whether all the reference variables are defined or not.
● Performing type checking and schema inference.
● Verifying the user's program ability to instantiate classes corresponding to user-defined
functions.
● Confirming the existence of streaming executables referenced by the user’s program.

This phase outputs a canonical logical plan with one to one correspondence between pig latin
statements and logical operators. The output plan is arranged in a directed acyclic graph (DAG).
This plan is then passed through a logical optimiser phase, where logical optimisations such as
projection pushdown are carried out.

After obtaining the optimised logical plan from logical optimiser phase, it is then compiled into
a series of MapReduce jobs. This phase is followed up by another MapReduce level
optimisation phase. An example of optimisation carried out in this stage would be, utilising the
combiner stage to perform early partial aggregations. This is in case of distributive or algebraic
aggregation functions.

The DAG of optimised MapReduce jobs is then sorted topologically, and the jobs are submitted
to Hadoop in that sorted order for execution. Pig monitors the Hadoop execution status. It
periodically reports the progress of the overall program to the user. The warnings or errors that
occur during the execution gets logged and reported to the user.
Figure 6: Execution Stages of Pig

A Pig program is initially translated into a logical plan in one to one manner. In simpler words,
Each operator in the pig script is annotated with the schema of its output data.
Figure 7: Logical Plan Conversion

After obtaining the logical plan, Pig performs a limited set of optimisations to transform the
logical plan before converting it into a MapReduce plan. Pig then performs the translation of
the logical plan into a physical plan. In this, translation, the logical operators are converted into
the physical operators.

Figure 8: Physical Plan Conversion


The logical co-group operator translates into a series of three physical operators namely local
rearrange, global rearrange and package (​Note:​ The rearrange term refers to either hashing or
sorting by key). The local rearrange operator annotates each tuple in a way that indicates its
source relation. The global rearrange operator then ensures that tuples with the same
group-by-key end up on the same machine and adjacent in the data stream. The package
operator keeps the adjacent same key tuples into a single-tuple “package”. Each package
consists of the key followed by its corresponding bag of tuples. The Join operator can be
rewritten as a CoGroup operator followed by a ForEach operator to perform flattening.

After constructing a physical plan, Pig then translates it into a MapReduce plan.

Figure 9: MapReduce Plan Conversion

In the MapReduce plan, the local rearrange operator simply annotates the tuples with keys and
relation identifiers. It then lets the Hadoop local sort stage to do the work.

Pig understands the scope for additional optimisations even after generating the MapReduce
plan. So, it breaks distributive and algebraic aggregation functions into a series of three steps
namely initial, intermediate and final. The initial step is assigned to the map stage, the
intermediate step to the combine stage and the final step to the reduce stage.

The two benefits associated with the usage of the combiner are:

● Reduces the volume of the data handled by the shuffle and merge phases which often
consume a significant portion of job execution time.

● Tends to equalise the amount of data associated with each key which in turn, helps in
reducing the skew in the reduce phase.

In the final compilation step, each MapReduce combination is converted into a Hadoop job
description for execution. In this step, a java jar file is generated that contains the map and
reduce implementation classes as well as any user-defined functions that will be invoked as
part of the job.

Pig UDFs can be implemented in six languages - Java, Javascript, Python, Jython, Ruby and
Groovy. They can customise each part of the processing that includes data load/store, column
transformation and aggregation. Further, UDFs defined in Java are more efficient and get the
most extensive support as they are implemented in the same language as Pig.

Pig has a huge list of built-in functions including many string, math, eval and load/store
functions. Apart from these built-in functions, Piggybank is a repository of Java UDFs supported
by Pig. Using this, users can access the Java UDF’s contributed by the other users or can add
their UDF’s for other users.

Note: ​The Piggybank UDF’s are not included in the Pig jar. So, the user has to register them
manually in the Pig script.
Register: ​It helps the user to specify the location of the UDF.

Example:
REGISTER​ http:​///​usr/local/pig/myudfs.jar;
A ​=​ ​LOAD​ ​'user_data'​ ​AS​ (name: ​chararray​,​ age: ​int​,​ gpa: ​float​);
B ​=​ ​FOREACH​ A ​GENERATE​ myudfs.UPPER(name);
DUMP​ B;

Note:

● Multiple register commands can also be used in the same script.


● If the Jar is located in your Java classpath, then PIG automatically locates and loads that
Jar.

The Define command helps in avoiding the verbosity of mentioning the full package and class
name of the UDF. It helps the user to assign a short name to the function.

Example:
REGISTER​ http:​///​usr/local/pig/myudfs.jar;
DEFINE​ UPPER myudfs.UPPER;
A ​=​ ​LOAD​ ​'user_data'​ ​AS​ (name: ​chararray​,​ age: ​int​,​ gpa: ​float​);
B ​=​ ​FOREACH​ A ​GENERATE​ ​UPPER​(name);
DUMP​ B;

The Eval or Evaluation function is the most common type of function that can be used in the
FOREACH statements. Aggregate functions are the common type of Eval functions that are
generally applied to grouped data. An aggregate function takes a bag as input and returns a
scalar value. The aggregate functions can be computed incrementally in a distributed mode
which is one of their useful attributes. The COUNT, MIN, MAX and AVERAGE are some of the
aggregate functions implemented in such a way.

You might also like