You are on page 1of 20

UNIT-4: APACHE PIG

PIG: HADOOP PROGRAMMING MADE EASIER- Admiring the Pig Architecture,


Going with the Pig Latin Application Flow, Working through the ABCs of Pig Latin,
Evaluating Local and Distributed Modes of Running Pig Scripts, Checking out the Pig
Script Interfaces, Scripting with Pig Latin

4.0. INTRODUCTION: WHAT IS APACHE PIG?

Apache Pig is a high level data flow platform for execution Map Reduce programs of Hadoop.
The language for Pig is pig Latin.

The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in
HDFS. Every task which can be achieved using PIG can also be achieved using java used in
Map reduce.

Why Do We Need Apache Pig?

Programmers who are not so good at Java normally used to struggle working with Hadoop,
especially while performing any MapReduce tasks. Apache Pig is a boon for all such
programmers.

● Using Pig Latin, programmers can perform MapReduce tasks easily without having to
type complex codes in Java.
● Apache Pig uses multi-query approach, thereby reducing the length of codes. For
example, an operation that would require you to type 200 lines of code (LoC) in Java
can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately Apache
Pig reduces the development time by almost 16 times.
● Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar
with SQL.
● Apache Pig provides many built-in operators to support data operations like joins,
filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and
maps that are missing from MapReduce.

Features of Pig

Apache Pig comes with the following features −

● Rich set of operators − It provides many operators to perform operations like join,
sort, filer, etc.
● Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script
if you are good at SQL.
● Optimization opportunities − The tasks in Apache Pig optimize their execution
automatically, so the programmers need to focus only on semantics of the language.
● Extensibility − Using the existing operators, users can develop their own functions to
read, process, and write data.
● UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig Scripts.
● Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as
well as unstructured. It stores the results in HDFS.

APACHE PIG VS MAPREDUCE

Differences between Apache Pig and MapReduce.

Apache Pig MapReduce


MapReduce is a data processing
Apache Pig is a data flow language.
paradigm.
It is a high level language. MapReduce is low level and rigid.
It is quite difficult in MapReduce to
Performing a Join operation in Apache Pig is pretty
perform a Join operation between
simple.
datasets.
Any novice programmer with a basic knowledge of Exposure to Java is must to work with
SQL can work conveniently with Apache Pig. MapReduce.
MapReduce will require almost 20 times
Apache Pig uses multi-query approach, thereby
more the number of lines to perform the
reducing the length of the codes to a great extent.
same task.
There is no need for compilation. On execution,
MapReduce jobs have a long
every Apache Pig operator is converted internally
compilation process.
into a MapReduce job.

Applications of Apache Pig

Apache Pig is generally used by data scientists for performing tasks involving ad-hoc
processing and quick prototyping. Apache Pig is used −

● To process huge data sources such as web logs.


● To perform data processing for search platforms.
● To process time sensitive data loads.

4.1. ADMIRING THE PIG-ARCHITECTURE:

Pig is made up of two components:

a) The language itself: The programming language for Pig is known as Pig Latin, a high-level
language that allows you to write data processing and analysis programs.

b) The Pig Latin compiler: The Pig Latin compiler converts the Pig Latin code into executable
code. The executable code is either in the form of MapReduce jobs or it can spawn a process
where a virtual Hadoop instance is created to run the Pig code on a single node.

To perform a particular task Programmers using Pig, programmers need to write a Pig script
using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt
Shell, UDFs, Embedded). After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the desired output.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The architecture of Apache Pig is shown below.

Fig: Pig Architecture

Apache Pig Components- As shown in the figure, there are various components in the
Apache Pig framework are:

Parser

Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes and the data flows
are represented as edges.

Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.

4.2. GOING WITH THE PIG LATIN APPLICATION FLOW:

Pig Latin is a dataflow language, where you define a data stream and a series of transformations
that are applied to the data as it flows through your application. This is in contrast to acontrol
flow language (like C or Java), where you write a series of instructions. In control flow
languages, we use constructs like loops and conditional logic (like an if statement). You won‘t
find loops and if statements in Pig Latin.

If you need some convincing that working with Pig is a significantly easier than having to
write Map and Reduce programs, start by taking a look at some real Pig syntax. The following
Listing specifies Sample Pig Code to illustrate the data processing dataflow.

Listing: Sample pig code to illustrate the data processing data flow

Some of the text in this example actually looks like English. Looking at each line in turn, you
can see the basic flow of a Pig program. This code can either be part of a script or issued on
the interactive shell called Grunt.

● Load: You first load (LOAD) the data you want to manipulate. As in a typical
MapReduce job, that data is stored in HDFS. For a Pig program to access the data, you
first tell Pig what file or files to use. For that task, you use the LOAD 'data_file'
command. Here, 'data_file' can specify either an HDFS file or a directory. If a directory
is specified, all files in that directory are loaded into the program. If the data is stored
in a file format that isn‘t natively accessible to Pig, you can optionally add the USING
function to the LOAD statement to specify a user-defined function that can read in (and
interpret) the data
● Transform: You run the data through a set of transformations which are translated into
a set of Map and Reduce tasks. The transformation logic is where all the data
manipulation happens. You can FILTER out rows that aren‘t of interest, JOIN two sets
of data files, GROUP data to build aggregations, ORDER results, and do much, much
more.
● Dump: Finally, you dump (DUMP) the results to the screen or Store (STORE) the
results in a file somewhere.

4.3. WORKING THROUGH THE ABCS OF PIG LATIN:


Pig Latin is the language for Pig programs. Pig translates the Pig Latin script into MapReduce
jobs that can be executed within Hadoop cluster. When coming up with Pig Latin, the
development team followed three key design principles:

Keep it simple.

Pig Latin provides a streamlined method for interacting with Java MapReduce. It‘s an
abstraction, in other words, that simplifies the creation of parallel programs on the Hadoop
cluster for data flows and analysis. Writing data transformation and flows as Pig Latin scripts
instead of Java MapReduce programs makes these programs easier to write, understand, and
maintain because a) you don‘t have to write the job in Java, b) you don‘t have to think in terms
of MapReduce, and c) you don‘t need to come up with custom code to support rich data types.
Pig Latin provides a simpler language to exploit your Hadoop cluster, thus making it easier for
more people to leverage the power of Hadoop and become productive sooner.

Make it smart.

You may recall that the Pig Latin Compiler does the work of transforming a Pig Latin program
into a series of Java MapReduce jobs. The trick is to make sure that the compiler can optimize
the execution of these Java MapReduce jobs automatically, allowing the user to focus on
semantics rather than on how to optimize and access the dataSQL is set up as a declarative
query that you use to access structured data stored in an RDBMS. The RDBMS engine first
translates the query to a data access method and then looks at the statistics and generates a
series of data access approaches. The cost-based optimizer chooses the most efficient approach
for execution.

Don’t limit development.

Make Pig extensible so that developers can add functions to address their particular business
problems.

The problem we‘re trying to solve involves calculating the total number of flights flown by
every carrier. Following listing is the Pig Latin script we‘ll use to answer this question.

The Pig script is a lot smaller than the MapReduce application you‘d need to accomplish the
same task — the Pig script only has 4 lines of code. And not only is the code shorter, but it‘s
even semi-human readable.

Most Pig scripts start with the LOAD statement to read data from HDFS.
In this case, we‘re loading data from a .csv file. Pig has a data model it uses, so next we need
to map the file‘s data model to the Pig data mode. This is accomplished with the help of the
USING statement. We then specify that it is a comma-delimited file with the PigStorage(',')
statement followed by the AS statement defining the name of each of the columns.

Aggregations are commonly used in Pig to summarize data sets.

The GROUP statement is used to aggregate the records into a single record mileage_recs. The
ALL statement

is used to aggregate all tuples into a single group. Note that some statements — including the
following SUM statement — requires a preceding GROUP ALL statement for global sums.

FOREACH . . . GENERATE statements are used here to transform columns data In this
case, we want to count the miles traveled in the records_Distance column. The SUM statement
computes the sum of the record_Distance column into a single-column collection total_miles.

The DUMP operator is used to execute the Pig Latin statement and display the results on
the screen.

DUMP is used in interactive mode, which means that the statements are executable
immediately and the results are not saved. Typically, you will either use the DUMP or STORE
operators at the end of your Pig script.

4.4. PIG LATIN DATA MODEL:

The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such
as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data
model.

Fig: Pig Latin Data Model


Pig Latin has these four types in its data models
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored
as string and can be used as string and number. int, long, float, double, chararray, and
bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as
a field.

Example − ‘raja’ or ‘30’


Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any
type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is
known as a bag. Each tuple can have any number of fields (flexible schema). A bag is
represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not
necessary that every tuple contain the same number of fields or that the fields in the same
position (column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Raja, 30, {9848022338, raja@gmail.com,}}

Map
A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’

Example − [name#Raja, age#30]

Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee
that tuples are processed in any particular order).

DATA TYPES

S.N. Data Type Description & Example

1 int Represents a signed 32-bit integer.

Example : 8

2 long Represents a signed 64-bit integer.

Example : 5L

3 float Represents a signed 32-bit floating point.

Example : 5.5F
4 double Represents a 64-bit floating point.

Example : 10.5

5 chararray Represents a character array (string) in Unicode UTF-8 format.

Example : ‘tutorials point’

6 Bytearray Represents a Byte array (blob).

7 Boolean Represents a Boolean value.

Example : true/ false.

8 Datetime Represents a date-time.

Example : 1970-01-01T00:00:00.000+00:00

9 Biginteger Represents a Java BigInteger.

Example : 60708090709

10 Bigdecimal Represents a Java BigDecimal

Example : 185.98376256272893883

Complex Types

11 Tuple A tuple is an ordered set of fields.

Example : (raja, 30)

12 Bag A bag is a collection of tuples.

Example : {(raju,30),(Mohhammad,45)}

13 Map A Map is a set of key-value pairs.

Example : [ ‘name’#’Raju’, ‘age’#30]


PIG LATIN OPERATORS:

Operator Description

Loading and Storing

LOAD To Load the data from the file system (local/HDFS) into a relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, To generate data transformations based on columns of data.


GENERATE

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting

ORDER To arrange a relation in a sorted order based on one or more fields


(ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.


EXPLAIN To view the logical, physical, or MapReduce execution plans to
compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.

Let’s create two files to run the commands:

We have two files with name ‘first’ and ‘second.’ The first file contain three fields: user, url &
id.

The second file contain two fields: url & rating. These two files are CSV files.

Relational Operators

● Relational operators are the main tools using which Pig Latin operate on the data.
● It allows you to transform the data by sorting, grouping, joining, projecting and
filtering.

1. Load operator is used to load data from the file system or HDFS storage into a Pig relation.

● In this example, the Load operator loads the data from file ‘first’ to form relation
‘loading1’.
● The field names are user, url, id.
Example:

2. FOREACH –

● This operator generates data transformations based on columns of data.


● It is used to add or remove fields from a relation.
● Use FOREACH-GENERATE operation to work with columns of data.

Example:

Result:

3. FILTER –

● This operator selects tuples from a relation based on condition.


● In this example, we are filtering the record from ‘loading1’ when the condition ‘id’ is
greater than 8.

Example:

Result:

4. JOIN –

● Join operator is used to perform an inner, equijoin join of two or more relations based
on common field values.
● The join operator always performs an inner join.
● Inner joins ignore null keys, so it makes sense to filter them out before the join .
● In this example, join the two relations based on the column ‘url’ from ‘loading1’ and
‘loading2’.
Example –

Result:

5. ORDER BY -

● Order By is used to sort a relation based on one or more fields.


● You can do sorting in ascending order or descending order using ASC and DESC
keywords.
● In below example, we are sorting data in loading2 in ascending order on ratings field.

Example:

Result:

6. DISTINCT –

● Distinct removes duplicate tuples in relation.


● Let’s take an input file as below, which has amr, crap, 8 and amr, myblog, 10 twice in
the file.

● When we apply distinct on the data in this file, duplicate entries are removed.

Example:
Output:

7. STORE –

Store is used to save the results to the file system.

Here we are saving loading3 data into a file named storing on HDFS.

Example:

Result:

8. GROUP -

● The GROUP operator groups together the tuples with the same group key (key field).
● The key field will be a tuple if the group key has more than one field, otherwise it will
be the same type as that of the group key.
● The result of a GROUP operation is a relation that includes one tuple per group.
● In this example group the relation ‘loading1’ by column url.
Example:

Result:

9. COGROUP –

● COGROUP is same as GROUP operator.


● For readability, programmers usually use GROUP when only one relation is involved
and COGROUP when multiple relations re involved.
● In this example group the ‘loading1’ and ‘loading2’ by url field in both relations.

Example:

Result:

10. CROSS:

● The CROSS operator is used to compute the cross product (Cartesian product) of two
or more relations.

● Applying cross product on loading1 and loading2.

Example:

Result:
11. LIMIT-

● LIMIT operator is used to limit the number of output tuples.


● If the specified number of output tuples is equal to or exceeds the number of tuples in
the relation, the output will include all tuples in the relation.

Example:

Result:

12. SPLIT:

● SPLIT operator is used to partition the contents of a relation into two or more relations
based on some expression.
● Depending on the conditions stated in the expression.

● Split the loading2 into two relations x and y.


● x relation created by loading2 contain the fields that the rating is greater than 8 and y
relation contain fields that rating is less than or equal to 8.

Example:

Result:
13. DUMP –

The DUMP operator is used to run Pig Latin statements and display the results on the screen.
In this example, the operator prints ‘loading1’ on to the screen.

Result:

14. DESCRIBE –

Use the DESCRIBE operator to review the schema of a particular relation. The DESCRIBE
operator is best used for debugging a script.

15. ILLUSTRATE –

● ILLUSTRATE operator is used to review how data is transformed through a sequence


of Pig Latin statements.
● ILLUSTRATE command is your best friend when it comes to debugging a script. This
command alone might be a good reason for choosing Pig over something else.

Example:
4.5. EVALUATING LOCAL AND DISTRIBUTED MODES OF RUNNING PIG
SCRIPTS- CHECKING OUT THE PIG SCRIPT INTERFACES:

Before you can run your first Pig script, you need to have a handle on how Pig programs can
be packaged with the Pig server. Pig has two modes for running scripts, as shown in Figure

1. Local mode

In this mode, all the files are installed and run from your local host and local file system. There
is no need of Hadoop or HDFS. This mode is generally used for testing purpose.

2. MapReduce mode (also known as Hadoop mode)

MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular operation
on the data that exists in the HDFS.

Pig programs can be packaged in three different ways:

1. Batch mode (Script): You can run Apache Pig in Batch mode by writing the Pig Latin script
in a single file with .pig extension.

2. Grunt: You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you
can enter the Pig Latin statements and get the output (using Dump operator).

3. Embedded: Pig Latin statements can be executed within Java, Python, or JavaScript
programs.

Invoking the Grunt Shell

You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as
shown below.
Local mode MapReduce mode

Command − Command −

$ ./pig –x local $ ./pig -x mapreduce

Output − Output −

Either of these commands gives we the Grunt shell prompt as shown below.

grunt>

We can exit the Grunt shell using ‘ctrl + d’.

After invoking the Grunt shell, we can execute a Pig script by directly entering the Pig Latin
statements in it.

grunt> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode

You can write an entire Pig Latin script in a file and execute it using the –x command. Let us
suppose we have a Pig script in a file named sample_script.pig as shown below.

Sample_script.pig

student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING


PigStorage(',') as (id:int,name:chararray,city:chararray);

Dump student;

Now, you can execute the script in the above file as shown below.

The following is running your Pig script in local mode and mapreduce mode:
Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig

Here‘s how you‘d run the Pig script in Hadoop mode, which is the default if you don‘t specify
the flag: pig -x mapreduce milesPerCarrier.pig

You can execute it from the Grunt shell as well using the exec command as shown below.

grunt> exec /sample_script.pig

Executing a Pig Script from HDFS


We can also execute a Pig script that resides in the HDFS. Suppose there is a Pig script with
the name Sample_script.pig in the HDFS directory named /pig_data/. We can execute it as
shown below.
$ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig

4.6. SCRIPTING WITH PIG LATIN

Hadoop is a rich and quickly evolving ecosystem with a growing set of new applications.
Rather than try to keep up with all the requirements for new capabilities, Pig is designed to be
extensible via user-defined functions, also known as UDFs. UDFs can be written in a number
of programming languages, including Java, Python, and JavaScript. Developers are also
posting and sharing a growing collection of UDFs online. (Look for Piggy Bank and DataFu,
to name just two examples of such online collections.) Some of the Pig UDFs that are part of
these repositories are LOAD/STORE functions (XML, for example), date time functions, text,
math, and stats functions.

Pig can also be embedded in host languages such as Java, Python, and JavaScript, which
allows you to integrate Pig with your existing applications. It also helps overcome limitations
in the Pig language. One of the most commonly referenced limitations is that Pig doesn‘t
support control flow statements: if/else, while loop, for loop, and condition statements. Pig
natively supports data flow, but needs to be embedded within another language to provide
control flow. There are tradeoffs, however of embedding Pig in a control-flow language. For
example if a Pig statement is embedded in a loop, every time the loop iterates and runs the Pig
statement, this causes a separate MapReduce job to run.

REVIEW QUESTIONS:
1. (a) Define Pig? List out the features and applications of Pig?

(b) Distinguish between Apache Pig and Map reduce

2. (a) Briefly explain the Pig Architecture and its components with a neat sketch.

(b) Describe the Pig Latin data model and Pig Latin operators in detail.

3. (a) Discuss in detail about the Pig Latin Application flow.

(b) Explain the working principles through ABCs of Pig Latin

4. Elaborate the evaluating local and distributed modes of running Pig scripts with neat sketch

You might also like