Pig Full Lecture

Shivajirao Kadam Institute of Technology
and Management, Indore (M.P.)

Department of Computer Science and Engineering
Lecture
on
“Introduction to PIG”
.
Hadoop Ecosystem
Pig Introduction
Pig Introduction
 Developed at Yahoo around 2006
 Initial Release of Pig is September 11, 2008.
 Pig is used to analyze lager datasets of data by representing

them as data flows.
 Work on top of the Hadoop.
 Apache Pig allows to write complex MapReduce Programs

using a simple Scripting language.
Pig Introduction
 Metadata not required, but used when available.
 Pig is a high level scripting language that is used with Apache

Hadoop
 Pig’s simple SQL-like scripting language is called Pig Latin,

and appeals to developers already familiar with scripting
languages and SQL
 Compiler that produces sequences of MapReduce programs.
 Pig Translate Pig Latin Script into MapReduce to Execute

within Hadoop.
Pig Introduction
 Pigs, who eat anything, the Pig programming language is
designed to work upon any kind of data. That's why the name,
Pig!
 To analyze data using Apache Pig, programmers need to write

scripts using Pig Latin language. All these scripts are
internally converted to Map and Reduce tasks.
 Apache Pig has a component known as Pig Engine that

accepts the Pig Latin scripts as input and converts those scripts
into MapReduce jobs
Why Pig
 Programmer who are not so good in java.
 Pig Uses Multi-Query approach
 Built in Operators :- Join, Filter, Order, Group
 Provides Nested Data types: Bag, Field, Tuple

Pig Architecture
Logical Plan
 Pig Latin programs are based on interpreter checking
 Basically logical plan is a plan which is created for each line

in the pig script/programs.
 Interpreter check each statement/line about the syntax checks

for operators ( logical operators) , and if find error then it will
throw an exception and program execution ends.
 If no error found for the statement/line then a plan is generated

which is known as logical plan and that plan will added to
default logical plan of that program
Logical Plan
 Important:
 During logical plan no data processing takes place ,
Only syntax and semantics checks are taking place
 Logical Plan will contain the collection of logical

operators. It will not contain the edges between the
operators
Logical Plan
 Logical Plan will be look like this :-
 The flow of this chart is bottom to top so that

the Load operator is at the very bottom. The lines between
operators show the flow
Physical Plan
 Physical plan is basically a series of map reduce jobs.
 This plan describes the physical operators Pig will use to

execute the script, without reference to how they will be
executed in MapReduce.
 During the creation of physical plan, cogroup logical operator

is converted into 3 physical operators namely –Local
Rearrange, Global Rearrange and Package. Load and store
functions usually get resolved in the physical plan.
Logical and Physical Plan
Pig Philosophy
 Pig Eats Anything
 Pig Lives anywhere
 Pigs are Domestic animal

Pig Philosophy
 Pig Eats Anything
 Pig can operate on data whether it has meta data or not
 Pig can Operate on data that is relational, nested or

unstructured
Pig Philosophy
 Pig Live Anywhere
 Pig is not tied to one particular parallel framework
 Pig has been implemented first on hadoop

 Not intend that to be only hadoop
 Pig on MongoDB
 Pig with Cassandra
Pig Philosophy
 Pigs are Domestic Animal
 Designed to be easily controlled and modified by its
users.
 Integration of User Defined Function (UDF)
 Pig Supports Streaming

 Using hadoop streaming methods
 Pig uses Optimizer by rearranging some of the operatons

for better performance
Where Not to Use Pig
 While the data is completely unstructured. Such as video,
audio, and readable text.
 Where time constraints exist. Since Pig is slower than

MapReduce jobs.
 Also, when more power is required to optimize the codes, we

cannot use Pig.
Pig Data Model
Pig Data Model
Pig Data Model
 Pig Latin data model is fully nested. Also, it allows complex
non-atomic data types like map and tuple.
 Atom
 Atom is defined as any single value in Pig Latin,
irrespective of their data. Basically, we can use it as
string and number and store it as the string. Atomic
values of Pig are int, long, float, double, char array, and
byte array. Moreover, a field is a piece of data or a
simple atomic value in Pig.
For Example − ‘Shubham’ or ‘25
Pig Data Model
 Tuple
 Tuple is a record that is formed by an ordered set of fields.

However, the fields can be of any type. In addition, a tuple is
similar to a row in a table of RDBMS.
For Example − (Vikas, 100)
Pig Data Model
 Bag
 An unordered set of tuples is what we call Bag. To be more

specific, a Bag is a collection of tuples (non-unique).
Moreover, each tuple can have any number of fields (flexible
schema). Generally, we represent a bag by ‘{}’.
For Example − {(vikas, 100), (hemant, 400)}
Pig Data Model
 Relation
 A bag of tuples is what we call Relation. In Pig Latin, the

relations are unordered. Also, there is no guarantee that tuples
are processed in any particular order.
Pig Execution Environment
 Local Mode
 Need access to a single machine
 All files are installed and run using your local host and
file system
 Is invoked by using the -x local flag
 pig -x local
 MapReduce Mode
 Mapreduce mode is the default mode
 Need access to a Hadoop cluster and HDFS installation.
 Can also be invoked by using the -x mapreduce flag or
just pig pig
 pig -x mapreduce
Commiters of Pig
Pig Latin Example
Pig Latin Statements
 Pig Latin Statements work with relations
 Field is a piece of data.
 John
 Tuple is an ordered set of fields.

 (John,18,4.0F)
 Bag is a collection of tuples.

 (1,{(1,2,3)})
 Relation is a bag
Pig Pros and Cons
 Advantages of Apache Pig
 Less development time
 Easy to learn
 Procedural language
 Dataflow
 Easy to control execution
 UDFs
 Usage of Hadoop features
Pig Pros and Cons
 Limitations of Apache Pig
 Errors of Pig
 Not mature
 Support
 Minor one
 Implicit data schema
 Delay in execution
Pig Commands
Statement Description
Load Read data from the file system
Store Write data to the file system
Dump Write output to stdout
Foreach Apply expression to each record and generate one or
more records
Filter Apply predicate to each record and remove records
where false
Group / Cogroup Collect records with the same key from one or more
inputs
Join Join two or more inputs based on a key
Order Sort records based on a Key
Distinct Remove duplicate records
Union Merge two datasets
Pig Latin Example
 Suppose we have a table
urls: (url, category, pagerank)
 Simple SQL query that finds,

For each sufficiently large category, the average pagerank of
high-pagerank urls in that category
SELECT category, Avg(pagerank)
FROM urls WHERE pagerank > 0.2
GROUP BY category HAVING COUNT(*) > 106
Equivalent Pig Latin program
 good_urls = FILTER urls BY pagerank > 0.2;
 groups = GROUP good_urls BY category;
 big_groups = FILTER groups BY

COUNT(good_urls) > 106 ;
 output = FOREACH big_groups GENERATE

category, AVG(good_urls.pagerank);
Data Flow
Pig Installation
 Prerequisite:
 It is essential that you have Hadoop and Java installed
on your system before you go for Apache Pig
 Installation
 Step 1: Download Apache Pig by given Link
https://downloads.apache.org/pig/pig-0.17.0/
 Step 2: Extract the tar file using tar command.

Command: tar -xzf pig-0.17.0.tar.gz
Pig Installation
 Step 3: Edit the “.bashrc” file to update the environment
variables of Apache Pig
Command: sudo gedit .bashrc
 Step 4: Add the following at the end of the file-

export PATH=$PATH:/home/virendra/pig-0.17.0/bin
export PIG_HOME=/home/virendra/pig-0.17.0
export PIG_CLASSPATH=$HADOOP_HOME/conf
Pig Installation
 Step 5: Run below command to make the changes get
updated in same terminal.
Command: source .bashrc
 Step 6: Check pig version. This is to test that Apache Pig got
installed correctly.
Command: pig –version
 Step 7: Check pig help to see all the pig command options.
Command: pig -help

Pig Full Lecture

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pig Full Lecture

Uploaded by

Copyright:

Available Formats

Shivajirao Kadam Institute of Technology

and Management, Indore (M.P.)

 Initial Release of Pig is September 11, 2008.

 Pig is used to analyze lager datasets of data by representing

 Work on top of the Hadoop.

 Apache Pig allows to write complex MapReduce Programs

 Pig is a high level scripting language that is used with Apache

 Pig’s simple SQL-like scripting language is called Pig Latin,

 Compiler that produces sequences of MapReduce programs.

 Pig Translate Pig Latin Script into MapReduce to Execute

 To analyze data using Apache Pig, programmers need to write

 Apache Pig has a component known as Pig Engine that

 Pig Uses Multi-Query approach

 Built in Operators :- Join, Filter, Order, Group

 Provides Nested Data types: Bag, Field, Tuple

 Basically logical plan is a plan which is created for each line

 Interpreter check each statement/line about the syntax checks

 If no error found for the statement/line then a plan is generated

 Logical Plan will contain the collection of logical

 The flow of this chart is bottom to top so that

 This plan describes the physical operators Pig will use to

 During the creation of physical plan, cogroup logical operator

 Pig Lives anywhere

 Pigs are Domestic animal

 Pig can Operate on data that is relational, nested or

 Pig has been implemented first on hadoop

 Integration of User Defined Function (UDF)

 Pig Supports Streaming

 Pig uses Optimizer by rearranging some of the operatons

 Where time constraints exist. Since Pig is slower than

 Also, when more power is required to optimize the codes, we

 Tuple is a record that is formed by an ordered set of fields.

 An unordered set of tuples is what we call Bag. To be more

 A bag of tuples is what we call Relation. In Pig Latin, the

 Tuple is an ordered set of fields.

 Bag is a collection of tuples.

 Simple SQL query that finds,

 groups = GROUP good_urls BY category;

 big_groups = FILTER groups BY

 output = FOREACH big_groups GENERATE

 Step 2: Extract the tar file using tar command.

 Step 4: Add the following at the end of the file-

You might also like