5 Pig

How To Make The Best Use Of Live Sessions
• Please log in 10 mins before the class starts and check your internet connection to avoid any network issues during the LIVE
session
• All participants will be on mute, by default, to avoid any background noise. However, you will be unmuted by instructor if
required. Please use the “Questions” tab on your webinar tool to interact with the instructor at any point during the class
• Feel free to ask and answer questions to make your learning interactive. Instructor will address your queries at the end of on-
going topic
• If you want to connect to your Personal Learning Manager (PLM), dial +917618772501
• We have dedicated support team to assist all your queries. You can reach us anytime on the below numbers:
US: 1855 818 0063 (Toll-Free) | India: +91 9019117772
• Your feedback is very much appreciated. Please share feedback after each class, which will help us enhance your learning
experience
Copyright © edureka and/or its affiliates. All rights reserved.

Big Data & Hadoop Certification Training

Course Outline
Understanding Big Data Kafka Monitoring &
Hive
Stream Processing
and Hadoop
Hadoop Architecture Integration of Kafka

Kafka Producer Advance
with Hive&and
Hadoop HBase
Storm
and HDFS
Hadoop MapReduce Integration of Kafka

Kafka Consumer Advance
Framework with Spark &HBase
Flume
Kafka Operation and Processing Distributed Data

Advance MapReduce
Performance Tuning with Apache Spark
Kafka Cluster Architectures Apache Oozie and Hadoop

Pig
& Administering Kafka
Kafka Project
Project

Module 5: Pig

Topics
Following are the topics covered in this module:
▪ Need of Pig ▪ Data Structure used in Pig
▪ Why Pig? ▪ Pig Latin Relational Operators
▪ What is Pig? ▪ Pig Built-in Functions
▪ Pig Conceptual Data Flow ▪ Specialized Joins
▪ Pig Basic Program Structure ▪ Pig User Defined Functions
▪ Pig Running Modes ▪ Pig Streaming
▪ Pig Components ▪ Parameter Substitution in Pig
▪ Pig Data Types ▪ Diagnostic Operators and UDF Statements

Objectives
At the end of this module, you will be able to:
▪ Understand the Problem with Writing MapReduce
▪ Understand what is Pig and its Use Cases
▪ Understand Pig Architecture
▪ Understand Apache Pig Data Types
▪ Understand Apache Pig Working
▪ Write and Execute Pig Scripts
▪ Implement Pig UDFs & UDAFs

Let’s Revise – Advance MR
MapReduce MapReduce MapReduce MapReduce
▪ Combiner and Partition functions
▪ MapReduce Joins
▪ Hadoop Data Types
▪ Custom Data Types
▪ Input and Output Formats
HDFS – Hadoop Distributed Cache
▪ Sequence Files
▪ Distributed Cache
▪ MRUnit testing framework
▪ Hadoop Counters: Reporting Custom Metrics

Need of Pig
✓ Do you know Java? ✓ 10 lines of PIG = 200 lines of Java
+ Built in operations like:
✓ Join
✓ Group
✓ Filter
✓ Sort
✓ and more…

Why should I go for Pig when there is MR?
1/20 the lines of Code 1/16 the development Time
180 300
160
140
250
120 200
Minutes
100
80
150
60 100
40
20
50
0 0
Hadoop Pig Hadoop Pig
Performance on Par with Raw Hadoop

Why should I go for Pig when there is MR?
Map-reduce
▪ Powerful model for parallelism.

▪ Based on a rigid procedural structure.
▪ Provides a good opportunity to parallelize algorithm.
Pig
▪ It is desirable to have a higher level declarative language.

▪ Similar to SQL query where the user specifies the “what” and
leaves the “how” to the underlying processing engine.

Why Pig?

Why Pig?
▪ Java not required

Why Pig?
Structured data
▪ Can take any data Semi-Structured data
Unstructured data

Why Pig?
Structured data
Unstructured data
Similar to SQL
▪ Easy to learn, Easy to
write and Easy to read Reads like a series of steps

Why Pig?
Structured data
Unstructured data
Similar to SQL
write and Easy to read Reads like a series of
steps
Java
Python
▪ Extensible by UDF
(User Defined Functions) JavaScript
Ruby

Why Pig?
Structured data
▪ Provides common data operations filters,
▪ Can take any data Semi-Structured data joins, ordering, etc. and nested data types
tuples, bags, and maps missing from
Unstructured data
MapReduce.
Similar to SQL
write and Easy to read Reads like a series of
steps
Java
Python
Ruby

Why Pig?
Structured data
Unstructured data
MapReduce.
Similar to SQL
write and Easy to read Reads like a series of ▪ An ad-hoc way of creating and executing
steps map-reduce jobs on very large data sets
Java
Python
Ruby

Why Pig?
Structured data
Unstructured data
MapReduce.
Similar to SQL
write and Easy to read Reads like a series of ▪ An ad-hoc way of creating and executing
steps map-reduce jobs on very large data sets
Java
▪ Open source and actively supported by a
Python
▪ Extensible by UDF community of developers.
Ruby

Where should I use Pig?
Pig Pig is a data flow language.
It is on the top of Hadoop and makes it possible to create complex jobs to process large
volumes of data quickly and efficiently.
Case 1 – Time Sensitive Data Loads
Case 2 – Processing Many Data Sources
Case 3 – Analytic Insight Through Sampling
Hadoop

Where not to use Pig?
▪ Really nasty data formats or completely unstructured data
(video, audio, raw human-readable text).
▪ Not easy for complex business logic.
▪ When you would like more power to optimize your code.

Annie’s Question
Apache Pig is a platform for analysing?
» Small Data
» Data less than 10 GB
» Large Data
» All of them

Annie’s Answer
Ans. Large Data.

What Is Pig?

What is Pig?
Pig is an open-source high-level dataflow system.
It provides a simple language for queries and data manipulation Pig Latin, that is
compiled into map-reduce jobs that are run on Hadoop.
Why is it Important?
▪ Companies like Yahoo, Google and Microsoft are collecting enormous data sets
in the form of click streams, search logs, and web crawls.
▪ Some form of ad-hoc processing and analysis of all of this information is
required.

Use Cases Where Pig is Used
▪ Processing of Web Logs.
▪ Data processing for search platforms.
▪ Support for Ad Hoc queries across large datasets.
▪ Quick Prototyping of algorithms for processing large datasets.

Examples of Data Analysis Task
Find users who tend to visit “good” pages:
VISITS PAGES
User URL Time URL Page Rank

Amy www.cnn.com 8:00 www.cnn.com 0.9
Amy www.crap.com 8:05 www.flickr.com 0.9
Amy www.myblog.com 10:00 www.myblog.com 0.7
Amy www.flickr.com 10:05 www.crap.com 0.2
Fred cnn.com/index.htm 12:00

Conceptual Data Flow
Load Load
Visits (user, url, time) Pages (url, pagerank)
Join
url = url
Group by user
Compute Average
Pagerank
Filter
avgPR>0.5

How Yahoo Uses Pig?
Pig is best suited for the data factory.
Data Factory contains:
Pipelines:
▪ Pipelines bring logs from Yahoo!'s web servers.

▪ These logs undergo a cleaning step where bots, company internal views, and clicks are removed.
Research:
▪ Researchers want to quickly write a script to test a theory.

▪ Pig integration with streaming makes it easy for researchers to take a Perl or Python script and run it against a huge
data set.

Pig Basic Program Structure

Pig – Basic Program Structure
Script:
Pig can run a script file that contains Pig commands. Script
Example: pig script.pig runs the commands in the local file script.pig.
Grunt
Grunt:
Grunt is an interactive shell for running Pig commands. It is also possible
to run Pig scripts from within Grunt using run and exec (execute).
Embedded
Embedded:
Embedded can run Pig programs from Java, much like you can use JDBC to
run SQL programs from Java.

Case Sensitivity
Case Sensitive: The names / aliases of relations and fields, names of Pig functions
Case Insensitive: The names of parameters and all other pig reserved keywords
In the example below, note the following:
▪ The names of each relation i.e. A, B, and C are case sensitive
▪ The names (aliases) of fields name and age are case sensitive
▪ Functions PigStorage and COUNT are case sensitive
Example:
grunt> A = load ‘edureka' using PigStorage() as (name:chararray,age:int);

grunt> B = group A by name;
grunt> C = foreach B generate COUNT ($0);
grunt> dump C;

Pig – Running Modes
▪ Local Mode pig -x local
▪ MapReduce Mode pig

Shell and Utility Commands
Shell commands :
▪ fs : Invokes any hadoop FsShell command from within a Pig script or the Grunt shell.
Example : To list the files present in HDFS.
▪ sh : Invokes any shell command from within a Pig script or the Grunt shell.
Example : Check the list of files present in local directory

Pig is Made Up of Two Components
Pig Latin is used to

express Data Flows
1. Pig Data Flows
2. Execution
Environments
Distributed execution on a
Hadoop Cluster
Local execution in a single

JVM

Pig Execution
Pig resides on user machine Job executes on Cluster
Hadoop
Cluster
User Machine
No need to install anything extra on your Hadoop Cluster!

Pig Latin Program
Pig Latin Program
It is made up of a series of operations or transformations that are

applied to the input data to produce output.
Turns the transformations into…

Pig A series of MapReduce jobs

Four Basic Types of Data Models
Atom Tuple
Data
Model
Types
Bag Map

Data Model
Data Models can be defined as follows:
▪ A Field is a piece of data.
▪ A Tuple is an ordered set of fields.

▪ Parentheses are also used to indicate the tuple data type.
▪ A Bag is a collection of tuples.

▪ Curly brackets also used to indicate the bag data type.
▪ A Data Map is a map from keys that are string literals to values that can be any data type.
▪ Straight brackets are also used to indicate the map data type.
Example: t= ( 1, {(2,3),(4,6),(5,7)}, ['apache':'search'] )

Data Types In Pig

Pig Data Types
Pig Data Type Implementing Class
Bag org.apache.pig.data.DataBag
Tuple org.apache.pig.data.Tuple
Map java.util.Map<Object, Object>
Integer java.lang.Integer
Long java.lang.Long
Float java.lang.Float
Double java.lang.Double
Chararray java.lang.String
Bytearray byte[]

Pig Data Types (Contd.)
▪ Fields are assigned data types with the help of schemas
▪ For example, in relation B, age is converted to integer because 5 is integer However it is not mandatory to assign
schema always since If we don't
▪ A = LOAD ‘edureka' AS (name, age); assign data types, default type
▪ B = FOREACH A GENERATE age + 5; bytearray is assigned to fields and
implicit conversions are applied to the
▪ If a schema is defined as part of a load statement, the load function tries to assign fields depending on the context in
the given schema which the field is being used
▪ However If the data does not conform to given schema, pig will generate a null value
or an error
→ Example : Input data

» 4, Edureka
» A = LOAD ‘edureka' AS (name:chararray,age:int);
» Dump A;
» O/P : (4,)

Pig Data Types (Contd.)
▪ In case explicit cast is not supported, pig will report an error
Example: You cannot cast a chararray to int in pig
A = LOAD ‘edureka' AS (name: chararray, age: int);

B = FOREACH A GENERATE (int)name;
error…..
▪ If Pig cannot resolve incompatible types through implicit casts, pig will report an error
Example: You cannot add chararray and float
A = LOAD ‘edureka' AS (name: chararray, age: float);

B = FOREACH A GENERATE name + age;
error…..

Data Structure Used in Pig
▪ A field is a piece of data
▪ This can be treated as the column in a table in Database
→ Referencing fields :
▪ Fields are accessed by positional indexes or by name
▪ Positional indexes is generated by the system
▪ Positional indexes is indicated with the dollar sign ($) and begins with zero like $0, $1, $2
▪ Names to the fields are assigned by the user when defining schemas with PigStorage or any loader or
internally by the system during some operation like group by, etc.
▪ You can use any name that is not a Pig keyword
First Field Second Field

Data type chararray int
Positional Indexes(system generated) $0 $1
Possible name (assigned by you using a schema) name age
Field value (for the first tuple) abhay 3

Data Structure Used in Pig: Tuple
▪ A tuple is an ordered set of fields
▪ Tuples contain fields which may be of different data types
▪ A tuple can be compared to a row in SQL with fields as columns
▪ Since, tuples are ordered we can access fields in each tuple using indexes of the fields
▪ Tuple constants use parentheses to define tuple and commas to separate different fields
(‘Edureka', 25)
▪ The above declares a tuple constant with two fields of data types as chararray and int respectively
grunt> data = load ‘edureka';

grunt> mydata = foreach data generate $0;
grunt> dump mydata
▪ Notice that above examples doesn’t have a schema
▪ Hence, we can reference individual fields in the tuple by their position ($0 references to the first field in the tuple)

Data Structure Used in Pig: Tuple (Contd.)
▪ A tuple is enclosed in parentheses ( )
▪ In case if we have relation defined with schema, we can access the fields using field name
grunt> data = load 'StudentData' as (name:chararray, age:int);

grunt> finaldata = foreach data generate name;
grunt> dump finaldata
▪ In this case, we have defined a schema for the tuples

Data Structure Used in Pig: Bag
▪ A bag is a collection of tuples
▪ An inner bag is enclosed in curly brackets { }
▪ Tuples in the bag correspond to the rows in a table
Bag properties :
▪ A bag can have duplicate tuples
▪ A bag can consist of tuples with different numbers of fields
▪ However, if Pig tries to access a field that does not exist in any tuple, then a null value is substituted in the empty indexes
▪ A bag can have tuples with fields with varying data types
▪ However, for Pig to effectively process bags, the schemas of the tuples within those bags should be the same
Example: If half of the tuples include chararray fields and while the other half include float fields, only half of the tuples will
participate in any kind of computation because the chararray fields will be converted to null
▪ Bags have two forms: outer bag (or relation) and inner bag

Outer Bag or Relations
▪ A relation is a bag of tuples also known as outer bags
▪ A Pig relation is similar to a table in a relational database
▪ Pig relations don't require that every tuple contain the same number of fields or that the fields in the same position
(column) have the same type
▪ Relations are unordered which means there is no guarantee that tuples are processed in any particular order
▪ Referencing relations:
▪ Relations are referred to by name (or alias)
▪ In this example A is a relation or bag of tuples. You can think of this bag as an outer bag
A = LOAD ‘edureka' USING PigStorage() AS (name:chararray, age:int);

X = FOREACH A GENERATE name,$1;
DUMP X;
(aron,25)
(sam,45)
(bruno,25)
(Rony,45)

Inner Bag
▪ Now, suppose we group relation A by the first field to form relation X
▪ In this example, X is a relation or bag of tuples. The tuples in relation X have two fields
▪ The first field is type int. The second field is type bag; you can think of this bag as an inner bag
X = GROUP A BY age;
DUMP X;
(25,{(aron,25), (bruno,25)})
(45,{(sam,45), (Rony,45)})

Data Structure Used in Pig: Map
▪ A map is a set of key value pairs
▪ A map is a chararray to data element mapping which is expressed in key-value pairs
▪ The key should always be of type chararray and can be used as index to access the associated value
▪ It is not necessary that all the values in a map be of the same type
▪ An inner bag is enclosed in curly brackets { }
▪ Key value pairs are separated by the pound sign # with ',' separating key-value pairs
▪ Key Must be chararray data type. Must be a unique value
▪ Value can be Any data type
Syntax : [ key#value <, key#value …> ]
Example : ['Name'#‘aron', 'Age'#22]
▪ The above defines a map constant with two key-value pairs. Notice that the keys are always of type chararray
while values take type chararray and int respectively

Data Structure Used in Pig: Map (Contd.)
▪ In order to load data from files as maps, the data should be structured as below:
[name#aron, age#25]
[name#sam, age#26]
▪ Sample PigLatin statements to load the above data sample as map
grunt> mapload = load ‘AboveFile' as (a:map[chararray]);

grunt> values = foreach mapload generate a#‘name' as value;
grunt> value = FILTER values BY value is not null;
grunt> dump value
▪ The output of above statements is:
(aron)
(sam)

Data Structure Used in Pig : Map (Contd.)
▪ The load statement will construct two maps having two key-value pairs each.
▪ We can choose not to specify the data type of values in map as below:
grunt> mapload = load ‘AboveFile' as (a:map[]);
▪ In this case Pig assumes the type of values to be bytearray and performs implicit casts to appropriate type depending
on how your PigLatin statements handle the data
▪ In the second statement we are trying to retrieve the value associated with key ‘name'. Notice the syntax
a#’name’
which will return aron and sam

Pig Latin Relational Operators

Pig Latin Relational Operators
Category Operator Description
Loading and Storing LOAD Loads data from the file system or other storage into a relation .
STORE Saves a relation to the file system or other storage.
DUMP Prints a relation to the console.
Filtering FILTER Removes unwanted rows from a relation.
DISTINCT Removes duplicate rows from a relation.
FOREACH...GENERATE Adds or removes fields from a relation.
STREAM Transforms a relation using an external program.
Grouping and Joining JOIN Joins two or more relations.

COGROUP Groups the data in two or more relations.
GROUP Groups the data in a single relation.
CROSS Creates the cross product of two or more relations.
Sorting ORDER Sorts a relation by one or more fields.
LIMIT Limits the size of a relation to a maximum number of tuples.
Combining and Splitting UNION Combines two or more relations into one.
SPLIT Splits a relation into two or more relations.

Pig Latin - Nulls
In Pig, when a data

element is NULL, it
means the value is
unknown.
Includes the concept of a data element being…

Pig NULL
Data of any type can be NULL

Data
File – Student File – Student Roll

Name Age GPA Name Roll No.
Joe 18 2.5 Joe 45
Sam 3.0 Sam 24
Angel 21 7.9 Angel 1
John 17 9.0 John 12
Joe 19 2.9 Joe 19

Pig Latin – File Loaders
Pig Latin File Loaders
BinStorage - "binary" storage
PigStorage - Loads and stores data that is delimited by something
TextLoader - Loads data line by line (delimited by the newline character)
CSVLoader - Loads CSV files
XML Loader - Loads XML files

Pig Latin – Group Operator
Example of GROUP Operator:
A = load ‘/student' USING PigStorage( ‘ , ’ ) as (name:chararray, age:int, gpa:float);

dump A;
(joe,18,2.5)
(sam,,3.0)
(angel,21,7.9)
(john,17,9.0)
(joe,19,2.9)
X = group A by name;
dump X;
(joe,{(joe,18,2.5),(joe,19,2.9)})
(sam,{(sam,,3.0)})
(john,{(john,17,9.0)})
(angel,{(angel,21,7.9)})

Pig Latin – Cogroup Operator
Example of COGROUP Operator:
A = load ‘/student' USING PigStorage( ‘ , ’ ) as (name:chararray, age:int,gpa:float);

B = load ‘/studentRoll' USING PigStorage( ‘ , ’ ) as (name:chararray, rollno:int);
X = cogroup A by name, B by name;

dump X;
(joe,{(joe,18,2.5),(joe,19,2.9)},{(joe,45),(joe,19)})
(sam,{(sam,,3.0)},{(sam,24)})
(john,{(john,17,9.0)},{(john,12)})
(angel,{(angel,21,7.9)},{(angel,1)})

Joins and Cogroup
▪ JOIN and COGROUP operators perform similar functions
▪ JOIN creates a flat set of output records while COGROUP creates a nested set of output records
Example:

Union
UNION: To merge the contents of two or more relations.

Pig Built In Functions

▪ Pig comes with a set of built in functions and categorized according to input arguments to the function.
▪ Eval Functions,
▪ Load/Store Functions,
▪ Math Functions,
▪ String Functions,
▪ Datetime Functions,
▪ Tuple, Bag, Map Functions.
▪ The main difference between built-in functions and Pig UDF is :
▪ Built in functions don't need to be registered because Pig knows where they are.
▪ Built in functions don't need to be qualified when they are used because Pig knows where to find
them.
Pig function names are case sensitive and UPPER CASE.

Eval Function Load/Store Function Math Function String Function Date-time Functions Tuple, Bag, Map Functions
Eval functions accept Load/store functions The Math functions String functions are To work on these These functions are used
the bag data type as determine how data allows you to perform used to manipulate a functions date and Convert the fields into
input parameter and goes into Pig and mathematical tasks on char data type fields. time fields are loaded Complex data types.
return the result comes out of Pig. Pig fields as chararray data type
according to the provides a set of built- and convert to date
functions in load/store and time format using
functions. ToDate function
AVG.BagToString Handling ABS, ACOS, CBRT ENDSWITH AddDuration TOTUPLE, TOBAG

CONCAT,COUNT, ,Compression, CEIL, COS, EXP EqualsIgnoreCase CurrentTime TOMAP, TOP
COUNT_STAR,DIFF BinStorage, FLOOR,LOG INDEXOF GetDay
IsEmpty,MAX,MIN PigDump LOWER GetHour
PigStorage REGEX_EXTRACT GetMilliSecond
TextLoader REGEX_EXTRACT_ALL GetMinute
HBaseStorage REPLACE GetMonth
OrcStorage GetSecond

Must Read!!
▪ Review the following Blogs on Pig Scripts:
http://www.edureka.in/blog/pig-programming-create-your-first-apache-pig-script/
http://www.edureka.in/blog/pig-programming-apache-pig-script-in-local-mode/
http://www.edureka.in/blog/operators-in-apache-pig/
http://www.edureka.in/blog/operators-in-apache-pig-diagnostic-operators/

Utility Commands
Command Description Syntax
Clear the screen of Pig grunt shell and position the

clear grunt>clear
cursor at top of the screen
exec [–param param_name = param_value] [–param_file

exec Run a Pig script
file_name] [script]
help Prints a list of Pig commands or properties -help [properties]
history [-n]
history Display the list of statements used so far.
-n : Omit line numbers in the list
kill Kills a job kill jobid
quit Quits from the Pig grunt shell. grunt>quit
run [–param param_name = param_value] [–param_file

run Run a Pig script.
file_name] script
set Shows/Assigns values to keys used in Pig. grunt> set debug 'on’

Pig Specialized Joins

Specialized Joins
▪ To increase the performance of the Pig Jobs, Pig providing three different kind of Joins called specialized
joins:
▪ Replicated Joins
▪ Skewed Joins
▪ Merge Joins
▪ General syntax of Specialized Join is:
▪ alias = JOIN alias BY {joining relation Field}, alias BY {joining relation Field} USING ['replicated' |
'skewed' | 'merge']

Replicated Joins
▪ Also known as Fragment replicate join, Replicated Join works on only INNER and LEFT OUTER JOIN. Does not support
RIGHT and FULL OUTER JOIN.
▪ Special type of join that works well if one or more relations are small enough to fit into main memory. In such cases, Pig
can perform a very efficient join because all of the Hadoop work is done on the map side.
▪ Replicated join can be used with more than two tables. In this case, Rightmost table are read into memory. means Large
relation comes first followed by the smaller relations.
▪ Pig run the replicated join by loading the small file ( Replicated input ) into Hadoop’s distributed cache.
▪ Example: For a customer transaction data set, which could potentially have billions of rows that is joined to a smaller
geographic data set.

Skewed Joins
▪ Skewed joins works for inner and outer joins.
▪ One of the keys is much more common than others, and the data for it is too large to fit in the memory.
▪ Standard joins run in parallel across different reducers by splitting key values across processes.
▪ If there is a lot of data based on certain key, the data will not be distributed evenly across the reducers, and one of them
will be ‘stuck’ processing the majority of data.
▪ Skewed join handles this case. It calculates a histogram to check which key is the most prevalent and then splits its data
across different reducers for optimal performance.
Example:

Merge Joins
▪ In Normal or Default Join both input are sort first according to the Join key and complete the JOIN. This is called sort-
merge join. If the both the files are already sorted based on join key we no need to use the sort again, In this cases the
merge join is very efficient.
▪ To do the Merge JOIN, two data sets are both sorted in ascending order by the join key.
▪ The Merge JOIN is supported only in inner joins.
▪ To do the merge join, we should have only two tables or input files.
▪ Merge join works on mapreduce mode, in local mode merge join gets converted to regular joins.
Example: If the customer details and geography tables are sorted based on the country which is join key we can use the
merged join.

Pig UDF

Pig – UDF (User Defined Function)
▪ Pig allows users to combine existing operators with their own or other’s code via UDFs.
UDF
UDF
Pig
▪ The advantage of Pig is its ability to let user combine its operators with their own or other’s code via UDFs
▪ Pig itself comes with some UDFs. In version 0.8, a large number of standard string-processing, math, and complex-type
UDFs were added.

Pig – Creating UDF
→ A Program to create UDF:
public class IsOfAge extends FilterFunc {
@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return false;
}
try {
Object object = tuple.get(0);
if (object == null) {
return false;
}
int i = (Integer) object;
if (i == 18 || i == 19 || i == 21 || i == 23 || i == 27) {
return true;
} else {
return false;
}
} catch (ExecException e) {
throw new IOException(e);
}
}
}

Pig – Calling a UDF
How to call a UDF?
register myudf.jar;
X = filter A by IsOfAge(age);
Must Read → http://www.edureka.in/blog/pig-programming-apache-pig-script-with-udf-in-hdfs-mode/

Pig – UDFs Inbuilt
Eval, Aggregate, Filter Functions
http://www.edureka.in/blog/apache-pig-udf-part-1-eval-aggregate-filter-functions/
Load Functions
http://www.edureka.in/blog/apache-pig-udf-part-2-load-functions/
Store Functions
http://www.edureka.in/blog/apache-pig-udf-store-functions/

Pig Streaming

PIG Streaming
Sends data to an external script or program.
Syntax:
alias = STREAM alias THROUGH {`command` | cmd_alias } [AS schema] ;
Terms:
▪ Alias: name of the relation

▪ Command: External script file and shell command. The command, including the arguments, enclosed in back tics.
▪ cmd_alias: name assigned to command through DEFINE statement in pig.
▪ Schema: assign the schema to streaming output.
▪ Use the STREAM operator to send data through an external script or program.
▪ Multiple stream operators can appear in the same Pig script.
▪ The stream operators can be adjacent to each other or have other operations in between.

PIG Streaming
Running shell command in PIG latin statement to print top 5 fields in a relations
Use DEFINE to assign the alias name for the script or shell command.

Parameter Substitution in Pig
▪ Parameter Substitution in Pig is creating a template pig script and then use it with different parameters on a regular basis.
▪ For instance, if you have daily processing that is identical every day except the date it needs to process, it would be very
convenient to put a placeholder for the date and provide the actual value at run time.
Specifying Parameters :
▪ You can specify parameter names and parameter values as follows
▪ As part of a command line.

▪ In parameter file, as part of a command line.
▪ With the declare statement, as part of Pig script.
▪ With default statement, as part of a Pig script.

Passing Parameter from Command line
Syntax :
pig [-x local] -param param_name = param_value <path to your script>

Passing Parameter from file
Syntax :
pig [-x local] -param_file file_name <path to your script>

PiggyBank
▪ PiggyBank is a collection of useful LOAD, STORE, and UDF functions.
▪ To use a function, you need to figure out which package it belongs to:
▪ org.apache.pig.piggybank.comparison - for custom comparator used by ORDER operator
▪ org.apache.pig.piggybank.evaluation - for eval functions like aggregates and column transformations
▪ org.apache.pig.piggybank.filtering - for functions used in FILTER operator
▪ org.apache.pig.piggybank.grouping - for grouping functions
▪ org.apache.pig.piggybank.storage - for load/store functions

Diagnostic Operators and UDF Statements
Pig Latin Diagnostic Operators
Types of Pig Latin Diagnostic Operators:
DESCRIBE - Prints a relation’s schema.

EXPLAIN - Prints the logical and physical plans.
ILLUSTRATE - Shows a sample execution of the logical plan, using a generated subset of the input.
Pig Latin UDF Statements
Types of Pig Latin UDF Statements:
REGISTER - Registers a JAR file with the Pig runtime.

DEFINE - Creates an alias for a UDF, streaming script, or a command specification.

Describe
▪ Use the DESCRIBE operator to review the fields and data-types.

EXPLAIN: Logical Plan
▪ Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the
specified relationship.
▪ The logical plan shows a pipeline of operators to be executed to build the relation. Type checking and backend-independent
optimizations (such as applying filters early on) also apply.

EXPLAIN: Physical Plan
▪ The physical plan shows how the logical operators are translated to backend-specific physical operators. Some
backend optimizations also apply.

EXPLAIN: Map – Reduce Plan
▪ The map-reduce plan shows how the physical operators are grouped into map reduce jobs.

Demo

Demo on HealthCare Dataset
Problem Statement:
• De-identify personal health information.
Challenges:
• Huge amount of data flows into the systems daily and there are multiple data sources that we need to aggregate data from.
• Crunching this huge data and de-identifying it in a traditional way had problems.

Demo on Weather Data with Pig
ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/

Use Case in HealthCare
0100
Taking DB dump in CSV format and
ingest into HDFS
1101
1001
Store De-identified 0100

CSV file into HDFS
1101
HDFS 1001
Read CSV file
from HDFS
Pig Script De-identify columns based on configurations

and store the data back in a CSV file

Please visit all the blogs that have being shared with you in
earlier slides without fail!!

Assignment
▪ Find out successful students in the Class
▪ Execute Pig Weather Example
▪ Execute Data-set and Pig Script for Health Care Use-Case
▪ Execute Data-set and Pig Script for Weather Use-Case

Pre-work
Review the following Blogs on Hive Scripts:
→http://www.edureka.in/blog/apache-hive-installation-on-ubuntu/
→http://www.edureka.in/blog/apache-hadoop-hive-script/

Agenda for Next Class
• Hive and its Use Cases
• Hive vs Pig
• Hive Architecture and Components
• Primitive and Complex type in Hive
• Data Models in Hive
• Hive Script and Hive UDF


5 Pig

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 Pig

Uploaded by

Copyright:

Available Formats

How To Make The Best Use Of Live Sessions

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Hadoop Architecture Integration of Kafka

Hadoop MapReduce Integration of Kafka

Kafka Operation and Processing Distributed Data

Kafka Cluster Architectures Apache Oozie and Hadoop

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

+ Built in operations like:

Copyright © edureka and/or its affiliates. All rights reserved.

Performance on Par with Raw Hadoop

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Powerful model for parallelism.

▪ It is desirable to have a higher level declarative language.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Can take any data Semi-Structured data

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Can take any data Semi-Structured data

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Can take any data Semi-Structured data

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Pig Pig is a data flow language.

Case 1 – Time Sensitive Data Loads

Case 2 – Processing Many Data Sources

Case 3 – Analytic Insight Through Sampling

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Not easy for complex business logic.

▪ When you would like more power to optimize your code.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Ans. Large Data.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Data processing for search platforms.

▪ Support for Ad Hoc queries across large datasets.

▪ Quick Prototyping of algorithms for processing large datasets.

Copyright © edureka and/or its affiliates. All rights reserved.

User URL Time URL Page Rank

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Data Factory contains:

▪ Pipelines bring logs from Yahoo!'s web servers.

▪ Researchers want to quickly write a script to test a theory.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

In the example below, note the following:

▪ The names of each relation i.e. A, B, and C are case sensitive

▪ Functions PigStorage and COUNT are case sensitive

grunt> A = load ‘edureka' using PigStorage() as (name:chararray,age:int);

Copyright © edureka and/or its affiliates. All rights reserved.

▪ MapReduce Mode pig

Copyright © edureka and/or its affiliates. All rights reserved.