You are on page 1of 98

How To Make The Best Use Of Live Sessions

• Please log in 10 mins before the class starts and check your internet connection to avoid any network issues during the LIVE
session

• All participants will be on mute, by default, to avoid any background noise. However, you will be unmuted by instructor if
required. Please use the “Questions” tab on your webinar tool to interact with the instructor at any point during the class

• Feel free to ask and answer questions to make your learning interactive. Instructor will address your queries at the end of on-
going topic

• If you want to connect to your Personal Learning Manager (PLM), dial +917618772501

• We have dedicated support team to assist all your queries. You can reach us anytime on the below numbers:
US: 1855 818 0063 (Toll-Free) | India: +91 9019117772

• Your feedback is very much appreciated. Please share feedback after each class, which will help us enhance your learning
experience

Copyright © edureka and/or its affiliates. All rights reserved.


Big Data & Hadoop Certification Training

Copyright © edureka and/or its affiliates. All rights reserved.


Course Outline
Understanding Big Data Kafka Monitoring &
Hive
Stream Processing
and Hadoop

Hadoop Architecture Integration of Kafka


Kafka Producer Advance
with Hive&and
Hadoop HBase
Storm
and HDFS

Hadoop MapReduce Integration of Kafka


Kafka Consumer Advance
Framework with Spark &HBase
Flume

Kafka Operation and Processing Distributed Data


Advance MapReduce
Performance Tuning with Apache Spark

Kafka Cluster Architectures Apache Oozie and Hadoop


Pig
& Administering Kafka
Kafka Project
Project

Copyright © edureka and/or its affiliates. All rights reserved.


Module 5: Pig

Copyright © edureka and/or its affiliates. All rights reserved.


Topics
Following are the topics covered in this module:
▪ Need of Pig ▪ Data Structure used in Pig
▪ Why Pig? ▪ Pig Latin Relational Operators
▪ What is Pig? ▪ Pig Built-in Functions
▪ Pig Conceptual Data Flow ▪ Specialized Joins
▪ Pig Basic Program Structure ▪ Pig User Defined Functions
▪ Pig Running Modes ▪ Pig Streaming
▪ Pig Components ▪ Parameter Substitution in Pig
▪ Pig Data Types ▪ Diagnostic Operators and UDF Statements

Copyright © edureka and/or its affiliates. All rights reserved.


Objectives
At the end of this module, you will be able to:
▪ Understand the Problem with Writing MapReduce
▪ Understand what is Pig and its Use Cases
▪ Understand Pig Architecture
▪ Understand Apache Pig Data Types
▪ Understand Apache Pig Working
▪ Write and Execute Pig Scripts
▪ Implement Pig UDFs & UDAFs

Copyright © edureka and/or its affiliates. All rights reserved.


Let’s Revise – Advance MR
MapReduce MapReduce MapReduce MapReduce
▪ Combiner and Partition functions
▪ MapReduce Joins
▪ Hadoop Data Types
▪ Custom Data Types
▪ Input and Output Formats
HDFS – Hadoop Distributed Cache
▪ Sequence Files
▪ Distributed Cache
▪ MRUnit testing framework
▪ Hadoop Counters: Reporting Custom Metrics

Copyright © edureka and/or its affiliates. All rights reserved.


Need of Pig
✓ Do you know Java? ✓ 10 lines of PIG = 200 lines of Java

+ Built in operations like:

✓ Join
✓ Group
✓ Filter
✓ Sort
✓ and more…

Copyright © edureka and/or its affiliates. All rights reserved.


Why should I go for Pig when there is MR?
1/20 the lines of Code 1/16 the development Time

180 300
160
140
250
120 200

Minutes
100
80
150
60 100
40
20
50
0 0
Hadoop Pig Hadoop Pig

Performance on Par with Raw Hadoop

Copyright © edureka and/or its affiliates. All rights reserved.


Why should I go for Pig when there is MR?
Map-reduce

▪ Powerful model for parallelism.


▪ Based on a rigid procedural structure.
▪ Provides a good opportunity to parallelize algorithm.

Pig

▪ It is desirable to have a higher level declarative language.


▪ Similar to SQL query where the user specifies the “what” and
leaves the “how” to the underlying processing engine.

Copyright © edureka and/or its affiliates. All rights reserved.


Why Pig?

Copyright © edureka and/or its affiliates. All rights reserved.


Why Pig?
▪ Java not required

Copyright © edureka and/or its affiliates. All rights reserved.


Why Pig?
▪ Java not required
Structured data

▪ Can take any data Semi-Structured data

Unstructured data

Copyright © edureka and/or its affiliates. All rights reserved.


Why Pig?
▪ Java not required
Structured data

▪ Can take any data Semi-Structured data

Unstructured data

Similar to SQL
▪ Easy to learn, Easy to
write and Easy to read Reads like a series of steps

Copyright © edureka and/or its affiliates. All rights reserved.


Why Pig?
▪ Java not required
Structured data

▪ Can take any data Semi-Structured data

Unstructured data

Similar to SQL
▪ Easy to learn, Easy to
write and Easy to read Reads like a series of
steps

Java
Python
▪ Extensible by UDF
(User Defined Functions) JavaScript
Ruby

Copyright © edureka and/or its affiliates. All rights reserved.


Why Pig?
▪ Java not required
Structured data
▪ Provides common data operations filters,
▪ Can take any data Semi-Structured data joins, ordering, etc. and nested data types
tuples, bags, and maps missing from
Unstructured data
MapReduce.
Similar to SQL
▪ Easy to learn, Easy to
write and Easy to read Reads like a series of
steps

Java
Python
▪ Extensible by UDF
(User Defined Functions) JavaScript
Ruby

Copyright © edureka and/or its affiliates. All rights reserved.


Why Pig?
▪ Java not required
Structured data
▪ Provides common data operations filters,
▪ Can take any data Semi-Structured data joins, ordering, etc. and nested data types
tuples, bags, and maps missing from
Unstructured data
MapReduce.
Similar to SQL
▪ Easy to learn, Easy to
write and Easy to read Reads like a series of ▪ An ad-hoc way of creating and executing
steps map-reduce jobs on very large data sets

Java
Python
▪ Extensible by UDF
(User Defined Functions) JavaScript
Ruby

Copyright © edureka and/or its affiliates. All rights reserved.


Why Pig?
▪ Java not required
Structured data
▪ Provides common data operations filters,
▪ Can take any data Semi-Structured data joins, ordering, etc. and nested data types
tuples, bags, and maps missing from
Unstructured data
MapReduce.
Similar to SQL
▪ Easy to learn, Easy to
write and Easy to read Reads like a series of ▪ An ad-hoc way of creating and executing
steps map-reduce jobs on very large data sets

Java
▪ Open source and actively supported by a
Python
▪ Extensible by UDF community of developers.
(User Defined Functions) JavaScript
Ruby

Copyright © edureka and/or its affiliates. All rights reserved.


Where should I use Pig?

Pig Pig is a data flow language.

It is on the top of Hadoop and makes it possible to create complex jobs to process large
volumes of data quickly and efficiently.

Case 1 – Time Sensitive Data Loads

Case 2 – Processing Many Data Sources

Case 3 – Analytic Insight Through Sampling

Hadoop

Copyright © edureka and/or its affiliates. All rights reserved.


Where not to use Pig?
▪ Really nasty data formats or completely unstructured data
(video, audio, raw human-readable text).

▪ Not easy for complex business logic.

▪ When you would like more power to optimize your code.

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Question
Apache Pig is a platform for analysing?
» Small Data
» Data less than 10 GB
» Large Data
» All of them

Copyright © edureka and/or its affiliates. All rights reserved.


Annie’s Answer

Ans. Large Data.

Copyright © edureka and/or its affiliates. All rights reserved.


What Is Pig?

Copyright © edureka and/or its affiliates. All rights reserved.


What is Pig?
Pig is an open-source high-level dataflow system.

It provides a simple language for queries and data manipulation Pig Latin, that is
compiled into map-reduce jobs that are run on Hadoop.

Why is it Important?
▪ Companies like Yahoo, Google and Microsoft are collecting enormous data sets
in the form of click streams, search logs, and web crawls.
▪ Some form of ad-hoc processing and analysis of all of this information is
required.

Copyright © edureka and/or its affiliates. All rights reserved.


Use Cases Where Pig is Used
▪ Processing of Web Logs.

▪ Data processing for search platforms.

▪ Support for Ad Hoc queries across large datasets.

▪ Quick Prototyping of algorithms for processing large datasets.

Copyright © edureka and/or its affiliates. All rights reserved.


Examples of Data Analysis Task
Find users who tend to visit “good” pages:

VISITS PAGES

User URL Time URL Page Rank


Amy www.cnn.com 8:00 www.cnn.com 0.9
Amy www.crap.com 8:05 www.flickr.com 0.9
Amy www.myblog.com 10:00 www.myblog.com 0.7
Amy www.flickr.com 10:05 www.crap.com 0.2
Fred cnn.com/index.htm 12:00

Copyright © edureka and/or its affiliates. All rights reserved.


Conceptual Data Flow
Load Load
Visits (user, url, time) Pages (url, pagerank)

Join
url = url

Group by user

Compute Average
Pagerank

Filter
avgPR>0.5

Copyright © edureka and/or its affiliates. All rights reserved.


How Yahoo Uses Pig?
Pig is best suited for the data factory.

Data Factory contains:

Pipelines:

▪ Pipelines bring logs from Yahoo!'s web servers.


▪ These logs undergo a cleaning step where bots, company internal views, and clicks are removed.

Research:

▪ Researchers want to quickly write a script to test a theory.


▪ Pig integration with streaming makes it easy for researchers to take a Perl or Python script and run it against a huge
data set.

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Basic Program Structure

Copyright © edureka and/or its affiliates. All rights reserved.


Pig – Basic Program Structure
Script:

Pig can run a script file that contains Pig commands. Script

Example: pig script.pig runs the commands in the local file script.pig.
Grunt
Grunt:
Grunt is an interactive shell for running Pig commands. It is also possible
to run Pig scripts from within Grunt using run and exec (execute).
Embedded

Embedded:

Embedded can run Pig programs from Java, much like you can use JDBC to
run SQL programs from Java.

Copyright © edureka and/or its affiliates. All rights reserved.


Case Sensitivity
Case Sensitive: The names / aliases of relations and fields, names of Pig functions

Case Insensitive: The names of parameters and all other pig reserved keywords

In the example below, note the following:

▪ The names of each relation i.e. A, B, and C are case sensitive

▪ The names (aliases) of fields name and age are case sensitive

▪ Functions PigStorage and COUNT are case sensitive

Example:

grunt> A = load ‘edureka' using PigStorage() as (name:chararray,age:int);


grunt> B = group A by name;
grunt> C = foreach B generate COUNT ($0);
grunt> dump C;

Copyright © edureka and/or its affiliates. All rights reserved.


Pig – Running Modes
▪ Local Mode pig -x local

▪ MapReduce Mode pig

Copyright © edureka and/or its affiliates. All rights reserved.


Shell and Utility Commands
Shell commands :

▪ fs : Invokes any hadoop FsShell command from within a Pig script or the Grunt shell.
Example : To list the files present in HDFS.

▪ sh : Invokes any shell command from within a Pig script or the Grunt shell.
Example : Check the list of files present in local directory

Copyright © edureka and/or its affiliates. All rights reserved.


Pig is Made Up of Two Components

Pig Latin is used to


express Data Flows
1. Pig Data Flows

2. Execution
Environments
Distributed execution on a
Hadoop Cluster

Local execution in a single


JVM

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Execution

Pig resides on user machine Job executes on Cluster

Hadoop
Cluster

User Machine

No need to install anything extra on your Hadoop Cluster!

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Latin Program

Pig Latin Program

It is made up of a series of operations or transformations that are


applied to the input data to produce output.

Turns the transformations into…


Pig A series of MapReduce jobs

Copyright © edureka and/or its affiliates. All rights reserved.


Four Basic Types of Data Models

Atom Tuple

Data
Model
Types

Bag Map

Copyright © edureka and/or its affiliates. All rights reserved.


Data Model
Data Models can be defined as follows:
▪ A Field is a piece of data.

▪ A Tuple is an ordered set of fields.


▪ Parentheses are also used to indicate the tuple data type.

▪ A Bag is a collection of tuples.


▪ Curly brackets also used to indicate the bag data type.

▪ A Data Map is a map from keys that are string literals to values that can be any data type.
▪ Straight brackets are also used to indicate the map data type.

Example: t= ( 1, {(2,3),(4,6),(5,7)}, ['apache':'search'] )

Copyright © edureka and/or its affiliates. All rights reserved.


Data Types In Pig

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Data Types

Pig Data Type Implementing Class

Bag org.apache.pig.data.DataBag

Tuple org.apache.pig.data.Tuple
Map java.util.Map<Object, Object>

Integer java.lang.Integer

Long java.lang.Long

Float java.lang.Float

Double java.lang.Double

Chararray java.lang.String

Bytearray byte[]

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Data Types (Contd.)
▪ Fields are assigned data types with the help of schemas

▪ For example, in relation B, age is converted to integer because 5 is integer However it is not mandatory to assign
schema always since If we don't
▪ A = LOAD ‘edureka' AS (name, age); assign data types, default type
▪ B = FOREACH A GENERATE age + 5; bytearray is assigned to fields and
implicit conversions are applied to the
▪ If a schema is defined as part of a load statement, the load function tries to assign fields depending on the context in
the given schema which the field is being used
▪ However If the data does not conform to given schema, pig will generate a null value
or an error

→ Example : Input data


» 4, Edureka
» A = LOAD ‘edureka' AS (name:chararray,age:int);
» Dump A;
» O/P : (4,)

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Data Types (Contd.)
▪ In case explicit cast is not supported, pig will report an error

Example: You cannot cast a chararray to int in pig

A = LOAD ‘edureka' AS (name: chararray, age: int);


B = FOREACH A GENERATE (int)name;

error…..

▪ If Pig cannot resolve incompatible types through implicit casts, pig will report an error

Example: You cannot add chararray and float

A = LOAD ‘edureka' AS (name: chararray, age: float);


B = FOREACH A GENERATE name + age;

error…..

Copyright © edureka and/or its affiliates. All rights reserved.


Data Structure Used in Pig
▪ A field is a piece of data
▪ This can be treated as the column in a table in Database
→ Referencing fields :
▪ Fields are accessed by positional indexes or by name
▪ Positional indexes is generated by the system
▪ Positional indexes is indicated with the dollar sign ($) and begins with zero like $0, $1, $2
▪ Names to the fields are assigned by the user when defining schemas with PigStorage or any loader or
internally by the system during some operation like group by, etc.
▪ You can use any name that is not a Pig keyword

First Field Second Field


Data type chararray int

Positional Indexes(system generated) $0 $1

Possible name (assigned by you using a schema) name age

Field value (for the first tuple) abhay 3

Copyright © edureka and/or its affiliates. All rights reserved.


Data Structure Used in Pig: Tuple
▪ A tuple is an ordered set of fields
▪ Tuples contain fields which may be of different data types
▪ A tuple can be compared to a row in SQL with fields as columns
▪ Since, tuples are ordered we can access fields in each tuple using indexes of the fields
▪ Tuple constants use parentheses to define tuple and commas to separate different fields

(‘Edureka', 25)

▪ The above declares a tuple constant with two fields of data types as chararray and int respectively

grunt> data = load ‘edureka';


grunt> mydata = foreach data generate $0;
grunt> dump mydata

▪ Notice that above examples doesn’t have a schema

▪ Hence, we can reference individual fields in the tuple by their position ($0 references to the first field in the tuple)

Copyright © edureka and/or its affiliates. All rights reserved.


Data Structure Used in Pig: Tuple (Contd.)
▪ A tuple is enclosed in parentheses ( )
▪ In case if we have relation defined with schema, we can access the fields using field name

grunt> data = load 'StudentData' as (name:chararray, age:int);


grunt> finaldata = foreach data generate name;
grunt> dump finaldata

▪ In this case, we have defined a schema for the tuples

Copyright © edureka and/or its affiliates. All rights reserved.


Data Structure Used in Pig: Bag
▪ A bag is a collection of tuples
▪ An inner bag is enclosed in curly brackets { }
▪ Tuples in the bag correspond to the rows in a table
Bag properties :
▪ A bag can have duplicate tuples
▪ A bag can consist of tuples with different numbers of fields
▪ However, if Pig tries to access a field that does not exist in any tuple, then a null value is substituted in the empty indexes
▪ A bag can have tuples with fields with varying data types
▪ However, for Pig to effectively process bags, the schemas of the tuples within those bags should be the same
Example: If half of the tuples include chararray fields and while the other half include float fields, only half of the tuples will
participate in any kind of computation because the chararray fields will be converted to null
▪ Bags have two forms: outer bag (or relation) and inner bag

Copyright © edureka and/or its affiliates. All rights reserved.


Data Structure Used in Pig: Bag
Outer Bag or Relations
▪ A relation is a bag of tuples also known as outer bags
▪ A Pig relation is similar to a table in a relational database
▪ Pig relations don't require that every tuple contain the same number of fields or that the fields in the same position
(column) have the same type
▪ Relations are unordered which means there is no guarantee that tuples are processed in any particular order
▪ Referencing relations:
▪ Relations are referred to by name (or alias)
▪ In this example A is a relation or bag of tuples. You can think of this bag as an outer bag

A = LOAD ‘edureka' USING PigStorage() AS (name:chararray, age:int);


X = FOREACH A GENERATE name,$1;
DUMP X;
(aron,25)
(sam,45)
(bruno,25)
(Rony,45)

Copyright © edureka and/or its affiliates. All rights reserved.


Data Structure Used in Pig: Bag
Inner Bag
▪ Now, suppose we group relation A by the first field to form relation X
▪ In this example, X is a relation or bag of tuples. The tuples in relation X have two fields
▪ The first field is type int. The second field is type bag; you can think of this bag as an inner bag

X = GROUP A BY age;
DUMP X;
(25,{(aron,25), (bruno,25)})
(45,{(sam,45), (Rony,45)})

Copyright © edureka and/or its affiliates. All rights reserved.


Data Structure Used in Pig: Map
▪ A map is a set of key value pairs
▪ A map is a chararray to data element mapping which is expressed in key-value pairs
▪ The key should always be of type chararray and can be used as index to access the associated value
▪ It is not necessary that all the values in a map be of the same type
▪ An inner bag is enclosed in curly brackets { }
▪ Key value pairs are separated by the pound sign # with ',' separating key-value pairs
▪ Key Must be chararray data type. Must be a unique value
▪ Value can be Any data type

Syntax : [ key#value <, key#value …> ]

Example : ['Name'#‘aron', 'Age'#22]

▪ The above defines a map constant with two key-value pairs. Notice that the keys are always of type chararray
while values take type chararray and int respectively

Copyright © edureka and/or its affiliates. All rights reserved.


Data Structure Used in Pig: Map (Contd.)
▪ In order to load data from files as maps, the data should be structured as below:

[name#aron, age#25]
[name#sam, age#26]

▪ Sample PigLatin statements to load the above data sample as map

grunt> mapload = load ‘AboveFile' as (a:map[chararray]);


grunt> values = foreach mapload generate a#‘name' as value;
grunt> value = FILTER values BY value is not null;
grunt> dump value

▪ The output of above statements is:

(aron)
(sam)

Copyright © edureka and/or its affiliates. All rights reserved.


Data Structure Used in Pig : Map (Contd.)
▪ The load statement will construct two maps having two key-value pairs each.

▪ We can choose not to specify the data type of values in map as below:

grunt> mapload = load ‘AboveFile' as (a:map[]);

▪ In this case Pig assumes the type of values to be bytearray and performs implicit casts to appropriate type depending
on how your PigLatin statements handle the data

▪ In the second statement we are trying to retrieve the value associated with key ‘name'. Notice the syntax

a#’name’

which will return aron and sam

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Latin Relational Operators

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Latin Relational Operators
Category Operator Description

Loading and Storing LOAD Loads data from the file system or other storage into a relation .
STORE Saves a relation to the file system or other storage.
DUMP Prints a relation to the console.
Filtering FILTER Removes unwanted rows from a relation.
DISTINCT Removes duplicate rows from a relation.
FOREACH...GENERATE Adds or removes fields from a relation.
STREAM Transforms a relation using an external program.

Grouping and Joining JOIN Joins two or more relations.


COGROUP Groups the data in two or more relations.
GROUP Groups the data in a single relation.
CROSS Creates the cross product of two or more relations.
Sorting ORDER Sorts a relation by one or more fields.
LIMIT Limits the size of a relation to a maximum number of tuples.
Combining and Splitting UNION Combines two or more relations into one.
SPLIT Splits a relation into two or more relations.

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Latin - Nulls

In Pig, when a data


element is NULL, it
means the value is
unknown.

Includes the concept of a data element being…


Pig NULL
Data of any type can be NULL

Copyright © edureka and/or its affiliates. All rights reserved.


Data

File – Student File – Student Roll


Name Age GPA Name Roll No.
Joe 18 2.5 Joe 45

Sam 3.0 Sam 24

Angel 21 7.9 Angel 1

John 17 9.0 John 12

Joe 19 2.9 Joe 19

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Latin – File Loaders
Pig Latin File Loaders

BinStorage - "binary" storage

PigStorage - Loads and stores data that is delimited by something

TextLoader - Loads data line by line (delimited by the newline character)

CSVLoader - Loads CSV files

XML Loader - Loads XML files

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Latin – Group Operator
Example of GROUP Operator:

A = load ‘/student' USING PigStorage( ‘ , ’ ) as (name:chararray, age:int, gpa:float);


dump A;

(joe,18,2.5)
(sam,,3.0)
(angel,21,7.9)
(john,17,9.0)
(joe,19,2.9)

X = group A by name;
dump X;

(joe,{(joe,18,2.5),(joe,19,2.9)})
(sam,{(sam,,3.0)})
(john,{(john,17,9.0)})
(angel,{(angel,21,7.9)})

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Latin – Cogroup Operator
Example of COGROUP Operator:

A = load ‘/student' USING PigStorage( ‘ , ’ ) as (name:chararray, age:int,gpa:float);


B = load ‘/studentRoll' USING PigStorage( ‘ , ’ ) as (name:chararray, rollno:int);

X = cogroup A by name, B by name;


dump X;

(joe,{(joe,18,2.5),(joe,19,2.9)},{(joe,45),(joe,19)})
(sam,{(sam,,3.0)},{(sam,24)})
(john,{(john,17,9.0)},{(john,12)})
(angel,{(angel,21,7.9)},{(angel,1)})

Copyright © edureka and/or its affiliates. All rights reserved.


Joins and Cogroup
▪ JOIN and COGROUP operators perform similar functions
▪ JOIN creates a flat set of output records while COGROUP creates a nested set of output records

Example:

Copyright © edureka and/or its affiliates. All rights reserved.


Union
UNION: To merge the contents of two or more relations.

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Built In Functions

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Built In Functions
▪ Pig comes with a set of built in functions and categorized according to input arguments to the function.

▪ Eval Functions,
▪ Load/Store Functions,
▪ Math Functions,
▪ String Functions,
▪ Datetime Functions,
▪ Tuple, Bag, Map Functions.

▪ The main difference between built-in functions and Pig UDF is :

▪ Built in functions don't need to be registered because Pig knows where they are.
▪ Built in functions don't need to be qualified when they are used because Pig knows where to find
them.

Pig function names are case sensitive and UPPER CASE.

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Built In Functions
Eval Function Load/Store Function Math Function String Function Date-time Functions Tuple, Bag, Map Functions

Eval functions accept Load/store functions The Math functions String functions are To work on these These functions are used
the bag data type as determine how data allows you to perform used to manipulate a functions date and Convert the fields into
input parameter and goes into Pig and mathematical tasks on char data type fields. time fields are loaded Complex data types.
return the result comes out of Pig. Pig fields as chararray data type
according to the provides a set of built- and convert to date
functions in load/store and time format using
functions. ToDate function

AVG.BagToString Handling ABS, ACOS, CBRT ENDSWITH AddDuration TOTUPLE, TOBAG


CONCAT,COUNT, ,Compression, CEIL, COS, EXP EqualsIgnoreCase CurrentTime TOMAP, TOP
COUNT_STAR,DIFF BinStorage, FLOOR,LOG INDEXOF GetDay
IsEmpty,MAX,MIN PigDump LOWER GetHour
PigStorage REGEX_EXTRACT GetMilliSecond
TextLoader REGEX_EXTRACT_ALL GetMinute
HBaseStorage REPLACE GetMonth
OrcStorage GetSecond

Copyright © edureka and/or its affiliates. All rights reserved.


Must Read!!
▪ Review the following Blogs on Pig Scripts:

http://www.edureka.in/blog/pig-programming-create-your-first-apache-pig-script/

http://www.edureka.in/blog/pig-programming-apache-pig-script-in-local-mode/

http://www.edureka.in/blog/operators-in-apache-pig/

http://www.edureka.in/blog/operators-in-apache-pig-diagnostic-operators/

Copyright © edureka and/or its affiliates. All rights reserved.


Utility Commands
Command Description Syntax

Clear the screen of Pig grunt shell and position the


clear grunt>clear
cursor at top of the screen

exec [–param param_name = param_value] [–param_file


exec Run a Pig script
file_name] [script]

help Prints a list of Pig commands or properties -help [properties]

history [-n]
history Display the list of statements used so far.
-n : Omit line numbers in the list

kill Kills a job kill jobid

quit Quits from the Pig grunt shell. grunt>quit

run [–param param_name = param_value] [–param_file


run Run a Pig script.
file_name] script

set Shows/Assigns values to keys used in Pig. grunt> set debug 'on’

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Specialized Joins

Copyright © edureka and/or its affiliates. All rights reserved.


Specialized Joins
▪ To increase the performance of the Pig Jobs, Pig providing three different kind of Joins called specialized
joins:

▪ Replicated Joins
▪ Skewed Joins
▪ Merge Joins

▪ General syntax of Specialized Join is:

▪ alias = JOIN alias BY {joining relation Field}, alias BY {joining relation Field} USING ['replicated' |
'skewed' | 'merge']

Copyright © edureka and/or its affiliates. All rights reserved.


Replicated Joins
▪ Also known as Fragment replicate join, Replicated Join works on only INNER and LEFT OUTER JOIN. Does not support
RIGHT and FULL OUTER JOIN.

▪ Special type of join that works well if one or more relations are small enough to fit into main memory. In such cases, Pig
can perform a very efficient join because all of the Hadoop work is done on the map side.

▪ Replicated join can be used with more than two tables. In this case, Rightmost table are read into memory. means Large
relation comes first followed by the smaller relations.

▪ Pig run the replicated join by loading the small file ( Replicated input ) into Hadoop’s distributed cache.

▪ Example: For a customer transaction data set, which could potentially have billions of rows that is joined to a smaller
geographic data set.

Copyright © edureka and/or its affiliates. All rights reserved.


Skewed Joins
▪ Skewed joins works for inner and outer joins.

▪ One of the keys is much more common than others, and the data for it is too large to fit in the memory.

▪ Standard joins run in parallel across different reducers by splitting key values across processes.

▪ If there is a lot of data based on certain key, the data will not be distributed evenly across the reducers, and one of them
will be ‘stuck’ processing the majority of data.

▪ Skewed join handles this case. It calculates a histogram to check which key is the most prevalent and then splits its data
across different reducers for optimal performance.

Example:

Copyright © edureka and/or its affiliates. All rights reserved.


Merge Joins
▪ In Normal or Default Join both input are sort first according to the Join key and complete the JOIN. This is called sort-
merge join. If the both the files are already sorted based on join key we no need to use the sort again, In this cases the
merge join is very efficient.

▪ To do the Merge JOIN, two data sets are both sorted in ascending order by the join key.

▪ The Merge JOIN is supported only in inner joins.

▪ To do the merge join, we should have only two tables or input files.

▪ Merge join works on mapreduce mode, in local mode merge join gets converted to regular joins.

Example: If the customer details and geography tables are sorted based on the country which is join key we can use the
merged join.

Copyright © edureka and/or its affiliates. All rights reserved.


Pig UDF

Copyright © edureka and/or its affiliates. All rights reserved.


Pig – UDF (User Defined Function)
▪ Pig allows users to combine existing operators with their own or other’s code via UDFs.

UDF

UDF
Pig

▪ The advantage of Pig is its ability to let user combine its operators with their own or other’s code via UDFs

▪ Pig itself comes with some UDFs. In version 0.8, a large number of standard string-processing, math, and complex-type
UDFs were added.

Copyright © edureka and/or its affiliates. All rights reserved.


Pig – Creating UDF
→ A Program to create UDF:
public class IsOfAge extends FilterFunc {
@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return false;
}

try {
Object object = tuple.get(0);
if (object == null) {
return false;
}
int i = (Integer) object;
if (i == 18 || i == 19 || i == 21 || i == 23 || i == 27) {
return true;
} else {
return false;
}
} catch (ExecException e) {
throw new IOException(e);
}
}
}

Copyright © edureka and/or its affiliates. All rights reserved.


Pig – Calling a UDF
How to call a UDF?

register myudf.jar;

X = filter A by IsOfAge(age);

Must Read → http://www.edureka.in/blog/pig-programming-apache-pig-script-with-udf-in-hdfs-mode/

Copyright © edureka and/or its affiliates. All rights reserved.


Pig – UDFs Inbuilt
Eval, Aggregate, Filter Functions

http://www.edureka.in/blog/apache-pig-udf-part-1-eval-aggregate-filter-functions/

Load Functions

http://www.edureka.in/blog/apache-pig-udf-part-2-load-functions/

Store Functions

http://www.edureka.in/blog/apache-pig-udf-store-functions/

Copyright © edureka and/or its affiliates. All rights reserved.


Pig Streaming

Copyright © edureka and/or its affiliates. All rights reserved.


PIG Streaming
Sends data to an external script or program.

Syntax:

alias = STREAM alias THROUGH {`command` | cmd_alias } [AS schema] ;

Terms:

▪ Alias: name of the relation


▪ Command: External script file and shell command. The command, including the arguments, enclosed in back tics.
▪ cmd_alias: name assigned to command through DEFINE statement in pig.
▪ Schema: assign the schema to streaming output.

▪ Use the STREAM operator to send data through an external script or program.
▪ Multiple stream operators can appear in the same Pig script.
▪ The stream operators can be adjacent to each other or have other operations in between.

Copyright © edureka and/or its affiliates. All rights reserved.


PIG Streaming
Running shell command in PIG latin statement to print top 5 fields in a relations

Use DEFINE to assign the alias name for the script or shell command.

Copyright © edureka and/or its affiliates. All rights reserved.


Parameter Substitution in Pig
▪ Parameter Substitution in Pig is creating a template pig script and then use it with different parameters on a regular basis.

▪ For instance, if you have daily processing that is identical every day except the date it needs to process, it would be very
convenient to put a placeholder for the date and provide the actual value at run time.

Specifying Parameters :

▪ You can specify parameter names and parameter values as follows

▪ As part of a command line.


▪ In parameter file, as part of a command line.
▪ With the declare statement, as part of Pig script.
▪ With default statement, as part of a Pig script.

Copyright © edureka and/or its affiliates. All rights reserved.


Parameter Substitution in Pig
Passing Parameter from Command line

Syntax :
pig [-x local] -param param_name = param_value <path to your script>

Copyright © edureka and/or its affiliates. All rights reserved.


Parameter Substitution in Pig
Passing Parameter from file

Syntax :
pig [-x local] -param_file file_name <path to your script>

Copyright © edureka and/or its affiliates. All rights reserved.


PiggyBank
▪ PiggyBank is a collection of useful LOAD, STORE, and UDF functions.
▪ To use a function, you need to figure out which package it belongs to:
▪ org.apache.pig.piggybank.comparison - for custom comparator used by ORDER operator
▪ org.apache.pig.piggybank.evaluation - for eval functions like aggregates and column transformations
▪ org.apache.pig.piggybank.filtering - for functions used in FILTER operator
▪ org.apache.pig.piggybank.grouping - for grouping functions
▪ org.apache.pig.piggybank.storage - for load/store functions

Copyright © edureka and/or its affiliates. All rights reserved.


Diagnostic Operators and UDF Statements
Pig Latin Diagnostic Operators

Types of Pig Latin Diagnostic Operators:

DESCRIBE - Prints a relation’s schema.


EXPLAIN - Prints the logical and physical plans.
ILLUSTRATE - Shows a sample execution of the logical plan, using a generated subset of the input.

Pig Latin UDF Statements

Types of Pig Latin UDF Statements:

REGISTER - Registers a JAR file with the Pig runtime.


DEFINE - Creates an alias for a UDF, streaming script, or a command specification.

Copyright © edureka and/or its affiliates. All rights reserved.


Describe
▪ Use the DESCRIBE operator to review the fields and data-types.

Copyright © edureka and/or its affiliates. All rights reserved.


EXPLAIN: Logical Plan
▪ Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the
specified relationship.

▪ The logical plan shows a pipeline of operators to be executed to build the relation. Type checking and backend-independent
optimizations (such as applying filters early on) also apply.

Copyright © edureka and/or its affiliates. All rights reserved.


EXPLAIN: Physical Plan
▪ The physical plan shows how the logical operators are translated to backend-specific physical operators. Some
backend optimizations also apply.

Copyright © edureka and/or its affiliates. All rights reserved.


EXPLAIN: Map – Reduce Plan
▪ The map-reduce plan shows how the physical operators are grouped into map reduce jobs.

Copyright © edureka and/or its affiliates. All rights reserved.


Demo

Copyright © edureka and/or its affiliates. All rights reserved.


Demo on HealthCare Dataset
Problem Statement:

• De-identify personal health information.

Challenges:

• Huge amount of data flows into the systems daily and there are multiple data sources that we need to aggregate data from.

• Crunching this huge data and de-identifying it in a traditional way had problems.

Copyright © edureka and/or its affiliates. All rights reserved.


Demo on Weather Data with Pig
ftp://ftp.ncdc.noaa.gov/pub/data/uscrn/products/daily01/

Copyright © edureka and/or its affiliates. All rights reserved.


Use Case in HealthCare
0100
Taking DB dump in CSV format and
ingest into HDFS
1101
1001

Store De-identified 0100


CSV file into HDFS
1101
HDFS 1001
Read CSV file
from HDFS

Pig Script De-identify columns based on configurations


and store the data back in a CSV file

Copyright © edureka and/or its affiliates. All rights reserved.


Please visit all the blogs that have being shared with you in
earlier slides without fail!!

Copyright © edureka and/or its affiliates. All rights reserved.


Assignment
▪ Find out successful students in the Class

▪ Execute Pig Weather Example

▪ Execute Data-set and Pig Script for Health Care Use-Case

▪ Execute Data-set and Pig Script for Weather Use-Case

Copyright © edureka and/or its affiliates. All rights reserved.


Pre-work
Review the following Blogs on Hive Scripts:

→http://www.edureka.in/blog/apache-hive-installation-on-ubuntu/
→http://www.edureka.in/blog/apache-hadoop-hive-script/

Copyright © edureka and/or its affiliates. All rights reserved.


Agenda for Next Class
• Hive and its Use Cases
• Hive vs Pig
• Hive Architecture and Components
• Primitive and Complex type in Hive
• Data Models in Hive
• Hive Script and Hive UDF

Copyright © edureka and/or its affiliates. All rights reserved.


Copyright © edureka and/or its affiliates. All rights reserved.
Copyright © edureka and/or its affiliates. All rights reserved.

You might also like