You are on page 1of 16

Introduction

Pig is a high-level platform or tool for processing massive data sets. It provides a high
level of abstraction for MapReduce computation. It includes a high-level scripting language
called Pig Latin, used to create data analysis codes. The programmers will build scripts in the
Pig Latin Language to process the data stored in the Hadoop distributed file system(HDFS).
Internally, Pig Engine (an Apache Pig component) converted all of these scripts into a
single map and reduced the process. However, to give a high level of abstraction, these are not
visible to programmers. The Apache Pig tool's two primary components are Pig Latin and Pig
Engine. Pig's output is always saved in HDFS.
Pig and Pig Latin are very much related to each other. Pig provides a high-level
language known as Pig Latin for writing data analysis applications. This language has several
operators that programmers can use to create their own functions for reading, writing, and
processing data.
To analyze data using Apache Pig, programmers must create scripts in the Pig Latin language.
Internally, all of these scripts are turned to Map and Reduce jobs. The Pig Engine component
of Apache Pig accepts Pig Latin scripts as input and turns them into MapReduce jobs.

Introduction to Apache Pig


Apache Pig is an open-source platform developed by Yahoo! that makes it easier to
work with large datasets. It is built on top of the Hadoop ecosystem and provides a high-level
scripting language, Pig Latin, for expressing data transformations and data analysis tasks. Pig
simplifies the process of writing complex MapReduce jobs, making it accessible to a broader
range of users, including those without deep programming expertise.

Why Use Apache Pig?


There are several reasons why Apache Pig is a valuable tool for working with big data:
 Abstraction Layer: Pig provides an abstraction layer over Hadoop, which means
users can write data transformation operations in a more user-friendly language, Pig
Latin, rather than dealing with the low-level intricacies of MapReduce.
 Simplicity: Pig Latin scripts are more straightforward to write and understand
compared to Java-based MapReduce programs. This simplicity can lead to faster
development and easier maintenance.
 Reusability: Pig Latin scripts are modular, allowing users to reuse and share them
across different data processing tasks, which can significantly improve development
efficiency.
 Optimization: Pig optimizes the execution of queries automatically, taking care of
query planning and execution, which reduces the need for manual optimization.
 Versatility: Pig can handle a wide range of data sources, including structured and
semi-structured data, making it versatile for a variety of data processing tasks.
 Scalability: Pig is highly scalable and can process large datasets efficiently by
distributing workloads across a Hadoop cluster.

Use Cases for Pig


Pig is well-suited for a wide range of data processing and analysis tasks, making it a
valuable tool in various industries. Some common use cases for Pig include:
 Log Processing: Organizations can use Pig to process log data generated by web
servers, applications, or network devices to gain insights and monitor system health.
 ETL (Extract, Transform, Load): Pig can be used for data extraction,
transformation, and loading tasks, which are common in data warehousing and
business intelligence applications.
 Text Analysis: Text processing and analysis, such as sentiment analysis, can be
performed with Pig to derive valuable insights from unstructured textual data.
 Data Cleaning: Pig can help clean and preprocess raw data, ensuring that it is
ready for further analysis or machine learning tasks.
 Graph Processing: Pig's iterative processing capabilities are well-suited for graph
algorithms and analytics, making it useful in social network analysis and
recommendation systems.
 Machine Learning: Pig can be used for iterative machine learning algorithms,
especially when coupled with libraries like Apache Mahout.
 Data Aggregation: Aggregating and summarizing data for reporting and analytics
is a common use case, and Pig simplifies this process.
 Exploratory Data Analysis (EDA): Pig can be used for initial data exploration,
helping data scientists and analysts understand the characteristics of their datasets.

Architecture of Apache PIG:


 PIG Latin – The PIG Latin which is a high-level data processing language that
enables users/developers to write code for data processing and analyzing.
 Runtime Environment – A runtime environment which is an execution
mechanism (platform) to run PIG Latin programs.
The PIG architecture comprises of various elements including parser, optimiser,
compiler and finally execution engine.

Apache PIG execution modes:


 Local mode: In this mode, the files are accessed from the local host and local file
system.
 MapReduce Mode: In this mode, the files are accessed from the Hadoop file
system (HDFS).

Apache PIG execution mechanism:


The programs written in Apache PIG can be executed in three ways:
 Interactive Mode (Grunt Shell)
 Batch Mode (Script)
 Embedded Mode (UDF)

Advantages of Using Pig


Using Apache Pig and Pig Latin offers several advantages when working with big
data:
 Productivity: Pig Latin's high-level constructs and simplicity allow developers and
data analysts to be more productive. They can focus on the logic of data processing
rather than low-level coding details.
 Scalability: Pig can handle large datasets by distributing processing across a
Hadoop cluster, which ensures scalability and performance.
 Code Reusability: Pig Latin scripts are modular and can be reused across different
projects, reducing development time and effort.
 Optimization: Pig handles many optimizations automatically, including query
planning and execution optimization, reducing the need for manual tuning.
 Extensibility: The ability to incorporate custom UDFs allows you to tailor Pig to
your specific data processing requirements.
 Error Handling: Pig provides robust error handling and debugging tools, making it
easier to diagnose and fix issues in data processing.
 Iterative Processing: Pig supports iterative processing, which is crucial for
machine learning and graph algorithms.
 Ecosystem Integration: Pig seamlessly integrates with other Hadoop ecosystem
tools like Hive and HBase, expanding its capabilities.

Limitations of Pig
While Apache Pig is a powerful tool, it does have some limitations:
 Learning Curve: Pig Latin may have a learning curve for those who are not
familiar with it, although it is generally easier to learn than low-level MapReduce
programming.
 Performance Overhead: Pig introduces some performance overhead due to the
translation from Pig Latin to MapReduce jobs. For very simple tasks, the direct use
of MapReduce may be more efficient.
 Customization: While Pig offers extensibility through UDFs, complex custom
operations may be more efficiently implemented using low-level MapReduce.
 Real-time Processing: Pig is more suited for batch processing, and it may not be
the best choice for real-time data processing requirements.

Introduction to PIG Latin


In today’s time, when organisations are gathering a huge amount of data, popular
websites, such as, Facebook and Instagram are making use of Big Data technology to store,
process and analyse the data for later use. Hadoop framework is an innovative solution to
support the big data. This framework majorly consists of two components, Hadoop Distributed
File System (HDFS) and MapReduce. While HDFS helps in storing the big data, MapReduce
helps processing the big data. MapReduce programing paradigm is written in Java.
Apache PIG is a tool to process and analyse massive data, big data, as data flows. It is
a high-level scripting language built over MapReduce for expressing data analysis programs.
Apache PIG gives an abstraction to reduce the complexity of developing MapReduce
programming for the developers. The scripting language used for PIG is Pig Latin. Apache Pig
was developed in 2006 by Yahoo to create and manipulate MapReduce tasks on the datasets.
Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this
chapter, we are going to discuss the basics of Pig Latin such as Pig Latin statements, data
types, general and relational operators, and Pig Latin UDF’s.
Need of Pig and Pig Latin
Programmers that are not fluent in Java typically struggle when working
with Hadoop, particularly when doing MapReduce jobs. Apache Pig is a godsend for all of
these programmers.
Programmers can readily perform MapReduce tasks using Pig and Pig Latin without
entering sophisticated Java code.
Apache Pig employs a multi-query method, which reduces code length. For example,
in Apache Pig, an operation that would require 200 lines of code (LoC) in Java can be
completed in as little as 10 LoC. Finally, Apache Pig cuts development time by nearly 16
times.
Apache Pig includes many built-in operators to help with data operations such as joins,
filters, sorting, etc. Furthermore, it adds nested data types such as tuples, bags, and maps that
MapReduce lacks.

Features of Pig and Pig Latin


Apache Pig and Pig Latin come with the following features.
 Rich set of operators: It has a variety of operators for performing operations like
join, sort, filter, etc.
 It handles all kinds of data: Apache Pig examines various data types, organized
and unstructured. The findings are saved in HDFS.
 User-Defined Functions(UDFs): Pig allows you to write user-defined functions in
other programming languages, such as Java, and then invoke or embed them in Pig
Scripts.
 Extensibility: Users can create their functions to read, process, and write data using
current operators.
 Ease of programming: Pig Latin is comparable to SQL, and writing a Pig script is
simple if you know SQL.
 Optimization opportunities: The jobs in Apache Pig optimize their execution
automatically, so programmers need to focus on language semantics.

Applications of Pig and Pig Latin


Few applications of Pig and Pig Latin are mentioned below.
 Pig scripts are used for exploring massive databases
 Pig and Pig Latin provide support for ad-hoc queries across huge data sets
 Pig scripts aid in the development of massive data set processing methods
 Pig is required for the processing of time-sensitive data loads.
 Pig scripts are used to collect massive volumes of data in search logs and web
crawls.

Pig Latin – Data Model


As discussed in the previous chapters, the data model of Pig is fully nested.
A Relation is the outermost structure of the Pig Latin data model. And it is a bag where −
 A bag is a collection of tuples.
 A tuple is an ordered set of fields.
 A field is a piece of data.

Pig Latin – Statemets


While processing data using Pig Latin, statements are the basic constructs.
 These statements work with relations. They include expressions and schemas.
 Every statement ends with a semicolon (;).
 We will perform various operations using operators provided by Pig Latin, through
statements.
 Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
 As soon as you enter a Load statement in the Grunt shell, its semantic checking will
be carried out. To see the contents of the schema, you need to use
the Dump operator. Only after performing the dump operation, the MapReduce job
for loading the data into the file system will be carried out.

Pig Latin – Data types


The simple data types of PIG include int, long, float, double, chararray, Bytearray,
Boolean, Datetime, Biginteger, Big decimal. Whereas Bag, Tuple and Map are the complex
data types in PIG. All data types above can be NULL value which means an unknown value
or non-existing value. Apache PIG treats NULL value in a similar way as SQL.
Given below table describes the Pig Latin data types.

S.N
Data Type Description & Example
.

Represents a signed 32-bit integer.


1 int
Example : 8

Represents a signed 64-bit integer.


2 long
Example : 5L

Represents a signed 32-bit floating point.


3 float
Example : 5.5F

4 double Represents a 64-bit floating point.


Example : 10.5

Represents a character array (string) in Unicode UTF-8


5 chararray format.
Example : ‘tutorials point’

6 Bytearray Represents a Byte array (blob).

Represents a Boolean value.


7 Boolean
Example : true/ false.

Represents a date-time.
8 Datetime
Example : 1970-01-01T00:00:00.000+00:00

Represents a Java BigInteger.


9 Biginteger
Example : 60708090709

Represents a Java BigDecimal


10 Bigdecimal
Example : 185.98376256272893883

Complex Data Types


A tuple is an ordered set of fields.
11 Tuple
Example : (raja, 30)

A bag is a collection of tuples.


12 Bag
Example : {(raju,30),(Mohhammad,45)}
A Map is a set of key-value pairs.
13 Map
Example : [ ‘name’#’Raju’, ‘age’#30]

Pig Latin – Arithmetic Operators


The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b =
20.

Operator Description Example

Addition − Adds values on either side of the


+ a + b will give 30
operator

Subtraction − Subtracts right hand operand a − b will give



from left hand operand −10

Multiplication − Multiplies values on either a * b will give


*
side of the operator 200

Division − Divides left hand operand by right


/ b / a will give 2
hand operand

Modulus − Divides left hand operand by right


% b % a will give 0
hand operand and returns remainder

?: Bincond − Evaluates the Boolean operators. It b = (a == 1)? 20:


has three operands as shown below. 30;
variable x = (expression) ? value1 if if a = 1 the value
true : value2 if false.
of b is 20.
if a!=1 the value
of b is 30.

CASE f2 % 2
CASE
WHEN 0 THEN
WHEN
Case − The case operator is equivalent to 'even'
THEN nested bincond operator. WHEN 1 THEN
ELSE 'odd'
END
END

Pig Latin – Comparison Operators


The following table describes the comparison operators of Pig Latin.

Operator Description Example

Equal − Checks if the values of two operands are


(a = b) is not
== equal or not; if yes, then the condition becomes
true
true.

Not Equal − Checks if the values of two operands


(a != b) is
!= are equal or not. If the values are not equal, then
true.
condition becomes true.

Greater than − Checks if the value of the left


(a > b) is not
> operand is greater than the value of the right
true.
operand. If yes, then the condition becomes true.
Less than − Checks if the value of the left operand
< is less than the value of the right operand. If yes, (a < b) is true.
then the condition becomes true.

Greater than or equal to − Checks if the value of


the left operand is greater than or equal to the value (a >= b) is not
>=
of the right operand. If yes, then the condition true.
becomes true.

Less than or equal to − Checks if the value of the


left operand is less than or equal to the value of the (a <= b) is
<=
right operand. If yes, then the condition becomes true.
true.

Pattern matching − Checks whether the string in


f1 matches
matches the left-hand side matches with the constant in the
'.*tutorial.*'
right-hand side.

Pig Latin – Type Construction Operators


The following table describes the Type construction operators of Pig Latin.

Operato
Description Example
r

() Tuple constructor operator − This (Raju, 30)


operator is used to construct a tuple.

Bag constructor operator − This operator {(Raju, 30),


{}
is used to construct a bag. (Mohammad, 45)}

Map constructor operator − This operator


[] [name#Raja, age#30]
is used to construct a tuple.

Pig Latin – Relational Operations


The following table describes the relational operators of Pig Latin.

Operator Description

Loading and Storing

LOAD To Load the data from the file system (local/HDFS)


into a relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, To generate data transformations based on columns of


GENERATE data.

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting

To arrange a relation in a sorted order based on one or


ORDER
more fields (ascending or descending).

LIMIT To get a limited number of tuples from a relation.


Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

To view the logical, physical, or MapReduce execution


EXPLAIN
plans to compute a relation.

To view the step-by-step execution of a series of


ILLUSTRATE
statements.

Conclusion
Apache Pig and its query language, Pig Latin, are powerful tools for simplifying big
data processing. They provide an abstraction layer over the Hadoop ecosystem, making it
easier for developers and data analysts to work with large datasets. Pig's high-level language,
modular design, and optimization capabilities make it a valuable addition to the toolkit of
anyone working with big data.
As organizations continue to generate vast amounts of data, tools like Pig are essential
for processing, analyzing, and extracting valuable insights. Whether you are performing log
analysis, data cleaning, machine learning, or any other data-related task, Apache Pig is a
versatile and efficient choice for simplifying the process of working with big data.

You might also like