Professional Documents
Culture Documents
• Please log in 10 mins before the class starts and check your internet connection to avoid any network issues during the LIVE
session
• All participants will be on mute, by default, to avoid any background noise. However, you will be unmuted by instructor if
required. Please use the “Questions” tab on your webinar tool to interact with the instructor at any point during the class
• Feel free to ask and answer questions to make your learning interactive. Instructor will address your queries at the end of on-
going topic
• If you want to connect to your Personal Learning Manager (PLM), dial +917618772501
• We have dedicated support team to assist all your queries. You can reach us anytime on the below numbers:
US: 1855 818 0063 (Toll-Free) | India: +91 9019117772
• Your feedback is very much appreciated. Please share feedback after each class, which will help us enhance your learning
experience
✓ Join
✓ Group
✓ Filter
✓ Sort
✓ and more…
180 300
160
140
250
120 200
Minutes
100
80
150
60 100
40
20
50
0 0
Hadoop Pig Hadoop Pig
Pig
Unstructured data
Unstructured data
Similar to SQL
▪ Easy to learn, Easy to
write and Easy to read Reads like a series of steps
Unstructured data
Similar to SQL
▪ Easy to learn, Easy to
write and Easy to read Reads like a series of
steps
Java
Python
▪ Extensible by UDF
(User Defined Functions) JavaScript
Ruby
Java
Python
▪ Extensible by UDF
(User Defined Functions) JavaScript
Ruby
Java
Python
▪ Extensible by UDF
(User Defined Functions) JavaScript
Ruby
Java
▪ Open source and actively supported by a
Python
▪ Extensible by UDF community of developers.
(User Defined Functions) JavaScript
Ruby
It is on the top of Hadoop and makes it possible to create complex jobs to process large
volumes of data quickly and efficiently.
Hadoop
It provides a simple language for queries and data manipulation Pig Latin, that is
compiled into map-reduce jobs that are run on Hadoop.
Why is it Important?
▪ Companies like Yahoo, Google and Microsoft are collecting enormous data sets
in the form of click streams, search logs, and web crawls.
▪ Some form of ad-hoc processing and analysis of all of this information is
required.
VISITS PAGES
Join
url = url
Group by user
Compute Average
Pagerank
Filter
avgPR>0.5
Pipelines:
Research:
Pig can run a script file that contains Pig commands. Script
Example: pig script.pig runs the commands in the local file script.pig.
Grunt
Grunt:
Grunt is an interactive shell for running Pig commands. It is also possible
to run Pig scripts from within Grunt using run and exec (execute).
Embedded
Embedded:
Embedded can run Pig programs from Java, much like you can use JDBC to
run SQL programs from Java.
Case Insensitive: The names of parameters and all other pig reserved keywords
▪ The names (aliases) of fields name and age are case sensitive
Example:
▪ fs : Invokes any hadoop FsShell command from within a Pig script or the Grunt shell.
Example : To list the files present in HDFS.
▪ sh : Invokes any shell command from within a Pig script or the Grunt shell.
Example : Check the list of files present in local directory
2. Execution
Environments
Distributed execution on a
Hadoop Cluster
Hadoop
Cluster
User Machine
Atom Tuple
Data
Model
Types
Bag Map
▪ A Data Map is a map from keys that are string literals to values that can be any data type.
▪ Straight brackets are also used to indicate the map data type.
Bag org.apache.pig.data.DataBag
Tuple org.apache.pig.data.Tuple
Map java.util.Map<Object, Object>
Integer java.lang.Integer
Long java.lang.Long
Float java.lang.Float
Double java.lang.Double
Chararray java.lang.String
Bytearray byte[]
▪ For example, in relation B, age is converted to integer because 5 is integer However it is not mandatory to assign
schema always since If we don't
▪ A = LOAD ‘edureka' AS (name, age); assign data types, default type
▪ B = FOREACH A GENERATE age + 5; bytearray is assigned to fields and
implicit conversions are applied to the
▪ If a schema is defined as part of a load statement, the load function tries to assign fields depending on the context in
the given schema which the field is being used
▪ However If the data does not conform to given schema, pig will generate a null value
or an error
error…..
▪ If Pig cannot resolve incompatible types through implicit casts, pig will report an error
error…..
(‘Edureka', 25)
▪ The above declares a tuple constant with two fields of data types as chararray and int respectively
▪ Hence, we can reference individual fields in the tuple by their position ($0 references to the first field in the tuple)
X = GROUP A BY age;
DUMP X;
(25,{(aron,25), (bruno,25)})
(45,{(sam,45), (Rony,45)})
▪ The above defines a map constant with two key-value pairs. Notice that the keys are always of type chararray
while values take type chararray and int respectively
[name#aron, age#25]
[name#sam, age#26]
(aron)
(sam)
▪ We can choose not to specify the data type of values in map as below:
▪ In this case Pig assumes the type of values to be bytearray and performs implicit casts to appropriate type depending
on how your PigLatin statements handle the data
▪ In the second statement we are trying to retrieve the value associated with key ‘name'. Notice the syntax
a#’name’
Loading and Storing LOAD Loads data from the file system or other storage into a relation .
STORE Saves a relation to the file system or other storage.
DUMP Prints a relation to the console.
Filtering FILTER Removes unwanted rows from a relation.
DISTINCT Removes duplicate rows from a relation.
FOREACH...GENERATE Adds or removes fields from a relation.
STREAM Transforms a relation using an external program.
(joe,18,2.5)
(sam,,3.0)
(angel,21,7.9)
(john,17,9.0)
(joe,19,2.9)
X = group A by name;
dump X;
(joe,{(joe,18,2.5),(joe,19,2.9)})
(sam,{(sam,,3.0)})
(john,{(john,17,9.0)})
(angel,{(angel,21,7.9)})
(joe,{(joe,18,2.5),(joe,19,2.9)},{(joe,45),(joe,19)})
(sam,{(sam,,3.0)},{(sam,24)})
(john,{(john,17,9.0)},{(john,12)})
(angel,{(angel,21,7.9)},{(angel,1)})
Example:
▪ Eval Functions,
▪ Load/Store Functions,
▪ Math Functions,
▪ String Functions,
▪ Datetime Functions,
▪ Tuple, Bag, Map Functions.
▪ Built in functions don't need to be registered because Pig knows where they are.
▪ Built in functions don't need to be qualified when they are used because Pig knows where to find
them.
Eval functions accept Load/store functions The Math functions String functions are To work on these These functions are used
the bag data type as determine how data allows you to perform used to manipulate a functions date and Convert the fields into
input parameter and goes into Pig and mathematical tasks on char data type fields. time fields are loaded Complex data types.
return the result comes out of Pig. Pig fields as chararray data type
according to the provides a set of built- and convert to date
functions in load/store and time format using
functions. ToDate function
http://www.edureka.in/blog/pig-programming-create-your-first-apache-pig-script/
http://www.edureka.in/blog/pig-programming-apache-pig-script-in-local-mode/
http://www.edureka.in/blog/operators-in-apache-pig/
http://www.edureka.in/blog/operators-in-apache-pig-diagnostic-operators/
history [-n]
history Display the list of statements used so far.
-n : Omit line numbers in the list
set Shows/Assigns values to keys used in Pig. grunt> set debug 'on’
▪ Replicated Joins
▪ Skewed Joins
▪ Merge Joins
▪ alias = JOIN alias BY {joining relation Field}, alias BY {joining relation Field} USING ['replicated' |
'skewed' | 'merge']
▪ Special type of join that works well if one or more relations are small enough to fit into main memory. In such cases, Pig
can perform a very efficient join because all of the Hadoop work is done on the map side.
▪ Replicated join can be used with more than two tables. In this case, Rightmost table are read into memory. means Large
relation comes first followed by the smaller relations.
▪ Pig run the replicated join by loading the small file ( Replicated input ) into Hadoop’s distributed cache.
▪ Example: For a customer transaction data set, which could potentially have billions of rows that is joined to a smaller
geographic data set.
▪ One of the keys is much more common than others, and the data for it is too large to fit in the memory.
▪ Standard joins run in parallel across different reducers by splitting key values across processes.
▪ If there is a lot of data based on certain key, the data will not be distributed evenly across the reducers, and one of them
will be ‘stuck’ processing the majority of data.
▪ Skewed join handles this case. It calculates a histogram to check which key is the most prevalent and then splits its data
across different reducers for optimal performance.
Example:
▪ To do the Merge JOIN, two data sets are both sorted in ascending order by the join key.
▪ To do the merge join, we should have only two tables or input files.
▪ Merge join works on mapreduce mode, in local mode merge join gets converted to regular joins.
Example: If the customer details and geography tables are sorted based on the country which is join key we can use the
merged join.
UDF
UDF
Pig
▪ The advantage of Pig is its ability to let user combine its operators with their own or other’s code via UDFs
▪ Pig itself comes with some UDFs. In version 0.8, a large number of standard string-processing, math, and complex-type
UDFs were added.
try {
Object object = tuple.get(0);
if (object == null) {
return false;
}
int i = (Integer) object;
if (i == 18 || i == 19 || i == 21 || i == 23 || i == 27) {
return true;
} else {
return false;
}
} catch (ExecException e) {
throw new IOException(e);
}
}
}
register myudf.jar;
X = filter A by IsOfAge(age);
http://www.edureka.in/blog/apache-pig-udf-part-1-eval-aggregate-filter-functions/
Load Functions
http://www.edureka.in/blog/apache-pig-udf-part-2-load-functions/
Store Functions
http://www.edureka.in/blog/apache-pig-udf-store-functions/
Syntax:
Terms:
▪ Use the STREAM operator to send data through an external script or program.
▪ Multiple stream operators can appear in the same Pig script.
▪ The stream operators can be adjacent to each other or have other operations in between.
Use DEFINE to assign the alias name for the script or shell command.
▪ For instance, if you have daily processing that is identical every day except the date it needs to process, it would be very
convenient to put a placeholder for the date and provide the actual value at run time.
Specifying Parameters :
Syntax :
pig [-x local] -param param_name = param_value <path to your script>
Syntax :
pig [-x local] -param_file file_name <path to your script>
▪ The logical plan shows a pipeline of operators to be executed to build the relation. Type checking and backend-independent
optimizations (such as applying filters early on) also apply.
Challenges:
• Huge amount of data flows into the systems daily and there are multiple data sources that we need to aggregate data from.
• Crunching this huge data and de-identifying it in a traditional way had problems.
→http://www.edureka.in/blog/apache-hive-installation-on-ubuntu/
→http://www.edureka.in/blog/apache-hadoop-hive-script/