You are on page 1of 8

Pig Tutorial

Table of contents
1 Overview............................................................................................................................2
2 Java Installation..................................................................................................................2
3 Pig Installation................................................................................................................... 2
4 Running the Pig Scripts in Local Mode............................................................................. 3
5 Running the Pig Scripts in Hadoop Mode......................................................................... 3
6 Pig Tutorial File................................................................................................................. 3
7 Pig Script 1: Query Phrase Popularity............................................................................... 4
8 Pig Script 2: Temporal Query Phrase Popularity...............................................................6

Copyright © 2007 The Apache Software Foundation. All rights reserved.

Create an environment variable. Download the Pig tutorial file to your local directory. no Hadoop or HDFS installation is required. and point it to your directory. Download the Pig tutorial file and install Pig. Page 2 Copyright © 2007 The Apache Software Foundation. Pigs scripts. 2. you need access to a Hadoop cluster and HDFS installation.gz 1. Unzip the Pig tutorial file (the files are stored in a newly created directory. The JAVA_HOME environment variable is set the root of your Java installation. Pig Installation To install Pig.tar. do the following: 1. 3. 2. PIGDIR. Overview The Pig tutorial shows you how to run two Pig scripts in local mode and hadoop mode.6 or higher (preferably from Sun) 2. For example: export PIGDIR=/home/me/pig (bash. Copy the pig. Review the contents of the Pig tutorial file. csh). Java Installation Make sure your run-time environment includes the following: 1. follow these basic steps: 1.locally or on a Hadoop cluster. • Local Mode: To run the scripts in local mode. 2. • Hadoop Mode: To run the scripts in hadoop (mapreduce) mode. For example: /home/me/pig. Pig Tutorial 1. Run the Pig scripts . 2. These files work with Hadoop 0. log files). Install Java. $ tar -xzf pigtutorial. 3. pigtmp). 4. The Pig tutorial file (tutorial/pigtutorial. To get started.tar. . All files are installed and run from your local host and file system.gz file in the pig distribution) includes the Pig JAR file (pig. 3. Java 1. All rights reserved.jar.jar file to the appropriate directory on your system.18 and provide everything you need to run the Pig scripts. sh) or setenv PIGDIR /home/me/pig (tcsh.jar) and the tutorial files (tutorial. Move to the pigtmp directory.

Move to the pigtmp directory.pig or script2-local. File Description pig.pig 1.pig): $ java -cp $PIGDIR/pig. .tar.Main -x local script1-local.xml file. All rights reserved. Pig Tutorial File The contents of the Pig tutorial file (pigtutorial. Execute the following command (using either script1-hadoop.pig 1.pig.pig. Execute the following command (using either script1-local.txt): $ ls -l script1-local-results.gz) are described here. do the following: 1.pig or script2-hadoop. 1.pig). 2. 3.bz2 file from the pigtmp directory to the HDFS directory. $ hadoop fs –copyFromLocal excite.jar:$HADOOPSITEPATH org. 2. 3. do the following: 1.Pig Tutorial 4.apache. Copy the excite. Review Pig Script 1 and Pig Script 2.txt 5. Review the result file (either script1-local-results. Review Pig Script 1 and Pig Script 2.txt or script2-local-results. Review the result files (located in either the script1-hadoop-results or script2-hadoop-results HDFS directory): $ hadoop fs -ls script1-hadoop-results $ hadoop fs -cat 'script1-hadoop-results/*' | less 6.jar Pig JAR file Page 3 Copyright © 2007 The Apache Software Foundation. Set the HADOOPSITEPATH environment variable to the location of your hadoop-site.Main script1-hadoop.txt $ cat script1-local-results.jar org. $ java -cp $PIGDIR/pig. 2. Running the Pig Scripts in Local Mode To run the Pig scripts in local mode.log.apache. Running the Pig Scripts in Hadoop Mode To run the Pig scripts in hadoop (mapreduce) mode. Move to the pigtmp directory.log.bz2 .

jar User-defined functions (UDFs) and Java classes script1-local. ScoreGenerator Calculates a "popularity" score for the n-gram. 7. Excite search engine (Hadoop cluster) The user-defined functions (UDFs) are described here. Query Phrase Popularity (local mode) script1-hadoop.pig Pig Script 2.log Log file.bz2 Log file. Query Phrase Popularity (Hadoop cluster) script2-local. Pig Tutorial tutorial.pig Pig Script 1. Excite search engine (local mode) excite. NonURLDetector Removes the record if the query field is empty or a URL.pig Pig Script 2. UDF Description ExtractHour Extracts the hour from the record. Pig Script 1: Query Phrase Popularity The Query Phrase Popularity script (script1-local.pig or script1-hadoop. TutorialUtil Divides the query string into a set of words. Temporal Query Phrase Popularity (local mode) script2-hadoop. ToLower Changes the query field to lowercase. NGramGenerator Composes n-grams from the set of words.log.pig) processes a search query log file from the Excite search engine and finds search phrases that occur with Page 4 Copyright © 2007 The Apache Software Foundation. .pig Pig Script 1. Temporal Query Phrase Popularity (Hadoop cluster) excite-small. All rights reserved.

hour. clean1 = FILTER raw BY org.pig. • Call the ToLower UDF to change the query field to lowercase. .tutorial. we are only interested in the hour. query. • Call the NGramGenerator UDF to compose the n-grams of the query.jar. time.pig. • Use the DISTINCT operator to get the unique n-grams for all records.pig. time.pig. Call the ExtractHour UDF to extract the hour (HH) from the time field. and query. Each group now corresponds to a distinct n-gram and has the count for each hour. The script is shown here: • Register the tutorial JAR file so that the included UDFs can be called in the script. query). • Use the COUNTfunction to get the count (occurrences) of each n-gram. raw = LOAD 'excite. Page 5 Copyright © 2007 The Apache Software Foundation. All rights reserved.apache. • Because the log file only contains queries for a single day.NonURLDetector(query). time.apache./tutorial. • Use the GROUP operator to group records by n-gram and hour.log' USING PigStorage('\t') AS (user. hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0).apache.log) into the “raw” bag as an array of records with the fields user.log or excite-small.Pig Tutorial particular high frequency during certain times of the day. REGISTER . COUNT($1) as count. ngramed2 = DISTINCT ngramed1. org. hour). • Call the NonURLDetector UDF to remove records if the query field is empty or a URL.ToLower(query) as query. hour_frequency1 = GROUP ngramed2 BY (ngram. org.ExtractHour(time) as hour.tutorial. • Use the GROUP operator to group records by n-gram only. • Use the PigStorage function to load the excite log file (excite. houred = FOREACH clean2 GENERATE user.tutorial. flatten(org.tutorial.NGramGenerator(query)) as ngram. clean2 = FOREACH clean1 GENERATE user.apache. The excite query log timestamp format is YYMMDDHHMMSS. ngramed1 = FOREACH houred GENERATE user.

0. $3 as count./tutorial. identify the hour in which this n-gram is used with a particularly high frequency. . All rights reserved.tutorial.pig. uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour. • Use the PigStorage function to store the results. filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2. mean. time.0. • Use the FOREACH-GENERATE operator to assign names to the fields. flatten(org. ordered_uniq_frequency = ORDER filtered_uniq_frequency BY (hour.jar. STORE ordered_uniq_frequency INTO '/tmp/tutorial-results' USING PigStorage(). ngram. raw = LOAD 'excite. The output file contains a list of n-grams with the following fields: hour. Pig Script 2: Temporal Query Phrase Popularity The Temporal Query Phrase Popularity script (script2-local. • Use the FILTER operator to move all records with a score less than or equal to 2.pig or script2-hadoop. REGISTER . score).log) into the “raw” bag as an array of records with the fields user. and query. • Use the PigStorage function to load the excite log file (excite. score.log' USING PigStorage('\t') AS (user. • Call the NonURLDetector UDF to remove records if the query field is empty or a URL. The script is shown here: • Register the tutorial JAR file so that the user-defined functions (UDFs) can be called in the script. Call the ScoreGenerator UDF to calculate a "popularity" score for the n-gram. Pig Tutorial uniq_frequency1 = GROUP hour_frequency2 BY group::ngram. 8. count.ScoreGenerator($1)).apache. query).pig) processes a search query log file from the Excite search engine and compares the occurrence of frequency of search phrases across two time periods separated by twelve hours. $2 as score. • For each group. uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0).log or excite-small. Page 6 Copyright © 2007 The Apache Software Foundation. $4 as mean. $0 as ngram. time. • Use the ORDER operator to sort the remaining records by hour and score.

• Use the COUNT function to get the count (occurrences) of each n-gram. The excite query log timestamp format is YYMMDDHHMMSS.apache.tutorial. • Use the FILTERoperator to get the n-grams for hour ‘00’ hour00 = FILTER hour_frequency2 BY hour eq '00'.NGramGenerator(query)) as ngram.pig.ToLower(query) as query. org. $1 as hour. ngramed1 = FOREACH houred GENERATE user. Call the ExtractHour UDF to extract the hour from the time field. hour. hour). COUNT($1) as count.apache. hour_frequency1 = GROUP ngramed2 BY (ngram. clean2 = FOREACH clean1 GENERATE user. flatten(org. query. hour12 BY $0. • Use the GROUP operator to group the records by n-gram and hour. hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram.pig. • Call the NGramGenerator UDF to compose the n-grams of the query.pig. houred = FOREACH clean2 GENERATE user. time. • Use the DISTINCT operator to get the unique n-grams for all records. $2 as count. Page 7 Copyright © 2007 The Apache Software Foundation. .tutorial.apache. • Use the JOIN operator to get the n-grams that appear in both hours. • Use the FOREACH-GENERATE operator to assign names to the fields. • Uses the FILTER operators to get the n-grams for hour ‘12’ hour12 = FILTER hour_frequency3 BY hour eq '12'. we are only interested in the hour.Pig Tutorial clean1 = FILTER raw BY org. org. All rights reserved.ExtractHour(time) as hour.NonURLDetector(query). • Call the ToLower UDF to change the query field to lowercase.tutorial.apache. • Because the log file only contains queries for a single day. same = JOIN hour00 BY $0.pig. ngramed2 = DISTINCT ngramed1. hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0).tutorial.

The output file contains a list of n-grams with the following fields: hour. All rights reserved. count12. $2 as count00. • Use the PigStorage function to store the results. Pig Tutorial • Use the FOREACH-GENERATE operator to record their frequency. Page 8 Copyright © 2007 The Apache Software Foundation. count00. STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage(). $5 as count12. same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as ngram. .