Professional Documents
Culture Documents
$pig
You will get the grunt prompt grunt>
You will see some warnings on type conversion. You can ignore them.
4. Since this is unstructured data, the entire line is $0. We will extract the type of log from
character position 24 to 28 from the line, giving four character log type.
This gives an error that the function substring is not found. This is because the key words in
Pig are not case sensitive, whereas the functions are. So let us modify it to use capital
letters.
6. Since we want to count the number of occurrences of each log type, let us do a group by
the log type.
7. Let us check the structure of log1grp and log2grp. You will see that they are nested
structures with log1f nested as a bag inside log1grp and log2f nested as a bag inside
log2grp.
describe log1grp;
describe log2grp;
Note that we have used lower case letters for “foreach” and “generate” as they are key
words, whereas “COUNT” has to be upper case as it is a function.
Note that Pig uses lazy processing, so a MapReduce job is created only when it sees a dump
or a store command. While processing log1cnt, only the aliases pertaining to log1 are
processed. Similarly, during log2cnt processing, only the aliases for log2 are processed. This
optimization is done by Pig.
You can see that the output is similar to MapReduce output. Output is a directory with
an empty file _SUCCESS and an output file ‘part-r-00000’.
12. Reload these files into the same aliases as before. These will be read as structured data,
so provide the delimiter as well as the schema. Also, we can specify the directory itself
and Pig will load all the files in the directory. Since _Success is an empty file, this will not
be a problem for us.
log1cnt = load ‘log1cnt’ using PigStorage(‘,’) as (logtype1: chararray, cnt1 : chararray);
log2cnt = load ‘log2cnt’ using PigStorage(‘,’) as (logtype2: chararray, cnt2 : chararray);
13. Join the two log type relations to produce a third relation. Use Full Outer Join as we want
records from both the relations even if there is no match.
15. Check the content of the logtotal relation using dump command. Note that Pig starts
MapReduce jobs only on dump or store commands.
dump logtotal
Note that it processes the data from files upon dump. Output contains fields from both
the relations.
quit;