You are on page 1of 2

Since it is procedural, you could control of the execution of every step.

If you want to write your own UDF(User Defined Function) and inject in one specific
part in the pipeline,
it is straightforward.

Data Schema is not enforced explicitly but implicitly. I think this is big one,
too.
The debugging of pig scripts in my experience is %90 of time schema and since it
does not enforce an explicit schema, sometimes one data structure goes bytearray,
which is a �raw� data type and unless you coerce the fields even the strings, they
turn bytearray without notice.
This may propagate for other steps of the data processing.

You could write your UDFs in Python.


You have UDFs which you want to parallellize and utilize for large amounts of data,
then you are in luck.
Use Pig as a base pipeline where it does the hard work and you just apply your UDF
in the step that you want.

A class for Java programs to connect to Pig. Typically a program will create a
PigServer instance

pig -x local myscript.pig

pig

Basic commands

sh ls

clear

help

Execute pig commands

truck_events1 = LOAD '/user/centos/drivers.csv' USING PigStorage(',');


DESCRIBE truck_events1;

truck_events2 = LOAD '/user/centos/drivers.csv' USING PigStorage(',')


AS (driverId:int, truckId:int, eventTime:chararray,
eventType:chararray, longitude:double, latitude:double,
eventKey:chararray, correlationId:long, driverName:chararray,
routeId:long,routeName:chararray,eventDate:chararray);
DESCRIBE truck_events2;

truck_events_subset = LIMIT truck_events2 10;


DESCRIBE truck_events_subset;

DUMP truck_events_subset;

specific_columns = FOREACH truck_events_subset GENERATE driverId, eventTime,


eventType;
DESCRIBE specific_columns;
STORE specific_columns INTO 'output1/specific_columns' USING PigStorage(',')

orders = load '/user/centos/data1.csv' using PigStorage(',') as


(cstrId:int,itmId:int,orderDate:long,deliveryDate:long);
grpd = group orders by cstrId;
items_by_customer = foreach grpd generate group as cstrId, COUNT(orders) as
itemCnt;
describe items_by_customer;

orders = load '/user/centos/data1.csv' using PigStorage(',') as (cstrId:int,


itmId:int, orderDate:long, deliveryDate: long);
cstr_info = load '/user/centos/information.csv' using PigStorage(',') as
(cstrId:int, name:chararray, city:chararray);
jnd = join orders by cstrId, cstr_info by cstrId;
describe jnd;
jnd_grp = group jnd by (orders::itmId, cstr_info::city);
describe jnd_grp;
result = foreach jnd_grp generate FLATTEN(group) , COUNT(jnd) as cnt;
describe result;

You might also like