You are on page 1of 9


pig is a scripting language..less coding part.lattin scripting language..pig is a temporaru storage..

hive is warehouse..for cmd prmopt..if we write anything and we rempove it..then
automatically..we can loose the data..
to store some specifc table in pig..we have store command
client requrement -write pig script-goto mapreduce-processing on haddop system
-simple,fast and rich
multiple stages to come solutions
2)from hdfs to relations
3)from lfs to realtions
4) joins
5)complex data type
6)foreach,filter,shell scripting

in hive we create table first after load in hdfs or lfs
but in pig we decide which data we are using either hdfs or lfs..
if its hdfs use.....pig
if its lfs use pig -x local after that we process
in single line we can create the relation and we can load into hdfsor lfs
dump is used for retriving data
instead of string i n pig we are using chararray
store in o/p command is used store..
1) give the pig command then it will give grunt shell means using hdfs
2)hadoop fs -put desktop/emp.txt venkat
emp table is there
table creations:
3)grunt> emp= load '/user/cloudera/venkat/emp.txt' using PigStorage(',') as

4) grunt>describe emp; table is created and load
5) >dump
going into job tracker
6)suceessfully stored into emp table

2))))) take lfs in on the shell

1)grunt> emp= load '/home/cloudera/desktop/emp.txt' using PigStorage(',') as
2) >dump emp
automatically get error because it will goto hdfs and we gave local path
so mention local path
take new shell
>$ pig - x local
2)grunt> emp= load '/home/cloudera/desktop/emp.txt' using PigStorage(',') as
3)dump emp
4) if its in local failed in hdfs..

#3) joins
2)grunt> dept= load '/home/cloudera/desktop/dept.txt' using PigStorage(',') as
3) >dump
store in to dept table is completed
now we have to do join between dept and emp
1) > empdept=join emp by dno,dept by dno;
2) 1) > empdept=join emp by dno leftouter ,dept by dno;
1) > empdept=join emp by dno right outer,dept by dno;
1) > empdept=join emp by dno full outer ,dept by dno;
this full outer i/f is stored in hdfs..
>store empdept into '/user/cloudera/empdept/:'
data is stored in hdfs
if we want to see the data
>$haddop fs -ls empdept
>$ hadoop fs -cat empdept/part-r-00000

common operations in pig..
!) foreach :- to get only few columns we use
>empdept_for=foreach empdept generate $0,$1; // $0 dno,$1 dname indexes

>dump empdept_for;

2) filter:-
>empdept_fil=filter empdept by empid==101;
filtered result will get
3) empdept_sort=order empdept by esal;
give assending order
>empdept_lim=limit empdept_sort 3;
> dump empdept_lim;
give max 3 values

complex data types in pig:
take the sample data complex.txt
(1,2,3)( venkat,2)
(5,6,7)( sita, 3)

3 types of c dts
1)tuple --- relation...(1,2,3)-1tuple
(venkat,2) another tuple like this
>$ hadoop fs -ls ram
>$ hadoop fs -put desktop/complex.txt ram
loaded in ram directiory in hdfs
use hdfs file system....
grunt>A= load '/user/cloudera/ram/complex.txt' using PigStorage(' ')as
(t1:(t1a:int,t1b:int,t1c:int),t2:( t2a:chararray,t2b:int));
>dump A
so automatically loaded into A
we want only second tuple 1st column
so use for each command
> A_for= foreach A generate t2.t2a;
>dump A_for;
then automacally it will give result 2nd tuple 2nd column (venkat)

bag :
bag is look like {(1,2,3,)}
save as bag.txt
store in hdfs first i.e
1) >hadoop fs -put 'dektop/bag.txt ' ram
stored into ram file in hdfs
2) >pig
3)grunt> A_bag= load '/user/clodera/ram/bag.txt' as ( B:{
4) grunt>dump

stored into A_bag
here i want 1st column so use like
grunt>A_for= foreach A_bag generate B.t1a;
>dump A_for

map is key and value
sample data is stored in hdfs
1)>hadoop fs -put '/desktop/sample.txt' ram
3)grunt>A_map=load '/user/clodera/ram/sample.txt' AS (M:map[]);
4)>dump A_map

here we have delimiter is #
5)using for each
6) grunt> A_mapfor= foreach A_map generate M# 'open';

shell script in pig:
datameer....gui in hadoop
1)emp =load ' /user/clodera/ram/emp.txt' using PigStorage(',') as
2) dept=load '/user/clodera/ram/dept/txt' as PigStorage (',') as
3) empdept=join emp by eno=dept by dno;
4) dump empdept
5) store empdept into '/user/clodera/ram/empdept/';
these 4 stmts re stored in one txt file and give the name to txt.pig
6) >pig txt.pig
then we get result

if we see in hdfs file systeem
1) >hadoop fs -ls ram/empdept
2) >hadoop fs -cat ram/empdept