You are on page 1of 16

~•

WHAT"S IN STORE?
~ ----:'
~ ~
!' '• "'w
: . : - - - - -- - . • 231
=
w, ~om< dm Y'" ,J~dy /;miliu wich rornrnmi,l d,rab~ ,ymrn,. lo <hi,, 111!!!1!111 Z:,

!liHl l l l!'!I I■
F_ JJ■
w, on
use char knowledge as our base ro build a structure ·11 alHadoop. for effective anal
. ys1s peer, We w·11
. ha\V, 1 I . . • , , •, , • ', · ,
we ge b · w e will dis. 1-,,.
so enrich your kno l d
' ·
•.•,•..
imporrance of Hive with the help of use cases. we w1 ,,,
·1v,(!Jjj
..
£ch Jearn1·ng resources su d ch o '•schy Working "'i~th¾!Jir . -- we-adat •-..:-•·
Query Language. " c .
We suggesr you rerer ro some o e ggesre at e end f h' also Co · ' 1gure 9. 1 H' ~Ji!lliwu\ffl
10
9 toot. -
'"'"hoos
,,.
'''" ch, "Td< M, =•d- ' '" ' .,,
l!frf
iiiiiiiiM11111/•t!MMii!❖liiil❖liiillM
Wieti!i#\titij@, ~ ~~
~=:::j. ~ _-=.=_"",s::
...=,_,., =
~j
.." ·~
!Ij=.r.""r,,,-.=,l!'l!"'Fs:: ""
'~ " ' = ".' , .I
'.=
Ahout the Company Figure 9.2 History of Hive.
TENT()TEN is a Retail Srore which has a chain of hypermarkets in India. They hav 25 stO
cities and cowns. About 45,000+ people are working in TENTOTEN. TE~Q~ res acr<lll
-
95 range of products induding fashion apparels, food produces, books, Furniture' etc. ArN
wide deaJs in 1
ound
h f h
customers visit and/or purchase produces every d ay from eac o t ese scores. 1500,
1 Problem Scenario
-
The approximate size of TENTOTEN .
log datasets is 12 TB. Information about rhe various
· d d ..,, d · · al B ·
is scored in the form of sem1-scructure ara. ira mon usmess
. Intelligence (BI) tools arestores
when dara is present in pre-defined sch ema an d da tasecs are JUSt several hundreds of . b goo,! ~•mr:Et-nt'-l•JlllUil&t
che TENTOTEN dataset is mostly log dataset, which does nor conform co any parti!'~a )'Tes. Bui
ar schemi
uwm,~
Querying such large dataset is difficult and immensely rime consuming.
The challenges are:
I. Moving cbe log dataset to HDFS (Hadoop Discribured File Sysrem).
1111 i@i§i·HGi-i•IBMH+iM@&·♦
. t'
2. Performing analysis on HDFS dara. Figure 9.3 Recent releases of Hive.
Hadoop MapReduce can be used co resolve these issues. However we will still have to deal 'ch die
below constrain cs: wi
I. ~ricing complex Map Reduce jobs in Java ~ be teclious and error prone. 'Hive provides HQL (Hive Query Language) which is similar co SQL, Hive compiles SQL queries into
2. Jommg across large datasets is quite tricky. ipReduce jobs and then runs the job in the H adoop Cluster. Hive provides extensive dara type functions
>rmacs for data summarization and analysis:
Enter ffive co counter the above challenges.
9.1.1 History of Hive and Recent Releases of Hive
ThehistoiyofHive and recent releases of Hive are illustrated pictorially in Figures 9.2 and 9.3, respeccivcly.
9.1 WHAT IS HIVE?
9.1.2 Hive Features
Hive is a Dara Warehousing roo0Refer Figure 9. 1. 'i'-uve is used to query structured data builr ~n ro¢.
- -:- 'Y (.!:1 . • l f J d ra Hive'"""
~ op. Facebook created Hive component co manage their ever-growmg vo umes o og a ·
use of the following:
Q· It is s~ilar to SQL.
· HQL Ill easy to code.
I. HDFS for Storage. !· H!ve supports rich data types such as struccs, lists, and maps.
• Hive supports SQL filrers, group-by and order-by clauses.
2. MapReduce For execution. ~ 5
3. Scores m ecadara in an RDBM1/ • CUstom Types, Custom Functions can be defined)
9.1.3 Hive Integration and Work Flow ~ l~
-
Fi[ure 9.4 depicrs the Row oflog file analysis.
~urly Log Data can be stored directly into HDFS and then data cleansin is
Finally Hive cable(s) can be created to query the log file\
/ g perforrned 0 n1he1 Directory
ogfi~ Rows ·
- -J
Partitions
9.1.4 (Bive Data Units
I. Databases: The namespacc for tables.
~ {S~irectory)
2. Tables: Set of records that have similar schema. .,/1
3. Partitions: Logical separations of data based on classification of give • c t..::::;Files
. . . d th da b ed n mrormaf10 Figure 9.6 Semblance of Hive structure With database.
tributes. Once hive has parnt1one e ta as . on a specified key, ·it starts t0 n as per spe ·c
into specific folders as and when the records are inserted. assernbte th C1t1c ~
4. Buckets (or Qnsters): Similar to partitions but uses hash function to e reeordi
the duster or bucket into which the record should be placed.""'-
'_.,) segregate data and deterrn·
1
Figure 9.5 shows how these data units are arranged in a Hive Cluster. · 11Q
Figure 9.6 describes the semblance of Hive structure with database. . c,J"1mand-l1ne interfac - Hive web interface I
c.:._A database contains several cables. Each table is constituted of rows and c 1
,tr m::::::,,.
·a
,,,,_
stored as a folder and partition tables are stored as a sub-directory. Bucketed ta:I umns. In I-live, tab!
~ -~~
es are stored es :lit Dr vcr (Query compiler,·e·
asa~
~ r
'·,_._1~--·
:--~·~
.(__ --
I JobTracke
la
Log compression
Figure 9.7 Hive architecture.
- ~..--1
Figure 9.4 Flow of log analysis file.
:Aidiitccturc is depicted in Figure 9.7. The various parts are as follows:
'L HneCommand-1.ine Interface (Hive CU): The most commonly used interface to interact with Hive.
·2, HneWeb lntcrfac:e1 It is a simple Graphic User Interface to interact with Hive and to execute query.
-
Database
Tables
3. Hne ~ This is an optional server. This can be used to submit Hive Jobs from a remoteclienc.
4. JDBC / ODBCi Jobs can be submitted from a JDBC Client. One can write a Java code to connect
to Hive and submit jobs on it.
Part1t1ons IS, DrhaiHive queries are sent to the driver for compilation, optimization and execution.
Columns
., Mecuco. Hive table definitions and mappings to the data are stored in a Metastore. A Metasrore
a>nsists •\\.following:
~ mm • Metu"ioa service: Offers interface to the Hive.
• ~-Stores data definitions, mappings to the data and others.
Figure 9.5 Data units as arranged in a Hive.
n.-J~..!-~ _i", . •
~
• f b IDs of Tables, IDs of Indexes,
.
'WJUch IS stored m the metastore includes IDs o Data ase,
the lllnt'~tion

__
of a Table, the Input Format used for a T:3ble, the Output Format =
Jc'
1or
~
Hi, ~
234 • TYPES
(. T,bl,, '"'- n,, '°'"''"' o upd"<d wh<•"" ' a mbl, 0 crea,<d o, dd««I from ,re thi _
<. 11,,., , ,ata Types
'ii.& of m<""O<C• 1
1, Embedded Mewtore: This mecastorc mainly
· : isTh. · th
1s 1s e e
faulunit tests. Herc'r onI!Ve
usedd for
t metastor e fo H ·
Y one .
le process
1
IS :dJ
,
,-;.~·;;~b ·Typ~ ........... .................. ·····.... ············.........................
to connect to the merastore at a wnc.
uns, emb· . ds Apachc [NT 1 · byte signed integer - _
Database. In this merastorc, both the database and the mcrastor e service r
b dd d e <led in h
Hive Server process. Figure: 9.8 shows an Em e e Merasto re. ,LUNT 2 - byte signed integer
rompon< n< lik M ' '"'
z. Loa1 M - = Moadam an h< ..,,cd in wy. RDBMS
-=
. ru . Loca1 rnc~..-
serviceySQL. 4 - byte signed integer
allows multiple connections at a time. In this mode, the Hive metastor e e . can b ns in the main
• n 1-1:
. -"'Iii 8 - byte signed integer
Server process, but the merastorc database runs ID a separate process a d ;INT
eon a sep ''lit .
Figure 9.9 shows a Local Mcrastore. 1T 4 - byte single-precision floating -point
In ,hO, ,h, Hi~ dd>tt wd ffi, m=ro, c in,crtaa . ""'~
3• ...,.,. different JVM ( , ;Lf 8 - byte double-precision floatin .
can run on different machines as well) as in Figure 9.1 O. This way the d ru~ on ········· ......·~ ..........········ ...................... ·········· ... ~~~~!~~.?.~~~:~..
m e users
the Hive user and also database credentials arc: complet ely isolated fro attha ase IVe.) .,._
be fire-wa]:ed~
canofH· f...
....~ --~-··,·······,·············· .. ···· ······ ··········· ······ •• ................................
sr----~ .
·-
VARCHAR Only available starting with Hive o.12 .0
. .1 ..,.....--CHAR
....~~.~~.':':~~~~.~~..~!~~~.~ .~~-~~~ .~~~.t.es.............
.... .......can
...strings
(') or double
. Only available starting with Hive O.l3.0
...................... .. quotes (M)'
Embedded Metastore.
Figure 9.8
/ ....................
,us Types
BOOLEAN
Hive service JVM

/.i,~~~~~-...........................?.~~~. ~~~i.~~.l~ .~~.~.~~- .":!~~. ~!~~ ....... ····· ········.
9.3.2 Collect ion Data Types
I. I. I.
· ~• • 1. 1 0 0 10 I 0 I • I I • I O O O O o O O O O O O • o • • o o o O O O O o • o o O O o o • •• 0 0 0 o • . • o o o o
o o o o o • • o • • • o o • o O O o O 0. 0 0 I I I O • 0 0 0
I I O Oo O I o OOO I I OI ' I o I O I OO Oo o o O
I o O O I I I IO O O o O • 0
Coll~~... ~
n', 'Doe')
STRUCT Similar to 'C' struct. Fields are accessed using dot notation. E.g.: struct('Joh
Figure 9.9 Local Metastore.
/Ip A collection of key - value pairs. Fields are accessed using [] notation. E.g.: map('first',
'last', 'Doe')
'John',
using array index. E.g.: array('John ', 'Doe')
/ ARRAY Ordered sequenc e.. of same types. Fields are accessed
.................................... ··· ·············
-
Hive service JVM
•·
9.4 HIVE FILE FORM AT
The file formats in Hive specify how records are encoded in a file.
-
Hlva service JVM 9.4.1 Text File
Ulit dcfau] 61 . . .
llol ~ e format is text file. In this format, each record 1s a !me
10 e
ch 61e. In rat Jile, di/fen:nr con-
all lidds), "B (ocal 002,
ers arc used as delimite rs. The delimite rs are "A (ocral 001, separares
Figure 9.10 rote Metastore.
111111 '°~
11
,:::.------
,,.. - - - - ' ..... ~ ERY LANGUAGE (HQL)
• 237
~
llign
in the arraY or scruct), "C (occal 003, separates key-value p .
) ~ ,
V~ olJ ..
separat
• _J es h eeoverrid
emenCSing the default delimiter. The suppo rted
text files are CSVair d,and \n.Th t. tf _ 'des basic SQL !tke operanons. Here are few
th I ,.~ . ...,dC pCOVI of the t ks h' h H
· -•
do<U"""'" '°" "" b< q,«iliol • ""' fil,.'J Or
>nTsV· )SON
'•'
....,ot1la0&~" ..
bles and partmons. as w ic QL can do easily.
.oian3!~
4.,..,..'._."d({
~.,; ._•ous1'P1a~ional, Arithmetic, and Logical Operators.
f, "'"~
' rrvar' .0 ns. .
~ 'ci,1 fiJ,,.,ntial file
9,4.2 Seque 11
SiJPr_,_ fiinctl
C:eq"" e"
fiks .i,,, - bi"'1 i«Y-"1"' i'"" k indode, rompres<mnru
·
,,. !:.:.,;aJ
~ -~._...,;,i
o-ur the conten
ts of a rable to a local directory or result of queries
to HDFS d'
1rwory.
d< CPU, UO "'1,;""'°'') · ppon Whi<h .,....-ii•fjnition Language) Statements
•i' .,u-_ ioata
·
De•
9.4.3 ~Fil e (R~ ,d Ale) · . Colum"'.' W pOl.· cs( are used to build and modify the tables and other objects in che database. The DDL com-
RCf,k
al"""
.,_!
'°'"'
,1,<1,o m ~ w i n c h ° " " " ' dm
~ ,pO'tlo
of ,,Jy n,
P"°;,
,'gs,
p., -pk , ,...;d« • i,wl,ich """" "' fom rok
n;,g m, oblo b,,;,,,,ally lik< ,be ,ow-orkn.d
ritl"" rh• ,,bk fin< _..n ,, .,d d<o ,acir,lly ro ,W,li ,, ,be
r1,,._
DB'; :;;'"'•
mw-,e
• i, "~
--orio .
,
T~~~:-l ,tei~ · ,
--
,~ n-
. . /}Jr.et I)acabase
-~ ~ p,rruncate Table
fio< d,,,,bl,• P"°tlon"i;,uo mWtlpk ""'group, horiwoally. D,pk<ed11,,ed ,.),
9 ""'<p«;6olO
· i,"b; tlr, ,
}~~ cio n/C olu mn
T,bk 9.I OP"°tlonol mm "" '°"
grouP' by ro,.ideriog dm< row, ti tl,m , ' ' .l, <h, obi, I ...
Jo ~ttr!:O.Op/Alter View
P"""'" = ~
Nat ;, - , row g,o•P RCf,k ,be d,u ,ana lly lilre rol , of M row gro, "'I
4. 00'"'.=- /Alter Index
,.;.1;,,.1 • - - ; ,T,ble 93? C,race/f)roP
,m,-, me. So <h, nl:' 1 S,
L ShoW
· •· Tablo 9,1 Atab• wtth four co~mos .,, 'be
7,[)cSCtl
E . ,i""" '"'
~
CZ . C3 . .. (Data Manipulation Language) Statements
11 12 }j 14 . ents are used to retrieve, store, modify, delete, and update data m
24
· dacabasc. The DML
21 22 23
commands arc as fo11ows:
32 33 34
31 , Loading files into table.
/'
41
51
42
52
43
53
··········································
Table 9.2 Table with two row groups

44
54
1
2,
Nppo<U t•
Jnser[ing data into Hive Tables from queries.
""° u;,, 0.14 ddett, ..,i _ , op<ratl=
9.5.3 ~in g Hive Shell
··················--··················--·············•"''''''
.. Row Group 2
Row Group 1 .. To~ vc, go to the installation path of Hive and type as
C4 Cl C2 C3 C4 below:
c1 " · C2 C3 (rootfv olqaln
44 xOO~ ~] t hive
13 14 41 42 43
11 12 ~ i n i ; initia lized u!ling confi
gurat ion in jar : file :/root /Deskt
op/VHDATA/ Kivo/h i
24 51 52 53 54 -ve/l1b /h.1ve -c:omm on-0 . 14 . O. jar
21 22 23 8LFfJ1 C1••• I /hive -log4 j. prope rtie:,
SLF4J1 round path conto ino multi p.le SLP'4J bindin
bindi ng in (jar: f1 le, /root/o eak1:o p/VKD 9• ·
34 p/cma on/U. .b/alf 4j-log 4j 12-l, ATA/H odoop/ hodoop/ ahon/ hadoo
31 32 33 SU'4J1 Found bindi ng in (jar: 7. 5. jar I /org/ alf4j / impl/S totieL ogqe r8indo r .el•••l
file: /root/O e:,~op /VNDA TA/Bi ve/ hi ve/lib / hive-jd bc-0
........( Table 9.3 Table in RCFile Forma
!Ll'
.lt.O- etand alone .jar J /org/ !ll.f 4j
J• 8•• http: //www .slf4j .orq/e
/ 1.mpl/ Static Logge rBinde r .claaa J
odes , ht:Jlllf multip le b inding
h~:i•.
4
Aotual. • tor an oxplan atlon,
........................................................t......... .. bindi ng is of type (org,s lf4j
.impl,L og4jLo ggerFo etoryJ
Row Group 1 Row Group 2
11, 21, 31;
The sections have been designed as follows:
41, 51;
12, 22, 32; 42, 52; 0/,jldiH: What is it chat we are crying to achieve herd
lnp,a (optunud): What is the input chat has been given to us
/1 13, 23, 33; 43, 53; kl: to act upon/, L.cf. .
n..~-_ The actual statement/command co accomplish che wk at ""''
14, 24, 34; 44, 54;
······················································ ············· -"""1M : The result/output as a consequence of exccun·ng the sutement.
I
'DP. '1
·••a<£ EXTENDED STIJD£NTs; ,
9.5.4 Database ~
. NT Details hdfs://volgaln~Olo ad . f
I I
Adatabase is like a container for data. It has a collcccion of tables which houses the data. ~ {creator=JOHN} · •1 n os~ .co,,, 00o
USER ncls Fetched: 1 row(s) 9 1•••rfh1v,1••r•hoo I
i2J
• '-.bjectim To create a database named "STUDENTS" with comments and database
~
Am . properties.
.
,eco • ••lstud,nI
1CREATE oATABASE IP NOT EXISTS STUDENTS
.WITH DBPROP~.r ~ • •.'J0~11_) .
COMMENT , ---==-.
- ~- . : .... -
STUDENT
. D~, ,
Outcome:
hive> CREATE DATABA5£ IF l«>T EXISTS 51\JOENTS CCMIENT 'STUDENT Details• WITH
r' • • JOHN'); DBPROPERTlEs •
STUDEN!S SET DBP~OPE~ ('edited-by'= 'JAMFs1 )
OK (~-0
· ssible to unsee the DB properties.
~tis not po 1
Tim taken: 0. 536 seconds
hive> D
STUOENTS SET DBPROPERTIES ( ' edited-by' = ' lli4ES' );
■ seconds
( Objectiff: To display.a list of all databases.
Am
SHOW DATABASl!SJ '" ,-,·..-
. --··• , _ _._.::.,_
--
~-)
Outcome: ~ ~ec:tife:-To.make the database as current working database.
A£U•--fl. -
hive> ~ DATABASES;
' OK
ru lie>

ts
aken: 0.082 seconds, Fetched: 22 r011(s)
l - ~
Ou.u:ome:
( Objective:'-To describe a database. · sen: ~ - . ~ 5111DENTS; .
0.02 seconds
~
Act:
.. ·-·-. - . -·· .. ·-------•--.....·~
. DFSCIUBE -- ,.... ·-:r·- ·--~~\
DATABASE STUD ·.. · , ·
•· ,. . .,a-.~ - ~,.. - .,. ,) ~ '-·• A
Note: Shows only DB name, comment, and DB directory. l_Objectm: To drop database.
Outcome: Adi
I
i
ve> DESCRIBE DATABASE SlUDENTS; · ·
udents
.db root
51\JOENT Details hdfs://volgalnxOlO.ad. infosys.com:9000/user/hive/warehouse/ studen
USER
11e taken: 0. 03 seconds, Fetched: 1 l'Olll(s)
lliifA'rABASE STUDENTS;} __ . ~ ·~_. ~ . _
,hwoJ Note: Hive creates database in the warehouse directory of Hive as shown below:
...... C..t1lnalf7~1MOGJt
\ Objective: To describe the extended database. lil
Act:
iw;(iww.T-~-1-Wi
DESCRIBE DATABASE EXl'ENDEB ·~ :,l "
l-tw·0.1C~]~ISG-ujjll
Note: Shows l)B properties also.
I
- --
240 • ~ -( ~- Self-Managed Table • 241
9.5.5 Tables ,,.,,al or
• dropp
ed it retains the data in the underlying I •
Hive provides cwo kinds of cable: Managed and External Table. lC:1
i;able JS •
rd1s used to' create an external table.
th d . 0
cat1on.
~J~
, .-.,a,,0
•JICGU" to be sp
P-1 ecilied to store e ataset 1n that particular ocat1on.

1
9.5.5.1 ~anaged Table
1. Hive stores the Managed cables under the warehouse folder under Hive.
2. The c.omplete life cycle of table and data is managed by Hive. ,ternal table named 'EXT_STlJDENT'.
3. When the internal cable is dropped, it drops the data as well as the metadat~
TAB
•[Objectm: To create managed cable named 'STUDENT'. LE IF NOT EXISTS EXr_STl.JDENT(rollno n,.,.
I.AT) ROW FORMAT DELIMITED FIEI..Ds TERMINATED"' BY
iENT_INFO;} 'It'
1,namc
I.cf;. I •' . It.,. ._., I
. CREATE TABLE IF NOT EXISTS STUDENT(rollno INT,namc STRING ' .
FORMAT DELIMITED ~ S ~ -B~ ',\ t';) ,gpaFlOAT) R.OW . IF NOT EXISTS EXT..sTUOENT(rollno INT name snuw
. TABLE
.RMINATEO
BY '\t' LOCATION '/STUDENTJNFO' · '

G,gpa FLOAT)~ FQAA1nII
seconds
Outcome: . P"'-•• U•~- c::: - ·._
ve> CRE"lE T"8LE IF NOT EXISTS STUOENT(rollno INT,name STRING,gpa FLOAT) ROW FORM,, I- ~
1,,e>: .1 •. LI • ·
, - -L _ __ •r 1 ,
LOS lERMINAlED BY '\t'; T DELIMIT];o F
Note: t.1:..0 creates. the external tao1e m cne specinect 1 • ==
[1.1•~ • ocauon.
11e taken: O. 3SS seconds
~ VO -I
Loading Data into Table from File
--. 9,5,5.3
~edffCI To load data into the table from file named srudent.tsv.
• k>bjectM: To describe the "STIJDENT" table. ALU
I.cf;. WAD DATA LOCAL INPATH '/root/hivcdcmos/studcnt.tsv' OVERWRITE INTO TABLE
DFSCRIBE ~ENT; ) ~-srtJDENT; /
Note Local keyword is used to load the data from the local file system. To load the data from HDFS,
Outcome: iemove local key word from the statement.
Ve> DESCRIBE STUDENT;
ollno fot Outcomci
nue string
• float ~ - LOAD DATA LOCAL lNPATII • /root/hivedemos/student. tsv' OVER\OIUTE INTO TABLE EXT...srut>ENT; ~
_, taken: 0. 163 seconds, Fetched: 3 l'W(s) .-ding data to table students.ext...student
~ Ve> I rtle students.exutudent stats : [nulllfiles-=O, nwdl""5=0, totalsize=O, r..oatasize=l>J
I• t!ken: 5.034 seconds
live>
Note: Hive creates managed table in the warehouse directory of Hive as shown below:
c..n...,........,. _ __
-,~---~:::
~"@!I - ◄
~
=
Hive loads the file in th~ specified location as shown below:
i....,.,.._,
--~-@
llTVDDIT_11'TO
~~ - - ~:»!Frt"a!filJ"tr,, ...:: :.~ -. . -
- .. . .
. b_ - . -~
r-•aDffiheer
Loc■llop
i.,- Lec,,11op
Kadlla,JIIU
i.....,.
----- ~ ~
I'
~
"Ill
242 •
• 243
-=-
- §i#'.
~
=:,;.:.(¥;:Sr~
::Jal
11.,ct: • ,,..- £]CT_sTUOENT;
,Ji!
;:.di
J
-_..
,.n ...
J.0
...:zo,
....
..... - ..
...,., ..... ... -
""'
11,
, .s
H
·--··--
s-11• ... , .... o
_ . . . .. J 1. 1}
,_ ..-.i••·
. . .....
,Ju
~ .... • ·• ';econdS, Fetched·· 10 row(s)
9.5.5.4 @_ollection Data Types
Objective: To work with collection data types.

► __.:---
i~'. ~• Coll~cion Dau Typ~. --
IF-
STlJDENT_INFO;
Input: ·
1001,John,Smith:Jones,Markl !45 :Mark2!46:Mark3!43 OB FROM STIJDENT_INFO;
1002,Jack,Smith:Jones,Markl !46:Mark2!47 :Mark3!42
ueofMarkl
Act: ~~ ~ MARKS['Markl ']&om STUDENT_INFO;
CREATE TABLE STIJDENT_INFO (rollno INT,name String, sub ARRAY<STIUN
MAP<STRING,INT>) · G>,111ar1ia
//To -~ subordinate (array) value
--~ ~ ~ ( O ] ~OM ~ENT_INFO; )
ROW FORMAT DELIMITED FIELDS TERMINATED BY','
COLLECTION ITEMS TERMINATED BY':' Qutc01DC:
hive> SELliCT • froa STUDElff_lNFO ;
MAP KEYS TERMINATED BY 'l'J J•s.rtth•,•Jones"l
HI
Cit ,....,
Jack "saith" . "Jones" )
{""Nark1M:4S, • N.ark2'" · 46 1 " M.arlc3H·4 3
{~arkl • : 46, ..M•rk2 '"; 47 "Nark3" : 42 l
STIJDENT_INFO1 ·.
LOAD DATA LOCAL INPATH '/~os/studentinfo.csv' INTO TABLE Ti• ta1cen: O. &econds, Fetched: 2 row(s) • ·
1
iivo SELECT NANE .sua FROM STUDENTJNFo;
Outcome:
B~
1K ["saith", " Jones"]
["Saith" " Jones"]
•1 .. taken: 0. 06i seconds, Fetched: 2 r-(s)
h i ve> CREATE TABLE STUOENTJNRJ (roll no XNT.nuie Str1 na. sub AlUU,,.Y<STRING> ,.arks NAP<STRI NG, FLOAT> ) ,;ve>
j ► ROW FORMAT OELINITEO FIELDS TERMINATED BY " , •
, > COLLECTION ITEMS TERMINATED BY ' : • ilve> SELECT NAME , MARKS[ 1Mark.1 ' ] fr<MIII s-fu0E.NTJNFO;
> MAP KEYS TENillNATED BY • I • ; I(
lohn 45
""
Tiae taken : O. lU s econds
hive> J
llack
·f• taken:
46
0.06 seconds. Fetche d : 2 row(s)
tfve:> -
Ye> LOAD DATA "LOCAL INPATH • /root/hivedeaos/st.udentinfo.csv · l'.NTO TABLE STUDENTJNFO;
oading data to table students.student_1nfo ,ivo S£LECT NANE,SU8[0] FROM STUDENTJNFO ;
II(
able 5tudents . studenLinfo stats : (nuaFiles•l, tot.alSize-109) lohn 5'11th
~ - ta.ken: 0 . 397 seconds lack s.;th
11ve> - :t ~t&ken: 0 . 071 seconds, Fe't.ched: 2 row(s}
9.5.5.S Querying Table 9.S.6 Partitions ·
Objective: To retrieve the student details from "EXT_STUDENT" table.
i Hive, the query reads the entire dataset even though a where clause filter is specilicd on a panicul~
~ lillln. This becomes a bottleneck in most of the MapReduce jobs as it involves huge degree ofI/O. Son
~~ to reduce 1/0 ·required by the MapReduce job co improve the performance ofthe query. A very
Act:
SELECT• &om EXT_STUDENT,
Pa:r method to reduce 1/0 is data partitioning.
Hi tlons ~plit the larger dataset into more meaningful chunks.
Ve provides two kinds of partitions: Stacie Partition and Dynamic Panicio~
II ~:_:__ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Big Data and
2+' • ~.\naJ,.,
' #-'°,.ave--------------- -
~
111111111
(i .. ~ V ~ -'11_•_11<_P'_"_"_•o_"_''_'m
_r1
_ 1J1_'000ooo
__·-•
9.S.6.1 ~tic Pa~1t1o~umns whose values arc known at compile time. ,,,.~~
· aro·uons comprise co ~ - - -- - - - -- - -
Sraac p
- . arcition based on "gpa'' column.
•ective: To create scaac P
~ ~- - -
- --- -- - - - - ::..------
----
obJ ~
_,...
.,,.~
-
--
Act:
CREATE TABLE IF NOT EXISTS STATIC_PAIU_STUDENT (rollno l
pAR11TIONED BY (gpa FWAT) ROW FORMAT DELIMlTED FIELr::• name STIUN
BY"' I s 'l'l!RM!,,.,.,~ O 4

-------.
10 add one m re static panition based on gpa" column using the "a] •
ter statement.
Out00me:
hive> cUAtt TABLE 1i: NOT [,aSTS STAnCJ'AAT_sTUOENT(rollno
: : . f - T o<Lt>4ITED rYELOS TE""INATED OY . ,,·;
i:NT • nAae STRYNG)
PARTmONED BY ( gpa Fl
M- ■
~ ~ A R T _ S T U D E N T ADD """1TION (gp.,J.sJ-
T; . . ""'"'" 0.10s s«onds OAT) 00 l.'.JjI ._,.. _,,...
TABLESTATIC_PART_STlJDEN TPARTinoN,
-----.
_J
...:J .. _ , ~ i,::om EXl'_STUDENT where gpa-=4.0;
) (gpa-4.0) SEUcr
·~ -
()llf'OIIICI
~Objective: Load data into parcition cable &om cable. VO AL1£R T...... E STATXc..PAAT...snioENT ADO PARTITION (gpo-]. S) ;
Act:
INSEKI' OVERWRITE TABLE STATJC_PAIU_STUDENT PAIITITION (gp
f-
tvo>
taken: 0.166 seconds
~
4 _ . tfdlrfdOI')' ~JIJllk:atLdlp/sbNC'_ptrt_ studcat
SELECT rollno, name from EXT_STUDENT where gpa-4.0; / a = ,0)
~--~
01
= :!"..
Out00me:
~o~~ TABLE STAnc...PMT_,51\IDENT PARnTION (gpa -4.0) 5£L£CT rollno,nillllf! (rom. EXT....sruocpq
rtc:::f ~~; :C,t_201S02242J0404_<1$00dS8a-cb2l-4912-ba40-7Pe~cf'lf9dA
--- ~Tbo~~
.. --
:$1 · Ii-
~.
. J@-n:11iE-i---ii it,-ji:-::-
~
Hive creates the folder for the value specified in the panition.
c ....... . , ~ ~ 1 1 1 . A
a-Ee! 52J
9.5.6.2 G>,namic Partition
'71.,;,:~-~~:-=.::~r-~--~·~
'::.'.',. .. c•c--:-'"I -

Dynamic parcition have columns whose values arc known only at Execution lime.
C-Wlem:Jhew
Objectivea To crcate dynamic partition on column date.
i..callop
.__ Ace:
C_.,_botdirffto,y-lllldai.dll/tft llc_Jlflf_ _f CREATE TABLE IF NOT EXISTS DYNAMIC_PART_SfUDENT(rol lno INT,nameSTRINGJ
""'"........,.._.....,..] PAmTIONED BY (gpa FLOAT) ROW FORMAT DEUMITED FIEIDS TERMINATED
BY'\t'1 }
-__ ~-2°·-,=~s~•>;j ~-~- ,_ 1 i·-:--:~ Outcome:
1
(ig heck ,e PE$...., "r~1tEATE TABLE IF NOT EXISTS OYNAMIC_PART_5TUOENT(ro11 no INT 0 naae STRI NG) PARTITIONED I Y ClilP• FLMT) 1
T OE'LDln"EO FIELDS TERMINATED BY "\t • ;
I · o-166
1
tllcen·
• I
Local logs ~ s econds ·
I.a......,.
,,
D ------
246 •
..o
--
~bjectivc: To load data into a dynamic partition cable from cable.
Act:

TABLE STUDENT_BUCKET
1e,gradeJ
SET biff.,aec-clynaJDic.partition • ttUe; it of first bucket
SET biff.,aec.dynamic.partition.mode .. nonstrict; _
r,=)7 GRADE FROM STUDENT_BUCKEr
Note:. The dynamic
d . . . strict mode
partition d requires
set 1vc.CXCC. ynam1c.part1t1on.mo e=nonstrict
• column ....,
. at least one static paninon
• io tu
',uCKET ~ our OF 3 ON GRADE);)
h
INSEKf OVE]lWRITE TABLE D~yAirr_STUDENT PAIITI " .this
/ ·
rollno,name,gpa &om EXf_STUDEN'I\ · TION (gpa) S ~;.i.s~irNCYT EXISTS ST UDENT (rollno I NT nam
T l)BLIMITED FIELDS TERMINATED BY • \ t,;e STRING, grade !'Loll.Tl
Oou:ome: ,1 111econd.s
oaw-11ordtffd91'7~..,....•_,.n_1111c1ta1
c:-,.f , . ........... ~ -.nat'A LOCAL :INPATH '/root/h ivedemos/stude n t t ,
~ t o ta!,le bOOk • .stud~nt • -'V INTO TABLE STUDENT;
-~~::~f:~:~=;~~t~~:;£;1~-i=~- ~
. IO' ~~todent .stat.s: [numFile:i=l, totalsize=l4 SJ
r.i,l• J:JOO•.
- l ~~.~...•.. ,,1«"•--~ - - , . -, -·""' ·-- ~_, II _n . o 536 :second:!
-:.:-ri:.~ . - i ~ . - . ~,"::...Ji. _-..:._;:J :=:Jill.'.'.- - · - ta•-· .
----t~....,._... -:£.-----...,...~ .- ....... - ~ -,
"'~, 11>---.$v--~7.------. --..---. -----
_,..,.,.~·"!'"---~- ,1-,e:>:
-1---= • '"'_... ,,._.;~ -
Mve> 1!4!t llive.en!orce .buclc:eting=true;
QrtestteDfShilM
~_J . -~
-:;--c;RBATE TABLE IF NOT EXISTS STUDENT BUc:l<ET (rolln
Note: Cr~te partition for all values. ii > CLOSTERED BY (grade) into 3 buclc:et:i; o INT,name STRING,grade !'Loll.Tl
--.
·9 .5.7 ~cketing . _ ..
Bucketing is similar co panition. However, there 1s a subtle difference between partmon and bucketing. In
a partition, you need to create panition for each unique value of the column. This may lead to siruaciom
where you may end up with thousands of partitions. This can be avoided by using Bucketing in which )1111
r ve> fR0II STUDENT
> IIISERT OVERWRITE TABLE STUDENT_BUc:l<ET
> SBLSCT rollno, name, grade;
3 buckets lurYe bee~ created as shown below:
can limit the number of buckets co create. A bucket is a file whereas a partition is a directoryj
a n o i o f 1 ~ - • a n n u ci-J1.411/•tudmt_bucut
[Objective: To learn about bucket in hive.
~~ ~~ii
LAct:
CREATE TABLE IF NOT EXISTS STUDENT (rollno INT,name STRING,grade FLOAT)
ROW FORMAT DELIMITED FIEIDS TERMINATED BY '\t';
wallop
WAD DATA
WCAL INPATH '/root/biwdcmos/student.tsv' INTO TABLE STUDENT,
i.-,
Set below property to enable bu~g.
=
i.i. ..is.
act hive.enfon:e.buckcting-trucJ liiv.>
~
II Td create a bucketed table hmng 3 buckets > hLBc-r DISTINCT GRADE FROH S TUDENT BUCKET
Cl > ~ L E (BUCKET 1 OUT OF 3 ON GRADE) ;
TABLE IF NOT EXISTS STIJDENT_BUCKET (rollno INT,name sTRING,grade t.o
~-2 .
_ CWSTERED _!IY (grade) into -~• buckets,_) ~tibini 21,111 :second:,, Fetc hed: 2 row (-'l
..
~~ ~
-~
P'
r ~ub-que ry to count occurre .
9.5.8 {y;ews ~~Write"~
~
• nee of sunil
ar words in
. the fil
In Hive, vi~ suppart is available only in version starting from o·6· y·1ews are purely logicaI
.
ob· e.
Obi<""" To -~ ,,;"' ,.bl< oumol •STUDENT_VIEW". ,-,
~iJLE docs (line STRING);
~0,PLOCAL INPATH '/root/hivcdemos/lin
:~ iABLE worcl_count AS CS,txt' OVERWJun
}.a:
lf CREATE VIEVI STUDENT-VIEW
. AS SELECT rollno, name FROMEXT-SiUDE, (~11!!!"'- .-t('.) ~ ~• FROM INTo TABLEdoa,
~ .... oplode (split (line, )) AS word FROM
- "i:I.~
~
Outcoroc:
~~ ' " ' " ' " ' " """''"-"'" '5 sm<' eolloo,- ,.,. ""
K
Tille taken: 0.606 seconds
h;~I

-STUDENT;
.j -#s~r
; -~•·, .,
~i,i0;8f1'0
ri-; ··'
rosD~

rel
•.BY..,.,rel;
·:~ ~ fllOM word_count; )

docs) w
1
~bj- Qu,.,;og ili< WW •S]1JIJENT_VIEW". ~ ~ y,.._£
~ -"en:
,.,t...

do<> (Ha, STIUO;) ;


0 · ll8 seconds
J'.!J
·
Act:.
illt". OCAl INPATH '/root/hivedemos/ lines.t~t• OVERWRITE INTO TABLE docs-
SELECT• FROM STIJDENT_VIEW LIMIT 41 / 1 ()Al) DATA L able students.dC?CS . '
ive>ngL data to• docs
oadi dents t stats: [numf1les=l, numRows:O, totalS1ze=91, rawOatasize=O]
• le stu
rab 2 697 seconds
Ourcomc: rite t a1<en:
1K •
~
(hive> SELECT * FROM 51\JDENLVIEW LIMIT 4; lhiVC__j
OK ive> cREAn: TABLE wor<Lcount AS
~I
1001 John > SELECT word count(!) AS count FROM
002 Jack
003 Smith > (SELECT el<piode (split (line, ' ')) AS d
> GROUP BY word wor FROM docs) w
1
004
,ve>J
Scott ·
~me taken: 0.279 seconds, Fetched: 4 row(s)
r > ORDER BY word:
iive> SELECT * FROM word_count;
-I
1dooP 2
ive 2
1troducing 1
{ Objective: To drop the vi~ "STUDENT_VIEW". 1troduction 1
g 1
ssion 3
Act:. 1lc0111t
I
1
2
me taken: 0.062 seconds, Fetched: 8 row(s)

DROPVIEWSTUDENT_VIEW1 ) ve>J
Outa>me: Note: The explode0 function takes an array as input and outputs the clements of the array as
lhi ve> DROP VIEIII STIIDENT viElll•
I( - I
~ separate rows.
Til,e taken: 0.452 seconds In Hive 0.13, sub-queries are sup~orted in the where clause as well.
~ive> J
9.5.10 (Joins
( 9.5. 9 Sub-Query Joins in ltive is similar to the SQL Join.
are
)n Hive, sub-<juerics supported only in the FROM clause (Hive 0.12). You need to specify_name for: I-
Ob' •
query because every table in a FROM clause has a name. The columns in the sub-query select list should~ th Jective: To create JOIN between Student and Department. tables where we use RollNo &om both
unique names. The columns in the subquery select list are available co the outer query just like columns ofa etables as the join ke
7
" ~
~
r_v
11,g Data and ¼ ~
250' . 'tc ,i,, ,va,g, and'°"" ' ,gg,eg,uio, fu,aio,.
CCRfA11!
roRf,l,\TTABLE
Ac:n IF NoT EJ(IS'J'S sruoEN'f(,ollno INT,nom• STRING gp Flo
J!Lll,ll1ED FJELDS TE(U,fINATED on,·,
,r..~ .
.
·(

to"'1fl
' • 'AlJ q ~ F)IOM STUDENT,
sruoENf•
0
L()AI) oATAL()('.AL (Nl\(111 ·1........,....,,1,tud,nLuv' OVERWRITE
. INTo'fAB~
JP'~-. fll.Old STUDENT:)
cR.f.ATE TABLE IF NOT EXIsTS DEPAKfMENT(rollno INT,deptno int
RfN{ FORf,1,1'.f oJ!Lll,ll1ED FJEL0S i£RMINA'IID BY 'It'; '"'°" S'Ilu!,C) ~ •,iO(gp&) FltOM STUDENT;
,,.,
~ l1'lo.
LOAD DATA LOCAL INPATH '/rootfhlvedemos/department.tsv' OVER
. TABLE oEPAKfMENT• ' ~ "·" ,_... ""'''' , ,~,,,
- ·es3CJ27
E
SJlLECf .,.0.0,.......,..,.. b.d<ptDO FROM STIIDEN
~ 7
....UO• b.,,llao<
T, JOIN DEPARl'M
ENT bON
I~ _ d'f a,19(91'&) FROM STUDENT·
pit"'Si.... '
en: ,
3. 1
-
26-218 seconds, Fetched: l row(s)
Outcome:

!hive> cREAlI TABLE IF NOT EXISTS STUOENT(rollno INT,name STRING
ELDIITED FIELDS nRMINATED av '\t'; ,gpa FLOAT) ROW F
Tille taken: o.115 seconds
hive> I
;
ihive> LOAD DATA LOCAL JNPAlll '/rootfhive de111>S/s tudent. tsv' OVER~TE INTO
TAB LE STUDENT
Gra.wriY and Having
oadinostudents.s
1K
'able data to table
tudentstudents. student
stats: [nulllfiles 5
=l, nuntows=O, totalsize= 145 ' rawoat a 1ze=O]
.
9.s.12 7 ·I .
.,.nn or columns can be grouped on the basis of values contained
· acol.,.,~- therein b . "G
.
·;ae taken: O. 723 seconds
l)al2~. clause is used to filter out groups NOT meeung the specified condition. Yusing roup By"·
lhive> J ~
hive> CREATI TABLE IF NOT EXISTS OEPARTMENT(rollno INT,deptn o int,name
~ntING) R
AT DELIMITED FIELDS nRMINATID av '\t'; ~ g r o u p by and having function.
l IK
ow FOR/,!
lob1-·-
rime taken: O. 099 seconds
-AJJ;
lhive>_J
'hive> LOAD DATA LOCAL INPAlll • /root/Mve deaos/dep artaent. tsv· OVER~TE
INTO TABLE DEPA
.SEIJ!C'I' rollno, name,gpa FROM STUDENT GROUP BY rollno,name,gpa HAVING gpa >
,RlMENT;
uading
!Table data to epartment
students.d table students.d
stats:el)&rtlleJl t =l, nunlOWS=O, tota1Size
[nulllfiles =120, raWOataSize=O]
4.01)
OK Outcome:
Time taken: 0.442 seconds
lhive> 001 Saith 4. 5
\1004 Scott 4.2
h;ve> SELECT a.rollno, · · - • a.g,a, b.deptno F - STUDENT • JOIN DEPAR1"4ENT b OH • . 11 006 Alex 4. 5
lrollno • b.rollno: .001 o.vid 4. 2
·toe tilcen: 78. 972 seconds, Fetched: 4 row(s)
lvo.
1001 John 3.0 101
1002 Jack 4.0 102
11003 Saith 4.5 103
1
1004 Scott 4. 2 104
1005 Joshi 3. S 105
1006 Alex 4.5 101
1007 o..,;d 4.2 104
1008 Jaaes 4.0 102
Tjae taken: 115. 282 5econds, Fetched: 8 row(s)
1hive> I 9.6 RCFILE IMPLEMENTATION
-I I
JQ"t!e (Rrcord Columnar File) is a data placement structure that determines how 10 store relational rabies
<X>mputcr clusters,' \
1111
9.5.11 Aggregation
..__ _ _ _...J. J
Hive supports aggregation functions like avg, count, etc. Ob' •
Jcctnoe: To work with RCFILE Format.
-- - -
...
~
252 • Bi D ,
- 'LE(mildata string);
~
Ace
Wl1,E Sf(JDEN'f_JlC{ ,ollno mo ..,.. Mng,gpa Ro~t) STORED '
INS£lIT QVERWR1TE ublo Sf(JDENT_RC SELECT• FROM STUDENT, AS R~ • INPATH '/modhh.d,,.0,1;,p.,_""1' IN'ro "tli!u XMLIAM,LP;
table AS
sELECf sUM(gpa) FROM STUDENT_RC; ) ' · 'empl~cmpid'),
iployce/namc1,
~- .,..u ""'-' """'"-"' ~n~ '"'· •- ""'"·"• """ , =
Outc0DlC: RED AS RCFIL •
·empIo,,._
-'designation')
Ti""' taken: 0.093 seconds
hive>I
'
hive> IIISERT D\IERWRl"li table STIJDENTJIC SELECT • from STIJDENT;
I
. Q

E,
_J
• ._table:)
1
11,ive> SELECT s~(gpa). froa STIJDENT.JlC; • I
...~ lCMLSNIPLE(,cald.na s·tring);
38. 39999961853027
Ti..e taken: 25 .41 seconds, Fetched: 1 row(s) UU---OfA11i TABLE ~
I !VO 44 seconds
1hive> [11u1cen: O. 2
Note: Stores the data in column oriented manner.
,..,_,~-· ~ gN) o,.TA LOCl
N.. INPATH '/root/hivede110s/input . x11l ' INTO T~LE Xl4LS~PLE·
tudents.X11lsa,nple '
tf
t,1,ie> L ,1ata to
L;,.dinlltudellts. JCII s
.:S,~e stats: (nud'iles=l, total5ize•l94]
.. t?-..........{il 1111e s
~ • o.889 seconds
c;:hdre £ '1iC!'iC
Mwd1)0'.'.'.tn:W~
Tilf t~ell•
hi~
3
'it°"'l• , lt r,;111• rt i -...... ~,.,......,...... cREAlE TABLE xpath....table AS
1• •• . . .UHlllllllllttW-leeUfl&Ulk.U,_~_,,_,..._., .......,.....,,...... l~ SELECT ,cpath....int(xaldl!-ta, 'employee/empid')
> ,cpatutring(xaldata, en,ployee/name') '
-t
r > ,cpatutring(X11ldata, 'employee/desi~tion ' )
> FIOI >111lsaq>le;
;.o SELECT • FROI xpath....table;
1
John Teant Lead
{J.7 ~ERDE saith Analyst
t~en: 0 . 064 seconds, Fetched: 2 row(s)
SerDe
-----------
stands for Serialli:er/Dcscrializcr.
~mains the logic to convert unstrueturcd .
data into records.
2. Implemented using Java. 9.8 USER-DEFINED FUNCTION (UDF)
3. Serialiurs are used at the rime of writing.
4. Descrialiurs are used at query rime (SELECT Statement).
In Hive, you can use custom functions by defining the User-Defined Function (UDF).
Descrializcr interface takes a binary representation or string of a record, converts it into a java object dw
into something that Hive can write to HDFS. J
Hive can then manipulate. Serializ.cr takes a java object that Hive has been working with and translates i1
/objectifti Write a Hive function to convert the values of a field to uppercase.
~ .
Objective: To ~anipulatc the XML data. . package com.example.hive. udf;
Input: . impon org,apachc.hadoop.hive.qI.exec.Description;
<employee> <empid> 1001 <lempid> <name>John<ln~e> <designation>Team Lead</designation> imponorg.apache.hadoop.hive.ql.exec.UDF;
<Iemployee> @Description(
<employee> <c:mpid> 1002</cmpid> <na,mc>Smith</name> <designation>Analysc<idesignation>
namea•simpleUD FExample")
<Iemployee>
li"
and A... ,
~ 2 ~ 5 4 ~ • - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ~ tg Data
~ ~
public final class MyLowerCase_ extends UDF {
public String evalua te(6na l Stnng word) {
return word.toLow erCase 0;
}
}, 7
Note: Conve rt this Java Progra m into Jar.

~ D JAR /root/hivedemos/UpperCase.jar; _
~ TEMPORARY FUNCTION toupp ercas e AS 'com. exam ple.hi ve.ud f.MyU
SELECTTOUPPERCASE(name) FROM STUDENT; ) pper<:aae•1

Outco me:
.hive> ADD JAR /root/ hivedem os/Upp erCase . jar;· . path
Added [/root/ hivede mos/U pperea se.jar] to class
Added resourc es: [/root/ hivede mos/U pperca se. jarJ ercase ' ;
~!ve> C-REATE TEMPORARY FUNCTION touppe rcase AS 'com.e xample . hive . udf . MyUpp
Time taken: 0.014 second s
Mve> I .
.
N

hive> Select touppe rcase (nuie) from STUDENT;


OK
JOHN
JACK
SMITH
SCO-TT
JOSHI
ALEX
DAVID
JAMES
JOHN
JOSHI
I . 0 . 061 second s, Fetche d: 10 row(s)
hive>taken·
Time
258•
~
T .
~
BigD
WHAT'S IN STORE? ~ .
' '"
We assume that by now you would have become familiar with the basic
Programming. The focus of this chapter will be to build on this kn;wledconcepts 0
fB.DFs
will discuss few rcf:1.tional and cval operators of Pig. We will also discu g~ perform
10,3 plG ON
~::,HAOO
::..:OP
---- :--- ---- ---- ---- ---- ---_ _:_
and UDF (User Defined Function ana1/nd ~apR_,..1 \ • t
Oil ffadooP· Pig uses botb Hadoop DiS ributed File System and
s) of Pig. . ss mplex Data TYp515 Map Reduce Pr , B
es1lsp·1ngp~ JUl!Su:..i
"lr\ '.';'i..,lt.a-c, ods input files from
• HDFS H HDFS. p·Pig stores
I the intermediate data (data producedobgramM m:gd.uceY
. WeW,suggest
al you refer to some o.fth......, c 1earning
M " resource. s provided at the end of th· h
y ap,,e
' 1ggy Bbl ~ · P"."'~J
i.1.tllll'l die output inll . • owcver, tg can a so read input from and place output th
mg. e so suggest you to practice 1est c exercises. too er sources
ts c apter r,or h•t ; I"".: ""t"r-
o:. _,nnortS the fo owmg:
.
'terl roe . ·
~ - l, t{l)PS commands.
10.1 WHAT IS PIG? 2- UNIX shell commands.
,_ ~tiona l operators.
~pache Pig is a platform for data analysis. It is an alternative to MapRed uce Pr .
t Positional parameters.

oped as a research • project at
•_ L
Yanoo.
)
ograrnrn1ng p· S, r.onunon mathematical functions.
• 1g Was dCV!).
6. Custom functions.
10.1.1 Key Features of Pig 7, C.Omplcx data structu res.)
Q. It provides an engine for executing data Rows (how your data should fl.ow)
paralld on the Hadoop cluster.
p· 10,4 PIG PHILOSOPHY
· ig processes data.
2. It provides a language called "Pig Latin" to express data flows.
in
3, Pig Latin contains operators for many of the traditional data operations such figure 10.2 describes the Pig philosophy.
. . .
4. It allows users to develop their own functions (User Defined Functions) fior as dJ~in, ,('Pip Eat Anything: Pig can process
, filter, son, etc. different kinds of data such as structured and unstructured data.
writing data. / reamgp . i'-Pigs live Anywhere: Pig not only processes files in HDFS, it also processes files
' rocessing, ll1d in other sources such
as files in the local file system.
10.2 THE ANATOMY OF PIG 3, Pip are Domestic Animals: Pig allows you to develop user-defined functions
and the same can be
included in the script for complex operations.
(_The main components of Pig arc as follows: 4. Pigs Fly: Pig processes data quickly. )
1. Data flow language (Pig Latin).
2. Interactive shell where you can type Pis.!:-tin statements (Grunt) . Pigs fly
3. Pig interpreter and execution engine. J
~
Refer Figure 10.1.
Pig Latin Script
A = load 'student' (rollno,
name. gpa);

Pig Interpreter/Execution Engine


Processes and parses
Pig Latin

MapReduce Jobs
'7
- •~, I
Pigs live
anywhere

P,gs eat
anything
A = filter A_by gpa > 4.0;
A = foreach A generate
UPPER (name);
STORE A INTO 'myreport'

· Checks data types


- Performs opt1m1zaho
• Creates MapReduce
Jobs
- Submits Job(s) to Hadoop 1

I
§§ Hadoop ·
10.5 USE CASE FOR PIG: ETL PROCESSING

Figure 10.2 Pig philosophy.


• Monitors progress
& is widely used for "ETL" (Extract, Transform, and Load). Pig~ extract data &om different sources
/ / such as ERP, Accounting, Flat Files, _cc~. Pig then makes use of various operators to
on the data and subsequently loads tt mto the data warehouse. Refer Figure 10.3.
/
~rform transformation .
Figure 10.t The anatomy of Pig.
260 •

Fixing
.
Removal
. ... _,
:
~ ., ... ....
~ -~
Data Encode
validation errors of duplicates I'
value

Figure 10.3 Pig: ETL Processing.

10.6 PIG LATIN OVERVIEW

l10.6.1 Pig Latin Statements


1. .Pig Latin statements are basic constructs to process data using Pig.
2. Pig Latin statement is an operator.
3. An operator in Pig Latin takes a relation as ii:iput and yields another relation as output.
4. Pig Latin statements include schemas and expressions to process data.
5. Pig Latin statements should end with a semi-colon.
Pig Latin Statements are generally ordered as follows:
1. LOAD statement that reads data from the file system.
2. Series of statements to perform transformations.
3. DUMP or STORE to display/store result.
The following is a simple Pig Latin script to load, filter, and store "student" data.

A= load 'student' (rollno, name, gpa);


A = filter A by gpa > 4.0;
A = foreach A generate UPPER (name); .
STORE A INfO 'myreport' / ·

Note: In the above example A is a relation and NOT a variable.

10.6.2 ~g Latin: Keywords


Keywords are reserved. It cannot be used to nam~ things.

10.6.3 Pig Latin: Identifiers


1. Identifiers arc names assigned to fields or other data structures. d rscores)
d
2. It should begin with a letter and should be followed only by letters, numbers, ao un e
• 261

--~~iid"id~~iiii~;· ······Tab le 1 0.1 \Jalid


an
·· ··y· ··········· ··; d invalid identifiers
;.;:·· · ·· ·· ·· ·· ·.... A1~i014
' .... ... ·s~;;;l~ ..
..10
,ra\ld Ideotlfi~_r
.. ..
.. .. .. .............. ~
.. .. . .. .. .. .. .. .. ..
:. ~•l_••~ .. .. .. .. ..
acscnbcs valid a n d inval
.. .. . ~•".°•~.............. ::-~•l~•
.. .. .
-rabk 10.1 id iden tifiers.
\Q, , p ig L
64 atin: C o m m e
n ts
1"~ "'°
tyj><S o f co m
m en ts a te su p p
1

!--lc \inc comm o rt e d ,



1,._ S~
"M ulti\inc commen th
en ts th a t b eg in w it
h
ts a t b eg·m w 't h ""/*-- ". d
1 a n e nd w it. h */"

\Q.6.5 P ig La
,l(J(qWOtds a ic ti n : C ase Sens
nor case sensiti it iv it y
ve su c h as LO
1 . Relations a AD , S T O R E , G
n d p at h s are ca ROUP, F O R E
3. Function nam se-s ACH, DUMP
es are case sensiensitive. , e tc.
tive su c h as Pig
Storage, C O U
\Q.6.6 O p e ra NT.
to rs
in P ig L a ti n
Table 10.2 descri
bes operators in
P ig L at in .

..~rtih~~ii~........T.ab\e 1 0 .2 Oper
c~·~p~ri~~~· ators in Pig Latin
...... ."~ii.... .. ......... ·.. .·.
+
--
·a ~~i~~~- ······
IS MULL
.. .,_- lS MOT MULL
AMO
OR
<
I >
MOT
0/o
<=
.... ..>..=...... ..........
.... ........ .... ...... ..
,o.J ..... .... ............ ..
DATA TYPES .... .
IN PIG
~Ga:] S im p \e D a ta
Typ
es
Tab\e 1 0.3 d
cs ·h
an array o f bytecn es s1· mp\e d a
s w h ic h is k n ota
N-" w n ~ esb vsu
r: ra o n e d .m P .
, · ig. In P ig , fie\
u m ln P ig Lat Y· ds o f u nsp ec if
10 7 .m , N U L L d ie d ty p es a rc co
ns·1d d
cn o tc s a v al u e .
th a t is u n k n e re as
· ·2 C o m p \e o w n o r .is n o n
-e
x D a ta T y p e s x is te n t.
Tab\c 1 0 4 d
. cscn.bcs c o
~ p \e x d a ta typ
es in P ig .

You might also like