You are on page 1of 17

.

Wh y Apa che Pig? is based on the Java pr?grammmg


is used with Had oop , and Hadoop (0
By now , we know that Apac~e Pig need for Apache Pi g ~ame _up
que stio n that aris es in our minds is ' Wh y Pig? ' The I
lang uage . N~ the a lot of stru ggle workrng with
mer s wer en't com fort ,i ble with Java and we re fac ing
whe n man )_'.__Pro gram Pig ca me into the Hadoop
, espe cially, \Y.ll.£ 11 Mnp Red ucc tasks liaE.7.tp be pcrf'o rm cd ....J\ pac hc...
Hadoop
mers. Cl!.)
world as a boon for all such program ra mm ers arc al;l~ to work on Ma pRe
duc e tasks wi thout
• Afte r the !ntro du~ tion of Pi g..) ,nt_in, now , prog
lr5j) (.~ N"-€.J
pl1 rn tcd code s ns 111 .l avn . lts in ~J
the use ol com approac h i<> used by Apache Pig, whi
ch resu
• To red uce the leng th or cod es. tile mul ti -q uery
reduced development tim e by 16 fo lds. para ti ve ly easy to lea rn /\rac he Pig
if we have little~
n is very si mil ar to SQL , it is com ~
• Since Pigj__nti
knowledge of SQL .
ral in-bui lt
ns such as £
filt_ ,
L)_ j oins , orde rin g, etc., Apa che Pig provides seve
For sup po1ti11g data ope ratio
@
operations.
Fea ture s of Pig Had oop
Pi g:
There are several featu res of Apache performing several data
rs: Apa che Pig prov ides a very goo d set of operators fo r
I. In-b uilt ope rato
operatio ns like smt, join , filter, etc. ilarities with SQL, it is very easy to wri
te a Pig script.
ing: Si nce Pig Lati n has sim
2. Eas e of pro gra mm mized. Thi s makes the
s in Apache Pig are automatically opti
3. Aut oma tic opti miz atio n: The task
semantics of the language.
programmers concentrate only on the yze both structured and unstructured data
and store
s of data : Apa che Pig can anal
4. Han dles all kind
the results in HDFS.

Apa che Pig Arc hite ctur e


it convertSI the scri pts into a
have started usin g Hadoop Pig is that
The main reason why programmers of-PigJ-Iado op:
r job easy. Bel ow is the archi tecture
series of Map Reduce tasks making fhei elli.? aat -.lrn
Pig L atin
Sc:rip ts

Apac ,e Pg

@ ru nt Shelf} C P1g Serv er )


L P.>rse r ) - -

( Op timjz c,- )

L.53 ~plle "._.J


Ex iec.u tJon Enp in e :=J
M a pR e du c:e

,.. ·7=,a"d~
--.-

[ r HOFS - □DD J
c◊-~ pon ents :
Pig Hadoop framework has fo ur mai n by the parser. The p.1.rser is
Lati n scri pt is sent to Hadoop Pi g, it is first handled
I. Par ser: _Wh :n a Pig us checl-..s. Parser gi \es
e script, along with other miscellaneo
res pons1bl_e for chec kin? the_syntax oflh n state ments, too ;the r
c li c Graph \DAG) that contains Pig Lati
a1: output in tl:e fo rm of a Di rected Acy
0

as nod es.
with other logi cal operators represented for DAG is pass~d to a loITTcal
t~e pa!·ser is r~tri eved, a log ica l planmizations.
2. Op~i~!zer3 After ~he_~ut~ut from
0

for carrying out the logi cal opti


opt1m1:rer. fhe optim izer 1s responsible izer is rece ived Th
es in when the out put from the optim
plan is then conv t ct : ~a
~lle r: Th~ role of the_ com pile r com
3. Com
co~ pJler compiles the logis cal plan sent by the optimi zing The log ical er e mto
serie s of Map Red uce task or jobs .
, · bs are sent to
ceJ·obs , th ese
4. Execution Engine: After the logical plim is conve rted to MapR edu JO
· I 1· the des •ired
Had · .1 . ,
thus e jobs are exec uted on I lndo op f < ing
oop "' a prop L:, 1) ~o rl l:u orLJ Q, w1J or y,e
res ult.

· ~
Mapllt:th11.:c jobs in whid1 access is reg uireti to tie
1 Hadoop clus ter
By default, Pi g l ladoop choo ses to run

(0 t t;~ ,f'~JWe
A (" /~
• Ml'f\...f2 ,

d · e a local mode 10 w ic
. h' hall the files are installed
d · the
and the HD~S installation. But there is.an~~~t:ieo s~s~~1;. You can run the localhost mo e usm~ocal \
and run using a local host, along wit~ -x ntTi.. t
p1g ~1-ea-fflea--:-m•a
command: . A e-Pi-g:-hrtl'rts section 011 Apac 11e-i •o• ·· -
i,.-1,ope you Nere able tG sueeessfully 1115t011 nf:JOCfl . _ · . -- d frFta~~ -the-mstaHat-i0f-l-0£Apache-B-~g
is Apoehe Pig?', wby w.e. .ne.e.d Pig, i1 s.-featu.i:~i:G-hitee-t1:1-Fe,f1Ft

. . . -
Y
.

form or too l whi ch is used to proce.ss the


£i g Represents Bigl)ata as data fl ows. Pi g 1s a hJ_gh-- le-ve l pl at . the MapReduce. It provides a
large datasets. It provides a high-level of abstrac.tton ~or ~rocessmg ;verl the data analysis codes.
\
0
high-level scripting language, known as Pig Latin which IS used to eve ~ write the scripts using the
1
First, to process the data which is stored in the HDFS, the programm~r) WI rt d all these scripts into
Pi g Latin Language. Internally Pig Engine(a component of Apache Pig con~e e d t 0 provide a high-
a specific map and reduce task. But these are not visible to th_e programmers 111 ~r : h Pig too l. l[he
level of abstraction. Pig Latin and Pi g Engine are the twom_a,n compon.ents of t e pac e
resul t of Pig always stored in the-HD.FS. . . .
Note: Pig Engine has two type of execution environment i.e. a local executwn envzron'!1ent m a
single JVM (used when dataset is small in size)and distributed execution environment ma Hadoop
Cluster.
Need of Pig: One limitation of MapReduce is that the development cycle is very long. Writing the
reducer and mapper, compiling packaging the code, submitting the job and retrieving the output is a
time-consuming task. Apache Pig reduces the time of development using the multi-query approach.
Also, Pig is beneficial for programmers who are not from Java background. 200 lines of Java code can
be written in only 10 lines using the Pig Latin language. Programmers who have SQL knowledge
needed less effort to learn Pig Latin.
• Jt uses query approach which results in reducing the length of the code.
• Pig Latin is SQL like language.
• It provides many builtln operators.
• It provides nested data types (tuples, bags, map).
Evolution of Pig: Earli er in 2006, Apache Pi g was deveJopeclby _y_ahoo 's researchers. At that time, the
main idea to develop Pig was to execute the Map Reduce jobs on extremely large datasets. In the year
2007, it moved to Apache Software Foundation(ASF) which makes it an open source project. The first
version(0. J) of Pig came in the year 2008. The latest version of Apache Pig is 0.18 which came in the
year 2017.
Features of Apache Pig:
• For performing several operations Apache Pig provides rich sets of operators like the filters, join,
sort, etc.
• Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
• Apache Pig is extensible so that you can make your own user-defined functions and process.
• Join operation is easy in Apache Pig.
• Fewer lines of code.
• Apache Pig allows splits in the pipeline.
• The data structure is multivalued, nested, and richer.
• Pig can handle the analysis of both structured and unstructured data.

Difference between Pig and MapReduce


Apache Pig Map Reduce

It is a scripting language. It is a compil ed programming

Abstraction is at higher level. Abstraction is at lower level.

It have less line of code as compared to Lines of code is more.

Less effort is needed for Apache Pig. More development efforts are required
Apache Pig MapReduce

Code efficiency is less as compared to As compared to Pig efficiency of code

Pig provides built in functions for Hard to perform data operations.

It allows nested data types like map, tuple It does not al low nested data types
Applications of Apache Pig:
• For exploring large datasets Pig Scripting is used.
• Provides the supports across large data-sets for Ad-hoc queries .
• In the prototyping of large data-sets processing algorithms.
• Required to process the time sensitive data loads.
• For collecting large amounts of datasets in form of search logs and web crawls.
• Used where the analytical insights are needed using the sampling.
Types of Data Models in Apache Pig: It consist of the 4 types of data models as follows:
• Atom : It is a atomic data value which is used to store as a string. The main use of this model is that
it can be used as a number and as well as a string.
• Tuple: It is an ordered set of the fields.
• Bag: It is a collection of the tuples.
• Map: It is a set of key/value pairs.

Pig Architecture
The Architecture of Pig consists of two components:
I. Pig Latin, which is a language
2. A runtime environment, for running PigLatin programs.
A Pig Latin program consists of a series of operations or transformations which are applied to the input
data to produce output. These operations describe a data flow which is translated into an executable
representation, by Hadoop Pig execution environment. Underneath, results of these transformations are
series ofMapReduce jobs which a programmer is unaware of So, in a way, Pig in Hadoop allows the
programmer to focus on data rather than the nature of execution.
PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join,
Group and Filter.

Tran•-r<>rmot l'c:>,n '

Logfca·, Plan

T r • nlfi, ~or.m a t t on

Phy.;.;•cal Plan

Tran-.,orma.tlJC>n

HADQOP Ex.ec utle>!"'I

PIG Architecture
Execution modes:
Pi g in Hadoop has two execution modes:
I. Local mod~: In thi s_ mo~e, Hadoop Pi g language runs in a single JVM and makes use of local file
system. This mode 1s suitabl e only for analysis of small datasets using Pig in H d
2. Map Reduce mode: fn thi s mode, queries written in Pig Latin are translated int; ~~PRed .
a~d are run on~ H~doop cluster ~cluster may be pseudo or fully distributed). Ma R:d uce Jobs
. with the ~ully di_stnbuted cluster 1~ us~ful of running Pig on large datasets. P uce mode
https.//www.tutonalspomt.com/apache p1g/p1g latin basics.htm
L

1+1e, ( \ \,.{, ', rs-J

~ l l 't:..b', >r) c' <;.~\~ ', ~ 1- ) ]


l.fu:
Pig is used for the analysis of a large amount of data. It is abstract over MapReduce. Pig is used to
perform all kinds of data manipulation operations in Hadoop. It provides the Pig-Latin language to write
the code that contains many inbuilt functions like join, filter, etc. The two parts of the Apache Pig are Pig-
Latin and Pig-Engine. Pig Engine is used to convert all these scripts into a specific map and reduce tasks.
Pig abstraction is at a higher level. It contains less line of code as compared to MapReduce.
2. Hive:
Hive is built on the top of Hadoop and is used to process structured data in Hadoop. Hive was developed
by Facebook. It provides various types of querying language whjch is frequently known as Hive Query
Language. Apache Hive is a data warehouse and which provides an SOL-like interface between the user
and the Hadoop distributed file system (HDFS) whlch integrates Hadoop.

Difference between Pig and Hive :


S.No.Pig Hive
Pig operates on the client side of a
I. cluster. Hive operates on the server side of a cluster.
2. Pig uses pig-latin language. Hive uses HiveQL language.
Pig is a Procedural Data Flow
3. Language. Hive is a Declarative SQLish Language.
4. It was developed by Yahoo. It was developed by Facebook.
It is used by Researchers and
5. Programmers. It is mainly used by Data Anal ysts.
It is used to hand le structured and
6. semi-structured data. It is main ly used to handle structured data.
7. It is used for programming. It is used for creating reports.
8. Pig scripts end with .pig extension.In Hive, al l extensions are supported.
9. , It does not support partitioning. It supports partitioning.
I 0. It loads data quickly. It loads data slowly.
11. It does not support JDBC. It supports JDBC.
12. It does not suppo11 ODBC. It supports ODBC.
Pig does not have a dedicated Hive makes use of the exact variation of dedicated SQL-DDL
13. metadata database. language by defining tables beforehand.
14. It supports Avro file format. It does not support Avro file format.
Pig is suitable for complex and
15. nested data structures. Hive is suitable for batch-processing OLAP systems.
Pig does not support schema to
16. store data. Hive supports schema for data insertion in tables.
It is very easy to write UDFs to
17. calculate matrices. It does support UDFs but is much hard to debug.
Pig Data '.ype s
Apache Pig supports many da ta types. A list of Apac he Pig Data Type s with descri pti on and exam
ples are
given below.

Type Description Example

Int Signed 32 bit integer 2


-······ ~- .. ----~
Long Signe d 64 bit integer 15Lo r151

Float 32 bit floating point 2.5f or 2.5F

, Doub le 32 bit floating point 1.5 or l.5e2 or l.5E2


I
charArray Char acter array hello javat point

byteA nay BLO B(By te array)

tuple Orde red set of fie lds ( 12,43)


----·----~---- -
bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apa che]


- -
Map
A map in Pig is a chara rray to data elem ent mapp . . l d.
ing, where that elem ent can be any Pig type, me u mg a
complex type. The chararray is called a key and
is used as an index to find the elem ent, refer red
value. to as the

Tuple
A tuple is a fixed -length, ordered collection of
Pig data elem ents. Tupl es are divid ed into field
field containing one data element. These elements s, with each
can be of any type
Bag
A bag is an unordered collection of tuples. Beca
use it has no order, it is not possi ble to refer ence
a bag by position. Like tuples, a bag can, but tuple s in
is not required to, have a schem a assoc iated with
case of a bag, the schem a describes all tuples withi it. In the
n the bag
Null s
Pig includes the concept of a data element being
null. Data of any type can be null. It is impo
understand that in Pi g the concept of null is the rtant to
same as in SQL, which is comp letely diffe rent
concept of null in C, Java, Python, etc. In Pig a from the
null data elem ent means the value is unkn own.
be becau se the data is mi ss ing, an error occurred This migh t
in. processing
Hive is a data warehouse sys tem which is used to ana lyze structured data. It is bui It on th e top of Radoop.
rTfwas developed by Facebook . -
H ive provides the functionalit y of reading, writing, and managing large datasets residing in distributed
storage . It run s SQ1.J ike queries ca ll ed HQL (Hi ~ ue ry language] which ge ts inte rn a ll y conve rted to
Map Reduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce
programs. H ive supports Data Dc lin itio n La ng_!;1agc (DDLJ Data Ma nipu lat io n Lang uage (DML ), and
1.!Jse'r Defined Functions LUDF). (?)
Features of Hive
These are the following features of Hive :
o Hive is fast and scalable.
o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark
jobs.
o It is capable of analyzing large datasets stored in HDFS .
o It allows different storage types such as plain text, RCFile, and HBase.
·o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem .
o lt suppo1ts user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.
Differences between Hive and Pig

Hive Architecture
The following architecture explains the flow of submission of query into Hive.

H i ve
C lie nt

Hive
Services

Meta store_)

HDFS

(I-fi ve Clienr -
Hive allows writing applications in various languages, inc luding Java, Python, and C++. It supports
different types of cli ents such as:-
o Thrift Server - It is a cross- language service provider platform that se rves the request from all
those programmin g languages that supports Thrift:.
o JDBC Driver - It is used to es tab li sh a conn ection betwee n hive and Java applications. The JDBC
Dri ver is present in the class org.a pache. hadoop. hive.jdbc.Hive Drive r.
o ODBC Driver - It allows the app li cation s that support the ODBC protoco l to connect to Hive.
Hive Services
The followin g are th e services provided by Hive:-
. h 11 where we can execute Hive
. CLI - The Hive CLI (Command Line Interface) is a s e
o H 1ve "d b
q~eries and commands. . b UI is ·ust an alternative of Hive CLI. It provr es a we -
o Hive Web User Interface - The Hive We J
based GUI for executing Hive queries a~d commands. II the structure informatio n of various
o Hive MetaStore - It is a central repository that st?re~ ~ t data of column and its type
tables and partitions in the warehouse. It also . me _u es ;ea d and write data and the
information , the serializers and deserializers which 1s use to rea
correspondi ng HDFS tiles where the data is ~tored. t fr 0 m different clients
o Hive Server - It is referred to as Apache Thnft Server. It accepts the reques
and provides it to Hive Driver. . . UI CLI Thrift and
o Hive Driver - It receives queries from different sources hke web , , '
JDBC/ODB C driver. It transfers the queries to the compiler. . •
1
o Hive Compiler -The purpose of the compiler is to parse the query and perform s_e mantic ana ysis
on the different query blocks and expressions . It converts HiveQL statements mto MapReduc e
jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG o f map-reduce
tasks and HDFS tasks . In the end, the execution engine executes the incoming tasks in the order
of their dependencie s.
HIVE Data Types
Hive data types are categorized in numeric types, string types, misc types, and complex types. A list of
Hive data types is given below.
Lnteger Types

Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767


! --
INT 4-byte signed integer 2,147,483,6 48 to 2,147,483,6 47

BIGINT 8-byte signed integer -9,223,372,0 36,854, 775,808 to 9,223,372,0 36,854, 77:

Decimal Type

Type Size Range

FLOAT 4-byte ,. . . , , ,J , S.,i~gle precision floating point number


l
DO UBLE 8-byte Double precision floating point number

Date/rime Types
TIMESTA MP
o It supports traditional UNIX timestamp with optional nanosecond precision.
o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with decimal
precision.
o As string, it follows java.sql.Tim estamp format "YYYY-MM -DD HH :MM:SS.fff ffffff' (9
decimal place precision)
DATES
r
.
The Date valu e is used to specify paiiicu1ar year , month and day, in the form YYYY--MM--DD.
However, it didn't pro vide the tim O; th The range of Date type lies between 000
0--01 --01 to 9999 -
e e day.
-1 2--31.
String Types
STRING
· · · (') or double quotes
string is a sequence of cha1·a cters. It va ues can be enclosed within single quotes
I
The
(").
Varchar specifies that the
har is a variable leng th type who se range li es between I and 65535, whi ch
The _varc
in the character string.
max imum number of characters all owed
CHAR
imum length is fi xed at 255.
The char is a fixed-length type whose max
Complex Type

Type Size
.
fields are accessed usin g the "dot" notation
Struct It is simi lar to C stru ct or an object where
fields are accessed using array notation.
Map It contains the key-value tuples where the
.
lt is a coll ection of similar type of valu
es that inde xabl e using zero-bas ed integers
AITay
Partitioning in Hive e into som e parts base d on the values of a particul
ar
Hi ve mea ns divi ding the tabl
The partitioning in e the data is stored in
The adva ntage of partitioning is that sinc
column like date, course, city or country.
ter.
slices, the query response time becomes fas to use the best
that Had oop is used to hand le the huge amount of data, it is always required
As we know
in Hi ve is the best exam ple of it.
approach to deal with it. The partitioning have to fetch the
stud ents stud ying in an institute. Now, we
Let's assume we have a data of JO million ugh the entire data.
a part icul ar cour se. If we use a traditional app roach, we have to go thro
stud ents of partitioning
to perform ance deg rada tion. In such a case , we can adopt the bett er approac h i.e.,
This leads .
rent datasets base d on particular columns
in Hive and divi de the data among the diffe
in two ways -
The partitioning in Hi ve can be executed
o Static partitioni ng
o Dynamic part itioning

Static Partitioning manuall y while


c or man ual part ition ing, it is required to pass the valu es of partition ed columns
In stati mns .
data fil e doesn't cont ain the partition ed colu
loading the data into the tabl e. Hence, the
Exa mpl e of Static Partitioning
want to crea te a tabl e.
o First, select the database in whi ch we
1. hive> use test; mand: _
iti oned columns by usin g the fo llow ing com
. o Create the table and prov ide the part
g, age int, inst itute strin g)
1. hi ve> create table student (id int, name strin
2. parti tion ed by (course strin g)
3. row fo rmat delimited
4. field s terminated by ',';
lhive :>
>
crE: ate tabl e stud ent (id int, name stri
par titio ned by (cou rse stri ng)
trin g, age int,
ng,
.. . - .. .
-
ins titu te stri ng)
~
~}.
[\
·,
> row form at deli mite d
> fiel ds term inat ed by' ,';

Tin1e take n: 3.40 2 seco nds


'
hive > I
o Let's retr ieve the in fo rm ati on assoc iated with
the tabl e.
I. hi ve> describe stude nt ;

codeg yani@ubuntu64server : ~/hiv e □

# col name

cour se
Tiroe take n:,
hive > I
o Load the data into the tabl e and pass the value ~l
s of partiti on columns with it by using the
following command: -
1. hive> load data local inpath '/home/cod
egyani /hi ve/student_detail s l' into tab le student
2. partition(cour se= "java");
,....,,. ...,.... .~~~- -.....,, ...-,-, -..
~-- ,~,. -.. ~-.., .,.,,_,
__..,..,,.... ...., . . ·-= -.. . . , . , _ .
~

,."
,
· ~od egya ni@ ubun tu64 serv ;r ~/hive
~ _",~'J.-0. . '. ~- y ,.__ -- ........ .. .., ........ -:"'~ ..............- ~ _ _
. . _......,.,,...... ...._

hive > load data loca l inpa th '/ho me/


cod egy ani/ hive /stu den t_de tails l' into
stud ent tab le
> par titio n(co urse = "jav a");
Load ing data to tab le test .stu den t
par titio n (cou rse= java )
Par titio n test .stu den t{co urse =jav a} stat
s: [num File s=l, numR.ows=O~ tota lsiz e=1
, raw Data size= O] 22
OK
Time take n: 8.05 7 seco nds
hive > I
Here, we are part iti onin g the stude nt s or an instit
ute based on courses.
o Load the data of anoth er fil e into the same
table and pass the values of partition colu1nns with
by using the follo wing comm and : - it
1. hive> load data local inpath '/hom e/cod egyani /hivc
/stud ent_details2' into table student
2. partiti on(course= "hadoop" );
I

hive> load data local inpath '/home/codeg yani/hive/stu dent_details 2'


.. -. . . .
student
> partition(co urse= "hadoop");
Loading data to table test.student partition (course=hadoop)
Partition test.student{ course=hado op} stats: [numFiles=l, numRows=O, totalSize=7
5, rawDataSize=O}
OK
Time taken: 2.402 seconds
hive> I
In t\1e fo llowing screenshot. we can see that the table student is divided into two categor.ies.

Browse Directory

/user/hive/warehouse/test.db/student Gal
\
\
Block
\ Permission Owner Group Size Last Modified Replication Size Name

drwxr-xr-x codegyani supergroup OB 8/1/2019. 5:39:27 0 OB ccurse= r.adc.::p


PM

drwxr-xr-x codegyani supergroup OB 811 /2019. 5:37:1 3 0 OB ccurse=ja\a


PM

o Let's retrieve the entire data of the able by using the following command: -
l . hi ve> select * from student;
~...,,,....•.,.,,,.,.....,,,..,.,..,,.,.,,,,~""?".~ . . -,.,..~-r~-'!'9"rZ:~.,.,.,_,.,..~,~'?""'l"!P"r'-,.,,,-....--.--~___,..,......,,,..
codegyani@ubuntu64se~er: ✓hi~e-
hive> sele.ct * from student; I

OK .:-'.:.1· ; .
. :r., 1
6 "Chris" 22 javatpoint hadoop I

7 "Hariss" 21 javatpoint ,,\ I


8 "Angelina"24 NULL NULL hadoop l \ -_
1 "Gaurav" 24 javatpoint java
·, '
2 "John'' 22 javatpoint java
3
4
"William"
"Steve" 24
21 javatpoint
javatpoint java
java · .,.)X ,{: \<:. -<~ ·-r ,.
5 "Roman" 23 javatpoint java
Time taken: 4.85 seconds, Fetched: 8 row(s) .:~;:ii''Hi\' ~;':'1f:.:,;·it!}\hl,i~ _·
hive> I • I
I

o Now, try to retrieve the data based on partitioned columns by using the following command: -
hive> select* from student where course="java";

.. . . . .
-

l,.,I
¼I
:1
\, t
j
q
' I
i
i
'
I.

Il____ -.~ •So+ - """'- _ . __ . . _ ._ ci. . •••..•.• -~

ln thi s case, we are not examining the entire data. Hence, thi s approach improves query respo nse time.
o Let's also retrieve the data of another partiti oned dataset by using the follow in g com mand: -
1. hive> select * from student where course= "hadoop" ;

.... . - . .
ive>
OK
6
.r

I
e designed to provide quick random access to
HBase is a data model that is similar to Goog le's bi g tabl
introduction to HBase, the procedures to set up
huge amounts of structured data. This tutoria L2rovides an
HBase shell. It also describes how to connect
H.Base on Hadoop File Systems, and ways to interact with
on HB ase using java. Since 1970, RDBM S is
to HBase using java, and how to perform__basic operations
m s. ;\ flcr th e advent of big data, comp anies
the solution for data storag e.and mai ntenance related proble
g for soluti ons li ke I ladoo p. Hadoop uses
reali zed the benefit of processing big data and started optin
to process it. I ladoop excels in storin g and
di stributed fil e syste m fo r storin g bi g data, and MapR educe
semi -, or even unstru ctured .
processin g-u f-huge data of Tariou s·f-orm ats such as arb itrary,

Limitations of Hadoop
ll be accessed onl y in a sequential manne r-'fhat
Hadoop C.fill.Perform only batch i:irocess ing, and data wi
simplest of jobs. A huge datase1 when processed
means one has to search the entire datase t even for the
sed seq uenti ally. At th is poi nt, a new solution
results in anotber huge data s..e.t, which.should also be proces
(random access).
is needed to access any point of data in a single unit of time

Hadoop Random Access Databases


o, and MongoDB are some of the databases
Applications such as HBase, Cassandra, couchDB, Dynam
manner.
that store huge amounts of data and access the data in a random
top of fhe Hadoop t-fi le system. It is an open-
HBase is a distributea column-oriented database buiJt on
source project and is horizontally scalable. ""' . -' ~ n
--.:.c"Mi (,,n rv
that is simiJar to Googl e 's bi g a6fe des igned to provide quick random access to
~ Base is a data model
tolerance provid ed by the Hadoo p File Sys tem
hugs amounts of structured data. It levera ges the fau lt
(HDFSJ
(!!}
real-time read/write aocess to data in the
lt is a part of the Hadoop ecosystem that provides random
ffiacl oo.12. File System.
HBase. Data consumer reads/access es the data
One can store the data in HD_FS either directly or througb
Hadoop B'i le System and provides read and
in HDFS randomly usingH Base. 1---IBase sits on top of the
write access.

HBase and HDFS


HBnse
HDFS

HDFS is a distributed fi le system suitabl e fo r I-I Base ls a database built on top of the HDFS .
storing large fi les.

HDFS does not support fas t individual record 1-IB asc provid es fast lookup s for larger tables.
lookup s.
. ws from bill i<111~
It provi des low latency access to single ro
· · no access).
'd h'gh latency batch processing , 'd andorn acct.
ll prO VI es I . h tables and provt es r
conce pt of batch processing . HB internally uses Has
. . ase d HDFS files for faster lookups.
It provides only sequential access of data. m mdexe

b Tl1e tabk sche ma defin es


Stora ge Mechanism in HBase bl . .t re sorted y--rov\ .
d I
i; \ a~l-e.: ha'v---e multiple colum n famili~s an eac 1
HBase is a co\umn-ori ent_ed database and the ta _es lumn values are store d conti guou sly on
.
only column families , which are the key value pairs HBase·
Subs eque nt _co
column family can have any number of"col umns . .
.Jn short , man
the di sk. Each cell value of the table has a timestamp
• Table is a collection of rows.
• Row is a collection of column famil ies.
• Colum n family is a collection of columns.
• Column is a col lection of key value pairs.
Given below is an example schema of tabl e in HBase. Column Fnmily
Column Family
Rowid Column Family co13 col
co13 coll col2
co\3 coll co12
coll col2

Column Oriented and Row Oriented


tables as sections of co lumn s of data, rather than as
Column-orie nted databases are those that store data
.
rows of data. Short ly, they wi ll have column families
Column-Oriented Dati
Row-Oriented Database
It is suitab le fo r Online Analytical Proce
It is suitable for Online Transaction Process (OLT P).

of rows and columns. Colu mn-oriented databases are designed


Such datab ases are designed for small numb er

n-oriented database:
The following image shows column families in a colum

I COLUMN FAMILIES I

/ ~
personal ~ ata professional ,2ita .
Row key - -·
city designation salary
emp1d name

raju hyderabad manager 50,000


1
ra vi chennai sr.engineer 30,000
2
rajesh delhi Ir.engineer 25,000
3
HBase and RDBMS

HBase RDBMS

HBase is schema -less , it doesn't h ave th e concept of fixed columns schema· An RDBMS is governe d by its schema, ·
'
defines only column families . whole structur e of tables.

It is built for wide tables. HBase is hori zontall y scalable . It is thin and built for small tables. Hard

No transact ions are there in HBase. RDBMS is transactional.

lt has de-norm alized data. It will have normali zed data.

It is good for structur ed data.


It is good for semi-str uctured as well as structured data.

Feature s of HBase
• HBase is linearly scalable .
• It has automat ic failure support.
• It provide s consiste nt read and writes.
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• lt provides data replication across clusters.

Where to Use HBase


Data.
• Apache HBase is used to have random, real-time read/write access to Big
• lt hosts very large tables on top of clusters of commod ity hardwar e.
. Bigtable acts up on
• Apache HBase is a non-relational database modeled after Google's Bigtable
Google File System, likewise Apache HBase works on top ofHado op and HDFS.

Applications of HBase
• It is used whenever there is a need to write heavy applications.
e data.
• HBase is used whenever we need to provide fast random access to availabl
and Adobe use HBase internall y.
• Companies such as Facebook, Twitter, Yahoo,

HBase History
Year Event

Nov 2006 Google released the paper on BigTable.

Feb 2007 Initi al HBase prototype was created as a Hadoop contribution.

Oct 2007 The first usable HBase along with Hadoop 0.J·S.0 was released.

\ Jan 2008 HBase became the sub project of Hadoop.

Oct 2008 HBase 0.18.1 was released.

Jan 2009 HBase 0.19.0 was released.

Sept2009 HBase 0.20.0 was released .


ache top-level project.
HBase became Ap
May 2010
Reoions are vertica lly divided_
. . nd are served by the region servers. below is the architecture ot
.
n HBase tables are spltt mto regions a d as files in HDFS. Shown
1anu·11es
, c.
by column · .mto "Stores" · Stores are save
HBase.
I . th storage structure.
Note: The tenn 'store' is used for regions to exp am e

tf~as~ Arch itectu~e

~;doop
t!OFS
HBASE
D
D
Muter
s.,...._,
D
Client, l<;o KHptr

Reg ion u rv11


D

HBase has three major components: the client li qrary, ni11Ncr s~rver, and reg ion servers. Region serve rs
can be added or removed as per requirement.

MasterServer

The master server -


• Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
• Handles load balancing of the regions across region servers. It unloads the busy servers and shifts
the regions to less occupied servers.
• Maintains the state of the cluster by negotiating the load balancing.
• Is responsible for schema changes and other metadata operations such as creation of tables and
column families .

Regions

Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -
• Communicate with the client and handle data-related operations.
• Handle read and write requests for all the regions under it.
• Decide the size of the region by following the region size thresholds.
When we take a deeper look into the region server, it contain regions and stores as shown below:
E ffi Regions

~ ffi
St0<e ~
~
Store flle (H fil~)

DODO
ory. Anything that is
e and HF iles. Me mstore is just like a cac he mem
The store contains mem ory stor in H1iles as blocks
is stored here initiall y. Later, the data is transferred and saved
entered into the HBase
and the memstore is flushed .

Zoo keeper
maintaining configuration
per is an ope n-so urc e pro ject that provides services li ke
• Zookee
distributed synchronization, etc.
information , naming, providing Master servers use these
per has eph eme ral nod es rep resenting different region servers.
• Zookee
ers.
nodes to discover available serv s or network partitions.
ition to ava ilability, the nod es are also used to track server fai lure
• In add
ion servers via zookeeper.
• Clients communicate with reg of zookeeper.
des, HBase itself will take care
• In pseudo and standalone mo

You might also like