Professional Documents
Culture Documents
Apac ,e Pg
( Op timjz c,- )
,.. ·7=,a"d~
--.-
[ r HOFS - □DD J
c◊-~ pon ents :
Pig Hadoop framework has fo ur mai n by the parser. The p.1.rser is
Lati n scri pt is sent to Hadoop Pi g, it is first handled
I. Par ser: _Wh :n a Pig us checl-..s. Parser gi \es
e script, along with other miscellaneo
res pons1bl_e for chec kin? the_syntax oflh n state ments, too ;the r
c li c Graph \DAG) that contains Pig Lati
a1: output in tl:e fo rm of a Di rected Acy
0
as nod es.
with other logi cal operators represented for DAG is pass~d to a loITTcal
t~e pa!·ser is r~tri eved, a log ica l planmizations.
2. Op~i~!zer3 After ~he_~ut~ut from
0
· ~
Mapllt:th11.:c jobs in whid1 access is reg uireti to tie
1 Hadoop clus ter
By default, Pi g l ladoop choo ses to run
(0 t t;~ ,f'~JWe
A (" /~
• Ml'f\...f2 ,
d · e a local mode 10 w ic
. h' hall the files are installed
d · the
and the HD~S installation. But there is.an~~~t:ieo s~s~~1;. You can run the localhost mo e usm~ocal \
and run using a local host, along wit~ -x ntTi.. t
p1g ~1-ea-fflea--:-m•a
command: . A e-Pi-g:-hrtl'rts section 011 Apac 11e-i •o• ·· -
i,.-1,ope you Nere able tG sueeessfully 1115t011 nf:JOCfl . _ · . -- d frFta~~ -the-mstaHat-i0f-l-0£Apache-B-~g
is Apoehe Pig?', wby w.e. .ne.e.d Pig, i1 s.-featu.i:~i:G-hitee-t1:1-Fe,f1Ft
. . . -
Y
.
Less effort is needed for Apache Pig. More development efforts are required
Apache Pig MapReduce
It allows nested data types like map, tuple It does not al low nested data types
Applications of Apache Pig:
• For exploring large datasets Pig Scripting is used.
• Provides the supports across large data-sets for Ad-hoc queries .
• In the prototyping of large data-sets processing algorithms.
• Required to process the time sensitive data loads.
• For collecting large amounts of datasets in form of search logs and web crawls.
• Used where the analytical insights are needed using the sampling.
Types of Data Models in Apache Pig: It consist of the 4 types of data models as follows:
• Atom : It is a atomic data value which is used to store as a string. The main use of this model is that
it can be used as a number and as well as a string.
• Tuple: It is an ordered set of the fields.
• Bag: It is a collection of the tuples.
• Map: It is a set of key/value pairs.
Pig Architecture
The Architecture of Pig consists of two components:
I. Pig Latin, which is a language
2. A runtime environment, for running PigLatin programs.
A Pig Latin program consists of a series of operations or transformations which are applied to the input
data to produce output. These operations describe a data flow which is translated into an executable
representation, by Hadoop Pig execution environment. Underneath, results of these transformations are
series ofMapReduce jobs which a programmer is unaware of So, in a way, Pig in Hadoop allows the
programmer to focus on data rather than the nature of execution.
PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join,
Group and Filter.
Logfca·, Plan
T r • nlfi, ~or.m a t t on
Phy.;.;•cal Plan
Tran-.,orma.tlJC>n
PIG Architecture
Execution modes:
Pi g in Hadoop has two execution modes:
I. Local mod~: In thi s_ mo~e, Hadoop Pi g language runs in a single JVM and makes use of local file
system. This mode 1s suitabl e only for analysis of small datasets using Pig in H d
2. Map Reduce mode: fn thi s mode, queries written in Pig Latin are translated int; ~~PRed .
a~d are run on~ H~doop cluster ~cluster may be pseudo or fully distributed). Ma R:d uce Jobs
. with the ~ully di_stnbuted cluster 1~ us~ful of running Pig on large datasets. P uce mode
https.//www.tutonalspomt.com/apache p1g/p1g latin basics.htm
L
Tuple
A tuple is a fixed -length, ordered collection of
Pig data elem ents. Tupl es are divid ed into field
field containing one data element. These elements s, with each
can be of any type
Bag
A bag is an unordered collection of tuples. Beca
use it has no order, it is not possi ble to refer ence
a bag by position. Like tuples, a bag can, but tuple s in
is not required to, have a schem a assoc iated with
case of a bag, the schem a describes all tuples withi it. In the
n the bag
Null s
Pig includes the concept of a data element being
null. Data of any type can be null. It is impo
understand that in Pi g the concept of null is the rtant to
same as in SQL, which is comp letely diffe rent
concept of null in C, Java, Python, etc. In Pig a from the
null data elem ent means the value is unkn own.
be becau se the data is mi ss ing, an error occurred This migh t
in. processing
Hive is a data warehouse sys tem which is used to ana lyze structured data. It is bui It on th e top of Radoop.
rTfwas developed by Facebook . -
H ive provides the functionalit y of reading, writing, and managing large datasets residing in distributed
storage . It run s SQ1.J ike queries ca ll ed HQL (Hi ~ ue ry language] which ge ts inte rn a ll y conve rted to
Map Reduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex MapReduce
programs. H ive supports Data Dc lin itio n La ng_!;1agc (DDLJ Data Ma nipu lat io n Lang uage (DML ), and
1.!Jse'r Defined Functions LUDF). (?)
Features of Hive
These are the following features of Hive :
o Hive is fast and scalable.
o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark
jobs.
o It is capable of analyzing large datasets stored in HDFS .
o It allows different storage types such as plain text, RCFile, and HBase.
·o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem .
o lt suppo1ts user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.
Differences between Hive and Pig
Hive Architecture
The following architecture explains the flow of submission of query into Hive.
H i ve
C lie nt
Hive
Services
Meta store_)
HDFS
(I-fi ve Clienr -
Hive allows writing applications in various languages, inc luding Java, Python, and C++. It supports
different types of cli ents such as:-
o Thrift Server - It is a cross- language service provider platform that se rves the request from all
those programmin g languages that supports Thrift:.
o JDBC Driver - It is used to es tab li sh a conn ection betwee n hive and Java applications. The JDBC
Dri ver is present in the class org.a pache. hadoop. hive.jdbc.Hive Drive r.
o ODBC Driver - It allows the app li cation s that support the ODBC protoco l to connect to Hive.
Hive Services
The followin g are th e services provided by Hive:-
. h 11 where we can execute Hive
. CLI - The Hive CLI (Command Line Interface) is a s e
o H 1ve "d b
q~eries and commands. . b UI is ·ust an alternative of Hive CLI. It provr es a we -
o Hive Web User Interface - The Hive We J
based GUI for executing Hive queries a~d commands. II the structure informatio n of various
o Hive MetaStore - It is a central repository that st?re~ ~ t data of column and its type
tables and partitions in the warehouse. It also . me _u es ;ea d and write data and the
information , the serializers and deserializers which 1s use to rea
correspondi ng HDFS tiles where the data is ~tored. t fr 0 m different clients
o Hive Server - It is referred to as Apache Thnft Server. It accepts the reques
and provides it to Hive Driver. . . UI CLI Thrift and
o Hive Driver - It receives queries from different sources hke web , , '
JDBC/ODB C driver. It transfers the queries to the compiler. . •
1
o Hive Compiler -The purpose of the compiler is to parse the query and perform s_e mantic ana ysis
on the different query blocks and expressions . It converts HiveQL statements mto MapReduc e
jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG o f map-reduce
tasks and HDFS tasks . In the end, the execution engine executes the incoming tasks in the order
of their dependencie s.
HIVE Data Types
Hive data types are categorized in numeric types, string types, misc types, and complex types. A list of
Hive data types is given below.
Lnteger Types
BIGINT 8-byte signed integer -9,223,372,0 36,854, 775,808 to 9,223,372,0 36,854, 77:
Decimal Type
Date/rime Types
TIMESTA MP
o It supports traditional UNIX timestamp with optional nanosecond precision.
o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with decimal
precision.
o As string, it follows java.sql.Tim estamp format "YYYY-MM -DD HH :MM:SS.fff ffffff' (9
decimal place precision)
DATES
r
.
The Date valu e is used to specify paiiicu1ar year , month and day, in the form YYYY--MM--DD.
However, it didn't pro vide the tim O; th The range of Date type lies between 000
0--01 --01 to 9999 -
e e day.
-1 2--31.
String Types
STRING
· · · (') or double quotes
string is a sequence of cha1·a cters. It va ues can be enclosed within single quotes
I
The
(").
Varchar specifies that the
har is a variable leng th type who se range li es between I and 65535, whi ch
The _varc
in the character string.
max imum number of characters all owed
CHAR
imum length is fi xed at 255.
The char is a fixed-length type whose max
Complex Type
Type Size
.
fields are accessed usin g the "dot" notation
Struct It is simi lar to C stru ct or an object where
fields are accessed using array notation.
Map It contains the key-value tuples where the
.
lt is a coll ection of similar type of valu
es that inde xabl e using zero-bas ed integers
AITay
Partitioning in Hive e into som e parts base d on the values of a particul
ar
Hi ve mea ns divi ding the tabl
The partitioning in e the data is stored in
The adva ntage of partitioning is that sinc
column like date, course, city or country.
ter.
slices, the query response time becomes fas to use the best
that Had oop is used to hand le the huge amount of data, it is always required
As we know
in Hi ve is the best exam ple of it.
approach to deal with it. The partitioning have to fetch the
stud ents stud ying in an institute. Now, we
Let's assume we have a data of JO million ugh the entire data.
a part icul ar cour se. If we use a traditional app roach, we have to go thro
stud ents of partitioning
to perform ance deg rada tion. In such a case , we can adopt the bett er approac h i.e.,
This leads .
rent datasets base d on particular columns
in Hive and divi de the data among the diffe
in two ways -
The partitioning in Hi ve can be executed
o Static partitioni ng
o Dynamic part itioning
# col name
cour se
Tiroe take n:,
hive > I
o Load the data into the tabl e and pass the value ~l
s of partiti on columns with it by using the
following command: -
1. hive> load data local inpath '/home/cod
egyani /hi ve/student_detail s l' into tab le student
2. partition(cour se= "java");
,....,,. ...,.... .~~~- -.....,, ...-,-, -..
~-- ,~,. -.. ~-.., .,.,,_,
__..,..,,.... ...., . . ·-= -.. . . , . , _ .
~
,."
,
· ~od egya ni@ ubun tu64 serv ;r ~/hive
~ _",~'J.-0. . '. ~- y ,.__ -- ........ .. .., ........ -:"'~ ..............- ~ _ _
. . _......,.,,...... ...._
Browse Directory
/user/hive/warehouse/test.db/student Gal
\
\
Block
\ Permission Owner Group Size Last Modified Replication Size Name
o Let's retrieve the entire data of the able by using the following command: -
l . hi ve> select * from student;
~...,,,....•.,.,,,.,.....,,,..,.,..,,.,.,,,,~""?".~ . . -,.,..~-r~-'!'9"rZ:~.,.,.,_,.,..~,~'?""'l"!P"r'-,.,,,-....--.--~___,..,......,,,..
codegyani@ubuntu64se~er: ✓hi~e-
hive> sele.ct * from student; I
OK .:-'.:.1· ; .
. :r., 1
6 "Chris" 22 javatpoint hadoop I
o Now, try to retrieve the data based on partitioned columns by using the following command: -
hive> select* from student where course="java";
.. . . . .
-
l,.,I
¼I
:1
\, t
j
q
' I
i
i
'
I.
ln thi s case, we are not examining the entire data. Hence, thi s approach improves query respo nse time.
o Let's also retrieve the data of another partiti oned dataset by using the follow in g com mand: -
1. hive> select * from student where course= "hadoop" ;
.... . - . .
ive>
OK
6
.r
I
e designed to provide quick random access to
HBase is a data model that is similar to Goog le's bi g tabl
introduction to HBase, the procedures to set up
huge amounts of structured data. This tutoria L2rovides an
HBase shell. It also describes how to connect
H.Base on Hadoop File Systems, and ways to interact with
on HB ase using java. Since 1970, RDBM S is
to HBase using java, and how to perform__basic operations
m s. ;\ flcr th e advent of big data, comp anies
the solution for data storag e.and mai ntenance related proble
g for soluti ons li ke I ladoo p. Hadoop uses
reali zed the benefit of processing big data and started optin
to process it. I ladoop excels in storin g and
di stributed fil e syste m fo r storin g bi g data, and MapR educe
semi -, or even unstru ctured .
processin g-u f-huge data of Tariou s·f-orm ats such as arb itrary,
Limitations of Hadoop
ll be accessed onl y in a sequential manne r-'fhat
Hadoop C.fill.Perform only batch i:irocess ing, and data wi
simplest of jobs. A huge datase1 when processed
means one has to search the entire datase t even for the
sed seq uenti ally. At th is poi nt, a new solution
results in anotber huge data s..e.t, which.should also be proces
(random access).
is needed to access any point of data in a single unit of time
HDFS is a distributed fi le system suitabl e fo r I-I Base ls a database built on top of the HDFS .
storing large fi les.
HDFS does not support fas t individual record 1-IB asc provid es fast lookup s for larger tables.
lookup s.
. ws from bill i<111~
It provi des low latency access to single ro
· · no access).
'd h'gh latency batch processing , 'd andorn acct.
ll prO VI es I . h tables and provt es r
conce pt of batch processing . HB internally uses Has
. . ase d HDFS files for faster lookups.
It provides only sequential access of data. m mdexe
n-oriented database:
The following image shows column families in a colum
I COLUMN FAMILIES I
/ ~
personal ~ ata professional ,2ita .
Row key - -·
city designation salary
emp1d name
HBase RDBMS
HBase is schema -less , it doesn't h ave th e concept of fixed columns schema· An RDBMS is governe d by its schema, ·
'
defines only column families . whole structur e of tables.
It is built for wide tables. HBase is hori zontall y scalable . It is thin and built for small tables. Hard
Feature s of HBase
• HBase is linearly scalable .
• It has automat ic failure support.
• It provide s consiste nt read and writes.
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• lt provides data replication across clusters.
Applications of HBase
• It is used whenever there is a need to write heavy applications.
e data.
• HBase is used whenever we need to provide fast random access to availabl
and Adobe use HBase internall y.
• Companies such as Facebook, Twitter, Yahoo,
HBase History
Year Event
Oct 2007 The first usable HBase along with Hadoop 0.J·S.0 was released.
~;doop
t!OFS
HBASE
D
D
Muter
s.,...._,
D
Client, l<;o KHptr
HBase has three major components: the client li qrary, ni11Ncr s~rver, and reg ion servers. Region serve rs
can be added or removed as per requirement.
MasterServer
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -
• Communicate with the client and handle data-related operations.
• Handle read and write requests for all the regions under it.
• Decide the size of the region by following the region size thresholds.
When we take a deeper look into the region server, it contain regions and stores as shown below:
E ffi Regions
~ ffi
St0<e ~
~
Store flle (H fil~)
DODO
ory. Anything that is
e and HF iles. Me mstore is just like a cac he mem
The store contains mem ory stor in H1iles as blocks
is stored here initiall y. Later, the data is transferred and saved
entered into the HBase
and the memstore is flushed .
Zoo keeper
maintaining configuration
per is an ope n-so urc e pro ject that provides services li ke
• Zookee
distributed synchronization, etc.
information , naming, providing Master servers use these
per has eph eme ral nod es rep resenting different region servers.
• Zookee
ers.
nodes to discover available serv s or network partitions.
ition to ava ilability, the nod es are also used to track server fai lure
• In add
ion servers via zookeeper.
• Clients communicate with reg of zookeeper.
des, HBase itself will take care
• In pseudo and standalone mo