BigData Analytics Unit-V

UG BigData Dr.Veeramanikandan, Asst.Prof, Dept.
of CS, TKGAC, VDM
BigData Frameworks – Hive & Pig in Hadoop Unit - V
Introduction to Hive
Birth of Hive
Facebook played an active role in the birth of Hive as Facebook uses Hadoop to handle Big Data.
Hadoop uses MapReduce to process data. Previously, users needed to write lengthy, complex
codes to process and analyze data. Not everyone was well-versed in Java and other complex
programming languages. On the other hand, many individuals were comfortable with writing
queries in SQL. For this reason, there was a need to develop a language similar to SQL, which
was well-known to all users. This is how the Hive Query Language, also known as HiveQL,
came to be.
What is Hive in Hadoop?
Hive is a data warehouse system used to query and analyze large datasets stored in HDFS. Hive
uses a query language called HiveQL, which is similar to SQL.
Fig: Hive operation
The image above demonstrates a user writing queries in the HiveQL language, which is then
converted into MapReduce tasks. Next, the data is processed and analyzed. HiveQL works on
structured data, such as numbers, addresses, dates, names, and so on. HiveQL allows multiple
users to query data simultaneously.
So, what do we do with semi-structured and unstructured data like emails, images, videos? Enter
Apache Pig.
1
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
Introduction to Pig
Pig also came into existence to solve issues with MapReduce. Let’s take a close look at Apache
Pig.
Birth of Pig
Although MapReduce helped process and analyze Big Data faster, it had its flaws. Individuals
who were unfamiliar with programming often found it challenging to write lengthy Java codes.
Eventually, it became a difficult task to maintain and optimize the code, and as a result, the
processing time increased.
This was the reason Yahoo faced problems when it came to processing and analyzing large
datasets. Apache Pig was developed to analyze large datasets without using time-consuming and
complex Java codes. Pig was explicitly developed for non-programmers.
What is Pig in Hadoop?
Pig is a scripting platform that runs on Hadoop clusters, designed to process and analyze large
datasets. Pig uses a language called Pig Latin, which is similar to SQL. This language does not
require as much code in order to analyze data. Although it is similar to SQL, it does have
significant differences. In Pig Latin, 10 lines of code is equivalent to 200 lines in Java. This, in
turn, results in shorter development times.
Fig: Pig operation
2
What stands out about Pig is that it operates on various types of data, including structured, semi-
structured, and unstructured data. Whether you’re working with structured, semi-structured, or
unstructured data, Pig takes care of it all.
Many people wonder what makes Pig better than Hive. Hive does have its advantages over Pig in
a few ways—and we’ll compare these different features—to help you make a more informed
decision when it comes to choosing which platform best suits your requirements.
Hive vs. Pig
The following table compares the advantages of Hive with the advantages of Pig :
Features
Hive uses a declarative language With Pig Latin, a procedural data

1. Language
called HiveQL flow language is used
Creating schema is not required to

2. Schema Hive supports schema
store data in Pig
3
3. Data Hive is used for batch Pig is a high-level data-flow

Processing processing language
No. Pig does not support partitions

4. Partitions Yes although there is an option for
filtering
5. Web interface Hive has a web interface Pig does not support web interface
6. User Data analysts are the primary Programmers and researchers use
Specification users Pig
7. Used for Reporting Programming
Hive works on structured data.

Pig works on structured, semi-
8. Type of data Does not work on other types of
structured and unstructured data
data
4
Works on the server-side of the Works on the client-side of the

9. Operates on
cluster cluster
10. Avro File

Hive does not support Avro Pig supports Avro
Format
11. Loading Hive takes time to load but

Pig loads data quickly
Speed executes quickly
12. JDBC/ ODBC Supported, but limited Unsupported
Fig: Hive vs. Pig Comparison Table
Both Hive and Pig are excellent data analysis tools—one is not necessarily better than the other,
but they do have different capabilities and features. Depending on your job role, business
requirements, and budget, you can choose either of these Big Data analysis platforms.
5
Difference between Pig and Hive :

S.No. Pig Hive
Pig operates on the client

1. side of a cluster. Hive operates on the server side of a cluster.
Pig uses pig-latin

2. language. Hive uses HiveQL language.
Pig is a Procedural Data

3. Flow Language. Hive is a Declarative SQLish Language.
It was developed by
4. Yahoo. It was developed by Facebook.
It is used by Researchers
5. and Programmers. It is mainly used by Data Analysts.
It is used to handle
structured and semi-
6. structured data. It is mainly used to handle structured data.
It is used for
7. programming. It is used for creating reports.
Pig scripts end with .pig

8. extension. In HIve, all extensions are supported.
It does not support

9. partitioning. It supports partitioning.
10. It loads data quickly. It loads data slowly.
11. It does not support JDBC. It supports JDBC.
It does not support

12. ODBC. It supports ODBC.
Pig does not have a Hive makes use of the exact variation of
dedicated metadata dedicated SQL-DDL language by defining
13. database. tables beforehand.
It supports Avro file

14. format. It does not support Avro file format.
6
Pig is suitable for complex

and nested data Hive is suitable for batch-
15. structures. processing OLAP systems.
Pig does not support Hive supports schema for data insertion in
16. schema to store data. tables.
It is very easy to write

UDFs to calculate It does support UDFs but is much hard to
17. matrices. debug.
What is HIVE
Hive is a data warehouse system which is used to analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).
Features of Hive
These are the following features of Hive:
o Hive is fast and scalable.

o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or Spark
jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and HBase.
7
o It uses indexing to accelerate queries.

o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.
Differences between Hive and Pig
Hive Pig
Hive is commonly used by Data Analysts. Pig is commonly used by programmers.
It follows SQL-like queries. It follows the data-flow language.
It can handle structured data. It can handle semi-structured data.
It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.
Hive is slower than Pig. Pig is comparatively faster than Hive.
8
Hive Architecture
The following architecture explains the flow of submission of query into Hive.
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request from
all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications. The
JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.
9
Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides
a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks
in the order of their dependencies.
What Is Hadoop Hive Query Language
Hive Query Language

• Hive QL is the HIVE QUERY LANGUAGE
• Hive offers no support for row-level inserts, updates, and deletes.
• Hive does not support transactions.
• Hive adds extensions to provide better performance in the context of Hadoop and to integrate
with custom extensions and even external programs.
• DDL and DML are the parts of HIVE QL
• Data Definition Language (DDL) is used for creating, altering and dropping databases, tables,
views, functions and indexes.
10
• Data manipulation language is used to put data into Hive tables and to extract data to the file
system and also how to explore and manipulate data with queries, grouping, filtering, joining
etc.
Databases in Hive:
• The Databases in the Hive is essentially just a catalog or namespace of tables.

• They are very useful for larger clusters with multiple teams and users, as a way of avoiding
table name
• Hive provides commands such as
o CREATE DATABASE db name -- to create a database in Hive
o USE db name -- To use the database in Hive.
o DROP db name -- To delete the database in Hive.
o SHOW DATABASE -- to see the list of the DataBase
Simple Selects
In Hive, querying data is performed by a SELECT statement. A select statement has

6 key components;
1. SELECT column names

2. FROM table-name
3. GROUP BY column names
4. WHERE conditions
5. HAVING conditions
6. ORDER by column names
Simple Selects – Selecting Rows

In addition to limiting the columns returned by a query, you can also limit the rows
returned. The simplest case is to say how many rows are wanted using the Limit clause.
SELECT anonid, fueltypes, acorn_type
FROM geog_all
LIMIT 10;
11
1. HiveQL query for information_schema database

Hive queries can be written to get information about Hive privileges, tables, views or columns.
Information_schema data is a read-only and user-friendly way to know the state of the system
similar to sys database data.
Example:
Code:
Select * from information_schema.columns where table_schema = ‘database_name’
This will retrieve all the columns in the database table specified.
2. Creation and loading of data into a table

The bulk load operation is used to insert data into managed tables as Hive does not support row-
level insert, delete or update.
Code:
LOAD DATA LOCAL INPATH ‘$Home/students_address’ OVERWRITE INTO TABLE
students
PARTITION (class = “12”, section = “science”);
With the above command, a directory is first created for the partition, and then all the files are
copied in the directory. The keyword “local” is used to specify that the data is present in the local
12
file system. “Partition” keyword can be omitted if the table does not have a partition key. Hive
query will not check for the data being loaded to match the schema of the table.
The “INSERT” command is used to load data from a query into a table. “OVERWRITE”
keyword is used to replace the data in a table. In Hive v0.8.0 or later, data will get appended into
a table if overwrite keyword is omitted.
Code:
INSERT OVERWRITE TABLE students
PARTITION ( class = “12”, section = “science”)
Select * from students_data where class = “12” and section = “science”
All the partitions of the table students_data can be dynamically inserted by setting below
properties:
Set hive.exec.dynamic.partition = True;
Set hive.exec.dynamic.partition.mode = unstrict
Set hive.exec.max.dynamic.partition.pernode = 1000;
CREATE TABLE clause will also create a table, and schema will be taken from the select
clause.
13
3. Merge data in tables

Data can be merged from tables using classic SQL joins like inner, full outer, left, right join.
Code:
Select a.roll_number, class, section from students as a
inner join pass_table as b
on a.roll_number = b.roll_number
This will return class and section of all the roll numbers who have passed. Using a left join to
this will return the “grade” for only pass students and “NULL” for the failed ones.
Code:
Select a.roll_number, class, section, b.grade from students as a
Left join pass_table as b
on a.roll_number = b.roll_number
UNION ALL and UNION are also used to append data present in two tables. However, few
things need to be taken care of on doing so like, Schema of both the tables should be same.
UNION is used to append the table and return unique records while UNION ALL returns all the
records, including duplicates.
14
4. Ordering a table
ORDER BY clause enables total ordering of the data set by passing all data through one reducer.
This may take a long time for large data tables, so SORT BY clause can be used to achieve
partial sorting, by sorting each reducer.
Code:
Select customer_id, spends from customer as a order by spends DESC limit 100
This will return the top 100 customers with highest spends.
5. Aggregation of data in a table

Aggregation is done using aggregate functions that returns a single value after doing
computation on many rows. These are count(col), sum(col), avg(col), min(col), max(col),
stddev_pop(col), percentile_approx(int_expr, P, NB), where NB is number of histogram bins for
estimation), collect_set(col), this returns duplicate elements after removing collection column.
The set property which helps in improving the performance of aggregation is hive.map.aggr =
true.
“GROUP BY” clause is used with an aggregate function.
Example:
Code:
Select year(date_yy), avg(spends) from customer_spends where merchant = “Retail” group by year(date_yy)
15
HAVING clause is used to restrict the output from GROUP BY, which is done using a subquery.
6. Conditional statements
CASE…WHEN…THEN clause is similar to if-else statements to perform a conditional
operation on any column in a query.
For example:
Code:
Select customer,
Case when percentage <40 then “Fail”
When percentage >=40 and percentage <80 then “Average” Else “Excellent”
End as rank From students;
Frequently Asked Questions ( For your reference)

Q1. What queries are used in Hive?
A. Hive supports the Hive Querying Language(HQL). HQL is very similar to SQL. It
supports the usual insert, update, delete, and merge SQL statements to query data in
Hive.
Q2. What are the benefits of Hive?
16
A. Hive is built on top of Apache Hadoop. This makes it an apt tool for analyzing Big
data. It also supports various types of connectors, making it easier for developers to
query Hive data using different programming languages.
Q3. What is the difference between Hive and MapReduce?
A. Hive is a data warehousing system that provides SQL-like querying language called
HiveQL, while MapReduce is a programming model and software framework used for
processing large datasets in a distributed computing environment. Hive also provides a
schema for data stored in Hadoop Distributed File System (HDFS), making it easier to
manage and analyze large datasets.
Related
Features of Apache Pig

Let's see the various uses of Pig technology.
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-programmers. Pig makes
this process easy. In the Pig, the queries are converted to MapReduce internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution automatically, allowing
the user to focus on semantics rather than efficiency.
3) Extensibility
A user-defined function is written in which the user can write their logic to execute over the data
set.
17
4) Flexible
It can easily handle structured as well as unstructured data.
5) In-built operators
It contains various type of operators such as sort, filter and joins.
Differences between Apache MapReduce and PIG
Apache MapReduce Apache PIG
It is a low-level data processing tool. It is a high-level data flow tool.
Here, it is required to develop complex It is not required to develop complex programs.

programs using Java or Python.
It is difficult to perform data operations in It provides built-in operators to perform dat

MapReduce. operations like union, sorting and ordering.
It doesn't allow nested data types. It provides nested data types like tuple, bag, and map
Advantages of Apache Pig

o Less code - The Pig consumes less line of code to perform any operation.
o Reusability - The Pig code is flexible enough to reuse again.
o Nested data types - The Pig provides a useful concept of nested data types like tuple, bag,
and map.
Fundamentals of HBase & ZooKeeper
ZooKeeper
ZooKeeper acts as the bridge across the communication of the HBase architecture. It is
responsible for keeping track of all the Region Servers and the regions that are within them.
Monitoring which Region Servers and HMaster are active and which have failed is also a part of
ZooKeeper’s duties. When it finds that a Server Region has failed, it triggers the HMaster to take
necessary actions. On the other hand, if the HMaster itself fails, it triggers the inactive HMaster
that becomes active after the alert. Every user and even the HMaster need to go through
18
ZooKeeper to access Region Servers and the data within. ZooKeeper stores a.Meta file, which
contains a list of all the Region Servers. ZooKeeper’s responsibilities include:
• Establishing communication across the Hadoop cluster
• Maintaining configuration information
• Tracking Region Server and HMaster failure
• Maintaining Region Server information
Figure – Architecture of HBase

The 3 components of HBase are described below:
HMaster –
The implementation of Master Server in HBase is HMaster. It is a process in which regions are
assigned to region server as well as DDL (create, delete table) operations. It monitor all Region
Server instances present in the cluster. In a distributed environment, Master runs several
background threads. HMaster has many features like controlling load balancing, failover etc.
Region Server –
HBase Tables are divided horizontally by row key range into Regions. Regions are the basic
building elements of HBase cluster that consists of the distribution of tables and are comprised of
Column families. Region Server runs on HDFS DataNode which is present in Hadoop cluster.
Regions of Region Server are responsible for several things, like handling, managing, executing
as well as reads and writes HBase operations on that set of regions. The default size of a region is
19
256 MB.
Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining configuration information,
naming, providing distributed synchronization, server failure notification etc. Clients
communicate with region servers via zookeeper.
• ZooKeeper is the centralized service intended for support of the configuration information
and naming; provides the distributed synchronization and group service.
IBM InfoSphere BigInsights

IBM released at the end of 2011 the software of InfoSphere BigInsights and InfoSphere
Streams which allows clients to gain a fast impression about information streams in a zone
of interests of their business.
BigInsights is the platform for data analysis allowing the companies to turn difficult data
sets of scale of the Internet into knowledge. Easily set Apache Hadoop distribution kit and
also a set of the connected tools necessary for application development, data transfer and
management of a cluster are a part of this platform.
In addition to above-mentioned products the BigInsights distribution kit includes the following
technologies of IBM:
• BigSheets is the browser interface in the form of the spreadsheet intended for search and
data analysis and using all power of Hadoop; allows users to collect and analyze data
easily. Contains the wired programs of viewing data able to work with several
widespread formats including JSON, CSV (the value separated by commas) and TSV
(the value separated by signs of tabulation).
• Text analytics is previously brought together library of text annotator for widespread
business objects. Contains a rich language and tools for creation of the user annotator of
locations.
• Adaptive MapReduce is the solution developed by IBM Research and intended for
acceleration of accomplishment of the small MapReduce tasks by change of a method of
their processing.
InfoSphere platform
InfoSphere is the comprehensive platform on integration of information including means of
storage and data analysis, an integration tool of information, a management tool master data,
management tools lifecycle and also means of protecting and ensuring confidentiality of data.
InfoSphere does development process of applications by more effective, allowing the
organizations to save time, to reduce costs for integration and to increase quality of information.
20
The product BigInsights, being a part of the platform IBM Big Data, contains integration points
with other its components, including storage systems and data integration, mechanisms of
management and third-party tools for data analysis. It is possible to BigInsights to integrate with
the InfoSphere Streams platform.
21

BigData Analytics Unit-V

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BigData Analytics Unit-V

Uploaded by

Copyright:

Available Formats

UG BigData Dr.Veeramanikandan, Asst.Prof, Dept.

of CS, TKGAC, VDM

BigData Frameworks – Hive & Pig in Hadoop Unit - V

What is Hive in Hadoop?

Fig: Hive operation

What is Pig in Hadoop?

Fig: Pig operation

Hive vs. Pig

Hive uses a declarative language With Pig Latin, a procedural data

Creating schema is not required to

3. Data Hive is used for batch Pig is a high-level data-flow

No. Pig does not support partitions

7. Used for Reporting Programming

Hive works on structured data.

Works on the server-side of the Works on the client-side of the

10. Avro File

11. Loading Hive takes time to load but

12. JDBC/ ODBC Supported, but limited Unsupported

Fig: Hive vs. Pig Comparison Table

Difference between Pig and Hive :

Pig operates on the client

Pig uses pig-latin

Pig is a Procedural Data

Pig scripts end with .pig

It does not support

10. It loads data quickly. It loads data slowly.

11. It does not support JDBC. It supports JDBC.

It does not support

It supports Avro file

Pig is suitable for complex

It is very easy to write

o Hive is fast and scalable.

o It uses indexing to accelerate queries.

Differences between Hive and Pig

Hive is commonly used by Data Analysts. Pig is commonly used by programmers.

It follows SQL-like queries. It follows the data-flow language.

It can handle structured data. It can handle semi-structured data.

It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.

Hive is slower than Pig. Pig is comparatively faster than Hive.

What Is Hadoop Hive Query Language

Hive Query Language

• The Databases in the Hive is essentially just a catalog or namespace of tables.

In Hive, querying data is performed by a SELECT statement. A select statement has

1. SELECT column names

Simple Selects – Selecting Rows

1. HiveQL query for information_schema database

similar to sys database data.

Select * from information_schema.columns where table_schema = ‘database_name’

2. Creation and loading of data into a table

level insert, delete or update.

LOAD DATA LOCAL INPATH ‘$Home/students_address’ OVERWRITE INTO TABLE

PARTITION (class = “12”, section = “science”);

a table if overwrite keyword is omitted.

INSERT OVERWRITE TABLE students

PARTITION ( class = “12”, section = “science”)

Select * from students_data where class = “12” and section = “science”

Set hive.exec.dynamic.partition = True;

Set hive.exec.dynamic.partition.mode = unstrict

Set hive.exec.max.dynamic.partition.pernode = 1000;

3. Merge data in tables

Select a.roll_number, class, section from students as a

inner join pass_table as b

Select a.roll_number, class, section, b.grade from students as a

Left join pass_table as b