Professional Documents
Culture Documents
Introduction to Hive
Birth of Hive
Facebook played an active role in the birth of Hive as Facebook uses Hadoop to handle Big Data.
Hadoop uses MapReduce to process data. Previously, users needed to write lengthy, complex
codes to process and analyze data. Not everyone was well-versed in Java and other complex
programming languages. On the other hand, many individuals were comfortable with writing
queries in SQL. For this reason, there was a need to develop a language similar to SQL, which
was well-known to all users. This is how the Hive Query Language, also known as HiveQL,
came to be.
Hive is a data warehouse system used to query and analyze large datasets stored in HDFS. Hive
uses a query language called HiveQL, which is similar to SQL.
The image above demonstrates a user writing queries in the HiveQL language, which is then
converted into MapReduce tasks. Next, the data is processed and analyzed. HiveQL works on
structured data, such as numbers, addresses, dates, names, and so on. HiveQL allows multiple
users to query data simultaneously.
So, what do we do with semi-structured and unstructured data like emails, images, videos? Enter
Apache Pig.
1
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
Introduction to Pig
Pig also came into existence to solve issues with MapReduce. Let’s take a close look at Apache
Pig.
Birth of Pig
Although MapReduce helped process and analyze Big Data faster, it had its flaws. Individuals
who were unfamiliar with programming often found it challenging to write lengthy Java codes.
Eventually, it became a difficult task to maintain and optimize the code, and as a result, the
processing time increased.
This was the reason Yahoo faced problems when it came to processing and analyzing large
datasets. Apache Pig was developed to analyze large datasets without using time-consuming and
complex Java codes. Pig was explicitly developed for non-programmers.
Pig is a scripting platform that runs on Hadoop clusters, designed to process and analyze large
datasets. Pig uses a language called Pig Latin, which is similar to SQL. This language does not
require as much code in order to analyze data. Although it is similar to SQL, it does have
significant differences. In Pig Latin, 10 lines of code is equivalent to 200 lines in Java. This, in
turn, results in shorter development times.
2
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
What stands out about Pig is that it operates on various types of data, including structured, semi-
structured, and unstructured data. Whether you’re working with structured, semi-structured, or
unstructured data, Pig takes care of it all.
Many people wonder what makes Pig better than Hive. Hive does have its advantages over Pig in
a few ways—and we’ll compare these different features—to help you make a more informed
decision when it comes to choosing which platform best suits your requirements.
The following table compares the advantages of Hive with the advantages of Pig :
Features
3
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
5. Web interface Hive has a web interface Pig does not support web interface
6. User Data analysts are the primary Programmers and researchers use
Specification users Pig
4
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
Both Hive and Pig are excellent data analysis tools—one is not necessarily better than the other,
but they do have different capabilities and features. Depending on your job role, business
requirements, and budget, you can choose either of these Big Data analysis platforms.
5
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
It was developed by
4. Yahoo. It was developed by Facebook.
It is used by Researchers
5. and Programmers. It is mainly used by Data Analysts.
It is used to handle
structured and semi-
6. structured data. It is mainly used to handle structured data.
It is used for
7. programming. It is used for creating reports.
Pig does not have a Hive makes use of the exact variation of
dedicated metadata dedicated SQL-DDL language by defining
13. database. tables beforehand.
6
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
Pig does not support Hive supports schema for data insertion in
16. schema to store data. tables.
What is HIVE
Hive is a data warehouse system which is used to analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).
Features of Hive
These are the following features of Hive:
7
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
Limitations of Hive
o Hive is not capable of handling real-time data.
o It is not designed for online transaction processing.
o Hive queries contain high latency.
Hive Pig
8
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
Hive Architecture
The following architecture explains the flow of submission of query into Hive.
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-
o Thrift Server - It is a cross-language service provider platform that serves the request from
all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications. The
JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.
9
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides
a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks
in the order of their dependencies.
10
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
• Data manipulation language is used to put data into Hive tables and to extract data to the file
system and also how to explore and manipulate data with queries, grouping, filtering, joining
etc.
Databases in Hive:
Simple Selects
11
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
Information_schema data is a read-only and user-friendly way to know the state of the system
Example:
Code:
This will retrieve all the columns in the database table specified.
Code:
students
With the above command, a directory is first created for the partition, and then all the files are
copied in the directory. The keyword “local” is used to specify that the data is present in the local
12
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
file system. “Partition” keyword can be omitted if the table does not have a partition key. Hive
query will not check for the data being loaded to match the schema of the table.
The “INSERT” command is used to load data from a query into a table. “OVERWRITE”
keyword is used to replace the data in a table. In Hive v0.8.0 or later, data will get appended into
Code:
All the partitions of the table students_data can be dynamically inserted by setting below
properties:
CREATE TABLE clause will also create a table, and schema will be taken from the select
clause.
13
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
Code:
on a.roll_number = b.roll_number
This will return class and section of all the roll numbers who have passed. Using a left join to
this will return the “grade” for only pass students and “NULL” for the failed ones.
Code:
on a.roll_number = b.roll_number
UNION ALL and UNION are also used to append data present in two tables. However, few
things need to be taken care of on doing so like, Schema of both the tables should be same.
UNION is used to append the table and return unique records while UNION ALL returns all the
14
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
4. Ordering a table
ORDER BY clause enables total ordering of the data set by passing all data through one reducer.
This may take a long time for large data tables, so SORT BY clause can be used to achieve
Code:
Select customer_id, spends from customer as a order by spends DESC limit 100
This will return the top 100 customers with highest spends.
computation on many rows. These are count(col), sum(col), avg(col), min(col), max(col),
estimation), collect_set(col), this returns duplicate elements after removing collection column.
The set property which helps in improving the performance of aggregation is hive.map.aggr =
true.
Example:
Code:
Select year(date_yy), avg(spends) from customer_spends where merchant = “Retail” group by year(date_yy)
15
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
HAVING clause is used to restrict the output from GROUP BY, which is done using a subquery.
6. Conditional statements
CASE…WHEN…THEN clause is similar to if-else statements to perform a conditional
For example:
Code:
Select customer,
When percentage >=40 and percentage <80 then “Average” Else “Excellent”
A. Hive supports the Hive Querying Language(HQL). HQL is very similar to SQL. It
supports the usual insert, update, delete, and merge SQL statements to query data in
Hive.
16
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
A. Hive is built on top of Apache Hadoop. This makes it an apt tool for analyzing Big
data. It also supports various types of connectors, making it easier for developers to
query Hive data using different programming languages.
A. Hive is a data warehousing system that provides SQL-like querying language called
HiveQL, while MapReduce is a programming model and software framework used for
processing large datasets in a distributed computing environment. Hive also provides a
schema for data stored in Hadoop Distributed File System (HDFS), making it easier to
manage and analyze large datasets.
Related
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-programmers. Pig makes
this process easy. In the Pig, the queries are converted to MapReduce internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their execution automatically, allowing
the user to focus on semantics rather than efficiency.
3) Extensibility
A user-defined function is written in which the user can write their logic to execute over the data
set.
17
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
4) Flexible
It can easily handle structured as well as unstructured data.
5) In-built operators
It contains various type of operators such as sort, filter and joins.
It doesn't allow nested data types. It provides nested data types like tuple, bag, and map
ZooKeeper
ZooKeeper acts as the bridge across the communication of the HBase architecture. It is
responsible for keeping track of all the Region Servers and the regions that are within them.
Monitoring which Region Servers and HMaster are active and which have failed is also a part of
ZooKeeper’s duties. When it finds that a Server Region has failed, it triggers the HMaster to take
necessary actions. On the other hand, if the HMaster itself fails, it triggers the inactive HMaster
that becomes active after the alert. Every user and even the HMaster need to go through
18
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
ZooKeeper to access Region Servers and the data within. ZooKeeper stores a.Meta file, which
contains a list of all the Region Servers. ZooKeeper’s responsibilities include:
• Establishing communication across the Hadoop cluster
• Maintaining configuration information
• Tracking Region Server and HMaster failure
• Maintaining Region Server information
HMaster –
The implementation of Master Server in HBase is HMaster. It is a process in which regions are
assigned to region server as well as DDL (create, delete table) operations. It monitor all Region
Server instances present in the cluster. In a distributed environment, Master runs several
background threads. HMaster has many features like controlling load balancing, failover etc.
Region Server –
HBase Tables are divided horizontally by row key range into Regions. Regions are the basic
building elements of HBase cluster that consists of the distribution of tables and are comprised of
Column families. Region Server runs on HDFS DataNode which is present in Hadoop cluster.
Regions of Region Server are responsible for several things, like handling, managing, executing
as well as reads and writes HBase operations on that set of regions. The default size of a region is
19
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
256 MB.
Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining configuration information,
naming, providing distributed synchronization, server failure notification etc. Clients
communicate with region servers via zookeeper.
• ZooKeeper is the centralized service intended for support of the configuration information
and naming; provides the distributed synchronization and group service.
InfoSphere platform
InfoSphere is the comprehensive platform on integration of information including means of
storage and data analysis, an integration tool of information, a management tool master data,
management tools lifecycle and also means of protecting and ensuring confidentiality of data.
InfoSphere does development process of applications by more effective, allowing the
organizations to save time, to reduce costs for integration and to increase quality of information.
20
UG BigData Dr.Veeramanikandan, Asst.Prof, Dept. of CS, TKGAC, VDM
The product BigInsights, being a part of the platform IBM Big Data, contains integration points
with other its components, including storage systems and data integration, mechanisms of
management and third-party tools for data analysis. It is possible to BigInsights to integrate with
the InfoSphere Streams platform.
21