You are on page 1of 15

1

Unit IV - Hive Architecture


What is Hive?
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different companies.
For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive
The following component diagram depicts the architecture of Hive:
2

This component diagram contains different units. The following table describes each unit:

Unit Name Operation


User Interface Hive is a data warehouse infrastructure software that can create interaction between
user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive
command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.
HiveQL Process HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of
Engine the replacements of traditional approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for MapReduce job and process
it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Engine. Execution engine processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store
data into file system.

Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:
3

Step No. Operation


1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any database
driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here, the
parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends
the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which
is in Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operations with
Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
4

Comparison with Traditional Database:

SNo Hive Traditional database


1 Hive enforces Schema on READ – it’s does RDBMS enforces Schema on WRITE – table
not verify the schema while it’s loaded the schema is enforced at data load time i.e if the data
data being loaded does’t conformed on schema in that
case it will rejected
2 It’s very easily scalable at low cost Not much Scalable, costly scale up.
3 It’s based on Hadoop notation that is Write In traditional database we can read and write many
once and read many times time
4 Record level updates is not possible in Hive Record level updates, insertions and
because Hive was built to operate over HDFS deletes, transactions and indexes are possible
data using MapReduce, where full-table scans
are the norm and a table update is achieved by
transforming the data into a new table.
5 Hive is suited for data warehouse RDBMS is best suited for dynamic data
applications, where relatively static data is analysis and where fast responses are expected
analyzed, fast response times are not required,
and when the data is not changing rapidly.
6 OLTP (On-line Transaction Processing) is not Both OLTP (On-line Transaction Processing)
yet supported in Hive but it’s and OLAP (On-line Analytical Processing) are
supported OLAP (On-line Analytical supported in RDBMS
Processing). To overcome the limitations of
Hive, HBase is being integrated with Hive to
support record level operations and OLAP.
7 Hive can 100’s Petabytes very easily In RDBMS, maximum data size allowed will be in
10’s of Terabytes

What is Hive: Apache Hive is a Data warehouse system which is built to work on Hadoop. It is used to
querying and managing large datasets residing in distributed storage. Before becoming a open source
project of Apache Hadoop, Hive was originated in Facebook. It provides a mechanism to project structure
onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL). Hive is
used because the tables in Hive are similar to tables in a relational database. All Hadoop sub-projects such
5

as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install any Linux
flavored OS.
Uses of Hive:
1. The Apache Hive distributed storage.
2. Hive provides tools to enable easy data extract/transform/load (ETL)
3. It provides the structure on a variety of data formats.
4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used to
querying and managing large datasets residing in) or in other data storage systems such as Apache HBase.
Limitations of Hive:
• Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online Analytical
Processing.
• Hive supports overwriting or apprehending data, but not updates and deletes.
• In Hive, sub queries are not supported.
Components of Hive:

1. Metastore : Hive stores the schema of the Hive tables in a Hive Metastore. Metastore is used to hold
all the information about the tables and partitions that are in the warehouse. By default, the metastore
is run in the same process as the Hive service and the default Metastore is DerBy Database.
2. SerDe : Serializer, Deserializer gives instructions to hive on how to process a record.

Hive Commands: Hive commands are categorized into data definition Language and Data Manipulation
Language.
Data Definition Language (DDL): DDL statements are used to build and modify the tables and other
objects in the database.

Example : CREATE, DROP, TRUNCATE, ALTER, SHOW, DESCRIBE Statements.

Go to Hive shell by giving the command sudo hive and enter the command ‘create database<data
base name>’to create the new database in the Hive.

To list out the databases in Hive warehouse, enter the command ‘show databases’.
6

The command to use the database is USE <data base name>

Copy the input data to HDFS from local by using the copy From Local command.

The following command creates a table with in location of “/user/hive/warehouse/retail.db”

Describe provides information about the schema of the table.

Data Manipulation Language (DML): DML statements are used to retrieve, store, modify, delete, insert
and update data in the database.
7

Example : LOAD, INSERT Statements.

Syntax : LOAD data <LOCAL> inpath <file path> into table [tablename]

The Load operation is used to move the data into corresponding Hive table. If the keyword local is
specified, then in the load command will give the local file system path. If the keyword local is not
specified we have to use the HDFS path of the file.

Here are some examples for the LOAD data LOCAL command.

After loading the data into the Hive table we can apply the Data Manipulation Statements or aggregate
functions retrieve the data.

Example to count number of records (Aggregate Function)

Count aggregate function is used count the total number of the records in a table.
8

SELECT Statement: SELECT statement is used to retrieve the data from a table.

Syntax: The syntax for the SELECT statement is:


SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
SELECT is the projection operator in SQL. It includes:
 SELECT scans the table specified by the FROM clause
 WHERE gives the condition of what to filter
 GROUP BY gives a list of columns which specify how to aggregate the records
 CLUSTER BY, DISTRIBUTE BY, SORT BY specify the sort order and algorithm
 LIMIT specifies how many # of records to retrieve

Example: Assume we have the employee table as given below, with fields named Id, Name, Salary,
Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than
Rs 30000.
----------------------------------------------------------------------------------
| ID | Name | Salary | Designation | Dept |
-----------------------------------------------------------------------------------
| 1201 | Gopal | 45000 | Technical manager | TP |
| 1202 | Manisha | 45000 | Proofreader | PR |
| 1203 | Masthanvali | 40000 | Technical writer | TP |
| 1204 | Krian | 40000 | Hr Admin | HR |
| 1205 | Kranthi | 30000 | Op Admin | Admin |
----------------------------------------------------------------------------------
9

The following query retrieves the employee details using the above scenario:

hive> SELECT * FROM employee WHERE salary>30000;

On successful execution of the query, you will see the following output:
----------------------------------------------------------------------------------
| ID | Name | Salary | Designation | Dept |
-----------------------------------------------------------------------------------
| 1201 | Gopal | 45000 | Technical manager | TP |
| 1202 | Manisha | 45000 | Proofreader | PR |
| 1203 | Masthanvali | 40000 | Technical writer | TP |
| 1204 | Krian | 40000 | Hr Admin | HR |

WHERE Clauses: WHERE clause is used to filter the result set by using predicate operators and logical
operators. Functions can also be used to compute the condition.
 List of Predicate Operators
 List of Logical Operators
 List of Functions

Here’s an example query that uses a WHERE clause.

SELECT name FROM products WHERE name = 'stone of jordan';

GROUP BY Clauses: GROUP BY clause is frequently used with aggregate functions, to group the
result set by columns and apply aggregate functions over each group. Functions can also be used to
compute the grouping key.
 List of Aggregate Functions
 List of Functions
Here’s an example query that groups and counts by category.

SELECT category, count(1) FROM products GROUP BY category;

HAVING Clause: HAVING clause lets you filter the groups produced by GROUP BY, by
applying predicate operators to each groups.
 List of Predicate Operators

An example query that groups and counts by category, and then retrieves only counts > 10;
SELECT category, count(1) AS cnt FROM products GROUP BY category HAVING cnt > 10;

INSERT Statement: It is used to insert values into database.

The syntax is:


10

INSERT OVERWRITE TABLE tablename1 select_statement1 FROM from_statement;


INSERT INTO TABLE tablename1 select_statement1 FROM from_statement;
WITH tablename1 AS (select_statement1 FROM from_statement) INSERT OVERWRITE/INTO
TABLE tablename2 select_statement2 FROM from_statement2;

Mapreduce scripts

Hive scripts are used to execute a set of Hive commands collectively. This helps in reducing the time and
effort invested in writing and executing each command manually. This blog is a step by step guide to
write your first Hive script and executing it.

Hive supports scripting from Hive 0.10.0 and above versions. Cloudera distribution for hadoop
(CDH4) quick VM comes with pre-installed Hive 0.10.0 (CDH3 Demo VM uses Hive 0.90 and hence,
cannot run Hive Scripts).

Execute the following steps to create your first Hive Script:

Step1: Writing a script

Open a terminal in your Cloudera CDH4 distribution and give the below command to create a Hive
Script.

command: gedit sample.sql

The Hive script file should be saved with .sql extension to enable the execution.

Edit the file and write few Hive commands that will be executed using this script.

In this sample script, we will create a table, describe it, load the data into the table and retrieve the data
from this table.

 Create a table ‘product’ in Hive:

command: create table product ( productid: int, productname: string, price: float, category: string)
rows format delimited fields terminated by ‘,’ ;

Here { productid, productname, price, category} are the columns in the ‘product’ table.

“Fields terminated by ‘,’ ” indicates that the columns in the input file are separated by the ‘,’
delimiter. You can use other delimiters also. For example, the records in an input file can be separated by
a new line (‘’) character.

 Describe the Table :

command: describe product;


11

 Load the data into the Table:

To load the data into the table, create an input file which contains the records that needs to be inserted
into the table.

command: sudo gedit input.txt

Create few records in the input text file as shown in the figure.

Command: load data local inpath ‘/home/cloudera/input.txt’ into table product;

 Retrieving the data:

To retrieve the data use select command.

command: select * from product;

The above command will retrieve all the records from the table ‘product’.

The script should look like as shown in the following image:

Save the sample.sql file and close the editor. You are now ready to execute your first Hive script.

Step 2: Execute the Hive Script

Execute the hive script using the following command:

Command: hive –f /home/cloudera/sample.sql

While executing the script, make sure that you give the entire path of the script location. As the sample
script is present in the current directory, I haven’t provided the complete path of the script.
12

The following image shows that that all the commands were executed successfully.

Big Data

JOIN:
JOIN is a clause that is used for combining specific fields from two tables by using values common to
each one. It is used to combine records from two or more tables in the database.

Syntax

join_table:
13

table_reference JOIN table_factor [join_condition]


| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference
join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition]

Example

We will use the following two tables in this chapter. Consider the following table named
CUSTOMERS..
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
Consider another table ORDERS as follows:
+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
| 101 | 2009-11-20 00:00:00 | 2 | 1560 |
| 103 | 2008-05-20 00:00:00 | 4 | 2060 |
+-----+---------------------+-------------+--------+
There are different types of joins given as follows:

 JOIN
 LEFT OUTER JOIN
 RIGHT OUTER JOIN
 FULL OUTER JOIN

JOIN

JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as OUTER
JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
14

On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+

LEFT OUTER JOIN

The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no matches in
the right table. This means, if the ON clause matches 0 (zero) records in the right table, the JOIN still
returns a row in the result, but with NULL in each column from the right table.
A LEFT JOIN returns all the values from the left table, plus the matched values from the right table, or
NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+----+----------+--------+---------------------+

RIGHT OUTER JOIN

The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no matches
in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still returns a row in
the result, but with NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left table, or
NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER tables.
15

notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT
OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+

FULL OUTER JOIN

The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables that
fulfil the JOIN condition. The joined table contains either all the records from both the tables, or fills in
NULL values for missing matches on either side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+

You might also like