Professional Documents
Culture Documents
Architecture of Hive
The following component diagram depicts the architecture of Hive:
2
This component diagram contains different units. The following table describes each unit:
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
3
What is Hive: Apache Hive is a Data warehouse system which is built to work on Hadoop. It is used to
querying and managing large datasets residing in distributed storage. Before becoming a open source
project of Apache Hadoop, Hive was originated in Facebook. It provides a mechanism to project structure
onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL). Hive is
used because the tables in Hive are similar to tables in a relational database. All Hadoop sub-projects such
5
as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install any Linux
flavored OS.
Uses of Hive:
1. The Apache Hive distributed storage.
2. Hive provides tools to enable easy data extract/transform/load (ETL)
3. It provides the structure on a variety of data formats.
4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used to
querying and managing large datasets residing in) or in other data storage systems such as Apache HBase.
Limitations of Hive:
• Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online Analytical
Processing.
• Hive supports overwriting or apprehending data, but not updates and deletes.
• In Hive, sub queries are not supported.
Components of Hive:
1. Metastore : Hive stores the schema of the Hive tables in a Hive Metastore. Metastore is used to hold
all the information about the tables and partitions that are in the warehouse. By default, the metastore
is run in the same process as the Hive service and the default Metastore is DerBy Database.
2. SerDe : Serializer, Deserializer gives instructions to hive on how to process a record.
Hive Commands: Hive commands are categorized into data definition Language and Data Manipulation
Language.
Data Definition Language (DDL): DDL statements are used to build and modify the tables and other
objects in the database.
Go to Hive shell by giving the command sudo hive and enter the command ‘create database<data
base name>’to create the new database in the Hive.
To list out the databases in Hive warehouse, enter the command ‘show databases’.
6
Copy the input data to HDFS from local by using the copy From Local command.
Data Manipulation Language (DML): DML statements are used to retrieve, store, modify, delete, insert
and update data in the database.
7
Syntax : LOAD data <LOCAL> inpath <file path> into table [tablename]
The Load operation is used to move the data into corresponding Hive table. If the keyword local is
specified, then in the load command will give the local file system path. If the keyword local is not
specified we have to use the HDFS path of the file.
Here are some examples for the LOAD data LOCAL command.
After loading the data into the Hive table we can apply the Data Manipulation Statements or aggregate
functions retrieve the data.
Count aggregate function is used count the total number of the records in a table.
8
SELECT Statement: SELECT statement is used to retrieve the data from a table.
Example: Assume we have the employee table as given below, with fields named Id, Name, Salary,
Designation, and Dept. Generate a query to retrieve the employee details who earn a salary of more than
Rs 30000.
----------------------------------------------------------------------------------
| ID | Name | Salary | Designation | Dept |
-----------------------------------------------------------------------------------
| 1201 | Gopal | 45000 | Technical manager | TP |
| 1202 | Manisha | 45000 | Proofreader | PR |
| 1203 | Masthanvali | 40000 | Technical writer | TP |
| 1204 | Krian | 40000 | Hr Admin | HR |
| 1205 | Kranthi | 30000 | Op Admin | Admin |
----------------------------------------------------------------------------------
9
The following query retrieves the employee details using the above scenario:
On successful execution of the query, you will see the following output:
----------------------------------------------------------------------------------
| ID | Name | Salary | Designation | Dept |
-----------------------------------------------------------------------------------
| 1201 | Gopal | 45000 | Technical manager | TP |
| 1202 | Manisha | 45000 | Proofreader | PR |
| 1203 | Masthanvali | 40000 | Technical writer | TP |
| 1204 | Krian | 40000 | Hr Admin | HR |
WHERE Clauses: WHERE clause is used to filter the result set by using predicate operators and logical
operators. Functions can also be used to compute the condition.
List of Predicate Operators
List of Logical Operators
List of Functions
GROUP BY Clauses: GROUP BY clause is frequently used with aggregate functions, to group the
result set by columns and apply aggregate functions over each group. Functions can also be used to
compute the grouping key.
List of Aggregate Functions
List of Functions
Here’s an example query that groups and counts by category.
HAVING Clause: HAVING clause lets you filter the groups produced by GROUP BY, by
applying predicate operators to each groups.
List of Predicate Operators
An example query that groups and counts by category, and then retrieves only counts > 10;
SELECT category, count(1) AS cnt FROM products GROUP BY category HAVING cnt > 10;
Mapreduce scripts
Hive scripts are used to execute a set of Hive commands collectively. This helps in reducing the time and
effort invested in writing and executing each command manually. This blog is a step by step guide to
write your first Hive script and executing it.
Hive supports scripting from Hive 0.10.0 and above versions. Cloudera distribution for hadoop
(CDH4) quick VM comes with pre-installed Hive 0.10.0 (CDH3 Demo VM uses Hive 0.90 and hence,
cannot run Hive Scripts).
Open a terminal in your Cloudera CDH4 distribution and give the below command to create a Hive
Script.
The Hive script file should be saved with .sql extension to enable the execution.
Edit the file and write few Hive commands that will be executed using this script.
In this sample script, we will create a table, describe it, load the data into the table and retrieve the data
from this table.
command: create table product ( productid: int, productname: string, price: float, category: string)
rows format delimited fields terminated by ‘,’ ;
Here { productid, productname, price, category} are the columns in the ‘product’ table.
“Fields terminated by ‘,’ ” indicates that the columns in the input file are separated by the ‘,’
delimiter. You can use other delimiters also. For example, the records in an input file can be separated by
a new line (‘’) character.
To load the data into the table, create an input file which contains the records that needs to be inserted
into the table.
Create few records in the input text file as shown in the figure.
The above command will retrieve all the records from the table ‘product’.
Save the sample.sql file and close the editor. You are now ready to execute your first Hive script.
While executing the script, make sure that you give the entire path of the script location. As the sample
script is present in the current directory, I haven’t provided the complete path of the script.
12
The following image shows that that all the commands were executed successfully.
Big Data
JOIN:
JOIN is a clause that is used for combining specific fields from two tables by using values common to
each one. It is used to combine records from two or more tables in the database.
Syntax
join_table:
13
Example
We will use the following two tables in this chapter. Consider the following table named
CUSTOMERS..
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
Consider another table ORDERS as follows:
+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
| 101 | 2009-11-20 00:00:00 | 2 | 1560 |
| 103 | 2008-05-20 00:00:00 | 4 | 2060 |
+-----+---------------------+-------------+--------+
There are different types of joins given as follows:
JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as OUTER
JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
14
On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+
The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no matches in
the right table. This means, if the ON clause matches 0 (zero) records in the right table, the JOIN still
returns a row in the result, but with NULL in each column from the right table.
A LEFT JOIN returns all the values from the left table, plus the matched values from the right table, or
NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+----+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+----+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
+----+----------+--------+---------------------+
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no matches
in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still returns a row in
the result, but with NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left table, or
NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER tables.
15
notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT
OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables that
fulfil the JOIN condition. The joined table contains either all the records from both the tables, or fills in
NULL values for missing matches on either side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+