You are on page 1of 35

Chapter 4: Data Access Components

Hive

1
References
1. Achari, Shiva. Hadoop Essentials, Packt Publishing, Limited,
2015.
ProQuest Ebook Central,
https://ebookcentral.proquest.com/lib/shctom/detail.action?docID
=2039889
.
2. https://www.edureka.co/blog/hive-tutorial/
3. https://www.guru99.com/introduction-hive.html
4. https://cwiki.apache.org/confluence/display/Hive/Tutorial
5. https://www.simplilearn.com/tutorials/hadoop-tutorial/hive
2
What is Hive?
 Hive is a data warehousing infrastructure based on Apache Hadoop which provides SQL like language for
querying and analyzing Big Data.
 Hive provides a mechanism to project structure onto the data and query the data using SQL-like language
called HiveQL
 Hive uses MapReduce and HDFS for processing and storage/retrieval of data.
 Hive is used for analyzing structured and semi-structured data.
 SQL commands in Hive are called as HiveQL.
 HiveQL gets converted to map reduce jobs by the Hive compiler.
 Apache Hive supports Data Definition Language (DDL), Data Manipulation Language (DML) and User
Defined Functions (UDF).
 Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks.

3
Reference #1: Page 83 & Reference #2 and #4
Advantage of Hive
 It is an efficient ETL tool
 Provides capabilities of querying and analysis
 HiveQL is similar to SQL statements, thus easy to understand
 Performs analytics on large datasets and works well for complex queries
 Reduces the need to write complex MapReduce programs to process data using
Hadoop

4
Where Not to use Hive?
 When the data to be processed is less than a GB
 When finding schema is difficult or not possible on the data
 When the response is needed in seconds and for low latency applications.
 When it is possible to solve problems by RDBMS

5
Hive Features
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP
 IT provides SQL type language for querying called HiveQL or
HQL
 IT is fast, scalable and extensible.

6
Hive architecture
 Hive architecture has different components such as:

Image Source: https://www.simplilearn.com/tutorials/hadoop-tutorial/hive


7
Hive architecture
 Hive Clients:
 Hive provides different drivers for communication with a different type of
applications.
 For Thrift based applications, it will provide Thrift client for communication.

 For Java related applications, it provides JDBC Drivers.

 Other than any type of applications provided ODBC drivers. These Clients and
drivers in turn again communicate with Hive server in the Hive services.

Reference #3
8
Hive architecture
 Hive Services:
 Hive CLI (Command Line Interface): This is the default shell provided by
the Hive where you can execute your Hive queries and commands directly.
 Apache Hive Web Interfaces: Apart from the command line interface, Hive
also provides a web-based GUI for executing Hive queries and commands.
 Hive Server: Hive server is built on Apache Thrift and therefore, is also
referred as Thrift Server that allows different clients to submit requests to
Hive and retrieve the result.

Reference #3
9
Hive architecture
 Hive Services:
 Apache Hive Driver: It is responsible for receiving the queries submitted through the
CLI, the web UI, Thrift, ODBC or JDBC interfaces by a client.
 Then, the driver passes the query to the compiler where parsing, type checking and
semantic analysis takes place with the help of schema present in the metastore.
 In the next step, an optimized logical plan is generated in the form of a DAG (Directed
Acyclic Graph) of map-reduce tasks and HDFS tasks.
 Finally, the execution engine executes these tasks in the order of their dependencies,
using Hadoop.

Reference #3
10
Hive architecture
 Hive Services:
 Metastore: Metastore as a central repository for storing all the Hive metadata
information.
 Metastore stores all the details about the tables, partitions, schemas, columns, types,
and so on which is required for Read/Write operation on the data present in HDFS.
 Metastore is very critical for Hive without which the structure design details cannot be
retrieved and data cannot be accessed. Hence, Metastore is backed up regularly.
 Hive ensures that Metastore is not directly accessed by Mappers and Reducers of a job;
instead it is passed through an xml plan that is generated by the compiler and contains
information that is needed at runtime.
Reference #1 Page 84 and Reference #3
11
Job Execution in Hive

Image Source -Reference #3


12
Job Execution in Hive
1. Executing Query from the UI( User Interface)
2. The driver is interacting with Compiler for getting the plan. (Here plan refers to query execution)
process and its related metadata information gathering
3. The compiler creates the plan for a job to be executed. Compiler communicating with Meta store
for getting metadata request
4. Meta store sends metadata information back to compiler
5. Compiler communicating with Driver with the proposed plan to execute the query
6. Driver Sending execution plans to Execution engine

7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. It executes

the plan step by step, considering the dependent task to complete for every task in the plan. The

results of tasks are stored in a temporary location and in the final step the data is moved to the

desired location.

Reference #3
13
Job Execution in Hive
 EE should first contacts Name Node and then to Data nodes to get the values stored in tables.
 EE is going to fetch desired records from Data Nodes. The actual data of tables resides in
data node only. While from Name Node it only fetches the metadata information for the
query.
 It collects actual data from data nodes related to mentioned query
 Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to
perform DDL (Data Definition Language) operations. Here DDL operations like CREATE,
DROP and ALTERING tables and databases are done. Meta store will store information
about database name, table names and column names only. It will fetch data related to query
mentioned.
 Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node,
Data nodes, and job tracker to execute the query on top of Hadoop file system
8.Fetching results from driver
9.Sending results to Execution engine. Once the results fetched from data nodes to the EE, it will
send results back to driver and to UI ( front end)

14
Data Units in Hive
 Hive data is organized into:
 Databases: Namespaces function to avoid naming conflicts for tables, views, partitions,
columns, and so on. Databases can also be used to enforce security for a user or group
of users.
 Tables: Homogeneous units of data which have the same schema.

 Partitions: Each Table can have one or more partition Keys which determines how the
data is stored.
 Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based
on the value of a hash function of some column of the Table.
Reference #4
15
Data Units in Hive

Image Source - Reference #5


16
Data Units in Hive

Image Source - https://data-flair.training/blogs/hive-data-model/


17
Tables in Hive
 Tables in Hive are the same as the tables present in a Relational Database.
 There are two types of tables in Hive
 Managed Table :
 Hive is responsible for managing the data of a managed table.
 After load, the data from a file present in HDFS into a Hive Managed Table and issue a
DROP command on it, the table along with its metadata will be deleted. So, the data
belonging to the dropped managed_table no longer exist anywhere in HDFS and can’t
retrieve it by any means.
 To move the data, issue the LOAD command from the HDFS file location to the Hive
warehouse directory.
 The default path of the warehouse directory is set to/user/hive/warehouse.
 The data of a Hive table resides in warehouse_directory/table_name (HDFS).
Reference #2
18
Tables in Hive
 External Table :
 Hive is not responsible for managing the data.
 In this case, when we issue the LOAD command, Hive moves the data into its
warehouse directory.
 Then, Hive creates the metadata information for the external table.
 Now, if we issue a DROP command on the external table, only metadata information
regarding the external table will be deleted.
 Therefore, can still retrieve the data of that very external table from the warehouse
directory using HDFS commands.

Reference #2
19
Modes of Hive
 Hive operates in two modes depending on the number and size of data nodes.
They are:
 Local Mode - Used when Hadoop has one data node, and the amount of data is
small. Here, the processing will be very fast on smaller datasets, which are
present in local machines.
 Mapreduce Mode - Used when the data in Hadoop is spread across multiple
data nodes. Processing large datasets can be more efficient using this mode.

Reference #5
20
Hive Vs RDMS
Hive RDBMS
 Hive enforces schema on reading  RDBMS enforces schema on write
 Hive data size is in petabytes  Data size is in terabytes
 Hive is based on the notion of write once and  RDBMS is based on the notion of reading and
read many times write many times
 Hive resembles a traditional database by  RDBMS is a type of database management
supporting SQL, but it is not a database; it is a system, which is based on the relational model of
data warehouse data
 Easily scalable at low cost  Not scalable at low cost

Reference #5
21
Hive Query Language : Data Types (Numeric Type)
 TINYINT (1 Byte Signed Integer)
 SMALLINT (2 Byte Signed Integer)
 INT/INTEGER (4 Byte Signed Integer)
 BIGINT( 8 Byte Singed Integer)
 FLOAT ( 4 Byte Single Precision Floating Point Number)
 DOUBLE (8 Byte Double Precision Floating Point Number)
 DECIMAL
 NUMERIC (Same as Decimal)

22
Hive Query Language : Data Types (Numeric
Type)
 TIMESTAMP yyyy-mm-dd hh:mm:ss[f…]
 DATE YYYY-MM-DD
 INTERVAL INTERVAL ‘1’ day

23
Hive Query Language : Data Types (String Type)

 STRING C-style escaping within the strings


 VARCHAR length specifier (between 1 65535)
 CHAR similar to VARCHAR but fixed length

24
Hive Query Language : Data Types (Complex Type)

 Arrays - A collection of the same entities. The syntax is:


array<data_type>
 Maps - A collection of key-value pairs and the syntax is
map<primitive_type, data_type>
 Structs - A collection of complex data with comments. Syntax:
struct<col_name : data_type [COMMENT col_comment],…..>
 Units - A collection of heterogeneous data types. Syntax:
uniontype<data_type, data_type,..>

25
Hive Query Language : Queries
CREATING DATABASE
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment] [LOCATION hdfs_path]
[WITH DBPROPERTIES (PROPERTY_NAME =PROPERTY_VALUE,….)]

Hive> CREATE DATABASE UTAS;

SHOWING DATABASE
Hive> SHOW DATABASES;

26
Hive Query Language : Queries
DROPPING DATABASE
DROP (DATABASE | SCHEMA) [IF EXISTS] database_name [RESTRICT |
CASCADE];
Hive> DROP DATABASE IF EXISTS UTAS;
USING DATABASE
USER DATABASE_NAME

27
Hive Query Language : Queries
CREATING A TABLE
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT
col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type
[COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY
(col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format]
[STORED AS file_format] [LOCATION hdfs_path] [TBLPROPERTIES
(property_name=property_value, ...)] [AS select_statement]

Hive> CREATE TABLE STUDENT (SID INT, SNAME STRING);

28
Hive Query Language : Queries
LOADING DATA IN TABLE
LOAD DATA [LOCAL] INPATH '<The table data location>' [OVERWRITE] INTO
TABLE <table_name> [PARTITION partcol1=val1, partcol2=val2…)]
Hive> LOAD LOCAL INPTH ‘/USER/CLOUDERA/STU.TXT’ INTO TBALE
STUDENT;
DISPLAYING CONTENTS OF TABLE
Hive> select * from student;
ALTERING A TABLE
ALTER TABLE <table_name> ADD COLUMNS (column type);
Hive> alter table student add columns (grade string);
RENAMING A TABLE
ALTER TABLE <table_name> RENAME TO <new table_name>
Hive > alter table student rename to students;
DROPPING TABLE
DROP TABLE <table_name>
Hive > drop table students;
29
Hive Query Language : SELECT Operation
 SELECT: SELECT is the projection operator in SQL. The clauses used for this
function are:
 SELECT scans the table specified by the FROM clause
 WHERE gives the condition of what to filter
 GROUP BY gives a list of columns which then specify how to aggregate the
records
 CLUSTER BY, DISTRIBUTE BY, and SORT BY specify the sort order and
algorithm
 LIMIT specifies the # of records to retrieve
SELECT [ALL | DISTINCT] select_expr, select_expr, FROM table_reference
[WHERE where_condition] [GROUP BY col_list] [HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT
number];
30
Hive Query Language : JOINS
HiveQL supports the following types of joins:
 JOIN
 LEFT OUTER JOIN
 RIGHT OUTER JOIN
• FULL OUTER JOIN

31
Hive Query Language : Aggregation
 HiveQL supports aggregations and also allows for multiple aggregations to be
done at the same time.
 The possible aggregators are:
 count(*), count(expr), count(DISTINCT expr[, expr_.])
 sum(col), sum(DISTINCT col)
 avg(col), avg(DISTINCT col)
 min(col)
 max(col)

32
Hive Query Language : Built-In Functions
Hive has numerous built-in functions and some of its widely used functions
are:
 concat(string A, string B,...)
 substr(string A, int start)
 round(double a)
 upper(string A), lower(string A)
 trim(string A)
 to_date(string timestamp)
 year(string date), month(string date), day(string date)

33
Hive: Partitioning
 Partitioning are a way to divide a table into coarse-grained parts,
based on the value of a partition column such as ‘date’.
 Using partitions, you can make it faster to do queries on slices of the data.
 A table can have one or more partition columns. A separate data directory
 Is created for each distinct value combination in the partition column
 Partitions are defined at the time of creation of table
 Usage
 Use the clause (PARTITIONED BY) for creating a list of column
definitions.
 The partitions can be added or removed using ALTER TABLE statement.
 Partitions can be viewed by using SHOW PARTITIONS logs;

34
Hive: Bucketing
We just discussed the fact about partitioning that it can unevenly distribute the data,
but usually it is very less likely to get even distribution. But, we can achieve almost
even distributed data for processing using bucketing. Bucketing has a value of data
into a bucket due to which the same value records can be present in the same
bucket, and a bucket can have multiple groups of values. Bucketing provides control
to a number of files, as we have to mention the number of buckets while using
bucketing in create table using CLUSTERED BY (month) INTO #noofBuckets
BUCKETS.

35

You might also like