Hive Full Lecture

Shivajirao Kadam Institute of Technology
and Management, Indore (M.P.)

Department of Computer Science and Engineering
Subject: Data Analytics [CS 503]
Lecture
on
“All About Hive”
Hive
Motivation
Yahoo worked on Pig to facilitate application deployment on
Hadoop.
Their need mainly was focused on unstructured data
Simultaneously Facebook started working on deploying warehouse

solutions on Hadoop that resulted in Hive
Hive Look
What is Hive
 Apache Hive is an open source data warehouse system built on top
of Hadoop to summarize Big Data.
Used for Data Analysis
Created for users comfortable with SQL
Used for Managing and Querying Structured Data
No Need to Learn Java

What is Hive
 Apache Hive is for reading, writing and managing large data set
files that are stored directly in either the Apache Hadoop Distributed
File System (HDFS) or other data storage systems such as Apache
HBase
 Hive enables SQL developers to write Hive Query Language (HQL)

statements that are similar to standard SQL statements for data query
and analysis.
It is designed to make MapReduce programming easier because you

don’t have to know and write lengthy Java code.
What is Hive
 Instead, you can write queries more simply in HQL, and Hive can
then create the map and reduce the functions.
The Hive generally runs on your workstation and converts your SQL
query into a series of jobs for execution on a Hadoop cluster.
Apache Hive organizes data into tables. This provides a means for
attaching the structure to data stored in HDFS.
Limitation of Hive
 Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates.
 Latency for Apache Hive queries is generally very high
Features of Hive
 It stores schema in a database and processed data into
HDFS
It is designed for OLAP
It supports user-defined functions (UDFs) where user can

provide its functionality.
Hive is fast and scalable.

How Hive Different from Pig
Apache Hive Apache Pig

It can handle structured data. It can handle semi-structured data.
It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.
Hive is slower than Pig. Pig is comparatively faster than Hive.
Hive uses Hive Query Llanguage Pig uses pig-latin language.
It was developed by Facebook. It was developed by Yahoo
In HIve, all extensions are supported. Pig scripts end with .pig extension.
It support JDBC/ODBC drivers It does not support JDBC/ODBC drivers

Where to use Hive
1) Data Mining
2) Document Indexing
3) Predictive Modeling
4) Custom Facing UI
Important characteristics of Hive
1) In Hive, tables and databases are created first and then data is loaded into these tables.
2) Hadoop's programming works on flat files. So, Hive can use directory structures to
"partition" data to improve performance on certain queries.
3) A new and important component of Hive i.e. Metastore used for storing schema
information. This Metastore typically resides in a relational database. We can interact
with Hive using methods like
 Web GUI
 Java Database Connectivity (JDBC) interface
4) Generally, HQL syntax is similar to the SQL syntax that most data analysts are
familiar with. The Sample query below display all the records present in mentioned
table name.
Sample query : Select * from <TableName>
5) Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and
RCFILE (Record Columnar File).
Hive Architecture
Hive Architecture
 Hive Client: Support Different types of client
 Thrift Server - It is a cross-language service provider platform
that serves the request from all those programming languages
that supports Thrift.
 JDBC Driver - It is used to establish a connection between hive

and Java applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
 ODBC Driver - It allows the applications that support the ODBC

protocol to connect to Hive.
Hive Architecture
 Hive Services: Provided different types of Services
 Hive CLI - The Hive CLI (Command Line Interface) is a shell where we
can execute Hive queries and commands.
 Hive Web User Interface - The Hive Web UI is just an alternative of Hive
CLI. It provides a web-based GUI for executing Hive queries and
commands.
 Hive MetaStore - It is a central repository that stores all the structure

information of various tables and partitions in the warehouse. It also
includes metadata of column and its type information, the serializers and
deserializers which is used to read and write data and the corresponding
HDFS files where the data is stored.
Hive Architecture
 Hive Services: Provided different types of Services
 Hive Server - It is referred to as Apache Thrift Server. It accepts the request
from different clients and provides it to Hive Driver.
 Hive Driver - It receives queries from different sources like web UI, CLI,
Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.
 Hive Compiler - The purpose of the compiler is to parse the query and
perform semantic analysis on the different query blocks and expressions. It
converts HiveQL statements into MapReduce jobs.
 Hive Execution Engine - Optimizer generates the logical plan in the form of
DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine
executes the incoming tasks in the order of their dependencies .
Thank You

Hive Full Lecture

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hive Full Lecture

Uploaded by

Copyright:

Available Formats

Shivajirao Kadam Institute of Technology

and Management, Indore (M.P.)

Their need mainly was focused on unstructured data

Simultaneously Facebook started working on deploying warehouse

Used for Data Analysis

Created for users comfortable with SQL

Used for Managing and Querying Structured Data

No Need to Learn Java

 Hive enables SQL developers to write Hive Query Language (HQL)

It is designed to make MapReduce programming easier because you

It is designed for OLAP

It supports user-defined functions (UDFs) where user can

Hive is fast and scalable.

Apache Hive Apache Pig

It support JDBC/ODBC drivers It does not support JDBC/ODBC drivers

Sample query : Select * from <TableName>

 JDBC Driver - It is used to establish a connection between hive

 ODBC Driver - It allows the applications that support the ODBC

 Hive MetaStore - It is a central repository that stores all the structure

You might also like