You are on page 1of 17

Shivajirao Kadam Institute of Technology

and Management, Indore (M.P.)


Department of Computer Science and Engineering
Subject: Data Analytics [CS 503]

Lecture
on
“All About Hive”
Hive
Motivation
Yahoo worked on Pig to facilitate application deployment on
Hadoop.

Their need mainly was focused on unstructured data

Simultaneously Facebook started working on deploying warehouse


solutions on Hadoop that resulted in Hive
Hive Look
What is Hive
 Apache Hive is an open source data warehouse system built on top
of Hadoop to summarize Big Data.

Used for Data Analysis

Created for users comfortable with SQL

Used for Managing and Querying Structured Data

No Need to Learn Java


What is Hive
 Apache Hive is for reading, writing and managing large data set
files that are stored directly in either the Apache Hadoop Distributed
File System (HDFS) or other data storage systems such as Apache
HBase

 Hive enables SQL developers to write Hive Query Language (HQL)


statements that are similar to standard SQL statements for data query
and analysis.

It is designed to make MapReduce programming easier because you


don’t have to know and write lengthy Java code.
What is Hive
 Instead, you can write queries more simply in HQL, and Hive can
then create the map and reduce the functions.

The Hive generally runs on your workstation and converts your SQL
query into a series of jobs for execution on a Hadoop cluster.

Apache Hive organizes data into tables. This provides a means for
attaching the structure to data stored in HDFS.
Limitation of Hive
 Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates.
 Latency for Apache Hive queries is generally very high
Features of Hive
 It stores schema in a database and processed data into
HDFS

It is designed for OLAP

It supports user-defined functions (UDFs) where user can


provide its functionality.

Hive is fast and scalable.


How Hive Different from Pig

Apache Hive Apache Pig


It can handle structured data. It can handle semi-structured data.
It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.
Hive is slower than Pig. Pig is comparatively faster than Hive.
Hive uses Hive Query Llanguage Pig uses pig-latin language.
It was developed by Facebook. It was developed by Yahoo
In HIve, all extensions are supported. Pig scripts end with .pig extension.

It support JDBC/ODBC drivers It does not support JDBC/ODBC drivers


Where to use Hive
1) Data Mining

2) Document Indexing

3) Predictive Modeling

4) Custom Facing UI
Important characteristics of Hive
1) In Hive, tables and databases are created first and then data is loaded into these tables.
2) Hadoop's programming works on flat files. So, Hive can use directory structures to
"partition" data to improve performance on certain queries.
3) A new and important component of Hive i.e. Metastore used for storing schema
information. This Metastore typically resides in a relational database. We can interact
with Hive using methods like
 Web GUI
 Java Database Connectivity (JDBC) interface

4) Generally, HQL syntax is similar to the SQL syntax that most data analysts are
familiar with. The Sample query below display all the records present in mentioned
table name.

Sample query : Select * from <TableName>

5) Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and
RCFILE (Record Columnar File).
Hive Architecture
Hive Architecture
 Hive Client: Support Different types of client
 Thrift Server - It is a cross-language service provider platform
that serves the request from all those programming languages
that supports Thrift.

 JDBC Driver - It is used to establish a connection between hive


and Java applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.

 ODBC Driver - It allows the applications that support the ODBC


protocol to connect to Hive.
Hive Architecture
 Hive Services: Provided different types of Services
 Hive CLI - The Hive CLI (Command Line Interface) is a shell where we
can execute Hive queries and commands.

 Hive Web User Interface - The Hive Web UI is just an alternative of Hive
CLI. It provides a web-based GUI for executing Hive queries and
commands.

 Hive MetaStore - It is a central repository that stores all the structure


information of various tables and partitions in the warehouse. It also
includes metadata of column and its type information, the serializers and
deserializers which is used to read and write data and the corresponding
HDFS files where the data is stored.
Hive Architecture
 Hive Services: Provided different types of Services
 Hive Server - It is referred to as Apache Thrift Server. It accepts the request
from different clients and provides it to Hive Driver.

 Hive Driver - It receives queries from different sources like web UI, CLI,
Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.

 Hive Compiler - The purpose of the compiler is to parse the query and
perform semantic analysis on the different query blocks and expressions. It
converts HiveQL statements into MapReduce jobs.

 Hive Execution Engine - Optimizer generates the logical plan in the form of
DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine
executes the incoming tasks in the order of their dependencies .
Thank You

You might also like