You are on page 1of 54

GLOBAL INSTITUTE OF TECHNOLOGY

Subject Name with code: Big Data Analytics (BDA) (7AID4-01)

Presented by: Mr. Hemant Mittal

Assistant Professor, CSE

Branch and SEM - IT/ VII SEM

Department of

Computer Science & Engineering

1
UNIT-5

Hadoop Data to HIVE

2
Outline
5.1 Hadoop Data to HIVE

5.1.1 Hive Application

5.1.2 Hive Architectures

5.1.3 Features and Techniques

5.2 Seeing How the Hive is Put Together

5.2.1 Hive Installation

5.3 Getting Started with Apache Hive

5.3.1 Apache Introduction

5.3.2 Apache Architecture

5.3.3 Apache clients and server application

5.3.4 Apache Installation

5.5 Examining the Hive Clients

5.6 Working with Hive Data Types

5.7 Creating and Managing Databases and Tables

5.8 Seeing How the Hive Data Manipulation

Language Works

5.9 Querying and Analyzing Data

3
UNIT 5

Hadoop Data to HIVE


HADOOP Data to HIVE

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
This is a brief tutorial that provides an introduction on how to use Apache Hive QL with
Hadoop Distributed File System. This tutorial can be your first step towards becoming a
successful Hadoop Developer with Hive.

Hadoop

Hadoop is an open-source framework to store and process Big Data in a distributed


environment. It contains two modules, one is MapReduce and another is Hadoop Distributed
File System (HDFS).
 MapReduce: It is a parallel programming model for processing large amounts of
structured, semi-structured, and unstructured data on large clusters of commodity
hardware.
 HDFS: Hadoop Distributed File System is a part of Hadoop framework, used to store
and process the datasets. It provides a fault-tolerant file system to run on commodity
hardware.
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that
are used to help Hadoop modules.
 Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
 Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
4
 Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.
Note: There are various ways to execute MapReduce operations:

 The traditional approach using Java MapReduce program for structured, semi-structured,
and unstructured data.
 The scripting approach for MapReduce to process structured and semi structured data
using Pig.
 The Hive Query Language (Hive QL or HQL) for MapReduce to process structured data
using Hive.

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook; later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not

 A relational database
 A design for Online Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive

 It stores schema in a database and processed data into HDFS.


 It is designed for OLAP.
 It provides SQL type language for querying called Hive QL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive

5
The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes each unit:

Unit Name Operation

User Interface Hive is data warehouse infrastructure software that can create interaction
between user and HDFS. The user interfaces that Hive supports are Hive
Web UI, Hive command line, and Hive HD Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the schema or Metadata
of tables, databases, columns in a table, their data types, and HDFS
mapping.

Hive QL Process Hive QL is similar to SQL for querying on schema info on the Meta
Engine store. It is one of the replacements of traditional approach for
MapReduce program. Instead of writing MapReduce program in Java,
we can write a query for MapReduce job and process it.

Execution Engine The conjunction part of Hive QL process Engine and MapReduce is Hive
Execution Engine. Execution engine processes the query and generates

6
results as same as MapReduce results. It uses the flavor of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques
to store data into file system.

Working of Hive

The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:

Step Operation
No.

1 Execute Query

The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.

2 Get Plan

The driver takes the help of query compiler that parses the query to check the syntax and
query plan or the requirement of query.

7
3 Get Metadata

The compiler sends metadata request to Meta store (any database).

4 Send Metadata

Meta store sends metadata as a response to the compiler.

5 Send Plan

The compiler checks the requirement and resends the plan to the driver. Up to here, the
parsing and compiling of a query is complete.

6 Execute Plan

The driver sends the execute plan to the execution engine.

7 Execute Job

Internally, the process of execution job is a MapReduce job. The execution engine sends
the job to Job Tracker, which is in Name node and it assigns this job to Task Tracker,
which is in Data node. Here, the query executes MapReduce job.

7.1 Metadata Ops

Meanwhile in execution, the execution engine can execute metadata operations with
Meta store.

8 Fetch Result

The execution engine receives the results from Data nodes.

9 Send Results

The execution engine sends those resultant values to the driver.

10 Send Results

The driver sends the results to Hive Interfaces.

8
Moving data from HDFS to Apache Hive

In CDP Private Cloud Base, you can import data from diverse data sources into HDFS, perform
ETL processes, and then query the data in Apache Hive.

1. Ingest the data.


In CDP Private Cloud Base, you create a single Sqoop import command that imports data
from a relational database into HDFS.
2. Convert the data to ORC format.
In CDP Private Cloud Base, to use Hive to query data in HDFS, you apply a schema to
the data and then store data in ORC format.
3. Incrementally update the imported data.
In CDP Private Cloud Base, updating imported tables involves importing incremental
changes made to the original table using Sqoop and then merging changes with the tables
imported into Hive.

The three important functionalities for which Hive is deployed are data summarization, data
analysis, and data query. The query language, exclusively supported by Hive, is Hive QL. This
language translates SQL-like queries into MapReduce jobs for deploying them on Hadoop. Hive
QL also supports MapReduce scripts that can be plugged into the queries. Hive increases schema
design flexibility and also data serialization and deserialization.

Hive is best suited for batch jobs, rather than working with web log data and append-only data. It
cannot work for online transaction processing (OLTP) systems since it does not provide real-
time querying for row-level updates.

Criteria Hive
Query language Hive QL (SQL-like)
Used for Creating reports
Area of Server side
deployment
Supported data Structured
type
Integration JDBC and BI tools

Most important file systems supported by Hive are:

 Flat files or text files


 Sequence files consisting of binary key–value pairs
 RC Files that store columns of a table in a columnar database

9
Architecture of Apache Hive

Major Components of Hive Architecture

Meta store: It is the repository of metadata. This metadata consists of data for each table like its
location and schema. It also holds the information for partition metadata which lets you monitor
various distributed data progresses in the cluster. This data is generally present in the relational
databases. The metadata keeps track of the data, replicates it, and provides a backup in the case
of data loss.

Driver: The driver receives Hive QL statements and works like a controller. It monitors the
progress and life cycle of various executions by creating sessions. The driver stores the metadata
that is generated while executing the Hive QL statement. When the reducing operation is
completed by the Map Reduce job, the driver collects the data points and query results.

Compiler: The compiler is assigned with the task of converting a Hive QL query into a
MapReduce input. It includes a method to execute the steps and tasks needed to let the Hive QL
output as needed by MapReduce.

Optimizer: This performs various transformation steps for aggregation and pipeline conversion
by a single join for multiple joins. It also is assigned to split a task while transforming data,
before the reduce operations, for improved efficiency and scalability.

Executor: The executor executes tasks after the compilation and optimization steps. It directly
interacts with the Hadoop Job Tracker for scheduling the tasks to be run.

CLI, UI, and Thrift Server: The command-line interface (CLI) and the user interface (UI)
submit queries and process monitoring and instructions so that the external users can interact
with Hive. Thrift Server lets other clients interact with Hive.

10
Understanding Hive QL

The query language used in Apache Hive is called Hive QL, but it is not exactly a structured
query language (SQL). However, Hive QL offers various other extensions that are not part of
SQL. It is possible to have full ACID properties in Hive QL, along with update, insert, and delete
functionalities. You can do multi-table inserts and use syntax like ‘create table as select’ and
‘create table like’ in Hive QL, but it has only basic support for indexes. Hive QL does not
provide support for online transaction processing and view materialization. It only offers sub-
query support. The way in which Hive stores and queries data closely resembles regular
databases. But since Hive is built on top of the Hadoop ecosystem, it has to adhere to the rules
set forth by the Hadoop framework.

Why use Apache Hive?

Apache Hive is mainly used for data querying, analysis, and summarization. It helps improve
developers’ productivity which usually comes at the cost of increasing latency. Hive is a variant
of SQL and a very good one indeed. It stands tall when compared to SQL systems implemented
in databases. Hive has many user-defined functions that offer effective ways of solving
problems. It is easily possible to connect Hive queries to various Hadoop packages like R Hive,
R Hipe, and even Apache Mahout. Also, it greatly helps the developer community work with
complex analytical processing and challenging data formats.

Data warehouse refers to a system used for reporting and data analysis. What this means is
inspecting, cleaning, transforming, and modeling data with the goal of discovering useful
information and suggesting conclusions. Data analysis has multiple aspects and approaches,
encompassing diverse techniques under a variety of names in different domains.

Hive allows users to simultaneously access data and, at the same time, increases the response
time, i.e., the time a system or a functional unit takes to react to a given input. In fact, Hive
typically has a much faster response time than most other types of queries. Hive is also highly
flexible as more commodities can easily be added in response to adding more clusters of data
without any drop in performance.

What are the advantages of learning Hive?

Apache Hive lets you work with Hadoop in a very efficient manner. It is a complete data
warehouse infrastructure that is built on top of the Hadoop framework. Hive is uniquely
deployed to come up with querying of data, powerful data analysis, and data summarization
while working with large volumes of data. The integral part of Hive is Hive QL which is an
SQL-like interface that is used extensively to query data that is stored in databases.

Hive has a distinct advantage of deploying high-speed data reads and writes within data
warehouses while managing large datasets distributed across multiple locations, all thanks to its
SQL-like features. Hive provides a structure to the data that is already stored in the database.
Users are able to connect with Hive using a command-line tool and a JDBC driver.

11
What is Hive?

Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System
(HDFS). Hive makes job easy for performing operations like

 Data encapsulation
 Ad-hoc queries
 Analysis of huge datasets

Important characteristics of Hive

 In Hive, tables and databases are created first and then data is loaded into these tables.
 Hive as data warehouse designed for managing and querying only structured data that is
stored in tables.
 While dealing with structured data, Map Reduce doesn't have optimization and usability
features like UDFs but Hive framework does. Query optimization refers to an effective
way of query execution in terms of performance.
 Hive's SQL-inspired language separates the user from the complexity of Map Reduce
programming. It reuses familiar concepts from the relational database world, such as
tables, rows, columns and schema, etc. for ease of learning.
 Hadoop's programming works on flat files. So, Hive can use directory structures to
"partition" data to improve performance on certain queries.
 A new and important component of Hive i.e. Meta store used for storing schema
information. This Meta store typically resides in a relational database. We can interact
with Hive using methods like
o Web GUI
o Java Database Connectivity (JDBC) interface
 Most interactions tend to take place over a command line interface (CLI). Hive provides
a CLI to write Hive queries using Hive Query Language(HQL)
 Generally, HQL syntax is similar to the SQL syntax that most data analysts are familiar
with. The Sample query below display all the records present in mentioned table name.
o Sample query : Select * from <Table Name>
 Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and
RCFILE (Record Columnar File).
 For single user metadata storage, Hive uses derby database and for multiple
user Metadata or shared Metadata case Hive uses MYSQL.

Some of the key points about Hive:

 The major difference between HQL and SQL is that Hive query executes on Hadoop's
infrastructure rather than the traditional database.
 The Hive query execution is going to be like series of automatically generated map
reduce Jobs.

12
 Hive supports partition and buckets concepts for easy retrieval of data when the client
executes the query.
 Hive supports custom specific UDF (User Defined Functions) for data cleansing,
filtering, etc. According to the requirements of the programmers one can define Hive
UDFs.

Hive Vs Relational Databases:-

By using Hive, we can perform some peculiar functionality that is not achieved in Relational
Databases. For a huge amount of data that is in peta-bytes, querying it and getting results in
seconds is important. And Hive does this quite efficiently; it processes the queries fast and
produce results in second's time.

Let see now what makes Hive so fast.

Some key differences between Hive and relational databases are the following;

Relational databases are of "Schema on READ and Schema on Write". First creating a table
then inserting data into the particular table. On relational database tables, functions like
Insertions, Updates, and Modifications can be performed.

Hive is "Schema on READ only". So, functions like the update, modifications, etc. don't work
with this. Because the Hive query in a typical cluster runs on multiple Data Nodes. So it is not
possible to update and modify data across multiple nodes.( Hive versions below 0.13)

Also, Hive supports "READ Many WRITE Once" pattern. Which means that after inserting
table we can update the table in the latest Hive versions?

NOTE: However the new version of Hive comes with updated features. Hive versions ( Hive
0.14) comes up with Update and Delete options as new features

Hive Architecture

13
Hive Consists of Mainly 3 core parts

1. Hive Clients
2. Hive Services
3. Hive Storage and Computing

Hive Clients:

Hive provides different drivers for communication with a different type of applications. For
Thrift based applications, it will provide Thrift client for communication.

For Java related applications, it provides JDBC Drivers. Other than any type of applications
provided ODBC drivers. These Clients and drivers in turn again communicate with Hive server
in the Hive services.

Hive Services:

Client interactions with Hive can be performed through Hive Services. If the client wants to
perform any query related operations in Hive, it has to communicate through Hive Services.

CLI is the command line interface acts as Hive service for DDL (Data definition Language)
operations. All drivers communicate with Hive server and to the main driver in Hive services as
shown in above architecture diagram.

Driver present in the Hive services represents the main driver, and it communicates all type of
JDBC, ODBC, and other client specific applications. Driver will process those requests from
different applications to meta store and field systems for further processing.

Hive Storage and Computing:

14
Hive services such as Meta store, File system, and Job Client in turn communicates with Hive
storage and performs the following actions

 Metadata information of tables created in Hive is stored in Hive "Meta storage database".
 Query results and data loaded in the tables are going to be stored in Hadoop cluster on
HDFS.

Job execution flow:

All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system.
Therefore, you need to install any Linux flavored OS. The following simple steps are executed
for Hive installation:

Step 1: Verifying JAVA Installation

Java must be installed on your system before installing Hive. Let us verify java installation
using the following command:
$ Java –version
If Java is already installed on your system, you get to see the following response:
Java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java Hot Spot(TM) Client VM (build 25.0-b02, mixed mode)
If java is not installed in your system, then follow the steps given below for installing java.

Installing Java

Step I:
Download java (JDK <latest version> - X64.tar.gz) by visiting the following
link http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html.
Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system.
Step II:
Generally you will find the downloaded java file in the Downloads folder. Verify it and extract
the jdk-7u71-linux-x64.gz file using the following commands.
$ Cd Downloads/
$ Ls
Jdk-7u71-linux-x64.gz

15
$ tar ax jdk-7u71-linux-x64.gz
$ Ls
Jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step III:
To make java available to all the users, you have to move it to the location “/usr/local/”. Open
root, and type the following commands.
$ Us
Password:
# Mv jdk1.7.0_71 /usr/local/
# Exit

Step IV:
For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc
file.
Export JAVA_HOME=/user/local/jdk1.7.0_71
Export PATH=$PATH: $JAVA_HOME/bin
Now apply all the changes into the current running system.
$ Source ~/.basher

Step V:
Use the following commands to configure java alternatives:
# Alternatives --install /user/bin/java/java/user/local/java/bin/java 2

# Alternatives --install /user/bin/java/java/user/local/java/bin/java 2

# Alternatives --install /user/bin/jar/jar/user/local/java/bin/jar 2

# Alternatives --set java/user/local/java/bin/java

# Alternatives --set java/user/local/java/bin/java

# Alternatives --set jar/user/local/java/bin/jar


Now verify the installation using the command java -version from the terminal as explained
above.

Step 2: Verifying Hadoop Installation

16
Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop
installation using the following command:
$ Hadoop version
If Hadoop is already installed on your system, then you will get the following response:
Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with proton 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
If Hadoop is not installed on your system, then proceed with the following steps:

Downloading Hadoop

Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following
commands.
$ Su
Password:
# Cd /user/local
# wet http://apache.claz.org/hadoop/common/hadoop-2.4.1/
Hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# my hadoop-2.4.1/* to hadoop/
# Exit

Installing Hadoop in Pseudo Distributed Mode

The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode.
Step I: Setting up Hadoop
You can set Hadoop environment variables by appending the following commands
to ~/.basher file.
Export HADOOP_HOME=/user/local/hadoop
Export HADOOP_MAPRED_HOME=$HADOOP_HOME
Export HADOOP_COMMON_HOME=$HADOOP_HOME
Export HADOOP_HDFS_HOME=$HADOOP_HOME
Export YARN_HOME=$HADOOP_HOME
Export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export
PATH=$PATH:$HADOOP_HOME/sin:$HADOOP_HOME/bin
Now apply all the changes into the current running system.
$ Source ~/.basher

17
Step II: Hadoop Configuration
You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. You need to make suitable changes in those configuration
files according to your Hadoop infrastructure.
$ Cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs using java, you have to reset the java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in
your system.
Export JAVA_HOME=/user/local/jdk1.7.0_71
Given below is the list of files that you have to edit to configure Hadoop.
Core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance,
memory allocated for the file system, memory limit for storing the data, and the size of
Read/Write buffers.
Open the core-site.xml and add the following properties in between the <configuration> and
</configuration> tags.
<Configuration>

<Property>
<name>fs.default.name</name>
<Value>hdfs://localhost:9000</value>
</property>

</configuration>
Hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, the name node
path, and the data node path of your local file systems. It means the place where you want to
store the Hadoop infra.
Let us assume the following data.
dfs.replication (data replication value) = 1

(In the following path /hadoop/ is the user name.


Hadoopinfra/hdfs/name node is the directory created by hdfs file system.)

Name node path = //home/hadoop/hadoop infra/hdfs/name node

(Hadoop infra/hdfs/data node is the directory created by hdfs file system.)


Data node path = //home/hadoop/hadoop infra/hdfs/data node

18
Open this file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<Configuration>

<Property>
<Name>dfs.replication</name>
<Value>1</value>
</property>
<Property>
<name>dfs.name.dir</name>
<Value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<Property>
<name>dfs.data.dir</name>
<Value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >
</property>

</configuration>
Note: In the above file, all the property values are user-defined and you can make changes
according to your Hadoop infrastructure.
Yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the
following properties in between the <configuration>, </configuration> tags in this file.
<Configuration>

<Property>
<Name>yarn. Node manager.aux-services</name>
<Value>map reduce shuffle</value>
</property>

</configuration>
Mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, you need to copy the file from mapred-site,
xml.template to mapred-site.xml file using the following command.
$ Cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration> tags in this file.
<Configuration>

19
<Property>
<name>mapreduce.framework.name</name>
<Value>yarn</value>
</property>

</configuration>

Verifying Hadoop Installation

The following steps are used to verify the Hadoop installation.


Step I: Name Node Setup
Set up the name node using the command “hdfs name node -format” as follows.
$ Cd ~
$ Hdfs name node -format
The expected result is as follows.
10/24/14 21:30:55 INFO name node. Name Node: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting Name Node
STARTUP_MSG: host = local host/192.168.1.11
STARTUP_MSG: rags = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common. Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/name node has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
Retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down Name Node at local host/192.168.1.11
************************************************************/

Step II: Verifying Hadoop dfs


The following command is used to start dfs. Executing this command will start your Hadoop
file system.
$ Start-dfs.sh
The expected output is as follows:
10/24/14 21:37:56
Starting name nodes on [local host]

20
Local host: starting name node, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-
namenode-localhost.out
Local host: starting data node, logging to /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-
datanode-localhost.out
Starting secondary name nodes [0.0.0.0]

Step III: Verifying Yarn Script


The following command is used to start the yarn script. Executing this command will start your
yarn daemons.
$ Start-yarn.sh
The expected output is as follows:
Starting yarn daemons
Starting resource manager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
resourcemanager-localhost.out
local host: starting node manager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-
nodemanager-localhost.out

Step IV: Accessing Hadoop on Browser


The default port number to access Hadoop is 50070. Use the following urn to get Hadoop
services on your browser.
http://localhost:50070/

21
Step V: Verify all applications for cluster
The default port number to access all applications of cluster is 8088. Use the following url to
visit this service.
http://localhost:8088/

Step 3: Downloading Hive


We use hive-0.14.0 in this tutorial. You can download it by visiting the following
link http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the
/Downloads directory. Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz”
for this tutorial. The following command is used to verify the download:
$ Cd Downloads
$ Ls
On successful download, you get to see the following response:
Apache-hive-0.14.0-bin.tar.gz

Step 4: Installing Hive

The following steps are required for installing Hive on your system. Let us assume the Hive
archive is downloaded onto the /Downloads directory.
Extracting and verifying Hive Archive
The following command is used to verify the download and extract the hive archive:
$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ Ls
On successful download, you get to see the following response:
Apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

Copying files to /us/local/hive directory


We need to copy the files from the super user “us -”. The following commands are used to copy
the files from the extracted directory to the /us/local/hive” directory.
$ Us -

22
Passed:

# Cd /home/user/Download
# My apache-hive-0.14.0-bin /user/local/hive
# Exit

Setting up environment for Hive


You can set up the Hive environment by appending the following lines to ~/.basher file:
Export HIVE_HOME=/us/local/hive
Export PATH=$PATH: $HIVE_HOME/bin
Export CLASSPATH=$CLASSPATH: /use/local/Hadoop/lib/*:
Export CLASSPATH=$CLASSPATH: /use/local/hive/lib/*:
The following command is used to execute ~/.basher file.
$ Source ~/.basher

Step 5: Configuring Hive

To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in
the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder
and copy the template file:
$ Cd $HIVE_HOME/conf
$ Cp hive-env.sh.template hive-env.sh
Edit the hive-env.sh file by appending the following line:
Export HADOOP_HOME=/use/local/hadoop
Hive installation is completed successfully. Now you require an external database server to
configure Meta store. We use Apache Derby database.

Step 6: Downloading and Installing Apache Derby

Follow the steps given below to download and install Apache Derby:
Downloading Apache Derby
The following command is used to download Apache Derby. It takes some time to download.
$ Cd ~
$ Wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
The following command is used to verify the download:
$ Ls
On successful download, you get to see the following response:

23
Db-derby-10.4.2.0-bin.tar.gz

Extracting and verifying Derby archive


The following commands are used for extracting and verifying the Derby archive:
$ tar zxvf db-derby-10.4.2.0-bin.tar.gz
$ Ls
On successful download, you get to see the following response:
Db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz

Copying files to /user/local/derby directory


We need to copy from the super user “us -”. The following commands are used to copy the files
from the extracted directory to the /user/local/derby directory:
$ Us -
Passed:
# Cd /home/user
# My db-derby-10.4.2.0-bin /user/local/derby
# Exit

Setting up environment for Derby


You can set up the Derby environment by appending the following lines to ~/.basher file:
Export DERBY_HOME=/user/local/derby
Export PATH=$PATH: $DERBY_HOME/bin
Apache Hive
18
Export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:
$DERBY_HOME/lib/derbytools.jar
The following command is used to execute ~/.basher file:
$ Source ~/.basher

Create a directory to store Meta store


Create a directory named data in $DERBY_HOME directory to store Meta store data.
$ My dir $DERBY_HOME/data
Derby installation and environmental setup is now complete.

Step 7: Configuring Meta store of Hive

24
Configuring Meta store means specifying to Hive where the database is stored. You can do this
by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy
the template file using the following command:
$ Cd $HIVE_HOME/conf
$ Cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the <configuration> and
</configuration> tags:
<Property>
<name>javax.jdo.option.ConnectionURL</name>
<Value>jdbc: derby: //localhost:1527/metastore_db;create=true </value>
<Description>JDBC connect string for a JDBC megastores </description>
</property>
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass =

org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc: derby: //hadoop1:1527/metastore_db;create = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Step 8: Verifying Hive Installation

Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS.
Here, we use the /user/hive/warehouse folder. You need to set write permission for these
newly created folders as shown below:
Chmod g+w
Now set them in HDFS before verifying Hive. Use the following commands:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp

25
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
The following commands are used to verify Hive installation:
$ Cd $HIVE_HOME
$ bin/hive
On successful installation of Hive, you get to see the following response:
Logging initialized using configuration in jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-
0.9.0.jar!/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….
Hive>
The following sample command is executed to display all the tables:
Hive> show tables;
OK
Time taken: 2.798 seconds
Hive>
This chapter takes you through the different data types in Hive, which are involved in the table
creation. All the data types in Hive are classified into four types, given as follows:

 Column Types
 Literals
 Null Values
 Complex Types

Column Types

Column type is used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:

Type Postfix Example

TINYINT Y 10Y

26
SMALLINT S 10S

INT - 10

BIGINT L 10L

String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It contains
two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Time stamp format “YYYY-MM-DD HH:MM: SS.fffffffff” and format “yyyy-mm-dd
hh:mm: ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL (precision, scale)
Decimal (10,0)

27
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, strict<a: int,b:string>>

{0:1}
{1:2.0}
{2: ["three","four"]}
{3 :{"a":5,"b":"five"}}
{2: ["six","seven"]}
{3 :{"a":8,"b":"eight"}}
{0:9}
{1:10.0}

Literals

The following literals are used in Hive:


Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this type of data is
composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type.
The range of decimal type is approximately -10-308 to 10308.

Null Value

Missing values are represented by the special value NULL.

Complex Types

The Hive complex data types are as follows:


Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data type>

Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data type>

28
Struts
Struts in Hive are similar to using complex data with comment.
Syntax: STRUCT<colonnade: data type [COMMENT col_comment],>
Hive is a database technology that can define databases and tables to analyze structured data.
The theme for structured data analysis is to store the data in a tabular manner, and pass queries
to analyze it. This chapter explains how to create Hive database. Hive contains a default
database named default.

Create Database Statement

Create Database is a statement used to create a database in Hive. A database in Hive is


a namespace or a collection of tables. The syntax for this statement is as follows:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the
same name already exists. We can use SCHEMA in place of DATABASE in this command.
The following query is executed to create a database named user db:
Hive> CREATE DATABASE [IF NOT EXISTS] user db;
Or
Hive> CREATE SCHEMA user db;
The following query is used to verify a databases list:
Hive> SHOW DATABASES;
Default
Used

JDBC Program
The JDBC program to create a database is given below.
Import java.sql.SQLException;
Import java.sql.Connection;
Import java.sql.ResultSet;
Import java.sql.Statement;
Import java.sql.DriverManager;

Public class HiveCreateDb {


Private static String driver Name = "org.apache.hadoop.hive.jdbc.HiveDriver";

Public static void main (String[] args) throws SQLException {


// Register driver and create driver instance

Class.forName (driver Name);

29
// get connection

Connection con = DriverManager.getConnection ("jdbc: hive://localhost:10000/default", "",


"");
Statement stmt = con.createStatement ();

stmt.executeQuery ("CREATE DATABASE used");


System.out.println (“Database user db created successfully.”);

Conclude ();
}
}
Save the program in a file named HiveCreateDb.java. The following commands are used to
compile and execute this program.
$ Java HiveCreateDb.java
$ Java Hive CreateDb

Output:
Database user db created successfully.
This chapter describes how to drop a database in Hive. The usage of SCHEMA and
DATABASE are same.

Drop Database Statement

Drop Database is a statement that drops all the tables and deletes the database. Its syntax is as
follows:
DROP DATABASE Statement DROP (DATABASE|SCHEMA) [IF EXISTS] database name
[RESTRICT|CASCADE];
The following queries are used to drop a database. Let us assume that the database name is user
db.
hive> DROP DATABASE IF EXISTS user db;
The following query drops the database using CASCADE. It means dropping respective tables
before dropping the database.
Hive> DROP DATABASE IF EXISTS userdb CASCADE;
The following query drops the database using SCHEMA.
hive> DROP SCHEMA userdb;
This clause was added in Hive 0.6.

30
JDBC Program
The JDBC program to drop a database is given below.
Import java.sql.SQLException;
Import java.sql.Connection;
Import java.sql.ResultSet;
Import java.sql.Statement;
Import java.sql.DriverManager;

Public class HiveDropDb {


Private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

Public static void main (String [] args) throws SQLException {

// Register driver and create driver instance


Class.forName (driverName);

// get connection
Connection con = DriverManager.getConnection ("jdbc: hive: //localhost:10000/default", "",
"");
Statement stmt = con.createStatement ();
stmt.executeQuery ("DROP DATABASE userdb");

System.out.println (“Drop userdb database successful.”);

con.close ();
}
}
Save the program in a file named HiveDropDb.java. Given below are the commands to compile
and execute this program.
$ Javac HiveDropDb.java
$ Java HiveDropDb

Output:
Drop userdb database successful.
This chapter explains how to create a table and how to insert data into it. The conventions of
creating a table in HIVE are quite similar to creating a table using SQL.

Create Table Statement

Create Table is a statement used to create a table in Hive. The syntax and example are as
follows:
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

31
[(col_name data_type [COMMENT col_comment],)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

Example
Let us assume you need to create a table named employee using CREATE TABLE statement.
The following table lists the fields and their data types in employee table:

Sr.No Field Name Data Type

1 Eid int

2 Name String

3 Salary Float

4 Designation string

The following data is a Comment, Row formatted fields such as Field terminator, Lines
terminator, and Stored File type.
COMMENT ‘Employee details’
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED IN TEXT FILE
The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already
exists.
On successful creation of table, you get to see the following response:

32
OK
Time taken: 5.905 seconds
Hive>
JDBC Program
The JDBC program to create a table is given example.
Import java.sql.SQLException;
Import java.sql.Connection;
Import java.sql.ResultSet;
Import java.sql.Statement;
Import java.sql.DriverManager;

Public class HiveCreateTable {


Private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

Public static void main(String[] args) throws SQLException {

// Register driver and create driver instance


Class.forName (driverName);

// get connection
Connection con = DriverManager.getConnection ("jdbc: hive://localhost:10000/userdb", "",
"");

// create statement
Statement stmt = con.createStatement ();

// execute statement
stmt.executeQuery ("CREATE TABLE IF NOT EXISTS "
+" employee (eid int, name String, "
+" salary String, destignation String)"
+" COMMENT ‘Employee details’"
+" ROW FORMAT DELIMITED"
+" FIELDS TERMINATED BY ‘\t’"
+" LINES TERMINATED BY ‘\n’"
+" STORED AS TEXTFILE ;");

System.out.println (“Table employee created.”);


con.close ();
}
}
Save the program in a file named HiveCreateDb.java. The following commands are used to
compile and execute this program.
$ Javac HiveCreateDb.java
$ Java HiveCreateDb

33
Output
Table employee created.

Load Data Statement

Generally, after creating a table in SQL, we can insert data using the Insert statement. But in
Hive, we can insert data using the LOAD DATA statement.
While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are
two ways to load data: one is from local file system and second is from Hadoop file system.
Syntax
The syntax for load data is as follows:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]

 LOCAL is identifier to specify the local path. It is optional.


 OVERWRITE is optional to overwrite the data in the table.
 PARTITION is optional.
Example
We will insert the following data into the table. It is a text file
named sample.txt in /home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Kiran 40000 Hr Admin
1205 Kranthi 30000 Op Admin
The following query loads the given text into the table.
Hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'
OVERWRITE INTO TABLE employee;
On successful download, you get to see the following response:
OK
Time taken: 15.905 seconds
Hive>
JDBC Program
Given below is the JDBC program to load given data into the table.
Import java.sql.SQLException;
Import java.sql.Connection;
Import java.sql.ResultSet;

34
Import java.sql.Statement;
Import java.sql.DriverManager;

Public class HiveLoadData {

Private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

Public static void main (String [] args) throws SQLException {

// Register driver and create driver instance


Class.forName (driverName);

// get connection
Connection con = DriverManager.getConnection ("jdbc: hive: //localhost:10000/userdb", "",
"");

// create statement
Statement stmt = con.createStatement ();

// execute statement
stmt.executeQuery ("LOAD DATA LOCAL INPATH '/home/user/sample.txt'" +
"OVERWRITE INTO TABLE employee ;");
System.out.println ("Load Data into employee successful");

con.close ();
}
}
Save the program in a file named HiveLoadData.java. Use the following commands to compile
and execute this program.
$ javac HiveLoadData.java
$ java HiveLoadData
This chapter explains how to alter the attributes of a table such as changing its table name,
changing column names, adding columns, and deleting or replacing columns.

Alter Table Statement

It is used to alter a table in Hive.


Syntax
The statement takes any of the following syntaxes based on what attributes we wish to modify
in a table.
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec [, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type

35
ALTER TABLE name REPLACE COLUMNS (col_spec [, col_spec ...])

Rename To… Statement

The following query renames the table from employee to EMP.


Hive> ALTER TABLE employee RENAME TO EMP;
JDBC Program
The JDBC program to rename a table is as follows.
Import java.sql.SQLException;
Import java.sql.Connection;
Import java.sql.ResultSet;
Import java.sql.Statement;
Import java.sql.DriverManager;

Public class HiveAlterRenameTo {


Private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

Public static void main(String[] args) throws SQLException {

// Register driver and create driver instance


Class.forName (driverName);

// get connection
Connection con = DriverManager.getConnection ("jdbc: hive: //localhost:10000/userdb", "",
"");

// create statement
Statement stmt = con.createStatement ();

// execute statement
stmt.executeQuery ("ALTER TABLE employee RENAME TO emp;");
System.out.println ("Table Renamed Successfully");
con.close ();
}
}
Save the program in a file named HiveAlterRenameTo.java. Use the following commands to
compile and execute this program.
$ Javac HiveAlterRenameTo.java
$ Java HiveAlterRenameTo
Output:
Table renamed successfully.

Change Statement

36
The following table contains the fields of employee table and it shows the fields to be changed
(in bold).

Field Name Convert from Data Type Change Field Name Convert to Data Type

eid int eid int

name String ename String

salary Float salary Double

designation String designation String

The following queries rename the column name and column data type using the above data:
Hive> ALTER TABLE employee CHANGE name ename String;
Hive> ALTER TABLE employee CHANGE salary salary Double;
JDBC Program
Given below is the JDBC program to change a column.
Import java.sql.SQLException;
Import java.sql.Connection;
Import java.sql.ResultSet;
Import java.sql.Statement;
Import java.sql.DriverManager;

Public class HiveAlterChangeColumn {


Private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

Public static void main (String [] args) throws SQLException {

// Register driver and create driver instance


Class.forName (driverName);

// get connection
Connection con = DriverManager.getConnection ("jdbc: hive: //localhost:10000/userdb", "",
"");

// create statement
Statement stmt = con.createStatement ();

37
// execute statement
stmt.executeQuery ("ALTER TABLE employee CHANGE name ename String;");
stmt.executeQuery ("ALTER TABLE employee CHANGE salary salary Double;");

System.out.println ("Change column successful.");


con.close ();
}
}
Save the program in a file named HiveAlterChangeColumn.java. Use the following commands
to compile and execute this program.
$ Javac HiveAlterChangeColumn.java
$ Java HiveAlterChangeColumn
Output:
Change column successful.

Add Columns Statement

The following query adds a column named dept to the employee table.
Hive> ALTER TABLE employee ADD COLUMNS (
Dept STRING COMMENT 'Department name');

JDBC Program

The JDBC program to add a column to a table is given below.


Import java.sql.SQLException;
Import java.sql.Connection;
Import java.sql.ResultSet;
Import java.sql.Statement;
Import java.sql.DriverManager;

Public class HiveAlterAddColumn {


Private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

Public static void main (String [] args) throws SQLException {

// Register driver and create driver instance


Class.forName (driver Name);

// get connection
Connection con = DriverManager.getConnection ("jdbc: hive: //localhost:10000/userdb", "",
"");

// create statement

38
Statement stmt = con.createStatement ();

// execute statement
stmt.executeQuery ("ALTER TABLE employee ADD COLUMNS " + " (dept STRING
COMMENT 'Department name');");
System.out.println ("Add column successful.");

Conclude ();
}
}
Save the program in a file named HiveAlterAddColumn.java. Use the following commands to
compile and execute this program.
$ Java HiveAlterAddColumn.java
$ Java HiveAlterAddColumn
Output:
Add column successful.

Replace Statement

The following query deletes all the columns from the employee table and replaces it
with empt and name columns:
Hive> ALTER TABLE employee REPLACE COLUMNS (
Aid INT amped into,
Enamel STRING name String);

JDBC Program

Given below is the JDBC program to replace aid column with amped and enamel column
with name.
Import java.sql.SQLException;
Import java.sql.Connection;
Import java.sql.ResultSet;
Import java.sql.Statement;
Import java.sql.DriverManager;

Public class HiveAlterReplaceColumn {

Private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

Public static void main (String [] args) throws SQLException {

// Register driver and create driver instance


Class.forName (driverName);

39
// get connection
Connection con = DriverManager.getConnection ("jdbc: hive: //localhost:10000/userdb", "",
"");

// create statement
Statement stmt = con.createStatement ();

// execute statement
stmt.executeQuery ("ALTER TABLE employee REPLACE COLUMNS "
+" (eid INT empid Int,"
+" ename STRING name String);");

System.out.println (" Replace column successful");


Conclude ();
}
}
Save the program in a file named HiveAlterReplaceColumn.java. Use the following commands
to compile and execute this program.
$ Java HiveAlterReplaceColumn.java
$ Java HiveAlterReplaceColumn

Output:
Replace column successful.
This chapter describes how to drop a table in Hive. When you drop a table from Hive Meta
store, it removes the table/column data and their metadata. It can be a normal table (stored in
Meta store) or an external table (stored in local file system); Hive treats both in the same
manner, irrespective of their types.

Drop Table Statement

The syntax is as follows:


DROP TABLE [IF EXISTS] table name;
The following query drops a table named employee:
Hive> DROP TABLE IF EXISTS employee;
On successful execution of the query, you get to see the following response:
OK
Time taken: 5.3 seconds
Hive>

JDBC Program
The following JDBC program drops the employee table.

40
Import java.sql.SQLException;
Import java.sql.Connection;
Import java.sql.ResultSet;
Import java.sql.Statement;
Import java.sql.DriverManager;

Public class HiveDropTable {

Private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

Public static void main (String [] args) throws SQLException {

// Register driver and create driver instance


Class.forName (driverName);

// get connection
Connection con = DriverManager.getConnection ("jdbc: hive: //localhost:10000/userdb", "",
"");

// create statement
Statement stmt = con.createStatement ();

// execute statement
stmt.executeQuery ("DROP TABLE IF EXISTS employee ;");
System.out.println ("Drop table successful.");

Conclude ();
}
}
Save the program in a file named HiveDropTable.java. Use the following commands to compile
and execute this program.
$ Java HiveDropTable.java
$ Java HiveDropTable

Output:
Drop table successful
The following query is used to verify the list of tables:
Hive> SHOW TABLES;
Empt
Ok
Time taken: 2.1 seconds
Hive>

41
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on
the values of partitioned columns such as date, city, and department. Using partition, it is easy
to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may
be used for more efficient querying. Bucketing works based on the value of hash function of
some column of a table.
For example, a table named Tab1 contains employee data such as id, name, dept, and you (i.e.,
year of joining). Suppose you need to retrieve the details of all employees who joined in 2012.
A query searches the whole table for the required information. However, if you partition the
employee data with the year and store it in a separate file, it reduces the query processing time.
The following example shows how to partition a file and its data:

From the above screenshot we can understand the Job execution flow in Hive with Hadoop

The data flow in Hive behaves in the following pattern;

1. Executing Query from the UI( User Interface)


2. The driver is interacting with Compiler for getting the plan. (Here plan refers to query
execution) process and its related metadata information gathering
3. The compiler creates the plan for a job to be executed. Compiler communicating with
Meta store for getting metadata request
4. Meta store sends metadata information back to compiler
5. Compiler communicating with Driver with the proposed plan to execute the query
6. Driver Sending execution plans to Execution engine
7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query.
For DFS operations.

 EE should first contacts Name Node and then to Data nodes to get the values stored in
tables.

42
 EE is going to fetch desired records from Data Nodes. The actual data of tables resides in
data node only. While from Name Node it only fetches the metadata information for the
query.
 It collects actual data from data nodes related to mentioned query
 Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to
perform DDL (Data Definition Language) operations. Here DDL operations like
CREATE, DROP and ALTERING tables and databases are done. Meta store will store
information about database name, table names and column names only. It will fetch data
related to query mentioned.
 Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node,
Data nodes, and job tracker to execute the query on top of Hadoop file system

8. Fetching results from driver


9. Sending results to Execution engine. Once the results fetched from data nodes to the EE,
it will send results back to driver and to UI ( front end)

Hive Continuously in contact with Hadoop file system and its daemons via Execution engine.
The dotted arrow in the Job flow diagram shows the Execution engine communication with
Hadoop daemons.

Different modes of Hive

Hive can operate in two modes depending on the size of data nodes in Hadoop.

These modes are,

 Local mode
 Map reduce mode

When to use Local mode:

 If the Hadoop installed under pseudo mode with having one data node we use Hive in this
mode
 If the data size is smaller in term of limited to single local machine, we can use this mode
 Processing will be very fast on smaller data sets present in the local machine

When to use Map reduce mode:

 If Hadoop is having multiple data nodes and data is distributed across different node we
use Hive in this mode
 It will perform on large amount of data sets and query going to execute in parallel way
 Processing of large data sets with better performance can be achieved through this mode

In Hive, we can set this property to mention which mode Hive can work? By default, it works on
Map Reduce mode and for local mode you can have the following setting.

43
Hive to work in local mode set

SET mapred.job.tracker=local;

From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode
automatically.

What is Hive Server2 (HS2)?

HiveServer2 (HS2) is a server interface that performs following functions:

 Enables remote clients to execute queries against Hive


 Retrieve the results of mentioned queries

From the latest version it's having some advanced features Based on Thrift RPC like;

 Multi-client concurrency
 Authentication

Summary:

Hive is an ETL and data warehouse tool on top of Hadoop ecosystem and used for processing
structured and semi structured data.

 Hive is a database present in Hadoop ecosystem performs DDL and DML operations, and
it provides flexible query language such as HQL for better querying and processing of
data.
 It provides so many features compared to RDMS which has certain limitations.

For user specific logic to meet client requirements.

 It provides option of writing and deploying custom defined scripts and User defined
functions.
 In addition, it provides partitions and buckets for storage specific logics.

Installation and Configuration


You can install a stable release of Hive by downloading a tar ball, or you can download the
source code and build Hive from that.

44
Running HiveServer2 and Beeline

Requirements

 Java1.7
Note: Hive versions 1.2 onward require Java 1.7 or newer. Hive versions 0.14 to 1.1 work with
Java 1.6 as well. Users are strongly advised to start moving to Java 1.8 (see HIVE-8607).
 Hadoop 2.x (preferred), 1.x (not supported by Hive 2.0.0 onward).
Hive versions up to 0.13 also supported Hadoop 0.20.x, 0.23.x.
 Hive is commonly used in production Linux and Windows environment. Mac is a commonly
used development environment. The instructions in this document are applicable to Linux and
Mac. Using it on Windows would require slightly different steps.

Installing Hive from a Stable Release


Start by downloading the most recent stable release of Hive from one of the Apache download
mirrors (see Hive Releases).
Next you need to unpack the tar ball. This will result in the creation of a subdirectory
named hive-x.y.z (where x.y.z is the release number):
$ Tar -xzvf hive-x.y.z.tar.gz
Set the environment variable HIVE_HOME to point to the installation directory:
$ Cd hive-x.y.z
$ export HIVE_HOME={{pwd}}
Finally, add $HIVE_HOME/bin to your PATH:
$ export PATH=$HIVE_HOME/bin:$PATH

Building Hive from Source


The Hive GIT repository for the most recent Hive code is located here: git clone https://git-wip-
us.apache.org/repos/asf/hive.git (the master branch).
All release versions are in branches named "branch-0. #" or "branch-1. #" or the upcoming
"branch-2.#", with the exception of release 0.8.1 which is in "branch-0.8-r2". Any branches with
other names are feature branches for works-in-progress. See Understanding Hive Branches for
details.
As of 0.13, Hive is built using Apache Maven.
Compile Hive on master
To build the current Hive code from the master branch:
$ Git clone https://git-wip-us.apache.org/repos/asf/hive.git
$ Cd hive
$ man clean package -Pdist [-DskipTests -Dmaven.javadoc.skip=true]

45
$ Cd packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-
SNAPSHOT-bin
$ Ls
LICENSE
NOTICE
README.txt
RELEASE_NOTES.txt
Bin/ (all the shell scripts)
Lib/ (required jar files)
Conf/ (configuration files)
Examples/ (sample input and query files)
Hcatalog / (hcatalog installation)
Scripts / (upgrade scripts for hive-metastore)
Here, {version} refers to the current Hive version.
If building Hive source using Maven (mvn), we will refer to the directory
"/packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-
bin" as <install-dir> for the rest of the page.

Compile Hive on branch-1


In branch-1, Hive supports both Hadoop 1.x and 2.x. You will need to specify which version of
Hadoop to build against via a Maven profile. To build against Hadoop 1.x use the
profile hadoop-1; for Hadoop 2.x use hadoop-2. For example to build against Hadoop 1.x, the
above mvn command becomes:
$ mvn clean package -Phadoop-1, dist

Compile Hive Prior to 0.13 on Hadoop 0.20


Prior to Hive 0.13, Hive was built using Apache Ant. To build an older version of Hive on
Hadoop 0.20:
$ svn co http://svn.apache.org/repos/asf/hive/branches/branch-{version} hive
$ Cd hive
$ Ant clean package
$ Cd build/dist
# Ls
LICENSE
NOTICE
README.txt
RELEASE_NOTES.txt
Bin/ (all the shell scripts)
Lib/ (required jar files)
Conf/ (configuration files)
Examples/ (sample input and query files)
Hcatalog / (hcatalog installation)
Scripts / (upgrade scripts for hive-metastore)

46
If using Ant, we will refer to the directory "build/dist" as <install-dir>.

Compile Hive Prior to 0.13 on Hadoop 0.23


To build Hive in Ant against Hadoop 0.23, 2.0.0, or other version, build with the appropriate
flag; some examples below:
$ Ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 -
Dhadoop.mr.rev=23
$ Ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha -
Dhadoop.mr.rev=23

Running Hive
Hive uses Hadoop, so:

 you must have Hadoop in your path OR


 export HADOOP_HOME=<hadoop-install-dir>

Runtime Configuration

 Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries
can be controlled by the Hadoop configuration variables.

 The Hive CLI (deprecated) and Beeline command 'SET' can be used to set any Hadoop (or Hive)
configuration variable. For example:
 beeline> SET mapred.job.tracker=myhost.mycompany.com:50030;
 beeline> SET -v;
The latter shows all the current settings. Without the -v option only the variables that differ from
the base Hadoop configuration are displayed.

Hive, Map-Reduce and Local-Mode


Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the
Map-Reduce cluster indicated by the variable:
mapred.job.tracker
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty
option to run map-reduce jobs locally on the user's workstation. This can be very useful to run
queries over small data sets – in such cases local mode execution is usually significantly faster
than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely,
local mode only runs with one reducer and can be very slow processing larger data sets.
Starting with release 0.7, Hive fully supports local mode execution. To enable this, the user can
enable the following option:
Hive> SET mapreduce.framework.name=local;

47
In addition, mapred.local.dir should point to a path that's valid on the local machine (for
example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating
local disk space.)
Starting with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode
automatically. The relevant options
are hive.exec.mode.local.auto, hive.exec.mode.local.auto.inputbytes.max,
and hive.exec.mode.local.auto.tasks.max:
hive> SET hive.exec.mode.local.auto=false;
Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-
reduce job in a query and may run it locally if the following thresholds are satisfied:

 The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB
by default)
 The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
 The total number of reduce tasks required is 1 or 0.

So for queries over small data sets, or for queries with multiple map-reduce jobs where the input
to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs
may be run locally.
Note that there may be differences in the runtime environment of Hadoop server nodes and the
machine running the Hive client (because of different jvm versions or different software
libraries). This can cause unexpected behavior/errors while running in local mode. Also note that
local mode execution is done in a separate, child jvm (of the Hive client). If the user so wishes,
the maximum amount of memory for this child jvm can be controlled via the
option hive.mapred.local.mem. By default, it's set to zero, in which case Hive lets Hadoop
determine the default memory limits of the child jvm.

Hive Logging
Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The
default logging level is WARN for Hive releases prior to 0.13.0. Starting with Hive 0.13.0, the
default logging level is INFO.
The logs are stored in the directory /tmp/<user.name>:

 /tmp/<user.name>/hive.log
Note: In local mode, prior to hive 0.13.0 the log file name was ".log" instead of "hive.log". This
bug was fixed in release 0.13.0 (see HIVE-5528 and HIVE-5676).

To configure a different log location, set hive.log.dir in $HIVE_HOME/conf/hive-


log4j.properties. Make sure the directory has the sticky bit set (chmod 1777 <dir>).

 hive.log.dir=<other location>

48
What is Apache Hive?

Apache Hive is a data warehouse system developed by Facebook to process a huge amount of
structure data in Hadoop. We know that to process the data using Hadoop, we need to right
complex map-reduce functions which is not an easy task for most of the developers. Hive makes
this work very easy for us.

It uses a scripting language called Hive QL which is almost similar to the SQL. So now, we just
have to write SQL-like commands and at the backend of Hive will automatically convert them
into the map-reduce jobs.

Apache Hive Architecture

Let’s have a look at the following diagram which shows the architecture.

 Hive Clients: It allows us to write hive applications using different types of clients such
as thrift server, JDBC driver for Java, and Hive applications and also supports the
applications that use ODBC protocol.

49
 Hive Services: As a developer, if we wish to process any data, we need to use the hive
services such as hive CLI (Command Line Interface). In addition to that hive also
provides a web-based interface to run the hive applications.
 Hive Driver: It is capable of receiving queries from multiple resources like thrift, JDBC,
and ODBS using the hive server and directly from hive CLI and web-based UI. After
receiving the queries, it transfers it to the compiler.
 Hive QL Engine: It receives the query from the compiler and converts the SQL like
query into the map-reduce jobs.
 Meta Store: Here hive stores the meta-information about the databases like schema of
the table, data types of the columns, location in the HDFS, etc
 HDFS: It is simply the Hadoop distributed file system used to store the data. I would
highly recommend you to go through this article to learn more about the HDFS

Working of Apache Hive

Now, let’s have a look at the working of the Hive over the Hadoop framework.

1. In the first step, we write down the query using the web interface or the command-line
interface of the hive. It sends it to the driver to execute the query.
2. In the next step, the driver sends the received query to the compiler where the compiler
verifies the syntax.
3. And once the syntax verification is done, it requests metadata from the meta store.
4. Now, the metadata provides information like the database, tables, data types of the
column in response to the query back to the compiler.
5. The compiler again checks all the requirements received from the meta store and sends
the execution plan to the driver.
6. Now, the driver sends the execution plan to the Hive QL process engine where the engine
converts the query into the map-reduce job.

50
7. After the query is converted into the map-reduce job, it sends the task information to the
Hadoop where the processing of the query begins and at the same time it updates the
metadata about the map-reduce job in the Meta store.
8. Once the processing is done, the execution engine receives the results of the query.
9. The execution engine transfers the results back to the driver and which finally sends to
the hive user-interface from where we can see the results.

Data Types in Apache Hive

Hive data types are divided into the following 5 different categories:

1. Numeric Type: TINYINT, SMALLINT, INT, BIGINT


2. Date/Time Types: TIMESTAMP, DATE, INTERVAL
3. String Types: STRING, VARCHAR, CHAR
4. Complex Types: STRUCT, MAP, UNION, ARRAY
5. Misc Types: BOOLEAN, BINARY

Here is a small description of a few of them.

Create and Drop Database

Creating and Dropping database is very simple and similar to the SQL. We need to assign a
unique name to each of the databases in the hive. If the database already exists, it will show a
warning and to suppress this warning you can add the keywords IF NOT EXISTS after the
database keyword.

51
CREATE DATABASE <<database name>> ;

Dropping a database is also very simple; you just need to write a drop database and the
database name to be dropped. If you try to drop the database that doesn’t exist, it will give you
the Semantic Exception error.

DROP DATABASE <<database name>> ;

Create Table

We use the create table statement to create a table and the complete syntax is as follows.

CREATE TABLE IF NOT EXISTS <<database name.>><<table name>>

(column_name_1 data_type_1,

column_name_2 data_type_2,

column_name_n data_type_n)

ROW FORMAT DELIMITED FIELDS

TERMINATED BY '\t'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

If you are already using the database, you are not required to write database_name.table_name.
In that case, you can only write the table name. In the case of Big Data, most of the time we
52
import the data from external files so here we can pre-define the delimiter used in the file, line
terminator and we can also define how we want to store the table.

There are 2 different types of hive tables Internal and External tables. Please go through this
article to know more about the concept:

Load Data into Table

Now, the tables have been created. It’s time to load the data into it. We can load the data from
any local file on our system using the following syntax.

LOAD DATA LOCAL INPATH <<path of file on your local system>>

INTO TABLE

<<Database name.>><<table name>>;

When we work with a huge amount of data, there is a possibility of having unmatched data types
in some of the rows. In that case, the hive will not throw any error rather it will fill null values in
place of them. This is a very useful feature as loading big data files into the hive is an expensive
process and we do not want to load the entire dataset just because of few files.

Alter Table

In the hive, we can do multiple modifications to the existing tables like renaming the tables,
adding more columns to the table. The commands to alter the table are very much similar to the
SQL commands.

Here is the syntax to rename the table:

ALTER TABLE <<table name>> RENAME TO <<new name>> ;

Syntax to add more columns from the table:

## To add more columns

53
ALTER TABLE <<table name>> ADD COLUMNS

(new_column_name_1 data_type_1,

new_column_name_2 data_type_2,

new_column_name_n data_type_n) ;

Advantages/Disadvantages of Apache Hive

 Uses SQL like query language which is already familiar to most of the developers so
makes it easy to use.
 It is highly scalable; you can use it to process any size of data.
 Supports multiple databases like My SQL, derby, Postures, and Oracle for its Meta store.
 A support multiple data formats also allows indexing, partitioning, and bucketing for
query optimization.
 Can only deal with cold data and is useless when it comes to processing real-time data.
 It is comparatively slower than some of its competitors. If your use-case is mostly about
batch processing then Hive is well and fine.

54

You might also like