You are on page 1of 41

Apache Hive

HIVE
Hive
• Data warehousing package built on top of
hadoop.
• Used for data analysis on structured data.
• Targeted towards users comfortable with SQL.
• It is similar to SQL and called HiveQL.
• Abstracts complexity of hadoop.
• No Java is required.
• Developed by Facebook.
Features of Hive
How is it Different from SQL
•The major difference is that a Hive query
executes on a Hadoop infrastructure rather than
a traditional database.
•This allows Hive to handle huge data sets - data
sets so large that high-end, expensive, traditional
databases would fail.
•The internal execution of a Hive query is via a
series of automatically generated Map Reduce
jobs
When not to use Hive
• Semi-structured or complete unstructured data.
• Hive is not designed for online transaction
processing.
• It is best for batch jobs over large sets of data.
• Latency for Hive queries is generally very high in
minutes, even when data sets are very small (say
a few hundred megabytes).
• It cannot be compared with systems such as
oracle where analyses are conducted on a
significantly smaller amount of data.
Install Hive
•To install hive
• untar the .gz file using tar –xvzf hive-0.13.0-bin.tar.gz
•To initialize the environment variables, export the
following:
• export HADOOP_HOME=/home/usr/hadoop-0.20.2
(Specifies the location of the installation directory
of hadoop.)
• export HIVE_HOME=/home/usr/hive-0.13.0-bin
(Specifies the location of the hive to the environment
variable.)
• export PATH=$PATH:$HIVE_HOME/bin

Hive configurations
• Hive default configuration is stored in hive-default.xml
file in the conf directory
• Hive comes configured to use derby as the metastore
Hive Modes
To start the hive shell, type hive and Enter.
• Hive in Local mode
 No HDFS is required, All files run on local file
system.
 hive> SET mapred.job.tracker=local
• Hive in MapReduce(hadoop) mode
 hive> SET mapred.job.tracker=master:9001;
Introducing data types
• The primitive data types in hive include
Integers, Boolean, Floating point,
Date,Timestamp and Strings.
• The below table lists the size of data types:
Type Size
-------------------------
TINYINT 1 byte
SMALLINT 2 byte
INT 4 byte
BIGINT 8 byte
FLOAT 4 byte (single precision floating point numbers)
DOUBLE 8 byte (double precision floating point numbers)
BOOLEAN TRUE/FALSE value
STRING Max size is 2GB.
• Complex data Type : Array ,Map ,Structs
Configuring Hive
• Hive is configured using an XML configuration file called
hivesite.xml and is located in Hive’s conf directory.
• Execution engines
 Hive was originally written to use MapReduce as its execution engine,
and that is still the default.
 We can use Apache Tez as its execution engine, and also work is
underway to support Spark, too. Both Tez and Spark are general
directed acyclic graph (DAG) engines that offer more flexibility and
higher performance than MapReduce.
 It’s easy to switch the execution engine on a per-query basis, so you
can see the effect of a different engine on a particular query.
 Set Hive to use Tez: hive> SET hive.execution.engine=tez;
 The execution engine is controlled by the hive.execution.engine
property, which defaults to “mr” (for MapReduce).
Hive Architecture
Components
• Thrift Client
 It is possible to interact with hive by using any
programming language that usages Thrift server. For e.g.
 Python
 Ruby
• JDBC Driver
 Hive provides a pure java JDBC driver for java application
to connect to hive , defined in the class
org.hadoop.hive.jdbc.HiveDriver
• ODBC Driver
 An ODBC driver allows application that supports ODBC
protocol
Components
• Metastore
 This is the central repository for Hive metadata.
 By default, Hive is configured to use Derby as the metastore. As a result of the
configuration, a metastore_db directory is created in each working folder.
• What are the problems with the default metastore
 Users cannot see the tables created by others if they do not use the same
metastore_db.
 Only one embedded Derby database can access the database files at any given
point of time
 Results in only one open Hive session with a metastore. Not possible to have
multiple sessions with Derby as the metastore.
Solution
 We can use a standalone database either on the same machine or on a
remote machine as a metastore and any JDBC-compliant database can be used
 MySQL is a popular choice for the standalone metastore.
Configuring MySQL as metastore
 Install MySQL Admin/Client
 Create a Hadoop user and grant permissions to the user
 mysql -u root –p
 mysql> Create user 'hadoop'@'localhost' identified by 'hadoop‘;
 mysql> Grant ALL on *.* to 'hadoop'@'localhost' with GRANT option;
 Modify the following properties in hive-site.xml to use MySQL instead of Derby. This creates a
database in MySql by the name – Hive :
 name : javax.jdo.option.ConnectionUR
 value :
dbc:mysql://localhost:3306/Hive?createDatabaseIfNotExist=true
 name : javax.jdo.option.ConnectionDriverName
 value : com.mysql.jdbc.Driver
 name : javax.jdo.option.ConnectionUserName
 value : hadoop
 name : javax.jdo.option.ConnectionPassword
 value : hadoop
Hive Program Structure
• The Hive Shell
 The shell is the primary way that we will interact with Hive, by issuing
commands in HiveQL.
 HiveQL is heavily influenced by MySQL, so if you are familiar with
MySQL, you should feel at home using Hive.
 The command must be terminated with a semicolon to tell Hive to
execute it.
 HiveQL is generally case insensitive.
 The Tab key will autocomplete Hive keywords and functions.
• Hive can run in non-interactive mode.
 Use -f option to run the commands in the specified file,
 hive -f script.hql
 For short scripts, you can use the -e option to specify the commands
inline, in which case the final semicolon is not required.
 hive -e 'SELECT * FROM dummy'
Ser-de
• A SerDe is a combination of a Serializer and a
Deserializer (hence, Ser-De).
• The Serializer, however, will take a Java object that Hive
has been working with, and turn it into something that
Hive can write to HDFS or another supported system.
• Serializer is used when writing data, such as through an
INSERT-SELECT statement.
• The Deserializer interface takes a string or binary
representation of a record, and translates it into a Java
object that Hive can manipulate.
• Deserializer is used at query time to execute SELECT
statements.
Hive Tables
A Hive table is logically made up of the data being stored in HDFS and the
associated metadata describing the layout of the data in the MySQL table.
• Managed Table
 When you create a table in Hive and load data into a managed table, it is moved into
Hive’s warehouse directory.
 CREATE TABLE managed_table (dummy STRING);
 LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
• External Table
 Alternatively, you may create an external table, which tells Hive to refer to the data that
is at an existing location outside the warehouse directory.
 The location of the external data is specified at table creation time:
 CREATE EXTERNAL TABLE external_table (dummy STRING)
 LOCATION '/user/tom/external_table';
 LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;
• When you drop an external table, Hive will leave the data untouched and
only delete the metadata.
• Hive does not do any transformation while loading data into tables. Load
operations are currently pure copy/move operations that move data files
into locations corresponding to Hive tables.
Storage Format
Storage Format
Storage Format
Text File
When you create a table with no ROW FORMAT or STORED AS
clauses, the default format is delimited text with one row per
line.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘
Storage Format
RC: Record Columnar File
The RC format was designed for clusters with MapReduce in
mind. It is a huge step up over standard text files. It’s a mature
format with ways to ingest into the cluster without ETL. It is
supported in several hadoop system components.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
Storage Format
ORC: Optimized Row Columnar File
The ORC format showed up in Hive 0.11 onwards. As the name
implies, it is more optimized than the RC format. If you want to
hold onto speed and compress the data as much as possible,
then ORC is best.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.ORCFileOutputFormat'
Practice Session
• CREATE DATABASE|SCHEMA [IF NOT EXISTS]
<database name>
or
hive> CREATE SCHEMA testdb;

SHOW DATABASES;
DROP SCHEMA userdb;
• CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT
EXISTS] [db_name.] table_name

• [(col_name data_type [COMMENT col_comment],


...)] [COMMENT table_comment] [ROW FORMAT
row_format] [STORED AS file_format]

• Loading a data
LOAD DATA [local ] INPATH
'hdfs_file_or_directory_path'
Create Table
• Managed Table
CREATE TABLE Student (sno int, sname string, year
int) row format delimited fields terminated by ',';
• External Table
CREATE EXTERNAL TABLE Student(sno int, sname
string, year int) row format delimited fields
terminated by ',‘LOCATION '/user/external_table';
Load Data to table
To store the local files to hive location
• LOAD DATA local INPATH
'/home/cloudera/SampleDataFile/student_ma
rks.csv' INTO table Student;
To store file located in HDFS file system to hive
table location
• LOAD DATA INPATH
'/user/cloudera/Student_Year.csv' INTO table
Student;
Table Commands
• Insert Data
 INSERT OVERWRITE TABLE targettable
select col1, col2 from source (to overwrite data)
 INSERT INTO TABLE targettbl
select col1, col2 from source (to append data)
• Multitable insert
 From sourcetable
INSERT OVERWRITE TABLE table1
select col1,col2 where condition1
INSERT OVERWRITE TABLE table2
select col1,col2 where condition2
• Create table..as Select
 Create table table1 as select col1,col2 from source;
• Create a new table with existing schema like other table
 Create table newtable like existingtable;
Database Commands
• Displays all created DB List.
 Show Databases;
• To Create new database with default properties.
 Create Database DBName;
• Create database with comment
 Create Database DBName comment ‘holds backup data’ ;
• To Use Database
 Use DBName;
• To View the database details
 DESCRIBE DATABASE EXTENDED DbName
Table Commands
• To list all tables
Show Tables;
• Displaying all contents of the table
select * from <table-name>;
select * from Student_Year where year = 2011;
• Display header information along with Data
set hive.cli.print.header=true;
• Using Group by
select year,count(sno) from Student_Year group by
year;
Table Commands
• SubQueries
 A subquery is a SELECT statement that is embedded in another SQL
statement.
 Hive has limited support for subqueries, permitting a subquery in the
FROM clause of a SELECT statement, or in the WHERE clause in certain
cases.
 The following query finds the average maximum temperature for
every year and weather station:
SELECT year, AVG(max_temperature)
FROM (
SELECT year, MAX(temperature) AS max_temperature
FROM records2
GROUP BY year
) mt
GROUP BY year;
Table Commands
Alter table
• To Add column
 ALTER TABLE student ADD COLUMNS (Year string);
• To Modify a column
 ALTER TABLE table_name CHANGE old_col_name new_col_name
new_data_type
• Changes the table name;
 Alter table Employee RENAME to emp;
• Drops a partition
 ALTER table MyTable DROP PARTITION (age=17) -- Drop Table
• DROP TABLE
 DROP TABLE operatordetails;
• Describe Table Schema
 Desc Employee;
 Describe extended Employee; -- displays detailed information
Partitioning in Hive
• Using partitions, you
can make it faster to
execute queries on
slices of the data.
• A table can have one
or more partition
columns.
• A separate data
directory is created for
each distinct value
combination in the
partition columns.
Partitioning in Hive
• Partitions are defined at the time of creating a table
using PARTITIONED BY clause is used to create
partition.
Static Partition (Example-1)
 CREATE TABLE student_partnew (name STRING,id
int,marks String) PARTITIONED BY (pyear STRING) row
format delimited fields terminated by ',';
 LOAD DATA LOCAL INPATH '/home/notroot/std_2011.csv'
INTO TABLE student_partnew PARTITION (pyear='2011');
 LOAD DATA LOCAL INPATH '/home/notroot/std_2012.csv'
INTO TABLE student_partnew PARTITION (pyear='2012');
 LOAD DATA LOCAL INPATH '/home/notroot/std_2013.csv'
INTO TABLE student_partnew PARTITION (pyear='2013');
Partitioning in Hive
Static Partition (Example-2)
• CREATE TABLE student_New (id int,name string,marks
int,year int) row format delimited fields terminated by ',';
• LOAD DATA local INPATH
'/home/notroot/Sandeep/DataSamples/Student_new.csv'
INTO table Student_New;
• CREATE TABLE student_part (id int,name string,marks int,)
PARTITIONED BY (year STRING);
• INSERT into TABLE student_part PARTITION(pyear='2012' )
SELECT id,name,marks from student_new WHERE
year='2012';
SHOW Partition
• SHOW PARTITIONS month_part;
Partitioning in Hive
Dynamic Partition
• To enable dynamic partitions
 set hive.exec.dynamic.partition=true;
(To enable dynamic partitions, by default it is false)
 set hive.exec.dynamic.partition.mode=nonstrict;
(To allow a table to be partitioned based on multiple columns in hive, in
such case we have to enable the nonstrict mode)
 set hive.exec.max.dynamic.partitions.pernode=300;
(The default value is 100, we have to modify the same according to the
possible no of partitions that would come in your case)

hive.exec.max.created.files=150000
(IThe default values is 100000 but for larger tables it can exceed the default,
so we may have to update the same. )
Partitioning in Hive
• CREATE TABLE Stage_oper_Month (oper_id string, Creation_Date string,
oper_name String, oper_age int, oper_dept String, oper_dept_id int, opr_status
string, EYEAR STRING, EMONTH STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',';
• LOAD DATA local INPATH
'/home/notroot/Sandeep/DataSamples/user_info.csv'INTO TABLE
Stage_oper_Month;
• CREATE TABLE Fact_oper_Month (oper_id string, Creation_Date string, oper_name
String, oper_age int, oper_dept String, oper_dept_id int) PARTITIONED BY
(opr_status string, eyear STRING, eMONTH STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
• FROM Stage_oper_Month INSERT OVERWRITE TABLE Fact_oper_Month
PARTITION (opr_status, eyear, eMONTH) SELECT oper_id, Creation_Date,
oper_name, oper_age, oper_dept, oper_dept_id, opr_status, EYEAR, EMONTH
DISTRIBUTE BY opr_status, eyear, eMONTH;
• (Select from partition table)
 Select oper_id, oper_name, oper_dept from Fact_oper_Month where
eyear=2010 and emonth=1;
Bucketing Features
• Partitioning gives effective results when there are limited number of
partitions and comparatively equal sized partitions
• To overcome the problem of partitioning, Hive provides Bucketing
concept, another technique for decomposing table data sets into more
manageable parts.
• Bucketing concept is based on (hashing function on the bucketed
column) mod (by total number of buckets)
• Use CLUSTERED BY clause to divide the table into buckets.
• Bucketing can be done along with Partitioning on Hive tables and even
without partitioning.
• Bucketed tables will create almost equally distributed data file parts.
• To populate the bucketed table, we need to set the property
 set hive.enforce.bucketing = true;
Bucketing Advantage
Bucketing Advantages
• Bucketed tables offer efficient sampling than by non-bucketed
tables. With sampling, we can try out queries on a fraction of data
for testing and debugging purpose when the original data sets are
very huge.
• As the data files are equal sized parts, map-side joins will be faster
on bucketed tables than non-bucketed tables. In Map-side join, a
mapper processing a bucket of the left table knows that the
matching rows in the right table will be in its corresponding bucket,
so it only retrieves that bucket (which is a small fraction of all the
data stored in the right table).
• Similar to partitioning, bucketed tables provide faster query
responses than non-bucketed tables.
Bucketing Example
• We can create bucketed tables with the help of CLUSTERED BY clause and
optional SORTED BY clause in CREATE TABLE statement and DISTRIBUTED
BY clause in load statement.
• CREATE TABLE Month_bucketed (oper_id string, Creation_Date string,
oper_name String, oper_age int,oper_dept String, oper_dept_id int,
opr_status string, eyear string , emonth string) CLUSTERED BY(oper_id)
SORTED BY (oper_id,Creation_Date) INTO 10 BUCKETS ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
Similar to partitioned tables, we cannot directly load bucketed tables
with LOAD DATA (LOCAL) INPATH command, rather we need to use INSERT
OVERWRITE TABLE … SELECT …FROM clause from another table to populate
the bucketed tables.
• INSERT OVERWRITE TABLE Month_bucketed SELECT oper_id,
Creation_Date, oper_name, oper_age, oper_dept, oper_dept_id,
opr_status, EYEAR, EMONTH FROM stage_oper_month DISTRIBUTE BY
oper_id sort by oper_id, Creation_Date;
Partitioning with Bucketing
• CREATE TABLE Month_Part_bucketed (oper_id string,
Creation_Date string, oper_name String, oper_age int,oper_dept
String, oper_dept_id int) PARTITIONED BY (opr_status string, eyear
STRING, eMONTH STRING) CLUSTERED BY(oper_id) SORTED BY
(oper_id,Creation_Date) INTO 12 BUCKETS ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
• FROM Stage_oper_Month stg INSERT OVERWRITE TABLE
Month_Part_bucketed PARTITION(opr_status, eyear, eMONTH)
SELECT stg.oper_id, stg.Creation_Date, stg.oper_name,
stg.oper_age, stg.oper_dept, stg.oper_dept_id, stg.opr_status,
stg.EYEAR, stg.EMONTH DISTRIBUTE BY opr_status, eyear,
eMONTH;
Note: Unlike partitioned columns (which are not included in table
columns definition), Bucketed columns are included in table
definition as shown in above code
for oper_id and creation_date columns.
Table Sampling in Hive
Table Sampling in hive is nothing but extraction small fraction of data from
the original large data sets. It is similar to LIMIT operator in Hive.
Difference between LIMIT and TABLESAMPLE in Hive.
 In many cases a LIMIT clause executes the entire query, and then only returns
limited results.
 But Sampling will only select a portion of data to perform query.
To see the performance difference between bucketed and non-bucketed
tables.
 Query-1: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept
FROM month_bucketed TABLESAMPLE(BUCKET 12 OUT OF 12 ON oper_id);
 Query-2: SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept
FROM stage_oper_month limit 18;
Note: Query-1 should always perform faster that query-2
To perform random sampling with Hive
 SELECT oper_id, Creation_Date, oper_name, oper_age, oper_dept FROM
month_bucketed TABLESAMPLE (1 percent);

You might also like