You are on page 1of 13

Hive

Hive is a data warehouse software tool that facilitates reading, writing and managing large datasets
residing in distributed storage using SQL.

Structure can also be projected onto data already present into storage(hdfs, s3 etc).

A command line tool and jdbc driver are provided to connect users to hive.

Architecture

1. Numeric Types

 TINYINT (1-byte signed integer, from -128 to 127)


 SMALLINT (2-byte signed integer, from -32,768 to 32,767)
 INT (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)
 BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
 FLOAT (4-byte single precision floating point number)
 DOUBLE (8-byte double precision floating point number)
 DECIMAL (Hive 0.13.0 introduced user definable precision and scale)

2. Date/Time Types

 TIMESTAMP
 DATE

3. String Types

 STRING
 VARCHAR
 CHAR
4. Misc Types

 BOOLEAN
 BINARY

Apart from these primitive data types Hive offers some complex data types which are listed below:

5. Complex Types

 arrays: ARRAY<data_type>
 maps: MAP<primitive_type, data_type>
 structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
 union: UNIONTYPE<data_type, data_type, ...>

Create Database Statement


Create Database is a statement used to create a database in Hive. A database in Hive is a  namespace or
a collection of tables. The syntax for this statement is as follows:

CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

# create db with table have data in txt format


create database retaildb_txt;
use retaildb_txt;
To drop database, we use DROP DATABASE <dbname>;

# create table orders in text format


create table orders(
order_id int,
order_date string,
order_customer_id int,
order_status string)
row format delimited fields terminated by ','
stored as textfile;

# load data from local file system


load data local inpath 'Downloads/data-master/retail_db/orders' into table
orders;

# create table order_items in text format


create table order_items(
order_item_id int,
order_item_order_id int,
order_item_product_id int,
order_item_quantity int,
order_item_subtotal float,
order_itme_product_price float)
row format delimited fields terminated by ','
stored as textfile;

# load data
load data local inpath 'Downloads/data-master/retail_db/order_items' into
table order_items;

# create db with table have data in orc format


create database retaildb_orc;
use retaildb_orc;

# create table in orc format in another db


create table orders(
order_id int,
order_date string,
order_customer_id int,
order_status string)
stored as orc;

# insert data from retaildb_txt


insert into table orders select * from retaildb_txt.orders limit 100;

# create customer table in retaildb_txt


create table customers (
customer_id int,
customer_fname varchar(45),
customer_lname varchar(45),
customer_email varchar(45),
customer_password varchar(45),
customer_street varchar(255),
customer_city varchar(45),
customer_state varchar(45),
customer_zipcode varchar(45)
) row format delimited fields terminated by ','
stored as textfile;

# load data in customer table


load data local inpath 'Downloads/data-master/retail_db/customers' into table
customers;

# date and timestamp functions - here we dont need to specify any table
name
select current_date;
select current_timestamp;
select day(current_date);
select month(current_timestamp);
select date_format('2014-04-02 00:00:00.0','YYYYMM');

# other relational db sql clauses like CASE, order by etc are available in
hive

# BY DEFAULT ACID TRANSACTIONS(UPDATE,DELETE) ARE NOT ENABLED IN HIVE, WE NEED


TO MODIFY
# CONFIG FILES TO MAKE IT ENABLED
# Joins
select o.*,c.* from customers c left outer join
orders o on o.order_customer_id = c.customer_id
limit 10;

select o.*,c.* from customers c inner join


orders o on o.order_customer_id = c.customer_id
limit 10;

# aggregation using group by, having and sum


select o.order_id,o.order_date, o.order_status,
round(sum(oi.order_item_subtotal),2) order_revenue
from orders o join order_items oi
on o.order_id = oi.order_item_order_id
where o.order_status in ('COMPLETE','CLOSED')
group by o.order_id, o.order_date, o.order_status
having sum(oi.order_item_subtotal) >= 1000;

# use distribute by to get records by column on one reducer


select o.order_id,o.order_date, o.order_status,
round(sum(oi.order_item_subtotal),2) order_revenue
from orders o join order_items oi
on o.order_id = oi.order_item_order_id
where o.order_status in ('COMPLETE','CLOSED')
group by o.order_id, o.order_date, o.order_status
having sum(oi.order_item_subtotal) >= 1000
distribute by o.order_date sort by o.order_date, order_revenue desc;

# case example
select order_status,
case when order_status in ('CLOSED','COMPLETE') then 'NO ACTION'
when order_status in
('PENDING','ON_HOLD','PAYMENT_REVIEW','PENDING_PAYMENT','PROCESSING') then
'PENDING_ACTION'
else 'RISKY'
end from orders limit 10;

# analytical functions
select * from (
select o.order_id, o.order_date, o.order_status, oi.order_item_subtotal,
round(sum(oi.order_item_subtotal) over (partition by o.order_id), 2)
order_revenue,
oi.order_item_subtotal/round(sum(oi.order_item_subtotal) over (partition by
o.order_id), 2) pct_revenue,
round(avg(oi.order_item_subtotal) over (partition by o.order_id), 2)
avg_revenue,
rank() over (partition by o.order_id order by oi.order_item_subtotal desc)
rnk_revenue,
dense_rank() over (partition by o.order_id order by oi.order_item_subtotal
desc) dense_rnk_revenue,
percent_rank() over (partition by o.order_id order by oi.order_item_subtotal
desc) pct_rnk_revenue,
row_number() over (partition by o.order_id order by oi.order_item_subtotal
desc) rn_orderby_revenue,
row_number() over (partition by o.order_id) rn_revenue
from orders o join order_items oi
on o.order_id = oi.order_item_order_id
where o.order_status in ('COMPLETE', 'CLOSED')) q
where order_revenue >= 1000
order by order_date, order_revenue desc, rnk_revenue
limit 10;

# lead, lag
select * from (
select o.order_id, o.order_date, o.order_status, oi.order_item_subtotal,
round(sum(oi.order_item_subtotal) over (partition by o.order_id), 2)
order_revenue,
oi.order_item_subtotal/round(sum(oi.order_item_subtotal) over (partition by
o.order_id), 2) pct_revenue,
round(avg(oi.order_item_subtotal) over (partition by o.order_id), 2)
avg_revenue,
rank() over (partition by o.order_id order by oi.order_item_subtotal desc)
rnk_revenue,
dense_rank() over (partition by o.order_id order by oi.order_item_subtotal
desc) dense_rnk_revenue,
percent_rank() over (partition by o.order_id order by oi.order_item_subtotal
desc) pct_rnk_revenue,
row_number() over (partition by o.order_id order by oi.order_item_subtotal
desc) rn_orderby_revenue,
row_number() over (partition by o.order_id) rn_revenue,
lead(oi.order_item_subtotal) over (partition by o.order_id order by
oi.order_item_subtotal desc) lead_order_item_subtotal,
lag(oi.order_item_subtotal) over (partition by o.order_id order by
oi.order_item_subtotal desc) lag_order_item_subtotal,
first_value(oi.order_item_subtotal) over (partition by o.order_id order by
oi.order_item_subtotal desc) first_order_item_subtotal,
last_value(oi.order_item_subtotal) over (partition by o.order_id order by
oi.order_item_subtotal desc) last_order_item_subtotal
from orders o join order_items oi
on o.order_id = oi.order_item_order_id
where o.order_status in ('COMPLETE', 'CLOSED')) q
where order_revenue >= 1000
order by order_date, order_revenue desc, rnk_revenue
limit 10;

# external tables on data located in hdfs


An external table is a table for which hive doesn’t manage storage. It is stored outside of hive
warehouse. Hive stores only its meta data with it.
A managed or internal table in hive has its data stored in hive warehouse. Hive stores both its metadata
and data in hive only.
If we delete an external table, only its definition(metadata) is deleted in hive, data stored in external
storage is not deleted. In case of internal or hive managed table, on deleting table, both metadata and
data present in hive warehouse is deleted.
ACI
Table Type File Format INSERT UPDATE/DELETE
D
Managed: CRUD transactional Yes ORC Yes Yes
Managed: Insert-only Yes Any Yes No
transactional
Managed: Temporary No Any Yes No
External No Any Yes No

create external table orders(


order_id int,
order_date string,
order_customer_id int,
order_status string)
row format delimited fields terminated by ','
location '/usr/retail_db/orders/retail_db/orders';

# create partitioned table


# we can have multiple column for partition criteria.
# column used for partition should not be part of table ddl
# In hive, for every partition one folder is created.
CREATE TABLE orders (
order_id int,
order_date string,
order_customer_id int,
order_status string
)
PARTITIONED BY (order_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS PARQUET;

# to add partition in hive table statically


alter table orders add partition (order_month='2014-01');
# We can add partittions statically if we know in advance. Static partitions #
take less time to load
# read more on data-flair training blog

# to load data in static partition


insert into retaildb_orc.orders partition (order_month)
select order_id, order_date, order_customer_id, order_status,
substr(order_date,1,7) as order_month
from retaildb_txt.orders where substr(order_date,1,7) = '2014-01';

# to let partitions created dynamically at load


insert into retaildb_orc.orders partition (order_month)
select order_id, order_date, order_customer_id, order_status,
substr(order_date,1,7) as order_month
from retaildb_txt.orders;

# create bucketed tables


# we can create bucketed tables as well with or without partitions
# in case of bucketing, we can have that column in table ddl.
# We need to specify number of buckets we want in table. While loading data, #
with the help of hashing, data gets installed in buckets. Bucketing ensures
# that same values of columns lie in same bucket.
# We need to set below properties :
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.enforce.bucketing = true;

CREATE TABLE orders_bucket (


order_id int,
order_date string,
order_customer_id int,
order_status string
)
CLUSTERED BY (order_id) INTO 16 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

# insert data into clustered table


insert into orders_bucket
select * from retaildb_txt.orders;

# CTAS - Create table AS SELECT


create table orders_orc
row format delimited fields terminated by ':'
stored as orc
as select * from retaildb_txt.orders;

# creating a custom file format based(SerDe) table - for example orc


CREATE TABLE `orders_orc`(
`order_id` int,
`order_date` string,
`order_customer_id` int,
`order_status` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'field.delim'=':',
'serialization.format'=':')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';

# create table with custom delimiter and custom null values


create table t (i int, j int)
row format delimited fields terminated by '\u0001'
null defined as '-1';

Partitioning Vs Bucketing
 Separate folders are created for each value in partition while a bucket
can have more than one value in a bucket.
 Partitioning is used where we want to have table queried mostly on a
column having less value like countries or geographies. While bucketing
is used when we have large number of unique values of column like date,
employee_id etc. and we want to get date based statistics like
average,max sales per day etc. It is most suitable for sampling purpose.
 Partitioning is good for high volume of data while bucketing is good for
low volume of data keeping a column in consideration.

Map Side Join


Map side join is favorable where we have one small table and one big table. This small table is read by a
map/reduce task and stores it into in-memory hashtable. Afterwards, it moves the hashtable to
hadoop’s distributed cache. When original map/reduce task start for join, it loads the table in local disk
of mapper. This way there is no need of reduce task here as join happens in map phase only.

We cannot convert a full outer join to map join. Only Right/left outer join can be converted to map join
only if small table is less than 25 MB in size.
Although, we can use the hint to specify the query using Map Join in
Hive. Hence, below an example shows that smaller table is the one put in the
hint, and force to cache table B manually.

Select /*+ MAPJOIN(b) */ a.key, a.value from a join b on a.key = b.key


At first, auto convert shuffle/common join to map join.
However, we have 3 parameters are related:
set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;
Bucket Map Join
When all tables in join are large and all tables used in join are bucketed on join columns then we use
bucket map join.
One table should have buckets in multiples of number of buckets in another table in this type of join.

For example, if a table have 2 buckets then other table must have 2,4.6 (or any multiple of 2) buckets. In
this case, only matching buckets of small tables are replicated or cached on all mappers and join is
performed in mapper phase only similar to map join.

Bucket map join is disabled by default in hive. We need to do bucket map join using hints like below :
set hive.optimize.bucketmapjoin=true;
select /*+ MAPJOIN(b2) */ b1.* from b1,b2 where b1.col0=b2.col0;

Skewed join
https://weidongzhou.wordpress.com/2017/06/08/join-type-in-hive-skewed-join/

Skewed joins are useful when one of the table in join is skewed on joining column. Suppose there are 2
tables A and B. Both A & B have skewed data value “mytest” in joining column. Assuming table B has
fewer rows with skewed data in table A.
The first step is to scan table B and save all rows with key “mytest” in an in-memory hashtable. Then run
a set of mappers to read table A to perform below tasks :
1. If it has skewed data then it will use hashed version of B for join.
Here map side join is performed.
2. Otherwise , it will send the rows to reducer that performs join. The same reduce will get rows
from table B by scanning like in normal join.
To use Skewed Join, you need to understand your data and query. Set
parameter hive.optimize.skewjoin to true. Parameter hive.skewjoin.key is optional and it is 100000
by default

Sort Merge Bucket Join


There are several scenarios when we can use Hive Sort Merge Bucket Join:
 While all tables are Large.
 Also, while all tables are bucketed using the join columns.
 While by using the join columns, Sorted.
 Also, when the number of buckets is same as the number of all tables.
It is necessary to convert SMB join to SMB map join Below parameters need to set.
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

Hive Optimization techniques/Performance tuning


1. Partitioning Tables
2. De-normalizing data:
3. Compress map/reduce output:
4. Map join
5. Bucketing:
6. Input Format Selection:ORC rather than txt or json
7. Parallel execution:execute mapreduce jobs in parallel.
SET hive.exec.parallel=true.
8. Vectorization: Vectorization allows Hive to process a batch of rows
together instead of processing one row at a time.
SET hive.vectorized.execution.enabled=true
9. Unit Testing: In Hive, you can unit test UDFs, SerDes, streaming scripts,
Hive queries and more. To a large extent, it is possible to verify the
correctness of your whole HiveQL query by running quick local unit tests
without even touching a Hadoop cluster
10. Sampling: Sampling allows users to take a subset of dataset and analyze
it, without having to analyze the entire data set. Hive offers a built-in
TABLESAMPLE clause that allows you to sample your tables.
https://www.qubole.com/blog/hive-best-practices/

Hive Serde
https://medium.com/@gohitvaranasi/how-does-apache-hive-serde-work-behind-the-
scenes-a-theoretical-approach-e67636f08a2a
Sqoop
Sqoop is a tool designed to transfer data between hdfs and relational database. It is used to import data
from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system
to relational databases. 

Configuration
Step 6: Configuring Sqoop
To configure Sqoop with Hadoop, you need to edit the sqoop-env.sh file, which is
placed in the $SQOOP_HOME/conf directory. First of all, Redirect to Sqoop config
directory and copy the template file using the following command −
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh

Open sqoop-env.sh and edit the following lines −


export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop

Step 7: Download and Configure mysql-connector-java


We can download mysql-connector-java-5.1.30.tar.gz file from the following link.
The following commands are used to extract mysql-connector-java tarball and
move mysql-connector-java-5.1.30-bin.jar to /usr/lib/sqoop/lib directory.
$ tar -zxf mysql-connector-java-5.1.30.tar.gz
$ su
password:

# cd mysql-connector-java-5.1.30
# mv mysql-connector-java-5.1.30-bin.jar /usr/lib/sqoop/lib

Step 8: Verifying Sqoop


The following command is used to verify the Sqoop version.
$ cd $SQOOP_HOME/bin
$ sqoop-version

Eval – To run any query against a database server, we can use sqoop eval command. It can be used to
run a select query or an insert query.
$ sqoop eval \
–connect jdbc:mysql://localhost/db \
–username root \
–query “SELECT * FROM employee LIMIT 3”
$ sqoop eval \
–connect jdbc:mysql://localhost/db \
–username root \
-e “INSERT INTO employee VALUES(1007,‘Gem’,‘UI
dev’,15000,‘TP’)”

Import
https://data-flair.training/blogs/sqoop-import/
Sqoop import is used to move data from RDBMS to hdfs or hive table.
Below command will import data from mysql table to hdfs location.
$ sqoop import \
--connect jdbc:uri
–query ‘SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE
$CONDITIONS’ \
-m 1 –target-dir /user/foo/joinresults
--m  is for number of mapper tasks to execute in parallel
--target dir  is HDFS target dir

Below command will import data into hive table


$ sqoop import –connect jdbc:mysql://db.foo.com/corp \
–table EMPLOYEES \
–username SomeUser \
--password xxxxxx \

So selecting specific columns from the EMPLOYEES table:


$ sqoop import –connect jdbc:mysql://db.foo.com/corp \
–table EMPLOYEES \
–columns “employee_id,first_name,last_name,job_title”

Incremental import
https://medium.com/datadriveninvestor/incremental-data-load-using-apache-
sqoop-3c259308f65c
Sqoop jobs store metadata information such as last-value , incremental-mode,file-format,output-
directory, etc which act as reference in loading data incrementally.

sqoop job — create incrementalappendImportJob — import — connect


jdbc:mysql://localhost/test — username root — password hortonworks1 — table stocks —
target-dir /user/hirw/sqoop/stocks_append — incremental append — check-column id
-m 1

We can execute the sqoop job using below command :


sqoop job — exec incrementalappendImportJob
Every time we run above sqoop job it will get latest records only on the basis of metadata variable
incremental.last.value stored(which is max value of column id at present in table).

Basically, Import all tables from the corp database:


$ sqoop import-all-tables –connect jdbc:mysql://db.foo.com/corp

Export
https://sites.google.com/site/hadoopbigdataoverview/sqoop/sqoop-export-
commands
Export will move data from hdfs file/hive table to RDBMS.
Example - Below command will insert new rows and update existing rows based on column id
update and insert
==================
sqoop export \
--connect jdbc:mysql://localhost/test \
--username root \
--table employee \
--export-dir /sqooptest/dataset.txt \
--update-key id \
--update-mode allowinsert ;

below command will move data from hive to rdbms

$SQOOP_HOME/bin> ./sqoop export  --connect jdbc:mysql://192.x.x.x/test


--table sales  --username biadmin  --password  biadmin
 --export-dir /user/hive/warehouse/customersales.db/sales/Sales.csv
 --input-fields-terminated-by , --input-lines-terminated-by \n -m 1

We can also do validation in Hadoop like check row count at source and target using –validate option.
Validation in sqoop’s main purpose is to validate the data copied. Basically,  either Sqoop import or
Export by comparing the row counts from the source as well as the target post copy.

$ sqoop import –connect jdbc:mysql://db.foo.com/corp  \


–table EMPLOYEES –validate

You might also like