Professional Documents
Culture Documents
Hive is a data warehouse software tool that facilitates reading, writing and managing large datasets
residing in distributed storage using SQL.
Structure can also be projected onto data already present into storage(hdfs, s3 etc).
A command line tool and jdbc driver are provided to connect users to hive.
Architecture
1. Numeric Types
2. Date/Time Types
TIMESTAMP
DATE
3. String Types
STRING
VARCHAR
CHAR
4. Misc Types
BOOLEAN
BINARY
Apart from these primitive data types Hive offers some complex data types which are listed below:
5. Complex Types
arrays: ARRAY<data_type>
maps: MAP<primitive_type, data_type>
structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
union: UNIONTYPE<data_type, data_type, ...>
# load data
load data local inpath 'Downloads/data-master/retail_db/order_items' into
table order_items;
# date and timestamp functions - here we dont need to specify any table
name
select current_date;
select current_timestamp;
select day(current_date);
select month(current_timestamp);
select date_format('2014-04-02 00:00:00.0','YYYYMM');
# other relational db sql clauses like CASE, order by etc are available in
hive
# case example
select order_status,
case when order_status in ('CLOSED','COMPLETE') then 'NO ACTION'
when order_status in
('PENDING','ON_HOLD','PAYMENT_REVIEW','PENDING_PAYMENT','PROCESSING') then
'PENDING_ACTION'
else 'RISKY'
end from orders limit 10;
# analytical functions
select * from (
select o.order_id, o.order_date, o.order_status, oi.order_item_subtotal,
round(sum(oi.order_item_subtotal) over (partition by o.order_id), 2)
order_revenue,
oi.order_item_subtotal/round(sum(oi.order_item_subtotal) over (partition by
o.order_id), 2) pct_revenue,
round(avg(oi.order_item_subtotal) over (partition by o.order_id), 2)
avg_revenue,
rank() over (partition by o.order_id order by oi.order_item_subtotal desc)
rnk_revenue,
dense_rank() over (partition by o.order_id order by oi.order_item_subtotal
desc) dense_rnk_revenue,
percent_rank() over (partition by o.order_id order by oi.order_item_subtotal
desc) pct_rnk_revenue,
row_number() over (partition by o.order_id order by oi.order_item_subtotal
desc) rn_orderby_revenue,
row_number() over (partition by o.order_id) rn_revenue
from orders o join order_items oi
on o.order_id = oi.order_item_order_id
where o.order_status in ('COMPLETE', 'CLOSED')) q
where order_revenue >= 1000
order by order_date, order_revenue desc, rnk_revenue
limit 10;
# lead, lag
select * from (
select o.order_id, o.order_date, o.order_status, oi.order_item_subtotal,
round(sum(oi.order_item_subtotal) over (partition by o.order_id), 2)
order_revenue,
oi.order_item_subtotal/round(sum(oi.order_item_subtotal) over (partition by
o.order_id), 2) pct_revenue,
round(avg(oi.order_item_subtotal) over (partition by o.order_id), 2)
avg_revenue,
rank() over (partition by o.order_id order by oi.order_item_subtotal desc)
rnk_revenue,
dense_rank() over (partition by o.order_id order by oi.order_item_subtotal
desc) dense_rnk_revenue,
percent_rank() over (partition by o.order_id order by oi.order_item_subtotal
desc) pct_rnk_revenue,
row_number() over (partition by o.order_id order by oi.order_item_subtotal
desc) rn_orderby_revenue,
row_number() over (partition by o.order_id) rn_revenue,
lead(oi.order_item_subtotal) over (partition by o.order_id order by
oi.order_item_subtotal desc) lead_order_item_subtotal,
lag(oi.order_item_subtotal) over (partition by o.order_id order by
oi.order_item_subtotal desc) lag_order_item_subtotal,
first_value(oi.order_item_subtotal) over (partition by o.order_id order by
oi.order_item_subtotal desc) first_order_item_subtotal,
last_value(oi.order_item_subtotal) over (partition by o.order_id order by
oi.order_item_subtotal desc) last_order_item_subtotal
from orders o join order_items oi
on o.order_id = oi.order_item_order_id
where o.order_status in ('COMPLETE', 'CLOSED')) q
where order_revenue >= 1000
order by order_date, order_revenue desc, rnk_revenue
limit 10;
Partitioning Vs Bucketing
Separate folders are created for each value in partition while a bucket
can have more than one value in a bucket.
Partitioning is used where we want to have table queried mostly on a
column having less value like countries or geographies. While bucketing
is used when we have large number of unique values of column like date,
employee_id etc. and we want to get date based statistics like
average,max sales per day etc. It is most suitable for sampling purpose.
Partitioning is good for high volume of data while bucketing is good for
low volume of data keeping a column in consideration.
We cannot convert a full outer join to map join. Only Right/left outer join can be converted to map join
only if small table is less than 25 MB in size.
Although, we can use the hint to specify the query using Map Join in
Hive. Hence, below an example shows that smaller table is the one put in the
hint, and force to cache table B manually.
For example, if a table have 2 buckets then other table must have 2,4.6 (or any multiple of 2) buckets. In
this case, only matching buckets of small tables are replicated or cached on all mappers and join is
performed in mapper phase only similar to map join.
Bucket map join is disabled by default in hive. We need to do bucket map join using hints like below :
set hive.optimize.bucketmapjoin=true;
select /*+ MAPJOIN(b2) */ b1.* from b1,b2 where b1.col0=b2.col0;
Skewed join
https://weidongzhou.wordpress.com/2017/06/08/join-type-in-hive-skewed-join/
Skewed joins are useful when one of the table in join is skewed on joining column. Suppose there are 2
tables A and B. Both A & B have skewed data value “mytest” in joining column. Assuming table B has
fewer rows with skewed data in table A.
The first step is to scan table B and save all rows with key “mytest” in an in-memory hashtable. Then run
a set of mappers to read table A to perform below tasks :
1. If it has skewed data then it will use hashed version of B for join.
Here map side join is performed.
2. Otherwise , it will send the rows to reducer that performs join. The same reduce will get rows
from table B by scanning like in normal join.
To use Skewed Join, you need to understand your data and query. Set
parameter hive.optimize.skewjoin to true. Parameter hive.skewjoin.key is optional and it is 100000
by default
Hive Serde
https://medium.com/@gohitvaranasi/how-does-apache-hive-serde-work-behind-the-
scenes-a-theoretical-approach-e67636f08a2a
Sqoop
Sqoop is a tool designed to transfer data between hdfs and relational database. It is used to import data
from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system
to relational databases.
Configuration
Step 6: Configuring Sqoop
To configure Sqoop with Hadoop, you need to edit the sqoop-env.sh file, which is
placed in the $SQOOP_HOME/conf directory. First of all, Redirect to Sqoop config
directory and copy the template file using the following command −
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh
# cd mysql-connector-java-5.1.30
# mv mysql-connector-java-5.1.30-bin.jar /usr/lib/sqoop/lib
Eval – To run any query against a database server, we can use sqoop eval command. It can be used to
run a select query or an insert query.
$ sqoop eval \
–connect jdbc:mysql://localhost/db \
–username root \
–query “SELECT * FROM employee LIMIT 3”
$ sqoop eval \
–connect jdbc:mysql://localhost/db \
–username root \
-e “INSERT INTO employee VALUES(1007,‘Gem’,‘UI
dev’,15000,‘TP’)”
Import
https://data-flair.training/blogs/sqoop-import/
Sqoop import is used to move data from RDBMS to hdfs or hive table.
Below command will import data from mysql table to hdfs location.
$ sqoop import \
--connect jdbc:uri
–query ‘SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE
$CONDITIONS’ \
-m 1 –target-dir /user/foo/joinresults
--m is for number of mapper tasks to execute in parallel
--target dir is HDFS target dir
Incremental import
https://medium.com/datadriveninvestor/incremental-data-load-using-apache-
sqoop-3c259308f65c
Sqoop jobs store metadata information such as last-value , incremental-mode,file-format,output-
directory, etc which act as reference in loading data incrementally.
Export
https://sites.google.com/site/hadoopbigdataoverview/sqoop/sqoop-export-
commands
Export will move data from hdfs file/hive table to RDBMS.
Example - Below command will insert new rows and update existing rows based on column id
update and insert
==================
sqoop export \
--connect jdbc:mysql://localhost/test \
--username root \
--table employee \
--export-dir /sqooptest/dataset.txt \
--update-key id \
--update-mode allowinsert ;
We can also do validation in Hadoop like check row count at source and target using –validate option.
Validation in sqoop’s main purpose is to validate the data copied. Basically, either Sqoop import or
Export by comparing the row counts from the source as well as the target post copy.