You are on page 1of 9

LESSON-14

Cloudera - Hive
 Class will start with refreshing the previous class with QA…. (30)
(What is cloudera/component/hive commands)

 Ensure all student successfully installed cloudera .. .. .. (30)

Today’s topics: (in cloudera)


Lesson Plan

1. Few practice on hive over cloudera


2. Create, load and display internal table.. .. ..
3. Hive partitioning .. .. .. .. .. ..
4. Hive bucketing………. (practice in cloudera)
5. UDF (hive functions)

Remember

 Hive shells look like “hive>” in linux or (use beeline) # use hive server2
 !connect jdbc:hive2:// # Connect hive server2
 For terminating hive shell, use the command exit; or quit;
 All hive commands ends with “ ; “
 Successful command execution will show “OK” and time taken for the execution with other data.
 hive commands as shown below are written with red/blue font just to be familiar with hive commands.

Create, show, use and display database


Create database

1. create database my_db; # create a database “my_db”

2. show databases; # The command will show the name of the database created

3. use my_db; # any table will be created will be under this db.

4. set hive.cli.print.current.db=true; # will display the db in command line

Create, show, alter, insert value and drop table and db


Create table

1. create table employee(empid int, empname string); # create a table

2. show tables; # The command will show the name of the table created

1
Alter table
1. alter table employee rename to emp; # Will change the name.

2. show tables; # The command will show the name of the table altered.

3. Insert into emp values (101, ‘ram’), (102, ‘shyam’); #insert values directly

4. select * from emp; #Will show the table.

Drop table
1. drop table emp;

# Will delete the table as mentioned. If it is internal table, it will delete the schema and the data
altogether. For external table, data is not lost only the schema will be deleted.

2. DROP DATABASE my_db CASCADE; # will delete the db along with tables under it.

3. Show databases; # deleted database will not appear

# Note: if the data base has no table, it will be deleted by DROP DATABASE my_db; command only.

Create, load and display internal table

# Create a text file in desktop and give name with file extension .txt

1. go to desktop
2. open a file
3. type
beti 3
bitti 5
bute 7
4. name the file as name_age
hive> create table age(name string, age int)
Row format delimited
fields terminated by ‘\t’
stored as textfile;

5. load data local inpath ‘/home/cloudera/Desktop/name_age’ into table age; # load


6. select * from age; # Show the table

2
Partitioning

Hive partition is a way of splitting a table into sub-tables based on the values of specific columns such as date, city,
department, year etc, (as decided or commanded)

Using partition, it is easy to query a portion of the data, since the search engine has to scan over less data.

There are 2 types of Hive partitioning:


1. Static partitioning or manual partitioning
2. Dynamic partitioning

Hive Static Partitioning

 Insert input data files individually into a partition table is Static Partition.
 Static Partition saves our time in loading data compared to dynamic partition.
 Static property set by default in hive-site.xml
 We can perform Static partition on Hive Manage table or external table.

Hive Dynamic Partitioning

 Single insert to partition table is known as a dynamic partition.


 Usually, dynamic partition loads the data from the non-partitioned table.
 Dynamic Partition takes more time in loading data compared to static partition.
 When we have large data stored in a table then the Dynamic partition is suitable.
 We can’t perform alter on the Dynamic partition.
 We can perform dynamic partition on hive external table and managed table.
 We have to enable dynamic partitioning properties, as below:
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;

Static Partitioning
# At first create a file

# Open hadoop terminal

Be in hadoopuser

(if not apply su – hadoopuser)

nano /tmp/info # open editor to create a file “info” under dir “tmp”
101,Rahim,Abdul,TP,2012
102,Costa,Joseph,HR,2012
103,Akter,Gulnahar,TP,2012
104,Johar,Kiran,SC,2013
105,Mahmood,Faisal,HR,2013
106,Roma,Sharmin,HR,2013
107,Akter,Shorna,PR,2014
3
108,Rajen,Krish,SL,2014
109,Fajle,Kabeer,PR,2014

# CTL+x, Y, hit ENTR # Save the file

# Open hive shell, or open a new terminal and activate hive shell

hive> # Type hive and hit ENTR

hive> create table tab_1(id int, first_name string, last_name string, dept string, year)
partitioned by (year string)
row format delimited
fields terminated by ‘,’; # Create “tab_1” schema with partition key

hive> describe tab_1; # View the schema for “tab_1”

# please note the schema shows an additional column as “year”. This is hive is creating automatically a column as per
the partition by key.

# By default hive is saving files under user/hive/warehouse in hdfs, therefore this file tab-1 can be viewed in default
directory.

# Now we have to split the file “info” into 3 files info1, infor2 and info3 with data of year 2012, 2013 and 2014. And then
start loading data to tab_1, with following commands
Load data local inpath “/tmp/info1” overwrite into table tab_1 partition (year=’2012’);
Load data local inpath “/tmp/info2” overwrite into table tab_1 partition (year=’2013’);
Load data local inpath “/tmp/info3” overwrite into table tab_1 partition (year=’2014’);

Dynamic Partitioning

# Creation of Table tab_1


hive> create table tab_1(id int, first_name string, last_name string, dept string, year
string)
row format delimited
fields terminated by ',';

# Loading data into created table


hive> load data local inpath “/tmp/info” overwrite into table tab_1;

# Creation of partition table

4
hive> create table tab_part(id int, fname string, lname string, dept string)
PARTITIONED BY(year string)
row format delimited
fields terminated by’,’;

# For dynamic partition we have to set below property as nonstrict


hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;

# Loading data into partition table


Hive> INSERT OVERWRITE TABLE tab_part
PARTITION(year)
SELECT id,fname,lname,dept,year
from tab_1;

# Actual processing and formation of partition tables based on year as partition key

# there are going to be 3 partition outputs in HDFS storage with the file name as year. We can check this.

Bucketing

If we proceed to partition a table date wise and if it is a huge data for 5 year and then there will be 5x365 partitions or
tables. Working over that become little impractical.

Also it may happen that some of the partition will have very few data, making no sense to benefit.

What is Bucketing

To overcome this problem of partitioning, Hive provides Bucketing concept, which allows user to divide table data sets
into more manageable parts (buckets).

Thus, Hive Bucketing is a way of dividing data in to specified buckets, to allow user to manipulate data in more
manageable way.

Bucketing helps user to maintain parts that are more manageable and user can set the size of the manageable parts or
Buckets, Bucketing is very helpful for data sampling.

Using partition and bucketing, it becomes more optimistic to query a portion of the data, since the search engine has to
scan over specified zone with less data.

We can make both partitioning and bucketing command in one go. Or we can make only bucketing if required.

5
Today we shall practice hive on cloudera

o To create a file “student” in /home/cloudera path with nano student command


o Open browser in cloudera – sign in gmail - Use my mail to copy text and create file
o Create table “std” with schema (id int, name string, year int, dept string)
o Load data into table “std” from file “student”
o Enable dynamic partitioning by two commands
o Ceate table “buck_std” with partition (based on department) and bucketing (id)
o Load data to table “buck_std” from table “std”
o Check in terminal by hadoop fs –ls command

Be in cloudera home
nano student

# copy text from my mail and paste it in the file.

Switch to another terminal to create hive shell


hive> create table std(id int, name string, year int, dept string)
row format delimited
fields terminated by ‘,’
stored as textfile;

hive> load data local inpath‘/home/cloudera/student’ into table std;


hive> select * from std;

6
hive> set hive.exec.dynamic.partition = true;
hive> set hive.exec.dynamic.partition.mode = nonstrict;
hive> set hive.enforce.bucketing=true;

hive> create table buck_std(id int, name string, year int)


partitioned by(dept string)
clustered by(id) into 5 buckets
row format delimited
fields terminated by‘,’
stored as textfile;
# Note that dept is missing in create table which is a partion key. Partition key is usually kept in last position

hive> insert into table buck_std


partition(dept)
select id,name,year,dept
from std;

Brows localhost:50070
Check user/hive/warehouse/buck_std……….

Switch to cloudera home terminal


Hadoop fs –ls /
Hadoop fs –ls /user/hive/warehouse
Hadoop fs –ls /user/hive/warehouse/buck_std/……

# Finally we shall reach to bucket 000000_0 and view by cat command

Bucketing only

# We can also do only bucketing if we want without partitioning, but before that we have to enable bucketing as below
hive> set hive.enforce.bucketing=true;
hive> create table buckstd(id int, name string, year int, dept string)
clustered by(id) into 5 buckets
row format delimited
fields terminated by‘,’
stored as textfile;

hive> insert overwrite table buckstd select * from std;

7
# We can browse localhost:50070 to check….

Hive functions and UDF

There are 3 types of hive functions (built-in function)

1. Standard functions: like round(), concat(), reverse(), ucase(), floor() etc.


Select round(1.1111, 2), or select concat(fname,lname) as full name from emp

2. Aggregate function: like sum(), avg(), min(), max(), etc


Select sum(sal) from payroll

3. Table generating function: like explode(),


Select id, name, skill from employee lateral view explode(skills)

Therefore, Hive comes with a comprehensive library of functions. Yet, in reality we may need to apply some
function which is not available within hive built-in command or library. Therefore, we have to create custom functions to
process records, which is called User Defined Function or UDF.

There are 3 types of UDF

1. Regular UDF: operates on a single row and produces a single row as its output.
2. User Defined Aggregate Function or UDAF: Works on multiple input rows and creates a single out put row.
3. User Defined Table generated Function or UDTF: operates on a single row and produces multiple rows which is
in table format.

How to create UDF

Java Enviromment Hive shell

Import Create Java Covert it to Add the jar Create temp


libraries program Jar file file function

If we are going to create a UDF we must follow 2 rules

1. A UDF must be a subclass of Org.apache.hadoop.hive.ql.exec.UDF, Therefore, before creating UDF we


must apply below commands in Java environment

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.text;

2. A UDF must implement at least one evaluate() method, during creating java prog/func with argument
8
Public text evaluate (final text s)
If (s==null) {return null:}
Return new text (s.toString().toLowerCase())

Let us list up – what to do


a) Import org.apache.hadoop.hive.ql.exec.UDF; and org.apache.hadoop.io.text;
b) We have to create a java program with class, where we shall write our UDF
c) Configure the build path of “class” by adding external hadoop jar
d) Add all external hive jar
e) Write java function with argument
f) The project or program is created – export it to create jar file
g) Now we have to add this jar file (using add jar function) with the proj/prog
h) Finally we have to create a temporary function (say lower_letter)
i) Use it to convert upper letter to lower letter.

Link Hive UDF

https://www.youtube.com/watch?v=ncai9SNCE2c

Useful link for hive command in cloudera


https://www.youtube.com/watch?v=nsA9LJBi9dE

Partitioning and bucketing


https://www.guru99.com/hive-partitions-buckets-example.html

Bucketing – hindi – Sandeep patil


https://www.youtube.com/watch?v=ObuCKKd_5Aw

You might also like