14-Lesson Cloudera Hive

LESSON-14
Cloudera - Hive
 Class will start with refreshing the previous class with QA…. (30)
(What is cloudera/component/hive commands)
 Ensure all student successfully installed cloudera .. .. .. (30)
Today’s topics: (in cloudera)

Lesson Plan
1. Few practice on hive over cloudera

2. Create, load and display internal table.. .. ..
3. Hive partitioning .. .. .. .. .. ..
4. Hive bucketing………. (practice in cloudera)
5. UDF (hive functions)
Remember
 Hive shells look like “hive>” in linux or (use beeline) # use hive server2
 !connect jdbc:hive2:// # Connect hive server2
 For terminating hive shell, use the command exit; or quit;
 All hive commands ends with “ ; “
 Successful command execution will show “OK” and time taken for the execution with other data.
 hive commands as shown below are written with red/blue font just to be familiar with hive commands.
Create, show, use and display database

Create database
1. create database my_db; # create a database “my_db”
2. show databases; # The command will show the name of the database created
3. use my_db; # any table will be created will be under this db.
4. set hive.cli.print.current.db=true; # will display the db in command line
Create, show, alter, insert value and drop table and db

Create table
1. create table employee(empid int, empname string); # create a table
2. show tables; # The command will show the name of the table created
1
Alter table
1. alter table employee rename to emp; # Will change the name.
2. show tables; # The command will show the name of the table altered.
3. Insert into emp values (101, ‘ram’), (102, ‘shyam’); #insert values directly
4. select * from emp; #Will show the table.
Drop table
1. drop table emp;
# Will delete the table as mentioned. If it is internal table, it will delete the schema and the data
altogether. For external table, data is not lost only the schema will be deleted.
2. DROP DATABASE my_db CASCADE; # will delete the db along with tables under it.
3. Show databases; # deleted database will not appear
# Note: if the data base has no table, it will be deleted by DROP DATABASE my_db; command only.
Create, load and display internal table
# Create a text file in desktop and give name with file extension .txt
1. go to desktop
2. open a file
3. type
beti 3
bitti 5
bute 7
4. name the file as name_age
hive> create table age(name string, age int)
Row format delimited
fields terminated by ‘\t’
stored as textfile;
5. load data local inpath ‘/home/cloudera/Desktop/name_age’ into table age; # load

6. select * from age; # Show the table
2
Partitioning
Hive partition is a way of splitting a table into sub-tables based on the values of specific columns such as date, city,
department, year etc, (as decided or commanded)
Using partition, it is easy to query a portion of the data, since the search engine has to scan over less data.
There are 2 types of Hive partitioning:

1. Static partitioning or manual partitioning
2. Dynamic partitioning
Hive Static Partitioning
 Insert input data files individually into a partition table is Static Partition.
 Static Partition saves our time in loading data compared to dynamic partition.
 Static property set by default in hive-site.xml
 We can perform Static partition on Hive Manage table or external table.
Hive Dynamic Partitioning
 Single insert to partition table is known as a dynamic partition.

 Usually, dynamic partition loads the data from the non-partitioned table.
 Dynamic Partition takes more time in loading data compared to static partition.
 When we have large data stored in a table then the Dynamic partition is suitable.
 We can’t perform alter on the Dynamic partition.
 We can perform dynamic partition on hive external table and managed table.
 We have to enable dynamic partitioning properties, as below:
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
Static Partitioning
# At first create a file
# Open hadoop terminal
Be in hadoopuser
(if not apply su – hadoopuser)
nano /tmp/info # open editor to create a file “info” under dir “tmp”
101,Rahim,Abdul,TP,2012
102,Costa,Joseph,HR,2012
103,Akter,Gulnahar,TP,2012
104,Johar,Kiran,SC,2013
105,Mahmood,Faisal,HR,2013
106,Roma,Sharmin,HR,2013
107,Akter,Shorna,PR,2014
3
108,Rajen,Krish,SL,2014
109,Fajle,Kabeer,PR,2014
# CTL+x, Y, hit ENTR # Save the file
# Open hive shell, or open a new terminal and activate hive shell
hive> # Type hive and hit ENTR
hive> create table tab_1(id int, first_name string, last_name string, dept string, year)
partitioned by (year string)
row format delimited
fields terminated by ‘,’; # Create “tab_1” schema with partition key
hive> describe tab_1; # View the schema for “tab_1”
# please note the schema shows an additional column as “year”. This is hive is creating automatically a column as per
the partition by key.
# By default hive is saving files under user/hive/warehouse in hdfs, therefore this file tab-1 can be viewed in default
directory.
# Now we have to split the file “info” into 3 files info1, infor2 and info3 with data of year 2012, 2013 and 2014. And then
start loading data to tab_1, with following commands
Load data local inpath “/tmp/info1” overwrite into table tab_1 partition (year=’2012’);
Dynamic Partitioning
# Creation of Table tab_1

hive> create table tab_1(id int, first_name string, last_name string, dept string, year
string)
fields terminated by ',';
# Loading data into created table

hive> load data local inpath “/tmp/info” overwrite into table tab_1;
# Creation of partition table
4
hive> create table tab_part(id int, fname string, lname string, dept string)
PARTITIONED BY(year string)
fields terminated by’,’;
# For dynamic partition we have to set below property as nonstrict

hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
# Loading data into partition table

Hive> INSERT OVERWRITE TABLE tab_part
PARTITION(year)
SELECT id,fname,lname,dept,year
from tab_1;
# Actual processing and formation of partition tables based on year as partition key
# there are going to be 3 partition outputs in HDFS storage with the file name as year. We can check this.
Bucketing
If we proceed to partition a table date wise and if it is a huge data for 5 year and then there will be 5x365 partitions or
tables. Working over that become little impractical.
Also it may happen that some of the partition will have very few data, making no sense to benefit.
What is Bucketing
To overcome this problem of partitioning, Hive provides Bucketing concept, which allows user to divide table data sets
into more manageable parts (buckets).
Thus, Hive Bucketing is a way of dividing data in to specified buckets, to allow user to manipulate data in more
manageable way.
Bucketing helps user to maintain parts that are more manageable and user can set the size of the manageable parts or
Buckets, Bucketing is very helpful for data sampling.
Using partition and bucketing, it becomes more optimistic to query a portion of the data, since the search engine has to
scan over specified zone with less data.
We can make both partitioning and bucketing command in one go. Or we can make only bucketing if required.
5
Today we shall practice hive on cloudera
o To create a file “student” in /home/cloudera path with nano student command

o Open browser in cloudera – sign in gmail - Use my mail to copy text and create file
o Create table “std” with schema (id int, name string, year int, dept string)
o Load data into table “std” from file “student”
o Enable dynamic partitioning by two commands
o Ceate table “buck_std” with partition (based on department) and bucketing (id)
o Load data to table “buck_std” from table “std”
o Check in terminal by hadoop fs –ls command
Be in cloudera home
nano student
# copy text from my mail and paste it in the file.
Switch to another terminal to create hive shell

hive> create table std(id int, name string, year int, dept string)
fields terminated by ‘,’
stored as textfile;
hive> load data local inpath‘/home/cloudera/student’ into table std;

hive> select * from std;
6
hive> set hive.exec.dynamic.partition = true;
hive> set hive.exec.dynamic.partition.mode = nonstrict;
hive> set hive.enforce.bucketing=true;
hive> create table buck_std(id int, name string, year int)

partitioned by(dept string)
clustered by(id) into 5 buckets
fields terminated by‘,’
stored as textfile;
# Note that dept is missing in create table which is a partion key. Partition key is usually kept in last position
hive> insert into table buck_std

partition(dept)
select id,name,year,dept
from std;
Brows localhost:50070
Check user/hive/warehouse/buck_std……….
Switch to cloudera home terminal

Hadoop fs –ls /
Hadoop fs –ls /user/hive/warehouse
Hadoop fs –ls /user/hive/warehouse/buck_std/……
# Finally we shall reach to bucket 000000_0 and view by cat command
Bucketing only
# We can also do only bucketing if we want without partitioning, but before that we have to enable bucketing as below
hive> set hive.enforce.bucketing=true;
hive> create table buckstd(id int, name string, year int, dept string)
clustered by(id) into 5 buckets
fields terminated by‘,’
stored as textfile;
hive> insert overwrite table buckstd select * from std;
7
# We can browse localhost:50070 to check….
Hive functions and UDF
There are 3 types of hive functions (built-in function)
1. Standard functions: like round(), concat(), reverse(), ucase(), floor() etc.

Select round(1.1111, 2), or select concat(fname,lname) as full name from emp
2. Aggregate function: like sum(), avg(), min(), max(), etc

Select sum(sal) from payroll
3. Table generating function: like explode(),

Select id, name, skill from employee lateral view explode(skills)
Therefore, Hive comes with a comprehensive library of functions. Yet, in reality we may need to apply some
function which is not available within hive built-in command or library. Therefore, we have to create custom functions to
process records, which is called User Defined Function or UDF.
There are 3 types of UDF
1. Regular UDF: operates on a single row and produces a single row as its output.
2. User Defined Aggregate Function or UDAF: Works on multiple input rows and creates a single out put row.
3. User Defined Table generated Function or UDTF: operates on a single row and produces multiple rows which is
in table format.
How to create UDF
Java Enviromment Hive shell
Import Create Java Covert it to Add the jar Create temp

libraries program Jar file file function
If we are going to create a UDF we must follow 2 rules
1. A UDF must be a subclass of Org.apache.hadoop.hive.ql.exec.UDF, Therefore, before creating UDF we

must apply below commands in Java environment
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.text;
2. A UDF must implement at least one evaluate() method, during creating java prog/func with argument
8
Public text evaluate (final text s)
If (s==null) {return null:}
Return new text (s.toString().toLowerCase())
Let us list up – what to do

a) Import org.apache.hadoop.hive.ql.exec.UDF; and org.apache.hadoop.io.text;
b) We have to create a java program with class, where we shall write our UDF
c) Configure the build path of “class” by adding external hadoop jar
d) Add all external hive jar
e) Write java function with argument
f) The project or program is created – export it to create jar file
g) Now we have to add this jar file (using add jar function) with the proj/prog
h) Finally we have to create a temporary function (say lower_letter)
i) Use it to convert upper letter to lower letter.
Link Hive UDF
https://www.youtube.com/watch?v=ncai9SNCE2c
Useful link for hive command in cloudera

https://www.youtube.com/watch?v=nsA9LJBi9dE
Partitioning and bucketing

https://www.guru99.com/hive-partitions-buckets-example.html
Bucketing – hindi – Sandeep patil

https://www.youtube.com/watch?v=ObuCKKd_5Aw

14-Lesson Cloudera Hive

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

14-Lesson Cloudera Hive

Uploaded by

Copyright:

Available Formats

LESSON-14

 Ensure all student successfully installed cloudera .. .. .. (30)

Today’s topics: (in cloudera)

1. Few practice on hive over cloudera

Create, show, use and display database

1. create database my_db; # create a database “my_db”

4. set hive.cli.print.current.db=true; # will display the db in command line

Create, show, alter, insert value and drop table and db

1. create table employee(empid int, empname string); # create a table

4. select * from emp; #Will show the table.

3. Show databases; # deleted database will not appear

Create, load and display internal table

5. load data local inpath ‘/home/cloudera/Desktop/name_age’ into table age; # load

There are 2 types of Hive partitioning:

Hive Static Partitioning

Hive Dynamic Partitioning

 Single insert to partition table is known as a dynamic partition.

# Open hadoop terminal

(if not apply su – hadoopuser)

# CTL+x, Y, hit ENTR # Save the file

hive> # Type hive and hit ENTR

hive> describe tab_1; # View the schema for “tab_1”

# Creation of Table tab_1

# Loading data into created table

# Creation of partition table

# For dynamic partition we have to set below property as nonstrict

# Loading data into partition table

o To create a file “student” in /home/cloudera path with nano student command

# copy text from my mail and paste it in the file.

Switch to another terminal to create hive shell

hive> load data local inpath‘/home/cloudera/student’ into table std;

hive> create table buck_std(id int, name string, year int)

hive> insert into table buck_std

Switch to cloudera home terminal

# Finally we shall reach to bucket 000000_0 and view by cat command

hive> insert overwrite table buckstd select * from std;

Hive functions and UDF

There are 3 types of hive functions (built-in function)

1. Standard functions: like round(), concat(), reverse(), ucase(), floor() etc.

2. Aggregate function: like sum(), avg(), min(), max(), etc

3. Table generating function: like explode(),

There are 3 types of UDF

How to create UDF

Java Enviromment Hive shell

Import Create Java Covert it to Add the jar Create temp

If we are going to create a UDF we must follow 2 rules

1. A UDF must be a subclass of Org.apache.hadoop.hive.ql.exec.UDF, Therefore, before creating UDF we

Let us list up – what to do

Link Hive UDF

Useful link for hive command in cloudera

Partitioning and bucketing

Bucketing – hindi – Sandeep patil

You might also like