Professional Documents
Culture Documents
Cloudera - Hive
Class will start with refreshing the previous class with QA…. (30)
(What is cloudera/component/hive commands)
Remember
Hive shells look like “hive>” in linux or (use beeline) # use hive server2
!connect jdbc:hive2:// # Connect hive server2
For terminating hive shell, use the command exit; or quit;
All hive commands ends with “ ; “
Successful command execution will show “OK” and time taken for the execution with other data.
hive commands as shown below are written with red/blue font just to be familiar with hive commands.
2. show databases; # The command will show the name of the database created
3. use my_db; # any table will be created will be under this db.
2. show tables; # The command will show the name of the table created
1
Alter table
1. alter table employee rename to emp; # Will change the name.
2. show tables; # The command will show the name of the table altered.
3. Insert into emp values (101, ‘ram’), (102, ‘shyam’); #insert values directly
Drop table
1. drop table emp;
# Will delete the table as mentioned. If it is internal table, it will delete the schema and the data
altogether. For external table, data is not lost only the schema will be deleted.
2. DROP DATABASE my_db CASCADE; # will delete the db along with tables under it.
# Note: if the data base has no table, it will be deleted by DROP DATABASE my_db; command only.
# Create a text file in desktop and give name with file extension .txt
1. go to desktop
2. open a file
3. type
beti 3
bitti 5
bute 7
4. name the file as name_age
hive> create table age(name string, age int)
Row format delimited
fields terminated by ‘\t’
stored as textfile;
2
Partitioning
Hive partition is a way of splitting a table into sub-tables based on the values of specific columns such as date, city,
department, year etc, (as decided or commanded)
Using partition, it is easy to query a portion of the data, since the search engine has to scan over less data.
Insert input data files individually into a partition table is Static Partition.
Static Partition saves our time in loading data compared to dynamic partition.
Static property set by default in hive-site.xml
We can perform Static partition on Hive Manage table or external table.
Static Partitioning
# At first create a file
Be in hadoopuser
nano /tmp/info # open editor to create a file “info” under dir “tmp”
101,Rahim,Abdul,TP,2012
102,Costa,Joseph,HR,2012
103,Akter,Gulnahar,TP,2012
104,Johar,Kiran,SC,2013
105,Mahmood,Faisal,HR,2013
106,Roma,Sharmin,HR,2013
107,Akter,Shorna,PR,2014
3
108,Rajen,Krish,SL,2014
109,Fajle,Kabeer,PR,2014
# Open hive shell, or open a new terminal and activate hive shell
hive> create table tab_1(id int, first_name string, last_name string, dept string, year)
partitioned by (year string)
row format delimited
fields terminated by ‘,’; # Create “tab_1” schema with partition key
# please note the schema shows an additional column as “year”. This is hive is creating automatically a column as per
the partition by key.
# By default hive is saving files under user/hive/warehouse in hdfs, therefore this file tab-1 can be viewed in default
directory.
# Now we have to split the file “info” into 3 files info1, infor2 and info3 with data of year 2012, 2013 and 2014. And then
start loading data to tab_1, with following commands
Load data local inpath “/tmp/info1” overwrite into table tab_1 partition (year=’2012’);
Load data local inpath “/tmp/info2” overwrite into table tab_1 partition (year=’2013’);
Load data local inpath “/tmp/info3” overwrite into table tab_1 partition (year=’2014’);
Dynamic Partitioning
4
hive> create table tab_part(id int, fname string, lname string, dept string)
PARTITIONED BY(year string)
row format delimited
fields terminated by’,’;
# Actual processing and formation of partition tables based on year as partition key
# there are going to be 3 partition outputs in HDFS storage with the file name as year. We can check this.
Bucketing
If we proceed to partition a table date wise and if it is a huge data for 5 year and then there will be 5x365 partitions or
tables. Working over that become little impractical.
Also it may happen that some of the partition will have very few data, making no sense to benefit.
What is Bucketing
To overcome this problem of partitioning, Hive provides Bucketing concept, which allows user to divide table data sets
into more manageable parts (buckets).
Thus, Hive Bucketing is a way of dividing data in to specified buckets, to allow user to manipulate data in more
manageable way.
Bucketing helps user to maintain parts that are more manageable and user can set the size of the manageable parts or
Buckets, Bucketing is very helpful for data sampling.
Using partition and bucketing, it becomes more optimistic to query a portion of the data, since the search engine has to
scan over specified zone with less data.
We can make both partitioning and bucketing command in one go. Or we can make only bucketing if required.
5
Today we shall practice hive on cloudera
Be in cloudera home
nano student
6
hive> set hive.exec.dynamic.partition = true;
hive> set hive.exec.dynamic.partition.mode = nonstrict;
hive> set hive.enforce.bucketing=true;
Brows localhost:50070
Check user/hive/warehouse/buck_std……….
Bucketing only
# We can also do only bucketing if we want without partitioning, but before that we have to enable bucketing as below
hive> set hive.enforce.bucketing=true;
hive> create table buckstd(id int, name string, year int, dept string)
clustered by(id) into 5 buckets
row format delimited
fields terminated by‘,’
stored as textfile;
7
# We can browse localhost:50070 to check….
Therefore, Hive comes with a comprehensive library of functions. Yet, in reality we may need to apply some
function which is not available within hive built-in command or library. Therefore, we have to create custom functions to
process records, which is called User Defined Function or UDF.
1. Regular UDF: operates on a single row and produces a single row as its output.
2. User Defined Aggregate Function or UDAF: Works on multiple input rows and creates a single out put row.
3. User Defined Table generated Function or UDTF: operates on a single row and produces multiple rows which is
in table format.
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.text;
2. A UDF must implement at least one evaluate() method, during creating java prog/func with argument
8
Public text evaluate (final text s)
If (s==null) {return null:}
Return new text (s.toString().toLowerCase())
https://www.youtube.com/watch?v=ncai9SNCE2c