Pig Programming:
Introduction to pig:
Pig Represents Big Data as data flows. Pig is a high-level platform or
tool which is used to process the large datasets. It provides a high-
level of abstraction for processing over the MapReduce. It provides a
high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes. First, to process the data which is
stored in the HDFS, the programmers will write the scripts using the
Pig Latin Language. Internally Pig Engine(a component of Apache
Pig) converted all these scripts into a specific map and reduce task.
But these are not visible to the programmers in order to provide a
high-level of abstraction. Pig Latin and Pig Engine are the two main
components of the Apache Pig tool. The result of Pig always stored in
the HDFS.
Need of Pig: One limitation of MapReduce is that the development
cycle is very long. Writing the reducer and mapper, compiling
packaging the code, submitting the job and retrieving the output is a
time-consuming task. Apache Pig reduces the time of development
using the multi-query approach. Also, Pig is beneficial for
programmers who are not from Java background. 200 lines of Java
code can be written in only 10 lines using the Pig Latin language.
Programmers who have SQL knowledge needed less effort to learn
Pig Latin.
It uses query approach which results in reducing the length of the
code.
Pig Latin is SQL like language.
It provides many builtIn operators.
It provides nested data types (tuples, bags, map).
Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s
researchers. At that time, the main idea to develop Pig was to execute the
MapReduce jobs on extremely large datasets. In the year 2007, it moved to
Apache Software Foundation(ASF) which makes it an open source project.
The first version(0.1) of Pig came in the year 2008. The latest version of
Apache Pig is 0.18 which came in the year 2017.
Features of pig:
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-
programmers. Pig makes this process easy. In the Pig, the queries are
converted to MapReduce internally.
2) Optimization opportunities
It is how tasks are encoded permits the system to optimize their
execution automatically, allowing the user to focus on semantics
rather than efficiency.
3) Extensibility
A user-defined function is written in which the user can write their
logic to execute over the data set.
4) Flexible
It can easily handle structured as well as unstructured data.
5) In-built operators
It contains various type of operators such as sort, filter and joins.
Application of pig:
For exploring large datasets Pig Scripting is used.
Provides the supports across large data-sets for Ad-hoc queries.
In the prototyping of large data-sets processing algorithms.
Required to process the time sensitive data loads.
For collecting large amounts of datasets in form of search logs and web
crawls.
Used where the analytical insights are needed using the sampling.
Pig Architecture:
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It
is a highlevel data processing language which provides a rich set of data types
and operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write
a Pig script using the Pig Latin language, and execute them using any of the
execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these
scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs,
and thus, it makes the programmer’s job easy. The architecture of Apache Pig is
shown below.
Apache Pig Components
As shown in the figure, there are various components in the Apache
Pig framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and
the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted
order. Finally, these MapReduce jobs are executed on Hadoop
producing the desired results.
Pig Latin Data Model
The data model of Pig Latin is fully nested and it allows complex
non-atomic datatypes such as map and tuple. Given below is the
diagrammatical representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known
as an Atom. It is stored as string and can be used as string and
number. int, long, float, double, chararray, and bytearray are the atomic values
of Pig. A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields
can be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of
tuples (non-unique) is known as a bag. Each tuple can have any
number of fields (flexible schema). A bag is represented by ‘{}’. It is
similar to a table in RDBMS, but unlike a table in RDBMS, it is not
necessary that every tuple contain the same number of fields or that
the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be
of type chararray and should be unique. The value might be of any
type. It is represented by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is
no guarantee that tuples are processed in any particular order).
Pig data types:
Primitive Data type:
Type Description Example
Int Signed 32 bit integer 2
Long Signed 64 bit integer 15L or 15l
Float 32 bit floating point 2.5f or 2.5F
Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2
charArray Character array hello javatpoint
byteArray BLOB(Byte array)
tuple Ordered set of fields (12,43)
bag Collection f tuples {(12,43),(54,28)}
map collection of tuples [open#apache]
int − Signed 32-bit integer and similar to Integer in Java.
Long − It is a fully signed 64-bit number similar to Long in
Java.
Float − It is a signed 32-bit floating surface that appears to be
similar to Java's float.
Double − A floating-point 63-bit and similar to Double in Java.
Char array − A list of characters in the Unicode format, UTF-
8. This is compatible with the Java character unit item.
byte array − The byte data type represents bytes by default.
When the data file type is not specified, the default value is byte
array.
Boolean − A value that is either true or false
Complex Data type
Complex data types consist of a bit of logical and complicated data type. The
following are the complex data type −
Data Definition Code Example
Types
Tuple A set of (field[,fields....]) (1,2)
ordered
fields. The
tuple is
written with
braces.
Bag A group of {tuple,[,tuple...]} {(1,2), (3,4)}
tuples is
called a
bag.
Represented
by folded
weights or
curly
braces.
Map A set of [Key # Value] ['keyname'#'valuename']
key-value
pairs. The
map is
represented
by square
brackets.
Key − An element of finding an element, the key must be
unique and must be charrarray.
Value − Any data can be stored in a value, and each key has
particular data related to it. The map is built using a bracket and
hash between key and values. Cas to separate pairs of over one key
value. Here # is used to distinguish key and value.
Null Values − Valuable value is missing or unknown, and any
data may apply. The pig handles an empty value similar to SQL.
Pig detects blank values when data is missing, or an error occurs
during data processing. Also, null can be used as a value
proposition of your choice
Definig schema:
storing data through pig:
In the previous chapter, we learnt how to load data into Apache Pig.
You can store the loaded data in the file system using
the store operator. This chapter explains how to store data in Apache
Pig using the Store operator.
Syntax:
Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path ' [USING
function];
Example
Assume we have a file student_data.txt in HDFS with the following
content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator
as shown below.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray,
phone:chararray,
city:chararray );
Now, let us store the relation in the HDFS
directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ '
USING PigStorage (',');
Output
After executing the store statement, you will get the following output.
A directory is created with the specified name and the data will be
stored in it.
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
MapReduceLau ncher - 100% complete
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.tools.pigstats.mapreduce.SimplePigStats -
Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt
Features
2.6.0 0.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05
13:05:05 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime
AvgMapTime MedianMapTime
job_14459_06 1 0 n/a n/a n/a n/a
MaxReduceTime MinReduceTime AvgReduceTime
MedianReducetime Alias Feature
0 0 0 0 student MAP_ONLY
OutPut folder
hdfs://localhost:9000/pig_Output/
Input(s): Successfully read 0 records from:
"hdfs://localhost:9000/pig_data/student_data.txt"
Output(s): Successfully stored 0 records in:
"hdfs://localhost:9000/pig_Output"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG: job_1443519499159_0006
2015-10-05 13:06:06,192 [main] INFO
org.apache.pig.backend.hadoop.executionengine
.mapReduceLayer.MapReduceLau ncher - Success!
Verification
You can verify the stored data as shown below.
Step 1
First of all, list out the files in the directory named pig_output using
the ls command as shown below.
hdfs dfs -ls 'hdfs://localhost:9000/pig_Output/'
Found 2 items
rw-r--r- 1 Hadoop supergroup 0 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/_SUCCESS
rw-r--r- 1 Hadoop supergroup 224 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/part-m-00000
You can observe that two files were created after executing
the store statement.
Step 2
Using cat command, list the contents of the file named part-m-
00000 as shown below.
$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output/part-m-00000'
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
6,Archana,Mishra,9848022335,Chennai
Reading data through pig:
In general, Apache Pig works on top of Hadoop. It is an analytical
tool that analyzes large datasets that exist in the Hadoop File System.
To analyze data using Apache Pig, we have to initially load the data
into Apache Pig. This chapter explains how to load data to Apache
Pig from HDFS.
Preparing HDFS
In MapReduce mode, Pig reads (loads) data from HDFS and stores
the results back in HDFS. Therefore, let us start HDFS and create the
following sample data in HDFS.
Student First
Last Name Phone City
ID Name
001 Rajiv Reddy 9848022337 Hyderabad
Battachary
002 siddarth 9848022338 Kolkata
a
003 Rajesh Khanna 9848022339 Delhi
004 Preethi Agarwal 9848022330 Pune
005 Trupthi Mohanthy 9848022336 Bhuwaneshwar
006 Archana Mishra 9848022335 Chennai
The above dataset contains personal details like id, first name, last
name, phone number and city, of six students.
Step 1: Verifying Hadoop
First of all, verify the installation using Hadoop version command, as
shown below.
$ hadoop version
If your system contains Hadoop, and if you have set the PATH
variable, then you will get the following output −
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using
/home/Hadoop/hadoop/share/hadoop/common/hadoop
common-2.6.0.jar
Step 2: Starting HDFS
Browse through the sbin directory of Hadoop and start yarn and
Hadoop dfs (distributed file system) as shown below.
cd /$Hadoop_Home/sbin/
$ start-dfs.sh
localhost: starting namenode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-namenode-
localhost.localdomain.out
localhost: starting datanode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-datanode-
localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
starting secondarynamenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-
localhost.localdomain.out
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-
localhost.localdomain.out
localhost: starting nodemanager, logging to
/home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-
localhost.localdomain.out
Step 3: Create a Directory in HDFS
In Hadoop DFS, you can create directories using the
command mkdir. Create a new directory in HDFS with the
name Pig_Data in the required path as shown below.
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data
Step 4: Placing the data in HDFS
The input file of Pig contains each tuple/record in individual lines.
And the entities of the record are separated by a delimiter (In our
example we used “,”).
In the local file system, create an input file student_data.txt containing
data as shown below.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
Now, move the file from the local file system to HDFS
using put command as shown below. (You can
use copyFromLocal command as well.)
$ cd $HADOOP_HOME/bin
$ hdfs dfs -put
/home/Hadoop/Pig/Pig_Data/student_data.txt
dfs://localhost:9000/pig_data/
Verifying the file
You can use the cat command to verify whether the file has been
moved into the HDFS, as shown below.
$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat
hdfs://localhost:9000/pig_data/student_data.txt
Output
You can see the content of the file as shown below.
15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load
native-hadoop
library for your platform... using builtin-java classes where applicable
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
The Load Operator
You can load data into Apache Pig from the file system (HDFS/
Local) using LOAD operator of Pig Latin.
Syntax
The load statement consists of two parts divided by the “=” operator.
On the left-hand side, we need to mention the name of the
relation where we want to store the data, and on the right-hand side,
we have to define how we store the data. Given below is the syntax of
the Load operator.
Relation_name = LOAD 'Input file path' USING function as schema;
Where,
relation_name − We have to mention the relation in which we want
to store the data.
Input file path − We have to mention the HDFS directory where the
file is stored. (In MapReduce mode)
function − We have to choose a function from the set of load
functions provided by Apache Pig (BinStorage, JsonLoader,
PigStorage, TextLoader).
Schema − We have to define the schema of the data. We can define
the required schema as follows −
(column1 : data type, column2 : data type, column3 : data type);
Note − We load the data without specifying the schema. In that case,
the columns will be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under
the schema named Student using the LOAD command.
Start the Pig Grunt Shell
First of all, open the Linux terminal. Start the Pig Grunt shell in
MapReduce mode as shown below.
$ Pig –x mapreduce
It will start the Pig Grunt shell as shown below.
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType :
LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType :
MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked
MAPREDUCE as the ExecType
2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main -
Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015,
11:44:35
2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main -
Logging error messages to: /home/Hadoop/pig_1443683018078.log
2015-10-01 12:33:38,242 [main] INFO org.apache.pig.impl.util.Utils
- Default bootup file /home/Hadoop/.pigbootup not found
2015-10-01 12:33:39,630 [main]
INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://localhost:9000
grunt>
Execute the Load Statement
Now load the data from the file student_data.txt into Pig by
executing the following Pig Latin statement in the Grunt shell.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray,
lastname:chararray, phone:chararray,
city:chararray );
Following is the description of the above statement.
Relation
We have stored the data in the schema student.
name
Input We are reading data from the
file file student_data.txt, which is in the /pig_data/
path directory of HDFS.
We have used the PigStorage() function. It
Storage loads and stores data as structured text files. It
functio takes a delimiter using which each entity of a
n tuple is separated, as a parameter. By default, it
takes ‘\t’ as a parameter.
We have stored the data using the following
schema.
firstnam lastnam phon
column id city
e e e
schema
char
datatyp in char char char
arra
e t array array array
y
Pig operator:
Perfoming inner and outer join in pig:
Introduction to Hive:
Hive is a data warehouse system which is used to analyze structured
data. It is built on the top of Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing
large datasets residing in distributed storage. It runs SQL like queries
called HQL (Hive query language) which gets internally converted to
MapReduce jobs.
Using Hive, we can skip the requirement of the traditional approach
of writing complex MapReduce programs. Hive supports Data
Definition Language (DDL), Data Manipulation Language (DML),
and User Defined Functions (UDF).
Features of Hive:
o Hive is fast and scalable.
o It provides SQL-like queries (i.e., HQL) that are implicitly
transformed to MapReduce or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and
HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop
ecosystem.
o It supports user-defined functions (UDFs) where user can
provide its functionality.
Application of Hive:
Architecture of Hive:
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C+
+. It supports different types of clients such as:-
o Thrift Server - It is a cross-language service provider platform
that serves the request from all those programming languages
that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive
and Java applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the
ODBC protocol to connect to Hive.
Hive Services
The following are the services provided by Hive:-
o Hive CLI - The Hive CLI (Command Line Interface) is a shell
where we can execute Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an
alternative of Hive CLI. It provides a web-based GUI for
executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the
structure information of various tables and partitions in the
warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to
read and write data and the corresponding HDFS files where the
data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It
accepts the request from different clients and provides it to Hive
Driver.
o Hive Driver - It receives queries from different sources like
web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the
queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the
query and perform semantic analysis on the different query
blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan
in the form of DAG of map-reduce tasks and HDFS tasks. In the
end, the execution engine executes the incoming tasks in the
order of their dependencies.
Componenets of Hive:
Hive shell:
HiveQL:
Hive Database and Tabls:
Create Database is a statement used to create a database in
Hive. A database in Hive is a namespace or a collection of tables.
The syntax for this statement is as follows:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Here, IF NOT EXISTS is an optional clause, which notifies the user
that a database with the same name already exists. We can use
SCHEMA in place of DATABASE in this command. The following
query is executed to create a database named userdb:
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
or
hive> CREATE SCHEMA userdb;
The following query is used to verify a databases list:
hive> SHOW DATABASES;
default
userdb
JDBC Program
The JDBC program to create a database is given below.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveCreateDb {
private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000
/default", "", "");
Statement stmt = con.createStatement();
stmt.executeQuery("CREATE DATABASE userdb");
System.out.println(“Database userdb created
successfully.”);
con.close();
}
}
Save the program in a file named HiveCreateDb.java. The
following commands are used to compile and execute this
program.
$ javac HiveCreateDb.java
$ java HiveCreateDb
Output:
Database userdb created successfully.
Drop Database Statement
Drop Database is a statement that drops all the tables and
deletes the database. Its syntax is as follows:
DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]
database_name
[RESTRICT|CASCADE];
The following queries are used to drop a database. Let us assume
that the database name is userdb.
hive> DROP DATABASE IF EXISTS userdb;
The following query drops the database using CASCADE. It means
dropping respective tables before dropping the database.
hive> DROP DATABASE IF EXISTS userdb CASCADE;
The following query drops the database using SCHEMA.
hive> DROP SCHEMA userdb;
This clause was added in Hive 0.6.
JDBC Program
The JDBC program to drop a database is given below.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveDropDb {
private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/de
fault", "", "");
Statement stmt = con.createStatement();
stmt.executeQuery("DROP DATABASE userdb");
System.out.println(“Drop userdb database successful.”);
con.close();
}
}
Save the program in a file named HiveDropDb.java. Given below
are the commands to compile and execute this program.
$ javac HiveDropDb.java
$ java HiveDropDb
Output:
Drop userdb database successful.
Create Table is a statement used to create a table in Hive. The
syntax and example are as follows:
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]
table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
Example
Let us assume you need to create a table
named employee using CREATE TABLE statement. The following
table lists the fields and their data types in employee table:
Sr.No Field Name Data Type
1 Eid int
2 Name String
3 Salary Float
4 Designation string
The following data is a Comment, Row formatted fields such as
Field terminator, Lines terminator, and Stored File type.
COMMENT ‘Employee details’
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED IN TEXT FILE
The following query creates a table named employee using the
above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
If you add the option IF NOT EXISTS, Hive ignores the statement
in case the table already exists.
On successful creation of table, you get to see the following
response:
OK
Time taken: 5.905 seconds
hive>
JDBC Program
The JDBC program to create a table is given example.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveCreateTable {
private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/us
erdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("CREATE TABLE IF NOT EXISTS "
+" employee ( eid int, name String, "
+" salary String, destignation String)"
+" COMMENT ‘Employee details’"
+" ROW FORMAT DELIMITED"
+" FIELDS TERMINATED BY ‘\t’"
+" LINES TERMINATED BY ‘\n’"
+" STORED AS TEXTFILE;");
System.out.println(“ Table employee created.”);
con.close();
}
}
Save the program in a file named HiveCreateDb.java. The
following commands are used to compile and execute this
program.
$ javac HiveCreateDb.java
$ java HiveCreateDb
Output
Table employee created.
Load Data Statement
Generally, after creating a table in SQL, we can insert data using
the Insert statement. But in Hive, we can insert data using the
LOAD DATA statement.
While inserting data into Hive, it is better to use LOAD DATA to
store bulk records. There are two ways to load data: one is from
local file system and second is from Hadoop file system.
Syntax
The syntax for load data is as follows:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
LOCAL is identifier to specify the local path. It is optional.
OVERWRITE is optional to overwrite the data in the table.
PARTITION is optional.
Example
We will insert the following data into the table. It is a text file
named sample.txt in /home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Kiran 40000 Hr Admin
1205 Kranthi 30000 Op Admin
The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'
OVERWRITE INTO TABLE employee;
On successful download, you get to see the following response:
OK
Time taken: 15.905 seconds
hive>
JDBC Program
Given below is the JDBC program to load given data into the
table.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveLoadData {
private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/us
erdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("LOAD DATA LOCAL INPATH
'/home/user/sample.txt'" + "OVERWRITE INTO TABLE
employee;");
System.out.println("Load Data into employee
successful");
con.close();
}
}
Save the program in a file named HiveLoadData.java. Use the
following commands to compile and execute this program.
$ javac HiveLoadData.java
$ java HiveLoadData
Output:
Load Data into employee successful
Data types:
Column Types
Column type are used as column data types of Hive. They are as
follows:
Integral Types
Integer type data can be specified using integral data types, INT.
When the data range exceeds the range of INT, you need to use
BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:
Type Postfix Example
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or
double quotes (" "). It contains two data types: VARCHAR and
CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:
Data Type Length
VARCHAR 1 to 65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond
precision. It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form
{{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of
Java. It is used for representing immutable arbitrary precision. The
syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an
instance using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher
range than DOUBLE data type. The range of decimal type is
approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT
col_comment], ...>
Operations in Hive:
Performing Inner and Outer join in Hive:
Built in functions in Hive:
Hive supports the following built-in functions:
Return Type Signature Description
It returns the rounded
BIGINT round(double a) BIGINT value of the
double.
It returns the
maximum BIGINT
BIGINT floor(double a)
value that is equal or
less than the double.
It returns the minimum
BIGINT value that is
BIGINT ceil(double a)
equal or greater than
the double.
It returns a random
double rand(), rand(int seed) number that changes
from row to row.
It returns the string
concat(string A, string resulting from
string
B,...) concatenating B after
A.
It returns the substring
substr(string A, int of A starting from start
string
start) position till the end of
string A.
string substr(string A, int It returns the substring
start, int length) of A starting from start
position with the given
length.
It returns the string
resulting from
string upper(string A) converting all
characters of A to
upper case.
string ucase(string A) Same as above.
It returns the string
resulting from
string lower(string A) converting all
characters of B to
lower case.
string lcase(string A) Same as above.
It returns the string
resulting from
string trim(string A)
trimming spaces from
both ends of A.
It returns the string
resulting from
string ltrim(string A) trimming spaces from
the beginning (left
hand side) of A.
string rtrim(string A) rtrim(string A) It
returns the string
resulting from
trimming spaces from
the end (right hand
side) of A.
It returns the string
resulting from
replacing all substrings
regexp_replace(string
string in B that match the
A, string B, string C)
Java regular
expression syntax with
C.
It returns the number
int size(Map<K.V>) of elements in the map
type.
It returns the number
int size(Array<T>) of elements in the
array type.
It converts the results
of the expression expr
to <type> e.g. cast('1'
as BIGINT) converts
value of
cast(<expr> as <type>) the string '1' to it
<type>
integral representation.
A NULL is returned if
the conversion does
not succeed.
string from_unixtime(int convert the number of
seconds from Unix
epoch (1970-01-01
00:00:00 UTC) to a
string representing the
unixtime) timestamp of that
moment in the current
system time zone in
the format of "1970-
01-01 00:00:00"
It returns the date part
of a timestamp string:
to_date(string
string to_date("1970-01-01
timestamp)
00:00:00") = "1970-
01-01"
It returns the year part
of a date or a
timestamp string:
int year(string date) year("1970-01-01
00:00:00") = 1970,
year("1970-01-01") =
1970
It returns the month
part of a date or a
timestamp string:
int month(string date) month("1970-11-01
00:00:00") = 11,
month("1970-11-01")
= 11
It returns the day part
of a date or a
timestamp string:
int day(string date)
day("1970-11-01
00:00:00") = 1,
day("1970-11-01") = 1
It extracts json object
from a json string
based on json path
get_json_object(string specified, and returns
string json_string, string json string of the
path) extracted json object.
It returns NULL if the
input json string is
invalid.
Example
The following queries demonstrate some built-in functions:
round() function
hive> SELECT round(2.6) from temp;
On successful execution of query, you get to see the following
response:
3.0
floor() function
hive> SELECT floor(2.6) from temp;
On successful execution of the query, you get to see the following
response:
2.0
ceil() function
hive> SELECT ceil(2.6) from temp;
On successful execution of the query, you get to see the following
response:
3.0
Aggregate Functions
Hive supports the following built-in aggregate functions. The usage
of these functions is as same as the SQL aggregate functions.
Return
Signature Description
Type
count(*), count(*) - Returns the total
BIGINT
count(expr), number of retrieved rows.
It returns the sum of the
sum(col),
elements in the group or the
DOUBLE sum(DISTINCT
sum of the distinct values of
col)
the column in the group.
It returns the average of the
avg(col),
elements in the group or the
DOUBLE avg(DISTINCT
average of the distinct values
col)
of the column in the group.
It returns the minimum value
DOUBLE min(col)
of the column in the group.
DOUBLE max(col) It returns the maximum value
of the column in the group.
Print Page
Dtabase operators in Hive:
Hive vs RDBMS:
RDBMS Hive
It is used to maintain database. It is used to maintain data warehouse.
It uses SQL (Structured Query
It uses HQL (Hive Query Language).
Language).
Schema is fixed in RDBMS. Schema varies in it.
Normalized and de-normalized both type of
Normalized data is stored.
data is stored.
Tables in rdms are sparse. Table in hive are dense.
It doesn’t support partitioning. It supports automation partition.
No partition method is used. Sharding method is used for partition.
Example of Hive: