0% found this document useful (0 votes)

399 views38 pages

Big Data Notes Pig

Uploaded by

rohitmarale77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

399 views38 pages

Big Data Notes Pig

Uploaded by

rohitmarale77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Pig Programming:

Introduction to pig:
Pig Represents Big Data as data flows. Pig is a high-level platform or
tool which is used to process the large datasets. It provides a high-
level of abstraction for processing over the MapReduce. It provides a
high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes. First, to process the data which is
stored in the HDFS, the programmers will write the scripts using the
Pig Latin Language. Internally Pig Engine(a component of Apache
Pig) converted all these scripts into a specific map and reduce task.
But these are not visible to the programmers in order to provide a
high-level of abstraction. Pig Latin and Pig Engine are the two main
components of the Apache Pig tool. The result of Pig always stored in
the HDFS.
Need of Pig: One limitation of MapReduce is that the development
cycle is very long. Writing the reducer and mapper, compiling
packaging the code, submitting the job and retrieving the output is a
time-consuming task. Apache Pig reduces the time of development
using the multi-query approach. Also, Pig is beneficial for
programmers who are not from Java background. 200 lines of Java
code can be written in only 10 lines using the Pig Latin language.
Programmers who have SQL knowledge needed less effort to learn
Pig Latin.
It uses query approach which results in reducing the length of the
code.
Pig Latin is SQL like language.
It provides many builtIn operators.
It provides nested data types (tuples, bags, map).

Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s

researchers. At that time, the main idea to develop Pig was to execute the
MapReduce jobs on extremely large datasets. In the year 2007, it moved to
Apache Software Foundation(ASF) which makes it an open source project.
The first version(0.1) of Pig came in the year 2008. The latest version of
Apache Pig is 0.18 which came in the year 2017.
Features of pig:
1) Ease of programming
Writing complex java programs for map reduce is quite tough for non-
programmers. Pig makes this process easy. In the Pig, the queries are
converted to MapReduce internally.

2) Optimization opportunities

It is how tasks are encoded permits the system to optimize their

execution automatically, allowing the user to focus on semantics
rather than efficiency.

3) Extensibility

A user-defined function is written in which the user can write their

logic to execute over the data set.

4) Flexible
It can easily handle structured as well as unstructured data.

5) In-built operators
It contains various type of operators such as sort, filter and joins.

Application of pig:

 For exploring large datasets Pig Scripting is used.

 Provides the supports across large data-sets for Ad-hoc queries.
 In the prototyping of large data-sets processing algorithms.
 Required to process the time sensitive data loads.
 For collecting large amounts of datasets in form of search logs and web
crawls.
 Used where the analytical insights are needed using the sampling.

Pig Architecture:
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It
is a highlevel data processing language which provides a rich set of data types
and operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write
a Pig script using the Pig Latin language, and execute them using any of the
execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these
scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs,
and thus, it makes the programmer’s job easy. The architecture of Apache Pig is
shown below.

Apache Pig Components

As shown in the figure, there are various components in the Apache

Pig framework. Let us take a look at the major components.

Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the
script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and
the data flows are represented as edges.

Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce
jobs.

Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted
order. Finally, these MapReduce jobs are executed on Hadoop
producing the desired results.

Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex
non-atomic datatypes such as map and tuple. Given below is the
diagrammatical representation of Pig Latin’s data model.

Atom
Any single value in Pig Latin, irrespective of their data, type is known
as an Atom. It is stored as string and can be used as string and
number. int, long, float, double, chararray, and bytearray are the atomic values
of Pig. A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’

Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields
can be of any type. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag

A bag is an unordered set of tuples. In other words, a collection of

tuples (non-unique) is known as a bag. Each tuple can have any
number of fields (flexible schema). A bag is represented by ‘{}’. It is
similar to a table in RDBMS, but unlike a table in RDBMS, it is not
necessary that every tuple contain the same number of fields or that
the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}

Map

A map (or data map) is a set of key-value pairs. The key needs to be
of type chararray and should be unique. The value might be of any
type. It is represented by ‘[]’
Example − [name#Raja, age#30]

Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is
no guarantee that tuples are processed in any particular order).

Pig data types:

Primitive Data type:
Type Description Example

Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F

Double 32 bit floating point 1.5 or 1.5e2 or 1.5E2

charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}

map collection of tuples [open#apache]

 int − Signed 32-bit integer and similar to Integer in Java.

 Long − It is a fully signed 64-bit number similar to Long in
Java.
 Float − It is a signed 32-bit floating surface that appears to be
similar to Java's float.
 Double − A floating-point 63-bit and similar to Double in Java.
 Char array − A list of characters in the Unicode format, UTF-
8. This is compatible with the Java character unit item.
 byte array − The byte data type represents bytes by default.
When the data file type is not specified, the default value is byte
array.
 Boolean − A value that is either true or false

Complex Data type

Complex data types consist of a bit of logical and complicated data type. The
following are the complex data type −
Data Definition Code Example
Types

Tuple A set of (field[,fields....]) (1,2)

ordered
fields. The
tuple is
written with
braces.

Bag A group of {tuple,[,tuple...]} {(1,2), (3,4)}

tuples is
called a
bag.
Represented
by folded
weights or
curly
braces.

Map A set of [Key # Value] ['keyname'#'valuename']

key-value
pairs. The
map is
represented
by square
brackets.

 Key − An element of finding an element, the key must be

unique and must be charrarray.
 Value − Any data can be stored in a value, and each key has
particular data related to it. The map is built using a bracket and

hash between key and values. Cas to separate pairs of over one key
value. Here # is used to distinguish key and value.
 Null Values − Valuable value is missing or unknown, and any
data may apply. The pig handles an empty value similar to SQL.
Pig detects blank values when data is missing, or an error occurs
during data processing. Also, null can be used as a value
proposition of your choice

Definig schema:

storing data through pig:

In the previous chapter, we learnt how to load data into Apache Pig.
You can store the loaded data in the file system using
the store operator. This chapter explains how to store data in Apache
Pig using the Store operator.
Syntax:
Given below is the syntax of the Store statement.
STORE Relation_name INTO ' required_directory_path ' [USING
function];
Example
Assume we have a file student_data.txt in HDFS with the following
content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator
as shown below.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray,
phone:chararray,
city:chararray );
Now, let us store the relation in the HDFS
directory “/pig_Output/” as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ '
USING PigStorage (',');

Output
After executing the store statement, you will get the following output.
A directory is created with the specified name and the data will be
stored in it.
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
MapReduceLau ncher - 100% complete
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.tools.pigstats.mapreduce.SimplePigStats -
Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt
Features
2.6.0 0.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05
13:05:05 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime
AvgMapTime MedianMapTime
job_14459_06 1 0 n/a n/a n/a n/a
MaxReduceTime MinReduceTime AvgReduceTime
MedianReducetime Alias Feature
0 0 0 0 student MAP_ONLY
OutPut folder
hdfs://localhost:9000/pig_Output/
Input(s): Successfully read 0 records from:
"hdfs://localhost:9000/pig_data/student_data.txt"
Output(s): Successfully stored 0 records in:
"hdfs://localhost:9000/pig_Output"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG: job_1443519499159_0006
2015-10-05 13:06:06,192 [main] INFO
org.apache.pig.backend.hadoop.executionengine
.mapReduceLayer.MapReduceLau ncher - Success!
Verification
You can verify the stored data as shown below.
Step 1
First of all, list out the files in the directory named pig_output using
the ls command as shown below.
hdfs dfs -ls 'hdfs://localhost:9000/pig_Output/'
Found 2 items
rw-r--r- 1 Hadoop supergroup 0 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/_SUCCESS
rw-r--r- 1 Hadoop supergroup 224 2015-10-05 13:03
hdfs://localhost:9000/pig_Output/part-m-00000
You can observe that two files were created after executing
the store statement.
Step 2
Using cat command, list the contents of the file named part-m-
00000 as shown below.
$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output/part-m-00000'
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
6,Archana,Mishra,9848022335,Chennai

Reading data through pig:

In general, Apache Pig works on top of Hadoop. It is an analytical

tool that analyzes large datasets that exist in the Hadoop File System.
To analyze data using Apache Pig, we have to initially load the data
into Apache Pig. This chapter explains how to load data to Apache
Pig from HDFS.
Preparing HDFS

In MapReduce mode, Pig reads (loads) data from HDFS and stores
the results back in HDFS. Therefore, let us start HDFS and create the
following sample data in HDFS.

Student First
Last Name Phone City
ID Name

001 Rajiv Reddy 9848022337 Hyderabad

Battachary
002 siddarth 9848022338 Kolkata
a

003 Rajesh Khanna 9848022339 Delhi

004 Preethi Agarwal 9848022330 Pune

005 Trupthi Mohanthy 9848022336 Bhuwaneshwar

006 Archana Mishra 9848022335 Chennai

The above dataset contains personal details like id, first name, last
name, phone number and city, of six students.

Step 1: Verifying Hadoop

First of all, verify the installation using Hadoop version command, as

shown below.
$ hadoop version
If your system contains Hadoop, and if you have set the PATH
variable, then you will get the following output −
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using
/home/Hadoop/hadoop/share/hadoop/common/hadoop
common-2.6.0.jar
Step 2: Starting HDFS

Browse through the sbin directory of Hadoop and start yarn and
Hadoop dfs (distributed file system) as shown below.
cd /$Hadoop_Home/sbin/
$ start-dfs.sh
localhost: starting namenode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-namenode-
localhost.localdomain.out
localhost: starting datanode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-datanode-
localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
starting secondarynamenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-
localhost.localdomain.out
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to
/home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-
localhost.localdomain.out
localhost: starting nodemanager, logging to
/home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-
localhost.localdomain.out
Step 3: Create a Directory in HDFS

In Hadoop DFS, you can create directories using the

command mkdir. Create a new directory in HDFS with the
name Pig_Data in the required path as shown below.
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data
Step 4: Placing the data in HDFS
The input file of Pig contains each tuple/record in individual lines.
And the entities of the record are separated by a delimiter (In our
example we used “,”).
In the local file system, create an input file student_data.txt containing
data as shown below.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
Now, move the file from the local file system to HDFS
using put command as shown below. (You can
use copyFromLocal command as well.)
$ cd $HADOOP_HOME/bin
$ hdfs dfs -put
/home/Hadoop/Pig/Pig_Data/student_data.txt
dfs://localhost:9000/pig_data/
Verifying the file
You can use the cat command to verify whether the file has been
moved into the HDFS, as shown below.
$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat
hdfs://localhost:9000/pig_data/student_data.txt

Output

You can see the content of the file as shown below.

15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load
native-hadoop
library for your platform... using builtin-java classes where applicable
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
The Load Operator
You can load data into Apache Pig from the file system (HDFS/
Local) using LOAD operator of Pig Latin.
Syntax
The load statement consists of two parts divided by the “=” operator.
On the left-hand side, we need to mention the name of the
relation where we want to store the data, and on the right-hand side,
we have to define how we store the data. Given below is the syntax of
the Load operator.
Relation_name = LOAD 'Input file path' USING function as schema;
Where,
relation_name − We have to mention the relation in which we want
to store the data.
Input file path − We have to mention the HDFS directory where the
file is stored. (In MapReduce mode)
function − We have to choose a function from the set of load
functions provided by Apache Pig (BinStorage, JsonLoader,
PigStorage, TextLoader).
Schema − We have to define the schema of the data. We can define
the required schema as follows −
(column1 : data type, column2 : data type, column3 : data type);
Note − We load the data without specifying the schema. In that case,
the columns will be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under
the schema named Student using the LOAD command.
Start the Pig Grunt Shell
First of all, open the Linux terminal. Start the Pig Grunt shell in
MapReduce mode as shown below.
$ Pig –x mapreduce
It will start the Pig Grunt shell as shown below.

15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType :

LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType :
MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked
MAPREDUCE as the ExecType

2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main -

Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015,
11:44:35
2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main -
Logging error messages to: /home/Hadoop/pig_1443683018078.log
2015-10-01 12:33:38,242 [main] INFO org.apache.pig.impl.util.Utils
- Default bootup file /home/Hadoop/.pigbootup not found
2015-10-01 12:33:39,630 [main]
INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://localhost:9000
grunt>

Execute the Load Statement

Now load the data from the file student_data.txt into Pig by
executing the following Pig Latin statement in the Grunt shell.
grunt> student = LOAD
'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray,
lastname:chararray, phone:chararray,
city:chararray );

Following is the description of the above statement.

Relation
We have stored the data in the schema student.
name
Input We are reading data from the
file file student_data.txt, which is in the /pig_data/
path directory of HDFS.

We have used the PigStorage() function. It

Storage loads and stores data as structured text files. It
functio takes a delimiter using which each entity of a
n tuple is separated, as a parameter. By default, it
takes ‘\t’ as a parameter.

We have stored the data using the following

schema.
firstnam lastnam phon
column id city
e e e
schema
char
datatyp in char char char
arra
e t array array array
y

Pig operator:

Perfoming inner and outer join in pig:

Introduction to Hive:

Hive is a data warehouse system which is used to analyze structured

data. It is built on the top of Hadoop. It was developed by Facebook.

Hive provides the functionality of reading, writing, and managing

large datasets residing in distributed storage. It runs SQL like queries
called HQL (Hive query language) which gets internally converted to
MapReduce jobs.

Using Hive, we can skip the requirement of the traditional approach

of writing complex MapReduce programs. Hive supports Data
Definition Language (DDL), Data Manipulation Language (DML),
and User Defined Functions (UDF).

Features of Hive:

o Hive is fast and scalable.

o It provides SQL-like queries (i.e., HQL) that are implicitly
transformed to MapReduce or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It allows different storage types such as plain text, RCFile, and
HBase.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop
ecosystem.
o It supports user-defined functions (UDFs) where user can
provide its functionality.
Application of Hive:

Architecture of Hive:

Hive Client
Hive allows writing applications in various languages, including Java, Python, and C+
+. It supports different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform

that serves the request from all those programming languages
that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive
and Java applications. The JDBC Driver is present in the class
org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the
ODBC protocol to connect to Hive.

Hive Services

The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell

where we can execute Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an
alternative of Hive CLI. It provides a web-based GUI for
executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the
structure information of various tables and partitions in the
warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to
read and write data and the corresponding HDFS files where the
data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It
accepts the request from different clients and provides it to Hive
Driver.
o Hive Driver - It receives queries from different sources like
web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the
queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the
query and perform semantic analysis on the different query
blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan
in the form of DAG of map-reduce tasks and HDFS tasks. In the
end, the execution engine executes the incoming tasks in the
order of their dependencies.
Componenets of Hive:

Hive shell:

HiveQL:

Hive Database and Tabls:

Create Database is a statement used to create a database in
Hive. A database in Hive is a namespace or a collection of tables.
The syntax for this statement is as follows:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Here, IF NOT EXISTS is an optional clause, which notifies the user
that a database with the same name already exists. We can use
SCHEMA in place of DATABASE in this command. The following
query is executed to create a database named userdb:
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
or
hive> CREATE SCHEMA userdb;

The following query is used to verify a databases list:

hive> SHOW DATABASES;

default
userdb

JDBC Program

The JDBC program to create a database is given below.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveCreateDb {

private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000
/default", "", "");
Statement stmt = con.createStatement();
stmt.executeQuery("CREATE DATABASE userdb");
System.out.println(“Database userdb created
successfully.”);
con.close();
}
}

Save the program in a file named HiveCreateDb.java. The

following commands are used to compile and execute this
program.

$ javac HiveCreateDb.java
$ java HiveCreateDb

Output:

Database userdb created successfully.

Drop Database Statement

Drop Database is a statement that drops all the tables and

deletes the database. Its syntax is as follows:

DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]

database_name
[RESTRICT|CASCADE];
The following queries are used to drop a database. Let us assume
that the database name is userdb.
hive> DROP DATABASE IF EXISTS userdb;

The following query drops the database using CASCADE. It means

dropping respective tables before dropping the database.
hive> DROP DATABASE IF EXISTS userdb CASCADE;

The following query drops the database using SCHEMA.

hive> DROP SCHEMA userdb;

This clause was added in Hive 0.6.

JDBC Program

The JDBC program to drop a database is given below.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveDropDb {

System.out.println(“Drop userdb database successful.”);

con.close();
}
}

Save the program in a file named HiveDropDb.java. Given below

are the commands to compile and execute this program.

$ javac HiveDropDb.java
$ java HiveDropDb

Output:
Drop userdb database successful.

Create Table is a statement used to create a table in Hive. The

syntax and example are as follows:

Syntax

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]

table_name

[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

Example

Let us assume you need to create a table

named employee using CREATE TABLE statement. The following
table lists the fields and their data types in employee table:
Sr.No Field Name Data Type

1 Eid int

2 Name String

3 Salary Float

4 Designation string

The following data is a Comment, Row formatted fields such as

Field terminator, Lines terminator, and Stored File type.

COMMENT ‘Employee details’

FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED IN TEXT FILE
The following query creates a table named employee using the
above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

If you add the option IF NOT EXISTS, Hive ignores the statement
in case the table already exists.

On successful creation of table, you get to see the following

response:

OK
Time taken: 5.905 seconds
hive>

JDBC Program

The JDBC program to create a table is given example.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveCreateTable {

// execute statement
stmt.executeQuery("CREATE TABLE IF NOT EXISTS "
+" employee ( eid int, name String, "
+" salary String, destignation String)"
+" COMMENT ‘Employee details’"
+" ROW FORMAT DELIMITED"
+" FIELDS TERMINATED BY ‘\t’"
+" LINES TERMINATED BY ‘\n’"
+" STORED AS TEXTFILE;");
System.out.println(“ Table employee created.”);
con.close();
}
}

Save the program in a file named HiveCreateDb.java. The

following commands are used to compile and execute this
program.

$ javac HiveCreateDb.java
$ java HiveCreateDb

Output

Table employee created.

Load Data Statement

Generally, after creating a table in SQL, we can insert data using

the Insert statement. But in Hive, we can insert data using the
LOAD DATA statement.

While inserting data into Hive, it is better to use LOAD DATA to

store bulk records. There are two ways to load data: one is from
local file system and second is from Hadoop file system.

Syntax

The syntax for load data is as follows:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename

[PARTITION (partcol1=val1, partcol2=val2 ...)]
 LOCAL is identifier to specify the local path. It is optional.
 OVERWRITE is optional to overwrite the data in the table.
 PARTITION is optional.

Example

We will insert the following data into the table. It is a text file
named sample.txt in /home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Kiran 40000 Hr Admin
1205 Kranthi 30000 Op Admin

The following query loads the given text into the table.

hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'

OVERWRITE INTO TABLE employee;

On successful download, you get to see the following response:

OK
Time taken: 15.905 seconds
hive>

JDBC Program

Given below is the JDBC program to load given data into the
table.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveLoadData {

private static String driverName =
"org.apache.hadoop.hive.jdbc.HiveDriver";
public static void main(String[] args) throws
SQLException {
// Register driver and create driver instance
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/us
erdb", "", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
stmt.executeQuery("LOAD DATA LOCAL INPATH
'/home/user/sample.txt'" + "OVERWRITE INTO TABLE
employee;");
System.out.println("Load Data into employee
successful");
con.close();
}
}

Save the program in a file named HiveLoadData.java. Use the

following commands to compile and execute this program.

$ javac HiveLoadData.java
$ java HiveLoadData

Output:

Load Data into employee successful

Data types:
Column Types

Column type are used as column data types of Hive. They are as
follows:

Integral Types

Integer type data can be specified using integral data types, INT.
When the data range exceeds the range of INT, you need to use
BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.

The following table depicts various INT data types:

Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

BIGINT L 10L

String Types

String type data types can be specified using single quotes (' ') or
double quotes (" "). It contains two data types: VARCHAR and
CHAR. Hive follows C-types escape characters.

The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp

It supports traditional UNIX timestamp with optional nanosecond

precision. It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates

DATE values are described in year/month/day format in the form

{{YYYY-MM-DD}}.

Decimals

The DECIMAL type in Hive is as same as Big Decimal format of

Java. It is used for representing immutable arbitrary precision. The
syntax and example is as follows:

DECIMAL(precision, scale)
decimal(10,0)

Union Types

Union is a collection of heterogeneous data types. You can create an

instance using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals

The following literals are used in Hive:

Floating Point Types

Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.

Decimal Type
Decimal type data is nothing but floating point value with higher
range than DOUBLE data type. The range of decimal type is
approximately -10-308 to 10308.
Null Value

Missing values are represented by the special value NULL.

Complex Types

The Hive complex data types are as follows:

Arrays

Arrays in Hive are used the same way they are used in Java.

Syntax: ARRAY<data_type>

Maps

Maps in Hive are similar to Java Maps.

Syntax: MAP<primitive_type, data_type>

Structs

Structs in Hive is similar to using complex data with comment.

Syntax: STRUCT<col_name : data_type [COMMENT

col_comment], ...>

Operations in Hive:

Performing Inner and Outer join in Hive:

Built in functions in Hive:

Hive supports the following built-in functions:

Return Type Signature Description

It returns the rounded

BIGINT round(double a) BIGINT value of the
double.

It returns the
maximum BIGINT
BIGINT floor(double a)
value that is equal or
less than the double.

It returns the minimum

BIGINT value that is
BIGINT ceil(double a)
equal or greater than
the double.

It returns a random
double rand(), rand(int seed) number that changes
from row to row.

It returns the string

concat(string A, string resulting from
string
B,...) concatenating B after
A.

It returns the substring

substr(string A, int of A starting from start
string
start) position till the end of
string A.

string substr(string A, int It returns the substring

start, int length) of A starting from start
position with the given
length.

It returns the string

resulting from
string upper(string A) converting all
characters of A to
upper case.

string ucase(string A) Same as above.

It returns the string

resulting from
string lower(string A) converting all
characters of B to
lower case.

string lcase(string A) Same as above.

It returns the string

resulting from
string trim(string A)
trimming spaces from
both ends of A.

It returns the string

resulting from
string ltrim(string A) trimming spaces from
the beginning (left
hand side) of A.

string rtrim(string A) rtrim(string A) It

returns the string
resulting from
trimming spaces from
the end (right hand
side) of A.

It returns the string

resulting from
replacing all substrings
regexp_replace(string
string in B that match the
A, string B, string C)
Java regular
expression syntax with
C.

It returns the number

int size(Map<K.V>) of elements in the map
type.

It returns the number

int size(Array<T>) of elements in the
array type.

It converts the results

of the expression expr
to <type> e.g. cast('1'
as BIGINT) converts
value of
cast(<expr> as <type>) the string '1' to it
<type>
integral representation.
A NULL is returned if
the conversion does
not succeed.

string from_unixtime(int convert the number of

seconds from Unix
epoch (1970-01-01
00:00:00 UTC) to a
string representing the
unixtime) timestamp of that
moment in the current
system time zone in
the format of "1970-
01-01 00:00:00"

It returns the date part

of a timestamp string:
to_date(string
string to_date("1970-01-01
timestamp)
00:00:00") = "1970-
01-01"

It returns the year part

of a date or a
timestamp string:
int year(string date) year("1970-01-01
00:00:00") = 1970,
year("1970-01-01") =
1970

It returns the month

part of a date or a
timestamp string:
int month(string date) month("1970-11-01
00:00:00") = 11,
month("1970-11-01")
= 11
It returns the day part
of a date or a
timestamp string:
int day(string date)
day("1970-11-01
00:00:00") = 1,
day("1970-11-01") = 1

It extracts json object

from a json string
based on json path
get_json_object(string specified, and returns
string json_string, string json string of the
path) extracted json object.
It returns NULL if the
input json string is
invalid.

Example

The following queries demonstrate some built-in functions:

round() function

hive> SELECT round(2.6) from temp;

On successful execution of query, you get to see the following

response:

3.0

floor() function

hive> SELECT floor(2.6) from temp;

On successful execution of the query, you get to see the following

response:
2.0

ceil() function

hive> SELECT ceil(2.6) from temp;

On successful execution of the query, you get to see the following

response:

3.0
Aggregate Functions
Hive supports the following built-in aggregate functions. The usage
of these functions is as same as the SQL aggregate functions.
Return
Signature Description
Type

count(), count() - Returns the total

BIGINT
count(expr), number of retrieved rows.

It returns the sum of the

sum(col),
elements in the group or the
DOUBLE sum(DISTINCT
sum of the distinct values of
col)
the column in the group.

It returns the average of the

avg(col),
elements in the group or the
DOUBLE avg(DISTINCT
average of the distinct values
col)
of the column in the group.

It returns the minimum value

DOUBLE min(col)
of the column in the group.

DOUBLE max(col) It returns the maximum value

of the column in the group.

Print Page

Dtabase operators in Hive:

Hive vs RDBMS:

RDBMS Hive

It is used to maintain database. It is used to maintain data warehouse.

It uses SQL (Structured Query

It uses HQL (Hive Query Language).
Language).

Schema is fixed in RDBMS. Schema varies in it.

Normalized and de-normalized both type of

Normalized data is stored.
data is stored.

Tables in rdms are sparse. Table in hive are dense.

It doesn’t support partitioning. It supports automation partition.

No partition method is used. Sharding method is used for partition.

Example of Hive:

Introduction To Data Science UNIT-3
100% (1)
Introduction To Data Science UNIT-3
28 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Big Data Analytics: Pig & Hive Overview
No ratings yet
Big Data Analytics: Pig & Hive Overview
10 pages
Big Data-One
No ratings yet
Big Data-One
9 pages
Hive Database & Analytics Guide
No ratings yet
Hive Database & Analytics Guide
10 pages
Big Data Analytics Midterm Q&A
No ratings yet
Big Data Analytics Midterm Q&A
15 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Key-Value Databases Overview
No ratings yet
Key-Value Databases Overview
20 pages
Big Data Pyq
No ratings yet
Big Data Pyq
1 page
Introduction To Data Science UNIT - IV
No ratings yet
Introduction To Data Science UNIT - IV
45 pages
De Unit-2
100% (1)
De Unit-2
17 pages
Module 3
No ratings yet
Module 3
43 pages
FDP Brochure PDF
100% (1)
FDP Brochure PDF
2 pages
Big Data Analytics Exam - Summer 2018
No ratings yet
Big Data Analytics Exam - Summer 2018
2 pages
Text Mining in Data Mining Guide
No ratings yet
Text Mining in Data Mining Guide
18 pages
MapReduce Applications in Hadoop
No ratings yet
MapReduce Applications in Hadoop
17 pages
ADBMS Lab Manual
No ratings yet
ADBMS Lab Manual
33 pages
HBase Overview: Data Model & Clients
No ratings yet
HBase Overview: Data Model & Clients
34 pages
Jntuh Computer Networks r22 r18
No ratings yet
Jntuh Computer Networks r22 r18
82 pages
Mrcet R20 Iv 1 QB
No ratings yet
Mrcet R20 Iv 1 QB
79 pages
IDS Unit3
100% (1)
IDS Unit3
16 pages
Exploratory Data Analysis Guide
No ratings yet
Exploratory Data Analysis Guide
6 pages
Unit - Big - Data - (DK - PPT) - Part - 1
No ratings yet
Unit - Big - Data - (DK - PPT) - Part - 1
70 pages
FSD Unit III
No ratings yet
FSD Unit III
22 pages
BDA Notes Unit-1
No ratings yet
BDA Notes Unit-1
18 pages
Introduction to Hadoop: History & HDFS
100% (1)
Introduction to Hadoop: History & HDFS
43 pages
BIG DATA ANALYTICS Notes Unit 1 and 2
No ratings yet
BIG DATA ANALYTICS Notes Unit 1 and 2
34 pages
CSE-302: Mobile Transaction Models: Dr. R. B. Patel
No ratings yet
CSE-302: Mobile Transaction Models: Dr. R. B. Patel
41 pages
Feature Generation & Selection in Data Science
No ratings yet
Feature Generation & Selection in Data Science
24 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Program Merged
No ratings yet
Program Merged
23 pages
BCS 052 Previous Year Question Papers by Ignouassignmentguru
100% (1)
BCS 052 Previous Year Question Papers by Ignouassignmentguru
47 pages
CCS334 Set1
No ratings yet
CCS334 Set1
3 pages
PCA and LDA Assignment
No ratings yet
PCA and LDA Assignment
5 pages
Python Web Development Essentials
No ratings yet
Python Web Development Essentials
243 pages
SQL Lookup Table in Data Warehousing
No ratings yet
SQL Lookup Table in Data Warehousing
41 pages
Big Data: Concepts, Challenges, and Solutions
No ratings yet
Big Data: Concepts, Challenges, and Solutions
22 pages
Advanced Database Systems Overview
No ratings yet
Advanced Database Systems Overview
29 pages
Chp4 Advance Analytics-KMeans
No ratings yet
Chp4 Advance Analytics-KMeans
40 pages
Mc5502 Bda Unit I Notes
No ratings yet
Mc5502 Bda Unit I Notes
106 pages
UNIT1 - Full Stack Web Development
No ratings yet
UNIT1 - Full Stack Web Development
13 pages
Elementary Data Structures Guide
No ratings yet
Elementary Data Structures Guide
26 pages
Big Data Analytics Lab Manual Guide
No ratings yet
Big Data Analytics Lab Manual Guide
53 pages
Computer Vision: Image-Based Rendering Techniques
No ratings yet
Computer Vision: Image-Based Rendering Techniques
25 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
28 pages
Ids Unit-4
No ratings yet
Ids Unit-4
33 pages
Unit Ii Notes
No ratings yet
Unit Ii Notes
49 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Web Mining Notes
100% (1)
Web Mining Notes
8 pages
Data Science Viva Questions
No ratings yet
Data Science Viva Questions
2 pages
Data Similarity and Dissimilarity Measures
No ratings yet
Data Similarity and Dissimilarity Measures
3 pages
B.E.Cse (AIML)
No ratings yet
B.E.Cse (AIML)
402 pages
43 PPT On Apache Pig
No ratings yet
43 PPT On Apache Pig
16 pages
BDA Lab ManuaL
No ratings yet
BDA Lab ManuaL
83 pages
Unit 5 Ids
No ratings yet
Unit 5 Ids
19 pages
CS3401 Algorithms Lab Manual 2023-24
No ratings yet
CS3401 Algorithms Lab Manual 2023-24
70 pages
AD3501 Deep Learning Nov Dec 2023 Question Paper Download
No ratings yet
AD3501 Deep Learning Nov Dec 2023 Question Paper Download
2 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
30 pages
Introduction to Apache Pig for Data Analysis
No ratings yet
Introduction to Apache Pig for Data Analysis
23 pages
Understanding Apache Pig for Data Analysis
No ratings yet
Understanding Apache Pig for Data Analysis
6 pages
Understanding Hadoop Framework Basics
100% (1)
Understanding Hadoop Framework Basics
5 pages
DCN Unit 4
No ratings yet
DCN Unit 4
43 pages
Types of Multiplexing and Networking Explained
No ratings yet
Types of Multiplexing and Networking Explained
20 pages
OSI Model: Overview of 7 Layers
No ratings yet
OSI Model: Overview of 7 Layers
32 pages
Big Data Analytics
No ratings yet
Big Data Analytics
2 pages
Talend ETL Solutions for Apex Parks Group
No ratings yet
Talend ETL Solutions for Apex Parks Group
20 pages
Data Science With Python PDF
0% (1)
Data Science With Python PDF
7 pages
Data Scientist & Hadoop Engineer Profile
No ratings yet
Data Scientist & Hadoop Engineer Profile
2 pages
Expert Veri Ed, Online, Free.: Custom View Settings
No ratings yet
Expert Veri Ed, Online, Free.: Custom View Settings
12 pages
Resume Abhishek Vijaywargiya
No ratings yet
Resume Abhishek Vijaywargiya
2 pages
Data Science Harvard Lecture 1 PDF
No ratings yet
Data Science Harvard Lecture 1 PDF
43 pages
Syllabus
No ratings yet
Syllabus
4 pages
Big-Data Cheatsheet
No ratings yet
Big-Data Cheatsheet
12 pages
Big Data Analytics 7th CSE 2024-2025
No ratings yet
Big Data Analytics 7th CSE 2024-2025
2 pages
Big Data Analytics - Unit III
No ratings yet
Big Data Analytics - Unit III
71 pages
Data Analytics
100% (1)
Data Analytics
148 pages
BDA Courseplan
No ratings yet
BDA Courseplan
3 pages
CC R20 - Unit-5
No ratings yet
CC R20 - Unit-5
9 pages
Big Data Analytics Lab Syllabus
No ratings yet
Big Data Analytics Lab Syllabus
193 pages
Bda 20cs41001 Course File
No ratings yet
Bda 20cs41001 Course File
133 pages
Cloud Security & Standards Guide
No ratings yet
Cloud Security & Standards Guide
3 pages
Architecture Basics Guide Dataiku
No ratings yet
Architecture Basics Guide Dataiku
31 pages
Big Data & AI Career Profiles Guide
No ratings yet
Big Data & AI Career Profiles Guide
11 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Big Data Huawei Course
No ratings yet
Big Data Huawei Course
12 pages
Cloud Computing Lab Manual Ccs335
50% (2)
Cloud Computing Lab Manual Ccs335
56 pages
Seminar PPT On Hadoop
No ratings yet
Seminar PPT On Hadoop
13 pages
Cloud IoT Architect with Python Expertise
No ratings yet
Cloud IoT Architect with Python Expertise
4 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
6 pages
Big Data and Data Science Syllabus
No ratings yet
Big Data and Data Science Syllabus
4 pages
Exploring Dataset in MapReduce
No ratings yet
Exploring Dataset in MapReduce
14 pages
B.Tech Computer Science Course Syllabus
No ratings yet
B.Tech Computer Science Course Syllabus
18 pages
Cloudera EDH ExecutiveBrief
No ratings yet
Cloudera EDH ExecutiveBrief
4 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages