You are on page 1of 48

Module-6

HIVE

www.edureka.co/big-data-and-hadoop
Course Topics
 Module 1  Module 6
» Understanding Big Data and Hadoop » HIVE

 Module 2  Module 7
» Hadoop Architecture and HDFS » Advance HIVE and HBase

 Module 3  Module 8
» Hadoop MapReduce Framework » Advance HBase

 Module 4  Module 9
» Advance MapReduce
» Processing Distributed Data with Apache Spark
 Module 5
» PIG  Module 10
» Oozie and Hadoop Project

Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this module, you will be able to
 Understand What is Hive and its Use Cases

 Analyze difference between Hive and Pig

 Understand Hive Architecture and Hive Components

 Analyze limitations of Hive

 Implement Primitive and Complex types in Hive

 Understand Hive Data Model

 Perform basic Hive operations

 Execute Hive scripts and Hive UDFs

Slide 3 www.edureka.co/big-data-and-hadoop
Hive Background
 Started at Facebook
 Data was collected by nightly cron jobs into Oracle DB
 “ETL” via hand-coded python
 Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that

Scribe Server Tier MySQL Server Tier

Data Collection Server Oracle Database


Slide 4 www.edureka.co/big-data-and-hadoop
Hive Use Case @ Facebook
Challenge…

> 950 > 70k >300


> 500
Million queries million
Terabytes
Users per day photos
per day
per day

Traditional RDBMS…
Solution…

Hard to Users
program know
SQL well

Easy to plug-in HIVE tables Extensible:


Tables can be Schema JDBC/ODBC
custom can be defined Types, Formats,
partitioned and flexibility and drivers are
mapper/ directly on Functions,
bucketed evolution available
reduce code HDFS Scripts

Slide 5 www.edureka.co/big-data-and-hadoop
What is Hive?
 Data Warehousing package built on top of Hadoop
 Used for data analysis
 Targeted towards users comfortable with SQL
 It is similar to SQL and called HiveQL
 For managing and querying structured data
 Abstracts complexity of Hadoop
 No need to learn java and Hadoop APIs
 Developed by Facebook and contributed to community
 Facebook analyzed several Terabytes of data everyday using Hive

Slide 6 www.edureka.co/big-data-and-hadoop
What is Hive? (Contd.)

Defines
SQL-Like Data
Query Warehouse
Language Infrastructure
called QL

Allows
programmers to
plug-in custom Provides tools to
mappers and enable easy data
reducers ETL

Slide 7 www.edureka.co/big-data-and-hadoop
Where to use Hive?

Data
Mining

Log HIVE Document


Processing Applications Indexing

Customer- Predictive
facing Modeling,
Business Hypothesis
Intelligence Testing

Slide 8 www.edureka.co/big-data-and-hadoop
Why go for Hive When Pig is there?

PigLatin: HiveQL:

 Procedural data-flow language  Declarative SQLish language


A = load ‘mydata’; Select * from ‘mytable’;
Dump A;

 Pig is used by Programmers and Researchers  Hive is used by Analysts generating daily reports

Slide 9 www.edureka.co/big-data-and-hadoop
Why go for Hive When Pig is there? (Contd.)

Features Hive Pig


Language SQL-like PigLatin
Schemas/Types Yes (explicit) Yes (implicit)
Partitions Yes No
Server Optional (Thrift) No
User Defined Functions (UDF) Yes (Java) Yes (Java)
Custom Serializer/Deserializer Yes Yes
DFS Direct Access Yes (implicit) Yes (explicit)
Join/Order/Sort Yes Yes
Shell Yes Yes
Streaming Yes Yes
Web Interface Yes No
JDBC/ODBC Yes (limited) No

Slide 10 www.edureka.co/big-data-and-hadoop
Hive Architecture
Karmasphere Hue Qubole Others…

Thrift JDBC ODBC


Hive Application Application Application

Hive Thrift Hive JDBC Hive ODBC


Client Driver Driver

JDBC ODBC

CLI HWI Thrift Server

Driver
Metastore
(compiles, optimizes, executes)

Hadoop
Master
*Resource DFS
Name Node
Manager

Slide 11 www.edureka.co/big-data-and-hadoop
Hive Components

Shell

HIVE
Driver Metastore
Components

Execution
Compiler
Engine

Slide 12 www.edureka.co/big-data-and-hadoop
Metastore
HIVE Service JVM

Embedded Driver Metastore Derby


Metastore

Driver Metastore
Local
Metastore MySQL
Driver Metastore

Metastore
Driver
Remote Server JVM
Metastore MySQL
Driver Metastore
Server JVM

Slide 13 www.edureka.co/big-data-and-hadoop
Limitations of HIVE?

Not designed for Does not offer


online transaction real-time queries
processing and row level
updates

Latency for Provides acceptable


Hive queries is (not optimal)
generally very high latency for
(minutes) interactive data
browsing

Slide 14 www.edureka.co/big-data-and-hadoop
Abilities of HIVE Query Language
Hive Query Language provides the basic SQL-like operations

Ability to filter rows


from a table using a
‘where’ clause

Ability to store the results HIVE Ability to do equi-joins


of a query into another Query between two tables
table
Language

Ability to manage tables Ability to store the


and partitions (create, results of a query in
drop & alter) Hadoop dfs directory

Slide 15 www.edureka.co/big-data-and-hadoop
Differences with Traditional RDBMS
 Schema on Read vs Schema on Write

» Hive does not verify the data when it is loaded, but rather when a query is issued.
» Schema on read makes for a very fast initial load, since the data does not have to be read, parsed and
serialized to disk in the database’s internal format. The load operation is just a file copy or move.

 No Updates, Transactions and Indexes.

Slide 16 www.edureka.co/big-data-and-hadoop
Type System

Integers
Boolean Type TINYINT – 1 byte integer
BOOLEAN – TRUE/FALSE SMALLINT – 2 byte integer
INT – 4 byte integer
BIGINT – 8 byte integer

Primitive
Types

Floating Point Numbers String Type


FLOAT – Single Precision STRING –Sequence of characters
DOUBLE – Double Precision

Slide 17 www.edureka.co/big-data-and-hadoop
Complex Types
 Complex Types can be built up from primitive types and other composite types using the following three operators:

Operators

Structs: Maps: Arrays:

It can be accessed (key-value tuples) (indexable lists)


using the DOT (.) It can be accessed Elements can be
notation. using [‘element name’] accessed using the [n]
notation. notation where n is an
index (zero-based) into
the array.

Slide 18 www.edureka.co/big-data-and-hadoop
Hive Data Models
 Databases HIVE Data (In the order of granularity)
» Namespaces

 Tables
» Schemas in namespaces Databases Tables

 Partitions timestamp
» How data is stored in HDFS
» Grouping data bases on some column
Userid
» Can have one or more columns

 Buckets or Clusters referer_url


» Partitions divided further into buckets bases
on some other column page_url
» Use for data sampling

IP

Slide 19 www.edureka.co/big-data-and-hadoop
Partitions
 Partition means dividing a table into a coarse grained parts based on the value of a partition column
such as a date. This make it faster to do queries on slices of the data.

Partitions

Partition keys Each unique value of Partitions are named


determine how the partition keys after dates for
the data is stored defines a partition of convenience
the table

Slide 20 www.edureka.co/big-data-and-hadoop
Buckets
 Buckets give extra structure to the data that may be used for more efficient queries.

» A join of two tables that are bucketed on the same columns – including the join column can be implemented as
a Map Side Join.

» Bucketing by user ID means we can quickly evaluate a user based query by running it on a randomized sample
of the total set of users.

Buckets (Cluster)

Bucket

Based on the value of a


Bucket hash function of some
columns of the table.
Data Bucket

Slide 21 www.edureka.co/big-data-and-hadoop
Create Database and Table
 Create Database.
» Create database retail;

 Use Database.
» Use retail;

Slide 22 www.edureka.co/big-data-and-hadoop
Create Database and Table (Contd.)
 Create table for storing transactional records.
» Create table txnrecords(txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING,
product STRING, city STRING, state String, Spendby String ) row format delimited fields terminated by ‘,’
stored as textfile;

Slide 23 www.edureka.co/big-data-and-hadoop
External Tables
 Create the table in another HDFS location and not in warehouse directory

 Not managed by hive


» CREATE EXTERNAL TABLE external_Table (dummy STRING) LOCATION ‘/user/notroot/external_table’;

Need to specify the hdfs


location

 Hive does not delete the table (or HDFS files) even when the tables are dropped

 It leaves the table untouched and only metadata about the tables are deleted

Slide 24 www.edureka.co/big-data-and-hadoop
Load Data
 Load the data into the table.
» LOAD DATA LOCAL INPATH ’/home/edureka/txns’ OVERWRITE INTO TABLE txnrecords;

 Describing metadata or schema of the table.


» describe txnrecords;

Slide 25 www.edureka.co/big-data-and-hadoop
Queries
 Select
» Select count(*) from txnrecords;

 Aggregation
» Select count (DISTINCT category) from txnrecords;

 Grouping
» Select category, sum( amount ) from txnrecords group by category;

Slide 26 www.edureka.co/big-data-and-hadoop
Managing Outputs
 Inserting Output into another table.
» INSERT OVERWRITE TABLE results (SELECT * from txnrecords);

 Inserting Output into local file.


» INSERT OVERWRITE LOCAL DIRECTORY ’results’ SELECT * from txnrecords;

 Inserting Output into HDFS.


» INSERT OVERWRITE DIRECTORY ’/results’ SELECT * from txnrecords;

Slide 27 www.edureka.co/big-data-and-hadoop
Hive Command Blog

http://www.edureka.co/blog/hive-commands/

Slide 28 www.edureka.co/big-data-and-hadoop
Hive Script
 Hive scripts are used to execute a set of Hive commands collectively. This helps in reducing the time and effort
invested in writing and executing each command manually.

 Hive supports scripting from Hive 0.10.0 and above versions. myqueries.sql hive
script

Slide 29 www.edureka.co/big-data-and-hadoop
Hive Script (Contd.)
 Command to execute the hive script : hive -f myqueries.sql

 The script runs and executed all the queries one by one in a single go and saves the output in hive/output
directory.

Slide 30 www.edureka.co/big-data-and-hadoop
Hive Script Blog

http://www.edureka.co/blog/apache-hadoop-hive-script/

Slide 31 www.edureka.co/big-data-and-hadoop
Joining Two Tables
User Table
Id Email Language Location
1 edureka@1.com EN US

2 edureka@2.com EN GB

3 edureka@3.com FR FR

Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper

Slide 32 www.edureka.co/big-data-and-hadoop
Joining Two Tables (Contd.)
User Table
Id Email Language Location
1 edureka@1.com EN US

2 edureka@2.com EN GB Prod 1
3 edureka@3.com FR FR

Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper

Slide 33 www.edureka.co/big-data-and-hadoop
Joining Two Tables (Contd.)
User Table
Id Email Language Location
1 edureka@1.com EN US

2 edureka@2.com EN GB

3 edureka@3.com FR FR Prod 2
Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper

Slide 34 www.edureka.co/big-data-and-hadoop
Joining Two Tables (Contd.)
User Table
Id Email Language Location Product Location
1 edureka@1.com EN US Prod-1 3
Prod-2 1
2 edureka@2.com EN GB

3 edureka@3.com FR FR

Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper

Slide 35 www.edureka.co/big-data-and-hadoop
Hive UDF

Slide 36 www.edureka.co/big-data-and-hadoop
Revisiting Use Case in Healthcare
Load CSV file into Hive

Hive stores the


data internally
on HDFS

HDFS
Read data from
Hive table

De-identify columns
and store the data
back in a Hive table
Hive Script

Slide 37 www.edureka.co/big-data-and-hadoop
HealthCare UDF

package myudf;
private String encrypt(String strToEncrypt, byte[] key) throws NoSuchAlgorithmException,
NoSuchPaddingException, InvalidKeyException, IllegalBlockSizeException, BadPaddingException

{
Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
SecretKeySpec secretKey = new SecretKeySpec(key, "AES");
cipher.init(Cipher.ENCRYPT_MODE, secretKey);
String encryptedString = Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));
System.out.println("------------encryptedString"+encryptedString);
return encryptedString.trim();

Slide 38 www.edureka.co/big-data-and-hadoop
HealthCare UDF (Contd.)
 Adding myudf jar.

 Creating healthCareSampleDS table and loading health_Sample_dataset1.csv file in the table.

Slide 39 www.edureka.co/big-data-and-hadoop
HealthCare UDF (Contd.)
 Creating a function deIdentify for the UDF.

 Creating healthCareSampleDSDeidentified table, applying our UDF on all the attributes.

Slide 40 www.edureka.co/big-data-and-hadoop
HealthCare UDF (Contd.)
 Storing the output in a local directory hive/output

Slide 41 www.edureka.co/big-data-and-hadoop
HealthCare UDF (Contd.)
 Storing the output on HDFS in out directory.

Slide 42 www.edureka.co/big-data-and-hadoop
HealthCare UDF (Contd.)
 The output after decrypting the healthcare dataset.

Slide 43 www.edureka.co/big-data-and-hadoop
Assignment for Hive
Referring the documents present in the LMS under assignment.

 Execute a calculating a Stock’s Covariance assignment

Slide 44 www.edureka.co/big-data-and-hadoop
Pre-work
Go through http://www.edureka.in/blog/map-side-join-vs-join/

Practice Hive Health Care Use-Case

Slide 45 www.edureka.co/big-data-and-hadoop
Agenda for Next Class
 Joins in Hive
 Dynamic Partitioning in Hive
 Custom MapReduce Scripts
 Hive UDF
 Introduction to HBase
 HBase Storage Architecture
 Cluster Deployment

Slide 46 www.edureka.co/big-data-and-hadoop
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!

Please spare few minutes to take the survey after the webinar.

Slide 47 www.edureka.co/big-data-and-hadoop

You might also like