Edureka's Module 6 on HIVE Overview and Key Concepts

Module-6
HIVE
www.edureka.co/big-data-and-hadoop
Course Topics
 Module 1  Module 6
» Understanding Big Data and Hadoop » HIVE
» Hadoop Architecture and HDFS » Advance HIVE and HBase
» Hadoop MapReduce Framework » Advance HBase
» Advance MapReduce
» Processing Distributed Data with Apache Spark
 Module 5
» PIG  Module 10
» Oozie and Hadoop Project
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
At the end of this module, you will be able to
 Understand What is Hive and its Use Cases
 Analyze difference between Hive and Pig
 Understand Hive Architecture and Hive Components
 Analyze limitations of Hive
 Implement Primitive and Complex types in Hive
 Understand Hive Data Model
 Perform basic Hive operations
 Execute Hive scripts and Hive UDFs
Hive Background
 Started at Facebook
 Data was collected by nightly cron jobs into Oracle DB
 “ETL” via hand-coded python
 Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that
Scribe Server Tier MySQL Server Tier
Data Collection Server Oracle Database

Hive Use Case @ Facebook
Challenge…
> 950 > 70k >300

> 500
Million queries million
Terabytes
Users per day photos
per day
per day
Traditional RDBMS…
Solution…
Hard to Users
program know
SQL well
Easy to plug-in HIVE tables Extensible:

Tables can be Schema JDBC/ODBC
custom can be defined Types, Formats,
partitioned and flexibility and drivers are
mapper/ directly on Functions,
bucketed evolution available
reduce code HDFS Scripts
What is Hive?
 Data Warehousing package built on top of Hadoop
 Used for data analysis
 Targeted towards users comfortable with SQL
 It is similar to SQL and called HiveQL
 For managing and querying structured data
 Abstracts complexity of Hadoop
 No need to learn java and Hadoop APIs
 Developed by Facebook and contributed to community
 Facebook analyzed several Terabytes of data everyday using Hive
What is Hive? (Contd.)
Defines
SQL-Like Data
Query Warehouse
Language Infrastructure
called QL
Allows
programmers to
plug-in custom Provides tools to
mappers and enable easy data
reducers ETL
Where to use Hive?
Data
Mining
Log HIVE Document

Processing Applications Indexing
Customer- Predictive
facing Modeling,
Business Hypothesis
Intelligence Testing
Why go for Hive When Pig is there?
PigLatin: HiveQL:
 Procedural data-flow language  Declarative SQLish language

A = load ‘mydata’; Select * from ‘mytable’;
Dump A;
 Pig is used by Programmers and Researchers  Hive is used by Analysts generating daily reports
Why go for Hive When Pig is there? (Contd.)
Features Hive Pig

Language SQL-like PigLatin
Schemas/Types Yes (explicit) Yes (implicit)
Partitions Yes No
Server Optional (Thrift) No
User Defined Functions (UDF) Yes (Java) Yes (Java)
Custom Serializer/Deserializer Yes Yes
DFS Direct Access Yes (implicit) Yes (explicit)
Join/Order/Sort Yes Yes
Shell Yes Yes
Streaming Yes Yes
Web Interface Yes No
JDBC/ODBC Yes (limited) No
Hive Architecture
Karmasphere Hue Qubole Others…
Thrift JDBC ODBC

Hive Application Application Application
Hive Thrift Hive JDBC Hive ODBC

Client Driver Driver
JDBC ODBC
CLI HWI Thrift Server
Driver
Metastore
(compiles, optimizes, executes)
Hadoop
Master
*Resource DFS
Name Node
Manager
Hive Components
Shell
HIVE
Driver Metastore
Components
Execution
Compiler
Engine
Metastore
HIVE Service JVM
Embedded Driver Metastore Derby

Metastore
Driver Metastore
Local
Metastore MySQL
Driver Metastore
Metastore
Driver
Remote Server JVM
Metastore MySQL
Driver Metastore
Server JVM
Limitations of HIVE?
Not designed for Does not offer

online transaction real-time queries
processing and row level
updates
Latency for Provides acceptable

Hive queries is (not optimal)
generally very high latency for
(minutes) interactive data
browsing
Abilities of HIVE Query Language
Hive Query Language provides the basic SQL-like operations
Ability to filter rows

from a table using a
‘where’ clause
Ability to store the results HIVE Ability to do equi-joins

of a query into another Query between two tables
table
Language
Ability to manage tables Ability to store the

and partitions (create, results of a query in
drop & alter) Hadoop dfs directory
Differences with Traditional RDBMS
 Schema on Read vs Schema on Write
» Hive does not verify the data when it is loaded, but rather when a query is issued.
» Schema on read makes for a very fast initial load, since the data does not have to be read, parsed and
serialized to disk in the database’s internal format. The load operation is just a file copy or move.
 No Updates, Transactions and Indexes.
Type System
Integers
Boolean Type TINYINT – 1 byte integer
BOOLEAN – TRUE/FALSE SMALLINT – 2 byte integer
INT – 4 byte integer
BIGINT – 8 byte integer
Primitive
Types
Floating Point Numbers String Type

FLOAT – Single Precision STRING –Sequence of characters
DOUBLE – Double Precision
Complex Types
 Complex Types can be built up from primitive types and other composite types using the following three operators:
Operators
Structs: Maps: Arrays:
It can be accessed (key-value tuples) (indexable lists)

using the DOT (.) It can be accessed Elements can be
notation. using [‘element name’] accessed using the [n]
notation. notation where n is an
index (zero-based) into
the array.
Hive Data Models
 Databases HIVE Data (In the order of granularity)
» Namespaces
 Tables
» Schemas in namespaces Databases Tables
 Partitions timestamp
» How data is stored in HDFS
» Grouping data bases on some column
Userid
» Can have one or more columns
 Buckets or Clusters referer_url

» Partitions divided further into buckets bases
on some other column page_url
» Use for data sampling
IP
Partitions
 Partition means dividing a table into a coarse grained parts based on the value of a partition column
such as a date. This make it faster to do queries on slices of the data.
Partitions
Partition keys Each unique value of Partitions are named

determine how the partition keys after dates for
the data is stored defines a partition of convenience
the table
Buckets
 Buckets give extra structure to the data that may be used for more efficient queries.
» A join of two tables that are bucketed on the same columns – including the join column can be implemented as
a Map Side Join.
» Bucketing by user ID means we can quickly evaluate a user based query by running it on a randomized sample
of the total set of users.
Buckets (Cluster)
Bucket
Based on the value of a

Bucket hash function of some
columns of the table.
Data Bucket
Create Database and Table
 Create Database.
» Create database retail;
 Use Database.
» Use retail;
Create Database and Table (Contd.)
 Create table for storing transactional records.
» Create table txnrecords(txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING,
product STRING, city STRING, state String, Spendby String ) row format delimited fields terminated by ‘,’
stored as textfile;
External Tables
 Create the table in another HDFS location and not in warehouse directory
 Not managed by hive

» CREATE EXTERNAL TABLE external_Table (dummy STRING) LOCATION ‘/user/notroot/external_table’;
Need to specify the hdfs

location
 Hive does not delete the table (or HDFS files) even when the tables are dropped
 It leaves the table untouched and only metadata about the tables are deleted
Load Data
 Load the data into the table.
» LOAD DATA LOCAL INPATH ’/home/edureka/txns’ OVERWRITE INTO TABLE txnrecords;
 Describing metadata or schema of the table.

» describe txnrecords;
Queries
 Select
» Select count(*) from txnrecords;
 Aggregation
» Select count (DISTINCT category) from txnrecords;
 Grouping
» Select category, sum( amount ) from txnrecords group by category;
Managing Outputs
 Inserting Output into another table.
» INSERT OVERWRITE TABLE results (SELECT * from txnrecords);
 Inserting Output into local file.

» INSERT OVERWRITE LOCAL DIRECTORY ’results’ SELECT * from txnrecords;
 Inserting Output into HDFS.

» INSERT OVERWRITE DIRECTORY ’/results’ SELECT * from txnrecords;
Hive Command Blog
http://www.edureka.co/blog/hive-commands/
Hive Script
 Hive scripts are used to execute a set of Hive commands collectively. This helps in reducing the time and effort
invested in writing and executing each command manually.
 Hive supports scripting from Hive 0.10.0 and above versions. myqueries.sql hive
script
Hive Script (Contd.)
 Command to execute the hive script : hive -f myqueries.sql
 The script runs and executed all the queries one by one in a single go and saves the output in hive/output
directory.
Hive Script Blog
http://www.edureka.co/blog/apache-hadoop-hive-script/
Joining Two Tables
User Table
Id Email Language Location
1 edureka@1.com EN US
2 edureka@2.com EN GB
3 edureka@3.com FR FR
Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
4 Prod-2 3 100 A rubber chicken
Joining Two Tables (Contd.)
User Table
2 edureka@2.com EN GB Prod 1
Transaction Table
User Table
3 edureka@3.com FR FR Prod 2
Transaction Table
User Table
Id Email Language Location Product Location
1 edureka@1.com EN US Prod-1 3
Prod-2 1
Transaction Table
Hive UDF
Revisiting Use Case in Healthcare
Load CSV file into Hive
Hive stores the

data internally
on HDFS
HDFS
Read data from
Hive table
De-identify columns
and store the data
back in a Hive table
Hive Script
HealthCare UDF
package myudf;
private String encrypt(String strToEncrypt, byte[] key) throws NoSuchAlgorithmException,
NoSuchPaddingException, InvalidKeyException, IllegalBlockSizeException, BadPaddingException
{
Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
SecretKeySpec secretKey = new SecretKeySpec(key, "AES");
cipher.init(Cipher.ENCRYPT_MODE, secretKey);
String encryptedString = Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));
System.out.println("------------encryptedString"+encryptedString);
return encryptedString.trim();
HealthCare UDF (Contd.)
 Adding myudf jar.
 Creating healthCareSampleDS table and loading health_Sample_dataset1.csv file in the table.
 Creating a function deIdentify for the UDF.
 Creating healthCareSampleDSDeidentified table, applying our UDF on all the attributes.
 Storing the output in a local directory hive/output
 Storing the output on HDFS in out directory.
 The output after decrypting the healthcare dataset.
Assignment for Hive
Referring the documents present in the LMS under assignment.
 Execute a calculating a Stock’s Covariance assignment
Pre-work
Go through http://www.edureka.in/blog/map-side-join-vs-join/
Practice Hive Health Care Use-Case
Agenda for Next Class
 Joins in Hive
 Dynamic Partitioning in Hive
 Custom MapReduce Scripts
 Hive UDF
 Introduction to HBase
 HBase Storage Architecture
 Cluster Deployment
Survey
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make
the course better!
Please spare few minutes to take the survey after the webinar.

Edureka's Module 6 on HIVE Overview and Key Concepts

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Edureka's Module 6 on HIVE Overview and Key Concepts

Uploaded by

Copyright:

Available Formats

Module-6

 Analyze difference between Hive and Pig

 Understand Hive Architecture and Hive Components

 Analyze limitations of Hive

 Implement Primitive and Complex types in Hive

 Understand Hive Data Model

 Perform basic Hive operations

 Execute Hive scripts and Hive UDFs

Scribe Server Tier MySQL Server Tier

Data Collection Server Oracle Database

> 950 > 70k >300

Easy to plug-in HIVE tables Extensible:

Log HIVE Document

 Procedural data-flow language  Declarative SQLish language

Features Hive Pig

Thrift JDBC ODBC

Hive Thrift Hive JDBC Hive ODBC

CLI HWI Thrift Server

Embedded Driver Metastore Derby

Not designed for Does not offer

Latency for Provides acceptable

Ability to filter rows

Ability to store the results HIVE Ability to do equi-joins

Ability to manage tables Ability to store the

 No Updates, Transactions and Indexes.

Floating Point Numbers String Type

Structs: Maps: Arrays:

It can be accessed (key-value tuples) (indexable lists)

 Buckets or Clusters referer_url

Partition keys Each unique value of Partitions are named

Based on the value of a

 Not managed by hive

Need to specify the hdfs

 Describing metadata or schema of the table.

 Inserting Output into local file.

 Inserting Output into HDFS.

Hive stores the

 Creating healthCareSampleDS table and loading health_Sample_dataset1.csv file in the table.

 Creating healthCareSampleDSDeidentified table, applying our UDF on all the attributes.

 Execute a calculating a Stock’s Covariance assignment

Practice Hive Health Care Use-Case

You might also like