You are on page 1of 63

How To Make The Best Use Of Live Sessions

• Please log in 10 mins before the class starts and check your internet connection to avoid any network issues during the LIVE
session

• All participants will be on mute, by default, to avoid any background noise. However, you will be unmuted by instructor if
required. Please use the “Questions” tab on your webinar tool to interact with the instructor at any point during the class

• Feel free to ask and answer questions to make your learning interactive. Instructor will address your queries at the end of on-
going topic

• If you want to connect to your Personal Learning Manager (PLM), dial +917618772501

• We have dedicated support team to assist all your queries. You can reach us anytime on the below numbers:
US: 1855 818 0063 (Toll-Free) | India: +91 9019117772

• Your feedback is very much appreciated. Please share feedback after each class, which will help us enhance your learning
experience

Copyright © edureka and/or its affiliates. All rights reserved.


Big Data & Hadoop Certification Training

Copyright © edureka and/or its affiliates. All rights reserved.


Course Outline
Understanding Big Data Kafka Monitoring &
and Hadoop Hive
Stream Processing

Hadoop Architecture Integration of Kafka


Kafka Producer Advance
with Hive&and
Hadoop HBase
Storm
and HDFS

Hadoop MapReduce Integration of Kafka


Kafka Consumer Advance
Framework with Spark &HBase
Flume

Kafka Operation and Processing Distributed Data


Advance MapReduce
Performance Tuning with Apache Spark

Kafka Cluster Architectures Apache Oozie and Hadoop


Pig Kafka Project
& Administering Kafka Project

Copyright © edureka and/or its affiliates. All rights reserved.


Module 6: Hive

Copyright © edureka and/or its affiliates. All rights reserved.


Topics
Following are the topics covered in this module:
▪ What is Hive?
▪ Hive Use Cases
▪ Hive Architecture
▪ Hive Components
▪ Limitations of Hive
▪ Type System
▪ Hive Data Models
▪ Creating Database and Tables
▪ Hive Queries
▪ Hive Scripts
▪ Joining Two Tables
▪ Hive UDF

Copyright © edureka and/or its affiliates. All rights reserved.


Objectives
At the end of this module, you will be able to:

▪ Understand What is Hive and its Use Cases

▪ Analyze difference between Hive and Pig

▪ Understand Hive Architecture and Hive Components

▪ Analyze limitations of Hive

▪ Implement Primitive and Complex types in Hive

▪ Understand Hive Data Model

▪ Perform basic Hive operations

▪ Execute Hive scripts and Hive UDFs

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Background
Scribe Server Tier MySQL Server
Tier

▪ Started at Facebook
▪ Data was collected by nightly cron jobs into
Oracle DB
▪ “ETL” via hand-coded python
▪ Grew from 10s of GBs (2006) to 1 TB/day new
data (2007), now 10x that

Data Collection Oracle Database


Server

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Use Case @ Facebook

Copyright © edureka and/or its affiliates. All rights reserved.


What Is Hive?

Copyright © edureka and/or its affiliates. All rights reserved.


What is Hive?

▪ Data Warehousing package built on top of Hadoop


▪ Used for data analysis
▪ Targeted towards users comfortable with SQL
▪ It is similar to SQL and called HiveQL
▪ For managing and querying structured data
▪ Abstracts complexity of Hadoop
▪ No need to learn java and Hadoop APIs
▪ Developed by Facebook and contributed to community
▪ Facebook analyzed several Terabytes of data everyday using Hive

Copyright © edureka and/or its affiliates. All rights reserved.


What is Hive? (Contd.)

Defines SQL- Data


Like Query Warehouse
Language Infrastructure
called QL

Allows programmers
to plug-in custom Provides tools to
mappers and enable easy data
reducers ETL

Copyright © edureka and/or its affiliates. All rights reserved.


Where to Use Hive?

Data
Mining

Log Hive Document


Processing Applications Indexing

Customer- Predictive
facing Modeling,
Business Hypothesis
Intelligence Testing

Copyright © edureka and/or its affiliates. All rights reserved.


Why Go for Hive When Pig is There?

Pig Latin HiveQL

▪ Procedural data-flow language


▪ Declarative SQLish language
▪ A = load ‘mydata’;
▪ Select * from ‘mytable’;
▪ Dump A;
▪ Hive is used by Analysts generating daily reports
▪ Pig is used by Programmers and Researchers

Copyright © edureka and/or its affiliates. All rights reserved.


Why Go for Hive When Pig is There? (Contd.)
Features Hive Pig
Language SQL-like PigLatin
Schemas/Types Yes (explicit) Yes (implicit)
Partitions Yes No
Server Optional (Thrift) No
User Defined Functions (UDF) Yes (Java) Yes (Java)
Custom Serializer/Deserializer Yes Yes
DFS Direct Access Yes (implicit) Yes (explicit)
Join/Order/Sort Yes Yes
Shell Yes Yes
Streaming Yes Yes
Web Interface Yes No
JDBC/ODBC Yes (limited) No

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Architecture

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Architecture
Karmasphere Hue Qubole Others…

Thrift JDBC ODBC


Hive Application Application Application

Hive Thrift Hive JDBC Hive ODBC


Client Driver Driver

JDBC ODBC

CLI HWI Thrift Server

Driver
Metastore
(compiles, optimizes, executes)

Hadoop
Master
*Resource DFS
Name Node
Manager

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Components

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Components

Shell

Hive
Driver Components Metastore

Execution
Compiler
Engine

Copyright © edureka and/or its affiliates. All rights reserved.


Metastore
HIVE Service JVM

Embedded Driver Metastore Derby


Metastore

Driver Metastore
Local
MySQL
Metastore
Driver Metastore

Metastore
Driver
Remote Server JVM
Metastore MySQL
Metastore
Driver
Server JVM

Copyright © edureka and/or its affiliates. All rights reserved.


Limitations of Hive

Not designed for Does not offer


online transaction real-time queries
processing and row level
updates

Latency for Provides acceptable


Hive queries is (not optimal) latency
generally very high for interactive data
(minutes) browsing

Copyright © edureka and/or its affiliates. All rights reserved.


Abilities of Hive Query Language
Hive Query Language provides the basic SQL-like operations

Ability to filter rows from


a table using a ‘where’
clause

Ability to store the results of HIVE Ability to do equi-joins


a query into another table Query between two tables
Language

Ability to manage tables and Ability to store the results


partitions (create, drop & of a query in Hadoop dfs
alter) directory

Copyright © edureka and/or its affiliates. All rights reserved.


Differences with Traditional RDBMS

▪ Schema on Read vs Schema on Write

▪ Hive does not verify the data when it is loaded, but rather when a query is issued.

▪ Schema on read makes for a very fast initial load, since the data does not have to be read,
parsed and serialized to disk in the database’s internal format. The load operation is just a file
copy or move.

▪ No Updates, Transactions and Indexes.

Copyright © edureka and/or its affiliates. All rights reserved.


Type System

Copyright © edureka and/or its affiliates. All rights reserved.


Type System

Integers
Boolean Type TINYINT – 1 byte integer
BOOLEAN – TRUE/FALSE SMALLINT – 2 byte integer
INT – 4 byte integer
BIGINT – 8 byte integer

Primitive
Types

Floating Point Numbers


String Type
FLOAT – Single Precision
STRING –Sequence of characters
DOUBLE – Double Precision

Copyright © edureka and/or its affiliates. All rights reserved.


Complex Types
▪ Complex Types can be built up from primitive types and other composite types using the following three
operators:
Operators

Structs: Maps: Arrays:

It can be accessed using (key-value tuples) (indexable lists)


the DOT (.) notation. It can be accessed using Elements can be
[‘element name’] accessed using the [n]
notation. notation where n is an
index (zero-based) into
the array.

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Data Models

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Data Models
Hive Data (In the order of granularity)

Databases Tables

timestamp

Userid

referer_url

page_url

IP

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Data Models
Partitions
Databases
▪ How data is stored in HDFS
▪ Namespaces ▪ Grouping databases on some column
▪ Can have one or more columns

Buckets or Clusters
Tables
▪ Partitions divided further into buckets based
▪ Schemas in namespaces
on some other column
▪ Used for data sampling

Copyright © edureka and/or its affiliates. All rights reserved.


Partitions
Partition means dividing a table into a coarse grained parts based on the value of a partition column such as a
date. This make it faster to do queries on slices of the data.

Partitions

Partition keys Each unique value of


Partitions are named
determine how the the partition keys
after dates for
data is stored defines a partition of
convenience
the table

Copyright © edureka and/or its affiliates. All rights reserved.


Buckets
▪ Buckets give extra structure to the data that may be used for more efficient queries.
▪ A join of two tables that are bucketed on the same columns – including the join column can be implemented as a Map Side Join.
▪ Bucketing by user ID means we can quickly evaluate a user based query by running it on a randomized sample of the total set of
users.

Buckets (Cluster)

Bucket

Based on the value of a


Bucket hash function of some
columns of the table.
Data Bucket

Copyright © edureka and/or its affiliates. All rights reserved.


Creating Database And Tables

Copyright © edureka and/or its affiliates. All rights reserved.


Create Database and Table
Create Database:
▪ CREATE DATABASE retail;
Use Database:
▪ USE retail;

Copyright © edureka and/or its affiliates. All rights reserved.


Create Database and Table (Contd.)

Create table for storing transactional records:


▪ CREATE TABLE txnrecords(txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING, product STRING, city
STRING, state String, Spendby String ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ STORED AS TEXTFILE;

Copyright © edureka and/or its affiliates. All rights reserved.


External Tables

Create the table in another HDFS location and not in warehouse directory
▪ For external table, Hive is not responsible for managing the data
▪ CREATE EXTERNAL TABLE external_Table (dummy STRING) LOCATION ‘path/to/hdfs/directory’;

Need to specify the hdfs location


where table data is residing

▪ Hive does not delete the table (or HDFS files) even when the tables are dropped
▪ It leaves the table untouched and only metadata about the tables are deleted

Copyright © edureka and/or its affiliates. All rights reserved.


Load Data into Table and Describe Schema
Load the data into the table.
▪ LOAD DATA LOCAL INPATH ’/home/edureka/txns’ OVERWRITE INTO TABLE txnrecords;

Describing metadata or schema of the table.


▪ DESCRIBE txnrecords;

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Queries

Copyright © edureka and/or its affiliates. All rights reserved.


Queries
Select
▪ Select COUNT(*) FROM txnrecords;

Copyright © edureka and/or its affiliates. All rights reserved.


Queries
Aggregation:
▪ SELECT COUNT (DISTINCT category) FROM txnrecords;

Copyright © edureka and/or its affiliates. All rights reserved.


Queries
Grouping:
▪ SELECT category, SUM( amount ) FROM txnrecords GROUP BY category;

Copyright © edureka and/or its affiliates. All rights reserved.


Managing Outputs
Inserting Output into another table:

▪ INSERT OVERWRITE TABLE results SELECT * FROM txnrecords;

Inserting Output into local file:

▪ INSERT OVERWRITE LOCAL DIRECTORY ’results’ SELECT * FROM txnrecords;

Inserting Output into HDFS:

▪ INSERT OVERWRITE DIRECTORY ’/results’ SELECT * FROM txnrecords;

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Command Blog

http://www.edureka.co/blog/hive-commands/

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Script

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Script
▪ Hive scripts are used to execute a set of Hive commands collectively. This helps in reducing the time and effort invested in
writing and executing each command manually.
myqueries.sql hive
▪ Hive supports scripting from Hive 0.10.0 and above versions.
script

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Script (Contd.)
▪ Command to execute the hive script : hive -f myqueries.sql

▪ The script runs and executed all the queries one by one in a single go.
▪ The final output is saved in /user/hive/warehouse/healthdb.db/healthcaresampledsdeidentified directory.

Copyright © edureka and/or its affiliates. All rights reserved.


Hive Script Blog

http://www.edureka.co/blog/apache-hadoop-hive-script/

Copyright © edureka and/or its affiliates. All rights reserved.


Joining Two Tables
User Table
Id Email Language Location
1 edureka@1.com EN US

2 edureka@2.com EN GB

3 edureka@3.com FR FR

Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper

Copyright © edureka and/or its affiliates. All rights reserved.


Joining Two Tables (Contd.)
User Table
Id Email Language Location
1 edureka@1.com EN US

2 edureka@2.com EN GB Prod 1
3 edureka@3.com FR FR

Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper

Copyright © edureka and/or its affiliates. All rights reserved.


Joining Two Tables (Contd.)
User Table
Id Email Language Location
1 edureka@1.com EN US

2 edureka@2.com EN GB

3 edureka@3.com FR FR Prod 2

Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper

Copyright © edureka and/or its affiliates. All rights reserved.


Joining Two Tables (Contd.)
User Table
Id Email Language Location Product Location
1 edureka@1.com EN US Prod-1 3
Prod-2 1
2 edureka@2.com EN GB

3 edureka@3.com FR FR

Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
2 Prod-1 2 300 A jumper
3 Prod-1 2 300 A jumper
4 Prod-2 3 100 A rubber chicken
5 Prod-1 3 300 A jumper

Copyright © edureka and/or its affiliates. All rights reserved.


Hive UDF

Copyright © edureka and/or its affiliates. All rights reserved.


Revisiting Use Case in Healthcare
Load CSV file into Hive

Hive stores the


data internally on
HDFS

HDFS
Read data from
Hive table

De-identify columns
and store the data back
in a Hive table
Hive Script

Copyright © edureka and/or its affiliates. All rights reserved.


HealthCare UDF
package myudf;

private String encrypt(String strToEncrypt, byte[] key) throws NoSuchAlgorithmException, NoSuchPaddingException,


InvalidKeyException, IllegalBlockSizeException, BadPaddingException
{
Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
SecretKeySpec secretKey = new SecretKeySpec(key, "AES");
cipher.init(Cipher.ENCRYPT_MODE, secretKey);
String encryptedString = Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));
System.out.println("------------encryptedString"+encryptedString);
return encryptedString.trim();
}

Copyright © edureka and/or its affiliates. All rights reserved.


HealthCare UDF (Contd.)
▪ Adding myudf jar:

▪ Creating healthCareSampleDS table and loading health_Sample_dataset1.csv file in the table:

Copyright © edureka and/or its affiliates. All rights reserved.


HealthCare UDF (Contd.)
▪ Creating a function deIdentify for the UDF.

▪ Creating healthCareSampleDSDeidentified table, applying our UDF on all the attributes.

Copyright © edureka and/or its affiliates. All rights reserved.


HealthCare UDF (Contd.)
▪ Storing the output in a local directory hive/output

Copyright © edureka and/or its affiliates. All rights reserved.


HealthCare UDF (Contd.)
▪ Storing the output on HDFS in out directory.

Copyright © edureka and/or its affiliates. All rights reserved.


HealthCare UDF (Contd.)
▪ The output after decrypting the healthcare dataset.

Copyright © edureka and/or its affiliates. All rights reserved.


Assignment for Hive
Referring the documents present in the LMS under assignment.

▪ Execute the Calculating Stock’s Covariance Assignment

Copyright © edureka and/or its affiliates. All rights reserved.


Pre-work
Go through: http://www.edureka.in/blog/map-side-join-vs-join/

Practice Hive Health Care Use-Case

Copyright © edureka and/or its affiliates. All rights reserved.


Agenda for Next Class
▪ Joins in Hive
▪ Dynamic Partitioning in Hive
▪ Custom MapReduce Scripts
▪ Hive UDF
▪ Introduction to HBase
▪ HBase Storage Architecture
▪ Cluster Deployment

Copyright © edureka and/or its affiliates. All rights reserved.


Copyright © edureka and/or its affiliates. All rights reserved.
Copyright © edureka and/or its affiliates. All rights reserved.

You might also like