How To Make The Best Use of Live Sessions: Log in 10 Mins Before

How To Make The Best Use Of Live Sessions
• Please log in 10 mins before the class starts and check your internet connection to avoid any network issues during the LIVE
session
• All participants will be on mute, by default, to avoid any background noise. However, you will be unmuted by instructor if
required. Please use the “Questions” tab on your webinar tool to interact with the instructor at any point during the class
• Feel free to ask and answer questions to make your learning interactive. Instructor will address your queries at the end of on-
going topic
• If you want to connect to your Personal Learning Manager (PLM), dial +917618772501
• We have dedicated support team to assist all your queries. You can reach us anytime on the below numbers:
US: 1855 818 0063 (Toll-Free) | India: +91 9019117772
• Your feedback is very much appreciated. Please share feedback after each class, which will help us enhance your learning
experience
Copyright © edureka and/or its affiliates. All rights reserved.

Big Data & Hadoop Certification Training

Course Outline
Understanding Big Data Kafka Monitoring &
and Hadoop Hive
Stream Processing
Hadoop Architecture Integration of Kafka

Kafka Producer Advance
with Hive&and
Hadoop HBase
Storm
and HDFS
Hadoop MapReduce Integration of Kafka

Kafka Consumer Advance
Framework with Spark &HBase
Flume
Kafka Operation and Processing Distributed Data

Advance MapReduce
Performance Tuning with Apache Spark
Kafka Cluster Architectures Apache Oozie and Hadoop

Pig Kafka Project
& Administering Kafka Project

Module 6: Hive

Topics
Following are the topics covered in this module:
▪ What is Hive?
▪ Hive Use Cases
▪ Hive Architecture
▪ Hive Components
▪ Limitations of Hive
▪ Type System
▪ Hive Data Models
▪ Creating Database and Tables
▪ Hive Queries
▪ Hive Scripts
▪ Joining Two Tables
▪ Hive UDF

Objectives
At the end of this module, you will be able to:
▪ Understand What is Hive and its Use Cases
▪ Analyze difference between Hive and Pig
▪ Understand Hive Architecture and Hive Components
▪ Analyze limitations of Hive
▪ Implement Primitive and Complex types in Hive
▪ Understand Hive Data Model
▪ Perform basic Hive operations
▪ Execute Hive scripts and Hive UDFs

Hive Background
Scribe Server Tier MySQL Server
Tier
▪ Started at Facebook
▪ Data was collected by nightly cron jobs into
Oracle DB
▪ “ETL” via hand-coded python
▪ Grew from 10s of GBs (2006) to 1 TB/day new
data (2007), now 10x that
Data Collection Oracle Database

Server

Hive Use Case @ Facebook

What Is Hive?

What is Hive?
▪ Data Warehousing package built on top of Hadoop

▪ Used for data analysis
▪ Targeted towards users comfortable with SQL
▪ It is similar to SQL and called HiveQL
▪ For managing and querying structured data
▪ Abstracts complexity of Hadoop
▪ No need to learn java and Hadoop APIs
▪ Developed by Facebook and contributed to community
▪ Facebook analyzed several Terabytes of data everyday using Hive

What is Hive? (Contd.)
Defines SQL- Data

Like Query Warehouse
Language Infrastructure
called QL
Allows programmers
to plug-in custom Provides tools to
mappers and enable easy data
reducers ETL

Where to Use Hive?
Data
Mining
Log Hive Document

Processing Applications Indexing
Customer- Predictive
facing Modeling,
Business Hypothesis
Intelligence Testing

Why Go for Hive When Pig is There?
Pig Latin HiveQL
▪ Procedural data-flow language

▪ Declarative SQLish language
▪ A = load ‘mydata’;
▪ Select * from ‘mytable’;
▪ Dump A;
▪ Hive is used by Analysts generating daily reports
▪ Pig is used by Programmers and Researchers

Why Go for Hive When Pig is There? (Contd.)
Features Hive Pig
Language SQL-like PigLatin
Schemas/Types Yes (explicit) Yes (implicit)
Partitions Yes No
Server Optional (Thrift) No
User Defined Functions (UDF) Yes (Java) Yes (Java)
Custom Serializer/Deserializer Yes Yes
DFS Direct Access Yes (implicit) Yes (explicit)
Join/Order/Sort Yes Yes
Shell Yes Yes
Streaming Yes Yes
Web Interface Yes No
JDBC/ODBC Yes (limited) No

Hive Architecture

Hive Architecture
Karmasphere Hue Qubole Others…
Thrift JDBC ODBC

Hive Application Application Application
Hive Thrift Hive JDBC Hive ODBC

Client Driver Driver
JDBC ODBC
CLI HWI Thrift Server
Driver
Metastore
(compiles, optimizes, executes)
Hadoop
Master
*Resource DFS
Name Node
Manager

Hive Components

Hive Components
Shell
Hive
Driver Components Metastore
Execution
Compiler
Engine

Metastore
HIVE Service JVM
Embedded Driver Metastore Derby

Metastore
Driver Metastore
Local
MySQL
Metastore
Driver Metastore
Metastore
Driver
Remote Server JVM
Metastore MySQL
Metastore
Driver
Server JVM

Limitations of Hive
Not designed for Does not offer

online transaction real-time queries
processing and row level
updates
Latency for Provides acceptable

Hive queries is (not optimal) latency
generally very high for interactive data
(minutes) browsing

Abilities of Hive Query Language
Hive Query Language provides the basic SQL-like operations
Ability to filter rows from

a table using a ‘where’
clause
Ability to store the results of HIVE Ability to do equi-joins

a query into another table Query between two tables
Language
Ability to manage tables and Ability to store the results

partitions (create, drop & of a query in Hadoop dfs
alter) directory

Differences with Traditional RDBMS
▪ Schema on Read vs Schema on Write
▪ Hive does not verify the data when it is loaded, but rather when a query is issued.
▪ Schema on read makes for a very fast initial load, since the data does not have to be read,
parsed and serialized to disk in the database’s internal format. The load operation is just a file
copy or move.
▪ No Updates, Transactions and Indexes.

Type System

Type System
Integers
Boolean Type TINYINT – 1 byte integer
BOOLEAN – TRUE/FALSE SMALLINT – 2 byte integer
INT – 4 byte integer
BIGINT – 8 byte integer
Primitive
Types
Floating Point Numbers

String Type
FLOAT – Single Precision
STRING –Sequence of characters
DOUBLE – Double Precision

Complex Types
▪ Complex Types can be built up from primitive types and other composite types using the following three
operators:
Operators
Structs: Maps: Arrays:
It can be accessed using (key-value tuples) (indexable lists)

the DOT (.) notation. It can be accessed using Elements can be
[‘element name’] accessed using the [n]
notation. notation where n is an
index (zero-based) into
the array.

Hive Data Models

Hive Data Models
Hive Data (In the order of granularity)
Databases Tables
timestamp
Userid
referer_url
page_url
IP

Hive Data Models
Partitions
Databases
▪ How data is stored in HDFS
▪ Namespaces ▪ Grouping databases on some column
▪ Can have one or more columns
Buckets or Clusters
Tables
▪ Partitions divided further into buckets based
▪ Schemas in namespaces
on some other column
▪ Used for data sampling

Partitions
Partition means dividing a table into a coarse grained parts based on the value of a partition column such as a
date. This make it faster to do queries on slices of the data.
Partitions
Partition keys Each unique value of

Partitions are named
determine how the the partition keys
after dates for
data is stored defines a partition of
convenience
the table

Buckets
▪ Buckets give extra structure to the data that may be used for more efficient queries.
▪ A join of two tables that are bucketed on the same columns – including the join column can be implemented as a Map Side Join.
▪ Bucketing by user ID means we can quickly evaluate a user based query by running it on a randomized sample of the total set of
users.
Buckets (Cluster)
Bucket
Based on the value of a

Bucket hash function of some
columns of the table.
Data Bucket

Creating Database And Tables

Create Database and Table
Create Database:
▪ CREATE DATABASE retail;
Use Database:
▪ USE retail;

Create Database and Table (Contd.)
Create table for storing transactional records:

▪ CREATE TABLE txnrecords(txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING, product STRING, city
STRING, state String, Spendby String ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ STORED AS TEXTFILE;

External Tables
Create the table in another HDFS location and not in warehouse directory
▪ For external table, Hive is not responsible for managing the data
▪ CREATE EXTERNAL TABLE external_Table (dummy STRING) LOCATION ‘path/to/hdfs/directory’;
Need to specify the hdfs location

where table data is residing
▪ Hive does not delete the table (or HDFS files) even when the tables are dropped
▪ It leaves the table untouched and only metadata about the tables are deleted

Load Data into Table and Describe Schema
Load the data into the table.
▪ LOAD DATA LOCAL INPATH ’/home/edureka/txns’ OVERWRITE INTO TABLE txnrecords;
Describing metadata or schema of the table.

▪ DESCRIBE txnrecords;

Hive Queries

Queries
Select
▪ Select COUNT(*) FROM txnrecords;

Queries
Aggregation:
▪ SELECT COUNT (DISTINCT category) FROM txnrecords;

Queries
Grouping:
▪ SELECT category, SUM( amount ) FROM txnrecords GROUP BY category;

Managing Outputs
Inserting Output into another table:
▪ INSERT OVERWRITE TABLE results SELECT * FROM txnrecords;
Inserting Output into local file:
▪ INSERT OVERWRITE LOCAL DIRECTORY ’results’ SELECT * FROM txnrecords;
Inserting Output into HDFS:
▪ INSERT OVERWRITE DIRECTORY ’/results’ SELECT * FROM txnrecords;

Hive Command Blog
http://www.edureka.co/blog/hive-commands/

Hive Script

Hive Script
▪ Hive scripts are used to execute a set of Hive commands collectively. This helps in reducing the time and effort invested in
writing and executing each command manually.
myqueries.sql hive
▪ Hive supports scripting from Hive 0.10.0 and above versions.
script

Hive Script (Contd.)
▪ Command to execute the hive script : hive -f myqueries.sql
▪ The script runs and executed all the queries one by one in a single go.
▪ The final output is saved in /user/hive/warehouse/healthdb.db/healthcaresampledsdeidentified directory.

Hive Script Blog
http://www.edureka.co/blog/apache-hadoop-hive-script/

Joining Two Tables
User Table
Id Email Language Location
1 edureka@1.com EN US
2 edureka@2.com EN GB
3 edureka@3.com FR FR
Transaction Table
Id Product Id UserId Purchase Amount Item Description
1 Prod-1 1 300 A jumper
4 Prod-2 3 100 A rubber chicken

Joining Two Tables (Contd.)
User Table
2 edureka@2.com EN GB Prod 1
Transaction Table

User Table
3 edureka@3.com FR FR Prod 2
Transaction Table

User Table
Id Email Language Location Product Location
1 edureka@1.com EN US Prod-1 3
Prod-2 1
Transaction Table

Hive UDF

Revisiting Use Case in Healthcare
Load CSV file into Hive
Hive stores the

data internally on
HDFS
HDFS
Read data from
Hive table
De-identify columns
and store the data back
in a Hive table
Hive Script

HealthCare UDF
package myudf;
private String encrypt(String strToEncrypt, byte[] key) throws NoSuchAlgorithmException, NoSuchPaddingException,

InvalidKeyException, IllegalBlockSizeException, BadPaddingException
{
Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
SecretKeySpec secretKey = new SecretKeySpec(key, "AES");
cipher.init(Cipher.ENCRYPT_MODE, secretKey);
String encryptedString = Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));
System.out.println("------------encryptedString"+encryptedString);
return encryptedString.trim();
}

HealthCare UDF (Contd.)
▪ Adding myudf jar:
▪ Creating healthCareSampleDS table and loading health_Sample_dataset1.csv file in the table:

▪ Creating a function deIdentify for the UDF.
▪ Creating healthCareSampleDSDeidentified table, applying our UDF on all the attributes.

▪ Storing the output in a local directory hive/output

▪ Storing the output on HDFS in out directory.

▪ The output after decrypting the healthcare dataset.

Assignment for Hive
Referring the documents present in the LMS under assignment.
▪ Execute the Calculating Stock’s Covariance Assignment

Pre-work
Go through: http://www.edureka.in/blog/map-side-join-vs-join/
Practice Hive Health Care Use-Case

Agenda for Next Class
▪ Joins in Hive
▪ Dynamic Partitioning in Hive
▪ Custom MapReduce Scripts
▪ Hive UDF
▪ Introduction to HBase
▪ HBase Storage Architecture
▪ Cluster Deployment


How To Make The Best Use of Live Sessions: Log in 10 Mins Before

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How To Make The Best Use of Live Sessions: Log in 10 Mins Before

Uploaded by

Copyright:

Available Formats

How To Make The Best Use Of Live Sessions

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Hadoop Architecture Integration of Kafka

Hadoop MapReduce Integration of Kafka

Kafka Operation and Processing Distributed Data

Kafka Cluster Architectures Apache Oozie and Hadoop

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Understand What is Hive and its Use Cases

▪ Analyze difference between Hive and Pig

▪ Understand Hive Architecture and Hive Components

▪ Analyze limitations of Hive

▪ Implement Primitive and Complex types in Hive

▪ Understand Hive Data Model

▪ Perform basic Hive operations

▪ Execute Hive scripts and Hive UDFs

Copyright © edureka and/or its affiliates. All rights reserved.

Data Collection Oracle Database

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Data Warehousing package built on top of Hadoop

Copyright © edureka and/or its affiliates. All rights reserved.

Defines SQL- Data

Copyright © edureka and/or its affiliates. All rights reserved.

Log Hive Document

Copyright © edureka and/or its affiliates. All rights reserved.

Pig Latin HiveQL

▪ Procedural data-flow language

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Thrift JDBC ODBC

Hive Thrift Hive JDBC Hive ODBC

CLI HWI Thrift Server

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Embedded Driver Metastore Derby

Copyright © edureka and/or its affiliates. All rights reserved.

Not designed for Does not offer

Latency for Provides acceptable

Copyright © edureka and/or its affiliates. All rights reserved.

Ability to filter rows from

Ability to store the results of HIVE Ability to do equi-joins

Ability to manage tables and Ability to store the results

Copyright © edureka and/or its affiliates. All rights reserved.

▪ Schema on Read vs Schema on Write

▪ No Updates, Transactions and Indexes.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Floating Point Numbers

Copyright © edureka and/or its affiliates. All rights reserved.

Structs: Maps: Arrays:

It can be accessed using (key-value tuples) (indexable lists)

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Copyright © edureka and/or its affiliates. All rights reserved.

Partition keys Each unique value of

Copyright © edureka and/or its affiliates. All rights reserved.