You are on page 1of 18

UNIT – V: FRAMEWORKS

Objective: To introduce programming tools HIVE in Hadoop echo system.

Frameworks: Applications on Big Data Using Pig and Hive – Data processing operatorsin Pig
– Hive services – HiveQL – Querying Data in Hive – fundamentals of HBase andZooKeeper–
IBM InfoSphereBigInsights and Streams.

What is Hadoop?

Big Data consists of data in different formats, such as Excel spreadsheets, reports, log files,
videos, etc. Traditional databases failed to store, process, and analyze Big Data. The Hadoop
framework made this job easier with the help of various components in its ecosystem.

The Hadoop Distributed File System (HDFS) is where we store Big Data in a distributed
manner. Hadoop MapReduce is responsible for processing large volumes of data in a parallelly
distributed manner, and YARN in Hadoop acts as the resource management unit.

Apart from those Hadoop components, the Hadoop ecosystem has other capabilities that help
with Big Data processing. The following comprise the Hadoop ecosystem:
1. HDFS
2. HBase
3. Sqoop
4. Flume
5. Spark
6. Hadoop MapReduce
7. Pig
8. Impala
9. Hive
10. Cloudera Search
11. Oozie
12. Hue

Hive and Pig are a pair of these secondary languages for interacting with data stored HDFS. Hive
is a data warehousing system which exposes an SQL-like language called HiveQL. Pig is an
analysis platform which provides a dataflow language called Pig Latin.

Pig Latin: The scripting language used in Apache Pig.


Data Lakehouse: A new type of data architecture that offers the best features of data
warehouses and data lakes.
Hadoop: An open-source software framework used for distributed storage and processing of
large datasets.
ETL: Extract, Transform, Load - a data integration process.
APACHE PIG:

PIG: High Level Data Processing

Pig is a scripting platform that runs on Hadoop clusters, designed to process and analyze large
datasets. Pig uses a language called Pig Latin, which is similar to SQL. This language does not
require as much code in order to analyze data. Although it is similar to SQL, it does have
significant differences. In Pig Latin, 10 lines of code is equivalent to 200 lines in Java. This, in
turn, results in shorter development times.

Pig is a platform that allows developers to create complex data transformations using a high-level
language called Pig Latin, which is then converted into a series of MapReduce jobs to be executed
on Hadoop.

Key Features of Pig:


Pig offers several features that make it well-suited for data processing tasks:

i. High-Level Abstraction: Pig Latin simplifies the development of data processing


tasks by providing a high-level, easy-to-understand language.
ii. Extensibility: Pig supports custom functions, also known as User-Defined Functions
(UDFs), allowing developers to extend its functionality.
iii. Optimized Execution: Pig automatically optimizes the execution plan for a given
script, improving performance and resource utilization.

Pig Architecture: Key Components

The primary components of Pig’s architecture include:


Parser: The component responsible for parsing Pig Latin scripts and converting them into a
logical plan.
Optimizer: The component that optimizes the logical plan by applying various optimization
rules, such as predicate pushdown and projection pruning.
Compiler: The component that translates the optimized logical plan into a series of MapReduce
jobs.
Execution Engine: The component that executes the generated MapReduce jobs on the Hadoop
cluster.

Apache Hive:

Hive: SQL-like Data Querying and Analysis

Hive is a data warehouse system used to query and analyze large datasets stored in HDFS. Hive
uses a query language called HiveQL, which is similar to SQL.

Hive is a data warehousing solution built on top of Hadoop, providing a SQL-like query language
called HiveQL for querying and analyzing data stored in HDFS or other storage systems.
Key Features of Hive
Hive offers several features that make it an ideal choice for data querying and analysis:
SQL-like Syntax: HiveQL allows users familiar with SQL to easily query and analyze data in
Hadoop.
Extensibility: Hive supports custom UDFs, User-Defined Aggregated Functions (UDAFs), and
User-Defined Table-Generating Functions (UDTFs) to extend its functionality.
Optimized Execution: Hive leverages query optimization techniques, such as cost-based
optimization and join optimizations, to improve query performance.

Hive Architecture: Key Components

The primary components of Hive’s architecture include:


Driver: The component responsible for managing the lifecycle of a HiveQL query, including
parsing, optimization, and execution.
Metastore: The component that stores metadata about the tables, partitions, and columns in the
Hive warehouse.
Query Compiler: The component that translates a HiveQL query into a series of MapReduce or
Tez jobs.
Execution Engine: The component that executes the generated jobs on the Hadoop cluster,
utilizing MapReduce or Apache Tez as the underlying processing framework.

Applications of Pig and Hive

Pig and Hive are widely used in various data processing and analysis scenarios:
Data Transformation: Pig is well-suited for complex data transformations, such as cleansing,
normalization, and enrichment of raw data.

Ad-hoc Data Analysis: Hive is ideal for ad-hoc data analysis, allowing users to quickly query
and analyze large datasets using familiar SQL-like syntax.

ETL Pipelines: Both Pig and Hive can be integrated into ETL pipelines for data extraction,
transformation, and loading, providing robust solutions for data processing and analysis.
Machine Learning and Data Science: Pig and Hive can be used to preprocess data for machine
learning algorithms or perform exploratory data analysis in data science projects.

Data Warehousing: Hive is particularly useful for building data warehouses on top of Hadoop,
providing a scalable and cost-effective solution for storing and analyzing large volumes of
structured data.

Types of Data Models in Apache Pig: It consist of the 4 types of data models as follows:
 Atom: It is a atomic data value which is used to store as a string. The main use of this
model is that it can be used as a number and as well as a string.
 Tuple: It is an ordered set of the fields.
 Bag: It is a collection of the tuples.
 Map: It is a set of key/value pairs.

Hive vs. Pig

The following table compares the advantages of Hive with the advantages of Pig:

Features

Hive uses a declarative With Pig Latin, a procedural


1. Language
language called HiveQL data flow language is used

Creating schema is not


2. Schema Hive supports schema
required to store data in Pig

Hive is used for batch Pig is a high-level data-flow


3. Data Processing
processing language

No. Pig does not support


4. Partitions Yes partitions although there is an
option for filtering

Pig does not support web


5. Web interface Hive has a web interface
interface
Data analysts are the primary Programmers and researchers
6. User Specification
users use Pig

7. Used for Reporting Programming

Hive works on structured data. Pig works on structured, semi-


8. Type of data Does not work on other types structured and unstructured
of data data

Works on the server-side of Works on the client-side of the


9. Operates on
the cluster cluster

10. Avro File Format Hive does not support Avro Pig supports Avro

Hive takes time to load but


11. Loading Speed Pig loads data quickly
executes quickly

12. JDBC/ ODBC Supported, but limited Unsupported

DATA PROCESSING OPERATORS IN PIG

Apache Pig Operators:

Pig Latin – Statemets

While processing data using Pig Latin, statements are the basic constructs.

 These statements work with relations. They include expressions and schemas.
 Every statement ends with a semicolon (;).
 We will perform various operations using operators provided by Pig Latin, through
statements.
 Except LOAD and STORE, while performing all other operations, Pig Latin statements
take a relation as input and produce another relation as output.
 As soon as you enter a Load statement in the Grunt shell, its semantic checking will be
carried out. To see the contents of the schema, you need to use the Dump operator. Only
after performing the dump operation, the MapReduce job for loading the data into the
file system will be carried out.
Example

Given below is a Pig Latin statement, which loads data to Apache Pig.

grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as


( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Pig Latin – Data types
Given below table describes the Pig Latin data types.

S.N. Data Type Description & Example


1 int Represents a signed 32-bit integer.
2 long Represents a signed 64-bit integer.
3 float Represents a signed 32-bit floating point.
4 double Represents a 64-bit floating point.
5 chararray Represents a character array (string) in Unicode UTF-8 format.
Example : ‘tutorials point’
6 Bytearray Represents a Byte array (blob).
7 Boolean Represents a Boolean value.
Example : true/ false.
8 Datetime Represents a date-time.
Example : 1970-01-01T00:00:00.000+00:00
9 Biginteger Represents a Java BigInteger.
Example : 60708090709
10 Bigdecimal Represents a Java BigDecimal
Example : 185.98376256272893883

Complex Types
11 Tuple A tuple is an ordered set of fields.
Example : (raja, 30)
12 Bag A bag is a collection of tuples.
Example : {(raju,30),(Mohhammad,45)}
13 Map A Map is a set of key-value pairs.
Example : [ ‘name’#’Raju’, ‘age’#30]

Pig Latin – Arithmetic Operators

The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b = 20.

Operator Description Example


+ Addition − Adds values on either side of the operator a + b will give 30
Subtraction − Subtracts right hand operand from left
− a − b will give −10
hand operand
Multiplication − Multiplies values on either side of the
* a * b will give 200
operator
Division − Divides left hand operand by right hand
/ b / a will give 2
operand
% Modulus − Divides left hand operand by right hand b % a will give 0
operand and returns remainder
b = (a == 1)? 20: 30;
Bincond − Evaluates the Boolean operators. It has three if a = 1 the value of b is
?: operands as shown below. 20.
variable x = (expression) ? value1 if true : value2 if false. if a!=1 the value of b is
30.
CASE
CASE f2 % 2
WHEN
Case − The case operator is equivalent to nested bincond WHEN 0 THEN 'even'
THEN
operator. WHEN 1 THEN 'odd'
ELSE
END
END

Pig Latin – Comparison Operators

The following table describes the comparison operators of Pig Latin.

Operator Description Example


Equal − Checks if the values of two operands are equal or
== (a = b) is not true
not; if yes, then the condition becomes true.
Not Equal − Checks if the values of two operands are equal
!= or not. If the values are not equal, then condition becomes (a != b) is true.
true.
Greater than − Checks if the value of the left operand is
> greater than the value of the right operand. If yes, then the (a > b) is not true.
condition becomes true.
Less than − Checks if the value of the left operand is less
< than the value of the right operand. If yes, then the condition (a < b) is true.
becomes true.
>= Greater than or equal to − Checks if the value of the left (a >= b) is not true.
operand is greater than or equal to the value of the right
operand. If yes, then the condition becomes true.
Less than or equal to − Checks if the value of the left
<= operand is less than or equal to the value of the right (a <= b) is true.
operand. If yes, then the condition becomes true.
matches Pattern matching − Checks whether the string in the left- f1 matches
hand side matches with the constant in the right-hand side. '.*tutorial.*'
Pig Latin – Type Construction Operators

The following table describes the Type construction operators of Pig Latin.

Operator Description Example


Tuple constructor operator − This operator is used to
() (Raju, 30)
construct a tuple.
Bag constructor operator − This operator is used to {(Raju, 30), (Mohammad,
{}
construct a bag. 45)}
Map constructor operator − This operator is used to
[] [name#Raja, age#30]
construct a tuple.

Pig Latin – Relational Operations

The following table describes the relational operators of Pig Latin.

Operator Description
Loading and Storing
LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH, GENERATE To generate data transformations based on columns of data.
STREAM To transform a relation using an external program.
Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.


GROUP To group the data in a single relation.
CROSS To create the cross product of two or more relations.
Sorting
To arrange a relation in a sorted order based on one or more fields
RDER
(ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.


Diagnostic Operators
DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

To view the logical, physical, or MapReduce execution plans to


EXPLAIN
compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.

Hive Components, Services and Architecture

Apache Hive architecture consists mainly of three components:

1. Hive Client

2. Hive Services

3. Hive Storage and Computer

The following architecture explains the flow of submission of query into Hive.

Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-

o Thrift Server - It is a cross-language service provider platform that serves the request
from all those programming languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between hive and Java applications.
The JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the ODBC protocol to connect to
Hive.

HIVE SERVICES

The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute
Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It
provides a web-based GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of
various tables and partitions in the warehouse. It also includes metadata of column and its
type information, the serializers and deserializers which is used to read and write data and
the corresponding HDFS files where the data is stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from
different clients and provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements
into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of
map-reduce tasks and HDFS tasks. In the end, the execution engine executes the
incoming tasks in the order of their dependencies.

HiveQL
What is Hive Query Language?
Hive Query Language (HQL) is a SQL-like scripting language developed for data querying and
analysis in Hadoop clusters. It simplifies the complexity of writing complex MapReduce jobs by
providing a familiar SQL abstraction for big data.
Hive Query Language (HiveQL): HiveQL is a query language for Hive to analyze and process
structured data in a Meta-store. It is a mixture of SQL-92, MySQL, and Oracle's SQL. It is very
much similar to SQL and highly scalable.
Difference between SQL and HiveQL

1. Structured Query Language (SQL): SQL is a domain-specific language used in


programming and designed for managing data held in a relational database management system
also known as RDBMS. It is also useful in handling structured data, i.e., data incorporating
relations among entities and variables. SQL is a standard language for storing, manipulating, and
retrieving data in databases.
2. Hive Query Language (HiveQL): HiveQL is a query language for Hive to analyze and
process structured data in a Meta-store. It is a mixture of SQL-92, MySQL, and Oracle’s SQL.
It is very much similar to SQL and highly scalable. It reuses familiar concepts from the relational
database world, such as tables, rows, columns and schema, to ease learning. Hive supports four
file formats those are TEXT FILE, SEQUENCE FILE, ORC and RC FILE (Record Columnar
File).

Differences between SQL and HiveQL is as follows:


On the basis of SQL HiveQL
Update-commands in table UPDATE, DELETE UPDATE, DELETE
structure INSERT, INSERT,
Manages Relational data Data Structures
Transaction Supported Limited Support Supported
Indexes Supported Supported
It contains Boolean, integral,
It contain a total of five data floating-point, fixed-point,
types i.e., Integral, floating- timestamp(nanosecond
Data Types
point, fixed-point, text and precision) , Date, text and binary
binary strings, temporal strings, temporal, array, map,
struct, Union
Functions Hundreds of built-in functions Hundreds of built-in functions
Mapreduce Not Supported Supported
Multitable inserts in table Not supported Supported
Create table…as Select Not supported Supported
Supported with SORT BY
clause for partial ordering and
Select command Supported
LIMIT to restrict number of
rows returned
Inner joins, outer joins, semi
Joins Supported
join, map joins, cross joins
Only Used in FROM, WHERE,
Subqueries Supported
or HAVING clauses
Views Can be Updated Read-only i.e. cannot be updated
Querying and Analyzing Data in Hive

Querying and analyzing data in Hive involves using Hive Query Language (HQL) to interact
with data stored in Hive tables. Hive is a data warehousing and SQL-like querying tool that
provides an SQL-like interface for querying and analyzing data stored in Hadoop Distributed
File System (HDFS) or other compatible storage systems. Here are the steps to query and
analyze data in Hive:

1. Data Ingestion:

 Data is typically ingested into Hive from various sources, including HDFS, external
databases, or data streams.

2. Data Definition:

 Define the schema of your data by creating Hive tables. You can specify the table name,
column names, data types, and storage format. Hive supports both structured and semi-
structured data.

Example:

Sql>
CREATE TABLE employee (
emp_id INT,
emp_name STRING,
emp_salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

3. Data Loading:

 Load data into Hive tables using the LOAD DATA command or by inserting data
directly.

Example

Sql>

LOAD DATA INPATH '/user/hadoop/employee_data.csv' INTO TABLE employee;

4. Querying Data:
 Use HQL to query data from Hive tables. You can write SQL-like queries to retrieve,
filter, and transform data.

Example:

Sql>
SELECT emp_name, emp_salary FROM employee WHERE emp_salary > 50000;

5. Aggregations and Grouping:

 Hive supports aggregation functions (e.g., SUM, AVG, COUNT) and GROUP BY
clauses for summarizing data.

Example:

Sql>
SELECT department, AVG(salary) AS avg_salary FROM employee
GROUP BY department;

6. Joins:

 You can perform joins between Hive tables to combine data from multiple sources.

Example:

Sql>
SELECT e.emp_name,d.department_name FROM employee e

JOIN department d ON e.department_id = d.department_id;

7. Data Transformation:

 Hive allows you to transform and process data using user-defined functions (UDFs) and
built-in functions.

Example:

Sql>
SELECT emp_name, UPPER(emp_name) AS uppercase_name FROM employee;

8. Storing Results:

 You can store the results of queries in Hive tables for further analysis or reporting.

Example:

Sql>
INSERT OVERWRITE TABLE high_salary_employees
SELECT emp_name,emp_salary FROM employee WHERE emp_salary > 75000;

9. Running Queries:

 Submit Hive queries using the Hive command-line interface (CLI) or through Hive client
libraries and interfaces in programming languages like Python or Java.

10. Monitoring and Optimization: – Monitor query performance and optimize Hive queries by
creating appropriate indexes, partitions, and tuning configurations.

HBase Architecture

The Apache Zookeeper monitors the system, and the HBase Master assigns regions
and load balancing. The Region server serves data to read and write. The Region
Server is all the different computers in the Hadoop cluster. It consists of Region,
HLog, Store, MemoryStore, and different files.

HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.

Figure – Architecture of HBase


All the 3 components are described below:

1. HMaster –
The implementation of Master Server in HBase is HMaster. It is a process in which regions
are assigned to region server as well as DDL (create, delete table) operations. It monitor all
Region Server instances present in the cluster. In a distributed environment, Master runs
several background threads. HMaster has many features like controlling load balancing,
failover etc.

2. Region Server –
HBase Tables are divided horizontally by row key range into Regions. Regions are the
basic building elements of HBase cluster that consists of the distribution of tables and are
comprised of Column families. Region Server runs on HDFS DataNode which is present in
Hadoop cluster. Regions of Region Server are responsible for several things, like handling,
managing, executing as well as reads and writes HBase operations on that set of regions.
The default size of a region is 256 MB.

3. Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining configuration
information, naming, providing distributed synchronization, server failure notification etc.
Clients communicate with region servers via zookeeper.

Advantages of HBase –

1. Can store large data sets


2. Database can be shared
3. Cost-effective from gigabytes to petabytes
4. High availability through failover and replication

Disadvantages of HBase –

1. No support SQL structure


2. No transaction support
3. Sorted only on key
4. Memory issues on the cluster

Comparison between HBase and HDFS:

 HBase provides low latency access while HDFS provides high latency operations.

 HBase supports random read and writes while HDFS supports Write once Read Many
times.
 HBase is accessed through shell commands, Java API, REST, Avro or Thrift API while
HDFS is accessed through MapReduce jobs.
Difference Between Hbase and Hive.

Sl.no HBase Hive


1 Despite providing SQL functionality Apache HBase is a NoSQL key/value
store which runs on top of HDFS
2 Unlike Hive, HBase operations run in Hive does not fully runs in realtime
realtime on its database rather than
MapReduce jobs
3 HBase provides Interactive query. Hive does not provide interactive
querying yet - it only runs batch
processes on Hadoop.

Features of HBase architecture :

Distributed and Scalable: HBase is designed to be distributed and scalable, which means it
can handle large datasets and can scale out horizontally by adding more nodes to the cluster.

Column-oriented Storage: HBase stores data in a column-oriented manner, which means data
is organized by columns rather than rows. This allows for efficient data retrieval and
aggregation.

Hadoop Integration: HBase is built on top of Hadoop, which means it can leverage Hadoop’s
distributed file system (HDFS) for storage and MapReduce for data processing.

Consistency and Replication: HBase provides strong consistency guarantees for read and
write operations, and supports replication of data across multiple nodes for fault tolerance.

Built-in Caching: HBase has a built-in caching mechanism that can cache frequently accessed
data in memory, which can improve query performance.

Compression: HBase supports compression of data, which can reduce storage requirements
and improve query performance.

Flexible Schema: HBase supports flexible schemas, which means the schema can be updated
on the fly without requiring a database schema migration.

Note – HBase is extensively used for online analytical operations, like in banking applications
such as real-time data updates in ATM machines, HBase can be used.

IBM InfoSphere Streams V3.0

IBM released at the end of 2011 the software of InfoSphere BigInsights and InfoSphere
Streams which allows clients to gain a fast impression about information streams in a zone of
interests of their business.
BigInsights is the platform for data analysis allowing the companies to turn difficult data sets of
scale of the Internet into knowledge. Easily set Apache Hadoop distribution kit and also a set of
the connected tools necessary for application development, data transfer and management of a
cluster are a part of this platform. Thanks to the simplicity and scalability of Hadoop, the
representing Open Source-Program of infrastructure MapReduce, uses deserved recognition in
different industries and sciences. In addition to Hadoop, the following Open Source-технологии
are a part of BigInsights (all of them, except for Jaql, are the Apache Software Foundation
projects):
 Pig is the platform including a high-level language of the description of the programs analyzing
big data sets. The compiler transforming the Pig applications to the sequences of the
MapReduce tasks performed in the environment of Hadoop is a part of Pig.
 Hive is the solution for data warehousing developed on the basis of the Hadoop environment.
In it the familiar principles of relational databases - tables, columns, sections are implemented.
Also set of SQL statements (HiveQL) for work in the unstructured Hadoop environment is its
part. Requests of Hive are compiled in the MapReduce tasks performed in the environment of
Hadoop.
 Jaql is the language of requests with the SQL-like interface developed by IBM and intended for
JavaScript Object Notation (JSON). Jaql perfectly maintains enclosure, is highly function-
oriented and extremely flexible. This language well is suitable for work with poorly structured
data; also it serves as the interface of storage of the HBase columns and is used for the analysis
of the text.
 HBase - the data storage environment focused on columns by a не-SQL intended for support of
big tables with small degree of fullness in Hadoop.
 Flume is the distributed, reliable and available service intended for effective movement of large
volumes of the generated data. Flume well is suitable for obtaining event logs from several
systems and their moving to the Hadoop file system (Hadoop Distributed File System, HDFS)
in process of their generation.
 Lucene is the library of the search system providing the high performance and full text search.
 Avro is the technology of consecutive ordering of data using JSON for determination of data
types and protocols. Arranges data in a compact binary format.
 ZooKeeper is the centralized service intended for support of the configuration information and
naming; provides the distributed synchronization and group service.
 Oozie is the schedule system of line processing of tasks intended for the organization and
management of Apache Hadoop task performance.

You might also like