You are on page 1of 11

LESSON-8

Files, Hive Intorduction

Class will start with refreshing the previous class with QA…. (15)

Today’s topics:

 Files used in Hadoop (All 6 types)


 How to choose file format for my cluster
 Hive introduction (What is Hive)
 Hive Architecture / Component / Meta Store
 Hive Tables

Files
There are 6 types of file formats usually used in hadoop. While processing data we have to mention
the types of files for input and output data.
 Text file format
 Sequence file format
 Avro file format
 Parquet file format
 RC file format
 ORC file format
Text (CSV Files, TSV files, Space separated value ‘ \n ‘ line separator)

These are normal text. In these text each line is a record and the lines are terminated by new line
character ‘\n’. These files give us good write performance but slow read. It does not support block
compression. Text files are inherently split able on ‘\n’ character. Limited support for schema evolution.

Sequence Files

Hadoop Sequence File is a flat file structure which consists of serialized key-value pairs. This is the same
format in which the data is stored internally during the processing – so it requires less space. It has good
read and write performance. It supports block level compression. Sequence files are split able.

There are 3 types of sequence file:

 Uncompressed key/value records.


 Record compressed - only 'values' are compressed out of keys and values.
 Block compressed - both keys and values are collected in 'blocks' separately and
compressed.
1
D^@@@@@w@@@@v@@@@customerno@@@accountno@@

^L001 ABC99901^@@@@@N@@@@
^L002 ABC99902^@@@@@N@@@@
^L003 ABC99903^@@@@@N@@@@
^L004 ABC99904^@@@@@N@@@@
^L005 ABC99905^@@@@@N@@@@
^L006 ABC99906^@@@@@N@@@@
^L007 ABC99907^@@@@@N@@@@
^L008 ABC99908^@@@@@N@@@@
^L009 ABC99909^@@@@@N@@@@

Avro Files
Avro is a row-based storage format for Hadoop which is widely used as a serialization – deserialization
framework. Avro stores the data definition (schema / metadata) in JSON format making it easy to read
and interpret by any program. The data itself is stored in serialized binary format making it compact
and efficient. It’s read and write performance is moderate – we can say in middle. It is a best choice, if
the file schema is going to change frequently.

ARFF (Attribute relation file format)

@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {setosa,versicolor,virginica}
@DATA
5.1,3.5,1.4,0.2,setosa

2
4.9,3.0,1.4,0.2,virginica
4.7,3.2,1.3,0.2,virginica
4.6,3.1,1.5,0.2,setosa

Iris Dataset - JSON Encoded AVRO Format

{"type":"record","name":"Iris","namespace":"org.apache.samoa.avro.iris","fields":[{"name":"sepalle
ngth","type":"double"},{"name":"sepalwidth","type":"double"},{"name":"petallength","type":"double"
},{"name":"petalwidth","type":"double"},{"name":"class","type":{"type":"enum","name":"Labels","sym
bols":["setosa","versicolor","virginica"]}}]}
{"sepallength":5.1,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"}
{"sepallength":3.0,"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"}
{"sepallength":4.7,"sepalwidth":3.2,"petallength":1.3,"petalwidth":0.2,"class":"virginica"}
{"sepallength":3.1,"sepalwidth":1.5,"petallength":4.6,"petalwidth":0.2,"class":"setosa"}

Iris Dataset - Binary Encoded AVRO Format

Objavro.schema΅{"type":"record","name":"Iris","namespace":"org.apache.samoa.avro.iris","fields":[{
"name":"sepallength","type":"double"},{"name":"sepalwidth","type":"double"},{"name":"petallength",
"type":"double"},{"name":"petalwidth","type":"double"},{"name":"class","type":{"type":"enum","name
":"Labels","symbols":["setosa","versicolor","virginica"]}}]} !<khCrֱ
S가ީȂffffff@ @ffffffٙٙɿ
@ffffffٙٙ@ֱ ͌͌ٙ‫@ ڙ‬Ό͌͌
ٙ‫ڙ‬ɿΌ͌͌
@ֱ ͌͌ @
͌͌
ɿΌ͌͌ ٙ‫ڙ‬ɿ !<khCrֱ
𿦦ffff@ֱ S가ީ

Parquet Files
Parquet is a column-oriented data storage format in Apache Hadoop ecosystem. It is similar to the
other columnar-storage file formats available in Hadoop namely RC File and ORC. It is compatible
with most of the data processing frameworks in the Hadoop environment.

3
RC Files (Record columnar)
RCFile is a Record Columnar file. It has a data placement structure that determines how to store
relational tables on computer clusters. RC first split data in row groups of around 4MB. And then
transform each data chunk (row group) to columnar major style. This is in order to get better query
performance. RC file formats are used in Hbase, Cassandra and Hive. It has fast data loading
capability, fast query processing and efficient storage space.

ORC Files (Optimized RC)

ORC (Optimized RC) file is basically taken entire RC file concept with additional features to make it
better. ORC file format provides a highly efficient way to store Hive data. It was designed to overcome
limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading,
writing, and processing data.
ORC file format has many advantages over RC files such as: block-mode compression, ability to split
files, concurrent reads of the same file using separate RecordReaders etc.
While saving - ORC breaks file into set of rows horizontally and further rows are vertically divided for
better query performance. Row groups are called Stripes. The size of Strip is 64MB or high up to
256MB.

ORC maintains file footer with strip level index and file level index, which is often called statistics, this
feature made the ORC powerful in data query.

Row level Index data consists min and max values for each row. The file footer contains strip level
index and file level index. Postscript section contains the file information like the length of the file’s, the.
Compression parameters and the size of the compressed folder.

Whenever a hive query is made with where clause, Hive will search for file level index, on meeting the
condition, Hive will further search at strip level index, and finally scan data only over the part of the
whole file matched with where condition. This abruptly reduce effort in data query, this way of pushing
is called “Predicate Pushdown”.
4
5
How to choose file format for my cluster

In General Select
• If you are storing intermediate data between MapReduce jobs-
Choose Sequence file

• If query performance against the data is most important-


Choose ORC (HortonWorks/Hive) or Parquet (Cloudera/Impala)

• if your schema is going to change over time-


Choose Avro

• if you are going to extract data from Hadoop to bulk load into a database-
Choose text file (CSV)

6
Decision Tree for choosing file format

7
Hive Introduction
Hive is originally created by Facebook, presently owned by Apache. Hive is dealing with huge data.
Hive is a data warehousing package, built on Hadoop, used for structured Data Query and Analysis
with HQL, (Hive Query Language) which is very similar to SQL (Structured Query Language)

However, Hive can also manage unstructured and semi structured data with the help of MapReduce.

Hive stores data on Hadoop (i.e. HDFS), so it is fault tolerant.

Hive is used for Data Analysis (Data Mining, Document Indexing, Predictive Modeling and Reporting)

Architecture

 Total architecture is on top of Hadoop


 Consists 3 core parts a) Hive Clients, b) Hive Services, c) Hive Storage
Theoretically we can setup HiveServer, MetaStore Server, all in the Master Node. However in a
production scenario placing them in a Master or Slave node is not a good idea.
We should set up Hiveserver on a dedicated Node so it will not interfere namenode processes.
MetaStore is a separate daemon which can be either embedded with the HiveServer or can be setup
separately as a dedicated database service (the recommended way).
Beeline (hive client) can run in embedded mode as well as remotely. Remote HiveServer2 mode is
recommended for production use, as it is more secure. Beeline is a JDBC client tool which is used to
connect to HiveServer2 and doesn't require direct HDFS/metastore access to be granted for users.

8
Component

 User Interface (UI) – Provide an interface between user and hive. ...
 Driver – Manages Life Cycle of HQL statement
 Compiler – Converts HQL statement to MR jobs
 Meta Store – Keeps all Hive Meta Data.
 Execution Engine –Bridging Hive and Hadoop for every thing

Work Flow

1. Executing Query from the UI to Driver


2. The driver is interacting with Compiler for getting the plan and Meta Data.
3. The compiler creates the plan for a job to be executed and ask Meta Store for Meta Data.
4. Meta store sends metadata information back to compiler
5. Compiler communicate Driver to execute the query as per proposed plan
6. Driver Sending execution plans to Execution engine
7. Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For
DFS operations.
8. UI fetching results from driver
9. Driver communicate EE, and once the results fetched from data nodes by EE, it is sent back to
UI.

9
EE bridging with Cluster

 EE should first contacts Name Node to get the result from tables.
 EE is going to fetch desired records from Data Nodes.
 It collects actual data (Table) from data nodes.
 EE communicates bi-directionally with Meta store present in Hive to perform operations.
 EE in turn communicates with Hadoop daemons such as Name node, Data nodes, and job
tracker to execute the query on top of Hadoop file system

Meta Store

Hive has 3 types of Meta store

 Embedded Meta store


 Local Meta Store and
 Remote Meta Store

Every Meta Store has 3 component.


Drive – Sending Meta Data.
Meta Store service/server - Keep Meta Data for Service
Data Base - store Meta Data

Embedded: Used only for study or experiment, does not accept multi-user. Not suitable for production

Local : Overcome the limitations of Embedded, and more secure then embedded

10
Remote : More secured. “Drive”, “Meta Store Server” and “Data Base” are in different Machines.

Hive Table
Hive has two types of table
1) Managed table (internal) (Table is created with the data from hive data warehouse)
2) External table (Table is created with the data from external storage)

A table has 2 parts,


 Schema or structure
 Data

During DROP table, or we can say deleting internal table both the schema and data are deleted
But during deleting external table, only schema is deleted, data remains in Hive/HDFS.

Useful link

Files

https://www.youtube.com/watch?v=1rOSia_r8hI

Hive

https://www.guru99.com/introduction-hive.html

11

You might also like