Professional Documents
Culture Documents
8-Lesson Files Hive
8-Lesson Files Hive
Class will start with refreshing the previous class with QA…. (15)
Today’s topics:
Files
There are 6 types of file formats usually used in hadoop. While processing data we have to mention
the types of files for input and output data.
Text file format
Sequence file format
Avro file format
Parquet file format
RC file format
ORC file format
Text (CSV Files, TSV files, Space separated value ‘ \n ‘ line separator)
These are normal text. In these text each line is a record and the lines are terminated by new line
character ‘\n’. These files give us good write performance but slow read. It does not support block
compression. Text files are inherently split able on ‘\n’ character. Limited support for schema
evolution.
Sequence Files
Hadoop Sequence File is a flat file structure which consists of serialized key-value pairs. This is the same
format in which the data is stored internally during the processing – so it requires less space. It has good
read and write performance. It supports block level compression. Sequence files are split able.
1
Block compressed - both keys and values are collected in 'blocks' separately and
compressed.
D^@@@@@w@@@@v@@@@customerno@@@accountno@@
^L001 ABC99901^@@@@@N@@@@
^L002 ABC99902^@@@@@N@@@@
^L003 ABC99903^@@@@@N@@@@
^L004 ABC99904^@@@@@N@@@@
^L005 ABC99905^@@@@@N@@@@
^L006 ABC99906^@@@@@N@@@@
^L007 ABC99907^@@@@@N@@@@
^L008 ABC99908^@@@@@N@@@@
^L009 ABC99909^@@@@@N@@@@
Avro Files
Avro is a row-based storage format for Hadoop which is widely used as a serialization –
deserialization framework. Avro stores the data definition (schema / metadata) in JSON format
making it easy to read and interpret by any program. The data itself is stored in serialized binary
format making it compact and efficient. It’s read and write performance is moderate – we can say in
middle. It is a best choice, if the file schema is going to change frequently.
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {setosa,versicolor,virginica}
2
@DATA
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,virginica
4.7,3.2,1.3,0.2,virginica
4.6,3.1,1.5,0.2,setosa
{"type":"record","name":"Iris","namespace":"org.apache.samoa.avro.iris","fields":
[{"name":"sepallength","type":"double"},{"name":"sepalwidth","type":"double"},
{"name":"petallength","type":"double"},{"name":"petalwidth","type":"double"},
{"name":"class","type":{"type":"enum","name":"Labels","symbols":
["setosa","versicolor","virginica"]}}]}
{"sepallength":5.1,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"}
{"sepallength":3.0,"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"}
{"sepallength":4.7,"sepalwidth":3.2,"petallength":1.3,"petalwidth":0.2,"class":"virginica"}
{"sepallength":3.1,"sepalwidth":1.5,"petallength":4.6,"petalwidth":0.2,"class":"setosa"}
Objavro.schema΅{"type":"record","name":"Iris","namespace":"org.apache.samoa.avro.iris","fields":
[{"name":"sepallength","type":"double"},{"name":"sepalwidth","type":"double"},
{"name":"petallength","type":"double"},{"name":"petalwidth","type":"double"},
{"name":"class","type":{"type":"enum","name":"Labels","symbols":
S 빧ީȂffffff@
["setosa","versicolor","virginica"]}}]} !<khCrֱ @ffffffٙٙ
ɿ @ٙ
@ffffffٙٙ ͌
@ٙ
ڙɿΌ͌ ڙ
ٙٙ
͌
@Ό͌
ɿΌ͌͌
@ ffff@ٙ S 빧ީ
ڙɿ !<khCrֱ
Parquet Files
Parquet is a column-oriented data storage format in Apache Hadoop ecosystem. It is similar to the
other columnar-storage file formats available in Hadoop namely RC File and ORC. It is compatible
3
with most of the data processing frameworks in the Hadoop environment.
ORC (Optimized RC) file is basically taken entire RC file concept with additional features to make it
better. ORC file format provides a highly efficient way to store Hive data. It was designed to overcome
limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading,
writing, and processing data.
4
ORC file format has many advantages over RC files such as: block-mode compression, ability to split
files, concurrent reads of the same file using separate RecordReaders etc.
While saving - ORC breaks file into set of rows horizontally and further rows are vertically divided for
better query performance. Row groups are called Stripes. The size of Strip is 64MB or high up to
256MB.
ORC maintains file footer with strip level index and file level index, which is often called statistics, this
feature made the ORC powerful in data query.
Row level Index data consists min and max values for each row. The file footer contains strip level
index and file level index. Postscript section contains the file information like the length of the file’s,
the. Compression parameters and the size of the compressed folder.
Whenever a hive query is made with where clause, Hive will search for file level index, on meeting the
condition, Hive will further search at strip level index, and finally scan data only over the part of the
whole file matched with where condition. This abruptly reduce effort in data query, this way of pushing
is called “Predicate Pushdown”.
5
How to choose file format for my cluster
6
In General Select
• If you are storing intermediate data between MapReduce jobs-
Choose Sequence file
• if you are going to extract data from Hadoop to bulk load into a database-
Choose text file (CSV)
7
Hive Introduction
Hive is originally created by Facebook, presently owned by Apache. Hive is dealing with huge data.
8
Hive is a data warehousing package, built on Hadoop, used for structured Data Query and Analysis
with HQL, (Hive Query Language) which is very similar to SQL (Structured Query Language)
However, Hive can also manage unstructured and semi structured data with the help of MapReduce.
Hive is used for Data Analysis (Data Mining, Document Indexing, Predictive Modeling and Reporting)
Architecture
Component
9
User Interface (UI) – Provide an interface between user and hive. ...
Driver – Manages Life Cycle of HQL statement
Compiler – Converts HQL statement to MR jobs
Meta Store – Keeps all Hive Meta Data.
Execution Engine –Bridging Hive and Hadoop for every thing
Work Flow
EE should first contacts Name Node to get the result from tables.
10
EE is going to fetch desired records from Data Nodes.
It collects actual data (Table) from data nodes.
EE communicates bi-directionally with Meta store present in Hive to perform operations.
EE in turn communicates with Hadoop daemons such as Name node, Data nodes, and job
tracker to execute the query on top of Hadoop file system
Meta Store
Embedded: Used only for study or experiment, does not accept multi-user. Not suitable for production
Local : Overcome the limitations of Embedded, and more secure then embedded
Remote : More secured. “Drive”, “Meta Store Server” and “Data Base” are in different Machines.
11
Hive Table
Hive has two types of table
1) Managed table (internal) (Table is created with the data from hive data warehouse)
2) External table (Table is created with the data from external storage)
During DROP table, or we can say deleting internal table both the schema and data are deleted
But during deleting external table, only schema is deleted, data remains in Hive/HDFS.
Useful link
Files
https://www.youtube.com/watch?v=1rOSia_r8hI
Hive
https://www.guru99.com/introduction-hive.html
12