You are on page 1of 12

Version 1.

0
Performance metrics of Hive
Queries

Comparison of Performance Metrics of Hive Queries using


Text,Avro,Parquet Fileformats
Version: 1.0

Table of Contents
1 Introduction...................................................................................................................... 2
1.1 Overview................................................................................................................................. 2
1.2 Introduction to Apache Avro file format...................................................................................2
1.2 Introduction to Apache Parquet file format..............................................................................2

2 Case Study....................................................................................................................... 3
2.1
2.2
2.3
2.4

Objective................................................................................................................................. 3
Extracting the data from Teradata........................................................................................... 3
Loading the data extracted from sqoop into Hive tables.........................................................9
Conclusion............................................................................................................................. 10

Modified: 24/09/2015 07:33


2014 Accenture. All Rights Reserved.

Last modified by: Guard,


Mohammed Danesh

Version 1.0
Performance metrics of Hive
Queries

1 INTRODUCTION
1.1

Overview

The purpose of this document is discuss on the performace trade off of Hive queries
using different file formats and suggest the best file formats to be used for Hive Storage

1.1Introduction to Apache Avro file format


Avro is an Apache open source project that provides data serialization and data
exchange services for Hadoop. These services can be used together or independently.
Using Avro, big data can be exchanged between programs written in any language
Using the serialization service, programs can efficiently serialize data into files or into
messages. The data storage is compact and efficient. Avro stores both the data definition
and the data together in one message or file making it easy for programs to dynamically
understand the information stored in an Avro file or message. Avro stores the data
definition in JSON format making it easy to read and interpret, the data itself is stored in
binary format making it compact and efficient. Avro files include markers that cam be used
to splitting large data sets into subsets suitable for MapReduce processing. Some data
exchange services use a code generator to interpret the data definition and produce code
to access the data. Avro doesn't require this step, making it ideal for scripting languages.
Avro supports a rich set of primitive data types including: numeric, binary data and
strings; and a number of complex types including arrays, maps, enumerations and
records. A sort order can also be defined for the data. A key feature of Avro is robust
support for data schemas that change over time - often called schema evolution. Avro
cleanly handles schema changes like missing fields, added fields and changed fields; as a
result, old programs can read new data and new programs can read old data. Avro
includes API's for Java, Python, Ruby, C, C++ and more. Data stored using Avro can easily
be passed from a program written in one language to a program written in another
language, even from a complied language like C to a scripting language like Pig.

1.2Introduction to Apache Parquet file format


Apache Parquet is a columnar storage format available to any project in the Hadoop
ecosystem, regardless of the choice of data processing framework, data model or
programming language. Parquet was designed to take the advantages of compressed,
efficient columnar data representation available to any project in the Hadoop ecosystem.
Parquet is built from the ground up with complex nested data structures in mind, and
uses the record shredding and assembly algorithm described in the Dremel paper. This
approach is superior to simple flattening of nested name spaces.

Modified: 24/09/2015 07:33


2014 Accenture. All Rights Reserved.

Last modified by: Guard,


Mohammed Danesh

Version 1.0
Performance metrics of Hive
Queries

Parquet is built to support very efficient compression and encoding schemes Parquet
allows compression schemes to be specified on a per-column level, and is future-proofed
to allow adding more encodings as they are invented and implemented.Parquet is built to
be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and
we are not interested in playing favorites. We believe that an efficient, well-implemented
columnar storage substrate should be useful to all frameworks without the cost of
extensive and difficult to set up dependencies.

2 CASE STUDY
2.1Objective
Load the data from a Terdata table which is approximatelt about 40GB, Into hive
tables to make a performance comparison of the query execution time using different
file formats.

2.2Extaracting the data from Teradata


The data was exracted from the Teradata table using different storage formats and
enabling compression.
Sqoop Commands:
The below commands can be used to extract the data as Textfile, Textfile with
compression, Avro, Avro with Compression

Modified: 24/09/2015 07:33


2014 Accenture. All Rights Reserved.

Last modified by: Guard,


Mohammed Danesh

Version 1.0
Performance metrics of Hive
Queries

Since sqoop doesnt support importing the data in Parquet format we need to convert
the data extracted from sqoop into parquet format.
Two approaches are listed below to conert the data into parquet format:
Modified: 24/09/2015 07:33
2014 Accenture. All Rights Reserved.

Last modified by: Guard,


Mohammed Danesh

Version 1.0
Performance metrics of Hive
Queries

i. Converting a CSV file into Parquet data


1) Get the data file from Teradata database in in CSV format
2) Process the Data file using PIG to convert the CSV file into Parquet file format
3) Once the Conversion is done, create Parquet Hive table based on the schema

4) Load the converted Parquet file into Parquet Table.

5) Description of Parquet Table

6) Simple Select Query on Parquet Table

Modified: 24/09/2015 07:33


2014 Accenture. All Rights Reserved.

Last modified by: Guard,


Mohammed Danesh

Version 1.0
Performance metrics of Hive
Queries

7) Complex Select Query on Parquet Table

8) Table Size

ii.

Converting an Avro data file into Parquet data file

1) Use the below source code to create a jar which helps to converting a avro file to
Parquet data file
Main java Code
File Name : Avro2Parquet.java
package com.cloudera.science.avro2parquet;
import java.io.InputStream;
Modified: 24/09/2015 07:33
2014 Accenture. All Rights Reserved.

Last modified by: Guard,


Mohammed Danesh

Version 1.0
Performance metrics of Hive
Queries
import
import
import
import
import
import
import
import
import
import
import

org.apache.avro.Schema;
org.apache.avro.mapreduce.AvroKeyInputFormat;
org.apache.hadoop.conf.Configuration;
org.apache.hadoop.conf.Configured;
org.apache.hadoop.fs.FileStatus;
org.apache.hadoop.fs.FileSystem;
org.apache.hadoop.fs.Path;
org.apache.hadoop.mapreduce.Job;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.util.Tool;
org.apache.hadoop.util.ToolRunner;

import parquet.avro.AvroParquetOutputFormat;
import parquet.avro.AvroSchemaConverter;
import parquet.hadoop.metadata.CompressionCodecName;
public class Avro2Parquet extends Configured implements Tool {
public
Path
Path
Path

int run(String[] args) throws Exception {


schemaPath = new Path(args[0]);
inputPath = new Path(args[1]);
outputPath = new Path(args[2]);

Job job = new Job(getConf());


job.setJarByClass(getClass());
Configuration conf = job.getConfiguration();
FileSystem fs = FileSystem.get(conf);
InputStream in = fs.open(schemaPath);
Schema avroSchema = new Schema.Parser().parse(in);
System.out.println(new AvroSchemaConverter().convert(avroSchema).toString());
FileInputFormat.addInputPath(job, inputPath);
job.setInputFormatClass(AvroKeyInputFormat.class);
job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setOutputPath(job, outputPath);
AvroParquetOutputFormat.setSchema(job, avroSchema);
AvroParquetOutputFormat.setCompression(job, CompressionCodecName.SNAPPY);
AvroParquetOutputFormat.setCompressOutput(job, true);
/* Impala likes Parquet files to have only a single row group.
* Setting the block size to a larger value helps ensure this to
* be the case, at the expense of buffering the output of the
* entire mapper's split in memory.
*
* It would be better to set this based on the files' block size,
* using fs.getFileStatus or fs.listStatus.
*/
AvroParquetOutputFormat.setBlockSize(job, 500 * 1024 * 1024);
job.setMapperClass(Avro2ParquetMapper.class);
job.setNumReduceTasks(0);
Modified: 24/09/2015 07:33
2014 Accenture. All Rights Reserved.

Last modified by: Guard,


Mohammed Danesh

Version 1.0
Performance metrics of Hive
Queries
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Avro2Parquet(), args);
System.exit(exitCode);
}
}

Mapper java Code


File Name : Avro2ParquetMapper.java
package com.cloudera.science.avro2parquet;
import java.io.IOException;
import
import
import
import

org.apache.avro.generic.GenericRecord;
org.apache.avro.mapred.AvroKey;
org.apache.hadoop.io.NullWritable;
org.apache.hadoop.mapreduce.Mapper;

public class Avro2ParquetMapper extends


Mapper<AvroKey<GenericRecord>, NullWritable, Void, GenericRecord> {

@Override
protected void map(AvroKey<GenericRecord> key, NullWritable value,
Context context) throws IOException, InterruptedException {
context.write(null, key.datum());
}

2) Compile the code into a Jar using the below command


Javac classpath `hadoop classpath`:. Avro2Parquet.java
Jar cvf avro2parquet.jar .*.class

3) Using the jar to convert from avro data format into Paquet data format
hadoop jar <avro2parquet jar file> \
com.cloudera.science.avro2parquet.Avro2Parquet \
<and generic options to the JVM> \
hdfs:///path/to/avro/schema.avsc \
hdfs:///path/to/avro/data \
hdfs:///output/path

The stats below indicate the File size comparison with different file storage types

Modified: 24/09/2015 07:33


2014 Accenture. All Rights Reserved.

Last modified by: Guard,


Mohammed Danesh

Version 1.0
Performance metrics of Hive
Queries

As can be seen, File size decreases drasctically as we from Text file -> Snappy Conversion,
Avro File -> Avro Compression and in case of Parquet without compression any
compression it brings down the file size to 85% of original size.

Modified: 24/09/2015 07:33


2014 Accenture. All Rights Reserved.

Last modified by: Guard,


Mohammed Danesh

Version 1.0
Performance metrics of Hive
Queries

2.3Loading the data extracted from sqoop into the Hive tables
1) Loading the text file into Hive Table
This is a straight forward load of the data into Hive table
2) Loading the text file into Hive Table
CREATE EXTERNAL TABLE avro_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOACATION '<hdfs_path>'
TBLPROPERTIES (
'avro.schema.literal'='<hdfs_path>/avro_schema.avsc');

As we can see above to create a Avro Hive Table it is necessary to specify the Avro
Schema
Defining the avro schema can be difficult as it needs a thorough knowledge of Json,
hence we can follow
the below steps to extract the schema from the avro data file itself instead of defining the
schema manually for
each of the tables.

Modified: 24/09/2015 07:33


2014 Accenture. All Rights Reserved.

10

Last modified by: Guard,


Mohammed Danesh

Version 1.0
Performance metrics of Hive
Queries

3) Loading the data into Parquet based Hive Table


The Parquet Hive Table is created on the same structure as Avro Table to use the same
structure as Avro Table to use the same schema as the avro table.
CREATE TABLE Parquet_table
ROW FORMAT
SERDE 'parquet.hive.serde.ParquetHiveSerde'
STORED AS
INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
AS SELECT * FORM AVRO_TABLE WHERE 1=2;

Load the converted parquet data into the newly created Parquet Hive Table
The stats below indicate the Query Respond time with different file storage types.

Modified: 24/09/2015 07:33


2014 Accenture. All Rights Reserved.

11

Last modified by: Guard,


Mohammed Danesh

Version 1.0
Performance metrics of Hive
Queries

2.4

Conclusion

Comparing all 3 formats Parquet Storage compresses data file to a great extent and
query respond time is also much faster than the other two formats , hence Parquet format
looks to be an undisputed winner in this scenario.
Since parquet format is columnar , it might not work as efficient as the above use
case incase entire row accesses are needed.
Revision History

Date

Version

Description

Author

27-Nov14

1.0

Created

Mohammed
Danesh Guard

Modified: 24/09/2015 07:33


2014 Accenture. All Rights Reserved.

12

Last modified by: Guard,


Mohammed Danesh