You are on page 1of 4

STORAGE FORMATS IN HADOOP

FACTORS TO CONSIDER WHILE CHOOSING A FILE STORAGE FORMAT

1. Compression strategy:
a. Compression speed
b. Compression ratio (most compact/less compact)
c. Splittable compression (especially for parallel processing)

2. Type conversion overhead associated with storing data in text file

3. File splitting ( example its difficult to split structured files like XML or JSON)

4. Columnar vs Row storage

5. Support outside Hadoop system or outside a particular application: Eg ORC is supported mainly
on Hive. Limited or no support on non-Hive interface like Impala,Pig,Java etc

6. Readable vs non readable format

7. File Size is smaller or larger : This example is a special use case of going for sequence file
format. Storing a large number of small files in Hadoop can cause a couple of issues. One is
excessive memory use for the NameNode, because metadata for each file stored in HDFS is
held in memory. Another potential issue is in processing data in these filesmany small files
can lead to many processing tasks, causing excessive overhead in processing. Because Hadoop
is optimized for large files, packing smaller files into a SequenceFile makes the storage and
processing of these files much more efficient

8. Serialization format and inter language communication : Thrift/protocol buffer/avro are


serialization framework to facilitate data exchange between services written in different
languages. Thrift and protocol buffer has several drawbacks: it does not support internal
compression of records, its not splittable, and it lacks native MapReduce support. External API
is needed with these 2 to support all these functions. Avro takes care of all these drawbacks.

9. Schema Evolution

10. Data types supported by the storage format are complex nested types supported?

11. Is metadata needed as part of storage format?

12. Failure handling: An important aspect of the various file formats is failure handling; some
formats handle corruption better than others. Few Examples listed below :
a. Columnar formats, while often efficient, do not work well in the event of failure, since
this can lead to incomplete rows.
b. Sequence files will be readable to the first failed row, but will not be recoverable after
that row.
c. Avro provides the best failure handling; in the event of a bad record, the read will
continue at the next sync point, so failures only affect a portion of a file.

Sample Use Case for Reading and writing to storage formats using Hive:

2
3
4

You might also like