Professional Documents
Culture Documents
1. Compression strategy:
a. Compression speed
b. Compression ratio (most compact/less compact)
c. Splittable compression (especially for parallel processing)
3. File splitting ( example its difficult to split structured files like XML or JSON)
5. Support outside Hadoop system or outside a particular application: Eg ORC is supported mainly
on Hive. Limited or no support on non-Hive interface like Impala,Pig,Java etc
7. File Size is smaller or larger : This example is a special use case of going for sequence file
format. Storing a large number of small files in Hadoop can cause a couple of issues. One is
excessive memory use for the NameNode, because metadata for each file stored in HDFS is
held in memory. Another potential issue is in processing data in these filesmany small files
can lead to many processing tasks, causing excessive overhead in processing. Because Hadoop
is optimized for large files, packing smaller files into a SequenceFile makes the storage and
processing of these files much more efficient
9. Schema Evolution
10. Data types supported by the storage format are complex nested types supported?
12. Failure handling: An important aspect of the various file formats is failure handling; some
formats handle corruption better than others. Few Examples listed below :
a. Columnar formats, while often efficient, do not work well in the event of failure, since
this can lead to incomplete rows.
b. Sequence files will be readable to the first failed row, but will not be recoverable after
that row.
c. Avro provides the best failure handling; in the event of a bad record, the read will
continue at the next sync point, so failures only affect a portion of a file.
Sample Use Case for Reading and writing to storage formats using Hive:
2
3
4