You are on page 1of 20

Lesson 

: File Formats for


Storing and Quering
data

1/
File Formats

Raw or record-based formats :
 Plain text file
 Structured texte file : CSV, JSON, ...
 SequenceFile
 Avro
 ...

Column-based formats :
 RC /ORC
 Parquet
Plain Text File

Are human readable and easily parsable

Are slow to read and write.

Data size is relatively bulky and not as
efficient to query.

No metadata is stored in the text files so we
need to know how the structure of the fields.

Text files are not splittable after compression

Limited support for schema evolution: new
fields can only be appended at the end of the
records and existing fields can never be
removed.
Structured text file

Are human readable and easily parsable

Exp : CSV, TSV, XML or JSON files.
Sequence File (for images,
videos, ..)

Row-based

Consisting of binary key/value pairs

More compact than text files

You can’t perform specified key editing, adding,
removal: files are append only

Support splitting even when the data inside the file is
compressed

The sequence file reader will read until a sync
marker is reached ensuring that a record is read as a
whole

do not store metadata, so the only schema evolution
option is appending new fields

Three different SequenceFile formats

 Uncompressed key/value records.


 Record compressed key/value records - only
'values' are compressed here.
 Block compressed key/value records - both keys
and values are collected in 'blocks' separately and
compressed. The size of the 'block' is configurable
Sequence File structure
Sequence File: the header
AVRO

Raw base

Compact binary format,

Offers supports dynamic schema discovery and
schema evolution

Widely used as a serialization platform,

Interoperable across multiple languages,

Compressible and splittable.

Offers complex data structures like nested types.
AVRO

Direct mapping from/to JSON

Interoperability: can serialize into Avro/Binary or Avro/Json

Provides rich data structures

Map keys can only be strings (could be seen as a
limitation)

Extensible schema language

Untagged data

Bindings for a wide variety of programming languages

Dynamic typing

Provides a remote procedure call

Supports block compression

Avro files are splittable
AVRO structure
RCfile (Record Colunm)
ORC (optimized Row

Columnar)
Column-oriented

Lightweight indexes stored within the file

Ability to skip row groups that don’t pass predicate
filtering

Block-mode compression based on data type

Includes basic statistics (min, max, sum, count) on
columns

Concurrent reads of the same file using separate
RecordReaders

Ability to split files without scanning for markers

Metadata is stored using protobuf, which allows schema
evolution by supporting addition and removal of fields
ORC structure
Parquet

Column-oriented

Efficient in terms of disk I/O and memory utilization

Efficiently encoding of nested structures

Provides extensible support for per-column encodings.

Provides extensibility of storing multiple types of data in
column data.

Offers better write performance by storing metadata at
the end of the file.

Records in columns are homogeneous so it’s easier to
apply encoding schemes.

Parquet supports Avro files via object model converters
that map an external object model to Parquet’s internal
data types
Parquet structure
File formats

18 /

You might also like