Lesson: File Formats For Storing and Quering Data

Uploaded by

Hajri

0% found this document useful (0 votes)

4 views20 pages

Original Title

Leçon_FileFormats-c7

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

4 views20 pages

Lesson: File Formats For Storing and Quering Data

Uploaded by

Hajri

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 20

Search inside document

Lesson

: File Formats for

Storing and Quering
data

1/
File Formats

Raw or record-based formats :
 Plain text file
 Structured texte file : CSV, JSON, ...
 SequenceFile
 Avro
 ...

Column-based formats :
 RC /ORC
 Parquet
Plain Text File

Are human readable and easily parsable

Are slow to read and write.

Data size is relatively bulky and not as
efficient to query.

No metadata is stored in the text files so we
need to know how the structure of the fields.

Text files are not splittable after compression

Limited support for schema evolution: new
fields can only be appended at the end of the
records and existing fields can never be
removed.
Structured text file

Are human readable and easily parsable

Exp : CSV, TSV, XML or JSON files.
Sequence File (for images,
videos, ..)

Row-based

Consisting of binary key/value pairs

More compact than text files

You can’t perform specified key editing, adding,
removal: files are append only

Support splitting even when the data inside the file is
compressed

The sequence file reader will read until a sync
marker is reached ensuring that a record is read as a
whole

do not store metadata, so the only schema evolution
option is appending new fields

Three different SequenceFile formats

 Uncompressed key/value records.

 Record compressed key/value records - only
'values' are compressed here.
 Block compressed key/value records - both keys
and values are collected in 'blocks' separately and
compressed. The size of the 'block' is configurable
Sequence File structure
Sequence File: the header
AVRO

Raw base

Compact binary format,

Offers supports dynamic schema discovery and
schema evolution

Widely used as a serialization platform,

Interoperable across multiple languages,

Compressible and splittable.

Offers complex data structures like nested types.
AVRO

Direct mapping from/to JSON

Interoperability: can serialize into Avro/Binary or Avro/Json

Provides rich data structures

Map keys can only be strings (could be seen as a
limitation)

Extensible schema language

Untagged data

Bindings for a wide variety of programming languages

Dynamic typing

Provides a remote procedure call

Supports block compression

Avro files are splittable
AVRO structure
RCfile (Record Colunm)
ORC (optimized Row

Columnar)
Column-oriented

Lightweight indexes stored within the file

Ability to skip row groups that don’t pass predicate
filtering

Block-mode compression based on data type

Includes basic statistics (min, max, sum, count) on
columns

Concurrent reads of the same file using separate
RecordReaders

Ability to split files without scanning for markers

Metadata is stored using protobuf, which allows schema
evolution by supporting addition and removal of fields
ORC structure
Parquet

Column-oriented

Efficient in terms of disk I/O and memory utilization

Efficiently encoding of nested structures

Provides extensible support for per-column encodings.

Provides extensibility of storing multiple types of data in
column data.

Offers better write performance by storing metadata at
the end of the file.

Records in columns are homogeneous so it’s easier to
apply encoding schemes.

Parquet supports Avro files via object model converters
that map an external object model to Parquet’s internal
data types
Parquet structure
File formats

18 /

Iseries Overview - III
Document10 pages
Iseries Overview - III
Anonymous 7r2OlOFV
No ratings yet
Mirza Ghalib
Document36 pages
Mirza Ghalib
mailwriter19
No ratings yet
Course 2 - Intro To Data Science
Document113 pages
Course 2 - Intro To Data Science
Hajri
0% (1)
Chicago Citation Style
Document8 pages
Chicago Citation Style
Jeff Eunus
No ratings yet
Managing Multimedia and Unstructured Data in the Oracle Database
From Everand
Managing Multimedia and Unstructured Data in the Oracle Database
Marcelle Kratochvil
No ratings yet
Course 4 - Watson Studio
Document62 pages
Course 4 - Watson Studio
Hajri
No ratings yet
Course 1 - Big Data Ecosystem
Document663 pages
Course 1 - Big Data Ecosystem
Parimi baby anitha
No ratings yet
Course 1 - Big Data Ecosystem
Document663 pages
Course 1 - Big Data Ecosystem
Parimi baby anitha
No ratings yet
B2 UNIT 3 Test Standard
Document8 pages
B2 UNIT 3 Test Standard
hi
100% (1)
Direct and Indirect Questions
Document8 pages
Direct and Indirect Questions
Iuli Julie
No ratings yet
Word List For Levels A1-C1
Document3,155 pages
Word List For Levels A1-C1
Ugur Kaya
No ratings yet
Disk Storage, File Structure and Hashing
Document30 pages
Disk Storage, File Structure and Hashing
Kapil Singhal
No ratings yet
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
From Everand
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
Christopher Right
Rating: 2.5 out of 5 stars
2.5/5 (2)
Bigdata Fileformats
Document12 pages
Bigdata Fileformats
Madhavan Eyunni
No ratings yet
Common data formats for big data include Avro, Parquet, JSON
Document7 pages
Common data formats for big data include Avro, Parquet, JSON
Mahmoud Elmahdy
No ratings yet
File Formats: Critical and You Need To Consider The Format, Compression and Your Data
Document3 pages
File Formats: Critical and You Need To Consider The Format, Compression and Your Data
Parveen Kumari
No ratings yet
Azure Storages
Document12 pages
Azure Storages
golden
No ratings yet
Data Base
Document1 page
Data Base
Awais Bin Asif
No ratings yet
Understanding Different Types of File Formats-En
Document2 pages
Understanding Different Types of File Formats-En
consciousconsultingthefirm
No ratings yet
Fundamental File Structure Concepts
Document48 pages
Fundamental File Structure Concepts
Venkat S
No ratings yet
Unit 4 - File Systems in Operating System
Document15 pages
Unit 4 - File Systems in Operating System
Sujal Mishra
No ratings yet
08 Task Performance 1-OS
Document4 pages
08 Task Performance 1-OS
marjorie Arroyo
No ratings yet
CMP 225
Document27 pages
CMP 225
Emmanuel Elizabeth
No ratings yet
Text File Types and Formats Explained
Document4 pages
Text File Types and Formats Explained
Michael Vladislav
No ratings yet
Oops
Document19 pages
Oops
satya kamble
No ratings yet
File Formats in Big Data
Document13 pages
File Formats in Big Data
Meghna Sharma
No ratings yet
Data File Structure
Document2 pages
Data File Structure
vianfulloflife
No ratings yet
LabVIEW File IO PDF
Document9 pages
LabVIEW File IO PDF
caovanchung
No ratings yet
Text File
Document4 pages
Text File
Remi Drigo
No ratings yet
Zip Files in Csharp
Document169 pages
Zip Files in Csharp
Porla Rechucha
No ratings yet
Inls 623 - Database Systems Ii - File Structures, Indexing, and Hashing
Document41 pages
Inls 623 - Database Systems Ii - File Structures, Indexing, and Hashing
Akriti Agrawal
No ratings yet
Basic File Structure
Document17 pages
Basic File Structure
Saloni Vani
No ratings yet
SDF SDF SDF
Document2 pages
SDF SDF SDF
bopufouria
No ratings yet
CH6_1
Document14 pages
CH6_1
tester
No ratings yet
Files Chapter Seven
Document24 pages
Files Chapter Seven
qthermal
No ratings yet
Presentation 1
Document44 pages
Presentation 1
PawanKumar
No ratings yet
Aniket Pandey Presentation File Management 1
Document15 pages
Aniket Pandey Presentation File Management 1
Aniket pandey
No ratings yet
File Organization Methods and Techniques
Document6 pages
File Organization Methods and Techniques
dhiraj100
No ratings yet
FS Lab Mini Project Placement Statistics
Document44 pages
FS Lab Mini Project Placement Statistics
Tripti kumari
No ratings yet
Technical Note TN1150 - HFS Plus Volume Format
Document58 pages
Technical Note TN1150 - HFS Plus Volume Format
zos1568003
No ratings yet
File Handling
Document25 pages
File Handling
Utkarsh Singh Sikarwar
No ratings yet
Operating Systems Unit - 5: I/O and File Management
Document48 pages
Operating Systems Unit - 5: I/O and File Management
Shailesh Tripathi
No ratings yet
File Systems Fundamentals
Document5 pages
File Systems Fundamentals
BhubanAnandaChhatriya
No ratings yet
15 Files 1
Document36 pages
15 Files 1
Karthikeya
No ratings yet
DP900 Chapter1 Notes
Document10 pages
DP900 Chapter1 Notes
ellesonntseare
No ratings yet
Chapter 5 Managing Files of Records
Document16 pages
Chapter 5 Managing Files of Records
JoshuaChrisEspiritu
No ratings yet
File Management 1
Document11 pages
File Management 1
Aniket pandey
No ratings yet
Hadoop File Formats - YoussefEtman
Document8 pages
Hadoop File Formats - YoussefEtman
Israa
No ratings yet
Ignou Bca Cs 06 Solved Assignment 2012
Document10 pages
Ignou Bca Cs 06 Solved Assignment 2012
Vikas Sengar
No ratings yet
C++ Data File Handling
Document11 pages
C++ Data File Handling
Royal Green Achrol
No ratings yet
ICS 143 - Principles of Operating Systems
Document51 pages
ICS 143 - Principles of Operating Systems
ram_lalit
No ratings yet
Os Unit-4 (Bca)
Document40 pages
Os Unit-4 (Bca)
Bot Id
No ratings yet
X Way Forensic
Document4 pages
X Way Forensic
Zeljko Tomic
No ratings yet
Chapter 4 Summery
Document14 pages
Chapter 4 Summery
Puja Ranasinghe
No ratings yet
Text File: Verification
Document5 pages
Text File: Verification
Bman
No ratings yet
Text File - Wikipedia
Document19 pages
Text File - Wikipedia
Igammm 12
No ratings yet
CMP 222 File Organization and Management Course Overview
Document19 pages
CMP 222 File Organization and Management Course Overview
Emmanuel Elizabeth
No ratings yet
Mamo
Document4 pages
Mamo
James
No ratings yet
Fundamentos de Las Tecnologías de La Información Uc3m Tema 4
Document5 pages
Fundamentos de Las Tecnologías de La Información Uc3m Tema 4
claudiazdeandres
No ratings yet
Data Structures PDF
Document622 pages
Data Structures PDF
kamayani_pr
100% (4)
Labview File
Document6 pages
Labview File
Tadveer Singh Hora
No ratings yet
Storage Formats in Hadoop
Document4 pages
Storage Formats in Hadoop
Aadi Manchanda
No ratings yet
Chapter 5: File System 5. File Concepts
Document13 pages
Chapter 5: File System 5. File Concepts
mehari
No ratings yet
Unit#5
Document24 pages
Unit#5
aamirneyazi12
No ratings yet
Discuss Various Features of XML: Extended Hierarchical Format Data Definition
Document7 pages
Discuss Various Features of XML: Extended Hierarchical Format Data Definition
Azhar Khas
No ratings yet
File System-1
Document11 pages
File System-1
Leslie Qwer
No ratings yet
File Systems: Design and Implementation
Document63 pages
File Systems: Design and Implementation
Sasi Murugesan
No ratings yet
Big Data Technologies - Data Storage
Document91 pages
Big Data Technologies - Data Storage
Gotchi
No ratings yet
Beginning XML
From Everand
Beginning XML
Joe Fawcett
Rating: 3 out of 5 stars
3/5 (1)
Leçon: Spark For Distributed Computing (Alternative To Hadoop Mapreduce)
Document21 pages
Leçon: Spark For Distributed Computing (Alternative To Hadoop Mapreduce)
Hajri
No ratings yet
Phil-IRI SILENT Form 1 - Pretest Grade VI Reading
Document2 pages
Phil-IRI SILENT Form 1 - Pretest Grade VI Reading
Benedick Damaso
No ratings yet
Sabatun Naimi by Mufti Zulfiqar Khan
Document29 pages
Sabatun Naimi by Mufti Zulfiqar Khan
Mohammad Izharun Nabi Hussaini
No ratings yet
Properties of the Ism—Status and Exercises
Document4 pages
Properties of the Ism—Status and Exercises
sfalam
No ratings yet
Pragmatic Transfer
Document20 pages
Pragmatic Transfer
Dian Anggraini
No ratings yet
Toefl PBT Test Answer Sheet Template
Document1 page
Toefl PBT Test Answer Sheet Template
Youth Blue
No ratings yet
Intelektualnost I Umjetnost Renesanse
Document16 pages
Intelektualnost I Umjetnost Renesanse
ljubisa121
No ratings yet
Age of Chaucer
Document2 pages
Age of Chaucer
Venkates DSwamy
No ratings yet
ESP Handbook
Document24 pages
ESP Handbook
Tch Dante
No ratings yet
Differentiating Types of Apraxia in Children with Learning Disabilities
Document8 pages
Differentiating Types of Apraxia in Children with Learning Disabilities
Laura
No ratings yet
Adjectives X Adverbs Review
Document2 pages
Adjectives X Adverbs Review
neidefbg
No ratings yet
First year baccalaureate lesson plan on internet addiction
Document79 pages
First year baccalaureate lesson plan on internet addiction
Samir Amrani
No ratings yet
Grammar Tenses and Voice Changes
Document6 pages
Grammar Tenses and Voice Changes
Vanessa Karina
No ratings yet
Spoken English Course Outline
Document2 pages
Spoken English Course Outline
ranjhajogi
No ratings yet
Evaluation and Treatment of Aphasia Among The Elderly With Stroke
Document11 pages
Evaluation and Treatment of Aphasia Among The Elderly With Stroke
mgpastor
No ratings yet
IT2402: Fundamentals of Software Engineering: University of Colombo, Sri Lanka
Document11 pages
IT2402: Fundamentals of Software Engineering: University of Colombo, Sri Lanka
Angana
No ratings yet
Cic - gc.CA-Language Test Equivalency Charts
Document3 pages
Cic - gc.CA-Language Test Equivalency Charts
Mahmoz
No ratings yet
P 1
Document2 pages
P 1
alsquatrevents
No ratings yet
G31T-LM2: Page Title of Schematic
Document41 pages
G31T-LM2: Page Title of Schematic
Xuân Trường
100% (1)
Dalai Lama Teaches on Four Seals and Path to Liberation
Document17 pages
Dalai Lama Teaches on Four Seals and Path to Liberation
dimeotan
No ratings yet
Java Package
Document10 pages
Java Package
5025linux
No ratings yet
City of Hartford Accounting-Outside Legal Services 2017-2018
Document21 pages
City of Hartford Accounting-Outside Legal Services 2017-2018
kevinhfd
50% (2)
Result Gazette IX-2016
Document94 pages
Result Gazette IX-2016
WsxQaz
No ratings yet
Rethinking Modular Multi-Exponentiation Algorithms
Document15 pages
Rethinking Modular Multi-Exponentiation Algorithms
fsd fsdfds
No ratings yet
Manual Dynare
Document99 pages
Manual Dynare
Juan Palomino
No ratings yet
What is Rhetoric
Document6 pages
What is Rhetoric
Bazura Baharuddin
No ratings yet