Pandas I

Pandas and PyTables
Ezequiel Cimadevilla Álvarez

ezequiel.cimadevilla@unican.es
Santander Meteorology Group

ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es
Máster Data Science/Ciencia de Datos - 2020/2021

Outline
Recap
Pandas
● Series
● DataFrame
● Index
PyTables
Exercises
Recap
Following the contents of the course
● Python - Basics and OOP

● Multidimensional data - NumPy,
netCDF, HDF5 (h5py)
● Relational model - SQL, OLAP (ROLAP)
● Tabular data - Pandas, PyTables
Recap
You can see that there is certain overlap
between different data models/structures
● When would you suggest to use SQL

ROLAP instead of NumPy?
● When would you suggest to use SQL
tables instead of Pandas?
Is there some criteria that we can use to

decide between databases and Python
libraries?
Pandas
Pandas
● High level library oriented to easy and fast analysis of tabular data in Python
● Built on top of NumPy (as any data science library in Python)
● It provides:
○ Data structures with labeled axes supporting data alignment
○ Time series functionality
○ Arithmetic operations and reductions (summing across axis) would pass on the
metadata (axis labels)
○ Flexible handling of missing data
○ Relational database operations (SQL)
Pandas
Three basic data structures
● Series
○ One dimensional array object with an index
● DataFrame
○ Two dimensional array object with index and columns
● Index
○ Used to label index and columns in Series and DataFrame
Pandas
Open the notebook and follow the examples
PyTables
PyTables
● Python interface to HDF5 files
○ Similar to h5py
● Focused on relational data (tables)
● NumPy deals with large datasets in-memory
● PyTables uses NumPy containers as in-memory buffers to push the I/O
bandwidth towards the platform limits
● It doesn’t support transactional operations so be careful if writing data in
parallel
● It provides different types of containers
PyTables
Group
● Hierarchically-addressable container for HDF5 nodes
● Same as HDF5 Groups
● Similar to operative system directories

PyTables
Table
● Lets you deal with heterogeneous datasets
● Allows compression
● Enlargeable
● Supports nested types
● Good performance for read/writing data

PyTables
Array
● Provides quick and dirty array handling
● Not compression allowed
● Not enlargeable
● Can be used only with relatively small datasets (i.e. those that fit in memory)
● It provides the fastest I/O speed.

PyTables
CArray
● Provides compressed array support
● Not enlargeable
● Good speed when reading/writing

PyTables
EArray
● Most general array support
● Compressible and enlargeable
● It is pretty fast at extending and very good at reading

PyTables
VLArray
● Supports collections of homogeneous data with a variable number of entries
● Compressible and enlargeable
● I/O is not very fast

PyTables
Comparison with h5py
● h5py is an attempt to map the HDF5 feature set to NumPy as closely as possible
○ It also provides access to nearly all of the HDF5 C API
● PyTables builds up an additional abstraction layer on top of HDF5 and NumPy
○ Enhanced type system, engine for enabling complex queries, efficient

computational kernel and advanced indexing capabilities
PyTables
Comparison with relational SQL databases
● No support for relationships (beyond the hierarchical one)
● No support for transactional features
● PyTables is more focused on speed and dealing with really large datasets
● PyTables can be best viewed as a teammate of a relational database
○ Remember OLAP (ROLAP) cubes and ETL processes

Pandas and PyTables
Ezequiel Cimadevilla Álvarez
ezequiel.cimadevilla@unican.es
Santander Meteorology Group

ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es
Máster Data Science/Ciencia de Datos - 2020/2021

Pandas I

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pandas I

Uploaded by

Copyright:

Available Formats

Pandas and PyTables

Ezequiel Cimadevilla Álvarez

Santander Meteorology Group

Máster Data Science/Ciencia de Datos - 2020/2021

● Python - Basics and OOP

● When would you suggest to use SQL

Is there some criteria that we can use to

● Hierarchically-addressable container for HDF5 nodes

● Same as HDF5 Groups

● Similar to operative system directories

● Lets you deal with heterogeneous datasets

● Supports nested types

● Good performance for read/writing data

● Provides quick and dirty array handling

● Not compression allowed

● It provides the fastest I/O speed.

● Provides compressed array support

● Good speed when reading/writing

● Most general array support

● Compressible and enlargeable

● It is pretty fast at extending and very good at reading

● Supports collections of homogeneous data with a variable number of entries

● Compressible and enlargeable

● I/O is not very fast

○ It also provides access to nearly all of the HDF5 C API

● PyTables builds up an additional abstraction layer on top of HDF5 and NumPy

○ Enhanced type system, engine for enabling complex queries, efﬁcient

● No support for relationships (beyond the hierarchical one)

● No support for transactional features

● PyTables can be best viewed as a teammate of a relational database

○ Remember OLAP (ROLAP) cubes and ETL processes

Santander Meteorology Group

Máster Data Science/Ciencia de Datos - 2020/2021

You might also like