You are on page 1of 19

Pandas and PyTables

Ezequiel Cimadevilla Álvarez


ezequiel.cimadevilla@unican.es

Santander Meteorology Group


ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es

Máster Data Science/Ciencia de Datos - 2020/2021


Outline
Recap

Pandas

● Series
● DataFrame
● Index

PyTables

Exercises
Recap
Following the contents of the course

● Python - Basics and OOP


● Multidimensional data - NumPy,
netCDF, HDF5 (h5py)
● Relational model - SQL, OLAP (ROLAP)
● Tabular data - Pandas, PyTables
Recap
You can see that there is certain overlap
between different data models/structures

● When would you suggest to use SQL


ROLAP instead of NumPy?
● When would you suggest to use SQL
tables instead of Pandas?

Is there some criteria that we can use to


decide between databases and Python
libraries?
Pandas
Pandas
● High level library oriented to easy and fast analysis of tabular data in Python
● Built on top of NumPy (as any data science library in Python)
● It provides:
○ Data structures with labeled axes supporting data alignment
○ Time series functionality
○ Arithmetic operations and reductions (summing across axis) would pass on the
metadata (axis labels)
○ Flexible handling of missing data
○ Relational database operations (SQL)
Pandas
Three basic data structures

● Series
○ One dimensional array object with an index
● DataFrame
○ Two dimensional array object with index and columns
● Index
○ Used to label index and columns in Series and DataFrame
Pandas
Open the notebook and follow the examples
PyTables
PyTables
● Python interface to HDF5 files
○ Similar to h5py
● Focused on relational data (tables)
● NumPy deals with large datasets in-memory
● PyTables uses NumPy containers as in-memory buffers to push the I/O
bandwidth towards the platform limits
● It doesn’t support transactional operations so be careful if writing data in
parallel
● It provides different types of containers
PyTables
Group

● Hierarchically-addressable container for HDF5 nodes

● Same as HDF5 Groups

● Similar to operative system directories


PyTables
Table

● Lets you deal with heterogeneous datasets

● Allows compression

● Enlargeable

● Supports nested types

● Good performance for read/writing data


PyTables
Array

● Provides quick and dirty array handling

● Not compression allowed

● Not enlargeable

● Can be used only with relatively small datasets (i.e. those that fit in memory)

● It provides the fastest I/O speed.


PyTables
CArray

● Provides compressed array support

● Not enlargeable

● Good speed when reading/writing


PyTables
EArray

● Most general array support

● Compressible and enlargeable

● It is pretty fast at extending and very good at reading


PyTables
VLArray

● Supports collections of homogeneous data with a variable number of entries

● Compressible and enlargeable

● I/O is not very fast


PyTables
Comparison with h5py

● h5py is an attempt to map the HDF5 feature set to NumPy as closely as possible

○ It also provides access to nearly all of the HDF5 C API

● PyTables builds up an additional abstraction layer on top of HDF5 and NumPy

○ Enhanced type system, engine for enabling complex queries, efficient


computational kernel and advanced indexing capabilities
PyTables
Comparison with relational SQL databases

● No support for relationships (beyond the hierarchical one)

● No support for transactional features

● PyTables is more focused on speed and dealing with really large datasets

● PyTables can be best viewed as a teammate of a relational database

○ Remember OLAP (ROLAP) cubes and ETL processes


Pandas and PyTables
Ezequiel Cimadevilla Álvarez
ezequiel.cimadevilla@unican.es

Santander Meteorology Group


ETSI Caminos, Department of Applied Mathematics and Computer Sciences
University of Cantabria
Avenida de los Castros s/n
39005 Santander, Spain
http://www.meteo.unican.es

Máster Data Science/Ciencia de Datos - 2020/2021

You might also like