You are on page 1of 10

26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021

ct, 2021 | Towards Data S…

Get started Open in app

Follow 591K Followers

This is your last free member-only story this month. Sign up for Medium and get an extra one

Stop Using Pandas to Read/Write Data — This


Alternative is 7 Times Faster
Read and write CSV datasets 7 times faster than with Pandas

Dario Radečić 4 days ago · 5 min read

https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 1/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…

Get started Open in app

Photo by Casey Horner on Unsplash

I love Python’s Pandas library. It’s my preferred way to analyze, transform, and
preprocess data. But boy is it slow when it comes to reading and saving data files. It’s a
huge time waster, especially if your datasets measure gigabytes in size.

Picture this — you want to go over gigabytes of CSV data, stored either locally or on the
cloud. You’ll do the analysis with Pandas, even though you know it’s slow as hell when
reading CSV files. As a result, you spend most of the time waiting for reads and writes
to finish. There’s a better way.

It’s called PyArrow — an amazing Python binding for the Apache Arrow project. It
introduces faster data read/write times and doesn’t otherwise interfere with your data

https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 2/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…

analysis pipeline. It’s the best of both worlds, as you can still use Pandas for further
Get started Open in app
calculations.

You can install PyArrow both with Pip and Anaconda:

pip install pyarrow

conda install -c conda-forge pyarrow

Looking for a video version? You’re in luck:

PyArrow vs. Pandas for managing CSV files - How to Speed Up Data Loa…
Loa…

Let’s create a dummy dataset


Let’s start with the library imports. You’ll need quite a few today:

import random

import string

import numpy as np

import pandas as pd

import pyarrow as pa

import pyarrow.csv as csv

from datetime import datetime

https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 3/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…

We’ll create a somewhat large dataset next. It will contain around 11M of date, float,
Get started Open in app
and string values. The date information is completely made up, ranging from 2000 to
2021 in minute intervals. Other columns are completely random as well:

def gen_random_string(length: int = 32) -> str:

return ''.join(random.choices(

string.ascii_uppercase + string.digits, k=length)

dt = pd.date_range(

start=datetime(2000, 1, 1),

end=datetime(2021, 1, 1),

freq='min'

np.random.seed = 42

df_size = len(dt)

print(f'Dataset length: {df_size}')

df = pd.DataFrame({

'date': dt,

'a': np.random.rand(df_size),

'b': np.random.rand(df_size),

'c': np.random.rand(df_size),

'd': np.random.rand(df_size),

'e': np.random.rand(df_size),

'str1': [gen_random_string() for x in range(df_size)],

'str2': [gen_random_string() for x in range(df_size)]

})

Here’s how it looks like:

Image 1 — Dummy dataset (image by author)

That’s 11,046,241 rows of mixed data types, so the resulting CSV files will be quite
beefy.

https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 4/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…

Read/Write CSV files with Pandas


Get started Open in app
We’ll use Pandas as a baseline solution. It’s what you’d use if libraries like PyArrow
didn’t exist. This section contains only code — you’ll find comparisons and charts later
in the article.

Use the following code to save our dataset df to a CSV file:

df.to_csv('csv_pandas.csv', index=False)

You can save yourself some disk space if you don’t care for write speed. Pandas’s
to_csv() function has an optional argument compression . Let’s see how to use it to

save the dataset in csv.gz format:

df.to_csv('csv_pandas.csv.gz', index=False, compression='gzip')

Finally, you can read both versions by using the read_csv() function:

df1 = pd.read_csv('csv_pandas.csv')

df2 = pd.read_csv('csv_pandas.csv.gz')

Nothing new or interesting here, but I wanted to cover all bases. Let’s see how PyArrow
works next.

Read/Write CSV files with PyArrow


Here’s a thing you should know about PyArrow — it can’t handle datetime columns.
You’ll have to convert the date attribute to a timestamp. Here’s how:

df_pa = df.copy()

df_pa['date'] = df_pa['date'].values.astype(np.int64) // 10 ** 9

https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 5/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…

And here’s how the dataset looks like after the change:
Get started Open in app

Image 2 — Dummy dataset after timestamp conversion (image by author)

It’s still the same information, just presented differently. You can now convert the
DataFrame to a PyArrow Table. It’s a necessary step before you can dump the dataset to
disk:

df_pa_table = pa.Table.from_pandas(df_pa)

The conversion takes 1.52 seconds on my machine (M1 MacBook Pro) and will be
included to comparison charts.

Use PyArrow’s csv.write_csv() function to dump the dataset:

csv.write_csv(df_pa_table, 'csv_pyarrow.csv')

Adding compression requires a bit more code:

with pa.CompressedOutputStream('csv_pyarrow.csv.gz', 'gzip') as out:

csv.write_csv(df_pa_table, out)

You can read both compressed and uncompressed dataset with the csv.read_csv()

function:

df_pa_1 = csv.read_csv('csv_pyarrow.csv')

df_pa_2 = csv.read_csv('csv_pyarrow.csv.gz')

https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 6/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…

Both will be read in the pyarrow.Table format, so use the following command to
Get started Open in app
convert them to a Pandas DataFrame:

df_pa_1 = df_pa_1.to_pandas()

And that’s all you should know for today. Let’s see how these compare in performance.

Pandas vs. PyArrow — Which one should you use?


If your data is big, use PyArrow every time. It’s that simple. Here’s why.

The following chart shows the time needed to save the DataFrame with Pandas and
PyArrow — both uncompressed and compressed versions:

Image 3 — Pandas vs. PyArrow save time in seconds (Pandas CSV: 54.5; Pandas CSV.GZ: 182; PyArrow CSV: 7.76;
PyArrow CSV.GZ: 84) (image by author)

That’s around 7X speed increase for uncompressed files and around 2X for compressed
files. I knew what was going to happen and I’m still impressed.

Next, let’s compare the read times — how long does it take to read CSV files for Pandas
and PyArrow:

https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 7/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…

Get started Open in app

Image 4 — Pandas vs. PyArrow read time in seconds (Pandas CSV: 17.8; Pandas CSV.GZ: 28; PyArrow CSV: 2.44;
PyArrow CSV.GZ: 9.09) (image by author)

We get a similar performance boost — around 7X for the uncompressed datasets and
around 3X for the compressed ones. I’m lost for words. It’s an identical file format for
Pete’s sake.

Finally, let’s see how the file sizes differ. There should be any differences between the
libraries:

Image 5 — Pandas vs. PyArrow file size in GB (Pandas CSV: 2.01; Pandas CSV.GZ: 1.12; PyArrow CSV: 1.96; PyArrow
CSV.GZ: 1.13) (image by author)

There are slight differences in the uncompressed versions, but that’s likely because
we’re storing datetime objects with Pandas and integers with PyArrow. Nothing to
write home about, as expected.

Conclusion
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 8/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…

To summarize, if your apps save/load data from disk frequently, then it’s a wise
Get started Open in app
decision to leave these operations to PyArrow. Heck, it’s 7 times faster for the identical
file format. Imagine we introduced Parquet file format to the mix. That’s what the next
article will cover.

What are your thoughts on PyArrow? Have you experienced any pain points in daily
use? Please share your thoughts in the comment section below.

Loved the article? Become a Medium member to continue learning without limits. I’ll
receive a portion of your membership fee if you use the following link, with no extra cost to
you.

Join Medium with my referral link - Dario Radečić


As a Medium member, a portion of your membership fee goes to
writers you read, and you get full access to every story…
medium.com

Stay connected
Sign up for my newsletter

Subscribe on YouTube

Connect on LinkedIn

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.

Get this newsletter

https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 9/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…

Get started Open in app

Data Science Python Machine Learning Artificial Intelligence Towards Data Science

About Write Help Legal

Get the Medium app

https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 10/10

You might also like