Professional Documents
Culture Documents
This is your last free member-only story this month. Sign up for Medium and get an extra one
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 1/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…
I love Python’s Pandas library. It’s my preferred way to analyze, transform, and
preprocess data. But boy is it slow when it comes to reading and saving data files. It’s a
huge time waster, especially if your datasets measure gigabytes in size.
Picture this — you want to go over gigabytes of CSV data, stored either locally or on the
cloud. You’ll do the analysis with Pandas, even though you know it’s slow as hell when
reading CSV files. As a result, you spend most of the time waiting for reads and writes
to finish. There’s a better way.
It’s called PyArrow — an amazing Python binding for the Apache Arrow project. It
introduces faster data read/write times and doesn’t otherwise interfere with your data
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 2/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…
analysis pipeline. It’s the best of both worlds, as you can still use Pandas for further
Get started Open in app
calculations.
PyArrow vs. Pandas for managing CSV files - How to Speed Up Data Loa…
Loa…
import random
import string
import numpy as np
import pandas as pd
import pyarrow as pa
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 3/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…
We’ll create a somewhat large dataset next. It will contain around 11M of date, float,
Get started Open in app
and string values. The date information is completely made up, ranging from 2000 to
2021 in minute intervals. Other columns are completely random as well:
return ''.join(random.choices(
dt = pd.date_range(
start=datetime(2000, 1, 1),
end=datetime(2021, 1, 1),
freq='min'
np.random.seed = 42
df_size = len(dt)
df = pd.DataFrame({
'date': dt,
'a': np.random.rand(df_size),
'b': np.random.rand(df_size),
'c': np.random.rand(df_size),
'd': np.random.rand(df_size),
'e': np.random.rand(df_size),
})
That’s 11,046,241 rows of mixed data types, so the resulting CSV files will be quite
beefy.
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 4/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…
df.to_csv('csv_pandas.csv', index=False)
You can save yourself some disk space if you don’t care for write speed. Pandas’s
to_csv() function has an optional argument compression . Let’s see how to use it to
Finally, you can read both versions by using the read_csv() function:
df1 = pd.read_csv('csv_pandas.csv')
df2 = pd.read_csv('csv_pandas.csv.gz')
Nothing new or interesting here, but I wanted to cover all bases. Let’s see how PyArrow
works next.
df_pa = df.copy()
df_pa['date'] = df_pa['date'].values.astype(np.int64) // 10 ** 9
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 5/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…
And here’s how the dataset looks like after the change:
Get started Open in app
It’s still the same information, just presented differently. You can now convert the
DataFrame to a PyArrow Table. It’s a necessary step before you can dump the dataset to
disk:
df_pa_table = pa.Table.from_pandas(df_pa)
The conversion takes 1.52 seconds on my machine (M1 MacBook Pro) and will be
included to comparison charts.
csv.write_csv(df_pa_table, 'csv_pyarrow.csv')
csv.write_csv(df_pa_table, out)
You can read both compressed and uncompressed dataset with the csv.read_csv()
function:
df_pa_1 = csv.read_csv('csv_pyarrow.csv')
df_pa_2 = csv.read_csv('csv_pyarrow.csv.gz')
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 6/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…
Both will be read in the pyarrow.Table format, so use the following command to
Get started Open in app
convert them to a Pandas DataFrame:
df_pa_1 = df_pa_1.to_pandas()
And that’s all you should know for today. Let’s see how these compare in performance.
The following chart shows the time needed to save the DataFrame with Pandas and
PyArrow — both uncompressed and compressed versions:
Image 3 — Pandas vs. PyArrow save time in seconds (Pandas CSV: 54.5; Pandas CSV.GZ: 182; PyArrow CSV: 7.76;
PyArrow CSV.GZ: 84) (image by author)
That’s around 7X speed increase for uncompressed files and around 2X for compressed
files. I knew what was going to happen and I’m still impressed.
Next, let’s compare the read times — how long does it take to read CSV files for Pandas
and PyArrow:
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 7/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…
Image 4 — Pandas vs. PyArrow read time in seconds (Pandas CSV: 17.8; Pandas CSV.GZ: 28; PyArrow CSV: 2.44;
PyArrow CSV.GZ: 9.09) (image by author)
We get a similar performance boost — around 7X for the uncompressed datasets and
around 3X for the compressed ones. I’m lost for words. It’s an identical file format for
Pete’s sake.
Finally, let’s see how the file sizes differ. There should be any differences between the
libraries:
Image 5 — Pandas vs. PyArrow file size in GB (Pandas CSV: 2.01; Pandas CSV.GZ: 1.12; PyArrow CSV: 1.96; PyArrow
CSV.GZ: 1.13) (image by author)
There are slight differences in the uncompressed versions, but that’s likely because
we’re storing datetime objects with Pandas and integers with PyArrow. Nothing to
write home about, as expected.
Conclusion
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 8/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…
To summarize, if your apps save/load data from disk frequently, then it’s a wise
Get started Open in app
decision to leave these operations to PyArrow. Heck, it’s 7 times faster for the identical
file format. Imagine we introduced Parquet file format to the mix. That’s what the next
article will cover.
What are your thoughts on PyArrow? Have you experienced any pain points in daily
use? Please share your thoughts in the comment section below.
Loved the article? Become a Medium member to continue learning without limits. I’ll
receive a portion of your membership fee if you use the following link, with no extra cost to
you.
Stay connected
Sign up for my newsletter
Subscribe on YouTube
Connect on LinkedIn
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 9/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…
Data Science Python Machine Learning Artificial Intelligence Towards Data Science
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 10/10