Stop Using Pandas To Read - Write Data - This Alternative Is 7 Times Faster - by Dario Radečić - Oct, 2021 - Towards Data Science

26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021
ct, 2021 | Towards Data S…
Get started Open in app
Follow 591K Followers
This is your last free member-only story this month. Sign up for Medium and get an extra one
Stop Using Pandas to Read/Write Data — This

Alternative is 7 Times Faster
Read and write CSV datasets 7 times faster than with Pandas
Dario Radečić 4 days ago · 5 min read
https://towardsdatascience.com/stop-using-pandas-to-read-write-data-this-alternative-is-7-times-faster-893301633475 1/10
26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021 | Towards Data S…
Photo by Casey Horner on Unsplash
I love Python’s Pandas library. It’s my preferred way to analyze, transform, and
preprocess data. But boy is it slow when it comes to reading and saving data files. It’s a
huge time waster, especially if your datasets measure gigabytes in size.
Picture this — you want to go over gigabytes of CSV data, stored either locally or on the
cloud. You’ll do the analysis with Pandas, even though you know it’s slow as hell when
reading CSV files. As a result, you spend most of the time waiting for reads and writes
to finish. There’s a better way.
It’s called PyArrow — an amazing Python binding for the Apache Arrow project. It
introduces faster data read/write times and doesn’t otherwise interfere with your data
analysis pipeline. It’s the best of both worlds, as you can still use Pandas for further
calculations.
You can install PyArrow both with Pip and Anaconda:
pip install pyarrow
conda install -c conda-forge pyarrow
Looking for a video version? You’re in luck:
PyArrow vs. Pandas for managing CSV files - How to Speed Up Data Loa…
Loa…
Let’s create a dummy dataset

Let’s start with the library imports. You’ll need quite a few today:
import random
import string
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.csv as csv
from datetime import datetime
We’ll create a somewhat large dataset next. It will contain around 11M of date, float,
and string values. The date information is completely made up, ranging from 2000 to
2021 in minute intervals. Other columns are completely random as well:
def gen_random_string(length: int = 32) -> str:
return ''.join(random.choices(
string.ascii_uppercase + string.digits, k=length)
dt = pd.date_range(
start=datetime(2000, 1, 1),
end=datetime(2021, 1, 1),
freq='min'
np.random.seed = 42
df_size = len(dt)
print(f'Dataset length: {df_size}')
df = pd.DataFrame({
'date': dt,
'a': np.random.rand(df_size),
'b': np.random.rand(df_size),
'c': np.random.rand(df_size),
'd': np.random.rand(df_size),
'e': np.random.rand(df_size),
'str1': [gen_random_string() for x in range(df_size)],
'str2': [gen_random_string() for x in range(df_size)]
})
Here’s how it looks like:
Image 1 — Dummy dataset (image by author)
That’s 11,046,241 rows of mixed data types, so the resulting CSV files will be quite
beefy.
Read/Write CSV files with Pandas

We’ll use Pandas as a baseline solution. It’s what you’d use if libraries like PyArrow
didn’t exist. This section contains only code — you’ll find comparisons and charts later
in the article.
Use the following code to save our dataset df to a CSV file:
df.to_csv('csv_pandas.csv', index=False)
You can save yourself some disk space if you don’t care for write speed. Pandas’s
to_csv() function has an optional argument compression . Let’s see how to use it to
save the dataset in csv.gz format:
df.to_csv('csv_pandas.csv.gz', index=False, compression='gzip')
Finally, you can read both versions by using the read_csv() function:
df1 = pd.read_csv('csv_pandas.csv')
df2 = pd.read_csv('csv_pandas.csv.gz')
Nothing new or interesting here, but I wanted to cover all bases. Let’s see how PyArrow
works next.
Read/Write CSV files with PyArrow

Here’s a thing you should know about PyArrow — it can’t handle datetime columns.
You’ll have to convert the date attribute to a timestamp. Here’s how:
df_pa = df.copy()
df_pa['date'] = df_pa['date'].values.astype(np.int64) // 10 ** 9
And here’s how the dataset looks like after the change:
Image 2 — Dummy dataset after timestamp conversion (image by author)
It’s still the same information, just presented differently. You can now convert the
DataFrame to a PyArrow Table. It’s a necessary step before you can dump the dataset to
disk:
df_pa_table = pa.Table.from_pandas(df_pa)
The conversion takes 1.52 seconds on my machine (M1 MacBook Pro) and will be
included to comparison charts.
Use PyArrow’s csv.write_csv() function to dump the dataset:
csv.write_csv(df_pa_table, 'csv_pyarrow.csv')
Adding compression requires a bit more code:
with pa.CompressedOutputStream('csv_pyarrow.csv.gz', 'gzip') as out:
csv.write_csv(df_pa_table, out)
You can read both compressed and uncompressed dataset with the csv.read_csv()
function:
df_pa_1 = csv.read_csv('csv_pyarrow.csv')
df_pa_2 = csv.read_csv('csv_pyarrow.csv.gz')
Both will be read in the pyarrow.Table format, so use the following command to
convert them to a Pandas DataFrame:
df_pa_1 = df_pa_1.to_pandas()
And that’s all you should know for today. Let’s see how these compare in performance.
Pandas vs. PyArrow — Which one should you use?

If your data is big, use PyArrow every time. It’s that simple. Here’s why.
The following chart shows the time needed to save the DataFrame with Pandas and
PyArrow — both uncompressed and compressed versions:
Image 3 — Pandas vs. PyArrow save time in seconds (Pandas CSV: 54.5; Pandas CSV.GZ: 182; PyArrow CSV: 7.76;
PyArrow CSV.GZ: 84) (image by author)
That’s around 7X speed increase for uncompressed files and around 2X for compressed
files. I knew what was going to happen and I’m still impressed.
Next, let’s compare the read times — how long does it take to read CSV files for Pandas
and PyArrow:
Image 4 — Pandas vs. PyArrow read time in seconds (Pandas CSV: 17.8; Pandas CSV.GZ: 28; PyArrow CSV: 2.44;
PyArrow CSV.GZ: 9.09) (image by author)
We get a similar performance boost — around 7X for the uncompressed datasets and
around 3X for the compressed ones. I’m lost for words. It’s an identical file format for
Pete’s sake.
Finally, let’s see how the file sizes differ. There should be any differences between the
libraries:
Image 5 — Pandas vs. PyArrow file size in GB (Pandas CSV: 2.01; Pandas CSV.GZ: 1.12; PyArrow CSV: 1.96; PyArrow
CSV.GZ: 1.13) (image by author)
There are slight differences in the uncompressed versions, but that’s likely because
we’re storing datetime objects with Pandas and integers with PyArrow. Nothing to
write home about, as expected.
Conclusion
To summarize, if your apps save/load data from disk frequently, then it’s a wise
decision to leave these operations to PyArrow. Heck, it’s 7 times faster for the identical
file format. Imagine we introduced Parquet file format to the mix. That’s what the next
article will cover.
What are your thoughts on PyArrow? Have you experienced any pain points in daily
use? Please share your thoughts in the comment section below.
Loved the article? Become a Medium member to continue learning without limits. I’ll
receive a portion of your membership fee if you use the following link, with no extra cost to
you.
Join Medium with my referral link - Dario Radečić

As a Medium member, a portion of your membership fee goes to
writers you read, and you get full access to every story…
medium.com
Stay connected
Sign up for my newsletter
Subscribe on YouTube
Connect on LinkedIn
Sign up for The Variable

By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Get this newsletter
Data Science Python Machine Learning Artificial Intelligence Towards Data Science
About Write Help Legal
Get the Medium app

Stop Using Pandas To Read - Write Data - This Alternative Is 7 Times Faster - by Dario Radečić - Oct, 2021 - Towards Data Science

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stop Using Pandas To Read - Write Data - This Alternative Is 7 Times Faster - by Dario Radečić - Oct, 2021 - Towards Data Science

Uploaded by

Copyright:

Available Formats

26/10/2021 13:02 Stop Using Pandas to Read/Write Data — This Alternative is 7 Times Faster | by Dario Radečić | Oct, 2021

ct, 2021 | Towards Data S…

Get started Open in app

Follow 591K Followers