You are on page 1of 18

Open in app

Pandas 2.0: A Game-Changer for Data


Scientists?
The Top 5 Features for Efficient Data Manipulation

Miriam Santos · Follow


Published in Towards Data Science
7 min read · Jun 27

Listen Share More

This April, pandas 2.0.0 was officially launched, making huge waves across the data science community.
Photo by Yancy Min on Unsplash.

Due to its extensive functionality and versatility, pandas has secured a place in
every data scientist’s heart.
From data input/output to data cleaning and transformation, it’s nearly impossible to
think about data manipulation without import pandas as pd , right?

Now, bear with me: with such a buzz around LLMs over the past months, I have
somehow let slide the fact that pandas has just undergone a major release! Yep,
pandas 2.0 is out and came with guns blazing!

Although I wasn’t aware of all the hype, the Data-Centric AI Community promptly
came to the rescue:

The 2.0 release seems to have created quite an impact in the data science community, with a lot of users
praising the modifications added in the new version. Screenshot by Author.

Fun fact: Were you aware this release was in the making for an astonishing 3 years? Now
that’s what I call “commitment to the community”!

So what does pandas 2.0 bring to the table? Let’s dive right into it!

1. Performance, Speed, and Memory-Efficiency


As we all know, pandas was built using numpy , which was not intentionally designed
as a backend for dataframe libraries. For that reason, one of the major limitations of
pandas was handling in-memory processing for larger datasets.

In this release, the big change comes from the introduction of the Apache Arrow
backend for pandas data.

Essentially, Arrow is a standardized in-memory columnar data format with available


libraries for several programming languages (C, C++, R, Python, among others). For
Python there is PyArrow, which is based on the C++ implementation of Arrow, and
therefore, fast!

So, long story short, PyArrow takes care of our previous memory constraints of
versions 1.X and allows us to conduct faster and more memory-efficient data
operations, especially for larger datasets.

Here’s a comparison between reading the data without and with the pyarrow
backend, using the Hacker News dataset, which is around 650 MB (License CC BY-
NC-SA 4.0):
Comparison of read_csv(): Using the pyarrow backend is 35X faster. Snippet by Author.

As you can see, using the new backend makes reading the data nearly 35x faster.
Other aspects worth pointing out:

Without the pyarrow backend, each column/feature is stored as its own unique
data type: numeric features are stored as int64 or float64 while string values
are stored as objects;

With pyarrow , all features are using the Arrow dtypes: note the [pyarrow]

annotation and the different types of data: int64 , float64 , string , timestamp ,
and double :
df.info(): Investigating the dtypes of each DataFrame. Snippter by Author.

2. Arrow Data Types and Numpy Indices


Beyond reading data, which is the simplest case, you can expect additional
improvements for a series of other operations, especially those involving string
operations, since pyarrow ’s implementation of the string datatype is quite efficient:

Comparing string operations: showcasing the efficiency of arrow’s implementation. Snippet by Author.
In fact, Arrow has more (and better support for) data types than numpy , which are
needed outside the scientific (numerical) scope: dates and times, duration, binary,
decimals, lists, and maps. Skimming through the equivalence between pyarrow-
backed and numpy data types might actually be a good exercise in case you want to
learn how to leverage them.

It is also now possible to hold more numpy numeric types in indices.


The traditional int64 , uint64 , and float64 have opened up space for all numpy
numeric dtypes Index values so we can, for instance, specify their 32-bit version
instead:

Leveraging 32-bit numpy indices, making the code more memory-efficient. Snippet by Author.
This is a welcome change since indices are one of the most used functionalities in
pandas , allowing users to filter, join, and shuffle data, among other data operations.
Essentially, the lighter the Index is, the more efficient those processes will be!

3. Easier Handling of Missing Values


Being built on top of numpy made it hard for pandas to handle missing values in a
hassle-free, flexible way, since numpy does not support null values for some data
types.

For instance, integers are automatically converted to floats, which is not ideal:
Missing Values: Conversion to float. Snippet by Author.

Note how points automatically changes from int64 to float64 after the
introduction of a single None value.

There is nothing worst for a data flow than wrong typesets, especially within a data-
centric AI paradigm.

Erroneous typesets directly impact data preparation decisions, cause


incompatibilities between different chunks of data, and even when passing silently,
they might compromise certain operations that output nonsensical results in return.

As an example, at the Data-Centric AI Community, we’re currenlty working on a


project around synthetic data for data privacy. One of the features, NOC (number of
children), has missing values and therefore it is automatically converted to float

when the data is loaded. The, when passing the data into a generative model as a
float , we might get output values as decimals such as 2.5 — unless you’re a
mathematician with 2 kids, a newborn, and a weird sense of humor, having 2.5
children is not OK.

In pandas 2.0, we can leverage dtype = 'numpy_nullable' , where missing values are

accounted for without any dtype changes, so we can keep our original data types
( int64 in this case):
Leveraging ‘numpy_nullable’, pandas 2.0 can handle missing values without changing the original data types.
Snippet by Author.

It might seem like a subtle change, but under the hood it means that now pandas

can natively use Arrow’s implementation of dealing with missing values. This
makes operations much more efficient, since pandas doesn’t have to implement its
own version for handling null values for each data type.

4. Copy-On-Write Optimization
Pandas 2.0 also adds a new lazy copy mechanism that defers copying DataFrames
and Series objects until they are modified.

This means that certain methods will return views rather than copies when copy-
on-write is enabled, which improves memory efficiency by minimizing
unnecessary data duplication.

It also means you need to be extra careful when using chained assignments.

If the copy-on-write mode is enabled, chained assignments will not work because
they point to a temporary object that is the result of an indexing operation (which
under copy-on-write behaves as a copy).

When copy_on_write is disabled, operations like slicing may change the original df

if the new dataframe is changed:


Disabled Copy-on-Write: original dataframe is changed in chained assignments. Snippet by Author.

When copy_on_write is enabled, a copy is created at assignment, and therefore the


original dataframe is never changed. Pandas 2.0 will raise a ChainedAssignmentError

in these situations to avoid silent bugs:


Enabled Copy-on-Write: original dataframe is not changed in chained assignments. Snippet by Author.

5. Optional Dependencies
When using pip , version 2.0 gives us the flexibility to install optional
dependencies, which is a plus in terms of customization and optimization of
resources.

We can tailor the installation to our specific requirements, without spending disk
space on what we don’t really need.

Plus, it saves a lot of “dependency headaches”, reducing the likelihood of


compatibility issues or conflicts with other packages we may have in our
development environments:
Installing optional dependencies. Snippet by Author.

Taking it for a spin!


Yet, the question lingered: is the buzz really justified? I was curious to see whether
pandas 2.0 provided significant improvements with respect to some packages I use
on a daily basis: ydata-profiling, matplotlib, seaborn, scikit-learn.

From those, I decided to take ydata-profiling for a spin— it has just added support
for pandas 2.0, which seemed like a must-have for the community! In the new
release, users can rest to sure that their pipelines won’t break if they’re using pandas
2.0, and that’s a major plus! But what else?

Truth be told, ydata-profiling has been one of my top favorite tools for exploratory
data analysis, and it’s a nice and quick benchmark too — a 1-line of code on my side,
but under the hood it is full of computations that as a data scientist I need to work
out — descriptive statistics, histogram plotting, analyzing correlations, and so on.
So what better way than testing the impact of the pyarrow engine on all of those at
once with minimal effort?

Benchmarking with ydata-profiling. Snippet by Author.

Again, reading the data is definitely better with the pyarrow engine, althought
creating the data profile has not changed significanlty in terms of speed.

Yet, differences may rely on memory efficiency, for which we’d have to run a
different analysis. Also, we could further investigate the type of analysis being
conducted over the data: for some operations, the difference between 1.5.2 and 2.0
versions seems negligible.
But the main thing I noticed that might make a difference to this regard is that
ydata-profiling is not yet leveraging the pyarrow data types. This update could have a
great impact in both speed and memory and is something I look forward in future
developments!

The Verdict: Performance, Flexibility, Interoperability!


This new pandas 2.0 release brings a lot of flexibility and performance optimization
with subtle, yet crucial modifications “under the hood”.

Maybe they are not “flashy” for newcomers into the field of data manipulation, but
they sure as hell are like water in the desert for veteran data scientists that used to
jump through hoops to overcome the limitations of the previous versions.

Wrapping it up, these are the top main advantages introduced in the new release:

Performance Optimization: With the introduction of Apache Arrow backend,


more numpy dtype indices, and copy-on-write mode;

Added flexibility and customization: Allowing users to control optional


dependencies and taking advantage of the Apache Arrow data types (including
nullability from the get go!);

Interoperability: Perhaps a less “acclaimed” advantage of the new version but


with huge impact. Since Arrow is language-independent, in-memory data can be
transferred between programs built not only on Python, but also R, Spark, and
others using Apache Arrow backend!

And there you have it, folks! I hope this wrap up as quieted down some of your
questions around pandas 2.0 and its applicability on our data manipulation tasks.

I’m still curious whether you have found major differences in you daily coding with
the introduction of pandas 2.0 as well! If you’re up to it, come and find me at the
Data-Centric AI Community and let me know your thoughts! See you there?

About me
Ph.D., Machine Learning Researcher, Educator, Data Advocate, and overall “jack-of-
all-trades”. Here on Medium, I write about Data-Centric AI and Data Quality,
educating the Data Science & Machine Learning communities on how to move from
imperfect to intelligent data.

Developer Relations @ YData | Data-Centric AI Community | GitHub | Instagram |


Google Scholar | LinkedIn

Artificial Intelligence Machine Learning Data Science Data Engineering

Data Analysis

Follow

Written by Miriam Santos


808 Followers · Writer for Towards Data Science

Data Advocate, PhD, Jack of all trades | Educating towards Data-Centric AI and Data Quality | Fighting for a
diverse, inclusive, fair, and transparent AI

More from Miriam Santos and Towards Data Science


Miriam Santos in Towards Data Science

A Data Scientist’s Essential Guide to Exploratory Data Analysis


Best Practices, Techniques, and Tools to Fully Understand Your Data

11 min read · May 30

732 8

Dominik Polzer in Towards Data Science

All You Need to Know to Build Your First LLM App


A step-by-step tutorial to document loaders, embeddings, vector stores and prompt templates

· 26 min read · Jun 22

3.4K 31

Leonie Monigatti in Towards Data Science

Explaining Vector Databases in 3 Levels of Difficulty


From noob to expert: Demystifying vector databases across different backgrounds

· 8 min read · Jul 4

1.8K 18
Miriam Santos in Towards Data Science

Awesome Data Science Tools to Master in 2023: Data Profiling Edition


5 Open Source Python Packages for EDA and Visualization

15 min read · Feb 22

741 5

See all from Miriam Santos

See all from Towards Data Science

You might also like