Professional Documents
Culture Documents
This April, pandas 2.0.0 was officially launched, making huge waves across the data science community.
Photo by Yancy Min on Unsplash.
Due to its extensive functionality and versatility, pandas has secured a place in
every data scientist’s heart.
From data input/output to data cleaning and transformation, it’s nearly impossible to
think about data manipulation without import pandas as pd , right?
Now, bear with me: with such a buzz around LLMs over the past months, I have
somehow let slide the fact that pandas has just undergone a major release! Yep,
pandas 2.0 is out and came with guns blazing!
Although I wasn’t aware of all the hype, the Data-Centric AI Community promptly
came to the rescue:
The 2.0 release seems to have created quite an impact in the data science community, with a lot of users
praising the modifications added in the new version. Screenshot by Author.
Fun fact: Were you aware this release was in the making for an astonishing 3 years? Now
that’s what I call “commitment to the community”!
So what does pandas 2.0 bring to the table? Let’s dive right into it!
In this release, the big change comes from the introduction of the Apache Arrow
backend for pandas data.
So, long story short, PyArrow takes care of our previous memory constraints of
versions 1.X and allows us to conduct faster and more memory-efficient data
operations, especially for larger datasets.
Here’s a comparison between reading the data without and with the pyarrow
backend, using the Hacker News dataset, which is around 650 MB (License CC BY-
NC-SA 4.0):
Comparison of read_csv(): Using the pyarrow backend is 35X faster. Snippet by Author.
As you can see, using the new backend makes reading the data nearly 35x faster.
Other aspects worth pointing out:
Without the pyarrow backend, each column/feature is stored as its own unique
data type: numeric features are stored as int64 or float64 while string values
are stored as objects;
With pyarrow , all features are using the Arrow dtypes: note the [pyarrow]
annotation and the different types of data: int64 , float64 , string , timestamp ,
and double :
df.info(): Investigating the dtypes of each DataFrame. Snippter by Author.
Comparing string operations: showcasing the efficiency of arrow’s implementation. Snippet by Author.
In fact, Arrow has more (and better support for) data types than numpy , which are
needed outside the scientific (numerical) scope: dates and times, duration, binary,
decimals, lists, and maps. Skimming through the equivalence between pyarrow-
backed and numpy data types might actually be a good exercise in case you want to
learn how to leverage them.
Leveraging 32-bit numpy indices, making the code more memory-efficient. Snippet by Author.
This is a welcome change since indices are one of the most used functionalities in
pandas , allowing users to filter, join, and shuffle data, among other data operations.
Essentially, the lighter the Index is, the more efficient those processes will be!
For instance, integers are automatically converted to floats, which is not ideal:
Missing Values: Conversion to float. Snippet by Author.
Note how points automatically changes from int64 to float64 after the
introduction of a single None value.
There is nothing worst for a data flow than wrong typesets, especially within a data-
centric AI paradigm.
when the data is loaded. The, when passing the data into a generative model as a
float , we might get output values as decimals such as 2.5 — unless you’re a
mathematician with 2 kids, a newborn, and a weird sense of humor, having 2.5
children is not OK.
In pandas 2.0, we can leverage dtype = 'numpy_nullable' , where missing values are
accounted for without any dtype changes, so we can keep our original data types
( int64 in this case):
Leveraging ‘numpy_nullable’, pandas 2.0 can handle missing values without changing the original data types.
Snippet by Author.
It might seem like a subtle change, but under the hood it means that now pandas
can natively use Arrow’s implementation of dealing with missing values. This
makes operations much more efficient, since pandas doesn’t have to implement its
own version for handling null values for each data type.
4. Copy-On-Write Optimization
Pandas 2.0 also adds a new lazy copy mechanism that defers copying DataFrames
and Series objects until they are modified.
This means that certain methods will return views rather than copies when copy-
on-write is enabled, which improves memory efficiency by minimizing
unnecessary data duplication.
It also means you need to be extra careful when using chained assignments.
If the copy-on-write mode is enabled, chained assignments will not work because
they point to a temporary object that is the result of an indexing operation (which
under copy-on-write behaves as a copy).
When copy_on_write is disabled, operations like slicing may change the original df
5. Optional Dependencies
When using pip , version 2.0 gives us the flexibility to install optional
dependencies, which is a plus in terms of customization and optimization of
resources.
We can tailor the installation to our specific requirements, without spending disk
space on what we don’t really need.
From those, I decided to take ydata-profiling for a spin— it has just added support
for pandas 2.0, which seemed like a must-have for the community! In the new
release, users can rest to sure that their pipelines won’t break if they’re using pandas
2.0, and that’s a major plus! But what else?
Truth be told, ydata-profiling has been one of my top favorite tools for exploratory
data analysis, and it’s a nice and quick benchmark too — a 1-line of code on my side,
but under the hood it is full of computations that as a data scientist I need to work
out — descriptive statistics, histogram plotting, analyzing correlations, and so on.
So what better way than testing the impact of the pyarrow engine on all of those at
once with minimal effort?
Again, reading the data is definitely better with the pyarrow engine, althought
creating the data profile has not changed significanlty in terms of speed.
Yet, differences may rely on memory efficiency, for which we’d have to run a
different analysis. Also, we could further investigate the type of analysis being
conducted over the data: for some operations, the difference between 1.5.2 and 2.0
versions seems negligible.
But the main thing I noticed that might make a difference to this regard is that
ydata-profiling is not yet leveraging the pyarrow data types. This update could have a
great impact in both speed and memory and is something I look forward in future
developments!
Maybe they are not “flashy” for newcomers into the field of data manipulation, but
they sure as hell are like water in the desert for veteran data scientists that used to
jump through hoops to overcome the limitations of the previous versions.
Wrapping it up, these are the top main advantages introduced in the new release:
And there you have it, folks! I hope this wrap up as quieted down some of your
questions around pandas 2.0 and its applicability on our data manipulation tasks.
I’m still curious whether you have found major differences in you daily coding with
the introduction of pandas 2.0 as well! If you’re up to it, come and find me at the
Data-Centric AI Community and let me know your thoughts! See you there?
About me
Ph.D., Machine Learning Researcher, Educator, Data Advocate, and overall “jack-of-
all-trades”. Here on Medium, I write about Data-Centric AI and Data Quality,
educating the Data Science & Machine Learning communities on how to move from
imperfect to intelligent data.
Data Analysis
Follow
Data Advocate, PhD, Jack of all trades | Educating towards Data-Centric AI and Data Quality | Fighting for a
diverse, inclusive, fair, and transparent AI
732 8
3.4K 31
1.8K 18
Miriam Santos in Towards Data Science
741 5