Comprehensive Guide To Grouping and Aggregating With Pandas - Practical Business Python

10/11/2020 Comprehensive Guide to Grouping and Aggregating with
Pandas - Practical Business Python
Practical Business Python

Taking care of business, one python script at a time
Mon 09 November 2020
Comprehensive Guide to Grouping and

Aggregating with Pandas
Posted by Chris Mo tt in articles
Introduction
One of the most basic analysis functions is grouping and aggregating data. In some cases, this level
of analysis may be su cient to answer business questions. In other instances, this activity might be
the rst step in a more complex data science analysis. In pandas, the groupby function can be
combined with one or more aggregation functions to quickly and easily summarize data. This
concept is deceptively simple and most new pandas users will understand this concept. However,
they might be surprised at how useful complex aggregation functions can be for supporting
sophisticated analysis.
https://pbpython.com/groupby-agg.html 1/23
10/11/2020 Comprehensive Guide to Grouping and Aggregating with Pandas - Practical Business Python
This article will quickly summarize the basic pandas aggregation functions and show examples of
more complex custom aggregations. Whether you are a new or more experienced pandas user, I
think you will learn a few things from this article.
Aggregating
In the context of this article, an aggregation function is one which takes multiple individual values
and returns a summary. In the majority of the cases, this summary is a single value.
The most common aggregation functions are a simple average or summation of values. As of
pandas 0.20, you may call an aggregation function on one or more columns of a DataFrame.
Here’s a quick example of calculating the total and average fare using the Titanic dataset (loaded
from seaborn):
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
df['fare'].agg(['sum', 'mean'])
sum 28693.949300
mean 32.204208
Name: fare, dtype: float64
This simple concept is a necessary building block for more complex analysis.
One area that needs to be discussed is that there are multiple ways to call an aggregation function.
As shown above, you may pass a list of functions to apply to one or more columns of data.
What if you want to perform the analysis on only a subset of columns? There are two other options
for aggregations: using a dictionary or a named aggregation.
Here is a comparison of the the three options:
It is important to be aware of these options and know which one to use when.
Choosing an aggregation approach
As a general rule, I prefer to use dictionaries for aggregations.
The tuple approach is limited by only being able to apply one aggregation at a time to a speci c
column. If I need to rename columns, then I will use the rename function after the aggregations
are complete. In some speci c instances, the list approach is a useful shortcut. I will reiterate
though, that I think the dictionary approach provides the most robust approach for the majority
of situations.
Groupby
Now that we know how to use aggregations, we can combine this with groupby to
summarize data.
Basic math
The most common built in aggregation functions are basic math functions including sum, mean,
median, minimum, maximum, standard deviation, variance, mean absolute deviation and product.
We can apply all these functions to the fare while grouping by the embark_town :
agg_func_math = {
'fare':
['sum', 'mean', 'median', 'min', 'max', 'std', 'var', 'mad', 'prod']
}
df.groupby(['embark_town']).agg(agg_func_math).round(2)
This is all relatively straightforward math.
As an aside, I have not found a good usage for the prod function which computes the product of
all the values in a group. For the sake of completeness, I am including it.
One other useful shortcut is to use describe to run multiple built-in aggregations at one time:
agg_func_describe = {'fare': ['describe']}

df.groupby(['embark_town']).agg(agg_func_describe).round(2)
Counting
After basic math, counting is the next most common aggregation I perform on grouped data. In
some ways, this can be a little more tricky than the basic math. Here are three examples
of counting:
agg_func_count = {'embark_town': ['count', 'nunique', 'size']}

df.groupby(['deck']).agg(agg_func_count)
The major distinction to keep in mind is that count will not include NaN values whereas size
will. Depending on the data set, this may or may not be a useful distinction. In addition, the
nunique function will exclude NaN values in the unique counts. Keep reading for an example of
how to include NaN in the unique value counts.
First and last
In this example, we can select the highest and lowest fare by embarked town. One important point
to remember is that you must sort the data rst if you want first and last to pick the max
and min values.
agg_func_selection = {'fare': ['first', 'last']}

df.sort_values(by=['fare'],
ascending=False).groupby(['embark_town'
]).agg(agg_func_selection)
In the example above, I would recommend using max and min but I am including first and
last for the sake of completeness. In other applications (such as time series analysis) you may
want to select the rst and last values for further analysis.
Another selection approach is to use idxmax and idxmin to select the index value that
corresponds to the maximum or minimum value.
agg_func_max_min = {'fare': ['idxmax', 'idxmin']}

df.groupby(['embark_town']).agg(agg_func_max_min)
We can check the results:
df.loc[[258, 378]]
Here’s another shortcut trick you can use to see the rows with the max fare :
df.loc[df.groupby('class')['fare'].idxmax()]
The above example is one of those places where the list-based aggregation is a useful shortcut.
Other libraries
You are not limited to the aggregation functions in pandas. For instance, you could use stats
functions from scipy or numpy.
Here is an example of calculating the mode and skew of the fare data.
from scipy.stats import skew, mode

agg_func_stats = {'fare': [skew, mode, pd.Series.mode]}
df.groupby(['embark_town']).agg(agg_func_stats)
The mode results are interesting. The scipy.stats mode function returns the most frequent value as
well as the count of occurrences. If you just want the most frequent value, use pd.Series.mode.
The key point is that you can use any function you want as long as it knows how to interpret the
array of pandas values and returns a single value.
Working with text
When working with text, the counting functions will work as expected. You can also use scipy’s
mode function on text data.
One interesting application is that if you a have small number of distinct values, you can use
python’s set function to display the full list of unique values.
This summary of the class and deck shows how this approach can be useful for some data sets.
agg_func_text = {'deck': [ 'nunique', mode, set]}

df.groupby(['class']).agg(agg_func_text)
Custom functions
The pandas standard aggregation functions and pre-built functions from the python ecosystem will
meet many of your analysis needs. However, you will likely want to create your own custom
aggregation functions. There are four methods for creating your own functions.
To illustrate the di erences, let’s calculate the 25th percentile of the data using four approaches:
First, we can use a partial function:
from functools import partial

# Use partial
q_25 = partial(pd.Series.quantile, q=0.25)
q_25.__name__ = '25%'
Next, we de ne our own function (which is a small wrapper around quantile ):
# Define a function
def percentile_25(x):
return x.quantile(.25)
We can de ne a lambda function and give it a name:
# Define a lambda function

lambda_25 = lambda x: x.quantile(.25)
lambda_25.__name__ = 'lambda_25%'
Or, de ne the lambda inline:
# Use a lambda function inline

agg_func = {
'fare': [q_25, percentile_25, lambda_25, lambda x: x.quantile(.25)]
}
df.groupby(['embark_town']).agg(agg_func).round(2)
As you can see, the results are the same but the labels of the column are all a little di erent. This is
an area of programmer preference but I encourage you to be familiar with the options since you will
encounter most of these in online solutions.
Choosing an custom function style
I prefer to use custom functions or inline lambdas.
Like many other areas of programming, this is an element of style and preference but I encourage
you to pick one or two approaches and stick with them for consistency.
Custom function examples
As shown above, there are multiple approaches to developing custom aggregation functions. I will
go through a few speci c useful examples to highlight how they are frequently used.
In most cases, the functions are lightweight wrappers around built in pandas functions. Part of the
reason you need to do this is that there is no way to pass arguments to aggregations. Some
examples should clarify this point.
If you want to count the number of null values, you could use this function:
def count_nulls(s):
return s.size - s.count()
If you want to include NaN values in your unique counts, you need to pass dropna=False to the
nunique function.
def unique_nan(s):
return s.nunique(dropna=False)
Here is a summary of all the values together:
agg_func_custom_count = {
'embark_town': ['count', 'nunique', 'size', unique_nan, count_nulls, se
}
df.groupby(['deck']).agg(agg_func_custom_count)
If you want to calculate the 90th percentile, use quantile :
def percentile_90(x):
return x.quantile(.9)
If you want to calculate a trimmed mean where the lowest 10th percent is excluded, use the scipy
stats function trim_mean :
def trim_mean_10(x):
return trim_mean(x, 0.1)
If you want the largest value, regardless of the sort order (see notes above about first and
last :
def largest(x):
return x.nlargest(1)
This is equivalent to max but I will show another example of nlargest below to highlight
the di erence.
I wrote about sparklines before. Refer to that article for install instructions. Here’s how to
incorporate them into an aggregate function for a unique view of the data:
def sparkline_str(x):
bins=np.histogram(x)[0]
sl = ''.join(sparklines(bins))
return sl
Here they are all put together:
agg_func_largest = {
'fare': [percentile_90, trim_mean_10, largest, sparkline_str]
}
df.groupby(['class', 'embark_town']).agg(agg_func_largest)
The nlargest and nsmallest functions can be useful for summarizing the data in various
scenarios. Here is code to show the total fares for the top 10 and bottom 10 individuals:
def top_10_sum(x):
return x.nlargest(10).sum()
def bottom_10_sum(x):
return x.nsmallest(10).sum()
agg_func_top_bottom_sum = {
'fare': [top_10_sum, bottom_10_sum]
}
df.groupby('class').agg(agg_func_top_bottom_sum)
Using this approach can be useful when applying the Pareto principle to your own data.
Custom functions with multiple columns

If you have a scenario where you want to run multiple aggregations across columns, then you may
want to use the groupby combined with apply as described in this stack over ow answer.
Using this method, you will have access to all of the columns of the data and can choose the
appropriate aggregation approach to build up your resulting DataFrame (including the
column labels):
def summary(x):
result = {
'fare_sum': x['fare'].sum(),
'fare_mean': x['fare'].mean(),
'fare_range': x['fare'].max() - x['fare'].min()
}
return pd.Series(result).round(0)
df.groupby(['class']).apply(summary)
Using apply with groupy gives maximum exibility over all aspects of the results. However,
there is a downside. The apply function is slow so this approach should be used sparingly.
Working with group objects

Once you group and aggregate the data, you can do additional calculations on the grouped objects.
For the rst example, we can gure out what percentage of the total fares sold can be attributed to
each embark_town and class combination. We use assign and a lambda function to add a
pct_total column:
df.groupby(['embark_town', 'class']).agg({
'fare': 'sum'
}).assign(pct_total=lambda x: x / x.sum())
One important thing to keep in mind is that you can actually do this more simply using a
pd.crosstab as described in my previous article:
pd.crosstab(df['embark_town'],
df['class'],
values=df['fare'],
aggfunc='sum',
normalize=True)
While we are talking about crosstab , a useful concept to keep in mind is that agg functions can
be combined with pivot tables too.
Here’s a quick example:
pd.pivot_table(data=df,
index=['embark_town'],
columns=['class'],
aggfunc=agg_func_top_bottom_sum)
Sometimes you will need to do multiple groupby’s to answer your question. For instance, if we
wanted to see a cumulative total of the fares, we can group and aggregate by town and class then
group the resulting object and calculate a cumulative sum:
fare_group = df.groupby(['embark_town', 'class']).agg({'fare': 'sum'})

fare_group.groupby(level=0).cumsum()
This may be a little tricky to understand. Here’s a summary of what we are doing:
Here’s another example where we want to summarize daily sales data and convert it to a
cumulative daily and quarterly view. Refer to the Grouper article if you are not familiar with using
pd.Grouper() :
In the rst example, we want to include a total daily sales as well as cumulative quarter amount:
sales = pd.read_excel('https://github.com/chris1610/pbpython/blob/master/da
daily_sales = sales.groupby([pd.Grouper(key='date', freq='D')

]).agg(daily_sales=('ext price',
'sum')).reset_index()
daily_sales['quarter_sales'] = daily_sales.groupby(
pd.Grouper(key='date', freq='Q')).agg({'daily_sales': 'cumsum'})
To understand this, you need to look at the quarter boundary (end of March through start of April)
to get a good sense of what is going on.
If you want to just get a cumulative quarterly total, you can chain multiple groupby functions.
First, group the daily results, then group those results by quarter and use a cumulative sum:
sales.groupby([pd.Grouper(key='date', freq='D')
]).agg(daily_sales=('ext price', 'sum')).groupby(
pd.Grouper(freq='Q')).agg({
'daily_sales': 'cumsum'
}).rename(columns={'daily_sales': 'quarterly_sales'})
In this example, I included the named aggregation approach to rename the variable to clarify that it
is now daily sales. I then group again and use the cumulative sum to get a running sum for the
quarter. Finally, I rename the column to quarterly sales.
Admittedly this is a bit tricky to understand. However, if you take it step by step and build out the
function and inspect the results at each step, you will start to get the hang of it. Don’t
be discouraged!
Flattening Hierarchical Column Indices

By default, pandas creates a hierarchical column index on the summary DataFrame. Here is what I
am referring to:
df.groupby(['embark_town', 'class']).agg({'fare': ['sum', 'mean']}).round(0
At some point in the analysis process you will likely want to “ atten” the columns so that there is a
single row of names.
I have found that the following approach works best for me. I use the parameter as_index=False
when grouping, then build a new collapsed column name.
Here is the code:
multi_df = df.groupby(['embark_town', 'class'],

as_index=False).agg({'fare': ['sum', 'mean']})
multi_df.columns = [
'_'.join(col).rstrip('_') for col in multi_df.columns.values
]
Here is a picture showing what the attened frame looks like:
I prefer to use _ as my separator but you could use other values. Just keep in mind that it will be
easier for your subsequent analysis if the resulting column names do not have spaces.
Subtotals
One process that is not straightforward with grouping and aggregating in pandas is adding a
subtotal. If you want to add subtotals, I recommend the sidetable package. Here is how you can
summarize fares by class , embark_town and sex with a subtotal at each level as well as a
grand total at the bottom:
import sidetable
df.groupby(['class', 'embark_town', 'sex']).agg({'fare': 'sum'}).stb.subtot
sidetable also allows customization of the subtotal levels and resulting labels. Refer to the package
documentation for more examples of how sidetable can summarize your data.
Summary
Thanks for reading this article. There is a lot of detail here but that is due to how many di erent
uses there are for grouping and aggregating data with pandas. My hope is that this post becomes a
useful resource that you can bookmark and come back to when you get stuck with a challenging
problem of your own.
If you have other common techniques you use frequently please let me know in the comments. If I
get some broadly useful ones, I will include in this post or as an updated article.
image credit: Herman Traub
← Reading Poorly Structured Excel Files with Pandas

Tags pandas
Comments
ALSO ON PRACTICAL BUSINESS PYTHON
Building a Windows Automated Report Case Study: Co

Shortcut with Python Generation with … Processing … Ex
a year ago • 3 comments a year ago • 2 comments a month ago • 6 comments ay
Introduction I prefer to use This is part 1 of a two-part This article is a case study Intr
miniconda for installing a series post about showing how to use Python com
lightweight python … automating report … to download and process … fun
0 Comments Practical Business Python 🔒 Disqus' Privacy Policy 

1 Login
 Recommend t Tweet f Share Sort by Best
Start the discussion…
LOG IN WITH
OR SIGN UP WITH DISQUS ?
Name
Be the first to comment.
✉ Subscribe d Add Disqus to your siteAdd DisqusAdd ⚠ Do Not Sell My Data
Subscribe to the mailing list

Email address
Subscribe
 Social
 Github
 Twitter
 LinkedIn
 Submit a Topic
 Suggest a topic for a post
 Popular
 Pandas Pivot Table Explained
 Common Excel Tasks Demonstrated in Pandas
 Overview of Python Visualization Tools
 Guide to Encoding Categorical Values in Python
 Overview of Pandas Data Types
 Article Roadmap
 Feeds
 Atom Feed
 Disclosure
We are a participant in the Amazon Services LLC
Associates Program, an a liate advertising program
designed to provide a means for us to earn fees by
linking to Amazon.com and a liated sites.
Ⓒ 2014-2020 Practical Business Python • Site built using Pelican • Theme based on VoidyBootstrap by RKI

Comprehensive Guide To Grouping and Aggregating With Pandas - Practical Business Python

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comprehensive Guide To Grouping and Aggregating With Pandas - Practical Business Python

Uploaded by

Copyright:

Available Formats

10/11/2020 Comprehensive Guide to Grouping and Aggregating with

Pandas - Practical Business Python

Practical Business Python

Mon 09 November 2020

Comprehensive Guide to Grouping and

This simple concept is a necessary building block for more complex analysis.

Here is a comparison of the the three options:

It is important to be aware of these options and know which one to use when.

Choosing an aggregation approach

As a general rule, I prefer to use dictionaries for aggregations.

This is all relatively straightforward math.

agg_func_describe = {'fare': ['describe']}

agg_func_count = {'embark_town': ['count', 'nunique', 'size']}

agg_func_selection = {'fare': ['first', 'last']}

agg_func_max_min = {'fare': ['idxmax', 'idxmin']}

We can check the results:

Here is an example of calculating the mode and skew of the fare data.

from scipy.stats import skew, mode

agg_func_text = {'deck': [ 'nunique', mode, set]}

First, we can use a partial function:

from functools import partial

Next, we de ne our own function (which is a small wrapper around quantile ):

We can de ne a lambda function and give it a name:

# Define a lambda function

Or, de ne the lambda inline:

# Use a lambda function inline

Choosing an custom function style

I prefer to use custom functions or inline lambdas.

Here is a summary of all the values together:

If you want to calculate the 90th percentile, use quantile :

Here they are all put together:

Custom functions with multiple columns

Working with group objects

fare_group = df.groupby(['embark_town', 'class']).agg({'fare': 'sum'})

This may be a little tricky to understand. Here’s a summary of what we are doing:

daily_sales = sales.groupby([pd.Grouper(key='date', freq='D')

Flattening Hierarchical Column Indices

df.groupby(['embark_town', 'class']).agg({'fare': ['sum', 'mean']}).round(0

multi_df = df.groupby(['embark_town', 'class'],

Here is a picture showing what the attened frame looks like:

image credit: Herman Traub

← Reading Poorly Structured Excel Files with Pandas

Building a Windows Automated Report Case Study: Co

0 Comments Practical Business Python 🔒 Disqus' Privacy Policy 

 Recommend t Tweet f Share Sort by Best

Start the discussion…

Be the first to comment.

✉ Subscribe d Add Disqus to your siteAdd DisqusAdd ⚠ Do Not Sell My Data

Subscribe to the mailing list

You might also like