← All articles
PythonData Analysis

Pandas GroupBy: A Practical Guide to Split, Apply, Combine

Every 'per-group' question in Pandas goes through groupby. This guide builds the split-apply-combine mental model, then works through agg, named aggregations, multi-column grouping, transform, and filter on a small dataset you can regenerate yourself.

Almost every interesting question you can ask of a dataset is a per-group question. Not “what’s the total revenue?” but “what’s the revenue per city?” Not “what’s the average order?” but “what’s the average order per product?” In Pandas, all of those questions go through one method: groupby().

groupby is also where a lot of people quietly lose confidence in Pandas. The method returns a strange lazy object instead of data, the results sometimes come back with weird indexes, and there are three different ways (agg, transform, filter) to do what looks like the same thing. (If some of that indexing confusion sounds familiar, our post on common Pandas gotchas covers the related traps.) This guide builds the mental model first, then works through each tool on a small dataset you can regenerate on your own machine.

The Mental Model: Split, Apply, Combine

Every groupby operation, no matter how complicated it looks, is three steps:

  1. Split the rows into groups based on the values in one or more columns.
  2. Apply a function to each group independently — sum it, average it, rank it, or test it.
  3. Combine the per-group results back into a single Series or DataFrame.
Diagram of the split-apply-combine model: a table of orders is split into three groups by city, a sum is applied to each group independently, and the per-group results are combined into one summary table.

You only ever write the apply step. Pandas handles the splitting and combining for you — which is exactly why the syntax feels terse: orders.groupby("city")["revenue"].sum() is a full split-apply-combine pipeline in one line.

Keep this model in mind, because the difference between agg, transform, and filter is only in the combine step: agg returns one row per group, transform returns one value per original row, and filter returns the surviving original rows.

A Dataset You Can Reproduce

Rather than downloading anything, let’s generate a dataset with a fixed seed so your numbers match mine exactly. Imagine a small coffee chain with shops in three cities, selling five items:

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)
n = 500

menu = {
    "espresso": ("drink", 2.50),
    "latte": ("drink", 3.80),
    "flat white": ("drink", 4.20),
    "croissant": ("food", 2.90),
    "banana bread": ("food", 3.40),
}

items = rng.choice(list(menu), size=n)
orders = pd.DataFrame({
    "city": rng.choice(["Amsterdam", "Berlin", "Lisbon"], size=n, p=[0.45, 0.35, 0.20]),
    "item": items,
    "category": [menu[i][0] for i in items],
    "quantity": rng.integers(1, 4, size=n),
})
orders["revenue"] = (orders["quantity"] * orders["item"].map(lambda i: menu[i][1])).round(2)

orders.head()
        city        item category  quantity  revenue
0     Berlin    espresso    drink         2      5.0
1     Lisbon   croissant     food         1      2.9
2     Berlin   croissant     food         1      2.9
3  Amsterdam  flat white    drink         1      4.2
4  Amsterdam  flat white    drink         2      8.4

Five hundred orders, five columns. (The outputs in this post come from pandas 3.0 — everything shown also works on pandas 2.x.)

Your First GroupBy

Revenue per city — the classic:

orders.groupby("city")["revenue"].sum()
city
Amsterdam    1506.7
Berlin       1109.3
Lisbon        655.9
Name: revenue, dtype: float64

Read it right to left: take the revenue column, within each city group, and sum it. The result is a Series whose index is the group keys. That index placement surprises people, and we’ll deal with it properly in a moment.

What groupby() Actually Returns

If you stop halfway, you don’t get data:

g = orders.groupby("city")
g
<pandas.api.typing.DataFrameGroupBy object at 0x10c74bd90>

Nothing has been computed yet. A DataFrameGroupBy object is a plan — it knows how the rows split, and it waits for you to say what to apply. (The official groupby user guide has the full list of what these objects support, if you want to go deeper than this post.) It’s still useful on its own, though:

g.size()
city
Amsterdam    222
Berlin       177
Lisbon       101
dtype: int64

And you can pull out one group as a regular DataFrame when you want to inspect it:

g.get_group("Lisbon").head(3)
      city          item category  quantity  revenue
1   Lisbon     croissant     food         1      2.9
6   Lisbon      espresso    drink         3      7.5
11  Lisbon  banana bread     food         1      3.4

Notice the original row labels (1, 6, 11) are preserved — get_group is literally the “split” step, handed back to you.

More Than One Number per Group: agg

One aggregation per group is rarely enough. agg takes a list:

orders.groupby("city")["revenue"].agg(["sum", "mean", "count"])
              sum      mean  count
city
Amsterdam  1506.7  6.786937    222
Berlin     1109.3  6.267232    177
Lisbon      655.9  6.494059    101

Better still, named aggregations let you aggregate different columns and name the outputs in one call. This is the form I reach for in real analysis code, because the result is self-documenting:

orders.groupby("city").agg(
    total_revenue=("revenue", "sum"),
    avg_order=("revenue", "mean"),
    orders=("revenue", "count"),
).round(2)
           total_revenue  avg_order  orders
city
Amsterdam         1506.7       6.79     222
Berlin            1109.3       6.27     177
Lisbon             655.9       6.49     101

Each keyword argument is new_column_name=(source_column, function). No renaming step afterwards, no guessing what sum referred to.

Grouping by Multiple Columns

Pass a list of columns and the groups become every observed combination:

orders.groupby(["city", "category"])["revenue"].sum()
city       category
Amsterdam  drink       927.3
           food        579.4
Berlin     drink       746.3
           food        363.0
Lisbon     drink       427.1
           food        228.8
Name: revenue, dtype: float64

That result has a two-level MultiIndex, which is powerful but awkward if you just want a flat table to keep working with. You have two good options. If you want a plain DataFrame, ask for it up front:

orders.groupby(["city", "category"], as_index=False)["revenue"].sum()
        city category  revenue
0  Amsterdam    drink    927.3
1  Amsterdam     food    579.4
2     Berlin    drink    746.3
3     Berlin     food    363.0
4     Lisbon    drink    427.1
5     Lisbon     food    228.8

And if you want a comparison grid — cities as rows, categories as columns — unstack() pivots the inner index level into columns:

orders.groupby(["city", "category"])["revenue"].sum().unstack()
category   drink   food
city
Amsterdam  927.3  579.4
Berlin     746.3  363.0
Lisbon     427.1  228.8

transform: Group Results at Row Level

Here’s the one that unlocks a whole class of problems. agg collapses each group to one row. But sometimes you want the group-level number attached to every original row — for example, “what share of its city’s revenue is this order?” That’s transform:

orders["city_revenue"] = orders.groupby("city")["revenue"].transform("sum")
orders["pct_of_city"] = (orders["revenue"] / orders["city_revenue"] * 100).round(2)

orders[["city", "item", "revenue", "city_revenue", "pct_of_city"]].head()
        city        item  revenue  city_revenue  pct_of_city
0     Berlin    espresso      5.0        1109.3         0.45
1     Lisbon   croissant      2.9         655.9         0.44
2     Berlin   croissant      2.9        1109.3         0.26
3  Amsterdam  flat white      4.2        1506.7         0.28
4  Amsterdam  flat white      8.4        1506.7         0.56

Same split, same sum — but the combine step broadcasts each group’s total back onto its rows, so the result is exactly as long as the original DataFrame and can be assigned straight into a new column. Any time you catch yourself computing an aggregate and then merging it back onto the original table, transform does it in one step.

filter: Keep or Drop Whole Groups

The third combine flavor: filter runs a test per group and keeps the original rows of the groups that pass. Which items have an average order value above €8?

big = orders.groupby("item").filter(lambda grp: grp["revenue"].mean() > 8)
big["item"].unique()
<StringArray>
['flat white']
Length: 1, dtype: str

Only the flat white makes the cut, and big contains its complete, unaggregated order rows — ready for further analysis. The lambda receives each group as a DataFrame and must return a single True or False.

Three Gotchas Worth Knowing

Groups come back sorted. By default, groupby sorts group keys alphabetically, not by first appearance. If you don’t need sorting, sort=False keeps encounter order and can be faster on large data:

orders.groupby("city", sort=False)["revenue"].sum()
city
Berlin       1109.3
Lisbon        655.9
Amsterdam    1506.7
Name: revenue, dtype: float64

Missing keys silently vanish. Rows where the grouping column is NaN are dropped from the result entirely — your totals just quietly shrink. If missing keys are meaningful in your data, pass dropna=False and they come back as their own group:

orders_with_gaps.groupby("city", dropna=False)["revenue"].sum()
city
Amsterdam    1448.1
Berlin       1036.9
Lisbon        636.2
NaN           150.7
Name: revenue, dtype: float64

“Top item per group” is idxmax, not sort-and-hope. To find each city’s best-selling item, aggregate to (city, item) totals first, then let idxmax pick the winner within each city:

per_item = orders.groupby(["city", "item"])["revenue"].sum()
per_item.groupby("city").idxmax()
city
Amsterdam    (Amsterdam, flat white)
Berlin               (Berlin, latte)
Lisbon          (Lisbon, flat white)
Name: revenue, dtype: object

Amsterdam and Lisbon run on flat whites; Berlin is a latte town.

Wrapping Up

Everything groupby does fits one model: split rows by key, apply a function per group, combine the results. The three tools differ only in what “combine” means:

  • agg → one row per group (summaries, reports)
  • transform → one value per original row (group stats as new columns)
  • filter → the original rows of groups that pass a test

Add as_index=False for flat results, named aggregations for readable code, and dropna=False when missing keys matter, and you can answer nearly any per-group question without leaving a single method chain.

If you want to build this into a fuller data-analysis workflow — indexing, cleaning, joining, and visualizing with real datasets — the Pandas Data Analysis lessons in our free Python for Data Analytics course pick up exactly where this post leaves off.

More from the blog