Lesson 15 - GroupBy and Aggregation

Category-Based Analysis

You can filter, sort, and transform data. Now you will learn GroupBy—the most powerful pandas feature for analyzing data by categories. Group by country, sector, or any category, then calculate statistics for each group.

By the end of this lesson, you will be able to:

  • Group data by one or more columns
  • Calculate statistics for each group (mean, sum, count, etc.)
  • Use .agg() for multiple aggregations
  • Understand the split-apply-combine pattern
  • Find top performers within categories
  • Create complex multi-level aggregations

GroupBy operations answer questions like “What is the average revenue per sector?” or “Which country has the highest total profits?” Let’s master this essential skill.


Understanding the Split-Apply-Combine Pattern

GroupBy operations follow a three-step pattern:

  1. Split - Divide data into groups based on category values
  2. Apply - Calculate a statistic for each group independently
  3. Combine - Merge the results back into a single structure

This pattern is fundamental to category-based analysis in data science.

Visual Representation

Original DataFrame:
company          sector        revenues
TechCorp Global  Technology    125000
FreshMart Inc    Food          89000
AutoDrive Motors Automotive    156000
FirstBank Holdings Financials  234000
PowerGen Energy  Energy        178000

↓ SPLIT by sector ↓

Technology: [125000]
Food: [89000]
Automotive: [156000]
Financials: [234000]
Energy: [178000]

↓ APPLY mean() ↓

Technology: 125000
Food: 89000
Automotive: 156000
Financials: 234000
Energy: 178000

↓ COMBINE ↓

sector
Technology    125000
Food           89000
Automotive    156000
Financials    234000
Energy        178000

Sample Data

Let’s create an expanded dataset for demonstration:

import pandas as pd
import numpy as np

# Create companies dataset
companies = pd.DataFrame({
    'company': ['TechCorp Global', 'FreshMart Inc', 'AutoDrive Motors', 'FirstBank Holdings', 'PowerGen Energy',
                'MediPharm Solutions', 'RetailHub Ltd', 'SkyWings Airlines', 'SteelCore Industries', 'NetLink Telecom',
                'LogiTrans Services', 'SolarNova Energy'],
    'sector': ['Technology', 'Food', 'Automotive', 'Financials', 'Energy',
               'Healthcare', 'Retail', 'Transportation', 'Materials', 'Technology',
               'Transportation', 'Energy'],
    'revenues': [125000, 89000, 156000, 234000, 178000,
                 98000, 112000, 187000, 145000, 165000,
                 134000, 142000],
    'profits': [12000, 8500, -3000, 45000, 23000,
                15000, 9800, 21000, 18000, 28000,
                16000, 19000],
    'employees': [1200, 890, 2300, 5600, 3400,
                  2100, 4500, 8900, 3200, 6700,
                  5400, 2800],
    'country': ['USA', 'USA', 'USA', 'UK', 'Germany',
                'USA', 'UK', 'Germany', 'USA', 'UK',
                'UK', 'Germany']
})

print(f"Dataset: {len(companies)} companies")
companies

This dataset has 12 companies across multiple sectors and countries—perfect for demonstrating GroupBy operations.


Basic GroupBy Operations

Group data by one column and calculate a single statistic.

Grouping by Single Column

# Group by country and calculate mean revenues
avg_revenue_by_country = companies.groupby('country')['revenues'].mean()

print("Average revenue by country:")
print(avg_revenue_by_country)

Output:

Average revenue by country:
country
Germany    169000.0
UK         158250.0
USA        125400.0
Name: revenues, dtype: float64

Each country gets one value—the average of all revenues in that group.

Different Aggregations

# Sum of profits by sector
total_profit_by_sector = companies.groupby('sector')['profits'].sum()

print("Total profits by sector:")
print(total_profit_by_sector.sort_values(ascending=False))

Output:

Total profits by sector:
sector
Financials        45000
Energy            42000
Transportation    37000
Technology        40000
...
Name: profits, dtype: int64

Counting Group Sizes

# Count companies per sector
companies_per_sector = companies.groupby('sector').size()

print("Number of companies per sector:")
print(companies_per_sector)

Output:

Number of companies per sector:
sector
Automotive        1
Energy            2
Financials        1
Food              1
Healthcare        1
Materials         1
Retail            1
Technology        2
Transportation    2
dtype: int64

The .size() method counts how many rows are in each group.


Multiple Statistics with .agg()

Calculate multiple statistics for each group using .agg().

Multiple Stats on Single Column

# Multiple statistics on revenues by country
revenue_stats_by_country = companies.groupby('country')['revenues'].agg(['mean', 'sum', 'count'])

print("Revenue statistics by country:")
print(revenue_stats_by_country)

Output:

Revenue statistics by country:
              mean     sum  count
country
Germany  169000.0  507000      3
UK       158250.0  633000      4
USA      125400.0  627000      5

Each row shows mean, sum, and count for that country’s revenues.

Different Stats for Different Columns

# Different statistics for different columns
summary = companies.groupby('sector').agg({
    'revenues': 'sum',
    'profits': 'mean',
    'employees': ['sum', 'count']
})

print("Sector summary:")
print(summary)

Output:

           revenues  profits employees
                sum     mean       sum count
sector
Automotive   156000  -3000.0      2300     1
Energy       320000  21000.0      6200     2
Financials   234000  45000.0      5600     1
...

This creates a multi-level column structure with different aggregations per column.

Flattening Column Names

# Flatten column names for easier access
summary_flat = companies.groupby('sector').agg({
    'revenues': 'sum',
    'profits': 'mean',
    'employees': 'sum'
})

# Rename columns for clarity
summary_flat.columns = ['total_revenue', 'avg_profit', 'total_employees']

print("Flattened summary:")
print(summary_flat)

Output:

Flattened summary:
                total_revenue  avg_profit  total_employees
sector
Automotive             156000     -3000.0             2300
Energy                 320000     21000.0             6200
Financials             234000     45000.0             5600
Food                    89000      8500.0              890
Healthcare              98000     15000.0             2100
...

Now each column has a single meaningful name.


Common Aggregation Functions

Pandas provides many built-in aggregation functions:

Basic Statistics

FunctionDescriptionExample
'sum'TotalTotal revenues per sector
'mean'AverageAverage profit per country
'median'Middle valueMedian employees per sector
'count'Number of valuesCompanies per country
'size'Group size (includes NaN)Total rows per group
'min'Smallest valueSmallest revenue in each sector
'max'Largest valueLargest profit in each country
'std'Standard deviationRevenue variability per sector
'var'VarianceProfit variance per country
'first'First valueFirst company alphabetically
'last'Last valueLast company alphabetically

Example Using Multiple Functions

# Comprehensive statistics
detailed_stats = companies.groupby('country').agg({
    'revenues': ['mean', 'min', 'max'],
    'profits': ['sum', 'mean'],
    'employees': 'sum'
})

print("Detailed statistics by country:")
print(detailed_stats)

Output:

        revenues                  profits            employees
            mean    min    max       sum     mean        sum
country
Germany   169000 142000 187000     63000  21000.0      15100
UK        158250 112000 234000     99800  24950.0      24700
USA       125400  89000 156000     61300  12260.0      13190

Custom Aggregation Functions

Define custom functions for specialized aggregations.

Using Lambda Functions

# Custom aggregation with lambda
custom_agg = companies.groupby('sector')['profits'].agg([
    'sum',
    'mean',
    ('profit_range', lambda x: x.max() - x.min())
])

print("Custom aggregations:")
print(custom_agg)

Output shows sum, mean, and the range (max - min) of profits for each sector.

Named Custom Functions

# Define custom function
def coefficient_of_variation(series):
    """Calculate coefficient of variation (std / mean)"""
    return series.std() / series.mean() if series.mean() != 0 else 0

# Apply custom function
cv_by_sector = companies.groupby('sector')['revenues'].agg([
    'mean',
    'std',
    ('cv', coefficient_of_variation)
])

print("Revenue variability by sector:")
print(cv_by_sector.round(2))

This calculates how variable revenues are within each sector.


Grouping by Multiple Columns

Group by more than one column for deeper analysis.

Two-Level Grouping

# Group by country AND sector
country_sector_avg = companies.groupby(['country', 'sector'])['revenues'].mean()

print("Average revenue by country and sector:")
print(country_sector_avg)

Output:

country  sector
Germany  Energy           160000.0
         Transportation   187000.0
UK       Financials       234000.0
         Retail           112000.0
         Technology       165000.0
         Transportation   134000.0
USA      Automotive       156000.0
         Food              89000.0
         Healthcare        98000.0
         Materials        145000.0
         Technology       125000.0
Name: revenues, dtype: float64

This creates a hierarchical index (MultiIndex) with country as the first level and sector as the second.

Multi-Column Aggregation

# Multiple columns with multiple groups
country_sector_df = companies.groupby(['country', 'sector']).agg({
    'revenues': 'mean',
    'profits': 'sum',
    'company': 'count'
}).rename(columns={'company': 'num_companies'})

print("Country-Sector breakdown:")
print(country_sector_df)

Output:

                        revenues  profits  num_companies
country sector
Germany Energy           160000.0    42000              2
        Transportation   187000.0    21000              1
UK      Financials       234000.0    45000              1
        Retail           112000.0     9800              1
        Technology       165000.0    28000              1
        Transportation   134000.0    16000              1
USA     Automotive       156000.0    -3000              1
        Food              89000.0     8500              1
        Healthcare        98000.0    15000              1
        Materials        145000.0    18000              1
        Technology       125000.0    12000              1

Resetting Index

# Reset index to make it a regular DataFrame
country_sector_flat = country_sector_df.reset_index()

print("Flattened multi-group result:")
print(country_sector_flat.head())

Output:

  country         sector  revenues  profits  num_companies
0  Germany         Energy  160000.0    42000              2
1  Germany  Transportation  187000.0    21000              1
2       UK     Financials  234000.0    45000              1
3       UK         Retail  112000.0     9800              1
4       UK     Technology  165000.0    28000              1

Now it’s a regular DataFrame without the hierarchical index.


Finding Top Performers

Identify the best in each category using GroupBy.

Top Company in Each Sector

# Find company with highest revenue in each sector
top_companies = (companies.sort_values(['sector', 'revenues'], ascending=[True, False])
                 .groupby('sector')
                 .first()
                 [['company', 'revenues']])

print("Top revenue company per sector:")
print(top_companies)

Output shows the highest-revenue company in each sector.

Best Sector in Each Country

# Find best performing sector in each country
best_sectors = (companies.groupby(['country', 'sector'])['profits']
                .sum()
                .reset_index()
                .sort_values(['country', 'profits'], ascending=[True, False])
                .groupby('country')
                .first()
                .reset_index())

print("Best performing sector per country:")
print(best_sectors)

This identifies which sector has the highest total profits in each country.


Practice Exercises

Apply GroupBy operations with these exercises.

Exercise 1: Basic GroupBy

  1. Find the average number of employees per sector
  2. Find the total revenues per country
  3. Count how many companies are in each country
# Your code here

Exercise 2: Using .agg()

Group by sector and calculate:

  1. Sum of revenues
  2. Mean of profits
  3. Min, max, and mean of employees

Display all in one DataFrame.

# Your code here

Hint

Use a dictionary in .agg() to specify different aggregations for different columns!

Exercise 3: Multiple Groups

  1. Group by country and sector, calculate average revenues
  2. Find which country-sector combination has the highest total profits
  3. Count companies in each country-sector combination
# Your code here

Exercise 4: Challenge - Comprehensive Analysis

  1. Add a profit_margin column to the original DataFrame (profits / revenues * 100)
  2. Group by country and find:
    • Total number of companies
    • Total revenues
    • Average profit margin
    • Total employees
  3. Sort by total revenues (descending)
  4. Display the top country with all its statistics
# Your code here

Summary

You now master GroupBy and aggregation operations. Let’s review the key concepts.

Key Concepts

Split-Apply-Combine Pattern

  • Split data into groups
  • Apply calculation to each group
  • Combine results into output

Basic GroupBy

  • .groupby('column') creates groups
  • Add aggregation: .mean(), .sum(), .count()
  • Result is a Series or DataFrame

Multiple Aggregations

  • .agg(['func1', 'func2']) for multiple stats
  • .agg({'col1': 'func1', 'col2': 'func2'}) for different columns
  • Custom functions with lambdas or named functions

Multiple Groups

  • .groupby(['col1', 'col2']) for hierarchical grouping
  • Creates MultiIndex result
  • Use .reset_index() to flatten

Syntax Reference

# Basic groupby
df.groupby('col')['value'].mean()          # Single statistic
df.groupby('col').size()                    # Count per group

# Multiple statistics
df.groupby('col')['value'].agg(['mean', 'sum', 'count'])

# Different stats for different columns
df.groupby('col').agg({
    'col1': 'sum',
    'col2': 'mean',
    'col3': ['min', 'max']
})

# Multiple grouping columns
df.groupby(['col1', 'col2'])['value'].mean()

# Reset index to regular DataFrame
df.groupby('col')['value'].mean().reset_index()

# Custom aggregations
df.groupby('col')['value'].agg([
    'sum',
    ('custom_name', lambda x: x.max() - x.min())
])

Common Patterns

# Sales by region
df.groupby('region')['sales'].sum()

# Average score per student
df.groupby('student')['score'].mean()

# Product performance
df.groupby('product').agg({
    'revenue': 'sum',
    'units_sold': 'sum',
    'rating': 'mean'
})

# Top performer per category
df.groupby('category')['sales'].idxmax()

# Multi-level analysis
df.groupby(['year', 'month'])['sales'].sum()

Important Reminders

  • Groups are created automatically based on unique values
  • Missing values (NaN) form their own group (unless excluded)
  • Order matters in multi-column groupby (first column is primary)
  • .size() vs .count(): size includes NaN, count doesn’t
  • Reset index to convert result back to regular DataFrame
  • Column names can become complex with multiple aggregations

Next Steps

You can now group data and calculate category-based statistics. In the next lesson, you will learn Pivot Tables—a powerful way to reshape and summarize data in a spreadsheet-like format.

Continue to Lesson 16 - Pivot Tables

Learn to create Excel-style pivot tables for data summarization

Back to Lesson 14 - String Operations

Review text data manipulation and cleaning


Master Category-Based Analysis

GroupBy is the most powerful pandas feature for understanding data patterns across categories. Every question about “average per group” or “total by category” uses GroupBy.

Use GroupBy to unlock insights hidden in categorical data!