Lesson 15 - GroupBy and Aggregation
On this page
- Category-Based Analysis
- Understanding the Split-Apply-Combine Pattern
- Sample Data
- Basic GroupBy Operations
- Multiple Statistics with .agg()
- Common Aggregation Functions
- Custom Aggregation Functions
- Grouping by Multiple Columns
- Finding Top Performers
- Practice Exercises
- Summary
- Next Steps
- Master Category-Based Analysis
Category-Based Analysis
You can filter, sort, and transform data. Now you will learn GroupBy—the most powerful pandas feature for analyzing data by categories. Group by country, sector, or any category, then calculate statistics for each group.
By the end of this lesson, you will be able to:
- Group data by one or more columns
- Calculate statistics for each group (mean, sum, count, etc.)
- Use
.agg()for multiple aggregations - Understand the split-apply-combine pattern
- Find top performers within categories
- Create complex multi-level aggregations
GroupBy operations answer questions like “What is the average revenue per sector?” or “Which country has the highest total profits?” Let’s master this essential skill.
Understanding the Split-Apply-Combine Pattern
GroupBy operations follow a three-step pattern:
- Split - Divide data into groups based on category values
- Apply - Calculate a statistic for each group independently
- Combine - Merge the results back into a single structure
This pattern is fundamental to category-based analysis in data science.
Visual Representation
Original DataFrame:
company sector revenues
TechCorp Global Technology 125000
FreshMart Inc Food 89000
AutoDrive Motors Automotive 156000
FirstBank Holdings Financials 234000
PowerGen Energy Energy 178000
↓ SPLIT by sector ↓
Technology: [125000]
Food: [89000]
Automotive: [156000]
Financials: [234000]
Energy: [178000]
↓ APPLY mean() ↓
Technology: 125000
Food: 89000
Automotive: 156000
Financials: 234000
Energy: 178000
↓ COMBINE ↓
sector
Technology 125000
Food 89000
Automotive 156000
Financials 234000
Energy 178000Sample Data
Let’s create an expanded dataset for demonstration:
import pandas as pd
import numpy as np
# Create companies dataset
companies = pd.DataFrame({
'company': ['TechCorp Global', 'FreshMart Inc', 'AutoDrive Motors', 'FirstBank Holdings', 'PowerGen Energy',
'MediPharm Solutions', 'RetailHub Ltd', 'SkyWings Airlines', 'SteelCore Industries', 'NetLink Telecom',
'LogiTrans Services', 'SolarNova Energy'],
'sector': ['Technology', 'Food', 'Automotive', 'Financials', 'Energy',
'Healthcare', 'Retail', 'Transportation', 'Materials', 'Technology',
'Transportation', 'Energy'],
'revenues': [125000, 89000, 156000, 234000, 178000,
98000, 112000, 187000, 145000, 165000,
134000, 142000],
'profits': [12000, 8500, -3000, 45000, 23000,
15000, 9800, 21000, 18000, 28000,
16000, 19000],
'employees': [1200, 890, 2300, 5600, 3400,
2100, 4500, 8900, 3200, 6700,
5400, 2800],
'country': ['USA', 'USA', 'USA', 'UK', 'Germany',
'USA', 'UK', 'Germany', 'USA', 'UK',
'UK', 'Germany']
})
print(f"Dataset: {len(companies)} companies")
companiesThis dataset has 12 companies across multiple sectors and countries—perfect for demonstrating GroupBy operations.
Basic GroupBy Operations
Group data by one column and calculate a single statistic.
Grouping by Single Column
# Group by country and calculate mean revenues
avg_revenue_by_country = companies.groupby('country')['revenues'].mean()
print("Average revenue by country:")
print(avg_revenue_by_country)Output:
Average revenue by country:
country
Germany 169000.0
UK 158250.0
USA 125400.0
Name: revenues, dtype: float64Each country gets one value—the average of all revenues in that group.
Different Aggregations
# Sum of profits by sector
total_profit_by_sector = companies.groupby('sector')['profits'].sum()
print("Total profits by sector:")
print(total_profit_by_sector.sort_values(ascending=False))Output:
Total profits by sector:
sector
Financials 45000
Energy 42000
Transportation 37000
Technology 40000
...
Name: profits, dtype: int64Counting Group Sizes
# Count companies per sector
companies_per_sector = companies.groupby('sector').size()
print("Number of companies per sector:")
print(companies_per_sector)Output:
Number of companies per sector:
sector
Automotive 1
Energy 2
Financials 1
Food 1
Healthcare 1
Materials 1
Retail 1
Technology 2
Transportation 2
dtype: int64The .size() method counts how many rows are in each group.
Multiple Statistics with .agg()
Calculate multiple statistics for each group using .agg().
Multiple Stats on Single Column
# Multiple statistics on revenues by country
revenue_stats_by_country = companies.groupby('country')['revenues'].agg(['mean', 'sum', 'count'])
print("Revenue statistics by country:")
print(revenue_stats_by_country)Output:
Revenue statistics by country:
mean sum count
country
Germany 169000.0 507000 3
UK 158250.0 633000 4
USA 125400.0 627000 5Each row shows mean, sum, and count for that country’s revenues.
Different Stats for Different Columns
# Different statistics for different columns
summary = companies.groupby('sector').agg({
'revenues': 'sum',
'profits': 'mean',
'employees': ['sum', 'count']
})
print("Sector summary:")
print(summary)Output:
revenues profits employees
sum mean sum count
sector
Automotive 156000 -3000.0 2300 1
Energy 320000 21000.0 6200 2
Financials 234000 45000.0 5600 1
...This creates a multi-level column structure with different aggregations per column.
Flattening Column Names
# Flatten column names for easier access
summary_flat = companies.groupby('sector').agg({
'revenues': 'sum',
'profits': 'mean',
'employees': 'sum'
})
# Rename columns for clarity
summary_flat.columns = ['total_revenue', 'avg_profit', 'total_employees']
print("Flattened summary:")
print(summary_flat)Output:
Flattened summary:
total_revenue avg_profit total_employees
sector
Automotive 156000 -3000.0 2300
Energy 320000 21000.0 6200
Financials 234000 45000.0 5600
Food 89000 8500.0 890
Healthcare 98000 15000.0 2100
...Now each column has a single meaningful name.
Common Aggregation Functions
Pandas provides many built-in aggregation functions:
Basic Statistics
| Function | Description | Example |
|---|---|---|
'sum' | Total | Total revenues per sector |
'mean' | Average | Average profit per country |
'median' | Middle value | Median employees per sector |
'count' | Number of values | Companies per country |
'size' | Group size (includes NaN) | Total rows per group |
'min' | Smallest value | Smallest revenue in each sector |
'max' | Largest value | Largest profit in each country |
'std' | Standard deviation | Revenue variability per sector |
'var' | Variance | Profit variance per country |
'first' | First value | First company alphabetically |
'last' | Last value | Last company alphabetically |
Example Using Multiple Functions
# Comprehensive statistics
detailed_stats = companies.groupby('country').agg({
'revenues': ['mean', 'min', 'max'],
'profits': ['sum', 'mean'],
'employees': 'sum'
})
print("Detailed statistics by country:")
print(detailed_stats)Output:
revenues profits employees
mean min max sum mean sum
country
Germany 169000 142000 187000 63000 21000.0 15100
UK 158250 112000 234000 99800 24950.0 24700
USA 125400 89000 156000 61300 12260.0 13190Custom Aggregation Functions
Define custom functions for specialized aggregations.
Using Lambda Functions
# Custom aggregation with lambda
custom_agg = companies.groupby('sector')['profits'].agg([
'sum',
'mean',
('profit_range', lambda x: x.max() - x.min())
])
print("Custom aggregations:")
print(custom_agg)Output shows sum, mean, and the range (max - min) of profits for each sector.
Named Custom Functions
# Define custom function
def coefficient_of_variation(series):
"""Calculate coefficient of variation (std / mean)"""
return series.std() / series.mean() if series.mean() != 0 else 0
# Apply custom function
cv_by_sector = companies.groupby('sector')['revenues'].agg([
'mean',
'std',
('cv', coefficient_of_variation)
])
print("Revenue variability by sector:")
print(cv_by_sector.round(2))This calculates how variable revenues are within each sector.
Grouping by Multiple Columns
Group by more than one column for deeper analysis.
Two-Level Grouping
# Group by country AND sector
country_sector_avg = companies.groupby(['country', 'sector'])['revenues'].mean()
print("Average revenue by country and sector:")
print(country_sector_avg)Output:
country sector
Germany Energy 160000.0
Transportation 187000.0
UK Financials 234000.0
Retail 112000.0
Technology 165000.0
Transportation 134000.0
USA Automotive 156000.0
Food 89000.0
Healthcare 98000.0
Materials 145000.0
Technology 125000.0
Name: revenues, dtype: float64This creates a hierarchical index (MultiIndex) with country as the first level and sector as the second.
Multi-Column Aggregation
# Multiple columns with multiple groups
country_sector_df = companies.groupby(['country', 'sector']).agg({
'revenues': 'mean',
'profits': 'sum',
'company': 'count'
}).rename(columns={'company': 'num_companies'})
print("Country-Sector breakdown:")
print(country_sector_df)Output:
revenues profits num_companies
country sector
Germany Energy 160000.0 42000 2
Transportation 187000.0 21000 1
UK Financials 234000.0 45000 1
Retail 112000.0 9800 1
Technology 165000.0 28000 1
Transportation 134000.0 16000 1
USA Automotive 156000.0 -3000 1
Food 89000.0 8500 1
Healthcare 98000.0 15000 1
Materials 145000.0 18000 1
Technology 125000.0 12000 1Resetting Index
# Reset index to make it a regular DataFrame
country_sector_flat = country_sector_df.reset_index()
print("Flattened multi-group result:")
print(country_sector_flat.head())Output:
country sector revenues profits num_companies
0 Germany Energy 160000.0 42000 2
1 Germany Transportation 187000.0 21000 1
2 UK Financials 234000.0 45000 1
3 UK Retail 112000.0 9800 1
4 UK Technology 165000.0 28000 1Now it’s a regular DataFrame without the hierarchical index.
Finding Top Performers
Identify the best in each category using GroupBy.
Top Company in Each Sector
# Find company with highest revenue in each sector
top_companies = (companies.sort_values(['sector', 'revenues'], ascending=[True, False])
.groupby('sector')
.first()
[['company', 'revenues']])
print("Top revenue company per sector:")
print(top_companies)Output shows the highest-revenue company in each sector.
Best Sector in Each Country
# Find best performing sector in each country
best_sectors = (companies.groupby(['country', 'sector'])['profits']
.sum()
.reset_index()
.sort_values(['country', 'profits'], ascending=[True, False])
.groupby('country')
.first()
.reset_index())
print("Best performing sector per country:")
print(best_sectors)This identifies which sector has the highest total profits in each country.
Practice Exercises
Apply GroupBy operations with these exercises.
Exercise 1: Basic GroupBy
- Find the average number of employees per sector
- Find the total revenues per country
- Count how many companies are in each country
# Your code hereExercise 2: Using .agg()
Group by sector and calculate:
- Sum of revenues
- Mean of profits
- Min, max, and mean of employees
Display all in one DataFrame.
# Your code hereHint
Use a dictionary in .agg() to specify different aggregations for different columns!
Exercise 3: Multiple Groups
- Group by country and sector, calculate average revenues
- Find which country-sector combination has the highest total profits
- Count companies in each country-sector combination
# Your code hereExercise 4: Challenge - Comprehensive Analysis
- Add a
profit_margincolumn to the original DataFrame (profits / revenues * 100) - Group by country and find:
- Total number of companies
- Total revenues
- Average profit margin
- Total employees
- Sort by total revenues (descending)
- Display the top country with all its statistics
# Your code hereSummary
You now master GroupBy and aggregation operations. Let’s review the key concepts.
Key Concepts
Split-Apply-Combine Pattern
- Split data into groups
- Apply calculation to each group
- Combine results into output
Basic GroupBy
.groupby('column')creates groups- Add aggregation:
.mean(),.sum(),.count() - Result is a Series or DataFrame
Multiple Aggregations
.agg(['func1', 'func2'])for multiple stats.agg({'col1': 'func1', 'col2': 'func2'})for different columns- Custom functions with lambdas or named functions
Multiple Groups
.groupby(['col1', 'col2'])for hierarchical grouping- Creates MultiIndex result
- Use
.reset_index()to flatten
Syntax Reference
# Basic groupby
df.groupby('col')['value'].mean() # Single statistic
df.groupby('col').size() # Count per group
# Multiple statistics
df.groupby('col')['value'].agg(['mean', 'sum', 'count'])
# Different stats for different columns
df.groupby('col').agg({
'col1': 'sum',
'col2': 'mean',
'col3': ['min', 'max']
})
# Multiple grouping columns
df.groupby(['col1', 'col2'])['value'].mean()
# Reset index to regular DataFrame
df.groupby('col')['value'].mean().reset_index()
# Custom aggregations
df.groupby('col')['value'].agg([
'sum',
('custom_name', lambda x: x.max() - x.min())
])Common Patterns
# Sales by region
df.groupby('region')['sales'].sum()
# Average score per student
df.groupby('student')['score'].mean()
# Product performance
df.groupby('product').agg({
'revenue': 'sum',
'units_sold': 'sum',
'rating': 'mean'
})
# Top performer per category
df.groupby('category')['sales'].idxmax()
# Multi-level analysis
df.groupby(['year', 'month'])['sales'].sum()Important Reminders
- Groups are created automatically based on unique values
- Missing values (NaN) form their own group (unless excluded)
- Order matters in multi-column groupby (first column is primary)
.size()vs.count(): size includes NaN, count doesn’t- Reset index to convert result back to regular DataFrame
- Column names can become complex with multiple aggregations
Next Steps
You can now group data and calculate category-based statistics. In the next lesson, you will learn Pivot Tables—a powerful way to reshape and summarize data in a spreadsheet-like format.
Continue to Lesson 16 - Pivot Tables
Learn to create Excel-style pivot tables for data summarization
Back to Lesson 14 - String Operations
Review text data manipulation and cleaning
Master Category-Based Analysis
GroupBy is the most powerful pandas feature for understanding data patterns across categories. Every question about “average per group” or “total by category” uses GroupBy.
Use GroupBy to unlock insights hidden in categorical data!