Lesson 5 - Comparing Distributions
Welcome to Comparing Distributions
Most real questions are comparisons. Do this group earn more than that one? Is one machine producing heavier parts than another? You rarely care about a single distribution in isolation — you want to lay two or three side by side and see how they differ. Doing that fairly is trickier than it looks, because the groups you compare are almost never the same size.
In this lesson you will compare a numeric measurement across the three penguin species, build summaries and charts that put the groups next to each other, and learn the one habit that keeps unequal group sizes from quietly lying to you.
By the end of this lesson, you will be able to:
- Summarize a numeric variable across groups with
groupbyanddescribe - Explain why relative frequency, not raw counts, is the honest way to compare groups of different sizes
- Read overlaid histograms and frequency polygons to see how distributions overlap or separate
- Use boxplots to compare medians, spread, and outliers across many groups at once
You only need a little Python and pandas. Let’s begin.
Summarizing a Variable Across Groups
The same penguin dataset returns, with body measurements for three species near Palmer Station, Antarctica. The species do not appear in equal numbers, and that fact drives the whole lesson:
import pandas as pd
penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")
print(penguins["species"].value_counts())species
Adelie 152
Gentoo 124
Chinstrap 68
Name: count, dtype: int64There are 152 Adelie but only 68 Chinstrap — more than twice as many of the first group. Keep that ratio in mind; it is the trap we will spring later.
The fastest way to compare a numeric variable across groups is to split it by the grouping column and describe each piece at once. Here is flipper_length_mm summarized per species:
summary = penguins.groupby("species")["flipper_length_mm"].describe()
print(summary[["count", "mean", "std", "min", "max"]].round(1)) count mean std min max
species
Adelie 151.0 190.0 6.5 172.0 210.0
Chinstrap 68.0 195.8 7.1 178.0 212.0
Gentoo 123.0 217.2 6.5 203.0 231.0One call gives a side-by-side report. Read across the mean column: Adelie flippers average 190.0 mm, Chinstrap 195.8 mm, and Gentoo 217.2 mm. Gentoo birds clearly have the longest flippers, by a wide margin. The count column also reminds you that two measurements are missing (151 and 123 instead of 152 and 124), which describe skips automatically.
When you only want a couple of statistics, agg is cleaner than a full describe:
print(penguins.groupby("species")["body_mass_g"].agg(["mean", "median"]).round(1)) mean median
species
Adelie 3700.7 3700.0
Chinstrap 3733.1 3700.0
Gentoo 5076.0 5000.0Adelie and Chinstrap are nearly twins in body mass — means of 3700.7 g and 3733.1 g, identical medians of 3700 g — while Gentoo tower over both at 5076.0 g. Numbers like these are the destination, but a table cannot show you shape: where each group clusters, how much it spreads, or where the groups overlap. For that we need pictures.
Why Raw Counts Mislead
The natural first chart is a histogram of each group’s flipper lengths, drawn on the same axes so you can compare them. Let’s do exactly that, with raw counts on the vertical axis:
The picture is genuinely useful — you can already see Gentoo separating off to the right — but the heights are not comparable. The Adelie bars reach higher than the Chinstrap bars in part for a boring reason: there are 152 Adelie and only 68 Chinstrap, so Adelie has more than twice as many penguins to stack into every bar. Taller bars do not mean Adelie flippers are “more concentrated” than Chinstrap flippers; they mostly mean the Adelie group is bigger.
We can make the problem concrete. Count how many penguins of each species fall in the 188–192 mm range:
ade = penguins[penguins["species"] == "Adelie"]["flipper_length_mm"].dropna()
chi = penguins[penguins["species"] == "Chinstrap"]["flipper_length_mm"].dropna()
print("Adelie count in 188-192:", ((ade >= 188) & (ade <= 192)).sum())
print("Chinstrap count in 188-192:", ((chi >= 188) & (chi <= 192)).sum())Adelie count in 188-192: 45
Chinstrap count in 188-192: 10Raw counts say Adelie wins this range 45 to 10 — more than four to one. But that comparison is rigged, because there are far more Adelie penguins overall. The fair question is what fraction of each species lands there:
print("Adelie proportion:", round(((ade >= 188) & (ade <= 192)).mean(), 3))
print("Chinstrap proportion:", round(((chi >= 188) & (chi <= 192)).mean(), 3))Adelie proportion: 0.298
Chinstrap proportion: 0.147Now the gap is two to one (29.8% vs 14.7%), not four to one. The shape of the difference survives, but its size was inflated by the count comparison. Relative frequency — the proportion of a group that falls in each bin, rather than the raw number — is the fix. It rescales every group to sum to 1, so a small group and a large group can be compared as if they were the same size.
Unequal groups are the rule, not the exception
Real groups are almost never the same size: more of one customer segment, more of one product, more of one region. Whenever you overlay or stack distributions of different-sized groups using raw counts, the bigger group will always look taller — and you will mistake “more numerous” for “more concentrated.” Switch to relative frequency, density, or proportion before you compare.
Frequency Polygons and Relative Frequency
Overlaid histograms get cluttered fast — bars hide each other, and three colors fighting for the same bins is hard to read. A frequency polygon solves both problems. Instead of bars, you plot a single point at the top of each bin and connect the points with lines, giving each group one clean curve. Pair that with relative frequency on the vertical axis and the comparison becomes honest and legible:
Two things jump out now that the size distortion is gone. First, the Adelie and Chinstrap curves overlap heavily — their flipper lengths live in the same neighborhood, with Chinstrap shifted only slightly right. Second, the Gentoo curve is an island, well separated from the other two. That separation is the key fact for any system that wants to tell the species apart by flipper length alone.
The separation is so clean it is nearly perfect. Compare the lowest Gentoo flipper to the highest of the other two species:
print("Gentoo min flipper:", penguins[penguins["species"] == "Gentoo"]["flipper_length_mm"].min())
print("Adelie max flipper:", ade.max())
print("Chinstrap max flipper:", chi.max())Gentoo min flipper: 203.0
Adelie max flipper: 210.0
Chinstrap max flipper: 212.0The shortest Gentoo flipper is 203 mm, while almost every Adelie and Chinstrap sits below 200 mm — only a handful of the largest ones reach into Gentoo territory at all. A threshold around 206 mm would split Gentoo from the rest with very few mistakes. You could never have read that off the raw-count histogram with any confidence; the relative-frequency view makes it obvious.
Boxplots: A Compact Comparison
Frequency polygons show full shape, but when you have many groups — or just want a quick read on center, spread, and oddballs — a boxplot is the most compact comparison there is. Each box summarizes one group with five numbers: the median line, the box edges at the first and third quartiles (the interquartile range, or IQR), the whiskers reaching out to the typical extremes, and individual points for outliers beyond them.
Here is body mass compared across the three species:
One small chart carries the whole story. The Adelie and Chinstrap boxes sit at nearly the same low height and overlap heavily — consistent with their almost-identical medians of 3700 g. The Gentoo box floats far above both, centered near 5000 g, and its box is taller, meaning Gentoo body mass is also more spread out. You can confirm the spread directly with the IQR:
for sp in ["Adelie", "Chinstrap", "Gentoo"]:
mass = penguins[penguins["species"] == sp]["body_mass_g"].dropna()
q1, q3 = mass.quantile([0.25, 0.75])
print(sp, "IQR:", q3 - q1)Adelie IQR: 650.0
Chinstrap IQR: 462.5
Gentoo IQR: 800.0Gentoo’s middle 50% spans 800 g, against just 462.5 g for Chinstrap — the tightest of the three. The boxplot showed you that wider Gentoo box at a glance; the numbers put a figure on it.
The boxplot also flags outliers automatically — points lying more than 1.5 IQRs beyond a quartile. Notice that only the Chinstrap group shows stray points, one high and one low:
mass = penguins[penguins["species"] == "Chinstrap"]["body_mass_g"].dropna()
q1, q3 = mass.quantile([0.25, 0.75])
iqr = q3 - q1
print(mass[(mass < q1 - 1.5 * iqr) | (mass > q3 + 1.5 * iqr)].tolist())[4800.0, 2700.0]One unusually heavy Chinstrap at 4800 g and one unusually light one at 2700 g fall outside the whiskers; Adelie and Gentoo have none. That is the kind of detail a table of means would hide completely, and a boxplot surfaces it without any extra work.
Which chart when?
Use overlaid histograms or frequency polygons when you want to see full shape — bumps, skew, where groups overlap. Use boxplots when you want a compact comparison of center, spread, and outliers across many groups, or when full curves would be too busy. Reach for relative frequency in the first case and you stay honest about unequal group sizes.
Practice Exercises
Exercise 1: Grouped summary of bill length
Build a side-by-side summary of bill_length_mm for the three species using groupby and describe, and report each species’ mean. Which species has the longest bills on average, and which two are closest?
Hint
Use penguins.groupby("species")["bill_length_mm"].describe() and pull the columns you care about with [["count", "mean", "std"]].round(1). Compare the mean column across the three rows.
Exercise 2: Counts vs proportions
Among penguins with body mass above 4500 g, count how many are Gentoo versus Adelie. Then compute, within each species, the proportion above 4500 g. Explain why the raw counts give a misleading impression and the proportions correct it.
Hint
For counts, filter with penguins[penguins["body_mass_g"] > 4500]["species"].value_counts(). For the within-species proportion, use penguins.groupby("species")["body_mass_g"].apply(lambda m: (m > 4500).mean()). The species with more penguins overall can rack up a big raw count even if a small fraction of it qualifies.
Exercise 3: Boxplot reading
Draw boxplots of flipper_length_mm by species (any plotting library), then confirm what you see: print each species’ median and IQR. Does the chart match the near-perfect Gentoo separation you found earlier in the lesson?
Hint
Compute the median with groupby("species")["flipper_length_mm"].median() and the IQR by subtracting the 0.25 quantile from the 0.75 quantile within each group. The Gentoo box should sit almost entirely above the other two.
Summary
You learned to compare a numeric variable across groups without getting fooled. A grouped describe lays the groups side by side in numbers; relative frequency lets you compare distributions fairly even when the groups differ in size, because raw counts always make the larger group look taller. Overlaid histograms and frequency polygons reveal full shape — showing that Adelie and Chinstrap overlap while Gentoo separates cleanly — and boxplots pack center, spread, and outliers for many groups into one compact picture.
Key Concepts
- Grouped summary — splitting a variable by a category and describing each piece, e.g.
groupby(...).describe(). - Relative frequency — the proportion of a group falling in each bin, used so unequal-sized groups can be compared fairly.
- Frequency polygon — a line connecting the top of each bin, giving each group one clean curve instead of bars.
- Boxplot — a five-number summary (median, quartiles/IQR, whiskers, outliers) drawn as a box, ideal for comparing many groups at once.
- Interquartile range (IQR) — the spread of the middle 50% of a group, from the first to the third quartile.
- Outlier — a value lying more than 1.5 IQRs beyond a quartile, flagged as a point on a boxplot.
Why This Matters
Almost every analysis you will do is a comparison between groups, and almost every set of groups is unequal in size. The instinct to compare raw counts is exactly where honest-looking charts go wrong — the bigger group wins by default. Knowing to switch to relative frequency, and knowing which chart reveals shape versus which gives a compact summary, is what lets you say “these groups really do differ” and have it be true.
Next Steps
Continue to Lesson 6 - Guided Project: Separating Penguin Species
Put grouped summaries, relative frequency, and boxplots to work in a full guided project that tells the species apart from their measurements.
Back to Module Overview
Return to the Statistics Fundamentals module overview
Continue Building Your Skills
You can now put two or three distributions next to each other and read the real differences between them — overlap, separation, spread, and the odd stray value — without letting unequal group sizes deceive you. Next you will bring every idea from this module together in a guided project, using these same tools to tell the penguin species apart from their measurements alone.