Lesson 5 - Comparing Distributions

Welcome to Comparing Distributions

Most real questions are comparisons. Do this group earn more than that one? Is one machine producing heavier parts than another? You rarely care about a single distribution in isolation — you want to lay two or three side by side and see how they differ. Doing that fairly is trickier than it looks, because the groups you compare are almost never the same size.

In this lesson you will compare a numeric measurement across the three penguin species, build summaries and charts that put the groups next to each other, and learn the one habit that keeps unequal group sizes from quietly lying to you.

By the end of this lesson, you will be able to:

  • Summarize a numeric variable across groups with groupby and describe
  • Explain why relative frequency, not raw counts, is the honest way to compare groups of different sizes
  • Read overlaid histograms and frequency polygons to see how distributions overlap or separate
  • Use boxplots to compare medians, spread, and outliers across many groups at once

You only need a little Python and pandas. Let’s begin.


Summarizing a Variable Across Groups

The same penguin dataset returns, with body measurements for three species near Palmer Station, Antarctica. The species do not appear in equal numbers, and that fact drives the whole lesson:

import pandas as pd

penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")
print(penguins["species"].value_counts())
species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64

There are 152 Adelie but only 68 Chinstrap — more than twice as many of the first group. Keep that ratio in mind; it is the trap we will spring later.

The fastest way to compare a numeric variable across groups is to split it by the grouping column and describe each piece at once. Here is flipper_length_mm summarized per species:

summary = penguins.groupby("species")["flipper_length_mm"].describe()
print(summary[["count", "mean", "std", "min", "max"]].round(1))
           count   mean  std    min    max
species
Adelie     151.0  190.0  6.5  172.0  210.0
Chinstrap   68.0  195.8  7.1  178.0  212.0
Gentoo     123.0  217.2  6.5  203.0  231.0

One call gives a side-by-side report. Read across the mean column: Adelie flippers average 190.0 mm, Chinstrap 195.8 mm, and Gentoo 217.2 mm. Gentoo birds clearly have the longest flippers, by a wide margin. The count column also reminds you that two measurements are missing (151 and 123 instead of 152 and 124), which describe skips automatically.

When you only want a couple of statistics, agg is cleaner than a full describe:

print(penguins.groupby("species")["body_mass_g"].agg(["mean", "median"]).round(1))
             mean  median
species
Adelie     3700.7  3700.0
Chinstrap  3733.1  3700.0
Gentoo     5076.0  5000.0

Adelie and Chinstrap are nearly twins in body mass — means of 3700.7 g and 3733.1 g, identical medians of 3700 g — while Gentoo tower over both at 5076.0 g. Numbers like these are the destination, but a table cannot show you shape: where each group clusters, how much it spreads, or where the groups overlap. For that we need pictures.


Why Raw Counts Mislead

The natural first chart is a histogram of each group’s flipper lengths, drawn on the same axes so you can compare them. Let’s do exactly that, with raw counts on the vertical axis:

Overlaid histograms of flipper length for the three penguin species, with the Gentoo bars sitting far to the right and the Adelie bars towering over the others because there are more Adelie penguins.
Raw-count histograms: the Adelie bars look dominant, but partly because there are simply more Adelie penguins (152) than Chinstrap (68), not because their distribution is taller in any meaningful sense.

The picture is genuinely useful — you can already see Gentoo separating off to the right — but the heights are not comparable. The Adelie bars reach higher than the Chinstrap bars in part for a boring reason: there are 152 Adelie and only 68 Chinstrap, so Adelie has more than twice as many penguins to stack into every bar. Taller bars do not mean Adelie flippers are “more concentrated” than Chinstrap flippers; they mostly mean the Adelie group is bigger.

We can make the problem concrete. Count how many penguins of each species fall in the 188–192 mm range:

ade = penguins[penguins["species"] == "Adelie"]["flipper_length_mm"].dropna()
chi = penguins[penguins["species"] == "Chinstrap"]["flipper_length_mm"].dropna()

print("Adelie  count in 188-192:", ((ade >= 188) & (ade <= 192)).sum())
print("Chinstrap count in 188-192:", ((chi >= 188) & (chi <= 192)).sum())
Adelie  count in 188-192: 45
Chinstrap count in 188-192: 10

Raw counts say Adelie wins this range 45 to 10 — more than four to one. But that comparison is rigged, because there are far more Adelie penguins overall. The fair question is what fraction of each species lands there:

print("Adelie  proportion:", round(((ade >= 188) & (ade <= 192)).mean(), 3))
print("Chinstrap proportion:", round(((chi >= 188) & (chi <= 192)).mean(), 3))
Adelie  proportion: 0.298
Chinstrap proportion: 0.147

Now the gap is two to one (29.8% vs 14.7%), not four to one. The shape of the difference survives, but its size was inflated by the count comparison. Relative frequency — the proportion of a group that falls in each bin, rather than the raw number — is the fix. It rescales every group to sum to 1, so a small group and a large group can be compared as if they were the same size.

Unequal groups are the rule, not the exception

Real groups are almost never the same size: more of one customer segment, more of one product, more of one region. Whenever you overlay or stack distributions of different-sized groups using raw counts, the bigger group will always look taller — and you will mistake “more numerous” for “more concentrated.” Switch to relative frequency, density, or proportion before you compare.


Frequency Polygons and Relative Frequency

Overlaid histograms get cluttered fast — bars hide each other, and three colors fighting for the same bins is hard to read. A frequency polygon solves both problems. Instead of bars, you plot a single point at the top of each bin and connect the points with lines, giving each group one clean curve. Pair that with relative frequency on the vertical axis and the comparison becomes honest and legible:

Frequency polygons of flipper length for the three species using proportion within species on the vertical axis. Each curve peaks at a similar height, and the Gentoo curve sits far to the right, well separated from the overlapping Adelie and Chinstrap curves.
Relative-frequency polygons: because each curve is scaled to its own group, the three peaks reach comparable heights and you can read the real story — Adelie and Chinstrap overlap heavily on the left, while Gentoo sits cleanly apart on the right.

Two things jump out now that the size distortion is gone. First, the Adelie and Chinstrap curves overlap heavily — their flipper lengths live in the same neighborhood, with Chinstrap shifted only slightly right. Second, the Gentoo curve is an island, well separated from the other two. That separation is the key fact for any system that wants to tell the species apart by flipper length alone.

The separation is so clean it is nearly perfect. Compare the lowest Gentoo flipper to the highest of the other two species:

print("Gentoo  min flipper:", penguins[penguins["species"] == "Gentoo"]["flipper_length_mm"].min())
print("Adelie  max flipper:", ade.max())
print("Chinstrap max flipper:", chi.max())
Gentoo  min flipper: 203.0
Adelie  max flipper: 210.0
Chinstrap max flipper: 212.0

The shortest Gentoo flipper is 203 mm, while almost every Adelie and Chinstrap sits below 200 mm — only a handful of the largest ones reach into Gentoo territory at all. A threshold around 206 mm would split Gentoo from the rest with very few mistakes. You could never have read that off the raw-count histogram with any confidence; the relative-frequency view makes it obvious.


Boxplots: A Compact Comparison

Frequency polygons show full shape, but when you have many groups — or just want a quick read on center, spread, and oddballs — a boxplot is the most compact comparison there is. Each box summarizes one group with five numbers: the median line, the box edges at the first and third quartiles (the interquartile range, or IQR), the whiskers reaching out to the typical extremes, and individual points for outliers beyond them.

A single annotated horizontal boxplot labelling the minimum, Q1, median, Q3, and maximum, with the box marked as the IQR, the two whiskers, and one outlier point beyond the right whisker.
The anatomy of a boxplot: the box spans Q1 to Q3 (the IQR), the line inside is the median, whiskers reach the typical extremes, and points beyond them are outliers.

Here is body mass compared across the three species:

Side-by-side boxplots of body mass for the three penguin species. The Adelie and Chinstrap boxes sit low and overlap heavily, while the Gentoo box sits much higher with a wider spread. The Chinstrap group shows two outlier points.
Boxplots of body mass: Adelie and Chinstrap overlap almost completely down low, Gentoo sits far above both with a wider box, and only Chinstrap shows outliers — the two stray points above and below its whiskers.

One small chart carries the whole story. The Adelie and Chinstrap boxes sit at nearly the same low height and overlap heavily — consistent with their almost-identical medians of 3700 g. The Gentoo box floats far above both, centered near 5000 g, and its box is taller, meaning Gentoo body mass is also more spread out. You can confirm the spread directly with the IQR:

for sp in ["Adelie", "Chinstrap", "Gentoo"]:
    mass = penguins[penguins["species"] == sp]["body_mass_g"].dropna()
    q1, q3 = mass.quantile([0.25, 0.75])
    print(sp, "IQR:", q3 - q1)
Adelie IQR: 650.0
Chinstrap IQR: 462.5
Gentoo IQR: 800.0

Gentoo’s middle 50% spans 800 g, against just 462.5 g for Chinstrap — the tightest of the three. The boxplot showed you that wider Gentoo box at a glance; the numbers put a figure on it.

The boxplot also flags outliers automatically — points lying more than 1.5 IQRs beyond a quartile. Notice that only the Chinstrap group shows stray points, one high and one low:

mass = penguins[penguins["species"] == "Chinstrap"]["body_mass_g"].dropna()
q1, q3 = mass.quantile([0.25, 0.75])
iqr = q3 - q1
print(mass[(mass < q1 - 1.5 * iqr) | (mass > q3 + 1.5 * iqr)].tolist())
[4800.0, 2700.0]

One unusually heavy Chinstrap at 4800 g and one unusually light one at 2700 g fall outside the whiskers; Adelie and Gentoo have none. That is the kind of detail a table of means would hide completely, and a boxplot surfaces it without any extra work.

Which chart when?

Use overlaid histograms or frequency polygons when you want to see full shape — bumps, skew, where groups overlap. Use boxplots when you want a compact comparison of center, spread, and outliers across many groups, or when full curves would be too busy. Reach for relative frequency in the first case and you stay honest about unequal group sizes.


Practice Exercises

Exercise 1: Grouped summary of bill length

Build a side-by-side summary of bill_length_mm for the three species using groupby and describe, and report each species’ mean. Which species has the longest bills on average, and which two are closest?

Hint

Use penguins.groupby("species")["bill_length_mm"].describe() and pull the columns you care about with [["count", "mean", "std"]].round(1). Compare the mean column across the three rows.

Exercise 2: Counts vs proportions

Among penguins with body mass above 4500 g, count how many are Gentoo versus Adelie. Then compute, within each species, the proportion above 4500 g. Explain why the raw counts give a misleading impression and the proportions correct it.

Hint

For counts, filter with penguins[penguins["body_mass_g"] > 4500]["species"].value_counts(). For the within-species proportion, use penguins.groupby("species")["body_mass_g"].apply(lambda m: (m > 4500).mean()). The species with more penguins overall can rack up a big raw count even if a small fraction of it qualifies.

Exercise 3: Boxplot reading

Draw boxplots of flipper_length_mm by species (any plotting library), then confirm what you see: print each species’ median and IQR. Does the chart match the near-perfect Gentoo separation you found earlier in the lesson?

Hint

Compute the median with groupby("species")["flipper_length_mm"].median() and the IQR by subtracting the 0.25 quantile from the 0.75 quantile within each group. The Gentoo box should sit almost entirely above the other two.


Summary

You learned to compare a numeric variable across groups without getting fooled. A grouped describe lays the groups side by side in numbers; relative frequency lets you compare distributions fairly even when the groups differ in size, because raw counts always make the larger group look taller. Overlaid histograms and frequency polygons reveal full shape — showing that Adelie and Chinstrap overlap while Gentoo separates cleanly — and boxplots pack center, spread, and outliers for many groups into one compact picture.

Key Concepts

  • Grouped summary — splitting a variable by a category and describing each piece, e.g. groupby(...).describe().
  • Relative frequency — the proportion of a group falling in each bin, used so unequal-sized groups can be compared fairly.
  • Frequency polygon — a line connecting the top of each bin, giving each group one clean curve instead of bars.
  • Boxplot — a five-number summary (median, quartiles/IQR, whiskers, outliers) drawn as a box, ideal for comparing many groups at once.
  • Interquartile range (IQR) — the spread of the middle 50% of a group, from the first to the third quartile.
  • Outlier — a value lying more than 1.5 IQRs beyond a quartile, flagged as a point on a boxplot.

Why This Matters

Almost every analysis you will do is a comparison between groups, and almost every set of groups is unequal in size. The instinct to compare raw counts is exactly where honest-looking charts go wrong — the bigger group wins by default. Knowing to switch to relative frequency, and knowing which chart reveals shape versus which gives a compact summary, is what lets you say “these groups really do differ” and have it be true.


Next Steps

Continue to Lesson 6 - Guided Project: Separating Penguin Species

Put grouped summaries, relative frequency, and boxplots to work in a full guided project that tells the species apart from their measurements.

Back to Module Overview

Return to the Statistics Fundamentals module overview


Continue Building Your Skills

You can now put two or three distributions next to each other and read the real differences between them — overlap, separation, spread, and the odd stray value — without letting unequal group sizes deceive you. Next you will bring every idea from this module together in a guided project, using these same tools to tell the penguin species apart from their measurements alone.