Lesson 2 - Multiple Comparisons

Welcome to Multiple Comparisons

In the last lesson, looking at one test many times inflated your false-positive rate. This lesson is the same disease from a different source: looking at many tests once. You launch an experiment and, because you have a dashboard full of numbers, you check the click rate, the signup rate, revenue, retention, time on page, and fifteen more — and the moment any one of them crosses p < 0.05, you declare a win. Every individual test is computed correctly at α = 0.05, just like before. But asking twenty questions and celebrating the one that answered “yes” is not the same as asking one question, and the gap between them is enormous. On data with no real effect at all, twenty metrics give you a coin-flip’s-worth of chances to fool yourself.

By the end of this lesson, you will be able to:

  • Explain why testing many metrics, variants, or segments inflates the false-positive rate
  • Compute the family-wise error rate with the formula 1 - (1 - α)^m and confirm it by simulation
  • Apply the Bonferroni correction and see it restore the intended error rate
  • Choose between Bonferroni, Holm, and false-discovery-rate control for a given situation

Let’s watch twenty harmless metrics manufacture a winner.


Why Many Tests Break the Test

Recall the α = 0.05 promise: if there’s no real effect, a single test wrongly declares significance 5% of the time. That 5% is a per-test budget. Now run m independent tests on null data, where every one of them should come back “not significant.” Each has a 95% chance of behaving — but the chance that all of them behave is 0.95 multiplied by itself m times, and that shrinks fast. The probability that at least one falsely fires — the family-wise error rate — is:

P(at least one false positive) = 1 - (1 - α)^m

For m = 1 that’s the familiar 0.05. For m = 20 it’s 1 - 0.95^20 ≈ 0.64. You didn’t change any single test; you just gave the noise twenty independent chances to trip the wire, and one of them usually does.

The tests don’t have to be labelled “metrics” for this to bite. Many variants (comparing five design arms against control is many tests), many segments (slicing the result by country, device, and browser), and even many days all multiply your chances the same way. Anywhere you run several comparisons and are willing to act on any of them, the family-wise error rate is the number that governs your risk — not the 5% printed next to each individual result.

A grid of 20 metric result chips, all computed from null (no real effect) data, showing p-values like 0.62, 0.34, 0.19, 0.51, 0.28 and so on, with one chip highlighted red at p=0.03 and marked as a false positive; two summary boxes compare outcomes: Uncorrected gives a 64.5% chance that at least one of the 20 metrics is a false winner, versus Bonferroni with threshold 0.05 divided by 20 which brings the family-wise error back to 4.9% and under control.
Twenty metrics from pure null data: at least one dips below 0.05 about 64% of the time, but requiring p < 0.05/20 pulls the family-wise error back near the intended 5%.

Watch the Inflation

Let’s prove it rather than assert it. We simulate 40,000 “families”, and within each family we run m = 20 independent null metrics: both arms share the same 10% rate at n = 2000 per arm, so every “significant” result is a false positive by construction. For each metric we compute the two-proportion p-value. Then we ask two questions of each family: does any metric cross the raw 0.05 threshold, and does any metric cross the Bonferroni threshold 0.05/20 = 0.0025?

import numpy as np
from scipy import stats

def two_prop_p(c1, n1, c2, n2):
    p1, p2 = c1/n1, c2/n2; p=(c1+c2)/(n1+n2)
    se=np.sqrt(p*(1-p)*(1/n1+1/n2))
    return 2*(1-stats.norm.cdf(np.abs((p2-p1)/se)))

rng = np.random.default_rng(202)
families, m, n = 40000, 20, 2000
c1 = rng.binomial(n, 0.10, size=(families, m))
c2 = rng.binomial(n, 0.10, size=(families, m))
pvals = two_prop_p(c1, n, c2, n)
print("P(>=1 false positive):", round(float(np.mean((pvals < 0.05).any(1))), 3))
print("theory 1-0.95^20:     ", round(1 - 0.95**20, 3))
print("with Bonferroni 0.05/20:", round(float(np.mean((pvals < 0.05/m).any(1))), 3))

Running it:

P(>=1 false positive): 0.645
theory 1-0.95^20:      0.642
with Bonferroni 0.05/20: 0.049

Testing 20 null metrics gives a 64.5% chance of at least one false “winner” — and it lands right on the theory, 1 - 0.95^20 = 0.642. Nothing is broken about any single test; the family just had twenty chances and almost always cashed one in. The Bonferroni correction — require p < α/m, here 0.05/20 = 0.0025 — pulls the family-wise error back to 4.9%, essentially the 5% you intended for the whole family. You bought yourself protection by making each individual test harder to pass, in exact proportion to how many you ran.

The cleanest defense: designate one primary metric

Corrections work, but the simplest fix is to have fewer tests to correct. Back in Module 2 you chose a single primary metric — this is why. If the decision to ship rests on one pre-registered test, there is no family and no multiplicity: you make one comparison at α = 0.05 and you’re done. Your guardrails and secondary metrics are still worth watching, but you monitor them, you don’t treat each one as its own decision at α = 0.05. Reserve the corrections below for the genuine cases where you must judge several comparisons at once — and beware the sneaky version, “many segments”: slicing one result twenty ways and celebrating the single significant slice is multiple comparisons wearing a disguise.


Which Correction to Use

Bonferroni is the simplest guarantee, but it’s not the only one, and it’s deliberately conservative. Three tools cover almost every practical case:

  • Bonferroni — require p < α/m. Dead simple, always controls the family-wise error, but the most conservative: with large m the threshold gets so strict you lose power to detect real effects.
  • Holm — a step-down variant that sorts the p-values and tests them against progressively looser thresholds (α/m, α/(m-1), … ). It gives the same family-wise guarantee as Bonferroni but is uniformly more powerful, so it rejects at least as many true effects. There’s no reason to prefer plain Bonferroni over Holm; reach for Holm by default when you have a handful of metrics.
  • False Discovery Rate (Benjamini-Hochberg) — a different, gentler goal. Instead of controlling the chance of any false positive, it controls the expected proportion of false positives among the results you flag. When you have many tests (screening dozens or hundreds of metrics or segments) and can tolerate a few false discoveries in exchange for far more power, FDR is the right tradeoff.

The practical rule: for a small number of decision metrics, use Holm (or Bonferroni); for large-scale screening where you’re generating hypotheses rather than making one ship decision, use FDR control.


Practice Exercises

Exercise 1: What happens at 40 metrics?

The simulation tested 20 null metrics and got a 64.5% family-wise error rate. Without rerunning, is testing 40 metrics better, worse, or the same — and what does the theory formula say?

Hint

Worse. Every additional independent test is another 5% chance to fire on null data, so the family-wise error only grows. The formula gives it exactly: 1 - 0.95^40 ≈ 0.87 — an 87% chance of at least one false winner. And Bonferroni would tighten the threshold to 0.05/40 = 0.00125 to compensate, which is why more tests cost you power even when you correct.

Exercise 2: Is the Bonferroni 4.9% a problem?

The Bonferroni result came out to 0.049, not exactly 0.050. Does that mean the correction slightly overshot and is broken?

Hint

No — 0.049 is an estimate of the true family-wise rate from 40,000 simulated families, so it carries sampling noise, and Bonferroni is intentionally conservative (it guarantees the rate is at most α, often a touch under). Landing just below 0.05 is exactly the expected, correct behavior. The uncorrected 0.645, by contrast, isn’t noise around 0.05 — it’s a real, structural inflation caused by running twenty tests.

Exercise 3: Spot the multiplicity

A teammate says: “The overall test wasn’t significant, but when I broke it down by country, the effect was significant in Germany at p = 0.02, so we should roll out there.” What’s wrong, and what should they have done?

Hint

That’s multiple comparisons in disguise — slicing by country turns one test into many, and finding one significant slice out of a dozen is exactly what you’d expect from pure noise. The p = 0.02 for Germany hasn’t been corrected for the number of segments examined. The clean move is to pre-register segments of interest and apply a correction (Holm across the segment tests), or treat the finding as a hypothesis to confirm in a fresh, dedicated experiment rather than a decision.


Summary

Multiple comparisons inflate the false-positive rate the same way peeking did, but from a different source: instead of looking many times at one test, you look once at many tests. The α = 0.05 guarantee is a per-test budget, so with m independent tests the chance of at least one false positive — the family-wise error rate — is 1 - (1 - α)^m, which for m = 20 is about 64%. A simulation of 40,000 families of 20 null metrics made it concrete: the uncorrected family-wise error was 0.645 (matching the theory’s 0.642), while the Bonferroni correction of p < 0.05/20 pulled it back to 0.049. The cleanest defense is to designate one primary metric up front so there’s a single decision; when you genuinely must judge several comparisons, use Holm for a handful and false-discovery-rate control for large-scale screening.

Key Concepts

  • Family-wise error rate — the chance that any of m tests falsely fires; equals 1 - (1 - α)^m on independent nulls.
  • Sources of multiplicity — many metrics, many variants, many segments, many days all multiply your chances.
  • Bonferroni correction — require p < α/m; simple and safe, but conservative and costs power as m grows.
  • Holm and FDR — Holm is a uniformly more powerful drop-in for a few tests; Benjamini-Hochberg FDR is better for large-scale screening.

Why This Matters

Every team with a dashboard is running multiple comparisons whether they name it or not — a dozen metrics, a segmentation breakdown, a weekly re-check — and each extra look is another chance to ship something that does nothing. The teams that stay honest do two things: they decide before the experiment which single metric makes the call, and when they truly need to test several comparisons, they correct for how many. Understanding why twenty null metrics manufacture a winner 64% of the time — and that a one-line threshold change fixes it — is what keeps a “significant result” from being the loudest coincidence in a noisy table. Next, you’ll meet a subtler trap where the data isn’t lying about any single number, yet the aggregate still points the wrong way.


Next Steps

Continue to Lesson 3 - Simpson's Paradox

When a treatment wins in every segment but loses overall — how aggregation can reverse the truth.

Back to Module Overview

Return to the Pitfalls and Validity module overview


Continue Building Your Skills

You’ve seen how testing twenty harmless metrics produces a false winner 64% of the time, and how a single primary metric — or a Bonferroni, Holm, or FDR correction — keeps the family-wise error under control. That’s the second way to inflate your error rate on purpose. In Lesson 3 you’ll face a stranger failure: a change that appears to win inside every subgroup yet loses when the groups are combined, and why the fix is about how you slice, not how many times.