Lesson 3 - Simpson's Paradox
Welcome to Simpson’s Paradox
Here’s a result that looks impossible. You segment your experiment by user type, and the control arm converts better among new users and better among returning users — it wins in every subgroup you look at. Then you check the overall number and the treatment is ahead. Not tied — clearly ahead. Every subgroup says control, the aggregate says treatment, and both are computed from the exact same data with the exact same arithmetic. This is Simpson’s paradox, and it isn’t a rounding glitch or a bug in your code. It’s a real feature of how weighted averages combine, and in an A/B test it’s usually trying to tell you something important.
By the end of this lesson, you will be able to:
- Recognize Simpson’s paradox: a treatment that loses in every subgroup yet wins overall
- Explain the mechanism — an imbalanced segment mix between arms
- Understand why the paradox should not appear under clean randomization
- Use the paradox as a diagnostic for a broken split rather than a story to chase
Let’s watch the reversal happen.
A Treatment That Loses Everywhere and Wins Overall
Segment a simple experiment by user type. Conversion rate is just conversions divided by users, so there’s nothing hidden in the numbers:
| Segment | Control | Treatment | Winner |
|---|---|---|---|
| New users | 200 / 2000 = 10.0% | 90 / 1000 = 9.0% | control |
| Returning users | 300 / 1000 = 30.0% | 570 / 2000 = 28.5% | control |
| Overall | 500 / 3000 = 16.7% | 660 / 3000 = 22.0% | treatment |
Read down the winner column and it’s genuinely strange. Control beats treatment among new users. Control beats treatment among returning users. There is no third group. And yet, pool everyone together and the treatment is nearly five points ahead. Let’s confirm the arithmetic isn’t sleight of hand by computing every rate directly:
seg = {
"new users": {"control": (200, 2000), "treatment": (90, 1000)},
"returning users": {"control": (300, 1000), "treatment": (570, 2000)},
}
for name, arms in seg.items():
(cc, cn), (tc, tn) = arms["control"], arms["treatment"]
print(f"{name}: control {cc/cn:.1%} treatment {tc/tn:.1%}")
cc = sum(a['control'][0] for a in seg.values()); cn = sum(a['control'][1] for a in seg.values())
tc = sum(a['treatment'][0] for a in seg.values()); tn = sum(a['treatment'][1] for a in seg.values())
print(f"overall: control {cc/cn:.1%} treatment {tc/tn:.1%}")Running it:
new users: control 10.0% treatment 9.0%
returning users: control 30.0% treatment 28.5%
overall: control 16.7% treatment 22.0%Both subgroup comparisons favor control, the aggregate favors treatment, and every number is exactly what the counts say it is. So where does the reversal come from?
The Mechanism: An Imbalanced Mix
The reversal has nothing to do with treatment being better and everything to do with who ended up in each arm. Look at the segment sizes, not the rates. Control’s 3000 users are mostly new users (2000 of them), the low-converting group. Treatment’s 3000 users are mostly returning users (2000 of them), the high-converting group. Returning users convert about three times higher than new users regardless of which arm they’re in — so treatment’s overall rate is a weighted average that leans heavily on the good segment, while control’s leans on the weak one.
That favorable mix inflates treatment’s aggregate enough to overtake control, even though treatment is worse within each segment. The overall number is an average weighted by segment size, and because the weights differ between arms, the aggregate can point the opposite way from every subgroup. Change the mix and the paradox vanishes; nothing about the segment-level rates has to change at all. Aggregating over an imbalanced mix is the whole trick.
Never segment on a variable the treatment changed
User type here is a pre-treatment attribute — someone was new or returning before they ever saw a variant, so splitting on it is fair game. Conditioning on a post-treatment variable — one the treatment itself influenced, like “users who added an item to cart” or “sessions longer than 30 seconds” — is a different and worse mistake. When the treatment changes who lands in a segment, comparing arms within that segment compares non-comparable groups and can manufacture a paradox out of thin air. Segment only on attributes fixed before assignment.
Why This Should Not Happen (and What It Means When It Does)
Here’s the key point for A/B testing. In a clean randomized experiment, randomization balances the segment mix across arms (Module 1) — control and treatment should each get roughly the same proportion of new and returning users, because assignment is independent of everything about the user. When the mix is balanced, the weights match, and Simpson’s paradox from a pre-treatment attribute simply cannot flip the result. The example above has 2000 new users in control but only 1000 in treatment; that lopsided split is not what randomization produces.
So when you do see this reversal on a pre-treatment variable, treat it as a red flag, not a discovery. It means the mix is imbalanced across arms, and under proper randomization it shouldn’t be. The usual culprits are a broken or imbalanced assignment — a sample ratio mismatch (Lesson 4) — non-random assignment, or segmenting on a post-treatment variable you should never have conditioned on. The paradox is doing double duty: it’s a statistical curiosity and a diagnostic. The move is to trust the properly-randomized aggregate and go investigate why the segments disagree, because that disagreement is evidence of a bug.
The practical guidance follows directly. Always check whether your arms have the same mix on key pre-treatment dimensions — user type, device, country, tenure. If a segmented view contradicts a clean randomized aggregate, suspect a broken split before you believe the segments. And don’t go the other way and segment-hunt for a story: slicing until some subgroup tells the tale you want is just multiple comparisons in disguise (Lesson 2), and it will always find you a narrative.
Practice Exercises
Exercise 1: Where does the reversal actually come from?
The treatment loses among new users (9.0% vs 10.0%) and among returning users (28.5% vs 30.0%). If it’s worse in both groups, how can its overall rate be higher?
Hint
Because the overall rate is an average weighted by segment size, and the arms have different mixes. Treatment’s users are mostly high-converting returning users (2000 of 3000), so its aggregate is pulled up toward the ~28.5% segment. Control’s users are mostly low-converting new users (2000 of 3000), so its aggregate is dragged down toward the ~10% segment. The mix, not any real advantage, produces the flip.
Exercise 2: Which number do you trust?
You ran a properly randomized test, and the aggregate favors treatment while both subgroups favor control. Do you ship treatment because the aggregate wins?
Hint
No — you stop and investigate first. Under clean randomization the mix should be balanced, so a Simpson reversal on a pre-treatment attribute means it isn’t balanced, which points to a bug: a sample ratio mismatch or non-random assignment. You can’t trust the aggregate if the split that produced it is broken. Fix the assignment, confirm the mix is balanced, then read the result.
Exercise 3: Spot the bad segmentation
A teammate segments the experiment by “users who completed onboarding” — a step the new treatment made much easier — and finds a paradox. Is this the same situation as user type?
Hint
No, it’s worse. “Completed onboarding” is a post-treatment variable: the treatment changed who ends up in that segment, so the two arms’ onboarding-completers aren’t comparable groups. Conditioning on it can manufacture a paradox that has nothing to do with a broken split. The fix is to never segment on anything the treatment could have influenced — stick to attributes fixed before assignment.
Summary
Simpson’s paradox is a reversal where a treatment loses in every subgroup yet wins in aggregate. The verified example made it concrete: control beat treatment among new users (10.0% vs 9.0%) and among returning users (30.0% vs 28.5%), but treatment won overall (22.0% vs 16.7%) — every number computed correctly from the same counts. The cause is an imbalanced segment mix: treatment’s users were mostly high-converting returning users while control’s were mostly low-converting new users, so the overall rate — a weighted average — leaned in treatment’s favor despite it being worse in each segment. In a clean randomized experiment, randomization balances that mix, so this paradox from a pre-treatment attribute shouldn’t appear. When it does, it’s a red flag for a broken split (a sample ratio mismatch), non-random assignment, or segmenting on a post-treatment variable.
Key Concepts
- Simpson’s paradox — the subgroup winner reverses in the aggregate.
- Mix imbalance is the cause — the overall rate is a size-weighted average, and unequal weights across arms flip the comparison.
- Randomization prevents it — balanced mixes on pre-treatment attributes mean no reversal.
- The paradox is a diagnostic — its appearance signals a broken split, not a real effect; trust the randomized aggregate and find the bug.
Why This Matters
Simpson’s paradox is dangerous in two directions. If you don’t know about it, an imbalanced test can make a losing treatment look like a winner in the aggregate — or make a real winner look like it’s failing everyone. And if you do segment aggressively, the paradox gives you a tempting story to tell about any subgroup you like. The professional response is neither to panic at the reversal nor to cherry-pick a segment, but to recognize that a paradox on a pre-treatment variable means the randomization is suspect — and to go check the split before believing anything. That check is the subject of the next lesson.
Next Steps
Continue to Lesson 4 - Sanity Checks and Threats to Validity
How to catch a broken split with a sample ratio mismatch check, and the other threats that quietly invalidate a result.
Back to Module Overview
Return to the Pitfalls and Validity module overview
Continue Building Your Skills
You’ve seen how an imbalanced segment mix can flip the aggregate against every subgroup, and why that flip is a warning that your randomization is broken rather than a result to act on. The natural next question is how to detect that broken split before it fools you. In Lesson 4 you’ll learn the sample ratio mismatch check and the wider family of sanity checks and threats to validity that keep your experiments trustworthy.