Lesson 3 - Why Randomization Works
Welcome to Why Randomization Works
In Lesson 1 you watched a coin flip rescue a comparison that motivation had corrupted, and in Lesson 2 you framed that flip as the split between control and treatment. But why does the flip work? It’s easy to say “randomization balances the groups” and move on. The claim is stronger and stranger than that: a random split makes the two groups statistically equivalent on every characteristic at once — the traits you carefully measured, and the ones you never thought to. That second half is the whole game. It’s the one thing an observational analysis can never buy, no matter how many columns you record. This lesson opens up that claim, shows you how to verify the half you can see, and explains why the half you can’t see comes along for free.
By the end of this lesson, you will be able to:
- Explain why random assignment balances every covariate simultaneously, measured or not
- Run a balance check to confirm the measured covariates came out even
- Say why randomization beats observational “controlling for” — you can only control for what you recorded
- Describe why balance holds on average and tightens as the sample grows
Let’s look at what a fair split actually produces.
Balance You Can See, Balance You Can’t
When you assign users by a coin flip, you’re not steering anyone into a group based on who they are. The flip doesn’t know a user’s motivation, whether they’re a returning visitor, or what device they’re on — and because it ignores all of that, none of it can pile up on one side. Each trait gets scattered across control and treatment in roughly the same proportion it has in the whole population. That’s balance.
The subtle part is that this happens to every trait at the same time, whether or not it appears in your data. You can check the traits you recorded — that’s a balance check, coming next. But the traits you didn’t record are balanced by the exact same mechanism; you just can’t see the receipt. Contrast this with the observational approach from Lesson 1, where the fix for a confounder was to “control for” it statistically. That only works for confounders you actually measured and put in a column. Randomization needs no column. It handles the unknown confounders — the ones you’d never think to record — for free.
Run a Balance Check
Let’s make this concrete. We’ll create 10,000 users, each carrying three traits: a hidden motivation score, whether they’re a returning user, and whether they’re on mobile. Then we assign treat by a coin flip that is independent of all three — it never looks at any of them. If randomization does what we claim, every trait should land nearly equal across the two groups.
import numpy as np
rng = np.random.default_rng(11)
n = 10000
motivation = rng.random(n) # a hidden trait
returning = rng.random(n) < 0.40 # 40% are returning users
mobile = rng.random(n) < 0.60 # 60% on mobile
treat = rng.random(n) < 0.50 # random assignment, independent of all covariates
def bal(x): return x[treat].mean(), x[~treat].mean()
for name, x in [("avg motivation", motivation), ("% returning", returning), ("% mobile", mobile)]:
t, c = bal(x)
print(f"{name:16s} treat={t:.3f} control={c:.3f} diff={abs(t-c):.3f}")
print("group sizes:", int(treat.sum()), int((~treat).sum()))Running it:
avg motivation treat=0.497 control=0.493 diff=0.003
% returning treat=0.401 control=0.406 diff=0.005
% mobile treat=0.610 control=0.598 diff=0.012
group sizes: 4996 5004Every trait lands nearly equal. Average motivation differs by 0.003, the share of returning users by 0.005, the share on mobile by 0.012 — all tiny. The group sizes are close to 50/50 but not exact: 4996 versus 5004. That’s not a bug. A fair coin flipped 10,000 times almost never gives you exactly 5000 heads; a small wobble is expected sampling variation, and it’s the same reason the trait diffs aren’t exactly zero either.
Here is the teaching point that’s easy to miss: treat was generated without ever referencing a single covariate. The line rng.random(n) < 0.50 doesn’t know motivation, returning, or mobile exist. Yet all three came out balanced. So any fourth trait you didn’t include — say, time zone, or past purchase history, or something you couldn’t even name — would balance by the same mechanism. That’s the receipt you can’t print but can trust.
Balance is guaranteed on average, not in every single split
Randomization balances covariates in expectation — averaged over the many ways the coins could have landed. Any one split has a little wobble (that’s why the diffs above aren’t zero), and the wobble shrinks as the sample grows: 10,000 users balances tightly, but 40 users could easily land lopsided by pure chance. That’s why sample size matters so much, and it’s the thread we pick up in Module 3. Running a balance check on your measured covariates after randomizing is a standard sanity check — if a covariate comes out badly off, it’s a sign to look at your randomization, not to “adjust” it away.
Practice Exercises
Exercise 1: Why did an unmeasured trait balance?
Suppose your users also have a night_owl trait that you never recorded and never put in the code. After the coin-flip assignment above, is night_owl balanced across the treatment and control groups? Why or why not?
Hint
Yes, it’s balanced — by the exact same mechanism that balanced motivation, returning, and mobile. The coin flip assigns users without looking at any of their traits, so no trait can concentrate on one side, whether or not you recorded it. That’s the difference from observational analysis: “controlling for” only works on columns you have, but randomization balances the trait you never measured just as well as the ones you did.
Exercise 2: Is 4996 vs 5004 a problem?
The coin flip produced 4996 users in treatment and 5004 in control instead of an exact 5000/5000 split. A teammate worries the randomization is broken. Are they right?
Hint
No. A fair coin flipped 10,000 times almost never lands on exactly 5000 heads — a small deviation like 4996/5004 is expected sampling variation, the same variation that makes the trait diffs slightly nonzero. A split that came out exactly 5000/5000 every time would actually be suspicious. What matters is that the imbalance is small relative to the sample, which it is here.
Exercise 3: Small sample, honest doubt
You rerun the balance check with n = 40 instead of 10,000 and see % mobile come out treat=0.70, control=0.50 — a 0.20 gap. Does this mean randomization failed?
Hint
No — it means the sample is too small for balance to have tightened up yet. Randomization balances covariates on average across the many possible splits, but any single split of just 40 users can land lopsided by chance. The mechanism is fine; the sample is thin. Grow the sample and the gap shrinks toward zero. This is exactly why sample size gets its own treatment in Module 3.
Summary
Random assignment works because it makes the treatment and control groups statistically equivalent on every characteristic at once — the ones you measured and the ones you didn’t. We verified the measured half with a balance check: a coin flip that never referenced any trait still produced average motivation within 0.003, returning-user share within 0.005, and mobile share within 0.012 across the groups, with sizes at a near-even 4996 versus 5004. Because that flip ignored every covariate, any covariate you didn’t include is balanced by the same mechanism — which is precisely what observational “controlling for” can’t do, since it only reaches the confounders you recorded. Balance holds on average and tightens as the sample grows, so a small split can wobble by chance — a preview of why sample size matters.
Key Concepts
- Balance — a random split spreads every trait evenly across groups, in roughly its population proportion.
- Balance check — comparing measured covariates across groups after randomizing, to confirm the split came out even.
- Unmeasured confounders come free — the same flip that balances recorded traits balances the ones you never recorded.
- Balance on average — equivalence holds in expectation and tightens with sample size; small samples can wobble by chance.
Why This Matters
The reason an A/B test can license a causal claim while a dashboard can’t comes down to this one property: randomization neutralizes confounders you never even knew existed. Understanding why — not just that it “balances the groups” — is what lets you trust an experiment’s result and defend it when someone asks “but did you account for X?” The answer is that you didn’t have to. It also tells you what to check after you randomize (measured balance) and what to watch out for (small samples that can split unevenly by chance). Next, you’ll assemble these pieces into the full anatomy of an A/B test.
Next Steps
Continue to Lesson 4 - Anatomy of an A/B Test
Put the pieces together: the full structure of an A/B test, from hypothesis to metric to decision.
Back to Module Overview
Return to The Logic of Experiments module overview
Continue Building Your Skills
You’ve seen why a coin flip does more than split users in two: it balances every characteristic across the groups at once, so the traits you measured and the traits you never recorded come out even. You verified the visible half with a balance check, and learned that balance holds on average and tightens with sample size. Next you’ll bring the whole picture together — hypothesis, groups, randomization unit, and metric — into the anatomy of a complete A/B test.