Lesson 3 - Why Randomization Works

Welcome to Why Randomization Works

In Lesson 1 you watched a coin flip rescue a comparison that motivation had corrupted, and in Lesson 2 you framed that flip as the split between control and treatment. But why does the flip work? It’s easy to say “randomization balances the groups” and move on. The claim is stronger and stranger than that: a random split makes the two groups statistically equivalent on every characteristic at once — the traits you carefully measured, and the ones you never thought to. That second half is the whole game. It’s the one thing an observational analysis can never buy, no matter how many columns you record. This lesson opens up that claim, shows you how to verify the half you can see, and explains why the half you can’t see comes along for free.

By the end of this lesson, you will be able to:

Explain why random assignment balances every covariate simultaneously, measured or not
Run a balance check to confirm the measured covariates came out even
Say why randomization beats observational “controlling for” — you can only control for what you recorded
Describe why balance holds on average and tightens as the sample grows

Let’s look at what a fair split actually produces.

Balance You Can See, Balance You Can’t

When you assign users by a coin flip, you’re not steering anyone into a group based on who they are. The flip doesn’t know a user’s motivation, whether they’re a returning visitor, or what device they’re on — and because it ignores all of that, none of it can pile up on one side. Each trait gets scattered across control and treatment in roughly the same proportion it has in the whole population. That’s balance.

The subtle part is that this happens to every trait at the same time, whether or not it appears in your data. You can check the traits you recorded — that’s a balance check, coming next. But the traits you didn’t record are balanced by the exact same mechanism; you just can’t see the receipt. Contrast this with the observational approach from Lesson 1, where the fix for a confounder was to “control for” it statistically. That only works for confounders you actually measured and put in a column. Randomization needs no column. It handles the unknown confounders — the ones you’d never think to record — for free.

Run a Balance Check

Let’s make this concrete. We’ll create 10,000 users, each carrying three traits: a hidden motivation score, whether they’re a returning user, and whether they’re on mobile. Then we assign treat by a coin flip that is independent of all three — it never looks at any of them. If randomization does what we claim, every trait should land nearly equal across the two groups.

import numpy as np
rng = np.random.default_rng(11)
n = 10000
motivation = rng.random(n)             # a hidden trait
returning  = rng.random(n) < 0.40      # 40% are returning users
mobile     = rng.random(n) < 0.60      # 60% on mobile
treat = rng.random(n) < 0.50           # random assignment, independent of all covariates

def bal(x): return x[treat].mean(), x[~treat].mean()
for name, x in [("avg motivation", motivation), ("% returning", returning), ("% mobile", mobile)]:
    t, c = bal(x)
    print(f"{name:16s} treat={t:.3f}  control={c:.3f}  diff={abs(t-c):.3f}")
print("group sizes:", int(treat.sum()), int((~treat).sum()))

Running it:

avg motivation   treat=0.497  control=0.493  diff=0.003
% returning      treat=0.401  control=0.406  diff=0.005
% mobile         treat=0.610  control=0.598  diff=0.012
group sizes: 4996 5004

Every trait lands nearly equal. Average motivation differs by 0.003, the share of returning users by 0.005, the share on mobile by 0.012 — all tiny. The group sizes are close to 50/50 but not exact: 4996 versus 5004. That’s not a bug. A fair coin flipped 10,000 times almost never gives you exactly 5000 heads; a small wobble is expected sampling variation, and it’s the same reason the trait diffs aren’t exactly zero either.

A balance table comparing control vs treatment groups on three characteristics: avg motivation (0.500 vs 0.500), % returning users (40% vs 41%), and % on mobile (60% vs 59%). Each row shows the two group values nearly equal, joined by an approximately-equal symbol. A note beneath the table says every characteristic lands nearly equal across the groups, including ones you never measured. — A balance check confirms the measured covariates came out even across control and treatment — and the same coin flip that balanced these balanced the traits you never recorded too.

Here is the teaching point that’s easy to miss: treat was generated without ever referencing a single covariate. The line rng.random(n) < 0.50 doesn’t know motivation, returning, or mobile exist. Yet all three came out balanced. So any fourth trait you didn’t include — say, time zone, or past purchase history, or something you couldn’t even name — would balance by the same mechanism. That’s the receipt you can’t print but can trust.

Balance is guaranteed on average, not in every single split

Randomization balances covariates in expectation — averaged over the many ways the coins could have landed. Any one split has a little wobble (that’s why the diffs above aren’t zero), and the wobble shrinks as the sample grows: 10,000 users balances tightly, but 40 users could easily land lopsided by pure chance. That’s why sample size matters so much, and it’s the thread we pick up in Module 3. Running a balance check on your measured covariates after randomizing is a standard sanity check — if a covariate comes out badly off, it’s a sign to look at your randomization, not to “adjust” it away.

Practice Exercises

Exercise 1: Why did an unmeasured trait balance?

Suppose your users also have a night_owl trait that you never recorded and never put in the code. After the coin-flip assignment above, is night_owl balanced across the treatment and control groups? Why or why not?

Hint

Yes, it’s balanced — by the exact same mechanism that balanced motivation, returning, and mobile. The coin flip assigns users without looking at any of their traits, so no trait can concentrate on one side, whether or not you recorded it. That’s the difference from observational analysis: “controlling for” only works on columns you have, but randomization balances the trait you never measured just as well as the ones you did.

Exercise 2: Is 4996 vs 5004 a problem?

The coin flip produced 4996 users in treatment and 5004 in control instead of an exact 5000/5000 split. A teammate worries the randomization is broken. Are they right?

Hint

No. A fair coin flipped 10,000 times almost never lands on exactly 5000 heads — a small deviation like 4996/5004 is expected sampling variation, the same variation that makes the trait diffs slightly nonzero. A split that came out exactly 5000/5000 every time would actually be suspicious. What matters is that the imbalance is small relative to the sample, which it is here.

Exercise 3: Small sample, honest doubt

You rerun the balance check with n = 40 instead of 10,000 and see % mobile come out treat=0.70, control=0.50 — a 0.20 gap. Does this mean randomization failed?

Hint

No — it means the sample is too small for balance to have tightened up yet. Randomization balances covariates on average across the many possible splits, but any single split of just 40 users can land lopsided by chance. The mechanism is fine; the sample is thin. Grow the sample and the gap shrinks toward zero. This is exactly why sample size gets its own treatment in Module 3.

Summary

Random assignment works because it makes the treatment and control groups statistically equivalent on every characteristic at once — the ones you measured and the ones you didn’t. We verified the measured half with a balance check: a coin flip that never referenced any trait still produced average motivation within 0.003, returning-user share within 0.005, and mobile share within 0.012 across the groups, with sizes at a near-even 4996 versus 5004. Because that flip ignored every covariate, any covariate you didn’t include is balanced by the same mechanism — which is precisely what observational “controlling for” can’t do, since it only reaches the confounders you recorded. Balance holds on average and tightens as the sample grows, so a small split can wobble by chance — a preview of why sample size matters.

Key Concepts

Balance — a random split spreads every trait evenly across groups, in roughly its population proportion.
Balance check — comparing measured covariates across groups after randomizing, to confirm the split came out even.
Unmeasured confounders come free — the same flip that balances recorded traits balances the ones you never recorded.
Balance on average — equivalence holds in expectation and tightens with sample size; small samples can wobble by chance.

Why This Matters

The reason an A/B test can license a causal claim while a dashboard can’t comes down to this one property: randomization neutralizes confounders you never even knew existed. Understanding why — not just that it “balances the groups” — is what lets you trust an experiment’s result and defend it when someone asks “but did you account for X?” The answer is that you didn’t have to. It also tells you what to check after you randomize (measured balance) and what to watch out for (small samples that can split unevenly by chance). Next, you’ll assemble these pieces into the full anatomy of an A/B test.

Next Steps

Continue to Lesson 4 - Anatomy of an A/B Test

Put the pieces together: the full structure of an A/B test, from hypothesis to metric to decision.

Back to Module Overview

Return to The Logic of Experiments module overview

Continue Building Your Skills

You’ve seen why a coin flip does more than split users in two: it balances every characteristic across the groups at once, so the traits you measured and the traits you never recorded come out even. You verified the visible half with a balance check, and learned that balance holds on average and tightens with sample size. Next you’ll bring the whole picture together — hypothesis, groups, randomization unit, and metric — into the anatomy of a complete A/B test.

Previous lesson

Lesson 2 - Control and Treatment Groups

Next lesson

Lesson 4 - Anatomy of an A/B Test

Courses

DATATWEETS

Title here

Lesson 3 - Why Randomization Works

Welcome to Why Randomization Works

Balance You Can See, Balance You Can’t

Run a Balance Check

Practice Exercises

Exercise 1: Why did an unmeasured trait balance?

Exercise 2: Is 4996 vs 5004 a problem?

Exercise 3: Small sample, honest doubt

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 4 - Anatomy of an A/B Test

Back to Module Overview

Continue Building Your Skills

Lesson 3 - Why Randomization Works

Welcome to Why Randomization Works#

Balance You Can See, Balance You Can’t#

Run a Balance Check#

Practice Exercises#

Exercise 1: Why did an unmeasured trait balance?#

Exercise 2: Is 4996 vs 5004 a problem?#

Exercise 3: Small sample, honest doubt#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 4 - Anatomy of an A/B Test

Back to Module Overview

Continue Building Your Skills#

Welcome to Why Randomization Works

Balance You Can See, Balance You Can’t

Run a Balance Check

Practice Exercises

Exercise 1: Why did an unmeasured trait balance?

Exercise 2: Is 4996 vs 5004 a problem?

Exercise 3: Small sample, honest doubt

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills