Lesson 1 - Significance and the Two Errors
Welcome to Significance and the Two Errors
An experiment ends in a decision: ship the change, or don’t. Like any yes/no call made under uncertainty, it can be wrong in two different ways — and they’re not symmetric. You can see a difference that isn’t really there and ship a change that does nothing. Or you can miss a difference that is there and kill a change that would have helped. Sizing an experiment is fundamentally about controlling how often each of those mistakes happens, so before we can compute a sample size, we need to name the two errors precisely and see what the “significance level” you’ve heard about actually controls. That’s this lesson.
By the end of this lesson, you will be able to:
- Distinguish a Type I error (false positive) from a Type II error (false negative)
- Explain what the significance level α controls and what β is
- Read the decision matrix that relates the truth to your call
- See, in simulation, that a p<0.05 test’s false-positive rate really is ~5%
Let’s start with the two ways to be wrong.
Two Ways to Be Wrong
Reality is one of two things — the change has a real effect, or it doesn’t — and your test says one of two things — significant, or not. Cross them and you get four outcomes: two correct, two errors.
- A Type I error is a false positive: the change really does nothing, but noise happened to produce a difference big enough that you called it significant and shipped a dud. Its rate is α (alpha).
- A Type II error is a false negative: the change really works, but your test didn’t have enough evidence to call it, so you missed a winner. Its rate is β (beta).
The two correct outcomes are the other diagonal: correctly detecting a real effect (rate 1−β, which we call power), and correctly seeing nothing when there’s nothing (rate 1−α).
What the Significance Level Controls
Here’s the key thing people miss: you choose α directly. The significance level — almost always 0.05 — is a threshold you set on the p-value, and it is your Type I error rate. Setting α = 0.05 is a decision to accept a 5% false-positive rate: if the change truly does nothing, a p<0.05 test will still call it significant about 1 time in 20, purely by chance.
That’s not a flaw to be fixed; it’s a dial you’re setting. Lower α (say 0.01) means fewer false positives — you cry wolf less — but, as you’ll see next lesson, it makes real effects harder to detect, raising β. The two errors trade off, and α is the lever you control first. β you’ll control mostly through sample size, which is where this module is heading.
Watch the False-Positive Rate
The claim “α = 0.05 means a 5% false-positive rate” isn’t just a definition — we can measure it. We’ll run 20,000 “A/A” experiments: two groups drawn from the exact same 10% conversion rate, so the null is true by construction — there is no effect. Then we count how often a two-proportion test at p<0.05 wrongly calls it significant.
import numpy as np
from scipy.stats import norm
def two_prop_p(c1, n1, c2, n2): # two-sided p-value for a difference in rates
p1, p2 = c1 / n1, c2 / n2
p = (c1 + c2) / (n1 + n2)
se = np.sqrt(p * (1 - p) * (1 / n1 + 1 / n2))
z = (p2 - p1) / se
return 2 * (1 - norm.cdf(np.abs(z)))
rng = np.random.default_rng(101)
sims, n, p = 20000, 3839, 0.10 # BOTH arms have the SAME true rate -> null is true
c1 = rng.binomial(n, p, size=sims)
c2 = rng.binomial(n, p, size=sims)
false_positive_rate = np.mean(two_prop_p(c1, n, c2, n) < 0.05)
print(f"false-positive rate: {false_positive_rate:.4f}")Running it:
false-positive rate: 0.05170.0517 — almost exactly the 5% we set with α. There was nothing to find in any of these 20,000 experiments, yet about 1 in 20 produced a “significant” difference from noise alone. That’s the Type I error rate made concrete: it’s the price of admission for using a threshold at all, and it’s exactly the number you chose. (The tiny gap from 0.0500 is itself just sampling noise — run more experiments and it tightens toward 0.05.)
This is why you don’t run twenty tests and celebrate the winner
If a single true-null test fires falsely 5% of the time, then running many tests almost guarantees a false positive somewhere: check 20 dead-end ideas and you’d expect about one to look “significant” by chance. This is the seed of the multiple-comparisons problem you’ll tackle in the pitfalls module. For now, just hold the intuition: α isn’t the chance your result is wrong — it’s how often the test cries wolf when there’s no wolf at all, per test.
Practice Exercises
Exercise 1: Name the error
Lumen ships a new page that truly has no effect on conversion, but the test came back p = 0.03 and the team rolled it out. Which error is this, and what’s its rate called?
Hint
A Type I error — a false positive. The change does nothing, but noise produced a significant-looking result and a dud got shipped. Its rate is α, the significance level, which the team implicitly set at 0.05 by using the p<0.05 threshold.
Exercise 2: The other mistake
A different change truly lifts conversion by 2 points, but Lumen’s test came back p = 0.20 and they killed it. Name the error, its rate, and the quantity “1 minus that rate.”
Hint
A Type II error — a false negative: a real effect went undetected. Its rate is β. The complement, 1 − β, is the power of the test: the probability of correctly detecting the effect. This experiment was likely underpowered — too few users to see a 2-point lift reliably — which is exactly what sample sizing fixes.
Exercise 3: Why 0.0517 and not 0.0500?
The simulation set the null to be true and used α = 0.05, yet the measured false-positive rate was 0.0517, not exactly 0.0500. Is the test broken?
Hint
No — 0.0517 is an estimate of the true 0.05 rate from 20,000 simulated experiments, and estimates carry sampling noise. The expected rate is 0.05; any single run wobbles around it, and the wobble shrinks as you simulate more experiments. It’s the same reason an observed conversion rate isn’t exactly the true rate — which is the whole reason we need significance tests in the first place.
Summary
Every experiment’s yes/no decision can be wrong two ways. A Type I error (false positive, rate α) ships a change that does nothing; a Type II error (false negative, rate β) misses a change that works. You set α directly as the significance level — choosing 0.05 is choosing to accept a 5% false-positive rate — and a simulation of 20,000 true-null A/A experiments confirmed it: the p<0.05 test fired falsely 0.0517 of the time, right on target. The correct-detection rate, 1 − β, is power, and it’s controlled mostly by sample size — the subject of this module. The two errors trade off against each other, so sizing an experiment means deciding how much of each you’ll tolerate.
Key Concepts
- Type I error (α) — a false positive; calling noise a real effect. You set its rate as the significance level.
- Type II error (β) — a false negative; missing a real effect.
- Power (1 − β) — the probability of correctly detecting a true effect.
- α is a choice — 0.05 means a 5% false-positive rate, confirmed at 0.0517 in simulation.
Why This Matters
Sample sizing is impossible until you’ve named what you’re controlling: the two error rates. Teams that skip this run tests that are secretly guaranteed to mislead — underpowered ones that miss real wins (high β), or a flurry of unadjusted tests that manufacture false winners (compounding α). Understanding that α is a dial you set, and that β is the one you buy down with data, is what makes the rest of this module — power and sample size — a set of deliberate choices instead of guesses. Next, you’ll focus on power itself: what it is, and the levers that raise it.
Next Steps
Continue to Lesson 2 - Statistical Power
What power is, and the four levers — effect size, sample size, significance, and variance — that control it.
Back to Module Overview
Return to the Power and Sample Size module overview
Continue Building Your Skills
You can now name the two ways an experiment goes wrong — Type I (α, a false positive you set the rate for) and Type II (β, a missed real effect) — and you’ve watched a p<0.05 test cry wolf 5% of the time when there was nothing to find. Next you’ll turn to power, 1−β: what it means to have a good chance of detecting a real effect, and the levers that raise it before you ever compute a sample size.