Lesson 2 - Statistical Power

Welcome to Statistical Power

Last lesson you met the two errors and learned that α, the false-positive rate, is a dial you set directly. Its counterpart, β — the chance you miss a real effect — is different: you don’t set it, you buy it down. The quantity you actually design around isn’t β itself but its complement, power = 1 − β: the probability that when there really is an effect, your test catches it. A test with low power is a test that will quietly kill real winners, and the frustrating part is you’d never know — a non-significant result looks the same whether the effect was absent or just too small to see with the data you had. This lesson defines power precisely and walks through the four levers that move it, so that by the time you compute a sample size you understand exactly what you’re purchasing.

By the end of this lesson, you will be able to:

  • Define power as 1 − β and explain the 80% convention
  • Name the four levers that control power and which direction each pushes
  • Explain why loosening α raises power but isn’t a free lever
  • See why sample size is the lever you’ll actually turn to hit a power target

Let’s start with what power is.


What Power Is

Power is the probability that your test returns a significant result when a real effect truly exists. It’s the good outcome in the right-hand column of last lesson’s decision matrix: the effect is there, and you caught it. Formally it’s 1 − β, one minus the Type II error rate, and it always refers to detecting some specific effect — power to detect a tiny effect and power to detect a large one are different numbers for the same experiment.

The near-universal convention is to design for 80% power. That’s a deliberate choice to accept a 20% miss rate: even a correctly-run experiment on a genuinely effective change will fail to reach significance about one time in five. Eighty percent is a norm, not a law of nature — it’s a widely-agreed default that balances the cost of collecting more data against the cost of missing wins. For high-stakes decisions, where missing a real effect is expensive, teams often design for 90% power instead, accepting only a 10% miss rate but paying for it with a larger sample.

The key mental shift from Lesson 1: you chose α outright, but you design toward a power target. Power isn’t a single knob — it’s an outcome that several forces push around at once. There are four of them.


The Four Levers

Four things move power. Picture power sitting at the center, with four levers pulling on it.

Power (1 minus beta) at the center with four levers pulling on it. Sample size up: more users, more power. Effect size up: bigger effects are easier to detect, more power. Significance alpha up: a looser bar raises power but also raises false positives. Variance down: less noise means more power. A note flags that loosening alpha isn't free because it raises false positives, so the truly free levers are more data and less noise.
Four levers move power — but only more data and less noise raise it for free; loosening α buys power by trading away false positives.

1. Effect size — bigger true effects are easier to catch. A large real difference stands out from noise, so the same experiment has more power to detect a 10% → 14% lift than a 10% → 10.5% lift. You don’t control the true effect — that’s a fact about the world — but you do choose the minimum detectable effect (MDE), the smallest lift you insist on being able to see. Designing around a smaller MDE is like asking your test to spot a fainter signal: it demands more of everything else.

2. Sample size — more users, more power. This is the lever you actually control. Every additional user sharpens your estimate of each group’s rate, which makes a real difference easier to separate from noise. Hold everything else fixed and power rises steadily with n. This is the whole reason sample sizing exists: it’s the mechanism by which you buy the power you want, and it’s why the rest of this module is really about computing one number.

3. Significance level α — a looser bar raises power, but not for free. Relax α from 0.05 to 0.10 and you lower the threshold a result must clear, so real effects clear it more often — power goes up. Tighten α to 0.01 and the opposite happens: real effects struggle to reach the stricter bar, and power drops. But this is a trade-off, not a gift. The same loosening that helps real effects through also lets more noise through, so you gain power by accepting more Type I errors. You’re not creating power; you’re swapping one error for the other.

4. Variance — less noise, more power. The noisier your metric, the harder it is to tell a real difference from random wobble, so a high-variance metric needs more data to reach the same power. For a proportion (a conversion rate), the variance isn’t separate from the metric — it’s baked into the baseline rate as p(1 − p), largest near 50% and smaller toward the extremes. You can sometimes reduce variance by cleaning up the metric or removing outliers, and less noise always means more power for the same n.

Only two of the four levers are truly free

Look closely and the four levers split into two kinds. More data and less noise raise power at no cost to your error rates — they’re the honest levers. Effect size you don’t control (you only pick the MDE you design around), and loosening α raises power only by buying it with more false positives. So when someone says “just relax the significance threshold to hit 80% power,” they’re not finding free power — they’re quietly agreeing to cry wolf more often. The genuinely free ways to gain power are to collect more users or measure them with less noise.


Why Sample Size Is the Lever You Turn

Line the four levers up against what you can actually change and a pattern appears. Effect size is fixed by the problem — you can pick a smaller MDE, but only by demanding more data, so it circles back to sample size. α you already set in Lesson 1, and pushing it around trades false positives for power rather than giving you power outright. Variance is mostly a property of your metric; you can trim it at the margins, but for a conversion rate it’s locked to the baseline p(1 − p) you inherit.

That leaves one lever you can freely turn to hit any power target you like: sample size. Fix the effect you care about (the MDE), fix α at 0.05, take the variance the metric hands you, and pick your target — 80% or 90% power. Everything is pinned except n, so n is what you solve for. That’s not a coincidence; it’s the whole logic of sample sizing. As a concrete anchor, the standard formula (Lesson 3) says that detecting a 10% → 12% lift at α = 0.05 needs about 3,839 users per arm to reach roughly 80% power — that’s the same 3,839 the A/A simulation used last lesson, and it’s a verified number. Change the MDE or the target power and that number moves, but the shape of the calculation is always the same: everything fixed, solve for n.

You don’t have to trust the formula on faith, either. Because power is just a probability, you can estimate it the same way Lesson 1 estimated the false-positive rate — simulate many experiments where the effect is truly present and count how often the test catches it. That’s exactly what Lesson 4 does. Here, hold the intuition: power is a number you can both compute and measure, and sample size is how you dial it to where you want it.


Practice Exercises

Exercise 1: Read the miss rate

Lumen designs an experiment for 80% power to detect a real 2-point lift. The change genuinely works. Before running it, what’s the chance the test fails to reach significance, and what is that quantity called?

Hint

The chance of missing it is 20% — that’s β, the Type II error rate, and it’s exactly 1 minus the 80% power they designed for. Designing for 80% power is accepting a 1-in-5 miss rate on a genuinely effective change. If that risk is too high for the decision at hand, they’d design for 90% power (a 10% miss rate) and pay for it with a larger sample.

Exercise 2: Which lever, which direction

A teammate proposes hitting the power target by changing α from 0.05 to 0.10 instead of collecting more users. Does that raise power, and what’s the catch?

Hint

Yes, it raises power — a looser α lowers the bar a result must clear, so real effects reach significance more often. But it’s not free: the same loosened threshold lets more noise through too, so you gain power by accepting more Type I errors (false positives). You’re trading one error for the other, not creating power. The free levers are more data and less noise.

Exercise 3: The one lever left

Of the four levers — effect size, sample size, α, variance — which is the one you actually turn to reach a chosen power target, and why aren’t the others available for the job?

Hint

Sample size. The effect size is fixed by the world (you only pick the MDE, which just pushes back onto n); α is set separately and trades false positives for power rather than granting it; and variance is mostly a property of the metric — for a rate it’s locked to the baseline p(1 − p). With those pinned, n is the free variable you solve for to hit 80% or 90% power. That solve is Lesson 3.


Summary

Power = 1 − β is the probability that your test detects a real effect when one truly exists, and the standard is to design for 80% power — accepting a 20% miss rate — or 90% for high-stakes calls. Four levers move power: effect size (bigger effects are easier to catch), sample size (more users, more power), significance level α (a looser bar raises power), and variance (less noise, more power). But the levers aren’t equal: you don’t control the true effect, and loosening α buys power only by accepting more false positives, so the only free ways to gain power are more data and less noise. Since effect size, α, and variance are largely fixed by the problem, sample size is the lever you turn to hit your target — which is exactly what the next two lessons compute.

Key Concepts

  • Power (1 − β) — the probability of detecting a real effect; design for 80%, or 90% when the stakes are high.
  • Four levers — effect size, sample size, α, and variance each move power.
  • Only two are free — more data and less noise raise power at no cost; loosening α trades away false positives.
  • Sample size is the solve — with the other levers fixed, n is what you tune to reach a power target.

Why This Matters

An underpowered test is worse than no test: it burns traffic and time, then hands you a non-significant result that quietly buries a real win — and you can’t tell that outcome from a genuine null. Understanding the four levers is what lets you avoid that trap deliberately, and understanding that only data and noise are free is what stops teams from “hitting 80% power” by secretly loosening α and multiplying their false positives. Once you see that sample size is the honest lever, the sample-size formula stops looking like a magic incantation and becomes what it is: the answer to “how many users buy me the power I want?” That’s Lesson 3.


Next Steps

Continue to Lesson 3 - Computing Sample Size

Turn the power target into a number — the sample-size formula and the inputs it needs.

Back to Module Overview

Return to the Power and Sample Size module overview


Continue Building Your Skills

You can now define power as 1 − β, explain why 80% is the working default, and name the four levers — effect size, sample size, α, and variance — that push it around, along with the honest fact that only more data and less noise are free. With sample size identified as the lever you actually turn, you’re ready to make it concrete: next you’ll meet the formula that takes your MDE, baseline, α, and power target and hands back the exact number of users each arm needs.