Lesson 4 - Power Curves and Simulation

Welcome to Power Curves and Simulation

The last three lessons built up to a formula that hands you a sample size for a target power. Formulas are fast, but they can also feel like a black box — you plug in numbers, one comes out, and you take it on faith. This lesson is the payoff: instead of trusting the formula, we’ll measure power directly. The idea is disarmingly simple. If you simulate many experiments where a real effect genuinely exists, run your significance test on each one, and count the fraction that come back significant, that fraction is the power at that sample size. No approximation, no assumption you can’t inspect. We’ll run this for real, watch the Monte-Carlo estimate land right on the formula’s answer, and then step back to read the power curve that ties every sample size to the power it buys.

By the end of this lesson, you will be able to:

  • Estimate statistical power directly by simulating experiments with a known effect
  • Confirm that the sample-size formula and a Monte-Carlo simulation agree
  • Read an S-shaped power curve and pick the sample size where it crosses your target
  • Explain why simulation is the general tool that works even where no formula exists

Let’s measure power instead of just computing it.


Power Is Something You Can Count

Power is a probability: the chance a test detects a real effect. And any probability can be estimated by doing the thing many times and counting how often it happens. So here’s the recipe. Build a world where the effect is real — a control rate of 10% and a treatment rate of 12%, a genuine 2-point lift. Draw thousands of experiments from that world, each with n users per arm. Run the same two-proportion test from Lesson 1 on every one. The fraction that reach p<0.05 is the power at that n, because in every single experiment there truly was something to find.

We reuse two_prop_p unchanged and wrap the simulation in one function:

import numpy as np
from scipy.stats import norm

def two_prop_p(c1, n1, c2, n2):          # two-sided p-value for a difference in rates
    p1, p2 = c1 / n1, c2 / n2
    p = (c1 + c2) / (n1 + n2)
    se = np.sqrt(p * (1 - p) * (1 / n1 + 1 / n2))
    z = (p2 - p1) / se
    return 2 * (1 - norm.cdf(np.abs(z)))

def simulate_power(p_c, p_t, n, sims=40000, seed=202):
    rng = np.random.default_rng(seed)
    c1 = rng.binomial(n, p_c, size=sims)     # control arm: true rate p_c
    c2 = rng.binomial(n, p_t, size=sims)     # treatment arm: true rate p_t (a REAL effect)
    return float(np.mean(two_prop_p(c1, n, c2, n) < 0.05))

for n in (1500, 3839, 8000):
    print(f"n={n}: power={simulate_power(0.10, 0.12, n):.3f}")

Running it — 40,000 seeded Monte-Carlo experiments at each sample size — gives:

n=1500: power=0.416
n=3839: power=0.802
n=8000: power=0.982

Look at the middle row. Back in Lesson 3, the formula prescribed n = 3,839 users per arm to detect this 10%→12% lift with 80% power. Here, a completely independent Monte-Carlo simulation — no formula anywhere in the code, just draw-test-count — lands on 0.802. The algebra and the brute-force experiment agree to within sampling noise. That mutual confirmation is the whole point: the formula wasn’t a black box after all, and the simulation isn’t a hack — they’re two routes to the same truth, and they meet.

The other two rows tell the cost of getting n wrong. At n = 1,500, power is only 0.416 — an underpowered test that misses a real 2-point lift more than half the time. Run that experiment and a genuine winner has worse-than-coin-flip odds of showing up as significant. At n = 8,000, power climbs to 0.982: technically excellent, but overpowered. You’ve bought far more precision than the 80% target asked for, spending thousands of extra users of traffic to shave a false-negative rate that was already small.


Reading the Power Curve

Compute simulate_power across a whole range of sample sizes and plot power against n, and the points trace a characteristic S-shaped curve. It starts near α when n is tiny (a test with almost no data can barely beat chance), rises steeply through the middle, and flattens as it approaches 1.0 (past a point, more users add almost nothing).

An S-shaped power curve rising with users per arm for detecting a 10% to 12% lift, by Monte-Carlo simulation; marked points at 1,500 users (power 0.416, underpowered), 3,839 users (power 0.802, the formula's n, on the 0.80 target line), and 8,000 users (power 0.982, overpowered); the formula's n lands right on 80% power.
The power curve for a 10%→12% lift: power rises with sample size, and the formula's n=3,839 lands right on the 0.80 target line that the simulation confirms.

Reading it is the practical skill. You pick your target power — 0.80 is the near-universal convention — draw a horizontal line, and find where the curve crosses it. That crossing is your sample size. Everything to the left of it is the gambling zone: too few users, and you’re running an experiment that will probably miss a real effect. Everything far to the right is the overspending zone: real power gains have flattened out, and each extra user buys almost nothing while still costing you traffic and time. The curve turns “how many users?” from a guess into a reading: find the target line, find the crossing, that’s your n.

Simulation works where formulas don’t

The reason to reach for simulation isn’t just to double-check the formula — though it does that beautifully here. It’s that the clean formula only exists for simple cases like a difference in two proportions. The moment your metric gets realistic — a ratio metric (revenue per user), clustered randomization, a CUPED-adjusted outcome, or any heavily skewed, non-normal distribution — the tidy algebra falls apart or requires heroic approximations. Simulation doesn’t care. If you can generate data from your assumed world and run your test on it, you can count power. The draw-test-count loop is a general, near-universal tool; the formula is the special case.


Practice Exercises

Exercise 1: What is the fraction counting?

In simulate_power, the treatment arm is drawn with a true rate p_t strictly higher than p_c. Given that, explain in one sentence why np.mean(two_prop_p(...) < 0.05) estimates power rather than the false-positive rate.

Hint

Because a real effect exists in every simulated experiment — the two arms have genuinely different true rates — so a “significant” result is a correct detection, not a false alarm. The fraction of runs that reach p<0.05 is therefore the probability of correctly detecting the effect, which is exactly power. (Contrast Lesson 1, where both arms shared one rate, so the same fraction measured the false-positive rate α instead.)

Exercise 2: The 0.802 coincidence that isn’t

The formula gave n=3,839 for 80% power, and the simulation returned 0.802. Is it a coincidence that they’re this close, and what would make the simulation’s number wobble?

Hint

It’s not a coincidence — both are estimating the same true power, so they should agree; that’s the confirmation. The 0.802 is a Monte-Carlo estimate from 40,000 experiments, so it carries sampling noise and would wobble slightly if you changed the seed or the number of simulations. Run more experiments and it tightens toward the true value near 0.80; the small gap from exactly 0.800 is just that noise, not disagreement.

Exercise 3: Which sample size, and why not the biggest?

Of n = 1,500, 3,839, and 8,000, which does the formula pick for 80% power — and why is picking 8,000 “to be safe” a mistake?

Hint

The formula picks n = 3,839, the smallest sample that reaches the 0.80 target (its simulated power is 0.802). Choosing 8,000 raises power to 0.982, but that’s overpowered: you’d more than double the users per arm to buy a sliver of extra detection you didn’t need, spending traffic and calendar time that could have gone to the next experiment. The point of sizing is to hit your target, not to exceed it — 1,500 gambles (0.416) and 8,000 overspends, so 3,839 is the deliberate choice.


Summary

Power isn’t only a formula — it’s a quantity you can measure. Build a world where the effect is real, simulate many experiments at a given n, run the significance test on each, and the fraction reaching p<0.05 is the power. We ran this for real in seeded numpy/scipy Monte-Carlo: at the formula’s prescribed n = 3,839 the simulation returned 0.802, an independent confirmation that the algebra and the brute-force experiment agree. The other sample sizes showed the stakes — 1,500 gave only 0.416 power (underpowered, misses a real lift over half the time) and 8,000 gave 0.982 (overpowered, wasted traffic). Plot power against n and you get an S-shaped power curve; pick the sample size where it crosses your 0.80 target. And because simulation only needs you to generate data and run a test, it’s the general tool that keeps working for ratio metrics, clustered designs, and CUPED, where no clean formula exists.

Key Concepts

  • Power by simulation — draw many experiments with a real effect, run the test, count the fraction that reach p<0.05; that fraction is the power.
  • Formula–simulation agreement — the formula’s n=3,839 and the Monte-Carlo 0.802 confirm each other, so neither is a black box.
  • The power curve — an S-shaped rise from ~α toward 1.0; pick the n where it crosses your target power.
  • Simulation is general — it validates the formula and handles complex designs (ratio metrics, clustering, CUPED) that have no clean formula.

Why This Matters

A sample-size formula you can’t check is a sample-size formula you can’t fully trust — and the day your metric stops being a simple proportion, the formula won’t exist at all. Learning to estimate power by simulation gives you both a way to verify the tidy cases and a way to size the messy ones that dominate real experimentation. Teams that can only plug into a formula are stuck the moment they measure revenue per user or randomize by store instead of by visitor; teams that can simulate size any experiment they can describe. Next, you’ll put all of it together on Lumen’s real experiment — turning a business question into a target effect, a power goal, and a defensible sample size.


Next Steps

Continue to Lesson 5 - Guided Project: Size Lumen's Experiment

Put the formula and the simulation together to size a real Lumen experiment end to end.

Back to Module Overview

Return to the Power and Sample Size module overview


Continue Building Your Skills

You can now measure power directly instead of only computing it — draw experiments with a real effect, run the test, count the significant ones — and you’ve watched a Monte-Carlo simulation land on 0.802 exactly where the formula said n=3,839 would. You can read a power curve to find the sample size that hits your target without gambling or overspending, and you know that simulation is the tool that keeps working when the formula runs out. Next you’ll bring every piece of this module together to size a real Lumen experiment from scratch.