Lesson 2 - Sequential Testing

Welcome to Sequential Testing

In Module 6 you saw the peeking problem in its full ugliness: if you check a running experiment repeatedly and stop the instant p < 0.05, your false-positive rate balloons far above the 5% you signed up for. Each look is another chance for noise to cross the line, and those chances add up. The advice that followed was blunt — commit to a sample size, look once. But that advice fights human nature and good business sense. Teams genuinely want to stop early: to stop wasting traffic on a clear winner, and to kill a losing variant before it costs more revenue. Sequential testing resolves the tension. It doesn’t forbid peeking — it makes peeking valid, by replacing the single-look threshold with a stricter boundary that budgets your error across every look you take.

By the end of this lesson, you will be able to:

  • Explain why repeated looks at the 1.96 line inflate the false-positive rate
  • Describe a group-sequential design and the constant Pocock boundary
  • Calibrate a corrected boundary by simulation so family-wise error stays at 5%
  • Stop an experiment early — at the first boundary crossing — without cheating

Let’s start with why the ordinary threshold breaks.


Why Looking Repeatedly Breaks the Threshold

The single-look test is calibrated for exactly one decision: collect your data, compute z, reject the null if |z| > 1.96. That 1.96 is tuned so that under the null, the chance of crossing it is 5%. The moment you look more than once, that logic collapses — you now have five (or fifty) chances to cross the same line, and under the null some of those crossings are pure noise. The false-positive rate is no longer 5%; it’s the chance that any look crosses, which is much larger.

The fix is not to ban looking — it’s to raise the bar. If you know in advance you’ll look K times, use a higher boundary at each look, calibrated so the overall (family-wise) false-positive rate across all K looks stays at 5%. This is a group-sequential design. The simplest version keeps the same boundary at every look — a constant threshold — and is called a Pocock boundary. The whole trick is spending your 5% error budget across the looks instead of blowing it all on each one.

A plot of the absolute z statistic across 5 looks as data accumulates. A red dashed horizontal line sits at 1.96 labeled 'naive - crossed by chance too often (14%)'. A solid green horizontal line sits at 2.41 labeled 'calibrated boundary - family error back to 5%'. A wandering z path rises above the 1.96 line at looks 3 and 4 (where a naive tester would stop and declare a false winner) but stays below the 2.41 line, so the experiment correctly keeps going. Caption text reads: looking 5 times at 1.96 inflates errors to 14%, requiring absolute z greater than 2.41 restores 5% and you can still stop as soon as it is crossed.
Looking five times at the ordinary 1.96 line lets noise cross too often (14% false positives); a calibrated boundary at |z| > 2.41 restores the 5% guarantee while still letting you stop the moment it's crossed.

Calibrating the Boundary by Simulation

Let’s prove the whole thing. We run 40,000 null experiments — A/A tests where both arms have the same 10% conversion rate, so there is no real effect. Data arrives in batches, and we peek K = 5 times as it accumulates. First we measure how often naive peeking (checking |z| > 1.96 at each look) falsely declares a winner. Then we find the constant boundary c that pushes that error back down to 5%.

import numpy as np
from scipy import stats

def two_prop_z(c1, n1, c2, n2):
    p1, p2 = c1/n1, c2/n2; p=(c1+c2)/(n1+n2)
    se=np.sqrt(p*(1-p)*(1/n1+1/n2)); return (p2-p1)/se

rng = np.random.default_rng(7)
N, K, batch, p0 = 40000, 5, 400, 0.10
ic = rng.binomial(batch, p0, size=(N,K)); it = rng.binomial(batch, p0, size=(N,K))
cc, ct = ic.cumsum(1), it.cumsum(1); cn = batch*np.arange(1,K+1)
z = two_prop_z(cc, cn, ct, cn); maxz = np.abs(z).max(1)
naive = np.mean((np.abs(z) > 1.96).any(1))     # peek at 1.96 each look
c = np.quantile(maxz, 0.95)                     # calibrated boundary
corrected = np.mean((np.abs(z) > c).any(1))
print(round(float(naive),3), round(float(c),3), round(float(corrected),3))

Running it:

0.14 2.41 0.05

Read those three numbers carefully. Peeking five times at the ordinary 1.96 line gives a 0.140 false-positive rate — 14%, nearly triple the 5% you thought you had. The calibrated constant boundary is c = 2.410: requiring |z| > 2.41 at every look drops the family-wise error back to exactly 0.050. That 2.410 corresponds to a per-look nominal alpha of about 0.0159 (from 2*(1 - Φ(2.41))), meaning each individual look uses a much stricter threshold than the naive 0.05, and the strictness is precisely what pays for the right to look five times.

The payoff: with the 2.41 boundary in hand, you can now watch the experiment and stop the moment |z| crosses it — at look 2, look 3, whenever — and your 5% guarantee still holds. That’s valid early stopping. The price is a slightly higher bar than 1.96, but that’s a bargain compared to the two bad alternatives: silently inflating your error to 14%, or being forbidden from looking at all.

Pocock vs. O’Brien-Fleming

The constant boundary we simulated is a Pocock design — the same threshold at every look. Its downside is that the early looks use the same strict bar as the last one, so it can stop on thin data. O’Brien-Fleming boundaries instead start very strict early and relax toward the end: they almost never stop on the first sliver of data, and by the final look the boundary is near the familiar 1.96. That shape is often preferred because it protects against over-reacting to early noise while barely costing anything at the planned end. Both are special cases of alpha-spending functions, which let you choose exactly how much of your 0.05 budget to spend at each look. Modern platforms go further with always-valid p-values / confidence sequences (mSPRT) that let you look continuously. The unifying takeaway: peeking is fine — as long as the boundary was designed for it.


Practice Exercises

Exercise 1: Why is 14% the number that matters?

The naive procedure crossed 1.96 on 14% of the null experiments. Why is that the figure that should alarm you, rather than the fact that any single look has a 5% error?

Hint

Each individual look really does have a 5% chance of crossing 1.96 under the null — nothing is wrong with any single test in isolation. The problem is the decision rule: “stop and declare a winner if any of the five looks crosses.” That rule fires whenever at least one look crosses, and with five correlated chances the probability of at least one crossing climbs to 14%. False-positive control is about the whole procedure, not one look — and the procedure, not the individual test, is what you actually deploy.

Exercise 2: Where does 2.41 come from?

The calibrated boundary is c = np.quantile(maxz, 0.95). Explain in words what maxz is and why its 95th percentile is exactly the boundary you want.

Hint

For each null experiment, maxz is the largest |z| observed across its five looks — the closest that experiment ever came to falsely crossing. The decision rule fires exactly when this maximum exceeds the boundary. So if you set the boundary at the 95th percentile of maxz, then by construction only 5% of null experiments have a max that exceeds it — which means the family-wise false-positive rate is 5%. Calibrating on the maximum is what accounts for all five looks at once.

Exercise 3: What if you plan more looks?

Your teammate wants to peek 20 times instead of 5. Without rerunning the code, what happens to the naive error rate and to the calibrated boundary c?

Hint

More looks means more chances for noise to cross 1.96, so the naive error rate climbs even higher than 14% — toward and past 20% as K grows. To compensate, the calibrated boundary c must move higher than 2.41, because you’re now spending your fixed 5% budget across 20 correlated looks instead of 5, so each look must be stricter. The lesson: the boundary isn’t a universal constant — it depends on how many times you plan to look, which is why you decide K (or your spending function) before you start.


Summary

Sequential testing turns early stopping from a statistical mistake into a valid decision rule. Module 6 showed that peeking at the ordinary |z| > 1.96 line and stopping at the first crossing inflates the false-positive rate — in our simulation of 40,000 null experiments with five looks, to 14%. A group-sequential design fixes this by raising the bar: the constant Pocock boundary of |z| > 2.41 (a per-look alpha of about 0.0159) restores the family-wise error to exactly 5%, while still letting you stop the instant the boundary is crossed. You pay a slightly stricter threshold for the right to look repeatedly — a trade far better than either inflating your error or being forbidden to peek at all.

Key Concepts

  • Peeking inflates error — five looks at 1.96 push the false-positive rate from 5% to ~14%.
  • Group-sequential design — use a higher boundary at each look so family-wise error stays at 5%.
  • Pocock boundary — a constant threshold (here 2.41) calibrated as the 95th percentile of the max |z|.
  • Valid early stopping — stop the moment the calibrated boundary is crossed, guarantee intact.

Why This Matters

Early stopping is one of the most valuable things an experimentation platform can offer: it lets you cut losers fast, ship winners sooner, and free traffic for the next test — turning experimentation into a faster feedback loop. Naive peeking promises all of that but quietly corrupts your error rate, which is why so many “significant” early calls fail to replicate. Sequential methods give you the speed and the guarantee, which is why Pocock, O’Brien-Fleming, alpha-spending, and always-valid confidence sequences are standard in modern platforms. The deeper lesson echoes the whole course: the validity of a test lives in the decision rule, not any single number — design the rule for how you’ll actually use it, and peeking becomes a tool instead of a trap. Next, you’ll step outside the frequentist frame entirely and ask a different question with Bayesian A/B testing.


Next Steps

Continue to Lesson 3 - Bayesian A/B Testing

A different question entirely: the probability that B beats A, updated as data arrives — and what that buys you over p-values.

Back to Module Overview

Return to the Beyond Basic A/B module overview


Continue Building Your Skills

You’ve seen how a stricter, well-designed boundary makes peeking honest: five looks at 2.41 instead of 1.96 keeps your false-positive rate at 5% while letting you stop the moment the evidence lands. That’s the frequentist way to earn the right to look early. Next you’ll reframe the whole problem: Bayesian A/B testing lets you ask directly for the probability that B beats A and update it continuously as data flows in — a natural fit for the peeking instinct, with its own set of trade-offs to understand.