Lesson 1 - Peeking and Optional Stopping

Welcome to Peeking and Optional Stopping

Here’s a habit that feels responsible and is quietly disastrous: you launch an experiment, and because you’re eager (and the dashboard is right there), you check it every day. The moment it crosses p < 0.05, you stop and declare a winner. What could be wrong with stopping as soon as you have significance? Everything. That single, intuitive behavior — peeking and optional stopping — is probably the most common way careful teams ship false positives, and it doesn’t announce itself: every individual p-value is computed correctly. In this lesson you’ll watch it inflate your error rate almost four-fold, on data where there’s nothing to find at all.

By the end of this lesson, you will be able to:

  • Explain why repeatedly testing and stopping early inflates the false-positive rate
  • Quantify the inflation with a simulation of null experiments
  • Distinguish a fixed-horizon analysis from optional stopping
  • Name the correct fixes: pre-set sample sizes and sequential tests

Let’s watch the damage.


Why Peeking Breaks the Test

Recall what α = 0.05 promises: if there’s no real effect, a single test wrongly declares significance 5% of the time. That guarantee is about one look. But a p-value computed on accumulating data doesn’t sit still — it wanders, drifting up and down as each new user arrives and the sample grows. Give a wandering value enough chances to cross the 0.05 line, and eventually, by pure chance, it will — even when the two groups are truly identical.

Peeking exploits exactly that. Each time you check and are willing to stop, you give the noise another opportunity to dip below the threshold. You’re no longer asking “is this one result significant?” — you’re asking “did it cross the line at any of my checks?”, and the answer is “yes” far more than 5% of the time. The more often you look, the worse it gets.

A wandering p-value line drifting up and down over 10 checkpoints as data arrives, with a red dashed line at p=0.05; at one checkpoint the line dips below 0.05 and is marked 'crossed 0.05, stop and declare a winner', even though the experiment is a null A/A test. Two boxes compare outcomes over 40,000 null experiments: one honest look at the end gives a 5.1% false-positive rate (what you signed up for), while stopping at the first dip gives 19.0% — nearly four times more false winners.
A p-value wanders as data accumulates. Checking repeatedly and stopping at the first dip below 0.05 catches these random crossings, turning a 5% false-positive rate into 19% — the experiment is a pure null with no real effect.

Watch the Inflation

Let’s prove it rather than assert it. We simulate 40,000 A/A experiments — both groups drawn from the same 10% rate, so any “significant” result is a false positive by construction. Users arrive in 10 batches; we compute the running two-proportion p-value at each of the 10 checkpoints. Then we compare two policies: one honest look at the end, versus peeking and stopping the first time we see p < 0.05.

import numpy as np
from scipy import stats

def two_prop_p(c1, n1, c2, n2):
    p1, p2 = c1 / n1, c2 / n2
    p = (c1 + c2) / (n1 + n2)
    se = np.sqrt(p * (1 - p) * (1 / n1 + 1 / n2))
    return 2 * (1 - stats.norm.cdf(np.abs((p2 - p1) / se)))

rng = np.random.default_rng(101)
N, K, batch, p0 = 40000, 10, 200, 0.10       # A/A: both arms share rate p0

inc_c = rng.binomial(batch, p0, size=(N, K))  # conversions per batch
inc_t = rng.binomial(batch, p0, size=(N, K))
cum_c, cum_t = inc_c.cumsum(1), inc_t.cumsum(1)
cum_n = batch * np.arange(1, K + 1)           # users per arm at each checkpoint
pvals = two_prop_p(cum_c, cum_n, cum_t, cum_n)   # shape (N, 10)

single_look = np.mean(pvals[:, -1] < 0.05)             # look once, at the end
peeked      = np.mean((pvals < 0.05).any(axis=1))      # stop at first significant look
print(f"single fixed look:  {single_look:.3f}")
print(f"peeking (10 looks): {peeked:.3f}")

Running it:

single fixed look:  0.051
peeking (10 looks): 0.190

The honest single look gives 0.051 — exactly the 5% we signed up for. Peeking at 10 checkpoints gives 0.190: nearly one in five of these completely null experiments produces a “significant winner” if you stop at the first crossing. Same data, same correct p-value formula — the only thing that changed is the decision to keep looking and stop early. That’s the entire mistake, and it quadrupled the false-positive rate.

This is why you commit to a sample size in advance

The fix isn’t “never look” — teams need to monitor experiments for bugs and disasters. The fix is to decide the stopping rule before you start and not let a peek trigger a decision it wasn’t designed for. That’s exactly why Module 3 had you compute a sample size up front: you run to that pre-set number of users and analyze once. If you genuinely need to check as you go and be able to stop early, that’s a real, solved problem — but it requires sequential testing methods (alpha-spending, group-sequential designs, or Bayesian approaches) that budget the error across looks. What you can’t do is run a fixed-horizon test and treat every daily peek as a valid stopping point.


Practice Exercises

Exercise 1: Why does looking more hurt more?

The simulation peeked 10 times and got a 19% false-positive rate. Without rerunning, would peeking 30 times be better, worse, or the same?

Hint

Worse. Every additional look is another chance for the wandering p-value to dip below 0.05 by chance, so the probability of ever crossing only grows as you add checkpoints. Peeking 30 times would push the false-positive rate above 19% (toward ~25–30%). In the limit of continuous monitoring with a fixed threshold, a null result will cross 0.05 almost surely if you wait long enough.

Exercise 2: Is the single-look 0.051 a problem?

The single fixed look came out to 0.051, not exactly 0.050. Does that mean even the honest analysis is slightly broken?

Hint

No — 0.051 is an estimate of the true 0.05 rate from 40,000 simulated experiments, and estimates carry sampling noise. The expected value is exactly 0.05; run more experiments and it tightens toward it. The single-look analysis is correct. The 0.190 from peeking, by contrast, isn’t noise around 0.05 — it’s a real, structural inflation caused by the stopping rule.

Exercise 3: Spot the peeking

A teammate says: “I set up the test for 7 days, but on day 3 it hit p = 0.04, so I called it and shipped.” What’s wrong, and what should they have done?

Hint

They peeked and stopped early on a fixed-horizon test — the p = 0.04 on day 3 is exactly the kind of random early crossing that inflates false positives to ~19%. The plan was 7 days (a pre-set horizon), so the correct move is to run the full 7 days and analyze once at the end. If early stopping is genuinely needed, they’d have to design a sequential test up front, not improvise one from a daily dashboard.


Summary

Peeking — repeatedly checking an experiment and stopping the moment it crosses significance — is the most common way correct calculations still produce false conclusions. The α = 0.05 guarantee applies to a single look, but a p-value on accumulating data wanders, and every additional check gives that noise another chance to dip below 0.05. A simulation of 40,000 null A/A experiments made it concrete: one honest look at the end gave the promised 0.051 false-positive rate, while peeking at 10 checkpoints and stopping early gave 0.190 — nearly four times higher, on data with no real effect at all. The fix is to commit to a sample size in advance and analyze once, or to use a proper sequential test if you need to stop early.

Key Concepts

  • Peeking / optional stopping — checking repeatedly and stopping at the first significant result.
  • A p-value wanders — on accumulating data it drifts, so it will cross 0.05 by chance if given enough looks.
  • Inflation is real, not noise — 10 looks turned a 5% error rate into 19% on pure nulls.
  • The fixes — pre-set the sample size and analyze once, or use a sequential/group-sequential test.

Why This Matters

Peeking is insidious precisely because it feels like diligence and every number along the way is computed correctly — so it survives review and shows up in real, consequential decisions. A team that stops experiments early “when they look significant” is running at a hidden false-positive rate several times higher than they think, and will ship a stream of changes that do nothing. Understanding why early stopping breaks the guarantee — and that the fix is a stopping rule fixed in advance — is what keeps your significant results actually meaningful. Next, you’ll see the same error-inflation from a different source: testing many things at once.


Next Steps

Continue to Lesson 2 - Multiple Comparisons

Testing many metrics or variants inflates the false-positive rate too — and how corrections rein it back in.

Back to Module Overview

Return to the Pitfalls and Validity module overview


Continue Building Your Skills

You’ve seen how peeking turns a 5% false-positive rate into 19% on data with no effect, and why committing to a sample size in advance is the cure. That’s one way to accidentally inflate your error rate; the next is just as common — running many tests at once. In Lesson 2 you’ll watch 20 harmless metrics produce a false “winner” 64% of the time, and learn how to correct for it.