Lesson 5 - Guided Project: Audit a Lumen Experiment

Welcome to the Guided Project

Across this module you took the failure modes apart one at a time: peeking inflates a 5% error rate to 19%, testing 20 metrics manufactures a false winner 64% of the time, Simpson’s paradox flips the aggregate against every subgroup, and a sample ratio mismatch quietly invalidates the whole assignment. Now you put them back together as one thing you actually do at work: a validity audit. Lumen’s growth team just handed you an experiment with a “significant winner” and a deploy request. Your job is not to admire the p-value — it’s to decide whether the experiment behind the p-value is trustworthy at all. You’ll build a repeatable audit checklist and run it on this exact experiment, and by the end you’ll have a clear, defensible verdict: ship, or don’t.

By the end of this project, you will be able to:

  • Run a sample ratio mismatch check as the first gate on any experiment, before reading the result
  • Interrogate the stopping rule to catch peeking-inflated significance
  • Recompute a segmented result to expose a Simpson’s-paradox reversal driven by a bad mix
  • Assemble a reusable audit checklist that orders the checks so validity comes before significance

We’ll build the audit in stages, reusing the exact tests you wrote earlier in the module. Let’s audit Lumen’s experiment.


Stage 1: The Claim, and the Audit Mindset

Here’s what landed in your inbox. Lumen ran a new onboarding flow as treatment against the current flow as control, on a planned 50/50 split. The dashboard says treatment’s conversion rate is higher, the difference is “significant,” and the growth team wants to ship this week. The instinct is to open the result, see p < 0.05, and nod.

Resist it. A significant result is only trustworthy if the experiment that produced it is valid — and validity is a property of the plumbing, not the p-value. A p-value computed on a broken experiment is a correct calculation on the wrong data, and it will look exactly as convincing as a real one. So the audit checks the plumbing first. Here is the checklist, in the order you run it:

  1. SRM — did users actually split the way the design intended?
  2. Stopping rule — how did the team decide to stop, and does that rule preserve the error rate?
  3. Primary metric only — is the “winner” the one pre-declared metric, or one slice found among many?
  4. Subgroup sanity — does the aggregate agree with the segments, or is a mix effect fooling us?

Only after all four pass do you read the headline. Run them in this order because the earlier a check fails, the more completely it invalidates everything downstream. Start with the one that can void the entire experiment.


Stage 2: SRM Check First

The very first number to check isn’t the conversion rate — it’s the counts. A planned 50/50 split should land close to 50/50. Lumen’s experiment reported 5,200 users in control and 4,800 in treatment. That’s 52/48. It looks like a rounding wobble. It isn’t. With 10,000 users, a chi-squared test against the expected 50/50 tells you whether that gap is plausibly chance.

from scipy import stats

def srm_pvalue(n_a, n_b, expected_ratio=0.5):
    total = n_a + n_b
    exp = [total * expected_ratio, total * (1 - expected_ratio)]
    return float(stats.chisquare([n_a, n_b], f_exp=exp).pvalue)

print(f"reported split 5200/4800: SRM p = {srm_pvalue(5200, 4800):.2e}")
print(f"healthy split  4988/5012: SRM p = {srm_pvalue(4988, 5012):.3f}")

Running it:

reported split 5200/4800: SRM p = 6.33e-05
healthy split  4988/5012: SRM p = 0.810

The reported split gives p = 6.33e-05 — about six in a hundred thousand. A 52/48 split is not a wobble at this sample size; it’s a signal that the randomization or logging is broken. For contrast, a genuinely healthy split of 4988/5012 gives p = 0.810 — exactly the kind of harmless noise you expect around 50/50. That contrast is the whole point: SRM isn’t about the size of the imbalance, it’s about whether the imbalance is explainable by chance given how many users you have.

This is why SRM is check #1. A failed SRM means users did not get assigned the way you designed, so control and treatment are no longer comparable populations — and every downstream number, including that significant p-value, is computed on a corrupted comparison. Strictly, the audit could stop right here: the experiment is broken, fix the pipeline and re-run. But it’s worth walking the rest of the checks, because a failed SRM predicts exactly the kind of damage you’ll find next.

A sample ratio mismatch check: a planned 50/50 split shown against two observed splits. The healthy split of 4988 versus 5012 users yields a chi-squared p-value of 0.81 and is marked OK. The reported split of 5200 versus 4800 users yields p = 6.33e-05, well below the 0.001 threshold, and is marked FAIL — investigate the assignment pipeline before trusting any downstream result.
The same tool applied to both splits: 4988/5012 is harmless (p = 0.81), but the reported 5200/4800 fails hard (p = 6.33e-05). A failed SRM means the assignment is broken and the comparison is invalid — this is check #1 for a reason.

Stage 3: The Peeking Check

Next, ask how the team decided to stop. This isn’t a code check — it’s a question, and the answer determines whether the “significance” is real. If the flow was designed for a fixed number of users but someone watched the dashboard and called it the first day it crossed p < 0.05, then the significance is inflated, not earned.

You proved why in Lesson 1. A p-value on accumulating data wanders, and every extra look gives it another chance to dip below 0.05 by pure noise. Here’s the evidence you cite when you ask the question:

import numpy as np
from scipy import stats

def two_prop_p(c1, n1, c2, n2):
    p1, p2 = c1 / n1, c2 / n2
    p = (c1 + c2) / (n1 + n2)
    se = np.sqrt(p * (1 - p) * (1 / n1 + 1 / n2))
    return 2 * (1 - stats.norm.cdf(np.abs((p2 - p1) / se)))

rng = np.random.default_rng(101)
N, K, batch, p0 = 40000, 10, 200, 0.10          # A/A: both arms share the same rate
inc_c = rng.binomial(batch, p0, size=(N, K))
inc_t = rng.binomial(batch, p0, size=(N, K))
cum_c, cum_t = inc_c.cumsum(1), inc_t.cumsum(1)
cum_n = batch * np.arange(1, K + 1)
pvals = two_prop_p(cum_c, cum_n, cum_t, cum_n)

single_look = np.mean(pvals[:, -1] < 0.05)          # one honest look at the end
peeked      = np.mean((pvals < 0.05).any(axis=1))   # stop at first significant look
print(f"single fixed look:   {single_look:.3f}")
print(f"peeking (10 looks):  {peeked:.3f}")

Running it:

single fixed look:   0.051
peeking (10 looks):  0.190

On 40,000 experiments where there is nothing to find, one honest look gives the promised 0.051 false-positive rate; stopping at the first of 10 crossings gives 0.190. So the question to Lumen’s team is direct: was the sample size set in advance and analyzed once, or did you stop when it looked good? If it’s the latter, this “significant winner” carries a false-positive rate closer to 19% than 5%, and the significance tells you nothing. The fix is the one from Module 3: pre-register the sample size, run to it, and analyze exactly once.


Stage 4: The Subgroup and Simpson Check

The growth team also sends a segmented view “to be thorough” — treatment wins overall, they say, and here’s the breakdown. Recompute it yourself rather than trusting the summary. Lumen has two segments, new and returning users:

seg = {
    "new users":       {"control": (200, 2000), "treatment": (90, 1000)},
    "returning users": {"control": (300, 1000), "treatment": (570, 2000)},
}

for name, arms in seg.items():
    cc, cn = arms["control"]; tc, tn = arms["treatment"]
    winner = "treatment" if tc/tn > cc/cn else "CONTROL"
    print(f"{name:16s} control {cc/cn:.1%}  treatment {tc/tn:.1%}  -> {winner} wins")

cc = sum(a["control"][0] for a in seg.values());  cn = sum(a["control"][1] for a in seg.values())
tc = sum(a["treatment"][0] for a in seg.values()); tn = sum(a["treatment"][1] for a in seg.values())
overall = "TREATMENT" if tc/tn > cc/cn else "control"
print(f"{'OVERALL':16s} control {cc/cn:.1%}  treatment {tc/tn:.1%}  -> {overall} wins")

Running it:

new users        control 10.0%  treatment 9.0%   -> CONTROL wins
returning users  control 30.0%  treatment 28.5%  -> CONTROL wins
OVERALL          control 16.7%  treatment 22.0%  -> TREATMENT wins

Look at what just happened. Treatment loses in every single segment — 9.0% vs 10.0% for new users, 28.5% vs 30.0% for returning — yet wins overall, 22.0% vs 16.7%. That’s Simpson’s paradox, and it’s not magic: it’s the mix. Control’s users are weighted toward low-converting new users, treatment’s toward high-converting returning users, so the aggregate reflects the mix difference more than any real effect of the flow. And notice where that mix difference comes from — it’s exactly what a failed SRM produces. Stage 2’s broken assignment and Stage 4’s misleading aggregate are the same wound seen from two angles.

One more caution while you’re in the segments: don’t go the other way and start hunting for a segment where treatment wins. Lesson 2 showed that slicing enough ways guarantees a “significant” slice by chance alone. A subgroup view is for sanity-checking the aggregate against the segments, not for shopping until you find a winner. If you do report a segment, it has to be one you pre-declared, and a lone significant slice among many is a multiple-comparisons artifact until a correction or a replication says otherwise.

Validity before significance

The single habit this whole audit trains is check the plumbing before you read the number. A p-value is a summary of the data you collected; if the collection was broken — wrong split, wrong stopping rule, wrong comparison — then the summary is a precise description of the wrong thing. That’s why the checklist runs SRM, stopping rule, and segment sanity first, and only reads the headline result last. A significant result on an invalid experiment isn’t a weak result. It’s not a result at all.


Stage 5: The Audit Readout

Write the verdict the way you’d hand it to the growth team. Every check told the same story from a different direction:

  • SRM: FAIL. The 5200/4800 split gives p = 6.33e-05. Assignment is broken; control and treatment are not comparable populations.
  • Stopping rule: SUSPECT. If the team stopped at the first significant look on a fixed-horizon test, the significance is inflated toward the 0.190 false-positive rate, not the 0.05 it claims.
  • Segment sanity: FAIL. Treatment loses in every segment (9.0% vs 10.0%, 28.5% vs 30.0%) but wins overall (22.0% vs 16.7%) purely because the mix differs by arm — the direct symptom of the broken assignment.

Recommendation: do not ship. The headline “significant winner” is an artifact of a broken experiment, not evidence that the new flow is better. Concretely: fix the assignment pipeline so the split is genuinely random, re-run with a pre-set sample size, and analyze once on the single pre-declared primary metric. When you re-run, the SRM check should pass (p well above 0.001), and only then does the conversion result mean anything.

Here is the reusable checklist — run it on every experiment, in this order, before you trust a result:

VALIDITY AUDIT  (run top to bottom; stop and fix on the first hard fail)
  1. SRM              chi-squared on the split; p > 0.001 or the test is void
  2. Stopping rule    sample size pre-set and analyzed once? no dashboard peeking
  3. Primary metric   read the ONE pre-declared metric, not a slice found later
  4. Segment sanity   do segments agree with the aggregate? watch for Simpson's
  5. THEN read result only now is the p-value worth interpreting

That ordering is the lesson. Validity is a gate, not a footnote: the p-value is the last thing you look at, because it’s the only thing that means nothing until everything above it passes.


Practice Exercises

Exercise 1: What SRM threshold, and why not 0.05?

The audit flags SRM at p < 0.001, not the usual p < 0.05. Why the stricter threshold for this particular check?

Hint

Because SRM is a health check you run on every experiment, and a true, correctly randomized split will still produce p < 0.05 about 5% of the time by pure chance. If you halted every experiment on p < 0.05, you’d cry “broken pipeline” on one in twenty perfectly good tests. A real SRM — a genuine plumbing bug — produces a screamingly small p-value (Lumen’s was 6.33e-05), so p < 0.001 catches the real breaks while almost never false-alarming on a healthy split like 4988/5012 (p = 0.81). You want this gate to fire on bugs, not on noise.

Exercise 2: Design the re-run to avoid all three pitfalls

Lumen fixes the assignment pipeline and asks you to design the re-run so this audit passes cleanly. What do you specify up front?

Hint

Three commitments, all made before launch. (1) Fixed sample size: compute the required n from the power analysis in Module 3, run to it, and analyze once — this kills peeking. (2) One pre-declared primary metric: name the single metric that decides ship/no-ship before you see any data — this kills segment- and metric-hunting. (3) An SRM check as the first gate on the new data — if the split still fails, you don’t even look at the result, you go back to the pipeline. Optionally pre-register any segments you’ll report and correct for them. The through-line: decide the rules while you’re still honest, i.e. before the data can tempt you.

Exercise 3: Would Bonferroni rescue a segment-hunted result?

Suppose the team hadn’t shown the aggregate at all — instead they sliced 20 ways and found one segment where treatment “significantly” wins. Would applying a Bonferroni correction make that finding trustworthy?

Hint

Bonferroni fixes the false-positive problem, not the broken-experiment problem. Testing a lone slice at 0.05/20 correctly stops you from being fooled by one lucky segment out of twenty — so a Bonferroni-surviving segment is real evidence of a difference in that slice. But it does nothing for the SRM: if assignment is broken, even a Bonferroni-corrected segment result is computed on non-comparable populations. So the correction is necessary for the multiple-comparisons pitfall and useless for the validity pitfall. Fix the plumbing first (Stage 2), then a correction makes a pre-declared multi-metric analysis honest.


Summary

You turned four separate pitfalls into one repeatable validity audit and ran it on a real request to ship. Lumen’s experiment reported a significant winner on a planned 50/50 split, but the audit checked the plumbing before the p-value. The SRM check on the reported 5200/4800 split gave p = 6.33e-05 — a hard fail, versus p = 0.810 for a healthy 4988/5012 split — meaning assignment was broken and the comparison invalid. The stopping-rule check cited the peeking simulation, where an honest single look gives a 0.051 false-positive rate but stopping at the first of 10 looks gives 0.190, so any dashboard-triggered “significance” is inflated. The subgroup check recomputed the segments: treatment loses in every one (new: 9.0% vs 10.0%; returning: 28.5% vs 30.0%) yet wins overall (22.0% vs 16.7%), a Simpson’s reversal driven by the very mix imbalance a failed SRM produces. Verdict: do not ship — fix the pipeline, re-run with a pre-set sample size, and analyze once on the pre-declared primary metric.

Key Concepts

  • Validity is a gate before significance — read the p-value last, after SRM, stopping rule, and segment checks pass.
  • SRM is check #1 — a chi-squared p-value of 6.33e-05 on the split voids the whole experiment; use p < 0.001, not 0.05.
  • Ask about the stopping rule — a fixed-horizon test stopped early carries a ~0.190 error rate dressed up as 0.05.
  • Segment sanity, not segment hunting — a Simpson’s reversal (loses in every slice, wins overall) exposes a bad mix; don’t shop for a winning slice.

Why This Matters

Real growth teams don’t hand you clean experiments labeled “invalid” — they hand you a significant winner and a deadline, and the plumbing failures are silent. Every number in Lumen’s experiment was computed correctly; the experiment was still worthless, and only the audit caught it. Running these four checks in order, before you interpret a single result, is what separates a team that ships real wins from one that ships a stream of changes that do nothing while congratulating itself on its p-values. This audit is the habit that makes every earlier lesson in the module actually protect a decision.


Next Steps

Continue to Module 7 - Beyond Basic A/B

Variance reduction (CUPED), sequential testing, Bayesian A/B, and multi-armed bandits.

Back to Module Overview

Return to the Pitfalls and Validity module overview


Continue Building Your Skills

You can now take any “significant winner” and decide whether it deserves your trust — by checking the split, the stopping rule, and the segments before you ever read the headline. That audit is the capstone of everything this module taught: validity is a property you verify, not assume. The next module moves past the basic two-arm test to the methods that make experiments faster and sharper — variance reduction with CUPED, honest sequential testing, Bayesian A/B, and multi-armed bandits. On to Beyond Basic A/B.