Lesson 2 - Running and Validating the Experiment

Welcome Back

In Lesson 1 you signed the contract: a hypothesis (the onboarding checklist lifts 7-day activation by at least 3 points), a metric hierarchy (primary: activation rate; guardrails: support tickets and 30-day retention), the user as the randomization unit, and a computed sample size of 3,394 per arm — 6,788 total. The design is done. Now Lumen runs it.

Running an experiment is two things: enroll users and assign each one to a group, and collect the outcome for each. This lesson simulates both, so you can see the raw material an analysis actually receives. But there’s a step before the analysis that trips up teams who are eager to see if their idea worked — validation. Before you look at whether the checklist lifted activation, you check that the experiment is even trustworthy. That discipline — validity before the p-value — is the whole point of this lesson.

By the end of this lesson, you will be able to:

  • Simulate a realistic 50/50 assignment (a coin flip, not a forced-even split)
  • Generate activation outcomes from a known true rate per arm
  • Run the sample ratio mismatch (SRM) check and interpret its p-value
  • Explain why validity is checked before the result, not after

Let’s run it.


Running the Experiment

Assignment in the real world is a coin flip per user, not a promise of exactly-equal groups. Each of the 6,788 users is independently sent to control or treatment with 50% probability, so the two arms come out close to even but rarely exactly so. Then each user either activates or doesn’t within seven days — control at its true baseline of 25%, treatment at the true 28% the checklist is (in this simulation) actually worth. In real life you never know these true rates; here we set them so we can check whether the analysis recovers them.

import numpy as np
rng = np.random.default_rng(2026)
n = 3394
total = 2 * n
is_control = rng.random(total) < 0.50           # 50/50 assignment
n_c = int(is_control.sum()); n_t = total - n_c
act_c = int((rng.random(n_c) < 0.25).sum())      # control true activation 25%
act_t = int((rng.random(n_t) < 0.28).sum())      # treatment true activation 28%
print(f"assigned : control {n_c}  treatment {n_t}")
print(f"activated: control {act_c}/{n_c} = {act_c/n_c:.4f}  treatment {act_t}/{n_t} = {act_t/n_t:.4f}")

Running it:

assigned : control 3328  treatment 3460
activated: control 832/3328 = 0.2500  treatment 980/3460 = 0.2832

Notice the assignment is not 3,394 / 3,394. The coin flip handed us 3,328 control and 3,460 treatment — a gap of 132 users. That’s not a bug; it’s exactly the kind of chance variation a per-user random split produces. But a lopsided split is also what a broken pipeline looks like, so before you read anything into the outcomes, you have to ask whether this split is innocent chance or a red flag.

The five-stage capstone pipeline — Design, Run, Validate, Analyze, Decide — with this lesson covering Run (assign users 50/50 and collect their 7-day activation outcomes) and Validate (run the SRM check and sanity checks first, before any analysis). A scenario strip notes the onboarding-checklist brief, baseline activation 25%, minimum detectable effect +3 points, and 3,394 users per arm; the observed split came out 3,328 control and 3,460 treatment.
Two stages of the workflow in action: Run produces the raw assignment and outcomes, and Validate gates them — the SRM check decides whether the split is fair enough to trust before the analysis begins.

Validate Before You Analyze

The observed 3,328 / 3,460 split should be 50/50. The sample ratio mismatch (SRM) check from Module 6 asks a precise question: if assignment were truly a fair coin flip, how surprised should we be to see a gap this large? A chi-square goodness-of-fit test against the expected even split gives the answer.

from scipy import stats
total = n_c + n_t
srm_p = stats.chisquare([n_c, n_t], f_exp=[total/2, total/2]).pvalue
print(f"SRM p-value: {srm_p:.3f}")

Running it:

SRM p-value: 0.109

A p-value of 0.109 means a split this uneven (or worse) happens about 11% of the time under a fair coin — completely unremarkable. SRM uses a deliberately strict alarm threshold of around 0.001, not the usual 0.05, precisely because a fair split should almost never trip it; anything above that says the randomization is behaving. At 0.109 we’re nowhere near the alarm, so the split is consistent with a fair 50/50 assignment. The experiment is valid to analyze.

The order here is the lesson. You ran the SRM check before looking at the activation numbers — before you had any chance to be seduced by a result you were hoping for. That sequencing is a safeguard, not a formality.

Validity is a gate, not a footnote

Had the SRM check failed — say p < 0.001, with the split coming out 3,100 / 3,688 — the correct response is to stop, not to analyze anyway and mention the imbalance in a caveat. A failed SRM means the assignment mechanism is broken: a logging bug, a redirect that drops users unevenly, a bot filter that hits one arm harder. When the split is untrustworthy, every downstream number is contaminated, because the two groups are no longer comparable. So you fix the pipeline and re-run — you do not analyze garbage and hope the caveat covers you. Checking validity before the result (Module 6) is what makes the eventual p-value mean something.

The SRM check isn’t the only validity consideration — it’s just the one that depends on the data you collected. Several others were already handled back in the design. The ~1-week runtime (Lesson 1’s sample size at ~1,000 users/day) deliberately spans all seven days of the week, so day-of-week seasonality can’t masquerade as an effect. The guardrail metrics — support tickets and 30-day retention — are monitored throughout, so a “win” that quietly breaks something gets caught. And the fixed sample size means the team analyzes exactly once, at 6,788 users, with no peeking along the way. Validity isn’t a single test; it’s a set of commitments, most made before kickoff and one confirmed the moment the data lands.


Practice Exercises

Exercise 1: Read the split

The assignment came out 3,328 control and 3,460 treatment, not the planned 3,394 / 3,394. Is that a problem?

Hint

No. Assignment is an independent 50% coin flip per user, so the arm sizes fluctuate around the expected 3,394 — they almost never land exactly even. A 132-user gap out of 6,788 is ordinary chance variation, and the SRM p-value of 0.109 confirms it’s consistent with a fair split. What would be a problem is a split so lopsided that SRM fires (p < 0.001), which signals a broken assignment mechanism rather than luck.

Exercise 2: What SRM actually protects

Why run the SRM check before looking at the activation results, rather than glancing at both together?

Hint

Because validity is a precondition for the result to mean anything, and checking it first removes the temptation to rationalize. If the two arms aren’t a fair split, they aren’t comparable, so any difference in activation could be an artifact of who landed in each group rather than the checklist. Running SRM first turns it into a gate: pass and you may analyze; fail and you stop and fix the pipeline. Peeking at the outcome first invites you to wave a failed check through because you like what you see.

Exercise 3: A failed SRM

Suppose the split had come back 3,050 control / 3,738 treatment and SRM returned p = 0.0002. What do you do, and what do you not do?

Hint

You stop and investigate the assignment pipeline — a p-value that far below the 0.001 alarm says the imbalance is almost certainly not chance. Look for a logging gap, an uneven redirect, a bot filter hitting one arm, or a bucketing bug. What you do not do is proceed to the z-test and report the activation lift with a footnote about the imbalance. A broken split contaminates every downstream number because the groups aren’t comparable, so the “result” would be meaningless no matter how significant it looked.


Summary

With the design signed, the experiment runs in two moves: assign each of the 6,788 users to control or treatment by an independent 50/50 coin flip, and collect each user’s 7-day activation outcome. The simulation gave a realistic uneven split — 3,328 control, 3,460 treatment — with observed activation of 25.00% in control and 28.32% in treatment. Before touching those outcomes, you ran the sample ratio mismatch check: a chi-square test of the split against a fair 50/50 returned p = 0.109, far above the strict ~0.001 SRM alarm, so the assignment is trustworthy and the experiment is valid to analyze. Other validity threats were pre-empted by design — a full-week runtime for seasonality, monitored guardrails, and a fixed sample size that forbids peeking. The result now has permission to be believed.

Key Concepts

  • Coin-flip assignment — a per-user 50% split lands close to even, not exactly even; a small gap is normal.
  • Collect true outcomes — control at 25%, treatment at 28%; the analysis will try to recover this gap.
  • SRM before the result — p = 0.109 clears the ~0.001 alarm, so the split is fair and the test is valid.
  • Validity is a gate — a failed SRM means stop and fix the pipeline, never analyze and caveat.

Why This Matters

The most disciplined thing a data scientist does in an experiment is refuse to look at the result until the experiment has earned it. Simulating the run makes the raw material concrete — an uneven split and two activation counts — and the SRM check turns “the split looks a bit off” into a precise, defensible yes-or-no about validity. Teams that skip this step and analyze whatever comes out ship conclusions built on broken randomization and never know it. You checked validity first, it passed, and only now are you allowed to ask the real question. Next, you’ll finally answer it: does the 25.00%-vs-28.32% gap represent a real lift, and by how much?


Next Steps

Continue to Lesson 3 - Analyzing the Results

With the experiment validated, run the z-test and confidence interval to decide whether the activation lift is real.

Back to Module Overview

Return to the Capstone module overview


Continue Building Your Skills

You ran the experiment Lesson 1 designed: a realistic coin-flip assignment that landed 3,328 / 3,460, and activation outcomes of 25.00% versus 28.32%. Then, before reading anything into those numbers, you ran the SRM check — p = 0.109, comfortably valid — and confirmed the other validity commitments were already handled by the design. The experiment has passed the gate. In the next lesson you’ll do the analysis it has earned: a two-proportion z-test and a confidence interval to turn the observed 3.3-point gap into a claim you can defend.