Lesson 1 - The Brief and the Design
Welcome to the Capstone
This is where it all comes together. Over seven modules you learned each piece of experimentation in isolation; now you’ll run one complete experiment the way a data scientist actually does — from a one-line brief to a written ship decision. No new theory, just the whole workflow applied in sequence, with every number computed for real. The scenario is fresh so you’re not just repeating earlier lessons: Lumen’s product team has a question, and by the end of this module you’ll have answered it rigorously and written the readout.
The brief: “We built a new onboarding checklist. Does it get more new users active in their first week? Should we ship it?” That’s it — vague, business-flavored, and exactly how real experiments start. This lesson turns it into a design. And it starts, as everything in this course has argued, before any data exists.
By the end of this lesson, you will be able to:
- Turn a business brief into a specific, testable hypothesis
- Choose the primary, guardrail, and secondary metrics for the test
- State the randomization unit and justify it
- Compute the sample size the experiment needs before it runs
Let’s design the experiment.
From Brief to Design
A design is a set of decisions made up front, so the experiment can only tell the truth. Here’s the full design for Lumen’s test, each piece drawing on an earlier module:
- Hypothesis (Module 2): The new onboarding checklist increases the 7-day activation rate by at least 3 percentage points. Specific (one change: the checklist), directional (increases, by ≥3 points), falsifiable (if activation is flat or lower, it’s refuted).
- Primary metric: the 7-day activation rate — the share of new users who complete a key milestone within their first week. It’s a proportion, measurable per user, and sensitive to an onboarding change.
- Guardrail metrics: support-ticket rate (a confusing checklist shouldn’t spike tickets) and 30-day retention (short-term activation shouldn’t come at the cost of longer-term engagement). These must not regress even if activation rises.
- Randomization unit: the user. Onboarding is a per-user experience, and each user should see one consistent version across every visit in their first week.
- Decision rule: ship only if the primary metric rises significantly and clears the 3-point bar and no guardrail regresses — fixed now, before any data.
Sizing the Experiment
The design isn’t complete until you know how many users you need. From Module 3: the sample size follows from the baseline rate, the minimum detectable effect, the significance level, and the power. Lumen’s current 7-day activation rate is about 25%, and the team decided a +3 point lift is the smallest worth shipping. At the standard 5% significance and 80% power:
import math
from scipy.stats import norm
def n_per_arm(p1, p2, alpha=0.05, power=0.80):
z_a, z_b = norm.ppf(1 - alpha / 2), norm.ppf(power)
return math.ceil((z_a + z_b) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2)) / (p2 - p1) ** 2)
n = n_per_arm(0.25, 0.28) # baseline 25% -> target 28% (a +3 point lift)
print(f"required sample size: {n:,} per arm ({2 * n:,} total)")Running it:
required sample size: 3,394 per arm (6,788 total)Lumen needs 3,394 users per arm — 6,788 total — to have an 80% chance of detecting a real 3-point lift at 5% significance. That number is the design’s final commitment: the team runs until it has that many users per arm, then analyzes once. If Lumen gets around 1,000 new users a day, that’s roughly a week of traffic — conveniently also long enough to smooth over day-of-week effects (Module 6).
The design is a contract you sign before kickoff
Everything above — the hypothesis, the metrics, the 3-point threshold, the 3,394-per-arm sample size, the decision rule — is fixed now, before a single user is enrolled. That’s not bureaucracy; it’s the entire defense against the pitfalls of Module 6. A pre-committed sample size prevents peeking. A single pre-declared primary metric prevents multiple-comparisons fishing. A decision rule written in advance prevents moving the goalposts to fit whatever the data happens to show. When you write the design down and stick to it, the experiment’s conclusion means something. When you improvise it after seeing results, it doesn’t. The rest of this capstone simply executes this contract.
Practice Exercises
Exercise 1: Critique the hypothesis
Is “the new onboarding checklist increases 7-day activation by at least 3 points” a good hypothesis? Name the properties that make it testable.
Hint
Yes. It’s specific (one change — the checklist — and one named primary metric, 7-day activation rate), directional (increases, and by a stated amount), and falsifiable (if activation comes back flat or lower, the hypothesis is refuted). The 3-point threshold also encodes practical significance, so it doubles as the bar for the decision rule. That’s exactly the Module 2 standard.
Exercise 2: Why guardrails?
Activation is the primary metric. Why also track support tickets and 30-day retention, when the team only cares about getting users active?
Hint
Because a change can help the primary metric while quietly hurting something that matters more. A pushy checklist might drive activation up but confuse users (more support tickets) or produce shallow early engagement that doesn’t last (lower 30-day retention). Guardrails catch a “win” that’s actually a loss elsewhere — the decision rule ships only if the primary rises and no guardrail regresses.
Exercise 3: Resize it
Suppose the team decides a +2 point lift is worth shipping, not +3. Without computing exactly, will the required sample size go up or down, and roughly by how much?
Hint
Up — a lot. Sample size scales with roughly the inverse square of the effect you want to detect, so shrinking the MDE from 3 points to 2 (a factor of 2/3) multiplies the sample size by about (3/2)² ≈ 2.25. The ~3,394 per arm would balloon to roughly 7,600 per arm. Smaller effects are far more expensive to detect — the MDE is the biggest lever on cost (Module 2).
Summary
The capstone runs one experiment end to end, and it begins — like every real experiment — with a design written before any data. Lumen’s brief (“does a new onboarding checklist lift 7-day activation?”) became a full design: a hypothesis (increase activation by ≥3 points), a metric hierarchy (primary: 7-day activation rate; guardrails: support tickets and 30-day retention; unit: the user), a decision rule fixed in advance, and a computed sample size of 3,394 per arm (6,788 total) for 80% power to detect a 3-point lift at 5% significance. That design is a contract that pre-empts the validity pitfalls: a fixed sample size stops peeking, a single primary metric stops multiple comparisons, and a pre-written decision rule stops goalpost-moving. With the contract signed, the experiment is ready to run.
Key Concepts
- Brief → design — turn a vague business question into specific, pre-committed decisions.
- Metric hierarchy — one primary metric, guardrails that must not regress, secondary for context.
- Sample size up front — 3,394 per arm here; run to it, then analyze once.
- The design is a contract — fixing it before data is the defense against Module 6’s pitfalls.
Why This Matters
The design phase is where experiments are won or lost, long before any analysis. A sharp hypothesis, the right metrics, and a pre-committed sample size and decision rule are what make the eventual result trustworthy — and what a rushed team skips, then wonders why its “significant wins” don’t hold up. Doing it properly, as you just did, is the difference between an experiment that answers a question and one that merely produces a number. Next, you’ll run the experiment: simulate the data and — before peeking at whether it worked — check that it’s even valid.
Next Steps
Continue to Lesson 2 - Running and Validating the Experiment
Simulate the assignment and outcomes, then run the sample-ratio-mismatch check before trusting a single number.
Back to Module Overview
Return to the Capstone module overview
Continue Building Your Skills
You turned a one-line brief into a complete, pre-committed design — hypothesis, metrics, randomization unit, decision rule, and a sample size of 3,394 users per arm. That contract is what makes everything downstream trustworthy. Next you’ll run the experiment: simulate the 50/50 assignment and the activation outcomes, and then — before looking at whether the checklist worked — run the validity checks that decide whether the result can be believed at all.