Lesson 1 - The Brief and the Design

Welcome to the Capstone

This is where it all comes together. Over seven modules you learned each piece of experimentation in isolation; now you’ll run one complete experiment the way a data scientist actually does — from a one-line brief to a written ship decision. No new theory, just the whole workflow applied in sequence, with every number computed for real. The scenario is fresh so you’re not just repeating earlier lessons: Lumen’s product team has a question, and by the end of this module you’ll have answered it rigorously and written the readout.

The brief: “We built a new onboarding checklist. Does it get more new users active in their first week? Should we ship it?” That’s it — vague, business-flavored, and exactly how real experiments start. This lesson turns it into a design. And it starts, as everything in this course has argued, before any data exists.

By the end of this lesson, you will be able to:

Turn a business brief into a specific, testable hypothesis
Choose the primary, guardrail, and secondary metrics for the test
State the randomization unit and justify it
Compute the sample size the experiment needs before it runs

Let’s design the experiment.

From Brief to Design

A design is a set of decisions made up front, so the experiment can only tell the truth. Here’s the full design for Lumen’s test, each piece drawing on an earlier module:

Hypothesis (Module 2): The new onboarding checklist increases the 7-day activation rate by at least 3 percentage points. Specific (one change: the checklist), directional (increases, by ≥3 points), falsifiable (if activation is flat or lower, it’s refuted).
Primary metric: the 7-day activation rate — the share of new users who complete a key milestone within their first week. It’s a proportion, measurable per user, and sensitive to an onboarding change.
Guardrail metrics: support-ticket rate (a confusing checklist shouldn’t spike tickets) and 30-day retention (short-term activation shouldn’t come at the cost of longer-term engagement). These must not regress even if activation rises.
Randomization unit: the user. Onboarding is a per-user experience, and each user should see one consistent version across every visit in their first week.
Decision rule: ship only if the primary metric rises significantly and clears the 3-point bar and no guardrail regresses — fixed now, before any data.

A five-stage pipeline for one experiment end to end: 1. Design (hypothesis, metrics, sample size — Modules 2 and 3), 2. Run (assign 50/50, collect outcomes — Module 1), 3. Validate (SRM and sanity checks first — Module 6), 4. Analyze (z-test, CI, Bayesian — Modules 4 and 7), 5. Decide (the readout and ship call — Modules 4 to 7). A scenario strip reads: the brief is whether a new onboarding checklist lifts Lumen's 7-day activation rate, baseline 25%, smallest lift worth shipping +3 points, 3,394 users per arm — validity before the p-value, decision at the end. — The capstone workflow: design, run, validate, analyze, decide — each stage drawing on a module from the course, in the exact order a real experiment runs. Validity is checked before the p-value; the decision comes last.

Sizing the Experiment

The design isn’t complete until you know how many users you need. From Module 3: the sample size follows from the baseline rate, the minimum detectable effect, the significance level, and the power. Lumen’s current 7-day activation rate is about 25%, and the team decided a +3 point lift is the smallest worth shipping. At the standard 5% significance and 80% power:

import math
from scipy.stats import norm

def n_per_arm(p1, p2, alpha=0.05, power=0.80):
    z_a, z_b = norm.ppf(1 - alpha / 2), norm.ppf(power)
    return math.ceil((z_a + z_b) ** 2 * (p1 * (1 - p1) + p2 * (1 - p2)) / (p2 - p1) ** 2)

n = n_per_arm(0.25, 0.28)          # baseline 25% -> target 28% (a +3 point lift)
print(f"required sample size: {n:,} per arm ({2 * n:,} total)")

Running it:

required sample size: 3,394 per arm (6,788 total)

Lumen needs 3,394 users per arm — 6,788 total — to have an 80% chance of detecting a real 3-point lift at 5% significance. That number is the design’s final commitment: the team runs until it has that many users per arm, then analyzes once. If Lumen gets around 1,000 new users a day, that’s roughly a week of traffic — conveniently also long enough to smooth over day-of-week effects (Module 6).

The design is a contract you sign before kickoff

Everything above — the hypothesis, the metrics, the 3-point threshold, the 3,394-per-arm sample size, the decision rule — is fixed now, before a single user is enrolled. That’s not bureaucracy; it’s the entire defense against the pitfalls of Module 6. A pre-committed sample size prevents peeking. A single pre-declared primary metric prevents multiple-comparisons fishing. A decision rule written in advance prevents moving the goalposts to fit whatever the data happens to show. When you write the design down and stick to it, the experiment’s conclusion means something. When you improvise it after seeing results, it doesn’t. The rest of this capstone simply executes this contract.

Practice Exercises

Exercise 1: Critique the hypothesis

Is “the new onboarding checklist increases 7-day activation by at least 3 points” a good hypothesis? Name the properties that make it testable.

Hint

Yes. It’s specific (one change — the checklist — and one named primary metric, 7-day activation rate), directional (increases, and by a stated amount), and falsifiable (if activation comes back flat or lower, the hypothesis is refuted). The 3-point threshold also encodes practical significance, so it doubles as the bar for the decision rule. That’s exactly the Module 2 standard.

Exercise 2: Why guardrails?

Activation is the primary metric. Why also track support tickets and 30-day retention, when the team only cares about getting users active?

Hint

Because a change can help the primary metric while quietly hurting something that matters more. A pushy checklist might drive activation up but confuse users (more support tickets) or produce shallow early engagement that doesn’t last (lower 30-day retention). Guardrails catch a “win” that’s actually a loss elsewhere — the decision rule ships only if the primary rises and no guardrail regresses.

Exercise 3: Resize it

Suppose the team decides a +2 point lift is worth shipping, not +3. Without computing exactly, will the required sample size go up or down, and roughly by how much?

Hint

Up — a lot. Sample size scales with roughly the inverse square of the effect you want to detect, so shrinking the MDE from 3 points to 2 (a factor of 2/3) multiplies the sample size by about (3/2)² ≈ 2.25. The ~3,394 per arm would balloon to roughly 7,600 per arm. Smaller effects are far more expensive to detect — the MDE is the biggest lever on cost (Module 2).

Summary

The capstone runs one experiment end to end, and it begins — like every real experiment — with a design written before any data. Lumen’s brief (“does a new onboarding checklist lift 7-day activation?”) became a full design: a hypothesis (increase activation by ≥3 points), a metric hierarchy (primary: 7-day activation rate; guardrails: support tickets and 30-day retention; unit: the user), a decision rule fixed in advance, and a computed sample size of 3,394 per arm (6,788 total) for 80% power to detect a 3-point lift at 5% significance. That design is a contract that pre-empts the validity pitfalls: a fixed sample size stops peeking, a single primary metric stops multiple comparisons, and a pre-written decision rule stops goalpost-moving. With the contract signed, the experiment is ready to run.

Key Concepts

Brief → design — turn a vague business question into specific, pre-committed decisions.
Metric hierarchy — one primary metric, guardrails that must not regress, secondary for context.
Sample size up front — 3,394 per arm here; run to it, then analyze once.
The design is a contract — fixing it before data is the defense against Module 6’s pitfalls.

Why This Matters

The design phase is where experiments are won or lost, long before any analysis. A sharp hypothesis, the right metrics, and a pre-committed sample size and decision rule are what make the eventual result trustworthy — and what a rushed team skips, then wonders why its “significant wins” don’t hold up. Doing it properly, as you just did, is the difference between an experiment that answers a question and one that merely produces a number. Next, you’ll run the experiment: simulate the data and — before peeking at whether it worked — check that it’s even valid.

Next Steps

Continue to Lesson 2 - Running and Validating the Experiment

Simulate the assignment and outcomes, then run the sample-ratio-mismatch check before trusting a single number.

Back to Module Overview

Return to the Capstone module overview

Continue Building Your Skills

You turned a one-line brief into a complete, pre-committed design — hypothesis, metrics, randomization unit, decision rule, and a sample size of 3,394 users per arm. That contract is what makes everything downstream trustworthy. Next you’ll run the experiment: simulate the 50/50 assignment and the activation outcomes, and then — before looking at whether the checklist worked — run the validity checks that decide whether the result can be believed at all.

Next lesson

Lesson 2 - Running and Validating the Experiment

Courses

DATATWEETS

Title here

Lesson 1 - The Brief and the Design

Welcome to the Capstone

From Brief to Design

Sizing the Experiment

Practice Exercises

Exercise 1: Critique the hypothesis

Exercise 2: Why guardrails?

Exercise 3: Resize it

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 2 - Running and Validating the Experiment

Back to Module Overview

Continue Building Your Skills

Lesson 1 - The Brief and the Design

Welcome to the Capstone#

From Brief to Design#

Sizing the Experiment#

Practice Exercises#

Exercise 1: Critique the hypothesis#

Exercise 2: Why guardrails?#

Exercise 3: Resize it#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 2 - Running and Validating the Experiment

Back to Module Overview

Continue Building Your Skills#

Welcome to the Capstone

From Brief to Design

Sizing the Experiment

Practice Exercises

Exercise 1: Critique the hypothesis

Exercise 2: Why guardrails?

Exercise 3: Resize it

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills