Lesson 4 - Anatomy of an A/B Test

Welcome to the Anatomy of an A/B Test

You know why we run experiments (to earn causal claims) and you know the frame (control and treatment, split by a randomization unit). Now we assemble the whole thing. A complete A/B test isn’t just “show two versions and see what happens” — it’s a small set of decisions made before a single user is bucketed: what you believe, what you’ll change, what you’ll measure, what you won’t let get worse, and what result will make you ship. Skip any of these and the test gets slippery — you end up staring at numbers wondering what they mean, or worse, deciding what they mean after you’ve seen them. This lesson names each part using Lumen’s signup-page test, and by the end you’ll have set up the exact experiment you’ll analyze in the next lesson.

By the end of this lesson, you will be able to:

  • Write a hypothesis that is specific, directional, and falsifiable
  • Identify the randomization unit, primary metric, and guardrail metrics of a test
  • State a decision rule in advance and explain why that keeps the test honest
  • Generate the seeded Lumen experiment data you’ll analyze in the guided project

Let’s take the test apart, one component at a time.


The Five Parts of a Test

Lumen wants to know whether a redesigned signup page will get more visitors to create an account. That’s the goal. To turn a goal into an experiment, we specify five parts.

1. The hypothesis. A hypothesis is a specific, directional, falsifiable statement about what the change will do. “Make signups better” is not a hypothesis — it’s a wish. It doesn’t say what “better” means, which direction to expect, or how you’d know if you were wrong. Compare: “The new signup page increases signup conversion rate compared to the current page.” That’s specific (signup conversion rate), directional (increases), and falsifiable (if conversion doesn’t go up, the hypothesis failed). You can only test a claim sharp enough to be proven wrong.

2. The randomization unit. This is what gets randomly assigned to control or treatment — for Lumen, the user. Each visitor is independently coin-flipped into the current page (control, “A”) or the new page (treatment, “B”), and stays in that bucket for the whole test. We chose the user back in Lesson 2 because it’s the unit that experiences the change and produces the outcome we care about. Randomizing users is what balances motivation and every other confounder across the two groups.

3. The primary metric. This is the one number the decision hinges on. For Lumen it’s signup conversion rate — the fraction of visitors in a group who create an account. A good primary metric has two properties: it’s measurable per unit (each user either signs up or doesn’t, so we can compute a rate per group), and it’s sensitive to the change (a better signup page should plausibly move signups). Picking one primary metric up front stops you from fishing through a dozen numbers until one looks good.

4. Guardrail metrics. These are things that must not get worse, even if the primary improves. A flashier signup page might lift conversion but slow the page down, or attract sign-ups who never finish a course, or drive up refunds. So Lumen watches guardrails like page load time, downstream course-completion rate, and refund rate. A win on the primary that wrecks a guardrail isn’t a win. We’ll treat guardrails fully in a later module — for now, just know that a real test names them before it starts.

5. The decision rule. This is the part everyone forgets: decide in advance what result will make you ship. Lumen’s rule: ship the new page if signup conversion improves and the improvement is statistically significant, and no guardrail metric regresses. Everything — the direction, the metric, the bar — is written down before the data arrives.

Users arriving at Lumen are randomly split by a coin flip into two groups: Control (A), which sees the current signup page, and Treatment (B), which sees the new signup page. The same metric, signup conversion rate, is measured for each group, and the two rates are compared to see whether the new page did better.
Lumen's signup test in one picture: users are randomly split into Control (current page) and Treatment (new page), and the same metric — signup conversion rate — is measured for each group and compared.

Decide Before You Run

Notice the throughline: hypothesis → design (unit, metrics) → decision rule, and every one of them is fixed before a single user is bucketed. That ordering isn’t bureaucracy — it’s what keeps the experiment honest. If you write the decision rule first, the data can only confirm or deny a claim you already committed to. If you wait until the numbers are in, you’ll find yourself reaching for whichever metric happened to move, or lowering the bar because the result is so close. An experiment designed in advance answers a question; an experiment interpreted after the fact just launders a decision you’d already made.

Deciding after you see the data invites bias

The most tempting mistake in all of experimentation is choosing the rule after looking at the result — glancing at the numbers, then deciding they’re “good enough,” or checking the test every day and stopping the moment it looks significant. That’s called peeking, and it quietly inflates your false-positive rate: given enough looks, random noise will eventually cross any line you draw. We’ll dedicate real time to it in Module 6. For now, the fix is simple and free: write the decision rule down before you run, and hold yourself to it.


Set Up Lumen’s Experiment

Let’s build the exact experiment you’ll analyze in the next lesson. Lumen runs the test with 5,000 users in each group. Each user in a group either signs up or doesn’t — a single yes/no outcome, which is a Bernoulli trial. The current page converts about 10% of visitors; the new page, we’ll pretend, converts about 12%. Those true rates are what we’re pretending not to know: the whole job of the experiment is to detect the 2-percentage-point difference from the data alone.

import numpy as np
rng = np.random.default_rng(7)
n_c = n_t = 5000                      # 5000 users per group
conv_control   = rng.random(n_c) < 0.10   # current page ~10% convert
conv_treatment = rng.random(n_t) < 0.12   # new page ~12% convert

Reading it line by line:

  • rng = np.random.default_rng(7) seeds a random number generator so the experiment is reproducible — you and everyone else running this get the same “users.”
  • n_c = n_t = 5000 sets the sample size: 5,000 users in control, 5,000 in treatment.
  • rng.random(n_c) < 0.10 draws 5,000 numbers uniformly in [0, 1) and marks each as True when it lands below 0.10. Since a uniform draw is below 0.10 exactly 10% of the time, this gives each control user a 10% chance of converting — a Bernoulli outcome per user. conv_control is a length-5,000 array of True/False (signed up / didn’t).
  • rng.random(n_t) < 0.12 does the same for treatment, but with a 12% threshold — the new page’s true rate.

That’s the whole setup. We’re deliberately not computing the result here — measuring the effect is the job of the guided project in Lesson 5. When we get there, the first move will be to compare the two observed rates, conv_control.mean() versus conv_treatment.mean(), and then ask the harder question: is the gap real, or could it be noise? Notice that because these are random draws, the observed rates won’t be exactly 0.10 and 0.12 — that gap between the truth and what you measure is exactly what the rest of this course is about.


Practice Exercises

Exercise 1: Fix the hypothesis

A teammate proposes the hypothesis “the new page will improve the funnel.” Explain what’s wrong with it, and rewrite it as a proper hypothesis for Lumen’s test.

Hint

It’s vague and not falsifiable: “improve” isn’t a direction on a named metric, and “the funnel” isn’t one measurable number, so you couldn’t say whether the test succeeded or failed. A proper version: “The new signup page increases signup conversion rate compared to the current page.” It names the metric (signup conversion rate), the direction (increases), and a clear failure condition (conversion doesn’t rise).

Exercise 2: Primary or guardrail?

Lumen is testing the new page. Classify each as the primary metric or a guardrail: (a) signup conversion rate, (b) page load time, (c) refund rate. Why does the distinction matter for the decision?

Hint

(a) is the primary — it’s the one number the decision hinges on, and the change is meant to move it. (b) and (c) are guardrails — things that must not get worse even if conversion goes up. The distinction matters because a win on the primary that regresses a guardrail (a faster-converting page that’s much slower to load, or that drives more refunds) is not a win; the decision rule requires the primary to improve and the guardrails to hold.

Exercise 3: Why decide in advance?

The test finishes and signup conversion is up slightly but not by much. A manager suggests, “Let’s just look and decide if it’s enough to ship.” Why is that risky, and what should have happened instead?

Hint

Deciding after seeing the data invites bias — you’ll be tempted to move the bar to match the result you’re hoping for, and repeated looking (peeking) inflates the chance of shipping on noise. The decision rule should have been written before the test ran: ship only if conversion improves, the improvement is statistically significant, and no guardrail regresses. With that rule fixed in advance, the borderline result has a predetermined answer, and no one gets to reinterpret it after the fact.


Summary

A complete A/B test is assembled from five parts, all fixed before you run. The hypothesis is a specific, directional, falsifiable statement (“the new signup page increases signup conversion rate”), not a vague goal. The randomization unit — the user, for Lumen — is what gets coin-flipped into control or treatment. The primary metric (signup conversion rate) is the one number the decision hinges on: measurable per unit and sensitive to the change. Guardrail metrics (page load time, course-completion, refund rate) are things that must not get worse even if the primary improves. And the decision rule — ship if the primary improves significantly with no guardrail regression — is committed to in advance, because deciding after seeing the data invites bias and peeking. We also generated Lumen’s seeded experiment: 5,000 users per group, a current page at a true 10% and a new page at a true 12%, as Bernoulli outcomes per user — the exact setup you’ll analyze next.

Key Concepts

  • Hypothesis — a specific, directional, falsifiable claim about what the change will do to a named metric.
  • Primary metric — the single number the decision hinges on; measurable per unit and sensitive to the change.
  • Guardrail metrics — quantities that must not regress even if the primary improves.
  • Decision rule — the ship/no-ship criteria fixed in advance, which is what keeps the experiment honest.

Why This Matters

Most tests that go wrong don’t fail on the statistics — they fail because a part was missing: no real hypothesis, an ambiguous primary metric, guardrails no one thought about, or a decision made after the fact. Naming all five parts up front is the difference between an experiment that answers a question and a dashboard that rationalizes a decision you’d already made. With Lumen’s experiment now fully specified and its data generated, you’re ready to actually run the analysis: compare the groups, and decide whether the gap you see is real.


Next Steps

Continue to Lesson 5 - Guided Project: Your First Experiment on Lumen

Run the analysis: compare the two groups' conversion rates and decide whether the difference is real.

Back to Module Overview

Return to The Logic of Experiments module overview


Continue Building Your Skills

You’ve taken an A/B test apart into its five parts — hypothesis, randomization unit, primary metric, guardrails, and a decision rule fixed before you run — and seen how putting them in that order is what keeps an experiment honest. You also built Lumen’s experiment: 5,000 users per group, a true 10% control rate and 12% treatment rate, generated as Bernoulli outcomes per user. Next you’ll take that exact data and run your first analysis end to end — measuring the gap, and asking whether it’s real.