Lesson 5 - Guided Project: Design Lumen's Pricing Experiment
Welcome to the Guided Project
Across this module you took an experiment apart one decision at a time: sharpening a fuzzy goal into a hypothesis (Lesson 1), choosing the one primary metric plus the guardrails that protect everything else (Lesson 2), picking the randomization unit that keeps buckets clean (Lesson 3), and setting the minimum detectable effect that drives how much data you need (Lesson 4). Now you’ll put all four back together on one real decision. Lumen wants to redesign its pricing page, hoping more free users click through to a paid plan. Your job isn’t to run that test yet — it’s to write the design doc: the complete, fixed-up-front spec that says exactly what you’re testing, how you’ll measure it, who gets bucketed, and how many users you need before you can trust the answer. By the end you’ll have a doc a stakeholder could sign off on and an engineer could build from.
By the end of this project, you will be able to:
- Write a specific, directional, falsifiable hypothesis for a pricing experiment
- Choose a primary metric, guardrails, and a secondary metric, plus a decision rule
- Justify randomizing by user rather than by session
- Compute the required sample size per arm with scipy and read how the MDE moves it
We’ll build it in stages, reusing the exact pieces you wrote earlier in the module. Let’s design Lumen’s pricing experiment.
Stage 1: The Hypothesis
Start where Lesson 1 said to start: turn the goal into a claim that data could prove wrong. “Make the pricing page convert better” names no change, no metric, and no threshold — it can’t fail, so it can’t be tested. Sharpen it:
Hypothesis: Lumen’s new pricing page increases the free-to-paid upgrade rate by at least 2 percentage points versus the current page.
Check it against the three marks. It’s specific — one change (the new pricing page), one primary metric (upgrade rate). It’s directional — it predicts increases, and by at least 2 points. And it’s falsifiable — if the new page converts the same or worse, the hypothesis is refuted. The threshold matters most here: “at least 2 points” is the smallest lift Lumen would bother shipping for, and in Stage 4 that exact number becomes the input that decides how many users the test needs.
Framed as statisticians do it: the null (H₀) says the new page’s true upgrade rate equals the current page’s — any difference is noise. The alternative (H₁) says the new page’s rate is higher. The experiment will measure how surprising the observed gap would be if the null were true. That’s the claim; everything downstream is pointed at it.
Stage 2: The Metrics
A hypothesis names one primary metric, but a real design tracks three tiers — the one you decide on, the ones that must not break, and the ones that explain what happened. This is the Lesson 2 hierarchy.
- Primary — upgrade conversion rate. The share of free users who upgrade to paid. This is the metric the hypothesis is about and the one the ship decision hinges on. One primary, no committee.
- Guardrails — refund rate and support-ticket rate. A pricing page that inflates upgrades by confusing or over-promising to users will show up here: buyers who churn back out (refunds) or users who can’t tell what they bought (support tickets). Guardrails don’t need to improve; they must simply not regress.
- Secondary — ARPU (average revenue per user). More upgrades is the goal, but ARPU tells you whether the extra upgrades are worth as much — a new page that pushes users toward a cheaper plan could lift upgrades while ARPU stalls. Secondary metrics don’t decide; they explain.
State the decision rule now, in the doc, before any data: ship the new page only if the primary metric wins its significance test and no guardrail regresses. A primary win with a spiked refund rate is not a win — it’s a problem you’d have shipped blind if you hadn’t named the guardrail up front.
Stage 3: The Randomization Unit
Lesson 3’s question: what do you flip the coin on? For a pricing page, randomize by user, not by session or by pageview.
The reason is consistency of experience. A free user might visit the pricing page three times over a week before upgrading. If you randomized per session, the same person could see the new page Monday, the old page Wednesday, and the new page again Friday — a jarring, incoherent experience, and worse, an unanalyzable one: which page gets “credit” for the eventual upgrade? Randomizing by user assigns a person to one variant and keeps them there across every visit, so the whole path to upgrade happens inside a single, clean bucket.
You’d implement this by hashing a stable user_id into a bucket, so a given user always lands in the same arm no matter how many times they return. Session- or pageview-level randomization is fine for changes that are self-contained within one visit (a button color, say), but a pricing decision unfolds over multiple visits — so the unit has to be the user.
Stage 4: The MDE and Sample Size
Now the number that makes the design real. From Stage 1, the minimum detectable effect is +2 percentage points. Lumen’s current pricing page converts free users at a 10% baseline, so the test needs enough users to reliably tell a true 10% → 12% lift apart from noise. That’s a sample-size calculation on two proportions — the standard normal approximation from Lesson 4:
import math
from scipy.stats import norm
def n_per_arm(p1, p2, alpha=0.05, power=0.80):
z_a = norm.ppf(1 - alpha/2); z_b = norm.ppf(power)
return math.ceil((z_a + z_b)**2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)**2)
print(n_per_arm(0.10, 0.12)) # 3839
print(n_per_arm(0.10, 0.11)) # 14749Run for real with scipy, this prints 3839 and 14749. So detecting the +2-point lift needs 3,839 users per arm — 7,678 total across control and treatment — at the usual 5% significance and 80% power.
Look at what the second line says. If Lumen tightened the MDE to +1 point (a 10% → 11% lift), the requirement jumps to 14,749 per arm — almost four times the users to detect half the effect. That’s the Lesson 4 lesson made concrete: sample size scales with the inverse square of the effect you want to catch, so halving the MDE roughly quadruples the cost. Deciding the smallest lift worth shipping isn’t a statistical nicety — it directly sets how long the experiment runs and how many users it burns.
Two dials, alpha=0.05 and power=0.80, sit inside the function. Module 3 is where you’ll learn what they mean and how changing them moves the count; for the design doc, the defaults are the standard choice, and the point here is that the number is computed, not guessed.
The design doc is the contract
Write the whole design — hypothesis, metrics, decision rule, MDE, sample size — before the experiment runs, and treat it as a contract you don’t renegotiate mid-flight. The temptation once data starts arriving is to move the goalposts: lower the threshold that suddenly looks out of reach, promote a secondary metric that happened to move, stop early because the primary is briefly ahead. Every one of those quietly turns a real test into a story fitted to noise. The doc fixes the target while you still can’t see where the arrows will land, which is the only time fixing it is honest.
Stage 5: The Design Doc, Assembled
Put the four decisions into one readable spec — this is the artifact you’d circulate for sign-off:
Experiment: Lumen New Pricing Page
- Hypothesis: The new pricing page increases the free-to-paid upgrade rate by at least 2 percentage points versus the current page.
- Randomization unit: User (hashed
user_id), so each user sees one consistent variant across all visits.- Primary metric: Free-to-paid upgrade conversion rate.
- Guardrail metrics: Refund rate, support-ticket rate — must not regress.
- Secondary metric: ARPU (average revenue per user) — context only.
- Baseline: 10% upgrade rate. MDE: +2 percentage points (target 12%).
- Sample size: 3,839 users per arm; 7,678 total (α = 0.05, power = 0.80, computed with scipy).
- Decision rule: Ship only if the primary metric wins its significance test and no guardrail regresses.
That’s a complete design — every decision fixed, every number justified. But notice what it deliberately does not contain, because those pieces belong to the modules ahead:
- The power calculation, in full. You have the sample size, but not yet the machinery behind α, power, and how the
n_per_armcount actually falls out. That’s Module 3 — Power and Sample Size. - The analysis. Nothing here runs a significance test, because there’s no data yet. Once the experiment collects its 7,678 users, you’ll test whether the primary difference is real and check the guardrails. That’s Module 4.
This lesson is the design — the contract you sign before bucketing a single user. Running it and analyzing the result come next.
Practice Exercises
Exercise 1: Resize for a healthier funnel
Suppose Lumen’s baseline upgrade rate is actually 15%, not 10%, but you still want to detect a +2-point lift (15% → 17%). Recompute the required sample size per arm. Is it larger or smaller than the 3,839 you got at a 10% baseline?
Hint
Call n_per_arm(0.15, 0.17). It returns 4,952 per arm — more than at the 10% baseline, which surprises people. The reason is the variance term p*(1-p): a proportion’s variance is largest near 50%, so moving the baseline from 10% toward 15% makes each user noisier and demands a bigger sample to see the same 2-point lift. The MDE isn’t the only thing that moves the count — the baseline does too.
Exercise 2: Add a second guardrail
The design has two guardrails (refund rate, support-ticket rate). Name a third guardrail specific to a pricing change, and explain what failure it would catch that the first two miss.
Hint
A strong choice is plan-downgrade rate or paid-plan churn within 30 days — a page that nudges users into a plan too expensive for them can win on upgrade rate today and lose them next month, which neither refunds nor tickets fully catch. Another is checkout-abandonment rate: a confusing new layout might drive more clicks to upgrade while more people bail at payment. A guardrail is any metric a “win” could quietly break — name the failure mode first, then the metric that would reveal it.
Exercise 3: Weigh the cost of a tighter MDE
Lumen’s PM asks: “Can we detect a +1-point lift instead of +2?” Using the numbers from Stage 4, quantify the cost and give the PM a one-line recommendation.
Hint
n_per_arm(0.10, 0.11) returns 14,749 per arm versus 3,839 for +2 points — about 3.8× the users, so nearly four times the runtime (or four times the traffic) to catch an effect half the size. The recommendation: only tighten to +1 point if a 1-point lift is genuinely worth shipping and Lumen has the traffic to run that long; otherwise the +2-point design answers the real decision far faster. The MDE is a business call about the smallest lift worth acting on, not a knob to make small just because you can.
Summary
You assembled a complete experiment design for Lumen’s pricing page by reusing all four decisions from this module. You wrote a specific, directional, falsifiable hypothesis (“the new pricing page lifts the free-to-paid upgrade rate by at least 2 points”), then built its metric hierarchy — a single primary (upgrade rate), two guardrails that must not regress (refund rate, support-ticket rate), and a secondary for context (ARPU) — bound together by a decision rule that requires a primary win and no guardrail regression. You chose to randomize by user so each person sees one consistent variant across visits. And you set the MDE at +2 points on a 10% baseline and computed the sample size with scipy: 3,839 users per arm, 7,678 total — with n_per_arm(0.10, 0.11) = 14,749 showing that a stricter +1-point target costs nearly four times as much. Every number here was computed for real with scipy, not estimated. The result is a design doc fixed before any user is bucketed — the honest contract that makes the running and analysis ahead trustworthy.
Key Concepts
- Design doc as contract — hypothesis, metrics, unit, MDE, and sample size all fixed before running, so you can’t refit the target to the data.
- Metric hierarchy with a decision rule — one primary decides, guardrails must not regress, secondary explains; ship only on a primary win with clean guardrails.
- Randomize by user for pricing — one consistent variant per user across visits keeps the path to upgrade inside a single clean bucket.
- MDE drives sample size — halving the detectable effect roughly quadruples the users needed (3,839 vs. 14,749 per arm).
Why This Matters
This is the deliverable a real experimentation team produces before touching a line of production code: a design doc a stakeholder signs off on and an engineer builds from. It forces every hard decision — what counts as success, what’s allowed to break, how long you’ll run — to be made while you still can’t see the result, which is exactly when those decisions are honest. Teams that skip this step end up debating what the experiment “meant” after the fact, with the numbers already in hand and the temptation to spin them fully loaded. You now have the discipline and the arithmetic to avoid that. Next, Module 3 opens up the sample-size machinery you used here as a black box — significance, power, and where that 3,839 actually comes from.
Next Steps
Continue to Module 3 - Power and Sample Size
Significance, power, and computing exactly how many users your experiment needs.
Back to Module Overview
Return to the Designing an Experiment module overview
Continue Building Your Skills
You can now write a full experiment design end to end — a sharp hypothesis, a metric hierarchy with a decision rule, a randomization unit, and a scipy-computed sample size — and hold all of it fixed before a single user is bucketed. That design doc is the foundation everything else rests on. Next, Module 3 cracks open the part you took on faith here: what significance and power actually mean, and how the required sample size falls out of them.