Lesson 2 - Choosing Your Metrics
Welcome to Choosing Your Metrics
Last lesson you sharpened Lumen’s goal into a hypothesis: the new signup page increases signup conversion rate by at least 2 points. That sentence already names a metric — conversion rate — and it’s tempting to stop there. But a real experiment measures more than one number. It watches the metric the decision depends on, and it watches a handful of others to make sure the win isn’t quietly costing you something worse. The mistake isn’t measuring too little; it’s measuring everything with equal weight, so that when the numbers disagree you have no rule for what to do. This lesson gives you that rule: a metric hierarchy where each number has a job, and only one of them decides whether you ship.
By the end of this lesson, you will be able to:
- Separate metrics into primary, guardrail, and secondary tiers, each with a distinct job
- Explain what makes a good primary metric: aligned, sensitive, and measurable per unit
- Compute conversion, ARPU/ARPPU, and a latency guardrail from real seeded data
- Apply the decision rule: ship only if the primary wins and no guardrail regresses
Let’s start with the three tiers and why exactly one of them gets to decide.
The Metric Hierarchy
An experiment can report on dozens of numbers, but they don’t all mean the same thing. Sort them into three tiers before the test runs:
- Primary metric — there is exactly one, and the decision hinges on it. For Lumen’s signup test, that’s signup conversion rate. If it doesn’t move, the experiment failed, full stop. Naming a single primary metric up front is what stops you from running the test, scanning everything that moved, and declaring victory on whatever happened to wiggle.
- Guardrail metrics — numbers that must not get worse, even if the primary improves. A faster-converting signup page is a bad trade if it doubled page load time, spiked the refund rate, or pulled in users who never finish a course. Guardrails are the “not at any cost” clause: page load time, refund rate, downstream course-completion.
- Secondary metrics — context that helps you explain the result but doesn’t drive the decision. Average revenue per user (ARPU) and time-to-signup tell a richer story about what changed and why, but you don’t ship or kill on them. They’re the “interesting, and here’s what probably happened” tier.
What Makes a Good Primary Metric
Not every number can carry the weight of a decision. A good primary metric has three properties:
- Aligned with the true goal — it measures the thing you actually care about, not a proxy that’s easy to move. Raw signup-button clicks is a vanity metric: a flashier button can lift clicks while completed signups stay flat or fall. Conversion rate — users who actually finish — is aligned with the goal; clicks are aligned with the button.
- Sensitive — it actually moves when the change works, and it moves within the experiment’s window. A metric that only shifts after months (annual retention, lifetime value) can’t be your primary in a two-week test; there’s nothing to detect. Pick something that responds to the change on the timescale you can actually run.
- Measurable per randomization unit — you need to compute it for each user (or whatever unit you randomize on) so you can compare treatment and control cleanly. A metric you can only compute in aggregate is hard to test statistically. Conversion rate is a clean per-user 0/1; that’s part of why it makes a good primary.
The recurring traps are the mirror image of these: vanity metrics that move without the goal moving (clicks, impressions), and downstream metrics that are aligned with the goal but too slow or rare to shift inside the test window. Both feel principled and both quietly break the experiment.
Real Metrics From Lumen’s Data
To make the tiers concrete, here are metrics computed for real from a seeded, reproducible sample of Lumen signups — 8,000 users, a conversion flag, revenue per user, and page load times. The seed (3) means you get these exact numbers every run:
import numpy as np
rng = np.random.default_rng(3)
n = 8000
converted = rng.random(n) < 0.11
revenue = np.where(converted, rng.lognormal(mean=3.3, sigma=0.4, size=n), 0.0)
load_ms = rng.normal(820, 130, n)
print("conversion rate:", round(converted.mean(), 4)) # PRIMARY
print("ARPU: ", round(revenue.mean(), 2)) # SECONDARY (per user)
print("ARPPU: ", round(revenue[converted].mean(), 2)) # per converter
print("p95 load (ms): ", round(float(np.percentile(load_ms, 95)))) # GUARDRAILconversion rate: 0.1195
ARPU: 3.46
ARPPU: 28.99
p95 load (ms): 1034Read these top to bottom as the hierarchy in action. Conversion rate (0.1195, about 12%) is the primary — the one number Lumen’s ship decision hinges on. The two revenue numbers are secondary, and they teach a lesson about denominators: ARPU is $3.46 per user, but ARPPU is $28.99 per converter. Same revenue, wildly different figures, because ARPU divides by everyone and ARPPU divides only by people who paid. If you report “revenue per user” without saying which denominator, you can be off by 8x and not know it — always name the unit. Finally, p95 load is 1034ms: a guardrail. You’d watch it so a conversion win doesn’t sneak in on the back of a slower page; if the treatment lifts conversion but pushes p95 past your threshold, the win isn’t clean.
One primary is a discipline, not a limitation
It feels safer to declare two or three primary metrics — surely more targets means more chances to win. It’s the opposite. With several “primary” metrics and no rule for what happens when they disagree, every outcome becomes arguable: conversion up but ARPU flat, is that a win? You’ll decide after the fact, which is exactly the post-hoc storytelling the hypothesis was supposed to prevent. One primary forces the hard choice — what does this experiment actually decide? — while you can still think clearly, before any data is in.
The Decision Rule
The hierarchy only pays off if it produces a decision, so tie the tiers together into one rule fixed before the test runs:
Ship only if the primary metric wins AND no guardrail regresses.
Both conditions are required. A primary win with a guardrail regression isn’t a win — a signup page that converts better but doubles p95 load or spikes refunds has traded a visible gain for a hidden cost. And a flat or losing primary doesn’t ship no matter how nice the secondary metrics look; a lovely ARPU story can’t rescue a conversion rate that didn’t move. Secondary metrics never appear in the rule at all — they explain the result, they don’t decide it. Writing this rule down in advance is what stops the post-experiment negotiation where every number gets reinterpreted until something justifies the launch.
Practice Exercises
Exercise 1: Assign the tiers
Lumen is testing a new course-recommendation widget on the signup page. You have these metrics available: signup conversion rate, page load time, ARPU, refund rate, and time-to-signup. Assign each to primary, guardrail, or secondary.
Hint
Primary: signup conversion rate — the one number the ship decision hinges on. Guardrails: page load time and refund rate — a heavier widget could slow the page or attract worse-fit signups, and neither may get worse. Secondary: ARPU and time-to-signup — useful context (did revenue or friction change?) but they don’t drive the decision. Exactly one primary; the rest split by “must not regress” versus “helps explain.”
Exercise 2: Spot the vanity metric
A teammate proposes using “clicks on the signup button” as the primary metric because it’s easy to measure and moves fast. What’s the risk, and what would you use instead?
Hint
Button clicks is a vanity metric — it’s not aligned with the goal. A flashier button can lift clicks while completed signups stay flat or even fall, so a “win” on clicks could mask a loss on the thing you actually care about. Use signup conversion rate (users who finish) as the primary; it’s aligned with the goal, sensitive within the test window, and measurable per user. Fast and easy to measure doesn’t make a metric the right one.
Exercise 3: Name the denominator
Someone reports that Lumen’s revenue per user is “about $29” and someone else says it’s “about $3.50.” Both ran the numbers correctly. Explain how both can be right, and why it matters for a metric definition.
Hint
They’re using different denominators. ARPU ($3.46) divides total revenue by all users, including the ~88% who never converted. ARPPU ($28.99) divides only by converters. Same revenue, an 8x difference in the reported figure. It matters because a metric isn’t defined until its denominator is: “revenue per user” is ambiguous, and comparing an ARPU in treatment against an ARPPU in control would produce a nonsense result. Always pin down the unit — per user or per converter — before you report or compare.
Summary
An experiment measures more than one number, and the discipline is giving each number a job. The metric hierarchy has three tiers: exactly one primary metric the decision hinges on (Lumen’s signup conversion rate), guardrail metrics that must not regress even if the primary improves (page load, refund rate, downstream completion), and secondary metrics that explain the result without driving it (ARPU, time-to-signup). A good primary metric is aligned with the true goal (not a vanity metric like clicks), sensitive enough to move within the test window (not a too-slow downstream metric), and measurable per randomization unit. Computed on real seeded Lumen data, conversion rate is 0.1195, ARPU is $3.46 per user against ARPPU of $28.99 per converter — a reminder to name your denominator — and p95 load of 1034ms is a guardrail. All of it feeds one rule, fixed before the test: ship only if the primary wins and no guardrail regresses.
Key Concepts
- Metric hierarchy — one primary metric decides, guardrails protect, secondary metrics explain.
- Good primary metric — aligned with the goal, sensitive within the window, measurable per unit.
- ARPU vs. ARPPU — same revenue, different denominators; always name the unit you’re dividing by.
- Decision rule — ship only if the primary wins and no guardrail regresses; secondary metrics never decide.
Why This Matters
Choosing metrics badly is how a “successful” experiment ships something harmful: a vanity primary that moves without the goal moving, or no guardrails so a conversion win quietly hides a latency or refund regression. The metric hierarchy and its decision rule turn a pile of numbers into an unambiguous answer — and fixing them before the test runs is what stops the post-hoc negotiation where every metric gets reinterpreted until the launch looks justified. It also sets up everything after: the primary metric you pick here is exactly what determines the unit you randomize on and how much data you’ll need. Next, you’ll choose that randomization unit — the thing you actually split into treatment and control.
Next Steps
Continue to Lesson 3 - The Randomization Unit
Decide what you actually split into treatment and control — users, sessions, or something else — and why the choice matters.
Back to Module Overview
Return to the Designing an Experiment module overview
Continue Building Your Skills
You can now sort an experiment’s numbers into a hierarchy — one primary metric the decision hinges on, guardrails that must not regress, and secondary metrics that explain — and you’ve seen real metrics computed from Lumen’s data, denominators and all. You also have the rule that ties them together: ship only if the primary wins and no guardrail gets worse. Next you’ll pick the randomization unit — the thing you split into treatment and control — which the metric you just chose helps determine.