Lesson 1 - Variance Reduction with CUPED

Welcome to Variance Reduction with CUPED

In Module 3 you learned the iron law of sample size: to detect a smaller effect, you need dramatically more users, because the noise in your metric drowns small signals. What if you could reduce that noise directly — not by collecting more data, but by using data you already had before the experiment even started? That’s the idea behind CUPED (Controlled-experiment Using Pre-Experiment Data), one of the highest-leverage techniques in modern experimentation. If your metric is even moderately predictable from a user’s history, CUPED strips out the predictable part, leaving a much quieter signal — and a test that reaches significance with roughly half the users. This lesson builds it and proves the saving.

By the end of this lesson, you will be able to:

  • Explain why removing predictable variation shrinks a metric’s variance
  • Build the CUPED adjustment from a pre-experiment covariate
  • Show that variance drops by approximately the correlation squared
  • Connect the variance reduction to a smaller required sample size

Let’s start with the core idea.


Subtract What You Already Knew

Most metrics are partly predictable. A user who spent a lot last month tends to spend a lot this month; a user who was highly active before the experiment tends to stay active. That predictability is the key. If you can guess part of a user’s outcome from their history, then the surprising part — the part your experiment might actually be moving — is smaller and less noisy than the raw outcome.

CUPED formalizes this. Take a pre-experiment covariate X — any metric measured before the experiment that correlates with your outcome Y (last month’s spend, prior activity, historical conversion). Then define an adjusted metric:

Y' = Y − θ(X − X̄), where θ = cov(X, Y) / var(X)

The term θ(X − X̄) is the part of Y you could have predicted from X; subtracting it removes that predictable variation. Crucially, because X was measured before the experiment, it can’t have been affected by the treatment — so subtracting it doesn’t change the effect you’re measuring, only the noise around it.

Two distributions. On the left, a wide raw-metric distribution labeled 'Raw metric (wide)' with std err 0.0201 and a note that the effect is hard to see in the spread. An arrow labeled Y' = Y - theta(X - Xbar) 'remove pre-period predictable part' points to the right, where a much narrower distribution labeled 'Adjusted metric (narrow)' has std err 0.0144 and 'same effect, far less noise'. A summary bar reads: variance falls by about rho squared (corr 0.7 gives -48%), the standard error drops 28%, so the same experiment reaches significance with roughly half the users; theta = cov(X,Y)/var(X).
CUPED subtracts the part of the metric predictable from a pre-experiment covariate. The effect estimate is unchanged, but the distribution tightens dramatically — here variance falls 48% and the standard error 28%, because X and Y are correlated at 0.7.

How Much Noise Disappears

The size of the win is governed by one number: the correlation ρ between X and Y. The variance of the adjusted metric is var(Y') = var(Y)·(1 − ρ²) — so the fraction of variance you remove is ρ². A covariate correlated 0.7 with your outcome removes about half the variance; one correlated 0.9 removes 81%.

Let’s verify it. We simulate an experiment with a true effect of 0.20, where the outcome Y correlates 0.7 with a pre-experiment covariate X, then compare the raw and CUPED analyses:

import numpy as np

rng = np.random.default_rng(5)
n, rho, delta = 5000, 0.7, 0.20

def make_arm(shift):
    x = rng.normal(0, 1, n)                                  # pre-experiment covariate
    eps = rng.normal(0, np.sqrt(1 - rho**2), n)             # unpredictable part
    return x, shift + rho * x + eps                          # Y = shift + rho*X + noise

xc, yc = make_arm(0.0)
xt, yt = make_arm(delta)                                     # treatment adds the effect
x, y = np.concatenate([xc, xt]), np.concatenate([yc, yt])

theta = np.cov(x, y, ddof=1)[0, 1] / np.var(x, ddof=1)      # CUPED coefficient
yc_adj = yc - theta * (xc - x.mean())
yt_adj = yt - theta * (xt - x.mean())

var_reduction = 1 - np.var(np.r_[yc_adj, yt_adj], ddof=1) / np.var(y, ddof=1)
se_raw   = np.sqrt(yc.var(ddof=1)/n + yt.var(ddof=1)/n)
se_cuped = np.sqrt(yc_adj.var(ddof=1)/n + yt_adj.var(ddof=1)/n)
print(f"variance reduction: {var_reduction:.3f}   (rho^2 = {rho**2:.2f})")
print(f"std error  raw {se_raw:.4f}  ->  cuped {se_cuped:.4f}")

Running it:

variance reduction: 0.480   (rho^2 = 0.49)
std error  raw 0.0201  ->  cuped 0.0144

The variance fell 48%, right on the predicted ρ² = 0.49, and the standard error dropped from 0.0201 to 0.0144 — a 28% reduction. Here’s why that matters so much: sample size scales with the square of the standard error, so cutting the SE by 28% means you need only about (0.0144/0.0201)² ≈ 0.51 — roughly half — as many users to reach the same power. Same experiment, same effect, half the traffic, just by using data you already had.

CUPED is free power — with one honest caveat

CUPED is close to a free lunch: it’s unbiased (the effect estimate is unchanged), it only needs a pre-experiment covariate you almost certainly already log, and the variance reduction is real, not a trick. The one caveat is the word pre-experiment: the covariate must be measured before assignment, so the treatment can’t have influenced it. Using a covariate contaminated by the treatment (say, activity during the experiment) reintroduces exactly the confounding the whole course warns against and biases your result. Pick something from before the user entered the test — their history — and the win is genuine. The best covariate is usually the same metric measured in a pre-period, which tends to be highly correlated with itself.


Practice Exercises

Exercise 1: How good must the covariate be?

Your teammate wants to use a covariate correlated only 0.3 with the outcome. Roughly what variance reduction would CUPED give, and is it worth it?

Hint

Variance reduction ≈ ρ² = 0.3² = 0.09, so about 9% — modest. The standard error would drop only ~4.6% (√0.91), saving under 10% of the sample. It’s not nothing, and CUPED is cheap to apply, so it may still be worth it — but the big wins come from covariates correlated 0.6+ (ρ² ≥ 0.36). The usual high-correlation choice is the same metric measured in a pre-period.

Exercise 2: Why doesn’t CUPED change the effect?

Subtracting θ(X − X̄) from every user’s outcome changes their individual numbers. Why doesn’t it change the estimated treatment effect (the difference in group means)?

Hint

Because X is a pre-experiment covariate, randomization makes its distribution the same in both groups on average, so the adjustment θ(X − X̄) has the same average in each arm and cancels out of the difference in means. It removes noise symmetrically from both groups without shifting the gap between them. (In finite samples it even helpfully corrects small chance imbalances in X.) What it can’t do is subtract something the treatment affected — that would remove real effect, not just noise.

Exercise 3: From variance to sample size

CUPED cut your standard error by 28%. Your original test needed 8,000 users per arm. Roughly how many does the CUPED-adjusted test need for the same power?

Hint

Sample size scales with the square of the standard error, and the SE fell to 0.72 of its original value (a 28% cut). So you need about 0.72² ≈ 0.51 of the users — roughly 4,100 per arm instead of 8,000. The variance reduction (48%) maps almost directly onto the sample-size saving (~49%), which is why CUPED is so valuable when traffic is scarce or tests are slow.


Summary

CUPED cuts the noise in a metric by subtracting the part you could have predicted from a pre-experiment covariate X: the adjusted metric is Y' = Y − θ(X − X̄) with θ = cov(X, Y)/var(X). Because X is measured before assignment, the adjustment leaves the treatment effect unchanged while removing variance equal to about ρ², the squared correlation between X and Y. On simulated data with ρ = 0.7, variance fell 48% and the standard error dropped from 0.0201 to 0.0144 — and since sample size scales with the SE squared, that halves the users needed for the same power. It’s one of the closest things to free statistical power in experimentation, with the single requirement that the covariate come from before the experiment.

Key Concepts

  • CUPED adjustmentY' = Y − θ(X − X̄), θ = cov(X,Y)/var(X), using a pre-experiment covariate.
  • Variance reduction ≈ ρ² — a covariate correlated 0.7 removes ~49% of the variance.
  • Effect unchanged — the adjustment cancels from the difference in means, so it’s unbiased.
  • Half the sample — sample size scales with SE², so a 28% SE cut roughly halves the traffic needed.

Why This Matters

Traffic is the scarcest resource in experimentation: it caps how many tests you can run and how fast you learn. CUPED effectively doubles your experimentation throughput when a good pre-period covariate exists — the same infrastructure runs twice as many tests, or each test finishes in half the time, for essentially no cost. It’s used by every major experimentation platform for exactly this reason. And it reframes a deep idea from earlier in the course: reducing noise is just as powerful as increasing signal, and sometimes far cheaper. Next, you’ll tackle the other big frustration — the temptation to stop early — with sequential testing done correctly.


Next Steps

Continue to Lesson 2 - Sequential Testing

The correct way to peek at a running test and stop early — with the error control naive peeking destroys.

Back to Module Overview

Return to the Beyond Basic A/B module overview


Continue Building Your Skills

You’ve seen how CUPED turns a pre-experiment covariate into free statistical power — a 48% variance reduction and roughly half the required sample size, just by subtracting predictable noise. That’s one way to make experiments cheaper. Next you’ll make them faster to conclude: sequential testing lets you look at a running experiment and stop the moment there’s enough evidence, with the valid error control that the naive peeking of Module 6 threw away.