Lesson 5 - Guided Project: A Smarter Lumen Experiment

Welcome to the Guided Project

Across this module you collected four advanced tools, each one an answer to a specific frustration you’ve felt running Lumen’s experiments. Too many users needed? CUPED buys back power for free. Tired of waiting for a fixed sample? Sequential testing lets you stop early — correctly. Stakeholders glaze over at “p < 0.05”? A Bayesian readout speaks in probabilities they actually understand. Too many variants, and every day of even-split traffic costs money? A bandit earns while it learns. In this capstone you’ll point three of these tools at Lumen and — just as importantly — decide which tool belongs in which situation. The goal isn’t to use everything at once; it’s to reach for the right instrument.

By the end of this project, you will be able to:

  • Apply CUPED with a pre-experiment covariate to shrink a Lumen test to roughly half the sample
  • Report a conversion result as a Bayesian probability and credible interval a stakeholder can read
  • Route traffic across three headline variants with Thompson sampling and measure the regret it saves
  • Choose the right advanced tool for a given situation — and know when not to reach for one

We’ll build it in stages, reusing the exact pieces from Lessons 1, 3, and 4. Let’s make Lumen’s experiments smarter.


Stage 1: The Setup — Four Frustrations, Four Tools

Lumen’s growth team wants to move faster: more experiments per quarter, and the ability to test several ideas at once instead of one lonely A/B at a time. But every ambition runs into a wall you’ve met before. Here is the map for this project — each frustration and the tool that answers it:

  • “We never have enough users.” The test that needs 16,000 users takes three weeks Lumen doesn’t want to wait. → CUPED (Lesson 1): use pre-experiment data to cut the noise and halve the sample.
  • “We want to stop the moment we know.” Waiting for a fixed sample when the winner is already obvious feels wasteful — but peeking naively inflates false positives. → Sequential testing (Lesson 2): a calibrated boundary that lets you look repeatedly and stop early, safely.
  • “Nobody understands the readout.” “We failed to reject the null at α = 0.05” wins no arguments in a launch meeting. → Bayesian analysis (Lesson 3): report P(treatment is better) and an expected uplift instead.
  • “We have five ideas and one homepage.” Splitting traffic evenly across five headlines means most users see a loser the whole time. → Multi-armed bandit (Lesson 4): send more traffic to whatever is winning, so you earn while you learn.

This project works three of these on Lumen’s data — CUPED, Bayesian, and the bandit — then closes with a decision guide covering all four. Every number below is computed for real with numpy and scipy.


Stage 2: CUPED to Shrink the Test

Lumen’s next experiment measures a continuous engagement score, and the team is bracing for a big sample. But Lumen has been logging that same score for months — so each user arrives with a history, and history is predictable. That’s exactly what CUPED from Lesson 1 exploits: take a pre-experiment covariate X correlated with the outcome Y, and subtract off the part of Y you could have predicted from it. Here Lumen’s pre-period engagement correlates 0.7 with in-experiment engagement — a strong covariate.

import numpy as np

rng = np.random.default_rng(5)
n, rho, delta = 5000, 0.7, 0.20          # correlation 0.7, true effect 0.20

def make_arm(shift):
    x = rng.normal(0, 1, n)                          # pre-experiment engagement
    eps = rng.normal(0, np.sqrt(1 - rho**2), n)      # the unpredictable part
    return x, shift + rho * x + eps                  # Y = shift + rho*X + noise

xc, yc = make_arm(0.0)
xt, yt = make_arm(delta)                             # treatment adds the effect
x, y = np.concatenate([xc, xt]), np.concatenate([yc, yt])

theta = np.cov(x, y, ddof=1)[0, 1] / np.var(x, ddof=1)     # CUPED coefficient
yc_adj = yc - theta * (xc - x.mean())
yt_adj = yt - theta * (xt - x.mean())

var_reduction = 1 - np.var(np.r_[yc_adj, yt_adj], ddof=1) / np.var(y, ddof=1)
se_raw   = np.sqrt(yc.var(ddof=1)/n + yt.var(ddof=1)/n)
se_cuped = np.sqrt(yc_adj.var(ddof=1)/n + yt_adj.var(ddof=1)/n)
print(f"variance reduction: {var_reduction:.3f}")
print(f"std error  raw {se_raw:.4f}  ->  cuped {se_cuped:.4f}")

Running it:

variance reduction: 0.480
std error  raw 0.0201  ->  cuped 0.0144

CUPED removed 48% of the variance — right on the predicted ρ² = 0.49 — and dropped the standard error from 0.0201 to 0.0144. Since sample size scales with the square of the standard error, that (0.0144/0.0201)² ≈ 0.51 factor means Lumen needs roughly half the users to reach the same power. The three-week test becomes a ten-day test, and the effect estimate is unchanged — all from data Lumen was already logging. This is the tool you reach for almost every time, and it costs nothing.


Stage 3: A Bayesian Readout of the Conversion Result

Lumen’s homepage conversion test from earlier modules is done: 503 / 5000 conversions on control, 613 / 5000 on treatment. Module 4 already gave you the frequentist verdict — a confidence interval that excludes zero. But in the launch meeting, “the 95% CI excludes zero” lands flat. The Bayesian readout from Lesson 3 says the same thing in language a stakeholder actually wants: how likely is it that treatment is better, and by how much? We put a uniform Beta(1,1) prior on each rate, form the Beta-Binomial posteriors, and sample the difference:

import numpy as np

rng = np.random.default_rng(11)
n, conv_c, conv_t = 5000, 503, 613
draws = 400_000

pc = rng.beta(1 + conv_c, 1 + (n - conv_c), draws)   # posterior for control
pt = rng.beta(1 + conv_t, 1 + (n - conv_t), draws)   # posterior for treatment

diff = pt - pc
p_better = np.mean(diff > 0)
lo, hi = np.percentile(diff, [2.5, 97.5])
print(f"P(treatment > control) = {p_better:.4f}")
print(f"posterior mean uplift  = {diff.mean():.4f}")
print(f"95% credible interval  = [{lo:.4f}, {hi:.4f}]")

Running it:

P(treatment > control) = 0.9998
posterior mean uplift  = 0.0220
95% credible interval  = [0.0096, 0.0344]

Now the readout writes itself: there is a 99.98% probability that treatment beats control, with an expected uplift of +2.2 percentage points, and we’re 95% sure the true gain lies between +0.96 and +3.44 points. Notice that credible interval [0.0096, 0.0344] is essentially the same interval the frequentist test gave in Module 4 — the two methods agree on the numbers, they just tell the story differently. But “99.98% likely to be better, worth about 2.2 points” is a sentence anyone in the room can act on, and it plugs straight into an expected-loss decision. Same evidence, far clearer decision.


Stage 4: A Bandit for the Multi-Variant Rollout

Now the ambitious experiment. Lumen wants to test three homepage headlines at once, and here’s the twist: this is a short-lived campaign, so every user who sees a losing headline is money left on the table. A fixed even split would send a third of traffic to each variant for the whole run — including the two that lose. The multi-armed bandit from Lesson 4 does better: with Thompson sampling, it keeps a posterior over each headline’s conversion rate, samples from those posteriors each round, and serves whichever variant looks best that round. As evidence accumulates, traffic flows toward the winner on its own. The true rates are 0.10 / 0.12 / 0.11 — headline B (index 1) is best:

import numpy as np

rng = np.random.default_rng(13)
true = np.array([0.10, 0.12, 0.11])       # B (index 1) is the best headline
T = 30000
a = np.ones(3); b = np.ones(3)            # Beta(1,1) prior per headline
pulls = np.zeros(3, dtype=int); reward = 0

for _ in range(T):
    theta = rng.beta(a, b)                # sample a rate for each headline
    arm = int(np.argmax(theta))           # serve whichever looks best now
    r = rng.random() < true[arm]          # did that user convert?
    a[arm] += r; b[arm] += 1 - r          # update that headline's posterior
    pulls[arm] += 1; reward += r

best = true.max()
regret_ts = best * T - reward                     # reward lost vs always-best
regret_uniform = (best - true.mean()) * T         # even split's regret
print(f"pulls per headline: {pulls.tolist()}")
print(f"share to the best:  {pulls[1]/T:.0%}")
print(f"regret  Thompson {regret_ts:.0f}   even split {regret_uniform:.0f}")

Running it:

pulls per headline: [528, 24755, 4717]
share to the best:  83%
regret  Thompson 124   even split 299

The bandit sent 83% of traffic (24,755 of 30,000 pulls) to the winning headline and starved the worst one down to just 528 impressions — all without anyone declaring a winner by hand. Its total regret was 124 conversions of lost reward, versus 299 for a fixed even split: the bandit gave up less than half the reward the naive rollout would have. Over a short campaign with several variants, that difference is real revenue.

A Thompson-sampling bandit over three homepage headlines with true conversion rates 0.10, 0.12, and 0.11. Three horizontal bars show traffic allocation after 30,000 rounds: headline A gets 528 pulls, headline B (the best, rate 0.12) gets 24,755 pulls or 83 percent, and headline C gets 4,717 pulls. A callout compares total regret: Thompson sampling loses 124 conversions of reward while a fixed even split loses 299, because the bandit concentrates traffic on the winner as evidence accumulates instead of splitting evenly the whole time.
Thompson sampling routed 83% of traffic to the best headline (24,755 of 30,000 pulls) and cut regret to 124 lost conversions versus 299 for an even split — earning while it learns instead of paying full price to explore losers.

Advanced tools on a shaky foundation still lie

None of these tools rescues a broken experiment. CUPED shrinks the noise, but if your randomization has a sample-ratio mismatch (Module 6), CUPED just gives you a tighter, more confident wrong answer. A Bayesian posterior is only as honest as the metric it’s built on — feed it a badly chosen or gameable metric (Module 5) and it will report high confidence in the wrong thing. And a bandit optimizes ruthlessly toward whatever short-term reward you hand it, so a novelty effect or a metric that ignores long-term retention will lead it straight off a cliff. These four tools are additive to the solid A/B foundation from Modules 1–6 — trustworthy randomization, a good metric, honest stopping — never a substitute for it. Get the foundation right first; then these make it faster and smarter.


Stage 5: Choosing the Right Tool

You now have three verified results and a fourth tool from Lesson 2 in your pocket. The skill that matters most isn’t running any one of them — it’s knowing which to reach for. Here’s the decision guide:

  • CUPED — use it almost always. If you have any pre-experiment covariate correlated with your outcome (usually the same metric measured in a pre-period), apply CUPED. It’s unbiased, cheap, and buys free power — Lumen’s 48% variance reduction is typical. The only requirement is that the covariate come from before assignment. There’s rarely a reason not to use it.
  • Sequential testing — when you need to monitor and stop early. Reach for a calibrated group-sequential boundary when the cost of waiting for a fixed sample is high and you want the option to stop the moment the evidence is decisive — on a rigorous, error-controlled test. Don’t use naive peeking; that’s the Module 6 sin. Use the calibrated boundary that spends your α correctly.
  • Bayesian — for intuitive reporting and expected-loss decisions. When you need stakeholders to understand the result, or you want to make a decision by weighing the expected cost of being wrong, report P(better), the expected uplift, and a credible interval. It agrees with the frequentist math (as Stage 3 showed) but speaks in probabilities people can act on.
  • Bandit — when you’re choosing among many short-lived variants and want to minimize regret. Use it when the goal is to earn during the test — many variants, short campaigns, and you care more about total reward than a clean readout. Do not use a bandit when you need an unbiased, clean effect estimate (its adaptive allocation biases naive estimates), or when novelty or long-term effects mean early winners aren’t real winners. For those, a plain, patient A/B test is still the right call.

None of these replaces the foundation from Modules 1–6. They sit on top of it: CUPED makes the same test cheaper, sequential makes it faster to conclude, Bayesian makes it clearer to report, and the bandit makes multi-variant rollouts less wasteful. Solid randomization, a trustworthy metric, and honest stopping come first — always.


Practice Exercises

Exercise 1: Pick Lumen’s CUPED covariate

Stage 2 assumed a covariate correlated 0.7 with the outcome. For Lumen’s engagement experiment, which specific pre-experiment metric would you choose as the covariate, and why would it correlate strongly with the outcome?

Hint

The best covariate is almost always the same metric measured in a pre-period — here, each user’s engagement score over the two weeks before they entered the experiment. Behavior is sticky: engaged users stay engaged, so pre-period engagement correlates strongly (often 0.6–0.8) with in-experiment engagement, which is exactly the 0.7 Stage 2 used. The one rule is that it must be measured before assignment, so the treatment can’t have touched it. Avoid anything logged during the experiment — that reintroduces the confounding the whole course warns against.

Exercise 2: When would the bandit make the wrong call?

Lumen’s bandit crowned headline B in Stage 4. Describe a realistic scenario where the same bandit would confidently route traffic to a headline that is actually worse in the long run.

Hint

A novelty effect is the classic trap. Suppose a flashy new headline gets a burst of clicks in its first days simply because it’s different — high early conversion that fades once regulars grow used to it. The bandit only sees short-term reward, so it piles traffic onto the novelty winner while the novelty is still fresh, then keeps exploiting it after the effect has decayed. The same failure hits any metric that ignores long-term value: a headline that boosts sign-ups but attracts users who churn immediately. When early reward doesn’t predict long-run value, a patient fixed A/B with a long-horizon metric beats the bandit.

Exercise 3: Recompute P(better) with a skeptical prior

Stage 3 used a uniform Beta(1,1) prior and found P(treatment > control) = 0.9998. A skeptical stakeholder says “you assumed nothing; I think most changes do nothing.” How would you encode that skepticism, and would it flip the conclusion?

Hint

Encode skepticism with a stronger prior centered near the control rate — for example, replace Beta(1,1) with something like Beta(100, 900) on each arm (a prior that already “believes” the rate is about 0.10 and takes real data to move). Re-run the same sampling with those prior parameters added to the counts. With 5,000 users and a clear ~2-point gap per arm, the data still overwhelms a moderately skeptical prior, so P(better) stays very high — the conclusion holds, just slightly less emphatically. That robustness is the answer to the skeptic: show that even under their prior, the evidence wins.


Summary

You pointed the advanced toolkit at Lumen and let each tool answer the frustration it was built for. CUPED used a pre-experiment covariate correlated 0.7 with the outcome to cut variance by 48%, dropping the standard error from 0.0201 to 0.0144 — roughly half the sample for the same power. A Bayesian readout of Lumen’s 503/5000 vs 613/5000 conversion result reported P(treatment > control) = 0.9998, an expected uplift of +0.0220, and a 95% credible interval of [0.0096, 0.0344] — matching the frequentist CI from Module 4 but in language a stakeholder can act on. A Thompson-sampling bandit over three headlines (true rates 0.10/0.12/0.11) sent 83% of 30,000 pulls ([528, 24755, 4717]) to the winner and cut regret to 124 lost conversions versus 299 for an even split. And the closing decision guide told you which to reach for — and when a plain A/B test is still the right answer.

Key Concepts

  • CUPED for power — a pre-period covariate cut variance 48% and halved Lumen’s required sample, unchanged effect.
  • Bayesian for clarityP(better) = 0.9998 and a credible interval report the same evidence in decision-ready language.
  • Bandit for regret — Thompson sampling routed 83% of traffic to the best headline, regret 124 vs 299 for an even split.
  • Right tool, right situation — CUPED almost always, sequential to stop early, Bayesian to report, bandit for many short-lived variants — never a substitute for a sound A/B foundation.

Why This Matters

This is what “advanced experimentation” actually looks like in practice: not a single silver bullet, but a small kit of tools you deploy selectively on top of a trustworthy A/B foundation. CUPED lets Lumen run twice as many tests on the same traffic; the Bayesian readout gets experiments acted on instead of argued over; the bandit turns a wasteful multi-variant rollout into one that earns while it learns. The judgment you practiced here — matching tool to situation, and knowing when not to reach for one — is what separates a team that runs experiments from a team that runs them well. With Modules 1–6 as the foundation and this toolkit on top, you’re ready for the capstone: designing and reading out a complete experiment end to end.


Next Steps

Continue to Module 8 - Capstone

Design, simulate, analyze, and write the decision readout for a complete experiment, end to end.

Back to Module Overview

Return to the Beyond Basic A/B module overview


Continue Building Your Skills

You just made one Lumen experiment faster with CUPED, clearer with a Bayesian readout, and smarter with a bandit — and, most importantly, you learned to choose among them instead of reaching for all of them at once. That judgment is the real payoff of this module. Next comes the capstone, where you’ll bring the entire course together: design an experiment from a business question, size it, simulate it, analyze it with the tools you now trust, and write the decision readout a real team would ship on.