Lesson 4 - Multi-Armed Bandits
Welcome to Multi-Armed Bandits
A classic A/B test makes a quiet, expensive promise: it sends half your traffic to the loser for the entire duration of the test, even long after it’s fairly clear which arm is better. Every one of those exposures to the worse variant is a conversion or a dollar you didn’t have to give up. A multi-armed bandit breaks that promise on purpose. Instead of holding a fixed split until the end, it reallocates traffic toward the better-performing arms as it learns, so it earns while it experiments. The cost is subtle — you trade a clean, unbiased effect estimate for higher total reward — and this lesson is about making that trade with your eyes open. You’ll build Thompson sampling, the most widely used bandit method, and measure exactly how much it saves.
By the end of this lesson, you will be able to:
- Explain the explore-exploit tradeoff and why a fixed A/B test sits at one extreme
- Build a Thompson sampling bandit from Beta posteriors
- Measure regret and compare a bandit against a fixed even split
- Decide when a bandit is the right tool and when a classic A/B test is
Let’s start with the tension at the heart of it.
Explore, or Exploit?
Every round of an experiment forces a choice. You can explore — try each arm enough times to actually know its rate — or you can exploit — send this user to the arm that currently looks best. You need both. Pure exploration is a fixed A/B test: it learns the rates beautifully but earns poorly, because it keeps routing half its traffic to a loser it has already half-diagnosed. Pure exploitation is the opposite failure: it locks onto whichever arm jumped ahead early, which might just be noise, and never gathers the evidence to correct itself. A good bandit balances the two — exploring aggressively when it’s uncertain, and concentrating traffic on the leader as the evidence hardens.
Thompson sampling solves this with a surprisingly elegant trick, and it reuses the Beta posteriors from Lesson 3. Keep one Beta posterior per arm, representing your current belief about that arm’s conversion rate. Each round: sample a rate from each arm’s posterior, play the arm whose sample came out highest, observe the reward, and update that arm’s posterior. Early on the posteriors are wide, so the random samples jump around and different arms win — that is the exploration. As evidence accumulates, the best arm’s posterior tightens around a high value and it wins the sampling most of the time, so traffic concentrates on it automatically. Exploration fades exactly as fast as the winner becomes clear, with no schedule to tune.
Measuring the Win: Regret
To compare strategies fairly, we need one number, and the right one is regret: the conversions you gave up versus an oracle that always played the best arm. Regret is the bandit’s core metric — it rolls exploration cost and exploitation reward into a single quantity to minimize. A fixed even split across three arms has predictable regret, because on average it wastes a fixed fraction of traffic on the two losers; the bandit’s regret is whatever it fails to claw back.
Let’s build a Thompson sampler and measure it. Three arms have true rates 0.10, 0.12, 0.11 — arm B is the best — and we run 30,000 rounds:
import numpy as np
rng = np.random.default_rng(13)
true = np.array([0.10, 0.12, 0.11]) # arm B (0.12) is best
T = 30000
a = np.ones(3); b = np.ones(3) # Beta(1,1) prior per arm
pulls = np.zeros(3, int); reward = 0
for _ in range(T):
theta = rng.beta(a, b) # sample a rate per arm
arm = int(np.argmax(theta)) # play the best sample
r = rng.random() < true[arm]
a[arm] += r; b[arm] += 1 - r # update posterior
pulls[arm] += 1; reward += r
regret = true.max()*T - reward
print("pulls:", pulls.tolist(), " regret:", int(regret))Running it:
pulls: [528, 24755, 4717] regret: 124The bandit sent 24,755 of 30,000 rounds — about 83% — to the true best arm, gave the clear loser (arm A) only 528 pulls, and spent a bit more (4,717) distinguishing the close runner-up (arm C, at 0.11) from the winner. Its total regret was 124 lost conversions. Now compare a fixed even 1/3 split, which routes 10,000 rounds to each arm regardless of what it learns. Its regret is (best − mean(true))·T = (0.12 − 0.11)·30000 = 299 — roughly 2.4x the bandit’s. The bandit’s advantage comes entirely from not paying to explore the losers once it’s confident, and Thompson sampling has strong, near-optimal regret guarantees that make this reliable rather than lucky.
Bandit or A/B test? Optimize regret, or optimize learning
A bandit and an A/B test optimize different things, so pick by what you need. Bandits minimize regret — they shine for short-lived decisions where earning while learning matters: headline and creative selection, promo choice, or picking among many variants where a clean effect estimate isn’t the point. Classic fixed A/B tests maximize learning — they’re better when you need a rigorous, unbiased read on the effect size for a lasting decision, when novelty or delayed effects are in play (a bandit can chase a short-term winner that fades), or when long-term downstream metrics matter more than the immediate reward the bandit optimizes. One more honest caveat: adaptive allocation breaks the fixed-sample assumptions behind classical significance tests, so a naive bandit complicates any p-value you try to read off it. Use a bandit to earn, an A/B test to know.
Practice Exercises
Exercise 1: Why does the runner-up get more traffic than the clear loser?
Arm A (0.10) got 528 pulls but arm C (0.11) got 4,717 — nearly nine times as many — even though both are losers. Why does the bandit spend so much more traffic on C than on A?
Hint
Because C is closer to the winner, so it’s harder to rule out. Arm A’s rate (0.10) is far below B’s (0.12), so its posterior separates quickly and it rarely produces the highest sample — the bandit stops exploring it fast. Arm C (0.11) overlaps B far more, so its posterior samples keep occasionally beating B’s, and the bandit has to keep pulling it to stay confident. This is a feature, not a bug: the bandit spends its exploration budget exactly where the decision is genuinely uncertain, not on arms already clearly beaten.
Exercise 2: What would pure exploitation have done?
Suppose instead of sampling, you always played the arm with the highest observed rate so far (greedy). Why is that risky, and how does Thompson sampling avoid the trap?
Hint
A greedy strategy can lock onto a false leader. If arm A happens to convert on its first couple of pulls by chance, its observed rate is high, greedy keeps playing it, and it may never gather enough data on B to discover B is actually better — it exploits an early fluke forever. Thompson sampling avoids this because a wide early posterior keeps producing high samples for the under-explored arms too, forcing the bandit to keep testing B until the evidence, not one lucky start, decides the winner. The randomness is the safeguard.
Exercise 3: Would you use a bandit here?
Your team is choosing between two new checkout flows. The winner will ship to all users for the next two years, and leadership wants a defensible estimate of exactly how much it lifts revenue. Bandit or A/B test?
Hint
Use a classic A/B test. This is a lasting decision that needs a rigorous, unbiased effect estimate — how much revenue the winner lifts — which is precisely what a bandit trades away for reward. There’s no earning-while-learning benefit worth chasing over a two-week test when the flow ships for two years, and adaptive allocation would muddy the significance read leadership wants. Bandits are for short-lived, many-variant, “just pick the winner and move on” decisions; a foundational choice with a long shelf life and a scrutinized effect size is squarely A/B-test territory.
Summary
A multi-armed bandit replaces a fixed even split with adaptive allocation: it routes traffic toward better-performing arms as it learns, so it earns while it experiments instead of sending half its traffic to the loser for the whole test. The engine is the explore-exploit tradeoff, and Thompson sampling balances it elegantly — keep a Beta posterior per arm, each round sample a rate from each, play the highest sample, and update. On three arms (rates 0.10, 0.12, 0.11), the bandit sent 83% of 30,000 rounds to the true winner and finished with regret 124, against 299 for a fixed 1/3 split — about 2.4x less lost reward. The catch is what you give up: bandits optimize regret, A/B tests optimize learning, and adaptive allocation complicates the clean significance testing a lasting decision needs.
Key Concepts
- Explore-exploit tradeoff — try each arm enough to know its rate, while sending traffic to the current best; a fixed A/B test is pure exploration.
- Thompson sampling — sample a rate from each arm’s Beta posterior, play the highest, observe, update; exploration fades as the winner sharpens.
- Regret — conversions given up versus always playing the best arm; the bandit’s core metric to minimize (124 vs 299 here).
- Bandit vs A/B — bandits win for short-lived, many-variant, earn-while-learning decisions; A/B tests win when you need an unbiased effect size for a lasting choice.
Why This Matters
Most experimentation is framed as “run the test, read the result, then act” — but for a huge class of decisions, the acting can’t wait for the reading, because every day on the wrong variant costs real money. Bandits collapse that gap: they act on partial evidence, continuously, and pay only for the exploration the decision genuinely requires. Knowing when to reach for one — and, just as importantly, when a classic A/B test’s clean, unbiased read is worth the wasted traffic — is what separates a mechanical experimenter from someone who matches the method to the stakes. Next, you’ll put the whole module together in a guided project, designing a smarter Lumen experiment that uses these tools where they actually fit.
Next Steps
Continue to Lesson 5 - Guided Project: A Smarter Lumen Experiment
Bring variance reduction, sequential testing, and bandits together in one end-to-end experiment design.
Back to Module Overview
Return to the Beyond Basic A/B module overview
Continue Building Your Skills
You’ve seen how a multi-armed bandit turns learning into earning — routing 83% of traffic to the true winner and cutting regret from 299 to 124 by refusing to keep paying for losers it has already diagnosed. Just as important, you’ve seen its limit: a bandit optimizes reward, not the clean unbiased effect estimate a lasting decision needs. Next, in the module’s guided project, you’ll design a smarter Lumen experiment from scratch — deciding where CUPED buys you power, where sequential testing lets you stop early, and where a bandit earns while it learns.