Lesson 3 - Bayesian A/B Testing
Welcome to Bayesian A/B Testing
Back in Module 4 you ran a proper frequentist test on Lumen’s checkout page and got a p-value. It was small, you rejected the null, you shipped. But think about what that p-value actually told you: how surprising this data would be if the treatment had no effect at all. That’s a strange, backwards quantity — a statement about a world where nothing happened. When you walked into the product review, nobody asked “how surprising is this data under the null?” They asked two plain questions: “What’s the probability the new page is actually better?” and “How much better?” The frequentist p-value answers neither directly. Bayesian A/B testing answers both, in the language people already speak.
By the end of this lesson, you will be able to:
- Explain why a p-value doesn’t answer the question stakeholders actually ask
- Build Beta-Binomial posteriors for each arm from a uniform prior
- Compute the probability the treatment beats control, and a credible interval on the lift
- Interpret a credible interval correctly — and see why it matches the frequentist result
Let’s start with the question you actually want answered.
From “How Surprising?” to “How Likely?”
A p-value is the answer to a convoluted question. It fixes a hypothetical world where the treatment does nothing, then measures how extreme your data looks in that world. Useful, rigorous — and almost nobody’s actual question. What a product owner wants is a probability about the thing itself: given the data we collected, how likely is it that the new page’s true conversion rate is higher? That is a question about the unknown rate, conditioned on the data — exactly the quantity frequentist statistics refuses to give you and Bayesian statistics is built to deliver.
The Bayesian move is to treat each arm’s true conversion rate as an unknown quantity we hold beliefs about, and to update those beliefs with data. For conversion rates this is remarkably clean, because the Beta distribution is the conjugate prior for binomial (converted / didn’t-convert) data. Start with a prior belief about an arm’s true rate; observe conversions; get a posterior belief — and the posterior is just another Beta, with the counts folded in.
Concretely: start each arm at Beta(1, 1), which is uniform on [0, 1] — every conversion rate equally plausible, i.e. no prior knowledge. After observing conv conversions out of n visitors, the posterior is:
Beta(1 + conv, 1 + (n − conv))
That’s your updated belief about that arm’s true rate — a full distribution, not a point estimate. To compare two arms, you sample many draws from each posterior and ask how often the treatment’s rate exceeds the control’s. No null world, no test statistic — just two distributions and a direct comparison.
Computing the Answer on Lumen’s Data
Let’s run it on the exact conversion data from Module 4: 503 conversions out of 5,000 in control, 613 out of 5,000 in treatment. We build a posterior for each arm, draw hundreds of thousands of samples from both, and read off the two things stakeholders asked for.
import numpy as np
rng = np.random.default_rng(11)
n, conv_c, conv_t = 5000, 503, 613
draws = 400000
pc = rng.beta(1+conv_c, 1+(n-conv_c), draws) # control posterior
pt = rng.beta(1+conv_t, 1+(n-conv_t), draws) # treatment posterior
p_better = np.mean(pt > pc)
diff = pt - pc
lo, hi = np.percentile(diff, [2.5, 97.5])
print(f"P(treatment > control) = {p_better:.4f}")
print(f"posterior mean uplift = {diff.mean():.4f}")
print(f"95% credible interval = [{lo:.4f}, {hi:.4f}]")Running it:
P(treatment > control) = 0.9998
posterior mean uplift = 0.0220
95% credible interval = [0.0096, 0.0344]Read those straight out loud. There is a 99.98% posterior probability that the new page’s true conversion rate is higher than the old one’s. The posterior mean uplift is +2.20 points. And the 95% credible interval on the lift is [+0.96, +3.44] points. No hedging about hypothetical null worlds — these are direct statements about the quantities you care about.
The credible interval is where Bayesian phrasing really pays off. You can say: “there’s a 95% probability the true lift is between +0.96 and +3.44 points.” That is the intuitive reading people wrongly attach to a frequentist confidence interval — and here, because it’s a credible interval, the intuitive reading is actually the correct one. The interval is a statement about the unknown lift given the data, which is exactly what it sounds like it should be.
Credible interval vs confidence interval — and the prior
A frequentist 95% confidence interval makes a statement about the procedure: across many hypothetical repetitions, 95% of such intervals would contain the true value. It says nothing directly about this interval — the “95% probability the truth is in here” reading is technically wrong. A Bayesian 95% credible interval says exactly that: given the data and prior, there’s a 95% probability the true lift lies inside. Same numbers, honest phrasing. The one honest caveat is the prior: Beta(1, 1) is uninformative, so the data speaks for itself. A stronger prior — say Beta(50, 450) encoding “we’ve historically converted around 10%” — pulls small-sample results toward it, which usefully tames early noise but genuinely shapes the answer. Choose it honestly, and with lots of data (as here) the prior barely matters.
Two Philosophies, One Answer
Here’s the reassuring part. Compare the Bayesian credible interval [0.0096, 0.0344] to the frequentist 95% confidence interval you computed in Module 4: [0.0097, 0.0343]. They are essentially identical. Same data, two entirely different philosophies, and the same practical answer — because with a uniform prior and a large sample, the Bayesian posterior and the frequentist sampling distribution converge.
That’s the honest takeaway: Bayesian A/B testing is not “more powerful” than the frequentist test. It won’t find effects the frequentist method misses on the same data. What it offers is a different, often more intuitive language — a probability the treatment is better, a credible interval you can read literally — plus two things that compose more naturally with real decisions.
First, decision framing via expected loss. Beyond P(better), you can compute the expected loss of shipping each arm: if you pick treatment and you’re actually wrong, how much conversion do you expect to give up? Average that regret over the posterior, and you get a business-friendly stopping rule — ship when the expected loss of doing so falls below a small threshold (say, less than 0.1 points of expected forgone conversion). That’s a far more natural criterion than “p < 0.05,” because it’s denominated in the thing the business cares about.
Second, peeking. The fixed-horizon peeking problem from Module 6 is fundamentally a frequentist artifact — it comes from the null-hypothesis testing framework’s need to control error rates across a pre-committed sample size. A posterior is just your current belief given the data so far; looking at it more often doesn’t corrupt it the way repeated significance tests inflate false positives. This is not a license to stop the instant P(better) twitches above 95% — you can still fool yourself by stopping on noise, and early posteriors are wide for a reason — but the machinery doesn’t punish honest monitoring the same way.
Practice Exercises
Exercise 1: What did the p-value actually say?
Your Module 4 test on this data returned a small p-value. A stakeholder reads it as “there’s only a tiny chance the new page isn’t better.” Why is that reading wrong, and what quantity does say that?
Hint
The p-value is P(data this extreme | no effect) — a probability about the data assuming the null is true, not a probability about the hypothesis. It never says “the chance the new page isn’t better is tiny.” The quantity that does is the Bayesian posterior: P(treatment > control) = 0.9998, so P(treatment not better) ≈ 0.02%. That’s the number the stakeholder was reaching for — and only the Bayesian analysis actually produces it.
Exercise 2: Where does Beta(1 + conv, 1 + (n − conv)) come from?
Control saw 503 conversions in 5,000 visitors, so its posterior is Beta(504, 4498). Explain the two parameters in plain terms, and what Beta(1, 1) contributes.
Hint
A Beta(a, b) posterior behaves like having seen a − 1 “successes” and b − 1 “failures.” Here that’s 503 conversions and 5000 − 503 = 4497 non-conversions — the raw counts. The extra +1 on each side is the Beta(1, 1) prior: it acts like one imaginary conversion and one imaginary non-conversion, which is what makes it uniform (equal weight everywhere) rather than opinionated. With 5,000 real visitors, those two phantom observations are negligible — which is exactly why the answer here is data-driven and the prior barely moves it.
Exercise 3: A stronger prior on a small sample
Suppose you’d only collected 50 visitors per arm and Lumen has years of history showing conversion hovers near 10%. Would you keep Beta(1, 1), and what does a stronger prior buy you?
Hint
With only 50 visitors, Beta(1, 1) lets pure noise swing the posterior wildly — a couple of lucky conversions could imply an implausible 20% rate. A prior like Beta(50, 450) encodes “we expect around 10%” with the weight of ~500 pseudo-observations, so it pulls the small, noisy sample back toward reality and tightens the posterior. The trade-off is honesty: that prior genuinely shapes the conclusion, so you must be able to defend it from real historical data, not wishful thinking. Once you have thousands of visitors, the prior’s influence fades and the choice stops mattering.
Summary
Bayesian A/B testing replaces the p-value’s backwards question (“how surprising is this data if there’s no effect?”) with the direct one stakeholders actually ask (“what’s the probability the new page is better, and by how much?”). For conversion data it’s clean because Beta is the conjugate prior for binomial outcomes: starting from a uniform Beta(1, 1), each arm’s posterior is Beta(1 + conv, 1 + (n − conv)). Sampling both posteriors on Lumen’s data — 503/5000 control, 613/5000 treatment — gives a 99.98% probability the treatment is better, a posterior mean uplift of +2.20 points, and a 95% credible interval of [+0.96, +3.44] that you can read literally as a probability about the lift. That credible interval essentially matches the frequentist confidence interval from Module 4 — same data, same answer, plainer language.
Key Concepts
- Direct probability — Bayesian methods return
P(treatment > control), the quantity a p-value can’t give. - Beta-Binomial conjugacy — posterior is
Beta(1 + conv, 1 + (n − conv))from a uniformBeta(1, 1)prior. - Credible interval — you can say “95% probability the true lift is in here,” unlike a confidence interval.
- Convergence — with a flat prior and large
n, the Bayesian and frequentist intervals agree.
Why This Matters
Most of the pain in communicating experiment results is translation: turning a p-value and a confidence interval into sentences a product owner can act on without quietly misreading them. Bayesian A/B testing removes the translation step — its outputs are the sentences people want, and they’re correct as stated. It also composes naturally with decisions: expected loss turns “is it significant?” into “how much do we expect to give up if we ship the wrong arm?”, a stopping rule denominated in business terms rather than error rates. Bayesian isn’t a silver bullet — on the same data it finds the same effects — but as a language for decisions under uncertainty it’s often the clearer one. Next you’ll go one step further, to methods that don’t just measure the winner but actively route traffic toward it while the test runs.
Next Steps
Continue to Lesson 4 - Multi-Armed Bandits
Stop wasting traffic on losing variants — algorithms that learn and allocate toward the winner while the experiment is still running.
Back to Module Overview
Return to the Beyond Basic A/B module overview
Continue Building Your Skills
You’ve seen how Bayesian A/B testing swaps a p-value for the two answers people actually want — a direct probability the treatment is better and a credible interval on the lift you can read literally — while landing on the same practical result as the frequentist test. That’s a better language for decisions. Next you’ll change the mechanics of the experiment itself: multi-armed bandits stop treating every variant as equal for the full run and instead shift traffic toward whichever arm is winning, trading some measurement precision for far less regret while the test is live.