Lesson 1 - The Two-Sample T-Test
On this page
Welcome to the Two-Sample T-Test
Lumen’s new signup page lifted conversion — you proved that in the last module. But conversion isn’t the only thing the business cares about. Does the new page also change revenue per user? That’s a different kind of metric: not a rate (converted or not) but an average of a continuous amount (dollars). The two-proportion z-test you just mastered doesn’t apply — you can’t pool “rates” when every user has a dollar figure. Comparing two averages needs its own tool: the two-sample t-test. In this lesson you’ll build it and run it on Lumen’s revenue data — and get a result that isn’t as clean as last module’s, which is exactly why the rest of this module exists.
By the end of this lesson, you will be able to:
- Explain why a mean metric needs a t-test rather than a proportion test
- Describe how a difference in means becomes a t statistic
- Run a two-sample t-test with
scipyand read the result - Interpret a borderline p-value and know what questions it leaves open
Let’s start with why means are different.
Why Means Need a Different Test
A proportion is built from yes/no outcomes, and its variability is fixed by the rate itself — a 10% rate has a known, formula-driven spread. A mean is different: it’s the average of a continuous quantity, and its variability depends on how spread out the individual values are. Two experiments can have the same average revenue but wildly different reliability — one where everyone spends about $16, another where most spend $2 and a few spend $500. The average alone doesn’t tell you how much to trust it; the spread does.
So the t-test follows the same “effect over noise” logic as the z-test, but measures the noise from the data’s own variability:
- The effect is the difference in group means,
mean₂ − mean₁. - The noise is the standard error of that difference, built from each group’s variance and size:
SE = √(s₁²/n₁ + s₂²/n₂). - The t statistic is their ratio:
t = (mean₂ − mean₁) / SE— how many standard errors apart the two averages are.
Turning t into a p-value uses Student’s t-distribution (close to the normal for large samples), and scipy does it for us.
Running It on Lumen’s Revenue
Lumen measured revenue per user for both groups. Revenue is right-skewed — most users spend a little, a few spend a lot — so we model it with a lognormal draw. The new page (treatment) went to a smaller slice of traffic and produced more variable spending. We compute the means and run the t-test with scipy, using equal_var=False (Welch’s t-test — the safe default you’ll understand fully in Lesson 2):
import numpy as np
from scipy import stats
rng = np.random.default_rng(3)
rev_c = rng.lognormal(mean=2.50, sigma=0.70, size=6000) # control: current page
rev_t = rng.lognormal(mean=2.28, sigma=1.05, size=1500) # treatment: new page
print(f"control mean ${rev_c.mean():.2f} sd ${rev_c.std(ddof=1):.2f} n={rev_c.size}")
print(f"treatment mean ${rev_t.mean():.2f} sd ${rev_t.std(ddof=1):.2f} n={rev_t.size}")
result = stats.ttest_ind(rev_t, rev_c, equal_var=False)
print(f"difference = ${rev_t.mean() - rev_c.mean():.2f} t = {result.statistic:.3f} p = {result.pvalue:.4f}")Running it:
control mean $15.62 sd $12.42 n=6000
treatment mean $16.84 sd $23.89 n=1500
difference = $1.22 t = 1.908 p = 0.0566The treatment’s average revenue is $16.84 versus $15.62 — a $1.22 lift. But the t-test returns t = 1.908 and p = 0.0566 — just above the 0.05 threshold. At the 5% significance level, this difference is not statistically significant: we can’t rule out that a $1.22 gap this size arose from noise. The average went up, but the test says “not so fast.”
A borderline result is a prompt, not a verdict
p = 0.0566 is agonizingly close to 0.05 — and that closeness is a signal to look harder, not to fudge the threshold. Two very different questions are hiding inside this result. First: did we even measure the noise correctly? Notice the treatment group’s standard deviation ($23.89) is nearly double the control’s ($12.42) — unequal spreads that the choice of t-test handles differently (Lesson 2). Second: with revenue this skewed, is the mean even the right thing to compare (Lesson 3)? A borderline p-value on a skewed metric is precisely where careless analysis ships the wrong decision. The rest of this module is about not doing that.
Practice Exercises
Exercise 1: Why not a proportion test?
Lumen already has a two-proportion z-test from last module. Why can’t it reuse that test to compare revenue per user between the two groups?
Hint
Because revenue per user isn’t a proportion — it’s a continuous dollar amount, not a yes/no outcome. The z-test’s standard error comes from the rate formula p(1−p), which only makes sense for 0/1 data. A mean’s variability depends on how spread out the actual values are (the variance s²), so it needs the t-test, whose standard error is built from each group’s variance and size.
Exercise 2: Same means, different result?
Two experiments both show a treatment mean of $16.84 and a control mean of $15.62. In one, spending is tightly clustered; in the other, it’s wildly spread out. Would the t-test treat them the same?
Hint
No. The t statistic is (difference) / SE, and the SE grows with the variance. The tightly-clustered experiment has a small SE, so the same $1.22 gap is many standard errors out — likely significant. The wildly-spread one has a large SE, so the same gap is only a fraction of a standard error out — likely not significant. Identical averages, opposite conclusions, because the spread decides how much to trust the gap.
Exercise 3: Read the borderline
Lumen’s test gave p = 0.0566. A teammate says “that’s basically 0.05, let’s call it significant and ship.” What’s wrong with that, and what should you do instead?
Hint
The threshold was fixed at 0.05 before the test (Module 3), and 0.0566 is above it — nudging it to fit the result is exactly the kind of after-the-fact reasoning that manufactures false positives. Instead of moving the goalposts, investigate why it’s borderline: check whether the test handled the unequal variances correctly (Lesson 2), and whether the skewed distribution makes the mean misleading in the first place (Lesson 3). The right move is more scrutiny, not a rounded-down p-value.
Summary
Mean metrics — revenue, time, order value — are averages of continuous quantities, so the two-proportion z-test doesn’t apply; you need the two-sample t-test. It follows the same effect-over-noise logic: the effect is the difference in means (mean₂ − mean₁), the noise is the standard error built from each group’s variance and size (√(s₁²/n₁ + s₂²/n₂)), and the t statistic is their ratio. Run on Lumen’s revenue data with scipy, the test found a $1.22 average lift but t = 1.908, p = 0.0566 — just above 0.05, so not significant. Crucially, the average rose while the test withheld judgment, and two red flags in the data — a treatment variance nearly double the control’s, and heavy skew — mean this borderline result needs the deeper analysis the rest of the module provides.
Key Concepts
- Mean metrics need the t-test — proportions use
p(1−p); means use the data’s own variance. - Effect over noise —
t = (mean₂ − mean₁) / √(s₁²/n₁ + s₂²/n₂). - Spread decides trust — the same difference in means can be significant or not depending on variance.
- Borderline p (0.0566) — above 0.05 is not significant; investigate, don’t round down.
Why This Matters
Revenue, engagement time, and order value are the metrics executives actually care about, and they’re all means — so the t-test is as essential to A/B testing as the proportion test. But mean metrics are treacherous in ways rates aren’t: they’re sensitive to variance and badly distorted by skew, and a naive t-test on borderline, skewed data is one of the most common ways real experiments reach a wrong, expensive conclusion. Learning to run the test and to distrust a too-convenient average is what separates a reliable analysis from a misleading one. Next, you’ll see how the choice of t-test — Welch versus Student — flips Lumen’s borderline result, and why one of them is almost always the right call.
Next Steps
Continue to Lesson 2 - Welch's T-Test and Unequal Variances
Why Welch's t-test is the safe default — and how it overturns a false 'significant' on Lumen's revenue data.
Back to Module Overview
Return to the Analyzing Mean Metrics module overview
Continue Building Your Skills
You built the two-sample t-test — difference in means over a standard error that depends on spread — and ran it on Lumen’s revenue to find a $1.22 lift that lands at a borderline p = 0.0566. Two warning signs in the data (a treatment variance nearly double the control’s, and heavy skew) tell you this result needs a closer look. Next you’ll start with the first: how Welch’s t-test handles the unequal variances, and why it’s the version you should almost always use.