Lesson 1 - The Two-Sample T-Test

Welcome to the Two-Sample T-Test

Lumen’s new signup page lifted conversion — you proved that in the last module. But conversion isn’t the only thing the business cares about. Does the new page also change revenue per user? That’s a different kind of metric: not a rate (converted or not) but an average of a continuous amount (dollars). The two-proportion z-test you just mastered doesn’t apply — you can’t pool “rates” when every user has a dollar figure. Comparing two averages needs its own tool: the two-sample t-test. In this lesson you’ll build it and run it on Lumen’s revenue data — and get a result that isn’t as clean as last module’s, which is exactly why the rest of this module exists.

By the end of this lesson, you will be able to:

Explain why a mean metric needs a t-test rather than a proportion test
Describe how a difference in means becomes a t statistic
Run a two-sample t-test with scipy and read the result
Interpret a borderline p-value and know what questions it leaves open

Let’s start with why means are different.

Why Means Need a Different Test

A proportion is built from yes/no outcomes, and its variability is fixed by the rate itself — a 10% rate has a known, formula-driven spread. A mean is different: it’s the average of a continuous quantity, and its variability depends on how spread out the individual values are. Two experiments can have the same average revenue but wildly different reliability — one where everyone spends about $16, another where most spend $2 and a few spend $500. The average alone doesn’t tell you how much to trust it; the spread does.

So the t-test follows the same “effect over noise” logic as the z-test, but measures the noise from the data’s own variability:

The effect is the difference in group means, mean₂ − mean₁.
The noise is the standard error of that difference, built from each group’s variance and size: SE = √(s₁²/n₁ + s₂²/n₂).
The t statistic is their ratio: t = (mean₂ − mean₁) / SE — how many standard errors apart the two averages are.

Two overlapping bell-like distributions of revenue per user on a dollar axis: a control distribution with mean $15.62 and a treatment distribution with mean $16.84, their means marked with dashed lines and the +$1.22 gap between them highlighted. A formula box reads t = (mean2 - mean1) / SE, how many standard errors apart the averages are, then a p-value. Caption: same idea as the z-test, effect over noise, but the effect is a difference in averages and the noise depends on each group's spread. — The two-sample t-test compares two averages by the same effect-over-noise logic as the z-test — but the noise now comes from each group's own spread, which is why variability matters as much as the averages themselves.

Turning t into a p-value uses Student’s t-distribution (close to the normal for large samples), and scipy does it for us.

Running It on Lumen’s Revenue

Lumen measured revenue per user for both groups. Revenue is right-skewed — most users spend a little, a few spend a lot — so we model it with a lognormal draw. The new page (treatment) went to a smaller slice of traffic and produced more variable spending. We compute the means and run the t-test with scipy, using equal_var=False (Welch’s t-test — the safe default you’ll understand fully in Lesson 2):

import numpy as np
from scipy import stats

rng = np.random.default_rng(3)
rev_c = rng.lognormal(mean=2.50, sigma=0.70, size=6000)   # control: current page
rev_t = rng.lognormal(mean=2.28, sigma=1.05, size=1500)   # treatment: new page

print(f"control   mean ${rev_c.mean():.2f}   sd ${rev_c.std(ddof=1):.2f}   n={rev_c.size}")
print(f"treatment mean ${rev_t.mean():.2f}   sd ${rev_t.std(ddof=1):.2f}   n={rev_t.size}")

result = stats.ttest_ind(rev_t, rev_c, equal_var=False)
print(f"difference = ${rev_t.mean() - rev_c.mean():.2f}   t = {result.statistic:.3f}   p = {result.pvalue:.4f}")

Running it:

control   mean $15.62   sd $12.42   n=6000
treatment mean $16.84   sd $23.89   n=1500
difference = $1.22   t = 1.908   p = 0.0566

The treatment’s average revenue is $16.84 versus $15.62 — a $1.22 lift. But the t-test returns t = 1.908 and p = 0.0566 — just above the 0.05 threshold. At the 5% significance level, this difference is not statistically significant: we can’t rule out that a $1.22 gap this size arose from noise. The average went up, but the test says “not so fast.”

A borderline result is a prompt, not a verdict

p = 0.0566 is agonizingly close to 0.05 — and that closeness is a signal to look harder, not to fudge the threshold. Two very different questions are hiding inside this result. First: did we even measure the noise correctly? Notice the treatment group’s standard deviation ($23.89) is nearly double the control’s ($12.42) — unequal spreads that the choice of t-test handles differently (Lesson 2). Second: with revenue this skewed, is the mean even the right thing to compare (Lesson 3)? A borderline p-value on a skewed metric is precisely where careless analysis ships the wrong decision. The rest of this module is about not doing that.

Practice Exercises

Exercise 1: Why not a proportion test?

Lumen already has a two-proportion z-test from last module. Why can’t it reuse that test to compare revenue per user between the two groups?

Hint

Because revenue per user isn’t a proportion — it’s a continuous dollar amount, not a yes/no outcome. The z-test’s standard error comes from the rate formula p(1−p), which only makes sense for 0/1 data. A mean’s variability depends on how spread out the actual values are (the variance s²), so it needs the t-test, whose standard error is built from each group’s variance and size.

Exercise 2: Same means, different result?

Two experiments both show a treatment mean of $16.84 and a control mean of $15.62. In one, spending is tightly clustered; in the other, it’s wildly spread out. Would the t-test treat them the same?

Hint

No. The t statistic is (difference) / SE, and the SE grows with the variance. The tightly-clustered experiment has a small SE, so the same $1.22 gap is many standard errors out — likely significant. The wildly-spread one has a large SE, so the same gap is only a fraction of a standard error out — likely not significant. Identical averages, opposite conclusions, because the spread decides how much to trust the gap.

Exercise 3: Read the borderline

Lumen’s test gave p = 0.0566. A teammate says “that’s basically 0.05, let’s call it significant and ship.” What’s wrong with that, and what should you do instead?

Hint

The threshold was fixed at 0.05 before the test (Module 3), and 0.0566 is above it — nudging it to fit the result is exactly the kind of after-the-fact reasoning that manufactures false positives. Instead of moving the goalposts, investigate why it’s borderline: check whether the test handled the unequal variances correctly (Lesson 2), and whether the skewed distribution makes the mean misleading in the first place (Lesson 3). The right move is more scrutiny, not a rounded-down p-value.

Summary

Mean metrics — revenue, time, order value — are averages of continuous quantities, so the two-proportion z-test doesn’t apply; you need the two-sample t-test. It follows the same effect-over-noise logic: the effect is the difference in means (mean₂ − mean₁), the noise is the standard error built from each group’s variance and size (√(s₁²/n₁ + s₂²/n₂)), and the t statistic is their ratio. Run on Lumen’s revenue data with scipy, the test found a $1.22 average lift but t = 1.908, p = 0.0566 — just above 0.05, so not significant. Crucially, the average rose while the test withheld judgment, and two red flags in the data — a treatment variance nearly double the control’s, and heavy skew — mean this borderline result needs the deeper analysis the rest of the module provides.

Key Concepts

Mean metrics need the t-test — proportions use p(1−p); means use the data’s own variance.
Effect over noise — t = (mean₂ − mean₁) / √(s₁²/n₁ + s₂²/n₂).
Spread decides trust — the same difference in means can be significant or not depending on variance.
Borderline p (0.0566) — above 0.05 is not significant; investigate, don’t round down.

Why This Matters

Revenue, engagement time, and order value are the metrics executives actually care about, and they’re all means — so the t-test is as essential to A/B testing as the proportion test. But mean metrics are treacherous in ways rates aren’t: they’re sensitive to variance and badly distorted by skew, and a naive t-test on borderline, skewed data is one of the most common ways real experiments reach a wrong, expensive conclusion. Learning to run the test and to distrust a too-convenient average is what separates a reliable analysis from a misleading one. Next, you’ll see how the choice of t-test — Welch versus Student — flips Lumen’s borderline result, and why one of them is almost always the right call.

Next Steps

Continue to Lesson 2 - Welch's T-Test and Unequal Variances

Why Welch's t-test is the safe default — and how it overturns a false 'significant' on Lumen's revenue data.

Back to Module Overview

Return to the Analyzing Mean Metrics module overview

Continue Building Your Skills

You built the two-sample t-test — difference in means over a standard error that depends on spread — and ran it on Lumen’s revenue to find a $1.22 lift that lands at a borderline p = 0.0566. Two warning signs in the data (a treatment variance nearly double the control’s, and heavy skew) tell you this result needs a closer look. Next you’ll start with the first: how Welch’s t-test handles the unequal variances, and why it’s the version you should almost always use.

Next lesson

Lesson 2 - Welch's T-Test and Unequal Variances

Courses

DATATWEETS

Title here

Lesson 1 - The Two-Sample T-Test

Welcome to the Two-Sample T-Test

Why Means Need a Different Test

Running It on Lumen’s Revenue

Practice Exercises

Exercise 1: Why not a proportion test?

Exercise 2: Same means, different result?

Exercise 3: Read the borderline

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 2 - Welch's T-Test and Unequal Variances

Back to Module Overview

Continue Building Your Skills

Lesson 1 - The Two-Sample T-Test

Welcome to the Two-Sample T-Test#

Why Means Need a Different Test#

Running It on Lumen’s Revenue#

Practice Exercises#

Exercise 1: Why not a proportion test?#

Exercise 2: Same means, different result?#

Exercise 3: Read the borderline#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 2 - Welch's T-Test and Unequal Variances

Back to Module Overview

Continue Building Your Skills#

Welcome to the Two-Sample T-Test

Why Means Need a Different Test

Running It on Lumen’s Revenue

Practice Exercises

Exercise 1: Why not a proportion test?

Exercise 2: Same means, different result?

Exercise 3: Read the borderline

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills