Lesson 2 - Welch's T-Test and Unequal Variances

Welcome to Welch’s T-Test

Last lesson ended on a cliffhanger: Lumen’s revenue lift landed at p = 0.0566 — just above 0.05, not significant. But that number depended on a choice we made almost silently — equal_var=False. Change that one argument and the p-value changes with it. In fact, the “obvious” default in most stats packages would have returned p = 0.0063 on the very same data — comfortably significant, “ship it.” Both come from a two-sample t-test. Only one is correct. This lesson is about which one, and why. The answer hinges on a single fact about Lumen’s data: the two groups don’t have the same spread — and when the smaller group is also the noisier one, the wrong t-test lies to you.

By the end of this lesson, you will be able to:

  • Distinguish Student’s (pooled) t-test from Welch’s (unpooled) t-test
  • Explain why unequal variances make the pooled version anti-conservative
  • Run both versions in scipy and see them disagree on Lumen’s data
  • Choose Welch (equal_var=False) as your default and justify it

Let’s start with what “pooling” actually does.


Two Versions of the Same Test

The two-sample t-test needs an estimate of the noise in the difference of means. There are two ways to build it:

  • Student’s t-test assumes both groups have the same underlying variance. It pools them — combines both groups’ spread into one shared variance estimate — and uses that for both sides of the difference. scipy: ttest_ind(..., equal_var=True).
  • Welch’s t-test makes no such assumption. It lets each group keep its own variance, plugging s₁²/n₁ and s₂²/n₂ in separately, and adjusts the degrees of freedom with the Welch-Satterthwaite formula to match. scipy: ttest_ind(..., equal_var=False).

When the two groups really do have equal spread, pooling is a small efficiency win and the two tests nearly agree. The trouble starts when the spreads differ — and differ in a particular, dangerous way.

Look at Lumen’s data. Control is a big group (n = 6,000) with a tight spread (sd $12.42). Treatment is a small group (n = 1,500) with a wide spread (sd $23.89). Pooling averages the two variances weighted by sample size, so the huge, calm control group dominates the pooled estimate — and the small, noisy treatment group’s real variability gets drowned out. The pooled variance comes out too small. A too-small noise estimate means the t statistic comes out too big, and the p-value too small. Student’s test over-claims significance — it’s anti-conservative exactly when the smaller group is the more variable one.

Control is a big group (n=6,000) with a tight spread (sd $12.42) and treatment is a small group (n=1,500) with a wide spread (sd $23.89); Student's pooled t-test assumes equal variance and gives p=0.0063 ('significant, ship it' with an X), while Welch's t-test lets each group keep its own variance and gives p=0.0566 ('not significant, don't ship' with a check); pooling lets the big tight group mask the small noisy one so Student over-claims; use Welch by default (equal_var=False).
When the smaller group is the noisier one, pooling lets the big tight group mask the small noisy one — Student over-claims significance while Welch reports the honest result.

Welch avoids this by never pooling: the treatment group’s $23.89 spread stays attached to the treatment group, where it belongs.


Watching the Two Tests Disagree

Enough theory — run both on the exact same revenue data from Lesson 1 (seed 3) and watch them split:

import numpy as np
from scipy import stats

rng = np.random.default_rng(3)
rev_c = rng.lognormal(2.50, 0.70, 6000)   # control:   n=6000, sd ~$12.42
rev_t = rng.lognormal(2.28, 1.05, 1500)   # treatment: n=1500, sd ~$23.89

welch   = stats.ttest_ind(rev_t, rev_c, equal_var=False)
student = stats.ttest_ind(rev_t, rev_c, equal_var=True)
print(f"Welch   t = {welch.statistic:.3f}  p = {welch.pvalue:.4f}")
print(f"Student t = {student.statistic:.3f}  p = {student.pvalue:.4f}")

Running it:

Welch   t = 1.908  p = 0.0566
Student t = 2.734  p = 0.0063

Same data, same $1.22 lift, two completely different verdicts. Student’s t-test says p = 0.0063 — well under 0.05, so “significant, ship the revenue win!” Welch’s t-test says p = 0.0566 — above 0.05, so “not significant, don’t ship.” One of these is wrong, and it’s Student’s: its p-value is inflated by a pooling assumption the data plainly violates. The treatment variance is nearly double the control’s; pooling them into one number understates the real noise in the difference, pushing t from Welch’s honest 1.908 up to a misleading 2.734. Had you trusted the default, you’d ship a revenue lift that isn’t statistically there — and later wonder why it never showed up in the quarterly numbers.

scipy’s default is the trap

stats.ttest_ind defaults to equal_var=True — Student’s pooled test. Many stats packages do the same. That means the convenient call, the one you get by not thinking about it, is the one that over-claims on unequal-variance data. You have to actively pass equal_var=False to get Welch. This is why Lesson 1 quietly used equal_var=False from the start: the safe version is never the default, so you must ask for it every time.

One clarification that trips people up: the two tests don’t always diverge. With equal sample sizes the t-statistic is identical for both — only the degrees of freedom differ, so the p-values barely move. The sharp disagreement you see here needs unequal n and unequal variance together, which is exactly Lumen’s situation: a small, noisy treatment against a large, calm control.


Practice Exercises

Exercise 1: What does “pooling” pool?

Student’s t-test “pools” the variances. In plain terms, what does that mean, and why does it require an assumption Welch’s test doesn’t?

Hint

Pooling combines both groups’ spread into a single shared variance estimate — one number used for both sides of the difference. That only makes sense if both groups truly have the same underlying variance, so Student’s test assumes equal variances. Welch’s test makes no such assumption: it keeps each group’s variance separate (s₁²/n₁ + s₂²/n₂) and adjusts the degrees of freedom to match, so it works whether the spreads are equal or not.

Exercise 2: Why does Student over-claim here?

On Lumen’s data the smaller group (treatment, n = 1,500) has the larger variance (sd $23.89). Explain step by step why that specific combination makes Student’s p-value too small.

Hint

Pooling weights each group’s variance by its sample size, so the big, low-variance control group (n = 6,000) dominates the pooled estimate and the small, high-variance treatment group gets drowned out. The pooled variance ends up too small — it understates the real noise in the difference. A smaller noise estimate makes the t statistic larger (2.734 instead of 1.908) and the p-value smaller (0.0063 instead of 0.0566). The result looks significant when it isn’t — Student is anti-conservative exactly when the smaller group is the noisier one.

Exercise 3: When would they agree?

A teammate says “Welch and Student always give different answers, so why not just pick the one I like?” When do the two tests actually coincide, and why does that undercut the argument?

Hint

They coincide when the assumption Student relies on holds — equal variances — and also, for the t-statistic itself, when the two groups have equal sample sizes (then only the degrees of freedom differ and the p-values barely move). So they don’t “always” disagree; they disagree precisely when Student’s assumption is violated, which is when Student is wrong. You don’t pick the one you like — you pick Welch, because it’s right in both cases: it nearly matches Student when variances are equal and protects you when they’re not.


Summary

The two-sample t-test comes in two forms. Student’s (equal_var=True) assumes both groups share one variance and pools them into a single estimate; Welch’s (equal_var=False) lets each group keep its own variance and adjusts the degrees of freedom with the Welch-Satterthwaite formula. The difference is decisive when — as with Lumen — the smaller group is also the more variable one: pooling lets the big, calm control mask the small, noisy treatment, understating the real noise, inflating t, and shrinking the p-value. On the identical seed-3 data, Student returned t = 2.734, p = 0.0063 (“ship it”) while Welch returned t = 1.908, p = 0.0566 (“don’t”). Welch is the correct verdict; Student over-claimed. The rule is simple: use Welch by default. It costs almost nothing when variances are truly equal and protects you when they aren’t — and since scipy (like many packages) defaults to Student, you must pass equal_var=False yourself.

Key Concepts

  • Student pools, Welch doesn’t — Student assumes one shared variance; Welch keeps each group’s own.
  • Unequal n + unequal variance is the danger zone — with equal sample sizes the t-statistics coincide; the tests diverge only when both differ.
  • Small-and-noisy breaks pooling — when the smaller group has the larger variance, the pooled estimate is too small, so Student’s p is too small (anti-conservative).
  • Welch is the safe defaultequal_var=False; nearly matches Student when variances are equal, protects you when they aren’t.

Why This Matters

The gap between “ship a revenue win” and “don’t ship” came down to one keyword argument — and the default value points the wrong way. Real A/B tests are full of unequal variances: a new feature that helps some users a lot and others not at all, a pricing change that widens the spread of order values, a treatment rolled out to a smaller holdout. In all of these the pooled t-test can manufacture significance that isn’t there, and because it’s the default, it does so silently. Making Welch your reflex — and knowing exactly when it matters — is one of the cheapest, highest-leverage habits in experiment analysis. But Welch fixes the variance problem; it still compares means. Lumen’s revenue is heavily skewed, and the next lesson asks whether the mean is even the right thing to test.


Next Steps

Continue to Lesson 3 - Skewed Metrics and the Mann-Whitney Test

Welch fixed the variance problem, but the metric is still skewed — when the mean itself misleads, reach for a rank-based test.

Back to Module Overview

Return to the Analyzing Mean Metrics module overview


Continue Building Your Skills

You saw the two-sample t-test split into two versions — Student’s pooled test and Welch’s unpooled one — and watched them reach opposite decisions on Lumen’s revenue data because the smaller treatment group carried the larger variance. Student’s default over-claimed at p = 0.0063; Welch reported the honest p = 0.0566. The takeaway is a habit: pass equal_var=False every time. Next you’ll confront the second red flag from Lesson 1 — the heavy skew — and ask whether comparing means was ever the right move at all.