Lesson 4 - Confidence Intervals for the Difference in Means

Welcome to Confidence Intervals for Means

Back in Module 4 you learned that a p-value is a thin answer: it tells you whether an effect is bigger than noise, but not how big the effect plausibly is. For proportions, the confidence interval fixed that — it gave you a range of plausible lifts in percentage points. Mean metrics deserve the same treatment, only now the range is in dollars. Lumen’s t-test found a $1.22 average revenue lift at a borderline p = 0.0566. That number, on its own, doesn’t tell a product manager whether to ship. A confidence interval will: it converts the borderline result into a plain-English range of plausible dollar effects — and, as you’ll see, that range is exactly what makes “don’t ship yet” the honest call.

By the end of this lesson, you will be able to:

  • Build a Welch 95% confidence interval for a difference in means, in dollars
  • Compute the Welch-Satterthwaite degrees of freedom and the t-quantile it feeds
  • Cross-check your manual interval against scipy’s built-in confidence_interval()
  • Explain why an interval that includes $0 is the same verdict as a non-significant t-test

Let’s turn the borderline p-value into dollars.


Building the Welch Confidence Interval

The recipe is the same shape as every confidence interval you’ve built: a point estimate, plus and minus a margin of error. The point estimate is the difference in means, diff = meanₜ − mean_c. The margin is a critical value times the standard error:

  • The standard error is the same one the t-test used: SE = √(s_c²/n_c + s_t²/n_t).
  • The critical value t* is the t-quantile at 97.5% (for a 95%, two-sided interval) — but at the Welch-Satterthwaite degrees of freedom, not n_c + n_t − 2. Because the two groups have unequal variances, the df is a fractional number that accounts for that imbalance (the same machinery from Lesson 2).

Put together, the 95% interval is diff ± t* · SE. Here it is on Lumen’s revenue data:

import numpy as np
from scipy import stats

rng = np.random.default_rng(3)
rev_c = rng.lognormal(2.50, 0.70, 6000)
rev_t = rng.lognormal(2.28, 1.05, 1500)

mc, mt = rev_c.mean(), rev_t.mean()
vc, vt = rev_c.var(ddof=1), rev_t.var(ddof=1)
nc, nt = rev_c.size, rev_t.size
se = np.sqrt(vc/nc + vt/nt)
df = (vc/nc + vt/nt)**2 / ((vc/nc)**2/(nc-1) + (vt/nt)**2/(nt-1))   # Welch-Satterthwaite
tcrit = stats.t.ppf(0.975, df)
diff = mt - mc
print(f"difference ${diff:.2f}  SE ${se:.3f}  df {df:.0f}  t* {tcrit:.3f}")
print(f"95% CI: [${diff - tcrit*se:.2f}, ${diff + tcrit*se:.2f}]")

# scipy computes the same CI directly:
ci = stats.ttest_ind(rev_t, rev_c, equal_var=False).confidence_interval()
print(f"scipy CI: [${ci.low:.2f}, ${ci.high:.2f}]")

Running it:

difference $1.22  SE $0.637  df 1706  t* 1.961
95% CI: [$-0.03, $2.47]
scipy CI: [$-0.03, $2.47]

The manual interval and scipy’s built-in ttest_ind(...).confidence_interval() agree to the penny — a reassuring cross-check that the formula and the library are computing the same thing, so you can trust either.

A number line in dollars with $0 marked 'no effect'. A 95% Welch confidence interval bar spans from -$0.03 to +$2.47, with the observed +$1.22 point estimate marked inside it. The interval includes $0, so a true lift of zero is still plausible — consistent with Welch p = 0.0566 (not significant). The point estimate looks positive, but the interval can't rule out no effect.
The 95% confidence interval for Lumen's revenue lift runs from -$0.03 to +$2.47 — it barely includes $0, which is the same verdict as the non-significant p-value: a positive point estimate that still can't rule out no effect.

Notice the df: 1706, not n_c + n_t − 2 = 7498. The Welch-Satterthwaite formula gives a fractional degrees of freedom that shrinks toward the smaller, more variable group — the direct consequence of the unequal spreads you diagnosed in Lesson 2.


Reading the Interval

The interval is [-$0.03, +$2.47]. Read it plainly: given this data, the true mean revenue lift is plausibly anywhere from about 3 cents negative to $2.47 positive. Because that range includes $0 (barely), a true effect of exactly zero — or even a tiny loss — is still on the table. The point estimate of +$1.22 sits inside the interval and looks encouraging, but the interval is honest about the uncertainty around it: you cannot rule out “no lift.”

This is not a coincidence, and it’s not a contradiction with the t-test — it’s the duality between the two-sided test and the confidence interval, exactly as you saw for proportions in Module 4. A 95% interval includes $0 if and only if the two-sided test is non-significant at the 5% level. Lumen’s Welch p was 0.0566 (above 0.05, not significant), so the 95% interval must include $0. The p-value and the interval are two views of the same evidence; here they agree that the lift, if real, is too small and too uncertain to claim.

The p-value and the interval are the same verdict

A 95% confidence interval and a two-sided test at α = 0.05 are built from the same standard error and the same critical value, so they always agree: the interval excludes $0 exactly when p < 0.05, and includes $0 exactly when p ≥ 0.05. This is why you never need to “reconcile” a borderline p-value with an interval that includes zero — they’re the same statement told two ways. The interval just says more: instead of a bare “not significant,” it tells you the plausible effect ranges from -$0.03 to +$2.47, which is what a decision-maker actually needs to weigh the risk of shipping.

For the decision, this is decisive in the quiet way real analysis usually is. The interval can’t rule out no effect, so you have no evidence of a revenue lift — a positive point estimate is not the same as a proven one. Combined with Lesson 3’s finding that the typical user actually spends less under the new page, the revenue case for shipping is weak on both counts: the mean lift is unproven, and the median moved the wrong way. The interval doesn’t tell you to ship; it tells you not to, yet.


Practice Exercises

Exercise 1: Why not just report the p-value?

Lumen’s t-test already gave p = 0.0566. What does the confidence interval [-$0.03, +$2.47] add that the p-value alone doesn’t?

Hint

The p-value only answers “is the effect bigger than noise?” — a yes/no, and here the answer is “no.” The interval answers “how big is the effect, plausibly?” in the units the business cares about: dollars. It says the true mean lift is somewhere between -$0.03 and +$2.47. That range lets a product manager weigh the upside (up to ~$2.47 per user) against the real possibility of no effect at all — a decision the bare p-value can’t inform.

Exercise 2: Predict the p-value from the interval

Without re-running the test, a teammate tells you a 95% CI for a different metric’s difference in means is [+$0.40, +$3.10]. Is that difference significant at the 5% level, and how do you know?

Hint

Yes, it’s significant. By the p-value↔CI duality, a 95% interval excludes $0 exactly when the two-sided test is significant at 5%. This interval runs from +$0.40 to +$3.10 — entirely above $0 — so $0 is not a plausible value, which means p < 0.05. You can read significance straight off whether the interval contains zero, no p-value needed.

Exercise 3: Why isn’t df equal to n₁ + n₂ − 2?

The code prints df = 1706, but the two groups together have 7,500 users, so n_c + n_t − 2 = 7498. Why is the Welch df so much smaller?

Hint

n₁ + n₂ − 2 is the pooled (Student) degrees of freedom, which assumes both groups share one variance. Welch doesn’t assume that — the treatment’s variance is far larger than the control’s, so the Welch-Satterthwaite formula down-weights toward the smaller, noisier group and produces a fractional df (here 1706). That lower df gives a slightly wider t-quantile (t* = 1.961 instead of ~1.960), a small penalty for the unequal spreads — the same Welch correction from Lesson 2, now shaping the interval’s width.


Summary

A confidence interval turns a bare p-value into a range of plausible effects in real units. For a difference in means, the 95% Welch interval is diff ± t* · SE, where SE = √(s_c²/n_c + s_t²/n_t) and t* is the t-quantile at the Welch-Satterthwaite degrees of freedom — a fractional df (1706 here, not 7498) that accounts for the unequal variances. On Lumen’s revenue data the interval is [-$0.03, +$2.47], matching scipy’s built-in confidence_interval() to the penny. Because that range includes $0, a true lift of zero is still plausible — which is exactly the same verdict as the non-significant p = 0.0566, thanks to the duality between the two-sided test and the interval. The point estimate (+$1.22) looks positive, but the interval can’t rule out no effect, so Lumen has no evidence of a revenue lift.

Key Concepts

  • CI for a difference in meansdiff ± t* · SE, reported in dollars, not a yes/no.
  • Welch-Satterthwaite df — a fractional df from the unequal variances (1706, not n_c + n_t − 2).
  • p-value↔CI duality — a 95% interval includes $0 exactly when the two-sided test is non-significant at 5%.
  • Includes $0 → no evidence — a positive point estimate inside a zero-crossing interval is not a proven lift.

Why This Matters

Executives don’t ship on p-values; they ship on effect sizes. A confidence interval is the tool that reports a mean effect in the currency the business speaks — dollars per user — and, just as importantly, reports the uncertainty around it. Lumen’s interval barely crossing $0 is the difference between “we found a $1.22 lift” and the honest “we can’t rule out no lift at all,” and shipping on the former when the truth is the latter is how teams roll out changes that quietly earn nothing. Reading the interval, not just the p-value, is what keeps a borderline result from becoming a costly mistake. Next, you’ll put every piece of this module together on a full revenue experiment from scratch.


Next Steps

Continue to Lesson 5 - Guided Project: Analyze Lumen's Revenue Experiment

Put the t-test, Welch's correction, skew diagnostics, and the confidence interval together on a complete revenue experiment.

Back to Module Overview

Return to the Analyzing Mean Metrics module overview


Continue Building Your Skills

You built the Welch 95% confidence interval for Lumen’s revenue lift — diff ± t* · SE at the Welch-Satterthwaite degrees of freedom — and landed at [-$0.03, +$2.47], an interval that barely includes $0 and therefore agrees exactly with the borderline p = 0.0566. The point estimate looks positive, but the interval can’t rule out no effect, so the revenue case for shipping stays weak. Next you’ll bring the whole module together — t-test, Welch’s correction, skew, and this interval — on a complete revenue experiment, start to finish.