Lesson 3 - Confidence Intervals for the Difference

Welcome to Confidence Intervals for the Difference

The z-test told Lumen the new signup page beats the old one — p = 0.00048, not noise. But “not noise” isn’t a plan. Before anyone ships, product wants to know how much better: is this a rounding-error win worth skipping, or a real lift worth the migration cost? A p-value can’t answer that. It’s a yes/no verdict that throws away the one number the decision actually hinges on — the size of the effect. The confidence interval puts that number back. Instead of “the effect is real,” it says “the true lift is plausibly somewhere between here and here.” In this lesson you’ll build the 95% CI for the difference in two proportions, run it on Lumen’s data, and see how it lines up exactly with the p-value from Lesson 1.

By the end of this lesson, you will be able to:

  • Build a 95% confidence interval for the difference in two proportions
  • Explain why the CI uses the unpooled standard error while the test pooled
  • Interpret “95% confidence” correctly — and avoid the common misreading
  • Connect the CI to the p-value through their duality

Let’s turn the verdict into a range.


From a Verdict to a Range

A significance test asks a null question: if the change did nothing, how surprising is this data? To answer it, Lesson 1 pooled both groups into one shared rate — because “the change did nothing” means both groups have the same true rate. A confidence interval asks a different question: what is the effect, and how uncertain am I about it? Here we are not assuming the null. We’re estimating the actual difference, so the two groups are allowed to have genuinely different rates — and we estimate each group’s noise on its own, then combine. That’s the unpooled standard error:

  • Group rates p₁ = c₁/n₁ and p₂ = c₂/n₂, estimated separately.
  • Unpooled SE: √[ p₁(1−p₁)/n₁ + p₂(1−p₂)/n₂ ] — each group contributes its own variance.
  • The interval: (p₂ − p₁) ± z · SE, where z = 1.96 for 95% confidence.
A horizontal number line of the difference in signup conversion measured in percentage points, with 0 marked and labeled 'no effect'. A 95% confidence interval bar spans from +0.97 to +3.43 points, with the observed point estimate of +2.20 points marked at its center. The entire interval sits to the right of 0, so it is significant at the 5% level, matching p < 0.05. An annotation reads: plausible lift is roughly +1 to +3.4 points — showing not just that the effect is real, but how big it is and with how much uncertainty.
The confidence interval places the whole plausible range of the true lift on a number line. Lumen's spans +0.97 to +3.43 points — narrow, and entirely above zero, so the effect is both real and comfortably positive.

The center of the interval is your best single guess, the observed difference. The width is your uncertainty — set by the standard error and how confident you want to be. Where that interval falls relative to 0 is what turns it into a decision.


Building the Interval

The whole computation is one standard error and one multiplier. We estimate each rate on its own, add their variances, take the square root, and step out z standard errors on each side of the observed difference:

import numpy as np
from scipy.stats import norm

def diff_ci(c1, n1, c2, n2, conf=0.95):
    p1, p2 = c1/n1, c2/n2
    se = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)     # UNPOOLED
    z = norm.ppf(1 - (1-conf)/2)                   # 1.96 for 95%
    diff = p2 - p1
    return diff - z*se, diff + z*se

print(diff_ci(503, 5000, 613, 5000))

Notice the standard error: p₁(1−p₁)/n₁ + p₂(1−p₂)/n₂, with each group’s rate kept separate — no pooled rate anywhere. That’s the deliberate difference from Lesson 1. The norm.ppf(1 - (1-conf)/2) line turns a confidence level into a multiplier: for 95% it splits the leftover 5% into two tails of 2.5% each and returns the z that leaves 2.5% beyond it — the familiar 1.96.


Reading Lumen’s Interval

Running it on the same 503/5000 versus 613/5000 counts:

(0.009713..., 0.034286...)

The 95% confidence interval for the lift is approximately (0.0097, 0.0343) — in plain terms, [+0.97 points, +3.43 points] around the observed +2.20 point lift. Two things matter. First, the width: the plausible true lift runs from about +1 point to about +3.4 points. That’s a real, decision-relevant range — a +1-point lift and a +3.4-point lift might call for different ship decisions, and the CI tells you the data can’t yet distinguish them. Second, the position: the entire interval sits above 0. Zero — “no effect” — is not a plausible value for the true difference. That’s the same conclusion the z-test reached, arrived at from the other direction.

What ‘95% confidence’ actually means

The tempting reading — “there’s a 95% probability the true lift is between +0.97 and +3.43” — is wrong. The true difference is a fixed number; it’s either in this particular interval or it isn’t. What “95% confidence” describes is the procedure: if you reran the whole experiment many times and built an interval each time, about 95% of those intervals would capture the true difference. This one interval is a single draw from that reliable process. The practical takeaway is unchanged — the plausible lift is roughly +1 to +3.4 points and the whole range is above 0 — but the probability lives in the method, not in this one interval.


The Duality with the P-Value

The CI and the p-value aren’t two separate tools — they’re two views of the same evidence, and they’re bound by a rule: a 95% CI excludes 0 exactly when the two-sided test is significant at the 5% level. Lumen’s interval [0.0097, 0.0343] excludes 0, and Lesson 1’s p-value was 0.00048 < 0.05. They must agree, because both are built from the same difference and (nearly) the same standard error — one asks “is 0 inside the interval?”, the other asks “is the difference far enough from 0?”, and those are the same question.

So why compute both? Because the CI carries strictly more information. The p-value collapses the result to a single yes/no and hides the effect size; the CI keeps the size and its uncertainty out in the open. “p = 0.0005” and “the lift is +2.2 points, plausibly +1 to +3.4” describe the same experiment — but only the second one lets you weigh the win against the cost of shipping.

One clarification on units: this CI is in absolute percentage points — the raw difference p₂ − p₁. You could instead report a relative lift by dividing by the baseline (a +2.2-point rise on a 10.06% base is about +22% relative), and build a CI for that. Both are valid; they answer different questions. Just always say which one you mean — “+2 points” and “+20%” describe the same result and sound wildly different. We keep the focus on the absolute difference CI here, because it’s the quantity the z-test and the ship decision are built on.


Practice Exercises

Exercise 1: Why unpooled here?

Lesson 1’s z-test pooled both groups to estimate the standard error; this lesson’s CI does not. What changed?

Hint

The question changed. The z-test computes a p-value assuming the null is true — and if both groups share one rate, the best estimate of that rate uses all the data pooled together. The CI drops the null assumption and estimates the actual effect, so the two groups are allowed different true rates; each contributes its own variance, p₁(1−p₁)/n₁ + p₂(1−p₂)/n₂. Same data, two standard errors, two questions.

Exercise 2: The 95% misreading

A teammate says, “The 95% CI is [0.97, 3.43] points, so there’s a 95% chance the true lift is in that range.” What’s wrong, and what’s the right statement?

Hint

The true lift is a fixed (if unknown) number — it’s either in this interval or it isn’t, so no probability attaches to this interval. The 95% describes the procedure: across many repeated experiments, about 95% of the intervals it produces would contain the true difference. This interval is one reliable draw from that process. Practically you’d still say: the plausible lift is about +1 to +3.4 points, entirely above 0.

Exercise 3: Predict the p-value from the CI

Without recomputing it, you’re told a 95% CI for a difference is [-0.4 points, +2.6 points]. Is the two-sided test significant at the 5% level? How do you know?

Hint

No. The interval includes 0, and by the duality a 95% CI excludes 0 exactly when the two-sided test is significant at 5%. Since 0 is a plausible value here, the p-value is above 0.05. The CI even tells you why the result is inconclusive: the true effect could be slightly negative or moderately positive — the data can’t rule out “no effect.”


Summary

A confidence interval for the difference turns the z-test’s yes/no verdict into a plausible range for the true effect. Because we’re estimating the actual difference rather than assuming the null, we use the unpooled standard error √[p₁(1−p₁)/n₁ + p₂(1−p₂)/n₂] and form the interval (p₂ − p₁) ± 1.96·SE. On Lumen’s 503/5000 versus 613/5000 data, the 95% CI is [+0.97, +3.43] points around the +2.20-point lift — narrow and entirely above 0. “95% confidence” is a property of the procedure, not of this one interval. And by duality, a 95% CI excludes 0 exactly when the two-sided test is significant at 5%: this interval excludes 0, matching Lesson 1’s p = 0.00048. The CI and p-value are the same evidence — but the CI also shows the effect’s size and its uncertainty.

Key Concepts

  • Unpooled standard error — estimate each group’s rate separately when measuring the effect, since you’re no longer assuming the null.
  • The interval(p₂ − p₁) ± z·SE, with z = 1.96 for 95%; center is the estimate, width is the uncertainty.
  • Correct interpretation — 95% confidence is a property of the repeated procedure, not a probability about this single interval.
  • Duality — a 95% CI excludes 0 exactly when the two-sided test is significant at the 5% level.

Why This Matters

A p-value is a gate; a confidence interval is a measurement. Shipping decisions turn on how big an effect is — a lift barely above 0 may not be worth the engineering cost, while a comfortably positive one is an easy yes — and the CI is the tool that surfaces that size along with its honest uncertainty. It also inoculates you against the two most common analysis mistakes: reading “significant” as “large,” and reading “not significant” as “no effect.” With Lumen’s lift now bounded at [+0.97, +3.43] points, you have exactly what the next lesson needs — a real effect size, with its range, ready to weigh against the cost of shipping.


Next Steps

Continue to Lesson 4 - Making the Ship Decision

Turn the effect size and its confidence interval into an actual go/no-go call, weighing lift against cost and risk.

Back to Module Overview

Return to the Analyzing Proportion Metrics module overview


Continue Building Your Skills

You built the 95% confidence interval for the difference in two proportions — unpooled standard error, ± 1.96·SE — and ran it on Lumen’s experiment to bound the lift at [+0.97, +3.43] points, entirely above 0 and in perfect agreement with the p-value. Next you’ll put that number to work: turning a statistically significant, well-bounded lift into an actual ship decision, where effect size, cost, and risk all meet.