Lesson 3 - Analyzing the Results
Now You Can Look
In Lesson 2 you did the hard, unglamorous thing: you checked that the experiment was valid before checking whether it worked. The sample-ratio-mismatch test passed, the guardrails held, and the assignment was clean. That’s what earns you the right to look at the primary metric — and now you can.
Here’s the validated data. Of 3,328 users in control, 832 activated within 7 days: a 25.00% activation rate. Of 3,460 users in treatment, 980 activated: 28.32%. On the face of it, the checklist looks like it helped. But “looks like” isn’t an answer. The primary metric is a proportion, so the tool is the one you built in Module 4: the two-proportion z-test, plus a confidence interval to say how big the effect is and how sure we are.
By the end of this lesson, you will be able to:
- Run a two-proportion z-test on a validated A/B result
- Build a 95% confidence interval for the lift and read it correctly
- Connect the p-value and the CI (the duality from Module 4)
- Separate statistical significance from practical significance — honestly
Let’s read the result.
The Two-Proportion Z-Test
The question is whether a 25.00% → 28.32% difference is more than noise. The z-test compares the two proportions against the variability you’d expect from sampling alone. The CI then reports the lift as a range rather than a single point. Both come straight from Module 4:
import math
from scipy.stats import norm
def two_prop(c1, n1, c2, n2):
p1, p2 = c1/n1, c2/n2
p = (c1+c2)/(n1+n2)
se = math.sqrt(p*(1-p)*(1/n1+1/n2))
z = (p2-p1)/se
return p2-p1, z, 2*(1-norm.cdf(abs(z)))
def diff_ci(c1, n1, c2, n2):
p1, p2 = c1/n1, c2/n2
se = math.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
z = norm.ppf(0.975); d = p2-p1
return d - z*se, d + z*se
diff, z, p = two_prop(832, 3328, 980, 3460)
lo, hi = diff_ci(832, 3328, 980, 3460)
print(f"difference = {diff:.4f} z = {z:.3f} p = {p:.5f}")
print(f"95% CI for the lift: [{lo:.4f}, {hi:.4f}]")Running it:
difference = 0.0332 z = 3.095 p = 0.00197
95% CI for the lift: [0.0122, 0.0543]The observed lift is +3.32 percentage points (25.00% → 28.32%). The z-statistic is 3.095 — the difference is about three standard errors away from zero — and the p-value is 0.00197, well below the pre-committed 0.05 threshold. So the result is statistically significant: a difference this large is very unlikely to be sampling noise if the checklist truly did nothing. The 95% CI for the lift is [+1.22, +5.43] points, and it excludes 0. That’s not a coincidence: a p-value below 0.05 and a 95% CI that excludes zero are two views of the same fact — the p-value/CI duality from Module 4. The test says there is an effect; the interval says how big it plausibly is.
Significant, But How Big?
Statistical significance answers “is there an effect?” — but the decision Lumen actually cares about is “is the effect big enough to ship?” That’s practical significance, and the design already set the bar: the minimum detectable effect was +3 points, the smallest lift the team said was worth shipping. So compare the CI to that bar, not just to zero.
The point estimate is +3.32 points — it clears the 3-point bar. The CI’s upper bound is +5.43, comfortably above it. But the CI’s lower bound is +1.22, which sits below the 3-point line. Read honestly, that means: the effect is clearly real and positive (the whole interval is above zero), and our best single guess beats the practical bar — but there’s genuine uncertainty about whether the true lift is as large as the +3 points the team hoped for. It could plausibly be as small as about 1.2 points. The experiment nailed down the direction and sign firmly; it pinned down the magnitude less precisely, because a 3.32-point effect is only a touch above the 3-point effect the study was powered to detect.
Excludes zero, but straddles the bar
It’s tempting to stop at “p = 0.002, significant, ship it.” Resist that. Significance and practical significance are different questions, and this result answers them differently. The CI excludes zero cleanly — so we’re confident the checklist helps. But the CI does not sit entirely above the +3-point MDE — its lower end is +1.22 — so we can’t claim with 95% confidence that the true lift is at least the 3 points the team wanted. That’s not a reason to reject the result; the best estimate clears the bar and the effect is unambiguously positive. It is a reason to report the range honestly rather than the point estimate alone. A readout that says “+3.32 points, significant” is fine; one that pretends the true lift is guaranteed to be ≥3 points is overselling. The full ship decision — weighing this against the guardrails and a Bayesian cross-check — comes in Lesson 4.
Practice Exercises
Exercise 1: Read the interval aloud
The 95% CI for the lift is [+1.22, +5.43] points. Say what it means in one sentence — and what it does not mean.
Hint
It means: our best estimate of the true lift is +3.32 points, and the data are consistent with a true lift anywhere from about +1.22 to +5.43 points (with 95% confidence). It does not mean there’s a 95% probability the true lift is in that range — the true lift is a fixed number, not a random one. The 95% refers to the procedure: intervals built this way capture the true value 95% of the time. And it does not mean every value in the range is equally likely — values near the center are more plausible than values near the edges (Module 4).
Exercise 2: p-value and CI agree
The p-value is 0.00197 and the 95% CI excludes 0. Explain why those two facts had to agree.
Hint
They’re the same test seen two ways — the duality from Module 4. A 95% CI contains exactly the effect values you would not reject at the 5% level. So “p < 0.05” and “the 95% CI excludes the null value of 0” are logically equivalent: whenever one holds, the other must. If the p-value had been, say, 0.08, the CI would have straddled zero. They can’t disagree, because they’re computed from the same standard error and the same data.
Exercise 3: The lower bound below the bar
Suppose a teammate says, “The CI’s lower bound is +1.22, which is under our +3-point MDE, so the test failed — we shouldn’t ship.” Is that right?
Hint
No — that’s too harsh. The test did not fail: the result is significant (p = 0.002), the effect is clearly positive (CI excludes 0), and the point estimate (+3.32) clears the 3-point bar. What the lower bound tells us is narrower: we can’t be 95% confident the true lift is at least 3 points, only that it’s at least about 1.2. That’s a statement about precision, not a verdict. The right move is to report the effect and its range honestly and fold it into the decision (Lesson 4), not to declare failure. Powering a study for a 3-point MDE means it can detect a 3-point effect, not prove the effect is exactly 3 points or more.
Summary
With the experiment validated in Lesson 2, this lesson finally read the primary metric. Control activated 25.00% (832/3,328), treatment 28.32% (980/3,460) — an observed lift of +3.32 percentage points. Because the metric is a proportion, the analysis used the two-proportion z-test from Module 4: z = 3.095, p = 0.00197, comfortably below 0.05, so the result is statistically significant. The 95% CI for the lift is [+1.22, +5.43] points, which excludes zero (the p-value/CI duality) but does not sit entirely above the +3-point MDE. Practically: the effect is real, positive, and the best estimate clears the practical bar, but the true lift could plausibly be as small as ~1.2 points. The honest readout is “significant and worth shipping, with the range noted” — not “guaranteed ≥3 points.”
Key Concepts
- Two-proportion z-test — the right test when the primary metric is a rate; here z = 3.095, p = 0.00197.
- Confidence interval for the lift — [+1.22, +5.43] points reports magnitude and uncertainty, not just a yes/no.
- p-value/CI duality — p < 0.05 and a 95% CI excluding 0 are the same fact seen two ways.
- Significance vs. practical significance — the CI excludes 0 (significant) but straddles the +3-point MDE (magnitude uncertain).
Why This Matters
A p-value alone tells you an effect exists; it doesn’t tell you whether the effect is worth the engineering cost, the risk, or the maintenance of shipping a new feature. The confidence interval is what turns a significant result into a business decision, because it puts the effect on the same scale as the bar you set in the design. Reading it honestly — celebrating that the interval excludes zero while admitting its lower end dips below your MDE — is the difference between an analyst who reports the truth and one who reports the headline. Next, you’ll cross-check this frequentist result with a Bayesian view, weigh the guardrails, and write the actual ship decision.
Next Steps
Continue to Lesson 4 - The Cross-Check and the Decision
Cross-check the z-test with a Bayesian view, weigh the guardrails, and write the ship decision the brief asked for.
Back to Module Overview
Return to the Capstone module overview
Continue Building Your Skills
You took a validated experiment and read its result the way a data scientist should: a two-proportion z-test for significance (z = 3.095, p = 0.00197), and a confidence interval for the lift ([+1.22, +5.43] points) to report magnitude and uncertainty. You didn’t stop at the p-value — you compared the interval to the practical bar and told the truth about where it falls. Next, you’ll pressure-test this conclusion from a second angle with a Bayesian cross-check, confirm the guardrails held, and combine everything into the written ship decision the brief demanded from the very start.