Lesson 3 - Analyzing the Results

Now You Can Look

In Lesson 2 you did the hard, unglamorous thing: you checked that the experiment was valid before checking whether it worked. The sample-ratio-mismatch test passed, the guardrails held, and the assignment was clean. That’s what earns you the right to look at the primary metric — and now you can.

Here’s the validated data. Of 3,328 users in control, 832 activated within 7 days: a 25.00% activation rate. Of 3,460 users in treatment, 980 activated: 28.32%. On the face of it, the checklist looks like it helped. But “looks like” isn’t an answer. The primary metric is a proportion, so the tool is the one you built in Module 4: the two-proportion z-test, plus a confidence interval to say how big the effect is and how sure we are.

By the end of this lesson, you will be able to:

  • Run a two-proportion z-test on a validated A/B result
  • Build a 95% confidence interval for the lift and read it correctly
  • Connect the p-value and the CI (the duality from Module 4)
  • Separate statistical significance from practical significance — honestly

Let’s read the result.


The Two-Proportion Z-Test

The question is whether a 25.00% → 28.32% difference is more than noise. The z-test compares the two proportions against the variability you’d expect from sampling alone. The CI then reports the lift as a range rather than a single point. Both come straight from Module 4:

import math
from scipy.stats import norm

def two_prop(c1, n1, c2, n2):
    p1, p2 = c1/n1, c2/n2
    p = (c1+c2)/(n1+n2)
    se = math.sqrt(p*(1-p)*(1/n1+1/n2))
    z = (p2-p1)/se
    return p2-p1, z, 2*(1-norm.cdf(abs(z)))

def diff_ci(c1, n1, c2, n2):
    p1, p2 = c1/n1, c2/n2
    se = math.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
    z = norm.ppf(0.975); d = p2-p1
    return d - z*se, d + z*se

diff, z, p = two_prop(832, 3328, 980, 3460)
lo, hi = diff_ci(832, 3328, 980, 3460)
print(f"difference = {diff:.4f}   z = {z:.3f}   p = {p:.5f}")
print(f"95% CI for the lift: [{lo:.4f}, {hi:.4f}]")

Running it:

difference = 0.0332   z = 3.095   p = 0.00197
95% CI for the lift: [0.0122, 0.0543]

The observed lift is +3.32 percentage points (25.00% → 28.32%). The z-statistic is 3.095 — the difference is about three standard errors away from zero — and the p-value is 0.00197, well below the pre-committed 0.05 threshold. So the result is statistically significant: a difference this large is very unlikely to be sampling noise if the checklist truly did nothing. The 95% CI for the lift is [+1.22, +5.43] points, and it excludes 0. That’s not a coincidence: a p-value below 0.05 and a 95% CI that excludes zero are two views of the same fact — the p-value/CI duality from Module 4. The test says there is an effect; the interval says how big it plausibly is.

Control 7-day activation 25.00% (832/3,328) versus treatment 28.32% (980/3,460); a 95% confidence interval for the lift spanning +1.22 to +5.43 points with the observed +3.32 point estimate marked, and a dashed MDE marker at +3 points; the whole interval sits above 0; z=3.10, p=0.0020, significant and the best estimate clears the +3-point bar.
The validated result and the 95% CI for the lift: the whole interval sits above zero, so the effect is significant, and the +3.32-point best estimate clears the +3-point bar — though the interval's lower end dips below it.

Significant, But How Big?

Statistical significance answers “is there an effect?” — but the decision Lumen actually cares about is “is the effect big enough to ship?” That’s practical significance, and the design already set the bar: the minimum detectable effect was +3 points, the smallest lift the team said was worth shipping. So compare the CI to that bar, not just to zero.

The point estimate is +3.32 points — it clears the 3-point bar. The CI’s upper bound is +5.43, comfortably above it. But the CI’s lower bound is +1.22, which sits below the 3-point line. Read honestly, that means: the effect is clearly real and positive (the whole interval is above zero), and our best single guess beats the practical bar — but there’s genuine uncertainty about whether the true lift is as large as the +3 points the team hoped for. It could plausibly be as small as about 1.2 points. The experiment nailed down the direction and sign firmly; it pinned down the magnitude less precisely, because a 3.32-point effect is only a touch above the 3-point effect the study was powered to detect.

Excludes zero, but straddles the bar

It’s tempting to stop at “p = 0.002, significant, ship it.” Resist that. Significance and practical significance are different questions, and this result answers them differently. The CI excludes zero cleanly — so we’re confident the checklist helps. But the CI does not sit entirely above the +3-point MDE — its lower end is +1.22 — so we can’t claim with 95% confidence that the true lift is at least the 3 points the team wanted. That’s not a reason to reject the result; the best estimate clears the bar and the effect is unambiguously positive. It is a reason to report the range honestly rather than the point estimate alone. A readout that says “+3.32 points, significant” is fine; one that pretends the true lift is guaranteed to be ≥3 points is overselling. The full ship decision — weighing this against the guardrails and a Bayesian cross-check — comes in Lesson 4.


Practice Exercises

Exercise 1: Read the interval aloud

The 95% CI for the lift is [+1.22, +5.43] points. Say what it means in one sentence — and what it does not mean.

Hint

It means: our best estimate of the true lift is +3.32 points, and the data are consistent with a true lift anywhere from about +1.22 to +5.43 points (with 95% confidence). It does not mean there’s a 95% probability the true lift is in that range — the true lift is a fixed number, not a random one. The 95% refers to the procedure: intervals built this way capture the true value 95% of the time. And it does not mean every value in the range is equally likely — values near the center are more plausible than values near the edges (Module 4).

Exercise 2: p-value and CI agree

The p-value is 0.00197 and the 95% CI excludes 0. Explain why those two facts had to agree.

Hint

They’re the same test seen two ways — the duality from Module 4. A 95% CI contains exactly the effect values you would not reject at the 5% level. So “p < 0.05” and “the 95% CI excludes the null value of 0” are logically equivalent: whenever one holds, the other must. If the p-value had been, say, 0.08, the CI would have straddled zero. They can’t disagree, because they’re computed from the same standard error and the same data.

Exercise 3: The lower bound below the bar

Suppose a teammate says, “The CI’s lower bound is +1.22, which is under our +3-point MDE, so the test failed — we shouldn’t ship.” Is that right?

Hint

No — that’s too harsh. The test did not fail: the result is significant (p = 0.002), the effect is clearly positive (CI excludes 0), and the point estimate (+3.32) clears the 3-point bar. What the lower bound tells us is narrower: we can’t be 95% confident the true lift is at least 3 points, only that it’s at least about 1.2. That’s a statement about precision, not a verdict. The right move is to report the effect and its range honestly and fold it into the decision (Lesson 4), not to declare failure. Powering a study for a 3-point MDE means it can detect a 3-point effect, not prove the effect is exactly 3 points or more.


Summary

With the experiment validated in Lesson 2, this lesson finally read the primary metric. Control activated 25.00% (832/3,328), treatment 28.32% (980/3,460) — an observed lift of +3.32 percentage points. Because the metric is a proportion, the analysis used the two-proportion z-test from Module 4: z = 3.095, p = 0.00197, comfortably below 0.05, so the result is statistically significant. The 95% CI for the lift is [+1.22, +5.43] points, which excludes zero (the p-value/CI duality) but does not sit entirely above the +3-point MDE. Practically: the effect is real, positive, and the best estimate clears the practical bar, but the true lift could plausibly be as small as ~1.2 points. The honest readout is “significant and worth shipping, with the range noted” — not “guaranteed ≥3 points.”

Key Concepts

  • Two-proportion z-test — the right test when the primary metric is a rate; here z = 3.095, p = 0.00197.
  • Confidence interval for the lift — [+1.22, +5.43] points reports magnitude and uncertainty, not just a yes/no.
  • p-value/CI duality — p < 0.05 and a 95% CI excluding 0 are the same fact seen two ways.
  • Significance vs. practical significance — the CI excludes 0 (significant) but straddles the +3-point MDE (magnitude uncertain).

Why This Matters

A p-value alone tells you an effect exists; it doesn’t tell you whether the effect is worth the engineering cost, the risk, or the maintenance of shipping a new feature. The confidence interval is what turns a significant result into a business decision, because it puts the effect on the same scale as the bar you set in the design. Reading it honestly — celebrating that the interval excludes zero while admitting its lower end dips below your MDE — is the difference between an analyst who reports the truth and one who reports the headline. Next, you’ll cross-check this frequentist result with a Bayesian view, weigh the guardrails, and write the actual ship decision.


Next Steps

Continue to Lesson 4 - The Cross-Check and the Decision

Cross-check the z-test with a Bayesian view, weigh the guardrails, and write the ship decision the brief asked for.

Back to Module Overview

Return to the Capstone module overview


Continue Building Your Skills

You took a validated experiment and read its result the way a data scientist should: a two-proportion z-test for significance (z = 3.095, p = 0.00197), and a confidence interval for the lift ([+1.22, +5.43] points) to report magnitude and uncertainty. You didn’t stop at the p-value — you compared the interval to the practical bar and told the truth about where it falls. Next, you’ll pressure-test this conclusion from a second angle with a Bayesian cross-check, confirm the guardrails held, and combine everything into the written ship decision the brief demanded from the very start.