Lesson 1 - The Two-Proportion Z-Test

Welcome to the Two-Proportion Z-Test

Lumen ran the signup experiment you designed and sized. The results are in: out of 5,000 users each, the current page converted 503 and the new page converted 613 — 10.06% versus 12.26%. The new page looks better. But you already know the trap from Module 3: observed rates aren’t true rates, and a gap this size could, in principle, be noise. So the question isn’t “which number is bigger” — it’s “is this difference large enough that chance alone probably didn’t produce it?” The tool that answers it for proportions is the two-proportion z-test, and in this lesson you’ll build it from its parts and run it on exactly this data.

By the end of this lesson, you will be able to:

  • Compute the difference in conversion rates and its pooled standard error
  • Form the z statistic and explain what it measures
  • Turn z into a two-sided p-value with scipy
  • Run the full test on Lumen’s experiment and read the result

Let’s build the test one piece at a time.


The Idea: How Many Standard Errors Out?

Every significance test has the same shape: measure the effect, measure the noise, and take their ratio. For two proportions:

  • The effect is the difference in rates, p₂ − p₁ — for Lumen, 0.1226 − 0.1006 = 0.0220.
  • The noise is the standard error — how much that difference would wobble from sample to sample if you reran the experiment. Under the null (no real effect), both groups share one true rate, so we estimate it by pooling both groups: p_pool = (c₁ + c₂) / (n₁ + n₂), and the standard error is √[ p_pool(1−p_pool) (1/n₁ + 1/n₂) ].
  • The z statistic is their ratio: z = (p₂ − p₁) / SE. It says how many standard errors the observed difference sits from zero.
A bell curve centered at 0 labeled 'null: no effect — differences cluster near 0', representing where the difference in rates would fall if the change did nothing, measured in standard errors (z). A red vertical line far out on the right tail at z = 3.49 marks the observed difference, with a tiny shaded tail beyond it labeled p = 0.0005. A caption notes a difference 3.49 standard errors out would almost never happen by chance, so we reject 'no effect'.
The z-test places the observed difference on the null distribution — the bell of differences you'd expect if the change did nothing. Lumen's difference sits 3.49 standard errors out, so far into the tail that noise alone almost never gets there.

If the null were true, z would usually land near 0 — most reruns give small differences. A z far out in the tail means the observed difference is the kind of thing that almost never happens by chance, which is evidence the change had a real effect.


From z to a P-Value

The z statistic becomes a p-value: the probability, if the null were true, of seeing a difference at least this extreme in either direction. Because the null distribution of z is standard normal, we read that probability off the normal curve’s tails. A two-sided test counts both tails (the change could help or hurt), so we double the one-tail area:

import numpy as np
from scipy.stats import norm

def two_prop_ztest(c1, n1, c2, n2):
    p1, p2 = c1 / n1, c2 / n2
    p_pool = (c1 + c2) / (n1 + n2)                 # pooled rate under the null
    se = np.sqrt(p_pool * (1 - p_pool) * (1 / n1 + 1 / n2))
    z = (p2 - p1) / se
    p = 2 * (1 - norm.cdf(abs(z)))                 # two-sided p-value
    return p2 - p1, z, p

That’s the whole test: three lines of arithmetic and one call to the normal CDF. The p_pool step is the key subtlety — under the null hypothesis both groups have the same rate, so we estimate that shared rate from all the data combined before measuring the noise.


Running It on Lumen

Now the payoff. We regenerate Lumen’s experiment with the same seed used when we designed it, so the counts match exactly — 503 and 613 conversions — and run the test:

rng = np.random.default_rng(7)
conv_c = rng.random(5000) < 0.10        # control: current page
conv_t = rng.random(5000) < 0.12        # treatment: new page
c1, c2 = int(conv_c.sum()), int(conv_t.sum())

diff, z, p = two_prop_ztest(c1, 5000, c2, 5000)
print(f"control {c1}/5000 = {c1/5000:.4f}, treatment {c2}/5000 = {c2/5000:.4f}")
print(f"difference = {diff:.4f}   z = {z:.3f}   p = {p:.5f}")

Running it:

control 503/5000 = 0.1006, treatment 613/5000 = 0.1226
difference = 0.0220   z = 3.493   p = 0.00048

There’s the verdict. The 2.2-point lift sits z = 3.49 standard errors above zero, and the two-sided p-value is 0.00048 — about 5 in 10,000. If the new page truly did nothing, a difference this large would almost never appear; since it did appear, we reject “no effect.” The lift you observed back in Module 1 isn’t noise: it’s statistically significant.

Why pool for the test but not later

Notice we pooled both groups to estimate the standard error. That’s deliberate: the p-value asks “if the null is true, how surprising is this?” — and if the null is true, both groups share one rate, best estimated from all the data. When you later build a confidence interval for the difference (Lesson 3), you’ll use a different, unpooled standard error, because there you’re no longer assuming the null — you’re estimating the actual effect. Same data, two standard errors, for two different questions. It’s a subtlety worth remembering.


Practice Exercises

Exercise 1: What does z measure?

Lumen’s test gave z = 3.49. In plain language, what does that number say about the observed difference?

Hint

It says the observed difference in conversion rates is 3.49 standard errors above zero — that is, 3.49 times larger than the typical sample-to-sample wobble you’d expect if there were no real effect. Differences that far out on the null bell curve are extremely rare by chance, which is why the p-value is tiny and the result is significant.

Exercise 2: Why double the tail?

The code computes p = 2 * (1 - norm.cdf(abs(z))). Why the factor of 2?

Hint

It’s a two-sided test: before the experiment, the new page could have been better or worse, so a “surprising” result is one far from zero in either direction. 1 - norm.cdf(abs(z)) is the area in one tail; doubling it counts both tails. If you had a strict one-directional hypothesis, you’d use a one-sided p-value (no doubling) — but two-sided is the safe default and the norm.

Exercise 3: Pooled vs. unpooled

The test estimates the standard error with a pooled rate (c1 + c2) / (n1 + n2). Under what assumption is pooling the right thing to do?

Hint

Pooling is right under the null hypothesis — that both groups have the same true rate. Since the p-value is computed assuming the null is true, estimating one shared rate from all the data is the correct standard error for that question. When you drop the null assumption to estimate the actual effect size (the confidence interval), you no longer pool.


Summary

The two-proportion z-test turns two conversion rates into a verdict by comparing the effect to the noise. The effect is the difference in rates (p₂ − p₁); the noise is the pooled standard error √[p_pool(1−p_pool)(1/n₁ + 1/n₂)], pooled because the null assumes both groups share one rate; and the z statistic is their ratio — how many standard errors the difference sits from zero. Converting z to a two-sided p-value with the normal CDF gives the probability of a difference this extreme under “no effect.” Run on Lumen’s real seeded experiment (503/5000 vs 613/5000), the test returns a 0.0220 difference, z = 3.49, and p = 0.00048 — the 2.2-point lift is statistically significant, not noise.

Key Concepts

  • Effect over noise — every test is (difference) / (standard error); here that ratio is z.
  • Pooled standard error — under the null both groups share a rate, so pool to estimate the noise.
  • z statistic — how many standard errors the observed difference is from zero.
  • Two-sided p-value2·(1 − Φ(|z|)); the chance of a difference this extreme if the change did nothing.

Why This Matters

The two-proportion z-test is the workhorse of A/B testing — conversion, signup, click-through, and retention rates are all proportions, and this is how you decide whether a change to any of them is real. Building it from its parts, rather than calling a black-box function, is what lets you trust the output and debug it when a library disagrees. But a p-value alone is a blunt instrument: it tells you an effect is real without telling you how big or how often you’d be fooled. The next two lessons sharpen it — reading the p-value honestly, then adding a confidence interval that quantifies the effect itself.


Next Steps

Continue to Lesson 2 - Reading the P-Value

What a p-value really means, the misinterpretations to avoid, and one-sided vs two-sided tests.

Back to Module Overview

Return to the Analyzing Proportion Metrics module overview


Continue Building Your Skills

You built the two-proportion z-test from its parts — difference, pooled standard error, z, and p-value — and ran it on Lumen’s real experiment to find a statistically significant 2.2-point lift (z = 3.49, p = 0.0005). Next you’ll make sure you can read that p-value correctly: what it does and doesn’t mean, and the misinterpretations that lead teams to over- or under-trust their results.