Lesson 5 - Guided Project: Analyze Lumen's Signup Experiment
Welcome to the Guided Project
Across this course you designed Lumen’s signup experiment (Module 2), sized it so it could actually detect the effect you cared about (Module 3), and then, in this module, built every analysis tool by hand: the two-proportion z-test, the honest reading of a p-value, the confidence interval for the difference, and the decision framework that turns statistics into a call. Now you close the loop. The experiment ran; the data is in. In this project you’ll take Lumen’s signup test from raw counts all the way to a shipped verdict — rates, z-test, confidence interval, decision — and write the experiment readout, the short document a real analyst hands to the team. By the end you won’t just have a p-value; you’ll have a defensible decision and the paper trail behind it.
By the end of this project, you will be able to:
- Load an experiment’s results and compute the conversion rates for each arm
- Run the two-proportion z-test and read its z statistic and p-value
- Build the 95% confidence interval for the difference and connect it to significance
- Apply the four-check decision framework and write the experiment readout that ships
We’ll build it in stages, reusing the exact functions you wrote earlier in the module. Let’s analyze Lumen’s experiment.
Stage 1: Load the Result
Start with the data. Lumen’s signup experiment split 10,000 users evenly: 5,000 saw the current signup page (control) and 5,000 saw the redesigned one (treatment). We regenerate it with the same seed used when the experiment was designed and sized, so the counts match every earlier module exactly — nothing has drifted.
import numpy as np
from scipy.stats import norm
rng = np.random.default_rng(7) # SAME seed as M1/M2/M3
c1 = int((rng.random(5000) < 0.10).sum()) # control conversions
c2 = int((rng.random(5000) < 0.12).sum()) # treatment conversions
print(f"control: {c1}/5000 = {c1/5000:.4f}")
print(f"treatment: {c2}/5000 = {c2/5000:.4f}")Running it:
control: 503/5000 = 0.1006
treatment: 613/5000 = 0.1226There’s the result. The current page converted 503 of 5,000 signups — 10.06% — and the redesign converted 613 of 5,000 — 12.26%. The new page looks better by a bit over two points. But you already know the discipline from Module 3: an observed gap is not a proven gap. Before anyone celebrates, we have to ask whether a difference this size could plausibly be noise. That’s Stage 2.
Stage 2: The Z-Test
The two-proportion z-test (Lesson 1) answers “could this gap be chance?” by comparing the effect — the difference in rates — to the noise — the pooled standard error. Their ratio is the z statistic, and the normal curve turns z into a two-sided p-value.
def two_prop_ztest(c1, n1, c2, n2):
p1, p2 = c1/n1, c2/n2
pp = (c1+c2)/(n1+n2) # pooled rate, under the null
se = np.sqrt(pp*(1-pp)*(1/n1+1/n2))
z = (p2-p1)/se
return p2-p1, z, 2*(1-norm.cdf(abs(z)))
diff, z, p = two_prop_ztest(c1, 5000, c2, 5000)
print(f"difference = {diff:.4f} z = {z:.3f} p = {p:.5f}")Running it:
difference = 0.0220 z = 3.493 p = 0.00048The 2.20-point lift sits z = 3.49 standard errors above zero, and the two-sided p-value is 0.00048 — roughly 5 in 10,000. If the redesign truly did nothing, a difference this large would almost never appear by chance; since it did appear, we reject “no effect.” The result is statistically significant. That settles whether there’s an effect. It doesn’t yet tell us how big the effect is — and the size is what the ship decision hangs on. On to the interval.
Stage 3: The Confidence Interval
The p-value confirms the lift is real; the confidence interval (Lesson 3) tells us how large it plausibly is. It uses the unpooled standard error — because here we’re estimating the actual effect, not testing the null, so we don’t assume both groups share one rate.
def diff_ci(c1, n1, c2, n2, conf=0.95):
p1, p2 = c1/n1, c2/n2
se = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2) # unpooled SE
z = norm.ppf(1-(1-conf)/2); d = p2-p1
return d - z*se, d + z*se
lo, hi = diff_ci(c1, 5000, c2, 5000)
print(f"95% CI for the difference: [{lo:.4f}, {hi:.4f}]")
print(f" = [{lo*100:+.2f}, {hi*100:+.2f}] percentage points")Running it:
95% CI for the difference: [0.0097, 0.0343]
= [+0.97, +3.43] percentage pointsThe best estimate of the lift is +2.20 points, and we’re 95% confident the true lift lies between +0.97 and +3.43 points. Notice the interval sits entirely above zero — it excludes 0. That’s not a coincidence: it’s the duality you saw in Lesson 3. A result significant at the 5% level always has a 95% CI that excludes zero, and vice versa. The p-value and the interval are two views of the same fact. The interval just adds what the p-value can’t: a range for the effect itself, which is exactly what the decision needs.
Stage 4: The Decision
Now apply the four-check framework from Lesson 4. Statistics don’t ship features; decisions do, and a decision weighs four things in order.
- Statistically significant? Yes. p = 0.00048 is well below 0.05, and the 95% CI [+0.97, +3.43] excludes zero. The lift is real, not noise.
- Practically significant? This is the judgment call. Lumen set a minimum detectable effect of +2 points back in Module 3 — the smallest lift worth shipping. The point estimate, +2.20 points, clears that bar. But the confidence interval’s lower bound, +0.97, sits below it. So the honest reading is: the effect is real and positive, our best estimate meets the goal, but there’s genuine uncertainty at the low end — the true lift could be as small as about one point.
- Guardrails OK? Assume the guardrail metrics (page load, error rate, downstream activation) held steady — no regressions flagged. In a real analysis you’d confirm this explicitly before shipping.
- The call. Ship. The result is a clearly positive, statistically significant lift whose best estimate meets the bar Lumen set. That’s a win. We ship it while being honest about the downside: the true effect might land nearer +1 point than +2, so we note that in the readout and can keep monitoring the metric post-launch — or extend the test if the team wants the interval tightened before committing further investment.
The discipline here is not to overstate. “Significant” does not mean “huge and certain.” Lumen’s lift is real and worth shipping, and the interval’s low end is a caveat, not a veto. A good analyst ships the win and names the uncertainty in the same breath.
Stage 5: The Readout
The analysis is done. The deliverable isn’t the p-value floating in a Slack message — it’s a short, structured experiment readout the whole team can read and act on. Here’s the document, assembled from everything above:
EXPERIMENT READOUT — Lumen Signup Page Redesign
================================================
RESULT
Control (current page): 503 / 5,000 = 10.06%
Treatment (new page): 613 / 5,000 = 12.26%
Observed lift: +2.20 percentage points
SIGNIFICANCE
Two-proportion z-test: z = 3.49, p = 0.00048
Significant at the 5% level. The lift is not noise.
EFFECT SIZE
95% confidence interval: [+0.97, +3.43] points
Best estimate +2.20; true lift very likely between ~1 and ~3.4 points.
Interval excludes 0 (consistent with the significant p-value).
DECISION: SHIP
Rationale: A statistically significant, positive lift whose best
estimate (+2.20) meets the pre-registered +2-point MDE.
Caveat: The interval's lower bound (+0.97) is below the +2-point
bar, so the true lift could be as small as ~1 point.
Guardrails held. Recommend shipping and monitoring the
signup rate post-launch; extend the test if a tighter
estimate is needed before further investment.That’s the product of the whole module — and the whole course so far. It names the result, proves it isn’t chance, quantifies how big it is, and turns all of that into a decision with a rationale and an honest caveat. Anyone on the team can read it in thirty seconds and know what happened and what to do next.
The readout is the product
The most common analyst mistake isn’t a math error — it’s stopping at the p-value. “p = 0.0005, ship it 🚀” in a Slack thread is not a decision anyone can audit, revisit, or trust six months later. The readout is the actual deliverable: a short doc that states the result, the evidence, the effect size with its interval, and a decision with a rationale and a named caveat. A clear readout beats a dramatic p-value every time, because it survives the meeting, the follow-up question, and the post-launch review. Write the readout — that’s the job.
Practice Exercises
Exercise 1: Report the relative lift with a CI
The readout gives the absolute lift (+2.20 points). Stakeholders often want the relative lift — the percent increase over control. Compute it, and give a rough confidence interval for it too.
Hint
The relative lift is (p2 - p1) / p1 = 0.0220 / 0.1006 ≈ 0.219, a ~21.9% relative increase in signups. For a quick interval, divide the absolute CI bounds by the control rate: [0.0097, 0.0343] / 0.1006 ≈ [+9.6%, +34.1%]. It’s wide — relative lifts inherit the absolute interval’s uncertainty and then some — which is exactly why you report the absolute effect as the primary number and the relative one as color.
Exercise 2: Redo the interval at 90% confidence
Recompute the confidence interval with conf=0.90 instead of 0.95. Is it wider or narrower, and why? Does it change the ship decision?
Hint
Call diff_ci(c1, 5000, c2, 5000, conf=0.90). A 90% interval uses a smaller z multiplier (norm.ppf(0.95) ≈ 1.645 instead of 1.96), so it’s narrower — you’re asking for less confidence, so you accept a tighter range. It still excludes zero (the result is even more than significant at the 10% level), so the ship decision doesn’t change. Narrower isn’t “better,” though: it just reflects a lower confidence level, not more data.
Exercise 3: What if a guardrail had regressed?
The decision assumed guardrails held. Suppose the readout instead showed the new page’s median load time rose noticeably and downstream activation dipped. How would that change the verdict, even with the same significant +2.20-point signup lift?
Hint
A guardrail regression can override a winning primary metric. If the signup lift comes at the cost of slower pages or fewer users completing activation, you’re trading a headline win for a real cost — and shipping might be net-negative even though signups is significant. The honest move is to not ship as-is: flag the trade-off in the readout, quantify the guardrail hit, and either fix the regression and re-test, or escalate the trade-off to the team rather than deciding it silently. Significance on the primary metric is necessary, not sufficient.
Summary
You analyzed Lumen’s signup experiment end to end and produced the deliverable a real analyst ships. Starting from the seeded result — 503/5,000 control vs 613/5,000 treatment, 10.06% vs 12.26% — you ran the two-proportion z-test to get a 0.0220 difference, z = 3.49, p = 0.00048 (significant), built the 95% confidence interval [+0.97, +3.43] points with the unpooled SE (excluding zero, matching the p-value), applied the four-check decision framework, and wrote the experiment readout. The verdict: ship — a clearly positive, significant lift whose best estimate meets the +2-point bar — with an honest caveat that the true lift could be as small as ~1 point. Every number in this lesson was computed for real with numpy and scipy on the seed-7 experiment.
Key Concepts
- Rates first — turn raw counts into conversion rates before doing anything else (503/5,000 = 10.06%).
- Z-test for significance — is the gap real? Difference over pooled SE gives z = 3.49, p = 0.00048.
- CI for effect size — how big is the gap? The unpooled 95% CI [+0.97, +3.43] excludes zero, matching significance.
- Readout over p-value — the shipped product is a structured decision doc: result, significance, effect size, decision, caveat.
Why This Matters
This is the full arc of an experiment analysis, and it’s exactly what you’ll do on the job: raw counts to rates, rates to a test, a test to an interval, an interval to a decision, and a decision to a document. Teams that stop at the p-value ship on gut feel and can’t defend the call later; teams that write the readout make decisions that survive scrutiny. You now have the whole loop for proportion metrics — design, size, analyze, decide — which is the backbone of trustworthy experimentation. The next module carries the same discipline to metrics that are means, not proportions: revenue, time-on-page, order value, where the two-sample t-test replaces the z-test but the arc is identical.
Next Steps
Continue to Module 5 - Analyzing Mean Metrics
Revenue and time metrics with the two-sample t-test — when the metric is a mean, not a proportion.
Back to Module Overview
Return to the Analyzing Proportion Metrics module overview
Continue Building Your Skills
You took Lumen’s signup experiment from raw counts to a shipped readout — computing the rates, running the z-test, building the confidence interval, applying the decision framework, and writing the decision doc that closes the loop. That’s the complete workflow for a proportion metric. The next module keeps the exact same arc but changes the metric type: revenue per user, time on page, average order value — continuous outcomes where you compare means with the two-sample t-test. Same discipline, new tool. On to analyzing mean metrics.