Lesson 4 - Making the Ship Decision
Welcome to the Ship Decision
You’ve built the two-proportion z-test, learned to read the p-value honestly, and wrapped the effect in a confidence interval. Lumen’s numbers are settled: the new signup page lifted conversion by +2.20 points, with p = 0.00048 and a 95% CI of [+0.97, +3.43] points. So — do you ship it? The temptation is to glance at the tiny p-value, declare victory, and roll it out. But significance is necessary, not sufficient. A result can be statistically real and still be the wrong thing to ship: too small to justify the cost, or a primary win that quietly broke something else. This lesson turns a test result into a decision using four checks, then walks all four honestly on Lumen — including the part that’s genuinely borderline.
By the end of this lesson, you will be able to:
- Run the four ship-decision checks: significance, practical significance, guardrails, and the confidence interval
- Compare an observed effect to the MDE you set at design time, using the CI rather than just the point estimate
- Make an honest ship call on a borderline result without overstating certainty
- Recognize the three anti-patterns that lead teams to ship — or kill — the wrong thing
Let’s turn the number into a decision.
Four Checks, Not One
Shipping on the p-value alone is the most common mistake in experimentation. A complete decision runs four checks in order, and a “no” on any of them changes the call:
- Statistically significant? Is
p < αand does the confidence interval exclude 0? If both hold, the effect is real — not something noise alone would routinely produce. If not, you can’t rule out chance (and you may simply be underpowered — Lesson 2’s warning that “not significant” is not “proven no effect”). - Practically significant? Is the effect big enough to be worth it? Compare it to the minimum detectable effect (MDE) you committed to back in Module 2 when you sized the test. A real but tiny lift may not justify the engineering, maintenance, and risk of rolling out a change.
- Guardrails OK? Did anything critical regress — page load time, refund rate, support tickets, latency? Even a real, large primary win shouldn’t ship if it broke a guardrail. The primary metric is what you wanted to move; guardrails are what you promised not to break.
- Read the interval, not just the point. The point estimate is one number; the CI is the range of effects the data is consistent with. The decision lives in that range — especially where the bounds fall relative to your MDE.
Notice what the flow encodes: significance gets you into the decision, but it doesn’t end it. Only a “yes” on all three checks — real, worthwhile, and safe — leads to shipping.
Applying the Four Checks to Lumen
Now run the checks on the numbers you already computed. No new statistics here — just judgment applied to the verified results.
Check 1 — Statistically significant? Yes. p = 0.00048 is far below the standard α = 0.05, and the 95% CI [+0.97, +3.43] excludes 0. The lift is real, not noise. ✓
Check 2 — Practically significant? This is where honesty matters. At design time you set an MDE of +2 points — the smallest lift worth detecting and, implicitly, worth shipping. The observed point estimate is +2.20, which clears the bar. But the CI’s lower bound is +0.97, which sits below the 2-point MDE. So while the best estimate beats the practical bar, the data is also consistent with a true lift as small as ~1 point — below what you decided was worth it. The point estimate passes; the interval doesn’t fully clear the bar. ⚠
Check 3 — Guardrails OK? For this scenario we’ll assume the guardrails held — load time, refunds, and support volume all steady. (State this as an assumption; in a real analysis you’d verify each one before trusting it.) ✓
Check 4 — Read the interval. The plausible range is +0.97 to +3.43 points. The center of that range is comfortably positive and above the MDE, but the lower tail dips under it. This is a clearly positive, significant effect whose best guess is worth shipping — with real uncertainty about whether the true lift clears the practical bar.
The pragmatic read: a significant, clearly-positive lift whose best estimate beats the MDE, with no guardrail regression, is normally a ship — but you ship flagging the uncertainty, not claiming certainty. The true lift could be as small as ~1 point, and you shouldn’t pretend otherwise. Two reasonable paths: ship and monitor (roll out, watch the metric on live traffic, and be ready to react), or run longer to tighten the interval and pull the lower bound above the MDE before committing. Both are defensible; what isn’t defensible is either declaring “+2.20, done” as if the point estimate were the whole story, or killing a real, positive win purely because one bound is borderline.
Significant is not the same as worth it
With a large enough sample, any nonzero effect eventually becomes statistically significant — the CI shrinks until it excludes 0 even for a trivial 0.1-point lift. That’s why practical significance is a separate check. A result can be significant (p < α, CI excludes 0) yet fail to clear the MDE — real, but too small to be worth the cost. Significance answers “is it there?”; the MDE answers “is it big enough to act on?” You need both, and they are not the same question.
Three Anti-Patterns to Avoid
The four checks exist to head off three recurring mistakes:
- Shipping on significance alone. Seeing
p < 0.05and rolling out without checking effect size or guardrails. A significant result can be practically trivial, or can hide a guardrail regression that outweighs the primary win. Green p-value, red guardrail — still a no-ship. - Killing a real win for being “only” borderline. Discarding a significant, positive lift because the CI’s lower bound dips slightly below the MDE, without weighing the cost. If the change is cheap and low-risk, a probable ~2-point lift is worth taking even with a soft lower bound. Practical significance is a judgment against cost and risk, not a hard gate.
- Reading “not significant” as “proven no effect.” (The Lesson 2 callback.) A non-significant result means you couldn’t rule out chance — often because the test was underpowered — not that the effect is zero. Don’t confidently ship the null.
Each anti-pattern comes from collapsing a four-part decision into one number. The discipline is to run all four checks, every time, and to say out loud where the uncertainty lives.
Practice Exercises
Exercise 1: Green p-value, red guardrail
A test on a new checkout flow shows a significant +1.8-point lift in purchase rate (p = 0.003, CI excludes 0), clearing the 1-point MDE. But median page load time rose by 400 ms — a guardrail you’d committed to protecting. Ship or not?
Hint
Don’t ship as-is. Checks 1 and 2 pass — the lift is real and clears the MDE — but Check 3 fails: a guardrail regressed. A 400 ms load-time hit can suppress conversion elsewhere and hurt the experience, and it may even erode the very lift you measured over time. The primary win doesn’t buy back a broken guardrail. Fix the latency and re-test, rather than shipping the regression alongside the win.
Exercise 2: Point estimate vs. the interval
Lumen’s lift is +2.20 points with a 95% CI of [+0.97, +3.43], against a 2-point MDE. A teammate says, “The lift is 2.20, which beats our 2-point bar, so it’s a clean ship.” What’s incomplete about that reasoning?
Hint
It treats the point estimate as the whole answer and ignores the interval. Yes, +2.20 exceeds the 2-point MDE — but the CI’s lower bound of +0.97 sits below it, so the data is also consistent with a true lift under the practical bar. The decision should acknowledge that the true effect could be as small as ~1 point. It’s still likely a ship (significant, positive, best estimate above the MDE, guardrails fine), but “clean” overstates the certainty — the honest version flags the soft lower bound and considers running longer or shipping-and-monitoring.
Exercise 3: A flat result
A different experiment returns p = 0.42, with a 95% CI of [−1.1, +2.6] points. A stakeholder concludes, “No effect — the change does nothing, kill it.” Is that the right read?
Hint
No — that’s the “not significant means proven no effect” trap from Lesson 2. The result is non-significant (the CI includes 0), so you can’t rule out chance — but the interval also includes lifts as large as +2.6 points, which would be well worth having. The test simply didn’t have the power to resolve the effect. The honest read is “inconclusive,” and the right move is usually to run longer or with more traffic, not to declare the effect zero.
Summary
A ship decision is four checks, not one. Statistical significance (p < α and the CI excludes 0) tells you the effect is real; practical significance compares that effect to the MDE you set at design time to ask whether it’s big enough to be worth it; guardrails confirm nothing critical regressed; and reading the full confidence interval — not just the point estimate — shows the range of effects the data supports. On Lumen, the call is honestly borderline: p = 0.00048 and a CI of [+0.97, +3.43] make the +2.20-point lift clearly real and positive, and the point estimate beats the 2-point MDE — but the lower bound dips below it, so the true lift could be as small as ~1 point. With guardrails assumed fine, that’s normally a ship, made while flagging the uncertainty, with running-longer as a reasonable alternative. The lesson’s core discipline is to avoid the three anti-patterns: shipping on significance alone, killing a real win for being borderline, and mistaking “not significant” for “no effect.”
Key Concepts
- Significance is necessary, not sufficient —
p < αand CI-excludes-0 get you into the decision, not through it. - Practical significance — compare the effect to the MDE from design; real-but-tiny may not be worth the cost.
- Guardrails — a primary win doesn’t ship if it broke something you promised to protect.
- Decide on the interval — the CI’s bounds relative to the MDE carry the decision, not the point estimate alone.
Why This Matters
The gap between a green p-value and a good decision is where experimentation programs earn or lose their credibility. Teams that ship on significance alone accumulate trivial changes and hidden regressions; teams that demand a perfectly clean interval kill real wins and grind to a halt. The four checks — and the honesty to say “significant, positive, but borderline on the MDE” out loud — are what let you act decisively and accurately on a result like Lumen’s. Next, you’ll put every skill from this module together in a guided project: analyzing Lumen’s signup experiment end to end, from raw counts to a defensible ship decision.
Next Steps
Continue to Lesson 5 - Guided Project: Analyze Lumen's Signup Experiment
Put it all together: from raw counts to z-test, p-value, confidence interval, and a defensible ship decision, end to end.
Back to Module Overview
Return to the Analyzing Proportion Metrics module overview
Continue Building Your Skills
You turned Lumen’s result into a decision using four checks — significance, practical significance against the MDE, guardrails, and the full confidence interval — and made an honest ship call on a genuinely borderline outcome without overstating certainty. Next you’ll run the entire module workflow yourself in a guided project, carrying Lumen’s signup experiment from raw conversion counts all the way to a defensible ship/no-ship recommendation.