Lesson 4 - Making the Ship Decision

Welcome to the Ship Decision

You’ve built the two-proportion z-test, learned to read the p-value honestly, and wrapped the effect in a confidence interval. Lumen’s numbers are settled: the new signup page lifted conversion by +2.20 points, with p = 0.00048 and a 95% CI of [+0.97, +3.43] points. So — do you ship it? The temptation is to glance at the tiny p-value, declare victory, and roll it out. But significance is necessary, not sufficient. A result can be statistically real and still be the wrong thing to ship: too small to justify the cost, or a primary win that quietly broke something else. This lesson turns a test result into a decision using four checks, then walks all four honestly on Lumen — including the part that’s genuinely borderline.

By the end of this lesson, you will be able to:

Run the four ship-decision checks: significance, practical significance, guardrails, and the confidence interval
Compare an observed effect to the MDE you set at design time, using the CI rather than just the point estimate
Make an honest ship call on a borderline result without overstating certainty
Recognize the three anti-patterns that lead teams to ship — or kill — the wrong thing

Let’s turn the number into a decision.

Four Checks, Not One

Shipping on the p-value alone is the most common mistake in experimentation. A complete decision runs four checks in order, and a “no” on any of them changes the call:

Statistically significant? Is p < α and does the confidence interval exclude 0? If both hold, the effect is real — not something noise alone would routinely produce. If not, you can’t rule out chance (and you may simply be underpowered — Lesson 2’s warning that “not significant” is not “proven no effect”).
Practically significant? Is the effect big enough to be worth it? Compare it to the minimum detectable effect (MDE) you committed to back in Module 2 when you sized the test. A real but tiny lift may not justify the engineering, maintenance, and risk of rolling out a change.
Guardrails OK? Did anything critical regress — page load time, refund rate, support tickets, latency? Even a real, large primary win shouldn’t ship if it broke a guardrail. The primary metric is what you wanted to move; guardrails are what you promised not to break.
Read the interval, not just the point. The point estimate is one number; the CI is the range of effects the data is consistent with. The decision lives in that range — especially where the bounds fall relative to your MDE.

A decision flow with three checks in sequence — 1. Significant? (p < alpha and CI excludes 0), 2. Big enough? (effect >= MDE, worth it), 3. Guardrails OK? (nothing critical regressed) — all yes leads to SHIP the change (real, worthwhile, safe); side outcomes: Not significant (can't rule out chance, don't ship / maybe underpowered) and Significant but tiny or a guardrail hit (real but too small or it broke something, usually don't ship); caption notes significance is necessary not sufficient. — Significance is necessary, not sufficient: a ship needs a real effect, an effect big enough to matter, and no guardrail broken — all three.

Notice what the flow encodes: significance gets you into the decision, but it doesn’t end it. Only a “yes” on all three checks — real, worthwhile, and safe — leads to shipping.

Applying the Four Checks to Lumen

Now run the checks on the numbers you already computed. No new statistics here — just judgment applied to the verified results.

Check 1 — Statistically significant? Yes. p = 0.00048 is far below the standard α = 0.05, and the 95% CI [+0.97, +3.43] excludes 0. The lift is real, not noise. ✓

Check 2 — Practically significant? This is where honesty matters. At design time you set an MDE of +2 points — the smallest lift worth detecting and, implicitly, worth shipping. The observed point estimate is +2.20, which clears the bar. But the CI’s lower bound is +0.97, which sits below the 2-point MDE. So while the best estimate beats the practical bar, the data is also consistent with a true lift as small as ~1 point — below what you decided was worth it. The point estimate passes; the interval doesn’t fully clear the bar. ⚠

Check 3 — Guardrails OK? For this scenario we’ll assume the guardrails held — load time, refunds, and support volume all steady. (State this as an assumption; in a real analysis you’d verify each one before trusting it.) ✓

Check 4 — Read the interval. The plausible range is +0.97 to +3.43 points. The center of that range is comfortably positive and above the MDE, but the lower tail dips under it. This is a clearly positive, significant effect whose best guess is worth shipping — with real uncertainty about whether the true lift clears the practical bar.

The pragmatic read: a significant, clearly-positive lift whose best estimate beats the MDE, with no guardrail regression, is normally a ship — but you ship flagging the uncertainty, not claiming certainty. The true lift could be as small as ~1 point, and you shouldn’t pretend otherwise. Two reasonable paths: ship and monitor (roll out, watch the metric on live traffic, and be ready to react), or run longer to tighten the interval and pull the lower bound above the MDE before committing. Both are defensible; what isn’t defensible is either declaring “+2.20, done” as if the point estimate were the whole story, or killing a real, positive win purely because one bound is borderline.

Significant is not the same as worth it

With a large enough sample, any nonzero effect eventually becomes statistically significant — the CI shrinks until it excludes 0 even for a trivial 0.1-point lift. That’s why practical significance is a separate check. A result can be significant (p < α, CI excludes 0) yet fail to clear the MDE — real, but too small to be worth the cost. Significance answers “is it there?”; the MDE answers “is it big enough to act on?” You need both, and they are not the same question.

Three Anti-Patterns to Avoid

The four checks exist to head off three recurring mistakes:

Shipping on significance alone. Seeing p < 0.05 and rolling out without checking effect size or guardrails. A significant result can be practically trivial, or can hide a guardrail regression that outweighs the primary win. Green p-value, red guardrail — still a no-ship.
Killing a real win for being “only” borderline. Discarding a significant, positive lift because the CI’s lower bound dips slightly below the MDE, without weighing the cost. If the change is cheap and low-risk, a probable ~2-point lift is worth taking even with a soft lower bound. Practical significance is a judgment against cost and risk, not a hard gate.
Reading “not significant” as “proven no effect.” (The Lesson 2 callback.) A non-significant result means you couldn’t rule out chance — often because the test was underpowered — not that the effect is zero. Don’t confidently ship the null.

Each anti-pattern comes from collapsing a four-part decision into one number. The discipline is to run all four checks, every time, and to say out loud where the uncertainty lives.

Practice Exercises

Exercise 1: Green p-value, red guardrail

A test on a new checkout flow shows a significant +1.8-point lift in purchase rate (p = 0.003, CI excludes 0), clearing the 1-point MDE. But median page load time rose by 400 ms — a guardrail you’d committed to protecting. Ship or not?

Hint

Don’t ship as-is. Checks 1 and 2 pass — the lift is real and clears the MDE — but Check 3 fails: a guardrail regressed. A 400 ms load-time hit can suppress conversion elsewhere and hurt the experience, and it may even erode the very lift you measured over time. The primary win doesn’t buy back a broken guardrail. Fix the latency and re-test, rather than shipping the regression alongside the win.

Exercise 2: Point estimate vs. the interval

Lumen’s lift is +2.20 points with a 95% CI of [+0.97, +3.43], against a 2-point MDE. A teammate says, “The lift is 2.20, which beats our 2-point bar, so it’s a clean ship.” What’s incomplete about that reasoning?

Hint

It treats the point estimate as the whole answer and ignores the interval. Yes, +2.20 exceeds the 2-point MDE — but the CI’s lower bound of +0.97 sits below it, so the data is also consistent with a true lift under the practical bar. The decision should acknowledge that the true effect could be as small as ~1 point. It’s still likely a ship (significant, positive, best estimate above the MDE, guardrails fine), but “clean” overstates the certainty — the honest version flags the soft lower bound and considers running longer or shipping-and-monitoring.

Exercise 3: A flat result

A different experiment returns p = 0.42, with a 95% CI of [−1.1, +2.6] points. A stakeholder concludes, “No effect — the change does nothing, kill it.” Is that the right read?

Hint

No — that’s the “not significant means proven no effect” trap from Lesson 2. The result is non-significant (the CI includes 0), so you can’t rule out chance — but the interval also includes lifts as large as +2.6 points, which would be well worth having. The test simply didn’t have the power to resolve the effect. The honest read is “inconclusive,” and the right move is usually to run longer or with more traffic, not to declare the effect zero.

Summary

A ship decision is four checks, not one. Statistical significance (p < α and the CI excludes 0) tells you the effect is real; practical significance compares that effect to the MDE you set at design time to ask whether it’s big enough to be worth it; guardrails confirm nothing critical regressed; and reading the full confidence interval — not just the point estimate — shows the range of effects the data supports. On Lumen, the call is honestly borderline: p = 0.00048 and a CI of [+0.97, +3.43] make the +2.20-point lift clearly real and positive, and the point estimate beats the 2-point MDE — but the lower bound dips below it, so the true lift could be as small as ~1 point. With guardrails assumed fine, that’s normally a ship, made while flagging the uncertainty, with running-longer as a reasonable alternative. The lesson’s core discipline is to avoid the three anti-patterns: shipping on significance alone, killing a real win for being borderline, and mistaking “not significant” for “no effect.”

Key Concepts

Significance is necessary, not sufficient — p < α and CI-excludes-0 get you into the decision, not through it.
Practical significance — compare the effect to the MDE from design; real-but-tiny may not be worth the cost.
Guardrails — a primary win doesn’t ship if it broke something you promised to protect.
Decide on the interval — the CI’s bounds relative to the MDE carry the decision, not the point estimate alone.

Why This Matters

The gap between a green p-value and a good decision is where experimentation programs earn or lose their credibility. Teams that ship on significance alone accumulate trivial changes and hidden regressions; teams that demand a perfectly clean interval kill real wins and grind to a halt. The four checks — and the honesty to say “significant, positive, but borderline on the MDE” out loud — are what let you act decisively and accurately on a result like Lumen’s. Next, you’ll put every skill from this module together in a guided project: analyzing Lumen’s signup experiment end to end, from raw counts to a defensible ship decision.

Next Steps

Continue to Lesson 5 - Guided Project: Analyze Lumen's Signup Experiment

Put it all together: from raw counts to z-test, p-value, confidence interval, and a defensible ship decision, end to end.

Back to Module Overview

Return to the Analyzing Proportion Metrics module overview

Continue Building Your Skills

You turned Lumen’s result into a decision using four checks — significance, practical significance against the MDE, guardrails, and the full confidence interval — and made an honest ship call on a genuinely borderline outcome without overstating certainty. Next you’ll run the entire module workflow yourself in a guided project, carrying Lumen’s signup experiment from raw conversion counts all the way to a defensible ship/no-ship recommendation.

Previous lesson

Lesson 3 - Confidence Intervals for the Difference

Next lesson

Lesson 5 - Guided Project: Analyze Lumen's Signup Experiment

Courses

DATATWEETS

Title here

Lesson 4 - Making the Ship Decision

Welcome to the Ship Decision

Four Checks, Not One

Applying the Four Checks to Lumen

Three Anti-Patterns to Avoid

Practice Exercises

Exercise 1: Green p-value, red guardrail

Exercise 2: Point estimate vs. the interval

Exercise 3: A flat result

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 5 - Guided Project: Analyze Lumen's Signup Experiment

Back to Module Overview

Continue Building Your Skills

Lesson 4 - Making the Ship Decision

Welcome to the Ship Decision#

Four Checks, Not One#

Applying the Four Checks to Lumen#

Three Anti-Patterns to Avoid#

Practice Exercises#

Exercise 1: Green p-value, red guardrail#

Exercise 2: Point estimate vs. the interval#

Exercise 3: A flat result#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 5 - Guided Project: Analyze Lumen's Signup Experiment

Back to Module Overview

Continue Building Your Skills#

Welcome to the Ship Decision

Four Checks, Not One

Applying the Four Checks to Lumen

Three Anti-Patterns to Avoid

Practice Exercises

Exercise 1: Green p-value, red guardrail

Exercise 2: Point estimate vs. the interval

Exercise 3: A flat result

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills