Lesson 4 - The Cross-Check and the Decision

From a Number to a Decision

Lesson 3 gave you the frequentist verdict: p = 0.00197 and a 95% confidence interval of [+1.22, +5.43] points — the checklist significantly lifts activation, by an observed +3.32 points. That’s a strong result. But a strong result is not yet a decision. Before you tell Lumen to ship, you do two things a careful data scientist always does: you get a second opinion in a different language, and you check the result against the decision rule you wrote at design time — not against your gut, and not against “the p-value was small.”

This lesson closes the loop. First a Bayesian cross-check (Module 7) restates the finding as a probability a stakeholder can actually parse. Then you run the full scorecard — validity, significance, effect size, corroboration, guardrails — and make the call.

By the end of this lesson, you will be able to:

  • Compute a Bayesian posterior for each arm and a credible interval for the difference
  • Read “the probability the treatment is truly better” as a second opinion on the p-value
  • Apply a five-check decision scorecard against the pre-set rule
  • State a ship decision that survives scrutiny, not just a low p-value

Let’s get the second opinion.


The Bayesian Cross-Check

The frequentist test answers a slightly awkward question: if there were no effect, how surprising is this data? Useful, but not the sentence a stakeholder wants. The Bayesian readout (Module 7) answers the question they actually asked: what’s the probability the treatment is truly better? Same validated data — control 832/3328, treatment 980/3460 — a different philosophy, and a plainer answer. We put a flat Beta(1, 1) prior on each arm’s rate, update it with the observed counts, and sample the two posteriors:

import numpy as np
rng = np.random.default_rng(2026)
n_c, act_c = 3328, 832
n_t, act_t = 3460, 980
draws = 400000
pc = rng.beta(1+act_c, 1+(n_c-act_c), draws)
pt = rng.beta(1+act_t, 1+(n_t-act_t), draws)
print(f"P(treatment > control) = {np.mean(pt > pc):.4f}")
lo, hi = np.percentile(pt - pc, [2.5, 97.5])
print(f"95% credible interval  = [{lo:.4f}, {hi:.4f}]")

Running it:

P(treatment > control) = 0.9991
95% credible interval  = [0.0122, 0.0542]

Two numbers, both reassuring. There’s a 99.9% posterior probability the checklist truly lifts activation — not “we reject the null,” but a direct probability of being right. And the 95% credible interval for the lift is [+1.22, +5.42] points, which essentially matches the frequentist confidence interval from Lesson 3 ([+1.22, +5.43]). Same data, two philosophies, the same answer. That agreement is exactly what you hope to see: when a Bayesian and a frequentist analysis of the same experiment land on the same interval, neither method is doing something strange, and you can trust the result more than either alone. And “99.9% probability it’s better” is the sentence you can put in front of a stakeholder without a footnote.


The Decision Scorecard

Now the decision. A ship call is not one number — it’s a scorecard, run against the rule written at design time (Lesson 1: ship only if the primary metric rises significantly and clears the 3-point bar and no guardrail regresses). Five checks, each from a module of this course:

  1. Valid? SRM p = 0.109 (Lesson 2) — the 3328/3460 split is within chance, so randomization held. Yes, a trustworthy experiment.
  2. Statistically significant? p = 0.00197 and the 95% CI [+1.22, +5.43] excludes 0 (Lesson 3). Yes.
  3. Practically meaningful? The observed +3.32 points beats the pre-set +3-point MDE. (Honest caveat: the CI’s lower bound, +1.22, is below the 3-point bar, so the true lift could be smaller than we’d want — but it’s clearly positive, and our best estimate clears the bar.) Yes.
  4. Second opinion agrees? Bayesian P(better) = 99.9% and the credible interval matches the frequentist CI. Yes.
  5. Guardrails held? No regression in support tickets or 30-day retention. Yes.
A decision scorecard with five green-checked rows — Valid? (SRM p=0.109, split within chance), Statistically significant? (p=0.0020, CI [+1.22,+5.43] excludes 0), Practically meaningful? (observed +3.32 pts beats the +3-point MDE), Second opinion agrees? (Bayesian P(treatment>control)=99.9%, credible interval matches), Guardrails held? (no regression in support tickets or 30-day retention) — and a final green verdict banner reading SHIP IT.
The decision scorecard: five checks, all green, decided against the rule written at design time — the verdict is to ship the onboarding checklist.

Every row is green. The verdict: SHIP the onboarding checklist. It is valid, statistically significant, practically meaningful, corroborated by an independent method, and safe on the guardrails.

A decision is a scorecard, not a p-value

The temptation, once you see p = 0.002, is to stop and ship. Don’t. A low p-value only clears one row of the scorecard. A result can be significant but trivial (a +0.1 point lift on a huge sample), significant but invalid (a broken SRM check upstream), or significant but harmful (activation up, retention down). The scorecard exists precisely to catch those cases. Here, all five checks happen to pass, so it’s a clean ship — but the discipline is what matters, not the outcome. You run the same five checks every time, decide against the bar you set before seeing data, and let the scorecard — not the excitement of a small number — make the call.


Practice Exercises

Exercise 1: Two intervals, one answer

The frequentist CI is [+1.22, +5.43] and the Bayesian credible interval is [+1.22, +5.42]. They’re nearly identical — but they mean different things. What does each one actually claim?

Hint

The confidence interval is a statement about the procedure: if we repeated this experiment many times, 95% of the intervals built this way would contain the true lift. The credible interval is a statement about the lift itself: given this data and a flat prior, there’s a 95% probability the true lift is between +1.22 and +5.42. The credible interval is the one people intuitively think a CI means. That they coincide here isn’t luck — with a lot of data and a flat prior, the two converge, which is exactly why the agreement is reassuring rather than surprising.

Exercise 2: The one honest caveat

The scorecard passes the “practically meaningful” row, but with a caveat: the CI lower bound (+1.22) is below the +3-point MDE. Why note it, if the row still passes?

Hint

Because intellectual honesty is part of the readout. The best estimate (+3.32) clears the bar, so the row passes — but the data is also consistent with a true lift of only ~1.2 points, which is below the threshold you called “worth shipping.” Stating that keeps you from overselling: you’re shipping because the point estimate clears the bar and everything else is green, not because you’ve proven the lift is at least 3 points. A stakeholder who later sees a +1.5 point lift in production shouldn’t feel misled — the caveat told them the range up front.

Exercise 3: Break one row

Suppose everything is the same except 30-day retention dropped 2 points in the treatment arm. Walk through the scorecard. What’s the decision?

Hint

Rows 1 through 4 still pass — the experiment is valid, the activation lift is significant, meaningful, and corroborated. But row 5 (guardrails) now fails: retention regressed. The decision rule from Lesson 1 was “ship only if… no guardrail regresses,” so the verdict flips to do not ship (or hold and investigate). This is the whole point of guardrails and of the scorecard: a genuine, significant win on the primary metric is not enough if it comes at the cost of something you promised to protect. One red row overrides four green ones.


Summary

A significant result is the start of a decision, not the end of one. This lesson gave Lumen’s finding a Bayesian cross-check — on the same validated data (control 832/3328, treatment 980/3460), the posterior probability the treatment is truly better is 99.9%, with a credible interval of [+1.22, +5.42] that matches the frequentist CI almost exactly. Then it ran the decision scorecard: valid (SRM p = 0.109), significant (p = 0.00197, CI excludes 0), practically meaningful (+3.32 beats the +3-point MDE), corroborated (Bayesian agrees), and safe (guardrails held). All five green, decided against the rule written at design time — so the verdict is a clean SHIP. The discipline is the lesson: a ship decision is a scorecard, not a small p-value.

Key Concepts

  • Bayesian cross-check — restate the result as P(treatment better) and a credible interval; agreement with the CI is reassuring.
  • Credible vs confidence interval — different claims that converge with lots of data and a flat prior.
  • The decision scorecard — validity, significance, effect size vs the bar, corroboration, guardrails.
  • Decide against the pre-set rule — the design-time decision rule makes the call, not the excitement of a low p-value.

Why This Matters

Teams ship bad changes not because their statistics are wrong but because they stop at the p-value. A low p-value clears one row of a five-row scorecard; the other four — validity, effect size against a pre-set bar, an independent second opinion, and guardrails — are what separate a decision that holds up from a “significant win” that quietly breaks something. Running the full scorecard against a rule fixed before data is what makes a ship decision defensible in a room full of skeptics. Next, you’ll assemble everything — design, validation, analysis, cross-check, and decision — into the written readout that Lumen’s stakeholders actually read.


Next Steps

Continue to Lesson 5 - The Full Readout and Course Wrap-Up

Assemble the design, validation, analysis, cross-check, and decision into the one-page readout — and close out the course.

Back to Module Overview

Return to the Capstone module overview


Continue Building Your Skills

You gave a strong frequentist result a Bayesian second opinion — 99.9% probability the checklist truly helps, with a credible interval that matches the CI — and then ran the five-check scorecard that turns a number into a decision: valid, significant, practically meaningful, corroborated, and safe. The verdict was a clean ship, but the transferable skill is the scorecard itself, which would have flagged a broken, trivial, or guardrail-busting result just as clearly. Next you’ll write it all up: the one-page readout that carries the whole experiment — brief, design, validity, analysis, cross-check, and decision — to the people who have to act on it.