Lesson 5 - The Full Readout and Course Wrap-Up
Welcome to the Guided Project
This is the last lesson of the last module — the finale of the whole course. Across Lessons 1 to 4 you ran Lumen’s onboarding experiment end to end: you turned a one-line brief into a design with a pre-committed sample size, you ran and validated the data before peeking, you analyzed the proportions with a z-test and a confidence interval, and you cross-checked the whole thing with a Bayesian readout. Every number is now in hand. What’s left is the part that most people skip and every serious team demands: writing it all up.
The deliverable of an experiment is not a p-value dropped in a chat message. It’s a readout — a short, self-contained document that states the question, the design, the validity, the result, and the decision with its rationale and caveats, so a decision-maker can act on it and, months later, an auditor can check it. This lesson assembles Lumen’s entire experiment into that one document. Then we’ll step back and celebrate finishing the course.
By the end of this project, you will be able to:
- Explain why the readout — not the p-value — is the real deliverable of an experiment
- Assemble a full experiment’s numbers into one coherent, self-contained document
- Write a model readout with question, design, validity, result, Bayesian cross-check, decision, and caveats
- Map each course module to the step in the readout it made possible
Let’s write the readout and close the course.
Stage 1: Why the Readout Matters
Imagine you’ve done everything right — designed carefully, sized properly, validated, analyzed, cross-checked — and then you communicate the result by typing “it’s significant, p=0.002, ship it” into a Slack thread. What happens? The decision-maker asks: significant compared to what? How big was the lift? Was the test even valid? What could go wrong? And in six months, when someone wants to know why Lumen shipped that checklist, the answer has scrolled off the top of a channel and is gone.
A readout fixes all of that. It is one artifact that answers every question a reasonable person would ask, in a fixed order:
- The question — what business decision does this experiment inform?
- The design — the hypothesis, metrics, unit, and sample size, so a reader knows what was committed before data.
- The validity — the sanity checks (like SRM) that decide whether the numbers can be trusted at all.
- The result — the effect, its uncertainty, and its statistical and practical significance.
- The decision — the call, its rationale, and the caveats that qualify it.
That’s the whole job. A readout is not a research paper and it’s not a dashboard — it’s a decision document. It’s short enough to read in two minutes and complete enough that nobody has to ask a follow-up. Writing it forces you to be honest: if you can’t state the design cleanly, you didn’t have one; if you skip the validity line, you’re hiding something. The readout is where rigor becomes visible.
Stage 2: Assembling the Numbers
Before writing prose, let’s gather every number from Lessons 1 to 4 in one place — all computed for real with numpy and scipy, all consistent with each other. Here is the complete ledger of the capstone:
- Design (Lesson 1) — baseline 7-day activation 25%; smallest lift worth shipping +3 points; at 5% significance and 80% power, required n = 3,394 per arm (6,788 total).
- Run (Lesson 2) — assigned 50/50; ran to control 3,328 and treatment 3,460 users. Activations: control 832/3,328 = 25.00%, treatment 980/3,460 = 28.32%.
- Validity (Lesson 2) — sample-ratio-mismatch check on the split: p = 0.109. Comfortably above the 0.001 alarm line, so the randomization is intact and the experiment is valid to analyze.
- Analysis (Lesson 3) — two-proportion z-test on the lift: difference = +3.32 points, z = 3.095, p = 0.00197, 95% CI [+1.22, +5.43] points.
- Bayesian cross-check (Lesson 4) — P(treatment better) = 0.9991, 95% credible interval [+1.22, +5.42] points — essentially identical to the frequentist interval, which is exactly the reassurance you want.
- Guardrails — support-ticket rate and 30-day retention both held; no regression.
Two things stand out. First, every downstream number depends on the validity check passing — if SRM had failed, none of the analysis would mean anything, which is why it comes first. Second, the frequentist and Bayesian intervals agree to within a hundredth of a point ([+1.22, +5.43] versus [+1.22, +5.42]). That agreement isn’t a coincidence; with this much data the two frameworks converge, and seeing them land in the same place is a strong signal the result is real rather than an artifact of one method’s assumptions.
The verdict these numbers point to: SHIP. Now let’s write it up properly.
Stage 3: The Readout
Here is the model deliverable — the entire Lumen experiment as one self-contained readout. This is what you hand a decision-maker, and what an auditor reads in six months. Notice the order: question, design, validity, result, cross-check, decision, caveats — nothing out of place, nothing hidden.
Experiment Readout: Onboarding Checklist v2
Question. Does the new onboarding checklist increase the share of new users who become active in their first week, enough to justify shipping it to everyone?
Design. Hypothesis: the checklist lifts the 7-day activation rate by at least 3 percentage points. Primary metric: 7-day activation rate (a per-user proportion). Guardrail metrics: support-ticket rate and 30-day retention (must not regress). Randomization unit: the user, held consistent across their first week. Minimum detectable effect: +3 points. Sample size: 3,394 users per arm (6,788 total) for 80% power at 5% significance — committed before launch.
Validity. Sample-ratio-mismatch check on the 3,328 / 3,460 split: p = 0.109, well above the 0.001 threshold. The randomization is intact; the experiment is valid to analyze.
Result. Activation rose from 25.00% (control, 832/3,328) to 28.32% (treatment, 980/3,460) — a lift of +3.32 percentage points. Two-proportion z-test: z = 3.095, p = 0.0020. 95% confidence interval for the lift: [+1.22, +5.43] points. The effect is statistically significant (the interval excludes zero) and clears the +3-point practical bar.
Bayesian cross-check. Probability the treatment is genuinely better: 99.9%. 95% credible interval: [+1.22, +5.42] points — matching the frequentist interval, which corroborates the result under different assumptions.
Decision: SHIP. The lift is statistically significant, its point estimate clears the pre-committed +3-point bar, both frameworks agree the effect is real, and no guardrail regressed. All three conditions of the decision rule are met.
Caveats. The confidence interval’s lower bound is +1.22 points, so the true lift could be meaningfully smaller than the observed +3.32 — the practical bar is cleared by the point estimate, not guaranteed by the whole interval. Recommend monitoring 30-day retention for at least 30 days post-launch to confirm the early-activation gain is durable and not shallow engagement that fades.
Read it top to bottom: a stranger to the project now knows exactly what was asked, what was committed, whether to trust the numbers, what the numbers say, and what to watch after launch. That’s a readout. That’s the product.
The readout is the product
It’s tempting to think the analysis is the deliverable and the write-up is a formality. It’s the reverse. The analysis lives in your notebook; the readout is what leaves your desk and changes what the company does. A brilliant analysis communicated as “p=0.002, ship” is worth less than a merely correct one written up as a clear, caveated readout — because only the second one lets a decision-maker act with confidence and lets a future teammate reconstruct why. When you finish an experiment, you are not done until the readout is written. The readout is the product.
Stage 4: What Made This Trustworthy
Step back and look at what made that readout believable. It wasn’t one clever move — it was doing every step, in order, before the decision. Each line of the readout is a module of this course cashed in:
- Module 1 — Randomization. Assigning users 50/50 at random is what lets the readout say the checklist caused the lift, not merely that active users happened to see it. No randomization, no causal claim — everything else is just correlation.
- Module 2 — Design and metrics. The hypothesis, the single primary metric, and the guardrails came from Module 2. That’s why the readout has one clean thing to test and a rule for what would count as harm.
- Module 3 — Sample size. The 3,394-per-arm figure, committed up front, is what let the team run to a fixed size and analyze once — no peeking, no stopping early on a lucky wobble.
- Module 4 — Proportion analysis. The two-proportion z-test, the difference, and the confidence interval are the Module 4 machinery applied to activation. This is the “Result” line.
- Module 5 — Mean metrics. Not used here because the metric was a rate — but if the question had been about revenue per user or sessions per week, the readout’s Result line would use the Module 5 t-test and mean-difference interval instead. Same shape, different tool.
- Module 6 — Validity and SRM. The SRM check at p = 0.109 is Module 6. Running it first, before looking at the result, is what earns the right to trust everything below it.
- Module 7 — Bayesian cross-check. The 99.9% probability and the credible interval are Module 7, giving a second, independent read on the same data.
The rigor didn’t come from any single technique. It came from the discipline of the sequence: design before data, validity before analysis, analysis before decision, decision before you let yourself feel good about the result. Skip any step and the readout develops a crack an auditor can find. Do all of them, in order, and the conclusion holds up.
Stage 5: Course Wrap-Up
Take a breath and look at how far you’ve come, because you just finished the entire course.
You started at the very foundation: why randomize at all — the insight that a coin flip is what turns “these two groups differ” into “this change caused the difference.” From there you learned to design an experiment: sharpen a vague brief into a testable hypothesis, choose a primary metric and guardrails, and pick a randomization unit. You learned to size it, so you commit to a sample and analyze once instead of peeking your way to a false win. You learned to analyze both proportions (z-tests on rates) and means (t-tests on continuous metrics), and to report effects with confidence intervals rather than bare p-values. You learned the pitfalls — sample ratio mismatch, peeking, multiple comparisons, novelty effects — and how validity checks catch a result that only looks real. You picked up advanced techniques, including the Bayesian view that reads the same data as a probability of improvement. And in this capstone you ran one experiment end to end, from a one-line brief to a written ship decision, using every one of those tools in sequence.
That arc — why randomize → design → size → analyze proportions → analyze means → avoid pitfalls → advanced techniques → full capstone — is the complete anatomy of trustworthy experimentation, and you worked through every stage yourself.
But the most valuable thing you built isn’t a technique. It’s a mindset:
- Skepticism toward flattering numbers. A shiny “significant lift” is a hypothesis, not a fact, until you’ve checked that the test was valid and the effect clears a bar that matters. You now instinctively ask “is this real, or does it just look good?”
- The discipline to decide before the data. You commit the hypothesis, the sample size, and the decision rule up front — so the experiment can only tell you the truth, not flatter whatever story you wish were true.
- The judgment to combine statistics with business context. A p-value doesn’t ship a feature; a person does, weighing significance, practical size, guardrails, and cost. You’ve practiced being that person.
That mindset is portable. It works on onboarding checklists and pricing pages, on email subject lines and model rollouts, on any question where you can randomize and measure. The math will always be there in the notebook when you need it — but the habit of thinking like an experimenter is what you carry into every decision from here.
So go run your own experiments. Start small: one hypothesis, one metric, a sample size you commit to, a decision rule you write before you look. Then write the readout. You have everything you need.
Practice Exercises
Exercise 1: Rewrite the readout for a non-technical exec
Your VP of Product doesn’t care about z-scores. In three sentences, no jargon, tell them what happened and what you recommend.
Hint
Lead with the decision and the business fact, drop the statistics, keep the caveat. Something like: “The new onboarding checklist lifted first-week activation from 25% to 28% — about a 3-point gain — in a clean test on nearly 7,000 users, with no downside to support load or retention. We’re confident this is a real improvement, so I recommend shipping it to everyone. The gain could turn out a bit smaller than 3 points, so we’ll watch 30-day retention for a month to confirm it sticks.” Same result, zero jargon, decision first.
Exercise 2: What if a guardrail had regressed?
Suppose everything held except that 30-day retention dropped in the treatment arm. The activation lift is still +3.32 points and significant. What does the readout’s decision line say now?
Hint
The decision rule ships only if the primary rises and no guardrail regresses — so the answer is do not ship as-is, even though the primary metric won a significant, practically meaningful lift. The readout would say something like: “Decision: HOLD. Activation rose +3.32 points (significant), but 30-day retention regressed, violating a pre-committed guardrail. Shipping would trade a durable outcome for a short-term one.” Then recommend a fix — investigate whether the checklist front-loads shallow activity — and a re-test. A guardrail regression turns a “win” into a warning, which is exactly what guardrails are for.
Exercise 3: Design a follow-up to confirm the effect size
The lift is significant, but the confidence interval runs from +1.22 to +5.43 points — a wide range. Design a follow-up experiment to pin the effect down more precisely.
Hint
Precision comes from sample size: to halve the width of the interval you need roughly four times the users (width scales with 1/√n). So a confirmatory experiment would set a tighter MDE — say, resolving the lift to within ±1 point instead of the current ~±2 — recompute the sample size from Module 3 (which will be substantially larger), and run again on fresh traffic. Because you now expect an effect near +3 points, you can also frame it as a confirmation with pre-registered success criteria, and add the 30-day retention guardrail as a co-primary so the follow-up settles durability at the same time.
Summary
The finale assembled Lumen’s entire capstone experiment into one deliverable — the readout — because the real product of an experiment is not a p-value but a self-contained document a decision-maker can act on and an auditor can check. Every number, verified with numpy and scipy across Lessons 1 to 4, went into it: a design committing 3,394 users per arm, a run that produced 25.00% -> 28.32% activation, a validity check (SRM p = 0.109, passed), a significant result (+3.32 points, z = 3.095, p = 0.0020, 95% CI [+1.22, +5.43]), a Bayesian cross-check (99.9% probability of improvement, credible interval [+1.22, +5.42]), and a SHIP decision with the honest caveat that the true lift could be as small as ~1.2 points. The readout is trustworthy because it maps, line by line, onto the course: randomization for the causal claim, design and metrics, a pre-committed sample size, proportion analysis, validity checks, and the Bayesian cross-check — every step done in order, before the decision.
Key Concepts
- The readout is the deliverable — question, design, validity, result, decision, caveats, in that fixed order; not a p-value in a chat.
- Validity before result — the SRM check earns the right to trust everything below it; run it first.
- Two frameworks, one answer — frequentist and Bayesian intervals agreeing to a hundredth of a point corroborates the effect.
- Rigor is the sequence — trust came from doing every module’s step, in order, before the ship call — not from any one technique.
Why This Matters
A great analysis that never becomes a clear, caveated readout doesn’t change what a company does — it dies in a notebook. The skill you practiced here, turning a completed experiment into a two-minute document that a VP can act on and a future teammate can audit, is what separates analysts whose work ships from analysts whose work is admired and ignored. And the discipline underneath it, deciding before the data and checking validity before celebrating, is the whole reason experimentation is trustworthy at all. You now have both the tools and the habits. That’s what it takes to run experiments people can bet the roadmap on.
Next Steps
Back to the Course Home
Revisit any module in A/B Testing & Experimentation.
Explore More Courses
Continue your data and AI journey on DataTweets.
Continue Building Your Skills
You did it — you finished A/B Testing & Experimentation. Look back at the whole journey. You started with the single idea that makes experiments work at all: randomize, and a difference becomes a cause. You learned to design a test and choose its metrics, to size it so you analyze once instead of peeking, to analyze both proportions and means and report them with confidence intervals, to spot the pitfalls that make a fake result look real, and to reach for advanced techniques like the Bayesian view. Then you tied it all together in a capstone: one experiment, from a one-line brief to a validated, cross-checked, written SHIP decision — every number computed for real.
More than the mechanics, you built the mindset of an experimenter: skeptical of flattering numbers, disciplined enough to decide before the data, and wise enough to weigh statistics against business context. That mindset outlasts any single tool and travels to any decision you can randomize and measure. So take it and go — pick a real question, commit a hypothesis and a sample size, run the test, and write the readout. Congratulations, and happy experimenting.