Lesson 2 - Reading the P-Value
Welcome to Reading the P-Value
Last lesson you built the two-proportion z-test and it handed Lumen a number: p = 0.00048. It’s tempting to file that away as “the probability the result was a fluke” and move on — but that reading is wrong, and the wrong reading is exactly how teams talk themselves into believing weak results and dismissing real ones. A p-value is a precise thing that answers a narrow question, and almost every everyday phrasing of it says something the p-value doesn’t. In this lesson you’ll learn to read the number honestly: what it is, the four things it isn’t, the threshold you compare it to, and how the one-sided-versus-two-sided choice changes it — worked on two Lumen experiments that give opposite verdicts.
By the end of this lesson, you will be able to:
- State precisely what a p-value is — and name the four misinterpretations to avoid
- Apply the p-versus-α decision rule and explain what “not significant” does and doesn’t prove
- Choose between a one-sided and two-sided test, and default correctly
- Read two contrasting real results and see why a big p-value isn’t proof of “no effect”
Let’s start with the definition, because everything else follows from it.
What a P-Value Is — and Is Not
A p-value is the probability of observing a difference at least as extreme as the one you saw, if the null hypothesis were true — that is, if the change did nothing. That “if” is the whole point. The p-value lives inside a hypothetical world where there’s no effect, and it measures how surprising your data would be in that world. A small p-value means the data would be strange if nothing had changed, and strange data under “no effect” is evidence against “no effect.” Lumen’s p = 0.00048 says: if the new page were no better, a lift this large would show up about 5 times in 10,000 reruns. That’s surprising enough to reject the “no effect” story.
What trips people up is that this narrow statement gets swapped for four wrong ones:
- It is NOT the probability the null is true. The p-value assumes the null and asks about the data; it never turns around and reports a probability about the hypothesis. “p = 0.00048” is not “a 0.048% chance there’s no effect.”
- It is NOT the probability the result was due to chance. Same swap in different clothes. The p-value is computed assuming chance is the only thing going on — it can’t also be the probability that chance is the explanation.
- It does NOT measure the size or importance of the effect. A p-value shrinks as your sample grows, so with enough traffic a trivially small, business-irrelevant lift can post a tiny p. This is the practical-versus-statistical-significance point from Module 2: significant does not mean big enough to matter.
- “p = 0.05” does NOT mean “95% chance the effect is real.” There’s no 95% anywhere in the p-value. That number is a confidence-level idea from intervals (Lesson 3), not a restatement of 1 − p.
Keep the left panel in mind and you’ll avoid every mistake on the right: the p-value is a statement about data under a hypothesis, never about the hypothesis itself.
The Threshold: Comparing p to α
A p-value is a continuous measure of surprise, but a decision is binary — ship or don’t. To cross from one to the other you pick a significance threshold α before running the experiment, then apply one rule: if p < α, reject the null and call the result significant; otherwise, you fail to reject it. The convention is α = 0.05, and Lumen’s p = 0.00048 clears it easily.
That α isn’t arbitrary — recall from Module 3 that α is the Type I error rate: the probability of declaring a winner when the change actually did nothing. Setting α = 0.05 is a standing decision to accept a 5% false-positive rate on true nulls. Lower α (say 0.01) means fewer false alarms but a harder bar to clear.
The dangerous half of the rule is the other side. When p ≥ α, the honest phrase is “not significant” — and that means we couldn’t rule out chance, not we proved there’s no effect. Absence of evidence is not evidence of absence. A non-significant test is consistent with “no effect” and also consistent with “a real effect we didn’t have the power to detect.” Which one you’re in depends on how much data you collected — as the next section makes concrete.
Two Real Results: Significant vs. Underpowered
Reuse the two_prop_ztest from Lesson 1 — same function, no changes — and run it on two experiments. First, Lumen’s real one from last lesson: 503/5000 vs 613/5000 gives difference 0.0220, z = 3.49, p = 0.00048. Significant: strong evidence against “no effect.”
Now a second experiment with a similar gap but far less data — control 150/1500 = 0.1000, treatment 168/1500 = 0.1120, a 1.2-point lift on a tenth of the traffic:
diff, z, p = two_prop_ztest(150, 1500, 168, 1500)
print(f"difference = {diff:.4f} z = {z:.3f} p = {p:.4f}")
# difference = 0.0120 z = 1.068 p = 0.2857Same shape of result — a roughly one-point lift for the treatment — but the verdict flips: p = 0.2857, nowhere near 0.05, not significant. If the change did nothing, a difference this size would show up about 29% of the time from noise alone, which is not surprising at all.
Here’s the trap this pair is built to spring. It’s wrong to read p = 0.2857 as “the new page is no better.” The experiment didn’t find no effect — it failed to find one, and it failed because 1,500 users per arm can’t resolve a 1-point difference from noise. The lift is right there in the data (0.0120); the sample is simply too small for the z-test to trust it. That’s an underpowered experiment (Module 3), and its non-significant p is a statement about the experiment’s resolution, not about reality. Give this same true effect ten times the traffic and it would likely turn significant.
A big p-value doesn’t clear the change
When a test comes back “not significant,” resist the urge to announce “the change had no effect.” The correct summary is narrower: this experiment couldn’t distinguish the observed difference from chance. Before concluding anything, ask the power question — was the sample large enough to detect an effect you’d care about? A p of 0.29 on 1,500 users per arm and a p of 0.29 on 150,000 users per arm mean very different things: the first is likely underpowered, the second is real evidence the effect is small. The p-value alone can’t tell them apart — which is exactly why Lesson 3 adds a confidence interval that shows the range of effects still on the table.
One-Sided vs. Two-Sided
There’s one more choice baked into the p-value: which direction counts as “extreme.” Lesson 1’s two_prop_ztest doubles the tail — p = 2 * (1 - norm.cdf(abs(z))) — because it’s a two-sided test: before the experiment the new page could have helped or hurt, so a surprising result is one far from zero in either direction. Two-sided is the default, and it’s the safe one.
A one-sided test counts only one tail — it asks “is the treatment better?” and ignores the possibility of “worse.” Dropping half the tail area roughly halves the p-value, so a one-sided test is more powerful: it clears α with a smaller effect. But that extra power is only legitimate if you genuinely care about a single direction and you commit to it before running the experiment. Deciding to go one-sided after seeing which way the result leans is a way to manufacture significance — it inflates your true Type I error above the α you thought you set.
The recommendation is simple: default to two-sided. In A/B testing you almost always care about a change going the wrong way (a “better” button that quietly tanks checkout is a disaster), and the two-sided test protects you from it. Reach for one-sided only in the rare case where the wrong direction is genuinely irrelevant — and decide it up front.
Practice Exercises
Exercise 1: Say it correctly
A teammate summarizes Lumen’s result as “p = 0.00048, so there’s a 99.95% chance the new page is better.” What’s wrong with that sentence, and how would you fix it?
Hint
It commits two of the classic errors: it treats the p-value as a probability about the hypothesis (“chance the page is better”) and it invents a “99.95% = 1 − p” confidence that isn’t there. The honest phrasing: if the new page were no better, a lift this large would appear only about 5 times in 10,000 — that’s surprising enough to reject “no effect.” The p-value is a statement about the data under the null, not about how likely the effect is.
Exercise 2: What does “not significant” prove?
The underpowered test returned p = 0.2857. Does this prove the new page is no better than the old one? Explain.
Hint
No. “Not significant” means the experiment couldn’t rule out chance, not that it proved no effect — absence of evidence isn’t evidence of absence. Here the difference (0.0120) is real in the data; there just weren’t enough users (1,500 per arm) to distinguish it from noise. The experiment is underpowered, so its big p-value reflects low resolution, not a genuine null. The right move is to check the power/sample size, not to declare the change worthless.
Exercise 3: When is one-sided allowed?
Under what conditions is it legitimate to use a one-sided test, and why is two-sided the default?
Hint
A one-sided test is valid only if you truly care about one direction of effect and you commit to that direction before running the experiment. It’s more powerful because it puts all the tail area on one side, but choosing the direction after seeing the data inflates your real Type I error. Two-sided is the default because in A/B testing a change can hurt as well as help, and you want to catch a regression — so a “surprising” result should count in either direction.
Summary
A p-value is the probability of a difference at least this extreme if the null were true — a statement about the data under “no effect,” never about the hypothesis. That distinction rules out the four common misreadings: it is not the probability the null is true, not the probability the result was due to chance, not a measure of the effect’s size, and “p = 0.05” is not “95% chance it’s real.” The decision rule compares p to a pre-chosen α (usually 0.05, which is the Type I error rate): p < α means significant, while “not significant” means couldn’t rule out chance, not proved no effect. Run on two Lumen experiments, the z-test returns p = 0.00048 (significant) for the real study and p = 0.2857 for a similar-gap but underpowered one — proof that a big p-value doesn’t clear the change, it just means this experiment couldn’t detect a difference. Finally, default to a two-sided test; use one-sided only when you truly care about one direction and commit to it before running.
Key Concepts
- What p is — probability of data this extreme if the null were true; small p is evidence against “no effect.”
- What p is not — not P(null true), not P(due to chance), not the effect’s size, not “95% it’s real.”
- p vs. α — p < α means significant; “not significant” means couldn’t rule out chance, not proved no effect.
- Two-sided by default — counts both directions; one-sided is more powerful but only if committed to up front.
Why This Matters
The p-value is the most reported — and most misreported — number in A/B testing, and reading it wrong costs both ways: teams ship changes on flukes they mistook for certainty, and they kill good ideas whose experiments were simply too small. The underpowered Lumen result is the everyday version of that second mistake: a real 1-point lift buried under a p of 0.29, one bad summary away from “the change didn’t work.” Reading the p-value honestly — as a narrow statement about data under the null, bounded by the α you set and the power you had — is what separates a trustworthy experiment culture from cargo-cult statistics. But even read correctly, a single p-value still won’t tell you how big the effect is or how uncertain you should be about it. That’s the job of the confidence interval, next.
Next Steps
Continue to Lesson 3 - Confidence Intervals for the Difference
Move past the yes/no verdict to a range: how big the effect plausibly is, and how sure you can be.
Back to Module Overview
Return to the Analyzing Proportion Metrics module overview
Continue Building Your Skills
You learned to read a p-value for what it is — the probability of data this extreme under the null — and to reject the four misreadings that fool teams, applying the p-versus-α rule to two contrasting Lumen results (p = 0.00048 significant, p = 0.2857 underpowered). Next you’ll go past the yes/no verdict entirely: a confidence interval turns the difference into a range of plausible effects, telling you not just whether the change is real but how big it might be and how much uncertainty is left.