Lesson 3 - Bayes' Theorem
Welcome to Bayes’ Theorem
In the last lesson you saw that and are not the same number. That raises an obvious question: if you happen to know one of them, can you recover the other? Often you can — and Bayes’ theorem is the formula that does it. It is one of the most useful results in all of statistics, the engine behind spam filters, medical diagnosis, and the machine-learning classifier you will build in the next lesson.
In this lesson you will derive Bayes’ theorem from the conditional-probability formula you already know, learn the four names every Bayesian calculation uses, and work the famous medical-test problem in Python. The answer will surprise you.
By the end of this lesson, you will be able to:
- Derive Bayes’ theorem from the definition of conditional probability
- Name the prior, likelihood, evidence, and posterior in any problem
- Compute a posterior probability in Python, including the full medical-test example
- Explain why the base rate of an event can completely change what a test result means
You only need the conditional-probability ideas from Lessons 1 and 2, plus a little Python. Let’s begin.
Deriving Bayes’ Theorem
Recall the definition of conditional probability from Lesson 1. The probability of given is the joint probability of both events divided by the probability of the thing you conditioned on:
The same definition works the other way around. The probability of given is:
Look closely: both formulas contain the same joint probability . That shared term is the bridge between the two conditionals. Solve the second equation for :
Now substitute that into the first equation, and you have Bayes’ theorem:
That is the whole derivation. Bayes’ theorem is not a new law of probability — it is just the conditional-probability formula rearranged so you can flip the direction of the conditioning. If you know but want , this is how you get there.
The Vocabulary: Prior, Likelihood, Evidence, Posterior
Bayes’ theorem comes with four named pieces. Learning the names is worth the effort, because every application — medical tests, spam filters, classifiers — uses exactly this vocabulary.
Read the formula again with the names attached:
- The prior is what you believed about before seeing any evidence. In the medical example it is the disease’s prevalence in the population.
- The likelihood is how probable the evidence is if were true. For a test, it is the sensitivity — how often a sick person tests positive.
- The evidence is the overall probability of seeing at all, across every way it could happen. It acts as a normalizer.
- The posterior is your updated belief about after the evidence arrives. It is the number you actually want.
The big idea hiding in those four words: Bayes’ theorem is how you update a belief when new evidence arrives. You start with a prior, you observe something, and the posterior is your revised prior. That single move powers an enormous amount of modern statistics and machine learning.
Computing the evidence
The one term that is not handed to you directly is the evidence . You almost always compute it with the law of total probability: can happen either when is true or when is false, so add up both routes.
Plugging that expanded denominator into Bayes’ theorem gives the form you will actually code:
The two terms in the denominator are the true positives and the false positives. Keep that picture in mind — it is the key to the result we are about to compute.
The Medical Test Example
Here is the problem that makes Bayes’ theorem famous. A disease affects 1% of people, so the prevalence is . A test for it has:
- Sensitivity 99% — a sick person tests positive 99% of the time, so .
- Specificity 95% — a healthy person correctly tests negative 95% of the time, so the false-positive rate is .
You take the test and it comes back positive. What is the probability you actually have the disease? Most people, including many doctors, guess somewhere around 95% or 99%. Let’s compute the real answer.
# Givens
prevalence = 0.01 # P(D)
sensitivity = 0.99 # P(+ | D)
specificity = 0.95
false_pos = 1 - specificity # P(+ | not D) = 0.05
# Bayes' theorem with the evidence expanded via total probability
prior = prevalence
likelihood = sensitivity
evidence = sensitivity * prevalence + false_pos * (1 - prevalence)
posterior = likelihood * prior / evidence
print(round(posterior, 4))0.1667A positive test means only a 16.7% chance that you actually have the disease. Not 99%, not 95% — about one in six. The other five out of six positive results are false alarms. This is the counterintuitive headline of the whole lesson, and it is not a trick of the arithmetic. To see exactly where it comes from, let’s drop the abstractions and count people.
The natural-frequencies intuition
Probabilities like 0.01 and 0.99 are slippery to reason about. Whole numbers are not. So imagine a concrete population of 10,000 people and walk the numbers through:
N = 10_000
diseased = N * prevalence # 100 people actually have it
healthy = N * (1 - prevalence) # 9,900 do not
true_positives = diseased * sensitivity # sick people who test +
false_positives = healthy * false_pos # healthy people who test +
print("true positives :", true_positives)
print("false positives:", false_positives)
print("all positives :", true_positives + false_positives)
print("P(D | +) :", round(true_positives / (true_positives + false_positives), 4))true positives : 99.0
false positives: 495.0
all positives : 594.0
P(D | +) : 0.1667Now the result is obvious. Out of 10,000 people, only 100 have the disease, and 99 of them test positive. But among the 9,900 healthy people, 5% — that is 495 people — test positive by mistake. So when you line up everyone who got a positive result, you have 99 truly sick people drowning in a crowd of 495 false alarms. Your chance of being one of the truly sick is .
The false positives win not because the test is bad — 95% specificity is quite good — but because the healthy group is so much larger than the sick group. A small error rate applied to a huge population still produces a lot of false alarms. That mismatch is the heart of the result, and it has a name.
Why Base Rates Matter
The prior — the base rate of the disease — is doing the heavy lifting in that calculation. The test never changed; only the prevalence made the posterior small. So what happens if the disease is more common? Suppose prevalence rises to 10%:
def posterior_given_prevalence(p, sens=0.99, fp=0.05):
evidence = sens * p + fp * (1 - p)
return sens * p / evidence
for p in [0.001, 0.01, 0.10, 0.50]:
print(f"prevalence {p:6.3f} -> P(D | +) = {posterior_given_prevalence(p):.4f}")prevalence 0.001 -> P(D | +) = 0.0194
prevalence 0.010 -> P(D | +) = 0.1667
prevalence 0.100 -> P(D | +) = 0.6875
prevalence 0.500 -> P(D | +) = 0.9519The same positive test means a 1.9% chance of disease in a rare condition, 16.7% at 1% prevalence, but 68.8% once 10% of people are affected, and over 95% when half the population has it. Identical test, wildly different conclusions — driven entirely by the prior.
This is why the prior is not optional. Ignoring the base rate is one of the most common mistakes in reasoning about probability — it has its own name, the base-rate fallacy. A test result only updates your belief; it does not replace it. Where you start determines where you land.
The base-rate fallacy
Judging a positive test by its sensitivity alone (“the test is 99% accurate, so I’m 99% likely to be sick”) throws away the prior and badly overstates the risk. Bayes’ theorem forces you to keep the base rate in the calculation, which is exactly why the honest answer is 16.7% and not 99%.
Bayes Recovers a Known Conditional
Bayes’ theorem is not magic — it is bookkeeping that has to agree with the data. A good way to build trust in it is to compute a conditional you could also measure directly, and check that Bayes returns the same number. We will use the penguins dataset and flip a species/island conditional.
import pandas as pd
penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")
# The three ingredients Bayes needs
p_biscoe = (penguins["island"] == "Biscoe").mean() # prior P(Biscoe)
p_gentoo = (penguins["species"] == "Gentoo").mean() # evidence P(Gentoo)
biscoe = penguins[penguins["island"] == "Biscoe"]
p_gentoo_g_biscoe = (biscoe["species"] == "Gentoo").mean() # likelihood P(Gentoo | Biscoe)
print("P(Biscoe) :", round(p_biscoe, 4))
print("P(Gentoo) :", round(p_gentoo, 4))
print("P(Gentoo | Biscoe):", round(p_gentoo_g_biscoe, 4))P(Biscoe) : 0.4884
P(Gentoo) : 0.3605
P(Gentoo | Biscoe): 0.7381We have the likelihood , the prior , and the evidence . Bayes’ theorem flips them into — the probability that a Gentoo penguin lives on Biscoe:
bayes_result = p_gentoo_g_biscoe * p_biscoe / p_gentoo
# Compute the same thing directly, by restricting to Gentoos
gentoo = penguins[penguins["species"] == "Gentoo"]
direct_result = (gentoo["island"] == "Biscoe").mean()
print("Bayes P(Biscoe | Gentoo):", round(bayes_result, 4))
print("Direct P(Biscoe | Gentoo):", round(direct_result, 4))Bayes P(Biscoe | Gentoo): 1.0
Direct P(Biscoe | Gentoo): 1.0Both give exactly 1.0 — every Gentoo in this dataset lives on Biscoe Island, so given that a penguin is a Gentoo, it is certainly on Biscoe. Bayes’ theorem recovered that fact from three separate marginal and conditional probabilities, none of which mentioned it directly. When the formula and the raw data agree, you can trust the formula in the harder cases where you cannot measure the answer directly — like the medical test, where you never get to observe who is truly sick.
Practice Exercises
Exercise 1: A less accurate test
Rework the medical-test calculation for a screening test with sensitivity 90% and specificity 90%, keeping the disease prevalence at 1%. Compute . Is a positive result more or less informative than with the original 99%/95% test?
Hint
Set sensitivity = 0.90 and false_pos = 1 - 0.90, then reuse evidence = sensitivity * prevalence + false_pos * (1 - prevalence) and posterior = sensitivity * prevalence / evidence. With more false positives, the posterior should drop below 16.7%.
Exercise 2: Find the break-even prevalence
Using the original test (sensitivity 99%, false-positive rate 5%), find roughly the prevalence at which a positive result first gives you a better-than-even chance of disease — that is, where crosses 0.5.
Hint
Loop over a range of prevalences, e.g. for p in [0.02, 0.04, 0.05, 0.06, 0.08]:, print posterior_given_prevalence(p), and watch for the value where it passes 0.5. It happens just below a prevalence of 0.05.
Exercise 3: Flip another penguin conditional
You know from the lesson. Use Bayes’ theorem to instead compute from , , and , then verify it against the direct calculation.
Hint
The likelihood is (penguins[penguins["species"]=="Adelie"]["island"]=="Dream").mean(), the prior is (penguins["species"]=="Adelie").mean(), and the evidence is (penguins["island"]=="Dream").mean(). Multiply likelihood by prior, divide by evidence, then compare to restricting on island == "Dream".
Summary
Bayes’ theorem falls straight out of conditional probability: because and share the joint probability , you can rearrange the definition to flip the direction of conditioning. The result is , with four named parts — prior, likelihood, evidence, and posterior. Worked on the classic medical test, it delivers a famously counterintuitive answer: a positive result on a 99%-sensitive test means only a 16.7% chance of disease, because the disease is rare and false positives swamp the true ones. Raise the prevalence and the same test becomes far more convincing — proof that base rates decide what evidence means. Above all, Bayes’ theorem is the mechanism for updating a belief when evidence arrives.
Key Concepts
- Bayes’ theorem — , derived by rearranging conditional probability.
- Prior — your belief about before seeing the evidence (e.g. disease prevalence).
- Likelihood — how probable the evidence is if is true (e.g. test sensitivity).
- Evidence — the total probability of the evidence, often computed with the law of total probability.
- Posterior — your updated belief after the evidence arrives.
- Base-rate fallacy — ignoring the prior and overstating how much a positive test tells you.
Why This Matters
Every system that reasons under uncertainty leans on this formula. Spam filters update the odds that an email is junk as each suspicious word appears; diagnostic tools weigh a symptom against how common a disease is; recommendation engines revise what they think you’ll like with every click. Understanding prior, likelihood, and posterior is what lets you read those results honestly instead of being fooled by a confident-sounding number — and it is the exact foundation of the classifier you build next.
Next Steps
Continue to Lesson 4 - The Naive Bayes Algorithm
Turn Bayes' theorem into a working classifier and see how a 'naive' independence assumption makes it fast and surprisingly accurate.
Back to Module Overview
Return to the Conditional Probability & Bayes module overview
Continue Building Your Skills
You now hold one of the most quietly powerful tools in statistics: a way to flip a conditional, weigh evidence against a base rate, and turn a fresh observation into an updated belief. In the next lesson you will hand that formula a whole table of features and watch it become a classifier — the Naive Bayes algorithm — predicting categories from data with nothing more than the probabilities you just learned to compute.