Lesson 3 - Bayes' Theorem

Welcome to Bayes’ Theorem

In the last lesson you saw that P(AB) P(A \mid B) and P(BA) P(B \mid A) are not the same number. That raises an obvious question: if you happen to know one of them, can you recover the other? Often you can — and Bayes’ theorem is the formula that does it. It is one of the most useful results in all of statistics, the engine behind spam filters, medical diagnosis, and the machine-learning classifier you will build in the next lesson.

In this lesson you will derive Bayes’ theorem from the conditional-probability formula you already know, learn the four names every Bayesian calculation uses, and work the famous medical-test problem in Python. The answer will surprise you.

By the end of this lesson, you will be able to:

  • Derive Bayes’ theorem from the definition of conditional probability
  • Name the prior, likelihood, evidence, and posterior in any problem
  • Compute a posterior probability in Python, including the full medical-test example
  • Explain why the base rate of an event can completely change what a test result means

You only need the conditional-probability ideas from Lessons 1 and 2, plus a little Python. Let’s begin.


Deriving Bayes’ Theorem

Recall the definition of conditional probability from Lesson 1. The probability of A A given B B is the joint probability of both events divided by the probability of the thing you conditioned on:

P(AB)=P(AB)P(B) P(A \mid B) = \frac{P(A \cap B)}{P(B)}

The same definition works the other way around. The probability of B B given A A is:

P(BA)=P(AB)P(A) P(B \mid A) = \frac{P(A \cap B)}{P(A)}

Look closely: both formulas contain the same joint probability P(AB) P(A \cap B) . That shared term is the bridge between the two conditionals. Solve the second equation for P(AB) P(A \cap B) :

P(AB)=P(BA)P(A) P(A \cap B) = P(B \mid A)\,P(A)

Now substitute that into the first equation, and you have Bayes’ theorem:

P(AB)=P(BA)P(A)P(B) P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}

That is the whole derivation. Bayes’ theorem is not a new law of probability — it is just the conditional-probability formula rearranged so you can flip the direction of the conditioning. If you know P(BA) P(B \mid A) but want P(AB) P(A \mid B) , this is how you get there.


The Vocabulary: Prior, Likelihood, Evidence, Posterior

Bayes’ theorem comes with four named pieces. Learning the names is worth the effort, because every application — medical tests, spam filters, classifiers — uses exactly this vocabulary.

Read the formula again with the names attached:

P(AB)posterior=P(BA)likelihood  P(A)priorP(B)evidence \underbrace{P(A \mid B)}_{\text{posterior}} = \frac{\overbrace{P(B \mid A)}^{\text{likelihood}}\;\overbrace{P(A)}^{\text{prior}}}{\underbrace{P(B)}_{\text{evidence}}}
  • The prior P(A) P(A) is what you believed about A A before seeing any evidence. In the medical example it is the disease’s prevalence in the population.
  • The likelihood P(BA) P(B \mid A) is how probable the evidence B B is if A A were true. For a test, it is the sensitivity — how often a sick person tests positive.
  • The evidence P(B) P(B) is the overall probability of seeing B B at all, across every way it could happen. It acts as a normalizer.
  • The posterior P(AB) P(A \mid B) is your updated belief about A A after the evidence arrives. It is the number you actually want.

The big idea hiding in those four words: Bayes’ theorem is how you update a belief when new evidence arrives. You start with a prior, you observe something, and the posterior is your revised prior. That single move powers an enormous amount of modern statistics and machine learning.

A left-to-right flow showing a small prior belief bar at 1 percent, an orange arrow labelled times likelihood divided by evidence with new data a positive test, and a larger posterior belief bar at 16.7 percent.
Evidence updates the 1% prior into a 16.7% posterior by multiplying by the likelihood and dividing by the evidence.

Computing the evidence

The one term that is not handed to you directly is the evidence P(B) P(B) . You almost always compute it with the law of total probability: B B can happen either when A A is true or when A A is false, so add up both routes.

P(B)=P(BA)P(A)+P(Bnot A)P(not A) P(B) = P(B \mid A)\,P(A) + P(B \mid \text{not } A)\,P(\text{not } A)

Plugging that expanded denominator into Bayes’ theorem gives the form you will actually code:

P(AB)=P(BA)P(A)P(BA)P(A)+P(Bnot A)P(not A) P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B \mid A)\,P(A) + P(B \mid \text{not } A)\,P(\text{not } A)}

The two terms in the denominator are the true positives and the false positives. Keep that picture in mind — it is the key to the result we are about to compute.


The Medical Test Example

Here is the problem that makes Bayes’ theorem famous. A disease affects 1% of people, so the prevalence is P(D)=0.01 P(D) = 0.01 . A test for it has:

  • Sensitivity 99% — a sick person tests positive 99% of the time, so P(+D)=0.99 P(+ \mid D) = 0.99 .
  • Specificity 95% — a healthy person correctly tests negative 95% of the time, so the false-positive rate is P(+not D)=0.05 P(+ \mid \text{not } D) = 0.05 .

You take the test and it comes back positive. What is the probability you actually have the disease? Most people, including many doctors, guess somewhere around 95% or 99%. Let’s compute the real answer.

# Givens
prevalence   = 0.01      # P(D)
sensitivity  = 0.99      # P(+ | D)
specificity  = 0.95
false_pos    = 1 - specificity  # P(+ | not D) = 0.05

# Bayes' theorem with the evidence expanded via total probability
prior      = prevalence
likelihood = sensitivity
evidence   = sensitivity * prevalence + false_pos * (1 - prevalence)

posterior = likelihood * prior / evidence
print(round(posterior, 4))
0.1667

A positive test means only a 16.7% chance that you actually have the disease. Not 99%, not 95% — about one in six. The other five out of six positive results are false alarms. This is the counterintuitive headline of the whole lesson, and it is not a trick of the arithmetic. To see exactly where it comes from, let’s drop the abstractions and count people.

The natural-frequencies intuition

Probabilities like 0.01 and 0.99 are slippery to reason about. Whole numbers are not. So imagine a concrete population of 10,000 people and walk the numbers through:

N = 10_000
diseased = N * prevalence          # 100 people actually have it
healthy  = N * (1 - prevalence)    # 9,900 do not

true_positives  = diseased * sensitivity   # sick people who test +
false_positives = healthy  * false_pos     # healthy people who test +

print("true positives :", true_positives)
print("false positives:", false_positives)
print("all positives  :", true_positives + false_positives)
print("P(D | +)       :", round(true_positives / (true_positives + false_positives), 4))
true positives : 99.0
false positives: 495.0
all positives  : 594.0
P(D | +)       : 0.1667

Now the result is obvious. Out of 10,000 people, only 100 have the disease, and 99 of them test positive. But among the 9,900 healthy people, 5% — that is 495 people — test positive by mistake. So when you line up everyone who got a positive result, you have 99 truly sick people drowning in a crowd of 495 false alarms. Your chance of being one of the truly sick is 99/(99+495)=99/594=16.7% 99 / (99 + 495) = 99 / 594 = 16.7\% .

A tree diagram splitting 10,000 people into 100 diseased and 9,900 healthy, then into positive and negative test results, showing 99 true positives versus 495 false positives.
Out of 10,000 people, a positive test produces 99 true positives but 495 false positives. Only 99 of those 594 positives are truly sick, which is why P(disease | positive) is just 16.7%.

The false positives win not because the test is bad — 95% specificity is quite good — but because the healthy group is so much larger than the sick group. A small error rate applied to a huge population still produces a lot of false alarms. That mismatch is the heart of the result, and it has a name.


Why Base Rates Matter

The prior P(D) P(D) — the base rate of the disease — is doing the heavy lifting in that calculation. The test never changed; only the prevalence made the posterior small. So what happens if the disease is more common? Suppose prevalence rises to 10%:

def posterior_given_prevalence(p, sens=0.99, fp=0.05):
    evidence = sens * p + fp * (1 - p)
    return sens * p / evidence

for p in [0.001, 0.01, 0.10, 0.50]:
    print(f"prevalence {p:6.3f}  ->  P(D | +) = {posterior_given_prevalence(p):.4f}")
prevalence  0.001  ->  P(D | +) = 0.0194
prevalence  0.010  ->  P(D | +) = 0.1667
prevalence  0.100  ->  P(D | +) = 0.6875
prevalence  0.500  ->  P(D | +) = 0.9519

The same positive test means a 1.9% chance of disease in a rare condition, 16.7% at 1% prevalence, but 68.8% once 10% of people are affected, and over 95% when half the population has it. Identical test, wildly different conclusions — driven entirely by the prior.

A curve of P(disease given a positive test) against disease prevalence, rising steeply from near zero and marked at 1 percent giving 16.7 percent and 10 percent giving 68.8 percent.
The posterior P(disease | positive) plotted against prevalence. The curve climbs steeply: at 1% prevalence a positive test means 16.7%, but at 10% it already means 68.8%. The base rate sets the meaning of the result.

This is why the prior is not optional. Ignoring the base rate is one of the most common mistakes in reasoning about probability — it has its own name, the base-rate fallacy. A test result only updates your belief; it does not replace it. Where you start determines where you land.

The base-rate fallacy

Judging a positive test by its sensitivity alone (“the test is 99% accurate, so I’m 99% likely to be sick”) throws away the prior and badly overstates the risk. Bayes’ theorem forces you to keep the base rate in the calculation, which is exactly why the honest answer is 16.7% and not 99%.


Bayes Recovers a Known Conditional

Bayes’ theorem is not magic — it is bookkeeping that has to agree with the data. A good way to build trust in it is to compute a conditional you could also measure directly, and check that Bayes returns the same number. We will use the penguins dataset and flip a species/island conditional.

import pandas as pd

penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")

# The three ingredients Bayes needs
p_biscoe          = (penguins["island"] == "Biscoe").mean()      # prior P(Biscoe)
p_gentoo          = (penguins["species"] == "Gentoo").mean()     # evidence P(Gentoo)
biscoe            = penguins[penguins["island"] == "Biscoe"]
p_gentoo_g_biscoe = (biscoe["species"] == "Gentoo").mean()       # likelihood P(Gentoo | Biscoe)

print("P(Biscoe)         :", round(p_biscoe, 4))
print("P(Gentoo)         :", round(p_gentoo, 4))
print("P(Gentoo | Biscoe):", round(p_gentoo_g_biscoe, 4))
P(Biscoe)         : 0.4884
P(Gentoo)         : 0.3605
P(Gentoo | Biscoe): 0.7381

We have the likelihood P(GentooBiscoe) P(\text{Gentoo} \mid \text{Biscoe}) , the prior P(Biscoe) P(\text{Biscoe}) , and the evidence P(Gentoo) P(\text{Gentoo}) . Bayes’ theorem flips them into P(BiscoeGentoo) P(\text{Biscoe} \mid \text{Gentoo}) — the probability that a Gentoo penguin lives on Biscoe:

bayes_result = p_gentoo_g_biscoe * p_biscoe / p_gentoo

# Compute the same thing directly, by restricting to Gentoos
gentoo = penguins[penguins["species"] == "Gentoo"]
direct_result = (gentoo["island"] == "Biscoe").mean()

print("Bayes  P(Biscoe | Gentoo):", round(bayes_result, 4))
print("Direct P(Biscoe | Gentoo):", round(direct_result, 4))
Bayes  P(Biscoe | Gentoo): 1.0
Direct P(Biscoe | Gentoo): 1.0

Both give exactly 1.0 — every Gentoo in this dataset lives on Biscoe Island, so given that a penguin is a Gentoo, it is certainly on Biscoe. Bayes’ theorem recovered that fact from three separate marginal and conditional probabilities, none of which mentioned it directly. When the formula and the raw data agree, you can trust the formula in the harder cases where you cannot measure the answer directly — like the medical test, where you never get to observe who is truly sick.


Practice Exercises

Exercise 1: A less accurate test

Rework the medical-test calculation for a screening test with sensitivity 90% and specificity 90%, keeping the disease prevalence at 1%. Compute P(D+) P(D \mid +) . Is a positive result more or less informative than with the original 99%/95% test?

Hint

Set sensitivity = 0.90 and false_pos = 1 - 0.90, then reuse evidence = sensitivity * prevalence + false_pos * (1 - prevalence) and posterior = sensitivity * prevalence / evidence. With more false positives, the posterior should drop below 16.7%.

Exercise 2: Find the break-even prevalence

Using the original test (sensitivity 99%, false-positive rate 5%), find roughly the prevalence at which a positive result first gives you a better-than-even chance of disease — that is, where P(D+) P(D \mid +) crosses 0.5.

Hint

Loop over a range of prevalences, e.g. for p in [0.02, 0.04, 0.05, 0.06, 0.08]:, print posterior_given_prevalence(p), and watch for the value where it passes 0.5. It happens just below a prevalence of 0.05.

Exercise 3: Flip another penguin conditional

You know P(GentooBiscoe) P(\text{Gentoo} \mid \text{Biscoe}) from the lesson. Use Bayes’ theorem to instead compute P(AdelieDream) P(\text{Adelie} \mid \text{Dream}) from P(DreamAdelie) P(\text{Dream} \mid \text{Adelie}) , P(Adelie) P(\text{Adelie}) , and P(Dream) P(\text{Dream}) , then verify it against the direct calculation.

Hint

The likelihood is (penguins[penguins["species"]=="Adelie"]["island"]=="Dream").mean(), the prior is (penguins["species"]=="Adelie").mean(), and the evidence is (penguins["island"]=="Dream").mean(). Multiply likelihood by prior, divide by evidence, then compare to restricting on island == "Dream".


Summary

Bayes’ theorem falls straight out of conditional probability: because P(AB) P(A \mid B) and P(BA) P(B \mid A) share the joint probability P(AB) P(A \cap B) , you can rearrange the definition to flip the direction of conditioning. The result is P(AB)=P(BA)P(A)/P(B) P(A \mid B) = P(B \mid A)\,P(A) / P(B) , with four named parts — prior, likelihood, evidence, and posterior. Worked on the classic medical test, it delivers a famously counterintuitive answer: a positive result on a 99%-sensitive test means only a 16.7% chance of disease, because the disease is rare and false positives swamp the true ones. Raise the prevalence and the same test becomes far more convincing — proof that base rates decide what evidence means. Above all, Bayes’ theorem is the mechanism for updating a belief when evidence arrives.

Key Concepts

  • Bayes’ theoremP(AB)=P(BA)P(A)P(B) P(A \mid B) = \dfrac{P(B \mid A)\,P(A)}{P(B)} , derived by rearranging conditional probability.
  • Prior P(A) P(A) — your belief about A A before seeing the evidence (e.g. disease prevalence).
  • Likelihood P(BA) P(B \mid A) — how probable the evidence is if A A is true (e.g. test sensitivity).
  • Evidence P(B) P(B) — the total probability of the evidence, often computed with the law of total probability.
  • Posterior P(AB) P(A \mid B) — your updated belief after the evidence arrives.
  • Base-rate fallacy — ignoring the prior and overstating how much a positive test tells you.

Why This Matters

Every system that reasons under uncertainty leans on this formula. Spam filters update the odds that an email is junk as each suspicious word appears; diagnostic tools weigh a symptom against how common a disease is; recommendation engines revise what they think you’ll like with every click. Understanding prior, likelihood, and posterior is what lets you read those results honestly instead of being fooled by a confident-sounding number — and it is the exact foundation of the classifier you build next.


Next Steps

Continue to Lesson 4 - The Naive Bayes Algorithm

Turn Bayes' theorem into a working classifier and see how a 'naive' independence assumption makes it fast and surprisingly accurate.

Back to Module Overview

Return to the Conditional Probability & Bayes module overview


Continue Building Your Skills

You now hold one of the most quietly powerful tools in statistics: a way to flip a conditional, weigh evidence against a base rate, and turn a fresh observation into an updated belief. In the next lesson you will hand that formula a whole table of features and watch it become a classifier — the Naive Bayes algorithm — predicting categories from data with nothing more than the probabilities you just learned to compute.