Lesson 1 - Conditional Probability Fundamentals

Welcome to Conditional Probability

Most useful probabilities come with a condition attached. You rarely ask “what is the chance of rain?” in the abstract — you ask “what is the chance of rain given that the sky is grey?” The extra information changes the answer. Conditional probability is the tool for updating what you believe once you learn that something is true, and it is the single idea that everything in Bayesian reasoning is built on.

In this lesson you will make that idea concrete with a real dataset of penguins. You will condition on which island a penguin lives on, watch a probability move as you add that knowledge, and compute every number yourself in Python so the formula stops being abstract.

By the end of this lesson, you will be able to:

  • Explain what P(AB) P(A \mid B) means and read it as “the probability of A given B”
  • Use the formula P(AB)=P(AB)/P(B) P(A \mid B) = P(A \cap B) / P(B) and compute it directly from data
  • Show how conditioning on new information changes a probability
  • Recognize that P(AB) P(A \mid B) is generally not P(BA) P(B \mid A) , and how conditioning relates to independence

You only need a little Python and pandas. Let’s begin.


What Conditional Probability Means

A plain probability P(A) P(A) measures how likely an event A A is across the whole population. A conditional probability, written P(AB) P(A \mid B) , measures how likely A A is once you already know that another event B B happened. The vertical bar reads as the word “given”: P(AB) P(A \mid B) is “the probability of A A given B B .”

The key mental move is this: conditioning on B B means you throw away every outcome where B B is false and ask your question only inside the survivors. You shrink the world down to the rows where B B holds, then look at how often A A happens there.

Two panels: the full sample space with overlapping events A and B, then the same picture with everything outside B greyed out so B becomes the new universe and P(A|B) equals the A-and-B overlap divided by B.
Conditioning on B greys out everything outside B, so B becomes the new universe and the denominator shrinks from the whole space down to B.

Load the dataset for this module — body measurements of penguins from three species near Palmer Station, Antarctica — and look at how species and island relate:

import pandas as pd

penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")
print(pd.crosstab(penguins["island"], penguins["species"]))
species    Adelie  Chinstrap  Gentoo
island
Biscoe         44          0     124
Dream          56         68       0
Torgersen      52          0       0

This table is the entire world we will reason about. Every cell is a real count of penguins. Notice already that the islands are not interchangeable: Gentoos appear only on Biscoe, Chinstraps only on Dream, and Torgersen is pure Adelie. That structure is exactly what conditioning will let us exploit.


Computing P(A | B) From the Data

The unconditional probability of drawing a Gentoo from all 344 penguins is just its share of the whole dataset:

print(round((penguins["species"] == "Gentoo").mean(), 4))
0.3605

So with no extra information, P(Gentoo)=0.3605 P(\text{Gentoo}) = 0.3605 — a little over one in three. Now suppose someone tells you the penguin lives on Biscoe. Conditioning means we keep only the Biscoe rows and ask how many are Gentoo:

biscoe = penguins[penguins["island"] == "Biscoe"]
print(round((biscoe["species"] == "Gentoo").mean(), 4))
0.7381

The probability jumped from 0.3605 to 0.7381. That is P(GentooBiscoe) P(\text{Gentoo} \mid \text{Biscoe}) : among Biscoe penguins, almost three in four are Gentoo. Learning the island more than doubled the chance. The boolean mask penguins["island"] == "Biscoe" did the conditioning — it discarded the other two islands — and .mean() of a True/False column gives the proportion that are True.

The formula behind the mask

What .mean() computed has a name. The definition of conditional probability is:

P(AB)=P(AB)P(B) P(A \mid B) = \frac{P(A \cap B)}{P(B)}

Here P(AB) P(A \cap B) is the probability that both A A and B B happen (the “intersection”), and P(B) P(B) is the probability of the condition on its own. Dividing by P(B) P(B) is the mathematical version of shrinking the world down to B B . Let’s confirm it gives the same 0.7381:

p_both = ((penguins["species"] == "Gentoo") & (penguins["island"] == "Biscoe")).mean()
p_biscoe = (penguins["island"] == "Biscoe").mean()
print(round(p_both, 4), round(p_biscoe, 4), round(p_both / p_biscoe, 4))
0.3605 0.4884 0.7381

The intersection probability is 0.3605, the probability of Biscoe is 0.4884, and their ratio is 0.7381 — identical to the masked answer. Restricting to a subgroup and applying the formula are two views of the same operation.

Conditioning can pin a probability to certainty

Conditioning does not always nudge a probability gently. Sometimes it removes all doubt. Look at Torgersen:

torgersen = penguins[penguins["island"] == "Torgersen"]
print(round((torgersen["species"] == "Adelie").mean(), 4))
1.0

P(AdelieTorgersen)=1.0 P(\text{Adelie} \mid \text{Torgersen}) = 1.0 . Every penguin on Torgersen is an Adelie, so once you know the island is Torgersen, the species is certain. Compare that with Dream, a mixed island:

dream = penguins[penguins["island"] == "Dream"]
print(round((dream["species"] == "Adelie").mean(), 4))
0.4516

P(AdelieDream)=0.4516 P(\text{Adelie} \mid \text{Dream}) = 0.4516 — Dream is split between Adelie and Chinstrap, so the same question gives a very different answer depending on which condition you impose. The figure below shows the full picture: how the species mix changes as you move from island to island.

Stacked bar chart showing the percentage of each penguin species within each of the three islands: Biscoe is 73.8% Gentoo and 26.2% Adelie, Dream is 54.8% Chinstrap and 45.2% Adelie, and Torgersen is 100% Adelie.
Each bar conditions on one island and shows the species shares inside it. The heights are the conditional probabilities: P(Gentoo | Biscoe) = 73.8%, P(Adelie | Dream) = 45.2%, and P(Adelie | Torgersen) = 100%. Conditioning on a different island reshuffles the whole distribution.

Why we divide by P(B)

Dividing by P(B) P(B) rescales the probabilities inside B B so they add back up to 1. The raw share of Biscoe-and-Gentoo penguins in the whole dataset is only 0.3605, but inside the Biscoe world it has to compete only with other Biscoe penguins — so it grows to 0.7381. The denominator is what turns a slice of the whole into a probability within the subgroup.


P(A | B) Is Not P(B | A)

Here is the trap that conditional probability sets, and the reason it deserves a whole lesson before we reach Bayes’ theorem: the order of the condition matters. P(AB) P(A \mid B) and P(BA) P(B \mid A) are different questions, and they usually have different answers.

We already found P(GentooBiscoe)=0.7381 P(\text{Gentoo} \mid \text{Biscoe}) = 0.7381 . Now flip it — among Gentoos, what fraction live on Biscoe?

gentoo = penguins[penguins["species"] == "Gentoo"]
print(round((gentoo["island"] == "Biscoe").mean(), 4))
1.0

P(BiscoeGentoo)=1.0 P(\text{Biscoe} \mid \text{Gentoo}) = 1.0 . Every Gentoo lives on Biscoe, so given a Gentoo you can be certain of the island. But the reverse, P(GentooBiscoe)=0.7381 P(\text{Gentoo} \mid \text{Biscoe}) = 0.7381 , is far from certain, because Biscoe also houses plenty of Adelies. Two conditional probabilities that look like mirror images — 1.0 versus 0.74 — describe completely different facts.

Confusing the two directions is one of the most common reasoning errors in statistics. “Most Gentoos are on Biscoe” does not mean “most Biscoe penguins are Gentoo.” Keeping these straight is the whole motivation for Bayes’ theorem, which is the formal recipe for converting one direction into the other. You will meet it in the next lessons.


Conditioning and Independence

Sometimes conditioning on B B changes nothing at all. When that happens, A A and B B are independent: knowing B B tells you nothing new about A A . The precise definition is that A A and B B are independent when

P(AB)=P(A) P(A \mid B) = P(A)

In words, the conditional probability equals the plain probability — the condition is irrelevant. Our penguin data is a study in the opposite: island and species are strongly dependent, which is exactly why conditioning moved the numbers so much.

p_gentoo = (penguins["species"] == "Gentoo").mean()
p_gentoo_given_biscoe = (penguins[penguins["island"] == "Biscoe"]["species"] == "Gentoo").mean()
p_gentoo_given_dream = (penguins[penguins["island"] == "Dream"]["species"] == "Gentoo").mean()
print(round(p_gentoo, 4), round(p_gentoo_given_biscoe, 4), round(p_gentoo_given_dream, 4))
0.3605 0.7381 0.0

P(Gentoo)=0.3605 P(\text{Gentoo}) = 0.3605 , but conditioning on Biscoe pushes it up to 0.7381 and conditioning on Dream drops it to 0.0. Because P(GentooBiscoe)P(Gentoo) P(\text{Gentoo} \mid \text{Biscoe}) \neq P(\text{Gentoo}) , the two events are not independent — island carries real information about species. If they had been independent, all three numbers would have been the same, and conditioning would have been a waste of effort.

That contrast is worth holding onto: conditional probability is only powerful when events are dependent. The more a condition reshapes a probability, the more that condition is worth knowing.


Practice Exercises

Exercise 1: Condition on a different island

Compute P(ChinstrapDream) P(\text{Chinstrap} \mid \text{Dream}) — the probability that a Dream penguin is a Chinstrap. Then compare it to the unconditional P(Chinstrap) P(\text{Chinstrap}) . Did conditioning on Dream raise or lower the probability?

Hint

Restrict first with dream = penguins[penguins["island"] == "Dream"], then take (dream["species"] == "Chinstrap").mean(). Get the unconditional version from the full dataset with (penguins["species"] == "Chinstrap").mean().

Exercise 2: Verify the formula by hand

Confirm P(AdelieDream)=0.4516 P(\text{Adelie} \mid \text{Dream}) = 0.4516 using the formula P(AB)=P(AB)/P(B) P(A \mid B) = P(A \cap B) / P(B) instead of a mask. Compute the intersection probability and P(Dream) P(\text{Dream}) separately, then divide.

Hint

The intersection is ((penguins["species"] == "Adelie") & (penguins["island"] == "Dream")).mean() and the condition is (penguins["island"] == "Dream").mean(). Dividing the first by the second should give 0.4516.

Exercise 3: Flip a conditional probability

You know P(BiscoeGentoo)=1.0 P(\text{Biscoe} \mid \text{Gentoo}) = 1.0 . Now compute the reverse, P(GentooBiscoe) P(\text{Gentoo} \mid \text{Biscoe}) , and explain in one sentence why the two are so different even though they involve the same two events.

Hint

For P(GentooBiscoe) P(\text{Gentoo} \mid \text{Biscoe}) , restrict to Biscoe and take the share of Gentoos. The asymmetry comes from the denominators: there are more Biscoe penguins than Gentoo penguins, because Biscoe also holds Adelies.


Summary

You met conditional probability, the probability of an event A A once you know that another event B B is true, written P(AB) P(A \mid B) and read “A given B.” Computing it means restricting your world to the rows where B B holds and asking how often A A happens there — which the formula P(AB)=P(AB)/P(B) P(A \mid B) = P(A \cap B) / P(B) captures exactly. You saw conditioning move P(Gentoo) P(\text{Gentoo}) from 0.36 up to 0.74 given Biscoe and pin P(AdelieTorgersen) P(\text{Adelie} \mid \text{Torgersen}) to 1.0, you learned that P(AB) P(A \mid B) is generally not P(BA) P(B \mid A) , and you connected conditioning to independence, which holds exactly when the condition changes nothing.

Key Concepts

  • Conditional probabilityP(AB) P(A \mid B) , the probability of A A given that B B is true.
  • The formulaP(AB)=P(AB)/P(B) P(A \mid B) = P(A \cap B) / P(B) ; dividing by P(B) P(B) restricts the world to B B .
  • Conditioning from data — keep only the rows where B B holds, then take the share where A A holds.
  • AsymmetryP(AB)P(BA) P(A \mid B) \neq P(B \mid A) in general; the direction of the bar matters.
  • IndependenceA A and B B are independent exactly when P(AB)=P(A) P(A \mid B) = P(A) , so the condition adds no information.

Why This Matters

Every model that updates a belief from evidence — spam filters, medical tests, recommendation engines, fraud detection — runs on conditional probability. Knowing how to condition correctly, and never confusing P(AB) P(A \mid B) with P(BA) P(B \mid A) , is what separates a trustworthy inference from a confident mistake. It is also the exact foundation Bayes’ theorem is built on, which is where this module heads next.


Next Steps

Continue to Lesson 2 - Conditional Probability Intermediate

Go deeper with the multiplication rule, chained conditions, and building the bridge to Bayes' theorem.

Back to Module Overview

Return to the Conditional Probability & Bayes module overview


Continue Building Your Skills

You now know how to update a probability the moment you learn a new fact — the quiet engine behind every system that reasons under uncertainty. Next you will chain conditions together and watch the path toward Bayes’ theorem take shape, turning “given B” from a single step into a full method for learning from evidence.