Lesson 4 - Chi-Squared Test of Independence

Welcome to the Chi-Squared Test of Independence

So far you have tested claims about numbers — means, proportions, differences. But a great deal of real data is categorical: a penguin’s species, a car’s country of origin, a customer’s plan tier. When you have two categorical variables, the natural question is whether they are related. Does species depend on which island a penguin lives on? Does engine size depend on where a car was built?

The chi-squared test of independence answers exactly that question. In this lesson you will count two categorical variables against each other, work out what their table would look like if they were unrelated, and measure how far reality strays from that expectation.

By the end of this lesson, you will be able to:

  • Build a contingency table of two categorical variables with pd.crosstab
  • Compute the expected counts that independence would predict
  • Calculate the chi-squared statistic and its degrees of freedom by hand
  • Run the test with scipy.stats.chi2_contingency and interpret the result honestly

You only need pandas and scipy. Let’s begin.


Two Categorical Variables

Every test you have met so far compared numbers. The chi-squared test of independence compares categories. It starts from a question of the form: are these two labels related, or do they vary independently of each other?

Load the penguins dataset and look at the two categorical columns we will study — species and island:

import pandas as pd

penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")
print(penguins[["species", "island"]].head())
  species     island
0  Adelie  Torgersen
1  Adelie  Torgersen
2  Adelie  Torgersen
3  Adelie  Torgersen
4  Adelie  Torgersen

We want to know whether species and island are independent. If they were, then knowing a penguin’s island would tell you nothing about its likely species — the species mix on every island would look the same. Let’s count and see.


The Contingency Table

A contingency table (also called a cross-tabulation) counts how often each combination of two categories occurs. pandas builds one in a single call:

observed = pd.crosstab(penguins["species"], penguins["island"])
print(observed)
island     Biscoe  Dream  Torgersen
species
Adelie         44     56         52
Chinstrap       0     68          0
Gentoo        124      0          0

Each cell is an observed count — the number of penguins with that species and that island. Already the table tells a vivid story: Gentoo penguins appear only on Biscoe, and Chinstrap only on Dream. That is the opposite of independence; species is clearly tied to island. The heatmap below makes the pattern impossible to miss.

Heatmap of the penguin species by island contingency table, with darker blue cells for larger counts. Gentoo is dark only in the Biscoe column, Chinstrap only in Dream, while Adelie is spread across all three islands.
The species by island contingency table as a heatmap. Whole rows collapse onto single islands — Gentoo to Biscoe, Chinstrap to Dream — which is exactly what dependence looks like.

Margins: the row and column totals

The totals along the edges of the table — called marginal totals — are what the test uses to build its expectation. Add them with margins=True:

print(pd.crosstab(penguins["species"], penguins["island"], margins=True))
island     Biscoe  Dream  Torgersen  All
species
Adelie         44     56         52  152
Chinstrap       0     68          0   68
Gentoo        124      0          0  124
All           168    124         52  344

There are N=344 N = 344 penguins in total. The row totals (152, 68, 124) give the overall species mix; the column totals (168, 124, 52) give the overall island mix. Hold on to these — independence is defined entirely by them.


Expected Counts Under Independence

The test asks a precise hypothetical question: if species and island were independent, how many penguins would we expect in each cell?

Under independence, the probability of landing in a cell is just the row probability times the column probability. Multiply that by N N and the row and column totals do all the work. For the cell in row i i and column j j :

Eij=RiCjN E_{ij} = \frac{R_i \, C_j}{N}

where Ri R_i is that row’s total, Cj C_j is that column’s total, and N N is the grand total.

Working one cell by hand

Take Adelie on Biscoe. The Adelie row total is Ri=152 R_i = 152 , the Biscoe column total is Cj=168 C_j = 168 , and N=344 N = 344 :

EAdelie, Biscoe=152×168344=74.23 E_{\text{Adelie, Biscoe}} = \frac{152 \times 168}{344} = 74.23

If species and island were independent, we would expect about 74 Adelie penguins on Biscoe. We actually observed only 44 — a big gap. The whole test is just adding up gaps like this one across every cell.

scipy computes the full table of expected counts for you:

from scipy.stats import chi2_contingency

chi2, p_value, dof, expected = chi2_contingency(observed)
print(pd.DataFrame(expected, index=observed.index,
                   columns=observed.columns).round(2))
island     Biscoe  Dream  Torgersen
species
Adelie      74.23  54.79      22.98
Chinstrap   33.21  24.51      10.28
Gentoo      60.56  44.70      18.74

Notice every row of expected counts has the same island proportions — that is what “independent” means. Compare this to the observed table full of zeros, and you can feel how far reality is from the independence story.


The Chi-Squared Statistic and Degrees of Freedom

To turn those gaps into a single number, the chi-squared statistic sums the squared difference between observed and expected in each cell, scaled by the expected count:

χ2=(OE)2E \chi^2 = \sum \frac{(O - E)^2}{E}

Squaring makes every gap positive, and dividing by E E puts each cell on a fair footing — a gap of 30 matters far more where you expected 10 than where you expected 1000. A χ2 \chi^2 near 0 means observed and expected agree (consistent with independence); a large χ2 \chi^2 means they diverge.

One cell’s contribution

For Adelie on Biscoe, O=44 O = 44 and E=74.23 E = 74.23 :

(4474.23)274.23=(30.23)274.23=12.31 \frac{(44 - 74.23)^2}{74.23} = \frac{(-30.23)^2}{74.23} = 12.31

That single cell already contributes 12.31. Summing this quantity over all nine cells gives the full statistic:

import numpy as np

contributions = (observed.values - expected) ** 2 / expected
print(round(contributions.sum(), 2))
299.55

Degrees of freedom

The degrees of freedom count how many cells are free to vary once the margins are fixed. For a table with r r rows and c c columns:

dof=(r1)(c1) \text{dof} = (r - 1)(c - 1)

With 3 species and 3 islands that is (31)(31)=4 (3-1)(3-1) = 4 . Degrees of freedom set the scale of the chi-squared distribution we compare against, so a χ2 \chi^2 of 10 means something very different on 1 degree of freedom than on 40.

Reading the p-value

chi2_contingency returns the statistic, the p-value, the degrees of freedom, and the expected table all at once:

chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"chi-squared = {chi2:.1f}")
print(f"dof         = {dof}")
print(f"p-value     = {p_value:.2e}")
chi-squared = 299.6
dof         = 4
p-value     = 1.35e-63

The p-value is the probability of seeing a χ2 \chi^2 this large if species and island really were independent. Here it is 1.35×1063 1.35 \times 10^{-63} — astronomically small. We reject the independence hypothesis with overwhelming confidence: species and island are strongly dependent. That matches what the raw table screamed at us, with Gentoo confined to Biscoe and Chinstrap to Dream.


A Second Example: Cars

The method works on any pair of categorical columns. Let’s ask whether a car’s engine size (number of cylinders) depends on its country of origin:

cars = pd.read_csv("https://datatweets.com/datasets/cars.csv")
origin_cyl = pd.crosstab(cars["origin"], cars["cylinders"])
print(origin_cyl)
cylinders  3   4  5   6    8
origin
europe     0  63  3   4    0
japan      4  69  0   6    0
usa        0  72  0  74  103

The pattern is just as stark as before: every 8-cylinder car in the dataset is from the USA, while European and Japanese cars cluster at 4 cylinders. Run the test:

chi2, p_value, dof, expected = chi2_contingency(origin_cyl)
print(f"chi-squared = {chi2:.1f}")
print(f"dof         = {dof}")
print(f"p-value     = {p_value:.2e}")
chi-squared = 180.1
dof         = 8
p-value     = 9.80e-35

Here the table is 3×5 3 \times 5 , so dof=(31)(51)=8 \text{dof} = (3-1)(5-1) = 8 . The p-value of 9.80×1035 9.80 \times 10^{-35} again leaves no doubt: engine size depends on origin. US cars skew heavily toward large 8-cylinder engines, a difference no amount of random chance could produce.


Cautions: What the Test Does Not Tell You

The chi-squared test is easy to run and easy to over-read. Keep three warnings in mind.

Significance is not strength

A tiny p-value says the variables are almost certainly related — it says nothing about how strongly. A barely-there association and an iron-clad one can both produce p<0.001 p < 0.001 . To measure strength you need an effect-size measure such as Cramér’s V, not the p-value.

Large samples make trivial effects “significant”

Look back at the χ2 \chi^2 formula: the statistic grows with the counts. Feed the test enough rows and even a microscopic, practically meaningless deviation from independence will cross the significance threshold. With big data, statistically significant and important drift apart — always ask how large the effect actually is.

A relationship is not causation

The test detects that two variables move together, not why. Origin and cylinders are related, but origin does not magically stamp cylinders into an engine — design choices, regulations, and markets drive both. Chi-squared finds association; explaining it is your job, not the test’s.

One assumption to check

The chi-squared approximation is only reliable when expected counts are reasonably large — a common rule of thumb is that every expected cell should be at least 5. When several expected counts fall below that, the p-value can be misleading; use Fisher’s exact test instead. Always inspect the expected table that chi2_contingency returns.


Practice Exercises

Exercise 1: Sex and island

Build a contingency table of penguin sex against island, then run chi2_contingency. Is there evidence that sex depends on island? Does the result make biological sense?

Hint

Drop missing values first with clean = penguins.dropna(subset=["sex"]), then pd.crosstab(clean["sex"], clean["island"]). A large p-value means you cannot reject independence — which is what you would expect, since males and females share every island.

Exercise 2: Expected count by hand

Using the cars origin by cylinders table, compute the expected count for 4-cylinder USA cars by hand with Eij=RiCj/N E_{ij} = R_i C_j / N , then confirm it against the expected array from chi2_contingency.

Hint

The USA row total and the 4-cylinder column total come from pd.crosstab(..., margins=True); N N is the grand total. Compare your number to the matching entry of the returned expected table.

Exercise 3: Measure the strength

For the species-by-island table, compute Cramér’s V to quantify how strong the association is, using V=χ2/(N(k1)) V = \sqrt{\chi^2 / (N \cdot (k-1))} where k k is the smaller of the number of rows and columns. How does a strength measure differ from the p-value?

Hint

Here N=344 N = 344 , χ2=299.55 \chi^2 = 299.55 , and k=3 k = 3 , so V=299.55/(344×2) V = \sqrt{299.55 / (344 \times 2)} . A V near 1 means a very strong association; the p-value only told you the association was real, not that it was strong.


Summary

You learned to test whether two categorical variables are related. You built a contingency table with pd.crosstab, computed the expected counts independence would predict from the row and column totals, and summed the scaled, squared gaps into the chi-squared statistic. With degrees of freedom (r1)(c1) (r-1)(c-1) , scipy.stats.chi2_contingency turned that statistic into a p-value. Penguin species proved strongly dependent on island, and car engine size strongly dependent on origin. Most importantly, you learned what the test does not say: significance is not strength, big samples inflate significance, and association is never proof of cause.

Key Concepts

  • Contingency table — a cross-tabulation counting every combination of two categorical variables.
  • Expected count — the cell count predicted under independence, Eij=RiCj/N E_{ij} = R_i C_j / N .
  • Chi-squared statisticχ2=(OE)2/E \chi^2 = \sum (O-E)^2 / E , measuring total divergence from independence.
  • Degrees of freedom(r1)(c1) (r-1)(c-1) , the number of freely varying cells given fixed margins.
  • Test of independence — rejects independence when χ2 \chi^2 is large and the p-value small.

Why This Matters

Categorical relationships are everywhere in real data work — which marketing channel drives which plan, which region buys which product, which treatment pairs with which outcome. The chi-squared test of independence is the standard first tool for asking “are these two labels connected?”, and knowing its cautions keeps you from mistaking a tiny p-value for a big, causal, or important finding.


Next Steps

Continue to Lesson 5 - Guided Project: Japanese vs American Cars

Put hypothesis and chi-squared tests to work in a full, open-ended analysis comparing two groups of cars.

Back to Module Overview

Return to the Statistical Inference module overview


Continue Building Your Skills

You can now take two columns of labels and ask, rigorously, whether they are connected. In the next lesson you will stop running tests one at a time and pull everything together — sampling, hypothesis tests, and chi-squared — into a single guided investigation of how Japanese and American cars really differ.