Lesson 3 - Chi-Squared Goodness of Fit

Welcome to Chi-Squared Goodness of Fit

So far you have tested claims about numbers — a mean, a difference between groups. But a lot of data is not numeric at all. It is categorical: a species, a color, a survey answer, the face of a die. When you want to ask “does this category breakdown match what I expected?”, the tool you reach for is the chi-squared goodness-of-fit test.

In this lesson you will use it to answer a concrete question about real data: are the three penguin species in our dataset equally common, or does one dominate? You will work the test out by hand, watch every piece of arithmetic, then confirm it in one line with scipy.

By the end of this lesson, you will be able to:

  • Recognize when a question calls for a goodness-of-fit test on a categorical variable
  • Turn expected proportions into expected counts and compare them to observed counts
  • Compute the chi-squared statistic and explain what it measures
  • Read the result against degrees of freedom and a significance level, and confirm it with scipy.stats.chisquare

You only need a little Python, pandas, and scipy. Let’s begin.


When to Use a Goodness-of-Fit Test

A goodness-of-fit test asks a single question: does the distribution of a categorical variable match a set of expected proportions? You have a variable that sorts each observation into one of several categories, a count for each category, and a theory about how those counts should be split. The test measures how badly reality and theory disagree.

Some questions it answers:

  • Are the six faces of a die equally likely, or is it loaded?
  • Do customer arrivals split evenly across the seven days of the week?
  • Does a sample’s ethnicity breakdown match the known population proportions?

Our question for this lesson is in the same shape. Load the penguins and count the species:

import pandas as pd

penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")
print(penguins["species"].value_counts())
species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64

There are 344 penguins, split very unevenly: 152 Adelie, 124 Gentoo, only 68 Chinstrap. The question we will test is “are the three species equally likely?” — that is, is each species meant to be one-third of the population, with the gaps we see just sampling noise? Or is the imbalance real?

Stating the hypotheses

Every test starts with a null hypothesis H0 H_0 , the dull “nothing special” claim we try to disprove, and an alternative H1 H_1 .

  • H0 H_0 : the three species are equally likely — each occurs with probability 1/3 1/3 .
  • H1 H_1 : the species are not equally likely — at least one differs from 1/3 1/3 .

The goodness-of-fit test will tell us whether the data give us enough evidence to throw out H0 H_0 .


Observed vs Expected Counts

The test compares two sets of numbers for each category: what you saw and what you would expect if H0 H_0 were true.

The observed counts Oi O_i are simply the value counts you already have: 152, 124, 68. The expected counts Ei E_i are what each category should contain under the null. If all three species are equally likely, each one should hold one-third of the 344 birds:

Ei=N×pi=344×13=114.67 E_i = N \times p_i = 344 \times \frac{1}{3} = 114.67

So under H0 H_0 , we would expect about 114.67 penguins of each species. (Expected counts do not have to be whole numbers — they are a theoretical average, not an actual headcount.)

import numpy as np

observed = np.array([152, 124, 68])      # Adelie, Gentoo, Chinstrap
N = observed.sum()                        # 344
expected = np.array([N/3, N/3, N/3])      # 114.67 each
print("Observed:", observed)
print("Expected:", expected.round(2))
Observed: [152 124  68]
Expected: [114.67 114.67 114.67]

It helps to see those two bars side by side for each species. The taller the gap between the blue and grey bar, the more that species defies the equal-split theory.

Grouped bar chart comparing observed penguin species counts (152 Adelie, 124 Gentoo, 68 Chinstrap) against the expected count of about 115 per species under a uniform model.
Under a uniform model every species should sit near 115 (grey). Adelie runs well above it and Chinstrap well below — the goodness-of-fit test asks whether those gaps are too big to be chance.

Adelie overshoots the expected line by 37, Chinstrap undershoots by 47, and Gentoo sits almost on it. The eye says these gaps look large — but “looks large” is not a decision rule. We need to convert the gaps into a single number we can judge against chance.


The Chi-Squared Statistic

The chi-squared statistic χ2 \chi^2 rolls every observed-vs-expected gap into one number:

χ2=i(OiEi)2Ei \chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}

Read it one piece at a time, because every part is there for a reason:

  • (OiEi) (O_i - E_i) is the raw gap for a category — how far the count fell from expectation.
  • We square it so that overshoots and undershoots both count as positive distance and cannot cancel out, and so that big gaps are punished far more than small ones.
  • We divide by Ei E_i to scale the gap. A gap of 20 is huge when you only expected 30, but trivial when you expected 5000. Dividing by the expected count puts every category on the same footing.
  • We sum across all categories to get one total deviation.

When the data match the null perfectly, every Oi=Ei O_i = E_i , each term is zero, and χ2=0 \chi^2 = 0 . The more reality drifts from theory, the larger χ2 \chi^2 grows.

Computing it by hand

Let’s plug the penguin numbers straight into the formula, one species at a time:

χ2=(152114.67)2114.67+(124114.67)2114.67+(68114.67)2114.67 \chi^2 = \frac{(152 - 114.67)^2}{114.67} + \frac{(124 - 114.67)^2}{114.67} + \frac{(68 - 114.67)^2}{114.67}
chi2 = (((observed - expected) ** 2) / expected).sum()
print("Per-category terms:", (((observed - expected) ** 2) / expected).round(3))
print("Chi-squared:", round(chi2, 3))
Per-category terms: [12.155  0.76  18.992]
Chi-squared: 31.907

The total is χ2=31.9 \chi^2 = 31.9 . Notice where it comes from: Chinstrap alone contributes 18.99 — its 47-bird shortfall, squared and scaled, dominates the statistic. Gentoo, sitting almost on expectation, adds only 0.76. The statistic automatically weights the categories that disagree most.

But is 31.9 big? On its own the number means nothing until we know what scale to read it on — and that scale is set by the degrees of freedom.


Degrees of Freedom

The degrees of freedom (df) tell us how many category counts are free to vary once the total is fixed. For a goodness-of-fit test with k k categories:

df=k1 \text{df} = k - 1

Why k1 k - 1 and not k k ? Because the counts must sum to N N . With 344 penguins, once you know any two species’ counts, the third is forced — it is whatever is left over. Only k1=2 k - 1 = 2 of the three counts can move freely, so we have 2 degrees of freedom.

k = len(observed)     # 3 categories
df = k - 1            # 2
print("Degrees of freedom:", df)
Degrees of freedom: 2

The df picks out which chi-squared distribution our statistic should be compared against. That distribution describes how large χ2 \chi^2 tends to get purely by chance when H0 H_0 is true. A statistic that lands far out in its tail is evidence against the null.

A different df for a different test

The k1 k - 1 rule is specific to the goodness-of-fit test. The chi-squared test of independence (next lesson) uses a different formula, (rows1)(columns1) (\text{rows} - 1)(\text{columns} - 1) . Always match the df rule to the test you are running.


Reading the Result

Now we turn χ2=31.9 \chi^2 = 31.9 on 2 df into a decision. The p-value is the probability of seeing a statistic this large or larger if the null were true. Compute it from the chi-squared distribution’s upper tail:

from scipy import stats

p_value = stats.chi2.sf(chi2, df)   # sf = survival function = upper tail
print("p-value:", p_value)
p-value: 1.1789300370444807e-07

The p-value is 1.18×107 1.18 \times 10^{-7} — about one in eight million. If the three species really were equally likely, a gap as extreme as the one we saw would almost never happen by chance.

Comparing against alpha

We decide by comparing the p-value to a significance level α \alpha , the risk of a false alarm we are willing to accept. The usual choice is α=0.05 \alpha = 0.05 .

alpha = 0.05
print("Reject H0?" , p_value < alpha)
Reject H0? True

Since 1.18×107 1.18 \times 10^{-7} is far below 0.05, we reject H0 H_0 . The penguin species are not equally common; the imbalance we saw is real, not sampling noise.

There is an equivalent way to decide, by comparing the statistic to a critical value — the χ2 \chi^2 cutoff that the upper 5% of the distribution begins at:

critical = stats.chi2.ppf(1 - alpha, df)
print("Critical value:", round(critical, 3))
print("Reject H0?", chi2 > critical)
Critical value: 5.991
Reject H0? True

Our statistic of 31.9 sails past the 5.99 cutoff, so we reject — the same verdict. The two routes always agree: a small p-value and a statistic past the critical value are two views of the same fact.

Confirming with scipy

You will rarely do this by hand in practice. scipy.stats.chisquare does the whole calculation — statistic and p-value — in one call. By default it assumes equal expected proportions, exactly our null:

chi2_sp, p_sp = stats.chisquare([152, 124, 68])
print("scipy chi-squared:", round(chi2_sp, 3))
print("scipy p-value:", p_sp)
scipy chi-squared: 31.907
scipy p-value: 1.1789300370444807e-07

Identical to our hand calculation: χ2=31.907 \chi^2 = 31.907 and p=1.18×107 p = 1.18 \times 10^{-7} . Working it out by hand once teaches you what the test means; from here on, the one-liner is all you need.


A Second Example: Is the Die Fair?

The penguin test used the default equal-proportions null. Let’s run the test again on a fresh problem to cement it — and to see what failing to reject looks like. We will roll a simulated die many times and ask whether all six faces are equally likely.

A fair die

First, roll a genuinely fair die 600 times. With six faces, each should appear about 600/6=100 600 / 6 = 100 times:

rng = np.random.default_rng(0)
fair_rolls = rng.integers(1, 7, size=600)
fair_counts = np.bincount(fair_rolls, minlength=7)[1:]   # counts for faces 1-6
print("Fair-die counts:", fair_counts)

chi2_fair, p_fair = stats.chisquare(fair_counts)
print("Chi-squared:", round(chi2_fair, 2), " p-value:", round(p_fair, 3))
Fair-die counts: [100  88  90  99 109 114]
Chi-squared: 5.22  p-value: 0.39

The faces came out near 100 each, but not exactly — 88 here, 114 there. That scatter is normal sampling variation. The statistic is only 5.22, with a p-value of 0.39, well above 0.05. With k=6 k = 6 faces the df is 61=5 6 - 1 = 5 , and the critical value is 11.07; our 5.22 is nowhere near it. We fail to reject H0 H_0 : the data are perfectly consistent with a fair die. (Failing to reject does not prove the die is fair — it just means we found no evidence against it.)

A loaded die

Now rig the die so a six lands half the time, then run the same test:

rng = np.random.default_rng(0)
weights = [0.1, 0.1, 0.1, 0.1, 0.1, 0.5]          # face 6 is loaded
loaded_rolls = rng.choice([1, 2, 3, 4, 5, 6], size=600, p=weights)
loaded_counts = np.bincount(loaded_rolls, minlength=7)[1:]
print("Loaded-die counts:", loaded_counts)

chi2_loaded, p_loaded = stats.chisquare(loaded_counts)
print("Chi-squared:", round(chi2_loaded, 1), " p-value:", p_loaded)
Loaded-die counts: [ 54  61  45  57  57 326]
Chi-squared: 614.4  p-value: 1.5958543008871274e-130

Face 6 turned up 326 times against an expected 100, and the statistic explodes to 614.4 — a p-value so small (1.6×10130 1.6 \times 10^{-130} ) it is effectively zero. We reject H0 H_0 with overwhelming evidence: this die is loaded. The same test, the same single line of code, cleanly separates the fair die from the rigged one.


Assumptions to Respect

The chi-squared test is forgiving, but it rests on a couple of conditions. Break them and the p-value stops being trustworthy.

  • Counts, not percentages. The test runs on raw frequencies. Feeding it proportions (which sum to 1) gives a meaningless statistic. Always pass actual counts.
  • Independent observations. Each penguin, each die roll must be a separate event. Measuring the same individual twice, or rolls that influence each other, violates the test.
  • Expected counts large enough. The chi-squared distribution is an approximation that only holds when expected counts are not tiny. A common rule of thumb: every Ei E_i should be at least 5. Our smallest expected count was 114.67 (penguins) and 100 (die), so we are comfortably safe. When some expected counts fall below 5, merge sparse categories or use an exact test like Fisher’s instead.

Watch the expected counts, not the observed ones

The “at least 5” rule applies to the expected counts Ei E_i , not the observed ones. You can observe zero in a category and still be fine, as long as you expected a reasonable number there.


Practice Exercises

Exercise 1: Test the islands

The penguins live on three islands: Biscoe, Dream, and Torgersen. Count how many penguins live on each, then run a goodness-of-fit test of whether the islands are equally populated. Report the chi-squared statistic, the p-value, and your decision at α=0.05 \alpha = 0.05 .

Hint

Get the counts with penguins["island"].value_counts(), then pass them straight to stats.chisquare(...). There are still 3 categories, so df is again 31=2 3 - 1 = 2 .

Exercise 2: Test against custom proportions

scipy.stats.chisquare accepts an f_exp argument for non-uniform expectations. Suppose a guidebook claims penguins split 45% Adelie, 35% Gentoo, 20% Chinstrap. Test the observed counts [152, 124, 68] against those proportions instead of an equal split.

Hint

Turn the proportions into expected counts that sum to 344: expected = np.array([0.45, 0.35, 0.20]) * 344, then call stats.chisquare([152, 124, 68], f_exp=expected). Because the claim is close to reality, you should fail to reject this time.

Exercise 3: Shrink the sample

Repeat the loaded-die simulation but with only 30 rolls instead of 600. Run the test and compare the p-value to the 600-roll version. Does the test still detect the loading? What does this tell you about sample size?

Hint

Reuse the loaded-die code with size=30. Watch your expected counts: with 30 rolls you expect only 30/6=5 30/6 = 5 per face — right at the edge of the assumption. A real effect can hide when the sample is small.


Summary

You learned to test whether a categorical variable follows an expected distribution. The goodness-of-fit test compares observed counts against expected counts built from a null hypothesis, squeezes every gap into one chi-squared statistic χ2=(OiEi)2/Ei \chi^2 = \sum (O_i - E_i)^2 / E_i , and reads it against the chi-squared distribution with k1 k - 1 degrees of freedom. A statistic far out in the tail — a small p-value below α \alpha — means the data disagree with the null. The penguin species turned out to be unequal (χ2=31.9 \chi^2 = 31.9 , p=1.18×107 p = 1.18 \times 10^{-7} ); a fair die passed (p=0.39 p = 0.39 ) while a loaded one failed spectacularly.

Key Concepts

  • Goodness-of-fit test — checks whether a categorical variable’s distribution matches expected proportions.
  • Observed count Oi O_i — the actual number of items in a category.
  • Expected count Ei E_i — the number a category should hold under the null, N×pi N \times p_i .
  • Chi-squared statistici(OiEi)2/Ei \sum_i (O_i - E_i)^2 / E_i ; zero when reality matches theory, larger as they diverge.
  • Degrees of freedomk1 k - 1 for a goodness-of-fit test with k k categories.
  • Expected-count assumption — every Ei E_i should be at least 5 for the approximation to hold.

Why This Matters

Categorical data is everywhere — survey responses, product categories, error codes, A/B test outcomes. The chi-squared goodness-of-fit test is the standard, lightweight way to ask “is this breakdown what we expected?” and back the answer with evidence instead of a hunch. It is one line of scipy, but knowing what the statistic measures and which assumptions it leans on is what lets you trust the verdict.


Next Steps

Continue to Lesson 4 - Chi-Squared Test of Independence

Extend the chi-squared idea from one variable to two, and test whether two categorical variables are related.

Back to Module Overview

Return to the Statistical Inference module overview


Continue Building Your Skills

You can now test a single categorical variable against any expected distribution and defend the result. Next you will take the same chi-squared machinery and point it at two categorical variables at once, asking the deeper question: are these two things independent, or does one tell you something about the other?