Lesson 3 - Chi-Squared Goodness of Fit
Welcome to Chi-Squared Goodness of Fit
So far you have tested claims about numbers — a mean, a difference between groups. But a lot of data is not numeric at all. It is categorical: a species, a color, a survey answer, the face of a die. When you want to ask “does this category breakdown match what I expected?”, the tool you reach for is the chi-squared goodness-of-fit test.
In this lesson you will use it to answer a concrete question about real data: are the three penguin species in our dataset equally common, or does one dominate? You will work the test out by hand, watch every piece of arithmetic, then confirm it in one line with scipy.
By the end of this lesson, you will be able to:
- Recognize when a question calls for a goodness-of-fit test on a categorical variable
- Turn expected proportions into expected counts and compare them to observed counts
- Compute the chi-squared statistic and explain what it measures
- Read the result against degrees of freedom and a significance level, and confirm it with
scipy.stats.chisquare
You only need a little Python, pandas, and scipy. Let’s begin.
When to Use a Goodness-of-Fit Test
A goodness-of-fit test asks a single question: does the distribution of a categorical variable match a set of expected proportions? You have a variable that sorts each observation into one of several categories, a count for each category, and a theory about how those counts should be split. The test measures how badly reality and theory disagree.
Some questions it answers:
- Are the six faces of a die equally likely, or is it loaded?
- Do customer arrivals split evenly across the seven days of the week?
- Does a sample’s ethnicity breakdown match the known population proportions?
Our question for this lesson is in the same shape. Load the penguins and count the species:
import pandas as pd
penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")
print(penguins["species"].value_counts())species
Adelie 152
Gentoo 124
Chinstrap 68
Name: count, dtype: int64There are 344 penguins, split very unevenly: 152 Adelie, 124 Gentoo, only 68 Chinstrap. The question we will test is “are the three species equally likely?” — that is, is each species meant to be one-third of the population, with the gaps we see just sampling noise? Or is the imbalance real?
Stating the hypotheses
Every test starts with a null hypothesis , the dull “nothing special” claim we try to disprove, and an alternative .
- : the three species are equally likely — each occurs with probability .
- : the species are not equally likely — at least one differs from .
The goodness-of-fit test will tell us whether the data give us enough evidence to throw out .
Observed vs Expected Counts
The test compares two sets of numbers for each category: what you saw and what you would expect if were true.
The observed counts are simply the value counts you already have: 152, 124, 68. The expected counts are what each category should contain under the null. If all three species are equally likely, each one should hold one-third of the 344 birds:
So under , we would expect about 114.67 penguins of each species. (Expected counts do not have to be whole numbers — they are a theoretical average, not an actual headcount.)
import numpy as np
observed = np.array([152, 124, 68]) # Adelie, Gentoo, Chinstrap
N = observed.sum() # 344
expected = np.array([N/3, N/3, N/3]) # 114.67 each
print("Observed:", observed)
print("Expected:", expected.round(2))Observed: [152 124 68]
Expected: [114.67 114.67 114.67]It helps to see those two bars side by side for each species. The taller the gap between the blue and grey bar, the more that species defies the equal-split theory.
Adelie overshoots the expected line by 37, Chinstrap undershoots by 47, and Gentoo sits almost on it. The eye says these gaps look large — but “looks large” is not a decision rule. We need to convert the gaps into a single number we can judge against chance.
The Chi-Squared Statistic
The chi-squared statistic rolls every observed-vs-expected gap into one number:
Read it one piece at a time, because every part is there for a reason:
- is the raw gap for a category — how far the count fell from expectation.
- We square it so that overshoots and undershoots both count as positive distance and cannot cancel out, and so that big gaps are punished far more than small ones.
- We divide by to scale the gap. A gap of 20 is huge when you only expected 30, but trivial when you expected 5000. Dividing by the expected count puts every category on the same footing.
- We sum across all categories to get one total deviation.
When the data match the null perfectly, every , each term is zero, and . The more reality drifts from theory, the larger grows.
Computing it by hand
Let’s plug the penguin numbers straight into the formula, one species at a time:
chi2 = (((observed - expected) ** 2) / expected).sum()
print("Per-category terms:", (((observed - expected) ** 2) / expected).round(3))
print("Chi-squared:", round(chi2, 3))Per-category terms: [12.155 0.76 18.992]
Chi-squared: 31.907The total is . Notice where it comes from: Chinstrap alone contributes 18.99 — its 47-bird shortfall, squared and scaled, dominates the statistic. Gentoo, sitting almost on expectation, adds only 0.76. The statistic automatically weights the categories that disagree most.
But is 31.9 big? On its own the number means nothing until we know what scale to read it on — and that scale is set by the degrees of freedom.
Degrees of Freedom
The degrees of freedom (df) tell us how many category counts are free to vary once the total is fixed. For a goodness-of-fit test with categories:
Why and not ? Because the counts must sum to . With 344 penguins, once you know any two species’ counts, the third is forced — it is whatever is left over. Only of the three counts can move freely, so we have 2 degrees of freedom.
k = len(observed) # 3 categories
df = k - 1 # 2
print("Degrees of freedom:", df)Degrees of freedom: 2The df picks out which chi-squared distribution our statistic should be compared against. That distribution describes how large tends to get purely by chance when is true. A statistic that lands far out in its tail is evidence against the null.
A different df for a different test
The rule is specific to the goodness-of-fit test. The chi-squared test of independence (next lesson) uses a different formula, . Always match the df rule to the test you are running.
Reading the Result
Now we turn on 2 df into a decision. The p-value is the probability of seeing a statistic this large or larger if the null were true. Compute it from the chi-squared distribution’s upper tail:
from scipy import stats
p_value = stats.chi2.sf(chi2, df) # sf = survival function = upper tail
print("p-value:", p_value)p-value: 1.1789300370444807e-07The p-value is — about one in eight million. If the three species really were equally likely, a gap as extreme as the one we saw would almost never happen by chance.
Comparing against alpha
We decide by comparing the p-value to a significance level , the risk of a false alarm we are willing to accept. The usual choice is .
alpha = 0.05
print("Reject H0?" , p_value < alpha)Reject H0? TrueSince is far below 0.05, we reject . The penguin species are not equally common; the imbalance we saw is real, not sampling noise.
There is an equivalent way to decide, by comparing the statistic to a critical value — the cutoff that the upper 5% of the distribution begins at:
critical = stats.chi2.ppf(1 - alpha, df)
print("Critical value:", round(critical, 3))
print("Reject H0?", chi2 > critical)Critical value: 5.991
Reject H0? TrueOur statistic of 31.9 sails past the 5.99 cutoff, so we reject — the same verdict. The two routes always agree: a small p-value and a statistic past the critical value are two views of the same fact.
Confirming with scipy
You will rarely do this by hand in practice. scipy.stats.chisquare does the whole calculation — statistic and p-value — in one call. By default it assumes equal expected proportions, exactly our null:
chi2_sp, p_sp = stats.chisquare([152, 124, 68])
print("scipy chi-squared:", round(chi2_sp, 3))
print("scipy p-value:", p_sp)scipy chi-squared: 31.907
scipy p-value: 1.1789300370444807e-07Identical to our hand calculation: and . Working it out by hand once teaches you what the test means; from here on, the one-liner is all you need.
A Second Example: Is the Die Fair?
The penguin test used the default equal-proportions null. Let’s run the test again on a fresh problem to cement it — and to see what failing to reject looks like. We will roll a simulated die many times and ask whether all six faces are equally likely.
A fair die
First, roll a genuinely fair die 600 times. With six faces, each should appear about times:
rng = np.random.default_rng(0)
fair_rolls = rng.integers(1, 7, size=600)
fair_counts = np.bincount(fair_rolls, minlength=7)[1:] # counts for faces 1-6
print("Fair-die counts:", fair_counts)
chi2_fair, p_fair = stats.chisquare(fair_counts)
print("Chi-squared:", round(chi2_fair, 2), " p-value:", round(p_fair, 3))Fair-die counts: [100 88 90 99 109 114]
Chi-squared: 5.22 p-value: 0.39The faces came out near 100 each, but not exactly — 88 here, 114 there. That scatter is normal sampling variation. The statistic is only 5.22, with a p-value of 0.39, well above 0.05. With faces the df is , and the critical value is 11.07; our 5.22 is nowhere near it. We fail to reject : the data are perfectly consistent with a fair die. (Failing to reject does not prove the die is fair — it just means we found no evidence against it.)
A loaded die
Now rig the die so a six lands half the time, then run the same test:
rng = np.random.default_rng(0)
weights = [0.1, 0.1, 0.1, 0.1, 0.1, 0.5] # face 6 is loaded
loaded_rolls = rng.choice([1, 2, 3, 4, 5, 6], size=600, p=weights)
loaded_counts = np.bincount(loaded_rolls, minlength=7)[1:]
print("Loaded-die counts:", loaded_counts)
chi2_loaded, p_loaded = stats.chisquare(loaded_counts)
print("Chi-squared:", round(chi2_loaded, 1), " p-value:", p_loaded)Loaded-die counts: [ 54 61 45 57 57 326]
Chi-squared: 614.4 p-value: 1.5958543008871274e-130Face 6 turned up 326 times against an expected 100, and the statistic explodes to 614.4 — a p-value so small () it is effectively zero. We reject with overwhelming evidence: this die is loaded. The same test, the same single line of code, cleanly separates the fair die from the rigged one.
Assumptions to Respect
The chi-squared test is forgiving, but it rests on a couple of conditions. Break them and the p-value stops being trustworthy.
- Counts, not percentages. The test runs on raw frequencies. Feeding it proportions (which sum to 1) gives a meaningless statistic. Always pass actual counts.
- Independent observations. Each penguin, each die roll must be a separate event. Measuring the same individual twice, or rolls that influence each other, violates the test.
- Expected counts large enough. The chi-squared distribution is an approximation that only holds when expected counts are not tiny. A common rule of thumb: every should be at least 5. Our smallest expected count was 114.67 (penguins) and 100 (die), so we are comfortably safe. When some expected counts fall below 5, merge sparse categories or use an exact test like Fisher’s instead.
Watch the expected counts, not the observed ones
The “at least 5” rule applies to the expected counts , not the observed ones. You can observe zero in a category and still be fine, as long as you expected a reasonable number there.
Practice Exercises
Exercise 1: Test the islands
The penguins live on three islands: Biscoe, Dream, and Torgersen. Count how many penguins live on each, then run a goodness-of-fit test of whether the islands are equally populated. Report the chi-squared statistic, the p-value, and your decision at .
Hint
Get the counts with penguins["island"].value_counts(), then pass them straight to stats.chisquare(...). There are still 3 categories, so df is again .
Exercise 2: Test against custom proportions
scipy.stats.chisquare accepts an f_exp argument for non-uniform expectations. Suppose a guidebook claims penguins split 45% Adelie, 35% Gentoo, 20% Chinstrap. Test the observed counts [152, 124, 68] against those proportions instead of an equal split.
Hint
Turn the proportions into expected counts that sum to 344: expected = np.array([0.45, 0.35, 0.20]) * 344, then call stats.chisquare([152, 124, 68], f_exp=expected). Because the claim is close to reality, you should fail to reject this time.
Exercise 3: Shrink the sample
Repeat the loaded-die simulation but with only 30 rolls instead of 600. Run the test and compare the p-value to the 600-roll version. Does the test still detect the loading? What does this tell you about sample size?
Hint
Reuse the loaded-die code with size=30. Watch your expected counts: with 30 rolls you expect only per face — right at the edge of the assumption. A real effect can hide when the sample is small.
Summary
You learned to test whether a categorical variable follows an expected distribution. The goodness-of-fit test compares observed counts against expected counts built from a null hypothesis, squeezes every gap into one chi-squared statistic , and reads it against the chi-squared distribution with degrees of freedom. A statistic far out in the tail — a small p-value below — means the data disagree with the null. The penguin species turned out to be unequal (, ); a fair die passed () while a loaded one failed spectacularly.
Key Concepts
- Goodness-of-fit test — checks whether a categorical variable’s distribution matches expected proportions.
- Observed count — the actual number of items in a category.
- Expected count — the number a category should hold under the null, .
- Chi-squared statistic — ; zero when reality matches theory, larger as they diverge.
- Degrees of freedom — for a goodness-of-fit test with categories.
- Expected-count assumption — every should be at least 5 for the approximation to hold.
Why This Matters
Categorical data is everywhere — survey responses, product categories, error codes, A/B test outcomes. The chi-squared goodness-of-fit test is the standard, lightweight way to ask “is this breakdown what we expected?” and back the answer with evidence instead of a hunch. It is one line of scipy, but knowing what the statistic measures and which assumptions it leans on is what lets you trust the verdict.
Next Steps
Continue to Lesson 4 - Chi-Squared Test of Independence
Extend the chi-squared idea from one variable to two, and test whether two categorical variables are related.
Back to Module Overview
Return to the Statistical Inference module overview
Continue Building Your Skills
You can now test a single categorical variable against any expected distribution and defend the result. Next you will take the same chi-squared machinery and point it at two categorical variables at once, asking the deeper question: are these two things independent, or does one tell you something about the other?