Lesson 3 - Chi-Squared Goodness of Fit

Welcome to Chi-Squared Goodness of Fit

So far you have tested claims about numbers — a mean, a difference between groups. But a lot of data is not numeric at all. It is categorical: a species, a color, a survey answer, the face of a die. When you want to ask “does this category breakdown match what I expected?”, the tool you reach for is the chi-squared goodness-of-fit test.

In this lesson you will use it to answer a concrete question about real data: are the three penguin species in our dataset equally common, or does one dominate? You will work the test out by hand, watch every piece of arithmetic, then confirm it in one line with scipy.

By the end of this lesson, you will be able to:

Recognize when a question calls for a goodness-of-fit test on a categorical variable
Turn expected proportions into expected counts and compare them to observed counts
Compute the chi-squared statistic and explain what it measures
Read the result against degrees of freedom and a significance level, and confirm it with scipy.stats.chisquare

You only need a little Python, pandas, and scipy. Let’s begin.

When to Use a Goodness-of-Fit Test

A goodness-of-fit test asks a single question: does the distribution of a categorical variable match a set of expected proportions? You have a variable that sorts each observation into one of several categories, a count for each category, and a theory about how those counts should be split. The test measures how badly reality and theory disagree.

Some questions it answers:

Are the six faces of a die equally likely, or is it loaded?
Do customer arrivals split evenly across the seven days of the week?
Does a sample’s ethnicity breakdown match the known population proportions?

Our question for this lesson is in the same shape. Load the penguins and count the species:

import pandas as pd

penguins = pd.read_csv("https://datatweets.com/datasets/penguins.csv")
print(penguins["species"].value_counts())

species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64

There are 344 penguins, split very unevenly: 152 Adelie, 124 Gentoo, only 68 Chinstrap. The question we will test is “are the three species equally likely?” — that is, is each species meant to be one-third of the population, with the gaps we see just sampling noise? Or is the imbalance real?

Stating the hypotheses

Every test starts with a null hypothesis $H_0$ , the dull “nothing special” claim we try to disprove, and an alternative $H_1$ .

$H_0$ : the three species are equally likely — each occurs with probability $1/3$ .
$H_1$ : the species are not equally likely — at least one differs from $1/3$ .

The goodness-of-fit test will tell us whether the data give us enough evidence to throw out $H_0$ .

Observed vs Expected Counts

The test compares two sets of numbers for each category: what you saw and what you would expect if $H_0$ were true.

The observed counts $O_i$ are simply the value counts you already have: 152, 124, 68. The expected counts $E_i$ are what each category should contain under the null. If all three species are equally likely, each one should hold one-third of the 344 birds:

E_i = N \times p_i = 344 \times \frac{1}{3} = 114.67

So under $H_0$ , we would expect about 114.67 penguins of each species. (Expected counts do not have to be whole numbers — they are a theoretical average, not an actual headcount.)

import numpy as np

observed = np.array([152, 124, 68])      # Adelie, Gentoo, Chinstrap
N = observed.sum()                        # 344
expected = np.array([N/3, N/3, N/3])      # 114.67 each
print("Observed:", observed)
print("Expected:", expected.round(2))

Observed: [152 124  68]
Expected: [114.67 114.67 114.67]

It helps to see those two bars side by side for each species. The taller the gap between the blue and grey bar, the more that species defies the equal-split theory.

Grouped bar chart comparing observed penguin species counts (152 Adelie, 124 Gentoo, 68 Chinstrap) against the expected count of about 115 per species under a uniform model. — Under a uniform model every species should sit near 115 (grey). Adelie runs well above it and Chinstrap well below — the goodness-of-fit test asks whether those gaps are too big to be chance.

Adelie overshoots the expected line by 37, Chinstrap undershoots by 47, and Gentoo sits almost on it. The eye says these gaps look large — but “looks large” is not a decision rule. We need to convert the gaps into a single number we can judge against chance.

The Chi-Squared Statistic

The chi-squared statistic $\chi^2$ rolls every observed-vs-expected gap into one number:

\chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}

Read it one piece at a time, because every part is there for a reason:

$(O_i - E_i)$ is the raw gap for a category — how far the count fell from expectation.
We square it so that overshoots and undershoots both count as positive distance and cannot cancel out, and so that big gaps are punished far more than small ones.
We divide by $E_i$ to scale the gap. A gap of 20 is huge when you only expected 30, but trivial when you expected 5000. Dividing by the expected count puts every category on the same footing.
We sum across all categories to get one total deviation.

When the data match the null perfectly, every $O_i = E_i$ , each term is zero, and $\chi^2 = 0$ . The more reality drifts from theory, the larger $\chi^2$ grows.

Computing it by hand

Let’s plug the penguin numbers straight into the formula, one species at a time:

\chi^2 = \frac{(152 - 114.67)^2}{114.67} + \frac{(124 - 114.67)^2}{114.67} + \frac{(68 - 114.67)^2}{114.67}

chi2 = (((observed - expected) ** 2) / expected).sum()
print("Per-category terms:", (((observed - expected) ** 2) / expected).round(3))
print("Chi-squared:", round(chi2, 3))

Per-category terms: [12.155  0.76  18.992]
Chi-squared: 31.907

The total is $\chi^2 = 31.9$ . Notice where it comes from: Chinstrap alone contributes 18.99 — its 47-bird shortfall, squared and scaled, dominates the statistic. Gentoo, sitting almost on expectation, adds only 0.76. The statistic automatically weights the categories that disagree most.

But is 31.9 big? On its own the number means nothing until we know what scale to read it on — and that scale is set by the degrees of freedom.

Degrees of Freedom

The degrees of freedom (df) tell us how many category counts are free to vary once the total is fixed. For a goodness-of-fit test with $k$ categories:

\text{df} = k - 1

Why $k - 1$ and not $k$ ? Because the counts must sum to $N$ . With 344 penguins, once you know any two species’ counts, the third is forced — it is whatever is left over. Only $k - 1 = 2$ of the three counts can move freely, so we have 2 degrees of freedom.

k = len(observed)     # 3 categories
df = k - 1            # 2
print("Degrees of freedom:", df)

Degrees of freedom: 2

The df picks out which chi-squared distribution our statistic should be compared against. That distribution describes how large $\chi^2$ tends to get purely by chance when $H_0$ is true. A statistic that lands far out in its tail is evidence against the null.

A different df for a different test

The $k - 1$ rule is specific to the goodness-of-fit test. The chi-squared test of independence (next lesson) uses a different formula, $(\text{rows} - 1)(\text{columns} - 1)$ . Always match the df rule to the test you are running.

Reading the Result

Now we turn $\chi^2 = 31.9$ on 2 df into a decision. The p-value is the probability of seeing a statistic this large or larger if the null were true. Compute it from the chi-squared distribution’s upper tail:

from scipy import stats

p_value = stats.chi2.sf(chi2, df)   # sf = survival function = upper tail
print("p-value:", p_value)

p-value: 1.1789300370444807e-07

The p-value is $1.18 \times 10^{-7}$ — about one in eight million. If the three species really were equally likely, a gap as extreme as the one we saw would almost never happen by chance.

Comparing against alpha

We decide by comparing the p-value to a significance level $\alpha$ , the risk of a false alarm we are willing to accept. The usual choice is $\alpha = 0.05$ .

alpha = 0.05
print("Reject H0?" , p_value < alpha)

Reject H0? True

Since $1.18 \times 10^{-7}$ is far below 0.05, we reject $H_0$ . The penguin species are not equally common; the imbalance we saw is real, not sampling noise.

There is an equivalent way to decide, by comparing the statistic to a critical value — the $\chi^2$ cutoff that the upper 5% of the distribution begins at:

critical = stats.chi2.ppf(1 - alpha, df)
print("Critical value:", round(critical, 3))
print("Reject H0?", chi2 > critical)

Critical value: 5.991
Reject H0? True

Our statistic of 31.9 sails past the 5.99 cutoff, so we reject — the same verdict. The two routes always agree: a small p-value and a statistic past the critical value are two views of the same fact.

Confirming with scipy

You will rarely do this by hand in practice. scipy.stats.chisquare does the whole calculation — statistic and p-value — in one call. By default it assumes equal expected proportions, exactly our null:

chi2_sp, p_sp = stats.chisquare([152, 124, 68])
print("scipy chi-squared:", round(chi2_sp, 3))
print("scipy p-value:", p_sp)

scipy chi-squared: 31.907
scipy p-value: 1.1789300370444807e-07

Identical to our hand calculation: $\chi^2 = 31.907$ and $p = 1.18 \times 10^{-7}$ . Working it out by hand once teaches you what the test means; from here on, the one-liner is all you need.

A Second Example: Is the Die Fair?

The penguin test used the default equal-proportions null. Let’s run the test again on a fresh problem to cement it — and to see what failing to reject looks like. We will roll a simulated die many times and ask whether all six faces are equally likely.

A fair die

First, roll a genuinely fair die 600 times. With six faces, each should appear about $600 / 6 = 100$ times:

rng = np.random.default_rng(0)
fair_rolls = rng.integers(1, 7, size=600)
fair_counts = np.bincount(fair_rolls, minlength=7)[1:]   # counts for faces 1-6
print("Fair-die counts:", fair_counts)

chi2_fair, p_fair = stats.chisquare(fair_counts)
print("Chi-squared:", round(chi2_fair, 2), " p-value:", round(p_fair, 3))

Fair-die counts: [100  88  90  99 109 114]
Chi-squared: 5.22  p-value: 0.39

The faces came out near 100 each, but not exactly — 88 here, 114 there. That scatter is normal sampling variation. The statistic is only 5.22, with a p-value of 0.39, well above 0.05. With $k = 6$ faces the df is $6 - 1 = 5$ , and the critical value is 11.07; our 5.22 is nowhere near it. We fail to reject $H_0$ : the data are perfectly consistent with a fair die. (Failing to reject does not prove the die is fair — it just means we found no evidence against it.)

A loaded die

Now rig the die so a six lands half the time, then run the same test:

rng = np.random.default_rng(0)
weights = [0.1, 0.1, 0.1, 0.1, 0.1, 0.5]          # face 6 is loaded
loaded_rolls = rng.choice([1, 2, 3, 4, 5, 6], size=600, p=weights)
loaded_counts = np.bincount(loaded_rolls, minlength=7)[1:]
print("Loaded-die counts:", loaded_counts)

chi2_loaded, p_loaded = stats.chisquare(loaded_counts)
print("Chi-squared:", round(chi2_loaded, 1), " p-value:", p_loaded)

Loaded-die counts: [ 54  61  45  57  57 326]
Chi-squared: 614.4  p-value: 1.5958543008871274e-130

Face 6 turned up 326 times against an expected 100, and the statistic explodes to 614.4 — a p-value so small ( $1.6 \times 10^{-130}$ ) it is effectively zero. We reject $H_0$ with overwhelming evidence: this die is loaded. The same test, the same single line of code, cleanly separates the fair die from the rigged one.

Assumptions to Respect

The chi-squared test is forgiving, but it rests on a couple of conditions. Break them and the p-value stops being trustworthy.

Counts, not percentages. The test runs on raw frequencies. Feeding it proportions (which sum to 1) gives a meaningless statistic. Always pass actual counts.
Independent observations. Each penguin, each die roll must be a separate event. Measuring the same individual twice, or rolls that influence each other, violates the test.
Expected counts large enough. The chi-squared distribution is an approximation that only holds when expected counts are not tiny. A common rule of thumb: every $E_i$ should be at least 5. Our smallest expected count was 114.67 (penguins) and 100 (die), so we are comfortably safe. When some expected counts fall below 5, merge sparse categories or use an exact test like Fisher’s instead.

Watch the expected counts, not the observed ones

The “at least 5” rule applies to the expected counts $E_i$ , not the observed ones. You can observe zero in a category and still be fine, as long as you expected a reasonable number there.

Practice Exercises

Exercise 1: Test the islands

The penguins live on three islands: Biscoe, Dream, and Torgersen. Count how many penguins live on each, then run a goodness-of-fit test of whether the islands are equally populated. Report the chi-squared statistic, the p-value, and your decision at $\alpha = 0.05$ .

Hint

Get the counts with penguins["island"].value_counts(), then pass them straight to stats.chisquare(...). There are still 3 categories, so df is again $3 - 1 = 2$ .

Exercise 2: Test against custom proportions

scipy.stats.chisquare accepts an f_exp argument for non-uniform expectations. Suppose a guidebook claims penguins split 45% Adelie, 35% Gentoo, 20% Chinstrap. Test the observed counts [152, 124, 68] against those proportions instead of an equal split.

Hint

Turn the proportions into expected counts that sum to 344: expected = np.array([0.45, 0.35, 0.20]) * 344, then call stats.chisquare([152, 124, 68], f_exp=expected). Because the claim is close to reality, you should fail to reject this time.

Exercise 3: Shrink the sample

Repeat the loaded-die simulation but with only 30 rolls instead of 600. Run the test and compare the p-value to the 600-roll version. Does the test still detect the loading? What does this tell you about sample size?

Hint

Reuse the loaded-die code with size=30. Watch your expected counts: with 30 rolls you expect only $30/6 = 5$ per face — right at the edge of the assumption. A real effect can hide when the sample is small.

Summary

You learned to test whether a categorical variable follows an expected distribution. The goodness-of-fit test compares observed counts against expected counts built from a null hypothesis, squeezes every gap into one chi-squared statistic $\chi^2 = \sum (O_i - E_i)^2 / E_i$ , and reads it against the chi-squared distribution with $k - 1$ degrees of freedom. A statistic far out in the tail — a small p-value below $\alpha$ — means the data disagree with the null. The penguin species turned out to be unequal ( $\chi^2 = 31.9$ , $p = 1.18 \times 10^{-7}$ ); a fair die passed ( $p = 0.39$ ) while a loaded one failed spectacularly.

Key Concepts

Goodness-of-fit test — checks whether a categorical variable’s distribution matches expected proportions.
Observed count $O_i$ — the actual number of items in a category.
Expected count $E_i$ — the number a category should hold under the null, $N \times p_i$ .
Chi-squared statistic — $\sum_i (O_i - E_i)^2 / E_i$ ; zero when reality matches theory, larger as they diverge.
Degrees of freedom — $k - 1$ for a goodness-of-fit test with $k$ categories.
Expected-count assumption — every $E_i$ should be at least 5 for the approximation to hold.

Why This Matters

Categorical data is everywhere — survey responses, product categories, error codes, A/B test outcomes. The chi-squared goodness-of-fit test is the standard, lightweight way to ask “is this breakdown what we expected?” and back the answer with evidence instead of a hunch. It is one line of scipy, but knowing what the statistic measures and which assumptions it leans on is what lets you trust the verdict.

Next Steps

Continue to Lesson 4 - Chi-Squared Test of Independence

Extend the chi-squared idea from one variable to two, and test whether two categorical variables are related.

Back to Module Overview

Return to the Statistical Inference module overview

Continue Building Your Skills

You can now test a single categorical variable against any expected distribution and defend the result. Next you will take the same chi-squared machinery and point it at two categorical variables at once, asking the deeper question: are these two things independent, or does one tell you something about the other?

Previous lesson

Lesson 2 - Hypothesis Testing

Next lesson

Lesson 4 - Chi-Squared Test of Independence

Courses

DATATWEETS

Title here

Lesson 3 - Chi-Squared Goodness of Fit

Welcome to Chi-Squared Goodness of Fit

When to Use a Goodness-of-Fit Test

Stating the hypotheses

Observed vs Expected Counts

The Chi-Squared Statistic

Computing it by hand

Degrees of Freedom

Reading the Result

Comparing against alpha

Confirming with scipy

A Second Example: Is the Die Fair?

A fair die

A loaded die

Assumptions to Respect

Practice Exercises

Exercise 1: Test the islands

Exercise 2: Test against custom proportions

Exercise 3: Shrink the sample

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 4 - Chi-Squared Test of Independence

Back to Module Overview

Continue Building Your Skills

Lesson 3 - Chi-Squared Goodness of Fit

Welcome to Chi-Squared Goodness of Fit#

When to Use a Goodness-of-Fit Test#

Stating the hypotheses#

Observed vs Expected Counts#

The Chi-Squared Statistic#

Computing it by hand#

Degrees of Freedom#

Reading the Result#

Comparing against alpha#

Confirming with scipy#

A Second Example: Is the Die Fair?#

A fair die#

A loaded die#

Assumptions to Respect#

Practice Exercises#

Exercise 1: Test the islands#

Exercise 2: Test against custom proportions#

Exercise 3: Shrink the sample#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 4 - Chi-Squared Test of Independence

Back to Module Overview

Continue Building Your Skills#

Welcome to Chi-Squared Goodness of Fit

When to Use a Goodness-of-Fit Test

Stating the hypotheses

Observed vs Expected Counts

The Chi-Squared Statistic

Computing it by hand

Degrees of Freedom

Reading the Result

Comparing against alpha

Confirming with scipy

A Second Example: Is the Die Fair?

A fair die

A loaded die

Assumptions to Respect

Practice Exercises

Exercise 1: Test the islands

Exercise 2: Test against custom proportions

Exercise 3: Shrink the sample

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills