Lesson 2 - Hypothesis Testing

Welcome to Hypothesis Testing

You will constantly face a deceptively simple question: is this difference real, or did it just happen by chance? Japanese cars in our dataset average about 10 more miles per gallon than American ones. That gap looks big — but samples wobble, and a gap this size could, in principle, be a fluke of which cars happened to land in each group. Hypothesis testing is the formal machinery for answering that question with a number instead of a hunch.

In this lesson you will take a real gap, assume it is meaningless, simulate the world many thousands of times under that assumption, and watch how often pure chance reproduces what you actually saw. When the answer is “essentially never,” you have evidence the gap is real.

By the end of this lesson, you will be able to:

  • State a null and an alternative hypothesis and choose a test statistic
  • Build a null distribution with a permutation test and compute a p-value
  • Pick a significance level α \alpha and decide between one- and two-tailed tests
  • Recognize Type I and Type II errors, run a fast parametric t-test, and avoid p-hacking

You only need pandas, numpy, and a bit of scipy. Let’s begin.


The Logic of a Hypothesis Test

Every hypothesis test runs on one slightly backwards idea: to argue that something is happening, you start by assuming that nothing is. You pretend the effect you care about does not exist, then ask how strange your real data would be in that pretend world. If the data would be wildly unusual under “nothing is happening,” you reject that assumption. If the data fits comfortably, you have no case.

This is the same logic a court uses. The defendant is presumed innocent (nothing happened); the prosecution must show the evidence would be absurd if they were innocent. You never prove guilt with certainty — you show that innocence is too hard to believe.

Load the data and look at the gap we want to explain:

import pandas as pd

cars = pd.read_csv("https://datatweets.com/datasets/cars.csv")

jp = cars[cars["origin"] == "japan"]["mpg"]
us = cars[cars["origin"] == "usa"]["mpg"]

print("Japan:", len(jp), "cars, mean mpg", round(jp.mean(), 2))
print("USA:  ", len(us), "cars, mean mpg", round(us.mean(), 2))
Japan: 79 cars, mean mpg 30.45
USA:   249 cars, mean mpg 20.08

Japanese cars average 30.45 mpg and American cars 20.08 mpg. That is our observed effect, and everything below is about deciding whether to take it seriously.


Null and Alternative Hypotheses

A hypothesis test compares two competing statements about the population, not just the sample in front of you.

The null hypothesis H0 H_0 is the boring claim: there is no effect, no difference, nothing to see. Here it says the true mean mpg is the same for Japanese and American cars, and the 10-mpg gap we measured is just sampling noise.

H0:μjapan=μusa H_0: \mu_{\text{japan}} = \mu_{\text{usa}}

The alternative hypothesis Ha H_a is the interesting claim you would need evidence to support: the means really differ.

Ha:μjapanμusa H_a: \mu_{\text{japan}} \neq \mu_{\text{usa}}

Notice the asymmetry. We never set out to prove Ha H_a . We try to make H0 H_0 look untenable, and if we succeed, Ha H_a is what remains. Failing to reject H0 H_0 is not the same as proving it — it just means we lacked the evidence to rule it out.

The test statistic

A test statistic is a single number that summarizes the effect, computed from your sample. It is the quantity you will compare against chance. For comparing two group means, the natural choice is simply the difference in means:

observed_diff = jp.mean() - us.mean()
print(round(observed_diff, 2))
10.37

Our observed test statistic is 10.37 mpg. The whole test now reduces to one question: if H0 H_0 were true, how often would chance alone hand us a difference as extreme as 10.37?


Building the Null Distribution with a Permutation Test

To answer that, we need to know what differences chance can produce when H0 H_0 is true. The collection of all those chance outcomes is the null distribution. A permutation test builds it directly, with almost no assumptions.

The idea is beautifully literal. If H0 H_0 is true and origin makes no difference to mpg, then the “japan” and “usa” labels are meaningless tags we could swap around freely. So we shuffle the origin labels, randomly reassigning which cars count as Japanese, recompute the difference in means, and repeat thousands of times. Each shuffle is one possible world in which the labels truly don’t matter.

Here is a single shuffle:

import numpy as np

pool = cars[cars["origin"].isin(["japan", "usa"])].copy()

rng = np.random.default_rng(0)
pool["shuffled_origin"] = rng.permutation(pool["origin"].values)

group_means = pool.groupby("shuffled_origin")["mpg"].mean()
print(round(group_means["japan"] - group_means["usa"], 3))
-0.866

With the labels scrambled, the “difference” is −0.87 mpg — close to zero, exactly as you would expect when origin carries no real information. One shuffle proves nothing; we need the full distribution. Repeat 10,000 times:

mpg = pool["mpg"].values
n_jp = (pool["origin"] == "japan").sum()

rng = np.random.default_rng(0)
perm_diffs = np.empty(10000)
for i in range(10000):
    shuffled = rng.permutation(mpg)
    perm_diffs[i] = shuffled[:n_jp].mean() - shuffled[n_jp:].mean()

print("Largest chance difference:", round(np.abs(perm_diffs).max(), 2))
Largest chance difference: 3.46

Across 10,000 shuffled worlds, the most extreme difference chance ever produced was about 3.46 mpg — and most clustered near zero. Our real gap of 10.37 is in a different league entirely. The figure makes this unmistakable:

Histogram of 10,000 shuffled mean-mpg differences between Japan and USA, centered near zero and spanning roughly minus four to plus four, with the observed difference of 10.37 marked by an orange line far to the right of every shuffled value.
The null distribution from 10,000 label shuffles sits centered on zero and never reaches past about 4 mpg. The observed difference of 10.37 (orange) is so far out in the tail that no shuffle comes close — a picture of a vanishingly small p-value.

The blue histogram is the world of “origin doesn’t matter.” The orange line is reality. They do not overlap.


The P-Value and the Significance Level

The p-value is the probability, assuming H0 H_0 is true, of seeing a test statistic at least as extreme as the one you observed. It puts a number on “how surprising is my data under the null?” A small p-value means your data is hard to explain by chance alone.

A null distribution bell centered at zero with the observed test statistic marked far out on the right and the tail areas beyond it shaded orange as the p-value
The p-value is the shaded tail area beyond the observed statistic — the probability of a result this extreme if the null hypothesis were true.

We compute it by counting how many shuffled differences were as extreme as 10.37, in either direction:

count_as_extreme = np.sum(np.abs(perm_diffs) >= abs(observed_diff))
p_value = count_as_extreme / 10000
print("Shuffles >= observed:", count_as_extreme)
print("p-value:", p_value)
Shuffles >= observed: 0
p-value: 0.0

Out of 10,000 shuffles, not one produced a gap as large as 10.37. Our estimated p-value is 0.0 — meaning it is so small our 10,000 simulations could not measure it; the true value is tiny but not literally zero. Chance essentially never manufactures a difference this big.

What a p-value is not

The p-value is not the probability that H0 H_0 is true, and it is not the probability your result was a fluke. It is strictly the probability of data this extreme given that H0 H_0 holds. Keeping that direction straight prevents most of the misunderstandings about hypothesis testing.

Choosing a significance level

Before you peek at the p-value, you fix a threshold called the significance level α \alpha — the bar the evidence must clear. The most common choice is α=0.05 \alpha = 0.05 . The decision rule is simple:

p<α    reject H0,pα    fail to reject H0 p < \alpha \;\Rightarrow\; \text{reject } H_0, \qquad p \geq \alpha \;\Rightarrow\; \text{fail to reject } H_0

Here p0 p \approx 0 is far below 0.05, so we reject H0 H_0 . The mpg difference between Japanese and American cars is statistically significant: it is too large to credibly blame on chance. Note that α \alpha is a choice, not a law of nature — a stricter test (say medicine) might demand α=0.01 \alpha = 0.01 , and you must commit to it before seeing the result.


One-Tailed and Two-Tailed Tests

The test above was two-tailed: our alternative was μjapanμusa \mu_{\text{japan}} \neq \mu_{\text{usa}} , so a surprising result could fall on either side — Japanese cars much higher or much lower. That is why we counted shuffles with np.abs(...), sweeping up both tails of the null distribution.

A one-tailed test asks a directional question. If your alternative were specifically μjapan>μusa \mu_{\text{japan}} > \mu_{\text{usa}} — “Japanese cars get better mileage” — you would count only shuffles that reached at least +10.37, ignoring the negative tail:

Ha:μjapan>μusa H_a: \mu_{\text{japan}} > \mu_{\text{usa}}

One-tailed tests have more power to detect an effect in the predicted direction, but they come with a discipline: you must choose the direction from theory before seeing the data. Picking the tail after looking at which way the result went is a form of cheating that inflates your false-positive rate. When in doubt, use a two-tailed test — it is the more conservative, more defensible default.


Type I and Type II Errors

Because a test decides under uncertainty, it can be wrong in two distinct ways. Lining them up against the truth gives four outcomes:

H0 H_0 is actually trueH0 H_0 is actually false
Reject H0 H_0 Type I error (false positive)Correct
Fail to reject H0 H_0 CorrectType II error (false negative)
A two by two table with columns reject and fail to reject H0 and rows H0 true and H0 false; Type I error alpha and Type II error beta cells in orange, the two correct cells in green
The four outcomes of a test: a Type I error (α) and a Type II error (β) are the two ways to be wrong, while the green cells — including the test's power — are correct decisions.

A Type I error is a false alarm: you reject a true null and “discover” an effect that isn’t there. Its probability is exactly α \alpha — choosing α=0.05 \alpha = 0.05 means accepting a 5% chance of crying wolf when nothing is happening.

A Type II error is a miss: you fail to reject a false null and overlook a real effect. Its probability is written β \beta , and 1β 1 - \beta is the test’s power — its ability to catch effects that truly exist.

The two trade off. Lowering α \alpha to guard against false positives makes you more skeptical, which raises β \beta and lets more real effects slip past. The main honest way to reduce both is to collect a larger sample. For our cars, with p0 p \approx 0 , a Type I error is essentially off the table — the evidence is overwhelming.


The T-Test: A Faster Parametric Counterpart

The permutation test is wonderfully assumption-light, but it ran 10,000 simulations. The t-test reaches the same kind of answer with a formula instead of brute force. It is a parametric test: it assumes the data follows a roughly normal model, and in exchange it gives you a p-value instantly.

Because our two groups have very different sizes and spreads, we use Welch’s t-test, which does not assume equal variances:

from scipy import stats

t_stat, p_value = stats.ttest_ind(jp, us, equal_var=False)
print("t-statistic:", round(t_stat, 2))
print("p-value:", p_value)
t-statistic: 13.02
p-value: 1.0109535398282052e-25

The t-statistic of 13.02 says the gap is about 13 standard errors away from zero — astronomically far. The p-value, 1.0×1025 1.0 \times 10^{-25} , agrees with the permutation test: the difference is real. Two very different methods, the same verdict, which is exactly the reassurance you want.

Permutation test or t-test?

Reach for the t-test when your data is roughly normal and your samples are not tiny — it is fast and standard. Reach for the permutation test when you distrust the normality assumption, have small or skewed samples, or use an unusual statistic (a median, a ratio) that has no neat formula. When both apply and they agree, your conclusion is on very solid ground.


Two Caveats That Matter More Than the Math

A test can be technically correct and still mislead you. Guard against these two traps.

Statistical significance is not practical importance. A p-value tells you an effect is real, not that it is large. With a big enough sample, a trivial half-mpg difference can come out “highly significant” while meaning nothing to a car buyer. Always report the effect size — here, a 10.37-mpg gap — alongside the p-value, and ask whether it actually matters in the real world.

Beware p-hacking. If you test 20 unrelated comparisons at α=0.05 \alpha = 0.05 , you should expect about one to come up “significant” purely by chance. Hunting through many tests, subgroups, or variable combinations until something crosses 0.05 — then reporting only that one — manufactures false discoveries. The honest practice is to decide your hypothesis and your α \alpha before you look, and to be upfront about how many tests you ran.

The cardinal sin

Choosing your hypothesis, your tail, or your significance level after seeing the data turns a rigorous test into a rationalization. Commit to all three in advance. A test only protects you if it could have come out against you.


Practice Exercises

Exercise 1: Test Europe against the USA

Repeat the analysis comparing European cars to American ones. Compute the difference in mean mpg, then run a Welch t-test. Is the difference statistically significant at α=0.05 \alpha = 0.05 ? Is it larger or smaller than the Japan–USA gap?

Hint

Filter with eu = cars[cars["origin"] == "europe"]["mpg"] and reuse us from the lesson. The difference is eu.mean() - us.mean(), and the test is stats.ttest_ind(eu, us, equal_var=False). You should find a smaller gap of about 7.8 mpg that is still very significant.

Exercise 2: Make it one-tailed

Using the perm_diffs array from the permutation test, compute a one-tailed p-value for the alternative μjapan>μusa \mu_{\text{japan}} > \mu_{\text{usa}} . How does it compare to the two-tailed value, and why?

Hint

Drop the absolute value and count only shuffles that reached the observed difference: np.mean(perm_diffs >= observed_diff). With no shuffle anywhere near 10.37, the one-tailed p-value is also essentially 0 — the same conclusion, because all the action is in one tail.

Exercise 3: A different test statistic

Test whether Japanese and American cars differ in weight instead of mpg. Run a Welch t-test on the weight column for the two groups and interpret the sign of the t-statistic.

Hint

Use cars[cars["origin"] == "japan"]["weight"] and the USA equivalent, then stats.ttest_ind(jp_w, us_w, equal_var=False). A large negative t-statistic (around −18) means Japanese cars are significantly lighter, which helps explain their better mileage.


Summary

You learned the core logic of inference: assume the null hypothesis H0 H_0 — that nothing is happening — and ask how surprising your data would be in that world. You chose a test statistic (the difference in means), built a null distribution with a permutation test by shuffling the origin labels, and read off a p-value by counting how often chance matched your result. For the cars data, chance never came close to the observed 10.37-mpg gap (p0 p \approx 0 ), so we rejected H0 H_0 at α=0.05 \alpha = 0.05 . A Welch t-test confirmed it in one line (t13.0 t \approx 13.0 , p1025 p \approx 10^{-25} ). Along the way you met one- vs two-tailed tests, Type I and Type II errors, and two warnings — significance is not importance, and p-hacking fabricates discoveries.

Key Concepts

  • Null hypothesis (H0 H_0 ) — the default claim that there is no effect or difference.
  • Alternative hypothesis (Ha H_a ) — the claim you need evidence to support.
  • Test statistic — a single number summarizing the effect (here, the difference in means).
  • Null distribution — the spread of test-statistic values chance produces when H0 H_0 is true.
  • Permutation test — building that distribution by repeatedly shuffling group labels.
  • P-value — the probability, under H0 H_0 , of data at least as extreme as observed.
  • Significance level (α \alpha ) — the pre-chosen threshold a p-value must beat to reject H0 H_0 .
  • Type I / Type II error — a false positive (rejecting a true H0 H_0 ) / a false negative (missing a real effect).

Why This Matters

Hypothesis testing is how a result graduates from “interesting in this sample” to “trustworthy about the world.” Whether you are evaluating an A/B test, a drug trial, or a model’s improvement, the same questions apply: what is the null, what is the statistic, how surprising is the data, and is the effect big enough to care about? Master this and you stop guessing whether a difference is real and start measuring it.


Next Steps

Continue to Lesson 3 - Chi-Squared Goodness of Fit

Move from comparing means to testing whether counts match an expected pattern, using the chi-squared distribution.

Back to Module Overview

Return to the Statistical Inference module overview


Continue Building Your Skills

You can now take a difference, assume it is nothing, and let simulation tell you whether to believe it — the heart of every honest claim in data work. Next you will extend that same surprise-under-the-null logic to categories and counts, asking not “are these means different?” but “does this pattern of frequencies fit what I expected?”