Lesson 5 - Guided Project: Do Japanese and American Cars Really Differ?
On this page
- Welcome to the Guided Project
- Step 1: Explore the Two Groups
- Step 2: State Hypotheses and Run a Permutation Test
- Step 3: Confirm with a t-Test and Measure the Effect Size
- Step 4: A Second Question — Is Engine Size Tied to Origin?
- Step 5: Conclusions, With Caveats
- Summary
- Next Steps
- Congratulations — You’ve Finished the Course
Welcome to the Guided Project
You’re settling a bar argument with statistics. Someone insists that the Japanese cars of the 1970s and 80s were simply more fuel-efficient than American ones; someone else says it’s a stereotype, that the few extra miles per gallon could be chance or cherry-picked memory. You have data, and you have a full inference toolkit. Time to settle it for real.
Using everything from this module — confidence intervals, a permutation test, a t-test, an effect size, and a chi-squared test of independence — you’ll investigate whether 1970s-80s Japanese and American cars genuinely differed in fuel economy, or whether the gap could be random noise. Then you’ll bring in a second question: is engine size tied to where a car was built? And because this is real-world, observational data, you’ll close by being honest about what these tests can and cannot prove.
By the end of this lesson, you will be able to:
- Estimate each group’s true mean with a confidence interval and read whether two intervals overlap
- State a null and alternative hypothesis and test a difference of means with a permutation test
- Confirm that result with a Welch t-test and quantify the size of the effect with Cohen’s d
- Use a chi-squared test of independence to test whether two categorical variables are related
- Draw conclusions from observational data while naming the confounders that limit them
You only need pandas, numpy, and scipy.stats. Let’s settle it.
Step 1: Explore the Two Groups
Start by loading the cars and isolating the two origins under debate. The dataset records fuel economy in mpg along with engine and weight details for cars from the USA, Japan, and Europe; we’ll set Europe aside and focus the argument on Japan versus the USA.
import pandas as pd
import numpy as np
cars = pd.read_csv("https://datatweets.com/datasets/cars.csv")
jp = cars[cars["origin"] == "japan"]["mpg"].dropna()
us = cars[cars["origin"] == "usa"]["mpg"].dropna()
print(f"Japanese: n={len(jp)} mean={jp.mean():.2f} median={jp.median():.1f}")
print(f"American: n={len(us)} mean={us.mean():.2f} median={us.median():.1f}")
print(f"Difference in means: {jp.mean() - us.mean():.2f} mpg")Japanese: n=79 mean=30.45 median=31.6
American: n=249 mean=20.08 median=18.5
Difference in means: 10.37 mpgThe headline gap is large. The 79 Japanese cars average 30.45 mpg; the 249 American cars average 20.08 mpg — a difference of 10.37 mpg. The medians tell the same story (31.6 vs 18.5), so this isn’t a handful of outliers dragging a mean around. But a difference in two sample means is not yet proof of a difference in the populations they came from. The first inferential move is to put honest error bars on each estimate.
A confidence interval turns each sample mean into a range that, with 95% confidence, contains the group’s true mean. Build one for each origin:
from scipy import stats
def mean_ci(x, conf=0.95):
m = x.mean()
se = x.std(ddof=1) / np.sqrt(len(x))
t = stats.t.ppf((1 + conf) / 2, len(x) - 1)
return m - t * se, m + t * se
print("Japanese 95% CI:", tuple(round(v, 2) for v in mean_ci(jp)))
print("American 95% CI:", tuple(round(v, 2) for v in mean_ci(us)))Japanese 95% CI: (29.09, 31.82)
American 95% CI: (19.28, 20.88)Now the picture sharpens. The plausible range for true Japanese mpg is 29.09 to 31.82; for American mpg it is 19.28 to 20.88. These two intervals don’t just fail to overlap — there’s a roughly 8 mpg no-man’s-land between them where neither group’s true mean is likely to sit.
When two 95% confidence intervals are this far from touching, a formal test is almost a formality — but “almost” isn’t “is,” so let’s run it properly.
Non-overlapping intervals are suggestive, not a test
Two confidence intervals that don’t overlap are strong evidence of a real difference, but the overlap rule is a rough heuristic, not a hypothesis test. Intervals can even overlap slightly while the difference is still significant. We use the intervals to build intuition, then test the difference directly in the next steps.
Step 2: State Hypotheses and Run a Permutation Test
Every hypothesis test begins by writing down what you’re testing. We frame the argument as two competing claims about the true difference in mean mpg:
- Null hypothesis : there is no real difference; Japanese and American cars have the same true mean mpg, and the 10.37 we observed is just an accident of which cars landed in each group.
- Alternative hypothesis : there is a real difference in true mean mpg between the two origins.
A permutation test turns the null hypothesis into a simulation you can actually run. If the labels “Japanese” and “American” were meaningless, then shuffling them onto the same 328 mpg values at random should produce a difference of means just as extreme as 10.37 reasonably often. So we pool every car’s mpg, repeatedly reshuffle the origin labels, and record the difference each time. The share of shuffles that match or beat our observed gap is the p-value.
observed = jp.mean() - us.mean()
pool = np.concatenate([jp.values, us.values])
n_jp = len(jp)
rng = np.random.default_rng(7)
n_iter = 20000
diffs = np.empty(n_iter)
for i in range(n_iter):
rng.shuffle(pool)
diffs[i] = pool[:n_jp].mean() - pool[n_jp:].mean()
p_value = np.mean(np.abs(diffs) >= abs(observed))
print(f"Observed difference: {observed:.2f} mpg")
print(f"Largest difference in 20,000 shuffles: {np.abs(diffs).max():.2f} mpg")
print(f"Permutation p-value: {p_value:.4f}")Observed difference: 10.37 mpg
Largest difference in 20,000 shuffles: 3.97 mpg
Permutation p-value: 0.0000This is about as decisive as a result gets. Across 20,000 random shuffles, the biggest difference chance ever produced was 3.97 mpg — and our real gap is 10.37. Not one shuffled world came anywhere near it, so the p-value rounds to 0.0000. The figure makes the verdict visual:
When the observed effect lands completely off the edge of the null distribution, we reject the null hypothesis: the difference in fuel economy is real, not an artifact of how the cars happened to be grouped.
Step 3: Confirm with a t-Test and Measure the Effect Size
A permutation test makes almost no assumptions, which is its strength. A t-test asks the same question through a formula instead of a simulation, and agreement between the two is a good cross-check. Because the groups have different sizes and different spreads, we use Welch’s t-test, which doesn’t assume equal variances:
t_stat, p_t = stats.ttest_ind(jp, us, equal_var=False)
print(f"Welch t = {t_stat:.2f}")
print(f"p-value = {p_t:.2e}")Welch t = 13.02
p-value = 1.01e-25The two methods agree emphatically. A t-statistic of 13.02 means the gap is thirteen standard errors wide, and the p-value of is a decimal point followed by two dozen zeros before the first real digit. Both tests point the same way: the difference is not chance.
But “statistically significant” only answers whether there’s a difference, not how big it is. With 328 cars, even a trivial gap could clear the significance bar. To measure the size of the effect on a scale that doesn’t depend on sample size, compute Cohen’s d — the difference in means expressed in pooled standard deviations:
n1, n2 = len(jp), len(us)
pooled_sd = np.sqrt(((n1 - 1) * jp.var(ddof=1) +
(n2 - 1) * us.var(ddof=1)) / (n1 + n2 - 2))
cohens_d = (jp.mean() - us.mean()) / pooled_sd
print(f"Pooled SD: {pooled_sd:.2f} mpg")
print(f"Cohen's d: {cohens_d:.2f}")Pooled SD: 6.31 mpg
Cohen's d: 1.64A Cohen’s d of 1.64 is a large effect by any convention — anything above 0.8 is considered large, and we’re more than double that. In plain terms, the average Japanese car sits about 1.6 standard deviations of mpg above the average American car. This isn’t a difference you’d need a microscope and a huge sample to detect; it’s a gap you could practically see by eye at a 1980s parking lot. Both the significance and the magnitude check out.
Significance and size answer different questions
A p-value tells you whether an effect is distinguishable from chance; an effect size tells you whether it’s big enough to care about. A massive sample can make a tiny, meaningless difference “significant,” so always report both. Here they happen to agree — the effect is both real and large — but in your own work they often won’t, and the effect size is usually what matters for a decision.
Step 4: A Second Question — Is Engine Size Tied to Origin?
We’ve settled the mpg argument. But why were the Japanese cars so much thriftier? A natural suspect is engine size: more cylinders generally means more fuel burned. If Japanese and American cars carried systematically different engines, that would be a clue. This is a question about two categorical variables — origin and number of cylinders — so the right tool is the chi-squared test of independence.
Start with a cross-tabulation of how many cars of each origin have each cylinder count:
ct = pd.crosstab(cars["origin"], cars["cylinders"])
print(ct)cylinders 3 4 5 6 8
origin
europe 0 63 3 4 0
japan 4 69 0 6 0
usa 0 72 0 74 103The pattern is stark even before any test. Japanese cars cluster overwhelmingly at 4 cylinders and never appear with 8. American cars spread across 4, 6, and 8 cylinders, with 103 of the big V8s that no Japanese or European car in the data has. The chi-squared test asks whether this lopsidedness is more than we’d expect if origin and cylinder count were unrelated:
chi2, p_chi, dof, expected = stats.chi2_contingency(ct)
print(f"chi-squared = {chi2:.1f}")
print(f"degrees of freedom = {dof}")
print(f"p-value = {p_chi:.2e}")chi-squared = 180.1
degrees of freedom = 8
p-value = 9.80e-35A chi-squared statistic of 180.1 on 8 degrees of freedom gives a p-value of — so engine size and origin are very much not independent. The heatmap shows exactly where the relationship lives:
So we have two confirmed findings: Japanese cars got meaningfully better mpg, and origin is strongly tied to engine size, with American cars skewing toward the thirsty 8-cylinder engines. It’s tempting to stitch these into a tidy causal story — but that’s exactly the moment to slow down.
Step 5: Conclusions, With Caveats
Here’s what the data genuinely supports:
- The mpg difference is real and large. A permutation test (p ≈ 0) and a Welch t-test (p ≈ ) both reject chance, and Cohen’s d ≈ 1.64 says the gap is big, not just detectable.
- Engine size is strongly associated with origin (chi-squared p ≈ ), with American cars carrying the big 8-cylinder engines that Japanese cars in this data never used.
And here’s what it does not prove — the part most analyses skip:
- This is observational data, not an experiment. Nobody randomly assigned cars to an origin. The two groups differ in many ways at once, so we can’t cleanly credit any single cause.
- Confounders are everywhere. Engine size, vehicle weight, and model year all move together and all affect mpg. American cars in this era were heavier and built around larger engines; “origin” is entangled with all of it. The mpg gap is almost certainly partly an engine-and-weight gap wearing an origin label.
- Correlation is not causation. We’ve shown that origin and mpg are related, and that origin and cylinders are related. We have not shown that being Japanese-built causes better mileage in some intrinsic way. The honest statement is that these design choices traveled together in the 1970s-80s fleet.
- It’s a snapshot of one era. These conclusions describe 1970s-80s cars in this dataset. They say nothing about today’s vehicles, where the picture has changed completely.
That’s the discipline inference demands: state the effect, quantify your confidence, and then name out loud the things your data can’t rule out. A result you can defend with its limits attached is worth more than an overconfident headline.
Take It Further
You’ve settled the original argument; now pressure-test it with the same toolkit:
- Control for a confounder. Filter to only 4-cylinder cars and re-run the mpg comparison between Japan and the USA. If the gap shrinks a lot, engine size was carrying much of the difference; if it survives, origin is doing more on its own.
- Bring weight into it. Compare the mean
weightof Japanese and American cars with their own confidence intervals. How much of the mpg story is really a weight story? - Test a third group. Run the permutation test and t-test on Japan versus Europe instead of the USA. Is the difference still significant? Is the effect size as large?
- Quantify the uncertainty in the gap itself. Bootstrap a confidence interval directly for the difference in means (resample each group with replacement, take the difference, repeat). Does it exclude zero, and by how much?
Summary
You settled a real argument with statistics. You estimated each origin’s true mpg with a confidence interval and saw the two ranges sit far apart; you stated explicit hypotheses and ran a permutation test that no shuffle could match (p ≈ 0); you confirmed it with a Welch t-test (t = 13.02) and measured a large effect with Cohen’s d ≈ 1.64. Then you opened a second question and used a chi-squared test of independence to show engine size is strongly tied to origin (p ≈ ). Finally, you did the hardest part — you named the confounders and refused to overclaim causation from observational data.
Key Concepts
- Confidence interval — a range that, with stated confidence, contains a group’s true mean; non-overlapping intervals suggest a real difference.
- Permutation test — simulates the null by shuffling labels, then asks how often chance matches the observed effect.
- Welch t-test — a formula-based test of a difference in means that allows unequal variances and group sizes.
- Effect size (Cohen’s d) — the difference in means in standard-deviation units; measures how big, independent of sample size.
- Chi-squared test of independence — tests whether two categorical variables are related, using a contingency table.
- Confounding — a third variable (weight, era, engine size) entangled with both your groups and your outcome, blocking causal claims from observational data.
Why This Matters
This project is the whole module doing real work at once. Estimating with confidence intervals, separating signal from noise with hypothesis tests, sizing the effect, and testing a relationship between categories — these are the exact moves behind A/B test readouts, medical trial summaries, and every “is this real?” question an analyst is paid to answer. And the caveats matter as much as the tests: the analysts people trust are the ones who quantify their uncertainty and admit what their data can’t prove.
Next Steps
Continue to Machine Learning — put your statistics foundation to work
You can describe data, quantify uncertainty, and test claims. Next, use that foundation to build models that learn from data.
Back to the course overview
Return to the Statistics & Probability course and revisit any module.
Congratulations — You’ve Finished the Course
Take a moment, because you just closed a real arc. You started by describing data — means, medians, spread, distributions — learning to summarize a pile of numbers into something a person can hold in their head. You moved on to probability, the grammar of chance, and learned to reason about uncertainty instead of fearing it. Then in this final module you put the two together into inference: using a sample to make claims about a population, putting honest error bars on your estimates, and separating a genuine effect from random noise.
That progression — describe it, quantify the uncertainty, test the claim — is the backbone of every serious analysis you’ll ever do. You can now look at a difference and answer the question that actually matters: is it real, how big is it, and how sure are you? Carry the last habit from this project with you above all the others: state your result, attach its limits, and never let a tidy story outrun your data. Well done — and onward to machine learning, where this foundation is exactly what makes the models make sense.