Lesson 6 - Guided Project: Profiling Fuel Economy

Welcome to the Guided Project

The year is 1983. You’re a data analyst at an auto magazine, and your editor has handed you a spreadsheet of every car the magazine has reviewed since 1970 — fuel economy, weight, engine size, and where each model was built. The question on the cover this month is simple: how has fuel economy changed, and who builds the most efficient cars?

You don’t get to reach for regression or machine learning. You have only the descriptive tools from this module — means, medians, and modes; range, variance, and standard deviation; and z-scores. That turns out to be plenty. A good analyst can answer the editor’s question with a handful of summaries and three honest charts, and that is exactly what you’ll build here.

By the end of this lesson, you will be able to:

  • Summarize a real variable with center and spread, and read the story its skew tells
  • Compute and visualize a year-by-year trend from grouped means
  • Compare groups fairly using means, medians, and standard deviations side by side
  • Use z-scores to find the genuine standouts in a dataset

You only need pandas and a little numpy — every technique comes straight from the earlier lessons in this module. Let’s begin.


Step 1: Load and Inspect the Data

Start by loading the cars and taking the measure of what you have.

import pandas as pd

cars = pd.read_csv("https://datatweets.com/datasets/cars.csv")
print(cars.shape)
print(cars["origin"].value_counts())
(398, 9)
origin
usa       249
japan      79
europe     70
Name: count, dtype: int64

There are 398 cars, each described by nine columns: mpg, cylinders, displacement, horsepower, weight, acceleration, model_year, origin, and name. The origin column splits the fleet three ways — 249 American, 79 Japanese, and 70 European cars. That imbalance matters: US cars dominate the dataset, so any “overall” number is really an American number wearing a disguise. We’ll keep that in mind when we compare regions later.

The model_year column runs from 70 to 82 (read it as 1970–1982), which is the timeline our trend will follow. Before summarizing, check for missing values:

print(cars[["mpg", "horsepower", "weight"]].isna().sum())
mpg           0
horsepower    6
weight        0
dtype: int64

Only horsepower has gaps — six cars whose engine power was never recorded. Our profile leans on mpg, weight, and model_year, all of which are complete, so we can work with the full 398 rows. We’ll only worry about the missing horsepower if a calculation actually touches that column.

Treat the file as the whole fleet

These 398 cars are a sample of every car ever made, but for this project we treat them as the complete population the magazine reviewed. That lets us describe them directly — every mean and standard deviation here is a fact about this fleet, not an estimate of a larger one.


Step 2: Center — Mean vs Median, and the Skew Story

The first thing the editor wants is a single headline number: how efficient is the typical car? This is exactly the mean-versus-median question from earlier in the module, and the cars dataset shows why you should never report just one.

print("mean   ", round(cars["mpg"].mean(), 2))
print("median ", round(cars["mpg"].median(), 1))
print("mode   ", cars["mpg"].mode().tolist())
mean    23.51
median  23.0
mode    [13.0]

The mean fuel economy is 23.51 mpg and the median is 23.0 mpg — close, but the mean sits a little higher. The mode is 13.0 mpg, a value far below both. That gap between a low mode and a higher mean is the fingerprint of a right-skewed distribution: a thick cluster of thirsty, low-mpg cars on the left, with a thinner tail of very efficient cars stretching to the right.

Confirm the skew directly and look at the quartiles:

print("skew    ", round(cars["mpg"].skew(), 2))
print(cars["mpg"].describe()[["min", "25%", "50%", "75%", "max"]].round(1))
skew     0.46
min       9.0
25%      17.5
50%      23.0
75%      29.0
max      46.6

The skew of +0.46 is positive, confirming the right-leaning tail. Read the five-number summary as a story: the most fuel-hungry car manages just 9 mpg, the thriftiest stretches to 46.6 mpg, and the middle half of the fleet lives between 17.5 and 29 mpg. The distance from the median up to the max (23.6) is far longer than from the median down to the min (14.0) — that long upper reach is the skew pulling the mean above the median.

For a skewed variable like this, the median is the more honest “typical” value, because the handful of ultra-efficient cars drag the mean upward. We’ll report 23 mpg as the typical car but keep the mean handy, since it’s what most of the comparisons below will use.


Step 3: The Trend Over Model Year

Now the heart of the story. Did fuel economy actually improve across the 1970s? A single mean can’t answer that — we need one mean per year. This is groupby doing exactly what it was built for:

by_year = cars.groupby("model_year")["mpg"].mean().round(2)
print(by_year)
model_year
70    17.69
71    21.25
72    18.71
73    17.10
74    22.70
75    20.27
76    21.57
77    23.38
78    24.06
79    25.09
80    33.70
81    30.33
82    31.71
Name: mpg, dtype: float64

The trend is unmistakable. In 1970 the average car got 17.69 mpg; by 1982 it got 31.71 mpg — the typical new car nearly doubled its fuel economy in twelve years. The early 1970s wobble (1973 actually dips to 17.10) and then the line climbs hard from 1977 onward, with a dramatic jump to 33.70 mpg in 1980 as the oil-shock fuel-economy push took hold.

A column of numbers hints at the trend; a chart makes it land. Plot the yearly means as a line:

Line chart of average mpg per model year from 1970 to 1982, rising from about 17.7 mpg in 1970 to about 31.7 mpg in 1982 with a sharp jump in 1980.
Average fuel economy per model year. After a bumpy early-1970s start, the line climbs steadily and jumps sharply in 1980, ending near 32 mpg.
import matplotlib.pyplot as plt

ax = by_year.plot(marker="o", color="#0067c0", figsize=(7, 4.2))
ax.set_title("Average Fuel Economy Climbed Steadily, 1970–1982")
ax.set_xlabel("Model year")
ax.set_ylabel("Mean mpg")
ax.spines[["top", "right"]].set_visible(False)
plt.show()

This single chart is the magazine’s cover story. Fuel economy didn’t drift — it surged, especially after 1979. But an average hides who drove the change. For that, we compare regions.


Step 4: Compare the Regions of Origin

Were all three regions equally efficient, or did some build thriftier cars than others? Comparing groups fairly means looking at center and spread together — the mean tells you where a group sits, the standard deviation tells you how tightly its cars cluster there. Build the whole comparison in one agg:

profile = cars.groupby("origin").agg(
    mpg_mean=("mpg", "mean"),
    mpg_median=("mpg", "median"),
    mpg_std=("mpg", "std"),
    weight_mean=("weight", "mean"),
    weight_std=("weight", "std"),
).round(2)
print(profile)
        mpg_mean  mpg_median  mpg_std  weight_mean  weight_std
origin
europe     27.89        26.5     6.72      2423.30      490.04
japan      30.45        31.6     6.09      2221.23      320.50
usa        20.08        18.5     6.40      3361.93      794.79

Three clean findings fall out of this table:

  • US cars are the heaviest and least efficient. They average just 20.08 mpg while weighing 3,362 lbs — over 1,100 lbs more than the Japanese cars. Their median (18.5) sits well below their mean, so the American fleet is itself right-skewed, with a few efficient outliers lifting the average.
  • Japanese cars are the most efficient. At 30.45 mpg mean (and a 31.6 median), they get half again as many miles per gallon as US cars, on the lightest bodies in the fleet (2,221 lbs).
  • European cars sit in between at 27.89 mpg and 2,423 lbs — closer to Japan than to the US, but a notch thirstier than Japan.

The standard deviations matter just as much as the means. Notice US weight has by far the widest spread (794.79 lbs vs ~320–490 for the others): America built everything from compact coupes to enormous sedans, while Japan built a tight, consistent lineup of light cars (the smallest weight spread, 320.50). The fuel-economy spreads are similar across regions (~6 mpg each), so the gaps between the group means are real and not just noise swamping everything.

A boxplot shows these distributions side by side, spread and all:

Boxplots of mpg for USA, Japan, and Europe, with the USA box lowest around 18 to 20 mpg and Japan highest around 30 mpg.
Fuel economy by region. The US box sits lowest and is right-skewed; Japan sits highest, with Europe just below it.

The reason behind the gap is no mystery — it’s weight. Plot every car’s weight against its mpg and the relationship is immediate:

Scatter plot of weight against mpg colored by origin, showing mpg falling as weight rises, with US cars clustered heavy and inefficient and Japanese cars light and efficient.
Heavier cars get fewer miles per gallon. US cars (blue) cluster in the heavy, low-mpg corner; Japanese cars (orange) sit light and efficient.

Heavier cars burn more fuel — the cloud slopes clearly downward — and the regions separate along that slope. The US efficiency gap isn’t a mystery of engineering so much as a consequence of building bigger, heavier cars.


Step 5: Crown the Standouts with Z-Scores

Means and medians describe the fleet, but the editor also wants names: which specific cars are the genuine standouts? This is what z-scores are for. A z-score restates a car’s mpg as the number of standard deviations it sits from the fleet average:

z=xμσ z = \frac{x - \mu}{\sigma}

A z-score of +2 +2 means a car is two standard deviations more efficient than the typical car — a true outlier on the thrifty end. Compute it for every car against the overall mpg mean and standard deviation:

mu = cars["mpg"].mean()
sigma = cars["mpg"].std()
cars["mpg_z"] = (cars["mpg"] - mu) / sigma

most = cars.sort_values("mpg_z", ascending=False).head(5)
print(most[["name", "mpg", "origin", "mpg_z"]].round(2).to_string(index=False))
                name  mpg origin  mpg_z
           mazda glc 46.6  japan   2.95
 honda civic 1500 gl 44.6  japan   2.70
vw rabbit c (diesel) 44.3 europe   2.66
           vw pickup 44.0 europe   2.62
  vw dasher (diesel) 43.4 europe   2.54

The most efficient car the magazine ever reviewed is the Mazda GLC at 46.6 mpg — a z-score of +2.95, almost three full standard deviations above the typical car. That’s an extraordinary standout: under a roughly normal distribution, fewer than two cars in a thousand would sit that far out. Every car in the top five is Japanese or European, exactly as Step 4 predicted. Now the other end:

least = cars.sort_values("mpg_z").head(5)
print(least[["name", "mpg", "origin", "mpg_z"]].round(2).to_string(index=False))
            name  mpg origin  mpg_z
        hi 1200d  9.0    usa  -1.86
       ford f250 10.0    usa  -1.73
       chevy c20 10.0    usa  -1.73
chevrolet impala 11.0    usa  -1.60
oldsmobile omega 11.0    usa  -1.60

The thirstiest cars are all American, led by a 9-mpg vehicle at z = −1.86. Notice the asymmetry: the standouts on the efficient end reach almost +3, while the worst guzzlers only reach about −1.9. That’s the right skew from Step 2 showing up again — the distribution has room to stretch far on the efficient side but bumps against a floor near 9 mpg on the thirsty side.

Why z-scores beat raw mpg for ranking

Saying “the Mazda gets 46.6 mpg” is informative, but “the Mazda is 2.95 standard deviations above average” tells you how unusual that is relative to the whole fleet. The z-score puts every car on a common, unit-free scale, which is exactly what you need to crown a standout fairly.

Take It Further

You’ve answered the cover story, but the same descriptive tools open more questions:

  • Z-scores within a region. Recompute z-scores using each car’s own region’s mean and standard deviation (with groupby). The most efficient US car might be unremarkable overall but a standout among American cars — a different and useful story.
  • Trend by region. Repeat Step 3’s yearly means but split by origin. Did US cars improve faster than Japanese ones, or did everyone climb together?
  • A different variable. Run the same center-and-spread profile on weight or displacement across the years. Engines shrank as mpg rose — can you show it with means and standard deviations alone?
  • Quantify the weight link. Bucket cars into weight ranges and report mean mpg per bucket. How many mpg do you lose, on average, for every extra 500 lbs?

Summary

You profiled a fleet of 398 cars using nothing but the descriptive tools from this module. You found the typical car gets about 23 mpg and learned to trust the median over the mean because the distribution is right-skewed (skew +0.46, mode just 13). You turned model_year into a trend, showing average economy nearly doubled from 17.69 mpg in 1970 to 31.71 mpg in 1982. You compared regions on center and spread — US cars heaviest and least efficient at 20.08 mpg, Japan most efficient at 30.45 mpg, Europe between at 27.89 — and tied the gap to weight. Finally, z-scores crowned the Mazda GLC (z = +2.95) as the standout and confirmed the thirstiest cars were all American.

Key Concepts

  • Mean vs median under skew — when a distribution leans, the median is the more honest “typical” value and the mean–median gap reveals the skew.
  • Grouped means as a trendgroupby on a time column turns a single average into a story over time.
  • Center and spread together — comparing groups fairly means reading the mean and the standard deviation, never one alone.
  • Z-scores — restate a value as standard deviations from the mean to rank standouts on a common, unit-free scale.

Why This Matters

Long before anyone fits a model, the questions that decide a project are exactly the ones you just answered: what’s typical, is it changing, who differs from whom, and which cases are genuinely unusual. A profile built from means, medians, spreads, and z-scores is fast, transparent, and explainable to an editor — or a stakeholder — in a sentence. That instinct to summarize center and spread first, then chart the trend, then name the outliers, is the foundation every later technique is built on.


Next Steps

Continue to Module 3 - Probability Fundamentals (next in the course)

Move from describing data to reasoning about chance — the language of uncertainty that powers every inference ahead.

Back to Module Overview

Return to the Measures of Center & Variability module overview


Continue Building Your Skills

You just did real analyst work: a magazine-ready profile of fuel economy built from means, medians, spreads, and z-scores, with every claim traced back to a number you can defend. That discipline — describe the center, measure the spread, chart the trend, name the standouts — is the backbone of every analysis you’ll ever run. Next you’ll add the language of chance, and start reasoning not just about what is in your data, but about what could happen.