Lesson 1 - The Mean
Welcome to The Mean
Ask anyone for “the average” and they will hand you the mean: add up the numbers, divide by how many there are. It is the most reported statistic in the world, and the one we trust almost without thinking. But that easy familiarity hides a few sharp edges — the mean can be pulled around by a single extreme value, and a single overall average can quietly bury big differences between groups.
In this lesson you will compute the mean by hand and with pandas, see why it sits at the exact balance point of your data, watch it get tugged by skew, and discover that an overall mean is really a blend of its subgroups in disguise. You will use a real dataset of 1970s and 80s cars to do it.
By the end of this lesson, you will be able to:
- Write the formula for the mean and tell a population mean apart from a sample mean
- Compute the mean of a column in pandas and reproduce it from the raw formula
- Explain why the mean is the data’s balance point — the deviations around it sum to zero
- Recognize when the mean is misleading, and see how an overall mean blends its subgroup means
You only need a little Python and pandas. Let’s begin.
What the Mean Is
The arithmetic mean is the sum of all the values divided by how many values there are. For a set of numbers , the mean is:
That symbol (“x-bar”) is the sample mean — the mean of the data you actually have in front of you. When the numbers represent an entire population rather than a sample, we write the mean as (the Greek letter “mu”) and use for the population size:
The arithmetic is identical; the symbols just tell the reader whether you are describing a sample or a whole population. As you saw in the first module, is a fixed fact about a population, while is an estimate that shifts from sample to sample.
Let’s load the cars dataset and meet the column we will summarize.
import pandas as pd
cars = pd.read_csv("https://datatweets.com/datasets/cars.csv")
print(cars.shape)
print(cars[["mpg", "cylinders", "weight", "origin", "name"]].head())(398, 9)
mpg cylinders weight origin name
0 18.0 8 3504 usa chevrolet chevelle malibu
1 15.0 8 3693 usa buick skylark 320
2 18.0 8 3436 usa plymouth satellite
3 16.0 8 3433 usa amc rebel sst
4 17.0 8 3449 usa ford torinoThese are 398 cars from the 1970s and early 80s, each with its fuel economy (mpg, miles per gallon), engine size, weight, and region of origin. We will focus on mpg — a single number that captures how thirsty each car is.
Computing the Mean in pandas
pandas gives you the mean of any column with a single method call.
mean_mpg = cars["mpg"].mean()
print(round(mean_mpg, 2))23.51So the average car in this dataset gets 23.51 miles per gallon. Because this is the mean of every car we have, you could reasonably call it for this population of 398 vehicles.
To prove there is no magic inside .mean(), build it from the formula yourself — the total of the column divided by its length:
n = len(cars["mpg"])
total = cars["mpg"].sum()
print(round(total / n, 2))23.51Identical. The .mean() method is just wrapped up for convenience.
The mean ignores nothing
Every value in the column contributes to the mean. That is a strength — it uses all your data — but also the source of its biggest weakness: a single very large or very small value moves the mean for everyone. The median, which you will meet next lesson, does not share that vulnerability.
The Mean as a Balance Point
Here is the idea that makes the mean special. Imagine each data value as a weight placed along a ruler. The mean is the exact point where the ruler balances — the values above it and the values below it cancel out perfectly.
We can show this precisely. A deviation is the distance from a value to the mean, . If the mean is truly the balance point, then all those deviations — the ones above the mean (positive) and below it (negative) — must add up to zero.
deviations = cars["mpg"] - cars["mpg"].mean()
print(round(deviations.sum(), 6))0.0The deviations sum to zero. This is not a coincidence of this dataset; it is true for any set of numbers, and it follows straight from the formula:
Because , the total of the values equals , and the deviations cancel exactly. That balancing act is what makes the mean the natural “center of mass” of your data — and, as you will see in a later lesson, it is also why those deviations have to be squared before we can measure spread, since on their own they always cancel to nothing.
When the Mean Misleads
The balance-point property is elegant, but it is also the mean’s Achilles’ heel. Because every value pulls on the balance, a few extreme values can drag the mean away from where most of the data actually sits. This happens whenever a distribution is skewed — stretched out toward one side.
The weight column is a good example. Compare its mean to its median, the middle value when the data is sorted:
print("mean ", round(cars["weight"].mean(), 1))
print("median", cars["weight"].median())mean 2970.4
median 2803.5The mean weight is about 167 pounds heavier than the median. That gap is a fingerprint of skew: a tail of very heavy cars (the big V8 sedans of the era) pulls the mean upward, while the median — which only cares about the middle car, not how heavy the heaviest ones are — stays put. The figure below shows the distribution with both markers.
When data is roughly symmetric, the mean and median nearly agree and the mean is a fine summary. When data is skewed or has outliers, the mean can tell a story most of your data would not recognize. Knowing which situation you are in is exactly why the next lesson pairs the mean with the median.
An Overall Mean Is a Blend of Subgroups
There is one more thing the mean hides, and it is the most useful insight in this lesson. A single overall average is almost never the whole story, because it is silently averaging across groups that may be very different from each other.
Our cars come from three regions. Look at the average mpg within each, along with how many cars each region contributes:
by_origin = cars.groupby("origin")["mpg"].agg(["mean", "count"]).round(2)
print(by_origin) mean count
origin
europe 27.89 70
japan 30.45 79
usa 20.08 249These groups are worlds apart. Japanese cars average 30.45 mpg; American cars average just 20.08 mpg — more than ten miles per gallon thirstier. The overall mean of 23.51 sits closer to the USA figure, and now you can see why: the USA contributes 249 of the 398 cars, so it dominates the average.
In fact, the overall mean is exactly the subgroup means combined, each weighted by how many cars are in its group. Multiply each group’s mean by its count, add those up, and divide by the total count:
blend = (by_origin["mean"] * by_origin["count"]).sum() / by_origin["count"].sum()
print(round(blend, 2))23.51The blend reproduces the overall mean of 23.51 perfectly. This is no accident — the overall mean is a weighted average of its subgroup means, where the weights are the group sizes:
That single equation is why the overall figure leans toward the largest group. It also points straight at the next lesson: when the things you are averaging carry different weights — different group sizes, different importance, different reliability — the plain mean is no longer the right tool, and you need the weighted mean.
Practice Exercises
Exercise 1: Mean acceleration, two ways
Compute the mean of the acceleration column using .mean(), then reproduce the same number from the raw formula (sum divided by count). Confirm the two match to two decimal places.
Hint
The method version is cars["acceleration"].mean(). The manual version is cars["acceleration"].sum() / len(cars["acceleration"]). Wrap both in round(..., 2) before comparing.
Exercise 2: Deviations always cancel
Pick the horsepower column and show that the deviations from its mean sum to (approximately) zero. The column has a few missing values, so drop them first. Why might the result print as a tiny number like 1e-13 instead of an exact 0?
Hint
Use hp = cars["horsepower"].dropna(), then (hp - hp.mean()).sum(). The tiny leftover is floating-point rounding — computers store decimals with limited precision, so the cancellation is exact in math but off by a hair in binary. Round it to see the intended 0.0.
Exercise 3: Mean mpg by model year
Group the cars by model_year and compute the mean mpg for each year. Did fuel economy improve over the 1970–82 span? Then check that the overall mean still equals the year means blended by their counts.
Hint
Use cars.groupby("model_year")["mpg"].agg(["mean", "count"]). To verify the blend, multiply each year’s mean by its count, sum that, and divide by the total count — it should land back on 23.51.
Summary
You met the most familiar statistic of all and looked past its simplicity. The mean is the sum of the values over their count, written for a sample and for a population. It sits at the data’s balance point — the deviations around it always sum to zero — which makes it the natural center of a symmetric distribution. But that same property makes it sensitive to skew and outliers, which is why the mean of car weight sat well above the median. And crucially, an overall mean is a weighted blend of its subgroup means: the cars’ 23.51 mpg leaned toward the largest group, the USA, exactly as the group-size weights predict.
Key Concepts
- Arithmetic mean — the sum of all values divided by the number of values, .
- Population mean — the mean of an entire population; the sample mean estimates it.
- Deviation — the distance from a value to the mean, ; deviations always sum to zero.
- Balance point — the mean is the value at which the data balances, its center of mass.
- Skew — an asymmetric tail that pulls the mean away from the median.
- Weighted blend — an overall mean equals its subgroup means weighted by their group sizes.
Why This Matters
The mean is the number you will report, read, and be misled by more than any other. Knowing that it balances the data, that a long tail can drag it somewhere unrepresentative, and that it secretly leans toward your biggest subgroup is what keeps you from reporting an “average” that no one in your data would recognize. Every dashboard headline and KPI rests on these instincts.
Next Steps
Continue to Lesson 2 - The Weighted Mean and the Median
Average values that carry different weights, and meet the median — the center that ignores outliers.
Back to Module Overview
Return to the Measures of Center & Variability module overview
Continue Building Your Skills
You now know what the mean really measures — and the three ways it can quietly mislead you, from skew to lopsided subgroups. Next you will pick up the two tools that handle exactly those cases: the weighted mean, for when your values do not all count equally, and the median, for when a few extreme values would otherwise steer the average off course.