Lesson 2 - The Weighted Mean and the Median
Welcome to the Weighted Mean and the Median
The ordinary mean treats every value as equally important and is easily dragged around by a few extreme numbers. Most real datasets break both of those assumptions: some values deserve more weight than others, and a handful of giants can pull the average somewhere no typical value lives. This lesson gives you two tools that fix exactly those problems.
You will use a dataset of 398 cars to compute a weighted mean that combines group averages correctly, and a median that shrugs off the skew that distorts the ordinary mean. Along the way you will see one of the most useful diagnostics in all of descriptive statistics: comparing the mean and the median to detect the shape of your data in a single glance.
By the end of this lesson, you will be able to:
- Compute a weighted mean and explain when each value should count more
- Combine subgroup means by group size instead of averaging them naively
- Find the median for both odd and even sample sizes
- Use the gap between the mean and the median to diagnose skew
You only need a little Python, pandas, and numpy. Let’s begin.
The Weighted Mean
The ordinary mean adds up every value and divides by how many there are, giving each value the same say. But sometimes values should not count equally — one might represent 200 cars and another only 70. A weighted mean lets each value carry an importance, or weight, that reflects how much it should contribute.
The formula multiplies each value by its weight , sums those products, and divides by the total weight:
When every weight is the same, the weights cancel and you are back to the ordinary mean. The weighted mean is simply the ordinary mean’s more honest sibling: it admits that not all values represent the same amount of the world.
When each value should count more
The most common reason to weight is combining group averages of different sizes. Imagine you already know the average fuel economy (mpg) for cars from each region of origin, but not the value for every individual car. To get the overall average, you cannot just average the three regional numbers — each region contains a very different number of cars.
Load the data and compute the mean mpg and the count for each origin:
import pandas as pd
import numpy as np
cars = pd.read_csv("https://datatweets.com/datasets/cars.csv")
summary = cars.groupby("origin")["mpg"].agg(["mean", "count"]).round(2)
print(summary) mean count
origin
europe 27.89 70
japan 30.45 79
usa 20.08 249There are three group means — 27.89, 30.45, and 20.08 — but they describe wildly different numbers of cars. USA alone accounts for 249 of the 398 cars, more than the other two regions combined.
Why the naive average is wrong
A tempting shortcut is to average the three group means directly:
naive = summary["mean"].mean()
print(round(naive, 2))26.14That gives 26.14 mpg — and it is wrong. It silently pretends each region is equally common, treating the 70 European cars as if they mattered as much as the 249 American ones. Since the large USA group has the lowest mpg, ignoring its size inflates the overall figure.
The correct overall mean weights each group mean by its number of cars:
means = summary["mean"]
counts = summary["count"]
weighted = (means * counts).sum() / counts.sum()
print(round(weighted, 2))23.51The weighted mean is 23.51 mpg — far below the naive 26.14. You can confirm it is the true overall average by computing the mean straight from all 398 individual cars:
print(round(cars["mpg"].mean(), 2))23.51They match exactly. That is the whole point: a weighted mean of group averages reproduces the mean you would get from the raw data, while a naive average of group averages does not.
Never average averages blindly
Averaging group means only gives the right overall mean when every group is the same size. Whenever the groups differ in size — different regions, different store counts, different class enrollments — you must weight each group mean by its size, or your “average” will quietly favor the smaller groups.
Computing it with NumPy
NumPy has the weighted mean built in through np.average, which takes a weights argument:
weighted = np.average(summary["mean"], weights=summary["count"])
print(round(weighted, 2))23.51Same answer, less arithmetic. Use np.average(values, weights=...) whenever you have a value for each group and a count (or any other importance) to weight it by. With the weights left out, np.average returns the ordinary mean.
The Median
The mean is sensitive: one enormous value can drag it far from where most of the data sits. The median is the antidote. It is the middle value of the data when every observation is lined up in order — half the values fall below it, half above. Because it cares only about position in the sorted order, not the size of the numbers, no single extreme value can pull it around.
Computing the median for odd and even samples
When there is an odd number of values, the median is the single one in the middle. With five values sorted as 15, 18, 21, 24, 31, the third value is the center:
small = [15, 18, 21, 24, 31]
print(np.median(small))21.0When there is an even number of values, there is no single middle, so the median is the average of the two middle values. With 15, 18, 21, 24, the two central values are 18 and 21, and their average is 19.5:
small_even = [15, 18, 21, 24]
print(np.median(small_even))19.5In pandas you rarely sort by hand — the .median() method handles odd and even automatically:
print(cars["mpg"].median())23.0The median mpg is 23.0, almost identical to the mean of 23.51. When the mean and median sit close together, the data is roughly symmetric. The interesting cases are when they pull apart.
Why the median resists skew and outliers
The cars’ weight is a classic skewed variable: most cars are light to medium, but a tail of heavy gas-guzzlers stretches far to the right. Watch what that does to the two measures of center:
print("mean ", round(cars["weight"].mean(), 1))
print("median", cars["weight"].median())mean 2970.4
median 2803.5The mean weight is 2970.4 lbs, but the median is 2803.5 lbs — about 167 lbs lighter. The heavy tail inflates the mean, dragging it above the typical car, while the median stays anchored where the bulk of the data actually lives. The median reports a more representative “center” precisely because the extreme heavy cars cannot vote with their size, only with their position.
Horsepower tells the same story even more loudly:
print("mean ", round(cars["horsepower"].mean(), 1))
print("median", cars["horsepower"].median())mean 104.5
median 93.5The mean horsepower is 104.5 but the median is just 93.5. A relatively small number of high-powered engines pulls the mean up by 11 horsepower, while the median ignores them and reports the genuine midpoint.
When the median is the honest choice
For skewed quantities like income, house prices, response times, or car weight, the median usually describes a “typical” value better than the mean. News reports quote median household income for exactly this reason: a few billionaires would send the mean soaring above what any ordinary household earns.
The median for open-ended distributions
There is one more situation where the median is not just better but necessary: open-ended distributions, where the largest (or smallest) category has no exact upper bound. If a survey records income as brackets ending in “$200,000 or more,” you cannot compute a mean — you do not know the actual values inside that top bracket. But you can still find the median, because locating the middle value only requires knowing the order, not the exact size of the extremes. The median is defined whenever you can rank the data, even when the tails are unmeasured.
Mean versus median as a skew diagnostic
Putting the two measures side by side gives you a fast, no-plot way to detect the shape of a distribution:
- Mean ≈ median → the data is roughly symmetric.
- Mean > median → a tail stretches to the right; the data is right-skewed (weight: 2970.4 vs 2803.5).
- Mean < median → a tail stretches to the left; the data is left-skewed.
The intuition is simple: the mean follows the tail, while the median stays put. So whichever side the mean sits on tells you which way the data is skewed, and the size of the gap hints at how strong the skew is. Before you ever draw a histogram, comparing the mean and median gives you a reliable first read on a variable’s shape.
Practice Exercises
Exercise 1: Weight a different grouping
The mean mpg differs by number of cylinders, and the cylinder groups are very different sizes. Group the cars by cylinders, compute the mean mpg and count for each group, then combine those group means into a single overall mean using np.average. Confirm your weighted result equals cars["mpg"].mean(), and check that a naive average of the group means does not.
Hint
Build the table with cars.groupby("cylinders")["mpg"].agg(["mean", "count"]), then call np.average(table["mean"], weights=table["count"]). Compare that to table["mean"].mean() and to cars["mpg"].mean() to see which one matches.
Exercise 2: Diagnose skew from mean and median
Compute the mean and median of the displacement column. Based only on those two numbers, decide whether displacement is right-skewed, left-skewed, or roughly symmetric — and explain how the gap told you.
Hint
Use cars["displacement"].mean() and cars["displacement"].median(). If the mean sits well above the median, a right tail is pulling it up. (You should find a mean near 193.4 and a median of 148.5 — a large gap.)
Exercise 3: Find a symmetric variable
Not every variable is skewed. Compute the mean and median of acceleration and compare them. Is this variable more symmetric than weight? What does the small gap suggest about whether the mean or median you report would matter much here?
Hint
Use cars["acceleration"].mean() and cars["acceleration"].median(). When the two land almost on top of each other (mean about 15.57, median 15.5), the distribution is nearly symmetric and the choice of center barely changes the answer.
Summary
You learned two measures of center that handle the cases the ordinary mean cannot. The weighted mean lets each value count according to its importance, which is essential when combining group averages of different sizes — averaging the three origin means naively gave 26.14 mpg, while correctly weighting by car count gave the true 23.51. The median, the middle value of the sorted data, resists skew and outliers because it depends on position rather than magnitude: for right-skewed car weight, the median of 2803.5 lbs sat well below the mean of 2970.4 lbs and described a typical car far better. Comparing the mean and median is itself a quick diagnostic for the shape of a distribution.
Key Concepts
- Weighted mean — ; each value contributes in proportion to its weight.
- Weight — a number expressing how much a value should count, often a group’s size.
- Combining group means — weight each group mean by its group size, never average group means directly.
- Median — the middle value of the sorted data; the average of the two middle values when the count is even.
- Resistance — the median is unaffected by extreme values because it depends on rank, not magnitude.
- Skew diagnostic — mean > median means right-skew, mean < median means left-skew, mean ≈ median means symmetric.
Why This Matters
Real data is rarely symmetric and rarely comes pre-balanced. Salaries, prices, wait times, and file sizes all skew, and analyses constantly need to roll up averages across regions, cohorts, or experiments of unequal size. Reaching for a weighted mean when groups differ in size, and for the median when a tail would mislead the mean, is the difference between a summary number that reflects reality and one that quietly flatters it.
Next Steps
Continue to Lesson 3 - The Mode
Meet the third measure of center — the most frequent value — and learn when it beats the mean and median.
Back to Module Overview
Return to the Measures of Center & Variability module overview
Continue Building Your Skills
You can now pick the right average for the job — weighting values that represent more of the world, and switching to the median when a skewed tail would lead the mean astray. Next you will add the third and final measure of center, the mode, and see when the most common value tells you something the mean and median cannot.