Lesson 2 - The Weighted Mean and the Median

Welcome to the Weighted Mean and the Median

The ordinary mean treats every value as equally important and is easily dragged around by a few extreme numbers. Most real datasets break both of those assumptions: some values deserve more weight than others, and a handful of giants can pull the average somewhere no typical value lives. This lesson gives you two tools that fix exactly those problems.

You will use a dataset of 398 cars to compute a weighted mean that combines group averages correctly, and a median that shrugs off the skew that distorts the ordinary mean. Along the way you will see one of the most useful diagnostics in all of descriptive statistics: comparing the mean and the median to detect the shape of your data in a single glance.

By the end of this lesson, you will be able to:

Compute a weighted mean and explain when each value should count more
Combine subgroup means by group size instead of averaging them naively
Find the median for both odd and even sample sizes
Use the gap between the mean and the median to diagnose skew

You only need a little Python, pandas, and numpy. Let’s begin.

The Weighted Mean

The ordinary mean adds up every value and divides by how many there are, giving each value the same say. But sometimes values should not count equally — one might represent 200 cars and another only 70. A weighted mean lets each value carry an importance, or weight, that reflects how much it should contribute.

The formula multiplies each value $x_i$ by its weight $w_i$ , sums those products, and divides by the total weight:

\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}

When every weight is the same, the weights cancel and you are back to the ordinary mean. The weighted mean is simply the ordinary mean’s more honest sibling: it admits that not all values represent the same amount of the world.

When each value should count more

The most common reason to weight is combining group averages of different sizes. Imagine you already know the average fuel economy (mpg) for cars from each region of origin, but not the value for every individual car. To get the overall average, you cannot just average the three regional numbers — each region contains a very different number of cars.

Load the data and compute the mean mpg and the count for each origin:

import pandas as pd
import numpy as np

cars = pd.read_csv("https://datatweets.com/datasets/cars.csv")

summary = cars.groupby("origin")["mpg"].agg(["mean", "count"]).round(2)
print(summary)

         mean  count
origin
europe  27.89     70
japan   30.45     79
usa     20.08    249

There are three group means — 27.89, 30.45, and 20.08 — but they describe wildly different numbers of cars. USA alone accounts for 249 of the 398 cars, more than the other two regions combined.

Why the naive average is wrong

A tempting shortcut is to average the three group means directly:

naive = summary["mean"].mean()
print(round(naive, 2))

26.14

That gives 26.14 mpg — and it is wrong. It silently pretends each region is equally common, treating the 70 European cars as if they mattered as much as the 249 American ones. Since the large USA group has the lowest mpg, ignoring its size inflates the overall figure.

The correct overall mean weights each group mean by its number of cars:

\bar{x}_w = \frac{(27.89 \times 70) + (30.45 \times 79) + (20.08 \times 249)}{70 + 79 + 249}

means = summary["mean"]
counts = summary["count"]
weighted = (means * counts).sum() / counts.sum()
print(round(weighted, 2))

23.51

The weighted mean is 23.51 mpg — far below the naive 26.14. You can confirm it is the true overall average by computing the mean straight from all 398 individual cars:

print(round(cars["mpg"].mean(), 2))

23.51

They match exactly. That is the whole point: a weighted mean of group averages reproduces the mean you would get from the raw data, while a naive average of group averages does not.

Never average averages blindly

Averaging group means only gives the right overall mean when every group is the same size. Whenever the groups differ in size — different regions, different store counts, different class enrollments — you must weight each group mean by its size, or your “average” will quietly favor the smaller groups.

Computing it with NumPy

NumPy has the weighted mean built in through np.average, which takes a weights argument:

weighted = np.average(summary["mean"], weights=summary["count"])
print(round(weighted, 2))

23.51

Same answer, less arithmetic. Use np.average(values, weights=...) whenever you have a value for each group and a count (or any other importance) to weight it by. With the weights left out, np.average returns the ordinary mean.

The Median

The mean is sensitive: one enormous value can drag it far from where most of the data sits. The median is the antidote. It is the middle value of the data when every observation is lined up in order — half the values fall below it, half above. Because it cares only about position in the sorted order, not the size of the numbers, no single extreme value can pull it around.

Computing the median for odd and even samples

When there is an odd number of values, the median is the single one in the middle. With five values sorted as 15, 18, 21, 24, 31, the third value is the center:

small = [15, 18, 21, 24, 31]
print(np.median(small))

21.0

When there is an even number of values, there is no single middle, so the median is the average of the two middle values. With 15, 18, 21, 24, the two central values are 18 and 21, and their average is 19.5:

small_even = [15, 18, 21, 24]
print(np.median(small_even))

19.5

In pandas you rarely sort by hand — the .median() method handles odd and even automatically:

print(cars["mpg"].median())

23.0

The median mpg is 23.0, almost identical to the mean of 23.51. When the mean and median sit close together, the data is roughly symmetric. The interesting cases are when they pull apart.

Why the median resists skew and outliers

The cars’ weight is a classic skewed variable: most cars are light to medium, but a tail of heavy gas-guzzlers stretches far to the right. Watch what that does to the two measures of center:

print("mean  ", round(cars["weight"].mean(), 1))
print("median", cars["weight"].median())

mean   2970.4
median 2803.5

The mean weight is 2970.4 lbs, but the median is 2803.5 lbs — about 167 lbs lighter. The heavy tail inflates the mean, dragging it above the typical car, while the median stays anchored where the bulk of the data actually lives. The median reports a more representative “center” precisely because the extreme heavy cars cannot vote with their size, only with their position.

Histogram of car weight with a long right tail; a solid line marks the median at about 2,804 lbs and a dashed line marks the mean at about 2,970 lbs, sitting to the right of the median. — The weight distribution is right-skewed: the long tail of heavy cars pulls the mean (dashed) above the median (solid), so the median better represents a typical car.

Horsepower tells the same story even more loudly:

print("mean  ", round(cars["horsepower"].mean(), 1))
print("median", cars["horsepower"].median())

mean   104.5
median 93.5

The mean horsepower is 104.5 but the median is just 93.5. A relatively small number of high-powered engines pulls the mean up by 11 horsepower, while the median ignores them and reports the genuine midpoint.

When the median is the honest choice

For skewed quantities like income, house prices, response times, or car weight, the median usually describes a “typical” value better than the mean. News reports quote median household income for exactly this reason: a few billionaires would send the mean soaring above what any ordinary household earns.

The median for open-ended distributions

There is one more situation where the median is not just better but necessary: open-ended distributions, where the largest (or smallest) category has no exact upper bound. If a survey records income as brackets ending in “$200,000 or more,” you cannot compute a mean — you do not know the actual values inside that top bracket. But you can still find the median, because locating the middle value only requires knowing the order, not the exact size of the extremes. The median is defined whenever you can rank the data, even when the tails are unmeasured.

Mean versus median as a skew diagnostic

Putting the two measures side by side gives you a fast, no-plot way to detect the shape of a distribution:

Mean ≈ median → the data is roughly symmetric.
Mean > median → a tail stretches to the right; the data is right-skewed (weight: 2970.4 vs 2803.5).
Mean < median → a tail stretches to the left; the data is left-skewed.

The intuition is simple: the mean follows the tail, while the median stays put. So whichever side the mean sits on tells you which way the data is skewed, and the size of the gap hints at how strong the skew is. Before you ever draw a histogram, comparing the mean and median gives you a reliable first read on a variable’s shape.

Practice Exercises

Exercise 1: Weight a different grouping

The mean mpg differs by number of cylinders, and the cylinder groups are very different sizes. Group the cars by cylinders, compute the mean mpg and count for each group, then combine those group means into a single overall mean using np.average. Confirm your weighted result equals cars["mpg"].mean(), and check that a naive average of the group means does not.

Hint

Build the table with cars.groupby("cylinders")["mpg"].agg(["mean", "count"]), then call np.average(table["mean"], weights=table["count"]). Compare that to table["mean"].mean() and to cars["mpg"].mean() to see which one matches.

Exercise 2: Diagnose skew from mean and median

Compute the mean and median of the displacement column. Based only on those two numbers, decide whether displacement is right-skewed, left-skewed, or roughly symmetric — and explain how the gap told you.

Hint

Use cars["displacement"].mean() and cars["displacement"].median(). If the mean sits well above the median, a right tail is pulling it up. (You should find a mean near 193.4 and a median of 148.5 — a large gap.)

Exercise 3: Find a symmetric variable

Not every variable is skewed. Compute the mean and median of acceleration and compare them. Is this variable more symmetric than weight? What does the small gap suggest about whether the mean or median you report would matter much here?

Hint

Use cars["acceleration"].mean() and cars["acceleration"].median(). When the two land almost on top of each other (mean about 15.57, median 15.5), the distribution is nearly symmetric and the choice of center barely changes the answer.

Summary

You learned two measures of center that handle the cases the ordinary mean cannot. The weighted mean lets each value count according to its importance, which is essential when combining group averages of different sizes — averaging the three origin means naively gave 26.14 mpg, while correctly weighting by car count gave the true 23.51. The median, the middle value of the sorted data, resists skew and outliers because it depends on position rather than magnitude: for right-skewed car weight, the median of 2803.5 lbs sat well below the mean of 2970.4 lbs and described a typical car far better. Comparing the mean and median is itself a quick diagnostic for the shape of a distribution.

Key Concepts

Weighted mean — $\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}$ ; each value contributes in proportion to its weight.
Weight — a number expressing how much a value should count, often a group’s size.
Combining group means — weight each group mean by its group size, never average group means directly.
Median — the middle value of the sorted data; the average of the two middle values when the count is even.
Resistance — the median is unaffected by extreme values because it depends on rank, not magnitude.
Skew diagnostic — mean > median means right-skew, mean < median means left-skew, mean ≈ median means symmetric.

Why This Matters

Real data is rarely symmetric and rarely comes pre-balanced. Salaries, prices, wait times, and file sizes all skew, and analyses constantly need to roll up averages across regions, cohorts, or experiments of unequal size. Reaching for a weighted mean when groups differ in size, and for the median when a tail would mislead the mean, is the difference between a summary number that reflects reality and one that quietly flatters it.

Next Steps

Continue to Lesson 3 - The Mode

Meet the third measure of center — the most frequent value — and learn when it beats the mean and median.

Back to Module Overview

Return to the Measures of Center & Variability module overview

Continue Building Your Skills

You can now pick the right average for the job — weighting values that represent more of the world, and switching to the median when a skewed tail would lead the mean astray. Next you will add the third and final measure of center, the mode, and see when the most common value tells you something the mean and median cannot.

Previous lesson

Lesson 1 - The Mean

Next lesson

Lesson 3 - The Mode

Courses

DATATWEETS

Title here

Lesson 2 - The Weighted Mean and the Median

Welcome to the Weighted Mean and the Median

The Weighted Mean

When each value should count more

Why the naive average is wrong

Computing it with NumPy

The Median

Computing the median for odd and even samples

Why the median resists skew and outliers

The median for open-ended distributions

Mean versus median as a skew diagnostic

Practice Exercises

Exercise 1: Weight a different grouping

Exercise 2: Diagnose skew from mean and median

Exercise 3: Find a symmetric variable

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 3 - The Mode

Back to Module Overview

Continue Building Your Skills

Lesson 2 - The Weighted Mean and the Median

Welcome to the Weighted Mean and the Median#

The Weighted Mean#

When each value should count more#

Why the naive average is wrong#

Computing it with NumPy#

The Median#

Computing the median for odd and even samples#

Why the median resists skew and outliers#

The median for open-ended distributions#

Mean versus median as a skew diagnostic#

Practice Exercises#

Exercise 1: Weight a different grouping#

Exercise 2: Diagnose skew from mean and median#

Exercise 3: Find a symmetric variable#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 3 - The Mode

Back to Module Overview

Continue Building Your Skills#

Welcome to the Weighted Mean and the Median

The Weighted Mean

When each value should count more

Why the naive average is wrong

Computing it with NumPy

The Median

Computing the median for odd and even samples

Why the median resists skew and outliers

The median for open-ended distributions

Mean versus median as a skew diagnostic

Practice Exercises

Exercise 1: Weight a different grouping

Exercise 2: Diagnose skew from mean and median

Exercise 3: Find a symmetric variable

Summary

Key Concepts

Why This Matters

Next Steps

Continue Building Your Skills