Lesson 3 - Skewed Metrics and the Mann-Whitney Test
Welcome to Skewed Metrics
The last two lessons compared Lumen’s revenue by its average — and found a $1.22 lift that Welch’s t-test called not significant. But there’s a deeper problem hiding underneath that number, one that no choice of t-test can fix: revenue is heavily right-skewed. Most users spend a little, a handful of “whales” spend enormously, and that long tail drags the average toward itself. When a metric is skewed like this, the mean stops describing the typical user — and, alarmingly, it can move in the opposite direction from what most of your users actually experience. In this lesson you’ll watch that happen on Lumen’s exact data, then meet the Mann-Whitney U test, a rank-based tool built for the question the mean can’t answer.
By the end of this lesson, you will be able to:
- Explain why right-skew makes the mean misleading for the typical user
- Show that a higher mean can coexist with a lower median on the same data
- Run the Mann-Whitney U test with
scipyand read its effect size - Decide which question — total revenue or typical user — a decision actually needs
Let’s start with the number that should stop you cold.
When the Average Lies
Here is Lumen’s revenue data from Lesson 1 — the same two groups, the same seed — but now we look past the mean to the median, the spend of the middle user:
import numpy as np
from scipy import stats
rng = np.random.default_rng(3)
rev_c = rng.lognormal(2.50, 0.70, 6000)
rev_t = rng.lognormal(2.28, 1.05, 1500)
print(f"mean: control ${rev_c.mean():.2f} treatment ${rev_t.mean():.2f}")
print(f"median: control ${np.median(rev_c):.2f} treatment ${np.median(rev_t):.2f}")Running it:
mean: control $15.62 treatment $16.84
median: control $12.26 treatment $9.39Look carefully, because these two lines disagree. The mean says treatment wins: $16.84 versus $15.62. But the median says treatment loses, and not by a little: the typical treatment user spends $9.39, nearly three dollars less than the typical control user’s $12.26. Both numbers are computed correctly on the same data. So which group is better?
The resolution is the skew. The treatment has a fatter tail — recall its standard deviation was nearly double the control’s. Its distribution shifted a lot of users down toward low spending (that’s the lower median) while stretching a few users way up into big spenders (that’s what lifts the mean). The average went up not because typical users spend more, but because a small number of whales spend a great deal, and the mean gives every dollar equal weight regardless of how few users produced it. The average, in other words, is lying about the typical user — it’s telling the truth only about the total.
When the mean and the median tell opposite stories, you need a test that compares groups without letting a few whales dominate — one that measures the typical user, not the total. That’s the Mann-Whitney U test.
The Mann-Whitney U Test
The Mann-Whitney U test (also called the Wilcoxon rank-sum test) is nonparametric: instead of comparing means, it throws away the raw dollar magnitudes and works only with ranks. It pools every value from both groups, sorts them, and asks a simple question — do one group’s values tend to sit higher in the ranking than the other’s? Equivalently: if you pick a random user from each group, how often does the treatment user out-spend the control user? Because it uses ranks and not magnitudes, a single $500 whale counts as exactly one high rank — it can’t drag the result the way it drags a mean. That’s what makes the test robust to skew and outliers.
scipy runs it in one call:
u = stats.mannwhitneyu(rev_t, rev_c, alternative="two-sided")
print(f"U = {u.statistic:.0f} p = {u.pvalue:.2e}")
print(f"P(random treatment user > random control user) = {u.statistic/(rev_t.size*rev_c.size):.3f}")Running it:
U = 3782666 p = 1.14e-21
P(random treatment user > random control user) = 0.420Two things jump out. First, the p-value is 1.14e-21 — this is not borderline like the t-test’s 0.0566; it is overwhelmingly significant. There is a real, reliable difference in the ranks of the two groups. Second — and read this carefully — the difference points the opposite way from the mean. That last line, U / (n1 · n2) = 0.420, is the common-language effect size, also called the probability of superiority: a randomly chosen treatment user out-spends a randomly chosen control user only 42% of the time. Flip it around — 58% of the time, the control user spends more. On a per-user basis, control is ahead, decisively, even though the treatment’s average is higher.
Which average are you optimizing?
The t-test and the Mann-Whitney test disagree here because they answer different questions, and both answers are correct. The mean is about total revenue: sum every dollar, divide by users — a handful of whales genuinely can lift total revenue even while most users spend less, and if you’re forecasting quarterly revenue that’s the number you want. The median / Mann-Whitney is about the typical user’s experience: it ignores how large the whales are and asks who wins the head-to-head majority of the time. Neither is “more correct.” Before you pick a test, decide which question the decision rests on — is Lumen optimizing total revenue, or the experience of its median customer? The data can’t answer that; the business must.
Reading Two Stories at Once
It’s tempting to declare Mann-Whitney the winner — “the t-test was borderline and this one is p ≈ 1e-21, so trust the ranks.” Resist that. The rank test is not confirming or denying the t-test; it is measuring a different thing, and treating it as a tiebreaker gets the logic exactly backwards.
A few points to hold together honestly:
- The t-test on the mean is still valid. With samples this large, the Central Limit Theorem makes the sampling distribution of the mean approximately normal even though the underlying revenue is badly skewed. So the t-test isn’t “wrong” or invalidated by skew — it’s correctly telling you the means barely differ. Mann-Whitney is not more rigorous; it just answers the typical-user question instead of the total-revenue question.
- The two disagree because the treatment did two things at once. It shifted a lot of mass down toward low spenders (dropping the median and the probability of superiority to 0.42) and stretched a thin tail up toward big spenders (lifting the mean). One change hurts the typical user; the other helps the total. That’s why the mean and the ranks point different ways.
- Report both, and when they disagree, dig in. A single number would have shipped the wrong decision here whichever one you’d chosen alone. Seeing the mean and the median and the probability of superiority is what reveals that the treatment is a trade — worse for most users, potentially better for total revenue via a few whales.
So when should you reach for Mann-Whitney? When you care about the typical unit rather than the total, or when the mean is so dominated by a few outliers that you no longer trust it to describe your users. It’s not a replacement for the t-test — it’s the second lens that keeps a skewed average from fooling you.
Practice Exercises
Exercise 1: Higher mean, lower median
On Lumen’s data the treatment has a higher mean ($16.84 vs $15.62) but a lower median ($9.39 vs $12.26). How is that possible on a single dataset, and what does it tell you about the two groups?
Hint
It’s possible because revenue is right-skewed and the mean and median measure different things. The median is the middle user, unaffected by how large the biggest spenders are; the mean gives every dollar equal weight, so a thin tail of whales pulls it upward. A lower treatment median with a higher treatment mean means the treatment pushed most users toward lower spending while stretching a few users into very high spending — the typical user got worse, but the total got propped up by outliers.
Exercise 2: What does 0.420 mean?
The Mann-Whitney output gives U / (n1·n2) = 0.420. A teammate reads this as “the treatment is 42% better.” What does the number actually mean, and who is winning?
Hint
It’s the probability of superiority, not a percentage improvement. It says that if you pick one random treatment user and one random control user, the treatment user spends more only 42% of the time — which means the control user spends more 58% of the time. So on a head-to-head, typical-user basis, control is winning. The teammate has it backwards: 0.420 below 0.5 is evidence against the treatment for the per-user question, not for it.
Exercise 3: Does p ≈ 1e-21 overrule the t-test?
The t-test gave p = 0.0566 (not significant) and Mann-Whitney gives p ≈ 1e-21 (wildly significant). Does the tiny Mann-Whitney p-value mean the t-test was wrong and the treatment “really” loses?
Hint
No — they test different hypotheses, so one can’t overrule the other. The t-test asks whether the means differ; thanks to the Central Limit Theorem it’s valid even on skewed data, and it correctly says the means barely differ. Mann-Whitney asks whether one group’s ranks tend to be higher; it correctly says control’s typically are. Both are right about their own question. The decision isn’t “which test wins” but “which question matters” — total revenue (mean) or the typical user (ranks). Report both and let the business goal choose.
Summary
Skewed metrics — revenue, time on site, order value — carry a long right tail of big spenders, and under that skew the mean stops describing the typical user. On Lumen’s exact data the treatment mean is higher ($16.84 vs $15.62) while its median is lower ($9.39 vs $12.26): the average rose only because a thin tail of whales dragged it up, not because typical users spend more. The Mann-Whitney U test answers the question the mean can’t — using ranks instead of magnitudes, it found a random treatment user out-spends a random control user just 42% of the time (p ≈ 1e-21), so on a per-user basis control wins. Crucially, this doesn’t overrule the t-test: the mean is about total revenue and the ranks are about the typical user, and both answers are correct. Report both, and when they disagree, dig into why.
Key Concepts
- Skew breaks the mean — a fat right tail pulls the average above the typical user, so the mean can rise while the median falls.
- Mann-Whitney uses ranks — it compares whether one group’s values tend to be larger, robust to outliers because a whale is just one high rank.
- Probability of superiority (0.420) — the common-language effect size: how often a random treatment user beats a random control user; below 0.5 means control wins.
- Different questions, both correct — the mean answers total revenue, the ranks answer typical user; choose by what the decision needs, not by which p-value is smaller.
Why This Matters
Revenue and engagement are the metrics businesses live and die by, and they are almost always skewed — which means a naive comparison of averages is one of the easiest ways to ship a change that quietly makes most of your users worse off while a few whales flatter the total. Learning to read the median and the Mann-Whitney effect size alongside the mean is what lets you see that trade-off instead of averaging it away, and to state honestly that a treatment can help total revenue and hurt the typical user at the same time. The intellectually honest move is never to pick the test that agrees with you — it’s to report both and force the decision to name which question it’s answering. Next, you’ll put a full uncertainty range around the difference in means with a confidence interval, so the borderline result gets a range, not just a verdict.
Next Steps
Continue to Lesson 4 - Confidence Intervals for the Difference in Means
Turn the borderline p-value into a range: how wide is the plausible band around Lumen's $1.22 lift, and does it cross zero?
Back to Module Overview
Return to the Analyzing Mean Metrics module overview
Continue Building Your Skills
You saw the average lie: on Lumen’s own data the treatment mean is higher while its median is lower, because a thin tail of big spenders drags the average above the typical user. The Mann-Whitney U test cut through the skew with ranks and found control winning the head-to-head 58% of the time — a different question from the mean, and both answers correct. The discipline is to report both and let the business goal decide which matters. Next you’ll put a confidence interval around the difference in means, so Lumen’s $1.22 lift comes with a full range of plausible values rather than a single borderline verdict.