Lesson 5 - Guided Project: Analyze Lumen's Revenue Experiment

Welcome to the Guided Project

Across this module you took mean metrics apart: the two-sample t-test, Welch’s correction for unequal variances, confidence intervals for a difference in means, and the Mann-Whitney test for when skew makes the average lie. Now you’ll put all four to work on one decision — Lumen’s revenue experiment — and reach a verdict the raw average would have gotten wrong. The new signup page lifted conversion (you proved that in Module 4). The question here is different and trickier: did it lift revenue per user? The average says yes. By the end of this project you’ll know why the honest answer is “no — and possibly the opposite.”

By the end of this project, you will be able to:

  • Run a complete mean-metric analysis: Welch t-test, confidence interval, and a rank check
  • Recognize when a skewed average is misleading and reach for the median and Mann-Whitney
  • Reconcile a metric where the mean and the typical user disagree
  • Write an experiment readout that surfaces the tension instead of cherry-picking the average

We’ll build it in stages, reusing the exact tools from earlier in the module. Let’s analyze the revenue.


Stage 1: The Data

Lumen recorded revenue per user for both groups over the experiment window. The new page (treatment) went to a smaller slice of traffic, and — as money metrics always are — the spending is heavily right-skewed. We regenerate the same seeded data used throughout the module:

import numpy as np
from scipy import stats

rng = np.random.default_rng(3)
rev_c = rng.lognormal(mean=2.50, sigma=0.70, size=6000)   # control: current page
rev_t = rng.lognormal(mean=2.28, sigma=1.05, size=1500)   # treatment: new page

print(f"control   mean ${rev_c.mean():.2f}   sd ${rev_c.std(ddof=1):.2f}   n={rev_c.size}")
print(f"treatment mean ${rev_t.mean():.2f}   sd ${rev_t.std(ddof=1):.2f}   n={rev_t.size}")
control   mean $15.62   sd $12.42   n=6000
treatment mean $16.84   sd $23.89   n=1500

The treatment average is $16.84 versus $15.62 — a $1.22 lift. A tempting headline. But note the two warning signs you’ve learned to watch for: the treatment’s standard deviation ($23.89) is nearly double the control’s, and revenue is skewed. Both mean we can’t take the average at face value.


Stage 2: The Welch T-Test

Because the variances are unequal and the groups are different sizes, we use Welch’s t-test — equal_var=False — exactly as Lesson 2 established:

welch = stats.ttest_ind(rev_t, rev_c, equal_var=False)
print(f"Welch t = {welch.statistic:.3f}   p = {welch.pvalue:.4f}")
Welch t = 1.908   p = 0.0566

p = 0.0566 — not significant at the 5% level. The $1.22 lift is not large enough, relative to the noise, to rule out chance. And recall the trap from Lesson 2: Student’s pooled t-test on this same data returns p = 0.0063 — a confident “significant, ship it!” that is simply wrong, an artifact of pooling a small noisy group with a big tight one. Trusting the default would have shipped a revenue win that isn’t statistically there. Welch keeps you honest.


Stage 3: The Confidence Interval

A p-value alone is thin. The confidence interval for the difference in means (Lesson 4) shows the plausible size of the lift, in dollars:

ci = welch.confidence_interval()
print(f"95% CI for the mean difference: [${ci.low:.2f}, ${ci.high:.2f}]")
95% CI for the mean difference: [$-0.03, $2.47]

The interval runs from -$0.03 to +$2.47 — it includes $0. So a true lift of zero (or even a hair negative) remains plausible; we simply don’t have evidence of a revenue increase. This is the same story the p-value told, now in dollars: the CI includes 0 exactly because Welch’s test is non-significant. The point estimate looks positive, but the uncertainty swallows it.


Stage 4: The Skew Check and Mann-Whitney

Here’s where a careful analyst earns their keep. Revenue is skewed, so before trusting any conclusion about the mean, we check the median and run the rank-based Mann-Whitney test (Lesson 3):

print(f"median: control ${np.median(rev_c):.2f}   treatment ${np.median(rev_t):.2f}")

u = stats.mannwhitneyu(rev_t, rev_c, alternative="two-sided")
prob_sup = u.statistic / (rev_t.size * rev_c.size)
print(f"U = {u.statistic:.0f}   p = {u.pvalue:.2e}")
print(f"P(random treatment user > random control user) = {prob_sup:.3f}")
median: control $12.26   treatment $9.39
U = 3782666   p = 1.14e-21
P(random treatment user > random control user) = 0.420

This flips the story. The treatment median is lower — $9.39 versus $12.26 — so the typical user spends less under the new page. Mann-Whitney is overwhelmingly significant (p ≈ 10⁻²¹) and points toward control: a randomly chosen treatment user out-spends a randomly chosen control user only 42% of the time — meaning 58% of head-to-head matchups favor control. The higher mean was a mirage produced by a fatter tail of a few big spenders; it did not reflect a better experience for most users.

A right-skewed revenue distribution for the treatment group with the median marked at $9.39 near the bulk of users and the mean marked at $16.84 pulled to the right by a long tail of big spenders. Two boxes contrast 'the mean says treatment wins, $16.84 vs $15.62, up $1.22, driven by the tail' against 'the ranks say control wins, a random treated user out-spends a control user just 42% of the time'.
The treatment's higher mean is dragged up by a few big spenders in the tail, while its lower median and the Mann-Whitney result show the typical user spends less — the average and the typical user disagree.

Stage 5: The Readout and Decision

Now assemble the findings into a readout — the deliverable a real analyst ships:

Lumen new-page experiment — revenue readout. Average revenue per user rose +$1.22 (control $15.62 → treatment $16.84). However, the lift is not statistically significant (Welch’s t-test, p = 0.057; 95% CI [-$0.03, +$2.47] includes zero). Moreover, the metric is heavily skewed and the picture reverses for the typical user: the median fell ($12.26 → $9.39), and a Mann-Whitney test strongly favors control (p ≈ 10⁻²¹; a random treatment user out-spends a control user only 42% of the time). Recommendation: do not ship on the revenue case. There is no significant per-user revenue gain, and the typical user appears to spend less. The apparent average lift is an artifact of a few high spenders, not a broad improvement.

The decision is do not ship for revenue. And note the nuance against Module 4: the same new page significantly lifted signup conversion. So the honest, complete story is not “the page is good” or “the page is bad” — it’s that the page wins more signups but does not improve, and may worsen, revenue per user. That tension is precisely what a good readout surfaces, and precisely what a glance at the +$1.22 average would have buried.

An average is not a summary

For money and time metrics, a single average almost never tells the whole story, because a handful of extreme values can move it independently of what happens to everyone else. Total revenue can rise while most users spend less; average session time can climb because of a few marathon sessions while the typical visit shortens. The discipline this module teaches — check the median, run a rank test, look at the distribution — isn’t optional polish; it’s how you avoid shipping a change that the average flattered and the reality didn’t support. When the mean and the median disagree, that disagreement is the finding.


Practice Exercises

Exercise 1: Analyze log-revenue

Instead of raw dollars, run the Welch t-test on np.log(rev_c) and np.log(rev_t). Why might a team analyze log-revenue for a skewed metric, and what does the test now actually compare?

Hint

Taking logs pulls in the long right tail, making the distribution far more symmetric, so the mean of the logs is a stabler, less outlier-driven summary — and the t-test on it is comparing geometric means (roughly, the typical multiplicative level) rather than arithmetic means. It often better reflects the typical user. The catch: a difference in mean-log doesn’t translate directly back to a difference in dollars, so you trade interpretability in raw units for robustness. It’s a common, defensible choice for skewed money metrics.

Exercise 2: Which metric should decide?

Suppose Lumen’s finance team cares only about total revenue (they’ll happily take a few big spenders), while the product team cares about the typical user’s experience. Could the same experiment support shipping for one and not the other?

Hint

Yes — that’s exactly the mean-vs-median split. Total revenue is mean × users, so if the mean genuinely rose (even from the tail), finance’s metric improves; the product team’s typical-user metric (median / Mann-Whitney) got worse. Here the mean lift isn’t even significant, so the finance case is weak too — but the general lesson stands: different stakeholders optimize different summaries of the same distribution, and the “right” decision depends on which question the business is actually asking. Naming the metric up front (Module 2) is how you avoid this fight after the fact.

Exercise 3: More traffic

If Lumen ran the new page to far more users and the effect sizes stayed the same, what would happen to the Welch p-value and the confidence interval — and would that resolve the disagreement between mean and median?

Hint

More data shrinks the standard error, so the p-value would drop and the confidence interval would narrow — likely making the +$1.22 mean lift statistically significant and pushing the CI above 0. But it would not resolve the mean-vs-median conflict: more data makes both the mean gap and the median gap more certain, so you’d have a significant mean increase and a significant typical-user decrease at the same time. Sample size sharpens the estimates; it doesn’t make a skewed metric tell one story.


Summary

You analyzed Lumen’s revenue experiment end to end and reached a decision the raw average would have gotten wrong. The mean rose +$1.22 ($15.62 → $16.84), but Welch’s t-test returned p = 0.0566 (not significant — while Student’s pooled test wrongly said 0.0063), and the 95% CI [-$0.03, +$2.47] includes zero. The skew check overturned the headline: the treatment median fell ($12.26 → $9.39), and Mann-Whitney (p ≈ 10⁻²¹) showed a random treatment user out-spends a control user only 42% of the time — the higher mean was a fat-tail artifact. The verdict: do not ship on revenue, even though the same page significantly lifted conversion in Module 4. Every number here was computed for real with numpy and scipy. The readout surfaces the genuine tension between signups and per-user revenue rather than hiding it behind a flattering average.

Key Concepts

  • Full mean-metric workflow — Welch t-test, confidence interval, then a median/rank check.
  • Welch over Student — the pooled test falsely called this revenue lift significant.
  • Mean vs. typical user — a higher mean with a lower median; Mann-Whitney favored control.
  • Honest readout — report the tension (signups up, revenue not) instead of the convenient number.

Why This Matters

This is the analysis that separates a trustworthy experimentation practice from one that ships flattering-average illusions. Revenue and engagement decisions ride on mean metrics, and mean metrics are exactly where skew and unequal variance conspire to mislead — a naive t-test on this data would have shipped a “revenue win” that most users didn’t experience. Knowing to run Welch, quantify with a CI, and cross-check the average against the median and a rank test is what lets you give a straight answer when the data is messy and the stakes are real. Next, you’ll study the traps that invalidate experiments before the analysis even begins — peeking, multiple comparisons, and Simpson’s paradox.


Next Steps

Continue to Module 6 - Pitfalls and Validity

Peeking, multiple comparisons, Simpson's paradox, and the traps that invalidate real experiments.

Back to Module Overview

Return to the Analyzing Mean Metrics module overview


Continue Building Your Skills

You ran a complete mean-metric analysis and made the honest call: Lumen’s new page doesn’t improve per-user revenue and may lower it for the typical user, even as it wins more signups. That kind of disciplined, skeptical analysis — Welch, confidence interval, median, and Mann-Whitney together — is what money metrics demand. Next you’ll step back from the tests themselves to the traps that can invalidate an experiment before you ever compute a p-value: peeking, multiple comparisons, and Simpson’s paradox.