Lesson 4 - Forecast Intervals and Their Honesty
Welcome to Forecast Intervals and Their Honesty
Every forecast in this course has come with a point prediction and, since Module 5’s get_forecast, a confidence interval alongside it. A stated 95% interval is a specific, testable claim: across many forecasts, the actual value should land inside it about 95% of the time. This lesson checks that claim directly, using the same walk-forward machinery from Lessons 2 and 3, and finds a real, meaningful failure alongside a real, meaningful success.
By the end of this lesson, you will be able to:
- Define empirical coverage and explain why checking it requires many forecasts, not one
- Backtest a model’s stated confidence interval across several origins
- Distinguish a poorly calibrated interval from one that is merely uninformatively wide
- Explain why a good forecast interval needs both correct coverage and narrow width
Let’s put a stated confidence level to the test.
What Coverage Means, and Why One Forecast Cannot Check It
A 95% interval does not promise that any single forecast’s interval will be correct. It promises that if you compute many such intervals, about 95% of them will contain the true value. Checking that promise on one forecast is meaningless, since the actual value either falls inside or it does not, a single coin flip tells you nothing about whether the coin is fair. This is exactly Lesson 1’s argument, now applied to intervals instead of point forecasts: you need many forecasts, at several different origins, to measure whether a stated confidence level is honest.
Backtesting SARIMA’s Interval
Reuse Lesson 2 and 3’s six origins, and check each one’s 95% interval against what actually happened:
import numpy as np
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
def cyclepath():
idx = pd.date_range("2016-01-01", periods=96, freq="MS")
t = np.arange(96)
rng = np.random.default_rng(42)
trend = 9000 + 90 * t
seasonal = 3200 * np.sin(2 * np.pi * (t - 3) / 12)
noise = rng.normal(0, 350, 96)
return pd.Series(np.round(trend + seasonal + noise).astype(int), index=idx, name="trips")
y = cyclepath()
h = 6
origins = list(range(60, 96, 6))
total, inside, widths = 0, 0, []
for origin in origins:
train, test = y.iloc[:origin], y.iloc[origin:origin + h]
model = SARIMAX(train, order=(1, 1, 0), seasonal_order=(1, 1, 0, 12)).fit(disp=False)
ci = model.get_forecast(steps=h).conf_int(alpha=0.05)
lo, hi = ci.iloc[:, 0].values, ci.iloc[:, 1].values
hits = ((test.values >= lo) & (test.values <= hi)).sum()
inside += hits
total += h
widths.extend((hi - lo).tolist())
print(total, inside, round(inside / total * 100, 1))
print(round(np.mean(widths), 1))36 36 100.0
2144.6Every single one of the 36 forecasted months, across all six origins, fell inside SARIMA’s stated 95% interval, and the average interval width was about 2,145 trips. A 100% empirical coverage rate on a 95% nominal interval is on the conservative side, slightly wider than strictly necessary, but it is an honest, trustworthy interval: when SARIMA says it is 95% confident, the actual value really did show up inside that range essentially every time.
A Genuinely Under-Covering Interval
Not every model’s stated confidence is this trustworthy. Fit a plain random-walk-with-drift model, order (0, 1, 0) with a linear trend term and no seasonal structure at all, the same way at each origin:
from statsmodels.tsa.arima.model import ARIMA
total_rw, inside_rw, widths_rw = 0, 0, []
for origin in origins:
train, test = y.iloc[:origin], y.iloc[origin:origin + h]
model = ARIMA(train, order=(0, 1, 0), trend="t").fit()
ci = model.get_forecast(steps=h).conf_int(alpha=0.05)
lo, hi = ci.iloc[:, 0].values, ci.iloc[:, 1].values
hits = ((test.values >= lo) & (test.values <= hi)).sum()
inside_rw += hits
total_rw += h
widths_rw.extend((hi - lo).tolist())
print(total_rw, inside_rw, round(inside_rw / total_rw * 100, 1))
print(round(np.mean(widths_rw), 1))36 30 83.3
8743.4Only 30 of 36 forecasted months, 83.3%, fell inside this model’s stated 95% interval, a real, meaningful shortfall from the 95% it claims. This model has no seasonal term at all, so its point forecasts miss the seasonal swing directly, and its interval, despite being nearly four times wider than SARIMA’s (8,743 versus 2,145), still is not wide enough to reliably cover a systematically biased point forecast. A model whose stated confidence does not match its actual reliability is not just less accurate, it is actively misleading anyone who trusts the number it reports.
Coverage Is Not the Whole Story: Sharpness Matters Too
There is a third case worth checking, because it shows that correct coverage alone is not enough to call an interval good. Fit a non-seasonal ARIMA, (1, 1, 1), the same specification that struggled in Module 5:
total_ns, inside_ns, widths_ns = 0, 0, []
for origin in origins:
train, test = y.iloc[:origin], y.iloc[origin:origin + h]
model = ARIMA(train, order=(1, 1, 1), trend="n").fit()
ci = model.get_forecast(steps=h).conf_int(alpha=0.05)
lo, hi = ci.iloc[:, 0].values, ci.iloc[:, 1].values
hits = ((test.values >= lo) & (test.values <= hi)).sum()
inside_ns += hits
total_ns += h
widths_ns.extend((hi - lo).tolist())
print(round(inside_ns / total_ns * 100, 1))
print(round(np.mean(widths_ns), 1))100.0
10731.8This model also achieves 100% coverage, technically just as calibrated as SARIMA. But its average interval width is 10,732, exactly five times wider than SARIMA’s 2,145. An interval that is wide enough will cover almost anything, correct coverage by itself does not mean the interval is useful. A weather forecast that says “between 20 and 100 degrees, 95% confident” is technically well calibrated and completely useless. The property that distinguishes a genuinely good interval from a merely wide one is called sharpness: how narrow the interval is, given that it still achieves its stated coverage. SARIMA is the only one of these three models with both properties at once, correct coverage and the sharpest interval.
| Model | Empirical coverage (nominal 95%) | Average interval width |
|---|---|---|
| SARIMA | 100.0% | 2,144.6 |
| Non-seasonal ARIMA(1,1,1) | 100.0% | 10,731.8 |
| Random walk with drift | 83.3% | 8,743.4 |
Two questions, not one
Checking a forecast interval always means asking two separate questions. First: does it cover the truth as often as it claims (calibration)? Second, only meaningful once the first is answered well: how narrow is it while still covering the truth (sharpness)? The random-walk model failed the first question outright. The non-seasonal ARIMA passed the first question but failed the second, technically honest, practically useless. SARIMA is the only one here that passed both, which is exactly what makes its stated 95% confidence something you can actually act on.
Practice Exercises
Exercise 1: Reading a coverage number below nominal
A model’s 90% interval, backtested across 40 forecasts, covers the actual value 30 times. Is this model’s interval trustworthy?
Hint
No. 30 out of 40 is 75% empirical coverage against a stated 90%, a real, meaningful shortfall, similar in spirit to this lesson’s random-walk-with-drift result (83.3% against a stated 95%). A gap this size across 40 forecasts is unlikely to be explained by noise alone; it suggests the model is more overconfident than it claims to be, and its stated interval should not be trusted at face value until the underlying model is improved or the interval is otherwise recalibrated.
Exercise 2: Comparing two well-calibrated intervals
Two models both achieve exactly 95% empirical coverage on their 95% intervals. Model A’s average width is 500; Model B’s is 2,000. Which is the better forecasting model, all else equal?
Hint
Model A, because when two intervals both achieve correct coverage, the sharper (narrower) one is giving you more useful, more precise information about where the true value is likely to fall. Correct coverage is necessary but not sufficient; once two models both clear that bar, sharpness is the tiebreaker, exactly the comparison that separated SARIMA from the non-seasonal ARIMA in this lesson, both fully calibrated, but SARIMA five times sharper.
Exercise 3: Why not just make every interval extremely wide?
If wide intervals are easier to get correct coverage from, why not always report a very wide interval to be safe?
Hint
Because an extremely wide interval, while technically hard to miss, stops being useful for any real decision. The non-seasonal ARIMA’s 10,732-trip-wide interval in this lesson is a real example: it is honest, but knowing that next month’s ridership will be “somewhere in a 10,000-trip range” is not something an operations team can act on the way they could act on SARIMA’s much tighter 2,145-trip range. The goal is not maximum safety through vagueness, it is the narrowest interval that still keeps its promise, which is precisely why sharpness matters as much as coverage.
Summary
A stated 95% forecast interval is a testable claim about empirical coverage, checkable only across many forecasts, never a single one. Backtested across all 36 forecasts from this module’s six origins, SARIMA’s 95% interval achieved 100% coverage at an average width of 2,144.6 trips. A random-walk-with-drift model achieved only 83.3% coverage, a real, measured failure of its stated confidence, despite an interval nearly four times wider (8,743.4). A non-seasonal ARIMA also achieved 100% coverage, but only by being five times wider than SARIMA (10,731.8), technically calibrated but far less useful. Good forecast intervals need both correct coverage and narrow width, called sharpness, and only SARIMA achieved both here.
Key Concepts
- Empirical coverage — the fraction of backtested forecasts where the actual value fell inside the stated interval; only measurable across many forecasts.
- Under-coverage — an interval that misses more often than its stated confidence level claims, an honest failure a single test set would hide.
- Sharpness — how narrow an interval is, given that it still achieves correct coverage; a wide interval can be technically calibrated and still uninformative.
- Calibration and sharpness together — a genuinely trustworthy interval needs both, not just one.
Why This Matters
A model’s point forecast being accurate is only half of what makes it trustworthy in practice. The other half is whether its stated uncertainty means what it says, and this lesson showed both a real failure of that promise (the random-walk model) and a real case where the promise was kept but uselessly (the non-seasonal ARIMA), alongside a model that got both right (SARIMA). Checking coverage and sharpness together, across many backtested forecasts, is exactly how you would catch either failure before deploying a model that reports confidence it has not actually earned. The guided project now brings every tool from this module together, backtesting every model this course has built across the same six origins, with a result that changes which model looks best.
Next Steps
Continue to Lesson 5 - Guided Project: Backtesting Every Model From the Course
Backtest seasonal-naive, SARIMA, and Holt-Winters across six origins, and watch the single-split ranking from Modules 6 and 7 get overturned.
Back to Module Overview
Return to the Evaluation and Backtesting module overview
Continue Building Your Skills
You have now checked both halves of what makes a forecast trustworthy: whether the point prediction is accurate across many origins, and whether the stated confidence interval around it actually means what it claims. The guided project puts every model this course has built through the full battery of tools from this module, and the ranking it produces is not the one Modules 6 and 7 reported.