Lesson 4 - Diagnostics: Is the Model Adequate?
Welcome to Diagnostics: Is the Model Adequate?
Lesson 3’s forecast looked great — it tracked the seasonal curve almost exactly. But “looks right on the test set” isn’t proof a model is sound; it could be fitting this particular year well by luck while leaving real structure unexplained. Diagnostics answer the harder question: are the model’s residuals white noise? If the residuals still contain a pattern, the model missed something, no matter how good one forecast looks. This is the exact check Module 4’s preview model failed — and this lesson shows the finished model passing it.
By the end of this lesson, you will be able to:
- State what a residual diagnostic is checking: are the residuals white noise?
- Run and interpret the Ljung-Box test for leftover autocorrelation
- Read the summary’s built-in normality and heteroskedasticity checks
- Distinguish an adequate model from one that only looks right, and judge borderline cases honestly
Let’s interrogate the model.
What Adequate Means: White-Noise Residuals
A model has extracted all the predictable structure when its residuals — the gaps between what it predicted and what happened — look like pure random noise. Concretely, adequate residuals have four properties:
- Zero mean — the model isn’t systematically over- or under-predicting.
- No autocorrelation — no leftover pattern the model could have used but didn’t.
- Roughly normal — errors are symmetric and not dominated by wild outliers.
- Constant variance — the error size doesn’t grow or shrink over time (homoskedasticity).
Start with the first: the residual mean.
import numpy as np, pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
def cyclepath():
idx = pd.date_range("2016-01-01", periods=96, freq="MS")
t = np.arange(96); rng = np.random.default_rng(42)
trend = 9000 + 90*t; seasonal = 3200*np.sin(2*np.pi*(t-3)/12); noise = rng.normal(0,350,96)
return pd.Series(np.round(trend+seasonal+noise).astype(int), index=idx, name="trips")
y = cyclepath()
train = y.iloc[:-12]
res = SARIMAX(train, order=(1, 1, 0), seasonal_order=(1, 1, 0, 12)).fit(disp=False)
resid = res.resid.iloc[res.loglikelihood_burn:] # drop the burn-in period
print(round(resid.mean(), 2)) # -13.64
print(round(resid.std(), 1)) # 399.5The residual mean is -13.64 — essentially zero relative to a series that runs in the thousands and a residual standard deviation of ~400. No systematic bias. (We drop the initial loglikelihood_burn observations, which the state-space model uses to initialize and whose residuals aren’t meaningful.)
The Ljung-Box Test
The most important diagnostic is the Ljung-Box test, which you met at the very end of Module 4. Its null hypothesis is that the residuals are white noise — no autocorrelation up to a chosen lag — so a large p-value (above 0.05) is what you want: it means you can’t reject whiteness, i.e. no leftover structure was detected.
from statsmodels.stats.diagnostic import acorr_ljungbox
lb = acorr_ljungbox(resid, lags=[6, 12, 18], return_df=True)
print(lb.round(4)) lb_stat lb_pvalue
6 11.0310 0.0874
12 15.6228 0.2091
18 23.0097 0.1902All three p-values are above 0.05 — 0.087 at lag 6, 0.209 at lag 12, 0.190 at lag 18 — so the test fails to reject white noise at every horizon. The model’s residuals show no significant leftover autocorrelation, including at the all-important seasonal lag 12. This is the milestone Module 4’s preview model couldn’t reach:
preview = SARIMAX(train, order=(0, 0, 0), seasonal_order=(1, 1, 0, 12)).fit(disp=False)
pr = preview.resid.iloc[preview.loglikelihood_burn:]
print(round(acorr_ljungbox(pr, lags=[12], return_df=True)["lb_pvalue"].iloc[0], 5)) # 0.0The Module 4 preview model SARIMA(0,0,0)(1,1,0)[12] scores a Ljung-Box p-value of 0.0 at lag 12 — decisively rejecting whiteness, with a residual lag-1 autocorrelation of 0.335 still sitting there unexplained. That model beat the non-seasonal ARIMAs on AIC but was not adequate: it left real structure on the table. Adding the non-seasonal AR term and first differencing (going from (0,0,0) to (1,1,0)) is what cleaned up those residuals and turned a promising-but-inadequate model into an adequate one.
The Summary’s Built-In Diagnostics
statsmodels computes several diagnostics automatically and prints them at the bottom of .summary():
print(res.summary().tables[2])===================================================================================
Ljung-Box (L1) (Q): 0.13 Jarque-Bera (JB): 0.30
Prob(Q): 0.72 Prob(JB): 0.86
Heteroskedasticity (H): 0.76 Skew: -0.05
Prob(H) (two-sided): 0.52 Kurtosis: 3.30
===================================================================================Three checks, all reassuring:
- Jarque-Bera tests normality:
Prob(JB) = 0.86(well above 0.05) means the residuals are consistent with a normal distribution — good. Skew is -0.05 (near zero, symmetric) and kurtosis 3.30 (near the normal value of 3). - Heteroskedasticity (H) tests whether the variance changes over time:
Prob(H) = 0.52means no evidence of changing variance — the error size is stable across the series, matching Cyclepath’s confirmed additive structure. - Ljung-Box (L1) is a quick one-lag version of the test you just ran in full:
Prob(Q) = 0.72, no lag-1 autocorrelation.
Together with the multi-lag Ljung-Box, these say the residuals pass all four white-noise properties: zero mean, no autocorrelation, normal, constant variance.
An Honest Wrinkle
No real model is perfect, and it’s worth being honest about a small blemish. If you inspect the individual residual autocorrelations, a couple of lags sit just outside the significance band:
from statsmodels.tsa.stattools import acf
ra = acf(resid, nlags=6, fft=True)
band = 1.96 / np.sqrt(len(resid))
print(round(band, 3)) # 0.233
print([(i, round(ra[i], 3)) for i in range(1, 7) if abs(ra[i]) > band])
# [(2, -0.24), (3, -0.255)]Lags 2 and 3 marginally cross the ±0.233 band (-0.24 and -0.255). Does this sink the model? No — for two reasons rooted in earlier lessons. First, the joint Ljung-Box test, which properly accounts for testing many lags at once, still passes (p = 0.209 at lag 12) — it weighs all lags together rather than flagging isolated crossings. Second, recall Module 4’s multiple-testing lesson: when you check many lags, a couple will cross the band by chance alone (about 42% of pure-noise series show at least one such crossing). Two marginal, barely-significant spikes with no seasonal meaning, against an otherwise-clean residual set and a passing joint test, is well within that expectation. The honest verdict: adequate, not flawless — good enough to trust, while noting exactly where its small imperfection lies rather than pretending it’s spotless.
Adequate is the goal, not perfect
Chasing a model whose every single residual autocorrelation sits inside the band is usually a mistake — it leads to over-parameterized models that fit the training data’s noise (the overfitting trap from Module 5’s ARIMA(2,1,2)). The right standard is the joint Ljung-Box test plus the normality and variance checks, read together. A model that passes those, with only minor isolated wrinkles explainable by multiple testing, is adequate. Report the wrinkle honestly, but don’t add terms to chase it — that trades a real, stable model for a fragile one.
Practice Exercises
Exercise 1: Read a Ljung-Box result
A SARIMA’s residuals give a Ljung-Box p-value of 0.003 at lag 12. Is the model adequate? What would you do?
Hint
No — a p-value of 0.003 is well below 0.05, so the test rejects white noise: there’s significant leftover autocorrelation, meaning the model missed real structure (exactly like the Module 4 preview model at p = 0.000). You’d examine where the residual autocorrelation sits — a spike at a short lag suggests adding a non-seasonal AR or MA term; a spike at the seasonal lag suggests strengthening the seasonal orders — then refit and re-test. The Ljung-Box test doesn’t just say “inadequate,” it points you toward what to add by which lags still show structure.
Exercise 2: Which check catches growing variance?
Cyclepath is additive (constant seasonal swing), so its variance is stable. If you fit a SARIMA to a multiplicative series without log-transforming it first, which diagnostic would most likely flag the problem?
Hint
The heteroskedasticity test (H) — it checks whether residual variance changes over time. A multiplicative series has a seasonal swing that grows with the level (Module 2/3), so a model that assumes constant variance would leave residuals that are small early and large later, which the H test would flag with a low Prob(H). The fix is the one from Module 4 of the stationarity module: log-transform the series first to stabilize the variance, then fit the SARIMA. This is why the heteroskedasticity check exists — it catches exactly the additive-vs-multiplicative mismatch.
Exercise 3: A good forecast but a failing test
A model forecasts your test year with low error but fails the Ljung-Box test (p = 0.01). Should you trust it?
Hint
Be cautious. A low test error on one held-out year is encouraging but not conclusive — it could be that this particular year happened to suit the model. Failing Ljung-Box means the residuals contain leftover structure, so the model is provably not extracting everything it could, and its good performance may not generalize to other years or longer horizons. The disciplined move is to treat the failing diagnostic as the more fundamental signal: try to fix the residual structure (add the terms the residual autocorrelation points to) and see if you get a model that both forecasts well and passes diagnostics. When a good forecast and a failing test disagree, the failing test is the warning worth heeding — much like AIC vs. test error in Module 5.
Summary
Model adequacy means the residuals are white noise: zero mean, no autocorrelation, roughly normal, constant variance. The SARIMA(1,1,0)(1,1,0)[12] residuals had a near-zero mean (-13.64) and passed the Ljung-Box test at every horizon (p = 0.087, 0.209, 0.190 at lags 6, 12, 18) — where Module 4’s preview model (0,0,0)(1,1,0) failed at p = 0.000 with a leftover lag-1 autocorrelation of 0.335. The summary’s built-in checks confirmed normality (Jarque-Bera p = 0.86) and constant variance (heteroskedasticity p = 0.52). An honest wrinkle — residual autocorrelations at lags 2 and 3 marginally crossing the band — is tolerated by the joint Ljung-Box test and explainable by Module 4’s multiple-testing logic, making the model adequate, not flawless.
Key Concepts
- White-noise residuals — the four properties (zero mean, no autocorrelation, normal, constant variance) that define an adequate model.
- Ljung-Box test — a large p-value (> 0.05) means no leftover autocorrelation detected; the finished model passes at 0.209.
- Summary diagnostics — Jarque-Bera (normality) and heteroskedasticity (constant variance) checks come built in.
- Adequate, not perfect — judge by the joint test plus normality/variance; tolerate minor isolated wrinkles rather than overfitting to chase them.
Why This Matters
Diagnostics are what separate a forecaster who knows their model is sound from one who’s merely hoping. The Ljung-Box test is the single most important habit in this module: it caught that Module 4’s promising preview model was secretly inadequate, and it certified that the finished model genuinely extracts the structure it should. Learning to read these tests — and to judge borderline cases honestly rather than overfitting to make every last spike disappear — is the difference between a model you can defend and one that will surprise you in production. Next, the capstone ties everything together: fitting, diagnosing, and forecasting a SARIMA end to end, and finally beating the seasonal-naive baseline that has stood since Module 1.
Next Steps
Continue to Lesson 5 - Guided Project: A SARIMA Forecast for Cyclepath
Fit, diagnose, and forecast a SARIMA end to end — and beat the 5.9% seasonal-naive baseline for the first time in the course.
Back to Module Overview
Return to the Seasonality: SARIMA module overview
Continue Building Your Skills
You can now interrogate a model rather than trust it on looks — running Ljung-Box, reading the built-in normality and variance checks, and judging a borderline residual wrinkle honestly. Next, the capstone brings the whole module together: fit the SARIMA, confirm it passes diagnostics, forecast the held-out year, and compare against the seasonal-naive baseline that has been the bar to beat since the very first module.