Lesson 5 - Guided Project: An ARIMA Forecast for Cyclepath

Welcome to the Guided Project

This module built the whole ARIMA family from parts, and Module 4 handed you a shortlist of orders. Now you fit real models to Cyclepath, forecast the year you held out, and — crucially — measure the result against the bar that’s been waiting since Module 1: the seasonal-naive baseline at 5.9% MAPE. This capstone is different from the others in one important way: it ends in an honest negative result. That’s not a failure of the work; it’s the single clearest motivation for the next module, and learning to recognize and report it is a real forecasting skill.

By the end of this project, you will be able to:

  • Fit and forecast multiple non-seasonal ARIMA candidates on a real series
  • Score forecasts against a baseline using MAPE on a held-out test set
  • Recognize when AIC and out-of-sample error disagree, and why
  • Diagnose a too-good-to-be-true result as a degenerate model, and draw the right conclusion

Let’s fit some models — and be honest about what they can and can’t do.


Stage 1: Setup and the Bar to Beat

Rebuild Cyclepath, split off the last 12 months, and re-establish the seasonal-naive baseline from Module 1:

import warnings; warnings.filterwarnings("ignore")
import numpy as np, pandas as pd
from statsmodels.tsa.arima.model import ARIMA

def cyclepath():
    idx = pd.date_range("2016-01-01", periods=96, freq="MS")
    t = np.arange(96); rng = np.random.default_rng(42)
    trend = 9000 + 90*t; seasonal = 3200*np.sin(2*np.pi*(t-3)/12); noise = rng.normal(0,350,96)
    return pd.Series(np.round(trend+seasonal+noise).astype(int), index=idx, name="trips")

y = cyclepath()
train, test = y.iloc[:-12], y.iloc[-12:]

def mape(a, f): return np.mean(np.abs((a - f) / a)) * 100

seasonal_naive = pd.Series(train.iloc[-12:].values, index=test.index)
print(round(mape(test, seasonal_naive), 1))    # 5.9

The bar is 5.9% MAPE — the “replay last year” forecast from Module 1. Any ARIMA worth its complexity has to beat it.


Stage 2: Fit the Candidates

Fit a range of non-seasonal ARIMA candidates, each with a drift term (trend="t") to handle Cyclepath’s climb, and score each one’s 12-month forecast against the test set by both AIC (in-sample fit) and MAPE (out-of-sample accuracy):

for order in [(0,1,0), (1,1,0), (0,1,1), (1,1,1)]:
    res = ARIMA(train, order=order, trend="t").fit()
    fc = res.forecast(steps=12)
    print(f"ARIMA{order}  AIC={res.aic:7.2f}  MAPE={mape(test, fc):4.1f}")
ARIMA(0, 1, 0)  AIC=1421.58  MAPE=16.0
ARIMA(1, 1, 0)  AIC=1363.72  MAPE=27.0
ARIMA(0, 1, 1)  AIC=1387.26  MAPE=17.9
ARIMA(1, 1, 1)  AIC=1365.36  MAPE=26.0

Two things jump out, and both are important:

  1. Not one candidate comes close to 5.9%. The best test MAPE is 16.0% (the plain drift model, ARIMA(0,1,0)) — nearly three times worse than the simple seasonal-naive baseline. Adding AR and MA terms makes it worse, not better: ARIMA(1,1,0) and ARIMA(1,1,1) land at 27% and 26%.

  2. AIC and test MAPE disagree. ARIMA(1,1,1) has the best AIC (1365.36) but nearly the worst test MAPE (26.0%), while ARIMA(0,1,0) has the worst AIC (1421.58) but the best test MAPE (16.0%). The lower-AIC models are the ones that generalize worst.


Stage 3: Why AIC and Test Error Disagree

This disagreement is worth understanding, because it’s a trap. AIC scores in-sample fit on the differenced series — how well the model explains the month-to-month changes it was trained on. The AR and MA terms in ARIMA(1,1,1) do genuinely reduce that in-sample error, earning a better AIC. But what they’re fitting is the seasonal wiggle leaking into the differenced series (the lag-12 structure Module 3 measured) — and they fit it as short-lag AR/MA structure, which extrapolates wrongly over a 12-month forecast. The plain drift model (0,1,0) has no such terms to misuse, so it just extends the trend flatly: less impressive in-sample, but less catastrophically wrong out-of-sample.

The lesson: AIC is a guide for comparing models on the data they were trained on, not a guarantee of forecast accuracy. On a held-out test set, always check out-of-sample error too — and when they disagree this sharply, it’s usually a sign the model class is missing something structural. Here, that something is seasonality.

A line chart of the Cyclepath series showing the training portion rising with seasonal waves, then the held-out 2023 test portion continuing the seasonal rise and fall (actual, in black). Two forecasts are overlaid on the test region: a nearly flat line labeled 'ARIMA(0,1,0)+drift, MAPE 16.0%' that misses the seasonal swing entirely, and a wavy line labeled 'seasonal-naive, MAPE 5.9%' that tracks the actual curve closely. A legend notes the seasonal-naive baseline wins decisively.
The core result: the best non-seasonal ARIMA (flat, MAPE 16.0%) can extend the trend but is blind to the seasonal swing, while the far simpler seasonal-naive forecast (wavy, MAPE 5.9%) tracks the actual 2023 curve. A non-seasonal model has no term that can reproduce the yearly rise and fall.

Stage 4: The Model That “Beats” the Baseline — and Why It’s a Mirage

There’s one non-seasonal specification that does beat 5.9%. Fit ARIMA(2,1,2) and it scores a stunning 1.9% MAPE. Before celebrating, inspect how it does it:

import cmath
res = ARIMA(train, order=(2,1,2), trend="t").fit()
fc = res.forecast(steps=12)
print(round(mape(test, fc), 1))                        # 1.9

print(np.round(np.abs(res.arroots), 3))                # [1. 1.]   <- on the unit circle
for r in res.arroots:
    if abs(r.imag) > 1e-6:
        print(round(2*np.pi / abs(cmath.phase(r)), 2)) # 12.03     <- a 12-month cycle
        break

The MAPE is real — 1.9%, genuinely beating the baseline. But look at the roots: the AR polynomial’s roots sit exactly on the unit circle (modulus 1.0), with a phase corresponding to a period of 12.03 months. The model is beating the baseline by driving its parameters to a degenerate boundary that synthesizes a 12-month cycle out of an AR(2) term — essentially reinventing seasonality through the back door. This comes with a convergence warning during fitting (the optimizer struggles at that boundary), and roots on the unit circle mean the model is on the edge of non-stationarity, where forecasts are fragile and the fit is unstable.

A too-good non-seasonal result is a red flag, not a win

When a non-seasonal model suddenly forecasts a strongly seasonal series well, the right instinct is suspicion, not celebration. ARIMA(2,1,2) here isn’t capturing Cyclepath’s structure the right way — it’s contorting a general-purpose ARMA into a makeshift seasonal model through unit-circle roots, triggering convergence warnings and sitting on the knife-edge of instability. It works on this particular test year, but it’s the wrong tool used in a way that happens to fit. The honest read is: the fact that you need a degenerate ARMA to fake a 12-month cycle is itself proof that the seasonality should be modeled explicitly — with a seasonal term designed for exactly this, which is what SARIMA provides.


Stage 5: The Takeaway

Step back and state the honest result plainly:

  1. No stable non-seasonal ARIMA beats the seasonal-naive baseline. The best well-behaved candidate (ARIMA(0,1,0)+drift) manages 16.0% MAPE against the baseline’s 5.9% — worse by a factor of nearly three.
  2. Adding AR/MA terms made it worse, not better, because they fit the seasonal leakage in the differenced series as short-lag structure and extrapolate it wrongly.
  3. The one model that beats the baseline is a mirage — ARIMA(2,1,2) fakes a 12-month cycle with unit-circle roots and a convergence warning, which is a red flag, not a solution.

The conclusion isn’t that ARIMA is useless — it handled Cyclepath’s trend correctly through differencing, and it’s the essential foundation for everything ahead. The conclusion is that a non-seasonal model is structurally blind to seasonality: it has no term whose job is to capture “this July resembles last July.” Cyclepath’s dominant structure is exactly that seasonal pattern, so a non-seasonal ARIMA is the wrong shape for it, no matter how you tune the orders.

That is precisely the gap SARIMA fills. Module 6 extends ARIMA with seasonal AR, MA, and differencing terms — (P, D, Q)[s] alongside the familiar (p, d, q) — giving the model an explicit mechanism for the 12-month structure this capstone kept running into. The seasonal-AR(1) candidate that won Module 4’s preview (AIC 1123.93, far below anything here) is where that story picks up.


Practice Exercises

Exercise 1: Report the result honestly

Your manager asks whether the ARIMA model is “ready to ship” after you fit ARIMA(0,1,0)+drift at 16% MAPE. How do you report this, given the 5.9% baseline?

Hint

Honestly: “No — it’s worse than the simple ‘replay last year’ baseline (16% error versus 5.9%), so shipping it would be a step backward from a forecast we could make with one line of code and no model.” The disciplined move is to report the baseline comparison prominently, explain why the ARIMA falls short (it can’t capture seasonality), and recommend the seasonal model (SARIMA) as the next step. Reporting a 16% MAPE without the baseline context would make a bad model look acceptable — the baseline is what turns a bare number into a verdict, exactly as Module 1 established.

Exercise 2: Trust AIC or test MAPE?

ARIMA(1,1,1) had a better AIC than ARIMA(0,1,0) but a worse test MAPE. For choosing a model to actually deploy for forecasting, which should you weight more heavily, and why?

Hint

Weight the out-of-sample test MAPE more heavily for a deployment decision, because it directly measures what you actually care about: how accurately the model forecasts data it hasn’t seen. AIC is an in-sample score — useful for comparing models efficiently when you can’t afford a full held-out test, but it can favor a model that fits the training data’s quirks (here, seasonal leakage) in ways that don’t generalize. When you have a proper held-out test set, as here, its error is the more trustworthy guide. Ideally you’d use both: AIC to shortlist quickly, held-out error to make the final call — which is exactly the workflow Module 4 previewed and Module 6 will formalize.

Exercise 3: Spot the degenerate model elsewhere

Besides a convergence warning and unit-circle roots, what other symptom might tip you off that a model is “faking it” rather than genuinely fitting?

Hint

A few tells: wildly large or near-canceling coefficients (ARIMA(2,1,2)’s AR and MA terms nearly cancel each other, a classic sign of an over-parameterized model straining), forecast intervals that behave strangely (collapsing too narrow or exploding), instability under small data changes (refit on a slightly different training window and the coefficients swing dramatically), or an out-of-sample result too good to be plausible given how a simpler model performs. The general principle: when a model’s behavior looks pathological even though its score looks great, trust the behavior. A genuinely good fit is stable, converges cleanly, and has interpretable coefficients — not roots pinned to the edge of stationarity.


Summary

You fit non-seasonal ARIMA candidates to Cyclepath and scored their forecasts against the 5.9% seasonal-naive baseline. The honest result: no stable non-seasonal model beats it — the best (ARIMA(0,1,0)+drift) reached only 16.0% MAPE, and adding AR/MA terms made things worse (26–27%) by fitting seasonal leakage as short-lag structure. AIC and test MAPE disagreed — ARIMA(1,1,1) won on AIC (1365.36) but lost on MAPE (26.0%), a reminder that in-sample fit doesn’t guarantee out-of-sample accuracy. The one model that beat the baseline, ARIMA(2,1,2) at 1.9%, did so via a degenerate unit-root solution faking a 12-month cycle (AR roots on the unit circle, period 12.03, convergence warning) — a red flag, not a win. The structural conclusion: a non-seasonal ARIMA handles trend but is blind to seasonality, which is exactly the gap SARIMA fills.

Key Concepts

  • Baseline comparison is the verdict — a 16% MAPE only means something next to the 5.9% baseline it fails to beat.
  • AIC vs. out-of-sample error — they can disagree; the held-out test error is the more trustworthy guide for deployment.
  • Seasonal leakage misfits — non-seasonal AR/MA terms fit the seasonal wiggle in the differenced series and extrapolate it wrongly.
  • Degenerate models — unit-circle roots, convergence warnings, and near-canceling coefficients signal a model faking structure it wasn’t built for.

Why This Matters

The most valuable skill this capstone teaches isn’t a technique — it’s the discipline to reach, recognize, and honestly report a negative result. A non-seasonal ARIMA that loses to a one-line baseline is not a wasted effort; it’s a precise diagnosis of what the data needs (explicit seasonal terms) and a rigorous justification for the more complex model coming next. Forecasters who skip this step and reach for complexity without proving it’s needed end up shipping models that lose to “replay last year” without ever knowing it. You now know exactly why Cyclepath needs SARIMA — not as an assumption, but as a measured, demonstrated conclusion. Module 6 builds it.


Next Steps

Continue to Module 6 - Seasonality: SARIMA

Add explicit seasonal terms to ARIMA and finally beat the seasonal-naive baseline on Cyclepath.

Back to Module Overview

Return to the AR, MA, ARMA, ARIMA module overview


Continue Building Your Skills

You’ve reached the honest, demonstrated conclusion this module was built toward: a non-seasonal ARIMA, however carefully fit and tuned, can’t cleanly beat a simple seasonal baseline on a seasonal series — because it has no term for seasonality at all. Next, Module 6 adds exactly that: seasonal AR, MA, and differencing terms that model the 12-month structure explicitly, turning ARIMA into SARIMA and finally beating the 5.9% bar for real.