Lesson 4 - Fitting and Backtesting Candidate Models

Welcome to Fitting and Backtesting Candidate Models

Lesson 3 found the right differencing for stationarity, first differencing alone, and real evidence that a seasonal term is still needed despite that. This lesson turns those findings into fitted models, tests several SARIMA specifications against a single held-out block, and then, following Module 8’s discipline rather than trusting that one split, backtests the leading candidates across several origins before recommending anything.

By the end of this lesson, you will be able to:

  • Fit several SARIMA specifications and compare them on held-out accuracy
  • Explain why AIC cannot fairly compare models built with different differencing orders
  • Fit a multiplicative Holt-Winters model on this series
  • Backtest both model families across multiple origins and read the result honestly

Let’s fit and compare.


Five SARIMA Candidates, One Held-Out Block

import numpy as np
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX

def lantern_vine():
    idx = pd.date_range("2020-01-06", periods=208, freq="W-MON")
    t = np.arange(208)
    rng = np.random.default_rng(7)
    growth_rate = np.where(t < 104, 0.008, 0.003)
    log_level = np.cumsum(growth_rate)
    level = 500 * np.exp(log_level)
    seasonal_factor = 1 + 0.35 * np.sin(2 * np.pi * (t - 35) / 52)
    noise_factor = rng.normal(1, 0.04, 208)
    y = level * seasonal_factor * noise_factor
    return pd.Series(np.round(y).astype(int), index=idx, name="units_sold")

y = lantern_vine()
h = 26
train, test = y.iloc[:-h], y.iloc[-h:]
logtrain = np.log(train)
def mape(a, f): return np.mean(np.abs((a - f) / a)) * 100

seasonal_naive = pd.Series(train.iloc[-52:-52 + h].values, index=test.index)
print(round(mape(test, seasonal_naive), 2))

candidates = [
    ((1, 1, 0), (1, 0, 0, 52)),
    ((1, 1, 0), (0, 0, 0, 52)),
    ((0, 1, 1), (0, 0, 1, 52)),
    ((1, 1, 1), (1, 0, 0, 52)),
    ((1, 1, 0), (1, 1, 0, 52)),
]
for order, sorder in candidates:
    m = SARIMAX(logtrain, order=order, seasonal_order=sorder).fit(disp=False)
    fc = np.exp(m.forecast(h))
    print(order, sorder, "AIC=", round(m.aic, 2), "MAPE=", round(mape(test, fc), 2))
15.68

(1, 1, 0) (1, 0, 0, 52)   AIC= -517.39  MAPE= 23.39
(1, 1, 0) (0, 0, 0, 52)   AIC= -494.66  MAPE= 39.39
(0, 1, 1) (0, 0, 1, 52)   AIC= -501.08  MAPE= 34.16
(1, 1, 1) (1, 0, 0, 52)   AIC= -503.3   MAPE= 39.13
(1, 1, 0) (1, 1, 0, 52)   AIC= -371.84  MAPE=  3.74

The seasonal-naive bar for this series is 15.68%, already far higher than Cyclepath’s 5.9%, a direct consequence of a noisier, faster-growing series. Four of the five SARIMA candidates score 23% or worse, not even beating the baseline. Only the last one, with both a seasonal AR term and a seasonal difference (seasonal_order=(1, 1, 0, 52)), reaches 3.74% MAPE, a decisive winner among these five, and more than four times better than the baseline.


A Trap: AIC Cannot Compare These Fairly

Look again at the AIC column. The winning model, 3.74% MAPE, has the worst (highest) AIC of the five, -371.84, while the first candidate, with a dramatically worse 23.39% MAPE, has the best AIC, -517.39. This is not another AIC-versus-accuracy contradiction like Module 7’s. It is a different, specific trap: AIC measures fit relative to the number of observations a model was actually trained on, and the winning model, using D=1, differenced away 52 observations that the other four candidates, using D=0, kept. Comparing AIC across models trained on different amounts of data, which is exactly what different D values produce, is not a valid comparison at all, regardless of what the raw numbers seem to suggest.

AIC only compares models trained on the same data

AIC is only meaningful when comparing models fit to the same dataset. Every model in Module 6’s SARIMA search shared the same d and D, so their AIC values were genuinely comparable. Here, changing D from 0 to 1 changes how many observations the model is trained on (207 rows differenced once, versus 155 differenced twice), so the AIC values are being computed over different-sized datasets and cannot be compared to each other at all. When differencing choices differ, held-out accuracy, not AIC, is the fair way to compare, exactly why this lesson led with test MAPE rather than AIC.


A Multiplicative Holt-Winters Model

from statsmodels.tsa.holtwinters import ExponentialSmoothing

hw = ExponentialSmoothing(
    train, trend="add", seasonal="mul", seasonal_periods=52,
    initialization_method="estimated",
).fit()
fc_hw = hw.forecast(h)
print(round(mape(test, fc_hw), 2))
3.85

Holt-Winters, with a multiplicative season matching Lesson 2’s finding directly, scores 3.85%, close behind SARIMA’s 3.74% on this single split. Both models comfortably beat the 15.68% baseline. On a single test, SARIMA edges ahead, the exact shape of Module 6 and 7’s original comparison on Cyclepath, before Module 8 backtested it properly.


Backtesting Both, Properly

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=6, test_size=13)
sn_scores, sarima_scores, hw_scores = [], [], []

for train_idx, test_idx in tscv.split(y):
    tr, te = y.iloc[train_idx], y.iloc[test_idx]
    sn = pd.Series(tr.iloc[-52:-52 + len(test_idx)].values, index=te.index)
    sn_scores.append(mape(te, sn))

    m = SARIMAX(np.log(tr), order=(1, 1, 0), seasonal_order=(1, 1, 0, 52)).fit(disp=False)
    sarima_scores.append(mape(te, np.exp(m.forecast(len(test_idx)))))

    hwm = ExponentialSmoothing(
        tr, trend="add", seasonal="mul", seasonal_periods=52,
        initialization_method="estimated").fit()
    hw_scores.append(mape(te, hwm.forecast(len(test_idx))))

for name, scores in [("seasonal-naive", sn_scores), ("SARIMA", sarima_scores), ("Holt-Winters", hw_scores)]:
    print(f"{name:16s} mean={np.mean(scores):.2f}  std={np.std(scores):.2f}")
seasonal-naive   mean=16.44  std=2.65
SARIMA           mean=4.39   std=1.10
Holt-Winters     mean=3.76   std=0.66

Backtested across six origins with a 13-week horizon, the exact ranking Module 8 found on Cyclepath repeats here, on a different series with a different structure: Holt-Winters wins on both average accuracy (3.76% versus 4.39%) and stability (standard deviation 0.66 versus 1.10). The single-split test had shown SARIMA narrowly ahead, 3.74% versus 3.85%, and once again, that narrow single-split edge does not survive a proper backtest.

A side-by-side comparison of two series, Cyclepath and Lantern and Vine, each showing the same pattern: a single-split test narrowly favoring SARIMA, and a six-origin backtest instead favoring Holt-Winters on both mean accuracy and stability, with a checkmark noting the pattern repeats on a completely different series structure.
The same reversal Module 8 found on Cyclepath repeats on Lantern & Vine: a single split narrowly favors SARIMA, but backtesting across six origins favors Holt-Winters, on both accuracy and consistency, on a series with a genuinely different structure.

Practice Exercises

Exercise 1: Why compare four non-seasonal candidates at all?

Given that the seasonal SARIMA candidate won so decisively, what was the point of testing the four non-seasonal ones?

Hint

Testing weaker candidates alongside the eventual winner is exactly what makes the winner’s margin meaningful rather than assumed. Without those four comparisons, you would not know whether the seasonal terms were doing genuine, substantial work (a 23-point MAPE gap) or only a marginal improvement. This is the same discipline Module 6 followed when comparing SARIMA against every non-seasonal ARIMA from Module 5: the size of the gap is itself evidence about how much a specific model choice matters.

Exercise 2: When would comparing AIC across differencing choices be fine?

Is it ever valid to compare AIC between two models with different d or D values?

Hint

Not directly, as a rule of thumb: AIC compares log-likelihoods computed over however many observations a model was actually trained on after differencing, so any change in d or D changes that count and breaks a fair comparison. If you needed to compare across differencing choices, the correct move is what this lesson did, use out-of-sample accuracy on a common held-out set instead, since a forecast error is measured on the same test data regardless of how the model was internally differenced during training.

Exercise 3: Is the Cyclepath-Lantern & Vine agreement a coincidence?

Holt-Winters beat SARIMA on backtested accuracy for both Cyclepath and Lantern & Vine. Does this mean Holt-Winters is simply the better model family in general?

Hint

Not necessarily, and Module 8 already raised this caution once. Two series both favoring Holt-Winters is stronger evidence than one, but both series here are still synthetic, seeded, and built with fairly regular, repeating seasonal patterns, closer to what Holt-Winters’ smoothing assumptions handle well than a series with irregular or evolving seasonality might be. The honest conclusion is that Holt-Winters has now won a properly backtested comparison twice, which is meaningful evidence, but a genuinely different kind of series could still favor SARIMA’s explicit autoregressive structure instead.


Summary

Five SARIMA specifications on log Lantern & Vine sales split sharply: only the one with a seasonal AR term and a seasonal difference (order=(1,1,0), seasonal_order=(1,1,0,52)) reached 3.74% MAPE; the other four scored 23% or worse, not even beating the 15.68% seasonal-naive baseline. Their AIC values could not be fairly compared, since different D values train on different amounts of differenced data, a distinct trap from the AIC-versus-accuracy issues seen earlier in this course. A multiplicative Holt-Winters model scored a close 3.85% on the same single split. Backtested properly across six origins, Holt-Winters won again, mean 3.76% (std 0.66) against SARIMA’s mean 4.39% (std 1.10), the same reversal Module 8 found on Cyclepath, now confirmed on a second, structurally different series.

Key Concepts

  • A seasonal term can be decisive — four non-seasonal SARIMA candidates lost by 20 or more percentage points of MAPE to the one with seasonal terms.
  • AIC cannot compare different differencing orders — changing d or D changes the effective training data size, breaking a fair AIC comparison; use held-out accuracy instead.
  • The single-split result reversed again — a proper backtest across six origins favored a different model than one held-out block did, exactly as on Cyclepath.
  • Two series, one confirmed lesson — Holt-Winters’ backtested advantage over SARIMA now holds on two structurally different series, stronger evidence than either alone.

Why This Matters

This lesson combines nearly everything the course has built: SARIMA order search from Module 6, the AIC pitfall this lesson adds as a genuinely new, more advanced caution, Holt-Winters from Module 7, and Module 8’s backtesting discipline, applied fresh to a series with different frequency, different structure, and a real regime change. That the backtesting reversal repeats here is not a coincidence to wave away, it is the evidence-based conclusion this entire course has been building toward: test properly, more than once, before trusting a ranking. The guided project now closes out the course with the final step every real forecasting project needs, a genuine forecast, reported honestly.


Next Steps

Continue to Lesson 5 - Guided Project: The Final Report

Retrain the winning model on the full series, forecast six months forward, and write up the complete pipeline as a final report.

Back to Module Overview

Return to the Capstone module overview


Continue Building Your Skills

You have found, and properly backtested, a model that beats Lantern & Vine’s baseline by a wide margin, and confirmed Module 8’s Holt-Winters finding a second time on a genuinely different series. The final lesson retrains the winning model on every week of data available and produces a real forecast of the next six months, written up the way an actual analyst would report it.

Sponsor

Keep DATATWEETS free. Help fund practical data, AI, and engineering lessons for learners worldwide.

Buy Me a Coffee at ko-fi.com