Lesson 5 - Guided Project: Backtesting Every Model From the Course
Welcome to the Guided Project
Modules 6 and 7 each built a model, scored it on the same one held-out year, and reported a winner: SARIMA at 1.06% MAPE, beating Holt-Winters’ 1.57%. Every lesson in this module has been building toward checking whether that comparison holds up under real scrutiny. This capstone runs the full walk-forward backtest, six origins, a six-month horizon, on every model this course has built, and the answer is not the one Modules 6 and 7 gave.
By the end of this project, you will be able to:
- Backtest multiple models across the same set of origins and compare them properly
- Explain why a single-split comparison and a backtested comparison can disagree
- Read both the mean and the variability of backtested scores as part of a model comparison
- State, with evidence, which model this course’s series actually favors
Let’s find out which model really wins.
Stage 1: Set Up the Shared Backtest
import numpy as np
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.holtwinters import ExponentialSmoothing
def cyclepath():
idx = pd.date_range("2016-01-01", periods=96, freq="MS")
t = np.arange(96)
rng = np.random.default_rng(42)
trend = 9000 + 90 * t
seasonal = 3200 * np.sin(2 * np.pi * (t - 3) / 12)
noise = rng.normal(0, 350, 96)
return pd.Series(np.round(trend + seasonal + noise).astype(int), index=idx, name="trips")
y = cyclepath()
def mape(a, f): return np.mean(np.abs((a - f) / a)) * 100
tscv = TimeSeriesSplit(n_splits=6, test_size=6)The exact same six folds every other lesson in this module has used, so all three models are compared on identical training and test data at every origin.
Stage 2: Backtest All Three Models
sn_scores, sarima_scores, hw_scores = [], [], []
for train_idx, test_idx in tscv.split(y):
train, test = y.iloc[train_idx], y.iloc[test_idx]
seasonal_naive = pd.Series(train.iloc[-12:-12 + len(test_idx)].values, index=test.index)
sn_scores.append(mape(test, seasonal_naive))
sarima = SARIMAX(train, order=(1, 1, 0), seasonal_order=(1, 1, 0, 12)).fit(disp=False)
sarima_scores.append(mape(test, sarima.forecast(len(test_idx))))
hw = ExponentialSmoothing(
train, trend="add", seasonal="add", seasonal_periods=12,
initialization_method="estimated").fit()
hw_scores.append(mape(test, hw.forecast(len(test_idx))))
for name, scores in [("seasonal-naive", sn_scores), ("SARIMA", sarima_scores), ("Holt-Winters", hw_scores)]:
print(f"{name:16s} {[float(round(s, 2)) for s in scores]} mean={np.mean(scores):.2f} std={np.std(scores):.2f}")seasonal-naive [7.48, 5.75, 7.49, 7.29, 6.57, 5.27] mean=6.64 std=0.87
SARIMA [2.12, 3.49, 3.74, 1.38, 0.82, 1.31] mean=2.14 std=1.11
Holt-Winters [1.96, 2.46, 0.99, 0.96, 1.44, 1.71] mean=1.59 std=0.53Stage 3: The Reversal
Modules 6 and 7’s single-split test, the last 12 months of 2023 only, reported SARIMA at 1.06% MAPE and Holt-Winters at 1.57%, a clear win for SARIMA. Backtested across six different origins, the ranking flips: Holt-Winters averages 1.59% MAPE, actually beating SARIMA’s 2.14%. And the gap in stability is even larger: Holt-Winters’ standard deviation across origins (0.53) is less than half of SARIMA’s (1.11), meaning Holt-Winters was consistently good across every origin, while SARIMA’s accuracy swung considerably depending on which six months it was forecasting, from a strong 0.82% at one origin to a much weaker 3.74% at another.
Stage 4: Why Did the Ranking Flip?
This is not a contradiction, and neither result was computed incorrectly. Look at SARIMA’s six scores again: 2.12, 3.49, 3.74, 1.38, 0.82, 1.31. The single Module 6 split happened to land near SARIMA’s best-case behavior, later origins in this sequence, where its score was as low as 0.82%. Earlier origins, forecasting into 2021 and early 2022, scored considerably worse, 3.49% and 3.74%. Module 6’s single test, by chance of which twelve months it happened to hold out, sampled from the favorable end of SARIMA’s actual range, not its typical performance.
Holt-Winters’ scores are more tightly clustered (0.96 to 2.46) around a lower average, so whichever single origin you happened to test it on, the answer would have looked similar. That consistency is exactly what a lower standard deviation across origins measures, and it is precisely what a single split cannot tell you apart from a lucky high or low draw.
Neither number was wrong, but one was incomplete
Modules 6 and 7 reported real, correctly computed numbers on real, correctly held-out data. Nothing about that was a mistake. The mistake would be treating that single comparison as the final word on which model is better, exactly the trap Lesson 1 named at the start of this module. The backtested comparison does not overrule the single split’s arithmetic; it reveals that the single split sampled from a wider range of outcomes than one test alone could show, and that on average, across that wider range, Holt-Winters is the more reliable choice for this series.
Stage 5: The Takeaway
Step back and see what this project, and this entire module, produced:
- A properly backtested ranking — Holt-Winters (mean 1.59%, std 0.53) outperforms SARIMA (mean 2.14%, std 1.11) on both average accuracy and consistency, once tested across six origins instead of one.
- A reversed conclusion, explained, not dismissed — Modules 6 and 7’s single-split result was real; it simply sampled from the favorable end of SARIMA’s actual range of outcomes, which only a multi-origin backtest could reveal.
- A complete evaluation toolkit — walk-forward validation, expanding and rolling windows,
TimeSeriesSplit, and empirical coverage checks, all validated on the same real series this course has used from the start.
The wider lesson of this whole module, and in some ways of this whole course: a model’s reported accuracy is only as trustworthy as the evaluation behind it. Every earlier module built real, working models with real, honest single-split numbers, and every one of those numbers was worth reporting at the time. But the discipline this module adds, testing across many origins rather than trusting one, is what would have caught this exact reversal before it mattered, if this had been a real forecasting decision rather than a teaching exercise.
Practice Exercises
Exercise 1: Would more origins change the answer again?
If this backtest used twelve origins instead of six, spaced more closely together, would you expect the ranking to flip back in SARIMA’s favor?
Hint
Not based on anything in this lesson’s evidence. More origins would refine the estimate of each model’s mean and standard deviation, likely making both more precise, but there is no reason to expect the underlying pattern, Holt-Winters being more consistently accurate on this specific series, to reverse simply because more data points were used to measure it. If anything, more origins would make you more confident in whichever ranking holds up, not less. The only way to know for certain would be to actually run it.
Exercise 2: Does this mean SARIMA is a worse model in general?
Does this backtest prove that Holt-Winters is simply a better forecasting method than SARIMA?
Hint
No, it shows that Holt-Winters is the better choice for this specific series, tested this specific way. A different real series, one with a less stable seasonal pattern, a genuine structural trend change, or autocorrelation structure that SARIMA’s explicit AR and MA terms are well suited to capture, could easily favor SARIMA once properly backtested. The result here is specific evidence about Cyclepath, not a universal ranking of the two model families, exactly the same caution Module 7 raised when it first compared them on the single split.
Exercise 3: What would you tell someone deploying one of these models tomorrow?
Based on everything in this module, what would you recommend to someone who needs to pick one model to deploy for Cyclepath starting tomorrow?
Hint
Recommend Holt-Winters, on the evidence that it is both more accurate on average and meaningfully more consistent across the six backtested origins, which matters as much as the average itself when you cannot know in advance which specific months you will be forecasting. You would also recommend continuing to backtest going forward rather than trusting this result forever, since a series’ behavior in production can still change over time in ways a historical backtest, however thorough, cannot fully anticipate, the same caution about drift that Lesson 2 raised when comparing expanding and rolling windows.
Summary
Backtested across the same six origins with a six-month horizon, seasonal-naive averaged 6.64% MAPE (std 0.87), SARIMA averaged 2.14% (std 1.11), and Holt-Winters averaged 1.59% (std 0.53). Modules 6 and 7’s single-split test had ranked SARIMA (1.06%) ahead of Holt-Winters (1.57%) on the last 12 months of 2023 alone. The properly backtested comparison reverses that ranking: Holt-Winters is both more accurate on average and more than twice as stable across origins, because Module 6’s single test happened to sample from the favorable end of SARIMA’s real range of outcomes, a range only visible once you test across several different points in time.
Key Concepts
- Backtesting can overturn a single-split conclusion — a properly conducted multi-origin comparison is a genuinely different, more complete kind of evidence than one test.
- Mean and standard deviation both matter — Holt-Winters won not just on average error but on consistency, which a single split cannot measure at all.
- A reversed result does not mean the earlier work was wrong — Modules 6 and 7’s numbers were real; they were simply incomplete without the fuller picture this module provides.
- Evaluation is specific to the series — this result favors Holt-Winters for Cyclepath specifically, not as a general ranking of the two model families.
Why This Matters
This is the most important result in the entire course, not because Holt-Winters “won”, but because of what it demonstrates about evaluation itself: even a careful, correctly computed model comparison can point the wrong way if it rests on a single test period. Every technique in this module, walk-forward validation, expanding and rolling windows, TimeSeriesSplit, and empirical interval coverage, exists to catch exactly this kind of situation before it costs something real. You now have a complete forecasting toolkit: the classical model families from Modules 5 through 7, and the evaluation discipline from this module to know which one to actually trust.
Next Steps
Continue to Module 9 - Capstone
Apply the full pipeline to a brand new series with its own structure: weekly data, multiplicative seasonality, and a genuine change in growth rate.
Back to Module Overview
Return to the Evaluation and Backtesting module overview
Continue Building Your Skills
You now have a full evaluation toolkit, walk-forward validation, expanding and rolling windows, TimeSeriesSplit, and forecast interval checks, and you have seen it change a real conclusion about which model to trust. The capstone ahead applies everything from this entire course, exploration, decomposition, stationarity, autocorrelation, model fitting, and now honest evaluation, to a brand new series you have not seen before.