Lesson 2 - Walk-Forward Validation

Welcome to Walk-Forward Validation

Lesson 1 showed that a model’s error swings meaningfully depending on which single year gets held out. Walk-forward validation is the standard fix: instead of splitting the series once, you split it several times, at a sequence of points moving forward through history, refitting the model fresh at each one and forecasting a fixed horizon ahead. The result is not one accuracy number but a small collection of them, which is exactly what Lesson 1’s exercises argued you actually need.

By the end of this lesson, you will be able to:

  • Explain what an origin is and how walk-forward validation moves through a series
  • Implement an expanding-window backtest by hand
  • Implement a rolling-window backtest by hand and explain how it differs
  • Read a mean and standard deviation across origins as a model’s real expected performance

Let’s build the loop.


Origins, Expanding, and Rolling

A walk-forward backtest picks a sequence of origins, points in time where training stops and forecasting begins. At each origin, you train on everything up to that point, forecast a fixed number of steps ahead, called the horizon, and score the forecast against what actually happened. Then you move the origin forward and repeat. Two ways to grow the training window as the origin advances:

  • Expanding window: training always starts at the very first observation, so it keeps growing longer at each origin.
  • Rolling window: training keeps a fixed size, so as the origin advances, the oldest months are dropped to make room for the newest ones.

Expanding Window, By Hand

Pick six origins, 60 months of training up through 90 months, each followed by a six-month forecast:

import numpy as np
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX

def cyclepath():
    idx = pd.date_range("2016-01-01", periods=96, freq="MS")
    t = np.arange(96)
    rng = np.random.default_rng(42)
    trend = 9000 + 90 * t
    seasonal = 3200 * np.sin(2 * np.pi * (t - 3) / 12)
    noise = rng.normal(0, 350, 96)
    return pd.Series(np.round(trend + seasonal + noise).astype(int), index=idx, name="trips")

y = cyclepath()
def mape(a, f): return np.mean(np.abs((a - f) / a)) * 100

h = 6
origins = list(range(60, 96, 6))
expanding_scores = []

for origin in origins:
    train, test = y.iloc[:origin], y.iloc[origin:origin + h]
    model = SARIMAX(train, order=(1, 1, 0), seasonal_order=(1, 1, 0, 12)).fit(disp=False)
    fc = model.forecast(h)
    score = mape(test, fc)
    expanding_scores.append(score)
    print(f"origin={origin}  test={test.index[0].date()}..{test.index[-1].date()}  MAPE={score:.2f}")

print(round(np.mean(expanding_scores), 2), round(np.std(expanding_scores), 2))
origin=60  test=2021-01-01..2021-06-01  MAPE=2.12
origin=66  test=2021-07-01..2021-12-01  MAPE=3.49
origin=72  test=2022-01-01..2022-06-01  MAPE=3.74
origin=78  test=2022-07-01..2022-12-01  MAPE=1.38
origin=84  test=2023-01-01..2023-06-01  MAPE=0.82
origin=90  test=2023-07-01..2023-12-01  MAPE=1.31

2.14 1.11

Six origins, six different training sizes (60 through 90 months), each refitting SARIMA from scratch and forecasting the following six months. The scores range from 0.82% to 3.74%, averaging 2.14% with a standard deviation of 1.11. Notice something Module 6 never surfaced: the origin at 84 months, forecasting January through June 2023, scores 0.82%, meaningfully better than the 1.06% Module 6 reported for the full twelve months of 2023. Backtesting does not just add more test points, it also reveals that a model’s accuracy can depend on the forecast horizon itself, not only on which year is tested.

A diagram showing six horizontal bars stacked vertically, each representing one origin. The training portion of each bar, shown in blue, grows longer from the top bar to the bottom bar, starting at 60 months and ending at 90 months, always starting from the same left edge labeled 2016. A six-month orange test segment follows immediately after each blue training segment, sliding to the right as the training grows.
An expanding window: training always starts at the same point (2016) and grows longer at each successive origin, always followed by a fixed six-month test segment.

Rolling Window, By Hand

A rolling window keeps the same six origins and the same six-month horizon, but fixes the training size at 60 months, dropping the oldest data as the origin advances:

window_size = 60
rolling_scores = []

for origin in origins:
    train = y.iloc[origin - window_size:origin]
    test = y.iloc[origin:origin + h]
    model = SARIMAX(train, order=(1, 1, 0), seasonal_order=(1, 1, 0, 12)).fit(disp=False)
    fc = model.forecast(h)
    rolling_scores.append(mape(test, fc))

print(round(np.mean(rolling_scores), 2), round(np.std(rolling_scores), 2))
2.24 1.19

The rolling-window average, 2.24%, is close to but slightly worse than the expanding window’s 2.14%, and its standard deviation (1.19) is a touch higher too. That is not a coincidence. Cyclepath was built with a perfectly stable trend and an identical seasonal wave every year, so the earliest months are exactly as relevant to forecasting 2023 as the most recent ones, and throwing them away, which is what a rolling window does, only removes useful information without buying anything in return. The expanding window’s small edge here is a direct, measured consequence of that stability.

When would a rolling window win instead?

A rolling window earns its keep when a series’ behavior genuinely changes over time, a real shift in trend, a seasonal pattern that gradually evolves, or a structural break like a change in business conditions. In that situation, the oldest data reflects a version of the process that no longer applies, and an expanding window would keep diluting a model’s fit with information that has gone stale. Cyclepath was deliberately built without any such drift, which is exactly why the expanding window edges ahead here. On a real series, this is a genuine judgment call, and testing both, exactly as this lesson just did, is the way to find out which one your specific series rewards.


Practice Exercises

Exercise 1: Why six origins and not just one more?

Why does averaging across six origins give a more trustworthy accuracy estimate than simply picking two held-out years instead of one?

Hint

Two years would be an improvement over one, but the core problem, that any small handful of specific origins carries its own noise, does not go away just by doubling the count. More origins reduce the influence of any single unlucky or lucky test period on the overall average, and they also let you measure the spread of outcomes (the standard deviation), not just the two individual numbers. There is no magic minimum number of origins; more is generally more trustworthy, balanced against how much computation and how much historical data you actually have.

Exercise 2: Predict what a longer rolling window would do

If the rolling window’s fixed size were increased from 60 months to 84 months, would you expect its scores to move closer to or further from the expanding window’s scores?

Hint

Closer. As the rolling window’s fixed size approaches the full amount of data the expanding window would have used at each origin, the two methods converge, since a rolling window that is nearly as long as all available history is discarding less and less as it grows. At the extreme where the rolling window’s size equals the full expanding window’s length at every origin, the two methods become identical.

Exercise 3: Reading the origin-to-origin pattern

Looking at the expanding-window scores (2.12, 3.49, 3.74, 1.38, 0.82, 1.31), the middle two origins score noticeably worse than the rest. Does that necessarily mean something is wrong with the model?

Hint

Not necessarily. With only six origins, some variation between them is expected from ordinary noise, the same phenomenon Lesson 1 demonstrated with three different held-out years for seasonal-naive. Before concluding anything is wrong, you would want more origins to see whether this is a consistent pattern (for instance, if forecasts starting in January consistently score worse than forecasts starting in July) or just noise settling differently on two adjacent folds. A pattern that repeats across many origins is worth investigating; a couple of noisier folds among six generally is not.


Summary

Walk-forward validation tests a model at a sequence of origins moving forward through a series, refitting fresh at each one and forecasting a fixed horizon ahead, rather than testing once. An expanding window always starts training at the beginning of the series; a rolling window keeps a fixed training size, dropping the oldest data as the origin advances. Built by hand on Cyclepath with six origins and a six-month horizon, SARIMA’s expanding-window MAPE averaged 2.14% (standard deviation 1.11, range 0.82% to 3.74%), while a 60-month rolling window scored slightly worse, 2.24% (standard deviation 1.19), evidence that Cyclepath’s stable, non-drifting structure rewards keeping all the historical data rather than discarding it.

Key Concepts

  • Origin — the point in time where training stops and a forecast begins; a walk-forward backtest tests several.
  • Expanding window — training always starts at the beginning of the series and grows at each origin.
  • Rolling window — training keeps a fixed size, dropping the oldest data as the origin advances.
  • Mean and standard deviation across origins — a far more complete description of expected accuracy than any single split’s number.

Why This Matters

A walk-forward backtest is what turns Lesson 1’s warning into a working tool: instead of reporting one number and hoping it was representative, you now have a distribution of outcomes across six genuinely different training and test periods, and a principled way to compare an expanding window against a rolling one when a series’ stability is in question. Next, Lesson 3 shows that this exact loop is standard enough to have a name and a built-in implementation, TimeSeriesSplit, and formalizes it so you never have to hand-write the origin loop again.


Next Steps

Continue to Lesson 3 - TimeSeriesSplit and Formalizing the Loop

Replace the hand-rolled origin loop with scikit-learn's TimeSeriesSplit, and confirm it produces the exact same folds.

Back to Module Overview

Return to the Evaluation and Backtesting module overview


Continue Building Your Skills

You have now built a walk-forward backtest by hand, in both its expanding and rolling forms, and measured a real, if narrow, advantage for the expanding window on Cyclepath’s stable structure. Next, you will see that this exact loop has a standard, well-tested implementation you can reach for directly, and confirm it produces the identical six folds you just built by hand.

Sponsor

Keep DATATWEETS free. Help fund practical data, AI, and engineering lessons for learners worldwide.

Buy Me a Coffee at ko-fi.com