Lesson 3 - TimeSeriesSplit and Formalizing the Loop

Welcome to TimeSeriesSplit and Formalizing the Loop

Lesson 2’s walk-forward loop, six origins, an expanding window, hand-written slicing, is exactly the pattern scikit-learn built a dedicated tool for. TimeSeriesSplit generates the same sequence of expanding training windows and fixed-size test blocks automatically, which means less index arithmetic to get right and a standard, well-tested implementation other people’s code already expects.

By the end of this lesson, you will be able to:

  • Generate expanding-window folds with TimeSeriesSplit instead of hand-written index ranges
  • Confirm that TimeSeriesSplit reproduces exactly the same folds built by hand in Lesson 2
  • Backtest the seasonal-naive baseline properly and compare it against Lesson 1’s single-split numbers
  • Explain what n_splits, test_size, and gap each control

Let’s formalize the loop.


TimeSeriesSplit Reproduces the Same Folds

import numpy as np
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=6, test_size=6)
for train_idx, test_idx in tscv.split(np.arange(96)):
    print(train_idx[0], "-", train_idx[-1], " (n=", len(train_idx), ")  test:", test_idx[0], "-", test_idx[-1])
0 - 59  (n= 60 )  test: 60 - 65
0 - 65  (n= 66 )  test: 66 - 71
0 - 71  (n= 72 )  test: 72 - 77
0 - 77  (n= 78 )  test: 78 - 83
0 - 83  (n= 84 )  test: 84 - 89
0 - 89  (n= 90 )  test: 90 - 95

Six folds, training sizes 60, 66, 72, 78, 84, and 90 months, each followed by a 6-month test block. These are exactly the origins built by hand in Lesson 2. TimeSeriesSplit did not introduce a different backtesting scheme, it produced the identical loop, indexed automatically instead of by hand. n_splits=6 asked for six folds, and test_size=6 fixed each test block at six observations; scikit-learn worked out the training boundaries needed to make that happen with an expanding window, which is TimeSeriesSplit’s default behavior.


Backtesting Seasonal-Naive With the Formal Loop

Use it to properly settle the question Lesson 1 raised: what is seasonal-naive’s real expected accuracy, not just its accuracy on one particular year?

import pandas as pd

def cyclepath():
    idx = pd.date_range("2016-01-01", periods=96, freq="MS")
    t = np.arange(96)
    rng = np.random.default_rng(42)
    trend = 9000 + 90 * t
    seasonal = 3200 * np.sin(2 * np.pi * (t - 3) / 12)
    noise = rng.normal(0, 350, 96)
    return pd.Series(np.round(trend + seasonal + noise).astype(int), index=idx, name="trips")

y = cyclepath()
def mape(a, f): return np.mean(np.abs((a - f) / a)) * 100

scores = []
for train_idx, test_idx in tscv.split(y):
    train, test = y.iloc[train_idx], y.iloc[test_idx]
    seasonal_naive = pd.Series(train.iloc[-12:-12 + len(test_idx)].values, index=test.index)
    scores.append(mape(test, seasonal_naive))

print([float(round(s, 2)) for s in scores])
print(round(np.mean(scores), 2), round(np.std(scores), 2))
[7.48, 5.75, 7.49, 7.29, 6.57, 5.27]
6.64 0.87

Across all six folds, seasonal-naive’s MAPE ranges from 5.27% to 7.49%, averaging 6.64% with a standard deviation of 0.87. Compare that to Lesson 1’s three single-year numbers: 6.61%, 7.39%, and 5.92%. The backtested mean, 6.64%, sits close to all three, which makes sense since they were each one sample of the same underlying variation this fuller backtest now measures directly. But it is a meaningfully different, and more honest, number than any single one of them: it is not “how did seasonal-naive do on 2023,” it is “how does seasonal-naive typically do, and by how much does that typically vary.”

Two identical-looking rows of six blue-to-orange bar pairs, labeled 'hand-rolled loop (Lesson 2)' on top and 'TimeSeriesSplit (Lesson 3)' below, with matching origin boundaries at 60, 66, 72, 78, 84, and 90 months for both rows, and a checkmark confirming the two rows produce identical fold boundaries.
TimeSeriesSplit and the Lesson 2 hand-rolled loop produce the exact same six folds. The tool does not change the backtesting logic, it removes the need to write and re-check the index arithmetic by hand.

Reading TimeSeriesSplit’s Parameters

Three parameters cover most of what you need:

  • n_splits: how many origins (folds) to generate. More folds mean a more thorough backtest, at the cost of more model fits.
  • test_size: how many observations each test block contains, the forecast horizon at every origin. This lesson and Lesson 2 both used 6.
  • gap: an optional number of observations to skip between the end of training and the start of testing, useful when a real forecasting pipeline has a delay between when data is available and when a forecast is actually needed.
tscv_gap = TimeSeriesSplit(n_splits=6, test_size=6, gap=1)
first_train, first_test = next(tscv_gap.split(np.arange(96)))
print(first_train[-1], first_test[0])
58 60

With gap=1, the first test block still starts at index 60, but training now stops at index 58 instead of 59, leaving a one-month buffer unused between them. TimeSeriesSplit defaults to gap=0, which is what Lessons 2 and 3 have used throughout, matching a situation where a forecast can be made the moment the most recent month’s data arrives.

TimeSeriesSplit only does expanding windows

TimeSeriesSplit always grows the training window, it has no built-in option for a fixed-size rolling window like Lesson 2 built by hand. If a rolling window is what you need, on a series where the underlying process genuinely drifts, you would still write that slicing yourself, exactly as Lesson 2 did, or trim each TimeSeriesSplit training fold down to a fixed size after the fact. Knowing the tool’s actual scope, rather than assuming it does everything, is part of using it correctly.


Practice Exercises

Exercise 1: Predict the effect of more splits

If you called TimeSeriesSplit(n_splits=10, test_size=6) on the same 96-month series, what would you expect to happen to the earliest fold’s training size?

Hint

With 10 folds of 6 months each needed at the end of the series, and the last fold’s test block ending at the very last observation, the earliest fold’s training window would have to start much smaller than 60 months, working backward from the total length. In general, requesting more folds with the same fixed test size pushes the first origin earlier, giving that fold’s model less data to train on, which is worth checking does not leave the earliest fold with too little history to fit a meaningful model.

Exercise 2: When would gap matter in practice?

Describe a real forecasting situation where setting gap=1 (or higher) would be the honest choice, rather than gap=0.

Hint

Any situation where there is a real delay between when a month’s data becomes available and when a forecast actually needs to be produced and acted on, for instance if ridership figures for a given month are not finalized and available until partway through the following month, but a forecast for two months out needs to be issued as soon as the current month closes. Setting gap=0 in that situation would let the backtest use data that would not actually have been available yet in a real deployment, silently making the backtest look better than a real forecast could ever perform.

Exercise 3: Why does the backtested mean differ from any single year?

The backtested mean (6.64%) does not exactly equal 2021’s (6.61%), 2022’s (7.39%), or 2023’s (5.92%) individual numbers. Why not, given that the three single-split years overlap with three of the six backtest origins?

Hint

The backtest’s six folds use a 6-month horizon, forecasting half a year at a time, while Lesson 1’s three single-split numbers each forecast a full 12-month year at once. Different horizons are not directly comparable fold for fold, even on overlapping calendar periods, since forecasting further ahead is generally a harder problem, exactly the reason Lesson 2 noted a shorter-horizon fold (0.82% for six months) scoring better than the full twelve-month result (1.06%) covering some of the same months. The backtested mean is a summary over a specifically defined horizon and set of origins, not a recomputation of the earlier three numbers under a different label.


Summary

TimeSeriesSplit(n_splits=6, test_size=6) generates the exact same six expanding-window origins built by hand in Lesson 2, confirmed fold for fold: training sizes 60 through 90 months, each followed by a 6-month test block. Backtesting seasonal-naive across all six folds gives a mean MAPE of 6.64% with a standard deviation of 0.87, a properly summarized version of the variation Lesson 1 first measured across three single-year splits (5.92%, 6.61%, 7.39%). n_splits, test_size, and gap cover the parameters most backtests need, though TimeSeriesSplit only supports expanding windows, not the rolling window Lesson 2 also built.

Key Concepts

  • TimeSeriesSplit — generates expanding-window folds automatically; confirmed to match a hand-rolled loop exactly.
  • n_splits, test_size, gap — control how many folds, how large each test block is, and whether a buffer separates training from testing.
  • Expanding onlyTimeSeriesSplit has no rolling-window option; that still requires hand-written slicing.
  • A backtested mean is a proper summary — combining multiple origins into one mean and standard deviation is a more honest description than any single split.

Why This Matters

Having a standard, checkable tool for the walk-forward loop matters beyond convenience: it means the backtesting logic in your code matches what anyone reading it, including a future version of yourself, already expects TimeSeriesSplit to do, rather than a bespoke loop that needs to be re-verified every time it is touched. With the mechanics now standard, the guided project can spend its attention on the results themselves rather than the plumbing. First, though, Lesson 4 turns to a different kind of honesty check: not just whether a model’s point forecast is accurate, but whether its stated uncertainty, its forecast interval, can actually be trusted.


Next Steps

Continue to Lesson 4 - Forecast Intervals and Their Honesty

Check whether a model's stated confidence intervals actually contain the truth as often as they claim.

Back to Module Overview

Return to the Evaluation and Backtesting module overview


Continue Building Your Skills

You now have a standard, verified way to backtest any model across multiple origins, confirmed to match the manual loop from Lesson 2 exactly. Next, you will apply this same multi-origin discipline to a different question: not just whether a model’s point forecast is accurate, but whether the confidence interval it reports alongside that forecast means what it claims to mean.

Sponsor

Keep DATATWEETS free. Help fund practical data, AI, and engineering lessons for learners worldwide.

Buy Me a Coffee at ko-fi.com