Lesson 1 - Why One Split Is Not Enough

Welcome to Why One Split Is Not Enough

Since Module 1, this course has followed one pattern: split Cyclepath once, train on the first 84 months, test on the last 12, and report a single accuracy number. Every model comparison, SARIMA against Holt-Winters, the airline model against the featured SARIMA, has rested on that same one split. This lesson asks a question that should have been sitting underneath all of it: what if the last 12 months happened to be an easy year, or a hard one, and a different held-out year would have told a different story?

By the end of this lesson, you will be able to:

  • Explain what a single train/test split can and cannot tell you about a model
  • Measure how much a model’s reported error changes when a different year is held out
  • Distinguish a model’s true expected accuracy from one sample of its accuracy
  • Explain why this motivates testing across several origins instead of one

Let’s measure the swing directly.


Same Model, Same Series, Different Year

The seasonal-naive baseline is the simplest model in this course: forecast each month as the same month last year. Its error should depend only on how well that simple pattern held during whatever year you happen to test it on. Hold out three different years, one at a time, and check:

import numpy as np
import pandas as pd

def cyclepath():
    idx = pd.date_range("2016-01-01", periods=96, freq="MS")
    t = np.arange(96)
    rng = np.random.default_rng(42)
    trend = 9000 + 90 * t
    seasonal = 3200 * np.sin(2 * np.pi * (t - 3) / 12)
    noise = rng.normal(0, 350, 96)
    return pd.Series(np.round(trend + seasonal + noise).astype(int), index=idx, name="trips")

y = cyclepath()
def mape(a, f): return np.mean(np.abs((a - f) / a)) * 100

for holdout_start in range(60, 96, 12):
    train, test = y.iloc[:holdout_start], y.iloc[holdout_start:holdout_start + 12]
    seasonal_naive = pd.Series(train.iloc[-12:].values, index=test.index)
    print(test.index[0].date(), test.index[-1].date(), round(mape(test, seasonal_naive), 2))
2021-01-01 2021-12-01 6.61
2022-01-01 2022-12-01 7.39
2023-01-01 2023-12-01 5.92

Same model, same series, same twelve-month horizon, three different answers: 6.61%, 7.39%, and 5.92%. The 2023 result, 5.92%, is the number this course has quoted since Module 1, rounded to the 5.9% baseline every later model was measured against. But it was never the only possible answer. Held out 2022 instead, and the same seasonal-naive baseline looks meaningfully worse, 7.39%, a relative difference of about 25% from the number this course has treated as fixed since the beginning.

Three vertical bars, one per held-out year, showing the seasonal-naive baseline's MAPE: 2021 at 6.61 percent, 2022 at 7.39 percent (the tallest bar), and 2023 at 5.92 percent (the shortest bar, labeled as the value this course has quoted since Module 1). A horizontal band spans from the shortest to the tallest bar, labeled 'the real range a single split can land in'.
The same seasonal-naive baseline, tested on three different held-out years, gives three different answers. The 5.9% this course has quoted since Module 1 was never guaranteed; it was one sample from this range.

This Is Not a Flaw in Seasonal-Naive

It would be easy to read this as a problem specific to a simple baseline, but the issue is structural, not about which model you pick. Every model in this course, however sophisticated, was fit once and tested on whichever twelve months happened to come last. A model’s true, underlying accuracy is a property of how well it captures the process generating the series. A single test period is one noisy observation of that underlying accuracy, not the accuracy itself, the same distinction between a parameter and a sample estimate of it that shows up throughout statistics.

What changed between 2021, 2022, and 2023?

Nothing structural. Cyclepath’s generating process, a fixed trend slope and an identical seasonal wave every year, is exactly the same across all three years, by construction. The only thing that differs is the specific run of random noise that landed in each twelve-month window. That is precisely the point: even on a series with no real regime change, no genuine shift in behavior, held-out accuracy still varies meaningfully from year to year, purely from noise. On a real series, where the underlying process might also genuinely shift over time, a single split’s number is even less trustworthy on its own.


What This Motivates

None of this means the single-split numbers from earlier modules were wrong. They were computed correctly and reported honestly. What it means is that a single number is not the whole picture, and a more complete picture requires testing a model at more than one point in time. The rest of this module builds exactly that: Lesson 2 tests a model repeatedly as its training window moves forward through the series, Lesson 3 formalizes that loop with a standard tool, and the guided project applies it to every model this course has built, with a result that will change how you read Modules 6 and 7’s conclusion.


Practice Exercises

Exercise 1: Why does 2022 look harder?

Seasonal-naive’s error was highest for 2022 (7.39%). Without recomputing anything, what does that tell you, and what does it not tell you?

Hint

It tells you that 2022’s actual monthly values happened to deviate more from “the same month last year” than 2021’s or 2023’s did, purely a property of that specific year’s random noise, since Cyclepath’s true seasonal pattern is identical every year by construction. It does not tell you that seasonal-naive is a worse model in general, or that something changed structurally in 2022. Reading too much into a single year’s number, in either direction, good or bad, is exactly the mistake this lesson is warning against.

Exercise 2: Would a better model remove this variation?

If you repeated this same three-year comparison using SARIMA instead of seasonal-naive, would you expect its MAPE to be identical across 2021, 2022, and 2023?

Hint

No. Every model, however accurate on average, is still being tested against one specific year’s random noise each time, so some fold-to-fold variation is unavoidable even for a very good model. What you would hope for from a better model is a smaller swing between years, and a lower average error, not perfectly identical numbers across every single year. Lesson 2 measures exactly this kind of variation for SARIMA directly.

Exercise 3: What would you report instead of one number?

Given that a single split can land anywhere in a real range, what would be a more honest way to report a model’s expected accuracy?

Hint

Report a summary across several test periods rather than one: an average error across multiple held-out windows, along with some measure of how much it varies from window to window, such as a standard deviation or a minimum-to-maximum range. That combination, a typical error plus how much it moves around, is a far more honest description of what to expect from a model than any single held-out year’s number, which is exactly what walk-forward validation, starting in the next lesson, is built to produce.


Summary

Every model in this course was scored on one held-out year, 2023, giving the seasonal-naive baseline its now-familiar 5.9% MAPE. Testing the exact same model against 2021 and 2022 instead gives 6.61% and 7.39%, a real swing of about 25% relative to the number quoted since Module 1. Cyclepath’s underlying process is identical across all three years by construction, so this variation comes purely from the random noise each specific year happened to contain, not from any genuine change in the series. A single train/test split reports one sample of a model’s accuracy, not the accuracy itself, and that distinction motivates testing a model across several points in time rather than one.

Key Concepts

  • A single split is one sample — a model’s held-out error on one test period is a noisy estimate of its true expected accuracy, not the accuracy itself.
  • Variation without regime change — Cyclepath’s identical process across 2021 to 2023 still produces meaningfully different held-out errors, purely from noise.
  • This applies to every model — the issue is structural, not a flaw specific to seasonal-naive or to any one model in this course.
  • The fix is more origins — a trustworthy accuracy estimate needs several held-out test periods, not just one.

Why This Matters

Every accuracy number this course has reported so far, SARIMA’s 1.06%, Holt-Winters’ 1.57%, was true and correctly computed, but each was also just one sample of what a broader test would show. Recognizing that up front is what keeps a model comparison honest: a model that wins by a narrow margin on one held-out year might lose on another, and the only way to know is to check. Next, Lesson 2 builds the standard fix, walk-forward validation, testing a model repeatedly as its training window advances through the series, and you will see exactly how much this changes the picture for a real model.


Next Steps

Continue to Lesson 2 - Walk-Forward Validation

Test a model at several points in time instead of one, with an expanding and a rolling training window.

Back to Module Overview

Return to the Evaluation and Backtesting module overview


Continue Building Your Skills

You have now measured, directly and on the same series this whole course has used, that a single held-out year is not a reliable enough basis for judging a model. Next, you will build the standard fix: testing a model repeatedly, at several different points as its training data grows, instead of at just one.

Sponsor

Keep DATATWEETS free. Help fund practical data, AI, and engineering lessons for learners worldwide.

Buy Me a Coffee at ko-fi.com