Lesson 4 - Splitting Time Series and Baselines

Welcome to Splitting Time Series and Baselines

Before you fit anything fancy, you need two things: a way to test a forecast honestly, and something simple to beat. Both are easy to get wrong. In Lesson 1 you saw why a random split leaks the future — here you’ll do the split the right way, chronologically, holding out the most recent months as the test set. Then you’ll build two “baselines”: forecasts so dumb they take one line of code each. Their job isn’t to be good — it’s to set the bar. If the ARIMA model you build in a later module can’t beat a one-line seasonal-naive guess, it isn’t earning its complexity, and you should know that before you ship it. This lesson gives you the honest split and the baselines every model must clear.

By the end of this lesson, you will be able to:

  • Split a time series chronologically into train and test, holding out the last H months
  • Define the forecast horizon and explain why the test set must come after the train set
  • Build naive and seasonal-naive baseline forecasts in a line of code each
  • Compute MAE, RMSE, and MAPE, and read seasonal-naive as the bar a real model must beat

Let’s start with the split that doesn’t lie.


The Chronological Split and the Horizon

You forecast the future from the past, so your test must sit later than your training data — full stop. No shuffle, no train_test_split(random_state=…). You slice off the last H observations as the test set and keep everything before them for training. H is the forecast horizon: how many steps ahead you commit to predicting. For Cyclepath we hold out the last 12 months — a full year — so the test set exercises every season.

import numpy as np, pandas as pd
def cyclepath():
    idx = pd.date_range("2016-01-01", periods=96, freq="MS")
    t = np.arange(96); rng = np.random.default_rng(42)
    trend = 9000 + 90*t; seasonal = 3200*np.sin(2*np.pi*(t-3)/12); noise = rng.normal(0,350,96)
    return pd.Series(np.round(trend+seasonal+noise).astype(int), index=idx, name="trips")
y = cyclepath()
h = 12                          # forecast HORIZON: hold out the last 12 months
train, test = y.iloc[:-h], y.iloc[-h:]
print(len(train), train.index[-1].date())   # 84 2022-12-01
print(len(test), test.index[0].date(), test.index[-1].date())  # 12 2023-01-01 2023-12-01

This prints 84 2022-12-01 and 12 2023-01-01 2023-12-01: 84 months of training that end in December 2022, and 12 months of test running through all of 2023. Notice what’s absent — there’s no shuffle and no random seed on the split, because reordering would scatter 2023 into the training data and let the model peek at the future it’s supposed to predict (exactly the leaky split from Lesson 1). Slicing by position on a time-ordered index keeps past and future cleanly separated.

One honest split gives you one score, which can be lucky or unlucky. For robust evaluation you’d slide the split forward repeatedly — train on a growing window, test the next H months, and average — a technique called walk-forward validation. You’ll get the full treatment in a later module (Evaluation & Backtesting); for now a single held-out year is enough to establish your baselines.

A series split into a shaded TRAIN region covering the past 84 months and a TEST region covering the last 12 months; the actual test line continues the seasonal up-and-down pattern; a flat red dashed 'naive (last value)' forecast holds a single level and misses the seasonality entirely, while a green dashed 'seasonal-naive (last year)' forecast rises and falls with the actual line and tracks it closely; two metric boxes compare Naive MAE 3,497 MAPE 19.0% against Seasonal-naive MAE 998 MAPE 5.9%, roughly 3.5x better.
Train on the past 84 months, test on the last 12: the flat naive forecast ignores seasonality and drifts far from the actual, while seasonal-naive replays last year's pattern and tracks the test line closely — about 3.5x more accurate.

Two Baselines Every Model Must Beat

A baseline is a forecast so simple it’s almost embarrassing — and that’s the point. If your sophisticated model can’t beat it, the sophistication is decoration. Two baselines cover almost every series:

  • Naive — forecast every future month as the last observed value. That’s it: whatever ridership was in December 2022, predict that exact number for all twelve months of 2023.
  • Seasonal-naive — forecast each month as the value from one full season earlier. With monthly data and yearly seasonality, that means “this coming July looks like last July,” month by month across the horizon.
naive = pd.Series(train.iloc[-1], index=test.index)
seasonal_naive = pd.Series(train.iloc[-12:].values, index=test.index)

Naive is a flat line — it holds December’s level all year and completely ignores that summers surge and winters slump. Seasonal-naive is a replay of the previous year, so it rises and falls with the calendar exactly the way Cyclepath does. On a series with strong seasonality, that difference is enormous, and the metrics will show it.


Scoring Forecasts: MAE, RMSE, and MAPE

A forecast needs a number. Three error metrics do almost all the work, each answering a slightly different question. MAE (mean absolute error) is the average size of the miss, in the data’s own units — trips. RMSE (root mean squared error) squares the errors before averaging, so it penalizes large misses more than small ones. MAPE (mean absolute percentage error) expresses the miss as a percentage of the actual value, so it’s unit-free and easy to communicate.

def mae(a,f):  return np.mean(np.abs(a-f))
def rmse(a,f): return np.sqrt(np.mean((a-f)**2))
def mape(a,f): return np.mean(np.abs((a-f)/a))*100

Run all three on both baselines against the held-out year and you get:

naive           MAE 3,497   RMSE 4,219   MAPE 19.0%
seasonal-naive  MAE   998   RMSE 1,024   MAPE  5.9%

Read those numbers carefully, because they set the agenda for the rest of the course. Seasonal-naive is about 3.5x more accurate than plain naive — MAE 998 versus 3,497, MAPE 5.9% versus 19.0% — and it earns that entirely by capturing the yearly pattern the flat forecast throws away. This is now the bar. Any ARIMA or SARIMA model you build later must beat seasonal-naive’s 5.9% MAPE to be worth its added complexity. A model that can’t clear a one-line replay of last year is not adding value, no matter how impressive its equations look.

Always beat the baseline — or explain why you can’t

Compute the naive and seasonal-naive baselines before you fit anything, and write their MAPE at the top of your notebook. From then on, every model reports its own MAPE right next to that number. If your fancy model doesn’t beat seasonal-naive, one of three things is true: your series has weak seasonality, your model is misconfigured, or the extra complexity genuinely isn’t buying you anything — and in that last case, ship the one-liner. The baseline isn’t a formality you clear once; it’s the yardstick you hold up against every result for the rest of the project.


Practice Exercises

Exercise 1: Why not just shuffle and split 80/20?

A teammate wants to run train_test_split(y, test_size=0.2, random_state=0) on the Cyclepath series “like we always do.” Why is that wrong here, and what do you do instead?

Hint

A random 80/20 split scatters future months (2023) into the training set and past months into the test set, so the model trains on data from after the period it’s scored on — it sees the future, and the test score is meaningless. Forecasting requires a chronological split: sort by time and hold out the last H observations (y.iloc[:-h], y.iloc[-h:]). Train on the earliest months, test on the latest, mirroring how a real forecast runs — you never have tomorrow’s data when you predict tomorrow.

Exercise 2: Why does seasonal-naive crush naive here?

Naive scores 19.0% MAPE and seasonal-naive scores 5.9% on the same held-out year. What property of Cyclepath makes the gap so large, and when would the gap shrink?

Hint

Cyclepath has strong yearly seasonality — big summer peaks, deep winter troughs — so a flat “last value” forecast is badly wrong for most of the year, while “same month last year” lands close. The gap is large precisely because seasonality is strong. On a series with little or no seasonality (say, a slow random walk), naive and seasonal-naive would score similarly, and plain naive might even win because it uses the most recent value rather than one from a year ago.

Exercise 3: Reading MAE against RMSE

For seasonal-naive, MAE is 998 and RMSE is 1,024 — very close. For naive, MAE is 3,497 and RMSE is 4,219 — RMSE noticeably larger. What does the size of the RMSE-minus-MAE gap tell you?

Hint

RMSE squares errors before averaging, so it’s inflated by a few large misses; when RMSE sits well above MAE, the errors are uneven — some months are badly wrong. Naive’s wider gap (4,219 vs 3,497) reflects its huge misses at the seasonal peaks and troughs. Seasonal-naive’s tight gap (1,024 vs 998) says its errors are small and consistent across the year. Report MAE for the typical miss, RMSE when big misses are especially costly, and MAPE to communicate error as a plain percentage.


Summary

Forecasting is tested with a chronological split, never a random one: you hold out the last H observations — the forecast horizon — as a test set that sits after your training data, so the model is scored the way it will actually run. For Cyclepath you kept 84 months of training through December 2022 and tested on all of 2023. Then you built two baselines: naive (predict the last observed value) and seasonal-naive (predict the value from one season earlier). Scored with MAE (average miss in units), RMSE (penalizes big misses), and MAPE (unit-free percentage), seasonal-naive beat naive roughly 3.5x — MAPE 5.9% versus 19.0% — because it captures the yearly pattern naive ignores. That 5.9% is now the bar: any model you build later must beat it to justify its complexity.

Key Concepts

  • Chronological split — slice off the last H observations; test must be later than train, never shuffled.
  • Forecast horizon (H) — how many steps ahead you predict; here, 12 months.
  • Naive & seasonal-naive — one-line baselines: last value, and same period last season.
  • MAE / RMSE / MAPE — typical miss in units, squared-penalty on big misses, and unit-free percentage.

Why This Matters

Most forecasting failures are either an honesty failure or a value failure — a leaky split that makes a mediocre model look brilliant, or a complex model that never actually beat a one-line guess. A chronological split and a pair of baselines close both gaps in a handful of lines: you get a score you can trust and a bar every future model must clear. This is the discipline that keeps the rest of the course grounded — when you fit SARIMA and read its MAPE, you’ll know instantly whether it earned its place by comparing it to the 5.9% you established today. Next, you’ll put all four foundation lessons together on the full Cyclepath series in a hands-on guided project.


Next Steps

Continue to Lesson 5 - Guided Project: Meet the Cyclepath Series

Put it all together: load the full Cyclepath series in pandas, explore its trend and seasonality, split it, and beat the baselines end to end.

Back to Module Overview

Return to the Time Series Foundations module overview


Continue Building Your Skills

You can now split a series the way forecasting demands — chronologically, holding out a full horizon — and set the bar with naive and seasonal-naive baselines scored on MAE, RMSE, and MAPE. That gives you both an honest test and a number to beat. Next you’ll bring the whole module together in a guided project: load Cyclepath, explore its structure hands-on, split it, and clear the baselines yourself, so you’re ready to start modeling in earnest.