Lesson 5 - Guided Project: Meet the Cyclepath Series

Welcome to the Guided Project

Across this module you took time series apart piece by piece: what makes them a distinct discipline (order, autocorrelation, split by time), the pandas DatetimeIndex with resampling and rolling windows, and the chronological split with naive baselines that keep you honest. Now you’ll put it all back together on one series — the one you’ll model for the rest of the course. Cyclepath, the fictional city bike-share, is going to go from a line of generator code to a fully explored series with an honest test set and a baseline every future model has to beat. By the end, you’ll have built the series, seen its trend and yearly summer peaks, split off the last year without leaking it, and measured exactly how well the dumbest reasonable forecast does — the number ARIMA, SARIMA, and everything else will be judged against.

By the end of this project, you will be able to:

  • Generate the canonical Cyclepath monthly series with a DatetimeIndex and confirm its shape
  • Explore trend and seasonality with summary stats, resampling to yearly totals, and a rolling mean
  • Split the series chronologically, holding out the last 12 months as an honest test set
  • Establish naive and seasonal-naive baselines with MAE, RMSE, and MAPE — the bar every model must beat

We’ll build it in stages, reusing the exact tools from each lesson in the module. Let’s meet Cyclepath.


Stage 1: Generate the Series

Start with the data itself. Cyclepath is seeded synthetic — generated in code with a fixed random seed — so every number you compute below matches the lesson exactly and you can rerun the whole analysis end to end. The generator builds a monthly DatetimeIndex (Lesson 2), adds a steady upward trend, a yearly seasonal wave that peaks in summer, and a little noise, then rounds to whole trips.

import numpy as np, pandas as pd

def cyclepath():
    idx = pd.date_range("2016-01-01", periods=96, freq="MS")
    t = np.arange(96); rng = np.random.default_rng(42)
    trend = 9000 + 90*t; seasonal = 3200*np.sin(2*np.pi*(t-3)/12); noise = rng.normal(0,350,96)
    return pd.Series(np.round(trend+seasonal+noise).astype(int), index=idx, name="trips")

y = cyclepath()
print(len(y), y.index.freqstr, y.index[0].date(), y.index[-1].date())  # 96 MS 2016-01-01 2023-12-01
print(y.iloc[:3].tolist())                                             # [5907, 5955, 7843]

That freq="MS" means month start — one point stamped at the first of each month. You get exactly 96 monthly points spanning 2016-01 through 2023-12 (eight full years), and the first three values are [5907, 5955, 7843] — the winter trough climbing toward spring. This is the series. Everything else in this project, and in this course, operates on y.


Stage 2: Explore It

Before you split or model anything, look at what you have (Lesson 3). Start with the one-line summary:

print(f"min {y.min():,}  mean {y.mean():,.0f}  max {y.max():,}")   # min 5,907  mean 13,267  max 20,533

Ridership ranges from a low of 5,907 to a high of 20,533, averaging about 13,267 trips a month. A line plot of y would show two things at once: a steady upward trend — the city adopting bikes as the network grows — and a yearly rhythm of tall summer peaks and deep winter troughs that repeats eight times. That’s exactly the trend-plus-seasonality structure Lesson 1 promised.

To see the growth on its own, resample the monthly series to yearly totals — collapsing the seasonal wiggle so only the trend remains:

yearly = y.resample("YS").sum()
print(yearly.values)   # [113347 127335 140950 153225 165816 177929 191528 203501]

Annual trips climb every single year: 113,347 → 127,335 → 140,950 → 153,225 → 165,816 → 177,929 → 191,528 → 203,501 (2016 through 2023) — roughly 80% growth over eight years, with no reversals. To smooth the monthly series without collapsing it, overlay a 12-month rolling mean (Lesson 2), which averages each month with the eleven before it, cancelling the seasonal cycle and leaving the trend:

ma = y.rolling(12).mean()
print(f"{ma.dropna().iloc[0]:,.0f} -> {ma.dropna().iloc[-1]:,.0f}")   # 9,446 -> 16,958

The first 11 values are NaN (a 12-month window needs 12 points), and from there the rolling mean rises smoothly from about 9,446 to 16,958 — the trend line under the seasonal noise, almost doubling as Cyclepath grows.


Stage 3: Split Chronologically

Now the move that keeps you honest (Lesson 4). To measure a forecast fairly you need a test set the model has never seen — and for a time series that test set must be later in time than the training data, never a random sample. You hold out the last 12 months (all of 2023) as the test set and train on everything before it:

h = 12
train, test = y.iloc[:-h], y.iloc[-h:]
print(len(train), train.index[-1].date())   # 84 2022-12-01
print(len(test),  test.index[0].date(), test.index[-1].date())  # 12 2023-01-01 2023-12-01

That gives 84 training months (through 2022-12) and 12 test months (all of 2023). Notice there’s no shuffle — the split is a single clean cut in time, so nothing from 2023 leaks backward into training. This mimics reality exactly: you stand at the end of 2022 and forecast the year you haven’t lived yet.

A monthly bike-share series rising with yearly summer peaks, cut by a vertical dashed line into an 84-month training region on the left (through 2022-12) and a 12-month test region on the right (2023). Over the test region two forecasts are drawn: a flat horizontal line labeled 'naive (last value)' that misses the seasonal shape, and a wavy line labeled 'seasonal-naive (last year repeated)' that closely tracks the actual 2023 curve. A legend notes naive MAPE 19.0% versus seasonal-naive MAPE 5.9%.
The chronological split: train on the first 84 months, test on the last 12. The flat naive forecast ignores seasonality and lands far off; the seasonal-naive forecast replays last year's shape and hugs the actual 2023 curve — the bar every model must beat.

Stage 4: Baseline It

A test set is only useful next to a baseline — the dumbest reasonable forecast, so you know whether a fancy model is actually earning its complexity (Lesson 4). Two baselines matter here. Naive forecasts every future month as the last value observed. Seasonal-naive forecasts each month as the value from 12 months earlier — replaying last year’s shape, which is exactly right for a seasonal series.

naive = pd.Series(train.iloc[-1], index=test.index)
seasonal_naive = pd.Series(train.iloc[-12:].values, index=test.index)

def mae(a,f):  return np.mean(np.abs(a-f))
def mape(a,f): return np.mean(np.abs((a-f)/a))*100
print(round(mae(test, naive)), round(mape(test, naive),1))                    # 3497 19.0
print(round(mae(test, seasonal_naive)), round(mape(test, seasonal_naive),1))  # 998 5.9

The plain naive forecast is a flat line at December 2022’s value — it ignores seasonality entirely and misses by a mile: MAE 3,497 trips, MAPE 19.0%. The seasonal-naive forecast replays 2022’s monthly pattern and tracks 2023 closely: MAE 998 trips, MAPE 5.9% — more than three times better, because Cyclepath is seasonal and last year’s shape is a strong guess for this year’s. That 5.9% MAPE is the bar. Any model you build for the rest of the course has to beat it, or it isn’t worth its complexity.

The baseline is your yardstick

Seasonal-naive at 5.9% MAPE isn’t just a number — it’s the yardstick you measure every future model against. A model that can’t beat this simple “same month last year” forecast has learned nothing worth the added complexity, tuning, and risk, no matter how sophisticated it looks. This is the discipline that separates real forecasting from cargo-cult modeling: always establish a naive baseline first, then make each new model prove it’s better. If your ARIMA can’t beat replaying last year, ship the baseline.


Stage 5: The Takeaway

Step back and look at what this project produced. You now hold three things you’ll use for the entire rest of the course:

  1. The Cyclepath series — 96 months of seeded, reproducible monthly ridership with a clear upward trend and strong yearly seasonality, built from one cyclepath() function.
  2. An honest chronological test set — the last 12 months (2023) held out with no shuffle and no leakage, so any accuracy you report is the accuracy you’d actually get standing at the end of 2022.
  3. A baseline to beat — seasonal-naive at 5.9% MAPE, the bar every future model is judged against.

That’s the whole point of Module 1: not a model yet, but the scaffolding that makes every future model trustworthy. From here, the work is beating the baseline honestly. Next up, Module 2 takes this exact series and decomposes it — separating the trend, the seasonality, and the leftover residual into their own components — so you can see, and eventually model, each piece on its own.


Practice Exercises

Exercise 1: Change the horizon to 24 months

Instead of holding out the last 12 months, hold out the last 24 (all of 2022 and 2023). Re-split, recompute the seasonal-naive baseline, and check whether the MAPE gets better or worse. Why might a longer test period be harder?

Hint

Set h = 24 and re-run the split: train, test = y.iloc[:-h], y.iloc[-h:], giving 72 training months and 24 test months. For seasonal-naive over a 24-month horizon you replay the last 24 months of the training set: pd.Series(train.iloc[-24:].values, index=test.index). A longer horizon is usually harder because you’re forecasting further from the data you have, and the trend keeps rising — replaying an older year underestimates a growing series more the further out you go.

Exercise 2: Compute a rolling standard deviation

You overlaid a 12-month rolling mean to see the trend. Now compute a 12-month rolling standard deviation with y.rolling(12).std() and describe what it tells you. Is the size of the seasonal swing constant over the years, or growing?

Hint

The rolling std measures how much the series varies within each 12-month window — essentially the amplitude of the seasonal swing plus noise. If it grows over time, the summer-to-winter gap is widening as ridership climbs, which is a sign of multiplicative seasonality (the swing scales with the level) rather than additive. That distinction is exactly what Module 2 formalizes when it asks whether to model the series as trend-plus-seasonality or trend-times-seasonality.

Exercise 3: Why lag 12, not lag 1?

Seasonal-naive forecasts each month using the value from 12 months earlier, not the value from 1 month earlier. On Cyclepath’s monthly series, why is lag 12 the right choice? What lag would you use for daily data with a weekly cycle?

Hint

The lag should match the length of the seasonal cycle. Cyclepath repeats every 12 months (each July resembles last July far more than it resembles this June), so lag 12 replays the matching month. Lag 1 would just repeat the previous month and drift with the seasonal wave — that’s the plain naive forecast that scored 19.0%. For daily data with a weekly cycle, the season length is 7, so you’d forecast each day as the value from 7 days earlier (last Monday predicts this Monday).


Summary

You built and explored the Cyclepath series end to end and set the baseline the rest of the course must beat. You generated the series with cyclepath() — 96 monthly points on a DatetimeIndex from 2016-01 to 2023-12, starting [5907, 5955, 7843]. You explored it: min 5,907, mean about 13,267, max 20,533; a line plot revealing upward trend plus yearly summer peaks; yearly totals climbing 113,347 → 203,501; and a 12-month rolling mean rising from about 9,446 to 16,958. You split it chronologically — 84 training months through 2022-12, 12 test months across 2023, no shuffle, no leakage. And you baselined it: naive scored MAE 3,497 / MAPE 19.0%, while seasonal-naive scored MAE 998 / MAPE 5.9% — all computed for real with pandas and numpy. Seasonal-naive at 5.9% is the bar. You now have the series, an honest test set, and a baseline to beat.

Key Concepts

  • Seeded synthetic series — Cyclepath is generated with a fixed seed, so every number is reproducible and matches exactly on rerun.
  • Explore before you model — summary stats, a line plot, yearly resampling, and a rolling mean reveal the trend and seasonality you’ll exploit.
  • Chronological hold-out — cut off the last 12 months as a test set with no shuffle, so reported accuracy reflects real forecasting.
  • Naive baselines as the bar — seasonal-naive (lag 12) at 5.9% MAPE is the yardstick every future model must beat to justify its complexity.

Why This Matters

This project is the foundation the whole course rests on. Without a fixed series, an honest test set, and a baseline, every model you build afterward would be measuring itself against nothing — and a model measured against nothing can fool you into shipping something worse than replaying last year. By establishing seasonal-naive at 5.9% MAPE now, you’ve given every future technique — decomposition, ARIMA, SARIMA, backtesting — a concrete target and an honest scoreboard. Get this scaffolding right, as you just did, and the rest of the course is a disciplined march of “did this actually beat the baseline?” Next, Module 2 decomposes Cyclepath into its trend, seasonality, and residual so you can model each piece deliberately.


Next Steps

Continue to Module 2 - Components and Decomposition

Separate a series into trend, seasonality, and residual — additive vs multiplicative, and STL.

Back to Module Overview

Return to the Time Series Foundations module overview


Continue Building Your Skills

You now have the Cyclepath series, an honest chronological test set, and a baseline to beat — seasonal-naive at 5.9% MAPE — which is exactly the scaffolding every serious forecasting project starts from. With the foundation in place, the next module stops treating the series as one signal and pulls it apart: isolating the trend, the yearly seasonality, and the leftover residual so each can be understood and modeled on its own. On to components and decomposition.