Lesson 5 - Guided Project: Making Cyclepath Stationary

Welcome to the Guided Project

This module built up every piece separately: what stationarity requires (Lesson 1), how to test for it formally (Lesson 2), how regular differencing removes trend but leaves seasonality behind (Lesson 3), and how seasonal differencing targets that leftover structure — while combining it with regular differencing turned out to overdo it (Lesson 4). Now you’ll run the whole decision process on Cyclepath in one pass, the way you’d actually approach a new series: test, try a fix, test again, compare alternatives, and choose with evidence rather than habit.

By the end of this project, you will be able to:

  • Run the ADF test on a raw series and interpret a failing result
  • Compare multiple differencing strategies side by side using both the ADF test and variance
  • Identify the differencing choice that stationarizes a series most efficiently
  • Explain why “passes the ADF test” is necessary but not sufficient for choosing a final transformation

Let’s make Cyclepath stationary — properly.


Stage 1: Rebuild the Series

import numpy as np, pandas as pd
from statsmodels.tsa.stattools import adfuller, acf

def cyclepath():
    idx = pd.date_range("2016-01-01", periods=96, freq="MS")
    t = np.arange(96); rng = np.random.default_rng(42)
    trend = 9000 + 90*t; seasonal = 3200*np.sin(2*np.pi*(t-3)/12); noise = rng.normal(0,350,96)
    return pd.Series(np.round(trend+seasonal+noise).astype(int), index=idx, name="trips")

y = cyclepath()

Stage 2: Test the Raw Series

def adf_report(name, s):
    s = s.dropna()
    stat, pval, *_ = adfuller(s, autolag="AIC")
    print(f"{name:16s} n={len(s):3d}  ADF={stat:7.3f}  p={pval:.4f}  var={s.var():10.1f}")
    return pval

adf_report("raw", y)
# raw              n= 96  ADF= -0.920  p=0.7815  var=11766025.3

p = 0.7815, nowhere near significant. Raw Cyclepath is confirmed non-stationary — the starting point every fix in this project has to improve on.


Stage 3: Try Regular Differencing

d1 = y.diff()
adf_report("d=1", d1)
# d=1              n= 95  ADF= -8.642  p=0.0000  var= 1567486.3

a_d1 = acf(d1.dropna(), nlags=12, fft=True)
print(round(a_d1[12], 3))   # 0.801

Passes the ADF test convincingly — but that lag-12 autocorrelation of 0.801 is a loud signal that seasonality is still fully present. Regular differencing alone isn’t the answer, even though it technically clears the bar.


Stage 4: Try Seasonal Differencing Alone

D1 = y.diff(12)
adf_report("D=1", D1)
# D=1              n= 84  ADF= -4.689  p=0.0001  var=  124680.0

a_D1 = acf(D1.dropna(), nlags=12, fft=True)
print(round(a_D1[12], 3))   # -0.417

Also passes, with a p-value comfortably under 0.05. Its variance — 124,680 — is dramatically lower than regular differencing’s 1,567,486, and its own lag-12 autocorrelation shrank from 0.801 down to -0.417. This is looking like the strongest candidate so far.


Stage 5: Try Combining Both

d1D1 = y.diff().diff(12)
adf_report("d=1,D=1", d1D1)
# d=1,D=1          n= 83  ADF= -4.542  p=0.0002  var=  257768.8

Also passes — but its variance, 257,768.8, is more than double seasonal differencing alone. Combining both differences didn’t buy anything the seasonal difference alone hadn’t already achieved, and it cost real variance in the process.

A four-row comparison table rendered as a diagram. Row 1, 'raw': p-value 0.7815, marked with a red cross for fail, variance about 11.8 million. Row 2, 'd=1 regular difference': p-value under 0.0001, green check for pass, variance about 1.57 million, but flagged with a warning icon for a leftover lag-12 autocorrelation of 0.801. Row 3, 'D=1 seasonal difference alone': p-value 0.0001, green check for pass, variance about 125 thousand, highlighted with a star as the winner. Row 4, 'd=1 and D=1 combined': p-value 0.0002, green check for pass, variance about 258 thousand, flagged with a warning icon for overdifferencing since its variance exceeds row 3's.
Four candidates, one clear winner: seasonal differencing alone passes the ADF test with the lowest variance of any option and the smallest leftover seasonal echo — the combined regular-plus-seasonal difference passes too, but at a real variance cost that buys nothing in return.

Stage 6: Decide

Lay all four candidates side by side:

SeriesADF p-valuePasses?VarianceLag-12 ACF
raw0.7815No11,766,025.3
d=10.0000Yes1,567,486.30.801
D=10.0001Yes124,680.0-0.417
d=1, D=10.0002Yes257,768.8

Three of the four transformations pass the ADF test, which means the test alone can’t make this decision — exactly the situation Lesson 4 warned about. Bringing in variance as a tiebreaker settles it decisively: seasonal differencing alone (D=1) achieves the lowest variance of any passing option, by a wide margin, while also leaving the smallest residual seasonal signal. Regular differencing alone leaves too much seasonal structure; combining both differences overcorrects. For Cyclepath, D=1 is the answer.

stationary = D1.dropna()
print(len(stationary), round(stationary.mean(), 1), round(stationary.var(), 1))
# 84 1073.3 124680.0

Stage 7: The Takeaway

Step back and look at what this project produced. You now hold the specific, evidence-backed transformation that carries forward into the rest of the course:

  1. A confirmed diagnosis — raw Cyclepath fails the ADF test (p = 0.7815), non-stationary exactly as Module 2’s decomposition predicted.
  2. A rejected shortcut — regular differencing alone passes ADF but leaves 0.801 of seasonal autocorrelation at lag 12; passing the test isn’t the same as being done.
  3. A winning transformation — seasonal differencing alone (y.diff(12)), 84 points, mean 1,073.3, variance 124,680.0 — the lowest-variance option that also passes ADF and minimizes leftover seasonal structure.

That’s the whole point of Module 3: stationarity isn’t fixed by reflexively differencing until a test passes — it’s a decision made with evidence, comparing real alternatives on real numbers. Next up, Module 4 reads ACF and PACF plots — and it will read them from exactly this stationary series, D1, to choose the orders for the ARIMA and SARIMA models Module 5 and 6 build.

Where this hands off to Module 4

Module 4 doesn’t start over — it picks up the stationary series (D1, seasonally differenced) built here and asks a new question of it: which specific lags show significant autocorrelation, and what does that imply about how many AR and MA terms an ARIMA model needs? The ACF value already computed at lag 12 (-0.417) is a preview of exactly the kind of number Module 4 will read systematically, across every lag, to choose model orders instead of guessing them.


Practice Exercises

Exercise 1: Would the raw series ever be usable?

Is there any circumstance where you’d fit an ARIMA-family model directly to raw, non-stationary Cyclepath without any differencing?

Hint

Not directly to the undifferenced series with a plain ARMA model — but this is actually what the “I” in ARIMA is for: rather than differencing by hand and fitting to the result, you can hand the raw series to an ARIMA model with its differencing order (d) set appropriately, and the model differences internally before fitting. Either way, some form of differencing has to happen before the AR/MA structure is estimated; ARIMA’s d parameter (Module 5) is just a way of folding the differencing decision you made by hand in this project into the model specification itself.

Exercise 2: What if variance had favored the combined series?

Suppose, hypothetically, d=1, D=1 had come out with the lowest variance instead of D=1 alone. Would that change which series you’d carry forward?

Hint

Yes — the decision rule in this lesson wasn’t “always prefer seasonal differencing alone,” it was “prefer whichever ADF-passing option has the lowest variance and the least leftover structure.” If the evidence had favored the combined series instead, that would be the right one to carry forward, and calling D=1 alone the winner in that scenario would be applying a rule mechanically instead of following the actual data — precisely the mistake this lesson’s evidence-based comparison is designed to avoid.

Exercise 3: A series with no seasonality at all

If a series had a trend but no seasonality whatsoever, would this project’s four-way comparison still make sense to run?

Hint

Seasonal differencing (D=1) and the combined option wouldn’t be meaningful choices for a series with no seasonal period to speak of — there’d be no lag-12-style relationship to target, so y.diff(12) would just be a needlessly large, uninformative gap rather than a genuine fix. For a trend-only series, the comparison collapses to raw versus d=1 alone: test the raw series, difference once if it fails, and check the ADF test and autocorrelation at a few early lags rather than a specific seasonal one.


Summary

You ran the complete stationarity decision process on Cyclepath. Raw failed the ADF test outright (p = 0.7815). Regular differencing passed (p < 0.0001) but left a lag-12 autocorrelation of 0.801 — proof it solved trend without touching seasonality. Seasonal differencing alone also passed (p = 0.0001), with variance of 124,680 — the lowest of any candidate — and a much-reduced lag-12 autocorrelation of -0.417. Combining both passed too (p = 0.0002) but at a variance cost of 257,768.8, more than double the seasonal-alone option, a clear case of overdifferencing. The evidence points to one winner: seasonal differencing alone, y.diff(12), which now becomes the stationary series the rest of the course builds on.

Key Concepts

  • Evidence-based differencing — when multiple options pass the ADF test, use variance and leftover autocorrelation to choose between them, not the test alone.
  • Regular vs. seasonal differencing — each targets a different kind of non-stationarity (trend vs. a fixed seasonal lag); apply the one that matches the structure actually present.
  • Overdifferencing is measurable — it shows up as increased variance without further improving the ADF result, exactly as combining both differences did here.
  • This module’s output feeds Module 4 — the stationary series built here (D1) is what ACF and PACF plots get read from next.

Why This Matters

“Difference until the ADF test passes” is a common shortcut, but this project showed three different transformations all clear that bar while differing by a factor of 12 in variance and by a wide margin in leftover structure — the test alone would have left you unable to choose. Building the habit of comparing real alternatives on real numbers, rather than stopping at the first passing p-value, is what makes a stationarity decision something you can defend rather than something you got lucky with. With a properly stationarized series in hand, Module 4 turns to ACF and PACF — the tools for reading exactly which lags matter, which is what turns “the series is stationary” into “here are the ARIMA orders to try.”


Next Steps

Continue to Module 4 - Autocorrelation: ACF and PACF

Read ACF and PACF plots to identify AR and MA structure and choose ARIMA orders.

Back to Module Overview

Return to the Stationarity and Differencing module overview


Continue Building Your Skills

You now have a stationary Cyclepath series — seasonally differenced, ADF-confirmed, and chosen with real evidence over two other passing alternatives. Next, Module 4 reads ACF and PACF plots from exactly this series to answer the question stationarity alone can’t: how many autoregressive and moving-average terms does Cyclepath actually need?