Evaluation and Backtesting on DATATWEETS

Evaluation and Backtesting on DATATWEETShttps://datatweets.com/courses/time-series-forecasting/evaluation-and-backtesting/Recent content in Evaluation and Backtesting on DATATWEETSHugoenCopyright (c) 2025 DatatweetsSun, 05 Jul 2026 09:00:00 +0200Lesson 1 - Why One Split Is Not Enoughhttps://datatweets.com/courses/time-series-forecasting/evaluation-and-backtesting/lesson-1-why-one-split-is-not-enough/Sun, 05 Jul 2026 09:00:00 +0200https://datatweets.com/courses/time-series-forecasting/evaluation-and-backtesting/lesson-1-why-one-split-is-not-enough/Since Module 1, every model in this course was judged on one twelve-month test period, 2023. Holding out 2021 or 2022 instead, using the exact same seasonal-naive baseline, gives a MAPE of 6.61% or 7.39%, not 5.9%, a real, meaningful swing of about 25%. A single split reports one sample from a range of possible outcomes, and this lesson measures that range directly, on the same series this course has used throughout, before building the tools to test more thoroughly.Lesson 2 - Walk-Forward Validationhttps://datatweets.com/courses/time-series-forecasting/evaluation-and-backtesting/lesson-2-walk-forward-validation/Sun, 05 Jul 2026 09:00:00 +0200https://datatweets.com/courses/time-series-forecasting/evaluation-and-backtesting/lesson-2-walk-forward-validation/Walk-forward validation refits a model repeatedly, at a sequence of origins moving forward through the series, instead of once. An expanding window keeps every training point from the start; a rolling window keeps only the most recent fixed number of months. Built by hand on Cyclepath with six origins and a six-month horizon, SARIMA’s expanding-window MAPE averages 2.14% with a standard deviation of 1.11, ranging from 0.82% to 3.74% depending on the origin. A rolling window of the same size scores slightly worse (mean 2.24%), evidence that Cyclepath’s stable, non-drifting structure rewards keeping all the historical data rather than discarding the oldest months.Lesson 3 - TimeSeriesSplit and Formalizing the Loophttps://datatweets.com/courses/time-series-forecasting/evaluation-and-backtesting/lesson-3-timeseriessplit-and-formalizing-the-loop/Sun, 05 Jul 2026 09:00:00 +0200https://datatweets.com/courses/time-series-forecasting/evaluation-and-backtesting/lesson-3-timeseriessplit-and-formalizing-the-loop/scikit-learn’s TimeSeriesSplit(n_splits=6, test_size=6) reproduces, fold for fold, the exact same six expanding-window origins built by hand in Lesson 2, train sizes 60 through 90 months each followed by a 6-month test block. Using it to backtest the seasonal-naive baseline confirms Lesson 1’s warning with a proper multi-origin summary: a mean MAPE of 6.64% with a standard deviation of 0.87, a far more honest description than any single year’s 5.9%, 6.61%, or 7.39%.Lesson 4 - Forecast Intervals and Their Honestyhttps://datatweets.com/courses/time-series-forecasting/evaluation-and-backtesting/lesson-4-forecast-intervals-and-their-honesty/Sun, 05 Jul 2026 09:00:00 +0200https://datatweets.com/courses/time-series-forecasting/evaluation-and-backtesting/lesson-4-forecast-intervals-and-their-honesty/A stated 95% forecast interval is a testable claim: across many forecasts, the actual value should fall inside it about 95% of the time, its empirical coverage. Backtested across all 36 forecasts from Module 8’s six origins, SARIMA’s 95% interval achieves 100% coverage at an average width of about 2,145 trips. A non-seasonal ARIMA also achieves 100% coverage, but only by being nearly five times wider (10,732), uninformative rather than wrong. A plain random-walk-with-drift model achieves neither: only 83.3% coverage, missing the actual value on 6 of 36 forecasts, despite an interval nearly four times wider than SARIMA’s. Good intervals need both correct coverage and narrow width, and only backtesting across many forecasts can check either one.Lesson 5 - Guided Project: Backtesting Every Model From the Coursehttps://datatweets.com/courses/time-series-forecasting/evaluation-and-backtesting/lesson-5-guided-project-backtesting-every-model/Sun, 05 Jul 2026 09:00:00 +0200https://datatweets.com/courses/time-series-forecasting/evaluation-and-backtesting/lesson-5-guided-project-backtesting-every-model/The Module 8 capstone backtests every forecasting model this course built across six origins with a six-month horizon. Seasonal-naive averages 6.64% MAPE (std 0.87), SARIMA averages 2.14% (std 1.11), and Holt-Winters averages 1.59% (std 0.53). Modules 6 and 7’s single-split test had ranked SARIMA (1.06%) ahead of Holt-Winters (1.57%). Properly backtested across six different origins, Holt-Winters is actually the more accurate model on average, and it is more than twice as stable from origin to origin, directly overturning the single-split conclusion this course reported.