Lesson 6 - Guided Project: Forecasting the S&P 500

Welcome to the Capstone Project

This is the project that ties the whole module together. Over the previous lessons you learned how recurrent networks process sequences, how to turn a raw time series into supervised windows, and why LSTMs and GRUs outperform a plain RNN on long sequences. Now you will assemble all of those pieces into one complete forecasting pipeline on real financial data: the monthly closing level of the S&P 500 index.

By the end of this lesson, you will be able to:

  • Load and inspect a real univariate time series and prepare it for forecasting
  • Build fixed-length input windows and split a time series without leaking the future into the past
  • Scale a target correctly using statistics learned from the training set only
  • Build, compile, and train an LSTM forecaster with Keras and TensorFlow
  • Evaluate a forecaster with RMSE and MAE and read a predicted-vs-actual chart critically
  • Explain honestly what a trend-following model can and cannot do, and why it is dangerous to trade on it

You should be comfortable with NumPy, pandas, and the windowing and LSTM ideas from the earlier lessons in this module. Let’s build something end to end.


The Project: Forecasting an Index

Imagine you have joined a small research team that studies market behavior. Someone hands you a single file containing the monthly closing level of the S&P 500 going back to 1950 and asks a deceptively simple question: given the last year of values, can a model predict where the index goes next?

This is a forecasting problem, and it is the perfect capstone because it forces you to use every skill from this module in sequence. You will:

  1. Load and explore the raw series so you understand its shape and scale.
  2. Window it into fixed-length input/output pairs, the way a supervised model needs.
  3. Split by time into a training set and a test set, keeping the future strictly out of the past.
  4. Scale the target so the network trains smoothly.
  5. Train an LSTM to map a 12-month window to the next month’s value.
  6. Evaluate it on the held-out test set and visualize the predictions.
  7. Interpret the result honestly, including its very real limitations.

The most important word above is honestly. It is easy to make a forecasting chart look impressive. It is much harder to say clearly what the model has actually learned. We will do both.

This is an educational project, not trading advice

We are building this model to practice the full sequence-modeling workflow on real data, nothing more. Nothing in this lesson is investment advice, and as you will see at the end, a model like this is not something you would trade on. Treat the financial framing as a vivid example, not a strategy.


Step 1: Load and Explore the Data

The dataset is a single CSV with one row per month: a date and the index’s closing level for that month. Start by loading it and looking at its size and range.

import numpy as np
import pandas as pd

# download: https://datatweets.com/datasets/sp500_monthly.csv
df = pd.read_csv("sp500_monthly.csv", parse_dates=["date"])
df = df.sort_values("date").reset_index(drop=True)

print("Rows:", len(df))
print("From:", df["date"].min().date(), "to:", df["date"].max().date())
print("Price range:", round(df["price"].min(), 1), "to", round(df["price"].max(), 1))
# Output:
# Rows: 917
# From: 1950-01-01 to: 2026-05-01
# Price range: 16.9 to 7412.6

There are 917 monthly observations spanning from January 1950 to May 2026. The index has grown from about 16.9 to over 7,400, a range of more than two orders of magnitude. That enormous range is the first thing to notice, and it tells you immediately that scaling will matter.

Sorting by date before anything else is not optional. Time series forecasting depends on order, and a model trained on shuffled rows would be meaningless. Always confirm the data is sorted ascending by time.

A quick look at the series makes its character obvious.

print(df.head(3))
print(df.tail(3))
# Output (illustrative formatting):
#         date   price
# 0 1950-01-01   16.9
# 1 1950-02-01   17.0
# 2 1950-03-01   17.3
#           date    price
# 914 2026-03-01  6740.3
# 915 2026-04-01  6981.1
# 916 2026-05-01  7412.6

The series rises over the long run, but it does not rise smoothly. It has long bull markets, sharp crashes, and flat stretches. That mixture, a strong overall trend layered on top of unpredictable short-term swings, is exactly what makes this both a good teaching example and a hard real problem.

Why a single column?

Real market data often comes with open, high, low, close, and volume columns. To keep the focus on the sequence-modeling workflow you learned in this module, we forecast a single value per month, the closing level. A univariate series is the cleanest way to practice windowing, splitting, scaling, and training without extra bookkeeping.


Step 2: Build Windows

A neural network cannot consume an open-ended time series directly. As you saw earlier in the module, you convert the series into a stack of fixed-length windows: each input is a slice of the recent past, and the matching target is the value that came next.

Here you use a window of 12 months. The model sees one full year of history and predicts the following month’s level. The helper below slides a window of length window across the series and pairs each window with the next value.

def make_windows(series, window=12):
    """Turn a 1-D series into (X, y) supervised windows."""
    X, y = [], []
    for i in range(len(series) - window):
        X.append(series[i : i + window])   # 12 consecutive months
        y.append(series[i + window])       # the month right after
    return np.array(X), np.array(y)

Each input X[i] is a vector of 12 consecutive monthly values, and each target y[i] is the single value immediately after that window. Slide forward one month and repeat. This is the same windowing idea from the previous lessons, applied to real data.

A recurrent layer in Keras expects each sample to have shape (timesteps, features). Here you have 12 timesteps and 1 feature per step, so every window must be reshaped to (12, 1). You will do that reshape right after the split, once the data is scaled.


Step 3: Split by Time

This is the single most important step in any forecasting project, and the easiest to get wrong.

In ordinary machine learning you split data randomly. You must not do that here. Randomly shuffling the months would put future observations into the training set and past observations into the test set, letting the model “peek” at the future. Its scores would look spectacular and mean nothing. This is a classic form of data leakage.

Instead you split chronologically: an early stretch becomes training, and the most recent stretch becomes the test set. The model only ever learns from the past and is only ever judged on a future it has not seen.

window = 12
prices = df["price"].values.astype("float32")

# Chronological split: train on the earlier portion, test on the most recent.
split = int(len(prices) * 0.80)
train_raw = prices[:split]
test_raw  = prices[split - window:]   # overlap by `window` so test windows are complete

Notice the small overlap of window months at the boundary. The first test window needs the 12 months that precede the first test target, and some of those months live at the end of the training period. Including them in test_raw lets you build complete test windows without ever using a future value to predict an earlier one. The targets themselves never overlap between the two sets.

The cardinal rule of time series

Never let the future leak into the past. Split by time, fit your scaler on the training portion only, and build test windows from data that genuinely precedes each test target. A random split on time series data will give you a beautiful, completely dishonest result.


Step 4: Scale the Target

Recall that the index ranges from about 17 to over 7,400. Networks train far better when their inputs and targets live in a small, consistent range, so you rescale the values into roughly [0, 1] with a MinMaxScaler.

The non-negotiable detail: fit the scaler on the training data only, then apply that same transform to the test data. If you fit on the full series, the maximum value (which sits at the very end of the data, in the test period) leaks information about the future into the scaling of the training set.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
train_scaled = scaler.fit_transform(train_raw.reshape(-1, 1)).flatten()
test_scaled  = scaler.transform(test_raw.reshape(-1, 1)).flatten()

# Build windows on the scaled series
X_train, y_train = make_windows(train_scaled, window)
X_test,  y_test  = make_windows(test_scaled,  window)

# Reshape to (samples, timesteps, features) for the LSTM
X_train = X_train.reshape(-1, window, 1)
X_test  = X_test.reshape(-1, window, 1)

print("Train windows:", X_train.shape)
print("Test windows: ", X_test.shape)
# Output:
# Train windows: (721, 12, 1)
# Test windows:  (184, 12, 1)

You now have 721 training windows and 184 test windows, each of shape (12, 1): twelve timesteps, one feature per step. This is exactly the shape an LSTM layer expects. The split gives the model decades of history to learn from while reserving the most recent years as a genuine, unseen test.

Scale the target, predict in scaled space

Because you scaled the values, the model’s predictions also come out in [0, 1] space. To report errors in real index points, you invert the scaling with scaler.inverse_transform before computing RMSE and MAE. We do exactly that in the evaluation step.


Step 5: Build and Train the LSTM

Now the part everything has been building toward. You will use a compact LSTM: one recurrent layer to read the 12-month sequence, followed by a single dense output that produces next month’s value.

An LSTM is the right tool here because, as you learned earlier in the module, its gated memory cell carries information across many timesteps without the vanishing-gradient problems that cripple a plain RNN. A year of monthly history is well within its comfort zone.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

tf.random.set_seed(42)

model = keras.Sequential([
    layers.Input(shape=(window, 1)),   # 12 timesteps, 1 feature
    layers.LSTM(64),                   # gated recurrent layer
    layers.Dense(1),                   # single regression output
])

model.compile(optimizer="adam", loss="mse", metrics=["mae"])
model.summary()
# Output (shapes; parameter counts will print below each layer):
# Model: "sequential"
# _________________________________________________________________
#  Layer (type)                Output Shape              Param #
# =================================================================
#  lstm (LSTM)                 (None, 64)                16896
#  dense (Dense)               (None, 1)                 65
# =================================================================

A few choices worth understanding:

  • Loss is mean squared error (MSE). Forecasting a continuous value is a regression task, and MSE penalizes large misses heavily, which is what you want.
  • Optimizer is Adam. It adapts the learning rate per parameter and is a reliable default for recurrent networks.
  • metrics=["mae"] reports mean absolute error alongside the loss, which is easier to interpret because it is in the same units as the (scaled) target.

Now train. You run for 60 epochs, which gives the LSTM enough passes over the data to settle into a stable fit.

history = model.fit(
    X_train, y_train,
    epochs=60,
    batch_size=32,
    verbose=0,          # set to 1 to watch the loss curve live
)

print("Final training loss (MSE, scaled):", round(history.history["loss"][-1], 5))
# Output: a small positive value close to zero (scaled MSE)

Because the loss is measured in scaled space, its absolute number is not very meaningful on its own. What matters is the error in real index points, which you compute next on data the model has never seen.


Step 6: Evaluate on the Test Set

Training accuracy tells you almost nothing. The honest question is: how well does the model predict months it never saw during training? You answer that on the held-out test set, after inverting the scaling so the errors are in real index points.

You will report two standard regression metrics:

  • RMSE (root mean squared error): the square root of the average squared error. It is in the same units as the target and punishes large misses.
  • MAE (mean absolute error): the average absolute miss. It is easier to read as a typical error size.

For predictions y^i \hat{y}_i and actual values yi y_i over n n test points, they are defined as:

RMSE=1ni=1n(yiy^i)2MAE=1ni=1nyiy^i \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \qquad \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \lvert y_i - \hat{y}_i \rvert
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Predict in scaled space, then invert back to index points
pred_scaled = model.predict(X_test, verbose=0)
pred = scaler.inverse_transform(pred_scaled).flatten()
actual = scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()

rmse = np.sqrt(mean_squared_error(actual, pred))
mae  = mean_absolute_error(actual, pred)

print(f"Test RMSE: {rmse:.1f}")
print(f"Test MAE:  {mae:.1f}")
# Output:
# Test RMSE: 507.0
# Test MAE:  353.1

The LSTM achieves a test RMSE of 507.0 and a test MAE of 353.1 index points. To put that in context, the index on the test set sits in the thousands, so a typical absolute miss of around 350 points is a few percent of the level. That is a respectable result for a single-layer model fed only the last 12 monthly values, and it comfortably beats the plain RNN baselines you saw earlier in the module.

But the headline numbers only tell part of the story. You need to see the predictions to understand what the model is really doing.


Step 7: Visualize Predicted vs. Actual

A scatter of predicted level against actual level is one of the most revealing plots in all of forecasting. Each point is one test month: its horizontal position is what actually happened, and its vertical position is what the model predicted. If the model were perfect, every point would land exactly on the diagonal line y^=y \hat{y} = y .

import matplotlib.pyplot as plt

lo = min(actual.min(), pred.min())
hi = max(actual.max(), pred.max())

plt.figure(figsize=(7, 7))
plt.scatter(actual, pred, s=18, alpha=0.6)
plt.plot([lo, hi], [lo, hi], "r--", label="perfect prediction")
plt.xlabel("Actual index level")
plt.ylabel("Predicted index level")
plt.title("LSTM: Predicted vs. Actual (Test Set)")
plt.legend()
plt.tight_layout()
plt.show()
Scatter plot of LSTM predicted S&P 500 index level versus actual level on the test set, with points clustered around the diagonal perfect-prediction line
Predicted versus actual index level on the held-out test set; points hug the diagonal, showing the model tracks the broad level well.

The points cluster tightly around the diagonal, which confirms what the metrics suggested: the model captures the overall level of the index well. When the true value is high, the prediction is high; when it is low, the prediction is low. The LSTM has clearly learned the dominant feature of the series, its long-run trend.

Look closer, though, and you can see the scatter widen at the high end, where the index moves fastest and most erratically. Those are the months where short-term swings are largest, and they are exactly the months the model struggles with. That observation leads to the most important discussion in this whole lesson.


Reading the Result Honestly

It would be easy to stop here, point at the tight diagonal, and declare victory. That would be a mistake, and learning to resist it is part of becoming a real practitioner.

Here is what the model has genuinely learned and what it has not.

What it does well. The S&P 500 has trended upward over decades. A model that takes the last 12 months as input can learn a simple, powerful pattern: next month is usually close to this month. Because the index moves gradually most of the time, “predict something near the recent level, nudged in the recent direction” is a strong strategy for tracking the broad trend. That is why the scatter hugs the diagonal and the errors look modest relative to the index level.

What it cannot do. The model cannot predict the short-term moves that actually matter for trading. An RMSE of 507 points is small next to an index in the thousands, but it is enormous next to a single month’s typical change. The market’s month-to-month wiggles, the very thing a trader would need to call correctly, are dominated by news, sentiment, and shocks that simply are not present in twelve past values. The model effectively follows the trend with a lag; it does not anticipate turns.

Why you should never trade on this

A trend-following forecaster looks great on a chart and is nearly useless for trading. Because next month is usually close to this month, the model “wins” most of the time simply by predicting more of the same, but it has no edge on the unpredictable swings where money is actually made or lost. Worse, it has never seen a regime it was not trained on: a novel crash or a structural shift would blindside it completely. Backtests on past data routinely flatter such models. Treat this project as a lesson in the limits of forecasting, not as a strategy.

This honest reckoning is the real payoff of the project. You assembled a complete, technically correct pipeline, you produced a model with sensible error metrics, and you can still explain precisely why it would be reckless to act on its forecasts. Holding both of those truths at once is what separates a careful practitioner from someone who is fooled by a pretty chart.


Practice Exercises

These extend the project. Reuse the variables you built above, and experiment.

Exercise 1: Change the Window Size

The lesson used a 12-month window. Rebuild the data with a 24-month window, retrain the same LSTM for 60 epochs, and compare the test RMSE and MAE. Does giving the model two years of history instead of one help?

# Your code here: set window = 24, rebuild windows and shapes,
# retrain the model, and recompute RMSE/MAE on the test set.

Hint

Set window = 24, then redo Steps 3 through 6 exactly: re-split with the new overlap (test_raw = prices[split - window:]), re-fit the scaler on the new training portion, rebuild and reshape X_train/X_test to (-1, window, 1), and retrain. Watch how the window length flows through every shape downstream.

Exercise 2: Add a Baseline to Beat

A forecaster is only impressive if it beats the obvious guess. Build a naive baseline that predicts next month equals this month (the last value in each window), and compute its RMSE and MAE on the same test set. Compare it to the LSTM’s 507.0 / 353.1.

# Your code here: take the last value of each test window as the prediction,
# invert the scaling, and compute RMSE and MAE against `actual`.

Hint

The naive prediction for window i is its last timestep: X_test[:, -1, 0]. Invert it with scaler.inverse_transform(...), then reuse mean_squared_error and mean_absolute_error against actual. If the LSTM barely beats this baseline, that tells you a lot about how much it has really learned versus simply echoing the last value.

Exercise 3: Train Longer and Watch for Diminishing Returns

Retrain the original 12-month LSTM for 120 epochs instead of 60, and compare the test RMSE and MAE. Does doubling the training time meaningfully improve the held-out error, or are you just spending compute?

# Your code here: rebuild the model, fit with epochs=120, then
# recompute RMSE and MAE on the test set and compare to 507.0 / 353.1.

Hint

Re-instantiate the model (so you start from fresh weights), set epochs=120 in model.fit, and recompute the metrics exactly as in Step 6. More epochs help only up to a point; beyond it the test error flattens or worsens as the model starts memorizing training quirks. Always judge improvement on the test set, never on the training loss.


Summary

You built a complete time-series forecasting pipeline from raw data to an evaluated, critiqued model. Let’s review what you put together.

Key Concepts

The End-to-End Workflow

  • Forecasting follows a fixed sequence: load and explore, window, split by time, scale, train, evaluate, interpret
  • Each step from this module slotted into one larger pipeline; the capstone is integration, not new theory

Preparing a Time Series

  • Sort by time first, then build fixed-length windows that pair recent history with the next value
  • A window of 12 months produced inputs of shape (12, 1), exactly what an LSTM expects
  • Split chronologically, never randomly, and overlap the boundary by one window so test windows are complete

Avoiding Leakage

  • Fit the scaler on the training portion only, then transform the test set with that same scaler
  • Build every test window from data that genuinely precedes its target
  • A random split or a full-series scaler leaks the future and produces dishonest scores

Building and Evaluating the LSTM

  • A single LSTM(64) layer plus a Dense(1) output forecasts the next monthly value
  • Compile with MSE loss and Adam; train 60 epochs; predict in scaled space and invert before scoring
  • The model reached a test RMSE of 507.0 and MAE of 353.1 index points
  • A predicted-vs-actual scatter that hugs the diagonal confirms the model tracks the broad level

Interpreting Honestly

  • The model learned the long-run trend, essentially “next month is near this month”
  • It cannot predict the short-term swings that matter for trading, and it would fail on unseen regimes
  • A good-looking chart and sensible metrics do not make a model safe to act on

Why This Matters

The skills in this project generalize far beyond stock indices. Demand forecasting, energy load prediction, sensor monitoring, and capacity planning all follow the same pipeline: window the series, split by time without leaking, scale carefully, train a recurrent model, and evaluate on a genuine future. Master this workflow once and you can adapt it to almost any sequential forecasting problem you meet.

Just as important is the habit of critical interpretation. The most valuable thing you can take from this lesson is not the LSTM architecture; it is the discipline to build a technically sound model and still ask hard questions about what it really learned. That skepticism is what protects you, and anyone relying on your work, from being fooled by a confident-looking forecast.


Next Steps

You have completed the sequence models module by shipping a full forecasting project end to end. Recurrent networks are one of the two great families of sequence models; next you turn to language, where the same sequential thinking powers a different set of tools.

Continue to the Next Module - Natural Language Processing

Apply sequential modeling to text: tokenization, embeddings, and language models.

Back to Module Overview

Return to the Sequence Models module overview to revisit any lesson.


Keep Building Your Skills

You did something genuinely hard in this lesson: you assembled every concept from the module into one working system, produced real results on real data, and then had the discipline to explain exactly why those results should be treated with caution. That combination of technical competence and honest judgment is what professional machine learning work actually looks like. Carry both forward, because every model you build from here will need both the skill to make it work and the wisdom to know what it cannot do.