Lesson 5 - Time-Series Forecasting with RNNs

Welcome to Time-Series Forecasting

This lesson takes everything you know about recurrent networks and points it at one of the most natural problems for them: forecasting a value through time. You will work with the real S&P 500 monthly price series, turn it into a supervised learning problem, train an LSTM to predict the next month from the previous twelve, and measure how well it actually does on years it has never seen.

By the end of this lesson, you will be able to:

  • Turn a single time series into supervised (input window -> next value) training pairs
  • Explain why time-series data demands a temporal split, never a random one
  • Scale a series correctly by fitting the scaler on the training portion only
  • Build, train, and run a one-step LSTM forecaster in Keras
  • Evaluate a forecast with RMSE and MAE, and reason about why financial forecasting is hard

You should be comfortable with NumPy, pandas, and the basics of building an LSTM in Keras from the earlier lessons in this module. Let’s begin.


Why RNNs Fit Time Series

A time series is a sequence of measurements recorded in order through time: a stock index sampled monthly, electricity demand sampled hourly, a patient’s heart rate sampled per second. The defining property is that order matters. Shuffle the rows of a customer table and nothing breaks, but shuffle a time series and you destroy the very thing you want to learn from, which is how the past flows into the future.

Recurrent networks are built for exactly this. An RNN reads a sequence one step at a time, carrying a hidden state that summarizes everything it has seen so far. That hidden state is a compressed memory of the recent past, and the network uses it to predict what comes next. An LSTM (long short-term memory) is a more capable recurrent cell that adds gates to control what gets remembered and what gets forgotten, which lets it hold onto useful signal over longer stretches without the gradient problems a plain RNN suffers from.

Time-series forecasting is the task of modeling an existing sequence so you can predict its future values. In this lesson your sequence is the monthly closing level of the S&P 500, and your goal is to forecast next month’s level from a window of recent months. That kind of forecast is useful far beyond finance: the same workflow predicts demand, traffic, energy load, and cash flow.

One series, many lessons

Forecasting looks different from the classification you may have done earlier. There is no separate label column. The thing you predict, the price, is also the thing you feed in, just shifted in time. That single idea shapes every preparation step in this lesson.


The Dataset

You will use a long monthly history of the S&P 500 index. Each row is one month, and the column you care about records the index level at that point in time. Load it with pandas and parse the date column as the index, which makes everything that follows easier to plot and reason about.

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers

# download: https://datatweets.com/datasets/sp500_monthly.csv
df = pd.read_csv("sp500_monthly.csv", parse_dates=["date"], index_col="date")

print("Rows:", len(df))
print("From", df.index.min().date(), "to", df.index.max().date())
print("Price range:", round(df["price"].min(), 1), "to", round(df["price"].max(), 1))
# Output:
# Rows: 917
# From 1950-01-01 to 2026-05-01
# Price range: 16.9 to 7412.6

There are 917 monthly observations spanning from January 1950 to May 2026. The index starts near 17 and climbs past 7,400, an enormous range that immediately tells you two things. First, this series has a strong upward trend, so the future genuinely lives in a different numeric region than the past. Second, that huge spread means scaling will matter a great deal before you hand the numbers to a neural network.

A picture is the right first step with any series. Plotting the level through time shows the long climb, the major drawdowns, and the overall shape your model will be asked to extend.

LSTM one-step forecast of the S&P 500 plotted against the actual index, with the training and test split marked and the forecast tracking the held-out years
An LSTM trained on the early decades forecasts the held-out years one month at a time, tracking the actual index closely after the train/test boundary.

That chart is where you are headed: the blue line is the actual index, and the model’s forecast rides on top of it across the held-out years to the right of the split. To get there, you first have to reshape this single column into something an RNN can learn from.


Windowing: From a Series to Supervised Pairs

An RNN does not consume a raw column of numbers. It expects examples, where each example is a short sequence (a window) of recent values and a target that comes right after it. Building those examples from a flat series is called windowing, and it is the step that turns forecasting into ordinary supervised learning.

The idea is simple. Pick a window_size, say 12 months. Slide that window along the series. Each position gives you one training example: the 12 values inside the window are the input, and the very next value is the target you want to predict.

series:   [ p0  p1  p2  p3  p4  p5  ... ]   (window_size = 3)

example 1   input = [p0 p1 p2]   target = p3
example 2   input = [p1 p2 p3]   target = p4
example 3   input = [p2 p3 p4]   target = p5

Here is a small helper that does exactly this. It walks the series and, at each step, collects the window of recent values and the next value as the target.

def make_windows(series, window_size):
    X, y = [], []
    for i in range(len(series) - window_size):
        X.append(series[i : i + window_size])   # the window of recent values
        y.append(series[i + window_size])       # the next value (the target)
    return np.array(X), np.array(y)

Notice what make_windows is really doing: it converts one long sequence into many overlapping (input -> next value) pairs. The X values and the y values come from the same series; the only difference is that each y sits one step in the future relative to its window. This is the heart of one-step forecasting, predicting a single time step ahead.

You will use a window of 12 months so each prediction is based on a full year of recent history.

window_size = 12

Window size is a real hyperparameter

The window size is unlike most hyperparameters because it is baked into the data itself, not the model. Too short, and the network has almost no context to work with. Too long, and you create fewer examples and drown recent signal in stale history. Twelve months is a sensible starting point for a monthly series; treat it as a knob worth tuning.


The Temporal Split: The Rule You Cannot Break

In ordinary machine learning you split data randomly so the training and test sets are fair samples of the same population. With a time series you must never do that. A random split would scatter future months into your training set, letting the model peek at the future it is supposed to predict. That is data leakage in its purest form, and it produces forecasts that look brilliant in your notebook and fail completely in the real world.

The correct approach is a temporal split: train on the earlier portion of the series and test on the later portion, with the boundary fixed in time. The model learns from the past and is judged on a future it genuinely has not seen.

|<---------- TRAIN (earlier years) ---------->|<--- TEST (later years) --->|
1950 ............................................ 2010 .................. 2026
                                              split point

You will hold back roughly the last 20 percent of months for testing. Split the raw series before doing anything else, because every later step, scaling included, must learn only from the training portion.

prices = df["price"].values.astype("float32")

split = int(len(prices) * 0.80)
train_raw = prices[:split]      # earlier years
test_raw  = prices[split:]      # held-out later years

print("Train months:", len(train_raw))
print("Test months: ", len(test_raw))
# Output:
# Train months: 733
# Test months:  184

A random split is silent sabotage

The danger of a random time-series split is that nothing errors out. Your code runs, your metrics look great, and you only discover the problem when the model meets real future data and collapses. Make the temporal split the very first thing you do, and let it govern every step that follows.


Scaling: Fit on Train Only

That price range, from about 17 to over 7,400, is far too wide for a neural network to handle comfortably. Networks train best when inputs sit in a small, consistent range, so you rescale the series into the interval [0,1][0, 1] with a min-max scaler. For a value x x , the transform is:

xscaled=xxminxmaxxmin x_{\text{scaled}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

The non-negotiable detail is where xmin x_{\min} and xmax x_{\max} come from. They must be learned from the training portion only. If you fit the scaler on the whole series, the maximum value, which lives in the recent test years, leaks backward into training and quietly tells the model how high prices eventually go.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(train_raw.reshape(-1, 1))   # learn min/max from TRAIN ONLY

train_scaled = scaler.transform(train_raw.reshape(-1, 1)).flatten()
test_scaled  = scaler.transform(test_raw.reshape(-1, 1)).flatten()

Because the index keeps rising, the test years contain values above the training maximum, so some scaled test values will exceed 1. That is expected and correct: it honestly reflects that the model is being asked to forecast into territory beyond anything it trained on, which is exactly what real forecasting demands.

Now window each scaled portion separately and reshape into the 3-D form an RNN expects, (samples, timesteps, features). Here you have one feature (the price) at each of the 12 timesteps.

X_train, y_train = make_windows(train_scaled, window_size)
X_test,  y_test  = make_windows(test_scaled,  window_size)

X_train = X_train.reshape(X_train.shape[0], window_size, 1)
X_test  = X_test.reshape(X_test.shape[0],  window_size, 1)

print("X_train:", X_train.shape)
print("X_test: ", X_test.shape)
# Output:
# X_train: (721, 12, 1)
# X_test:  (172, 12, 1)

The training set holds 721 windowed examples, each a 12-month sequence of a single feature. Windowing always costs you a window_size worth of examples at the start of each portion, because the first prediction needs a full window of history behind it.


Building and Training the LSTM

With the data shaped correctly, the model itself is short. You stack an LSTM layer to read each 12-month window, a small Dense layer to mix what it found, and a single output neuron that produces the one number you want: next month’s scaled price.

model = tf.keras.Sequential([
    layers.Input(shape=(window_size, 1)),
    layers.LSTM(64, activation="tanh"),
    layers.Dense(32, activation="relu"),
    layers.Dense(1),                       # one-step forecast: a single value
])

model.compile(optimizer="adam", loss="mean_squared_error")
model.summary()

The output layer has a single node with no activation, because you are predicting a continuous value, not a category. The loss is mean squared error, the natural choice for regression: it penalizes large misses much more harshly than small ones.

Train for 60 epochs. Each epoch is one full pass over the training windows, and because this is a small model on a modest dataset, the passes are quick.

history = model.fit(
    X_train, y_train,
    epochs=60,
    batch_size=16,
    verbose=0,
)

print("Final training loss:", round(float(history.history["loss"][-1]), 5))
# Output:
# a small positive MSE on the scaled series

The training loss is reported on the scaled series, so its raw value (a small number near zero) is not directly meaningful in dollars. To judge the model honestly, you need to bring the predictions back to the original price scale and measure them on the held-out years.


Forecasting and Evaluating

Run the trained model on the test windows to get scaled predictions, then reverse the scaler to return them to real index levels. You invert both the predictions and the true targets so the comparison happens in dollars, not in the abstract [0,1][0, 1] range.

pred_scaled = model.predict(X_test, verbose=0)

# Un-scale predictions and true values back to real index levels
pred = scaler.inverse_transform(pred_scaled).flatten()
true = scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()

Now measure the error. Two metrics are standard for forecasting, and they answer slightly different questions.

Root mean squared error (RMSE) is the square root of the average squared error. Because it squares before averaging, it punishes large misses heavily, so it is sensitive to occasional big errors:

RMSE=1ni=1n(yiy^i)2 \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}\left(y_i - \hat{y}_i\right)^2}

Mean absolute error (MAE) is just the average size of the errors, treating a 100-point miss as exactly twice as bad as a 50-point miss:

MAE=1ni=1nyiy^i \text{MAE} = \frac{1}{n}\sum_{i=1}^{n}\left|y_i - \hat{y}_i\right|

Both come straight out of scikit-learn.

from sklearn.metrics import mean_squared_error, mean_absolute_error

rmse = mean_squared_error(true, pred) ** 0.5
mae  = mean_absolute_error(true, pred)

print(f"Test RMSE: {rmse:.1f}")
print(f"Test MAE:  {mae:.1f}")
# Output:
# Test RMSE: 507.0
# Test MAE:  353.1

On the held-out years the LSTM lands a test RMSE of 507.0 and MAE of 353.1 index points. Read those numbers in context. The index in the test period sits in the thousands, so being off by a few hundred points on average, for a model that has never seen these years, is a genuinely respectable one-step forecast. The fact that RMSE (507.0) sits above MAE (353.1) tells you the model makes a handful of larger misses, most likely around the sharp turns where the index changes direction abruptly.

The forecast chart you saw earlier makes this concrete: across the test region the predicted line tracks the actual index closely, lagging slightly at sudden reversals but capturing the overall path well.

Why these numbers, and not R-squared

RMSE and MAE are reported in the same units as the index, so a stakeholder can read “off by about 350 points” directly. That interpretability is why they are the default for forecasting. A scale-free score like R-squared is useful for comparing across very different series, but it hides the dollar magnitude that usually matters most in a forecast.


One-Step vs. Multi-Step, and Why Finance Is Hard

Everything above is one-step forecasting: predict a single time step ahead, then stop. At every test point the model receives a window of real recent history and predicts the next value. This is the easiest forecasting setting and the right place to start.

Multi-step forecasting asks for several steps into the future at once. The usual approach is to forecast one step, append that prediction to the window, and forecast again from the partly-imagined window, repeating as far out as you need. The catch is that errors compound: each prediction is built on top of previous predictions rather than real data, so small mistakes snowball and the forecast drifts further from reality the further out you push.

That compounding is one reason financial forecasting is so hard, but it is not the deepest one. Markets are close to a random walk: tomorrow’s level is roughly today’s level plus noise that is, by design, extremely difficult to predict. If the next move were reliably forecastable, traders would act on it and erase the very edge they found. So a strong one-step model often succeeds mostly by predicting that “next month looks a lot like this month,” which is informative but humble. Add genuine non-stationarity, regime changes, crashes, and the influence of news that no price history could contain, and you have a problem where modest, honest error bars are the mark of a serious model, not a weak one.

Beware the lag illusion

A one-step price forecast can look stunning on a chart precisely because predicting “about the same as last month” is a strong baseline for a trending series. Before celebrating, compare your model against that naive last-value baseline. If the LSTM cannot clearly beat “tomorrow equals today,” its impressive-looking line is mostly tracking, not forecasting.


The Full Workflow in One Place

Here is the entire pipeline condensed into a single runnable script. It is a template you can point at almost any single-series forecasting problem by changing the file, the column, and the window size.

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error

def make_windows(series, window_size):
    X, y = [], []
    for i in range(len(series) - window_size):
        X.append(series[i : i + window_size])
        y.append(series[i + window_size])
    return np.array(X), np.array(y)

# 1. Load the series
df = pd.read_csv("sp500_monthly.csv", parse_dates=["date"], index_col="date")
prices = df["price"].values.astype("float32")

# 2. TEMPORAL split (never random)
split = int(len(prices) * 0.80)
train_raw, test_raw = prices[:split], prices[split:]

# 3. Scale, fitting on TRAIN ONLY
scaler = MinMaxScaler().fit(train_raw.reshape(-1, 1))
train_scaled = scaler.transform(train_raw.reshape(-1, 1)).flatten()
test_scaled  = scaler.transform(test_raw.reshape(-1, 1)).flatten()

# 4. Window into (12 months -> next month) pairs
window_size = 12
X_train, y_train = make_windows(train_scaled, window_size)
X_test,  y_test  = make_windows(test_scaled,  window_size)
X_train = X_train.reshape(-1, window_size, 1)
X_test  = X_test.reshape(-1, window_size, 1)

# 5. Build and train the LSTM
model = tf.keras.Sequential([
    layers.Input(shape=(window_size, 1)),
    layers.LSTM(64, activation="tanh"),
    layers.Dense(32, activation="relu"),
    layers.Dense(1),
])
model.compile(optimizer="adam", loss="mean_squared_error")
model.fit(X_train, y_train, epochs=60, batch_size=16, verbose=0)

# 6. Forecast, un-scale, and evaluate in real units
pred = scaler.inverse_transform(model.predict(X_test, verbose=0)).flatten()
true = scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()
print(f"Test RMSE: {mean_squared_error(true, pred) ** 0.5:.1f}")
print(f"Test MAE:  {mean_absolute_error(true, pred):.1f}")
# Output:
# Test RMSE: 507.0
# Test MAE:  353.1

Six steps: load, split through time, scale on train only, window, train, evaluate in real units. That ordering is the whole discipline of time-series forecasting.


Practice Exercises

Try these before checking the hints.

Exercise 1: Build a Naive Baseline

Before trusting any model, you need a baseline. Implement the simplest forecast possible: predict that next month equals this month. For the test windows, that means the prediction is just the last value in each window. Compute its RMSE and MAE on the real (un-scaled) index and compare them to the LSTM’s 507.0 / 353.1.

# Your code here (reuse X_test, y_test, scaler from the lesson)

Hint

The last value in each window is X_test[:, -1, 0]. Inverse-transform both that array and y_test with scaler.inverse_transform(...), then pass them to mean_squared_error and mean_absolute_error. If the LSTM cannot clearly beat this baseline, it is mostly tracking rather than forecasting.

Exercise 2: Change the Window Size

Re-run the workflow with window_size = 6 and again with window_size = 24, leaving the model unchanged. Print the test RMSE for each. Does giving the model more or less recent history help on this series?

# Your code here (rebuild X_train, X_test for each window_size, then refit)

Hint

Wrap steps 4 through 6 of the full script in a loop over [6, 12, 24]. Remember that changing the window changes the number of examples and the input shape, so rebuild X_train, X_test, and re-create the model inside the loop. Watch how the metric moves rather than chasing one exact number.

Exercise 3: A Multi-Step Forecast

Generate a 6-month-ahead forecast from the final training window using the recursive approach: predict one step, append the prediction to the window, drop the oldest value, and repeat six times. Print the six forecasted (un-scaled) values.

# Your code here (use the trained model, the last window of train_scaled, and window_size)

Hint

Start from window = train_scaled[-window_size:].copy(). In a loop, reshape it to (1, window_size, 1), call model.predict, append the scalar prediction, and slice off the first element so the window stays length 12. Collect the predictions, then inverse-transform them at the end. Notice how each step builds on the previous prediction, which is exactly why errors compound.


Summary

You built a complete time-series forecasting pipeline on a real, decades-long price series and evaluated it honestly on years the model never saw. Let’s review what you learned.

Key Concepts

Time Series and RNNs

  • A time series is ordered through time, and that order carries the signal you want to learn
  • RNNs and especially LSTMs are built for sequences, carrying a hidden state that summarizes the recent past
  • Forecasting predicts future values of a sequence; the value you predict is also the value you feed in, just shifted in time

Windowing

  • Windowing slides a fixed-length window over the series to create (input window -> next value) supervised pairs
  • The window size is a hyperparameter baked into the data, not the model
  • You always lose window_size examples at the start, since the first prediction needs a full history behind it

The Temporal Split

  • Time-series data demands a temporal split: train on earlier data, test on later data
  • A random split leaks the future into training and produces forecasts that fail in the real world
  • Split first, before scaling or windowing, so every later step respects the boundary

Scaling Without Leakage

  • Fit the scaler on the training portion only, then apply it to both portions
  • With a trending series, scaled test values can legitimately exceed 1
  • Inverse-transform predictions before evaluating so errors are reported in real units

Building and Evaluating

  • A one-step LSTM forecaster is Input -> LSTM -> Dense -> Dense(1) with MSE loss
  • RMSE punishes large misses; MAE reports the typical error size; both are in the index’s own units
  • The lesson’s LSTM reached test RMSE 507.0 and MAE 353.1 over 60 epochs on the held-out years

Why This Matters

The discipline you practiced here, splitting through time, scaling on train only, and judging on a genuine future, is what separates a forecast you can trust from one that merely looks good in a notebook. The most common and most damaging mistakes in forecasting are leaks: a random split, a scaler fit on the whole series, a window that peeks ahead. Get the ordering right and your metrics mean something; get it wrong and they lie to you.

Just as important is the honesty about limits. Markets behave close to a random walk, multi-step errors compound, and the most useful one-step model is often the humble one that says “next month resembles this month.” Knowing that keeps you from overselling a pretty chart and pushes you toward the right comparisons, starting with a naive baseline. That mindset, real numbers and honest expectations, is exactly what the guided project asks you to put into practice next.


Next Steps

You now have the full forecasting workflow and a working LSTM. In the guided project, you will take the wheel and apply all of it end to end on the S&P 500, making and defending your own modeling choices.

Continue to Lesson 6 - Guided Project: Forecasting the S&P 500

Put the full forecasting workflow into practice and build your own LSTM forecaster end to end.

Back to Module Overview

Return to the Sequence Models module overview.


Keep Building Your Skills

You have crossed an important threshold: you can now take a raw series of numbers through time and turn it into a trustworthy forecast, with metrics you can defend. The exact model will change from problem to problem, but the workflow, temporal split, train-only scaling, careful windowing, and evaluation in real units, stays the same. Carry that discipline into the guided project, and you will be forecasting like a practitioner rather than hoping like a beginner.