Lesson 2 - Basic RNN Architecture
Welcome to Recurrent Neural Networks
In the previous lesson you saw why sequence models exist: ordinary feed-forward networks treat every input independently, but many real problems, such as language, audio, and time series, carry information in their order. This lesson opens the box and shows you how the simplest sequence model, the recurrent neural network (RNN), actually works. You will learn the recurrence relation that gives an RNN its memory, see how it unrolls through time, and then build and train a real SimpleRNN in Keras to forecast the S&P 500 stock index from windows of past monthly prices.
By the end of this lesson, you will be able to:
- Explain the RNN recurrence relation and what the hidden state represents
- Describe how an RNN unrolls through time and how backpropagation through time trains it
- Turn a single time series into supervised sliding windows of shape
(samples, timesteps, features) - Scale a time series correctly by fitting the scaler on the training data only
- Build, train, and evaluate a Keras
SimpleRNNand interpret RMSE and MAE on a forecast
You should be comfortable with basic Python, NumPy, and pandas, and you should have read Lesson 1 of this module. Some exposure to neural networks (layers, weights, activation functions) will help. Let’s begin.
What Makes a Network “Recurrent”
A standard feed-forward network maps an input to an output in one shot. Feed it a vector, it produces an answer, and it forgets everything the moment the next input arrives. That is fine when each example stands alone. It is a problem when the meaning of the current input depends on what came before, which is exactly the situation with a time series like a stock index: this month’s price only makes sense in the context of the months leading up to it.
A recurrent neural network solves this by adding a loop. As it reads a sequence one step at a time, it keeps a running summary of everything it has seen so far. That summary is called the hidden state, written , and it is the network’s memory.
At each timestep , the RNN combines two things: the current input and the previous hidden state . It blends them, squashes the result through an activation function, and produces a new hidden state . That new state is then carried forward to the next step. The single equation that defines this is the recurrence relation:
Read it slowly. is a weight matrix applied to the current input, is a weight matrix applied to the previous hidden state, and is a bias. The activation function here is , which keeps the hidden state values in a manageable range. The crucial detail is the term: the new state depends on the old state, so information from early in the sequence can ripple all the way to the end.
The hidden state is the memory
The hidden state is a fixed-size vector that compresses everything the network has read so far into one summary. Its length is a choice you make (the number of units in the RNN layer). A larger hidden state can remember more, but costs more parameters and is harder to train.
One Set of Weights, Reused Every Step
A subtle but important point: , , and do not change from one timestep to the next. The same weights are applied at every step of the sequence. This is called weight sharing, and it is what lets an RNN handle sequences of any length with a fixed, small number of parameters. Whether your sequence is 12 steps long or 1,200, the network uses the same handful of weights over and over.
Unrolling Through Time
The loop in an RNN is easy to draw but hard to reason about. The standard trick is to unroll it: instead of one cell with an arrow pointing back to itself, you draw a copy of the cell for each timestep, laid out left to right, with the hidden state flowing from one copy to the next.
Each box in the unrolled diagram is the same cell with the same weights. At step 1 the cell reads and an initial hidden state (usually all zeros) and produces . At step 2 it reads and and produces . This continues to the final step, where the last hidden state summarizes the whole sequence. For a forecasting task like ours, that final hidden state is what gets passed to an output layer to predict the next value.
Unrolling is not a different model. It is just a way of seeing the loop as a deep chain so that you can apply ordinary backpropagation to it.
Backpropagation Through Time
How does an RNN learn its weights? The same way any neural network does, by gradient descent, but with one wrinkle. Because the unrolled network is a chain of repeated cells, the gradient of the loss has to flow backward through every timestep. This procedure is called backpropagation through time (BPTT).
Conceptually, BPTT works like this:
- Run the sequence forward, computing and the final prediction.
- Compare the prediction to the true value and compute the loss.
- Propagate the error backward from the last step to the first, accumulating how much each shared weight contributed to the loss at every timestep.
- Sum those contributions and update the weights once.
Because the same weights appear at every step, their gradient is the sum of the gradients from all steps. That is the only real difference from ordinary backpropagation.
Why plain RNNs struggle with long sequences
When the error flows back through many timesteps, it gets multiplied by the recurrent weights again and again. If those multipliers are small, the gradient shrinks toward zero (the vanishing gradient problem) and early timesteps barely learn; if they are large, it can explode. This is why a plain SimpleRNN has trouble remembering things from far in the past. The next lesson introduces GRU and LSTM cells, which were designed specifically to fix this.
The Forecasting Problem
Enough theory. Let’s put a SimpleRNN to work on a real series. You will forecast the S&P 500, a stock-market index that tracks 500 large U.S. companies, using monthly closing values stretching back to 1950.
You can download the dataset and load it with pandas.
import pandas as pd
# download: https://datatweets.com/datasets/sp500_monthly.csv
df = pd.read_csv("sp500_monthly.csv", parse_dates=["date"])
df = df.sort_values("date").reset_index(drop=True)
print("Rows:", len(df))
print("From", df["date"].min().date(), "to", df["date"].max().date())
print("Price range:", round(df["close"].min(), 1), "to", round(df["close"].max(), 1))
# Output:
# Rows: 917
# From 1950-01-01 to 2026-05-01
# Price range: 16.9 to 7412.6There are 917 monthly observations, one per month from January 1950 through May 2026. The close column is the value you want to predict. Plotting it shows the long upward march of the index, with the familiar dips around major downturns.
This is a genuine sequence problem. The value next month depends heavily on the recent trajectory, so an RNN, which reads the recent history before predicting, is a natural fit.
From a Series to Supervised Windows
An RNN cannot consume a raw 1D series directly. You have to reshape the problem into supervised examples, each with an input sequence and a target. The standard technique for time series is windowing: slide a fixed-length window across the series, use the values inside the window as the input, and use the value immediately after the window as the target.
Here you will use a window of 12 months. Each training example is therefore “the last 12 monthly closes” and its target is “the close in month 13.” Sliding the window forward one month at a time generates a fresh example each step.
import numpy as np
WINDOW = 12
def make_windows(series, window):
X, y = [], []
for i in range(len(series) - window):
X.append(series[i : i + window]) # 12 months of input
y.append(series[i + window]) # the next month (target)
return np.array(X), np.array(y)Before windowing, you must split the series into train and test, and the split has to respect time. You cannot shuffle a time series, because that would let the model peek at the future to predict the past. Instead you slice chronologically: the earliest portion is for training, the most recent portion is for testing.
prices = df["close"].values.astype("float32")
# Chronological split: train on the past, test on the most recent years
split = int(len(prices) * 0.8)
train_series = prices[:split]
test_series = prices[split - WINDOW:] # overlap by WINDOW so the first
# test window has 12 real monthsNotice the test slice starts WINDOW months before the split point. That overlap is deliberate: the very first test prediction needs 12 prior months as its input, and those months legitimately come from before the test region. The targets themselves never leak into training.
Scaling: Fit on Train Only
Neural networks train far better when inputs are on a small, consistent scale, and the S&P 500 ranges from about 17 to over 7,400. You will use a MinMaxScaler to squeeze values into the range .
The golden rule from Lesson 1 applies here just as strictly: fit the scaler on the training data only, then apply the same transform to the test data. If you fit on the full series, information about future test values leaks into the scaling and your evaluation becomes dishonest.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
train_scaled = scaler.fit_transform(train_series.reshape(-1, 1)).ravel()
test_scaled = scaler.transform(test_series.reshape(-1, 1)).ravel()Now build the windows from the scaled series and reshape the inputs into the 3D form that recurrent layers require: (samples, timesteps, features). Here timesteps is 12 (the window length) and features is 1 (a single price per step).
X_train, y_train = make_windows(train_scaled, WINDOW)
X_test, y_test = make_windows(test_scaled, WINDOW)
# Add the feature dimension: (samples, timesteps) -> (samples, timesteps, 1)
X_train = X_train.reshape(X_train.shape[0], WINDOW, 1)
X_test = X_test.reshape(X_test.shape[0], WINDOW, 1)
print("Train windows:", X_train.shape)
print("Test windows: ", X_test.shape)
# Output:
# Train windows: (721, 12, 1)
# Test windows: (184, 12, 1)You now have 721 training windows and 184 test windows, each one a sequence of 12 monthly prices with a single feature per step. That trailing 1 is the num_features dimension: at every timestep the network sees exactly one number, the scaled close.
The shape every RNN expects
Recurrent layers in Keras always expect input shaped (samples, timesteps, features). For a single-variable time series, features is 1. If you forget the feature dimension and pass a 2D array, you will get a shape error before training even starts.
Building the SimpleRNN
With the data shaped correctly, the model itself is short. You will stack a SimpleRNN layer with 32 units on top of a single Dense output node that produces the one-step-ahead forecast.
import tensorflow as tf
from tensorflow import keras
model = keras.Sequential([
keras.layers.Input(shape=(WINDOW, 1)), # 12 timesteps, 1 feature
keras.layers.SimpleRNN(32, activation="tanh"),
keras.layers.Dense(1), # one numeric output
])
model.compile(optimizer="adam", loss="mse")
model.summary()A few things to notice. The Input layer declares the shape of one example, (12, 1), so Keras knows each input is a 12-step sequence with one feature. The SimpleRNN(32) layer maintains a 32-dimensional hidden state and, by default, returns only the final hidden state , which is exactly what you want for a single forecast. The Dense(1) layer turns that 32-number summary into a single predicted price. Because this is regression, the loss is mean squared error and there is no activation on the output.
Training the Model
Training is one call to .fit(). You will run for 40 epochs, meaning the network passes over all 721 training windows 40 times, nudging its shared weights with backpropagation through time after each batch.
history = model.fit(
X_train, y_train,
epochs=40,
batch_size=32,
verbose=0, # set to 1 to watch the loss fall each epoch
)
print("Final training loss:", round(float(history.history["loss"][-1]), 5))
# Output:
# a small positive MSE on the scaled targetsThe loss is computed on the scaled targets, so it is a tiny number near zero and not directly meaningful in dollars. To judge the model in real terms, you need to make predictions, undo the scaling, and compare against the actual prices.
Evaluating the Forecast
The model predicts on the scaled range, so you must invert the scaler to bring predictions back to actual index points before scoring them.
# Predict on the test windows, then invert scaling back to real prices
pred_scaled = model.predict(X_test, verbose=0)
y_pred = scaler.inverse_transform(pred_scaled).ravel()
y_true = scaler.inverse_transform(y_test.reshape(-1, 1)).ravel()Two standard metrics for regression on a continuous target are root mean squared error (RMSE) and mean absolute error (MAE). RMSE squares the errors before averaging, so it punishes large misses harder; MAE is the plain average absolute miss, in the same units as the target.
from sklearn.metrics import mean_squared_error, mean_absolute_error
rmse = mean_squared_error(y_true, y_pred) ** 0.5
mae = mean_absolute_error(y_true, y_pred)
print(f"SimpleRNN test RMSE: {rmse:.1f}")
print(f"SimpleRNN test MAE: {mae:.1f}")
# Output:
# SimpleRNN test RMSE: 1007.8
# SimpleRNN test MAE: 641.6On the held-out recent years, the SimpleRNN lands at an RMSE of about 1007.8 index points and an MAE of about 641.6. Plotting the forecast against the actual values tells the real story.
The model clearly tracks the trend: when the index rises, the prediction rises, and when it falls, the prediction follows. But look closely and you will see the forecast lags. It reacts a step late, consistently undershooting on the way up and overshooting on the way down. That lag, plus the large RMSE relative to MAE, tells you the plain RNN handles the gentle direction but misses the timing and magnitude of fast moves.
Why a lagging forecast still teaches something
A model that lags by predicting “roughly last month again” is a classic baseline trap for time series. It looks plausible because prices change slowly month to month. The lesson is not that the RNN failed, but that a SimpleRNN with a short window struggles to anticipate turns. The next lesson’s gated cells (GRU and LSTM) will cut these errors substantially on the very same data.
Putting It All Together
Here is the entire pipeline, from raw CSV to a scored forecast, condensed into one runnable script. This is a template you can reuse for any single-variable time series.
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
WINDOW = 12
# 1. Load and sort the series
df = pd.read_csv("sp500_monthly.csv", parse_dates=["date"]) # download: https://datatweets.com/datasets/sp500_monthly.csv
df = df.sort_values("date").reset_index(drop=True)
prices = df["close"].values.astype("float32")
# 2. Chronological split (never shuffle a time series)
split = int(len(prices) * 0.8)
train_series = prices[:split]
test_series = prices[split - WINDOW:]
# 3. Scale, fitting on TRAIN only
scaler = MinMaxScaler()
train_scaled = scaler.fit_transform(train_series.reshape(-1, 1)).ravel()
test_scaled = scaler.transform(test_series.reshape(-1, 1)).ravel()
# 4. Build sliding windows
def make_windows(series, window):
X, y = [], []
for i in range(len(series) - window):
X.append(series[i : i + window])
y.append(series[i + window])
return np.array(X), np.array(y)
X_train, y_train = make_windows(train_scaled, WINDOW)
X_test, y_test = make_windows(test_scaled, WINDOW)
X_train = X_train.reshape(X_train.shape[0], WINDOW, 1)
X_test = X_test.reshape(X_test.shape[0], WINDOW, 1)
# 5. Define and train the SimpleRNN
model = keras.Sequential([
keras.layers.Input(shape=(WINDOW, 1)),
keras.layers.SimpleRNN(32, activation="tanh"),
keras.layers.Dense(1),
])
model.compile(optimizer="adam", loss="mse")
model.fit(X_train, y_train, epochs=40, batch_size=32, verbose=0)
# 6. Predict, invert scaling, and score
y_pred = scaler.inverse_transform(model.predict(X_test, verbose=0)).ravel()
y_true = scaler.inverse_transform(y_test.reshape(-1, 1)).ravel()
print(f"RMSE: {mean_squared_error(y_true, y_pred) ** 0.5:.1f}")
print(f"MAE: {mean_absolute_error(y_true, y_pred):.1f}")
# Output:
# RMSE: 1007.8
# MAE: 641.6In about 40 lines you loaded a real index, framed it as a supervised windowing problem, scaled it honestly, trained a recurrent network, and measured it on years it never saw. That is the complete RNN forecasting workflow.
Practice Exercises
Now it is your turn. Try these before checking the hints.
Exercise 1: Change the Window Length
The lesson used a 12-month window. Rebuild the windows with WINDOW = 24 (two years of history per example) and retrain the same SimpleRNN(32). Print the new train and test window shapes and the test RMSE. Does a longer memory help?
WINDOW = 24
# Rebuild train_scaled / test_scaled with the new WINDOW overlap,
# then make_windows, reshape, train, and score.
# Your code hereHint
Remember that test_series = prices[split - WINDOW:] depends on WINDOW, so recompute it after changing the window. The reshape becomes X_train.reshape(X_train.shape[0], 24, 1), and the Input shape becomes (24, 1). A longer window gives the network more context but also fewer total windows.
Exercise 2: Compare Against a Naive Baseline
A simple baseline for any time series is “predict next month equals this month.” Compute the RMSE of this naive forecast on the test set: use the last value in each test window as the prediction. Is the SimpleRNN actually beating “do nothing”?
# The last value in each test window is X_test[:, -1, 0] (still scaled)
# Your code hereHint
Take naive_scaled = X_test[:, -1, 0], invert it with scaler.inverse_transform(naive_scaled.reshape(-1, 1)).ravel(), then compute mean_squared_error(y_true, naive) ** 0.5. Comparing a model to a naive baseline is one of the most important habits in time-series work.
Exercise 3: Resize the Hidden State
The hidden state size is a hyperparameter. Train two more models, one with SimpleRNN(8) and one with SimpleRNN(64), keeping everything else the same. Print the test RMSE for each. Does a bigger hidden state always win?
for units in [8, 64]:
# build, compile, fit, predict, invert, and print RMSE
# Your code hereHint
Only the SimpleRNN(units, ...) line changes inside the loop. Wrap the build-train-score steps in the loop body and print units alongside the RMSE. You will often find that more units help up to a point and then stop, or even hurt, because the model starts to overfit the limited training history.
Summary
Congratulations! You have built and trained your first recurrent neural network on a real time series. Let’s review what you learned.
Key Concepts
How an RNN Works
- An RNN keeps a hidden state , a fixed-size summary of everything it has read so far
- The recurrence relation blends the current input with the previous state
- The same weights are reused at every step (weight sharing), so one small set of parameters handles sequences of any length
Training Through Time
- Unrolling redraws the loop as a left-to-right chain of identical cells
- Backpropagation through time flows the gradient back across every timestep and sums each shared weight’s contribution
- Plain RNNs suffer from vanishing or exploding gradients over long sequences, which limits their memory
Framing a Time Series
- Windowing turns one series into supervised examples: 12 months in, the next month out
- A time series must be split chronologically, never shuffled, to avoid peeking at the future
- Recurrent layers expect input shaped
(samples, timesteps, features); here(721, 12, 1)for train and(184, 12, 1)for test - Scale with
MinMaxScalerfit on the training data only, and invert the scaling before scoring
Building and Evaluating
- A
SimpleRNN(32)plus aDense(1)output is a complete one-step forecaster compile(optimizer="adam", loss="mse")then.fit()trains it;.predict()produces forecasts- The model reached a test RMSE of 1007.8 and MAE of 641.6, tracking the trend but lagging behind sharp moves
Why This Matters
The recurrence relation you learned here is the seed of every sequence model. GRUs, LSTMs, and even the attention mechanisms behind modern language models all exist to do one thing better than the SimpleRNN: carry useful information across many steps without the gradient vanishing. Understanding the plain RNN first means those later cells are not mysterious black boxes but targeted fixes to a problem you have now seen with your own eyes.
You also practiced the discipline that separates trustworthy time-series work from accidental self-deception: splitting chronologically, scaling on the training set only, and always comparing against a naive baseline. The SimpleRNN’s lagging forecast is not a failure; it is a clear, honest measurement that gives you something concrete to improve. In the next lesson you will improve it dramatically.
Next Steps
You now understand how a recurrent network remembers, unrolls, and learns, and you have trained one end to end. In the next lesson you will swap the SimpleRNN for gated cells that were designed to remember much longer, and watch the error on this exact dataset fall sharply.
Continue to Lesson 3 - Advanced RNN Architecture: GRU and LSTM
Replace the SimpleRNN with gated cells that fix the vanishing-gradient problem and cut forecast error.
Back to Module Overview
Return to the Sequence Models module overview.
Keep Building Your Skills
You have moved from understanding why sequences need special models to building one that actually forecasts real market data. The recurrence relation, unrolling, and backpropagation through time are the foundation everything else in this module rests on. Keep the mental picture of the hidden state flowing forward through the unrolled chain, because it is the same picture, just dressed up, that explains the more powerful cells you are about to meet.