Lesson 3 - Advanced RNN Architecture: GRU and LSTM

Welcome to Gated Recurrent Networks

In the previous lesson you built a SimpleRNN and watched it forecast the S&P 500 from a window of past values. It worked, but it also showed its limits: a plain recurrent cell struggles to hold onto information across many timesteps. This lesson explains exactly why that happens and introduces the two architectures that fix it, LSTM and GRU. You will see the gate equations that give these cells their memory, then put all three networks head to head on the same forecasting task.

By the end of this lesson, you will be able to:

Explain the vanishing-gradient problem and why it cripples plain RNNs on long sequences
Describe how a gated RNN uses gates to decide what to remember and what to forget
Write out the LSTM cell state with its forget, input, and output gates
Write out the GRU with its reset and update gates, and explain how it differs from LSTM
Build and compare SimpleRNN, GRU, and LSTM forecasters in Keras on the real S&P 500 series

You should have completed Lessons 1 and 2 of this module, be comfortable with NumPy and pandas, and know how to build a basic Sequential model in Keras. Let’s begin.

Why Simple RNNs Forget

A recurrent network processes a sequence one step at a time, carrying a hidden state $h_t$ forward from step to step. At each timestep it folds in the new input and updates its memory:

h_t = \tanh(W_x x_t + W_h h_{t-1} + b)

This looks elegant, and for short sequences it works. The trouble appears during training. To learn, the network propagates error gradients backward through every timestep, a process called backpropagation through time. Each step multiplies the gradient by the same recurrent weight matrix $W_h$ and by the derivative of the $\tanh$ activation, which is always less than or equal to 1.

Multiply a number smaller than 1 by itself fifty times and it collapses toward zero. That is precisely what happens to the gradient as it travels back through a long sequence. By the time the error signal reaches the early timesteps, it has effectively vanished. The network never learns how those early inputs should influence the final output. This is the vanishing-gradient problem, and it is the single biggest reason plain RNNs cannot capture long-range dependencies.

The flip side: exploding gradients

If the recurrent weights are large instead of small, repeated multiplication can blow the gradient up exponentially instead of shrinking it, producing wild, unstable updates. This is the exploding-gradient problem. It is usually easier to manage (you can clip gradients to a maximum size), but it comes from the same root cause: applying the same transformation over and over through a deep unrolled network.

The consequence is concrete. A SimpleRNN can usually remember the last few timesteps, but anything further back fades. For a problem where the answer depends on context many steps earlier, the plain cell simply cannot hold the thread. The fix is to redesign the cell so that information can flow across many steps without being squashed at every one. That redesign is the gated RNN.

The Idea of a Gated Cell

A gate is just a small neural layer that outputs values between 0 and 1, produced by a sigmoid activation:

\sigma(z) = \frac{1}{1 + e^{-z}}

You can read a gate’s output as a set of dials. A value near 0 means “block this,” and a value near 1 means “let this through.” By multiplying a stream of information element-wise by a gate, the cell learns to selectively keep, update, or discard each piece of its memory at every timestep.

This is the key insight. Instead of forcing all information through a single $\tanh$ bottleneck at every step (which is what shrinks the gradient), a gated cell maintains a protected memory channel and uses gates to control what flows in and out of it. Because that channel can pass information forward almost unchanged, gradients can travel back across many timesteps without vanishing. Two famous designs implement this idea: the LSTM and the GRU.

Long Short-Term Memory (LSTM)

The Long Short-Term Memory cell, introduced in 1997, was the first widely successful gated RNN. Its defining feature is a separate cell state $C_t$ , a memory channel that runs straight through the cell with only minor, gated edits at each step. Alongside the usual hidden state $h_t$ , this cell state is what lets an LSTM remember things for a long time.

An LSTM uses three gates to manage that memory. At each timestep it looks at the new input $x_t$ and the previous hidden state $h_{t-1}$ and computes:

The forget gate decides what to erase from the old cell state:

f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

The input gate decides which new information to write, paired with a candidate update $\tilde{C}_t$ :

i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)

\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)

The cell state is then updated by forgetting some old memory and adding some new:

C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

Finally, the output gate decides what part of the updated memory becomes the new hidden state:

o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)

h_t = o_t \odot \tanh(C_t)

Here $\odot$ is element-wise multiplication. Notice the crucial line: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$ . When the forget gate stays near 1 and the input gate near 0, the cell state passes forward essentially unchanged. That additive, gated highway is what protects the gradient and gives the LSTM its long memory.

In Keras, all of that machinery is a single line. You never implement the gates by hand:

from tensorflow.keras import layers

# An LSTM layer with 32 units; Keras handles all three gates internally
layer = layers.LSTM(32)

Gated Recurrent Unit (GRU)

The Gated Recurrent Unit, proposed in 2014, is a streamlined alternative. It folds the LSTM’s ideas into a simpler design: there is no separate cell state, and it uses only two gates instead of three. Fewer gates means fewer parameters, less memory, and faster training, while keeping most of the long-range benefit.

A GRU computes an update gate and a reset gate:

z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)

r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)

The reset gate controls how much of the past hidden state feeds into the candidate state $\tilde{h}_t$ :

\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t] + b_h)

The update gate then blends the old hidden state with the new candidate, deciding how much to keep versus refresh:

h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

That single blending equation does the work that the LSTM splits across its forget, input, and output gates. When $z_t$ is near 0, the GRU carries $h_{t-1}$ forward almost untouched, the same protected-channel trick that defeats the vanishing gradient.

# A GRU layer with 32 units; just two gates, fewer parameters than LSTM
layer = layers.GRU(32)

GRU or LSTM: which should you pick?

There is no universal winner. GRUs are lighter and often train faster, which helps on smaller datasets or tight compute budgets. LSTMs, with their dedicated cell state, sometimes edge ahead on very long sequences. The honest answer is to try both and let your validation metric decide, which is exactly what you will do in this lesson.

Setting Up the S&P 500 Forecasting Task

To compare the three architectures fairly, you will train each one on the same problem with the same settings. You will reuse the S&P 500 monthly series from earlier in this module: predict the next month’s index level from a sliding window of the previous twelve months.

Start by loading the data.

import pandas as pd
import numpy as np

# download: https://datatweets.com/datasets/sp500_monthly.csv
df = pd.read_csv("sp500_monthly.csv", parse_dates=["date"])
df = df.sort_values("date").reset_index(drop=True)

print("Rows:", len(df))
print("Range:", df["date"].min().date(), "to", df["date"].max().date())
print("Price min/max:", round(df["close"].min(), 1), "/", round(df["close"].max(), 1))
# Output:
# Rows: 917
# Range: 1950-01-01 to 2026-05-01
# Price min/max: 16.9 / 7412.6

The series holds 917 monthly closing values stretching from January 1950 to May 2026, climbing from about 16.9 to a high near 7412.6. That enormous range is the heart of the challenge: a model that captures the long upward trend will forecast far more accurately than one that loses the thread.

Next, turn the flat series into supervised windows. Each input is twelve consecutive months, and the target is the thirteenth.

def make_windows(series, window=12):
    X, y = [], []
    for i in range(len(series) - window):
        X.append(series[i : i + window])
        y.append(series[i + window])
    X = np.array(X).reshape(-1, window, 1)  # (samples, timesteps, features)
    y = np.array(y)
    return X, y

values = df["close"].values.astype("float32")
X, y = make_windows(values, window=12)

# Chronological split: train on the past, test on the most recent stretch
split = len(X) - 184
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

print("Train windows:", X_train.shape)
print("Test windows: ", X_test.shape)
# Output:
# Train windows: (721, 12, 1)
# Test windows:  (184, 12, 1)

You now have 721 training windows and 184 test windows, each of shape (12, 1): twelve timesteps with one feature per step. The split is chronological, never shuffled. With time series you must train on the past and test on the future, otherwise the model peeks at information it could not have had in real life.

Never shuffle a time series before splitting

The standard train_test_split with random shuffling is fine for independent observations, but it is a serious leak for sequential data. If a future window lands in your training set, the model effectively sees the answer before it forecasts it, and your test score becomes meaningless. Always split time series by time.

Building the Three Models

Because Keras exposes SimpleRNN, GRU, and LSTM through the same interface, you can define all three with one helper function and swap only the recurrent layer. Every other detail stays identical so the comparison is fair.

import tensorflow as tf
from tensorflow.keras import layers

def build_model(recurrent_layer):
    model = tf.keras.Sequential([
        recurrent_layer,                  # the only line that changes
        layers.Dense(16, activation="relu"),
        layers.Dense(1),                  # regression: predict one value
    ])
    model.compile(optimizer="adam", loss="mse")
    return model

# Same number of units (32) and same input shape for all three
rnn_model = build_model(layers.SimpleRNN(32, input_shape=(12, 1)))
gru_model = build_model(layers.GRU(32, input_shape=(12, 1)))
lstm_model = build_model(layers.LSTM(32, input_shape=(12, 1)))

Each network has a 32-unit recurrent layer, a 16-unit ReLU hidden layer, and a single linear output for the regression target. The optimizer (adam) and loss (mean squared error) are the same throughout. The only difference between the three is the cell type, which is exactly the variable you want to isolate.

Now train each model for the same number of epochs.

EPOCHS = 40

rnn_model.fit(X_train, y_train, epochs=EPOCHS, verbose=0)
gru_model.fit(X_train, y_train, epochs=EPOCHS, verbose=0)
lstm_model.fit(X_train, y_train, epochs=EPOCHS, verbose=0)

print("Training complete for all three models.")
# Output: Training complete for all three models.

Forty epochs each, same data, same architecture apart from the cell. Whatever differences you see in the results come from the recurrent layer alone.

Comparing Performance

To score each model you will use root mean squared error (RMSE), which reports the typical forecast error in the same units as the index itself (points). Lower is better.

\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}

Compute it on the held-out test set for all three models.

def rmse(model, X, y):
    preds = model.predict(X, verbose=0).ravel()
    return np.sqrt(np.mean((y - preds) ** 2))

print(f"SimpleRNN  test RMSE: {rmse(rnn_model,  X_test, y_test):.1f}")
print(f"GRU        test RMSE: {rmse(gru_model,  X_test, y_test):.1f}")
print(f"LSTM       test RMSE: {rmse(lstm_model, X_test, y_test):.1f}")
# Output:
# SimpleRNN  test RMSE: 917.1
# GRU        test RMSE: 568.9
# LSTM       test RMSE: 423.4

The pattern is dramatic. Moving from the plain SimpleRNN to a GRU cuts the typical error from about 917 points to 569, and the LSTM drives it down further to 423. The gated cells more than halve the error of the plain RNN on the same task with the same training budget.

Bar chart comparing test RMSE of SimpleRNN, GRU, and LSTM on the S&P 500 forecasting task — Gated cells dramatically reduce forecasting error: LSTM and GRU cut the SimpleRNN's RMSE by more than half on the same S&P 500 task.

You can confirm the same story with mean absolute error (MAE), the average absolute gap between prediction and truth, which is less sensitive to large outliers than RMSE.

def mae(model, X, y):
    preds = model.predict(X, verbose=0).ravel()
    return np.mean(np.abs(y - preds))

print(f"SimpleRNN  test MAE: {mae(rnn_model,  X_test, y_test):.1f}")
print(f"GRU        test MAE: {mae(gru_model,  X_test, y_test):.1f}")
print(f"LSTM       test MAE: {mae(lstm_model, X_test, y_test):.1f}")
# Output:
# SimpleRNN  test MAE: 578.0
# GRU        test MAE: 377.4
# LSTM       test MAE: 293.5

MAE tells the same ranking: SimpleRNN worst at 578, GRU in the middle at 377, LSTM best at 294. Both metrics agree, which gives you confidence the ordering is real and not an artifact of one scoring choice.

Why the gap is so large here

The S&P 500 has a strong, persistent upward trend, so the most useful signal often lives several months back in the window. The SimpleRNN’s short memory is exactly the wrong tool for that. The gated cells, which can carry information across the full twelve-step window, exploit that long-range structure and pull far ahead. On a problem dominated by very short-term noise, the gap would be smaller.

A small caveat worth internalizing: neural-network training involves random weight initialization, so your exact numbers may shift by a few points from run to run. What stays stable is the ordering. Gated cells beat the plain RNN, and that is the durable lesson, not any single decimal.

Practice Exercises

Try these before checking the hints. Reuse the X_train, y_train, X_test, y_test, and build_model definitions from the lesson.

Exercise 1: Identify the Gates

Without running any code, write down which gates belong to an LSTM and which belong to a GRU, and state in one sentence what role the LSTM’s cell state plays that the GRU has no direct equivalent for.

Hint

An LSTM has three gates (forget, input, output) plus a dedicated cell state $C_t$ that acts as a long-lived memory channel. A GRU has two gates (reset, update) and folds memory into its single hidden state, so it has no separate cell state.

Exercise 2: Shrink the GRU and Watch the Error

Build a new GRU model with only 8 units instead of 32, train it for 40 epochs, and print its test RMSE. Does cutting the capacity hurt the forecast?

# Your code here (reuse build_model, X_train, y_train, X_test, y_test)

Hint

Call build_model(layers.GRU(8, input_shape=(12, 1))), then .fit(X_train, y_train, epochs=40, verbose=0), then reuse the rmse helper from the lesson. Expect the error to rise compared to the 32-unit GRU’s 568.9, since fewer units mean less capacity to model the trend.

Exercise 3: Stack Two LSTM Layers

A single recurrent layer is not your only option. Build a model with two stacked LSTM layers (the first must pass its full sequence forward), train it, and compare its test RMSE to the single-layer LSTM’s 423.4.

# Your code here

Hint

The first LSTM needs return_sequences=True so it outputs a value at every timestep for the second LSTM to consume: layers.LSTM(32, return_sequences=True, input_shape=(12, 1)) followed by layers.LSTM(16), then your Dense layers. Stacking adds capacity but does not always lower error, so compare honestly against the single-layer result.

Summary

You now understand why plain RNNs fall short on long sequences and how gated cells solve the problem, and you have proven it on real data. Let’s review.

Key Concepts

The Vanishing-Gradient Problem

Backpropagation through time repeatedly multiplies the gradient by the recurrent weights and the $\tanh$ derivative
Repeated multiplication by values below 1 shrinks the gradient toward zero, so early timesteps stop learning
This is why a SimpleRNN cannot capture long-range dependencies; the mirror problem, exploding gradients, comes from the same repeated multiplication

Gated Cells

A gate is a sigmoid layer outputting values in $[0, 1]$ that act as dials to keep, update, or discard information
A protected, gated memory channel lets information (and gradients) flow across many steps without being squashed

LSTM

Maintains a separate cell state $C_t$ as a long-term memory highway
Uses three gates: forget ( $f_t$ ), input ( $i_t$ ), and output ( $o_t$ )
The additive update $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$ is what preserves long-range information

GRU

A simpler design with no separate cell state and only two gates: reset ( $r_t$ ) and update ( $z_t$ )
The blend $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$ does the work of LSTM’s three gates with fewer parameters
Often trains faster than LSTM; relative accuracy depends on the problem

The Keras Pattern

layers.SimpleRNN, layers.GRU, and layers.LSTM share one interface, so you swap a single line to change cells
Split time series chronologically, never with random shuffling
Score regression forecasts with RMSE and MAE, both in the units of the target

Why This Matters

The numbers you saw are not a coincidence of one dataset. Gated cells reduced the S&P 500 forecast error by more than half (SimpleRNN 917.1, GRU 568.9, LSTM 423.4) precisely because they can hold context across the whole input window while the plain RNN forgets. That ability to remember the relevant past is what makes LSTM and GRU the default choice for sequence problems in language, audio, and finance.

Just as important is the habit you practiced: always compare a more complex model against a simpler baseline. The SimpleRNN was your baseline, and only by measuring against it could you confirm that the extra machinery of a gated cell actually earned its keep. That discipline keeps you from reaching for complexity you do not need, and it is exactly how professionals validate every modeling choice.

Next Steps

You have seen how gated cells give recurrent networks real memory. Next, you will add a convolutional front end to extract local patterns before the recurrent layer ever sees the sequence, combining the strengths of both.

Continue to Lesson 4 - Adding Convolutional Layers to RNNs

Learn how a 1D convolution can extract local features and boost a recurrent forecaster.

Back to Module Overview

Return to the Sequence Models module overview.

Keep Building Your Skills

You have crossed an important threshold: you now understand not just how to call an LSTM or GRU, but why they exist and what their gates actually do. That understanding is what lets you reason about a model instead of guessing at it. Keep the comparison mindset close as you continue, every new architecture you meet should be measured against a simpler one, and the gate equations you learned here will keep showing up as sequence models grow more sophisticated.

Lesson 2 - Basic RNN Architecture

Lesson 4 - Adding Convolutional Layers to RNNs

Courses

DATATWEETS

Title here

Lesson 3 - Advanced RNN Architecture: GRU and LSTM

Welcome to Gated Recurrent Networks

Why Simple RNNs Forget

The Idea of a Gated Cell

Long Short-Term Memory (LSTM)

Gated Recurrent Unit (GRU)

Setting Up the S&P 500 Forecasting Task

Building the Three Models

Comparing Performance

Practice Exercises

Exercise 1: Identify the Gates

Exercise 2: Shrink the GRU and Watch the Error

Exercise 3: Stack Two LSTM Layers

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 4 - Adding Convolutional Layers to RNNs

Back to Module Overview

Keep Building Your Skills

Lesson 3 - Advanced RNN Architecture: GRU and LSTM

Welcome to Gated Recurrent Networks#

Why Simple RNNs Forget#

The Idea of a Gated Cell#

Long Short-Term Memory (LSTM)#

Gated Recurrent Unit (GRU)#

Setting Up the S&P 500 Forecasting Task#

Building the Three Models#

Comparing Performance#

Practice Exercises#

Exercise 1: Identify the Gates#

Exercise 2: Shrink the GRU and Watch the Error#

Exercise 3: Stack Two LSTM Layers#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 4 - Adding Convolutional Layers to RNNs

Back to Module Overview

Keep Building Your Skills#

Welcome to Gated Recurrent Networks

Why Simple RNNs Forget

The Idea of a Gated Cell

Long Short-Term Memory (LSTM)

Gated Recurrent Unit (GRU)

Setting Up the S&P 500 Forecasting Task

Building the Three Models

Comparing Performance

Practice Exercises

Exercise 1: Identify the Gates

Exercise 2: Shrink the GRU and Watch the Error

Exercise 3: Stack Two LSTM Layers

Summary

Key Concepts

Why This Matters

Next Steps

Keep Building Your Skills