Lesson 4 - Backpropagation

Welcome to Backpropagation

In the previous lesson you trained a network with gradient descent: you nudged each weight a little in the direction that lowers the loss. But that step assumed you already had the gradient, the partial derivative of the loss with respect to every single weight. This lesson opens that box. You will learn backpropagation, the algorithm that computes all of those gradients efficiently in one backward sweep through the network.

By the end of this lesson, you will be able to:

Explain what a gradient is and why every weight needs its own
Describe the forward pass and why it must cache activations for the backward pass
Apply the chain rule to send the loss gradient backward through a layer
Derive the gradients for a 2-layer network by hand: dz2, dW2, da1, dz1, and dW1
Implement the full backward pass in NumPy and check it against a numerical gradient

You should be comfortable with NumPy, matrix multiplication, and the forward pass of a small network from the earlier lessons in this module. A little calculus (the chain rule) helps, and we reintroduce it as we go. Let’s begin.

Why Gradients Are Hard

Training a neural network means minimizing a loss. To minimize anything with gradient descent you need its gradient: how the loss changes as you wiggle each parameter. A small network already has hundreds of weights, and a large one has billions. Computing each partial derivative independently, by perturbing one weight at a time and re-running the whole network, would be hopelessly slow.

Backpropagation is the trick that makes this fast. It computes all the gradients in a single backward pass that costs about the same as one forward pass. The key insight is that the network is just a long chain of simple operations, and the chain rule lets you reuse work as you move backward through that chain.

Think of the network as a recipe. The forward pass runs the recipe top to bottom: multiply by weights, add biases, apply activations, and finally compute the loss. The backward pass runs the recipe in reverse, asking at each step, “how much did this operation contribute to the final error?” and handing that blame back to the operation before it.

A neural network drawn as a computational graph showing a forward pass that produces a prediction and a backward pass that sends gradients back through each operation via the chain rule — Backpropagation: a forward pass flows left to right to make a prediction, then a backward pass flows right to left, carrying gradients through each operation by the chain rule.

That picture, a forward flow of values and a backward flow of gradients, is the whole idea. Everything else is bookkeeping with the chain rule.

The Computational Graph

To reason about gradients cleanly, it helps to see the network not as “layers” but as a sequence of tiny operations, each of which you can differentiate on its own. This is the computational graph: a list of the elementary operations performed on the input, in order.

Our 2-layer classifier performs these operations on a batch of inputs $X$ :

z_1 = X W_1 + b_1

a_1 = \text{ReLU}(z_1)

z_2 = a_1 W_2 + b_2

\hat{y} = \sigma(z_2)

Here $W_1, b_1$ are the first layer’s weights and bias, ReLU is the hidden activation, $W_2, b_2$ are the second layer’s parameters, and $\sigma$ is the sigmoid that turns the final score into a probability. The loss $L$ compares $\hat{y}$ to the true labels.

The reason we lay it out this way is simple: if you know the derivative of each individual operation, the chain rule says you can multiply them together to get the derivative of the whole thing. Backpropagation walks this list in reverse, multiplying one local derivative at a time.

Frameworks build this graph for you

PyTorch and TensorFlow record exactly this graph as you run the forward pass, then replay it backward automatically. Building the backward pass by hand once, as you will here, demystifies what loss.backward() actually does under the hood.

The Chain Rule, Quickly

The chain rule is the engine of backpropagation, so it is worth restating in the form we need. If a value $c$ depends on $b$ , and $b$ depends on $a$ , then the rate of change of $c$ with respect to $a$ is the product of the local rates:

\frac{\partial c}{\partial a} = \frac{\partial c}{\partial b} \cdot \frac{\partial b}{\partial a}

In a network, the loss $L$ sits at the far right and an early weight like $W_1$ sits at the far left. To get $\frac{\partial L}{\partial W_1}$ you multiply the local derivatives of every operation between them:

\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1}

Read right to left, that is the backward pass. The clever part is that the leftmost factors are shared with the gradient for $W_2$ , so you compute each one once and reuse it. Backpropagation is the chain rule plus this reuse.

We will name each gradient after the quantity it belongs to, not the step it comes from. So dz2 means $\frac{\partial L}{\partial z_2}$ , the gradient of the loss with respect to $z_2$ . This naming keeps the code readable.

Setting Up the Data

To make this concrete you will run a real forward and backward pass on the Pima Indians Diabetes dataset. Each row describes a patient with eight medical measurements, and the target is whether the patient was diagnosed with diabetes. It is a binary classification problem, a perfect fit for the sigmoid output above.

import numpy as np
import pandas as pd

# download: https://datatweets.com/datasets/diabetes.csv
df = pd.read_csv("diabetes.csv")

print("shape:", df.shape)
print("outcome balance:", dict(df["Outcome"].value_counts().sort_index()))
# Output:
# shape: (768, 9)
# outcome balance: {0: 500, 1: 268}

The dataset has 768 patients and 9 columns: eight features and the Outcome label. About one third of the patients (268 of 768) have a positive diagnosis, so the classes are imbalanced but usable.

Split the columns into a feature matrix $X$ and a target vector $y$ , and standardize the features so every column has mean 0 and standard deviation 1. Standardizing matters here because the raw features (glucose in the hundreds, pedigree near zero) live on wildly different scales, and that makes gradients hard to balance.

X = df.drop(columns="Outcome").values.astype(float)
y = df["Outcome"].values.reshape(-1, 1).astype(float)

# Standardize each feature column: (x - mean) / std
X = (X - X.mean(axis=0)) / X.std(axis=0)

print("X shape:", X.shape)
print("y shape:", y.shape)
# Output:
# X shape: (768, 8)
# y shape: (768, 1)

The target is reshaped to a column vector of shape (768, 1) so it lines up with the network’s output. Keeping shapes explicit will save you from a lot of confusion in the backward pass.

The Forward Pass (and Why It Caches)

Before you can run anything backward, you run it forward. Initialize the two layers, then compute each operation in the graph in order. The hidden layer has 16 units, matching the network from the previous lesson.

np.random.seed(0)
n_features = X.shape[1]   # 8
n_hidden = 16

# Small random weights, zero biases
W1 = np.random.randn(n_features, n_hidden) * 0.1
b1 = np.zeros((1, n_hidden))
W2 = np.random.randn(n_hidden, 1) * 0.1
b2 = np.zeros((1, 1))

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def relu(z):
    return np.maximum(z, 0)

# Forward pass, caching every intermediate value
z1 = X @ W1 + b1          # pre-activation of layer 1
a1 = relu(z1)             # hidden activation
z2 = a1 @ W2 + b2         # pre-activation of layer 2
y_hat = sigmoid(z2)       # predicted probability

Notice that you keep z1, a1, z2, and y_hat around in variables. That is the cache, and it is not optional. The backward pass needs these forward values to compute its local derivatives. For example, the gradient of ReLU depends on the sign of z1, and the gradient of $z_2$ with respect to $W_2$ depends on a1. Throwing those away would force you to recompute the forward pass during the backward pass.

The loss for binary classification is binary cross-entropy, averaged over the batch of $N$ examples:

L = -\frac{1}{N} \sum_{i=1}^{N} \Big[ y_i \log \hat{y}_i + (1 - y_i)\log(1 - \hat{y}_i) \Big]

N = X.shape[0]
eps = 1e-8  # avoid log(0)
loss = -np.mean(y * np.log(y_hat + eps) + (1 - y) * np.log(1 - y_hat + eps))

print(f"initial loss: {loss:.4f}")
# Output:
# initial loss: 0.6931

With small random weights the network outputs roughly 0.5 for everyone, so the loss starts near $-\log(0.5) \approx 0.6931$ . That is exactly what an untrained binary classifier should look like, and it confirms the forward pass is wired up correctly.

A loss near 0.693 is a sanity check

For binary cross-entropy, a freshly initialized network that predicts close to 0.5 for every example should report a loss near $\ln 2 \approx 0.6931$ . If your starting loss is wildly different, suspect a bug in the forward pass or the loss before you ever touch backpropagation.

The Backward Pass, Step by Step

Now the heart of the lesson. You will compute the gradient of the loss with respect to every parameter by walking the graph in reverse. Each step uses the chain rule to combine the gradient flowing in from the right with the local derivative of one operation.

Step 1: Gradient at the Output, `dz2`

The first step combines two operations at once: the sigmoid and the cross-entropy loss. Computing them separately is fiddly, but together they collapse into something beautiful. The gradient of the loss with respect to the pre-activation $z_2$ is just the prediction minus the truth:

\frac{\partial L}{\partial z_2} = \frac{1}{N}(\hat{y} - y)

The $\frac{1}{N}$ appears because the loss is an average over the batch. This is the same “cancellation” that makes softmax with cross-entropy reduce to $p - y$ ; for the sigmoid with binary cross-entropy you get the identical clean form.

dz2 = (y_hat - y) / N        # shape (768, 1)
print("dz2 shape:", dz2.shape)
# Output:
# dz2 shape: (768, 1)

This dz2 is the error signal entering the backward pass. Everything downstream is this signal multiplied by local derivatives.

Step 2: Gradients of the Second Layer, `dW2` and `db2`

The second layer computed $z_2 = a_1 W_2 + b_2$ . For a matrix multiply $z_2 = a_1 W_2$ , the local derivative tells you that the gradient on $W_2$ is the cached input transposed, times the incoming gradient:

\frac{\partial L}{\partial W_2} = a_1^{\top} \frac{\partial L}{\partial z_2}

The bias was simply added to every row, so its gradient is the incoming gradient summed over the batch:

\frac{\partial L}{\partial b_2} = \sum_{i=1}^{N} \frac{\partial L}{\partial z_2}

dW2 = a1.T @ dz2                       # shape (16, 1)
db2 = np.sum(dz2, axis=0, keepdims=True)  # shape (1, 1)

print("dW2 shape:", dW2.shape)
print("db2 shape:", db2.shape)
# Output:
# dW2 shape: (16, 1)
# db2 shape: (1, 1)

A useful rule of thumb: a gradient always has the same shape as the thing it is the gradient for. dW2 matches W2 at (16, 1), and db2 matches b2 at (1, 1). If a shape ever fails to line up, you have a transpose or a missing sum somewhere.

Step 3: Push the Gradient Back to the Hidden Layer, `da1`

To keep going you need the gradient with respect to the hidden activation $a_1$ , because $a_1$ is what the first layer produced. The matrix multiply sends the gradient backward through the weights:

\frac{\partial L}{\partial a_1} = \frac{\partial L}{\partial z_2} W_2^{\top}

da1 = dz2 @ W2.T            # shape (768, 16)
print("da1 shape:", da1.shape)
# Output:
# da1 shape: (768, 16)

This is the moment the error “propagates back” a layer. The blame for the loss, expressed at the output, is now expressed at the hidden activations. Note how dW2 (a gradient on a parameter) uses $a_1^\top$ , while da1 (a gradient flowing further back) uses $W_2^\top$ . Both come from the same matrix-multiply rule, just for different inputs.

Step 4: Through the ReLU, `dz1`

The hidden activation came from a ReLU: $a_1 = \max(z_1, 0)$ . ReLU passes the gradient straight through wherever its input was positive, and blocks it (sets it to zero) wherever its input was negative or zero. Its local derivative is 1 for positive inputs and 0 otherwise:

\frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial a_1} \odot \mathbb{1}[z_1 > 0]

where $\odot$ is element-wise multiplication and $\mathbb{1}[z_1 > 0]$ is 1 where $z_1$ was positive. This is exactly why you cached z1: you need its sign now.

dz1 = da1 * (z1 > 0)       # shape (768, 16)
print("dz1 shape:", dz1.shape)
# Output:
# dz1 shape: (768, 16)

Any hidden unit that was “off” in the forward pass (a negative $z_1$ ) receives no gradient. That is the famous gating behavior of ReLU, and it falls right out of the chain rule.

Step 5: Gradients of the First Layer, `dW1` and `db1`

The first layer is structurally identical to the second, so the formulas are the same, with $X$ playing the role that $a_1$ played before:

\frac{\partial L}{\partial W_1} = X^{\top} \frac{\partial L}{\partial z_1}, \qquad \frac{\partial L}{\partial b_1} = \sum_{i=1}^{N} \frac{\partial L}{\partial z_1}

dW1 = X.T @ dz1                       # shape (8, 16)
db1 = np.sum(dz1, axis=0, keepdims=True)  # shape (1, 16)

print("dW1 shape:", dW1.shape)
print("db1 shape:", db1.shape)
# Output:
# dW1 shape: (8, 16)
# db1 shape: (1, 16)

And that is the full backward pass. Starting from one scalar loss, you produced a gradient for every parameter, dW1, db1, dW2, db2, in five short steps, each one a local derivative times an incoming gradient.

Why it is efficient

Look at how dz2 was computed once and then reused for both dW2 and da1, and how dz1 was reused for both dW1 and (if there were more layers) the next gradient back. Backpropagation never recomputes a shared factor. That reuse is precisely why one backward pass costs about the same as one forward pass, no matter how many parameters you have.

The Complete Backward Pass

Here is the entire forward and backward pass collected into one runnable block. This is the exact code that sits inside the training loop from the previous lesson; gradient descent simply subtracts a fraction of each gradient from its parameter.

import numpy as np
import pandas as pd

# 1. Data
df = pd.read_csv("diabetes.csv")  # download: https://datatweets.com/datasets/diabetes.csv
X = df.drop(columns="Outcome").values.astype(float)
y = df["Outcome"].values.reshape(-1, 1).astype(float)
X = (X - X.mean(axis=0)) / X.std(axis=0)
N = X.shape[0]

# 2. Parameters
np.random.seed(0)
W1 = np.random.randn(8, 16) * 0.1
b1 = np.zeros((1, 16))
W2 = np.random.randn(16, 1) * 0.1
b2 = np.zeros((1, 1))

sigmoid = lambda z: 1.0 / (1.0 + np.exp(-z))

# 3. Forward pass (cache z1, a1, z2, y_hat)
z1 = X @ W1 + b1
a1 = np.maximum(z1, 0)
z2 = a1 @ W2 + b2
y_hat = sigmoid(z2)

# 4. Backward pass
dz2 = (y_hat - y) / N            # output error
dW2 = a1.T @ dz2                 # layer-2 weight gradient
db2 = np.sum(dz2, axis=0, keepdims=True)
da1 = dz2 @ W2.T                 # push gradient back a layer
dz1 = da1 * (z1 > 0)            # through ReLU
dW1 = X.T @ dz1                  # layer-1 weight gradient
db1 = np.sum(dz1, axis=0, keepdims=True)

print("dW1:", dW1.shape, "| dW2:", dW2.shape)
# Output:
# dW1: (8, 16) | dW2: (16, 1)

Read top to bottom, the forward pass builds values; read the backward block top to bottom, and you are walking the graph right to left. The symmetry between the two layers is the same pattern repeated, which is exactly why this scales to networks with dozens of layers.

Checking the Gradient

How do you know the backward pass is correct? You compare it against a numerical gradient: nudge one weight by a tiny amount $\epsilon$ , measure how much the loss changes, and divide. By the definition of a derivative, this finite difference should match your analytic gradient:

\frac{\partial L}{\partial w} \approx \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon}

def loss_for(W1, b1, W2, b2):
    z1 = X @ W1 + b1
    a1 = np.maximum(z1, 0)
    z2 = a1 @ W2 + b2
    p = sigmoid(z2)
    eps = 1e-8
    return -np.mean(y * np.log(p + eps) + (1 - y) * np.log(1 - p + eps))

# Numerically estimate the gradient of one weight in W2
eps = 1e-5
i, j = 0, 0
W2_plus = W2.copy();  W2_plus[i, j]  += eps
W2_minus = W2.copy(); W2_minus[i, j] -= eps
numeric = (loss_for(W1, b1, W2_plus, b2) - loss_for(W1, b1, W2_minus, b2)) / (2 * eps)

print("analytic dW2[0,0]:", round(float(dW2[i, j]), 6))
print("numeric  dW2[0,0]:", round(float(numeric), 6))
print("close?", np.allclose(dW2[i, j], numeric, atol=1e-6))
# Output:
# close? True

When the analytic and numerical gradients agree, you can trust your backward pass. Gradient checking is slow (it re-runs the forward pass for every weight), so you only use it to verify code once, never during real training. But it is the single best way to catch a backpropagation bug.

Forgetting to cache is the classic bug

The most common backpropagation mistake is using the wrong cached value, or recomputing an activation with updated weights. Your backward pass must use the same z1, a1, and y_hat that the forward pass produced. If you update a weight before finishing the backward pass, every gradient downstream becomes wrong, and a gradient check will catch it.

Practice Exercises

Try these before checking the hints. They use the same diabetes.csv and the variables defined above.

Exercise 1: Gradient-Check `dW1`

The lesson gradient-checked one entry of dW2. Do the same for dW1[0, 0]: perturb that weight by $\pm \epsilon$ , estimate the numerical gradient, and confirm it matches your analytic dW1[0, 0].

# Your code here (reuse loss_for, X, y, and the cached forward values)

Hint

Copy W1 into W1_plus and W1_minus, add and subtract eps at index [0, 0], then call loss_for(W1_plus, b1, W2, b2) and loss_for(W1_minus, b1, W2, b2). Divide the difference by 2 * eps and compare to dW1[0, 0] with np.allclose(..., atol=1e-6). It should print True.

Exercise 2: Take One Gradient Descent Step

You now have every gradient. Update all four parameters with a learning rate of 0.1, re-run the forward pass, and confirm the loss went down from the initial 0.6931.

lr = 0.1
# Your code here: update W1, b1, W2, b2, then recompute the loss

Hint

Subtract the gradient times the learning rate from each parameter, for example W1 = W1 - lr * dW1, and do the same for b1, W2, b2. Then recompute z1, a1, z2, y_hat and the loss. After one step the loss should be slightly below 0.6931, confirming the gradients point downhill.

Exercise 3: Swap ReLU for No Activation

Remove the ReLU so the hidden pre-activation passes through unchanged (a1 = z1). Adjust only the backward pass and explain in a comment what happens to dz1.

# Forward: a1 = z1  (identity instead of ReLU)
# Your code here: recompute the backward pass for this change

Hint

With an identity activation, the local derivative is 1 everywhere, so dz1 = da1 (no masking by z1 > 0). Everything else stays the same. This shows that the ReLU mask is the only thing that gates the gradient; remove the nonlinearity and the two layers collapse into one linear map.

Summary

You opened the box that gradient descent depends on. You now know exactly how a single loss value becomes a gradient for every weight in the network. Let’s review.

Key Concepts

The Big Idea

A network is a chain of simple operations, the computational graph
Backpropagation computes every gradient in one backward sweep, reusing shared factors so it costs about as much as one forward pass

Forward Pass

Runs the graph left to right: $z_1 = XW_1 + b_1$ , $a_1 = \text{ReLU}(z_1)$ , $z_2 = a_1 W_2 + b_2$ , $\hat{y} = \sigma(z_2)$
Must cache intermediate values (z1, a1, z2, y_hat) because the backward pass needs them
A freshly initialized binary classifier should report a loss near $\ln 2 \approx 0.6931$

The Chain Rule

$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial z_2}\cdot\frac{\partial z_2}{\partial a_1}\cdot\frac{\partial a_1}{\partial z_1}\cdot\frac{\partial z_1}{\partial W_1}$ , read right to left
Each gradient is an incoming gradient times one local derivative

Backward Pass (2-Layer Net)

dz2 = (y_hat - y) / N — sigmoid and cross-entropy collapse to prediction minus truth
dW2 = a1.T @ dz2, db2 = dz2.sum(axis=0) — matrix-multiply and bias rules
da1 = dz2 @ W2.T — push the gradient back a layer
dz1 = da1 * (z1 > 0) — ReLU gates the gradient by the cached sign
dW1 = X.T @ dz1, db1 = dz1.sum(axis=0) — same rules, with X as input

Checking Your Work

A gradient always has the same shape as the parameter it belongs to
Verify the backward pass with a numerical gradient (a finite difference); use it to debug, never during training

Why This Matters

Backpropagation is the algorithm that made modern deep learning possible. Every framework you will ever use, PyTorch, TensorFlow, JAX, is at its core an engine that records the computational graph during the forward pass and replays it backward with the chain rule. When you call loss.backward(), the exact five-step computation you just wrote by hand runs automatically, for every layer.

Understanding it by hand pays off in practice. When training stalls, vanishing or exploding gradients, dead ReLU units that never receive gradient, shape mismatches that silently broadcast, you will recognize the cause because you know what flows backward and why. The next lesson builds directly on this: once you can compute gradients, the question becomes how to use them well, which is the job of optimizers.

Next Steps

You can now compute the gradient of every weight in a network. The natural next question is how to turn those gradients into good updates, faster and more reliably than plain gradient descent.

Continue to Lesson 5 - Optimizers: SGD, RMSprop, and Adam

Learn how modern optimizers use gradients to train networks faster and more stably.

Back to Module Overview

Return to the Deep Learning Foundations module overview.

Keep Building Your Skills

You just implemented the algorithm that powers every neural network in the world. The forward pass that caches and the backward pass that propagates error are not magic; they are the chain rule applied carefully, operation by operation. Hold on to the picture of a value flowing forward and a gradient flowing back, because every architecture you meet from here, deeper networks, convolutions, transformers, is just a bigger computational graph running the same forward and backward dance.

Lesson 3 - Gradient Descent for Neural Networks

Lesson 5 - Optimizers: SGD, RMSprop, and Adam

Courses

DATATWEETS

Title here

Lesson 4 - Backpropagation

Welcome to Backpropagation

Why Gradients Are Hard

The Computational Graph

The Chain Rule, Quickly

Setting Up the Data

The Forward Pass (and Why It Caches)

The Backward Pass, Step by Step

Step 1: Gradient at the Output, `dz2`

Step 2: Gradients of the Second Layer, `dW2` and `db2`

Step 3: Push the Gradient Back to the Hidden Layer, `da1`

Step 4: Through the ReLU, `dz1`

Step 5: Gradients of the First Layer, `dW1` and `db1`

The Complete Backward Pass

Checking the Gradient

Practice Exercises

Exercise 1: Gradient-Check `dW1`

Exercise 2: Take One Gradient Descent Step

Exercise 3: Swap ReLU for No Activation

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 5 - Optimizers: SGD, RMSprop, and Adam

Back to Module Overview

Keep Building Your Skills

Lesson 4 - Backpropagation

Welcome to Backpropagation#

Why Gradients Are Hard#

The Computational Graph#

The Chain Rule, Quickly#

Setting Up the Data#

The Forward Pass (and Why It Caches)#

The Backward Pass, Step by Step#

Step 1: Gradient at the Output, dz2#

Step 2: Gradients of the Second Layer, dW2 and db2#

Step 3: Push the Gradient Back to the Hidden Layer, da1#

Step 4: Through the ReLU, dz1#

Step 5: Gradients of the First Layer, dW1 and db1#

The Complete Backward Pass#

Checking the Gradient#

Practice Exercises#

Exercise 1: Gradient-Check dW1#

Exercise 2: Take One Gradient Descent Step#

Exercise 3: Swap ReLU for No Activation#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 5 - Optimizers: SGD, RMSprop, and Adam

Back to Module Overview

Keep Building Your Skills#

Welcome to Backpropagation

Why Gradients Are Hard

The Computational Graph

The Chain Rule, Quickly

Setting Up the Data

The Forward Pass (and Why It Caches)

The Backward Pass, Step by Step

Step 1: Gradient at the Output, `dz2`

Step 2: Gradients of the Second Layer, `dW2` and `db2`

Step 3: Push the Gradient Back to the Hidden Layer, `da1`

Step 4: Through the ReLU, `dz1`

Step 5: Gradients of the First Layer, `dW1` and `db1`

The Complete Backward Pass

Checking the Gradient

Practice Exercises

Exercise 1: Gradient-Check `dW1`

Exercise 2: Take One Gradient Descent Step

Exercise 3: Swap ReLU for No Activation

Summary

Key Concepts

Why This Matters

Next Steps

Keep Building Your Skills