Lesson 4 - Backpropagation
Welcome to Backpropagation
In the previous lesson you trained a network with gradient descent: you nudged each weight a little in the direction that lowers the loss. But that step assumed you already had the gradient, the partial derivative of the loss with respect to every single weight. This lesson opens that box. You will learn backpropagation, the algorithm that computes all of those gradients efficiently in one backward sweep through the network.
By the end of this lesson, you will be able to:
- Explain what a gradient is and why every weight needs its own
- Describe the forward pass and why it must cache activations for the backward pass
- Apply the chain rule to send the loss gradient backward through a layer
- Derive the gradients for a 2-layer network by hand:
dz2,dW2,da1,dz1, anddW1 - Implement the full backward pass in NumPy and check it against a numerical gradient
You should be comfortable with NumPy, matrix multiplication, and the forward pass of a small network from the earlier lessons in this module. A little calculus (the chain rule) helps, and we reintroduce it as we go. Let’s begin.
Why Gradients Are Hard
Training a neural network means minimizing a loss. To minimize anything with gradient descent you need its gradient: how the loss changes as you wiggle each parameter. A small network already has hundreds of weights, and a large one has billions. Computing each partial derivative independently, by perturbing one weight at a time and re-running the whole network, would be hopelessly slow.
Backpropagation is the trick that makes this fast. It computes all the gradients in a single backward pass that costs about the same as one forward pass. The key insight is that the network is just a long chain of simple operations, and the chain rule lets you reuse work as you move backward through that chain.
Think of the network as a recipe. The forward pass runs the recipe top to bottom: multiply by weights, add biases, apply activations, and finally compute the loss. The backward pass runs the recipe in reverse, asking at each step, “how much did this operation contribute to the final error?” and handing that blame back to the operation before it.
That picture, a forward flow of values and a backward flow of gradients, is the whole idea. Everything else is bookkeeping with the chain rule.
The Computational Graph
To reason about gradients cleanly, it helps to see the network not as “layers” but as a sequence of tiny operations, each of which you can differentiate on its own. This is the computational graph: a list of the elementary operations performed on the input, in order.
Our 2-layer classifier performs these operations on a batch of inputs :
Here are the first layer’s weights and bias, ReLU is the hidden activation, are the second layer’s parameters, and is the sigmoid that turns the final score into a probability. The loss compares to the true labels.
The reason we lay it out this way is simple: if you know the derivative of each individual operation, the chain rule says you can multiply them together to get the derivative of the whole thing. Backpropagation walks this list in reverse, multiplying one local derivative at a time.
Frameworks build this graph for you
PyTorch and TensorFlow record exactly this graph as you run the forward pass, then replay it backward automatically. Building the backward pass by hand once, as you will here, demystifies what loss.backward() actually does under the hood.
The Chain Rule, Quickly
The chain rule is the engine of backpropagation, so it is worth restating in the form we need. If a value depends on , and depends on , then the rate of change of with respect to is the product of the local rates:
In a network, the loss sits at the far right and an early weight like sits at the far left. To get you multiply the local derivatives of every operation between them:
Read right to left, that is the backward pass. The clever part is that the leftmost factors are shared with the gradient for , so you compute each one once and reuse it. Backpropagation is the chain rule plus this reuse.
We will name each gradient after the quantity it belongs to, not the step it comes from. So dz2 means , the gradient of the loss with respect to . This naming keeps the code readable.
Setting Up the Data
To make this concrete you will run a real forward and backward pass on the Pima Indians Diabetes dataset. Each row describes a patient with eight medical measurements, and the target is whether the patient was diagnosed with diabetes. It is a binary classification problem, a perfect fit for the sigmoid output above.
import numpy as np
import pandas as pd
# download: https://datatweets.com/datasets/diabetes.csv
df = pd.read_csv("diabetes.csv")
print("shape:", df.shape)
print("outcome balance:", dict(df["Outcome"].value_counts().sort_index()))
# Output:
# shape: (768, 9)
# outcome balance: {0: 500, 1: 268}The dataset has 768 patients and 9 columns: eight features and the Outcome label. About one third of the patients (268 of 768) have a positive diagnosis, so the classes are imbalanced but usable.
Split the columns into a feature matrix and a target vector , and standardize the features so every column has mean 0 and standard deviation 1. Standardizing matters here because the raw features (glucose in the hundreds, pedigree near zero) live on wildly different scales, and that makes gradients hard to balance.
X = df.drop(columns="Outcome").values.astype(float)
y = df["Outcome"].values.reshape(-1, 1).astype(float)
# Standardize each feature column: (x - mean) / std
X = (X - X.mean(axis=0)) / X.std(axis=0)
print("X shape:", X.shape)
print("y shape:", y.shape)
# Output:
# X shape: (768, 8)
# y shape: (768, 1)The target is reshaped to a column vector of shape (768, 1) so it lines up with the network’s output. Keeping shapes explicit will save you from a lot of confusion in the backward pass.
The Forward Pass (and Why It Caches)
Before you can run anything backward, you run it forward. Initialize the two layers, then compute each operation in the graph in order. The hidden layer has 16 units, matching the network from the previous lesson.
np.random.seed(0)
n_features = X.shape[1] # 8
n_hidden = 16
# Small random weights, zero biases
W1 = np.random.randn(n_features, n_hidden) * 0.1
b1 = np.zeros((1, n_hidden))
W2 = np.random.randn(n_hidden, 1) * 0.1
b2 = np.zeros((1, 1))
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
def relu(z):
return np.maximum(z, 0)
# Forward pass, caching every intermediate value
z1 = X @ W1 + b1 # pre-activation of layer 1
a1 = relu(z1) # hidden activation
z2 = a1 @ W2 + b2 # pre-activation of layer 2
y_hat = sigmoid(z2) # predicted probabilityNotice that you keep z1, a1, z2, and y_hat around in variables. That is the cache, and it is not optional. The backward pass needs these forward values to compute its local derivatives. For example, the gradient of ReLU depends on the sign of z1, and the gradient of with respect to depends on a1. Throwing those away would force you to recompute the forward pass during the backward pass.
The loss for binary classification is binary cross-entropy, averaged over the batch of examples:
N = X.shape[0]
eps = 1e-8 # avoid log(0)
loss = -np.mean(y * np.log(y_hat + eps) + (1 - y) * np.log(1 - y_hat + eps))
print(f"initial loss: {loss:.4f}")
# Output:
# initial loss: 0.6931With small random weights the network outputs roughly 0.5 for everyone, so the loss starts near . That is exactly what an untrained binary classifier should look like, and it confirms the forward pass is wired up correctly.
A loss near 0.693 is a sanity check
For binary cross-entropy, a freshly initialized network that predicts close to 0.5 for every example should report a loss near . If your starting loss is wildly different, suspect a bug in the forward pass or the loss before you ever touch backpropagation.
The Backward Pass, Step by Step
Now the heart of the lesson. You will compute the gradient of the loss with respect to every parameter by walking the graph in reverse. Each step uses the chain rule to combine the gradient flowing in from the right with the local derivative of one operation.
Step 1: Gradient at the Output, dz2
The first step combines two operations at once: the sigmoid and the cross-entropy loss. Computing them separately is fiddly, but together they collapse into something beautiful. The gradient of the loss with respect to the pre-activation is just the prediction minus the truth:
The appears because the loss is an average over the batch. This is the same “cancellation” that makes softmax with cross-entropy reduce to ; for the sigmoid with binary cross-entropy you get the identical clean form.
dz2 = (y_hat - y) / N # shape (768, 1)
print("dz2 shape:", dz2.shape)
# Output:
# dz2 shape: (768, 1)This dz2 is the error signal entering the backward pass. Everything downstream is this signal multiplied by local derivatives.
Step 2: Gradients of the Second Layer, dW2 and db2
The second layer computed . For a matrix multiply , the local derivative tells you that the gradient on is the cached input transposed, times the incoming gradient:
The bias was simply added to every row, so its gradient is the incoming gradient summed over the batch:
dW2 = a1.T @ dz2 # shape (16, 1)
db2 = np.sum(dz2, axis=0, keepdims=True) # shape (1, 1)
print("dW2 shape:", dW2.shape)
print("db2 shape:", db2.shape)
# Output:
# dW2 shape: (16, 1)
# db2 shape: (1, 1)A useful rule of thumb: a gradient always has the same shape as the thing it is the gradient for. dW2 matches W2 at (16, 1), and db2 matches b2 at (1, 1). If a shape ever fails to line up, you have a transpose or a missing sum somewhere.
Step 3: Push the Gradient Back to the Hidden Layer, da1
To keep going you need the gradient with respect to the hidden activation , because is what the first layer produced. The matrix multiply sends the gradient backward through the weights:
da1 = dz2 @ W2.T # shape (768, 16)
print("da1 shape:", da1.shape)
# Output:
# da1 shape: (768, 16)This is the moment the error “propagates back” a layer. The blame for the loss, expressed at the output, is now expressed at the hidden activations. Note how dW2 (a gradient on a parameter) uses , while da1 (a gradient flowing further back) uses . Both come from the same matrix-multiply rule, just for different inputs.
Step 4: Through the ReLU, dz1
The hidden activation came from a ReLU: . ReLU passes the gradient straight through wherever its input was positive, and blocks it (sets it to zero) wherever its input was negative or zero. Its local derivative is 1 for positive inputs and 0 otherwise:
where is element-wise multiplication and is 1 where was positive. This is exactly why you cached z1: you need its sign now.
dz1 = da1 * (z1 > 0) # shape (768, 16)
print("dz1 shape:", dz1.shape)
# Output:
# dz1 shape: (768, 16)Any hidden unit that was “off” in the forward pass (a negative ) receives no gradient. That is the famous gating behavior of ReLU, and it falls right out of the chain rule.
Step 5: Gradients of the First Layer, dW1 and db1
The first layer is structurally identical to the second, so the formulas are the same, with playing the role that played before:
dW1 = X.T @ dz1 # shape (8, 16)
db1 = np.sum(dz1, axis=0, keepdims=True) # shape (1, 16)
print("dW1 shape:", dW1.shape)
print("db1 shape:", db1.shape)
# Output:
# dW1 shape: (8, 16)
# db1 shape: (1, 16)And that is the full backward pass. Starting from one scalar loss, you produced a gradient for every parameter, dW1, db1, dW2, db2, in five short steps, each one a local derivative times an incoming gradient.
Why it is efficient
Look at how dz2 was computed once and then reused for both dW2 and da1, and how dz1 was reused for both dW1 and (if there were more layers) the next gradient back. Backpropagation never recomputes a shared factor. That reuse is precisely why one backward pass costs about the same as one forward pass, no matter how many parameters you have.
The Complete Backward Pass
Here is the entire forward and backward pass collected into one runnable block. This is the exact code that sits inside the training loop from the previous lesson; gradient descent simply subtracts a fraction of each gradient from its parameter.
import numpy as np
import pandas as pd
# 1. Data
df = pd.read_csv("diabetes.csv") # download: https://datatweets.com/datasets/diabetes.csv
X = df.drop(columns="Outcome").values.astype(float)
y = df["Outcome"].values.reshape(-1, 1).astype(float)
X = (X - X.mean(axis=0)) / X.std(axis=0)
N = X.shape[0]
# 2. Parameters
np.random.seed(0)
W1 = np.random.randn(8, 16) * 0.1
b1 = np.zeros((1, 16))
W2 = np.random.randn(16, 1) * 0.1
b2 = np.zeros((1, 1))
sigmoid = lambda z: 1.0 / (1.0 + np.exp(-z))
# 3. Forward pass (cache z1, a1, z2, y_hat)
z1 = X @ W1 + b1
a1 = np.maximum(z1, 0)
z2 = a1 @ W2 + b2
y_hat = sigmoid(z2)
# 4. Backward pass
dz2 = (y_hat - y) / N # output error
dW2 = a1.T @ dz2 # layer-2 weight gradient
db2 = np.sum(dz2, axis=0, keepdims=True)
da1 = dz2 @ W2.T # push gradient back a layer
dz1 = da1 * (z1 > 0) # through ReLU
dW1 = X.T @ dz1 # layer-1 weight gradient
db1 = np.sum(dz1, axis=0, keepdims=True)
print("dW1:", dW1.shape, "| dW2:", dW2.shape)
# Output:
# dW1: (8, 16) | dW2: (16, 1)Read top to bottom, the forward pass builds values; read the backward block top to bottom, and you are walking the graph right to left. The symmetry between the two layers is the same pattern repeated, which is exactly why this scales to networks with dozens of layers.
Checking the Gradient
How do you know the backward pass is correct? You compare it against a numerical gradient: nudge one weight by a tiny amount , measure how much the loss changes, and divide. By the definition of a derivative, this finite difference should match your analytic gradient:
def loss_for(W1, b1, W2, b2):
z1 = X @ W1 + b1
a1 = np.maximum(z1, 0)
z2 = a1 @ W2 + b2
p = sigmoid(z2)
eps = 1e-8
return -np.mean(y * np.log(p + eps) + (1 - y) * np.log(1 - p + eps))
# Numerically estimate the gradient of one weight in W2
eps = 1e-5
i, j = 0, 0
W2_plus = W2.copy(); W2_plus[i, j] += eps
W2_minus = W2.copy(); W2_minus[i, j] -= eps
numeric = (loss_for(W1, b1, W2_plus, b2) - loss_for(W1, b1, W2_minus, b2)) / (2 * eps)
print("analytic dW2[0,0]:", round(float(dW2[i, j]), 6))
print("numeric dW2[0,0]:", round(float(numeric), 6))
print("close?", np.allclose(dW2[i, j], numeric, atol=1e-6))
# Output:
# close? TrueWhen the analytic and numerical gradients agree, you can trust your backward pass. Gradient checking is slow (it re-runs the forward pass for every weight), so you only use it to verify code once, never during real training. But it is the single best way to catch a backpropagation bug.
Forgetting to cache is the classic bug
The most common backpropagation mistake is using the wrong cached value, or recomputing an activation with updated weights. Your backward pass must use the same z1, a1, and y_hat that the forward pass produced. If you update a weight before finishing the backward pass, every gradient downstream becomes wrong, and a gradient check will catch it.
Practice Exercises
Try these before checking the hints. They use the same diabetes.csv and the variables defined above.
Exercise 1: Gradient-Check dW1
The lesson gradient-checked one entry of dW2. Do the same for dW1[0, 0]: perturb that weight by , estimate the numerical gradient, and confirm it matches your analytic dW1[0, 0].
# Your code here (reuse loss_for, X, y, and the cached forward values)Hint
Copy W1 into W1_plus and W1_minus, add and subtract eps at index [0, 0], then call loss_for(W1_plus, b1, W2, b2) and loss_for(W1_minus, b1, W2, b2). Divide the difference by 2 * eps and compare to dW1[0, 0] with np.allclose(..., atol=1e-6). It should print True.
Exercise 2: Take One Gradient Descent Step
You now have every gradient. Update all four parameters with a learning rate of 0.1, re-run the forward pass, and confirm the loss went down from the initial 0.6931.
lr = 0.1
# Your code here: update W1, b1, W2, b2, then recompute the lossHint
Subtract the gradient times the learning rate from each parameter, for example W1 = W1 - lr * dW1, and do the same for b1, W2, b2. Then recompute z1, a1, z2, y_hat and the loss. After one step the loss should be slightly below 0.6931, confirming the gradients point downhill.
Exercise 3: Swap ReLU for No Activation
Remove the ReLU so the hidden pre-activation passes through unchanged (a1 = z1). Adjust only the backward pass and explain in a comment what happens to dz1.
# Forward: a1 = z1 (identity instead of ReLU)
# Your code here: recompute the backward pass for this changeHint
With an identity activation, the local derivative is 1 everywhere, so dz1 = da1 (no masking by z1 > 0). Everything else stays the same. This shows that the ReLU mask is the only thing that gates the gradient; remove the nonlinearity and the two layers collapse into one linear map.
Summary
You opened the box that gradient descent depends on. You now know exactly how a single loss value becomes a gradient for every weight in the network. Let’s review.
Key Concepts
The Big Idea
- A network is a chain of simple operations, the computational graph
- Backpropagation computes every gradient in one backward sweep, reusing shared factors so it costs about as much as one forward pass
Forward Pass
- Runs the graph left to right: , , ,
- Must cache intermediate values (
z1,a1,z2,y_hat) because the backward pass needs them - A freshly initialized binary classifier should report a loss near
The Chain Rule
- , read right to left
- Each gradient is an incoming gradient times one local derivative
Backward Pass (2-Layer Net)
dz2 = (y_hat - y) / N— sigmoid and cross-entropy collapse to prediction minus truthdW2 = a1.T @ dz2,db2 = dz2.sum(axis=0)— matrix-multiply and bias rulesda1 = dz2 @ W2.T— push the gradient back a layerdz1 = da1 * (z1 > 0)— ReLU gates the gradient by the cached signdW1 = X.T @ dz1,db1 = dz1.sum(axis=0)— same rules, withXas input
Checking Your Work
- A gradient always has the same shape as the parameter it belongs to
- Verify the backward pass with a numerical gradient (a finite difference); use it to debug, never during training
Why This Matters
Backpropagation is the algorithm that made modern deep learning possible. Every framework you will ever use, PyTorch, TensorFlow, JAX, is at its core an engine that records the computational graph during the forward pass and replays it backward with the chain rule. When you call loss.backward(), the exact five-step computation you just wrote by hand runs automatically, for every layer.
Understanding it by hand pays off in practice. When training stalls, vanishing or exploding gradients, dead ReLU units that never receive gradient, shape mismatches that silently broadcast, you will recognize the cause because you know what flows backward and why. The next lesson builds directly on this: once you can compute gradients, the question becomes how to use them well, which is the job of optimizers.
Next Steps
You can now compute the gradient of every weight in a network. The natural next question is how to turn those gradients into good updates, faster and more reliably than plain gradient descent.
Continue to Lesson 5 - Optimizers: SGD, RMSprop, and Adam
Learn how modern optimizers use gradients to train networks faster and more stably.
Back to Module Overview
Return to the Deep Learning Foundations module overview.
Keep Building Your Skills
You just implemented the algorithm that powers every neural network in the world. The forward pass that caches and the backward pass that propagates error are not magic; they are the chain rule applied carefully, operation by operation. Hold on to the picture of a value flowing forward and a gradient flowing back, because every architecture you meet from here, deeper networks, convolutions, transformers, is just a bigger computational graph running the same forward and backward dance.