Lesson 3 - Gradient Descent for Neural Networks
On this page
- Welcome to How Networks Learn
- The Problem: Good Weights Don’t Appear by Magic
- Measuring Error: The Loss Function
- The Loss Surface and the Gradient
- Gradient Descent: Stepping Downhill
- The Learning Rate: The Size of Your Steps
- Training a Network on Real Data
- Watching the Learning Rate in Action
- Practice Exercises
- Summary
- Next Steps
- Keep Building Your Skills
Welcome to How Networks Learn
In the previous lessons you built a network’s forward pass: data flows in, gets multiplied by weights, passes through activation functions, and produces a prediction. But there was a missing piece. Where do those weights come from? In this lesson you will learn the answer: a network learns its weights by minimizing a loss function with an algorithm called gradient descent. You will train a small network from scratch in numpy on a real medical dataset and watch its error fall, epoch after epoch.
By the end of this lesson, you will be able to:
- Explain what a loss function is and why binary cross-entropy is used for classification
- Describe what a gradient is and what it tells you about the loss surface
- State the gradient descent update rule and explain why we step in the negative gradient direction
- Explain the role of the learning rate and what happens when it is too small or too large
- Train a small numpy neural network with full-batch gradient descent and evaluate it on unseen data
You should have completed Lessons 1 and 2 of this module and be comfortable with the forward pass, activation functions, and basic NumPy. Let’s begin.
The Problem: Good Weights Don’t Appear by Magic
A neural network is a long chain of multiplications and additions, controlled by its weights and biases. In the forward pass, those parameters decide what prediction comes out the other end. If the weights are good, the predictions are accurate. If the weights are random, the predictions are garbage.
When you create a network, the weights are random. So the network starts out useless. Training is the process of nudging those random numbers, a little at a time, until the predictions become accurate. To do that, you need two things:
- A way to measure how wrong the current predictions are. That is the loss function.
- A way to adjust the weights so the loss goes down. That is gradient descent.
Everything in this lesson is built from those two ideas. Let’s start with measuring error.
Measuring Error: The Loss Function
A loss function takes the network’s predictions and the true answers, and returns a single number that says how badly the network is doing. A large loss means the predictions are far from the truth. A small loss means they are close. The entire goal of training is to make this number as small as possible.
The right loss function depends on the task. In this lesson you are doing binary classification: predict whether a patient has diabetes (1) or not (0). The network outputs a probability between 0 and 1 (thanks to a sigmoid on the final layer), and you compare that probability to the true 0/1 label.
Why Not Just Use Squared Error?
For predicting a number, the natural loss is squared error, . It works well for regression. But for classification it has a quiet flaw. Imagine the true label is 1 and the network confidently predicts 0.01. Squared error penalizes this as just , barely worse than a hesitant wrong guess. A confident wrong answer should be punished much harder than a hesitant one.
Binary Cross-Entropy
The loss designed for this is binary cross-entropy (BCE), also called log loss. For a single example with true label and predicted probability , it is:
Read it one case at a time. If the true label is , only the first term survives and the loss is . When is close to 1, is near 0 (small loss, good). When drifts toward 0, shoots up toward infinity (huge loss, a confident mistake). If the true label is , only the second term survives, , and it punishes confident predictions of 1 in the same way.
The loss over a whole dataset of examples is just the average:
In code, this is a one-liner once you have predictions and labels:
import numpy as np
def bce_loss(y_true, y_pred):
# Clip to avoid log(0), which is undefined
eps = 1e-8
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(
y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
)
# A confident correct prediction has tiny loss
print(round(bce_loss(np.array([1.0]), np.array([0.99])), 4))
# Output: 0.0101
# A confident WRONG prediction has huge loss
print(round(bce_loss(np.array([1.0]), np.array([0.01])), 4))
# Output: 4.6052Notice the asymmetry: the confident correct guess costs almost nothing, while the confident wrong guess costs over 4. That is exactly the behavior you want from a classification loss.
Why clip the predictions?
The logarithm of 0 is negative infinity. If the network ever outputs an exact 0 or 1, the loss becomes infinite and the whole computation breaks. Clipping predictions into a tiny safe range like keeps the math stable without meaningfully changing the result. This is a standard trick in every serious implementation.
The Loss Surface and the Gradient
Now you can measure error. The next question is how to reduce it. Here is the key mental picture.
The loss depends on the weights. If you change a weight, the loss changes. Imagine plotting the loss on the vertical axis and a weight on the horizontal axis. You get a curve, and somewhere on that curve is a lowest point, the minimum, where the weight produces the smallest possible error. Your job is to find it.
With millions of weights this surface lives in millions of dimensions and you cannot picture it. But the principle stays the same in any number of dimensions: there is a low region you want to reach, and you are currently standing somewhere on the slope.
The Gradient: Which Way Is Downhill?
The gradient is the tool that tells you which way is downhill. Formally, the gradient of the loss with respect to a weight, written , is the slope of the loss curve at your current position. It answers a precise question: if I nudge this weight up a tiny bit, does the loss go up or down, and how steeply?
- If the gradient is positive, the loss increases as the weight increases, so the weight is currently too high and you should decrease it.
- If the gradient is negative, the loss decreases as the weight increases, so the weight is too low and you should increase it.
- If the gradient is near zero, you are at the bottom of a valley (or a flat spot), and there is little left to improve.
In both useful cases the rule is the same: move the weight in the direction opposite to the gradient. That is the heart of gradient descent.
One gradient per parameter
A real network has a gradient for every single weight and bias. The full collection of these slopes is “the gradient” with a capital G: it is a vector pointing in the direction of steepest increase in loss. Stepping in the opposite direction takes the steepest path downhill. Computing all these gradients efficiently is the job of backpropagation, which is the entire focus of the next lesson. For now, treat the gradient as a black box that hands you the slope for each parameter.
Gradient Descent: Stepping Downhill
Gradient descent is almost embarrassingly simple. You stand on the loss surface, look at the gradient to find the downhill direction, take a small step that way, and repeat. Each step lowers the loss a little, and after enough steps you settle near a minimum.
The update rule for a single weight is:
Read it left to right. The new weight equals the old weight, minus the gradient, scaled by (the Greek letter eta), which is the learning rate. The minus sign is what makes you go downhill: you subtract the gradient, so you move opposite to the direction of steepest increase. The same rule applies to every weight and every bias in the network, each using its own gradient.
The full algorithm is a loop:
- Forward pass. Run the data through the network to get predictions.
- Compute the loss. Measure how wrong the predictions are with BCE.
- Compute the gradients. Find the slope of the loss for every parameter.
- Update. Nudge every parameter one small step downhill using the rule above.
- Repeat. Go back to step 1 and do it again.
Each full pass over the training data is called an epoch. You typically run for hundreds or thousands of epochs, and with each one the loss creeps lower as the weights settle into a good configuration.
What does full-batch mean?
In this lesson you compute the gradient using the entire training set on every step. That is called full-batch (or batch) gradient descent. It gives a smooth, stable loss curve because every update reflects all the data. Later, when datasets get large, you will switch to using small random subsets per step, which is faster and adds helpful noise. That variation is covered in Lesson 5; here, full-batch keeps the picture clean.
The Learning Rate: The Size of Your Steps
The learning rate is the single most important knob in gradient descent. It controls how big a step you take each time. Get it right and the network converges quickly and smoothly. Get it wrong and training either crawls forever or explodes.
Think of descending a foggy hill. The gradient tells you which way is down, but the learning rate decides how far you stride before checking again.
- Too small (say 0.001): each step barely moves you. You will get to the bottom eventually, but it takes an enormous number of epochs. Training crawls.
- Just right (say 0.1 for this problem): you take confident steps and reach a good minimum in a reasonable number of epochs.
- Too large (say 1.0): you overstep the minimum and land on the far slope, possibly higher than where you started. The loss bounces around or grows without bound. Training is unstable.
Why does too large a step backfire? The gradient is only the slope at your current point. It does not know that the slope changes as you move. Take a huge step and you can sail right over the bottom of the valley and up the other side, ending up with worse loss than before. You will see this happen with real numbers shortly.
There is no universal best learning rate. It depends on the data, the network, and the loss. The practical approach is to try a few values spanning orders of magnitude (0.01, 0.1, 1.0) and compare the loss curves, which is exactly what you will do at the end of this lesson.
Training a Network on Real Data
Enough theory. You will now train a small network with plain gradient descent on the Pima Indians Diabetes dataset, a classic medical dataset where the task is to predict whether a patient has diabetes from eight health measurements.
Loading and Preparing the Data
import numpy as np
import pandas as pd
# download: https://datatweets.com/datasets/diabetes.csv
df = pd.read_csv("diabetes.csv")
print("Shape:", df.shape)
print("Outcome balance:", dict(df["Outcome"].value_counts().sort_index()))
# Output:
# Shape: (768, 9)
# Outcome balance: {0: 500, 1: 268}The dataset has 768 patients and 9 columns: 8 features and the binary Outcome target. About one third of the patients have diabetes (268 out of 768), so the classes are imbalanced but not extreme.
The features have wildly different scales (glucose readings in the hundreds, a diabetes pedigree score below 1). As you learned in the Foundations module, neural networks train far better on standardized inputs, so you scale each feature to mean 0 and standard deviation 1, and split into training and test sets.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = df.drop(columns="Outcome").to_numpy(dtype=float)
y = df["Outcome"].to_numpy(dtype=float).reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit on train only
X_test = scaler.transform(X_test) # apply same transform to test
print("Train:", X_train.shape, " Test:", X_test.shape)
# Output:
# Train: (576, 8) Test: (192, 8)Building the Network in NumPy
You will build a small network with one hidden layer of 16 ReLU units and a single sigmoid output. The forward pass is exactly what you learned in Lesson 2: multiply by weights, add a bias, apply an activation, repeat.
def init_params(n_inputs, n_hidden, seed=0):
rng = np.random.default_rng(seed)
# Small random weights; biases start at zero
W1 = rng.standard_normal((n_inputs, n_hidden)) * 0.1
b1 = np.zeros((1, n_hidden))
W2 = rng.standard_normal((n_hidden, 1)) * 0.1
b2 = np.zeros((1, 1))
return [W1, b1, W2, b2]
def relu(z):
return np.maximum(0, z)
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def forward(params, X):
W1, b1, W2, b2 = params
z1 = X @ W1 + b1 # hidden pre-activation
a1 = relu(z1) # hidden activation
z2 = a1 @ W2 + b2 # output pre-activation
y_hat = sigmoid(z2) # output probability
# Return the cache too; we need it for the gradients
return y_hat, (z1, a1, z2)For the backward pass you need the gradients of the BCE loss with respect to every parameter. The next lesson derives these step by step; here you simply use them. The convenient fact is that for a sigmoid output paired with BCE loss, the gradient at the output simplifies to , which makes the rest of the chain clean.
def backward(params, cache, X, y, y_hat):
W1, b1, W2, b2 = params
z1, a1, z2 = cache
n = X.shape[0]
# Gradient at the output: sigmoid + BCE simplifies to (y_hat - y)
dz2 = (y_hat - y) / n
dW2 = a1.T @ dz2
db2 = dz2.sum(axis=0, keepdims=True)
# Propagate back through the hidden layer
da1 = dz2 @ W2.T
dz1 = da1 * (z1 > 0) # ReLU passes gradient only where input was positive
dW1 = X.T @ dz1
db1 = dz1.sum(axis=0, keepdims=True)
return [dW1, db1, dW2, db2]The Training Loop
Now assemble the gradient descent loop. Each epoch runs a forward pass, computes the loss, computes the gradients, and applies the update rule to every parameter.
def train(X, y, n_hidden=16, lr=0.1, epochs=2000, seed=0):
params = init_params(X.shape[1], n_hidden, seed=seed)
history = []
for epoch in range(epochs):
# 1. Forward pass
y_hat, cache = forward(params, X)
# 2. Loss
loss = bce_loss(y, y_hat)
history.append(loss)
# 3. Gradients
grads = backward(params, cache, X, y, y_hat)
# 4. Update every parameter: w <- w - lr * grad
for i in range(len(params)):
params[i] -= lr * grads[i]
return params, history
params, history = train(X_train, y_train, n_hidden=16, lr=0.1, epochs=2000)
print(f"Final training loss: {history[-1]:.4f}")
# Output: Final training loss: 0.4323The loss starts high, because the weights are random, and falls steadily to 0.4323 as gradient descent walks the weights downhill over 2000 epochs. Plotting the recorded loss makes the descent vivid.
The shape of this curve is the signature of healthy training: a fast initial drop while the easy gains are picked up, then a gentle flattening as the weights settle near a minimum. A smooth, monotonically decreasing curve like this tells you the learning rate is in a good range.
Evaluating the Network
A low training loss is encouraging, but the real test is performance on data the network never saw. Convert the predicted probabilities to 0/1 labels with a 0.5 threshold and measure accuracy on both sets.
def accuracy(params, X, y):
y_hat, _ = forward(params, X)
preds = (y_hat >= 0.5).astype(float)
return np.mean(preds == y)
train_acc = accuracy(params, X_train, y_train)
test_acc = accuracy(params, X_test, y_test)
print(f"Train accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")
# Output:
# Train accuracy: 0.807
# Test accuracy: 0.750The network reaches 80.7 percent accuracy on training data and 75.0 percent on the held-out test set. The modest gap between the two is exactly what you hope to see: the network learned real patterns that generalize, rather than memorizing the training rows. You built and trained a working neural network from nothing but NumPy and the gradient descent update rule.
Why the test accuracy is what counts
Training accuracy only tells you how well the network fits data it has already seen. The test accuracy of 0.750 is the honest number, because those 192 patients played no part in setting the weights. A network that scores high in training but poorly on test has overfit, a failure mode you will learn to fight with regularization in Lesson 6.
Watching the Learning Rate in Action
You now have a working network, so you can finally see for yourself why the learning rate matters so much. Train the same network three times, changing only , and compare the loss curves.
for lr in [0.01, 0.1, 1.0]:
_, hist = train(X_train, y_train, n_hidden=16, lr=lr, epochs=2000)
print(f"lr={lr:<4} final loss={hist[-1]:.4f}")
# Output:
# lr=0.01 final loss=0.5793
# lr=0.1 final loss=0.4323
# lr=1.0 final loss=0.6014Three very different stories from a single changed number:
- lr = 0.01 (too small). The steps are tiny, so after 2000 epochs the loss has only crawled down to 0.5793. It is still heading in the right direction, it just needs far more epochs to get anywhere. Slow but stable.
- lr = 0.1 (just right). The loss drops cleanly to 0.4323, the best of the three. The steps are large enough to make real progress but small enough to stay stable.
- lr = 1.0 (too large). The steps overshoot the minimum and the loss bounces around, ending at 0.6014, worse than the well-tuned run despite the same number of epochs. The updates are so aggressive that progress is erratic.
Seeing all three curves on one chart makes the trade-off unmistakable.
There is no magic learning rate
A learning rate that works beautifully on this network may be far too large or too small for a different one. The values 0.01, 0.1, and 1.0 are a sensible starting grid to scan, not universal answers. Always plot the loss curve: a smooth decline means you are in a good range, a crawling curve means go higher, and a noisy or rising curve means go lower.
Practice Exercises
Try these before checking the hints. Reuse the functions and the scaled X_train, y_train, X_test, y_test from the lesson.
Exercise 1: Compute Loss by Hand
Without using your bce_loss function, compute the binary cross-entropy for a single example where the true label is 0 and the predicted probability is 0.2. Then confirm it with bce_loss. Which surviving term of the formula applies here?
import numpy as np
y_true = 0.0
y_pred = 0.2
# Your code hereHint
With , the first term vanishes and only remains. Compute -np.log(0.8), which is about 0.2231, then check it matches bce_loss(np.array([0.0]), np.array([0.2])).
Exercise 2: Find a Better Learning Rate
The lesson scanned 0.01, 0.1, and 1.0. Add a finer scan around the winner by training with learning rates 0.05, 0.1, and 0.3, and print the final training loss for each. Which one gives the lowest loss after 2000 epochs?
# Your code here (reuse train, X_train, y_train)Hint
Loop over [0.05, 0.1, 0.3], call train(X_train, y_train, n_hidden=16, lr=lr, epochs=2000), and print hist[-1]. This is exactly how practitioners narrow down a learning rate: start with a coarse grid spanning orders of magnitude, then zoom in around the best value.
Exercise 3: Does Training Longer Help?
Train the well-tuned network (lr=0.1) for 5000 epochs instead of 2000, and compare the final training loss and test accuracy to the lesson’s results. Does the extra training keep helping, or does it level off?
# Your code hereHint
Call train(X_train, y_train, n_hidden=16, lr=0.1, epochs=5000), then reuse the accuracy function on the returned params. You should see the training loss dip a little below 0.4323, while test accuracy barely moves, a sign the network is near its ceiling on this data and that more epochs alone are not the answer.
Summary
You just taught a neural network to learn, from scratch, using nothing but a loss function and gradient descent. Let’s review what you learned.
Key Concepts
The Loss Function
- A loss function turns predictions and true labels into a single number measuring how wrong the network is
- Binary cross-entropy is the standard loss for binary classification: it punishes confident wrong predictions far more harshly than hesitant ones
- The training goal is to make the loss as small as possible
The Gradient
- The gradient is the slope of the loss with respect to a weight: it points in the direction of steepest increase in loss
- A positive gradient means the weight is too high; a negative gradient means it is too low; a near-zero gradient means you are near a minimum
Gradient Descent
- The update rule is : step every parameter opposite to its gradient
- The loop is forward pass, compute loss, compute gradients, update, repeat; one full pass over the data is an epoch
- Full-batch gradient descent uses the entire training set for every update, giving a smooth loss curve
The Learning Rate
- The learning rate sets the step size: too small and training crawls, too large and it overshoots and becomes unstable
- There is no universal best value; scan several (0.01, 0.1, 1.0) and read the loss curves
- On the diabetes data, reached a final loss of 0.4323, with 0.807 train accuracy and 0.750 test accuracy
Why This Matters
Gradient descent is the engine inside essentially every neural network ever trained, from the small numpy model you built here to systems with hundreds of billions of parameters. The scale changes dramatically, but the core loop does not: measure the loss, find the gradient, step downhill, repeat. Once you understand that a network “learns” by repeatedly nudging its weights to lower a loss, the rest of deep learning becomes a series of refinements on this one idea.
You also got a hands-on feel for the learning rate, the knob that beginners most often get wrong. Watching the same network crawl, converge, and destabilize as you changed only is a lesson that sticks. The one piece you took on faith was how the gradients were computed. That is no accident, because deriving them efficiently is a topic important enough to deserve its own lesson, and it is the one you are about to start.
Next Steps
You now understand how a network learns by minimizing a loss with gradient descent. The missing piece, computing the gradients efficiently through every layer, is exactly what comes next.
Continue to Lesson 4 - Backpropagation
Learn how the gradients are actually computed, layer by layer, with the chain rule.
Back to Module Overview
Return to the Deep Learning Foundations module overview.
Keep Building Your Skills
You have moved from a network that merely makes predictions to one that genuinely learns from its mistakes. The loss function, the gradient, and the gradient descent update rule are the three ideas every deep learning system is built on, and you now have them not as abstractions but as code you wrote and ran yourself. Keep this loop in your head as you continue: measure, find the slope, step downhill, repeat. Master it here on a small numpy network, and the largest models in the world will feel like the same idea, just scaled up.