Lesson 5 - Optimizers: SGD, RMSprop, and Adam
On this page
- Welcome to Optimizers
- What an Optimizer Actually Does
- The Dataset: Predicting Diabetes
- A Network You Can Swap Optimizers Into
- SGD: The Baseline
- Momentum: Smoothing the Path
- RMSprop: Per-Parameter Step Sizes
- Adam: Momentum Meets RMSprop
- Comparing All Four Optimizers
- Practice Exercises
- Summary
- Next Steps
- Keep Building Your Skills
Welcome to Optimizers
In the last lesson you trained a neural network with plain gradient descent: compute the gradient, take a small step downhill, repeat. It works, but it is rarely the fastest or most stable way to learn. This lesson introduces the optimizers that practitioners actually reach for, including the one that trains almost every modern model. You will see why plain gradient descent struggles, then build three improvements on top of it: momentum, RMSprop, and Adam.
By the end of this lesson, you will be able to:
- Explain what an optimizer does and why stochastic gradient descent (SGD) alone is often slow and noisy
- Describe how momentum smooths the update path using an exponential moving average of past gradients
- Describe how RMSprop gives each parameter its own adaptive step size
- Explain how Adam combines momentum and RMSprop, including bias correction
- Implement all four optimizers in NumPy and compare their training-loss curves on a real dataset
You should be comfortable with the gradient descent training loop from the previous lesson, basic NumPy, and the idea of a loss function and its gradient. Let’s begin.
What an Optimizer Actually Does
Training a neural network means searching for the weights that make the loss as small as possible. An optimizer is the rule that decides, at each step, how to change the weights given the current gradient. Every optimizer answers the same question: I know which direction is downhill right now; how big a step should I take, and in exactly what direction?
Plain gradient descent gives the simplest possible answer: step a fixed fraction of the gradient, directly opposite to it. The fraction is the learning rate . For a weight with loss gradient , the update is:
When you compute the gradient on a small minibatch of examples rather than the entire dataset, this is called stochastic gradient descent, or SGD. The word stochastic (random) refers to the fact that each minibatch gives a slightly different, noisy estimate of the true gradient.
That noise is the problem. SGD has no memory. At every step it reacts only to the current minibatch’s gradient, so when the gradient jumps around from batch to batch, the weights jump around too. The path to the minimum zig-zags instead of heading straight down, and you waste iterations correcting overshoots.
Optimizer vs. learning rate
The optimizer is the strategy for updating weights; the learning rate is the single most important number that strategy uses. The same learning rate that is perfect for SGD can be far too large for an adaptive optimizer like Adam, which is why you will see different learning rates for different optimizers later in this lesson.
The Dataset: Predicting Diabetes
Throughout this lesson you will train the same small neural network with four different optimizers and watch how each one drives the loss down. Using one fixed problem makes the comparison fair: any difference in the loss curves comes from the optimizer, not the data.
You will use the real Diabetes dataset, where each row is a patient and the goal is to predict whether that patient tested positive for diabetes from eight medical measurements.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# download: https://datatweets.com/datasets/diabetes.csv
df = pd.read_csv("diabetes.csv")
print("Shape:", df.shape)
print("Outcome balance:")
print(df["Outcome"].value_counts())
# Output:
# Shape: (768, 9)
# Outcome balance:
# Outcome
# 0 500
# 1 268
# Name: count, dtype: int64There are 768 patients and 9 columns: eight features plus the Outcome target (1 for a positive diabetes test, 0 for negative). About 35 percent of patients are positive, so the classes are uneven but not extreme.
Just as in earlier lessons, you split the data, then scale the features so each one has mean 0 and standard deviation 1. Scaling matters even more here: optimizers that adapt their step size per parameter behave best when all features live on a comparable scale.
X = df.drop(columns=["Outcome"]).values
y = df["Outcome"].values.reshape(-1, 1).astype(float)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # learn mean/std on TRAIN only
X_test = scaler.transform(X_test) # apply the SAME transform to test
print("Train:", X_train.shape, " Test:", X_test.shape)
# Output:
# Train: (614, 8) Test: (154, 8)A Network You Can Swap Optimizers Into
To compare optimizers cleanly, you need a network whose layers expose their gradients rather than updating themselves. That way the optimizer, not the layer, owns the update rule, and you can drop in any optimizer without touching the network.
The Dense layer below does a forward pass and a backward pass, but its backward method returns the weight and bias gradients instead of applying them. A separate update method applies whatever change the optimizer hands back.
import math
class Dense:
def __init__(self, input_size, output_size, activation=True, seed=0):
self.add_activation = activation
np.random.seed(seed)
# Initialize weights in the range -sqrt(k) to sqrt(k), with k = 1 / input_size
k = math.sqrt(1 / input_size)
self.weights = np.random.rand(input_size, output_size) * (2 * k) - k
self.bias = np.zeros((1, output_size))
self.prev_input = None
self.output = None
def forward(self, x):
self.prev_input = x.copy()
x = x @ self.weights + self.bias
if self.add_activation:
x = np.maximum(x, 0) # ReLU
self.output = x.copy()
return x
def backward(self, grad):
if self.add_activation:
grad = grad * np.heaviside(self.output, 0) # gradient of ReLU
w_grad = self.prev_input.T @ grad
b_grad = np.sum(grad, axis=0, keepdims=True)
next_grad = grad @ self.weights.T
return [w_grad, b_grad], next_grad
def update(self, w_update, b_update):
# Apply whatever change the optimizer computed
self.weights += w_update
self.bias += b_updateA small helper builds a two-layer network: 8 inputs to 16 hidden units with ReLU, then 16 hidden units to a single output that you push through a sigmoid for a probability.
def build_network():
return [
Dense(8, 16),
Dense(16, 1, activation=False),
]
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def forward(x, layers):
for layer in layers:
x = layer.forward(x)
return x
def backward(grad, layers):
layer_grads = []
for layer in reversed(layers): # output layer first
param_grads, grad = layer.backward(grad)
layer_grads.append(param_grads)
return layer_gradsWith the network in place, each optimizer below is just a small class that takes the returned gradients and decides what update to apply.
SGD: The Baseline
Start with plain SGD so you have something to beat. The optimizer normalizes the weight gradient by the batch size to get the average gradient, scales it by the learning rate, and negates it so update subtracts it from the weights.
class SGD:
def __init__(self, lr):
self.lr = lr
def step(self, layer_grads, layers, batch_size):
for (w_grad, b_grad), layer in zip(layer_grads, reversed(layers)):
w_grad = w_grad / batch_size
w_update = -self.lr * w_grad
b_update = -self.lr * b_grad
layer.update(w_update, b_update)Here is the training loop. It runs minibatches through the network, computes the gradient of the binary cross-entropy loss, and lets the optimizer update the weights. It records the average training loss each epoch so you can plot it later.
def bce_loss(pred, y):
p = np.clip(sigmoid(pred), 1e-7, 1 - 1e-7)
return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
def train(optimizer, epochs=100, batch_size=16, seed=0):
layers = build_network()
losses = []
for epoch in range(epochs):
epoch_loss = 0.0
n_batches = 0
for i in range(0, len(X_train), batch_size):
xb = X_train[i:i + batch_size]
yb = y_train[i:i + batch_size]
pred = forward(xb, layers)
# Gradient of BCE w.r.t. the pre-sigmoid output
grad = sigmoid(pred) - yb
layer_grads = backward(grad, layers)
optimizer.step(layer_grads, layers, len(xb))
epoch_loss += bce_loss(pred, yb)
n_batches += 1
losses.append(epoch_loss / n_batches)
return layers, losses
def test_accuracy(layers):
probs = sigmoid(forward(X_test, layers))
preds = (probs >= 0.5).astype(int)
return np.mean(preds == y_test)
layers, sgd_losses = train(SGD(lr=0.1))
print(f"SGD final loss={sgd_losses[-1]:.4f} test acc={test_accuracy(layers):.3f}")
# Output:
# SGD final loss=0.4384 test acc=0.750SGD reaches a final training loss of 0.4384 and a test accuracy of 0.750. That is a perfectly reasonable result, and it is the bar the next three optimizers have to clear. Watch the loss, not just the final number: SGD gets there, but it gets there slowly and unevenly.
Momentum: Smoothing the Path
SGD’s weakness is that it forgets. Each step reacts only to the current noisy gradient, so the path zig-zags. Momentum fixes this by giving the optimizer a memory of where it has been heading.
The idea is an exponential moving average (EMA). Instead of acting on the raw gradient, you keep a running, decaying average of recent gradients. Think of predicting tomorrow’s weather: today’s temperature matters most, but yesterday’s and last week’s still carry information. An EMA blends them, weighting recent values more heavily and letting old ones fade.
The classic EMA of some quantity is:
where (between 0 and 1) is the decay coefficient: higher means the average changes more slowly and remembers further back. You can see this smoothing on a simple sequence.
values = np.array([8, 9, 9, 8, 2, 3, 7, 3, 7, 2], dtype=float)
ema = values[0]
beta = 0.5
smoothed = []
for v in values:
ema = beta * ema + (1 - beta) * v
smoothed.append(round(ema, 3))
print(smoothed)
# Output:
# [8.0, 8.5, 8.75, 8.375, 5.188, 4.094, 5.547, 4.273, 5.637, 3.818]The smoothed sequence swings far less than the raw values, exactly the calming effect you want on noisy gradients.
For momentum the update is written with a velocity vector that accumulates gradient history, then the weights move along that velocity:
Here is the previous velocity, controls how quickly it decays, and is the learning rate. Because old gradients keep getting multiplied by each step, their influence shrinks exponentially over time, hence the name. In code, you store one velocity array per parameter and update it before applying it.
class Momentum:
def __init__(self, lr, beta=0.9):
self.lr = lr
self.beta = beta
self.velocities = None
def step(self, layer_grads, layers, batch_size):
if self.velocities is None:
# One zero-initialized velocity per parameter array
self.velocities = [[np.zeros_like(w), np.zeros_like(b)]
for w, b in layer_grads]
for idx, ((w_grad, b_grad), layer) in enumerate(
zip(layer_grads, reversed(layers))):
w_grad = w_grad / batch_size
v_w, v_b = self.velocities[idx]
v_w = self.beta * v_w - self.lr * w_grad
v_b = self.beta * v_b - self.lr * b_grad
layer.update(v_w, v_b)
self.velocities[idx] = [v_w, v_b]
layers, momentum_losses = train(Momentum(lr=0.1, beta=0.9))
print(f"Momentum final loss={momentum_losses[-1]:.4f} "
f"test acc={test_accuracy(layers):.3f}")
# Output:
# Momentum final loss=0.3845 test acc=0.719Momentum drives the training loss down to 0.3845, noticeably lower than SGD’s 0.4384, and it gets there faster because consistent gradient directions reinforce each other while noise cancels out. Notice the test accuracy slipped to 0.719: a lower training loss does not automatically mean a better model, a tension you will explore properly in the next lesson on regularization.
A good default for beta
is the standard starting point for momentum, meaning roughly the last ten gradients dominate the average. Raise it toward 0.99 for smoother but more sluggish updates; lower it for a more reactive but noisier path.
RMSprop: Per-Parameter Step Sizes
Momentum smooths which direction you move, but every weight still shares the same learning rate. That is wasteful. Some weights sit on steep parts of the loss surface and need small steps; others sit on flat parts and could take large ones. A single global learning rate has to compromise.
RMSprop (root mean square propagation) gives each parameter its own adaptive step size. The trick is to track an exponential moving average of the squared gradient for each parameter, then divide the update by the square root of that average. A weight whose gradient has been consistently large gets a big denominator and therefore a small step; a weight with tiny gradients gets a small denominator and a relatively larger step. The result automatically balances the step sizes across parameters.
You keep a running squared-gradient average :
The small constant (typically ) sits in the denominator only to prevent division by something too close to zero. Notice the learning rate is now effectively divided per parameter, which is why RMSprop usually needs a smaller base learning rate than SGD.
class RMSprop:
def __init__(self, lr, beta=0.9, eps=1e-8):
self.lr = lr
self.beta = beta
self.eps = eps
self.sq_avgs = None
def step(self, layer_grads, layers, batch_size):
if self.sq_avgs is None:
self.sq_avgs = [[np.zeros_like(w), np.zeros_like(b)]
for w, b in layer_grads]
for idx, ((w_grad, b_grad), layer) in enumerate(
zip(layer_grads, reversed(layers))):
w_grad = w_grad / batch_size
s_w, s_b = self.sq_avgs[idx]
s_w = self.beta * s_w + (1 - self.beta) * np.square(w_grad)
s_b = self.beta * s_b + (1 - self.beta) * np.square(b_grad)
w_update = -self.lr * w_grad / (np.sqrt(s_w) + self.eps)
b_update = -self.lr * b_grad / (np.sqrt(s_b) + self.eps)
layer.update(w_update, b_update)
self.sq_avgs[idx] = [s_w, s_b]
layers, rmsprop_losses = train(RMSprop(lr=0.01, beta=0.9))
print(f"RMSprop final loss={rmsprop_losses[-1]:.4f} "
f"test acc={test_accuracy(layers):.3f}")
# Output:
# RMSprop final loss=0.3249 test acc=0.729With a learning rate of just 0.01, RMSprop reaches a training loss of 0.3249, lower than both SGD and momentum, with test accuracy 0.729. The per-parameter scaling lets it make confident progress on flat directions that plain SGD would crawl across.
Adaptive optimizers want smaller learning rates
Because RMSprop and Adam divide each update by the gradient’s recent magnitude, the effective step can be large even when the base learning rate is small. Reusing SGD’s learning rate of 0.1 with these optimizers will often make training diverge. Start adaptive optimizers around 0.01 or lower.
Adam: Momentum Meets RMSprop
You now have two complementary ideas. Momentum tracks the first moment of the gradient (its running mean, telling you which way to go). RMSprop tracks the second moment (its running mean square, telling you how big the gradient tends to be). Adam (adaptive moment estimation) keeps both at once: it steps in the smoothed momentum direction and scales that step per parameter using the squared-gradient average. It is the default optimizer for the vast majority of modern deep learning, including large language models.
Adam maintains a first-moment estimate and a second-moment estimate , each with its own decay coefficient:
There is one subtlety. Both and start at zero, so during the first few iterations they are biased toward zero and badly underestimate the true averages. Adam corrects this with bias correction, dividing by , where is the step number:
Early on, is small, so dividing by it boosts the estimates up to where they should be; as grows, shrinks toward zero and the correction fades to 1, leaving the estimates untouched. The final update combines the two corrected moments:
You can watch bias correction at work on the same toy sequence, tracking both the mean and the (uncentered) variance.
values = np.array([8, 9, 9, 8, 2, 3, 7, 3, 7, 2], dtype=float)
m, v = 0.0, 0.0
beta1, beta2 = 0.5, 0.9
means, varis = [], []
for t, g in enumerate(values, start=1):
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g ** 2
m_hat = m / (1 - beta1 ** t) # bias-corrected mean
v_hat = v / (1 - beta2 ** t) # bias-corrected variance
means.append(round(m_hat, 2))
varis.append(round(v_hat, 1))
print("mean:", means[:5])
print("var :", varis[:5])
# Output:
# mean: [8.0, 9.33, 9.05, 8.13, 5.04]
# var : [64.0, 76.3, 79.4, 75.4, 64.2]Without the / (1 - beta ** t) term, the very first mean estimate would be only 4.0 instead of 8.0, half the true value. Bias correction pulls it back to the right scale immediately. Here is the full optimizer.
class Adam:
def __init__(self, lr, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.moments = None
self.t = 0
def step(self, layer_grads, layers, batch_size):
if self.moments is None:
# [first moment, second moment] per parameter
self.moments = [[[np.zeros_like(w), np.zeros_like(b)],
[np.zeros_like(w), np.zeros_like(b)]]
for w, b in layer_grads]
self.t += 1
for idx, ((w_grad, b_grad), layer) in enumerate(
zip(layer_grads, reversed(layers))):
w_grad = w_grad / batch_size
(m_w, m_b), (v_w, v_b) = self.moments[idx]
# First moment (momentum) and second moment (RMSprop), per param
m_w = self.beta1 * m_w + (1 - self.beta1) * w_grad
m_b = self.beta1 * m_b + (1 - self.beta1) * b_grad
v_w = self.beta2 * v_w + (1 - self.beta2) * np.square(w_grad)
v_b = self.beta2 * v_b + (1 - self.beta2) * np.square(b_grad)
# Bias correction
mhat_w = m_w / (1 - self.beta1 ** self.t)
mhat_b = m_b / (1 - self.beta1 ** self.t)
vhat_w = v_w / (1 - self.beta2 ** self.t)
vhat_b = v_b / (1 - self.beta2 ** self.t)
w_update = -self.lr * mhat_w / (np.sqrt(vhat_w) + self.eps)
b_update = -self.lr * mhat_b / (np.sqrt(vhat_b) + self.eps)
layer.update(w_update, b_update)
self.moments[idx] = [[m_w, m_b], [v_w, v_b]]
layers, adam_losses = train(Adam(lr=0.01))
print(f"Adam final loss={adam_losses[-1]:.4f} "
f"test acc={test_accuracy(layers):.3f}")
# Output:
# Adam final loss=0.3139 test acc=0.708Adam reaches the lowest training loss of all four, 0.3139, and it descends the most smoothly. Its test accuracy here is 0.708, the lowest of the group, a reminder that Adam’s eagerness to minimize training loss can mean it overfits a small dataset. Techniques to counter that are exactly the subject of the next lesson.
Adam’s defaults are famous for a reason
The values , , and work so well across so many problems that they are the defaults in essentially every deep learning framework. When in doubt, start with Adam and these settings, and tune only the learning rate.
Comparing All Four Optimizers
Because every run used the same network, data, and loss, you can put the four training-loss curves on a single chart and read off exactly how each optimizer behaves.
The pattern is clear. Plain SGD is the slowest and settles highest. Momentum descends faster and lands lower by smoothing the path. The two adaptive optimizers, RMSprop and Adam, pull away from both by giving each parameter its own step size, with Adam reaching the lowest training loss of all.
| Optimizer | Learning rate | Final training loss | Test accuracy |
|---|---|---|---|
| SGD | 0.1 | 0.4384 | 0.750 |
| Momentum | 0.1 | 0.3845 | 0.719 |
| RMSprop | 0.01 | 0.3249 | 0.729 |
| Adam | 0.01 | 0.3139 | 0.708 |
Read this table carefully, because it tells two stories. Going down the training loss column, each optimizer beats the last: adaptive methods are genuinely better at minimizing the objective you gave them. But the test accuracy column does not follow the same order. Plain SGD, the worst at minimizing training loss, generalizes best on this small dataset, while Adam, the best at minimizing training loss, generalizes worst.
That is not a contradiction; it is the central lesson of model training. A faster, more powerful optimizer fits the training data more aggressively, which on a small dataset can mean fitting noise. Driving training loss to the floor is not the goal. Generalizing to new patients is. You will learn the tools to close that gap, without giving up the speed of adaptive optimizers, in the next lesson.
Practice Exercises
Now it is your turn. Try these before checking the hints.
Exercise 1: Tune Momentum’s Beta
Run the momentum optimizer twice more, once with beta=0.5 and once with beta=0.99, keeping lr=0.1. Print the final training loss for each and compare them to the beta=0.9 result from the lesson.
# Reuse Momentum, train, and test_accuracy from the lesson
# Your code hereHint
Call train(Momentum(lr=0.1, beta=0.5)) and train(Momentum(lr=0.1, beta=0.99)), then read losses[-1] from each return value. A higher beta remembers more history and smooths the path more, but can overshoot if it is too high.
Exercise 2: Give Adam the SGD Learning Rate
Train Adam with lr=0.1 instead of 0.01 and observe what happens to the loss curve. Why does the larger learning rate hurt Adam so much more than it hurt SGD?
# Your code here
layers, losses = train(Adam(lr=0.1))
print("Adam lr=0.1 final loss:", round(losses[-1], 4))Hint
Adam already divides each update by , so the effective step size is amplified. A base learning rate of 0.1 makes the steps far too large and the loss will be much higher (or unstable) than the smooth descent you saw at 0.01. This is why adaptive optimizers use smaller learning rates than SGD.
Exercise 3: Build RMSprop Without Per-Parameter Scaling
Modify RMSprop so it divides the update by a single shared scalar (the mean of all squared gradients in a layer) instead of an element-wise array. Train it and compare the final loss to the real RMSprop. What does the per-parameter version buy you?
# Start from the RMSprop class and change how s_w / s_b are used
# Your code hereHint
Replace np.sqrt(s_w) with np.sqrt(np.mean(s_w)), a single number, so every weight in the layer shares one step size. You should find it does worse than true RMSprop, because the whole point of RMSprop is to scale each parameter independently based on its own gradient history.
Summary
Congratulations! You have implemented four optimizers from scratch and seen exactly how each one improves on the last. Let’s review what you learned.
Key Concepts
What an Optimizer Does
- An optimizer is the rule for updating weights given the current gradient
- The learning rate sets the base step size and is the most important hyperparameter
- SGD uses the raw minibatch gradient and has no memory, so its path is slow and noisy
Momentum
- Tracks an exponential moving average of past gradients (the first moment)
- The decay coefficient controls how much history is remembered; 0.9 is the standard default
- Smooths the update direction, descending faster and more stably than plain SGD
RMSprop
- Tracks an EMA of squared gradients (the second moment) per parameter
- Divides each update by , giving every weight its own adaptive step size
- Needs a smaller base learning rate than SGD because the effective step is amplified
Adam
- Combines momentum (first moment) and RMSprop (second moment) in one optimizer
- Applies bias correction with so early estimates are not stuck near zero
- Reaches the lowest training loss and is the default optimizer for modern deep learning
The Bigger Picture
- Adaptive optimizers minimize training loss fastest, but lower training loss is not the goal
- On a small dataset, the most aggressive optimizer can generalize worst (Adam’s lower test accuracy here)
Why This Matters
Optimizers are the engine of every neural network you will ever train. Understanding them is the difference between blindly accepting a framework’s default and knowing why Adam usually wins, why it wants a smaller learning rate, and when plain SGD might actually generalize better. Almost every model you encounter, from a small classifier to a large language model, is trained by some descendant of the four optimizers you just built by hand.
You also saw the most important caveat in all of model training: a lower training loss is not automatically a better model. The adaptive optimizers drove the loss down fastest, yet plain SGD generalized best on this small dataset. That gap between fitting the training data and performing on new data is the single most important problem in machine learning, and closing it is exactly what the next lesson is about.
Next Steps
You now understand how momentum, RMSprop, and Adam train networks faster than plain gradient descent, and you have seen the overfitting risk that comes with aggressive optimization. In the next lesson, you will learn the techniques that keep powerful optimizers from memorizing the training data.
Continue to Lesson 6 - Regularizing Neural Networks
Learn weight decay and other techniques to close the gap between training and test performance.
Back to Module Overview
Return to the Deep Learning Foundations module overview.
Keep Building Your Skills
You have gone from a single update rule to a full family of optimizers, and you understand the trade-offs between them rather than just their names. Every time you call optimizer="adam" in a framework from now on, you will know exactly what is happening inside: smoothed momentum, per-parameter scaling, and bias correction, all working together. Keep that mental model close as you move into regularization, because the speed of these optimizers and the discipline of regularization are two halves of training a model that actually works on data it has never seen.