Lesson 5 - Optimizers: SGD, RMSprop, and Adam

Welcome to Optimizers

In the last lesson you trained a neural network with plain gradient descent: compute the gradient, take a small step downhill, repeat. It works, but it is rarely the fastest or most stable way to learn. This lesson introduces the optimizers that practitioners actually reach for, including the one that trains almost every modern model. You will see why plain gradient descent struggles, then build three improvements on top of it: momentum, RMSprop, and Adam.

By the end of this lesson, you will be able to:

Explain what an optimizer does and why stochastic gradient descent (SGD) alone is often slow and noisy
Describe how momentum smooths the update path using an exponential moving average of past gradients
Describe how RMSprop gives each parameter its own adaptive step size
Explain how Adam combines momentum and RMSprop, including bias correction
Implement all four optimizers in NumPy and compare their training-loss curves on a real dataset

You should be comfortable with the gradient descent training loop from the previous lesson, basic NumPy, and the idea of a loss function and its gradient. Let’s begin.

What an Optimizer Actually Does

Training a neural network means searching for the weights that make the loss as small as possible. An optimizer is the rule that decides, at each step, how to change the weights given the current gradient. Every optimizer answers the same question: I know which direction is downhill right now; how big a step should I take, and in exactly what direction?

Plain gradient descent gives the simplest possible answer: step a fixed fraction of the gradient, directly opposite to it. The fraction is the learning rate $\alpha$ . For a weight $w$ with loss gradient $\nabla L(w_t)$ , the update is:

w_{t+1} = w_t - \alpha \, \nabla L(w_t)

When you compute the gradient on a small minibatch of examples rather than the entire dataset, this is called stochastic gradient descent, or SGD. The word stochastic (random) refers to the fact that each minibatch gives a slightly different, noisy estimate of the true gradient.

That noise is the problem. SGD has no memory. At every step it reacts only to the current minibatch’s gradient, so when the gradient jumps around from batch to batch, the weights jump around too. The path to the minimum zig-zags instead of heading straight down, and you waste iterations correcting overshoots.

Optimizer vs. learning rate

The optimizer is the strategy for updating weights; the learning rate is the single most important number that strategy uses. The same learning rate that is perfect for SGD can be far too large for an adaptive optimizer like Adam, which is why you will see different learning rates for different optimizers later in this lesson.

The Dataset: Predicting Diabetes

Throughout this lesson you will train the same small neural network with four different optimizers and watch how each one drives the loss down. Using one fixed problem makes the comparison fair: any difference in the loss curves comes from the optimizer, not the data.

You will use the real Diabetes dataset, where each row is a patient and the goal is to predict whether that patient tested positive for diabetes from eight medical measurements.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# download: https://datatweets.com/datasets/diabetes.csv
df = pd.read_csv("diabetes.csv")

print("Shape:", df.shape)
print("Outcome balance:")
print(df["Outcome"].value_counts())
# Output:
# Shape: (768, 9)
# Outcome balance:
# Outcome
# 0    500
# 1    268
# Name: count, dtype: int64

There are 768 patients and 9 columns: eight features plus the Outcome target (1 for a positive diabetes test, 0 for negative). About 35 percent of patients are positive, so the classes are uneven but not extreme.

Just as in earlier lessons, you split the data, then scale the features so each one has mean 0 and standard deviation 1. Scaling matters even more here: optimizers that adapt their step size per parameter behave best when all features live on a comparable scale.

X = df.drop(columns=["Outcome"]).values
y = df["Outcome"].values.reshape(-1, 1).astype(float)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)   # learn mean/std on TRAIN only
X_test = scaler.transform(X_test)         # apply the SAME transform to test

print("Train:", X_train.shape, " Test:", X_test.shape)
# Output:
# Train: (614, 8)  Test: (154, 8)

A Network You Can Swap Optimizers Into

To compare optimizers cleanly, you need a network whose layers expose their gradients rather than updating themselves. That way the optimizer, not the layer, owns the update rule, and you can drop in any optimizer without touching the network.

The Dense layer below does a forward pass and a backward pass, but its backward method returns the weight and bias gradients instead of applying them. A separate update method applies whatever change the optimizer hands back.

import math

class Dense:
    def __init__(self, input_size, output_size, activation=True, seed=0):
        self.add_activation = activation
        np.random.seed(seed)
        # Initialize weights in the range -sqrt(k) to sqrt(k), with k = 1 / input_size
        k = math.sqrt(1 / input_size)
        self.weights = np.random.rand(input_size, output_size) * (2 * k) - k
        self.bias = np.zeros((1, output_size))
        self.prev_input = None
        self.output = None

    def forward(self, x):
        self.prev_input = x.copy()
        x = x @ self.weights + self.bias
        if self.add_activation:
            x = np.maximum(x, 0)          # ReLU
        self.output = x.copy()
        return x

    def backward(self, grad):
        if self.add_activation:
            grad = grad * np.heaviside(self.output, 0)   # gradient of ReLU
        w_grad = self.prev_input.T @ grad
        b_grad = np.sum(grad, axis=0, keepdims=True)
        next_grad = grad @ self.weights.T
        return [w_grad, b_grad], next_grad

    def update(self, w_update, b_update):
        # Apply whatever change the optimizer computed
        self.weights += w_update
        self.bias += b_update

A small helper builds a two-layer network: 8 inputs to 16 hidden units with ReLU, then 16 hidden units to a single output that you push through a sigmoid for a probability.

def build_network():
    return [
        Dense(8, 16),
        Dense(16, 1, activation=False),
    ]

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def forward(x, layers):
    for layer in layers:
        x = layer.forward(x)
    return x

def backward(grad, layers):
    layer_grads = []
    for layer in reversed(layers):          # output layer first
        param_grads, grad = layer.backward(grad)
        layer_grads.append(param_grads)
    return layer_grads

With the network in place, each optimizer below is just a small class that takes the returned gradients and decides what update to apply.

SGD: The Baseline

Start with plain SGD so you have something to beat. The optimizer normalizes the weight gradient by the batch size to get the average gradient, scales it by the learning rate, and negates it so update subtracts it from the weights.

class SGD:
    def __init__(self, lr):
        self.lr = lr

    def step(self, layer_grads, layers, batch_size):
        for (w_grad, b_grad), layer in zip(layer_grads, reversed(layers)):
            w_grad = w_grad / batch_size
            w_update = -self.lr * w_grad
            b_update = -self.lr * b_grad
            layer.update(w_update, b_update)

Here is the training loop. It runs minibatches through the network, computes the gradient of the binary cross-entropy loss, and lets the optimizer update the weights. It records the average training loss each epoch so you can plot it later.

def bce_loss(pred, y):
    p = np.clip(sigmoid(pred), 1e-7, 1 - 1e-7)
    return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))

def train(optimizer, epochs=100, batch_size=16, seed=0):
    layers = build_network()
    losses = []
    for epoch in range(epochs):
        epoch_loss = 0.0
        n_batches = 0
        for i in range(0, len(X_train), batch_size):
            xb = X_train[i:i + batch_size]
            yb = y_train[i:i + batch_size]

            pred = forward(xb, layers)
            # Gradient of BCE w.r.t. the pre-sigmoid output
            grad = sigmoid(pred) - yb
            layer_grads = backward(grad, layers)
            optimizer.step(layer_grads, layers, len(xb))

            epoch_loss += bce_loss(pred, yb)
            n_batches += 1
        losses.append(epoch_loss / n_batches)
    return layers, losses

def test_accuracy(layers):
    probs = sigmoid(forward(X_test, layers))
    preds = (probs >= 0.5).astype(int)
    return np.mean(preds == y_test)

layers, sgd_losses = train(SGD(lr=0.1))
print(f"SGD   final loss={sgd_losses[-1]:.4f}  test acc={test_accuracy(layers):.3f}")
# Output:
# SGD   final loss=0.4384  test acc=0.750

SGD reaches a final training loss of 0.4384 and a test accuracy of 0.750. That is a perfectly reasonable result, and it is the bar the next three optimizers have to clear. Watch the loss, not just the final number: SGD gets there, but it gets there slowly and unevenly.

Momentum: Smoothing the Path

SGD’s weakness is that it forgets. Each step reacts only to the current noisy gradient, so the path zig-zags. Momentum fixes this by giving the optimizer a memory of where it has been heading.

The idea is an exponential moving average (EMA). Instead of acting on the raw gradient, you keep a running, decaying average of recent gradients. Think of predicting tomorrow’s weather: today’s temperature matters most, but yesterday’s and last week’s still carry information. An EMA blends them, weighting recent values more heavily and letting old ones fade.

The classic EMA of some quantity $g$ is:

\text{ema}_t = \beta \, \text{ema}_{t-1} + (1 - \beta)\, g_t

where $\beta$ (between 0 and 1) is the decay coefficient: higher $\beta$ means the average changes more slowly and remembers further back. You can see this smoothing on a simple sequence.

values = np.array([8, 9, 9, 8, 2, 3, 7, 3, 7, 2], dtype=float)

ema = values[0]
beta = 0.5
smoothed = []
for v in values:
    ema = beta * ema + (1 - beta) * v
    smoothed.append(round(ema, 3))

print(smoothed)
# Output:
# [8.0, 8.5, 8.75, 8.375, 5.188, 4.094, 5.547, 4.273, 5.637, 3.818]

The smoothed sequence swings far less than the raw values, exactly the calming effect you want on noisy gradients.

For momentum the update is written with a velocity vector $v_t$ that accumulates gradient history, then the weights move along that velocity:

\begin{aligned} v_t &= \beta \, v_{t-1} - \alpha \, \nabla L(w_t) \\ w_{t+1} &= w_t + v_t \end{aligned}

Here $v_{t-1}$ is the previous velocity, $\beta$ controls how quickly it decays, and $\alpha$ is the learning rate. Because old gradients keep getting multiplied by $\beta$ each step, their influence shrinks exponentially over time, hence the name. In code, you store one velocity array per parameter and update it before applying it.

class Momentum:
    def __init__(self, lr, beta=0.9):
        self.lr = lr
        self.beta = beta
        self.velocities = None

    def step(self, layer_grads, layers, batch_size):
        if self.velocities is None:
            # One zero-initialized velocity per parameter array
            self.velocities = [[np.zeros_like(w), np.zeros_like(b)]
                               for w, b in layer_grads]
        for idx, ((w_grad, b_grad), layer) in enumerate(
                zip(layer_grads, reversed(layers))):
            w_grad = w_grad / batch_size
            v_w, v_b = self.velocities[idx]
            v_w = self.beta * v_w - self.lr * w_grad
            v_b = self.beta * v_b - self.lr * b_grad
            layer.update(v_w, v_b)
            self.velocities[idx] = [v_w, v_b]

layers, momentum_losses = train(Momentum(lr=0.1, beta=0.9))
print(f"Momentum  final loss={momentum_losses[-1]:.4f}  "
      f"test acc={test_accuracy(layers):.3f}")
# Output:
# Momentum  final loss=0.3845  test acc=0.719

Momentum drives the training loss down to 0.3845, noticeably lower than SGD’s 0.4384, and it gets there faster because consistent gradient directions reinforce each other while noise cancels out. Notice the test accuracy slipped to 0.719: a lower training loss does not automatically mean a better model, a tension you will explore properly in the next lesson on regularization.

A good default for beta

$\beta = 0.9$ is the standard starting point for momentum, meaning roughly the last ten gradients dominate the average. Raise it toward 0.99 for smoother but more sluggish updates; lower it for a more reactive but noisier path.

RMSprop: Per-Parameter Step Sizes

Momentum smooths which direction you move, but every weight still shares the same learning rate. That is wasteful. Some weights sit on steep parts of the loss surface and need small steps; others sit on flat parts and could take large ones. A single global learning rate has to compromise.

RMSprop (root mean square propagation) gives each parameter its own adaptive step size. The trick is to track an exponential moving average of the squared gradient for each parameter, then divide the update by the square root of that average. A weight whose gradient has been consistently large gets a big denominator and therefore a small step; a weight with tiny gradients gets a small denominator and a relatively larger step. The result automatically balances the step sizes across parameters.

You keep a running squared-gradient average $s_t$ :

\begin{aligned} s_t &= \beta \, s_{t-1} + (1 - \beta)\, \nabla L(w_t)^2 \\ w_{t+1} &= w_t - \alpha \, \frac{\nabla L(w_t)}{\sqrt{s_t} + \epsilon} \end{aligned}

The small constant $\epsilon$ (typically $10^{-8}$ ) sits in the denominator only to prevent division by something too close to zero. Notice the learning rate $\alpha$ is now effectively divided per parameter, which is why RMSprop usually needs a smaller base learning rate than SGD.

class RMSprop:
    def __init__(self, lr, beta=0.9, eps=1e-8):
        self.lr = lr
        self.beta = beta
        self.eps = eps
        self.sq_avgs = None

    def step(self, layer_grads, layers, batch_size):
        if self.sq_avgs is None:
            self.sq_avgs = [[np.zeros_like(w), np.zeros_like(b)]
                            for w, b in layer_grads]
        for idx, ((w_grad, b_grad), layer) in enumerate(
                zip(layer_grads, reversed(layers))):
            w_grad = w_grad / batch_size
            s_w, s_b = self.sq_avgs[idx]
            s_w = self.beta * s_w + (1 - self.beta) * np.square(w_grad)
            s_b = self.beta * s_b + (1 - self.beta) * np.square(b_grad)
            w_update = -self.lr * w_grad / (np.sqrt(s_w) + self.eps)
            b_update = -self.lr * b_grad / (np.sqrt(s_b) + self.eps)
            layer.update(w_update, b_update)
            self.sq_avgs[idx] = [s_w, s_b]

layers, rmsprop_losses = train(RMSprop(lr=0.01, beta=0.9))
print(f"RMSprop  final loss={rmsprop_losses[-1]:.4f}  "
      f"test acc={test_accuracy(layers):.3f}")
# Output:
# RMSprop  final loss=0.3249  test acc=0.729

With a learning rate of just 0.01, RMSprop reaches a training loss of 0.3249, lower than both SGD and momentum, with test accuracy 0.729. The per-parameter scaling lets it make confident progress on flat directions that plain SGD would crawl across.

Adaptive optimizers want smaller learning rates

Because RMSprop and Adam divide each update by the gradient’s recent magnitude, the effective step can be large even when the base learning rate is small. Reusing SGD’s learning rate of 0.1 with these optimizers will often make training diverge. Start adaptive optimizers around 0.01 or lower.

Adam: Momentum Meets RMSprop

You now have two complementary ideas. Momentum tracks the first moment of the gradient (its running mean, telling you which way to go). RMSprop tracks the second moment (its running mean square, telling you how big the gradient tends to be). Adam (adaptive moment estimation) keeps both at once: it steps in the smoothed momentum direction and scales that step per parameter using the squared-gradient average. It is the default optimizer for the vast majority of modern deep learning, including large language models.

Adam maintains a first-moment estimate $m_t$ and a second-moment estimate $v_t$ , each with its own decay coefficient:

\begin{aligned} m_t &= \beta_1 \, m_{t-1} + (1 - \beta_1)\, \nabla L(w_t) \\ v_t &= \beta_2 \, v_{t-1} + (1 - \beta_2)\, \nabla L(w_t)^2 \end{aligned}

There is one subtlety. Both $m$ and $v$ start at zero, so during the first few iterations they are biased toward zero and badly underestimate the true averages. Adam corrects this with bias correction, dividing by $1 - \beta^t$ , where $t$ is the step number:

\begin{aligned} \hat{m}_t &= \frac{m_t}{1 - \beta_1^{\,t}} \\ \hat{v}_t &= \frac{v_t}{1 - \beta_2^{\,t}} \end{aligned}

Early on, $1 - \beta^t$ is small, so dividing by it boosts the estimates up to where they should be; as $t$ grows, $\beta^t$ shrinks toward zero and the correction fades to 1, leaving the estimates untouched. The final update combines the two corrected moments:

w_{t+1} = w_t - \alpha \, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

You can watch bias correction at work on the same toy sequence, tracking both the mean and the (uncentered) variance.

values = np.array([8, 9, 9, 8, 2, 3, 7, 3, 7, 2], dtype=float)

m, v = 0.0, 0.0
beta1, beta2 = 0.5, 0.9
means, varis = [], []
for t, g in enumerate(values, start=1):
    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * g ** 2
    m_hat = m / (1 - beta1 ** t)         # bias-corrected mean
    v_hat = v / (1 - beta2 ** t)         # bias-corrected variance
    means.append(round(m_hat, 2))
    varis.append(round(v_hat, 1))

print("mean:", means[:5])
print("var :", varis[:5])
# Output:
# mean: [8.0, 9.33, 9.05, 8.13, 5.04]
# var : [64.0, 76.3, 79.4, 75.4, 64.2]

Without the / (1 - beta ** t) term, the very first mean estimate would be only 4.0 instead of 8.0, half the true value. Bias correction pulls it back to the right scale immediately. Here is the full optimizer.

class Adam:
    def __init__(self, lr, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.moments = None
        self.t = 0

    def step(self, layer_grads, layers, batch_size):
        if self.moments is None:
            # [first moment, second moment] per parameter
            self.moments = [[[np.zeros_like(w), np.zeros_like(b)],
                             [np.zeros_like(w), np.zeros_like(b)]]
                            for w, b in layer_grads]
        self.t += 1
        for idx, ((w_grad, b_grad), layer) in enumerate(
                zip(layer_grads, reversed(layers))):
            w_grad = w_grad / batch_size
            (m_w, m_b), (v_w, v_b) = self.moments[idx]

            # First moment (momentum) and second moment (RMSprop), per param
            m_w = self.beta1 * m_w + (1 - self.beta1) * w_grad
            m_b = self.beta1 * m_b + (1 - self.beta1) * b_grad
            v_w = self.beta2 * v_w + (1 - self.beta2) * np.square(w_grad)
            v_b = self.beta2 * v_b + (1 - self.beta2) * np.square(b_grad)

            # Bias correction
            mhat_w = m_w / (1 - self.beta1 ** self.t)
            mhat_b = m_b / (1 - self.beta1 ** self.t)
            vhat_w = v_w / (1 - self.beta2 ** self.t)
            vhat_b = v_b / (1 - self.beta2 ** self.t)

            w_update = -self.lr * mhat_w / (np.sqrt(vhat_w) + self.eps)
            b_update = -self.lr * mhat_b / (np.sqrt(vhat_b) + self.eps)
            layer.update(w_update, b_update)
            self.moments[idx] = [[m_w, m_b], [v_w, v_b]]

layers, adam_losses = train(Adam(lr=0.01))
print(f"Adam  final loss={adam_losses[-1]:.4f}  "
      f"test acc={test_accuracy(layers):.3f}")
# Output:
# Adam  final loss=0.3139  test acc=0.708

Adam reaches the lowest training loss of all four, 0.3139, and it descends the most smoothly. Its test accuracy here is 0.708, the lowest of the group, a reminder that Adam’s eagerness to minimize training loss can mean it overfits a small dataset. Techniques to counter that are exactly the subject of the next lesson.

Adam’s defaults are famous for a reason

The values $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , and $\epsilon = 10^{-8}$ work so well across so many problems that they are the defaults in essentially every deep learning framework. When in doubt, start with Adam and these settings, and tune only the learning rate.

Comparing All Four Optimizers

Because every run used the same network, data, and loss, you can put the four training-loss curves on a single chart and read off exactly how each optimizer behaves.

Training-loss curves over epochs for SGD, Momentum, RMSprop, and Adam on the Diabetes dataset — Training loss per epoch for all four optimizers: adaptive methods drive the loss lower and faster than plain SGD.

The pattern is clear. Plain SGD is the slowest and settles highest. Momentum descends faster and lands lower by smoothing the path. The two adaptive optimizers, RMSprop and Adam, pull away from both by giving each parameter its own step size, with Adam reaching the lowest training loss of all.

Optimizer	Learning rate	Final training loss	Test accuracy
SGD	0.1	0.4384	0.750
Momentum	0.1	0.3845	0.719
RMSprop	0.01	0.3249	0.729
Adam	0.01	0.3139	0.708

Read this table carefully, because it tells two stories. Going down the training loss column, each optimizer beats the last: adaptive methods are genuinely better at minimizing the objective you gave them. But the test accuracy column does not follow the same order. Plain SGD, the worst at minimizing training loss, generalizes best on this small dataset, while Adam, the best at minimizing training loss, generalizes worst.

That is not a contradiction; it is the central lesson of model training. A faster, more powerful optimizer fits the training data more aggressively, which on a small dataset can mean fitting noise. Driving training loss to the floor is not the goal. Generalizing to new patients is. You will learn the tools to close that gap, without giving up the speed of adaptive optimizers, in the next lesson.

Practice Exercises

Now it is your turn. Try these before checking the hints.

Exercise 1: Tune Momentum’s Beta

Run the momentum optimizer twice more, once with beta=0.5 and once with beta=0.99, keeping lr=0.1. Print the final training loss for each and compare them to the beta=0.9 result from the lesson.

# Reuse Momentum, train, and test_accuracy from the lesson
# Your code here

Hint

Call train(Momentum(lr=0.1, beta=0.5)) and train(Momentum(lr=0.1, beta=0.99)), then read losses[-1] from each return value. A higher beta remembers more history and smooths the path more, but can overshoot if it is too high.

Exercise 2: Give Adam the SGD Learning Rate

Train Adam with lr=0.1 instead of 0.01 and observe what happens to the loss curve. Why does the larger learning rate hurt Adam so much more than it hurt SGD?

# Your code here
layers, losses = train(Adam(lr=0.1))
print("Adam lr=0.1 final loss:", round(losses[-1], 4))

Hint

Adam already divides each update by $\sqrt{\hat{v}_t}$ , so the effective step size is amplified. A base learning rate of 0.1 makes the steps far too large and the loss will be much higher (or unstable) than the smooth descent you saw at 0.01. This is why adaptive optimizers use smaller learning rates than SGD.

Exercise 3: Build RMSprop Without Per-Parameter Scaling

Modify RMSprop so it divides the update by a single shared scalar (the mean of all squared gradients in a layer) instead of an element-wise array. Train it and compare the final loss to the real RMSprop. What does the per-parameter version buy you?

# Start from the RMSprop class and change how s_w / s_b are used
# Your code here

Hint

Replace np.sqrt(s_w) with np.sqrt(np.mean(s_w)), a single number, so every weight in the layer shares one step size. You should find it does worse than true RMSprop, because the whole point of RMSprop is to scale each parameter independently based on its own gradient history.

Summary

Congratulations! You have implemented four optimizers from scratch and seen exactly how each one improves on the last. Let’s review what you learned.

Key Concepts

What an Optimizer Does

An optimizer is the rule for updating weights given the current gradient
The learning rate $\alpha$ sets the base step size and is the most important hyperparameter
SGD uses the raw minibatch gradient and has no memory, so its path is slow and noisy

Momentum

Tracks an exponential moving average of past gradients (the first moment)
The decay coefficient $\beta$ controls how much history is remembered; 0.9 is the standard default
Smooths the update direction, descending faster and more stably than plain SGD

RMSprop

Tracks an EMA of squared gradients (the second moment) per parameter
Divides each update by $\sqrt{s_t} + \epsilon$ , giving every weight its own adaptive step size
Needs a smaller base learning rate than SGD because the effective step is amplified

Adam

Combines momentum (first moment) and RMSprop (second moment) in one optimizer
Applies bias correction with $1 - \beta^t$ so early estimates are not stuck near zero
Reaches the lowest training loss and is the default optimizer for modern deep learning

The Bigger Picture

Adaptive optimizers minimize training loss fastest, but lower training loss is not the goal
On a small dataset, the most aggressive optimizer can generalize worst (Adam’s lower test accuracy here)

Why This Matters

Optimizers are the engine of every neural network you will ever train. Understanding them is the difference between blindly accepting a framework’s default and knowing why Adam usually wins, why it wants a smaller learning rate, and when plain SGD might actually generalize better. Almost every model you encounter, from a small classifier to a large language model, is trained by some descendant of the four optimizers you just built by hand.

You also saw the most important caveat in all of model training: a lower training loss is not automatically a better model. The adaptive optimizers drove the loss down fastest, yet plain SGD generalized best on this small dataset. That gap between fitting the training data and performing on new data is the single most important problem in machine learning, and closing it is exactly what the next lesson is about.

Next Steps

You now understand how momentum, RMSprop, and Adam train networks faster than plain gradient descent, and you have seen the overfitting risk that comes with aggressive optimization. In the next lesson, you will learn the techniques that keep powerful optimizers from memorizing the training data.

Continue to Lesson 6 - Regularizing Neural Networks

Learn weight decay and other techniques to close the gap between training and test performance.

Back to Module Overview

Return to the Deep Learning Foundations module overview.

Keep Building Your Skills

You have gone from a single update rule to a full family of optimizers, and you understand the trade-offs between them rather than just their names. Every time you call optimizer="adam" in a framework from now on, you will know exactly what is happening inside: smoothed momentum, per-parameter scaling, and bias correction, all working together. Keep that mental model close as you move into regularization, because the speed of these optimizers and the discipline of regularization are two halves of training a model that actually works on data it has never seen.

Lesson 4 - Backpropagation

Lesson 6 - Regularizing Neural Networks

Courses

DATATWEETS

Title here

Lesson 5 - Optimizers: SGD, RMSprop, and Adam

Welcome to Optimizers

What an Optimizer Actually Does

The Dataset: Predicting Diabetes

A Network You Can Swap Optimizers Into

SGD: The Baseline

Momentum: Smoothing the Path

RMSprop: Per-Parameter Step Sizes

Adam: Momentum Meets RMSprop

Comparing All Four Optimizers

Practice Exercises

Exercise 1: Tune Momentum’s Beta

Exercise 2: Give Adam the SGD Learning Rate

Exercise 3: Build RMSprop Without Per-Parameter Scaling

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 6 - Regularizing Neural Networks

Back to Module Overview

Keep Building Your Skills

Lesson 5 - Optimizers: SGD, RMSprop, and Adam

Welcome to Optimizers#

What an Optimizer Actually Does#

The Dataset: Predicting Diabetes#

A Network You Can Swap Optimizers Into#

SGD: The Baseline#

Momentum: Smoothing the Path#

RMSprop: Per-Parameter Step Sizes#

Adam: Momentum Meets RMSprop#

Comparing All Four Optimizers#

Practice Exercises#

Exercise 1: Tune Momentum’s Beta#

Exercise 2: Give Adam the SGD Learning Rate#

Exercise 3: Build RMSprop Without Per-Parameter Scaling#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 6 - Regularizing Neural Networks

Back to Module Overview

Keep Building Your Skills#

Welcome to Optimizers

What an Optimizer Actually Does

The Dataset: Predicting Diabetes

A Network You Can Swap Optimizers Into

SGD: The Baseline

Momentum: Smoothing the Path

RMSprop: Per-Parameter Step Sizes

Adam: Momentum Meets RMSprop

Comparing All Four Optimizers

Practice Exercises

Exercise 1: Tune Momentum’s Beta

Exercise 2: Give Adam the SGD Learning Rate

Exercise 3: Build RMSprop Without Per-Parameter Scaling

Summary

Key Concepts

Why This Matters

Next Steps

Keep Building Your Skills