Lesson 3 - Regularization in Deep Learning
On this page
- Welcome to Regularization in Deep Learning
- The Problem: When a Good Score Lies
- Setting Up: Fashion-MNIST
- The Baseline: A Strong but Overfitting CNN
- Regularization Technique 1: Dropout
- Regularization Technique 2: Data Augmentation
- Putting It Together: The Regularized Model
- Reading the Results: Why a Smaller Gap Wins
- Two More Tools to Know
- Practice Exercises
- Summary
- Next Steps
- Keep Building Your Skills
Welcome to Regularization in Deep Learning
In the last lesson you built a convolutional neural network that scored well on Fashion-MNIST. But a high score can hide a problem: a model that memorizes its training data instead of learning general patterns. This lesson teaches you how to recognize that problem and how to fix it with regularization, the family of techniques that keep deep models honest.
By the end of this lesson, you will be able to:
- Recognize overfitting in a CNN by comparing training and validation curves
- Explain what regularization is and why deep models need it
- Apply dropout to randomly disable units during training
- Apply data augmentation with
RandomFlipandRandomTranslationto expand your effective dataset - Understand where batch normalization and early stopping fit in
- Judge a model by the train/validation gap, not just raw test accuracy
You should be comfortable building and training a CNN in Keras, as covered in the previous lesson, and know basic NumPy. Let’s begin.
The Problem: When a Good Score Lies
Imagine you train a CNN and it reaches 95 percent accuracy on the data it learned from. That sounds excellent. But when you show it new images it has never seen, it only gets 88 percent right. The model did not learn what a sneaker looks like; it partly memorized the specific sneakers in your training set.
This is overfitting: the model fits the training data too closely, capturing noise and quirks that do not transfer to new examples. The opposite failure, underfitting, happens when a model is too simple to capture the real pattern at all. The goal sits between them: a model that learns the genuine structure and generalizes to unseen data.
Deep networks are especially prone to overfitting because they have enormous capacity. A CNN can easily have millions of parameters, far more than the number of training images. With that much capacity, the network has more than enough room to memorize.
How to Spot Overfitting
You diagnose overfitting by training with a validation set, a slice of data the model does not learn from, and watching two curves over the training epochs:
- Training accuracy: how well the model does on data it learns from.
- Validation accuracy: how well it does on held-out data.
Early in training both curves rise together. That is healthy learning. The warning sign appears when the training curve keeps climbing while the validation curve flattens or drops. The growing distance between them is the train/validation gap, and a widening gap is the signature of overfitting.
The gap is what matters
A model with 99 percent training accuracy and 99 percent validation accuracy is wonderful. A model with 99 percent training accuracy and 80 percent validation accuracy is overfitting badly, even though its training number looks better. Always read the two curves together, never the training number alone.
Regularization is the set of techniques that shrink this gap. Instead of letting the network memorize, you constrain it so it must find patterns that actually generalize. In this lesson you will use two of the most effective and widely used techniques for vision: dropout and data augmentation.
Setting Up: Fashion-MNIST
You will work with Fashion-MNIST, a dataset of 70,000 grayscale images of clothing items (shirts, sneakers, bags, and so on), each 28 by 28 pixels, sorted into 10 classes. It is a drop-in replacement for the classic handwritten-digit dataset but harder, which makes overfitting easy to demonstrate.
Keras can download and load it in one call.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
# downloads on first use: keras.datasets.fashion_mnist.load_data()
(x_train_full, y_train_full), (x_test_full, y_test_full) = keras.datasets.fashion_mnist.load_data()
print("Full train:", x_train_full.shape)
print("Full test: ", x_test_full.shape)
# Output:
# Full train: (60000, 28, 28)
# Full test: (10000, 28, 28)To keep training fast enough to run on a laptop and to make overfitting clearly visible, you will use a subset: 15,000 training images and 3,000 test images. A smaller dataset overfits more readily, which is exactly what you want for a regularization lesson.
# Take a subset so training is fast and overfitting is easy to see
x_train = x_train_full[:15000]
y_train = y_train_full[:15000]
x_test = x_test_full[:3000]
y_test = y_test_full[:3000]
# Scale pixels to [0, 1] and add the single channel dimension
x_train = (x_train.astype("float32") / 255.0)[..., np.newaxis]
x_test = (x_test.astype("float32") / 255.0)[..., np.newaxis]
print("Train subset:", x_train.shape)
print("Test subset: ", x_test.shape)
# Output:
# Train subset: (15000, 28, 28, 1)
# Test subset: (3000, 28, 28, 1)The new trailing 1 is the channel dimension. These images are grayscale, so they have one channel; color images would have three. Convolutional layers expect this dimension to be present.
The Baseline: A Strong but Overfitting CNN
Start by building a straightforward CNN with no regularization at all. This gives you a baseline to improve on, and it lets you see overfitting in its natural state.
def build_baseline():
model = models.Sequential([
layers.Input(shape=(28, 28, 1)),
layers.Conv2D(32, kernel_size=3, activation="relu"),
layers.MaxPooling2D(pool_size=2),
layers.Conv2D(64, kernel_size=3, activation="relu"),
layers.MaxPooling2D(pool_size=2),
layers.Flatten(),
layers.Dense(128, activation="relu"),
layers.Dense(10, activation="softmax"),
])
return model
baseline = build_baseline()
baseline.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"],
)This is the same kind of architecture you built last lesson: two convolution-and-pooling blocks to extract features, a flatten, and two dense layers to classify. Nothing here fights overfitting; the model is free to memorize.
Now train it, holding out 20 percent of the training data as a validation set so you can watch the two curves.
history = baseline.fit(
x_train, y_train,
validation_split=0.2,
epochs=15,
batch_size=64,
verbose=2,
)
test_loss, test_acc = baseline.evaluate(x_test, y_test, verbose=0)
print(f"Baseline test accuracy: {test_acc:.3f}")
# Output:
# Baseline test accuracy: 0.883The baseline reaches a test accuracy of about 0.883. On its own that looks great. The trouble shows up when you compare it to how the model did on its training data.
final_train_acc = history.history["accuracy"][-1]
final_val_acc = history.history["val_accuracy"][-1]
print(f"Final training accuracy: {final_train_acc:.3f}")
print(f"Final validation accuracy: {final_val_acc:.3f}")
# Output:
# Final training accuracy: 0.949
# Final validation accuracy: 0.883There it is. The model scores 0.949 on data it trained on but only 0.883 on data it did not. That roughly 6.5-point gap is overfitting: the network has learned details specific to the training images that do not carry over.
Don’t celebrate the training number
It is tempting to report 0.949 because it is the bigger number. But you will never deploy a model onto its own training set; in production it only ever sees new data. The honest estimate of real-world performance is the validation or test accuracy, and the gap between them tells you how much the model is fooling itself.
Regularization Technique 1: Dropout
The first and simplest fix is dropout. During each training step, a dropout layer randomly sets a fraction of its inputs to zero. If you set a dropout rate of 0.5, then on every step roughly half of the units passing through that layer are switched off, chosen at random.
Why would deliberately breaking the network help? Because it stops any single unit from becoming indispensable. If a neuron cannot rely on its neighbors always being present, it has to learn features that are useful on their own. The effect is like training a huge ensemble of slightly different smaller networks and averaging them, which is a powerful defense against memorization.
Two details matter:
- The dropout rate is a hyperparameter. A common choice is
0.2to0.3after convolutional blocks and0.4to0.5before the final classifier, where most parameters live. - Dropout is only active during training. At evaluation time Keras automatically turns it off so that every unit contributes to the prediction. You do not have to manage this switch yourself.
In Keras a dropout layer is a single line, layers.Dropout(rate), that you insert into the model.
# A dropout layer that zeros out 30% of its inputs during training
layers.Dropout(0.3)You will add dropout to the model in a moment, together with the second technique. First, look at the other major tool for vision problems.
Regularization Technique 2: Data Augmentation
Dropout constrains the model. Data augmentation attacks overfitting from the other side: it expands the data.
The root cause of overfitting is often simply too few examples. The cleanest fix is more data, but collecting and labeling new images is expensive. Data augmentation gives you the next best thing. It applies small, random, label-preserving transformations to your existing images, so the model sees a slightly different version every epoch and can never memorize any single picture.
For clothing images, two transformations are safe and effective:
- Random horizontal flip (
RandomFlip("horizontal")): a mirrored sneaker is still a sneaker, so flipping left-to-right creates a valid new example. - Random translation (
RandomTranslation): shifting an item a few pixels up, down, or sideways teaches the model that position should not change the label.
You add these as layers at the very front of the model. They transform each batch on the fly during training and, like dropout, automatically do nothing at evaluation time, so your test images are left untouched.
data_augmentation = models.Sequential([
layers.RandomFlip("horizontal"),
layers.RandomTranslation(height_factor=0.1, width_factor=0.1),
], name="data_augmentation")Choose augmentations that preserve the label
Augmentation only helps if the transformed image still belongs to the same class. A horizontally flipped shirt is still a shirt, so a horizontal flip is fine here. But a vertical flip would turn a standing boot upside down, which never happens in real photos, and on a digit dataset flipping a 3 would create something that is no longer a 3. Always pick transformations that match the kind of variation your model will actually face.
Putting It Together: The Regularized Model
Now combine both techniques into one model. You prepend the augmentation block, then add dropout after each pooling stage and before the classifier. The convolutional backbone stays identical to the baseline, so any change in behavior comes from the regularization, not the architecture.
def build_regularized():
model = models.Sequential([
layers.Input(shape=(28, 28, 1)),
# Data augmentation: random variations every epoch (training only)
layers.RandomFlip("horizontal"),
layers.RandomTranslation(height_factor=0.1, width_factor=0.1),
# Same convolutional backbone as the baseline
layers.Conv2D(32, kernel_size=3, activation="relu"),
layers.MaxPooling2D(pool_size=2),
layers.Dropout(0.25),
layers.Conv2D(64, kernel_size=3, activation="relu"),
layers.MaxPooling2D(pool_size=2),
layers.Dropout(0.25),
layers.Flatten(),
layers.Dense(128, activation="relu"),
layers.Dropout(0.5),
layers.Dense(10, activation="softmax"),
])
return model
regularized = build_regularized()
regularized.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"],
)Train it exactly as before, with the same epochs and batch size, so the comparison is fair.
history_reg = regularized.fit(
x_train, y_train,
validation_split=0.2,
epochs=15,
batch_size=64,
verbose=2,
)
reg_test_loss, reg_test_acc = regularized.evaluate(x_test, y_test, verbose=0)
reg_train_acc = history_reg.history["accuracy"][-1]
print(f"Regularized training accuracy: {reg_train_acc:.3f}")
print(f"Regularized test accuracy: {reg_test_acc:.3f}")
# Output:
# Regularized training accuracy: 0.848
# Regularized test accuracy: 0.872Look closely at the two numbers. The training accuracy dropped from 0.949 to 0.848, and the test accuracy dropped slightly from 0.883 to 0.872. At first glance the regularized model seems worse. It is not, and understanding why is the heart of this lesson.
Reading the Results: Why a Smaller Gap Wins
Line up the two models side by side.
| Model | Training accuracy | Test accuracy | Train/test gap |
|---|---|---|---|
| Baseline (no regularization) | 0.949 | 0.883 | ~0.066 |
| Regularized (dropout + augmentation) | 0.848 | 0.872 | ~0.024 |
The baseline’s training accuracy towers over its test accuracy: a wide gap, the clear mark of memorization. The regularized model’s training and test accuracies sit close together: the gap shrank from about 6.6 points to roughly 2.4 points. The chart below shows the same story as training curves.
So why prefer the regularized model when its test accuracy is a hair lower?
Because the small gap means the model learned real, transferable patterns. The baseline’s high training accuracy is partly an illusion built on memorized details that will not appear in new data. Its test score of 0.883 is fragile: shift the data even slightly, and it can fall. The regularized model is not leaning on memorized quirks, so its 0.872 is a more trustworthy estimate of how it will behave on genuinely new images.
There are concrete reasons a small gap is the better bet in practice:
- Reliability under shift. Real-world data is never identical to your training set. A model that generalizes degrades gracefully; a memorizing model can collapse.
- Headroom to grow. The regularized model has not been pushed to its limit, because dropout and augmentation make each epoch harder. Training it for more epochs will often raise both curves further, something the overfit baseline cannot do.
- Honest evaluation. A small gap tells you the test number means what it says.
Optimize the gap, then the accuracy
A useful habit: first regularize until the train/validation gap is small, then work on raising both curves together (more epochs, a bigger model, a better learning rate). Chasing raw accuracy on an overfitting model is chasing a number that will not survive contact with real data.
Two More Tools to Know
Dropout and augmentation are the workhorses, but two other techniques belong in your vocabulary.
Batch Normalization
A batch normalization layer normalizes the activations flowing through the network, rescaling them to a stable mean and variance using the statistics of the current mini-batch. This keeps the signal well-behaved as it moves through deep stacks of layers, which speeds up training and lets you use higher learning rates. As a side effect, the small batch-to-batch noise it introduces has a mild regularizing influence.
In Keras it is one line, typically placed between a convolutional or dense layer and its activation:
layers.BatchNormalization()Batch normalization behaves differently in training and evaluation: during training it uses each batch’s own statistics, while at evaluation it uses a running average accumulated over training. Keras handles that switch for you. You will see batch normalization used heavily in the deeper architectures of the next lesson.
Early Stopping
Early stopping does not change the model; it changes when you stop training. You monitor the validation metric, and the moment it stops improving for a set number of epochs, training halts and the best weights are restored. This prevents the model from continuing to train past the point where it begins to overfit, and it saves time.
You implement it as a callback passed to fit:
early_stop = keras.callbacks.EarlyStopping(
monitor="val_loss", # watch the validation loss
patience=3, # stop if it doesn't improve for 3 epochs
restore_best_weights=True,
)
# history = model.fit(..., callbacks=[early_stop])Here patience=3 means training tolerates three epochs with no improvement before stopping, and restore_best_weights=True rewinds to the best checkpoint. Early stopping pairs naturally with dropout and augmentation; you rarely use any of these in isolation.
Regularizers combine
These techniques are not competitors. Production vision models routinely stack data augmentation, dropout, batch normalization, and early stopping all at once. Each addresses overfitting from a different angle: augmentation grows the data, dropout and batch normalization constrain the network, and early stopping limits how long it can drift.
Practice Exercises
Try these before checking the hints. Reuse the x_train, y_train, x_test, y_test arrays prepared earlier.
Exercise 1: Measure the Baseline Gap
Train the baseline model from this lesson and compute the gap between its final training accuracy and its test accuracy. Print all three numbers.
baseline = build_baseline()
baseline.compile(optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
# Your code here: fit with validation_split=0.2, then compute the gapHint
Call history = baseline.fit(x_train, y_train, validation_split=0.2, epochs=15, batch_size=64). Get the final training accuracy with history.history["accuracy"][-1] and the test accuracy with baseline.evaluate(x_test, y_test, verbose=0)[1]. Subtract them; you should see roughly 0.949 minus 0.883, a gap of about 0.066.
Exercise 2: Vary the Dropout Rate
Dropout rate is a hyperparameter. Build the regularized model but change the final Dropout(0.5) before the classifier to Dropout(0.2). Train it and compare the train/test gap to the original. Does a lighter dropout rate let the model memorize more?
# Your code here: rebuild build_regularized() with a smaller final dropout rateHint
Copy build_regularized() and change only the last layers.Dropout(0.5) to layers.Dropout(0.2). Train with the same settings, then compare the training-minus-test gap to the roughly 0.024 gap from the lesson. A smaller rate weakens the regularization, so expect the training accuracy to creep back up and the gap to widen.
Exercise 3: Add Early Stopping
Add an EarlyStopping callback to the regularized model so training halts when the validation loss stops improving. Set epochs=30 but let early stopping decide when to actually stop, and print how many epochs ran.
regularized = build_regularized()
regularized.compile(optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
# Your code here: create the callback and pass it to fitHint
Create cb = keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, restore_best_weights=True) and pass callbacks=[cb] to fit. The number of epochs that ran is len(history.history["loss"]); with augmentation and dropout slowing memorization, training may use most of the 30 epochs before stopping.
Summary
You learned how to diagnose and fix overfitting in convolutional neural networks. Let’s review the key ideas.
Key Concepts
Overfitting and Generalization
- Overfitting is when a model fits the training data too closely and fails to generalize to new data
- You spot it by comparing training and validation accuracy: a widening gap is the warning sign
- The honest measure of performance is validation or test accuracy, never the training number alone
Regularization
- Regularization is any technique that reduces the train/validation gap so the model learns transferable patterns
- These techniques combine; production models often use several at once
Dropout
- Randomly zeros a fraction of units during training, controlled by the dropout rate
- Forces the network to learn redundant, robust features instead of depending on any single unit
- Active only during training; Keras turns it off automatically at evaluation
Data Augmentation
- Applies random, label-preserving transformations like
RandomFlipandRandomTranslationto grow the effective dataset - The model sees fresh variations every epoch and cannot memorize individual images
- Only choose transformations that keep the label valid
Batch Normalization and Early Stopping
- Batch normalization stabilizes activations, speeding training and adding mild regularization
- Early stopping halts training once the validation metric stops improving, restoring the best weights
Reading Results
- On Fashion-MNIST the baseline scored 0.949 train / 0.883 test (a wide gap)
- Adding dropout and augmentation gave 0.848 train / 0.872 test (a much smaller gap)
- A slightly lower test accuracy with a far smaller gap usually generalizes more reliably
Why This Matters
Out in the real world, a model only ever sees data it was not trained on. A network that memorizes its training set can post an impressive training score and then fail the moment it meets a slightly different image, a different camera, or a new season of products. Regularization is what makes deep learning dependable rather than merely impressive in a notebook.
The shift in mindset is the lasting lesson here. Beginners chase the highest accuracy number; experienced practitioners watch the gap between training and validation, because that gap predicts how a model will behave once it leaves their hands. Dropout, augmentation, batch normalization, and early stopping are the levers that let you trade a little memorized performance for a lot of real-world reliability, and learning when to pull them is a core skill for every deep learning project that follows.
Next Steps
You can now diagnose overfitting and fight it with the regularization toolkit. In the next lesson you will move beyond simple stacks of layers and study the deeper, smarter architectures that power modern computer vision.
Continue to Lesson 4 - Advanced CNN Architectures
Explore deeper networks and the design patterns behind modern computer vision models.
Back to Module Overview
Return to the Computer Vision with CNNs module overview.
Keep Building Your Skills
Regularization is one of those skills that separates a model that works in a notebook from a model you can trust in production. Every CNN you build from now on will face the same tension between fitting the data and generalizing beyond it, and the techniques you practiced here, dropout and augmentation above all, are your reliable answer. Keep watching that train/validation gap; it will guide you toward better models far more honestly than any single accuracy number ever could.