Lesson 6 - Regularizing Neural Networks
Welcome to Regularizing Neural Networks
This lesson is the capstone of the from-scratch foundations module. You have already built a neural network in NumPy, trained it with gradient descent, and sped it up with better optimizers. Now you will confront the single biggest obstacle to making a deep network actually useful: overfitting. You will watch a network memorize its training data, diagnose the problem with a train/test gap, and then fix it by adding L2 weight decay directly to your own backpropagation code.
By the end of this lesson, you will be able to:
- Explain why deeper, wider networks are especially prone to overfitting
- Diagnose overfitting by comparing training accuracy and test accuracy
- Derive how L2 regularization changes the loss and the weight gradients
- Implement L2 weight decay from scratch in a NumPy network
- Describe how dropout and early stopping fight overfitting from different angles
You should have completed the earlier lessons in this module, so you are comfortable with forward propagation, backpropagation, and the gradient descent update. Basic Python and NumPy are assumed. Let’s begin.
Why Deep Networks Overfit
In an earlier lesson you saw that a model is only useful if it generalizes, meaning it performs well on data it has never seen. A model that scores perfectly on the data it trained on but poorly on new data has not learned anything reusable. It has simply memorized.
Deep networks are unusually good at memorizing. The reason is capacity. Every weight in the network is a free parameter the model can tune, and a network with many wide layers has thousands of them. With that many knobs, the network can bend itself into a shape that passes exactly through every training point, including the noise and the quirks that will never repeat in new data.
Think of it like a student preparing for an exam. A student who understands the material can answer questions they have never seen. A student who memorized last year’s answer key will ace last year’s exam and fail this year’s. The answer-key student has high capacity for recall and zero capacity for generalization. A high-capacity network left unchecked behaves exactly like the answer-key student.
The trade-off is real and unavoidable:
- Too little capacity (a tiny network) and the model cannot capture the real patterns. It underfits, scoring poorly on both training and test data.
- Too much capacity (a large network with no restraint) and the model memorizes. It scores nearly perfectly on training data while test performance lags far behind.
The sweet spot is a network with enough capacity to learn the real structure, plus a mechanism that discourages it from memorizing the noise. That mechanism is regularization, and it is the subject of this lesson.
Capacity is not the enemy
The answer is rarely “use a smaller network.” Modern practice is to build a network with plenty of capacity and then apply regularization to keep it honest. A large, well-regularized network almost always beats a small, unregularized one, because it can represent rich patterns while still being prevented from memorizing.
Seeing Overfitting on Real Data
Let’s make overfitting concrete. You will train a deliberately roomy network, a single hidden layer with 64 units, on the real Diabetes dataset and watch it memorize.
The Diabetes dataset records eight medical measurements for 768 patients, along with a binary outcome that marks whether each patient was diagnosed with diabetes. It is the same kind of binary classification problem you have worked with throughout this module: small enough that a 64-unit network can easily overpower it, which is exactly what makes it perfect for studying overfitting.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# download: https://datatweets.com/datasets/diabetes.csv
df = pd.read_csv("diabetes.csv")
print("Shape:", df.shape)
print(df["outcome"].value_counts().to_dict())
# Output:
# Shape: (768, 9)
# {0: 500, 1: 268}There are 768 patients, 500 without a diabetes diagnosis and 268 with one. Now prepare the data the same way you have all module: split off a test set, then standardize the features using statistics learned on the training set only.
feature_cols = [
"pregnancies", "glucose", "blood_pressure", "skin_thickness",
"insulin", "bmi", "diabetes_pedigree", "age",
]
X = df[feature_cols].values
y = df["outcome"].values.reshape(-1, 1) # column vector for the network
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # learn mean/std on TRAIN only
X_test = scaler.transform(X_test) # apply the SAME transform to test
print("Train:", X_train.shape, "Test:", X_test.shape)
# Output:
# Train: (576, 8) Test: (192, 8)A Network Built to Memorize
Here is a compact NumPy network with one hidden layer of 64 ReLU units and a sigmoid output, trained with binary cross-entropy. This is the same machinery you built earlier in the module, gathered into one place so you can run it end to end.
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
def relu(z):
return np.maximum(0, z)
def init_params(n_features, n_hidden, seed=1):
rng = np.random.default_rng(seed)
# small random weights, biases at zero
W1 = rng.standard_normal((n_features, n_hidden)) * 0.1
b1 = np.zeros((1, n_hidden))
W2 = rng.standard_normal((n_hidden, 1)) * 0.1
b2 = np.zeros((1, 1))
return W1, b1, W2, b2
def forward(X, W1, b1, W2, b2):
Z1 = X @ W1 + b1
A1 = relu(Z1)
Z2 = A1 @ W2 + b2
A2 = sigmoid(Z2) # predicted probabilities
return Z1, A1, Z2, A2The forward pass returns the hidden pre-activations Z1, the hidden activations A1, the output pre-activation Z2, and the final probabilities A2. You will need the intermediate values during backpropagation.
Now the training loop. For the moment, ignore the highlighted regularization terms (they multiply by l2=0.0, so they do nothing yet). You will switch them on in the next section.
def train(X, y, n_hidden=64, lr=0.1, epochs=2000, l2=0.0):
n = X.shape[0]
W1, b1, W2, b2 = init_params(X.shape[1], n_hidden)
for epoch in range(epochs):
# ----- forward -----
Z1, A1, Z2, A2 = forward(X, W1, b1, W2, b2)
# ----- backward (binary cross-entropy) -----
dZ2 = (A2 - y) / n
dW2 = A1.T @ dZ2 + l2 * W2 # L2 term on W2
db2 = dZ2.sum(axis=0, keepdims=True)
dA1 = dZ2 @ W2.T
dZ1 = dA1 * (Z1 > 0) # ReLU derivative
dW1 = X.T @ dZ1 + l2 * W1 # L2 term on W1
db1 = dZ1.sum(axis=0, keepdims=True)
# ----- gradient descent update -----
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
return W1, b1, W2, b2
def accuracy(X, y, params):
*_, A2 = forward(X, *params)
preds = (A2 > 0.5).astype(int)
return (preds == y).mean()Train it with no regularization and check both accuracies.
params = train(X_train, y_train, n_hidden=64, lr=0.1, epochs=2000, l2=0.0)
print(f"train acc = {accuracy(X_train, y_train, params):.3f}")
print(f"test acc = {accuracy(X_test, y_test, params):.3f}")
# Output:
# train acc = 1.000
# test acc = 0.719There it is. The network classifies every single training patient correctly (accuracy 1.000) but reaches only 0.719 on the test set. A gap that large is the textbook signature of overfitting: the network has memorized the 576 training patients instead of learning patterns that transfer to the 192 it has never seen.
Perfect training accuracy is a warning, not a victory
It is tempting to celebrate a training accuracy of 1.000. Resist that urge. Perfect performance on the data the model learned from almost always means the model has fit the noise. The number that matters is test accuracy, and here it is nearly 30 points lower. Always judge a model by how it does on data it has never seen.
L2 Weight Decay
How do you stop a network from memorizing? The most direct tool is L2 regularization, also called weight decay. The idea is elegant: large weights are what let a network contort itself to pass through every training point, so you add a penalty to the loss that grows as the weights grow. The network now has to balance two goals, fitting the data and keeping its weights small, and small weights produce smoother, more general functions.
Changing the Loss
Your original loss was binary cross-entropy, which we will call . L2 regularization adds a term proportional to the sum of squared weights:
Here (lambda) is the regularization strength, a hyperparameter you choose. When the penalty vanishes and you are back to ordinary training. As grows, the network is pushed harder toward small weights. The factor of is there purely so the derivative comes out clean, as you are about to see.
Note that the penalty applies to the weights, not the biases. Biases only shift activations and do not contribute to the model’s capacity to overfit in the same way, so they are conventionally left unpenalized.
Changing the Gradient
The reason this is so easy to implement is what the penalty does to the gradient. The derivative of with respect to is simply . So for every weight matrix, the gradient gains one extra term:
That is the whole change. Look back at the training loop you already wrote: the lines dW2 = A1.T @ dZ2 + l2 * W2 and dW1 = X.T @ dZ1 + l2 * W1 are exactly this formula. The + l2 * W is the L2 term. With l2=0.0 it did nothing; now you simply pass a positive value.
The name weight decay comes from looking at the update step. Substitute the new gradient into the gradient descent rule:
Every step, before applying the data gradient, each weight is first multiplied by , a number slightly less than one. The weights decay toward zero on every update unless the data gradient actively pushes back. Useful weights survive because the data keeps reinforcing them; useless weights that only fit noise quietly shrink away.
Training With L2
You do not need to write any new code. Just pass a nonzero l2 to the same train function.
params_l2 = train(X_train, y_train, n_hidden=64, lr=0.1, epochs=2000, l2=0.05)
print(f"train acc = {accuracy(X_train, y_train, params_l2):.3f}")
print(f"test acc = {accuracy(X_test, y_test, params_l2):.3f}")
# Output:
# train acc = 0.790
# test acc = 0.724Compare this to the unregularized run. Training accuracy fell from a perfect 1.000 down to 0.790, while test accuracy rose from 0.719 to 0.724. At first glance, giving up training accuracy might look like a loss, but it is exactly what you want. The network stopped memorizing the training patients, and the price it paid in training accuracy bought a small gain in test accuracy and, more importantly, a model you can actually trust.
| Model | Train accuracy | Test accuracy | Train/test gap |
|---|---|---|---|
| No regularization | 1.000 | 0.719 | 0.281 |
| L2 weight decay () | 0.790 | 0.724 | 0.066 |
The gap collapsed from 0.281 to 0.066. That shrinking gap, not the raw accuracy, is the real measure of success here. A model whose training and test scores sit close together is one that has learned transferable patterns rather than memorized answers.
The clearest way to see the effect is to plot the loss on both sets across training. The figure below shows the train and test loss with and without L2. Without regularization, the training loss keeps dropping while the test loss bottoms out and then drifts upward, the curves fanning apart as the network memorizes. With L2, the two curves stay close together, exactly the behavior of a model that generalizes.
Choosing lambda
The regularization strength is a hyperparameter you tune, just like the learning rate. Too small and it has no effect; too large and it crushes the weights so hard the network underfits, dragging both training and test accuracy down. A common approach is to try a few values spaced apart, such as 0.001, 0.01, 0.05, and 0.1, and keep the one that gives the best test (or, more properly, validation) performance.
Two More Tools: Dropout and Early Stopping
L2 weight decay is the workhorse of regularization, but it is not the only tool. Two others appear in nearly every deep learning project, and although you will implement them with a framework in later modules, it is worth understanding the idea behind each now.
Dropout
Dropout fights overfitting by randomly switching off neurons during training. On each forward pass, each neuron in a layer is kept with some probability (say 0.8) and otherwise set to zero for that pass. The set of dropped neurons changes every batch.
Why does this help? Picture a band where one virtuoso guitarist always covers for everyone else’s mistakes. The band sounds great until that guitarist gets sick, and then it falls apart. Dropout is like randomly benching a different musician at every rehearsal: it forces every player to learn the whole song rather than leaning on one neighbor. In network terms, no single neuron can become a crutch the others depend on, so the network is pushed to build redundant, robust pathways instead of fragile memorized ones.
A crucial detail: dropout is only active during training. At test time, every neuron participates, because now you want the full, trained network making predictions. This is why frameworks distinguish a “training mode” from an “evaluation mode.”
Early Stopping
Early stopping is the simplest regularizer of all, and you have already seen its mechanism in the loss curves above. As training proceeds, test (validation) loss typically falls, reaches a minimum, and then starts to climb again as the network begins memorizing. Early stopping says: watch the validation performance, remember the best model you have seen, and stop training once it has gone some number of epochs (the patience) without improving. You then restore the best-scoring weights rather than the final, overfit ones.
It costs almost nothing, requires no change to the network, and pairs well with every other technique. In practice, L2, dropout, and early stopping are often used together, each attacking overfitting from a different direction.
Different tools, same goal
L2 shrinks the weights, dropout breaks fragile co-dependencies between neurons, and early stopping halts before memorization sets in. They are complementary, not competing. A typical production network uses all three at once, and the more your model overfits, the more aggressively you turn each of them up.
Practice Exercises
Now it is your turn. Try these before checking the hints. Reuse the train, accuracy, and forward functions and the scaled X_train, X_test, y_train, y_test from the lesson.
Exercise 1: Sweep the Regularization Strength
Train the 64-unit network for several values of l2 and print the train and test accuracy for each. Watch how stronger regularization trades training accuracy for a smaller gap.
for lam in [0.0, 0.01, 0.05, 0.2]:
# Your code here: train with l2=lam, then print train and test accuracy
passHint
Inside the loop, call params = train(X_train, y_train, n_hidden=64, lr=0.1, epochs=2000, l2=lam), then print accuracy(X_train, y_train, params) and accuracy(X_test, y_test, params). You should see l2=0.0 give a train accuracy of 1.000 with test 0.719, and l2=0.05 give train 0.790 with test 0.724. Larger values like 0.2 push training accuracy down further.
Exercise 2: Does a Smaller Network Overfit Less?
Instead of regularizing, shrink the network. Train with no L2 but only n_hidden=4 hidden units, and compare its train/test gap to the unregularized 64-unit network from the lesson.
# Your code here: train with n_hidden=4, l2=0.0, then report both accuraciesHint
Call train(X_train, y_train, n_hidden=4, lr=0.1, epochs=2000, l2=0.0). A 4-unit network has far less capacity, so it cannot reach a perfect 1.000 training accuracy. Its train/test gap will be much smaller than the 64-unit network’s gap of 0.281, illustrating that limiting capacity is another way to reduce overfitting, just a blunter one than L2.
Exercise 3: Verify the Weight-Decay Effect
L2 should produce smaller weights. After training one model with l2=0.0 and one with l2=0.05, compare the average magnitude of the first weight matrix W1 for each. Confirm that regularization really does shrink the weights.
# Your code here: train both models, then compare np.abs(W1).mean() for eachHint
The train function returns (W1, b1, W2, b2), so unpack W1 = params[0] for each model and compute np.abs(W1).mean(). The L2 model’s average weight magnitude should be noticeably smaller than the unregularized model’s, which is the weight decay you derived earlier showing up in the trained parameters.
Summary
Congratulations! You have completed the from-scratch foundations module. You built a network that overfits, diagnosed it, and fixed it with regularization you implemented yourself. Let’s review what you learned.
Key Concepts
Why Networks Overfit
- Overfitting means a model memorizes its training data and fails to generalize to new data
- Deep, wide networks have high capacity (many free parameters), making them especially prone to memorizing
- The fix is rarely a smaller network; it is a high-capacity network plus regularization
Diagnosing Overfitting
- Compare training accuracy to test accuracy; a large gap is the warning sign
- The unregularized 64-unit network reached 1.000 training accuracy but only 0.719 on the test set
- Perfect training accuracy is a red flag, not a success
L2 Weight Decay
- L2 adds to the loss, penalizing large weights
- The gradient gains one term, , so implementation is a single addition per weight matrix
- The update rule multiplies each weight by every step, hence the name weight decay
- With , training accuracy dropped to 0.790 but test rose to 0.724 and the train/test gap collapsed from 0.281 to 0.066
Dropout and Early Stopping
- Dropout randomly disables neurons during training to prevent fragile co-dependencies; it is off at test time
- Early stopping halts training once validation performance stops improving and restores the best weights
- L2, dropout, and early stopping are complementary and often used together
Why This Matters
Every neural network you will ever build, from a tiny tabular classifier to a giant image or language model, faces the same tension: enough capacity to learn, but not so much freedom that it memorizes. Regularization is how practitioners resolve that tension, and it is the difference between a model that dazzles on training data and crumbles in production, and one that holds up on data it has never seen.
By implementing L2 in your own backpropagation code, you saw that regularization is not a mysterious framework switch. It is a small, principled change to the loss and the gradient, one you derived from scratch and can now reason about with confidence. When you move on to high-level frameworks and call a single weight_decay argument, you will know exactly what it is doing under the hood.
Next Steps
You have finished building neural networks from the ground up: forward propagation, backpropagation, gradient descent, optimizers, and now regularization. You understand what every line is doing because you wrote it yourself. From here, you will trade hand-written NumPy for a production framework that handles the mechanics automatically, so you can build deeper, more powerful models faster.
Continue to the Deep Learning with PyTorch Module
Move from NumPy to PyTorch and build the same networks with automatic differentiation and far less code.
Back to Module Overview
Return to the Deep Learning Foundations module overview.
Keep Building Your Skills
You have reached the end of the from-scratch journey, and that foundation will pay off for the rest of your career. Frameworks make it easy to stack layers and flip on regularization without a second thought, but you now understand what each switch actually changes in the math. That understanding is what separates someone who can copy a tutorial from someone who can debug a model that will not generalize. Carry it forward: every powerful technique you meet next is built on exactly the pieces you have now implemented with your own hands.