Lesson 5 - Deep Networks and Regularization

Welcome to Deep Networks and Regularization

In the previous lesson you wrote a full PyTorch training loop and trained a small classifier from start to finish. This lesson takes the next step: you will make your network deeper, discover that depth alone can hurt you, and then learn the two regularization tools every practitioner reaches for — dropout and weight decay. You will see, on a real and deliberately small dataset, how these tools narrow the gap between training and test performance.

By the end of this lesson, you will be able to:

  • Explain what overfitting is and why deeper networks on small datasets are especially prone to it
  • Build a deeper multi-layer PyTorch classifier with nn.Sequential
  • Add nn.Dropout layers and understand how random neuron dropping forces generalization
  • Apply L2 regularization through the optimizer’s weight_decay argument
  • Compare a no-regularization model against a dropout-plus-weight-decay model and read the difference off the train/test loss curves

This lesson assumes you are comfortable with PyTorch tensors, nn.Sequential, and the basic training loop from Lesson 4, plus basic pandas. Let’s begin.


The Problem with Going Deeper

When a model is not learning enough, the natural instinct is to make it bigger: more layers, more neurons. More capacity means the network can represent more complicated functions, and sometimes that is exactly what you need. But capacity is a double-edged sword.

A network with thousands of parameters and only a few hundred training examples has more than enough room to simply memorize the answer for every training row. When that happens, training accuracy climbs toward perfection while performance on new, unseen data stalls or even gets worse. This failure mode is called overfitting: the model learns the noise in the training set instead of the underlying pattern.

The opposite failure, underfitting, is when the model is too simple to capture the pattern at all, so it does poorly everywhere. The art of training neural networks is steering between these two.

underfitting            good fit              overfitting
--------------          --------------        --------------
too simple              captures the          memorizes the
high train error        real pattern          training data
high test error         low train error       very low train error
                        low test error        high test error

The clearest symptom of overfitting is a gap between training and test performance. If your model scores beautifully on data it trained on but poorly on data it has never seen, the difference between those two numbers is your overfitting gap, and shrinking that gap is exactly what regularization is for.

Why a small dataset is the perfect teacher

Overfitting is easiest to see when a flexible model meets a small dataset. With only a few hundred rows, a deep network can practically memorize the training set, so the train/test gap appears quickly and dramatically. That makes a small, real dataset an ideal place to learn the regularization tools you will later apply to much larger problems.


The Dataset: Indian IPO Listing Gains

You will work with the Indian IPO dataset, a record of companies that went public on Indian stock exchanges. Each row describes one initial public offering (IPO) along with a few financial and demand-side signals, and your job is to predict whether the stock produced a positive listing gain — that is, whether it closed its first trading day above its offer price.

This is a binary classification problem: the target is 1 if the IPO listed at a gain and 0 otherwise. It is also a genuinely hard, genuinely small problem, which makes it the perfect stage for studying overfitting.

import pandas as pd

# download: https://datatweets.com/datasets/indian_ipo.csv
df = pd.read_csv("indian_ipo.csv")

print("Shape:", df.shape)
# Output: Shape: (319, 10)

Just 319 rows. That is tiny by deep learning standards, and it is the whole point: a deep network will be tempted to memorize these rows rather than learn a rule that transfers to future IPOs.

Take a look at how the target is distributed.

# 1 = listed at a gain, 0 = listed flat or at a loss
print(df["listing_gain"].value_counts())
# Output:
# listing_gain
# 1    174
# 0    145
# Name: count, dtype: int64

print("gain rate:", round(df["listing_gain"].mean(), 3))
# Output: gain rate: 0.545

About 54.5 percent of these IPOs listed at a gain. That gives you a baseline to beat: a lazy model that always predicts “gain” would be right roughly 54.5 percent of the time. Any network worth keeping has to do better than that naive guess on unseen data.

Bar chart of positive versus non-positive IPO listing gains
The Indian IPO dataset is fairly balanced, with a slight majority of IPOs listing at a gain.

A balanced target is convenient: it means accuracy is a reasonable headline metric, and it means the train/test gap you will study reflects real learning rather than a quirk of class imbalance.


Preparing the Data

The workflow here mirrors what you already know from earlier lessons: select numeric features, build a numeric target, split into train and test sets, scale the features, and convert everything to PyTorch tensors.

import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Features: numeric IPO signals. Target: did it list at a gain?
feature_cols = [c for c in df.columns if c != "listing_gain"]
X = df[feature_cols].values
y = df["listing_gain"].values

# Hold out 30% for an honest test of generalization
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

# Scale features: fit on TRAIN only, then apply to both sets
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert to float32 tensors for PyTorch
X_train_t = torch.tensor(X_train, dtype=torch.float32)
X_test_t = torch.tensor(X_test, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.float32)
y_test_t = torch.tensor(y_test, dtype=torch.float32)

print("Train rows:", X_train_t.shape[0], "Test rows:", X_test_t.shape[0])
# Output: Train rows: 223 Test rows: 96

Two details carry over from earlier lessons and matter just as much here. First, stratify=y keeps the gain/no-gain ratio the same in both splits, so your test score is a fair estimate. Second, the scaler is fit on the training set only. Fitting it on the full dataset would leak information from the test set into training and inflate your numbers.

Never let the test set leak in

Regularization is about honest generalization, so any leakage quietly defeats the purpose. Fit the scaler on X_train, then transform both sets with those same statistics. The test set should be touched exactly once, at the very end, to measure performance.


A Deeper Classifier in PyTorch

Now build the network. For binary classification you want the final layer to produce a single number that can be turned into a probability. You will keep the raw output (a logit) and pair it with a loss function that applies the sigmoid internally, which is the numerically stable way to do this in PyTorch.

Here is a deliberately deep architecture — several hidden layers, each followed by a ReLU activation.

import torch.nn as nn

n_features = X_train_t.shape[1]

# A deep network: lots of capacity for a small dataset
plain_model = nn.Sequential(
    nn.Linear(n_features, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 16),
    nn.ReLU(),
    nn.Linear(16, 1),   # single logit for binary classification
)

Notice there is no activation on the final layer. The output is a raw logit; the loss function will handle the sigmoid.

For the loss, use nn.BCEWithLogitsLoss. It combines a sigmoid and binary cross-entropy in one numerically stable step, so you feed it logits directly. For the optimizer, stick with Adam.

import torch.optim as optim

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(plain_model.parameters(), lr=0.01)

Binary cross-entropy is the right loss for probabilities: it punishes confident wrong predictions hard (predicting a 0.95 probability of a gain when the IPO actually lost) while being gentle on uncertain ones near 0.5. Mathematically, for a single example with true label y y and predicted probability y^ \hat{y} , the loss is:

L=[ylog(y^)+(1y)log(1y^)] \mathcal{L} = -\big[\, y \log(\hat{y}) + (1 - y)\log(1 - \hat{y}) \,\big]

When y=1 y = 1 the term log(y^) -\log(\hat{y}) blows up as y^ \hat{y} approaches 0, which is the heavy penalty for being confidently wrong.

Watching It Overfit

Train this model and track the loss on both the training set and the held-out test set after every epoch. The test loss is your window into generalization: as long as it keeps falling, the model is still learning something useful. When it stops falling — or starts rising — while training loss keeps dropping, that is overfitting happening in real time.

def evaluate_loss(model, X_t, y_t):
    model.eval()
    with torch.no_grad():
        logits = model(X_t).squeeze()
        return criterion(logits, y_t).item()

n_epochs = 200
for epoch in range(n_epochs):
    plain_model.train()
    optimizer.zero_grad()
    logits = plain_model(X_train_t).squeeze()
    loss = criterion(logits, y_train_t)
    loss.backward()
    optimizer.step()

# Final losses on both sets
plain_train_loss = evaluate_loss(plain_model, X_train_t, y_train_t)
plain_test_loss = evaluate_loss(plain_model, X_test_t, y_test_t)

print(f"Plain model - train loss: {plain_train_loss:.4f}")
print(f"Plain model - test loss:  {plain_test_loss:.4f}")
# Output:
# Plain model - train loss: 0.3382
# Plain model - test loss:  0.7100

There it is. The training loss has dropped to 0.3382, but the test loss sits far higher. The network is fitting the 223 training rows beautifully while failing to carry that performance over to the 96 test rows. That divergence — low train loss, high test loss — is the unmistakable signature of overfitting on a small dataset.

You can confirm the same story in the accuracy.

def accuracy(model, X_t, y_t):
    model.eval()
    with torch.no_grad():
        probs = torch.sigmoid(model(X_t).squeeze())
        preds = (probs > 0.5).float()
        return (preds == y_t).float().mean().item()

print(f"Plain model - test accuracy: {accuracy(plain_model, X_test_t, y_test_t):.3f}")
# Output: Plain model - test accuracy: 0.562

About 0.562 on the test set — only a hair above the 0.545 baseline. All that depth bought almost nothing in terms of generalization. The model memorized; it did not learn. Time to fix it.


Tool 1: Dropout

The first regularizer is dropout, and the idea behind it is surprisingly intuitive. During each training step, dropout randomly switches off a fraction of the neurons in a layer — sets their outputs to zero — so the network can never rely too heavily on any single neuron or any single pathway.

Think of a relay team where one runner is so fast that everyone else slacks off. If that runner pulls a muscle, the team collapses. Dropout is like randomly benching a different runner every practice: it forces every member to stay sharp, because nobody knows who will be sitting out next time. The result is a network whose “knowledge” is spread across many neurons rather than concentrated in a fragile few — and spread-out knowledge generalizes better.

In PyTorch you add it with nn.Dropout(p), where p is the probability that any given neuron is dropped.

# During training, ~30% of incoming activations are zeroed at random
drop = nn.Dropout(0.3)

A crucial detail: dropout is active only during training. At evaluation time you want every neuron contributing, so PyTorch automatically turns dropout off when you call model.eval(), and turns it back on with model.train(). This is exactly why those mode-switching calls matter.

train() and eval() are not optional with dropout

Dropout behaves differently in the two modes. In train() mode it randomly zeros neurons; in eval() mode it lets every neuron through and rescales appropriately. If you forget to call model.eval() before measuring test performance, dropout will randomly cripple your network and your scores will look worse than they really are. Always set the mode explicitly.

You place dropout after the activation function, so you are dropping fully activated features rather than interfering with the linear layer’s raw output. A common pattern is to use a slightly heavier rate in the wide early layers (where most of the parameters live) and a lighter rate later.


Tool 2: Weight Decay (L2 Regularization)

The second regularizer works on a completely different principle. Instead of randomly removing neurons, weight decay gently pushes every weight in the network toward zero. The reasoning: overfitted models tend to have large, extreme weights that let them carve out wild, jagged decision boundaries to fit individual training points. If you penalize large weights, the model is nudged toward smoother, simpler functions that generalize better.

Concretely, weight decay adds a penalty to the loss proportional to the sum of the squared weights. If L0 \mathcal{L}_0 is your ordinary loss and w w are the weights, the regularized loss is:

L=L0+λiwi2 \mathcal{L} = \mathcal{L}_0 + \lambda \sum_i w_i^2

The hyperparameter λ \lambda controls how hard you push: larger values mean stronger regularization. Because this penalty involves the square of the L2 norm of the weights, it is called L2 regularization, and because it effectively shrinks the weights a little on every update, it is also called weight decay. The two names describe the same thing.

The beautiful part in PyTorch is that you do not implement any of this by hand. You simply pass weight_decay to the optimizer.

# L2 regularization, applied automatically on every step
optimizer = optim.Adam(model.parameters(), lr=0.01, weight_decay=1e-3)

That single argument tells Adam to decay the weights toward zero on every update, with no change to your training loop at all.

Two tools, one goal

Dropout and weight decay attack overfitting from different angles — one removes neurons, the other shrinks weights — and they combine well. A good default starting point is a dropout rate around 0.3 plus a small weight decay such as 1e-3, then adjust based on how the train/test gap responds.


Putting It Together: Regularized vs. Plain

Now build the regularized version of the same deep network. It has the identical layer sizes as the plain model so the comparison is fair — the only differences are the nn.Dropout layers and the weight_decay in the optimizer.

reg_model = nn.Sequential(
    nn.Linear(n_features, 64),
    nn.ReLU(),
    nn.Dropout(0.3),        # drop 30% of activations
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(32, 16),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(16, 1),       # no dropout before the final output
)

criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(reg_model.parameters(), lr=0.01, weight_decay=1e-3)

Train it with exactly the same loop you used before. The only thing that changes is which model you are optimizing — the discipline of switching between train() and eval() now genuinely affects behavior, because of the dropout layers.

for epoch in range(n_epochs):
    reg_model.train()           # dropout ON
    optimizer.zero_grad()
    logits = reg_model(X_train_t).squeeze()
    loss = criterion(logits, y_train_t)
    loss.backward()
    optimizer.step()

# evaluate_loss() calls model.eval() internally, so dropout is OFF here
reg_train_loss = evaluate_loss(reg_model, X_train_t, y_train_t)
reg_test_loss = evaluate_loss(reg_model, X_test_t, y_test_t)

print(f"Regularized - train loss: {reg_train_loss:.4f}")
print(f"Regularized - test loss:  {reg_test_loss:.4f}")

The plot below compares the two runs side by side. On the left is the plain model: training loss dives toward 0.34 while test loss peels away and climbs — the classic overfitting fan. On the right is the regularized model: the training loss no longer collapses (dropout makes the training task harder on purpose), but the two curves stay close together, and the test loss settles lower than the plain model’s. That tighter spacing between the curves is regularization doing its job.

Train versus test loss curves for a plain deep network compared to one with dropout and weight decay
Without regularization the train and test losses diverge sharply; with dropout plus weight decay the gap narrows and the test loss improves.

The key takeaway is not that regularization makes the training loss lower — it usually makes it higher, because you have deliberately made the network’s job harder. The takeaway is that it makes the gap smaller. A model with a small train/test gap is a model you can trust on data it has never seen.

Reading the Curves Like a Practitioner

These two curves are a diagnostic you will use for the rest of your deep learning career:

  • Train and test both high, still falling: underfitting or undertraining. Train longer, or add capacity.
  • Train low, test much higher: overfitting. Add dropout, increase weight decay, get more data, or reduce capacity.
  • Train and test both low and close together: a healthy fit. This is the goal.

When you saw the plain model’s test loss climb to 0.71 while training loss sank to 0.34, you were reading the second case. Adding dropout and weight decay moves you toward the third.

Regularization is not magic

On a dataset this small and this noisy, no amount of regularization will turn IPO prediction into a solved problem — the signal is genuinely weak. What regularization reliably buys you is honesty: a model whose test performance you can trust, rather than a memorizer that looks impressive on training data and falls apart in production. Closing the gap matters more than chasing the lowest possible training loss.


Practice Exercises

Try these before checking the hints. Reuse the tensors and helper functions from the lesson.

Exercise 1: Vary the Dropout Rate

Rebuild the regularized model three times, using dropout rates of 0.1, 0.3, and 0.5 (keep weight_decay=1e-3 fixed). Train each for 200 epochs and print the final train and test loss for each. How does the gap between train and test loss change as the dropout rate increases?

for p in [0.1, 0.3, 0.5]:
    # Build a model with nn.Dropout(p) after each ReLU
    # Train for 200 epochs, then print train/test loss
    pass

Hint

Wrap the model-building and training code from the lesson in the for p in [...] loop, passing p into each nn.Dropout(p). Create a fresh optimizer for each model. You should see the train loss rise as p grows (heavier dropout makes training harder), while the train/test gap shrinks — until very heavy dropout starts hurting both.

Exercise 2: Weight Decay on Its Own

Train the deep network with no dropout but with weight_decay=1e-2 in the Adam optimizer. Compare its train/test loss gap to the plain model (which had weight_decay=0 and no dropout). Does L2 regularization alone narrow the gap?

# Build the plain architecture (no Dropout layers)
# optimizer = optim.Adam(model.parameters(), lr=0.01, weight_decay=1e-2)
# Train 200 epochs, then print train and test loss

Hint

Use the exact same nn.Sequential as plain_model (no dropout), but change only the optimizer’s weight_decay from 0 to 1e-2. The plain model finished at a train loss of 0.3382 with a much higher test loss; with stronger weight decay you should see the train loss rise and the gap shrink, confirming that L2 alone is a real regularizer.

Exercise 3: Compare Test Accuracy

Using the accuracy() helper from the lesson, print the test accuracy of both the plain model and the regularized model. Then compare both to the naive baseline of always predicting “gain” (0.545). Which models actually beat the baseline?

# print plain model test accuracy
# print regularized model test accuracy
# compare both to the 0.545 baseline

Hint

Call accuracy(plain_model, X_test_t, y_test_t) and the same for reg_model. The plain model lands around 0.562, barely above the 0.545 baseline. The point of the exercise is to see that beating a naive baseline on a hard, small dataset is genuinely difficult, and that the honest train/test gap matters as much as the headline accuracy.


Summary

You took a network from overfitting to honest generalization using the two regularizers that power real production models. Let’s review what you learned.

Key Concepts

Overfitting and the Train/Test Gap

  • Overfitting is when a model memorizes the training data and fails to generalize; underfitting is when it is too simple to learn the pattern at all
  • The symptom of overfitting is a large gap between training and test performance: low train loss but high test loss
  • Deep networks on small datasets overfit fast, which makes the Indian IPO dataset (319 rows) an ideal place to study the problem

Building a Deep Classifier in PyTorch

  • Stack nn.Linear and nn.ReLU layers in nn.Sequential, ending with a single logit output (no final activation)
  • Use nn.BCEWithLogitsLoss, which applies the sigmoid internally for numerical stability
  • Track loss on both train and test sets each epoch to see generalization in real time

Dropout

  • nn.Dropout(p) randomly zeros a fraction p of activations during training, forcing the network to spread its knowledge across many neurons
  • Dropout is active in train() mode and off in eval() mode, so those mode calls now genuinely matter
  • Place dropout after the activation, and skip it before the final output layer

Weight Decay (L2)

  • Weight decay adds a penalty λiwi2 \lambda \sum_i w_i^2 that shrinks large weights toward zero, favoring smoother functions
  • In PyTorch you enable it with one argument: optim.Adam(..., weight_decay=1e-3)
  • “Weight decay” and “L2 regularization” are two names for the same mechanism

Reading the Curves

  • Both curves high and falling → underfitting; train low, test high → overfitting; both low and close → a healthy fit
  • Regularization usually raises training loss while shrinking the train/test gap — the gap is what you care about

Why This Matters

Every serious neural network you will ever build has more capacity than its dataset strictly requires, which means overfitting is not an occasional nuisance but the default condition you must actively fight. Dropout and weight decay are the two most widely used tools for that fight, and they are baked into PyTorch so cleanly — one layer, one optimizer argument — that there is no excuse not to use them.

Just as importantly, you learned to diagnose with train and test curves rather than chasing a single accuracy number. A model that scores 99 percent on training data and 60 percent on test data is worse than a model that scores 75 percent on both, because only the second one will hold up when real, unseen IPOs arrive. Reading the gap, not just the headline metric, is what separates a practitioner from someone who is fooling themselves.


Next Steps

You now have the full toolkit: deep architectures, a working training loop, and the regularization techniques that keep deep models honest. In the next lesson you will put all of it together in a complete, end-to-end guided project on the same IPO data.

Continue to Lesson 6 - Guided Project: Predicting IPO Listing Gains with PyTorch

Apply everything you have learned in a complete, end-to-end PyTorch classification project.

Back to Module Overview

Return to the Deep Learning with PyTorch module overview.


Keep Building Your Skills

You have crossed an important threshold: you can now build a deep network and keep it from fooling you. That second skill is the one professionals lean on every day, because real datasets are always smaller and noisier than we would like. Carry the train/test gap mindset into everything you do next — whenever you train a model, watch both curves, ask whether the gap is acceptable, and reach for dropout and weight decay when it is not. Master that habit, and your models will earn the trust that makes them worth deploying.