Lesson 4 - Training Neural Networks
On this page
Welcome to Training Neural Networks
In the last lesson you assembled a neural network with nn.Sequential and pushed data through it. The model could produce numbers, but those numbers were meaningless: the weights were still random, so the predictions were essentially guesses. This lesson is where the model comes alive. You will write the training loop that turns a random network into one that has learned real patterns, and you will train it on an actual dataset to predict whether an Indian IPO closes higher on its first trading day.
By the end of this lesson, you will be able to:
- Choose the right loss function for binary classification and explain why
BCEWithLogitsLossis used - Create an
Adamoptimizer and connect it to a model’s parameters - Implement the full training step:
zero_grad-> forward -> loss ->backward->step - Train a model for multiple epochs while tracking the loss
- Evaluate a trained model honestly with accuracy and AUC, and read a training curve
- Recognize when a problem is genuinely hard and a modest score is the honest result
You should already be comfortable with tensors, nn.Sequential, and the idea of a forward pass from the earlier lessons in this module. Let’s begin.
How a Neural Network Learns
A neural network learns by repeating one short cycle thousands of times. Each pass through the cycle nudges the model’s weights a tiny bit in a direction that makes its predictions slightly less wrong. Do that often enough and the random network slowly becomes a useful one.
The cycle has five steps that repeat for every batch of data:
- Clear old gradients so this batch starts fresh.
- Forward pass: feed the batch through the model to get predictions.
- Compute the loss: measure how wrong those predictions are with a single number.
- Backward pass: let autograd compute how each parameter should change to reduce the loss.
- Update parameters: the optimizer applies those changes.
PyTorch deliberately makes you write these five steps yourself. Other frameworks hide them behind a single .fit() call, but PyTorch’s explicit loop means you can see exactly what happens at each stage. That transparency makes debugging far easier and gives you room to customize training when you need to.
Before you can run the cycle, you need two supporting pieces: a loss function that scores how wrong the predictions are, and an optimizer that knows how to apply gradients. The rest of this lesson builds both, wires them into the loop, and trains a real model.
The Problem: Predicting IPO First-Day Gains
When a company goes public, its shares start trading on an exchange for the first time. Investors care intensely about one question: will the stock close its first day above its offer price (a “listing gain”) or not? A reliable predictor would be worth a fortune, which is exactly why this is a hard problem. If it were easy, the market would have already priced the edge away.
You will use the real Indian IPO dataset, which records first-day outcomes for IPOs listed on Indian exchanges along with features describing each offering.
import pandas as pd
# download: https://datatweets.com/datasets/indian_ipo.csv
df = pd.read_csv("indian_ipo.csv")
print("Shape:", df.shape)
# Output: Shape: (319, 10)The dataset has 319 rows and 10 columns. Each row is one IPO. The features describe the offering (issue size, price band, subscription demand, and so on), and the target records whether the stock gained on day one.
# The target: 1 = closed higher on day one, 0 = did not
print(df["listing_gain"].value_counts())
# Output:
# listing_gain
# 1 174
# 0 145
# Name: count, dtype: int64
print("gain rate:", round(df["listing_gain"].mean(), 3))
# Output: gain rate: 0.545About 54.5 percent of IPOs gained on their first day. This is a mildly imbalanced binary classification problem. That number is also your baseline: a lazy model that always predicts “gain” would be right 54.5 percent of the time. Any model worth keeping must beat that, and as you will see, beating it convincingly is not easy.
Why a baseline matters
Always compute the majority-class rate before you train anything. If 95 percent of your examples were one class, a model could score 95 percent accuracy by ignoring the data entirely. The baseline tells you what “doing nothing” looks like, so you can judge whether your model actually learned something.
Assume the data has already been cleaned, scaled, and split into PyTorch tensors X_train, y_train, X_test, and y_test, just as you did in the previous lesson. The targets are shaped as a column of floats (0.0 or 1.0), which is what the binary loss function expects.
The Loss Function
The loss function turns a batch of predictions and their true labels into a single number that measures how wrong the model is. Training is nothing more than driving that number down.
For regression you used nn.MSELoss, which penalizes squared errors. Binary classification needs something different. The natural loss for “yes/no” problems is binary cross-entropy, which rewards the model for being both correct and confident, and punishes it harshly for being confident and wrong.
PyTorch offers two ways to compute it, and the choice matters:
nn.BCELossexpects probabilities between 0 and 1, so your model must end in aSigmoidlayer.nn.BCEWithLogitsLossexpects raw scores (called logits) straight from the finalLinearlayer, and applies the sigmoid internally.
You will use BCEWithLogitsLoss. It is the recommended choice because it combines the sigmoid and the cross-entropy into one step that is numerically stable. Doing the two operations separately can produce overflow or NaN losses when scores get large; the combined version avoids that entirely.
import torch.nn as nn
loss_fn = nn.BCEWithLogitsLoss()This means your model’s last layer should be a plain nn.Linear(..., 1) with no sigmoid attached. The model outputs one raw logit per example, and the loss function handles the rest.
The formula behind binary cross-entropy, for a single example with true label and predicted probability , is:
When the true label is 1, only the first term survives, and the loss grows as the predicted probability drops toward 0. When the label is 0, only the second term survives, punishing high predicted probabilities. Either way, confident correct answers cost almost nothing, while confident mistakes cost a lot.
Logits, not probabilities
A logit is the raw, unbounded score a network produces before any squashing function. Passing it through a sigmoid converts it to a probability between 0 and 1. With BCEWithLogitsLoss you feed it logits directly during training, then apply torch.sigmoid yourself only when you want human-readable probabilities at evaluation time.
The Optimizer
Gradients tell you which direction to move each parameter to reduce the loss, but gradients alone change nothing. You need an optimizer to actually apply them and update the weights.
PyTorch provides several optimizers in torch.optim. The two you will meet most often are:
- Adam: the versatile default. It adapts the step size for each parameter automatically and usually converges quickly with little tuning.
- SGD: classic stochastic gradient descent. It can reach excellent final results but often needs more careful tuning and more epochs.
You will use Adam. Creating it takes two things: the parameters to update (retrieved from the model) and a learning rate.
import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=0.001)The learning rate controls how big each update step is. Think of walking down a hill in fog:
- Too small (like
0.00001): you creep along and take forever to reach the bottom. - Too large (like
1.0): you leap wildly and may overshoot, bouncing up the far slope. - Just right (often near
0.001): you make steady downhill progress.
A learning rate of 0.001 is a safe default for Adam and a sensible starting point for almost any problem. With both the loss function and optimizer in hand, you are ready to write the training step itself.
The Five-Step Training Cycle
Here is the heart of the lesson: the exact sequence of calls that constitutes one training step on one batch. Read it slowly, because every PyTorch training loop you ever write is built from these five lines.
optimizer.zero_grad() # 1. clear old gradients
logits = model(batch_X) # 2. forward pass -> raw scores
loss = loss_fn(logits, batch_y) # 3. measure how wrong we are
loss.backward() # 4. compute gradients via autograd
optimizer.step() # 5. update the parametersLet’s unpack each line.
Step 1: optimizer.zero_grad(). PyTorch accumulates gradients by default, adding new ones onto whatever is already stored. If you skip this step, gradients from previous batches pile up and corrupt the update. Clearing them at the start of each batch keeps every step clean.
Step 2: model(batch_X). The forward pass. The batch flows through each layer in order and emerges as a batch of raw logits, one per example.
Step 3: loss_fn(logits, batch_y). The loss function compares predictions to the true labels and returns a single number. Lower is better; a perfect model would score zero, which essentially never happens.
Step 4: loss.backward(). This is where autograd shines. Starting from the loss, PyTorch walks backward through every operation that produced it and computes the gradient of the loss with respect to each parameter, storing it in that parameter’s .grad attribute.
Step 5: optimizer.step(). The optimizer reads the freshly computed .grad values and nudges every parameter in the direction that reduces the loss.
The Order Is Not Optional
The sequence has to run in this order, and it helps to understand why:
- You must clear gradients before the forward pass, or they accumulate across batches.
- You need predictions before you can compute a loss.
- You need a loss before you can compute gradients.
- You need gradients before the optimizer can update anything.
Notice what you do not do: you never compute gradients by hand (autograd does it), you never update weights manually (the optimizer does it), and you never track which parameters to touch (the optimizer already knows them all).
The most common training bug
Forgetting optimizer.zero_grad() is the classic PyTorch mistake. Because gradients accumulate silently, training still runs and produces no error, but the model learns erratically or not at all. If your loss refuses to drop, check that you are clearing gradients on every batch first.
Training and Evaluation Modes
One more habit to build: models have two modes. Call model.train() before training and model.eval() before evaluating.
model.train() # training mode
model.eval() # evaluation modePlain Linear and ReLU layers behave the same in both modes, so the switch has no visible effect yet. But some specialized layers behave differently during training and evaluation, and you will meet them in the next lesson. Setting the mode explicitly now is good practice that will save you from subtle bugs later.
Training for Multiple Epochs
One pass through every batch in the training set is called an epoch. A single epoch is rarely enough; the model needs to see the data many times to refine its understanding. So you wrap the five-step cycle in two loops: an outer loop over epochs and an inner loop over batches.
To keep the IPO example focused, you will train on the full training set as one batch each epoch, which is fine for a dataset this small. The structure is identical to looping over a DataLoader; there is just one “batch” per epoch.
import torch
num_epochs = 200
train_losses = []
test_losses = []
for epoch in range(num_epochs):
# --- training ---
model.train()
optimizer.zero_grad()
logits = model(X_train)
loss = loss_fn(logits, y_train)
loss.backward()
optimizer.step()
train_losses.append(loss.item())
# --- track test loss (no gradients) ---
model.eval()
with torch.no_grad():
test_logits = model(X_test)
test_loss = loss_fn(test_logits, y_test)
test_losses.append(test_loss.item())
if (epoch + 1) % 50 == 0:
print(f"Epoch [{epoch+1}/{num_epochs}] "
f"train loss: {loss.item():.4f} test loss: {test_loss.item():.4f}")
print(f"\nFinal training loss: {train_losses[-1]:.4f}")
# Output:
# Epoch [50/200] train loss: 0.5928 test loss: 0.6480
# Epoch [100/200] train loss: 0.4644 test loss: 0.6593
# Epoch [150/200] train loss: 0.3848 test loss: 0.6822
# Epoch [200/200] train loss: 0.3382 test loss: 0.7079
# Final training loss: 0.3382A few things are worth noticing. The training loss falls steadily, from around 0.59 at epoch 50 down to 0.3382 at epoch 200, which tells you the optimizer is doing its job and the model is fitting the training data better and better.
The test loss tells a different and more honest story. It dips early, then begins to rise even as the training loss keeps falling. That widening gap is the unmistakable signature of overfitting: the model is memorizing quirks of the training set that do not generalize to new IPOs. The training curve below makes this divergence visible.
Reading a training curve
A smooth, steadily falling training loss means the learning rate is healthy. A test loss that turns and rises while the training loss keeps dropping means the model has started memorizing rather than learning. Lesson 5 introduces regularization techniques such as dropout that directly fight this gap.
Evaluating the Model
A falling training loss feels encouraging, but the only number that matters is how the model performs on data it has never seen. You evaluate on the test set in evaluation mode, with gradients disabled.
For a binary classifier you care about two metrics:
- Accuracy: the fraction of test examples classified correctly. You get predictions by converting logits to probabilities with
torch.sigmoid, then thresholding at 0.5. - AUC (area under the ROC curve): how well the model ranks positives above negatives, across every possible threshold. AUC of 0.5 means random guessing; 1.0 means perfect separation. It is often more informative than accuracy because it does not depend on a single cutoff.
from sklearn.metrics import accuracy_score, roc_auc_score
model.eval()
with torch.no_grad():
test_logits = model(X_test)
test_probs = torch.sigmoid(test_logits) # logits -> probabilities
test_preds = (test_probs >= 0.5).float() # probabilities -> 0/1
y_true = y_test.numpy()
acc = accuracy_score(y_true, test_preds.numpy())
auc = roc_auc_score(y_true, test_probs.numpy())
print(f"Test accuracy: {acc:.3f}")
print(f"Test AUC: {auc:.3f}")
# Output:
# Test accuracy: 0.562
# Test AUC: 0.618So the model lands at 56.2 percent accuracy and an AUC of 0.618. Let’s be honest about what those numbers mean.
The accuracy of 0.562 is barely above the 0.545 majority-class baseline. In plain terms, your carefully trained neural network is only a hair better than always betting “gain.” The AUC of 0.618 is more reassuring: because it sits clearly above 0.5, the model has genuinely learned something; it ranks IPOs that gained slightly higher than those that did not, more often than chance would. But 0.618 is still a long way from a confident predictor.
This is not a bug, and it is not a failure of your code. IPO first-day movement is genuinely hard to predict. Markets are noisy, the dataset is small (319 IPOs), and much of what drives a first-day pop, such as overall market sentiment on the listing day, simply is not in these features. A modest score here is the truthful result, and reporting it honestly is exactly what a professional does.
Accuracy can flatter a model
On a 54.5 percent baseline, an “impressive-sounding” 56 percent accuracy is almost worthless on its own. Always compare accuracy against the majority-class rate, and pair it with a ranking metric like AUC. The two numbers together tell you whether the model learned a real signal or just learned to guess the common class.
Putting It All Together
Here is the complete workflow, from raw data to evaluated model, condensed into one runnable script. This is a template you can adapt for any binary classification problem in PyTorch.
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score, roc_auc_score
# 1. Load (already cleaned, scaled, and split into tensors upstream)
# download: https://datatweets.com/datasets/indian_ipo.csv
# X_train, y_train, X_test, y_test are float tensors; y has shape (n, 1)
# 2. Build a model that ends in a single raw-logit output (no sigmoid)
model = nn.Sequential(
nn.Linear(X_train.shape[1], 32),
nn.ReLU(),
nn.Linear(32, 16),
nn.ReLU(),
nn.Linear(16, 1),
)
# 3. Loss function and optimizer
loss_fn = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 4. Train
for epoch in range(200):
model.train()
optimizer.zero_grad()
logits = model(X_train)
loss = loss_fn(logits, y_train)
loss.backward()
optimizer.step()
print(f"Final training loss: {loss.item():.4f}")
# Output: Final training loss: 0.3382
# 5. Evaluate
model.eval()
with torch.no_grad():
probs = torch.sigmoid(model(X_test))
preds = (probs >= 0.5).float()
print(f"Test accuracy: {accuracy_score(y_test.numpy(), preds.numpy()):.3f}")
print(f"Test AUC: {roc_auc_score(y_test.numpy(), probs.numpy()):.3f}")
# Output:
# Test accuracy: 0.562
# Test AUC: 0.618In a couple dozen lines you defined a loss, created an optimizer, ran the five-step cycle for 200 epochs, and evaluated the result honestly. That is the entire PyTorch training process.
Practice Exercises
Try these before checking the hints.
Exercise 1: Pick the Right Loss
You are building a binary classifier whose final layer is nn.Linear(16, 1) with no sigmoid attached. A teammate suggests using nn.BCELoss(). Explain in your own words why nn.BCEWithLogitsLoss() is the better choice here, and what you would have to change to make nn.BCELoss() work correctly instead.
# Write your explanation as a comment, then sketch the model change.Hint
BCELoss expects probabilities in [0, 1], so it would require adding nn.Sigmoid() as the last layer. BCEWithLogitsLoss takes raw logits and fuses the sigmoid with the cross-entropy in one numerically stable step, avoiding overflow and NaN losses. To use BCELoss you would append nn.Sigmoid() to the model and feed probabilities to the loss.
Exercise 2: Spot the Bug
The training loop below runs without errors, but the loss barely moves. Find the bug and fix it.
for epoch in range(200):
model.train()
logits = model(X_train)
loss = loss_fn(logits, y_train)
loss.backward()
optimizer.step()Hint
The loop never calls optimizer.zero_grad(), so gradients from every epoch accumulate on top of each other and the updates become garbage. Add optimizer.zero_grad() as the first line inside the loop, before the forward pass.
Exercise 3: Probabilities and Predictions
Given a tensor test_logits of raw model outputs, write the two lines that convert them first into probabilities and then into hard 0/1 predictions using a threshold of 0.5.
import torch
# test_logits is a tensor of raw logits
# Your code hereHint
Apply torch.sigmoid(test_logits) to get probabilities, then compare to the threshold: (probs >= 0.5).float(). The first line squashes logits into [0, 1]; the second turns each probability into a 0 or 1 label.
Summary
You have implemented a complete PyTorch training loop from scratch and used it to train and evaluate a real classifier. Let’s review what you learned.
Key Concepts
The Learning Cycle
- A network learns by repeating five steps per batch:
zero_grad-> forward -> loss ->backward->step - PyTorch makes the loop explicit, giving you full visibility and control over training
- An epoch is one complete pass over the training data; models need many epochs to learn
Loss Functions
- Binary classification uses binary cross-entropy, which rewards confident correct answers and punishes confident mistakes
BCEWithLogitsLosstakes raw logits and fuses the sigmoid for numerical stability, so the model’s last layer is a plainLinearwith no sigmoid
Optimizers
- The optimizer applies gradients to update parameters; gradients alone change nothing
- Adam is a strong default; the learning rate controls step size, with
0.001a safe starting point - Always call
optimizer.zero_grad()first, because PyTorch accumulates gradients by default
Training and Evaluation
- Use
model.train()for training andmodel.eval()withtorch.no_grad()for evaluation - Convert logits to probabilities with
torch.sigmoid, then threshold at 0.5 for hard predictions - Judge a classifier with accuracy and AUC, and always compare accuracy to the majority-class baseline
Why This Matters
The five-step training loop is the foundation of essentially every model you will ever build in PyTorch, from a tiny classifier like this one to a large network with millions of parameters. The architecture changes, the data changes, but the cycle of clear, forward, measure, backpropagate, update stays exactly the same. Once it is in your fingers, you can read and write any PyTorch training code.
Just as important is the lesson in honesty. Your IPO model reached 56.2 percent accuracy and an AUC of 0.618, only modestly better than guessing the majority class. That is not a coding failure; it is the truth about a genuinely hard, noisy problem on a small dataset. Real practitioners report these numbers plainly, compare them against a baseline, and resist the temptation to dress up a weak result. The rising test loss you saw also flagged overfitting, the very problem the next lesson tackles head-on.
Next Steps
You can now train and evaluate a neural network end to end. Next you will make those networks deeper and learn the regularization techniques that close the train/test gap you observed here.
Continue to Lesson 5 - Deep Networks and Regularization
Go deeper with more layers and fight overfitting using dropout and other regularization techniques.
Back to Module Overview
Return to the Deep Learning with PyTorch module overview.
Keep Building Your Skills
You wrote a real training loop today and watched a random network become a working model, one epoch at a time. The five-step cycle is now yours, and so is something subtler but just as valuable: the discipline to evaluate honestly and call a hard problem hard. Carry both forward. As your networks grow deeper in the lessons ahead, the loop will stay familiar, and your judgment about what the numbers really mean will keep getting sharper.