Lesson 4 - Multi-Layer Deep Learning Models

Welcome to Multi-Layer Deep Learning Models

In the previous lesson you built a single-hidden-layer network and trained it end to end with the Keras Sequential API. This lesson takes the next step: stacking several Dense layers into a genuinely deep model, training it well, and learning to read the signals your training run gives you. You will also meet the most common problem a deeper model creates, overfitting, and the two tools you reach for most often to fix it: Dropout and L2 regularization.

By the end of this lesson, you will be able to:

  • Stack multiple Dense layers into a deep network with the Sequential API
  • Compile and train a model with validation_data to monitor generalization
  • Read the History object and plot training versus validation loss
  • Recognize overfitting from the gap between training and validation curves
  • Apply Dropout and a kernel_regularizer (L2) to reduce overfitting

You should already be comfortable building a one-layer Keras model, compiling it with an optimizer and loss, and calling fit. Basic Python, pandas, and NumPy are assumed. Let’s begin.


From Shallow to Deep

A shallow network has a single hidden layer between its input and output. A deep network stacks several hidden layers, one feeding into the next. Each extra layer gives the model the capacity to compose simpler patterns into more complex ones: the first layer might learn rough combinations of your raw features, the next layer combinations of those, and so on.

That extra capacity is a double-edged sword. More layers and more nodes mean more weights to fit, which lets the model capture subtle structure, but also makes it easier for the model to memorize the training data instead of learning patterns that generalize. Most of this lesson is about getting the benefit of depth without paying the full price of memorization.

Diagram of a multi-layer dense network with an input layer, several hidden layers, and a single output node
A deep network stacks several Dense hidden layers between the input and the output.

The Sequential API makes stacking layers almost trivial. You already know model.add(...); building a deep model just means calling it more than once. Everything else about the workflow, compile, fit, and evaluate, stays exactly the same.


The Problem: Predicting IPO Listing Gains

To make this concrete, you will work with a real and genuinely difficult problem. The Indian IPO dataset records initial public offerings on Indian exchanges. For each IPO you have how heavily it was subscribed by different investor categories, its issue size, and its issue price. The target, Listing_Gains, is 1 if the stock closed higher than its issue price on its first trading day and 0 otherwise.

Predicting whether an IPO will pop on day one is hard. Markets are noisy, the dataset is small, and much of what drives a first-day move is not in these columns. That makes it an honest teacher: it will show you overfitting clearly, and it will keep you humble about what accuracy to expect.

import pandas as pd

# download: https://datatweets.com/datasets/indian_ipo.csv
df = pd.read_csv("indian_ipo.csv")

print("Shape:", df.shape)
print(df["Listing_Gains"].value_counts())
# Output:
# Shape: (319, 10)
# Listing_Gains
# 1    174
# 0    145
# Name: count, dtype: int64

There are 319 IPOs and 10 columns. About 55 percent of them listed with a gain, so the classes are reasonably balanced. A model that always guessed “gain” would score roughly 0.55 accuracy, which is the baseline any real model must beat to be worth anything.

gain_rate = df["Listing_Gains"].mean()
print(f"gain rate: {gain_rate:.3f}")
# Output: gain rate: 0.545
Bar chart of IPOs that gained versus did not gain on listing day
The Indian IPO dataset is fairly balanced: roughly 55 percent of listings gained on day one.

Why a hard dataset is good for learning

On an easy dataset, almost any architecture scores well, and overfitting hides because both curves look fine. A hard, small dataset like this one makes the training-versus-validation gap obvious, which is exactly what you need in order to see the concepts this lesson teaches.


Preparing the Data

The features are all numeric: subscription multiples for the three investor categories, the combined total, the issue size, and the issue price. Neural networks train far more reliably when their inputs are on a similar scale, so you standardize the features. You then split into training and test sets.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

feature_cols = [
    "Issue_Size", "Subscription_QIB", "Subscription_HNI",
    "Subscription_RII", "Subscription_Total", "Issue_Price",
]

X = df[feature_cols].astype(np.float32).values
y = df["Listing_Gains"].astype(np.float32).values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Standardize: fit on train only, then apply to both
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train).astype(np.float32)
X_test = scaler.transform(X_test).astype(np.float32)

print("Train:", X_train.shape, " Test:", X_test.shape)
# Output:
# Train: (239, 6)  Test: (80, 6)

Notice the discipline that carries over from every model you have built: the scaler is fit on the training data only, then applied to both sets. Letting the scaler see the test data would leak information and inflate your score.


Stacking Dense Layers

Now build a deep model. With the Sequential API you simply add hidden layers one after another. Each Dense layer applies a linear transform followed by an activation; here you use ReLU for the hidden layers because it trains quickly and rarely saturates.

Because this is a binary classification problem, the output layer has a single node with a sigmoid activation, which squashes the model’s output into a probability between 0 and 1.

import tensorflow as tf

tf.random.set_seed(42)

model = tf.keras.Sequential(name="ipo_mlp")
model.add(tf.keras.layers.Input(shape=(X_train.shape[1],)))
model.add(tf.keras.layers.Dense(32, activation="relu"))
model.add(tf.keras.layers.Dense(16, activation="relu"))
model.add(tf.keras.layers.Dense(8, activation="relu"))
model.add(tf.keras.layers.Dense(1, activation="sigmoid"))

model.summary()
# Output:
# Model: "ipo_mlp"
# _________________________________________________________________
#  Layer (type)                Output Shape              Param #
# =================================================================
#  dense (Dense)               (None, 32)                224
#  dense_1 (Dense)             (None, 16)                528
#  dense_2 (Dense)             (None, 8)                 136
#  dense_3 (Dense)             (None, 1)                 9
# =================================================================
# Total params: 897

The summary() output is worth reading. Each row is a layer, and the Param # column counts its weights. The first layer maps 6 inputs to 32 nodes: that is 6×32=192 6 \times 32 = 192 weights plus 32 biases, which is 224. The whole network has 897 trainable parameters. For 239 training rows, that is already more parameters than examples, which is your first hint that overfitting will be a real concern.

Why the Output Uses Sigmoid

For a single-output binary classifier, the model needs to produce a number it can compare against a 0/1 label. A raw linear output could be any real number, so you pass it through the sigmoid function:

σ(z)=11+ez \sigma(z) = \frac{1}{1 + e^{-z}}

This maps any value z z to the range (0,1) (0, 1) , which you interpret as the probability that the IPO gains on listing day. Values above 0.5 become a predicted “gain,” values below become “no gain.”


Compiling and Training with Validation Data

Compiling attaches three things to the model: an optimizer (how weights are updated), a loss (what the optimizer minimizes), and metrics (what you watch). For binary classification with a sigmoid output, the natural loss is binary cross-entropy. You also track accuracy and AUC so you can judge the model from more than one angle.

The crucial new ingredient is validation_data. When you pass a held-out set to fit, Keras evaluates the model on it after every epoch without ever training on it. That gives you a running readout of how well the model generalizes, epoch by epoch.

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss="binary_crossentropy",
    metrics=["accuracy", tf.keras.metrics.AUC(name="auc")],
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=200,
    batch_size=16,
    verbose=0,
)

print("final training loss:", round(history.history["loss"][-1], 4))
# Output:
# final training loss: 0.2529

After 200 epochs the training loss has fallen to 0.2529. That looks great, but a low training loss on its own tells you nothing about generalization. To judge that, you need the validation side of the story, which lives in the History object that fit returns.

validation_data versus validation_split

You can let Keras carve out a validation slice automatically with validation_split=0.2, which holds back the last 20 percent of your training data. Passing validation_data explicitly, as above, gives you full control over exactly which examples are used and is the right choice when you have a dedicated test or validation set.


Reading the History Object

fit returns a History object whose .history attribute is a plain dictionary. Each key is a metric name, and each value is a list with one entry per epoch. Because you passed validation_data, every training metric has a matching val_ twin.

print(history.history.keys())
# Output:
# dict_keys(['loss', 'accuracy', 'auc', 'val_loss', 'val_accuracy', 'val_auc'])

Plotting loss against val_loss is the single most useful diagnostic in deep learning. It tells you whether the model is still learning, has converged, or has started to overfit.

import matplotlib.pyplot as plt

plt.plot(history.history["loss"], label="training loss")
plt.plot(history.history["val_loss"], label="validation loss")
plt.xlabel("epoch")
plt.ylabel("binary cross-entropy")
plt.legend()
plt.show()
Line chart of training and validation loss over 200 epochs, with training loss falling steadily while validation loss flattens and rises
Training loss keeps falling while validation loss flattens and drifts upward, the signature of overfitting.

Read the two curves together:

  • Both falling in early epochs: the model is learning useful patterns.
  • Training loss keeps falling, validation loss flattens then rises: the model has stopped learning general patterns and started memorizing the training set. The growing gap between the curves is the textbook signature of overfitting.

That is exactly what you see here. Training loss marches down toward 0.25, but validation loss bottoms out early and then drifts up. The model is fitting the training IPOs better and better while getting worse on IPOs it has not seen.


Evaluating Honestly

evaluate runs the trained model on the test set and returns the loss followed by every metric you compiled, in order.

results = model.evaluate(X_test, y_test, verbose=0)
for name, value in zip(model.metrics_names, results):
    print(f"{name}: {value:.3f}")
# Output:
# loss: 0.715
# accuracy: 0.537
# auc: 0.591

The test accuracy is 0.537 and the AUC is 0.591. Be honest with yourself about what these numbers mean. An accuracy of 0.537 is barely above the 0.545 you would get by always guessing “gain,” and an AUC of 0.591 (where 0.5 is random and 1.0 is perfect) says the model has found only a faint signal. That is not a failure of your code, it is the nature of the problem: first-day IPO moves are dominated by factors these six columns do not capture.

This is one of the most valuable lessons in applied deep learning. A model can train beautifully, with a training loss of 0.2529, and still generalize poorly. The training curve told you why before you ever ran evaluate.

A low training loss is not success

It is tempting to celebrate when training loss drops. But the only number that matters is performance on data the model has never seen. Always evaluate on a held-out set, and always look at the validation curve. A widening gap between training and validation loss means your impressive training loss is being bought with memorization, not learning.


Fighting Overfitting

The training curve diagnosed overfitting. Now you treat it. Regularization is any technique that discourages the model from memorizing, trading a little training-set performance for better generalization. You will use the two most common tools, often together.

Dropout

A Dropout layer randomly switches off a fraction of the nodes feeding into it on each training step. With, say, Dropout(0.3), each node has a 30 percent chance of being zeroed out for that step. Because the network can never rely on any single node always being present, it is forced to spread what it learns across many nodes, which makes it more robust. Dropout is active only during training; at evaluation time all nodes are used.

L2 Regularization (Weight Decay)

A kernel_regularizer adds a penalty to the loss based on the size of a layer’s weights. L2 regularization adds the sum of the squared weights, scaled by a small factor λ \lambda :

losstotal=lossdata+λiwi2 \text{loss}_{\text{total}} = \text{loss}_{\text{data}} + \lambda \sum_{i} w_i^2

Because large weights now cost something, the optimizer is nudged toward smaller, smoother weights. Smaller weights mean a less wiggly decision boundary that is less able to memorize individual training points. This is why L2 is also called weight decay.

Putting Them Together

Here is the same architecture with Dropout between the hidden layers and an L2 penalty on each Dense kernel.

from tensorflow.keras import regularizers

tf.random.set_seed(42)

reg_model = tf.keras.Sequential(name="ipo_mlp_reg")
reg_model.add(tf.keras.layers.Input(shape=(X_train.shape[1],)))
reg_model.add(tf.keras.layers.Dense(
    32, activation="relu",
    kernel_regularizer=regularizers.l2(0.01)))
reg_model.add(tf.keras.layers.Dropout(0.3))
reg_model.add(tf.keras.layers.Dense(
    16, activation="relu",
    kernel_regularizer=regularizers.l2(0.01)))
reg_model.add(tf.keras.layers.Dropout(0.3))
reg_model.add(tf.keras.layers.Dense(
    8, activation="relu",
    kernel_regularizer=regularizers.l2(0.01)))
reg_model.add(tf.keras.layers.Dense(1, activation="sigmoid"))

reg_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

reg_history = reg_model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=200,
    batch_size=16,
    verbose=0,
)

Now compare the two runs side by side. The chart below plots training and validation loss for the original model and the regularized model on the same axes.

Comparison of training and validation loss curves with and without Dropout and L2 regularization
With Dropout and L2, the training curve sits higher but the gap to the validation curve narrows, the goal of regularization.

Look at what regularization changes. In the original model the training loss dives well below the validation loss, opening a wide gap. In the regularized model the training loss is higher, because Dropout and the L2 penalty deliberately make the training task harder, but the gap between training and validation loss is much smaller. That is the trade you are making on purpose: you give up some training-set fit to keep the model honest on unseen data.

On a problem this hard, regularization will not turn 0.537 accuracy into something spectacular, because the limiting factor is the signal in the data, not the model’s willingness to memorize. What it does reliably is shrink the overfitting gap, which is the part you control. On richer datasets, that same shrinking gap often translates directly into better test scores.

Regularization is a knob, not a switch

The dropout rate (here 0.3) and the L2 strength (here 0.01) are hyperparameters. Too little and overfitting persists; too much and the model cannot learn even the real patterns, a state called underfitting. The right amount is found by experiment, watching the validation curve as you adjust them.


Practice Exercises

Try these before checking the hints. Reuse the prepared X_train, X_test, y_train, and y_test from the lesson.

Exercise 1: Build a Deeper Model

Add one more hidden layer to the original architecture: insert a Dense layer with 4 nodes and ReLU activation just before the output layer, so the model has four hidden layers (32, 16, 8, 4). Compile it the same way, print model.summary(), and note how the total parameter count changes.

import tensorflow as tf

# Your code here

Hint

Start from the lesson’s Sequential model and add model.add(tf.keras.layers.Dense(4, activation="relu")) after the 8-node layer and before the final sigmoid layer. The extra layer adds its own weights, so the total parameter count in summary() will rise above 897.

Exercise 2: Read the History Object

After training a model with validation_data, find the epoch at which validation loss was lowest. Print that epoch number and the validation loss value there. This is the point at which the model generalized best, before overfitting set in.

# Your code here (assume `history` from model.fit)

Hint

The validation losses are in history.history["val_loss"], a plain Python list. Use import numpy as np and best_epoch = int(np.argmin(history.history["val_loss"])), then index the list with best_epoch to get the lowest loss. Remember epochs are zero-indexed here.

Exercise 3: Tune the Dropout Rate

Try three dropout rates, 0.1, 0.3, and 0.5, in the regularized architecture. For each, train the model and print the test accuracy. Which rate gives the best result on this dataset, and does more dropout always help?

for rate in [0.1, 0.3, 0.5]:
    # build, compile, fit, evaluate with this dropout rate
    # Your code here
    pass

Hint

Wrap the model-building code in the loop and pass rate to each tf.keras.layers.Dropout(rate) call. After fit, use reg_model.evaluate(X_test, y_test, verbose=0) and print the accuracy entry. On a dataset this small and noisy, expect the scores to wobble; the point is to see that the largest dropout rate is not automatically the best.


Summary

You took a single-layer model and grew it into a genuinely deep network, then learned to train it well and keep it honest. Let’s review.

Key Concepts

Building Deep Models

  • A deep network stacks multiple Dense hidden layers; with the Sequential API you just call model.add more than once
  • More layers and nodes add capacity, which raises both the model’s power and its risk of overfitting
  • A binary classifier ends in a single sigmoid node that outputs a probability between 0 and 1

Training and Monitoring

  • Compile with an optimizer, a loss (binary_crossentropy for binary tasks), and metrics like accuracy and AUC
  • Pass validation_data to fit so Keras measures generalization after every epoch
  • fit returns a History object whose .history dictionary holds per-epoch values for every metric and its val_ twin

Diagnosing Overfitting

  • Plot training versus validation loss; a widening gap means the model is memorizing, not generalizing
  • A low training loss (here 0.2529) is not success; only held-out performance counts
  • On the IPO data the model reached just 0.537 test accuracy and 0.591 AUC, because the signal in the features is genuinely weak

Regularization

  • Dropout randomly zeroes nodes during training, forcing the network not to rely on any single node
  • An L2 kernel_regularizer penalizes large weights, pushing toward a smoother decision boundary
  • Regularization raises training loss on purpose to shrink the train-validation gap; its strength is a hyperparameter to tune

Why This Matters

Every serious deep learning project lives or dies on the gap between training and validation performance. The skills you practiced here, reading a History object, recognizing overfitting from two curves, and reaching for Dropout and L2 to close the gap, are the daily work of training neural networks, no matter the framework or the dataset.

Just as important is the honesty this lesson modeled. The IPO model trained beautifully and still barely beat a coin flip, and that is fine: it taught you that a good workflow surfaces the truth about your data instead of hiding it. A practitioner who can look at a training curve and say “this is overfitting, and no amount of tuning will fix the weak signal here” is far more valuable than one who only celebrates a low training loss.


Next Steps

You can now build, train, monitor, and regularize deep Sequential models. Next you will outgrow the Sequential API and learn the Functional API, which lets you build models with branches, multiple inputs, and shared layers, the architectures that Sequential simply cannot express.

Continue to Lesson 5 - Deep Learning with the Keras Functional API

Build flexible model topologies with multiple inputs, branches, and shared layers.

Back to Module Overview

Return to the Deep Learning with TensorFlow module overview.


Keep Building Your Skills

You have crossed an important threshold: you can now build deep models and, just as importantly, tell when they are deceiving you. The training curve is your most trusted instrument, so make a habit of plotting it on every model you train. As you move on to more flexible architectures, carry this discipline with you: build, train, watch the validation curve, regularize when the gap opens, and always judge your model on data it has never seen.