Lesson 4 - Advanced CNN Architectures

Welcome to Advanced CNN Architectures

In the previous lessons you built a working convolutional neural network and learned how to fight overfitting with regularization. Now you will go deeper, literally. This lesson is about architecture: how you stack convolutional layers into reusable blocks, how batch normalization keeps deep networks trainable, how global average pooling replaces the flatten-then-dense bottleneck, and what makes classic designs like VGG and ResNet work. You will build a deeper batch-normalized CNN on Fashion-MNIST and compare it head-to-head with your earlier models.

By the end of this lesson, you will be able to:

  • Stack convolutional layers into repeatable blocks to build deeper networks
  • Explain what batch normalization does and why it makes deep CNNs train faster and more stably
  • Use global average pooling instead of Flatten to cut parameters and reduce overfitting
  • Describe the core ideas behind VGG-style blocks and residual (skip) connections
  • Build, train, and evaluate a deeper batch-normalized CNN and compare architectures fairly

You should already be comfortable with the basics of CNNs (convolution, pooling, and a simple Keras model) and with regularization techniques like dropout. Let’s go deeper.


Why Go Deeper?

A convolutional network learns a hierarchy of features. The first layers pick up edges and simple textures. Deeper layers combine those into shapes like a sleeve, a heel, or a buckle. Deeper still, the network assembles those parts into whole-object concepts like “sneaker” or “coat.” Each new convolutional block gives the network another chance to combine simpler patterns into more abstract ones.

That is the promise of depth: more layers means more representational power. A two-layer network can only build features that are two compositions deep, while a twenty-layer network can express far richer combinations. The breakthrough image classifiers of the 2010s were all built on this idea.

But depth comes with a catch. As you stack more layers, two problems appear:

  • Optimization gets harder. Gradients have to travel back through every layer during training. In a very deep network they can shrink toward zero (the vanishing gradient problem) or blow up, so the early layers barely learn.
  • Overfitting gets easier. More layers means more parameters, and more parameters means more capacity to memorize the training set instead of generalizing.

Most of the architectural ideas in this lesson exist to solve exactly these two problems. Batch normalization and careful initialization make deep networks trainable. Global average pooling and dropout keep them from overfitting. Residual connections let you stack dozens of layers without the early ones going dark.

Architecture is a set of trade-offs

There is no single “best” network. A good architecture balances capacity (enough layers to learn the patterns) against trainability (gradients that flow) and generalization (not memorizing noise). The techniques below are the standard tools for striking that balance.


Loading Fashion-MNIST

You will work with Fashion-MNIST, a drop-in replacement for the classic handwritten-digits dataset. It contains 28x28 grayscale images of clothing in 10 categories: t-shirt, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. It is harder than digits but small enough to train on quickly, which makes it perfect for comparing architectures.

To keep training fast and the comparison fair, you will use a fixed subset: 15,000 training images and 3,000 test images.

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# download: keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test_full, y_test_full) = \
    keras.datasets.fashion_mnist.load_data()

# Add a channel dimension and scale pixels to [0, 1]
X_train_full = X_train_full[..., np.newaxis].astype("float32") / 255.0
X_test_full = X_test_full[..., np.newaxis].astype("float32") / 255.0

# Use a fixed subset for fast, fair comparisons
X_train = X_train_full[:15000]
y_train = y_train_full[:15000]
X_test = X_test_full[:3000]
y_test = y_test_full[:3000]

print("full train", X_train_full.shape, "using subset", X_train.shape,
      "test subset", X_test.shape)
# Output:
# full train (60000, 28, 28, 1) using subset (15000, 28, 28, 1) test subset (3000, 28, 28, 1)

Every model in this lesson trains on the same 15,000 images and is scored on the same 3,000 test images. That is the only way to tell whether a deeper architecture actually helps, rather than just having seen different data.

Keep the comparison honest

When you compare architectures, change one thing at a time and hold everything else fixed: the same data, the same preprocessing, the same number of epochs, the same optimizer. Otherwise you cannot tell whether a score went up because of the architecture or because of an unrelated change.


Building Blocks of Deeper Networks

Before building the full model, you need three architectural ideas: convolutional blocks, batch normalization, and global average pooling.

Stacking Convolutions into Blocks

The most reliable way to add depth is to group layers into a repeatable block and stack copies of it. A typical block looks like this:

[ Conv -> BatchNorm -> ReLU ] -> [ Conv -> BatchNorm -> ReLU ] -> MaxPool

Each block does two convolutions at the same spatial resolution, then halves the height and width with pooling. As you go deeper you usually double the number of filters each time you pool, because the spatial map is shrinking and you want more channels to carry richer features. A common progression is 32 filters, then 64, then 128.

This pattern, two or three small 3×3 3 \times 3 convolutions followed by pooling, with the filter count doubling at each stage, is exactly the recipe behind the VGG family of networks. VGG showed that a stack of small 3×3 3 \times 3 filters is both more expressive and more parameter-efficient than a few large filters. Two stacked 3×3 3 \times 3 convolutions see the same input region as one 5×5 5 \times 5 convolution, but with fewer parameters and an extra nonlinearity in between.

Batch Normalization

As a network gets deeper, the distribution of values flowing into each layer keeps shifting during training, because every layer’s weights are changing at once. This makes optimization slow and finicky. Batch normalization fixes this by normalizing each layer’s inputs, per mini-batch, to have roughly zero mean and unit variance, then learning a scale and shift so the layer can recover any distribution it actually needs.

For a feature value x x in a mini-batch, batch normalization computes:

x^=xμbatchσbatch2+ϵ,y=γx^+β \hat{x} = \frac{x - \mu_{\text{batch}}}{\sqrt{\sigma^2_{\text{batch}} + \epsilon}}, \qquad y = \gamma \hat{x} + \beta

Here μbatch \mu_{\text{batch}} and σbatch2 \sigma^2_{\text{batch}} are the mean and variance computed over the current batch, ϵ \epsilon is a tiny constant for numerical stability, and γ \gamma and β \beta are learned parameters. The practical payoffs are large:

  • Faster training. Networks converge in fewer epochs because gradients behave better.
  • More stability. You can often use higher learning rates without the network diverging.
  • A mild regularizing effect. The per-batch noise in the statistics acts a little like dropout.

In Keras you add it as a layer, layers.BatchNormalization(), almost always placed right after a convolution and before the activation.

Where do non-trainable parameters come from?

A batch normalization layer keeps a running estimate of the mean and variance to use at prediction time, when there is no batch to compute statistics over. Those running estimates are updated during training but not by backpropagation, so Keras reports them as non-trainable parameters. The learned scale γ \gamma and shift β \beta , by contrast, are trainable.

Global Average Pooling

In a basic CNN, you usually end with Flatten followed by a dense layer. The problem is that flattening a feature map can produce an enormous vector. If your last convolutional block outputs a 7×7×128 7 \times 7 \times 128 tensor, flattening gives 7×7×128=6272 7 \times 7 \times 128 = 6272 values, and connecting that to a dense layer creates a huge pile of parameters, most of which exist only to overfit.

Global average pooling is the modern alternative. Instead of flattening, it collapses each feature map to a single number by averaging it. A 7×7×128 7 \times 7 \times 128 tensor becomes a vector of just 128 values, one average per channel.

Flatten:                 7 x 7 x 128  ->  6272 values  (then a big Dense layer)
GlobalAveragePooling2D:  7 x 7 x 128  ->   128 values  (one average per channel)

This has two benefits. It slashes the parameter count, which fights overfitting, and it builds in a useful prior: each channel becomes a single “how much of this feature is present anywhere in the image” score. For datasets where each image contains one centered object, that summary is exactly what you want. In Keras it is layers.GlobalAveragePooling2D().


A Baseline to Beat

To judge a deeper network, you need something to compare it against. Here is a compact baseline CNN, the kind of model you built in earlier lessons: two convolutional layers, pooling, a dense layer, and a softmax output.

def make_baseline():
    model = keras.Sequential([
        keras.Input(shape=(28, 28, 1)),
        layers.Conv2D(32, 3, activation="relu", padding="same"),
        layers.MaxPooling2D(),
        layers.Conv2D(64, 3, activation="relu", padding="same"),
        layers.MaxPooling2D(),
        layers.Flatten(),
        layers.Dense(64, activation="relu"),
        layers.Dense(10, activation="softmax"),
    ])
    model.compile(optimizer="adam",
                  loss="sparse_categorical_crossentropy",
                  metrics=["accuracy"])
    return model

baseline = make_baseline()
history = baseline.fit(X_train, y_train, epochs=10,
                       validation_data=(X_test, y_test), verbose=0)

test_loss, test_acc = baseline.evaluate(X_test, y_test, verbose=0)
print(f"baseline test acc={test_acc:.3f}  "
      f"final train acc={history.history['accuracy'][-1]:.3f}  "
      f"val acc={history.history['val_accuracy'][-1]:.3f}")
# Output:
# baseline test acc=0.883  final train acc=0.949  val acc=0.883

The baseline reaches 0.883 test accuracy. Notice the gap between training accuracy (0.949) and test accuracy (0.883): the model fits the training set noticeably better than the test set, which is the signature of mild overfitting. You also have a regularized version of this network from the previous lesson, which trades a little training accuracy for a smaller gap and lands at 0.872 test accuracy. Both are reference points for what follows.


Building a Deeper Batch-Normalized CNN

Now put the building blocks together. This network is deeper than the baseline: three convolutional stages with the filter count doubling (32, 64, 128), batch normalization after every convolution, and global average pooling instead of Flatten.

def conv_block(x, filters):
    """Two Conv -> BatchNorm -> ReLU layers, then downsample."""
    x = layers.Conv2D(filters, 3, padding="same",
                      kernel_initializer="he_normal")(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.Conv2D(filters, 3, padding="same",
                      kernel_initializer="he_normal")(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    return layers.MaxPooling2D()(x)

def make_deep_bn():
    inputs = keras.Input(shape=(28, 28, 1))
    x = conv_block(inputs, 32)    # 28x28 -> 14x14
    x = conv_block(x, 64)         # 14x14 -> 7x7
    x = conv_block(x, 128)        # 7x7   -> 3x3
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(10, activation="softmax")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="adam",
                  loss="sparse_categorical_crossentropy",
                  metrics=["accuracy"])
    return model

deep_bn = make_deep_bn()
deep_bn.summary()

A few design choices are worth calling out:

  • The network is written with the Functional API (keras.Model(inputs, outputs)) rather than Sequential. The Functional API lets you define a reusable conv_block function and call it as many times as you like, which is essential once blocks share inputs or skip connections.
  • Each convolution uses the He Normal initializer (kernel_initializer="he_normal"). It was designed for ReLU networks and helps keep activations well-scaled in deep models.
  • GlobalAveragePooling2D replaces Flatten, so the dense classifier sees only 128 features instead of a flattened map.

Now train it on the same subset, for the same number of epochs, and evaluate it on the same test set.

deep_bn.fit(X_train, y_train, epochs=10,
            validation_data=(X_test, y_test), verbose=0)

deep_loss, deep_acc = deep_bn.evaluate(X_test, y_test, verbose=0)
print(f"deep BN-CNN test acc={deep_acc:.3f}")
# Output:
# deep BN-CNN test acc=0.866

The deeper batch-normalized network reaches 0.866 test accuracy.


Comparing the Architectures

Now line up all three models on the same test set: the baseline, the regularized version, and the deeper batch-normalized network.

Bar chart comparing test accuracy of the baseline CNN, the regularized CNN, and the deeper batch-normalized CNN on Fashion-MNIST
Test accuracy on the Fashion-MNIST subset: baseline 0.883, regularized 0.872, and deeper batch-normalized 0.866.
ModelTest accuracy
Baseline CNN0.883
Regularized CNN0.872
Deep CNN + batch norm0.866

This is a result worth sitting with, because it is the opposite of what beginners expect. The deepest, most modern-looking model scored the lowest of the three. Depth, batch normalization, and global average pooling did not buy you accuracy here. Why not?

The honest answer is that the dataset is too easy for the extra capacity to pay off. Fashion-MNIST images are tiny (28x28), grayscale, and centered, with a single object and no background clutter. A compact two-block CNN already captures almost everything there is to learn from 15,000 such images. Adding more layers gives the model more capacity, but there is no extra signal in the data for that capacity to fit, so the deeper model spends some of it fitting noise. Batch normalization and global average pooling also add their own mild regularization, which on an already-easy problem nudges accuracy down a touch rather than up.

Deeper is not automatically better

A bigger network is not a free win. On small, simple datasets, extra depth and capacity often hurt: the model overfits or the regularization built into modern layers slightly underfits. Match the size of your network to the difficulty of your problem and the amount of data you have.

So when do these techniques shine? On harder datasets: larger color images, many classes, cluttered backgrounds, and millions of training examples. There, a compact CNN runs out of capacity, gradients struggle to reach early layers, and overfitting from a giant flatten-then-dense head becomes real. That is exactly the regime batch normalization, global average pooling, and residual connections were invented for. The techniques you just practiced are the ones that scale to ImageNet-sized problems, even if they do not win on a small grayscale subset. Learning them on a fast, friendly dataset means you already know the tools when you move to a hard one.


The Idea Behind Residual Connections

There is one more architectural idea you should understand conceptually, even though you will not need it on Fashion-MNIST: the residual connection, the key insight behind the ResNet family.

Recall the problem with very deep networks: as you stack more layers, gradients have to travel back through all of them, and they tend to fade. Past a certain depth, adding layers actually made networks worse, not because of overfitting but because the deeper network was harder to optimize. In principle a deeper network can always match a shallower one (the extra layers could just learn the identity), but in practice plain stacks could not even learn that.

Residual connections solve this with a small, elegant change. Instead of asking a block to compute a brand-new output y y from its input x x , you ask it to compute only the change it wants to add, and you add the input back in:

y=x+block(x) y = x + \text{block}(x)

The path that carries x x straight through is called a skip connection or shortcut. It has two effects. First, if a block has nothing useful to add, it can simply learn block(x)0 \text{block}(x) \approx 0 and pass the input along unchanged, so extra depth never hurts. Second, the shortcut gives gradients a direct, uninterrupted route back to earlier layers, which keeps them from vanishing. With this one trick, researchers trained networks over a hundred layers deep and kept improving.

   x ─────────────────────────────┐  (skip connection)
   │                               │
   └─> Conv -> BN -> ReLU -> Conv -> BN ──> (+) -> ReLU -> y

A network built from these blocks is a Residual Network, or ResNet. You will not gain anything by adding skip connections to a three-block model on Fashion-MNIST, but the moment you reach for a genuinely deep network on a hard dataset, residual connections become essential. Keep the equation y=x+block(x) y = x + \text{block}(x) in your back pocket.

From a function to a residual

The shift in framing is the whole point. A plain block must learn the full mapping from input to output. A residual block only has to learn the difference between them. Learning a small correction on top of “leave the input alone” turns out to be far easier for an optimizer than learning the full transformation from scratch.


Practice Exercises

Try these before checking the hints. Reuse the X_train, y_train, X_test, and y_test arrays from the lesson.

Exercise 1: Add a Fourth Convolutional Stage

Modify make_deep_bn to add a fourth conv_block with 256 filters after the 128-filter block, then train and evaluate it for 10 epochs. Does adding even more depth help on this small dataset?

# Your code here: copy make_deep_bn, add conv_block(x, 256) before pooling,
# train for 10 epochs, and print the test accuracy.

Hint

The input is only 28x28, so each conv_block halves the spatial size: 28 -> 14 -> 7 -> 3 -> 1. Insert x = conv_block(x, 256) right before GlobalAveragePooling2D. With the dataset already easy for the three-block model, expect the four-block version to land around the same accuracy or slightly lower, reinforcing the lesson that more depth is not automatically better here.

Exercise 2: Batch Normalization vs. No Batch Normalization

Build a version of the deep network with the BatchNormalization layers removed and everything else identical. Train both for 5 epochs and compare not just final accuracy but how quickly each one reaches a decent validation accuracy.

# Your code here: write a conv_block_no_bn that skips BatchNormalization,
# build the model, and compare its training to the batch-normalized version.

Hint

Keep the fit call’s history object for both models and look at history.history['val_accuracy'] epoch by epoch. The main thing batch normalization buys you on an easy dataset is not a higher final score but faster, smoother convergence in the early epochs.

Exercise 3: Flatten vs. Global Average Pooling

Replace GlobalAveragePooling2D in the deep network with Flatten followed by Dense(128, activation="relu"). Print model.count_params() for both versions and compare their test accuracy. How much does the parameter count grow?

# Your code here: build one model with GlobalAveragePooling2D and one with
# Flatten + Dense, then print count_params() and test accuracy for each.

Hint

Use model.count_params() to read the total number of parameters. The Flatten version connects every spatial position of the final feature map to the dense layer, so it has many more parameters than the global-average-pooling version. More parameters on this small dataset usually means more overfitting, not better test accuracy.


Summary

You moved from a single working CNN to thinking like an architect: choosing how deep to go, how to keep a deep network trainable, and how to keep it from overfitting. Let’s review.

Key Concepts

Going Deeper

  • Depth lets a network build a richer hierarchy of features, from edges to parts to whole objects
  • The two costs of depth are harder optimization (vanishing or exploding gradients) and easier overfitting (more parameters)
  • The right depth depends on the difficulty of the problem and the amount of data, not on a desire for the biggest model

Architectural Building Blocks

  • Group layers into a repeatable block and stack copies, doubling the filter count each time you pool, the VGG recipe
  • Batch normalization normalizes each layer’s inputs per batch, giving faster, more stable training and a mild regularizing effect
  • Global average pooling collapses each feature map to one number, replacing Flatten and cutting parameters dramatically
  • Batch normalization’s running mean and variance appear as non-trainable parameters; its learned scale and shift are trainable

Classic Architectures

  • VGG stacks small 3×3 3 \times 3 convolutions, which are more expressive and efficient than a few large filters
  • Residual connections add the input back to a block’s output, y=x+block(x) y = x + \text{block}(x) , so extra depth never hurts and gradients flow freely
  • A network of residual blocks is a ResNet, which is what makes very deep networks trainable

Comparing Fairly

  • Hold the data, preprocessing, epochs, and optimizer fixed; change only the architecture
  • On the Fashion-MNIST subset the baseline (0.883) beat both the regularized (0.872) and the deep batch-normalized (0.866) models

Why This Matters

The most important takeaway from this lesson is counterintuitive: your deepest, most sophisticated model scored the lowest. That is not a failure of the techniques, it is a lesson about matching tools to problems. Batch normalization, global average pooling, and residual connections were invented to make very deep networks trainable on very hard datasets. On a small, clean, grayscale dataset, a compact CNN already extracts almost all the available signal, so the extra capacity has nothing useful to learn and the built-in regularization slightly underfits.

Knowing this saves you real time and money. When a problem is small, reach for a small model first. When a problem is genuinely hard, large color images, many classes, millions of examples, the techniques you practiced here are exactly what unlock progress. You learned them on a fast, friendly dataset so that you already have them ready when the hard problem arrives.


Next Steps

You now know how to build deeper networks and, just as importantly, when not to. But what if you do not have millions of images to train a deep network from scratch? In the next lesson you will learn transfer learning, where you reuse a network already trained on a huge dataset and adapt it to your own problem.

Continue to Lesson 5 - Transfer Learning

Reuse powerful pretrained networks and adapt them to your own task with very little data.

Back to Module Overview

Return to the Computer Vision with CNNs module overview.


Keep Building Your Skills

You have learned the architectural vocabulary that every deep learning practitioner uses: blocks, batch normalization, global average pooling, and skip connections. Just as valuable, you have seen with your own results that a bigger network is not automatically a better one. Carry that judgment forward. The best machine learning engineers are not the ones who build the deepest models, but the ones who match the model to the problem, and you just took a real step toward that.