Lesson 1 - Introduction to Convolutional Neural Networks

Welcome to Computer Vision

This lesson introduces you to convolutional neural networks (CNNs), the family of models behind nearly every modern computer vision system, from face unlock on your phone to medical image diagnosis. You will learn why ordinary dense networks struggle with images, what a convolution actually does, and the handful of ideas, kernels, feature maps, parameter sharing, and translation invariance, that make CNNs so effective. Along the way you will meet Fashion-MNIST, the clothing image dataset you will use throughout this module.

By the end of this lesson, you will be able to:

Explain how a digital image is stored as a grid of pixel values
Describe why fully connected (dense) networks scale badly on image data
Define a convolution and compute one by hand with a small kernel
Explain local receptive fields, feature maps, parameter sharing, and translation invariance
Load and explore the Fashion-MNIST dataset with Keras

You should be comfortable with basic Python and NumPy, and have seen a simple neural network before (layers, weights, and training). Let’s begin.

How a Computer Sees an Image

Your brain processes the visual world effortlessly. You glance at a photo and instantly know it shows a sneaker, a coat, or a face. A computer has no such intuition. To a machine, an image is nothing more than a grid of numbers.

Every digital image is stored as a matrix of values. Each cell in that matrix is a pixel, short for “picture element.” A pixel holds a number describing how bright that point is. If an image is 1920 pixels wide and 1080 pixels tall, it contains $1920 \times 1080 = 2{,}073{,}600$ pixels, and the device stores one number per pixel.

For a grayscale image, each pixel stores a single value from 0 to 255. A value of 0 is pure black, 255 is pure white, and everything in between is a shade of gray. A color image is stored a little differently: it stacks three grids, one each for red, green, and blue intensity, so a color pixel is described by three numbers instead of one. The number of stacked grids is called the number of channels: a grayscale image has one channel, a color image has three.

The clothing images you will work with in this module are grayscale and small, just 28 by 28 pixels, so each image is a single 28-by-28 grid of numbers between 0 and 255. That is 784 numbers per image.

Channels in one sentence

A grayscale image has shape (height, width, 1) and a color image has shape (height, width, 3). The final number is the channel count, and it is the one beginners forget most often when defining a model’s input shape.

Why Dense Networks Struggle With Images

If you have built a neural network before, you have used dense (also called fully connected) layers, where every input connects to every neuron with its own weight. Dense networks work beautifully for tabular data. For images, they fall apart for two related reasons.

Reason 1: The Parameter Explosion

To feed an image into a dense layer, you first have to flatten it into a long vector. A tiny 28-by-28 grayscale image flattens into 784 values. If the first hidden layer has just 128 neurons, that single layer already needs:

784 \times 128 + 128 = 100{,}480 \text{ parameters}

That is manageable. But real photographs are not 28 by 28. A modest 200-by-200 color image flattens into $200 \times 200 \times 3 = 120{,}000$ values. Connect that to 128 neurons and the first layer alone needs over 15 million weights, before you have added a single additional layer. The number of parameters grows with the size of the image, and large models with too many parameters are slow to train and quick to overfit.

Reason 2: Throwing Away Spatial Structure

The deeper problem is more subtle. The moment you flatten an image, you destroy the information about which pixels are next to each other. Pixel (5, 5) and pixel (5, 6) sit side by side in the picture, but after flattening they are just two arbitrary positions in a long list. A dense layer has no idea they are neighbors.

This matters because the meaning in an image lives in local patterns: an edge, a corner, a curve, the texture of fabric. These patterns are made of nearby pixels working together. A model that treats every pixel as independent has to relearn the same edge detector separately for every possible location in the image. That is wasteful and fragile.

We need an approach that respects two facts about images: meaningful patterns are local, and the same pattern can appear anywhere in the frame. That approach is the convolution.

What Is a Convolution?

A convolution is a small, repeatable operation that scans an image looking for a specific local pattern. Here is the whole idea in one sentence: you slide a tiny grid of numbers, called a kernel (or filter), across the image, and at each position you multiply the kernel against the patch of pixels underneath it and add up the result.

Each output number answers one question: “How strongly does the pattern in this kernel appear at this spot?” Doing this everywhere produces a new grid called a feature map, which highlights where the pattern was found.

A 3 by 3 kernel sliding over an input image to produce a smaller feature map — A convolution slides a small kernel across the image, computing one output value per position to build a feature map.

The small patch of the image that a single output value depends on is called its local receptive field. A 3-by-3 kernel has a 3-by-3 receptive field: each output value looks at only 9 input pixels, not the entire image. This is exactly the locality that dense layers lacked.

Computing a Convolution by Hand

The best way to understand a convolution is to compute one yourself. Let’s use a small binary “image” (just 0s and 1s) and a 3-by-3 kernel.

import numpy as np

image = np.array([[1, 1, 0, 0, 0, 0, 1, 1],
                  [1, 0, 1, 1, 1, 1, 0, 1],
                  [1, 0, 1, 1, 1, 1, 0, 1],
                  [1, 0, 0, 0, 0, 0, 0, 1],
                  [1, 0, 0, 0, 0, 0, 0, 1],
                  [1, 0, 1, 1, 1, 1, 0, 1],
                  [1, 0, 1, 1, 1, 1, 0, 1],
                  [1, 1, 0, 0, 0, 0, 1, 1]])

kernel = np.array([[-1, 0, 1],
                   [-2, 0, 2],
                   [-1, 0, 1]])

This particular kernel is a classic vertical edge detector. Notice its left column is negative and its right column is positive. Where the image is dark on the left and bright on the right (a vertical edge), the kernel produces a large value; where the image is flat, the positives and negatives cancel out to roughly zero.

The kernel is 3 by 3 and the image is 8 by 8. The kernel can slide to 6 distinct horizontal positions and 6 vertical positions, so the output feature map will be 6 by 6. We loop over every position, multiply element by element, and sum.

# The kernel fits in 6 horizontal and 6 vertical positions -> a 6x6 output
conv_output = np.zeros((6, 6), dtype=int)

for i in range(6):          # slide top to bottom
    for j in range(6):      # slide left to right
        patch = image[i:i+3, j:j+3]          # the 3x3 receptive field
        conv_output[i, j] = np.sum(patch * kernel)

print(conv_output)
# Output:
# [[-1  2  0  0 -2  1]
#  [-1  3  0  0 -3  1]
#  [-3  1  0  0 -1  3]
#  [-3  1  0  0 -1  3]
#  [-1  3  0  0 -3  1]
#  [-1  2  0  0 -2  1]]

Look at the structure of that output. The large-magnitude values cluster on the left and right columns, exactly where the original shape had vertical edges, while the flat middle columns are zero. The kernel found the vertical edges in the image. That is feature extraction in action.

Let’s confirm one cell by hand. The top-left output value comes from the top-left 3-by-3 patch of the image:

\begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \end{bmatrix} \odot \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}

Multiplying element by element and summing gives $(-1) + 0 + 0 + (-2) + 0 + 2 + (-1) + 0 + 1 = -1$ , which matches the -1 in the top-left corner of conv_output.

One kernel, one pattern

Transposing this kernel (swapping rows for columns) turns it into a horizontal edge detector instead. Each kernel detects exactly one kind of pattern. That is why a real convolutional layer uses many kernels at once, so it can look for many patterns in parallel.

From Convolutions to Convolutional Networks

A convolutional neural network is a network whose early layers are convolutional layers instead of dense layers. Each convolutional layer slides one or more kernels across its input and produces a feature map for each kernel. An activation function such as ReLU is then applied to each feature map, just as in a dense network. The result is the layer’s set of feature maps, the patterns it found.

The crucial difference from the dense networks you know: in a convolutional layer you do not have a separate weight for every input pixel. Instead, the kernel is the weight matrix, and the same small kernel is reused at every position as it slides. This single idea, called parameter sharing, is what makes CNNs both efficient and powerful.

Recall that a dense layer connecting a 200-by-200 color image to 128 neurons needed over 15 million weights. Now compare that to a convolutional layer. A single 3-by-3 kernel on a color image has $3 \times 3 \times 3 = 27$ weights, plus one bias, for 28 parameters total. Even with 64 different kernels in the layer, you have only $64 \times 28 = 1{,}792$ parameters, regardless of how large the image is.

The reason is that the kernel is shared. The same 9 (or 27) numbers are applied at every location, so the parameter count depends on the kernel size and the number of kernels, not on the size of the image. This is a dramatic reduction, and it is also a form of built-in regularization: with fewer free parameters, the model is far less prone to overfitting.

Translation Invariance

Parameter sharing buys you a second, almost magical property. Because the same kernel scans the entire image, a pattern is detected the same way no matter where it appears. A vertical-edge kernel finds a vertical edge in the top-left corner using the exact same weights it uses in the bottom-right corner.

This is called translation invariance: shifting an object around the frame does not stop the network from recognizing it. A dense network would have to learn the same feature over and over for each possible position. A CNN learns it once and applies it everywhere. For real images, where a sneaker might sit anywhere in the photo, this is exactly the behavior you want.

The Kernels Are Learned

In our hand example, we chose the edge-detector kernel ourselves. In a real CNN, you do not hand-pick kernel values. You specify only how many kernels a layer should have and how big each one is; the network learns the kernel weights automatically through training, just like any other weights, using backpropagation.

As training proceeds, the early layers tend to learn simple, generic patterns like edges and color blobs, while deeper layers combine those into more complex features like textures, fabric patterns, and eventually whole-object shapes such as a collar or a heel. You will see this layered architecture in detail in the next lesson.

The three big ideas

Everything about why CNNs work for images reduces to three connected ideas: local receptive fields (each output looks at a small patch), parameter sharing (one kernel reused everywhere), and translation invariance (patterns are found regardless of position). Keep these three in mind and the rest of this module will feel intuitive.

Meet the Dataset: Fashion-MNIST

For the rest of this module you will work with Fashion-MNIST, a dataset of small grayscale photos of clothing. It is a popular benchmark because it is small enough to train quickly yet hard enough to be interesting, much harder than recognizing handwritten digits.

Fashion-MNIST contains 70,000 images in total: 60,000 for training and 10,000 for testing. Every image is 28 by 28 pixels, grayscale (one channel), and belongs to exactly one of 10 clothing categories:

Label	Class	Label	Class
0	T-shirt/top	5	Sandal
1	Trouser	6	Shirt
2	Pullover	7	Sneaker
3	Dress	8	Bag
4	Coat	9	Ankle boot

Keras ships this dataset, so you can download and load it with a single function call.

from tensorflow import keras
import numpy as np

# Downloads on first call:
# keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test_full, y_test_full) = \
    keras.datasets.fashion_mnist.load_data()

print("Full train images:", X_train_full.shape)
print("Full test images: ", X_test_full.shape)
# Output:
# Full train images: (60000, 28, 28)
# Full test images:  (10000, 28, 28)

The training images arrive as a NumPy array of shape (60000, 28, 28): 60,000 images, each a 28-by-28 grid. The labels are integers from 0 to 9.

Adding the Channel Dimension and Taking a Subset

Convolutional layers in Keras expect each image to carry an explicit channel dimension, so a grayscale image should have shape (28, 28, 1) rather than (28, 28). We add that final axis. Training on all 60,000 images can be slow without a GPU, so for the lessons in this module we use a smaller, representative subset of 15,000 training images and 3,000 test images. This keeps experiments fast while still being plenty of data to learn from.

# Add the channel axis: (28, 28) -> (28, 28, 1)
X_train_full = X_train_full[..., np.newaxis]
X_test_full = X_test_full[..., np.newaxis]

# Use a subset to keep training fast
X_train = X_train_full[:15000]
y_train = y_train_full[:15000]
X_test = X_test_full[:3000]
y_test = y_test_full[:3000]

print("Train subset:", X_train.shape)
print("Test subset: ", X_test.shape)
# Output:
# Train subset: (15000, 28, 28, 1)
# Test subset:  (3000, 28, 28, 1)

You now have a training subset of shape (15000, 28, 28, 1) and a test subset of shape (3000, 28, 28, 1). That trailing 1 is the channel dimension you just added.

Don’t forget to normalize later

Pixel values here range from 0 to 255. Neural networks train far more reliably when inputs are scaled to a small range, typically by dividing by 255 to put every value in [0, 1]. We will handle normalization when we build and train the actual model in the next lesson; just remember that raw pixel values are not yet ready for training.

Looking at the Images

Numbers in an array are hard to reason about, so let’s actually look at the clothing. The grid below shows a sample of Fashion-MNIST images with their class labels.

A grid of real Fashion-MNIST grayscale clothing images labeled with their classes — A sample of Fashion-MNIST images: low resolution, grayscale, and surprisingly varied within each clothing category.

Notice the challenge. Items in the same class vary in shape, orientation, and lighting; a pullover and a coat can look almost identical at 28 by 28 pixels; and the object is not always centered. This variety is exactly why we need a model that detects local patterns wherever they appear, rather than memorizing fixed pixel positions. You can reproduce a view like this yourself with Matplotlib.

import matplotlib.pyplot as plt

class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

plt.figure(figsize=(8, 4))
for i in range(8):
    plt.subplot(2, 4, i + 1)
    plt.imshow(X_train[i].squeeze(), cmap="gray")
    plt.title(class_names[y_train[i]])
    plt.axis("off")
plt.tight_layout()
plt.show()

The .squeeze() call drops the trailing channel dimension so Matplotlib sees a plain 28-by-28 grid, and cmap="gray" renders the single channel in grayscale.

Practice Exercises

Try these before checking the hints. They reinforce the core ideas without building a full model yet.

Exercise 1: Detect Horizontal Edges

The lesson used a vertical edge detector. Build the horizontal edge detector by transposing that kernel, then convolve it with the same 8-by-8 image and print the 6-by-6 output. Compare the result to the vertical version.

import numpy as np

image = np.array([[1, 1, 0, 0, 0, 0, 1, 1],
                  [1, 0, 1, 1, 1, 1, 0, 1],
                  [1, 0, 1, 1, 1, 1, 0, 1],
                  [1, 0, 0, 0, 0, 0, 0, 1],
                  [1, 0, 0, 0, 0, 0, 0, 1],
                  [1, 0, 1, 1, 1, 1, 0, 1],
                  [1, 0, 1, 1, 1, 1, 0, 1],
                  [1, 1, 0, 0, 0, 0, 1, 1]])

# Your code here

Hint

Create the horizontal kernel with kernel = np.array([[-1, -2, -1], [0, 0, 0], [1, 2, 1]]) (or transpose the vertical one with .T). Then reuse the same nested loop from the lesson: build a (6, 6) zero array, slide the kernel with image[i:i+3, j:j+3], multiply, and sum. The large values will now appear in the top and bottom rows instead of the left and right columns.

Exercise 2: Count the Parameters

Compute how many parameters a dense first layer versus a convolutional first layer would need for a 28-by-28-by-1 Fashion-MNIST image. For the dense layer assume 128 neurons. For the convolutional layer assume 32 kernels of size 3-by-3. Print both numbers.

# Your code here (just arithmetic, no Keras needed)

Hint

A dense layer needs inputs * neurons + neurons parameters, where inputs = 28 * 28 * 1 = 784, giving 784 * 128 + 128 = 100480. A convolutional layer needs (kernel_h * kernel_w * channels + 1) * num_kernels, which is (3 * 3 * 1 + 1) * 32 = 320. The convolutional layer uses hundreds of times fewer parameters thanks to parameter sharing.

Exercise 3: Inspect the Class Distribution

Load Fashion-MNIST, take the 15,000-image training subset, and check how many examples each of the 10 classes has. Is the dataset balanced?

from tensorflow import keras
import numpy as np

(X_train_full, y_train_full), _ = keras.datasets.fashion_mnist.load_data()

# Your code here

Hint

Slice the labels with y_train = y_train_full[:15000], then call np.unique(y_train, return_counts=True) to get each class and its count. You should see all 10 classes present with roughly similar counts, so the subset is close to balanced, which means plain accuracy will be a sensible metric later.

Summary

You have built the conceptual foundation for everything that follows in this module. Let’s review what you learned.

Key Concepts

How Images Are Stored

An image is a grid of pixels, each holding a brightness value (0 to 255 for grayscale)
The number of stacked grids is the channel count: 1 for grayscale, 3 for color (RGB)

Why Dense Networks Struggle

Flattening an image creates a huge number of parameters that grows with image size
Flattening also destroys spatial structure, so the network cannot exploit local patterns

Convolutions

A convolution slides a small kernel (filter) over the image, multiplying and summing at each position
Each output value depends only on a small patch, its local receptive field
The full set of outputs is a feature map that highlights where a pattern appears
One kernel detects one pattern; real layers use many kernels at once

Why CNNs Work

Parameter sharing: one kernel is reused everywhere, so parameter count depends on kernel size and count, not image size
Translation invariance: a pattern is detected regardless of where it sits in the image
Kernel weights are learned during training, not set by hand

The Dataset

Fashion-MNIST: 60,000 train / 10,000 test grayscale 28-by-28 images, 10 clothing classes
You added a channel axis to get shape (28, 28, 1) and took a 15000 / 3000 subset for speed
Raw pixels still need normalization before training

Why This Matters

Convolutional neural networks are the workhorse of computer vision, and the three ideas you learned here, local receptive fields, parameter sharing, and translation invariance, explain why they work so well. They are not arbitrary engineering tricks; they are a direct response to the structure of images, where meaning is local and patterns can appear anywhere.

Understanding these foundations now will pay off repeatedly. When you later tune kernel sizes, stack many layers, add pooling, or reach for a pretrained model, you will be making informed decisions rather than copying recipes. Every advanced technique in this module is built on the simple sliding-kernel operation you computed by hand in this lesson.

Next Steps

You now understand what a convolution does and why CNNs are the right tool for images. In the next lesson, you will assemble these pieces into a working architecture, stacking convolutional layers, pooling, and a classifier head, and train it on Fashion-MNIST in Keras.

Continue to Lesson 2 - CNN Architecture

Stack convolutional and pooling layers into a complete CNN and train it in Keras.

Back to Module Overview

Return to the Computer Vision with CNNs module overview.

Keep Building Your Skills

You have taken your first real step into computer vision. The convolution you computed by hand is the same operation running billions of times inside the largest vision models in the world; the only difference is scale and learned kernels. Hold on to the intuition you built here, that a CNN scans for local patterns and reuses what it learns everywhere, and the architectures, regularization tricks, and transfer-learning techniques in the coming lessons will fall neatly into place.

Lesson 6 - Guided Project: Predicting IPO Listing Gains with TensorFlow

Lesson 2 - CNN Architecture

Courses

DATATWEETS

Title here

Lesson 1 - Introduction to Convolutional Neural Networks

Welcome to Computer Vision

How a Computer Sees an Image

Why Dense Networks Struggle With Images

Reason 1: The Parameter Explosion

Reason 2: Throwing Away Spatial Structure

What Is a Convolution?

Computing a Convolution by Hand

From Convolutions to Convolutional Networks

Translation Invariance

The Kernels Are Learned

Meet the Dataset: Fashion-MNIST

Adding the Channel Dimension and Taking a Subset

Looking at the Images

Practice Exercises

Exercise 1: Detect Horizontal Edges

Exercise 2: Count the Parameters

Exercise 3: Inspect the Class Distribution

Summary

Key Concepts

Why This Matters

Next Steps

Continue to Lesson 2 - CNN Architecture

Back to Module Overview

Keep Building Your Skills

Lesson 1 - Introduction to Convolutional Neural Networks

Welcome to Computer Vision#

How a Computer Sees an Image#

Why Dense Networks Struggle With Images#

Reason 1: The Parameter Explosion#

Reason 2: Throwing Away Spatial Structure#

What Is a Convolution?#

Computing a Convolution by Hand#

From Convolutions to Convolutional Networks#

Parameter Sharing#

Translation Invariance#

The Kernels Are Learned#

Meet the Dataset: Fashion-MNIST#

Adding the Channel Dimension and Taking a Subset#

Looking at the Images#

Practice Exercises#

Exercise 1: Detect Horizontal Edges#

Exercise 2: Count the Parameters#

Exercise 3: Inspect the Class Distribution#

Summary#

Key Concepts#

Why This Matters#

Next Steps#

Continue to Lesson 2 - CNN Architecture

Back to Module Overview

Keep Building Your Skills#

Welcome to Computer Vision

How a Computer Sees an Image

Why Dense Networks Struggle With Images

Reason 1: The Parameter Explosion

Reason 2: Throwing Away Spatial Structure

What Is a Convolution?

Computing a Convolution by Hand

From Convolutions to Convolutional Networks

Parameter Sharing

Translation Invariance

The Kernels Are Learned

Meet the Dataset: Fashion-MNIST

Adding the Channel Dimension and Taking a Subset

Looking at the Images

Practice Exercises

Exercise 1: Detect Horizontal Edges

Exercise 2: Count the Parameters

Exercise 3: Inspect the Class Distribution

Summary

Key Concepts

Why This Matters

Next Steps

Keep Building Your Skills