CNNs: The Visual Cortex of AI

Imagine trying to explain a cat to a computer. You could say “it has whiskers” or “pointy ears.” But how does a computer see whiskers? It sees a grid of numbers—pixels.

Convolutional Neural Networks (CNNs) are the architecture that allows machines to “see” by detecting patterns—first edges, then shapes, and finally complex objects like cats or cars. They are the backbone of modern Computer Vision.

1. The Convolution Operation

At the heart of a CNN is the Convolution. Instead of processing an image as a flat list of pixels (like a standard neural network), a CNN preserves the spatial relationship between pixels.

It uses a small matrix called a Kernel (or Filter) that slides over the image. At each position, it performs a dot product—multiplying the kernel values with the image pixel values and summing them up. This produces a Feature Map.

Interactive Convolution Visualizer

See how a 3×3 kernel slides over a 5×5 input image to produce a 3×3 feature map. The kernel below is a vertical edge detector. Notice how it activates (produces high values) where there is a vertical difference in the input.

Input Image (5×5)

Kernel (3×3)

Vertical Edge Detector

Feature Map (3×3)

Hover over the output pixels!

2. Stride and Padding

When the kernel moves, how far does it step?

Stride

Stride is the number of pixels the kernel moves at each step.

Stride 1: Moves 1 pixel at a time (standard).
Stride 2: Skips 1 pixel. This reduces the output dimensions by roughly half.

Padding

If we apply a 3×3 kernel to a 5×5 image, we get a 3×3 output. The image shrinks! Also, pixels on the edge are used less than pixels in the center. Padding adds extra border pixels (usually zeros) around the input image.

“Valid” Padding: No padding (image shrinks).
“Same” Padding: Padding added so output size equals input size (assuming stride 1).

The formula for output size (O) given input size (W), kernel size (K), padding (P), and stride (S) is:

O = ⌊ (W - K + 2P) / S ⌋ + 1

3. Pooling Layers

After convolution, we often want to downsample the image to reduce computation and make the network robust to small translations (if the cat moves 2 pixels to the left, it’s still a cat).

Pooling aggregates information from a local region.

Max Pooling: Takes the maximum value in the window. This captures the most prominent features (e.g., the strongest edge).
Average Pooling: Takes the average value. This smoothes the features.

[!TIP] Max Pooling is the industry standard for CNNs because it acts as a “feature detector” selection mechanism, preserving the strongest signal.

4. PyTorch Implementation

Here is how we implement a simple CNN block in PyTorch.

import torch
import torch.nn as nn

# Define a simple CNN block
class SimpleCNNBlock(nn.Module):
  def __init__(self, in_channels, out_channels):
    super(SimpleCNNBlock, self).__init__()
    # Convolution: 3x3 kernel, stride 1, padding 1 (Same padding)
    self.conv = nn.Conv2d(
      in_channels=in_channels,
      out_channels=out_channels,
      kernel_size=3,
      stride=1,
      padding=1
    )
    # Activation function
    self.relu = nn.ReLU()
    # Max Pooling: 2x2 window, stride 2 (Halves the dimensions)
    self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

  def forward(self, x):
    # x shape: [Batch, In_Channels, Height, Width]
    x = self.conv(x)
    x = self.relu(x)
    x = self.pool(x)
    # Output shape: [Batch, Out_Channels, Height/2, Width/2]
    return x

# Example Usage
# Input: Batch of 8 images, 3 color channels (RGB), 64x64 pixels
input_tensor = torch.randn(8, 3, 64, 64)

# Create block: 3 input channels -> 16 output filters
block = SimpleCNNBlock(in_channels=3, out_channels=16)

output = block(input_tensor)
print(f"Input Shape: {input_tensor.shape}")
print(f"Output Shape: {output.shape}")
# Output will be [8, 16, 32, 32]

Key Parameters Breakdown

in_channels: Depth of input (e.g., 3 for RGB image, 1 for grayscale).
out_channels: Number of kernels to learn (i.e., depth of output feature maps).
kernel_size: Size of the sliding window (usually 3 or 5).

5. Summary

Convolution preserves spatial structure and detects features using learnable kernels.
Stride controls how fast the kernel moves and effectively downsamples.
Padding prevents image shrinkage and loss of edge information.
Pooling (usually Max) reduces dimensionality and provides translation invariance.

In the next chapter, we will see how stacking these layers created the famous architectures that revolutionized AI.