Module Review: Computer Vision
[!NOTE] This module explores the core principles of Module Review: Computer Vision, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Key Takeaways
- CNNs preserve spatial structure: Unlike regular Neural Networks, CNNs process images as 2D grids using Kernels.
- The Convolution Operation: A dot product between a filter and a local region of the image. It detects features like edges.
- Pooling reduces size: Max Pooling downsamples the image, keeping only the most important features (translation invariance).
- Depth matters: Deeper networks (VGG, ResNet) generally perform better, but are harder to train.
- ResNet solved depth: Residual Connections (
x + F(x)) allow gradients to flow through very deep networks without vanishing. - Transfer Learning is king: Instead of training from scratch, we use models pretrained on ImageNet and adapt them (Feature Extraction or Fine-Tuning).
2. Interactive Flashcards
Test your knowledge before moving on. Click to flip the card.
What is the purpose of Stride in a CNN?
Stride controls how many pixels the kernel moves at each step. A stride > 1 reduces the spatial dimensions of the output (downsampling).
Why do we use Max Pooling?
To reduce the spatial size of the representation (reducing parameters and computation) and to make the network robust to small translations in the input.
What is the "Vanishing Gradient Problem"?
In deep networks, gradients can become extremely small as they are backpropagated, stopping early layers from learning. ResNet solves this with skip connections.
What is the difference between Feature Extraction and Fine-Tuning?
Feature Extraction freezes all convolutional layers and trains only the classifier. Fine-Tuning unfreezes some top convolutional layers to adapt them to the new task.
What is a 1×1 Convolution used for?
It is used to change the number of channels (depth) without changing the height or width. It acts as a bottleneck to reduce computation (popularized by Inception).
3. PyTorch Cheat Sheet
| Operation | Code Snippet |
|---|---|
| Convolution | nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1) |
| Max Pooling | nn.MaxPool2d(kernel_size=2, stride=2) |
| ReLU Activation | nn.ReLU() |
| Flatten | nn.Flatten() (before FC layer) |
| Load ResNet | model = torchvision.models.resnet18(weights='DEFAULT') |
| Freeze Parameters | for p in model.parameters(): p.requires_grad = False |
| Replace FC | model.fc = nn.Linear(model.fc.in_features, num_classes) |
4. Next Steps
You now understand how machines see. But what happens when the data isn’t an image, but a sequence—like text or audio?
In the next module, Sequence Models, we will explore Recurrent Neural Networks (RNNs) and LSTMs.