DL App: Neural Network Layers
[!NOTE] This module explores the core principles of DL App: Neural Network Layers, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Introduction: The Building Block
A Neural Network is just a chain of Linear Algebra operations interspersed with non-linear functions. The core component is the Dense Layer (or Fully Connected Layer).
Mathematically, a layer transforms an input vector x into an output vector y:
- x: Input Vector (Shape: N × 1).
- W: Weight Matrix (Shape: M × N). This rotates and stretches the input space.
- b: Bias Vector (Shape: M × 1). This shifts the space (translation).
- σ: Activation Function (e.g., ReLU). This bends or folds the space.
The Manifold Hypothesis
Why does this work? Real-world data (like images of cats) lies on a low-dimensional “manifold” (a crumpled sheet) inside a high-dimensional space. The goal of the neural network is to uncrumple this sheet so that the classes (cats vs dogs) can be separated by a simple line.
2. The Activation Function (Non-Linear)
Without σ, a deep network would just be one big linear matrix (since W2(W1x) = Wnewx). The activation function introduces non-linearity.
A. ReLU (Rectified Linear Unit)
- Effect: Folds the space along the axes. Points in the negative quadrant get squashed to zero.
- Pros: Efficient, solves Vanishing Gradient.
- Cons: “Dead ReLU” (if inputs are always negative, gradients die).
B. Leaky ReLU
- Effect: Similar to ReLU, but allows a tiny “leak” for negative values.
- Pros: Fixes the “Dead ReLU” problem.
C. Sigmoid / Tanh
- Effect: Squashes space into a bounded range [0, 1] or [-1, 1].
- Pros: Smooth, probability-like.
- Cons: Vanishing Gradient. Notice in the visualizer how large inputs get squashed into a tiny region where the slope is almost zero? That kills learning.
3. Interactive Visualizer: The Neural Fold v3.0
Below, we visualize a single layer with 2 inputs and 2 neurons. We start with a grid of points (Blue).
- Linear Step: Apply Wx. (Shear/Rotate).
- Activation Step: Apply σ(z).
[!TIP] Try it yourself: Switch between ReLU, Leaky ReLU, and Sigmoid. Observe how ReLU folds the space like a piece of paper, while Leaky ReLU bends it slightly.
4. Code: Building a Layer in PyTorch
In PyTorch, nn.Linear handles the matrix multiplication (Wx + b).
import torch
import torch.nn as nn
# 1. Define a Dense Layer
# Input Features: 2 (x, y coordinates)
# Output Neurons: 2 (transformed coordinates)
layer = nn.Linear(in_features=2, out_features=2)
# Set weights manually to match visualizer defaults
# W = [[1.0, 0.5], [-0.5, 1.0]]
# b = [0, 0]
with torch.no_grad():
layer.weight = nn.Parameter(torch.tensor([[1.0, 0.5], [-0.5, 1.0]]))
layer.bias = nn.Parameter(torch.zeros(2))
# 2. Define Input Data (Batch of 3 points)
# Point 1: (1, 0)
# Point 2: (0, 1)
# Point 3: (1, 1)
x = torch.tensor([
[1.0, 0.0],
[0.0, 1.0],
[1.0, 1.0]
])
# 3. Pass through Linear Layer
z = layer(x)
print("Linear Output (z):\n", z)
# Output matches Wx rotation
# 4. Apply Activation (ReLU)
activation = nn.ReLU()
y = activation(z)
print("\nActivated Output (y):\n", y)
# Negative values become 0
5. Summary
- W (Weights): Linearly transforms the space (Rotate/Scale/Shear).
- b (Bias): Translates the space.
- Activation: Non-linearly warps the space.
- ReLU: Folds space. Good for Deep Learning.
- Sigmoid: Squashes space. Good for probability output, bad for deep layers (Vanishing Gradient).