DL App: Neural Network Layers

[!NOTE] This module explores the core principles of DL App: Neural Network Layers, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction: The Building Block

A Neural Network is just a chain of Linear Algebra operations interspersed with non-linear functions. The core component is the Dense Layer (or Fully Connected Layer).

Mathematically, a layer transforms an input vector x into an output vector y:

  y = σ(Wx + b)

x: Input Vector (Shape: N × 1).
W: Weight Matrix (Shape: M × N). This rotates and stretches the input space.
b: Bias Vector (Shape: M × 1). This shifts the space (translation).
σ: Activation Function (e.g., ReLU). This bends or folds the space.

The Manifold Hypothesis

Why does this work? Real-world data (like images of cats) lies on a low-dimensional “manifold” (a crumpled sheet) inside a high-dimensional space. The goal of the neural network is to uncrumple this sheet so that the classes (cats vs dogs) can be separated by a simple line.

2. The Activation Function (Non-Linear)

Without σ, a deep network would just be one big linear matrix (since W₂(W₁x) = W_newx). The activation function introduces non-linearity.

A. ReLU (Rectified Linear Unit)

ReLU(z) = max(0, z)

Effect: Folds the space along the axes. Points in the negative quadrant get squashed to zero.
Pros: Efficient, solves Vanishing Gradient.
Cons: “Dead ReLU” (if inputs are always negative, gradients die).

B. Leaky ReLU

LReLU(z) = max(0.01z, z)

Effect: Similar to ReLU, but allows a tiny “leak” for negative values.
Pros: Fixes the “Dead ReLU” problem.

C. Sigmoid / Tanh

Effect: Squashes space into a bounded range [0, 1] or [-1, 1].
Pros: Smooth, probability-like.
Cons: Vanishing Gradient. Notice in the visualizer how large inputs get squashed into a tiny region where the slope is almost zero? That kills learning.

3. Interactive Visualizer: The Neural Fold v3.0

Below, we visualize a single layer with 2 inputs and 2 neurons. We start with a grid of points (Blue).

Linear Step: Apply Wx. (Shear/Rotate).
Activation Step: Apply σ(z).

[!TIP] Try it yourself: Switch between ReLU, Leaky ReLU, and Sigmoid. Observe how ReLU folds the space like a piece of paper, while Leaky ReLU bends it slightly.

Weights (W)

Activation

          Blue Dots: Input Grid

          Green Lines: Transformed Grid

4. Code: Building a Layer in PyTorch

In PyTorch, nn.Linear handles the matrix multiplication (Wx + b).

import torch
import torch.nn as nn

# 1. Define a Dense Layer
# Input Features: 2 (x, y coordinates)
# Output Neurons: 2 (transformed coordinates)
layer = nn.Linear(in_features=2, out_features=2)

# Set weights manually to match visualizer defaults
# W = [[1.0, 0.5], [-0.5, 1.0]]
# b = [0, 0]
with torch.no_grad():
    layer.weight = nn.Parameter(torch.tensor([[1.0, 0.5], [-0.5, 1.0]]))
    layer.bias = nn.Parameter(torch.zeros(2))

# 2. Define Input Data (Batch of 3 points)
# Point 1: (1, 0)
# Point 2: (0, 1)
# Point 3: (1, 1)
x = torch.tensor([
    [1.0, 0.0],
    [0.0, 1.0],
    [1.0, 1.0]
])

# 3. Pass through Linear Layer
z = layer(x)
print("Linear Output (z):\n", z)
# Output matches Wx rotation

# 4. Apply Activation (ReLU)
activation = nn.ReLU()
y = activation(z)

print("\nActivated Output (y):\n", y)
# Negative values become 0

5. Summary

W (Weights): Linearly transforms the space (Rotate/Scale/Shear).
b (Bias): Translates the space.
Activation: Non-linearly warps the space.
ReLU: Folds space. Good for Deep Learning.
Sigmoid: Squashes space. Good for probability output, bad for deep layers (Vanishing Gradient).