Capstone: Transformers & VAEs

[!NOTE] This module explores the core principles of Capstone: Transformers & VAEs, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Introduction: The Modern Titans

This entire module leads to this moment. The two architectures that define modern AI:

Transformers (GPT, BERT, Claude): The masters of Sequence and Context.
Variational Autoencoders (VAEs, Stable Diffusion): The masters of Generation and Latent Space.

They rely entirely on the math we just covered: Dot Products (Linear Algebra), Softmax (Calculus), Entropy (Info Theory), and Gaussian Distributions (Stats).

2. Transformers: Attention is All You Need

Before 2017, we processed text sequentially (RNNs). “Read word 1, then word 2, then word 3…” This was slow and forgot long-term context.

The Transformer reads the entire sentence at once (in parallel). But how does it know “it” refers to “animal”?

The Core Mechanism: Self-Attention

Imagine a database lookup.

Query (Q): What I am looking for? (e.g., “it”).
Key (K): What defines this word? (e.g., “animal”, “street”).
Value (V): The actual content (The meaning).

We compute the similarity between the Query and every Key using a Dot Product.

  Attention(Q, K, V) = softmax( QKT / √dk ) V

QK^T: Compute similarity scores (Dot Product).
Divide by √d_k: Scale down so gradients don’t explode.
Softmax: Convert scores to probabilities (sum to 1).
Multiply by V: Get the weighted sum of meanings.

3. Interactive Visualizer: The Attention Graph

See how Self-Attention links words. Sentence: “The animal didn’t cross the street because it was too tired.”

Instructions:

Hover over any word in the bottom row (Query).
See which words light up in the top row (Keys).
- “it”: Strong link to “animal” (Resolution).
- “tired”: Strong link to “animal” (Adjective association).
- “cross”: Link to “street” (Object).

[!TIP] Try it yourself: Hover over the word “it” at the bottom. You will see a strong line connecting to “animal” at the top. This is the model resolving the pronoun!

Hover/Touch bottom words (Query) to see Attention (Keys).

4. Multi-Head Attention & Positional Encoding

Why “Multi-Head”?

A single attention mechanism might focus on Grammar (e.g., “it” → “animal”). But we also need to understand Context (e.g., “tired” → “cross”). We use multiple “Heads” (usually 8 or 12) to look at the sentence from different perspectives simultaneously.

Each head projects the embeddings into a different Subspace (Linear Algebra!), attends to different features, and then we concat the results.

Positional Encoding

Since the Transformer reads everything at once, it doesn’t know the order. “The dog bit the man” looks the same as “The man bit the dog”. To fix this, we add a signal to the embeddings based on their position, using Sine and Cosine waves of different frequencies (Fourier Transforms!).

Visualization of Positional Encoding Matrix (Y-axis: Time Steps, X-axis: Embedding Dimensions)

5. VAE: The Reparameterization Trick

Variational Autoencoders try to compress data into a Latent Space (z). This latent space is a map of concepts (e.g., “Smile Vector”, “Age Vector”).

The Problem

We need to sample z from a distribution N(μ, σ) to generate new images. But we can’t backpropagate gradients through a random node. Randomness breaks the chain rule.

The Solution: Reparameterization

Move the randomness aside. Instead of sampling directly, we define:

  z = μ + σ ⊙ ε

Where ε ~ N(0, 1) (Standard Noise). Now gradients can flow through μ and σ to update the encoder, treating ε as a constant input.

The Loss Function (ELBO)

We minimize two things:

Reconstruction Loss: Does the output look like the input? (MSE or Cross-Entropy).
KL Divergence: Does the latent space look like a Normal Gaussian? (Regularization).

  Loss = ||x - x'||2 + DKL(N(μ, σ) || N(0, 1))

6. Summary

Transformers: Use dot products to find relationships between words.
Multi-Head Attention: Learning multiple relationships (Grammar, Meaning) in parallel.
Positional Encoding: Adding sine waves to give words an order.
VAEs: Use Gaussian tricks to generate new data from a structured latent space.