Module Review: Transformers

1. Key Takeaways

  • Self-Attention: Allows the model to look at all words simultaneously, solving the long-term dependency problem of RNNs.
  • Query, Key, Value: The core retrieval mechanism. Attention(Q, K, V) = softmax(QKT / √dk)V.
  • Multi-Head Attention: Runs multiple attention mechanisms in parallel to capture different relationships (grammar, semantics, coreference).
  • Positional Encodings: Since Self-Attention is permutation invariant, we inject position information via sine/cosine waves.
  • Pre-Training:
  • BERT (Encoder): Masked LM, good for understanding.
  • GPT (Decoder): Causal LM, good for generation.
  • T5 (Encoder-Decoder): Span corruption, good for translation/summarization.

2. Flashcards

Test your knowledge of the Transformer architecture.

Time Complexity of Self-Attention?

O(N2 ċ d)

Why do we scale by √dk?

To prevent dot products from getting too large, which would push Softmax into regions with small gradients.

Difference between BERT and GPT?

BERT is Encoder-only (Bidirectional context). GPT is Decoder-only (Unidirectional/Causal context).

What replaces RNNs/LSTMs?

Self-Attention (Transformers)

3. Cheat Sheet

Concept Description Formula / Note
Attention Weighted sum of values softmax(QKT / √dk)V
Encoder Bi-directional processing Used in BERT
Decoder Uni-directional (Causal) Used in GPT
Residual Skip connection x + Sublayer(x)
LayerNorm Normalizes features Stabilizes training
FFN Feed-Forward Network Two linear layers with ReLU

4. Next Steps

Now that you understand the architecture, it’s time to build one.