Module Review: Transformers
1. Key Takeaways
- Self-Attention: Allows the model to look at all words simultaneously, solving the long-term dependency problem of RNNs.
- Query, Key, Value: The core retrieval mechanism. Attention(Q, K, V) = softmax(QKT / βdk)V.
- Multi-Head Attention: Runs multiple attention mechanisms in parallel to capture different relationships (grammar, semantics, coreference).
- Positional Encodings: Since Self-Attention is permutation invariant, we inject position information via sine/cosine waves.
- Pre-Training:
- BERT (Encoder): Masked LM, good for understanding.
- GPT (Decoder): Causal LM, good for generation.
- T5 (Encoder-Decoder): Span corruption, good for translation/summarization.
2. Flashcards
Test your knowledge of the Transformer architecture.
Time Complexity of Self-Attention?
O(N2 ċ d)
Why do we scale by √dk?
To prevent dot products from getting too large, which would push Softmax into regions with small gradients.
Difference between BERT and GPT?
BERT is Encoder-only (Bidirectional context). GPT is Decoder-only (Unidirectional/Causal context).
What replaces RNNs/LSTMs?
Self-Attention (Transformers)
3. Cheat Sheet
| Concept | Description | Formula / Note |
|---|---|---|
| Attention | Weighted sum of values | softmax(QKT / βdk)V |
| Encoder | Bi-directional processing | Used in BERT |
| Decoder | Uni-directional (Causal) | Used in GPT |
| Residual | Skip connection | x + Sublayer(x) |
| LayerNorm | Normalizes features | Stabilizes training |
| FFN | Feed-Forward Network | Two linear layers with ReLU |
4. Next Steps
Now that you understand the architecture, itβs time to build one.