Module Review: Transformers

This module covered the revolutionary Transformer architecture, a cornerstone of modern Deep Learning that solves the limitations of recurrent sequence models. Through interactive visualizations, we explored the inner workings of Self-Attention, Multi-Head components, and massive Pre-Training objectives.

1. Key Takeaways

Self-Attention: Allows the model to look at all words simultaneously, solving the long-term dependency problem of RNNs.
Query, Key, Value: The core retrieval mechanism. Attention(Q, K, V) = softmax(QK^T / √d_k)V.
Multi-Head Attention: Runs multiple attention mechanisms in parallel to capture different relationships (grammar, semantics, coreference).
Positional Encodings: Since Self-Attention is permutation invariant, we inject position information via sine/cosine waves.
Pre-Training:
BERT (Encoder): Masked LM, good for understanding.
GPT (Decoder): Causal LM, good for generation.
T5 (Encoder-Decoder): Span corruption, good for translation/summarization.

2. Flashcards

Test your knowledge of the Transformer architecture.

Time Complexity of Self-Attention?

O(N² ċ d)

Why do we scale by √d_k?

To prevent dot products from getting too large, which would push Softmax into regions with small gradients.

Difference between BERT and GPT?

BERT is Encoder-only (Bidirectional context). GPT is Decoder-only (Unidirectional/Causal context).

What replaces RNNs/LSTMs?

Self-Attention (Transformers)

3. Cheat Sheet

Concept	Description	Formula / Note
Attention	Weighted sum of values	softmax(QK^T / √d_k)V
Encoder	Bi-directional processing	Used in BERT
Decoder	Uni-directional (Causal)	Used in GPT
Residual	Skip connection	x + Sublayer(x)
LayerNorm	Normalizes features	Stabilizes training
FFN	Feed-Forward Network	Two linear layers with ReLU

4. Quick Revision

RNN bottleneck: Sequential nature prevents parallelization.
Self-Attention formula: softmax((QK^T) / √d_k)V.
Query, Key, Value: Vectors derived from the same input token for self-attention.
Positional Encodings: Since attention has no notion of sequence order, we inject sine and cosine waves.
Multi-Head Attention: Multiple scaled dot-product attentions in parallel, improving representational capacity.
Layer Normalization & Residuals: Essential for stabilizing the training of deep architectures.
Pre-Training Objectives: BERT (Masked LM), GPT (Causal LM), and T5 (Seq2Seq Span Corruption).

5. Next Steps

Now that you understand the architecture, it’s time to build one.

Deep Learning Glossary