Module Review: Transformers

1. Key Takeaways

Self-Attention: Allows the model to look at all words simultaneously, solving the long-term dependency problem of RNNs.
Query, Key, Value: The core retrieval mechanism. Attention(Q, K, V) = softmax(QK^T / √d_k)V.
Multi-Head Attention: Runs multiple attention mechanisms in parallel to capture different relationships (grammar, semantics, coreference).
Positional Encodings: Since Self-Attention is permutation invariant, we inject position information via sine/cosine waves.
Pre-Training:
BERT (Encoder): Masked LM, good for understanding.
GPT (Decoder): Causal LM, good for generation.
T5 (Encoder-Decoder): Span corruption, good for translation/summarization.

Test your knowledge of the Transformer architecture.

O(N² ċ d)

To prevent dot products from getting too large, which would push Softmax into regions with small gradients.

BERT is Encoder-only (Bidirectional context). GPT is Decoder-only (Unidirectional/Causal context).

Self-Attention (Transformers)

Now that you understand the architecture, it’s time to build one.