Pre-Training
[!IMPORTANT] The “Magic” of modern AI comes from Self-Supervised Learning. We don’t need labeled data; we just need text.
1. The Paradigm Shift
Before 2018, we trained models from scratch for every task.
- Old Way: Initialize random weights → Train on specific task (e.g., Sentiment Analysis).
- New Way (Transfer Learning): Pre-train on the entire internet → Fine-tune on specific task.
This process allows the model to learn grammar, facts, and reasoning during pre-training, which it then adapts to downstream tasks.
2. Encoder-Only: BERT (Auto-Encoding)
BERT (Bidirectional Encoder Representations from Transformers) is designed to understand text. It sees the entire sentence at once.
Objective: Masked Language Modeling (MLM)
We randomly hide 15% of the tokens and ask the model to guess them.
- Input:
The [MASK] sat on the mat. - Target:
cat
This forces the model to use bidirectional context (left and right) to infer the missing word.
Interactive: Masked Word Predictor
Try to be a BERT model. Guess the masked word based on context.
Sentence: "The [MASK] is chasing the mouse."
Model Probabilities (Mock):
3. Decoder-Only: GPT (Auto-Regressive)
GPT (Generative Pre-trained Transformer) is designed to generate text. It cannot look ahead.
Objective: Causal Language Modeling (CLM)
Predict the next token based on all previous tokens.
- Input:
The cat sat on the - Target:
mat
This is harder than MLM because the model has less context (only the past), but it enables text generation.
4. Encoder-Decoder: T5 (Seq2Seq)
T5 (Text-to-Text Transfer Transformer) treats every task as a text-to-text problem.
Objective: Span Corruption
Similar to BERT, but it masks spans of text and asks the decoder to generate the missing spans.
- Input:
The [MASK_0] sat on the [MASK_1]. - Target:
<X> cat <Y> mat <Z>
5. Scaling Laws
Why do models keep getting bigger?
Researchers found that model performance (loss) scales as a power law with:
- N: Number of parameters.
- D: Dataset size.
- C: Compute used.
[!TIP] The “Chinchilla” scaling laws suggest that for every doubling of model size, you should also double the training data to be compute-optimal.
6. Implementation: Masking Logic
Here is how you might implement the MLM masking logic in PyTorch.
import torch
def mask_tokens(inputs, tokenizer, mlm_probability=0.15):
"""
Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
"""
labels = inputs.clone()
# Create a mask array
probability_matrix = torch.full(labels.shape, mlm_probability)
# Find special tokens (CLS, SEP, PAD) -> Do NOT mask them
special_tokens_mask = [
tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True)
for val in labels.tolist()
]
probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
# Select tokens to mask
masked_indices = torch.bernoulli(probability_matrix).bool()
labels[~masked_indices] = -100 # We only compute loss on masked tokens
# 80% of the time, replace with [MASK]
indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)
# 10% of the time, replace with random word
indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
inputs[indices_random] = random_words[indices_random]
# The remaining 10% stay as the original word (but we still predict them)
return inputs, labels
7. Summary
| Model | Architecture | Objective | Best For |
|---|---|---|---|
| BERT | Encoder-Only | Masked LM | Understanding, Classification, NER |
| GPT | Decoder-Only | Causal LM | Generation, Completion |
| T5 | Enc-Dec | Span Corruption | Translation, Summarization |