Pre-Training

[!IMPORTANT] The “Magic” of modern AI comes from Self-Supervised Learning. We don’t need labeled data; we just need text.

1. The Paradigm Shift

Before 2018, we trained models from scratch for every task.

  • Old Way: Initialize random weights → Train on specific task (e.g., Sentiment Analysis).
  • New Way (Transfer Learning): Pre-train on the entire internet → Fine-tune on specific task.

This process allows the model to learn grammar, facts, and reasoning during pre-training, which it then adapts to downstream tasks.

2. Encoder-Only: BERT (Auto-Encoding)

BERT (Bidirectional Encoder Representations from Transformers) is designed to understand text. It sees the entire sentence at once.

Objective: Masked Language Modeling (MLM)

We randomly hide 15% of the tokens and ask the model to guess them.

  • Input: The [MASK] sat on the mat.
  • Target: cat

This forces the model to use bidirectional context (left and right) to infer the missing word.

Interactive: Masked Word Predictor

Try to be a BERT model. Guess the masked word based on context.

Sentence: "The [MASK] is chasing the mouse."

Model Probabilities (Mock):

3. Decoder-Only: GPT (Auto-Regressive)

GPT (Generative Pre-trained Transformer) is designed to generate text. It cannot look ahead.

Objective: Causal Language Modeling (CLM)

Predict the next token based on all previous tokens.

  • Input: The cat sat on the
  • Target: mat

This is harder than MLM because the model has less context (only the past), but it enables text generation.

4. Encoder-Decoder: T5 (Seq2Seq)

T5 (Text-to-Text Transfer Transformer) treats every task as a text-to-text problem.

Objective: Span Corruption

Similar to BERT, but it masks spans of text and asks the decoder to generate the missing spans.

  • Input: The [MASK_0] sat on the [MASK_1].
  • Target: <X> cat <Y> mat <Z>

5. Scaling Laws

Why do models keep getting bigger?

Researchers found that model performance (loss) scales as a power law with:

  1. N: Number of parameters.
  2. D: Dataset size.
  3. C: Compute used.

[!TIP] The “Chinchilla” scaling laws suggest that for every doubling of model size, you should also double the training data to be compute-optimal.

6. Implementation: Masking Logic

Here is how you might implement the MLM masking logic in PyTorch.

import torch

def mask_tokens(inputs, tokenizer, mlm_probability=0.15):
  """
  Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
  """
  labels = inputs.clone()

  # Create a mask array
  probability_matrix = torch.full(labels.shape, mlm_probability)

  # Find special tokens (CLS, SEP, PAD) -> Do NOT mask them
  special_tokens_mask = [
    tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True)
    for val in labels.tolist()
  ]
  probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)

  # Select tokens to mask
  masked_indices = torch.bernoulli(probability_matrix).bool()
  labels[~masked_indices] = -100  # We only compute loss on masked tokens

  # 80% of the time, replace with [MASK]
  indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
  inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)

  # 10% of the time, replace with random word
  indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
  random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
  inputs[indices_random] = random_words[indices_random]

  # The remaining 10% stay as the original word (but we still predict them)

  return inputs, labels

7. Summary

Model Architecture Objective Best For
BERT Encoder-Only Masked LM Understanding, Classification, NER
GPT Decoder-Only Causal LM Generation, Completion
T5 Enc-Dec Span Corruption Translation, Summarization