Seq2Seq Models and Attention

Imagine you are a professional United Nations interpreter. Someone is speaking a long, complex sentence in French, and your job is to translate it into English. If you wait until they finish the entire paragraph before you start translating, you’re going to forget the beginning. Instead, you listen, and as you translate each word, you focus your attention on the specific part of the speaker’s sentence that is relevant right now.

This is exactly how Sequence-to-Sequence (Seq2Seq) models with Attention work.

Many tasks involve mapping a sequence to another sequence:

  • Translation: English sentence → French sentence
  • Summarization: Long article → Short summary
  • Chatbot: User question → Bot answer

These are handled by Seq2Seq models, which rely on the Encoder-Decoder architecture.

1. Encoder-Decoder Architecture

The standard Seq2Seq model historically consisted of two RNNs (or LSTMs/GRUs):

  1. Encoder: Reads the input sequence one step at a time and produces a final hidden state (the context vector). This vector is supposed to capture the “meaning” of the entire input.
  2. Decoder: Takes the context vector as its initial state and generates the output sequence one token at a time.

🚨 The Information Bottleneck

The core problem with early Seq2Seq models is the Information Bottleneck. The Encoder is forced to compress an arbitrarily long input sequence into a single, fixed-length context vector.

The Lemonade Stand Analogy: Imagine trying to pour a 10-gallon bucket of water (a long sentence) into a single 12-ounce cup (the context vector) before handing it to the Decoder. Most of the water spills. For sentences longer than ~20 words, early Seq2Seq models catastrophically failed because they simply "forgot" the beginning of the sentence by the time they reached the end.

2. The Attention Mechanism

War Story: In 2014, researchers at Google and Jacobs University were struggling with Machine Translation. Their models were hitting a hard accuracy ceiling on long sentences. The breakthrough? Stop trying to compress everything into one vector. Instead, give the Decoder an “Open-Book Exam”.

Attention solves the bottleneck problem. Instead of relying on just the final hidden state, Attention allows the Decoder to “look back” at the entire sequence of Encoder hidden states at every step of generation.

At each decoder step t, the model:

  1. Calculates an attention score (similarity) between the current decoder state and every encoder state.
  2. Normalizes these scores using Softmax to get attention weights.
  3. Computes a context vector as a weighted sum of encoder states.
  4. Uses this context vector to predict the next word.

Mathematically (Dot-Product Attention):

  • score(h<sub>t</sub>, &h_bar;<sub>s</sub>) = h<sub>t</sub><sup>T</sup> &middot; &h_bar;<sub>s</sub>
  • &alpha;<sub>ts</sub> = softmax(score(h<sub>t</sub>, &h_bar;<sub>s</sub>))
  • c<sub>t</sub> = &Sigma;<sub>s</sub> &alpha;<sub>ts</sub> &h_bar;<sub>s</sub>

Where h<sub>t</sub> is the decoder state and &h_bar;<sub>s</sub> are the encoder states.

Anatomy of Attention

To understand the math intuitively, let’s break down the PEDALS of the Attention operation (focusing on the Process and Architecture):

  • Query (ht): What the Decoder is currently looking for. “I just generated the word ‘black’, I need to find the noun it describes.”
  • Key (&h_bar;s): What the Encoder states are holding. “I am the word ‘chat’ (cat), I am a noun.”
  • Score: The dot product measures how similar the Query is to the Key. High similarity means high relevance.
  • Value: The actual content of the Encoder state that gets passed to the Decoder if the score is high. (In basic dot-product attention, Key and Value are the same vector).

Interactive: Attention Visualizer

See how the attention mechanism focuses on different parts of the input sentence when generating the translation.

Input (French): "Le chat noir est sur le tapis"

Translation (English): "The black cat is on the mat"

Hover over the English words (Decoder) to see which French words (Encoder) the model attends to.

3. PyTorch Implementation (Seq2Seq with Attention)

Here’s a simplified conceptual implementation of an Attention Decoder in PyTorch.

import torch
import torch.nn as nn
import torch.nn.functional as F

class AttnDecoderRNN(nn.Module):
  def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=10):
    super(AttnDecoderRNN, self).__init__()
    self.hidden_size = hidden_size
    self.output_size = output_size
    self.dropout_p = dropout_p
    self.max_length = max_length

    self.embedding = nn.Embedding(self.output_size, self.hidden_size)

    # Attention weights layer
    # Takes current input and previous hidden state
    self.attn = nn.Linear(self.hidden_size * 2, self.max_length)

    # Combine layer for attention applied context and input
    self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)

    self.dropout = nn.Dropout(self.dropout_p)
    self.gru = nn.GRU(self.hidden_size, self.hidden_size)
    self.out = nn.Linear(self.hidden_size, self.output_size)

  def forward(self, input, hidden, encoder_outputs):
    # input: (1, 1) - single word index
    # hidden: (1, 1, hidden_size)
    # encoder_outputs: (1, max_length, hidden_size)

    embedded = self.embedding(input).view(1, 1, -1)
    embedded = self.dropout(embedded)

    # Calculate attention weights
    # attn_weights: (1, max_length)
    attn_weights = F.softmax(
      self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)

    # Apply attention to encoder outputs
    # attn_applied: (1, 1, hidden_size)
    attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                 encoder_outputs)

    # Combine embedded input and context vector
    output = torch.cat((embedded[0], attn_applied[0]), 1)
    output = self.attn_combine(output).unsqueeze(0)

    output = F.relu(output)
    output, hidden = self.gru(output, hidden)

    log_softmax = F.log_softmax(self.out(output[0]), dim=1)
    return log_softmax, hidden, attn_weights

# Note: This is a simplified educational example.
# Real implementations often use 'PackedSequence' and more complex batching.

Anatomy of the PyTorch Code

Let’s trace the tensor shapes to understand exactly what is happening during a single step of the Decoder:

  1. Input & Embedding: We take the single input token (e.g., shape [1, 1]) and embed it into a dense vector: embedded shape [1, 1, hidden_size].
  2. Attention Weights (self.attn): We concatenate the embedded input and the previous hidden state, and pass it through a Linear layer to output a score for each word in the input sequence up to max_length.
    • Note: The softmax ensures these weights sum to 1. Shape: [1, max_length].
  3. Applying Attention (torch.bmm): This is the core of the mechanism. Batch Matrix Multiplication (bmm) multiplies the attention weights [1, 1, max_length] by the actual encoder outputs [1, max_length, hidden_size].
    • The result is attn_applied (the Context Vector) of shape [1, 1, hidden_size]. It is a weighted sum of the encoder outputs!
  4. Combining: We concatenate the original embedded input with our new, highly-focused Context Vector, pass it through a linear layer (attn_combine), and feed that into the GRU.

[!TIP] Edge Case: Padding Masks In a real-world system, sequences are batched, meaning shorter sentences are padded with <PAD> tokens. We must apply an Attention Mask (setting padding scores to $-\infty$ before the softmax) to ensure the model never “attends” to padding tokens.

4. Summary

  • Seq2Seq models map input sequences to output sequences.
  • Encoder-Decoder is the standard architecture.
  • Attention allows the decoder to focus on specific parts of the input sequence, solving the bottleneck problem.
  • Transformers (which we will cover in a later module) take the idea of Attention to the extreme, discarding RNNs entirely.