Seq2Seq Models and Attention

Many tasks involve mapping a sequence to another sequence:

Translation: English sentence → French sentence
Summarization: Long article → Short summary
Chatbot: User question → Bot answer

These are handled by Sequence-to-Sequence (Seq2Seq) models, typically using an Encoder-Decoder architecture.

1. Encoder-Decoder Architecture

The standard Seq2Seq model consists of two RNNs (or LSTMs/GRUs):

Encoder: Reads the input sequence one step at a time and produces a final hidden state (context vector). This vector is supposed to capture the “meaning” of the entire input.
Decoder: Takes the context vector as its initial state and generates the output sequence one token at a time.

[!WARNING] The Information Bottleneck: The Encoder must compress the entire input sequence into a single fixed-length vector. For long sentences, this is very difficult, and information is often lost.

2. The Attention Mechanism

Attention solves the bottleneck problem. Instead of relying on just the final hidden state, Attention allows the Decoder to “look back” at the entire sequence of Encoder hidden states at every step of generation.

At each decoder step t, the model:

Calculates an attention score (similarity) between the current decoder state and every encoder state.
Normalizes these scores using Softmax to get attention weights.
Computes a context vector as a weighted sum of encoder states.
Uses this context vector to predict the next word.

Mathematically (Dot-Product Attention):

score(ht, &h_bar;s) = htT · &h_bar;s
αts = softmax(score(ht, &h_bar;s))
ct = Σs αts &h_bar;s

Where ht is the decoder state and &h_bar;s are the encoder states.

Interactive: Attention Visualizer

See how the attention mechanism focuses on different parts of the input sentence when generating the translation.

3. PyTorch Implementation (Seq2Seq with Attention)

Here’s a simplified conceptual implementation of an Attention Decoder in PyTorch.

import torch
import torch.nn as nn
import torch.nn.functional as F

class AttnDecoderRNN(nn.Module):
  def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=10):
    super(AttnDecoderRNN, self).__init__()
    self.hidden_size = hidden_size
    self.output_size = output_size
    self.dropout_p = dropout_p
    self.max_length = max_length

    self.embedding = nn.Embedding(self.output_size, self.hidden_size)

    # Attention weights layer
    # Takes current input and previous hidden state
    self.attn = nn.Linear(self.hidden_size * 2, self.max_length)

    # Combine layer for attention applied context and input
    self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)

    self.dropout = nn.Dropout(self.dropout_p)
    self.gru = nn.GRU(self.hidden_size, self.hidden_size)
    self.out = nn.Linear(self.hidden_size, self.output_size)

  def forward(self, input, hidden, encoder_outputs):
    # input: (1, 1) - single word index
    # hidden: (1, 1, hidden_size)
    # encoder_outputs: (1, max_length, hidden_size)

    embedded = self.embedding(input).view(1, 1, -1)
    embedded = self.dropout(embedded)

    # Calculate attention weights
    # attn_weights: (1, max_length)
    attn_weights = F.softmax(
      self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)

    # Apply attention to encoder outputs
    # attn_applied: (1, 1, hidden_size)
    attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                 encoder_outputs)

    # Combine embedded input and context vector
    output = torch.cat((embedded[0], attn_applied[0]), 1)
    output = self.attn_combine(output).unsqueeze(0)

    output = F.relu(output)
    output, hidden = self.gru(output, hidden)

    log_softmax = F.log_softmax(self.out(output[0]), dim=1)
    return log_softmax, hidden, attn_weights

# Note: This is a simplified educational example.
# Real implementations often use 'PackedSequence' and more complex batching.

4. Summary

Seq2Seq models map input sequences to output sequences.
Encoder-Decoder is the standard architecture.
Attention allows the decoder to focus on specific parts of the input sequence, solving the bottleneck problem.
Transformers (which we will cover in a later module) take the idea of Attention to the extreme, discarding RNNs entirely.