Seq2Seq Models and Attention
Many tasks involve mapping a sequence to another sequence:
- Translation: English sentence → French sentence
- Summarization: Long article → Short summary
- Chatbot: User question → Bot answer
These are handled by Sequence-to-Sequence (Seq2Seq) models, typically using an Encoder-Decoder architecture.
1. Encoder-Decoder Architecture
The standard Seq2Seq model consists of two RNNs (or LSTMs/GRUs):
- Encoder: Reads the input sequence one step at a time and produces a final hidden state (context vector). This vector is supposed to capture the “meaning” of the entire input.
- Decoder: Takes the context vector as its initial state and generates the output sequence one token at a time.
[!WARNING] The Information Bottleneck: The Encoder must compress the entire input sequence into a single fixed-length vector. For long sentences, this is very difficult, and information is often lost.
2. The Attention Mechanism
Attention solves the bottleneck problem. Instead of relying on just the final hidden state, Attention allows the Decoder to “look back” at the entire sequence of Encoder hidden states at every step of generation.
At each decoder step t, the model:
- Calculates an attention score (similarity) between the current decoder state and every encoder state.
- Normalizes these scores using Softmax to get attention weights.
- Computes a context vector as a weighted sum of encoder states.
- Uses this context vector to predict the next word.
Mathematically (Dot-Product Attention):
score(h<sub>t</sub>, &h_bar;<sub>s</sub>) = h<sub>t</sub><sup>T</sup> · &h_bar;<sub>s</sub>α<sub>ts</sub> = softmax(score(h<sub>t</sub>, &h_bar;<sub>s</sub>))c<sub>t</sub> = Σ<sub>s</sub> α<sub>ts</sub> &h_bar;<sub>s</sub>
Where h<sub>t</sub> is the decoder state and &h_bar;<sub>s</sub> are the encoder states.
Interactive: Attention Visualizer
See how the attention mechanism focuses on different parts of the input sentence when generating the translation.
Input (French): "Le chat noir est sur le tapis"
Translation (English): "The black cat is on the mat"
Hover over the English words (Decoder) to see which French words (Encoder) the model attends to.
3. PyTorch Implementation (Seq2Seq with Attention)
Here’s a simplified conceptual implementation of an Attention Decoder in PyTorch.
import torch
import torch.nn as nn
import torch.nn.functional as F
class AttnDecoderRNN(nn.Module):
def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=10):
super(AttnDecoderRNN, self).__init__()
self.hidden_size = hidden_size
self.output_size = output_size
self.dropout_p = dropout_p
self.max_length = max_length
self.embedding = nn.Embedding(self.output_size, self.hidden_size)
# Attention weights layer
# Takes current input and previous hidden state
self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
# Combine layer for attention applied context and input
self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
self.dropout = nn.Dropout(self.dropout_p)
self.gru = nn.GRU(self.hidden_size, self.hidden_size)
self.out = nn.Linear(self.hidden_size, self.output_size)
def forward(self, input, hidden, encoder_outputs):
# input: (1, 1) - single word index
# hidden: (1, 1, hidden_size)
# encoder_outputs: (1, max_length, hidden_size)
embedded = self.embedding(input).view(1, 1, -1)
embedded = self.dropout(embedded)
# Calculate attention weights
# attn_weights: (1, max_length)
attn_weights = F.softmax(
self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
# Apply attention to encoder outputs
# attn_applied: (1, 1, hidden_size)
attn_applied = torch.bmm(attn_weights.unsqueeze(0),
encoder_outputs)
# Combine embedded input and context vector
output = torch.cat((embedded[0], attn_applied[0]), 1)
output = self.attn_combine(output).unsqueeze(0)
output = F.relu(output)
output, hidden = self.gru(output, hidden)
log_softmax = F.log_softmax(self.out(output[0]), dim=1)
return log_softmax, hidden, attn_weights
# Note: This is a simplified educational example.
# Real implementations often use 'PackedSequence' and more complex batching.
4. Summary
- Seq2Seq models map input sequences to output sequences.
- Encoder-Decoder is the standard architecture.
- Attention allows the decoder to focus on specific parts of the input sequence, solving the bottleneck problem.
- Transformers (which we will cover in a later module) take the idea of Attention to the extreme, discarding RNNs entirely.