Module Review: Sequence Models

1. Key Takeaways

  • Sequence Matters: Traditional FNNs cannot handle variable-length sequences or capture temporal dependencies. RNNs are designed for this.
  • Memory: RNNs maintain a hidden state (h<sub>t</sub>) that acts as memory, updated at each time step.
  • Training Challenges: Training RNNs with BPTT often leads to vanishing or exploding gradients.
  • LSTMs & GRUs: These advanced architectures solve the vanishing gradient problem using gating mechanisms (Forget, Input, Output).
  • Seq2Seq: The Encoder-Decoder architecture is standard for tasks like translation and summarization.
  • Attention: The Attention Mechanism allows the decoder to access the entire input sequence, solving the information bottleneck problem of fixed-length context vectors.

2. Flashcards

Test your understanding of the core concepts.

What is BPTT?

Backpropagation Through Time. It's the standard training algorithm for RNNs, unrolling the network across time steps to compute gradients.

Why use LSTM over RNN?

LSTMs solve the vanishing gradient problem using gates, allowing them to capture long-term dependencies that vanilla RNNs miss.

What is the "Forget Gate"?

A sigmoid layer in an LSTM that decides what information to discard from the cell state (0 = forget, 1 = keep).

What problem does Attention solve?

The Information Bottleneck. It allows the decoder to "look back" at all encoder states, not just the final context vector.

3. Cheat Sheet

Concept Equation / Description
RNN Update h<sub>t</sub> = tanh(W<sub>xh</sub> x<sub>t</sub> + W<sub>hh</sub> h<sub>t-1</sub>)
LSTM Forget f<sub>t</sub> = &sigma;(W<sub>f</sub> &middot; [h<sub>t-1</sub>, x<sub>t</sub>])
LSTM Input i<sub>t</sub> = &sigma;(W<sub>i</sub> &middot; [h<sub>t-1</sub>, x<sub>t</sub>])
LSTM Cell C<sub>t</sub> = f<sub>t</sub> * C<sub>t-1</sub> + i<sub>t</sub> * &Ctilde;<sub>t</sub>
Attention Score score = h<sub>decoder</sub><sup>T</sup> &middot; h<sub>encoder</sub> (Dot Product)
Context Vector c<sub>t</sub> = &Sigma; &alpha;<sub>ts</sub> h<sub>s</sub> (Weighted sum)

4. Next Steps

You have mastered the fundamentals of sequence modeling with RNNs. However, RNNs are sequential by nature, making them slow to train on modern hardware (GPUs).

In the next module, we will explore Transformers, an architecture that relies entirely on Attention and discards recurrence, revolutionizing NLP and Deep Learning.

Start Module 05: Transformers