Goodfellow et al., Chapter 10

Recurrence & Sequences

Unfolding computation graphs, vanilla RNNs, the vanishing gradient problem, LSTM, GRU, and the road to modern sequence models.

Prerequisites: Chapter 6 (feedforward networks, backprop), Chapter 8 (optimization, gradient clipping).
10
Chapters
4+
Simulations
10
Quizzes

Chapter 0: Why Sequences?

Not all data is fixed-size. Sentences have variable length. Stock prices unfold over time. Music is a stream of notes. To process data where order matters and length varies, we need models that handle sequences.

A feedforward network takes a fixed-size input and produces a fixed-size output. It has no notion of "before" or "after." A recurrent neural network (RNN) processes sequences one element at a time, maintaining a hidden state that carries information from past elements to inform future processing.

The key idea: An RNN shares the same weights across all time steps. It reads element x1, updates its hidden state, reads x2, updates again, and so on. The hidden state is the network's "memory" — a compressed summary of everything it has seen so far.
Sequence-to-Vector
Sentiment: "I love this movie" → Positive
Sequence-to-Sequence
Translation: "Bonjour le monde" → "Hello world"
Vector-to-Sequence
Image captioning: [image] → "A cat sitting on a mat"
What is the fundamental advantage of RNNs over feedforward networks for sequences?

Chapter 1: Unfolding

An RNN can be understood through unfolding (or unrolling). The recurrence ht = f(ht-1, xt) defines a compact, recursive computation. Unfolding it across time steps produces an equivalent feedforward graph where each time step is a layer.

ht = f(ht-1, xt; θ)

The crucial constraint: the function f and its parameters θ are shared across all time steps. This is parameter sharing in the time dimension, analogous to how CNNs share filters across space. It allows the network to generalize to sequences of any length, including lengths not seen during training.

Why unfolding matters for training: Once unfolded, we can apply standard backpropagation to the resulting deep feedforward graph. This is called backpropagation through time (BPTT). The gradients flow backward from the output through each time step. The total gradient for the shared parameters is the sum of gradients from all time steps.
RNN Unfolding

The compact recurrence on the left unfolds into a deep feedforward graph on the right. Same weights at every step.

Sequence length4
What does "unfolding" an RNN mean?

Chapter 2: The Vanilla RNN

The simplest RNN computes the hidden state as a linear transformation of the previous hidden state and the current input, followed by a nonlinearity (typically tanh):

ht = tanh(Whh ht-1 + Wxh xt + bh)
yt = Why ht + by

Three weight matrices define the entire network: Wxh maps input to hidden state, Whh maps previous hidden state to current hidden state, and Why maps hidden state to output. The same three matrices are reused at every time step.

Hidden state as memory: The hidden state ht is a fixed-size vector (say, 256 dimensions) that must encode everything relevant about the entire sequence seen so far. This is an incredibly compressed representation. At each step, the network must decide what to remember, what to update, and what to forget — all through the fixed function tanh(Whh h + Wxh x + b).

The hidden state size is the main capacity control. Too small and the network cannot remember enough. Too large and it overfits and is expensive to compute. Typical sizes range from 128 to 1024 dimensions.

Vanilla RNN Forward Pass

Watch the hidden state evolve as each input arrives. The hidden state color encodes its values.

What do the three weight matrices in a vanilla RNN control?

Chapter 3: The Vanishing Gradient Problem

When you unfold an RNN over T time steps, the gradient must flow backward through T applications of the weight matrix Whh. The gradient at time step 1 depends on the product WhhT. This product either explodes or vanishes exponentially with T.

∂hT/∂h1 = ∏t=2T ∂ht/∂ht-1 ≈ (Whh)T-1

If the largest eigenvalue of Whh is greater than 1, the product explodes. If it is less than 1, it vanishes to zero. In practice, this means vanilla RNNs cannot learn long-range dependencies — information from 50 time steps ago has effectively zero gradient signal.

The core problem: The gradient of the loss with respect to early time steps passes through many matrix multiplications. Like repeatedly multiplying a number by 0.9 gives 0.9100 ≈ 0.00003, the gradient signal decays exponentially. The network cannot learn that word 1 in a sentence affects the meaning at word 100. This is the vanishing gradient problem.

Gradient clipping handles the exploding case: if ||g|| > threshold, rescale g ← threshold · g / ||g||. This prevents catastrophic parameter updates. But clipping does not help the vanishing case — you cannot amplify a signal that is already zero. That requires a fundamentally different architecture: the LSTM.

Gradient Flow Through Time

Watch the gradient magnitude decay as it flows backward through time. Longer sequences have weaker gradients at early steps.

Sequence length20
Eigenvalue magnitude0.95
Why can't vanilla RNNs learn long-range dependencies?

Chapter 4: LSTM

The Long Short-Term Memory (LSTM) network (Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem with one elegant idea: a cell state ct that flows through time with additive updates instead of multiplicative ones.

The LSTM has three gates that control information flow:

ft = σ(Wf[ht-1, xt] + bf)     (forget gate)
it = σ(Wi[ht-1, xt] + bi)     (input gate)
ot = σ(Wo[ht-1, xt] + bo)     (output gate)
t = tanh(Wc[ht-1, xt] + bc)     (candidate)
ct = ft ⊙ ct-1 + it ⊙ c̃t
ht = ot ⊙ tanh(ct)
Why this solves vanishing gradients: The cell state update ct = ft ⊙ ct-1 + it ⊙ c̃t is additive. When the forget gate ft is close to 1 and the input gate it is close to 0, the cell state passes through unchanged: ct ≈ ct-1. The gradient flows through unchanged too. This creates a "gradient highway" that can carry information across hundreds of time steps.

Forget gate ft decides what to erase from the cell state. When processing a new subject in a sentence, it erases the old subject. Input gate it decides what new information to write. Output gate ot decides what part of the cell state to expose as the hidden state output.

LSTM Cell

Watch information flow through the LSTM gates. The cell state (top line) flows with minimal interference. Gates control what enters and exits.

Forget gate0.90
Input gate0.30
Output gate0.70
How does the LSTM solve the vanishing gradient problem?

Chapter 5: GRU

The Gated Recurrent Unit (Cho et al., 2014) is a simplified alternative to the LSTM. It merges the cell state and hidden state into a single vector and uses only two gates instead of three:

zt = σ(Wz[ht-1, xt])     (update gate)
rt = σ(Wr[ht-1, xt])     (reset gate)
t = tanh(W[rt ⊙ ht-1, xt])     (candidate)
ht = (1 − zt) ⊙ ht-1 + zt ⊙ h̃t

The update gate zt is like a combined forget-and-input gate: it interpolates between the old hidden state and the new candidate. When zt ≈ 0, the hidden state passes through unchanged (like LSTM with forget gate ≈ 1). When zt ≈ 1, the old state is fully replaced.

GRU vs LSTM: GRU has fewer parameters (2 gates vs 3, no separate cell state), making it faster to train and less prone to overfitting on small datasets. Performance is comparable to LSTM on most tasks. In practice: use LSTM when you have plenty of data and need maximum capacity. Use GRU when you want efficiency. Neither dominates across all tasks.

The reset gate rt controls how much of the previous hidden state is used to compute the candidate. When rt ≈ 0, the candidate ignores the previous state entirely, effectively "resetting" the memory. This is useful when the current input marks a clear break from previous context.

How does the GRU simplify the LSTM?

Chapter 6: Bidirectional RNNs

A standard RNN reads the sequence left-to-right. But for many tasks, future context is just as important as past context. Consider the sentence: "He said the bank of the river was muddy." Understanding that "bank" means "riverbank" (not a financial institution) requires seeing "river" which comes after "bank."

A bidirectional RNN runs two separate RNNs: one reading left-to-right (forward), one reading right-to-left (backward). At each position, the outputs of both are concatenated:

ht = [ht ; ht]
When to use bidirectional: Bidirectional RNNs require access to the entire sequence, so they work for tasks where the full input is available (classification, tagging, encoding). They do not work for autoregressive generation (language modeling, translation decoding) where you produce tokens one at a time and cannot see the future.

Bidirectional LSTMs (BiLSTMs) were the dominant architecture for NLP tasks like named entity recognition, part-of-speech tagging, and sentiment analysis before transformers. BERT (2018) can be seen as a bidirectional transformer encoder — the same idea applied to attention instead of recurrence.

Deep RNNs stack multiple recurrent layers. The hidden state of layer l at time t feeds into layer l+1 at the same time step. Typical depth is 2-4 layers. Deeper than 4 layers rarely helps for RNNs (unlike CNNs), partly because the unfolded network is already very deep in the time dimension.

Why can't bidirectional RNNs be used for autoregressive generation?

Chapter 7: Encoder-Decoder

Many tasks map a variable-length input to a variable-length output: translation, summarization, conversation. The encoder-decoder (or sequence-to-sequence) architecture handles this with two RNNs.

The encoder reads the entire input sequence and compresses it into a fixed-size context vector c (usually the final hidden state). The decoder then generates the output sequence one element at a time, conditioned on c and its own previous outputs.

Encoder: ht = fenc(ht-1, xt),   c = hT
Decoder: st = fdec(st-1, yt-1, c),   yt = g(st)
The bottleneck problem: Compressing an entire input sequence into a single fixed-size vector c is a severe limitation. For long sequences, the context vector cannot possibly retain all relevant information. This motivated attention (Bahdanau et al., 2014): instead of using only the final hidden state, the decoder learns to attend to all encoder hidden states at each decoding step. Attention was the key innovation that eventually led to the Transformer.

Teacher forcing is the standard training technique: during training, feed the ground-truth previous token yt-1 to the decoder, not its own prediction. This prevents error accumulation but creates a train-test mismatch (at test time, the model must use its own predictions). Scheduled sampling gradually transitions from teacher forcing to self-generated inputs during training.

Encoder-Decoder Architecture

The encoder reads the input and produces a context vector. The decoder generates the output sequence from the context. The bottleneck is visible as the single point where information must squeeze through.

What is the main limitation of the basic encoder-decoder architecture?

Chapter 8: Sequence Memory Playground

Test how well different architectures remember information across time. A signal appears at the start of the sequence; the network must reproduce it at the end. Vanilla RNNs forget; LSTMs remember.

Memory Test: Vanilla RNN vs LSTM

A "memory signal" is injected at step 1. Can the network recall it at the last step? The gradient magnitude shows why LSTMs succeed.

Sequence length20
RNN eigenvalue0.95
Experiments: (1) Set length=10, eigenvalue=0.95 — the vanilla RNN barely remembers. (2) Set length=40 — the vanilla RNN signal is near zero. (3) Notice the LSTM maintains nearly full signal regardless of length, thanks to its additive cell state. (4) Set eigenvalue=1.0 — the vanilla RNN now preserves the signal (but in practice, eigenvalue=1.0 is unstable).
In the memory test, why does the LSTM maintain its signal strength regardless of sequence length?

Chapter 9: Connections

RNNs pioneered sequence modeling and many of their ideas live on in modern architectures:

ConceptWhere It Appears
Recurrence / hidden stateState Space Models (Mamba, S4) bring back recurrence with linear complexity. Transformers replaced RNNs for most tasks but recurrence is returning.
Gating (LSTM/GRU)Gating mechanisms appear everywhere: GLU in transformers, SE-Net squeeze-excitation, gated convolutions in WaveNet.
Vanishing gradientsSolved by residual connections (Ch 9, ResNet), skip connections, and normalization. Same problem, same solution strategy.
Encoder-decoderThe Transformer itself is an encoder-decoder. T5, BART, and all seq2seq models. Also VAEs (Ch 7) encode then decode.
AttentionBorn from RNN encoder-decoder limitations. Scaled to self-attention in Transformers. Now the dominant paradigm for sequences.
Teacher forcingStandard for training all autoregressive models including GPT. RLHF is partly a remedy for train-test mismatch.
Bidirectional processingBERT = bidirectional transformer encoder. BiLSTMs → masked self-attention. Same idea, different mechanism.
What you should take away: RNNs process sequences through recurrence and shared weights. Vanilla RNNs suffer from vanishing gradients; LSTMs solve this with additive cell state updates and gating. The encoder-decoder + attention paradigm born from RNN limitations directly led to the Transformer, which dominates modern deep learning.

Up next: Chapter 11: Practical Methodology — how to actually build, debug, and tune deep learning systems in the real world.

What key limitation of RNN encoder-decoders directly led to the development of the Transformer?