Ch 10: Recurrence & Sequence Modeling — Goodfellow Deep Learning

Chapter 0: Why Sequences?

Not all data is fixed-size. Sentences have variable length. Stock prices unfold over time. Music is a stream of notes. To process data where order matters and length varies, we need models that handle sequences.

A feedforward network takes a fixed-size input and produces a fixed-size output. It has no notion of "before" or "after." A recurrent neural network (RNN) processes sequences one element at a time, maintaining a hidden state that carries information from past elements to inform future processing.

The key idea: An RNN shares the same weights across all time steps. It reads element x₁, updates its hidden state, reads x₂, updates again, and so on. The hidden state is the network's "memory" — a compressed summary of everything it has seen so far.

Sequence-to-Vector

Sentiment: "I love this movie" → Positive

↓

Sequence-to-Sequence

Translation: "Bonjour le monde" → "Hello world"

↓

Vector-to-Sequence

Image captioning: [image] → "A cat sitting on a mat"

What is the fundamental advantage of RNNs over feedforward networks for sequences?

RNNs maintain a hidden state across time steps, allowing them to process variable-length sequences while retaining information about past elements RNNs have more parameters RNNs are faster to train

Chapter 1: Unfolding

An RNN can be understood through unfolding (or unrolling). The recurrence h_t = f(h_t-1, x_t) defines a compact, recursive computation. Unfolding it across time steps produces an equivalent feedforward graph where each time step is a layer.

h_t = f(h_t-1, x_t; θ)

The crucial constraint: the function f and its parameters θ are shared across all time steps. This is parameter sharing in the time dimension, analogous to how CNNs share filters across space. It allows the network to generalize to sequences of any length, including lengths not seen during training.

Why unfolding matters for training: Once unfolded, we can apply standard backpropagation to the resulting deep feedforward graph. This is called backpropagation through time (BPTT). The gradients flow backward from the output through each time step. The total gradient for the shared parameters is the sum of gradients from all time steps.

RNN Unfolding

The compact recurrence on the left unfolds into a deep feedforward graph on the right. Same weights at every step.

Sequence length4

What does "unfolding" an RNN mean?

Expanding the recurrence across time steps into an equivalent feedforward computation graph, where each time step becomes a layer with shared weights Removing the hidden state from the network Increasing the number of parameters

Chapter 2: The Vanilla RNN

The simplest RNN computes the hidden state as a linear transformation of the previous hidden state and the current input, followed by a nonlinearity (typically tanh):

h_t = tanh(W_hh h_t-1 + W_xh x_t + b_h)
y_t = W_hy h_t + b_y

Three weight matrices define the entire network: W_xh maps input to hidden state, W_hh maps previous hidden state to current hidden state, and W_hy maps hidden state to output. The same three matrices are reused at every time step.

Hidden state as memory: The hidden state h_t is a fixed-size vector (say, 256 dimensions) that must encode everything relevant about the entire sequence seen so far. This is an incredibly compressed representation. At each step, the network must decide what to remember, what to update, and what to forget — all through the fixed function tanh(W_hh h + W_xh x + b).

The hidden state size is the main capacity control. Too small and the network cannot remember enough. Too large and it overfits and is expensive to compute. Typical sizes range from 128 to 1024 dimensions.

Vanilla RNN Forward Pass

Watch the hidden state evolve as each input arrives. The hidden state color encodes its values.

What do the three weight matrices in a vanilla RNN control?

W_xh maps input to hidden, W_hh maps previous hidden to current hidden (recurrence), W_hy maps hidden to output They control the learning rate at each time step They define different layers in a feedforward network

Chapter 3: The Vanishing Gradient Problem

When you unfold an RNN over T time steps, the gradient must flow backward through T applications of the weight matrix W_hh. The gradient at time step 1 depends on the product W_hh^T. This product either explodes or vanishes exponentially with T.

∂h_T/∂h₁ = ∏_t=2^T ∂h_t/∂h_t-1 ≈ (W_hh)^T-1

If the largest eigenvalue of W_hh is greater than 1, the product explodes. If it is less than 1, it vanishes to zero. In practice, this means vanilla RNNs cannot learn long-range dependencies — information from 50 time steps ago has effectively zero gradient signal.

The core problem: The gradient of the loss with respect to early time steps passes through many matrix multiplications. Like repeatedly multiplying a number by 0.9 gives 0.9¹⁰⁰ ≈ 0.00003, the gradient signal decays exponentially. The network cannot learn that word 1 in a sentence affects the meaning at word 100. This is the vanishing gradient problem.

Gradient clipping handles the exploding case: if ||g|| > threshold, rescale g ← threshold · g / ||g||. This prevents catastrophic parameter updates. But clipping does not help the vanishing case — you cannot amplify a signal that is already zero. That requires a fundamentally different architecture: the LSTM.

Gradient Flow Through Time

Watch the gradient magnitude decay as it flows backward through time. Longer sequences have weaker gradients at early steps.

Sequence length20

Eigenvalue magnitude0.95

Why can't vanilla RNNs learn long-range dependencies?

Gradients are multiplied by the recurrent weight matrix at each time step; if its eigenvalues are < 1, the gradient vanishes exponentially, destroying the learning signal from distant past inputs They do not have enough parameters The tanh activation is too slow to compute

Chapter 4: LSTM

The Long Short-Term Memory (LSTM) network (Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem with one elegant idea: a cell state c_t that flows through time with additive updates instead of multiplicative ones.

The LSTM has three gates that control information flow:

f_t = σ(W_f[h_t-1, x_t] + b_f)     (forget gate)
i_t = σ(W_i[h_t-1, x_t] + b_i)     (input gate)
o_t = σ(W_o[h_t-1, x_t] + b_o)     (output gate)
c̃_t = tanh(W_c[h_t-1, x_t] + b_c)     (candidate)
c_t = f_t ⊙ c_t-1 + i_t ⊙ c̃_t
h_t = o_t ⊙ tanh(c_t)

Why this solves vanishing gradients: The cell state update c_t = f_t ⊙ c_t-1 + i_t ⊙ c̃_t is additive. When the forget gate f_t is close to 1 and the input gate i_t is close to 0, the cell state passes through unchanged: c_t ≈ c_t-1. The gradient flows through unchanged too. This creates a "gradient highway" that can carry information across hundreds of time steps.

Forget gate f_t decides what to erase from the cell state. When processing a new subject in a sentence, it erases the old subject. Input gate i_t decides what new information to write. Output gate o_t decides what part of the cell state to expose as the hidden state output.

LSTM Cell

Watch information flow through the LSTM gates. The cell state (top line) flows with minimal interference. Gates control what enters and exits.

Forget gate0.90

Input gate0.30

Output gate0.70

How does the LSTM solve the vanishing gradient problem?

The cell state uses additive updates controlled by gates; when the forget gate is ~1, the cell state and its gradient pass through unchanged, creating a gradient highway across time It uses a larger hidden state It clips the gradients automatically

Chapter 5: GRU

The Gated Recurrent Unit (Cho et al., 2014) is a simplified alternative to the LSTM. It merges the cell state and hidden state into a single vector and uses only two gates instead of three:

z_t = σ(W_z[h_t-1, x_t])     (update gate)
r_t = σ(W_r[h_t-1, x_t])     (reset gate)
h̃_t = tanh(W[r_t ⊙ h_t-1, x_t])     (candidate)
h_t = (1 − z_t) ⊙ h_t-1 + z_t ⊙ h̃_t

The update gate z_t is like a combined forget-and-input gate: it interpolates between the old hidden state and the new candidate. When z_t ≈ 0, the hidden state passes through unchanged (like LSTM with forget gate ≈ 1). When z_t ≈ 1, the old state is fully replaced.

GRU vs LSTM: GRU has fewer parameters (2 gates vs 3, no separate cell state), making it faster to train and less prone to overfitting on small datasets. Performance is comparable to LSTM on most tasks. In practice: use LSTM when you have plenty of data and need maximum capacity. Use GRU when you want efficiency. Neither dominates across all tasks.

The reset gate r_t controls how much of the previous hidden state is used to compute the candidate. When r_t ≈ 0, the candidate ignores the previous state entirely, effectively "resetting" the memory. This is useful when the current input marks a clear break from previous context.

How does the GRU simplify the LSTM?

It merges the cell state and hidden state into one, uses two gates (update and reset) instead of three, with the update gate interpolating between old and new states It removes all gates It uses a larger hidden state

Chapter 6: Bidirectional RNNs

A standard RNN reads the sequence left-to-right. But for many tasks, future context is just as important as past context. Consider the sentence: "He said the bank of the river was muddy." Understanding that "bank" means "riverbank" (not a financial institution) requires seeing "river" which comes after "bank."

A bidirectional RNN runs two separate RNNs: one reading left-to-right (forward), one reading right-to-left (backward). At each position, the outputs of both are concatenated:

h_t = [h_t^→ ; h_t^←]

When to use bidirectional: Bidirectional RNNs require access to the entire sequence, so they work for tasks where the full input is available (classification, tagging, encoding). They do not work for autoregressive generation (language modeling, translation decoding) where you produce tokens one at a time and cannot see the future.

Bidirectional LSTMs (BiLSTMs) were the dominant architecture for NLP tasks like named entity recognition, part-of-speech tagging, and sentiment analysis before transformers. BERT (2018) can be seen as a bidirectional transformer encoder — the same idea applied to attention instead of recurrence.

Deep RNNs stack multiple recurrent layers. The hidden state of layer l at time t feeds into layer l+1 at the same time step. Typical depth is 2-4 layers. Deeper than 4 layers rarely helps for RNNs (unlike CNNs), partly because the unfolded network is already very deep in the time dimension.

Why can't bidirectional RNNs be used for autoregressive generation?

They require access to the full sequence (including future tokens), which is not available during generation when tokens are produced one at a time They are too slow They have too many parameters

Chapter 7: Encoder-Decoder

Many tasks map a variable-length input to a variable-length output: translation, summarization, conversation. The encoder-decoder (or sequence-to-sequence) architecture handles this with two RNNs.

The encoder reads the entire input sequence and compresses it into a fixed-size context vector c (usually the final hidden state). The decoder then generates the output sequence one element at a time, conditioned on c and its own previous outputs.

Encoder: h_t = f_enc(h_t-1, x_t), c = h_T
Decoder: s_t = f_dec(s_t-1, y_t-1, c), y_t = g(s_t)

The bottleneck problem: Compressing an entire input sequence into a single fixed-size vector c is a severe limitation. For long sequences, the context vector cannot possibly retain all relevant information. This motivated attention (Bahdanau et al., 2014): instead of using only the final hidden state, the decoder learns to attend to all encoder hidden states at each decoding step. Attention was the key innovation that eventually led to the Transformer.

Teacher forcing is the standard training technique: during training, feed the ground-truth previous token y_t-1 to the decoder, not its own prediction. This prevents error accumulation but creates a train-test mismatch (at test time, the model must use its own predictions). Scheduled sampling gradually transitions from teacher forcing to self-generated inputs during training.

Encoder-Decoder Architecture

The encoder reads the input and produces a context vector. The decoder generates the output sequence from the context. The bottleneck is visible as the single point where information must squeeze through.

What is the main limitation of the basic encoder-decoder architecture?

The entire input sequence must be compressed into a single fixed-size context vector, which cannot retain all relevant information for long sequences The encoder and decoder cannot share weights It can only handle fixed-length outputs

Chapter 8: Sequence Memory Playground

Test how well different architectures remember information across time. A signal appears at the start of the sequence; the network must reproduce it at the end. Vanilla RNNs forget; LSTMs remember.

Memory Test: Vanilla RNN vs LSTM

A "memory signal" is injected at step 1. Can the network recall it at the last step? The gradient magnitude shows why LSTMs succeed.

Sequence length20

RNN eigenvalue0.95

Experiments: (1) Set length=10, eigenvalue=0.95 — the vanilla RNN barely remembers. (2) Set length=40 — the vanilla RNN signal is near zero. (3) Notice the LSTM maintains nearly full signal regardless of length, thanks to its additive cell state. (4) Set eigenvalue=1.0 — the vanilla RNN now preserves the signal (but in practice, eigenvalue=1.0 is unstable).

In the memory test, why does the LSTM maintain its signal strength regardless of sequence length?

The LSTM's cell state passes information through additive updates; with forget gate near 1, information is preserved without multiplicative decay The LSTM uses a larger hidden state The LSTM trains faster

Chapter 9: Connections

RNNs pioneered sequence modeling and many of their ideas live on in modern architectures:

Concept	Where It Appears
Recurrence / hidden state	State Space Models (Mamba, S4) bring back recurrence with linear complexity. Transformers replaced RNNs for most tasks but recurrence is returning.
Gating (LSTM/GRU)	Gating mechanisms appear everywhere: GLU in transformers, SE-Net squeeze-excitation, gated convolutions in WaveNet.
Vanishing gradients	Solved by residual connections (Ch 9, ResNet), skip connections, and normalization. Same problem, same solution strategy.
Encoder-decoder	The Transformer itself is an encoder-decoder. T5, BART, and all seq2seq models. Also VAEs (Ch 7) encode then decode.
Attention	Born from RNN encoder-decoder limitations. Scaled to self-attention in Transformers. Now the dominant paradigm for sequences.
Teacher forcing	Standard for training all autoregressive models including GPT. RLHF is partly a remedy for train-test mismatch.
Bidirectional processing	BERT = bidirectional transformer encoder. BiLSTMs → masked self-attention. Same idea, different mechanism.

What you should take away: RNNs process sequences through recurrence and shared weights. Vanilla RNNs suffer from vanishing gradients; LSTMs solve this with additive cell state updates and gating. The encoder-decoder + attention paradigm born from RNN limitations directly led to the Transformer, which dominates modern deep learning.

Up next: Chapter 11: Practical Methodology — how to actually build, debug, and tune deep learning systems in the real world.

What key limitation of RNN encoder-decoders directly led to the development of the Transformer?

The fixed-size context vector bottleneck motivated attention, which was then generalized to self-attention, eliminating recurrence entirely in the Transformer RNNs could not process text RNNs used too much memory

Recurrence & Sequences