Unfolding computation graphs, vanilla RNNs, the vanishing gradient problem, LSTM, GRU, and the road to modern sequence models.
Not all data is fixed-size. Sentences have variable length. Stock prices unfold over time. Music is a stream of notes. To process data where order matters and length varies, we need models that handle sequences.
A feedforward network takes a fixed-size input and produces a fixed-size output. It has no notion of "before" or "after." A recurrent neural network (RNN) processes sequences one element at a time, maintaining a hidden state that carries information from past elements to inform future processing.
An RNN can be understood through unfolding (or unrolling). The recurrence ht = f(ht-1, xt) defines a compact, recursive computation. Unfolding it across time steps produces an equivalent feedforward graph where each time step is a layer.
The crucial constraint: the function f and its parameters θ are shared across all time steps. This is parameter sharing in the time dimension, analogous to how CNNs share filters across space. It allows the network to generalize to sequences of any length, including lengths not seen during training.
The compact recurrence on the left unfolds into a deep feedforward graph on the right. Same weights at every step.
The simplest RNN computes the hidden state as a linear transformation of the previous hidden state and the current input, followed by a nonlinearity (typically tanh):
Three weight matrices define the entire network: Wxh maps input to hidden state, Whh maps previous hidden state to current hidden state, and Why maps hidden state to output. The same three matrices are reused at every time step.
The hidden state size is the main capacity control. Too small and the network cannot remember enough. Too large and it overfits and is expensive to compute. Typical sizes range from 128 to 1024 dimensions.
Watch the hidden state evolve as each input arrives. The hidden state color encodes its values.
When you unfold an RNN over T time steps, the gradient must flow backward through T applications of the weight matrix Whh. The gradient at time step 1 depends on the product WhhT. This product either explodes or vanishes exponentially with T.
If the largest eigenvalue of Whh is greater than 1, the product explodes. If it is less than 1, it vanishes to zero. In practice, this means vanilla RNNs cannot learn long-range dependencies — information from 50 time steps ago has effectively zero gradient signal.
Gradient clipping handles the exploding case: if ||g|| > threshold, rescale g ← threshold · g / ||g||. This prevents catastrophic parameter updates. But clipping does not help the vanishing case — you cannot amplify a signal that is already zero. That requires a fundamentally different architecture: the LSTM.
Watch the gradient magnitude decay as it flows backward through time. Longer sequences have weaker gradients at early steps.
The Long Short-Term Memory (LSTM) network (Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem with one elegant idea: a cell state ct that flows through time with additive updates instead of multiplicative ones.
The LSTM has three gates that control information flow:
Forget gate ft decides what to erase from the cell state. When processing a new subject in a sentence, it erases the old subject. Input gate it decides what new information to write. Output gate ot decides what part of the cell state to expose as the hidden state output.
Watch information flow through the LSTM gates. The cell state (top line) flows with minimal interference. Gates control what enters and exits.
The Gated Recurrent Unit (Cho et al., 2014) is a simplified alternative to the LSTM. It merges the cell state and hidden state into a single vector and uses only two gates instead of three:
The update gate zt is like a combined forget-and-input gate: it interpolates between the old hidden state and the new candidate. When zt ≈ 0, the hidden state passes through unchanged (like LSTM with forget gate ≈ 1). When zt ≈ 1, the old state is fully replaced.
The reset gate rt controls how much of the previous hidden state is used to compute the candidate. When rt ≈ 0, the candidate ignores the previous state entirely, effectively "resetting" the memory. This is useful when the current input marks a clear break from previous context.
A standard RNN reads the sequence left-to-right. But for many tasks, future context is just as important as past context. Consider the sentence: "He said the bank of the river was muddy." Understanding that "bank" means "riverbank" (not a financial institution) requires seeing "river" which comes after "bank."
A bidirectional RNN runs two separate RNNs: one reading left-to-right (forward), one reading right-to-left (backward). At each position, the outputs of both are concatenated:
Bidirectional LSTMs (BiLSTMs) were the dominant architecture for NLP tasks like named entity recognition, part-of-speech tagging, and sentiment analysis before transformers. BERT (2018) can be seen as a bidirectional transformer encoder — the same idea applied to attention instead of recurrence.
Deep RNNs stack multiple recurrent layers. The hidden state of layer l at time t feeds into layer l+1 at the same time step. Typical depth is 2-4 layers. Deeper than 4 layers rarely helps for RNNs (unlike CNNs), partly because the unfolded network is already very deep in the time dimension.
Many tasks map a variable-length input to a variable-length output: translation, summarization, conversation. The encoder-decoder (or sequence-to-sequence) architecture handles this with two RNNs.
The encoder reads the entire input sequence and compresses it into a fixed-size context vector c (usually the final hidden state). The decoder then generates the output sequence one element at a time, conditioned on c and its own previous outputs.
Teacher forcing is the standard training technique: during training, feed the ground-truth previous token yt-1 to the decoder, not its own prediction. This prevents error accumulation but creates a train-test mismatch (at test time, the model must use its own predictions). Scheduled sampling gradually transitions from teacher forcing to self-generated inputs during training.
The encoder reads the input and produces a context vector. The decoder generates the output sequence from the context. The bottleneck is visible as the single point where information must squeeze through.
Test how well different architectures remember information across time. A signal appears at the start of the sequence; the network must reproduce it at the end. Vanilla RNNs forget; LSTMs remember.
A "memory signal" is injected at step 1. Can the network recall it at the last step? The gradient magnitude shows why LSTMs succeed.
RNNs pioneered sequence modeling and many of their ideas live on in modern architectures:
| Concept | Where It Appears |
|---|---|
| Recurrence / hidden state | State Space Models (Mamba, S4) bring back recurrence with linear complexity. Transformers replaced RNNs for most tasks but recurrence is returning. |
| Gating (LSTM/GRU) | Gating mechanisms appear everywhere: GLU in transformers, SE-Net squeeze-excitation, gated convolutions in WaveNet. |
| Vanishing gradients | Solved by residual connections (Ch 9, ResNet), skip connections, and normalization. Same problem, same solution strategy. |
| Encoder-decoder | The Transformer itself is an encoder-decoder. T5, BART, and all seq2seq models. Also VAEs (Ch 7) encode then decode. |
| Attention | Born from RNN encoder-decoder limitations. Scaled to self-attention in Transformers. Now the dominant paradigm for sequences. |
| Teacher forcing | Standard for training all autoregressive models including GPT. RLHF is partly a remedy for train-test mismatch. |
| Bidirectional processing | BERT = bidirectional transformer encoder. BiLSTMs → masked self-attention. Same idea, different mechanism. |
Up next: Chapter 11: Practical Methodology — how to actually build, debug, and tune deep learning systems in the real world.