Feedforward networks see one snapshot at a time. RNNs remember. They carry a hidden state forward through time, letting them process language, video, and anything that unfolds step by step.
Every neural network you've seen so far takes a fixed-size input and produces a fixed-size output. A convolutional net takes a 224×224 image and outputs a class label. A fully-connected net takes a feature vector and outputs a number. But what happens when your input is a sequence — a sentence, a video, a melody?
A sentence can be 5 words or 50. A video can be 30 frames or 300. You can't just pad everything to the maximum length and hope for the best — that wastes memory, blurs the boundary between content and padding, and ignores the temporal structure entirely.
Data where the order matters. "The cat sat on the mat" and "mat the on sat cat the" contain the same words but carry entirely different meanings. The information lives not just in the elements, but in their arrangement over time.
Not every problem is "sequence in, sequence out." The CS 231n lecture identifies five fundamental patterns:
| Pattern | Input | Output | Example |
|---|---|---|---|
| One-to-One | Fixed | Fixed | Image classification (not recurrent) |
| One-to-Many | Fixed | Sequence | Image captioning: one image → many words |
| Many-to-One | Sequence | Fixed | Sentiment analysis: many words → one label |
| Many-to-Many (sync) | Sequence | Sequence | Video frame labeling: one output per frame |
| Many-to-Many (async) | Sequence | Sequence | Machine translation: read all, then generate |
Feedforward nets can only handle one-to-one. For everything else, we need a network that can carry information across time steps. That's the recurrent neural network.
A recurrent network processes one element at a time, maintaining a hidden state that summarizes everything it has seen so far. Think of it as reading a book one word at a time while keeping notes. Your "notes" (the hidden state) evolve with each word, carrying context forward.
Consider classifying movie reviews. The sentence "Not bad at all" contains the word "bad" but is positive. A bag-of-words model (feedforward) might focus on "bad" and predict negative. An RNN reads "Not" first, updates its hidden state to expect a reversal, then reads "bad" and correctly interprets the negation. The hidden state carries the context of "Not" forward to modify the meaning of "bad."
You could concatenate all time steps into one giant vector and feed it to an MLP. Problems: (1) the input dimension changes with sequence length, (2) the number of parameters explodes (a separate weight for every position), and (3) the network can't generalize a pattern at position 5 to position 50. RNNs solve all three by sharing weights across time.
An RNN is a neural network with a loop. At each time step t, it takes an input xt, combines it with the previous hidden state ht−1, and produces a new hidden state ht. The same weights are used at every step — the network is like a factory that processes one item at a time using the same machinery.
Where ht is the hidden state at time t (a vector, say of dimension 128), Whh is the hidden-to-hidden weight matrix (how the previous state transforms), Wxh is the input-to-hidden weight matrix (how the current input influences the state), and bh is the bias. The tanh nonlinearity squashes the result to [−1, +1].
The output yt is a simple linear projection of the hidden state. If you're classifying, you might take the final hT and project it to class logits. If you need an output at every step, project every ht.
The matrices Whh, Wxh, and Why are the same at every time step. The network learns one set of transition rules that apply universally. This means it can handle sequences of any length with a fixed number of parameters, and patterns learned at position 3 automatically generalize to position 300.
It helps to "unroll" the loop: draw a separate copy of the RNN cell for each time step. This reveals the computational graph as a chain of repeated transformations:
In the unrolled view, you can see that the hidden state h0 flows through every cell. Information from x1 can influence y5 — but only by passing through four hidden states. This chain is both the RNN's power (memory) and its weakness (we'll see why in Chapter 4).
Input sequence: [0, 1, 0, 1, 1, 1, 0]. Task: output 1 whenever the current input is 1 and the previous input was also 1.
The hidden state needs to remember the previous input. Let h be a 3D vector [current, previous, bias]. We can set Wxh to copy x into "current," Whh to copy "current" into "previous," and Why to compute max(current + previous − 1, 0). The RNN outputs: [0, 0, 0, 0, 1, 1, 0]. This is exactly Karpathy's example from the lecture — a hand-crafted RNN that solves a sequential pattern matching task.
Think of ht as a summary of everything the network has seen up to time t. It's not a perfect recording — it's a compressed, learned representation. What gets remembered depends on the weights, which are learned from data. A well-trained RNN keeps the important information and discards the noise.
Once unrolled, an RNN is just a very deep feedforward network with shared weights. Training it means computing gradients of the loss with respect to those shared weights. The algorithm is called Backpropagation Through Time (BPTT) — standard backprop applied to the unrolled graph.
For a many-to-many task with loss at every step:
Suppose T = 3, and the loss at each step depends on W through ht. Then ∂L/∂W = ∂L1/∂W + ∂L2/∂W + ∂L3/∂W. Each term ∂Lt/∂W involves backpropagating through all steps from t down to 1 (because ht depends on ht−1, which depends on ht−2, etc.). The total gradient is the sum of all these contributions.
For a document with 10,000 words, BPTT requires storing all 10,000 hidden states (for the backward pass) and backpropagating through a 10,000-layer deep network. This is both memory-prohibitive and numerically unstable.
Instead of backpropagating through the entire sequence, break it into chunks of length k (typically 20–100). Run the forward pass through the chunk, compute and backpropagate the loss, update weights, then start the next chunk — carrying the hidden state forward but truncating the gradient computation at chunk boundaries.
Truncated BPTT makes training feasible but introduces a bias: the network can carry information over arbitrary distances in the forward pass (through h), but it can't learn to use information beyond k steps away (because gradients are truncated). Long-range dependencies exist in the representation but are invisible to the optimizer.
If you only have a loss at the final time step (e.g., sentiment classification), truncated BPTT still works: you compute the loss at the end and backpropagate through the last k steps. The hidden state at each chunk boundary still carries upstream gradients from its loss contribution, even though earlier chunks don't directly feel the final loss.
Here is the fundamental obstacle that plagued RNNs for two decades. When you backpropagate from ht to ht−1, the chain rule says:
To backpropagate from step T all the way to step 1, you multiply this matrix T−1 times. That's the crux of the problem.
The tanh derivative, 1 − tanh2(z), is always between 0 and 1. When you multiply by WhhT and the result has singular values less than 1, each multiplication shrinks the gradient. After 50 steps, you're multiplying 50 numbers less than 1 together: the gradient is essentially zero.
If the largest singular value of Whh is greater than 1, the opposite happens: each multiplication grows the gradient. After 50 steps, you have a number so large it causes NaN or wildly unstable updates.
Suppose the effective per-step gradient multiplier is 0.7 (tanh derivative × W magnitude). After 10 steps: 0.710 = 0.028 — the gradient is 2.8% of its original size. After 20 steps: 0.720 = 0.0008 — less than 0.1%. After 50 steps: 0.750 ≈ 1.8 × 10−8. The network literally cannot learn that a word 50 positions ago matters.
Now suppose the multiplier is 1.3. After 10 steps: 1.310 = 13.8. After 20 steps: 1.320 = 190. After 50 steps: 1.350 ≈ 497,929. Your gradient update is half a million times too large. A single step could destroy all learned weights.
Exploding gradients have a simple fix: gradient clipping. If the gradient norm exceeds a threshold, scale it down to that threshold while preserving direction.
Gradient clipping prevents explosions but does nothing for vanishing gradients. A gradient of 10−8 is too small to learn from, and clipping can't make it larger. To fix vanishing gradients, we need a fundamentally different architecture. That's the LSTM.
Vanishing gradients in RNNs are the same problem as vanishing gradients in deep feedforward networks (like a 100-layer MLP). ResNets solved it for feedforward nets using skip connections — a shortcut that lets gradients bypass layers. LSTMs solve it for RNNs using a remarkably similar idea: an additive path through the cell state that lets gradients flow unimpeded.
Hochreiter and Schmidhuber proposed the Long Short-Term Memory (LSTM) in 1997, specifically to solve the vanishing gradient problem. The key insight: give the network an explicit memory channel — the cell state ct — that information can flow through with minimal transformation. Then use learned gates to control what enters, what leaves, and what gets erased.
An RNN variant with two state vectors: the hidden state ht (used for output) and the cell state ct (long-term memory). Three sigmoid gates — forget, input, and output — control information flow. The cell state acts as a highway for gradients.
All four computations begin the same way: concatenate ht−1 and xt, multiply by a weight matrix, add a bias, then apply an activation. The gates are:
The sigmoid σ gates output values in [0, 1] — think of them as dials controlling "how much." The tanh gate gt produces the actual content, values in [−1, +1].
Read this as: "Take the old cell state, forget some of it (multiply by ft, which is 0 for things to erase and 1 for things to keep), then add new information (it ⊙ gt, which is the input gate controlling how much of the candidate gets written)."
The hidden state is a filtered version of the cell state. The output gate decides which parts of the memory to expose to the outside world. The tanh squashes ct to [−1, +1] first.
The critical equation is ct = ft ⊙ ct−1 + it ⊙ gt. Backpropagating through this:
In a vanilla RNN, the gradient flows through ∂ht/∂ht−1 = diag(tanh') · WhhT — a matrix multiplication at every step, which causes exponential growth or decay.
In an LSTM, the gradient through the cell state is just ∂ct/∂ct−1 = diag(ft). This is a diagonal matrix — element-wise scaling, not a matrix multiply. If ft = 1 (forget nothing), the gradient passes through unchanged. No shrinking, no exploding.
Over T steps: ∂cT/∂c1 = ∏t=2T ft. If the network learns to keep ft close to 1 for important information, gradients flow over hundreds of steps with minimal attenuation. This is the additive gradient highway — directly analogous to ResNet's skip connections.
ResNet: y = x + F(x). The gradient of y with respect to x is 1 + F'(x). The 1 is the highway. LSTM cell state: ct = ft ⊙ ct−1 + it ⊙ gt. The gradient of ct with respect to ct−1 is ft — close to 1 when the network wants to remember. Same principle: additive paths preserve gradients. This connection was noticed retrospectively; Highway Networks (Srivastava et al., 2015) explicitly linked the two ideas.
Suppose at step 1, the network encounters the word "not" and writes −1 into cell dimension 7 (a negation flag). For the next 49 steps, ft[7] = 1 and it[7] = 0 for all t (the gates learn to preserve this flag and not overwrite it). At step 50, the network reads "bad" and uses c50[7] = −1 to correctly predict positive sentiment.
The gradient for this: ∂c50[7]/∂c1[7] = ∏t=250 ft[7] = 149 = 1. Perfect gradient flow. A vanilla RNN would have gradient ~0 after 50 steps.
If the forget gate learns to be small (f ≈ 0), gradients still vanish through the cell state. LSTM provides the capacity to preserve long-range gradients, but it doesn't guarantee the network will learn to use that capacity. In practice, LSTMs handle sequences of hundreds of steps well, but struggle beyond ~1000 steps.
A crucial implementation trick: initialize the forget gate bias large (e.g., bf = 1 or 2). This makes ft ≈ 1 at the start of training, so the cell state preserves everything by default. The network then learns what to forget, rather than what to remember. Without this, LSTMs often fail to learn long-range dependencies because they erase information before discovering it matters.
The Gated Recurrent Unit (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state into a single vector, and reducing three gates to two. It often matches LSTM performance with fewer parameters.
The update gate zt plays a dual role: it's both the forget gate and the input gate. When zt = 0, the hidden state is copied forward unchanged (like f = 1, i = 0 in LSTM). When zt = 1, the hidden state is completely replaced by the candidate (like f = 0, i = 1). The constraint z + (1−z) = 1 means the GRU always does a convex combination — it can't simultaneously forget and add at full strength.
The reset gate rt controls how much of the previous hidden state to use when computing the candidate. When rt = 0, the candidate ignores history entirely, making the unit act like a plain feedforward layer on xt alone.
Suppose zt = 0.3 for all dimensions. Then ht = 0.7 · ht−1 + 0.3 · h̃t. The network keeps 70% of the old state and blends in 30% new. For a dimension where z = 0, the state is copied forward perfectly — unlimited memory for that dimension.
| Feature | LSTM | GRU |
|---|---|---|
| State vectors | 2 (ht and ct) | 1 (ht) |
| Gates | 3 (σ) + 1 (tanh) = 4 | 2 (σ) + 1 (tanh) = 3 |
| Parameters | 4 × (n2 + nm + n) | 3 × (n2 + nm + n) |
| Memory highway | Separate cell state ct | Interpolation of ht |
| Forget ⊥ input | Independent (f and i are separate) | Coupled (z and 1−z) |
| Performance | Slight edge on very long sequences | Competitive, faster training |
In practice, LSTM and GRU perform similarly on most tasks. GRU trains faster (fewer parameters) and is easier to tune. LSTM has more expressive power because forget and input are independent. Rule of thumb: start with GRU for speed; switch to LSTM if you need to model very long-range dependencies where the decoupled gates help.
∂ht/∂ht−1 from ht = (1−zt) ⊙ ht−1 + zt ⊙ h̃t gives a term (1−zt) on the direct path. When zt ≈ 0 (the update gate is closed), this term is ≈ 1, and the gradient flows through unimpeded — the same highway principle as LSTM's cell state. The difference: GRU's highway goes through the hidden state directly, while LSTM separates it into a dedicated cell state channel.
The most intuitive application of RNNs: predict the next character in a text. Given "hell", predict "o". Given "hello", predict " " (space). This is a many-to-many task: at every time step, the network reads one character and predicts the next.
Define a vocabulary of all possible characters. For English text, this might be 65 characters: a–z, A–Z, 0–9, space, and common punctuation. Each character is represented as a one-hot vector — a vector of length 65 with a 1 at the character's index and 0 everywhere else.
Vocabulary: [h, e, l, o]. Indices: h=0, e=1, l=2, o=3.
Training sequence: "hello". Input → target pairs:
"h" → "e", "e" → "l", "l" → "l", "l" → "o"
At each step, the RNN reads one character, updates its hidden state, and outputs a probability distribution over the next character. The loss is cross-entropy between the prediction and the true next character.
Multiplying a weight matrix Wembed by a one-hot vector just extracts one column of W. This is called an embedding lookup — each character gets its own learned vector. Instead of spending computation on a matrix multiply, implementations just index directly into the embedding table.
At test time, you feed the model a seed character, sample from the output distribution to get the next character, feed that back as input, and repeat. The model generates character by character, using its own predictions as input.
Before applying softmax, divide the logits by a temperature parameter T. High T (> 1) flattens the distribution, making all characters roughly equally likely (more random, more creative). Low T (< 1) sharpens the distribution, concentrating probability on the top prediction (more deterministic, more repetitive). T = 1 is the default (model's natural confidence).
Logits: [2.0, 1.0, 0.5, 0.1] for characters [a, b, c, d].
T = 1.0: softmax → [0.47, 0.17, 0.10, 0.07]. Model picks "a" about half the time.
T = 0.5: divide logits by 0.5 → [4.0, 2.0, 1.0, 0.2]. softmax → [0.82, 0.11, 0.04, 0.02]. Model almost always picks "a." Very repetitive.
T = 2.0: divide by 2 → [1.0, 0.5, 0.25, 0.05]. softmax → [0.35, 0.21, 0.17, 0.13]. More uniform. Random, creative, sometimes nonsensical.
Karpathy's famous 2015 blog post trained a char-RNN on various corpora with remarkable results:
Trained on Shakespeare: the model learns to open and close quotes, use iambic meter patterns, and create plausible (if meaningless) dialogue. Trained on Linux source code: it learns to balance braces, indent properly, write syntactically plausible C functions, and even produce fake comments. Trained on LaTeX: it generates valid (but nonsensical) math papers with proper equation environments. The hidden states learn interpretable features: Karpathy found individual neurons that tracked quote depth, line length, and if-statement nesting.
Image captioning is a one-to-many task: one image in, a sequence of words out. The architecture is beautifully simple — use a CNN to encode the image, then use an RNN to decode a caption word by word.
During training, the RNN is fed the ground-truth word at each step (teacher forcing), and the loss is cross-entropy between the predicted word distribution and the true next word. At test time, the model uses its own predictions — just like the character-level LM.
Image: a photo of two people walking on a beach with surfboards.
Step 0: CNN encodes the image → feature vector v. Feed <START>.
Step 1: RNN predicts "Two" (conditioned on v + <START>).
Step 2: Feed "Two" → predict "people".
Step 3: Feed "people" → predict "walking".
Step 4: "walking" → "on".
Step 5: "on" → "the".
Step 6: "the" → "beach".
Step 7: "beach" → "with".
Step 8: "with" → "surfboards".
Step 9: "surfboards" → <END>.
Output: "Two people walking on the beach with surfboards."
Just like character-level models use learned embeddings, word-level models embed each word as a dense vector (typically 256–512 dimensions). These embeddings capture semantic relationships: "king" and "queen" have similar embeddings, "king" − "man" + "woman" ≈ "queen." The embedding matrix is learned jointly with the rest of the model.
The CNN-encoder + RNN-decoder pattern generalizes far beyond captioning:
| Task | Encoder | Decoder |
|---|---|---|
| Image Captioning | CNN (image) | RNN (words) |
| Video Captioning | CNN per frame + RNN (frames) | RNN (words) |
| VQA (Visual QA) | CNN (image) + RNN (question) | MLP (answer) |
| Machine Translation | RNN (source language) | RNN (target language) |
| Visual Navigation | CNN (view) + RNN (instructions) | RNN (actions) |
The entire image is compressed into a single vector v. For complex scenes, this bottleneck loses important spatial details. Attention mechanisms (the subject of CS 231n Lecture 8) solve this by letting the RNN "look at" different parts of the image at each decoding step. This leads directly to the Transformer architecture.
Here's a live character-level RNN running in your browser. Type a seed text, adjust the temperature, and watch the network generate characters one at a time. The hidden state evolves at each step, and the probability distribution over the next character is displayed in real time.
This is a small model with a vocabulary of lowercase letters, space, and basic punctuation, pretrained on pattern-like text. The hidden state visualization shows how the network's internal representation changes as it generates each character.
Low temperature (0.3): The model becomes very confident and repetitive. It tends to generate the most common patterns over and over. High temperature (2.5): The output becomes nearly random — a mix of characters with little structure. Medium temperature (0.8–1.2): The sweet spot. Structured enough to form word-like patterns, random enough to be creative.
The top bar shows the generated text growing character by character. The middle section shows the hidden state as a heatmap — each row is a dimension of the hidden vector, colored by its activation (blue negative, yellow positive). The bottom section shows the probability distribution over the next character. The highlighted bar is the character that was sampled.
You now understand the complete story of recurrent neural networks — from the basic idea of carrying state through time, to the gradient pathologies that crippled vanilla RNNs, to the gated architectures that solved them.
Sequences require memory across time steps (vanilla RNN). Training via BPTT unrolls the network and backpropagates through time, but repeated matrix multiplication causes gradients to vanish or explode (the gradient problem). Gradient clipping fixes explosions but not vanishing. LSTM introduces a cell state with an additive update path: ct = f ⊙ ct−1 + i ⊙ g. This lets gradients flow through multiplication by f instead of W — an additive highway that preserves long-range information. GRU simplifies this to a single-state interpolation. Both architectures power language modeling, image captioning, translation, and every sequence task before transformers.
| Architecture | Year | States | Gates | Key Property |
|---|---|---|---|---|
| Vanilla RNN | 1990 | h | 0 | Simple but gradients vanish |
| LSTM | 1997 | h, c | 3+1 | Cell state highway for gradients |
| GRU | 2014 | h | 2+1 | Simpler, competitive performance |
| Transformer | 2017 | — | — | Parallel, attention-based, dominant |
| Advantages | Disadvantages |
|---|---|
| Can process any length input | Recurrent computation is inherently sequential (slow) |
| Model size doesn't grow with input length | Difficult to access information from many steps back |
| Same weights at every step (symmetry) | Cannot parallelize across time steps (unlike Transformer) |
| Computation uses info from many steps back (in theory) | Vanishing gradients limit practical memory span |
Attention Mechanisms (CS 231n Lecture 8): Instead of compressing all history into a fixed-size hidden state, let the decoder "attend" to any part of the input at each step. This eliminates the bottleneck and allows the model to look up relevant information directly.
Transformers: Replace recurrence entirely with self-attention. Every position attends to every other position in parallel. No sequential computation, no vanishing gradients (attention is a weighted sum, not a product chain). This is why transformers dominate modern NLP and vision.
State Space Models (Mamba, etc.): Modern architectures that bring back the idea of a hidden state but with linear-time computation and selective gating. They combine the efficiency of RNNs (linear in sequence length) with the ability to handle long-range dependencies. An active area of research as of 2025.
| Year | Paper | Contribution |
|---|---|---|
| 1990 | Elman, "Finding Structure in Time" | The vanilla RNN (Elman network) |
| 1994 | Bengio et al., "Learning Long-Term Dependencies with Gradient Descent is Difficult" | Formalized vanishing gradient problem |
| 1997 | Hochreiter & Schmidhuber, "Long Short-Term Memory" | LSTM architecture |
| 2014 | Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder" | GRU architecture |
| 2015 | Vinyals et al., "Show and Tell" | CNN + LSTM image captioning |
| 2015 | Karpathy, "The Unreasonable Effectiveness of RNNs" | Char-RNN blog post & min-char-rnn.py |
| 2016 | Karpathy, Johnson, & Fei-Fei, "Visualizing and Understanding Recurrent Networks" | Interpretable RNN cells |
RNNs carry memory through time via a hidden state, but vanilla gradients vanish exponentially — LSTM's gated cell state provides an additive highway that preserves gradients the same way ResNet's skip connections preserve them through depth.