← Gleams
Stanford CS 231n · Lecture 7 · Recurrent Neural Networks

Processing Sequences with Recurrent Networks

Feedforward networks see one snapshot at a time. RNNs remember. They carry a hidden state forward through time, letting them process language, video, and anything that unfolds step by step.

Sequential data Vanishing gradients LSTM & GRU gates Character-level LM
Roadmap

What You'll Master

Chapter 01

Why Sequences Break Feedforward Nets

Every neural network you've seen so far takes a fixed-size input and produces a fixed-size output. A convolutional net takes a 224×224 image and outputs a class label. A fully-connected net takes a feature vector and outputs a number. But what happens when your input is a sequence — a sentence, a video, a melody?

A sentence can be 5 words or 50. A video can be 30 frames or 300. You can't just pad everything to the maximum length and hope for the best — that wastes memory, blurs the boundary between content and padding, and ignores the temporal structure entirely.

Definition
Sequential Data

Data where the order matters. "The cat sat on the mat" and "mat the on sat cat the" contain the same words but carry entirely different meanings. The information lives not just in the elements, but in their arrangement over time.

The Five Patterns of Sequence Processing

Not every problem is "sequence in, sequence out." The CS 231n lecture identifies five fundamental patterns:

PatternInputOutputExample
One-to-OneFixedFixedImage classification (not recurrent)
One-to-ManyFixedSequenceImage captioning: one image → many words
Many-to-OneSequenceFixedSentiment analysis: many words → one label
Many-to-Many (sync)SequenceSequenceVideo frame labeling: one output per frame
Many-to-Many (async)SequenceSequenceMachine translation: read all, then generate

Feedforward nets can only handle one-to-one. For everything else, we need a network that can carry information across time steps. That's the recurrent neural network.

The Core Idea

A recurrent network processes one element at a time, maintaining a hidden state that summarizes everything it has seen so far. Think of it as reading a book one word at a time while keeping notes. Your "notes" (the hidden state) evolve with each word, carrying context forward.

Worked Example — Why Order Matters

Consider classifying movie reviews. The sentence "Not bad at all" contains the word "bad" but is positive. A bag-of-words model (feedforward) might focus on "bad" and predict negative. An RNN reads "Not" first, updates its hidden state to expect a reversal, then reads "bad" and correctly interprets the negation. The hidden state carries the context of "Not" forward to modify the meaning of "bad."

Why Not Just Flatten?

You could concatenate all time steps into one giant vector and feed it to an MLP. Problems: (1) the input dimension changes with sequence length, (2) the number of parameters explodes (a separate weight for every position), and (3) the network can't generalize a pattern at position 5 to position 50. RNNs solve all three by sharing weights across time.

Chapter 02

The Vanilla RNN

An RNN is a neural network with a loop. At each time step t, it takes an input xt, combines it with the previous hidden state ht−1, and produces a new hidden state ht. The same weights are used at every step — the network is like a factory that processes one item at a time using the same machinery.

RNN Hidden State Update ht = tanh(Whh · ht−1 + Wxh · xt + bh)

Where ht is the hidden state at time t (a vector, say of dimension 128), Whh is the hidden-to-hidden weight matrix (how the previous state transforms), Wxh is the input-to-hidden weight matrix (how the current input influences the state), and bh is the bias. The tanh nonlinearity squashes the result to [−1, +1].

RNN Output yt = Why · ht + by

The output yt is a simple linear projection of the hidden state. If you're classifying, you might take the final hT and project it to class logits. If you need an output at every step, project every ht.

Definition
Weight Sharing

The matrices Whh, Wxh, and Why are the same at every time step. The network learns one set of transition rules that apply universally. This means it can handle sequences of any length with a fixed number of parameters, and patterns learned at position 3 automatically generalize to position 300.

Unrolling Through Time

It helps to "unroll" the loop: draw a separate copy of the RNN cell for each time step. This reveals the computational graph as a chain of repeated transformations:

RNN Unrolling Through Time
Watch the hidden state propagate through time steps. Click "Step" to advance, "Reset" to start over. The same weights W are applied at every step.
Step: 0

In the unrolled view, you can see that the hidden state h0 flows through every cell. Information from x1 can influence y5 — but only by passing through four hidden states. This chain is both the RNN's power (memory) and its weakness (we'll see why in Chapter 4).

Worked Example — Detecting Repeated 1s

Input sequence: [0, 1, 0, 1, 1, 1, 0]. Task: output 1 whenever the current input is 1 and the previous input was also 1.

The hidden state needs to remember the previous input. Let h be a 3D vector [current, previous, bias]. We can set Wxh to copy x into "current," Whh to copy "current" into "previous," and Why to compute max(current + previous − 1, 0). The RNN outputs: [0, 0, 0, 0, 1, 1, 0]. This is exactly Karpathy's example from the lecture — a hand-crafted RNN that solves a sequential pattern matching task.

The Hidden State Is Memory

Think of ht as a summary of everything the network has seen up to time t. It's not a perfect recording — it's a compressed, learned representation. What gets remembered depends on the weights, which are learned from data. A well-trained RNN keeps the important information and discards the noise.

Chapter 03

Training RNNs: Backpropagation Through Time

Once unrolled, an RNN is just a very deep feedforward network with shared weights. Training it means computing gradients of the loss with respect to those shared weights. The algorithm is called Backpropagation Through Time (BPTT) — standard backprop applied to the unrolled graph.

The Computational Graph

For a many-to-many task with loss at every step:

BPTT — Full Sequence
  1. Forward pass: Unroll the RNN for all T time steps. Compute h1, h2, ..., hT and outputs y1, ..., yT.
  2. Compute loss: L = L1 + L2 + ... + LT, summing the per-step losses.
  3. Backward pass: Backpropagate gradients from L through the entire unrolled graph, back to h0.
  4. Accumulate gradients: Since W is shared, ∂L/∂W = ∑t=1T ∂L/∂W at step t. Sum contributions from every time step.
  5. Update: W ← W − α · ∂L/∂W.
Worked Example — Gradient Accumulation

Suppose T = 3, and the loss at each step depends on W through ht. Then ∂L/∂W = ∂L1/∂W + ∂L2/∂W + ∂L3/∂W. Each term ∂Lt/∂W involves backpropagating through all steps from t down to 1 (because ht depends on ht−1, which depends on ht−2, etc.). The total gradient is the sum of all these contributions.

The Problem: Long Sequences

For a document with 10,000 words, BPTT requires storing all 10,000 hidden states (for the backward pass) and backpropagating through a 10,000-layer deep network. This is both memory-prohibitive and numerically unstable.

Definition
Truncated BPTT

Instead of backpropagating through the entire sequence, break it into chunks of length k (typically 20–100). Run the forward pass through the chunk, compute and backpropagate the loss, update weights, then start the next chunk — carrying the hidden state forward but truncating the gradient computation at chunk boundaries.

Truncated BPTT Forward: h flows through the full sequence (never reset)
Backward: gradients flow back at most k steps

The hidden state carries long-range information, but the gradient can only "see" k steps of history.
The Trade-off

Truncated BPTT makes training feasible but introduces a bias: the network can carry information over arbitrary distances in the forward pass (through h), but it can't learn to use information beyond k steps away (because gradients are truncated). Long-range dependencies exist in the representation but are invisible to the optimizer.

Many-to-One with Truncated BPTT

If you only have a loss at the final time step (e.g., sentiment classification), truncated BPTT still works: you compute the loss at the end and backpropagate through the last k steps. The hidden state at each chunk boundary still carries upstream gradients from its loss contribution, even though earlier chunks don't directly feel the final loss.

Chapter 04

The Gradient Problem

Here is the fundamental obstacle that plagued RNNs for two decades. When you backpropagate from ht to ht−1, the chain rule says:

Gradient Through One Step ∂ht / ∂ht−1 = diag(1 − tanh2(zt)) · WhhT

where zt = Whh ht−1 + Wxh xt + b

To backpropagate from step T all the way to step 1, you multiply this matrix T−1 times. That's the crux of the problem.

Gradient Over Many Steps ∂hT / ∂h1 = ∏t=2T diag(1 − tanh2(zt)) · WhhT

Why Gradients Vanish

The tanh derivative, 1 − tanh2(z), is always between 0 and 1. When you multiply by WhhT and the result has singular values less than 1, each multiplication shrinks the gradient. After 50 steps, you're multiplying 50 numbers less than 1 together: the gradient is essentially zero.

Why Gradients Explode

If the largest singular value of Whh is greater than 1, the opposite happens: each multiplication grows the gradient. After 50 steps, you have a number so large it causes NaN or wildly unstable updates.

Vanishing & Exploding Gradients
Adjust sequence length and Whh singular value to see how gradient magnitude changes across time steps. The gradient at step 1 is what the network uses to learn long-range dependencies.
15 0.90
Worked Example — Vanishing in Numbers

Suppose the effective per-step gradient multiplier is 0.7 (tanh derivative × W magnitude). After 10 steps: 0.710 = 0.028 — the gradient is 2.8% of its original size. After 20 steps: 0.720 = 0.0008 — less than 0.1%. After 50 steps: 0.750 ≈ 1.8 × 10−8. The network literally cannot learn that a word 50 positions ago matters.

Worked Example — Exploding in Numbers

Now suppose the multiplier is 1.3. After 10 steps: 1.310 = 13.8. After 20 steps: 1.320 = 190. After 50 steps: 1.350 ≈ 497,929. Your gradient update is half a million times too large. A single step could destroy all learned weights.

Gradient Clipping: Fixing Explosions

Exploding gradients have a simple fix: gradient clipping. If the gradient norm exceeds a threshold, scale it down to that threshold while preserving direction.

Gradient Clipping if ||g|| > threshold:
  g ← g × threshold / ||g||
Clipping Doesn't Fix Vanishing

Gradient clipping prevents explosions but does nothing for vanishing gradients. A gradient of 10−8 is too small to learn from, and clipping can't make it larger. To fix vanishing gradients, we need a fundamentally different architecture. That's the LSTM.

The Deep Connection

Vanishing gradients in RNNs are the same problem as vanishing gradients in deep feedforward networks (like a 100-layer MLP). ResNets solved it for feedforward nets using skip connections — a shortcut that lets gradients bypass layers. LSTMs solve it for RNNs using a remarkably similar idea: an additive path through the cell state that lets gradients flow unimpeded.

Chapter 05

LSTM — Gated Memory

Hochreiter and Schmidhuber proposed the Long Short-Term Memory (LSTM) in 1997, specifically to solve the vanishing gradient problem. The key insight: give the network an explicit memory channel — the cell state ct — that information can flow through with minimal transformation. Then use learned gates to control what enters, what leaves, and what gets erased.

Definition
LSTM Cell

An RNN variant with two state vectors: the hidden state ht (used for output) and the cell state ct (long-term memory). Three sigmoid gates — forget, input, and output — control information flow. The cell state acts as a highway for gradients.

The Four Gates

All four computations begin the same way: concatenate ht−1 and xt, multiply by a weight matrix, add a bias, then apply an activation. The gates are:

Gate Computations ft = σ(Wf · [ht−1, xt] + bf)    ← Forget gate (what to erase)
it = σ(Wi · [ht−1, xt] + bi)    ← Input gate (what to write)
ot = σ(Wo · [ht−1, xt] + bo)    ← Output gate (what to reveal)
gt = tanh(Wg · [ht−1, xt] + bg)    ← Candidate values (content to write)

The sigmoid σ gates output values in [0, 1] — think of them as dials controlling "how much." The tanh gate gt produces the actual content, values in [−1, +1].

Cell State Update

Cell State Update ct = ft ⊙ ct−1 + it ⊙ gt

⊙ denotes element-wise multiplication

Read this as: "Take the old cell state, forget some of it (multiply by ft, which is 0 for things to erase and 1 for things to keep), then add new information (it ⊙ gt, which is the input gate controlling how much of the candidate gets written)."

Hidden State Output

Hidden State ht = ot ⊙ tanh(ct)

The hidden state is a filtered version of the cell state. The output gate decides which parts of the memory to expose to the outside world. The tanh squashes ct to [−1, +1] first.

LSTM Gates — Information Flow
Watch information flow through the LSTM cell. Click "Step" to advance through time steps. Observe how the forget gate erases, the input gate writes, and the cell state accumulates memory.
Step: 0

Why LSTM Solves Vanishing Gradients

The critical equation is ct = ft ⊙ ct−1 + it ⊙ gt. Backpropagating through this:

Gradient Through Cell State ∂ct / ∂ct−1 = ft

Element-wise multiplication by forget gate. NO matrix multiply by W.
Derivation — Why This Changes Everything

In a vanilla RNN, the gradient flows through ∂ht/∂ht−1 = diag(tanh') · WhhT — a matrix multiplication at every step, which causes exponential growth or decay.

In an LSTM, the gradient through the cell state is just ∂ct/∂ct−1 = diag(ft). This is a diagonal matrix — element-wise scaling, not a matrix multiply. If ft = 1 (forget nothing), the gradient passes through unchanged. No shrinking, no exploding.

Over T steps: ∂cT/∂c1 = ∏t=2T ft. If the network learns to keep ft close to 1 for important information, gradients flow over hundreds of steps with minimal attenuation. This is the additive gradient highway — directly analogous to ResNet's skip connections.

The ResNet Connection

ResNet: y = x + F(x). The gradient of y with respect to x is 1 + F'(x). The 1 is the highway. LSTM cell state: ct = ft ⊙ ct−1 + it ⊙ gt. The gradient of ct with respect to ct−1 is ft — close to 1 when the network wants to remember. Same principle: additive paths preserve gradients. This connection was noticed retrospectively; Highway Networks (Srivastava et al., 2015) explicitly linked the two ideas.

Worked Example — LSTM Remembering Over 50 Steps

Suppose at step 1, the network encounters the word "not" and writes −1 into cell dimension 7 (a negation flag). For the next 49 steps, ft[7] = 1 and it[7] = 0 for all t (the gates learn to preserve this flag and not overwrite it). At step 50, the network reads "bad" and uses c50[7] = −1 to correctly predict positive sentiment.

The gradient for this: ∂c50[7]/∂c1[7] = ∏t=250 ft[7] = 149 = 1. Perfect gradient flow. A vanilla RNN would have gradient ~0 after 50 steps.

LSTM Doesn't Guarantee No Vanishing

If the forget gate learns to be small (f ≈ 0), gradients still vanish through the cell state. LSTM provides the capacity to preserve long-range gradients, but it doesn't guarantee the network will learn to use that capacity. In practice, LSTMs handle sequences of hundreds of steps well, but struggle beyond ~1000 steps.

Practical Detail: Forget Gate Bias

A crucial implementation trick: initialize the forget gate bias large (e.g., bf = 1 or 2). This makes ft ≈ 1 at the start of training, so the cell state preserves everything by default. The network then learns what to forget, rather than what to remember. Without this, LSTMs often fail to learn long-range dependencies because they erase information before discovering it matters.

Chapter 06

GRU — The Streamlined Alternative

The Gated Recurrent Unit (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state into a single vector, and reducing three gates to two. It often matches LSTM performance with fewer parameters.

GRU Equations zt = σ(Wz · [ht−1, xt] + bz)    ← Update gate
rt = σ(Wr · [ht−1, xt] + br)    ← Reset gate
t = tanh(Wh · [rt ⊙ ht−1, xt] + bh)    ← Candidate
ht = (1 − zt) ⊙ ht−1 + zt ⊙ h̃t    ← Final state

How GRU Maps to LSTM

The update gate zt plays a dual role: it's both the forget gate and the input gate. When zt = 0, the hidden state is copied forward unchanged (like f = 1, i = 0 in LSTM). When zt = 1, the hidden state is completely replaced by the candidate (like f = 0, i = 1). The constraint z + (1−z) = 1 means the GRU always does a convex combination — it can't simultaneously forget and add at full strength.

The reset gate rt controls how much of the previous hidden state to use when computing the candidate. When rt = 0, the candidate ignores history entirely, making the unit act like a plain feedforward layer on xt alone.

Worked Example — GRU as Interpolation

Suppose zt = 0.3 for all dimensions. Then ht = 0.7 · ht−1 + 0.3 · h̃t. The network keeps 70% of the old state and blends in 30% new. For a dimension where z = 0, the state is copied forward perfectly — unlimited memory for that dimension.

FeatureLSTMGRU
State vectors2 (ht and ct)1 (ht)
Gates3 (σ) + 1 (tanh) = 42 (σ) + 1 (tanh) = 3
Parameters4 × (n2 + nm + n)3 × (n2 + nm + n)
Memory highwaySeparate cell state ctInterpolation of ht
Forget ⊥ inputIndependent (f and i are separate)Coupled (z and 1−z)
PerformanceSlight edge on very long sequencesCompetitive, faster training
Which One to Choose?

In practice, LSTM and GRU perform similarly on most tasks. GRU trains faster (fewer parameters) and is easier to tune. LSTM has more expressive power because forget and input are independent. Rule of thumb: start with GRU for speed; switch to LSTM if you need to model very long-range dependencies where the decoupled gates help.

GRU Gradient Flow

∂ht/∂ht−1 from ht = (1−zt) ⊙ ht−1 + zt ⊙ h̃t gives a term (1−zt) on the direct path. When zt ≈ 0 (the update gate is closed), this term is ≈ 1, and the gradient flows through unimpeded — the same highway principle as LSTM's cell state. The difference: GRU's highway goes through the hidden state directly, while LSTM separates it into a dedicated cell state channel.

Chapter 07

Character-Level Language Modeling

The most intuitive application of RNNs: predict the next character in a text. Given "hell", predict "o". Given "hello", predict " " (space). This is a many-to-many task: at every time step, the network reads one character and predicts the next.

The Setup

Define a vocabulary of all possible characters. For English text, this might be 65 characters: a–z, A–Z, 0–9, space, and common punctuation. Each character is represented as a one-hot vector — a vector of length 65 with a 1 at the character's index and 0 everywhere else.

Worked Example — Karpathy's "hello"

Vocabulary: [h, e, l, o]. Indices: h=0, e=1, l=2, o=3.

Training sequence: "hello". Input → target pairs:

"h" → "e", "e" → "l", "l" → "l", "l" → "o"

At each step, the RNN reads one character, updates its hidden state, and outputs a probability distribution over the next character. The loss is cross-entropy between the prediction and the true next character.

Embedding Layer

Multiplying a weight matrix Wembed by a one-hot vector just extracts one column of W. This is called an embedding lookup — each character gets its own learned vector. Instead of spending computation on a matrix multiply, implementations just index directly into the embedding table.

Embedding Lookup Wembed · one_hot(k) = Wembed[:, k]   (just column k)

Sampling — Generating Text

At test time, you feed the model a seed character, sample from the output distribution to get the next character, feed that back as input, and repeat. The model generates character by character, using its own predictions as input.

Definition
Temperature

Before applying softmax, divide the logits by a temperature parameter T. High T (> 1) flattens the distribution, making all characters roughly equally likely (more random, more creative). Low T (< 1) sharpens the distribution, concentrating probability on the top prediction (more deterministic, more repetitive). T = 1 is the default (model's natural confidence).

Temperature Scaling p(chari) = exp(zi / T) / ∑j exp(zj / T)
Worked Example — Temperature in Action

Logits: [2.0, 1.0, 0.5, 0.1] for characters [a, b, c, d].

T = 1.0: softmax → [0.47, 0.17, 0.10, 0.07]. Model picks "a" about half the time.

T = 0.5: divide logits by 0.5 → [4.0, 2.0, 1.0, 0.2]. softmax → [0.82, 0.11, 0.04, 0.02]. Model almost always picks "a." Very repetitive.

T = 2.0: divide by 2 → [1.0, 0.5, 0.25, 0.05]. softmax → [0.35, 0.21, 0.17, 0.13]. More uniform. Random, creative, sometimes nonsensical.

What Char-RNNs Learn

Karpathy's famous 2015 blog post trained a char-RNN on various corpora with remarkable results:

Emergent Structure

Trained on Shakespeare: the model learns to open and close quotes, use iambic meter patterns, and create plausible (if meaningless) dialogue. Trained on Linux source code: it learns to balance braces, indent properly, write syntactically plausible C functions, and even produce fake comments. Trained on LaTeX: it generates valid (but nonsensical) math papers with proper equation environments. The hidden states learn interpretable features: Karpathy found individual neurons that tracked quote depth, line length, and if-statement nesting.

Chapter 08

Image Captioning — CNN + RNN

Image captioning is a one-to-many task: one image in, a sequence of words out. The architecture is beautifully simple — use a CNN to encode the image, then use an RNN to decode a caption word by word.

The Show-and-Tell Architecture

CNN Encoder + RNN Decoder
  1. Encode: Pass the image through a pretrained CNN (e.g., VGG or ResNet). Remove the final classification layer. The output is a feature vector v (e.g., 4096-dimensional from VGG's fc7).
  2. Initialize: Project v to the RNN's hidden dimension and use it as h0. Alternatively, add a term Wih · v to the hidden state update at every time step.
  3. Decode: Feed a <START> token as x0. The RNN predicts the first word. Feed that word as x1. The RNN predicts the second word. Continue until the model predicts an <END> token.
Modified RNN for Captioning ht = tanh(Wxh · xt + Whh · ht−1 + Wih · v + b)

v = CNN(image) is added at every time step, keeping the image "in view"

During training, the RNN is fed the ground-truth word at each step (teacher forcing), and the loss is cross-entropy between the predicted word distribution and the true next word. At test time, the model uses its own predictions — just like the character-level LM.

Worked Example — Captioning a Beach Image

Image: a photo of two people walking on a beach with surfboards.

Step 0: CNN encodes the image → feature vector v. Feed <START>.

Step 1: RNN predicts "Two" (conditioned on v + <START>).

Step 2: Feed "Two" → predict "people".

Step 3: Feed "people" → predict "walking".

Step 4: "walking" → "on".

Step 5: "on" → "the".

Step 6: "the" → "beach".

Step 7: "beach" → "with".

Step 8: "with" → "surfboards".

Step 9: "surfboards" → <END>.

Output: "Two people walking on the beach with surfboards."

Word Embeddings

Just like character-level models use learned embeddings, word-level models embed each word as a dense vector (typically 256–512 dimensions). These embeddings capture semantic relationships: "king" and "queen" have similar embeddings, "king" − "man" + "woman" ≈ "queen." The embedding matrix is learned jointly with the rest of the model.

Beyond Captioning

The CNN-encoder + RNN-decoder pattern generalizes far beyond captioning:

TaskEncoderDecoder
Image CaptioningCNN (image)RNN (words)
Video CaptioningCNN per frame + RNN (frames)RNN (words)
VQA (Visual QA)CNN (image) + RNN (question)MLP (answer)
Machine TranslationRNN (source language)RNN (target language)
Visual NavigationCNN (view) + RNN (instructions)RNN (actions)
The Bottleneck

The entire image is compressed into a single vector v. For complex scenes, this bottleneck loses important spatial details. Attention mechanisms (the subject of CS 231n Lecture 8) solve this by letting the RNN "look at" different parts of the image at each decoding step. This leads directly to the Transformer architecture.

Chapter 09

Showcase: Character-Level RNN

Here's a live character-level RNN running in your browser. Type a seed text, adjust the temperature, and watch the network generate characters one at a time. The hidden state evolves at each step, and the probability distribution over the next character is displayed in real time.

This is a small model with a vocabulary of lowercase letters, space, and basic punctuation, pretrained on pattern-like text. The hidden state visualization shows how the network's internal representation changes as it generates each character.

Character-Level RNN Generator
Type seed text and click "Generate" to produce characters. Adjust temperature to control randomness. Watch the hidden state and output probabilities update in real time.
1.0
What to Try

Low temperature (0.3): The model becomes very confident and repetitive. It tends to generate the most common patterns over and over. High temperature (2.5): The output becomes nearly random — a mix of characters with little structure. Medium temperature (0.8–1.2): The sweet spot. Structured enough to form word-like patterns, random enough to be creative.

What You're Seeing

The top bar shows the generated text growing character by character. The middle section shows the hidden state as a heatmap — each row is a dimension of the hidden vector, colored by its activation (blue negative, yellow positive). The bottom section shows the probability distribution over the next character. The highlighted bar is the character that was sampled.

Chapter 10

Summary & Connections

You now understand the complete story of recurrent neural networks — from the basic idea of carrying state through time, to the gradient pathologies that crippled vanilla RNNs, to the gated architectures that solved them.

The Core Chain

The Logic in One Paragraph

Sequences require memory across time steps (vanilla RNN). Training via BPTT unrolls the network and backpropagates through time, but repeated matrix multiplication causes gradients to vanish or explode (the gradient problem). Gradient clipping fixes explosions but not vanishing. LSTM introduces a cell state with an additive update path: ct = f ⊙ ct−1 + i ⊙ g. This lets gradients flow through multiplication by f instead of W — an additive highway that preserves long-range information. GRU simplifies this to a single-state interpolation. Both architectures power language modeling, image captioning, translation, and every sequence task before transformers.

Key Equations

Vanilla RNN ht = tanh(Whh ht−1 + Wxh xt + b)
LSTM Cell State ct = ft ⊙ ct−1 + it ⊙ gt
ht = ot ⊙ tanh(ct)
GRU ht = (1 − zt) ⊙ ht−1 + zt ⊙ h̃t

RNN Variants Compared

ArchitectureYearStatesGatesKey Property
Vanilla RNN1990h0Simple but gradients vanish
LSTM1997h, c3+1Cell state highway for gradients
GRU2014h2+1Simpler, competitive performance
Transformer2017Parallel, attention-based, dominant

RNN Advantages and Disadvantages

AdvantagesDisadvantages
Can process any length inputRecurrent computation is inherently sequential (slow)
Model size doesn't grow with input lengthDifficult to access information from many steps back
Same weights at every step (symmetry)Cannot parallelize across time steps (unlike Transformer)
Computation uses info from many steps back (in theory)Vanishing gradients limit practical memory span

What Comes Next

Attention Mechanisms (CS 231n Lecture 8): Instead of compressing all history into a fixed-size hidden state, let the decoder "attend" to any part of the input at each step. This eliminates the bottleneck and allows the model to look up relevant information directly.

Transformers: Replace recurrence entirely with self-attention. Every position attends to every other position in parallel. No sequential computation, no vanishing gradients (attention is a weighted sum, not a product chain). This is why transformers dominate modern NLP and vision.

State Space Models (Mamba, etc.): Modern architectures that bring back the idea of a hidden state but with linear-time computation and selective gating. They combine the efficiency of RNNs (linear in sequence length) with the ability to handle long-range dependencies. An active area of research as of 2025.

References

YearPaperContribution
1990Elman, "Finding Structure in Time"The vanilla RNN (Elman network)
1994Bengio et al., "Learning Long-Term Dependencies with Gradient Descent is Difficult"Formalized vanishing gradient problem
1997Hochreiter & Schmidhuber, "Long Short-Term Memory"LSTM architecture
2014Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder"GRU architecture
2015Vinyals et al., "Show and Tell"CNN + LSTM image captioning
2015Karpathy, "The Unreasonable Effectiveness of RNNs"Char-RNN blog post & min-char-rnn.py
2016Karpathy, Johnson, & Fei-Fei, "Visualizing and Understanding Recurrent Networks"Interpretable RNN cells
The One Sentence

RNNs carry memory through time via a hidden state, but vanilla gradients vanish exponentially — LSTM's gated cell state provides an additive highway that preserves gradients the same way ResNet's skip connections preserve them through depth.