CS 231n — Recurrent Neural Networks: Sequences from Scratch

Roadmap

What You'll Master

01Why Sequences Break Feedforward Nets 02The Vanilla RNN 03Training: BPTT 04The Gradient Problem 05LSTM: Gated Memory 06GRU: The Streamlined Alternative 07Character-Level Language Models 08Image Captioning 09Showcase: Char-RNN Generator 10Summary & Connections

Chapter 01

Why Sequences Break Feedforward Nets

Every neural network you've seen so far takes a fixed-size input and produces a fixed-size output. A convolutional net takes a 224×224 image and outputs a class label. A fully-connected net takes a feature vector and outputs a number. But what happens when your input is a sequence — a sentence, a video, a melody?

A sentence can be 5 words or 50. A video can be 30 frames or 300. You can't just pad everything to the maximum length and hope for the best — that wastes memory, blurs the boundary between content and padding, and ignores the temporal structure entirely.

Definition

Sequential Data

Data where the order matters. "The cat sat on the mat" and "mat the on sat cat the" contain the same words but carry entirely different meanings. The information lives not just in the elements, but in their arrangement over time.

The Five Patterns of Sequence Processing

Not every problem is "sequence in, sequence out." The CS 231n lecture identifies five fundamental patterns:

Pattern	Input	Output	Example
One-to-One	Fixed	Fixed	Image classification (not recurrent)
One-to-Many	Fixed	Sequence	Image captioning: one image → many words
Many-to-One	Sequence	Fixed	Sentiment analysis: many words → one label
Many-to-Many (sync)	Sequence	Sequence	Video frame labeling: one output per frame
Many-to-Many (async)	Sequence	Sequence	Machine translation: read all, then generate

Feedforward nets can only handle one-to-one. For everything else, we need a network that can carry information across time steps. That's the recurrent neural network.

The Core Idea

A recurrent network processes one element at a time, maintaining a hidden state that summarizes everything it has seen so far. Think of it as reading a book one word at a time while keeping notes. Your "notes" (the hidden state) evolve with each word, carrying context forward.

Worked Example — Why Order Matters

Consider classifying movie reviews. The sentence "Not bad at all" contains the word "bad" but is positive. A bag-of-words model (feedforward) might focus on "bad" and predict negative. An RNN reads "Not" first, updates its hidden state to expect a reversal, then reads "bad" and correctly interprets the negation. The hidden state carries the context of "Not" forward to modify the meaning of "bad."

Why Not Just Flatten?

You could concatenate all time steps into one giant vector and feed it to an MLP. Problems: (1) the input dimension changes with sequence length, (2) the number of parameters explodes (a separate weight for every position), and (3) the network can't generalize a pattern at position 5 to position 50. RNNs solve all three by sharing weights across time.

Chapter 02

The Vanilla RNN

An RNN is a neural network with a loop. At each time step t, it takes an input x_t, combines it with the previous hidden state h_t−1, and produces a new hidden state h_t. The same weights are used at every step — the network is like a factory that processes one item at a time using the same machinery.

RNN Hidden State Update h_t = tanh(W_hh · h_t−1 + W_xh · x_t + b_h)

Where h_t is the hidden state at time t (a vector, say of dimension 128), W_hh is the hidden-to-hidden weight matrix (how the previous state transforms), W_xh is the input-to-hidden weight matrix (how the current input influences the state), and b_h is the bias. The tanh nonlinearity squashes the result to [−1, +1].

RNN Output y_t = W_hy · h_t + b_y

The output y_t is a simple linear projection of the hidden state. If you're classifying, you might take the final h_T and project it to class logits. If you need an output at every step, project every h_t.

Definition

Weight Sharing

The matrices W_hh, W_xh, and W_hy are the same at every time step. The network learns one set of transition rules that apply universally. This means it can handle sequences of any length with a fixed number of parameters, and patterns learned at position 3 automatically generalize to position 300.

Unrolling Through Time

It helps to "unroll" the loop: draw a separate copy of the RNN cell for each time step. This reveals the computational graph as a chain of repeated transformations:

RNN Unrolling Through Time

Watch the hidden state propagate through time steps. Click "Step" to advance, "Reset" to start over. The same weights W are applied at every step.

Step: 0

In the unrolled view, you can see that the hidden state h₀ flows through every cell. Information from x₁ can influence y₅ — but only by passing through four hidden states. This chain is both the RNN's power (memory) and its weakness (we'll see why in Chapter 4).

Worked Example — Detecting Repeated 1s

Input sequence: [0, 1, 0, 1, 1, 1, 0]. Task: output 1 whenever the current input is 1 and the previous input was also 1.

The hidden state needs to remember the previous input. Let h be a 3D vector [current, previous, bias]. We can set W_xh to copy x into "current," W_hh to copy "current" into "previous," and W_hy to compute max(current + previous − 1, 0). The RNN outputs: [0, 0, 0, 0, 1, 1, 0]. This is exactly Karpathy's example from the lecture — a hand-crafted RNN that solves a sequential pattern matching task.

The Hidden State Is Memory

Think of h_t as a summary of everything the network has seen up to time t. It's not a perfect recording — it's a compressed, learned representation. What gets remembered depends on the weights, which are learned from data. A well-trained RNN keeps the important information and discards the noise.

Chapter 03

Training RNNs: Backpropagation Through Time

Once unrolled, an RNN is just a very deep feedforward network with shared weights. Training it means computing gradients of the loss with respect to those shared weights. The algorithm is called Backpropagation Through Time (BPTT) — standard backprop applied to the unrolled graph.

The Computational Graph

For a many-to-many task with loss at every step:

BPTT — Full Sequence

Forward pass: Unroll the RNN for all T time steps. Compute h₁, h₂, ..., h_T and outputs y₁, ..., y_T.
Compute loss: L = L₁ + L₂ + ... + L_T, summing the per-step losses.
Backward pass: Backpropagate gradients from L through the entire unrolled graph, back to h₀.
Accumulate gradients: Since W is shared, ∂L/∂W = ∑_t=1^T ∂L/∂W at step t. Sum contributions from every time step.
Update: W ← W − α · ∂L/∂W.

Worked Example — Gradient Accumulation

Suppose T = 3, and the loss at each step depends on W through h_t. Then ∂L/∂W = ∂L₁/∂W + ∂L₂/∂W + ∂L₃/∂W. Each term ∂L_t/∂W involves backpropagating through all steps from t down to 1 (because h_t depends on h_t−1, which depends on h_t−2, etc.). The total gradient is the sum of all these contributions.

The Problem: Long Sequences

For a document with 10,000 words, BPTT requires storing all 10,000 hidden states (for the backward pass) and backpropagating through a 10,000-layer deep network. This is both memory-prohibitive and numerically unstable.

Definition

Truncated BPTT

Instead of backpropagating through the entire sequence, break it into chunks of length k (typically 20–100). Run the forward pass through the chunk, compute and backpropagate the loss, update weights, then start the next chunk — carrying the hidden state forward but truncating the gradient computation at chunk boundaries.

Truncated BPTT Forward: h flows through the full sequence (never reset)
Backward: gradients flow back at most k steps

The hidden state carries long-range information, but the gradient can only "see" k steps of history.

The Trade-off

Truncated BPTT makes training feasible but introduces a bias: the network can carry information over arbitrary distances in the forward pass (through h), but it can't learn to use information beyond k steps away (because gradients are truncated). Long-range dependencies exist in the representation but are invisible to the optimizer.

Many-to-One with Truncated BPTT

If you only have a loss at the final time step (e.g., sentiment classification), truncated BPTT still works: you compute the loss at the end and backpropagate through the last k steps. The hidden state at each chunk boundary still carries upstream gradients from its loss contribution, even though earlier chunks don't directly feel the final loss.

Chapter 04

The Gradient Problem

Here is the fundamental obstacle that plagued RNNs for two decades. When you backpropagate from h_t to h_t−1, the chain rule says:

Gradient Through One Step ∂h_t / ∂h_t−1 = diag(1 − tanh²(z_t)) · W_hh^T

where z_t = W_hh h_t−1 + W_xh x_t + b

To backpropagate from step T all the way to step 1, you multiply this matrix T−1 times. That's the crux of the problem.

Gradient Over Many Steps ∂h_T / ∂h₁ = ∏_t=2^T diag(1 − tanh²(z_t)) · W_hh^T

Why Gradients Vanish

The tanh derivative, 1 − tanh²(z), is always between 0 and 1. When you multiply by W_hh^T and the result has singular values less than 1, each multiplication shrinks the gradient. After 50 steps, you're multiplying 50 numbers less than 1 together: the gradient is essentially zero.

Why Gradients Explode

If the largest singular value of W_hh is greater than 1, the opposite happens: each multiplication grows the gradient. After 50 steps, you have a number so large it causes NaN or wildly unstable updates.

Vanishing & Exploding Gradients

Adjust sequence length and W_hh singular value to see how gradient magnitude changes across time steps. The gradient at step 1 is what the network uses to learn long-range dependencies.

Seq Length: 15 Singular value: 0.90

Worked Example — Vanishing in Numbers

Suppose the effective per-step gradient multiplier is 0.7 (tanh derivative × W magnitude). After 10 steps: 0.7¹⁰ = 0.028 — the gradient is 2.8% of its original size. After 20 steps: 0.7²⁰ = 0.0008 — less than 0.1%. After 50 steps: 0.7⁵⁰ ≈ 1.8 × 10⁻⁸. The network literally cannot learn that a word 50 positions ago matters.

Worked Example — Exploding in Numbers

Now suppose the multiplier is 1.3. After 10 steps: 1.3¹⁰ = 13.8. After 20 steps: 1.3²⁰ = 190. After 50 steps: 1.3⁵⁰ ≈ 497,929. Your gradient update is half a million times too large. A single step could destroy all learned weights.

Gradient Clipping: Fixing Explosions

Exploding gradients have a simple fix: gradient clipping. If the gradient norm exceeds a threshold, scale it down to that threshold while preserving direction.

Gradient Clipping if ||g|| > threshold:
g ← g × threshold / ||g||

Clipping Doesn't Fix Vanishing

Gradient clipping prevents explosions but does nothing for vanishing gradients. A gradient of 10⁻⁸ is too small to learn from, and clipping can't make it larger. To fix vanishing gradients, we need a fundamentally different architecture. That's the LSTM.

The Deep Connection

Vanishing gradients in RNNs are the same problem as vanishing gradients in deep feedforward networks (like a 100-layer MLP). ResNets solved it for feedforward nets using skip connections — a shortcut that lets gradients bypass layers. LSTMs solve it for RNNs using a remarkably similar idea: an additive path through the cell state that lets gradients flow unimpeded.

Chapter 05

LSTM — Gated Memory

Hochreiter and Schmidhuber proposed the Long Short-Term Memory (LSTM) in 1997, specifically to solve the vanishing gradient problem. The key insight: give the network an explicit memory channel — the cell state c_t — that information can flow through with minimal transformation. Then use learned gates to control what enters, what leaves, and what gets erased.

Definition

LSTM Cell

An RNN variant with two state vectors: the hidden state h_t (used for output) and the cell state c_t (long-term memory). Three sigmoid gates — forget, input, and output — control information flow. The cell state acts as a highway for gradients.

The Four Gates

All four computations begin the same way: concatenate h_t−1 and x_t, multiply by a weight matrix, add a bias, then apply an activation. The gates are:

Gate Computations f_t = σ(W_f · [h_t−1, x_t] + b_f)    ← Forget gate (what to erase)
i_t = σ(W_i · [h_t−1, x_t] + b_i)    ← Input gate (what to write)
o_t = σ(W_o · [h_t−1, x_t] + b_o)    ← Output gate (what to reveal)
g_t = tanh(W_g · [h_t−1, x_t] + b_g)    ← Candidate values (content to write)

The sigmoid σ gates output values in [0, 1] — think of them as dials controlling "how much." The tanh gate g_t produces the actual content, values in [−1, +1].

Cell State Update

Cell State Update c_t = f_t ⊙ c_t−1 + i_t ⊙ g_t

⊙ denotes element-wise multiplication

Read this as: "Take the old cell state, forget some of it (multiply by f_t, which is 0 for things to erase and 1 for things to keep), then add new information (i_t ⊙ g_t, which is the input gate controlling how much of the candidate gets written)."

Hidden State Output

Hidden State h_t = o_t ⊙ tanh(c_t)

The hidden state is a filtered version of the cell state. The output gate decides which parts of the memory to expose to the outside world. The tanh squashes c_t to [−1, +1] first.

LSTM Gates — Information Flow

Watch information flow through the LSTM cell. Click "Step" to advance through time steps. Observe how the forget gate erases, the input gate writes, and the cell state accumulates memory.

Step: 0

Why LSTM Solves Vanishing Gradients

The critical equation is c_t = f_t ⊙ c_t−1 + i_t ⊙ g_t. Backpropagating through this:

Gradient Through Cell State ∂c_t / ∂c_t−1 = f_t

Element-wise multiplication by forget gate. NO matrix multiply by W.

Derivation — Why This Changes Everything

In a vanilla RNN, the gradient flows through ∂h_t/∂h_t−1 = diag(tanh') · W_hh^T — a matrix multiplication at every step, which causes exponential growth or decay.

In an LSTM, the gradient through the cell state is just ∂c_t/∂c_t−1 = diag(f_t). This is a diagonal matrix — element-wise scaling, not a matrix multiply. If f_t = 1 (forget nothing), the gradient passes through unchanged. No shrinking, no exploding.

Over T steps: ∂c_T/∂c₁ = ∏_t=2^T f_t. If the network learns to keep f_t close to 1 for important information, gradients flow over hundreds of steps with minimal attenuation. This is the additive gradient highway — directly analogous to ResNet's skip connections.

The ResNet Connection

ResNet: y = x + F(x). The gradient of y with respect to x is 1 + F'(x). The 1 is the highway. LSTM cell state: c_t = f_t ⊙ c_t−1 + i_t ⊙ g_t. The gradient of c_t with respect to c_t−1 is f_t — close to 1 when the network wants to remember. Same principle: additive paths preserve gradients. This connection was noticed retrospectively; Highway Networks (Srivastava et al., 2015) explicitly linked the two ideas.

Worked Example — LSTM Remembering Over 50 Steps

Suppose at step 1, the network encounters the word "not" and writes −1 into cell dimension 7 (a negation flag). For the next 49 steps, f_t[7] = 1 and i_t[7] = 0 for all t (the gates learn to preserve this flag and not overwrite it). At step 50, the network reads "bad" and uses c₅₀[7] = −1 to correctly predict positive sentiment.

The gradient for this: ∂c₅₀[7]/∂c₁[7] = ∏_t=2⁵⁰ f_t[7] = 1⁴⁹ = 1. Perfect gradient flow. A vanilla RNN would have gradient ~0 after 50 steps.

LSTM Doesn't Guarantee No Vanishing

If the forget gate learns to be small (f ≈ 0), gradients still vanish through the cell state. LSTM provides the capacity to preserve long-range gradients, but it doesn't guarantee the network will learn to use that capacity. In practice, LSTMs handle sequences of hundreds of steps well, but struggle beyond ~1000 steps.

Practical Detail: Forget Gate Bias

A crucial implementation trick: initialize the forget gate bias large (e.g., b_f = 1 or 2). This makes f_t ≈ 1 at the start of training, so the cell state preserves everything by default. The network then learns what to forget, rather than what to remember. Without this, LSTMs often fail to learn long-range dependencies because they erase information before discovering it matters.

Chapter 06

GRU — The Streamlined Alternative

The Gated Recurrent Unit (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state into a single vector, and reducing three gates to two. It often matches LSTM performance with fewer parameters.

GRU Equations z_t = σ(W_z · [h_t−1, x_t] + b_z)    ← Update gate
r_t = σ(W_r · [h_t−1, x_t] + b_r)    ← Reset gate
h̃_t = tanh(W_h · [r_t ⊙ h_t−1, x_t] + b_h)    ← Candidate
h_t = (1 − z_t) ⊙ h_t−1 + z_t ⊙ h̃_t    ← Final state

How GRU Maps to LSTM

The update gate z_t plays a dual role: it's both the forget gate and the input gate. When z_t = 0, the hidden state is copied forward unchanged (like f = 1, i = 0 in LSTM). When z_t = 1, the hidden state is completely replaced by the candidate (like f = 0, i = 1). The constraint z + (1−z) = 1 means the GRU always does a convex combination — it can't simultaneously forget and add at full strength.

The reset gate r_t controls how much of the previous hidden state to use when computing the candidate. When r_t = 0, the candidate ignores history entirely, making the unit act like a plain feedforward layer on x_t alone.

Worked Example — GRU as Interpolation

Suppose z_t = 0.3 for all dimensions. Then h_t = 0.7 · h_t−1 + 0.3 · h̃_t. The network keeps 70% of the old state and blends in 30% new. For a dimension where z = 0, the state is copied forward perfectly — unlimited memory for that dimension.

Feature	LSTM	GRU
State vectors	2 (h_t and c_t)	1 (h_t)
Gates	3 (σ) + 1 (tanh) = 4	2 (σ) + 1 (tanh) = 3
Parameters	4 × (n² + nm + n)	3 × (n² + nm + n)
Memory highway	Separate cell state c_t	Interpolation of h_t
Forget ⊥ input	Independent (f and i are separate)	Coupled (z and 1−z)
Performance	Slight edge on very long sequences	Competitive, faster training

Which One to Choose?

In practice, LSTM and GRU perform similarly on most tasks. GRU trains faster (fewer parameters) and is easier to tune. LSTM has more expressive power because forget and input are independent. Rule of thumb: start with GRU for speed; switch to LSTM if you need to model very long-range dependencies where the decoupled gates help.

GRU Gradient Flow

∂h_t/∂h_t−1 from h_t = (1−z_t) ⊙ h_t−1 + z_t ⊙ h̃_t gives a term (1−z_t) on the direct path. When z_t ≈ 0 (the update gate is closed), this term is ≈ 1, and the gradient flows through unimpeded — the same highway principle as LSTM's cell state. The difference: GRU's highway goes through the hidden state directly, while LSTM separates it into a dedicated cell state channel.

Chapter 07

Character-Level Language Modeling

The most intuitive application of RNNs: predict the next character in a text. Given "hell", predict "o". Given "hello", predict " " (space). This is a many-to-many task: at every time step, the network reads one character and predicts the next.

The Setup

Define a vocabulary of all possible characters. For English text, this might be 65 characters: a–z, A–Z, 0–9, space, and common punctuation. Each character is represented as a one-hot vector — a vector of length 65 with a 1 at the character's index and 0 everywhere else.

Worked Example — Karpathy's "hello"

Vocabulary: [h, e, l, o]. Indices: h=0, e=1, l=2, o=3.

Training sequence: "hello". Input → target pairs:

"h" → "e", "e" → "l", "l" → "l", "l" → "o"

At each step, the RNN reads one character, updates its hidden state, and outputs a probability distribution over the next character. The loss is cross-entropy between the prediction and the true next character.

Embedding Layer

Multiplying a weight matrix W_embed by a one-hot vector just extracts one column of W. This is called an embedding lookup — each character gets its own learned vector. Instead of spending computation on a matrix multiply, implementations just index directly into the embedding table.

Embedding Lookup W_embed · one_hot(k) = W_embed[:, k] (just column k)

Sampling — Generating Text

At test time, you feed the model a seed character, sample from the output distribution to get the next character, feed that back as input, and repeat. The model generates character by character, using its own predictions as input.

Definition

Temperature

Before applying softmax, divide the logits by a temperature parameter T. High T (> 1) flattens the distribution, making all characters roughly equally likely (more random, more creative). Low T (< 1) sharpens the distribution, concentrating probability on the top prediction (more deterministic, more repetitive). T = 1 is the default (model's natural confidence).

Temperature Scaling p(char_i) = exp(z_i / T) / ∑_j exp(z_j / T)

Worked Example — Temperature in Action

Logits: [2.0, 1.0, 0.5, 0.1] for characters [a, b, c, d].

T = 1.0: softmax → [0.57, 0.21, 0.13, 0.09]. Model picks "a" a bit more than half the time.

T = 0.5: divide logits by 0.5 → [4.0, 2.0, 1.0, 0.2]. softmax → [0.82, 0.11, 0.04, 0.02]. Model almost always picks "a." Very repetitive.

T = 2.0: divide by 2 → [1.0, 0.5, 0.25, 0.05]. softmax → [0.41, 0.25, 0.19, 0.16]. More uniform. Random, creative, sometimes nonsensical.

What Char-RNNs Learn

Karpathy's famous 2015 blog post trained a char-RNN on various corpora with remarkable results:

Emergent Structure

Trained on Shakespeare: the model learns to open and close quotes, use iambic meter patterns, and create plausible (if meaningless) dialogue. Trained on Linux source code: it learns to balance braces, indent properly, write syntactically plausible C functions, and even produce fake comments. Trained on LaTeX: it generates valid (but nonsensical) math papers with proper equation environments. The hidden states learn interpretable features: Karpathy found individual neurons that tracked quote depth, line length, and if-statement nesting.

Chapter 08

Image Captioning — CNN + RNN

Image captioning is a one-to-many task: one image in, a sequence of words out. The architecture is beautifully simple — use a CNN to encode the image, then use an RNN to decode a caption word by word.

The Show-and-Tell Architecture

CNN Encoder + RNN Decoder

Encode: Pass the image through a pretrained CNN (e.g., VGG or ResNet). Remove the final classification layer. The output is a feature vector v (e.g., 4096-dimensional from VGG's fc7).
Initialize: Project v to the RNN's hidden dimension and use it as h₀. Alternatively, add a term W_ih · v to the hidden state update at every time step.
Decode: Feed a <START> token as x₀. The RNN predicts the first word. Feed that word as x₁. The RNN predicts the second word. Continue until the model predicts an <END> token.

Modified RNN for Captioning h_t = tanh(W_xh · x_t + W_hh · h_t−1 + W_ih · v + b)

v = CNN(image) is added at every time step, keeping the image "in view"

During training, the RNN is fed the ground-truth word at each step (teacher forcing), and the loss is cross-entropy between the predicted word distribution and the true next word. At test time, the model uses its own predictions — just like the character-level LM.

Worked Example — Captioning a Beach Image

Image: a photo of two people walking on a beach with surfboards.

Step 0: CNN encodes the image → feature vector v. Feed <START>.

Step 1: RNN predicts "Two" (conditioned on v + <START>).

Step 2: Feed "Two" → predict "people".

Step 3: Feed "people" → predict "walking".

Step 4: "walking" → "on".

Step 5: "on" → "the".

Step 6: "the" → "beach".

Step 7: "beach" → "with".

Step 8: "with" → "surfboards".

Step 9: "surfboards" → <END>.

Output: "Two people walking on the beach with surfboards."

Word Embeddings

Just like character-level models use learned embeddings, word-level models embed each word as a dense vector (typically 256–512 dimensions). These embeddings capture semantic relationships: "king" and "queen" have similar embeddings, "king" − "man" + "woman" ≈ "queen." The embedding matrix is learned jointly with the rest of the model.

Beyond Captioning

The CNN-encoder + RNN-decoder pattern generalizes far beyond captioning:

Task	Encoder	Decoder
Image Captioning	CNN (image)	RNN (words)
Video Captioning	CNN per frame + RNN (frames)	RNN (words)
VQA (Visual QA)	CNN (image) + RNN (question)	MLP (answer)
Machine Translation	RNN (source language)	RNN (target language)
Visual Navigation	CNN (view) + RNN (instructions)	RNN (actions)

The Bottleneck

The entire image is compressed into a single vector v. For complex scenes, this bottleneck loses important spatial details. Attention mechanisms (the subject of CS 231n Lecture 8) solve this by letting the RNN "look at" different parts of the image at each decoding step. This leads directly to the Transformer architecture.

Chapter 09

Showcase: Character-Level RNN

Here's a live character-level RNN running in your browser. Type a seed text, adjust the temperature, and watch the network generate characters one at a time. The hidden state evolves at each step, and the probability distribution over the next character is displayed in real time.

This is a small model with a vocabulary of lowercase letters, space, and basic punctuation, pretrained on pattern-like text. The hidden state visualization shows how the network's internal representation changes as it generates each character.

Character-Level RNN Generator

Type seed text and click "Generate" to produce characters. Adjust temperature to control randomness. Watch the hidden state and output probabilities update in real time.

Seed: Temp: 1.0

What to Try

Low temperature (0.3): The model becomes very confident and repetitive. It tends to generate the most common patterns over and over. High temperature (2.5): The output becomes nearly random — a mix of characters with little structure. Medium temperature (0.8–1.2): The sweet spot. Structured enough to form word-like patterns, random enough to be creative.

What You're Seeing

The top bar shows the generated text growing character by character. The middle section shows the hidden state as a heatmap — each row is a dimension of the hidden vector, colored by its activation (blue negative, yellow positive). The bottom section shows the probability distribution over the next character. The highlighted bar is the character that was sampled.

Chapter 10

Summary & Connections

You now understand the complete story of recurrent neural networks — from the basic idea of carrying state through time, to the gradient pathologies that crippled vanilla RNNs, to the gated architectures that solved them.

The Core Chain

The Logic in One Paragraph

Sequences require memory across time steps (vanilla RNN). Training via BPTT unrolls the network and backpropagates through time, but repeated matrix multiplication causes gradients to vanish or explode (the gradient problem). Gradient clipping fixes explosions but not vanishing. LSTM introduces a cell state with an additive update path: c_t = f ⊙ c_t−1 + i ⊙ g. This lets gradients flow through multiplication by f instead of W — an additive highway that preserves long-range information. GRU simplifies this to a single-state interpolation. Both architectures power language modeling, image captioning, translation, and every sequence task before transformers.

Key Equations

Vanilla RNN h_t = tanh(W_hh h_t−1 + W_xh x_t + b)

LSTM Cell State c_t = f_t ⊙ c_t−1 + i_t ⊙ g_t
h_t = o_t ⊙ tanh(c_t)

GRU h_t = (1 − z_t) ⊙ h_t−1 + z_t ⊙ h̃_t

RNN Variants Compared

Architecture	Year	States	Gates	Key Property
Vanilla RNN	1990	h	0	Simple but gradients vanish
LSTM	1997	h, c	3+1	Cell state highway for gradients
GRU	2014	h	2+1	Simpler, competitive performance
Transformer	2017	—	—	Parallel, attention-based, dominant

RNN Advantages and Disadvantages

Advantages	Disadvantages
Can process any length input	Recurrent computation is inherently sequential (slow)
Model size doesn't grow with input length	Difficult to access information from many steps back
Same weights at every step (symmetry)	Cannot parallelize across time steps (unlike Transformer)
Computation uses info from many steps back (in theory)	Vanishing gradients limit practical memory span

What Comes Next

Attention Mechanisms (CS 231n Lecture 8): Instead of compressing all history into a fixed-size hidden state, let the decoder "attend" to any part of the input at each step. This eliminates the bottleneck and allows the model to look up relevant information directly.

Transformers: Replace recurrence entirely with self-attention. Every position attends to every other position in parallel. No sequential computation, no vanishing gradients (attention is a weighted sum, not a product chain). This is why transformers dominate modern NLP and vision.

State Space Models (Mamba, etc.): Modern architectures that bring back the idea of a hidden state but with linear-time computation and selective gating. They combine the efficiency of RNNs (linear in sequence length) with the ability to handle long-range dependencies. An active area of research as of 2025.

References

Year	Paper	Contribution
1990	Elman, "Finding Structure in Time"	The vanilla RNN (Elman network)
1994	Bengio et al., "Learning Long-Term Dependencies with Gradient Descent is Difficult"	Formalized vanishing gradient problem
1997	Hochreiter & Schmidhuber, "Long Short-Term Memory"	LSTM architecture
2014	Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder"	GRU architecture
2015	Vinyals et al., "Show and Tell"	CNN + LSTM image captioning
2015	Karpathy, "The Unreasonable Effectiveness of RNNs"	Char-RNN blog post & min-char-rnn.py
2016	Karpathy, Johnson, & Fei-Fei, "Visualizing and Understanding Recurrent Networks"	Interpretable RNN cells

The One Sentence

RNNs carry memory through time via a hidden state, but vanilla gradients vanish exponentially — LSTM's gated cell state provides an additive highway that preserves gradients the same way ResNet's skip connections preserve them through depth.

Processing Sequences with Recurrent Networks

What You'll Master

Why Sequences Break Feedforward Nets

The Five Patterns of Sequence Processing

The Vanilla RNN

Unrolling Through Time

Training RNNs: Backpropagation Through Time

The Computational Graph

The Problem: Long Sequences

The Gradient Problem

Why Gradients Vanish

Why Gradients Explode

Gradient Clipping: Fixing Explosions

LSTM — Gated Memory

The Four Gates

Cell State Update

Hidden State Output

Why LSTM Solves Vanishing Gradients

Practical Detail: Forget Gate Bias

GRU — The Streamlined Alternative

How GRU Maps to LSTM

Character-Level Language Modeling

The Setup

Embedding Layer

Sampling — Generating Text

What Char-RNNs Learn

Image Captioning — CNN + RNN

The Show-and-Tell Architecture

Beyond Captioning

Showcase: Character-Level RNN

Summary & Connections

The Core Chain

Key Equations

RNN Variants Compared

RNN Advantages and Disadvantages

What Comes Next

References