Bengio, Simard, Frasconi (U. Montreal / AT&T) — IEEE Trans Neural Networks 1994

Learning Long-Term Dependencies with Gradient Descent is Difficult

The formal proof that gradient-based learning in recurrent neural networks fails catastrophically for long-range dependencies — and the mathematical explanation of why.

Prerequisites: Chain rule (calculus) + Matrix multiplication + Eigenvalues (basic idea). That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Memory Problem

Imagine you're reading a novel. On page 3 the author writes: "Maria grew up speaking Portuguese." Two hundred pages later, a character says something to Maria in Portuguese and she responds fluently. You understand why — you remember the detail from page 3. You carried that information across hundreds of pages of intervening text.

Now imagine asking a neural network to do the same thing. You feed it a sequence of words, one at a time, and at the end you ask it a question whose answer depends on something it saw many steps ago. This is the long-range dependency problem: can a network learn to remember and use information from the distant past?

In theory, a recurrent neural network (RNN) should handle this perfectly. An RNN has a hidden state — a vector of numbers that gets updated at every time step. This hidden state is the network's "memory." It carries information from the past forward through time. At each step, the RNN reads the current input, combines it with the hidden state, and produces a new hidden state. In principle, this hidden state can encode anything from the past.

In practice, it doesn't work. Not for long sequences. An RNN can remember what it saw 5 or 10 steps ago, but not 50 or 100 steps ago. The information doesn't vanish from the hidden state — it's still there, encoded somewhere in those numbers. The problem is that the network can't learn to use it, because the gradient signal used for learning decays exponentially with the number of time steps.

Think about what "exponentially" means here. If the gradient shrinks by a factor of 0.7 at each step, then after 10 steps it's 0.710 = 0.028 (97% lost). After 20 steps: 0.720 = 0.0008 (99.9% lost). After 50 steps: 0.750 ≈ 1.8 × 10-8. The gradient is, for all practical purposes, zero. The network receives no signal about how to use information from 50 steps ago.

The Memory Problem

An RNN processes a sequence. A critical piece of information appears at the start (warm dot). The network must use it at the end. Drag the sequence length slider to see how the gradient signal from output to that critical input decays.

Sequence length 5
The central result of this paper: Bengio, Simard, and Frasconi proved mathematically that gradient-based learning in RNNs suffers from an unavoidable exponential decay (or explosion) of the gradient signal over time. For most practical cases, the gradient vanishes — shrinking by a constant factor at each time step. This means the network receives essentially zero learning signal for long-range dependencies. It's not a bug in the implementation or a bad hyperparameter choice. It's a fundamental property of how gradients flow through time in recurrent architectures.

This paper was published in 1994, but its consequences echoed through the next 25 years of deep learning. It motivated the invention of LSTMs (1997), GRUs (2014), and ultimately the Transformer (2017), which solved the problem by replacing recurrence with attention.

The paper's historical context

By 1994, the neural network community knew empirically that RNNs couldn't learn long-term dependencies. Researchers had tried and failed on tasks like temporal order detection (remember which symbol appeared first), string counting, and context-free grammar learning. But nobody had a formal explanation for why these failures occurred.

Bengio, Simard, and Frasconi provided that explanation. Their key insight was to analyze the Jacobian product chain — the sequence of matrices that the gradient must pass through during backpropagation. They proved that for any bounded activation function, this product either vanishes exponentially (the common case) or explodes exponentially (the dangerous case). There is essentially no middle ground.

YearEventConnection
1974Werbos describes backpropagation in his PhD thesisThe algorithm whose limitation this paper analyzes
1986Rumelhart, Hinton, Williams popularize backpropMakes neural network training practical — but not for sequences
1990Elman introduces the simple RNNThe architecture this paper analyzes
1993Empirical failures on long-range tasks documentedMotivates the theoretical analysis
1994This paper: formal proof of vanishing gradientsExplains why RNNs fail on long sequences
1997LSTM introduced (Hochreiter & Schmidhuber)The first architectural solution to this paper's problem

Let's understand exactly why gradients vanish — starting with how RNNs work.

What is the long-range dependency problem in RNNs?

Chapter 1: Unrolled RNNs

To understand why gradients vanish, we first need to understand how an RNN processes a sequence. The key insight is that an RNN is really a single set of weights applied repeatedly over time. We can "unroll" this repetition into a long chain.

An RNN maintains a hidden state ht that evolves over time. At each time step t, the network takes the current input xt and the previous hidden state ht-1, and produces a new hidden state:

ht = f(Wh ht-1 + Wx xt + b)

Where Wh is the recurrent weight matrix (hidden-to-hidden connections), Wx is the input weight matrix (input-to-hidden connections), b is a bias vector, and f is a nonlinear activation function (typically tanh or sigmoid — we'll see later why the choice of activation function is critical for the gradient problem). The output at time t is then computed from the hidden state:

yt = g(Wy ht + by)

Here's the crucial observation: the same weights Wh, Wx, Wy are used at every time step. When we unroll the RNN across T time steps, we get a very deep feedforward network where the same weight matrices appear at every layer. This "weight sharing" is what gives RNNs their power (they can process sequences of any length with a fixed number of parameters) but also what creates the gradient problem (the same matrix is multiplied over and over, causing systematic exponential behavior).

This weight sharing also means that a single gradient update to Wh affects the behavior at every time step simultaneously. If the gradient is dominated by the most recent time steps (because distant gradients have vanished), then the weight update primarily reflects the network's short-term behavior. Long-term patterns get no representation in the gradient, and thus no influence on learning.

Unrolling an RNN

Left: the compact RNN diagram with a self-loop on the hidden state. Right: the same RNN unrolled across time steps. Click "Unroll" to see the transformation. Notice how the same weights Wh appear at every step.

Click to unroll the recurrence
Why unrolling matters: An unrolled RNN for T time steps is equivalent to a feedforward network with T layers — except all layers share the same weights. This means the gradient must flow through T multiplications by the same matrix. If that matrix shrinks things (eigenvalues < 1), T multiplications shrinks things exponentially. If it grows things (eigenvalues > 1), T multiplications grows things exponentially. There's almost no middle ground.

Let's write this concretely. For a sequence of length T, the hidden states are:

h0 = f(Wx x0 + b)
h1 = f(Wh h0 + Wx x1 + b)
h2 = f(Wh h1 + Wx x2 + b)

hT = f(Wh hT-1 + Wx xT + b)

Each hidden state depends on the previous one through the recurrent weight matrix Wh. This creates a chain of dependencies: hT depends on hT-1, which depends on hT-2, all the way back to h0. The depth of this dependency chain equals the sequence length — for a 200-word sentence, that's 200 layers of recurrence.

Compare this to a typical CNN or feedforward network, which might have 50-100 layers. An RNN processing a modest paragraph (100 words) creates a computation graph 100 layers deep, where every layer uses the same weights. The gradient must traverse all 100 layers during backpropagation. This is where the trouble begins.

python
import torch
import torch.nn as nn

# A simple RNN cell: same weights applied at every step
hidden_size = 64
input_size = 10

W_h = torch.randn(hidden_size, hidden_size) * 0.1  # recurrent weights
W_x = torch.randn(hidden_size, input_size) * 0.1   # input weights
b   = torch.zeros(hidden_size)                        # bias

def rnn_forward(inputs, h0):
    """Unrolled RNN: same W_h at every step."""
    h = h0
    states = [h]
    for x_t in inputs:
        h = torch.tanh(W_h @ h + W_x @ x_t + b)  # same W_h each step!
        states.append(h)
    return states

# Process a sequence of 20 time steps
seq = [torch.randn(input_size) for _ in range(20)]
h0 = torch.zeros(hidden_size)
hidden_states = rnn_forward(seq, h0)
# hidden_states[20] depends on ALL previous inputs through W_h

The chain of hidden states is the computation graph through which gradients must flow. In the next chapter, we'll see exactly how gradients flow backward through this chain — and why the repeated multiplication by Wh creates an exponential problem.

The parameter count

Let's be precise about what's shared and what's not. For an RNN with hidden size n and input size m:

ParameterShapeCountShared across time?
Wh[n, n]Yes — same at every step
Wx[n, m]n × mYes — same at every step
b[n]nYes — same at every step
Wy[output, n]output × nYes — same at every step

Total: roughly n² + nm parameters, regardless of sequence length. A feedforward network with the same "depth" (layers = sequence length T) would have T × n² parameters — one weight matrix per layer. The RNN trades parameter efficiency (shared weights) for gradient instability (same matrix multiplied T times).

Hidden state capacity

The hidden state ht is a vector of n numbers. For tanh activation, each number is between -1 and 1. In principle, this vector can encode 2n distinguishable states (if we discretize each dimension to ±1). For n = 64, that's 264 ≈ 1.8 × 1019 possible states. The hidden state has enormous capacity. The problem isn't capacity — it's learnability.

A crucial distinction: An RNN with hidden size 64 can, in principle, encode any sequence of any length into its hidden state. The information is there. But gradient descent cannot find the encoding, because the gradient signal that would teach the network how to encode long-range information is exponentially weak. It's like having a filing cabinet with a million drawers but no labels — the storage exists, but finding anything is impossible.

Information vs learnability

This distinction between information capacity and learnability is one of the deepest insights in the paper. An RNN is a universal approximator — given enough hidden units, it can compute any computable function. But being able to compute a function is different from being able to learn it from data via gradient descent.

The capacity is exponential in hidden size: 2n distinguishable states. But the learnability drops exponentially with sequence length: gradients decay as γT. At some point, the learnable range hits zero — not because the architecture can't represent the function, but because gradient descent can't navigate to it.

This is why architectures matter. The same function can be represented by an RNN, an LSTM, or a Transformer. But only the latter two can learn it from data, because their gradient structure allows long-range credit assignment. Architecture is not just about representation — it's about optimization.

When we unroll an RNN for T time steps, what structure do we get?

Chapter 2: Backpropagation Through Time

We know how to train feedforward networks: backpropagation. We compute the loss, then propagate the error backward through the network, computing gradients layer by layer. For an RNN unrolled across T time steps, we do exactly the same thing — we call it Backpropagation Through Time (BPTT).

Suppose we have a loss at the final time step T:

L = L(yT, target)

We want ∂L/∂Wh — how the loss changes when we adjust the recurrent weights. Since Wh is used at every time step, the total gradient is a sum over all time steps:

∂L/∂Wh = ∑t=1T ∂L/∂hT · ∂hT/∂ht · ∂ht/∂Wh

The critical term is ∂hT/∂ht — how the final hidden state depends on the hidden state at time t. By the chain rule, this is a product of Jacobian matrices:

∂hT/∂ht = ∏k=t+1T ∂hk/∂hk-1

Each factor ∂hk/∂hk-1 is the Jacobian of the state transition function. For our RNN with hk = f(Wh hk-1 + Wx xk + b):

∂hk/∂hk-1 = diag(f'(zk)) · Wh

Where zk = Wh hk-1 + Wx xk + b is the pre-activation, and diag(f'(zk)) is a diagonal matrix of activation derivatives. So the total Jacobian product is:

∂hT/∂ht = ∏k=t+1T diag(f'(zk)) · Wh

This is a product of (T - t) matrices. Each matrix in the product is Wh scaled by the activation derivative at that time step. This is where the exponential problem lives.

Why the product is the key

The crucial difference between BPTT and standard backpropagation is this: in a feedforward network, the gradient passes through different weight matrices at each layer. In BPTT, it passes through the same matrix (modified by the activation derivatives) at every time step. This repetition is what makes the exponential behavior unavoidable.

Consider the analogy of compound interest. If you multiply a number by 0.95 once, you lose 5%. Multiply by 0.95 twice, you lose 9.75%. After 100 multiplications: 0.95100 = 0.006 — you've lost 99.4%. Now imagine multiplying by 0.95 once and by 1.05 once, alternating. The product is 0.95 × 1.05 = 0.9975 per pair — much less shrinkage, because the two factors partially cancel. This is why feedforward nets (different weights at each layer) suffer less than RNNs (same weights at every step).

In an RNN, every factor in the product has the same "direction" of shrinkage or growth. There's no cancellation. The exponential behavior is systematic, not random.

BPTT: Gradient Flow Through Time

Watch the gradient flow backward from the loss through each time step. At each step, the gradient is multiplied by the Jacobian (activation derivative × Wh). Click "Step Back" to propagate the gradient one step at a time.

Gradient at output = 1.0
Think of it this way: In a feedforward network, different layers have different weights, so the gradient product is a mix of different matrices. Some might shrink, some might grow, and they can partially cancel out. In an RNN, you're multiplying by essentially the same matrix over and over. If that matrix has a shrinking tendency, the gradient shrinks exponentially. If it has a growing tendency, it grows exponentially. The repeated structure is what makes the problem unavoidable.
python
# BPTT: computing the Jacobian product chain
import torch

def bptt_gradient(W_h, hidden_states, T, t):
    """Compute dh_T/dh_t by chaining Jacobians backward."""
    # Start with identity matrix (gradient at step T)
    jacobian_product = torch.eye(W_h.shape[0])

    for k in range(T, t, -1):
        # Pre-activation at step k
        z_k = W_h @ hidden_states[k-1]
        # tanh derivative: f'(z) = 1 - tanh(z)^2
        f_prime = 1 - torch.tanh(z_k)**2
        # Jacobian at step k: diag(f') @ W_h
        J_k = torch.diag(f_prime) @ W_h
        # Accumulate product
        jacobian_product = jacobian_product @ J_k

    return jacobian_product

# The norm of this product tells us the gradient magnitude
# For T-t = 50, with tanh activation, this is typically ~1e-15

The gradient at time step t has magnitude proportional to the norm of this Jacobian product. If each factor shrinks the gradient by some factor γ < 1, then after (T - t) steps the gradient has magnitude γT-t. For γ = 0.9 and T - t = 50, that's 0.950 ≈ 0.005. For γ = 0.7, it's 0.750 ≈ 1.8 × 10-8. For practical RNNs with tanh activation, γ is often around 0.3-0.5, making the gradient essentially zero after just 10-20 steps.

A concrete numerical example

Let's trace through a tiny RNN with hidden size 2 to see the gradient vanish in exact arithmetic. Suppose:

Wh = [[0.5, 0.1], [0.2, 0.4]]

And suppose the activation derivatives are all approximately 0.8 (tanh operating near the origin). Then each Jacobian factor is approximately:

J = 0.8 · Wh = [[0.40, 0.08], [0.16, 0.32]]

The spectral radius of this 2×2 matrix is ρ ≈ 0.46. After 5 steps: 0.465 ≈ 0.021. After 10 steps: 0.4610 ≈ 0.00043. After 20 steps: 0.4620 ≈ 1.8 × 10-7. The gradient has effectively vanished.

python
# Concrete example: gradient vanishing in a 2-unit RNN
import numpy as np

W_h = np.array([[0.5, 0.1], [0.2, 0.4]])
f_prime = 0.8  # typical tanh derivative
J = f_prime * W_h

# Product of n Jacobians
product = np.eye(2)
for n in range(1, 21):
    product = product @ J
    norm = np.linalg.norm(product)
    print(f"n={n:2d}: ||J^n|| = {norm:.6f}")
# n= 1: 0.464  n= 5: 0.021  n=10: 0.00043  n=20: 1.8e-7
Why this matters for learning: When the gradient at step t is 10-7 times smaller than at step T, the weight updates that would teach the network about time step t are 10-7 times smaller than updates from time step T. This means the network effectively ignores everything more than ~20 steps in the past, regardless of how important it is.
In BPTT, the gradient ∂hT/∂ht is computed as a product of Jacobian matrices. Why does this product typically shrink to near-zero?

Chapter 3: The Jacobian Product Chain

We've seen that the gradient involves a product of Jacobian matrices. Now let's look more carefully at what this product actually computes and why — for any bounded activation function — it's mathematically doomed to vanish.

The Jacobian product is the heart of the vanishing gradient problem. Everything else — the specific activation function, the weight initialization, the optimizer — is secondary. If the Jacobian product shrinks exponentially, gradients vanish. If it grows exponentially, gradients explode. Understanding the Jacobian product is understanding the problem.

Each Jacobian in the chain has the form:

Jk = diag(f'(zk)) · Wh

Let's think about what diag(f'(zk)) does. For tanh, the derivative is f'(z) = 1 - tanh(z)2. This value is always between 0 and 1. It equals 1 only when z = 0 (the activation is exactly at the origin), and approaches 0 as |z| grows (the activation saturates).

For sigmoid, f'(z) = σ(z)(1 - σ(z)), which peaks at 0.25 when z = 0. The maximum derivative of a sigmoid is only 0.25 — every time the gradient passes through a sigmoid, it shrinks by at least 75%!

Activationf'(z) at z=0Max f'(z)f'(z) as |z| → ∞
tanh1.01.00
sigmoid0.250.250
ReLU1.01.01.0 (or 0)

So each Jacobian Jk = diag(f'(zk)) · Wh is Wh with each row scaled by a number between 0 and 1 (for tanh) or 0 and 0.25 (for sigmoid).

Visualizing the activation derivative

The shape of f'(z) is crucial to understanding why gradients vanish. For tanh:

f(z) = tanh(z) = (ez - e-z) / (ez + e-z)
f'(z) = 1 - tanh2(z)

The derivative is a bell curve centered at z = 0, reaching its maximum of 1.0 at the origin and decaying to 0 as |z| grows. In a trained RNN, the pre-activations z = Whh + Wxx + b are typically not centered at zero — they're spread across a range that depends on the input and learned weights. On average, |z| is large enough that f'(z) ≈ 0.3-0.5.

For sigmoid, the situation is worse. The sigmoid derivative peaks at only 0.25:

σ(z) = 1 / (1 + e-z)
σ'(z) = σ(z)(1 - σ(z))

At z = 0: σ'(0) = 0.25. At z = ±2: σ'(±2) ≈ 0.1. At z = ±5: σ'(±5) ≈ 0.007. Every pass through a sigmoid layer shrinks the gradient by at least 75%, and usually much more.

python
# Activation derivatives: why tanh is better than sigmoid
import numpy as np

z_values = np.linspace(-5, 5, 100)

# tanh derivative
tanh_deriv = 1 - np.tanh(z_values)**2
print(f"tanh' at z=0: {1 - np.tanh(0)**2:.2f}")     # 1.00
print(f"tanh' at z=1: {1 - np.tanh(1)**2:.2f}")     # 0.42
print(f"tanh' at z=2: {1 - np.tanh(2)**2:.2f}")     # 0.07

# sigmoid derivative
def sig(z): return 1/(1+np.exp(-z))
print(f"sigmoid' at z=0: {sig(0)*(1-sig(0)):.2f}")  # 0.25
print(f"sigmoid' at z=1: {sig(1)*(1-sig(1)):.2f}")  # 0.20
print(f"sigmoid' at z=2: {sig(2)*(1-sig(2)):.2f}")  # 0.10

# With tanh, gradient survives 20 steps: 0.42^20 = 2e-8
# With sigmoid: 0.20^20 = 1e-14 (much worse!)

The product of n such matrices has norm bounded by:

||JT JT-1 … Jt+1|| ≤ ||Wh||T-t · ∏k=t+1T maxi |f'(zk,i)|

For the gradient to not vanish, we need ||Wh|| · max|f'| ≥ 1 at every step. But for sigmoid, max|f'| = 0.25, so we'd need ||Wh|| ≥ 4. And for tanh, even though max|f'| = 1, in practice the activations saturate and f' ≈ 0.3-0.5 on average.

Jacobian Shrinkage per Step

See how the Jacobian shrinks the gradient at each time step. Drag the spectral radius of Wh and the average activation derivative to see how much gradient survives each step. The "effective multiplier" is their product — if it's below 1.0, gradients vanish exponentially.

ρ(Wh) 0.90
avg f' 0.50
The trap: If the effective multiplier (ρ · f') is less than 1, gradients vanish exponentially. If it's greater than 1, gradients explode exponentially. There's a razor-thin boundary at exactly 1 where gradients neither vanish nor explode — but staying on that boundary is impossible in practice. The system is fundamentally unstable. Bengio et al. proved that for almost all weight configurations, the gradient either vanishes or explodes. The vanishing case is far more common because activation functions (tanh, sigmoid) have derivatives ≤ 1, which acts as a persistent damping force.

Why feedforward networks don't suffer as badly

In a feedforward network, each layer has its own weight matrix W1, W2, ..., WL. The gradient product is ∏ Wl — different matrices at each step. Some might have large norms, some small, and they can partially balance out. Moreover, feedforward networks are typically 5-50 layers deep, while RNN sequences can be 100-1000 steps long.

In an RNN, it's the same matrix Wh (scaled by activation derivatives) at every step. If it shrinks, it shrinks the same way every step, compounding exponentially. There's no opportunity for different layers to cancel out.

The connection to dynamical systems

Bengio et al. drew an important analogy to dynamical systems theory. An RNN is a discrete-time dynamical system: ht+1 = f(Whht + ...). The vanishing gradient problem is equivalent to saying that this system has a contractive dynamics — nearby trajectories converge exponentially, which means perturbations (including gradient signals) are exponentially damped.

In dynamical systems, the Lyapunov exponent measures the rate of exponential convergence or divergence:

λ = limT→∞ (1/T) log ||∂hT/∂h0||

If λ < 0, the system is contractive (vanishing gradients). If λ > 0, it's chaotic (exploding gradients). The boundary λ = 0 is the "edge of chaos" — the only regime where long-range learning is theoretically possible.

The edge of chaos: Several researchers (Bertschinger and Natschlager 2004, Langton 1990) have argued that neural computation is most powerful at the edge of chaos — the boundary between ordered (gradient-vanishing) and chaotic (gradient-exploding) dynamics. LSTMs and Transformers can be understood as architectures that operate at this edge, maintaining information flow without instability.
python
# Measuring gradient shrinkage per step
import torch
import numpy as np

hidden_size = 128
W_h = torch.randn(hidden_size, hidden_size) * 0.1

# Spectral radius: largest eigenvalue magnitude
eigenvalues = torch.linalg.eigvals(W_h)
spectral_radius = torch.max(torch.abs(eigenvalues)).item()
print(f"Spectral radius: {spectral_radius:.3f}")

# Typical tanh derivative on random activations
h = torch.randn(hidden_size)
z = W_h @ h
f_prime = 1 - torch.tanh(z)**2
print(f"Mean |f'|: {f_prime.mean().item():.3f}")

# Effective multiplier per step
eff = spectral_radius * f_prime.mean().item()
print(f"Effective multiplier: {eff:.3f}")
print(f"After 50 steps: {eff**50:.2e}")
# Typical output: ~1e-10 or smaller
Why is the gradient problem worse for RNNs than for feedforward networks of similar depth?

Chapter 4: Eigenvalue Analysis

We've been talking informally about matrices "shrinking" or "growing" the gradient. Now let's make this precise using eigenvalues — the mathematical tool that tells us exactly how repeated matrix multiplication behaves in the long run.

If you haven't encountered eigenvalues before, here's the intuition: every square matrix has certain "preferred directions" called eigenvectors. When you multiply an eigenvector by the matrix, it just gets scaled — it doesn't rotate or change direction. The scaling factor is the eigenvalue λ. Repeated multiplication scales by λn, which is why eigenvalues control the exponential behavior of iterated matrix products.

Every square matrix A has eigenvalues λ1, λ2, ..., λn and corresponding eigenvectors v1, v2, ..., vn. When you multiply A by an eigenvector, it just scales it: A vi = λi vi. The eigenvalue tells you the scaling factor along that direction.

When you multiply by A repeatedly, the scaling compounds:

An vi = λin vi

The spectral radius ρ(A) is the magnitude of the largest eigenvalue: ρ(A) = maxii|. This number determines the long-term behavior of repeated multiplication:

Spectral radius ρ(A)An as n → ∞Gradient behavior
ρ < 1→ 0 (exponentially)Vanishes
ρ = 1Bounded but may not convergeStable (rare)
ρ > 1→ ∞ (exponentially)Explodes

Applying eigenvalue analysis to RNNs

Now apply this to the Jacobian product in BPTT. The effective "matrix" being multiplied at each step is Jk = diag(f'(zk)) · Wh. While the activation derivative diag(f'(zk)) changes at each step (it depends on the hidden state), the weight matrix Wh is fixed. The eigenvalues of Wh determine the base rate of growth or decay, and the activation derivatives modulate this base rate at each step.

Bengio et al. showed that the gradient ∂hT/∂ht decays as:

||∂hT/∂ht|| ≤ C · γT-t

Where γ = ρ(Wh) · max|f'| is the effective spectral radius, and C is a constant. This is the precise formulation of exponential decay.

Eigenvalue Decay Visualization

Watch how repeated multiplication by a matrix with a given spectral radius affects a vector. Drag ρ to change the eigenvalue magnitude. When ρ < 1, the vector shrinks to zero. When ρ > 1, it explodes. The gradient experiences this same fate.

ρ 0.80
||v|| = 1.000

The eigenvalue spectrum of Wh

For a randomly initialized weight matrix Wh with entries drawn from N(0, σ2/n), the spectral radius is approximately σ. Standard initialization schemes like Xavier (Glorot) give σ ≈ 1/√n, leading to ρ ≈ 1. But the activation derivative f' < 1 (for tanh/sigmoid) pulls the effective spectral radius below 1. Even with "good" initialization, the gradient still vanishes.

The Marchenko-Pastur law from random matrix theory tells us that for large random matrices, the eigenvalue distribution converges to a specific shape. For Gaussian Wh with variance 1/n, the spectral radius is approximately 2/√n · √n = 2 (for n = 1), but in practice it's close to 1.0 for Xavier initialization. The key point: random initialization places us near the critical boundary ρ = 1, but the activation function pushes us below it.

What if we intentionally set ρ > 1 to counteract the activation derivative? Then we risk gradient explosion. And even with ρ exactly right, the system is chaotic — small perturbations to Wh during training push ρ above or below 1, alternating between explosion and vanishing.

Orthogonal initialization

One approach to taming eigenvalues is orthogonal initialization: initialize Wh as an orthogonal matrix (WhT Wh = I). An orthogonal matrix has all eigenvalues on the unit circle (|λ| = 1), so An doesn't grow or shrink for any n. This is theoretically ideal.

In practice, orthogonal initialization helps at the start of training but the weights quickly move away from orthogonality as training progresses. The activation derivatives also break the perfect eigenvalue structure. Still, it's significantly better than random Gaussian initialization, and is the recommended initialization for RNN hidden-to-hidden weights.

python
# Orthogonal vs Gaussian initialization
import torch
import torch.nn as nn

n = 64
# Gaussian: eigenvalues scattered, some > 1, some < 1
W_gauss = torch.randn(n, n) / n**0.5
eigs_g = torch.linalg.eigvals(W_gauss).abs()
print(f"Gaussian: max|eig|={eigs_g.max():.3f}, min={eigs_g.min():.3f}")

# Orthogonal: ALL eigenvalues have |eig| = 1
W_orth = torch.empty(n, n)
nn.init.orthogonal_(W_orth)
eigs_o = torch.linalg.eigvals(W_orth).abs()
print(f"Orthogonal: max|eig|={eigs_o.max():.3f}, min={eigs_o.min():.3f}")
# Output: all eigenvalue magnitudes ≈ 1.000
The fundamental dilemma: For long-range learning, we need gradients to flow unchanged across many time steps. This requires the Jacobian product to have norm ≈ 1 for all T-t. But a product of n matrices with norm ≈ 1 is exponentially sensitive to perturbations — if any factor is even slightly > 1 or < 1, the product diverges or collapses exponentially. Maintaining stability across hundreds of steps is like balancing a pencil on its tip — technically possible, but practically infeasible.
python
# Eigenvalue analysis of a recurrent weight matrix
import torch
import numpy as np

n = 64
W_h = torch.randn(n, n) / np.sqrt(n)  # Xavier init

# Eigenvalues of W_h
eigs = torch.linalg.eigvals(W_h)
magnitudes = torch.abs(eigs)
print(f"Spectral radius: {magnitudes.max().item():.3f}")
print(f"Mean |eigenvalue|: {magnitudes.mean().item():.3f}")

# Simulate gradient norm after T steps
avg_fprime = 0.65  # typical for tanh
eff_rho = magnitudes.max().item() * avg_fprime

for T in [5, 10, 20, 50, 100]:
    grad_scale = eff_rho ** T
    print(f"T={T:3d}: gradient scale = {grad_scale:.2e}")
# T=  5: ~0.116
# T= 10: ~0.013
# T= 20: ~1.8e-4
# T= 50: ~4.4e-10
# T=100: ~1.9e-19
What determines whether the gradient vanishes or explodes in an RNN?

Chapter 5: The Vanishing Gradient Theorem

We've built up the intuition piece by piece. Now let's state the formal theorem from Bengio et al., which makes everything precise.

Why formalize the intuition?

You might wonder: we already know the gradient decays exponentially. Why do we need a formal theorem? Because the formal statement gives us precise conditions under which learning will fail, and these conditions tell us exactly what an architectural fix must accomplish. Without the theorem, we'd be guessing at solutions. With it, we can engineer them.

Setup

Consider an RNN with hidden state update:

ht = f(Wh ht-1 + Wx xt + b)

where f is a bounded, differentiable activation function (e.g., tanh, sigmoid). Let σmax = max|f'(z)| be the maximum derivative of f. For tanh, σmax = 1. For sigmoid, σmax = 0.25.

Theorem (Bengio et al., 1994)

Sufficient condition for vanishing gradients: If the largest singular value of Wh satisfies:

σmax(f') · ||Wh||2 < 1

then the gradient ∂hT/∂ht vanishes exponentially as T - t grows:

||∂hT/∂ht|| ≤ (σmax · ||Wh||)T-t

Conversely, if σmax(f') · ρ(Wh) > 1 and the hidden states don't saturate, gradients can explode exponentially.

The proof (intuitive version)

Step 1: Write the Jacobian at each step.

∂hk/∂hk-1 = Dk Wh

where Dk = diag(f'(zk)) is a diagonal matrix with entries in [0, σmax].

Step 2: Bound the norm of each factor.

||Dk Wh|| ≤ ||Dk|| · ||Wh|| ≤ σmax · ||Wh||

Step 3: The total gradient is a product of (T-t) such factors.

||∂hT/∂ht|| = ||∏k=t+1T Dk Wh|| ≤ ∏k=t+1T ||Dk Wh|| ≤ (σmax · ||Wh||)T-t

Step 4: If σmax · ||Wh|| < 1, then (σmax · ||Wh||)T-t → 0 exponentially as T-t → ∞.

That's the full proof. It's beautifully simple: each step shrinks the gradient by at most σmax · ||Wh||, and after enough steps, the shrinkage compounds to zero.

The tightness of the bound

Is this bound tight? In practice, the actual gradient magnitude is often much smaller than the upper bound, because the bound uses the worst-case activation derivative σmax at every step. In reality, different dimensions saturate at different steps, and the average activation derivative is below σmax.

For tanh with typical hidden states:

Upper bound:max · ||Wh||)T = (1.0 · ||Wh||)T
Actual: ≈ (0.4 · ||Wh||)T   (using average f' ≈ 0.4)

So the actual decay is often 2-3x faster per step than the upper bound predicts. The bound guarantees vanishing when σmax · ||Wh|| < 1, but in practice vanishing occurs even when σmax · ||Wh|| ≈ 1-2, because the average behavior is worse than the worst case.

A practical rule of thumb: For a tanh RNN, gradients effectively vanish after T ≥ 20/log(1/γ) steps, where γ is the effective per-step shrinkage factor. For γ = 0.7: T ≥ 20/0.36 ≈ 56 steps. For γ = 0.5: T ≥ 20/0.69 ≈ 29 steps. Most RNNs can learn dependencies up to ~20 steps but struggle beyond ~50.
The Vanishing Gradient Theorem Visualized

The gradient magnitude at each time step, plotted as a bar. Choose σmax and ||Wh|| to see how the bound (σmax · ||Wh||)T-t shapes the decay. The warm line is the theoretical upper bound.

σmax 1.00
||Wh|| 0.80

What the theorem means in practice

For sigmoid activations (σmax = 0.25): the gradient vanishes unless ||Wh|| > 4. But ||Wh|| > 4 puts us firmly in the exploding regime for many inputs. It's virtually impossible to train sigmoid RNNs on long sequences.

For tanh activations (σmax = 1.0): the gradient vanishes unless ||Wh|| ≥ 1. This is achievable — but in practice, the activations saturate (|h| large), driving f' well below 1. So even with ||Wh|| ≈ 1, the effective multiplier drops below 1 and gradients vanish.

For ReLU activations (f'(z) = 0 or 1): the derivative doesn't shrink the gradient, but half the neurons are dead (f' = 0), which creates different problems (the "dying ReLU" problem in RNNs). When z < 0, the gradient is exactly zero — that dimension contributes nothing to learning. And with recurrence, a dead neuron stays dead because its gradient is zero at every future time step.

The necessary condition for learning

Bengio et al. also proved a necessary condition for gradient-based learning of long-term dependencies. For the gradient to remain large enough to enable learning over T steps, we need:

σmax · ρ(Wh) ≥ 1

But satisfying this condition almost certainly leads to gradient explosion for some inputs (those where activations don't saturate). This is the Catch-22:

If σmax · ρ < 1
Gradients vanish: long-range learning impossible
If σmax · ρ = 1
Unstable: any perturbation pushes to vanishing or exploding
If σmax · ρ > 1
Gradients explode: training diverges (loss → NaN)

There is no stable operating point for gradient-based learning of long-term dependencies in vanilla RNNs. This is the fundamental result of the paper.

Empirical verification

The paper verified this theory with experiments on a simple task: given a sequence where the first few inputs determine the correct output at the final step, can the RNN learn the dependency? They found that for sequences longer than ~20 steps, standard RNNs completely failed to learn. The gradient at the critical input was below the numerical precision of 32-bit floating point.

python
# Empirical test: can an RNN learn a long-range dependency?
import torch
import torch.nn as nn

# Task: output at step T should equal input at step 0
def make_data(T, n_samples=100):
    X = torch.zeros(n_samples, T, 1)
    X[:, 0, 0] = torch.randn(n_samples)  # signal at step 0
    Y = X[:, 0, 0]  # target = input at step 0
    return X, Y

# Test at different sequence lengths
for T in [5, 10, 20, 50, 100]:
    rnn = nn.RNN(1, 32, batch_first=True)
    opt = torch.optim.Adam(rnn.parameters(), lr=0.01)
    X, Y = make_data(T)

    for epoch in range(500):
        out, _ = rnn(X)
        loss = ((out[:, -1, 0] - Y)**2).mean()
        opt.zero_grad(); loss.backward(); opt.step()

    print(f"T={T:3d}: final loss = {loss.item():.4f}")
# T=  5: 0.02 (learns well)
# T= 10: 0.15 (struggles)
# T= 20: 0.85 (barely learns)
# T= 50: 1.00 (fails — predicts mean)
# T=100: 1.00 (complete failure)
The bottom line: For any bounded activation function (tanh, sigmoid, etc.), the vanishing gradient problem is a mathematical certainty for sufficiently long sequences. No amount of careful initialization or learning rate tuning can change this fundamental property. The only solutions are architectural: change how information flows through the network (LSTM gates, skip connections, attention).
According to the theorem, what condition guarantees that gradients will vanish exponentially?

Chapter 6: Gradient Explorer

Theory is beautiful, but seeing is believing. This chapter gives you a full interactive simulation where you can build an RNN, feed it a sequence, and watch the gradient flow backward through time — seeing exactly when and how it vanishes.

Interactive Gradient Explorer

Configure an RNN and watch gradient magnitudes at each time step. The top panel shows the forward pass (hidden states). The bottom panel shows the backward pass (gradient magnitudes). Experiment with different weight scales and activation functions to see vanishing vs. exploding gradients.

Wh scale 0.90
Seq length 20
Ready
What to try:
1. Set Wh scale = 0.5, length = 20. Watch the gradient vanish to near-zero within 10 steps.
2. Set Wh scale = 1.5, length = 20. Watch the gradient explode — it grows by orders of magnitude.
3. Set Wh scale = 1.0, length = 40. The gradient is unstable — it fluctuates wildly because the effective multiplier hovers near 1.
4. Switch to sigmoid activation. Even at Wh = 1.5, the gradient vanishes because sigmoid's max derivative is only 0.25.

Notice the fundamental asymmetry. With tanh, you can find a Wh scale that keeps gradients alive for ~10 steps. But extending to 50 or 100 steps is impossible — the system is too sensitive. Any small deviation from the critical point cascades exponentially.

The locus of difficulty

An important observation from the paper: the gradient problem affects different dimensions of the hidden state differently. The Jacobian product preferentially amplifies components along the eigenvector with the largest eigenvalue and suppresses all others. After many steps, only information aligned with the dominant eigenvector survives.

This means the RNN's effective memory capacity decreases with sequence length. At step 5, the hidden state might use all 64 dimensions. By step 50, only the 2-3 dimensions aligned with the dominant eigenvectors carry meaningful gradient signal. The rest are "gradient dead zones" — they still hold information from the forward pass, but no gradient reaches them to refine that information.

The hidden capacity trap: The forward pass can encode information into all dimensions of the hidden state. But the backward pass (gradient) can only reach a few dimensions — those aligned with the dominant eigenvectors of the Jacobian product. So the network can store long-range information but can never learn to store it correctly, because the gradient signal along most dimensions has vanished.

Attempts at solving without architecture changes

Several approaches were tried before architectural solutions (LSTM, attention) took over:

ApproachIdeaWhy it failed
Truncated BPTTOnly backpropagate for k steps, not all TBy definition cannot learn dependencies longer than k
Echo state networksFix Wh, only learn output weightsRandom fixed dynamics can't encode task-specific patterns
Second-order methodsUse curvature (Hessian) to guide stepsO(n2) memory, too expensive for large networks
Careful initializationInitialize Wh near the critical pointTraining quickly moves weights away from the critical point
Nesterov momentumBetter optimizer for smooth landscapesThe landscape isn't smooth — it has cliffs (Pascanu 2013)

The common thread: all of these approaches try to fix the optimization without changing the architecture. But the vanishing gradient is an architectural problem — it's intrinsic to the recurrence structure. The solution must be architectural too.

python
# Comparing truncated BPTT vs full BPTT
import torch
import torch.nn as nn

# Truncated BPTT: only backpropagate k steps
def train_truncated_bptt(model, sequence, k=20):
    """Truncated BPTT: gradient only flows k steps back."""
    h = torch.zeros(1, 1, 64)
    total_loss = 0

    for i in range(0, len(sequence), k):
        chunk = sequence[i:i+k]
        h = h.detach()  # stop gradient flow beyond k steps!
        out, h = model(chunk, h)
        loss = compute_loss(out)
        loss.backward()
        total_loss += loss.item()

    # Problem: dependencies longer than k steps can NEVER be learned
    # The gradient literally cannot reach inputs more than k steps back
    return total_loss

The learning consequence

What does vanishing gradients mean for actual learning? Consider a sequence where a critical input appears at step 3 and the loss is computed at step 20. The gradient of the loss with respect to that input is proportional to ∂h20/∂h3, which is the product of 17 Jacobians. If each has norm 0.8, the gradient is 0.817 ≈ 0.023 — 43 times smaller than the gradient at step 19.

The network receives almost all its learning signal from the most recent few steps. Distant inputs are effectively invisible to gradient descent. The network can learn short-range patterns ("the" often follows "in") but not long-range ones ("she" refers to "Maria" from 200 words ago).

python
# Measuring gradient norms at each time step in PyTorch
import torch
import torch.nn as nn

rnn = nn.RNN(10, 64, batch_first=True)
x = torch.randn(1, 30, 10, requires_grad=True)  # length-30 sequence
h0 = torch.zeros(1, 1, 64)

output, hn = rnn(x, h0)
loss = output[0, -1, :].sum()  # loss at final step
loss.backward()

# Gradient at each time step
grad_norms = x.grad[0].norm(dim=1)
for t, g in enumerate(grad_norms):
    print(f"t={t:2d}: gradient norm = {g.item():.6f}")
# Typical: t=29 ~ 0.5, t=20 ~ 0.01, t=10 ~ 0.0001, t=0 ~ 1e-7
In the gradient explorer, what happens when you set Wh scale = 0.5 with tanh activation and sequence length 20?

Chapter 7: Connections

Bengio et al.'s 1994 paper didn't just identify the problem — it shaped the next two decades of sequence modeling research. Every major architectural innovation in NLP can be understood as a response to the vanishing gradient problem.

Solutions that emerged

SolutionYearCore IdeaHow it helps
LSTM1997Gated memory cell with additive updatesCell state uses addition not multiplication, so gradients flow through the "constant error carousel" with Jacobian ≈ 1
Gradient clipping2013Cap gradient norm at a thresholdPrevents explosion, but can't fix vanishing (can't make a small gradient bigger)
GRU2014Simplified LSTM with fewer gatesSame additive update principle, fewer parameters, similar performance
Residual connections2015ht = ht-1 + F(ht-1)Additive skip creates gradient highways — Jacobian is I + JF, whose eigenvalues ≈ 1
Attention2015Direct connections between any two time stepsGradient can flow directly from output to any input step — no chain of multiplications
Transformer2017Remove recurrence entirely, use only attentionNo sequential Jacobian chain at all — every position connects to every other position
The LSTM insight: The key innovation of Hochreiter & Schmidhuber's LSTM (1997) was replacing the multiplicative state update ht = f(Wht-1 + ...) with an additive one: ct = ft ⊙ ct-1 + it ⊙ gt. The forget gate ft can be set close to 1, making ∂ct/∂ct-1 ≈ 1. This is the "constant error carousel" — gradients flow through the cell state without shrinking.

The Transformer solution

The most radical response to the vanishing gradient problem was the Transformer (Vaswani et al., 2017), which eliminated recurrence entirely. In a Transformer, every position in the sequence can attend directly to every other position through self-attention. The gradient from position T to position t flows through at most O(1) layers, not O(T-t) sequential multiplications.

This is why Transformers can handle sequences of thousands of tokens — the gradient path between any two tokens is short, regardless of their distance in the sequence.

The deeper lesson

The vanishing gradient problem teaches a fundamental principle about neural network design: the gradient path matters as much as the forward path. A network might be expressive enough to represent any function, but if the gradients can't flow to the relevant parameters, the network can't learn that function.

This principle extends beyond RNNs. Residual connections in deep CNNs (ResNet, 2015) solve the same problem for deep feedforward networks. LayerNorm and BatchNorm stabilize gradients by controlling activation scales. The entire history of deep learning architecture design can be read as a struggle to keep gradients flowing.

The LSTM constant error carousel in detail

To see exactly how the LSTM solves the problem, compare the Jacobian structures:

Vanilla RNN: ∂ht/∂ht-1 = diag(f'(z)) Wh
Product norm: ≤ (σmax ||Wh||)T → 0
LSTM cell state: ∂ct/∂ct-1 = diag(ft)
Product norm: = ∏ ft ≈ 1 when ft ≈ 1

The vanilla RNN multiplies by Wh at each step, while the LSTM cell multiplies by the forget gate ft, which is a scalar per dimension. When ft ≈ 1, the gradient passes through unchanged. This is the "constant error carousel" — a highway for gradients that bypasses the multiplicative bottleneck.

python
# LSTM gradient flow: constant error carousel
import torch

# Simulate LSTM cell state gradient flow
T = 100
grad_rnn = 1.0
grad_lstm = 1.0
gamma_rnn = 0.7  # typical RNN decay factor
forget_gate = 0.98  # LSTM forget gate near 1

for t in range(T):
    grad_rnn *= gamma_rnn
    grad_lstm *= forget_gate

print(f"After {T} steps:")
print(f"  RNN gradient: {grad_rnn:.2e}")   # ~3.2e-16
print(f"  LSTM gradient: {grad_lstm:.2e}")  # ~0.13
# LSTM preserves gradient 10^15 times better!
Gradient Path Comparison

Compare the gradient path from output to distant input in three architectures: vanilla RNN (chain of multiplications), LSTM (constant error carousel), and Transformer (direct attention). Toggle between them to see how the gradient magnitude differs.

Related lessons

To continue from here:

This paper
Why gradients vanish in RNNs (Bengio 1994)
Gradient clipping, the cliff phenomenon, practical training recipes
The Transformer — eliminates recurrence entirely with self-attention

Impact on modern practice

Even though RNNs are rarely used today, the vanishing gradient insight affects every modern architecture:

Modern techniqueDirectly motivated by vanishing gradients
Residual connections (ResNet, Transformer)Additive skip connections keep gradient path length O(1)
Layer normalizationPrevents activation scale drift that worsens gradient flow
GELU/SiLU activationSmoother derivatives than ReLU, better gradient flow
Gradient clippingPrevents the exploding gradient companion problem
Pre-norm architecturePuts normalization inside residual branch for cleaner gradients
Weight initialization schemesXavier/He init targets unit-variance activations to stabilize gradients

The Transformer's self-attention mechanism is the most complete solution: it creates a direct gradient path (O(1) length) between any two positions in the sequence. No chain of Jacobians, no exponential decay, no spectral radius concerns. But the Transformer still uses residual connections and layer normalization — because even within a single layer, gradient flow matters.

The continuing relevance

The vanishing gradient problem has not been "solved" in a general sense. It reappears whenever information must flow through many sequential transformations: very deep networks (100+ layers), long-horizon reinforcement learning, recurrent state-space models, and neural ODEs. Each new architecture must grapple with the same fundamental question Bengio asked in 1994: "Can the gradient reach the parameters that need to change?"

The return of recurrence: state-space models

Interestingly, recurrence has made a comeback with state-space models (SSMs) like Mamba (2023). SSMs reformulate the recurrence as a linear system with structured weight matrices, allowing efficient parallel training while maintaining the theoretical benefits of recurrence (O(1) memory per step, linear scaling with sequence length).

The key insight of modern SSMs: the vanishing gradient problem only applies to nonlinear recurrences (tanh, sigmoid). For linear recurrences (ht = Aht-1 + Bxt), the dynamics can be controlled precisely through the eigenvalues of A. By parameterizing A as a diagonal matrix with eigenvalues on the unit circle, SSMs achieve stable gradient flow over thousands of steps — essentially solving Bengio's problem within the recurrence framework itself.

ArchitectureRecurrenceGradient pathParallel training
Vanilla RNNNonlinearO(T), vanishesNo
LSTMGated nonlinearO(T), mitigatedNo
TransformerNoneO(1)Yes
Mamba / SSMLinear + selectiveO(T), controlledYes (via scan)

Whether SSMs will eventually challenge the Transformer's dominance remains an open question. But their existence proves that the vanishing gradient problem in recurrence is not unsolvable — it just required rethinking the structure of the recurrence itself, exactly as Bengio et al. suggested 30 years ago.

Closing thought: "The problem with recurrent networks is not that they can't represent long-term dependencies — it's that gradient descent can't find them." This realization shifted the entire field from asking "can this architecture represent the function?" to "can this architecture learn the function?" The answer depends on the gradient flow, not just the expressive power.
How does the LSTM solve the vanishing gradient problem?