The formal proof that gradient-based learning in recurrent neural networks fails catastrophically for long-range dependencies — and the mathematical explanation of why.
Imagine you're reading a novel. On page 3 the author writes: "Maria grew up speaking Portuguese." Two hundred pages later, a character says something to Maria in Portuguese and she responds fluently. You understand why — you remember the detail from page 3. You carried that information across hundreds of pages of intervening text.
Now imagine asking a neural network to do the same thing. You feed it a sequence of words, one at a time, and at the end you ask it a question whose answer depends on something it saw many steps ago. This is the long-range dependency problem: can a network learn to remember and use information from the distant past?
In theory, a recurrent neural network (RNN) should handle this perfectly. An RNN has a hidden state — a vector of numbers that gets updated at every time step. This hidden state is the network's "memory." It carries information from the past forward through time. At each step, the RNN reads the current input, combines it with the hidden state, and produces a new hidden state. In principle, this hidden state can encode anything from the past.
In practice, it doesn't work. Not for long sequences. An RNN can remember what it saw 5 or 10 steps ago, but not 50 or 100 steps ago. The information doesn't vanish from the hidden state — it's still there, encoded somewhere in those numbers. The problem is that the network can't learn to use it, because the gradient signal used for learning decays exponentially with the number of time steps.
Think about what "exponentially" means here. If the gradient shrinks by a factor of 0.7 at each step, then after 10 steps it's 0.710 = 0.028 (97% lost). After 20 steps: 0.720 = 0.0008 (99.9% lost). After 50 steps: 0.750 ≈ 1.8 × 10-8. The gradient is, for all practical purposes, zero. The network receives no signal about how to use information from 50 steps ago.
An RNN processes a sequence. A critical piece of information appears at the start (warm dot). The network must use it at the end. Drag the sequence length slider to see how the gradient signal from output to that critical input decays.
This paper was published in 1994, but its consequences echoed through the next 25 years of deep learning. It motivated the invention of LSTMs (1997), GRUs (2014), and ultimately the Transformer (2017), which solved the problem by replacing recurrence with attention.
By 1994, the neural network community knew empirically that RNNs couldn't learn long-term dependencies. Researchers had tried and failed on tasks like temporal order detection (remember which symbol appeared first), string counting, and context-free grammar learning. But nobody had a formal explanation for why these failures occurred.
Bengio, Simard, and Frasconi provided that explanation. Their key insight was to analyze the Jacobian product chain — the sequence of matrices that the gradient must pass through during backpropagation. They proved that for any bounded activation function, this product either vanishes exponentially (the common case) or explodes exponentially (the dangerous case). There is essentially no middle ground.
| Year | Event | Connection |
|---|---|---|
| 1974 | Werbos describes backpropagation in his PhD thesis | The algorithm whose limitation this paper analyzes |
| 1986 | Rumelhart, Hinton, Williams popularize backprop | Makes neural network training practical — but not for sequences |
| 1990 | Elman introduces the simple RNN | The architecture this paper analyzes |
| 1993 | Empirical failures on long-range tasks documented | Motivates the theoretical analysis |
| 1994 | This paper: formal proof of vanishing gradients | Explains why RNNs fail on long sequences |
| 1997 | LSTM introduced (Hochreiter & Schmidhuber) | The first architectural solution to this paper's problem |
Let's understand exactly why gradients vanish — starting with how RNNs work.
To understand why gradients vanish, we first need to understand how an RNN processes a sequence. The key insight is that an RNN is really a single set of weights applied repeatedly over time. We can "unroll" this repetition into a long chain.
An RNN maintains a hidden state ht that evolves over time. At each time step t, the network takes the current input xt and the previous hidden state ht-1, and produces a new hidden state:
Where Wh is the recurrent weight matrix (hidden-to-hidden connections), Wx is the input weight matrix (input-to-hidden connections), b is a bias vector, and f is a nonlinear activation function (typically tanh or sigmoid — we'll see later why the choice of activation function is critical for the gradient problem). The output at time t is then computed from the hidden state:
Here's the crucial observation: the same weights Wh, Wx, Wy are used at every time step. When we unroll the RNN across T time steps, we get a very deep feedforward network where the same weight matrices appear at every layer. This "weight sharing" is what gives RNNs their power (they can process sequences of any length with a fixed number of parameters) but also what creates the gradient problem (the same matrix is multiplied over and over, causing systematic exponential behavior).
This weight sharing also means that a single gradient update to Wh affects the behavior at every time step simultaneously. If the gradient is dominated by the most recent time steps (because distant gradients have vanished), then the weight update primarily reflects the network's short-term behavior. Long-term patterns get no representation in the gradient, and thus no influence on learning.
Left: the compact RNN diagram with a self-loop on the hidden state. Right: the same RNN unrolled across time steps. Click "Unroll" to see the transformation. Notice how the same weights Wh appear at every step.
Let's write this concretely. For a sequence of length T, the hidden states are:
Each hidden state depends on the previous one through the recurrent weight matrix Wh. This creates a chain of dependencies: hT depends on hT-1, which depends on hT-2, all the way back to h0. The depth of this dependency chain equals the sequence length — for a 200-word sentence, that's 200 layers of recurrence.
Compare this to a typical CNN or feedforward network, which might have 50-100 layers. An RNN processing a modest paragraph (100 words) creates a computation graph 100 layers deep, where every layer uses the same weights. The gradient must traverse all 100 layers during backpropagation. This is where the trouble begins.
python import torch import torch.nn as nn # A simple RNN cell: same weights applied at every step hidden_size = 64 input_size = 10 W_h = torch.randn(hidden_size, hidden_size) * 0.1 # recurrent weights W_x = torch.randn(hidden_size, input_size) * 0.1 # input weights b = torch.zeros(hidden_size) # bias def rnn_forward(inputs, h0): """Unrolled RNN: same W_h at every step.""" h = h0 states = [h] for x_t in inputs: h = torch.tanh(W_h @ h + W_x @ x_t + b) # same W_h each step! states.append(h) return states # Process a sequence of 20 time steps seq = [torch.randn(input_size) for _ in range(20)] h0 = torch.zeros(hidden_size) hidden_states = rnn_forward(seq, h0) # hidden_states[20] depends on ALL previous inputs through W_h
The chain of hidden states is the computation graph through which gradients must flow. In the next chapter, we'll see exactly how gradients flow backward through this chain — and why the repeated multiplication by Wh creates an exponential problem.
Let's be precise about what's shared and what's not. For an RNN with hidden size n and input size m:
| Parameter | Shape | Count | Shared across time? |
|---|---|---|---|
| Wh | [n, n] | n² | Yes — same at every step |
| Wx | [n, m] | n × m | Yes — same at every step |
| b | [n] | n | Yes — same at every step |
| Wy | [output, n] | output × n | Yes — same at every step |
Total: roughly n² + nm parameters, regardless of sequence length. A feedforward network with the same "depth" (layers = sequence length T) would have T × n² parameters — one weight matrix per layer. The RNN trades parameter efficiency (shared weights) for gradient instability (same matrix multiplied T times).
The hidden state ht is a vector of n numbers. For tanh activation, each number is between -1 and 1. In principle, this vector can encode 2n distinguishable states (if we discretize each dimension to ±1). For n = 64, that's 264 ≈ 1.8 × 1019 possible states. The hidden state has enormous capacity. The problem isn't capacity — it's learnability.
This distinction between information capacity and learnability is one of the deepest insights in the paper. An RNN is a universal approximator — given enough hidden units, it can compute any computable function. But being able to compute a function is different from being able to learn it from data via gradient descent.
The capacity is exponential in hidden size: 2n distinguishable states. But the learnability drops exponentially with sequence length: gradients decay as γT. At some point, the learnable range hits zero — not because the architecture can't represent the function, but because gradient descent can't navigate to it.
This is why architectures matter. The same function can be represented by an RNN, an LSTM, or a Transformer. But only the latter two can learn it from data, because their gradient structure allows long-range credit assignment. Architecture is not just about representation — it's about optimization.
We know how to train feedforward networks: backpropagation. We compute the loss, then propagate the error backward through the network, computing gradients layer by layer. For an RNN unrolled across T time steps, we do exactly the same thing — we call it Backpropagation Through Time (BPTT).
Suppose we have a loss at the final time step T:
We want ∂L/∂Wh — how the loss changes when we adjust the recurrent weights. Since Wh is used at every time step, the total gradient is a sum over all time steps:
The critical term is ∂hT/∂ht — how the final hidden state depends on the hidden state at time t. By the chain rule, this is a product of Jacobian matrices:
Each factor ∂hk/∂hk-1 is the Jacobian of the state transition function. For our RNN with hk = f(Wh hk-1 + Wx xk + b):
Where zk = Wh hk-1 + Wx xk + b is the pre-activation, and diag(f'(zk)) is a diagonal matrix of activation derivatives. So the total Jacobian product is:
This is a product of (T - t) matrices. Each matrix in the product is Wh scaled by the activation derivative at that time step. This is where the exponential problem lives.
The crucial difference between BPTT and standard backpropagation is this: in a feedforward network, the gradient passes through different weight matrices at each layer. In BPTT, it passes through the same matrix (modified by the activation derivatives) at every time step. This repetition is what makes the exponential behavior unavoidable.
Consider the analogy of compound interest. If you multiply a number by 0.95 once, you lose 5%. Multiply by 0.95 twice, you lose 9.75%. After 100 multiplications: 0.95100 = 0.006 — you've lost 99.4%. Now imagine multiplying by 0.95 once and by 1.05 once, alternating. The product is 0.95 × 1.05 = 0.9975 per pair — much less shrinkage, because the two factors partially cancel. This is why feedforward nets (different weights at each layer) suffer less than RNNs (same weights at every step).
In an RNN, every factor in the product has the same "direction" of shrinkage or growth. There's no cancellation. The exponential behavior is systematic, not random.
Watch the gradient flow backward from the loss through each time step. At each step, the gradient is multiplied by the Jacobian (activation derivative × Wh). Click "Step Back" to propagate the gradient one step at a time.
python # BPTT: computing the Jacobian product chain import torch def bptt_gradient(W_h, hidden_states, T, t): """Compute dh_T/dh_t by chaining Jacobians backward.""" # Start with identity matrix (gradient at step T) jacobian_product = torch.eye(W_h.shape[0]) for k in range(T, t, -1): # Pre-activation at step k z_k = W_h @ hidden_states[k-1] # tanh derivative: f'(z) = 1 - tanh(z)^2 f_prime = 1 - torch.tanh(z_k)**2 # Jacobian at step k: diag(f') @ W_h J_k = torch.diag(f_prime) @ W_h # Accumulate product jacobian_product = jacobian_product @ J_k return jacobian_product # The norm of this product tells us the gradient magnitude # For T-t = 50, with tanh activation, this is typically ~1e-15
The gradient at time step t has magnitude proportional to the norm of this Jacobian product. If each factor shrinks the gradient by some factor γ < 1, then after (T - t) steps the gradient has magnitude γT-t. For γ = 0.9 and T - t = 50, that's 0.950 ≈ 0.005. For γ = 0.7, it's 0.750 ≈ 1.8 × 10-8. For practical RNNs with tanh activation, γ is often around 0.3-0.5, making the gradient essentially zero after just 10-20 steps.
Let's trace through a tiny RNN with hidden size 2 to see the gradient vanish in exact arithmetic. Suppose:
And suppose the activation derivatives are all approximately 0.8 (tanh operating near the origin). Then each Jacobian factor is approximately:
The spectral radius of this 2×2 matrix is ρ ≈ 0.46. After 5 steps: 0.465 ≈ 0.021. After 10 steps: 0.4610 ≈ 0.00043. After 20 steps: 0.4620 ≈ 1.8 × 10-7. The gradient has effectively vanished.
python # Concrete example: gradient vanishing in a 2-unit RNN import numpy as np W_h = np.array([[0.5, 0.1], [0.2, 0.4]]) f_prime = 0.8 # typical tanh derivative J = f_prime * W_h # Product of n Jacobians product = np.eye(2) for n in range(1, 21): product = product @ J norm = np.linalg.norm(product) print(f"n={n:2d}: ||J^n|| = {norm:.6f}") # n= 1: 0.464 n= 5: 0.021 n=10: 0.00043 n=20: 1.8e-7
We've seen that the gradient involves a product of Jacobian matrices. Now let's look more carefully at what this product actually computes and why — for any bounded activation function — it's mathematically doomed to vanish.
The Jacobian product is the heart of the vanishing gradient problem. Everything else — the specific activation function, the weight initialization, the optimizer — is secondary. If the Jacobian product shrinks exponentially, gradients vanish. If it grows exponentially, gradients explode. Understanding the Jacobian product is understanding the problem.
Each Jacobian in the chain has the form:
Let's think about what diag(f'(zk)) does. For tanh, the derivative is f'(z) = 1 - tanh(z)2. This value is always between 0 and 1. It equals 1 only when z = 0 (the activation is exactly at the origin), and approaches 0 as |z| grows (the activation saturates).
For sigmoid, f'(z) = σ(z)(1 - σ(z)), which peaks at 0.25 when z = 0. The maximum derivative of a sigmoid is only 0.25 — every time the gradient passes through a sigmoid, it shrinks by at least 75%!
| Activation | f'(z) at z=0 | Max f'(z) | f'(z) as |z| → ∞ |
|---|---|---|---|
| tanh | 1.0 | 1.0 | 0 |
| sigmoid | 0.25 | 0.25 | 0 |
| ReLU | 1.0 | 1.0 | 1.0 (or 0) |
So each Jacobian Jk = diag(f'(zk)) · Wh is Wh with each row scaled by a number between 0 and 1 (for tanh) or 0 and 0.25 (for sigmoid).
The shape of f'(z) is crucial to understanding why gradients vanish. For tanh:
The derivative is a bell curve centered at z = 0, reaching its maximum of 1.0 at the origin and decaying to 0 as |z| grows. In a trained RNN, the pre-activations z = Whh + Wxx + b are typically not centered at zero — they're spread across a range that depends on the input and learned weights. On average, |z| is large enough that f'(z) ≈ 0.3-0.5.
For sigmoid, the situation is worse. The sigmoid derivative peaks at only 0.25:
At z = 0: σ'(0) = 0.25. At z = ±2: σ'(±2) ≈ 0.1. At z = ±5: σ'(±5) ≈ 0.007. Every pass through a sigmoid layer shrinks the gradient by at least 75%, and usually much more.
python # Activation derivatives: why tanh is better than sigmoid import numpy as np z_values = np.linspace(-5, 5, 100) # tanh derivative tanh_deriv = 1 - np.tanh(z_values)**2 print(f"tanh' at z=0: {1 - np.tanh(0)**2:.2f}") # 1.00 print(f"tanh' at z=1: {1 - np.tanh(1)**2:.2f}") # 0.42 print(f"tanh' at z=2: {1 - np.tanh(2)**2:.2f}") # 0.07 # sigmoid derivative def sig(z): return 1/(1+np.exp(-z)) print(f"sigmoid' at z=0: {sig(0)*(1-sig(0)):.2f}") # 0.25 print(f"sigmoid' at z=1: {sig(1)*(1-sig(1)):.2f}") # 0.20 print(f"sigmoid' at z=2: {sig(2)*(1-sig(2)):.2f}") # 0.10 # With tanh, gradient survives 20 steps: 0.42^20 = 2e-8 # With sigmoid: 0.20^20 = 1e-14 (much worse!)
The product of n such matrices has norm bounded by:
For the gradient to not vanish, we need ||Wh|| · max|f'| ≥ 1 at every step. But for sigmoid, max|f'| = 0.25, so we'd need ||Wh|| ≥ 4. And for tanh, even though max|f'| = 1, in practice the activations saturate and f' ≈ 0.3-0.5 on average.
See how the Jacobian shrinks the gradient at each time step. Drag the spectral radius of Wh and the average activation derivative to see how much gradient survives each step. The "effective multiplier" is their product — if it's below 1.0, gradients vanish exponentially.
In a feedforward network, each layer has its own weight matrix W1, W2, ..., WL. The gradient product is ∏ Wl — different matrices at each step. Some might have large norms, some small, and they can partially balance out. Moreover, feedforward networks are typically 5-50 layers deep, while RNN sequences can be 100-1000 steps long.
In an RNN, it's the same matrix Wh (scaled by activation derivatives) at every step. If it shrinks, it shrinks the same way every step, compounding exponentially. There's no opportunity for different layers to cancel out.
Bengio et al. drew an important analogy to dynamical systems theory. An RNN is a discrete-time dynamical system: ht+1 = f(Whht + ...). The vanishing gradient problem is equivalent to saying that this system has a contractive dynamics — nearby trajectories converge exponentially, which means perturbations (including gradient signals) are exponentially damped.
In dynamical systems, the Lyapunov exponent measures the rate of exponential convergence or divergence:
If λ < 0, the system is contractive (vanishing gradients). If λ > 0, it's chaotic (exploding gradients). The boundary λ = 0 is the "edge of chaos" — the only regime where long-range learning is theoretically possible.
python # Measuring gradient shrinkage per step import torch import numpy as np hidden_size = 128 W_h = torch.randn(hidden_size, hidden_size) * 0.1 # Spectral radius: largest eigenvalue magnitude eigenvalues = torch.linalg.eigvals(W_h) spectral_radius = torch.max(torch.abs(eigenvalues)).item() print(f"Spectral radius: {spectral_radius:.3f}") # Typical tanh derivative on random activations h = torch.randn(hidden_size) z = W_h @ h f_prime = 1 - torch.tanh(z)**2 print(f"Mean |f'|: {f_prime.mean().item():.3f}") # Effective multiplier per step eff = spectral_radius * f_prime.mean().item() print(f"Effective multiplier: {eff:.3f}") print(f"After 50 steps: {eff**50:.2e}") # Typical output: ~1e-10 or smaller
We've been talking informally about matrices "shrinking" or "growing" the gradient. Now let's make this precise using eigenvalues — the mathematical tool that tells us exactly how repeated matrix multiplication behaves in the long run.
If you haven't encountered eigenvalues before, here's the intuition: every square matrix has certain "preferred directions" called eigenvectors. When you multiply an eigenvector by the matrix, it just gets scaled — it doesn't rotate or change direction. The scaling factor is the eigenvalue λ. Repeated multiplication scales by λn, which is why eigenvalues control the exponential behavior of iterated matrix products.
Every square matrix A has eigenvalues λ1, λ2, ..., λn and corresponding eigenvectors v1, v2, ..., vn. When you multiply A by an eigenvector, it just scales it: A vi = λi vi. The eigenvalue tells you the scaling factor along that direction.
When you multiply by A repeatedly, the scaling compounds:
The spectral radius ρ(A) is the magnitude of the largest eigenvalue: ρ(A) = maxi |λi|. This number determines the long-term behavior of repeated multiplication:
| Spectral radius ρ(A) | An as n → ∞ | Gradient behavior |
|---|---|---|
| ρ < 1 | → 0 (exponentially) | Vanishes |
| ρ = 1 | Bounded but may not converge | Stable (rare) |
| ρ > 1 | → ∞ (exponentially) | Explodes |
Now apply this to the Jacobian product in BPTT. The effective "matrix" being multiplied at each step is Jk = diag(f'(zk)) · Wh. While the activation derivative diag(f'(zk)) changes at each step (it depends on the hidden state), the weight matrix Wh is fixed. The eigenvalues of Wh determine the base rate of growth or decay, and the activation derivatives modulate this base rate at each step.
Bengio et al. showed that the gradient ∂hT/∂ht decays as:
Where γ = ρ(Wh) · max|f'| is the effective spectral radius, and C is a constant. This is the precise formulation of exponential decay.
Watch how repeated multiplication by a matrix with a given spectral radius affects a vector. Drag ρ to change the eigenvalue magnitude. When ρ < 1, the vector shrinks to zero. When ρ > 1, it explodes. The gradient experiences this same fate.
For a randomly initialized weight matrix Wh with entries drawn from N(0, σ2/n), the spectral radius is approximately σ. Standard initialization schemes like Xavier (Glorot) give σ ≈ 1/√n, leading to ρ ≈ 1. But the activation derivative f' < 1 (for tanh/sigmoid) pulls the effective spectral radius below 1. Even with "good" initialization, the gradient still vanishes.
The Marchenko-Pastur law from random matrix theory tells us that for large random matrices, the eigenvalue distribution converges to a specific shape. For Gaussian Wh with variance 1/n, the spectral radius is approximately 2/√n · √n = 2 (for n = 1), but in practice it's close to 1.0 for Xavier initialization. The key point: random initialization places us near the critical boundary ρ = 1, but the activation function pushes us below it.
What if we intentionally set ρ > 1 to counteract the activation derivative? Then we risk gradient explosion. And even with ρ exactly right, the system is chaotic — small perturbations to Wh during training push ρ above or below 1, alternating between explosion and vanishing.
One approach to taming eigenvalues is orthogonal initialization: initialize Wh as an orthogonal matrix (WhT Wh = I). An orthogonal matrix has all eigenvalues on the unit circle (|λ| = 1), so An doesn't grow or shrink for any n. This is theoretically ideal.
In practice, orthogonal initialization helps at the start of training but the weights quickly move away from orthogonality as training progresses. The activation derivatives also break the perfect eigenvalue structure. Still, it's significantly better than random Gaussian initialization, and is the recommended initialization for RNN hidden-to-hidden weights.
python # Orthogonal vs Gaussian initialization import torch import torch.nn as nn n = 64 # Gaussian: eigenvalues scattered, some > 1, some < 1 W_gauss = torch.randn(n, n) / n**0.5 eigs_g = torch.linalg.eigvals(W_gauss).abs() print(f"Gaussian: max|eig|={eigs_g.max():.3f}, min={eigs_g.min():.3f}") # Orthogonal: ALL eigenvalues have |eig| = 1 W_orth = torch.empty(n, n) nn.init.orthogonal_(W_orth) eigs_o = torch.linalg.eigvals(W_orth).abs() print(f"Orthogonal: max|eig|={eigs_o.max():.3f}, min={eigs_o.min():.3f}") # Output: all eigenvalue magnitudes ≈ 1.000
python # Eigenvalue analysis of a recurrent weight matrix import torch import numpy as np n = 64 W_h = torch.randn(n, n) / np.sqrt(n) # Xavier init # Eigenvalues of W_h eigs = torch.linalg.eigvals(W_h) magnitudes = torch.abs(eigs) print(f"Spectral radius: {magnitudes.max().item():.3f}") print(f"Mean |eigenvalue|: {magnitudes.mean().item():.3f}") # Simulate gradient norm after T steps avg_fprime = 0.65 # typical for tanh eff_rho = magnitudes.max().item() * avg_fprime for T in [5, 10, 20, 50, 100]: grad_scale = eff_rho ** T print(f"T={T:3d}: gradient scale = {grad_scale:.2e}") # T= 5: ~0.116 # T= 10: ~0.013 # T= 20: ~1.8e-4 # T= 50: ~4.4e-10 # T=100: ~1.9e-19
We've built up the intuition piece by piece. Now let's state the formal theorem from Bengio et al., which makes everything precise.
You might wonder: we already know the gradient decays exponentially. Why do we need a formal theorem? Because the formal statement gives us precise conditions under which learning will fail, and these conditions tell us exactly what an architectural fix must accomplish. Without the theorem, we'd be guessing at solutions. With it, we can engineer them.
Consider an RNN with hidden state update:
where f is a bounded, differentiable activation function (e.g., tanh, sigmoid). Let σmax = max|f'(z)| be the maximum derivative of f. For tanh, σmax = 1. For sigmoid, σmax = 0.25.
Step 1: Write the Jacobian at each step.
where Dk = diag(f'(zk)) is a diagonal matrix with entries in [0, σmax].
Step 2: Bound the norm of each factor.
Step 3: The total gradient is a product of (T-t) such factors.
Step 4: If σmax · ||Wh|| < 1, then (σmax · ||Wh||)T-t → 0 exponentially as T-t → ∞.
That's the full proof. It's beautifully simple: each step shrinks the gradient by at most σmax · ||Wh||, and after enough steps, the shrinkage compounds to zero.
Is this bound tight? In practice, the actual gradient magnitude is often much smaller than the upper bound, because the bound uses the worst-case activation derivative σmax at every step. In reality, different dimensions saturate at different steps, and the average activation derivative is below σmax.
For tanh with typical hidden states:
So the actual decay is often 2-3x faster per step than the upper bound predicts. The bound guarantees vanishing when σmax · ||Wh|| < 1, but in practice vanishing occurs even when σmax · ||Wh|| ≈ 1-2, because the average behavior is worse than the worst case.
The gradient magnitude at each time step, plotted as a bar. Choose σmax and ||Wh|| to see how the bound (σmax · ||Wh||)T-t shapes the decay. The warm line is the theoretical upper bound.
For sigmoid activations (σmax = 0.25): the gradient vanishes unless ||Wh|| > 4. But ||Wh|| > 4 puts us firmly in the exploding regime for many inputs. It's virtually impossible to train sigmoid RNNs on long sequences.
For tanh activations (σmax = 1.0): the gradient vanishes unless ||Wh|| ≥ 1. This is achievable — but in practice, the activations saturate (|h| large), driving f' well below 1. So even with ||Wh|| ≈ 1, the effective multiplier drops below 1 and gradients vanish.
For ReLU activations (f'(z) = 0 or 1): the derivative doesn't shrink the gradient, but half the neurons are dead (f' = 0), which creates different problems (the "dying ReLU" problem in RNNs). When z < 0, the gradient is exactly zero — that dimension contributes nothing to learning. And with recurrence, a dead neuron stays dead because its gradient is zero at every future time step.
Bengio et al. also proved a necessary condition for gradient-based learning of long-term dependencies. For the gradient to remain large enough to enable learning over T steps, we need:
But satisfying this condition almost certainly leads to gradient explosion for some inputs (those where activations don't saturate). This is the Catch-22:
There is no stable operating point for gradient-based learning of long-term dependencies in vanilla RNNs. This is the fundamental result of the paper.
The paper verified this theory with experiments on a simple task: given a sequence where the first few inputs determine the correct output at the final step, can the RNN learn the dependency? They found that for sequences longer than ~20 steps, standard RNNs completely failed to learn. The gradient at the critical input was below the numerical precision of 32-bit floating point.
python # Empirical test: can an RNN learn a long-range dependency? import torch import torch.nn as nn # Task: output at step T should equal input at step 0 def make_data(T, n_samples=100): X = torch.zeros(n_samples, T, 1) X[:, 0, 0] = torch.randn(n_samples) # signal at step 0 Y = X[:, 0, 0] # target = input at step 0 return X, Y # Test at different sequence lengths for T in [5, 10, 20, 50, 100]: rnn = nn.RNN(1, 32, batch_first=True) opt = torch.optim.Adam(rnn.parameters(), lr=0.01) X, Y = make_data(T) for epoch in range(500): out, _ = rnn(X) loss = ((out[:, -1, 0] - Y)**2).mean() opt.zero_grad(); loss.backward(); opt.step() print(f"T={T:3d}: final loss = {loss.item():.4f}") # T= 5: 0.02 (learns well) # T= 10: 0.15 (struggles) # T= 20: 0.85 (barely learns) # T= 50: 1.00 (fails — predicts mean) # T=100: 1.00 (complete failure)
Theory is beautiful, but seeing is believing. This chapter gives you a full interactive simulation where you can build an RNN, feed it a sequence, and watch the gradient flow backward through time — seeing exactly when and how it vanishes.
Configure an RNN and watch gradient magnitudes at each time step. The top panel shows the forward pass (hidden states). The bottom panel shows the backward pass (gradient magnitudes). Experiment with different weight scales and activation functions to see vanishing vs. exploding gradients.
Notice the fundamental asymmetry. With tanh, you can find a Wh scale that keeps gradients alive for ~10 steps. But extending to 50 or 100 steps is impossible — the system is too sensitive. Any small deviation from the critical point cascades exponentially.
An important observation from the paper: the gradient problem affects different dimensions of the hidden state differently. The Jacobian product preferentially amplifies components along the eigenvector with the largest eigenvalue and suppresses all others. After many steps, only information aligned with the dominant eigenvector survives.
This means the RNN's effective memory capacity decreases with sequence length. At step 5, the hidden state might use all 64 dimensions. By step 50, only the 2-3 dimensions aligned with the dominant eigenvectors carry meaningful gradient signal. The rest are "gradient dead zones" — they still hold information from the forward pass, but no gradient reaches them to refine that information.
Several approaches were tried before architectural solutions (LSTM, attention) took over:
| Approach | Idea | Why it failed |
|---|---|---|
| Truncated BPTT | Only backpropagate for k steps, not all T | By definition cannot learn dependencies longer than k |
| Echo state networks | Fix Wh, only learn output weights | Random fixed dynamics can't encode task-specific patterns |
| Second-order methods | Use curvature (Hessian) to guide steps | O(n2) memory, too expensive for large networks |
| Careful initialization | Initialize Wh near the critical point | Training quickly moves weights away from the critical point |
| Nesterov momentum | Better optimizer for smooth landscapes | The landscape isn't smooth — it has cliffs (Pascanu 2013) |
The common thread: all of these approaches try to fix the optimization without changing the architecture. But the vanishing gradient is an architectural problem — it's intrinsic to the recurrence structure. The solution must be architectural too.
python # Comparing truncated BPTT vs full BPTT import torch import torch.nn as nn # Truncated BPTT: only backpropagate k steps def train_truncated_bptt(model, sequence, k=20): """Truncated BPTT: gradient only flows k steps back.""" h = torch.zeros(1, 1, 64) total_loss = 0 for i in range(0, len(sequence), k): chunk = sequence[i:i+k] h = h.detach() # stop gradient flow beyond k steps! out, h = model(chunk, h) loss = compute_loss(out) loss.backward() total_loss += loss.item() # Problem: dependencies longer than k steps can NEVER be learned # The gradient literally cannot reach inputs more than k steps back return total_loss
What does vanishing gradients mean for actual learning? Consider a sequence where a critical input appears at step 3 and the loss is computed at step 20. The gradient of the loss with respect to that input is proportional to ∂h20/∂h3, which is the product of 17 Jacobians. If each has norm 0.8, the gradient is 0.817 ≈ 0.023 — 43 times smaller than the gradient at step 19.
The network receives almost all its learning signal from the most recent few steps. Distant inputs are effectively invisible to gradient descent. The network can learn short-range patterns ("the" often follows "in") but not long-range ones ("she" refers to "Maria" from 200 words ago).
python # Measuring gradient norms at each time step in PyTorch import torch import torch.nn as nn rnn = nn.RNN(10, 64, batch_first=True) x = torch.randn(1, 30, 10, requires_grad=True) # length-30 sequence h0 = torch.zeros(1, 1, 64) output, hn = rnn(x, h0) loss = output[0, -1, :].sum() # loss at final step loss.backward() # Gradient at each time step grad_norms = x.grad[0].norm(dim=1) for t, g in enumerate(grad_norms): print(f"t={t:2d}: gradient norm = {g.item():.6f}") # Typical: t=29 ~ 0.5, t=20 ~ 0.01, t=10 ~ 0.0001, t=0 ~ 1e-7
Bengio et al.'s 1994 paper didn't just identify the problem — it shaped the next two decades of sequence modeling research. Every major architectural innovation in NLP can be understood as a response to the vanishing gradient problem.
| Solution | Year | Core Idea | How it helps |
|---|---|---|---|
| LSTM | 1997 | Gated memory cell with additive updates | Cell state uses addition not multiplication, so gradients flow through the "constant error carousel" with Jacobian ≈ 1 |
| Gradient clipping | 2013 | Cap gradient norm at a threshold | Prevents explosion, but can't fix vanishing (can't make a small gradient bigger) |
| GRU | 2014 | Simplified LSTM with fewer gates | Same additive update principle, fewer parameters, similar performance |
| Residual connections | 2015 | ht = ht-1 + F(ht-1) | Additive skip creates gradient highways — Jacobian is I + JF, whose eigenvalues ≈ 1 |
| Attention | 2015 | Direct connections between any two time steps | Gradient can flow directly from output to any input step — no chain of multiplications |
| Transformer | 2017 | Remove recurrence entirely, use only attention | No sequential Jacobian chain at all — every position connects to every other position |
The most radical response to the vanishing gradient problem was the Transformer (Vaswani et al., 2017), which eliminated recurrence entirely. In a Transformer, every position in the sequence can attend directly to every other position through self-attention. The gradient from position T to position t flows through at most O(1) layers, not O(T-t) sequential multiplications.
This is why Transformers can handle sequences of thousands of tokens — the gradient path between any two tokens is short, regardless of their distance in the sequence.
The vanishing gradient problem teaches a fundamental principle about neural network design: the gradient path matters as much as the forward path. A network might be expressive enough to represent any function, but if the gradients can't flow to the relevant parameters, the network can't learn that function.
This principle extends beyond RNNs. Residual connections in deep CNNs (ResNet, 2015) solve the same problem for deep feedforward networks. LayerNorm and BatchNorm stabilize gradients by controlling activation scales. The entire history of deep learning architecture design can be read as a struggle to keep gradients flowing.
To see exactly how the LSTM solves the problem, compare the Jacobian structures:
The vanilla RNN multiplies by Wh at each step, while the LSTM cell multiplies by the forget gate ft, which is a scalar per dimension. When ft ≈ 1, the gradient passes through unchanged. This is the "constant error carousel" — a highway for gradients that bypasses the multiplicative bottleneck.
python # LSTM gradient flow: constant error carousel import torch # Simulate LSTM cell state gradient flow T = 100 grad_rnn = 1.0 grad_lstm = 1.0 gamma_rnn = 0.7 # typical RNN decay factor forget_gate = 0.98 # LSTM forget gate near 1 for t in range(T): grad_rnn *= gamma_rnn grad_lstm *= forget_gate print(f"After {T} steps:") print(f" RNN gradient: {grad_rnn:.2e}") # ~3.2e-16 print(f" LSTM gradient: {grad_lstm:.2e}") # ~0.13 # LSTM preserves gradient 10^15 times better!
Compare the gradient path from output to distant input in three architectures: vanilla RNN (chain of multiplications), LSTM (constant error carousel), and Transformer (direct attention). Toggle between them to see how the gradient magnitude differs.
To continue from here:
Even though RNNs are rarely used today, the vanishing gradient insight affects every modern architecture:
| Modern technique | Directly motivated by vanishing gradients |
|---|---|
| Residual connections (ResNet, Transformer) | Additive skip connections keep gradient path length O(1) |
| Layer normalization | Prevents activation scale drift that worsens gradient flow |
| GELU/SiLU activation | Smoother derivatives than ReLU, better gradient flow |
| Gradient clipping | Prevents the exploding gradient companion problem |
| Pre-norm architecture | Puts normalization inside residual branch for cleaner gradients |
| Weight initialization schemes | Xavier/He init targets unit-variance activations to stabilize gradients |
The Transformer's self-attention mechanism is the most complete solution: it creates a direct gradient path (O(1) length) between any two positions in the sequence. No chain of Jacobians, no exponential decay, no spectral radius concerns. But the Transformer still uses residual connections and layer normalization — because even within a single layer, gradient flow matters.
The vanishing gradient problem has not been "solved" in a general sense. It reappears whenever information must flow through many sequential transformations: very deep networks (100+ layers), long-horizon reinforcement learning, recurrent state-space models, and neural ODEs. Each new architecture must grapple with the same fundamental question Bengio asked in 1994: "Can the gradient reach the parameters that need to change?"
Interestingly, recurrence has made a comeback with state-space models (SSMs) like Mamba (2023). SSMs reformulate the recurrence as a linear system with structured weight matrices, allowing efficient parallel training while maintaining the theoretical benefits of recurrence (O(1) memory per step, linear scaling with sequence length).
The key insight of modern SSMs: the vanishing gradient problem only applies to nonlinear recurrences (tanh, sigmoid). For linear recurrences (ht = Aht-1 + Bxt), the dynamics can be controlled precisely through the eigenvalues of A. By parameterizing A as a diagonal matrix with eigenvalues on the unit circle, SSMs achieve stable gradient flow over thousands of steps — essentially solving Bengio's problem within the recurrence framework itself.
| Architecture | Recurrence | Gradient path | Parallel training |
|---|---|---|---|
| Vanilla RNN | Nonlinear | O(T), vanishes | No |
| LSTM | Gated nonlinear | O(T), mitigated | No |
| Transformer | None | O(1) | Yes |
| Mamba / SSM | Linear + selective | O(T), controlled | Yes (via scan) |
Whether SSMs will eventually challenge the Transformer's dominance remains an open question. But their existence proves that the vanishing gradient problem in recurrence is not unsolvable — it just required rethinking the structure of the recurrence itself, exactly as Bengio et al. suggested 30 years ago.