Difficulty Training RNNs (Pascanu 2013)

Chapter 0: Why RNNs Break

You're training an RNN on a language modeling task. The loss is dropping steadily — 4.2, 3.8, 3.5, 3.1... Then suddenly, on iteration 8,347, the loss jumps to NaN. Your model is destroyed. All the weights are infinity. Hours of training, wasted.

This isn't a bug. It's a feature of RNN error surfaces — and until this 2013 paper, nobody fully understood why it happened or how to prevent it.

In the previous lesson on Bengio 1994, we learned that gradients in RNNs tend to vanish exponentially over time. But Bengio's analysis focused on the vanishing case. What about the opposite? What happens when gradients explode?

Pascanu, Mikolov, and Bengio showed that exploding gradients are not just the symmetric opposite of vanishing gradients. They have their own unique pathology: the cliff phenomenon. The RNN error surface contains regions where the gradient suddenly becomes enormous — like a cliff edge. If your optimizer steps near one of these cliffs, the gradient catapults the weights far from the current solution, often to a region where the loss is much worse or even infinite.

The cliff phenomenon is particularly insidious because it's unpredictable. You can be training for hours with stable, decreasing loss, and then a single unlucky batch sends the gradient norm from 5 to 50,000 in one step. The weight update is so large that it destroys the learned representations. The model can't recover because the new weight values are far from any good region of parameter space.

Before this paper, practitioners dealt with gradient explosions through trial and error: restart training with a smaller learning rate, use shorter sequences, add more regularization. None of these were principled solutions. Pascanu et al. provided both the explanation (cliffs) and the fix (clipping).

The practical significance cannot be overstated. In 2013, training an RNN was a coin flip — it might work, it might explode. A single cliff encounter could waste hours of GPU time. Gradient clipping transformed RNN training from gambling to engineering.

How often do cliffs occur?

In their experiments, Pascanu et al. found that without gradient clipping, 30-60% of training runs diverged (loss went to NaN) within the first epoch. The frequency depended on:

Factor	Effect on cliff frequency
Longer sequences	More frequent (sharper cliffs, more opportunities to hit them)
Larger learning rate	More frequent (larger steps = higher probability of landing on a cliff)
Larger weight scale	More frequent (closer to the critical point η = 1)
Certain input patterns	Some batches create cliffs that others don't
Early in training	More frequent (random weights are more prone to critical dynamics)

The Training Catastrophe

Watch an RNN training. The loss decreases steadily until the optimizer encounters a "cliff" in the error surface. The gradient explodes, the weight update is enormous, and training is destroyed. Click "Train" to see it happen.

Ready — click Train

The three key contributions of this paper:
1. The cliff phenomenon: RNN error surfaces contain sharp cliffs where the gradient norm changes by orders of magnitude over tiny parameter changes.
2. Gradient clipping: A simple fix — when the gradient norm exceeds a threshold, rescale it. This turns the cliff into a gentle slope.
3. Gradient regularization: An additional technique to encourage gradients to flow smoothly, helping with both vanishing and exploding.

Together, these insights transformed RNN training from an art requiring extreme patience and luck into a science with reliable recipes. Gradient clipping became so fundamental that it's now the default in almost every deep learning framework. When you train a Transformer, an LSTM, or any deep network, gradient clipping is almost certainly enabled behind the scenes.

The paper's experimental setup

Pascanu et al. used several carefully designed diagnostic tasks to study gradient dynamics:

Task	Description	Why it's useful
Temporal order	Two markers appear in the first 10% of a length-T sequence. Output their order.	Tests long-range memory — T controls difficulty
Addition problem	Two numbers are marked in a sequence. Output their sum.	Tests whether the network can store and combine distant information
Multiplication	Similar to addition but with multiplication	Tests sensitivity to distant inputs (product is more sensitive than sum)
Penn Treebank LM	Standard language modeling benchmark	Tests practical performance on real data

These tasks are specifically designed so that the difficulty scales with sequence length T. At T = 10, all methods work. At T = 100, only methods with proper gradient handling succeed. This controlled scaling reveals the gradient dynamics clearly, making it possible to isolate the effect of gradient clipping from other factors (model capacity, optimizer choice, etc.).

The addition and temporal order tasks have become standard benchmarks for evaluating new sequence model architectures. If a model can solve them at T = 500+ without special initialization or training tricks, it has solved the long-range dependency problem. LSTMs can handle T ≈ 200-500 (with gradient clipping), while vanilla RNNs fail at T ≈ 20-50.

How this paper relates to Bengio 1994

The relationship between these two papers is complementary:

Bengio 1994:

Theoretical. Proves gradients vanish/explode. Analyzes the mathematical structure. Doesn't offer solutions.

Pascanu 2013:

Practical. Shows how the theory manifests in training (cliffs). Provides solutions (clipping, regularization). Gives training recipes.

Together, they form the complete story: Bengio explained the what and why; Pascanu provided the how to fix it and what it looks like in practice.

What happens when an RNN optimizer encounters a "cliff" in the error surface during training?

The gradient becomes enormous, causing a huge weight update that catapults the parameters far from the good region — often destroying the model entirely (loss goes to NaN) Training slows down because the gradient is very small near the cliff The optimizer automatically reduces the learning rate

Chapter 1: The Cliff Phenomenon

Why do RNN error surfaces have cliffs? The answer comes directly from the Jacobian product chain we studied in the vanishing gradient paper.

To understand cliffs, we need to revisit the Jacobian product chain from Chapter 2 of the vanishing gradients lesson, but now with a focus on the exploding case rather than the vanishing case.

Recall that the gradient involves a product of Jacobians:

∂h_T/∂h_t = ∏_k=t+1^T diag(f'(z_k)) · W_h

When the spectral radius ρ(W_h) is slightly above the critical threshold, this product grows exponentially. But here's the subtle part: the rate of growth depends on the specific values of the hidden states at each step, which determine the activation derivatives f'(z_k).

Consider what happens as you move through parameter space during training. At most points, the hidden states are in the saturated regime (large |z|), so f'(z) is small and the gradients are manageable — in fact, they're vanishing, which is the more common problem. But there exist thin regions in parameter space where, for a particular training example, the hidden states align in the linear regime (|z| near 0, f' near 1). In these regions, the Jacobian product doesn't benefit from the damping of activation saturation, and the gradient explodes.

The key quantitative insight: when f' ≈ 1 (linear regime) and ρ(W_h) > 1, the gradient grows as ρ^T. For ρ = 1.1 and T = 100: 1.1¹⁰⁰ ≈ 13,781. For ρ = 1.2 and T = 100: 1.2¹⁰⁰ ≈ 8.3 × 10⁷. The gradient can be 10 million times larger than normal — and this happens over a parameter region that's only ε wide. This is the cliff: a gradient spike of astronomical magnitude over a vanishingly thin region.

An analogy for cliffs: Imagine walking on a hilly landscape where the slope changes gradually — except for occasional razor-thin ridges where the ground drops away vertically. As you walk (optimize), you usually follow gentle slopes. But if your step happens to land right on a ridge edge, you fall off the cliff. The gradient at that point is enormous — it points nearly straight down the cliff face. A normal-sized step would send you crashing to the bottom, far from where you want to be.

The Cliff in 1D Error Surface

A simplified RNN error surface as a function of a single weight. Notice the sharp cliff. Drag the optimizer position (warm dot) to feel the gradient at each point. Near the cliff, the gradient is orders of magnitude larger than in the smooth valleys.

Weight 0.30

Where do cliffs come from mathematically?

Pascanu et al. proved that cliffs appear when the following condition holds. Consider the Jacobian product at a specific point in parameter space. Define:

η = max_k ||∂h_k/∂h_k-1||

If η > 1, the gradient magnitude is bounded below by η^T-t, which grows exponentially. The cliff occurs at the boundary between the region where η < 1 (vanishing) and where η > 1 (exploding). At this boundary, η crosses 1, and the gradient transitions abruptly from near-zero to enormous.

The width of this transition zone is inversely proportional to T — longer sequences create sharper cliffs. For a sequence of length 100, the cliff can span just 10^-4 in parameter space while the gradient changes by 10¹⁰.

Formal characterization of cliffs

Pascanu et al. showed that the cliff width scales as:

width ≈ O(T^-1) in the direction of steepest gradient change

While the gradient magnitude scales as:

||grad|| ≈ O(η^T) at the cliff peak

So the ratio of gradient magnitude to cliff width grows as O(T · η^T) — super-exponentially! This is why cliffs are so dangerous: for long sequences, the gradient is enormous over an infinitesimally thin region of parameter space. The probability of an SGD step landing exactly on a cliff is small, but when it does happen, the consequence is catastrophic.

An analogy: Imagine walking on a flat field that has occasional paper-thin cracks in the ground, each leading to a bottomless pit. The cracks are so thin you almost never step on them. But when you do, you fall infinitely. The longer you walk (longer sequences), the thinner the cracks become but the deeper the pits get. This is the cliff phenomenon — rare but lethal.

Where cliffs appear in real training

Understanding when cliffs appear helps practitioners anticipate and prevent training failures. In their extensive experiments across multiple tasks and architectures, Pascanu et al. found that cliffs typically appear:

When?	Why?
Early in training	Random weights are more likely to produce near-critical spectral radius
On specific input sequences	Certain input patterns push hidden states into the linear regime
After learning rate warmup	Larger steps increase the probability of hitting a cliff
With longer sequences	More time steps = sharper cliffs = more dangerous

python
# Demonstrating the cliff: gradient norm vs weight perturbation
import torch
import torch.nn as nn
import numpy as np

rnn = nn.RNN(1, 32, batch_first=True)
x = torch.ones(1, 50, 1)  # constant input, length 50

# Scan along one weight direction
original_w = rnn.weight_hh_l0.data.clone()
direction = torch.randn_like(original_w)
direction /= direction.norm()

epsilons = np.linspace(-0.5, 0.5, 200)
grad_norms = []

for eps in epsilons:
    rnn.weight_hh_l0.data = original_w + eps * direction
    rnn.zero_grad()
    out, _ = rnn(x)
    loss = out[0, -1, :].sum()
    loss.backward()
    gn = rnn.weight_hh_l0.grad.norm().item()
    grad_norms.append(gn)

# Plot: you'll see a smooth landscape with sudden spikes (cliffs)
# Gradient can jump from ~1 to ~10000 in a tiny parameter region

What causes the "cliff" phenomenon in RNN error surfaces?

The learning rate is set too high At the boundary where the effective Jacobian norm crosses 1, the gradient transitions abruptly from near-zero (vanishing) to enormous (exploding) — this transition is razor-thin in parameter space, creating a cliff The activation function has discontinuities

Chapter 2: Gradient Clipping

The cliff problem has an elegantly simple solution: gradient clipping. Before applying the gradient update, check if the gradient norm exceeds a threshold. If it does, rescale the gradient to have exactly that threshold norm.

Think about what this means practically. When the optimizer encounters a cliff, the gradient points in the correct direction — downhill, away from the cliff. The problem is only the magnitude: the gradient is so large that a normal-sized step would overshoot dramatically. The solution is obvious in retrospect: keep the direction, shrink the magnitude.

The algorithm is three lines:

Step 1: Compute

Compute gradient g = ∇L(θ) via backpropagation

↓

Step 2: Check

If ||g|| > threshold τ, rescale: g ← τ · g / ||g||

↓

Step 3: Update

θ ← θ - α · g (standard SGD step with clipped gradient)

That's it. When the gradient is small (||g|| ≤ τ), nothing changes — the update proceeds as normal. When the gradient is large (||g|| > τ), the direction is preserved but the magnitude is capped at τ. This turns the cliff from a catastrophic fall into a gentle slope in the correct direction.

Geometrically, norm clipping projects the gradient onto the surface of a hypersphere of radius τ. Any gradient inside the sphere passes through unchanged; any gradient outside is projected to the sphere's surface, preserving direction. This is equivalent to:

ĝ = g · min(1, τ / ||g||₂)

The min(1, ...) ensures that gradients smaller than τ are untouched. Only "abnormally large" gradients (those that would launch the optimizer off a cliff) are affected. In typical training, 95-99% of gradient steps are below the threshold and pass through unmodified. Clipping activates only for the rare, dangerous cliff-edge steps.

ĝ = { g, if ||g|| ≤ τ }
{ τ · g / ||g||, if ||g|| > τ }

Gradient Clipping in Action

The error surface with a cliff. Without clipping (red), the optimizer is launched off the cliff. With clipping (green), the gradient direction is preserved but the step size is capped, keeping the optimizer near the good region. Toggle clipping to compare.

Threshold τ 3.0

Click to compare

Why this works: Gradient clipping doesn't change the gradient direction — it only changes the magnitude. When you encounter a cliff, the gradient points in the right direction (away from the cliff). The problem is only that the step is too large. Clipping preserves the direction while limiting the step size. It's like walking carefully near a cliff edge — you still go the right way, just in smaller steps.

Implementation

python
import torch
import torch.nn as nn

# The gradient clipping algorithm
def clip_gradient(parameters, max_norm):
    """Clip gradient norm to max_norm, preserving direction."""
    # Compute total gradient norm across all parameters
    total_norm = 0
    for p in parameters:
        if p.grad is not None:
            total_norm += p.grad.data.norm(2).item() ** 2
    total_norm = total_norm ** 0.5

    # Rescale if necessary
    if total_norm > max_norm:
        scale = max_norm / total_norm
        for p in parameters:
            if p.grad is not None:
                p.grad.data *= scale

    return total_norm

# In practice, PyTorch does this in one line:
model = nn.RNN(10, 64)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training loop
loss = compute_loss()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)  # clip!
optimizer.step()

The typical threshold τ is between 1 and 10. Pascanu et al. found that τ = 5 works well for most tasks. The exact value isn't critical — any reasonable threshold prevents catastrophic cliff-falls while allowing normal gradient-based learning to proceed.

How gradient clipping affects convergence

Does clipping introduce bias? Technically yes — when you clip, you're no longer following the true gradient direction with the true magnitude. But in practice, clipping events are rare (most gradients are below τ), and when they do occur, the true gradient would have been destructive anyway. Clipping replaces a catastrophic step with a merely suboptimal one.

Pascanu et al. showed empirically that clipping has negligible effect on convergence speed when the threshold is chosen reasonably. The model takes slightly more steps to converge (because some large gradients are truncated), but it never diverges. The tradeoff is overwhelmingly positive.

When to clip: In modern practice, gradient clipping is applied to essentially every model: RNNs, LSTMs, Transformers, CNNs. The standard recipe is clip_grad_norm_(model.parameters(), max_norm=1.0) for Transformers and max_norm=5.0 for RNNs/LSTMs. It's cheap (one norm computation per step), safe (prevents divergence), and has almost no downside.

Monitoring gradient norms

A practical corollary of this paper: always log gradient norms during training. Gradient norm plots reveal training dynamics that the loss curve hides. Spikes in gradient norm (even if clipped) indicate the model is encountering cliffs. If spikes are frequent, the learning rate may be too high or the model architecture may be unstable.

python
# Monitoring gradient norms during training
import torch

def train_step(model, optimizer, loss_fn, batch, max_norm=5.0):
    optimizer.zero_grad()
    loss = loss_fn(model, batch)
    loss.backward()

    # Log gradient norm BEFORE clipping
    total_norm = torch.nn.utils.clip_grad_norm_(
        model.parameters(), max_norm
    )

    # Log to your favorite logger
    # wandb.log({"grad_norm": total_norm.item()})
    # If total_norm > max_norm, clipping occurred
    if total_norm > max_norm:
        print(f"Clipped! Norm {total_norm:.1f} > {max_norm}")

    optimizer.step()
    return loss.item(), total_norm.item()

What does gradient clipping do when the gradient norm exceeds the threshold τ?

It rescales the gradient to have norm exactly τ, preserving its direction but limiting its magnitude — this prevents the optimizer from being launched off a cliff It sets the gradient to zero, skipping the update entirely It reverses the gradient direction to move away from the cliff

Chapter 3: Clipping Strategies

There are actually two main approaches to gradient clipping, and the distinction matters for practice.

Clipping by norm (the standard approach)

This is what we described in the previous chapter. You compute the total gradient norm across all parameters and rescale if it exceeds τ:

ĝ = g · min(1, τ / ||g||)

This approach preserves the relative magnitudes between different parameter gradients. If the gradient for one weight is 10x larger than another, that ratio is maintained after clipping. The overall gradient vector just gets shorter.

Clipping by value (element-wise)

An alternative is to clip each gradient element independently:

ĝ_i = max(-τ, min(τ, g_i))

This caps each component of the gradient at τ. It's simpler but it changes the direction of the gradient — large components are clipped while small ones are not, distorting the gradient direction.

Method	Preserves direction?	Effect on large gradients	When to use
Clip by norm	Yes	All components scaled equally	Default choice — used in almost all modern training
Clip by value	No	Each component clipped independently	Rarely preferred — use only if you have a specific reason
Adaptive clipping	Partially	Threshold adapts based on gradient history	More robust, but adds complexity

Norm Clipping vs. Value Clipping

A 2D gradient vector (arrow). Norm clipping shrinks the vector while preserving direction (green circle = threshold). Value clipping truncates each component independently, changing the direction (red box = threshold). Toggle between methods.

Threshold 1.5

Adaptive gradient clipping

Pascanu et al. also discussed adaptive clipping, where the threshold adjusts based on the running average of gradient norms. The idea is to clip gradients that are unusually large relative to recent history:

python
# Adaptive gradient clipping
class AdaptiveClipper:
    def __init__(self, alpha=0.95, multiplier=3.0):
        self.ema = None        # exponential moving average of grad norm
        self.alpha = alpha    # smoothing factor
        self.mult = multiplier # clip at mult * ema

    def clip(self, parameters):
        total_norm = 0
        for p in parameters:
            if p.grad is not None:
                total_norm += p.grad.data.norm(2).item() ** 2
        total_norm = total_norm ** 0.5

        # Update running average
        if self.ema is None:
            self.ema = total_norm
        else:
            self.ema = self.alpha * self.ema + (1 - self.alpha) * total_norm

        # Clip if norm exceeds mult * ema
        threshold = self.mult * self.ema
        if total_norm > threshold:
            scale = threshold / total_norm
            for p in parameters:
                if p.grad is not None:
                    p.grad.data *= scale

        return total_norm

Adaptive clipping is more robust because it doesn't require choosing τ in advance. The threshold emerges from the training dynamics themselves. However, plain norm clipping with τ = 1-5 works well enough that most practitioners stick with it.

Per-parameter vs global clipping

Beyond norm-vs-value clipping, there's another important design dimension: the scope of the norm computation. Do you clip each parameter's gradient independently, each layer's gradient independently, or compute a single norm across all parameters in the model?

This choice affects gradient direction preservation and computational cost differently:

Scope	What it clips	Used by
Global norm	Total norm across ALL parameters	PyTorch `clip_grad_norm_`, all major LLM codebases
Per-parameter	Each parameter's gradient independently	Some TensorFlow implementations
Per-layer	Each layer's gradients independently	Some custom implementations

Global norm clipping is almost universally preferred because it maintains the relative gradient magnitudes across parameters. If one layer's gradient is 10x larger than another's (which is normal), per-parameter clipping would clip them to the same magnitude, distorting the update direction. Global clipping preserves the ratio while limiting the total magnitude.

The computational cost of global norm clipping is negligible. Computing the total gradient norm requires one pass through all parameters to sum the squared norms, then one pass to rescale if needed. For a model with P parameters, this is O(P) — the same cost as a single SGD step. The norm computation can also be overlapped with gradient computation using GPU parallelism, making the wall-clock overhead essentially zero.

This near-zero cost is important: it means there's no reason not to use gradient clipping. It's a pure safety mechanism with no performance penalty. Even if your model never encounters a cliff, clipping costs nothing. And if it does encounter one, clipping saves your entire training run. This asymmetry — zero downside, huge upside — is why gradient clipping is universal in modern practice. It is the seat belt of deep learning: you wear it every time, not because you expect a crash, but because the cost of wearing it is negligible and the cost of not wearing it is catastrophic.

python
# Global vs per-parameter clipping in PyTorch
import torch

# Global norm (recommended)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)

# Per-parameter value clipping (less common)
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)

# Custom per-layer norm clipping
for name, param in model.named_parameters():
    if param.grad is not None:
        layer_norm = param.grad.norm()
        if layer_norm > 1.0:
            param.grad.data *= 1.0 / layer_norm

The mathematical relationship between clipping methods

Let g = (g₁, g₂, ..., g_n) be the gradient vector. The three clipping methods produce different clipped vectors:

Norm clip: ĝ = g · min(1, τ/||g||) [scales all components equally]

Value clip: ĝ_i = sign(g_i) · min(|g_i|, τ) [clips each independently]

Let's see this concretely. For a 2D gradient g = (10, 1) with threshold τ = 5:

Method	Clipped gradient	Direction preserved?
Norm clip	(4.97, 0.50)	Yes — same ratio 10:1
Value clip	(5.00, 1.00)	No — ratio changed to 5:1

What is the key difference between gradient clipping by norm and by value?

Norm clipping preserves the gradient direction (all components scaled equally), while value clipping changes the direction (each component clipped independently, distorting the vector) Value clipping is always better because it clips each weight independently Norm clipping requires knowing the learning rate, value clipping doesn't

Chapter 4: Gradient Regularization

Gradient clipping handles the symptom (exploding gradients), but can we address the cause? Pascanu et al. proposed a regularization term that encourages the Jacobian products to stay well-behaved.

The idea is simple: add a penalty to the loss function that discourages large Jacobian norms. If we penalize ||∂h_t+1/∂h_t||, we're directly discouraging the condition that leads to gradient explosion.

L_total = L_task + Ω

Where the regularization term Ω penalizes the Jacobian product from diverging:

Ω = λ ∑_t (||(∂L/∂h_t+1) · (∂h_t+1/∂h_t)|| / ||∂L/∂h_t+1|| - 1)²

Let's unpack this carefully, because the formula looks intimidating but the idea is simple. The term inside the squared penalty measures how much the gradient magnitude changes from step t+1 to step t. The ratio ||(∂L/∂h_t+1) J_t|| / ||∂L/∂h_t+1|| tells us the amplification factor at step t — how much the Jacobian at step t stretches or shrinks the gradient.

If this factor is 1, the gradient flows through unchanged (perfect). If it's greater than 1, the gradient is growing (danger of explosion). If it's less than 1, the gradient is shrinking (vanishing). The squared penalty (ratio - 1)² penalizes any deviation from 1, whether upward or downward. This drives the entire network toward a regime where gradients flow smoothly at every step.

The regularization strength λ controls the tradeoff: too small and the error surface retains its cliffs; too large and the regularization term dominates the task loss, preventing the model from learning the actual task. A value of λ = 0.01-0.1 works well in practice.

Think of it as gradient flow regulation: Without regularization, the gradient is free to grow or shrink at each step. With regularization, there's a "tax" on gradient amplification — the loss function actively penalizes the network for creating conditions where gradients would explode. This doesn't prevent learning; it just ensures the error surface stays smooth.

Effect of Gradient Regularization

Two RNN error surfaces: without regularization (rough, with cliffs) and with regularization (smooth, no cliffs). The regularization penalty discourages sharp changes in the loss landscape. Drag λ to control regularization strength.

λ 0.00

Practical considerations

Computing the exact regularization term is expensive — it requires second-order derivatives (the gradient of the gradient). Pascanu et al. proposed an efficient approximation using finite differences:

python
# Efficient gradient regularization via finite differences
import torch

def gradient_regularization(model, loss, hidden_states, lam=0.1, eps=1e-5):
    """Regularize Jacobian norms to stay near 1.
    Uses finite differences to avoid explicit second derivatives."""
    reg_loss = 0

    # For each consecutive pair of hidden states
    for t in range(len(hidden_states) - 1):
        h_t = hidden_states[t]
        h_t1 = hidden_states[t + 1]

        # Compute dL/dh_{t+1}
        grad_t1 = torch.autograd.grad(
            loss, h_t1, retain_graph=True, create_graph=True
        )[0]

        # Compute dL/dh_t (= dL/dh_{t+1} @ J_t)
        grad_t = torch.autograd.grad(
            loss, h_t, retain_graph=True, create_graph=True
        )[0]

        # Amplification factor
        ratio = grad_t.norm() / (grad_t1.norm() + eps)
        # Penalty: (ratio - 1)^2
        reg_loss += (ratio - 1) ** 2

    return lam * reg_loss

In practice, gradient regularization is used less often than gradient clipping because it's more expensive and clipping alone works well for most tasks. But for tasks requiring very long-range dependencies, the combination of clipping + regularization can be more effective than either alone.

Why regularize the Jacobian, not the weights?

A natural question: why not just add standard weight decay (L2 regularization on the weights) to prevent explosion? This is a common misconception, and the answer reveals an important distinction. Weight decay penalizes the magnitude of weights, not the gradient dynamics. The two are related but not the same.

A weight matrix with small Frobenius norm can still produce exploding gradients if the activation derivatives are large (e.g., when all activations are in the linear regime of tanh). Conversely, a large weight matrix can produce vanishing gradients if the activations are heavily saturated (f' near 0). The gradient dynamics depend on the product of the weight matrix and the activation derivatives, not on either factor alone.

Weight decay helps indirectly by preventing weights from growing too large, which keeps the spectral radius in check. But it's a blunt instrument — it penalizes all large weights equally, whether they contribute to gradient explosion or not. Jacobian regularization is more surgical because it targets the actual gradient dynamics.

The Jacobian regularization directly targets what we care about: the gradient flow. It measures "is the gradient growing or shrinking at each step?" and penalizes deviations from 1. This is a more surgical intervention than weight decay.

Regularization	What it penalizes	Effect on gradient flow	Cost
Weight decay (L2)	\|\|W\|\|²	Indirect — small weights may or may not help	Cheap
Jacobian regularization	\|\|J\|\|² deviation from 1	Direct — targets gradient dynamics exactly	Expensive (needs 2nd-order info)
Spectral normalization	ρ(W) > 1	Prevents explosion, but can cause vanishing	Moderate

Spectral normalization: a modern alternative

A simpler approach, developed after this paper, is spectral normalization: after each weight update, divide W_h by its spectral radius ρ(W_h) so that ρ = 1. This guarantees that the weight matrix alone doesn't amplify or shrink signals.

python
# Spectral normalization for recurrent weights
import torch

def spectral_normalize(W):
    """Normalize W so its spectral radius = 1."""
    eigs = torch.linalg.eigvals(W)
    rho = torch.max(torch.abs(eigs)).item()
    if rho > 0:
        W.data /= rho
    return W

What does gradient regularization encourage in an RNN?

It makes the weights smaller to prevent overfitting It penalizes Jacobian norms that deviate from 1, encouraging the gradient to flow through each time step without growing or shrinking — smoothing the error surface It forces the hidden states to be close to zero

Chapter 5: Vanishing vs Exploding

Vanishing and exploding gradients are two faces of the same coin — the instability of iterated matrix products. But they manifest differently and require different solutions.

Property	Vanishing	Exploding
Condition	σ_max · \|\|W_h\|\| < 1	σ_max · ρ(W_h) > 1
Symptom	No learning on long-range dependencies	Loss spikes to NaN, weights explode
Detection	Hard — training just converges slowly or to a bad solution	Easy — loss becomes NaN
Fix	Architectural: LSTM, GRU, attention, residual connections	Algorithmic: gradient clipping
Frequency	Very common (default for tanh/sigmoid)	Less common but catastrophic when it happens

The asymmetry: Exploding gradients are easy to fix (just clip them) but catastrophic when unfixed (NaN loss). Vanishing gradients are hard to fix (requires architectural changes) but insidious (training "works" but the model just doesn't learn long-range patterns, and you might not even notice). This is why the vanishing case is arguably more dangerous — at least explosions are loud.

Vanishing vs Exploding: A Side-by-Side

The same RNN with different weight scales. Left: small weights cause vanishing gradients (gradient bars disappear). Right: large weights cause exploding gradients (bars shoot off the chart). Drag the weight scale to see the transition.

W_h scale 0.70

The interaction during training

Pascanu et al. made a crucial observation: during training, the same model can experience both vanishing and exploding gradients — at different time scales and for different training examples.

A training example with certain input patterns might push the hidden states into the linear regime (f' near 1), causing gradient explosion. The very next training example might push the hidden states into the saturated regime (f' near 0), causing gradient vanishing. The network oscillates between the two failure modes.

This explains why naive fixes don't work:

Naive fix 1: Large W_h

"Set ||W_h|| > 1 so gradients don't vanish" — but this causes explosions

↓

Naive fix 2: Small lr

"Use tiny learning rate to prevent explosions" — but this makes vanishing worse

↓

Correct approach

Clip exploding gradients + use architecture that prevents vanishing (LSTM/GRU)

The correct approach is to use two complementary solutions: gradient clipping for explosions and architectural changes (LSTM/GRU/attention) for vanishing. These are not competing solutions — they address different failure modes and should be used together.

A deeper look at the interaction

Pascanu et al. provided a key insight about how vanishing and exploding interact within the same gradient computation. Consider the total gradient:

∂L/∂W_h = ∑_t=1^T ∂L/∂h_T · ∂h_T/∂h_t · ∂h_t/∂W_h

This is a sum of terms. The terms for recent time steps (t near T) have large Jacobian products. The terms for distant time steps (t near 0) have tiny Jacobian products. The total gradient is dominated by the recent terms — the distant terms contribute negligibly. But if even one recent term has an exploding Jacobian, it dominates the sum and makes the entire gradient enormous.

So within a single gradient computation:

Terms t = T, T-1, T-2

Large gradient contributions (recent past) — may explode

↓

Terms t = T-5, T-10

Moderate contributions

↓

Terms t = 0, 1, 2

Negligible contributions (distant past) — vanished

Gradient clipping handles the exploding recent terms. But it does nothing for the vanished distant terms. The network can learn from recent context but remains blind to the distant past. This is why both solutions are needed: clip the explosion (algorithmic fix) AND prevent the vanishing (architectural fix).

Echo state networks: an interesting alternative

The paper also discusses echo state networks (ESN) as a reference point. In an ESN, the recurrent weights W_h are fixed (not trained) — only the output weights are learned. This completely avoids the gradient problem since gradients never flow through W_h. But it limits what the network can represent: the dynamics are random rather than task-specific.

ESNs show that the gradient problem is specifically about learning the recurrence, not about using it. If you're willing to accept random dynamics, recurrence is fine. But learning task-specific dynamics through gradient descent is where the fundamental difficulty lies.

The role of initialization

Pascanu et al. highlighted the importance of weight initialization for RNNs. The initial spectral radius of W_h determines whether the network starts in the vanishing, exploding, or near-critical regime:

Initialization	Initial ρ(W_h)	Behavior
Gaussian N(0, 0.01)	≈ 0.1	Strongly vanishing — very slow learning even for short sequences
Xavier / Glorot	≈ 1.0	Near critical — works for moderate sequences but unstable for long ones
Orthogonal	= 1.0 exactly	Best starting point — all eigenvalues on unit circle
Identity + noise	≈ 1.0	Good alternative — W_h = I + εN preserves information initially

The identity initialization trick: Le et al. (2015) showed that initializing W_h as the identity matrix (plus small noise) with ReLU activation gives surprisingly good results. The intuition: an identity recurrence matrix copies the hidden state forward unchanged. The network starts by remembering everything, then learns what to forget — rather than starting with random dynamics and trying to learn what to remember.

python
# Different initialization strategies for RNN hidden weights
import torch
import torch.nn as nn

n = 128

# 1. Standard Gaussian (too small)
W_gauss = torch.randn(n, n) * 0.01

# 2. Xavier initialization
W_xavier = torch.randn(n, n) / n**0.5

# 3. Orthogonal (recommended for RNNs)
W_orth = torch.empty(n, n)
nn.init.orthogonal_(W_orth)

# 4. Identity + noise (IRNN, Le et al. 2015)
W_irnn = torch.eye(n) + torch.randn(n, n) * 0.001

# Check spectral radii
for name, W in [("Gaussian", W_gauss), ("Xavier", W_xavier),
                    ("Orthogonal", W_orth), ("IRNN", W_irnn)]:
    rho = torch.linalg.eigvals(W).abs().max().item()
    print(f"{name:12s}: rho = {rho:.4f}")

python
# Complete RNN training recipe from Pascanu et al.
import torch
import torch.nn as nn

# 1. Use LSTM instead of vanilla RNN (fixes vanishing)
model = nn.LSTM(256, 512, num_layers=2, dropout=0.3)

# 2. Initialize carefully
for name, param in model.named_parameters():
    if 'weight_hh' in name:
        nn.init.orthogonal_(param)  # eigenvalues on unit circle
    elif 'bias' in name:
        nn.init.zeros_(param)
        # Set forget gate bias high (remember by default)
        n = param.size(0)
        param.data[n//4:n//2].fill_(1.0)

# 3. Gradient clipping (fixes exploding)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for batch in dataloader:
    optimizer.zero_grad()
    loss = compute_loss(model, batch)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
    optimizer.step()

Why should gradient clipping and LSTM gates be used together, rather than one or the other?

They both fix the same problem, so using both gives more redundancy LSTMs make gradient clipping unnecessary They fix different problems: gradient clipping prevents explosions (an algorithmic fix) while LSTM gates prevent vanishing (an architectural fix). The same model can experience both failure modes on different training examples

Chapter 6: Training Lab

Time to put everything together. This interactive simulation lets you train an RNN with and without gradient clipping, on a sequence memorization task. You'll see how clipping prevents catastrophic gradient explosions while preserving learning.

RNN Training Lab

Train a small RNN to memorize a pattern in a length-20 sequence. The top panel shows the loss curve. The bottom panel shows gradient norms at each iteration. Toggle gradient clipping and adjust the threshold to see its effect. Without clipping, watch for the catastrophic loss spike.

Clip τ 5

LR 0.020

Ready

What to observe:
1. Without clipping: Training starts well but eventually hits a cliff. The gradient norm spikes (bottom panel), the loss jumps to a huge value or NaN, and the model never recovers.
2. With clipping (τ=5): Training proceeds smoothly. You'll see occasional gradient norm spikes in the bottom panel, but they're capped at τ, preventing the catastrophic loss jump.
3. Try τ=1: Very aggressive clipping. Learning is stable but slow — every gradient is truncated.
4. Try τ=20: Mild clipping. Some cliffs still cause trouble because the threshold is too permissive.

The practical training recipe

Based on their experiments, Pascanu et al. recommended this recipe for training RNNs:

1. Architecture

Use LSTM or GRU instead of vanilla RNN

↓

2. Initialization

Orthogonal init for W_hh, forget gate bias = 1

↓

3. Gradient clipping

Clip by norm with τ = 1-5

↓

4. Learning rate

Start with lr = 0.001-0.01, decay on plateau

↓

5. Monitor

Log gradient norms every iteration — cliffs show as spikes

This recipe became the standard for RNN training and, with minor modifications, is still used for training Transformers today. The gradient clipping step (3) is particularly universal — you'll find it in the training code of GPT-2, GPT-3, BERT, and essentially every large language model.

Diagnosing training problems

When your RNN (or any deep network) isn't training well, the gradient norm plot is your best diagnostic tool. Here's how to read it:

Pattern in gradient norm plot	Diagnosis	Fix
Consistently near zero	Vanishing gradients	Use LSTM/GRU, add residual connections, check initialization
Occasional massive spikes	Cliffs in error surface	Enable gradient clipping (or lower threshold)
Steadily increasing	Gradients slowly exploding	Lower learning rate, check weight initialization
Oscillating wildly	Near critical point (η ≈ 1)	Learning rate too high for this regime
Stable and gradually decreasing	Healthy training	No changes needed!

python
# Complete training loop with monitoring
import torch
import torch.nn as nn

model = nn.LSTM(256, 512, num_layers=2, dropout=0.3)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
max_norm = 5.0
grad_history = []

for step in range(10000):
    optimizer.zero_grad()
    loss = compute_loss(model, batch)
    loss.backward()

    # Record gradient norm BEFORE clipping
    norm = torch.nn.utils.clip_grad_norm_(
        model.parameters(), max_norm
    )
    grad_history.append(norm.item())
    optimizer.step()

    # Alert on anomalies
    if norm > max_norm * 0.9:
        print(f"Step {step}: near-clip event (norm={norm:.1f})")
    if step > 100 and norm < 1e-6:
        print(f"Step {step}: possible vanishing gradient!")

The connection to modern LLM training

Every major LLM training run uses gradient clipping from this paper:

Model	Gradient clip norm	Source
GPT-2 (2019)	1.0	OpenAI codebase
GPT-3 (2020)	1.0	Brown et al.
LLaMA (2023)	1.0	Touvron et al.
Chinchilla (2022)	1.0	Hoffmann et al.
PaLM (2022)	1.0	Chowdhery et al.

Notice the convergence on max_norm = 1.0 for Transformers (vs 5.0 for RNNs). This is because Transformer gradients are generally more well-behaved than RNN gradients, so a tighter clip doesn't hurt and provides better safety.

In the training lab, what happens differently between training with and without gradient clipping?

Without clipping, the optimizer eventually encounters a cliff where the gradient explodes, causing a catastrophic loss spike. With clipping, the gradient norm is capped, preventing the spike while preserving the correct gradient direction Clipping makes training faster by removing unnecessary gradient computation There is no difference in practice; clipping is only theoretically beneficial

Chapter 7: Connections

Pascanu et al.'s 2013 paper bridged the gap between Bengio's 1994 theoretical analysis and the practical training of deep sequence models. Gradient clipping became one of the most widely used techniques in all of deep learning.

Impact timeline

Year	Development	Connection to this paper
1994	Bengio: vanishing gradient theorem	The theoretical foundation this paper builds on
1997	LSTM (Hochreiter & Schmidhuber)	Architectural solution to vanishing; this paper's clipping complements it
2013	This paper: gradient clipping	First practical solution to exploding gradients
2014	GRU (Cho et al.)	Simplified LSTM, used with gradient clipping
2017	Transformer (Vaswani et al.)	Eliminated recurrence but still uses gradient clipping during training
2018	GPT-1 (Radford et al.)	Transformer + gradient clipping (max_norm = 1.0)
2020	GPT-3 (Brown et al.)	Same recipe at 175B parameters — clipping still essential

The lasting legacy: Open any modern LLM training codebase and search for "grad_clip" or "clip_grad_norm." You'll find it. This one idea from 2013 — cap the gradient when it's too big — is running inside every GPT, Claude, Gemini, and LLaMA training pipeline. It's one of the simplest ideas in deep learning, and one of the most consequential.

Related lessons

Bengio 1994

The vanishing gradient theorem — prerequisite for this paper

↓

This paper (Pascanu 2013)

Gradient clipping, cliff phenomenon, practical training

↓

Attention Is All You Need (2017)

The Transformer — makes recurrence (and its gradient problems) obsolete

The paper's experimental results

Pascanu et al. tested on several challenging RNN tasks:

Task	Without clipping	With clipping
Temporal order (T=50)	Diverged (31% of runs)	Converged (0% divergence)
Addition problem (T=100)	Diverged (47% of runs)	Converged (2% divergence)
Multiplication (T=50)	Diverged (62% of runs)	Converged (5% divergence)
Penn Treebank LM	Frequent NaN losses	Stable training

The results are striking: gradient clipping alone reduced training failures from 30-60% to 0-5% of runs. This is a transformation from "RNN training is unreliable" to "RNN training reliably works."

The combination of gradient clipping + gradient regularization performed even better on the most challenging tasks (temporal order at T = 200), but the improvement over clipping alone was modest. This suggests that for most practical purposes, gradient clipping is sufficient — regularization provides marginal additional benefit at significant computational cost.

Perhaps the most impressive result: even with gradient clipping enabled, the training convergence speed was not significantly affected. The clipped steps (when they occurred) still moved the parameters in the correct direction, just with smaller magnitude. The model reached the same final performance, just without the catastrophic failures along the way. In fact, on several tasks, the clipped training achieved slightly better final performance than the unclipped successful runs — likely because even the runs that didn't fully diverge still suffered from occasional large weight perturbations that damaged partially-learned representations.

What this paper didn't solve

Gradient clipping handles exploding gradients but does nothing for vanishing gradients. The network still can't learn dependencies longer than ~20-50 steps with a vanilla RNN, even with perfect clipping. For that, you need architectural solutions: LSTM, GRU, attention, or the Transformer.

The Transformer (2017) eventually made the RNN gradient problem largely moot by replacing sequential processing with parallel attention. But understanding why RNNs fail remains crucial — it's the motivation behind every major architectural innovation of the past decade, from LSTMs to Transformers to state-space models.

Moreover, the tools developed in this paper — gradient clipping, gradient norm monitoring, and the understanding of error surface geometry — remain essential even in the Transformer era. The specific failure mode has changed (Transformers don't have cliffs from recurrence), but the general principle holds: always monitor your gradient norms, always have a safety mechanism for extreme gradients, and always understand the geometry of your loss landscape.

The continuing relevance of this work

Beyond gradient clipping itself, this paper established a methodology for analyzing training dynamics that remains influential. The approach of (1) identifying a failure mode through theory, (2) characterizing its manifestation in practice (the cliff), and (3) providing a simple algorithmic fix (clipping) has been applied to many subsequent training problems:

Problem	Analysis paper	Simple fix
Gradient explosion	Pascanu 2013 (this paper)	Gradient clipping
Internal covariate shift	Ioffe & Szegedy 2015	Batch normalization
Degradation in deep nets	He et al. 2015	Residual connections
Training instability in GANs	Miyato et al. 2018	Spectral normalization
Loss spikes in LLM training	Various 2022-2023	Gradient clipping + learning rate cooldown

Modern extensions of gradient clipping

The basic idea of gradient clipping has been extended in several ways since 2013:

Extension	Year	Idea
Gradient scaling (AMP)	2017	Scale loss up to prevent underflow in FP16, then unscale gradients before clipping
Gradient accumulation	~2018	Accumulate gradients across micro-batches, clip the accumulated gradient
AGC (Adaptive Gradient Clipping)	2021	Clip based on the ratio of gradient norm to parameter norm (NFNet)
Gradient noise injection	2015	Add noise after clipping to help escape sharp minima

AGC from the NFNet paper (Brock et al., 2021) is particularly interesting. Instead of a fixed threshold, it clips when the gradient is "too large relative to the weight" — specifically when ||g|| / ||w|| exceeds a threshold λ. This makes the clipping scale-invariant and removes the need to tune τ for different layers.

python
# Adaptive Gradient Clipping (AGC) from NFNet
def agc(parameters, clip_factor=0.01, eps=1e-3):
    """Clip gradient based on gradient-to-weight ratio."""
    for p in parameters:
        if p.grad is None: continue
        p_norm = p.data.norm().clamp(min=eps)
        g_norm = p.grad.data.norm()
        max_norm = p_norm * clip_factor
        if g_norm > max_norm:
            p.grad.data *= max_norm / g_norm

The surprising simplicity of the solution

Looking back, it's remarkable that the solution to one of the most important problems in deep learning — gradient explosion — is so simple. Gradient clipping is literally: "if the gradient is too big, make it smaller." Three lines of code. No new mathematical framework, no complex theory, no hyperparameter search (any τ between 1 and 10 works).

This is a pattern in deep learning: the most impactful techniques are often embarrassingly simple. Dropout (randomly zero out neurons), batch normalization (normalize activations), residual connections (add the input to the output), and gradient clipping (cap the gradient norm) — each is a one-line idea that transformed the field.

Gradient clipping in the LLM era

As of 2024, gradient clipping remains essential for training large language models. Here is a summary of how it's used in practice at scale:

python
# Modern LLM training loop (simplified)
# Based on LLaMA / GPT training recipes
import torch
from torch.nn.utils import clip_grad_norm_
from torch.cuda.amp import GradScaler, autocast

model = TransformerLM(
    vocab_size=32000, d_model=4096,
    n_layers=32, n_heads=32, d_ff=11008
)  # ~7B parameters (LLaMA-7B config)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4, weight_decay=0.1,
    betas=(0.9, 0.95)
)
scaler = GradScaler()  # for mixed precision

for step, batch in enumerate(dataloader):
    with autocast():
        loss = model(batch)

    # Scale loss for FP16 stability
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)

    # Gradient clipping — still essential at 7B scale!
    grad_norm = clip_grad_norm_(
        model.parameters(), max_norm=1.0
    )

    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()

Notice that even with Transformers (no recurrence!), gradient clipping at max_norm=1.0 is standard. The gradient dynamics are more stable than RNNs, but occasional spikes still occur — especially early in training, during learning rate warmup, or on unusual data batches. The clipping provides a safety net that costs nearly nothing and prevents rare but catastrophic training failures.

The three-line legacy: Pascanu et al.'s gradient clipping algorithm — compute norm, compare to threshold, rescale if needed — runs inside every modern AI training pipeline. From 32-parameter toy RNNs to trillion-parameter language models, the same three lines of code keep training stable. It is perhaps the highest impact-per-line-of-code contribution in all of machine learning.

Closing thought: "The simplest ideas are often the most powerful. Gradient clipping is three lines of code — check the norm, rescale if too large, proceed. Yet this trivial modification turned RNN training from an unreliable art into a reproducible science." — The elegance of Pascanu et al.'s contribution lies not in mathematical sophistication but in practical insight: sometimes the best solution to a complex problem is embarrassingly simple.

Why is gradient clipping still used in Transformer training (GPT, Claude, etc.) even though Transformers don't use recurrence?

Transformers still have deep computation graphs (many layers), and gradient explosions can occur due to the depth alone — gradient clipping provides a cheap safety net that prevents catastrophic training failures regardless of architecture Transformers have recurrence in their attention layers It's only used for backward compatibility, not because it's needed

On the Difficulty of Training Recurrent Neural Networks