The paper that introduced gradient clipping, explained the "cliff" phenomenon in RNN error surfaces, and gave the first practical training recipes for deep recurrent networks.
You're training an RNN on a language modeling task. The loss is dropping steadily — 4.2, 3.8, 3.5, 3.1... Then suddenly, on iteration 8,347, the loss jumps to NaN. Your model is destroyed. All the weights are infinity. Hours of training, wasted.
This isn't a bug. It's a feature of RNN error surfaces — and until this 2013 paper, nobody fully understood why it happened or how to prevent it.
In the previous lesson on Bengio 1994, we learned that gradients in RNNs tend to vanish exponentially over time. But Bengio's analysis focused on the vanishing case. What about the opposite? What happens when gradients explode?
Pascanu, Mikolov, and Bengio showed that exploding gradients are not just the symmetric opposite of vanishing gradients. They have their own unique pathology: the cliff phenomenon. The RNN error surface contains regions where the gradient suddenly becomes enormous — like a cliff edge. If your optimizer steps near one of these cliffs, the gradient catapults the weights far from the current solution, often to a region where the loss is much worse or even infinite.
The cliff phenomenon is particularly insidious because it's unpredictable. You can be training for hours with stable, decreasing loss, and then a single unlucky batch sends the gradient norm from 5 to 50,000 in one step. The weight update is so large that it destroys the learned representations. The model can't recover because the new weight values are far from any good region of parameter space.
Before this paper, practitioners dealt with gradient explosions through trial and error: restart training with a smaller learning rate, use shorter sequences, add more regularization. None of these were principled solutions. Pascanu et al. provided both the explanation (cliffs) and the fix (clipping).
The practical significance cannot be overstated. In 2013, training an RNN was a coin flip — it might work, it might explode. A single cliff encounter could waste hours of GPU time. Gradient clipping transformed RNN training from gambling to engineering.
In their experiments, Pascanu et al. found that without gradient clipping, 30-60% of training runs diverged (loss went to NaN) within the first epoch. The frequency depended on:
| Factor | Effect on cliff frequency |
|---|---|
| Longer sequences | More frequent (sharper cliffs, more opportunities to hit them) |
| Larger learning rate | More frequent (larger steps = higher probability of landing on a cliff) |
| Larger weight scale | More frequent (closer to the critical point η = 1) |
| Certain input patterns | Some batches create cliffs that others don't |
| Early in training | More frequent (random weights are more prone to critical dynamics) |
Watch an RNN training. The loss decreases steadily until the optimizer encounters a "cliff" in the error surface. The gradient explodes, the weight update is enormous, and training is destroyed. Click "Train" to see it happen.
Together, these insights transformed RNN training from an art requiring extreme patience and luck into a science with reliable recipes. Gradient clipping became so fundamental that it's now the default in almost every deep learning framework. When you train a Transformer, an LSTM, or any deep network, gradient clipping is almost certainly enabled behind the scenes.
Pascanu et al. used several carefully designed diagnostic tasks to study gradient dynamics:
| Task | Description | Why it's useful |
|---|---|---|
| Temporal order | Two markers appear in the first 10% of a length-T sequence. Output their order. | Tests long-range memory — T controls difficulty |
| Addition problem | Two numbers are marked in a sequence. Output their sum. | Tests whether the network can store and combine distant information |
| Multiplication | Similar to addition but with multiplication | Tests sensitivity to distant inputs (product is more sensitive than sum) |
| Penn Treebank LM | Standard language modeling benchmark | Tests practical performance on real data |
These tasks are specifically designed so that the difficulty scales with sequence length T. At T = 10, all methods work. At T = 100, only methods with proper gradient handling succeed. This controlled scaling reveals the gradient dynamics clearly, making it possible to isolate the effect of gradient clipping from other factors (model capacity, optimizer choice, etc.).
The addition and temporal order tasks have become standard benchmarks for evaluating new sequence model architectures. If a model can solve them at T = 500+ without special initialization or training tricks, it has solved the long-range dependency problem. LSTMs can handle T ≈ 200-500 (with gradient clipping), while vanilla RNNs fail at T ≈ 20-50.
The relationship between these two papers is complementary:
Bengio 1994:
Theoretical. Proves gradients vanish/explode. Analyzes the mathematical structure. Doesn't offer solutions.
Pascanu 2013:
Practical. Shows how the theory manifests in training (cliffs). Provides solutions (clipping, regularization). Gives training recipes.
Together, they form the complete story: Bengio explained the what and why; Pascanu provided the how to fix it and what it looks like in practice.
Why do RNN error surfaces have cliffs? The answer comes directly from the Jacobian product chain we studied in the vanishing gradient paper.
To understand cliffs, we need to revisit the Jacobian product chain from Chapter 2 of the vanishing gradients lesson, but now with a focus on the exploding case rather than the vanishing case.
Recall that the gradient involves a product of Jacobians:
When the spectral radius ρ(Wh) is slightly above the critical threshold, this product grows exponentially. But here's the subtle part: the rate of growth depends on the specific values of the hidden states at each step, which determine the activation derivatives f'(zk).
Consider what happens as you move through parameter space during training. At most points, the hidden states are in the saturated regime (large |z|), so f'(z) is small and the gradients are manageable — in fact, they're vanishing, which is the more common problem. But there exist thin regions in parameter space where, for a particular training example, the hidden states align in the linear regime (|z| near 0, f' near 1). In these regions, the Jacobian product doesn't benefit from the damping of activation saturation, and the gradient explodes.
The key quantitative insight: when f' ≈ 1 (linear regime) and ρ(Wh) > 1, the gradient grows as ρT. For ρ = 1.1 and T = 100: 1.1100 ≈ 13,781. For ρ = 1.2 and T = 100: 1.2100 ≈ 8.3 × 107. The gradient can be 10 million times larger than normal — and this happens over a parameter region that's only ε wide. This is the cliff: a gradient spike of astronomical magnitude over a vanishingly thin region.
A simplified RNN error surface as a function of a single weight. Notice the sharp cliff. Drag the optimizer position (warm dot) to feel the gradient at each point. Near the cliff, the gradient is orders of magnitude larger than in the smooth valleys.
Pascanu et al. proved that cliffs appear when the following condition holds. Consider the Jacobian product at a specific point in parameter space. Define:
If η > 1, the gradient magnitude is bounded below by ηT-t, which grows exponentially. The cliff occurs at the boundary between the region where η < 1 (vanishing) and where η > 1 (exploding). At this boundary, η crosses 1, and the gradient transitions abruptly from near-zero to enormous.
The width of this transition zone is inversely proportional to T — longer sequences create sharper cliffs. For a sequence of length 100, the cliff can span just 10-4 in parameter space while the gradient changes by 1010.
Pascanu et al. showed that the cliff width scales as:
While the gradient magnitude scales as:
So the ratio of gradient magnitude to cliff width grows as O(T · ηT) — super-exponentially! This is why cliffs are so dangerous: for long sequences, the gradient is enormous over an infinitesimally thin region of parameter space. The probability of an SGD step landing exactly on a cliff is small, but when it does happen, the consequence is catastrophic.
Understanding when cliffs appear helps practitioners anticipate and prevent training failures. In their extensive experiments across multiple tasks and architectures, Pascanu et al. found that cliffs typically appear:
| When? | Why? |
|---|---|
| Early in training | Random weights are more likely to produce near-critical spectral radius |
| On specific input sequences | Certain input patterns push hidden states into the linear regime |
| After learning rate warmup | Larger steps increase the probability of hitting a cliff |
| With longer sequences | More time steps = sharper cliffs = more dangerous |
python # Demonstrating the cliff: gradient norm vs weight perturbation import torch import torch.nn as nn import numpy as np rnn = nn.RNN(1, 32, batch_first=True) x = torch.ones(1, 50, 1) # constant input, length 50 # Scan along one weight direction original_w = rnn.weight_hh_l0.data.clone() direction = torch.randn_like(original_w) direction /= direction.norm() epsilons = np.linspace(-0.5, 0.5, 200) grad_norms = [] for eps in epsilons: rnn.weight_hh_l0.data = original_w + eps * direction rnn.zero_grad() out, _ = rnn(x) loss = out[0, -1, :].sum() loss.backward() gn = rnn.weight_hh_l0.grad.norm().item() grad_norms.append(gn) # Plot: you'll see a smooth landscape with sudden spikes (cliffs) # Gradient can jump from ~1 to ~10000 in a tiny parameter region
The cliff problem has an elegantly simple solution: gradient clipping. Before applying the gradient update, check if the gradient norm exceeds a threshold. If it does, rescale the gradient to have exactly that threshold norm.
Think about what this means practically. When the optimizer encounters a cliff, the gradient points in the correct direction — downhill, away from the cliff. The problem is only the magnitude: the gradient is so large that a normal-sized step would overshoot dramatically. The solution is obvious in retrospect: keep the direction, shrink the magnitude.
The algorithm is three lines:
That's it. When the gradient is small (||g|| ≤ τ), nothing changes — the update proceeds as normal. When the gradient is large (||g|| > τ), the direction is preserved but the magnitude is capped at τ. This turns the cliff from a catastrophic fall into a gentle slope in the correct direction.
Geometrically, norm clipping projects the gradient onto the surface of a hypersphere of radius τ. Any gradient inside the sphere passes through unchanged; any gradient outside is projected to the sphere's surface, preserving direction. This is equivalent to:
The min(1, ...) ensures that gradients smaller than τ are untouched. Only "abnormally large" gradients (those that would launch the optimizer off a cliff) are affected. In typical training, 95-99% of gradient steps are below the threshold and pass through unmodified. Clipping activates only for the rare, dangerous cliff-edge steps.
The error surface with a cliff. Without clipping (red), the optimizer is launched off the cliff. With clipping (green), the gradient direction is preserved but the step size is capped, keeping the optimizer near the good region. Toggle clipping to compare.
python import torch import torch.nn as nn # The gradient clipping algorithm def clip_gradient(parameters, max_norm): """Clip gradient norm to max_norm, preserving direction.""" # Compute total gradient norm across all parameters total_norm = 0 for p in parameters: if p.grad is not None: total_norm += p.grad.data.norm(2).item() ** 2 total_norm = total_norm ** 0.5 # Rescale if necessary if total_norm > max_norm: scale = max_norm / total_norm for p in parameters: if p.grad is not None: p.grad.data *= scale return total_norm # In practice, PyTorch does this in one line: model = nn.RNN(10, 64) optimizer = torch.optim.SGD(model.parameters(), lr=0.01) # Training loop loss = compute_loss() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0) # clip! optimizer.step()
The typical threshold τ is between 1 and 10. Pascanu et al. found that τ = 5 works well for most tasks. The exact value isn't critical — any reasonable threshold prevents catastrophic cliff-falls while allowing normal gradient-based learning to proceed.
Does clipping introduce bias? Technically yes — when you clip, you're no longer following the true gradient direction with the true magnitude. But in practice, clipping events are rare (most gradients are below τ), and when they do occur, the true gradient would have been destructive anyway. Clipping replaces a catastrophic step with a merely suboptimal one.
Pascanu et al. showed empirically that clipping has negligible effect on convergence speed when the threshold is chosen reasonably. The model takes slightly more steps to converge (because some large gradients are truncated), but it never diverges. The tradeoff is overwhelmingly positive.
clip_grad_norm_(model.parameters(), max_norm=1.0) for Transformers and max_norm=5.0 for RNNs/LSTMs. It's cheap (one norm computation per step), safe (prevents divergence), and has almost no downside.A practical corollary of this paper: always log gradient norms during training. Gradient norm plots reveal training dynamics that the loss curve hides. Spikes in gradient norm (even if clipped) indicate the model is encountering cliffs. If spikes are frequent, the learning rate may be too high or the model architecture may be unstable.
python # Monitoring gradient norms during training import torch def train_step(model, optimizer, loss_fn, batch, max_norm=5.0): optimizer.zero_grad() loss = loss_fn(model, batch) loss.backward() # Log gradient norm BEFORE clipping total_norm = torch.nn.utils.clip_grad_norm_( model.parameters(), max_norm ) # Log to your favorite logger # wandb.log({"grad_norm": total_norm.item()}) # If total_norm > max_norm, clipping occurred if total_norm > max_norm: print(f"Clipped! Norm {total_norm:.1f} > {max_norm}") optimizer.step() return loss.item(), total_norm.item()
There are actually two main approaches to gradient clipping, and the distinction matters for practice.
This is what we described in the previous chapter. You compute the total gradient norm across all parameters and rescale if it exceeds τ:
This approach preserves the relative magnitudes between different parameter gradients. If the gradient for one weight is 10x larger than another, that ratio is maintained after clipping. The overall gradient vector just gets shorter.
An alternative is to clip each gradient element independently:
This caps each component of the gradient at τ. It's simpler but it changes the direction of the gradient — large components are clipped while small ones are not, distorting the gradient direction.
| Method | Preserves direction? | Effect on large gradients | When to use |
|---|---|---|---|
| Clip by norm | Yes | All components scaled equally | Default choice — used in almost all modern training |
| Clip by value | No | Each component clipped independently | Rarely preferred — use only if you have a specific reason |
| Adaptive clipping | Partially | Threshold adapts based on gradient history | More robust, but adds complexity |
A 2D gradient vector (arrow). Norm clipping shrinks the vector while preserving direction (green circle = threshold). Value clipping truncates each component independently, changing the direction (red box = threshold). Toggle between methods.
Pascanu et al. also discussed adaptive clipping, where the threshold adjusts based on the running average of gradient norms. The idea is to clip gradients that are unusually large relative to recent history:
python # Adaptive gradient clipping class AdaptiveClipper: def __init__(self, alpha=0.95, multiplier=3.0): self.ema = None # exponential moving average of grad norm self.alpha = alpha # smoothing factor self.mult = multiplier # clip at mult * ema def clip(self, parameters): total_norm = 0 for p in parameters: if p.grad is not None: total_norm += p.grad.data.norm(2).item() ** 2 total_norm = total_norm ** 0.5 # Update running average if self.ema is None: self.ema = total_norm else: self.ema = self.alpha * self.ema + (1 - self.alpha) * total_norm # Clip if norm exceeds mult * ema threshold = self.mult * self.ema if total_norm > threshold: scale = threshold / total_norm for p in parameters: if p.grad is not None: p.grad.data *= scale return total_norm
Adaptive clipping is more robust because it doesn't require choosing τ in advance. The threshold emerges from the training dynamics themselves. However, plain norm clipping with τ = 1-5 works well enough that most practitioners stick with it.
Beyond norm-vs-value clipping, there's another important design dimension: the scope of the norm computation. Do you clip each parameter's gradient independently, each layer's gradient independently, or compute a single norm across all parameters in the model?
This choice affects gradient direction preservation and computational cost differently:
| Scope | What it clips | Used by |
|---|---|---|
| Global norm | Total norm across ALL parameters | PyTorch clip_grad_norm_, all major LLM codebases |
| Per-parameter | Each parameter's gradient independently | Some TensorFlow implementations |
| Per-layer | Each layer's gradients independently | Some custom implementations |
Global norm clipping is almost universally preferred because it maintains the relative gradient magnitudes across parameters. If one layer's gradient is 10x larger than another's (which is normal), per-parameter clipping would clip them to the same magnitude, distorting the update direction. Global clipping preserves the ratio while limiting the total magnitude.
The computational cost of global norm clipping is negligible. Computing the total gradient norm requires one pass through all parameters to sum the squared norms, then one pass to rescale if needed. For a model with P parameters, this is O(P) — the same cost as a single SGD step. The norm computation can also be overlapped with gradient computation using GPU parallelism, making the wall-clock overhead essentially zero.
This near-zero cost is important: it means there's no reason not to use gradient clipping. It's a pure safety mechanism with no performance penalty. Even if your model never encounters a cliff, clipping costs nothing. And if it does encounter one, clipping saves your entire training run. This asymmetry — zero downside, huge upside — is why gradient clipping is universal in modern practice. It is the seat belt of deep learning: you wear it every time, not because you expect a crash, but because the cost of wearing it is negligible and the cost of not wearing it is catastrophic.
python # Global vs per-parameter clipping in PyTorch import torch # Global norm (recommended) torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0) # Per-parameter value clipping (less common) torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0) # Custom per-layer norm clipping for name, param in model.named_parameters(): if param.grad is not None: layer_norm = param.grad.norm() if layer_norm > 1.0: param.grad.data *= 1.0 / layer_norm
Let g = (g1, g2, ..., gn) be the gradient vector. The three clipping methods produce different clipped vectors:
Let's see this concretely. For a 2D gradient g = (10, 1) with threshold τ = 5:
| Method | Clipped gradient | Direction preserved? |
|---|---|---|
| Norm clip | (4.97, 0.50) | Yes — same ratio 10:1 |
| Value clip | (5.00, 1.00) | No — ratio changed to 5:1 |
Gradient clipping handles the symptom (exploding gradients), but can we address the cause? Pascanu et al. proposed a regularization term that encourages the Jacobian products to stay well-behaved.
The idea is simple: add a penalty to the loss function that discourages large Jacobian norms. If we penalize ||∂ht+1/∂ht||, we're directly discouraging the condition that leads to gradient explosion.
Where the regularization term Ω penalizes the Jacobian product from diverging:
Let's unpack this carefully, because the formula looks intimidating but the idea is simple. The term inside the squared penalty measures how much the gradient magnitude changes from step t+1 to step t. The ratio ||(∂L/∂ht+1) Jt|| / ||∂L/∂ht+1|| tells us the amplification factor at step t — how much the Jacobian at step t stretches or shrinks the gradient.
If this factor is 1, the gradient flows through unchanged (perfect). If it's greater than 1, the gradient is growing (danger of explosion). If it's less than 1, the gradient is shrinking (vanishing). The squared penalty (ratio - 1)2 penalizes any deviation from 1, whether upward or downward. This drives the entire network toward a regime where gradients flow smoothly at every step.
The regularization strength λ controls the tradeoff: too small and the error surface retains its cliffs; too large and the regularization term dominates the task loss, preventing the model from learning the actual task. A value of λ = 0.01-0.1 works well in practice.
Two RNN error surfaces: without regularization (rough, with cliffs) and with regularization (smooth, no cliffs). The regularization penalty discourages sharp changes in the loss landscape. Drag λ to control regularization strength.
Computing the exact regularization term is expensive — it requires second-order derivatives (the gradient of the gradient). Pascanu et al. proposed an efficient approximation using finite differences:
python # Efficient gradient regularization via finite differences import torch def gradient_regularization(model, loss, hidden_states, lam=0.1, eps=1e-5): """Regularize Jacobian norms to stay near 1. Uses finite differences to avoid explicit second derivatives.""" reg_loss = 0 # For each consecutive pair of hidden states for t in range(len(hidden_states) - 1): h_t = hidden_states[t] h_t1 = hidden_states[t + 1] # Compute dL/dh_{t+1} grad_t1 = torch.autograd.grad( loss, h_t1, retain_graph=True, create_graph=True )[0] # Compute dL/dh_t (= dL/dh_{t+1} @ J_t) grad_t = torch.autograd.grad( loss, h_t, retain_graph=True, create_graph=True )[0] # Amplification factor ratio = grad_t.norm() / (grad_t1.norm() + eps) # Penalty: (ratio - 1)^2 reg_loss += (ratio - 1) ** 2 return lam * reg_loss
In practice, gradient regularization is used less often than gradient clipping because it's more expensive and clipping alone works well for most tasks. But for tasks requiring very long-range dependencies, the combination of clipping + regularization can be more effective than either alone.
A natural question: why not just add standard weight decay (L2 regularization on the weights) to prevent explosion? This is a common misconception, and the answer reveals an important distinction. Weight decay penalizes the magnitude of weights, not the gradient dynamics. The two are related but not the same.
A weight matrix with small Frobenius norm can still produce exploding gradients if the activation derivatives are large (e.g., when all activations are in the linear regime of tanh). Conversely, a large weight matrix can produce vanishing gradients if the activations are heavily saturated (f' near 0). The gradient dynamics depend on the product of the weight matrix and the activation derivatives, not on either factor alone.
Weight decay helps indirectly by preventing weights from growing too large, which keeps the spectral radius in check. But it's a blunt instrument — it penalizes all large weights equally, whether they contribute to gradient explosion or not. Jacobian regularization is more surgical because it targets the actual gradient dynamics.
The Jacobian regularization directly targets what we care about: the gradient flow. It measures "is the gradient growing or shrinking at each step?" and penalizes deviations from 1. This is a more surgical intervention than weight decay.
| Regularization | What it penalizes | Effect on gradient flow | Cost |
|---|---|---|---|
| Weight decay (L2) | ||W||2 | Indirect — small weights may or may not help | Cheap |
| Jacobian regularization | ||J||2 deviation from 1 | Direct — targets gradient dynamics exactly | Expensive (needs 2nd-order info) |
| Spectral normalization | ρ(W) > 1 | Prevents explosion, but can cause vanishing | Moderate |
A simpler approach, developed after this paper, is spectral normalization: after each weight update, divide Wh by its spectral radius ρ(Wh) so that ρ = 1. This guarantees that the weight matrix alone doesn't amplify or shrink signals.
python # Spectral normalization for recurrent weights import torch def spectral_normalize(W): """Normalize W so its spectral radius = 1.""" eigs = torch.linalg.eigvals(W) rho = torch.max(torch.abs(eigs)).item() if rho > 0: W.data /= rho return W
Vanishing and exploding gradients are two faces of the same coin — the instability of iterated matrix products. But they manifest differently and require different solutions.
| Property | Vanishing | Exploding |
|---|---|---|
| Condition | σmax · ||Wh|| < 1 | σmax · ρ(Wh) > 1 |
| Symptom | No learning on long-range dependencies | Loss spikes to NaN, weights explode |
| Detection | Hard — training just converges slowly or to a bad solution | Easy — loss becomes NaN |
| Fix | Architectural: LSTM, GRU, attention, residual connections | Algorithmic: gradient clipping |
| Frequency | Very common (default for tanh/sigmoid) | Less common but catastrophic when it happens |
The same RNN with different weight scales. Left: small weights cause vanishing gradients (gradient bars disappear). Right: large weights cause exploding gradients (bars shoot off the chart). Drag the weight scale to see the transition.
Pascanu et al. made a crucial observation: during training, the same model can experience both vanishing and exploding gradients — at different time scales and for different training examples.
A training example with certain input patterns might push the hidden states into the linear regime (f' near 1), causing gradient explosion. The very next training example might push the hidden states into the saturated regime (f' near 0), causing gradient vanishing. The network oscillates between the two failure modes.
This explains why naive fixes don't work:
The correct approach is to use two complementary solutions: gradient clipping for explosions and architectural changes (LSTM/GRU/attention) for vanishing. These are not competing solutions — they address different failure modes and should be used together.
Pascanu et al. provided a key insight about how vanishing and exploding interact within the same gradient computation. Consider the total gradient:
This is a sum of terms. The terms for recent time steps (t near T) have large Jacobian products. The terms for distant time steps (t near 0) have tiny Jacobian products. The total gradient is dominated by the recent terms — the distant terms contribute negligibly. But if even one recent term has an exploding Jacobian, it dominates the sum and makes the entire gradient enormous.
So within a single gradient computation:
Gradient clipping handles the exploding recent terms. But it does nothing for the vanished distant terms. The network can learn from recent context but remains blind to the distant past. This is why both solutions are needed: clip the explosion (algorithmic fix) AND prevent the vanishing (architectural fix).
The paper also discusses echo state networks (ESN) as a reference point. In an ESN, the recurrent weights Wh are fixed (not trained) — only the output weights are learned. This completely avoids the gradient problem since gradients never flow through Wh. But it limits what the network can represent: the dynamics are random rather than task-specific.
ESNs show that the gradient problem is specifically about learning the recurrence, not about using it. If you're willing to accept random dynamics, recurrence is fine. But learning task-specific dynamics through gradient descent is where the fundamental difficulty lies.
Pascanu et al. highlighted the importance of weight initialization for RNNs. The initial spectral radius of Wh determines whether the network starts in the vanishing, exploding, or near-critical regime:
| Initialization | Initial ρ(Wh) | Behavior |
|---|---|---|
| Gaussian N(0, 0.01) | ≈ 0.1 | Strongly vanishing — very slow learning even for short sequences |
| Xavier / Glorot | ≈ 1.0 | Near critical — works for moderate sequences but unstable for long ones |
| Orthogonal | = 1.0 exactly | Best starting point — all eigenvalues on unit circle |
| Identity + noise | ≈ 1.0 | Good alternative — Wh = I + εN preserves information initially |
python # Different initialization strategies for RNN hidden weights import torch import torch.nn as nn n = 128 # 1. Standard Gaussian (too small) W_gauss = torch.randn(n, n) * 0.01 # 2. Xavier initialization W_xavier = torch.randn(n, n) / n**0.5 # 3. Orthogonal (recommended for RNNs) W_orth = torch.empty(n, n) nn.init.orthogonal_(W_orth) # 4. Identity + noise (IRNN, Le et al. 2015) W_irnn = torch.eye(n) + torch.randn(n, n) * 0.001 # Check spectral radii for name, W in [("Gaussian", W_gauss), ("Xavier", W_xavier), ("Orthogonal", W_orth), ("IRNN", W_irnn)]: rho = torch.linalg.eigvals(W).abs().max().item() print(f"{name:12s}: rho = {rho:.4f}")
python # Complete RNN training recipe from Pascanu et al. import torch import torch.nn as nn # 1. Use LSTM instead of vanilla RNN (fixes vanishing) model = nn.LSTM(256, 512, num_layers=2, dropout=0.3) # 2. Initialize carefully for name, param in model.named_parameters(): if 'weight_hh' in name: nn.init.orthogonal_(param) # eigenvalues on unit circle elif 'bias' in name: nn.init.zeros_(param) # Set forget gate bias high (remember by default) n = param.size(0) param.data[n//4:n//2].fill_(1.0) # 3. Gradient clipping (fixes exploding) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) for batch in dataloader: optimizer.zero_grad() loss = compute_loss(model, batch) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0) optimizer.step()
Time to put everything together. This interactive simulation lets you train an RNN with and without gradient clipping, on a sequence memorization task. You'll see how clipping prevents catastrophic gradient explosions while preserving learning.
Train a small RNN to memorize a pattern in a length-20 sequence. The top panel shows the loss curve. The bottom panel shows gradient norms at each iteration. Toggle gradient clipping and adjust the threshold to see its effect. Without clipping, watch for the catastrophic loss spike.
Based on their experiments, Pascanu et al. recommended this recipe for training RNNs:
This recipe became the standard for RNN training and, with minor modifications, is still used for training Transformers today. The gradient clipping step (3) is particularly universal — you'll find it in the training code of GPT-2, GPT-3, BERT, and essentially every large language model.
When your RNN (or any deep network) isn't training well, the gradient norm plot is your best diagnostic tool. Here's how to read it:
| Pattern in gradient norm plot | Diagnosis | Fix |
|---|---|---|
| Consistently near zero | Vanishing gradients | Use LSTM/GRU, add residual connections, check initialization |
| Occasional massive spikes | Cliffs in error surface | Enable gradient clipping (or lower threshold) |
| Steadily increasing | Gradients slowly exploding | Lower learning rate, check weight initialization |
| Oscillating wildly | Near critical point (η ≈ 1) | Learning rate too high for this regime |
| Stable and gradually decreasing | Healthy training | No changes needed! |
python # Complete training loop with monitoring import torch import torch.nn as nn model = nn.LSTM(256, 512, num_layers=2, dropout=0.3) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) max_norm = 5.0 grad_history = [] for step in range(10000): optimizer.zero_grad() loss = compute_loss(model, batch) loss.backward() # Record gradient norm BEFORE clipping norm = torch.nn.utils.clip_grad_norm_( model.parameters(), max_norm ) grad_history.append(norm.item()) optimizer.step() # Alert on anomalies if norm > max_norm * 0.9: print(f"Step {step}: near-clip event (norm={norm:.1f})") if step > 100 and norm < 1e-6: print(f"Step {step}: possible vanishing gradient!")
Every major LLM training run uses gradient clipping from this paper:
| Model | Gradient clip norm | Source |
|---|---|---|
| GPT-2 (2019) | 1.0 | OpenAI codebase |
| GPT-3 (2020) | 1.0 | Brown et al. |
| LLaMA (2023) | 1.0 | Touvron et al. |
| Chinchilla (2022) | 1.0 | Hoffmann et al. |
| PaLM (2022) | 1.0 | Chowdhery et al. |
Notice the convergence on max_norm = 1.0 for Transformers (vs 5.0 for RNNs). This is because Transformer gradients are generally more well-behaved than RNN gradients, so a tighter clip doesn't hurt and provides better safety.
Pascanu et al.'s 2013 paper bridged the gap between Bengio's 1994 theoretical analysis and the practical training of deep sequence models. Gradient clipping became one of the most widely used techniques in all of deep learning.
| Year | Development | Connection to this paper |
|---|---|---|
| 1994 | Bengio: vanishing gradient theorem | The theoretical foundation this paper builds on |
| 1997 | LSTM (Hochreiter & Schmidhuber) | Architectural solution to vanishing; this paper's clipping complements it |
| 2013 | This paper: gradient clipping | First practical solution to exploding gradients |
| 2014 | GRU (Cho et al.) | Simplified LSTM, used with gradient clipping |
| 2017 | Transformer (Vaswani et al.) | Eliminated recurrence but still uses gradient clipping during training |
| 2018 | GPT-1 (Radford et al.) | Transformer + gradient clipping (max_norm = 1.0) |
| 2020 | GPT-3 (Brown et al.) | Same recipe at 175B parameters — clipping still essential |
Pascanu et al. tested on several challenging RNN tasks:
| Task | Without clipping | With clipping |
|---|---|---|
| Temporal order (T=50) | Diverged (31% of runs) | Converged (0% divergence) |
| Addition problem (T=100) | Diverged (47% of runs) | Converged (2% divergence) |
| Multiplication (T=50) | Diverged (62% of runs) | Converged (5% divergence) |
| Penn Treebank LM | Frequent NaN losses | Stable training |
The results are striking: gradient clipping alone reduced training failures from 30-60% to 0-5% of runs. This is a transformation from "RNN training is unreliable" to "RNN training reliably works."
The combination of gradient clipping + gradient regularization performed even better on the most challenging tasks (temporal order at T = 200), but the improvement over clipping alone was modest. This suggests that for most practical purposes, gradient clipping is sufficient — regularization provides marginal additional benefit at significant computational cost.
Perhaps the most impressive result: even with gradient clipping enabled, the training convergence speed was not significantly affected. The clipped steps (when they occurred) still moved the parameters in the correct direction, just with smaller magnitude. The model reached the same final performance, just without the catastrophic failures along the way. In fact, on several tasks, the clipped training achieved slightly better final performance than the unclipped successful runs — likely because even the runs that didn't fully diverge still suffered from occasional large weight perturbations that damaged partially-learned representations.
Gradient clipping handles exploding gradients but does nothing for vanishing gradients. The network still can't learn dependencies longer than ~20-50 steps with a vanilla RNN, even with perfect clipping. For that, you need architectural solutions: LSTM, GRU, attention, or the Transformer.
The Transformer (2017) eventually made the RNN gradient problem largely moot by replacing sequential processing with parallel attention. But understanding why RNNs fail remains crucial — it's the motivation behind every major architectural innovation of the past decade, from LSTMs to Transformers to state-space models.
Moreover, the tools developed in this paper — gradient clipping, gradient norm monitoring, and the understanding of error surface geometry — remain essential even in the Transformer era. The specific failure mode has changed (Transformers don't have cliffs from recurrence), but the general principle holds: always monitor your gradient norms, always have a safety mechanism for extreme gradients, and always understand the geometry of your loss landscape.
Beyond gradient clipping itself, this paper established a methodology for analyzing training dynamics that remains influential. The approach of (1) identifying a failure mode through theory, (2) characterizing its manifestation in practice (the cliff), and (3) providing a simple algorithmic fix (clipping) has been applied to many subsequent training problems:
| Problem | Analysis paper | Simple fix |
|---|---|---|
| Gradient explosion | Pascanu 2013 (this paper) | Gradient clipping |
| Internal covariate shift | Ioffe & Szegedy 2015 | Batch normalization |
| Degradation in deep nets | He et al. 2015 | Residual connections |
| Training instability in GANs | Miyato et al. 2018 | Spectral normalization |
| Loss spikes in LLM training | Various 2022-2023 | Gradient clipping + learning rate cooldown |
The basic idea of gradient clipping has been extended in several ways since 2013:
| Extension | Year | Idea |
|---|---|---|
| Gradient scaling (AMP) | 2017 | Scale loss up to prevent underflow in FP16, then unscale gradients before clipping |
| Gradient accumulation | ~2018 | Accumulate gradients across micro-batches, clip the accumulated gradient |
| AGC (Adaptive Gradient Clipping) | 2021 | Clip based on the ratio of gradient norm to parameter norm (NFNet) |
| Gradient noise injection | 2015 | Add noise after clipping to help escape sharp minima |
AGC from the NFNet paper (Brock et al., 2021) is particularly interesting. Instead of a fixed threshold, it clips when the gradient is "too large relative to the weight" — specifically when ||g|| / ||w|| exceeds a threshold λ. This makes the clipping scale-invariant and removes the need to tune τ for different layers.
python # Adaptive Gradient Clipping (AGC) from NFNet def agc(parameters, clip_factor=0.01, eps=1e-3): """Clip gradient based on gradient-to-weight ratio.""" for p in parameters: if p.grad is None: continue p_norm = p.data.norm().clamp(min=eps) g_norm = p.grad.data.norm() max_norm = p_norm * clip_factor if g_norm > max_norm: p.grad.data *= max_norm / g_norm
Looking back, it's remarkable that the solution to one of the most important problems in deep learning — gradient explosion — is so simple. Gradient clipping is literally: "if the gradient is too big, make it smaller." Three lines of code. No new mathematical framework, no complex theory, no hyperparameter search (any τ between 1 and 10 works).
This is a pattern in deep learning: the most impactful techniques are often embarrassingly simple. Dropout (randomly zero out neurons), batch normalization (normalize activations), residual connections (add the input to the output), and gradient clipping (cap the gradient norm) — each is a one-line idea that transformed the field.
As of 2024, gradient clipping remains essential for training large language models. Here is a summary of how it's used in practice at scale:
python # Modern LLM training loop (simplified) # Based on LLaMA / GPT training recipes import torch from torch.nn.utils import clip_grad_norm_ from torch.cuda.amp import GradScaler, autocast model = TransformerLM( vocab_size=32000, d_model=4096, n_layers=32, n_heads=32, d_ff=11008 ) # ~7B parameters (LLaMA-7B config) optimizer = torch.optim.AdamW( model.parameters(), lr=3e-4, weight_decay=0.1, betas=(0.9, 0.95) ) scaler = GradScaler() # for mixed precision for step, batch in enumerate(dataloader): with autocast(): loss = model(batch) # Scale loss for FP16 stability scaler.scale(loss).backward() scaler.unscale_(optimizer) # Gradient clipping — still essential at 7B scale! grad_norm = clip_grad_norm_( model.parameters(), max_norm=1.0 ) scaler.step(optimizer) scaler.update() optimizer.zero_grad()
Notice that even with Transformers (no recurrence!), gradient clipping at max_norm=1.0 is standard. The gradient dynamics are more stable than RNNs, but occasional spikes still occur — especially early in training, during learning rate warmup, or on unusual data batches. The clipping provides a safety net that costs nearly nothing and prevents rare but catastrophic training failures.