You have a gradient estimate. Now: how far do you step? From naïve ascent to PPO — every idea that separates a good update from a catastrophic one.
Chapter 11 left us with an estimate of ∇U(θ) — the gradient of expected return with respect to policy parameters. The gradient tells us which direction improves the policy. Now we must decide: how far do we move?
This sounds like a routine hyperparameter choice. It isn't. Policy gradient optimization has a brutal failure mode that makes step size the most consequential decision in the entire algorithm.
This isn't hypothetical. Early RL research was plagued by it. The algorithms in this chapter are engineering responses to this specific failure mode — each one adding a different layer of protection.
In supervised learning, a large gradient usually means you're far from a minimum, so a big step makes sense. In RL, a large gradient often means you're in a region where rewards vary wildly between trajectories — the gradient estimate has high variance. Taking a large step based on a noisy estimate is exactly wrong.
There's a deeper issue. The gradient estimate ∇U(θ) is only valid as a local approximation. The further θ' is from θ, the more the estimate drifts from reality. Crucially, we can't just collect new data to check — each gradient evaluation requires running episodes.
A 1D utility landscape (curved surface) and gradient ascent with different step sizes. Watch what happens when α is too large: the update overshoots the optimum and lands on a slope pointing away. Click "Step" to see one gradient update.
The simplest approach: compute the gradient, step in that direction, repeat.
Here α > 0 is the step factor (learning rate). With a decaying step size that satisfies the Robbins-Monro conditions (∑αk = ∞, ∑αk2 < ∞), gradient ascent converges to a local optimum.
The actual distance moved in parameter space is α·||∇U(θ)||. This depends on both α and the gradient magnitude. In RL, gradient magnitudes can vary wildly across training: early on when policies are random, gradients are small; later, when policies find high-reward trajectories, gradients can explode. A fixed α that worked fine in epoch 1 may cause catastrophic overshooting in epoch 50.
Two cheap tricks tame large gradients before the update:
Scaling (norm clipping): If ||∇U|| > Lmax, rescale so the L2-norm equals Lmax:
Preserves gradient direction. Only shrinks when too large.
Clipping (per-component): Independently clamp each dimension:
Simpler, but can change the direction when one component dominates.
python def gradient_ascent_step(theta, grad_U, alpha, L_max=None): """One step of (optionally norm-clipped) gradient ascent.""" g = grad_U # gradient estimate from Ch 11 if L_max is not None: norm = np.linalg.norm(g) if norm > L_max: g = g * (L_max / norm) # scale: preserve direction return theta + alpha * g
Plain gradient ascent measures distance in parameter space: ||θ' − θ||. But what we really care about is how much the policy behavior changes. Two policies with nearby parameters can behave very differently; two policies with distant parameters can behave nearly identically. Parameter-space distance is the wrong ruler.
The natural gradient swaps in a better ruler: KL divergence between the action distributions.
Start simpler. Instead of fixing α, directly constrain how far θ can move in parameter space. Maximize the linear approximation to U(θ) subject to a spherical constraint:
This has a clean closed-form solution. The Lagrangian tells us θ' = θ + λ∇U. Plugging into the constraint gives λ = √(2ε/||∇U||2), so:
The step length is always exactly √(2ε), regardless of the gradient magnitude. Only the direction of the gradient matters. This decouples direction from magnitude — a major improvement.
Suppose your policy has two parameters: θ1 controls a softmax temperature (highly sensitive — tiny changes drastically shift the action distribution) and θ2 controls a bias term (insensitive — large changes barely matter). A spherical constraint in parameter space treats both the same. You'll overstep θ1 and understep θ2.
Solve the constrained problem with the Fisher metric:
The solution: let u = Fθ−1∇U(θ) be the natural gradient. Then:
The direction u = F−1∇U rotates the standard gradient to compensate for parameter-space curvature. In parameter directions where the policy is insensitive (small Fisher eigenvalue), u is larger, taking a bigger step. In sensitive directions (large Fisher eigenvalue), u is smaller, being conservative.
The Fisher information matrix Fθ is the expected outer product of the score function (gradient of log policy):
For stochastic policies, environment transition terms cancel out (same argument as the likelihood-ratio gradient), so we only need the policy's log-probability gradient summed over the trajectory:
We estimate this from the same rollout trajectories used for the gradient estimate. No extra episodes required.
Fisher matrix properties:
• Symmetric, positive semi-definite
• Approximates the Hessian of KL divergence
• Invertible when estimated from enough data
• Encodes curvature of the policy manifold
Practical cost:
• O(n2) storage for n parameters (huge for networks)
• O(n3) for explicit inversion
• Conjugate gradient: solve Fu = ∇U without forming F
• Diagonal approximation: cheap but ignores correlations
Objective: −(x+1)2 − 10y2. The y-axis is 10× more sensitive. Standard gradient (orange) zigzags; natural gradient (blue) heads straight for the optimum. Click Step to iterate both.
Natural gradient gives us the right direction. TRPO adds a guarantee: it searches along that direction for the best step that provably keeps the new policy close to the old one.
Define the surrogate objective — an estimate of U(θ') computed from data collected under θ:
The ratio πθ'/πθ is importance weighting: it corrects for the fact that actions were sampled from the old policy, not the new one. Qθ(s,a) is the action-value function estimate. This quantity can be evaluated for any θ' without running new episodes.
TRPO solves a constrained optimization at each step:
The KL divergence constraint is in policy space, not parameter space. Two policy parameterizations that produce identical distributions are treated identically — TRPO is truly parameterization-invariant.
The line search evaluates L(θ') and the KL for each candidate by computing importance ratios πθ'/πθ on the stored rollout data. No new episodes needed. Typically only 10-20 function evaluations per step.
A natural alternative: add β·KL to the objective as a penalty. The problem: the right β varies wildly across environments and even across training stages of the same environment. TRPO's hard constraint δ has a consistent semantic — "never let the policies diverge more than this" — making it a robust hyperparameter choice.
python (pseudocode) def trpo_step(policy, rollouts, delta=0.01, max_backtracks=10): grad = compute_policy_gradient(rollouts) # Ch 11 u = conjugate_gradient(fisher_vector_product, grad) # Fu = grad step_size = np.sqrt(2 * delta / np.dot(u, fisher_vector_product(u))) full_step = step_size * u # natural gradient theta_old = policy.get_params() for k in range(max_backtracks): # line search theta_new = theta_old + (0.5 ** k) * full_step policy.set_params(theta_new) kl = compute_kl(policy, rollouts) improvement = compute_surrogate(policy, rollouts) - compute_surrogate_old(rollouts) if kl <= delta and improvement > 0: return # accept step policy.set_params(theta_old) # reject all, keep old
TRPO works. But it's complicated: you need a conjugate gradient solver, a Fisher-vector-product computation, and a custom line search. In 2017, Schulman et al. asked: what if you could get most of TRPO's benefits with a single clipped objective that standard optimizers (Adam, SGD) can handle directly?
The answer: PPO. Define the probability ratio:
rt = 1 when θ = θold. rt > 1 means θ now assigns more probability to this action than θold did. rt < 1 means less probability.
The clipped surrogate objective is:
where At is the advantage estimate at time t, and ε is typically 0.2.
The clip function forces rt to stay in [1−ε, 1+ε]. The min takes the pessimistic bound between clipped and unclipped. Let's work through the two cases:
Case 1: At > 0 (good action)
We want to increase π(a|s), so we want rt to grow. But if rt > 1+ε, the clipped version caps at (1+ε)At < rtAt. The min picks the cap. Gradient goes to zero. No benefit to pushing r further right of the boundary.
Case 2: At < 0 (bad action)
We want to decrease π(a|s), so rt should shrink. But if rt < 1−ε, the clipped version stops at (1−ε)At > rtAt. The min picks the raw rtAt value, which is already negative enough. Wait — actually the min picks clip here too since clip·A > r·A when A<0 and r < 1−ε. The gradient again goes to zero.
The orange curve is LCLIP(θ). The blue dashed line is the unclipped surrogate r·A. The teal band is the [1−ε, 1+ε] trust region. The vertical red line shows the current ratio rt — drag it with the slider. Watch how the gradient (slope of orange) goes to zero outside the band for each advantage sign.
With positive advantage: Set A=1.5. Move r from 0 to 2. Notice the orange curve rises linearly until r=1+ε, then goes flat. The optimizer gets no gradient signal to push r above 1+ε. That's the safety mechanism.
With negative advantage: Set A=−1.5. The objective becomes a flipped curve. Now the orange curve falls as r decreases, but flattens at r=1−ε. No benefit to making r very small (overly suppressing the action).
With zero advantage: Set A=0. The objective is flat everywhere — if an action is neither good nor bad, PPO makes no update regardless of r. This is correct.
python import torch def ppo_loss(log_pi_new, log_pi_old, advantages, eps=0.2): """ log_pi_new: [B] log probabilities under new policy log_pi_old: [B] log probabilities under old policy (fixed) advantages: [B] advantage estimates (e.g. from GAE) Returns: scalar PPO clipped objective (to be maximized) """ ratio = (log_pi_new - log_pi_old).exp() # [B] — r_t(theta) # Unclipped surrogate obj1 = ratio * advantages # [B] # Clipped surrogate obj2 = ratio.clamp(1 - eps, 1 + eps) * advantages # [B] # Pessimistic min, then mean over batch loss = torch.min(obj1, obj2).mean() # scalar return -loss # negate: PyTorch minimizes, we want to maximize # Training loop (one iteration) for epoch in range(n_epochs): # e.g. 10 epochs on same data log_pi = policy.log_prob(states, actions) adv = compute_advantages(rewards, values, gamma, lam) loss = ppo_loss(log_pi, log_pi_old, adv) optimizer.zero_grad() loss.backward() optimizer.step()
Both TRPO and PPO are trying to solve the same problem: take the biggest safe step toward a better policy. They differ dramatically in how they define and enforce safety.
| Property | TRPO | PPO |
|---|---|---|
| Trust region | Hard KL constraint (≤δ) | Soft: flat gradient outside clip region |
| Optimization per step | Constrained optimization + line search | Standard unconstrained gradient steps |
| Fisher matrix | Required (via conjugate gradient) | Not needed |
| Epochs per batch | 1 (one natural gradient step) | 3–10 (multiple SGD passes) |
| Implementation complexity | High (CG, Hessian-vector products) | Low (~10 lines of PyTorch) |
| Memory overhead | High (conjugate gradient buffers) | Low (just old log-probs) |
| Parameterization invariant? | Yes (exact) | Approximately (via clipping) |
| Empirical performance | Excellent | Excellent (comparable) |
| Default choice today? | Rarely (too complex) | Yes (baseline for most RL) |
TRPO's hard KL constraint provides stronger guarantees in some situations. If you need:
• Monotonic improvement guarantees (safety-critical deployments)
• Invariance to policy parameterization (comparing across different network architectures)
• Understanding why a method fails (TRPO's diagnostics are cleaner)
Then TRPO is worth the extra implementation effort. In research, TRPO remains a useful baseline precisely because its guarantees are mathematically precise.
PPO's clipping is an approximation. There are edge cases where:
• The clip threshold ε is wrong for the current training stage (too tight early on, too loose later)
• Multiple epochs corrupt the old log-probs assumption (the importance ratio drifts)
• The advantage estimator has high variance, making the clip region the wrong shape
Modern practice: monitor the fraction of timesteps where rt is clipped (the "clip fraction"). If it's >20-30%, reduce ε or the number of epochs. If it's near 0%, ε may be too tight.
TRPO enforces a KL-ball trust region in policy space (elliptical in parameter space). PPO's clipping creates a box-like region in ratio space. These are different geometric objects — PPO's is an approximation. The slider shows how ε changes PPO's clip box.
The gap between understanding PPO conceptually and getting it to work is filled with engineering decisions that papers often omit. Here's what actually matters.
The paper's actual objective combines three terms:
Three components:
LCLIP: The clipped policy loss we derived. Coefficient c1 is implicitly 1 here.
LVF: Value function (critic) loss. Typically (Vθ(st) − Vttarget)2. Coefficient c1 ≈ 0.5.
S[πθ]: Entropy bonus. Encourages exploration. Coefficient c2 ≈ 0.01.
Why a shared network? The actor (policy) and critic (value function) share a body with separate heads. They co-train. The critic helps estimate advantages; the actor uses them to improve the policy. Sharing representations is more parameter-efficient.
Why entropy? Flat regions (A≈0) produce no policy gradient. The entropy bonus keeps the policy from collapsing to determinism prematurely, especially early in training.
The advantage At appears in every term of LCLIP. PPO almost always uses Generalized Advantage Estimation (GAE) from Schulman et al. (2016):
where δt = rt + γV(st+1) − V(st) is the TD error. The λ parameter interpolates between:
• λ=0: pure TD(0) — low variance, high bias
• λ=1: Monte Carlo returns — high variance, low bias
• λ=0.95: the sweet spot in most benchmarks
These details are rarely highlighted but often make or break PPO:
Observation normalization: Maintain running statistics of observations and normalize them. Many environments have wildly different observation ranges. A neural network trained on unnormalized observations can spend most of training learning the scale, not the task.
Reward normalization: Scale rewards by a running std estimate of the discounted return. Prevents the value function from having to represent astronomically different value scales.
python class PPOBuffer: """Collect rollouts, compute GAE advantages, normalize.""" def compute_advantages(self, gamma=0.99, lam=0.95): adv = np.zeros_like(self.rewards) last_gae = 0 for t in reversed(range(len(self.rewards))): if t == len(self.rewards) - 1: next_val = 0 # terminal else: next_val = self.values[t + 1] delta = self.rewards[t] + gamma * next_val - self.values[t] last_gae = delta + gamma * lam * last_gae adv[t] = last_gae # Normalize advantages (critical!) adv = (adv - adv.mean()) / (adv.std() + 1e-8) return adv
| Tensor | Shape | Notes |
|---|---|---|
| observations | [T, obs_dim] | T = rollout length (e.g. 2048) |
| actions | [T] or [T, act_dim] | discrete or continuous |
| log_pi_old | [T] | fixed; computed before any epoch |
| advantages | [T] | GAE, then normalized |
| returns | [T] | advantages + values; target for critic |
| log_pi_new | [T] | recomputed each epoch as θ changes |
| ratio rt | [T] | exp(log_pi_new - log_pi_old) |
| LCLIP | scalar | mean over T, then backprop |
PPO is widely praised for robustness, but "robust" is relative. Understanding which hyperparameters matter — and why — separates practitioners who tune PPO correctly from those who cargo-cult settings from Atari papers onto robotics tasks.
ε = 0.2 is the Schulman et al. default. It says: "never let the policy probability ratio deviate more than 20% from 1." In practice:
ε too small (e.g. 0.05):
Policy changes too slowly. High clip fraction (most updates are clipped). Training takes many more iterations. Can be appropriate for safety-sensitive tasks or fine-tuning.
ε too large (e.g. 0.5):
The trust region is effectively gone. Training becomes as fragile as vanilla gradient ascent. Large policy shifts make old advantage estimates unreliable.
More epochs = more gradient updates from the same data = more sample efficiency. But:
• After each epoch, θ changes, making log_pi_old stale
• The importance ratio rt drifts away from 1
• Eventually the advantage estimates are wrong for the current policy
Empirical finding: 3-10 epochs is a good range. Beyond 10, you're usually hurting more than helping. Monitor the average KL between old and new policy per batch — if it exceeds ~0.01-0.05, reduce epochs.
The rollout length T determines how many environment steps you collect before each update. Tradeoffs:
| Parameter | Effect when increased | Typical range |
|---|---|---|
| Rollout length T | Better advantage estimates (less bias in GAE), slower updates | 256 – 4096 |
| Minibatch size | More stable gradients, less variance per step | 32 – 512 |
| n_epochs | More data efficiency, higher risk of policy divergence | 3 – 10 |
| γ (discount) | Longer effective horizon, more variance in returns | 0.99 (default) |
| λ (GAE) | Lower variance (if <1), higher bias | 0.95 (default) |
| LR (Adam) | Faster learning, higher instability | 3e-4 (default) |
A subtle detail: some PPO implementations also clip the value function update. The clipped value loss is:
The idea: prevent the value function from changing too rapidly. In practice, value clipping is often omitted (and Andrychowicz et al. 2021 showed it can sometimes hurt). Use it if you observe value function instability, skip it otherwise.
PPO (2017) is now a foundation layer, not an endpoint. The field has built a series of extensions that address its remaining weaknesses.
Instead of clipping, add a KL penalty to the objective. Adaptively adjust the penalty coefficient β based on whether the current KL is above or below a target:
python (adaptive KL penalty) if kl_mean < target_kl / 1.5: beta /= 2 # KL too small: relax penalty elif kl_mean > target_kl * 1.5: beta *= 2 # KL too large: tighten penalty loss = -L_clip + beta * kl_mean
This variant is preferred in some RLHF pipelines because it provides smoother gradients than hard clipping.
Recent work from Byte et al. identifies a subtle problem: in Group Relative Policy Optimization (GRPO, used in DeepSeek-R1), the entropy can collapse because tokens that are already highly probable get clipped and receive zero gradient. DAPO fixes this by:
• Using separate clip thresholds εlow and εhigh for the lower and upper bounds
• Removing the lower clip entirely (i.e., only clipping upward) when advantage is positive
• Adding a token-level entropy loss to prevent collapse
• Dynamically oversampling prompts where the policy has improved significantly
For language model fine-tuning, RLOO simplifies the PPO update: for each prompt, sample K responses, use the mean of the other K−1 as a baseline. This gives an unbiased advantage estimate without a learned value function. Recent work (Ahmadian et al. 2024) shows RLOO matches PPO on RLHF benchmarks with less code and memory.
TRPO's constrained optimization framework naturally extends to constrained RL: maximize expected return subject to safety constraints (e.g., collision rate ≤ 5%). CPO (Constrained Policy Optimization, Achiam et al. 2017) extends TRPO to handle multiple safety constraints simultaneously using a primal-dual approach. PPO-based safe RL methods use Lagrangian relaxation to add constraint costs to the objective.
A modern pattern: pre-train a policy with behavior cloning (supervised learning on demonstrations), then fine-tune with PPO. The clip mechanism is essential here — it prevents PPO from immediately forgetting the learned behaviors. The initial rt values are all near 1, and the clip keeps updates conservative, allowing the policy to improve while retaining good priors.
Policy gradient optimization doesn't exist in isolation. It's the optimization backbone that every modern RL system plugs into. Here's where Ch 12 sits in the larger picture.
Every method here takes ∇U(θ) as input — the policy gradient estimate. Chapter 11 covers the likelihood-ratio trick, REINFORCE, baseline subtraction, and actor-critic gradient estimators. The quality of ∇U(θ) directly limits what optimization can achieve: a noisy gradient with high variance will cause unstable training regardless of how sophisticated the optimizer is.
Chapter 13 addresses a major weakness: the advantage estimate At used in LCLIP. Using raw Monte Carlo returns has high variance. Actor-critic methods replace Monte Carlo with a learned value function (the critic), dramatically reducing variance. The critic is trained to minimize Bellman error while the actor uses PPO's clipped objective. This is the standard architecture for modern RL.
| Dimension | Policy Gradient (Ch 12) | Value-Based (Ch 7-9) |
|---|---|---|
| What's learned | πθ(a|s) directly | Q(s,a) or V(s) |
| Action spaces | Continuous or discrete | Typically discrete |
| Stochastic policies | Native (just sample) | Requires ε-greedy tricks |
| Sample efficiency | Lower (on-policy) | Higher (off-policy replay) |
| Stability | Better (with TRPO/PPO) | Historically fragile |
| Convergence theory | Local optimum guaranteed | Approximate (with function approx) |
RLHF pipelines use PPO as the optimization step, but with a reward model trained from human preferences instead of an environment reward. The policy is initialized from supervised fine-tuning (SFT) — which is imitation learning (Ch 18). The clip mechanism prevents the policy from drifting too far from the SFT policy, which acts as a regularizer against reward hacking.
| Paper | What it contributes |
|---|---|
| Schulman et al. 2015 | TRPO — hard constraint + line search |
| Schulman et al. 2017 | PPO — clipped surrogate |
| Schulman et al. 2016 | GAE — generalized advantage estimation |
| Achiam et al. 2017 | CPO — constrained policy optimization |
| Andrychowicz et al. 2021 | What matters in on-policy RL (ablations) |
| Yu et al. 2025 | DAPO — entropy collapse fix for GRPO |
“To understand is to know what to do.”
— Wittgenstein, Zettel