Kochenderfer et al. — Chapter 12

Policy Gradient Optimization

You have a gradient estimate. Now: how far do you step? From naïve ascent to PPO — every idea that separates a good update from a catastrophic one.

Prerequisites: Chapter 11 (policy gradient estimation) + basic gradient descent. That's it.
10
Chapters
5
Simulations
10
Quizzes

Chapter 0: Why Step Size Is Everything

Chapter 11 left us with an estimate of ∇U(θ) — the gradient of expected return with respect to policy parameters. The gradient tells us which direction improves the policy. Now we must decide: how far do we move?

This sounds like a routine hyperparameter choice. It isn't. Policy gradient optimization has a brutal failure mode that makes step size the most consequential decision in the entire algorithm.

The catastrophe scenario: You've trained a robot for 10,000 episodes. The policy is finally working. You take one gradient step that's slightly too large. The policy changes enough that the actions it learned to avoid now look attractive again. Performance collapses. You can't recover by taking a small step back — the gradient at the new parameters points somewhere completely different. Months of compute, gone.

This isn't hypothetical. Early RL research was plagued by it. The algorithms in this chapter are engineering responses to this specific failure mode — each one adding a different layer of protection.

Why gradient magnitude can't guide step size

In supervised learning, a large gradient usually means you're far from a minimum, so a big step makes sense. In RL, a large gradient often means you're in a region where rewards vary wildly between trajectories — the gradient estimate has high variance. Taking a large step based on a noisy estimate is exactly wrong.

There's a deeper issue. The gradient estimate ∇U(θ) is only valid as a local approximation. The further θ' is from θ, the more the estimate drifts from reality. Crucially, we can't just collect new data to check — each gradient evaluation requires running episodes.

The Overshoot Problem

A 1D utility landscape (curved surface) and gradient ascent with different step sizes. Watch what happens when α is too large: the update overshoots the optimum and lands on a slope pointing away. Click "Step" to see one gradient update.

α = 0.15
Position: 2.50 | U: −2.52
Plain Gradient Ascent
θ ← θ + α∇U. Fast but fragile — step size is a gamble.
↓ constrain in parameter space
Natural Gradient
Rotate gradient by F−1 to account for policy geometry. Parameterization-invariant.
↓ constrain in policy space + line search
TRPO
Hard KL constraint. Line search to find the best step within the trust region.
↓ simplify the constraint
PPO (Clipped Surrogate)
Clip the probability ratio. No Fisher matrix. No line search. One gradient step.
What is the fundamental reason step size is so dangerous in policy gradient methods?

Chapter 1: Gradient Ascent

The simplest approach: compute the gradient, step in that direction, repeat.

θ ← θ + α ∇U(θ)

Here α > 0 is the step factor (learning rate). With a decaying step size that satisfies the Robbins-Monro conditions (∑αk = ∞, ∑αk2 < ∞), gradient ascent converges to a local optimum.

The step length problem

The actual distance moved in parameter space is α·||∇U(θ)||. This depends on both α and the gradient magnitude. In RL, gradient magnitudes can vary wildly across training: early on when policies are random, gradients are small; later, when policies find high-reward trajectories, gradients can explode. A fixed α that worked fine in epoch 1 may cause catastrophic overshooting in epoch 50.

Local vs global: Gradient ascent finds a local optimum — a point where no infinitesimally small step improves things. Most policy spaces have many local optima (e.g., different walking gaits that all work but are qualitatively different). Which one you find depends heavily on initialization. This is accepted in practice; a good local optimum is often sufficient.

Gradient scaling and clipping

Two cheap tricks tame large gradients before the update:

Scaling (norm clipping): If ||∇U|| > Lmax, rescale so the L2-norm equals Lmax:

∇ ← Lmax · ∇ / ||∇||

Preserves gradient direction. Only shrinks when too large.

Clipping (per-component): Independently clamp each dimension:

i ← clamp(∇i, −c, c)

Simpler, but can change the direction when one component dominates.

Which to use: Scaling is preferred when the gradient direction is reliable but the magnitude fluctuates. Clipping works when individual parameter gradients are untrustworthy. Both are cheap and widely used as preprocessing before any of the more sophisticated methods below.
python
def gradient_ascent_step(theta, grad_U, alpha, L_max=None):
    """One step of (optionally norm-clipped) gradient ascent."""
    g = grad_U                              # gradient estimate from Ch 11
    if L_max is not None:
        norm = np.linalg.norm(g)
        if norm > L_max:
            g = g * (L_max / norm)            # scale: preserve direction
    return theta + alpha * g
You're training a robot with gradient ascent and suddenly its performance collapses after a good run. What is the most likely cause?

Chapter 2: The Natural Gradient

Plain gradient ascent measures distance in parameter space: ||θ' − θ||. But what we really care about is how much the policy behavior changes. Two policies with nearby parameters can behave very differently; two policies with distant parameters can behave nearly identically. Parameter-space distance is the wrong ruler.

The natural gradient swaps in a better ruler: KL divergence between the action distributions.

The restricted gradient: fix the step length

Start simpler. Instead of fixing α, directly constrain how far θ can move in parameter space. Maximize the linear approximation to U(θ) subject to a spherical constraint:

maximize   ∇U(θ)T(θ' − θ)    s.t.    ½||θ' − θ||2 ≤ ε

This has a clean closed-form solution. The Lagrangian tells us θ' = θ + λ∇U. Plugging into the constraint gives λ = √(2ε/||∇U||2), so:

θ' = θ + √(2ε) · ∇U / ||∇U||

The step length is always exactly √(2ε), regardless of the gradient magnitude. Only the direction of the gradient matters. This decouples direction from magnitude — a major improvement.

The problem: parameter space is still the wrong metric

Suppose your policy has two parameters: θ1 controls a softmax temperature (highly sensitive — tiny changes drastically shift the action distribution) and θ2 controls a bias term (insensitive — large changes barely matter). A spherical constraint in parameter space treats both the same. You'll overstep θ1 and understep θ2.

The natural gradient fix: Replace the spherical constraint with an ellipsoid shaped by the Fisher information matrix Fθ. The ellipsoid is elongated along insensitive parameter directions (allowing larger steps) and compressed along sensitive ones (forcing smaller steps). This gives the same update regardless of how the policy is parameterized.

The natural gradient update

Solve the constrained problem with the Fisher metric:

maximize   ∇UT(θ'−θ)    s.t.    ½(θ'−θ)TFθ(θ'−θ) ≤ ε

The solution: let u = Fθ−1∇U(θ) be the natural gradient. Then:

θ' = θ + √(2ε / uTFθu) · u

The direction u = F−1∇U rotates the standard gradient to compensate for parameter-space curvature. In parameter directions where the policy is insensitive (small Fisher eigenvalue), u is larger, taking a bigger step. In sensitive directions (large Fisher eigenvalue), u is smaller, being conservative.

The Fisher information matrix

The Fisher information matrix Fθ is the expected outer product of the score function (gradient of log policy):

Fθ = Eτ[ ∇ log p(τ|θ) · ∇ log p(τ|θ)T ]

For stochastic policies, environment transition terms cancel out (same argument as the likelihood-ratio gradient), so we only need the policy's log-probability gradient summed over the trajectory:

Fθ ≈ ½ ∑τt ∇ log πθ(at|st) ∇ log πθ(at|st)T

We estimate this from the same rollout trajectories used for the gradient estimate. No extra episodes required.

Why "natural"? Amari (1998) showed that the natural gradient is the steepest ascent direction in the Riemannian manifold of probability distributions under the Fisher metric. It's the gradient that the geometry of probability space wants you to follow. Standard gradient ascent ignores this geometry; the natural gradient embraces it.

Fisher matrix properties:

• Symmetric, positive semi-definite
• Approximates the Hessian of KL divergence
• Invertible when estimated from enough data
• Encodes curvature of the policy manifold

Practical cost:

• O(n2) storage for n parameters (huge for networks)
• O(n3) for explicit inversion
• Conjugate gradient: solve Fu = ∇U without forming F
• Diagonal approximation: cheap but ignores correlations

Natural vs Standard Gradient on a Curved Surface

Objective: −(x+1)2 − 10y2. The y-axis is 10× more sensitive. Standard gradient (orange) zigzags; natural gradient (blue) heads straight for the optimum. Click Step to iterate both.

Iteration 0
Your policy has one parameter that's very sensitive (small change → big behavioral change) and one that's insensitive. What does the natural gradient do differently from the standard gradient?

Chapter 3: Trust Region Policy Optimization

Natural gradient gives us the right direction. TRPO adds a guarantee: it searches along that direction for the best step that provably keeps the new policy close to the old one.

Define the surrogate objective — an estimate of U(θ') computed from data collected under θ:

L(θ') = Es~b, a~πθ[θ'(a|s) / πθ(a|s)) ċ Qθ(s,a) ]

The ratio πθ'θ is importance weighting: it corrects for the fact that actions were sampled from the old policy, not the new one. Qθ(s,a) is the action-value function estimate. This quantity can be evaluated for any θ' without running new episodes.

Why importance weighting works: If πθ' assigns high probability to actions that πθ also liked, the ratio is ~1 and we trust the Q estimate. If πθ' drastically shifts to actions πθ almost never took, the Q estimate for those actions is unreliable. The KL constraint prevents this — it keeps the distributions similar enough that importance weighting remains valid.

The optimization problem

TRPO solves a constrained optimization at each step:

maximize L(θ')    subject to    Es[ DKLθ(·|s) || πθ'(·|s)) ] ≤ δ

The KL divergence constraint is in policy space, not parameter space. Two policy parameterizations that produce identical distributions are treated identically — TRPO is truly parameterization-invariant.

The algorithm: conjugate gradient + line search

1. Compute gradient ∇L(θ)
Standard policy gradient from rollouts.
2. Solve Fu = ∇L with conjugate gradient
Avoids forming F explicitly. Requires only matrix-vector products Fv, computable from rollouts.
3. Compute candidate: θcandidate = θ + √(2δ/uTFu) · u
Normalized natural gradient direction, scaled to hit the KL boundary.
4. Backtracking line search
Try θcandidate, then βθcandidate, then β2θcandidate, … until KL ≤ δ and L improves.

The line search evaluates L(θ') and the KL for each candidate by computing importance ratios πθ'θ on the stored rollout data. No new episodes needed. Typically only 10-20 function evaluations per step.

Why a hard constraint, not a penalty?

A natural alternative: add β·KL to the objective as a penalty. The problem: the right β varies wildly across environments and even across training stages of the same environment. TRPO's hard constraint δ has a consistent semantic — "never let the policies diverge more than this" — making it a robust hyperparameter choice.

TRPO's guarantees: Under regularity conditions, each TRPO step monotonically improves a lower bound on U(θ'). The bound tightens as KL shrinks. This is the closest thing to a guaranteed-safe update in RL — but it comes with high implementation complexity.
python (pseudocode)
def trpo_step(policy, rollouts, delta=0.01, max_backtracks=10):
    grad = compute_policy_gradient(rollouts)          # Ch 11
    u = conjugate_gradient(fisher_vector_product, grad) # Fu = grad
    step_size = np.sqrt(2 * delta / np.dot(u, fisher_vector_product(u)))
    full_step = step_size * u                          # natural gradient

    theta_old = policy.get_params()
    for k in range(max_backtracks):                     # line search
        theta_new = theta_old + (0.5 ** k) * full_step
        policy.set_params(theta_new)
        kl = compute_kl(policy, rollouts)
        improvement = compute_surrogate(policy, rollouts) - compute_surrogate_old(rollouts)
        if kl <= delta and improvement > 0:
            return                                       # accept step
    policy.set_params(theta_old)                         # reject all, keep old
TRPO's line search evaluates L(θ') at 10 candidate points. How many additional rollout episodes does this require?

Chapter 4: Proximal Policy Optimization

TRPO works. But it's complicated: you need a conjugate gradient solver, a Fisher-vector-product computation, and a custom line search. In 2017, Schulman et al. asked: what if you could get most of TRPO's benefits with a single clipped objective that standard optimizers (Adam, SGD) can handle directly?

The answer: PPO. Define the probability ratio:

rt(θ) = πθ(at|st) / πθold(at|st)

rt = 1 when θ = θold. rt > 1 means θ now assigns more probability to this action than θold did. rt < 1 means less probability.

The clipped surrogate objective is:

LCLIP(θ) = Et[ min( rt·At,   clip(rt, 1−ε, 1+ε)·At ) ]

where At is the advantage estimate at time t, and ε is typically 0.2.

What the clipping actually does

The clip function forces rt to stay in [1−ε, 1+ε]. The min takes the pessimistic bound between clipped and unclipped. Let's work through the two cases:

Case 1: At > 0 (good action)

We want to increase π(a|s), so we want rt to grow. But if rt > 1+ε, the clipped version caps at (1+ε)At < rtAt. The min picks the cap. Gradient goes to zero. No benefit to pushing r further right of the boundary.

Case 2: At < 0 (bad action)

We want to decrease π(a|s), so rt should shrink. But if rt < 1−ε, the clipped version stops at (1−ε)At > rtAt. The min picks the raw rtAt value, which is already negative enough. Wait — actually the min picks clip here too since clip·A > r·A when A<0 and r < 1−ε. The gradient again goes to zero.

The key insight: PPO creates a flat region in the objective beyond the clip boundaries. In that flat region, the gradient is exactly zero. The optimizer has no incentive to push the ratio outside [1−ε, 1+ε]. This acts like a soft constraint, without any constrained optimization machinery. It's a pessimistic lower bound on what TRPO would compute.

SHOWCASE: Interactive PPO Clipping Explorer

PPO Clipped Surrogate — Full Interactive Demo

The orange curve is LCLIP(θ). The blue dashed line is the unclipped surrogate r·A. The teal band is the [1−ε, 1+ε] trust region. The vertical red line shows the current ratio rt — drag it with the slider. Watch how the gradient (slope of orange) goes to zero outside the band for each advantage sign.

ε (clip) = 0.20
Advantage At = +1.5
Current ratio rt = 1.00
Gradient active

What to observe in the demo

With positive advantage: Set A=1.5. Move r from 0 to 2. Notice the orange curve rises linearly until r=1+ε, then goes flat. The optimizer gets no gradient signal to push r above 1+ε. That's the safety mechanism.

With negative advantage: Set A=−1.5. The objective becomes a flipped curve. Now the orange curve falls as r decreases, but flattens at r=1−ε. No benefit to making r very small (overly suppressing the action).

With zero advantage: Set A=0. The objective is flat everywhere — if an action is neither good nor bad, PPO makes no update regardless of r. This is correct.

PPO in practice: Schulman et al. (2017) found ε=0.2 works well across diverse tasks (MuJoCo locomotion, Atari, robotics). The surrogate LCLIP is maximized for multiple epochs (3-10) on the same batch of rollouts, then new rollouts are collected. This "multiple SGD epochs per batch" is what makes PPO sample-efficient despite having no explicit Fisher matrix computation.
python
import torch

def ppo_loss(log_pi_new, log_pi_old, advantages, eps=0.2):
    """
    log_pi_new: [B] log probabilities under new policy
    log_pi_old: [B] log probabilities under old policy (fixed)
    advantages: [B] advantage estimates (e.g. from GAE)
    Returns: scalar PPO clipped objective (to be maximized)
    """
    ratio = (log_pi_new - log_pi_old).exp()  # [B] — r_t(theta)
    # Unclipped surrogate
    obj1 = ratio * advantages                  # [B]
    # Clipped surrogate
    obj2 = ratio.clamp(1 - eps, 1 + eps) * advantages  # [B]
    # Pessimistic min, then mean over batch
    loss = torch.min(obj1, obj2).mean()        # scalar
    return -loss  # negate: PyTorch minimizes, we want to maximize

# Training loop (one iteration)
for epoch in range(n_epochs):             # e.g. 10 epochs on same data
    log_pi = policy.log_prob(states, actions)
    adv = compute_advantages(rewards, values, gamma, lam)
    loss = ppo_loss(log_pi, log_pi_old, adv)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
With advantage At = +2 and ε = 0.2, what happens to the PPO gradient when the current ratio rt = 1.5?

Chapter 5: TRPO vs PPO

Both TRPO and PPO are trying to solve the same problem: take the biggest safe step toward a better policy. They differ dramatically in how they define and enforce safety.

PropertyTRPOPPO
Trust regionHard KL constraint (≤δ)Soft: flat gradient outside clip region
Optimization per stepConstrained optimization + line searchStandard unconstrained gradient steps
Fisher matrixRequired (via conjugate gradient)Not needed
Epochs per batch1 (one natural gradient step)3–10 (multiple SGD passes)
Implementation complexityHigh (CG, Hessian-vector products)Low (~10 lines of PyTorch)
Memory overheadHigh (conjugate gradient buffers)Low (just old log-probs)
Parameterization invariant?Yes (exact)Approximately (via clipping)
Empirical performanceExcellentExcellent (comparable)
Default choice today?Rarely (too complex)Yes (baseline for most RL)
Why PPO won: Schulman et al. showed that PPO matches or exceeds TRPO on standard benchmarks (Atari, MuJoCo), at a fraction of the implementation cost. When two methods give the same answer, engineers choose the simpler one. The Fisher matrix and conjugate gradient that seemed essential for safety turned out to be replaceable by a five-line clipping operation.

When TRPO is still worth it

TRPO's hard KL constraint provides stronger guarantees in some situations. If you need:

• Monotonic improvement guarantees (safety-critical deployments)
• Invariance to policy parameterization (comparing across different network architectures)
• Understanding why a method fails (TRPO's diagnostics are cleaner)

Then TRPO is worth the extra implementation effort. In research, TRPO remains a useful baseline precisely because its guarantees are mathematically precise.

When PPO clips too aggressively

PPO's clipping is an approximation. There are edge cases where:

• The clip threshold ε is wrong for the current training stage (too tight early on, too loose later)
• Multiple epochs corrupt the old log-probs assumption (the importance ratio drifts)
• The advantage estimator has high variance, making the clip region the wrong shape

Modern practice: monitor the fraction of timesteps where rt is clipped (the "clip fraction"). If it's >20-30%, reduce ε or the number of epochs. If it's near 0%, ε may be too tight.

Conceptual Comparison: Trust Region Shape

TRPO enforces a KL-ball trust region in policy space (elliptical in parameter space). PPO's clipping creates a box-like region in ratio space. These are different geometric objects — PPO's is an approximation. The slider shows how ε changes PPO's clip box.

ε = 0.20
PPO runs 10 gradient epochs on the same batch of rollouts. What assumption does this violate and what problem can it cause?

Chapter 6: Implementation Details

The gap between understanding PPO conceptually and getting it to work is filled with engineering decisions that papers often omit. Here's what actually matters.

The full PPO objective

The paper's actual objective combines three terms:

L(θ) = Et[ LCLIPt(θ) − c1LVFt(θ) + c2S[πθ](st) ]

Three components:

LCLIP: The clipped policy loss we derived. Coefficient c1 is implicitly 1 here.

LVF: Value function (critic) loss. Typically (Vθ(st) − Vttarget)2. Coefficient c1 ≈ 0.5.

S[πθ]: Entropy bonus. Encourages exploration. Coefficient c2 ≈ 0.01.

Why a shared network? The actor (policy) and critic (value function) share a body with separate heads. They co-train. The critic helps estimate advantages; the actor uses them to improve the policy. Sharing representations is more parameter-efficient.

Why entropy? Flat regions (A≈0) produce no policy gradient. The entropy bonus keeps the policy from collapsing to determinism prematurely, especially early in training.

Advantage estimation: GAE

The advantage At appears in every term of LCLIP. PPO almost always uses Generalized Advantage Estimation (GAE) from Schulman et al. (2016):

AtGAE(λ) = ∑l=0 (γλ)l δt+l

where δt = rt + γV(st+1) − V(st) is the TD error. The λ parameter interpolates between:

• λ=0: pure TD(0) — low variance, high bias
• λ=1: Monte Carlo returns — high variance, low bias
• λ=0.95: the sweet spot in most benchmarks

Normalization: the hidden magic

These details are rarely highlighted but often make or break PPO:

Advantage normalization: Before computing LCLIP, subtract the batch mean and divide by the batch std: Anorm = (A − mean(A)) / (std(A) + 1e-8). This keeps gradient scales consistent across environments with different reward magnitudes. Without this, a task with large rewards (e.g. Atari scores in thousands) produces huge gradients vs. a task with unit rewards.

Observation normalization: Maintain running statistics of observations and normalize them. Many environments have wildly different observation ranges. A neural network trained on unnormalized observations can spend most of training learning the scale, not the task.

Reward normalization: Scale rewards by a running std estimate of the discounted return. Prevents the value function from having to represent astronomically different value scales.

python
class PPOBuffer:
    """Collect rollouts, compute GAE advantages, normalize."""
    def compute_advantages(self, gamma=0.99, lam=0.95):
        adv = np.zeros_like(self.rewards)
        last_gae = 0
        for t in reversed(range(len(self.rewards))):
            if t == len(self.rewards) - 1:
                next_val = 0  # terminal
            else:
                next_val = self.values[t + 1]
            delta = self.rewards[t] + gamma * next_val - self.values[t]
            last_gae = delta + gamma * lam * last_gae
            adv[t] = last_gae
        # Normalize advantages (critical!)
        adv = (adv - adv.mean()) / (adv.std() + 1e-8)
        return adv

Data flow: shapes and types

TensorShapeNotes
observations[T, obs_dim]T = rollout length (e.g. 2048)
actions[T] or [T, act_dim]discrete or continuous
log_pi_old[T]fixed; computed before any epoch
advantages[T]GAE, then normalized
returns[T]advantages + values; target for critic
log_pi_new[T]recomputed each epoch as θ changes
ratio rt[T]exp(log_pi_new - log_pi_old)
LCLIPscalarmean over T, then backprop
Why must log_pi_old be computed before any gradient epoch and kept fixed throughout all epochs?

Chapter 7: Hyperparameter Sensitivity

PPO is widely praised for robustness, but "robust" is relative. Understanding which hyperparameters matter — and why — separates practitioners who tune PPO correctly from those who cargo-cult settings from Atari papers onto robotics tasks.

The clip threshold ε

ε = 0.2 is the Schulman et al. default. It says: "never let the policy probability ratio deviate more than 20% from 1." In practice:

ε too small (e.g. 0.05):

Policy changes too slowly. High clip fraction (most updates are clipped). Training takes many more iterations. Can be appropriate for safety-sensitive tasks or fine-tuning.

ε too large (e.g. 0.5):

The trust region is effectively gone. Training becomes as fragile as vanilla gradient ascent. Large policy shifts make old advantage estimates unreliable.

Adaptive clipping: Some implementations decay ε over training (start large for exploration, shrink for stability at convergence). Others use KL early stopping: if the mean KL divergence exceeds 1.5δ, stop the current set of epochs early. This recovers TRPO's guarantee in a cheap way.

Number of epochs per rollout

More epochs = more gradient updates from the same data = more sample efficiency. But:

• After each epoch, θ changes, making log_pi_old stale
• The importance ratio rt drifts away from 1
• Eventually the advantage estimates are wrong for the current policy

Empirical finding: 3-10 epochs is a good range. Beyond 10, you're usually hurting more than helping. Monitor the average KL between old and new policy per batch — if it exceeds ~0.01-0.05, reduce epochs.

Rollout length T and batch size

The rollout length T determines how many environment steps you collect before each update. Tradeoffs:

ParameterEffect when increasedTypical range
Rollout length TBetter advantage estimates (less bias in GAE), slower updates256 – 4096
Minibatch sizeMore stable gradients, less variance per step32 – 512
n_epochsMore data efficiency, higher risk of policy divergence3 – 10
γ (discount)Longer effective horizon, more variance in returns0.99 (default)
λ (GAE)Lower variance (if <1), higher bias0.95 (default)
LR (Adam)Faster learning, higher instability3e-4 (default)
Rule of thumb: The total number of environment steps per iteration (T × n_parallel_envs) should be large enough that GAE produces stable estimates. For locomotion: 2048 steps × 1 env. For Atari: 128 steps × 8 envs. Total batch size should be several hundred to a few thousand.

Value function clipping

A subtle detail: some PPO implementations also clip the value function update. The clipped value loss is:

LVF = max[ (V(θ) − Vtarg)2, (clip(V(θ), Vold−ε, Vold+ε) − Vtarg)2 ]

The idea: prevent the value function from changing too rapidly. In practice, value clipping is often omitted (and Andrychowicz et al. 2021 showed it can sometimes hurt). Use it if you observe value function instability, skip it otherwise.

During training you notice that 35% of timesteps have their ratio rt clipped. What does this tell you and what should you do?

Chapter 8: Modern Extensions

PPO (2017) is now a foundation layer, not an endpoint. The field has built a series of extensions that address its remaining weaknesses.

PPO with KL penalty (PPO-KL)

Instead of clipping, add a KL penalty to the objective. Adaptively adjust the penalty coefficient β based on whether the current KL is above or below a target:

python (adaptive KL penalty)
if kl_mean < target_kl / 1.5:
    beta /= 2      # KL too small: relax penalty
elif kl_mean > target_kl * 1.5:
    beta *= 2      # KL too large: tighten penalty
loss = -L_clip + beta * kl_mean

This variant is preferred in some RLHF pipelines because it provides smoother gradients than hard clipping.

DAPO: Decoupled Clip and Dynamic Sampling (2025)

Recent work from Byte et al. identifies a subtle problem: in Group Relative Policy Optimization (GRPO, used in DeepSeek-R1), the entropy can collapse because tokens that are already highly probable get clipped and receive zero gradient. DAPO fixes this by:

• Using separate clip thresholds εlow and εhigh for the lower and upper bounds
• Removing the lower clip entirely (i.e., only clipping upward) when advantage is positive
• Adding a token-level entropy loss to prevent collapse
• Dynamically oversampling prompts where the policy has improved significantly

GRPO (Group Relative Policy Optimization): Used in DeepSeek-R1-Zero and many reasoning models. Instead of a learned value function, it estimates the advantage by comparing a group of sampled outputs to their group mean reward. This eliminates the critic network but introduces high variance. DAPO fixes the entropy collapse that GRPO suffers.

REINFORCE Leave-One-Out (RLOO)

For language model fine-tuning, RLOO simplifies the PPO update: for each prompt, sample K responses, use the mean of the other K−1 as a baseline. This gives an unbiased advantage estimate without a learned value function. Recent work (Ahmadian et al. 2024) shows RLOO matches PPO on RLHF benchmarks with less code and memory.

Safe RL and constrained policy optimization

TRPO's constrained optimization framework naturally extends to constrained RL: maximize expected return subject to safety constraints (e.g., collision rate ≤ 5%). CPO (Constrained Policy Optimization, Achiam et al. 2017) extends TRPO to handle multiple safety constraints simultaneously using a primal-dual approach. PPO-based safe RL methods use Lagrangian relaxation to add constraint costs to the objective.

Offline-to-online fine-tuning

A modern pattern: pre-train a policy with behavior cloning (supervised learning on demonstrations), then fine-tune with PPO. The clip mechanism is essential here — it prevents PPO from immediately forgetting the learned behaviors. The initial rt values are all near 1, and the clip keeps updates conservative, allowing the policy to improve while retaining good priors.

PPO's enduring relevance: In 2024, PPO or its variants are still the dominant method for RLHF (Reinforcement Learning from Human Feedback) in large language models (ChatGPT, Claude, Gemini). The simplicity of LCLIP scales well to billion-parameter models. TRPO's Fisher matrix computation at that scale is infeasible; PPO's is not.
Why is PPO specifically used in RLHF for LLMs rather than plain gradient ascent or TRPO?

Chapter 9: Connections and What Comes Next

Policy gradient optimization doesn't exist in isolation. It's the optimization backbone that every modern RL system plugs into. Here's where Ch 12 sits in the larger picture.

What Ch 12 assumes (from Ch 11)

Every method here takes ∇U(θ) as input — the policy gradient estimate. Chapter 11 covers the likelihood-ratio trick, REINFORCE, baseline subtraction, and actor-critic gradient estimators. The quality of ∇U(θ) directly limits what optimization can achieve: a noisy gradient with high variance will cause unstable training regardless of how sophisticated the optimizer is.

What plugs into Ch 12 (Ch 13 — Actor-Critic)

Chapter 13 addresses a major weakness: the advantage estimate At used in LCLIP. Using raw Monte Carlo returns has high variance. Actor-critic methods replace Monte Carlo with a learned value function (the critic), dramatically reducing variance. The critic is trained to minimize Bellman error while the actor uses PPO's clipped objective. This is the standard architecture for modern RL.

Comparison to value-based methods (Ch 7-9)

DimensionPolicy Gradient (Ch 12)Value-Based (Ch 7-9)
What's learnedπθ(a|s) directlyQ(s,a) or V(s)
Action spacesContinuous or discreteTypically discrete
Stochastic policiesNative (just sample)Requires ε-greedy tricks
Sample efficiencyLower (on-policy)Higher (off-policy replay)
StabilityBetter (with TRPO/PPO)Historically fragile
Convergence theoryLocal optimum guaranteedApproximate (with function approx)

Connection to RLHF (Ch 18 — Imitation Learning)

RLHF pipelines use PPO as the optimization step, but with a reward model trained from human preferences instead of an environment reward. The policy is initialized from supervised fine-tuning (SFT) — which is imitation learning (Ch 18). The clip mechanism prevents the policy from drifting too far from the SFT policy, which acts as a regularizer against reward hacking.

Recommended reading

PaperWhat it contributes
Schulman et al. 2015TRPO — hard constraint + line search
Schulman et al. 2017PPO — clipped surrogate
Schulman et al. 2016GAE — generalized advantage estimation
Achiam et al. 2017CPO — constrained policy optimization
Andrychowicz et al. 2021What matters in on-policy RL (ablations)
Yu et al. 2025DAPO — entropy collapse fix for GRPO
The progression in one line: Fix step size manually → fix step length → fix the metric (Fisher) → add line search (TRPO) → approximate with clipping (PPO) → PPO everywhere. Each step was forced by a specific failure mode of the previous method. Knowing the failures explains why the solution looks the way it does.
“To understand is to know what to do.”
— Wittgenstein, Zettel
← Ch 11: Policy Gradient Estimation   |   Ch 13: Actor-Critic Methods →
PPO is used in RLHF. The SFT policy acts as what kind of implicit regularizer?