Ch 12: Policy Gradient Optimization

Chapter 0: Why Step Size Is Everything

Chapter 11 left us with an estimate of ∇U(θ) — the gradient of expected return with respect to policy parameters. The gradient tells us which direction improves the policy. Now we must decide: how far do we move?

This sounds like a routine hyperparameter choice. It isn't. Policy gradient optimization has a brutal failure mode that makes step size the most consequential decision in the entire algorithm.

The catastrophe scenario: You've trained a robot for 10,000 episodes. The policy is finally working. You take one gradient step that's slightly too large. The policy changes enough that the actions it learned to avoid now look attractive again. Performance collapses. You can't recover by taking a small step back — the gradient at the new parameters points somewhere completely different. Months of compute, gone.

This isn't hypothetical. Early RL research was plagued by it. The algorithms in this chapter are engineering responses to this specific failure mode — each one adding a different layer of protection.

Why gradient magnitude can't guide step size

In supervised learning, a large gradient usually means you're far from a minimum, so a big step makes sense. In RL, a large gradient often means you're in a region where rewards vary wildly between trajectories — the gradient estimate has high variance. Taking a large step based on a noisy estimate is exactly wrong.

There's a deeper issue. The gradient estimate ∇U(θ) is only valid as a local approximation. The further θ' is from θ, the more the estimate drifts from reality. Crucially, we can't just collect new data to check — each gradient evaluation requires running episodes.

The Overshoot Problem

A 1D utility landscape (curved surface) and gradient ascent with different step sizes. Watch what happens when α is too large: the update overshoots the optimum and lands on a slope pointing away. Click "Step" to see one gradient update.

α = 0.15

Position: 2.50 | U: −2.52

Plain Gradient Ascent

θ ← θ + α∇U. Fast but fragile — step size is a gamble.

↓ constrain in parameter space

Natural Gradient

Rotate gradient by F⁻¹ to account for policy geometry. Parameterization-invariant.

↓ constrain in policy space + line search

TRPO

Hard KL constraint. Line search to find the best step within the trust region.

↓ simplify the constraint

PPO (Clipped Surrogate)

Clip the probability ratio. No Fisher matrix. No line search. One gradient step.

What is the fundamental reason step size is so dangerous in policy gradient methods?

Large step sizes always increase variance The gradient is always wrong The gradient estimate is only a local approximation, and a step too large moves you to a region where it's completely invalid — with no cheap way to recover

Chapter 1: Gradient Ascent

The simplest approach: compute the gradient, step in that direction, repeat.

θ ← θ + α ∇U(θ)

Here α > 0 is the step factor (learning rate). With a decaying step size that satisfies the Robbins-Monro conditions (∑α_k = ∞, ∑α_k² < ∞), gradient ascent converges to a local optimum.

The step length problem

The actual distance moved in parameter space is α·||∇U(θ)||. This depends on both α and the gradient magnitude. In RL, gradient magnitudes can vary wildly across training: early on when policies are random, gradients are small; later, when policies find high-reward trajectories, gradients can explode. A fixed α that worked fine in epoch 1 may cause catastrophic overshooting in epoch 50.

Local vs global: Gradient ascent finds a local optimum — a point where no infinitesimally small step improves things. Most policy spaces have many local optima (e.g., different walking gaits that all work but are qualitatively different). Which one you find depends heavily on initialization. This is accepted in practice; a good local optimum is often sufficient.

Gradient scaling and clipping

Two cheap tricks tame large gradients before the update:

Scaling (norm clipping): If ||∇U|| > L_max, rescale so the L2-norm equals L_max:

∇ ← L_max · ∇ / ||∇||

Preserves gradient direction. Only shrinks when too large.

Clipping (per-component): Independently clamp each dimension:

∇_i ← clamp(∇_i, −c, c)

Simpler, but can change the direction when one component dominates.

Which to use: Scaling is preferred when the gradient direction is reliable but the magnitude fluctuates. Clipping works when individual parameter gradients are untrustworthy. Both are cheap and widely used as preprocessing before any of the more sophisticated methods below.

python
def gradient_ascent_step(theta, grad_U, alpha, L_max=None):
    """One step of (optionally norm-clipped) gradient ascent."""
    g = grad_U                              # gradient estimate from Ch 11
    if L_max is not None:
        norm = np.linalg.norm(g)
        if norm > L_max:
            g = g * (L_max / norm)            # scale: preserve direction
    return theta + alpha * g

You're training a robot with gradient ascent and suddenly its performance collapses after a good run. What is the most likely cause?

The step size was too large, causing the update to overshoot a good region of parameter space; the gradient at the new parameters pointed in a harmful direction The gradient estimate became negative The learning rate decayed to zero

Chapter 2: The Natural Gradient

Plain gradient ascent measures distance in parameter space: ||θ' − θ||. But what we really care about is how much the policy behavior changes. Two policies with nearby parameters can behave very differently; two policies with distant parameters can behave nearly identically. Parameter-space distance is the wrong ruler.

The natural gradient swaps in a better ruler: KL divergence between the action distributions.

The restricted gradient: fix the step length

Start simpler. Instead of fixing α, directly constrain how far θ can move in parameter space. Maximize the linear approximation to U(θ) subject to a spherical constraint:

maximize ∇U(θ)^T(θ' − θ) s.t. ½||θ' − θ||² ≤ ε

This has a clean closed-form solution. The Lagrangian tells us θ' = θ + λ∇U. Plugging into the constraint gives λ = √(2ε/||∇U||²), so:

θ' = θ + √(2ε) · ∇U / ||∇U||

The step length is always exactly √(2ε), regardless of the gradient magnitude. Only the direction of the gradient matters. This decouples direction from magnitude — a major improvement.

The problem: parameter space is still the wrong metric

Suppose your policy has two parameters: θ₁ controls a softmax temperature (highly sensitive — tiny changes drastically shift the action distribution) and θ₂ controls a bias term (insensitive — large changes barely matter). A spherical constraint in parameter space treats both the same. You'll overstep θ₁ and understep θ₂.

The natural gradient fix: Replace the spherical constraint with an ellipsoid shaped by the Fisher information matrix F_θ. The ellipsoid is elongated along insensitive parameter directions (allowing larger steps) and compressed along sensitive ones (forcing smaller steps). This gives the same update regardless of how the policy is parameterized.

The natural gradient update

Solve the constrained problem with the Fisher metric:

maximize ∇U^T(θ'−θ) s.t. ½(θ'−θ)^TF_θ(θ'−θ) ≤ ε

The solution: let u = F_θ⁻¹∇U(θ) be the natural gradient. Then:

θ' = θ + √(2ε / u^TF_θu) · u

The direction u = F⁻¹∇U rotates the standard gradient to compensate for parameter-space curvature. In parameter directions where the policy is insensitive (small Fisher eigenvalue), u is larger, taking a bigger step. In sensitive directions (large Fisher eigenvalue), u is smaller, being conservative.

The Fisher information matrix

The Fisher information matrix F_θ is the expected outer product of the score function (gradient of log policy):

F_θ = E_τ[ ∇ log p(τ|θ) · ∇ log p(τ|θ)^T ]

For stochastic policies, environment transition terms cancel out (same argument as the likelihood-ratio gradient), so we only need the policy's log-probability gradient summed over the trajectory:

F_θ ≈ ½ ∑_τ ∑_t ∇ log π_θ(a_t|s_t) ∇ log π_θ(a_t|s_t)^T

We estimate this from the same rollout trajectories used for the gradient estimate. No extra episodes required.

Why "natural"? Amari (1998) showed that the natural gradient is the steepest ascent direction in the Riemannian manifold of probability distributions under the Fisher metric. It's the gradient that the geometry of probability space wants you to follow. Standard gradient ascent ignores this geometry; the natural gradient embraces it.

Fisher matrix properties:

• Symmetric, positive semi-definite
• Approximates the Hessian of KL divergence
• Invertible when estimated from enough data
• Encodes curvature of the policy manifold

Practical cost:

• O(n²) storage for n parameters (huge for networks)
• O(n³) for explicit inversion
• Conjugate gradient: solve Fu = ∇U without forming F
• Diagonal approximation: cheap but ignores correlations

Natural vs Standard Gradient on a Curved Surface

Objective: −(x+1)² − 10y². The y-axis is 10× more sensitive. Standard gradient (orange) zigzags; natural gradient (blue) heads straight for the optimum. Click Step to iterate both.

Iteration 0

Your policy has one parameter that's very sensitive (small change → big behavioral change) and one that's insensitive. What does the natural gradient do differently from the standard gradient?

It takes a smaller step along the sensitive parameter direction and a larger step along the insensitive one — calibrating step size per direction based on Fisher curvature It ignores the insensitive parameter entirely It takes equal steps in all parameter directions

Chapter 3: Trust Region Policy Optimization

Natural gradient gives us the right direction. TRPO adds a guarantee: it searches along that direction for the best step that provably keeps the new policy close to the old one.

Define the surrogate objective — an estimate of U(θ') computed from data collected under θ:

L(θ') = E_{s~b, a~π_θ}[ (π_θ'(a|s) / π_θ(a|s)) ċ Q_θ(s,a) ]

The ratio π_θ'/π_θ is importance weighting: it corrects for the fact that actions were sampled from the old policy, not the new one. Q_θ(s,a) is the action-value function estimate. This quantity can be evaluated for any θ' without running new episodes.

Why importance weighting works: If π_θ' assigns high probability to actions that π_θ also liked, the ratio is ~1 and we trust the Q estimate. If π_θ' drastically shifts to actions π_θ almost never took, the Q estimate for those actions is unreliable. The KL constraint prevents this — it keeps the distributions similar enough that importance weighting remains valid.

The optimization problem

TRPO solves a constrained optimization at each step:

maximize L(θ') subject to E_s[ D_KL(π_θ(·|s) || π_θ'(·|s)) ] ≤ δ

The KL divergence constraint is in policy space, not parameter space. Two policy parameterizations that produce identical distributions are treated identically — TRPO is truly parameterization-invariant.

The algorithm: conjugate gradient + line search

1. Compute gradient ∇L(θ)

Standard policy gradient from rollouts.

↓

2. Solve Fu = ∇L with conjugate gradient

Avoids forming F explicitly. Requires only matrix-vector products Fv, computable from rollouts.

↓

3. Compute candidate: θ_candidate = θ + √(2δ/u^TFu) · u

Normalized natural gradient direction, scaled to hit the KL boundary.

↓

4. Backtracking line search

Try θ_candidate, then βθ_candidate, then β²θ_candidate, … until KL ≤ δ and L improves.

The line search evaluates L(θ') and the KL for each candidate by computing importance ratios π_θ'/π_θ on the stored rollout data. No new episodes needed. Typically only 10-20 function evaluations per step.

Why a hard constraint, not a penalty?

A natural alternative: add β·KL to the objective as a penalty. The problem: the right β varies wildly across environments and even across training stages of the same environment. TRPO's hard constraint δ has a consistent semantic — "never let the policies diverge more than this" — making it a robust hyperparameter choice.

TRPO's guarantees: Under regularity conditions, each TRPO step monotonically improves a lower bound on U(θ'). The bound tightens as KL shrinks. This is the closest thing to a guaranteed-safe update in RL — but it comes with high implementation complexity.

python (pseudocode)
def trpo_step(policy, rollouts, delta=0.01, max_backtracks=10):
    grad = compute_policy_gradient(rollouts)          # Ch 11
    u = conjugate_gradient(fisher_vector_product, grad) # Fu = grad
    step_size = np.sqrt(2 * delta / np.dot(u, fisher_vector_product(u)))
    full_step = step_size * u                          # natural gradient

    theta_old = policy.get_params()
    for k in range(max_backtracks):                     # line search
        theta_new = theta_old + (0.5 ** k) * full_step
        policy.set_params(theta_new)
        kl = compute_kl(policy, rollouts)
        improvement = compute_surrogate(policy, rollouts) - compute_surrogate_old(rollouts)
        if kl <= delta and improvement > 0:
            return                                       # accept step
    policy.set_params(theta_old)                         # reject all, keep old

TRPO's line search evaluates L(θ') at 10 candidate points. How many additional rollout episodes does this require?

10 full episodes per candidate = 100 extra episodes One episode per candidate = 10 extra episodes Zero — L(θ') is computed from stored rollout data using importance ratios; no new episodes needed

Chapter 4: Proximal Policy Optimization

TRPO works. But it's complicated: you need a conjugate gradient solver, a Fisher-vector-product computation, and a custom line search. In 2017, Schulman et al. asked: what if you could get most of TRPO's benefits with a single clipped objective that standard optimizers (Adam, SGD) can handle directly?

The answer: PPO. Define the probability ratio:

r_t(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t)

r_t = 1 when θ = θ_old. r_t > 1 means θ now assigns more probability to this action than θ_old did. r_t < 1 means less probability.

The clipped surrogate objective is:

L^CLIP(θ) = E_t[ min( r_t·A_t, clip(r_t, 1−ε, 1+ε)·A_t ) ]

where A_t is the advantage estimate at time t, and ε is typically 0.2.

What the clipping actually does

The clip function forces r_t to stay in [1−ε, 1+ε]. The min takes the pessimistic bound between clipped and unclipped. Let's work through the two cases:

Case 1: A_t > 0 (good action)

We want to increase π(a|s), so we want r_t to grow. But if r_t > 1+ε, the clipped version caps at (1+ε)A_t < r_tA_t. The min picks the cap. Gradient goes to zero. No benefit to pushing r further right of the boundary.

Case 2: A_t < 0 (bad action)

We want to decrease π(a|s), so r_t should shrink. But if r_t < 1−ε, the clipped version stops at (1−ε)A_t > r_tA_t. The min picks the raw r_tA_t value, which is already negative enough. Wait — actually the min picks clip here too since clip·A > r·A when A<0 and r < 1−ε. The gradient again goes to zero.

The key insight: PPO creates a flat region in the objective beyond the clip boundaries. In that flat region, the gradient is exactly zero. The optimizer has no incentive to push the ratio outside [1−ε, 1+ε]. This acts like a soft constraint, without any constrained optimization machinery. It's a pessimistic lower bound on what TRPO would compute.

SHOWCASE: Interactive PPO Clipping Explorer

PPO Clipped Surrogate — Full Interactive Demo

The orange curve is L^CLIP(θ). The blue dashed line is the unclipped surrogate r·A. The teal band is the [1−ε, 1+ε] trust region. The vertical red line shows the current ratio r_t — drag it with the slider. Watch how the gradient (slope of orange) goes to zero outside the band for each advantage sign.

ε (clip) = 0.20

Advantage A_t = +1.5

Current ratio r_t = 1.00

Gradient active

What to observe in the demo

With positive advantage: Set A=1.5. Move r from 0 to 2. Notice the orange curve rises linearly until r=1+ε, then goes flat. The optimizer gets no gradient signal to push r above 1+ε. That's the safety mechanism.

With negative advantage: Set A=−1.5. The objective becomes a flipped curve. Now the orange curve falls as r decreases, but flattens at r=1−ε. No benefit to making r very small (overly suppressing the action).

With zero advantage: Set A=0. The objective is flat everywhere — if an action is neither good nor bad, PPO makes no update regardless of r. This is correct.

PPO in practice: Schulman et al. (2017) found ε=0.2 works well across diverse tasks (MuJoCo locomotion, Atari, robotics). The surrogate L^CLIP is maximized for multiple epochs (3-10) on the same batch of rollouts, then new rollouts are collected. This "multiple SGD epochs per batch" is what makes PPO sample-efficient despite having no explicit Fisher matrix computation.

python
import torch

def ppo_loss(log_pi_new, log_pi_old, advantages, eps=0.2):
    """
    log_pi_new: [B] log probabilities under new policy
    log_pi_old: [B] log probabilities under old policy (fixed)
    advantages: [B] advantage estimates (e.g. from GAE)
    Returns: scalar PPO clipped objective (to be maximized)
    """
    ratio = (log_pi_new - log_pi_old).exp()  # [B] — r_t(theta)
    # Unclipped surrogate
    obj1 = ratio * advantages                  # [B]
    # Clipped surrogate
    obj2 = ratio.clamp(1 - eps, 1 + eps) * advantages  # [B]
    # Pessimistic min, then mean over batch
    loss = torch.min(obj1, obj2).mean()        # scalar
    return -loss  # negate: PyTorch minimizes, we want to maximize

# Training loop (one iteration)
for epoch in range(n_epochs):             # e.g. 10 epochs on same data
    log_pi = policy.log_prob(states, actions)
    adv = compute_advantages(rewards, values, gamma, lam)
    loss = ppo_loss(log_pi, log_pi_old, adv)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

With advantage A_t = +2 and ε = 0.2, what happens to the PPO gradient when the current ratio r_t = 1.5?

The gradient is zero — r_t=1.5 is above the 1+ε=1.2 clip boundary, so the objective is flat there; no signal to push r further The gradient is positive and large, encouraging even more probability increase The objective is undefined outside [1-ε, 1+ε]

Chapter 5: TRPO vs PPO

Both TRPO and PPO are trying to solve the same problem: take the biggest safe step toward a better policy. They differ dramatically in how they define and enforce safety.

Property	TRPO	PPO
Trust region	Hard KL constraint (≤δ)	Soft: flat gradient outside clip region
Optimization per step	Constrained optimization + line search	Standard unconstrained gradient steps
Fisher matrix	Required (via conjugate gradient)	Not needed
Epochs per batch	1 (one natural gradient step)	3–10 (multiple SGD passes)
Implementation complexity	High (CG, Hessian-vector products)	Low (~10 lines of PyTorch)
Memory overhead	High (conjugate gradient buffers)	Low (just old log-probs)
Parameterization invariant?	Yes (exact)	Approximately (via clipping)
Empirical performance	Excellent	Excellent (comparable)
Default choice today?	Rarely (too complex)	Yes (baseline for most RL)

Why PPO won: Schulman et al. showed that PPO matches or exceeds TRPO on standard benchmarks (Atari, MuJoCo), at a fraction of the implementation cost. When two methods give the same answer, engineers choose the simpler one. The Fisher matrix and conjugate gradient that seemed essential for safety turned out to be replaceable by a five-line clipping operation.

When TRPO is still worth it

TRPO's hard KL constraint provides stronger guarantees in some situations. If you need:

• Monotonic improvement guarantees (safety-critical deployments)
• Invariance to policy parameterization (comparing across different network architectures)
• Understanding why a method fails (TRPO's diagnostics are cleaner)

Then TRPO is worth the extra implementation effort. In research, TRPO remains a useful baseline precisely because its guarantees are mathematically precise.

When PPO clips too aggressively

PPO's clipping is an approximation. There are edge cases where:

• The clip threshold ε is wrong for the current training stage (too tight early on, too loose later)
• Multiple epochs corrupt the old log-probs assumption (the importance ratio drifts)
• The advantage estimator has high variance, making the clip region the wrong shape

Modern practice: monitor the fraction of timesteps where r_t is clipped (the "clip fraction"). If it's >20-30%, reduce ε or the number of epochs. If it's near 0%, ε may be too tight.

Conceptual Comparison: Trust Region Shape

TRPO enforces a KL-ball trust region in policy space (elliptical in parameter space). PPO's clipping creates a box-like region in ratio space. These are different geometric objects — PPO's is an approximation. The slider shows how ε changes PPO's clip box.

ε = 0.20

PPO runs 10 gradient epochs on the same batch of rollouts. What assumption does this violate and what problem can it cause?

The old log-probs used in the ratio r_t become stale as θ changes across epochs; after many updates, r_t can drift far from 1, making the importance weighting invalid Running more epochs always improves performance The advantage estimates become negative

Chapter 6: Implementation Details

The gap between understanding PPO conceptually and getting it to work is filled with engineering decisions that papers often omit. Here's what actually matters.

The full PPO objective

The paper's actual objective combines three terms:

L(θ) = E_t[ L^CLIP_t(θ) − c₁L^VF_t(θ) + c₂S[π_θ](s_t) ]

Three components:

L^CLIP: The clipped policy loss we derived. Coefficient c₁ is implicitly 1 here.

L^VF: Value function (critic) loss. Typically (V_θ(s_t) − V_t^target)². Coefficient c₁ ≈ 0.5.

S[π_θ]: Entropy bonus. Encourages exploration. Coefficient c₂ ≈ 0.01.

Why a shared network? The actor (policy) and critic (value function) share a body with separate heads. They co-train. The critic helps estimate advantages; the actor uses them to improve the policy. Sharing representations is more parameter-efficient.

Why entropy? Flat regions (A≈0) produce no policy gradient. The entropy bonus keeps the policy from collapsing to determinism prematurely, especially early in training.

Advantage estimation: GAE

The advantage A_t appears in every term of L^CLIP. PPO almost always uses Generalized Advantage Estimation (GAE) from Schulman et al. (2016):

A_t^GAE(λ) = ∑_l=0^∞ (γλ)^l δ_t+l

where δ_t = r_t + γV(s_t+1) − V(s_t) is the TD error. The λ parameter interpolates between:

• λ=0: pure TD(0) — low variance, high bias
• λ=1: Monte Carlo returns — high variance, low bias
• λ=0.95: the sweet spot in most benchmarks

Normalization: the hidden magic

These details are rarely highlighted but often make or break PPO:

Advantage normalization: Before computing L^CLIP, subtract the batch mean and divide by the batch std: A_norm = (A − mean(A)) / (std(A) + 1e-8). This keeps gradient scales consistent across environments with different reward magnitudes. Without this, a task with large rewards (e.g. Atari scores in thousands) produces huge gradients vs. a task with unit rewards.

Observation normalization: Maintain running statistics of observations and normalize them. Many environments have wildly different observation ranges. A neural network trained on unnormalized observations can spend most of training learning the scale, not the task.

Reward normalization: Scale rewards by a running std estimate of the discounted return. Prevents the value function from having to represent astronomically different value scales.

python
class PPOBuffer:
    """Collect rollouts, compute GAE advantages, normalize."""
    def compute_advantages(self, gamma=0.99, lam=0.95):
        adv = np.zeros_like(self.rewards)
        last_gae = 0
        for t in reversed(range(len(self.rewards))):
            if t == len(self.rewards) - 1:
                next_val = 0  # terminal
            else:
                next_val = self.values[t + 1]
            delta = self.rewards[t] + gamma * next_val - self.values[t]
            last_gae = delta + gamma * lam * last_gae
            adv[t] = last_gae
        # Normalize advantages (critical!)
        adv = (adv - adv.mean()) / (adv.std() + 1e-8)
        return adv

Data flow: shapes and types

Tensor	Shape	Notes
observations	[T, obs_dim]	T = rollout length (e.g. 2048)
actions	[T] or [T, act_dim]	discrete or continuous
log_pi_old	[T]	fixed; computed before any epoch
advantages	[T]	GAE, then normalized
returns	[T]	advantages + values; target for critic
log_pi_new	[T]	recomputed each epoch as θ changes
ratio r_t	[T]	exp(log_pi_new - log_pi_old)
L^CLIP	scalar	mean over T, then backprop

Why must log_pi_old be computed before any gradient epoch and kept fixed throughout all epochs?

Because r_t = exp(log_pi_new - log_pi_old) measures how much the current policy has drifted from the policy that collected the data; if log_pi_old changes, the ratio loses its meaning as a correction factor Because it saves computation Because the optimizer requires fixed inputs

Chapter 7: Hyperparameter Sensitivity

PPO is widely praised for robustness, but "robust" is relative. Understanding which hyperparameters matter — and why — separates practitioners who tune PPO correctly from those who cargo-cult settings from Atari papers onto robotics tasks.

The clip threshold ε

ε = 0.2 is the Schulman et al. default. It says: "never let the policy probability ratio deviate more than 20% from 1." In practice:

ε too small (e.g. 0.05):

Policy changes too slowly. High clip fraction (most updates are clipped). Training takes many more iterations. Can be appropriate for safety-sensitive tasks or fine-tuning.

ε too large (e.g. 0.5):

The trust region is effectively gone. Training becomes as fragile as vanilla gradient ascent. Large policy shifts make old advantage estimates unreliable.

Adaptive clipping: Some implementations decay ε over training (start large for exploration, shrink for stability at convergence). Others use KL early stopping: if the mean KL divergence exceeds 1.5δ, stop the current set of epochs early. This recovers TRPO's guarantee in a cheap way.

Number of epochs per rollout

More epochs = more gradient updates from the same data = more sample efficiency. But:

• After each epoch, θ changes, making log_pi_old stale
• The importance ratio r_t drifts away from 1
• Eventually the advantage estimates are wrong for the current policy

Empirical finding: 3-10 epochs is a good range. Beyond 10, you're usually hurting more than helping. Monitor the average KL between old and new policy per batch — if it exceeds ~0.01-0.05, reduce epochs.

Rollout length T and batch size

The rollout length T determines how many environment steps you collect before each update. Tradeoffs:

Parameter	Effect when increased	Typical range
Rollout length T	Better advantage estimates (less bias in GAE), slower updates	256 – 4096
Minibatch size	More stable gradients, less variance per step	32 – 512
n_epochs	More data efficiency, higher risk of policy divergence	3 – 10
γ (discount)	Longer effective horizon, more variance in returns	0.99 (default)
λ (GAE)	Lower variance (if <1), higher bias	0.95 (default)
LR (Adam)	Faster learning, higher instability	3e-4 (default)

Rule of thumb: The total number of environment steps per iteration (T × n_parallel_envs) should be large enough that GAE produces stable estimates. For locomotion: 2048 steps × 1 env. For Atari: 128 steps × 8 envs. Total batch size should be several hundred to a few thousand.

Value function clipping

A subtle detail: some PPO implementations also clip the value function update. The clipped value loss is:

L^VF = max[ (V(θ) − V_targ)², (clip(V(θ), V_old−ε, V_old+ε) − V_targ)² ]

The idea: prevent the value function from changing too rapidly. In practice, value clipping is often omitted (and Andrychowicz et al. 2021 showed it can sometimes hurt). Use it if you observe value function instability, skip it otherwise.

During training you notice that 35% of timesteps have their ratio r_t clipped. What does this tell you and what should you do?

The policy is changing too much per iteration — reduce ε, reduce the number of epochs, or reduce the learning rate to keep more updates within the trust region Everything is fine; high clip fraction means the algorithm is working correctly Increase ε to allow more policy change

Chapter 8: Modern Extensions

PPO (2017) is now a foundation layer, not an endpoint. The field has built a series of extensions that address its remaining weaknesses.

PPO with KL penalty (PPO-KL)

Instead of clipping, add a KL penalty to the objective. Adaptively adjust the penalty coefficient β based on whether the current KL is above or below a target:

python (adaptive KL penalty)
if kl_mean < target_kl / 1.5:
    beta /= 2      # KL too small: relax penalty
elif kl_mean > target_kl * 1.5:
    beta *= 2      # KL too large: tighten penalty
loss = -L_clip + beta * kl_mean

This variant is preferred in some RLHF pipelines because it provides smoother gradients than hard clipping.

DAPO: Decoupled Clip and Dynamic Sampling (2025)

Recent work from Byte et al. identifies a subtle problem: in Group Relative Policy Optimization (GRPO, used in DeepSeek-R1), the entropy can collapse because tokens that are already highly probable get clipped and receive zero gradient. DAPO fixes this by:

• Using separate clip thresholds ε_low and ε_high for the lower and upper bounds
• Removing the lower clip entirely (i.e., only clipping upward) when advantage is positive
• Adding a token-level entropy loss to prevent collapse
• Dynamically oversampling prompts where the policy has improved significantly

GRPO (Group Relative Policy Optimization): Used in DeepSeek-R1-Zero and many reasoning models. Instead of a learned value function, it estimates the advantage by comparing a group of sampled outputs to their group mean reward. This eliminates the critic network but introduces high variance. DAPO fixes the entropy collapse that GRPO suffers.

REINFORCE Leave-One-Out (RLOO)

For language model fine-tuning, RLOO simplifies the PPO update: for each prompt, sample K responses, use the mean of the other K−1 as a baseline. This gives an unbiased advantage estimate without a learned value function. Recent work (Ahmadian et al. 2024) shows RLOO matches PPO on RLHF benchmarks with less code and memory.

Safe RL and constrained policy optimization

TRPO's constrained optimization framework naturally extends to constrained RL: maximize expected return subject to safety constraints (e.g., collision rate ≤ 5%). CPO (Constrained Policy Optimization, Achiam et al. 2017) extends TRPO to handle multiple safety constraints simultaneously using a primal-dual approach. PPO-based safe RL methods use Lagrangian relaxation to add constraint costs to the objective.

Offline-to-online fine-tuning

A modern pattern: pre-train a policy with behavior cloning (supervised learning on demonstrations), then fine-tune with PPO. The clip mechanism is essential here — it prevents PPO from immediately forgetting the learned behaviors. The initial r_t values are all near 1, and the clip keeps updates conservative, allowing the policy to improve while retaining good priors.

PPO's enduring relevance: In 2024, PPO or its variants are still the dominant method for RLHF (Reinforcement Learning from Human Feedback) in large language models (ChatGPT, Claude, Gemini). The simplicity of L^CLIP scales well to billion-parameter models. TRPO's Fisher matrix computation at that scale is infeasible; PPO's is not.

Why is PPO specifically used in RLHF for LLMs rather than plain gradient ascent or TRPO?

PPO's clipping keeps the LM close to the supervised pre-trained policy (preventing reward hacking / catastrophic forgetting), while being implementable without the Fisher matrix that TRPO requires at billion-parameter scale PPO is the only algorithm that supports language models TRPO cannot handle discrete action spaces like token prediction

Chapter 9: Connections and What Comes Next

Policy gradient optimization doesn't exist in isolation. It's the optimization backbone that every modern RL system plugs into. Here's where Ch 12 sits in the larger picture.

What Ch 12 assumes (from Ch 11)

Every method here takes ∇U(θ) as input — the policy gradient estimate. Chapter 11 covers the likelihood-ratio trick, REINFORCE, baseline subtraction, and actor-critic gradient estimators. The quality of ∇U(θ) directly limits what optimization can achieve: a noisy gradient with high variance will cause unstable training regardless of how sophisticated the optimizer is.

What plugs into Ch 12 (Ch 13 — Actor-Critic)

Chapter 13 addresses a major weakness: the advantage estimate A_t used in L^CLIP. Using raw Monte Carlo returns has high variance. Actor-critic methods replace Monte Carlo with a learned value function (the critic), dramatically reducing variance. The critic is trained to minimize Bellman error while the actor uses PPO's clipped objective. This is the standard architecture for modern RL.

Comparison to value-based methods (Ch 7-9)

Dimension	Policy Gradient (Ch 12)	Value-Based (Ch 7-9)
What's learned	π_θ(a\|s) directly	Q(s,a) or V(s)
Action spaces	Continuous or discrete	Typically discrete
Stochastic policies	Native (just sample)	Requires ε-greedy tricks
Sample efficiency	Lower (on-policy)	Higher (off-policy replay)
Stability	Better (with TRPO/PPO)	Historically fragile
Convergence theory	Local optimum guaranteed	Approximate (with function approx)

Connection to RLHF (Ch 18 — Imitation Learning)

RLHF pipelines use PPO as the optimization step, but with a reward model trained from human preferences instead of an environment reward. The policy is initialized from supervised fine-tuning (SFT) — which is imitation learning (Ch 18). The clip mechanism prevents the policy from drifting too far from the SFT policy, which acts as a regularizer against reward hacking.

Paper	What it contributes
Schulman et al. 2015	TRPO — hard constraint + line search
Schulman et al. 2017	PPO — clipped surrogate
Schulman et al. 2016	GAE — generalized advantage estimation
Achiam et al. 2017	CPO — constrained policy optimization
Andrychowicz et al. 2021	What matters in on-policy RL (ablations)
Yu et al. 2025	DAPO — entropy collapse fix for GRPO

Policy Gradient Optimization