PPO — Veanors

Chapter 0: The Problem

By 2017, policy gradient methods had a frustrating dilemma. The vanilla approach — compute the gradient, take a step — was simple but wasteful: you collect a batch of data, make one gradient update, then throw the data away. Terrible sample efficiency.

Why can't you just reuse the data for multiple gradient steps? Because after one step, the policy has changed. The data was collected under the old policy, so further gradient steps on that data are computed under wrong distribution. Take too many steps and the policy update becomes catastrophically large — performance collapses.

TRPO (Schulman et al., 2015) solved this with constrained optimization: maximize a surrogate objective subject to a KL divergence constraint. It works reliably, but it requires second-order optimization (conjugate gradient + line search), which is:

Complex to implement correctly
Incompatible with parameter sharing between policy and value networks
Incompatible with architectures using dropout or noise
Hard to extend to shared auxiliary tasks

The wish list: We want an algorithm that (1) enables multiple epochs of minibatch updates on the same data, (2) prevents destructively large policy changes, (3) uses only first-order optimization (plain SGD/Adam), and (4) is simple enough to implement in ~50 lines of code. PPO delivers all four.

Why can't you simply perform multiple gradient steps on the same batch of data with vanilla policy gradients?

After the first step, the policy has changed — further steps use data from the wrong distribution, leading to destructively large policy updates The data becomes stale The GPU runs out of memory

Chapter 1: The Key Insight

PPO's insight: instead of constraining how far the policy moves (TRPO's approach), clip the objective function so that moving too far gives no additional benefit.

The clipped surrogate objective is:

L^CLIP(θ) = Ê_t[min(r_t(θ) Â_t, clip(r_t(θ), 1−ε, 1+ε) Â_t)]

Where r_t(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t) is the probability ratio between new and old policies, and ε ≈ 0.2.

The intuition in one sentence: If an action got a positive advantage (it was good), increase its probability — but cap the increase at (1+ε). If it got a negative advantage (it was bad), decrease its probability — but cap the decrease at (1−ε). Never let the policy change by more than ε in either direction. This is a "pessimistic" lower bound on the true improvement.

This single equation gives you everything TRPO gives — multiple epochs of updates, stable training, monotonic improvement — with nothing more than a min() and a clip(). No Hessian-vector products, no conjugate gradient, no line search.

What does the clip operation in PPO prevent?

The probability ratio r_t(θ) from moving too far from 1 — capping the policy change to prevent destructively large updates The advantage from becoming too large The gradient from exploding

Chapter 2: Policy Gradient Review

PPO builds on the standard policy gradient estimator. Let's review the chain of ideas.

The vanilla policy gradient

ĝ = Ê_t[∇_θ ln π_θ(a_t|s_t) Â_t]

This is REINFORCE with a baseline (Â is the advantage). We construct a loss whose gradient equals this:

L^PG(θ) = Ê_t[ln π_θ(a_t|s_t) Â_t]

Problem: you can only do one gradient step on this, then the data is stale.

The importance-sampled surrogate

To reuse data from π_old, we use importance sampling. The surrogate objective from TRPO/CPI:

L^CPI(θ) = Ê_t[r_t(θ) Â_t] where r_t(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t)

The probability ratio r_t(θ) corrects for the distribution mismatch. When θ = θ_old, r = 1 and L^CPI = L^PG. But as θ drifts from θ_old, the importance weights can blow up, making optimization unstable.

The probability ratio r_t(θ): This is the key quantity in PPO. r = 1 means the new and old policies agree. r = 2 means the new policy is twice as likely to take this action. r = 0.5 means half as likely. PPO clips this ratio to stay in [1−ε, 1+ε], i.e., the new policy can't change any action's probability by more than a factor of ε relative to the old policy.

What does the probability ratio r_t(θ) = π_θ(a|s) / π_{θ_old}(a|s) measure?

How much more (or less) likely the new policy is to take the same action that the old policy took — r=1 means unchanged, r=2 means twice as likely The reward at timestep t The advantage function

Chapter 3: The Surrogate Objective

TRPO constrains the policy update using KL divergence: maximize L^CPI(θ) subject to KL[π_old, π_θ] ≤ δ. This requires solving a constrained optimization problem at each step.

The theory actually suggests using a penalty instead:

maximize Ê_t[r_t(θ) Â_t − β KL[π_old, π_θ]]

But choosing β is hard. Too small: policy changes too much, training destabilizes. Too large: policy barely moves, learning is slow. And the right β changes throughout training as the problem's characteristics evolve.

TRPO's fundamental tension: The KL constraint requires second-order optimization (computing the Fisher information matrix via conjugate gradient). A KL penalty would allow first-order methods but requires tuning β, which the paper shows doesn't work with a fixed value. PPO sidesteps both problems with a different mechanism entirely: clipping the objective so that large policy changes produce no gradient signal.

Why doesn't a fixed KL penalty coefficient β work well in practice?

The right β varies across problems and changes within a single problem as learning progresses — no single fixed value works KL divergence is too expensive to compute The penalty term is always zero

Chapter 4: The Clipped Objective

This is the heart of PPO. The clipped surrogate objective:

L^CLIP(θ) = Ê_t[min(r_t(θ) Â_t, clip(r_t(θ), 1−ε, 1+ε) Â_t)]

Unpacking this equation

There are two terms inside the min():

r_t(θ) Â_t — the unclipped surrogate (same as TRPO's L^CPI)
clip(r_t(θ), 1−ε, 1+ε) Â_t — the clipped version, where r_t is forced into [1−ε, 1+ε]

Taking the min of these two creates a pessimistic lower bound. Let's trace through the two cases:

Case 1: Positive advantage (Â > 0) — the action was GOOD

We want to increase r (make this action more likely). But clip(r, 1−ε, 1+ε) caps r at 1+ε. Once r reaches 1+ε, the clipped term stops growing. The min then selects the clipped term (since it's smaller), and the gradient becomes zero. Result: no incentive to increase r beyond 1+ε.

Case 2: Negative advantage (Â < 0) — the action was BAD

We want to decrease r (make this action less likely). But clip caps r at 1−ε. Once r drops to 1−ε, the clipped term (which is now less negative) is larger than the unclipped term. The min selects the unclipped term — but wait, the clipped version has zero gradient, and the unclipped version would push r even lower. The min picks whichever is worse (more negative), which is the unclipped one. But since the clipped version has no gradient for r < 1−ε, the gradient vanishes. Result: no incentive to decrease r below 1−ε.

The pessimistic bound: L^CLIP is a lower bound on L^CPI. It matches L^CPI near r=1 (where the optimization starts) but flattens out as r moves away from 1. Maximizing this lower bound guarantees improvement in the true objective — and the flat regions prevent over-optimization. It's like training wheels that let you pedal freely but prevent you from going too fast.

The Clipped Objective

Toggle between positive and negative advantage to see how L^CLIP (teal) clips the unclipped surrogate (gray dashed). The flat region beyond 1±ε has zero gradient — preventing large policy changes. Drag ε to adjust the clipping range.

ε0.20

When the advantage is positive (A > 0, the action was good), what happens once the probability ratio r exceeds 1+ε?

The gradient becomes zero — there's no incentive to increase the action's probability further, preventing an excessively large policy change The objective increases linearly Training stops

Chapter 5: Why Clipping Works

The clipped objective has several elegant properties that make it work so well in practice.

Property 1: It's a lower bound

L^CLIP ≤ L^CPI everywhere. Maximizing a lower bound guarantees improvement in the true objective. This is the "pessimistic estimate" the paper refers to — you only get credit for improvements you can be confident about.

Property 2: First-order match

At r = 1 (where optimization starts), L^CLIP = L^CPI and their gradients match. So the first step is identical to TRPO's update direction. The clipping only kicks in as the policy moves away from θ_old.

Property 3: Automatic trust region

The clipping creates a "soft trust region" without any explicit KL constraint. Actions where r ∈ [1−ε, 1+ε] receive normal gradients. Actions where r drifts outside this range receive zero gradient. This naturally prevents the policy from changing too much — the trust region is implicit in the objective shape.

Property 4: Multiple epochs are safe

Because the gradient vanishes when r moves too far from 1, you can run multiple epochs of SGD on the same data. Early epochs move r away from 1 (learning); later epochs find r already clipped and receive zero gradient (stability). The algorithm self-regulates.

Worked example: Suppose ε = 0.2, action a in state s has Â = +3 (very good action). Initial r = 1. Epoch 1: gradient pushes r toward 1.1. Epoch 2: pushes r to 1.15. Epoch 3: r reaches 1.2 = 1+ε, gradient drops to zero. Epochs 4-10: no further change for this (s, a) pair. The clipping naturally stops the update at the right point, regardless of how many epochs you run.

Why can PPO safely perform multiple epochs of SGD on the same batch of data?

The gradient vanishes when r moves outside [1−ε, 1+ε], so further epochs can't make destructively large changes — the algorithm self-regulates The data is refreshed each epoch The learning rate decays automatically

Chapter 6: The Full Algorithm

PPO combines the clipped policy objective with a value function loss and entropy bonus:

L^total(θ) = Ê_t[L^CLIP_t − c₁ L^VF_t + c₂ S[π_θ](s_t)]

Where:

L^CLIP — the clipped policy surrogate (maximize)
L^VF = (V_θ(s_t) − V^targ_t)² — value function MSE loss (minimize)
S[π] — entropy bonus for exploration (maximize)
c₁, c₂ — balancing coefficients

Advantage estimation

PPO uses Generalized Advantage Estimation (GAE):

Â_t = δ_t + (γλ)δ_t+1 + (γλ)²δ_t+2 + ...

δ_t = r_t + γV(s_t+1) − V(s_t)

With λ = 1, this is the n-step return minus V(s_t). With λ < 1, it's a weighted average that trades off bias (lower λ) for variance (higher λ).

The training loop

Collect

N parallel actors run π_{θ_old} for T timesteps each → NT transitions

↓

Compute

Advantages Â_t using GAE and current value function

↓

Optimize

K epochs of minibatch SGD on L^total (typically K=3-10, minibatch M ≤ NT)

↓

Update

θ_old ← θ, repeat

What is the role of the entropy bonus S[π] in PPO's objective?

It encourages exploration by preventing the policy from becoming too deterministic too quickly It regularizes the value function It speeds up training

Chapter 7: Experiments

Continuous control (MuJoCo)

PPO with ε = 0.2 outperforms all competitors on 7 MuJoCo locomotion tasks: TRPO, A2C, A2C with trust region, CEM, and vanilla PG with adaptive stepsize. The clipped objective scores 0.82 (normalized), vs 0.76 for adaptive KL and 0.71 for fixed KL.

Hyperparameter comparison

Across all objective variants tested:

No clipping/penalty: −0.39 (catastrophically bad — diverges on HalfCheetah)
Clipping ε=0.2: 0.82 (best)
Adaptive KL: 0.68-0.74 (decent but worse than clipping)
Fixed KL: 0.62-0.72 (sensitive to β)

3D humanoid showcase

PPO trains a 3D humanoid to run, steer toward targets, and get up after being knocked down by cubes. The policy network outputs continuous actions for all joints. This demonstrates PPO's ability to handle high-dimensional continuous action spaces.

Atari

PPO matches ACER's performance on Atari while being much simpler. It significantly outperforms A2C in sample efficiency on most games.

PPO vs Baselines on MuJoCo

Normalized scores across 7 continuous control tasks. PPO-Clip outperforms all alternatives.

The "no clipping" disaster: The normalized score is −0.39 without clipping — worse than random. On HalfCheetah, unconstrained policy updates cause the policy to diverge to a configuration that actively accumulates negative reward. This is exactly the "destructively large update" problem that PPO's clipping prevents.

What happens when PPO's surrogate objective is optimized WITHOUT clipping or any penalty?

Performance collapses — the policy diverges, scoring worse than random on some tasks (normalized score −0.39) It works fine but is slightly slower It performs the same as with clipping

Chapter 8: Adaptive KL Alternative

PPO also proposes an alternative to clipping: an adaptive KL penalty with automatic coefficient tuning.

L^KLPEN(θ) = Ê_t[r_t(θ) Â_t − β KL[π_{θ_old}, π_θ]]

After each update, measure the actual KL divergence d = Ê[KL]. Then adjust β:

If d < d_targ/1.5: β ← β/2 (too conservative, loosen up)
If d > d_targ×1.5: β ← β×2 (too aggressive, tighten up)

This adaptive scheme automatically finds the right β for each problem and training phase. The constants 1.5 and 2 are heuristic but the algorithm isn't sensitive to them.

Clipping vs KL penalty: The paper found clipping slightly outperforms the adaptive KL penalty (0.82 vs 0.74). However, the adaptive KL approach has its own advantages: it directly targets a specific KL divergence, which can be useful for theoretical analysis. In concurrent work, DeepMind used the adaptive KL variant for humanoid locomotion. Both are valid PPO implementations.

How does PPO's adaptive KL penalty automatically adjust β?

After each update, if actual KL is too small, halve β (allow bigger steps). If too large, double β (force smaller steps). This finds the right constraint strength automatically. β follows a fixed schedule β is set by cross-validation

Chapter 9: Connections

What PPO built on

REINFORCE (Williams, 1992): The foundational policy gradient. PPO uses the same log-probability gradient but with importance sampling and clipping for multi-epoch updates.

Policy Gradient Theorem (Sutton et al., 1999): Proved that policy gradients work with function approximation. PPO is a practical algorithm built on this theory.

TRPO (Schulman et al., 2015): The direct predecessor. PPO replaces TRPO's second-order constrained optimization with a first-order clipped objective that's simpler and works better.

GAE (Schulman et al., 2015): Generalized Advantage Estimation — the bias-variance tradeoff for advantage computation that PPO uses internally.

What PPO enabled

OpenAI Five (2018): PPO trained Dota 2 agents that beat world champions — demonstrating that the algorithm scales to extremely complex, long-horizon tasks.

RLHF (Ouyang et al., 2022): PPO is the RL algorithm used to align language models with human preferences in InstructGPT/ChatGPT. The reward model provides advantages, and PPO's clipping prevents the LLM policy from diverging too far from the base model.

GRPO / DeepSeek-R1 (2024-25): Group Relative Policy Optimization — simplifies PPO further by using group-based advantage estimation, removing the need for a separate value network.

PPO's legacy: PPO became the default algorithm for deep RL — used in robotics (Diffusion Policy fine-tuning), game playing (OpenAI Five, hide-and-seek), language model alignment (ChatGPT, Claude), and scientific discovery. Its simplicity (the core is ~50 lines of code) and robustness (works with default hyperparameters on most problems) made it the "ImageNet moment" of RL algorithms.

Cheat sheet

Core equation

L^CLIP = E[min(r Â, clip(r, 1−ε, 1+ε) Â)]

Key hyperparams

ε = 0.2, K = 3-10 epochs, γ = 0.99, λ = 0.95

Mechanism

Clipping creates implicit trust region — zero gradient when r leaves [1−ε, 1+ε]

Advantage

First-order only (Adam), multi-epoch, simple, robust, general

Impact

Default RL algorithm: OpenAI Five, RLHF/ChatGPT, robotics

How is PPO used in RLHF for language model alignment?

A reward model scores generated text, PPO maximizes this reward while clipping prevents the LLM from diverging too far from the base model PPO is used for tokenization PPO trains the reward model

Proximal Policy Optimization