Schulman, Wolski, Dhariwal, Radford, Klimov — 2017

Proximal Policy Optimization

The workhorse of modern RL — a simple clipped surrogate objective that enables multiple epochs of minibatch updates while preventing destructively large policy changes.

Prerequisites: Policy gradients + Advantage functions
10
Chapters
6+
Simulations

Chapter 0: The Problem

By 2017, policy gradient methods had a frustrating dilemma. The vanilla approach — compute the gradient, take a step — was simple but wasteful: you collect a batch of data, make one gradient update, then throw the data away. Terrible sample efficiency.

Why can't you just reuse the data for multiple gradient steps? Because after one step, the policy has changed. The data was collected under the old policy, so further gradient steps on that data are computed under wrong distribution. Take too many steps and the policy update becomes catastrophically large — performance collapses.

TRPO (Schulman et al., 2015) solved this with constrained optimization: maximize a surrogate objective subject to a KL divergence constraint. It works reliably, but it requires second-order optimization (conjugate gradient + line search), which is:

The wish list: We want an algorithm that (1) enables multiple epochs of minibatch updates on the same data, (2) prevents destructively large policy changes, (3) uses only first-order optimization (plain SGD/Adam), and (4) is simple enough to implement in ~50 lines of code. PPO delivers all four.
Why can't you simply perform multiple gradient steps on the same batch of data with vanilla policy gradients?

Chapter 1: The Key Insight

PPO's insight: instead of constraining how far the policy moves (TRPO's approach), clip the objective function so that moving too far gives no additional benefit.

The clipped surrogate objective is:

LCLIP(θ) = Êt[min(rt(θ) Ât, clip(rt(θ), 1−ε, 1+ε) Ât)]

Where rt(θ) = πθ(at|st) / πθold(at|st) is the probability ratio between new and old policies, and ε ≈ 0.2.

The intuition in one sentence: If an action got a positive advantage (it was good), increase its probability — but cap the increase at (1+ε). If it got a negative advantage (it was bad), decrease its probability — but cap the decrease at (1−ε). Never let the policy change by more than ε in either direction. This is a "pessimistic" lower bound on the true improvement.

This single equation gives you everything TRPO gives — multiple epochs of updates, stable training, monotonic improvement — with nothing more than a min() and a clip(). No Hessian-vector products, no conjugate gradient, no line search.

What does the clip operation in PPO prevent?

Chapter 2: Policy Gradient Review

PPO builds on the standard policy gradient estimator. Let's review the chain of ideas.

The vanilla policy gradient

ĝ = Êt[∇θ ln πθ(at|st) Ât]

This is REINFORCE with a baseline (Â is the advantage). We construct a loss whose gradient equals this:

LPG(θ) = Êt[ln πθ(at|st) Ât]

Problem: you can only do one gradient step on this, then the data is stale.

The importance-sampled surrogate

To reuse data from πold, we use importance sampling. The surrogate objective from TRPO/CPI:

LCPI(θ) = Êt[rt(θ) Ât]   where   rt(θ) = πθ(at|st) / πθold(at|st)

The probability ratio rt(θ) corrects for the distribution mismatch. When θ = θold, r = 1 and LCPI = LPG. But as θ drifts from θold, the importance weights can blow up, making optimization unstable.

The probability ratio rt(θ): This is the key quantity in PPO. r = 1 means the new and old policies agree. r = 2 means the new policy is twice as likely to take this action. r = 0.5 means half as likely. PPO clips this ratio to stay in [1−ε, 1+ε], i.e., the new policy can't change any action's probability by more than a factor of ε relative to the old policy.
What does the probability ratio rt(θ) = πθ(a|s) / πθold(a|s) measure?

Chapter 3: The Surrogate Objective

TRPO constrains the policy update using KL divergence: maximize LCPI(θ) subject to KL[πold, πθ] ≤ δ. This requires solving a constrained optimization problem at each step.

The theory actually suggests using a penalty instead:

maximize Êt[rt(θ) Ât − β KL[πold, πθ]]

But choosing β is hard. Too small: policy changes too much, training destabilizes. Too large: policy barely moves, learning is slow. And the right β changes throughout training as the problem's characteristics evolve.

TRPO's fundamental tension: The KL constraint requires second-order optimization (computing the Fisher information matrix via conjugate gradient). A KL penalty would allow first-order methods but requires tuning β, which the paper shows doesn't work with a fixed value. PPO sidesteps both problems with a different mechanism entirely: clipping the objective so that large policy changes produce no gradient signal.
Why doesn't a fixed KL penalty coefficient β work well in practice?

Chapter 4: The Clipped Objective

This is the heart of PPO. The clipped surrogate objective:

LCLIP(θ) = Êt[min(rt(θ) Ât, clip(rt(θ), 1−ε, 1+ε) Ât)]

Unpacking this equation

There are two terms inside the min():

  1. rt(θ) Ât — the unclipped surrogate (same as TRPO's LCPI)
  2. clip(rt(θ), 1−ε, 1+ε) Ât — the clipped version, where rt is forced into [1−ε, 1+ε]

Taking the min of these two creates a pessimistic lower bound. Let's trace through the two cases:

Case 1: Positive advantage (Â > 0) — the action was GOOD

We want to increase r (make this action more likely). But clip(r, 1−ε, 1+ε) caps r at 1+ε. Once r reaches 1+ε, the clipped term stops growing. The min then selects the clipped term (since it's smaller), and the gradient becomes zero. Result: no incentive to increase r beyond 1+ε.

Case 2: Negative advantage (Â < 0) — the action was BAD

We want to decrease r (make this action less likely). But clip caps r at 1−ε. Once r drops to 1−ε, the clipped term (which is now less negative) is larger than the unclipped term. The min selects the unclipped term — but wait, the clipped version has zero gradient, and the unclipped version would push r even lower. The min picks whichever is worse (more negative), which is the unclipped one. But since the clipped version has no gradient for r < 1−ε, the gradient vanishes. Result: no incentive to decrease r below 1−ε.

The pessimistic bound: LCLIP is a lower bound on LCPI. It matches LCPI near r=1 (where the optimization starts) but flattens out as r moves away from 1. Maximizing this lower bound guarantees improvement in the true objective — and the flat regions prevent over-optimization. It's like training wheels that let you pedal freely but prevent you from going too fast.
The Clipped Objective

Toggle between positive and negative advantage to see how LCLIP (teal) clips the unclipped surrogate (gray dashed). The flat region beyond 1±ε has zero gradient — preventing large policy changes. Drag ε to adjust the clipping range.

ε0.20
When the advantage is positive (A > 0, the action was good), what happens once the probability ratio r exceeds 1+ε?

Chapter 5: Why Clipping Works

The clipped objective has several elegant properties that make it work so well in practice.

Property 1: It's a lower bound

LCLIP ≤ LCPI everywhere. Maximizing a lower bound guarantees improvement in the true objective. This is the "pessimistic estimate" the paper refers to — you only get credit for improvements you can be confident about.

Property 2: First-order match

At r = 1 (where optimization starts), LCLIP = LCPI and their gradients match. So the first step is identical to TRPO's update direction. The clipping only kicks in as the policy moves away from θold.

Property 3: Automatic trust region

The clipping creates a "soft trust region" without any explicit KL constraint. Actions where r ∈ [1−ε, 1+ε] receive normal gradients. Actions where r drifts outside this range receive zero gradient. This naturally prevents the policy from changing too much — the trust region is implicit in the objective shape.

Property 4: Multiple epochs are safe

Because the gradient vanishes when r moves too far from 1, you can run multiple epochs of SGD on the same data. Early epochs move r away from 1 (learning); later epochs find r already clipped and receive zero gradient (stability). The algorithm self-regulates.

Worked example: Suppose ε = 0.2, action a in state s has  = +3 (very good action). Initial r = 1. Epoch 1: gradient pushes r toward 1.1. Epoch 2: pushes r to 1.15. Epoch 3: r reaches 1.2 = 1+ε, gradient drops to zero. Epochs 4-10: no further change for this (s, a) pair. The clipping naturally stops the update at the right point, regardless of how many epochs you run.
Why can PPO safely perform multiple epochs of SGD on the same batch of data?

Chapter 6: The Full Algorithm

PPO combines the clipped policy objective with a value function loss and entropy bonus:

Ltotal(θ) = Êt[LCLIPt − c1 LVFt + c2 S[πθ](st)]

Where:

Advantage estimation

PPO uses Generalized Advantage Estimation (GAE):

t = δt + (γλ)δt+1 + (γλ)²δt+2 + ...
δt = rt + γV(st+1) − V(st)

With λ = 1, this is the n-step return minus V(st). With λ < 1, it's a weighted average that trades off bias (lower λ) for variance (higher λ).

The training loop

Collect
N parallel actors run πθold for T timesteps each → NT transitions
Compute
Advantages Ât using GAE and current value function
Optimize
K epochs of minibatch SGD on Ltotal (typically K=3-10, minibatch M ≤ NT)
Update
θold ← θ, repeat
What is the role of the entropy bonus S[π] in PPO's objective?

Chapter 7: Experiments

Continuous control (MuJoCo)

PPO with ε = 0.2 outperforms all competitors on 7 MuJoCo locomotion tasks: TRPO, A2C, A2C with trust region, CEM, and vanilla PG with adaptive stepsize. The clipped objective scores 0.82 (normalized), vs 0.76 for adaptive KL and 0.71 for fixed KL.

Hyperparameter comparison

Across all objective variants tested:

3D humanoid showcase

PPO trains a 3D humanoid to run, steer toward targets, and get up after being knocked down by cubes. The policy network outputs continuous actions for all joints. This demonstrates PPO's ability to handle high-dimensional continuous action spaces.

Atari

PPO matches ACER's performance on Atari while being much simpler. It significantly outperforms A2C in sample efficiency on most games.

PPO vs Baselines on MuJoCo

Normalized scores across 7 continuous control tasks. PPO-Clip outperforms all alternatives.

The "no clipping" disaster: The normalized score is −0.39 without clipping — worse than random. On HalfCheetah, unconstrained policy updates cause the policy to diverge to a configuration that actively accumulates negative reward. This is exactly the "destructively large update" problem that PPO's clipping prevents.
What happens when PPO's surrogate objective is optimized WITHOUT clipping or any penalty?

Chapter 8: Adaptive KL Alternative

PPO also proposes an alternative to clipping: an adaptive KL penalty with automatic coefficient tuning.

LKLPEN(θ) = Êt[rt(θ) Ât − β KL[πθold, πθ]]

After each update, measure the actual KL divergence d = Ê[KL]. Then adjust β:

This adaptive scheme automatically finds the right β for each problem and training phase. The constants 1.5 and 2 are heuristic but the algorithm isn't sensitive to them.

Clipping vs KL penalty: The paper found clipping slightly outperforms the adaptive KL penalty (0.82 vs 0.74). However, the adaptive KL approach has its own advantages: it directly targets a specific KL divergence, which can be useful for theoretical analysis. In concurrent work, DeepMind used the adaptive KL variant for humanoid locomotion. Both are valid PPO implementations.
How does PPO's adaptive KL penalty automatically adjust β?

Chapter 9: Connections

What PPO built on

REINFORCE (Williams, 1992): The foundational policy gradient. PPO uses the same log-probability gradient but with importance sampling and clipping for multi-epoch updates.

Policy Gradient Theorem (Sutton et al., 1999): Proved that policy gradients work with function approximation. PPO is a practical algorithm built on this theory.

TRPO (Schulman et al., 2015): The direct predecessor. PPO replaces TRPO's second-order constrained optimization with a first-order clipped objective that's simpler and works better.

GAE (Schulman et al., 2015): Generalized Advantage Estimation — the bias-variance tradeoff for advantage computation that PPO uses internally.

What PPO enabled

OpenAI Five (2018): PPO trained Dota 2 agents that beat world champions — demonstrating that the algorithm scales to extremely complex, long-horizon tasks.

RLHF (Ouyang et al., 2022): PPO is the RL algorithm used to align language models with human preferences in InstructGPT/ChatGPT. The reward model provides advantages, and PPO's clipping prevents the LLM policy from diverging too far from the base model.

GRPO / DeepSeek-R1 (2024-25): Group Relative Policy Optimization — simplifies PPO further by using group-based advantage estimation, removing the need for a separate value network.

PPO's legacy: PPO became the default algorithm for deep RL — used in robotics (Diffusion Policy fine-tuning), game playing (OpenAI Five, hide-and-seek), language model alignment (ChatGPT, Claude), and scientific discovery. Its simplicity (the core is ~50 lines of code) and robustness (works with default hyperparameters on most problems) made it the "ImageNet moment" of RL algorithms.

Cheat sheet

Core equation
LCLIP = E[min(r Â, clip(r, 1−ε, 1+ε) Â)]
Key hyperparams
ε = 0.2, K = 3-10 epochs, γ = 0.99, λ = 0.95
Mechanism
Clipping creates implicit trust region — zero gradient when r leaves [1−ε, 1+ε]
Advantage
First-order only (Adam), multi-epoch, simple, robust, general
Impact
Default RL algorithm: OpenAI Five, RLHF/ChatGPT, robotics
How is PPO used in RLHF for language model alignment?