The workhorse of modern RL — a simple clipped surrogate objective that enables multiple epochs of minibatch updates while preventing destructively large policy changes.
By 2017, policy gradient methods had a frustrating dilemma. The vanilla approach — compute the gradient, take a step — was simple but wasteful: you collect a batch of data, make one gradient update, then throw the data away. Terrible sample efficiency.
Why can't you just reuse the data for multiple gradient steps? Because after one step, the policy has changed. The data was collected under the old policy, so further gradient steps on that data are computed under wrong distribution. Take too many steps and the policy update becomes catastrophically large — performance collapses.
TRPO (Schulman et al., 2015) solved this with constrained optimization: maximize a surrogate objective subject to a KL divergence constraint. It works reliably, but it requires second-order optimization (conjugate gradient + line search), which is:
PPO's insight: instead of constraining how far the policy moves (TRPO's approach), clip the objective function so that moving too far gives no additional benefit.
The clipped surrogate objective is:
Where rt(θ) = πθ(at|st) / πθold(at|st) is the probability ratio between new and old policies, and ε ≈ 0.2.
This single equation gives you everything TRPO gives — multiple epochs of updates, stable training, monotonic improvement — with nothing more than a min() and a clip(). No Hessian-vector products, no conjugate gradient, no line search.
PPO builds on the standard policy gradient estimator. Let's review the chain of ideas.
This is REINFORCE with a baseline (Â is the advantage). We construct a loss whose gradient equals this:
Problem: you can only do one gradient step on this, then the data is stale.
To reuse data from πold, we use importance sampling. The surrogate objective from TRPO/CPI:
The probability ratio rt(θ) corrects for the distribution mismatch. When θ = θold, r = 1 and LCPI = LPG. But as θ drifts from θold, the importance weights can blow up, making optimization unstable.
TRPO constrains the policy update using KL divergence: maximize LCPI(θ) subject to KL[πold, πθ] ≤ δ. This requires solving a constrained optimization problem at each step.
The theory actually suggests using a penalty instead:
But choosing β is hard. Too small: policy changes too much, training destabilizes. Too large: policy barely moves, learning is slow. And the right β changes throughout training as the problem's characteristics evolve.
This is the heart of PPO. The clipped surrogate objective:
There are two terms inside the min():
Taking the min of these two creates a pessimistic lower bound. Let's trace through the two cases:
We want to increase r (make this action more likely). But clip(r, 1−ε, 1+ε) caps r at 1+ε. Once r reaches 1+ε, the clipped term stops growing. The min then selects the clipped term (since it's smaller), and the gradient becomes zero. Result: no incentive to increase r beyond 1+ε.
We want to decrease r (make this action less likely). But clip caps r at 1−ε. Once r drops to 1−ε, the clipped term (which is now less negative) is larger than the unclipped term. The min selects the unclipped term — but wait, the clipped version has zero gradient, and the unclipped version would push r even lower. The min picks whichever is worse (more negative), which is the unclipped one. But since the clipped version has no gradient for r < 1−ε, the gradient vanishes. Result: no incentive to decrease r below 1−ε.
Toggle between positive and negative advantage to see how LCLIP (teal) clips the unclipped surrogate (gray dashed). The flat region beyond 1±ε has zero gradient — preventing large policy changes. Drag ε to adjust the clipping range.
The clipped objective has several elegant properties that make it work so well in practice.
LCLIP ≤ LCPI everywhere. Maximizing a lower bound guarantees improvement in the true objective. This is the "pessimistic estimate" the paper refers to — you only get credit for improvements you can be confident about.
At r = 1 (where optimization starts), LCLIP = LCPI and their gradients match. So the first step is identical to TRPO's update direction. The clipping only kicks in as the policy moves away from θold.
The clipping creates a "soft trust region" without any explicit KL constraint. Actions where r ∈ [1−ε, 1+ε] receive normal gradients. Actions where r drifts outside this range receive zero gradient. This naturally prevents the policy from changing too much — the trust region is implicit in the objective shape.
Because the gradient vanishes when r moves too far from 1, you can run multiple epochs of SGD on the same data. Early epochs move r away from 1 (learning); later epochs find r already clipped and receive zero gradient (stability). The algorithm self-regulates.
PPO combines the clipped policy objective with a value function loss and entropy bonus:
Where:
PPO uses Generalized Advantage Estimation (GAE):
With λ = 1, this is the n-step return minus V(st). With λ < 1, it's a weighted average that trades off bias (lower λ) for variance (higher λ).
PPO with ε = 0.2 outperforms all competitors on 7 MuJoCo locomotion tasks: TRPO, A2C, A2C with trust region, CEM, and vanilla PG with adaptive stepsize. The clipped objective scores 0.82 (normalized), vs 0.76 for adaptive KL and 0.71 for fixed KL.
Across all objective variants tested:
PPO trains a 3D humanoid to run, steer toward targets, and get up after being knocked down by cubes. The policy network outputs continuous actions for all joints. This demonstrates PPO's ability to handle high-dimensional continuous action spaces.
PPO matches ACER's performance on Atari while being much simpler. It significantly outperforms A2C in sample efficiency on most games.
Normalized scores across 7 continuous control tasks. PPO-Clip outperforms all alternatives.
PPO also proposes an alternative to clipping: an adaptive KL penalty with automatic coefficient tuning.
After each update, measure the actual KL divergence d = Ê[KL]. Then adjust β:
This adaptive scheme automatically finds the right β for each problem and training phase. The constants 1.5 and 2 are heuristic but the algorithm isn't sensitive to them.
REINFORCE (Williams, 1992): The foundational policy gradient. PPO uses the same log-probability gradient but with importance sampling and clipping for multi-epoch updates.
Policy Gradient Theorem (Sutton et al., 1999): Proved that policy gradients work with function approximation. PPO is a practical algorithm built on this theory.
TRPO (Schulman et al., 2015): The direct predecessor. PPO replaces TRPO's second-order constrained optimization with a first-order clipped objective that's simpler and works better.
GAE (Schulman et al., 2015): Generalized Advantage Estimation — the bias-variance tradeoff for advantage computation that PPO uses internally.
OpenAI Five (2018): PPO trained Dota 2 agents that beat world champions — demonstrating that the algorithm scales to extremely complex, long-horizon tasks.
RLHF (Ouyang et al., 2022): PPO is the RL algorithm used to align language models with human preferences in InstructGPT/ChatGPT. The reward model provides advantages, and PPO's clipping prevents the LLM policy from diverging too far from the base model.
GRPO / DeepSeek-R1 (2024-25): Group Relative Policy Optimization — simplifies PPO further by using group-based advantage estimation, removing the need for a separate value network.