The two algorithms that power modern RL — from ChatGPT to walking robots. Every trick derived, every hyperparameter explained.
In the policy gradients lesson, we derived the REINFORCE algorithm: collect trajectories, compute a gradient, update the policy, throw away all data, recollect. In the actor-critic lesson, we added a learned value function to reduce variance. But the core loop stayed the same — one gradient step per batch of data.
That is brutally wasteful. A robot arm spends 30 minutes collecting 2,000 timesteps. You take one gradient step. Toss the data. Collect again. At this rate, learning a simple pick-and-place task takes weeks of real-world time.
The fix: take multiple gradient steps on the same batch. But the moment you update the policy even once, the data was collected by the old policy. You're training on someone else's experience. This is the off-policy problem.
Learning from data collected by a different policy than the one being optimized. The "behavior policy" (data collector) and the "target policy" (being improved) diverge. Requires corrections — otherwise gradients point in wrong directions.
More gradient steps per batch = more data-efficient. But each step moves the policy further from the data source = less reliable gradients. Every algorithm in this lesson is a different answer to this tradeoff. PPO stays close with clipping. SAC uses a Q-function that works with any data.
We already know the mathematical tool for off-policy correction: importance sampling. From the policy gradients lesson, we can reweight old data by multiplying each sample by the ratio πθ'(a|s) / πθ(a|s). If the new policy makes an action 2× more likely than the old one, that sample counts double. If 0.5× as likely, it counts half.
Old policy: πθ(jump|cliff_edge) = 0.1. New policy after some updates: πθ'(jump|cliff_edge) = 0.4. IS ratio = 0.4/0.1 = 4.0. This means: the new policy jumps 4× more often, so each observed jump in the old data gets 4× the weight. Makes sense — that sample is now more "representative" of the new policy's behavior.
The question is: how do we use IS to build an objective we can optimize with multiple gradient steps without the policy drifting into unreliable territory?
We need two things: (1) an objective function that accounts for the distribution mismatch, and (2) a mechanism to prevent the optimization from going too far. The surrogate objective gives us the first. PPO's clipping gives us the second.
In on-policy RL, we sample trajectories τ = (s₀, a₀, s₁, a₁, ...) from πθ. Now we want to evaluate the expected return under a new policy πθ' using data from πθ.
Your task: Derive the full trajectory importance sampling ratio p(τ | πθ') / p(τ | πθ), show why transition dynamics cancel, and explain why the product of per-step ratios has exponentially growing variance in T.
Full derivation:
p(τ | π) = p(s₀) · Πt=0T-1 π(at|st) · p(st+1|st,at)
The IS ratio for trajectories:
ρ(τ) = p(τ|πθ') / p(τ|πθ) = Πt=0T-1 [πθ'(at|st) / πθ(at|st)]
Dynamics p(st+1|st,at) cancel because they don't depend on the policy. Initial state p(s₀) also cancels.
For the variance: let rt = πθ'(at|st) / πθ(at|st). E[rt] = 1 (IS is unbiased). Var(rt) = E[rt²] - 1. The product Πrt has E[(Πrt)²] = ΠE[rt²] = (1 + Var(rt))T. Even if per-step variance is 0.1, after 100 steps: (1.1)100 ≈ 13,781. The IS estimate is essentially noise.
The key insight: This exponential variance is WHY we use per-step IS ratios (as in PPO) rather than full-trajectory ratios. PPO clips the per-step ratio to [0.8, 1.2], bounding variance at each step rather than letting it accumulate.
From actor-critic, we have the advantage function Âπ(s,a) = Qπ(s,a) − Vπ(s). It tells us: how much better is action a than what the policy would normally do in state s? Positive advantage = better than average. Negative = worse.
The surrogate objective lets us take multiple gradient steps on old data by importance-weighting each sample:
Here θ is the old policy (the one that collected data) and θ' is the new policy being optimized. The ratio πθ'/πθ is the importance sampling ratio — it corrects for the distribution mismatch between who collected the data and who we're optimizing.
Think of it as a voting correction. If the old policy took action "left" 30% of the time but the new policy would take it 60% of the time, each "left" sample in our data should count double — because it's twice as representative of the new policy's behavior.
When θ' = θ (first gradient step), the ratio is exactly 1.0 and the objective reduces to the standard advantage-weighted objective. As we take gradient steps, the ratio tells us: "How much more (or less) likely is this action under the new policy?" If the new policy is 2× more likely to take an action with positive advantage, the objective rewards that.
Nothing stops the optimization from going too far. If action a has positive advantage, gradient ascent increases πθ'(a|s). The IS ratio grows: 1.5, 3.0, 10.0. The objective keeps climbing. But now the IS correction is wildly inaccurate — the data came from a policy that rarely took this action, and we're pretending it's common.
Old policy: πθ(left|s) = 0.3. Advantage of "left" in state s: Â = +2.0.
After 5 gradient steps: πθ'(left|s) = 0.9. IS ratio = 0.9/0.3 = 3.0.
Contribution to objective: 3.0 × 2.0 = 6.0. Looks great! But we only observed "left" 30% of the time in our data. We have very few samples supporting this estimate. We're extrapolating wildly from sparse data.
After 10 more steps: πθ'(left|s) = 0.99. Ratio = 3.3. The policy has gone all-in on an action it barely tested.
Too many gradient steps on the same batch → the policy overfits to the particular trajectories it saw, not the true advantage landscape. It's the same pathology as training a supervised model for too many epochs on a small dataset. The fix isn't early stopping — it's PPO.
Proximal Policy Optimization (Schulman et al., 2017) solves the overfitting problem with two elegant tricks. Both are simple to implement and together they make PPO the most widely-used RL algorithm in the world. The name "proximal" means "nearby" — the policy stays proximal (close) to its old version.
An on-policy (with mild off-policy reuse) actor-critic algorithm that takes multiple gradient steps per batch using a clipped surrogate objective. Introduced by Schulman et al. (2017) at OpenAI. Default choice for stable, scalable policy optimization.
Instead of letting the IS ratio grow without bound, clip it to [1−ε, 1+ε]:
Once the ratio hits 1.2, the clipped objective goes flat. No matter how much more likely you make the action, the objective stops increasing. The policy has no incentive to deviate further.
IS ratio = 1.0: unclipped = 1.0 × 2 = 2.0. Clipped = 1.0 × 2 = 2.0. Same.
IS ratio = 1.1: unclipped = 1.1 × 2 = 2.2. Clipped = 1.1 × 2 = 2.2. Still within range.
IS ratio = 1.5: unclipped = 1.5 × 2 = 3.0. Clipped = 1.2 × 2 = 2.4. Clamped!
IS ratio = 3.0: unclipped = 3.0 × 2 = 6.0. Clipped = 1.2 × 2 = 2.4. Still 2.4. No incentive to go further.
Clipping alone has a subtle edge case. When the advantage is negative and we clip the ratio downward (to 0.8), the clipped objective can actually be better than the unclipped one. This would still let the policy deviate in harmful directions.
Fix: take the minimum of the clipped and unclipped objectives:
The min operator is pessimistic. For any (state, action) pair, it picks whichever objective is worse. If clipping makes things look better (rare edge case with negative advantages), the min reverts to the unclipped version. If clipping makes things look worse (the normal case, preventing over-deviation), the min keeps the clip. Either way, the policy never benefits from straying too far.
IS ratio = 0.7: unclipped = 0.7 × (−1.5) = −1.05. Clipped = 0.8 × (−1.5) = −1.2. min(−1.05, −1.2) = −1.2. Takes the worse (more negative) one — policy is discouraged from reducing the probability of this bad action too aggressively.
IS ratio = 1.3: unclipped = 1.3 × (−1.5) = −1.95. Clipped = 1.2 × (−1.5) = −1.8. min(−1.95, −1.8) = −1.95. Takes the unclipped version — doesn't let clipping soften the penalty.
Two lines of code — a clamp() and a min() — replace the complicated constrained optimization (KL penalties, trust regions) of earlier methods like TRPO. That's why PPO took over: same stability, fraction of the complexity.
We need Âπ(s,a) for the PPO objective. But which advantage estimate should we use? From actor-critic, we know several options with different bias-variance tradeoffs:
Use n steps of actual reward, then bootstrap from V̂ for the rest:
Short horizon (small n): low variance because we rely on the learned V̂ to fill in the future, but biased if V̂ is inaccurate (early in training, V̂ is terrible). Long horizon (large n): less bias because we use real observed rewards, but high variance because those rewards are noisy and path-dependent.
If we bootstrap too early (small n), our advantage estimates inherit all the errors in V̂. If we bootstrap too late (large n), we're back to the high-variance Monte Carlo estimates that made REINFORCE so noisy. Neither extreme is good. We need a blend.
GAE's insight: don't pick one horizon — use all of them, with exponentially decaying weights.
The parameter λ controls the bias-variance tradeoff:
λ = 0: w₁ = 1, all other weights = 0. Pure 1-step TD. Lowest variance, highest bias. GAE reduces to Â1 = rt + γV̂(st+1) − V̂(st).
λ = 1: equal weights on all horizons. Pure Monte Carlo return minus baseline. Lowest bias, highest variance.
λ = 0.95 (typical): mostly uses short horizons but sprinkles in longer ones. Practical sweet spot.
GAE has an elegant recursive form using the TD residual δt = rt + γV̂(st+1) − V̂(st):
Computed with a single backward pass through the trajectory. No need to compute each Ân separately. In code: start from the last timestep, accumulate gae = delta + gamma * lambda * gae.
Suppose we have 3 timesteps of data with V̂ predictions:
• t=0: r₀ = +1, V̂(s₀) = 5.0, V̂(s₁) = 5.5 → δ₀ = 1 + 0.99×5.5 − 5.0 = +1.445
• t=1: r₁ = +2, V̂(s₁) = 5.5, V̂(s₂) = 6.0 → δ₁ = 2 + 0.99×6.0 − 5.5 = +2.44
• t=2: r₂ = −1, V̂(s₂) = 6.0, V̂(s₃) = 4.0 → δ₂ = −1 + 0.99×4.0 − 6.0 = −3.04
GAE at t=0: δ₀ + (γλ)δ₁ + (γλ)²δ₂ = 1.445 + 0.9405×2.44 + 0.8845×(−3.04) = 1.445 + 2.295 − 2.689 = +1.051
Compare: pure 1-step (λ=0) gives Â₀ = δ₀ = +1.445. Pure MC gives a different number. GAE blends them.
If you've seen TD(λ) in RL textbooks, GAE is the same idea applied to advantage estimates instead of value estimates. The λ parameter is the eligibility trace decay. Schulman et al. (2016) showed this works exceptionally well with policy gradients.
GAE assumes on-policy data. For fully off-policy data (replay buffers), we need Retrace(λ) (Munos et al., 2016). The key idea: truncate IS ratios at 1 to prevent variance explosion while maintaining a fixed-point guarantee.
Your task: Starting from the off-policy Bellman operator, derive why truncating ct = min(1, π(at|st)/μ(at|st)) still converges to Qπ, and show the Retrace target formula.
Retrace(λ) target:
Qret(s0,a0) = Q(s0,a0) + Σt=0T-1 γt (Πi=1t ci) [rt + γ Ea~π[Q(st+1,a)] - Q(st,at)]
where ct = λ · min(1, π(at|st) / μ(at|st))
Why it converges: At the fixed point Q = Qπ, the TD error δt = rt + γ Vπ(st+1) - Qπ(st,at) = 0 by definition of Qπ. So Qret = Qπ regardless of the ct values. The truncation only affects the speed and stability of convergence, not the solution.
Variance bound: Since ct ≤ λ ≤ 1, the product Πici ≤ λt → 0 exponentially. This means distant rewards get exponentially down-weighted, keeping variance bounded. Compare to untruncated IS where Π(π/μ) can be 105 for a single trajectory.
The key insight: Retrace is "safe IS" — it never over-weights a transition (ratio capped at 1), so it converges reliably with any behavior policy. SAC implicitly avoids this issue by using single-step TD (no multi-step returns), but algorithms like R2D2 and IMPALA use Retrace variants for their replay buffers.
After many gradient steps, the policy moves far from the data-generating policy. The IS ratio πθ'/πθ deviates significantly from 1.0, meaning the gradient estimates become unreliable — we're extrapolating from sparse data about what the new policy would do. The clipped objective JPPO = min(r·A, clip(r, 1-ε, 1+ε)·A) creates a flat region beyond the clip boundary. Once the ratio exceeds 1+ε (or drops below 1-ε), the gradient is zero — there's no incentive to move further. This effectively creates a trust region where the policy can only move ε per iteration. With 10 epochs, the ratio might reach ~1.2 before clipping kicks in. With 1000 epochs, every action would be clipped on the first few steps, and subsequent steps would have zero gradient — wasting compute but not diverging.
Now we have all the pieces: a clipped surrogate objective that prevents the policy from drifting too far, and GAE for low-variance advantage estimates. PPO combines them with a learned value function and a specific training loop.
What follows is the full algorithm with real hyperparameters from practice. These aren't toy values — they're what people actually use in MuJoCo, Atari, and LLM alignment.
| Hyperparameter | Typical Value | What It Controls |
|---|---|---|
| ε (clip range) | 0.2 | How far policy can deviate per iteration |
| γ (discount) | 0.99 | How far ahead the agent plans |
| λ (GAE) | 0.95 | Bias-variance in advantage estimates |
| Batch size | 2,048 timesteps | Data per iteration (across parallel envs) |
| Epochs per batch | 10 | How many passes over the same batch |
| Minibatch size | 64 | SGD minibatch (batch / minibatch = steps per epoch) |
| Total iterations | ~500 | ~1M total environment timesteps |
| Learning rate | 3×10−4 | Adam step size |
Per iteration: 2,048 timesteps collected. 10 epochs × (2048/64) = 10 × 32 = 320 gradient steps per iteration. That's 320 gradient updates from one batch of experience!
Total training: 500 iterations × 2,048 = 1,024,000 total timesteps. ~320,000 gradient steps. A MuJoCo Humanoid can learn to walk in roughly this budget.
Compare to REINFORCE: 1 gradient step per 2,048 timesteps = only 500 gradient steps total. PPO is 640× more gradient-efficient per timestep.
Unnormalized advantages can have wildly different scales across environments. A locomotion task might have advantages in [−100, +100], while a manipulation task has [−0.5, +0.5]. Normalizing to zero mean and unit variance makes the same ε = 0.2 clip range work across all tasks. This is one of PPO's "invisible" tricks that makes it so plug-and-play.
The value function V̂φ must be trained on the current batch's returns, not stale data. Some implementations share a backbone between policy and value function — be careful that value loss gradients don't destroy the policy features. Many practitioners use separate networks or a value loss coefficient of 0.5.
PPO reuses each batch for ~10 epochs, then discards it. Can we go further? What if we kept all past experience in a giant buffer and sampled from it indefinitely?
A database storing past transitions (s, a, r, s'). Typically holds the last 1M transitions. Each training step samples a random minibatch from the buffer. New experience is added continuously; the oldest is evicted. Used in DQN, DDPG, TD3, SAC, and many others.
The appeal is obvious: instead of 2,000 timesteps per iteration, you train on millions of stored transitions. Data efficiency goes through the roof. A robot that took 2 hours to learn with PPO might learn in 15 minutes with a replay buffer. But there's a fundamental problem.
Imagine a circular buffer of 1M transitions. The robot has been training for an hour. The buffer contains: transitions 1–300K from a terrible random policy, 300K–600K from a mediocre policy, and 600K–1M from the current reasonable policy. When we sample a minibatch of 256, we get a random mix of all three. The Q-function must make sense of this temporal soup.
In PPO, we fit V̂π(s) — the value of state s under the current policy π. The data comes from the current policy, so the target returns are valid.
With a replay buffer, data comes from many past policies. When we fit V̂ on this mixed data, which policy's value function are we learning? Not the current one. Not any past one. Some incoherent average.
Buffer contains transitions from policies π₁ through π₁₀₀. Policy π₁ (early, random) visited state s and got reward 0. Policy π₅₀ (later, competent) visited the same state and got reward +5 (because it took better future actions). If we fit V(s) on both, we get V(s) ≈ 2.5 — which is wrong for every policy.
V(s) depends on what the policy does after visiting s. Old data reflects old policies' future behavior. The algorithm is broken.
Vπ(s) = expected future reward under policy π. Past data was generated under different policies with different future behaviors. Fitting V on this data creates a meaningless chimera. We need a fundamentally different approach.
V(s) answers: "How good is this state, assuming I follow my policy from here?" This depends on the policy's future actions — which are different for each policy. Q(s,a) answers: "How good is this state-action pair, assuming I follow my policy from s' onward?" The first action is fixed — it's the one in the data. Only future actions need to come from the current policy.
The fix turns out to be surprisingly elegant: stop learning V(s) and start learning Q(s,a).
The key insight: V(s) depends on what actions the policy takes in the future. Q(s,a) already specifies the first action. If we also sample the next action from the current policy, we can build valid targets even from old data.
The next action ā' is sampled from the current policy πθ, not from whatever old policy generated the data. The transition (s, a, s') comes from the buffer — any policy. Only ā' must be fresh.
The (s, a, r, s') transition is just physics — it doesn't matter which policy generated it. The environment dynamics are the same regardless of who took the action. The only place the policy matters is in predicting future behavior, and we handle that by sampling ā' from the current policy. The reward r(s,a) and next state s' are ground truth from the environment — they're valid forever.
With Q(s,a) instead of V(s), the policy update changes too. We don't need an explicit advantage baseline — Q already captures action quality:
Sample states from the buffer, sample fresh actions from the current policy, and push the policy toward actions that Q says are good. No IS ratios needed — the actions are from the current policy.
With V(s) we computed advantages  = Q − V, centering the signal around zero. Without V, raw Q-values can be all positive or all negative, making the gradient noisy (the same "all rewards positive" problem from REINFORCE). In practice, the replay buffer provides so much data that this extra variance is manageable. SAC also adds entropy regularization which further stabilizes training.
Without a baseline (V), the gradient has higher variance. But we're sampling from the entire replay buffer — millions of transitions. The sheer volume of data compensates. In practice, the current policy's actions are better than the old policy's (the policy is improving), so Q-values for current actions tend to be positive anyway.
Buffer transition: (s=standing, a=lean_left, r=−0.5, s'=tilting_left). Old policy π₃₀ collected this.
Current policy π₅₀ samples ā' = correct_right in state s'. Q̂(s', correct_right) = +3.0.
Target: y = −0.5 + 0.99 × 3.0 = +2.47.
We fit Q̂(standing, lean_left) → 2.47. The physics of "lean left leads to tilting left" is valid regardless of which policy discovered it. The future value uses the current policy's action — that's what makes it correct.
Soft Actor-Critic (SAC), introduced by Haarnoja et al. (2018), builds on the Q-function replay buffer approach with two additional ideas: entropy regularization and twin Q-networks. Together, these turn the basic replay-buffer actor-critic into a robust, sample-efficient algorithm that works on real robots.
Add a bonus for policy randomness: JSAC = 𝔼[ Σt r(st,at) + α H(π(·|st)) ]. The policy maximizes reward and entropy simultaneously. Prevents premature convergence to a single action. α controls the reward-exploration tradeoff (automatically tuned in practice).
The "soft" in Soft Actor-Critic refers to this entropy bonus — the policy stays "soft" (stochastic) rather than collapsing to a deterministic action. A deterministic policy that always takes the same action has zero entropy. A uniform-random policy has maximum entropy. SAC finds the sweet spot: as much reward as possible while staying as random as possible.
Train two independent Q-functions Q̂₁ and Q̂₂. Use min(Q̂₁, Q̂₂) for targets. Why? Q-learning tends to overestimate values (the max over noisy estimates is biased upward). Taking the minimum of two estimates counteracts this. Same trick as TD3 (Fujimoto et al. 2018).
SAC uses Gaussian policies: πθ(a|s) = 𝒩(μθ(s), σθ(s)). To backpropagate through the policy sample, we use the reparameterization trick:
Now ∇θ Q(s, a) flows through a back to θ via the chain rule. No need for log-probability tricks or IS ratios.
Replay buffer size: 1M transitions. Minibatch: 256. Learning rate: 3×10−4 for all networks. γ: 0.99. τ: 0.005 (target smoothing). α: auto-tuned (start ~0.2). Gradient steps per env step: 1 (but can go higher for data efficiency).
The entropy bonus has a deep practical effect: it prevents the policy from committing too early to one strategy. If two paths through a maze give similar reward, SAC keeps both options alive. This makes it much more robust to local optima and helps with exploration. It also means SAC policies transfer better — they've explored more of the state space during training.
The Q-targets use target networks Q̄ (updated slowly) rather than the current Q̂ (updated every step). Why? Without targets, training is unstable: Q̂ updates → targets change → Q̂ chases a moving target. The soft update Q̄ ← 0.005 · Q̂ + 0.995 · Q̄ keeps targets nearly stationary, giving the Q-function a stable regression target. Same trick from DQN (2015), refined for continuous actions.
SAC maximizes J = E[Σt r(st,at) + α H(π(·|st))]. The per-state objective is: maxπ Ea~π[Q(s,a)] + α H(π(·|s)), subject to Σa π(a|s) = 1.
Your task: Derive that the optimal policy has the form π*(a|s) = exp(Q(s,a)/α) / Z(s), i.e., a Boltzmann (softmax) distribution over Q-values scaled by temperature α.
Step 1: Objective per state: maxπ Σa π(a|s) Q(s,a) - α Σa π(a|s) log π(a|s)
Step 2: Lagrangian with constraint Σπ = 1:
L = Σa π(a|s)[Q(s,a) - α log π(a|s)] + λ[1 - Σa π(a|s)]
Step 3: ∂L/∂π(a|s) = Q(s,a) - α log π(a|s) - α - λ = 0
Step 4: Solve: log π*(a|s) = (Q(s,a) - α - λ) / α = Q(s,a)/α - 1 - λ/α
π*(a|s) = exp(Q(s,a)/α) · C, where C = exp(-1 - λ/α) is a constant.
Step 5: Normalize: Σa π*(a|s) = 1 ⇒ C = 1/Z(s) where Z(s) = Σa exp(Q(s,a)/α).
Final: π*(a|s) = exp(Q(s,a)/α) / Z(s)
The key insight: The optimal SAC policy is a Boltzmann distribution with temperature α. As α → 0, this approaches argmax (deterministic, pure exploitation). As α → ∞, it approaches uniform (maximum entropy, pure exploration). SAC auto-tunes α to maintain a target entropy, dynamically balancing this tradeoff throughout training.
These are the two dominant algorithms in modern deep RL. They represent fundamentally different philosophies — how much you trust your data vs. how much data you can afford to collect. Choosing between them is one of the first decisions in any RL project, and it shapes everything from your infrastructure to your debugging strategy.
| Dimension | PPO | SAC |
|---|---|---|
| Data paradigm | More on-policy: collect batch, train ~10 epochs, discard | Fully off-policy: replay buffer of all past data |
| Data efficiency | Low — discards data after each iteration | High — reuses all past experience |
| Stability | Very stable — clipping prevents catastrophic updates | Less stable — Q-function approximation errors can compound |
| Ease of use | "Plug and play" — default hyperparams often work | More tuning needed (buffer size, α, target update rate) |
| Action space | Discrete or continuous | Primarily continuous (discrete variants exist) |
| Markov assumption | Can relax with MC returns | Heavy reliance (Q-learning requires Markov property) |
| Parallelism | Naturally parallel (many envs collecting simultaneously) | Harder to parallelize (replay buffer bottleneck) |
| Wall-clock speed | Faster if simulation is cheap | Faster if simulation is expensive (fewer env steps needed) |
Simulation is cheap (video games, MuJoCo, LLM training): use PPO. You can generate data quickly, so data efficiency doesn't matter. Stability does.
Simulation is expensive (real robots, complex physics): use SAC. Every timestep costs real time or real money. You need to squeeze maximum learning from every sample.
RLHF for ChatGPT: PPO. Each "trajectory" is an LLM response (~cheap to generate with GPUs). Training stability is paramount — a collapsed policy means gibberish output.
Robot arm learning to insert a peg: SAC. Each attempt takes 10 seconds. With PPO, you'd need 500 iterations × 2,000 timesteps = 1M timesteps × 10s = 116 days. SAC can learn from ~100K timesteps — 11.5 days, or 2 hours with parallelism.
If you have expert demonstrations (teleoperation, human play, etc.), seed the replay buffer (SAC) or pretrain with behavioral cloning (PPO). Starting from scratch is always slower than starting from a rough approximation. This is not cheating — it's good engineering.
Real-world solution (Luo et al., 2024 — HIL-SERL):
1. Warm-start with demos: Collect 20-50 human demonstrations via teleoperation. Seed the replay buffer AND pre-train the policy with BC. This gives non-zero initial success rate (~10-20%).
2. Reward shaping: Add dense reward components: (a) negative distance to object, (b) grasp contact reward, (c) lift progress reward. These guide exploration toward grasping without requiring full sparse success.
3. Residual policy: Learn a residual on top of the BC policy: atotal = aBC + aresidual. The residual is small (bounded), so the hand stays near demonstrated behavior. SAC optimizes the residual only.
4. Safety layer: Action clipping at the joint level + velocity limits. A separate safety controller overrides any action that would cause self-collision (computed analytically from kinematics). The entropy bonus explores within the safe action space.
5. Prioritized replay: Successful episodes get 10x sampling priority. With only 5% success rate, this means ~33% of each training batch contains successful transitions. Also: Hindsight Experience Replay (HER) — relabel failed episodes with "the state you actually reached" as the goal.
6. Camera encoder: Pre-trained (frozen) ResNet-18, fine-tuned only the last layer. End-to-end from scratch would waste the limited data budget on visual features. The encoder outputs a 512-d vector concatenated with joint states.
Result: 90%+ success rate in under 2 hours of real training. Key insight: SAC's sample efficiency shines when you remove the cold-start problem with demonstrations.
PPO and SAC are not just textbook algorithms. They are production systems powering the most impressive real-world AI deployments. Understanding where each shines validates the theory from earlier chapters and gives you intuition for your own projects.
Haarnoja et al. (2018) trained a real quadruped robot (Minitaur) to walk from scratch in under 2 hours. No simulation. No human demonstration. The robot fell over thousands of times, and SAC's replay buffer let it learn from every fall. The entropy bonus was especially critical: it kept the robot trying diverse gaits rather than getting stuck repeating one bad strategy. This was a watershed moment — proof that RL could work on hardware within a human attention span.
Luo et al. (2024) developed HIL-SERL (Human-in-the-Loop Sample-Efficient RL) for precise peg insertion and cable routing. SAC with a small number of human corrections outperformed pure imitation learning — it was both faster to train and more accurate. The robot learned sub-millimeter precision that imitation alone couldn't capture. The human interventions seeded the replay buffer with high-quality transitions, giving SAC a warm start rather than learning from scratch.
Tan et al. (2018) at Google trained a quadruped in simulation with PPO, then transferred the policy to a real Laikago robot. The sim-to-real transfer worked because PPO's stability meant the policy didn't exploit simulator artifacts — it learned robust gaits that generalized to reality.
OpenAI (2019) used PPO to train a dexterous robot hand to solve a Rubik's cube. The policy was trained entirely in simulation with massive domain randomization (changing physics parameters, lighting, cube size) and transferred zero-shot to a real Shadow Hand. This required billions of timesteps across thousands of parallel workers — possible only because PPO scales effortlessly with parallel simulation. Each worker collects its own batch; batches are aggregated for one synchronized gradient step. No replay buffer coordination needed.
InstructGPT and ChatGPT use PPO for RLHF (Reinforcement Learning from Human Feedback). The "environment" is: generate a response, get a reward from a trained reward model. PPO's stability is critical — a single bad update could make the model produce gibberish or harmful content. The clip mechanism prevents catastrophic forgetting of language capabilities.
State: prompt + tokens generated so far. Action: next token (discrete, ~50K vocabulary). Reward: 0 at every step except the last, where a reward model scores the complete response. Episode length: response length (~100-500 tokens). PPO with a KL penalty against the supervised-fine-tuned model keeps the language model coherent while steering toward human preferences.
Cheap simulation + need stability: PPO (Rubik's cube, RLHF, sim-to-real). Expensive real data + need sample efficiency: SAC (real robots, precision tasks). The theory from this lesson directly predicts which algorithm wins in each scenario.
2015: TRPO (Schulman et al.) — trust regions for policy optimization.
2016: GAE (Schulman et al.) — the advantage estimation we use today.
2017: PPO (Schulman et al.) — replaced TRPO with clipping. Immediate adoption.
2018: SAC (Haarnoja et al.) — entropy-regularized off-policy actor-critic.
2019: Rubik's cube with PPO (OpenAI) — dexterous manipulation milestone.
2022: ChatGPT uses PPO for RLHF — PPO goes mainstream.
2024: HIL-SERL with SAC — state-of-the-art real-robot manipulation.
SAC is Q-learning where the hard max becomes a soft max (log-sum-exp). As α→0, the Boltzmann policy becomes argmax and SAC reduces to standard Q-learning. The entropy bonus is the "softening" that makes the max differentiable and the policy stochastic. This same Boltzmann-to-argmax spectrum appears everywhere: from simulated annealing to energy-based models.
Where else in ML do you see this pattern of "soften a discrete operation to make it differentiable"? (Hint: think about attention mechanisms.)
Both add a regularizer that prevents the policy from collapsing. SAC penalizes deviation from uniform (via entropy). RLHF penalizes deviation from the SFT model (via KL). The mathematical structure is identical: reward + regularization toward a prior. SAC's prior is uniform; RLHF's prior is the pre-trained model. Both use temperature/β to control the tradeoff.
If you replaced SAC's entropy bonus with a KL penalty toward the BC policy (KL(π||πBC)), what algorithm would you get? (Answer: something very close to AWAC / offline RL methods.)
def sac_actor_loss(policy_net, q_net1, q_net2, states, alpha):
# Step 1: Forward pass through policy
mean, log_std = policy_net(states)
std = log_std.exp()
# Step 2: Reparameterization trick
epsilon = torch.randn_like(mean)
z = mean + std * epsilon # pre-squash action
action = torch.tanh(z) # squashed to [-1, 1]
# Step 3: Log probability with tanh correction
log_prob = Normal(mean, std).log_prob(z).sum(-1)
# Jacobian correction for tanh squashing:
log_prob -= torch.log(1 - action.pow(2) + 1e-6).sum(-1)
# Step 4: Evaluate both Q-functions, take min (pessimistic)
q1 = q_net1(states, action).squeeze(-1)
q2 = q_net2(states, action).squeeze(-1)
min_q = torch.min(q1, q2)
# Step 5: Actor loss = E[alpha * log_pi - Q]
# We MINIMIZE this, which MAXIMIZES (Q - alpha * log_pi)
loss = (alpha * log_prob - min_q).mean()
return loss
You now have a complete understanding of the two dominant algorithms in modern deep RL. Let's consolidate everything into a reference sheet.
| Question | If Yes | If No |
|---|---|---|
| Is simulation cheap / fast? | PPO | SAC |
| Is stability critical (e.g., LLMs)? | PPO | Either |
| Need max sample efficiency? | SAC | PPO |
| Discrete actions? | PPO | Either (SAC prefers continuous) |
| Have expert demonstrations? | Pretrain/seed buffer, then either | |
| Lecture | Key Algorithm | Core Idea |
|---|---|---|
| L3: Policy Gradients | REINFORCE | Weight actions by reward, gradient ascent |
| L4: Actor-Critic | A2C / N-step AC | Replace reward-to-go with learned V̂ |
| L5: Off-Policy AC (this lesson) | PPO, SAC | Reuse data with IS clipping or Q-functions |
Each lecture built on the last: REINFORCE's high variance motivated the value function critic. The critic's data inefficiency motivated off-policy methods. PPO and SAC are the culmination — practical, scalable algorithms that are deployed worldwide.
| Trick | Problem It Solves | Used In |
|---|---|---|
| IS ratio clipping | Policy drifts too far from data | PPO |
| Min of clipped/unclipped | Edge case where clipping helps objective | PPO |
| GAE (λ) | Bias-variance tradeoff in advantages | PPO (and others) |
| Advantage normalization | Reward scale varies across tasks | PPO |
| Q instead of V | V is meaningless with mixed-policy data | SAC |
| Twin Q-networks | Q-value overestimation | SAC, TD3 |
| Entropy regularization | Premature policy collapse | SAC |
| Target networks | Unstable Q-learning targets | SAC, DQN, TD3 |
| Reparameterization | Backprop through stochastic sampling | SAC |
PPO clips importance-sampling ratios so you can reuse data safely. SAC fits a Q-function on a replay buffer so you can reuse all data. Everything else is implementation details.