CS224R — Off-Policy Actor-Critic: PPO & SAC

Roadmap

What You'll Master

01Why Off-Policy? 02Surrogate Objective & Its Flaw 03PPO: Clipping & Min Trick 04Generalized Advantage Estimation 05The Complete PPO Algorithm 06Replay Buffers 07Fitting Q Instead of V 08The Complete SAC Algorithm 09PPO vs. SAC 10Real-World Impact 11Summary & Cheat Sheet

Chapter 01

Why Off-Policy?

In the policy gradients lesson, we derived the REINFORCE algorithm: collect trajectories, compute a gradient, update the policy, throw away all data, recollect. In the actor-critic lesson, we added a learned value function to reduce variance. But the core loop stayed the same — one gradient step per batch of data.

That is brutally wasteful. A robot arm spends 30 minutes collecting 2,000 timesteps. You take one gradient step. Toss the data. Collect again. At this rate, learning a simple pick-and-place task takes weeks of real-world time.

The fix: take multiple gradient steps on the same batch. But the moment you update the policy even once, the data was collected by the old policy. You're training on someone else's experience. This is the off-policy problem.

Definition

Off-Policy Learning

Learning from data collected by a different policy than the one being optimized. The "behavior policy" (data collector) and the "target policy" (being improved) diverge. Requires corrections — otherwise gradients point in wrong directions.

The Central Tension

More gradient steps per batch = more data-efficient. But each step moves the policy further from the data source = less reliable gradients. Every algorithm in this lesson is a different answer to this tradeoff. PPO stays close with clipping. SAC uses a Q-function that works with any data.

Quick Recap: Importance Sampling

We already know the mathematical tool for off-policy correction: importance sampling. From the policy gradients lesson, we can reweight old data by multiplying each sample by the ratio π_θ'(a|s) / π_θ(a|s). If the new policy makes an action 2× more likely than the old one, that sample counts double. If 0.5× as likely, it counts half.

Quick IS Example

Old policy: π_θ(jump|cliff_edge) = 0.1. New policy after some updates: π_θ'(jump|cliff_edge) = 0.4. IS ratio = 0.4/0.1 = 4.0. This means: the new policy jumps 4× more often, so each observed jump in the old data gets 4× the weight. Makes sense — that sample is now more "representative" of the new policy's behavior.

The question is: how do we use IS to build an objective we can optimize with multiple gradient steps without the policy drifting into unreliable territory?

We need two things: (1) an objective function that accounts for the distribution mismatch, and (2) a mechanism to prevent the optimization from going too far. The surrogate objective gives us the first. PPO's clipping gives us the second.

🔨 Derivation Importance Sampling Ratio for Trajectories ▶ ✓ ATTEMPTED

In on-policy RL, we sample trajectories τ = (s₀, a₀, s₁, a₁, ...) from π_θ. Now we want to evaluate the expected return under a new policy π_θ' using data from π_θ.

Your task: Derive the full trajectory importance sampling ratio p(τ | π_θ') / p(τ | π_θ), show why transition dynamics cancel, and explain why the product of per-step ratios has exponentially growing variance in T.

p(τ | π) = p(s₀) · Π_t=0^T-1 [π(a_t|s_t) · p(s_t+1|s_t,a_t)]. The trajectory probability factorizes into initial state, policy probabilities, and transition probabilities.

The product Π_t [π_θ'(a_t|s_t) / π_θ(a_t|s_t)] is a product of T random variables. If each ratio has variance σ², the product has variance that grows exponentially in T: Var ~ (1 + σ²)^T. This makes full-trajectory IS useless for long horizons.

Full derivation:

p(τ | π) = p(s₀) · Π_t=0^T-1 π(a_t|s_t) · p(s_t+1|s_t,a_t)

The IS ratio for trajectories:

ρ(τ) = p(τ|π_θ') / p(τ|π_θ) = Π_t=0^T-1 [π_θ'(a_t|s_t) / π_θ(a_t|s_t)]

Dynamics p(s_t+1|s_t,a_t) cancel because they don't depend on the policy. Initial state p(s₀) also cancels.

For the variance: let r_t = π_θ'(a_t|s_t) / π_θ(a_t|s_t). E[r_t] = 1 (IS is unbiased). Var(r_t) = E[r_t²] - 1. The product Πr_t has E[(Πr_t)²] = ΠE[r_t²] = (1 + Var(r_t))^T. Even if per-step variance is 0.1, after 100 steps: (1.1)¹⁰⁰ ≈ 13,781. The IS estimate is essentially noise.

The key insight: This exponential variance is WHY we use per-step IS ratios (as in PPO) rather than full-trajectory ratios. PPO clips the per-step ratio to [0.8, 1.2], bounding variance at each step rather than letting it accumulate.

Chapter 02

The Surrogate Objective & Its Flaw

From actor-critic, we have the advantage function Â^π(s,a) = Q^π(s,a) − V^π(s). It tells us: how much better is action a than what the policy would normally do in state s? Positive advantage = better than average. Negative = worse.

The surrogate objective lets us take multiple gradient steps on old data by importance-weighting each sample:

Surrogate Objective (IS-Weighted) J̃(θ') = Σ_t,i [ π_θ'(a|s) / π_θ(a|s) ] · Â^π(s, a)

Here θ is the old policy (the one that collected data) and θ' is the new policy being optimized. The ratio π_θ'/π_θ is the importance sampling ratio — it corrects for the distribution mismatch between who collected the data and who we're optimizing.

Think of it as a voting correction. If the old policy took action "left" 30% of the time but the new policy would take it 60% of the time, each "left" sample in our data should count double — because it's twice as representative of the new policy's behavior.

Why this works

When θ' = θ (first gradient step), the ratio is exactly 1.0 and the objective reduces to the standard advantage-weighted objective. As we take gradient steps, the ratio tells us: "How much more (or less) likely is this action under the new policy?" If the new policy is 2× more likely to take an action with positive advantage, the objective rewards that.

The Flaw: Policy Collapse

Nothing stops the optimization from going too far. If action a has positive advantage, gradient ascent increases π_θ'(a|s). The IS ratio grows: 1.5, 3.0, 10.0. The objective keeps climbing. But now the IS correction is wildly inaccurate — the data came from a policy that rarely took this action, and we're pretending it's common.

Worked Example

Old policy: π_θ(left|s) = 0.3. Advantage of "left" in state s: Â = +2.0.

After 5 gradient steps: π_θ'(left|s) = 0.9. IS ratio = 0.9/0.3 = 3.0.

Contribution to objective: 3.0 × 2.0 = 6.0. Looks great! But we only observed "left" 30% of the time in our data. We have very few samples supporting this estimate. We're extrapolating wildly from sparse data.

After 10 more steps: π_θ'(left|s) = 0.99. Ratio = 3.3. The policy has gone all-in on an action it barely tested.

The overfitting trap

Too many gradient steps on the same batch → the policy overfits to the particular trajectories it saw, not the true advantage landscape. It's the same pathology as training a supervised model for too many epochs on a small dataset. The fix isn't early stopping — it's PPO.

Chapter 03

PPO: Clipping & the Min Trick

Proximal Policy Optimization (Schulman et al., 2017) solves the overfitting problem with two elegant tricks. Both are simple to implement and together they make PPO the most widely-used RL algorithm in the world. The name "proximal" means "nearby" — the policy stays proximal (close) to its old version.

Definition

PPO — Proximal Policy Optimization

An on-policy (with mild off-policy reuse) actor-critic algorithm that takes multiple gradient steps per batch using a clipped surrogate objective. Introduced by Schulman et al. (2017) at OpenAI. Default choice for stable, scalable policy optimization.

Trick #1: Clip the IS Ratio

Instead of letting the IS ratio grow without bound, clip it to [1−ε, 1+ε]:

PPO Trick #1 — Clipped IS Ratio clip( π_θ'(a|s) / π_θ(a|s), 1 − ε, 1 + ε ) · Â^π(s, a)

Typical ε = 0.2, so ratio is clamped to [0.8, 1.2]

Once the ratio hits 1.2, the clipped objective goes flat. No matter how much more likely you make the action, the objective stops increasing. The policy has no incentive to deviate further.

Numerical Example (ε = 0.2, Â = +2.0)

IS ratio = 1.0: unclipped = 1.0 × 2 = 2.0. Clipped = 1.0 × 2 = 2.0. Same.

IS ratio = 1.1: unclipped = 1.1 × 2 = 2.2. Clipped = 1.1 × 2 = 2.2. Still within range.

IS ratio = 1.5: unclipped = 1.5 × 2 = 3.0. Clipped = 1.2 × 2 = 2.4. Clamped!

IS ratio = 3.0: unclipped = 3.0 × 2 = 6.0. Clipped = 1.2 × 2 = 2.4. Still 2.4. No incentive to go further.

Trick #2: Take the Minimum

Clipping alone has a subtle edge case. When the advantage is negative and we clip the ratio downward (to 0.8), the clipped objective can actually be better than the unclipped one. This would still let the policy deviate in harmful directions.

Fix: take the minimum of the clipped and unclipped objectives:

The PPO Surrogate Objective (Final Form) J̃_PPO(θ') = Σ_t,i min(
r_t(θ') · Â_t ,
clip( r_t(θ'), 1−ε, 1+ε ) · Â_t
)

where r_t(θ') = π_θ'(a_t|s_t) / π_θ(a_t|s_t)

Why min works

The min operator is pessimistic. For any (state, action) pair, it picks whichever objective is worse. If clipping makes things look better (rare edge case with negative advantages), the min reverts to the unclipped version. If clipping makes things look worse (the normal case, preventing over-deviation), the min keeps the clip. Either way, the policy never benefits from straying too far.

Negative Advantage Case (ε = 0.2, Â = −1.5)

IS ratio = 0.7: unclipped = 0.7 × (−1.5) = −1.05. Clipped = 0.8 × (−1.5) = −1.2. min(−1.05, −1.2) = −1.2. Takes the worse (more negative) one — policy is discouraged from reducing the probability of this bad action too aggressively.

IS ratio = 1.3: unclipped = 1.3 × (−1.5) = −1.95. Clipped = 1.2 × (−1.5) = −1.8. min(−1.95, −1.8) = −1.95. Takes the unclipped version — doesn't let clipping soften the penalty.

The genius of PPO

Two lines of code — a clamp() and a min() — replace the complicated constrained optimization (KL penalties, trust regions) of earlier methods like TRPO. That's why PPO took over: same stability, fraction of the complexity.

Chapter 04

Generalized Advantage Estimation

We need Â^π(s,a) for the PPO objective. But which advantage estimate should we use? From actor-critic, we know several options with different bias-variance tradeoffs:

Definition

N-step Advantage Estimate — Â_n

Use n steps of actual reward, then bootstrap from V̂ for the rest:

Â₁ = r_t + γV̂(s_t+1) − V̂(s_t) ← low variance, high bias
Â₂ = r_t + γr_t+1 + γ²V̂(s_t+2) − V̂(s_t)
Â_N = Σ_k=0^N−1 γ^kr_t+k + γ^NV̂(s_t+N) − V̂(s_t) ← high variance, low bias

Short horizon (small n): low variance because we rely on the learned V̂ to fill in the future, but biased if V̂ is inaccurate (early in training, V̂ is terrible). Long horizon (large n): less bias because we use real observed rewards, but high variance because those rewards are noisy and path-dependent.

The bias-variance dilemma

If we bootstrap too early (small n), our advantage estimates inherit all the errors in V̂. If we bootstrap too late (large n), we're back to the high-variance Monte Carlo estimates that made REINFORCE so noisy. Neither extreme is good. We need a blend.

GAE's insight: don't pick one horizon — use all of them, with exponentially decaying weights.

Generalized Advantage Estimation (GAE) Â^GAE_t = Σ_n=1^N w_n · Â_n,t

w_n = λⁿ⁻¹(1 − λ) exponentially decaying weights, λ ∈ [0,1]

The parameter λ controls the bias-variance tradeoff:

What λ does

λ = 0: w₁ = 1, all other weights = 0. Pure 1-step TD. Lowest variance, highest bias. GAE reduces to Â₁ = r_t + γV̂(s_t+1) − V̂(s_t).

λ = 1: equal weights on all horizons. Pure Monte Carlo return minus baseline. Lowest bias, highest variance.

λ = 0.95 (typical): mostly uses short horizons but sprinkles in longer ones. Practical sweet spot.

Efficient computation

GAE has an elegant recursive form using the TD residual δ_t = r_t + γV̂(s_t+1) − V̂(s_t):

Â^GAE_t = Σ_k=0^N−1 (γλ)^k δ_t+k

Computed with a single backward pass through the trajectory. No need to compute each Â_n separately. In code: start from the last timestep, accumulate gae = delta + gamma * lambda * gae.

Numerical Walkthrough (γ = 0.99, λ = 0.95)

Suppose we have 3 timesteps of data with V̂ predictions:

• t=0: r₀ = +1, V̂(s₀) = 5.0, V̂(s₁) = 5.5 → δ₀ = 1 + 0.99×5.5 − 5.0 = +1.445

• t=1: r₁ = +2, V̂(s₁) = 5.5, V̂(s₂) = 6.0 → δ₁ = 2 + 0.99×6.0 − 5.5 = +2.44

• t=2: r₂ = −1, V̂(s₂) = 6.0, V̂(s₃) = 4.0 → δ₂ = −1 + 0.99×4.0 − 6.0 = −3.04

GAE at t=0: δ₀ + (γλ)δ₁ + (γλ)²δ₂ = 1.445 + 0.9405×2.44 + 0.8845×(−3.04) = 1.445 + 2.295 − 2.689 = +1.051

Compare: pure 1-step (λ=0) gives Â₀ = δ₀ = +1.445. Pure MC gives a different number. GAE blends them.

GAE = TD(λ) for advantages

If you've seen TD(λ) in RL textbooks, GAE is the same idea applied to advantage estimates instead of value estimates. The λ parameter is the eligibility trace decay. Schulman et al. (2016) showed this works exceptionally well with policy gradients.

🔨 Derivation Retrace(λ): Truncated Importance Sampling for Off-Policy Returns ▶ ✓ ATTEMPTED

GAE assumes on-policy data. For fully off-policy data (replay buffers), we need Retrace(λ) (Munos et al., 2016). The key idea: truncate IS ratios at 1 to prevent variance explosion while maintaining a fixed-point guarantee.

Your task: Starting from the off-policy Bellman operator, derive why truncating c_t = min(1, π(a_t|s_t)/μ(a_t|s_t)) still converges to Q^π, and show the Retrace target formula.

The general multi-step off-policy return is: Q_ret(s,a) = Q(s,a) + Σ_t γ^t (Π_i=1^t c_i) δ_t, where δ_t = r_t + γ E_π[Q(s_t+1,·)] - Q(s_t,a_t) is the expected TD error.

At the fixed point, Q = Q^π, so δ_t = 0 for all t regardless of the truncation coefficients c_t. The truncation only affects the path TO the fixed point (convergence speed), not the fixed point itself. Since min(1, π/μ) > 0 wherever μ has support, all transitions still contribute — convergence is guaranteed.

Without truncation, c_t = π/μ can be huge when π assigns much higher probability than μ. The product Πc_i has variance that grows exponentially. With truncation at 1, max(c_t) = 1, so Πc_i ≤ 1 always. Variance is bounded by definition — no exponential blowup.

Retrace(λ) target:

Q_ret(s₀,a₀) = Q(s₀,a₀) + Σ_t=0^T-1 γ^t (Π_i=1^t c_i) [r_t + γ E_a~π[Q(s_t+1,a)] - Q(s_t,a_t)]

where c_t = λ · min(1, π(a_t|s_t) / μ(a_t|s_t))

Why it converges: At the fixed point Q = Q^π, the TD error δ_t = r_t + γ V^π(s_t+1) - Q^π(s_t,a_t) = 0 by definition of Q^π. So Q_ret = Q^π regardless of the c_t values. The truncation only affects the speed and stability of convergence, not the solution.

Variance bound: Since c_t ≤ λ ≤ 1, the product Π_ic_i ≤ λ^t → 0 exponentially. This means distant rewards get exponentially down-weighted, keeping variance bounded. Compare to untruncated IS where Π(π/μ) can be 10⁵ for a single trajectory.

The key insight: Retrace is "safe IS" — it never over-weights a transition (ratio capped at 1), so it converges reliably with any behavior policy. SAC implicitly avoids this issue by using single-step TD (no multi-step returns), but algorithms like R2D2 and IMPALA use Retrace variants for their replay buffers.

Checkpoint — Before you move on

Explain in your own words: why can PPO reuse a batch for ~10 gradient steps, but not 1000? What specifically goes wrong, and how does the clip mechanism address it?

✓ Gate cleared

Model Answer

After many gradient steps, the policy moves far from the data-generating policy. The IS ratio π_θ'/π_θ deviates significantly from 1.0, meaning the gradient estimates become unreliable — we're extrapolating from sparse data about what the new policy would do. The clipped objective J_PPO = min(r·A, clip(r, 1-ε, 1+ε)·A) creates a flat region beyond the clip boundary. Once the ratio exceeds 1+ε (or drops below 1-ε), the gradient is zero — there's no incentive to move further. This effectively creates a trust region where the policy can only move ε per iteration. With 10 epochs, the ratio might reach ~1.2 before clipping kicks in. With 1000 epochs, every action would be clipped on the first few steps, and subsequent steps would have zero gradient — wasting compute but not diverging.

Chapter 05

The Complete PPO Algorithm

Now we have all the pieces: a clipped surrogate objective that prevents the policy from drifting too far, and GAE for low-variance advantage estimates. PPO combines them with a learned value function and a specific training loop.

What follows is the full algorithm with real hyperparameters from practice. These aren't toy values — they're what people actually use in MuJoCo, Atari, and LLM alignment.

Algorithm: Proximal Policy Optimization (PPO)

Initialize policy π_θ and value function V̂_φ (often shared backbone)
For iteration k = 1, 2, …, ~500:
1. Collect batch: Run π_θ in environment for ~2,000 timesteps across multiple parallel envs
2. Fit value function: Update V̂_φ to predict discounted returns using collected data
3. Compute GAE advantages: Â^GAE_t = Σ_k (γλ)^k δ_t+k for every timestep
4. Normalize advantages: Â ← (Â − μ) / σ (per-batch standardization)
5. For epoch e = 1, 2, …, ~10:
  1. Shuffle data into minibatches of ~64
  2. For each minibatch: compute ∇_θ J̃_PPO and update θ (~300 total gradient steps)
6. Set θ = new parameters, discard old data, go to (a)

The Hyperparameters That Matter

Hyperparameter	Typical Value	What It Controls
ε (clip range)	0.2	How far policy can deviate per iteration
γ (discount)	0.99	How far ahead the agent plans
λ (GAE)	0.95	Bias-variance in advantage estimates
Batch size	2,048 timesteps	Data per iteration (across parallel envs)
Epochs per batch	10	How many passes over the same batch
Minibatch size	64	SGD minibatch (batch / minibatch = steps per epoch)
Total iterations	~500	~1M total environment timesteps
Learning rate	3×10⁻⁴	Adam step size

Concrete Numbers

Per iteration: 2,048 timesteps collected. 10 epochs × (2048/64) = 10 × 32 = 320 gradient steps per iteration. That's 320 gradient updates from one batch of experience!

Total training: 500 iterations × 2,048 = 1,024,000 total timesteps. ~320,000 gradient steps. A MuJoCo Humanoid can learn to walk in roughly this budget.

Compare to REINFORCE: 1 gradient step per 2,048 timesteps = only 500 gradient steps total. PPO is 640× more gradient-efficient per timestep.

Why normalize advantages?

Unnormalized advantages can have wildly different scales across environments. A locomotion task might have advantages in [−100, +100], while a manipulation task has [−0.5, +0.5]. Normalizing to zero mean and unit variance makes the same ε = 0.2 clip range work across all tasks. This is one of PPO's "invisible" tricks that makes it so plug-and-play.

Common PPO pitfall

The value function V̂_φ must be trained on the current batch's returns, not stale data. Some implementations share a backbone between policy and value function — be careful that value loss gradients don't destroy the policy features. Many practitioners use separate networks or a value loss coefficient of 0.5.

Chapter 06

Replay Buffers

PPO reuses each batch for ~10 epochs, then discards it. Can we go further? What if we kept all past experience in a giant buffer and sampled from it indefinitely?

Definition

Replay Buffer

A database storing past transitions (s, a, r, s'). Typically holds the last 1M transitions. Each training step samples a random minibatch from the buffer. New experience is added continuously; the oldest is evicted. Used in DQN, DDPG, TD3, SAC, and many others.

The appeal is obvious: instead of 2,000 timesteps per iteration, you train on millions of stored transitions. Data efficiency goes through the roof. A robot that took 2 hours to learn with PPO might learn in 15 minutes with a replay buffer. But there's a fundamental problem.

Replay buffer in practice

Imagine a circular buffer of 1M transitions. The robot has been training for an hour. The buffer contains: transitions 1–300K from a terrible random policy, 300K–600K from a mediocre policy, and 600K–1M from the current reasonable policy. When we sample a minibatch of 256, we get a random mix of all three. The Q-function must make sense of this temporal soup.

Why Replay Breaks the Value Function

In PPO, we fit V̂^π(s) — the value of state s under the current policy π. The data comes from the current policy, so the target returns are valid.

With a replay buffer, data comes from many past policies. When we fit V̂ on this mixed data, which policy's value function are we learning? Not the current one. Not any past one. Some incoherent average.

The Problem Concretely

Buffer contains transitions from policies π₁ through π₁₀₀. Policy π₁ (early, random) visited state s and got reward 0. Policy π₅₀ (later, competent) visited the same state and got reward +5 (because it took better future actions). If we fit V(s) on both, we get V(s) ≈ 2.5 — which is wrong for every policy.

V(s) depends on what the policy does after visiting s. Old data reflects old policies' future behavior. The algorithm is broken.

Cannot naively combine replay + V(s)

V^π(s) = expected future reward under policy π. Past data was generated under different policies with different future behaviors. Fitting V on this data creates a meaningless chimera. We need a fundamentally different approach.

V vs Q: the key difference

V(s) answers: "How good is this state, assuming I follow my policy from here?" This depends on the policy's future actions — which are different for each policy. Q(s,a) answers: "How good is this state-action pair, assuming I follow my policy from s' onward?" The first action is fixed — it's the one in the data. Only future actions need to come from the current policy.

The fix turns out to be surprisingly elegant: stop learning V(s) and start learning Q(s,a).

Chapter 07

Fitting Q Instead of V

The key insight: V(s) depends on what actions the policy takes in the future. Q(s,a) already specifies the first action. If we also sample the next action from the current policy, we can build valid targets even from old data.

Definition

Q-function with Current Policy Actions

Q^π(s, a) = r(s, a) + 𝔼_{s' ~ p, ā' ~ π_θ}[ Q^π(s', ā') ]

The next action ā' is sampled from the current policy π_θ, not from whatever old policy generated the data. The transition (s, a, s') comes from the buffer — any policy. Only ā' must be fresh.

The Training Procedure

Q-Function Fitting with Replay Buffer

Sample minibatch of (s_i, a_i, r_i, s'_i) from replay buffer
Sample fresh action ā_i ~ π_θ(·|s'_i) from the current policy
Compute target: y_i = r_i + γ Q̂^π(s'_i, ā_i)
Fit Q̂ by minimizing: Σ_i ( Q̂(s_i, a_i) − y_i )²

Why this fixes the problem

The (s, a, r, s') transition is just physics — it doesn't matter which policy generated it. The environment dynamics are the same regardless of who took the action. The only place the policy matters is in predicting future behavior, and we handle that by sampling ā' from the current policy. The reward r(s,a) and next state s' are ground truth from the environment — they're valid forever.

Updating the Policy

With Q(s,a) instead of V(s), the policy update changes too. We don't need an explicit advantage baseline — Q already captures action quality:

Policy Update from Q θ ← θ + α · ∇_θ 𝔼_{s~buffer, a~π_θ}[ Q̂(s, a) ]

Sample states from the buffer, sample fresh actions from the current policy, and push the policy toward actions that Q says are good. No IS ratios needed — the actions are from the current policy.

No advantage baseline = higher variance

With V(s) we computed advantages Â = Q − V, centering the signal around zero. Without V, raw Q-values can be all positive or all negative, making the gradient noisy (the same "all rewards positive" problem from REINFORCE). In practice, the replay buffer provides so much data that this extra variance is manageable. SAC also adds entropy regularization which further stabilizes training.

The tradeoff

Without a baseline (V), the gradient has higher variance. But we're sampling from the entire replay buffer — millions of transitions. The sheer volume of data compensates. In practice, the current policy's actions are better than the old policy's (the policy is improving), so Q-values for current actions tend to be positive anyway.

Numerical Walkthrough

Buffer transition: (s=standing, a=lean_left, r=−0.5, s'=tilting_left). Old policy π₃₀ collected this.

Current policy π₅₀ samples ā' = correct_right in state s'. Q̂(s', correct_right) = +3.0.

Target: y = −0.5 + 0.99 × 3.0 = +2.47.

We fit Q̂(standing, lean_left) → 2.47. The physics of "lean left leads to tilting left" is valid regardless of which policy discovered it. The future value uses the current policy's action — that's what makes it correct.

Chapter 08

The Complete SAC Algorithm

Soft Actor-Critic (SAC), introduced by Haarnoja et al. (2018), builds on the Q-function replay buffer approach with two additional ideas: entropy regularization and twin Q-networks. Together, these turn the basic replay-buffer actor-critic into a robust, sample-efficient algorithm that works on real robots.

Definition

Entropy Regularization

Add a bonus for policy randomness: J_SAC = 𝔼[ Σ_t r(s_t,a_t) + α H(π(·|s_t)) ]. The policy maximizes reward and entropy simultaneously. Prevents premature convergence to a single action. α controls the reward-exploration tradeoff (automatically tuned in practice).

The "soft" in Soft Actor-Critic refers to this entropy bonus — the policy stays "soft" (stochastic) rather than collapsing to a deterministic action. A deterministic policy that always takes the same action has zero entropy. A uniform-random policy has maximum entropy. SAC finds the sweet spot: as much reward as possible while staying as random as possible.

Definition

Twin Q-Networks

Train two independent Q-functions Q̂₁ and Q̂₂. Use min(Q̂₁, Q̂₂) for targets. Why? Q-learning tends to overestimate values (the max over noisy estimates is biased upward). Taking the minimum of two estimates counteracts this. Same trick as TD3 (Fujimoto et al. 2018).

The Reparameterization Trick

SAC uses Gaussian policies: π_θ(a|s) = 𝒩(μ_θ(s), σ_θ(s)). To backpropagate through the policy sample, we use the reparameterization trick:

Reparameterized Sampling a = μ_θ(s) + σ_θ(s) · ξ, ξ ~ 𝒩(0, 1)

Noise ξ is sampled externally; μ and σ are differentiable w.r.t. θ

Now ∇_θ Q(s, a) flows through a back to θ via the chain rule. No need for log-probability tricks or IS ratios.

Algorithm: Soft Actor-Critic (SAC)

Initialize policy π_θ, two Q-functions Q̂_φ₁ and Q̂_φ₂, target networks Q̄_φ₁ and Q̄_φ₂, replay buffer 𝒟
For each environment step:
1. Act: a ~ π_θ(·|s), observe r, s'
2. Store (s, a, r, s') in 𝒟
For each gradient step: (can do 1 or more per env step)
1. Sample minibatch of (s, a, r, s') from 𝒟
2. Sample ā' ~ π_θ(·|s') for next-state actions
3. Q-targets: y = r + γ ( min(Q̄₁(s',ā'), Q̄₂(s',ā')) − α log π_θ(ā'|s') )
4. Update Q̂₁, Q̂₂ to minimize (Q̂ − y)²
5. Update π_θ to maximize 𝔼_a~π[ min(Q̂₁(s,a), Q̂₂(s,a)) − α log π_θ(a|s) ]
6. Soft-update targets: Q̄ ← τQ̂ + (1−τ)Q̄, typically τ = 0.005

SAC Hyperparameters

Replay buffer size: 1M transitions. Minibatch: 256. Learning rate: 3×10⁻⁴ for all networks. γ: 0.99. τ: 0.005 (target smoothing). α: auto-tuned (start ~0.2). Gradient steps per env step: 1 (but can go higher for data efficiency).

SAC = max reward + max entropy

The entropy bonus has a deep practical effect: it prevents the policy from committing too early to one strategy. If two paths through a maze give similar reward, SAC keeps both options alive. This makes it much more robust to local optima and helps with exploration. It also means SAC policies transfer better — they've explored more of the state space during training.

Target networks: why the soft update?

The Q-targets use target networks Q̄ (updated slowly) rather than the current Q̂ (updated every step). Why? Without targets, training is unstable: Q̂ updates → targets change → Q̂ chases a moving target. The soft update Q̄ ← 0.005 · Q̂ + 0.995 · Q̄ keeps targets nearly stationary, giving the Q-function a stable regression target. Same trick from DQN (2015), refined for continuous actions.

🔨 Derivation SAC Optimal Policy is Boltzmann (Derive from KKT Conditions) ▶ ✓ ATTEMPTED

SAC maximizes J = E[Σ_t r(s_t,a_t) + α H(π(·|s_t))]. The per-state objective is: max_π E_a~π[Q(s,a)] + α H(π(·|s)), subject to Σ_a π(a|s) = 1.

Your task: Derive that the optimal policy has the form π*(a|s) = exp(Q(s,a)/α) / Z(s), i.e., a Boltzmann (softmax) distribution over Q-values scaled by temperature α.

From Q(s,a) - α log π(a|s) - α + λ = 0, we get log π(a|s) = Q(s,a)/α + (λ - α)/α. So π(a|s) = exp(Q(s,a)/α) · exp((λ-α)/α). The normalization Σπ = 1 determines the constant: exp((λ-α)/α) = 1/Z(s) where Z(s) = Σ_a exp(Q(s,a)/α).

Step 1: Objective per state: max_π Σ_a π(a|s) Q(s,a) - α Σ_a π(a|s) log π(a|s)

Step 2: Lagrangian with constraint Σπ = 1:

L = Σ_a π(a|s)[Q(s,a) - α log π(a|s)] + λ[1 - Σ_a π(a|s)]

Step 3: ∂L/∂π(a|s) = Q(s,a) - α log π(a|s) - α - λ = 0

Step 4: Solve: log π*(a|s) = (Q(s,a) - α - λ) / α = Q(s,a)/α - 1 - λ/α

π*(a|s) = exp(Q(s,a)/α) · C, where C = exp(-1 - λ/α) is a constant.

Step 5: Normalize: Σ_a π*(a|s) = 1 ⇒ C = 1/Z(s) where Z(s) = Σ_a exp(Q(s,a)/α).

Final: π*(a|s) = exp(Q(s,a)/α) / Z(s)

The key insight: The optimal SAC policy is a Boltzmann distribution with temperature α. As α → 0, this approaches argmax (deterministic, pure exploitation). As α → ∞, it approaches uniform (maximum entropy, pure exploration). SAC auto-tunes α to maintain a target entropy, dynamically balancing this tradeoff throughout training.

💥 Break-It Lab What Dies When You Remove SAC Components? ▶ ✓ ATTEMPTED

SAC has three critical components beyond the basic actor-critic: entropy regularization, twin Q-networks, and target network soft updates. Each prevents a specific failure mode. Toggle them off to see what breaks.

High IS Ratios (stale replay data, ratio > 10) NORMAL

Failure mode: With very old data in the replay buffer, the implicit IS ratio between current policy and behavior policy becomes enormous. Q-targets based on old transitions have high variance. The Q-function oscillates wildly, and the policy follows these noisy signals into catastrophic actions. Fix: SAC avoids explicit IS by sampling fresh actions a'~π for targets, but stale (s,a,r,s') transitions still create issues when the environment dynamics have shifted. Use priority replay or limit buffer age.

Remove Entropy Regularization (α = 0) NORMAL

Failure mode: Without entropy, the policy quickly collapses to a near-deterministic action at each state. Once committed, it stops exploring alternatives. If the Q-function has a local maximum (common early in training), the policy locks onto it permanently. Training flatlines at a suboptimal return. The Boltzmann structure degenerates to argmax — no stochasticity to escape local optima.

Wrong Temperature (α = 10, way too high) NORMAL

Failure mode: With α too large, the entropy bonus dominates the reward. The policy approaches uniform random (π ≈ Uniform) because any deviation from maximum entropy is penalized more than any reward gain. The agent "explores" forever and never exploits. Training appears stable (no divergence) but reward never improves. It's the opposite failure of no entropy: instead of collapsing, the policy refuses to commit.

Chapter 09

PPO vs. SAC: When to Use Which

These are the two dominant algorithms in modern deep RL. They represent fundamentally different philosophies — how much you trust your data vs. how much data you can afford to collect. Choosing between them is one of the first decisions in any RL project, and it shapes everything from your infrastructure to your debugging strategy.

Dimension	PPO	SAC
Data paradigm	More on-policy: collect batch, train ~10 epochs, discard	Fully off-policy: replay buffer of all past data
Data efficiency	Low — discards data after each iteration	High — reuses all past experience
Stability	Very stable — clipping prevents catastrophic updates	Less stable — Q-function approximation errors can compound
Ease of use	"Plug and play" — default hyperparams often work	More tuning needed (buffer size, α, target update rate)
Action space	Discrete or continuous	Primarily continuous (discrete variants exist)
Markov assumption	Can relax with MC returns	Heavy reliance (Q-learning requires Markov property)
Parallelism	Naturally parallel (many envs collecting simultaneously)	Harder to parallelize (replay buffer bottleneck)
Wall-clock speed	Faster if simulation is cheap	Faster if simulation is expensive (fewer env steps needed)

The Decision Rule

Simulation is cheap (video games, MuJoCo, LLM training): use PPO. You can generate data quickly, so data efficiency doesn't matter. Stability does.

Simulation is expensive (real robots, complex physics): use SAC. Every timestep costs real time or real money. You need to squeeze maximum learning from every sample.

Real Decision Examples

RLHF for ChatGPT: PPO. Each "trajectory" is an LLM response (~cheap to generate with GPUs). Training stability is paramount — a collapsed policy means gibberish output.

Robot arm learning to insert a peg: SAC. Each attempt takes 10 seconds. With PPO, you'd need 500 iterations × 2,000 timesteps = 1M timesteps × 10s = 116 days. SAC can learn from ~100K timesteps — 11.5 days, or 2 hours with parallelism.

Both benefit from demonstrations

If you have expert demonstrations (teleoperation, human play, etc.), seed the replay buffer (SAC) or pretrain with behavioral cloning (PPO). Starting from scratch is always slower than starting from a rough approximation. This is not cheating — it's good engineering.

🏗 Design Challenge You're the Architect: SAC for a Dexterous Hand ▶ ✓ ATTEMPTED

You're deploying SAC on a real 24-DOF dexterous hand (Shadow Hand or Allegro). The task: pick up a small object from a table. Reward is sparse (1.0 on success, 0 otherwise). The robot is real — no simulator reset; after each attempt, a human must place the object back. Budget: 2 hours of real-world training time at 10 Hz control frequency = 72,000 environment steps maximum.

DOF

24 joint angles (continuous)

Observation

Joint positions + wrist-mounted camera (224x224)

Reward

Sparse: 0/1 (object lifted 5cm)

Budget

72K real steps, no simulator

Safety

No self-collision, joint limits respected

Reset

Human reset (~30s), so episode length matters

1. Sparse reward means almost no learning signal early on. How do you generate enough successful episodes to start learning? (Hint: think about initialization strategy and the entropy coefficient.)

2. 24-DOF action space is massive. Do you learn all joints from scratch, or decompose the problem? What about the camera observation — end-to-end or separate encoder?

3. With 72K steps and sparse reward, standard SAC will spend most of its budget failing. How do you incorporate the few human demonstrations you can collect (say, 20 teleoperated successes)?

4. Safety: how do you prevent the hand from damaging itself during exploration? The entropy bonus encourages random actions, which could slam joints into limits.

5. The replay buffer contains mostly failures. How do you ensure the rare successes aren't drowned out?

Real-world solution (Luo et al., 2024 — HIL-SERL):

1. Warm-start with demos: Collect 20-50 human demonstrations via teleoperation. Seed the replay buffer AND pre-train the policy with BC. This gives non-zero initial success rate (~10-20%).

2. Reward shaping: Add dense reward components: (a) negative distance to object, (b) grasp contact reward, (c) lift progress reward. These guide exploration toward grasping without requiring full sparse success.

3. Residual policy: Learn a residual on top of the BC policy: a_total = a_BC + a_residual. The residual is small (bounded), so the hand stays near demonstrated behavior. SAC optimizes the residual only.

4. Safety layer: Action clipping at the joint level + velocity limits. A separate safety controller overrides any action that would cause self-collision (computed analytically from kinematics). The entropy bonus explores within the safe action space.

5. Prioritized replay: Successful episodes get 10x sampling priority. With only 5% success rate, this means ~33% of each training batch contains successful transitions. Also: Hindsight Experience Replay (HER) — relabel failed episodes with "the state you actually reached" as the goal.

6. Camera encoder: Pre-trained (frozen) ResNet-18, fine-tuned only the last layer. End-to-end from scratch would waste the limited data budget on visual features. The encoder outputs a 512-d vector concatenated with joint states.

Result: 90%+ success rate in under 2 hours of real training. Key insight: SAC's sample efficiency shines when you remove the cold-start problem with demonstrations.

⚔ Adversarial: SAC with Wrong Temperature

You're training SAC on a manipulation task. The auto-tuned α stabilizes at 0.2. A colleague suggests manually setting α = 5.0 "to encourage more exploration in the early phase." After 100K steps with α = 5.0, you switch to auto-tuning. What happens?

The Q-function diverges during the high-α phase The high-α phase collects diverse data, giving a net benefit after switching The Q-function learned during high-α is trained to predict soft Q-values (including entropy bonuses) at the wrong temperature, making it initially misleading after the switch Nothing changes — the Q-function is independent of α

Chapter 10

Real-World Impact

PPO and SAC are not just textbook algorithms. They are production systems powering the most impressive real-world AI deployments. Understanding where each shines validates the theory from earlier chapters and gives you intuition for your own projects.

SAC: Real-Robot Locomotion

Haarnoja et al. (2018) trained a real quadruped robot (Minitaur) to walk from scratch in under 2 hours. No simulation. No human demonstration. The robot fell over thousands of times, and SAC's replay buffer let it learn from every fall. The entropy bonus was especially critical: it kept the robot trying diverse gaits rather than getting stuck repeating one bad strategy. This was a watershed moment — proof that RL could work on hardware within a human attention span.

SAC: Precise Robotic Assembly

Luo et al. (2024) developed HIL-SERL (Human-in-the-Loop Sample-Efficient RL) for precise peg insertion and cable routing. SAC with a small number of human corrections outperformed pure imitation learning — it was both faster to train and more accurate. The robot learned sub-millimeter precision that imitation alone couldn't capture. The human interventions seeded the replay buffer with high-quality transitions, giving SAC a warm start rather than learning from scratch.

PPO: Sim-to-Real Locomotion

Tan et al. (2018) at Google trained a quadruped in simulation with PPO, then transferred the policy to a real Laikago robot. The sim-to-real transfer worked because PPO's stability meant the policy didn't exploit simulator artifacts — it learned robust gaits that generalized to reality.

PPO: Solving the Rubik's Cube

OpenAI (2019) used PPO to train a dexterous robot hand to solve a Rubik's cube. The policy was trained entirely in simulation with massive domain randomization (changing physics parameters, lighting, cube size) and transferred zero-shot to a real Shadow Hand. This required billions of timesteps across thousands of parallel workers — possible only because PPO scales effortlessly with parallel simulation. Each worker collects its own batch; batches are aggregated for one synchronized gradient step. No replay buffer coordination needed.

PPO: Alignment of Large Language Models

InstructGPT and ChatGPT use PPO for RLHF (Reinforcement Learning from Human Feedback). The "environment" is: generate a response, get a reward from a trained reward model. PPO's stability is critical — a single bad update could make the model produce gibberish or harmful content. The clip mechanism prevents catastrophic forgetting of language capabilities.

RLHF as an MDP

State: prompt + tokens generated so far. Action: next token (discrete, ~50K vocabulary). Reward: 0 at every step except the last, where a reward model scores the complete response. Episode length: response length (~100-500 tokens). PPO with a KL penalty against the supervised-fine-tuned model keeps the language model coherent while steering toward human preferences.

The pattern

Cheap simulation + need stability: PPO (Rubik's cube, RLHF, sim-to-real). Expensive real data + need sample efficiency: SAC (real robots, precision tasks). The theory from this lesson directly predicts which algorithm wins in each scenario.

Timeline

Key Milestones

2015: TRPO (Schulman et al.) — trust regions for policy optimization.

2016: GAE (Schulman et al.) — the advantage estimation we use today.

2017: PPO (Schulman et al.) — replaced TRPO with clipping. Immediate adoption.

2018: SAC (Haarnoja et al.) — entropy-regularized off-policy actor-critic.

2019: Rubik's cube with PPO (OpenAI) — dexterous manipulation milestone.

2022: ChatGPT uses PPO for RLHF — PPO goes mainstream.

2024: HIL-SERL with SAC — state-of-the-art real-robot manipulation.

🔗 Pattern Recognition

SAC's Soft Q = Q-Learning with Entropy Regularization

This Lesson (SAC)

Q_soft(s,a) = r + γ E[Q(s',a') - α log π(a'|s')]
Policy: π*(a|s) ∝ exp(Q(s,a)/α)

Q-Learning (DQN)

Q(s,a) = r + γ max_a' Q(s',a')
Policy: π*(a|s) = argmax Q(s,a) → Q-Learning lesson

SAC is Q-learning where the hard max becomes a soft max (log-sum-exp). As α→0, the Boltzmann policy becomes argmax and SAC reduces to standard Q-learning. The entropy bonus is the "softening" that makes the max differentiable and the policy stochastic. This same Boltzmann-to-argmax spectrum appears everywhere: from simulated annealing to energy-based models.

Where else in ML do you see this pattern of "soften a discrete operation to make it differentiable"? (Hint: think about attention mechanisms.)

🔗 Pattern Recognition

SAC Entropy ≈ RLHF KL Penalty

SAC (This Lesson)

J = E[r(s,a)] + α H(π)
= E[r(s,a)] - α E[log π(a|s)]
Encourages policy to stay random

RLHF for LLMs

J = E[R_model(y|x)] - β KL(π_θ || π_ref)
= E[R(y|x)] - β E[log π_θ - log π_ref]
Encourages policy to stay near reference

Both add a regularizer that prevents the policy from collapsing. SAC penalizes deviation from uniform (via entropy). RLHF penalizes deviation from the SFT model (via KL). The mathematical structure is identical: reward + regularization toward a prior. SAC's prior is uniform; RLHF's prior is the pre-trained model. Both use temperature/β to control the tradeoff.

If you replaced SAC's entropy bonus with a KL penalty toward the BC policy (KL(π||π_BC)), what algorithm would you get? (Answer: something very close to AWAC / offline RL methods.)

💻 Build It Implement SAC Policy Update Step ▶ ✓ ATTEMPTED

Given a batch of states from the replay buffer, implement the SAC actor loss. The policy outputs mean and log_std of a Gaussian. You need to: (1) sample an action using reparameterization, (2) compute log probability, (3) evaluate Q, (4) compute the actor loss including entropy bonus.

signature def sac_actor_loss(policy_net, q_net1, q_net2, states, alpha): """ Compute SAC actor loss for a batch of states. Args: policy_net: network that outputs (mean, log_std) given states q_net1, q_net2: twin Q-networks states: tensor of shape (batch_size, state_dim) alpha: temperature parameter (scalar) Returns: loss: scalar tensor (to be minimized) """

Test case

With batch_size=256, state_dim=11 (HalfCheetah), alpha=0.2:
loss should be a scalar < 0 (since Q-values for good actions are positive and dominate the entropy term).

Sample: z = mean + std * epsilon (epsilon ~ N(0,1)). Then squash: a = tanh(z). The log_prob needs a correction for the tanh: log_pi = log_prob_gaussian(z) - sum(log(1 - tanh(z)^2)). This correction accounts for the change of variables from unbounded Gaussian to bounded [-1,1] actions.

def sac_actor_loss(policy_net, q_net1, q_net2, states, alpha):
    # Step 1: Forward pass through policy
    mean, log_std = policy_net(states)
    std = log_std.exp()

    # Step 2: Reparameterization trick
    epsilon = torch.randn_like(mean)
    z = mean + std * epsilon  # pre-squash action
    action = torch.tanh(z)    # squashed to [-1, 1]

    # Step 3: Log probability with tanh correction
    log_prob = Normal(mean, std).log_prob(z).sum(-1)
    # Jacobian correction for tanh squashing:
    log_prob -= torch.log(1 - action.pow(2) + 1e-6).sum(-1)

    # Step 4: Evaluate both Q-functions, take min (pessimistic)
    q1 = q_net1(states, action).squeeze(-1)
    q2 = q_net2(states, action).squeeze(-1)
    min_q = torch.min(q1, q2)

    # Step 5: Actor loss = E[alpha * log_pi - Q]
    # We MINIMIZE this, which MAXIMIZES (Q - alpha * log_pi)
    loss = (alpha * log_prob - min_q).mean()

    return loss

Bonus: Implement automatic temperature tuning. The target entropy is -dim(A) (negative action dimension). Update α to minimize: L(α) = E[-α (log π(a|s) + target_entropy)]. When entropy is too low, α increases to encourage exploration. When entropy is too high, α decreases.

Chapter 11

Full Summary & Cheat Sheet

You now have a complete understanding of the two dominant algorithms in modern deep RL. Let's consolidate everything into a reference sheet.

The Equation Progression

Step 1 — Surrogate Objective (from Actor-Critic)J̃(θ') = Σ_t,i [π_θ'(a|s) / π_θ(a|s)] · Â^π(s,a)

Step 2 — PPO Clipped ObjectiveJ̃_PPO = Σ_t,i min( r_t · Â_t, clip(r_t, 1−ε, 1+ε) · Â_t )

Step 3 — GAE Advantage EstimateÂ^GAE_t = Σ_k=0^N−1 (γλ)^k δ_t+k, δ_t = r_t + γV̂(s_t+1) − V̂(s_t)

Step 4 — Q-function for Replay (SAC path)Q^π(s,a) = r(s,a) + γ 𝔼_ā'~π[ Q^π(s', ā') ]

Step 5 — SAC ObjectiveJ_SAC = 𝔼[ Σ_t r(s_t,a_t) + α H(π(·|s_t)) ]

Decision Tree

Question	If Yes	If No
Is simulation cheap / fast?	PPO	SAC
Is stability critical (e.g., LLMs)?	PPO	Either
Need max sample efficiency?	SAC	PPO
Discrete actions?	PPO	Either (SAC prefers continuous)
Have expert demonstrations?	Pretrain/seed buffer, then either

From This Lecture Series

Lecture	Key Algorithm	Core Idea
L3: Policy Gradients	REINFORCE	Weight actions by reward, gradient ascent
L4: Actor-Critic	A2C / N-step AC	Replace reward-to-go with learned V̂
L5: Off-Policy AC (this lesson)	PPO, SAC	Reuse data with IS clipping or Q-functions

Each lecture built on the last: REINFORCE's high variance motivated the value function critic. The critic's data inefficiency motivated off-policy methods. PPO and SAC are the culmination — practical, scalable algorithms that are deployed worldwide.

Key Tricks Cheat Sheet

Trick	Problem It Solves	Used In
IS ratio clipping	Policy drifts too far from data	PPO
Min of clipped/unclipped	Edge case where clipping helps objective	PPO
GAE (λ)	Bias-variance tradeoff in advantages	PPO (and others)
Advantage normalization	Reward scale varies across tasks	PPO
Q instead of V	V is meaningless with mixed-policy data	SAC
Twin Q-networks	Q-value overestimation	SAC, TD3
Entropy regularization	Premature policy collapse	SAC
Target networks	Unstable Q-learning targets	SAC, DQN, TD3
Reparameterization	Backprop through stochastic sampling	SAC

The One Sentence

PPO clips importance-sampling ratios so you can reuse data safely. SAC fits a Q-function on a replay buffer so you can reuse all data. Everything else is implementation details.