← Gleams
CS 224R · CS 234 · CS 285 · Complete RL Theory

Deep Reinforcement Learning — The Complete Theory

Every algorithm, every derivation, every proof — from the Bellman equation to DPO. The definitive reference for deep RL theory.

Policy Gradients Actor-Critic TRPO & PPO SAC DQN Offline RL RLHF & DPO
Roadmap

What You'll Master

Chapter 01

Markov Decision Processes

Before we can solve anything, we need a language for "an agent acting in a world." That language is the Markov Decision Process (MDP) — a mathematical framework that captures states, actions, transitions, and rewards in one clean package.

Definition
MDP — The Tuple ⟨S, A, p, r, ρ0, γ, H⟩

S — state space (all possible situations the agent can be in, like positions on a board).
A — action space (all moves available, like up/down/left/right).
p(s' | s, a) — transition dynamics (if you're in state s and take action a, what's the probability of landing in s'? Think: physics of the world).
r(s, a) — reward function (immediate payoff for taking action a in state s — the score you get right now).
ρ0(s) — initial state distribution (where episodes start — like the starting position in a game).
γ ∈ [0, 1) — discount factor (how much you value future rewards vs. immediate ones; γ = 0.99 means you're patient, γ = 0.1 means you're impatient).
H — horizon (how many time steps per episode).

The Markov Property

The "Markov" in MDP means the future depends only on the current state, not on how you got there. Formally: P(st+1 | st, at, st−1, at−1, ...) = P(st+1 | st, at). The entire history is summarized in the current state. This is a modeling assumption — if you're playing chess, the board position tells you everything. If you're driving, your current speed and position matter more than where you drove an hour ago.

Understanding the Discount Factor γ

The discount factor γ is surprisingly subtle. It serves three purposes:

Three Roles of γ
Discount Factor γ ∈ [0, 1)

1. Time preference: γ = 0.99 means a reward one step from now is worth 0.99 of the same reward right now. A reward 100 steps from now is worth 0.99100 = 0.366 — we're patient but not infinitely so.

2. Mathematical convergence: Without discounting (γ = 1), the infinite sum ∑ γt rt might not converge. With γ < 1, it's always bounded by rmax/(1−γ).

3. Effective horizon: The "effective horizon" is approximately 1/(1−γ). With γ = 0.99, the agent effectively plans ~100 steps ahead. With γ = 0.999, ~1000 steps. This lets you control the planning horizon without explicitly setting H.

Worked Example — Discount Factor Effect

Two strategies for a robot: Strategy A gets reward +1 every step forever. Strategy B gets reward 0 for 10 steps, then +2 forever.

With γ = 0.9: V(A) = 1/(1−0.9) = 10. V(B) = 0.910 × 2/(1−0.9) = 0.349 × 20 = 6.97. A wins — the impatient agent prefers steady small rewards.

With γ = 0.99: V(A) = 1/(1−0.01) = 100. V(B) = 0.9910 × 200 = 0.904 × 200 = 180.8. B wins — the patient agent is willing to wait for double the reward.

A policy π(a | s) maps states to action probabilities. It's the agent's strategy: "when I see state s, here's how I pick an action." A deterministic policy always picks one action; a stochastic policy samples from a distribution.

An episode (or trajectory) is a sequence τ = (s0, a0, r0, s1, a1, r1, ..., sH). The agent starts at s0 ~ ρ0, picks actions via π, and the world responds via p. The whole game of RL is to find the policy that maximizes the expected total discounted reward:

The RL Objective maxπ   𝔼τ~π [ ∑t=0H γt r(st, at) ]
Find the policy that collects the most (discounted) reward on average.

The Value Function Vπ(s)

Suppose you're in state s and you follow policy π forever after. How much total reward do you expect? That's the value function:

Value Function Vπ(s) = 𝔼π [ ∑t=0 γt r(st, at)  |  s0 = s ]

Now for the most important recursive relationship in all of RL — the Bellman equation. The value of being in state s equals the immediate reward plus the discounted value of wherever you end up next:

Derivation — Bellman Equation for Vπ

Start from the definition:

Vπ(s) = 𝔼a~π(s) [ r(s, a) + γ 𝔼s'~p(s'|s,a)[ Vπ(s') ] ]

Step 1: Expand the expectation over the trajectory. The first reward is r(s0, a0), and everything after s1 is another trajectory starting from s1.

Step 2: By the Markov property (the future depends only on the current state, not the past), the expected future reward from s1 onward is exactly Vπ(s1).

Step 3: Combine: Vπ(s) = 𝔼a~π[r(s,a)] + γ 𝔼a~π[𝔼s'~p[Vπ(s')]]. This is the Bellman equation — it says value = immediate reward + discounted future value.

The Q-Function Qπ(s, a)

Sometimes we want to know the value of a specific action, not just the state. The action-value function Qπ(s, a) is the expected total reward if you take action a in state s, then follow π thereafter:

Q-Function Qπ(s, a) = r(s, a) + γ 𝔼s'~p(s'|s,a)[ Vπ(s') ]

The relationship between V and Q is clean: Vπ(s) = 𝔼a~π(a|s)[Qπ(s, a)]. The value of a state is just the expected Q-value over the actions your policy would take.

Optimal Values

The optimal value function V*(s) = maxπ Vπ(s) is the best achievable value from state s under any policy. Similarly, Q*(s, a) = maxπ Qπ(s, a). The optimal policy π*(a | s) is the one that achieves these values, and it satisfies π*(s) = argmaxa Q*(s, a).

The Central Theorem of RL

For any finite MDP, there exists a deterministic optimal policy π* that simultaneously maximizes Vπ(s) for every state s. This is remarkable: you might expect that optimizing for one starting state could hurt performance in another. But the Bellman optimality equation guarantees that the same policy is optimal everywhere. This is because the optimal action in state s depends only on Q*(s, ·), which already accounts for optimal behavior in all future states.

Bellman Optimality Equation — Derivation

Start from the Bellman equation for Qπ: Qπ(s, a) = r(s, a) + γ 𝔼s'[𝔼a'~π[Qπ(s', a')]].

For the optimal policy π*, the inner expectation becomes: 𝔼a'~π*[Q*(s', a')] = maxa' Q*(s', a'). This is because π* puts all probability on the action with the highest Q-value.

Substituting: Q*(s, a) = r(s, a) + γ 𝔼s'~p[maxa' Q*(s', a')]. This is the Bellman optimality equation — it characterizes Q* as the unique fixed point of the optimality operator.

Worked Example — Bellman Optimality in a 2-State MDP

States: {A, B}. Actions: {left, right}. From A, right gives r = +3 and goes to B. From A, left gives r = +1 and stays at A. From B, right gives r = +10 (terminal). From B, left gives r = 0 and goes to A. γ = 0.9.

Q*(B, right) = 10 (terminal, no future).

Q*(B, left) = 0 + 0.9 × max(Q*(A, left), Q*(A, right)).

Q*(A, right) = 3 + 0.9 × max(Q*(B, left), Q*(B, right)) = 3 + 0.9 × 10 = 12.

Q*(A, left) = 1 + 0.9 × max(Q*(A, left), Q*(A, right)) = 1 + 0.9 × 12 = 11.8.

Q*(B, left) = 0 + 0.9 × 12 = 10.8.

Optimal policy: π*(A) = right (12 > 11.8), π*(B) = left (10.8 > 10). Surprisingly, the optimal strategy from B is to go back to A rather than immediately collecting the +10! This is because looping A→B→A generates a stream of +3 rewards that, when discounted, exceeds the one-shot +10.

State Distributions

When we analyze RL algorithms, we often need to talk about which states a policy visits. The discounted state distribution (also called the discounted occupancy measure) under policy π is:

Discounted State Distribution dπ(s) = (1 − γ) ∑t=0 γt P(st = s | π)
A weighted average of how often π visits state s, with recent visits weighted more.

The factor (1 − γ) normalizes this to a proper probability distribution. This distribution is crucial for the Performance Difference Lemma in Chapter 5.

POMDPs — When You Can't See Everything

In a Partially Observable MDP (POMDP), the agent doesn't see the full state s — it only gets an observation o drawn from some distribution p(o | s). A self-driving car can see camera images but not the intentions of other drivers. The fix: the agent must maintain a belief — a probability distribution over possible states — or use recurrent networks that implicitly track history. Most deep RL treats problems as MDPs (assuming full observability) even when they're technically POMDPs, by feeding in a window of recent observations.

Interactive: MDP Grid World
Click any cell to see its Vπ value. Toggle between random and optimal policy to see how values change. The goal (gold cell) gives +10 reward; each step costs −1.
Worked Example — Computing Vπ

3-state chain: A → B → C. Policy: always go right. Rewards: r(A, right) = −1, r(B, right) = −1, r(C) = +10 (terminal). γ = 0.9.

Vπ(B) = −1 + 0.9 × 10 = 8.0

Vπ(A) = −1 + 0.9 × Vπ(B) = −1 + 0.9 × 8.0 = 6.2

Notice how value propagates backward from the reward. This is the essence of the Bellman equation: each state's value depends recursively on the next state's value.

🔨 Derivation Prove: The Bellman Optimality Operator Is a Contraction ✓ ATTEMPTED

The Bellman policy operator Tπ is a γ-contraction (proved in Chapter 8). But the Bellman optimality operator T* uses a max instead of an expectation over π:

(T*V)(s) = maxa [ r(s,a) + γ ∑s' p(s'|s,a) V(s') ]

Your task: Prove that ||T*V − T*V'|| ≤ γ ||V − V'||. The max makes this trickier than the policy case — you can't factor it out.

For any functions f, g: |maxa f(a) − maxa g(a)| ≤ maxa |f(a) − g(a)|. This is because the max of one side picks a specific action, and that action's difference is bounded by the max difference.
You get: |(T*V)(s) − (T*V')(s)| ≤ maxa |γ ∑s' p(s'|s,a) [V(s') − V'(s')]|. The r(s,a) terms cancel (they appear identically in both). Now you need to bound the weighted sum of differences.
s' p(s'|s,a) |V(s') − V'(s')| ≤ ||V − V'|| · ∑s' p(s'|s,a) = ||V − V'|| · 1. The probabilities sum to 1, and each |V(s') − V'(s')| is bounded by the sup-norm.

Full proof:

Step 1: |(T*V)(s) − (T*V')(s)| = |maxa[r(s,a) + γ∑s'p(s'|s,a)V(s')] − maxa[r(s,a) + γ∑s'p(s'|s,a)V'(s')]|

Step 2: Apply |max f − max g| ≤ max |f − g|:

≤ maxa |γ ∑s' p(s'|s,a) [V(s') − V'(s')]| (rewards cancel)

Step 3: Triangle inequality on the weighted sum:

≤ maxa γ ∑s' p(s'|s,a) |V(s') − V'(s')| ≤ γ ||V − V'||

Step 4: Since this holds for all s, take the sup: ||T*V − T*V'|| ≤ γ ||V − V'||. ■

The key insight: The max doesn't break contraction because |max f − max g| ≤ max |f − g|. This is why value iteration (which applies T* repeatedly) converges to V* regardless of initialization — and why γ < 1 is essential.

🔗 Pattern Recognition
RL Objective = Stochastic Optimization of a Non-Stationary Objective
This Lesson (RL)
maxπ J(π) = 𝔼τ~π[∑ γt rt]
Objective changes as π changes (non-stationary data distribution)
Standard Optimization
minθ 𝔼x~D[L(θ, x)]
Fixed data distribution D (i.i.d. assumption)

In supervised learning, the data distribution is fixed: your training set doesn't change as you update weights. In RL, the data distribution is the policy — update the policy and the data changes. This non-stationarity is why RL is fundamentally harder than supervised learning: you're optimizing an objective that shifts under your feet.

Where else do you see non-stationary optimization? (Hint: think GANs, self-play, curriculum learning.)

Checkpoint — Before you move on
Explain in your own words: why can't we just compute ∇θ J(θ) by differentiating through the environment dynamics? What does the Bellman equation tell us about the structure of the problem that policy gradients exploit differently than value iteration?
✓ Gate cleared
Model Answer

The environment dynamics p(s'|s,a) are unknown and non-differentiable (the real world doesn't have a gradient). Value iteration requires knowing p to compute the Bellman backup. Policy gradients bypass this entirely: the log-derivative trick converts the gradient into an expectation over trajectories, and the dynamics terms cancel. We never differentiate through the environment — we only need to sample from it. This is why policy gradients are model-free while value iteration is model-based. The tradeoff: value iteration gives exact solutions (when you know p) but policy gradients work with unknown dynamics at the cost of high variance.

Chapter 02

Policy Gradient Methods

Value-based methods like Q-learning learn Q* and derive π* as argmax. But what if we parameterize the policy directly as πθ(a | s) and optimize θ by gradient ascent on expected reward? That's the policy gradient approach — the foundation of PPO, SAC, and essentially all modern deep RL.

Why Not Just Use Q-Learning?

If Q-learning can find the optimal policy via argmax, why bother with policy gradients at all? Three reasons:

Three Reasons for Policy Gradients
Advantages Over Value-Based Methods

1. Continuous actions: Q-learning requires maxa Q(s, a). With discrete actions (4 moves), you evaluate all 4. With continuous actions (a ∈ R7), there are infinitely many — you can't enumerate them. Policy gradients handle continuous actions naturally.

2. Stochastic policies: Sometimes the optimal policy is stochastic (e.g., rock-paper-scissors: play uniformly at random). Q-learning always produces deterministic policies (argmax). Policy gradients can learn arbitrary distributions.

3. Better convergence: Policy gradients optimize the objective directly, while Q-learning tries to find a fixed point of the Bellman operator. In large-scale settings (millions of parameters), the direct optimization often converges faster and more stably.

Policy Parameterization

For discrete actions, πθ is typically a neural network that outputs a softmax distribution over actions. For continuous actions, πθ outputs the mean and standard deviation of a Gaussian policy:

Gaussian Policy πθ(a | s) = N(a; μθ(s), σθ(s)2)
The network outputs μ (where to center the action) and σ (how much to explore).

The Objective

Define the objective as the expected total reward under trajectories sampled from πθ:

Policy Gradient Objective V(θ) = ∑τ Pθ(τ) R(τ)
where R(τ) = ∑t=0T−1 r(st, at) and Pθ(τ) = ρ0(s0) ∏t=0T−1 πθ(at|st) p(st+1|st,at)

We want ∇θV(θ). But the sum is over all possible trajectories — an astronomically large set. And R(τ) doesn't depend on θ directly (rewards are given by the environment). How do we differentiate through sampling?

The Log-Derivative Trick

Full Derivation — Policy Gradient

Step 1 — Apply the definition of gradient:

θV(θ) = ∇θτ Pθ(τ) R(τ) = ∑τ R(τ) ∇θ Pθ(τ)

Why: R(τ) doesn't depend on θ, so it comes out of the gradient.

Step 2 — Log-derivative trick: We use the identity ∇θ Pθ(τ) = Pθ(τ) ∇θ log Pθ(τ). This is just the chain rule: d/dx log f(x) = f'(x)/f(x), rearranged.

θV(θ) = ∑τ R(τ) Pθ(τ) ∇θ log Pθ(τ)

= 𝔼τ~Pθ [ R(τ) ∇θ log Pθ(τ) ]

Why this matters: We turned a sum over all trajectories into an expectation we can estimate by sampling trajectories!

Step 3 — Expand log Pθ(τ):

log Pθ(τ) = log ρ0(s0) + ∑t=0T−1 [ log πθ(at|st) + log p(st+1|st,at) ]

Step 4 — Differentiate:θ log Pθ(τ) = ∑t=0T−1θ log πθ(at|st)

Why: ρ0(s0) doesn't depend on θ (the starting state is given by the environment). The transition dynamics p(st+1|st,at) don't depend on θ either (they're properties of the physics). Only the policy πθ depends on θ. So the transition terms vanish!

The Policy GradientθV(θ) = 𝔼τ~πθ [ (∑t=0T−1θ log πθ(at|st)) (∑t=0T−1 r(st, at)) ]
Increase the log-probability of actions in proportion to how much total reward the trajectory received.
Model-Free Magic

The transition dynamics p(s'|s,a) completely disappeared from the gradient! We don't need to know the physics of the world — we only need to be able to sample trajectories and evaluate ∇θ log πθ. This is why policy gradients are model-free.

The Policy Gradient Theorem

There's an alternative form that's often more useful. Instead of summing over entire trajectories, we can express the gradient in terms of the Q-function:

Policy Gradient Theorem — Full Proof

Step 1: Start with V(θ) = 𝔼s00[Vπ(s0)]. Write Vπ(s) = ∑a πθ(a|s) Qπ(s, a).

Step 2: Take the gradient:

θ Vπ(s) = ∑a [ ∇θπθ(a|s) Qπ(s,a) + πθ(a|s) ∇θQπ(s,a) ]

Product rule: both π and Q depend on θ.

Step 3: Expand ∇θQπ(s,a) = ∇θ[r(s,a) + γ ∑s' p(s'|s,a) Vπ(s')] = γ ∑s' p(s'|s,a) ∇θVπ(s').

The reward r and transition p don't depend on θ, so only Vπ(s') contributes.

Step 4: Substitute back: ∇θVπ(s) = ∑aθπ(a|s) Qπ(s,a) + ∑a π(a|s) γ ∑s' p(s'|s,a) ∇θVπ(s').

Step 5: This is recursive! ∇V(s) depends on ∇V(s'). Unroll it one more step:

∇V(s') = ∑a' ∇π(a'|s') Q(s',a') + ∑a' π(a'|s') γ ∑s'' p(s''|s',a') ∇V(s'') ...

Step 6: Unrolling T steps and collecting terms: ∇V(s0) = ∑t=0s P(s0→s, t, π) γta ∇π(a|s) Qπ(s,a), where P(s0→s, t, π) is the probability of being in state s at time t starting from s0 under π.

Step 7: Take expectation over s00 and recognize dπ(s) = ∑t γt P(st=s|π)/(1-γ)-1. Use the log-derivative trick: ∇π(a|s) = π(a|s) ∇ log π(a|s). Final result: ■

Policy Gradient TheoremθV(θ) = 𝔼s~dπ, a~πθ [ ∇θ log πθ(a|s) · Qπ(s, a) ]
Sum over state distribution dπ and policy π, weighting each action by its Q-value.
Interactive: Policy Gradient Direction
2D parameter space (θ1, θ2) with V(θ) shown as contours. The blue arrow shows the vanilla policy gradient; the green arrow shows the natural gradient (Fisher-adjusted). Drag the dot to explore different θ values.
Worked Example — Policy Gradient for a 1-Step Bandit

Two actions: left (θ1) and right (θ2). Policy: softmax π(left) = eθ1 / (eθ1 + eθ2). Rewards: r(left) = 1, r(right) = 3. Current θ = (0, 0), so π(left) = π(right) = 0.5.

Sample trajectory 1: action = right, reward = 3. Gradient: 3 × ∇θ log π(right) = 3 × (0.5, −0.5) × (−1, 1) = correction that increases θ2.

After many samples: θ2 grows, θ1 shrinks, and π(right) → 1. The gradient pushes probability mass toward higher-reward actions.

Chapter 03

REINFORCE & Variance Reduction

The policy gradient formula is beautiful in theory but horrifically noisy in practice. Imagine estimating a gradient by running one episode: the agent might get lucky (high reward) or unlucky (low reward) by chance, not because of the policy. The variance of the gradient estimate is so high that it takes millions of samples to get a useful signal. This chapter is about taming that variance.

How Bad Is the Variance?

Consider a simple environment where the optimal reward is 100. With vanilla REINFORCE, a single gradient estimate might range from −500 to +500 across different trajectory samples. The true gradient might be +2. You'd need to average ~62,500 samples to get a reliable estimate (variance/mean2). This is why raw REINFORCE is rarely used in practice — it's too expensive.

REINFORCE (Williams, 1992)
  1. Sample a trajectory τ = (s0, a0, r0, ..., sT) by running πθ in the environment.
  2. Compute return: R(τ) = ∑t=0T−1 γt r(st, at).
  3. Compute gradient estimate: ĝ = (∑t=0T−1θ log πθ(at|st)) · R(τ).
  4. Update: θ ← θ + α ĝ.
  5. Repeat.

The Causality Trick: Reward-to-Go

In the vanilla gradient, every action's log-probability gets multiplied by the total trajectory reward, including rewards that happened before that action was taken. But action a5 can't possibly affect rewards r0 through r4 — those already happened! Including past rewards adds pure noise.

Derivation — Reward-to-Go (Causality Trick)

In the full gradient, consider the term for time step t:

θ log πθ(at|st) · ∑t'=0T−1 r(st', at')

Split the sum into past (t' < t) and future (t' ≥ t):

= ∇θ log πθ(at|st) · [ ∑t'=0t−1 rt' + ∑t'=tT−1 rt' ]

Key insight: 𝔼[ ∇θ log πθ(at|st) · rt' ] = 0 for t' < t. Why? Because rt' is determined by st', at', which are independent of at given st. And 𝔼at[∇θ log πθ(at|st)] = ∇θa πθ(a|st) = ∇θ 1 = 0. So the past reward term vanishes in expectation.

This means we can replace the total reward with the reward-to-go without introducing any bias:

Reward-to-Got = ∑t'=tT−1 γt'−t r(st', at')

ĝ = ∑t=0T−1θ log πθ(at|st) · Ĝt
Each action is only credited for future rewards it could have influenced.

Baselines: Subtracting Without Bias

Even with reward-to-go, the gradient is noisy. The trick: subtract a baseline b(st) from the reward-to-go. If all rewards are positive (e.g., 10, 12, 11, 13), then every action gets reinforced, just some more than others. Subtracting the average (~11.5) means good actions get positive reinforcement and bad actions get negative reinforcement — much clearer signal.

Proof — Baselines Are Unbiased

We need to show that 𝔼[∇θ log πθ(a|s) · b(s)] = 0 for any function b(s).

Step 1: 𝔼a~πθ[∇θ log πθ(a|s) · b(s)]

= b(s) · 𝔼a~π[∇θ log πθ(a|s)]

Why: b(s) doesn't depend on a, so it factors out.

Step 2: = b(s) · ∑a πθ(a|s) · ∇θ log πθ(a|s)

Step 3: = b(s) · ∑a πθ(a|s) · ∇θπθ(a|s) / πθ(a|s)

= b(s) · ∑aθπθ(a|s)

= b(s) · ∇θa πθ(a|s)

= b(s) · ∇θ 1 = 0   ■

The probabilities sum to 1 regardless of θ, so their gradient is zero. Any baseline that only depends on the state (not the action) can be subtracted for free.

The most common baseline is b(s) = Vπ(s) — the average value of the state. With this baseline, we're reinforcing actions proportional to Ĝt − Vπ(st), which is an estimate of the advantage: how much better was this action than average?

Worked Example — Advantage as Surprise

State s: standing at a fork in a maze. Vπ(s) = 5 (on average, this state leads to reward 5).

Action: go left.t = 8. Advantage = 8 − 5 = +3. "Going left was 3 points better than expected. Reinforce it."

Action: go right.t = 2. Advantage = 2 − 5 = −3. "Going right was 3 points worse than expected. Discourage it."

Action: go straight.t = 5. Advantage = 5 − 5 = 0. "Going straight was exactly average. Don't change anything."

Without the baseline, all three actions would receive positive reinforcement (rewards 8, 2, 5 are all positive). The baseline centers the signal so the gradient points in the right direction.

The Optimal Baseline

Among all baselines, which minimizes variance the most? The answer:

Optimal Baseline b*(st) = 𝔼[ ||∇θ log πθ(at|st)||2 · Ĝt ]  /  𝔼[ ||∇θ log πθ(at|st)||2 ]
The gradient-magnitude-weighted average of the rewards. In practice, Vπ(s) is almost as good and much easier to compute.
Interactive: Variance Reduction
Compare gradient estimate variance with and without a baseline. Each bar is one sampled gradient estimate. The blue line shows the true gradient. Toggle the baseline to see variance collapse.
Worked Example — Baseline Effect

Two trajectories: τ1 has reward 100, τ2 has reward 102. Without baseline: both get positive reinforcement (100 and 102). Gradient: increase probability of both τ1 and τ2, just τ2 slightly more. Signal-to-noise ratio is terrible.

With baseline b = 101: τ1 gets −1 (decrease its probability slightly), τ2 gets +1 (increase it slightly). The signal is now purely about the relative quality. Much cleaner.

REINFORCE with Baseline
  1. Sample trajectory τ from πθ.
  2. For each t: compute Ĝt = ∑t'=tT γt'−t rt'.
  3. Compute advantage estimate:t = Ĝt − V̂φ(st).
  4. Update policy: θ ← θ + α ∑tθ log πθ(at|st) · Ât.
  5. Update baseline: φ ← φ − β ∇φt (V̂φ(st) − Ĝt)2.
  6. Repeat.
⚔ Adversarial: When Does the Baseline Hurt?
You're using REINFORCE with baseline b(s) = Vπ(s). Your value function is badly learned: V̂(s) overestimates by +50 for all states. The true V(s) ≈ 10. Your rewards are in [0, 20]. What happens to training?
Chapter 04

Actor-Critic Methods

REINFORCE with baseline has a problem: the reward-to-go Ĝt is a Monte Carlo estimate — you have to wait until the end of the episode to compute it. This means high variance (one trajectory can be lucky or unlucky) and no learning until the episode finishes. What if we replaced Ĝt with a learned estimate?

The Advantage Function

The advantage Aπ(s, a) = Qπ(s, a) − Vπ(s) measures how much better action a is compared to the average action in state s. If A > 0, the action is better than average; if A < 0, it's worse.

Why Advantage?

The advantage is the ideal quantity to multiply the policy gradient by. Positive advantage → increase the action's probability. Negative advantage → decrease it. Zero advantage → leave it alone. It's a direct signal of "is this action good relative to alternatives?"

One-Step Advantage Estimate

We don't know Qπ or Vπ exactly, but we can approximate the advantage using a single transition and a learned value function V̂φ:

One-Step Advantage Estimatet ≈ r(st, at) + γ V̂φ(st+1) − V̂φ(st)
This is the TD error δt. It estimates Qπ(st,at) as r + γV(s') and subtracts V(st).

Why does this work? Qπ(s, a) = r(s, a) + γ 𝔼[Vπ(s')]. So r + γV̂(s') is a one-sample estimate of Qπ(s, a). Subtract V̂(s) and you get an estimate of the advantage. This introduces bias (V̂φ might be wrong) but dramatically reduces variance (no need to wait for the entire trajectory).

Training the Critic

The critic V̂φ is trained to predict the expected return from each state. Two options:

MethodTarget for V̂φ(st)BiasVariance
Monte Carlot = ∑t'=tT γt'−t rt'NoneHigh
Bootstrapping (TD)rt + γ V̂φ(st+1)Yes (from V̂)Low

In practice, bootstrapping wins because the variance reduction outweighs the bias, especially when the critic becomes accurate.

One-Step Actor-Critic (A2C)
  1. Collect transition (s, a, r, s') using πθ.
  2. Compute TD error: δ = r + γ V̂φ(s') − V̂φ(s).
  3. Update actor: θ ← θ + α ∇θ log πθ(a|s) · δ.
  4. Update critic: φ ← φ − β ∇φ (V̂φ(s) − [r + γV̂φ(s')])2.
  5. Repeat.
Connection to REINFORCE

Actor-critic = REINFORCE with baseline + two changes: (1) replace Monte Carlo return Ĝt with bootstrapped TD target r + γV̂(s'), and (2) update after every step instead of waiting for the episode to end. The baseline is the critic. The advantage estimate is the TD error. Same math, different variance-bias tradeoff.

The Bias-Variance Spectrum

Different advantage estimators trade off bias and variance along a spectrum. Here's the full picture:

EstimatorFormulaBiasVariance
Full Monte Carlot − V̂(st)NoneVery high
N-step returnk=0n−1 γkrt+k + γnV̂(st+n) − V̂(st)MediumMedium
1-step TDrt + γV̂(st+1) − V̂(st)High (if V̂ wrong)Low
GAE(λ)l=0 (γλ)lδt+lTunable via λTunable via λ
Worked Example — Why Bias Matters

Suppose V̂(s') is way off: true V*(s') = 100, but V̂(s') = 10. The TD error δ = r + 0.9 × 10 − V̂(s) massively underestimates the advantage. The actor gets a misleading signal: "this action wasn't very good" when actually it led to a highly valuable state. With Monte Carlo, the full trajectory return would eventually reveal the truth — but at the cost of enormous variance across trajectories.

GAE with λ = 0.95 blends both: mostly uses the multi-step return (low bias) but with enough bootstrapping to reduce variance. This is why λ = 0.95 is the most common choice in practice.

Batch vs Online Actor-Critic

The algorithm above updates after every single transition. In practice, we collect a batch of transitions (e.g., 2048 steps across multiple parallel environments) and update once on the whole batch. This is called batch actor-critic or A2C (Advantage Actor-Critic). The parallelism provides diversity: transitions from 16 environments at once are less correlated than 16 consecutive transitions from one environment.

Worked Example — One Actor-Critic Update

State s = "near the goal," action a = "go right," reward r = +5, next state s' = "at the goal." V̂φ(s) = 3, V̂φ(s') = 10, γ = 0.9.

TD error: δ = 5 + 0.9 × 10 − 3 = 5 + 9 − 3 = +11.

Actor update: δ = +11 > 0, so we increase log π(right | near_goal). "Going right near the goal was much better than expected — do more of it."

Critic update: Target = r + γV̂(s') = 14. Current V̂(s) = 3. Error = 14 − 3 = 11. Push V̂(s) toward 14. "This state is more valuable than I thought."

Chapter 05

Trust Regions & the Performance Difference Lemma

Policy gradients tell you which direction to move in parameter space. But how far should you step? Too small and you waste samples. Too large and you destroy a good policy with one bad update. This chapter develops the mathematical machinery that leads to TRPO and PPO.

Lemma: Expectation Under State Distribution

Lemma 10 — Full Proof

Claim: 𝔼τ~π[ ∑t=0 γt f(st, at) ] = 1/(1−γ) · 𝔼s~dπ, a~π[ f(s, a) ]

Proof:

Step 1: Let Ptπ(s) = P(st = s | π) be the state distribution at time t.

𝔼τ[ ∑t γt f(st,at) ] = ∑t=0 γts Ptπ(s) ∑a π(a|s) f(s,a)

Step 2: Recall dπ(s) = (1−γ) ∑t γt Ptπ(s). So ∑t γt Ptπ(s) = dπ(s) / (1−γ).

Step 3: Substitute: = ∑s dπ(s)/(1−γ) ∑a π(a|s) f(s,a) = 1/(1−γ) · 𝔼s~dπ, a~π[f(s,a)]   ■

The Performance Difference Lemma

This is one of the most powerful results in RL theory. It expresses the performance gap between any two policies in terms of the advantage of one under the other:

Performance Difference Lemma — Full Proof

Claim: V(π) − V(π') = 1/(1−γ) · 𝔼s~dπ, a~π[ Aπ'(s, a) ]

Proof:

Step 1: V(π) = 𝔼τ~π[∑t γt r(st,at)] = 1/(1−γ) 𝔼s~dπ, a~π[r(s,a)] by Lemma 10.

Step 2: Add and subtract Vπ'(s): we can write

r(s,a) = Aπ'(s,a) + Vπ'(s) − γ 𝔼s'~p[Vπ'(s')]

Why: By definition, Aπ'(s,a) = Qπ'(s,a) − Vπ'(s) = r(s,a) + γ𝔼[Vπ'(s')] − Vπ'(s). Rearranging gives r(s,a) = Aπ'(s,a) + Vπ'(s) − γ𝔼[Vπ'(s')].

Step 3: Substitute into the expression for V(π). The Vπ'(st) and γVπ'(st+1) terms telescope across time steps, leaving only Vπ'(s0) at the boundary.

Step 4: After telescoping: V(π) = V(π') + 1/(1−γ) 𝔼s~dπ, a~π[Aπ'(s,a)]

Rearranging gives the result. ■

Performance Difference Lemma V(π) − V(π') = 1/(1−γ) · 𝔼s~dπ, a~π[ Aπ'(s, a) ]
The performance gap equals the expected advantage of π over π', weighted by π's state distribution.

The Local Approximation

The PDL involves dπ — the state distribution under the new policy π. But we only have data from the old policy π'. Define the local approximation:

Local Approximation Lπ'(π) = V(π') + 1/(1−γ) · 𝔼s~dπ', a~π[ Aπ'(s, a) ]
Same as the PDL but using dπ' instead of dπ. Exact when π = π', biased otherwise.

α-Coupled Policies and the TRPO Bound

Definition
α-Coupled Policies

Two policies π and π' are α-coupled if at each state, with probability 1 − α they choose the same action, and with probability α they choose independently. Intuitively, α measures how different the two policies are. If α = 0, they're identical; if α = 1, they're independent.

Using coupled policies, we can bound how far off the local approximation L is from the true performance V:

Bounding the Advantage (Lemma)

Claim: If π and π' are α-coupled, then |𝔼a~π[Aπ'(s,a)]| ≤ 2α maxs,a|Aπ'(s,a)|.

Proof: With probability 1−α, π picks the same action as π', so Aπ'(s, π'(s)) = 0 (by definition of Vπ'). With probability α, π picks independently, and the advantage is bounded by ε = max|Aπ'|. So |𝔼[A]| ≤ (1−α) · 0 + α · 2ε = 2αε. The factor of 2 accounts for the fact that the mean advantage under π' is zero, so the independent action could deviate by up to 2ε from 0. ■

TRPO Guarantee (Theorem 13) V(πnew) ≥ Lπoldnew) − 4εγ / (1−γ)2 · α2

where ε = maxs,a |Aπold(s, a)|, α = maxs DTVnew(s) || πold(s))
The true performance is at least the local approximation minus a penalty proportional to α².

This says: if we constrain the policy update to be small (small α, i.e., the new policy is close to the old one in total variation distance), then the local approximation L is a good lower bound on the true performance V. This is the theoretical foundation of Trust Region Policy Optimization (TRPO).

Natural/Covariant Policy Gradient

The vanilla policy gradient ∇θV(θ) depends on the parameterization of πθ. If you reparameterize (e.g., change from log-space to linear-space), the gradient direction changes. The natural policy gradient fixes this by pre-multiplying by the inverse Fisher information matrix:

Natural Policy Gradient θ ← θ + α F−1θV(θ)

where F = 𝔼s~dπ, a~π[ ∇θ log πθ(a|s) ∇θ log πθ(a|s)T ]
F−1 "un-warps" the parameter space so the gradient is invariant to reparameterization.
Trust Region = Constrained Natural Gradient

TRPO approximately solves: maxθ Lθold(θ) subject to DKLθ || πθold) ≤ δ. This is equivalent to taking a natural gradient step with step size determined by the KL constraint. The Fisher matrix defines the "geometry" of policy space — large steps in high-curvature directions are penalized.

Worked Example — The TRPO Bound in Numbers

Suppose ε = max|Aπ| = 10 (largest advantage magnitude), γ = 0.99, and we constrain α = DTVnew, πold) ≤ 0.01 (very small policy change).

Penalty term: 4 × 10 × 0.99 / (1 − 0.99)2 × 0.012 = 4 × 10 × 0.99 / 0.0001 × 0.0001 = 3.96

Interpretation: The true performance is at least L(πnew) − 3.96. If the local approximation predicts improvement of +5, the guarantee is at least +1.04. Small trust region gives tight bounds.

If we let α = 0.1 (large change): penalty = 4 × 10 × 0.99 / 0.0001 × 0.01 = 3960. The bound is useless! Large policy changes destroy the guarantee. This is why trust regions must be small.

TRPO Is Hard to Implement

TRPO requires: (1) computing the Fisher-vector product F∇, (2) solving a linear system via conjugate gradient (10-20 iterations), (3) a line search to find the largest step satisfying the KL constraint, and (4) handling the case where the constraint is violated. That's a lot of machinery compared to "just clip the ratio" (PPO). TRPO achieves better worst-case guarantees, but PPO achieves similar empirical performance with 1/10 the code.

Worked Example — Why Trust Regions Matter

Policy at iteration k: π(left | s) = 0.8, π(right | s) = 0.2. The gradient says "increase π(right)." Without trust region: learning rate 0.5 gives π(right) = 0.7. Huge change! But the gradient was computed assuming the old policy — with the new policy we'd visit different states, so the gradient estimate is stale.

With trust region (KL ≤ 0.01): π(right) can increase to at most ~0.25. Small step, but the gradient stays valid. Next iteration computes a fresh gradient. Slow but stable convergence.

🔨 Derivation Derive the Simulation Lemma from the Performance Difference Lemma ✓ ATTEMPTED

The Performance Difference Lemma states: V(π) − V(π') = 1/(1−γ) · 𝔼s~dπ, a~π[Aπ'(s,a)].

The Simulation Lemma bounds the performance gap using the maximum advantage: |V(π) − V(π')| ≤ 2εγ / (1−γ)2 · maxs DTV(π(s) || π'(s)), where ε = maxs,a|Aπ'(s,a)|.

Your task: Starting from the PDL, derive this bound. You'll need: (1) bound the advantage expectation using the TV distance, and (2) bound the state distribution shift.

For a fixed state s: |𝔼a~π[Aπ'(s,a)]| = |∑a(π(a|s) − π'(a|s))Aπ'(s,a)| ≤ 2 DTV(π(s)||π'(s)) · maxa|Aπ'(s,a)|. The factor of 2 comes from the definition of TV: DTV = (1/2)∑|p − q|, so ∑|p−q| = 2DTV.
The PDL uses dπ (the new policy's state distribution), which we can't evaluate under the old policy. The key insight: ||dπ − dπ'||1 ≤ 2γ/(1−γ) · maxs DTV(π(s)||π'(s)). This bounds how much the state distributions diverge based on per-step policy differences.
Substitute into the PDL: the advantage at each state is bounded by 2αε (Hint 1), and the state distribution shift adds another factor of γ/(1−γ) (Hint 2). Multiplying through the 1/(1−γ) prefactor gives the final bound.

Step 1: From PDL: V(π) − V(π') = 1/(1−γ) 𝔼s~dπ, a~π[Aπ'(s,a)]

Step 2: Bound advantage at each state: |𝔼a~π[Aπ'(s,a)]| ≤ 2αε where α = maxs DTV(π(s)||π'(s)) and ε = max|Aπ'|.

Step 3: Replace dπ with dπ' plus error: |𝔼s~dπ[f(s)] − 𝔼s~dπ'[f(s)]| ≤ ||dπ − dπ'||1 · max|f| ≤ 2γα/(1−γ) · 2αε

Step 4: Combining: |V(π) − V(π')| ≤ 1/(1−γ) [2αε + 2γα/(1−γ) · 2αε] = 2αε/(1−γ) [1 + 2γα/(1−γ)]

Step 5: For the standard form (bounding loosely): ≤ 4εγα2/(1−γ)2. This is the TRPO bound from Theorem 13. ■

The key insight: The bound is quadratic in α (the policy divergence) — this is why trust regions work. If you halve the allowed policy change, the error bound drops by 4x. The local approximation becomes increasingly accurate as the trust region shrinks.

Chapter 06

PPO: Proximal Policy Optimization

TRPO works brilliantly but is painful to implement. It requires computing the Fisher matrix, solving a constrained optimization problem via conjugate gradient, and doing a line search. PPO achieves nearly the same effect with a simple clipped objective that you can optimize with standard gradient descent.

The Probability Ratio

Define the probability ratio between the new and old policy:

Probability Ratio rt(θ) = πθ(at | st)  /  πθold(at | st)
If rt = 1, the new policy is identical to the old one at this (s, a) pair. rt > 1 means the new policy is more likely to take this action; rt < 1 means less likely.

The surrogate objective from TRPO can be written as:

Surrogate Objective LCPI(θ) = 𝔼t[ rt(θ) · Ât ]
CPI = Conservative Policy Iteration. Maximize this and you improve the policy.

The Clipped Surrogate

Without a constraint, maximizing LCPI can lead to rt(θ) exploding — the new policy becomes very different from the old one. PPO's solution: clip the ratio to stay in [1−ε, 1+ε].

PPO Clipped Objective LCLIP(θ) = 𝔼t[ min( rt(θ) Ât,  clip(rt(θ), 1−ε, 1+ε) Ât ) ]
ε = 0.2 typically. The min takes the more pessimistic estimate.
Why the Clip Works — Two Cases

Case 1: Ât > 0 (the action was good). We want to increase rt (make this action more likely). But the clip prevents rt from exceeding 1 + ε. So the objective becomes min(rt Â, (1+ε) Â). Once rt ≥ 1 + ε, the gradient is zero — no incentive to push further.

Case 2: Ât < 0 (the action was bad). We want to decrease rt. But the clip prevents rt from dropping below 1 − ε. So the objective becomes min(rt Â, (1−ε) Â). Once rt ≤ 1 − ε, the gradient is zero.

In both cases, the clip creates a "trust region" in probability ratio space. The policy can change, but not too much.

Interactive: PPO Clipped Objective
See how the clipped objective (gold) differs from the unclipped one (blue dashed). When advantage is positive, the clip prevents the ratio from going above 1+ε. When negative, it prevents going below 1−ε. Drag the slider to change ε.

Generalized Advantage Estimation (GAE)

How do we compute Ât? We could use the one-step TD error δt = rt + γV(st+1) − V(st) (low variance, high bias) or the full Monte Carlo return Gt − V(st) (no bias, high variance). GAE interpolates:

Generalized Advantage EstimationtGAE(γ,λ) = ∑l=0 (γλ)l δt+l

where δt = rt + γ V̂(st+1) − V̂(st)
λ = 0 gives one-step TD (low variance, high bias). λ = 1 gives Monte Carlo (no bias, high variance). λ = 0.95 is a popular middle ground.
Understanding GAE — Expanding the Sum

Let's expand ÂGAE for a few terms to see what it does:

t = δt + (γλ)δt+1 + (γλ)2δt+2 + ...

Substituting δt = rt + γV̂(st+1) − V̂(st):

When λ = 0: Ât = δt = rt + γV̂(st+1) − V̂(st). Just the one-step TD error.

When λ = 1: Ât = ∑l γl δt+l. After telescoping the V̂ terms: Ât = ∑l=0 γl rt+l − V̂(st) = Ĝt − V̂(st). The full Monte Carlo advantage!

So λ smoothly interpolates between 1-step TD and full Monte Carlo. The magic of GAE is that it does this in an exponentially-weighted way — nearby TD errors get more weight than distant ones.

Worked Example — Computing GAE

4-step trajectory: rewards [1, 2, 3, 0], V̂ values [5, 4, 6, 2, 0] (including the terminal state). γ = 0.99, λ = 0.95.

TD errors:

δ0 = 1 + 0.99 × 4 − 5 = 1 + 3.96 − 5 = −0.04

δ1 = 2 + 0.99 × 6 − 4 = 2 + 5.94 − 4 = 3.94

δ2 = 3 + 0.99 × 2 − 6 = 3 + 1.98 − 6 = −1.02

δ3 = 0 + 0.99 × 0 − 2 = −2.0

GAE (computed backwards):

3 = δ3 = −2.0

2 = δ2 + 0.99 × 0.95 × Â3 = −1.02 + 0.9405 × (−2.0) = −2.90

1 = δ1 + 0.9405 × Â2 = 3.94 + 0.9405 × (−2.90) = 1.21

0 = δ0 + 0.9405 × Â1 = −0.04 + 0.9405 × 1.21 = 1.10

Note the backward computation — this is how GAE is implemented in practice (loop from t = T−1 down to 0).

PPO Algorithm
  1. Collect a batch of trajectories using πθold.
  2. Compute advantages Ât using GAE with the current critic V̂φ.
  3. For K epochs (typically K = 3–10), on random minibatches of the collected data:
    1. Compute rt(θ) = πθ(at|st) / πθold(at|st).
    2. Compute LCLIP and do gradient ascent on θ.
    3. Update critic: φ ← φ − β ∇φ ∑ (V̂φ(st) − Ĝt)2.
  4. Set θold ← θ. Go to step 1.
Why PPO Dominates

PPO is the default algorithm for modern RL because: (1) it's simple — 50 lines of core code, (2) it's stable — the clip prevents catastrophic updates, (3) it's sample-efficient — you reuse each batch for K epochs, not just one gradient step, (4) it works for both continuous and discrete actions. OpenAI used PPO to train the policy for ChatGPT's RLHF stage.

Worked Example — PPO Hyperparameters in Practice

Setting: Training a locomotion policy for a simulated humanoid robot.

Batch size: 2048 steps across 8 parallel environments (256 steps each). Gives 2048 transitions per policy update.

ε = 0.2: The policy can change each action probability by at most 20% per update. Experiment: ε = 0.1 (too conservative, slow learning). ε = 0.5 (too aggressive, policy oscillates wildly).

K = 4 epochs: Each batch of 2048 transitions is used 4 times. More epochs = more sample efficiency, but the data becomes stale (the gradient is for the old policy, not the current one). K = 10 with ε = 0.2 usually works fine because the clip prevents too-large updates even with stale data.

λ = 0.95: GAE parameter. λ = 0.99 gave slightly better final performance but took 2x longer. λ = 0.8 converged fast but to a worse policy.

Worked Example — PPO Clip in Action

State s, action a. πθold(a|s) = 0.4. After some gradient steps, πθ(a|s) = 0.72. Ratio r = 0.72 / 0.4 = 1.8. With ε = 0.2, clipped ratio = clip(1.8, 0.8, 1.2) = 1.2.

If  = +3: unclipped objective = 1.8 × 3 = 5.4. Clipped = 1.2 × 3 = 3.6. PPO uses min(5.4, 3.6) = 3.6. The gradient is zero (we've hit the clip), so θ stops being pushed to increase π(a|s) further.

The policy wanted to make this action much more likely (1.8x), but PPO said "slow down, 1.2x is enough for now." Next batch of data will be collected under a policy closer to the current one, and we can push further if the advantage is still positive.

Chapter 07

SAC: Soft Actor-Critic

PPO is on-policy: it collects data, updates the policy, then throws the data away. This is wasteful. Soft Actor-Critic (SAC) is off-policy: it stores all experience in a replay buffer and reuses it many times. The key insight is adding an entropy bonus to the reward, encouraging exploration and enabling off-policy stability.

Maximum Entropy RL

Standard RL maximizes expected reward. Maximum entropy RL adds a bonus for acting randomly:

Maximum Entropy Objective maxπ 𝔼τ~π [ ∑t=0 γt ( r(st, at) + α H(π(·|st)) ) ]

where H(π(·|s)) = −𝔼a~π[log π(a|s)] is the entropy
α controls the trade-off: high α = more exploration, low α = more exploitation.

Why add entropy? Three reasons: (1) exploration — the agent tries diverse actions instead of collapsing to a single deterministic choice, (2) robustness — stochastic policies are more robust to perturbations, and (3) off-policy compatibility — the entropy regularization smooths the optimization landscape, making off-policy learning more stable.

Worked Example — Entropy as Exploration Bonus

Robot arm with 7 continuous action dimensions. Without entropy: the policy quickly collapses to always applying the same torques. If those happen to reach one target, great — but the robot never discovers alternative strategies.

With entropy bonus (α = 0.2): the policy maintains a standard deviation of ~0.3 in each dimension. The robot tries slightly different torques each time, occasionally discovering better solutions. As training progresses and α decreases (auto-tuned), the policy becomes more precise while retaining enough randomness to handle perturbations.

Entropy value: For a Gaussian with σ = 0.3 in 7 dims: H = 7 × (0.5 log(2πe × 0.09)) ≈ 7 × (−0.27) = −1.89 nats. The reward r + αH = r + 0.2 × (−1.89) = r − 0.38. Small entropy penalty compared to task reward, but enough to prevent premature convergence.

Soft Bellman Equation

Soft Value Functions Vπ(s) = 𝔼a~π[ Qπ(s, a) − α log π(a|s) ]

Qπ(s, a) = r(s, a) + γ 𝔼s'[ Vπ(s') ]
The "soft" value includes the entropy bonus. Higher entropy → higher V.
Derivation — Soft Bellman Equation

In standard RL, Vπ(s) = 𝔼a~π[Qπ(s,a)]. In maximum entropy RL, the value includes the entropy bonus:

Vπ(s) = 𝔼a~π[Qπ(s,a) − α log π(a|s)]

The soft Q-function becomes: Qπ(s,a) = r(s,a) + γ 𝔼s'[Vπ(s')]

The optimal soft policy has a closed form: π*(a|s) ∝ exp(Q*(s,a)/α). This is a Boltzmann (softmax) distribution over actions, weighted by Q-values. High α = uniform (maximum entropy). Low α = nearly deterministic (exploit the best action).

Substituting back: V*(s) = α log ∑a exp(Q*(s,a)/α) — the soft-max (log-sum-exp) of Q-values, which is a smooth approximation to the hard max.

SAC Components

SAC maintains three networks:

NetworkParametersWhat It Does
Actor πθθOutputs action distribution (Gaussian mean + std)
Critic Qφφ1, φ2Two Q-networks (clipped double-Q trick)
Target Q̄φ̄Exponential moving average of φ for stability

Temperature Auto-Tuning

Instead of manually setting α, SAC learns it by solving:

Temperature Objective minα 𝔼a~π[ −α log π(a|s) − α H̄ ]
H̄ is a target entropy (typically −dim(A) for continuous actions). If actual entropy > H̄, decrease α; if < H̄, increase α.

Connection to Deterministic Policy Gradient

When α → 0, SAC reduces to the Deterministic Policy Gradient (DPG). In DPG, the actor outputs a single action μθ(s) (no stochasticity) and the gradient is:

Deterministic Policy GradientθV = 𝔼s[ ∇aQ(s, a)|a=μθ(s) · ∇θμθ(s) ]
Backpropagate through the Q-network to get the policy gradient. No log-probabilities needed.
The Clipped Double-Q Trick

SAC uses two Q-networks Qφ1 and Qφ2 and takes the minimum of their predictions for the target. Why? To combat overestimation (the same problem as in DQN). If one Q-network overestimates, the other likely doesn't, so the min provides a more conservative estimate. This trick was introduced in TD3 (Twin Delayed DDPG) and is now standard in off-policy methods.

Worked Example — Temperature Auto-Tuning

Target entropy: H̄ = −dim(A) = −7 (for a 7-DOF robot arm). Current policy entropy: H(π) = −5. Since −5 > −7 (more entropy than target), the loss α(−(−5) − (−7)) = α(5 − 7) = −2α is negative. The gradient decreases α, reducing the entropy bonus. The policy becomes more deterministic.

If H(π) = −10 (less entropy than target): loss = α(10 − 7) = 3α > 0. The gradient increases α, adding more entropy bonus to encourage exploration.

Worked Example — SAC vs PPO

Robot arm reaching task. 7 continuous action dimensions (joint torques). PPO: collect 2048 steps, update policy, discard data. After 1M steps, it has done ~500 policy updates using 1M transitions total. SAC: collect 1 step, store in buffer (1M capacity), sample batch of 256, update all networks. After 1M steps, it has done 1M gradient updates, each using data from the full buffer. SAC uses data ~1000x more efficiently.

The trade-off: PPO is more stable (on-policy data is always "fresh"), SAC is more sample-efficient (reuses everything).

SAC Algorithm
  1. Initialize actor πθ, two critics Qφ1, Qφ2, target critics Q̄φ̄1, Q̄φ̄2, replay buffer D, temperature α.
  2. For each step: sample a ~ πθ(s), execute, observe (s, a, r, s'). Store in D.
  3. Sample minibatch from D. Compute target: y = r + γ (mini=1,2φ̄i(s', ā) − α log πθ(ā|s')) where ā ~ πθ(s').
  4. Update critics: minimize (Qφi(s,a) − y)2 for i = 1, 2.
  5. Update actor: minimize 𝔼a~π[α log πθ(a|s) − mini Qφi(s, a)].
  6. Update temperature: minimize α(−log πθ(a|s) − H̄).
  7. Soft update targets: φ̄ ← τφ + (1−τ)φ̄ with τ = 0.005.
Chapter 08

Tabular MDP Methods

Before neural networks entered the picture, RL was solved with tables. These methods have convergence guarantees that deep RL can only dream of. Understanding them reveals why algorithms work (or fail) when we scale up.

Definition
Tabular Representation

In tabular RL, V(s) and Q(s, a) are stored as literal tables (dictionaries/arrays) with one entry per state or state-action pair. For a grid world with 100 states and 4 actions, Q is a 100 × 4 table = 400 numbers. This is feasible when |S| and |A| are small (thousands), but impossible for images (|S| ~ 1030) or continuous states (|S| = ∞). The transition to function approximation (Chapter 9) is where tabular guarantees break down.

Policy Evaluation

Given a fixed policy π, compute Vπ(s) for all states by iteratively applying the Bellman equation:

Iterative Policy Evaluation
  1. Initialize V(s) = 0 for all s.
  2. For each iteration: for every state s:
    Vnew(s) = ∑a π(a|s) [ r(s,a) + γ ∑s' p(s'|s,a) Vold(s') ]
  3. Repeat until maxs |Vnew(s) − Vold(s)| < ε.

Policy Iteration

Policy Iteration
  1. Start with arbitrary policy π0.
  2. Evaluate: Compute Vπk by iterating the Bellman equation until convergence.
  3. Improve: πk+1(s) = argmaxa [ r(s,a) + γ ∑s' p(s'|s,a) Vπk(s') ] for all s.
  4. If πk+1 = πk, stop. Otherwise, go to step 2.

Value Iteration

Policy iteration has an expensive inner loop (full policy evaluation). Value iteration combines evaluation and improvement into a single step:

Value Iteration
  1. Initialize V(s) = 0 for all s.
  2. For each iteration: for every state s:
    Vnew(s) = maxa [ r(s,a) + γ ∑s' p(s'|s,a) Vold(s') ]
  3. Repeat until convergence.
  4. Extract policy: π*(s) = argmaxa [ r(s,a) + γ ∑s' p(s'|s,a) V*(s') ].

Bellman Operators as Contractions

Proof — Contraction Property

Claim: ||TπV − TπV'|| ≤ γ ||V − V'||

where (TπV)(s) = ∑a π(a|s) [r(s,a) + γ ∑s' p(s'|s,a) V(s')].

Proof:

Step 1: |(TπV)(s) − (TπV')(s)| = |∑a π(a|s) γ ∑s' p(s'|s,a) [V(s') − V'(s')]|

Why: the r(s,a) terms cancel.

Step 2: ≤ ∑a π(a|s) γ ∑s' p(s'|s,a) |V(s') − V'(s')|   (triangle inequality)

Step 3: ≤ ∑a π(a|s) γ ∑s' p(s'|s,a) ||V − V'||

Why: |V(s') − V'(s')| ≤ maxs |V(s) − V'(s)| = ||V − V'|| for any s'.

Step 4: = γ ||V − V'|| · ∑a π(a|s) ∑s' p(s'|s,a) = γ ||V − V'|| · 1

Why: probabilities sum to 1.

Since this holds for every s, it holds for the max over s. ■

Why Contraction Guarantees Convergence

A contraction mapping has a unique fixed point, and repeated application converges to it (Banach fixed-point theorem). Since γ < 1, the Bellman operator Tπ is a contraction with factor γ. After k iterations, the error is at most γk ||V0 − V*||, which → 0 exponentially fast. This is why value iteration is guaranteed to converge — and why γ < 1 is essential.

Worked Example — Contraction in Action

True V* = [10, 8]. Two initial guesses: Va = [0, 0] (pessimistic) and Vb = [20, 20] (optimistic). γ = 0.5.

Distance: ||Va − Vb|| = max(|0−20|, |0−20|) = 20.

After 1 Bellman backup: ||TVa − TVb|| ≤ 0.5 × 20 = 10. The two estimates are at most 10 apart.

After 5 backups: ≤ 0.55 × 20 = 0.625. Nearly identical!

After 10 backups: ≤ 0.510 × 20 = 0.0195. Converged for all practical purposes.

No matter how wrong your initialization is, the contraction property forces convergence. With γ = 0.99, convergence is much slower (0.99100 = 0.366, need ~460 iterations to reduce error by 100x).

Policy Iteration vs Value Iteration

PropertyPolicy IterationValue Iteration
Inner loopFull policy evaluation (solve linear system or iterate to convergence)None (single max operation per state)
Outer iterationsFew (often 3-10 for small MDPs)Many (proportional to 1/(1-γ))
Total computationOften faster for small MDPsOften faster for large MDPs
MemoryStores both V and πStores only V
ConvergenceFinite steps to exact π*Converges to V* in limit
Worked Example — Value Iteration

Two states: A and B. From A: action "go" gives r = 2, transitions to B. From B: action "stay" gives r = 1, stays in B. γ = 0.9.

Iteration 0: V(A) = 0, V(B) = 0.

Iteration 1: V(A) = 2 + 0.9 × 0 = 2. V(B) = 1 + 0.9 × 0 = 1.

Iteration 2: V(A) = 2 + 0.9 × 1 = 2.9. V(B) = 1 + 0.9 × 1 = 1.9.

Iteration 3: V(A) = 2 + 0.9 × 1.9 = 3.71. V(B) = 1 + 0.9 × 1.9 = 2.71.

Converges to: V*(B) = 1/(1−0.9) = 10, V*(A) = 2 + 0.9 × 10 = 11.

Chapter 09

DQN & Value Function Approximation

Tabular methods work when you can enumerate every state. But Atari has ~1030 possible screen images. You can't store a table that large. Value function approximation (VFA) replaces the table with a neural network V̂φ(s) or Q̂φ(s, a) that generalizes across similar states.

MC vs TD for VFA

Monte Carlo VFA: collect a trajectory, compute the return Gt, and regress V̂φ(st) toward Gt. Unbiased but high variance — one lucky trajectory can send the network in the wrong direction.

TD learning for VFA: use the bootstrapped target r + γV̂φ(s') instead of Gt. Lower variance (single-step estimate vs. entire trajectory) but biased (the target depends on our own potentially wrong estimates).

Worked Example — MC vs TD for Value Approximation

Mountain car environment. State = (position, velocity). We want V̂φ(s). Episode takes 150 steps with reward −1 per step (total = −150).

Monte Carlo: For every state visited in the episode, the target is the total return from that point. State at t = 50: target = −100 (100 remaining steps). State at t = 149: target = −1. All targets are noisy because different episodes take different numbers of steps (120 or 200 depending on initial conditions).

TD: For state at t = 50: target = −1 + 0.99 × V̂(s51). Even if V̂ is slightly wrong, the error from one step is small. But if V̂ is completely wrong early in training, TD propagates garbage: "my estimate of s51 is wrong, so my estimate of s50 is wrong, so my estimate of s49 is wrong..."

In practice, TD wins because the bias decreases as V̂ improves, while MC variance stays high forever.

Why Generalization Matters

The whole point of using a neural network instead of a table is generalization: if the agent learns that being near the goal is good, it should know this for all nearby states, even ones it hasn't visited. In a table, each entry is independent — learning Q(s7, up) = 10 tells you nothing about Q(s8, up). With a neural network, similar states get similar Q-values automatically.

The Double-Edged Sword of Generalization

Generalization is both the strength and weakness of deep RL. Strength: you can handle continuous state spaces with billions of possible states (like pixel observations). Weakness: an update to Q(s7) also changes Q(s8), Q(s9), etc. These unintended side effects can destabilize learning. This is why the deadly triad exists — it's specifically about how errors propagate through shared representations.

The Deadly Triad

The Deadly Triad

Combining three ingredients can cause divergence:

(1) Function approximation (neural nets instead of tables) — introduces generalization errors that propagate.

(2) Bootstrapping (using V̂(s') in the target) — creates a moving target that depends on the parameters we're updating.

(3) Off-policy learning (data from a different policy) — the data distribution doesn't match the policy we're evaluating.

Any two are fine. All three together can spiral out of control. DQN tames this with two tricks: replay buffers (partially fixes off-policy) and target networks (partially fixes bootstrapping).

Replay Buffer

Without replay, consecutive training samples are highly correlated (frames 1001, 1002, 1003 look almost identical). The network overfits to the current "neighborhood" of experience and forgets the rest. A replay buffer stores the last N transitions and samples uniformly at random, breaking temporal correlations and providing i.i.d.-like training data.

Worked Example — Why Correlation Is Deadly

Atari Pong. The agent is in the middle of a rally for frames 5000-5100. Without replay, the minibatch is [frame 5000, 5001, 5002, ..., 5031] — nearly identical images of the same rally. The network learns "this specific paddle position is good" and overwrites everything it knew about serving, recovering from corners, or other game situations. This is catastrophic forgetting.

With replay (buffer size 1M), the minibatch might contain: frame 5000 (mid-rally), frame 2300 (serving), frame 87000 (scoring), frame 45000 (edge recovery). Diverse experiences prevent forgetting and ensure the network learns a globally useful Q-function.

Definition
Prioritized Experience Replay

Instead of sampling uniformly from the buffer, prioritize transitions with high TD error |Q(s,a) − y|. These are the transitions where the Q-network is most wrong — and thus where it has the most to learn. This was introduced by Schaul et al. (2016) and is used in Rainbow DQN.

Fixed Q-Targets

The target y = r + γ maxa'θ(s', a') depends on θ. Every gradient step changes both the prediction and the target — like a dog chasing its tail. The fix: maintain a target networkθ' with frozen parameters, updated periodically by copying θ → θ' every N steps.

DQN Algorithm (Mnih et al. 2015)
  1. Initialize Q-network θ, target network θ' ← θ, replay buffer D.
  2. For each step: observe s, pick a via ε-greedy on Qθ.
  3. Execute a, observe r, s'. Store (s, a, r, s') in D.
  4. Sample minibatch of transitions from D.
  5. Compute targets: yi = ri + γ maxa'θ'(s'i, a').
  6. Gradient step: θ ← θ − α ∇θi (Q̂θ(si, ai) − yi)2.
  7. Every N steps: θ' ← θ.

The Deadly Triad in Detail

Worked Example — How Divergence Happens

Consider a simple MDP with 3 states. A neural net approximates Q with shared features. True Q*(A) = 5, Q*(B) = 5, Q*(C) = 5.

Step 1: From data, we update Q(A) toward target 5.2 (slightly noisy). Due to weight sharing, Q(B) moves to 5.1 and Q(C) moves to 5.15.

Step 2: Now max Q is used as target for another state. Target = r + γ × 5.2 (using inflated max). We train toward this inflated target.

Step 3: The update pushes Q even higher. Next iteration, max Q is 5.4. Then 5.8. Then 6.5. Then 8. Then 15. Then 100.

Without the table to anchor each entry independently, errors propagate through shared weights. Target networks and replay buffers slow this spiral but don't eliminate it completely.

Double DQN

Standard DQN overestimates Q-values because maxa' Q̂(s', a') preferentially selects actions with positive noise. Double DQN decouples action selection from evaluation:

Double DQN Target y = r + γ Q̂θ'(s', argmaxa'θ(s', a'))
Online network θ selects the best action. Target network θ' evaluates it. Since they have different noise, the overestimation bias is broken.
Interactive: Q-Value Overestimation
Watch how Q-values evolve during training. Standard DQN (red) overestimates because max picks the noisiest action. Double DQN (green) stays closer to the true values (white dashed). Click "Run" to animate the learning process.
Worked Example — Double DQN Fix

State s' has 3 actions. True Q*: [5, 5, 5]. Noisy estimates from θ: [7, 4, 6]. Noisy estimates from θ': [4, 7, 5].

Standard DQN: max from θ': max(4, 7, 5) = 7. Overestimate: 7 vs true 5.

Double DQN: θ selects action 0 (highest: 7). θ' evaluates action 0: Qθ'(s', 0) = 4. Estimate: 4. Underestimate, but no systematic upward bias.

Over many samples, Double DQN's estimates converge to the true values rather than drifting upward.

Checkpoint — Before you move on
DQN, PPO, and SAC all assume the agent can interact with the environment. Explain: why is offline RL (learning from a fixed dataset) fundamentally harder than online RL? What specific failure mode arises when you naively apply Q-learning to a fixed dataset?
✓ Gate cleared
Model Answer

The fundamental problem: Q-learning uses maxa' Q(s',a') in the target. In online RL, if Q overestimates some action, the agent eventually tries it and gets corrected by the real reward. Offline, the agent never acts — so overestimated actions are never corrected. Worse, the max operator preferentially selects the most overestimated action, and bootstrapping propagates this overestimate backward through the Bellman equation. The result: Q-values for unobserved actions hallucinate to arbitrarily high values, and the extracted policy confidently selects actions that were never in the data (and are likely catastrophic). This is why IQL avoids querying OOD actions entirely, and CQL explicitly pushes down Q-values for unseen actions.

💥 Break-It Lab What Dies When You Break RL's Stability Mechanisms? ✓ ATTEMPTED
A working DQN agent learns Q-values that converge to the true values (white dashed line). Toggle off each stability mechanism to see the specific failure mode it prevents.
Set γ = 1 (no discounting) ACTIVE
Failure mode: Without discounting, V(s) = ∑t=0 rt can diverge to ±∞. The Bellman operator is no longer a contraction (γ = 1 means the contraction factor is 1). Value iteration has no guarantee of convergence — estimates oscillate or explode. The agent cannot distinguish between strategies that defer reward indefinitely.
Remove target network (bootstrap from live Q) ACTIVE
Failure mode: Without a frozen target, the TD target y = r + γ max Qθ(s',a') changes with every gradient step. The optimization chases a moving target: overestimation in one state propagates instantly to all states that bootstrap from it, creating a positive feedback loop. Q-values spiral upward exponentially.
Set ε = 0 (no exploration) ACTIVE
Failure mode: Without exploration, the agent greedily exploits its current (imperfect) Q-function. It visits a narrow corridor of states, overfitting Q to that corridor while Q-values for unvisited states remain random. The agent gets stuck in a local optimum, never discovering that other actions lead to higher reward. The replay buffer fills with homogeneous data, causing catastrophic forgetting of early experience.
⚔ Adversarial: The Deadly Triad in Practice
You're training DQN on a continuous-state task (robot arm, 7D state). After 100K steps, the Q-values are stable around [3, 5, 2, 4] for the 4 actions. You then increase the replay buffer's priority exponent from 0.6 to 1.0 (more aggressive prioritization). Within 10K steps, Q-values explode to [150, 200, 180, 190]. What happened?
Chapter 10

Offline RL: IQL & CQL

In offline RL, you have a fixed dataset of transitions collected by some behavior policy πβ — and you can't collect any more data. Think: learning to drive from dashcam footage without ever touching a steering wheel. The challenge is terrifying: the Q-function can hallucinate high values for actions the dataset never saw.

The Overestimation Problem

Standard Q-learning updates Q(s, a) toward r + γ maxa' Q(s', a'). The max might select an action a' that never appears in the dataset. Since Q(s', a') was never corrected by real data, it could be arbitrarily wrong — and the max ensures it's wrong in the upward direction. These phantom high Q-values propagate backward through the Bellman equation, corrupting the entire Q-function.

Why Online Q-Learning Fails Offline

With online learning, the agent eventually tries the overestimated action and gets a correcting signal. Offline, it never tries anything — so the overestimation is never corrected. It's like a student who answers untested material with wild guesses that are always optimistic. Without exams (interaction), the guesses never get corrected.

Worked Example — Offline Overestimation

Dataset: 1000 transitions from a robot arm reaching targets on the left half of a table. The arm never reached right.

Q-learning iteration 1: Q(s, reach_left) trained on real data → Q = 8 (good). Q(s, reach_right) randomly initialized → Q = 0.

Iteration 100: Q(s, reach_left) = 8 (stable, supported by data). Q(s, reach_right) = 12 (inflated! No correcting data). Why? Some state s' in the buffer has maxa' Q(s', a') = 12 via an unseen action. This inflated target propagated to Q(s, reach_right) through bootstrapping.

Iteration 500: Q(s, reach_right) = 47. Completely fictional. The agent would confidently reach right and break.

Implicit Q-Learning (IQL)

IQL's insight: never query Q for actions outside the dataset. Instead of maxa' Q(s', a'), approximate the maximum using expectile regression — a statistical technique that estimates quantiles without explicit maximization.

Definition
Expectile Regression

The τ-expectile loss is Lτ2(u) = |τ − 1{u < 0}| · u2. When τ = 0.5, this is ordinary least squares. When τ → 1, the loss penalizes under-predictions much more than over-predictions, so the minimizer approaches the maximum of the distribution. Think of it as a smooth, differentiable approximation to max.

IQL uses three networks:

IQL Algorithm
  1. Learn Vψ(s) via expectile regression on Q values:
    Minimize 𝔼(s,a)~D[ Lτ2(Qθ(s, a) − Vψ(s)) ]
    For τ near 1, Vψ(s) ≈ maxa Q(s, a) without ever evaluating Q on unseen actions.
  2. Learn Qθ(s, a) via standard MSE with V-targets:
    Minimize 𝔼(s,a,r,s')~D[ (Qθ(s, a) − [r + γ Vψ(s')])2 ]
    Target uses V, not max Q — so out-of-distribution actions are never queried.
  3. Extract policy via Advantage Weighted Regression (AWR):
    π(a|s) ∝ πβ(a|s) · exp((Qθ(s,a) − Vψ(s)) / β)
    Reweight the behavior policy by exponentiated advantages. High-advantage actions get more probability.

Conservative Q-Learning (CQL)

CQL takes a different approach: explicitly push down Q-values for actions not in the dataset, making the Q-function a lower bound on the true values. The key idea: if you can't tell whether an action is good or bad (because you've never tried it), assume it's bad. This is conservative but safe.

Conservative = Safe

In safety-critical applications (healthcare, autonomous driving, robotics), overestimating an untested action's value can be catastrophic. CQL's conservatism is a feature, not a bug: the extracted policy will never choose an action whose value is only high because of hallucination. The policy might be suboptimal (it avoids potentially good actions it hasn't seen data for), but it won't be catastrophically bad.

CQL Regularizer minθ   α ( 𝔼s~D, a~μ[ Qθ(s, a) ] − 𝔼s~D, a~πβ[ Qθ(s, a) ] ) + standard TD loss
Minimize Q under some distribution μ (covering unseen actions), maximize Q under the behavior policy πβ (seen actions). Net effect: push down unseen actions, push up seen actions.

The optimal choice of μ that maximizes the regularizer is μ(a|s) ∝ exp(Qθ(s, a)) — the softmax of Q-values. This creates a minimax game: the regularizer tries to find the most overestimated actions and push them down.

Worked Example — CQL vs Standard Q-Learning

Dataset: transitions from a robot arm that only reaches the left side of a table. State: s = "object on the right." Standard Q-learning: Q(s, reach_right) might be very high because it was never corrected by failure data. The extracted policy reaches right and crashes.

CQL: The regularizer pushes down Q(s, reach_right) because reach_right never appears in the data for this state. Q(s, reach_left) stays high (it's in the data). The extracted policy safely reaches left — suboptimal but doesn't crash.

IQL vs CQL

IQL avoids querying out-of-distribution actions entirely (implicit approach). CQL explicitly penalizes them (conservative approach). IQL is simpler and often faster to train. CQL provides stronger theoretical guarantees (provable lower bound on Q). Both outperform naive offline Q-learning by a large margin.

Worked Example — Expectile Regression

Suppose Q(s, a) for the 4 actions in our dataset are: [2, 5, 3, 8]. With τ = 0.5 (standard mean), V(s) = (2+5+3+8)/4 = 4.5. With τ = 0.9 (high expectile), V(s) ≈ 7.2 — close to the maximum of 8, but not exactly 8. With τ = 0.99, V(s) ≈ 7.9 — even closer to max.

How the loss works: For τ = 0.9, errors where Q > V get weight τ = 0.9, while errors where Q < V get weight (1−τ) = 0.1. So the loss strongly penalizes underestimation of high Q-values, pushing V upward toward the max. This is how IQL approximates maxa Q(s,a) without ever evaluating Q on unseen actions.

Practical Considerations for Offline RL

FactorIQLCQL
Requires OOD query?No (expectile-based)Yes (softmax sampling)
Hyperparametersτ (expectile), β (AWR temp)α (CQL weight)
Theoretical guaranteeApproximation to optimalProvable lower bound on Q
Typical useRobot manipulationGame playing, navigation
ComputationFast (no sampling loop)Slower (logsumexp over actions)
Chapter 11

RLHF & DPO

How do you align a language model to human preferences when you can't write down a reward function? Reinforcement Learning from Human Feedback (RLHF) trains a reward model from human comparisons, then optimizes the LM against it. Direct Preference Optimization (DPO) skips the reward model entirely.

The RLHF Pipeline

Three Stages
RLHF Pipeline

Stage 1 — Supervised Fine-Tuning (SFT): Fine-tune the base LM on high-quality demonstrations. This gives you πref — a model that can write decent responses.

Stage 2 — Reward Modeling: Show humans pairs of responses (y1, y2) to the same prompt x. They choose which is better. Train a reward model rφ(x, y) to predict these preferences.

Stage 3 — RL Fine-Tuning: Optimize the LM policy πθ to maximize the learned reward, with a KL penalty to stay close to πref.

The Bradley-Terry Model

Given a prompt x and two responses y1, y2, the probability that a human prefers y1 over y2 is modeled as:

Bradley-Terry Preference Model p*(y1 ≻ y2 | x) = exp(r*(x, y1)) / (exp(r*(x, y1)) + exp(r*(x, y2)))
= σ(r*(x, y1) − r*(x, y2))
where σ is the sigmoid function. The response with higher reward is more likely to be preferred.

Reward Model Training

Reward Model Loss LRM(φ) = −𝔼(x, yw, yl)[ log σ(rφ(x, yw) − rφ(x, yl)) ]
yw = winner (preferred), yl = loser. Maximize the log-likelihood that the reward model assigns higher reward to the winner.

The RL Objective

RLHF Objective maxθ 𝔼x~D, y~πθ(y|x)[ rφ(x, y) ] − β DKLθ || πref)
Maximize reward while staying close to the reference policy. β controls the trade-off: too small and the model "reward hacks," too large and it ignores the reward signal.
Why This Is Hard

Language generation is non-differentiable — you can't backpropagate through token sampling (argmax or categorical sampling have zero or undefined gradients). So you can't just do gradient descent on the reward. You need RL (specifically PPO) to optimize through the non-differentiable sampling step. This is computationally expensive and unstable.

Worked Example — RLHF Reward Modeling

Prompt: "Write a haiku about rain." Two responses:

y1 (winner): "Soft drops on the leaves / Whispering to the still pond / Nature breathes again"

y2 (loser): "Rain is water that falls from clouds in the sky and makes things wet"

Current reward model: rφ(x, y1) = 2.3, rφ(x, y2) = 1.8. Difference: 0.5.

Loss: −log σ(2.3 − 1.8) = −log σ(0.5) = −log(0.622) = 0.475.

Gradient: Pushes rφ(x, y1) up and rφ(x, y2) down, increasing the gap. Over thousands of preference pairs, the reward model learns that well-structured creative responses are preferred over flat factual statements.

The KL Penalty in Practice

Without the KL term, the LM quickly discovers reward hacking — pathological outputs that score high on the reward model but are gibberish to humans. For example, a model trained to maximize a helpfulness reward might repeat "I'm so happy to help you! That's a great question!" endlessly. The KL penalty DKLθ || πref) prevents this by penalizing the model for deviating too far from the SFT baseline.

KL Penalty Per Token DKLθ || πref) = 𝔼y~πθ[ ∑t=1T log(πθ(yt|x, y<t) / πref(yt|x, y<t)) ]
Sum of per-token log-ratio. If πθ assigns much more probability to some token than πref, that token gets a large KL penalty.

DPO: Bypassing the Reward Model

DPO's key insight: the optimal policy under the KL-constrained objective has a closed-form solution, and we can reparameterize the reward in terms of the policy itself.

Full Derivation — DPO

Step 1 — Solve for the optimal policy. The RLHF objective with KL constraint is:

maxπ 𝔼y~π[r(x,y)] − β DKL(π || πref)

= maxπ 𝔼y~π[r(x,y) − β log(π(y|x)/πref(y|x))]

= maxπ 𝔼y~π[r(x,y) − β log π(y|x) + β log πref(y|x)]

Taking the functional derivative with respect to π(y|x) and setting to zero:

r(x,y) − β log π(y|x) − β + β log πref(y|x) = 0

Solving for π: π*(y|x) = πref(y|x) · exp(r(x,y)/β) / Z(x)

where Z(x) = ∑y πref(y|x) exp(r(x,y)/β) is the partition function. This is the optimal policy given reward r.

Step 2 — Reparameterize the reward. From the optimal policy expression, solve for r:

r(x, y) = β log(π*(y|x) / πref(y|x)) + β log Z(x)

Step 3 — Substitute into Bradley-Terry. The preference probability becomes:

p*(yw ≻ yl) = σ(r(x,yw) − r(x,yl))

= σ(β log(π*(yw|x)/πref(yw|x)) − β log(π*(yl|x)/πref(yl|x)))

The Z(x) terms cancel! They appear in both r(x,yw) and r(x,yl) and subtract out.

Step 4 — Write the DPO loss. Replace π* with πθ (the policy we're training):

DPO Loss LDPO(θ) = −𝔼(x, yw, yl)[ log σ( β log(πθ(yw|x) / πref(yw|x)) − β log(πθ(yl|x) / πref(yl|x)) ) ]
No reward model, no RL, no sampling during training. Just a classification loss on preference pairs.
Why DPO Is Revolutionary

DPO reduces RLHF to supervised learning. No reward model to train. No PPO loop to run. No generation during training. Just: compute log-probabilities of the winning and losing responses under πθ and πref, plug into the loss, and do gradient descent. It's 10x simpler to implement and much more stable than PPO-based RLHF.

DPO Gradient Analysis

What does the DPO gradient actually do? Taking the gradient of LDPO with respect to θ:

DPO Gradient Intuition

θLDPO = −β 𝔼[ σ(−u) (∇ log πθ(yw|x) − ∇ log πθ(yl|x)) ]

where u = β(log πθ(yw)/πref(yw) − log πθ(yl)/πref(yl)).

The weighting σ(−u): When u is large and positive (the model already strongly prefers yw), σ(−u) is near 0 — almost no gradient. When u is negative (the model incorrectly prefers yl), σ(−u) is near 1 — strong gradient signal. The gradient focuses on examples the model gets wrong, ignoring ones it already handles correctly.

The direction: The gradient increases log π(yw) and decreases log π(yl). It simultaneously makes the winner more likely and the loser less likely.

RLHF vs DPO: Practical Comparison

PropertyRLHF (PPO)DPO
Training stages3 (SFT → RM → RL)2 (SFT → DPO)
Models needed4 (SFT, RM, policy, ref)2 (policy, ref)
Generates during trainingYes (PPO rollouts)No
GPU memoryVery high (4 models)Moderate (2 models)
StabilitySensitive to hyperparametersRobust
Reward hacking riskHigher (imperfect RM)Lower (no RM to exploit)
Iterative improvementNatural (generate → relabel)Requires new preference data
State of the artLlama 2, InstructGPTZephyr, Llama 3 (partially)
DPO Failure Modes

Distribution shift: DPO assumes the preference data was generated by the optimal policy under πref. If the preference pairs are from a very different policy, the DPO objective may not converge to the right solution.

β too small: The policy can diverge far from πref, potentially generating incoherent text. The log-ratios blow up.

β too large: The policy barely moves from πref. The preference signal is too weak to have an effect. The model doesn't improve.

Preference noise: If annotators frequently disagree (noisy preferences), DPO's loss becomes noisy too. RLHF's reward model can average over multiple annotations per pair; DPO uses each pair directly.

Worked Example — DPO Update

Prompt x: "Explain gravity." Winner yw: clear explanation. Loser yl: vague rambling. β = 0.1.

Current log-ratios: log(πθ(yw|x)/πref(yw|x)) = 0.3. log(πθ(yl|x)/πref(yl|x)) = 0.5.

DPO argument: β(0.3 − 0.5) = 0.1 × (−0.2) = −0.02.

σ(−0.02) ≈ 0.495. Loss: −log(0.495) ≈ 0.703.

The gradient will increase πθ(yw|x) relative to πref and decrease πθ(yl|x) relative to πref. Intuitively: make the winner more likely and the loser less likely, compared to the reference model.

Chapter 12

The Complete Cheat Sheet

Algorithm Taxonomy

AlgorithmTypeOn/Off-PolicyActionsKey Idea
REINFORCEPolicy gradientOn-policyAnyMonte Carlo policy gradient
A2CActor-criticOn-policyAnyLearned baseline + bootstrapping
PPOActor-criticOn-policyAnyClipped surrogate objective
SACActor-criticOff-policyContinuousMax entropy + replay buffer
DQNValue-basedOff-policyDiscreteTarget network + replay
IQLValue-basedOfflineAnyExpectile regression avoids OOD
CQLValue-basedOfflineAnyConservative Q lower bound
DPOPreferenceOfflineLM tokensReparameterize reward into policy

Key Equations

EquationOne-Line Explanation
Vπ(s) = 𝔼a~π[r + γVπ(s')]Bellman equation: value = reward + discounted future value
θV = 𝔼[∇ log π · Qπ]Policy gradient theorem: weight each action by its Q-value
Aπ(s,a) = Qπ(s,a) − Vπ(s)Advantage: how much better is action a than average?
δt = r + γV(s') − V(s)TD error: one-step advantage estimate
GAE = ∑ (γλ)l δt+lGAE: exponentially-weighted sum of TD errors
LCLIP = 𝔼[min(rÂ, clip(r)Â)]PPO: clip the ratio to prevent catastrophic updates
Q* = r + γ maxa' Q*Bellman optimality: optimal Q satisfies this fixed point
V(π)−V(π') = 1/(1−γ) 𝔼[Aπ']PDL: performance gap = expected advantage
LDPO = −log σ(β(log rw − log rl))DPO: preference learning as classification

When to Use What

Interactive: Algorithm Decision Tree
Answer questions about your problem to get an algorithm recommendation. Click the options at each decision point.

On-Policy vs Off-Policy: The Fundamental Trade-Off

Definition
On-Policy vs Off-Policy

On-policy (REINFORCE, PPO): The data used for each update was collected by the current policy. After each update, old data is discarded and new data is collected. Pro: the gradient is always accurate. Con: data is used once and thrown away — expensive.

Off-policy (DQN, SAC): Data from any policy can be reused. A replay buffer stores millions of transitions from past policies. Pro: much more sample-efficient (each transition is used dozens of times). Con: the data distribution doesn't match the current policy, introducing bias that must be corrected (importance sampling, target networks, etc.).

Worked Example — Data Efficiency Comparison

Training a robot arm. Budget: 1 million environment steps.

PPO (on-policy): Collects batches of 2048 steps, updates, discards. 1M / 2048 = ~488 policy updates. Each transition used 4 times (K=4 epochs). Total gradient steps: ~2000.

SAC (off-policy): After initial exploration, every environment step triggers one gradient update using a batch of 256 from the replay buffer. Total gradient steps: ~1M. Each transition is reused ~256 times over its lifetime in the buffer.

Result: SAC does ~500x more gradient updates with the same data. For tasks where environment interaction is expensive (real robots: $100/hour), off-policy wins decisively. For tasks where simulation is cheap (video games: 1000 fps), on-policy's simplicity often wins.

Historical Timeline

YearMilestoneKey Contribution
1989Watkins — Q-LearningOff-policy tabular value learning with convergence proof
1992Williams — REINFORCEPolicy gradient via the score function trick
1999Sutton et al. — PG TheoremFormal foundation for policy gradient methods
2000Kakade — Natural PGFisher information matrix for parameterization-invariant updates
2013Mnih et al. — DQNNeural networks + replay buffer + target networks for Atari
2015Schulman et al. — TRPOTrust region constraints for stable policy updates
2016Schulman et al. — GAEExponentially-weighted TD errors for advantage estimation
2017Schulman et al. — PPOClipped surrogate objective replaces TRPO constraints
2018Haarnoja et al. — SACMaximum entropy off-policy actor-critic for continuous control
2020Kumar et al. — CQLConservative Q-function lower bounds for offline RL
2022Kostrikov et al. — IQLExpectile regression avoids OOD action evaluation
2022Ouyang et al. — InstructGPTRLHF pipeline for language model alignment
2023Rafailov et al. — DPOReparameterize reward into policy for direct preference optimization

Connections

This lesson covered the complete landscape of deep RL theory. For implementation details, see the companion lessons:

TopicLesson
Tabular Q-learning with worked examplesQ-Learning: Value-Based RL
Imitation learning as an alternative to RLImitation Learning
Actor-critic methods in depthOff-Policy Actor-Critic
MDPs and value iteration from scratchMarkov Decision Processes

Common Hyperparameters

AlgorithmKey HyperparameterTypical RangeWhat It Controls
PPOε (clip range)0.1 – 0.3How far the policy can change per update
PPOλ (GAE)0.9 – 0.99Bias-variance trade-off in advantage estimation
PPOK (epochs)3 – 10How many passes over each batch of data
SACα (temperature)Auto-tunedExploration-exploitation balance via entropy
SACτ (target EMA)0.005How fast target networks track the online network
DQNε (exploration)1.0 → 0.01Random action probability (annealed)
DQNTarget update freq1K – 10K stepsHow often to freeze target network
IQLτ (expectile)0.7 – 0.99How close to max (higher = closer)
CQLα (CQL weight)0.5 – 10How conservative to be with OOD actions
DPOβ0.05 – 0.5How much to deviate from reference policy

The RL Algorithm Family Tree

How Everything Connects
The Genealogy of Deep RL

Root: The Bellman equation (1957) gives us the recursive structure of value.

Branch 1 — Value-based: Compute Q*, derive π* as argmax. Tabular Q-learning (1989) → DQN (2013) → Double DQN (2016) → Rainbow (2018). Limited to discrete actions.

Branch 2 — Policy gradient: Directly optimize πθ. REINFORCE (1992) → Actor-Critic (2000s) → TRPO (2015) → PPO (2017). Works with any action space.

Branch 3 — Off-policy actor-critic: Combine replay buffers with policy optimization. DDPG (2015) → TD3 (2018) → SAC (2018). Best sample efficiency for continuous control.

Branch 4 — Offline RL: Learn from fixed data without interaction. BCQ (2019) → CQL (2020) → IQL (2022). Critical for real-world deployment.

Branch 5 — Preference-based: Learn from human comparisons instead of numeric rewards. RLHF (2022) → DPO (2023). Powers modern LLM alignment.

Quick Reference: Which Paper to Read

Want To LearnRead This PaperPages
The foundations of policy gradientsSutton et al. (1999) "Policy Gradient Methods for RL with Function Approximation"8
Trust region methodsSchulman et al. (2015) "Trust Region Policy Optimization"16
PPO in practiceSchulman et al. (2017) "Proximal Policy Optimization Algorithms"12
Off-policy continuous controlHaarnoja et al. (2018) "Soft Actor-Critic"13
Deep Q-learning from pixelsMnih et al. (2015) "Human-level control through deep RL" (Nature)10
Offline RL fundamentalsLevine et al. (2020) "Offline RL: Tutorial, Review, and Perspectives"55
Aligning LMs to preferencesRafailov et al. (2023) "Direct Preference Optimization"29
The One Sentence

Deep RL is the art of converting the Bellman equation — value = reward + discounted future value — into practical algorithms that scale from gridworlds to language models. Everything is a variation on that one recursive idea.

"What I cannot create, I do not understand."

— Richard Feynman

🏗 Design Challenge You're the Architect: RL Agent for Continuous Control ✓ ATTEMPTED
You're deploying an RL agent to control a 12-DOF quadruped robot in the real world. The robot has joint position/velocity sensors (24-dim state) and outputs torques (12-dim continuous action). You have a simulator that's ~80% accurate and access to the real robot for 2 hours/day. The task: walk forward at 1 m/s over varied terrain.
State Space
24-dim continuous
Action Space
12-dim continuous
Sim Budget
Unlimited
Real Robot
2 hrs/day
Sim-to-Real Gap
~20% dynamics error
Safety
Must not fall (hardware damage)
1. Which algorithm class: on-policy (PPO), off-policy (SAC), or offline (IQL)? Justify considering data efficiency vs. stability vs. sim-to-real.
2. Function class for πθ: MLP, RNN, or diffusion policy? What hidden sizes, and fixed or learned σ?
3. How do you bridge the sim-to-real gap? Domain randomization? System identification? Sim-to-real fine-tuning?
4. Safety: how do you prevent the robot from executing actions that cause falling during real-world training?
5. Reward design: dense (tracking velocity) vs. sparse (reach waypoint)? Include stability terms?

Industry standard (2024): PPO in simulation with massive domain randomization, then zero-shot or few-shot transfer to real hardware.

Algorithm: PPO (not SAC) because on-policy is more stable under domain randomization. The sim-to-real gap means off-policy data from a slightly-wrong simulator can mislead SAC's Q-function. PPO's on-policy nature means it only trusts fresh rollouts.

Architecture: MLP (2-3 layers, 256-512 units). Fixed σ (log-std as learnable parameter, not state-dependent) for locomotion — state-dependent σ can collapse to zero prematurely. Unitree, Boston Dynamics, and ETH Zurich all use this setup.

Sim-to-real: Domain randomization (randomize friction, mass, motor delay, ground slope, sensor noise) during sim training. The policy learns to be robust to all variations. 2 hours of real-robot fine-tuning with conservative PPO (ε = 0.05) for final adaptation.

Safety: Action clipping to torque limits + safety constraints in reward (penalty for IMU angle > 30°) + early termination in sim when falling. Real robot: conservative action bounds initially, gradually relaxed.

Reward: Dense: r = vforward − 0.1||torques||2 − 5·1[fell] − 0.5|roll| − 0.5|pitch|. Multiple terms for velocity tracking, energy efficiency, and stability. Sparse rewards fail for locomotion (too many steps before any signal).

💻 Build It Implement REINFORCE with Baseline from Scratch ✓ ATTEMPTED
You've seen REINFORCE with baseline in Chapter 3. Now implement the full gradient computation. Given a batch of trajectories, compute the policy gradient using reward-to-go and a state-dependent baseline. The baseline is V̂(s) = average return from state s.
python def reinforce_gradient(log_probs, rewards, gamma=0.99): """ Compute REINFORCE policy gradient with reward-to-go and baseline. Args: log_probs: list of N trajectories, each a list of T log pi(a|s) values rewards: list of N trajectories, each a list of T reward values gamma: discount factor Returns: loss: scalar surrogate loss (negate for gradient ascent) """
Test case
log_probs = [[-0.5, -0.3, -0.8]], rewards = [[1, 2, 10]], gamma=0.99
reward_to_go[0] = [1 + 0.99*2 + 0.99^2*10, 2 + 0.99*10, 10] = [12.79, 11.90, 10.0]
baseline = [12.79, 11.90, 10.0] (only 1 trajectory, so baseline = reward_to_go)
advantages = [0, 0, 0] → loss = 0 (single trajectory can't distinguish good from bad)
Loop backwards: G[T-1] = r[T-1], then G[t] = r[t] + gamma * G[t+1]. This is O(T) per trajectory. The baseline at time t is the mean of G[t] across all N trajectories.
import torch

def reinforce_gradient(log_probs, rewards, gamma=0.99):
    N = len(log_probs)

    # Step 1: Compute discounted reward-to-go
    all_rtg = []
    for i in range(N):
        T = len(rewards[i])
        rtg = [0.0] * T
        rtg[T-1] = rewards[i][T-1]
        for t in range(T-2, -1, -1):
            rtg[t] = rewards[i][t] + gamma * rtg[t+1]
        all_rtg.append(rtg)

    # Step 2: Compute baseline (mean reward-to-go per timestep)
    T = len(all_rtg[0])  # assume equal length
    baseline = [0.0] * T
    for t in range(T):
        baseline[t] = sum(all_rtg[i][t] for i in range(N)) / N

    # Step 3: Compute advantages
    loss = 0.0
    for i in range(N):
        for t in range(T):
            advantage = all_rtg[i][t] - baseline[t]
            loss += -log_probs[i][t] * advantage

    return loss / (N * T)
Bonus challenge: Modify this to use GAE(λ=0.95) instead of full reward-to-go. You'll need a value function V̂(s) and TD errors δt = rt + γV̂(st+1) − V̂(st).
🔗 Pattern Recognition
Iterative Refinement: Value Iteration = Diffusion Denoising = EM
This Lesson (RL)
Vk+1(s) = T* Vk(s)
Repeatedly apply contraction until fixed point V*
Diffusion Models
xt-1 = denoise(xt)
Repeatedly refine noise into clean sample → Diffusion

Both processes start from an arbitrary initialization and converge to a fixed point by repeatedly applying a contractive operator. In RL, the contraction factor is γ. In diffusion, it's the noise schedule βt. The deeper pattern: any iterative map with contraction factor < 1 converges to its unique fixed point (Banach). You see this in EM (converges to local maximum of likelihood), power iteration (converges to dominant eigenvector), and even Newton's method (quadratic contraction near roots).

Can you identify the "contraction factor" in EM? (Hint: it relates to the fraction of missing information.)