Every algorithm, every derivation, every proof — from the Bellman equation to DPO. The definitive reference for deep RL theory.
Before we can solve anything, we need a language for "an agent acting in a world." That language is the Markov Decision Process (MDP) — a mathematical framework that captures states, actions, transitions, and rewards in one clean package.
S — state space (all possible situations the agent can be in, like positions on a board).
A — action space (all moves available, like up/down/left/right).
p(s' | s, a) — transition dynamics (if you're in state s and take action a, what's the probability of landing in s'? Think: physics of the world).
r(s, a) — reward function (immediate payoff for taking action a in state s — the score you get right now).
ρ0(s) — initial state distribution (where episodes start — like the starting position in a game).
γ ∈ [0, 1) — discount factor (how much you value future rewards vs. immediate ones; γ = 0.99 means you're patient, γ = 0.1 means you're impatient).
H — horizon (how many time steps per episode).
The "Markov" in MDP means the future depends only on the current state, not on how you got there. Formally: P(st+1 | st, at, st−1, at−1, ...) = P(st+1 | st, at). The entire history is summarized in the current state. This is a modeling assumption — if you're playing chess, the board position tells you everything. If you're driving, your current speed and position matter more than where you drove an hour ago.
The discount factor γ is surprisingly subtle. It serves three purposes:
1. Time preference: γ = 0.99 means a reward one step from now is worth 0.99 of the same reward right now. A reward 100 steps from now is worth 0.99100 = 0.366 — we're patient but not infinitely so.
2. Mathematical convergence: Without discounting (γ = 1), the infinite sum ∑ γt rt might not converge. With γ < 1, it's always bounded by rmax/(1−γ).
3. Effective horizon: The "effective horizon" is approximately 1/(1−γ). With γ = 0.99, the agent effectively plans ~100 steps ahead. With γ = 0.999, ~1000 steps. This lets you control the planning horizon without explicitly setting H.
Two strategies for a robot: Strategy A gets reward +1 every step forever. Strategy B gets reward 0 for 10 steps, then +2 forever.
With γ = 0.9: V(A) = 1/(1−0.9) = 10. V(B) = 0.910 × 2/(1−0.9) = 0.349 × 20 = 6.97. A wins — the impatient agent prefers steady small rewards.
With γ = 0.99: V(A) = 1/(1−0.01) = 100. V(B) = 0.9910 × 200 = 0.904 × 200 = 180.8. B wins — the patient agent is willing to wait for double the reward.
A policy π(a | s) maps states to action probabilities. It's the agent's strategy: "when I see state s, here's how I pick an action." A deterministic policy always picks one action; a stochastic policy samples from a distribution.
An episode (or trajectory) is a sequence τ = (s0, a0, r0, s1, a1, r1, ..., sH). The agent starts at s0 ~ ρ0, picks actions via π, and the world responds via p. The whole game of RL is to find the policy that maximizes the expected total discounted reward:
Suppose you're in state s and you follow policy π forever after. How much total reward do you expect? That's the value function:
Now for the most important recursive relationship in all of RL — the Bellman equation. The value of being in state s equals the immediate reward plus the discounted value of wherever you end up next:
Start from the definition:
Vπ(s) = 𝔼a~π(s) [ r(s, a) + γ 𝔼s'~p(s'|s,a)[ Vπ(s') ] ]
Step 1: Expand the expectation over the trajectory. The first reward is r(s0, a0), and everything after s1 is another trajectory starting from s1.
Step 2: By the Markov property (the future depends only on the current state, not the past), the expected future reward from s1 onward is exactly Vπ(s1).
Step 3: Combine: Vπ(s) = 𝔼a~π[r(s,a)] + γ 𝔼a~π[𝔼s'~p[Vπ(s')]]. This is the Bellman equation — it says value = immediate reward + discounted future value.
Sometimes we want to know the value of a specific action, not just the state. The action-value function Qπ(s, a) is the expected total reward if you take action a in state s, then follow π thereafter:
The relationship between V and Q is clean: Vπ(s) = 𝔼a~π(a|s)[Qπ(s, a)]. The value of a state is just the expected Q-value over the actions your policy would take.
The optimal value function V*(s) = maxπ Vπ(s) is the best achievable value from state s under any policy. Similarly, Q*(s, a) = maxπ Qπ(s, a). The optimal policy π*(a | s) is the one that achieves these values, and it satisfies π*(s) = argmaxa Q*(s, a).
For any finite MDP, there exists a deterministic optimal policy π* that simultaneously maximizes Vπ(s) for every state s. This is remarkable: you might expect that optimizing for one starting state could hurt performance in another. But the Bellman optimality equation guarantees that the same policy is optimal everywhere. This is because the optimal action in state s depends only on Q*(s, ·), which already accounts for optimal behavior in all future states.
Start from the Bellman equation for Qπ: Qπ(s, a) = r(s, a) + γ 𝔼s'[𝔼a'~π[Qπ(s', a')]].
For the optimal policy π*, the inner expectation becomes: 𝔼a'~π*[Q*(s', a')] = maxa' Q*(s', a'). This is because π* puts all probability on the action with the highest Q-value.
Substituting: Q*(s, a) = r(s, a) + γ 𝔼s'~p[maxa' Q*(s', a')]. This is the Bellman optimality equation — it characterizes Q* as the unique fixed point of the optimality operator.
States: {A, B}. Actions: {left, right}. From A, right gives r = +3 and goes to B. From A, left gives r = +1 and stays at A. From B, right gives r = +10 (terminal). From B, left gives r = 0 and goes to A. γ = 0.9.
Q*(B, right) = 10 (terminal, no future).
Q*(B, left) = 0 + 0.9 × max(Q*(A, left), Q*(A, right)).
Q*(A, right) = 3 + 0.9 × max(Q*(B, left), Q*(B, right)) = 3 + 0.9 × 10 = 12.
Q*(A, left) = 1 + 0.9 × max(Q*(A, left), Q*(A, right)) = 1 + 0.9 × 12 = 11.8.
Q*(B, left) = 0 + 0.9 × 12 = 10.8.
Optimal policy: π*(A) = right (12 > 11.8), π*(B) = left (10.8 > 10). Surprisingly, the optimal strategy from B is to go back to A rather than immediately collecting the +10! This is because looping A→B→A generates a stream of +3 rewards that, when discounted, exceeds the one-shot +10.
When we analyze RL algorithms, we often need to talk about which states a policy visits. The discounted state distribution (also called the discounted occupancy measure) under policy π is:
The factor (1 − γ) normalizes this to a proper probability distribution. This distribution is crucial for the Performance Difference Lemma in Chapter 5.
In a Partially Observable MDP (POMDP), the agent doesn't see the full state s — it only gets an observation o drawn from some distribution p(o | s). A self-driving car can see camera images but not the intentions of other drivers. The fix: the agent must maintain a belief — a probability distribution over possible states — or use recurrent networks that implicitly track history. Most deep RL treats problems as MDPs (assuming full observability) even when they're technically POMDPs, by feeding in a window of recent observations.
3-state chain: A → B → C. Policy: always go right. Rewards: r(A, right) = −1, r(B, right) = −1, r(C) = +10 (terminal). γ = 0.9.
Vπ(B) = −1 + 0.9 × 10 = 8.0
Vπ(A) = −1 + 0.9 × Vπ(B) = −1 + 0.9 × 8.0 = 6.2
Notice how value propagates backward from the reward. This is the essence of the Bellman equation: each state's value depends recursively on the next state's value.
The Bellman policy operator Tπ is a γ-contraction (proved in Chapter 8). But the Bellman optimality operator T* uses a max instead of an expectation over π:
(T*V)(s) = maxa [ r(s,a) + γ ∑s' p(s'|s,a) V(s') ]
Your task: Prove that ||T*V − T*V'||∞ ≤ γ ||V − V'||∞. The max makes this trickier than the policy case — you can't factor it out.
Full proof:
Step 1: |(T*V)(s) − (T*V')(s)| = |maxa[r(s,a) + γ∑s'p(s'|s,a)V(s')] − maxa[r(s,a) + γ∑s'p(s'|s,a)V'(s')]|
Step 2: Apply |max f − max g| ≤ max |f − g|:
≤ maxa |γ ∑s' p(s'|s,a) [V(s') − V'(s')]| (rewards cancel)
Step 3: Triangle inequality on the weighted sum:
≤ maxa γ ∑s' p(s'|s,a) |V(s') − V'(s')| ≤ γ ||V − V'||∞
Step 4: Since this holds for all s, take the sup: ||T*V − T*V'||∞ ≤ γ ||V − V'||∞. ■
The key insight: The max doesn't break contraction because |max f − max g| ≤ max |f − g|. This is why value iteration (which applies T* repeatedly) converges to V* regardless of initialization — and why γ < 1 is essential.
In supervised learning, the data distribution is fixed: your training set doesn't change as you update weights. In RL, the data distribution is the policy — update the policy and the data changes. This non-stationarity is why RL is fundamentally harder than supervised learning: you're optimizing an objective that shifts under your feet.
Where else do you see non-stationary optimization? (Hint: think GANs, self-play, curriculum learning.)
The environment dynamics p(s'|s,a) are unknown and non-differentiable (the real world doesn't have a gradient). Value iteration requires knowing p to compute the Bellman backup. Policy gradients bypass this entirely: the log-derivative trick converts the gradient into an expectation over trajectories, and the dynamics terms cancel. We never differentiate through the environment — we only need to sample from it. This is why policy gradients are model-free while value iteration is model-based. The tradeoff: value iteration gives exact solutions (when you know p) but policy gradients work with unknown dynamics at the cost of high variance.
Value-based methods like Q-learning learn Q* and derive π* as argmax. But what if we parameterize the policy directly as πθ(a | s) and optimize θ by gradient ascent on expected reward? That's the policy gradient approach — the foundation of PPO, SAC, and essentially all modern deep RL.
If Q-learning can find the optimal policy via argmax, why bother with policy gradients at all? Three reasons:
1. Continuous actions: Q-learning requires maxa Q(s, a). With discrete actions (4 moves), you evaluate all 4. With continuous actions (a ∈ R7), there are infinitely many — you can't enumerate them. Policy gradients handle continuous actions naturally.
2. Stochastic policies: Sometimes the optimal policy is stochastic (e.g., rock-paper-scissors: play uniformly at random). Q-learning always produces deterministic policies (argmax). Policy gradients can learn arbitrary distributions.
3. Better convergence: Policy gradients optimize the objective directly, while Q-learning tries to find a fixed point of the Bellman operator. In large-scale settings (millions of parameters), the direct optimization often converges faster and more stably.
For discrete actions, πθ is typically a neural network that outputs a softmax distribution over actions. For continuous actions, πθ outputs the mean and standard deviation of a Gaussian policy:
Define the objective as the expected total reward under trajectories sampled from πθ:
We want ∇θV(θ). But the sum is over all possible trajectories — an astronomically large set. And R(τ) doesn't depend on θ directly (rewards are given by the environment). How do we differentiate through sampling?
Step 1 — Apply the definition of gradient:
∇θV(θ) = ∇θ ∑τ Pθ(τ) R(τ) = ∑τ R(τ) ∇θ Pθ(τ)
Why: R(τ) doesn't depend on θ, so it comes out of the gradient.
Step 2 — Log-derivative trick: We use the identity ∇θ Pθ(τ) = Pθ(τ) ∇θ log Pθ(τ). This is just the chain rule: d/dx log f(x) = f'(x)/f(x), rearranged.
∇θV(θ) = ∑τ R(τ) Pθ(τ) ∇θ log Pθ(τ)
= 𝔼τ~Pθ [ R(τ) ∇θ log Pθ(τ) ]
Why this matters: We turned a sum over all trajectories into an expectation we can estimate by sampling trajectories!
Step 3 — Expand log Pθ(τ):
log Pθ(τ) = log ρ0(s0) + ∑t=0T−1 [ log πθ(at|st) + log p(st+1|st,at) ]
Step 4 — Differentiate: ∇θ log Pθ(τ) = ∑t=0T−1 ∇θ log πθ(at|st)
Why: ρ0(s0) doesn't depend on θ (the starting state is given by the environment). The transition dynamics p(st+1|st,at) don't depend on θ either (they're properties of the physics). Only the policy πθ depends on θ. So the transition terms vanish!
The transition dynamics p(s'|s,a) completely disappeared from the gradient! We don't need to know the physics of the world — we only need to be able to sample trajectories and evaluate ∇θ log πθ. This is why policy gradients are model-free.
There's an alternative form that's often more useful. Instead of summing over entire trajectories, we can express the gradient in terms of the Q-function:
Step 1: Start with V(θ) = 𝔼s0~ρ0[Vπ(s0)]. Write Vπ(s) = ∑a πθ(a|s) Qπ(s, a).
Step 2: Take the gradient:
∇θ Vπ(s) = ∑a [ ∇θπθ(a|s) Qπ(s,a) + πθ(a|s) ∇θQπ(s,a) ]
Product rule: both π and Q depend on θ.
Step 3: Expand ∇θQπ(s,a) = ∇θ[r(s,a) + γ ∑s' p(s'|s,a) Vπ(s')] = γ ∑s' p(s'|s,a) ∇θVπ(s').
The reward r and transition p don't depend on θ, so only Vπ(s') contributes.
Step 4: Substitute back: ∇θVπ(s) = ∑a ∇θπ(a|s) Qπ(s,a) + ∑a π(a|s) γ ∑s' p(s'|s,a) ∇θVπ(s').
Step 5: This is recursive! ∇V(s) depends on ∇V(s'). Unroll it one more step:
∇V(s') = ∑a' ∇π(a'|s') Q(s',a') + ∑a' π(a'|s') γ ∑s'' p(s''|s',a') ∇V(s'') ...
Step 6: Unrolling T steps and collecting terms: ∇V(s0) = ∑t=0∞ ∑s P(s0→s, t, π) γt ∑a ∇π(a|s) Qπ(s,a), where P(s0→s, t, π) is the probability of being in state s at time t starting from s0 under π.
Step 7: Take expectation over s0~ρ0 and recognize dπ(s) = ∑t γt P(st=s|π)/(1-γ)-1. Use the log-derivative trick: ∇π(a|s) = π(a|s) ∇ log π(a|s). Final result: ■
Two actions: left (θ1) and right (θ2). Policy: softmax π(left) = eθ1 / (eθ1 + eθ2). Rewards: r(left) = 1, r(right) = 3. Current θ = (0, 0), so π(left) = π(right) = 0.5.
Sample trajectory 1: action = right, reward = 3. Gradient: 3 × ∇θ log π(right) = 3 × (0.5, −0.5) × (−1, 1) = correction that increases θ2.
After many samples: θ2 grows, θ1 shrinks, and π(right) → 1. The gradient pushes probability mass toward higher-reward actions.
The policy gradient formula is beautiful in theory but horrifically noisy in practice. Imagine estimating a gradient by running one episode: the agent might get lucky (high reward) or unlucky (low reward) by chance, not because of the policy. The variance of the gradient estimate is so high that it takes millions of samples to get a useful signal. This chapter is about taming that variance.
Consider a simple environment where the optimal reward is 100. With vanilla REINFORCE, a single gradient estimate might range from −500 to +500 across different trajectory samples. The true gradient might be +2. You'd need to average ~62,500 samples to get a reliable estimate (variance/mean2). This is why raw REINFORCE is rarely used in practice — it's too expensive.
In the vanilla gradient, every action's log-probability gets multiplied by the total trajectory reward, including rewards that happened before that action was taken. But action a5 can't possibly affect rewards r0 through r4 — those already happened! Including past rewards adds pure noise.
In the full gradient, consider the term for time step t:
∇θ log πθ(at|st) · ∑t'=0T−1 r(st', at')
Split the sum into past (t' < t) and future (t' ≥ t):
= ∇θ log πθ(at|st) · [ ∑t'=0t−1 rt' + ∑t'=tT−1 rt' ]
Key insight: 𝔼[ ∇θ log πθ(at|st) · rt' ] = 0 for t' < t. Why? Because rt' is determined by st', at', which are independent of at given st. And 𝔼at~π[∇θ log πθ(at|st)] = ∇θ ∑a πθ(a|st) = ∇θ 1 = 0. So the past reward term vanishes in expectation.
This means we can replace the total reward with the reward-to-go without introducing any bias:
Even with reward-to-go, the gradient is noisy. The trick: subtract a baseline b(st) from the reward-to-go. If all rewards are positive (e.g., 10, 12, 11, 13), then every action gets reinforced, just some more than others. Subtracting the average (~11.5) means good actions get positive reinforcement and bad actions get negative reinforcement — much clearer signal.
We need to show that 𝔼[∇θ log πθ(a|s) · b(s)] = 0 for any function b(s).
Step 1: 𝔼a~πθ[∇θ log πθ(a|s) · b(s)]
= b(s) · 𝔼a~π[∇θ log πθ(a|s)]
Why: b(s) doesn't depend on a, so it factors out.
Step 2: = b(s) · ∑a πθ(a|s) · ∇θ log πθ(a|s)
Step 3: = b(s) · ∑a πθ(a|s) · ∇θπθ(a|s) / πθ(a|s)
= b(s) · ∑a ∇θπθ(a|s)
= b(s) · ∇θ ∑a πθ(a|s)
= b(s) · ∇θ 1 = 0 ■
The probabilities sum to 1 regardless of θ, so their gradient is zero. Any baseline that only depends on the state (not the action) can be subtracted for free.
The most common baseline is b(s) = Vπ(s) — the average value of the state. With this baseline, we're reinforcing actions proportional to Ĝt − Vπ(st), which is an estimate of the advantage: how much better was this action than average?
State s: standing at a fork in a maze. Vπ(s) = 5 (on average, this state leads to reward 5).
Action: go left. Ĝt = 8. Advantage = 8 − 5 = +3. "Going left was 3 points better than expected. Reinforce it."
Action: go right. Ĝt = 2. Advantage = 2 − 5 = −3. "Going right was 3 points worse than expected. Discourage it."
Action: go straight. Ĝt = 5. Advantage = 5 − 5 = 0. "Going straight was exactly average. Don't change anything."
Without the baseline, all three actions would receive positive reinforcement (rewards 8, 2, 5 are all positive). The baseline centers the signal so the gradient points in the right direction.
Among all baselines, which minimizes variance the most? The answer:
Two trajectories: τ1 has reward 100, τ2 has reward 102. Without baseline: both get positive reinforcement (100 and 102). Gradient: increase probability of both τ1 and τ2, just τ2 slightly more. Signal-to-noise ratio is terrible.
With baseline b = 101: τ1 gets −1 (decrease its probability slightly), τ2 gets +1 (increase it slightly). The signal is now purely about the relative quality. Much cleaner.
REINFORCE with baseline has a problem: the reward-to-go Ĝt is a Monte Carlo estimate — you have to wait until the end of the episode to compute it. This means high variance (one trajectory can be lucky or unlucky) and no learning until the episode finishes. What if we replaced Ĝt with a learned estimate?
The advantage Aπ(s, a) = Qπ(s, a) − Vπ(s) measures how much better action a is compared to the average action in state s. If A > 0, the action is better than average; if A < 0, it's worse.
The advantage is the ideal quantity to multiply the policy gradient by. Positive advantage → increase the action's probability. Negative advantage → decrease it. Zero advantage → leave it alone. It's a direct signal of "is this action good relative to alternatives?"
We don't know Qπ or Vπ exactly, but we can approximate the advantage using a single transition and a learned value function V̂φ:
Why does this work? Qπ(s, a) = r(s, a) + γ 𝔼[Vπ(s')]. So r + γV̂(s') is a one-sample estimate of Qπ(s, a). Subtract V̂(s) and you get an estimate of the advantage. This introduces bias (V̂φ might be wrong) but dramatically reduces variance (no need to wait for the entire trajectory).
The critic V̂φ is trained to predict the expected return from each state. Two options:
| Method | Target for V̂φ(st) | Bias | Variance |
|---|---|---|---|
| Monte Carlo | Ĝt = ∑t'=tT γt'−t rt' | None | High |
| Bootstrapping (TD) | rt + γ V̂φ(st+1) | Yes (from V̂) | Low |
In practice, bootstrapping wins because the variance reduction outweighs the bias, especially when the critic becomes accurate.
Actor-critic = REINFORCE with baseline + two changes: (1) replace Monte Carlo return Ĝt with bootstrapped TD target r + γV̂(s'), and (2) update after every step instead of waiting for the episode to end. The baseline is the critic. The advantage estimate is the TD error. Same math, different variance-bias tradeoff.
Different advantage estimators trade off bias and variance along a spectrum. Here's the full picture:
| Estimator | Formula | Bias | Variance |
|---|---|---|---|
| Full Monte Carlo | Ĝt − V̂(st) | None | Very high |
| N-step return | ∑k=0n−1 γkrt+k + γnV̂(st+n) − V̂(st) | Medium | Medium |
| 1-step TD | rt + γV̂(st+1) − V̂(st) | High (if V̂ wrong) | Low |
| GAE(λ) | ∑l=0∞ (γλ)lδt+l | Tunable via λ | Tunable via λ |
Suppose V̂(s') is way off: true V*(s') = 100, but V̂(s') = 10. The TD error δ = r + 0.9 × 10 − V̂(s) massively underestimates the advantage. The actor gets a misleading signal: "this action wasn't very good" when actually it led to a highly valuable state. With Monte Carlo, the full trajectory return would eventually reveal the truth — but at the cost of enormous variance across trajectories.
GAE with λ = 0.95 blends both: mostly uses the multi-step return (low bias) but with enough bootstrapping to reduce variance. This is why λ = 0.95 is the most common choice in practice.
The algorithm above updates after every single transition. In practice, we collect a batch of transitions (e.g., 2048 steps across multiple parallel environments) and update once on the whole batch. This is called batch actor-critic or A2C (Advantage Actor-Critic). The parallelism provides diversity: transitions from 16 environments at once are less correlated than 16 consecutive transitions from one environment.
State s = "near the goal," action a = "go right," reward r = +5, next state s' = "at the goal." V̂φ(s) = 3, V̂φ(s') = 10, γ = 0.9.
TD error: δ = 5 + 0.9 × 10 − 3 = 5 + 9 − 3 = +11.
Actor update: δ = +11 > 0, so we increase log π(right | near_goal). "Going right near the goal was much better than expected — do more of it."
Critic update: Target = r + γV̂(s') = 14. Current V̂(s) = 3. Error = 14 − 3 = 11. Push V̂(s) toward 14. "This state is more valuable than I thought."
Policy gradients tell you which direction to move in parameter space. But how far should you step? Too small and you waste samples. Too large and you destroy a good policy with one bad update. This chapter develops the mathematical machinery that leads to TRPO and PPO.
Claim: 𝔼τ~π[ ∑t=0∞ γt f(st, at) ] = 1/(1−γ) · 𝔼s~dπ, a~π[ f(s, a) ]
Proof:
Step 1: Let Ptπ(s) = P(st = s | π) be the state distribution at time t.
𝔼τ[ ∑t γt f(st,at) ] = ∑t=0∞ γt ∑s Ptπ(s) ∑a π(a|s) f(s,a)
Step 2: Recall dπ(s) = (1−γ) ∑t γt Ptπ(s). So ∑t γt Ptπ(s) = dπ(s) / (1−γ).
Step 3: Substitute: = ∑s dπ(s)/(1−γ) ∑a π(a|s) f(s,a) = 1/(1−γ) · 𝔼s~dπ, a~π[f(s,a)] ■
This is one of the most powerful results in RL theory. It expresses the performance gap between any two policies in terms of the advantage of one under the other:
Claim: V(π) − V(π') = 1/(1−γ) · 𝔼s~dπ, a~π[ Aπ'(s, a) ]
Proof:
Step 1: V(π) = 𝔼τ~π[∑t γt r(st,at)] = 1/(1−γ) 𝔼s~dπ, a~π[r(s,a)] by Lemma 10.
Step 2: Add and subtract Vπ'(s): we can write
r(s,a) = Aπ'(s,a) + Vπ'(s) − γ 𝔼s'~p[Vπ'(s')]
Why: By definition, Aπ'(s,a) = Qπ'(s,a) − Vπ'(s) = r(s,a) + γ𝔼[Vπ'(s')] − Vπ'(s). Rearranging gives r(s,a) = Aπ'(s,a) + Vπ'(s) − γ𝔼[Vπ'(s')].
Step 3: Substitute into the expression for V(π). The Vπ'(st) and γVπ'(st+1) terms telescope across time steps, leaving only Vπ'(s0) at the boundary.
Step 4: After telescoping: V(π) = V(π') + 1/(1−γ) 𝔼s~dπ, a~π[Aπ'(s,a)]
Rearranging gives the result. ■
The PDL involves dπ — the state distribution under the new policy π. But we only have data from the old policy π'. Define the local approximation:
Two policies π and π' are α-coupled if at each state, with probability 1 − α they choose the same action, and with probability α they choose independently. Intuitively, α measures how different the two policies are. If α = 0, they're identical; if α = 1, they're independent.
Using coupled policies, we can bound how far off the local approximation L is from the true performance V:
Claim: If π and π' are α-coupled, then |𝔼a~π[Aπ'(s,a)]| ≤ 2α maxs,a|Aπ'(s,a)|.
Proof: With probability 1−α, π picks the same action as π', so Aπ'(s, π'(s)) = 0 (by definition of Vπ'). With probability α, π picks independently, and the advantage is bounded by ε = max|Aπ'|. So |𝔼[A]| ≤ (1−α) · 0 + α · 2ε = 2αε. The factor of 2 accounts for the fact that the mean advantage under π' is zero, so the independent action could deviate by up to 2ε from 0. ■
This says: if we constrain the policy update to be small (small α, i.e., the new policy is close to the old one in total variation distance), then the local approximation L is a good lower bound on the true performance V. This is the theoretical foundation of Trust Region Policy Optimization (TRPO).
The vanilla policy gradient ∇θV(θ) depends on the parameterization of πθ. If you reparameterize (e.g., change from log-space to linear-space), the gradient direction changes. The natural policy gradient fixes this by pre-multiplying by the inverse Fisher information matrix:
TRPO approximately solves: maxθ Lθold(θ) subject to DKL(πθ || πθold) ≤ δ. This is equivalent to taking a natural gradient step with step size determined by the KL constraint. The Fisher matrix defines the "geometry" of policy space — large steps in high-curvature directions are penalized.
Suppose ε = max|Aπ| = 10 (largest advantage magnitude), γ = 0.99, and we constrain α = DTV(πnew, πold) ≤ 0.01 (very small policy change).
Penalty term: 4 × 10 × 0.99 / (1 − 0.99)2 × 0.012 = 4 × 10 × 0.99 / 0.0001 × 0.0001 = 3.96
Interpretation: The true performance is at least L(πnew) − 3.96. If the local approximation predicts improvement of +5, the guarantee is at least +1.04. Small trust region gives tight bounds.
If we let α = 0.1 (large change): penalty = 4 × 10 × 0.99 / 0.0001 × 0.01 = 3960. The bound is useless! Large policy changes destroy the guarantee. This is why trust regions must be small.
TRPO requires: (1) computing the Fisher-vector product F∇, (2) solving a linear system via conjugate gradient (10-20 iterations), (3) a line search to find the largest step satisfying the KL constraint, and (4) handling the case where the constraint is violated. That's a lot of machinery compared to "just clip the ratio" (PPO). TRPO achieves better worst-case guarantees, but PPO achieves similar empirical performance with 1/10 the code.
Policy at iteration k: π(left | s) = 0.8, π(right | s) = 0.2. The gradient says "increase π(right)." Without trust region: learning rate 0.5 gives π(right) = 0.7. Huge change! But the gradient was computed assuming the old policy — with the new policy we'd visit different states, so the gradient estimate is stale.
With trust region (KL ≤ 0.01): π(right) can increase to at most ~0.25. Small step, but the gradient stays valid. Next iteration computes a fresh gradient. Slow but stable convergence.
The Performance Difference Lemma states: V(π) − V(π') = 1/(1−γ) · 𝔼s~dπ, a~π[Aπ'(s,a)].
The Simulation Lemma bounds the performance gap using the maximum advantage: |V(π) − V(π')| ≤ 2εγ / (1−γ)2 · maxs DTV(π(s) || π'(s)), where ε = maxs,a|Aπ'(s,a)|.
Your task: Starting from the PDL, derive this bound. You'll need: (1) bound the advantage expectation using the TV distance, and (2) bound the state distribution shift.
Step 1: From PDL: V(π) − V(π') = 1/(1−γ) 𝔼s~dπ, a~π[Aπ'(s,a)]
Step 2: Bound advantage at each state: |𝔼a~π[Aπ'(s,a)]| ≤ 2αε where α = maxs DTV(π(s)||π'(s)) and ε = max|Aπ'|.
Step 3: Replace dπ with dπ' plus error: |𝔼s~dπ[f(s)] − 𝔼s~dπ'[f(s)]| ≤ ||dπ − dπ'||1 · max|f| ≤ 2γα/(1−γ) · 2αε
Step 4: Combining: |V(π) − V(π')| ≤ 1/(1−γ) [2αε + 2γα/(1−γ) · 2αε] = 2αε/(1−γ) [1 + 2γα/(1−γ)]
Step 5: For the standard form (bounding loosely): ≤ 4εγα2/(1−γ)2. This is the TRPO bound from Theorem 13. ■
The key insight: The bound is quadratic in α (the policy divergence) — this is why trust regions work. If you halve the allowed policy change, the error bound drops by 4x. The local approximation becomes increasingly accurate as the trust region shrinks.
TRPO works brilliantly but is painful to implement. It requires computing the Fisher matrix, solving a constrained optimization problem via conjugate gradient, and doing a line search. PPO achieves nearly the same effect with a simple clipped objective that you can optimize with standard gradient descent.
Define the probability ratio between the new and old policy:
The surrogate objective from TRPO can be written as:
Without a constraint, maximizing LCPI can lead to rt(θ) exploding — the new policy becomes very different from the old one. PPO's solution: clip the ratio to stay in [1−ε, 1+ε].
Case 1: Ât > 0 (the action was good). We want to increase rt (make this action more likely). But the clip prevents rt from exceeding 1 + ε. So the objective becomes min(rt Â, (1+ε) Â). Once rt ≥ 1 + ε, the gradient is zero — no incentive to push further.
Case 2: Ât < 0 (the action was bad). We want to decrease rt. But the clip prevents rt from dropping below 1 − ε. So the objective becomes min(rt Â, (1−ε) Â). Once rt ≤ 1 − ε, the gradient is zero.
In both cases, the clip creates a "trust region" in probability ratio space. The policy can change, but not too much.
How do we compute Ât? We could use the one-step TD error δt = rt + γV(st+1) − V(st) (low variance, high bias) or the full Monte Carlo return Gt − V(st) (no bias, high variance). GAE interpolates:
Let's expand ÂGAE for a few terms to see what it does:
Ât = δt + (γλ)δt+1 + (γλ)2δt+2 + ...
Substituting δt = rt + γV̂(st+1) − V̂(st):
When λ = 0: Ât = δt = rt + γV̂(st+1) − V̂(st). Just the one-step TD error.
When λ = 1: Ât = ∑l γl δt+l. After telescoping the V̂ terms: Ât = ∑l=0∞ γl rt+l − V̂(st) = Ĝt − V̂(st). The full Monte Carlo advantage!
So λ smoothly interpolates between 1-step TD and full Monte Carlo. The magic of GAE is that it does this in an exponentially-weighted way — nearby TD errors get more weight than distant ones.
4-step trajectory: rewards [1, 2, 3, 0], V̂ values [5, 4, 6, 2, 0] (including the terminal state). γ = 0.99, λ = 0.95.
TD errors:
δ0 = 1 + 0.99 × 4 − 5 = 1 + 3.96 − 5 = −0.04
δ1 = 2 + 0.99 × 6 − 4 = 2 + 5.94 − 4 = 3.94
δ2 = 3 + 0.99 × 2 − 6 = 3 + 1.98 − 6 = −1.02
δ3 = 0 + 0.99 × 0 − 2 = −2.0
GAE (computed backwards):
Â3 = δ3 = −2.0
Â2 = δ2 + 0.99 × 0.95 × Â3 = −1.02 + 0.9405 × (−2.0) = −2.90
Â1 = δ1 + 0.9405 × Â2 = 3.94 + 0.9405 × (−2.90) = 1.21
Â0 = δ0 + 0.9405 × Â1 = −0.04 + 0.9405 × 1.21 = 1.10
Note the backward computation — this is how GAE is implemented in practice (loop from t = T−1 down to 0).
PPO is the default algorithm for modern RL because: (1) it's simple — 50 lines of core code, (2) it's stable — the clip prevents catastrophic updates, (3) it's sample-efficient — you reuse each batch for K epochs, not just one gradient step, (4) it works for both continuous and discrete actions. OpenAI used PPO to train the policy for ChatGPT's RLHF stage.
Setting: Training a locomotion policy for a simulated humanoid robot.
Batch size: 2048 steps across 8 parallel environments (256 steps each). Gives 2048 transitions per policy update.
ε = 0.2: The policy can change each action probability by at most 20% per update. Experiment: ε = 0.1 (too conservative, slow learning). ε = 0.5 (too aggressive, policy oscillates wildly).
K = 4 epochs: Each batch of 2048 transitions is used 4 times. More epochs = more sample efficiency, but the data becomes stale (the gradient is for the old policy, not the current one). K = 10 with ε = 0.2 usually works fine because the clip prevents too-large updates even with stale data.
λ = 0.95: GAE parameter. λ = 0.99 gave slightly better final performance but took 2x longer. λ = 0.8 converged fast but to a worse policy.
State s, action a. πθold(a|s) = 0.4. After some gradient steps, πθ(a|s) = 0.72. Ratio r = 0.72 / 0.4 = 1.8. With ε = 0.2, clipped ratio = clip(1.8, 0.8, 1.2) = 1.2.
If  = +3: unclipped objective = 1.8 × 3 = 5.4. Clipped = 1.2 × 3 = 3.6. PPO uses min(5.4, 3.6) = 3.6. The gradient is zero (we've hit the clip), so θ stops being pushed to increase π(a|s) further.
The policy wanted to make this action much more likely (1.8x), but PPO said "slow down, 1.2x is enough for now." Next batch of data will be collected under a policy closer to the current one, and we can push further if the advantage is still positive.
PPO is on-policy: it collects data, updates the policy, then throws the data away. This is wasteful. Soft Actor-Critic (SAC) is off-policy: it stores all experience in a replay buffer and reuses it many times. The key insight is adding an entropy bonus to the reward, encouraging exploration and enabling off-policy stability.
Standard RL maximizes expected reward. Maximum entropy RL adds a bonus for acting randomly:
Why add entropy? Three reasons: (1) exploration — the agent tries diverse actions instead of collapsing to a single deterministic choice, (2) robustness — stochastic policies are more robust to perturbations, and (3) off-policy compatibility — the entropy regularization smooths the optimization landscape, making off-policy learning more stable.
Robot arm with 7 continuous action dimensions. Without entropy: the policy quickly collapses to always applying the same torques. If those happen to reach one target, great — but the robot never discovers alternative strategies.
With entropy bonus (α = 0.2): the policy maintains a standard deviation of ~0.3 in each dimension. The robot tries slightly different torques each time, occasionally discovering better solutions. As training progresses and α decreases (auto-tuned), the policy becomes more precise while retaining enough randomness to handle perturbations.
Entropy value: For a Gaussian with σ = 0.3 in 7 dims: H = 7 × (0.5 log(2πe × 0.09)) ≈ 7 × (−0.27) = −1.89 nats. The reward r + αH = r + 0.2 × (−1.89) = r − 0.38. Small entropy penalty compared to task reward, but enough to prevent premature convergence.
In standard RL, Vπ(s) = 𝔼a~π[Qπ(s,a)]. In maximum entropy RL, the value includes the entropy bonus:
Vπ(s) = 𝔼a~π[Qπ(s,a) − α log π(a|s)]
The soft Q-function becomes: Qπ(s,a) = r(s,a) + γ 𝔼s'[Vπ(s')]
The optimal soft policy has a closed form: π*(a|s) ∝ exp(Q*(s,a)/α). This is a Boltzmann (softmax) distribution over actions, weighted by Q-values. High α = uniform (maximum entropy). Low α = nearly deterministic (exploit the best action).
Substituting back: V*(s) = α log ∑a exp(Q*(s,a)/α) — the soft-max (log-sum-exp) of Q-values, which is a smooth approximation to the hard max.
SAC maintains three networks:
| Network | Parameters | What It Does |
|---|---|---|
| Actor πθ | θ | Outputs action distribution (Gaussian mean + std) |
| Critic Qφ | φ1, φ2 | Two Q-networks (clipped double-Q trick) |
| Target Q̄ | φ̄ | Exponential moving average of φ for stability |
Instead of manually setting α, SAC learns it by solving:
When α → 0, SAC reduces to the Deterministic Policy Gradient (DPG). In DPG, the actor outputs a single action μθ(s) (no stochasticity) and the gradient is:
SAC uses two Q-networks Qφ1 and Qφ2 and takes the minimum of their predictions for the target. Why? To combat overestimation (the same problem as in DQN). If one Q-network overestimates, the other likely doesn't, so the min provides a more conservative estimate. This trick was introduced in TD3 (Twin Delayed DDPG) and is now standard in off-policy methods.
Target entropy: H̄ = −dim(A) = −7 (for a 7-DOF robot arm). Current policy entropy: H(π) = −5. Since −5 > −7 (more entropy than target), the loss α(−(−5) − (−7)) = α(5 − 7) = −2α is negative. The gradient decreases α, reducing the entropy bonus. The policy becomes more deterministic.
If H(π) = −10 (less entropy than target): loss = α(10 − 7) = 3α > 0. The gradient increases α, adding more entropy bonus to encourage exploration.
Robot arm reaching task. 7 continuous action dimensions (joint torques). PPO: collect 2048 steps, update policy, discard data. After 1M steps, it has done ~500 policy updates using 1M transitions total. SAC: collect 1 step, store in buffer (1M capacity), sample batch of 256, update all networks. After 1M steps, it has done 1M gradient updates, each using data from the full buffer. SAC uses data ~1000x more efficiently.
The trade-off: PPO is more stable (on-policy data is always "fresh"), SAC is more sample-efficient (reuses everything).
Before neural networks entered the picture, RL was solved with tables. These methods have convergence guarantees that deep RL can only dream of. Understanding them reveals why algorithms work (or fail) when we scale up.
In tabular RL, V(s) and Q(s, a) are stored as literal tables (dictionaries/arrays) with one entry per state or state-action pair. For a grid world with 100 states and 4 actions, Q is a 100 × 4 table = 400 numbers. This is feasible when |S| and |A| are small (thousands), but impossible for images (|S| ~ 1030) or continuous states (|S| = ∞). The transition to function approximation (Chapter 9) is where tabular guarantees break down.
Given a fixed policy π, compute Vπ(s) for all states by iteratively applying the Bellman equation:
Policy iteration has an expensive inner loop (full policy evaluation). Value iteration combines evaluation and improvement into a single step:
Claim: ||TπV − TπV'||∞ ≤ γ ||V − V'||∞
where (TπV)(s) = ∑a π(a|s) [r(s,a) + γ ∑s' p(s'|s,a) V(s')].
Proof:
Step 1: |(TπV)(s) − (TπV')(s)| = |∑a π(a|s) γ ∑s' p(s'|s,a) [V(s') − V'(s')]|
Why: the r(s,a) terms cancel.
Step 2: ≤ ∑a π(a|s) γ ∑s' p(s'|s,a) |V(s') − V'(s')| (triangle inequality)
Step 3: ≤ ∑a π(a|s) γ ∑s' p(s'|s,a) ||V − V'||∞
Why: |V(s') − V'(s')| ≤ maxs |V(s) − V'(s)| = ||V − V'||∞ for any s'.
Step 4: = γ ||V − V'||∞ · ∑a π(a|s) ∑s' p(s'|s,a) = γ ||V − V'||∞ · 1
Why: probabilities sum to 1.
Since this holds for every s, it holds for the max over s. ■
A contraction mapping has a unique fixed point, and repeated application converges to it (Banach fixed-point theorem). Since γ < 1, the Bellman operator Tπ is a contraction with factor γ. After k iterations, the error is at most γk ||V0 − V*||∞, which → 0 exponentially fast. This is why value iteration is guaranteed to converge — and why γ < 1 is essential.
True V* = [10, 8]. Two initial guesses: Va = [0, 0] (pessimistic) and Vb = [20, 20] (optimistic). γ = 0.5.
Distance: ||Va − Vb|| = max(|0−20|, |0−20|) = 20.
After 1 Bellman backup: ||TVa − TVb|| ≤ 0.5 × 20 = 10. The two estimates are at most 10 apart.
After 5 backups: ≤ 0.55 × 20 = 0.625. Nearly identical!
After 10 backups: ≤ 0.510 × 20 = 0.0195. Converged for all practical purposes.
No matter how wrong your initialization is, the contraction property forces convergence. With γ = 0.99, convergence is much slower (0.99100 = 0.366, need ~460 iterations to reduce error by 100x).
| Property | Policy Iteration | Value Iteration |
|---|---|---|
| Inner loop | Full policy evaluation (solve linear system or iterate to convergence) | None (single max operation per state) |
| Outer iterations | Few (often 3-10 for small MDPs) | Many (proportional to 1/(1-γ)) |
| Total computation | Often faster for small MDPs | Often faster for large MDPs |
| Memory | Stores both V and π | Stores only V |
| Convergence | Finite steps to exact π* | Converges to V* in limit |
Two states: A and B. From A: action "go" gives r = 2, transitions to B. From B: action "stay" gives r = 1, stays in B. γ = 0.9.
Iteration 0: V(A) = 0, V(B) = 0.
Iteration 1: V(A) = 2 + 0.9 × 0 = 2. V(B) = 1 + 0.9 × 0 = 1.
Iteration 2: V(A) = 2 + 0.9 × 1 = 2.9. V(B) = 1 + 0.9 × 1 = 1.9.
Iteration 3: V(A) = 2 + 0.9 × 1.9 = 3.71. V(B) = 1 + 0.9 × 1.9 = 2.71.
Converges to: V*(B) = 1/(1−0.9) = 10, V*(A) = 2 + 0.9 × 10 = 11.
Tabular methods work when you can enumerate every state. But Atari has ~1030 possible screen images. You can't store a table that large. Value function approximation (VFA) replaces the table with a neural network V̂φ(s) or Q̂φ(s, a) that generalizes across similar states.
Monte Carlo VFA: collect a trajectory, compute the return Gt, and regress V̂φ(st) toward Gt. Unbiased but high variance — one lucky trajectory can send the network in the wrong direction.
TD learning for VFA: use the bootstrapped target r + γV̂φ(s') instead of Gt. Lower variance (single-step estimate vs. entire trajectory) but biased (the target depends on our own potentially wrong estimates).
Mountain car environment. State = (position, velocity). We want V̂φ(s). Episode takes 150 steps with reward −1 per step (total = −150).
Monte Carlo: For every state visited in the episode, the target is the total return from that point. State at t = 50: target = −100 (100 remaining steps). State at t = 149: target = −1. All targets are noisy because different episodes take different numbers of steps (120 or 200 depending on initial conditions).
TD: For state at t = 50: target = −1 + 0.99 × V̂(s51). Even if V̂ is slightly wrong, the error from one step is small. But if V̂ is completely wrong early in training, TD propagates garbage: "my estimate of s51 is wrong, so my estimate of s50 is wrong, so my estimate of s49 is wrong..."
In practice, TD wins because the bias decreases as V̂ improves, while MC variance stays high forever.
The whole point of using a neural network instead of a table is generalization: if the agent learns that being near the goal is good, it should know this for all nearby states, even ones it hasn't visited. In a table, each entry is independent — learning Q(s7, up) = 10 tells you nothing about Q(s8, up). With a neural network, similar states get similar Q-values automatically.
Generalization is both the strength and weakness of deep RL. Strength: you can handle continuous state spaces with billions of possible states (like pixel observations). Weakness: an update to Q(s7) also changes Q(s8), Q(s9), etc. These unintended side effects can destabilize learning. This is why the deadly triad exists — it's specifically about how errors propagate through shared representations.
Combining three ingredients can cause divergence:
(1) Function approximation (neural nets instead of tables) — introduces generalization errors that propagate.
(2) Bootstrapping (using V̂(s') in the target) — creates a moving target that depends on the parameters we're updating.
(3) Off-policy learning (data from a different policy) — the data distribution doesn't match the policy we're evaluating.
Any two are fine. All three together can spiral out of control. DQN tames this with two tricks: replay buffers (partially fixes off-policy) and target networks (partially fixes bootstrapping).
Without replay, consecutive training samples are highly correlated (frames 1001, 1002, 1003 look almost identical). The network overfits to the current "neighborhood" of experience and forgets the rest. A replay buffer stores the last N transitions and samples uniformly at random, breaking temporal correlations and providing i.i.d.-like training data.
Atari Pong. The agent is in the middle of a rally for frames 5000-5100. Without replay, the minibatch is [frame 5000, 5001, 5002, ..., 5031] — nearly identical images of the same rally. The network learns "this specific paddle position is good" and overwrites everything it knew about serving, recovering from corners, or other game situations. This is catastrophic forgetting.
With replay (buffer size 1M), the minibatch might contain: frame 5000 (mid-rally), frame 2300 (serving), frame 87000 (scoring), frame 45000 (edge recovery). Diverse experiences prevent forgetting and ensure the network learns a globally useful Q-function.
Instead of sampling uniformly from the buffer, prioritize transitions with high TD error |Q(s,a) − y|. These are the transitions where the Q-network is most wrong — and thus where it has the most to learn. This was introduced by Schaul et al. (2016) and is used in Rainbow DQN.
The target y = r + γ maxa' Q̂θ(s', a') depends on θ. Every gradient step changes both the prediction and the target — like a dog chasing its tail. The fix: maintain a target network Q̂θ' with frozen parameters, updated periodically by copying θ → θ' every N steps.
Consider a simple MDP with 3 states. A neural net approximates Q with shared features. True Q*(A) = 5, Q*(B) = 5, Q*(C) = 5.
Step 1: From data, we update Q(A) toward target 5.2 (slightly noisy). Due to weight sharing, Q(B) moves to 5.1 and Q(C) moves to 5.15.
Step 2: Now max Q is used as target for another state. Target = r + γ × 5.2 (using inflated max). We train toward this inflated target.
Step 3: The update pushes Q even higher. Next iteration, max Q is 5.4. Then 5.8. Then 6.5. Then 8. Then 15. Then 100.
Without the table to anchor each entry independently, errors propagate through shared weights. Target networks and replay buffers slow this spiral but don't eliminate it completely.
Standard DQN overestimates Q-values because maxa' Q̂(s', a') preferentially selects actions with positive noise. Double DQN decouples action selection from evaluation:
State s' has 3 actions. True Q*: [5, 5, 5]. Noisy estimates from θ: [7, 4, 6]. Noisy estimates from θ': [4, 7, 5].
Standard DQN: max from θ': max(4, 7, 5) = 7. Overestimate: 7 vs true 5.
Double DQN: θ selects action 0 (highest: 7). θ' evaluates action 0: Qθ'(s', 0) = 4. Estimate: 4. Underestimate, but no systematic upward bias.
Over many samples, Double DQN's estimates converge to the true values rather than drifting upward.
The fundamental problem: Q-learning uses maxa' Q(s',a') in the target. In online RL, if Q overestimates some action, the agent eventually tries it and gets corrected by the real reward. Offline, the agent never acts — so overestimated actions are never corrected. Worse, the max operator preferentially selects the most overestimated action, and bootstrapping propagates this overestimate backward through the Bellman equation. The result: Q-values for unobserved actions hallucinate to arbitrarily high values, and the extracted policy confidently selects actions that were never in the data (and are likely catastrophic). This is why IQL avoids querying OOD actions entirely, and CQL explicitly pushes down Q-values for unseen actions.
In offline RL, you have a fixed dataset of transitions collected by some behavior policy πβ — and you can't collect any more data. Think: learning to drive from dashcam footage without ever touching a steering wheel. The challenge is terrifying: the Q-function can hallucinate high values for actions the dataset never saw.
Standard Q-learning updates Q(s, a) toward r + γ maxa' Q(s', a'). The max might select an action a' that never appears in the dataset. Since Q(s', a') was never corrected by real data, it could be arbitrarily wrong — and the max ensures it's wrong in the upward direction. These phantom high Q-values propagate backward through the Bellman equation, corrupting the entire Q-function.
With online learning, the agent eventually tries the overestimated action and gets a correcting signal. Offline, it never tries anything — so the overestimation is never corrected. It's like a student who answers untested material with wild guesses that are always optimistic. Without exams (interaction), the guesses never get corrected.
Dataset: 1000 transitions from a robot arm reaching targets on the left half of a table. The arm never reached right.
Q-learning iteration 1: Q(s, reach_left) trained on real data → Q = 8 (good). Q(s, reach_right) randomly initialized → Q = 0.
Iteration 100: Q(s, reach_left) = 8 (stable, supported by data). Q(s, reach_right) = 12 (inflated! No correcting data). Why? Some state s' in the buffer has maxa' Q(s', a') = 12 via an unseen action. This inflated target propagated to Q(s, reach_right) through bootstrapping.
Iteration 500: Q(s, reach_right) = 47. Completely fictional. The agent would confidently reach right and break.
IQL's insight: never query Q for actions outside the dataset. Instead of maxa' Q(s', a'), approximate the maximum using expectile regression — a statistical technique that estimates quantiles without explicit maximization.
The τ-expectile loss is Lτ2(u) = |τ − 1{u < 0}| · u2. When τ = 0.5, this is ordinary least squares. When τ → 1, the loss penalizes under-predictions much more than over-predictions, so the minimizer approaches the maximum of the distribution. Think of it as a smooth, differentiable approximation to max.
IQL uses three networks:
CQL takes a different approach: explicitly push down Q-values for actions not in the dataset, making the Q-function a lower bound on the true values. The key idea: if you can't tell whether an action is good or bad (because you've never tried it), assume it's bad. This is conservative but safe.
In safety-critical applications (healthcare, autonomous driving, robotics), overestimating an untested action's value can be catastrophic. CQL's conservatism is a feature, not a bug: the extracted policy will never choose an action whose value is only high because of hallucination. The policy might be suboptimal (it avoids potentially good actions it hasn't seen data for), but it won't be catastrophically bad.
The optimal choice of μ that maximizes the regularizer is μ(a|s) ∝ exp(Qθ(s, a)) — the softmax of Q-values. This creates a minimax game: the regularizer tries to find the most overestimated actions and push them down.
Dataset: transitions from a robot arm that only reaches the left side of a table. State: s = "object on the right." Standard Q-learning: Q(s, reach_right) might be very high because it was never corrected by failure data. The extracted policy reaches right and crashes.
CQL: The regularizer pushes down Q(s, reach_right) because reach_right never appears in the data for this state. Q(s, reach_left) stays high (it's in the data). The extracted policy safely reaches left — suboptimal but doesn't crash.
IQL avoids querying out-of-distribution actions entirely (implicit approach). CQL explicitly penalizes them (conservative approach). IQL is simpler and often faster to train. CQL provides stronger theoretical guarantees (provable lower bound on Q). Both outperform naive offline Q-learning by a large margin.
Suppose Q(s, a) for the 4 actions in our dataset are: [2, 5, 3, 8]. With τ = 0.5 (standard mean), V(s) = (2+5+3+8)/4 = 4.5. With τ = 0.9 (high expectile), V(s) ≈ 7.2 — close to the maximum of 8, but not exactly 8. With τ = 0.99, V(s) ≈ 7.9 — even closer to max.
How the loss works: For τ = 0.9, errors where Q > V get weight τ = 0.9, while errors where Q < V get weight (1−τ) = 0.1. So the loss strongly penalizes underestimation of high Q-values, pushing V upward toward the max. This is how IQL approximates maxa Q(s,a) without ever evaluating Q on unseen actions.
| Factor | IQL | CQL |
|---|---|---|
| Requires OOD query? | No (expectile-based) | Yes (softmax sampling) |
| Hyperparameters | τ (expectile), β (AWR temp) | α (CQL weight) |
| Theoretical guarantee | Approximation to optimal | Provable lower bound on Q |
| Typical use | Robot manipulation | Game playing, navigation |
| Computation | Fast (no sampling loop) | Slower (logsumexp over actions) |
How do you align a language model to human preferences when you can't write down a reward function? Reinforcement Learning from Human Feedback (RLHF) trains a reward model from human comparisons, then optimizes the LM against it. Direct Preference Optimization (DPO) skips the reward model entirely.
Stage 1 — Supervised Fine-Tuning (SFT): Fine-tune the base LM on high-quality demonstrations. This gives you πref — a model that can write decent responses.
Stage 2 — Reward Modeling: Show humans pairs of responses (y1, y2) to the same prompt x. They choose which is better. Train a reward model rφ(x, y) to predict these preferences.
Stage 3 — RL Fine-Tuning: Optimize the LM policy πθ to maximize the learned reward, with a KL penalty to stay close to πref.
Given a prompt x and two responses y1, y2, the probability that a human prefers y1 over y2 is modeled as:
Language generation is non-differentiable — you can't backpropagate through token sampling (argmax or categorical sampling have zero or undefined gradients). So you can't just do gradient descent on the reward. You need RL (specifically PPO) to optimize through the non-differentiable sampling step. This is computationally expensive and unstable.
Prompt: "Write a haiku about rain." Two responses:
y1 (winner): "Soft drops on the leaves / Whispering to the still pond / Nature breathes again"
y2 (loser): "Rain is water that falls from clouds in the sky and makes things wet"
Current reward model: rφ(x, y1) = 2.3, rφ(x, y2) = 1.8. Difference: 0.5.
Loss: −log σ(2.3 − 1.8) = −log σ(0.5) = −log(0.622) = 0.475.
Gradient: Pushes rφ(x, y1) up and rφ(x, y2) down, increasing the gap. Over thousands of preference pairs, the reward model learns that well-structured creative responses are preferred over flat factual statements.
Without the KL term, the LM quickly discovers reward hacking — pathological outputs that score high on the reward model but are gibberish to humans. For example, a model trained to maximize a helpfulness reward might repeat "I'm so happy to help you! That's a great question!" endlessly. The KL penalty DKL(πθ || πref) prevents this by penalizing the model for deviating too far from the SFT baseline.
DPO's key insight: the optimal policy under the KL-constrained objective has a closed-form solution, and we can reparameterize the reward in terms of the policy itself.
Step 1 — Solve for the optimal policy. The RLHF objective with KL constraint is:
maxπ 𝔼y~π[r(x,y)] − β DKL(π || πref)
= maxπ 𝔼y~π[r(x,y) − β log(π(y|x)/πref(y|x))]
= maxπ 𝔼y~π[r(x,y) − β log π(y|x) + β log πref(y|x)]
Taking the functional derivative with respect to π(y|x) and setting to zero:
r(x,y) − β log π(y|x) − β + β log πref(y|x) = 0
Solving for π: π*(y|x) = πref(y|x) · exp(r(x,y)/β) / Z(x)
where Z(x) = ∑y πref(y|x) exp(r(x,y)/β) is the partition function. This is the optimal policy given reward r.
Step 2 — Reparameterize the reward. From the optimal policy expression, solve for r:
r(x, y) = β log(π*(y|x) / πref(y|x)) + β log Z(x)
Step 3 — Substitute into Bradley-Terry. The preference probability becomes:
p*(yw ≻ yl) = σ(r(x,yw) − r(x,yl))
= σ(β log(π*(yw|x)/πref(yw|x)) − β log(π*(yl|x)/πref(yl|x)))
The Z(x) terms cancel! They appear in both r(x,yw) and r(x,yl) and subtract out.
Step 4 — Write the DPO loss. Replace π* with πθ (the policy we're training):
DPO reduces RLHF to supervised learning. No reward model to train. No PPO loop to run. No generation during training. Just: compute log-probabilities of the winning and losing responses under πθ and πref, plug into the loss, and do gradient descent. It's 10x simpler to implement and much more stable than PPO-based RLHF.
What does the DPO gradient actually do? Taking the gradient of LDPO with respect to θ:
∇θLDPO = −β 𝔼[ σ(−u) (∇ log πθ(yw|x) − ∇ log πθ(yl|x)) ]
where u = β(log πθ(yw)/πref(yw) − log πθ(yl)/πref(yl)).
The weighting σ(−u): When u is large and positive (the model already strongly prefers yw), σ(−u) is near 0 — almost no gradient. When u is negative (the model incorrectly prefers yl), σ(−u) is near 1 — strong gradient signal. The gradient focuses on examples the model gets wrong, ignoring ones it already handles correctly.
The direction: The gradient increases log π(yw) and decreases log π(yl). It simultaneously makes the winner more likely and the loser less likely.
| Property | RLHF (PPO) | DPO |
|---|---|---|
| Training stages | 3 (SFT → RM → RL) | 2 (SFT → DPO) |
| Models needed | 4 (SFT, RM, policy, ref) | 2 (policy, ref) |
| Generates during training | Yes (PPO rollouts) | No |
| GPU memory | Very high (4 models) | Moderate (2 models) |
| Stability | Sensitive to hyperparameters | Robust |
| Reward hacking risk | Higher (imperfect RM) | Lower (no RM to exploit) |
| Iterative improvement | Natural (generate → relabel) | Requires new preference data |
| State of the art | Llama 2, InstructGPT | Zephyr, Llama 3 (partially) |
Distribution shift: DPO assumes the preference data was generated by the optimal policy under πref. If the preference pairs are from a very different policy, the DPO objective may not converge to the right solution.
β too small: The policy can diverge far from πref, potentially generating incoherent text. The log-ratios blow up.
β too large: The policy barely moves from πref. The preference signal is too weak to have an effect. The model doesn't improve.
Preference noise: If annotators frequently disagree (noisy preferences), DPO's loss becomes noisy too. RLHF's reward model can average over multiple annotations per pair; DPO uses each pair directly.
Prompt x: "Explain gravity." Winner yw: clear explanation. Loser yl: vague rambling. β = 0.1.
Current log-ratios: log(πθ(yw|x)/πref(yw|x)) = 0.3. log(πθ(yl|x)/πref(yl|x)) = 0.5.
DPO argument: β(0.3 − 0.5) = 0.1 × (−0.2) = −0.02.
σ(−0.02) ≈ 0.495. Loss: −log(0.495) ≈ 0.703.
The gradient will increase πθ(yw|x) relative to πref and decrease πθ(yl|x) relative to πref. Intuitively: make the winner more likely and the loser less likely, compared to the reference model.
| Algorithm | Type | On/Off-Policy | Actions | Key Idea |
|---|---|---|---|---|
| REINFORCE | Policy gradient | On-policy | Any | Monte Carlo policy gradient |
| A2C | Actor-critic | On-policy | Any | Learned baseline + bootstrapping |
| PPO | Actor-critic | On-policy | Any | Clipped surrogate objective |
| SAC | Actor-critic | Off-policy | Continuous | Max entropy + replay buffer |
| DQN | Value-based | Off-policy | Discrete | Target network + replay |
| IQL | Value-based | Offline | Any | Expectile regression avoids OOD |
| CQL | Value-based | Offline | Any | Conservative Q lower bound |
| DPO | Preference | Offline | LM tokens | Reparameterize reward into policy |
| Equation | One-Line Explanation |
|---|---|
| Vπ(s) = 𝔼a~π[r + γVπ(s')] | Bellman equation: value = reward + discounted future value |
| ∇θV = 𝔼[∇ log π · Qπ] | Policy gradient theorem: weight each action by its Q-value |
| Aπ(s,a) = Qπ(s,a) − Vπ(s) | Advantage: how much better is action a than average? |
| δt = r + γV(s') − V(s) | TD error: one-step advantage estimate |
| ÂGAE = ∑ (γλ)l δt+l | GAE: exponentially-weighted sum of TD errors |
| LCLIP = 𝔼[min(rÂ, clip(r)Â)] | PPO: clip the ratio to prevent catastrophic updates |
| Q* = r + γ maxa' Q* | Bellman optimality: optimal Q satisfies this fixed point |
| V(π)−V(π') = 1/(1−γ) 𝔼[Aπ'] | PDL: performance gap = expected advantage |
| LDPO = −log σ(β(log rw − log rl)) | DPO: preference learning as classification |
On-policy (REINFORCE, PPO): The data used for each update was collected by the current policy. After each update, old data is discarded and new data is collected. Pro: the gradient is always accurate. Con: data is used once and thrown away — expensive.
Off-policy (DQN, SAC): Data from any policy can be reused. A replay buffer stores millions of transitions from past policies. Pro: much more sample-efficient (each transition is used dozens of times). Con: the data distribution doesn't match the current policy, introducing bias that must be corrected (importance sampling, target networks, etc.).
Training a robot arm. Budget: 1 million environment steps.
PPO (on-policy): Collects batches of 2048 steps, updates, discards. 1M / 2048 = ~488 policy updates. Each transition used 4 times (K=4 epochs). Total gradient steps: ~2000.
SAC (off-policy): After initial exploration, every environment step triggers one gradient update using a batch of 256 from the replay buffer. Total gradient steps: ~1M. Each transition is reused ~256 times over its lifetime in the buffer.
Result: SAC does ~500x more gradient updates with the same data. For tasks where environment interaction is expensive (real robots: $100/hour), off-policy wins decisively. For tasks where simulation is cheap (video games: 1000 fps), on-policy's simplicity often wins.
| Year | Milestone | Key Contribution |
|---|---|---|
| 1989 | Watkins — Q-Learning | Off-policy tabular value learning with convergence proof |
| 1992 | Williams — REINFORCE | Policy gradient via the score function trick |
| 1999 | Sutton et al. — PG Theorem | Formal foundation for policy gradient methods |
| 2000 | Kakade — Natural PG | Fisher information matrix for parameterization-invariant updates |
| 2013 | Mnih et al. — DQN | Neural networks + replay buffer + target networks for Atari |
| 2015 | Schulman et al. — TRPO | Trust region constraints for stable policy updates |
| 2016 | Schulman et al. — GAE | Exponentially-weighted TD errors for advantage estimation |
| 2017 | Schulman et al. — PPO | Clipped surrogate objective replaces TRPO constraints |
| 2018 | Haarnoja et al. — SAC | Maximum entropy off-policy actor-critic for continuous control |
| 2020 | Kumar et al. — CQL | Conservative Q-function lower bounds for offline RL |
| 2022 | Kostrikov et al. — IQL | Expectile regression avoids OOD action evaluation |
| 2022 | Ouyang et al. — InstructGPT | RLHF pipeline for language model alignment |
| 2023 | Rafailov et al. — DPO | Reparameterize reward into policy for direct preference optimization |
This lesson covered the complete landscape of deep RL theory. For implementation details, see the companion lessons:
| Topic | Lesson |
|---|---|
| Tabular Q-learning with worked examples | Q-Learning: Value-Based RL |
| Imitation learning as an alternative to RL | Imitation Learning |
| Actor-critic methods in depth | Off-Policy Actor-Critic |
| MDPs and value iteration from scratch | Markov Decision Processes |
| Algorithm | Key Hyperparameter | Typical Range | What It Controls |
|---|---|---|---|
| PPO | ε (clip range) | 0.1 – 0.3 | How far the policy can change per update |
| PPO | λ (GAE) | 0.9 – 0.99 | Bias-variance trade-off in advantage estimation |
| PPO | K (epochs) | 3 – 10 | How many passes over each batch of data |
| SAC | α (temperature) | Auto-tuned | Exploration-exploitation balance via entropy |
| SAC | τ (target EMA) | 0.005 | How fast target networks track the online network |
| DQN | ε (exploration) | 1.0 → 0.01 | Random action probability (annealed) |
| DQN | Target update freq | 1K – 10K steps | How often to freeze target network |
| IQL | τ (expectile) | 0.7 – 0.99 | How close to max (higher = closer) |
| CQL | α (CQL weight) | 0.5 – 10 | How conservative to be with OOD actions |
| DPO | β | 0.05 – 0.5 | How much to deviate from reference policy |
Root: The Bellman equation (1957) gives us the recursive structure of value.
Branch 1 — Value-based: Compute Q*, derive π* as argmax. Tabular Q-learning (1989) → DQN (2013) → Double DQN (2016) → Rainbow (2018). Limited to discrete actions.
Branch 2 — Policy gradient: Directly optimize πθ. REINFORCE (1992) → Actor-Critic (2000s) → TRPO (2015) → PPO (2017). Works with any action space.
Branch 3 — Off-policy actor-critic: Combine replay buffers with policy optimization. DDPG (2015) → TD3 (2018) → SAC (2018). Best sample efficiency for continuous control.
Branch 4 — Offline RL: Learn from fixed data without interaction. BCQ (2019) → CQL (2020) → IQL (2022). Critical for real-world deployment.
Branch 5 — Preference-based: Learn from human comparisons instead of numeric rewards. RLHF (2022) → DPO (2023). Powers modern LLM alignment.
| Want To Learn | Read This Paper | Pages |
|---|---|---|
| The foundations of policy gradients | Sutton et al. (1999) "Policy Gradient Methods for RL with Function Approximation" | 8 |
| Trust region methods | Schulman et al. (2015) "Trust Region Policy Optimization" | 16 |
| PPO in practice | Schulman et al. (2017) "Proximal Policy Optimization Algorithms" | 12 |
| Off-policy continuous control | Haarnoja et al. (2018) "Soft Actor-Critic" | 13 |
| Deep Q-learning from pixels | Mnih et al. (2015) "Human-level control through deep RL" (Nature) | 10 |
| Offline RL fundamentals | Levine et al. (2020) "Offline RL: Tutorial, Review, and Perspectives" | 55 |
| Aligning LMs to preferences | Rafailov et al. (2023) "Direct Preference Optimization" | 29 |
Deep RL is the art of converting the Bellman equation — value = reward + discounted future value — into practical algorithms that scale from gridworlds to language models. Everything is a variation on that one recursive idea.
"What I cannot create, I do not understand."
— Richard Feynman
Industry standard (2024): PPO in simulation with massive domain randomization, then zero-shot or few-shot transfer to real hardware.
Algorithm: PPO (not SAC) because on-policy is more stable under domain randomization. The sim-to-real gap means off-policy data from a slightly-wrong simulator can mislead SAC's Q-function. PPO's on-policy nature means it only trusts fresh rollouts.
Architecture: MLP (2-3 layers, 256-512 units). Fixed σ (log-std as learnable parameter, not state-dependent) for locomotion — state-dependent σ can collapse to zero prematurely. Unitree, Boston Dynamics, and ETH Zurich all use this setup.
Sim-to-real: Domain randomization (randomize friction, mass, motor delay, ground slope, sensor noise) during sim training. The policy learns to be robust to all variations. 2 hours of real-robot fine-tuning with conservative PPO (ε = 0.05) for final adaptation.
Safety: Action clipping to torque limits + safety constraints in reward (penalty for IMU angle > 30°) + early termination in sim when falling. Real robot: conservative action bounds initially, gradually relaxed.
Reward: Dense: r = vforward − 0.1||torques||2 − 5·1[fell] − 0.5|roll| − 0.5|pitch|. Multiple terms for velocity tracking, energy efficiency, and stability. Sparse rewards fail for locomotion (too many steps before any signal).
import torch
def reinforce_gradient(log_probs, rewards, gamma=0.99):
N = len(log_probs)
# Step 1: Compute discounted reward-to-go
all_rtg = []
for i in range(N):
T = len(rewards[i])
rtg = [0.0] * T
rtg[T-1] = rewards[i][T-1]
for t in range(T-2, -1, -1):
rtg[t] = rewards[i][t] + gamma * rtg[t+1]
all_rtg.append(rtg)
# Step 2: Compute baseline (mean reward-to-go per timestep)
T = len(all_rtg[0]) # assume equal length
baseline = [0.0] * T
for t in range(T):
baseline[t] = sum(all_rtg[i][t] for i in range(N)) / N
# Step 3: Compute advantages
loss = 0.0
for i in range(N):
for t in range(T):
advantage = all_rtg[i][t] - baseline[t]
loss += -log_probs[i][t] * advantage
return loss / (N * T)
Both processes start from an arbitrary initialization and converge to a fixed point by repeatedly applying a contractive operator. In RL, the contraction factor is γ. In diffusion, it's the noise schedule βt. The deeper pattern: any iterative map with contraction factor < 1 converges to its unique fixed point (Banach). You see this in EM (converges to local maximum of likelihood), power iteration (converges to dominant eigenvector), and even Newton's method (quadratic contraction near roots).
Can you identify the "contraction factor" in EM? (Hint: it relates to the fraction of missing information.)