Deep Reinforcement Learning — The Complete Theory

Roadmap

What You'll Master

01Markov Decision Processes 02Policy Gradient Methods 03REINFORCE & Variance Reduction 04Actor-Critic Methods 05Trust Regions & the PDL 06PPO: Proximal Policy Optimization 07SAC: Soft Actor-Critic 08Tabular MDP Methods 09DQN & Value Approximation 10Offline RL: IQL & CQL 11RLHF & DPO 12The Complete Cheat Sheet

Chapter 01

Markov Decision Processes

Before we can solve anything, we need a language for "an agent acting in a world." That language is the Markov Decision Process (MDP) — a mathematical framework that captures states, actions, transitions, and rewards in one clean package.

Definition

MDP — The Tuple ⟨S, A, p, r, ρ₀, γ, H⟩

S — state space (all possible situations the agent can be in, like positions on a board).
A — action space (all moves available, like up/down/left/right).
p(s' | s, a) — transition dynamics (if you're in state s and take action a, what's the probability of landing in s'? Think: physics of the world).
r(s, a) — reward function (immediate payoff for taking action a in state s — the score you get right now).
ρ₀(s) — initial state distribution (where episodes start — like the starting position in a game).
γ ∈ [0, 1) — discount factor (how much you value future rewards vs. immediate ones; γ = 0.99 means you're patient, γ = 0.1 means you're impatient).
H — horizon (how many time steps per episode).

The Markov Property

The "Markov" in MDP means the future depends only on the current state, not on how you got there. Formally: P(s_t+1 | s_t, a_t, s_t−1, a_t−1, ...) = P(s_t+1 | s_t, a_t). The entire history is summarized in the current state. This is a modeling assumption — if you're playing chess, the board position tells you everything. If you're driving, your current speed and position matter more than where you drove an hour ago.

Understanding the Discount Factor γ

The discount factor γ is surprisingly subtle. It serves three purposes:

Three Roles of γ

Discount Factor γ ∈ [0, 1)

1. Time preference: γ = 0.99 means a reward one step from now is worth 0.99 of the same reward right now. A reward 100 steps from now is worth 0.99¹⁰⁰ = 0.366 — we're patient but not infinitely so.

2. Mathematical convergence: Without discounting (γ = 1), the infinite sum ∑ γ^t r_t might not converge. With γ < 1, it's always bounded by r_max/(1−γ).

3. Effective horizon: The "effective horizon" is approximately 1/(1−γ). With γ = 0.99, the agent effectively plans ~100 steps ahead. With γ = 0.999, ~1000 steps. This lets you control the planning horizon without explicitly setting H.

Worked Example — Discount Factor Effect

Two strategies for a robot: Strategy A gets reward +1 every step forever. Strategy B gets reward 0 for 10 steps, then +2 forever.

With γ = 0.9: V(A) = 1/(1−0.9) = 10. V(B) = 0.9¹⁰ × 2/(1−0.9) = 0.349 × 20 = 6.97. A wins — the impatient agent prefers steady small rewards.

With γ = 0.99: V(A) = 1/(1−0.01) = 100. V(B) = 0.99¹⁰ × 200 = 0.904 × 200 = 180.8. B wins — the patient agent is willing to wait for double the reward.

A policy π(a | s) maps states to action probabilities. It's the agent's strategy: "when I see state s, here's how I pick an action." A deterministic policy always picks one action; a stochastic policy samples from a distribution.

An episode (or trajectory) is a sequence τ = (s₀, a₀, r₀, s₁, a₁, r₁, ..., s_H). The agent starts at s₀ ~ ρ₀, picks actions via π, and the world responds via p. The whole game of RL is to find the policy that maximizes the expected total discounted reward:

The RL Objective max_π 𝔼_τ~π [ ∑_t=0^H γ^t r(s_t, a_t) ]
Find the policy that collects the most (discounted) reward on average.

The Value Function V^π(s)

Suppose you're in state s and you follow policy π forever after. How much total reward do you expect? That's the value function:

Value Function V^π(s) = 𝔼_π [ ∑_t=0^∞ γ^t r(s_t, a_t) | s₀ = s ]

Now for the most important recursive relationship in all of RL — the Bellman equation. The value of being in state s equals the immediate reward plus the discounted value of wherever you end up next:

Derivation — Bellman Equation for V^π

Start from the definition:

V^π(s) = 𝔼_a~π(s) [ r(s, a) + γ 𝔼_s'~p(s'|s,a)[ V^π(s') ] ]

Step 1: Expand the expectation over the trajectory. The first reward is r(s₀, a₀), and everything after s₁ is another trajectory starting from s₁.

Step 2: By the Markov property (the future depends only on the current state, not the past), the expected future reward from s₁ onward is exactly V^π(s₁).

Step 3: Combine: V^π(s) = 𝔼_a~π[r(s,a)] + γ 𝔼_a~π[𝔼_s'~p[V^π(s')]]. This is the Bellman equation — it says value = immediate reward + discounted future value.

The Q-Function Q^π(s, a)

Sometimes we want to know the value of a specific action, not just the state. The action-value function Q^π(s, a) is the expected total reward if you take action a in state s, then follow π thereafter:

Q-Function Q^π(s, a) = r(s, a) + γ 𝔼_s'~p(s'|s,a)[ V^π(s') ]

The relationship between V and Q is clean: V^π(s) = 𝔼_a~π(a|s)[Q^π(s, a)]. The value of a state is just the expected Q-value over the actions your policy would take.

Optimal Values

The optimal value function V*(s) = max_π V^π(s) is the best achievable value from state s under any policy. Similarly, Q*(s, a) = max_π Q^π(s, a). The optimal policy π*(a | s) is the one that achieves these values, and it satisfies π*(s) = argmax_a Q*(s, a).

The Central Theorem of RL

For any finite MDP, there exists a deterministic optimal policy π* that simultaneously maximizes V^π(s) for every state s. This is remarkable: you might expect that optimizing for one starting state could hurt performance in another. But the Bellman optimality equation guarantees that the same policy is optimal everywhere. This is because the optimal action in state s depends only on Q*(s, ·), which already accounts for optimal behavior in all future states.

Bellman Optimality Equation — Derivation

Start from the Bellman equation for Q^π: Q^π(s, a) = r(s, a) + γ 𝔼_s'[𝔼_a'~π[Q^π(s', a')]].

For the optimal policy π*, the inner expectation becomes: 𝔼_a'~π*[Q*(s', a')] = max_a' Q*(s', a'). This is because π* puts all probability on the action with the highest Q-value.

Substituting: Q*(s, a) = r(s, a) + γ 𝔼_s'~p[max_a' Q*(s', a')]. This is the Bellman optimality equation — it characterizes Q* as the unique fixed point of the optimality operator.

Worked Example — Bellman Optimality in a 2-State MDP

States: {A, B}. Actions: {left, right}. From A, right gives r = +3 and goes to B. From A, left gives r = +1 and stays at A. From B, right gives r = +10 (terminal). From B, left gives r = 0 and goes to A. γ = 0.9.

Q*(B, right) = 10 (terminal, no future).

Q*(B, left) = 0 + 0.9 × max(Q*(A, left), Q*(A, right)).

Q*(A, right) = 3 + 0.9 × max(Q*(B, left), Q*(B, right)) = 3 + 0.9 × 10 = 12.

Q*(A, left) = 1 + 0.9 × max(Q*(A, left), Q*(A, right)) = 1 + 0.9 × 12 = 11.8.

Q*(B, left) = 0 + 0.9 × 12 = 10.8.

Optimal policy: π*(A) = right (12 > 11.8), π*(B) = left (10.8 > 10). Surprisingly, the optimal strategy from B is to go back to A rather than immediately collecting the +10! This is because looping A→B→A generates a stream of +3 rewards that, when discounted, exceeds the one-shot +10.

State Distributions

When we analyze RL algorithms, we often need to talk about which states a policy visits. The discounted state distribution (also called the discounted occupancy measure) under policy π is:

Discounted State Distribution d^π(s) = (1 − γ) ∑_t=0^∞ γ^t P(s_t = s | π)
A weighted average of how often π visits state s, with recent visits weighted more.

The factor (1 − γ) normalizes this to a proper probability distribution. This distribution is crucial for the Performance Difference Lemma in Chapter 5.

POMDPs — When You Can't See Everything

In a Partially Observable MDP (POMDP), the agent doesn't see the full state s — it only gets an observation o drawn from some distribution p(o | s). A self-driving car can see camera images but not the intentions of other drivers. The fix: the agent must maintain a belief — a probability distribution over possible states — or use recurrent networks that implicitly track history. Most deep RL treats problems as MDPs (assuming full observability) even when they're technically POMDPs, by feeding in a window of recent observations.

Interactive: MDP Grid World

Click any cell to see its V^π value. Toggle between random and optimal policy to see how values change. The goal (gold cell) gives +10 reward; each step costs −1.

γ: 0.90

Worked Example — Computing V^π

3-state chain: A → B → C. Policy: always go right. Rewards: r(A, right) = −1, r(B, right) = −1, r(C) = +10 (terminal). γ = 0.9.

V^π(B) = −1 + 0.9 × 10 = 8.0

V^π(A) = −1 + 0.9 × V^π(B) = −1 + 0.9 × 8.0 = 6.2

Notice how value propagates backward from the reward. This is the essence of the Bellman equation: each state's value depends recursively on the next state's value.

🔨 Derivation Prove: The Bellman Optimality Operator Is a Contraction ▶ ✓ ATTEMPTED

The Bellman policy operator T^π is a γ-contraction (proved in Chapter 8). But the Bellman optimality operator T* uses a max instead of an expectation over π:

(T*V)(s) = max_a [ r(s,a) + γ ∑_s' p(s'|s,a) V(s') ]

Your task: Prove that ||T*V − T*V'||_∞ ≤ γ ||V − V'||_∞. The max makes this trickier than the policy case — you can't factor it out.

For any functions f, g: |max_a f(a) − max_a g(a)| ≤ max_a |f(a) − g(a)|. This is because the max of one side picks a specific action, and that action's difference is bounded by the max difference.

∑_s' p(s'|s,a) |V(s') − V'(s')| ≤ ||V − V'||_∞ · ∑_s' p(s'|s,a) = ||V − V'||_∞ · 1. The probabilities sum to 1, and each |V(s') − V'(s')| is bounded by the sup-norm.

Full proof:

Step 2: Apply |max f − max g| ≤ max |f − g|:

≤ max_a |γ ∑_s' p(s'|s,a) [V(s') − V'(s')]| (rewards cancel)

Step 3: Triangle inequality on the weighted sum:

≤ max_a γ ∑_s' p(s'|s,a) |V(s') − V'(s')| ≤ γ ||V − V'||_∞

Step 4: Since this holds for all s, take the sup: ||T*V − T*V'||_∞ ≤ γ ||V − V'||_∞. ■

The key insight: The max doesn't break contraction because |max f − max g| ≤ max |f − g|. This is why value iteration (which applies T* repeatedly) converges to V* regardless of initialization — and why γ < 1 is essential.

🔗 Pattern Recognition

RL Objective = Stochastic Optimization of a Non-Stationary Objective

This Lesson (RL)

max_π J(π) = 𝔼_τ~π[∑ γ^t r_t]
Objective changes as π changes (non-stationary data distribution)

Standard Optimization

min_θ 𝔼_x~D[L(θ, x)]
Fixed data distribution D (i.i.d. assumption)

In supervised learning, the data distribution is fixed: your training set doesn't change as you update weights. In RL, the data distribution is the policy — update the policy and the data changes. This non-stationarity is why RL is fundamentally harder than supervised learning: you're optimizing an objective that shifts under your feet.

Where else do you see non-stationary optimization? (Hint: think GANs, self-play, curriculum learning.)

Checkpoint — Before you move on

Explain in your own words: why can't we just compute ∇_θ J(θ) by differentiating through the environment dynamics? What does the Bellman equation tell us about the structure of the problem that policy gradients exploit differently than value iteration?

✓ Gate cleared

Model Answer

The environment dynamics p(s'|s,a) are unknown and non-differentiable (the real world doesn't have a gradient). Value iteration requires knowing p to compute the Bellman backup. Policy gradients bypass this entirely: the log-derivative trick converts the gradient into an expectation over trajectories, and the dynamics terms cancel. We never differentiate through the environment — we only need to sample from it. This is why policy gradients are model-free while value iteration is model-based. The tradeoff: value iteration gives exact solutions (when you know p) but policy gradients work with unknown dynamics at the cost of high variance.

Chapter 02

Policy Gradient Methods

Value-based methods like Q-learning learn Q* and derive π* as argmax. But what if we parameterize the policy directly as π_θ(a | s) and optimize θ by gradient ascent on expected reward? That's the policy gradient approach — the foundation of PPO, SAC, and essentially all modern deep RL.

Why Not Just Use Q-Learning?

If Q-learning can find the optimal policy via argmax, why bother with policy gradients at all? Three reasons:

Three Reasons for Policy Gradients

Advantages Over Value-Based Methods

1. Continuous actions: Q-learning requires max_a Q(s, a). With discrete actions (4 moves), you evaluate all 4. With continuous actions (a ∈ R⁷), there are infinitely many — you can't enumerate them. Policy gradients handle continuous actions naturally.

2. Stochastic policies: Sometimes the optimal policy is stochastic (e.g., rock-paper-scissors: play uniformly at random). Q-learning always produces deterministic policies (argmax). Policy gradients can learn arbitrary distributions.

3. Better convergence: Policy gradients optimize the objective directly, while Q-learning tries to find a fixed point of the Bellman operator. In large-scale settings (millions of parameters), the direct optimization often converges faster and more stably.

Policy Parameterization

For discrete actions, π_θ is typically a neural network that outputs a softmax distribution over actions. For continuous actions, π_θ outputs the mean and standard deviation of a Gaussian policy:

Gaussian Policy π_θ(a | s) = N(a; μ_θ(s), σ_θ(s)²)
The network outputs μ (where to center the action) and σ (how much to explore).

The Objective

Define the objective as the expected total reward under trajectories sampled from π_θ:

Policy Gradient Objective V(θ) = ∑_τ P_θ(τ) R(τ)
where R(τ) = ∑_t=0^T−1 r(s_t, a_t) and P_θ(τ) = ρ₀(s₀) ∏_t=0^T−1 π_θ(a_t|s_t) p(s_t+1|s_t,a_t)

We want ∇_θV(θ). But the sum is over all possible trajectories — an astronomically large set. And R(τ) doesn't depend on θ directly (rewards are given by the environment). How do we differentiate through sampling?

The Log-Derivative Trick

Full Derivation — Policy Gradient

Step 1 — Apply the definition of gradient:

∇_θV(θ) = ∇_θ ∑_τ P_θ(τ) R(τ) = ∑_τ R(τ) ∇_θ P_θ(τ)

Why: R(τ) doesn't depend on θ, so it comes out of the gradient.

Step 2 — Log-derivative trick: We use the identity ∇_θ P_θ(τ) = P_θ(τ) ∇_θ log P_θ(τ). This is just the chain rule: d/dx log f(x) = f'(x)/f(x), rearranged.

∇_θV(θ) = ∑_τ R(τ) P_θ(τ) ∇_θ log P_θ(τ)

= 𝔼_{τ~P_θ} [ R(τ) ∇_θ log P_θ(τ) ]

Why this matters: We turned a sum over all trajectories into an expectation we can estimate by sampling trajectories!

Step 3 — Expand log P_θ(τ):

log P_θ(τ) = log ρ₀(s₀) + ∑_t=0^T−1 [ log π_θ(a_t|s_t) + log p(s_t+1|s_t,a_t) ]

Step 4 — Differentiate: ∇_θ log P_θ(τ) = ∑_t=0^T−1 ∇_θ log π_θ(a_t|s_t)

Why: ρ₀(s₀) doesn't depend on θ (the starting state is given by the environment). The transition dynamics p(s_t+1|s_t,a_t) don't depend on θ either (they're properties of the physics). Only the policy π_θ depends on θ. So the transition terms vanish!

The Policy Gradient ∇_θV(θ) = 𝔼_{τ~π_θ} [ (∑_t=0^T−1 ∇_θ log π_θ(a_t|s_t)) (∑_t=0^T−1 r(s_t, a_t)) ]
Increase the log-probability of actions in proportion to how much total reward the trajectory received.

Model-Free Magic

The transition dynamics p(s'|s,a) completely disappeared from the gradient! We don't need to know the physics of the world — we only need to be able to sample trajectories and evaluate ∇_θ log π_θ. This is why policy gradients are model-free.

The Policy Gradient Theorem

There's an alternative form that's often more useful. Instead of summing over entire trajectories, we can express the gradient in terms of the Q-function:

Policy Gradient Theorem — Full Proof

Step 1: Start with V(θ) = 𝔼_s₀~ρ₀[V^π(s₀)]. Write V^π(s) = ∑_a π_θ(a|s) Q^π(s, a).

Step 2: Take the gradient:

∇_θ V^π(s) = ∑_a [ ∇_θπ_θ(a|s) Q^π(s,a) + π_θ(a|s) ∇_θQ^π(s,a) ]

Product rule: both π and Q depend on θ.

Step 3: Expand ∇_θQ^π(s,a) = ∇_θ[r(s,a) + γ ∑_s' p(s'|s,a) V^π(s')] = γ ∑_s' p(s'|s,a) ∇_θV^π(s').

The reward r and transition p don't depend on θ, so only V^π(s') contributes.

Step 4: Substitute back: ∇_θV^π(s) = ∑_a ∇_θπ(a|s) Q^π(s,a) + ∑_a π(a|s) γ ∑_s' p(s'|s,a) ∇_θV^π(s').

Step 5: This is recursive! ∇V(s) depends on ∇V(s'). Unroll it one more step:

∇V(s') = ∑_a' ∇π(a'|s') Q(s',a') + ∑_a' π(a'|s') γ ∑_s'' p(s''|s',a') ∇V(s'') ...

Step 6: Unrolling T steps and collecting terms: ∇V(s₀) = ∑_t=0^∞ ∑_s P(s₀→s, t, π) γ^t ∑_a ∇π(a|s) Q^π(s,a), where P(s₀→s, t, π) is the probability of being in state s at time t starting from s₀ under π.

Step 7: Take expectation over s₀~ρ₀ and recognize d^π(s) = ∑_t γ^t P(s_t=s|π)/(1-γ)^-1. Use the log-derivative trick: ∇π(a|s) = π(a|s) ∇ log π(a|s). Final result: ■

Policy Gradient Theorem ∇_θV(θ) = 𝔼_{s~d^π, a~π_θ} [ ∇_θ log π_θ(a|s) · Q^π(s, a) ]
Sum over state distribution d^π and policy π, weighting each action by its Q-value.

Interactive: Policy Gradient Direction

2D parameter space (θ₁, θ₂) with V(θ) shown as contours. The blue arrow shows the vanilla policy gradient; the green arrow shows the natural gradient (Fisher-adjusted). Drag the dot to explore different θ values.

Learning rate: 0.15

Worked Example — Policy Gradient for a 1-Step Bandit

Two actions: left (θ₁) and right (θ₂). Policy: softmax π(left) = e^θ₁ / (e^θ₁ + e^θ₂). Rewards: r(left) = 1, r(right) = 3. Current θ = (0, 0), so π(left) = π(right) = 0.5.

Sample trajectory 1: action = right, reward = 3. Gradient: 3 × ∇_θ log π(right) = 3 × (0.5, −0.5) × (−1, 1) = correction that increases θ₂.

After many samples: θ₂ grows, θ₁ shrinks, and π(right) → 1. The gradient pushes probability mass toward higher-reward actions.

Chapter 03

REINFORCE & Variance Reduction

The policy gradient formula is beautiful in theory but horrifically noisy in practice. Imagine estimating a gradient by running one episode: the agent might get lucky (high reward) or unlucky (low reward) by chance, not because of the policy. The variance of the gradient estimate is so high that it takes millions of samples to get a useful signal. This chapter is about taming that variance.

How Bad Is the Variance?

Consider a simple environment where the optimal reward is 100. With vanilla REINFORCE, a single gradient estimate might range from −500 to +500 across different trajectory samples. The true gradient might be +2. You'd need to average ~62,500 samples to get a reliable estimate (variance/mean²). This is why raw REINFORCE is rarely used in practice — it's too expensive.

REINFORCE (Williams, 1992)

Sample a trajectory τ = (s₀, a₀, r₀, ..., s_T) by running π_θ in the environment.
Compute return: R(τ) = ∑_t=0^T−1 γ^t r(s_t, a_t).
Compute gradient estimate: ĝ = (∑_t=0^T−1 ∇_θ log π_θ(a_t|s_t)) · R(τ).
Update: θ ← θ + α ĝ.
Repeat.

The Causality Trick: Reward-to-Go

In the vanilla gradient, every action's log-probability gets multiplied by the total trajectory reward, including rewards that happened before that action was taken. But action a₅ can't possibly affect rewards r₀ through r₄ — those already happened! Including past rewards adds pure noise.

Derivation — Reward-to-Go (Causality Trick)

In the full gradient, consider the term for time step t:

∇_θ log π_θ(a_t|s_t) · ∑_t'=0^T−1 r(s_t', a_t')

Split the sum into past (t' < t) and future (t' ≥ t):

= ∇_θ log π_θ(a_t|s_t) · [ ∑_t'=0^t−1 r_t' + ∑_t'=t^T−1 r_t' ]

Key insight: 𝔼[ ∇_θ log π_θ(a_t|s_t) · r_t' ] = 0 for t' < t. Why? Because r_t' is determined by s_t', a_t', which are independent of a_t given s_t. And 𝔼_{a_t~π}[∇_θ log π_θ(a_t|s_t)] = ∇_θ ∑_a π_θ(a|s_t) = ∇_θ 1 = 0. So the past reward term vanishes in expectation.

This means we can replace the total reward with the reward-to-go without introducing any bias:

Reward-to-Go Ĝ_t = ∑_t'=t^T−1 γ^t'−t r(s_t', a_t')

ĝ = ∑_t=0^T−1 ∇_θ log π_θ(a_t|s_t) · Ĝ_t
Each action is only credited for future rewards it could have influenced.

Baselines: Subtracting Without Bias

Even with reward-to-go, the gradient is noisy. The trick: subtract a baseline b(s_t) from the reward-to-go. If all rewards are positive (e.g., 10, 12, 11, 13), then every action gets reinforced, just some more than others. Subtracting the average (~11.5) means good actions get positive reinforcement and bad actions get negative reinforcement — much clearer signal.

Proof — Baselines Are Unbiased

We need to show that 𝔼[∇_θ log π_θ(a|s) · b(s)] = 0 for any function b(s).

Step 1: 𝔼_{a~π_θ}[∇_θ log π_θ(a|s) · b(s)]

= b(s) · 𝔼_a~π[∇_θ log π_θ(a|s)]

Why: b(s) doesn't depend on a, so it factors out.

Step 2: = b(s) · ∑_a π_θ(a|s) · ∇_θ log π_θ(a|s)

Step 3: = b(s) · ∑_a π_θ(a|s) · ∇_θπ_θ(a|s) / π_θ(a|s)

= b(s) · ∑_a ∇_θπ_θ(a|s)

= b(s) · ∇_θ ∑_a π_θ(a|s)

= b(s) · ∇_θ 1 = 0 ■

The probabilities sum to 1 regardless of θ, so their gradient is zero. Any baseline that only depends on the state (not the action) can be subtracted for free.

The most common baseline is b(s) = V^π(s) — the average value of the state. With this baseline, we're reinforcing actions proportional to Ĝ_t − V^π(s_t), which is an estimate of the advantage: how much better was this action than average?

Worked Example — Advantage as Surprise

State s: standing at a fork in a maze. V^π(s) = 5 (on average, this state leads to reward 5).

Action: go left. Ĝ_t = 8. Advantage = 8 − 5 = +3. "Going left was 3 points better than expected. Reinforce it."

Action: go right. Ĝ_t = 2. Advantage = 2 − 5 = −3. "Going right was 3 points worse than expected. Discourage it."

Action: go straight. Ĝ_t = 5. Advantage = 5 − 5 = 0. "Going straight was exactly average. Don't change anything."

Without the baseline, all three actions would receive positive reinforcement (rewards 8, 2, 5 are all positive). The baseline centers the signal so the gradient points in the right direction.

The Optimal Baseline

Among all baselines, which minimizes variance the most? The answer:

Optimal Baseline b*(s_t) = 𝔼[ ||∇_θ log π_θ(a_t|s_t)||² · Ĝ_t ] / 𝔼[ ||∇_θ log π_θ(a_t|s_t)||² ]
The gradient-magnitude-weighted average of the rewards. In practice, V^π(s) is almost as good and much easier to compute.

Interactive: Variance Reduction

Compare gradient estimate variance with and without a baseline. Each bar is one sampled gradient estimate. The blue line shows the true gradient. Toggle the baseline to see variance collapse.

Episodes: 20

Worked Example — Baseline Effect

Two trajectories: τ₁ has reward 100, τ₂ has reward 102. Without baseline: both get positive reinforcement (100 and 102). Gradient: increase probability of both τ₁ and τ₂, just τ₂ slightly more. Signal-to-noise ratio is terrible.

With baseline b = 101: τ₁ gets −1 (decrease its probability slightly), τ₂ gets +1 (increase it slightly). The signal is now purely about the relative quality. Much cleaner.

REINFORCE with Baseline

Sample trajectory τ from π_θ.
For each t: compute Ĝ_t = ∑_t'=t^T γ^t'−t r_t'.
Compute advantage estimate: Â_t = Ĝ_t − V̂_φ(s_t).
Update policy: θ ← θ + α ∑_t ∇_θ log π_θ(a_t|s_t) · Â_t.
Update baseline: φ ← φ − β ∇_φ ∑_t (V̂_φ(s_t) − Ĝ_t)².
Repeat.

⚔ Adversarial: When Does the Baseline Hurt?

You're using REINFORCE with baseline b(s) = V^π(s). Your value function is badly learned: V̂(s) overestimates by +50 for all states. The true V(s) ≈ 10. Your rewards are in [0, 20]. What happens to training?

All advantages become negative (G_t − 50 < 0), so the policy decreases the probability of ALL actions, collapsing to uniform randomness The gradient is biased because the bad V̂ violates the unbiasedness proof Nothing goes wrong because baselines are always unbiased regardless of quality

Chapter 04

Actor-Critic Methods

REINFORCE with baseline has a problem: the reward-to-go Ĝ_t is a Monte Carlo estimate — you have to wait until the end of the episode to compute it. This means high variance (one trajectory can be lucky or unlucky) and no learning until the episode finishes. What if we replaced Ĝ_t with a learned estimate?

The Advantage Function

The advantage A^π(s, a) = Q^π(s, a) − V^π(s) measures how much better action a is compared to the average action in state s. If A > 0, the action is better than average; if A < 0, it's worse.

Why Advantage?

The advantage is the ideal quantity to multiply the policy gradient by. Positive advantage → increase the action's probability. Negative advantage → decrease it. Zero advantage → leave it alone. It's a direct signal of "is this action good relative to alternatives?"

One-Step Advantage Estimate

We don't know Q^π or V^π exactly, but we can approximate the advantage using a single transition and a learned value function V̂_φ:

One-Step Advantage Estimate Â_t ≈ r(s_t, a_t) + γ V̂_φ(s_t+1) − V̂_φ(s_t)
This is the TD error δ_t. It estimates Q^π(s_t,a_t) as r + γV(s') and subtracts V(s_t).

Why does this work? Q^π(s, a) = r(s, a) + γ 𝔼[V^π(s')]. So r + γV̂(s') is a one-sample estimate of Q^π(s, a). Subtract V̂(s) and you get an estimate of the advantage. This introduces bias (V̂_φ might be wrong) but dramatically reduces variance (no need to wait for the entire trajectory).

Training the Critic

The critic V̂_φ is trained to predict the expected return from each state. Two options:

Method	Target for V̂_φ(s_t)	Bias	Variance
Monte Carlo	Ĝ_t = ∑_t'=t^T γ^t'−t r_t'	None	High
Bootstrapping (TD)	r_t + γ V̂_φ(s_t+1)	Yes (from V̂)	Low

In practice, bootstrapping wins because the variance reduction outweighs the bias, especially when the critic becomes accurate.

One-Step Actor-Critic (A2C)

Collect transition (s, a, r, s') using π_θ.
Compute TD error: δ = r + γ V̂_φ(s') − V̂_φ(s).
Update actor: θ ← θ + α ∇_θ log π_θ(a|s) · δ.
Update critic: φ ← φ − β ∇_φ (V̂_φ(s) − [r + γV̂_φ(s')])².
Repeat.

Connection to REINFORCE

Actor-critic = REINFORCE with baseline + two changes: (1) replace Monte Carlo return Ĝ_t with bootstrapped TD target r + γV̂(s'), and (2) update after every step instead of waiting for the episode to end. The baseline is the critic. The advantage estimate is the TD error. Same math, different variance-bias tradeoff.

The Bias-Variance Spectrum

Different advantage estimators trade off bias and variance along a spectrum. Here's the full picture:

Estimator	Formula	Bias	Variance
Full Monte Carlo	Ĝ_t − V̂(s_t)	None	Very high
N-step return	∑_k=0ⁿ⁻¹ γ^kr_t+k + γⁿV̂(s_t+n) − V̂(s_t)	Medium	Medium
1-step TD	r_t + γV̂(s_t+1) − V̂(s_t)	High (if V̂ wrong)	Low
GAE(λ)	∑_l=0^∞ (γλ)^lδ_t+l	Tunable via λ	Tunable via λ

Worked Example — Why Bias Matters

Suppose V̂(s') is way off: true V*(s') = 100, but V̂(s') = 10. The TD error δ = r + 0.9 × 10 − V̂(s) massively underestimates the advantage. The actor gets a misleading signal: "this action wasn't very good" when actually it led to a highly valuable state. With Monte Carlo, the full trajectory return would eventually reveal the truth — but at the cost of enormous variance across trajectories.

GAE with λ = 0.95 blends both: mostly uses the multi-step return (low bias) but with enough bootstrapping to reduce variance. This is why λ = 0.95 is the most common choice in practice.

Batch vs Online Actor-Critic

The algorithm above updates after every single transition. In practice, we collect a batch of transitions (e.g., 2048 steps across multiple parallel environments) and update once on the whole batch. This is called batch actor-critic or A2C (Advantage Actor-Critic). The parallelism provides diversity: transitions from 16 environments at once are less correlated than 16 consecutive transitions from one environment.

Worked Example — One Actor-Critic Update

State s = "near the goal," action a = "go right," reward r = +5, next state s' = "at the goal." V̂_φ(s) = 3, V̂_φ(s') = 10, γ = 0.9.

TD error: δ = 5 + 0.9 × 10 − 3 = 5 + 9 − 3 = +11.

Actor update: δ = +11 > 0, so we increase log π(right | near_goal). "Going right near the goal was much better than expected — do more of it."

Critic update: Target = r + γV̂(s') = 14. Current V̂(s) = 3. Error = 14 − 3 = 11. Push V̂(s) toward 14. "This state is more valuable than I thought."

Chapter 05

Trust Regions & the Performance Difference Lemma

Policy gradients tell you which direction to move in parameter space. But how far should you step? Too small and you waste samples. Too large and you destroy a good policy with one bad update. This chapter develops the mathematical machinery that leads to TRPO and PPO.

Lemma: Expectation Under State Distribution

Lemma 10 — Full Proof

Claim: 𝔼_τ~π[ ∑_t=0^∞ γ^t f(s_t, a_t) ] = 1/(1−γ) · 𝔼_{s~d^π, a~π}[ f(s, a) ]

Proof:

Step 1: Let P_t^π(s) = P(s_t = s | π) be the state distribution at time t.

𝔼_τ[ ∑_t γ^t f(s_t,a_t) ] = ∑_t=0^∞ γ^t ∑_s P_t^π(s) ∑_a π(a|s) f(s,a)

Step 2: Recall d^π(s) = (1−γ) ∑_t γ^t P_t^π(s). So ∑_t γ^t P_t^π(s) = d^π(s) / (1−γ).

Step 3: Substitute: = ∑_s d^π(s)/(1−γ) ∑_a π(a|s) f(s,a) = 1/(1−γ) · 𝔼_{s~d^π, a~π}[f(s,a)] ■

The Performance Difference Lemma

This is one of the most powerful results in RL theory. It expresses the performance gap between any two policies in terms of the advantage of one under the other:

Performance Difference Lemma — Full Proof

Claim: V(π) − V(π') = 1/(1−γ) · 𝔼_{s~d^π, a~π}[ A^π'(s, a) ]

Proof:

Step 1: V(π) = 𝔼_τ~π[∑_t γ^t r(s_t,a_t)] = 1/(1−γ) 𝔼_{s~d^π, a~π}[r(s,a)] by Lemma 10.

Step 2: Add and subtract V^π'(s): we can write

r(s,a) = A^π'(s,a) + V^π'(s) − γ 𝔼_s'~p[V^π'(s')]

Why: By definition, A^π'(s,a) = Q^π'(s,a) − V^π'(s) = r(s,a) + γ𝔼[V^π'(s')] − V^π'(s). Rearranging gives r(s,a) = A^π'(s,a) + V^π'(s) − γ𝔼[V^π'(s')].

Step 3: Substitute into the expression for V(π). The V^π'(s_t) and γV^π'(s_t+1) terms telescope across time steps, leaving only V^π'(s₀) at the boundary.

Step 4: After telescoping: V(π) = V(π') + 1/(1−γ) 𝔼_{s~d^π, a~π}[A^π'(s,a)]

Rearranging gives the result. ■

Performance Difference Lemma V(π) − V(π') = 1/(1−γ) · 𝔼_{s~d^π, a~π}[ A^π'(s, a) ]
The performance gap equals the expected advantage of π over π', weighted by π's state distribution.

The Local Approximation

The PDL involves d^π — the state distribution under the new policy π. But we only have data from the old policy π'. Define the local approximation:

Local Approximation L_π'(π) = V(π') + 1/(1−γ) · 𝔼_{s~d^π', a~π}[ A^π'(s, a) ]
Same as the PDL but using d^π' instead of d^π. Exact when π = π', biased otherwise.

α-Coupled Policies and the TRPO Bound

Definition

α-Coupled Policies

Two policies π and π' are α-coupled if at each state, with probability 1 − α they choose the same action, and with probability α they choose independently. Intuitively, α measures how different the two policies are. If α = 0, they're identical; if α = 1, they're independent.

Using coupled policies, we can bound how far off the local approximation L is from the true performance V:

Bounding the Advantage (Lemma)

Claim: If π and π' are α-coupled, then |𝔼_a~π[A^π'(s,a)]| ≤ 2α max_s,a|A^π'(s,a)|.

Proof: With probability 1−α, π picks the same action as π', so A^π'(s, π'(s)) = 0 (by definition of V^π'). With probability α, π picks independently, and the advantage is bounded by ε = max|A^π'|. So |𝔼[A]| ≤ (1−α) · 0 + α · 2ε = 2αε. The factor of 2 accounts for the fact that the mean advantage under π' is zero, so the independent action could deviate by up to 2ε from 0. ■

TRPO Guarantee (Theorem 13) V(π_new) ≥ L_{π_old}(π_new) − 4εγ / (1−γ)² · α²

where ε = max_s,a |A^π_old(s, a)|, α = max_s D_TV(π_new(s) || π_old(s))
The true performance is at least the local approximation minus a penalty proportional to α².

This says: if we constrain the policy update to be small (small α, i.e., the new policy is close to the old one in total variation distance), then the local approximation L is a good lower bound on the true performance V. This is the theoretical foundation of Trust Region Policy Optimization (TRPO).

Natural/Covariant Policy Gradient

The vanilla policy gradient ∇_θV(θ) depends on the parameterization of π_θ. If you reparameterize (e.g., change from log-space to linear-space), the gradient direction changes. The natural policy gradient fixes this by pre-multiplying by the inverse Fisher information matrix:

Natural Policy Gradient θ ← θ + α F⁻¹ ∇_θV(θ)

where F = 𝔼_{s~d^π, a~π}[ ∇_θ log π_θ(a|s) ∇_θ log π_θ(a|s)^T ]
F⁻¹ "un-warps" the parameter space so the gradient is invariant to reparameterization.

Trust Region = Constrained Natural Gradient

TRPO approximately solves: max_θ L_{θ_old}(θ) subject to D_KL(π_θ || π_θold) ≤ δ. This is equivalent to taking a natural gradient step with step size determined by the KL constraint. The Fisher matrix defines the "geometry" of policy space — large steps in high-curvature directions are penalized.

Worked Example — The TRPO Bound in Numbers

Suppose ε = max|A^π| = 10 (largest advantage magnitude), γ = 0.99, and we constrain α = D_TV(π_new, π_old) ≤ 0.01 (very small policy change).

Penalty term: 4 × 10 × 0.99 / (1 − 0.99)² × 0.01² = 4 × 10 × 0.99 / 0.0001 × 0.0001 = 3.96

Interpretation: The true performance is at least L(π_new) − 3.96. If the local approximation predicts improvement of +5, the guarantee is at least +1.04. Small trust region gives tight bounds.

If we let α = 0.1 (large change): penalty = 4 × 10 × 0.99 / 0.0001 × 0.01 = 3960. The bound is useless! Large policy changes destroy the guarantee. This is why trust regions must be small.

TRPO Is Hard to Implement

TRPO requires: (1) computing the Fisher-vector product F∇, (2) solving a linear system via conjugate gradient (10-20 iterations), (3) a line search to find the largest step satisfying the KL constraint, and (4) handling the case where the constraint is violated. That's a lot of machinery compared to "just clip the ratio" (PPO). TRPO achieves better worst-case guarantees, but PPO achieves similar empirical performance with 1/10 the code.

Worked Example — Why Trust Regions Matter

Policy at iteration k: π(left | s) = 0.8, π(right | s) = 0.2. The gradient says "increase π(right)." Without trust region: learning rate 0.5 gives π(right) = 0.7. Huge change! But the gradient was computed assuming the old policy — with the new policy we'd visit different states, so the gradient estimate is stale.

With trust region (KL ≤ 0.01): π(right) can increase to at most ~0.25. Small step, but the gradient stays valid. Next iteration computes a fresh gradient. Slow but stable convergence.

🔨 Derivation Derive the Simulation Lemma from the Performance Difference Lemma ▶ ✓ ATTEMPTED

The Performance Difference Lemma states: V(π) − V(π') = 1/(1−γ) · 𝔼_{s~d^π, a~π}[A^π'(s,a)].

The Simulation Lemma bounds the performance gap using the maximum advantage: |V(π) − V(π')| ≤ 2εγ / (1−γ)² · max_s D_TV(π(s) || π'(s)), where ε = max_s,a|A^π'(s,a)|.

Your task: Starting from the PDL, derive this bound. You'll need: (1) bound the advantage expectation using the TV distance, and (2) bound the state distribution shift.

For a fixed state s: |𝔼_a~π[A^π'(s,a)]| = |∑_a(π(a|s) − π'(a|s))A^π'(s,a)| ≤ 2 D_TV(π(s)||π'(s)) · max_a|A^π'(s,a)|. The factor of 2 comes from the definition of TV: D_TV = (1/2)∑|p − q|, so ∑|p−q| = 2D_TV.

The PDL uses d^π (the new policy's state distribution), which we can't evaluate under the old policy. The key insight: ||d^π − d^π'||₁ ≤ 2γ/(1−γ) · max_s D_TV(π(s)||π'(s)). This bounds how much the state distributions diverge based on per-step policy differences.

Substitute into the PDL: the advantage at each state is bounded by 2αε (Hint 1), and the state distribution shift adds another factor of γ/(1−γ) (Hint 2). Multiplying through the 1/(1−γ) prefactor gives the final bound.

Step 1: From PDL: V(π) − V(π') = 1/(1−γ) 𝔼_{s~d^π, a~π}[A^π'(s,a)]

Step 2: Bound advantage at each state: |𝔼_a~π[A^π'(s,a)]| ≤ 2αε where α = max_s D_TV(π(s)||π'(s)) and ε = max|A^π'|.

Step 3: Replace d^π with d^π' plus error: |𝔼_s~d^π[f(s)] − 𝔼_s~d^π'[f(s)]| ≤ ||d^π − d^π'||₁ · max|f| ≤ 2γα/(1−γ) · 2αε

Step 4: Combining: |V(π) − V(π')| ≤ 1/(1−γ) [2αε + 2γα/(1−γ) · 2αε] = 2αε/(1−γ) [1 + 2γα/(1−γ)]

Step 5: For the standard form (bounding loosely): ≤ 4εγα²/(1−γ)². This is the TRPO bound from Theorem 13. ■

The key insight: The bound is quadratic in α (the policy divergence) — this is why trust regions work. If you halve the allowed policy change, the error bound drops by 4x. The local approximation becomes increasingly accurate as the trust region shrinks.

Chapter 06

PPO: Proximal Policy Optimization

TRPO works brilliantly but is painful to implement. It requires computing the Fisher matrix, solving a constrained optimization problem via conjugate gradient, and doing a line search. PPO achieves nearly the same effect with a simple clipped objective that you can optimize with standard gradient descent.

The Probability Ratio

Define the probability ratio between the new and old policy:

Probability Ratio r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t)
If r_t = 1, the new policy is identical to the old one at this (s, a) pair. r_t > 1 means the new policy is more likely to take this action; r_t < 1 means less likely.

The surrogate objective from TRPO can be written as:

Surrogate Objective L^CPI(θ) = 𝔼_t[ r_t(θ) · Â_t ]
CPI = Conservative Policy Iteration. Maximize this and you improve the policy.

The Clipped Surrogate

Without a constraint, maximizing L^CPI can lead to r_t(θ) exploding — the new policy becomes very different from the old one. PPO's solution: clip the ratio to stay in [1−ε, 1+ε].

PPO Clipped Objective L^CLIP(θ) = 𝔼_t[ min( r_t(θ) Â_t, clip(r_t(θ), 1−ε, 1+ε) Â_t ) ]
ε = 0.2 typically. The min takes the more pessimistic estimate.

Why the Clip Works — Two Cases

Case 1: Â_t > 0 (the action was good). We want to increase r_t (make this action more likely). But the clip prevents r_t from exceeding 1 + ε. So the objective becomes min(r_t Â, (1+ε) Â). Once r_t ≥ 1 + ε, the gradient is zero — no incentive to push further.

Case 2: Â_t < 0 (the action was bad). We want to decrease r_t. But the clip prevents r_t from dropping below 1 − ε. So the objective becomes min(r_t Â, (1−ε) Â). Once r_t ≤ 1 − ε, the gradient is zero.

In both cases, the clip creates a "trust region" in probability ratio space. The policy can change, but not too much.

Interactive: PPO Clipped Objective

See how the clipped objective (gold) differs from the unclipped one (blue dashed). When advantage is positive, the clip prevents the ratio from going above 1+ε. When negative, it prevents going below 1−ε. Drag the slider to change ε.

ε: 0.20

Generalized Advantage Estimation (GAE)

How do we compute Â_t? We could use the one-step TD error δ_t = r_t + γV(s_t+1) − V(s_t) (low variance, high bias) or the full Monte Carlo return G_t − V(s_t) (no bias, high variance). GAE interpolates:

Generalized Advantage Estimation Â_t^GAE(γ,λ) = ∑_l=0^∞ (γλ)^l δ_t+l

where δ_t = r_t + γ V̂(s_t+1) − V̂(s_t)
λ = 0 gives one-step TD (low variance, high bias). λ = 1 gives Monte Carlo (no bias, high variance). λ = 0.95 is a popular middle ground.

Understanding GAE — Expanding the Sum

Let's expand Â^GAE for a few terms to see what it does:

Â_t = δ_t + (γλ)δ_t+1 + (γλ)²δ_t+2 + ...

Substituting δ_t = r_t + γV̂(s_t+1) − V̂(s_t):

When λ = 0: Â_t = δ_t = r_t + γV̂(s_t+1) − V̂(s_t). Just the one-step TD error.

When λ = 1: Â_t = ∑_l γ^l δ_t+l. After telescoping the V̂ terms: Â_t = ∑_l=0^∞ γ^l r_t+l − V̂(s_t) = Ĝ_t − V̂(s_t). The full Monte Carlo advantage!

So λ smoothly interpolates between 1-step TD and full Monte Carlo. The magic of GAE is that it does this in an exponentially-weighted way — nearby TD errors get more weight than distant ones.

Worked Example — Computing GAE

4-step trajectory: rewards [1, 2, 3, 0], V̂ values [5, 4, 6, 2, 0] (including the terminal state). γ = 0.99, λ = 0.95.

TD errors:

δ₀ = 1 + 0.99 × 4 − 5 = 1 + 3.96 − 5 = −0.04

δ₁ = 2 + 0.99 × 6 − 4 = 2 + 5.94 − 4 = 3.94

δ₂ = 3 + 0.99 × 2 − 6 = 3 + 1.98 − 6 = −1.02

δ₃ = 0 + 0.99 × 0 − 2 = −2.0

GAE (computed backwards):

Â₃ = δ₃ = −2.0

Â₂ = δ₂ + 0.99 × 0.95 × Â₃ = −1.02 + 0.9405 × (−2.0) = −2.90

Â₁ = δ₁ + 0.9405 × Â₂ = 3.94 + 0.9405 × (−2.90) = 1.21

Â₀ = δ₀ + 0.9405 × Â₁ = −0.04 + 0.9405 × 1.21 = 1.10

Note the backward computation — this is how GAE is implemented in practice (loop from t = T−1 down to 0).

PPO Algorithm

Collect a batch of trajectories using π_{θ_old}.
Compute advantages Â_t using GAE with the current critic V̂_φ.
For K epochs (typically K = 3–10), on random minibatches of the collected data:
1. Compute r_t(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t).
2. Compute L^CLIP and do gradient ascent on θ.
3. Update critic: φ ← φ − β ∇_φ ∑ (V̂_φ(s_t) − Ĝ_t)².
Set θ_old ← θ. Go to step 1.

Why PPO Dominates

PPO is the default algorithm for modern RL because: (1) it's simple — 50 lines of core code, (2) it's stable — the clip prevents catastrophic updates, (3) it's sample-efficient — you reuse each batch for K epochs, not just one gradient step, (4) it works for both continuous and discrete actions. OpenAI used PPO to train the policy for ChatGPT's RLHF stage.

Worked Example — PPO Hyperparameters in Practice

Setting: Training a locomotion policy for a simulated humanoid robot.

Batch size: 2048 steps across 8 parallel environments (256 steps each). Gives 2048 transitions per policy update.

ε = 0.2: The policy can change each action probability by at most 20% per update. Experiment: ε = 0.1 (too conservative, slow learning). ε = 0.5 (too aggressive, policy oscillates wildly).

K = 4 epochs: Each batch of 2048 transitions is used 4 times. More epochs = more sample efficiency, but the data becomes stale (the gradient is for the old policy, not the current one). K = 10 with ε = 0.2 usually works fine because the clip prevents too-large updates even with stale data.

λ = 0.95: GAE parameter. λ = 0.99 gave slightly better final performance but took 2x longer. λ = 0.8 converged fast but to a worse policy.

Worked Example — PPO Clip in Action

State s, action a. π_{θ_old}(a|s) = 0.4. After some gradient steps, π_θ(a|s) = 0.72. Ratio r = 0.72 / 0.4 = 1.8. With ε = 0.2, clipped ratio = clip(1.8, 0.8, 1.2) = 1.2.

If Â = +3: unclipped objective = 1.8 × 3 = 5.4. Clipped = 1.2 × 3 = 3.6. PPO uses min(5.4, 3.6) = 3.6. The gradient is zero (we've hit the clip), so θ stops being pushed to increase π(a|s) further.

The policy wanted to make this action much more likely (1.8x), but PPO said "slow down, 1.2x is enough for now." Next batch of data will be collected under a policy closer to the current one, and we can push further if the advantage is still positive.

Chapter 07

SAC: Soft Actor-Critic

PPO is on-policy: it collects data, updates the policy, then throws the data away. This is wasteful. Soft Actor-Critic (SAC) is off-policy: it stores all experience in a replay buffer and reuses it many times. The key insight is adding an entropy bonus to the reward, encouraging exploration and enabling off-policy stability.

Maximum Entropy RL

Standard RL maximizes expected reward. Maximum entropy RL adds a bonus for acting randomly:

Maximum Entropy Objective max_π 𝔼_τ~π [ ∑_t=0^∞ γ^t ( r(s_t, a_t) + α H(π(·|s_t)) ) ]

where H(π(·|s)) = −𝔼_a~π[log π(a|s)] is the entropy
α controls the trade-off: high α = more exploration, low α = more exploitation.

Why add entropy? Three reasons: (1) exploration — the agent tries diverse actions instead of collapsing to a single deterministic choice, (2) robustness — stochastic policies are more robust to perturbations, and (3) off-policy compatibility — the entropy regularization smooths the optimization landscape, making off-policy learning more stable.

Worked Example — Entropy as Exploration Bonus

Robot arm with 7 continuous action dimensions. Without entropy: the policy quickly collapses to always applying the same torques. If those happen to reach one target, great — but the robot never discovers alternative strategies.

With entropy bonus (α = 0.2): the policy maintains a standard deviation of ~0.3 in each dimension. The robot tries slightly different torques each time, occasionally discovering better solutions. As training progresses and α decreases (auto-tuned), the policy becomes more precise while retaining enough randomness to handle perturbations.

Entropy value: For a Gaussian with σ = 0.3 in 7 dims: H = 7 × (0.5 log(2πe × 0.09)) ≈ 7 × (−0.27) = −1.89 nats. The reward r + αH = r + 0.2 × (−1.89) = r − 0.38. Small entropy penalty compared to task reward, but enough to prevent premature convergence.

Soft Bellman Equation

Soft Value Functions V^π(s) = 𝔼_a~π[ Q^π(s, a) − α log π(a|s) ]

Q^π(s, a) = r(s, a) + γ 𝔼_s'[ V^π(s') ]
The "soft" value includes the entropy bonus. Higher entropy → higher V.

Derivation — Soft Bellman Equation

In standard RL, V^π(s) = 𝔼_a~π[Q^π(s,a)]. In maximum entropy RL, the value includes the entropy bonus:

V^π(s) = 𝔼_a~π[Q^π(s,a) − α log π(a|s)]

The soft Q-function becomes: Q^π(s,a) = r(s,a) + γ 𝔼_s'[V^π(s')]

The optimal soft policy has a closed form: π*(a|s) ∝ exp(Q*(s,a)/α). This is a Boltzmann (softmax) distribution over actions, weighted by Q-values. High α = uniform (maximum entropy). Low α = nearly deterministic (exploit the best action).

Substituting back: V*(s) = α log ∑_a exp(Q*(s,a)/α) — the soft-max (log-sum-exp) of Q-values, which is a smooth approximation to the hard max.

SAC Components

SAC maintains three networks:

Network	Parameters	What It Does
Actor π_θ	θ	Outputs action distribution (Gaussian mean + std)
Critic Q_φ	φ₁, φ₂	Two Q-networks (clipped double-Q trick)
Target Q̄	φ̄	Exponential moving average of φ for stability

Temperature Auto-Tuning

Instead of manually setting α, SAC learns it by solving:

Temperature Objective min_α 𝔼_a~π[ −α log π(a|s) − α H̄ ]
H̄ is a target entropy (typically −dim(A) for continuous actions). If actual entropy > H̄, decrease α; if < H̄, increase α.

Connection to Deterministic Policy Gradient

When α → 0, SAC reduces to the Deterministic Policy Gradient (DPG). In DPG, the actor outputs a single action μ_θ(s) (no stochasticity) and the gradient is:

Deterministic Policy Gradient ∇_θV = 𝔼_s[ ∇_aQ(s, a)|_{a=μ_θ(s)} · ∇_θμ_θ(s) ]
Backpropagate through the Q-network to get the policy gradient. No log-probabilities needed.

The Clipped Double-Q Trick

SAC uses two Q-networks Q_φ1 and Q_φ2 and takes the minimum of their predictions for the target. Why? To combat overestimation (the same problem as in DQN). If one Q-network overestimates, the other likely doesn't, so the min provides a more conservative estimate. This trick was introduced in TD3 (Twin Delayed DDPG) and is now standard in off-policy methods.

Worked Example — Temperature Auto-Tuning

Target entropy: H̄ = −dim(A) = −7 (for a 7-DOF robot arm). Current policy entropy: H(π) = −5. Since −5 > −7 (more entropy than target), the loss α(−(−5) − (−7)) = α(5 − 7) = −2α is negative. The gradient decreases α, reducing the entropy bonus. The policy becomes more deterministic.

If H(π) = −10 (less entropy than target): loss = α(10 − 7) = 3α > 0. The gradient increases α, adding more entropy bonus to encourage exploration.

Worked Example — SAC vs PPO

Robot arm reaching task. 7 continuous action dimensions (joint torques). PPO: collect 2048 steps, update policy, discard data. After 1M steps, it has done ~500 policy updates using 1M transitions total. SAC: collect 1 step, store in buffer (1M capacity), sample batch of 256, update all networks. After 1M steps, it has done 1M gradient updates, each using data from the full buffer. SAC uses data ~1000x more efficiently.

The trade-off: PPO is more stable (on-policy data is always "fresh"), SAC is more sample-efficient (reuses everything).

SAC Algorithm

Initialize actor π_θ, two critics Q_φ1, Q_φ2, target critics Q̄_φ̄1, Q̄_φ̄2, replay buffer D, temperature α.
For each step: sample a ~ π_θ(s), execute, observe (s, a, r, s'). Store in D.
Sample minibatch from D. Compute target: y = r + γ (min_i=1,2 Q̄_φ̄i(s', ā) − α log π_θ(ā|s')) where ā ~ π_θ(s').
Update critics: minimize (Q_φi(s,a) − y)² for i = 1, 2.
Update actor: minimize 𝔼_a~π[α log π_θ(a|s) − min_i Q_φi(s, a)].
Update temperature: minimize α(−log π_θ(a|s) − H̄).
Soft update targets: φ̄ ← τφ + (1−τ)φ̄ with τ = 0.005.

Chapter 08

Tabular MDP Methods

Before neural networks entered the picture, RL was solved with tables. These methods have convergence guarantees that deep RL can only dream of. Understanding them reveals why algorithms work (or fail) when we scale up.

Definition

Tabular Representation

In tabular RL, V(s) and Q(s, a) are stored as literal tables (dictionaries/arrays) with one entry per state or state-action pair. For a grid world with 100 states and 4 actions, Q is a 100 × 4 table = 400 numbers. This is feasible when |S| and |A| are small (thousands), but impossible for images (|S| ~ 10³⁰) or continuous states (|S| = ∞). The transition to function approximation (Chapter 9) is where tabular guarantees break down.

Policy Evaluation

Given a fixed policy π, compute V^π(s) for all states by iteratively applying the Bellman equation:

Iterative Policy Evaluation

Initialize V(s) = 0 for all s.
For each iteration: for every state s:
V_new(s) = ∑_a π(a|s) [ r(s,a) + γ ∑_s' p(s'|s,a) V_old(s') ]
Repeat until max_s |V_new(s) − V_old(s)| < ε.

Policy Iteration

Start with arbitrary policy π₀.
Evaluate: Compute V^π_k by iterating the Bellman equation until convergence.
Improve: π_k+1(s) = argmax_a [ r(s,a) + γ ∑_s' p(s'|s,a) V^π_k(s') ] for all s.
If π_k+1 = π_k, stop. Otherwise, go to step 2.

Value Iteration

Policy iteration has an expensive inner loop (full policy evaluation). Value iteration combines evaluation and improvement into a single step:

Value Iteration

Initialize V(s) = 0 for all s.
For each iteration: for every state s:
V_new(s) = max_a [ r(s,a) + γ ∑_s' p(s'|s,a) V_old(s') ]
Repeat until convergence.
Extract policy: π*(s) = argmax_a [ r(s,a) + γ ∑_s' p(s'|s,a) V*(s') ].

Bellman Operators as Contractions

Proof — Contraction Property

Claim: ||T^πV − T^πV'||_∞ ≤ γ ||V − V'||_∞

where (T^πV)(s) = ∑_a π(a|s) [r(s,a) + γ ∑_s' p(s'|s,a) V(s')].

Proof:

Why: the r(s,a) terms cancel.

Step 2: ≤ ∑_a π(a|s) γ ∑_s' p(s'|s,a) |V(s') − V'(s')| (triangle inequality)

Step 3: ≤ ∑_a π(a|s) γ ∑_s' p(s'|s,a) ||V − V'||_∞

Why: |V(s') − V'(s')| ≤ max_s |V(s) − V'(s)| = ||V − V'||_∞ for any s'.

Step 4: = γ ||V − V'||_∞ · ∑_a π(a|s) ∑_s' p(s'|s,a) = γ ||V − V'||_∞ · 1

Why: probabilities sum to 1.

Since this holds for every s, it holds for the max over s. ■

Why Contraction Guarantees Convergence

A contraction mapping has a unique fixed point, and repeated application converges to it (Banach fixed-point theorem). Since γ < 1, the Bellman operator T^π is a contraction with factor γ. After k iterations, the error is at most γ^k ||V₀ − V*||_∞, which → 0 exponentially fast. This is why value iteration is guaranteed to converge — and why γ < 1 is essential.

Worked Example — Contraction in Action

True V* = [10, 8]. Two initial guesses: V_a = [0, 0] (pessimistic) and V_b = [20, 20] (optimistic). γ = 0.5.

Distance: ||V_a − V_b|| = max(|0−20|, |0−20|) = 20.

After 1 Bellman backup: ||TV_a − TV_b|| ≤ 0.5 × 20 = 10. The two estimates are at most 10 apart.

After 5 backups: ≤ 0.5⁵ × 20 = 0.625. Nearly identical!

After 10 backups: ≤ 0.5¹⁰ × 20 = 0.0195. Converged for all practical purposes.

No matter how wrong your initialization is, the contraction property forces convergence. With γ = 0.99, convergence is much slower (0.99¹⁰⁰ = 0.366, need ~460 iterations to reduce error by 100x).

Policy Iteration vs Value Iteration

Property	Policy Iteration	Value Iteration
Inner loop	Full policy evaluation (solve linear system or iterate to convergence)	None (single max operation per state)
Outer iterations	Few (often 3-10 for small MDPs)	Many (proportional to 1/(1-γ))
Total computation	Often faster for small MDPs	Often faster for large MDPs
Memory	Stores both V and π	Stores only V
Convergence	Finite steps to exact π*	Converges to V* in limit

Worked Example — Value Iteration

Two states: A and B. From A: action "go" gives r = 2, transitions to B. From B: action "stay" gives r = 1, stays in B. γ = 0.9.

Iteration 0: V(A) = 0, V(B) = 0.

Iteration 1: V(A) = 2 + 0.9 × 0 = 2. V(B) = 1 + 0.9 × 0 = 1.

Iteration 2: V(A) = 2 + 0.9 × 1 = 2.9. V(B) = 1 + 0.9 × 1 = 1.9.

Iteration 3: V(A) = 2 + 0.9 × 1.9 = 3.71. V(B) = 1 + 0.9 × 1.9 = 2.71.

Converges to: V*(B) = 1/(1−0.9) = 10, V*(A) = 2 + 0.9 × 10 = 11.

Chapter 09

DQN & Value Function Approximation

Tabular methods work when you can enumerate every state. But Atari has ~10³⁰ possible screen images. You can't store a table that large. Value function approximation (VFA) replaces the table with a neural network V̂_φ(s) or Q̂_φ(s, a) that generalizes across similar states.

MC vs TD for VFA

Monte Carlo VFA: collect a trajectory, compute the return G_t, and regress V̂_φ(s_t) toward G_t. Unbiased but high variance — one lucky trajectory can send the network in the wrong direction.

TD learning for VFA: use the bootstrapped target r + γV̂_φ(s') instead of G_t. Lower variance (single-step estimate vs. entire trajectory) but biased (the target depends on our own potentially wrong estimates).

Worked Example — MC vs TD for Value Approximation

Mountain car environment. State = (position, velocity). We want V̂_φ(s). Episode takes 150 steps with reward −1 per step (total = −150).

Monte Carlo: For every state visited in the episode, the target is the total return from that point. State at t = 50: target = −100 (100 remaining steps). State at t = 149: target = −1. All targets are noisy because different episodes take different numbers of steps (120 or 200 depending on initial conditions).

TD: For state at t = 50: target = −1 + 0.99 × V̂(s₅₁). Even if V̂ is slightly wrong, the error from one step is small. But if V̂ is completely wrong early in training, TD propagates garbage: "my estimate of s₅₁ is wrong, so my estimate of s₅₀ is wrong, so my estimate of s₄₉ is wrong..."

In practice, TD wins because the bias decreases as V̂ improves, while MC variance stays high forever.

Why Generalization Matters

The whole point of using a neural network instead of a table is generalization: if the agent learns that being near the goal is good, it should know this for all nearby states, even ones it hasn't visited. In a table, each entry is independent — learning Q(s₇, up) = 10 tells you nothing about Q(s₈, up). With a neural network, similar states get similar Q-values automatically.

The Double-Edged Sword of Generalization

Generalization is both the strength and weakness of deep RL. Strength: you can handle continuous state spaces with billions of possible states (like pixel observations). Weakness: an update to Q(s₇) also changes Q(s₈), Q(s₉), etc. These unintended side effects can destabilize learning. This is why the deadly triad exists — it's specifically about how errors propagate through shared representations.

The Deadly Triad

Combining three ingredients can cause divergence:

(1) Function approximation (neural nets instead of tables) — introduces generalization errors that propagate.

(2) Bootstrapping (using V̂(s') in the target) — creates a moving target that depends on the parameters we're updating.

(3) Off-policy learning (data from a different policy) — the data distribution doesn't match the policy we're evaluating.

Any two are fine. All three together can spiral out of control. DQN tames this with two tricks: replay buffers (partially fixes off-policy) and target networks (partially fixes bootstrapping).

Replay Buffer

Without replay, consecutive training samples are highly correlated (frames 1001, 1002, 1003 look almost identical). The network overfits to the current "neighborhood" of experience and forgets the rest. A replay buffer stores the last N transitions and samples uniformly at random, breaking temporal correlations and providing i.i.d.-like training data.

Worked Example — Why Correlation Is Deadly

Atari Pong. The agent is in the middle of a rally for frames 5000-5100. Without replay, the minibatch is [frame 5000, 5001, 5002, ..., 5031] — nearly identical images of the same rally. The network learns "this specific paddle position is good" and overwrites everything it knew about serving, recovering from corners, or other game situations. This is catastrophic forgetting.

With replay (buffer size 1M), the minibatch might contain: frame 5000 (mid-rally), frame 2300 (serving), frame 87000 (scoring), frame 45000 (edge recovery). Diverse experiences prevent forgetting and ensure the network learns a globally useful Q-function.

Definition

Prioritized Experience Replay

Instead of sampling uniformly from the buffer, prioritize transitions with high TD error |Q(s,a) − y|. These are the transitions where the Q-network is most wrong — and thus where it has the most to learn. This was introduced by Schaul et al. (2016) and is used in Rainbow DQN.

Fixed Q-Targets

The target y = r + γ max_a' Q̂_θ(s', a') depends on θ. Every gradient step changes both the prediction and the target — like a dog chasing its tail. The fix: maintain a target network Q̂_θ' with frozen parameters, updated periodically by copying θ → θ' every N steps.

DQN Algorithm (Mnih et al. 2015)

Initialize Q-network θ, target network θ' ← θ, replay buffer D.
For each step: observe s, pick a via ε-greedy on Q_θ.
Execute a, observe r, s'. Store (s, a, r, s') in D.
Sample minibatch of transitions from D.
Compute targets: y_i = r_i + γ max_a' Q̂_θ'(s'_i, a').
Gradient step: θ ← θ − α ∇_θ ∑_i (Q̂_θ(s_i, a_i) − y_i)².
Every N steps: θ' ← θ.

The Deadly Triad in Detail

Worked Example — How Divergence Happens

Consider a simple MDP with 3 states. A neural net approximates Q with shared features. True Q*(A) = 5, Q*(B) = 5, Q*(C) = 5.

Step 1: From data, we update Q(A) toward target 5.2 (slightly noisy). Due to weight sharing, Q(B) moves to 5.1 and Q(C) moves to 5.15.

Step 2: Now max Q is used as target for another state. Target = r + γ × 5.2 (using inflated max). We train toward this inflated target.

Step 3: The update pushes Q even higher. Next iteration, max Q is 5.4. Then 5.8. Then 6.5. Then 8. Then 15. Then 100.

Without the table to anchor each entry independently, errors propagate through shared weights. Target networks and replay buffers slow this spiral but don't eliminate it completely.

Double DQN

Standard DQN overestimates Q-values because max_a' Q̂(s', a') preferentially selects actions with positive noise. Double DQN decouples action selection from evaluation:

Double DQN Target y = r + γ Q̂_θ'(s', argmax_a' Q̂_θ(s', a'))
Online network θ selects the best action. Target network θ' evaluates it. Since they have different noise, the overestimation bias is broken.

Interactive: Q-Value Overestimation

Watch how Q-values evolve during training. Standard DQN (red) overestimates because max picks the noisiest action. Double DQN (green) stays closer to the true values (white dashed). Click "Run" to animate the learning process.

Noise: 0.8

Worked Example — Double DQN Fix

State s' has 3 actions. True Q*: [5, 5, 5]. Noisy estimates from θ: [7, 4, 6]. Noisy estimates from θ': [4, 7, 5].

Standard DQN: max from θ': max(4, 7, 5) = 7. Overestimate: 7 vs true 5.

Double DQN: θ selects action 0 (highest: 7). θ' evaluates action 0: Q_θ'(s', 0) = 4. Estimate: 4. Underestimate, but no systematic upward bias.

Over many samples, Double DQN's estimates converge to the true values rather than drifting upward.

Checkpoint — Before you move on

DQN, PPO, and SAC all assume the agent can interact with the environment. Explain: why is offline RL (learning from a fixed dataset) fundamentally harder than online RL? What specific failure mode arises when you naively apply Q-learning to a fixed dataset?

✓ Gate cleared

Model Answer

The fundamental problem: Q-learning uses max_a' Q(s',a') in the target. In online RL, if Q overestimates some action, the agent eventually tries it and gets corrected by the real reward. Offline, the agent never acts — so overestimated actions are never corrected. Worse, the max operator preferentially selects the most overestimated action, and bootstrapping propagates this overestimate backward through the Bellman equation. The result: Q-values for unobserved actions hallucinate to arbitrarily high values, and the extracted policy confidently selects actions that were never in the data (and are likely catastrophic). This is why IQL avoids querying OOD actions entirely, and CQL explicitly pushes down Q-values for unseen actions.

💥 Break-It Lab What Dies When You Break RL's Stability Mechanisms? ▶ ✓ ATTEMPTED

A working DQN agent learns Q-values that converge to the true values (white dashed line). Toggle off each stability mechanism to see the specific failure mode it prevents.

Set γ = 1 (no discounting) ACTIVE

Failure mode: Without discounting, V(s) = ∑_t=0^∞ r_t can diverge to ±∞. The Bellman operator is no longer a contraction (γ = 1 means the contraction factor is 1). Value iteration has no guarantee of convergence — estimates oscillate or explode. The agent cannot distinguish between strategies that defer reward indefinitely.

Remove target network (bootstrap from live Q) ACTIVE

Failure mode: Without a frozen target, the TD target y = r + γ max Q_θ(s',a') changes with every gradient step. The optimization chases a moving target: overestimation in one state propagates instantly to all states that bootstrap from it, creating a positive feedback loop. Q-values spiral upward exponentially.

Set ε = 0 (no exploration) ACTIVE

Failure mode: Without exploration, the agent greedily exploits its current (imperfect) Q-function. It visits a narrow corridor of states, overfitting Q to that corridor while Q-values for unvisited states remain random. The agent gets stuck in a local optimum, never discovering that other actions lead to higher reward. The replay buffer fills with homogeneous data, causing catastrophic forgetting of early experience.

⚔ Adversarial: The Deadly Triad in Practice

You're training DQN on a continuous-state task (robot arm, 7D state). After 100K steps, the Q-values are stable around [3, 5, 2, 4] for the 4 actions. You then increase the replay buffer's priority exponent from 0.6 to 1.0 (more aggressive prioritization). Within 10K steps, Q-values explode to [150, 200, 180, 190]. What happened?

The priority exponent increased the learning rate, causing overshooting Aggressive prioritization repeatedly samples high-TD-error transitions, amplifying overestimation via the max operator in a self-reinforcing loop The replay buffer ran out of diverse samples due to the new prioritization

Chapter 10

Offline RL: IQL & CQL

In offline RL, you have a fixed dataset of transitions collected by some behavior policy π_β — and you can't collect any more data. Think: learning to drive from dashcam footage without ever touching a steering wheel. The challenge is terrifying: the Q-function can hallucinate high values for actions the dataset never saw.

The Overestimation Problem

Standard Q-learning updates Q(s, a) toward r + γ max_a' Q(s', a'). The max might select an action a' that never appears in the dataset. Since Q(s', a') was never corrected by real data, it could be arbitrarily wrong — and the max ensures it's wrong in the upward direction. These phantom high Q-values propagate backward through the Bellman equation, corrupting the entire Q-function.

Why Online Q-Learning Fails Offline

With online learning, the agent eventually tries the overestimated action and gets a correcting signal. Offline, it never tries anything — so the overestimation is never corrected. It's like a student who answers untested material with wild guesses that are always optimistic. Without exams (interaction), the guesses never get corrected.

Worked Example — Offline Overestimation

Dataset: 1000 transitions from a robot arm reaching targets on the left half of a table. The arm never reached right.

Q-learning iteration 1: Q(s, reach_left) trained on real data → Q = 8 (good). Q(s, reach_right) randomly initialized → Q = 0.

Iteration 100: Q(s, reach_left) = 8 (stable, supported by data). Q(s, reach_right) = 12 (inflated! No correcting data). Why? Some state s' in the buffer has max_a' Q(s', a') = 12 via an unseen action. This inflated target propagated to Q(s, reach_right) through bootstrapping.

Iteration 500: Q(s, reach_right) = 47. Completely fictional. The agent would confidently reach right and break.

Implicit Q-Learning (IQL)

IQL's insight: never query Q for actions outside the dataset. Instead of max_a' Q(s', a'), approximate the maximum using expectile regression — a statistical technique that estimates quantiles without explicit maximization.

Definition

Expectile Regression

The τ-expectile loss is L_τ²(u) = |τ − 1{u < 0}| · u². When τ = 0.5, this is ordinary least squares. When τ → 1, the loss penalizes under-predictions much more than over-predictions, so the minimizer approaches the maximum of the distribution. Think of it as a smooth, differentiable approximation to max.

IQL uses three networks:

IQL Algorithm

Learn V_ψ(s) via expectile regression on Q values:
Minimize 𝔼_(s,a)~D[ L_τ²(Q_θ(s, a) − V_ψ(s)) ]
For τ near 1, V_ψ(s) ≈ max_a Q(s, a) without ever evaluating Q on unseen actions.
Learn Q_θ(s, a) via standard MSE with V-targets:
Minimize 𝔼_(s,a,r,s')~D[ (Q_θ(s, a) − [r + γ V_ψ(s')])² ]
Target uses V, not max Q — so out-of-distribution actions are never queried.
Extract policy via Advantage Weighted Regression (AWR):
π(a|s) ∝ π_β(a|s) · exp((Q_θ(s,a) − V_ψ(s)) / β)
Reweight the behavior policy by exponentiated advantages. High-advantage actions get more probability.

Conservative Q-Learning (CQL)

CQL takes a different approach: explicitly push down Q-values for actions not in the dataset, making the Q-function a lower bound on the true values. The key idea: if you can't tell whether an action is good or bad (because you've never tried it), assume it's bad. This is conservative but safe.

Conservative = Safe

In safety-critical applications (healthcare, autonomous driving, robotics), overestimating an untested action's value can be catastrophic. CQL's conservatism is a feature, not a bug: the extracted policy will never choose an action whose value is only high because of hallucination. The policy might be suboptimal (it avoids potentially good actions it hasn't seen data for), but it won't be catastrophically bad.

CQL Regularizer min_θ α ( 𝔼_{s~D, a~μ}[ Q_θ(s, a) ] − 𝔼_{s~D, a~π_β}[ Q_θ(s, a) ] ) + standard TD loss
Minimize Q under some distribution μ (covering unseen actions), maximize Q under the behavior policy π_β (seen actions). Net effect: push down unseen actions, push up seen actions.

The optimal choice of μ that maximizes the regularizer is μ(a|s) ∝ exp(Q_θ(s, a)) — the softmax of Q-values. This creates a minimax game: the regularizer tries to find the most overestimated actions and push them down.

Worked Example — CQL vs Standard Q-Learning

Dataset: transitions from a robot arm that only reaches the left side of a table. State: s = "object on the right." Standard Q-learning: Q(s, reach_right) might be very high because it was never corrected by failure data. The extracted policy reaches right and crashes.

CQL: The regularizer pushes down Q(s, reach_right) because reach_right never appears in the data for this state. Q(s, reach_left) stays high (it's in the data). The extracted policy safely reaches left — suboptimal but doesn't crash.

IQL vs CQL

IQL avoids querying out-of-distribution actions entirely (implicit approach). CQL explicitly penalizes them (conservative approach). IQL is simpler and often faster to train. CQL provides stronger theoretical guarantees (provable lower bound on Q). Both outperform naive offline Q-learning by a large margin.

Worked Example — Expectile Regression

Suppose Q(s, a) for the 4 actions in our dataset are: [2, 5, 3, 8]. With τ = 0.5 (standard mean), V(s) = (2+5+3+8)/4 = 4.5. With τ = 0.9 (high expectile), V(s) ≈ 7.2 — close to the maximum of 8, but not exactly 8. With τ = 0.99, V(s) ≈ 7.9 — even closer to max.

How the loss works: For τ = 0.9, errors where Q > V get weight τ = 0.9, while errors where Q < V get weight (1−τ) = 0.1. So the loss strongly penalizes underestimation of high Q-values, pushing V upward toward the max. This is how IQL approximates max_a Q(s,a) without ever evaluating Q on unseen actions.

Practical Considerations for Offline RL

Factor	IQL	CQL
Requires OOD query?	No (expectile-based)	Yes (softmax sampling)
Hyperparameters	τ (expectile), β (AWR temp)	α (CQL weight)
Theoretical guarantee	Approximation to optimal	Provable lower bound on Q
Typical use	Robot manipulation	Game playing, navigation
Computation	Fast (no sampling loop)	Slower (logsumexp over actions)

Chapter 11

RLHF & DPO

How do you align a language model to human preferences when you can't write down a reward function? Reinforcement Learning from Human Feedback (RLHF) trains a reward model from human comparisons, then optimizes the LM against it. Direct Preference Optimization (DPO) skips the reward model entirely.

The RLHF Pipeline

Three Stages

RLHF Pipeline

Stage 1 — Supervised Fine-Tuning (SFT): Fine-tune the base LM on high-quality demonstrations. This gives you π_ref — a model that can write decent responses.

Stage 2 — Reward Modeling: Show humans pairs of responses (y₁, y₂) to the same prompt x. They choose which is better. Train a reward model r_φ(x, y) to predict these preferences.

Stage 3 — RL Fine-Tuning: Optimize the LM policy π_θ to maximize the learned reward, with a KL penalty to stay close to π_ref.

The Bradley-Terry Model

Given a prompt x and two responses y₁, y₂, the probability that a human prefers y₁ over y₂ is modeled as:

Bradley-Terry Preference Model p*(y₁ ≻ y₂ | x) = exp(r*(x, y₁)) / (exp(r*(x, y₁)) + exp(r*(x, y₂)))
= σ(r*(x, y₁) − r*(x, y₂))
where σ is the sigmoid function. The response with higher reward is more likely to be preferred.

Reward Model Training

Reward Model Loss L_RM(φ) = −𝔼_{(x, y_w, y_l)}[ log σ(r_φ(x, y_w) − r_φ(x, y_l)) ]
y_w = winner (preferred), y_l = loser. Maximize the log-likelihood that the reward model assigns higher reward to the winner.

The RL Objective

RLHF Objective max_θ 𝔼_{x~D, y~π_θ(y|x)}[ r_φ(x, y) ] − β D_KL(π_θ || π_ref)
Maximize reward while staying close to the reference policy. β controls the trade-off: too small and the model "reward hacks," too large and it ignores the reward signal.

Why This Is Hard

Language generation is non-differentiable — you can't backpropagate through token sampling (argmax or categorical sampling have zero or undefined gradients). So you can't just do gradient descent on the reward. You need RL (specifically PPO) to optimize through the non-differentiable sampling step. This is computationally expensive and unstable.

Worked Example — RLHF Reward Modeling

Prompt: "Write a haiku about rain." Two responses:

y₁ (winner): "Soft drops on the leaves / Whispering to the still pond / Nature breathes again"

y₂ (loser): "Rain is water that falls from clouds in the sky and makes things wet"

Current reward model: r_φ(x, y₁) = 2.3, r_φ(x, y₂) = 1.8. Difference: 0.5.

Loss: −log σ(2.3 − 1.8) = −log σ(0.5) = −log(0.622) = 0.475.

Gradient: Pushes r_φ(x, y₁) up and r_φ(x, y₂) down, increasing the gap. Over thousands of preference pairs, the reward model learns that well-structured creative responses are preferred over flat factual statements.

The KL Penalty in Practice

Without the KL term, the LM quickly discovers reward hacking — pathological outputs that score high on the reward model but are gibberish to humans. For example, a model trained to maximize a helpfulness reward might repeat "I'm so happy to help you! That's a great question!" endlessly. The KL penalty D_KL(π_θ || π_ref) prevents this by penalizing the model for deviating too far from the SFT baseline.

KL Penalty Per Token D_KL(π_θ || π_ref) = 𝔼_{y~π_θ}[ ∑_t=1^T log(π_θ(y_t|x, y_<t) / π_ref(y_t|x, y_<t)) ]
Sum of per-token log-ratio. If π_θ assigns much more probability to some token than π_ref, that token gets a large KL penalty.

DPO: Bypassing the Reward Model

DPO's key insight: the optimal policy under the KL-constrained objective has a closed-form solution, and we can reparameterize the reward in terms of the policy itself.

Full Derivation — DPO

Step 1 — Solve for the optimal policy. The RLHF objective with KL constraint is:

max_π 𝔼_y~π[r(x,y)] − β D_KL(π || π_ref)

= max_π 𝔼_y~π[r(x,y) − β log(π(y|x)/π_ref(y|x))]

= max_π 𝔼_y~π[r(x,y) − β log π(y|x) + β log π_ref(y|x)]

Taking the functional derivative with respect to π(y|x) and setting to zero:

r(x,y) − β log π(y|x) − β + β log π_ref(y|x) = 0

Solving for π: π*(y|x) = π_ref(y|x) · exp(r(x,y)/β) / Z(x)

where Z(x) = ∑_y π_ref(y|x) exp(r(x,y)/β) is the partition function. This is the optimal policy given reward r.

Step 2 — Reparameterize the reward. From the optimal policy expression, solve for r:

r(x, y) = β log(π*(y|x) / π_ref(y|x)) + β log Z(x)

Step 3 — Substitute into Bradley-Terry. The preference probability becomes:

p*(y_w ≻ y_l) = σ(r(x,y_w) − r(x,y_l))

= σ(β log(π*(y_w|x)/π_ref(y_w|x)) − β log(π*(y_l|x)/π_ref(y_l|x)))

The Z(x) terms cancel! They appear in both r(x,y_w) and r(x,y_l) and subtract out.

Step 4 — Write the DPO loss. Replace π* with π_θ (the policy we're training):

DPO Loss L_DPO(θ) = −𝔼_{(x, y_w, y_l)}[ log σ( β log(π_θ(y_w|x) / π_ref(y_w|x)) − β log(π_θ(y_l|x) / π_ref(y_l|x)) ) ]
No reward model, no RL, no sampling during training. Just a classification loss on preference pairs.

Why DPO Is Revolutionary

DPO reduces RLHF to supervised learning. No reward model to train. No PPO loop to run. No generation during training. Just: compute log-probabilities of the winning and losing responses under π_θ and π_ref, plug into the loss, and do gradient descent. It's 10x simpler to implement and much more stable than PPO-based RLHF.

DPO Gradient Analysis

What does the DPO gradient actually do? Taking the gradient of L_DPO with respect to θ:

DPO Gradient Intuition

∇_θL_DPO = −β 𝔼[ σ(−u) (∇ log π_θ(y_w|x) − ∇ log π_θ(y_l|x)) ]

where u = β(log π_θ(y_w)/π_ref(y_w) − log π_θ(y_l)/π_ref(y_l)).

The weighting σ(−u): When u is large and positive (the model already strongly prefers y_w), σ(−u) is near 0 — almost no gradient. When u is negative (the model incorrectly prefers y_l), σ(−u) is near 1 — strong gradient signal. The gradient focuses on examples the model gets wrong, ignoring ones it already handles correctly.

The direction: The gradient increases log π(y_w) and decreases log π(y_l). It simultaneously makes the winner more likely and the loser less likely.

RLHF vs DPO: Practical Comparison

Property	RLHF (PPO)	DPO
Training stages	3 (SFT → RM → RL)	2 (SFT → DPO)
Models needed	4 (SFT, RM, policy, ref)	2 (policy, ref)
Generates during training	Yes (PPO rollouts)	No
GPU memory	Very high (4 models)	Moderate (2 models)
Stability	Sensitive to hyperparameters	Robust
Reward hacking risk	Higher (imperfect RM)	Lower (no RM to exploit)
Iterative improvement	Natural (generate → relabel)	Requires new preference data
State of the art	Llama 2, InstructGPT	Zephyr, Llama 3 (partially)

DPO Failure Modes

Distribution shift: DPO assumes the preference data was generated by the optimal policy under π_ref. If the preference pairs are from a very different policy, the DPO objective may not converge to the right solution.

β too small: The policy can diverge far from π_ref, potentially generating incoherent text. The log-ratios blow up.

β too large: The policy barely moves from π_ref. The preference signal is too weak to have an effect. The model doesn't improve.

Preference noise: If annotators frequently disagree (noisy preferences), DPO's loss becomes noisy too. RLHF's reward model can average over multiple annotations per pair; DPO uses each pair directly.

Worked Example — DPO Update

Prompt x: "Explain gravity." Winner y_w: clear explanation. Loser y_l: vague rambling. β = 0.1.

Current log-ratios: log(π_θ(y_w|x)/π_ref(y_w|x)) = 0.3. log(π_θ(y_l|x)/π_ref(y_l|x)) = 0.5.

DPO argument: β(0.3 − 0.5) = 0.1 × (−0.2) = −0.02.

σ(−0.02) ≈ 0.495. Loss: −log(0.495) ≈ 0.703.

The gradient will increase π_θ(y_w|x) relative to π_ref and decrease π_θ(y_l|x) relative to π_ref. Intuitively: make the winner more likely and the loser less likely, compared to the reference model.

Chapter 12

The Complete Cheat Sheet

Algorithm Taxonomy

Algorithm	Type	On/Off-Policy	Actions	Key Idea
REINFORCE	Policy gradient	On-policy	Any	Monte Carlo policy gradient
A2C	Actor-critic	On-policy	Any	Learned baseline + bootstrapping
PPO	Actor-critic	On-policy	Any	Clipped surrogate objective
SAC	Actor-critic	Off-policy	Continuous	Max entropy + replay buffer
DQN	Value-based	Off-policy	Discrete	Target network + replay
IQL	Value-based	Offline	Any	Expectile regression avoids OOD
CQL	Value-based	Offline	Any	Conservative Q lower bound
DPO	Preference	Offline	LM tokens	Reparameterize reward into policy

Key Equations

Equation	One-Line Explanation
V^π(s) = 𝔼_a~π[r + γV^π(s')]	Bellman equation: value = reward + discounted future value
∇_θV = 𝔼[∇ log π · Q^π]	Policy gradient theorem: weight each action by its Q-value
A^π(s,a) = Q^π(s,a) − V^π(s)	Advantage: how much better is action a than average?
δ_t = r + γV(s') − V(s)	TD error: one-step advantage estimate
Â^GAE = ∑ (γλ)^l δ_t+l	GAE: exponentially-weighted sum of TD errors
L^CLIP = 𝔼[min(rÂ, clip(r)Â)]	PPO: clip the ratio to prevent catastrophic updates
Q* = r + γ max_a' Q*	Bellman optimality: optimal Q satisfies this fixed point
V(π)−V(π') = 1/(1−γ) 𝔼[A^π']	PDL: performance gap = expected advantage
L_DPO = −log σ(β(log r_w − log r_l))	DPO: preference learning as classification

When to Use What

Interactive: Algorithm Decision Tree

Answer questions about your problem to get an algorithm recommendation. Click the options at each decision point.

On-Policy vs Off-Policy: The Fundamental Trade-Off

Definition

On-Policy vs Off-Policy

On-policy (REINFORCE, PPO): The data used for each update was collected by the current policy. After each update, old data is discarded and new data is collected. Pro: the gradient is always accurate. Con: data is used once and thrown away — expensive.

Off-policy (DQN, SAC): Data from any policy can be reused. A replay buffer stores millions of transitions from past policies. Pro: much more sample-efficient (each transition is used dozens of times). Con: the data distribution doesn't match the current policy, introducing bias that must be corrected (importance sampling, target networks, etc.).

Worked Example — Data Efficiency Comparison

Training a robot arm. Budget: 1 million environment steps.

PPO (on-policy): Collects batches of 2048 steps, updates, discards. 1M / 2048 = ~488 policy updates. Each transition used 4 times (K=4 epochs). Total gradient steps: ~2000.

SAC (off-policy): After initial exploration, every environment step triggers one gradient update using a batch of 256 from the replay buffer. Total gradient steps: ~1M. Each transition is reused ~256 times over its lifetime in the buffer.

Result: SAC does ~500x more gradient updates with the same data. For tasks where environment interaction is expensive (real robots: $100/hour), off-policy wins decisively. For tasks where simulation is cheap (video games: 1000 fps), on-policy's simplicity often wins.

Historical Timeline

Year	Milestone	Key Contribution
1989	Watkins — Q-Learning	Off-policy tabular value learning with convergence proof
1992	Williams — REINFORCE	Policy gradient via the score function trick
1999	Sutton et al. — PG Theorem	Formal foundation for policy gradient methods
2000	Kakade — Natural PG	Fisher information matrix for parameterization-invariant updates
2013	Mnih et al. — DQN	Neural networks + replay buffer + target networks for Atari
2015	Schulman et al. — TRPO	Trust region constraints for stable policy updates
2016	Schulman et al. — GAE	Exponentially-weighted TD errors for advantage estimation
2017	Schulman et al. — PPO	Clipped surrogate objective replaces TRPO constraints
2018	Haarnoja et al. — SAC	Maximum entropy off-policy actor-critic for continuous control
2020	Kumar et al. — CQL	Conservative Q-function lower bounds for offline RL
2022	Kostrikov et al. — IQL	Expectile regression avoids OOD action evaluation
2022	Ouyang et al. — InstructGPT	RLHF pipeline for language model alignment
2023	Rafailov et al. — DPO	Reparameterize reward into policy for direct preference optimization

Connections

This lesson covered the complete landscape of deep RL theory. For implementation details, see the companion lessons:

Topic	Lesson
Tabular Q-learning with worked examples	Q-Learning: Value-Based RL
Imitation learning as an alternative to RL	Imitation Learning
Actor-critic methods in depth	Off-Policy Actor-Critic
MDPs and value iteration from scratch	Markov Decision Processes

Common Hyperparameters

Algorithm	Key Hyperparameter	Typical Range	What It Controls
PPO	ε (clip range)	0.1 – 0.3	How far the policy can change per update
PPO	λ (GAE)	0.9 – 0.99	Bias-variance trade-off in advantage estimation
PPO	K (epochs)	3 – 10	How many passes over each batch of data
SAC	α (temperature)	Auto-tuned	Exploration-exploitation balance via entropy
SAC	τ (target EMA)	0.005	How fast target networks track the online network
DQN	ε (exploration)	1.0 → 0.01	Random action probability (annealed)
DQN	Target update freq	1K – 10K steps	How often to freeze target network
IQL	τ (expectile)	0.7 – 0.99	How close to max (higher = closer)
CQL	α (CQL weight)	0.5 – 10	How conservative to be with OOD actions
DPO	β	0.05 – 0.5	How much to deviate from reference policy

The RL Algorithm Family Tree

How Everything Connects

The Genealogy of Deep RL

Root: The Bellman equation (1957) gives us the recursive structure of value.

Branch 1 — Value-based: Compute Q*, derive π* as argmax. Tabular Q-learning (1989) → DQN (2013) → Double DQN (2016) → Rainbow (2018). Limited to discrete actions.

Branch 2 — Policy gradient: Directly optimize π_θ. REINFORCE (1992) → Actor-Critic (2000s) → TRPO (2015) → PPO (2017). Works with any action space.

Branch 3 — Off-policy actor-critic: Combine replay buffers with policy optimization. DDPG (2015) → TD3 (2018) → SAC (2018). Best sample efficiency for continuous control.

Branch 4 — Offline RL: Learn from fixed data without interaction. BCQ (2019) → CQL (2020) → IQL (2022). Critical for real-world deployment.

Branch 5 — Preference-based: Learn from human comparisons instead of numeric rewards. RLHF (2022) → DPO (2023). Powers modern LLM alignment.

Quick Reference: Which Paper to Read

Want To Learn	Read This Paper	Pages
The foundations of policy gradients	Sutton et al. (1999) "Policy Gradient Methods for RL with Function Approximation"	8
Trust region methods	Schulman et al. (2015) "Trust Region Policy Optimization"	16
PPO in practice	Schulman et al. (2017) "Proximal Policy Optimization Algorithms"	12
Off-policy continuous control	Haarnoja et al. (2018) "Soft Actor-Critic"	13
Deep Q-learning from pixels	Mnih et al. (2015) "Human-level control through deep RL" (Nature)	10
Offline RL fundamentals	Levine et al. (2020) "Offline RL: Tutorial, Review, and Perspectives"	55
Aligning LMs to preferences	Rafailov et al. (2023) "Direct Preference Optimization"	29

The One Sentence

Deep RL is the art of converting the Bellman equation — value = reward + discounted future value — into practical algorithms that scale from gridworlds to language models. Everything is a variation on that one recursive idea.

"What I cannot create, I do not understand."

— Richard Feynman

🏗 Design Challenge You're the Architect: RL Agent for Continuous Control ▶ ✓ ATTEMPTED

You're deploying an RL agent to control a 12-DOF quadruped robot in the real world. The robot has joint position/velocity sensors (24-dim state) and outputs torques (12-dim continuous action). You have a simulator that's ~80% accurate and access to the real robot for 2 hours/day. The task: walk forward at 1 m/s over varied terrain.

State Space

24-dim continuous

Action Space

12-dim continuous

Sim Budget

Unlimited

Real Robot

2 hrs/day

Sim-to-Real Gap

~20% dynamics error

Safety

Must not fall (hardware damage)

1. Which algorithm class: on-policy (PPO), off-policy (SAC), or offline (IQL)? Justify considering data efficiency vs. stability vs. sim-to-real.

2. Function class for π_θ: MLP, RNN, or diffusion policy? What hidden sizes, and fixed or learned σ?

3. How do you bridge the sim-to-real gap? Domain randomization? System identification? Sim-to-real fine-tuning?

4. Safety: how do you prevent the robot from executing actions that cause falling during real-world training?

5. Reward design: dense (tracking velocity) vs. sparse (reach waypoint)? Include stability terms?

Industry standard (2024): PPO in simulation with massive domain randomization, then zero-shot or few-shot transfer to real hardware.

Algorithm: PPO (not SAC) because on-policy is more stable under domain randomization. The sim-to-real gap means off-policy data from a slightly-wrong simulator can mislead SAC's Q-function. PPO's on-policy nature means it only trusts fresh rollouts.

Architecture: MLP (2-3 layers, 256-512 units). Fixed σ (log-std as learnable parameter, not state-dependent) for locomotion — state-dependent σ can collapse to zero prematurely. Unitree, Boston Dynamics, and ETH Zurich all use this setup.

Sim-to-real: Domain randomization (randomize friction, mass, motor delay, ground slope, sensor noise) during sim training. The policy learns to be robust to all variations. 2 hours of real-robot fine-tuning with conservative PPO (ε = 0.05) for final adaptation.

Safety: Action clipping to torque limits + safety constraints in reward (penalty for IMU angle > 30°) + early termination in sim when falling. Real robot: conservative action bounds initially, gradually relaxed.

Reward: Dense: r = v_forward − 0.1||torques||² − 5·1[fell] − 0.5|roll| − 0.5|pitch|. Multiple terms for velocity tracking, energy efficiency, and stability. Sparse rewards fail for locomotion (too many steps before any signal).

💻 Build It Implement REINFORCE with Baseline from Scratch ▶ ✓ ATTEMPTED

You've seen REINFORCE with baseline in Chapter 3. Now implement the full gradient computation. Given a batch of trajectories, compute the policy gradient using reward-to-go and a state-dependent baseline. The baseline is V̂(s) = average return from state s.

python def reinforce_gradient(log_probs, rewards, gamma=0.99): """ Compute REINFORCE policy gradient with reward-to-go and baseline. Args: log_probs: list of N trajectories, each a list of T log pi(a|s) values rewards: list of N trajectories, each a list of T reward values gamma: discount factor Returns: loss: scalar surrogate loss (negate for gradient ascent) """

Test case

log_probs = [[-0.5, -0.3, -0.8]], rewards = [[1, 2, 10]], gamma=0.99
reward_to_go[0] = [1 + 0.99*2 + 0.99^2*10, 2 + 0.99*10, 10] = [12.79, 11.90, 10.0]
baseline = [12.79, 11.90, 10.0] (only 1 trajectory, so baseline = reward_to_go)
advantages = [0, 0, 0] → loss = 0 (single trajectory can't distinguish good from bad)

Loop backwards: G[T-1] = r[T-1], then G[t] = r[t] + gamma * G[t+1]. This is O(T) per trajectory. The baseline at time t is the mean of G[t] across all N trajectories.

import torch

def reinforce_gradient(log_probs, rewards, gamma=0.99):
    N = len(log_probs)

    # Step 1: Compute discounted reward-to-go
    all_rtg = []
    for i in range(N):
        T = len(rewards[i])
        rtg = [0.0] * T
        rtg[T-1] = rewards[i][T-1]
        for t in range(T-2, -1, -1):
            rtg[t] = rewards[i][t] + gamma * rtg[t+1]
        all_rtg.append(rtg)

    # Step 2: Compute baseline (mean reward-to-go per timestep)
    T = len(all_rtg[0])  # assume equal length
    baseline = [0.0] * T
    for t in range(T):
        baseline[t] = sum(all_rtg[i][t] for i in range(N)) / N

    # Step 3: Compute advantages
    loss = 0.0
    for i in range(N):
        for t in range(T):
            advantage = all_rtg[i][t] - baseline[t]
            loss += -log_probs[i][t] * advantage

    return loss / (N * T)

Bonus challenge: Modify this to use GAE(λ=0.95) instead of full reward-to-go. You'll need a value function V̂(s) and TD errors δ_t = r_t + γV̂(s_t+1) − V̂(s_t).

🔗 Pattern Recognition

Iterative Refinement: Value Iteration = Diffusion Denoising = EM

This Lesson (RL)

V_k+1(s) = T* V_k(s)
Repeatedly apply contraction until fixed point V*

Diffusion Models

x_t-1 = denoise(x_t)
Repeatedly refine noise into clean sample → Diffusion

Both processes start from an arbitrary initialization and converge to a fixed point by repeatedly applying a contractive operator. In RL, the contraction factor is γ. In diffusion, it's the noise schedule β_t. The deeper pattern: any iterative map with contraction factor < 1 converges to its unique fixed point (Banach). You see this in EM (converges to local maximum of likelihood), power iteration (converges to dominant eigenvector), and even Newton's method (quadratic contraction near roots).

Can you identify the "contraction factor" in EM? (Hint: it relates to the fraction of missing information.)

Deep Reinforcement Learning — The Complete Theory

What You'll Master

Markov Decision Processes

Understanding the Discount Factor γ

The Value Function Vπ(s)

The Q-Function Qπ(s, a)

Optimal Values

State Distributions

Policy Gradient Methods

Why Not Just Use Q-Learning?

Policy Parameterization

The Objective

The Log-Derivative Trick

The Policy Gradient Theorem

REINFORCE & Variance Reduction

The Causality Trick: Reward-to-Go

Baselines: Subtracting Without Bias

The Optimal Baseline

Actor-Critic Methods

The Advantage Function

One-Step Advantage Estimate

Training the Critic

The Bias-Variance Spectrum

Batch vs Online Actor-Critic

Trust Regions & the Performance Difference Lemma

Lemma: Expectation Under State Distribution

The Performance Difference Lemma

The Local Approximation

α-Coupled Policies and the TRPO Bound

Natural/Covariant Policy Gradient

PPO: Proximal Policy Optimization

The Probability Ratio

The Clipped Surrogate

Generalized Advantage Estimation (GAE)

SAC: Soft Actor-Critic

Maximum Entropy RL

Soft Bellman Equation

SAC Components

Temperature Auto-Tuning

Connection to Deterministic Policy Gradient

Tabular MDP Methods

Policy Evaluation

Policy Iteration

Value Iteration

Bellman Operators as Contractions

Policy Iteration vs Value Iteration

DQN & Value Function Approximation

MC vs TD for VFA

Why Generalization Matters

The Deadly Triad

Replay Buffer

Fixed Q-Targets

The Deadly Triad in Detail

Double DQN

Offline RL: IQL & CQL

The Overestimation Problem

Implicit Q-Learning (IQL)

Conservative Q-Learning (CQL)

Practical Considerations for Offline RL

RLHF & DPO

The RLHF Pipeline

The Bradley-Terry Model

Reward Model Training

The RL Objective

The KL Penalty in Practice

DPO: Bypassing the Reward Model

DPO Gradient Analysis

RLHF vs DPO: Practical Comparison

The Complete Cheat Sheet

Algorithm Taxonomy

Key Equations

When to Use What

On-Policy vs Off-Policy: The Fundamental Trade-Off

Historical Timeline

Connections

Common Hyperparameters

The RL Algorithm Family Tree

Quick Reference: Which Paper to Read

The Value Function V^π(s)

The Q-Function Q^π(s, a)