← Gleams
Stanford CS 224R · Lecture 6 · Deep RL

The Complete Guide to Q-Learning

What if you could throw away the policy network entirely — and just learn a table of "how good is each action?" That's Q-learning.

Value-based RL Bellman optimality DQN & Atari Double Q-learning
Roadmap

What You'll Master

Chapter 01

From Actor-Critic to Critic Only

In actor-critic, you have two networks: an actor (the policy πθ) that picks actions, and a critic (the Q-function Q̂π) that evaluates them. The actor needs the critic for low-variance gradient estimates. But here's the question nobody asks at first: does the critic need the actor?

Suppose you had a perfect critic — an oracle that knows Q̂π(s, a) for every state-action pair. What would the best possible action be in state s? Just pick the action with the highest Q-value:

The Key Insight πnew(a | s) = argmaxaπ(s, a)

No gradient descent. No sampling. No log-probabilities. Just: look up every action's value, take the best one.

The Big Idea

The Policy Improvement Theorem guarantees that πnew defined this way is always at least as good as the original π. So if you can accurately estimate Q, you never need an explicit policy network at all. The policy is implicit: just pick argmax.

This is the conceptual leap from actor-critic to Q-learning: throw away the actor entirely. Invest all your modeling capacity into learning the best possible Q-function. Derive the policy for free via argmax.

Policy Improvement Theorem (Sketch)

Let Qπ(s, a) be the Q-function under policy π. Define π'(s) = argmaxa Qπ(s, a). Then for all states s: Vπ'(s) ≥ Vπ(s). Proof idea: Vπ(s) = Qπ(s, π(s)) ≤ maxa Qπ(s, a) = Qπ(s, π'(s)). The inequality holds because π might not pick the best action, but argmax always does. Iterating this argument across time steps gives Vπ'(s) ≥ Vπ(s) everywhere.

Definition
Q-Learning (informal)

A family of RL algorithms that learn a Q-function (state-action value) and derive the policy as argmax over actions. No explicit policy network is trained.

Worked Example — Why Argmax Improves

Imagine a 2-action problem. Current policy: π(left) = 0.6, π(right) = 0.4. True Q-values: Q(left) = 3, Q(right) = 7. Expected value under π: 0.6 × 3 + 0.4 × 7 = 4.6. Argmax policy: always go right. Expected value: 7.0. Improvement is guaranteed because argmax concentrates all probability on the best action.

The Trade-off

Actor-critic can handle continuous actions naturally (the actor outputs parameters of a Gaussian). Q-learning with argmax requires searching over all actions — easy for discrete actions (just try all 4 directions), hard for continuous ones (infinite choices). This is why Q-learning dominates in discrete-action settings like Atari, while actor-critic methods (like SAC) dominate in continuous control.

Argmax Requires Enumeration

argmaxa Q(s, a) means "evaluate Q for every possible action and pick the highest." With 4 discrete actions, that's 4 forward passes (or one forward pass outputting 4 values). With continuous actions, there are infinitely many — you'd need optimization inside each action selection, which is expensive. This is the fundamental limitation of pure Q-learning.

🔗 Pattern Recognition
Q-Learning = Sample-Based Value Iteration
Dynamic Programming
Q*(s,a) = r + γ Σs' p(s'|s,a) maxa' Q*(s',a')
Sweeps over ALL states. Requires full model.
Q-Learning
Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') − Q(s,a)]
Uses ONE sample transition. Model-free.

Q-learning is value iteration with the expectation replaced by a single sample. Instead of sweeping every (s,a) with the known model, you observe one transition and update one entry. Over infinite visits, the sample average converges to the expectation — recovering the same fixed point as DP.

Where else do you see "replace expensive exact computation with cheap stochastic samples"? (Hint: SGD vs. full-batch gradient descent.)

Chapter 02

Policy Iteration

Before jumping to the modern algorithm, let's see the classical approach. Policy iteration alternates between two steps, like breathing in and out:

Policy Iteration
  1. Policy Evaluation: Given current policy π, compute Qπ(s, a) for all (s, a). This means: "if I follow π forever starting from (s, a), what's my expected total reward?"
  2. Policy Improvement: Set π'(a | s) = argmaxa Qπ(s, a). Replace π with π'.
  3. Repeat until π stops changing (convergence).
Guaranteed Convergence

Policy iteration is guaranteed to converge to the optimal policy π* in a finite number of steps (for finite state/action spaces). Each iteration can only improve or maintain the policy — never make it worse.

A Gridworld Example

Consider a 3×1 grid: states {A, B, C}. Actions: left or right. State C gives reward +10 and ends the episode. Every step costs −1. Discount γ = 0.9.

Worked Example — One Iteration

Start: Random policy π (50% left, 50% right everywhere).

Step 1 — Evaluate: From state A going right: expected reward under random walk is messy, but we can compute Qπ(A, right) ≈ 4.2, Qπ(A, left) ≈ −1.0 (hits wall). From B going right: Qπ(B, right) = −1 + 0.9 × 10 = 8.0. From B going left: Qπ(B, left) = −1 + 0.9 × Qπ(A, random) ≈ 1.9.

Step 2 — Improve: In every state, pick the action with higher Q. State A: right (4.2 > −1.0). State B: right (8.0 > 1.9). New policy: always go right.

Result: After just one iteration, we found the optimal policy.

The Problem with Policy Iteration

Step 1 is expensive. Computing Qπ exactly requires solving a system of linear equations (or running many rollouts) for the current policy. Then you improve the policy and have to evaluate the new policy all over again. Each evaluation is a full inner loop.

Bottleneck

In deep RL, states are images (millions of pixels). You can't compute Qπ(s, a) for every (s, a) — there are infinitely many. We need a shortcut that skips the expensive evaluation step entirely. That's what the Bellman optimality equation gives us.

Why Convergence Is Guaranteed

There are finitely many deterministic policies over finite state/action spaces. Each improvement step produces a policy that's strictly better or equal. A strictly increasing sequence over a finite set must terminate. At termination, argmaxa Qπ(s, a) = π(s) for all s — meaning no improvement is possible. That's the definition of optimality.

Chapter 03

The Bellman Optimality Equation

Here's the breakthrough. Instead of iterating between evaluate and improve, what if we directly solved for the optimal Q-function Q* — the one that already reflects the best possible policy?

Bellman Optimality Equation Q*(s, a) = r(s, a) + γ 𝔼s' ~ p(·|s,a)[ maxa' Q*(s', a') ]

Read it aloud: "The value of taking action a in state s, then acting optimally forever, equals the immediate reward plus the discounted value of the best action in whatever state you land in."

Definition
Q* — The Optimal Q-function

Q*(s, a) is the expected total discounted reward of taking action a in state s, then following the optimal policy π* forever after. It's the best you can possibly do.

Why "max" Instead of Expectation Under π?

Compare the two versions:

EquationUsesMeaning
Qπ(s,a) = r + γ 𝔼[Qπ(s', π(s'))]Expectation under πValue of (s,a) under policy π
Q*(s,a) = r + γ 𝔼[maxa' Q*(s',a')]Max over actionsValue of (s,a) under optimal policy

The max is the "improvement" step baked in. We don't need to iterate because we're directly asking: "what's the best action in s'?" The optimal Q-function already encodes the optimal policy.

Policy-Independence

Q* doesn't depend on any particular policy. It's a property of the environment alone — the best achievable value for every (s, a). Once you have Q*, the optimal policy falls out trivially: π*(s) = argmaxa Q*(s, a).

Worked Example — Bellman Backup

Two states: A, B. Two actions: left, right. From A, right leads to B with reward +2. From B, left gives +1 back to A, right gives +5 and ends. γ = 0.9.

Q*(B, right) = 5 (terminal, no future). Q*(B, left) = 1 + 0.9 × max(Q*(A, left), Q*(A, right)). Q*(A, right) = 2 + 0.9 × max(Q*(B, left), Q*(B, right)) = 2 + 0.9 × 5 = 6.5.

Now Q*(B, left) = 1 + 0.9 × 6.5 = 6.85. Optimal policy from A: go right. From B: go left (6.85 > 5)! The agent should loop A→B→A→B to accumulate reward, rather than ending immediately.

Derivation Sketch

Start from Qπ(s,a) = r(s,a) + γ 𝔼s'[𝔼a'~π[Qπ(s',a')]]. For the optimal policy, π*(a'|s') puts all mass on argmaxa' Q*(s',a'). So 𝔼a'~π*[Q*(s',a')] = maxa' Q*(s',a'). Substitute and you get the Bellman optimality equation.

Chapter 04

The Q-Learning Algorithm

The Bellman optimality equation is a fixed-point equation: Q* is the unique function that, when you plug it into both sides, makes the equation hold. Q-learning finds Q* by repeatedly nudging an estimate Q̂ toward satisfying this equation.

Q-Learning (One Step)
  1. Collect a transition (s, a, r, s') using some exploration policy.
  2. Compute target: y = r + γ maxa' Q̂(s', a')
  3. Update: Regress Q̂(s, a) toward y. In tabular form: Q̂(s,a) ← Q̂(s,a) + α(y − Q̂(s,a)). With a neural net: minimize (Q̂θ(s,a) − y)2.
  4. Repeat from step 1.

Step 2 is the Bellman backup: "what should Q(s,a) be, if the rest of my Q-values are correct?" Step 3 moves Q̂ a little bit toward that target. Over thousands of updates, Q̂ converges to Q*.

Worked Example — Tabular Q-Update

Gridworld. Current estimates: Q̂(A, right) = 0, Q̂(B, left) = 0, Q̂(B, right) = 0. Learning rate α = 0.5, γ = 0.9.

Transition: In state A, take action right, get reward +2, land in B.

Target: y = 2 + 0.9 × max(Q̂(B, left), Q̂(B, right)) = 2 + 0.9 × max(0, 0) = 2.0

Update: Q̂(A, right) ← 0 + 0.5 × (2.0 − 0) = 1.0

One transition, one update. Q̂(A, right) went from 0 to 1.0 — it learned that going right from A has value. Repeat thousands of times and Q̂ converges.

Off-Policy By Nature

The target uses maxa', not the action the exploration policy actually took. This means Q-learning learns about the optimal policy regardless of how the data was collected. You can explore randomly and still learn the best strategy. This is off-policy learning — and it's why Q-learning can use a replay buffer.

🔨 Derivation Q-Learning Convergence via Robbins-Monro ✓ ATTEMPTED

Tabular Q-learning converges to Q* with probability 1. The proof relies on the Robbins-Monro conditions for stochastic approximation. Given the update rule:

Q(s,a) ← Q(s,a) + αt[r + γ maxa' Q(s',a') − Q(s,a)]

Your task: Show why the learning rate schedule must satisfy Σαt = ∞ and Σαt2 < ∞, and explain why the max operator doesn't break convergence (unlike importance sampling, which would be needed for on-policy targets).

If Σαt < ∞, total step sizes are bounded — the algorithm might stop before reaching Q*. The infinite sum condition ensures we can always "get there" from any starting point, no matter how far.
This bounds the accumulated noise variance. Each update has stochastic error (from sampling one transition). If α doesn't decay, noise accumulates and Q oscillates forever. The squared-sum condition ensures noise averages out. Constant α = 0.01 violates this — tabular Q-learning with constant α only converges to a neighborhood of Q*, not Q* exactly.
Consider the Bellman operator T: (TQ)(s,a) = r(s,a) + γ E[max Q(s',a')]. This operator is a contraction in the max-norm with factor γ < 1. Q-learning's update is a noisy sample of TQ. The contraction ensures that even with the max, the expected update always moves Q closer to Q*. No importance sampling needed because we're learning Q* (which doesn't depend on any behavior policy) rather than Qπ.

Full derivation:

Write the Q-learning update as: Qt+1(s,a) = (1 − αt)Qt(s,a) + αt[r + γ maxa' Qt(s',a')].

Define the error et(s,a) = Qt(s,a) − Q*(s,a). Then:

et+1 = (1−αt)et + αt[r + γ max Qt(s',a') − Q*(s,a)]

The term in brackets = (TQt)(s,a) − Q*(s,a) + noise. Since T is a γ-contraction: |TQt − TQ*| ≤ γ|Qt − Q*|. And TQ* = Q* (fixed point).

So the expected error contracts by γ each "effective sweep." The Robbins-Monro conditions ensure: (1) Σα = ∞ means every (s,a) gets updated infinitely often, so we never get stuck; (2) Σα2 < ∞ means noise variance vanishes asymptotically.

Why no importance sampling? The target is r + γ max Q(s',a'). The max doesn't depend on which action was actually taken in s' — only on the next state s'. Since s' is drawn from p(s'|s,a) (the true dynamics regardless of behavior policy), and we take max over all a' (not the behavior policy's action), the target is an unbiased estimate of (TQ)(s,a) regardless of how we chose action a. Off-policy is free because the Bellman optimality operator doesn't reference any policy.

The key insight: Q-learning converges because (1) the Bellman optimality operator is a contraction, and (2) the max removes policy-dependence, making off-policy learning trivial. Any exploration policy works — you just need every (s,a) visited infinitely often.

The Full Tabular Q-Learning Loop

Worked Example — Three Consecutive Updates

α = 0.5, γ = 0.9. All Q-values start at 0. Grid: A↔B↔C (reward +10 at C).

Step 1: (B, right, +10, C-terminal). y = 10 + 0 = 10. Q(B,right) ← 0 + 0.5(10−0) = 5.0.

Step 2: (A, right, −1, B). y = −1 + 0.9 × max(Q(B,left), Q(B,right)) = −1 + 0.9 × 5.0 = 3.5. Q(A,right) ← 0 + 0.5(3.5) = 1.75.

Step 3: (B, right, +10, C-terminal) again. y = 10. Q(B,right) ← 5.0 + 0.5(10−5.0) = 7.5.

Information propagates backward: C's reward reaches B first (step 1), then A (step 2). Each revisit sharpens the estimate. After enough visits, Q converges.

Comparison: Actor-Critic vs. Q-Learning

PropertyOff-Policy Actor-CriticQ-Learning
LearnsQπ (current policy's Q)Q* (optimal Q)
PolicyExplicit πθ, updated by gradientImplicit: argmaxa
Targetr + γ Q(s', a'), a' ~ πr + γ maxa' Q(s', a')
ActionsContinuous or discreteBest for discrete
Data reuseReplay bufferReplay buffer
Chapter 05

Exploration: ε-Greedy

Q-learning has no policy network to sample from. So how does the agent decide what to do while collecting data? It needs an exploration strategy — a way to try new things while mostly exploiting what it already knows.

Definition
ε-Greedy Policy

With probability (1 − ε), take the greedy action argmaxa Q̂(s, a). With probability ε, take a random action uniformly. The parameter ε ∈ [0, 1] controls the exploration rate.

ε-Greedy a = { argmaxa Q̂(s, a) with prob 1 − ε
    random action         with prob ε
Worked Example — ε-Greedy Action Selection

State s has 4 actions: up, down, left, right. Q̂ values: up = 3.2, down = 1.0, left = 5.7, right = 2.4. ε = 0.1.

90% of the time: pick left (highest Q = 5.7).

10% of the time: pick uniformly at random. Each action has probability ε/4 = 0.025. So: P(up) = 0.025, P(down) = 0.025, P(left) = 0.9 + 0.025 = 0.925, P(right) = 0.025.

Even the best action gets a small exploration bonus. But mostly we exploit.

Annealing ε

Early in training, Q̂ is garbage — random numbers. Acting greedy on random numbers is no better than random. So we start with ε = 1.0 (pure exploration) and slowly anneal to something small like ε = 0.01 as Q̂ improves. Common schedule: linear decay over the first million steps.

Worked Example — ε Annealing Schedule

DQN on Atari uses: ε starts at 1.0, linearly decays to 0.1 over 1 million frames, then stays at 0.1 forever.

At frame 0: ε = 1.0. Pure random play. The agent mashes buttons randomly. Q̂ is learning from diverse experience.

At frame 500,000: ε = 0.55. Half the time greedy, half random. Q̂ is getting decent; greedy actions start to make sense.

At frame 1,000,000: ε = 0.1. Mostly greedy, occasional random exploration. Q̂ is good; the agent plays well but still discovers new strategies 10% of the time.

Explore-Exploit Trade-off

ε too high: the agent wastes time on random actions when it already knows what's good. ε too low: the agent never discovers better strategies — it gets stuck exploiting a mediocre early estimate. The art is in the schedule.

Why Not Boltzmann Exploration?

An alternative: sample actions proportional to exp(Q̂(s,a) / T) for temperature T. This is more nuanced than ε-greedy — it prefers higher-Q actions even during exploration. But in practice, ε-greedy works well and is simpler. DQN used ε-greedy.

Checkpoint — Before you move on
You now know the tabular Q-learning algorithm and ε-greedy exploration. Explain in your own words: why is Q-learning off-policy? What specific property of the Bellman optimality equation makes it independent of the behavior policy?
✓ Gate cleared
Model Answer

Q-learning is off-policy because the target y = r + γ maxa' Q(s',a') uses a max over all actions, not the action the behavior policy actually chose in s'. This max encodes the optimal policy's action selection. It doesn't matter which policy generated the (s, a, r, s') tuple — once we know s', the optimal next action is determined by argmax, not by whatever the explorer did. In contrast, SARSA uses Q(s', a'actual) where a'actual was the action actually taken — making it on-policy (the target depends on the behavior policy).

Chapter 06

Target Networks & DQN

Here's where Q-learning almost died — and was resurrected. When you use a neural network for Q̂θ, a subtle instability appears.

The Moving Target Problem

Look at the update again: minimize (Q̂θ(s,a) − y)2 where y = r + γ maxa'θ(s', a'). The target y depends on θ — the same parameters we're updating! Every gradient step changes both the prediction AND the target. It's like a dog chasing its own tail.

Instability

In supervised learning, labels don't move when you update weights. In Q-learning with function approximation, the "label" (target y) shifts with every gradient step. This causes oscillation, divergence, and catastrophic forgetting. For decades, people thought neural nets + Q-learning was fundamentally broken.

The Fix: Freeze the Target

DeepMind's insight (Mnih et al., 2013/2015): maintain two copies of the Q-network.

Definition
Target Network — Q̂φ'

A frozen copy of the Q-network, with parameters φ'. Used only to compute targets: y = r + γ maxa'φ'(s', a'). Updated periodically by copying θ → φ' every N steps (e.g., N = 10,000). Between copies, the target is stationary — regular supervised learning.

DQN Algorithm (Mnih et al. 2015)
  1. Initialize Q-network θ, target network φ' ← θ, replay buffer D.
  2. For each step: observe state s, pick action via ε-greedy on Qθ.
  3. Execute action a, observe reward r, next state s'. Store (s, a, r, s') in D.
  4. Sample a minibatch of transitions from D.
  5. Compute targets: yi = ri + γ maxa'φ'(s'i, a') for each sample.
  6. Gradient step: θ ← θ − α ∇θ Σi (Q̂θ(si, ai) − yi)2.
  7. Every N steps: copy φ' ← θ.
Worked Example — Why Freezing Helps

Without target network: Q̂(B, right) = 5. Target for (A, right): y = 2 + 0.9 × 5 = 6.5. You update Q̂(A, right) toward 6.5. But this update also slightly changes Q̂(B, right) to 5.3 (shared weights!). Next iteration, target becomes 2 + 0.9 × 5.3 = 6.77. The target crept up. After many steps, targets spiral upward — divergence.

With target network: Q̂φ'(B, right) = 5 and stays 5 for 10,000 steps. Every target is stable. You do 10,000 steps of supervised learning against fixed targets. Then copy and repeat. Stable!

DQN's Three Innovations

(1) Target networks for stable training. (2) Experience replay buffer to break temporal correlations in data (consecutive frames are nearly identical — random sampling fixes this). (3) Frame stacking — feed 4 consecutive frames as input so the network sees velocity, not just position. Together: first algorithm to learn directly from pixels, beating humans on most Atari games.

The Replay Buffer in Detail

Without replay, the network trains on consecutive transitions: frame 1001, frame 1002, frame 1003... These are nearly identical images. The network overfits to the current "corridor" of experience and forgets everything else. Experience replay stores the last million transitions in a ring buffer and samples uniformly at random for each minibatch.

Worked Example — Replay Buffer Diversity

Buffer holds 100,000 transitions from the last ~2 hours of play. Minibatch of 32: might include a death from level 1 (step 5,000), a power-up from level 3 (step 72,000), and an enemy dodge from level 2 (step 41,000). This diversity prevents catastrophic forgetting and provides i.i.d.-like training data.

DQN Architecture for Atari

Input: 84×84×4 grayscale image stack. Three convolutional layers (32, 64, 64 filters) with ReLU. One hidden fully-connected layer (512 units). Output: one Q-value per action (e.g., 18 for Atari). Total: ~1.7M parameters. Tiny by modern standards, revolutionary in 2015.

Why Convolutions?

A Q-function over raw pixels needs spatial understanding: "is the ball above or below the paddle?" Fully-connected layers treat each pixel independently. Convolutional layers detect spatial patterns (edges, objects, motion) and are translation-invariant. The same enemy-detection filter works whether the enemy is on the left or right side of the screen.

Practical Results

Mnih et al. (2015) tested DQN on 49 Atari games with identical hyperparameters — no per-game tuning. It surpassed human-level performance on 29 of 49 games. Breakout: the agent discovered the "tunnel through the side" strategy that even experienced humans miss. Q-learning later extended to robotics: Kalashnikov et al. (2018) trained a real robot arm to grasp arbitrary objects using QT-Opt, a Q-learning variant.

💥 Break-It Lab What Dies When You Remove DQN Components? ✓ ATTEMPTED
DQN has three critical components: target networks, epsilon decay, and a large replay buffer. Toggle each off and watch the training curve collapse in different ways.
Remove Target Network ACTIVE
Failure mode: Divergence. Without a frozen target, the target y = r + γ max Qθ(s',a') changes with every gradient step. The network chases a moving target — each increase in Q(s',a') raises the target for Q(s,a), which raises the target for some other state, creating a positive feedback loop. Q-values spiral to infinity. This is the "deadly triad" of function approximation + bootstrapping + off-policy.
Remove Epsilon Decay ACTIVE
Failure mode: Perpetual sub-optimality. With constant ε = 1.0, the agent takes random actions 100% of the time. It collects diverse data (good for the replay buffer) but never exploits what it learned. Even with a perfect Q-function, the policy is random. Performance plateaus at random-policy level. With constant ε = 0.1, the agent converges but 10% of the time takes suboptimal actions — scoring lower than the true optimum.
Small Replay Buffer (1000 instead of 1M) ACTIVE
Failure mode: Catastrophic forgetting. With only 1000 transitions, the buffer holds only the last ~30 seconds of play. If the agent moves to a new room, all data from the previous room is overwritten. The network overfits to recent experience and forgets how to handle earlier situations. You see periodic performance collapses as the agent "forgets" skills it previously mastered. Additionally, consecutive transitions are highly correlated (same screen region), violating the i.i.d. assumption of SGD.
Chapter 07

Q-Value Overestimation

DQN works. But it has a systematic bias: it thinks it's doing better than it actually is. Every Q-value is a little too optimistic. Where does this come from?

The Max-of-Noisy-Estimates Problem

The target uses maxa' Q̂(s', a'). Each Q̂(s', a') is an estimate with some random error. The max operation preferentially selects the estimate with the highest positive error.

Worked Example — Overestimation in Action

True Q-values: Q*(s', left) = 5, Q*(s', right) = 5. Both actions are equally good.

Noisy estimates on trial 1: Q̂(left) = 5 + 2 = 7, Q̂(right) = 5 − 1 = 4. max = 7.

Noisy estimates on trial 2: Q̂(left) = 5 − 3 = 2, Q̂(right) = 5 + 2 = 7. max = 7.

Noisy estimates on trial 3: Q̂(left) = 5 + 1 = 6, Q̂(right) = 5 + 1 = 6. max = 6.

Average of max: (7 + 7 + 6) / 3 = 6.67. True max: 5. Overestimated by 1.67!

The max always picks whichever action had the luckiest noise. Over many updates, this compounds.

Why It Compounds

Overestimated Q(s', a') makes the target y too high. The network trains toward this inflated target, making Q(s, a) too high. That inflated Q(s, a) is used as a target for some other state. The bias propagates backward through the entire Q-function. In deep networks with many states, this can cause Q-values to blow up to absurd magnitudes.

Mathematical Argument

For any random variables X1, ..., Xn with 𝔼[Xi] = μ, we have 𝔼[maxi Xi] ≥ maxi 𝔼[Xi] = μ. The expectation of the max is always at least as large as the max of the expectations. Equality only when there's zero noise. Since neural network Q-estimates always have noise, overestimation is guaranteed.

How Bad Is It in Practice?

Thrun & Schwartz (1993) showed this theoretically. Van Hasselt et al. (2016) measured it empirically on Atari: DQN's estimated Q-values were often 2–10x higher than the true values. The agent thinks it'll score 500 points but actually scores 50. Remarkably, DQN still works (the relative ordering of actions is often preserved), but it could work much better.

Worked Example — Compounding Over Depth

Chain of 3 states: A → B → C. Each has 2 actions. All true Q-values = 10. Noise ~ Uniform(−2, +2). At C: max of two estimates with mean 10. Expected overestimation: 𝔼[max] ≈ 10.67 (overestimate by 0.67). At B: target uses the overestimated C-values as input, adding another layer of max-noise. Expected overestimate at B: ~1.2. At A: ~1.6. The bias grows with the length of the chain — deeper environments suffer more.

🔨 Derivation Why Off-Policy Is Free (No Importance Sampling Needed) ✓ ATTEMPTED

In policy gradient methods (actor-critic), reusing off-policy data requires importance weights πnew(a|s)/πold(a|s) that can explode. In Q-learning, we freely reuse data from any behavior policy with no correction whatsoever.

Your task: Formally show why the Q-learning target y = r + γ maxa' Q(s',a') is an unbiased estimate of the Bellman optimality backup, regardless of which policy generated the transition (s, a, r, s').

s' ~ p(s'|s, a) — the environment's transition dynamics. This depends only on the state-action pair (s, a), not on which policy chose a. The physics are the same regardless of why you took action a. So s' is a valid sample from the correct distribution.
The Bellman optimality operator is: (T*Q)(s,a) = r(s,a) + γ Es'~p[maxa' Q(s',a')]. The max is over all actions — it doesn't sample an action from any policy. It simply evaluates every possible a' and takes the best. No policy appears in this formula. Compare to the Bellman expectation operator: (TπQ)(s,a) = r(s,a) + γ Es',a'~π[Q(s',a')] which depends on π.

Full argument:

The Bellman optimality operator: (T*Q)(s,a) = Es'~p(·|s,a)[r(s,a) + γ maxa' Q(s',a')]

Given a single transition (s, a, r, s') sampled from any behavior policy πb:

1. The behavior policy determined which (s, a) we observe. But we update Q(s, a) specifically — for this particular pair, we need s' ~ p(·|s,a), which is what we get regardless of πb.

2. The sample target r + γ maxa' Q(s',a') is an unbiased estimate of Es'[r + γ max Q(s',a')] because s' was drawn from p(·|s,a).

3. No importance weight is needed because: (a) The action a is already fixed — we're updating Q(s,a), not Q at some other action; (b) The max doesn't sample from πb — it evaluates all actions deterministically.

Contrast with SARSA: SARSA target = r + γ Q(s', a'actual) where a'actual ~ πb(·|s'). Here the target depends on πb's action choice! If πb ≠ π, you're estimating Qπb, not Qπ. Importance weights would be needed to correct this.

The key insight: The max operator is a deterministic function of Q-values — it doesn't sample anything. Therefore no distribution mismatch can occur. Q-learning is off-policy "for free" because optimality is a property of the MDP, not of any particular policy.

Chapter 08

Double Q-Learning

The fix is elegant: decouple action selection from action evaluation. The overestimation happens because the same noisy Q-function both picks the best action and evaluates it. If the noise makes action A look best, the same noise makes its evaluation high. Use a different evaluator and the bias breaks.

Standard Q-Learning Target y = r + γ maxa'θ(s', a')

↑ Same network selects AND evaluates
Double Q-Learning Target a* = argmaxa'θ(s', a')    ← Current network SELECTS
y = r + γ Q̂φ'(s', a*)         ← Target network EVALUATES

The current network picks which action looks best. The target network — an older, independent snapshot — evaluates that action. Since the two networks have different noise patterns (the target network was frozen thousands of steps ago), the selection bias is drastically reduced.

Why Two Independent Estimates?

If you select an action because network A says it's great, and then ask the same network A to evaluate it, you get a biased answer — of course A thinks it's great, A is the one who picked it! But if you ask network B (trained on different data, frozen at a different time), B's errors are uncorrelated with A's selection. The positive noise that tricked A into selecting this action won't trick B into overrating it.

Worked Example — Double Q Fixes the Bias

True Q-values: Q*(left) = 5, Q*(right) = 5.

Standard:θ(left) = 7, Q̂θ(right) = 4. max selects left, evaluates at 7. Target = r + 0.9 × 7 = r + 6.3.

Double:θ selects left (7 > 4). But Q̂φ'(left) = 4.8 (different noise!). Target = r + 0.9 × 4.8 = r + 4.32. Much closer to truth (r + 4.5).

The target network didn't have the same positive noise on "left," so the evaluation is more accurate.

Free Upgrade

Double Q-learning costs nothing extra in computation. You already have a target network in DQN. The only change: instead of maxa' Qφ'(s',a'), you compute a* = argmax Qθ(s',a') then look up Qφ'(s',a*). One line of code. Van Hasselt et al. (2016) showed this consistently improves DQN on Atari, sometimes dramatically.

Definition
Double DQN (DDQN)

DQN with the double Q-learning target. Uses the online network for action selection and the target network for action evaluation. Published by Van Hasselt, Guez, and Silver (2016). Standard in all modern DQN variants.

🔨 Derivation Double Q-Learning Bias Correction ✓ ATTEMPTED

Given n actions with true values Q*(a) = μ for all a, and noisy estimates Q̂(a) = μ + εa where εa ~ N(0, σ2) i.i.d.:

Your task: (1) Show that E[maxa Q̂(a)] > μ and derive the approximate magnitude of overestimation for large n. (2) Show that using an independent estimate Q̂B to evaluate the action selected by Q̂A removes this bias.

For n i.i.d. N(0,1) variables, E[max] ≈ σ √(2 ln n) for large n. This is from extreme value theory. So E[max Q̂(a)] ≈ μ + σ√(2 ln n). The overestimation grows logarithmically with the number of actions.
Let a* = argmaxaA(a). Now evaluate with Q̂B(a*). Since Q̂B's noise is independent of Q̂A's, the fact that Q̂A selected a* because it had positive noise doesn't mean Q̂B(a*) will also be inflated. E[Q̂B(a*)] = μ + E[εB,a*] = μ + 0 = μ. The bias vanishes!
The target network is a stale copy of the online network, not truly independent. So Double DQN doesn't eliminate overestimation entirely — it reduces it. Empirically, it can slightly underestimate because the target network lags behind. But slight underestimation is far less harmful than compounding overestimation.

Part 1 — Overestimation magnitude:

Let Xa = μ + εa, εa ~ N(0, σ2). Define M = maxa Xa.

E[M] = μ + E[maxa εa] = μ + σ · E[max of n standard normals].

For n i.i.d. N(0,1) draws, E[max] ≈ √(2 ln n) − ln(ln n) + ln(4π) / (2√(2 ln n)) for large n.

For Atari with n=18 actions: E[max] ≈ σ · √(2 ln 18) ≈ 2.4σ. If σ = 2 (common in early training), overestimation ≈ 4.8 per state — and this compounds across the temporal chain.

Part 2 — Double Q removes bias:

Let a* = argmaxaA(a). Now E[Q̂B(a*) | a*] = μ + E[εB,a*] = μ, because εB,a* is independent of the event that a* was selected by A. The selection was driven by εA,a* being large, but εB,a* has no knowledge of this — it's a fresh draw. So E[double target] = r + γμ = unbiased.

The key insight: Overestimation arises from using the same noisy function for both selecting and evaluating the best action. Decoupling these with independent estimates breaks the positive correlation between "I chose this because it looked good" and "it evaluates as good" — eliminating the systematic upward bias.

⚔ Adversarial: When does Double Q-Learning HURT?
You're training a DQN agent on a sparse-reward environment where only 1 of 100 actions in the terminal state gives reward +1, all others give 0. Your Q-network is well-trained but has small noise (σ ≈ 0.05). Standard DQN gives max Q̂ ≈ 0.12 (slight overestimate of true 0.01). Double DQN gives Q̂φ'(argmax Q̂θ) ≈ 0.003 (underestimate).
Chapter 09

N-Step Returns for Q-Learning

So far, our target looks one step ahead: y = r + γ max Q̂(s', a'). What if we looked multiple steps ahead, using actual observed rewards instead of estimated ones?

1-Step Target (Standard) y = rt + γ maxa' Q̂(st+1, a')
N-Step Target y = rt + γ rt+1 + γ2 rt+2 + … + γn−1 rt+n−1 + γn maxa' Q̂(st+n, a')

= Σk=0n−1 γk rt+k + γn maxa' Q̂(st+n, a')

Use n real rewards, then bootstrap from Q̂ only at step n. This reduces our dependence on Q̂, which is especially helpful early in training when Q̂ is inaccurate.

Worked Example — 1-Step vs 3-Step

Trajectory: s0 ⟶ r=1 ⟶ s1 ⟶ r=1 ⟶ s2 ⟶ r=10 ⟶ s3. γ = 0.9. True Q*(s0, a) = 1 + 0.9 + 0.81 × 10 = 10.0. Current Q̂(s1) = 2 (bad estimate), Q̂(s3) = 0.

1-step target: y = 1 + 0.9 × 2 = 2.8. Way off (true: 10.0). Bottlenecked by bad Q̂(s1).

3-step target: y = 1 + 0.9 × 1 + 0.81 × 10 + 0.729 × 0 = 10.0. Exact! We used 3 real rewards and only bootstrapped from s3 where Q̂ = 0 is close to truth (episode ended).

The Catch: On-Policy Data

The n-step return uses actions at+1, ..., at+n-1 that were chosen by the exploration policy at collection time. Q-learning is supposed to learn about the optimal policy. If the exploration policy took suboptimal actions in steps t+1 to t+n−1, those rewards don't reflect what the optimal policy would have earned.

Theoretical Impurity

N-step Q-learning is only correct when n=1 (fully off-policy) or when the data is on-policy (collected by the current greedy policy). For n > 1 with off-policy data, the target is biased. In practice, most implementations just ignore this and use n ∈ {3, 5, 10} anyway — it works because the benefits (lower reliance on Q̂) outweigh the bias.

The Bias-Variance Trade-off of n

Small n (n=1): high bias (depends heavily on Q̂'s accuracy), low variance (one reward, one bootstrap). Large n: low bias (many real rewards), high variance (rewards are stochastic, compounds over steps). Most practitioners find n = 3 to 5 is the sweet spot.

N-Step in Practice: Rainbow DQN

The Rainbow agent (Hessel et al., 2018) combined six DQN improvements into one system. When they measured each improvement's contribution, n-step returns (n=3) provided the single largest performance boost — more than double DQN, more than prioritized replay, more than any individual trick. The combination of all six was even better, but if you could only add one thing to vanilla DQN, multi-step returns are the best bang for your buck.

Worked Example — N-Step with Discount Stacking

n = 3, γ = 0.99. Rewards: r0 = 1, r1 = 0, r2 = 0. Q̂(s3, best) = 100.

3-step target: y = 1 + 0.99 × 0 + 0.992 × 0 + 0.993 × 100 = 1 + 0 + 0 + 97.03 = 98.03.

1-step target: y = 1 + 0.99 × Q̂(s1, best). If Q̂(s1) = 80 (inaccurate), y = 1 + 79.2 = 80.2. Much worse.

The 3-step target skipped over two zero-reward states and bootstrapped closer to the actual high-reward region. The 1-step target got stuck relying on a bad intermediate estimate.

💻 Build It Implement Q-Learning Update with Experience Replay ✓ ATTEMPTED
Implement the core DQN training step: sample a minibatch from the replay buffer, compute Double DQN targets, and return the loss. Assume Q-network and target network are PyTorch modules.
signature def dqn_update(q_net, target_net, replay_buffer, batch_size=32, gamma=0.99): """Sample minibatch, compute Double DQN targets, return MSE loss. Args: q_net: nn.Module with forward(states) -> Q-values [B, n_actions] target_net: frozen copy of q_net (same architecture) replay_buffer: object with .sample(n) -> (states, actions, rewards, next_states, dones) batch_size: number of transitions per update gamma: discount factor Returns: loss: scalar tensor (MSE between predicted Q and target Q) """
Test case
# With states shape [32, 4], n_actions = 2, rewards all 1.0, gamma = 0.99:
# If q_net predicts [[5.0, 3.0], ...] and target_net predicts [[4.8, 3.1], ...],
# and actions taken were all 0:
# predicted = 5.0, target = 1.0 + 0.99 * 4.8 = 5.752
# loss for that sample = (5.0 - 5.752)^2 = 0.566
To extract Q(s, a) for specific actions from the [B, n_actions] tensor: q_values.gather(1, actions.unsqueeze(1)).squeeze(1). This indexes into each row at the column specified by the action taken.
import torch
import torch.nn.functional as F

def dqn_update(q_net, target_net, replay_buffer, batch_size=32, gamma=0.99):
    # 1. Sample minibatch
    states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)

    # 2. Q(s, a) for actions taken
    q_values = q_net(states)  # [B, n_actions]
    q_selected = q_values.gather(1, actions.unsqueeze(1)).squeeze(1)  # [B]

    # 3. Double DQN targets
    with torch.no_grad():
        # Online network SELECTS best action
        next_q = q_net(next_states)  # [B, n_actions]
        best_actions = next_q.argmax(dim=1)  # [B]

        # Target network EVALUATES that action
        target_q = target_net(next_states)  # [B, n_actions]
        next_q_values = target_q.gather(1, best_actions.unsqueeze(1)).squeeze(1)

        # y = r + gamma * Q_target(s', a*) * (1 - done)
        targets = rewards + gamma * next_q_values * (1.0 - dones.float())

    # 4. MSE loss
    loss = F.mse_loss(q_selected, targets)
    return loss
Bonus challenge: Replace MSE with Huber loss (smooth_l1_loss). Why does Huber loss help with outlier transitions that have very large TD errors? (Answer: it clips gradients for large errors, preventing one surprising transition from causing a catastrophic weight update.)
Chapter 10

The Online RL Landscape

Now that you've seen Q-learning, let's place it in the full family of online RL algorithms. Every method we've covered in this series has a slot in this table:

AlgorithmDataValue FunctionAction SelectionKey Feature
Vanilla PGOn-policyNoneSample from πθReward-to-go weighting
PPONear on-policyVπSample from πθGAE + clipped surrogate
SACReplay bufferQπSample a ~ πθEntropy bonus + off-policy
DQNReplay bufferQ*argmax Q̂Target networks + ε-greedy
The Spectrum

Moving down the table: less policy, more value. Vanilla PG has no value function at all. PPO adds a value baseline. SAC learns Qπ and uses it to train π. DQN drops π entirely and just uses Q*. Each step trades flexibility (continuous actions, stochastic policies) for sample efficiency (replay buffers, off-policy learning).

When to Use Which

Practical Guidance (Chelsea Finn's Advice)
Choosing an Algorithm

PPO — when you want stability and simplicity. Easy to tune, reliable convergence. Works for both discrete and continuous. The "safe default."

DQN / Double DQN — when actions are discrete (Atari, board games, token selection). Very sample-efficient with replay buffer. Can't handle continuous actions without discretization.

SAC — when you need data efficiency in continuous control (robotics, locomotion). Off-policy + entropy regularization = good exploration + data reuse.

The Key Differences That Matter

QuestionPolicy GradientQ-Learning
What do we learn?πθ(a|s) directlyQ*(s, a), derive π = argmax
Continuous actions?Natural (Gaussian π)Hard (argmax over infinite set)
Stochastic policy?Yes, inherentlyNo, deterministic argmax
Replay buffer?Not for vanilla PGYes, always
Sample efficiencyLow (on-policy)High (off-policy reuse)
StabilityMore stable (PPO)Tricky (need target nets)
Chapter 11

Summary & Cheat Sheet

You now understand the complete Q-learning pipeline, from the initial insight to the practical tricks that make it work at scale.

The Core Chain

The Logic in One Paragraph

If you have Qπ, argmax gives a better policy (policy improvement). Repeating evaluate-improve converges but is slow (policy iteration). Skip iteration by solving directly for Q* via the Bellman optimality equation (Q-learning). Use a neural net for Q̂ but freeze the target to avoid instability (DQN). Fix overestimation by decoupling selection and evaluation (Double DQN). Look multiple steps ahead for faster learning (N-step). The result: a family of algorithms that learn optimal behavior from raw pixels.

Key Equations

Bellman Optimality Q*(s, a) = r(s, a) + γ 𝔼s'[ maxa' Q*(s', a') ]
Q-Learning Target y = r + γ maxa'φ'(s', a')
Double Q-Learning Target a* = argmaxa'θ(s', a')
y = r + γ Q̂φ'(s', a*)
N-Step Target y = Σk=0n−1 γk rt+k + γn maxa' Q̂(st+n, a')

Strengths vs. Weaknesses

StrengthsWeaknesses
Very sample-efficient (replay buffer)Discrete actions only (without modification)
Off-policy: reuse all past dataQ-value overestimation (mitigated by Double DQN)
No explicit policy network neededRequires target networks for stability
Proven at scale (Atari, robotics)Brittle hyperparameters (ε schedule, update freq)
Conceptually clean (just learn a value table)Deterministic policy (no exploration without ε)
🏗 Design Challenge You're the Architect: DQN for Atari on a 4GB GPU ✓ ATTEMPTED
You need to train a DQN agent to play Atari Breakout. Your hardware: a single GPU with 4GB VRAM. You have 10 million environment frames of budget. Design the full system.
GPU VRAM
4 GB
Frame budget
10M frames
Input
210x160 RGB @ 60fps
Actions
4 (noop, fire, left, right)
Target
Superhuman (>400 avg score)
1. Frame stacking: how many frames? Grayscale or RGB? Resolution? How does this affect VRAM?
2. Network architecture: how many conv layers, filter sizes, FC layer width? Must fit in 4GB with batch + target net.
3. Replay buffer: 1M frames at 84x84x1 = 7GB RAM. How do you fit this? (uint8 storage? Smaller buffer? Compressed?)
4. Training schedule: when to start training? How often to update target? ε decay over what fraction of 10M frames?
5. Double DQN + n-step: which improvements give the most bang-per-buck under your memory constraints?

Frame preprocessing: Downsample 210x160 RGB → 84x84 grayscale. Stack 4 frames (velocity info). Frame skip = 4 (repeat each action 4 times, max-pool last 2 frames to avoid flickering). Input: 84x84x4 = 28KB per observation.

Network: 3 conv layers (8x8/32, 4x4/64, 3x3/64) + FC(512) + output(4). ~1.7M params = ~7MB. Two copies (online + target) = 14MB. Tiny by GPU standards.

Replay buffer: Store 1M frames as uint8 (1 byte per pixel). 84x84x1M = 7.06 GB in RAM (not VRAM). Only the minibatch of 32 lives on GPU at any time: 32x84x84x4 = 903KB. Fits easily.

Training schedule: Start training after 50K random frames (fill buffer). Update every 4 frames. Target network update every 10K steps. ε: 1.0 → 0.1 over first 1M frames. Total: 10M frames ≈ 50 hours on one GPU (2015 hardware).

Key memory insight: The buffer lives in CPU RAM, the networks live on GPU. Only the 32-sample minibatch is transferred per step. This is why DQN works on small GPUs despite having a massive replay buffer.

🔗 Pattern Recognition
From Q-Values to Policy Gradients via Advantage
Q-Learning (This Lesson)
π*(s) = argmaxa Q*(s,a)
Implicit policy. Hard for continuous actions.
Actor-Critic
A(s,a) = Q(s,a) − V(s)
Q becomes the advantage → weights policy gradient. → Actor-Critic

The connection: Q-learning learns Q* and extracts the policy via argmax. Actor-critic learns Qπ and uses it to compute the advantage A = Q − V, which weights the policy gradient. Both need an accurate Q-function — they just use it differently. Q-learning uses Q directly (argmax). Actor-critic uses Q indirectly (to improve a learned policy).

SAC (Soft Actor-Critic) combines both ideas: it learns Qπ with a replay buffer (like DQN) but uses it to train an explicit policy (like actor-critic). Why does this combination dominate continuous control?

What Comes Next

Offline RL: What if you can't interact with the environment at all? Learn from a fixed dataset of past trajectories. Key challenge: the agent can't try actions that aren't in the data. Q-learning is especially dangerous here — it will overestimate Q-values for out-of-distribution actions that were never tried.

Model-Based RL: Instead of learning Q-values from trial and error, learn a model of the environment (the transition dynamics p(s'|s,a)) and plan through it. Can be dramatically more sample-efficient because you can simulate thousands of trajectories in your head without touching the real environment.

Distributional RL: Instead of learning 𝔼[return], learn the full distribution of returns. C51 (Bellemare et al., 2017) and Rainbow (Hessel et al., 2018) combine distributional RL with double Q-learning, n-step returns, and prioritized replay for state-of-the-art Atari performance.

The Historical Arc

YearMilestoneKey Idea
1989Watkins introduces Q-learningTabular, convergence proof
1993Thrun & Schwartz: overestimationMax bias identified
2013DQN (Mnih et al.)Neural nets + target networks + replay
2015DQN in NatureSuperhuman Atari, 49 games
2016Double DQN (Van Hasselt et al.)Decouple select/evaluate
2016Dueling DQN, Prioritized ReplayArchitecture + sampling improvements
2018RainbowCombine 6 improvements, SOTA Atari
2018QT-Opt (Kalashnikov et al.)Q-learning for real robot grasping

Quick Reference: DQN Hyperparameters

ParameterTypical ValueWhat It Controls
Learning rate α2.5 × 10−4Step size for gradient descent
Discount γ0.99How much to value future rewards
ε start / end1.0 / 0.01Exploration rate schedule
ε decay frames1,000,000How fast to stop exploring
Replay buffer size1,000,000How much history to remember
Minibatch size32Transitions per gradient step
Target update freq10,000 stepsHow often to refresh target net
N-step3Multi-step return horizon
The One Sentence

Q-learning = learn how good every action is in every state, pick the best one, and never bother with an explicit policy. Everything else is stabilization.