What if you could throw away the policy network entirely — and just learn a table of "how good is each action?" That's Q-learning.
In actor-critic, you have two networks: an actor (the policy πθ) that picks actions, and a critic (the Q-function Q̂π) that evaluates them. The actor needs the critic for low-variance gradient estimates. But here's the question nobody asks at first: does the critic need the actor?
Suppose you had a perfect critic — an oracle that knows Q̂π(s, a) for every state-action pair. What would the best possible action be in state s? Just pick the action with the highest Q-value:
No gradient descent. No sampling. No log-probabilities. Just: look up every action's value, take the best one.
The Policy Improvement Theorem guarantees that πnew defined this way is always at least as good as the original π. So if you can accurately estimate Q, you never need an explicit policy network at all. The policy is implicit: just pick argmax.
This is the conceptual leap from actor-critic to Q-learning: throw away the actor entirely. Invest all your modeling capacity into learning the best possible Q-function. Derive the policy for free via argmax.
Let Qπ(s, a) be the Q-function under policy π. Define π'(s) = argmaxa Qπ(s, a). Then for all states s: Vπ'(s) ≥ Vπ(s). Proof idea: Vπ(s) = Qπ(s, π(s)) ≤ maxa Qπ(s, a) = Qπ(s, π'(s)). The inequality holds because π might not pick the best action, but argmax always does. Iterating this argument across time steps gives Vπ'(s) ≥ Vπ(s) everywhere.
A family of RL algorithms that learn a Q-function (state-action value) and derive the policy as argmax over actions. No explicit policy network is trained.
Imagine a 2-action problem. Current policy: π(left) = 0.6, π(right) = 0.4. True Q-values: Q(left) = 3, Q(right) = 7. Expected value under π: 0.6 × 3 + 0.4 × 7 = 4.6. Argmax policy: always go right. Expected value: 7.0. Improvement is guaranteed because argmax concentrates all probability on the best action.
Actor-critic can handle continuous actions naturally (the actor outputs parameters of a Gaussian). Q-learning with argmax requires searching over all actions — easy for discrete actions (just try all 4 directions), hard for continuous ones (infinite choices). This is why Q-learning dominates in discrete-action settings like Atari, while actor-critic methods (like SAC) dominate in continuous control.
argmaxa Q(s, a) means "evaluate Q for every possible action and pick the highest." With 4 discrete actions, that's 4 forward passes (or one forward pass outputting 4 values). With continuous actions, there are infinitely many — you'd need optimization inside each action selection, which is expensive. This is the fundamental limitation of pure Q-learning.
Q-learning is value iteration with the expectation replaced by a single sample. Instead of sweeping every (s,a) with the known model, you observe one transition and update one entry. Over infinite visits, the sample average converges to the expectation — recovering the same fixed point as DP.
Where else do you see "replace expensive exact computation with cheap stochastic samples"? (Hint: SGD vs. full-batch gradient descent.)
Before jumping to the modern algorithm, let's see the classical approach. Policy iteration alternates between two steps, like breathing in and out:
Policy iteration is guaranteed to converge to the optimal policy π* in a finite number of steps (for finite state/action spaces). Each iteration can only improve or maintain the policy — never make it worse.
Consider a 3×1 grid: states {A, B, C}. Actions: left or right. State C gives reward +10 and ends the episode. Every step costs −1. Discount γ = 0.9.
Start: Random policy π (50% left, 50% right everywhere).
Step 1 — Evaluate: From state A going right: expected reward under random walk is messy, but we can compute Qπ(A, right) ≈ 4.2, Qπ(A, left) ≈ −1.0 (hits wall). From B going right: Qπ(B, right) = −1 + 0.9 × 10 = 8.0. From B going left: Qπ(B, left) = −1 + 0.9 × Qπ(A, random) ≈ 1.9.
Step 2 — Improve: In every state, pick the action with higher Q. State A: right (4.2 > −1.0). State B: right (8.0 > 1.9). New policy: always go right.
Result: After just one iteration, we found the optimal policy.
Step 1 is expensive. Computing Qπ exactly requires solving a system of linear equations (or running many rollouts) for the current policy. Then you improve the policy and have to evaluate the new policy all over again. Each evaluation is a full inner loop.
In deep RL, states are images (millions of pixels). You can't compute Qπ(s, a) for every (s, a) — there are infinitely many. We need a shortcut that skips the expensive evaluation step entirely. That's what the Bellman optimality equation gives us.
There are finitely many deterministic policies over finite state/action spaces. Each improvement step produces a policy that's strictly better or equal. A strictly increasing sequence over a finite set must terminate. At termination, argmaxa Qπ(s, a) = π(s) for all s — meaning no improvement is possible. That's the definition of optimality.
Here's the breakthrough. Instead of iterating between evaluate and improve, what if we directly solved for the optimal Q-function Q* — the one that already reflects the best possible policy?
Read it aloud: "The value of taking action a in state s, then acting optimally forever, equals the immediate reward plus the discounted value of the best action in whatever state you land in."
Q*(s, a) is the expected total discounted reward of taking action a in state s, then following the optimal policy π* forever after. It's the best you can possibly do.
Compare the two versions:
| Equation | Uses | Meaning |
|---|---|---|
| Qπ(s,a) = r + γ 𝔼[Qπ(s', π(s'))] | Expectation under π | Value of (s,a) under policy π |
| Q*(s,a) = r + γ 𝔼[maxa' Q*(s',a')] | Max over actions | Value of (s,a) under optimal policy |
The max is the "improvement" step baked in. We don't need to iterate because we're directly asking: "what's the best action in s'?" The optimal Q-function already encodes the optimal policy.
Q* doesn't depend on any particular policy. It's a property of the environment alone — the best achievable value for every (s, a). Once you have Q*, the optimal policy falls out trivially: π*(s) = argmaxa Q*(s, a).
Two states: A, B. Two actions: left, right. From A, right leads to B with reward +2. From B, left gives +1 back to A, right gives +5 and ends. γ = 0.9.
Q*(B, right) = 5 (terminal, no future). Q*(B, left) = 1 + 0.9 × max(Q*(A, left), Q*(A, right)). Q*(A, right) = 2 + 0.9 × max(Q*(B, left), Q*(B, right)) = 2 + 0.9 × 5 = 6.5.
Now Q*(B, left) = 1 + 0.9 × 6.5 = 6.85. Optimal policy from A: go right. From B: go left (6.85 > 5)! The agent should loop A→B→A→B to accumulate reward, rather than ending immediately.
Start from Qπ(s,a) = r(s,a) + γ 𝔼s'[𝔼a'~π[Qπ(s',a')]]. For the optimal policy, π*(a'|s') puts all mass on argmaxa' Q*(s',a'). So 𝔼a'~π*[Q*(s',a')] = maxa' Q*(s',a'). Substitute and you get the Bellman optimality equation.
The Bellman optimality equation is a fixed-point equation: Q* is the unique function that, when you plug it into both sides, makes the equation hold. Q-learning finds Q* by repeatedly nudging an estimate Q̂ toward satisfying this equation.
Step 2 is the Bellman backup: "what should Q(s,a) be, if the rest of my Q-values are correct?" Step 3 moves Q̂ a little bit toward that target. Over thousands of updates, Q̂ converges to Q*.
Gridworld. Current estimates: Q̂(A, right) = 0, Q̂(B, left) = 0, Q̂(B, right) = 0. Learning rate α = 0.5, γ = 0.9.
Transition: In state A, take action right, get reward +2, land in B.
Target: y = 2 + 0.9 × max(Q̂(B, left), Q̂(B, right)) = 2 + 0.9 × max(0, 0) = 2.0
Update: Q̂(A, right) ← 0 + 0.5 × (2.0 − 0) = 1.0
One transition, one update. Q̂(A, right) went from 0 to 1.0 — it learned that going right from A has value. Repeat thousands of times and Q̂ converges.
The target uses maxa', not the action the exploration policy actually took. This means Q-learning learns about the optimal policy regardless of how the data was collected. You can explore randomly and still learn the best strategy. This is off-policy learning — and it's why Q-learning can use a replay buffer.
Tabular Q-learning converges to Q* with probability 1. The proof relies on the Robbins-Monro conditions for stochastic approximation. Given the update rule:
Q(s,a) ← Q(s,a) + αt[r + γ maxa' Q(s',a') − Q(s,a)]
Your task: Show why the learning rate schedule must satisfy Σαt = ∞ and Σαt2 < ∞, and explain why the max operator doesn't break convergence (unlike importance sampling, which would be needed for on-policy targets).
Full derivation:
Write the Q-learning update as: Qt+1(s,a) = (1 − αt)Qt(s,a) + αt[r + γ maxa' Qt(s',a')].
Define the error et(s,a) = Qt(s,a) − Q*(s,a). Then:
et+1 = (1−αt)et + αt[r + γ max Qt(s',a') − Q*(s,a)]
The term in brackets = (TQt)(s,a) − Q*(s,a) + noise. Since T is a γ-contraction: |TQt − TQ*|∞ ≤ γ|Qt − Q*|∞. And TQ* = Q* (fixed point).
So the expected error contracts by γ each "effective sweep." The Robbins-Monro conditions ensure: (1) Σα = ∞ means every (s,a) gets updated infinitely often, so we never get stuck; (2) Σα2 < ∞ means noise variance vanishes asymptotically.
Why no importance sampling? The target is r + γ max Q(s',a'). The max doesn't depend on which action was actually taken in s' — only on the next state s'. Since s' is drawn from p(s'|s,a) (the true dynamics regardless of behavior policy), and we take max over all a' (not the behavior policy's action), the target is an unbiased estimate of (TQ)(s,a) regardless of how we chose action a. Off-policy is free because the Bellman optimality operator doesn't reference any policy.
The key insight: Q-learning converges because (1) the Bellman optimality operator is a contraction, and (2) the max removes policy-dependence, making off-policy learning trivial. Any exploration policy works — you just need every (s,a) visited infinitely often.
α = 0.5, γ = 0.9. All Q-values start at 0. Grid: A↔B↔C (reward +10 at C).
Step 1: (B, right, +10, C-terminal). y = 10 + 0 = 10. Q(B,right) ← 0 + 0.5(10−0) = 5.0.
Step 2: (A, right, −1, B). y = −1 + 0.9 × max(Q(B,left), Q(B,right)) = −1 + 0.9 × 5.0 = 3.5. Q(A,right) ← 0 + 0.5(3.5) = 1.75.
Step 3: (B, right, +10, C-terminal) again. y = 10. Q(B,right) ← 5.0 + 0.5(10−5.0) = 7.5.
Information propagates backward: C's reward reaches B first (step 1), then A (step 2). Each revisit sharpens the estimate. After enough visits, Q converges.
| Property | Off-Policy Actor-Critic | Q-Learning |
|---|---|---|
| Learns | Qπ (current policy's Q) | Q* (optimal Q) |
| Policy | Explicit πθ, updated by gradient | Implicit: argmaxa Q̂ |
| Target | r + γ Q(s', a'), a' ~ π | r + γ maxa' Q(s', a') |
| Actions | Continuous or discrete | Best for discrete |
| Data reuse | Replay buffer | Replay buffer |
Q-learning has no policy network to sample from. So how does the agent decide what to do while collecting data? It needs an exploration strategy — a way to try new things while mostly exploiting what it already knows.
With probability (1 − ε), take the greedy action argmaxa Q̂(s, a). With probability ε, take a random action uniformly. The parameter ε ∈ [0, 1] controls the exploration rate.
State s has 4 actions: up, down, left, right. Q̂ values: up = 3.2, down = 1.0, left = 5.7, right = 2.4. ε = 0.1.
90% of the time: pick left (highest Q = 5.7).
10% of the time: pick uniformly at random. Each action has probability ε/4 = 0.025. So: P(up) = 0.025, P(down) = 0.025, P(left) = 0.9 + 0.025 = 0.925, P(right) = 0.025.
Even the best action gets a small exploration bonus. But mostly we exploit.
Early in training, Q̂ is garbage — random numbers. Acting greedy on random numbers is no better than random. So we start with ε = 1.0 (pure exploration) and slowly anneal to something small like ε = 0.01 as Q̂ improves. Common schedule: linear decay over the first million steps.
DQN on Atari uses: ε starts at 1.0, linearly decays to 0.1 over 1 million frames, then stays at 0.1 forever.
At frame 0: ε = 1.0. Pure random play. The agent mashes buttons randomly. Q̂ is learning from diverse experience.
At frame 500,000: ε = 0.55. Half the time greedy, half random. Q̂ is getting decent; greedy actions start to make sense.
At frame 1,000,000: ε = 0.1. Mostly greedy, occasional random exploration. Q̂ is good; the agent plays well but still discovers new strategies 10% of the time.
ε too high: the agent wastes time on random actions when it already knows what's good. ε too low: the agent never discovers better strategies — it gets stuck exploiting a mediocre early estimate. The art is in the schedule.
An alternative: sample actions proportional to exp(Q̂(s,a) / T) for temperature T. This is more nuanced than ε-greedy — it prefers higher-Q actions even during exploration. But in practice, ε-greedy works well and is simpler. DQN used ε-greedy.
Q-learning is off-policy because the target y = r + γ maxa' Q(s',a') uses a max over all actions, not the action the behavior policy actually chose in s'. This max encodes the optimal policy's action selection. It doesn't matter which policy generated the (s, a, r, s') tuple — once we know s', the optimal next action is determined by argmax, not by whatever the explorer did. In contrast, SARSA uses Q(s', a'actual) where a'actual was the action actually taken — making it on-policy (the target depends on the behavior policy).
Here's where Q-learning almost died — and was resurrected. When you use a neural network for Q̂θ, a subtle instability appears.
Look at the update again: minimize (Q̂θ(s,a) − y)2 where y = r + γ maxa' Q̂θ(s', a'). The target y depends on θ — the same parameters we're updating! Every gradient step changes both the prediction AND the target. It's like a dog chasing its own tail.
In supervised learning, labels don't move when you update weights. In Q-learning with function approximation, the "label" (target y) shifts with every gradient step. This causes oscillation, divergence, and catastrophic forgetting. For decades, people thought neural nets + Q-learning was fundamentally broken.
DeepMind's insight (Mnih et al., 2013/2015): maintain two copies of the Q-network.
A frozen copy of the Q-network, with parameters φ'. Used only to compute targets: y = r + γ maxa' Q̂φ'(s', a'). Updated periodically by copying θ → φ' every N steps (e.g., N = 10,000). Between copies, the target is stationary — regular supervised learning.
Without target network: Q̂(B, right) = 5. Target for (A, right): y = 2 + 0.9 × 5 = 6.5. You update Q̂(A, right) toward 6.5. But this update also slightly changes Q̂(B, right) to 5.3 (shared weights!). Next iteration, target becomes 2 + 0.9 × 5.3 = 6.77. The target crept up. After many steps, targets spiral upward — divergence.
With target network: Q̂φ'(B, right) = 5 and stays 5 for 10,000 steps. Every target is stable. You do 10,000 steps of supervised learning against fixed targets. Then copy and repeat. Stable!
(1) Target networks for stable training. (2) Experience replay buffer to break temporal correlations in data (consecutive frames are nearly identical — random sampling fixes this). (3) Frame stacking — feed 4 consecutive frames as input so the network sees velocity, not just position. Together: first algorithm to learn directly from pixels, beating humans on most Atari games.
Without replay, the network trains on consecutive transitions: frame 1001, frame 1002, frame 1003... These are nearly identical images. The network overfits to the current "corridor" of experience and forgets everything else. Experience replay stores the last million transitions in a ring buffer and samples uniformly at random for each minibatch.
Buffer holds 100,000 transitions from the last ~2 hours of play. Minibatch of 32: might include a death from level 1 (step 5,000), a power-up from level 3 (step 72,000), and an enemy dodge from level 2 (step 41,000). This diversity prevents catastrophic forgetting and provides i.i.d.-like training data.
Input: 84×84×4 grayscale image stack. Three convolutional layers (32, 64, 64 filters) with ReLU. One hidden fully-connected layer (512 units). Output: one Q-value per action (e.g., 18 for Atari). Total: ~1.7M parameters. Tiny by modern standards, revolutionary in 2015.
A Q-function over raw pixels needs spatial understanding: "is the ball above or below the paddle?" Fully-connected layers treat each pixel independently. Convolutional layers detect spatial patterns (edges, objects, motion) and are translation-invariant. The same enemy-detection filter works whether the enemy is on the left or right side of the screen.
Mnih et al. (2015) tested DQN on 49 Atari games with identical hyperparameters — no per-game tuning. It surpassed human-level performance on 29 of 49 games. Breakout: the agent discovered the "tunnel through the side" strategy that even experienced humans miss. Q-learning later extended to robotics: Kalashnikov et al. (2018) trained a real robot arm to grasp arbitrary objects using QT-Opt, a Q-learning variant.
DQN works. But it has a systematic bias: it thinks it's doing better than it actually is. Every Q-value is a little too optimistic. Where does this come from?
The target uses maxa' Q̂(s', a'). Each Q̂(s', a') is an estimate with some random error. The max operation preferentially selects the estimate with the highest positive error.
True Q-values: Q*(s', left) = 5, Q*(s', right) = 5. Both actions are equally good.
Noisy estimates on trial 1: Q̂(left) = 5 + 2 = 7, Q̂(right) = 5 − 1 = 4. max = 7.
Noisy estimates on trial 2: Q̂(left) = 5 − 3 = 2, Q̂(right) = 5 + 2 = 7. max = 7.
Noisy estimates on trial 3: Q̂(left) = 5 + 1 = 6, Q̂(right) = 5 + 1 = 6. max = 6.
Average of max: (7 + 7 + 6) / 3 = 6.67. True max: 5. Overestimated by 1.67!
The max always picks whichever action had the luckiest noise. Over many updates, this compounds.
Overestimated Q(s', a') makes the target y too high. The network trains toward this inflated target, making Q(s, a) too high. That inflated Q(s, a) is used as a target for some other state. The bias propagates backward through the entire Q-function. In deep networks with many states, this can cause Q-values to blow up to absurd magnitudes.
For any random variables X1, ..., Xn with 𝔼[Xi] = μ, we have 𝔼[maxi Xi] ≥ maxi 𝔼[Xi] = μ. The expectation of the max is always at least as large as the max of the expectations. Equality only when there's zero noise. Since neural network Q-estimates always have noise, overestimation is guaranteed.
Thrun & Schwartz (1993) showed this theoretically. Van Hasselt et al. (2016) measured it empirically on Atari: DQN's estimated Q-values were often 2–10x higher than the true values. The agent thinks it'll score 500 points but actually scores 50. Remarkably, DQN still works (the relative ordering of actions is often preserved), but it could work much better.
Chain of 3 states: A → B → C. Each has 2 actions. All true Q-values = 10. Noise ~ Uniform(−2, +2). At C: max of two estimates with mean 10. Expected overestimation: 𝔼[max] ≈ 10.67 (overestimate by 0.67). At B: target uses the overestimated C-values as input, adding another layer of max-noise. Expected overestimate at B: ~1.2. At A: ~1.6. The bias grows with the length of the chain — deeper environments suffer more.
In policy gradient methods (actor-critic), reusing off-policy data requires importance weights πnew(a|s)/πold(a|s) that can explode. In Q-learning, we freely reuse data from any behavior policy with no correction whatsoever.
Your task: Formally show why the Q-learning target y = r + γ maxa' Q(s',a') is an unbiased estimate of the Bellman optimality backup, regardless of which policy generated the transition (s, a, r, s').
Full argument:
The Bellman optimality operator: (T*Q)(s,a) = Es'~p(·|s,a)[r(s,a) + γ maxa' Q(s',a')]
Given a single transition (s, a, r, s') sampled from any behavior policy πb:
1. The behavior policy determined which (s, a) we observe. But we update Q(s, a) specifically — for this particular pair, we need s' ~ p(·|s,a), which is what we get regardless of πb.
2. The sample target r + γ maxa' Q(s',a') is an unbiased estimate of Es'[r + γ max Q(s',a')] because s' was drawn from p(·|s,a).
3. No importance weight is needed because: (a) The action a is already fixed — we're updating Q(s,a), not Q at some other action; (b) The max doesn't sample from πb — it evaluates all actions deterministically.
Contrast with SARSA: SARSA target = r + γ Q(s', a'actual) where a'actual ~ πb(·|s'). Here the target depends on πb's action choice! If πb ≠ π, you're estimating Qπb, not Qπ. Importance weights would be needed to correct this.
The key insight: The max operator is a deterministic function of Q-values — it doesn't sample anything. Therefore no distribution mismatch can occur. Q-learning is off-policy "for free" because optimality is a property of the MDP, not of any particular policy.
The fix is elegant: decouple action selection from action evaluation. The overestimation happens because the same noisy Q-function both picks the best action and evaluates it. If the noise makes action A look best, the same noise makes its evaluation high. Use a different evaluator and the bias breaks.
The current network picks which action looks best. The target network — an older, independent snapshot — evaluates that action. Since the two networks have different noise patterns (the target network was frozen thousands of steps ago), the selection bias is drastically reduced.
If you select an action because network A says it's great, and then ask the same network A to evaluate it, you get a biased answer — of course A thinks it's great, A is the one who picked it! But if you ask network B (trained on different data, frozen at a different time), B's errors are uncorrelated with A's selection. The positive noise that tricked A into selecting this action won't trick B into overrating it.
True Q-values: Q*(left) = 5, Q*(right) = 5.
Standard: Q̂θ(left) = 7, Q̂θ(right) = 4. max selects left, evaluates at 7. Target = r + 0.9 × 7 = r + 6.3.
Double: Q̂θ selects left (7 > 4). But Q̂φ'(left) = 4.8 (different noise!). Target = r + 0.9 × 4.8 = r + 4.32. Much closer to truth (r + 4.5).
The target network didn't have the same positive noise on "left," so the evaluation is more accurate.
Double Q-learning costs nothing extra in computation. You already have a target network in DQN. The only change: instead of maxa' Qφ'(s',a'), you compute a* = argmax Qθ(s',a') then look up Qφ'(s',a*). One line of code. Van Hasselt et al. (2016) showed this consistently improves DQN on Atari, sometimes dramatically.
DQN with the double Q-learning target. Uses the online network for action selection and the target network for action evaluation. Published by Van Hasselt, Guez, and Silver (2016). Standard in all modern DQN variants.
Given n actions with true values Q*(a) = μ for all a, and noisy estimates Q̂(a) = μ + εa where εa ~ N(0, σ2) i.i.d.:
Your task: (1) Show that E[maxa Q̂(a)] > μ and derive the approximate magnitude of overestimation for large n. (2) Show that using an independent estimate Q̂B to evaluate the action selected by Q̂A removes this bias.
Part 1 — Overestimation magnitude:
Let Xa = μ + εa, εa ~ N(0, σ2). Define M = maxa Xa.
E[M] = μ + E[maxa εa] = μ + σ · E[max of n standard normals].
For n i.i.d. N(0,1) draws, E[max] ≈ √(2 ln n) − ln(ln n) + ln(4π) / (2√(2 ln n)) for large n.
For Atari with n=18 actions: E[max] ≈ σ · √(2 ln 18) ≈ 2.4σ. If σ = 2 (common in early training), overestimation ≈ 4.8 per state — and this compounds across the temporal chain.
Part 2 — Double Q removes bias:
Let a* = argmaxa Q̂A(a). Now E[Q̂B(a*) | a*] = μ + E[εB,a*] = μ, because εB,a* is independent of the event that a* was selected by A. The selection was driven by εA,a* being large, but εB,a* has no knowledge of this — it's a fresh draw. So E[double target] = r + γμ = unbiased.
The key insight: Overestimation arises from using the same noisy function for both selecting and evaluating the best action. Decoupling these with independent estimates breaks the positive correlation between "I chose this because it looked good" and "it evaluates as good" — eliminating the systematic upward bias.
So far, our target looks one step ahead: y = r + γ max Q̂(s', a'). What if we looked multiple steps ahead, using actual observed rewards instead of estimated ones?
Use n real rewards, then bootstrap from Q̂ only at step n. This reduces our dependence on Q̂, which is especially helpful early in training when Q̂ is inaccurate.
Trajectory: s0 ⟶ r=1 ⟶ s1 ⟶ r=1 ⟶ s2 ⟶ r=10 ⟶ s3. γ = 0.9. True Q*(s0, a) = 1 + 0.9 + 0.81 × 10 = 10.0. Current Q̂(s1) = 2 (bad estimate), Q̂(s3) = 0.
1-step target: y = 1 + 0.9 × 2 = 2.8. Way off (true: 10.0). Bottlenecked by bad Q̂(s1).
3-step target: y = 1 + 0.9 × 1 + 0.81 × 10 + 0.729 × 0 = 10.0. Exact! We used 3 real rewards and only bootstrapped from s3 where Q̂ = 0 is close to truth (episode ended).
The n-step return uses actions at+1, ..., at+n-1 that were chosen by the exploration policy at collection time. Q-learning is supposed to learn about the optimal policy. If the exploration policy took suboptimal actions in steps t+1 to t+n−1, those rewards don't reflect what the optimal policy would have earned.
N-step Q-learning is only correct when n=1 (fully off-policy) or when the data is on-policy (collected by the current greedy policy). For n > 1 with off-policy data, the target is biased. In practice, most implementations just ignore this and use n ∈ {3, 5, 10} anyway — it works because the benefits (lower reliance on Q̂) outweigh the bias.
Small n (n=1): high bias (depends heavily on Q̂'s accuracy), low variance (one reward, one bootstrap). Large n: low bias (many real rewards), high variance (rewards are stochastic, compounds over steps). Most practitioners find n = 3 to 5 is the sweet spot.
The Rainbow agent (Hessel et al., 2018) combined six DQN improvements into one system. When they measured each improvement's contribution, n-step returns (n=3) provided the single largest performance boost — more than double DQN, more than prioritized replay, more than any individual trick. The combination of all six was even better, but if you could only add one thing to vanilla DQN, multi-step returns are the best bang for your buck.
n = 3, γ = 0.99. Rewards: r0 = 1, r1 = 0, r2 = 0. Q̂(s3, best) = 100.
3-step target: y = 1 + 0.99 × 0 + 0.992 × 0 + 0.993 × 100 = 1 + 0 + 0 + 97.03 = 98.03.
1-step target: y = 1 + 0.99 × Q̂(s1, best). If Q̂(s1) = 80 (inaccurate), y = 1 + 79.2 = 80.2. Much worse.
The 3-step target skipped over two zero-reward states and bootstrapped closer to the actual high-reward region. The 1-step target got stuck relying on a bad intermediate estimate.
q_values.gather(1, actions.unsqueeze(1)).squeeze(1). This indexes into each row at the column specified by the action taken.import torch import torch.nn.functional as F def dqn_update(q_net, target_net, replay_buffer, batch_size=32, gamma=0.99): # 1. Sample minibatch states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size) # 2. Q(s, a) for actions taken q_values = q_net(states) # [B, n_actions] q_selected = q_values.gather(1, actions.unsqueeze(1)).squeeze(1) # [B] # 3. Double DQN targets with torch.no_grad(): # Online network SELECTS best action next_q = q_net(next_states) # [B, n_actions] best_actions = next_q.argmax(dim=1) # [B] # Target network EVALUATES that action target_q = target_net(next_states) # [B, n_actions] next_q_values = target_q.gather(1, best_actions.unsqueeze(1)).squeeze(1) # y = r + gamma * Q_target(s', a*) * (1 - done) targets = rewards + gamma * next_q_values * (1.0 - dones.float()) # 4. MSE loss loss = F.mse_loss(q_selected, targets) return loss
Now that you've seen Q-learning, let's place it in the full family of online RL algorithms. Every method we've covered in this series has a slot in this table:
| Algorithm | Data | Value Function | Action Selection | Key Feature |
|---|---|---|---|---|
| Vanilla PG | On-policy | None | Sample from πθ | Reward-to-go weighting |
| PPO | Near on-policy | Vπ | Sample from πθ | GAE + clipped surrogate |
| SAC | Replay buffer | Qπ | Sample a ~ πθ | Entropy bonus + off-policy |
| DQN | Replay buffer | Q* | argmax Q̂ | Target networks + ε-greedy |
Moving down the table: less policy, more value. Vanilla PG has no value function at all. PPO adds a value baseline. SAC learns Qπ and uses it to train π. DQN drops π entirely and just uses Q*. Each step trades flexibility (continuous actions, stochastic policies) for sample efficiency (replay buffers, off-policy learning).
PPO — when you want stability and simplicity. Easy to tune, reliable convergence. Works for both discrete and continuous. The "safe default."
DQN / Double DQN — when actions are discrete (Atari, board games, token selection). Very sample-efficient with replay buffer. Can't handle continuous actions without discretization.
SAC — when you need data efficiency in continuous control (robotics, locomotion). Off-policy + entropy regularization = good exploration + data reuse.
| Question | Policy Gradient | Q-Learning |
|---|---|---|
| What do we learn? | πθ(a|s) directly | Q*(s, a), derive π = argmax |
| Continuous actions? | Natural (Gaussian π) | Hard (argmax over infinite set) |
| Stochastic policy? | Yes, inherently | No, deterministic argmax |
| Replay buffer? | Not for vanilla PG | Yes, always |
| Sample efficiency | Low (on-policy) | High (off-policy reuse) |
| Stability | More stable (PPO) | Tricky (need target nets) |
You now understand the complete Q-learning pipeline, from the initial insight to the practical tricks that make it work at scale.
If you have Qπ, argmax gives a better policy (policy improvement). Repeating evaluate-improve converges but is slow (policy iteration). Skip iteration by solving directly for Q* via the Bellman optimality equation (Q-learning). Use a neural net for Q̂ but freeze the target to avoid instability (DQN). Fix overestimation by decoupling selection and evaluation (Double DQN). Look multiple steps ahead for faster learning (N-step). The result: a family of algorithms that learn optimal behavior from raw pixels.
| Strengths | Weaknesses |
|---|---|
| Very sample-efficient (replay buffer) | Discrete actions only (without modification) |
| Off-policy: reuse all past data | Q-value overestimation (mitigated by Double DQN) |
| No explicit policy network needed | Requires target networks for stability |
| Proven at scale (Atari, robotics) | Brittle hyperparameters (ε schedule, update freq) |
| Conceptually clean (just learn a value table) | Deterministic policy (no exploration without ε) |
Frame preprocessing: Downsample 210x160 RGB → 84x84 grayscale. Stack 4 frames (velocity info). Frame skip = 4 (repeat each action 4 times, max-pool last 2 frames to avoid flickering). Input: 84x84x4 = 28KB per observation.
Network: 3 conv layers (8x8/32, 4x4/64, 3x3/64) + FC(512) + output(4). ~1.7M params = ~7MB. Two copies (online + target) = 14MB. Tiny by GPU standards.
Replay buffer: Store 1M frames as uint8 (1 byte per pixel). 84x84x1M = 7.06 GB in RAM (not VRAM). Only the minibatch of 32 lives on GPU at any time: 32x84x84x4 = 903KB. Fits easily.
Training schedule: Start training after 50K random frames (fill buffer). Update every 4 frames. Target network update every 10K steps. ε: 1.0 → 0.1 over first 1M frames. Total: 10M frames ≈ 50 hours on one GPU (2015 hardware).
Key memory insight: The buffer lives in CPU RAM, the networks live on GPU. Only the 32-sample minibatch is transferred per step. This is why DQN works on small GPUs despite having a massive replay buffer.
The connection: Q-learning learns Q* and extracts the policy via argmax. Actor-critic learns Qπ and uses it to compute the advantage A = Q − V, which weights the policy gradient. Both need an accurate Q-function — they just use it differently. Q-learning uses Q directly (argmax). Actor-critic uses Q indirectly (to improve a learned policy).
SAC (Soft Actor-Critic) combines both ideas: it learns Qπ with a replay buffer (like DQN) but uses it to train an explicit policy (like actor-critic). Why does this combination dominate continuous control?
Offline RL: What if you can't interact with the environment at all? Learn from a fixed dataset of past trajectories. Key challenge: the agent can't try actions that aren't in the data. Q-learning is especially dangerous here — it will overestimate Q-values for out-of-distribution actions that were never tried.
Model-Based RL: Instead of learning Q-values from trial and error, learn a model of the environment (the transition dynamics p(s'|s,a)) and plan through it. Can be dramatically more sample-efficient because you can simulate thousands of trajectories in your head without touching the real environment.
Distributional RL: Instead of learning 𝔼[return], learn the full distribution of returns. C51 (Bellemare et al., 2017) and Rainbow (Hessel et al., 2018) combine distributional RL with double Q-learning, n-step returns, and prioritized replay for state-of-the-art Atari performance.
| Year | Milestone | Key Idea |
|---|---|---|
| 1989 | Watkins introduces Q-learning | Tabular, convergence proof |
| 1993 | Thrun & Schwartz: overestimation | Max bias identified |
| 2013 | DQN (Mnih et al.) | Neural nets + target networks + replay |
| 2015 | DQN in Nature | Superhuman Atari, 49 games |
| 2016 | Double DQN (Van Hasselt et al.) | Decouple select/evaluate |
| 2016 | Dueling DQN, Prioritized Replay | Architecture + sampling improvements |
| 2018 | Rainbow | Combine 6 improvements, SOTA Atari |
| 2018 | QT-Opt (Kalashnikov et al.) | Q-learning for real robot grasping |
| Parameter | Typical Value | What It Controls |
|---|---|---|
| Learning rate α | 2.5 × 10−4 | Step size for gradient descent |
| Discount γ | 0.99 | How much to value future rewards |
| ε start / end | 1.0 / 0.01 | Exploration rate schedule |
| ε decay frames | 1,000,000 | How fast to stop exploring |
| Replay buffer size | 1,000,000 | How much history to remember |
| Minibatch size | 32 | Transitions per gradient step |
| Target update freq | 10,000 steps | How often to refresh target net |
| N-step | 3 | Multi-step return horizon |
Q-learning = learn how good every action is in every state, pick the best one, and never bother with an explicit policy. Everything else is stabilization.