Fixing Q-learning's overestimation bias by decoupling action selection from action evaluation — a simple modification that yields much better policies on Atari.
DQN achieved superhuman performance on Atari, but it had a hidden flaw: it systematically overestimates Q-values. Not by a little — by a lot. And these overestimations aren't harmless; they lead to worse policies.
The culprit is the max operator in Q-learning's target:
This max selects the highest-valued action. But when Q-values contain estimation errors (as they always do during learning), the max preferentially selects actions whose values are overestimated. The result: a systematic upward bias that grows with the number of actions.
The overestimation happens because Q-learning uses the same network for two different jobs:
When the same noisy estimates are used for both, the noise in the selection step correlates with the noise in the evaluation step, creating a positive bias.
The fix: decouple selection from evaluation. Use one set of parameters to select the best action, and a different set to evaluate it:
The online network θ picks the action (selection). The target network θ′ evaluates it (evaluation). Since the noise in θ and θ′ is different, the positive bias is eliminated.
Let's see the overestimation mathematically. Suppose all actions in state s have the same true value V*(s), but our estimates have errors: Q(s, a) = V*(s) + εa, where εa are zero-mean random errors.
The true optimal value is V*(s). But Q-learning estimates it as:
Since max of zero-mean random variables is always positive, the estimate is biased upward by E[maxa εa].
True value is 0 for all actions. Errors are standard normal. Drag the slider to see how the overestimation (orange) grows with more actions, while Double Q-learning (teal) stays unbiased.
The paper proves a tight lower bound on the overestimation:
What this means: any estimation error — from function approximation, noise, non-stationarity, or any other source — creates overestimation. The only way to avoid it is to not use the same estimates for selection and evaluation. Double Q-learning achieves exactly this.
Previous analyses assumed independent errors per action. Theorem 1 shows overestimation occurs even with arbitrary error correlations — as long as estimates aren't all exactly correct. Since estimates are never exactly correct during learning, overestimation is essentially guaranteed.
The original Double Q-learning (van Hasselt, 2010) maintains two independent Q-functions with parameters θ and θ′. For each update, one function selects the action and the other evaluates it:
The roles of θ and θ′ are swapped randomly. Because the noise in θ and θ′ is independent, selecting the best action according to θ and evaluating it with θ′ removes the positive correlation that causes overestimation.
Think of it as a second opinion. Network 1 says "action A looks best." Network 2 says "let me check — action A is actually worth X." If network 1 overestimated action A (got lucky with noise), network 2's independent estimate won't share that luck. The overestimation is corrected.
The original Double Q-learning requires two separate networks. But DQN already maintains two sets of parameters: the online network θ (updated every step) and the target network θ− (copied from θ every τ steps).
Double DQN simply uses the existing target network for evaluation:
target = r + γ * Q_target[max_action_target]. Double DQN computes target = r + γ * Q_target[max_action_online]. The action is selected by the online network but evaluated by the target network. That's it. Same architecture, same training loop, same compute, same memory. One subscript changes.The target network θ− is not fully independent of θ (it's a delayed copy), so Double DQN doesn't perfectly eliminate overestimation. But θ− is sufficiently different (lagging by τ steps) that it substantially reduces the bias in practice.
The paper shows that DQN overestimates Q-values substantially in practice. On several Atari games, the estimated Q-values are far higher than the actual discounted returns achieved by the learned policy.
For example, on Wizard of Wor, DQN estimates Q-values around 80, but the actual average return is only about 20. That's a 4× overestimation. On Asterix, the overestimation is even worse.
Double DQN's value estimates are much closer to the true returns. On most games, the estimated Q-values closely track the actual performance. This more accurate value estimation translates directly to better policies.
Double DQN is evaluated on 49 Atari games (the full ALE benchmark). With the same architecture and hyperparameters as DQN, changing only the target computation:
DQN (orange) overestimates Q-values far above actual returns. Double DQN (teal) estimates are close to the true value (dashed line).
The paper provides a deeper analysis of why reducing overestimation improves policies, beyond the obvious "more accurate values = better decisions."
In Q-learning, overestimated values propagate through the Bellman backup. If Q(s′, a′) is overestimated, then the target y = r + γ max Q(s′, a′) is too high, which pushes Q(s, a) higher, which affects Q(s′′, a′′), and so on. Overestimation cascades through the entire value function.
With function approximation, overestimating Q(s, a) in one state can corrupt Q-values in nearby states (because the function approximator generalizes). This creates a feedback loop: overestimation in state s corrupts nearby states, which corrupts their neighbors, spreading across the state space.
DQN (Mnih et al., 2013/2015): The deep RL agent that Double DQN improves upon. Double DQN uses DQN's exact architecture and training procedure — changing only one line.
Double Q-learning (van Hasselt, 2010): The tabular algorithm that proposed decoupling selection and evaluation. Double DQN generalizes this to deep networks using the existing target network.
Dueling DQN (Wang et al., 2016): Separates Q into state-value and advantage streams. Combined with Double DQN for further improvements.
Prioritized Experience Replay (Schaul et al., 2016): Samples important transitions more often. Works synergistically with Double DQN.
Rainbow (Hessel et al., 2017): Combines six DQN improvements including Double DQN. The ablation study shows Double DQN is one of the most important components.