DQN — Veanors

Chapter 0: The Problem

Before 2013, reinforcement learning and deep learning lived in separate worlds. RL researchers used hand-crafted features and linear function approximators — tabular methods that couldn't scale. Deep learning researchers achieved breakthroughs in vision and speech — but only with supervised learning and massive labeled datasets.

Combining them seemed fundamentally problematic for three reasons:

Correlated data: RL generates sequential data where consecutive samples are highly correlated. Deep learning assumes i.i.d. data. Correlated sequences cause catastrophic overfitting.
Non-stationary targets: In supervised learning, the target labels are fixed. In RL, the target (the Bellman backup r + γ max Q) changes as the network learns. It's like trying to hit a moving target that you're moving.
Proven divergence: Q-learning with non-linear function approximators was mathematically shown to diverge on simple problems. The RL community largely abandoned neural networks after this result.

The 20-year gap: TD-Gammon (1992) beat the world backgammon champion using a neural network + RL. But attempts to replicate this on chess, Go, or checkers failed. The consensus was that TD-Gammon was a lucky fluke — backgammon's dice rolling provided natural exploration and smoothed the value function. For 20 years, RL research focused on linear methods. DQN broke the drought.

What are the three fundamental challenges of combining deep learning with reinforcement learning?

Correlated sequential data, non-stationary learning targets, and proven divergence of Q-learning with non-linear function approximators Insufficient data, slow training, and large models High compute cost, complex environments, and reward sparsity

Chapter 1: The Key Insight

DQN solves all three problems with one elegant mechanism: experience replay.

Instead of learning from consecutive game frames (correlated, non-stationary), store every transition (s, a, r, s′) in a large replay buffer. Then sample random minibatches from this buffer for training — just like supervised learning samples from a dataset.

This single change:

Breaks correlations: Random sampling gives approximately i.i.d. data, satisfying deep learning's assumption.
Stabilizes targets: The minibatch contains transitions from many different policies across training history, smoothing the target distribution.
Improves data efficiency: Each experience can be sampled multiple times, getting more learning out of each transition.

The deep RL recipe: Take a CNN (proven for vision), Q-learning (proven for RL), and experience replay (proven for stability). Each existed before 2013. DQN's contribution was showing that combining all three, with the right architecture and training procedure, produces agents that learn from raw pixels and beat human experts. The whole is much greater than the sum of its parts.

How does experience replay solve the correlated data problem in deep RL?

By storing transitions in a buffer and sampling random minibatches, it breaks the sequential correlations and provides approximately i.i.d. training data By shuffling the game frames before training By using a larger neural network

Chapter 2: Q-Learning Review

Q-learning estimates the optimal action-value function Q*(s, a) — the expected discounted return from state s, taking action a, then following the optimal policy.

The Bellman equation

Q*(s, a) = E_s′[r + γ max_a′ Q*(s′, a′) | s, a]

The optimal Q-value of taking action a in state s equals the immediate reward plus the discounted optimal Q-value of the best action in the next state. This recursive definition is the foundation of value-based RL.

The Q-learning update

We approximate Q* with a neural network Q(s, a; θ) and minimize:

L_i(θ_i) = E_s,a[(y_i − Q(s, a; θ_i))²]

y_i = r + γ max_a′ Q(s′, a′; θ_i−1)

The target y_i uses the previous parameters θ_i−1, held fixed during each optimization step. This is crucial — if we used current parameters, the target would move with every gradient step, destabilizing training.

Model-free and off-policy: Q-learning is model-free (no environment model needed) and off-policy (learns about the greedy policy while following an exploratory ε-greedy policy). Being off-policy is what makes experience replay possible — we can learn from transitions generated by old policies.

Why does Q-learning use the PREVIOUS parameters θ_i-1 for the target y_i instead of the current θ_i?

Using current parameters would create a moving target — the target changes with every gradient step, destabilizing training Previous parameters are more accurate It's computationally cheaper

Chapter 3: Experience Replay

Experience replay is DQN's most important contribution. The concept existed since 1992 (Lin), but DQN showed it was the key ingredient for stable deep RL.

The replay buffer

A circular buffer D of capacity N (DQN uses N = 1 million). At each timestep, the agent stores the transition e_t = (φ_t, a_t, r_t, φ_t+1) in D, overwriting the oldest transition when full.

For each learning step, sample a random minibatch of 32 transitions from D and perform a gradient descent step on the Q-learning loss.

Three benefits

Data efficiency: Each transition is potentially used in many gradient updates instead of being discarded after one use. This is critical when environment interaction is expensive.
Decorrelation: Consecutive frames in a game are nearly identical. Learning from them sequentially causes the network to overfit to the current region of state space. Random sampling breaks this correlation.
Distribution smoothing: Without replay, the training distribution shifts as the policy changes — if the agent starts going left, all training data comes from left-side states, creating a feedback loop. Replay averages over many past policies, stabilizing the distribution.

The feedback loop without replay: Imagine the agent discovers that going right gives reward. It starts going right more, so all new training data comes from right-side states. The value function becomes accurate for right-side states but forgets left-side states. If left-side states were actually better, the agent can't discover this because it's trapped in a self-reinforcing loop. Replay breaks this by mixing old experiences with new ones.

Experience Replay Buffer

Transitions are stored as they arrive (orange). Random minibatches (teal) are sampled for learning. Click "Step" to add transitions and "Sample" to draw a minibatch.

Buffer: 0/50

Why is experience replay necessary for Q-learning to be off-policy?

Replay stores transitions from old policies — learning from them means learning about the optimal policy while the behavior policy has changed, which is exactly off-policy learning Replay makes Q-learning on-policy Replay provides more data

Chapter 4: The Architecture

DQN's CNN takes raw game frames as input and outputs Q-values for all possible actions in a single forward pass.

Input processing

Raw Atari frames (210×160 RGB at 60Hz) are preprocessed: convert to grayscale, downsample to 84×84, stack the last 4 frames. The 4-frame stack lets the network perceive velocity (ball direction in Pong, enemy movement in Space Invaders).

Network architecture

Input

84 × 84 × 4 (4 grayscale frames)

↓

Conv 1

16 filters, 8×8, stride 4 + ReLU → 20×20×16

↓

Conv 2

32 filters, 4×4, stride 2 + ReLU → 9×9×32

↓

256 ReLU units

↓

Output

|A| linear units (one Q-value per action, 4-18 actions)

One forward pass for all actions: A naive approach would input (state, action) and output a single Q-value — requiring one forward pass per action. DQN instead inputs only the state and outputs Q-values for ALL actions simultaneously. For 18 possible actions, this is 18× faster. The agent then selects a = argmax_a Q(s, a; θ).

The same architecture and hyperparameters are used for all 7 games — no game-specific tuning. This is a key contribution: a single algorithm that works across vastly different games without modification.

Why does DQN stack the last 4 frames as input instead of using a single frame?

A single frame can't capture velocity or direction of motion — stacking 4 frames lets the network perceive temporal information like ball trajectory or enemy movement To increase the batch size To reduce the frame rate

Chapter 5: The Algorithm

The full DQN algorithm combines Q-learning with experience replay in a clean loop:

For each episode:

Initialize the game, get first frame x₁
For each timestep t:
- With probability ε: select a random action (explore)
- Otherwise: select a = argmax_a Q(φ(s_t), a; θ) (exploit)
- Execute action, observe reward r_t and next frame x_t+1
- Store transition (φ_t, a_t, r_t, φ_t+1) in replay buffer D
- Sample random minibatch of 32 transitions from D
- Compute targets y_j = r_j + γ max_a′ Q(φ_j+1, a′; θ)
- Gradient descent step on (y_j − Q(φ_j, a_j; θ))²

ε-greedy exploration

ε is annealed linearly from 1.0 to 0.1 over the first million frames, then fixed at 0.1. This means the agent starts fully random (pure exploration) and gradually shifts to mostly exploiting its learned Q-values, while maintaining 10% random actions for continued exploration.

Reward clipping: Since score scales vary wildly across games (Pong: -1 to +1, Breakout: 0 to hundreds), DQN clips all rewards to {-1, 0, +1}. This allows using the same learning rate across games, at the cost of not distinguishing between reward magnitudes. A pragmatic choice that enables the single-algorithm-fits-all approach.

Why does DQN anneal ε from 1.0 to 0.1 over training?

Starting with ε=1 (fully random) ensures broad exploration of the state space. Gradually reducing ε shifts toward exploiting learned Q-values while keeping 10% randomness for continued exploration. To save computation To reduce the learning rate

Chapter 6: Training Tricks

Several practical details make DQN work that aren't obvious from the algorithm description.

Frame skipping

The agent only acts every k=4 frames (k=3 for Space Invaders). The last action is repeated on skipped frames. This means the agent effectively plays at 15Hz instead of 60Hz, which (1) reduces computation by 4×, (2) gives actions time to have visible effects, and (3) allows the agent to play roughly 4× more games in the same wall-clock time.

RMSProp optimization

DQN uses RMSProp with minibatch size 32. Not Adam, not SGD — RMSProp's per-parameter learning rate adaptation handles the varying magnitudes of Q-value gradients across different games and states.

Stability metrics

The paper notes that average episode reward is too noisy to track training progress. Instead, they track the average max Q-value on a fixed set of states (collected before training). This Q-metric is much smoother and monotonically increases, even when episode rewards oscillate.

No divergence observed: Despite theoretical concerns about Q-learning diverging with non-linear function approximators, DQN never diverged in any experiment. The paper credits experience replay for stabilizing training. This empirical observation gave the RL community confidence to pursue deep RL further — the theoretical concerns were real but manageable in practice.

Why does DQN track average max Q-value instead of average episode reward to monitor training?

Average Q-value is much smoother than episode reward — small weight changes cause large reward variance but smooth Q-value changes, making Q a more reliable progress metric Q-values are easier to compute Episode rewards aren't available during training

Chapter 7: Results

DQN is evaluated on 7 Atari games with the same architecture and hyperparameters. The results shattered expectations:

DQN Performance vs Baselines

DQN (teal) vs previous best RL method (warm) vs human expert (blue line). DQN surpasses humans on Breakout, Pong, and Enduro.

DQN outperforms all previous RL methods on 6 of 7 games and surpasses human experts on 3 (Breakout, Pong, Enduro). On Breakout, DQN discovers the optimal strategy (tunneling through the side wall) that most humans don't find.

The Breakout discovery: After ~400 episodes, DQN discovers that tunneling a ball through the side of the brick wall and letting it bounce behind the wall clears rows extremely efficiently. This strategy was not programmed or demonstrated — the agent discovered it purely through trial-and-error optimization of future rewards. This was one of the first demonstrations of deep RL discovering genuinely novel strategies.

What was remarkable about DQN's Breakout strategy?

DQN discovered the optimal "tunneling" strategy — sending the ball behind the wall — purely through reinforcement learning, without being shown the strategy DQN played faster than any human DQN used a different architecture for Breakout

Chapter 8: What the Network Learns

The most impressive aspect of DQN isn't the scores — it's what the network learns to represent internally.

Learned representations

Using t-SNE visualization of the last hidden layer, the authors show that DQN learns to group states by their semantic meaning. States where the ball is about to score cluster together. States where the ball is about to be lost cluster together. The network has learned game-relevant features directly from pixels — without any hand-engineering.

Q-value predictions

The predicted Q-values show that DQN understands the game dynamics. In Seaquest, Q-values spike when a fish is directly ahead (about to get a reward) and drop when the oxygen meter is low (need to surface). The network has implicitly learned the game's mechanics.

End-to-end learning from pixels: Before DQN, Atari RL methods extracted hand-crafted features (object positions, velocities) from the game state. DQN learns everything from raw pixels — the perception, the strategy, the value estimation. This end-to-end approach is what made DQN generalizable across games. The same CNN architecture discovers different features for different games.

What do t-SNE visualizations of DQN's hidden layer reveal?

The network groups states by semantic meaning — states with similar game-relevant properties cluster together, showing the network learned meaningful representations from raw pixels Random noise patterns The network memorizes specific frames

Chapter 9: Connections

What DQN built on

Q-learning (Watkins, 1989): The tabular RL algorithm that DQN extends with neural networks.

TD-Gammon (Tesauro, 1992): The first successful neural network + RL combination. DQN succeeded where TD-Gammon's followers failed by using experience replay and CNNs.

Experience replay (Lin, 1992): The concept existed for 20 years but was never combined with deep networks. DQN showed it was the critical missing ingredient.

What DQN enabled

Double DQN (2015): Fixes DQN's tendency to overestimate Q-values by using separate networks for action selection and evaluation.

Dueling DQN (2016): Separates the network into state-value and advantage streams — the network learns "how good is this state?" independently from "how much better is this action?"

Rainbow (2017): Combines 6 improvements (double, dueling, prioritized replay, distributional, n-step, noisy nets) into a single super-agent.

AlphaGo (2016): Used deep RL techniques pioneered by DQN (CNN + RL) to master Go — widely considered the most significant AI achievement of the decade.

The legacy: DQN proved that deep neural networks and reinforcement learning could be combined successfully. This single result launched the entire field of deep RL — hundreds of labs, thousands of papers, and ultimately the techniques that power game-playing AIs (AlphaGo, AlphaStar), robotics (OpenAI Five), and language model training (RLHF). The 2013 workshop paper and its 2015 Nature version are among the most cited papers in all of AI.

Cheat sheet

Core equation

L = E[(r + γ max_a′ Q(s′,a′;θ−) − Q(s,a;θ))²]

Key innovations

Experience replay + CNN + Q-learning = stable deep RL from pixels

Architecture

84×84×4 → Conv(16,8,4) → Conv(32,4,2) → FC(256) → |A| outputs

Key hyperparams

Replay buffer: 1M, minibatch: 32, ε: 1.0→0.1, γ: 0.99

Impact

Launched deep RL; led to AlphaGo, RLHF, modern game AIs

Which major AI systems directly trace their lineage to DQN?

AlphaGo, AlphaStar, OpenAI Five, and RLHF — all use deep RL techniques pioneered by DQN Only other Atari-playing agents Only robotics applications

Playing Atari withDeep RL