Mnih, Kavukcuoglu, Silver et al. (DeepMind) — 2013

Playing Atari with
Deep RL

The paper that launched deep RL — a CNN trained with Q-learning and experience replay that learns to play Atari games directly from raw pixels, surpassing human experts on three games.

Prerequisites: Q-learning + CNNs
10
Chapters
5+
Simulations

Chapter 0: The Problem

Before 2013, reinforcement learning and deep learning lived in separate worlds. RL researchers used hand-crafted features and linear function approximators — tabular methods that couldn't scale. Deep learning researchers achieved breakthroughs in vision and speech — but only with supervised learning and massive labeled datasets.

Combining them seemed fundamentally problematic for three reasons:

  1. Correlated data: RL generates sequential data where consecutive samples are highly correlated. Deep learning assumes i.i.d. data. Correlated sequences cause catastrophic overfitting.
  2. Non-stationary targets: In supervised learning, the target labels are fixed. In RL, the target (the Bellman backup r + γ max Q) changes as the network learns. It's like trying to hit a moving target that you're moving.
  3. Proven divergence: Q-learning with non-linear function approximators was mathematically shown to diverge on simple problems. The RL community largely abandoned neural networks after this result.
The 20-year gap: TD-Gammon (1992) beat the world backgammon champion using a neural network + RL. But attempts to replicate this on chess, Go, or checkers failed. The consensus was that TD-Gammon was a lucky fluke — backgammon's dice rolling provided natural exploration and smoothed the value function. For 20 years, RL research focused on linear methods. DQN broke the drought.
What are the three fundamental challenges of combining deep learning with reinforcement learning?

Chapter 1: The Key Insight

DQN solves all three problems with one elegant mechanism: experience replay.

Instead of learning from consecutive game frames (correlated, non-stationary), store every transition (s, a, r, s′) in a large replay buffer. Then sample random minibatches from this buffer for training — just like supervised learning samples from a dataset.

This single change:

The deep RL recipe: Take a CNN (proven for vision), Q-learning (proven for RL), and experience replay (proven for stability). Each existed before 2013. DQN's contribution was showing that combining all three, with the right architecture and training procedure, produces agents that learn from raw pixels and beat human experts. The whole is much greater than the sum of its parts.
How does experience replay solve the correlated data problem in deep RL?

Chapter 2: Q-Learning Review

Q-learning estimates the optimal action-value function Q*(s, a) — the expected discounted return from state s, taking action a, then following the optimal policy.

The Bellman equation

Q*(s, a) = Es′[r + γ maxa′ Q*(s′, a′) | s, a]

The optimal Q-value of taking action a in state s equals the immediate reward plus the discounted optimal Q-value of the best action in the next state. This recursive definition is the foundation of value-based RL.

The Q-learning update

We approximate Q* with a neural network Q(s, a; θ) and minimize:

Lii) = Es,a[(yi − Q(s, a; θi))²]
yi = r + γ maxa′ Q(s′, a′; θi−1)

The target yi uses the previous parameters θi−1, held fixed during each optimization step. This is crucial — if we used current parameters, the target would move with every gradient step, destabilizing training.

Model-free and off-policy: Q-learning is model-free (no environment model needed) and off-policy (learns about the greedy policy while following an exploratory ε-greedy policy). Being off-policy is what makes experience replay possible — we can learn from transitions generated by old policies.
Why does Q-learning use the PREVIOUS parameters θi-1 for the target yi instead of the current θi?

Chapter 3: Experience Replay

Experience replay is DQN's most important contribution. The concept existed since 1992 (Lin), but DQN showed it was the key ingredient for stable deep RL.

The replay buffer

A circular buffer D of capacity N (DQN uses N = 1 million). At each timestep, the agent stores the transition et = (φt, at, rt, φt+1) in D, overwriting the oldest transition when full.

For each learning step, sample a random minibatch of 32 transitions from D and perform a gradient descent step on the Q-learning loss.

Three benefits

  1. Data efficiency: Each transition is potentially used in many gradient updates instead of being discarded after one use. This is critical when environment interaction is expensive.
  2. Decorrelation: Consecutive frames in a game are nearly identical. Learning from them sequentially causes the network to overfit to the current region of state space. Random sampling breaks this correlation.
  3. Distribution smoothing: Without replay, the training distribution shifts as the policy changes — if the agent starts going left, all training data comes from left-side states, creating a feedback loop. Replay averages over many past policies, stabilizing the distribution.
The feedback loop without replay: Imagine the agent discovers that going right gives reward. It starts going right more, so all new training data comes from right-side states. The value function becomes accurate for right-side states but forgets left-side states. If left-side states were actually better, the agent can't discover this because it's trapped in a self-reinforcing loop. Replay breaks this by mixing old experiences with new ones.
Experience Replay Buffer

Transitions are stored as they arrive (orange). Random minibatches (teal) are sampled for learning. Click "Step" to add transitions and "Sample" to draw a minibatch.

Buffer: 0/50
Why is experience replay necessary for Q-learning to be off-policy?

Chapter 4: The Architecture

DQN's CNN takes raw game frames as input and outputs Q-values for all possible actions in a single forward pass.

Input processing

Raw Atari frames (210×160 RGB at 60Hz) are preprocessed: convert to grayscale, downsample to 84×84, stack the last 4 frames. The 4-frame stack lets the network perceive velocity (ball direction in Pong, enemy movement in Space Invaders).

Network architecture

Input
84 × 84 × 4 (4 grayscale frames)
Conv 1
16 filters, 8×8, stride 4 + ReLU → 20×20×16
Conv 2
32 filters, 4×4, stride 2 + ReLU → 9×9×32
FC
256 ReLU units
Output
|A| linear units (one Q-value per action, 4-18 actions)
One forward pass for all actions: A naive approach would input (state, action) and output a single Q-value — requiring one forward pass per action. DQN instead inputs only the state and outputs Q-values for ALL actions simultaneously. For 18 possible actions, this is 18× faster. The agent then selects a = argmaxa Q(s, a; θ).

The same architecture and hyperparameters are used for all 7 games — no game-specific tuning. This is a key contribution: a single algorithm that works across vastly different games without modification.

Why does DQN stack the last 4 frames as input instead of using a single frame?

Chapter 5: The Algorithm

The full DQN algorithm combines Q-learning with experience replay in a clean loop:

For each episode:

  1. Initialize the game, get first frame x1
  2. For each timestep t:
    • With probability ε: select a random action (explore)
    • Otherwise: select a = argmaxa Q(φ(st), a; θ) (exploit)
    • Execute action, observe reward rt and next frame xt+1
    • Store transition (φt, at, rt, φt+1) in replay buffer D
    • Sample random minibatch of 32 transitions from D
    • Compute targets yj = rj + γ maxa′ Q(φj+1, a′; θ)
    • Gradient descent step on (yj − Q(φj, aj; θ))²

ε-greedy exploration

ε is annealed linearly from 1.0 to 0.1 over the first million frames, then fixed at 0.1. This means the agent starts fully random (pure exploration) and gradually shifts to mostly exploiting its learned Q-values, while maintaining 10% random actions for continued exploration.

Reward clipping: Since score scales vary wildly across games (Pong: -1 to +1, Breakout: 0 to hundreds), DQN clips all rewards to {-1, 0, +1}. This allows using the same learning rate across games, at the cost of not distinguishing between reward magnitudes. A pragmatic choice that enables the single-algorithm-fits-all approach.
Why does DQN anneal ε from 1.0 to 0.1 over training?

Chapter 6: Training Tricks

Several practical details make DQN work that aren't obvious from the algorithm description.

Frame skipping

The agent only acts every k=4 frames (k=3 for Space Invaders). The last action is repeated on skipped frames. This means the agent effectively plays at 15Hz instead of 60Hz, which (1) reduces computation by 4×, (2) gives actions time to have visible effects, and (3) allows the agent to play roughly 4× more games in the same wall-clock time.

RMSProp optimization

DQN uses RMSProp with minibatch size 32. Not Adam, not SGD — RMSProp's per-parameter learning rate adaptation handles the varying magnitudes of Q-value gradients across different games and states.

Stability metrics

The paper notes that average episode reward is too noisy to track training progress. Instead, they track the average max Q-value on a fixed set of states (collected before training). This Q-metric is much smoother and monotonically increases, even when episode rewards oscillate.

No divergence observed: Despite theoretical concerns about Q-learning diverging with non-linear function approximators, DQN never diverged in any experiment. The paper credits experience replay for stabilizing training. This empirical observation gave the RL community confidence to pursue deep RL further — the theoretical concerns were real but manageable in practice.
Why does DQN track average max Q-value instead of average episode reward to monitor training?

Chapter 7: Results

DQN is evaluated on 7 Atari games with the same architecture and hyperparameters. The results shattered expectations:

DQN Performance vs Baselines

DQN (teal) vs previous best RL method (warm) vs human expert (blue line). DQN surpasses humans on Breakout, Pong, and Enduro.

DQN outperforms all previous RL methods on 6 of 7 games and surpasses human experts on 3 (Breakout, Pong, Enduro). On Breakout, DQN discovers the optimal strategy (tunneling through the side wall) that most humans don't find.

The Breakout discovery: After ~400 episodes, DQN discovers that tunneling a ball through the side of the brick wall and letting it bounce behind the wall clears rows extremely efficiently. This strategy was not programmed or demonstrated — the agent discovered it purely through trial-and-error optimization of future rewards. This was one of the first demonstrations of deep RL discovering genuinely novel strategies.
What was remarkable about DQN's Breakout strategy?

Chapter 8: What the Network Learns

The most impressive aspect of DQN isn't the scores — it's what the network learns to represent internally.

Learned representations

Using t-SNE visualization of the last hidden layer, the authors show that DQN learns to group states by their semantic meaning. States where the ball is about to score cluster together. States where the ball is about to be lost cluster together. The network has learned game-relevant features directly from pixels — without any hand-engineering.

Q-value predictions

The predicted Q-values show that DQN understands the game dynamics. In Seaquest, Q-values spike when a fish is directly ahead (about to get a reward) and drop when the oxygen meter is low (need to surface). The network has implicitly learned the game's mechanics.

End-to-end learning from pixels: Before DQN, Atari RL methods extracted hand-crafted features (object positions, velocities) from the game state. DQN learns everything from raw pixels — the perception, the strategy, the value estimation. This end-to-end approach is what made DQN generalizable across games. The same CNN architecture discovers different features for different games.
What do t-SNE visualizations of DQN's hidden layer reveal?

Chapter 9: Connections

What DQN built on

Q-learning (Watkins, 1989): The tabular RL algorithm that DQN extends with neural networks.

TD-Gammon (Tesauro, 1992): The first successful neural network + RL combination. DQN succeeded where TD-Gammon's followers failed by using experience replay and CNNs.

Experience replay (Lin, 1992): The concept existed for 20 years but was never combined with deep networks. DQN showed it was the critical missing ingredient.

What DQN enabled

Double DQN (2015): Fixes DQN's tendency to overestimate Q-values by using separate networks for action selection and evaluation.

Dueling DQN (2016): Separates the network into state-value and advantage streams — the network learns "how good is this state?" independently from "how much better is this action?"

Rainbow (2017): Combines 6 improvements (double, dueling, prioritized replay, distributional, n-step, noisy nets) into a single super-agent.

AlphaGo (2016): Used deep RL techniques pioneered by DQN (CNN + RL) to master Go — widely considered the most significant AI achievement of the decade.

The legacy: DQN proved that deep neural networks and reinforcement learning could be combined successfully. This single result launched the entire field of deep RL — hundreds of labs, thousands of papers, and ultimately the techniques that power game-playing AIs (AlphaGo, AlphaStar), robotics (OpenAI Five), and language model training (RLHF). The 2013 workshop paper and its 2015 Nature version are among the most cited papers in all of AI.

Cheat sheet

Core equation
L = E[(r + γ maxa′ Q(s′,a′;θ−) − Q(s,a;θ))²]
Key innovations
Experience replay + CNN + Q-learning = stable deep RL from pixels
Architecture
84×84×4 → Conv(16,8,4) → Conv(32,4,2) → FC(256) → |A| outputs
Key hyperparams
Replay buffer: 1M, minibatch: 32, ε: 1.0→0.1, γ: 0.99
Impact
Launched deep RL; led to AlphaGo, RLHF, modern game AIs
Which major AI systems directly trace their lineage to DQN?