Bellemare, Dabney, Munos (DeepMind) — 2017

A Distributional Perspective on
Reinforcement Learning

Instead of learning the expected return Q(s,a), learn the full distribution of returns Z(s,a) — yielding state-of-the-art Atari performance and fundamentally richer representations.

Prerequisites: DQN + Bellman equations + Probability distributions
10
Chapters
6+
Simulations

Chapter 0: The Problem

Standard RL learns Q(s, a) — the expected return. But an expectation is a lossy compression. Consider two scenarios:

Both have Q = 10. But they're fundamentally different situations. Scenario B has risk. A risk-averse agent should prefer A. A risk-seeking agent might prefer B. A standard Q-learning agent can't distinguish them — it sees the same number.

More practically: when a Q-network learns the expectation of a multimodal return distribution, the average can be a value that never actually occurs. The network tries to represent a number that's between two peaks — a phantom value that corrupts the learned representation.

Why expectations lie: In Space Invaders, an action might lead to survival (high return) or death (zero return). The expected return averages these into a moderate value that represents neither outcome. The distributional approach keeps both outcomes separate — the agent sees a bimodal distribution and can reason about each possibility distinctly.
Why is learning the expected return Q(s,a) insufficient for understanding the full nature of outcomes?

Chapter 1: The Key Insight

Instead of learning Q(s, a) = E[Z(s, a)], learn the full random variable Z(s, a) — the value distribution. Z is the random return: the actual sum of discounted rewards you'll receive, which varies depending on future stochasticity.

The distributional Bellman equation replaces the equality of expectations with an equality of distributions:

Z(x, a) D= R(x, a) + γZ(X′, A′)

The return distribution Z equals the reward R plus the discounted next-state return Z(X′, A′), in distribution. Three sources of randomness interact: (1) random reward R, (2) random transition to (X′, A′), and (3) the return distribution at the next state.

The distributional Bellman equation: Standard Bellman: Q = E[R] + γ E[Q′]. Distributional Bellman: Z D= R + γZ′. The distributional version preserves the full shape of the return — multimodality, skewness, heavy tails — while the standard version crushes it all into one number. The surprise: even in deterministic environments (like Atari), the distributions are rich and informative.

The practical algorithm — Categorical DQN (C51) — represents Z(s, a) as a discrete distribution over N=51 fixed atoms, and uses a projection step to map the Bellman backup onto this support. The result: state-of-the-art performance on Atari, with learned distributions that reveal the structure of each game.

What does the distributional Bellman equation Z D= R + γZ′ mean?

Chapter 2: Value Distributions

A value distribution Z(x, a) is a mapping from state-action pairs to probability distributions over returns. Where Q(x, a) ∈ ℝ gives one number, Z(x, a) gives a full distribution.

Sources of randomness

Even in "deterministic" environments, value distributions can be complex. The randomness comes from:

  1. Intrinsic randomness: Stochastic rewards and transitions in the environment
  2. State aliasing: Partially observing the state (the agent can't distinguish states that lead to different outcomes)
  3. Policy non-stationarity: The policy changes during learning, so the return distribution changes too
  4. Function approximation: The network can't perfectly represent every distribution, introducing approximation error
Value Distribution vs Expected Value

Three actions with the same expected value (10) but different return distributions. A Q-learning agent sees them as identical. A distributional agent sees the full picture.

Why do value distributions contain useful information even in deterministic environments like Atari?

Chapter 3: Distributional Bellman

The paper defines distributional Bellman operators analogous to the standard ones.

Policy evaluation operator Tπ

TπZ(x, a) D= R(x, a) + γPπZ(x, a)

Where PπZ(x, a) = Z(X′, A′) with X′ ~ P(·|x,a) and A′ ~ π(·|X′).

The contraction result

Tπ is a γ-contraction in the Wasserstein metric d̄p. This means repeated application Tπ, Tπ(TπZ), ... converges to the unique fixed point Zπ — the true value distribution. This is the distributional analog of the standard convergence result.

Critical subtlety — control is harder: For policy evaluation (fixed policy), Tπ is a contraction. But the optimality operator T (which includes greedy action selection) is NOT a contraction in any distributional metric. It converges in expectation (Lemma 4), but the distributions themselves may not converge neatly. This theoretical instability doesn't prevent practical success — but it explains why the algorithm needs careful design.

What metrics work and don't

Tπ is a contraction in Wasserstein distance — but NOT in total variation, KL divergence, or Kolmogorov distance. The Wasserstein metric is uniquely suited because it respects the geometry of the return space (nearby returns are "close").

In what metric is the distributional policy evaluation operator Tπ a contraction?

Chapter 4: The Wasserstein Metric

The Wasserstein distance between two distributions F and G measures the "cost" of transforming one into the other:

dp(F, G) = (∫01 |F−1(u) − G−1(u)|p du)1/p

Intuitively: line up the two distributions by their quantiles and measure how far each quantile has to move. A distribution of returns centered at 10 is "close" to one centered at 11 — but "far" from one centered at 100. Wasserstein respects this geometry; KL divergence doesn't (two non-overlapping distributions have infinite KL regardless of distance).

Key properties

These properties are why the Wasserstein metric gives the contraction result. The discount factor γ literally contracts the distance between distributions.

Why not use Wasserstein as the training loss? Despite Wasserstein being the "right" metric theoretically, the paper uses KL divergence / cross-entropy as the training loss. Why? Because Wasserstein can't be efficiently estimated from samples when using function approximation. Cross-entropy can — it's just multiclass classification. Theory says Wasserstein; practice says cross-entropy. The algorithm bridges both with a projection step.
Why does the discount factor γ provide the contraction in the distributional Bellman operator?

Chapter 5: The C51 Algorithm

C51 represents Z(x, a) as a discrete distribution over N = 51 fixed atoms:

zi = VMIN + i · Δz,   Δz = (VMAX − VMIN) / (N−1)

With VMIN = −10, VMAX = 10, and N = 51 atoms. The network outputs probabilities for each atom:

pi(x, a) = softmax(θi(x, a))

So the DQN architecture is modified to output N = 51 probabilities per action (instead of 1 Q-value). The expected Q-value is recovered as Q(x, a) = ∑i zi pi(x, a).

Why 51 atoms? The paper tests N = 2 (Bernoulli), 5, 11, 21, and 51. Performance monotonically improves with more atoms. N = 51 is where the paper stopped — not because of diminishing returns, but because they ran out of GPU memory. The name "C51" comes from "Categorical with 51 atoms."

Action selection

Actions are still selected greedily based on expected value: a* = argmaxa Q(x, a) = argmaxai zi pi(x, a). The full distribution is used for learning but not directly for action selection (though it could be, for risk-sensitive behavior).

C51: Discrete Return Distribution

51 atoms span [VMIN, VMAX]. The network outputs probabilities over these atoms. Click "Bimodal" or "Peaked" to see different distribution shapes.

In C51, what does the neural network output for each action?

Chapter 6: The Projection Step

Here's the key algorithmic challenge: after applying the Bellman update T̂zj = r + γzj, the resulting atoms generally don't land on our fixed support. We need to project the updated distribution back onto the 51 atoms.

The projection Φ

For each atom zj in the next-state distribution with probability pj:

  1. Compute the Bellman update: T̂zj = r + γzj
  2. Clip to [VMIN, VMAX]
  3. Find its position bj between atoms: bj = (T̂zj − VMIN) / Δz
  4. Split probability pj to the two nearest atoms proportionally to distance

The training loss is then the cross-entropy between this projected target distribution and the network's current prediction:

L = DKL(ΦT̂Zθ̃(x, a) ∥ Zθ(x, a))
Classification, not regression: C51 turns RL into a classification problem. Instead of predicting a single Q-value (regression), the network predicts a probability distribution over 51 return categories (classification via cross-entropy). This is why it works so well — cross-entropy has better gradient properties than MSE for this task, especially for multimodal distributions.
Why does C51 use cross-entropy loss instead of Wasserstein distance for training?

Chapter 7: Results

C51 achieves state-of-the-art performance on the Atari benchmark, outperforming DQN on the vast majority of games — often by huge margins.

Key findings

Performance vs Number of Atoms

More atoms = better performance. Even 2 atoms (Bernoulli) beats standard DQN on most games.

Computational overhead: C51 trains at about 75% of DQN's speed (for N=51 atoms). This modest slowdown yields dramatically better performance. The output layer is N× larger per action, but the convolutional backbone (the expensive part) is unchanged.
What happens to C51's performance as the number of atoms N increases?

Chapter 8: What Distributions Reveal

The most illuminating part of the paper: the learned distributions tell a story about the game that Q-values can't.

Space Invaders example

In one state, the agent has 6 possible actions. Three involve pressing the fire button (releasing a laser early). The distributions for these three actions show significant probability mass at 0 — the agent believes they lead to eventual game-over. The safe actions (left, right, noop) have distributions concentrated at higher returns.

A Q-learning agent would see different numbers for each action. A distributional agent sees why — the fire actions have a mode at death and a mode at survival, while safe actions have only the survival mode.

The multimodality preservation effect

When the standard Bellman operator averages a bimodal distribution, the average can be a value that never occurs. The distributional operator preserves both modes separately. This gives the network a fundamentally easier learning target — predict two peaks instead of one phantom mean.

Why distributions help learning, not just decisions: Even though C51 selects actions using the expected value (same as DQN), the distributional loss provides a richer gradient signal. Predicting 51 probabilities per action gives the network 51 learning signals instead of 1. This is likely why performance improves so much — it's not about making different decisions, it's about learning better representations through a denser supervisory signal.
Why might learning a full distribution help even when action selection is still based on expected value?

Chapter 9: Connections

What C51 built on

DQN (Mnih et al., 2013/2015): C51 uses the DQN architecture, just modifying the output layer from |A| scalar Q-values to |A|×51 atom probabilities.

Value distribution theory (Jaquette 1973, Sobel 1982): The theoretical study of return distributions predates this paper by decades — but it was always used for specific purposes (risk), never as the primary learning objective.

What C51 enabled

QR-DQN (Dabney et al., 2018): Instead of fixed atoms with learned probabilities, uses fixed probabilities (quantiles) with learned atom positions. More flexible and removes the need for VMIN/VMAX.

IQN (Dabney et al., 2018): Implicit Quantile Networks — can approximate any quantile on-the-fly. The most flexible distributional RL method.

Rainbow (Hessel et al., 2017): Combines C51 with 5 other DQN improvements. Ablation shows C51 is one of the most important components.

D4PG (Barth-Maron et al., 2018): Distributional DDPG — extends the distributional perspective to continuous action spaces, demonstrating it works beyond discrete-action domains.

The paradigm shift: C51 proved that the full return distribution is not a theoretical curiosity — it's a practical advantage. The distributional perspective has become a standard component of modern deep RL systems. The core insight — that predicting richer targets gives denser learning signals — has parallels in supervised learning (predicting distributions vs point estimates) and generative modeling.

Cheat sheet

Core equation
Z(x,a) D= R(x,a) + γZ(X′,A′) — distributional Bellman
Representation
51 atoms in [−10, 10], softmax probabilities per action
Training loss
Cross-entropy between projected Bellman target and predicted distribution
Key result
Tπ is a γ-contraction in Wasserstein metric (not KL, TV, or Kolmogorov)
Impact
SOTA on Atari; spawned QR-DQN, IQN, Rainbow, D4PG
How does QR-DQN improve upon C51?