Distributional RL (C51)

Chapter 0: The Problem

Standard RL learns Q(s, a) — the expected return. But an expectation is a lossy compression. Consider two scenarios:

Scenario A: Guaranteed return of 10
Scenario B: 50% chance of 0, 50% chance of 20

Both have Q = 10. But they're fundamentally different situations. Scenario B has risk. A risk-averse agent should prefer A. A risk-seeking agent might prefer B. A standard Q-learning agent can't distinguish them — it sees the same number.

More practically: when a Q-network learns the expectation of a multimodal return distribution, the average can be a value that never actually occurs. The network tries to represent a number that's between two peaks — a phantom value that corrupts the learned representation.

Why expectations lie: In Space Invaders, an action might lead to survival (high return) or death (zero return). The expected return averages these into a moderate value that represents neither outcome. The distributional approach keeps both outcomes separate — the agent sees a bimodal distribution and can reason about each possibility distinctly.

Why is learning the expected return Q(s,a) insufficient for understanding the full nature of outcomes?

The expectation collapses multimodal distributions into a single number that may correspond to NO actual outcome — losing information about risk, variance, and the shape of possible returns Expected returns are too large to store Expected returns are always inaccurate

Chapter 1: The Key Insight

Instead of learning Q(s, a) = E[Z(s, a)], learn the full random variable Z(s, a) — the value distribution. Z is the random return: the actual sum of discounted rewards you'll receive, which varies depending on future stochasticity.

The distributional Bellman equation replaces the equality of expectations with an equality of distributions:

Z(x, a) ^D= R(x, a) + γZ(X′, A′)

The return distribution Z equals the reward R plus the discounted next-state return Z(X′, A′), in distribution. Three sources of randomness interact: (1) random reward R, (2) random transition to (X′, A′), and (3) the return distribution at the next state.

The distributional Bellman equation: Standard Bellman: Q = E[R] + γ E[Q′]. Distributional Bellman: Z ^D= R + γZ′. The distributional version preserves the full shape of the return — multimodality, skewness, heavy tails — while the standard version crushes it all into one number. The surprise: even in deterministic environments (like Atari), the distributions are rich and informative.

The practical algorithm — Categorical DQN (C51) — represents Z(s, a) as a discrete distribution over N=51 fixed atoms, and uses a projection step to map the Bellman backup onto this support. The result: state-of-the-art performance on Atari, with learned distributions that reveal the structure of each game.

What does the distributional Bellman equation Z ^D= R + γZ′ mean?

The return distribution Z equals, in distribution, the reward R plus the discounted next-state return Z′ — preserving the full shape of the distribution, not just its mean Z is always equal to R plus gamma times Z prime The distributions are Gaussian

Chapter 2: Value Distributions

A value distribution Z(x, a) is a mapping from state-action pairs to probability distributions over returns. Where Q(x, a) ∈ ℝ gives one number, Z(x, a) gives a full distribution.

Sources of randomness

Even in "deterministic" environments, value distributions can be complex. The randomness comes from:

Intrinsic randomness: Stochastic rewards and transitions in the environment
State aliasing: Partially observing the state (the agent can't distinguish states that lead to different outcomes)
Policy non-stationarity: The policy changes during learning, so the return distribution changes too
Function approximation: The network can't perfectly represent every distribution, introducing approximation error

Value Distribution vs Expected Value

Three actions with the same expected value (10) but different return distributions. A Q-learning agent sees them as identical. A distributional agent sees the full picture.

Why do value distributions contain useful information even in deterministic environments like Atari?

State aliasing, policy non-stationarity, and function approximation all create effective randomness — the distributions capture the agent's uncertainty about future returns from its limited perspective Atari games are actually stochastic The distributions are always uniform

Chapter 3: Distributional Bellman

The paper defines distributional Bellman operators analogous to the standard ones.

Policy evaluation operator T^π

T^πZ(x, a) ^D= R(x, a) + γP^πZ(x, a)

Where P^πZ(x, a) = Z(X′, A′) with X′ ~ P(·|x,a) and A′ ~ π(·|X′).

The contraction result

T^π is a γ-contraction in the Wasserstein metric d̄_p. This means repeated application T^π, T^π(T^πZ), ... converges to the unique fixed point Z^π — the true value distribution. This is the distributional analog of the standard convergence result.

Critical subtlety — control is harder: For policy evaluation (fixed policy), T^π is a contraction. But the optimality operator T (which includes greedy action selection) is NOT a contraction in any distributional metric. It converges in expectation (Lemma 4), but the distributions themselves may not converge neatly. This theoretical instability doesn't prevent practical success — but it explains why the algorithm needs careful design.

What metrics work and don't

T^π is a contraction in Wasserstein distance — but NOT in total variation, KL divergence, or Kolmogorov distance. The Wasserstein metric is uniquely suited because it respects the geometry of the return space (nearby returns are "close").

In what metric is the distributional policy evaluation operator T^π a contraction?

The Wasserstein (Mallows/Kantorovich) metric — but NOT total variation, KL divergence, or Kolmogorov distance KL divergence All metrics

Chapter 4: The Wasserstein Metric

The Wasserstein distance between two distributions F and G measures the "cost" of transforming one into the other:

d_p(F, G) = (∫₀¹ |F⁻¹(u) − G⁻¹(u)|^p du)^1/p

Intuitively: line up the two distributions by their quantiles and measure how far each quantile has to move. A distribution of returns centered at 10 is "close" to one centered at 11 — but "far" from one centered at 100. Wasserstein respects this geometry; KL divergence doesn't (two non-overlapping distributions have infinite KL regardless of distance).

Key properties

Scaling: d_p(aU, aV) ≤ |a| d_p(U, V) — discounting by γ shrinks the Wasserstein distance by γ, giving the contraction
Translation: d_p(A+U, A+V) ≤ d_p(U, V) — adding reward doesn't increase distance

These properties are why the Wasserstein metric gives the contraction result. The discount factor γ literally contracts the distance between distributions.

Why not use Wasserstein as the training loss? Despite Wasserstein being the "right" metric theoretically, the paper uses KL divergence / cross-entropy as the training loss. Why? Because Wasserstein can't be efficiently estimated from samples when using function approximation. Cross-entropy can — it's just multiclass classification. Theory says Wasserstein; practice says cross-entropy. The algorithm bridges both with a projection step.

Why does the discount factor γ provide the contraction in the distributional Bellman operator?

Multiplying by γ < 1 scales the Wasserstein distance by γ — the "scaling" property d(aU, aV) ≤ |a| d(U, V) with a = γ directly gives the contraction γ makes all distributions narrower γ removes the mean from distributions

Chapter 5: The C51 Algorithm

C51 represents Z(x, a) as a discrete distribution over N = 51 fixed atoms:

z_i = V_MIN + i · Δz, Δz = (V_MAX − V_MIN) / (N−1)

With V_MIN = −10, V_MAX = 10, and N = 51 atoms. The network outputs probabilities for each atom:

p_i(x, a) = softmax(θ_i(x, a))

So the DQN architecture is modified to output N = 51 probabilities per action (instead of 1 Q-value). The expected Q-value is recovered as Q(x, a) = ∑_i z_i p_i(x, a).

Why 51 atoms? The paper tests N = 2 (Bernoulli), 5, 11, 21, and 51. Performance monotonically improves with more atoms. N = 51 is where the paper stopped — not because of diminishing returns, but because they ran out of GPU memory. The name "C51" comes from "Categorical with 51 atoms."

Action selection

Actions are still selected greedily based on expected value: a* = argmax_a Q(x, a) = argmax_a ∑_i z_i p_i(x, a). The full distribution is used for learning but not directly for action selection (though it could be, for risk-sensitive behavior).

C51: Discrete Return Distribution

51 atoms span [V_MIN, V_MAX]. The network outputs probabilities over these atoms. Click "Bimodal" or "Peaked" to see different distribution shapes.

In C51, what does the neural network output for each action?

A probability distribution over 51 fixed atoms (canonical returns) — instead of DQN's single Q-value per action 51 different Q-values The mean and variance of the return

Chapter 6: The Projection Step

Here's the key algorithmic challenge: after applying the Bellman update T̂z_j = r + γz_j, the resulting atoms generally don't land on our fixed support. We need to project the updated distribution back onto the 51 atoms.

The projection Φ

For each atom z_j in the next-state distribution with probability p_j:

Compute the Bellman update: T̂z_j = r + γz_j
Clip to [V_MIN, V_MAX]
Find its position b_j between atoms: b_j = (T̂z_j − V_MIN) / Δz
Split probability p_j to the two nearest atoms proportionally to distance

The training loss is then the cross-entropy between this projected target distribution and the network's current prediction:

L = D_KL(ΦT̂Z_θ̃(x, a) ∥ Z_θ(x, a))

Classification, not regression: C51 turns RL into a classification problem. Instead of predicting a single Q-value (regression), the network predicts a probability distribution over 51 return categories (classification via cross-entropy). This is why it works so well — cross-entropy has better gradient properties than MSE for this task, especially for multimodal distributions.

Why does C51 use cross-entropy loss instead of Wasserstein distance for training?

Wasserstein distance can't be efficiently estimated from samples with function approximation — cross-entropy can, reducing the problem to multiclass classification Cross-entropy is always a better loss function Wasserstein is too slow to compute

Chapter 7: Results

C51 achieves state-of-the-art performance on the Atari benchmark, outperforming DQN on the vast majority of games — often by huge margins.

Key findings

C51 outperforms DQN on all 5 training games, often dramatically (Seaquest: state-of-the-art)
Even N=2 (Bernoulli, single parameter) outperforms DQN on 3/5 training games
Performance monotonically improves with more atoms: N=51 >> N=21 >> N=11 >> N=5 >> N=2
The improvement is not just in raw scores but in learning stability and speed

Performance vs Number of Atoms

More atoms = better performance. Even 2 atoms (Bernoulli) beats standard DQN on most games.

Computational overhead: C51 trains at about 75% of DQN's speed (for N=51 atoms). This modest slowdown yields dramatically better performance. The output layer is N× larger per action, but the convolutional backbone (the expensive part) is unchanged.

What happens to C51's performance as the number of atoms N increases?

Performance monotonically improves — more atoms capture finer distributional structure, and this consistently translates to better policies Performance peaks at N=11 then decreases Number of atoms doesn't matter

Chapter 8: What Distributions Reveal

The most illuminating part of the paper: the learned distributions tell a story about the game that Q-values can't.

Space Invaders example

In one state, the agent has 6 possible actions. Three involve pressing the fire button (releasing a laser early). The distributions for these three actions show significant probability mass at 0 — the agent believes they lead to eventual game-over. The safe actions (left, right, noop) have distributions concentrated at higher returns.

A Q-learning agent would see different numbers for each action. A distributional agent sees why — the fire actions have a mode at death and a mode at survival, while safe actions have only the survival mode.

The multimodality preservation effect

When the standard Bellman operator averages a bimodal distribution, the average can be a value that never occurs. The distributional operator preserves both modes separately. This gives the network a fundamentally easier learning target — predict two peaks instead of one phantom mean.

Why distributions help learning, not just decisions: Even though C51 selects actions using the expected value (same as DQN), the distributional loss provides a richer gradient signal. Predicting 51 probabilities per action gives the network 51 learning signals instead of 1. This is likely why performance improves so much — it's not about making different decisions, it's about learning better representations through a denser supervisory signal.

Why might learning a full distribution help even when action selection is still based on expected value?

The distributional loss provides 51 learning signals per action instead of 1 — a much denser supervisory signal that leads to better internal representations The expected value is computed more accurately It doesn't help — the improvement comes from something else

Chapter 9: Connections

What C51 built on

DQN (Mnih et al., 2013/2015): C51 uses the DQN architecture, just modifying the output layer from |A| scalar Q-values to |A|×51 atom probabilities.

Value distribution theory (Jaquette 1973, Sobel 1982): The theoretical study of return distributions predates this paper by decades — but it was always used for specific purposes (risk), never as the primary learning objective.

What C51 enabled

QR-DQN (Dabney et al., 2018): Instead of fixed atoms with learned probabilities, uses fixed probabilities (quantiles) with learned atom positions. More flexible and removes the need for V_MIN/V_MAX.

IQN (Dabney et al., 2018): Implicit Quantile Networks — can approximate any quantile on-the-fly. The most flexible distributional RL method.

Rainbow (Hessel et al., 2017): Combines C51 with 5 other DQN improvements. Ablation shows C51 is one of the most important components.

D4PG (Barth-Maron et al., 2018): Distributional DDPG — extends the distributional perspective to continuous action spaces, demonstrating it works beyond discrete-action domains.

The paradigm shift: C51 proved that the full return distribution is not a theoretical curiosity — it's a practical advantage. The distributional perspective has become a standard component of modern deep RL systems. The core insight — that predicting richer targets gives denser learning signals — has parallels in supervised learning (predicting distributions vs point estimates) and generative modeling.

Cheat sheet

Core equation

Z(x,a) ^D= R(x,a) + γZ(X′,A′) — distributional Bellman

Representation

51 atoms in [−10, 10], softmax probabilities per action

Training loss

Cross-entropy between projected Bellman target and predicted distribution

Key result

T^π is a γ-contraction in Wasserstein metric (not KL, TV, or Kolmogorov)

Impact

SOTA on Atari; spawned QR-DQN, IQN, Rainbow, D4PG

How does QR-DQN improve upon C51?

QR-DQN uses fixed quantile probabilities with learned atom positions — the inverse of C51's fixed atoms with learned probabilities — removing the need for V_MIN and V_MAX hyperparameters QR-DQN uses more atoms QR-DQN doesn't use distributions

A Distributional Perspective onReinforcement Learning