Dopamine neurons compute TD errors. This is not a metaphor.
In the early 1990s, neuroscientist Wolfram Schultz was recording from dopamine neurons in monkey brains. He discovered something remarkable: these neurons didn't just fire when the monkey got a reward. They fired when the reward was unexpected. And when an expected reward was omitted, they went silent — below their baseline firing rate. The signal looked exactly like a prediction error.
At roughly the same time, Peter Dayan and Read Montague were studying TD learning. They noticed that the TD error — δ = R + γV(S') − V(S) — had exactly the properties Schultz was recording. Positive when reward is surprising. Zero when reward is expected. Negative when expected reward is missing. The correspondence was uncanny.
This chapter traces the evidence for this claim and its implications. We'll see how the brain implements something strikingly close to an actor-critic architecture, with dopamine as the broadcast error signal that drives both value learning (the critic) and policy learning (the actor).
Before diving into the neural implementation of RL, we need a minimal vocabulary. The brain is built from roughly 86 billion neurons — cells that communicate via electrical impulses. When a neuron fires (an action potential or "spike"), it releases chemical neurotransmitters across a tiny gap called a synapse to influence the next neuron.
Learning occurs through changes in synaptic strength — how effectively one neuron drives another. Strengthening a synapse is called long-term potentiation (LTP); weakening it is long-term depression (LTD). This is the biological equivalent of updating a weight in a neural network.
The basal ganglia are a group of subcortical structures that play a central role in action selection and reward-based learning. The striatum — the input structure of the basal ganglia — receives both cortical inputs (representing states) and dopamine signals (representing prediction errors). This is where RL happens in the brain.
Schultz's experiments (1993, 1997) used a simple paradigm: a monkey receives juice (reward) after a visual cue (CS). A microelectrode records from individual dopamine neurons. The results are now among the most replicated and celebrated findings in systems neuroscience.
Three conditions tell the whole story. Before conditioning: the dopamine neuron bursts when unexpected juice arrives. After conditioning: the burst shifts to the CS (the cue that predicts juice), and there's no response at juice delivery. When the CS appears but juice is withheld: there's a burst at the CS, then a dip below baseline at the time juice should have arrived.
Select a condition to see the dopamine firing pattern. Watch how the signal shifts from reward time to the predictive cue, and dips when reward is omitted.
The quantitative match is striking. The timing of the dopamine burst shifts backward in time as learning progresses — exactly as TD theory predicts (Chapter 14, TD model). The magnitude of the dip at reward omission matches the expected prediction error. Even the time course within a trial matches step-by-step TD error computations.
Let's make the correspondence precise. The TD error at time t is:
Now compare this with what dopamine neurons do. The phasic (transient) component of dopamine firing has three regimes that map directly onto the three cases of δt:
| Condition | TD Error | Dopamine |
|---|---|---|
| Reward better than predicted | δ > 0 | Burst above baseline |
| Reward as predicted | δ = 0 | No change from baseline |
| Reward worse than predicted | δ < 0 | Dip below baseline |
The correspondence goes beyond qualitative matching. Bayer and Glimcher (2005) showed that dopamine response magnitude scales linearly with the size of the prediction error. Tobler et al. (2005) showed that the response reflects the full probability distribution of expected rewards. This is not a vague analogy — it's a quantitative fit.
There's an important caveat: negative prediction errors are harder to encode because firing rates can't go below zero. The "pause" in dopamine firing is limited by the baseline rate. This means negative TD errors may be represented less precisely than positive ones — a biological constraint that RL algorithms don't face.
In Chapter 13, we saw that actor-critic methods separate value estimation (the critic) from action selection (the actor), with the TD error connecting them. The brain appears to implement exactly this architecture, with a clear anatomical division of labor.
The Critic: Ventral Striatum
The ventral striatum (especially the nucleus accumbens) learns to predict future reward — it estimates the value function V(s). Neurons here respond to reward-predicting cues, and their responses track expected value. Lesions to this area impair value-based predictions without necessarily affecting action selection.
The Actor: Dorsal Striatum
The dorsal striatum (caudate and putamen) is involved in action selection and habit formation — it stores the policy π(a|s). Neurons here respond to specific actions, and lesions impair learned action sequences. This is the "actor" that selects actions based on accumulated reinforcement history.
How do synapses in the brain actually change? The classical rule is Hebbian learning: "neurons that fire together, wire together." If neuron A repeatedly causes neuron B to fire, the synapse from A to B strengthens. But Hebbian learning alone is not enough for RL — it has no notion of reward.
The brain uses two distinct learning rules for the critic and the actor, and they differ in a crucial way:
Two-Factor Rule (Critic)
Synaptic change depends on two factors: (1) presynaptic activity and (2) the dopamine signal δ. The weight update is Δw ∝ x · δ. This is semi-gradient TD — the state features x are multiplied by the prediction error δ. This is how the ventral striatum updates value predictions.
Three-Factor Rule (Actor)
Synaptic change depends on three factors: (1) presynaptic activity (state), (2) postsynaptic activity (action taken), and (3) the dopamine signal δ. The update is Δw ∝ x · a · δ. This is the policy gradient — the state-action pair is reinforced in proportion to δ.
Both rules are modulated by dopamine, but they operate on different information. The two-factor rule (critic) is essentially supervised learning with δ as the teaching signal. The three-factor rule (actor) is the biological implementation of REINFORCE-like policy gradient methods, where the action needs to be "tagged" so that only the synapses responsible for the chosen action are updated.
There's a timing problem. When a rat presses a lever and gets food 5 seconds later, how does the brain know which synapses to credit? The dopamine arrives 5 seconds after the action. Most synapses were active at various times during those 5 seconds. Only the synapses responsible for the lever press should be strengthened.
The solution is eligibility traces — temporary biochemical tags at synapses that mark them as "eligible for modification." When a synapse is active, it sets a molecular flag that decays over time. When dopamine arrives, only the flagged synapses are modified.
This mechanism has been confirmed experimentally. Yagishita et al. (2014) showed in mouse striatum that synaptic potentiation requires dopamine to arrive within a critical time window (~2 seconds) of synaptic activity. If dopamine arrives too late, the trace has decayed and no learning occurs. This is exactly the eligibility trace mechanism from TD(λ).
The time constant of the biological trace (∼seconds) constrains the effective λ parameter in TD(λ). This may explain why animals struggle with very long credit assignment delays — the trace simply decays before the reward signal arrives.
Here's a puzzle. There are only about 400,000 dopamine neurons, but they broadcast to millions of synapses across the striatum. All those synapses receive roughly the same dopamine signal. How can a single, globally broadcast scalar (the TD error) drive useful learning across such a diverse population of neurons?
This is the structural credit assignment problem: if everyone gets the same reward signal, how does each unit know whether it personally contributed to success or failure? In RL terms, it's like running a team of a million agents who all receive the same reward.
The resolution comes from the local eligibility traces. Even though the dopamine signal is global, each synapse has its own eligibility trace based on its local activity. The update Δw = trace × δ is different for every synapse because the traces are different. Synapses that happen to be active when good things happen get strengthened; inactive synapses don't. Over many trials, this local-global interaction allows specialization despite the shared reward signal.
This is precisely the mechanism behind the REINFORCE algorithm — a global reward signal multiplied by a local log-probability gradient produces useful per-parameter updates. The brain's dopamine system implements this at biological scale.
If the brain implements TD learning, what happens when the dopamine system is hijacked? Every drug of abuse — cocaine, heroin, amphetamine, nicotine, alcohol — increases dopamine in the striatum, either directly or indirectly. From the TD learning perspective, this creates a phantom positive prediction error: the system registers unexpected reward even when nothing good actually happened.
Andrew Redish (2004) proposed a computational model of addiction based on destabilized TD learning. The key idea: addictive drugs create artificial δ > 0 signals that corrupt the learned value function, making drug-associated states appear far more valuable than they actually are.
Compare value learning with natural rewards (prediction error converges to zero) vs drug rewards (prediction error never converges). Adjust the artificial dopamine boost.
The model explains several puzzling features of addiction. Craving corresponds to the inflated value of drug-associated cues. Relapse occurs because the artificially high values are deeply ingrained. Tolerance maps to the value function partially adapting. And withdrawal corresponds to the negative prediction error when the inflated reward is suddenly absent.
This chapter presented one of the great success stories of computational neuroscience: the discovery that dopamine neurons compute TD reward prediction errors, and that the basal ganglia implement an actor-critic architecture. The correspondence is not metaphorical — it's quantitative and mechanistic.
| RL Concept | Neural Substrate | Evidence |
|---|---|---|
| TD error δ | Phasic dopamine signal | Schultz 1993, 1997 |
| Value function V(s) | Ventral striatum | fMRI, lesion studies |
| Policy π(a|s) | Dorsal striatum | Electrophysiology, lesions |
| Critic update (two-factor) | Ventral striatal plasticity | Synaptic recording |
| Actor update (three-factor) | Dorsal striatal plasticity | Synaptic recording |
| Eligibility traces | Molecular synaptic tags | Yagishita 2014 |
| Global broadcast of δ | Dopamine projection pattern | Anatomy, pharmacology |
What comes next: Chapter 14 showed that RL explains animal behavior. This chapter showed it explains the neural mechanisms. Chapter 16 returns to engineering — applying RL to create systems that play games, control hardware, and achieve superhuman performance in domains once thought to require human intuition.