Sutton & Barto, Chapter 15

RL and Neuroscience

Dopamine neurons compute TD errors. This is not a metaphor.

Prerequisites: Chapter 6 (TD learning) + Chapter 14 (psychology). That's it.
10
Chapters
3
Simulations
10
Quizzes

Chapter 0: RL in the Brain

In the early 1990s, neuroscientist Wolfram Schultz was recording from dopamine neurons in monkey brains. He discovered something remarkable: these neurons didn't just fire when the monkey got a reward. They fired when the reward was unexpected. And when an expected reward was omitted, they went silent — below their baseline firing rate. The signal looked exactly like a prediction error.

At roughly the same time, Peter Dayan and Read Montague were studying TD learning. They noticed that the TD error — δ = R + γV(S') − V(S) — had exactly the properties Schultz was recording. Positive when reward is surprising. Zero when reward is expected. Negative when expected reward is missing. The correspondence was uncanny.

The landmark discovery: Dopamine neurons in the midbrain appear to compute the TD reward prediction error. This is not an analogy. The quantitative properties of dopamine signals match the mathematical properties of δt in TD learning. This is one of the most successful applications of computational theory to neuroscience.

This chapter traces the evidence for this claim and its implications. We'll see how the brain implements something strikingly close to an actor-critic architecture, with dopamine as the broadcast error signal that drives both value learning (the critic) and policy learning (the actor).

Check: What was the key observation about dopamine neurons?

Chapter 1: Neuroscience Basics

Before diving into the neural implementation of RL, we need a minimal vocabulary. The brain is built from roughly 86 billion neurons — cells that communicate via electrical impulses. When a neuron fires (an action potential or "spike"), it releases chemical neurotransmitters across a tiny gap called a synapse to influence the next neuron.

Learning occurs through changes in synaptic strength — how effectively one neuron drives another. Strengthening a synapse is called long-term potentiation (LTP); weakening it is long-term depression (LTD). This is the biological equivalent of updating a weight in a neural network.

Presynaptic Neuron
Fires action potential → releases neurotransmitter
Synapse
Neurotransmitter crosses the gap
Postsynaptic Neuron
Receives signal, integrates, may fire
The key neurotransmitter for RL: Dopamine is produced by a small cluster of neurons in the midbrain (the ventral tegmental area, VTA, and substantia nigra pars compacta, SNc). Despite their small number (~400,000 in humans), these neurons broadcast dopamine widely across the brain, particularly to the striatum (part of the basal ganglia). This broadcast architecture is crucial for RL.

The basal ganglia are a group of subcortical structures that play a central role in action selection and reward-based learning. The striatum — the input structure of the basal ganglia — receives both cortical inputs (representing states) and dopamine signals (representing prediction errors). This is where RL happens in the brain.

Check: What is the role of dopamine in the brain's RL system?

Chapter 2: Reward Prediction Error

Schultz's experiments (1993, 1997) used a simple paradigm: a monkey receives juice (reward) after a visual cue (CS). A microelectrode records from individual dopamine neurons. The results are now among the most replicated and celebrated findings in systems neuroscience.

Three conditions tell the whole story. Before conditioning: the dopamine neuron bursts when unexpected juice arrives. After conditioning: the burst shifts to the CS (the cue that predicts juice), and there's no response at juice delivery. When the CS appears but juice is withheld: there's a burst at the CS, then a dip below baseline at the time juice should have arrived.

The three signatures: (1) Unexpected reward → burst. (2) Predicted reward → nothing at reward time, burst at predictor. (3) Omitted expected reward → dip. These are precisely the three cases of the TD error: δ > 0, δ = 0, and δ < 0.
Schultz's Dopamine Experiments

Select a condition to see the dopamine firing pattern. Watch how the signal shifts from reward time to the predictive cue, and dips when reward is omitted.

Showing: Before Learning — unexpected reward triggers dopamine burst

The quantitative match is striking. The timing of the dopamine burst shifts backward in time as learning progresses — exactly as TD theory predicts (Chapter 14, TD model). The magnitude of the dip at reward omission matches the expected prediction error. Even the time course within a trial matches step-by-step TD error computations.

Check: What happens to dopamine neuron firing when an expected reward is omitted?

Chapter 3: TD Error = Dopamine

Let's make the correspondence precise. The TD error at time t is:

δt = Rt+1 + γ V(St+1) − V(St)

Now compare this with what dopamine neurons do. The phasic (transient) component of dopamine firing has three regimes that map directly onto the three cases of δt:

ConditionTD ErrorDopamine
Reward better than predictedδ > 0Burst above baseline
Reward as predictedδ = 0No change from baseline
Reward worse than predictedδ < 0Dip below baseline
Baseline firing matters: Dopamine neurons have a tonic (constant) baseline firing rate of about 3-5 spikes per second. Positive prediction errors produce bursts above this baseline. Negative prediction errors produce pauses below it. The baseline represents δ = 0 — no surprise. This gives the system a way to signal both positive and negative errors with a single non-negative signal (you can't fire fewer than zero spikes, but you can pause).

The correspondence goes beyond qualitative matching. Bayer and Glimcher (2005) showed that dopamine response magnitude scales linearly with the size of the prediction error. Tobler et al. (2005) showed that the response reflects the full probability distribution of expected rewards. This is not a vague analogy — it's a quantitative fit.

There's an important caveat: negative prediction errors are harder to encode because firing rates can't go below zero. The "pause" in dopamine firing is limited by the baseline rate. This means negative TD errors may be represented less precisely than positive ones — a biological constraint that RL algorithms don't face.

Check: Why is the baseline firing rate of dopamine neurons important for encoding TD errors?

Chapter 4: The Neural Actor-Critic

In Chapter 13, we saw that actor-critic methods separate value estimation (the critic) from action selection (the actor), with the TD error connecting them. The brain appears to implement exactly this architecture, with a clear anatomical division of labor.

The Critic: Ventral Striatum

The ventral striatum (especially the nucleus accumbens) learns to predict future reward — it estimates the value function V(s). Neurons here respond to reward-predicting cues, and their responses track expected value. Lesions to this area impair value-based predictions without necessarily affecting action selection.

The Actor: Dorsal Striatum

The dorsal striatum (caudate and putamen) is involved in action selection and habit formation — it stores the policy π(a|s). Neurons here respond to specific actions, and lesions impair learned action sequences. This is the "actor" that selects actions based on accumulated reinforcement history.

Cortex (State Representation)
Sends state features to both striatal regions
Ventral Striatum (Critic)
Computes V(s) → contributes to δ calculation
↓ δ
VTA/SNc (Dopamine)
Broadcasts δ = R + γV(s') − V(s)
↓ δ broadcast
Dorsal Striatum (Actor)
Updates policy: reinforce action if δ > 0
The dopamine broadcast: The same δ signal (dopamine) is sent to both the critic and the actor. The critic uses it to improve value predictions. The actor uses it to reinforce or punish the chosen action. This is exactly how the actor-critic algorithm works — one error signal drives both learning processes.
Check: What brain region corresponds to the "critic" in actor-critic RL?

Chapter 5: Learning Rules

How do synapses in the brain actually change? The classical rule is Hebbian learning: "neurons that fire together, wire together." If neuron A repeatedly causes neuron B to fire, the synapse from A to B strengthens. But Hebbian learning alone is not enough for RL — it has no notion of reward.

The brain uses two distinct learning rules for the critic and the actor, and they differ in a crucial way:

Two-Factor Rule (Critic)

Synaptic change depends on two factors: (1) presynaptic activity and (2) the dopamine signal δ. The weight update is Δw ∝ x · δ. This is semi-gradient TD — the state features x are multiplied by the prediction error δ. This is how the ventral striatum updates value predictions.

Three-Factor Rule (Actor)

Synaptic change depends on three factors: (1) presynaptic activity (state), (2) postsynaptic activity (action taken), and (3) the dopamine signal δ. The update is Δw ∝ x · a · δ. This is the policy gradient — the state-action pair is reinforced in proportion to δ.

Why three factors for the actor? The critic just needs to know "this state was better/worse than expected" (state × error). The actor needs to know "this action in this state was better/worse than expected" (state × action × error). The third factor — postsynaptic activity representing the chosen action — is what makes the policy gradient work.

Both rules are modulated by dopamine, but they operate on different information. The two-factor rule (critic) is essentially supervised learning with δ as the teaching signal. The three-factor rule (actor) is the biological implementation of REINFORCE-like policy gradient methods, where the action needs to be "tagged" so that only the synapses responsible for the chosen action are updated.

Check: What is the key difference between the two-factor and three-factor learning rules?

Chapter 6: Eligibility Traces in the Brain

There's a timing problem. When a rat presses a lever and gets food 5 seconds later, how does the brain know which synapses to credit? The dopamine arrives 5 seconds after the action. Most synapses were active at various times during those 5 seconds. Only the synapses responsible for the lever press should be strengthened.

The solution is eligibility traces — temporary biochemical tags at synapses that mark them as "eligible for modification." When a synapse is active, it sets a molecular flag that decays over time. When dopamine arrives, only the flagged synapses are modified.

Two types of traces: Neuroscientists distinguish non-contingent traces (set by presynaptic activity alone, like the two-factor rule) and contingent traces (set by coincidence of pre- and postsynaptic activity, like the three-factor rule). The non-contingent trace serves the critic; the contingent trace serves the actor.
1. Synapse fires
Eligibility trace is set (molecular tag)
↓ trace decays over seconds
2. Dopamine arrives
δ signal broadcast from VTA/SNc
3. Synapse modified
Δw ∝ trace × δ (only if trace still active)

This mechanism has been confirmed experimentally. Yagishita et al. (2014) showed in mouse striatum that synaptic potentiation requires dopamine to arrive within a critical time window (~2 seconds) of synaptic activity. If dopamine arrives too late, the trace has decayed and no learning occurs. This is exactly the eligibility trace mechanism from TD(λ).

The time constant of the biological trace (∼seconds) constrains the effective λ parameter in TD(λ). This may explain why animals struggle with very long credit assignment delays — the trace simply decays before the reward signal arrives.

Check: What problem do eligibility traces solve in the brain?

Chapter 7: Collective Learning

Here's a puzzle. There are only about 400,000 dopamine neurons, but they broadcast to millions of synapses across the striatum. All those synapses receive roughly the same dopamine signal. How can a single, globally broadcast scalar (the TD error) drive useful learning across such a diverse population of neurons?

This is the structural credit assignment problem: if everyone gets the same reward signal, how does each unit know whether it personally contributed to success or failure? In RL terms, it's like running a team of a million agents who all receive the same reward.

The team problem: Imagine a thousand workers on a factory floor. The factory either meets its quota (reward = 1) or doesn't (reward = 0). Each worker sees only this global signal. Yet over time, the good workers get reinforced and the bad ones don't. How? Because each worker's local state features correlate differently with the global reward. Workers in the right place at the right time get consistently paired with positive rewards.

The resolution comes from the local eligibility traces. Even though the dopamine signal is global, each synapse has its own eligibility trace based on its local activity. The update Δw = trace × δ is different for every synapse because the traces are different. Synapses that happen to be active when good things happen get strengthened; inactive synapses don't. Over many trials, this local-global interaction allows specialization despite the shared reward signal.

This is precisely the mechanism behind the REINFORCE algorithm — a global reward signal multiplied by a local log-probability gradient produces useful per-parameter updates. The brain's dopamine system implements this at biological scale.

Check: How can a globally broadcast dopamine signal drive useful learning at individual synapses?

Chapter 8: Addiction

If the brain implements TD learning, what happens when the dopamine system is hijacked? Every drug of abuse — cocaine, heroin, amphetamine, nicotine, alcohol — increases dopamine in the striatum, either directly or indirectly. From the TD learning perspective, this creates a phantom positive prediction error: the system registers unexpected reward even when nothing good actually happened.

Andrew Redish (2004) proposed a computational model of addiction based on destabilized TD learning. The key idea: addictive drugs create artificial δ > 0 signals that corrupt the learned value function, making drug-associated states appear far more valuable than they actually are.

Addiction as corrupted TD learning: Normally, once the value function is accurate, δ ≈ 0 and learning stops. But drugs force δ > 0 every time, so V(drug states) keeps increasing without bound. The system never reaches equilibrium. The agent becomes "convinced" that drug-related states are the most valuable states in the world.
Normal vs Addicted TD Learning

Compare value learning with natural rewards (prediction error converges to zero) vs drug rewards (prediction error never converges). Adjust the artificial dopamine boost.

Drug DA boost0.50

The model explains several puzzling features of addiction. Craving corresponds to the inflated value of drug-associated cues. Relapse occurs because the artificially high values are deeply ingrained. Tolerance maps to the value function partially adapting. And withdrawal corresponds to the negative prediction error when the inflated reward is suddenly absent.

Check: Why does the TD system never reach equilibrium with addictive drugs?

Chapter 9: Summary

This chapter presented one of the great success stories of computational neuroscience: the discovery that dopamine neurons compute TD reward prediction errors, and that the basal ganglia implement an actor-critic architecture. The correspondence is not metaphorical — it's quantitative and mechanistic.

RL ConceptNeural SubstrateEvidence
TD error δPhasic dopamine signalSchultz 1993, 1997
Value function V(s)Ventral striatumfMRI, lesion studies
Policy π(a|s)Dorsal striatumElectrophysiology, lesions
Critic update (two-factor)Ventral striatal plasticitySynaptic recording
Actor update (three-factor)Dorsal striatal plasticitySynaptic recording
Eligibility tracesMolecular synaptic tagsYagishita 2014
Global broadcast of δDopamine projection patternAnatomy, pharmacology
The deeper lesson: The brain didn't "implement" RL theory — it evolved these mechanisms over hundreds of millions of years. But the fact that RL theory independently arrived at the same computational principles validates both the theory and our understanding of the biology. Good computational theory can predict neuroscience, and neuroscience can inspire better algorithms.

What comes next: Chapter 14 showed that RL explains animal behavior. This chapter showed it explains the neural mechanisms. Chapter 16 returns to engineering — applying RL to create systems that play games, control hardware, and achieve superhuman performance in domains once thought to require human intuition.

"It is rare in science that a computational theory makes such precise, confirmed predictions about biology.
The dopamine-TD correspondence is one of those rare cases."
— Nathaniel Daw
Check: What is the fundamental claim of the dopamine-TD theory?