How reinforcement learning illuminates a century of discoveries about animal behavior.
A rat presses a lever and gets a food pellet. A dog hears a bell and starts salivating. A child touches a hot stove and never does it again. Long before computers, psychologists studied how animals learn from rewards and punishments. They discovered principles that look remarkably like RL algorithms.
This chapter explores the deep connections between RL and animal learning psychology. The correspondence is not superficial — TD learning was literally inspired by animal conditioning models, and modern RL provides the most precise computational account of many psychological phenomena.
Prediction → Classical Conditioning
Pavlov's dogs learned to predict food from a bell. This is learning a value function — associating stimuli with future reward. TD learning captures this beautifully.
Control → Instrumental Conditioning
Thorndike's cats learned what to do to escape a puzzle box. This is policy learning — selecting actions that maximize reward. Actor-critic methods capture this.
The connections are so deep that neuroscientists have found the actual biological mechanisms (Chapter 15). But first, let's see how a century of behavioral experiments maps onto the RL framework we've built throughout this book.
In 1904, Ivan Pavlov noticed something strange: his dogs started salivating not when they saw food, but when they heard the footsteps of the assistant who brought it. The dogs had learned to predict food from an earlier cue. This was the birth of classical conditioning — perhaps the most studied phenomenon in all of psychology.
The terminology is precise. The unconditioned stimulus (US) is the thing that naturally triggers a response — food triggers salivation automatically. The unconditioned response (UR) is that natural reaction. The conditioned stimulus (CS) is the neutral cue that gets paired with the US — the bell, the light, the footsteps. After repeated pairings, the CS alone triggers the conditioned response (CR) — the dog salivates at the bell.
Several key phenomena emerge from this framework. Acquisition is the gradual increase in CR strength as CS-US pairings accumulate. Extinction occurs when the CS is presented without the US — the CR gradually weakens. This is exactly what you'd expect: remove the reward, and the value estimate decays toward zero.
For decades after Pavlov, psychologists assumed that conditioning strength grew simply with the number of CS-US pairings. More pairings, more learning. Then in 1972, Rescorla and Wagner proposed a radical idea: learning is driven by surprise. If the US is fully predicted, no learning occurs — even if the CS and US are paired again.
The Rescorla-Wagner model says that the change in associative strength on each trial is proportional to the prediction error — the difference between what actually happened and what was expected.
Here V(CS) is the current associative strength of the CS (its "value"), R is the reward (1 if US occurs, 0 if not), and α is a learning rate. This is the delta rule — the same update rule that appears throughout machine learning. When the US is surprising (R − V is large), learning is large. When the US is fully predicted (R − V ≈ 0), learning stops.
Watch associative strength V grow during acquisition (CS+US), then decay during extinction (CS alone). Adjust learning rate α.
The Rescorla-Wagner model is powerful, but it has a fatal flaw: it treats each trial as a single moment. In reality, a trial unfolds over time. The CS appears, then there's a gap, then the US arrives. The temporal structure matters enormously — if the bell rings ten seconds before the food, the dog starts salivating just before the food arrives, not when the bell rings.
The TD model of classical conditioning (Sutton & Barto, 1987) divides each trial into small time steps and applies TD learning at every step. This captures something Rescorla-Wagner cannot: the value prediction shifts backward in time from the US to the CS across trials. Early in training, the prediction error occurs at the US. With learning, it migrates to the CS onset.
Run conditioning trials. Watch the value predictions (blue) and TD errors (orange) evolve. The prediction error shifts from the US time to the CS time across trials.
Notice what happens: early on, the TD error (orange spike) is largest at the US time. After many trials, the spike moves to the CS onset. The value function develops a ramp from CS to US, reflecting the growing anticipation of reward. This is exactly what happens in the brain — dopamine neurons show this same shift (Chapter 15).
Here is one of the most important experiments in the history of psychology. In Phase 1, a rat learns that stimulus A predicts a shock. In Phase 2, a compound stimulus AB (A and B together) is followed by the same shock. Question: does the rat learn anything about B?
If conditioning were simply about pairings, B should gain associative strength — after all, B was paired with the shock. But Kamin (1969) showed that blocking occurs: the rat learns almost nothing about B. Why? Because A already fully predicts the shock. There is no prediction error left for B to absorb.
Phase 1: A alone is paired with reward. Phase 2: AB compound is paired with reward. Watch how B's value is blocked because A already predicts the outcome.
Blocking is not just an academic curiosity. It's the principle behind all efficient learning: don't waste resources encoding redundant information. In machine learning, this manifests as regularization and feature selection. In RL, it's why TD learning is so efficient — it only updates when predictions are wrong.
Classical conditioning is about prediction — the animal learns what leads to what. But animals also learn what to do. In 1898, Edward Thorndike placed cats in puzzle boxes. To escape and reach food, a cat had to perform a specific action (pull a string, press a lever). At first, the cat flailed randomly. Over many trials, escape time dropped — the successful action was "stamped in."
Thorndike formulated this as the Law of Effect: actions followed by satisfaction are strengthened; actions followed by discomfort are weakened. This is the foundational principle of RL. Every policy gradient algorithm, every Q-learning update, is a mathematical implementation of the Law of Effect.
Classical Conditioning
• Learning predictions
• Stimulus → Stimulus associations
• Passive: animal doesn't choose
• RL analogue: value function learning
• TD learning, critic
Instrumental Conditioning
• Learning actions
• Action → Outcome associations
• Active: animal's behavior matters
• RL analogue: policy learning
• Policy gradient, actor
The parallel to actor-critic methods is striking. The critic (value function) learns predictions like classical conditioning. The actor (policy) learns actions like instrumental conditioning. Both use prediction errors to drive learning. The animal brain appears to implement something very close to actor-critic RL.
Have you ever driven home on autopilot, arriving without remembering the trip? That's habitual behavior. Now imagine the usual road is closed — you'd have to think, plan, and find an alternate route. That's goal-directed behavior. Psychologists have shown that animals exhibit both.
The distinction maps precisely onto RL: habitual behavior is model-free (cached stimulus-response associations), while goal-directed behavior is model-based (planning using an internal model). The experimental test is outcome devaluation.
This explains a deep puzzle in RL: why have both model-free and model-based methods? Model-free is fast and cheap (no planning needed) but inflexible. Model-based is flexible (adapts instantly to changed goals) but expensive. Animals — and likely good AI systems — use both, shifting between them depending on experience and cognitive load.
In the 1930s, Edward Tolman challenged the dominant view that animals are mere stimulus-response machines. His experiments showed that rats form cognitive maps — internal representations of the environment's layout that allow flexible, model-based reasoning.
The key experiment: rats explored a maze without any reward (no food at the goal). They just wandered. When food was suddenly introduced, these rats reached the goal almost immediately — far faster than rats who had never explored. The exploring rats had built an internal model of the maze during their unrewarded exploration, and could use it instantly when motivation appeared.
Another striking experiment: rats trained to navigate a familiar maze to a goal were suddenly placed on a different path. Instead of backtracking along the old route, they took novel shortcuts — paths they had never been rewarded for. This is impossible under pure stimulus-response (model-free) learning, but natural under model-based planning.
Tolman's work was controversial at the time, but it now provides some of the strongest behavioral evidence for model-based RL in animals. The hippocampus, which creates spatial maps in the brain, is the biological substrate for these cognitive maps.
B.F. Skinner discovered that you can train animals to perform remarkably complex behaviors — pigeons playing ping-pong, rats running obstacle courses — by shaping: rewarding successive approximations to the desired behavior. You don't wait for the final behavior to appear; you reward anything close, then gradually raise the bar.
Want a pigeon to turn in a circle? First reward any head movement to the left. Then only reward a quarter turn. Then a half turn. Then a full circle. Each intermediate behavior is reinforced, creating a stepping-stone to the next. This is shaping, and it solves one of RL's hardest problems: sparse rewards.
Without Shaping
Reward only the final behavior. The animal (or agent) may never stumble upon it by chance. Learning is impossibly slow or never occurs.
With Shaping
Reward successive approximations. Each step is achievable. Complex behaviors emerge from a chain of simple ones. Learning is reliable and fast.
Skinner also introduced the schedule of reinforcement: the pattern of when rewards are given. Continuous reinforcement (reward every time) produces fast learning but quick extinction. Intermittent reinforcement (reward sometimes) is slower to acquire but remarkably resistant to extinction. Gamblers know this well — the occasional win keeps them playing long after rational analysis says to stop.
This chapter revealed that RL is not just inspired by psychology — it provides the most precise computational account of animal learning phenomena. The correspondence runs deep: from the delta rule of Rescorla-Wagner, to the temporal dynamics of TD learning, to the model-free vs. model-based distinction in habitual vs. goal-directed behavior.
| Psychology | RL Concept | Algorithm |
|---|---|---|
| Classical conditioning | Value prediction | TD learning |
| Rescorla-Wagner model | Delta rule / prediction error | MC or single-step update |
| TD model of conditioning | Temporal prediction error | TD(λ) |
| Blocking | Zero error → zero learning | Any error-driven method |
| Instrumental conditioning | Policy improvement | Actor / policy gradient |
| Habitual behavior | Model-free RL | Q-learning, Sarsa |
| Goal-directed behavior | Model-based RL | Dyna, planning |
| Cognitive maps | Learned environment models | Model learning |
| Shaping | Reward shaping / curriculum | Intermediate rewards |
What comes next: Chapter 13 gave us the computational tools. This chapter showed that animals implement something strikingly similar. Chapter 15 will go one level deeper — into the actual neural circuits that implement RL in the brain. The prediction errors we've been computing turn out to be literally encoded in dopamine neuron firing.