Ch 14: Psychology — Sutton & Barto RL

Chapter 0: RL Meets Psychology

A rat presses a lever and gets a food pellet. A dog hears a bell and starts salivating. A child touches a hot stove and never does it again. Long before computers, psychologists studied how animals learn from rewards and punishments. They discovered principles that look remarkably like RL algorithms.

This chapter explores the deep connections between RL and animal learning psychology. The correspondence is not superficial — TD learning was literally inspired by animal conditioning models, and modern RL provides the most precise computational account of many psychological phenomena.

The big picture: Animal learning research splits into two streams: prediction (learning what leads to what) and control (learning what to do). RL has precise counterparts: prediction is value estimation, and control is policy improvement.

Prediction → Classical Conditioning

Pavlov's dogs learned to predict food from a bell. This is learning a value function — associating stimuli with future reward. TD learning captures this beautifully.

Control → Instrumental Conditioning

Thorndike's cats learned what to do to escape a puzzle box. This is policy learning — selecting actions that maximize reward. Actor-critic methods capture this.

The connections are so deep that neuroscientists have found the actual biological mechanisms (Chapter 15). But first, let's see how a century of behavioral experiments maps onto the RL framework we've built throughout this book.

Check: What are the two main streams of animal learning research?

Supervised learning and unsupervised learning Prediction (classical conditioning) and control (instrumental conditioning) Exploration and exploitation

Chapter 1: Classical Conditioning

In 1904, Ivan Pavlov noticed something strange: his dogs started salivating not when they saw food, but when they heard the footsteps of the assistant who brought it. The dogs had learned to predict food from an earlier cue. This was the birth of classical conditioning — perhaps the most studied phenomenon in all of psychology.

The terminology is precise. The unconditioned stimulus (US) is the thing that naturally triggers a response — food triggers salivation automatically. The unconditioned response (UR) is that natural reaction. The conditioned stimulus (CS) is the neutral cue that gets paired with the US — the bell, the light, the footsteps. After repeated pairings, the CS alone triggers the conditioned response (CR) — the dog salivates at the bell.

Before Conditioning

Bell (CS) → No response | Food (US) → Salivation (UR)

↓ repeated pairings

During Conditioning

Bell (CS) + Food (US) → Salivation (UR)

↓ learning occurs

After Conditioning

Bell (CS) alone → Salivation (CR)

The RL interpretation: The US is like a reward. The CS is a state that precedes the reward. Classical conditioning is learning a value function — the animal learns that the CS predicts future reward (the US). The CR is a behavioral manifestation of a high value estimate.

Several key phenomena emerge from this framework. Acquisition is the gradual increase in CR strength as CS-US pairings accumulate. Extinction occurs when the CS is presented without the US — the CR gradually weakens. This is exactly what you'd expect: remove the reward, and the value estimate decays toward zero.

Check: In Pavlov's experiment, what is the conditioned stimulus (CS)?

The bell — an initially neutral cue that becomes associated with food The food — the natural reward The salivation — the response

Chapter 2: The Rescorla-Wagner Model

For decades after Pavlov, psychologists assumed that conditioning strength grew simply with the number of CS-US pairings. More pairings, more learning. Then in 1972, Rescorla and Wagner proposed a radical idea: learning is driven by surprise. If the US is fully predicted, no learning occurs — even if the CS and US are paired again.

The Rescorla-Wagner model says that the change in associative strength on each trial is proportional to the prediction error — the difference between what actually happened and what was expected.

ΔV(CS) = α · (R − V(CS))

Here V(CS) is the current associative strength of the CS (its "value"), R is the reward (1 if US occurs, 0 if not), and α is a learning rate. This is the delta rule — the same update rule that appears throughout machine learning. When the US is surprising (R − V is large), learning is large. When the US is fully predicted (R − V ≈ 0), learning stops.

Rescorla-Wagner Learning

Watch associative strength V grow during acquisition (CS+US), then decay during extinction (CS alone). Adjust learning rate α.

Learning rate α0.10

Why this matters: The Rescorla-Wagner model explained a zoo of conditioning phenomena with a single equation. It predicted that learning should stop when the outcome is fully expected — a prediction that was confirmed experimentally and that earlier theories couldn't explain.

Check: According to Rescorla-Wagner, when does learning stop?

After a fixed number of trials When the outcome is fully predicted (prediction error is zero) When the animal is satiated

Chapter 3: The TD Model of Conditioning

The Rescorla-Wagner model is powerful, but it has a fatal flaw: it treats each trial as a single moment. In reality, a trial unfolds over time. The CS appears, then there's a gap, then the US arrives. The temporal structure matters enormously — if the bell rings ten seconds before the food, the dog starts salivating just before the food arrives, not when the bell rings.

The TD model of classical conditioning (Sutton & Barto, 1987) divides each trial into small time steps and applies TD learning at every step. This captures something Rescorla-Wagner cannot: the value prediction shifts backward in time from the US to the CS across trials. Early in training, the prediction error occurs at the US. With learning, it migrates to the CS onset.

The key insight: On the first trial, only the US is surprising. But once the CS predicts the US, the surprise shifts to the CS onset — because the CS is the earliest predictor of reward. This temporal migration of prediction error is a core prediction of the TD model, confirmed by dopamine recordings in the brain (Chapter 15).

Interactive: TD Model of Conditioning

Run conditioning trials. Watch the value predictions (blue) and TD errors (orange) evolve. The prediction error shifts from the US time to the CS time across trials.

Learning rate α0.10

Discount γ0.95

Trial 0 — Click +1 Trial to begin

Notice what happens: early on, the TD error (orange spike) is largest at the US time. After many trials, the spike moves to the CS onset. The value function develops a ramp from CS to US, reflecting the growing anticipation of reward. This is exactly what happens in the brain — dopamine neurons show this same shift (Chapter 15).

Check: What does the TD model capture that Rescorla-Wagner cannot?

The temporal structure within a trial — the prediction error shifts backward in time from US to CS That learning requires surprise That stronger rewards produce faster learning

Chapter 4: Blocking

Here is one of the most important experiments in the history of psychology. In Phase 1, a rat learns that stimulus A predicts a shock. In Phase 2, a compound stimulus AB (A and B together) is followed by the same shock. Question: does the rat learn anything about B?

If conditioning were simply about pairings, B should gain associative strength — after all, B was paired with the shock. But Kamin (1969) showed that blocking occurs: the rat learns almost nothing about B. Why? Because A already fully predicts the shock. There is no prediction error left for B to absorb.

Phase 1

A → US (rat learns: V(A) ≈ 1)

↓

Phase 2

AB → US (but V(A) already predicts US!)

↓

Test

B alone → No CR (B was "blocked")

Rescorla-Wagner explains it perfectly: In Phase 2, the total prediction is V(A) + V(B). Since V(A) ≈ 1 and the actual reward is 1, the prediction error is R − (V(A) + V(B)) ≈ 0. With no prediction error, ΔV(B) ≈ 0. The redundant predictor simply doesn't learn. This was one of the Rescorla-Wagner model's greatest triumphs.

Blocking Experiment

Phase 1: A alone is paired with reward. Phase 2: AB compound is paired with reward. Watch how B's value is blocked because A already predicts the outcome.

Blocking is not just an academic curiosity. It's the principle behind all efficient learning: don't waste resources encoding redundant information. In machine learning, this manifests as regularization and feature selection. In RL, it's why TD learning is so efficient — it only updates when predictions are wrong.

Check: Why does stimulus B fail to gain associative strength in the blocking paradigm?

B is too weak a stimulus to be perceived The rat forgets about B A already predicts the US, so the prediction error is near zero — nothing left for B to learn from

Chapter 5: Instrumental Conditioning

Classical conditioning is about prediction — the animal learns what leads to what. But animals also learn what to do. In 1898, Edward Thorndike placed cats in puzzle boxes. To escape and reach food, a cat had to perform a specific action (pull a string, press a lever). At first, the cat flailed randomly. Over many trials, escape time dropped — the successful action was "stamped in."

Thorndike formulated this as the Law of Effect: actions followed by satisfaction are strengthened; actions followed by discomfort are weakened. This is the foundational principle of RL. Every policy gradient algorithm, every Q-learning update, is a mathematical implementation of the Law of Effect.

The Law of Effect (1911): "Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation." This is policy improvement in a single sentence.

Classical Conditioning

• Learning predictions
• Stimulus → Stimulus associations
• Passive: animal doesn't choose
• RL analogue: value function learning
• TD learning, critic

Instrumental Conditioning

• Learning actions
• Action → Outcome associations
• Active: animal's behavior matters
• RL analogue: policy learning
• Policy gradient, actor

The parallel to actor-critic methods is striking. The critic (value function) learns predictions like classical conditioning. The actor (policy) learns actions like instrumental conditioning. Both use prediction errors to drive learning. The animal brain appears to implement something very close to actor-critic RL.

Check: What does Thorndike's Law of Effect correspond to in RL?

Policy improvement — actions leading to reward are reinforced Value iteration — computing exact state values Model learning — building a world model

Chapter 6: Habitual vs Goal-Directed

Have you ever driven home on autopilot, arriving without remembering the trip? That's habitual behavior. Now imagine the usual road is closed — you'd have to think, plan, and find an alternate route. That's goal-directed behavior. Psychologists have shown that animals exhibit both.

The distinction maps precisely onto RL: habitual behavior is model-free (cached stimulus-response associations), while goal-directed behavior is model-based (planning using an internal model). The experimental test is outcome devaluation.

Training

Rat learns: press lever → food pellet

↓

Devaluation

Food is paired with nausea (now worthless)

↓

Test: Does the rat still press?

If yes → habitual (model-free) | If no → goal-directed (model-based)

The critical experiment: With moderate training, rats stop pressing — they use a model to reason "pressing gives food, food makes me sick, don't press." With extensive overtraining, rats keep pressing — the behavior has become a cached habit, independent of the outcome's current value. The transition from goal-directed to habitual is the transition from model-based to model-free RL.

This explains a deep puzzle in RL: why have both model-free and model-based methods? Model-free is fast and cheap (no planning needed) but inflexible. Model-based is flexible (adapts instantly to changed goals) but expensive. Animals — and likely good AI systems — use both, shifting between them depending on experience and cognitive load.

Check: How does the outcome devaluation experiment distinguish habitual from goal-directed behavior?

It measures reaction time If the animal still performs the action after the reward is devalued, the behavior is habitual (model-free); if it stops, it's goal-directed (model-based) It tests whether the animal can learn new tasks

Chapter 7: Cognitive Maps

In the 1930s, Edward Tolman challenged the dominant view that animals are mere stimulus-response machines. His experiments showed that rats form cognitive maps — internal representations of the environment's layout that allow flexible, model-based reasoning.

The key experiment: rats explored a maze without any reward (no food at the goal). They just wandered. When food was suddenly introduced, these rats reached the goal almost immediately — far faster than rats who had never explored. The exploring rats had built an internal model of the maze during their unrewarded exploration, and could use it instantly when motivation appeared.

Latent learning: The rats' wandering wasn't wasted — they were performing the RL equivalent of model learning without a reward signal. When reward finally appeared, they could immediately plan through their model. This is exactly what Dyna-like architectures (Chapter 8) do: learn a model from experience, then plan through it.

Another striking experiment: rats trained to navigate a familiar maze to a goal were suddenly placed on a different path. Instead of backtracking along the old route, they took novel shortcuts — paths they had never been rewarded for. This is impossible under pure stimulus-response (model-free) learning, but natural under model-based planning.

Tolman's work was controversial at the time, but it now provides some of the strongest behavioral evidence for model-based RL in animals. The hippocampus, which creates spatial maps in the brain, is the biological substrate for these cognitive maps.

Check: What does Tolman's latent learning experiment demonstrate?

Animals can build internal models of the environment even without reward, and use them when reward appears Animals can only learn from reward signals Stimulus-response associations are all that animals need

Chapter 8: Shaping

B.F. Skinner discovered that you can train animals to perform remarkably complex behaviors — pigeons playing ping-pong, rats running obstacle courses — by shaping: rewarding successive approximations to the desired behavior. You don't wait for the final behavior to appear; you reward anything close, then gradually raise the bar.

Want a pigeon to turn in a circle? First reward any head movement to the left. Then only reward a quarter turn. Then a half turn. Then a full circle. Each intermediate behavior is reinforced, creating a stepping-stone to the next. This is shaping, and it solves one of RL's hardest problems: sparse rewards.

The RL connection: Shaping is equivalent to reward shaping in RL — providing intermediate rewards that guide the agent toward a goal it would never discover through random exploration alone. It's also related to curriculum learning: start with easy tasks, gradually increase difficulty.

Without Shaping

Reward only the final behavior. The animal (or agent) may never stumble upon it by chance. Learning is impossibly slow or never occurs.

With Shaping

Reward successive approximations. Each step is achievable. Complex behaviors emerge from a chain of simple ones. Learning is reliable and fast.

Skinner also introduced the schedule of reinforcement: the pattern of when rewards are given. Continuous reinforcement (reward every time) produces fast learning but quick extinction. Intermittent reinforcement (reward sometimes) is slower to acquire but remarkably resistant to extinction. Gamblers know this well — the occasional win keeps them playing long after rational analysis says to stop.

Check: What problem does shaping solve in RL terms?

The exploration-exploitation tradeoff Sparse rewards — the agent may never discover the rewarded behavior without intermediate guidance The credit assignment problem over long time horizons

Chapter 9: Summary

This chapter revealed that RL is not just inspired by psychology — it provides the most precise computational account of animal learning phenomena. The correspondence runs deep: from the delta rule of Rescorla-Wagner, to the temporal dynamics of TD learning, to the model-free vs. model-based distinction in habitual vs. goal-directed behavior.

Psychology	RL Concept	Algorithm
Classical conditioning	Value prediction	TD learning
Rescorla-Wagner model	Delta rule / prediction error	MC or single-step update
TD model of conditioning	Temporal prediction error	TD(λ)
Blocking	Zero error → zero learning	Any error-driven method
Instrumental conditioning	Policy improvement	Actor / policy gradient
Habitual behavior	Model-free RL	Q-learning, Sarsa
Goal-directed behavior	Model-based RL	Dyna, planning
Cognitive maps	Learned environment models	Model learning
Shaping	Reward shaping / curriculum	Intermediate rewards

The deepest insight: Animals are not stimulus-response machines (pure model-free), nor are they pure planners (pure model-based). They use both systems, trading off speed and flexibility. The best RL architectures do the same.

What comes next: Chapter 13 gave us the computational tools. This chapter showed that animals implement something strikingly similar. Chapter 15 will go one level deeper — into the actual neural circuits that implement RL in the brain. The prediction errors we've been computing turn out to be literally encoded in dopamine neuron firing.

"The brain is the ultimate reinforcement learning machine."
— Peter Dayan & Nathaniel Daw

Check: What is the fundamental correspondence between classical conditioning and RL?

Both use neural networks Classical conditioning is learning a value function — predicting future reward from current stimuli using prediction errors Both require a teacher signal

RL and Psychology