Model-Free Methods — Kochenderfer et al., Chapter 17

Chapter 0: Model-Free Reinforcement Learning

You are a robot in a building you have never seen. You need to find the exit. You don't have a map. You don't know which doors lead where. But you can walk, open doors, and feel when you get closer to fresh air.

Chapter 16 would say: walk around for a while, carefully track which doors go where, build an internal map, then plan with it. That's model-based learning. It's sensible — but building an accurate map takes a lot of effort, and the map has to be stored and maintained.

There's another way. Don't bother building a map. Instead, keep a running score for each (location, action) pair: "how good is it to go left from the hallway by the stairs?" Every time you try an action and feel what happens, update that score. Over time, your scores improve. You learn which actions are good without ever knowing why they lead where they do.

The key shift: Model-based methods learn T(s'|s,a) and R(s,a), then solve for a policy. Model-free methods skip the model entirely — they learn the action value function Q(s,a) directly from (state, action, reward, next-state) tuples. No T, no R required. Just experience.

This is attractive for three reasons. First, the model can be harder to learn than the value function — why learn more than you need? Second, model-free methods scale to very large or continuous state spaces where storing T explicitly is impossible. Third, they are conceptually simple: gather experience, update estimates, repeat.

This chapter covers the main model-free algorithms from Kochenderfer Ch. 17 (pp. 335–353): incremental mean estimation (the building block), Q-learning, Sarsa, eligibility traces, reward shaping, function approximation, and experience replay.

Interact

Take action a in state s, observe r and s'

↓

Update Q

Use (s, a, r, s') to nudge Q(s,a) closer to the true value

↓

Act

Choose next action via ε-greedy (explore vs exploit)

↻ repeat

What does a model-free method learn instead of T(s'|s,a) and R(s,a)?

The action value function Q(s,a) directly from experience A separate reward model, but not a transition model The optimal policy π* without ever using Q-values

Chapter 1: Incremental Mean Estimation

Before Q-learning, we need a building block. Suppose you are trying to estimate the average reward from a slot machine — you pull it many times and get a noisy sample each time. How do you maintain a running average without storing every past sample?

Let x̂_m be your estimate after m samples. The exact sample average satisfies:

x̂_m = x̂_m-1 + α(m) · (x^(m) − x̂_m-1)

With α(m) = 1/m this recovers the exact mean. The term (x^(m) − x̂_m-1) is your prediction error — how far the new sample was from what you expected. You nudge your estimate by a fraction α of that error.

This one-liner is the kernel of all TD learning. Q-learning and Sarsa are just this update, applied to action values, with a smarter notion of "what the sample is."

Constant α vs 1/m: With α(m) = 1/m you get the exact running mean — every sample counts equally. With a constant α (say, 0.1), older samples are downweighted exponentially: after k additional steps, a sample's influence decays as (1−α)^k. This is called exponential recency weighting and is essential for non-stationary environments where the true value drifts over time. The textbook calls this "experience replay" at the micro level.

Convergence conditions: Guaranteed convergence requires ∑ α(m) = ∞ (steps large enough to reach the truth) AND ∑ α(m)² < ∞ (steps shrink to prevent oscillation). Constant α violates the second condition — but converges to a region near the true value, which is often fine.

Worked example — rolling die (true mean = 3.5):
Samples: 2, 5, 1, 6, 4. Using α = 0.3:
After sample 1: x̂ = 2 (initialize to first sample)
After sample 2: x̂ = 2 + 0.3×(5 − 2) = 2.9
After sample 3: x̂ = 2.9 + 0.3×(1 − 2.9) = 2.33
After sample 4: x̂ = 2.33 + 0.3×(6 − 2.33) = 3.43
After sample 5: x̂ = 3.43 + 0.3×(4 − 3.43) = 3.60
With 1/m instead: after 5 samples, x̂ = (2+5+1+6+4)/5 = 3.6 — identical here, but 1/m would give 0.2 weight to sample 5, while constant α=0.3 gives 0.3×0.7⁰ = 0.3.

Learning Rate Comparison — Estimating a Die Mean

True mean = 3.5. Compare four learning rate schedules estimating from 100 die rolls. Note how a larger constant α is more reactive but noisier.

python
def incremental_mean(samples, alpha=None):
    """
    Compute running mean estimates.
    alpha=None uses 1/m (exact mean).
    alpha=float uses constant learning rate (exponential recency).
    """
    estimates = []
    est = samples[0]
    for m, x in enumerate(samples, start=1):
        lr = (1 / m) if alpha is None else alpha
        est += lr * (x - est)  # the fundamental update rule
        estimates.append(est)
    return estimates

# Every Q-learning / Sarsa update is this, dressed up with a smarter target.

What does the prediction error term (x^(m) − x̂_m-1) represent?

How far the new sample was from the current estimate — the signal that drives the update The learning rate α applied to the current sample The total accumulated reward up to step m

Chapter 2: Q-Learning — Off-Policy TD Control

We want to learn Q*(s,a) — the value of taking action a from state s, then acting optimally forever. What does optimal mean? The Bellman optimality equation tells us:

Q*(s,a) = E[ R(s,a) + γ · max_a' Q*(s',a') ]

Read this carefully. The right-hand side involves an expectation over the random next state s'. To compute it exactly you'd need to know T(s'|s,a) — the model. We want to avoid that.

Here's the key insight. We can't compute the expectation, but we can sample it. Every time the agent takes action a in state s, it observes r and s'. The quantity (r + γ max_a' Q(s',a')) is a single noisy sample of the right-hand side. Now apply the incremental mean update from Chapter 1:

Q(s,a) ← Q(s,a) + α · [r + γ max_a' Q(s',a') − Q(s,a)]

The bracketed term is the temporal difference (TD) error δ: how surprised the agent was by reality versus its prediction. Q-learning nudges Q(s,a) toward each new sample by a step of size α.

Why "off-policy"? Look at the target: r + γ max_a' Q(s',a'). It uses the best possible next action — even if the agent didn't take it. The agent might be exploring randomly (picking ε-greedy actions), but its Q-values learn about the greedy policy. The behavior policy (how you act) and the target policy (what you're learning about) are decoupled. This is off-policy learning. It's powerful: you can learn from any experience, including demonstrations or replayed data.

TD error as dopamine: When δ > 0, the outcome was better than predicted — a positive surprise. When δ < 0, things went worse than expected. When δ = 0, prediction was exact — nothing to learn. Neuroscientists discovered the same signal in dopamine neurons: they fire more for surprising rewards, less for surprising non-rewards, and at baseline when rewards match predictions. Q-learning rediscovered a mechanism evolution built into mammalian brains.

Q-Learning Live — Grid World Q-Table Visualizer

A 5×5 grid. Goal (+10) at top-right. Watch Q-values fill in as the agent explores. Cell brightness = max_aQ(s,a). Arrows = greedy policy. Adjust sliders and step to see the effect.

ε (exploration) 0.30

α (learning rate) 0.10

γ (discount) 0.90

Steps: 0 | Episode: 0

Algorithm 17.2 (Kochenderfer): The agent keeps a Q-table initialized to zero. In each step it picks an action (usually ε-greedy: random with probability ε, greedy otherwise), observes (r, s'), performs the Bellman update above, and continues. Episodes reset when the agent reaches the goal or hits a step limit.

python
import numpy as np

def q_learning(env, n_steps, alpha=0.1, gamma=0.9, eps=0.1):
    Q = np.zeros((env.n_states, env.n_actions))
    s = env.reset()
    for _ in range(n_steps):
        # epsilon-greedy action selection
        if np.random.random() < eps:
            a = np.random.randint(env.n_actions)    # explore
        else:
            a = np.argmax(Q[s])                      # exploit

        sp, r, done = env.step(a)

        # off-policy TD update: max over next actions
        td_error = r + gamma * np.max(Q[sp]) - Q[s, a]
        Q[s, a] += alpha * td_error

        s = env.reset() if done else sp

    return Q

Why is Q-learning called "off-policy"?

It learns Q* using max_a' Q(s',a') regardless of which action the agent actually took in s' It does not follow any policy — actions are always random It only works when the behavior policy is optimal

Chapter 3: Sarsa — On-Policy TD Control

Q-learning uses max_a' in its update — the best possible next action, whether or not the agent takes it. Sarsa uses a different target: the value of the action the agent actually takes next.

Q(s,a) ← Q(s,a) + α · [r + γ Q(s',a') − Q(s,a)]

The name comes from the quintuple (S_t, A_t, R_t+1, S_t+1, A_t+1) — the five quantities needed for one update. The agent must commit to the next action a' before it can apply the update. That next action is drawn from the same policy the agent is currently following, not the greedy policy.

Worked example — one Sarsa step:
State s = "hallway junction." Agent picks action "go right" (ε-greedy says greedy this time).
Transitions to s' = "dead end", reward r = −1.
Agent now samples the next action from ε-greedy: picks "go back" (a random exploratory action, since ε = 0.1 and the dice rolled unlucky).
Sarsa quintuple: (hallway, right, −1, dead-end, back).
Update: Q(hallway, right) ← Q(hallway, right) + α [−1 + γ · Q(dead-end, back) − Q(hallway, right)]

Q-learning would have used: max_a' Q(dead-end, a') — possibly a different, better action. Sarsa used the actual exploratory action. This is the only difference.

On-policy means Sarsa learns about itself: Because Sarsa uses the actual next action (from the current policy), the Q-values it learns reflect how well the current exploring policy performs — including its random exploratory moves. Q-learning, by contrast, always imagines acting greedily from s' onward, regardless of exploration. This distinction seems small but has big consequences in risky environments (Chapter 4).

	Q-Learning (Off-Policy)	Sarsa (On-Policy)
Update target	r + γ max_a' Q(s',a')	r + γ Q(s',a')
Policy learned	Optimal π* (greedy)	Current exploring policy π_ε
Next action	Best possible (never taken)	Actually sampled from policy
Complexity/step	O(\|A\|) for max	O(1)
Safe near hazards?	Ignores exploration risk	Accounts for exploration risk

python
def sarsa(env, n_steps, alpha=0.1, gamma=0.9, eps=0.1):
    Q = np.zeros((env.n_states, env.n_actions))
    s = env.reset()
    # must pick FIRST action before the loop
    a = eps_greedy(Q, s, eps)

    for _ in range(n_steps):
        sp, r, done = env.step(a)
        ap = eps_greedy(Q, sp, eps)   # next action sampled NOW (on-policy)

        # on-policy TD update: uses ap, not max
        td_error = r + gamma * Q[sp, ap] - Q[s, a]
        Q[s, a] += alpha * td_error

        s, a = (env.reset(), eps_greedy(Q, env.start, eps)) if done else (sp, ap)

    return Q

In Sarsa, what is the target r + γQ(s',a') estimating?

The expected return under the current policy, including its exploratory actions The optimal return from state s' under the greedy policy The immediate reward only, discounting the future

Chapter 4: On-Policy vs Off-Policy — The Cliff Walk

The difference between on-policy (Sarsa) and off-policy (Q-learning) is theoretical until you see it break something. The cliff walk is the canonical demonstration.

Picture a 4×8 grid. The agent starts at bottom-left. The goal is bottom-right. Along the bottom row between them is a cliff: stepping onto any cliff cell gives reward −100 and teleports the agent back to start. Every other step costs −1. The optimal path (minimum total cost) runs right along the cliff edge — it's the shortest route.

Q-learning's take: "The optimal policy is to hug the cliff edge — it's 8 steps vs. 14 steps around the top. My Q-values reflect the greedy policy. Yes, during training I sometimes fall off while exploring, but I don't let that affect my value estimates for the greedy policy."

Sarsa's take: "I'm following an ε-greedy policy. Near the cliff edge, my policy has a ~10% chance of randomly stepping off. That makes cliff-adjacent states genuinely dangerous for me right now. My Q-values reflect my actual behavior, so I learn the safe path along the top row."

Neither is wrong — they answer different questions. Q-learning asks: "What's the optimal policy if I stopped exploring?" Sarsa asks: "How well am I actually performing, exploration included?" Q-learning eventually converges to the optimal path as ε → 0. But during training with a fixed ε, Sarsa's cumulative reward is higher because it avoids catastrophic falls. In robotics, this matters: you often cannot afford to break things while learning.

Cliff Walk: Q-Learning vs Sarsa Side by Side

Train 500 episodes of each. The learned greedy path is shown with arrows. Q-learning risks the cliff edge; Sarsa takes the long safe route.

Click to train.

Convergence clarification: Both algorithms converge to Q* as α → 0 and ε → 0 (the GLIE condition: Greedy in the Limit with Infinite Exploration). The difference is during training with a fixed small ε. Sarsa converges to Q^πε (the value of the ε-greedy policy), which hugs the cliff less aggressively. Q-learning converges to Q* regardless of ε, but with higher variance because it ignores exploration risk in its targets.

In the cliff walk, why does Sarsa learn the safe top path while Q-learning learns the risky bottom path?

Sarsa's values reflect the current exploring policy, making cliff-adjacent states look worse; Q-learning imagines acting greedily from s' so exploration risk doesn't enter its estimates Q-learning has a higher learning rate Sarsa uses a larger discount factor γ

Chapter 5: Eligibility Traces — Sarsa(λ)

Standard Sarsa (and Q-learning) update only the most recent (s,a) pair. If the agent wanders through ten states before hitting a reward, only the last step learns. The other nine states contributed to reaching the reward, but their Q-values don't know it yet. This is called the credit assignment problem.

Eligibility traces fix this by maintaining a "recency counter" z(s,a) for every (s,a) pair visited. When the TD error δ arrives, it's distributed backward through all recently visited pairs, weighted by how recently they were visited.

The full Sarsa(λ) update (Algorithm 17.4):

δ = r + γ Q(s',a') − Q(s,a)

z(s_t, a_t) ← z(s_t, a_t) + 1 (increment the just-visited pair)

Q(s, a) ← Q(s, a) + α · δ · z(s, a) for all s, a

z(s, a) ← γλ · z(s, a) for all s, a (decay all traces)

The decay rate γλ means each step into the past reduces credit by a factor of γλ. After k steps, the credit assigned to a visit is (γλ)^k. When λ = 0, only the current step gets credit: plain Sarsa. When λ = 1, credit decays at rate γ only — like Monte Carlo but computed online.

The forward view and the backward view are the same thing.

Forward view: The ideal target is not the 1-step TD target, but a weighted blend of all n-step returns:

G_t^λ = (1−λ) ∑_n=1^∞ λⁿ⁻¹ G_t:t+n

This λ-return mixes 1-step returns (low variance, more bias from bootstrapping) with longer-horizon returns (higher variance, less bias). It's the ideal target but requires future information.

Backward view (eligibility traces): Instead of looking forward, keep a running trace of past visits. When δ arrives, spread it back proportionally. Sutton proved these two views produce exactly the same parameter updates online. The backward view makes the forward view computationally feasible.

Cost: Sarsa(λ) is O(|S||A|) per step (all traces must be decayed). Plain Sarsa is O(1). The payoff: much faster convergence in problems with long reward delays. λ is a bias-variance knob — small λ is conservative and stable, large λ is ambitious and data-hungry.

Credit Assignment — How λ Spreads the Reward Backward

An agent walks 8 steps and receives reward R=10 at the end. Bar height = eligibility credit assigned to each step. Drag λ to see how credit propagates.

λ 0.80

γ 0.90

What does Sarsa(λ) reduce to when λ = 0?

Standard one-step Sarsa — only the current (s,a) pair is updated Monte Carlo estimation using full episode returns Q-learning with the max operator

Chapter 6: Reward Shaping

Sparse rewards are brutal. An agent trying to navigate a 10×10 maze with +1 only at the goal gets zero feedback for all the dead ends, all the wrong turns. The TD error is zero everywhere except the single lucky transition into the goal. Learning is glacially slow because there's almost no signal to propagate backward.

Reward shaping adds an extra signal to guide the agent faster, without changing which policy is optimal. The shaped reward is:

R'(s, a, s') = R(s, a) + F(s, a, s')

where F is the shaping bonus. The key question: what constraints on F guarantee that the optimal policy under R' is the same as under R?

The answer (Ng et al., 1999): use potential-based shaping. Define a potential function Φ(s) over states, then set:

F(s, a, s') = γ Φ(s') − Φ(s)

This is called a potential difference. If you think of Φ(s) as the "goodness" of a state (e.g., proximity to the goal), the bonus rewards entering good states and penalizes leaving them.

Why does potential-based shaping preserve optimal policies? The cumulative shaping bonus along any trajectory telescopes:

∑_t γ^t F(s_t, a_t, s_t+1) = ∑_t γ^t(γΦ(s_t+1) − Φ(s_t)) = −Φ(s₀) + lim_t→∞ γ^tΦ(s_t)

The sum depends only on the start state — not on which path you took. So adding this bonus to R changes the value of every policy by the same constant. The relative ranking of policies is unchanged, which means the optimal policy is unchanged.

Arbitrary (non-potential-based) shaping can change the optimal policy entirely. A famous example: if you reward the agent for looking at the TV (entertainment) instead of cleaning up (the goal), it learns to stare at the TV forever. Potential-based shaping is the principled safeguard.

Designing Φ: Good potential functions encode domain knowledge. For navigation: Φ(s) = −distance(s, goal) (states closer to goal are better). For games: Φ(s) = material advantage (more pieces = better). The function must be bounded so the shaping doesn't dominate the true reward asymptotically. Empirically, even rough potentials dramatically accelerate convergence. The key is that F(s,a,s') must always take the form γΦ(s') − Φ(s).

python
def shaped_reward(s, a, sp, r, phi, gamma):
    """
    Potential-based reward shaping.
    phi: potential function (state -> scalar)
    Returns shaped reward preserving optimal policy.
    """
    shaping = gamma * phi(sp) - phi(s)
    return r + shaping  # plug-in replacement for R in any TD method

# Example: navigation maze. phi = negative Manhattan distance to goal.
def phi_nav(s, goal=(9, 9)):
    return -(abs(s[0] - goal[0]) + abs(s[1] - goal[1]))

Why must reward shaping use the form F(s,a,s') = γΦ(s') − Φ(s)?

This form telescopes across trajectories so all policies are shifted by the same constant — preserving the optimal policy It makes the shaping bonus proportional to the reward, which speeds up TD convergence Potential functions are always non-negative, avoiding negative rewards

Chapter 7: Function Approximation

A 5×5 grid needs 25 states × 4 actions = 100 Q-table entries. Manageable. A 20×20 grid needs 1600. Still fine. But a robot with a camera input? An Atari game (210×160×3 pixels)? That's roughly 10^33,000 possible states. No table can hold that.

Function approximation replaces the Q-table with a parameterized function Q_θ(s,a) that computes Q-values from features of the state. The function generalizes: updating parameters for state s also changes predictions for similar states s', whether or not s' was ever visited.

Linear approximation (Algorithm 17.5): Define feature vector β(s,a) ∈ ℝ^d encoding properties of the (state, action) pair. The approximation is:

Q_θ(s,a) = θ^T β(s,a)

Update rule (gradient descent on the TD error, keeping the target fixed):

θ ← θ + α · [r + γ max_a' Q_θ(s',a') − Q_θ(s,a)] · ∇_θ Q_θ(s,a)

For linear Q, the gradient is just the feature vector: ∇_θ Q_θ(s,a) = β(s,a). The update becomes a simple inner-product update — no backpropagation needed.

The deadly triad: Three individually harmless ingredients combine dangerously:

1. Function approximation: Updating Q(s,a) also changes Q for similar states. If an error is introduced, it propagates to unseen states.

2. Bootstrapping: The TD target r + γ max Q(s',a') depends on Q_θ itself. As θ changes, the target moves. You're chasing a moving target while the target chases you.

3. Off-policy learning: The data distribution (from the behavior policy) differs from the distribution of states you care about (from the greedy policy). The function approximator fits the wrong distribution.

Any two of these three are stable. All three together can cause divergence — Q-values spiral to ±∞. This is why DQN needed both experience replay and a target network to stabilize training.

Linear Function Approximation with RBF Features

A 1D state space. Teal = true Q-function. Orange = linear approximation with 5 RBF (Gaussian) basis functions. Watch it converge with more training steps.

Steps: 0

python
import numpy as np

# RBF (Gaussian) feature encoding
centers = np.linspace(0, 1, 5)
sigma = 0.15

def features(s):
    """Map state s to RBF feature vector."""
    return np.exp(-((s - centers) ** 2) / (2 * sigma**2))

theta = np.zeros(5)

def Q_approx(s):
    return theta @ features(s)    # linear: theta^T * phi(s)

def update(s, r, sp, alpha=0.01, gamma=0.9):
    target = r + gamma * Q_approx(sp)
    prediction = Q_approx(s)
    td_error = target - prediction
    # gradient w.r.t. theta is just features(s) for linear Q
    global theta
    theta += alpha * td_error * features(s)

What makes the "deadly triad" dangerous when all three components are combined?

Function approximation spreads errors, bootstrapping chases a moving target, and off-policy shifts the data distribution — together they can cause Q-values to diverge Function approximation is too slow to be practical with bootstrapping Off-policy learning requires too many samples for function approximation

Chapter 8: Experience Replay

You're training a neural network to approximate Q-values on an Atari game. The agent starts in the first room of the game and explores it for 1000 steps. Every consecutive training sample is: (first-room state, action, reward, another-first-room state). The network sees the same kind of input over and over and gradually forgets everything it learned about the rest of the game. This is catastrophic forgetting caused by temporal correlation.

Experience replay breaks this correlation. Every transition (s, a, r, s') is stored in a circular buffer called the replay buffer. Instead of training on the most recent transition, at each step a random minibatch of size k is sampled from the buffer and used for the gradient update.

Benefit	Mechanism	Without replay
Break correlations	Random sampling destroys temporal order	Network overfits to most recent states
Data efficiency	Each transition reused many times	Each transition used once, discarded
Smoother distribution	Buffer holds diverse past experience	Training distribution shifts as agent explores

Experience replay was the first key innovation that made Deep Q-Networks (DQN) (Mnih et al., 2015) work on Atari. The second was the target network.

Target network — solving the moving-target problem: The TD target r + γ max_a' Q_θ(s',a') depends on θ itself. Every gradient step moves θ, which moves the target, which requires another step... a feedback loop that can diverge. The fix: maintain a separate target network Q_θ⁻ with frozen weights. Use Q_θ⁻ to compute TD targets. Every C steps (DQN used C = 10,000), copy θ → θ⁻. Between copies, the target is stable. One detail changes the update:

θ ← θ + α · [r + γ max_a' Q_θ⁻(s',a') − Q_θ(s,a)] · ∇_θ Q_θ(s,a)

Prioritized experience replay: Not all transitions are equally informative. A transition where the TD error is large (the agent was very surprised) carries more learning signal than one where δ ≈ 0 (already well-predicted). Prioritized replay samples transition i with probability proportional to |δ_i|^α. Transitions you're confused about are replayed more often. Importance-sampling weights correct for the biased distribution. This doubles convergence speed on many Atari games.

python
from collections import deque
import random

class ReplayBuffer:
    def __init__(self, capacity=100_000):
        self.buf = deque(maxlen=capacity)

    def push(self, s, a, r, sp, done):
        self.buf.append((s, a, r, sp, done))

    def sample(self, k=32):
        return random.sample(self.buf, k)   # uniform random minibatch

# DQN training loop (simplified)
buf = ReplayBuffer()
s = env.reset()
for step in range(n_steps):
    a = eps_greedy(Q_main, s, eps)
    sp, r, done, _ = env.step(a)
    buf.push(s, a, r, sp, done)

    if len(buf.buf) >= 1000:         # wait until buffer has enough data
        batch = buf.sample(32)
        gradient_step(Q_main, Q_target, batch, alpha, gamma)

    if step % 10_000 == 0:
        Q_target.load_state_dict(Q_main.state_dict())   # sync target network

    s = env.reset() if done else sp

What specific problem does experience replay solve when training a neural network Q-function?

It breaks the temporal correlation between consecutive samples, preventing the network from overfitting to the most recently visited states It replaces the need for a target network It eliminates the need for an exploration strategy

Chapter 9: Connections & What Comes Next

You've now seen every major idea in model-free tabular RL. Here's a map of what was covered and where each piece leads.

Bad Φ can distort behavior

Method	Type	Key Feature	Limitation
Q-Learning	Off-policy	Learns Q* with max; decouples behavior from target	Dangerous near cliffs during training
Sarsa	On-policy	Learns Q^πε; safer in risky environments	Sub-optimal during exploration
Sarsa(λ)	On-policy + traces	Propagates credit backward; faster convergence	O(\|S\|\|A\|) per step
Reward Shaping	Enhancement	Accelerates learning with domain knowledge
Linear FA	Generalization	Scales to continuous states; convergence guarantees	Feature design is manual work
Neural Net FA	Deep generalization	Learns features end-to-end; handles raw pixels	Deadly triad requires stabilization tricks
Experience Replay	Stabilization	Breaks correlations; data efficiency	Requires large memory buffer
Target Network	Stabilization	Fixes moving-target problem in deep RL	Delayed updates slow convergence

The chain from Q-learning to DQN: Tabular Q-learning (1989, Watkins) → replace table with linear features → replace linear features with a neural network → add experience replay to break correlations → add target network to stabilize → DQN (Mnih et al., 2015). The core update rule never changed. The engineering around it made it work at scale.

What Chapter 17 does NOT cover (see later chapters):

Policy gradient methods (Ch. 12): don't learn Q at all — directly optimize the policy parameters θ by gradient ascent on E[G₀].
Actor-Critic (Ch. 13): combine a value function (critic) with a policy (actor) — the best of both worlds.
Imitation learning (Ch. 18): no reward signal at all — learn from expert demonstrations.
POMDPs (Ch. 19-23): the agent doesn't observe the full state s — must maintain a belief.

Related micro-lessons

Key references

Watkins & Dayan. "Q-learning." Machine Learning, 1992.
Rummery & Niranjan. "On-line Q-learning." Tech Report, 1994. (Sarsa)
Sutton. "Learning to predict." Machine Learning, 1988. (TD(λ))
Ng et al. "Policy invariance under reward transformations." ICML, 1999.
Mnih et al. "Human-level control through DRL." Nature, 2015. (DQN)

“In order to learn, you have to be willing to be confused and not know.” — Richard Feynman

← Ch 16: Model-Based Ch 18: Imitation Learning →