Learn action values directly from experience. No transition model needed. Q-learning, Sarsa, eligibility traces, reward shaping, function approximation, and the ideas behind DQN.
You are a robot in a building you have never seen. You need to find the exit. You don't have a map. You don't know which doors lead where. But you can walk, open doors, and feel when you get closer to fresh air.
Chapter 16 would say: walk around for a while, carefully track which doors go where, build an internal map, then plan with it. That's model-based learning. It's sensible — but building an accurate map takes a lot of effort, and the map has to be stored and maintained.
There's another way. Don't bother building a map. Instead, keep a running score for each (location, action) pair: "how good is it to go left from the hallway by the stairs?" Every time you try an action and feel what happens, update that score. Over time, your scores improve. You learn which actions are good without ever knowing why they lead where they do.
This is attractive for three reasons. First, the model can be harder to learn than the value function — why learn more than you need? Second, model-free methods scale to very large or continuous state spaces where storing T explicitly is impossible. Third, they are conceptually simple: gather experience, update estimates, repeat.
This chapter covers the main model-free algorithms from Kochenderfer Ch. 17 (pp. 335–353): incremental mean estimation (the building block), Q-learning, Sarsa, eligibility traces, reward shaping, function approximation, and experience replay.
Before Q-learning, we need a building block. Suppose you are trying to estimate the average reward from a slot machine — you pull it many times and get a noisy sample each time. How do you maintain a running average without storing every past sample?
Let x̂m be your estimate after m samples. The exact sample average satisfies:
With α(m) = 1/m this recovers the exact mean. The term (x(m) − x̂m-1) is your prediction error — how far the new sample was from what you expected. You nudge your estimate by a fraction α of that error.
This one-liner is the kernel of all TD learning. Q-learning and Sarsa are just this update, applied to action values, with a smarter notion of "what the sample is."
True mean = 3.5. Compare four learning rate schedules estimating from 100 die rolls. Note how a larger constant α is more reactive but noisier.
python def incremental_mean(samples, alpha=None): """ Compute running mean estimates. alpha=None uses 1/m (exact mean). alpha=float uses constant learning rate (exponential recency). """ estimates = [] est = samples[0] for m, x in enumerate(samples, start=1): lr = (1 / m) if alpha is None else alpha est += lr * (x - est) # the fundamental update rule estimates.append(est) return estimates # Every Q-learning / Sarsa update is this, dressed up with a smarter target.
We want to learn Q*(s,a) — the value of taking action a from state s, then acting optimally forever. What does optimal mean? The Bellman optimality equation tells us:
Read this carefully. The right-hand side involves an expectation over the random next state s'. To compute it exactly you'd need to know T(s'|s,a) — the model. We want to avoid that.
Here's the key insight. We can't compute the expectation, but we can sample it. Every time the agent takes action a in state s, it observes r and s'. The quantity (r + γ maxa' Q(s',a')) is a single noisy sample of the right-hand side. Now apply the incremental mean update from Chapter 1:
The bracketed term is the temporal difference (TD) error δ: how surprised the agent was by reality versus its prediction. Q-learning nudges Q(s,a) toward each new sample by a step of size α.
A 5×5 grid. Goal (+10) at top-right. Watch Q-values fill in as the agent explores. Cell brightness = maxaQ(s,a). Arrows = greedy policy. Adjust sliders and step to see the effect.
Algorithm 17.2 (Kochenderfer): The agent keeps a Q-table initialized to zero. In each step it picks an action (usually ε-greedy: random with probability ε, greedy otherwise), observes (r, s'), performs the Bellman update above, and continues. Episodes reset when the agent reaches the goal or hits a step limit.
python import numpy as np def q_learning(env, n_steps, alpha=0.1, gamma=0.9, eps=0.1): Q = np.zeros((env.n_states, env.n_actions)) s = env.reset() for _ in range(n_steps): # epsilon-greedy action selection if np.random.random() < eps: a = np.random.randint(env.n_actions) # explore else: a = np.argmax(Q[s]) # exploit sp, r, done = env.step(a) # off-policy TD update: max over next actions td_error = r + gamma * np.max(Q[sp]) - Q[s, a] Q[s, a] += alpha * td_error s = env.reset() if done else sp return Q
Q-learning uses maxa' in its update — the best possible next action, whether or not the agent takes it. Sarsa uses a different target: the value of the action the agent actually takes next.
The name comes from the quintuple (St, At, Rt+1, St+1, At+1) — the five quantities needed for one update. The agent must commit to the next action a' before it can apply the update. That next action is drawn from the same policy the agent is currently following, not the greedy policy.
| Q-Learning (Off-Policy) | Sarsa (On-Policy) | |
|---|---|---|
| Update target | r + γ maxa' Q(s',a') | r + γ Q(s',a') |
| Policy learned | Optimal π* (greedy) | Current exploring policy πε |
| Next action | Best possible (never taken) | Actually sampled from policy |
| Complexity/step | O(|A|) for max | O(1) |
| Safe near hazards? | Ignores exploration risk | Accounts for exploration risk |
python def sarsa(env, n_steps, alpha=0.1, gamma=0.9, eps=0.1): Q = np.zeros((env.n_states, env.n_actions)) s = env.reset() # must pick FIRST action before the loop a = eps_greedy(Q, s, eps) for _ in range(n_steps): sp, r, done = env.step(a) ap = eps_greedy(Q, sp, eps) # next action sampled NOW (on-policy) # on-policy TD update: uses ap, not max td_error = r + gamma * Q[sp, ap] - Q[s, a] Q[s, a] += alpha * td_error s, a = (env.reset(), eps_greedy(Q, env.start, eps)) if done else (sp, ap) return Q
The difference between on-policy (Sarsa) and off-policy (Q-learning) is theoretical until you see it break something. The cliff walk is the canonical demonstration.
Picture a 4×8 grid. The agent starts at bottom-left. The goal is bottom-right. Along the bottom row between them is a cliff: stepping onto any cliff cell gives reward −100 and teleports the agent back to start. Every other step costs −1. The optimal path (minimum total cost) runs right along the cliff edge — it's the shortest route.
Q-learning's take: "The optimal policy is to hug the cliff edge — it's 8 steps vs. 14 steps around the top. My Q-values reflect the greedy policy. Yes, during training I sometimes fall off while exploring, but I don't let that affect my value estimates for the greedy policy."
Sarsa's take: "I'm following an ε-greedy policy. Near the cliff edge, my policy has a ~10% chance of randomly stepping off. That makes cliff-adjacent states genuinely dangerous for me right now. My Q-values reflect my actual behavior, so I learn the safe path along the top row."
Train 500 episodes of each. The learned greedy path is shown with arrows. Q-learning risks the cliff edge; Sarsa takes the long safe route.
Standard Sarsa (and Q-learning) update only the most recent (s,a) pair. If the agent wanders through ten states before hitting a reward, only the last step learns. The other nine states contributed to reaching the reward, but their Q-values don't know it yet. This is called the credit assignment problem.
Eligibility traces fix this by maintaining a "recency counter" z(s,a) for every (s,a) pair visited. When the TD error δ arrives, it's distributed backward through all recently visited pairs, weighted by how recently they were visited.
The full Sarsa(λ) update (Algorithm 17.4):
The decay rate γλ means each step into the past reduces credit by a factor of γλ. After k steps, the credit assigned to a visit is (γλ)k. When λ = 0, only the current step gets credit: plain Sarsa. When λ = 1, credit decays at rate γ only — like Monte Carlo but computed online.
An agent walks 8 steps and receives reward R=10 at the end. Bar height = eligibility credit assigned to each step. Drag λ to see how credit propagates.
Sparse rewards are brutal. An agent trying to navigate a 10×10 maze with +1 only at the goal gets zero feedback for all the dead ends, all the wrong turns. The TD error is zero everywhere except the single lucky transition into the goal. Learning is glacially slow because there's almost no signal to propagate backward.
Reward shaping adds an extra signal to guide the agent faster, without changing which policy is optimal. The shaped reward is:
where F is the shaping bonus. The key question: what constraints on F guarantee that the optimal policy under R' is the same as under R?
The answer (Ng et al., 1999): use potential-based shaping. Define a potential function Φ(s) over states, then set:
This is called a potential difference. If you think of Φ(s) as the "goodness" of a state (e.g., proximity to the goal), the bonus rewards entering good states and penalizes leaving them.
python def shaped_reward(s, a, sp, r, phi, gamma): """ Potential-based reward shaping. phi: potential function (state -> scalar) Returns shaped reward preserving optimal policy. """ shaping = gamma * phi(sp) - phi(s) return r + shaping # plug-in replacement for R in any TD method # Example: navigation maze. phi = negative Manhattan distance to goal. def phi_nav(s, goal=(9, 9)): return -(abs(s[0] - goal[0]) + abs(s[1] - goal[1]))
A 5×5 grid needs 25 states × 4 actions = 100 Q-table entries. Manageable. A 20×20 grid needs 1600. Still fine. But a robot with a camera input? An Atari game (210×160×3 pixels)? That's roughly 1033,000 possible states. No table can hold that.
Function approximation replaces the Q-table with a parameterized function Qθ(s,a) that computes Q-values from features of the state. The function generalizes: updating parameters for state s also changes predictions for similar states s', whether or not s' was ever visited.
Linear approximation (Algorithm 17.5): Define feature vector β(s,a) ∈ ℝd encoding properties of the (state, action) pair. The approximation is:
Update rule (gradient descent on the TD error, keeping the target fixed):
For linear Q, the gradient is just the feature vector: ∇θ Qθ(s,a) = β(s,a). The update becomes a simple inner-product update — no backpropagation needed.
A 1D state space. Teal = true Q-function. Orange = linear approximation with 5 RBF (Gaussian) basis functions. Watch it converge with more training steps.
python import numpy as np # RBF (Gaussian) feature encoding centers = np.linspace(0, 1, 5) sigma = 0.15 def features(s): """Map state s to RBF feature vector.""" return np.exp(-((s - centers) ** 2) / (2 * sigma**2)) theta = np.zeros(5) def Q_approx(s): return theta @ features(s) # linear: theta^T * phi(s) def update(s, r, sp, alpha=0.01, gamma=0.9): target = r + gamma * Q_approx(sp) prediction = Q_approx(s) td_error = target - prediction # gradient w.r.t. theta is just features(s) for linear Q global theta theta += alpha * td_error * features(s)
You're training a neural network to approximate Q-values on an Atari game. The agent starts in the first room of the game and explores it for 1000 steps. Every consecutive training sample is: (first-room state, action, reward, another-first-room state). The network sees the same kind of input over and over and gradually forgets everything it learned about the rest of the game. This is catastrophic forgetting caused by temporal correlation.
Experience replay breaks this correlation. Every transition (s, a, r, s') is stored in a circular buffer called the replay buffer. Instead of training on the most recent transition, at each step a random minibatch of size k is sampled from the buffer and used for the gradient update.
| Benefit | Mechanism | Without replay |
|---|---|---|
| Break correlations | Random sampling destroys temporal order | Network overfits to most recent states |
| Data efficiency | Each transition reused many times | Each transition used once, discarded |
| Smoother distribution | Buffer holds diverse past experience | Training distribution shifts as agent explores |
Experience replay was the first key innovation that made Deep Q-Networks (DQN) (Mnih et al., 2015) work on Atari. The second was the target network.
python from collections import deque import random class ReplayBuffer: def __init__(self, capacity=100_000): self.buf = deque(maxlen=capacity) def push(self, s, a, r, sp, done): self.buf.append((s, a, r, sp, done)) def sample(self, k=32): return random.sample(self.buf, k) # uniform random minibatch # DQN training loop (simplified) buf = ReplayBuffer() s = env.reset() for step in range(n_steps): a = eps_greedy(Q_main, s, eps) sp, r, done, _ = env.step(a) buf.push(s, a, r, sp, done) if len(buf.buf) >= 1000: # wait until buffer has enough data batch = buf.sample(32) gradient_step(Q_main, Q_target, batch, alpha, gamma) if step % 10_000 == 0: Q_target.load_state_dict(Q_main.state_dict()) # sync target network s = env.reset() if done else sp
You've now seen every major idea in model-free tabular RL. Here's a map of what was covered and where each piece leads.
| Method | Type | Key Feature | Limitation |
|---|---|---|---|
| Q-Learning | Off-policy | Learns Q* with max; decouples behavior from target | Dangerous near cliffs during training |
| Sarsa | On-policy | Learns Qπε; safer in risky environments | Sub-optimal during exploration |
| Sarsa(λ) | On-policy + traces | Propagates credit backward; faster convergence | O(|S||A|) per step |
| Reward Shaping | Enhancement | Accelerates learning with domain knowledge | Bad Φ can distort behavior |
| Linear FA | Generalization | Scales to continuous states; convergence guarantees | Feature design is manual work |
| Neural Net FA | Deep generalization | Learns features end-to-end; handles raw pixels | Deadly triad requires stabilization tricks |
| Experience Replay | Stabilization | Breaks correlations; data efficiency | Requires large memory buffer |
| Target Network | Stabilization | Fixes moving-target problem in deep RL | Delayed updates slow convergence |
“In order to learn, you have to be willing to be confused and not know.” — Richard Feynman