The paradigm that taught machines to play Atari, beat world champions at Go, and align large language models with human preferences.
Reinforcement learning is about an agent interacting with an environment. At each timestep, the agent observes the current state, takes an action, receives a reward, and transitions to a new state. The goal is deceptively simple: maximize cumulative reward over time.
This is different from supervised learning (where you have labeled examples) and unsupervised learning (where you find patterns). In RL, the agent must discover which actions lead to high rewards through trial and error. Nobody tells the agent the right answer — it must figure it out.
Watch the agent (green) navigate a simple world. It receives +1 for reaching the goal and -1 for walls. The cumulative reward tracks at the top.
Before we learn any algorithms, we must understand the fundamental dilemma: exploration vs exploitation. Should you go to your favourite restaurant (exploit what you know), or try a new place (explore to find something better)?
The multi-armed bandit problem distills this perfectly. You face N slot machines, each with an unknown payout probability. You want to maximize total winnings. If you always pull the lever that looks best so far, you might miss a better one. If you explore too much, you waste pulls on bad machines.
Click a lever to pull it, or let the ε-greedy agent play. Watch the regret (reward lost vs. always picking the best arm) accumulate.
The simplest approach to estimating how good a state is: play out full episodes, then average the returns you got. If you visited state S and ended up with a total return of 7, and another time you got 3, your estimate of V(S) is (7+3)/2 = 5. That's it. No fancy math.
There are two flavors: first-visit MC (only count the first time you visit a state in an episode) and every-visit MC (count every visit). Both converge to the true value as you play more episodes.
Run episodes to see MC value estimates converge. Each cell shows the estimated value of that state. More episodes = better estimates.
What if you could learn before an episode ends? Temporal Difference (TD) learning does exactly that. Instead of waiting for the full return G, it uses an estimate: the immediate reward plus the current guess of the next state's value.
The TD(0) update rule shifts the value of the current state toward a bootstrapped target: r + γV(s'). The difference between this target and our current estimate is the TD error δ.
| Symbol | Meaning |
|---|---|
| α | Learning rate (step size) |
| γ | Discount factor (how much we value future rewards) |
| δ | TD error: r + γV(s') − V(s) |
| V(s') | Current estimate of the next state's value |
Watch how TD(0) (teal) converges faster than MC (orange) on a random walk problem. TD learns from each step; MC waits until the end.
The value function V(s) is defined as the expected sum of discounted future rewards starting from state s under policy π:
Vπ(s) = Eπ[Gt | St = s] = Eπ[rt+1 + γrt+2 + γ²rt+3 + … | St = s]
Your task: Show that Vπ(s) can be decomposed into the immediate reward plus the discounted value of the next state — the Bellman equation:
Vπ(s) = ∑a π(a|s) ∑s' p(s'|s,a) [r(s,a,s') + γ Vπ(s')]
Full derivation:
1. Start with the definition: Vπ(s) = Eπ[Gt | St = s]
2. Decompose Gt: Vπ(s) = Eπ[rt+1 + γGt+1 | St = s]
3. Expand using law of total expectation over actions and next states:
= ∑a π(a|s) ∑s' p(s'|s,a) [r(s,a,s') + γ Eπ[Gt+1 | St+1=s']]
4. Recognize Eπ[Gt+1 | St+1=s'] = Vπ(s') by the Markov property:
= ∑a π(a|s) ∑s' p(s'|s,a) [r(s,a,s') + γ Vπ(s')]
The key insight: The Markov property is what makes this work. Because the future only depends on the current state (not how we got here), we can recursively define value in terms of itself. This self-referential structure is what TD learning exploits — we can update V(s) using our current estimate of V(s') without waiting for the full return.
Both are instances of the same deep pattern: recursive Bayesian estimation. You have a current belief, you get new information (a reward + next-state value, or a measurement), and you blend them using a weighting factor (α in TD, K in Kalman). The Bellman equation IS the Kalman predict step for value functions — "my best guess of the future given my model of how the world evolves."
Where else have you seen "new_estimate = old_estimate + step_size × (target − old_estimate)"? (Hint: think about how running averages work, and how gradient descent updates parameters.)
Instead of estimating how good a state is (V), what if we estimated how good a state-action pair is? That's Q-learning. Q(s, a) tells us the expected return of taking action a in state s and then acting optimally afterward. If we know Q perfectly, we can act optimally by always picking argmax_a Q(s, a).
Q-learning is off-policy: it learns the optimal policy regardless of what the agent actually does. The update uses the maximum Q-value of the next state, not the Q-value of the action the agent actually took.
For a gridworld with |S| = 100 states and |A| = 4 actions, Q is simply a [100, 4] array — 400 numbers, initialized to zeros. Each entry Q[s, a] estimates the expected return of taking action a in state s. The entire learning algorithm is one line of NumPy per step:
Python import numpy as np # Initialize the Q-table: 100 states × 4 actions = 400 entries Q = np.zeros((100, 4)) # After observing (s, a, r, s'): Q[s, a] += alpha * (r + gamma * np.max(Q[s_next, :]) - Q[s, a]) # └── TD target ──┘ └── old estimate ┘ # Extract greedy policy: best action per state policy = np.argmax(Q, axis=1) # shape [100]
That's the complete Q-learning implementation. The alpha * (target - old) pattern is the same as TD learning from Chapter 3, but now applied to state-action pairs instead of just states.
Watch the Q-table fill in as the agent explores. Green = goal (+10). Red = trap (-10). The arrows show the learned greedy policy. Cell colors show value estimates.
python import numpy as np def q_learning(env, n_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1): Q = np.zeros((env.n_states, env.n_actions)) for ep in range(n_episodes): state = env.reset() done = False while not done: # Epsilon-greedy action selection if np.random.random() < epsilon: action = np.random.randint(env.n_actions) else: action = np.argmax(Q[state]) # Take action, observe transition next_state, reward, done = env.step(action) # Q-learning update (off-policy: uses max) if done: target = reward # No future value at terminal else: target = reward + gamma * np.max(Q[next_state]) Q[state, action] += alpha * (target - Q[state, action]) state = next_state return Q
V(s) tells you the expected return from state s under policy π. But to act, you need to compare actions. V(s) alone doesn't tell you which action leads to which next state (you'd need the transition model p(s'|s,a) to compute that). Q(s,a) directly gives you the value of each action, so you can pick the best one with argmax — no model needed. This is why Q-learning is "model-free": it learns Q directly without ever learning the environment dynamics. The relationship is: Vπ(s) = ∑a π(a|s) Qπ(s,a) — the value of a state is the weighted average of its action values under the policy.
SARSA stands for State, Action, Reward, State, Action — exactly what the update uses. Unlike Q-learning, SARSA updates toward the Q-value of the action the agent actually took next, not the maximum. This makes it on-policy: it evaluates and improves the policy it's currently following.
The practical difference? SARSA is more conservative. If the agent explores and sometimes steps into danger, SARSA's values reflect that danger. Q-learning's values assume you'll act optimally after, so they're more optimistic. Near cliffs and traps, SARSA learns to stay further away.
| Q-Learning | SARSA | |
|---|---|---|
| Policy | Off-policy | On-policy |
| Update target | max_a' Q(s',a') | Q(s', a') where a' is the actual next action |
| Behavior | Optimistic (assumes optimal future) | Realistic (accounts for exploration) |
| Near danger | Risky paths | Safer paths |
Compare the paths learned by Q-learning (teal) and SARSA (orange). The bottom row is a cliff — falling gives -100. Notice SARSA takes the safer upper path.
Q-learning with a table works for small state spaces. But what about Atari games with millions of possible screen configurations? You can't have a table entry for every pixel combination. The breakthrough idea from DeepMind (2015): replace the Q-table with a neural network.
The network takes a state (raw pixels) as input and outputs Q-values for each possible action. But training is tricky — naive approaches diverge because of two problems:
The neural network is the Q-table. Instead of a [|S|, |A|] array (impossible for pixel inputs), a CNN maps raw frames to Q-values:
Architecture # Input: 4 stacked grayscale frames (motion information) State: [4, 84, 84] # 4 frames × 84×84 pixels = 28,224 values ↓ Conv2D(32, 8×8, stride 4) # → [32, 20, 20] Conv2D(64, 4×4, stride 2) # → [64, 9, 9] Conv2D(64, 3×3, stride 1) # → [64, 7, 7] Flatten # → [3136] Dense(512) # → [512] (learned features) Dense(18) # → [18] (one Q-value per action) # Action = argmax over these 18 Q-values
The replay buffer stores transitions as tuples (s, a, r, s', done). For Atari: 1 million transitions, each containing two frame stacks of [4, 84, 84]. Naive storage: 1M × (28,224 + 1 + 1 + 28,224 + 1) bytes ≈ 56 GB. In practice, you store each unique frame once and reference them by index — dropping memory to about 7 GB. Training samples a random mini-batch (typically 32) from this buffer, breaking temporal correlations.
Watch transitions (s, a, r, s') flow into the replay buffer. Training samples are drawn randomly, breaking temporal correlations.
Everything so far learned a value function and derived a policy from it. But what if we optimized the policy directly? Policy gradient methods parameterize the policy as π_θ(a|s) and use gradient ascent to maximize expected reward.
The REINFORCE algorithm is the simplest policy gradient method. The key insight: the gradient of the expected return has a beautiful form called the score function estimator:
Translation: increase the probability of actions that led to high returns; decrease it for actions that led to low returns. It's intuitive, unbiased, but has high variance — returns can vary wildly between episodes, making gradients noisy.
| Q-Learning (DQN) | Policy Gradient (REINFORCE) | |
|---|---|---|
| Input | State s → network | State s → network |
| Output | Q-values [|A|] (one per action) | Action probabilities [|A|] (softmax) |
| Action selection | argmax Q(s, a) + ε-greedy | Sample from π(a|s) |
| Loss function | MSE: (r + γ max Q(s',a') − Q(s,a))² | −log π(a|s) · Gt |
| What's optimized | Q-values match Bellman target | Probability of good actions increases |
| Data reuse | Yes (replay buffer, off-policy) | No (on-policy, data discarded after update) |
| Action space | Discrete only (need argmax) | Discrete or continuous |
The policy π(a) is a distribution over actions. After a good episode, the distribution shifts toward the actions taken. After a bad episode, it shifts away.
The objective is to maximize expected return: J(θ) = Eτ~πθ[R(τ)] where τ is a trajectory (s0, a0, r0, s1, ...) and R(τ) = ∑rt.
The probability of a trajectory is: p(τ) = p(s0) ∏t πθ(at|st) p(st+1|st,at)
Your task: Show that ∇θJ(θ) = E[∑t ∇θ log πθ(at|st) · R(τ)]. Why do the environment dynamics p(s'|s,a) disappear from the gradient?
Full derivation:
1. ∇J(θ) = ∇ ∫ p(τ) R(τ) dτ
2. = ∫ ∇p(τ) R(τ) dτ (move gradient inside integral)
3. = ∫ p(τ) ∇log p(τ) R(τ) dτ (log-derivative trick: ∇p = p · ∇log p)
4. = Eτ[∇log p(τ) · R(τ)] (definition of expectation)
5. Now expand: ∇θ log p(τ) = ∇θ[log p(s0) + ∑ log πθ(at|st) + ∑ log p(s'|s,a)]
6. = 0 + ∑t ∇θ log πθ(at|st) + 0
7. Therefore: ∇J(θ) = Eτ[(∑t ∇ log πθ(at|st)) · R(τ)]
The key insight: The log-derivative trick converts a gradient of an integral (intractable) into an expectation of a gradient (estimable with samples). The environment dynamics vanish because they don't depend on θ. This means we can compute policy gradients without EVER knowing how the environment works — we just need to sample trajectories and compute ∇ log π. This is profoundly elegant: to improve your policy, you only need to know your own action probabilities.
REINFORCE uses the full return G_t as a signal, which is high variance. What if we replaced G_t with a learned value estimate? That's the actor-critic architecture: the actor is the policy (decides what to do), and the critic is the value function (evaluates how good the state is).
The critic provides a low-variance baseline, dramatically stabilizing training. The actor updates using the advantage A(s,a) = Q(s,a) - V(s), which tells us: "was this action better or worse than average?"
Architecture # Shared backbone (or separate networks) State s → [Encoder] → features [256] ↓ ↓ Actor head Critic head Dense(256) → [|A|] Dense(256) → [1] softmax → π(a|s) linear → V(s) # Actor loss: -log π(a|s) × advantage # Critic loss: (r + γ V(s') - V(s))² # Both trained jointly with a single optimizer
The advantage A(s,a) = r + γ V(s') − V(s) is the TD error from Chapter 3. Positive advantage means "this action was better than expected" — increase its probability. Negative means "worse than expected" — decrease it. The critic provides this signal, replacing the high-variance full return Gt from REINFORCE.
The actor and critic train together. Watch how the critic's value estimate (teal) stabilizes the actor's policy updates (orange), reducing variance compared to pure REINFORCE.
REINFORCE uses ∇J = E[∇ log π(a|s) · Gt]. Actor-critic replaces Gt with the advantage A(s,a) = Q(s,a) − V(s).
Your task: (1) Show that subtracting any baseline b(s) that depends only on state does NOT change the expected gradient (it's still unbiased). (2) Show that V(s) is the optimal baseline (minimizes variance).
Part 1: Baseline doesn't change expected gradient
Ea~π[∇ log π(a|s) · b(s)] = b(s) · ∑a π(a|s) · ∇ log π(a|s)
= b(s) · ∑a π(a|s) · ∇π(a|s)/π(a|s) = b(s) · ∑a ∇π(a|s)
= b(s) · ∇ ∑a π(a|s) = b(s) · ∇(1) = 0
So subtracting b(s) from the return leaves the expected gradient unchanged. The gradient is still unbiased.
Part 2: V(s) minimizes variance
Var[∇ log π · (Q − b)] is minimized when b = E[Q(s,a)] = V(s). This is the standard result that the optimal control variate equals the conditional mean. Intuitively: centering around the mean minimizes the squared magnitude of the signal.
The key insight: The advantage function doesn't just reduce variance — it gives the gradient a sign. Without it, REINFORCE says "do more of everything that worked." With it, actor-critic says "do MORE of above-average actions, LESS of below-average ones." This directional signal is why actor-critic converges so much faster in practice.
Actor-critic methods can be unstable: a bad gradient step can ruin the policy, and you can't undo it. PPO (Schulman et al., 2017) solves this with a brilliantly simple idea: clip the policy update so it can't change too much in one step.
PPO computes the probability ratio r(θ) = π_new(a|s) / π_old(a|s). If the new policy is too different from the old one (ratio far from 1), the objective is clipped, preventing the update from going further.
Drag the ratio slider to see how clipping constrains the objective. The teal region is the clipping zone [1−ε, 1+ε]. The orange curve is the actual objective used.
Python # PPO collects a batch of trajectories, then does multiple # gradient steps on the SAME data (unlike REINFORCE which # uses data once and throws it away) for iteration in range(1000): # 1. Collect 2048 steps across 8 parallel environments batch = collect_rollouts(envs, policy, n_steps=2048) # batch contains: states, actions, rewards, log_probs_old, values # 2. Compute advantages using GAE (Generalized Advantage Estimation) advantages = compute_gae(batch, gamma=0.99, lam=0.95) # 3. Multiple epochs on the same batch (the key PPO trick) for epoch in range(4): for mini_batch in split(batch, size=64): ratio = exp(log_prob_new - log_prob_old) clipped = clip(ratio, 1-0.2, 1+0.2) loss = -min(ratio * adv, clipped * adv) optimizer.step(loss)
What if the agent was rewarded not just for getting high returns, but also for being as random as possible? That's the idea behind maximum entropy RL. The SAC objective adds an entropy bonus to the reward: explore as much as you can while still maximizing return.
The temperature parameter α controls the trade-off. High α means "explore aggressively" (policy stays broad). Low α means "focus on reward" (policy becomes peaked). SAC can even learn α automatically.
Adjust the temperature α to see how the policy distribution changes. Low α = peaked (exploit). High α = broad (explore).
| Feature | SAC | PPO |
|---|---|---|
| Policy type | Off-policy (replay buffer) | On-policy (no replay) |
| Action space | Continuous (native) | Both |
| Sample efficiency | High (reuses data) | Lower (data discarded) |
| Exploration | Built-in via entropy | Via stochastic policy |
| Stability | Very stable | Stable (clipping) |
You now know the major families of RL algorithms. But when should you use which? The landscape splits along several axes: model-free vs model-based, on-policy vs off-policy, value-based vs policy-based.
| Algorithm | Type | Policy | Best for |
|---|---|---|---|
| Q-Learning | Value-based | Off-policy | Small discrete spaces |
| SARSA | Value-based | On-policy | Safety-sensitive tasks |
| DQN | Value-based | Off-policy | Large discrete spaces (Atari) |
| REINFORCE | Policy-based | On-policy | Simple continuous tasks |
| A2C/A3C | Actor-Critic | On-policy | Parallel training |
| PPO | Actor-Critic | On-policy | General purpose, RLHF |
| SAC | Actor-Critic | Off-policy | Continuous control, robotics |
The algorithms in this lesson vary by orders of magnitude in how much experience they need:
| Algorithm | Samples to solve CartPole | Samples for Atari (superhuman) |
|---|---|---|
| DQN (off-policy) | ~10K steps | ~50M frames (~40 hours of play) |
| PPO (on-policy) | ~30K steps | ~200M frames (~150 hours of play) |
| SAC (off-policy) | ~5K steps | N/A (continuous actions) |
| Dreamer (model-based) | ~2K steps | ~2M frames (~1.5 hours of play) |
Off-policy methods (DQN, SAC) reuse data from the replay buffer, so they need fewer environment interactions. On-policy methods (PPO) throw away data after each update, but their gradients are less biased. Model-based methods are the most sample-efficient because they can "imagine" experience — but building an accurate world model is itself a hard problem.
An interactive guide: answer questions about your problem to find the right algorithm.
Real-world approach (circa 2023-2024):
Algorithm: SAC — Off-policy (critical for data reuse with expensive real robot), native continuous actions, entropy bonus aids exploration. TD3 is also viable but SAC's auto-tuned temperature is one less hyperparameter.
Sparse reward: Hindsight Experience Replay (HER) — After a failed episode (object ends at position X instead of target Y), pretend X WAS the goal and store a "successful" transition. This converts 0% success rate into 100% self-generated curriculum. OpenAI showed this enables learning with purely sparse rewards.
Sim-to-real: Domain Randomization + fine-tuning — In sim, randomize friction (0.3-1.0), object mass (50-500g), observation noise (σ=0.01), and action delay (0-2 steps). Train to convergence in sim (~5M steps overnight). Then fine-tune on real robot with frozen early layers, reduced learning rate, and the 100 real episodes. The randomization makes the sim policy robust enough that fine-tuning mostly calibrates rather than re-learns.
Reward shaping: Add dense reward: -||gripper - object|| (reaching) + grasp_bonus + -||object - target|| (placing). Risk: the agent may learn to hover near the object (maximizes reaching reward) without grasping. Curriculum: start with reaching only, add grasping criterion, then placing.
The Decision Transformer (2021) reframes RL as conditional sequence modeling. Instead of learning a value function or policy gradient, it trains a Transformer on trajectories: (return-to-go, state, action, return-to-go, state, action, ...). At test time, you condition on a HIGH desired return and the model outputs actions that achieve it. No Bellman equation, no bootstrapping, no exploration — just next-token prediction. This works because Transformers are powerful enough to implicitly learn the "stitching" that value functions provide: combining sub-optimal trajectory segments into optimal behavior.
If RL can be reduced to sequence modeling, what does that suggest about the relationship between "understanding the world" (prediction) and "acting in the world" (control)?
You now understand the full arc from bandits to PPO. Every time an AI agent learns from interaction, these algorithms are at work.