The paradigm that taught machines to play Atari, beat world champions at Go, and align large language models with human preferences.
Reinforcement learning is about an agent interacting with an environment. At each timestep, the agent observes the current state, takes an action, receives a reward, and transitions to a new state. The goal is deceptively simple: maximize cumulative reward over time.
This is different from supervised learning (where you have labeled examples) and unsupervised learning (where you find patterns). In RL, the agent must discover which actions lead to high rewards through trial and error. Nobody tells the agent the right answer — it must figure it out.
Watch the agent (green) navigate a simple world. It receives +1 for reaching the goal and -1 for walls. The cumulative reward tracks at the top.
Before we learn any algorithms, we must understand the fundamental dilemma: exploration vs exploitation. Should you go to your favourite restaurant (exploit what you know), or try a new place (explore to find something better)?
The multi-armed bandit problem distills this perfectly. You face N slot machines, each with an unknown payout probability. You want to maximize total winnings. If you always pull the lever that looks best so far, you might miss a better one. If you explore too much, you waste pulls on bad machines.
Click a lever to pull it, or let the ε-greedy agent play. Watch the regret (reward lost vs. always picking the best arm) accumulate.
The simplest approach to estimating how good a state is: play out full episodes, then average the returns you got. If you visited state S and ended up with a total return of 7, and another time you got 3, your estimate of V(S) is (7+3)/2 = 5. That's it. No fancy math.
There are two flavors: first-visit MC (only count the first time you visit a state in an episode) and every-visit MC (count every visit). Both converge to the true value as you play more episodes.
Run episodes to see MC value estimates converge. Each cell shows the estimated value of that state. More episodes = better estimates.
What if you could learn before an episode ends? Temporal Difference (TD) learning does exactly that. Instead of waiting for the full return G, it uses an estimate: the immediate reward plus the current guess of the next state's value.
The TD(0) update rule shifts the value of the current state toward a bootstrapped target: r + γV(s'). The difference between this target and our current estimate is the TD error δ.
| Symbol | Meaning |
|---|---|
| α | Learning rate (step size) |
| γ | Discount factor (how much we value future rewards) |
| δ | TD error: r + γV(s') − V(s) |
| V(s') | Current estimate of the next state's value |
Watch how TD(0) (teal) converges faster than MC (orange) on a random walk problem. TD learns from each step; MC waits until the end.
Instead of estimating how good a state is (V), what if we estimated how good a state-action pair is? That's Q-learning. Q(s, a) tells us the expected return of taking action a in state s and then acting optimally afterward. If we know Q perfectly, we can act optimally by always picking argmax_a Q(s, a).
Q-learning is off-policy: it learns the optimal policy regardless of what the agent actually does. The update uses the maximum Q-value of the next state, not the Q-value of the action the agent actually took.
Watch the Q-table fill in as the agent explores. Green = goal (+10). Red = trap (-10). The arrows show the learned greedy policy. Cell colors show value estimates.
SARSA stands for State, Action, Reward, State, Action — exactly what the update uses. Unlike Q-learning, SARSA updates toward the Q-value of the action the agent actually took next, not the maximum. This makes it on-policy: it evaluates and improves the policy it's currently following.
The practical difference? SARSA is more conservative. If the agent explores and sometimes steps into danger, SARSA's values reflect that danger. Q-learning's values assume you'll act optimally after, so they're more optimistic. Near cliffs and traps, SARSA learns to stay further away.
| Q-Learning | SARSA | |
|---|---|---|
| Policy | Off-policy | On-policy |
| Update target | max_a' Q(s',a') | Q(s', a') where a' is the actual next action |
| Behavior | Optimistic (assumes optimal future) | Realistic (accounts for exploration) |
| Near danger | Risky paths | Safer paths |
Compare the paths learned by Q-learning (teal) and SARSA (orange). The bottom row is a cliff — falling gives -100. Notice SARSA takes the safer upper path.
Q-learning with a table works for small state spaces. But what about Atari games with millions of possible screen configurations? You can't have a table entry for every pixel combination. The breakthrough idea from DeepMind (2015): replace the Q-table with a neural network.
The network takes a state (raw pixels) as input and outputs Q-values for each possible action. But training is tricky — naive approaches diverge because of two problems:
Watch transitions (s, a, r, s') flow into the replay buffer. Training samples are drawn randomly, breaking temporal correlations.
Everything so far learned a value function and derived a policy from it. But what if we optimized the policy directly? Policy gradient methods parameterize the policy as π_θ(a|s) and use gradient ascent to maximize expected reward.
The REINFORCE algorithm is the simplest policy gradient method. The key insight: the gradient of the expected return has a beautiful form called the score function estimator:
Translation: increase the probability of actions that led to high returns; decrease it for actions that led to low returns. It's intuitive, unbiased, but has high variance — returns can vary wildly between episodes, making gradients noisy.
The policy π(a) is a distribution over actions. After a good episode, the distribution shifts toward the actions taken. After a bad episode, it shifts away.
REINFORCE uses the full return G_t as a signal, which is high variance. What if we replaced G_t with a learned value estimate? That's the actor-critic architecture: the actor is the policy (decides what to do), and the critic is the value function (evaluates how good the state is).
The critic provides a low-variance baseline, dramatically stabilizing training. The actor updates using the advantage A(s,a) = Q(s,a) - V(s), which tells us: "was this action better or worse than average?"
The actor and critic train together. Watch how the critic's value estimate (teal) stabilizes the actor's policy updates (orange), reducing variance compared to pure REINFORCE.
Actor-critic methods can be unstable: a bad gradient step can ruin the policy, and you can't undo it. PPO (Schulman et al., 2017) solves this with a brilliantly simple idea: clip the policy update so it can't change too much in one step.
PPO computes the probability ratio r(θ) = π_new(a|s) / π_old(a|s). If the new policy is too different from the old one (ratio far from 1), the objective is clipped, preventing the update from going further.
Drag the ratio slider to see how clipping constrains the objective. The teal region is the clipping zone [1−ε, 1+ε]. The orange curve is the actual objective used.
What if the agent was rewarded not just for getting high returns, but also for being as random as possible? That's the idea behind maximum entropy RL. The SAC objective adds an entropy bonus to the reward: explore as much as you can while still maximizing return.
The temperature parameter α controls the trade-off. High α means "explore aggressively" (policy stays broad). Low α means "focus on reward" (policy becomes peaked). SAC can even learn α automatically.
Adjust the temperature α to see how the policy distribution changes. Low α = peaked (exploit). High α = broad (explore).
| Feature | SAC | PPO |
|---|---|---|
| Policy type | Off-policy (replay buffer) | On-policy (no replay) |
| Action space | Continuous (native) | Both |
| Sample efficiency | High (reuses data) | Lower (data discarded) |
| Exploration | Built-in via entropy | Via stochastic policy |
| Stability | Very stable | Stable (clipping) |
You now know the major families of RL algorithms. But when should you use which? The landscape splits along several axes: model-free vs model-based, on-policy vs off-policy, value-based vs policy-based.
| Algorithm | Type | Policy | Best for |
|---|---|---|---|
| Q-Learning | Value-based | Off-policy | Small discrete spaces |
| SARSA | Value-based | On-policy | Safety-sensitive tasks |
| DQN | Value-based | Off-policy | Large discrete spaces (Atari) |
| REINFORCE | Policy-based | On-policy | Simple continuous tasks |
| A2C/A3C | Actor-Critic | On-policy | Parallel training |
| PPO | Actor-Critic | On-policy | General purpose, RLHF |
| SAC | Actor-Critic | Off-policy | Continuous control, robotics |
An interactive guide: answer questions about your problem to find the right algorithm.
You now understand the full arc from bandits to PPO. Every time an AI agent learns from interaction, these algorithms are at work.