The Complete Beginner's Path

Understand Reinforcement
Learning

The paradigm that taught machines to play Atari, beat world champions at Go, and align large language models with human preferences.

Prerequisites: Basic probability + Intuition for optimization. That's it.
12
Chapters
12+
Simulations
0
Assumed Knowledge

Chapter 0: The RL Problem

Reinforcement learning is about an agent interacting with an environment. At each timestep, the agent observes the current state, takes an action, receives a reward, and transitions to a new state. The goal is deceptively simple: maximize cumulative reward over time.

This is different from supervised learning (where you have labeled examples) and unsupervised learning (where you find patterns). In RL, the agent must discover which actions lead to high rewards through trial and error. Nobody tells the agent the right answer — it must figure it out.

The core loop: State → Action → Reward → New State → repeat. The agent's policy π(a|s) maps states to actions. The goal is to find the policy that maximizes the expected sum of future rewards.
Agent
Observes state s, chooses action a via policy π
Environment
Returns reward r and next state s'
Objective
Maximize G = r₁ + γr₂ + γ²r₃ + …
G_t = r_{t+1} + γ r_{t+2} + γ² r_{t+3} + … = ∑ γ^k r_{t+k+1}
The RL Loop

Watch the agent (green) navigate a simple world. It receives +1 for reaching the goal and -1 for walls. The cumulative reward tracks at the top.

Check: How does RL differ from supervised learning?

Chapter 1: Exploration vs Exploitation

Before we learn any algorithms, we must understand the fundamental dilemma: exploration vs exploitation. Should you go to your favourite restaurant (exploit what you know), or try a new place (explore to find something better)?

The multi-armed bandit problem distills this perfectly. You face N slot machines, each with an unknown payout probability. You want to maximize total winnings. If you always pull the lever that looks best so far, you might miss a better one. If you explore too much, you waste pulls on bad machines.

ε-greedy strategy: With probability ε pick a random lever (explore); with probability 1−ε pick the best-known lever (exploit). Simple, effective, and the foundation for algorithms we'll see later.
Multi-Armed Bandit

Click a lever to pull it, or let the ε-greedy agent play. Watch the regret (reward lost vs. always picking the best arm) accumulate.

Epsilon ε0.10
Pulls: 0 | Regret: 0.0
Key insight: Too little exploration (ε≈0) means you get stuck on a mediocre arm. Too much exploration (ε≈0.5) wastes half your pulls on random arms. The sweet spot for most problems is ε between 0.05 and 0.15.
Check: What does ε-greedy do with probability ε?

Chapter 2: Monte Carlo Methods

The simplest approach to estimating how good a state is: play out full episodes, then average the returns you got. If you visited state S and ended up with a total return of 7, and another time you got 3, your estimate of V(S) is (7+3)/2 = 5. That's it. No fancy math.

There are two flavors: first-visit MC (only count the first time you visit a state in an episode) and every-visit MC (count every visit). Both converge to the true value as you play more episodes.

V(s) ≈ (1/N) ∑ G_t    for all episodes where s was visited
The catch: Monte Carlo requires complete episodes. If your task is continuous (a robot that runs forever) or episodes are very long, you have to wait until the end before learning anything. This is the main limitation.
Monte Carlo Value Estimation

Run episodes to see MC value estimates converge. Each cell shows the estimated value of that state. More episodes = better estimates.

Episodes: 0
Check: What is the main limitation of Monte Carlo methods?

Chapter 3: TD Learning — Learning from Incomplete Episodes

What if you could learn before an episode ends? Temporal Difference (TD) learning does exactly that. Instead of waiting for the full return G, it uses an estimate: the immediate reward plus the current guess of the next state's value.

The TD(0) update rule shifts the value of the current state toward a bootstrapped target: r + γV(s'). The difference between this target and our current estimate is the TD error δ.

V(s) ← V(s) + α [r + γV(s') − V(s)]
Bootstrapping means using your own estimates to improve your estimates. It's like learning to cook by tasting as you go, rather than waiting until the meal is done. It introduces some bias but dramatically reduces variance compared to MC.
SymbolMeaning
αLearning rate (step size)
γDiscount factor (how much we value future rewards)
δTD error: r + γV(s') − V(s)
V(s')Current estimate of the next state's value
TD(0) vs Monte Carlo

Watch how TD(0) (teal) converges faster than MC (orange) on a random walk problem. TD learns from each step; MC waits until the end.

Learning rate α0.10
Episodes: 0
Check: What does "bootstrapping" mean in TD learning?

Chapter 4: Q-Learning — The Classic

Instead of estimating how good a state is (V), what if we estimated how good a state-action pair is? That's Q-learning. Q(s, a) tells us the expected return of taking action a in state s and then acting optimally afterward. If we know Q perfectly, we can act optimally by always picking argmax_a Q(s, a).

Q-learning is off-policy: it learns the optimal policy regardless of what the agent actually does. The update uses the maximum Q-value of the next state, not the Q-value of the action the agent actually took.

Q(s,a) ← Q(s,a) + α [r + γ max_a' Q(s',a') − Q(s,a)]
Off-policy magic: Even if the agent explores randomly, Q-learning converges to the optimal Q-values. The update always asks "what's the best I could do from here?" regardless of what the agent actually did. This separates behavior (exploration) from learning (the optimal policy).
Q-Learning Grid World

Watch the Q-table fill in as the agent explores. Green = goal (+10). Red = trap (-10). The arrows show the learned greedy policy. Cell colors show value estimates.

Epsilon ε0.15
Learning rate α0.20
Discount γ0.95
Episode: 0 | Steps: 0
Experiment: Set ε very low (0.01) — the agent exploits but may never find the goal. Set it high (0.5) — it finds the goal quickly but the policy looks noisy. The sweet spot is around 0.1–0.2. Try changing γ too: low discount makes the agent short-sighted.
Check: Why is Q-learning called "off-policy"?

Chapter 5: SARSA — On-Policy Q-Learning

SARSA stands for State, Action, Reward, State, Action — exactly what the update uses. Unlike Q-learning, SARSA updates toward the Q-value of the action the agent actually took next, not the maximum. This makes it on-policy: it evaluates and improves the policy it's currently following.

Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') − Q(s,a)]

The practical difference? SARSA is more conservative. If the agent explores and sometimes steps into danger, SARSA's values reflect that danger. Q-learning's values assume you'll act optimally after, so they're more optimistic. Near cliffs and traps, SARSA learns to stay further away.

The Cliff Walk: Imagine a cliff edge. Q-learning finds the shortest path along the cliff (optimal but risky with ε-exploration). SARSA finds a safer path further from the edge, because it accounts for the chance of accidentally stepping off.
Q-LearningSARSA
PolicyOff-policyOn-policy
Update targetmax_a' Q(s',a')Q(s', a') where a' is the actual next action
BehaviorOptimistic (assumes optimal future)Realistic (accounts for exploration)
Near dangerRisky pathsSafer paths
Q-Learning vs SARSA on a Cliff

Compare the paths learned by Q-learning (teal) and SARSA (orange). The bottom row is a cliff — falling gives -100. Notice SARSA takes the safer upper path.

Episodes: 0
Check: Why does SARSA learn safer paths than Q-learning near cliffs?

Chapter 6: Deep Q-Networks (DQN)

Q-learning with a table works for small state spaces. But what about Atari games with millions of possible screen configurations? You can't have a table entry for every pixel combination. The breakthrough idea from DeepMind (2015): replace the Q-table with a neural network.

The network takes a state (raw pixels) as input and outputs Q-values for each possible action. But training is tricky — naive approaches diverge because of two problems:

Problem 1: Correlated samples
Consecutive states are similar → biased gradient updates
↓ Solution
Experience Replay
Store transitions in a buffer, sample randomly for training
 
Problem 2: Moving target
The target Q-value changes as the network updates → instability
↓ Solution
Target Network
Use a frozen copy of the network for targets, update periodically
The Atari breakthrough: DQN played 49 Atari games using only raw pixels and scores as input, reaching superhuman performance on many. The same architecture, same hyperparameters — no game-specific engineering. This was the moment RL went from theory to headline news.
Experience Replay Buffer

Watch transitions (s, a, r, s') flow into the replay buffer. Training samples are drawn randomly, breaking temporal correlations.

Check: Why does DQN use experience replay?

Chapter 7: Policy Gradients — REINFORCE

Everything so far learned a value function and derived a policy from it. But what if we optimized the policy directly? Policy gradient methods parameterize the policy as π_θ(a|s) and use gradient ascent to maximize expected reward.

The REINFORCE algorithm is the simplest policy gradient method. The key insight: the gradient of the expected return has a beautiful form called the score function estimator:

∇J(θ) = E[∑ ∇ log π_θ(a_t|s_t) · G_t]

Translation: increase the probability of actions that led to high returns; decrease it for actions that led to low returns. It's intuitive, unbiased, but has high variance — returns can vary wildly between episodes, making gradients noisy.

Why go direct? Value-based methods struggle with continuous action spaces (you can't take argmax over infinite actions). Policy gradients handle continuous actions naturally: the policy is a probability distribution (like a Gaussian) over the action space.
Policy Gradient Intuition

The policy π(a) is a distribution over actions. After a good episode, the distribution shifts toward the actions taken. After a bad episode, it shifts away.

Variance problem: REINFORCE multiplies the gradient by the full return G_t. If G_t varies from -100 to +100 across episodes, your gradient estimates are very noisy. The solution: subtract a baseline (like the average return) to center the signal. This is the seed of actor-critic methods.
Check: What is the main drawback of vanilla REINFORCE?

Chapter 8: Actor-Critic — Best of Both Worlds

REINFORCE uses the full return G_t as a signal, which is high variance. What if we replaced G_t with a learned value estimate? That's the actor-critic architecture: the actor is the policy (decides what to do), and the critic is the value function (evaluates how good the state is).

The critic provides a low-variance baseline, dramatically stabilizing training. The actor updates using the advantage A(s,a) = Q(s,a) - V(s), which tells us: "was this action better or worse than average?"

∇J(θ) = E[∇ log π_θ(a|s) · A(s,a)]
Actor (Policy π_θ)
"What action should I take?" → outputs action probabilities
↓ advantage signal
Critic (Value V_w)
"How good is this state?" → outputs value estimate
A2C vs A3C: A2C (Advantage Actor-Critic) runs multiple environments synchronously. A3C (Asynchronous A3C) runs workers in parallel, each updating a shared model asynchronously. A3C was a breakthrough for scaling, but A2C is simpler and often just as good with GPU batching.
Actor-Critic Architecture

The actor and critic train together. Watch how the critic's value estimate (teal) stabilizes the actor's policy updates (orange), reducing variance compared to pure REINFORCE.

Step: 0
Check: What role does the "critic" play in actor-critic?

Chapter 9: PPO — Proximal Policy Optimization

Actor-critic methods can be unstable: a bad gradient step can ruin the policy, and you can't undo it. PPO (Schulman et al., 2017) solves this with a brilliantly simple idea: clip the policy update so it can't change too much in one step.

PPO computes the probability ratio r(θ) = π_new(a|s) / π_old(a|s). If the new policy is too different from the old one (ratio far from 1), the objective is clipped, preventing the update from going further.

L(θ) = E[min(r(θ) A, clip(r(θ), 1−ε, 1+ε) A)]
Why clipping works: If the advantage A is positive (good action), we want to increase the probability — but only up to 1+ε times the old probability. If A is negative (bad action), we want to decrease it — but only down to 1−ε. This creates a "trust region" around the current policy, preventing catastrophic updates.
PPO Clipped Objective

Drag the ratio slider to see how clipping constrains the objective. The teal region is the clipping zone [1−ε, 1+ε]. The orange curve is the actual objective used.

Ratio r(θ)1.00
Clip ε0.20
Advantage A1.0
L_clip = 1.00
PPO is everywhere: OpenAI used PPO to train ChatGPT via RLHF. It's the default algorithm for robotics (Boston Dynamics), game AI, and LLM alignment. Its popularity comes from being simple to implement, stable to train, and effective across a wide range of tasks.
Check: What does PPO's clipping prevent?

Chapter 10: SAC — Soft Actor-Critic

What if the agent was rewarded not just for getting high returns, but also for being as random as possible? That's the idea behind maximum entropy RL. The SAC objective adds an entropy bonus to the reward: explore as much as you can while still maximizing return.

J(π) = E[∑ r_t + α H(π(·|s_t))]

The temperature parameter α controls the trade-off. High α means "explore aggressively" (policy stays broad). Low α means "focus on reward" (policy becomes peaked). SAC can even learn α automatically.

Why entropy? Encouraging randomness prevents the policy from collapsing to a single action too early. It makes the agent robust: it learns multiple ways to solve a task, so it can adapt when conditions change. Think of it as forced creativity.
Entropy vs Reward Trade-off

Adjust the temperature α to see how the policy distribution changes. Low α = peaked (exploit). High α = broad (explore).

Temperature α0.50
FeatureSACPPO
Policy typeOff-policy (replay buffer)On-policy (no replay)
Action spaceContinuous (native)Both
Sample efficiencyHigh (reuses data)Lower (data discarded)
ExplorationBuilt-in via entropyVia stochastic policy
StabilityVery stableStable (clipping)
Check: What does the temperature α control in SAC?

Chapter 11: The RL Landscape

You now know the major families of RL algorithms. But when should you use which? The landscape splits along several axes: model-free vs model-based, on-policy vs off-policy, value-based vs policy-based.

AlgorithmTypePolicyBest for
Q-LearningValue-basedOff-policySmall discrete spaces
SARSAValue-basedOn-policySafety-sensitive tasks
DQNValue-basedOff-policyLarge discrete spaces (Atari)
REINFORCEPolicy-basedOn-policySimple continuous tasks
A2C/A3CActor-CriticOn-policyParallel training
PPOActor-CriticOn-policyGeneral purpose, RLHF
SACActor-CriticOff-policyContinuous control, robotics
On-policy vs Off-policy: On-policy (SARSA, PPO) learns about the policy it's currently following — safer but less data-efficient. Off-policy (Q-learning, DQN, SAC) can learn from old data — more sample-efficient but harder to stabilize.
Model-free vs Model-based: Everything in this lesson is model-free: the agent learns from raw experience without building a model of the environment. Model-based RL (MuZero, Dreamer) learns a model of how the world works and plans ahead. More sample-efficient, but the model can be wrong.
Algorithm Decision Tree

An interactive guide: answer questions about your problem to find the right algorithm.

Click to start
RLHF connection: Reinforcement Learning from Human Feedback uses PPO to fine-tune language models. The "reward model" (trained on human preferences) plays the role of the environment's reward signal. The LLM is the policy. PPO's stability makes it ideal for this delicate fine-tuning.
"Reinforcement learning is the closest thing we have to a computational theory of intelligence."
— Richard Sutton

You now understand the full arc from bandits to PPO. Every time an AI agent learns from interaction, these algorithms are at work.

Check: Which algorithm is most commonly used for RLHF in large language models?