microRL — From Bandits to PPO

Chapter 0: The RL Problem

Reinforcement learning is about an agent interacting with an environment. At each timestep, the agent observes the current state, takes an action, receives a reward, and transitions to a new state. The goal is deceptively simple: maximize cumulative reward over time.

This is different from supervised learning (where you have labeled examples) and unsupervised learning (where you find patterns). In RL, the agent must discover which actions lead to high rewards through trial and error. Nobody tells the agent the right answer — it must figure it out.

The core loop: State → Action → Reward → New State → repeat. The agent's policy π(a|s) maps states to actions. The goal is to find the policy that maximizes the expected sum of future rewards.

Agent

Observes state s, chooses action a via policy π

↓

Environment

Returns reward r and next state s'

↓

Objective

Maximize G = r₁ + γr₂ + γ²r₃ + …

G_t = r_{t+1} + γ r_{t+2} + γ² r_{t+3} + … = ∑ γ^k r_{t+k+1}

The RL Loop

Watch the agent (green) navigate a simple world. It receives +1 for reaching the goal and -1 for walls. The cumulative reward tracks at the top.

Check: How does RL differ from supervised learning?

RL uses bigger datasets The agent must discover good actions through trial and error, without labeled examples RL only works for games

Chapter 1: Exploration vs Exploitation

Before we learn any algorithms, we must understand the fundamental dilemma: exploration vs exploitation. Should you go to your favourite restaurant (exploit what you know), or try a new place (explore to find something better)?

The multi-armed bandit problem distills this perfectly. You face N slot machines, each with an unknown payout probability. You want to maximize total winnings. If you always pull the lever that looks best so far, you might miss a better one. If you explore too much, you waste pulls on bad machines.

ε-greedy strategy: With probability ε pick a random lever (explore); with probability 1−ε pick the best-known lever (exploit). Simple, effective, and the foundation for algorithms we'll see later.

ε decay in practice: Real implementations don't use fixed ε. DQN starts with ε = 1.0 (100% random) and linearly decays to 0.01 over 1 million steps. Early on, you explore aggressively to discover the reward landscape. Later, you exploit what you've learned. The decay schedule is a crucial hyperparameter — decay too fast and you miss good strategies, too slow and you waste steps on random actions.

Multi-Armed Bandit

Click a lever to pull it, or let the ε-greedy agent play. Watch the regret (reward lost vs. always picking the best arm) accumulate.

Epsilon ε0.10

Pulls: 0 | Regret: 0.0

Key insight: Too little exploration (ε≈0) means you get stuck on a mediocre arm. Too much exploration (ε≈0.5) wastes half your pulls on random arms. The sweet spot for most problems is ε between 0.05 and 0.15.

Check: What does ε-greedy do with probability ε?

Always picks the best-known arm Picks a random arm (explores) Stops playing

Chapter 2: Monte Carlo Methods

The simplest approach to estimating how good a state is: play out full episodes, then average the returns you got. If you visited state S and ended up with a total return of 7, and another time you got 3, your estimate of V(S) is (7+3)/2 = 5. That's it. No fancy math.

There are two flavors: first-visit MC (only count the first time you visit a state in an episode) and every-visit MC (count every visit). Both converge to the true value as you play more episodes.

V(s) ≈ (1/N) ∑ G_t for all episodes where s was visited

The catch: Monte Carlo requires complete episodes. If your task is continuous (a robot that runs forever) or episodes are very long, you have to wait until the end before learning anything. This is the main limitation.

Monte Carlo Value Estimation

Run episodes to see MC value estimates converge. Each cell shows the estimated value of that state. More episodes = better estimates.

Episodes: 0

Check: What is the main limitation of Monte Carlo methods?

They require complete episodes before learning They only work for deterministic environments They need a neural network

Chapter 3: TD Learning — Learning from Incomplete Episodes

What if you could learn before an episode ends? Temporal Difference (TD) learning does exactly that. Instead of waiting for the full return G, it uses an estimate: the immediate reward plus the current guess of the next state's value.

The TD(0) update rule shifts the value of the current state toward a bootstrapped target: r + γV(s'). The difference between this target and our current estimate is the TD error δ.

V(s) ← V(s) + α [r + γV(s') − V(s)]

Bootstrapping means using your own estimates to improve your estimates. It's like learning to cook by tasting as you go, rather than waiting until the meal is done. It introduces some bias but dramatically reduces variance compared to MC.

Symbol	Meaning
α	Learning rate (step size)
γ	Discount factor (how much we value future rewards)
δ	TD error: r + γV(s') − V(s)
V(s')	Current estimate of the next state's value

TD(0) vs Monte Carlo

Watch how TD(0) (teal) converges faster than MC (orange) on a random walk problem. TD learns from each step; MC waits until the end.

Learning rate α0.10

Episodes: 0

Check: What does "bootstrapping" mean in TD learning?

Using your own estimates to improve your estimates Starting the algorithm from scratch Training on a bootstrap dataset

🔨 Derivation Derive the Bellman Equation from First Principles ▶ ✓ ATTEMPTED

The value function V(s) is defined as the expected sum of discounted future rewards starting from state s under policy π:

V^π(s) = E_π[G_t | S_t = s] = E_π[r_t+1 + γr_t+2 + γ²r_t+3 + … | S_t = s]

Your task: Show that V^π(s) can be decomposed into the immediate reward plus the discounted value of the next state — the Bellman equation:

V^π(s) = ∑_a π(a|s) ∑_s' p(s'|s,a) [r(s,a,s') + γ V^π(s')]

Start by splitting G_t = r_t+1 + γ(r_t+2 + γr_t+3 + …). Notice that the term in parentheses is itself G_t+1. So G_t = r_t+1 + γG_t+1.

E[G_t | S_t=s] = E[r_t+1 + γG_t+1 | S_t=s]. Use linearity of expectation and the law of total expectation: condition on the action taken and the next state reached.

Once you condition on S_t+1 = s', the remaining expectation E[G_t+1 | S_t+1=s'] is just V^π(s') by definition. This is what makes the equation recursive.

Full derivation:

1. Start with the definition: V^π(s) = E_π[G_t | S_t = s]

2. Decompose G_t: V^π(s) = E_π[r_t+1 + γG_t+1 | S_t = s]

3. Expand using law of total expectation over actions and next states:

= ∑_a π(a|s) ∑_s' p(s'|s,a) [r(s,a,s') + γ E_π[G_t+1 | S_t+1=s']]

4. Recognize E_π[G_t+1 | S_t+1=s'] = V^π(s') by the Markov property:

= ∑_a π(a|s) ∑_s' p(s'|s,a) [r(s,a,s') + γ V^π(s')]

The key insight: The Markov property is what makes this work. Because the future only depends on the current state (not how we got here), we can recursively define value in terms of itself. This self-referential structure is what TD learning exploits — we can update V(s) using our current estimate of V(s') without waiting for the full return.

🔗 Pattern Recognition

Bellman Equation = Recursive Estimation

This Lesson (TD Learning)

V(s) ← V(s) + α[r + γV(s') − V(s)]
"Update current estimate toward one-step-ahead estimate"

Kalman Filter

x̂ ← x̂_pred + K(z − H·x̂_pred)
"Update prediction toward measurement" → Kalman Filter lesson

Both are instances of the same deep pattern: recursive Bayesian estimation. You have a current belief, you get new information (a reward + next-state value, or a measurement), and you blend them using a weighting factor (α in TD, K in Kalman). The Bellman equation IS the Kalman predict step for value functions — "my best guess of the future given my model of how the world evolves."

Where else have you seen "new_estimate = old_estimate + step_size × (target − old_estimate)"? (Hint: think about how running averages work, and how gradient descent updates parameters.)

Chapter 4: Q-Learning — The Classic

Instead of estimating how good a state is (V), what if we estimated how good a state-action pair is? That's Q-learning. Q(s, a) tells us the expected return of taking action a in state s and then acting optimally afterward. If we know Q perfectly, we can act optimally by always picking argmax_a Q(s, a).

Q-learning is off-policy: it learns the optimal policy regardless of what the agent actually does. The update uses the maximum Q-value of the next state, not the Q-value of the action the agent actually took.

Q(s,a) ← Q(s,a) + α [r + γ max_a' Q(s',a') − Q(s,a)]

Off-policy magic: Even if the agent explores randomly, Q-learning converges to the optimal Q-values. The update always asks "what's the best I could do from here?" regardless of what the agent actually did. This separates behavior (exploration) from learning (the optimal policy).

The Q-Table, Concretely

For a gridworld with |S| = 100 states and |A| = 4 actions, Q is simply a [100, 4] array — 400 numbers, initialized to zeros. Each entry Q[s, a] estimates the expected return of taking action a in state s. The entire learning algorithm is one line of NumPy per step:

Python
import numpy as np

# Initialize the Q-table: 100 states × 4 actions = 400 entries
Q = np.zeros((100, 4))

# After observing (s, a, r, s'):
Q[s, a] += alpha * (r + gamma * np.max(Q[s_next, :]) - Q[s, a])
#                   └── TD target ──┘   └── old estimate ┘

# Extract greedy policy: best action per state
policy = np.argmax(Q, axis=1)  # shape [100]

That's the complete Q-learning implementation. The alpha * (target - old) pattern is the same as TD learning from Chapter 3, but now applied to state-action pairs instead of just states.

Q-Learning Grid World

Watch the Q-table fill in as the agent explores. Green = goal (+10). Red = trap (-10). The arrows show the learned greedy policy. Cell colors show value estimates.

Epsilon ε0.15

Learning rate α0.20

Discount γ0.95

Episode: 0 | Steps: 0

Experiment: Set ε very low (0.01) — the agent exploits but may never find the goal. Set it high (0.5) — it finds the goal quickly but the policy looks noisy. The sweet spot is around 0.1–0.2. Try changing γ too: low discount makes the agent short-sighted.

Check: Why is Q-learning called "off-policy"?

It learns the optimal policy regardless of the agent's exploration behavior It doesn't use a policy at all It only works offline

⚔ Adversarial: Your agent's Q-values are growing without bound

You're training a Q-learning agent on a grid world. After 10,000 episodes, you notice that Q-values in some states have grown to 10,000+ even though the maximum per-step reward is +1. The agent's behavior looks reasonable, but the numbers keep climbing. What's happening?

The learning rate is too high, causing oscillations The discount factor γ is 1.0 (no discounting), so in cyclic states the max operator accumulates value without bound The epsilon is too low

💻 Build It Implement Q-Learning Update from Scratch ▶ ✓ ATTEMPTED

You've seen the one-line Q-learning update. Now implement the full training loop: given an environment with discrete states and actions, run Q-learning for N episodes and return the learned Q-table.

signature def q_learning(env, n_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1): """ Train a Q-learning agent. Args: env: Environment with .reset() -> state, .step(action) -> (next_state, reward, done) env.n_states: int, env.n_actions: int n_episodes: Number of training episodes alpha: Learning rate gamma: Discount factor epsilon: Exploration rate Returns: Q: np.array of shape [n_states, n_actions] — the learned Q-table """

Test case

A 4x4 grid world with goal at (3,3) giving +1 reward. All other steps give -0.01.
After training: Q[start_state, right] should be positive (path toward goal is valuable).
np.argmax(Q, axis=1) should trace a path from (0,0) to (3,3).

To select an action: generate a random number. If it's less than epsilon, pick a random action (np.random.randint(n_actions)). Otherwise, pick the action with the highest Q-value for the current state (np.argmax(Q[state])). Handle the terminal state by NOT bootstrapping: when done=True, the target is just r (no γ*max Q(s',a')).

python
import numpy as np

def q_learning(env, n_episodes=1000, alpha=0.1, gamma=0.99, epsilon=0.1):
    Q = np.zeros((env.n_states, env.n_actions))

    for ep in range(n_episodes):
        state = env.reset()
        done = False

        while not done:
            # Epsilon-greedy action selection
            if np.random.random() < epsilon:
                action = np.random.randint(env.n_actions)
            else:
                action = np.argmax(Q[state])

            # Take action, observe transition
            next_state, reward, done = env.step(action)

            # Q-learning update (off-policy: uses max)
            if done:
                target = reward  # No future value at terminal
            else:
                target = reward + gamma * np.max(Q[next_state])

            Q[state, action] += alpha * (target - Q[state, action])
            state = next_state

    return Q

Bonus challenge: Modify this to implement SARSA instead. What single line changes? (Answer: replace np.max(Q[next_state]) with Q[next_state, next_action] where next_action is chosen by the same epsilon-greedy policy.)

Checkpoint — Before you move on

Explain in your own words: What is the relationship between V(s), Q(s,a), and the policy π? Why can't we just use V(s) for control — why do we need Q(s,a)?

✓ Gate cleared

Model Answer

V(s) tells you the expected return from state s under policy π. But to act, you need to compare actions. V(s) alone doesn't tell you which action leads to which next state (you'd need the transition model p(s'|s,a) to compute that). Q(s,a) directly gives you the value of each action, so you can pick the best one with argmax — no model needed. This is why Q-learning is "model-free": it learns Q directly without ever learning the environment dynamics. The relationship is: V^π(s) = ∑_a π(a|s) Q^π(s,a) — the value of a state is the weighted average of its action values under the policy.

Chapter 5: SARSA — On-Policy Q-Learning

SARSA stands for State, Action, Reward, State, Action — exactly what the update uses. Unlike Q-learning, SARSA updates toward the Q-value of the action the agent actually took next, not the maximum. This makes it on-policy: it evaluates and improves the policy it's currently following.

Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') − Q(s,a)]

The practical difference? SARSA is more conservative. If the agent explores and sometimes steps into danger, SARSA's values reflect that danger. Q-learning's values assume you'll act optimally after, so they're more optimistic. Near cliffs and traps, SARSA learns to stay further away.

The Cliff Walk: Imagine a cliff edge. Q-learning finds the shortest path along the cliff (optimal but risky with ε-exploration). SARSA finds a safer path further from the edge, because it accounts for the chance of accidentally stepping off.

	Q-Learning	SARSA
Policy	Off-policy	On-policy
Update target	max_a' Q(s',a')	Q(s', a') where a' is the actual next action
Behavior	Optimistic (assumes optimal future)	Realistic (accounts for exploration)
Near danger	Risky paths	Safer paths

Q-Learning vs SARSA on a Cliff

Compare the paths learned by Q-learning (teal) and SARSA (orange). The bottom row is a cliff — falling gives -100. Notice SARSA takes the safer upper path.

Episodes: 0

Check: Why does SARSA learn safer paths than Q-learning near cliffs?

Because SARSA's values account for the exploration policy's chance of falling Because SARSA uses a larger learning rate Because SARSA has a lower discount factor

Chapter 6: Deep Q-Networks (DQN)

Q-learning with a table works for small state spaces. But what about Atari games with millions of possible screen configurations? You can't have a table entry for every pixel combination. The breakthrough idea from DeepMind (2015): replace the Q-table with a neural network.

The network takes a state (raw pixels) as input and outputs Q-values for each possible action. But training is tricky — naive approaches diverge because of two problems:

Problem 1: Correlated samples

Consecutive states are similar → biased gradient updates

↓ Solution

Experience Replay

Store transitions in a buffer, sample randomly for training

Problem 2: Moving target

The target Q-value changes as the network updates → instability

↓ Solution

Target Network

Use a frozen copy of the network for targets, update periodically

The Atari breakthrough: DQN played 49 Atari games using only raw pixels and scores as input, reaching superhuman performance on many. The same architecture, same hyperparameters — no game-specific engineering. This was the moment RL went from theory to headline news.

DQN Data Flow

The neural network is the Q-table. Instead of a [|S|, |A|] array (impossible for pixel inputs), a CNN maps raw frames to Q-values:

Architecture
# Input: 4 stacked grayscale frames (motion information)
State:  [4, 84, 84]        # 4 frames × 84×84 pixels = 28,224 values
         ↓
Conv2D(32, 8×8, stride 4)  # → [32, 20, 20]
Conv2D(64, 4×4, stride 2)  # → [64, 9, 9]
Conv2D(64, 3×3, stride 1)  # → [64, 7, 7]
Flatten                      # → [3136]
Dense(512)                  # → [512]  (learned features)
Dense(18)                   # → [18]   (one Q-value per action)

# Action = argmax over these 18 Q-values

Experience Replay, Concretely

The replay buffer stores transitions as tuples (s, a, r, s', done). For Atari: 1 million transitions, each containing two frame stacks of [4, 84, 84]. Naive storage: 1M × (28,224 + 1 + 1 + 28,224 + 1) bytes ≈ 56 GB. In practice, you store each unique frame once and reference them by index — dropping memory to about 7 GB. Training samples a random mini-batch (typically 32) from this buffer, breaking temporal correlations.

Experience Replay Buffer

Watch transitions (s, a, r, s') flow into the replay buffer. Training samples are drawn randomly, breaking temporal correlations.

Check: Why does DQN use experience replay?

To break correlations between consecutive training samples To make training faster by using GPUs To save memory

💥 Break-It Lab What Dies When You Remove DQN Components? ▶ ✓ ATTEMPTED

DQN has three critical components beyond vanilla Q-learning: experience replay, a target network, and frame stacking. Each prevents a specific failure mode. Toggle them off to see training curves collapse.

Remove Experience Replay ACTIVE

Failure mode: Without replay, the network trains on consecutive correlated samples (frame 1, frame 2, frame 3...). The gradients are biased toward whatever the agent is doing RIGHT NOW. If it's stuck in a corner, all gradients push toward "corner behavior." The network catastrophically forgets everything else it learned. This is called catastrophic forgetting — the same pathology that plagues continual learning. Replay breaks temporal correlations by sampling uniformly from past experience.

Remove Target Network ACTIVE

Failure mode: The Q-learning target is r + γ max Q(s', a'). If Q changes every gradient step, the target ALSO changes every step — you're chasing a moving target. This creates a positive feedback loop: Q(s) goes up → target for upstream states goes up → those Q-values go up → targets go up further. The Q-values oscillate wildly or diverge to infinity. The frozen target network (updated every 10K steps) provides a stable target to regress toward.

Remove Reward Clipping ACTIVE

Failure mode: Atari games have wildly different reward scales (Pong: ±1, Space Invaders: 5-200, Breakout: 1-7). Without clipping rewards to [-1, +1], the network's loss landscape varies by orders of magnitude between games. A single high-reward event (killing a boss for 1000 points) produces an enormous gradient that destabilizes all learned weights. Clipping normalizes the optimization landscape at the cost of losing relative reward magnitude.

Chapter 7: Policy Gradients — REINFORCE

Everything so far learned a value function and derived a policy from it. But what if we optimized the policy directly? Policy gradient methods parameterize the policy as π_θ(a|s) and use gradient ascent to maximize expected reward.

The REINFORCE algorithm is the simplest policy gradient method. The key insight: the gradient of the expected return has a beautiful form called the score function estimator:

∇J(θ) = E[∑ ∇ log π_θ(a_t|s_t) · G_t]

Translation: increase the probability of actions that led to high returns; decrease it for actions that led to low returns. It's intuitive, unbiased, but has high variance — returns can vary wildly between episodes, making gradients noisy.

Why go direct? Value-based methods struggle with continuous action spaces (you can't take argmax over infinite actions). Policy gradients handle continuous actions naturally: the policy is a probability distribution (like a Gaussian) over the action space.

Q-Learning vs Policy Gradient: Side by Side

	Q-Learning (DQN)	Policy Gradient (REINFORCE)
Input	State s → network	State s → network
Output	Q-values [\|A\|] (one per action)	Action probabilities [\|A\|] (softmax)
Action selection	argmax Q(s, a) + ε-greedy	Sample from π(a\|s)
Loss function	MSE: (r + γ max Q(s',a') − Q(s,a))²	−log π(a\|s) · G_t
What's optimized	Q-values match Bellman target	Probability of good actions increases
Data reuse	Yes (replay buffer, off-policy)	No (on-policy, data discarded after update)
Action space	Discrete only (need argmax)	Discrete or continuous

Policy Gradient Intuition

The policy π(a) is a distribution over actions. After a good episode, the distribution shifts toward the actions taken. After a bad episode, it shifts away.

Variance problem: REINFORCE multiplies the gradient by the full return G_t. If G_t varies from -100 to +100 across episodes, your gradient estimates are very noisy. The solution: subtract a baseline (like the average return) to center the signal. This is the seed of actor-critic methods.

Check: What is the main drawback of vanilla REINFORCE?

It only works for discrete actions High variance in gradient estimates It cannot learn from rewards

🔨 Derivation Derive the Policy Gradient Theorem (REINFORCE) ▶ ✓ ATTEMPTED

The objective is to maximize expected return: J(θ) = E_{τ~π_θ}[R(τ)] where τ is a trajectory (s₀, a₀, r₀, s₁, ...) and R(τ) = ∑r_t.

The probability of a trajectory is: p(τ) = p(s₀) ∏_t π_θ(a_t|s_t) p(s_t+1|s_t,a_t)

Your task: Show that ∇_θJ(θ) = E[∑_t ∇_θ log π_θ(a_t|s_t) · R(τ)]. Why do the environment dynamics p(s'|s,a) disappear from the gradient?

∇p(τ) = p(τ) · ∇ log p(τ). This is because ∇ log f = ∇f / f, so ∇f = f · ∇ log f. This converts ∇_θ ∫ p(τ)R(τ)dτ into E_p(τ)[∇ log p(τ) · R(τ)] — an expectation we can estimate with samples!

log p(τ) = log p(s₀) + ∑_t log π_θ(a_t|s_t) + ∑_t log p(s_t+1|s_t,a_t). Now take ∇_θ of this. Which terms depend on θ?

Only π_θ(a_t|s_t) depends on θ. The initial state distribution p(s₀) and transition dynamics p(s'|s,a) are properties of the environment — their gradients w.r.t. θ are zero. This is why policy gradients are model-free!

Full derivation:

1. ∇J(θ) = ∇ ∫ p(τ) R(τ) dτ

2. = ∫ ∇p(τ) R(τ) dτ (move gradient inside integral)

3. = ∫ p(τ) ∇log p(τ) R(τ) dτ (log-derivative trick: ∇p = p · ∇log p)

4. = E_τ[∇log p(τ) · R(τ)] (definition of expectation)

5. Now expand: ∇_θ log p(τ) = ∇_θ[log p(s₀) + ∑ log π_θ(a_t|s_t) + ∑ log p(s'|s,a)]

6. = 0 + ∑_t ∇_θ log π_θ(a_t|s_t) + 0

7. Therefore: ∇J(θ) = E_τ[(∑_t ∇ log π_θ(a_t|s_t)) · R(τ)]

The key insight: The log-derivative trick converts a gradient of an integral (intractable) into an expectation of a gradient (estimable with samples). The environment dynamics vanish because they don't depend on θ. This means we can compute policy gradients without EVER knowing how the environment works — we just need to sample trajectories and compute ∇ log π. This is profoundly elegant: to improve your policy, you only need to know your own action probabilities.

Chapter 8: Actor-Critic — Best of Both Worlds

REINFORCE uses the full return G_t as a signal, which is high variance. What if we replaced G_t with a learned value estimate? That's the actor-critic architecture: the actor is the policy (decides what to do), and the critic is the value function (evaluates how good the state is).

The critic provides a low-variance baseline, dramatically stabilizing training. The actor updates using the advantage A(s,a) = Q(s,a) - V(s), which tells us: "was this action better or worse than average?"

∇J(θ) = E[∇ log π_θ(a|s) · A(s,a)]

Actor (Policy π_θ)

"What action should I take?" → outputs action probabilities

↓ advantage signal

Critic (Value V_w)

"How good is this state?" → outputs value estimate

A2C vs A3C: A2C (Advantage Actor-Critic) runs multiple environments synchronously. A3C (Asynchronous A3C) runs workers in parallel, each updating a shared model asynchronously. A3C was a breakthrough for scaling, but A2C is simpler and often just as good with GPU batching.

Actor-Critic Data Flow

Architecture
# Shared backbone (or separate networks)
State s → [Encoder] → features [256]
                              ↓              ↓
                    Actor head            Critic head
                   Dense(256) → [|A|]   Dense(256) → [1]
                   softmax → π(a|s)    linear → V(s)

# Actor loss: -log π(a|s) × advantage
# Critic loss: (r + γ V(s') - V(s))²
# Both trained jointly with a single optimizer

The advantage A(s,a) = r + γ V(s') − V(s) is the TD error from Chapter 3. Positive advantage means "this action was better than expected" — increase its probability. Negative means "worse than expected" — decrease it. The critic provides this signal, replacing the high-variance full return G_t from REINFORCE.

Actor-Critic Architecture

The actor and critic train together. Watch how the critic's value estimate (teal) stabilizes the actor's policy updates (orange), reducing variance compared to pure REINFORCE.

Step: 0

Check: What role does the "critic" play in actor-critic?

It directly selects actions It estimates state values to provide a low-variance baseline for the actor It replaces the actor after training

🔨 Derivation Why A(s,a) = Q(s,a) − V(s) Reduces Variance ▶ ✓ ATTEMPTED

REINFORCE uses ∇J = E[∇ log π(a|s) · G_t]. Actor-critic replaces G_t with the advantage A(s,a) = Q(s,a) − V(s).

Your task: (1) Show that subtracting any baseline b(s) that depends only on state does NOT change the expected gradient (it's still unbiased). (2) Show that V(s) is the optimal baseline (minimizes variance).

Consider E_a~π[∇ log π(a|s) · b(s)]. Factor out b(s) since it doesn't depend on a. You get b(s) · E_a~π[∇ log π(a|s)]. Now use the fact that ∑_a π(a|s) = 1 for all θ. What's the gradient of a constant?

Without baseline: if all returns are positive (e.g., G ranges from +5 to +100), ALL actions get reinforced — just some more than others. The gradient is always pushing probabilities up, making learning slow. With baseline V(s): actions better than average get positive signal, worse-than-average get negative signal. The gradient now has a clear "good vs bad" signal centered around zero.

V(s) = E_a~π[Q(s,a)] — it's the average Q-value over all actions. So A(s,a) = Q(s,a) − V(s) literally means "how much better is this specific action than the average action in this state?" Positive advantage = better than average. Negative = worse.

Part 1: Baseline doesn't change expected gradient

E_a~π[∇ log π(a|s) · b(s)] = b(s) · ∑_a π(a|s) · ∇ log π(a|s)

= b(s) · ∑_a π(a|s) · ∇π(a|s)/π(a|s) = b(s) · ∑_a ∇π(a|s)

= b(s) · ∇ ∑_a π(a|s) = b(s) · ∇(1) = 0

So subtracting b(s) from the return leaves the expected gradient unchanged. The gradient is still unbiased.

Part 2: V(s) minimizes variance

Var[∇ log π · (Q − b)] is minimized when b = E[Q(s,a)] = V(s). This is the standard result that the optimal control variate equals the conditional mean. Intuitively: centering around the mean minimizes the squared magnitude of the signal.

The key insight: The advantage function doesn't just reduce variance — it gives the gradient a sign. Without it, REINFORCE says "do more of everything that worked." With it, actor-critic says "do MORE of above-average actions, LESS of below-average ones." This directional signal is why actor-critic converges so much faster in practice.

Chapter 9: PPO — Proximal Policy Optimization

Actor-critic methods can be unstable: a bad gradient step can ruin the policy, and you can't undo it. PPO (Schulman et al., 2017) solves this with a brilliantly simple idea: clip the policy update so it can't change too much in one step.

PPO computes the probability ratio r(θ) = π_new(a|s) / π_old(a|s). If the new policy is too different from the old one (ratio far from 1), the objective is clipped, preventing the update from going further.

L(θ) = E[min(r(θ) A, clip(r(θ), 1−ε, 1+ε) A)]

Why clipping works: If the advantage A is positive (good action), we want to increase the probability — but only up to 1+ε times the old probability. If A is negative (bad action), we want to decrease it — but only down to 1−ε. This creates a "trust region" around the current policy, preventing catastrophic updates.

PPO Clipped Objective

Drag the ratio slider to see how clipping constrains the objective. The teal region is the clipping zone [1−ε, 1+ε]. The orange curve is the actual objective used.

Ratio r(θ)1.00

Clip ε0.20

Advantage A1.0

      L_clip =
      1.00
    

PPO Training Loop, Concretely

Python
# PPO collects a batch of trajectories, then does multiple
# gradient steps on the SAME data (unlike REINFORCE which
# uses data once and throws it away)

for iteration in range(1000):
    # 1. Collect 2048 steps across 8 parallel environments
    batch = collect_rollouts(envs, policy, n_steps=2048)
    # batch contains: states, actions, rewards, log_probs_old, values

    # 2. Compute advantages using GAE (Generalized Advantage Estimation)
    advantages = compute_gae(batch, gamma=0.99, lam=0.95)

    # 3. Multiple epochs on the same batch (the key PPO trick)
    for epoch in range(4):
        for mini_batch in split(batch, size=64):
            ratio = exp(log_prob_new - log_prob_old)
            clipped = clip(ratio, 1-0.2, 1+0.2)
            loss = -min(ratio * adv, clipped * adv)
            optimizer.step(loss)

PPO is everywhere: OpenAI used PPO to train ChatGPT via RLHF. It's the default algorithm for robotics (Boston Dynamics), game AI, and LLM alignment. Its popularity comes from being simple to implement, stable to train, and effective across a wide range of tasks.

Check: What does PPO's clipping prevent?

Catastrophically large policy updates in a single step The agent from exploring The reward from being negative

⚔ Adversarial: PPO's ratio explodes on rare actions

You're training an LLM with PPO-based RLHF. After a policy update, you notice some tokens have probability ratios r(θ) of 50+ (the new policy is 50x more likely to output them than the old policy). PPO clips at ε=0.2, so r is clamped to [0.8, 1.2]. Is this safe?

Yes, clipping prevents any damage from large ratios No, large ratios mean the KL divergence is huge, indicating the policy has already drifted too far WITHIN a single epoch of updates No, clipping stops the gradient for those tokens but the damage is already done — other tokens in the same batch caused the drift. The clip prevents further movement but can't undo the step that caused r=50

Chapter 10: SAC — Soft Actor-Critic

What if the agent was rewarded not just for getting high returns, but also for being as random as possible? That's the idea behind maximum entropy RL. The SAC objective adds an entropy bonus to the reward: explore as much as you can while still maximizing return.

J(π) = E[∑ r_t + α H(π(·|s_t))]

The temperature parameter α controls the trade-off. High α means "explore aggressively" (policy stays broad). Low α means "focus on reward" (policy becomes peaked). SAC can even learn α automatically.

Why entropy? Encouraging randomness prevents the policy from collapsing to a single action too early. It makes the agent robust: it learns multiple ways to solve a task, so it can adapt when conditions change. Think of it as forced creativity.

Entropy vs Reward Trade-off

Adjust the temperature α to see how the policy distribution changes. Low α = peaked (exploit). High α = broad (explore).

Temperature α0.50

Feature	SAC	PPO
Policy type	Off-policy (replay buffer)	On-policy (no replay)
Action space	Continuous (native)	Both
Sample efficiency	High (reuses data)	Lower (data discarded)
Exploration	Built-in via entropy	Via stochastic policy
Stability	Very stable	Stable (clipping)

Check: What does the temperature α control in SAC?

The trade-off between reward maximization and policy entropy (exploration) The learning rate of the critic The size of the replay buffer

Chapter 11: The RL Landscape

You now know the major families of RL algorithms. But when should you use which? The landscape splits along several axes: model-free vs model-based, on-policy vs off-policy, value-based vs policy-based.

Algorithm	Type	Policy	Best for
Q-Learning	Value-based	Off-policy	Small discrete spaces
SARSA	Value-based	On-policy	Safety-sensitive tasks
DQN	Value-based	Off-policy	Large discrete spaces (Atari)
REINFORCE	Policy-based	On-policy	Simple continuous tasks
A2C/A3C	Actor-Critic	On-policy	Parallel training
PPO	Actor-Critic	On-policy	General purpose, RLHF
SAC	Actor-Critic	Off-policy	Continuous control, robotics

On-policy vs Off-policy: On-policy (SARSA, PPO) learns about the policy it's currently following — safer but less data-efficient. Off-policy (Q-learning, DQN, SAC) can learn from old data — more sample-efficient but harder to stabilize.

Model-free vs Model-based: Everything in this lesson is model-free: the agent learns from raw experience without building a model of the environment. Model-based RL (MuZero, Dreamer) learns a model of how the world works and plans ahead. More sample-efficient, but the model can be wrong.

Sample Efficiency: The Practical Divide

The algorithms in this lesson vary by orders of magnitude in how much experience they need:

Algorithm	Samples to solve CartPole	Samples for Atari (superhuman)
DQN (off-policy)	~10K steps	~50M frames (~40 hours of play)
PPO (on-policy)	~30K steps	~200M frames (~150 hours of play)
SAC (off-policy)	~5K steps	N/A (continuous actions)
Dreamer (model-based)	~2K steps	~2M frames (~1.5 hours of play)

Off-policy methods (DQN, SAC) reuse data from the replay buffer, so they need fewer environment interactions. On-policy methods (PPO) throw away data after each update, but their gradients are less biased. Model-based methods are the most sample-efficient because they can "imagine" experience — but building an accurate world model is itself a hard problem.

Algorithm Decision Tree

An interactive guide: answer questions about your problem to find the right algorithm.

Click to start

RLHF connection: Reinforcement Learning from Human Feedback uses PPO to fine-tune language models. The "reward model" (trained on human preferences) plays the role of the environment's reward signal. The LLM is the policy. PPO's stability makes it ideal for this delicate fine-tuning.

🏗 Design Challenge You're the Architect: RL for a 7-DOF Robotic Arm ▶ ✓ ATTEMPTED

You're building an RL system to train a 7-DOF robotic arm to pick and place objects. The reward is sparse: +1 only when the object reaches the target position (within 2cm), 0 otherwise. You have a real robot (expensive, 1 episode = 30 seconds) and a physics simulator (fast, 1000 env steps/sec, but sim-to-real gap exists).

Action space

7 continuous joint torques, range [-1, 1]

Observation

Joint angles (7) + joint velocities (7) + object pose (6) + target pose (6) = 26-dim

Episode length

200 steps at 20Hz = 10 seconds per attempt

Real robot budget

~100 real episodes total for fine-tuning (50 min wall clock)

Sim budget

Unlimited sim steps (overnight training OK)

Success rate target

>80% on real robot with novel object positions

1. Which algorithm (SAC, PPO, or TD3) and why? Consider: continuous actions, sample efficiency needs, off-policy data reuse.

2. How do you handle the sparse reward problem? (Hint: the agent might never accidentally grasp the object in 200 random steps.)

3. How do you bridge the sim-to-real gap with only 100 real episodes?

4. What reward shaping would you add? What are the risks of reward shaping?

Real-world approach (circa 2023-2024):

Algorithm: SAC — Off-policy (critical for data reuse with expensive real robot), native continuous actions, entropy bonus aids exploration. TD3 is also viable but SAC's auto-tuned temperature is one less hyperparameter.

Sparse reward: Hindsight Experience Replay (HER) — After a failed episode (object ends at position X instead of target Y), pretend X WAS the goal and store a "successful" transition. This converts 0% success rate into 100% self-generated curriculum. OpenAI showed this enables learning with purely sparse rewards.

Sim-to-real: Domain Randomization + fine-tuning — In sim, randomize friction (0.3-1.0), object mass (50-500g), observation noise (σ=0.01), and action delay (0-2 steps). Train to convergence in sim (~5M steps overnight). Then fine-tune on real robot with frozen early layers, reduced learning rate, and the 100 real episodes. The randomization makes the sim policy robust enough that fine-tuning mostly calibrates rather than re-learns.

Reward shaping: Add dense reward: -||gripper - object|| (reaching) + grasp_bonus + -||object - target|| (placing). Risk: the agent may learn to hover near the object (maximizes reaching reward) without grasping. Curriculum: start with reaching only, add grasping criterion, then placing.

🔗 Pattern Recognition

RL as Sequence Modeling (Decision Transformer)

This Lesson (RL)

Policy π(a|s) maximizes E[∑ γ^t r_t]
"Learn to act by trial and error"

Transformers

p(a_t | R, s₁, a₁, ..., s_t)
"Predict the next action given desired return" → Transformer lesson

The Decision Transformer (2021) reframes RL as conditional sequence modeling. Instead of learning a value function or policy gradient, it trains a Transformer on trajectories: (return-to-go, state, action, return-to-go, state, action, ...). At test time, you condition on a HIGH desired return and the model outputs actions that achieve it. No Bellman equation, no bootstrapping, no exploration — just next-token prediction. This works because Transformers are powerful enough to implicitly learn the "stitching" that value functions provide: combining sub-optimal trajectory segments into optimal behavior.

If RL can be reduced to sequence modeling, what does that suggest about the relationship between "understanding the world" (prediction) and "acting in the world" (control)?

"Reinforcement learning is the closest thing we have to a computational theory of intelligence."

— Richard Sutton

You now understand the full arc from bandits to PPO. Every time an AI agent learns from interaction, these algorithms are at work.

Check: Which algorithm is most commonly used for RLHF in large language models?

Q-Learning PPO Monte Carlo

Understand ReinforcementLearning

Chapter 0: The RL Problem

Chapter 1: Exploration vs Exploitation

Chapter 2: Monte Carlo Methods

Chapter 3: TD Learning — Learning from Incomplete Episodes

Chapter 4: Q-Learning — The Classic

The Q-Table, Concretely

Chapter 5: SARSA — On-Policy Q-Learning

Chapter 6: Deep Q-Networks (DQN)

DQN Data Flow

Experience Replay, Concretely

Chapter 7: Policy Gradients — REINFORCE

Q-Learning vs Policy Gradient: Side by Side

Chapter 8: Actor-Critic — Best of Both Worlds

Actor-Critic Data Flow

Chapter 9: PPO — Proximal Policy Optimization

PPO Training Loop, Concretely

Chapter 10: SAC — Soft Actor-Critic

Chapter 11: The RL Landscape

Sample Efficiency: The Practical Divide

Understand Reinforcement
Learning