Bellman equations, Q-learning updates, policy gradients, advantage estimation, PPO clipping — all solved by hand with instant feedback. The math you need to build RL systems from scratch.
You're an agent in a tiny world with three states: A, B, and C. Each state gives you a reward, and from each state you transition to the next state with some probability. The fundamental question of RL: how valuable is each state? Not just the immediate reward — the total future reward you'll accumulate starting from that state.
The Bellman equation answers this. It says the value of a state equals the immediate reward plus the discounted value of where you end up next:
Here's our 3-state MDP. We'll use γ = 0.9 throughout:
| State | R(s) | P(A|s) | P(B|s) | P(C|s) |
|---|---|---|---|---|
| A | 10 | 0.0 | 0.8 | 0.2 |
| B | 2 | 0.1 | 0.0 | 0.9 |
| C | 5 | 0.6 | 0.3 | 0.1 |
Writing out the Bellman equations:
Expanding and rearranging into standard form (move all V terms to the left):
This is a 3×3 linear system. Solving by substitution or matrix inversion gives the unique fixed point.
Solve the system of 3 Bellman equations above. What is V(A)? (Hint: use substitution or a calculator for the 3×3 system.)
Solving the 3×3 system (I − γP)V = R in matrix form:
Using Cramer's rule or Gaussian elimination:
State A has the highest value despite all values being high — that's because γ=0.9 means we heavily weight future rewards, and the MDP cycles between states, accumulating rewards indefinitely. With lower γ, the values would be closer to the immediate rewards.
From the same system, what is V(B)?
V(B) is the lowest of the three, which makes sense: B has the smallest immediate reward (R=2) and transitions mostly to C (not to the high-reward state A).
What is V(C)?
C is the second most valuable state because it has a 60% chance of transitioning to A (the highest-value state). Even though C's immediate reward is only 5, its strategic position — a gateway to high rewards — makes it quite valuable.
With γ=0.1, V(A) = 10 + 0.1[0.8 V(B) + 0.2 V(C)]. Now V(B) ≈ 2 + 0.1[...] ≈ 2.something and V(C) ≈ 5.something. So V(A) ≈ 10 + 0.1[0.8×2.8 + 0.2×5.9] ≈ 10 + 0.1×3.42 ≈ 11.3.
With γ=0.1, the agent is extremely myopic — it only cares about the next step or two. Values collapse toward immediate rewards. This is the core tradeoff: low γ = greedy/short-sighted, high γ = far-sighted but harder to learn.
Write a function that performs ONE Bellman backup for all states. Given current V (array), rewards R (array), transition matrix P (2D array), and γ, return the new V array.
javascript function bellmanIteration(V, R, P, gamma) { const n = V.length; const Vnew = new Array(n); for (let s = 0; s < n; s++) { let futureVal = 0; for (let sp = 0; sp < n; sp++) { futureVal += P[s][sp] * V[sp]; } Vnew[s] = R[s] + gamma * futureVal; } return Vnew; }
You can't always solve Bellman equations analytically — real MDPs have millions of states. Instead, we iterate: start with V=0 everywhere, apply the Bellman backup repeatedly, and watch V converge to the true values. This is value iteration.
Consider this 4-state gridworld with one action per state (deterministic transitions):
| State | Reward | Next state |
|---|---|---|
| S0 | 0 | S1 |
| S1 | 0 | S2 |
| S2 | 0 | S3 |
| S3 (goal) | +1 | S3 (absorbing) |
With γ=0.9 and V0 = [0, 0, 0, 0], let's trace the iterations:
After iteration 3, V3 = [0, 0.81, 1.71, 2.71]. Compute V(S0) after iteration 4.
Notice the pattern: V(S0) after k iterations = γk-1 × R × (something). The reward signal "propagates backward" through the chain, discounted by γ at each step. S0 is 3 steps from the goal, so its value converges to γ3 / (1-γ) = 0.729/0.1 = 7.29.
V(S3) is an absorbing state with R=1 and γ=0.9. What does V(S3) converge to as k→∞? (Hint: it's a geometric series.)
This is the infinite-horizon discounted return from sitting in S3 forever. The formula 1/(1-γ) is one of the most useful in all of RL — it gives the value of a constant reward stream under discounting.
After iteration 3, V3 = [0, 0.81, 1.71, 2.71]. After iteration 4, V4 = [0.729, 1.539, 2.439, 3.439]. What is the max change ||V4 - V3||∞?
All states changed by exactly 0.729 this iteration. The uniform change is because this is a linear chain — the "wave" of value propagation moves one step per iteration. For ε=0.01, we need the max change to drop below 0.01, which for this MDP happens around iteration ~40.
This is why γ close to 1 makes value iteration slow. With γ=0.99, you'd need ~690 iterations for the same convergence. The number of iterations scales as O(1/(1-γ) × log(1/ε)).
Write value iteration for the linear chain MDP. Given n states, reward array R, next-state array next (deterministic), gamma, and epsilon, return the final value array.
javascript function valueIteration(R, next, gamma, epsilon) { const n = R.length; let V = new Array(n).fill(0); while (true) { let maxDelta = 0; const Vnew = new Array(n); for (let s = 0; s < n; s++) { Vnew[s] = R[s] + gamma * V[next[s]]; maxDelta = Math.max(maxDelta, Math.abs(Vnew[s] - V[s])); } V = Vnew; if (maxDelta < epsilon) break; } return V; }
State values V(s) tell you how good a state is, but they don't tell you which action to take. For that, we need Q-values — the expected return from taking action a in state s, then acting optimally afterwards:
Consider a 2-state, 2-action MDP. γ = 0.9:
| State | Action | Reward | Next state |
|---|---|---|---|
| S0 | left | +1 | S0 |
| S0 | right | +5 | S1 |
| S1 | left | +2 | S0 |
| S1 | right | 0 | S1 |
For the 2-state MDP above, assuming deterministic transitions and γ=0.9, the Bellman equations give us a system. Assuming the optimal policy from S1 is "left" (Q(S1,left)>Q(S1,right)), solve for Q(S0, right).
System: Q(S0,left) = 1 + 0.9×max Q(S0,·). Q(S0,right) = 5 + 0.9×max Q(S1,·). Q(S1,left) = 2 + 0.9×max Q(S0,·). Q(S1,right) = 0 + 0.9×max Q(S1,·).
Let A = max Q(S0,·) and B = max Q(S1,·). We need to find which action is optimal at each state.
Assume right is optimal at S0 (A = 5 + 0.9B) and left is optimal at S1 (B = 2 + 0.9A):
Check: Q(S0,left) = 1 + 0.9×35.79 = 33.21. Q(S0,right) = 5 + 0.9×34.21 = 35.79. But wait — we assumed A = Q(S0,right) = 35.79, but we should re-verify. Actually, let's try the other possibility: A = Q(S0,right), B = Q(S1,left).
Hmm, but this gives V*(S0) = 35.79. Let me redo more carefully. The correct approach: V*(S0) = max(Q(S0,L), Q(S0,R)) and V*(S1) = max(Q(S1,L), Q(S1,R)).
Assume π*(S0)=right, π*(S1)=left:
Verify: Q(S0,left) = 1 + 0.9 × 35.79 = 33.21 < 35.79. Q(S1,right) = 0.9 × 34.21 = 30.79 < 34.21. Assumptions confirmed.
Corrected: Q(S0, right) ≈ 35.79. But note: looking at V*(S0) more carefully with higher precision: V*(S0) = 6.8/0.19 = 35.7895..., so Q(S0,right) = 5 + 0.9 × (2 + 0.9 × 35.7895) = 5 + 0.9 × 34.2105 = 5 + 30.7895 = 35.7895. The answer is in fact V*(S0) = Q(S0,right) ≈ 35.79.
π*(s) = argmaxa Q(s,a). At S0: Q(right)=35.79 > Q(left)=33.21, so go right. At S1: Q(left)=34.21 > Q(right)=30.79, so go left.
The optimal policy bounces between S0 and S1, collecting +5 from S0→right and +2 from S1→left each cycle. The per-step average is (5+2)/2 = 3.5, which is better than staying in S0 (reward 1/step) or staying in S1 (reward 0/step).
Start with Q = 0 for all (s,a). α=0.1, γ=0.99. We observe the transition (S0, right, +5, S1). What is Q(S0, right) after this update?
With Q initialized to 0, the first update just moves Q toward the immediate reward by α. The max Q(s',a') term contributes nothing because all Q-values start at 0. Over many updates, the bootstrapped term will grow as Q-values get learned.
With α=0.1, γ=0.99, trace these 5 transitions starting from Q=0 everywhere. What is Q(S0, right) after all 5 updates?
| Step | s | a | r | s' |
|---|---|---|---|---|
| 1 | S0 | right | +5 | S1 |
| 2 | S1 | left | +2 | S0 |
| 3 | S0 | right | +5 | S1 |
| 4 | S1 | left | +2 | S0 |
| 5 | S0 | right | +5 | S1 |
Track all Q-values (initially all 0). Only the (s,a) being updated changes.
Let me re-check step 5 precisely: 5 + 0.99×0.5210 = 5.5158. 5.5158 − 0.9747 = 4.5411. 0.1×4.5411 = 0.4541. 0.9747 + 0.4541 = 1.4288. Hmm, the answer I set was 1.324 — let me recompute step 3.
Step 3: 5 + 0.99×0.2495 = 5.247. 5.247 − 0.5 = 4.747. 0.1×4.747 = 0.4747. 0.5 + 0.4747 = 0.9747. Step 5: 5 + 0.99×0.5210 = 5.5158. 5.5158 − 0.9747 = 4.5411. 0.1 × 4.5411 = 0.4541. 0.9747 + 0.4541 = 1.4288.
Q(S0, right) after 5 steps ≈ 1.43. After many more steps, it would converge toward the true Q ≈ 35.79.
The TD error is positive: the target (12.2) is higher than our current estimate (10), so Q(S0, right) will be pushed upward. A positive TD error means "things turned out better than expected." Note: the trick answer in option C is there to make you actually compute it — the TD error is indeed positive (+2.2).
This Q-learning update has a subtle bug. The Q-values diverge instead of converging. Click the buggy line.
function qUpdate(Q, s, a, r, sPrime, alpha, gamma) { const actions = Object.keys(Q[sPrime]); let maxQ = -Infinity; for (const ap of actions) { if (Q[sPrime][ap] > maxQ) maxQ = Q[sPrime][ap]; } const target = r + gamma * maxQ; Q[s][a] = Q[s][a] + alpha * target; return Q; }
Line 8 is the bug. The update should be Q[s][a] + alpha * (target - Q[s][a]), not Q[s][a] + alpha * target. The - Q[s][a] part is the TD error subtraction — without it, you're always adding a positive quantity, so Q-values grow without bound.
The correct Q-learning update is: Q(s,a) ← Q(s,a) + α(r + γ max Q(s',a') − Q(s,a)). The subtracted Q(s,a) is what makes this a temporal difference — the difference between the bootstrapped target and the current estimate.
Value iteration finds the optimal value function. But sometimes you need to evaluate a given policy — one that might not be optimal. How good is this specific policy? This is policy evaluation, and it's the inner loop of policy iteration.
Consider a 3×3 gridworld. States 0-8 arranged in a grid. Actions: up, down, left, right (deterministic, walls bounce back). Terminal state at position 8 (bottom-right) with R=+1. All other transitions have R=0. γ=0.9.
| S0 | S1 | S2 |
| S3 | S4 | S5 |
| S6 | S7 | S8 (+1) |
Consider the uniform random policy: π(a|s) = 0.25 for all 4 actions, at every state. Under this policy, the agent stumbles around randomly. How valuable is each state?
Starting from V=0 everywhere. Under the uniform random policy, S7 can move up (to S4), down (wall, stays S7), left (to S6), or right (to S8, reward +1). Compute V1(S7).
Only the "right" action from S7 reaches the terminal state S8 and gets R=+1. With uniform random policy, we pick that action 25% of the time. Since V0=0 everywhere, the γ term contributes nothing in the first iteration. The reward signal only propagates to neighbors of the terminal state in iteration 1.
S5 can move up (S2), down (S8, reward +1), left (S4), right (wall, stays S5). Compute V1(S5).
Same as S7 — both are adjacent to the terminal state S8 with one action leading there. S4 (center) has no direct path to S8, so V1(S4) = 0. Value propagates outward from the goal like a wave.
After iteration 1: V1 = [0, 0, 0, 0, 0, 0.25, 0, 0.25, terminal]. S4 can go up(S1), down(S7), left(S3), right(S5). Compute V2(S4). Assume S8 is absorbing with V=10 (= 1/(1-0.9)).
S4's neighbors after iter 1: S1(0), S7(0.25), S3(0), S5(0.25). All transitions have R=0.
S4 is 2 steps from the goal, so value reaches it in iteration 2. The value is smaller because it's discounted through two random steps — the signal attenuates with distance from the reward. Corner states like S0 (4 steps away) won't get significant value until iteration 4+.
The optimal policy reaches S8 in exactly 4 steps from S0 (right, right, down, down or similar). The random walk takes on average ~16 steps from S0 to S8 (random walks on grids are slow). With γ=0.9, the effective discount after 16 steps is 0.916 ≈ 0.185, versus 0.94 = 0.656 for the optimal path. The random policy also sometimes wanders for 50+ steps, heavily discounting the eventual reward.
In the 3×3 gridworld, how many iterations of policy evaluation are needed before V(S0) first becomes non-zero (assuming exact arithmetic, starting from V=0)?
The reward signal propagates one step per iteration. S8 is the terminal state with R=+1. The shortest path from S0 to S8 is 4 steps (e.g., S0→S1→S2→S5→S8).
This "wave propagation" pattern is fundamental to iterative methods. The number of iterations before a state gets signal equals its shortest distance (in graph hops) from any reward source. This is why value iteration is slow for large state spaces with sparse rewards.
Q-learning builds a table of values and derives a policy from it. Policy gradient methods do the opposite: they directly parameterize the policy πθ and optimize it using gradient ascent on expected return. This is the foundation of modern RL — PPO, SAC, and every algorithm used to train ChatGPT.
Consider a simple 2-action softmax policy. State s has features [1], and the policy has weights θ = [θ1, θ2] for actions 0 and 1:
With θ = [2, 1], what is π(a=0|s)?
With θ1 > θ2, action 0 is more probable. The softmax converts raw scores (logits) into a probability distribution. The difference θ1 - θ2 = 1 gives about a 73/27 split. If the difference were 0, it would be 50/50.
For the softmax policy, ∇θ1 log π(a=0) = 1 − π(a=0). With θ=[2,1] and π(a=0)=0.731, what is ∇θ1 log π(a=0)?
The score function for the chosen action is always (1 − π(a)). When π(a) is small (rare action), the gradient is large — taking a rare action and getting positive reward creates a strong learning signal. When π(a) is close to 1 (very likely action), the gradient is small — we already do this action often, so there's less to learn.
A trajectory has rewards [1, 0, 1, 0, 1] at timesteps t=0,1,2,3,4. With γ=0.99, compute G0 (the return from step 0).
Only the non-zero rewards contribute. Since γ=0.99 is close to 1, there's minimal discounting — G0 is nearly the undiscounted sum of 3 ones = 3.
Same trajectory [1, 0, 1, 0, 1], γ=0.99. Compute G4 (the return from the last step).
At the last timestep, there are no future rewards, so GT = rT. This is why REINFORCE computes returns backward: Gt = rt + γ Gt+1. Starting from G4=1, we get G3 = 0 + 0.99×1 = 0.99, G2 = 1 + 0.99×0.99 = 1.9801, G1 = 0 + 0.99×1.9801 = 1.9603, G0 = 1 + 0.99×1.9603 = 2.9406.
Write a function that computes the return Gt for every timestep in a reward sequence, using the backward recursion Gt = rt + γ Gt+1.
javascript function computeReturns(rewards, gamma) { const n = rewards.length; const G = new Array(n); G[n - 1] = rewards[n - 1]; for (let t = n - 2; t >= 0; t--) { G[t] = rewards[t] + gamma * G[t + 1]; } return G; }
∇ log π(a=0) points in the direction that increases π(a=0). Multiplying by Gt = -2 flips the direction. So the gradient update θ ← θ + α ∇ log π × Gt decreases π(a=0).
This is the beauty of REINFORCE: actions followed by good outcomes (G>0) get reinforced, actions followed by bad outcomes (G<0) get suppressed. It's gradient ascent on expected return — no critic needed, just sampled trajectories.
REINFORCE works but has terrible variance. The return Gt includes all future randomness, making gradients noisy. The fix: subtract a baseline — typically V(s) — leaving only the advantage of taking action a versus the average:
Consider a 5-step trajectory with learned value estimates:
| t | rt | V(st) | V(st+1) |
|---|---|---|---|
| 0 | 1 | 10.0 | 9.5 |
| 1 | 0 | 9.5 | 8.0 |
| 2 | 2 | 8.0 | 7.0 |
| 3 | 0 | 7.0 | 6.5 |
| 4 | 1 | 6.5 | 0.0 |
Using γ = 0.99 throughout.
Compute the TD residual δ0 = r0 + γ V(s1) − V(s0).
Positive δ means "things went better than V predicted." We got reward 1 and ended up in a state worth 9.5, but V(s0) only predicted 10.0 total. The actual one-step return was 10.405, so the advantage is +0.405.
Compute δ4 (the last step). V(s5) = 0 (terminal).
Large negative δ! The value function predicted V(s4)=6.5 but the episode ended right after with only reward 1. The value estimate was way too optimistic. This TD error will push V(s4) downward in training.
With λ=0, GAE reduces to Ât = δt. The TD residuals are: δ0=0.405, δ1=-1.58, δ2=0.93, δ3=-0.565, δ4=-5.5. What is Â0 with λ=0?
With λ=0, the advantage is just the 1-step TD error. It only looks one step ahead, relying entirely on the value function V for long-term prediction. This has low variance (one step of randomness) but high bias (if V is inaccurate, the advantage is wrong).
With λ=1 and γ=0.99, compute Â0. Use δ = [0.405, -1.58, 0.93, -0.565, -5.5].
Â0 = δ0 + γδ1 + γ²δ2 + γ³δ3 + γ4δ4 (since γλ = 0.99×1 = 0.99)
With λ=1, GAE is equivalent to Gt - V(st) — the Monte Carlo advantage. The large negative value comes from δ4 = -5.5, which propagates backward. This is high variance (one bad terminal outcome dominates) but unbiased.
Now compute Â0 with λ=0.95. The decay factor is γλ = 0.99×0.95 = 0.9405.
Using the backward recursion Ât = δt + γλ Ât+1 with γλ = 0.9405:
The GAE with λ=0.95 gives a value between λ=0 (+0.405) and λ=1 (-6.06). The backward recursion naturally down-weights distant TD errors by (γλ)l. The large negative δ4 still dominates because λ=0.95 is close to 1, but it's attenuated compared to the pure Monte Carlo estimate.
Lower λ means the advantage estimate relies more on the value function V (which is a learned, smooth estimate) and less on sampled returns (which are noisy). At λ=0, the advantage is just δt = r + γV(s') - V(s), using only one step of real randomness. This dramatically reduces variance at the cost of bias if V is inaccurate.
Increasing γ would make things worse (longer effective horizon = more variance). Increasing learning rate doesn't reduce gradient variance. Removing the baseline V(s) would catastrophically increase variance — the baseline is the single biggest variance reduction technique in policy gradients.
The discount factor γ is the most underrated hyperparameter in RL. It controls how far into the future the agent looks, the effective planning horizon, and even whether the value function converges at all. Getting γ wrong can make an otherwise-correct algorithm fail completely.
Reward sequence: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] (ten 1's). Compute the discounted return G0 for γ=0.9.
Using the geometric series formula: ∑ γk from k=0 to n-1 = (1 − γn) / (1 − γ). With γ=0.9 and n=10, we get 6.513 out of a possible 10 (undiscounted). The last few rewards contribute very little.
Same reward sequence [1,1,...,1] (ten 1's), now with γ=0.99.
γ=0.99 preserves 95.6% of the undiscounted sum over 10 steps. Compared to γ=0.9 (65.1%), the higher discount factor lets the agent see much further into the future.
What is the effective horizon for γ=0.995?
The agent effectively plans 200 steps ahead. This is common for robotics tasks where episodes last a few hundred steps. For game-playing (thousands of steps), you might need γ=0.999 (H=1000) or even γ=0.9999 (H=10000).
With γ=1 and a continuing task (no terminal state), the return G = r0 + r1 + r2 + ... sums to infinity for any non-zero reward. V(s) = ∞ for all states — useless. Even with terminal states, γ=1 means the Bellman operator T is no longer a γ-contraction, so value iteration may not converge (or converges arbitrarily slowly). The average-reward formulation (R-learning) handles the γ=1 case by subtracting the average reward, but it requires special algorithms.
With γ=0.9 and a 40-move game, the opening move's reward is discounted by 0.940 ≈ 0.015 — only 1.5% of the terminal reward reaches the opening. The agent would learn to play well in the endgame but make random-looking opening moves. With γ=0.999, the discount after 100 moves is 0.999100 ≈ 0.905 — the opening move still gets 90% credit. For sparse terminal rewards, you want Heff much larger than the episode length.
You're at a restaurant with 10 dishes. You've tried 3 and found one you like. Do you order that one again (exploit) or try something new that might be better (explore)? This is the exploration-exploitation dilemma — the fundamental tension in all of RL.
ε=0.1, 4 actions, action 2 is the greedy action (highest Q). What is the probability of selecting action 2? What about action 0?
Answer with P(action 2).
The greedy action gets both the (1-ε) exploit probability AND a share of the ε explore probability (since random exploration might randomly pick it). Non-greedy actions each get ε/n_actions = 0.025 = 2.5% chance.
Total steps t=100, c=2. Three actions with Q-values and visit counts:
| Action | Q(a) | N(a) |
|---|---|---|
| 0 | 3.0 | 50 |
| 1 | 2.5 | 40 |
| 2 | 4.0 | 10 |
Compute UCB(action 2) = Q(2) + c √(ln(100)/N(2)).
Action 2 has the highest Q AND the highest UCB bonus (because N(2)=10 is low — it's under-explored). Compare: UCB(0) = 3.0 + 2√(4.605/50) = 3.0 + 0.607 = 3.607. UCB(1) = 2.5 + 2√(4.605/40) = 2.5 + 0.679 = 3.179. Action 2 wins decisively — both high reward and high exploration bonus.
Q-values: Q(0)=3, Q(1)=5, Q(2)=1. Temperature τ=2. Compute π(action 1) under Boltzmann exploration.
Correcting: with more precise values, exp(1.5)≈4.4817, exp(2.5)≈12.1825, exp(0.5)≈1.6487. Sum=18.313. π(1) = 12.1825/18.313 ≈ 0.665. The highest Q action gets most probability, but not all — temperature τ=2 keeps exploration alive. With τ→0, this approaches greedy. With τ→∞, this approaches uniform random.
exp(50) completely dominates. π(1) ≈ 1.0. At low temperature, Boltzmann exploration becomes nearly greedy — only the highest Q action gets selected. This is useful for deployment (exploit the learned policy) but terrible for training (no exploration).
This ε-greedy implementation has a bug that causes the agent to always explore (never exploit). Click the buggy line.
function epsilonGreedy(Q, epsilon) { const nActions = Q.length; if (Math.random() < epsilon) { return Math.floor(Math.random() * nActions); } let bestA = 0; for (let a = 0; a < nActions; a++) { if (Q[a] > Q[a]) bestA = a; } return bestA; }
Line 8 is the bug: Q[a] > Q[a] compares each element to itself, which is always false. It should be Q[a] > Q[bestA]. Because the comparison is always false, bestA stays 0 and the greedy branch always returns action 0 regardless of Q-values. Combined with the exploration branch (which returns random actions), the agent effectively has a random policy with a slight bias toward action 0.
REINFORCE wastes data — each trajectory is used once and discarded. Proximal Policy Optimization (PPO) reuses data by doing multiple gradient steps on the same batch, but prevents the policy from changing too much using a clever clipping mechanism. PPO is the workhorse algorithm behind ChatGPT's RLHF.
rt = 1.3, At = 0.5, ε = 0.2. Compute both terms of the PPO objective and identify which is the min.
Term 1: rt × At. Term 2: clip(rt, 0.8, 1.2) × At.
The policy wants to increase this action's probability (A>0 and r>1 means we're already moving that direction). But the clipping limits the credit: instead of getting 0.65 objective value from a 30% probability increase, we cap it at 0.60 from a 20% increase. This prevents the policy from changing too aggressively in a single update.
Now rt = 1.3, At = -0.5, ε = 0.2. What is LCLIP?
With negative advantage, the unclipped term is MORE negative (worse). The min takes the more negative value. Here clipping does NOT activate — the gradient is the full unclipped gradient. Why? Because the policy increased this action's probability (r=1.3) but the action was bad (A<0). PPO says: "You moved in the wrong direction? Full penalty — no clipping protection." Clipping only protects beneficial changes from being too large.
The policy decreased this action's probability (r=0.7) but A is positive (the action was good). This is moving in the wrong direction. The min takes 0.35 (the unclipped, more pessimistic value), so the gradient pushes r back up toward 1. Clipping technically activates (r is below lower bound) but the min already selects the unclipped term, so clipping has no effect on the gradient.
Complete this table. For each case, compute LCLIP. ε=0.2.
| rt | At | Term 1 | clip(r) | Term 2 | min |
|---|---|---|---|---|---|
| 1.3 | +0.5 | 0.65 | 1.2 | 0.60 | ? |
| 1.3 | −0.5 | −0.65 | 1.2 | −0.60 | ? |
| 0.7 | +0.5 | 0.35 | 0.8 | 0.40 | ? |
| 0.7 | −0.5 | −0.35 | 0.8 | −0.40 | ? |
What is LCLIP for the last row (r=0.7, A=-0.5)?
Here the policy decreased this action's probability (r=0.7) and the action was bad (A<0). This is a correct move — reducing a bad action. Clipping activates and takes the more pessimistic (more negative) clipped value. This limits how much credit the policy gets for correctly avoiding bad actions. The pattern: clipping always prevents the objective from being "too good" — it's pessimistic about beneficial changes in both directions.
When A>0 and r=1.8 > 1+ε=1.2, the objective is min(1.8A, 1.2A) = 1.2A. This is constant w.r.t. θ (the clip function has zero gradient), so the gradient contribution from this sample is exactly zero. The policy has already moved "enough" for this action and clipping prevents further movement. This is the core mechanism: once the ratio exits [0.8, 1.2] in the beneficial direction, that sample stops contributing gradients.
TRPO requires computing the Fisher information matrix (or its inverse via conjugate gradients) and doing a line search to satisfy the KL constraint. This is complex, slow, and hard to parallelize across GPUs. PPO replaces all of that with a simple clip operation — you just compute the clipped surrogate loss and do standard SGD. Same idea (prevent large policy updates), drastically simpler implementation. This simplicity is why PPO became the default algorithm for RLHF in LLMs.
You've learned the math. Now put it all together. You're designing an RL system for a real task: a robot arm that must learn to pick up objects from a bin. This capstone tests your ability to choose hyperparameters, estimate compute requirements, and design the training pipeline.
γ=0.99 gives Heff=100, matching the max episode length. With γ=0.99, the discount at step 100 is 0.99100 ≈ 0.366 — the agent still gets 37% of the grasp reward signal at the start of the episode. With γ=0.95, it would be 0.95100 ≈ 0.006 — practically zero. γ=0.9999 would work but makes value function learning harder (higher variance returns) and value iteration slower, with no benefit since episodes are only 100 steps.
Your replay buffer stores 1 million transitions. Each transition: state (128×128×3 image as uint8 + 6 float32 joint angles), action (6 float32), reward (1 float32), next_state (same as state), done flag (1 byte). How much memory does the buffer consume?
93.7 GB for 1M transitions — this is why image-based RL is memory-hungry. Common optimizations: store images as compressed JPEG (5-10x smaller), or store only the current state and reconstruct next_state by indexing (halves memory). With compression, 1M transitions might fit in ~10-20 GB.
Rule of thumb: simple robotics tasks need ~1M environment steps, complex manipulation needs ~10M steps. At 100 steps per episode and 80% failure rate initially, approximately how many episodes is 5M steps?
50K episodes. At 100ms per sim step (typical for a fast physics simulator like MuJoCo), total wall time = 5M × 0.1s = 500K seconds = ~5.8 days single-threaded. With 32 parallel environments, ~4.4 hours. This is why parallel env collection is essential for robotics RL.
A CNN for the image is standard — it provides translation-invariant features and dramatically reduces dimensionality. Flattening the raw image gives 49K dims, which is wasteful and loses spatial structure. A transformer would work but is overkill for 128×128. Ignoring the image throws away critical object location information. The standard architecture is: shared CNN encoder → concat with proprioception → separate MLP heads for policy and value. The CNN is often 3-4 conv layers with stride 2, reducing 128×128×3 to ~4×4×64 = 1024, then a linear to 256.
3e-4 with Adam is the PPO default that works across a wide range of tasks. It's small enough to be stable with the clipping mechanism but large enough to learn in reasonable time. 0.1 is far too large for policy gradient methods (the policy would collapse). 1e-6 would take millions of gradient steps to learn anything. The PPO paper (Schulman 2017) used 3e-4 for all continuous control benchmarks and it has become the de facto standard. For image-based tasks, some practitioners use a slightly lower rate (1e-4) for the CNN encoder.
Order these components into the correct PPO training loop. Drag the chips into the pipeline slots.
The PPO training loop:
Your CNN+MLP policy network has ~2M parameters. PPO does 4 epochs of gradient updates per rollout collection of 2048 steps. With 5M total environment steps, how many gradient updates does the policy network receive?
If counting full-batch updates: 2441 × 4 = 9766. If counting minibatch updates with batch size 64: 2441 × 128 = 312,448. The answer depends on what you count as an "update." With the standard interpretation (one optimizer.step() per minibatch): ~9766 if you use full-batch, or ~312K if minibatched.
For comparison, a supervised learning model might see ~100K gradient updates. PPO does fewer but each one is noisier (on-policy data, changing distribution). The 4 epochs of data reuse are the key efficiency gain over vanilla REINFORCE.
| Topic | Lesson |
|---|---|
| MDP fundamentals | Markov Decision Processes — From Absolute Zero |
| RL algorithms | RL Algorithms — From Absolute Zero |
| Reward alignment | Reward & Alignment — From Absolute Zero |
| POMDPs | POMDPs — From Absolute Zero |
| Transformer math | Transformer Math Workbook |