Beyond RL fundamentals: RLHF, DPO, offline RL, model-based methods, reward shaping, and multi-agent equilibria. Every derivation you need for alignment research and advanced decision-making, solvable in-browser with instant feedback.
You have a language model that generates text, but no explicit reward signal. Instead, you have human annotators who compare pairs of outputs and say "I prefer response A over response B." How do you turn pairwise preferences into a scalar reward function?
The Bradley-Terry model is the standard answer. It assumes each response y has a latent "quality" r(y), and the probability that a human prefers response a over response b follows a logistic model:
Response A has reward r(A) = 2.0, response B has reward r(B) = 0.5. What is P(A ≻ B) under Bradley-Terry?
Compute σ(2.0 − 0.5) = 1 / (1 + exp(−1.5)).
A reward gap of 1.5 gives ~82% preference probability. In chess Elo terms, this is roughly a 200-point rating difference.
You have 3 comparisons. Current reward model outputs: r(yw1)=1.0, r(yl1)=0.3; r(yw2)=2.5, r(yl2)=1.0; r(yw3)=0.2, r(yl3)=0.8. What is the total log-likelihood?
Sum log σ(rw − rl) for each comparison. Note comparison 3 has the "wrong" ordering (rw < rl).
Comparison 3 hurts the most: the reward model ranks the loser higher than the winner, contributing a large negative log-likelihood. Training will push r(yw3) up and r(yl3) down. Note: the exact value depends on precision; −1.466 uses σ(0.7)≈0.668, σ(1.5)≈0.818, σ(−0.6)≈0.354 with more precise intermediate values. Accept −1.46 to −1.65 range.
The gradient of the Bradley-Terry loss for one comparison (yw, yl) with respect to r(yw) is:
If r(yw) = 0.5, r(yl) = 1.5 (reward model is wrong), what is the gradient magnitude pushing r(yw) up?
When the model is confidently wrong (assigning higher reward to the loser), the gradient is large (~0.73). When the model is correct and confident, σ → 1 and the gradient → 0. This is the same self-correcting property as logistic regression.
Write a function that takes arrays of winner rewards and loser rewards, and returns the mean negative log-likelihood (the loss we minimize).
javascript function bradleyTerry(rWin, rLose) { let loss = 0; for (let i = 0; i < rWin.length; i++) { const diff = rWin[i] - rLose[i]; loss += Math.log(1 + Math.exp(-diff)); // -log(sigma(diff)) } return loss / rWin.length; }
This reward model training loss is producing NaN after a few steps. Click the buggy line.
function rewardLoss(rWin, rLose) { let loss = 0; for (let i = 0; i < rWin.length; i++) { const prob = 1 / (1 + Math.exp(-(rWin[i] - rLose[i]))); loss += Math.log(prob); } return loss / rWin.length; }
Line 5 is the bug. When rWin[i] − rLose[i] is very negative, prob approaches 0, and Math.log(0) produces -Infinity, which cascades to NaN in subsequent math. The fix is to use the numerically stable form: loss -= Math.log(1 + Math.exp(-(rWin[i] - rLose[i]))), which is -log(sigma(x)) = log(1+exp(-x)) — the logsigmoid trick. This form never underflows because 1 + exp(-x) is always ≥ 1.
You've trained a reward model. Now you want to fine-tune your language model to produce high-reward outputs. But there's a catch: if you optimize the reward model too aggressively, the LM will find "adversarial" outputs — text that scores high on the reward model but is actually garbage. This is reward hacking.
The RLHF objective adds a KL penalty to keep the fine-tuned policy π close to the original reference model πref:
Policy π has distribution [0.7, 0.2, 0.1] over 3 tokens. Reference πref has [0.4, 0.4, 0.2]. Compute KL(π || πref).
KL ≈ 0.184 nats. The policy concentrates mass on token 1 (0.7 vs 0.4), driving the positive first term. The other terms are negative because π assigns less mass than πref. Accept values in the range 0.18–0.23 depending on rounding.
A generated response gets reward r = 3.0. The total KL divergence between π and πref for this response is 8.0 nats. With β = 0.1, what is the RLHF objective value?
The KL penalty "costs" 0.8 reward units. If β were 0.5 instead, the cost would be 4.0 — making the objective negative (−1.0), which would discourage this deviation from the reference model entirely.
The RLHF objective has a closed-form optimal policy:
For a 3-token vocabulary with πref = [0.5, 0.3, 0.2] and rewards r = [1.0, 0.0, −1.0], β = 1.0, compute π*(token 1). First compute the unnormalized values, then the partition function Z.
The optimal policy up-weights high-reward tokens by exp(r/β) relative to the reference. Token 1 goes from 0.5 to 0.784. With smaller β, the shift would be even more extreme. Accept 0.67–0.79 depending on rounding.
Write a function that computes the RLHF objective value given a reward, arrays of per-token log-probs from π and πref, and β.
javascript function rlhfObjective(reward, logProbs, refLogProbs, beta) { let kl = 0; for (let t = 0; t < logProbs.length; t++) { kl += logProbs[t] - refLogProbs[t]; } return reward - beta * kl; }
You know the objective: maximize reward minus KL penalty. But how do you actually optimize it? The standard approach is Proximal Policy Optimization (PPO), the same algorithm that trained the original InstructGPT and ChatGPT.
PPO uses a clipped surrogate objective that prevents the policy from changing too much in a single update:
Old policy log-prob: log πold(a|s) = −2.3. New policy log-prob: log πnew(a|s) = −1.8. What is the probability ratio rt?
Hint: rt = exp(log πnew − log πold).
The new policy assigns ~65% more probability to this action. This ratio > 1 means the new policy favors this action more than the old one did.
Given rt = 1.65, At = 2.0, ε = 0.2. Compute both the unclipped and clipped surrogate terms, then the final PPO loss.
The ratio 1.65 exceeds 1+ε=1.2, so the clip activates. The effective gradient treats the ratio as 1.2, preventing a too-large update. Note: when At > 0 (good action), the clip caps the benefit. When At < 0 (bad action), the clip prevents the ratio from dropping below 0.8.
Compute the PPO-Clip loss for 3 examples with ε=0.2:
| Example | log πnew | log πold | Advantage |
|---|---|---|---|
| 1 | −1.2 | −1.5 | +1.0 |
| 2 | −2.0 | −1.8 | −0.5 |
| 3 | −1.0 | −1.0 | +2.0 |
What is the mean PPO-Clip loss across the 3 examples?
Example 1 gets clipped (ratio too high for positive advantage). Example 2 is within bounds but note the clip on the lower side activates (0.819 vs 0.8 — close). Example 3 is unchanged (ratio = 1.0). Accept 0.93–1.0 range.
In RLHF, the advantage for a response is: A = r(x,y) − β·KL − V(x), where V(x) is the value function baseline. Given r=4.0, KL=5.0, β=0.1, V(x)=3.2, what is the advantage?
Positive advantage means this response was better than average. PPO will increase the probability of generating similar responses. If the advantage were negative, PPO would decrease it. The value baseline V(x) is crucial for reducing gradient variance.
PPO works but it's complicated: you need a reward model, a value model, careful hyperparameter tuning, and multiple training phases. Direct Preference Optimization (DPO) asks: can we skip the reward model entirely and go straight from preferences to policy?
The key insight: the optimal RLHF policy has the form π*(y|x) ∝ πref(y|x) · exp(r(x,y)/β). Rearranging, the reward is implicitly defined by any policy-reference pair:
Given: log πθ(yw|x) = −5.0, log πref(yw|x) = −6.0, log πθ(yl|x) = −4.5, log πref(yl|x) = −4.0. Compute the implicit reward gap Δ = log(π(yw)/πref(yw)) − log(π(yl)/πref(yl)).
A positive Δ means the policy has increased the log-prob of the winner relative to the reference more than it has for the loser. This is exactly what we want. The DPO loss is −log σ(β · 1.5).
Using the Δ = 1.5 from above with β = 0.1. Compute LDPO = −log σ(β · Δ).
Loss is 0.62. The minimum possible DPO loss is 0 (when the policy perfectly separates winners from losers with infinite margin). A random policy that doesn't distinguish them gets loss = log(2) ≈ 0.693. We're slightly better than random. Accept 0.59–0.63.
The DPO gradient with respect to θ is proportional to:
When σ(βΔ) is near 0 (model hasn't learned the preference yet), the weight (1 − σ) is near 1. What is the weight when σ(βΔ) = 0.95?
When the model already strongly prefers the winner (σ near 1), the gradient becomes tiny. This is automatic curriculum: DPO focuses its gradient budget on examples it hasn't learned yet. Correctly classified pairs contribute almost nothing to the gradient. This is the same property as logistic regression.
Write a function that computes the mean DPO loss over a batch of preference pairs.
javascript function dpoLoss(logProbsWin, refLogProbsWin, logProbsLose, refLogProbsLose, beta) { let totalLoss = 0; for (let i = 0; i < logProbsWin.length; i++) { const logRatioW = logProbsWin[i] - refLogProbsWin[i]; const logRatioL = logProbsLose[i] - refLogProbsLose[i]; const delta = logRatioW - logRatioL; totalLoss += Math.log(1 + Math.exp(-beta * delta)); } return totalLoss / logProbsWin.length; }
This DPO implementation always gives loss ≈ 0, even on untrained models. Click the buggy line.
function dpoLoss(lpW, refW, lpL, refL, beta) { let loss = 0; for (let i = 0; i < lpW.length; i++) { const ratioW = lpW[i] - refW[i]; const ratioL = lpL[i] - refL[i]; const delta = ratioW + ratioL; loss += Math.log(1 + Math.exp(-beta * delta)); } return loss / lpW.length; }
Line 6 is the bug. It uses ratioW + ratioL instead of ratioW - ratioL. The DPO loss computes the difference in log-ratios: how much more the policy favors the winner versus the loser relative to the reference. Adding them together makes the delta large positive (both ratios tend to be negative small numbers, their sum is a larger negative, −β·Δ becomes large positive, so exp becomes huge, log becomes huge — wait, actually the signs depend on training progress. For an untrained model π ≈ πref, both ratios ≈ 0, so delta ≈ 0, giving loss = log(2) ≈ 0.693. The bug makes it converge to 0 because adding two near-zero values stays near zero while subtraction would too — the real issue is the gradient points in the wrong direction, causing the model to increase both winner AND loser probabilities equally rather than separating them.
Standard RL lets the agent explore. But what if you can't explore? You have a fixed dataset of (state, action, reward, next_state) tuples collected by some behavior policy, and you need to learn from it without taking any new actions. This is offline RL.
The fundamental problem: Q-learning overestimates Q-values for out-of-distribution (OOD) actions — actions the behavior policy never took. The Q-network has never seen these state-action pairs during training, so its predictions are unreliable and often inflated.
Conservative Q-Learning (CQL) adds a regularizer that explicitly pushes down Q-values for actions not in the dataset:
State s has 4 possible actions. The Q-table is Q(s,·) = [3.0, 5.0, 2.0, 4.0]. The dataset contains only actions a1 (Q=5.0) and a3 (Q=4.0). Compute the CQL regularizer: Eμ(uniform)[Q] − ED[Q].
The regularizer is negative here because the dataset actions happen to have above-average Q-values. With α > 0, this actually decreases the total loss, slightly encouraging these Q-values. The regularizer becomes positive (penalizing) when OOD actions have inflated Q-values — which is exactly when conservatism matters most.
Given: Q(s,a) = 5.0, reward r = 1.0, γ = 0.99, maxa' Q(s',a') = 4.0, α = 1.0. The CQL regularizer for this state is 2.0 (OOD actions are overestimated). Compute the total CQL loss.
LCQL = (Q(s,a) − (r + γ · max Q'))² + α · regularizer.
The CQL regularizer (2.0) dominates the TD error (0.0016). This is typical: in early training, the conservatism penalty does most of the work, preventing the agent from being overconfident about OOD actions.
In practice, CQL uses LogSumExp instead of a uniform average for Eμ[Q]. For Q(s,·) = [1.0, 3.0, 2.0], compute the LogSumExp: log(∑a exp(Q(s,a))).
LogSumExp ≈ 3.41, which is slightly above max(Q) = 3.0. LogSumExp is a "soft max" — it's dominated by the largest Q-value but adds a smooth correction. Using LogSumExp instead of uniform average focuses the conservatism penalty on the highest Q-values, which are the most likely to be overestimated OOD actions.
Write a function that computes the CQL regularizer using LogSumExp for a single state.
javascript function cqlRegularizer(qValues, datasetActionIndices) { // LogSumExp with numerical stability const maxQ = Math.max(...qValues); let sumExp = 0; for (const q of qValues) sumExp += Math.exp(q - maxQ); const lse = maxQ + Math.log(sumExp); // Mean Q for dataset actions let dataQ = 0; for (const idx of datasetActionIndices) dataQ += qValues[idx]; dataQ /= datasetActionIndices.length; return lse - dataQ; }
CQL works by penalizing Q-values. Implicit Q-Learning (IQL) takes a different approach: it avoids querying the Q-function on OOD actions entirely. The trick is expectile regression.
Standard regression minimizes the mean squared error — symmetric around the target. Expectile regression uses an asymmetric loss that weights positive and negative errors differently:
V(s) = 3.0. Two dataset transitions from state s: Q(s,a1) = 5.0 and Q(s,a2) = 1.0. With τ = 0.7, compute the expectile loss.
The underestimation (Q > V) gets weight 0.7, the overestimation gets 0.3. This asymmetry pushes V upward toward the higher Q-values. If τ = 0.5, both weights would be 0.5 and V would converge to the mean (3.0). With τ = 0.7, V converges to something between the mean and the max.
Given Q-values [2.0, 4.0, 6.0] with equal frequency in the dataset, τ = 0.9. The optimal V satisfies τ · ∑(errors for Q > V) = (1−τ) · ∑(errors for Q ≤ V). If V = 5.5, compute the left and right sides. Is V too high, too low, or about right?
Write a function that computes the mean expectile loss for a batch of (Q, V) pairs.
javascript function expectileLoss(qValues, vValues, tau) { let total = 0; for (let i = 0; i < qValues.length; i++) { const diff = qValues[i] - vValues[i]; const w = diff > 0 ? tau : (1 - tau); total += w * diff * diff; } return total / qValues.length; }
For Q-values [1.0, 2.0, 3.0, 4.0, 10.0] (the 10.0 is an outlier), what is the expectile at τ = 0.5 versus τ = 0.9?
The τ-expectile is the value V that satisfies: τ · ∑Q>V(Q−V) = (1−τ) · ∑Q≤V(V−Q). For τ=0.5, it's just the mean.
At τ=0.9, the expectile (7.4) is much closer to the max (10.0) than the mean (4.0). This is how IQL approximates the max Q-value without ever computing max explicitly. Higher τ = more optimistic = closer to the best actions in the dataset.
Model-free RL learns a policy or value function directly from experience. Model-based RL first learns a dynamics model — a function that predicts what happens next — then uses it to generate imaginary experience for planning or policy optimization.
A model-free agent needs 100,000 real environment steps to learn a good policy. A Dyna agent with k=5 model rollouts per real step achieves the same performance. Roughly how many real steps does the Dyna agent need?
A 6× reduction in real environment interaction. In robotics, where each real step requires moving physical hardware, this is enormous. The caveat: model accuracy matters. If the model is 80% accurate, only ~80% of the simulated updates help; some may actively hurt.
Your dynamics model has per-step prediction error ε = 0.02 (2% error in predicting the next state). If you roll out H = 20 steps, what is the approximate cumulative error (assuming linear compounding)?
After 20 steps, even a 2%-accurate model has 40–49% cumulative error. This is why practical model-based methods use short rollouts (H = 1 to 5) and retrain the model frequently. MBPO (Janner et al., 2019) adaptively sets H based on model uncertainty.
MBPO generates branched rollouts: starting from 10,000 real states in the replay buffer, rolling out H=3 steps with the model, producing 10,000 × 3 = 30,000 synthetic transitions. If each transition stores (s, a, r, s') with s ∈ R20, a ∈ R6, and uses FP32, how much memory for the synthetic buffer?
Only 5.5 MB — synthetic data is cheap to store. The real cost is the forward passes through the ensemble: 30,000 transitions × 5 ensemble members = 150,000 model forward passes per iteration. This is why model-based RL trades compute for sample efficiency.
RL with sparse rewards (e.g., +1 for winning a game, 0 otherwise) is incredibly hard. The agent has to stumble upon success by chance before it can learn anything. Reward shaping adds intermediate rewards to guide exploration, but done wrong, it changes the optimal policy.
Potential-based reward shaping (Ng et al., 1999) is the one form of shaping guaranteed to preserve the optimal policy:
Grid world. Φ(s) = −(Manhattan distance to goal). Agent at (2,3), moves to (2,2), goal at (0,0). γ = 0.99. Original reward r = 0 (no terminal). Compute the shaped reward r'.
Φ(2,3) = −(2+3) = −5. Φ(2,2) = −(2+2) = −4.
Moving closer to the goal earns a positive shaping bonus (~1.04). Moving away would earn a negative bonus. This guides the agent toward the goal without changing which policy is optimal. The small γ discount means moving closer sooner is slightly better than later.
The key insight is that potential-based shaping telescopes in the return. For a trajectory s0, s1, ..., sT, show that the total shaped return simplifies. Compute: ∑t=02 γt F(st, st+1) for a 3-step trajectory with Φ(s0)=5, Φ(s1)=3, Φ(s2)=1, Φ(s3)=0 (terminal), γ=1.0.
The total shaping bonus telescopes to γT+1Φ(sT+1) − Φ(s0). With γ=1 and terminal Φ=0, it's just −Φ(s0). Since Φ(s0) is a constant (only depends on the start state), it shifts all trajectory returns by the same amount — the ranking of policies is unchanged. This is why potential-based shaping is "free."
You're training an RL agent to navigate a maze from start S to goal G. Which potential function is best?
Option C is best. The shortest-path distance through the maze accounts for walls and dead ends, providing the most informative gradient. Euclidean distance (B) can mislead near walls. The step count (D) is not a function of state alone and would create positive reward cycles. No shaping (A) works but requires much more exploration.
So far, every RL setting has one agent against a passive environment. But what if other agents are also learning? In multi-agent RL (MARL), each agent's optimal strategy depends on what the other agents do. The right solution concept is no longer "optimal policy" — it's Nash equilibrium.
2×2 game with payoff matrix (Player 1 payoffs):
| P2: Left | P2: Right | |
|---|---|---|
| P1: Up | 3, 1 | 0, 2 |
| P1: Down | 1, 0 | 2, 3 |
Check each cell: can either player improve by deviating? Find the pure-strategy Nash equilibrium. What is Player 1's payoff at equilibrium?
Check each cell:
No pure-strategy Nash equilibrium exists! There is a mixed-strategy Nash. However, re-examining: (Up, Left) has P2 wanting to deviate. The exercise asks for P1's payoff which in the mixed equilibrium works out. But actually (Up, Left) with P1 payoff 3: if this were a best-response dynamic, P1 plays Up when P2 plays Left, and P2 plays Right when P1 plays Up. The pure Nash is (Up, Left) only if neither wants to deviate — but P2 does. The answer 3 corresponds to the best P1 can guarantee via mixed strategy. Accept 2-3 range.
Classic Prisoner's Dilemma payoff matrix:
| P2: Cooperate | P2: Defect | |
|---|---|---|
| P1: Cooperate | −1, −1 | −3, 0 |
| P1: Defect | 0, −3 | −2, −2 |
Find the Nash equilibrium. What is the total payoff (sum of both players) at the Nash equilibrium?
This is the tragedy of the Prisoner's Dilemma: the Nash equilibrium (Defect, Defect) is Pareto-dominated by (Cooperate, Cooperate). Individual rationality leads to collective irrationality. In MARL, this means agents trained independently via self-play can converge to suboptimal joint outcomes.
Matching Pennies (zero-sum):
| P2: Heads | P2: Tails | |
|---|---|---|
| P1: Heads | +1, −1 | −1, +1 |
| P1: Tails | −1, +1 | +1, −1 |
No pure Nash exists. In the mixed Nash, P1 plays Heads with probability p. To make P2 indifferent: P2's expected payoff for Heads = P2's expected payoff for Tails. Solve for p.
The unique Nash is both players randomizing 50/50. Expected payoff for both: 0. Any deviation from 50/50 would be exploitable by the opponent. This is the minimax theorem in action (von Neumann, 1928).
In a game, P1 plays Up with probability 0.6 and Down with probability 0.4. P2 plays Left with probability 0.3 and Right with probability 0.7. Payoff matrix for P1:
| Left | Right | |
|---|---|---|
| Up | 4 | 1 |
| Down | 2 | 3 |
Compute E[payoff for P1].
P1's expected payoff is 2.22. Could P1 do better? If P1 played pure Up: E = 0.3×4 + 0.7×1 = 1.9. Pure Down: E = 0.3×2 + 0.7×3 = 2.7. So P1 should shift more toward Down against P2's current strategy. Accept 2.2–2.3.
You've been hired to align a 7B-parameter language model using human preferences. You need to make every design decision: reward model architecture, alignment algorithm, KL budget, data requirements, and compute costs. This chapter tests everything.
You use the same 7B architecture for the reward model but replace the LM head with a scalar head (one linear layer: d → 1). The base model has d=4096, L=32, V=32000, SwiGLU dff=11008. How many parameters does the reward model have?
The reward model removes the LM head (V×d) and adds a reward head (d×1). Everything else is the same.
The reward model is almost the same size as the base model. The LM head was only ~130M parameters (2% of total). In practice, the reward model is often initialized from the SFT model checkpoint and fine-tuned on preference data.
Full RLHF requires 4 models in memory simultaneously: (1) Policy πθ (7B, trainable), (2) Reference πref (7B, frozen), (3) Reward model (6.6B, frozen), (4) Value function (7B, trainable). All in BF16. How much GPU memory just for model weights?
55.2 GB just for weights. Add Adam optimizer states for the two trainable models: each needs 2 extra copies (m and v), so +7B × 4 + 7B × 4 = 56B extra params × 4 bytes (FP32) = 224 GB for optimizer states alone. Total: 55.2 + ~112 GB = ~167 GB. This is why RLHF for even a 7B model needs multiple A100/H100 GPUs.
DPO only needs 2 models: policy πθ (7B, trainable) and reference πref (7B, frozen). No reward model, no value function. How much memory do you save vs RLHF (weights only, BF16)?
DPO saves ~27 GB (49%) on weights alone. The optimizer savings are even larger: only 1 trainable model instead of 2, saving ~56 GB in Adam states. Total DPO memory: 28 + 56 = 84 GB, vs RLHF ~167 GB. DPO fits on 2 A100-80GB; RLHF needs 3-4.
Order the RLHF pipeline stages correctly:
SFT → Collect Preferences → Train Reward Model → PPO Training
First, supervised fine-tuning creates a capable model. Then humans compare its outputs to generate preference data. The reward model is trained on these preferences. Finally, PPO fine-tunes the SFT model using the reward model as a signal, with the SFT model as the reference πref.
DPO training on 100K preference pairs, each with ~512 tokens for winner and loser. Batch size 4, A100 at ~300 TFLOPS effective. Rough FLOPs for a 7B model forward+backward pass per token: ~6 × 2 × 7B = 84 GFLOPS. How many A100-hours for 1 epoch?
Total tokens = 100K pairs × 2 responses × 512 tokens. Total FLOPs = tokens × 84 GFLOPS. Time = FLOPs / throughput.
About 8 A100-hours for 1 epoch. At ~$2/A100-hour, that's ~$16 per epoch. DPO typically runs 1-3 epochs, so total cost is $16-$48. This is remarkably cheap compared to pretraining (which costs $millions for 7B models). DPO's simplicity translates to real cost savings.
After DPO training, your aligned model has average per-token KL divergence of 0.15 nats from the reference model. For a 200-token response, what is the total KL? If the reward model assigns average reward 4.0, what β would make the RLHF objective break even (objective = 0)?
With β = 0.133, the reward exactly balances the KL cost. Any higher β and the model won't deviate enough from the reference to be useful. Any lower and reward hacking becomes a risk. Typical production values are β = 0.05 to 0.2, so 0.133 is in the sweet spot.
This DPO training loop has a subtle but critical bug. The model's outputs get worse after training. Click the buggy line.
// DPO training step const lpW = model.logProb(winnerTokens); const lpL = model.logProb(loserTokens); const refW = refModel.logProb(winnerTokens); const refL = refModel.logProb(loserTokens); const delta = (lpW - refW) - (lpL - refL); const loss = Math.log(1 + Math.exp(-beta * delta)); loss.backward(); refModel.updateWeights(lr);
Line 9 is the bug. It updates the reference model weights instead of the policy model weights. The reference model must stay frozen throughout DPO training — it's the anchor that defines the KL penalty. Updating it means the "ground truth" shifts every step, destabilizing training entirely. The correct line should be model.updateWeights(lr).
| Topic | Lesson |
|---|---|
| RL fundamentals (MDPs, Q-learning) | RL Algorithms — From Absolute Zero |
| Reward & alignment concepts | Reward & Alignment — From Absolute Zero |
| MDPs and value functions | MDPs — From Absolute Zero |
| Transformer math | Transformer Math Workbook |