Where do rewards come from? Goal classifiers, human preferences, RLHF, and Constitutional AI — the science of telling agents what you want.
In Atari Breakout, the reward is trivial: your score goes up when a brick shatters. The environment gives you the number. You don't have to invent it. RL algorithms just maximize that number and you're done.
Now consider a robot folding a towel. What's the reward? Is it folded "well enough" when the corners align? When wrinkles are below some threshold? When it takes under 10 seconds? There's no number on screen. You have to design or learn one.
This is the reward specification problem: for most real-world tasks — dialog systems, autonomous driving, robotic manipulation, content generation — there is no built-in reward signal. The reward must be created by humans, and creating it well turns out to be incredibly hard.
The challenge of defining a mathematical reward function r(s, a) that faithfully captures human intent for a task where no natural reward signal exists. Imprecise reward functions lead to reward hacking — the agent optimizes exactly what you specified, not what you meant.
Consider the classic "walking robot" failure mode. You specify reward = forward velocity. The robot learns to fall forward at maximum speed — zero steps, face-first into the ground. Fix it by adding "stay upright." Now it stands perfectly still. Add "move feet." Now it shuffles in place. Each fix creates a new exploit.
Robot reward: r = 2.0 · vforward + 1.0 · upright + 0.5 · foot_contacts
Falling forward: vforward = 5.0, upright = 0.0, foot_contacts = 0 → r = 2(5) + 0 + 0 = 10.0
Standing still: vforward = 0.0, upright = 1.0, foot_contacts = 2 → r = 0 + 1 + 1 = 2.0
Walking nicely: vforward = 1.5, upright = 0.9, foot_contacts = 1 → r = 3 + 0.9 + 0.5 = 4.4
The agent picks "fall forward" because 10.0 > 4.4. The intended behavior has lower reward than the exploit!
The problem isn't that RL is broken. RL is working perfectly — it's maximizing exactly what you told it to maximize. The problem is that what you told it and what you meant are different things. The gap between specification and intention is where reward hacking lives.
In the real world, there are exactly four sources of reward signals:
1. Built-in (game score). Atari, Go, Chess. The environment literally returns a number. RL works out of the box.
2. Hand-designed. A human writes r(s,a) = w1f1(s) + w2f2(s) + .... Fragile, exploitable, requires domain expertise. Months of tuning for complex tasks.
3. Learned from demonstrations. Given expert trajectories, infer what reward the expert was optimizing. This is inverse RL (IRL).
4. Learned from preferences. Ask humans "which trajectory is better?" and learn a reward from pairwise comparisons. This is reward learning from preferences — the foundation of RLHF.
Many RL papers assume the reward function is "given." In the real world, designing r(s,a) often takes 10x longer than training the policy. The reward IS the hard part. This lesson is about solutions to that problem.
Before diving into algorithms, let's map the full landscape of how humans communicate goals to agents. Each approach trades off ease-of-specification against reliability.
Any mechanism by which a human communicates desired behavior to an RL agent. Includes reward functions, demonstrations, goal images, preference comparisons, natural language instructions, and more. Each mechanism has different annotation cost, scalability, and robustness properties.
| Approach | Input from Human | Pros | Cons |
|---|---|---|---|
| Hand-designed reward | Mathematical function r(s,a) | Direct; no learning needed | Fragile; reward hacking; expensive to tune |
| Imitation learning | Expert demonstrations τ* | Rich signal; covers full trajectories | Bounded by expert; distribution shift |
| Goal examples | Images/states of success | Easy to collect (point camera) | Binary only; no "how" information |
| Human preferences | Pairwise rankings (τA ≻ τB) | Humans good at comparison; scales to complex tasks | Expensive annotation; noisy |
| Language instructions | "Pick up the red cup" | Natural; flexible | Ambiguous; grounding problem |
| AI feedback | LLM judges trajectories | Infinite scale; cheap | Inherits LLM biases; no ground truth |
Even within "demonstrations," there's a spectrum of how much information you provide:
Full demonstrations — Expert executes entire trajectories. Maximum information but maximum cost. One demo can take 10+ minutes for a complex manipulation task.
Goal images — Show the agent what "done" looks like. A single photo of a folded towel. Cheap to collect but only provides the endpoint, not the path.
Preference rankings — "This trajectory is better than that one." Doesn't require expertise, just judgment. A human can compare two robot videos in seconds.
Asking "rate this robot trajectory from 1-10" is hard — what's a 6 vs a 7? But asking "which of these two attempts is better?" is easy. Humans are reliable comparison machines, even when they can't produce absolute scores. This insight drives the entire RLHF pipeline.
Consider a robotic manipulation task (pick-and-place):
Hand-designed reward: 2 weeks of a PhD student's time tuning weights, 0 annotation data.
Demonstrations: 50 kinesthetic demos at 3 min each = 150 minutes = 2.5 hours.
Goal images: 20 success photos + 20 failure photos = 10 minutes of data collection.
Preferences: 500 pairwise comparisons at 10 sec each = 83 minutes.
Goal images are cheapest per-annotation but provide least signal. Preferences scale to complex tasks where "what is success" is hard to photograph.
Every approach has failure modes. Demonstrations suffer distribution shift. Goal classifiers get exploited by OOD states. Preferences require thousands of comparisons. Language instructions are ambiguous. The "right" choice depends on your domain, budget, and failure tolerance.
The simplest idea for learning a reward: train a binary classifier to distinguish "success" from "failure," then use its output probability as the reward signal.
Where fφ is a neural network with parameters φ, trained on labeled success/failure states. σ is the sigmoid function mapping logits to probabilities.
You need two sets of states:
Positive examples (success states): Images or state vectors of the task being "done." For a cup-placing task: photos of the cup on the target location. For a drawer-closing task: states where the drawer is fully closed.
Negative examples (failure states): Everything else — random states, half-completed attempts, wrong configurations. Usually easy to collect in bulk.
Task: push a block to a target location.
Positive states: 50 images with block on target (different angles, lighting).
Negative states: 200 images of block anywhere else.
Classifier: ResNet-18, fine-tuned. Outputs P(block_on_target | image).
RL reward: r(st) = P(success | st). Agent maximizes this.
When block is near target: P ≈ 0.85 → high reward. Block far away: P ≈ 0.1 → low reward. The classifier creates a smooth "reward landscape" that the RL agent can follow.
Using raw logits fφ(s) is problematic. Logits are unbounded — the agent can find states where the logit is +1000 (classifier is extremely confident) but the state is actually nonsense. The sigmoid bounds output to [0, 1], providing a natural reward scale.
The classifier isn't just saying "yes/no." Its confidence creates a gradient. States that are almost at the goal get reward 0.7. States far from the goal get reward 0.1. This gradient is what makes RL optimization possible — without it, the agent would get zero signal until it randomly achieves the goal.
Symbol breakdown:
• φ — classifier parameters (neural network weights).
• fφ(si) — raw logit output for state si.
• σ(·) — sigmoid: σ(z) = 1/(1+e−z). Maps logit to probability [0,1].
• yi — label. 1 = success, 0 = failure.
• N — total labeled examples.
This approach seems perfect: collect 50 images, train a classifier, run RL. In practice, it breaks spectacularly due to the exploitation problem (next section). The classifier was trained on a small, well-behaved dataset. RL will push the agent into states the classifier has never seen.
Here's what actually happens when you use a fixed classifier as reward: RL finds adversarial inputs to the classifier. States that look nothing like success but fool the classifier into outputting P(success) ≈ 1.0.
When an RL policy discovers states outside the classifier's training distribution (OOD states) that receive high classifier confidence despite not representing true task success. The classifier was never trained to say "I don't know" — it always outputs a probability, even for states it's never seen.
Think of it this way: your classifier was trained on normal photos of cups on tables. The RL agent discovers that if the robot arm occludes the camera in exactly the right way, the classifier outputs 0.99. Or if the robot flips the cup upside down at a specific angle, it triggers the same visual features. These are adversarial examples found by RL optimization.
Classifier training data: 100 states from distribution Dtrain (normal robot operation).
RL policy explores: visits 10,000 states, many from distribution DRL ≠ Dtrain.
In-distribution state (block near target): fφ(s) = 2.1 → σ(2.1) = 0.89. Legitimate high reward.
OOD state (robot arm jammed against wall, camera occluded): fφ(s) = 4.7 → σ(4.7) = 0.99. Illegitimate reward — task not completed!
The agent prefers the OOD exploit (0.99) over legitimate completion (0.89). Policy converges to the exploit.
Neural network classifiers are interpolators, not extrapolators. Within the training distribution, they generalize well. Outside it, their predictions are meaningless — but they still output confident probabilities. A sigmoid always returns a number between 0 and 1, even for garbage inputs.
RL is the world's most aggressive optimizer. Given millions of environment interactions, it WILL find these weak spots. It's essentially performing adversarial attacks on your classifier, just through trial-and-error instead of gradient-based perturbation.
This problem is structurally identical to mode collapse in GANs. The "generator" (policy) learns to fool the "discriminator" (classifier). The fix is also the same: update the discriminator to account for the generator's current outputs. This leads directly to the adversarial approach in the next section.
With weak RL (few steps, high regularization), exploitation may not manifest. With strong RL (millions of steps, good exploration), it's guaranteed. The better your RL algorithm, the worse the exploitation. This is a fundamental tension: better optimization + bad reward = worse outcomes.
Show that the probability of finding an exploit increases with the number of unique states visited. Model the classifier's error rate on OOD states as uniform random in [0,1]. If the agent visits n OOD states, what's the expected maximum confidence?
Model: classifier assigns random confidence c ~ Uniform[0,1] to each OOD state (since it has no valid information about these states).
After visiting n OOD states, the agent selects the state with maximum confidence:
E[max(c1, ..., cn)] = n / (n+1)
Proof: CDF of max is F(x) = xn. PDF is f(x) = nxn-1. Expected value = ∫01 x · nxn-1 dx = n/(n+1).
So with n=99 states explored, expected best exploit has confidence 0.99. With n=999, it's 0.999. RL with millions of steps visits thousands of OOD states → near-certain to find a high-confidence exploit.
The fix: don't leave the classifier static. When RL finds exploits, add those exploits as negative examples and retrain. The classifier learns "that's not success either." Then RL finds new exploits. Update again. Iterate.
A goal classifier that is iteratively updated by adding the current RL policy's visited states as negative examples. This creates an adversarial game: the policy tries to fool the classifier, and the classifier adapts to reject the policy's attempts. Converges when the policy can only achieve high classifier confidence by actually completing the task.
This is exactly the GAN (Generative Adversarial Network) training procedure:
| GAN Component | Reward Learning Analog |
|---|---|
| Generator | RL policy πθ (produces states) |
| Discriminator | Goal classifier Cφ (judges success) |
| Generator loss | Negative RL reward (−r) |
| Discriminator loss | Binary cross-entropy on updated D+/D− |
| Mode collapse | Policy exploiting one OOD state |
| Training instability | Oscillation between exploits and corrections |
Iteration 1:
D+ = {50 images of cup on target}. D− = {100 random robot images}.
Train classifier. Run RL for 10k steps. Policy learns to occlude camera (exploit). C(sexploit) = 0.97.
Iteration 2:
Add 500 RL-visited states (including occlusion states) to D−.
Retrain classifier. Now C(socclude) = 0.15. Exploit fixed.
Run RL again. Policy finds new exploit: pressing cup against wall at odd angle. C(swall) = 0.92.
Iteration 3: Add wall-pressing states to D−. Retrain. Eventually, the ONLY way to get high reward is genuine task completion.
Critical implementation detail: when retraining the classifier, use balanced batches — equal numbers of positive and negative examples per batch.
Without balanced batches: if negatives outnumber positives 10:1, the classifier learns a prior of P(success) = 0.09. Even true success states may get low reward, starving the RL signal.
Each iteration, the classifier's "holes" (OOD regions with high confidence) get patched. After enough iterations, the only remaining high-confidence region is the actual success manifold. The policy has no choice but to genuinely complete the task.
Like GANs, this procedure can be unstable. If you add too many negatives too fast, the classifier becomes conservative and gives low reward everywhere (reward collapse). If you don't update frequently enough, exploitation persists. The balance is delicate.
The classifier is being updated but the exploit keeps shifting to new OOD regions faster than the classifier can cover them. The reward increases because each new exploit has higher confidence than the last (order statistics effect from Ch 4). Fix: (1) Increase the RL budget between classifier updates so the policy fully exploits the current classifier, making the exploit easier to identify. (2) Add ALL visited states as negatives, not just final states. (3) Use ensemble of classifiers — exploit must fool all of them.
Goal classifiers need labeled "success" and "failure" states. But for many tasks — dialog quality, driving smoothness, code correctness — there's no clear binary. What does "success" even mean for a conversation?
Solution: don't ask humans for absolute labels. Ask them for pairwise comparisons. "Which of these two trajectories is better?" This is easier, more reliable, and doesn't require the annotator to be an expert.
Learning a reward function rθ(s, a) from human pairwise preference labels. Given two trajectory segments σ1 and σ2, a human indicates which is better: σ1 ≻ σ2 or σ2 ≻ σ1. The reward function is trained to be consistent with these preferences.
How do we turn "trajectory A is better than trajectory B" into a loss function? We need a probabilistic model of preferences. The Bradley-Terry model (1952) provides exactly this:
The probability that τw is preferred over τl depends only on the difference in total reward. Higher total reward → more likely to be preferred.
Symbol-by-symbol breakdown:
• τw — the "winner" trajectory (the one the human preferred).
• τl — the "loser" trajectory (the one the human rejected).
• R(τ) — total return: sum of per-step rewards along the trajectory.
• rθ(st, at) — learned reward at each step. This is what we're training.
• σ — sigmoid. Maps reward difference (−∞, +∞) to probability (0, 1).
Assume each trajectory has "quality" proportional to exp(R(τ)). Then:
Divide numerator and denominator by exp(R(τw)):
This is the softmax over two items — the same logistic model used in Elo ratings (chess), TrueSkill (Xbox), and pairwise classification. It's the maximum-entropy model consistent with "higher reward = more preferred."
Two robot trajectories for a reaching task:
τA: 5 steps, learned rewards = [0.1, 0.3, 0.5, 0.7, 0.9] → R(τA) = 2.5
τB: 5 steps, learned rewards = [0.2, 0.2, 0.3, 0.3, 0.4] → R(τB) = 1.4
P(τA ≻ τB) = σ(2.5 − 1.4) = σ(1.1) = 1/(1+e−1.1) = 1/(1+0.333) = 0.75
If a human preferred τA, this 75% prediction is reasonable. The loss encourages pushing this probability higher (toward 1.0) by increasing rewards along τA relative to τB.
This is identical to binary cross-entropy where every example has label 1 (because we always put the winner first). The gradient pushes R(τw) up and R(τl) down.
Adding a constant c to all rewards doesn't change preferences: σ((R+c) − (R'+c)) = σ(R − R'). The reward function is identified only up to a shift. This means the absolute scale of learned rewards is arbitrary — only relative ordering matters.
Now we assemble the full pipeline. The key insight: reward learning and policy optimization can run in a loop. Collect trajectories, get preferences, update reward, update policy. Repeat.
A crucial efficiency trick: from k trajectories, you get k(k−1)/2 pairwise labels. With k=10, that's 45 pairs from just 10 human judgments (if you ask for a full ranking). With k=20, that's 190 pairs.
Budget: 100 human comparison queries per round.
Naive (present raw pairs): 100 labeled pairs directly. 100 comparisons.
Smart (rank k trajectories): Ask human to rank k=15 trajectories (14 comparisons for a full sort). This gives (15 choose 2) = 105 pairwise labels from ~14 ranking operations!
Efficiency gain: 105 / 14 = 7.5x more training signal per human judgment when using rankings instead of isolated pairs.
Symbol breakdown of the gradient:
• [1 − σ(·)] — prediction error. Close to 0 when prediction is correct, close to 1 when wrong.
• ∇φ Rφ(τw) — gradient of total reward of winner w.r.t. reward parameters. This gets pushed up.
• −∇φ Rφ(τl) — gradient of total reward of loser. This gets pushed down.
As the policy improves, it may overfit to imperfections in the learned reward. The reward function was trained on data from earlier policies. The current policy may visit states where the reward model is inaccurate. This is the preference-learning analog of classifier exploitation. Fix: keep collecting fresh preferences from the improving policy.
If you only collect preferences once and train reward to convergence, then train policy to convergence, the policy will overfit to reward model errors. The interleaved loop prevents this: the reward model always has fresh data from the current policy's distribution.
Implement the reward learning loss in PyTorch. The function takes a reward model, a batch of winner trajectories, and loser trajectories, and returns the scalar loss.
Everything above applies to robotics, games, and continuous control. But the technique that made reward learning famous is RLHF for language models — the process that turned GPT-3 into ChatGPT.
The pipeline has three stages, each building on the previous:
A three-stage pipeline for aligning language models with human preferences: (1) Pretraining on internet text, (2) Supervised Fine-Tuning (SFT) on curated demonstrations, (3) RL optimization against a learned reward model trained on human preference comparisons.
Train a large language model on internet text via next-token prediction. This gives a model that can write fluently but has no concept of "helpfulness" or "safety." It's like a person who can speak perfectly but has no goals.
Collect ~10k-100k demonstrations of ideal assistant behavior. Human annotators write high-quality responses to prompts. Fine-tune the pretrained model on these (prompt, response) pairs via standard cross-entropy loss. This gives a model that tries to be helpful but isn't consistent.
This is where reward learning enters:
Symbol breakdown:
• rφ(x, y) — learned reward for response y given prompt x. A neural network (often the LLM itself with a value head).
• β — KL penalty coefficient. Higher β → stay closer to SFT model (more conservative).
• KL(πθ || πSFT) — Kullback-Leibler divergence measuring how much the policy has drifted from its starting point.
Without it, the model would hack the reward model — finding degenerate outputs that get high reward but are nonsensical (the same exploitation problem from Ch 4!). The KL penalty says "you can't drift too far from a sensible model." It's a regularizer against reward overoptimization.
Prompt: "Explain quantum mechanics simply."
β = 0 (no constraint): Model finds reward hack. Outputs repetitive superlatives: "Absolutely amazing incredible wonderful fantastic..." → r = 8.2 (reward model fooled by enthusiastic language).
β = 0.1 (moderate): Model gives a genuinely helpful explanation. r = 5.4, KL = 12.0. Net: 5.4 − 0.1(12) = 4.2.
β = 1.0 (heavy): Model barely moves from SFT. r = 3.8, KL = 0.5. Net: 3.8 − 0.5 = 3.3. Too conservative.
Sweet spot: β around 0.01–0.1 for most LLM training.
The reward model must be powerful enough to represent complex preferences but not so powerful that it memorizes individual comparisons. In practice, the reward model is often the same size as the policy model (both are LLMs). Training a 7B reward model to judge a 7B policy's outputs is standard.
The pretrained model completes text, not instructions. It might continue a prompt with "..." or write it as a blog post instead of answering. SFT teaches the model the format of a helpful response (answer the question, appropriate length, conversational tone). RL then improves quality within that format. Without SFT, RL would have to simultaneously learn both format AND quality from the reward signal alone — a much harder optimization problem with a weaker starting point.
RLHF's biggest bottleneck: human annotation is expensive. Each comparison costs $1–5 depending on task complexity. Training a frontier model requires 100k+ comparisons = $100k–500k just for labels.
Key observation: critique is easier than generation. You don't need a Nobel laureate to judge which of two physics explanations is better. Similarly, a language model can often judge which response is better, even if it can't generate the better response itself.
Replacing human annotators with an AI model (typically a large LLM) that provides preference judgments. The AI judges which response is better based on specified criteria (helpfulness, harmlessness, honesty). The rest of the pipeline (Bradley-Terry reward model, PPO fine-tuning) remains identical to RLHF.
Anthropic's Constitutional AI formalizes RLAIF with an explicit constitution — a set of principles the AI judge must follow:
Prompt: "How do I pick a lock?"
Model response y: "Here's a step-by-step guide to pick a pin tumbler lock: First, insert a tension wrench..."
Principle: "Responses should not provide detailed instructions for illegal activities."
Critique: "This response provides detailed lock-picking instructions that could be used for burglary. Violation: yes."
Revision y': "Lock-picking is a legitimate skill for locksmiths. I can explain the general principle (pin tumbler mechanisms) without providing step-by-step instructions that could enable illegal entry."
Preference: y' ≻ y. This pair trains the reward model.
A fundamental tension in alignment: maximizing helpfulness often conflicts with minimizing harm. A model that refuses everything is maximally harmless but useless. A model that helps with anything is maximally helpful but dangerous.
Critique is easier than generation for the same reason that reviewing code is easier than writing it, or grading an essay is easier than writing one. The judge only needs to compare two options against clear criteria. The generator needs to create something good from scratch. This asymmetry means a model can meaningfully judge outputs that are at or near its own quality level.
AI judges inherit biases: preference for longer responses, formal language, responses that agree with the prompt's premise, and sycophantic behavior. If your AI judge prefers verbose responses, your trained model will be verbose regardless of whether humans prefer concise answers. Calibration against human preferences is essential.
All methods so far require human input (demonstrations, preferences, or principles). But can agents propose their own goals? Can they learn useful skills without any human reward signal?
Training RL agents without any externally specified reward. The agent generates its own reward signal through intrinsic motivation (curiosity, surprise, novelty) or self-play (competing against itself). The goal: pre-train diverse skills that transfer to downstream tasks.
Idea: two agents, Alice and Bob, play an asymmetric game:
Alice's job: Reach some state and challenge Bob: "Get to where I am." Alice gets reward if Bob fails.
Bob's job: Reach Alice's goal state. Bob gets reward if he succeeds.
The game naturally escalates: Alice proposes increasingly difficult goals (to stump Bob), and Bob develops increasingly capable skills (to achieve those goals). No human specifies what goals are interesting — the adversarial pressure generates a curriculum automatically.
Initially, Alice proposes easy goals (1 step away). Bob learns to solve them. Then Alice must propose harder goals (2 steps away). Bob learns those too. The difficulty ratchets up naturally — Alice can't propose impossible goals (she must reach them herself) and can't propose too-easy goals (Bob already solves those). This is a self-regulating curriculum without any human design.
Alternative to self-play: reward the agent for novelty or surprise. Formalized as prediction error of a forward model:
Problem: the agent may find "noise" rewarding (TV static is always unpredictable). Modern approaches use learned feature spaces instead of raw state predictions to filter out irrelevant stochasticity.
An agent with curiosity reward placed in a room with a TV showing random static will stare at the TV forever — it's maximally unpredictable! Solutions: use learned representations (ignore pixel noise), limit prediction to controllable aspects of the environment, or use ensemble disagreement instead of raw error.
We've covered the complete spectrum of reward specification — from hand-designed rewards to AI-generated preferences. Here's how they compare:
| Method | Human Input | Scalability | Failure Mode | Best For |
|---|---|---|---|---|
| Hand-designed | Engineer writes r(s,a) | Zero annotation cost | Reward hacking | Simple tasks with clear metrics |
| Goal classifier | Success/failure images | ~50-200 images | OOD exploitation | Robotics with visual goals |
| Adversarial classifier | Same + iterative updates | Moderate compute | Instability, reward collapse | Robotics with persistent exploits |
| Human preferences | Pairwise comparisons | 1k-100k queries | Reward overoptimization | Complex tasks (dialog, driving) |
| RLHF (LLMs) | SFT demos + comparisons | 100k+ queries | KL collapse or reward hacking | Language model alignment |
| RLAIF | Constitution (principles) | Infinite (AI judges) | Judge bias propagation | Scaling RLHF cheaply |
| Self-play | None | Infinite (self-generated) | Degenerate equilibria | Pre-training diverse skills |
Rewards can't be taken for granted. In the real world, defining "what good looks like" IS the hard problem. The RL algorithm is the easy part — it just maximizes whatever number you give it. The entire field of reward learning exists because writing down that number correctly is somewhere between difficult and impossible for most tasks that matter.
Q: Can you write r(s,a) directly? (e.g., game score, physical distance to goal) → Use hand-designed reward. Done.
Q: Can you photograph "success"? (e.g., robotics with clear goal states) → Use goal classifier (with adversarial updates).
Q: Can humans compare outcomes? (e.g., dialog, content quality) → Use preference learning (RLHF).
Q: Is human annotation too expensive? → Use RLAIF with a judge model. Calibrate against a small human-labeled set.
Q: No supervision at all? → Use intrinsic motivation or self-play for skill pre-training, then fine-tune with task reward later.
Reward learning doesn't exist in isolation. It connects to every part of the RL and AI alignment ecosystem:
IRL infers a reward function from optimal demonstrations. Preference learning infers it from comparative judgments. Both solve the same core problem (what reward was the human optimizing?) but with different supervision signals.
DPO (Rafailov et al., 2023) shows you can skip the explicit reward model entirely. The optimal policy under Bradley-Terry preferences has a closed-form relationship to the reward, allowing direct fine-tuning from preferences without the RL step.
Offline RL trains policies from fixed datasets without environment interaction. Learned reward models can retrospectively label offline datasets, enabling preference-guided offline RL — no human reward function needed for the original data collection.
Active research frontiers in reward learning:
Process reward models — Reward each reasoning STEP, not just the final answer. Enables credit assignment in chain-of-thought.
Scalable oversight — How do you provide feedback on tasks too complex for any single human to evaluate? Decomposition, debate, and recursive reward modeling.
Reward model interpretability — What features does the reward model actually use? Can we audit it for biases before deploying?
Multi-objective RLHF — Instead of one scalar reward, learn separate models for helpfulness, harmlessness, honesty, and let users set the tradeoff at inference time.
The reward function IS the specification of the problem. Getting it wrong doesn't just make your agent suboptimal — it makes your agent optimize for something you never intended. As AI systems become more capable, the gap between "what we specified" and "what we meant" becomes the most important safety problem in the field. Reward learning is not just a convenience — it's a necessity for building systems that do what we actually want.