Your language model is secretly a preference classifier — extract P("Yes")/P("No") to rank responses, eliminate the reward model, and self-improve through iterative DPO.
Aligning language models with human preferences requires a way to tell good outputs from bad ones. The standard RLHF pipeline trains a separate reward model on human preference data, then uses it to score candidate responses during RL training. This works, but it is expensive: you need to collect preference data, train and serve a second large neural network, and keep it calibrated as the policy shifts.
An alternative — RLAIF (RL from AI Feedback) — replaces human annotators with a powerful external LLM like GPT-4 or Claude. But this introduces a dependency on a proprietary, expensive API. You cannot self-host it, you cannot fine-tune it, and you pay per token.
What if the model you are aligning could judge its own outputs? Not by generating a long critique (which is slow and hard to parse), but by producing a single, calibrated score from its own logits?
Three approaches to scoring responses for alignment. Click to compare.
Here is the core observation: when you ask a language model "Is the following response a good answer to the question? Answer Yes or No," the model does not just output text. Under the hood, it computes a probability distribution over its entire vocabulary for the next token. Two of those probabilities are especially informative: P("Yes") and P("No").
A model that has been instruction-tuned — even without any preference-specific training — has already internalized a notion of response quality from its training data. When it assigns high probability to "Yes," it is expressing confidence that the response is good. When it assigns high probability to "No," it is expressing the opposite.
The key equation is embarrassingly simple. Given the raw logits for "Yes" and "No," compute the normalized preference score:
Where pyes = P("Yes" | prompt, response) and pno = P("No" | prompt, response). This normalization restricts the score to [0, 1] and ignores probability mass on all other tokens — focusing purely on the binary preference signal.
This score can then be used to rank candidate responses. Given N responses to the same prompt, the one with the highest p'yes is the chosen response, and the one with the lowest is the rejected response. You now have a preference pair — exactly what DPO needs for training.
Let us be precise about how the preference classification works. You have a prompt x and a candidate response y. You construct an evaluation prompt:
Given the instruction: {x}
Is the following response correct and helpful?
Response: {y}
Answer with Yes or No.
You feed this to the LLM and extract the logits at the final token position. From the vocabulary, you look up exactly two logit values: the one for token "Yes" and the one for token "No." Apply softmax to just these two:
This is equivalent to a sigmoid: pyes = σ(zyes − zno). The logit difference zyes − zno is the model's raw preference signal.
Watch how a response is scored by extracting Yes/No logits. Drag the quality slider to see how scores change.
Now to build a preference pair, you generate N candidate responses to the same prompt (typically N = 4). Score each one. The response with the highest p'yes becomes yw (chosen/winner), and the response with the lowest p'yes becomes yl (rejected/loser). The pair (x, yw, yl) is then fed to DPO.
Not all tasks are created equal. A response that is "good" for a coding question has different criteria than one for a math problem or a safety-sensitive query. IPO uses category-specific evaluation prompts that tailor the judging criteria to the domain.
The paper defines four categories, each with a specialized prompt template:
Why does this matter? Because a single generic prompt ("Is this a good response?") conflates many quality dimensions. A coding response might be technically correct but poorly documented. A math response might have the right answer but skip critical steps. Category-specific prompts let the model's internal preferences focus on the right criteria for each domain.
On RewardBench, the paper shows that category-specific prompts consistently outperform generic prompts, with the biggest gains in the safety category where evaluation criteria differ most from general helpfulness.
This is the payoff. IPO's preference classification ability enables a complete self-improvement loop where a single model generates, judges, and learns — with no external model in the loop.
The pipeline has four stages, repeating iteratively:
Watch the IPO self-improvement pipeline in action. Click "Run Iteration" to step through each stage.
The paper uses the UltraFeedback dataset's prompts (60k instructions across diverse domains) and generates 4 responses per prompt. Each iteration produces ~60k preference pairs. They run 1-3 iterations of the loop, finding diminishing returns after iteration 2.
RewardBench is a standardized benchmark for evaluating reward models and preference classifiers. It contains prompt-chosen-rejected triples across four categories: Chat, Chat-Hard, Safety, and Reasoning. A model's score is the accuracy of preferring the chosen response over the rejected one.
IPO evaluates the preference classification ability of models across multiple families and sizes:
Accuracy (%) on RewardBench. Higher is better. Trained reward models shown for comparison.
Key findings from RewardBench evaluation:
One of the paper's most striking findings: instruction tuning dramatically improves preference classification ability. Base models — even large ones — are poor preference classifiers. Their P("Yes")/P("No") distributions are nearly uniform and uncalibrated.
Why? Because base models have never been trained to follow the specific instruction format "Answer with Yes or No." They assign probability mass across many continuation tokens. Instruction-tuned models, in contrast, have learned to follow formatting constraints, which concentrates probability mass on the requested tokens.
Same model architecture, different training — the effect of instruction tuning on preference classification.
The gaps are dramatic:
Not all model specializations transfer to preference classification equally. The paper reveals a surprising asymmetry: code-specialized models are good preference classifiers, but math-specialized models are not.
Why this asymmetry? The paper hypothesizes two factors:
Specific results on RewardBench Reasoning category:
The ultimate test: does the self-improvement pipeline actually work? The paper runs the Generate → Judge → Pair → DPO loop on several base models and evaluates on standard benchmarks including AlpacaEval 2.0 and MT-Bench.
Key results:
On AlpacaEval 2.0 (length-controlled win rate against GPT-4-Turbo):
IPO sits at the intersection of several research threads that have been converging toward self-supervised alignment: