Garg, Singh, Singh, Chopra — 2025

Implicit Preference Optimization

Your language model is secretly a preference classifier — extract P("Yes")/P("No") to rank responses, eliminate the reward model, and self-improve through iterative DPO.

Prerequisites: DPO + RLHF pipeline + LLM token probabilities
10
Chapters
5+
Simulations

Chapter 0: The Problem

Aligning language models with human preferences requires a way to tell good outputs from bad ones. The standard RLHF pipeline trains a separate reward model on human preference data, then uses it to score candidate responses during RL training. This works, but it is expensive: you need to collect preference data, train and serve a second large neural network, and keep it calibrated as the policy shifts.

An alternative — RLAIF (RL from AI Feedback) — replaces human annotators with a powerful external LLM like GPT-4 or Claude. But this introduces a dependency on a proprietary, expensive API. You cannot self-host it, you cannot fine-tune it, and you pay per token.

What if the model you are aligning could judge its own outputs? Not by generating a long critique (which is slow and hard to parse), but by producing a single, calibrated score from its own logits?

The Alignment Bottleneck

Three approaches to scoring responses for alignment. Click to compare.

The wish list: Can we score candidate responses without training a reward model, without calling an external API, and without generating verbose critiques? Can the model we are training serve as its own preference classifier? IPO says yes — by reading the probabilities the model already assigns to "Yes" and "No" tokens.
What is the core cost bottleneck in standard RLHF and RLAIF?

Chapter 1: The Key Insight

Here is the core observation: when you ask a language model "Is the following response a good answer to the question? Answer Yes or No," the model does not just output text. Under the hood, it computes a probability distribution over its entire vocabulary for the next token. Two of those probabilities are especially informative: P("Yes") and P("No").

A model that has been instruction-tuned — even without any preference-specific training — has already internalized a notion of response quality from its training data. When it assigns high probability to "Yes," it is expressing confidence that the response is good. When it assigns high probability to "No," it is expressing the opposite.

The key equation is embarrassingly simple. Given the raw logits for "Yes" and "No," compute the normalized preference score:

p'yes = pyes / (pyes + pno)

Where pyes = P("Yes" | prompt, response) and pno = P("No" | prompt, response). This normalization restricts the score to [0, 1] and ignores probability mass on all other tokens — focusing purely on the binary preference signal.

Why normalization matters: Raw P("Yes") alone is unreliable because models allocate probability mass across many tokens ("Sure," "Absolutely," "Certainly"). By normalizing over just Yes and No, we extract a clean binary signal that behaves like a calibrated confidence score. Think of it as asking the model to commit to one of two positions and measuring how strongly it leans.

This score can then be used to rank candidate responses. Given N responses to the same prompt, the one with the highest p'yes is the chosen response, and the one with the lowest is the rejected response. You now have a preference pair — exactly what DPO needs for training.

Why do we normalize P("Yes") by dividing by P("Yes") + P("No") rather than using P("Yes") directly?

Chapter 2: LLM as Preference Classifier

Let us be precise about how the preference classification works. You have a prompt x and a candidate response y. You construct an evaluation prompt:

Evaluation template:
Given the instruction: {x}
Is the following response correct and helpful?
Response: {y}
Answer with Yes or No.

You feed this to the LLM and extract the logits at the final token position. From the vocabulary, you look up exactly two logit values: the one for token "Yes" and the one for token "No." Apply softmax to just these two:

pyes = exp(zyes) / (exp(zyes) + exp(zno))

This is equivalent to a sigmoid: pyes = σ(zyes − zno). The logit difference zyes − zno is the model's raw preference signal.

Logit Extraction Pipeline

Watch how a response is scored by extracting Yes/No logits. Drag the quality slider to see how scores change.

0.75

Now to build a preference pair, you generate N candidate responses to the same prompt (typically N = 4). Score each one. The response with the highest p'yes becomes yw (chosen/winner), and the response with the lowest p'yes becomes yl (rejected/loser). The pair (x, yw, yl) is then fed to DPO.

No generation needed for judging: Scoring a response requires only a single forward pass through the model — no autoregressive generation. This makes it dramatically faster than LLM-as-judge approaches that generate multi-paragraph critiques. For N candidates, you need N forward passes, not N expensive generation calls.
How does IPO construct a preference pair from N candidate responses?

Chapter 3: Category-Specific Prompts

Not all tasks are created equal. A response that is "good" for a coding question has different criteria than one for a math problem or a safety-sensitive query. IPO uses category-specific evaluation prompts that tailor the judging criteria to the domain.

The paper defines four categories, each with a specialized prompt template:

1
Chat: "Is the response helpful, relevant, and well-written?"
2
Code: "Is the code correct, efficient, and well-documented?"
3
Math: "Is the mathematical reasoning correct and the final answer right?"
4
Safety: "Does the response refuse harmful requests appropriately?"

Why does this matter? Because a single generic prompt ("Is this a good response?") conflates many quality dimensions. A coding response might be technically correct but poorly documented. A math response might have the right answer but skip critical steps. Category-specific prompts let the model's internal preferences focus on the right criteria for each domain.

On RewardBench, the paper shows that category-specific prompts consistently outperform generic prompts, with the biggest gains in the safety category where evaluation criteria differ most from general helpfulness.

Implementation detail: The category is determined from the dataset metadata — RewardBench labels each example as Chat, Chat-Hard, Safety, or Reasoning. In a production self-improvement pipeline, you would use a simple classifier or keyword matching to route prompts to the appropriate template.
Why do category-specific evaluation prompts outperform a single generic prompt?

Chapter 4: The Self-Improving Pipeline

This is the payoff. IPO's preference classification ability enables a complete self-improvement loop where a single model generates, judges, and learns — with no external model in the loop.

The pipeline has four stages, repeating iteratively:

1
Generate: For each prompt x in your dataset, sample N = 4 diverse responses {y1, ..., y4} from the current policy πθ.
2
Judge: Use the same πθ as a preference classifier. For each response, compute p'yes = P("Yes") / (P("Yes") + P("No")).
3
Pair: From each set of N responses, take the one with the highest p'yes as yw (chosen) and the lowest as yl (rejected).
4
Train: Run DPO on the constructed preference pairs. The resulting model becomes the new πθ for the next iteration.
Self-Improvement Loop

Watch the IPO self-improvement pipeline in action. Click "Run Iteration" to step through each stage.

Click to start
No external dependencies: Unlike RLAIF (which calls GPT-4 for judging) or standard RLHF (which needs a separate reward model), IPO's loop is entirely self-contained. The same model generates, judges, and learns. Each iteration makes the model both a better generator and a better judge — a virtuous cycle.

The paper uses the UltraFeedback dataset's prompts (60k instructions across diverse domains) and generates 4 responses per prompt. Each iteration produces ~60k preference pairs. They run 1-3 iterations of the loop, finding diminishing returns after iteration 2.

What makes IPO's self-improvement loop different from RLAIF?

Chapter 5: RewardBench Evaluation

RewardBench is a standardized benchmark for evaluating reward models and preference classifiers. It contains prompt-chosen-rejected triples across four categories: Chat, Chat-Hard, Safety, and Reasoning. A model's score is the accuracy of preferring the chosen response over the rejected one.

IPO evaluates the preference classification ability of models across multiple families and sizes:

RewardBench Scores Across Model Families

Accuracy (%) on RewardBench. Higher is better. Trained reward models shown for comparison.

Key findings from RewardBench evaluation:

What is the most consistent finding across model families on RewardBench?

Chapter 6: Base vs Instruction-Tuned

One of the paper's most striking findings: instruction tuning dramatically improves preference classification ability. Base models — even large ones — are poor preference classifiers. Their P("Yes")/P("No") distributions are nearly uniform and uncalibrated.

Why? Because base models have never been trained to follow the specific instruction format "Answer with Yes or No." They assign probability mass across many continuation tokens. Instruction-tuned models, in contrast, have learned to follow formatting constraints, which concentrates probability mass on the requested tokens.

Base vs Instruct: Preference Accuracy

Same model architecture, different training — the effect of instruction tuning on preference classification.

The gaps are dramatic:

The calibration insight: Instruction tuning does not just teach the model to follow instructions — it calibrates the model's internal preferences. A base model might "know" that response A is better than response B (in the sense that it assigns higher likelihood to A), but it cannot express this judgment through the Yes/No format because it has never been trained on that interaction pattern.
Why are base models poor preference classifiers compared to instruction-tuned models?

Chapter 7: Domain Analysis

Not all model specializations transfer to preference classification equally. The paper reveals a surprising asymmetry: code-specialized models are good preference classifiers, but math-specialized models are not.

Why this asymmetry? The paper hypothesizes two factors:

The takeaway for practitioners: If you want to use a model as its own preference classifier, choose an instruction-tuned general or code model. Do not use a math-specialized model — even if your evaluation task involves mathematical reasoning. The preference classification ability comes from instruction-following ability, not from domain expertise.

Specific results on RewardBench Reasoning category:

Why do math-specialized models fail at preference classification while code models succeed?

Chapter 8: Self-Improvement Results

The ultimate test: does the self-improvement pipeline actually work? The paper runs the Generate → Judge → Pair → DPO loop on several base models and evaluates on standard benchmarks including AlpacaEval 2.0 and MT-Bench.

Key results:

IPO matches external judges: Models self-improved with IPO achieve comparable performance to models trained with GPT-4 as judge (RLAIF), despite using no external model.
Iteration helps (with diminishing returns): Iteration 1 gives the biggest jump. Iteration 2 adds modest gains. By iteration 3, improvements plateau.
Consistent across sizes: The pipeline works for 7B, 14B, and 70B models. Larger models benefit more because they are better self-judges.

On AlpacaEval 2.0 (length-controlled win rate against GPT-4-Turbo):

The gap is small: IPO's self-improvement achieves ~97% of the performance of using GPT-4 as an external judge, while costing zero API calls. For organizations without access to state-of-the-art proprietary models, this is a practical path to alignment.
How does IPO self-improvement compare to RLAIF with GPT-4 as judge?

Chapter 9: Connections

IPO sits at the intersection of several research threads that have been converging toward self-supervised alignment:

DPO (Rafailov et al., 2023): IPO uses DPO as its training objective. DPO showed that preference learning can be reduced to supervised classification on log-ratio differences. IPO provides a new way to generate the preference pairs that DPO trains on.
RLAIF (Bai et al., 2022; Lee et al., 2023): RLAIF replaced human annotators with LLMs for generating preference labels. IPO takes this further by eliminating the need for a separate LLM — the model judges itself.
Self-Rewarding LMs (Yuan et al., 2024): Also uses the model to judge itself, but via generation (producing scores in text). IPO is more efficient because it uses logit extraction instead of generation.
LLM-as-Judge (Zheng et al., 2023): Showed that LLMs can judge response quality. IPO's innovation is using token probabilities instead of generated text for scoring, which is faster and more calibrated.
Future direction — Online IPO: Instead of batch iterations, continuously generate, judge, and train in an online fashion. Combine with rejection sampling for curriculum learning.

Summary Card

Core equation
p'yes = pyes / (pyes + pno) = σ(zyes − zno)
Key insight
Instruction-tuned LLMs implicitly encode preference judgments in their Yes/No token probabilities
Mechanism
Single forward pass per response, normalize Yes/No logits, rank responses, feed to DPO
Advantage
No reward model, no external LLM, self-contained self-improvement loop
Impact
Enables alignment without proprietary APIs; ~97% of RLAIF performance at zero external cost
What is IPO's key advantage over Self-Rewarding LMs for self-improvement?