IPO — Veanors

Chapter 0: The Problem

Aligning language models with human preferences requires a way to tell good outputs from bad ones. The standard RLHF pipeline trains a separate reward model on human preference data, then uses it to score candidate responses during RL training. This works, but it is expensive: you need to collect preference data, train and serve a second large neural network, and keep it calibrated as the policy shifts.

An alternative — RLAIF (RL from AI Feedback) — replaces human annotators with a powerful external LLM like GPT-4 or Claude. But this introduces a dependency on a proprietary, expensive API. You cannot self-host it, you cannot fine-tune it, and you pay per token.

What if the model you are aligning could judge its own outputs? Not by generating a long critique (which is slow and hard to parse), but by producing a single, calibrated score from its own logits?

The Alignment Bottleneck

Three approaches to scoring responses for alignment. Click to compare.

The wish list: Can we score candidate responses without training a reward model, without calling an external API, and without generating verbose critiques? Can the model we are training serve as its own preference classifier? IPO says yes — by reading the probabilities the model already assigns to "Yes" and "No" tokens.

What is the core cost bottleneck in standard RLHF and RLAIF?

RLHF requires a separate trained reward model; RLAIF requires an expensive external LLM API — both add significant cost and complexity The SFT stage is too slow The model runs out of memory during inference

Chapter 1: The Key Insight

Here is the core observation: when you ask a language model "Is the following response a good answer to the question? Answer Yes or No," the model does not just output text. Under the hood, it computes a probability distribution over its entire vocabulary for the next token. Two of those probabilities are especially informative: P("Yes") and P("No").

A model that has been instruction-tuned — even without any preference-specific training — has already internalized a notion of response quality from its training data. When it assigns high probability to "Yes," it is expressing confidence that the response is good. When it assigns high probability to "No," it is expressing the opposite.

The key equation is embarrassingly simple. Given the raw logits for "Yes" and "No," compute the normalized preference score:

p'_yes = p_yes / (p_yes + p_no)

Where p_yes = P("Yes" | prompt, response) and p_no = P("No" | prompt, response). This normalization restricts the score to [0, 1] and ignores probability mass on all other tokens — focusing purely on the binary preference signal.

Why normalization matters: Raw P("Yes") alone is unreliable because models allocate probability mass across many tokens ("Sure," "Absolutely," "Certainly"). By normalizing over just Yes and No, we extract a clean binary signal that behaves like a calibrated confidence score. Think of it as asking the model to commit to one of two positions and measuring how strongly it leans.

This score can then be used to rank candidate responses. Given N responses to the same prompt, the one with the highest p'_yes is the chosen response, and the one with the lowest is the rejected response. You now have a preference pair — exactly what DPO needs for training.

Why do we normalize P("Yes") by dividing by P("Yes") + P("No") rather than using P("Yes") directly?

Because raw P("Yes") is diluted by probability mass on synonyms like "Sure" and "Certainly" — normalizing over just Yes/No extracts a clean binary preference signal To make the probability sum to 1 To prevent numerical overflow

Chapter 2: LLM as Preference Classifier

Let us be precise about how the preference classification works. You have a prompt x and a candidate response y. You construct an evaluation prompt:

Evaluation template:

Given the instruction: {x}

    Is the following response correct and helpful?

    Response: {y}

    Answer with Yes or No.

You feed this to the LLM and extract the logits at the final token position. From the vocabulary, you look up exactly two logit values: the one for token "Yes" and the one for token "No." Apply softmax to just these two:

p_yes = exp(z_yes) / (exp(z_yes) + exp(z_no))

This is equivalent to a sigmoid: p_yes = σ(z_yes − z_no). The logit difference z_yes − z_no is the model's raw preference signal.

Logit Extraction Pipeline

Watch how a response is scored by extracting Yes/No logits. Drag the quality slider to see how scores change.

Response quality: 0.75

Now to build a preference pair, you generate N candidate responses to the same prompt (typically N = 4). Score each one. The response with the highest p'_yes becomes y_w (chosen/winner), and the response with the lowest p'_yes becomes y_l (rejected/loser). The pair (x, y_w, y_l) is then fed to DPO.

No generation needed for judging: Scoring a response requires only a single forward pass through the model — no autoregressive generation. This makes it dramatically faster than LLM-as-judge approaches that generate multi-paragraph critiques. For N candidates, you need N forward passes, not N expensive generation calls.

How does IPO construct a preference pair from N candidate responses?

Score each response's normalized P("Yes"), then take the highest-scored as chosen and lowest-scored as rejected Generate critiques for each response and parse them Compare responses pairwise using BLEU scores

Chapter 3: Category-Specific Prompts

Not all tasks are created equal. A response that is "good" for a coding question has different criteria than one for a math problem or a safety-sensitive query. IPO uses category-specific evaluation prompts that tailor the judging criteria to the domain.

The paper defines four categories, each with a specialized prompt template:

Chat: "Is the response helpful, relevant, and well-written?"

Code: "Is the code correct, efficient, and well-documented?"

Math: "Is the mathematical reasoning correct and the final answer right?"

Safety: "Does the response refuse harmful requests appropriately?"

Why does this matter? Because a single generic prompt ("Is this a good response?") conflates many quality dimensions. A coding response might be technically correct but poorly documented. A math response might have the right answer but skip critical steps. Category-specific prompts let the model's internal preferences focus on the right criteria for each domain.

On RewardBench, the paper shows that category-specific prompts consistently outperform generic prompts, with the biggest gains in the safety category where evaluation criteria differ most from general helpfulness.

Implementation detail: The category is determined from the dataset metadata — RewardBench labels each example as Chat, Chat-Hard, Safety, or Reasoning. In a production self-improvement pipeline, you would use a simple classifier or keyword matching to route prompts to the appropriate template.

Why do category-specific evaluation prompts outperform a single generic prompt?

They focus the model's preference signal on domain-relevant quality criteria rather than conflating helpfulness, correctness, safety, and code quality They use more tokens, which gives the model more context They bypass the tokenizer

Chapter 4: The Self-Improving Pipeline

This is the payoff. IPO's preference classification ability enables a complete self-improvement loop where a single model generates, judges, and learns — with no external model in the loop.

The pipeline has four stages, repeating iteratively:

Generate: For each prompt x in your dataset, sample N = 4 diverse responses {y₁, ..., y₄} from the current policy π_θ.

Judge: Use the same π_θ as a preference classifier. For each response, compute p'_yes = P("Yes") / (P("Yes") + P("No")).

Pair: From each set of N responses, take the one with the highest p'_yes as y_w (chosen) and the lowest as y_l (rejected).

Train: Run DPO on the constructed preference pairs. The resulting model becomes the new π_θ for the next iteration.

Self-Improvement Loop

Watch the IPO self-improvement pipeline in action. Click "Run Iteration" to step through each stage.

Click to start

No external dependencies: Unlike RLAIF (which calls GPT-4 for judging) or standard RLHF (which needs a separate reward model), IPO's loop is entirely self-contained. The same model generates, judges, and learns. Each iteration makes the model both a better generator and a better judge — a virtuous cycle.

The paper uses the UltraFeedback dataset's prompts (60k instructions across diverse domains) and generates 4 responses per prompt. Each iteration produces ~60k preference pairs. They run 1-3 iterations of the loop, finding diminishing returns after iteration 2.

What makes IPO's self-improvement loop different from RLAIF?

IPO uses the same model for both generation and judging — no external LLM required — while RLAIF depends on a separate, typically larger model as judge IPO uses more training data IPO uses reinforcement learning instead of DPO

Chapter 5: RewardBench Evaluation

RewardBench is a standardized benchmark for evaluating reward models and preference classifiers. It contains prompt-chosen-rejected triples across four categories: Chat, Chat-Hard, Safety, and Reasoning. A model's score is the accuracy of preferring the chosen response over the rejected one.

IPO evaluates the preference classification ability of models across multiple families and sizes:

RewardBench Scores Across Model Families

Accuracy (%) on RewardBench. Higher is better. Trained reward models shown for comparison.

Key findings from RewardBench evaluation:

Competitive with trained RMs: The best IPO models (Qwen2.5-72B-Instruct) achieve scores competitive with purpose-trained reward models, despite never being explicitly trained to judge preferences.
Scale helps: Larger models are consistently better preference classifiers. Qwen2.5-72B > 32B > 14B > 7B.
Family matters: Qwen and LLaMA families show the strongest preference classification. GPT models work but are slightly weaker.
Chat-Hard is hard for everyone: This category, which requires distinguishing between subtly different response qualities, is the most challenging across all models.

What is the most consistent finding across model families on RewardBench?

Larger models are better preference classifiers, and Chat-Hard is the most challenging category across all families All models score above 90% GPT models are always the best

Chapter 6: Base vs Instruction-Tuned

One of the paper's most striking findings: instruction tuning dramatically improves preference classification ability. Base models — even large ones — are poor preference classifiers. Their P("Yes")/P("No") distributions are nearly uniform and uncalibrated.

Why? Because base models have never been trained to follow the specific instruction format "Answer with Yes or No." They assign probability mass across many continuation tokens. Instruction-tuned models, in contrast, have learned to follow formatting constraints, which concentrates probability mass on the requested tokens.

Base vs Instruct: Preference Accuracy

Same model architecture, different training — the effect of instruction tuning on preference classification.

The gaps are dramatic:

LLaMA-3.1-8B base: ~52% accuracy (near random)
LLaMA-3.1-8B-Instruct: ~72% accuracy
Qwen2.5-7B base: ~55% accuracy
Qwen2.5-7B-Instruct: ~78% accuracy

The calibration insight: Instruction tuning does not just teach the model to follow instructions — it calibrates the model's internal preferences. A base model might "know" that response A is better than response B (in the sense that it assigns higher likelihood to A), but it cannot express this judgment through the Yes/No format because it has never been trained on that interaction pattern.

Why are base models poor preference classifiers compared to instruction-tuned models?

Base models have never been trained to follow "Answer Yes or No" format, so their probability mass is spread across many tokens rather than concentrated on Yes/No Base models have fewer parameters Base models were not trained on enough data

Chapter 7: Domain Analysis

Not all model specializations transfer to preference classification equally. The paper reveals a surprising asymmetry: code-specialized models are good preference classifiers, but math-specialized models are not.

Why this asymmetry? The paper hypothesizes two factors:

Code models are trained with extensive instruction-following data (documentation, Stack Overflow, coding tutorials). This format is close to the preference evaluation format. Code correctness is also relatively binary — code works or it doesn't — which aligns well with Yes/No classification.
Math models are often trained with chain-of-thought reasoning that prioritizes step-by-step derivation. Their training distributions are heavily skewed toward generating mathematical notation and reasoning chains, not toward binary judgments. The Yes/No format is far outside their fine-tuning distribution.

The takeaway for practitioners: If you want to use a model as its own preference classifier, choose an instruction-tuned general or code model. Do not use a math-specialized model — even if your evaluation task involves mathematical reasoning. The preference classification ability comes from instruction-following ability, not from domain expertise.

Specific results on RewardBench Reasoning category:

Qwen2.5-Coder-7B-Instruct: ~74% (good)
Qwen2.5-Math-7B-Instruct: ~51% (near random)
DeepSeek-Coder-6.7B-Instruct: ~71% (good)
DeepSeek-Math-7B-Instruct: ~53% (near random)

Why do math-specialized models fail at preference classification while code models succeed?

Math models are fine-tuned for chain-of-thought reasoning, not binary judgments — the Yes/No format is outside their training distribution, while code models have extensive instruction-following training Math problems are harder than coding problems Code models have more parameters

Chapter 8: Self-Improvement Results

The ultimate test: does the self-improvement pipeline actually work? The paper runs the Generate → Judge → Pair → DPO loop on several base models and evaluates on standard benchmarks including AlpacaEval 2.0 and MT-Bench.

Key results:

↑

IPO matches external judges: Models self-improved with IPO achieve comparable performance to models trained with GPT-4 as judge (RLAIF), despite using no external model.

↑

Iteration helps (with diminishing returns): Iteration 1 gives the biggest jump. Iteration 2 adds modest gains. By iteration 3, improvements plateau.

↑

Consistent across sizes: The pipeline works for 7B, 14B, and 70B models. Larger models benefit more because they are better self-judges.

On AlpacaEval 2.0 (length-controlled win rate against GPT-4-Turbo):

Qwen2.5-7B-Instruct baseline: ~23%
+ IPO iteration 1: ~31%
+ IPO iteration 2: ~33%
+ RLAIF (GPT-4 judge): ~34%

The gap is small: IPO's self-improvement achieves ~97% of the performance of using GPT-4 as an external judge, while costing zero API calls. For organizations without access to state-of-the-art proprietary models, this is a practical path to alignment.

How does IPO self-improvement compare to RLAIF with GPT-4 as judge?

IPO achieves ~97% of RLAIF's performance while requiring zero external API calls — a practical alternative when proprietary models are unavailable IPO consistently outperforms RLAIF IPO is twice as slow

Chapter 9: Connections

IPO sits at the intersection of several research threads that have been converging toward self-supervised alignment:

←

DPO (Rafailov et al., 2023): IPO uses DPO as its training objective. DPO showed that preference learning can be reduced to supervised classification on log-ratio differences. IPO provides a new way to generate the preference pairs that DPO trains on.

←

RLAIF (Bai et al., 2022; Lee et al., 2023): RLAIF replaced human annotators with LLMs for generating preference labels. IPO takes this further by eliminating the need for a separate LLM — the model judges itself.

←

Self-Rewarding LMs (Yuan et al., 2024): Also uses the model to judge itself, but via generation (producing scores in text). IPO is more efficient because it uses logit extraction instead of generation.

←

LLM-as-Judge (Zheng et al., 2023): Showed that LLMs can judge response quality. IPO's innovation is using token probabilities instead of generated text for scoring, which is faster and more calibrated.

→

Future direction — Online IPO: Instead of batch iterations, continuously generate, judge, and train in an online fashion. Combine with rejection sampling for curriculum learning.

Summary Card

Core equation

p'_yes = p_yes / (p_yes + p_no) = σ(z_yes − z_no)

Key insight

Instruction-tuned LLMs implicitly encode preference judgments in their Yes/No token probabilities

Mechanism

Single forward pass per response, normalize Yes/No logits, rank responses, feed to DPO

Advantage

No reward model, no external LLM, self-contained self-improvement loop

Impact

Enables alignment without proprietary APIs; ~97% of RLAIF performance at zero external cost

What is IPO's key advantage over Self-Rewarding LMs for self-improvement?

IPO uses logit extraction (a single forward pass) instead of text generation for scoring, making it faster and more calibrated IPO uses a larger model IPO uses reinforcement learning

Implicit Preference Optimization