DPO & Alignment — Engineermaxxing

Introduction

In 2020, GPT-3 stunned the world with its ability to generate coherent text on virtually any topic. But it had a problem. Ask it a question, and it might answer — or it might continue generating as if it were writing a forum post, a news article, or a fictional dialogue. It was trained to predict the next token, not to be helpful. The gap between "can generate any text" and "reliably generates the text a user actually wants" turned out to be one of the most important problems in modern AI.

This gap is what alignment addresses. The term encompasses a broad set of techniques for steering language models to follow instructions, be truthful, avoid harmful outputs, and generally behave in ways that align with human values and intentions. At the technical level, it is a post-training optimization problem: given a model that already understands language, how do we adjust its behavior without destroying that understanding?

The dominant framework for the past several years has been Reinforcement Learning from Human Feedback (RLHF), which trains a separate reward model from human preference data and then uses reinforcement learning to optimize the policy against that reward. It works. It also involves training three separate models, is unstable, and is expensive. In 2023, Direct Preference Optimization (DPO) showed that you could skip the reward model entirely and optimize the policy directly on preference data, using a simple classification loss. This was not just a simplification — it was a conceptual shift in how we think about alignment.

ℹ What this article covers

We start with why alignment is necessary, build RLHF from first principles (reward modeling, the Bradley-Terry model, PPO optimization, KL penalties), then derive DPO step by step from the same foundations. We cover the beta parameter, reference policies, practical tradeoffs, and the growing family of DPO variants (IPO, KTO, ORPO, SimPO). Four interactive visualizations let you explore these concepts hands-on.

The Alignment Gap

What pretraining gives you

Pretraining via next-token prediction on trillions of tokens produces a model with remarkable capabilities: factual knowledge, reasoning, code generation, multilingual fluency, and a deep understanding of linguistic structure. But the training objective — minimize cross-entropy loss on the next token — is fundamentally about imitation, not assistance.

A pretrained model is a distribution over text continuations. Given "The capital of France is", it assigns high probability to "Paris" — but also to "not well known to many" (from a quiz context), "a topic for another day" (from a blog post), or "what we'll discuss next" (from a lecture transcript). It faithfully represents all the patterns in its training data, including toxic content, misinformation, and confidential information. It has no concept of "the user is asking me a question and I should answer it directly."

What users actually want

Users want a model that: follows instructions accurately, admits uncertainty rather than hallucinating, refuses genuinely harmful requests, provides balanced perspectives on controversial topics, and maintains a consistent, helpful persona. None of these behaviors emerge naturally from next-token prediction. They must be taught through post-training.

The standard post-training pipeline has two stages. Supervised Fine-Tuning (SFT) teaches the model the format of helpful responses — that it should answer questions, follow instructions, and produce structured output. This gets you most of the way there. But SFT alone tends to produce models that are helpful but uncalibrated: they'll generate confident-sounding but incorrect answers, follow harmful instructions, or produce outputs that are technically correct but not what the user wanted. The second stage — alignment via human preferences — teaches the model which of its possible outputs humans actually prefer.

💡 The core tension

Alignment is fundamentally a constrained optimization: make the model more helpful and safe without making it less capable. Too little alignment and the model is dangerous. Too much and you get a model that refuses everything or gives vacuous, hedge-filled responses. The KL divergence penalty in both RLHF and DPO is the mathematical expression of this constraint — it bounds how far the aligned model can drift from its capable pretrained state.

RLHF from First Principles

Reinforcement Learning from Human Feedback was popularized by the InstructGPT paper (Ouyang et al., 2022) and became the standard alignment technique for GPT-3.5, GPT-4, Claude, and other frontier models. The basic idea is elegant: since we can't write down a mathematical formula for "what makes a good response," we instead train a model to predict what humans would prefer, then use that prediction as a reward signal for reinforcement learning.

RLHF Pipeline Interactive

Click each stage to see details. The animation shows data flow through the three-phase RLHF pipeline.

Click a stage or press Play

RLHF proceeds in three stages, each building on the last:

Stage 1: Reward model training

The reward model is a neural network (typically initialized from the same pretrained model) that takes a prompt-response pair and outputs a scalar score indicating quality. Training data comes from human annotators who compare pairs of model responses to the same prompt and indicate which they prefer.

For each prompt x, the SFT model generates multiple candidate responses. Human annotators see pairs of responses and choose the one they prefer. This produces a dataset of preference tuples: (x, y_w, y_l) where y_w is the preferred (winning) response and y_l is the rejected (losing) response.

The Bradley-Terry model

The preference data is modeled using the Bradley-Terry model, a classical framework from the 1950s originally designed for ranking chess players. The model assumes that the probability of response y₁ being preferred over y₂ depends only on their respective "strengths" (reward scores):

P(y w ≻ y l | x) = σ(r(x, y w) - r(x, y l))

where σ is the sigmoid function and r(x, y) is the reward model's score. The reward model is trained to maximize the log-likelihood of the observed preferences:

L RM = -E (x, y w, y l) [ log σ(r(x, y w) - r(x, y l)) ]

This is simply binary cross-entropy: the reward model learns to assign higher scores to preferred responses. The key insight is that only the difference in rewards matters — the absolute scale is arbitrary, which is why we can add any constant to all rewards without changing the preference ordering.

Stage 2: PPO optimization

With a trained reward model in hand, we use Proximal Policy Optimization (PPO) to fine-tune the language model (the "policy") to maximize the predicted reward while staying close to the original SFT model. The optimization objective is:

max π E x~D, y~π(y|x) [ r(x, y) ] - β D KL (π || π ref)

The first term says "generate responses that the reward model rates highly." The second term — the KL divergence penalty — says "but don't deviate too far from the reference policy (typically the SFT model)." The hyperparameter β controls the tradeoff. Without the KL penalty, the model quickly learns to exploit quirks in the reward model rather than genuinely improve.

PPO is an actor-critic RL algorithm. At each step: (1) sample a batch of prompts, (2) generate responses from the current policy, (3) score them with the reward model, (4) compute the advantage using a value function, (5) update the policy using the clipped surrogate objective. The clipping mechanism prevents catastrophically large policy updates — updates that would move the policy ratio outside [1-ε, 1+ε] are clipped.

L PPO = min(r t (θ) A t, clip(r t (θ), 1-ε, 1+ε) A t)

Here r_t(θ) is the ratio of the new policy's probability to the old policy's probability for the generated token, and A_t is the advantage estimate. In practice, this requires maintaining four models simultaneously: the policy being trained, the reference policy (frozen), the reward model (frozen), and the value function (trained alongside the policy).

Problems with RLHF

RLHF works, but it has significant practical challenges:

Reward hacking. The policy learns to exploit the reward model rather than genuinely improve. The reward model is an imperfect proxy for human preferences — it has systematic biases, and the RL process will find and exploit them. Longer responses often score higher regardless of quality. Confident-sounding responses score higher even when wrong. Specific phrases or formatting patterns that correlate with high reward in the training data get amplified. The KL penalty mitigates this but doesn't eliminate it.

Mode collapse. The policy converges to a narrow range of "safe" outputs, losing the diversity and creativity of the base model. When the reward model penalizes certain response types, the policy learns to avoid entire categories of output, even when they would be appropriate.

Training instability. PPO with language models is notoriously difficult to stabilize. Small hyperparameter changes can cause training to diverge. The interaction between four models (policy, reference, reward, value) creates complex dynamics. Practitioners report spending weeks tuning learning rates, KL coefficients, batch sizes, and clipping parameters.

Computational cost. Training requires four full-sized models in memory simultaneously. For a 70B parameter model, this means managing 280B parameters total, plus the compute for generating responses, scoring them, and running PPO updates. This is 2–4× more expensive than supervised fine-tuning.

∑ The RLHF cost equation

Memory ≈ 4 × model_size (policy + reference + reward + value)
Compute per step ≈ generation + reward_scoring + PPO_update
Wall-clock time ≈ 5–10× SFT for equivalent steps

Direct Preference Optimization

DPO (Rafailov et al., 2023) asks a simple but powerful question: if the goal of RLHF is to optimize preferences subject to a KL constraint, can we do that without ever training a reward model or running RL? The answer is yes, and the derivation is surprisingly clean.

The key insight: the implicit reward model

The optimal solution to the KL-constrained reward maximization problem has a closed-form. For the objective:

max π E x, y [ r(x, y) ] - β D KL (π || π ref)

the optimal policy is:

π*(y|x) = (1/Z(x)) π ref (y|x) exp(r(x,y) / β)

where Z(x) is a partition function that normalizes the distribution. This is a Boltzmann distribution — responses with higher reward get exponentially upweighted relative to the reference policy.

The crucial step: we can solve for the reward in terms of the policy:

r(x, y) = β log(π*(y|x) / π ref (y|x)) + β log Z(x)

This says the reward is implicitly defined by the log-ratio of the optimal policy to the reference policy. The partition function Z(x) depends only on the prompt, not the response. When we substitute this into the Bradley-Terry preference model, the partition functions cancel:

Loss function derivation

Substituting the implicit reward into the Bradley-Terry model:

P(y w ≻ y l) = σ(r(x, y w) - r(x, y l))

= σ(β log(π(y w |x) / π ref (y w |x)) - β log(π(y l |x) / π ref (y l |x)))

The β log Z(x) terms cancel because they appear in both the winning and losing reward. The DPO loss is the negative log-likelihood of this preference probability:

L DPO (θ) = -E (x, y w, y l) [ log σ(β (log π θ (y w |x) / π ref (y w |x) - log π θ (y l |x) / π ref (y l |x))) ]

This is a remarkably simple loss function. It operates directly on the log-probabilities that any language model already computes. No reward model, no value function, no RL loop. Just a classification-style loss applied to preference pairs.

To understand the gradient dynamics: the loss decreases when the model increases the probability of preferred responses relative to the reference and decreases the probability of rejected responses relative to the reference. The implicit reward margin between preferred and rejected outputs grows.

DPO Loss Landscape Interactive

Hover over the surface to explore how DPO loss varies with the log-probability ratios for preferred (y-axis) and rejected (x-axis) responses. Lower loss is darker.

β: 0.1 Hover to inspect

The beta parameter

The hyperparameter β controls how conservative the optimization is. It appears in both the RLHF objective (as the KL penalty weight) and the DPO loss (scaling the log-ratios), and it plays the same role: governing the tradeoff between reward maximization and staying close to the reference policy.

High β (e.g., 0.5): The model is strongly penalized for deviating from the reference policy. Optimization is conservative — the model makes small, cautious adjustments to its preferences. This reduces the risk of reward hacking and mode collapse but may also limit how much the model improves. The loss surface becomes very flat, requiring many steps to learn.

Low β (e.g., 0.01): The model is free to deviate significantly from the reference policy. Optimization is aggressive — the model can make large jumps in behavior. This risks overfitting to the preference data, reward hacking, and catastrophic forgetting of general capabilities. The loss surface has steep gradients and training can be unstable.

In practice, β values between 0.1 and 0.5 work well for most applications. The original DPO paper used β = 0.1 as the default. Models with strong SFT starting points can tolerate lower β because the reference policy is already close to the desired behavior.

Beta Parameter Effect Interactive

Adjust β to see how it changes the implicit reward landscape and the strength of the preference signal. Higher β compresses the reward range; lower β amplifies it.

β: 0.10 β = 0.10 (moderate)

Reference policy and KL divergence

The reference policy π_ref is typically the SFT model — the supervised fine-tuned model before any preference optimization. It serves as an anchor: the aligned model should not stray too far from this known-good starting point.

In the DPO loss, the reference policy appears through the log-ratios log(π_θ(y|x) / π_ref(y|x)). When the trained policy assigns a response the same probability as the reference, the log-ratio is zero and contributes nothing to the implicit reward. The model only "earns" implicit reward by deviating from the reference — and the DPO loss ensures it deviates in the direction of preferred outputs and away from rejected ones.

A subtle but important point: the reference policy must be frozen during training. If you update it (e.g., using the current policy as the reference), you change the optimization landscape in ways that can cause instability. Some recent work explores iterative DPO with periodic reference updates, but this requires careful handling.

DPO vs RLHF: When to Use Which

DPO and RLHF optimize the same theoretical objective — KL-constrained reward maximization under the Bradley-Terry preference model. They differ in implementation, and these differences have practical consequences.

DPO Advantages

No reward model to train or maintain
No RL loop — standard supervised training
2 models in memory (policy + reference) vs 4
Stable gradients — no PPO clipping needed
Reproducible: same data = same result
Simple to implement (<50 lines of loss code)

RLHF Advantages

Reward model can generalize beyond training pairs
Can explore: generate new responses during training
Online learning: iteratively improve from fresh data
Better for very large distributional shifts
Explicit reward scores aid debugging and monitoring
Proven at the largest scales (GPT-4, Claude)

The most important distinction is online vs offline. RLHF is an online algorithm: it generates new responses during training and scores them, allowing the model to explore and discover behaviors not present in the training data. DPO is offline: it operates only on the fixed dataset of preference pairs collected before training. This means DPO can only learn from responses that were generated by the SFT model (or whatever model produced the preference data). If the optimal behavior requires responses very different from anything in the dataset, DPO may struggle.

In practice, DPO has proven remarkably effective for most alignment tasks, often matching or exceeding RLHF performance when given sufficient high-quality preference data. The simplicity advantage is enormous for smaller teams: you can implement DPO training in an afternoon; a production RLHF pipeline takes weeks.

The frontier labs (OpenAI, Anthropic, Google DeepMind) still use RLHF variants for their largest models, often in combination with DPO-like methods. The trend is toward hybrid approaches: use DPO for the initial alignment pass, then use online methods for further refinement.

Modern Variants

DPO opened the floodgates for a family of preference optimization methods, each addressing specific limitations. Here are the most significant:

IPO: Identity Preference Optimization

Azar et al. (2023) identified a problem with DPO: the loss can overfit when preference pairs have a very clear winner. When π_θ assigns near-zero probability to the rejected response, the log-ratio goes to negative infinity, and the gradient vanishes — the model stops learning from that example even though it could still improve on others.

IPO replaces the log-sigmoid loss with a squared loss on the preference margin:

L IPO = E [ (β (log π/π ref (y w) - log π/π ref (y l)) - 1)² ]

This penalizes the model for making the margin too large as well as too small, preventing overfitting and maintaining bounded gradients throughout training.

KTO: Kahneman-Tversky Optimization

Ethayarajh et al. (2024) observed that collecting paired preferences (comparing two responses to the same prompt) is expensive and unnatural. Much easier to collect: binary feedback on individual responses — thumbs up or thumbs down. KTO is designed for this unpaired setting.

Inspired by prospect theory from behavioral economics, KTO applies different loss functions to desirable and undesirable outputs, with the loss for undesirable outputs weighted more heavily (reflecting loss aversion). The key advantage is that you don't need preference pairs — you just need a label for each response indicating whether it's good or bad.

ORPO and SimPO

ORPO (Hong et al., 2024) combines SFT and preference optimization into a single training stage. Instead of first doing SFT and then DPO, ORPO adds a preference-aware penalty to the standard language modeling loss. This eliminates the need for a separate reference model entirely, since the SFT model is being trained simultaneously.

SimPO (Meng et al., 2024) simplifies DPO further by using the average log-probability of a response (rather than the total log-probability) as the implicit reward. This length-normalized reward avoids the bias toward longer responses that can plague standard DPO, and it removes the need for a reference model. The loss becomes:

L SimPO = -log σ(β (avg_log_p(y w) - avg_log_p(y l)) - γ)

where γ is a target margin that ensures a minimum gap between preferred and rejected responses.

Constitutional AI and Self-Alignment

Anthropic's Constitutional AI (CAI) (Bai et al., 2022) introduced a different approach to collecting preference data: instead of using human annotators, use the model itself. The process has two phases:

Phase 1: Self-critique and revision. Given a harmful prompt, the model generates a response. Then it is asked to critique its own response according to a set of principles (the "constitution") and generate a revised response. This produces (original, revised) pairs where the revised response is generally preferred.

Phase 2: RLAIF (RL from AI Feedback). The model is used to label preference pairs according to the constitutional principles. These AI-generated labels replace human labels in the standard RLHF pipeline. This dramatically reduces the need for human annotation, which is expensive, slow, and can itself introduce biases.

The constitution is a set of natural-language principles: "Choose the response that is most helpful while being harmless," "Choose the response that is least likely to encourage illegal activity," etc. Different principles can be weighted differently, allowing fine-grained control over the model's values.

CAI can be combined with DPO instead of RLHF — generate AI preference labels, then train with the DPO loss. This is sometimes called DPO with AI feedback or self-play preference optimization.

Practical Alignment

The theoretical elegance of DPO is appealing, but the quality of alignment depends overwhelmingly on the quality of the preference data. Here are the practical considerations that matter most:

Preference data construction. The standard approach is: (1) collect a diverse set of prompts spanning the target use cases, (2) generate multiple responses per prompt using the SFT model (sampling with temperature), (3) have human annotators rank or compare the responses. The prompts must cover the full distribution of expected inputs — including edge cases, adversarial inputs, and multilingual content.

Annotator quality and disagreement. Human preferences are noisy and subjective. Two annotators will disagree 20–30% of the time on which response is better. This noise is fundamental, not a bug — the Bradley-Terry model explicitly accounts for it through the sigmoid function. However, systematic annotator biases (preferring longer responses, preferring specific writing styles) can distort the learned policy.

Preference Learning Dynamics Interactive

Select a preference pair to see how DPO adjusts the model's log-probabilities. Watch the implicit reward margin grow as training progresses.

Step 0 / 20

Chosen (y_w)

The capital of France is Paris, which has served as the country's capital since the late 10th century.

Rejected (y_l)

France's capital is probably Lyon or Paris, I'm not entirely sure but I think it might be Paris.

Data diversity matters more than volume. A dataset of 10,000 high-quality, diverse preference pairs typically outperforms 100,000 low-quality pairs. The most impactful examples are those where the model's current behavior is wrong — where the SFT model already generates good responses, preference training has little to teach.

Iterative alignment. The best results come from iterating: train with DPO, deploy, collect new preference data on the improved model's outputs, retrain. Each iteration addresses the current model's specific failure modes rather than the SFT model's. This is sometimes called "online DPO" or "iterative DPO" and partially closes the gap with online RLHF methods.

Evaluation. Alignment quality is notoriously difficult to measure. Automated benchmarks (MT-Bench, AlpacaEval, Arena-Hard) provide rough signals but are gameable. Human evaluation is the gold standard but is expensive and slow. The best practice is a combination: automated benchmarks for rapid iteration, human evaluation for major milestones, and red-teaming for safety.

Chain-of-Thought & Prompting Strategies

Aligned models unlock emergent capabilities through prompting. Chain-of-thought reasoning, in-context learning, and structured prompting are all capabilities that emerge from — or are significantly enhanced by — alignment training (instruction tuning, RLHF, DPO). A base model can technically do few-shot learning, but an aligned model follows instructions zero-shot, reasons through multi-step problems, and respects output format constraints. Understanding prompting strategies is therefore inseparable from understanding alignment.

The Prompting Hierarchy
Prompting strategies form a rough hierarchy of increasing complexity and cost: zero-shot (cheapest, single pass) → few-shot / ICL (moderate, longer context) → chain-of-thought (longer output) → self-consistency (multiple samples) → tree-of-thought (branching search). Each level trades compute for accuracy. Choose the simplest strategy that meets your quality bar.

Zero-shot & Few-shot Prompting

Zero-shot prompting gives the model a task description with no examples. This works because instruction-tuned models already understand common task formats from their alignment training. A prompt like "Translate the following English text to French: ..." requires no exemplars — the model knows what translation looks like.

Few-shot / in-context learning (ICL) provides 2–5 examples in the prompt. The model infers the pattern and applies it to a new input — with no gradient updates. This is pure inference-time learning. Few-shot ICL was first demonstrated at scale by GPT-3 (Brown et al., 2020) and remains one of the most remarkable properties of large language models.

The mystery: why does ICL work? The model was never explicitly trained to learn from in-context examples. Current theories include: task vectors (the examples shift the model's internal representation toward a task-specific subspace), induction heads (attention heads that copy patterns from context), and implicit Bayesian inference (the model marginalizes over possible tasks given the examples).

python

# Zero-shot prompting — relies on instruction tuning
zero_shot_prompt = """Classify the sentiment of the following review as
positive, negative, or neutral.

Review: "The battery life is incredible but the screen is dim."
Sentiment:"""

# Few-shot / in-context learning — provide exemplars
few_shot_prompt = """Classify the sentiment of each review.

Review: "Absolutely love this product, best purchase ever!"
Sentiment: positive

Review: "Broke after two days. Total waste of money."
Sentiment: negative

Review: "It's okay, nothing special but does the job."
Sentiment: neutral

Review: "The battery life is incredible but the screen is dim."
Sentiment:"""

Chain-of-Thought (CoT)

Chain-of-thought prompting (Wei et al., 2022) dramatically improves performance on math, logic, and multi-step reasoning tasks by asking the model to show its work. There are two flavors:

Zero-shot CoT: Simply append "Let's think step by step" to the prompt. Surprisingly effective.
Few-shot CoT: Provide exemplars that include full reasoning chains before the final answer.

Why CoT works: the serial computation hypothesis.
A transformer processes each token with roughly the same amount of computation (one forward pass through all layers). When a model tries to jump directly to an answer for a complex problem, it must compress all reasoning into that fixed compute budget. Chain-of-thought forces the model to allocate serial computation to intermediate steps — each generated token effectively gives the model another forward pass to think. The reasoning chain acts as an external "scratchpad" that extends the model's effective computation depth beyond what a single forward pass allows.

python

# Zero-shot Chain-of-Thought
zero_shot_cot = """Q: If a store has 45 apples and sells 3/5 of them,
then receives a shipment of 20 more, how many apples are there?

Let's think step by step."""

# Few-shot Chain-of-Thought (with exemplar reasoning chain)
few_shot_cot = """Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 balls. How many does he have now?
A: Roger started with 5 balls. 2 cans of 3 balls each is 2 * 3 = 6
balls. 5 + 6 = 11. The answer is 11.

Q: If a store has 45 apples and sells 3/5 of them, then receives a
shipment of 20 more, how many apples are there?
A:"""

Advanced Prompting Strategies

Self-Consistency (Wang et al., 2022) samples multiple chain-of-thought reasoning paths from the model (using temperature > 0), then takes a majority vote on the final answer. This improves over single CoT by ~5–15% on reasoning benchmarks. The trade-off: it requires multiple forward passes (typically 5–40 samples), increasing latency and cost proportionally.

Tree-of-Thought (ToT) (Yao et al., 2023) goes further by exploring multiple reasoning branches, evaluating partial solutions, and backtracking when a path looks unpromising. The LLM serves as both generator (proposing next steps) and evaluator (scoring partial solutions). ToT is more expensive but excels at complex planning and search problems where a single chain is unlikely to find the optimal path.

Structured prompting patterns — system prompts, role assignment, output format specification (JSON mode), and constrained generation — are alignment-adjacent techniques. The model's ability to reliably follow structured formats is a direct product of instruction tuning and RLHF. Without alignment, models struggle to consistently respect format constraints.

Strategy	Cost	When to Use	Accuracy Gain
Zero-shot	Lowest	Simple, well-defined tasks (classification, translation)	Baseline
Few-shot ICL	Low (longer context)	Ambiguous tasks, custom formats, rare domains	+5–20%
Chain-of-Thought	Medium (longer output)	Math, logic, multi-step reasoning	+10–40%
Self-Consistency	High (N× samples)	High-stakes reasoning where accuracy matters most	+5–15% over CoT
Tree-of-Thought	Highest (branching search)	Complex planning, puzzles, creative problem-solving	Task-dependent

Code Examples

Here is a complete DPO training setup using the trl library from Hugging Face. This is production-ready code that handles data loading, model setup, and training.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

# Load the SFT model (serves as both policy and reference)
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="bfloat16",
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load preference dataset
# Expected format: {"prompt": str, "chosen": str, "rejected": str}
dataset = load_dataset("argilla/ultrafeedback-binarized-preferences")

# DPO training configuration
training_args = DPOConfig(
    output_dir="./dpo-llama-8b",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,
    beta=0.1,               # KL penalty strength
    loss_type="sigmoid",     # Standard DPO loss
    max_length=1024,
    max_prompt_length=512,
    bf16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=200,
    warmup_ratio=0.1,
    gradient_checkpointing=True,
)

# Initialize the DPO trainer
trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    processing_class=tokenizer,
)

# Train
trainer.train()

The DPO loss itself is remarkably simple. Here it is from scratch in PyTorch, stripped of the library abstractions:

python

import torch
import torch.nn.functional as F

def dpo_loss(
    policy_chosen_logps: torch.Tensor,    # log P_theta(y_w | x)
    policy_rejected_logps: torch.Tensor,  # log P_theta(y_l | x)
    ref_chosen_logps: torch.Tensor,       # log P_ref(y_w | x)
    ref_rejected_logps: torch.Tensor,     # log P_ref(y_l | x)
    beta: float = 0.1,
) -> torch.Tensor:
    """
    Compute the DPO loss for a batch of preference pairs.

    The loss increases the implicit reward margin between
    chosen and rejected responses, scaled by beta.
    """
    # Log-ratios: how much has the policy shifted vs reference?
    chosen_logratios = policy_chosen_logps - ref_chosen_logps
    rejected_logratios = policy_rejected_logps - ref_rejected_logps

    # The implicit reward margin
    logits = beta * (chosen_logratios - rejected_logratios)

    # Negative log-sigmoid = binary cross-entropy with label=1
    loss = -F.logsigmoid(logits).mean()

    # Useful metrics for monitoring
    with torch.no_grad():
        chosen_rewards = beta * chosen_logratios
        rejected_rewards = beta * rejected_logratios
        reward_margin = (chosen_rewards - rejected_rewards).mean()
        accuracy = (logits > 0).float().mean()

    return loss, reward_margin, accuracy

Computing the log-probabilities for a full sequence is a matter of summing per-token log-probs over the response tokens (excluding the prompt):

python

def get_sequence_logprobs(model, input_ids, attention_mask, labels):
    """
    Compute total log-probability of a sequence under the model.
    Labels should have -100 for prompt tokens (ignored positions).
    """
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits[:, :-1, :]  # Shift: predict next token
    labels = labels[:, 1:]              # Shift: target is next token

    # Per-token log-probabilities
    log_probs = F.log_softmax(logits, dim=-1)
    token_log_probs = log_probs.gather(-1, labels.unsqueeze(-1)).squeeze(-1)

    # Mask out prompt tokens (labeled -100)
    mask = (labels != -100).float()
    sequence_log_probs = (token_log_probs * mask).sum(dim=-1)

    return sequence_log_probs

🧭 What comes next

With alignment techniques in hand, we've covered the full lifecycle of a language model: from tokenization and embeddings (Article 01), through attention and transformer blocks (Article 02), to training and scaling (Articles 03–04), and now post-training alignment. In Article 06: Inference & Serving, we'll explore how these models actually run in production — KV caching, speculative decoding, quantization, and the systems engineering that turns a trained model into a responsive API.

References

Seminal papers and key works referenced in this article.

Ouyang et al. "Training language models to follow instructions with human feedback." NeurIPS, 2022. arXiv
Rafailov et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS, 2023. arXiv
Schulman et al. "Proximal Policy Optimization Algorithms." 2017. arXiv
Bai et al. "Constitutional AI: Harmlessness from AI Feedback." 2022. arXiv
Christiano et al. "Deep Reinforcement Learning from Human Preferences." NeurIPS, 2017. arXiv