Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, et al. (Stanford) — NeurIPS 2023

AlpacaFarm: Simulating Human Feedback

A simulation framework for methods that learn from human feedback — use API LLMs as simulated annotators to rapidly iterate on RLHF algorithms at 45x lower cost.

Prerequisites: RLHF pipeline + Reward modeling + PPO basics. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Feedback Bottleneck

You've built a language model, instruction-tuned it, and now you want to align it with human preferences through RLHF. The recipe sounds simple: collect human preference data (which response is better?), train a reward model, and optimize with PPO. But there's a massive practical bottleneck: human annotation is slow, expensive, and noisy.

Consider the numbers. InstructGPT used ~35,000 human comparisons for its reward model. At a typical cost of $0.50-$2.00 per comparison (depending on quality requirements), that's $17,500-$70,000 just for the preference data — for a single experiment. If you want to iterate on your RLHF algorithm (try different reward model architectures, different PPO hyperparameters, different data mixtures), each iteration requires a fresh round of annotation.

The research iteration problem: Academic labs can't afford to run 10 RLHF experiments at $50K per experiment. This means most RLHF research is done by well-funded companies (OpenAI, Anthropic, Google), creating a knowledge gap: the people who can afford to experiment with RLHF are the ones least likely to publish their findings. AlpacaFarm breaks this bottleneck by making RLHF research affordable.
BottleneckImpact
Cost$50K+ per experiment for human annotations
TimeWeeks to collect and verify human preferences
Noise~75% inter-annotator agreement (humans often disagree)
ScaleCan't easily scale to 100K+ comparisons
ReproducibilityDifferent annotator pools give different results
The RLHF Bottleneck

See why human feedback is the bottleneck in the RLHF pipeline. Each box shows the time and cost for one iteration. Click "Simulate Human" vs "Simulate LLM" to compare.

Compare approaches

AlpacaFarm's solution: replace human annotators with LLM-based simulated annotators. Use GPT-4 (or other API LLMs) to generate preference comparisons that approximate human judgments. The simulated preferences cost ~$1 per 1000 comparisons instead of ~$1 per comparison — a 45x cost reduction — with high correlation to actual human preferences.

The insight behind this approach is that GPT-4, being a highly capable language model trained with RLHF itself, has already internalized many human preferences. When you show GPT-4 two responses and ask which is better, it can make judgments that align surprisingly well with human annotators. It's not perfect — individual judgments disagree with humans ~34% of the time — but averaged over many comparisons, the ranking of models converges to match human rankings almost exactly.

python
# The AlpacaFarm cost calculation
# Human RLHF pipeline:
#   10,000 comparisons × $5/comparison = $50,000
#   + 2-4 weeks calendar time for annotation
#   + $5,000 for quality control and evaluation
#   Total: ~$55,000 per experiment

# AlpacaFarm pipeline:
#   10,000 comparisons × 2 (double query) = 20,000 API calls
#   × ~$0.0005/call (GPT-4 @ ~200 tokens) = $10
#   + ~20 hours compute on 4 A100s = ~$5/hr × 20 = $100
#   Total: ~$110 per experiment
#   But most cost is GPU, not API → API is just $10

# At $55K/experiment, a lab with $100K budget → 1-2 experiments
# At $110/experiment, same budget → ~900 experiments
What is the core bottleneck that AlpacaFarm addresses in RLHF research?

Chapter 1: Simulated Preferences

The key technical contribution: using API LLMs as simulated annotators that produce preference judgments mimicking human behavior. The setup is straightforward but the details matter enormously.

How simulated preferences work

Given a prompt x and two responses y1, y2, the simulated annotator receives:

prompt to GPT-4 (simulated annotator)
"Below are two responses to an instruction. Which is better?"

"Instruction: {x}"

"Response 1: {y₁}"
"Response 2: {y₂}"

"Which response is better? Output only '1' or '2'."

To reduce position bias (LLMs tend to prefer the first option), AlpacaFarm queries each pair twice — once with y1 first, once with y2 first — and takes the majority vote. If the two orderings disagree, the pair is treated as a tie.

Position bias is a real and significant problem. Without debiasing, GPT-4 prefers the first response ~60% of the time regardless of quality. The double-query trick reduces this to ~52%, which is close to the ~50% you'd expect from an unbiased annotator. This same debiasing technique should be applied whenever using LLMs as evaluators.

Calibrating simulated preferences to humans

AlpacaFarm validates the simulated preferences against 2,500 real human annotations. The correlation is measured using:

MetricHuman vs HumanSim vs Human
Pairwise agreement~76%~66%
Ranking correlation (τ)~0.71~0.54
Win rate (Elo) correlation~0.85~0.97

Individual pairwise agreement (66%) is lower than human-human agreement (76%). But the Elo correlation (0.97) is nearly perfect. This means: on any single comparison, the simulator might disagree with humans ~34% of the time. But averaged over many comparisons, the ranking of models is nearly identical. For RLHF research (where you care about which method is best, not individual annotations), this is sufficient.

Think of it like a noisy thermometer. If your thermometer is off by ±2°C on each reading, you can't trust any single reading. But if you take 100 readings and average them, the average is very close to the true temperature. Similarly, the simulated annotator's individual judgments are noisy, but the average win rate over 805 test prompts is highly reliable.

python
# Simulated preference annotation
import openai

def simulated_preference(prompt, response_a, response_b):
    """Ask GPT-4 which response is better (with debiasing)"""
    # Query 1: A first, B second
    q1 = query_gpt4(prompt, response_a, response_b)
    # Query 2: B first, A second (swap to reduce position bias)
    q2 = query_gpt4(prompt, response_b, response_a)

    # Aggregate: if both agree, confident judgment
    # If they disagree, treat as tie
    if q1 == 'A' and q2 == 'B':  # both say A is better
        return 'A'
    elif q1 == 'B' and q2 == 'A':  # both say B is better
        return 'B'
    else:
        return 'tie'  # conflicting signals → uncertain
Simulated vs Human Preferences

Compare simulated (LLM) and human preference judgments. Each dot is a model — its position shows its win rate as judged by humans (x-axis) vs the simulator (y-axis). The closer to the diagonal, the better the correlation.

Correlation: τ = 0.97

What the simulator gets wrong

The simulated annotator has systematic biases that differ from humans:

BiasDirectionMitigation
Length biasPrefers longer responses more than humansNormalize by length in evaluation
Position biasPrefers first responseDouble query + majority vote
Style biasPrefers formal, structured responsesAcknowledged; not fully mitigated
Self-preferenceGPT-4 prefers GPT-4-like responsesUse diverse annotator models
How does AlpacaFarm mitigate position bias in LLM-based preference simulation?

Chapter 2: The AlpacaFarm Pipeline

AlpacaFarm is not just simulated preferences — it's a complete research framework for RLHF, providing standardized datasets, evaluation protocols, and reference implementations of multiple alignment algorithms.

The complete pipeline

Step 1: SFT Model
Start with LLaMA 7B fine-tuned on 52K Alpaca instructions (instruction-following baseline)
Step 2: Generate responses
For 10K prompts, generate pairs of responses from the SFT model (with different random seeds)
Step 3: Simulated preferences
GPT-4 judges which response is better for each pair (~10K preferences, cost: ~$10)
Step 4: Train reward model
Train a reward model on the simulated preferences (same architecture as SFT model)
Step 5: Optimize with RLHF/DPO/etc
Apply alignment algorithm to improve the policy model using the reward model
Step 6: Evaluate
Automated evaluation on 805 test prompts + human evaluation for calibration

The entire pipeline (steps 2-6) takes ~20 hours on 4 A100 GPUs and costs about $10 in API fees. Compare this to a full human RLHF pipeline: weeks of calendar time and $50K+ in annotation costs. This 45x cost reduction is what enables rapid iteration.

Let's break down the compute time for each step. Step 1 (SFT) is done once and shared across all experiments — about 8 hours on 4 A100s. Step 2 (generate response pairs) takes ~2 hours — you need to generate 2 responses per prompt for 10K prompts. Step 3 (simulated preferences) takes ~1 hour of API calls — GPT-4 evaluates 20K pairs (10K × 2 for debiasing). Step 4 (reward model training) takes ~4 hours. Step 5 (PPO/DPO) takes ~6 hours. Step 6 (evaluation) takes ~30 minutes.

python
# AlpacaFarm pipeline timing breakdown
pipeline = {
    "SFT baseline":           (8, "hours", "one-time"),
    "Generate pairs":          (2, "hours", "per experiment"),
    "Simulated preferences":   (1, "hour",  "per experiment, $10 API"),
    "Train reward model":      (4, "hours", "per experiment"),
    "Run PPO/DPO":             (6, "hours", "per experiment"),
    "Evaluate":                (0.5, "hours", "per experiment"),
}
# Total per experiment (after SFT): ~13.5 hours + $10
# Can run 2 experiments per day on a 4×A100 machine
# In one month: ~60 experiments — enough to thoroughly explore
# the alignment algorithm design space
AlpacaFarm Pipeline

Walk through the complete AlpacaFarm pipeline. Each step shows inputs, outputs, and cost. Click "Next Step" to advance.

Step 1/6

Reference implementations

AlpacaFarm provides standardized implementations of five alignment algorithms, all using the same base model, data, and evaluation:

MethodApproachComplexity
PPORLHF with proximal policy optimizationHighest (4 models: policy, ref, reward, value)
Best-of-nGenerate n responses, select highest rewardLow (just reward model + sampling)
Expert iterationSFT on best-of-n outputs (iteratively)Medium (reward model + SFT)
QuarkQuantile-based policy gradientMedium
Direct rewardBinary feedback without reward modelLow
What is the total cost and time to run one complete AlpacaFarm RLHF experiment?

Chapter 3: RLHF vs Alternatives

With all five methods implemented under the same framework, AlpacaFarm provides the first truly apples-to-apples comparison. The results challenge some conventional wisdom.

Win rates vs SFT baseline

MethodWin rate vs SFT (sim)Win rate vs SFT (human)Compute cost
PPO (RLHF)57.2%56.1%Highest (4 models)
Best-of-1660.1%59.3%Medium (inference)
Expert iteration55.8%54.9%Medium
Quark53.4%52.8%Medium
SFT (baseline)50.0%50.0%Lowest
The surprising finding: Best-of-n sampling — the simplest method — outperforms PPO (RLHF). Generate 16 responses, pick the highest-scoring one according to the reward model. No policy gradient training, no value function, no clipping — just inference. This challenges the assumption that PPO's complexity is justified.

Why does best-of-n win? Three reasons:

1. No distribution shift. PPO changes the policy during optimization, which can drift away from the reward model's training distribution. Best-of-n samples from the original SFT policy, so the reward model's scores are well-calibrated. This is perhaps the biggest advantage — reward model overoptimization (Goodhart's law) is PPO's Achilles heel.

2. Exploration. With n=16 samples, best-of-n explores a much wider range of responses than PPO (which typically generates 1-2 responses per prompt during training). More samples means higher probability of finding a genuinely excellent response.

3. Simplicity. PPO involves 4 models running simultaneously (policy, reference, reward, value), careful hyperparameter tuning, and complex training dynamics. Best-of-n requires only the SFT model and the reward model, with no training at all. Fewer moving parts means fewer things can go wrong.

python
# Best-of-n sampling: surprisingly powerful
def best_of_n(model, reward_model, prompt, n=16):
    """Generate n responses, return the highest-scoring one"""
    responses = [model.generate(prompt) for _ in range(n)]
    scores = [reward_model.score(prompt, r) for r in responses]
    return responses[np.argmax(scores)]

# Compare with PPO:
# PPO: train for 10K steps, then generate 1 response → ~57% win rate
# BoN-16: no training, generate 16 responses, pick best → ~60% win rate
# BoN wins on quality, but costs 16x more at inference time

# The math: expected quality of max(n samples) scales as
# E[max(X₁,...,Xₙ)] ∝ √(2 ln n) for Gaussian X
# Going from n=1 to n=16: ~2.5x quality improvement
# Going from n=16 to n=256: only ~1.3x more improvement
The DPO alternative: After AlpacaFarm's publication, Direct Preference Optimization (DPO) emerged as a compelling alternative to both PPO and best-of-n. DPO directly optimizes the policy from preference data without training a separate reward model — eliminating the reward model overoptimization problem that plagues PPO, while having the same fixed inference cost as PPO. In hindsight, DPO fills the gap that AlpacaFarm identified: a method that combines PPO's low inference cost with best-of-n's stability.

The caveat: best-of-n is expensive at inference time (you need 16x more generation per query). PPO pays the cost at training time instead. For production systems serving millions of queries, PPO's lower inference cost wins. For research and quality-sensitive applications, best-of-n is the better choice.

Method Comparison

Compare the five alignment methods across win rate (quality) and cost dimensions. Click each method to see its tradeoffs.

Why does best-of-n sampling outperform PPO in AlpacaFarm's experiments?

Chapter 4: Evaluation

Evaluation is the hardest part of alignment research. How do you measure whether a model is "better aligned"? AlpacaFarm uses a multi-layer evaluation strategy.

Automated evaluation

For rapid iteration, AlpacaFarm uses the simulated annotator (GPT-4) to evaluate model outputs on 805 held-out test prompts. The evaluation computes pairwise win rates: for each prompt, the model's output is compared against a reference (the SFT baseline), and GPT-4 judges which is better.

Win rate = |{x : model wins}| / |{x : total comparisons}|

Human evaluation

To validate the automated evaluation, AlpacaFarm also collects human judgments on a subset. The key finding: simulated evaluation rankings match human rankings almost perfectly (Spearman correlation 0.97). This means you can trust the simulated evaluation for method comparison, even though individual judgments may differ.

Evaluation Pipeline

See how models are evaluated on the 805 test prompts. Each prompt generates a comparison, and win rates are computed. Click "Evaluate" to simulate the process.

Click to evaluate

What evaluation reveals about alignment methods

FindingImplication
All methods improve over SFTAlignment training consistently helps, regardless of method
Best-of-n ≈ PPO > Expert Iter > QuarkSimpler methods can match or beat complex ones
Human and sim rankings agreeLLM-based evaluation is reliable for method comparison
Improvements plateau after ~5K preferencesMore data helps, but with diminishing returns
What is the Spearman correlation between AlpacaFarm's simulated evaluation rankings and human evaluation rankings?

Chapter 5: Cost Analysis

The economic argument for AlpacaFarm is compelling. Let's break down the numbers.

ComponentHuman RLHFAlpacaFarmSavings
Preference data$50,000 (10K comparisons × $5 each)$10 (GPT-4 API calls)5000x
Calendar time2-4 weeks~20 hours~24x
Evaluation$5,000 (human eval)$5 (GPT-4 eval)1000x
Total per experiment~$55,000 + 3 weeks~$15 + 1 day~3,600x
The research implication: At $15 per experiment, a PhD student with a $1,000 budget can run 66 RLHF experiments. At $55,000 per experiment, that same budget doesn't even cover one. This isn't just a quantitative difference — it's qualitative. It changes who can do RLHF research from "well-funded labs" to "anyone with a few GPUs."
Cost Comparison Calculator

Drag the slider to set your research budget. See how many RLHF experiments you can run with human annotations vs AlpacaFarm simulation.

Budget $10,000

When simulation isn't enough

AlpacaFarm acknowledges limitations where simulated feedback falls short:

ScenarioWhy simulation may fail
Safety-critical tasksLLM simulators may not capture subtle safety concerns that trained human annotators catch
Novel/creative tasksLLMs may have different aesthetic preferences than humans
Cultural sensitivityLLMs trained on Western data may not represent diverse cultural preferences
Edge casesLLMs may disagree with humans on unusual or ambiguous prompts

The recommended approach: use AlpacaFarm for rapid prototyping and method comparison, then validate the final method with a smaller round of real human evaluation.

This "simulate then validate" paradigm has become standard in alignment research. It's analogous to how pharmaceutical companies use computer simulations to screen drug candidates before expensive clinical trials. You run hundreds of cheap simulated experiments, identify the 2-3 most promising approaches, and then invest in expensive human evaluation only for those finalists.

python
# The "simulate then validate" workflow
# Phase 1: Exploration (AlpacaFarm)
#   - Try 50 hyperparameter configurations for PPO
#   - Try 20 reward model architectures
#   - Try 10 dataset compositions
#   - Cost: 80 experiments × $15 = $1,200
#   - Time: 2-3 weeks
#   - Output: top-3 configurations

# Phase 2: Validation (Human eval)
#   - Run top-3 configs with real human annotation
#   - Cost: 3 experiments × $5,000 = $15,000
#   - Time: 2-3 weeks
#   - Output: confirmed best configuration

# Total: $16,200 for a thorough exploration
# Without AlpacaFarm: would need $55K × 80 = $4.4M to explore
# AlpacaFarm reduces exploration cost by ~270×
At $15 per AlpacaFarm experiment vs $55,000 for human RLHF, how many experiments can a researcher with a $10,000 budget run with each approach?

Chapter 6: Alignment Explorer

Let's bring the full AlpacaFarm results together. This interactive explorer lets you compare all five alignment methods, adjust the number of preference samples, and see how the quality-cost tradeoff shifts.

Alignment Method Explorer

Compare alignment methods across two dimensions: quality (win rate vs SFT) and cost (training + inference). Drag the "n" slider for best-of-n to see how quality scales with more samples. Each method shows its position in the quality-cost space.

Best-of-n (n) 16

The Pareto frontier

When you plot all methods in quality-cost space, a clear Pareto frontier emerges:

If your priority is...Best methodWhy
Maximum qualityBest-of-64Highest win rate, but 64x inference cost
Best quality/costBest-of-16 or PPOPPO amortizes cost during training; BoN-16 is simpler
Minimum complexityExpert iterationJust SFT on best-of-n outputs, no RL needed
Production deploymentPPO or DPOFixed cost per query after training
The meta-lesson: There's no single "best" alignment method. The right choice depends on your constraints — research budget, inference budget, implementation complexity, and quality requirements. AlpacaFarm's framework lets you evaluate these tradeoffs empirically instead of relying on intuition.
What is the key tradeoff between PPO and best-of-n alignment methods?

Chapter 7: Connections

What AlpacaFarm builds on

FoundationContribution
Alpaca (Taori 2023)Self-instruct: generate training data with GPT-3.5, fine-tune LLaMA
RLHF (Ouyang 2022)The full pipeline (reward model + PPO) that AlpacaFarm makes accessible
DPO (Rafailov 2023)Alternative alignment method compared in AlpacaFarm
Constitutional AI (Bai 2022)Using AI feedback (RLAIF) — AlpacaFarm extends this to preference simulation

What came after

SuccessorAdvance
Camels (Tülu)Extends the open comparison to instruction-tuning datasets
LMSys Chatbot ArenaCrowdsourced human evaluation at scale — validates LLM-as-judge
Llama 3Production-scale iterative RLHF/DPO — applies lessons from AlpacaFarm
StarlingExtends AlpacaFarm's approach to build a better reward model

The broader significance of AlpacaFarm extends beyond RLHF research. It established a precedent for using LLMs as evaluators and annotators — a practice that has become ubiquitous in AI research. Today, LLM-as-judge is used not just for preference annotation but for code review, essay grading, safety evaluation, and many other tasks where human annotation is expensive. AlpacaFarm was one of the first papers to rigorously validate this approach.

The paper also contributed to the growing understanding that alignment is not a single technique but a design space. There's no single "best" alignment method — the right choice depends on your constraints. PPO is best for high-volume production systems. Best-of-n is best for quality-sensitive applications. DPO (which emerged shortly after) is best for simplicity and stability. AlpacaFarm's standardized comparison framework made it possible to reason about these tradeoffs quantitatively rather than anecdotally.

AlpacaFarm's lasting impact:
1. LLM-as-judge — established that LLMs can reliably replace human evaluators for method comparison (0.97 correlation).
2. Democratized RLHF research — made alignment research accessible to academic labs.
3. Best-of-n as a strong baseline — showed that the simplest method is often competitive, raising the bar for new methods.
4. Standardized comparison — provided the first fair apples-to-apples comparison of alignment algorithms.

"AlpacaFarm provides a realistic simulation of the RLHF process... enabling the research community to iterate more quickly on alignment methods."
— Dubois et al., 2023

What is AlpacaFarm's most important contribution to the field of AI alignment?