Saad-Falcon, Buchanan, Chen et al. — Stanford, 2025

Weaver: Weak Verifiers, Strong Selection

Combine multiple imperfect verifiers into one strong verifier for test-time repeated sampling — no labels, no fine-tuning. Llama 70B + Weaver matches o3-mini at 87.7% average accuracy.

Prerequisites: Language model inference + Basic probability
10
Chapters
4+
Simulations

Chapter 0: The Problem

You have a language model that can solve math problems. Not every time — maybe 70% of the time. But if you ask it the same question 100 times with some randomness (non-zero temperature), at least one of those 100 answers is almost certainly correct. For Llama 3.3 70B on MATH500, Pass@100 — the probability that at least one of 100 samples is correct — is 98.6%.

So the generation side is solved. The model can produce a correct answer. The problem is: which one?

You have 100 candidate responses. One is brilliant. Ninety-nine are wrong. How do you pick the right one? This is the verification problem: scoring and ranking responses to select the best candidate.

The core tension: Generation quality scales beautifully with more samples. Verification quality does not. Majority voting plateaus quickly. Single reward models have high false-positive rates. LM judges are biased and noisy. The model generates the answer — but we fail to find it. This is the generation-verification gap.

If you had a perfect verifier — an oracle that always identifies the correct response — you could close this gap entirely. But perfect verifiers only exist for narrow domains (Lean for formal proofs, unit tests for code). For open-ended math, science, and reasoning, all we have are weak verifiers: reward models and LM judges that are noisy, biased, and inconsistent.

Weaver's question: can you combine many weak verifiers into one strong verifier? Not by averaging (that assumes they're equally good, and they're not). Not by supervised learning (that requires labels you don't have). But through weak supervision — estimating each verifier's accuracy from the data alone, then weighting them accordingly.

The Selection Bottleneck

Drag the slider to change the number of candidate responses. Watch Pass@K (the chance a correct answer exists) soar — while the success rate of a weak verifier barely moves. The gap between them is the generation-verification gap.

Number of Samples (K) 64
Why does generating more candidate responses NOT automatically improve final accuracy?

Chapter 1: The Key Insight

Individual verifiers are weak. A single reward model might have 55% accuracy at picking the correct response from a pair. A single LM judge might score 60%. Neither is reliable enough to justify generating 100 candidates — you'd pick wrong almost half the time.

But here's the thing: they're wrong in different ways. A reward model trained on preference data might be biased toward longer responses. An LM judge might struggle with formal math but excel at reasoning chains. A different reward model might be great at code but poor at science. Their errors are correlated with different features of the problem.

The Weaver insight: If verifiers make independent errors, combining them amplifies signal and cancels noise — the same principle behind ensemble learning. But you can't just average them (they have different accuracies) and you can't supervise them (you don't have labels). Weaver uses weak supervision to estimate each verifier's accuracy from agreement patterns alone, then weights them accordingly. No ground truth needed.
Step 1: Generate
Sample K=100 responses from an LM (e.g., Llama 3.3 70B) at non-zero temperature. Many will be wrong, but at least one is likely correct (Pass@100 ≈ 98%).
Step 2: Score
Run every response through M verifiers (33 reward models + LM judges in the paper). Each verifier outputs a score: continuous for RMs, binary for judges.
Step 3: Normalize & Filter
Binarize all scores (correct/incorrect). Discard verifiers that are worse than random — their accuracy is estimated from dataset statistics.
Step 4: Estimate Weights
Use a Dawid-Skene-style latent variable model to estimate each verifier's accuracy WITHOUT labels. Verifiers that agree with accurate verifiers get higher weight.
Step 5: Select
Combine weighted verifier scores via Bayesian posterior. Pick the response with the highest P(correct | all verifier votes). Done.

The result: Llama 3.3 70B Instruct with Weaver verification achieves 87.7% average accuracy across MATH500, GPQA Diamond, MMLU College, and MMLU Pro. That matches o3-mini (86.7%) — a model that required extensive fine-tuning and post-training. Weaver gets there with zero parameter updates, using only off-the-shelf verifiers.

What makes Weaver different from simply averaging all verifier scores?

Chapter 2: The Generation-Verification Gap

Let's quantify the problem. For a given task and model, we can measure two things:

Generation-Verification Gap = Pass@K − Success Rate

A large gap means the model generates correct answers but the verifier fails to find them. An oracle verifier (perfect accuracy) closes the gap to zero.

How bad is the gap in practice? Consider Llama 3.3 70B on GPQA Diamond with K=100:

MethodSuccess RateGap to Pass@100
Pass@1 (no verification)42.9%38.1%
Majority Voting47.4%33.6%
Best Single RM49.7%31.3%
Naive Ensemble (top-10 RMs)41.3%39.7%
Multi-Agent Verification47.8%33.2%
Weaver72.1%8.9%
Oracle (Pass@100)81.0%0%
The striking result: On GPQA Diamond, majority voting barely improves over Pass@1 (47.4% vs. 42.9%). The naive ensemble of top-10 reward models actually hurts performance (41.3%). But Weaver jumps to 72.1% — closing 73% of the gap between Pass@1 and the oracle. Verification strategy matters enormously.

Why does naive ensembling fail? Because it treats all verifiers equally. If 7 of your 10 reward models are bad at GPQA but okay at MATH, the naive average is dominated by noise on GPQA. Weighting is essential.

Why does majority voting plateau? Because it ignores verifier scores entirely — it just picks the most common final answer. Once you have ~8-16 samples, the distribution of final answers stabilizes and more samples don't help. Verification, on the other hand, can keep scaling by using richer signals from the verifier scores.

Why can a naive ensemble of reward models sometimes perform WORSE than a single good verifier?

Chapter 3: Weighted Ensembles

If naive averaging fails because it ignores accuracy differences, the natural fix is: learn weights. Give more weight to verifiers that are more accurate. The paper tests this with oracle data first (to see the ceiling), then without labels (Weaver).

Oracle Weighted Ensembles

Given ground-truth labels for all query-response pairs, you can train a logistic regression or Naive Bayes classifier to estimate P(correct | all verifier scores). This is an "oracle" approach because it uses labels you wouldn't have in practice.

The result is dramatic: weighted ensembles outperform naive ensembles by up to 11.2 percentage points. On GPQA Diamond specifically, the oracle weighted ensemble reaches 68.2% vs. the naive ensemble's 41.3% — a 26.9 point gap. The weights reveal that verifier accuracy varies wildly: a 37.5% spread in individual verifier success rates across the 33 models tested.

The Dawid-Skene Model

But we don't have labels. So how do we estimate weights? Weaver adapts the Dawid-Skene model from the weak supervision literature. The core idea: if we assume each verifier makes errors independently (conditional on the true label), we can estimate their individual accuracies by analyzing their agreement patterns.

Think of it like this. You have 33 annotators labeling data, but no answer key. If annotators A and B agree 90% of the time, and annotators A and C agree 85% of the time, and B and C agree 82% of the time — then A is probably the most accurate. You can back out individual accuracies from pairwise agreement rates, even without knowing the true labels.

Formal setup: Let Y be the unknown true label (correct/incorrect). Let Sk be verifier k's binary vote. Assume conditional independence: Si ⊥ Sj | Y. Then the joint probability Pr(Si, Sj) factors as Pr(Si|Y=1)Pr(Sj|Y=1)Pr(Y=1) + Pr(Si|Y=0)Pr(Sj|Y=0)Pr(Y=0). The left side is observable (count co-occurrences). Pr(Y=1) is estimated from a tiny dev set (~5-10 samples). The accuracy parameters Pr(Sk=1|Y=1) are what we solve for.
Weighted vs. Unweighted Ensembles

Each bar represents a verifier with a different accuracy. Drag the slider to change how many verifiers you include. Compare the weighted ensemble score (orange line) vs. the unweighted average (gray line). Poor verifiers drag the unweighted average down.

Verifiers Included 8
How does the Dawid-Skene model estimate verifier accuracy without ground-truth labels?

Chapter 4: The Weaver Framework

Directly applying weak supervision to verifier ensembles doesn't work out of the box. Verifiers produce wildly different output formats (continuous scores, binary labels, Likert scales), some verifiers are worse than random on certain tasks, and the accuracy estimation is sensitive to these issues. Weaver solves each with a specific engineering decision.

Step 1: Binarization

Weak supervision models expect binary votes. But reward models output continuous scores (e.g., 0.73). Weaver binarizes every verifier's output: a response is "correct" (1) if its score is above the verifier's median score across all responses for that query, and "incorrect" (0) otherwise. The median threshold is computed per-query, not globally — this handles the fact that some queries produce universally low or high scores.

Step 2: Verifier Filtering

Some verifiers are genuinely harmful. A reward model that was trained on a different distribution might have negative correlation with correctness on a given task. Including it in the ensemble makes things worse, not better.

Weaver estimates each verifier's accuracy using the weak supervision model, then discards any verifier whose estimated accuracy is below 50% (i.e., worse than a coin flip). This is conservative but effective: on GPQA Diamond, the filter removes 8 of 33 verifiers, and the remaining 25 are all genuinely informative.

Step 3: Accuracy Estimation

Using the Dawid-Skene objective (Equation 5 from the paper), Weaver estimates the accuracy parameter for each surviving verifier. These become the ensemble weights.

Step 4: Weighted Combination

Given binary votes s̄1, ..., s̄m from the filtered verifiers, the ensemble score for each query-response pair is the posterior probability:

Pr(Y=1 | S1=s̄1, ..., Sm=s̄m) = ∏i Pr(Si=s̄i | Y=1) · Pr(Y=1) / Pr(S1=s̄1, ..., Sm=s̄m)

The response with the highest posterior is selected. That's it — the entire pipeline.

Why binarization helps: Different reward models have radically different score distributions. Model A might output scores in [0.3, 0.9] while Model B outputs [0.01, 0.15]. Averaging raw scores makes Model A dominate. Binarizing to "above or below median" puts every verifier on the same footing. The paper finds that binarization outperforms alternative calibration strategies (z-score normalization, min-max scaling) for downstream selection performance.
Weaver Pipeline

Watch the full pipeline: raw verifier scores are binarized, filtered, weighted by estimated accuracy, and combined into a posterior. Click "Run Weaver" to process a batch. Orange = correct response, gray = incorrect. The star marks Weaver's selection.

Ready
Why does Weaver binarize verifier scores using per-query medians rather than using raw continuous scores?

Chapter 5: Weak Supervision Details

Let's look inside the math that estimates verifier accuracies. The weak supervision model treats the true label Y as a latent variable — something we can't observe directly but can infer from the pattern of verifier votes.

The Graphical Model

Weaver assumes a simple generative story: for each query-response pair, nature first decides if the response is correct (Y=1) or not (Y=0). Then each verifier independently "votes" based on the true label, with some probability of getting it right. Formally:

The Observable Statistics

We can compute from data: Pr(Si=1) for each verifier (its "yes" rate), and Pr(Si=a, Sj=b) for each pair of verifiers (their joint agreement table). These are just counting exercises over the nK query-response pairs.

The Estimation Objective

Using the conditional independence assumption, the joint probability of any verifier pair factors as:

Pr(Si, Sj) = Pr(Si|Y=1)Pr(Sj|Y=1)p + Pr(Si|Y=0)Pr(Sj|Y=0)(1−p)

The left side is known. The right side depends on the accuracy parameters αk. Weaver constructs a matrix objective that minimizes the difference between observed and predicted pairwise statistics, plus a consistency constraint that marginals match:

minμ ||Ooff-diag − (μPμT)off-diag||2 + ||diag(O) − μP1T||2

where μ is the matrix of accuracy parameters, P is the diagonal matrix of class priors, and O is the matrix of observed joint probabilities. This is optimized via gradient descent.

Difficulty partitioning: Weaver also clusters queries by empirical difficulty (the ratio of correct to incorrect responses, estimated from verifier agreement). Easier queries have higher base rates, harder ones lower. A separate Weaver model is fit within each difficulty bucket. This improves accuracy estimation because verifier behavior can differ dramatically between easy and hard problems — a verifier that's great on easy problems might be useless on hard ones.
Why this works without labels: The key insight is that agreement between verifiers is informative about their individual accuracies. If verifier A agrees with verifiers B, C, D, and E almost all the time, and those verifiers disagree with F, then either A-E are all right and F is wrong, or F is right and A-E are all wrong. The base rate Pr(Y=1) — estimated from the tiny dev set — breaks this symmetry. If correct responses are rare, the "minority" cluster is more likely correct. If correct responses are common, the majority cluster is. You need surprisingly few labeled samples (∼5-10) to estimate this base rate.
What is the role of the tiny dev set (~5-10 labeled samples) in Weaver's weak supervision?

Chapter 6: The Key Result

Here's the headline number. Using Llama 3.3 70B Instruct as the generator, K=100 repeated samples, and an ensemble of 33 verifiers (7B-72B reward models and LM judges), Weaver achieves:

MethodMATH500GPQAMMLU Col.MMLU ProAverage
Pass@178.0%42.9%82.6%69.9%68.4%
Majority Voting83.0%47.4%84.1%74.4%72.2%
Best Single RM78.2%49.7%86.0%77.0%72.7%
Multi-Agent Verif.81.3%47.8%84.1%72.6%71.6%
Weaver93.4%72.1%94.9%90.2%87.7%
o3-mini94.4%74.0%92.2%86.0%86.7%
Oracle (Pass@100)98.6%81.0%96.0%92.0%91.9%
The $0 fine-tuning path to reasoning: Weaver with Llama 70B (87.7%) matches o3-mini (86.7%) — a model that required extensive reinforcement learning and post-training. Weaver requires zero parameter updates to the generator. The improvement comes entirely from better selection among existing responses. This is a 19.3 point jump from Pass@1 (68.4% → 87.7%), comparable to the GPT-4o → o3-mini jump (69.0% → 86.7%).

Weak-to-Strong Verification

Even at the 8B scale (Llama 3.1 8B generator + 8B verifiers), Weaver achieves 70.0% average — within 1.6% of majority voting at the 70B scale (71.6%). And Weaver at 70B (87.7%) surpasses o3-mini by 1.0%. This is a weak-to-strong phenomenon: small verifiers, when properly combined, punch far above their weight class.

Results Comparison

Average accuracy across four benchmarks. Drag the slider to change the generation model size. Compare Weaver (orange) to baselines.

Generator Size 70B
What is remarkable about Weaver achieving o3-mini-level accuracy with Llama 70B?

Chapter 7: The Distilled Verifier

There's a catch. Running 33 verifiers (some 70B+) on every candidate response is expensive. For K=100 samples with 33 verifiers, you need 3,300 forward passes per query. That's 35.35 exaFLOPs per query set — 10x more than generation alone. The verification cost dominates.

Weaver's solution: distillation. Train a tiny model to mimic the full ensemble.

How It Works

Take a ModernBERT-Large model (396M parameters, a cross-encoder). Feed it concatenated (query, response) pairs. The training label for each pair is Weaver's posterior probability Pr(Y=1 | all verifier votes) — a soft score from 0 to 1. Train with a regression loss (MSE or cross-entropy) on this soft label.

Step 1: Generate Training Data
Run full Weaver on a corpus of query-response pairs. For each pair, store the Weaver posterior score as a pseudolabel. This is a one-time offline cost.
Step 2: Train Cross-Encoder
Fine-tune ModernBERT-Large (396M) as a cross-encoder. Input: (query, response) concatenated. Output: scalar score matching the Weaver pseudolabel. Standard supervised learning.
Step 3: Deploy
At inference, replace the full 33-verifier ensemble with the single 400M cross-encoder. One forward pass per candidate response instead of 33.
The numbers: The distilled cross-encoder retains 98.7% of Weaver's accuracy gains while reducing verification compute by 99.97%. Full Weaver costs 35.35 exaFLOPs per query set (100 samples × 33 verifiers). The 400M cross-encoder costs 1.01 exaFLOPs for the same 100 samples. That's a 35x reduction. And it only needs a single A100 GPU with 32GB memory, instead of an 8-GPU node per 70B verifier.

On GPQA Diamond, the distilled model outperforms majority voting by 23.2% while adding only 0.57% inference cost over generation alone. The distilled verifier captures the combined intelligence of all 33 verifiers in a model 175x smaller than the smallest verifier in the ensemble.

Why distillation works: The Weaver ensemble score is a smooth, well-calibrated probability. It compresses the diverse signals from 33 verifiers into a single number per response. A cross-encoder is a powerful function class for mapping (query, response) pairs to scalars — it can attend to fine-grained relationships between the question and the answer. The pseudolabels from Weaver are higher quality than any individual verifier's scores, so the distilled model learns from cleaner supervision than any single teacher.
What are the inputs and outputs of the distilled Weaver cross-encoder?

Chapter 8: Analysis

The paper explores several scaling dimensions. Each reveals something about when and why Weaver works.

Scaling Generations

As K increases from 2 to 128, majority voting plateaus around K=8-16. The naive ensemble plateaus slightly later. Weaver keeps improving all the way to K=128, consistently narrowing the gap to Pass@K. The effect is most dramatic on hard tasks (GPQA Diamond) where individual verifiers are least reliable and ensembling helps most.

Scaling Verifier Count

Adding more verifiers helps, but with diminishing returns. Going from 1 to 5 verifiers yields the biggest jump (+8.5% on some tasks). From 5 to 15, gains taper off as new verifiers contribute increasingly correlated signals. Weaver consistently outperforms naive averaging at every verifier count, with the gap widest at 5-10 verifiers (+2.4% to +10.1%).

Which Verifiers Contribute?

The paper uses 33 verifiers from RewardBench and Chatbot Arena: a mix of reward models (8B to 72B) and LM judges. On average, Weaver filters out 5-10 of the 33 on any given task. The filtered verifiers tend to be models trained on very different distributions than the target task. Interestingly, no single verifier is best across all tasks — the optimal ensemble composition changes per task, which is exactly why adaptive weighting matters.

Compute-Accuracy Trade-off

Majority voting is the cheapest approach: zero verification cost. Weaver is the most expensive (33 verifier forward passes per sample). But Weaver achieves the highest accuracy and keeps scaling with more compute, while majority voting saturates early. The distilled Weaver cross-encoder sits in the sweet spot: nearly all of Weaver's accuracy at 99.97% less verification compute.

The key scaling insight: Generation and verification are complementary scaling axes. More generations give you more chances to produce a correct answer. Better verification lets you find it. But verification quality matters much more than generation quantity once Pass@K is high. Going from K=16 to K=128 improves Pass@K by ~5%, but going from naive ensemble to Weaver improves success rate by ~15%. Invest in verification.
Why does Weaver show the biggest gains on the hardest tasks (like GPQA Diamond)?

Chapter 9: Connections

Weaver sits at the intersection of several active research threads in inference-time compute and verification.

PaperRelationship to Weaver
Large Language Monkeys (Brown et al., 2024)Established that repeated sampling scales well: Pass@K grows reliably. Weaver addresses the follow-up question: once you have many samples, how do you select?
Scaling Test-Time Compute (Snell et al., 2024)Showed that allocating more compute at inference (via search, verification, or revision) can outperform scaling model parameters. Weaver extends this by scaling the verification axis specifically.
Multi-Agent Verification (Lifshitz et al., 2025)Uses multiple verifier calls but with a single verifier model (varying prompts/temperatures). Weaver uses multiple different verifier models with adaptive weighting — and outperforms Multi-Agent Verification by 16.1% on average.
Let's Verify Step by Step (Lightman et al., 2023)Introduced process reward models (PRMs) that verify intermediate reasoning steps. Weaver uses outcome-level verification but could potentially incorporate PRMs as additional weak verifiers.
ARCHON (Saad-Falcon et al., 2024)Architecture search for combining LM inference techniques (judges, rankers, generators). Weaver focuses specifically on the verification/selection component, providing a principled aggregation method that ARCHON could incorporate.
Inference Scaling Flaws (Stroebl et al., 2024)Showed that imperfect verifiers limit the benefits of repeated sampling. Weaver directly addresses this limitation by making verifiers less imperfect through weighted ensembling.
Open questions: (1) Can process-level reward models (step-by-step verification) be integrated as verifiers in Weaver? (2) Can the distilled cross-encoder generalize across tasks, or does it need task-specific training? (3) How does Weaver interact with chain-of-thought reasoning and self-correction? (4) Can the weak supervision approach work for multimodal verification (images, code execution)? The authors suggest all of these as future directions.

Weaver demonstrates a powerful principle: you don't need a strong verifier if you have many weak ones. The combination strategy — normalize, filter, weight, aggregate — is general enough to apply wherever multiple imperfect signals exist. This connects to a deep idea in machine learning: ensembles of weak learners can be strong, whether the learners are decision stumps (boosting), annotators (crowdsourcing), or verifiers (Weaver).

What fundamental principle from ensemble learning does Weaver apply to the verification problem?