Weaver — Veanors

Chapter 0: The Problem

You have a language model that can solve math problems. Not every time — maybe 70% of the time. But if you ask it the same question 100 times with some randomness (non-zero temperature), at least one of those 100 answers is almost certainly correct. For Llama 3.3 70B on MATH500, Pass@100 — the probability that at least one of 100 samples is correct — is 98.6%.

So the generation side is solved. The model can produce a correct answer. The problem is: which one?

You have 100 candidate responses. One is brilliant. Ninety-nine are wrong. How do you pick the right one? This is the verification problem: scoring and ranking responses to select the best candidate.

The core tension: Generation quality scales beautifully with more samples. Verification quality does not. Majority voting plateaus quickly. Single reward models have high false-positive rates. LM judges are biased and noisy. The model generates the answer — but we fail to find it. This is the generation-verification gap.

If you had a perfect verifier — an oracle that always identifies the correct response — you could close this gap entirely. But perfect verifiers only exist for narrow domains (Lean for formal proofs, unit tests for code). For open-ended math, science, and reasoning, all we have are weak verifiers: reward models and LM judges that are noisy, biased, and inconsistent.

Weaver's question: can you combine many weak verifiers into one strong verifier? Not by averaging (that assumes they're equally good, and they're not). Not by supervised learning (that requires labels you don't have). But through weak supervision — estimating each verifier's accuracy from the data alone, then weighting them accordingly.

The Selection Bottleneck

Drag the slider to change the number of candidate responses. Watch Pass@K (the chance a correct answer exists) soar — while the success rate of a weak verifier barely moves. The gap between them is the generation-verification gap.

Number of Samples (K) 64

Why does generating more candidate responses NOT automatically improve final accuracy?

Because more responses means more correct answers exist among the candidates, but a weak verifier cannot reliably identify which one is correct — so the selection accuracy plateaus even as Pass@K keeps climbing Because generating more responses uses more compute Because the model runs out of diverse solutions after a few samples

Chapter 1: The Key Insight

Individual verifiers are weak. A single reward model might have 55% accuracy at picking the correct response from a pair. A single LM judge might score 60%. Neither is reliable enough to justify generating 100 candidates — you'd pick wrong almost half the time.

But here's the thing: they're wrong in different ways. A reward model trained on preference data might be biased toward longer responses. An LM judge might struggle with formal math but excel at reasoning chains. A different reward model might be great at code but poor at science. Their errors are correlated with different features of the problem.

The Weaver insight: If verifiers make independent errors, combining them amplifies signal and cancels noise — the same principle behind ensemble learning. But you can't just average them (they have different accuracies) and you can't supervise them (you don't have labels). Weaver uses weak supervision to estimate each verifier's accuracy from agreement patterns alone, then weights them accordingly. No ground truth needed.

Step 1: Generate

Sample K=100 responses from an LM (e.g., Llama 3.3 70B) at non-zero temperature. Many will be wrong, but at least one is likely correct (Pass@100 ≈ 98%).

↓

Step 2: Score

Run every response through M verifiers (33 reward models + LM judges in the paper). Each verifier outputs a score: continuous for RMs, binary for judges.

↓

Step 3: Normalize & Filter

Binarize all scores (correct/incorrect). Discard verifiers that are worse than random — their accuracy is estimated from dataset statistics.

↓

Step 4: Estimate Weights

Use a Dawid-Skene-style latent variable model to estimate each verifier's accuracy WITHOUT labels. Verifiers that agree with accurate verifiers get higher weight.

↓

Step 5: Select

Combine weighted verifier scores via Bayesian posterior. Pick the response with the highest P(correct | all verifier votes). Done.

The result: Llama 3.3 70B Instruct with Weaver verification achieves 87.7% average accuracy across MATH500, GPQA Diamond, MMLU College, and MMLU Pro. That matches o3-mini (86.7%) — a model that required extensive fine-tuning and post-training. Weaver gets there with zero parameter updates, using only off-the-shelf verifiers.

What makes Weaver different from simply averaging all verifier scores?

Weaver uses more verifiers Weaver estimates each verifier's accuracy without labels and uses those accuracies as weights — so reliable verifiers count more and poor verifiers are down-weighted or discarded Weaver fine-tunes the verifiers on the target task

Chapter 2: The Generation-Verification Gap

Let's quantify the problem. For a given task and model, we can measure two things:

Pass@K: The probability that at least one of K generated responses is correct. This measures generation quality. It only depends on the model, temperature, and K.
Success rate: The probability that the selected response is correct, given a verification strategy f. This is bounded above by Pass@K — you can't pick a correct answer if none was generated.

Generation-Verification Gap = Pass@K − Success Rate

A large gap means the model generates correct answers but the verifier fails to find them. An oracle verifier (perfect accuracy) closes the gap to zero.

How bad is the gap in practice? Consider Llama 3.3 70B on GPQA Diamond with K=100:

Method	Success Rate	Gap to Pass@100
Pass@1 (no verification)	42.9%	38.1%
Majority Voting	47.4%	33.6%
Best Single RM	49.7%	31.3%
Naive Ensemble (top-10 RMs)	41.3%	39.7%
Multi-Agent Verification	47.8%	33.2%
Weaver	72.1%	8.9%
Oracle (Pass@100)	81.0%	0%

The striking result: On GPQA Diamond, majority voting barely improves over Pass@1 (47.4% vs. 42.9%). The naive ensemble of top-10 reward models actually hurts performance (41.3%). But Weaver jumps to 72.1% — closing 73% of the gap between Pass@1 and the oracle. Verification strategy matters enormously.

Why does naive ensembling fail? Because it treats all verifiers equally. If 7 of your 10 reward models are bad at GPQA but okay at MATH, the naive average is dominated by noise on GPQA. Weighting is essential.

Why does majority voting plateau? Because it ignores verifier scores entirely — it just picks the most common final answer. Once you have ~8-16 samples, the distribution of final answers stabilizes and more samples don't help. Verification, on the other hand, can keep scaling by using richer signals from the verifier scores.

Why can a naive ensemble of reward models sometimes perform WORSE than a single good verifier?

Because it treats all verifiers equally, so low-accuracy verifiers dominate the average and drown out the signal from good verifiers — the ensemble inherits the average quality, not the best quality Because reward models are always inaccurate Because ensembles require more compute than single models

Chapter 3: Weighted Ensembles

If naive averaging fails because it ignores accuracy differences, the natural fix is: learn weights. Give more weight to verifiers that are more accurate. The paper tests this with oracle data first (to see the ceiling), then without labels (Weaver).

Oracle Weighted Ensembles

Given ground-truth labels for all query-response pairs, you can train a logistic regression or Naive Bayes classifier to estimate P(correct | all verifier scores). This is an "oracle" approach because it uses labels you wouldn't have in practice.

The result is dramatic: weighted ensembles outperform naive ensembles by up to 11.2 percentage points. On GPQA Diamond specifically, the oracle weighted ensemble reaches 68.2% vs. the naive ensemble's 41.3% — a 26.9 point gap. The weights reveal that verifier accuracy varies wildly: a 37.5% spread in individual verifier success rates across the 33 models tested.

The Dawid-Skene Model

But we don't have labels. So how do we estimate weights? Weaver adapts the Dawid-Skene model from the weak supervision literature. The core idea: if we assume each verifier makes errors independently (conditional on the true label), we can estimate their individual accuracies by analyzing their agreement patterns.

Think of it like this. You have 33 annotators labeling data, but no answer key. If annotators A and B agree 90% of the time, and annotators A and C agree 85% of the time, and B and C agree 82% of the time — then A is probably the most accurate. You can back out individual accuracies from pairwise agreement rates, even without knowing the true labels.

Formal setup: Let Y be the unknown true label (correct/incorrect). Let S_k be verifier k's binary vote. Assume conditional independence: S_i ⊥ S_j | Y. Then the joint probability Pr(S_i, S_j) factors as Pr(S_i|Y=1)Pr(S_j|Y=1)Pr(Y=1) + Pr(S_i|Y=0)Pr(S_j|Y=0)Pr(Y=0). The left side is observable (count co-occurrences). Pr(Y=1) is estimated from a tiny dev set (~5-10 samples). The accuracy parameters Pr(S_k=1|Y=1) are what we solve for.

Weighted vs. Unweighted Ensembles

Each bar represents a verifier with a different accuracy. Drag the slider to change how many verifiers you include. Compare the weighted ensemble score (orange line) vs. the unweighted average (gray line). Poor verifiers drag the unweighted average down.

Verifiers Included 8

How does the Dawid-Skene model estimate verifier accuracy without ground-truth labels?

By analyzing pairwise agreement rates between verifiers — verifiers that frequently agree with other accurate verifiers are estimated as more accurate, using a latent variable model where the true label is unobserved By running each verifier on a held-out test set By comparing verifier outputs to the majority vote

Chapter 4: The Weaver Framework

Directly applying weak supervision to verifier ensembles doesn't work out of the box. Verifiers produce wildly different output formats (continuous scores, binary labels, Likert scales), some verifiers are worse than random on certain tasks, and the accuracy estimation is sensitive to these issues. Weaver solves each with a specific engineering decision.

Step 1: Binarization

Weak supervision models expect binary votes. But reward models output continuous scores (e.g., 0.73). Weaver binarizes every verifier's output: a response is "correct" (1) if its score is above the verifier's median score across all responses for that query, and "incorrect" (0) otherwise. The median threshold is computed per-query, not globally — this handles the fact that some queries produce universally low or high scores.

Step 2: Verifier Filtering

Some verifiers are genuinely harmful. A reward model that was trained on a different distribution might have negative correlation with correctness on a given task. Including it in the ensemble makes things worse, not better.

Weaver estimates each verifier's accuracy using the weak supervision model, then discards any verifier whose estimated accuracy is below 50% (i.e., worse than a coin flip). This is conservative but effective: on GPQA Diamond, the filter removes 8 of 33 verifiers, and the remaining 25 are all genuinely informative.

Step 3: Accuracy Estimation

Using the Dawid-Skene objective (Equation 5 from the paper), Weaver estimates the accuracy parameter for each surviving verifier. These become the ensemble weights.

Step 4: Weighted Combination

Given binary votes s̄₁, ..., s̄_m from the filtered verifiers, the ensemble score for each query-response pair is the posterior probability:

Pr(Y=1 | S₁=s̄₁, ..., S_m=s̄_m) = ∏_i Pr(S_i=s̄_i | Y=1) · Pr(Y=1) / Pr(S₁=s̄₁, ..., S_m=s̄_m)

The response with the highest posterior is selected. That's it — the entire pipeline.

Why binarization helps: Different reward models have radically different score distributions. Model A might output scores in [0.3, 0.9] while Model B outputs [0.01, 0.15]. Averaging raw scores makes Model A dominate. Binarizing to "above or below median" puts every verifier on the same footing. The paper finds that binarization outperforms alternative calibration strategies (z-score normalization, min-max scaling) for downstream selection performance.

Weaver Pipeline

Watch the full pipeline: raw verifier scores are binarized, filtered, weighted by estimated accuracy, and combined into a posterior. Click "Run Weaver" to process a batch. Orange = correct response, gray = incorrect. The star marks Weaver's selection.

Ready

Why does Weaver binarize verifier scores using per-query medians rather than using raw continuous scores?

Because continuous scores are too slow to compute Because different verifiers have wildly different score distributions, and binarizing to "above/below median" normalizes them to a common scale — this lets the weak supervision model compare verifiers on equal footing without one dominating due to its score range Because binary scores are required by the LM judges

Chapter 5: Weak Supervision Details

Let's look inside the math that estimates verifier accuracies. The weak supervision model treats the true label Y as a latent variable — something we can't observe directly but can infer from the pattern of verifier votes.

The Graphical Model

Weaver assumes a simple generative story: for each query-response pair, nature first decides if the response is correct (Y=1) or not (Y=0). Then each verifier independently "votes" based on the true label, with some probability of getting it right. Formally:

Y ~ Bernoulli(p), where p = Pr(Y=1) is the base rate of correct responses. Estimated from a tiny dev set (~5-10 labeled samples).
S_k | Y ~ Bernoulli(α_k) if Y matches S_k, where α_k = Pr(S_k=1|Y=1) is verifier k's accuracy parameter.
Conditional independence: S_i ⊥ S_j | Y for all i ≠ j.

The Observable Statistics

We can compute from data: Pr(S_i=1) for each verifier (its "yes" rate), and Pr(S_i=a, S_j=b) for each pair of verifiers (their joint agreement table). These are just counting exercises over the nK query-response pairs.

The Estimation Objective

Using the conditional independence assumption, the joint probability of any verifier pair factors as:

Pr(S_i, S_j) = Pr(S_i|Y=1)Pr(S_j|Y=1)p + Pr(S_i|Y=0)Pr(S_j|Y=0)(1−p)

The left side is known. The right side depends on the accuracy parameters α_k. Weaver constructs a matrix objective that minimizes the difference between observed and predicted pairwise statistics, plus a consistency constraint that marginals match:

min_μ ||O_off-diag − (μPμ^T)_off-diag||² + ||diag(O) − μP1^T||²

where μ is the matrix of accuracy parameters, P is the diagonal matrix of class priors, and O is the matrix of observed joint probabilities. This is optimized via gradient descent.

Difficulty partitioning: Weaver also clusters queries by empirical difficulty (the ratio of correct to incorrect responses, estimated from verifier agreement). Easier queries have higher base rates, harder ones lower. A separate Weaver model is fit within each difficulty bucket. This improves accuracy estimation because verifier behavior can differ dramatically between easy and hard problems — a verifier that's great on easy problems might be useless on hard ones.

Why this works without labels: The key insight is that agreement between verifiers is informative about their individual accuracies. If verifier A agrees with verifiers B, C, D, and E almost all the time, and those verifiers disagree with F, then either A-E are all right and F is wrong, or F is right and A-E are all wrong. The base rate Pr(Y=1) — estimated from the tiny dev set — breaks this symmetry. If correct responses are rare, the "minority" cluster is more likely correct. If correct responses are common, the majority cluster is. You need surprisingly few labeled samples (∼5-10) to estimate this base rate.

What is the role of the tiny dev set (~5-10 labeled samples) in Weaver's weak supervision?

It is used to train the verifier weights It is used to select which verifiers to include It estimates the base rate Pr(Y=1) — the probability that a response is correct — which breaks the symmetry in the latent variable model and allows accuracy estimation to proceed

Chapter 6: The Key Result

Here's the headline number. Using Llama 3.3 70B Instruct as the generator, K=100 repeated samples, and an ensemble of 33 verifiers (7B-72B reward models and LM judges), Weaver achieves:

Method	MATH500	GPQA	MMLU Col.	MMLU Pro	Average
Pass@1	78.0%	42.9%	82.6%	69.9%	68.4%
Majority Voting	83.0%	47.4%	84.1%	74.4%	72.2%
Best Single RM	78.2%	49.7%	86.0%	77.0%	72.7%
Multi-Agent Verif.	81.3%	47.8%	84.1%	72.6%	71.6%
Weaver	93.4%	72.1%	94.9%	90.2%	87.7%
o3-mini	94.4%	74.0%	92.2%	86.0%	86.7%
Oracle (Pass@100)	98.6%	81.0%	96.0%	92.0%	91.9%

The $0 fine-tuning path to reasoning: Weaver with Llama 70B (87.7%) matches o3-mini (86.7%) — a model that required extensive reinforcement learning and post-training. Weaver requires zero parameter updates to the generator. The improvement comes entirely from better selection among existing responses. This is a 19.3 point jump from Pass@1 (68.4% → 87.7%), comparable to the GPT-4o → o3-mini jump (69.0% → 86.7%).

Weak-to-Strong Verification

Even at the 8B scale (Llama 3.1 8B generator + 8B verifiers), Weaver achieves 70.0% average — within 1.6% of majority voting at the 70B scale (71.6%). And Weaver at 70B (87.7%) surpasses o3-mini by 1.0%. This is a weak-to-strong phenomenon: small verifiers, when properly combined, punch far above their weight class.

Results Comparison

Average accuracy across four benchmarks. Drag the slider to change the generation model size. Compare Weaver (orange) to baselines.

Generator Size 70B

What is remarkable about Weaver achieving o3-mini-level accuracy with Llama 70B?

It shows that better selection among existing responses (through weighted verifier ensembles) can match the gains from expensive fine-tuning and post-training — the improvement comes entirely from verification, with zero parameter updates to the generator It shows that Llama 70B is a better model than o3-mini It shows that reward models are unnecessary for verification

Chapter 7: The Distilled Verifier

There's a catch. Running 33 verifiers (some 70B+) on every candidate response is expensive. For K=100 samples with 33 verifiers, you need 3,300 forward passes per query. That's 35.35 exaFLOPs per query set — 10x more than generation alone. The verification cost dominates.

Weaver's solution: distillation. Train a tiny model to mimic the full ensemble.

How It Works

Take a ModernBERT-Large model (396M parameters, a cross-encoder). Feed it concatenated (query, response) pairs. The training label for each pair is Weaver's posterior probability Pr(Y=1 | all verifier votes) — a soft score from 0 to 1. Train with a regression loss (MSE or cross-entropy) on this soft label.

Step 1: Generate Training Data

Run full Weaver on a corpus of query-response pairs. For each pair, store the Weaver posterior score as a pseudolabel. This is a one-time offline cost.

↓

Step 2: Train Cross-Encoder

Fine-tune ModernBERT-Large (396M) as a cross-encoder. Input: (query, response) concatenated. Output: scalar score matching the Weaver pseudolabel. Standard supervised learning.

↓

Step 3: Deploy

At inference, replace the full 33-verifier ensemble with the single 400M cross-encoder. One forward pass per candidate response instead of 33.

The numbers: The distilled cross-encoder retains 98.7% of Weaver's accuracy gains while reducing verification compute by 99.97%. Full Weaver costs 35.35 exaFLOPs per query set (100 samples × 33 verifiers). The 400M cross-encoder costs 1.01 exaFLOPs for the same 100 samples. That's a 35x reduction. And it only needs a single A100 GPU with 32GB memory, instead of an 8-GPU node per 70B verifier.

On GPQA Diamond, the distilled model outperforms majority voting by 23.2% while adding only 0.57% inference cost over generation alone. The distilled verifier captures the combined intelligence of all 33 verifiers in a model 175x smaller than the smallest verifier in the ensemble.

Why distillation works: The Weaver ensemble score is a smooth, well-calibrated probability. It compresses the diverse signals from 33 verifiers into a single number per response. A cross-encoder is a powerful function class for mapping (query, response) pairs to scalars — it can attend to fine-grained relationships between the question and the answer. The pseudolabels from Weaver are higher quality than any individual verifier's scores, so the distilled model learns from cleaner supervision than any single teacher.

What are the inputs and outputs of the distilled Weaver cross-encoder?

Input: a concatenated (query, response) pair. Output: a scalar score predicting the Weaver ensemble's posterior probability that the response is correct. It replaces 33 verifiers with a single 400M model at 99.97% compute savings. Input: verifier scores. Output: a weighted average. Input: the query alone. Output: the correct answer.

Chapter 8: Analysis

The paper explores several scaling dimensions. Each reveals something about when and why Weaver works.

Scaling Generations

As K increases from 2 to 128, majority voting plateaus around K=8-16. The naive ensemble plateaus slightly later. Weaver keeps improving all the way to K=128, consistently narrowing the gap to Pass@K. The effect is most dramatic on hard tasks (GPQA Diamond) where individual verifiers are least reliable and ensembling helps most.

Scaling Verifier Count

Adding more verifiers helps, but with diminishing returns. Going from 1 to 5 verifiers yields the biggest jump (+8.5% on some tasks). From 5 to 15, gains taper off as new verifiers contribute increasingly correlated signals. Weaver consistently outperforms naive averaging at every verifier count, with the gap widest at 5-10 verifiers (+2.4% to +10.1%).

Which Verifiers Contribute?

The paper uses 33 verifiers from RewardBench and Chatbot Arena: a mix of reward models (8B to 72B) and LM judges. On average, Weaver filters out 5-10 of the 33 on any given task. The filtered verifiers tend to be models trained on very different distributions than the target task. Interestingly, no single verifier is best across all tasks — the optimal ensemble composition changes per task, which is exactly why adaptive weighting matters.

Compute-Accuracy Trade-off

Majority voting is the cheapest approach: zero verification cost. Weaver is the most expensive (33 verifier forward passes per sample). But Weaver achieves the highest accuracy and keeps scaling with more compute, while majority voting saturates early. The distilled Weaver cross-encoder sits in the sweet spot: nearly all of Weaver's accuracy at 99.97% less verification compute.

The key scaling insight: Generation and verification are complementary scaling axes. More generations give you more chances to produce a correct answer. Better verification lets you find it. But verification quality matters much more than generation quantity once Pass@K is high. Going from K=16 to K=128 improves Pass@K by ~5%, but going from naive ensemble to Weaver improves success rate by ~15%. Invest in verification.

Why does Weaver show the biggest gains on the hardest tasks (like GPQA Diamond)?

Because hard tasks have low individual verifier accuracy, so the gap between naive averaging and weighted ensembling is largest — the adaptive weights matter most when verifiers disagree frequently and their relative quality varies Because hard tasks have more correct responses in the candidate pool Because Weaver uses more verifiers for hard tasks

Chapter 9: Connections

Weaver sits at the intersection of several active research threads in inference-time compute and verification.

Paper	Relationship to Weaver
Large Language Monkeys (Brown et al., 2024)	Established that repeated sampling scales well: Pass@K grows reliably. Weaver addresses the follow-up question: once you have many samples, how do you select?
Scaling Test-Time Compute (Snell et al., 2024)	Showed that allocating more compute at inference (via search, verification, or revision) can outperform scaling model parameters. Weaver extends this by scaling the verification axis specifically.
Multi-Agent Verification (Lifshitz et al., 2025)	Uses multiple verifier calls but with a single verifier model (varying prompts/temperatures). Weaver uses multiple different verifier models with adaptive weighting — and outperforms Multi-Agent Verification by 16.1% on average.
Let's Verify Step by Step (Lightman et al., 2023)	Introduced process reward models (PRMs) that verify intermediate reasoning steps. Weaver uses outcome-level verification but could potentially incorporate PRMs as additional weak verifiers.
ARCHON (Saad-Falcon et al., 2024)	Architecture search for combining LM inference techniques (judges, rankers, generators). Weaver focuses specifically on the verification/selection component, providing a principled aggregation method that ARCHON could incorporate.
Inference Scaling Flaws (Stroebl et al., 2024)	Showed that imperfect verifiers limit the benefits of repeated sampling. Weaver directly addresses this limitation by making verifiers less imperfect through weighted ensembling.

Open questions: (1) Can process-level reward models (step-by-step verification) be integrated as verifiers in Weaver? (2) Can the distilled cross-encoder generalize across tasks, or does it need task-specific training? (3) How does Weaver interact with chain-of-thought reasoning and self-correction? (4) Can the weak supervision approach work for multimodal verification (images, code execution)? The authors suggest all of these as future directions.

Weaver demonstrates a powerful principle: you don't need a strong verifier if you have many weak ones. The combination strategy — normalize, filter, weight, aggregate — is general enough to apply wherever multiple imperfect signals exist. This connects to a deep idea in machine learning: ensembles of weak learners can be strong, whether the learners are decision stumps (boosting), annotators (crowdsourcing), or verifiers (Weaver).

What fundamental principle from ensemble learning does Weaver apply to the verification problem?

That bigger models are always better That multiple weak learners (verifiers) making independent errors can be combined with adaptive weighting to produce a strong learner — the same principle behind boosting, crowdsourcing, and the Dawid-Skene model That verification is always cheaper than generation

Weaver: Weak Verifiers, Strong Selection