Combine multiple imperfect verifiers into one strong verifier for test-time repeated sampling — no labels, no fine-tuning. Llama 70B + Weaver matches o3-mini at 87.7% average accuracy.
You have a language model that can solve math problems. Not every time — maybe 70% of the time. But if you ask it the same question 100 times with some randomness (non-zero temperature), at least one of those 100 answers is almost certainly correct. For Llama 3.3 70B on MATH500, Pass@100 — the probability that at least one of 100 samples is correct — is 98.6%.
So the generation side is solved. The model can produce a correct answer. The problem is: which one?
You have 100 candidate responses. One is brilliant. Ninety-nine are wrong. How do you pick the right one? This is the verification problem: scoring and ranking responses to select the best candidate.
If you had a perfect verifier — an oracle that always identifies the correct response — you could close this gap entirely. But perfect verifiers only exist for narrow domains (Lean for formal proofs, unit tests for code). For open-ended math, science, and reasoning, all we have are weak verifiers: reward models and LM judges that are noisy, biased, and inconsistent.
Weaver's question: can you combine many weak verifiers into one strong verifier? Not by averaging (that assumes they're equally good, and they're not). Not by supervised learning (that requires labels you don't have). But through weak supervision — estimating each verifier's accuracy from the data alone, then weighting them accordingly.
Drag the slider to change the number of candidate responses. Watch Pass@K (the chance a correct answer exists) soar — while the success rate of a weak verifier barely moves. The gap between them is the generation-verification gap.
Individual verifiers are weak. A single reward model might have 55% accuracy at picking the correct response from a pair. A single LM judge might score 60%. Neither is reliable enough to justify generating 100 candidates — you'd pick wrong almost half the time.
But here's the thing: they're wrong in different ways. A reward model trained on preference data might be biased toward longer responses. An LM judge might struggle with formal math but excel at reasoning chains. A different reward model might be great at code but poor at science. Their errors are correlated with different features of the problem.
The result: Llama 3.3 70B Instruct with Weaver verification achieves 87.7% average accuracy across MATH500, GPQA Diamond, MMLU College, and MMLU Pro. That matches o3-mini (86.7%) — a model that required extensive fine-tuning and post-training. Weaver gets there with zero parameter updates, using only off-the-shelf verifiers.
Let's quantify the problem. For a given task and model, we can measure two things:
A large gap means the model generates correct answers but the verifier fails to find them. An oracle verifier (perfect accuracy) closes the gap to zero.
How bad is the gap in practice? Consider Llama 3.3 70B on GPQA Diamond with K=100:
| Method | Success Rate | Gap to Pass@100 |
|---|---|---|
| Pass@1 (no verification) | 42.9% | 38.1% |
| Majority Voting | 47.4% | 33.6% |
| Best Single RM | 49.7% | 31.3% |
| Naive Ensemble (top-10 RMs) | 41.3% | 39.7% |
| Multi-Agent Verification | 47.8% | 33.2% |
| Weaver | 72.1% | 8.9% |
| Oracle (Pass@100) | 81.0% | 0% |
Why does naive ensembling fail? Because it treats all verifiers equally. If 7 of your 10 reward models are bad at GPQA but okay at MATH, the naive average is dominated by noise on GPQA. Weighting is essential.
Why does majority voting plateau? Because it ignores verifier scores entirely — it just picks the most common final answer. Once you have ~8-16 samples, the distribution of final answers stabilizes and more samples don't help. Verification, on the other hand, can keep scaling by using richer signals from the verifier scores.
If naive averaging fails because it ignores accuracy differences, the natural fix is: learn weights. Give more weight to verifiers that are more accurate. The paper tests this with oracle data first (to see the ceiling), then without labels (Weaver).
Given ground-truth labels for all query-response pairs, you can train a logistic regression or Naive Bayes classifier to estimate P(correct | all verifier scores). This is an "oracle" approach because it uses labels you wouldn't have in practice.
The result is dramatic: weighted ensembles outperform naive ensembles by up to 11.2 percentage points. On GPQA Diamond specifically, the oracle weighted ensemble reaches 68.2% vs. the naive ensemble's 41.3% — a 26.9 point gap. The weights reveal that verifier accuracy varies wildly: a 37.5% spread in individual verifier success rates across the 33 models tested.
But we don't have labels. So how do we estimate weights? Weaver adapts the Dawid-Skene model from the weak supervision literature. The core idea: if we assume each verifier makes errors independently (conditional on the true label), we can estimate their individual accuracies by analyzing their agreement patterns.
Think of it like this. You have 33 annotators labeling data, but no answer key. If annotators A and B agree 90% of the time, and annotators A and C agree 85% of the time, and B and C agree 82% of the time — then A is probably the most accurate. You can back out individual accuracies from pairwise agreement rates, even without knowing the true labels.
Each bar represents a verifier with a different accuracy. Drag the slider to change how many verifiers you include. Compare the weighted ensemble score (orange line) vs. the unweighted average (gray line). Poor verifiers drag the unweighted average down.
Directly applying weak supervision to verifier ensembles doesn't work out of the box. Verifiers produce wildly different output formats (continuous scores, binary labels, Likert scales), some verifiers are worse than random on certain tasks, and the accuracy estimation is sensitive to these issues. Weaver solves each with a specific engineering decision.
Weak supervision models expect binary votes. But reward models output continuous scores (e.g., 0.73). Weaver binarizes every verifier's output: a response is "correct" (1) if its score is above the verifier's median score across all responses for that query, and "incorrect" (0) otherwise. The median threshold is computed per-query, not globally — this handles the fact that some queries produce universally low or high scores.
Some verifiers are genuinely harmful. A reward model that was trained on a different distribution might have negative correlation with correctness on a given task. Including it in the ensemble makes things worse, not better.
Weaver estimates each verifier's accuracy using the weak supervision model, then discards any verifier whose estimated accuracy is below 50% (i.e., worse than a coin flip). This is conservative but effective: on GPQA Diamond, the filter removes 8 of 33 verifiers, and the remaining 25 are all genuinely informative.
Using the Dawid-Skene objective (Equation 5 from the paper), Weaver estimates the accuracy parameter for each surviving verifier. These become the ensemble weights.
Given binary votes s̄1, ..., s̄m from the filtered verifiers, the ensemble score for each query-response pair is the posterior probability:
The response with the highest posterior is selected. That's it — the entire pipeline.
Watch the full pipeline: raw verifier scores are binarized, filtered, weighted by estimated accuracy, and combined into a posterior. Click "Run Weaver" to process a batch. Orange = correct response, gray = incorrect. The star marks Weaver's selection.
Let's look inside the math that estimates verifier accuracies. The weak supervision model treats the true label Y as a latent variable — something we can't observe directly but can infer from the pattern of verifier votes.
Weaver assumes a simple generative story: for each query-response pair, nature first decides if the response is correct (Y=1) or not (Y=0). Then each verifier independently "votes" based on the true label, with some probability of getting it right. Formally:
We can compute from data: Pr(Si=1) for each verifier (its "yes" rate), and Pr(Si=a, Sj=b) for each pair of verifiers (their joint agreement table). These are just counting exercises over the nK query-response pairs.
Using the conditional independence assumption, the joint probability of any verifier pair factors as:
The left side is known. The right side depends on the accuracy parameters αk. Weaver constructs a matrix objective that minimizes the difference between observed and predicted pairwise statistics, plus a consistency constraint that marginals match:
where μ is the matrix of accuracy parameters, P is the diagonal matrix of class priors, and O is the matrix of observed joint probabilities. This is optimized via gradient descent.
Here's the headline number. Using Llama 3.3 70B Instruct as the generator, K=100 repeated samples, and an ensemble of 33 verifiers (7B-72B reward models and LM judges), Weaver achieves:
| Method | MATH500 | GPQA | MMLU Col. | MMLU Pro | Average |
|---|---|---|---|---|---|
| Pass@1 | 78.0% | 42.9% | 82.6% | 69.9% | 68.4% |
| Majority Voting | 83.0% | 47.4% | 84.1% | 74.4% | 72.2% |
| Best Single RM | 78.2% | 49.7% | 86.0% | 77.0% | 72.7% |
| Multi-Agent Verif. | 81.3% | 47.8% | 84.1% | 72.6% | 71.6% |
| Weaver | 93.4% | 72.1% | 94.9% | 90.2% | 87.7% |
| o3-mini | 94.4% | 74.0% | 92.2% | 86.0% | 86.7% |
| Oracle (Pass@100) | 98.6% | 81.0% | 96.0% | 92.0% | 91.9% |
Even at the 8B scale (Llama 3.1 8B generator + 8B verifiers), Weaver achieves 70.0% average — within 1.6% of majority voting at the 70B scale (71.6%). And Weaver at 70B (87.7%) surpasses o3-mini by 1.0%. This is a weak-to-strong phenomenon: small verifiers, when properly combined, punch far above their weight class.
Average accuracy across four benchmarks. Drag the slider to change the generation model size. Compare Weaver (orange) to baselines.
There's a catch. Running 33 verifiers (some 70B+) on every candidate response is expensive. For K=100 samples with 33 verifiers, you need 3,300 forward passes per query. That's 35.35 exaFLOPs per query set — 10x more than generation alone. The verification cost dominates.
Weaver's solution: distillation. Train a tiny model to mimic the full ensemble.
Take a ModernBERT-Large model (396M parameters, a cross-encoder). Feed it concatenated (query, response) pairs. The training label for each pair is Weaver's posterior probability Pr(Y=1 | all verifier votes) — a soft score from 0 to 1. Train with a regression loss (MSE or cross-entropy) on this soft label.
On GPQA Diamond, the distilled model outperforms majority voting by 23.2% while adding only 0.57% inference cost over generation alone. The distilled verifier captures the combined intelligence of all 33 verifiers in a model 175x smaller than the smallest verifier in the ensemble.
The paper explores several scaling dimensions. Each reveals something about when and why Weaver works.
As K increases from 2 to 128, majority voting plateaus around K=8-16. The naive ensemble plateaus slightly later. Weaver keeps improving all the way to K=128, consistently narrowing the gap to Pass@K. The effect is most dramatic on hard tasks (GPQA Diamond) where individual verifiers are least reliable and ensembling helps most.
Adding more verifiers helps, but with diminishing returns. Going from 1 to 5 verifiers yields the biggest jump (+8.5% on some tasks). From 5 to 15, gains taper off as new verifiers contribute increasingly correlated signals. Weaver consistently outperforms naive averaging at every verifier count, with the gap widest at 5-10 verifiers (+2.4% to +10.1%).
The paper uses 33 verifiers from RewardBench and Chatbot Arena: a mix of reward models (8B to 72B) and LM judges. On average, Weaver filters out 5-10 of the 33 on any given task. The filtered verifiers tend to be models trained on very different distributions than the target task. Interestingly, no single verifier is best across all tasks — the optimal ensemble composition changes per task, which is exactly why adaptive weighting matters.
Majority voting is the cheapest approach: zero verification cost. Weaver is the most expensive (33 verifier forward passes per sample). But Weaver achieves the highest accuracy and keeps scaling with more compute, while majority voting saturates early. The distilled Weaver cross-encoder sits in the sweet spot: nearly all of Weaver's accuracy at 99.97% less verification compute.
Weaver sits at the intersection of several active research threads in inference-time compute and verification.
| Paper | Relationship to Weaver |
|---|---|
| Large Language Monkeys (Brown et al., 2024) | Established that repeated sampling scales well: Pass@K grows reliably. Weaver addresses the follow-up question: once you have many samples, how do you select? |
| Scaling Test-Time Compute (Snell et al., 2024) | Showed that allocating more compute at inference (via search, verification, or revision) can outperform scaling model parameters. Weaver extends this by scaling the verification axis specifically. |
| Multi-Agent Verification (Lifshitz et al., 2025) | Uses multiple verifier calls but with a single verifier model (varying prompts/temperatures). Weaver uses multiple different verifier models with adaptive weighting — and outperforms Multi-Agent Verification by 16.1% on average. |
| Let's Verify Step by Step (Lightman et al., 2023) | Introduced process reward models (PRMs) that verify intermediate reasoning steps. Weaver uses outcome-level verification but could potentially incorporate PRMs as additional weak verifiers. |
| ARCHON (Saad-Falcon et al., 2024) | Architecture search for combining LM inference techniques (judges, rankers, generators). Weaver focuses specifically on the verification/selection component, providing a principled aggregation method that ARCHON could incorporate. |
| Inference Scaling Flaws (Stroebl et al., 2024) | Showed that imperfect verifiers limit the benefits of repeated sampling. Weaver directly addresses this limitation by making verifiers less imperfect through weighted ensembling. |
Weaver demonstrates a powerful principle: you don't need a strong verifier if you have many weak ones. The combination strategy — normalize, filter, weight, aggregate — is general enough to apply wherever multiple imperfect signals exist. This connects to a deep idea in machine learning: ensembles of weak learners can be strong, whether the learners are decision stumps (boosting), annotators (crowdsourcing), or verifiers (Weaver).