PRM800K needed 800,000 human step-level labels to train a process reward model. Math-Shepherd gets them for free — sample completions from each reasoning step, check if any reach the right answer, and you have automatic step-level supervision. Mistral-7B jumps from 77.9% to 84.1% on GSM8K.
An LLM writes a 6-step solution to a math problem. Step 4 is wrong — it divides instead of multiplying. But the final answer happens to be correct anyway (the errors cancel). An outcome reward model (ORM) sees the correct answer and says "good solution." A process reward model (PRM) catches the error at step 4 and says "bad step here."
PRMs are more reliable verifiers. They catch lucky-but-wrong solutions, and they give more useful training signal for reinforcement learning. The problem? Training a PRM requires a human to label every single step of every single solution as correct or incorrect.
The outcome reward model is easy to train automatically: generate solutions, check final answers against ground truth, label the whole solution correct or incorrect. But the PRM needs step-level labels. Where do those come from without humans?
A 5-step solution with an error at step 3. The ORM only sees the final answer. The PRM evaluates each step. Click steps to see the difference.
Imagine you are playing chess and you reach a particular board position after move 12. Is move 12 good or bad? One way to evaluate: play out the rest of the game many times from this position. If many continuations lead to a win, the position (and hence the move that led to it) is probably good. If most continuations lead to a loss, the move was probably bad.
This is exactly the idea behind Monte Carlo Tree Search (MCTS), and Math-Shepherd applies it to reasoning steps. To evaluate step k of a math solution:
This is cheap. The completer is just a fine-tuned LLM you already have. You run it N=8 times per step. No human in the loop. The labels are noisy — some correct steps might get unlucky completions, some incorrect steps might get lucky ones — but on average, the signal is remarkably good.
Math-Shepherd proposes two methods to convert completion results into step labels. Both start the same way: given step si, sample N completions, get N final answers {a1, ..., aN}, and compare each to the gold answer a*.
Binary: is there any completion that reaches the correct answer?
If even one out of N completions gets the right answer, label the step as "has potential" (correct). Otherwise, "no potential" (incorrect). This is aggressive — it gives the step the benefit of the doubt.
Continuous: what fraction of completions reach the correct answer?
This gives a probability-like score between 0 and 1. A step where 7/8 completions succeed gets 0.875; one where 1/8 succeeds gets 0.125.
Drag the slider to change how many of N=8 completions reach the correct answer. Compare the HE and SE labels.
Problem: "What is 15% of 80?" Gold answer: 12.
Step 1: "15% means 15/100 = 0.15." We sample 8 completions from step 1 onward. Results: 12, 12, 12, 12, 12, 12, 12, 12. All correct. HE = 1, SE = 1.0.
Step 2: "So 0.15 × 80 = 15." (This is wrong — should be 12.) We sample 8 completions from step 2 onward. Results: 15, 15, 15, 15, 15, 15, 12, 15. One completion somehow self-corrected. HE = 1 (benefit of the doubt!), SE = 0.125.
Step 3: "The answer is 15." Results from step 3: 15, 15, 15, 15, 15, 15, 15, 15. None correct. HE = 0, SE = 0.0.
This is the full pipeline that replaces 800K human labels with zero human effort. Let us walk through every stage with concrete numbers from the paper.
Take a base model (e.g., LLemma-7B). Fine-tune it for 3 epochs on MetaMATH, a dataset of math problems with chain-of-thought solutions. This model now serves two roles: it generates candidate solutions, and it completes partial solutions for step annotation.
For each problem in the GSM8K and MATH training sets, sample 15 solutions from each of a 7B and 13B generator. Remove duplicates. This yields roughly 170K solutions for GSM8K and 270K solutions for MATH.
For each step si in each solution, run the completer (LLemma-7B) N=8 times from step i onward. Check if each completion's final answer matches the gold answer. Apply hard estimation (HE) to assign a binary label to every step.
Each labeled solution becomes a training example. The step labels are encoded as special tokens appended after each step: a "+" token for "has potential" and a "−" token for "no potential."
The full pipeline from a single math problem to labeled PRM training data. Click "Run Pipeline" to animate each stage.
Here is what a labeled training example looks like in the paper's format:
Problem: What is 15% of 80? Step 1: 15% means 15/100 = 0.15 + # 8/8 completions correct Step 2: 0.15 × 80 = 12 + # 8/8 completions correct Step 3: The answer is 12. + # 8/8 completions correct
Problem: What is 15% of 80? Step 1: 15% means 15/100 = 0.15 + # 7/8 completions correct Step 2: 0.15 × 80 = 15 − # 0/8 completions correct (error!) Step 3: The answer is 15. − # 0/8 completions correct
With automatically labeled data in hand, training the PRM is straightforward. It shares the same architecture as an ORM — a language model with a classification head — but is trained on step-level labels instead of solution-level ones.
The PRM is trained with binary cross-entropy loss at each step:
where K is the number of steps, ysi is the label (1 = has potential, 0 = no potential), and rsi is the sigmoid output of the reward model at step i.
Compare this to the ORM loss, which is identical but with a single label for the entire solution:
The reward model is trained for 1 epoch with learning rate 1e-6. For verification tasks, the paper trains on LLaMA2-70B (GSM8K) and LLemma-34B (MATH) as base reward models. For RL, Mistral-7B is used. Max sequence length is 512 tokens.
Consider a 6-step solution where the answer is correct. The ORM gets 1 training signal: the whole solution is "correct." The PRM gets 6 training signals, one per step. Now consider a solution where the answer is wrong because of an error at step 4. The ORM says "incorrect" for the whole thing. The PRM says "steps 1-3 are fine, steps 4-6 are where things went wrong." This is far more informative.
The first application of Math-Shepherd: best-of-N verification. Generate N candidate solutions to a problem, score each with the PRM, and pick the highest-scoring one.
The PRM gives a score rsi to each step. To get a single score for the entire solution, Math-Shepherd takes the minimum across all steps:
Why the minimum? A chain of reasoning is only as strong as its weakest link. If any single step has low confidence, the whole solution is suspect.
Math-Shepherd also explores combining PRM scores with majority voting. Group the N solutions by their final answer, then score each group:
This weights each vote by the reward model's confidence. A solution that the PRM trusts counts more than one it does not.
Four candidate solutions to one problem. Each has step-level PRM scores. The solution with the highest minimum score is selected. Click "Score Solutions" to see the verification in action.
The second application: using Math-Shepherd as a reward model during reinforcement learning. Standard RLHF with an ORM gives a single reward at the end of the response. Math-Shepherd gives a reward at the end of every step.
In standard ORM-PPO, the LLM generates a complete solution, and the ORM gives one reward signal for the whole thing. The credit assignment problem is severe: if the solution is wrong, which step caused the failure? PPO has to figure that out from a single scalar.
In step-by-step PPO with Math-Shepherd, the PRM scores each step independently. The reward at step i is the PRM's prediction rsi. This is dense reward shaping — every step gets direct feedback about whether it is on track.
Consider two solutions that both get the wrong final answer. Solution A goes wrong at step 1. Solution B is correct through step 5 and only fails at step 6. With ORM-PPO, both get the same reward: 0. The model cannot distinguish them. With step-PPO, Solution B gets positive rewards for steps 1-5 and a negative reward only at step 6. The model learns that steps 1-5 were good moves, even though the overall solution failed.
Two solutions that both get the wrong answer. Toggle between ORM (sparse) and PRM (dense) reward to see the difference in training signal.
Math-Shepherd is evaluated across two benchmarks (GSM8K, MATH), two use cases (verification, RL), and models from 7B to 70B. The results are strong across the board.
The headline numbers for Mistral-7B, fine-tuned on MetaMATH:
Step-by-step PPO beats ORM-PPO by 2.3 points on GSM8K and 1.7 points on MATH. Both beat RFT (rejection sampling fine-tuning) by a wide margin.
Using Math-Shepherd as a verifier to rerank 256 candidate solutions:
The 93.3% on GSM8K with DeepSeek-67B approaches GPT-4 (94.4%). On MATH, 48.1% approaches GPT-4 (56.2%) with an open-source model.
On MATH with LLaMA2-70B, Math-Shepherd (automatically labeled) outperforms PRM800K (human-labeled) across all values of N. Why? Two reasons: (1) the data is 4x larger, and (2) PRM800K was labeled on GPT-4 outputs, creating a distribution mismatch when applied to open-source model outputs.
Accuracy on GSM8K and MATH across methods. Drag to compare benchmarks.
How does the quality of automatic process annotations depend on the number of completions N and the completer model size?
The paper manually annotated 160 steps from GSM8K training data and measured how well automatic labels match. With LLaMA2-70B as completer at N=4: 86% accuracy (hard estimation). As N increases beyond 4, accuracy drops for HE because more completions increase false positives — a wrong step is more likely to have at least one lucky completion.
Larger completers produce higher-quality labels. At N=4, the 70B completer achieves 86% accuracy while the 7B completer is lower. However, the paper uses the 7B completer in practice (LLemma-7B) for cost efficiency — the small accuracy gap does not significantly impact downstream PRM quality.
Measuring cross-entropy loss between automatic and human labels: SE progressively aligns closer to the human distribution as N increases. HE does not improve with larger N. Despite this, the final PRM quality is similar whether trained on SE or HE — suggesting the PRM is robust to moderate label noise.
For 440K solutions with ~6 steps each at N=8 completions per step:
Each completion is a few hundred tokens from a 7B model. On modern GPUs, this is a few hundred GPU-hours — orders of magnitude cheaper than hiring expert annotators for 800K steps.
How HE accuracy and SE calibration change as N (completions per step) increases. Drag to explore.
Math-Shepherd sits at a key intersection: it connects process reward models, Monte Carlo methods, and scalable RL for reasoning. Here is how it links to the broader landscape.