Peking University & DeepSeek-AI — 2024

Math-Shepherd: Automatic Process Rewards

PRM800K needed 800,000 human step-level labels to train a process reward model. Math-Shepherd gets them for free — sample completions from each reasoning step, check if any reach the right answer, and you have automatic step-level supervision. Mistral-7B jumps from 77.9% to 84.1% on GSM8K.

Prerequisites: Reward models (ORM/PRM) + PPO basics + Chain-of-thought reasoning
10
Chapters
5+
Simulations

Chapter 0: The Problem

An LLM writes a 6-step solution to a math problem. Step 4 is wrong — it divides instead of multiplying. But the final answer happens to be correct anyway (the errors cancel). An outcome reward model (ORM) sees the correct answer and says "good solution." A process reward model (PRM) catches the error at step 4 and says "bad step here."

PRMs are more reliable verifiers. They catch lucky-but-wrong solutions, and they give more useful training signal for reinforcement learning. The problem? Training a PRM requires a human to label every single step of every single solution as correct or incorrect.

The PRM800K bottleneck: Lightman et al. (2023) hired expert annotators to label 800,000 individual reasoning steps. Each step was classified as good, neutral, or bad. This is extraordinarily expensive — you need annotators who understand multi-step math, and labeling one solution with 8 steps takes 8 judgment calls. It does not scale to new domains, new models, or new difficulty levels.

The outcome reward model is easy to train automatically: generate solutions, check final answers against ground truth, label the whole solution correct or incorrect. But the PRM needs step-level labels. Where do those come from without humans?

ORM vs. PRM Supervision

A 5-step solution with an error at step 3. The ORM only sees the final answer. The PRM evaluates each step. Click steps to see the difference.

Why is an ORM insufficient for catching reasoning errors in math solutions?

Chapter 1: The Key Insight

Imagine you are playing chess and you reach a particular board position after move 12. Is move 12 good or bad? One way to evaluate: play out the rest of the game many times from this position. If many continuations lead to a win, the position (and hence the move that led to it) is probably good. If most continuations lead to a loss, the move was probably bad.

This is exactly the idea behind Monte Carlo Tree Search (MCTS), and Math-Shepherd applies it to reasoning steps. To evaluate step k of a math solution:

1. Freeze Steps 1 through k
Take the problem statement plus the first k steps of the solution exactly as written.
2. Sample N Completions
Use a fine-tuned LLM (the "completer") to generate N independent continuations from step k onward, each producing a final answer.
3. Check Final Answers
Compare each completion's final answer to the known ground-truth answer.
4. Estimate Step Quality
If enough completions reach the correct answer, step k is labeled as "correct." Otherwise, "incorrect."
The core realization: A correct reasoning step preserves the potential to reach the right answer. An incorrect step destroys that potential. You do not need a human to judge correctness directly — you just measure what happens downstream. If no continuation from step k can reach the right answer, step k almost certainly introduced an error.

This is cheap. The completer is just a fine-tuned LLM you already have. You run it N=8 times per step. No human in the loop. The labels are noisy — some correct steps might get unlucky completions, some incorrect steps might get lucky ones — but on average, the signal is remarkably good.

What does Math-Shepherd use as a proxy signal for whether a reasoning step is correct?

Chapter 2: Complement Step Estimation

Math-Shepherd proposes two methods to convert completion results into step labels. Both start the same way: given step si, sample N completions, get N final answers {a1, ..., aN}, and compare each to the gold answer a*.

Hard Estimation (HE)

Binary: is there any completion that reaches the correct answer?

ysiHE = 1   if ∃ aj ∈ A such that aj = a*   ;   0 otherwise

If even one out of N completions gets the right answer, label the step as "has potential" (correct). Otherwise, "no potential" (incorrect). This is aggressive — it gives the step the benefit of the doubt.

Soft Estimation (SE)

Continuous: what fraction of completions reach the correct answer?

ysiSE = (1/N) ∑j=1N I(aj = a*)

This gives a probability-like score between 0 and 1. A step where 7/8 completions succeed gets 0.875; one where 1/8 succeeds gets 0.125.

Which is better? Interestingly, the paper finds that HE and SE produce similar PRM quality in practice. HE is easier to implement — you just need two special tokens ("has potential" and "no potential") and a standard language modeling pipeline. So Math-Shepherd uses HE for convenience.
Hard vs. Soft Estimation

Drag the slider to change how many of N=8 completions reach the correct answer. Compare the HE and SE labels.

Correct completions (of 8) 3

Worked Example

Problem: "What is 15% of 80?" Gold answer: 12.

Step 1: "15% means 15/100 = 0.15." We sample 8 completions from step 1 onward. Results: 12, 12, 12, 12, 12, 12, 12, 12. All correct. HE = 1, SE = 1.0.

Step 2: "So 0.15 × 80 = 15." (This is wrong — should be 12.) We sample 8 completions from step 2 onward. Results: 15, 15, 15, 15, 15, 15, 12, 15. One completion somehow self-corrected. HE = 1 (benefit of the doubt!), SE = 0.125.

Step 3: "The answer is 15." Results from step 3: 15, 15, 15, 15, 15, 15, 15, 15. None correct. HE = 0, SE = 0.0.

Notice the noise: Step 2 is genuinely wrong, but HE labels it "correct" because one lucky completion self-corrected. This is the false positive problem with HE. SE gives a more calibrated 0.125. In practice, with enough training data, the noise averages out and HE still trains a good PRM.
If 0 out of 8 completions from step k reach the correct answer, what are the HE and SE labels?

Chapter 3: Automatic Data Construction

This is the full pipeline that replaces 800K human labels with zero human effort. Let us walk through every stage with concrete numbers from the paper.

Stage 1: Train a Generator + Completer

Take a base model (e.g., LLemma-7B). Fine-tune it for 3 epochs on MetaMATH, a dataset of math problems with chain-of-thought solutions. This model now serves two roles: it generates candidate solutions, and it completes partial solutions for step annotation.

Stage 2: Generate Candidate Solutions

For each problem in the GSM8K and MATH training sets, sample 15 solutions from each of a 7B and 13B generator. Remove duplicates. This yields roughly 170K solutions for GSM8K and 270K solutions for MATH.

Stage 3: Annotate Every Step

For each step si in each solution, run the completer (LLemma-7B) N=8 times from step i onward. Check if each completion's final answer matches the gold answer. Apply hard estimation (HE) to assign a binary label to every step.

Stage 4: Format Training Data

Each labeled solution becomes a training example. The step labels are encoded as special tokens appended after each step: a "+" token for "has potential" and a "−" token for "no potential."

Data Construction Pipeline

The full pipeline from a single math problem to labeled PRM training data. Click "Run Pipeline" to animate each stage.

Concrete Label Format

Here is what a labeled training example looks like in the paper's format:

Problem: What is 15% of 80?

Step 1: 15% means 15/100 = 0.15 +  # 8/8 completions correct
Step 2: 0.15 × 80 = 12           +  # 8/8 completions correct
Step 3: The answer is 12.         +  # 8/8 completions correct
Problem: What is 15% of 80?

Step 1: 15% means 15/100 = 0.15 +  # 7/8 completions correct
Step 2: 0.15 × 80 = 15             # 0/8 completions correct (error!)
Step 3: The answer is 15.           # 0/8 completions correct
Scale comparison: PRM800K required expert human annotators to label 800K steps across 75K solutions generated by GPT-4. Math-Shepherd labels ~440K solutions (170K GSM8K + 270K MATH) with a 7B completer running N=8 completions per step. The only cost is GPU time for the completions — no humans at all.
In Math-Shepherd's data pipeline, what determines whether a step receives a "+" or "−" label?

Chapter 4: Training the PRM

With automatically labeled data in hand, training the PRM is straightforward. It shares the same architecture as an ORM — a language model with a classification head — but is trained on step-level labels instead of solution-level ones.

The Loss Function

The PRM is trained with binary cross-entropy loss at each step:

LPRM = ∑i=1K [ ysi log rsi + (1 − ysi) log(1 − rsi) ]

where K is the number of steps, ysi is the label (1 = has potential, 0 = no potential), and rsi is the sigmoid output of the reward model at step i.

Compare this to the ORM loss, which is identical but with a single label for the entire solution:

LORM = ys log rs + (1 − ys) log(1 − rs)
Implementation trick: Because Math-Shepherd uses hard estimation (binary labels), the PRM can be trained with a standard language modeling pipeline. Two special tokens are added to the vocabulary — one for "has potential" (+) and one for "no potential" (−). The model is trained to predict the correct token after each step. No custom loss function or model modification needed.

Training Details

The reward model is trained for 1 epoch with learning rate 1e-6. For verification tasks, the paper trains on LLaMA2-70B (GSM8K) and LLemma-34B (MATH) as base reward models. For RL, Mistral-7B is used. Max sequence length is 512 tokens.

ORM vs. PRM: Where Does the Extra Signal Come From?

Consider a 6-step solution where the answer is correct. The ORM gets 1 training signal: the whole solution is "correct." The PRM gets 6 training signals, one per step. Now consider a solution where the answer is wrong because of an error at step 4. The ORM says "incorrect" for the whole thing. The PRM says "steps 1-3 are fine, steps 4-6 are where things went wrong." This is far more informative.

How does Math-Shepherd avoid needing a custom model architecture or loss function for PRM training?

Chapter 5: Verification (Reranking)

The first application of Math-Shepherd: best-of-N verification. Generate N candidate solutions to a problem, score each with the PRM, and pick the highest-scoring one.

Scoring a Solution

The PRM gives a score rsi to each step. To get a single score for the entire solution, Math-Shepherd takes the minimum across all steps:

score(S) = mini=1..K rsi

Why the minimum? A chain of reasoning is only as strong as its weakest link. If any single step has low confidence, the whole solution is suspect.

Combining with Self-Consistency

Math-Shepherd also explores combining PRM scores with majority voting. Group the N solutions by their final answer, then score each group:

asc+rm = argmaxai=1N I(ai = a) · RM(p, Si)

This weights each vote by the reward model's confidence. A solution that the PRM trusts counts more than one it does not.

Best-of-N Verification

Four candidate solutions to one problem. Each has step-level PRM scores. The solution with the highest minimum score is selected. Click "Score Solutions" to see the verification in action.

N = 256 in practice: The paper generates 256 candidate solutions per problem for verification. This sounds expensive, but generation is parallelizable, and the PRM scoring is a single forward pass per solution. The accuracy boost from 1 to 256 candidates is substantial — up to 15+ percentage points on MATH.
Why does Math-Shepherd use the minimum step score (not the average) to rank solutions?

Chapter 6: Step-Level PPO

The second application: using Math-Shepherd as a reward model during reinforcement learning. Standard RLHF with an ORM gives a single reward at the end of the response. Math-Shepherd gives a reward at the end of every step.

ORM-PPO vs. Step-by-Step PPO

In standard ORM-PPO, the LLM generates a complete solution, and the ORM gives one reward signal for the whole thing. The credit assignment problem is severe: if the solution is wrong, which step caused the failure? PPO has to figure that out from a single scalar.

In step-by-step PPO with Math-Shepherd, the PRM scores each step independently. The reward at step i is the PRM's prediction rsi. This is dense reward shaping — every step gets direct feedback about whether it is on track.

ORM-PPO: Sparse Reward
Step 1 → Step 2 → ... → Step K → [reward]. The entire solution gets a single score. PPO must propagate this back through all K steps.
vs.
Step-PPO: Dense Reward
Step 1 [reward1] → Step 2 [reward2] → ... → Step K [rewardK]. Each step gets its own score. Credit assignment is solved.
Training setup: Learning rate 4e-7 (LLaMA2-7B) and 1e-7 (Mistral-7B). KL coefficient 0.04 to prevent the policy from drifting too far from the SFT model. Cosine schedule with minimum LR 1e-8. Training uses questions from MetaMATH for PPO rollouts.

Why Dense Rewards Help

Consider two solutions that both get the wrong final answer. Solution A goes wrong at step 1. Solution B is correct through step 5 and only fails at step 6. With ORM-PPO, both get the same reward: 0. The model cannot distinguish them. With step-PPO, Solution B gets positive rewards for steps 1-5 and a negative reward only at step 6. The model learns that steps 1-5 were good moves, even though the overall solution failed.

Sparse vs. Dense Reward

Two solutions that both get the wrong answer. Toggle between ORM (sparse) and PRM (dense) reward to see the difference in training signal.

Reward type ORM (sparse)
What is the main advantage of step-level PPO over standard ORM-PPO?

Chapter 7: Results

Math-Shepherd is evaluated across two benchmarks (GSM8K, MATH), two use cases (verification, RL), and models from 7B to 70B. The results are strong across the board.

Step-by-Step PPO (Greedy Decoding)

The headline numbers for Mistral-7B, fine-tuned on MetaMATH:

Mistral-7B results:
GSM8K: 77.9% (SFT) → 81.8% (+ORM-PPO) → 84.1% (+Math-Shepherd step-PPO)
MATH: 28.6% (SFT) → 31.3% (+ORM-PPO) → 33.0% (+Math-Shepherd step-PPO)

Step-by-step PPO beats ORM-PPO by 2.3 points on GSM8K and 1.7 points on MATH. Both beat RFT (rejection sampling fine-tuning) by a wide margin.

Verification (Best-of-256)

Using Math-Shepherd as a verifier to rerank 256 candidate solutions:

Best verification results:
DeepSeek-67B + Math-Shepherd: 93.3% GSM8K, 48.1% MATH
Mistral-7B + step-PPO + verification: 89.1% GSM8K, 43.5% MATH

The 93.3% on GSM8K with DeepSeek-67B approaches GPT-4 (94.4%). On MATH, 48.1% approaches GPT-4 (56.2%) with an open-source model.

Math-Shepherd vs. PRM800K

On MATH with LLaMA2-70B, Math-Shepherd (automatically labeled) outperforms PRM800K (human-labeled) across all values of N. Why? Two reasons: (1) the data is 4x larger, and (2) PRM800K was labeled on GPT-4 outputs, creating a distribution mismatch when applied to open-source model outputs.

Results Comparison

Accuracy on GSM8K and MATH across methods. Drag to compare benchmarks.

Benchmark GSM8K
Why does Math-Shepherd (automatic labels) outperform PRM800K (human labels) on MATH?

Chapter 8: Scaling & Analysis

How does the quality of automatic process annotations depend on the number of completions N and the completer model size?

Label Accuracy vs. N

The paper manually annotated 160 steps from GSM8K training data and measured how well automatic labels match. With LLaMA2-70B as completer at N=4: 86% accuracy (hard estimation). As N increases beyond 4, accuracy drops for HE because more completions increase false positives — a wrong step is more likely to have at least one lucky completion.

The false positive tradeoff: Increasing N helps soft estimation (SE becomes better calibrated as N grows) but hurts hard estimation (HE gets more false positives). Since HE needs only one correct completion to label "positive," larger N means more chances to get lucky. The sweet spot for HE is around N=4-8.

Completer Model Size

Larger completers produce higher-quality labels. At N=4, the 70B completer achieves 86% accuracy while the 7B completer is lower. However, the paper uses the 7B completer in practice (LLemma-7B) for cost efficiency — the small accuracy gap does not significantly impact downstream PRM quality.

SE vs. HE: Cross-Entropy Analysis

Measuring cross-entropy loss between automatic and human labels: SE progressively aligns closer to the human distribution as N increases. HE does not improve with larger N. Despite this, the final PRM quality is similar whether trained on SE or HE — suggesting the PRM is robust to moderate label noise.

Cost Analysis

For 440K solutions with ~6 steps each at N=8 completions per step:

Total completions ≈ 440K × 6 × 8 = 21.1M

Each completion is a few hundred tokens from a 7B model. On modern GPUs, this is a few hundred GPU-hours — orders of magnitude cheaper than hiring expert annotators for 800K steps.

Label Quality vs. N

How HE accuracy and SE calibration change as N (completions per step) increases. Drag to explore.

N (completions per step) 8
Why does increasing N hurt hard estimation (HE) accuracy?

Chapter 9: Connections

Math-Shepherd sits at a key intersection: it connects process reward models, Monte Carlo methods, and scalable RL for reasoning. Here is how it links to the broader landscape.

Let's Verify Step by Step (Lightman et al., 2023)
Introduced PRM800K with human-labeled step-level rewards. Showed PRMs beat ORMs on MATH. Math-Shepherd achieves the same quality automatically — and with 4x more data, even surpasses PRM800K on open-source models.
Scaling LLM Test-Time Compute
Math-Shepherd's verification (best-of-N with PRM reranking) is an early form of test-time compute scaling. Generate more candidates, score them better, get better answers. Later work like DeepSeek-R1 and OpenAI o1 push this further.
RLHF & PPO
Math-Shepherd extends the standard RLHF pipeline (reward model + PPO) with step-level rewards. This is the same idea as dense reward shaping in classic RL — instead of a sparse terminal reward, give intermediate feedback.
DeepSeekMath (Shao et al., 2024)
Builds directly on Math-Shepherd. Uses process rewards combined with GRPO (Group Relative Policy Optimization) instead of PPO. Pushes Mistral-7B to even higher math accuracy.
Monte Carlo Tree Search
Math-Shepherd's completion sampling is essentially MCTS rollouts applied to reasoning chains. Each step is a node, completions are rollouts, and the success rate is the value estimate. This connection inspired later work on tree-of-thought and search-augmented reasoning.
The lasting impact: Math-Shepherd demonstrated that process supervision does not need to be a luxury requiring expensive human labor. Automatic process annotation has become a standard tool — used in DeepSeek-R1, Qwen2-Math, and many others. The idea that "you can estimate step quality by sampling completions" is now foundational infrastructure for training reasoning models.
What is the core methodological connection between Math-Shepherd and Monte Carlo Tree Search?