Process reward models that score each reasoning step — not just the final answer — solve 78% of MATH problems vs 72% with outcome-only reward models. Step-level supervision is fundamentally more effective for mathematical reasoning.
A language model generates a step-by-step solution to a math problem. The final answer is correct. But one of the intermediate steps contains a logical error that happens to cancel out later. Should we trust this solution?
This is the verification problem: how do you evaluate the quality of a reasoning chain? The obvious approach — check the final answer — misses a critical dimension. A solution with correct intermediate steps is more reliable than one that arrives at the right answer by luck.
text Solution A (correct reasoning): Step 1: x² - 5x + 6 = 0 ✓ correct Step 2: (x-2)(x-3) = 0 ✓ correct Step 3: x = 2 or x = 3 ✓ correct Answer: {2, 3} ✓ correct answer, reliable Solution B (wrong reasoning, lucky answer): Step 1: x² - 5x + 6 = 0 ✓ correct Step 2: x(x - 5) = -6 ✗ wrong factoring! Step 3: x = 2 or x = 3 (by guessing) ✗ lucky guess Answer: {2, 3} ✓ correct answer, UNRELIABLE
An outcome reward model (ORM) scores only the final answer. It gives both solutions the same score: correct. But Solution B is dangerous — its reasoning is wrong, and on a slightly different problem, the same approach would fail.
This paper from OpenAI is the definitive comparison: process supervision (scoring each step) vs. outcome supervision (scoring only the final answer). The result is unambiguous: process supervision wins, and the PRM800K dataset they release enables the community to build on their work.
See how an ORM (left) and PRM (right) evaluate the same solution. The ORM only checks the final answer. The PRM checks every step. Click "Show Error" to reveal the hidden mistake.
The paper frames the comparison between two types of reward models with precise definitions. Understanding this distinction is key to the entire contribution.
An ORM is trained on (solution, correctness) pairs. For each complete solution, it predicts whether the final answer is correct. Training data is easy to collect: generate solutions, check answers against ground truth, label as correct/incorrect.
A PRM is trained on (solution, step-level labels) pairs. For each step in the solution, it predicts whether that step is correct. Training data requires human experts to verify each step individually — much more expensive than outcome labels.
Both reward models are used with best-of-N selection: generate N solutions for each problem, score them with the reward model, and select the highest-scoring solution.
| Property | ORM | PRM |
|---|---|---|
| Supervision granularity | Per-solution | Per-step |
| Training data cost | Cheap (auto-verify answers) | Expensive (human step labeling) |
| What it scores | P(correct answer) | P(each step correct) |
| Solution ranking | By final answer probability | By product of step probabilities |
| Can detect lucky answers | No | Yes |
python # ORM scoring: one number per solution def orm_score(solution): return orm_model(solution) # scalar: P(correct answer) # PRM scoring: one number per step, multiply together def prm_score(solution): steps = split_into_steps(solution) step_scores = [prm_model(solution, i) for i in range(len(steps))] # Product of step probabilities = P(all steps correct) return prod(step_scores) # Best-of-N selection with either model def best_of_n(model, problem, N=100, score_fn=orm_score): solutions = [model.generate(problem) for _ in range(N)] scores = [score_fn(s) for s in solutions] return solutions[argmax(scores)]
When using a PRM to rank solutions, the paper uses the product of all step scores as the overall solution score. This is equivalent to computing P(all steps correct), assuming step correctness is conditionally independent given prior steps.
A solution with 10 steps, each scored 0.95, gets an overall score of 0.9510 = 0.60. A solution with one bad step (scored 0.3) drops to 0.959 × 0.3 = 0.19. The product is extremely sensitive to individual bad steps — which is exactly what we want.
See how individual step scores combine into a solution score. Drag the slider to set one bad step's score and watch the product collapse. The product is very sensitive to even one low score.
The PRM is architecturally identical to the base language model. It's a GPT-4-class model fine-tuned to predict step correctness. No special heads, no separate modules — just the same transformer with a classification output.
The PRM is trained on solutions where each step has been labeled by a human as one of three categories:
| Label | Meaning | Example |
|---|---|---|
| Positive | This step is correct and makes progress | "Factor: (x-2)(x-3) = 0" ✓ |
| Negative | This step contains an error | "x(x-5) = -6" ✗ |
| Neutral | This step is neither clearly correct nor incorrect | "Let me try a different approach" |
The model is trained with a cross-entropy loss to predict these labels at each step boundary. At inference, it outputs a probability that each step is correct.
python # PRM training (simplified) class ProcessRewardModel(GPT4Base): def forward(self, solution_tokens, step_boundaries): # Run the full solution through the transformer hidden = self.transformer(solution_tokens) # hidden: [seq_len, hidden_dim] # At each step boundary, predict correctness step_preds = [] for boundary in step_boundaries: h = hidden[boundary] # [hidden_dim] logit = self.classifier(h) # [3] = pos/neg/neutral step_preds.append(logit) return step_preds # Each step gets a probability of being correct # Loss: cross-entropy at each step boundary loss = cross_entropy(step_preds, step_labels)
A key engineering detail: how do you split a solution into steps? The paper uses newlines as step boundaries. Each line in the solution is one step. This is simple but effective for mathematical solutions, which naturally separate into one-line-per-step format.
The ORM is much simpler to train: generate solutions, check final answers, label as correct/incorrect. No human step labeling needed. The paper trains both on the same base model to ensure a fair comparison.
See how the PRM scores each step in a solution. Green = high confidence correct. Red = likely error. Click steps to see the PRM's reasoning.
The paper's most enduring contribution may be PRM800K: a dataset of 800,000 step-level human annotations on 75,000 mathematical solutions. This is the largest and most detailed dataset of human reasoning judgments ever created for math.
| Statistic | Value |
|---|---|
| Solutions annotated | 75,000 |
| Total step labels | 800,000 |
| Average steps per solution | ~10.7 |
| Problems sourced from | MATH dataset (12,500 problems) |
| Label set | Positive / Negative / Neutral |
| Annotator pool | Trained math-proficient contractors |
Human annotators with strong math backgrounds reviewed each step of model-generated solutions. For each step, they judged:
The annotation took approximately 4,000 person-hours. At typical contractor rates, this costs ~$100,000-$200,000. This is expensive but one-time: the resulting dataset enables training PRMs without additional human annotation.
Inter-annotator agreement (measured on a subset) was ~90%, indicating high consistency. Math has the advantage of being objectively verifiable — unlike sentiment or toxicity, "is this algebra step correct?" has a clear answer.
python # PRM800K data format sample = { "problem": "Solve x^2 - 5x + 6 = 0", "solution": [ {"step": "We need to factor the quadratic.", "label": "positive"}, {"step": "Looking for (x-a)(x-b) where ab=6, a+b=5.", "label": "positive"}, {"step": "(x-2)(x-3) = 0", "label": "positive"}, {"step": "x = 2 or x = 3", "label": "positive"} ], "answer": "{2, 3}", "correct": True }
Explore the distribution of step labels in PRM800K. Most steps are correct (positive), with errors typically appearing in the middle of solutions.
Both ORMs and PRMs are used with best-of-N sampling: generate N candidate solutions for each problem, score them, and submit the highest-scoring one. The paper systematically compares how ORMs and PRMs perform as N increases.
Generate N solutions from the same model (GPT-4) with temperature > 0. Score each with the reward model. Pick the best. This is identical to self-consistency's sampling step, but instead of majority voting on the answer, we use a learned scorer to pick the best solution.
More samples = higher chance that at least one solution is correct AND that the reward model can identify it. The paper evaluates at N = 1, 10, 50, 100, 400, and 1860.
| N | ORM best-of-N | PRM best-of-N | PRM advantage |
|---|---|---|---|
| 1 | 50.0% | 50.0% | 0% |
| 10 | 60.2% | 63.1% | +2.9% |
| 100 | 68.5% | 73.0% | +4.5% |
| 1860 | 72.4% | 78.2% | +5.8% |
At N = 1 (no selection), both are identical — it's just the base model. As N grows, both improve, but the PRM improves faster. The PRM's advantage grows with N, reaching +5.8 percentage points at N = 1860.
The paper also reports the "oracle" best-of-N: what accuracy would you get if you had a perfect reward model? At N = 1860, the oracle achieves 96.3% — meaning 96.3% of the time, at least one of the 1860 solutions is correct. The gap between the oracle (96.3%) and the PRM (78.2%) represents the reward model's selection error.
python # Best-of-N with PRM selection def best_of_n_prm(generator, prm, problem, N=100): solutions = [generator.generate(problem, temp=0.7) for _ in range(N)] best_score = -1 best_sol = None for sol in solutions: steps = split_steps(sol) step_scores = prm.score_steps(problem, steps) # Product of step scores = solution score sol_score = 1.0 for ss in step_scores: sol_score *= ss if sol_score < 0.01: break # early stop: bad step detected if sol_score > best_score: best_score = sol_score best_sol = sol return best_sol, best_score
See how accuracy improves with N for ORM, PRM, and the oracle. The PRM's advantage over the ORM grows steadily as N increases.
The paper's main result is unambiguous: process supervision outperforms outcome supervision on MATH, and the advantage is robust across problem difficulties and domains.
| Method | MATH (best-of-1860) | Training Signal |
|---|---|---|
| Base GPT-4 (greedy) | 50.0% | None |
| Majority voting (N=1860) | 62.9% | None |
| ORM best-of-N | 72.4% | Outcome labels |
| PRM best-of-N | 78.2% | Step labels (PRM800K) |
| Oracle best-of-N | 96.3% | Perfect selection |
The PRM beats the ORM by +5.8 percentage points at N = 1860. It also substantially beats majority voting (+15.3 points), confirming that learned selection outperforms naive voting for mathematical reasoning.
MATH problems are categorized into 5 difficulty levels. The PRM advantage is largest on the hardest problems:
| Difficulty | ORM best-of-100 | PRM best-of-100 | PRM advantage |
|---|---|---|---|
| Level 1 (easy) | 92% | 93% | +1% |
| Level 3 (medium) | 72% | 76% | +4% |
| Level 5 (hard) | 45% | 52% | +7% |
On easy problems, the base model almost always gets the right answer, so selection barely matters. On hard problems, there are many candidate solutions with correct answers but flawed reasoning — this is exactly where the PRM's ability to check individual steps provides the most value.
Self-consistency (majority voting) reaches 62.9% at N = 1860. The PRM reaches 78.2%. The 15-point gap shows that selection quality matters more than selection quantity. A trained PRM is much better at picking the right solution than counting votes.
Compare ORM vs PRM accuracy across difficulty levels. Notice the PRM advantage is largest on hard problems where flawed-but-lucky solutions are most common.
Let's see process verification in action. This simulator generates candidate solutions for a math problem. Each solution has steps, each step has a PRM score. Watch how the PRM identifies the solution with the best reasoning — and how it catches solutions that get the right answer for the wrong reasons.
Generate N candidate solutions. The PRM scores each step. Watch it select the solution with the best step-by-step reasoning. Red steps = errors. Green steps = correct. The product score determines the winner.
Process reward models sit at a critical junction in the reasoning model lineage. They formalize the idea that HOW you reason matters as much as WHAT you conclude.
| Method | Year | Relationship to PRM |
|---|---|---|
| Self-Consistency | 2023 | Naive voting. PRM replaces voting with learned selection. |
| Let's Verify (this paper) | 2023 | Proves step-level supervision beats outcome supervision. |
| Math-Shepherd | 2024 | Automatically generates step labels using Monte Carlo estimation. |
| DeepSeek-R1 | 2025 | Shows outcome-only RL can match PRMs — challenging their necessity. |
| Scaling Test-Time Compute | 2024 | Uses PRMs as value functions in test-time search. |
The fundamental result. Process supervision > outcome supervision. This has been reproduced and extended by every subsequent paper on reasoning verification.
The dataset. PRM800K enabled the community to train PRMs without OpenAI's annotation budget.
Cost of labeling. 800K step labels cost $100K+. Can we generate step labels automatically? Math-Shepherd (2024) showed yes, using Monte Carlo estimation.
RL training signal. The paper uses PRMs only for best-of-N selection. Using PRM scores as RL rewards could be even more powerful — the "Scaling Test-Time Compute" paper explores this direction.
Self-Consistency — Naive voting that PRMs improve upon. Read the SC lesson →
Scaling Test-Time Compute — Uses PRMs for compute-optimal test-time search. Read the TTC lesson →
DeepSeek-R1 — Shows outcome-only RL can match PRM-guided models. Read the R1 lesson →
See how reasoning verification evolved from voting to learned process models.