Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe — OpenAI, 2023

Let's Verify Step by Step

Process reward models that score each reasoning step — not just the final answer — solve 78% of MATH problems vs 72% with outcome-only reward models. Step-level supervision is fundamentally more effective for mathematical reasoning.

Prerequisites: Reward models + Chain-of-thought reasoning + Best-of-N sampling. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Verification Problem

A language model generates a step-by-step solution to a math problem. The final answer is correct. But one of the intermediate steps contains a logical error that happens to cancel out later. Should we trust this solution?

This is the verification problem: how do you evaluate the quality of a reasoning chain? The obvious approach — check the final answer — misses a critical dimension. A solution with correct intermediate steps is more reliable than one that arrives at the right answer by luck.

text
Solution A (correct reasoning):
Step 1: x² - 5x + 6 = 0             ✓ correct
Step 2: (x-2)(x-3) = 0              ✓ correct
Step 3: x = 2 or x = 3              ✓ correct
Answer: {2, 3}                        ✓ correct answer, reliable

Solution B (wrong reasoning, lucky answer):
Step 1: x² - 5x + 6 = 0             ✓ correct
Step 2: x(x - 5) = -6              ✗ wrong factoring!
Step 3: x = 2 or x = 3 (by guessing) ✗ lucky guess
Answer: {2, 3}                        ✓ correct answer, UNRELIABLE

An outcome reward model (ORM) scores only the final answer. It gives both solutions the same score: correct. But Solution B is dangerous — its reasoning is wrong, and on a slightly different problem, the same approach would fail.

The core insight: A process reward model (PRM) scores every step individually. It would flag Solution B's Step 2 as incorrect, preferring Solution A even though both reach the right answer. By evaluating the reasoning process, not just the outcome, PRMs select solutions that are correct for the right reasons — making them more reliable and generalizable.

This paper from OpenAI is the definitive comparison: process supervision (scoring each step) vs. outcome supervision (scoring only the final answer). The result is unambiguous: process supervision wins, and the PRM800K dataset they release enables the community to build on their work.

Outcome vs Process Scoring

See how an ORM (left) and PRM (right) evaluate the same solution. The ORM only checks the final answer. The PRM checks every step. Click "Show Error" to reveal the hidden mistake.

Why is scoring only the final answer (outcome supervision) insufficient for evaluating mathematical reasoning?

Chapter 1: Outcome vs Process

The paper frames the comparison between two types of reward models with precise definitions. Understanding this distinction is key to the entire contribution.

Outcome Reward Model (ORM)

An ORM is trained on (solution, correctness) pairs. For each complete solution, it predicts whether the final answer is correct. Training data is easy to collect: generate solutions, check answers against ground truth, label as correct/incorrect.

ORM(solution) → P(final answer is correct)

Process Reward Model (PRM)

A PRM is trained on (solution, step-level labels) pairs. For each step in the solution, it predicts whether that step is correct. Training data requires human experts to verify each step individually — much more expensive than outcome labels.

PRM(solution, stepi) → P(stepi is correct | steps1..i-1 are correct)
The key difference in training signal: ORM gets one bit of supervision per solution (correct/incorrect). PRM gets one bit per STEP (typically 5-15 steps per solution). PRM has ~10x more training signal per solution. This higher density of supervision is what makes process supervision more effective.

How each is used for selection

Both reward models are used with best-of-N selection: generate N solutions for each problem, score them with the reward model, and select the highest-scoring solution.

PropertyORMPRM
Supervision granularityPer-solutionPer-step
Training data costCheap (auto-verify answers)Expensive (human step labeling)
What it scoresP(correct answer)P(each step correct)
Solution rankingBy final answer probabilityBy product of step probabilities
Can detect lucky answersNoYes
python
# ORM scoring: one number per solution
def orm_score(solution):
    return orm_model(solution)  # scalar: P(correct answer)

# PRM scoring: one number per step, multiply together
def prm_score(solution):
    steps = split_into_steps(solution)
    step_scores = [prm_model(solution, i) for i in range(len(steps))]
    # Product of step probabilities = P(all steps correct)
    return prod(step_scores)

# Best-of-N selection with either model
def best_of_n(model, problem, N=100, score_fn=orm_score):
    solutions = [model.generate(problem) for _ in range(N)]
    scores = [score_fn(s) for s in solutions]
    return solutions[argmax(scores)]

The product rule for PRMs

When using a PRM to rank solutions, the paper uses the product of all step scores as the overall solution score. This is equivalent to computing P(all steps correct), assuming step correctness is conditionally independent given prior steps.

PRM_score(sol) = ∏i=1n P(stepi correct | steps1..i-1)

A solution with 10 steps, each scored 0.95, gets an overall score of 0.9510 = 0.60. A solution with one bad step (scored 0.3) drops to 0.959 × 0.3 = 0.19. The product is extremely sensitive to individual bad steps — which is exactly what we want.

Step Score Product

See how individual step scores combine into a solution score. Drag the slider to set one bad step's score and watch the product collapse. The product is very sensitive to even one low score.

Bad step score 0.95
How does a PRM rank a complete solution?

Chapter 2: PRM Architecture

The PRM is architecturally identical to the base language model. It's a GPT-4-class model fine-tuned to predict step correctness. No special heads, no separate modules — just the same transformer with a classification output.

Training the PRM

The PRM is trained on solutions where each step has been labeled by a human as one of three categories:

LabelMeaningExample
PositiveThis step is correct and makes progress"Factor: (x-2)(x-3) = 0" ✓
NegativeThis step contains an error"x(x-5) = -6" ✗
NeutralThis step is neither clearly correct nor incorrect"Let me try a different approach"

The model is trained with a cross-entropy loss to predict these labels at each step boundary. At inference, it outputs a probability that each step is correct.

python
# PRM training (simplified)
class ProcessRewardModel(GPT4Base):
    def forward(self, solution_tokens, step_boundaries):
        # Run the full solution through the transformer
        hidden = self.transformer(solution_tokens)
        # hidden: [seq_len, hidden_dim]

        # At each step boundary, predict correctness
        step_preds = []
        for boundary in step_boundaries:
            h = hidden[boundary]  # [hidden_dim]
            logit = self.classifier(h)  # [3] = pos/neg/neutral
            step_preds.append(logit)

        return step_preds
        # Each step gets a probability of being correct

# Loss: cross-entropy at each step boundary
loss = cross_entropy(step_preds, step_labels)

Step boundary detection

A key engineering detail: how do you split a solution into steps? The paper uses newlines as step boundaries. Each line in the solution is one step. This is simple but effective for mathematical solutions, which naturally separate into one-line-per-step format.

The PRM sees the entire solution. Unlike a step-level verifier that checks each step independently, the PRM sees all previous steps when scoring step i. This means it can detect errors that are wrong only in context — a step that would be correct in isolation might be wrong given what came before it.

Comparison to training an ORM

The ORM is much simpler to train: generate solutions, check final answers, label as correct/incorrect. No human step labeling needed. The paper trains both on the same base model to ensure a fair comparison.

PRM Step Scoring

See how the PRM scores each step in a solution. Green = high confidence correct. Red = likely error. Click steps to see the PRM's reasoning.

What advantage does the PRM have from seeing all previous steps when scoring step i?

Chapter 3: PRM800K Dataset

The paper's most enduring contribution may be PRM800K: a dataset of 800,000 step-level human annotations on 75,000 mathematical solutions. This is the largest and most detailed dataset of human reasoning judgments ever created for math.

Dataset statistics

StatisticValue
Solutions annotated75,000
Total step labels800,000
Average steps per solution~10.7
Problems sourced fromMATH dataset (12,500 problems)
Label setPositive / Negative / Neutral
Annotator poolTrained math-proficient contractors

Annotation process

Human annotators with strong math backgrounds reviewed each step of model-generated solutions. For each step, they judged:

Read step in context
Annotator reads the step given all previous steps. Context matters — a step may be locally correct but inconsistent with prior steps.
Assign label
Positive (correct and productive), Negative (contains error), or Neutral (restating, hedging, or ambiguous).
Stop at first error
After labeling a step as Negative, all subsequent steps are automatically labeled Negative. No need to verify steps after an error.
"Stop at first error" saves annotation cost. Once a solution goes wrong at step 5, steps 6-10 are all wrong by definition (they build on a faulty foundation). The annotator labels step 5 as Negative and stops. This cuts annotation time by ~40% compared to labeling every step independently.

Quality and cost

The annotation took approximately 4,000 person-hours. At typical contractor rates, this costs ~$100,000-$200,000. This is expensive but one-time: the resulting dataset enables training PRMs without additional human annotation.

Inter-annotator agreement (measured on a subset) was ~90%, indicating high consistency. Math has the advantage of being objectively verifiable — unlike sentiment or toxicity, "is this algebra step correct?" has a clear answer.

python
# PRM800K data format
sample = {
    "problem": "Solve x^2 - 5x + 6 = 0",
    "solution": [
        {"step": "We need to factor the quadratic.",
         "label": "positive"},
        {"step": "Looking for (x-a)(x-b) where ab=6, a+b=5.",
         "label": "positive"},
        {"step": "(x-2)(x-3) = 0",
         "label": "positive"},
        {"step": "x = 2 or x = 3",
         "label": "positive"}
    ],
    "answer": "{2, 3}",
    "correct": True
}
PRM800K Statistics

Explore the distribution of step labels in PRM800K. Most steps are correct (positive), with errors typically appearing in the middle of solutions.

Why does the PRM800K annotation protocol stop labeling after the first error?

Chapter 4: Best-of-N Selection

Both ORMs and PRMs are used with best-of-N sampling: generate N candidate solutions for each problem, score them, and submit the highest-scoring one. The paper systematically compares how ORMs and PRMs perform as N increases.

The best-of-N framework

Generate N solutions from the same model (GPT-4) with temperature > 0. Score each with the reward model. Pick the best. This is identical to self-consistency's sampling step, but instead of majority voting on the answer, we use a learned scorer to pick the best solution.

answer = argmaxs ∈ {s1,...,sN} RM(s)

How N affects accuracy

More samples = higher chance that at least one solution is correct AND that the reward model can identify it. The paper evaluates at N = 1, 10, 50, 100, 400, and 1860.

NORM best-of-NPRM best-of-NPRM advantage
150.0%50.0%0%
1060.2%63.1%+2.9%
10068.5%73.0%+4.5%
186072.4%78.2%+5.8%

At N = 1 (no selection), both are identical — it's just the base model. As N grows, both improve, but the PRM improves faster. The PRM's advantage grows with N, reaching +5.8 percentage points at N = 1860.

Why the PRM's advantage grows with N: With more candidates, the base model is more likely to generate both correct and flawed solutions. The ORM struggles to distinguish "correct answer via correct reasoning" from "correct answer via lucky error." The PRM can make this distinction by checking each step, so it more reliably picks the truly correct solution.

Oracle best-of-N

The paper also reports the "oracle" best-of-N: what accuracy would you get if you had a perfect reward model? At N = 1860, the oracle achieves 96.3% — meaning 96.3% of the time, at least one of the 1860 solutions is correct. The gap between the oracle (96.3%) and the PRM (78.2%) represents the reward model's selection error.

python
# Best-of-N with PRM selection
def best_of_n_prm(generator, prm, problem, N=100):
    solutions = [generator.generate(problem, temp=0.7)
                 for _ in range(N)]

    best_score = -1
    best_sol = None

    for sol in solutions:
        steps = split_steps(sol)
        step_scores = prm.score_steps(problem, steps)
        # Product of step scores = solution score
        sol_score = 1.0
        for ss in step_scores:
            sol_score *= ss
            if sol_score < 0.01:
                break  # early stop: bad step detected

        if sol_score > best_score:
            best_score = sol_score
            best_sol = sol

    return best_sol, best_score
Best-of-N Accuracy Curves

See how accuracy improves with N for ORM, PRM, and the oracle. The PRM's advantage over the ORM grows steadily as N increases.

N (samples) 100
Why does the PRM's advantage over the ORM grow as N (number of samples) increases?

Chapter 5: Results

The paper's main result is unambiguous: process supervision outperforms outcome supervision on MATH, and the advantage is robust across problem difficulties and domains.

Headline numbers

MethodMATH (best-of-1860)Training Signal
Base GPT-4 (greedy)50.0%None
Majority voting (N=1860)62.9%None
ORM best-of-N72.4%Outcome labels
PRM best-of-N78.2%Step labels (PRM800K)
Oracle best-of-N96.3%Perfect selection

The PRM beats the ORM by +5.8 percentage points at N = 1860. It also substantially beats majority voting (+15.3 points), confirming that learned selection outperforms naive voting for mathematical reasoning.

Process supervision is fundamentally better. The PRM advantage is not just statistical — it reflects a deeper truth: knowing WHERE a solution goes wrong is more informative than knowing WHETHER the final answer is right. This extra information enables better selection, and the advantage grows with the number of candidates.

Performance by difficulty

MATH problems are categorized into 5 difficulty levels. The PRM advantage is largest on the hardest problems:

DifficultyORM best-of-100PRM best-of-100PRM advantage
Level 1 (easy)92%93%+1%
Level 3 (medium)72%76%+4%
Level 5 (hard)45%52%+7%

On easy problems, the base model almost always gets the right answer, so selection barely matters. On hard problems, there are many candidate solutions with correct answers but flawed reasoning — this is exactly where the PRM's ability to check individual steps provides the most value.

Comparison to self-consistency

Self-consistency (majority voting) reaches 62.9% at N = 1860. The PRM reaches 78.2%. The 15-point gap shows that selection quality matters more than selection quantity. A trained PRM is much better at picking the right solution than counting votes.

Results by Difficulty

Compare ORM vs PRM accuracy across difficulty levels. Notice the PRM advantage is largest on hard problems where flawed-but-lucky solutions are most common.

On which difficulty level is the PRM's advantage over the ORM largest, and why?

Chapter 6: Step Scoring Simulator

Let's see process verification in action. This simulator generates candidate solutions for a math problem. Each solution has steps, each step has a PRM score. Watch how the PRM identifies the solution with the best reasoning — and how it catches solutions that get the right answer for the wrong reasons.

PRM Selection Simulator

Generate N candidate solutions. The PRM scores each step. Watch it select the solution with the best step-by-step reasoning. Red steps = errors. Green steps = correct. The product score determines the winner.

Candidates (N) 5
The PRM's superpower: catching lucky answers. In the simulator, look for solutions where the final answer is correct (green answer) but one or more steps are red (errors). The ORM would score these highly. The PRM catches them by detecting the bad step, giving them a low product score. This is the key advantage.
In the simulator, what pattern should you look for to see the PRM's advantage over an ORM?

Chapter 7: Connections

Process reward models sit at a critical junction in the reasoning model lineage. They formalize the idea that HOW you reason matters as much as WHAT you conclude.

MethodYearRelationship to PRM
Self-Consistency2023Naive voting. PRM replaces voting with learned selection.
Let's Verify (this paper)2023Proves step-level supervision beats outcome supervision.
Math-Shepherd2024Automatically generates step labels using Monte Carlo estimation.
DeepSeek-R12025Shows outcome-only RL can match PRMs — challenging their necessity.
Scaling Test-Time Compute2024Uses PRMs as value functions in test-time search.

What this paper got right

The fundamental result. Process supervision > outcome supervision. This has been reproduced and extended by every subsequent paper on reasoning verification.

The dataset. PRM800K enabled the community to train PRMs without OpenAI's annotation budget.

What this paper left open

Cost of labeling. 800K step labels cost $100K+. Can we generate step labels automatically? Math-Shepherd (2024) showed yes, using Monte Carlo estimation.

RL training signal. The paper uses PRMs only for best-of-N selection. Using PRM scores as RL rewards could be even more powerful — the "Scaling Test-Time Compute" paper explores this direction.

The debate R1 reignited. DeepSeek-R1 showed that outcome-only RL (with GRPO) can produce strong reasoning without PRMs. This challenges the necessity of process supervision. The current consensus: PRMs help, but they're not strictly necessary if you have enough RL compute. The optimal approach likely combines both.

Self-Consistency — Naive voting that PRMs improve upon. Read the SC lesson →

Scaling Test-Time Compute — Uses PRMs for compute-optimal test-time search. Read the TTC lesson →

DeepSeek-R1 — Shows outcome-only RL can match PRM-guided models. Read the R1 lesson →

Verification Methods Timeline

See how reasoning verification evolved from voting to learned process models.

How did DeepSeek-R1 challenge the necessity of process reward models?