Let's Verify Step by Step (Lightman 2023)

Chapter 0: The Verification Problem

A language model generates a step-by-step solution to a math problem. The final answer is correct. But one of the intermediate steps contains a logical error that happens to cancel out later. Should we trust this solution?

This is the verification problem: how do you evaluate the quality of a reasoning chain? The obvious approach — check the final answer — misses a critical dimension. A solution with correct intermediate steps is more reliable than one that arrives at the right answer by luck.

text
Solution A (correct reasoning):
Step 1: x² - 5x + 6 = 0             ✓ correct
Step 2: (x-2)(x-3) = 0              ✓ correct
Step 3: x = 2 or x = 3              ✓ correct
Answer: {2, 3}                        ✓ correct answer, reliable

Solution B (wrong reasoning, lucky answer):
Step 1: x² - 5x + 6 = 0             ✓ correct
Step 2: x(x - 5) = -6              ✗ wrong factoring!
Step 3: x = 2 or x = 3 (by guessing) ✗ lucky guess
Answer: {2, 3}                        ✓ correct answer, UNRELIABLE

An outcome reward model (ORM) scores only the final answer. It gives both solutions the same score: correct. But Solution B is dangerous — its reasoning is wrong, and on a slightly different problem, the same approach would fail.

The core insight: A process reward model (PRM) scores every step individually. It would flag Solution B's Step 2 as incorrect, preferring Solution A even though both reach the right answer. By evaluating the reasoning process, not just the outcome, PRMs select solutions that are correct for the right reasons — making them more reliable and generalizable.

This paper from OpenAI is the definitive comparison: process supervision (scoring each step) vs. outcome supervision (scoring only the final answer). The result is unambiguous: process supervision wins, and the PRM800K dataset they release enables the community to build on their work.

Outcome vs Process Scoring

See how an ORM (left) and PRM (right) evaluate the same solution. The ORM only checks the final answer. The PRM checks every step. Click "Show Error" to reveal the hidden mistake.

Why is scoring only the final answer (outcome supervision) insufficient for evaluating mathematical reasoning?

Because a solution can arrive at the correct answer through flawed reasoning (lucky cancellation, guessing) — outcome-only scoring can't distinguish reliable solutions from unreliable ones, while step-level scoring identifies exactly where reasoning goes wrong Because final answers are hard to check Because math problems always have multiple correct answers

Chapter 1: Outcome vs Process

The paper frames the comparison between two types of reward models with precise definitions. Understanding this distinction is key to the entire contribution.

Outcome Reward Model (ORM)

An ORM is trained on (solution, correctness) pairs. For each complete solution, it predicts whether the final answer is correct. Training data is easy to collect: generate solutions, check answers against ground truth, label as correct/incorrect.

ORM(solution) → P(final answer is correct)

Process Reward Model (PRM)

A PRM is trained on (solution, step-level labels) pairs. For each step in the solution, it predicts whether that step is correct. Training data requires human experts to verify each step individually — much more expensive than outcome labels.

PRM(solution, step_i) → P(step_i is correct | steps_1..i-1 are correct)

The key difference in training signal: ORM gets one bit of supervision per solution (correct/incorrect). PRM gets one bit per STEP (typically 5-15 steps per solution). PRM has ~10x more training signal per solution. This higher density of supervision is what makes process supervision more effective.

How each is used for selection

Both reward models are used with best-of-N selection: generate N solutions for each problem, score them with the reward model, and select the highest-scoring solution.

Property	ORM	PRM
Supervision granularity	Per-solution	Per-step
Training data cost	Cheap (auto-verify answers)	Expensive (human step labeling)
What it scores	P(correct answer)	P(each step correct)
Solution ranking	By final answer probability	By product of step probabilities
Can detect lucky answers	No	Yes

python
# ORM scoring: one number per solution
def orm_score(solution):
    return orm_model(solution)  # scalar: P(correct answer)

# PRM scoring: one number per step, multiply together
def prm_score(solution):
    steps = split_into_steps(solution)
    step_scores = [prm_model(solution, i) for i in range(len(steps))]
    # Product of step probabilities = P(all steps correct)
    return prod(step_scores)

# Best-of-N selection with either model
def best_of_n(model, problem, N=100, score_fn=orm_score):
    solutions = [model.generate(problem) for _ in range(N)]
    scores = [score_fn(s) for s in solutions]
    return solutions[argmax(scores)]

The product rule for PRMs

When using a PRM to rank solutions, the paper uses the product of all step scores as the overall solution score. This is equivalent to computing P(all steps correct), assuming step correctness is conditionally independent given prior steps.

PRM_score(sol) = ∏_i=1ⁿ P(step_i correct | steps_1..i-1)

A solution with 10 steps, each scored 0.95, gets an overall score of 0.95¹⁰ = 0.60. A solution with one bad step (scored 0.3) drops to 0.95⁹ × 0.3 = 0.19. The product is extremely sensitive to individual bad steps — which is exactly what we want.

Step Score Product

See how individual step scores combine into a solution score. Drag the slider to set one bad step's score and watch the product collapse. The product is very sensitive to even one low score.

Bad step score 0.95

How does a PRM rank a complete solution?

By computing the product of per-step correctness scores — P(all steps correct) — which is extremely sensitive to individual bad steps, since one low step score (e.g., 0.3) dramatically reduces the product even if all other steps score 0.95 By averaging all step scores By checking only the last step

Chapter 2: PRM Architecture

The PRM is architecturally identical to the base language model. It's a GPT-4-class model fine-tuned to predict step correctness. No special heads, no separate modules — just the same transformer with a classification output.

Training the PRM

The PRM is trained on solutions where each step has been labeled by a human as one of three categories:

Label	Meaning	Example
Positive	This step is correct and makes progress	"Factor: (x-2)(x-3) = 0" ✓
Negative	This step contains an error	"x(x-5) = -6" ✗
Neutral	This step is neither clearly correct nor incorrect	"Let me try a different approach"

The model is trained with a cross-entropy loss to predict these labels at each step boundary. At inference, it outputs a probability that each step is correct.

python
# PRM training (simplified)
class ProcessRewardModel(GPT4Base):
    def forward(self, solution_tokens, step_boundaries):
        # Run the full solution through the transformer
        hidden = self.transformer(solution_tokens)
        # hidden: [seq_len, hidden_dim]

        # At each step boundary, predict correctness
        step_preds = []
        for boundary in step_boundaries:
            h = hidden[boundary]  # [hidden_dim]
            logit = self.classifier(h)  # [3] = pos/neg/neutral
            step_preds.append(logit)

        return step_preds
        # Each step gets a probability of being correct

# Loss: cross-entropy at each step boundary
loss = cross_entropy(step_preds, step_labels)

Step boundary detection

A key engineering detail: how do you split a solution into steps? The paper uses newlines as step boundaries. Each line in the solution is one step. This is simple but effective for mathematical solutions, which naturally separate into one-line-per-step format.

The PRM sees the entire solution. Unlike a step-level verifier that checks each step independently, the PRM sees all previous steps when scoring step i. This means it can detect errors that are wrong only in context — a step that would be correct in isolation might be wrong given what came before it.

Comparison to training an ORM

The ORM is much simpler to train: generate solutions, check final answers, label as correct/incorrect. No human step labeling needed. The paper trains both on the same base model to ensure a fair comparison.

PRM Step Scoring

See how the PRM scores each step in a solution. Green = high confidence correct. Red = likely error. Click steps to see the PRM's reasoning.

What advantage does the PRM have from seeing all previous steps when scoring step i?

It can detect errors that are wrong only in context — a step that seems correct in isolation might be wrong given what came before, like applying a theorem whose conditions were violated in an earlier step It's faster to compute It uses less memory

Chapter 3: PRM800K Dataset

The paper's most enduring contribution may be PRM800K: a dataset of 800,000 step-level human annotations on 75,000 mathematical solutions. This is the largest and most detailed dataset of human reasoning judgments ever created for math.

Dataset statistics

Statistic	Value
Solutions annotated	75,000
Total step labels	800,000
Average steps per solution	~10.7
Problems sourced from	MATH dataset (12,500 problems)
Label set	Positive / Negative / Neutral
Annotator pool	Trained math-proficient contractors

Annotation process

Human annotators with strong math backgrounds reviewed each step of model-generated solutions. For each step, they judged:

Read step in context

Annotator reads the step given all previous steps. Context matters — a step may be locally correct but inconsistent with prior steps.

↓

Assign label

Positive (correct and productive), Negative (contains error), or Neutral (restating, hedging, or ambiguous).

↓

Stop at first error

After labeling a step as Negative, all subsequent steps are automatically labeled Negative. No need to verify steps after an error.

"Stop at first error" saves annotation cost. Once a solution goes wrong at step 5, steps 6-10 are all wrong by definition (they build on a faulty foundation). The annotator labels step 5 as Negative and stops. This cuts annotation time by ~40% compared to labeling every step independently.

Quality and cost

The annotation took approximately 4,000 person-hours. At typical contractor rates, this costs ~$100,000-$200,000. This is expensive but one-time: the resulting dataset enables training PRMs without additional human annotation.

Inter-annotator agreement (measured on a subset) was ~90%, indicating high consistency. Math has the advantage of being objectively verifiable — unlike sentiment or toxicity, "is this algebra step correct?" has a clear answer.

python
# PRM800K data format
sample = {
    "problem": "Solve x^2 - 5x + 6 = 0",
    "solution": [
        {"step": "We need to factor the quadratic.",
         "label": "positive"},
        {"step": "Looking for (x-a)(x-b) where ab=6, a+b=5.",
         "label": "positive"},
        {"step": "(x-2)(x-3) = 0",
         "label": "positive"},
        {"step": "x = 2 or x = 3",
         "label": "positive"}
    ],
    "answer": "{2, 3}",
    "correct": True
}

PRM800K Statistics

Explore the distribution of step labels in PRM800K. Most steps are correct (positive), with errors typically appearing in the middle of solutions.

Why does the PRM800K annotation protocol stop labeling after the first error?

Because once a step is wrong, all subsequent steps are wrong by definition (they build on a faulty foundation) — stopping saves ~40% annotation time without losing information Because later steps are harder to evaluate Because annotators get tired

Chapter 4: Best-of-N Selection

Both ORMs and PRMs are used with best-of-N sampling: generate N candidate solutions for each problem, score them, and submit the highest-scoring one. The paper systematically compares how ORMs and PRMs perform as N increases.

The best-of-N framework

Generate N solutions from the same model (GPT-4) with temperature > 0. Score each with the reward model. Pick the best. This is identical to self-consistency's sampling step, but instead of majority voting on the answer, we use a learned scorer to pick the best solution.

answer = argmax_{s ∈ {s₁,...,s_N}} RM(s)

How N affects accuracy

More samples = higher chance that at least one solution is correct AND that the reward model can identify it. The paper evaluates at N = 1, 10, 50, 100, 400, and 1860.

N	ORM best-of-N	PRM best-of-N	PRM advantage
1	50.0%	50.0%	0%
10	60.2%	63.1%	+2.9%
100	68.5%	73.0%	+4.5%
1860	72.4%	78.2%	+5.8%

At N = 1 (no selection), both are identical — it's just the base model. As N grows, both improve, but the PRM improves faster. The PRM's advantage grows with N, reaching +5.8 percentage points at N = 1860.

Why the PRM's advantage grows with N: With more candidates, the base model is more likely to generate both correct and flawed solutions. The ORM struggles to distinguish "correct answer via correct reasoning" from "correct answer via lucky error." The PRM can make this distinction by checking each step, so it more reliably picks the truly correct solution.

Oracle best-of-N

The paper also reports the "oracle" best-of-N: what accuracy would you get if you had a perfect reward model? At N = 1860, the oracle achieves 96.3% — meaning 96.3% of the time, at least one of the 1860 solutions is correct. The gap between the oracle (96.3%) and the PRM (78.2%) represents the reward model's selection error.

python
# Best-of-N with PRM selection
def best_of_n_prm(generator, prm, problem, N=100):
    solutions = [generator.generate(problem, temp=0.7)
                 for _ in range(N)]

    best_score = -1
    best_sol = None

    for sol in solutions:
        steps = split_steps(sol)
        step_scores = prm.score_steps(problem, steps)
        # Product of step scores = solution score
        sol_score = 1.0
        for ss in step_scores:
            sol_score *= ss
            if sol_score < 0.01:
                break  # early stop: bad step detected

        if sol_score > best_score:
            best_score = sol_score
            best_sol = sol

    return best_sol, best_score

Best-of-N Accuracy Curves

See how accuracy improves with N for ORM, PRM, and the oracle. The PRM's advantage over the ORM grows steadily as N increases.

N (samples) 100

Why does the PRM's advantage over the ORM grow as N (number of samples) increases?

With more candidates, the base model generates both correct and flawed solutions — the ORM can't distinguish "correct answer via correct reasoning" from "correct answer via lucky error," but the PRM can by checking each step, making its selection more reliable Because PRMs are faster at scoring many solutions Because the ORM gets confused by more solutions

Chapter 5: Results

The paper's main result is unambiguous: process supervision outperforms outcome supervision on MATH, and the advantage is robust across problem difficulties and domains.

Headline numbers

Method	MATH (best-of-1860)	Training Signal
Base GPT-4 (greedy)	50.0%	None
Majority voting (N=1860)	62.9%	None
ORM best-of-N	72.4%	Outcome labels
PRM best-of-N	78.2%	Step labels (PRM800K)
Oracle best-of-N	96.3%	Perfect selection

The PRM beats the ORM by +5.8 percentage points at N = 1860. It also substantially beats majority voting (+15.3 points), confirming that learned selection outperforms naive voting for mathematical reasoning.

Process supervision is fundamentally better. The PRM advantage is not just statistical — it reflects a deeper truth: knowing WHERE a solution goes wrong is more informative than knowing WHETHER the final answer is right. This extra information enables better selection, and the advantage grows with the number of candidates.

Performance by difficulty

MATH problems are categorized into 5 difficulty levels. The PRM advantage is largest on the hardest problems:

Difficulty	ORM best-of-100	PRM best-of-100	PRM advantage
Level 1 (easy)	92%	93%	+1%
Level 3 (medium)	72%	76%	+4%
Level 5 (hard)	45%	52%	+7%

On easy problems, the base model almost always gets the right answer, so selection barely matters. On hard problems, there are many candidate solutions with correct answers but flawed reasoning — this is exactly where the PRM's ability to check individual steps provides the most value.

Comparison to self-consistency

Self-consistency (majority voting) reaches 62.9% at N = 1860. The PRM reaches 78.2%. The 15-point gap shows that selection quality matters more than selection quantity. A trained PRM is much better at picking the right solution than counting votes.

Results by Difficulty

Compare ORM vs PRM accuracy across difficulty levels. Notice the PRM advantage is largest on hard problems where flawed-but-lucky solutions are most common.

On which difficulty level is the PRM's advantage over the ORM largest, and why?

On the hardest problems (Level 5: +7 points) — because hard problems produce more solutions that reach the correct answer via flawed reasoning (lucky cancellations, guessing), and the PRM's step-level checking detects these flaws while the ORM can't On easy problems, because there are more solutions to choose from Equally across all difficulty levels

Chapter 6: Step Scoring Simulator

Let's see process verification in action. This simulator generates candidate solutions for a math problem. Each solution has steps, each step has a PRM score. Watch how the PRM identifies the solution with the best reasoning — and how it catches solutions that get the right answer for the wrong reasons.

PRM Selection Simulator

Generate N candidate solutions. The PRM scores each step. Watch it select the solution with the best step-by-step reasoning. Red steps = errors. Green steps = correct. The product score determines the winner.

Candidates (N) 5

The PRM's superpower: catching lucky answers. In the simulator, look for solutions where the final answer is correct (green answer) but one or more steps are red (errors). The ORM would score these highly. The PRM catches them by detecting the bad step, giving them a low product score. This is the key advantage.

In the simulator, what pattern should you look for to see the PRM's advantage over an ORM?

Solutions where the final answer is correct but intermediate steps have errors — the PRM catches these via low step scores while the ORM would rank them highly based on the correct answer alone Solutions with many steps Solutions with the highest average score

Chapter 7: Connections

Process reward models sit at a critical junction in the reasoning model lineage. They formalize the idea that HOW you reason matters as much as WHAT you conclude.

Method	Year	Relationship to PRM
Self-Consistency	2023	Naive voting. PRM replaces voting with learned selection.
Let's Verify (this paper)	2023	Proves step-level supervision beats outcome supervision.
Math-Shepherd	2024	Automatically generates step labels using Monte Carlo estimation.
DeepSeek-R1	2025	Shows outcome-only RL can match PRMs — challenging their necessity.
Scaling Test-Time Compute	2024	Uses PRMs as value functions in test-time search.

What this paper got right

The fundamental result. Process supervision > outcome supervision. This has been reproduced and extended by every subsequent paper on reasoning verification.

The dataset. PRM800K enabled the community to train PRMs without OpenAI's annotation budget.

What this paper left open

Cost of labeling. 800K step labels cost $100K+. Can we generate step labels automatically? Math-Shepherd (2024) showed yes, using Monte Carlo estimation.

RL training signal. The paper uses PRMs only for best-of-N selection. Using PRM scores as RL rewards could be even more powerful — the "Scaling Test-Time Compute" paper explores this direction.

The debate R1 reignited. DeepSeek-R1 showed that outcome-only RL (with GRPO) can produce strong reasoning without PRMs. This challenges the necessity of process supervision. The current consensus: PRMs help, but they're not strictly necessary if you have enough RL compute. The optimal approach likely combines both.

Self-Consistency — Naive voting that PRMs improve upon. Read the SC lesson →

Scaling Test-Time Compute — Uses PRMs for compute-optimal test-time search. Read the TTC lesson →

DeepSeek-R1 — Shows outcome-only RL can match PRM-guided models. Read the R1 lesson →

Verification Methods Timeline

See how reasoning verification evolved from voting to learned process models.

How did DeepSeek-R1 challenge the necessity of process reward models?

R1 showed that outcome-only RL (GRPO with binary accuracy rewards) can produce strong reasoning without PRMs — suggesting that with enough RL compute, the model can learn to self-verify rather than relying on an external process reward model R1 proved PRMs are useless R1 used a better PRM