Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou — Google Research / Brain — ICLR 2023

Self-Consistency Improves Chain of Thought Reasoning

Sample multiple reasoning paths, then vote. The majority answer wins. A simple idea that boosts GSM8K from 56% to 74% and costs nothing but extra inference compute.

Prerequisites: Chain-of-thought prompting + Sampling from LLMs + Basic probability. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Fragility Problem

You ask a language model: "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder for $2 each. How much money does she make per day?"

With chain-of-thought prompting, the model reasons step by step:

text
Path 1 (correct):
16 eggs - 3 breakfast - 4 muffins = 9 eggs remaining
9 eggs × $2 = $18 per day
Answer: $18

But ask the same question again — same prompt, same model — and sometimes you get:

text
Path 2 (wrong):
16 eggs - 3 breakfast = 13 eggs
13 eggs × $2 = $26 per day
Answer: $26
(Forgot to subtract the 4 muffin eggs!)

Standard chain-of-thought uses greedy decoding — it takes the single most probable token at every step. This produces exactly one reasoning path. If that path makes an error at any step, the final answer is wrong. There's no safety net.

The fragility of single-path reasoning: Chain-of-thought improved LLM reasoning dramatically, but it has a critical weakness: it produces only ONE reasoning path. If the model makes a single arithmetic mistake, a wrong variable assignment, or a logical slip anywhere in the chain, the entire answer is wrong. Greedy decoding gives you one shot. Self-consistency gives you many shots and picks the most popular answer.

Think of it like an exam. If you solve a hard math problem once and submit your answer, you might make a careless error. But if you solve it five different ways and get the same answer four times, you can be much more confident it's correct. The one outlier was probably a computational mistake. Self-consistency applies this exact intuition to language models.

Single Path vs Multiple Paths

Click "Sample Path" to generate a reasoning path. One path might be wrong (red). Sample several and see how majority voting finds the correct answer even when individual paths make errors.

What is the fundamental weakness of standard chain-of-thought prompting that self-consistency addresses?

Chapter 1: Core Idea

Self-consistency is breathtakingly simple. It has three steps:

Step 1: Prompt with CoT
Use the same chain-of-thought prompt as before. Nothing changes here.
Step 2: Sample k paths
Instead of greedy decoding (1 path), sample k different reasoning paths with temperature > 0. Each path may use different reasoning strategies and reach different answers.
Step 3: Majority vote
Extract the final answer from each path. The answer that appears most frequently wins. Ignore the reasoning — only the final answer matters for voting.

That's it. No training. No fine-tuning. No special prompts. Just sample multiple times and vote. The elegance is that the reasoning paths naturally serve as a diversity mechanism: the model explores different approaches to the same problem, and correct approaches tend to converge on the same answer while incorrect approaches scatter across different wrong answers.

Why majority voting works: For a given problem, there is typically one correct answer but many possible wrong answers. If the model has, say, a 60% chance of getting the right answer on any single path, then after 10 paths, the correct answer will appear ~6 times while wrong answers split across the remaining ~4. The correct answer wins the vote easily. Wrong answers are diverse; the right answer is singular.

The key assumption: correct answers agree

Self-consistency relies on a critical assumption: different correct reasoning paths produce the same final answer. For math problems, this is clearly true — whether you solve 2x + 3 = 7 by subtraction-then-division or by algebraic manipulation, you get x = 2 either way. For factual questions, the correct answer is unique. For open-ended generation (writing, creative tasks), self-consistency doesn't apply because there's no single "correct" output.

python
# Self-consistency in 15 lines
from collections import Counter

def self_consistency(model, prompt, k=40, temperature=0.7):
    """Sample k reasoning paths and return majority answer."""
    answers = []
    for _ in range(k):
        # Sample one reasoning path (non-greedy)
        response = model.generate(prompt,
                                temperature=temperature,
                                max_tokens=256)
        # Extract just the final answer (ignore reasoning)
        answer = extract_answer(response)
        answers.append(answer)

    # Majority vote
    vote_counts = Counter(answers)
    return vote_counts.most_common(1)[0][0]

# Example:
# answers = ["$18", "$26", "$18", "$18", "$14", "$18", "$26", "$18"]
# Counter: {"$18": 5, "$26": 2, "$14": 1}
# Returns: "$18" (correct!)

Comparison to ensembles

Self-consistency resembles model ensembles (train multiple models, average their predictions) but is fundamentally cheaper. Ensembles require training N separate models. Self-consistency uses ONE model, sampled N times. The diversity comes from stochastic decoding, not separate training runs.

Self-Consistency Pipeline

Watch the self-consistency pipeline in action. The model samples multiple reasoning paths, extracts answers, and takes a majority vote. Click "Run" to see it step by step.

Why does majority voting tend to select the correct answer even when individual paths make errors?

Chapter 2: Sampling Diverse Paths

Self-consistency only works if the sampled paths are diverse. If every path follows the exact same reasoning steps and makes the same error, voting won't help — you'll just get k copies of the same wrong answer. The key is generating paths that explore different reasoning strategies.

Where does diversity come from?

When you sample from a language model with temperature > 0, the model doesn't always pick the most probable next token. It samples from the probability distribution, which means at each step it might take a different path. Over a multi-step reasoning chain, these small deviations accumulate into fundamentally different reasoning strategies:

text
Path 1 (direct calculation):
"Janet has 16 eggs. She uses 3 + 4 = 7 eggs.
 Remaining: 16 - 7 = 9 eggs. Revenue: 9 × $2 = $18"

Path 2 (step-by-step subtraction):
"Start with 16 eggs. After breakfast: 16 - 3 = 13 eggs.
 After muffins: 13 - 4 = 9 eggs. Sells 9 at $2: $18"

Path 3 (algebra):
"Let x = eggs sold. x = 16 - 3 - 4 = 9.
 Money = 2x = 2(9) = $18"

Path 4 (error — forgot muffins):
"16 eggs minus 3 for breakfast = 13 remaining.
 13 × $2 = $26"

Paths 1-3 use different reasoning strategies but all arrive at $18. Path 4 makes an error and gets $26. In a majority vote of these four paths, $18 wins 3-to-1.

Temperature controls diversity

Temperature scales the logits before softmax. Higher temperature = flatter probability distribution = more diverse sampling. Lower temperature = sharper distribution = paths more similar to greedy decoding.

P(tokeni) = ezi/T / ∑j ezj/T

Where zi are the logits and T is the temperature. At T=0, this becomes greedy (argmax). At T=1, this is the standard softmax. At T>1, the distribution flattens and rare tokens get sampled more often.

The diversity-accuracy trade-off. Too low temperature (T < 0.3): paths are too similar, voting adds little value. Too high temperature (T > 1.5): paths become incoherent, adding noise instead of diverse perspectives. The paper finds T = 0.5 to 0.7 is the sweet spot — diverse enough to explore different strategies, coherent enough that each path is individually reasonable.

Why random reasoning helps

This is counterintuitive: adding randomness to reasoning improves accuracy. The key is that randomness introduces diversity at the strategy level while maintaining coherence at the step level. A slightly random model might try algebra instead of direct computation — and both approaches reach the same answer if correct. The randomness doesn't introduce errors; it introduces alternative valid approaches.

Temperature and Path Diversity

Adjust temperature to see how it affects path diversity. Low temperature: paths cluster together. High temperature: paths spread out. Medium temperature: diverse but coherent.

Temperature 0.7
Why does sampling with temperature > 0 improve self-consistency over greedy decoding?

Chapter 3: Marginalization

The paper frames self-consistency not just as "voting" but as marginalizing over reasoning paths. This mathematical formulation connects the method to principled probabilistic reasoning and opens the door to weighted voting.

The probabilistic view

Given a question Q, the model generates a reasoning chain ri and final answer ai. We want the most probable answer a*, marginalizing over all possible reasoning paths:

a* = argmaxari P(ri, a | Q)

In plain English: the best answer is the one that has the highest total probability across all reasoning paths that lead to it. Some reasoning paths are more probable than others (more natural, more common strategies), and a correct answer supported by many high-probability paths should beat a wrong answer supported by one fluke path.

Approximation via sampling

We can't sum over all possible reasoning paths — there are infinitely many. Instead, we approximate with k samples:

a* ≈ argmaxai=1k 1[ai = a]

This is just majority voting: count how many of the k sampled paths produce each answer, and pick the most common one. Each sample is treated equally (unweighted voting).

Weighted voting (optional improvement)

You can also weight each path by its log-probability under the model. Paths the model considers more likely get more influence on the vote:

a* ≈ argmaxai: ai=a P(ri | Q)
python
# Weighted self-consistency
def weighted_self_consistency(model, prompt, k=40, temp=0.7):
    answer_weights = {}
    for _ in range(k):
        response, log_prob = model.generate(prompt,
                                          temperature=temp,
                                          return_log_probs=True)
        answer = extract_answer(response)
        weight = exp(log_prob)  # convert log-prob to probability

        if answer not in answer_weights:
            answer_weights[answer] = 0
        answer_weights[answer] += weight

    # Pick answer with highest total weight
    return max(answer_weights, key=answer_weights.get)

# Example: 3 paths say "$18" with weights 0.3, 0.4, 0.25 = 0.95 total
#          1 path says "$26" with weight 0.05 = 0.05 total
# Weighted vote: "$18" wins by 0.95 vs 0.05
Unweighted vs. weighted voting: The paper finds that simple unweighted majority voting performs nearly as well as weighted voting. Why? Because the log-probabilities of reasoning chains are not well-calibrated — a long chain might have a low probability simply because it's longer, not because it's less likely to be correct. In practice, just count votes.
Marginalization Visualizer

See how probability mass accumulates for each answer as more paths are sampled. The correct answer (green) accumulates mass from multiple convergent paths. Wrong answers (red) get isolated contributions.

Why does the paper find that unweighted majority voting works nearly as well as probability-weighted voting?

Chapter 4: Temperature & k

Self-consistency has two hyperparameters: temperature T and number of samples k. Both matter, and the paper studies them carefully.

How many samples (k)?

More samples = better accuracy, but with diminishing returns. The paper finds:

Samples (k)GSM8K AccuracyMarginal Gain
1 (greedy)56.5%
565.2%+8.7%
1069.8%+4.6%
2072.1%+2.3%
4074.4%+2.3%
10075.1%+0.7%

The biggest jump is from k=1 to k=5 — just five samples already captures most of the benefit. After k=40, gains are minimal. The paper uses k=40 as the default, balancing accuracy against inference cost.

The cost of self-consistency. Self-consistency with k=40 costs 40x more inference compute than greedy decoding. For a problem that takes 1 second with greedy, self-consistency takes 40 seconds. This is acceptable for benchmarks but may be too expensive for real-time applications. The paper argues that k=5 to k=10 gives most of the benefit at manageable cost.

Optimal temperature

The paper tests temperatures from 0.1 to 1.0:

TemperatureGSM8K (k=40)Character
0.168.2%Near-greedy, low diversity
0.371.5%Moderate diversity
0.573.8%Good balance
0.774.4%Sweet spot
1.072.1%Too much noise

The sweet spot is T ≈ 0.5-0.7. Below that, paths are too similar. Above that, individual paths become incoherent.

Scaling behavior

An important finding: self-consistency provides larger gains on harder problems. On easy problems (where greedy already gets 90%+), voting adds little. On hard problems (where greedy gets 30-50%), voting can add 15-20 percentage points. This makes sense: hard problems have more room for diverse correct approaches, and greedy decoding is more likely to make errors on hard problems.

k and Temperature Explorer

Adjust the number of samples (k) and temperature to see how they affect accuracy. Watch the diminishing returns as k increases and the sweet spot for temperature.

Samples (k) 40
Temperature 0.7
Where does self-consistency provide the largest accuracy gains?

Chapter 5: Results

Self-consistency was evaluated on a wide range of reasoning benchmarks with multiple language models. The results are consistently positive — it improves accuracy across every benchmark and every model tested.

Arithmetic reasoning (GSM8K)

MethodPaLM 540BGPT-3 175BCodex
Standard prompting43.0%34.0%45.0%
CoT (greedy)56.5%46.9%63.1%
CoT + Self-Consistency74.4%58.0%78.0%

On GSM8K (grade school math), self-consistency adds +17.9 points to PaLM, +11.1 points to GPT-3, and +14.9 points to Codex over greedy CoT. These are massive improvements from a method that requires zero training.

Commonsense reasoning

BenchmarkCoT (greedy)CoT + SC (k=40)Gain
StrategyQA73.4%81.6%+8.2%
ARC-Challenge85.2%90.7%+5.5%
CommonsenseQA79.3%82.4%+3.1%

Gains on commonsense reasoning are smaller (3-8 points) because these tasks are easier for large models, and self-consistency helps most where greedy decoding is weakest.

Symbolic reasoning

TaskCoT (greedy)CoT + SCGain
Coin flip (4 steps)99.6%100.0%+0.4%
Last letter concat (4)55.0%67.2%+12.2%
SVAMP79.0%86.6%+7.6%
Consistent improvement, everywhere. Self-consistency improves accuracy on every single benchmark tested, with every single model tested. No other inference-time technique has this level of universality. The gains are largest on the hardest tasks (GSM8K: +17.9%) and smallest on the easiest (coin flip: +0.4%). This is the signature of a fundamentally sound method.

Comparison to other inference-time methods

How does self-consistency compare to other ways of spending more inference compute?

MethodGSM8KCost
CoT greedy56.5%1x
CoT + sample-and-rank (by log-prob)62.3%40x
CoT + self-consistency (majority vote)74.4%40x
CoT + verifier (trained)77.0%40x + training

Self-consistency dramatically outperforms sample-and-rank (which picks the path with the highest log-probability). This confirms that the model's own probability assessment of reasoning chains is unreliable — voting is more robust than trusting the model's confidence.

Results Dashboard

Compare self-consistency gains across benchmarks. Select a benchmark to see the improvement from greedy CoT to self-consistency.

Why does self-consistency outperform sample-and-rank (picking the path with the highest log-probability)?

Chapter 6: Voting Simulator

Let's see self-consistency in action on a simulated problem. You control the model's per-path accuracy, the number of samples, and the temperature. Watch how majority voting produces a more reliable final answer than any single path.

Self-Consistency Voting Simulator

Set the model's per-path accuracy (how often a single path is correct), then click "Run k Paths" to sample and vote. Run multiple rounds to see how self-consistency's accuracy exceeds any individual path. The chart shows the vote distribution.

Per-path accuracy 55%
Samples (k) 10

Understanding the math

If each path is independently correct with probability p, then the probability that majority voting (with k paths) gives the correct answer follows the binomial distribution:

P(correct | k, p) = ∑i=⌈k/2⌉k C(k,i) · pi · (1-p)k-i

For p = 0.55 and k = 40: P(correct) ≈ 0.88. For p = 0.6 and k = 40: P(correct) ≈ 0.95. Even a weak per-path accuracy amplifies dramatically with enough samples.

The amplification effect. Self-consistency amplifies weak accuracy into strong accuracy. A model that gets the right answer 55% of the time on a single try becomes 88% accurate with 40 samples and voting. A model at 60% becomes 95%. This is the mathematical magic: as long as p > 0.5, majority voting converges to near-certainty as k grows. This is the same principle behind ensemble methods and the Condorcet jury theorem.
If a model has a 55% chance of getting the right answer on any single reasoning path, what happens with 40 paths and majority voting?

Chapter 7: Connections

Self-consistency is one of the foundational inference-time compute scaling methods. It spawned a family of techniques that all share the idea: spend more compute at inference to get better answers.

The inference-time scaling family

MethodYearHow It Differs from Self-Consistency
Self-Consistency2023Sample + majority vote. No training needed.
Verifiers / ORMs2021Train a model to score solutions. Use best-of-N instead of voting.
Process Reward Models2023Score each STEP, not just the final answer. More granular than voting.
Tree of Thoughts2023Structured tree search over reasoning steps. BFS/DFS instead of sampling.
MCTS reasoning2024Monte Carlo tree search with value function. Most sophisticated but most expensive.
Scaling Test-Time Compute2024Formalizes the compute-optimal choice between sampling more vs. revising answers.

What self-consistency got right

Simplicity. No training, no special prompts, no architecture changes. Just sample more and vote. This makes it the most widely applicable inference-time technique.

The core insight. Correct answers converge; wrong answers scatter. This insight underlies all subsequent work on inference-time scaling.

What self-consistency got wrong

Uniform treatment of paths. All paths get equal votes regardless of quality. Process reward models improve on this by scoring each reasoning step.

Independence assumption. The method assumes paths are independent, but they're generated by the same model with the same biases. If the model consistently makes the same type of error, voting won't help.

From voting to verification. Self-consistency asks "what answer do most paths agree on?" Process reward models ask "which path has the best reasoning at each step?" The evolution from consensus-based to quality-based selection is the main arc of inference-time compute research from 2023-2025.

Chain of Thought — The base technique that self-consistency builds on. Read the CoT lesson →

Let's Verify Step by Step — Process reward models that score each step instead of voting. Read the PRM lesson →

Scaling Test-Time Compute — Formalizing when to sample more vs. revise. Read the TTC lesson →

Inference-Time Methods Timeline

See how inference-time compute methods evolved from simple voting to sophisticated tree search.

Method Self-Consistency
What is the main limitation of self-consistency that process reward models address?