Sample multiple reasoning paths, then vote. The majority answer wins. A simple idea that boosts GSM8K from 56% to 74% and costs nothing but extra inference compute.
You ask a language model: "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder for $2 each. How much money does she make per day?"
With chain-of-thought prompting, the model reasons step by step:
text Path 1 (correct): 16 eggs - 3 breakfast - 4 muffins = 9 eggs remaining 9 eggs × $2 = $18 per day Answer: $18
But ask the same question again — same prompt, same model — and sometimes you get:
text Path 2 (wrong): 16 eggs - 3 breakfast = 13 eggs 13 eggs × $2 = $26 per day Answer: $26 (Forgot to subtract the 4 muffin eggs!)
Standard chain-of-thought uses greedy decoding — it takes the single most probable token at every step. This produces exactly one reasoning path. If that path makes an error at any step, the final answer is wrong. There's no safety net.
Think of it like an exam. If you solve a hard math problem once and submit your answer, you might make a careless error. But if you solve it five different ways and get the same answer four times, you can be much more confident it's correct. The one outlier was probably a computational mistake. Self-consistency applies this exact intuition to language models.
Click "Sample Path" to generate a reasoning path. One path might be wrong (red). Sample several and see how majority voting finds the correct answer even when individual paths make errors.
Self-consistency is breathtakingly simple. It has three steps:
That's it. No training. No fine-tuning. No special prompts. Just sample multiple times and vote. The elegance is that the reasoning paths naturally serve as a diversity mechanism: the model explores different approaches to the same problem, and correct approaches tend to converge on the same answer while incorrect approaches scatter across different wrong answers.
Self-consistency relies on a critical assumption: different correct reasoning paths produce the same final answer. For math problems, this is clearly true — whether you solve 2x + 3 = 7 by subtraction-then-division or by algebraic manipulation, you get x = 2 either way. For factual questions, the correct answer is unique. For open-ended generation (writing, creative tasks), self-consistency doesn't apply because there's no single "correct" output.
python # Self-consistency in 15 lines from collections import Counter def self_consistency(model, prompt, k=40, temperature=0.7): """Sample k reasoning paths and return majority answer.""" answers = [] for _ in range(k): # Sample one reasoning path (non-greedy) response = model.generate(prompt, temperature=temperature, max_tokens=256) # Extract just the final answer (ignore reasoning) answer = extract_answer(response) answers.append(answer) # Majority vote vote_counts = Counter(answers) return vote_counts.most_common(1)[0][0] # Example: # answers = ["$18", "$26", "$18", "$18", "$14", "$18", "$26", "$18"] # Counter: {"$18": 5, "$26": 2, "$14": 1} # Returns: "$18" (correct!)
Self-consistency resembles model ensembles (train multiple models, average their predictions) but is fundamentally cheaper. Ensembles require training N separate models. Self-consistency uses ONE model, sampled N times. The diversity comes from stochastic decoding, not separate training runs.
Watch the self-consistency pipeline in action. The model samples multiple reasoning paths, extracts answers, and takes a majority vote. Click "Run" to see it step by step.
Self-consistency only works if the sampled paths are diverse. If every path follows the exact same reasoning steps and makes the same error, voting won't help — you'll just get k copies of the same wrong answer. The key is generating paths that explore different reasoning strategies.
When you sample from a language model with temperature > 0, the model doesn't always pick the most probable next token. It samples from the probability distribution, which means at each step it might take a different path. Over a multi-step reasoning chain, these small deviations accumulate into fundamentally different reasoning strategies:
text Path 1 (direct calculation): "Janet has 16 eggs. She uses 3 + 4 = 7 eggs. Remaining: 16 - 7 = 9 eggs. Revenue: 9 × $2 = $18" Path 2 (step-by-step subtraction): "Start with 16 eggs. After breakfast: 16 - 3 = 13 eggs. After muffins: 13 - 4 = 9 eggs. Sells 9 at $2: $18" Path 3 (algebra): "Let x = eggs sold. x = 16 - 3 - 4 = 9. Money = 2x = 2(9) = $18" Path 4 (error — forgot muffins): "16 eggs minus 3 for breakfast = 13 remaining. 13 × $2 = $26"
Paths 1-3 use different reasoning strategies but all arrive at $18. Path 4 makes an error and gets $26. In a majority vote of these four paths, $18 wins 3-to-1.
Temperature scales the logits before softmax. Higher temperature = flatter probability distribution = more diverse sampling. Lower temperature = sharper distribution = paths more similar to greedy decoding.
Where zi are the logits and T is the temperature. At T=0, this becomes greedy (argmax). At T=1, this is the standard softmax. At T>1, the distribution flattens and rare tokens get sampled more often.
This is counterintuitive: adding randomness to reasoning improves accuracy. The key is that randomness introduces diversity at the strategy level while maintaining coherence at the step level. A slightly random model might try algebra instead of direct computation — and both approaches reach the same answer if correct. The randomness doesn't introduce errors; it introduces alternative valid approaches.
Adjust temperature to see how it affects path diversity. Low temperature: paths cluster together. High temperature: paths spread out. Medium temperature: diverse but coherent.
The paper frames self-consistency not just as "voting" but as marginalizing over reasoning paths. This mathematical formulation connects the method to principled probabilistic reasoning and opens the door to weighted voting.
Given a question Q, the model generates a reasoning chain ri and final answer ai. We want the most probable answer a*, marginalizing over all possible reasoning paths:
In plain English: the best answer is the one that has the highest total probability across all reasoning paths that lead to it. Some reasoning paths are more probable than others (more natural, more common strategies), and a correct answer supported by many high-probability paths should beat a wrong answer supported by one fluke path.
We can't sum over all possible reasoning paths — there are infinitely many. Instead, we approximate with k samples:
This is just majority voting: count how many of the k sampled paths produce each answer, and pick the most common one. Each sample is treated equally (unweighted voting).
You can also weight each path by its log-probability under the model. Paths the model considers more likely get more influence on the vote:
python # Weighted self-consistency def weighted_self_consistency(model, prompt, k=40, temp=0.7): answer_weights = {} for _ in range(k): response, log_prob = model.generate(prompt, temperature=temp, return_log_probs=True) answer = extract_answer(response) weight = exp(log_prob) # convert log-prob to probability if answer not in answer_weights: answer_weights[answer] = 0 answer_weights[answer] += weight # Pick answer with highest total weight return max(answer_weights, key=answer_weights.get) # Example: 3 paths say "$18" with weights 0.3, 0.4, 0.25 = 0.95 total # 1 path says "$26" with weight 0.05 = 0.05 total # Weighted vote: "$18" wins by 0.95 vs 0.05
See how probability mass accumulates for each answer as more paths are sampled. The correct answer (green) accumulates mass from multiple convergent paths. Wrong answers (red) get isolated contributions.
Self-consistency has two hyperparameters: temperature T and number of samples k. Both matter, and the paper studies them carefully.
More samples = better accuracy, but with diminishing returns. The paper finds:
| Samples (k) | GSM8K Accuracy | Marginal Gain |
|---|---|---|
| 1 (greedy) | 56.5% | — |
| 5 | 65.2% | +8.7% |
| 10 | 69.8% | +4.6% |
| 20 | 72.1% | +2.3% |
| 40 | 74.4% | +2.3% |
| 100 | 75.1% | +0.7% |
The biggest jump is from k=1 to k=5 — just five samples already captures most of the benefit. After k=40, gains are minimal. The paper uses k=40 as the default, balancing accuracy against inference cost.
The paper tests temperatures from 0.1 to 1.0:
| Temperature | GSM8K (k=40) | Character |
|---|---|---|
| 0.1 | 68.2% | Near-greedy, low diversity |
| 0.3 | 71.5% | Moderate diversity |
| 0.5 | 73.8% | Good balance |
| 0.7 | 74.4% | Sweet spot |
| 1.0 | 72.1% | Too much noise |
The sweet spot is T ≈ 0.5-0.7. Below that, paths are too similar. Above that, individual paths become incoherent.
An important finding: self-consistency provides larger gains on harder problems. On easy problems (where greedy already gets 90%+), voting adds little. On hard problems (where greedy gets 30-50%), voting can add 15-20 percentage points. This makes sense: hard problems have more room for diverse correct approaches, and greedy decoding is more likely to make errors on hard problems.
Adjust the number of samples (k) and temperature to see how they affect accuracy. Watch the diminishing returns as k increases and the sweet spot for temperature.
Self-consistency was evaluated on a wide range of reasoning benchmarks with multiple language models. The results are consistently positive — it improves accuracy across every benchmark and every model tested.
| Method | PaLM 540B | GPT-3 175B | Codex |
|---|---|---|---|
| Standard prompting | 43.0% | 34.0% | 45.0% |
| CoT (greedy) | 56.5% | 46.9% | 63.1% |
| CoT + Self-Consistency | 74.4% | 58.0% | 78.0% |
On GSM8K (grade school math), self-consistency adds +17.9 points to PaLM, +11.1 points to GPT-3, and +14.9 points to Codex over greedy CoT. These are massive improvements from a method that requires zero training.
| Benchmark | CoT (greedy) | CoT + SC (k=40) | Gain |
|---|---|---|---|
| StrategyQA | 73.4% | 81.6% | +8.2% |
| ARC-Challenge | 85.2% | 90.7% | +5.5% |
| CommonsenseQA | 79.3% | 82.4% | +3.1% |
Gains on commonsense reasoning are smaller (3-8 points) because these tasks are easier for large models, and self-consistency helps most where greedy decoding is weakest.
| Task | CoT (greedy) | CoT + SC | Gain |
|---|---|---|---|
| Coin flip (4 steps) | 99.6% | 100.0% | +0.4% |
| Last letter concat (4) | 55.0% | 67.2% | +12.2% |
| SVAMP | 79.0% | 86.6% | +7.6% |
How does self-consistency compare to other ways of spending more inference compute?
| Method | GSM8K | Cost |
|---|---|---|
| CoT greedy | 56.5% | 1x |
| CoT + sample-and-rank (by log-prob) | 62.3% | 40x |
| CoT + self-consistency (majority vote) | 74.4% | 40x |
| CoT + verifier (trained) | 77.0% | 40x + training |
Self-consistency dramatically outperforms sample-and-rank (which picks the path with the highest log-probability). This confirms that the model's own probability assessment of reasoning chains is unreliable — voting is more robust than trusting the model's confidence.
Compare self-consistency gains across benchmarks. Select a benchmark to see the improvement from greedy CoT to self-consistency.
Let's see self-consistency in action on a simulated problem. You control the model's per-path accuracy, the number of samples, and the temperature. Watch how majority voting produces a more reliable final answer than any single path.
Set the model's per-path accuracy (how often a single path is correct), then click "Run k Paths" to sample and vote. Run multiple rounds to see how self-consistency's accuracy exceeds any individual path. The chart shows the vote distribution.
If each path is independently correct with probability p, then the probability that majority voting (with k paths) gives the correct answer follows the binomial distribution:
For p = 0.55 and k = 40: P(correct) ≈ 0.88. For p = 0.6 and k = 40: P(correct) ≈ 0.95. Even a weak per-path accuracy amplifies dramatically with enough samples.
Self-consistency is one of the foundational inference-time compute scaling methods. It spawned a family of techniques that all share the idea: spend more compute at inference to get better answers.
| Method | Year | How It Differs from Self-Consistency |
|---|---|---|
| Self-Consistency | 2023 | Sample + majority vote. No training needed. |
| Verifiers / ORMs | 2021 | Train a model to score solutions. Use best-of-N instead of voting. |
| Process Reward Models | 2023 | Score each STEP, not just the final answer. More granular than voting. |
| Tree of Thoughts | 2023 | Structured tree search over reasoning steps. BFS/DFS instead of sampling. |
| MCTS reasoning | 2024 | Monte Carlo tree search with value function. Most sophisticated but most expensive. |
| Scaling Test-Time Compute | 2024 | Formalizes the compute-optimal choice between sampling more vs. revising answers. |
Simplicity. No training, no special prompts, no architecture changes. Just sample more and vote. This makes it the most widely applicable inference-time technique.
The core insight. Correct answers converge; wrong answers scatter. This insight underlies all subsequent work on inference-time scaling.
Uniform treatment of paths. All paths get equal votes regardless of quality. Process reward models improve on this by scoring each reasoning step.
Independence assumption. The method assumes paths are independent, but they're generated by the same model with the same biases. If the model consistently makes the same type of error, voting won't help.
Chain of Thought — The base technique that self-consistency builds on. Read the CoT lesson →
Let's Verify Step by Step — Process reward models that score each step instead of voting. Read the PRM lesson →
Scaling Test-Time Compute — Formalizing when to sample more vs. revise. Read the TTC lesson →
See how inference-time compute methods evolved from simple voting to sophisticated tree search.