Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — adding "Let's think step by step" or showing reasoning traces in prompts dramatically improves math and logic performance.
Here's a simple word problem: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"
You probably solved it instantly: 5 + (2 × 3) = 11. But if you ask GPT-3 (175B parameters) with a standard prompt, it often says "9" or "8" — jumping to an answer without doing the intermediate arithmetic.
Why? Because standard prompting asks the model to go directly from question to answer in a single step. The model must compose multiple operations — parse the problem, identify the quantities, determine the operations, execute the arithmetic, and combine the results — all within a single token prediction. That's asking a lot from one forward pass.
Consider what happens when you solve this problem yourself. You don't jump to "11" — your internal monologue goes something like:
Each step is simple — a single arithmetic operation. The difficulty isn't any individual step; it's the composition of multiple steps. What if we could make the model "think out loud" — generating these intermediate steps as text before producing the final answer?
Click "Standard" to see how a normal prompt leads to a wrong answer, then click "CoT" to see how adding reasoning steps fixes it. The key difference: the model gets to compute intermediate results as tokens.
This is exactly what Wei et al. discovered. By simply including step-by-step reasoning in the few-shot examples — or even just adding "Let's think step by step" to the prompt — you unlock dramatically better reasoning from the exact same model with the exact same weights.
The insight is embarrassingly simple: instead of prompting the model with input-output pairs, prompt it with input-reasoning-output triples. Show the model examples where the intermediate thinking is explicit.
Here's the difference side by side:
Standard prompting Q: Roger has 5 tennis balls. He buys 2 cans of 3. How many now? A: The answer is 11. Q: The cafeteria had 23 apples. If they used 20 and bought 6 more, how many do they have? A: ← model must jump to "9" in one step
Chain-of-thought prompting Q: Roger has 5 tennis balls. He buys 2 cans of 3. How many now? A: Roger started with 5 balls. 2 cans × 3 balls = 6 balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 and bought 6 more, how many do they have? A: ← model generates reasoning THEN answer
That's it. No new model architecture. No additional training. No fine-tuning. Just a different prompt format that shows reasoning steps.
Think of it this way: standard prompting is like asking someone to multiply 47 × 83 in their head and just say the answer. Chain-of-thought is like giving them scratch paper. The same brain, the same abilities — but the scratch paper makes hard problems tractable.
In standard prompting, the model estimates:
In chain-of-thought prompting, the model instead estimates:
Where c is a chain-of-thought — a sequence of intermediate reasoning tokens. The model marginalizes over possible chains, but in practice, it generates one chain autoregressively and conditions on it.
The decomposition is powerful because each conditional probability is simpler:
- p(c | question) — generate a reasoning chain (straightforward text generation)
- p(answer | question, c) — extract the answer from the chain (often trivial: read the last number)
Watch how the model generates reasoning tokens one at a time. Each new token becomes context for the next. Click "Step" to advance one token, or "Auto" to watch the full chain unfold. Notice how intermediate results become available for later computations.
The original Chain-of-Thought paper (Wei et al., 2022) focused on few-shot CoT: providing a handful of manually-written examples where each example includes the reasoning chain. The model learns the pattern from these demonstrations and generates chains for new problems.
Wei et al. hand-crafted 8 chain-of-thought exemplars for each benchmark. Here's a real example from the GSM8K math benchmark:
Few-shot CoT exemplar Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now? A: Shawn started with 5 toys. He then got 2 toys from his mom and 2 from his dad. So he got 2 + 2 = 4 more toys. Now he has 5 + 4 = 9 toys. The answer is 9.
The results were dramatic. On the GSM8K benchmark (grade-school math word problems):
| Method | Model | GSM8K Accuracy |
|---|---|---|
| Standard few-shot | PaLM 540B | 17.9% |
| CoT few-shot | PaLM 540B | 56.9% |
| Standard few-shot | GPT-3 175B | ~15% |
| CoT few-shot | GPT-3 175B | ~46% |
| Fine-tuned SOTA | GPT-3 + verifier | 55% |
Not all chains-of-thought are equally effective. Wei et al. found several patterns in what makes good exemplars:
1. Decompose into atomic steps. Each step should perform exactly one operation. "2 cans × 3 balls = 6" is one step. Don't combine multiple operations.
2. Use natural language, not equations. "He got 2 from mom and 2 from dad, so 2 + 2 = 4 more" works better than just "2 + 2 = 4". The words provide semantic grounding.
3. End with a clear answer format. "The answer is X" provides a consistent extraction point.
4. Match the difficulty level. Exemplars should be similar in complexity to the test problems. Too-simple exemplars don't demonstrate multi-step reasoning; too-complex exemplars confuse the model.
python # Building a few-shot CoT prompt programmatically def build_cot_prompt(exemplars, test_question): """ exemplars: list of (question, chain, answer) tuples """ prompt = "" for q, chain, ans in exemplars: prompt += f"Q: {q}\nA: {chain} The answer is {ans}.\n\n" prompt += f"Q: {test_question}\nA:" return prompt # The model sees the pattern: Q → reasoning → "The answer is X" # and follows it for the new question
Add exemplars to the prompt and watch how the model's accuracy improves. Each exemplar includes a reasoning chain. The bar chart shows accuracy with standard prompting (blue) vs CoT (orange) across different numbers of exemplars.
A follow-up discovery by Kojima et al. (2022) was even more surprising: you don't need hand-crafted exemplars at all. Simply appending "Let's think step by step" to the prompt triggers chain-of-thought reasoning.
Zero-shot CoT Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there? A: Let's think step by step. ← That's it. This one phrase changes everything. Model output: The juggler has 16 balls total. Half of 16 = 8 golf balls. Half of 8 = 4 blue golf balls. The answer is 4.
This is a two-stage process:
The results:
| Method | MultiArith | GSM8K | SVAMP |
|---|---|---|---|
| Zero-shot standard | 17.7% | 12.5% | 58.8% |
| Zero-shot CoT | 78.7% | 40.7% | 69.2% |
| Few-shot CoT (8 exemplars) | 91.7% | 56.9% | 79.0% |
Kojima et al. tested many trigger phrases:
| Trigger Phrase | MultiArith Accuracy |
|---|---|
| "Let's think step by step" | 78.7% |
| "Let's think about this logically" | 73.2% |
| "Let's solve this problem by splitting it into steps" | 72.2% |
| "First," | 67.8% |
| "Let's think" | 57.5% |
| (no trigger) | 17.7% |
"Let's think step by step" won, but many similar phrases also work. The key ingredient isn't the exact words — it's triggering the model into a step-by-step generation mode.
python # Zero-shot CoT implementation def zero_shot_cot(question, model): # Stage 1: Generate reasoning prompt1 = f"Q: {question}\nA: Let's think step by step." reasoning = model.generate(prompt1, max_tokens=256) # Stage 2: Extract answer prompt2 = prompt1 + reasoning + "\nTherefore, the answer is" answer = model.generate(prompt2, max_tokens=32) return answer # Two LLM calls, but no exemplars needed # Works across tasks without task-specific prompts
Compare different trigger phrases for zero-shot CoT. Each bar shows accuracy on a math benchmark. "Let's think step by step" consistently wins, but many phrases help. The baseline (no trigger) is shown in gray.
Chain-of-thought doesn't improve everything. Wei et al. carefully mapped out where it helps, where it's neutral, and where it actually hurts.
| Task Type | Standard | CoT | Gain |
|---|---|---|---|
| Multi-step arithmetic (GSM8K) | 17.9% | 56.9% | +39pp |
| Symbolic reasoning (last letter concat) | 15.3% | 58.6% | +43pp |
| Commonsense reasoning (StrategyQA) | 65.4% | 73.4% | +8pp |
| Simple classification (sentiment) | 94.1% | 93.8% | -0.3pp |
Perhaps the most important finding: CoT only helps at sufficient model scale. Below ~100B parameters, chain-of-thought often hurts performance.
| Model Size | Standard (GSM8K) | CoT (GSM8K) | Effect |
|---|---|---|---|
| ~8B (LaMDA) | 4.5% | 2.2% | Worse! |
| ~62B (PaLM) | 7.2% | 11.3% | Small gain |
| ~175B (GPT-3) | ~15% | ~46% | Large gain |
| ~540B (PaLM) | 17.9% | 56.9% | Huge gain |
At small scales, the model generates reasoning chains that are grammatically correct but logically wrong — it produces plausible-sounding but incorrect intermediate steps, leading to worse answers than just guessing directly. Only models large enough to generate correct reasoning chains benefit from CoT.
CoT has a sweet spot of task difficulty:
Too easy (1-step problems): The model can already solve these directly. CoT adds tokens but no benefit.
Just right (2-5 step problems): The model struggles with direct answers but can generate correct chains. Maximum benefit.
Too hard (10+ step problems): Even with chains, errors compound across steps. Each step has some error probability, and errors propagate: with 95% accuracy per step, 10-step accuracy is only 0.95¹⁰ ≈ 60%.
For p = 0.95 and n = 10: P = 0.60. For n = 20: P = 0.36. Longer chains are exponentially less reliable.
Drag the X slider to change model size and Y slider to change task complexity. The heatmap shows when CoT helps (green), is neutral (gray), or hurts (red). CoT is most beneficial for large models on medium-complexity tasks.
Understanding why chain-of-thought works is crucial. There are several complementary hypotheses, each supported by evidence.
A Transformer performs a fixed number of computation steps per output token (one forward pass through L layers). By generating intermediate tokens, CoT effectively gives the model more "compute time." A 10-token chain-of-thought provides 10× more forward passes before the answer, each building on the previous results.
CoT decomposes a complex problem into simpler sub-problems. Instead of solving "multi-step arithmetic" in one step, the model solves a sequence of "single-step arithmetic" problems. Each individual step is within the model's capability; the chain structure handles the composition.
python # Without CoT: must solve compound problem in one pass # p("11" | "5 + 2×3 = ?") → hard composite computation # With CoT: each step is simple # p("2×3 = 6" | problem) → easy single operation # p("5 + 6 = 11" | problem, "2×3 = 6") → easy, with "6" in context # p("11" | chain) → trivial read-off
Are the generated chains actually the "reasoning" the model is doing, or just post-hoc rationalizations? Evidence suggests they're at least partially faithful:
Evidence for faithfulness:
1. When the chain contains an arithmetic error, the final answer is usually consistent with the error (not the correct answer). This means the model is actually using the chain.
2. Perturbing the chain (inserting wrong intermediate values) consistently changes the final answer in the predicted direction.
Evidence against faithfulness:
1. Models sometimes generate correct chains for wrong reasons (correct answer, but the reasoning doesn't match the actual computation path).
2. The model may be doing "smart-sounding rationalization" rather than genuine step-by-step reasoning.
The training data contains many examples of step-by-step explanations — math textbooks, StackOverflow answers, tutorial blog posts. CoT triggers the model to sample from this "explanation" distribution rather than the "direct answer" distribution. Within the explanation distribution, accurate reasoning is more likely because the training examples in that distribution tend to be correct (people write correct explanations).
Standard prompting uses 1 forward pass to go from question to answer. CoT uses N forward passes (one per reasoning token). This visualization shows the computation "depth" for each approach. Click "Step" to advance and see how each approach processes the same problem.
Wang et al. (2022) proposed Self-Consistency: generate multiple chains-of-thought (using sampling with temperature > 0), then take a majority vote on the final answer. Different chains may make different errors, but the correct answer is most likely to appear across multiple chains.
Self-consistency with K=40 chains improved GSM8K from 56.9% (single CoT) to 74.4% — a 17.5pp gain from just generating more chains and voting.
Time to see chain-of-thought in action. This interactive explorer lets you compare standard prompting vs CoT on different problem types, model sizes, and chain lengths.
Select a problem type and watch the model solve it step by step. In "Standard" mode, the model jumps directly to an answer. In "CoT" mode, it generates intermediate reasoning. Toggle between modes to see the accuracy difference. The error probability per step is shown — watch how errors compound in long chains.
Generate multiple chains and watch majority voting improve accuracy. Each chain may have errors, but the correct answer tends to win the vote. Click "Sample Chain" to generate a new reasoning path. The tally board shows the running vote count.
Chain-of-thought prompting opened the floodgates for a new subfield: reasoning elicitation. Understanding its connections reveals both its roots and its impact.
| Paper | Contribution | Relationship to CoT |
|---|---|---|
| GPT-3 (2020) | In-context learning | CoT is a specific form of in-context learning — demonstrations include reasoning |
| Scratchpad (Nye et al., 2021) | Let models write intermediate computation | Predecessor idea — CoT generalized it to natural language |
| Show Your Work (Ling et al., 2017) | Train models to generate solution steps | Required fine-tuning; CoT achieves this with prompting alone |
| Paper | How It Extended CoT |
|---|---|
| Self-Consistency (Wang 2022) | Sample multiple chains, majority vote → +17.5pp on GSM8K |
| Tree of Thoughts (Yao 2023) | Branch and search over reasoning paths, not just one chain |
| Least-to-Most (Zhou 2022) | Decompose into sub-questions first, solve bottom-up |
| ReAct (Yao 2023) | Interleave reasoning (CoT) with actions (tool use) |
| Let's Verify Step by Step (Lightman 2023) | Train verifiers to check each reasoning step, not just the answer |
| o1 (OpenAI 2024) | Train models to generate internal chains-of-thought via RL |
CoT revealed something profound about language models: they have more capability than standard prompting extracts. The model "knows" how to reason — it's seen reasoning in its training data — but standard prompting doesn't give it the chance. CoT is a way to elicit latent capability.
This insight — that prompting technique matters as much as model capability — launched the field of prompt engineering and eventually led to approaches where reasoning is trained directly into the model (o1, R1), not just elicited via prompting.
CoT era (2022-2023):
Reasoning elicited via prompts.
Chain quality depends on prompt.
Emergent at ~100B+ scale.
Post-CoT era (2024+):
Reasoning trained via RL.
Models generate chains internally.
Works at smaller scales.
"The limits of my language mean the limits of my world." — Ludwig Wittgenstein