Chain-of-Thought (Wei 2022)

Chapter 0: Why Reasoning Fails

Here's a simple word problem: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?"

You probably solved it instantly: 5 + (2 × 3) = 11. But if you ask GPT-3 (175B parameters) with a standard prompt, it often says "9" or "8" — jumping to an answer without doing the intermediate arithmetic.

Why? Because standard prompting asks the model to go directly from question to answer in a single step. The model must compose multiple operations — parse the problem, identify the quantities, determine the operations, execute the arithmetic, and combine the results — all within a single token prediction. That's asking a lot from one forward pass.

The core problem: LLMs perform one "unit of computation" per token generated. Complex reasoning requires multiple computational steps. If the model can only output the final answer, it must do all the reasoning internally in a single forward pass — and there simply aren't enough layers to perform complex multi-step reasoning within the fixed-depth computation graph.

Consider what happens when you solve this problem yourself. You don't jump to "11" — your internal monologue goes something like:

Step 1

Roger starts with 5 balls

↓

Step 2

He buys 2 cans × 3 balls/can = 6 new balls

↓

Step 3

5 + 6 = 11 total balls

Each step is simple — a single arithmetic operation. The difficulty isn't any individual step; it's the composition of multiple steps. What if we could make the model "think out loud" — generating these intermediate steps as text before producing the final answer?

Standard vs Chain-of-Thought

Click "Standard" to see how a normal prompt leads to a wrong answer, then click "CoT" to see how adding reasoning steps fixes it. The key difference: the model gets to compute intermediate results as tokens.

Click a mode

This is exactly what Wei et al. discovered. By simply including step-by-step reasoning in the few-shot examples — or even just adding "Let's think step by step" to the prompt — you unlock dramatically better reasoning from the exact same model with the exact same weights.

Why do LLMs struggle with multi-step reasoning in standard prompting?

Because the model must compose all reasoning steps in a single forward pass when going directly from question to answer — there aren't enough computation steps for complex multi-step reasoning within one token prediction Because LLMs can't do arithmetic at all Because the training data doesn't contain math problems

Chapter 1: The Core Idea

The insight is embarrassingly simple: instead of prompting the model with input-output pairs, prompt it with input-reasoning-output triples. Show the model examples where the intermediate thinking is explicit.

Here's the difference side by side:

Standard prompting
Q: Roger has 5 tennis balls. He buys 2 cans of 3. How many now?
A: The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 and bought 6 more,
   how many do they have?
A: ← model must jump to "9" in one step

Chain-of-thought prompting
Q: Roger has 5 tennis balls. He buys 2 cans of 3. How many now?
A: Roger started with 5 balls. 2 cans × 3 balls = 6 balls.
   5 + 6 = 11. The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 and bought 6 more,
   how many do they have?
A: ← model generates reasoning THEN answer

That's it. No new model architecture. No additional training. No fine-tuning. Just a different prompt format that shows reasoning steps.

Why does this work? Each intermediate token the model generates becomes part of the context for subsequent tokens. When the model writes "2 cans × 3 balls = 6 balls", the "6" is now in its context window. When it then needs to compute "5 + 6", both numbers are explicitly available — the model doesn't have to hold them in its internal representations. The chain-of-thought externalizes working memory.

Think of it this way: standard prompting is like asking someone to multiply 47 × 83 in their head and just say the answer. Chain-of-thought is like giving them scratch paper. The same brain, the same abilities — but the scratch paper makes hard problems tractable.

The math

In standard prompting, the model estimates:

p(answer | question)

In chain-of-thought prompting, the model instead estimates:

p(answer | question) = ∑_c p(answer | question, c) · p(c | question)

Where c is a chain-of-thought — a sequence of intermediate reasoning tokens. The model marginalizes over possible chains, but in practice, it generates one chain autoregressively and conditions on it.

The decomposition is powerful because each conditional probability is simpler:

- p(c | question) — generate a reasoning chain (straightforward text generation)

- p(answer | question, c) — extract the answer from the chain (often trivial: read the last number)

Token-by-Token Reasoning

Watch how the model generates reasoning tokens one at a time. Each new token becomes context for the next. Click "Step" to advance one token, or "Auto" to watch the full chain unfold. Notice how intermediate results become available for later computations.

How does chain-of-thought prompting improve reasoning without changing the model's weights?

By training the model on more math data during a special fine-tuning phase By using a larger model with more parameters By making the model generate intermediate reasoning steps as tokens — each step becomes part of the context for subsequent steps, effectively externalizing working memory into the token sequence

Chapter 2: Few-Shot CoT

The original Chain-of-Thought paper (Wei et al., 2022) focused on few-shot CoT: providing a handful of manually-written examples where each example includes the reasoning chain. The model learns the pattern from these demonstrations and generates chains for new problems.

Wei et al. hand-crafted 8 chain-of-thought exemplars for each benchmark. Here's a real example from the GSM8K math benchmark:

Few-shot CoT exemplar
Q: Shawn has five toys. For Christmas, he got two toys
   each from his mom and dad. How many toys does he
   have now?
A: Shawn started with 5 toys. He then got 2 toys from
   his mom and 2 from his dad. So he got 2 + 2 = 4 more
   toys. Now he has 5 + 4 = 9 toys. The answer is 9.

The results were dramatic. On the GSM8K benchmark (grade-school math word problems):

Method	Model	GSM8K Accuracy
Standard few-shot	PaLM 540B	17.9%
CoT few-shot	PaLM 540B	56.9%
Standard few-shot	GPT-3 175B	~15%
CoT few-shot	GPT-3 175B	~46%
Fine-tuned SOTA	GPT-3 + verifier	55%

3x improvement from just changing the prompt. PaLM 540B went from 17.9% to 56.9% — more than tripling its score — with zero additional training. And the CoT version nearly matched the previous state-of-the-art, which required specialized fine-tuning and a separate verifier model.

Exemplar design matters

Not all chains-of-thought are equally effective. Wei et al. found several patterns in what makes good exemplars:

1. Decompose into atomic steps. Each step should perform exactly one operation. "2 cans × 3 balls = 6" is one step. Don't combine multiple operations.

2. Use natural language, not equations. "He got 2 from mom and 2 from dad, so 2 + 2 = 4 more" works better than just "2 + 2 = 4". The words provide semantic grounding.

3. End with a clear answer format. "The answer is X" provides a consistent extraction point.

4. Match the difficulty level. Exemplars should be similar in complexity to the test problems. Too-simple exemplars don't demonstrate multi-step reasoning; too-complex exemplars confuse the model.

python
# Building a few-shot CoT prompt programmatically
def build_cot_prompt(exemplars, test_question):
    """
    exemplars: list of (question, chain, answer) tuples
    """
    prompt = ""
    for q, chain, ans in exemplars:
        prompt += f"Q: {q}\nA: {chain} The answer is {ans}.\n\n"
    prompt += f"Q: {test_question}\nA:"
    return prompt

# The model sees the pattern: Q → reasoning → "The answer is X"
# and follows it for the new question

Few-Shot CoT Prompt Builder

Add exemplars to the prompt and watch how the model's accuracy improves. Each exemplar includes a reasoning chain. The bar chart shows accuracy with standard prompting (blue) vs CoT (orange) across different numbers of exemplars.

Exemplars 0

On GSM8K, PaLM 540B with chain-of-thought prompting scored 56.9% vs 17.9% with standard prompting. What changes between these two setups?

The model architecture is modified to include a reasoning module Only the prompt format changes — the few-shot examples include step-by-step reasoning chains instead of just question-answer pairs. Same model, same weights, no additional training The model is fine-tuned on GSM8K training data with reasoning annotations

Chapter 3: Zero-Shot CoT

A follow-up discovery by Kojima et al. (2022) was even more surprising: you don't need hand-crafted exemplars at all. Simply appending "Let's think step by step" to the prompt triggers chain-of-thought reasoning.

Zero-shot CoT
Q: A juggler can juggle 16 balls. Half of the balls are
   golf balls, and half of the golf balls are blue. How
   many blue golf balls are there?

A: Let's think step by step.
← That's it. This one phrase changes everything.

Model output:
The juggler has 16 balls total.
Half of 16 = 8 golf balls.
Half of 8 = 4 blue golf balls.
The answer is 4.

This is a two-stage process:

Stage 1: Reasoning Extraction

Append "Let's think step by step" and let the model generate a reasoning chain. Don't extract the answer yet.

↓

Stage 2: Answer Extraction

Append "Therefore, the answer is" to the generated chain, and let the model complete with the final answer.

The results:

Method	MultiArith	GSM8K	SVAMP
Zero-shot standard	17.7%	12.5%	58.8%
Zero-shot CoT	78.7%	40.7%	69.2%
Few-shot CoT (8 exemplars)	91.7%	56.9%	79.0%

"Let's think step by step" is the most impactful six words in prompting history. This single phrase, requiring zero examples and zero domain expertise, more than quadrupled accuracy on MultiArith (17.7% → 78.7%). It works because the phrase triggers the model to adopt a "reasoning persona" — it has seen millions of examples of step-by-step explanations in its training data, and this phrase activates that mode.

Why "Let's think step by step" specifically?

Kojima et al. tested many trigger phrases:

Trigger Phrase	MultiArith Accuracy
"Let's think step by step"	78.7%
"Let's think about this logically"	73.2%
"Let's solve this problem by splitting it into steps"	72.2%
"First,"	67.8%
"Let's think"	57.5%
(no trigger)	17.7%

"Let's think step by step" won, but many similar phrases also work. The key ingredient isn't the exact words — it's triggering the model into a step-by-step generation mode.

python
# Zero-shot CoT implementation
def zero_shot_cot(question, model):
    # Stage 1: Generate reasoning
    prompt1 = f"Q: {question}\nA: Let's think step by step."
    reasoning = model.generate(prompt1, max_tokens=256)

    # Stage 2: Extract answer
    prompt2 = prompt1 + reasoning + "\nTherefore, the answer is"
    answer = model.generate(prompt2, max_tokens=32)

    return answer

# Two LLM calls, but no exemplars needed
# Works across tasks without task-specific prompts

Trigger Phrase Comparison

Compare different trigger phrases for zero-shot CoT. Each bar shows accuracy on a math benchmark. "Let's think step by step" consistently wins, but many phrases help. The baseline (no trigger) is shown in gray.

"Let's think step by step"

In zero-shot CoT, why does adding "Let's think step by step" to the prompt trigger reasoning behavior?

The phrase activates a "reasoning persona" — the model has seen millions of step-by-step explanations in training data associated with similar phrases, so it shifts into a generation mode that produces intermediate reasoning steps The phrase adds new knowledge to the model about how to solve math problems The phrase triggers a special reasoning algorithm built into the model architecture

Chapter 4: When It Works

Chain-of-thought doesn't improve everything. Wei et al. carefully mapped out where it helps, where it's neutral, and where it actually hurts.

Where CoT helps most

Task Type	Standard	CoT	Gain
Multi-step arithmetic (GSM8K)	17.9%	56.9%	+39pp
Symbolic reasoning (last letter concat)	15.3%	58.6%	+43pp
Commonsense reasoning (StrategyQA)	65.4%	73.4%	+8pp
Simple classification (sentiment)	94.1%	93.8%	-0.3pp

CoT helps when the task requires multi-step reasoning. If the answer can be obtained in a single step (sentiment = just check tone), CoT adds unnecessary verbosity and may even slightly hurt. The benefits appear precisely when the problem requires composing multiple sub-computations — arithmetic, logical deduction, causal reasoning, multi-hop QA.

Scale matters critically

Perhaps the most important finding: CoT only helps at sufficient model scale. Below ~100B parameters, chain-of-thought often hurts performance.

Model Size	Standard (GSM8K)	CoT (GSM8K)	Effect
~8B (LaMDA)	4.5%	2.2%	Worse!
~62B (PaLM)	7.2%	11.3%	Small gain
~175B (GPT-3)	~15%	~46%	Large gain
~540B (PaLM)	17.9%	56.9%	Huge gain

At small scales, the model generates reasoning chains that are grammatically correct but logically wrong — it produces plausible-sounding but incorrect intermediate steps, leading to worse answers than just guessing directly. Only models large enough to generate correct reasoning chains benefit from CoT.

CoT is an emergent capability. Like few-shot learning in GPT-3, chain-of-thought reasoning appears to be an emergent ability that only materializes at sufficient scale. Below the threshold, the model can mimic the format (write things that look like reasoning) but not the substance (produce correct reasoning). This is why CoT was discovered in 2022 and not 2020 — the models simply weren't large enough before.

Task complexity sweet spot

CoT has a sweet spot of task difficulty:

Too easy (1-step problems): The model can already solve these directly. CoT adds tokens but no benefit.

Just right (2-5 step problems): The model struggles with direct answers but can generate correct chains. Maximum benefit.

Too hard (10+ step problems): Even with chains, errors compound across steps. Each step has some error probability, and errors propagate: with 95% accuracy per step, 10-step accuracy is only 0.95¹⁰ ≈ 60%.

P(correct chain) = ∏_i=1ⁿ P(step i correct) ≈ pⁿ

For p = 0.95 and n = 10: P = 0.60. For n = 20: P = 0.36. Longer chains are exponentially less reliable.

CoT Benefit by Scale and Complexity

Drag the X slider to change model size and Y slider to change task complexity. The heatmap shows when CoT helps (green), is neutral (gray), or hurts (red). CoT is most beneficial for large models on medium-complexity tasks.

Scale 175B

Steps 4

Why does chain-of-thought prompting actually HURT performance on small models (~8B parameters)?

Small models generate chains that are grammatically plausible but logically incorrect — they mimic the format of reasoning without actually reasoning correctly, leading to confidently wrong answers that are worse than direct guessing Small models don't have enough context window to fit the chain-of-thought Small models can't generate text at all when prompted with CoT

Chapter 5: Why It Works

Understanding why chain-of-thought works is crucial. There are several complementary hypotheses, each supported by evidence.

Hypothesis 1: Extended computation

A Transformer performs a fixed number of computation steps per output token (one forward pass through L layers). By generating intermediate tokens, CoT effectively gives the model more "compute time." A 10-token chain-of-thought provides 10× more forward passes before the answer, each building on the previous results.

Tokens are compute. In a Transformer, the only way to do more computation is to generate more tokens. Each token requires a full forward pass through all layers. So a 20-token reasoning chain uses 20 forward passes where standard prompting uses 1. This is why CoT helps with complex reasoning — it buys the model more computation steps to work through the problem.

Hypothesis 2: Decomposition

CoT decomposes a complex problem into simpler sub-problems. Instead of solving "multi-step arithmetic" in one step, the model solves a sequence of "single-step arithmetic" problems. Each individual step is within the model's capability; the chain structure handles the composition.

python
# Without CoT: must solve compound problem in one pass
# p("11" | "5 + 2×3 = ?") → hard composite computation

# With CoT: each step is simple
# p("2×3 = 6"  | problem) → easy single operation
# p("5 + 6 = 11" | problem, "2×3 = 6") → easy, with "6" in context
# p("11" | chain) → trivial read-off

Hypothesis 3: Faithful reasoning traces

Are the generated chains actually the "reasoning" the model is doing, or just post-hoc rationalizations? Evidence suggests they're at least partially faithful:

Evidence for faithfulness:

1. When the chain contains an arithmetic error, the final answer is usually consistent with the error (not the correct answer). This means the model is actually using the chain.

2. Perturbing the chain (inserting wrong intermediate values) consistently changes the final answer in the predicted direction.

Evidence against faithfulness:

1. Models sometimes generate correct chains for wrong reasons (correct answer, but the reasoning doesn't match the actual computation path).

2. The model may be doing "smart-sounding rationalization" rather than genuine step-by-step reasoning.

Hypothesis 4: Distributional shift

The training data contains many examples of step-by-step explanations — math textbooks, StackOverflow answers, tutorial blog posts. CoT triggers the model to sample from this "explanation" distribution rather than the "direct answer" distribution. Within the explanation distribution, accurate reasoning is more likely because the training examples in that distribution tend to be correct (people write correct explanations).

Computation Depth Comparison

Standard prompting uses 1 forward pass to go from question to answer. CoT uses N forward passes (one per reasoning token). This visualization shows the computation "depth" for each approach. Click "Step" to advance and see how each approach processes the same problem.

Step 0 / 10

Self-consistency: a natural extension

Wang et al. (2022) proposed Self-Consistency: generate multiple chains-of-thought (using sampling with temperature > 0), then take a majority vote on the final answer. Different chains may make different errors, but the correct answer is most likely to appear across multiple chains.

answer = argmax_a ∑_i=1^K 1[chain_i → a]

Self-consistency with K=40 chains improved GSM8K from 56.9% (single CoT) to 74.4% — a 17.5pp gain from just generating more chains and voting.

Why does generating more tokens (intermediate reasoning steps) help the model solve harder problems?

Each generated token requires a full forward pass through all layers, giving the model more total computation. Additionally, intermediate results become explicit tokens in the context, serving as external working memory that the model can reference for subsequent steps More tokens make the output look more professional and authoritative The extra tokens allow the model to access a larger vocabulary

Chapter 6: CoT Explorer

Time to see chain-of-thought in action. This interactive explorer lets you compare standard prompting vs CoT on different problem types, model sizes, and chain lengths.

Chain-of-Thought Reasoning Simulator

Select a problem type and watch the model solve it step by step. In "Standard" mode, the model jumps directly to an answer. In "CoT" mode, it generates intermediate reasoning. Toggle between modes to see the accuracy difference. The error probability per step is shown — watch how errors compound in long chains.

Model Size 540B

Self-Consistency Voting

Generate multiple chains and watch majority voting improve accuracy. Each chain may have errors, but the correct answer tends to win the vote. Click "Sample Chain" to generate a new reasoning path. The tally board shows the running vote count.

0 chains sampled

The key insight from this explorer: CoT doesn't make the model smarter — it gives the model scratch paper. The same model with the same weights produces dramatically different results depending on whether it's allowed to "think out loud." The chain-of-thought is the interface between the model's capability and the problem's complexity.

In self-consistency, why does majority voting over multiple chains improve accuracy beyond a single chain?

Different chains may make different errors (since sampling is stochastic), but the correct answer is statistically more likely to appear across multiple chains. Majority voting filters out random errors while preserving the correct signal More chains use more compute, so the model gets smarter with each chain The model learns from its previous chains and improves

Chapter 7: Connections

Chain-of-thought prompting opened the floodgates for a new subfield: reasoning elicitation. Understanding its connections reveals both its roots and its impact.

What came before

Paper	Contribution	Relationship to CoT
GPT-3 (2020)	In-context learning	CoT is a specific form of in-context learning — demonstrations include reasoning
Scratchpad (Nye et al., 2021)	Let models write intermediate computation	Predecessor idea — CoT generalized it to natural language
Show Your Work (Ling et al., 2017)	Train models to generate solution steps	Required fine-tuning; CoT achieves this with prompting alone

What came after

Paper	How It Extended CoT
Self-Consistency (Wang 2022)	Sample multiple chains, majority vote → +17.5pp on GSM8K
Tree of Thoughts (Yao 2023)	Branch and search over reasoning paths, not just one chain
Least-to-Most (Zhou 2022)	Decompose into sub-questions first, solve bottom-up
ReAct (Yao 2023)	Interleave reasoning (CoT) with actions (tool use)
Let's Verify Step by Step (Lightman 2023)	Train verifiers to check each reasoning step, not just the answer
o1 (OpenAI 2024)	Train models to generate internal chains-of-thought via RL

The bigger picture

CoT revealed something profound about language models: they have more capability than standard prompting extracts. The model "knows" how to reason — it's seen reasoning in its training data — but standard prompting doesn't give it the chance. CoT is a way to elicit latent capability.

This insight — that prompting technique matters as much as model capability — launched the field of prompt engineering and eventually led to approaches where reasoning is trained directly into the model (o1, R1), not just elicited via prompting.

CoT era (2022-2023):

Reasoning elicited via prompts.

Chain quality depends on prompt.

Emergent at ~100B+ scale.

Post-CoT era (2024+):

Reasoning trained via RL.

Models generate chains internally.

Works at smaller scales.

The CoT legacy: Chain-of-thought prompting proved that language models can reason — they just need the right interface. This was the bridge between "LLMs as fancy autocomplete" and "LLMs as reasoning engines." Every subsequent advance in LLM reasoning — from self-consistency to tree-of-thought to o1 — builds on the foundation CoT established.

"The limits of my language mean the limits of my world." — Ludwig Wittgenstein

What is the most important general insight from chain-of-thought prompting research?

That language models have more reasoning capability than standard prompting extracts — the right prompting technique can elicit latent abilities without changing the model's weights That larger models are always better at every task That we should always use chain-of-thought for every prompt

Chain-of-Thought Prompting