Zelikman, Wu, Mu, Goodman — Stanford / Google Research, 2022

Self-Taught Reasoner

Let the model teach itself to reason: generate rationales, keep the ones that work, learn from them, repeat.

Prerequisites: Chain-of-thought prompting + Fine-tuning + Few-shot learning
10
Chapters
5+
Simulations

Chapter 0: The Problem

You have a language model that can answer questions. Sometimes it gets the answer right, sometimes wrong. You notice something: when the model "thinks out loud" — writing down intermediate reasoning steps before answering — it does better. This is chain-of-thought (CoT) prompting.

But there's a catch. To get a model to reliably produce chain-of-thought reasoning, you have two options, and both are painful:

  1. Build a massive dataset of human-written rationales. Expensive, slow, and impossible to scale to every domain.
  2. Use few-shot prompting — show the model a handful of examples with rationales and hope it generalizes. Cheap, but significantly less accurate than fine-tuning.

In 2022, the numbers told a stark story. On CommonsenseQA with GPT-J (6B parameters):

MethodAccuracyCost
Few-shot (no CoT)20.9%~Free
Few-shot with CoT36.6%~Free
Fine-tune on answers only60.0%Need full dataset
???72.5%10 rationale examples

Few-shot CoT gets you to 36.6%. Fine-tuning on answers alone gets you to 60%. But that last row — 72.5% from just 10 hand-written rationales — seems impossible. How do you bridge the gap between "10 examples" and "state-of-the-art performance"?

The Rationale Gap

Compare methods: few-shot gets cheap but weak signal, fine-tuning needs expensive data. STaR bridges the gap.

The core question: Can we start with just a few rationale examples and bootstrap our way to a large, high-quality rationale dataset — using the model itself as the source? Can the model teach itself to reason?
What is the fundamental problem STaR tries to solve?

Chapter 1: The Key Insight

Here is the idea that makes STaR work, in one sentence:

If a rationale leads to the correct answer, it is probably a good rationale. You don't need humans to verify every reasoning chain. The final answer is the filter. Generate thousands of rationales, keep only the ones that produce correct answers, and fine-tune on those.

This is a powerful simplification. Think about how a student learns math. They don't wait for a teacher to check every line of their work. They try a problem, check the answer in the back of the book, and if they got it right, they trust that their approach was reasonable. If wrong, they try again.

STaR does exactly this, but at scale:

Generate
Model writes rationales for thousands of questions (few-shot prompted)
Filter
Keep ONLY rationales where the final answer matches the ground truth
Fine-tune
Train the model on the filtered rationale dataset
↻ Repeat with the improved model

Each round, the model gets better at generating rationales, which means more correct answers, which means a larger and better training set, which means an even better model. It's a virtuous cycle.

But there's a subtlety. If we only keep rationales for problems the model already solves, how does it learn to solve new problems? The training set can only shrink or stay the same. This is where the second insight comes in — rationalization — which we'll cover in Chapter 3.

What serves as the "quality signal" for filtering rationales in STaR?

Chapter 2: The STaR Loop

Let's look at the actual algorithm. STaR maintains two things: a pretrained base model M, and a prompt set P of just ~10 examples with hand-written rationales. Here is what happens at each iteration:

Step 1: Rationale Generation

For every question xi in the training set, prepend the few-shot prompt P and ask the model to generate a rationale r̂i followed by an answer ŷi. The prompt format looks like:

prompt format
Q: What can be used to carry a small dog?
Answer Choices: (a) swimming pool (b) basket (c) dog show
              (d) backyard (e) own home
A: The answer must be something that can be used to carry
a small dog. Baskets are designed to hold things.
Therefore, the answer is basket (b).

Q: [new question here]
Answer Choices: ...
A: [model generates rationale + answer]

Step 2: Filter

Compare each generated answer ŷi to the ground truth yi. Keep only the (xi, r̂i, yi) triples where ŷi = yi.

Step 3: Rationalize (for wrong answers)

For questions the model got wrong, provide the correct answer as a hint and ask the model to generate a rationale again. Filter these too. (More on this in Chapter 3.)

Step 4: Fine-tune

Combine the filtered rationales from Steps 2 and 3. Fine-tune the original pretrained model M on this combined dataset. Not the model from the previous iteration — always start fresh from M to avoid overfitting.

Step 5: Repeat

Use the newly fine-tuned model as Mn and go back to Step 1. Continue until performance plateaus.

The STaR Algorithm

Click "Step" to advance through one iteration of STaR. Watch the dataset grow and the model improve.

Iteration 0 — Click Step to begin
A critical detail: At each iteration, STaR fine-tunes from the original pretrained model, not from the previous iteration's fine-tuned model. This prevents catastrophic overfitting. The training data gets better each round, but the starting point stays the same.
Why does STaR fine-tune from the original pretrained model at each iteration, rather than continuing from the previous iteration's model?

Chapter 3: Rationalization

Here's the problem with the basic STaR loop: the model can only learn from questions it already answers correctly. If it gets a question wrong, that question produces no training signal. The dataset of correct rationales can only include questions the model can already solve.

This means the loop eventually stagnates. The model solves 60% of questions, generates training data for those 60%, fine-tunes, and... still solves roughly 60%. The remaining 40% never contribute to learning.

Rationalization is the trick that fixes this. For every question the model got wrong, you give it the correct answer as a hint and ask it to generate a rationale as if it knew the answer all along. Think of it as saying: "OK, the answer is (b) grocery cart. Now explain why."

Here's exactly what the prompt looks like, with the hint shown in orange:

rationalization prompt
Q: Where do you put your grapes just before
   checking out?
Answer Choices: (a) mouth (b) grocery cart
   (c) super market (d) fruit basket (e) fruit market
A: (b) grocery cart.  ← hint provided
The answer should be the place where grocery
items are placed before checking out. Of the
above choices, grocery cart makes the most sense
for holding grocery items. Therefore, the answer
is grocery cart (b).

The crucial part: when adding this rationalized example to the training set, the hint is removed. The model is trained as if it produced this rationale on its own, without seeing the answer. The training example becomes:

training example (hint stripped)
Q: Where do you put your grapes just before
   checking out?
Answer Choices: (a) mouth (b) grocery cart ...
A: The answer should be the place where grocery
items are placed before checking out. Of the
above choices, grocery cart makes the most sense
for holding grocery items. Therefore, the answer
is grocery cart (b).
Rationalization: Hindsight Reasoning

Watch the model attempt a question, fail, then succeed with a hint. The hint is stripped before training.

Step 1/4 — Model attempts question
Why does this work? Rationalization gives the model a "backwards" path to the answer. Given the correct answer, it's much easier to construct a plausible reasoning chain. Mathematically, we're sampling from p(rationale | question, answer) instead of p(rationale | question). This is a much easier distribution. The model can reverse-engineer a solution from the answer, and then learn to produce that reasoning forward.

On CommonsenseQA, direct rationale generation solved 78.2% of training examples. Rationalization rescued an additional 8.5%. Together, STaR trained on 86.7% of the dataset — all generated by the model itself, starting from just 10 hand-written rationales.

What happens to the "hint" (correct answer) used during rationalization when the example is added to the training set?

Chapter 4: Why Filtering Works

At first glance, filtering rationales by correctness seems crude. A model could get the right answer for the wrong reasons — especially on multiple-choice where there's a 20% chance of guessing correctly. Why does this simple filter produce good training data?

The authors show that STaR is actually an approximation to a policy gradient reinforcement learning objective. Here's the connection:

Think of the language model as a policy π that produces rationale-answer pairs (r̂, ŷ) given a question x. Define a binary reward: 1 if the answer is correct, 0 otherwise. The expected reward across the dataset is:

J(M) = ∑i Ei, ŷi ~ pM(·|xi) [1(ŷi = yi)]

The policy gradient of this objective is:

∇J = ∑i Ei, ŷi ~ pM [1(ŷi = yi) · ∇ log pM(ŷi, r̂i | xi)]

Notice: the indicator function 1(ŷi = yi) zeros out the gradient for all incorrect answers. This is exactly what STaR's filtering does — it discards rationales that led to wrong answers. The fine-tuning step then takes gradient steps to increase the probability of the surviving (correct) rationales.

STaR as approximate RL: Filtering by correctness is an implicit reward signal. STaR approximates policy gradient optimization with two simplifications: (1) greedy decoding instead of sampling reduces variance but limits exploration, and (2) multiple gradient steps on the same batch rather than single-step updates. These make STaR easy to implement with standard fine-tuning tools.

Of course, the 20% random-correct problem is real. On CommonsenseQA (5-way multiple choice), some wrong rationales will survive the filter. But as training progresses, the model generates higher-quality rationales, and the fraction of "lucky guesses" decreases relative to genuine reasoning. The signal-to-noise ratio improves with each iteration.

Signal vs. Noise in Filtering

Drag the iteration slider to see how the ratio of genuine reasoning to lucky guesses shifts over training.

Iteration 0
How does STaR's filtering connect to reinforcement learning?

Chapter 5: Iterative Improvement

The magic of STaR is that it gets better each round. Let's trace through what actually happens across iterations.

Iteration 1: The model starts with few-shot prompting. On CommonsenseQA, few-shot CoT gets about 36.6% accuracy. So roughly 3,600 of the 9,741 training questions produce correct rationales. Add maybe 800 from rationalization. Fine-tune on ~4,400 examples.

Iteration 2: The fine-tuned model is now better. It correctly solves ~55% of training questions, producing ~5,400 correct rationales. Plus ~1,200 from rationalization. The dataset is larger and higher quality.

Iterations 3-6: Each round, accuracy climbs. By the final iteration, 78.2% of training questions are solved by direct rationale generation, plus 8.5% from rationalization. The model has bootstrapped from 10 examples to 86.7% coverage of the dataset.

The virtuous cycle: Better model → more correct rationales → larger training set → better model → even more correct rationales → ... This is why the paper calls it "bootstrapping reasoning with reasoning."

A key practical detail: the authors increased the number of fine-tuning steps by 20% each iteration. Early iterations use fewer steps (40 at the start) to prevent overfitting on the small initial dataset. As the training set grows, more steps are appropriate. They also found that training more slowly at the beginning ultimately benefits final performance.

For arithmetic, the improvement is particularly dramatic. Few-shot accuracy on 2-digit addition starts below 1%. After just one iteration of STaR, it jumps to 32%. By iteration 16, overall arithmetic accuracy reaches 89.5%, compared to 76.3% for a model fine-tuned directly on 10,000 answer-only examples.

STaR Iteration Progress

Watch accuracy and dataset coverage grow across iterations for CommonsenseQA.

Iteration 0
Why does STaR increase fine-tuning steps by 20% each iteration?

Chapter 6: Results

STaR was evaluated on three domains: arithmetic, commonsense reasoning (CommonsenseQA), and grade school math (GSM8K). The results are striking.

CommonsenseQA

This is the headline result. GPT-J (6B parameters) with STaR matches GPT-3 (175B parameters) fine-tuned directly on answers:

MethodModelCQA Dev AccTrain Data
Few-shot directGPT-J 6B20.9%~0%
Few-shot CoTGPT-J 6B36.6%~0%
Few-shot CoTLaMDA 137B55.6%~0%
Direct fine-tuneGPT-J 6B60.0%100%
STaR (no rational.)GPT-J 6B68.8%69.7%
STaR (with rational.)GPT-J 6B72.5%86.7%
Direct fine-tuneGPT-3 175B73.0%100%
The punchline: STaR on a 6B model (72.5%) matches a 30x larger 175B model (73.0%), starting from just 10 hand-written rationales. The model taught itself to reason at a level comparable to a vastly larger model that was directly fine-tuned.

Arithmetic

On multi-digit addition with scratchpad reasoning, STaR reaches 89.5% after 16 iterations, versus 76.3% for direct fine-tuning on 10,000 examples. Without rationalization, the model learns digits one at a time (master 1-digit, then 2-digit, etc.). With rationalization, multiple digit lengths improve simultaneously.

GSM8K (Grade School Math)

MethodGSM8K Test AccTrain Data Used
Few-shot direct3.0%~0%
Few-shot CoT3.1%~0%
Direct fine-tune5.8%100%
STaR (no rational.)10.1%25.0%
STaR (with rational.)10.7%28.7%

The absolute numbers are lower (GSM8K is hard for 6B models), but STaR nearly doubles the performance of direct fine-tuning while using less than 30% of the training data.

Human Evaluation

Crowdworkers ranked STaR-generated rationales against few-shot rationales and human-written rationales on questions both methods answered correctly. STaR rationales were preferred 30% more often over few-shot rationales (p = .039) and 74% more often over human-written rationales (p < .001). The model generates better explanations than the humans who wrote the original dataset.

Results Comparison

CommonsenseQA accuracy across methods. STaR bridges the gap between few-shot and massive-model fine-tuning.

How does STaR (6B params) compare to GPT-3 (175B params) fine-tuned directly on CommonsenseQA?

Chapter 7: Convergence

Every iterative algorithm has to stop somewhere. When does STaR stop improving?

There are several forces at play:

Diminishing returns. Each iteration, the model solves more problems. But the remaining unsolved problems are genuinely harder. The low-hanging fruit is picked first. Eventually the model plateaus — it cannot generate correct rationales for the hardest questions, and rationalization cannot fully bridge the gap.

Dataset ceiling. Without rationalization, the training data can only contain problems the model already solves. This creates a hard ceiling: once the model stops solving new problems, no new training signal appears. Rationalization pushes this ceiling higher but doesn't eliminate it — the model still needs to produce a correct-answer rationale even with the hint.

Noise accumulation. On multiple-choice tasks, some wrong rationales sneak through the filter by luck (correct answer despite bad reasoning). Over many iterations, these noisy examples can accumulate. The authors mitigate this by always fine-tuning from the original pretrained model rather than building on a potentially corrupted model.

Temperature matters. The authors found that increasing temperature to generate more diverse rationales actually hurts. Higher temperature produces more "lucky" correct answers with bad reasoning, which poisons the training data. Rationalization is a much cleaner way to expand coverage than temperature-based exploration.

Convergence Dynamics

Watch how accuracy, dataset size, and noise evolve. Toggle rationalization on/off to see its impact on the ceiling.

Ready
When to stop: In practice, the authors run STaR until dev-set performance saturates. For CQA this was about 6-8 outer loop iterations. For arithmetic, 16+ iterations were needed since the task has a cleaner signal (no multiple-choice noise).
Why does increasing sampling temperature hurt STaR's performance?

Chapter 8: Limitations

STaR is elegant, but it has clear boundaries. Understanding these helps you know when to use it and when to reach for something else.

Verifiable answers required

STaR's filter needs ground truth answers. The entire loop depends on checking whether a generated answer matches the correct one. For open-ended tasks — writing, summarization, creative tasks — there is no single correct answer to filter against. STaR works best on tasks with clear right/wrong answers: math, multiple choice, code (test cases), formal proofs.

Minimum model capability

The bootstrap needs a spark. If the model's few-shot performance is at chance level, there are no correct rationales to filter and the loop never gets going. The authors found that GPT-2 could not bootstrap even on arithmetic. GPT-J (6B) was the minimum viable size in 2022. The model must be large enough to occasionally produce correct reasoning in the few-shot setting.

Binary decisions are problematic

On tasks with only two choices (yes/no, true/false), the model gets the right answer by chance 50% of the time. Half the training data would consist of wrong rationales with lucky correct answers. The signal-to-noise ratio is too low for the filter to work well.

Bias amplification

STaR trains on the model's own outputs. Any biases in the base model — gender, cultural, factual — get baked into the generated rationales and then amplified by fine-tuning on them. The CommonsenseQA dataset itself contains known biases, and the authors acknowledge that STaR may reinforce these.

Compute cost

Each outer loop iteration requires generating rationales for the entire training set (expensive inference) plus a full fine-tuning run. With 6+ iterations, the total compute is substantial. The authors used GPT-J specifically because it was affordable; applying STaR to larger models multiplies the cost.

LimitationImpactMitigation
Needs verifiable answersCannot apply to open-ended tasksUse reward models (leads to RLHF/DPO)
Minimum model sizeToo-small models can't bootstrapUse larger base model or more few-shot examples
Binary task noise50% lucky guesses poison dataNeed better filtering (e.g., verify rationale steps)
Bias amplificationSelf-training amplifies existing biasesCurate few-shot prompts, add debiasing steps
Compute costMultiple iterations of generate + fine-tuneUse smaller training subsets per iteration
Why does STaR struggle with binary (yes/no) tasks?

Chapter 9: Connections

STaR (2022) is a pivotal paper. It sits at the intersection of several ideas and directly influenced the reasoning models that dominate today.

Predecessors

Expert Iteration (ExIt): STaR's loop mirrors ExIt from AlphaGo/AlphaZero: generate solutions with the current "apprentice," improve via feedback from an "expert," then upgrade the apprentice. In STaR, the "expert" is the answer key, and the "apprentice" is the model. The key difference: STaR works in natural language, not game trees.

Chain-of-thought prompting (Wei et al., 2022): Showed that few-shot rationales improve reasoning. STaR turns this one-shot technique into an iterative training procedure.

Descendants

Rejection sampling fine-tuning (RFT): Sample many solutions, keep correct ones, fine-tune. This is essentially one iteration of STaR without the loop or rationalization.

OpenAI o1 / o3: The "reasoning models" that think for minutes before answering. While the exact training is proprietary, the public descriptions align closely with STaR-style iterative self-improvement on verifiable reasoning tasks.

DeepSeek-R1: Openly described as using RL on reasoning traces with verified answers — the same core principle as STaR, scaled up with RL instead of supervised fine-tuning.

Parallel ideas

Constitutional AI: Anthropic's method for self-improvement through self-critique. Where STaR uses answer correctness as the signal, Constitutional AI uses the model's own judgment guided by principles.

Self-play: Games like Go use self-play to improve without human data. STaR is "self-play for reasoning" — the model plays against the answer key instead of against itself.

The bigger picture: STaR demonstrated a profound idea that now underpins the most capable AI systems: a model can improve its own reasoning by generating, filtering, and learning from its own outputs. The model is both the student and (with rationalization) the teacher. This "bootstrapping reasoning with reasoning" is the conceptual ancestor of every modern reasoning-focused training approach.
MethodSignal SourceIterative?Domain
STaRAnswer correctnessYesVerifiable tasks
RLHFHuman preferencesNo (single RL run)Open-ended
DPOPreference pairsNoOpen-ended
Constitutional AISelf-critique + principlesYesOpen-ended
Expert IterationSearch (MCTS)YesGames, formal proofs
DeepSeek-R1Answer correctness + RLYesVerifiable tasks
What core principle does STaR share with modern reasoning models like o1 and DeepSeek-R1?