STaR — Veanors

Chapter 0: The Problem

You have a language model that can answer questions. Sometimes it gets the answer right, sometimes wrong. You notice something: when the model "thinks out loud" — writing down intermediate reasoning steps before answering — it does better. This is chain-of-thought (CoT) prompting.

But there's a catch. To get a model to reliably produce chain-of-thought reasoning, you have two options, and both are painful:

Build a massive dataset of human-written rationales. Expensive, slow, and impossible to scale to every domain.
Use few-shot prompting — show the model a handful of examples with rationales and hope it generalizes. Cheap, but significantly less accurate than fine-tuning.

In 2022, the numbers told a stark story. On CommonsenseQA with GPT-J (6B parameters):

Method	Accuracy	Cost
Few-shot (no CoT)	20.9%	~Free
Few-shot with CoT	36.6%	~Free
Fine-tune on answers only	60.0%	Need full dataset
???	72.5%	10 rationale examples

Few-shot CoT gets you to 36.6%. Fine-tuning on answers alone gets you to 60%. But that last row — 72.5% from just 10 hand-written rationales — seems impossible. How do you bridge the gap between "10 examples" and "state-of-the-art performance"?

The Rationale Gap

Compare methods: few-shot gets cheap but weak signal, fine-tuning needs expensive data. STaR bridges the gap.

The core question: Can we start with just a few rationale examples and bootstrap our way to a large, high-quality rationale dataset — using the model itself as the source? Can the model teach itself to reason?

What is the fundamental problem STaR tries to solve?

Chain-of-thought improves reasoning, but getting a large dataset of rationales is either expensive (human annotation) or inaccurate (few-shot only), so we need a way to bootstrap rationales from a small seed set Language models are too slow at inference time Few-shot prompting uses too many tokens in the context window

Chapter 1: The Key Insight

Here is the idea that makes STaR work, in one sentence:

If a rationale leads to the correct answer, it is probably a good rationale. You don't need humans to verify every reasoning chain. The final answer is the filter. Generate thousands of rationales, keep only the ones that produce correct answers, and fine-tune on those.

This is a powerful simplification. Think about how a student learns math. They don't wait for a teacher to check every line of their work. They try a problem, check the answer in the back of the book, and if they got it right, they trust that their approach was reasonable. If wrong, they try again.

STaR does exactly this, but at scale:

Generate

Model writes rationales for thousands of questions (few-shot prompted)

↓

Filter

Keep ONLY rationales where the final answer matches the ground truth

↓

Fine-tune

Train the model on the filtered rationale dataset

↻ Repeat with the improved model

Each round, the model gets better at generating rationales, which means more correct answers, which means a larger and better training set, which means an even better model. It's a virtuous cycle.

But there's a subtlety. If we only keep rationales for problems the model already solves, how does it learn to solve new problems? The training set can only shrink or stay the same. This is where the second insight comes in — rationalization — which we'll cover in Chapter 3.

What serves as the "quality signal" for filtering rationales in STaR?

A separate reward model scores each rationale Whether the final answer produced by the rationale matches the known correct answer Human annotators rate each rationale on a 1-5 scale

Chapter 2: The STaR Loop

Let's look at the actual algorithm. STaR maintains two things: a pretrained base model M, and a prompt set P of just ~10 examples with hand-written rationales. Here is what happens at each iteration:

Step 1: Rationale Generation

For every question x_i in the training set, prepend the few-shot prompt P and ask the model to generate a rationale r̂_i followed by an answer ŷ_i. The prompt format looks like:

prompt format
Q: What can be used to carry a small dog?
Answer Choices: (a) swimming pool (b) basket (c) dog show
              (d) backyard (e) own home
A: The answer must be something that can be used to carry
a small dog. Baskets are designed to hold things.
Therefore, the answer is basket (b).

Q: [new question here]
Answer Choices: ...
A: [model generates rationale + answer]

Step 2: Filter

Compare each generated answer ŷ_i to the ground truth y_i. Keep only the (x_i, r̂_i, y_i) triples where ŷ_i = y_i.

Step 3: Rationalize (for wrong answers)

For questions the model got wrong, provide the correct answer as a hint and ask the model to generate a rationale again. Filter these too. (More on this in Chapter 3.)

Step 4: Fine-tune

Combine the filtered rationales from Steps 2 and 3. Fine-tune the original pretrained model M on this combined dataset. Not the model from the previous iteration — always start fresh from M to avoid overfitting.

Step 5: Repeat

Use the newly fine-tuned model as M_n and go back to Step 1. Continue until performance plateaus.

The STaR Algorithm

Click "Step" to advance through one iteration of STaR. Watch the dataset grow and the model improve.

Iteration 0 — Click Step to begin

A critical detail: At each iteration, STaR fine-tunes from the original pretrained model, not from the previous iteration's fine-tuned model. This prevents catastrophic overfitting. The training data gets better each round, but the starting point stays the same.

Why does STaR fine-tune from the original pretrained model at each iteration, rather than continuing from the previous iteration's model?

To avoid overfitting — continually training one model would cause it to overfit to the generated rationales, while restarting from the base model lets the improved dataset do the teaching Because the original model is faster to train Because the previous model's weights are deleted to save memory

Chapter 3: Rationalization

Here's the problem with the basic STaR loop: the model can only learn from questions it already answers correctly. If it gets a question wrong, that question produces no training signal. The dataset of correct rationales can only include questions the model can already solve.

This means the loop eventually stagnates. The model solves 60% of questions, generates training data for those 60%, fine-tunes, and... still solves roughly 60%. The remaining 40% never contribute to learning.

Rationalization is the trick that fixes this. For every question the model got wrong, you give it the correct answer as a hint and ask it to generate a rationale as if it knew the answer all along. Think of it as saying: "OK, the answer is (b) grocery cart. Now explain why."

Here's exactly what the prompt looks like, with the hint shown in orange:

rationalization prompt
Q: Where do you put your grapes just before
   checking out?
Answer Choices: (a) mouth (b) grocery cart
   (c) super market (d) fruit basket (e) fruit market
A: (b) grocery cart.  ← hint provided
The answer should be the place where grocery
items are placed before checking out. Of the
above choices, grocery cart makes the most sense
for holding grocery items. Therefore, the answer
is grocery cart (b).

The crucial part: when adding this rationalized example to the training set, the hint is removed. The model is trained as if it produced this rationale on its own, without seeing the answer. The training example becomes:

training example (hint stripped)
Q: Where do you put your grapes just before
   checking out?
Answer Choices: (a) mouth (b) grocery cart ...
A: The answer should be the place where grocery
items are placed before checking out. Of the
above choices, grocery cart makes the most sense
for holding grocery items. Therefore, the answer
is grocery cart (b).

Rationalization: Hindsight Reasoning

Watch the model attempt a question, fail, then succeed with a hint. The hint is stripped before training.

Step 1/4 — Model attempts question

Why does this work? Rationalization gives the model a "backwards" path to the answer. Given the correct answer, it's much easier to construct a plausible reasoning chain. Mathematically, we're sampling from p(rationale | question, answer) instead of p(rationale | question). This is a much easier distribution. The model can reverse-engineer a solution from the answer, and then learn to produce that reasoning forward.

On CommonsenseQA, direct rationale generation solved 78.2% of training examples. Rationalization rescued an additional 8.5%. Together, STaR trained on 86.7% of the dataset — all generated by the model itself, starting from just 10 hand-written rationales.

What happens to the "hint" (correct answer) used during rationalization when the example is added to the training set?

The hint is kept as part of the training example so the model learns to use hints The hint is stripped — the model is trained as if it produced the rationale on its own, without ever seeing the answer The hint is replaced with a special [MASK] token

Chapter 4: Why Filtering Works

At first glance, filtering rationales by correctness seems crude. A model could get the right answer for the wrong reasons — especially on multiple-choice where there's a 20% chance of guessing correctly. Why does this simple filter produce good training data?

The authors show that STaR is actually an approximation to a policy gradient reinforcement learning objective. Here's the connection:

Think of the language model as a policy π that produces rationale-answer pairs (r̂, ŷ) given a question x. Define a binary reward: 1 if the answer is correct, 0 otherwise. The expected reward across the dataset is:

J(M) = ∑_i E_{r̂_i, ŷ_i ~ p_M(·|x_i)} [1(ŷ_i = y_i)]

The policy gradient of this objective is:

∇J = ∑_i E_{r̂_i, ŷ_i ~ p_M} [1(ŷ_i = y_i) · ∇ log p_M(ŷ_i, r̂_i | x_i)]

Notice: the indicator function 1(ŷ_i = y_i) zeros out the gradient for all incorrect answers. This is exactly what STaR's filtering does — it discards rationales that led to wrong answers. The fine-tuning step then takes gradient steps to increase the probability of the surviving (correct) rationales.

STaR as approximate RL: Filtering by correctness is an implicit reward signal. STaR approximates policy gradient optimization with two simplifications: (1) greedy decoding instead of sampling reduces variance but limits exploration, and (2) multiple gradient steps on the same batch rather than single-step updates. These make STaR easy to implement with standard fine-tuning tools.

Of course, the 20% random-correct problem is real. On CommonsenseQA (5-way multiple choice), some wrong rationales will survive the filter. But as training progresses, the model generates higher-quality rationales, and the fraction of "lucky guesses" decreases relative to genuine reasoning. The signal-to-noise ratio improves with each iteration.

Signal vs. Noise in Filtering

Drag the iteration slider to see how the ratio of genuine reasoning to lucky guesses shifts over training.

Iteration 0

How does STaR's filtering connect to reinforcement learning?

Filtering by answer correctness acts as a binary reward signal, making STaR an approximation of policy gradient optimization where the indicator function zeros out gradients for incorrect answers STaR uses PPO with a clipped objective STaR trains a separate reward model on the rationales

Chapter 5: Iterative Improvement

The magic of STaR is that it gets better each round. Let's trace through what actually happens across iterations.

Iteration 1: The model starts with few-shot prompting. On CommonsenseQA, few-shot CoT gets about 36.6% accuracy. So roughly 3,600 of the 9,741 training questions produce correct rationales. Add maybe 800 from rationalization. Fine-tune on ~4,400 examples.

Iteration 2: The fine-tuned model is now better. It correctly solves ~55% of training questions, producing ~5,400 correct rationales. Plus ~1,200 from rationalization. The dataset is larger and higher quality.

Iterations 3-6: Each round, accuracy climbs. By the final iteration, 78.2% of training questions are solved by direct rationale generation, plus 8.5% from rationalization. The model has bootstrapped from 10 examples to 86.7% coverage of the dataset.

The virtuous cycle: Better model → more correct rationales → larger training set → better model → even more correct rationales → ... This is why the paper calls it "bootstrapping reasoning with reasoning."

A key practical detail: the authors increased the number of fine-tuning steps by 20% each iteration. Early iterations use fewer steps (40 at the start) to prevent overfitting on the small initial dataset. As the training set grows, more steps are appropriate. They also found that training more slowly at the beginning ultimately benefits final performance.

For arithmetic, the improvement is particularly dramatic. Few-shot accuracy on 2-digit addition starts below 1%. After just one iteration of STaR, it jumps to 32%. By iteration 16, overall arithmetic accuracy reaches 89.5%, compared to 76.3% for a model fine-tuned directly on 10,000 answer-only examples.

STaR Iteration Progress

Watch accuracy and dataset coverage grow across iterations for CommonsenseQA.

Iteration 0

Why does STaR increase fine-tuning steps by 20% each iteration?

To make training faster To use a different learning rate schedule Because the training dataset grows with each iteration, so more steps are needed to learn from the larger set, while fewer steps early on prevent overfitting on the smaller initial dataset

Chapter 6: Results

STaR was evaluated on three domains: arithmetic, commonsense reasoning (CommonsenseQA), and grade school math (GSM8K). The results are striking.

CommonsenseQA

This is the headline result. GPT-J (6B parameters) with STaR matches GPT-3 (175B parameters) fine-tuned directly on answers:

Method	Model	CQA Dev Acc	Train Data
Few-shot direct	GPT-J 6B	20.9%	~0%
Few-shot CoT	GPT-J 6B	36.6%	~0%
Few-shot CoT	LaMDA 137B	55.6%	~0%
Direct fine-tune	GPT-J 6B	60.0%	100%
STaR (no rational.)	GPT-J 6B	68.8%	69.7%
STaR (with rational.)	GPT-J 6B	72.5%	86.7%
Direct fine-tune	GPT-3 175B	73.0%	100%

The punchline: STaR on a 6B model (72.5%) matches a 30x larger 175B model (73.0%), starting from just 10 hand-written rationales. The model taught itself to reason at a level comparable to a vastly larger model that was directly fine-tuned.

Arithmetic

On multi-digit addition with scratchpad reasoning, STaR reaches 89.5% after 16 iterations, versus 76.3% for direct fine-tuning on 10,000 examples. Without rationalization, the model learns digits one at a time (master 1-digit, then 2-digit, etc.). With rationalization, multiple digit lengths improve simultaneously.

GSM8K (Grade School Math)

Method	GSM8K Test Acc	Train Data Used
Few-shot direct	3.0%	~0%
Few-shot CoT	3.1%	~0%
Direct fine-tune	5.8%	100%
STaR (no rational.)	10.1%	25.0%
STaR (with rational.)	10.7%	28.7%

The absolute numbers are lower (GSM8K is hard for 6B models), but STaR nearly doubles the performance of direct fine-tuning while using less than 30% of the training data.

Human Evaluation

Crowdworkers ranked STaR-generated rationales against few-shot rationales and human-written rationales on questions both methods answered correctly. STaR rationales were preferred 30% more often over few-shot rationales (p = .039) and 74% more often over human-written rationales (p < .001). The model generates better explanations than the humans who wrote the original dataset.

Results Comparison

CommonsenseQA accuracy across methods. STaR bridges the gap between few-shot and massive-model fine-tuning.

How does STaR (6B params) compare to GPT-3 (175B params) fine-tuned directly on CommonsenseQA?

STaR on 6B achieves 72.5% vs. GPT-3's 73.0% — essentially matching a 30x larger model, using self-generated rationales from only 10 initial examples STaR significantly outperforms GPT-3 STaR is about 20% worse than GPT-3

Chapter 7: Convergence

Every iterative algorithm has to stop somewhere. When does STaR stop improving?

There are several forces at play:

Diminishing returns. Each iteration, the model solves more problems. But the remaining unsolved problems are genuinely harder. The low-hanging fruit is picked first. Eventually the model plateaus — it cannot generate correct rationales for the hardest questions, and rationalization cannot fully bridge the gap.

Dataset ceiling. Without rationalization, the training data can only contain problems the model already solves. This creates a hard ceiling: once the model stops solving new problems, no new training signal appears. Rationalization pushes this ceiling higher but doesn't eliminate it — the model still needs to produce a correct-answer rationale even with the hint.

Noise accumulation. On multiple-choice tasks, some wrong rationales sneak through the filter by luck (correct answer despite bad reasoning). Over many iterations, these noisy examples can accumulate. The authors mitigate this by always fine-tuning from the original pretrained model rather than building on a potentially corrupted model.

Temperature matters. The authors found that increasing temperature to generate more diverse rationales actually hurts. Higher temperature produces more "lucky" correct answers with bad reasoning, which poisons the training data. Rationalization is a much cleaner way to expand coverage than temperature-based exploration.

Convergence Dynamics

Watch how accuracy, dataset size, and noise evolve. Toggle rationalization on/off to see its impact on the ceiling.

Rationalization Ready

When to stop: In practice, the authors run STaR until dev-set performance saturates. For CQA this was about 6-8 outer loop iterations. For arithmetic, 16+ iterations were needed since the task has a cleaner signal (no multiple-choice noise).

Why does increasing sampling temperature hurt STaR's performance?

Higher temperature substantially increases the chance of correct answers despite incorrect reasoning, polluting the training data with bad rationales that pass the filter Higher temperature makes the model produce shorter outputs Higher temperature uses more compute

Chapter 8: Limitations

STaR is elegant, but it has clear boundaries. Understanding these helps you know when to use it and when to reach for something else.

Verifiable answers required

STaR's filter needs ground truth answers. The entire loop depends on checking whether a generated answer matches the correct one. For open-ended tasks — writing, summarization, creative tasks — there is no single correct answer to filter against. STaR works best on tasks with clear right/wrong answers: math, multiple choice, code (test cases), formal proofs.

Minimum model capability

The bootstrap needs a spark. If the model's few-shot performance is at chance level, there are no correct rationales to filter and the loop never gets going. The authors found that GPT-2 could not bootstrap even on arithmetic. GPT-J (6B) was the minimum viable size in 2022. The model must be large enough to occasionally produce correct reasoning in the few-shot setting.

Binary decisions are problematic

On tasks with only two choices (yes/no, true/false), the model gets the right answer by chance 50% of the time. Half the training data would consist of wrong rationales with lucky correct answers. The signal-to-noise ratio is too low for the filter to work well.

Bias amplification

STaR trains on the model's own outputs. Any biases in the base model — gender, cultural, factual — get baked into the generated rationales and then amplified by fine-tuning on them. The CommonsenseQA dataset itself contains known biases, and the authors acknowledge that STaR may reinforce these.

Compute cost

Each outer loop iteration requires generating rationales for the entire training set (expensive inference) plus a full fine-tuning run. With 6+ iterations, the total compute is substantial. The authors used GPT-J specifically because it was affordable; applying STaR to larger models multiplies the cost.

Limitation	Impact	Mitigation
Needs verifiable answers	Cannot apply to open-ended tasks	Use reward models (leads to RLHF/DPO)
Minimum model size	Too-small models can't bootstrap	Use larger base model or more few-shot examples
Binary task noise	50% lucky guesses poison data	Need better filtering (e.g., verify rationale steps)
Bias amplification	Self-training amplifies existing biases	Curate few-shot prompts, add debiasing steps
Compute cost	Multiple iterations of generate + fine-tune	Use smaller training subsets per iteration

Why does STaR struggle with binary (yes/no) tasks?

With 50% chance of a correct answer by luck, half the filtered training data consists of bad rationales that happened to reach the right answer, making the signal-to-noise ratio too low Binary tasks are too easy for language models The rationalization prompt doesn't work for binary questions

Chapter 9: Connections

STaR (2022) is a pivotal paper. It sits at the intersection of several ideas and directly influenced the reasoning models that dominate today.

Predecessors

Expert Iteration (ExIt): STaR's loop mirrors ExIt from AlphaGo/AlphaZero: generate solutions with the current "apprentice," improve via feedback from an "expert," then upgrade the apprentice. In STaR, the "expert" is the answer key, and the "apprentice" is the model. The key difference: STaR works in natural language, not game trees.

Chain-of-thought prompting (Wei et al., 2022): Showed that few-shot rationales improve reasoning. STaR turns this one-shot technique into an iterative training procedure.

Descendants

Rejection sampling fine-tuning (RFT): Sample many solutions, keep correct ones, fine-tune. This is essentially one iteration of STaR without the loop or rationalization.

OpenAI o1 / o3: The "reasoning models" that think for minutes before answering. While the exact training is proprietary, the public descriptions align closely with STaR-style iterative self-improvement on verifiable reasoning tasks.

DeepSeek-R1: Openly described as using RL on reasoning traces with verified answers — the same core principle as STaR, scaled up with RL instead of supervised fine-tuning.

Parallel ideas

Constitutional AI: Anthropic's method for self-improvement through self-critique. Where STaR uses answer correctness as the signal, Constitutional AI uses the model's own judgment guided by principles.

Self-play: Games like Go use self-play to improve without human data. STaR is "self-play for reasoning" — the model plays against the answer key instead of against itself.

The bigger picture: STaR demonstrated a profound idea that now underpins the most capable AI systems: a model can improve its own reasoning by generating, filtering, and learning from its own outputs. The model is both the student and (with rationalization) the teacher. This "bootstrapping reasoning with reasoning" is the conceptual ancestor of every modern reasoning-focused training approach.

Method	Signal Source	Iterative?	Domain
STaR	Answer correctness	Yes	Verifiable tasks
RLHF	Human preferences	No (single RL run)	Open-ended
DPO	Preference pairs	No	Open-ended
Constitutional AI	Self-critique + principles	Yes	Open-ended
Expert Iteration	Search (MCTS)	Yes	Games, formal proofs
DeepSeek-R1	Answer correctness + RL	Yes	Verifiable tasks

What core principle does STaR share with modern reasoning models like o1 and DeepSeek-R1?

The model iteratively improves its own reasoning ability by generating reasoning traces, filtering by verified correctness, and training on the successful ones — bootstrapping reasoning with reasoning They all use PPO for reinforcement learning They all require human-written rationale datasets

Self-Taught Reasoner