Let the model teach itself to reason: generate rationales, keep the ones that work, learn from them, repeat.
You have a language model that can answer questions. Sometimes it gets the answer right, sometimes wrong. You notice something: when the model "thinks out loud" — writing down intermediate reasoning steps before answering — it does better. This is chain-of-thought (CoT) prompting.
But there's a catch. To get a model to reliably produce chain-of-thought reasoning, you have two options, and both are painful:
In 2022, the numbers told a stark story. On CommonsenseQA with GPT-J (6B parameters):
| Method | Accuracy | Cost |
|---|---|---|
| Few-shot (no CoT) | 20.9% | ~Free |
| Few-shot with CoT | 36.6% | ~Free |
| Fine-tune on answers only | 60.0% | Need full dataset |
| ??? | 72.5% | 10 rationale examples |
Few-shot CoT gets you to 36.6%. Fine-tuning on answers alone gets you to 60%. But that last row — 72.5% from just 10 hand-written rationales — seems impossible. How do you bridge the gap between "10 examples" and "state-of-the-art performance"?
Compare methods: few-shot gets cheap but weak signal, fine-tuning needs expensive data. STaR bridges the gap.
Here is the idea that makes STaR work, in one sentence:
This is a powerful simplification. Think about how a student learns math. They don't wait for a teacher to check every line of their work. They try a problem, check the answer in the back of the book, and if they got it right, they trust that their approach was reasonable. If wrong, they try again.
STaR does exactly this, but at scale:
Each round, the model gets better at generating rationales, which means more correct answers, which means a larger and better training set, which means an even better model. It's a virtuous cycle.
But there's a subtlety. If we only keep rationales for problems the model already solves, how does it learn to solve new problems? The training set can only shrink or stay the same. This is where the second insight comes in — rationalization — which we'll cover in Chapter 3.
Let's look at the actual algorithm. STaR maintains two things: a pretrained base model M, and a prompt set P of just ~10 examples with hand-written rationales. Here is what happens at each iteration:
For every question xi in the training set, prepend the few-shot prompt P and ask the model to generate a rationale r̂i followed by an answer ŷi. The prompt format looks like:
prompt format Q: What can be used to carry a small dog? Answer Choices: (a) swimming pool (b) basket (c) dog show (d) backyard (e) own home A: The answer must be something that can be used to carry a small dog. Baskets are designed to hold things. Therefore, the answer is basket (b). Q: [new question here] Answer Choices: ... A: [model generates rationale + answer]
Compare each generated answer ŷi to the ground truth yi. Keep only the (xi, r̂i, yi) triples where ŷi = yi.
For questions the model got wrong, provide the correct answer as a hint and ask the model to generate a rationale again. Filter these too. (More on this in Chapter 3.)
Combine the filtered rationales from Steps 2 and 3. Fine-tune the original pretrained model M on this combined dataset. Not the model from the previous iteration — always start fresh from M to avoid overfitting.
Use the newly fine-tuned model as Mn and go back to Step 1. Continue until performance plateaus.
Click "Step" to advance through one iteration of STaR. Watch the dataset grow and the model improve.
Here's the problem with the basic STaR loop: the model can only learn from questions it already answers correctly. If it gets a question wrong, that question produces no training signal. The dataset of correct rationales can only include questions the model can already solve.
This means the loop eventually stagnates. The model solves 60% of questions, generates training data for those 60%, fine-tunes, and... still solves roughly 60%. The remaining 40% never contribute to learning.
Rationalization is the trick that fixes this. For every question the model got wrong, you give it the correct answer as a hint and ask it to generate a rationale as if it knew the answer all along. Think of it as saying: "OK, the answer is (b) grocery cart. Now explain why."
Here's exactly what the prompt looks like, with the hint shown in orange:
rationalization prompt Q: Where do you put your grapes just before checking out? Answer Choices: (a) mouth (b) grocery cart (c) super market (d) fruit basket (e) fruit market A: (b) grocery cart. ← hint provided The answer should be the place where grocery items are placed before checking out. Of the above choices, grocery cart makes the most sense for holding grocery items. Therefore, the answer is grocery cart (b).
The crucial part: when adding this rationalized example to the training set, the hint is removed. The model is trained as if it produced this rationale on its own, without seeing the answer. The training example becomes:
training example (hint stripped) Q: Where do you put your grapes just before checking out? Answer Choices: (a) mouth (b) grocery cart ... A: The answer should be the place where grocery items are placed before checking out. Of the above choices, grocery cart makes the most sense for holding grocery items. Therefore, the answer is grocery cart (b).
Watch the model attempt a question, fail, then succeed with a hint. The hint is stripped before training.
On CommonsenseQA, direct rationale generation solved 78.2% of training examples. Rationalization rescued an additional 8.5%. Together, STaR trained on 86.7% of the dataset — all generated by the model itself, starting from just 10 hand-written rationales.
At first glance, filtering rationales by correctness seems crude. A model could get the right answer for the wrong reasons — especially on multiple-choice where there's a 20% chance of guessing correctly. Why does this simple filter produce good training data?
The authors show that STaR is actually an approximation to a policy gradient reinforcement learning objective. Here's the connection:
Think of the language model as a policy π that produces rationale-answer pairs (r̂, ŷ) given a question x. Define a binary reward: 1 if the answer is correct, 0 otherwise. The expected reward across the dataset is:
The policy gradient of this objective is:
Notice: the indicator function 1(ŷi = yi) zeros out the gradient for all incorrect answers. This is exactly what STaR's filtering does — it discards rationales that led to wrong answers. The fine-tuning step then takes gradient steps to increase the probability of the surviving (correct) rationales.
Of course, the 20% random-correct problem is real. On CommonsenseQA (5-way multiple choice), some wrong rationales will survive the filter. But as training progresses, the model generates higher-quality rationales, and the fraction of "lucky guesses" decreases relative to genuine reasoning. The signal-to-noise ratio improves with each iteration.
Drag the iteration slider to see how the ratio of genuine reasoning to lucky guesses shifts over training.
The magic of STaR is that it gets better each round. Let's trace through what actually happens across iterations.
Iteration 1: The model starts with few-shot prompting. On CommonsenseQA, few-shot CoT gets about 36.6% accuracy. So roughly 3,600 of the 9,741 training questions produce correct rationales. Add maybe 800 from rationalization. Fine-tune on ~4,400 examples.
Iteration 2: The fine-tuned model is now better. It correctly solves ~55% of training questions, producing ~5,400 correct rationales. Plus ~1,200 from rationalization. The dataset is larger and higher quality.
Iterations 3-6: Each round, accuracy climbs. By the final iteration, 78.2% of training questions are solved by direct rationale generation, plus 8.5% from rationalization. The model has bootstrapped from 10 examples to 86.7% coverage of the dataset.
A key practical detail: the authors increased the number of fine-tuning steps by 20% each iteration. Early iterations use fewer steps (40 at the start) to prevent overfitting on the small initial dataset. As the training set grows, more steps are appropriate. They also found that training more slowly at the beginning ultimately benefits final performance.
For arithmetic, the improvement is particularly dramatic. Few-shot accuracy on 2-digit addition starts below 1%. After just one iteration of STaR, it jumps to 32%. By iteration 16, overall arithmetic accuracy reaches 89.5%, compared to 76.3% for a model fine-tuned directly on 10,000 answer-only examples.
Watch accuracy and dataset coverage grow across iterations for CommonsenseQA.
STaR was evaluated on three domains: arithmetic, commonsense reasoning (CommonsenseQA), and grade school math (GSM8K). The results are striking.
This is the headline result. GPT-J (6B parameters) with STaR matches GPT-3 (175B parameters) fine-tuned directly on answers:
| Method | Model | CQA Dev Acc | Train Data |
|---|---|---|---|
| Few-shot direct | GPT-J 6B | 20.9% | ~0% |
| Few-shot CoT | GPT-J 6B | 36.6% | ~0% |
| Few-shot CoT | LaMDA 137B | 55.6% | ~0% |
| Direct fine-tune | GPT-J 6B | 60.0% | 100% |
| STaR (no rational.) | GPT-J 6B | 68.8% | 69.7% |
| STaR (with rational.) | GPT-J 6B | 72.5% | 86.7% |
| Direct fine-tune | GPT-3 175B | 73.0% | 100% |
On multi-digit addition with scratchpad reasoning, STaR reaches 89.5% after 16 iterations, versus 76.3% for direct fine-tuning on 10,000 examples. Without rationalization, the model learns digits one at a time (master 1-digit, then 2-digit, etc.). With rationalization, multiple digit lengths improve simultaneously.
| Method | GSM8K Test Acc | Train Data Used |
|---|---|---|
| Few-shot direct | 3.0% | ~0% |
| Few-shot CoT | 3.1% | ~0% |
| Direct fine-tune | 5.8% | 100% |
| STaR (no rational.) | 10.1% | 25.0% |
| STaR (with rational.) | 10.7% | 28.7% |
The absolute numbers are lower (GSM8K is hard for 6B models), but STaR nearly doubles the performance of direct fine-tuning while using less than 30% of the training data.
Crowdworkers ranked STaR-generated rationales against few-shot rationales and human-written rationales on questions both methods answered correctly. STaR rationales were preferred 30% more often over few-shot rationales (p = .039) and 74% more often over human-written rationales (p < .001). The model generates better explanations than the humans who wrote the original dataset.
CommonsenseQA accuracy across methods. STaR bridges the gap between few-shot and massive-model fine-tuning.
Every iterative algorithm has to stop somewhere. When does STaR stop improving?
There are several forces at play:
Diminishing returns. Each iteration, the model solves more problems. But the remaining unsolved problems are genuinely harder. The low-hanging fruit is picked first. Eventually the model plateaus — it cannot generate correct rationales for the hardest questions, and rationalization cannot fully bridge the gap.
Dataset ceiling. Without rationalization, the training data can only contain problems the model already solves. This creates a hard ceiling: once the model stops solving new problems, no new training signal appears. Rationalization pushes this ceiling higher but doesn't eliminate it — the model still needs to produce a correct-answer rationale even with the hint.
Noise accumulation. On multiple-choice tasks, some wrong rationales sneak through the filter by luck (correct answer despite bad reasoning). Over many iterations, these noisy examples can accumulate. The authors mitigate this by always fine-tuning from the original pretrained model rather than building on a potentially corrupted model.
Temperature matters. The authors found that increasing temperature to generate more diverse rationales actually hurts. Higher temperature produces more "lucky" correct answers with bad reasoning, which poisons the training data. Rationalization is a much cleaner way to expand coverage than temperature-based exploration.
Watch how accuracy, dataset size, and noise evolve. Toggle rationalization on/off to see its impact on the ceiling.
STaR is elegant, but it has clear boundaries. Understanding these helps you know when to use it and when to reach for something else.
STaR's filter needs ground truth answers. The entire loop depends on checking whether a generated answer matches the correct one. For open-ended tasks — writing, summarization, creative tasks — there is no single correct answer to filter against. STaR works best on tasks with clear right/wrong answers: math, multiple choice, code (test cases), formal proofs.
The bootstrap needs a spark. If the model's few-shot performance is at chance level, there are no correct rationales to filter and the loop never gets going. The authors found that GPT-2 could not bootstrap even on arithmetic. GPT-J (6B) was the minimum viable size in 2022. The model must be large enough to occasionally produce correct reasoning in the few-shot setting.
On tasks with only two choices (yes/no, true/false), the model gets the right answer by chance 50% of the time. Half the training data would consist of wrong rationales with lucky correct answers. The signal-to-noise ratio is too low for the filter to work well.
STaR trains on the model's own outputs. Any biases in the base model — gender, cultural, factual — get baked into the generated rationales and then amplified by fine-tuning on them. The CommonsenseQA dataset itself contains known biases, and the authors acknowledge that STaR may reinforce these.
Each outer loop iteration requires generating rationales for the entire training set (expensive inference) plus a full fine-tuning run. With 6+ iterations, the total compute is substantial. The authors used GPT-J specifically because it was affordable; applying STaR to larger models multiplies the cost.
| Limitation | Impact | Mitigation |
|---|---|---|
| Needs verifiable answers | Cannot apply to open-ended tasks | Use reward models (leads to RLHF/DPO) |
| Minimum model size | Too-small models can't bootstrap | Use larger base model or more few-shot examples |
| Binary task noise | 50% lucky guesses poison data | Need better filtering (e.g., verify rationale steps) |
| Bias amplification | Self-training amplifies existing biases | Curate few-shot prompts, add debiasing steps |
| Compute cost | Multiple iterations of generate + fine-tune | Use smaller training subsets per iteration |
STaR (2022) is a pivotal paper. It sits at the intersection of several ideas and directly influenced the reasoning models that dominate today.
Expert Iteration (ExIt): STaR's loop mirrors ExIt from AlphaGo/AlphaZero: generate solutions with the current "apprentice," improve via feedback from an "expert," then upgrade the apprentice. In STaR, the "expert" is the answer key, and the "apprentice" is the model. The key difference: STaR works in natural language, not game trees.
Chain-of-thought prompting (Wei et al., 2022): Showed that few-shot rationales improve reasoning. STaR turns this one-shot technique into an iterative training procedure.
Rejection sampling fine-tuning (RFT): Sample many solutions, keep correct ones, fine-tune. This is essentially one iteration of STaR without the loop or rationalization.
OpenAI o1 / o3: The "reasoning models" that think for minutes before answering. While the exact training is proprietary, the public descriptions align closely with STaR-style iterative self-improvement on verifiable reasoning tasks.
DeepSeek-R1: Openly described as using RL on reasoning traces with verified answers — the same core principle as STaR, scaled up with RL instead of supervised fine-tuning.
Constitutional AI: Anthropic's method for self-improvement through self-critique. Where STaR uses answer correctness as the signal, Constitutional AI uses the model's own judgment guided by principles.
Self-play: Games like Go use self-play to improve without human data. STaR is "self-play for reasoning" — the model plays against the answer key instead of against itself.
| Method | Signal Source | Iterative? | Domain |
|---|---|---|---|
| STaR | Answer correctness | Yes | Verifiable tasks |
| RLHF | Human preferences | No (single RL run) | Open-ended |
| DPO | Preference pairs | No | Open-ended |
| Constitutional AI | Self-critique + principles | Yes | Open-ended |
| Expert Iteration | Search (MCTS) | Yes | Games, formal proofs |
| DeepSeek-R1 | Answer correctness + RL | Yes | Verifiable tasks |