Flan-PaLM and Flan-T5: instruction tuning at scale. More tasks, more diverse prompts, chain-of-thought finetuning — each ingredient independently improves zero-shot and few-shot performance.
You've pre-trained a massive language model — 540 billion parameters, trained on trillions of tokens. It can complete any text prompt with remarkable fluency. But ask it "What is the capital of France?" and instead of answering "Paris," it might continue: "What is the capital of Germany? What is the capital of Italy?..." It generates more questions instead of answers.
The problem is that pre-trained language models are text completers, not instruction followers. They've learned to predict what text comes next, which is different from understanding what a human wants when they write a question. The pre-training distribution is dominated by documents, articles, and books — not question-answer pairs or task instructions.
Instruction tuning (also called instruction fine-tuning) is the process of fine-tuning a pre-trained model on a diverse mixture of tasks phrased as natural language instructions. Instead of task-specific fine-tuning (one model per task), instruction tuning teaches one model to follow instructions for any task.
See how the same prompt gets handled by a base model (text completer) vs an instruction-tuned model (instruction follower). Click "Try Prompt" to cycle through examples.
The prior work (FLAN, 2021; T0, 2021) had shown that instruction tuning works. But critical questions remained unanswered: How does performance scale with the number of tasks? Does it help to include chain-of-thought examples? What's the effect of model size? This paper answers all three questions systematically.
Why do these questions matter practically? Because the answers tell you how to allocate your training budget. If more tasks always help, you should invest in collecting and formatting more tasks. If CoT helps, you should include reasoning traces. If larger models benefit more, you should save your instruction-tuning effort for your largest model. The paper provides data-driven answers to all of these allocation decisions.
The paper also introduces Flan-T5 and Flan-PaLM — instruction-tuned versions of Google's T5 and PaLM models. Flan-T5 in particular became one of the most widely used open-source models, serving as the default instruction-following model for the research community until LLaMA-based models took over in mid-2023.
python # The instruction tuning pipeline (simplified) from transformers import T5ForConditionalGeneration # 1. Start with pre-trained model model = T5ForConditionalGeneration.from_pretrained("t5-xxl") # 2. Prepare instruction data: (instruction + input) → output # 1,836 tasks × multiple templates = millions of examples # 3. Fine-tune with standard seq2seq loss # 30K-40K steps, batch size 128, lr=1e-3 # Result: T5-XXL → Flan-T5-XXL # Key insight: only 30K-40K steps — about 1% of pre-training compute # This tiny investment yields 20+ point improvements on benchmarks
| Previous Work | Tasks | Model | Limitation |
|---|---|---|---|
| FLAN (2021) | 62 tasks | 137B | Small task collection, no CoT |
| T0 (2021) | 35 datasets | 11B | Smaller model, no scaling analysis |
| Flan-PaLM (this paper) | 1,836 tasks | 540B | Comprehensive scaling study |
The paper's first contribution is a massive instruction tuning dataset: 1,836 tasks collected from four sources, each phrased with multiple prompt templates.
| Source | Tasks | Description |
|---|---|---|
| Flan 2021 | 62 | Original FLAN tasks: NLI, QA, sentiment, translation, etc. |
| P3 (T0) | 35 datasets (~270 prompts) | PromptSource community templates with diverse prompt formats |
| Super-Natural Instructions | 757 | Crowdsourced tasks with detailed instructions and definitions |
| Chain-of-Thought collection | 9 datasets | Tasks requiring step-by-step reasoning (math, logic, commonsense) |
Each task has up to 10 different prompt templates (instructions phrased differently). For example, a sentiment analysis task might have:
prompt templates # Template 1: "Classify the sentiment of the following review: {review}\nSentiment:" # Template 2: "Is the following review positive or negative?\n{review}\nAnswer:" # Template 3: "Read the review and determine if the author liked the product.\nReview: {review}\nLiked:" # Template 4: "What is the sentiment (positive/negative) of this text?\n{review}"
With 1,836 tasks of vastly different sizes (some have 100 examples, others have millions), how do you mix them? The paper uses a temperature-based sampling strategy:
Where ni is the number of examples in task i, K is a cap (preventing very large tasks from dominating), and T is a temperature parameter. T=1 samples proportionally to size; higher T flattens the distribution; T→∞ gives uniform sampling across tasks.
Explore the 1,836 tasks organized by source and category. The size of each block represents the number of examples. Click a source to highlight its tasks.
Each training example is a (instruction, input, output) triple, formatted as a text-to-text problem:
example # Input to model: "Classify the sentiment of the following movie review.\n" "Review: This movie was absolutely terrible. The acting was wooden,\n" "the plot made no sense, and I wanted to leave after 20 minutes.\n" "Sentiment:" # Target output: "Negative"
The model is trained with standard language modeling loss on the target output, conditioned on the instruction + input. This is the same as supervised fine-tuning (SFT) — the only difference is the diversity of tasks.
Let's be precise about the training loss. For encoder-decoder models (T5/PaLM), the input (instruction + input text) is encoded, and the model generates the output autoregressively. The loss is computed only on the output tokens, not the input:
Where x is the encoded instruction+input, and y1,...,yT are the target output tokens. This means the model isn't penalized for generating the wrong instruction — it's only penalized for generating the wrong response. The instruction serves as conditioning context.
python # Task sampling with temperature import numpy as np def sample_tasks(task_sizes, K=3000, T=3.0): """Temperature-based task sampling to balance task sizes""" capped = np.minimum(task_sizes, K) # cap at K examples weights = capped ** (1.0 / T) # temperature smoothing probs = weights / weights.sum() # normalize to probabilities return probs # T=1: proportional to size (big tasks dominate) # T=3: smoothed (standard choice — balanced) # T=∞: uniform (all tasks equally likely)
The paper's central question: does performance keep improving as you add more instruction-tuning tasks? The answer is a resounding yes, and the scaling curve hasn't saturated at 1,836 tasks.
The authors ran a systematic ablation: start with a subset of tasks, gradually add more, and measure performance on held-out benchmarks. The key insight is that they evaluate on held-out tasks — tasks NOT included in the instruction tuning mixture. This tests generalization: does the model learn to follow novel instructions, or just memorize the specific tasks it was trained on?
The results show a clear log-linear relationship:
Watch how benchmark performance improves as the number of instruction-tuning tasks increases. The x-axis is log-scaled. Drag the slider to add more tasks and see the performance curve.
Instruction tuning was applied to PaLM at three scales (8B, 62B, 540B) and T5 at four scales (250M, 780M, 3B, 11B). The results reveal an interaction between model size and instruction tuning:
| Model | Before IT | After IT | Gain |
|---|---|---|---|
| PaLM 8B | 52.1 | 59.3 | +7.2 |
| PaLM 62B | 68.4 | 73.8 | +5.4 |
| PaLM 540B | 78.3 | 81.4 | +3.1 |
| T5 11B | 33.7 | 57.2 | +23.5 |
Three patterns emerge:
1. Instruction tuning helps all sizes. Even the 540B model improves meaningfully (+3.1 on average). This was not obvious — you might expect that a 540B model is already so capable that instruction tuning would be redundant. The fact that it still helps shows that even the largest models don't automatically learn to follow instructions from pre-training alone.
2. Relative improvement is larger for smaller models. T5-11B gains 23.5 points — instruction tuning can partially compensate for limited model capacity. The smaller model has the knowledge but needs more "steering" to use it effectively.
3. Absolute improvement is larger for larger models on hard tasks. While relative improvement decreases with size, the absolute gap on difficult reasoning tasks is larger for bigger models. This makes sense: reasoning ability scales with model size, and instruction tuning "unlocks" whatever reasoning capability the model already has.
python # The scaling interaction: model size × instruction tuning # Think of it as: Knowledge × Interface = Performance # # PaLM 8B: Limited knowledge × Great interface → Moderate performance # 52.1 × IT → 59.3 (relative gain: +14%) # # PaLM 540B: Vast knowledge × Great interface → Excellent performance # 78.3 × IT → 81.4 (relative gain: +4%) # # The interface improvement is similar in both cases. # But the smaller model benefits more because its knowledge/interface # ratio was more imbalanced — it "knew" more than it could express.
Why does instruction tuning help smaller models more in relative terms? Smith and Ivison's later "Camels" paper provides the answer: instruction tuning is primarily a "style transfer" operation that teaches the model how to present its knowledge, not what to know. Smaller models have a larger gap between their latent knowledge and their ability to express it in instruction-following format. Instruction tuning closes this gap more dramatically for smaller models because their "interface deficit" is larger.
The paper also runs an important ablation on the number of prompt templates per task. Using 10 templates per task (instead of 1) improves performance by 2-3 points on held-out benchmarks. Each additional template provides a slightly different phrasing of the same task, teaching the model that the same instruction can be expressed many ways — a crucial skill for following novel instructions that use unfamiliar phrasings.
Perhaps the most impactful finding in the paper: including chain-of-thought (CoT) examples in the instruction tuning mixture dramatically improves reasoning on held-out tasks.
Chain-of-thought prompting (Wei et al. 2022) asks the model to show its work — to generate intermediate reasoning steps before the final answer. For example:
without CoT "Q: Roger has 5 tennis balls. He buys 2 more cans of 3. How many does he have?" "A: 11" with CoT "Q: Roger has 5 tennis balls. He buys 2 more cans of 3. How many does he have?" "A: Roger starts with 5 balls. He buys 2 cans of 3 balls each." "That's 2 × 3 = 6 new balls. Total = 5 + 6 = 11. The answer is 11."
The paper includes 9 CoT datasets (math, commonsense, logic) in the instruction tuning mix, representing only ~1% of total examples. Despite this tiny fraction, the impact is dramatic:
| Config | CoT benchmarks | Non-CoT benchmarks |
|---|---|---|
| No CoT in tuning | 56.1 | 73.5 |
| CoT only in tuning | 70.2 | 66.8 |
| Mix (non-CoT + CoT) | 73.8 | 73.0 |
Using CoT data alone hurts non-reasoning tasks (73.5 → 66.8). This makes sense: if you only train on step-by-step reasoning, the model may try to "reason" when a simple direct answer is needed. The mixed approach teaches the model when to reason step-by-step and when to answer directly.
The mechanism is subtle. With CoT-only training, the model learns a "default mode" of verbose step-by-step reasoning. When asked "What is the capital of France?", it might respond: "Let me think step by step. France is a country in Western Europe. Countries have capital cities. The capital of France is Paris. The answer is Paris." This is correct but unnecessarily verbose. The mixed approach teaches the model that simple factual questions deserve direct answers ("Paris"), while complex reasoning questions deserve step-by-step solutions.
python # CoT mixing: the key is teaching WHEN to reason # Standard task (99% of mix): # Q: "What is the capital of France?" # A: "Paris" ← direct answer # CoT task (1% of mix): # Q: "If John has 3 apples and buys 2 bags of 4, how many?" # A: "John starts with 3 apples. He buys 2 bags with 4 each. # That's 2 × 4 = 8 new apples. Total = 3 + 8 = 11." # After mixed training, the model learns: # - Simple questions → direct answer # - Multi-step reasoning → show work # This is implicit mode selection from the training distribution
Compare performance with and without CoT in the instruction tuning mixture. Notice how the mixed approach (both CoT and non-CoT) achieves the best results across all benchmark types.
Flan-PaLM 540B — PaLM instruction-tuned with the Flan collection — achieves state-of-the-art results on a broad range of benchmarks:
| Benchmark | PaLM 540B | Flan-PaLM 540B | GPT-3.5 (text-davinci-002) |
|---|---|---|---|
| MMLU (5-shot) | 69.3 | 73.6 | 68.0 |
| BBH (CoT) | 62.0 | 66.3 | 64.3 |
| TyDiQA | 55.2 | 72.6 | — |
| MGSM (multilingual math) | 45.9 | 72.0 | — |
While Flan-PaLM 540B is impressive, it's also impractical for most users. Flan-T5 is the practical variant — instruction-tuning Google's T5 models. The results are remarkable:
| Model | Params | MMLU (0-shot) | BBH (CoT) |
|---|---|---|---|
| T5-XL (no IT) | 3B | 28.3 | 24.1 |
| Flan-T5-XL | 3B | 52.4 | 43.6 |
| T5-XXL (no IT) | 11B | 33.7 | 28.5 |
| Flan-T5-XXL | 11B | 57.2 | 49.0 |
Flan-T5-XL (3B) with instruction tuning outperforms the raw T5-XXL (11B) on both benchmarks. Instruction tuning is worth more than a 3.7x increase in model size.
This result has enormous practical implications. Running a 3B model is ~4x cheaper than running an 11B model (in both memory and compute). If instruction tuning a 3B model gives better results than running an 11B model without instruction tuning, then instruction tuning is by far the most cost-effective intervention available. It's cheaper than buying bigger hardware, cheaper than collecting more pre-training data, and cheaper than any architectural improvement.
The paper also compares against GPT-3.5 (text-davinci-002), which was the strongest publicly available model at the time. Flan-PaLM 540B outperforms GPT-3.5 on MMLU (73.6 vs 68.0) and is competitive on BBH (66.3 vs 64.3). This was the first time an openly described instruction-tuning recipe produced a model competitive with OpenAI's best — a significant milestone for open research.
A particularly interesting finding: instruction tuning dramatically improves zero-shot performance (no examples in the prompt) while maintaining or slightly improving few-shot performance. This matters because zero-shot is the most user-friendly mode — you just ask a question without providing examples. Before instruction tuning, zero-shot performance was significantly worse than few-shot. After instruction tuning, the gap narrows or disappears.
python # Zero-shot vs few-shot improvement # PaLM 540B on MMLU: # 0-shot: 69.3 → 73.6 (Flan) = +4.3 # 5-shot: 73.5 → 74.1 (Flan) = +0.6 # Zero-shot gains >> few-shot gains # Why? Few-shot examples already provide an "interface" — # they show the model what format to use. # Instruction tuning provides this interface permanently, # making the few-shot examples redundant.
Compare base models vs instruction-tuned versions across benchmarks. Toggle between PaLM and T5 families.
The paper distills its findings into concrete guidelines for practitioners who want to instruction-tune their own models. These guidelines have proven remarkably durable.
python # Input inversion examples # Original task: # Input: "Classify: 'This movie is great' → ?" # Output: "Positive" # Inverted task: # Input: "Generate a positive movie review." # Output: "This movie is great, with excellent acting." # Why this helps: the model learns BOTH directions of the mapping # Original: text → label (recognition) # Inverted: label → text (generation) # Together: deep understanding of what "positive" means
The recommendation of 30K-40K gradient steps comes from a careful analysis of the training curve. Performance on instruction-tuning tasks improves steadily throughout training. But performance on held-out tasks (tasks not seen during instruction tuning) peaks around 30K-40K steps and then starts to decline. This is catastrophic forgetting — the model overwrites its pre-trained knowledge with instruction-specific patterns.
Where the first term represents pre-trained knowledge decaying exponentially with instruction-tuning steps, and the second term represents growing instruction-following ability. The optimal number of steps balances these two terms. Too few steps and the instruction-following ability is weak. Too many steps and the pre-trained knowledge is lost.
| Mistake | Why it hurts |
|---|---|
| Too few tasks | Model overfits to seen instruction patterns, fails on novel ones |
| Only one template per task | Model learns template matching, not instruction understanding |
| Too much training | Pre-trained knowledge is overwritten (catastrophic forgetting) |
| Only CoT data | Model tries to reason on simple factual questions |
| Uniform task sampling | Tiny tasks get drowned by massive ones |
Adjust the instruction tuning recipe and see how each change affects performance. Toggle components on/off to see their individual impact.
Let's bring the paper's three scaling dimensions together in one interactive visualization. You can independently vary the number of tasks, the model size, and the presence of CoT data, and see how each affects performance.
Adjust all three scaling axes simultaneously: number of tasks, model size, and CoT inclusion. Watch how performance changes — and notice that each axis contributes independently.
| Axis | Scaling Behavior | Saturation? |
|---|---|---|
| Number of tasks | Log-linear improvement: 2x tasks ≈ +1-2% accuracy | Not observed at 1,836 |
| Model size | Power-law improvement: 10x params ≈ +5-10% accuracy | Not observed at 540B |
| CoT inclusion | Step function: including ANY CoT gives +5-15% on reasoning | Mostly saturated at ~1% mix |
The paper provides evidence for this multiplicativity by running a full factorial experiment: every combination of (few tasks / many tasks) × (small model / large model) × (with CoT / without CoT). The gains from each axis are roughly additive in log-space, which means they're multiplicative in raw performance. This is an important finding because it tells practitioners that they should invest in all three axes — skipping any one of them leaves significant performance on the table.
python # The three scaling dimensions — approximate formula # Performance ≈ Base + log(tasks)*8 + log(params)*3 + CoT*8 # Worst case: 20 tasks, 250M params, no CoT # ≈ 50 + 10.4 + 0 + 0 = 60.4% # Best case: 1836 tasks, 540B params, with CoT # ≈ 50 + 26.1 + 22 + 8 = 106.1% (capped at ~95%) # Marginal value of each axis at best settings: # Adding tasks (20→1836): +15.7 points # Adding scale (250M→540B): +22.0 points # Adding CoT: +8.0 points
This paper sits at the intersection of two major trends: instruction tuning and scaling laws. It showed that both trends apply to the post-training phase, not just pre-training. More importantly, it provided the first comprehensive recipe for instruction tuning — a recipe that the community followed for over a year.
The paper's timing was critical. It was published in late 2022, just weeks before ChatGPT's release. While ChatGPT demonstrated the power of instruction tuning + RLHF to the public, the Flan paper provided the scientific foundation that the research community needed to replicate and extend these results. Without this paper's systematic analysis of what makes instruction tuning work, the open-source response to ChatGPT (Alpaca, Vicuna, etc.) would have been much more haphazard.
| Foundation | Contribution |
|---|---|
| FLAN (Wei 2021) | Original instruction tuning on 62 tasks — showed the concept works |
| T0/P3 (Sanh 2021) | Community prompt templates + zero-shot generalization study |
| Super-Natural Instructions | Large crowdsourced task collection with detailed definitions |
| CoT Prompting (Wei 2022) | Chain-of-thought prompting at inference time |
| PaLM (Chowdhery 2022) | The 540B base model that Flan-PaLM is built on |
| Successor | Advance |
|---|---|
| AlpacaFarm | Simulated human feedback for faster instruction tuning research |
| Camels (Tülu) | Systematic comparison of open instruction-tuning datasets |
| Alpaca | Self-instruct: generate instruction data from GPT-3.5, tune LLaMA |
| Orca | Explanation tuning: include GPT-4's reasoning traces in the fine-tuning data |
| Llama 3 | Scaled instruction tuning + DPO to 405B parameters |
The CoT finding deserves special emphasis. Before this paper, chain-of-thought was purely an inference-time technique — you added "Let's think step by step" to your prompt and hoped the model would reason better. This paper showed that including CoT in the training data is far more effective. This insight directly influenced every subsequent instruction-tuning effort: Llama 2 Chat, Mistral Instruct, and Claude all include reasoning traces in their fine-tuning data.
The Flan-T5 model family (small, base, large, xl, xxl) also deserves recognition for its practical impact. For over a year (late 2022 to mid-2023), Flan-T5 was the default instruction-following model for the research community. It was used in hundreds of papers, thousands of projects, and countless applications. Its encoder-decoder architecture made it particularly good at structured tasks (classification, extraction, translation) where the output format is well-defined.
python # Using Flan-T5 — the most practical output of this paper from transformers import T5ForConditionalGeneration, T5Tokenizer model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl") tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl") # Zero-shot — no examples needed! prompt = "Classify the sentiment: 'This movie was incredible'\nSentiment:" inputs = tokenizer(prompt, return_tensors="pt") output = model.generate(**inputs, max_new_tokens=10) print(tokenizer.decode(output[0], skip_special_tokens=True)) # → "positive" # CoT reasoning — also works zero-shot after Flan tuning prompt2 = "Q: If a train travels 60mph for 2.5 hours, how far does it go?" # → "The train travels 60 miles per hour for 2.5 hours. # Distance = speed × time = 60 × 2.5 = 150 miles."
"Instruction tuning is a simple method that, across many settings, substantially improves the performance of pretrained language models."
— Chung et al., 2022