Scaling Instruction-Finetuned LMs (Chung 2022)

Chapter 0: The Instruction Gap

You've pre-trained a massive language model — 540 billion parameters, trained on trillions of tokens. It can complete any text prompt with remarkable fluency. But ask it "What is the capital of France?" and instead of answering "Paris," it might continue: "What is the capital of Germany? What is the capital of Italy?..." It generates more questions instead of answers.

The problem is that pre-trained language models are text completers, not instruction followers. They've learned to predict what text comes next, which is different from understanding what a human wants when they write a question. The pre-training distribution is dominated by documents, articles, and books — not question-answer pairs or task instructions.

The instruction gap: Pre-trained models have the knowledge to answer questions, translate languages, and solve problems — all this knowledge is embedded in their weights. But they lack the interface — the ability to understand that "Translate this to French: Hello" means "produce a French translation" rather than "continue this document about translation." Instruction tuning bridges this gap.

Instruction tuning (also called instruction fine-tuning) is the process of fine-tuning a pre-trained model on a diverse mixture of tasks phrased as natural language instructions. Instead of task-specific fine-tuning (one model per task), instruction tuning teaches one model to follow instructions for any task.

Pre-trained vs Instruction-Tuned

See how the same prompt gets handled by a base model (text completer) vs an instruction-tuned model (instruction follower). Click "Try Prompt" to cycle through examples.

Click to compare

The prior work (FLAN, 2021; T0, 2021) had shown that instruction tuning works. But critical questions remained unanswered: How does performance scale with the number of tasks? Does it help to include chain-of-thought examples? What's the effect of model size? This paper answers all three questions systematically.

Why do these questions matter practically? Because the answers tell you how to allocate your training budget. If more tasks always help, you should invest in collecting and formatting more tasks. If CoT helps, you should include reasoning traces. If larger models benefit more, you should save your instruction-tuning effort for your largest model. The paper provides data-driven answers to all of these allocation decisions.

The paper also introduces Flan-T5 and Flan-PaLM — instruction-tuned versions of Google's T5 and PaLM models. Flan-T5 in particular became one of the most widely used open-source models, serving as the default instruction-following model for the research community until LLaMA-based models took over in mid-2023.

python
# The instruction tuning pipeline (simplified)
from transformers import T5ForConditionalGeneration

# 1. Start with pre-trained model
model = T5ForConditionalGeneration.from_pretrained("t5-xxl")

# 2. Prepare instruction data: (instruction + input) → output
#    1,836 tasks × multiple templates = millions of examples

# 3. Fine-tune with standard seq2seq loss
#    30K-40K steps, batch size 128, lr=1e-3
#    Result: T5-XXL → Flan-T5-XXL

# Key insight: only 30K-40K steps — about 1% of pre-training compute
# This tiny investment yields 20+ point improvements on benchmarks

Previous Work	Tasks	Model	Limitation
FLAN (2021)	62 tasks	137B	Small task collection, no CoT
T0 (2021)	35 datasets	11B	Smaller model, no scaling analysis
Flan-PaLM (this paper)	1,836 tasks	540B	Comprehensive scaling study

What is the "instruction gap" in pre-trained language models?

Pre-trained models have the knowledge to answer questions and perform tasks (it's in their weights from pre-training), but they lack the ability to understand that a prompt is an instruction to be followed, not text to be continued — they're text completers, not instruction followers Pre-trained models don't have enough parameters to follow instructions Pre-trained models can only process instructions in English

Chapter 1: The Flan Task Collection

The paper's first contribution is a massive instruction tuning dataset: 1,836 tasks collected from four sources, each phrased with multiple prompt templates.

Source	Tasks	Description
Flan 2021	62	Original FLAN tasks: NLI, QA, sentiment, translation, etc.
P3 (T0)	35 datasets (~270 prompts)	PromptSource community templates with diverse prompt formats
Super-Natural Instructions	757	Crowdsourced tasks with detailed instructions and definitions
Chain-of-Thought collection	9 datasets	Tasks requiring step-by-step reasoning (math, logic, commonsense)

Each task has up to 10 different prompt templates (instructions phrased differently). For example, a sentiment analysis task might have:

prompt templates
# Template 1:
"Classify the sentiment of the following review: {review}\nSentiment:"

# Template 2:
"Is the following review positive or negative?\n{review}\nAnswer:"

# Template 3:
"Read the review and determine if the author liked the product.\nReview: {review}\nLiked:"

# Template 4:
"What is the sentiment (positive/negative) of this text?\n{review}"

Why multiple templates matter: If all sentiment tasks use the same template, the model might learn a shallow pattern ("when I see 'classify the sentiment,' output positive/negative") rather than understanding the task. Multiple diverse templates force the model to understand the meaning of instructions, not just pattern-match on specific phrasings.

Task mixing

With 1,836 tasks of vastly different sizes (some have 100 examples, others have millions), how do you mix them? The paper uses a temperature-based sampling strategy:

p(task_i) ∝ min(n_i, K)^1/T

Where n_i is the number of examples in task i, K is a cap (preventing very large tasks from dominating), and T is a temperature parameter. T=1 samples proportionally to size; higher T flattens the distribution; T→∞ gives uniform sampling across tasks.

Task Collection Overview

Explore the 1,836 tasks organized by source and category. The size of each block represents the number of examples. Click a source to highlight its tasks.

Input and output format

Each training example is a (instruction, input, output) triple, formatted as a text-to-text problem:

example
# Input to model:
"Classify the sentiment of the following movie review.\n"
"Review: This movie was absolutely terrible. The acting was wooden,\n"
"the plot made no sense, and I wanted to leave after 20 minutes.\n"
"Sentiment:"

# Target output:
"Negative"

The model is trained with standard language modeling loss on the target output, conditioned on the instruction + input. This is the same as supervised fine-tuning (SFT) — the only difference is the diversity of tasks.

Let's be precise about the training loss. For encoder-decoder models (T5/PaLM), the input (instruction + input text) is encoded, and the model generates the output autoregressively. The loss is computed only on the output tokens, not the input:

L = -∑_t=1^T log p(y_t | y₁,...,y_t-1, x)

Where x is the encoded instruction+input, and y₁,...,y_T are the target output tokens. This means the model isn't penalized for generating the wrong instruction — it's only penalized for generating the wrong response. The instruction serves as conditioning context.

python
# Task sampling with temperature
import numpy as np

def sample_tasks(task_sizes, K=3000, T=3.0):
    """Temperature-based task sampling to balance task sizes"""
    capped = np.minimum(task_sizes, K)   # cap at K examples
    weights = capped ** (1.0 / T)         # temperature smoothing
    probs = weights / weights.sum()       # normalize to probabilities
    return probs
    # T=1: proportional to size (big tasks dominate)
    # T=3: smoothed (standard choice — balanced)
    # T=∞: uniform (all tasks equally likely)

Why does the Flan collection use multiple prompt templates per task instead of one?

Multiple templates force the model to understand the meaning of instructions rather than pattern-matching on specific phrasings — if all sentiment tasks say "classify the sentiment," the model might learn a shallow shortcut. Diverse templates teach genuine instruction understanding. Multiple templates make the dataset larger, which always helps Multiple templates are needed because different languages require different phrasings

Chapter 2: Scaling Effects

The paper's central question: does performance keep improving as you add more instruction-tuning tasks? The answer is a resounding yes, and the scaling curve hasn't saturated at 1,836 tasks.

Scaling the number of tasks

The authors ran a systematic ablation: start with a subset of tasks, gradually add more, and measure performance on held-out benchmarks. The key insight is that they evaluate on held-out tasks — tasks NOT included in the instruction tuning mixture. This tests generalization: does the model learn to follow novel instructions, or just memorize the specific tasks it was trained on?

The results show a clear log-linear relationship:

The key finding: Performance improves roughly linearly with the log of the number of tasks. Going from 282 → 1,836 tasks improves zero-shot performance by 2-5% across benchmarks. And the curve hasn't flattened — suggesting that adding even more tasks would continue to help. This is a scaling law for instruction tuning, analogous to the scaling laws for pre-training.

Task Scaling Curve

Watch how benchmark performance improves as the number of instruction-tuning tasks increases. The x-axis is log-scaled. Drag the slider to add more tasks and see the performance curve.

Tasks 1836

Scaling model size

Instruction tuning was applied to PaLM at three scales (8B, 62B, 540B) and T5 at four scales (250M, 780M, 3B, 11B). The results reveal an interaction between model size and instruction tuning:

Model	Before IT	After IT	Gain
PaLM 8B	52.1	59.3	+7.2
PaLM 62B	68.4	73.8	+5.4
PaLM 540B	78.3	81.4	+3.1
T5 11B	33.7	57.2	+23.5

Three patterns emerge:

1. Instruction tuning helps all sizes. Even the 540B model improves meaningfully (+3.1 on average). This was not obvious — you might expect that a 540B model is already so capable that instruction tuning would be redundant. The fact that it still helps shows that even the largest models don't automatically learn to follow instructions from pre-training alone.

2. Relative improvement is larger for smaller models. T5-11B gains 23.5 points — instruction tuning can partially compensate for limited model capacity. The smaller model has the knowledge but needs more "steering" to use it effectively.

3. Absolute improvement is larger for larger models on hard tasks. While relative improvement decreases with size, the absolute gap on difficult reasoning tasks is larger for bigger models. This makes sense: reasoning ability scales with model size, and instruction tuning "unlocks" whatever reasoning capability the model already has.

python
# The scaling interaction: model size × instruction tuning
# Think of it as: Knowledge × Interface = Performance
#
# PaLM 8B:   Limited knowledge × Great interface → Moderate performance
#            52.1 × IT → 59.3 (relative gain: +14%)
#
# PaLM 540B: Vast knowledge × Great interface → Excellent performance
#            78.3 × IT → 81.4 (relative gain: +4%)
#
# The interface improvement is similar in both cases.
# But the smaller model benefits more because its knowledge/interface
# ratio was more imbalanced — it "knew" more than it could express.

Practical implication: If you can't afford a 540B model, instruction tuning a smaller model gets you surprisingly far. Flan-T5 11B (instruction-tuned) outperforms the raw PaLM 62B (not instruction-tuned) on many benchmarks — a 6x smaller model beats a 6x larger one when properly tuned. This is one of the most cost-effective interventions in LLM training.

Why does instruction tuning help smaller models more in relative terms? Smith and Ivison's later "Camels" paper provides the answer: instruction tuning is primarily a "style transfer" operation that teaches the model how to present its knowledge, not what to know. Smaller models have a larger gap between their latent knowledge and their ability to express it in instruction-following format. Instruction tuning closes this gap more dramatically for smaller models because their "interface deficit" is larger.

The paper also runs an important ablation on the number of prompt templates per task. Using 10 templates per task (instead of 1) improves performance by 2-3 points on held-out benchmarks. Each additional template provides a slightly different phrasing of the same task, teaching the model that the same instruction can be expressed many ways — a crucial skill for following novel instructions that use unfamiliar phrasings.

How does the benefit of instruction tuning change with model size?

Instruction tuning improves all model sizes, but the relative improvement is larger for smaller models — a smaller model benefits more from the "steering" effect because it has knowledge from pre-training but needs more help using it effectively. This means instruction tuning partially compensates for limited model capacity. Only very large models benefit from instruction tuning Instruction tuning hurts smaller models and helps larger ones

Chapter 3: Chain-of-Thought Fine-Tuning

Perhaps the most impactful finding in the paper: including chain-of-thought (CoT) examples in the instruction tuning mixture dramatically improves reasoning on held-out tasks.

Chain-of-thought prompting (Wei et al. 2022) asks the model to show its work — to generate intermediate reasoning steps before the final answer. For example:

without CoT
"Q: Roger has 5 tennis balls. He buys 2 more cans of 3. How many does he have?"
"A: 11"

with CoT
"Q: Roger has 5 tennis balls. He buys 2 more cans of 3. How many does he have?"
"A: Roger starts with 5 balls. He buys 2 cans of 3 balls each."
"That's 2 × 3 = 6 new balls. Total = 5 + 6 = 11. The answer is 11."

The paper includes 9 CoT datasets (math, commonsense, logic) in the instruction tuning mix, representing only ~1% of total examples. Despite this tiny fraction, the impact is dramatic:

Config	CoT benchmarks	Non-CoT benchmarks
No CoT in tuning	56.1	73.5
CoT only in tuning	70.2	66.8
Mix (non-CoT + CoT)	73.8	73.0

The surprise: Including CoT examples doesn't just improve reasoning tasks — it also maintains (or slightly improves) non-reasoning tasks. And CoT fine-tuning is far more effective than CoT prompting alone. When the model has been fine-tuned on CoT examples, it internalizes the "show your work" pattern and applies it to novel tasks — even tasks not seen during training.

Why mixing helps

Using CoT data alone hurts non-reasoning tasks (73.5 → 66.8). This makes sense: if you only train on step-by-step reasoning, the model may try to "reason" when a simple direct answer is needed. The mixed approach teaches the model when to reason step-by-step and when to answer directly.

The mechanism is subtle. With CoT-only training, the model learns a "default mode" of verbose step-by-step reasoning. When asked "What is the capital of France?", it might respond: "Let me think step by step. France is a country in Western Europe. Countries have capital cities. The capital of France is Paris. The answer is Paris." This is correct but unnecessarily verbose. The mixed approach teaches the model that simple factual questions deserve direct answers ("Paris"), while complex reasoning questions deserve step-by-step solutions.

python
# CoT mixing: the key is teaching WHEN to reason

# Standard task (99% of mix):
# Q: "What is the capital of France?"
# A: "Paris"  ← direct answer

# CoT task (1% of mix):
# Q: "If John has 3 apples and buys 2 bags of 4, how many?"
# A: "John starts with 3 apples. He buys 2 bags with 4 each.
#     That's 2 × 4 = 8 new apples. Total = 3 + 8 = 11."

# After mixed training, the model learns:
# - Simple questions → direct answer
# - Multi-step reasoning → show work
# This is implicit mode selection from the training distribution

CoT fine-tuning vs CoT prompting: CoT prompting (Wei et al., 2022) adds "Let's think step by step" to the prompt at inference time. CoT fine-tuning includes step-by-step examples in the training data. The paper shows that fine-tuning is far more effective: it internalizes the reasoning pattern rather than relying on an explicit prompt trigger. After CoT fine-tuning, the model generates chain-of-thought reasoning even without being explicitly asked to.

Chain-of-Thought Impact

Compare performance with and without CoT in the instruction tuning mixture. Notice how the mixed approach (both CoT and non-CoT) achieves the best results across all benchmark types.

Why is the "mixed" CoT approach (both CoT and non-CoT examples) better than CoT-only fine-tuning?

CoT-only fine-tuning improves reasoning tasks but hurts non-reasoning tasks (the model tries to reason when a direct answer is needed). The mixed approach teaches the model WHEN to reason step-by-step and when to answer directly — giving the best of both worlds. Because the mixed approach uses more total training data Because CoT examples are lower quality than non-CoT examples

Chapter 4: Results

Flan-PaLM 540B — PaLM instruction-tuned with the Flan collection — achieves state-of-the-art results on a broad range of benchmarks:

Benchmark	PaLM 540B	Flan-PaLM 540B	GPT-3.5 (text-davinci-002)
MMLU (5-shot)	69.3	73.6	68.0
BBH (CoT)	62.0	66.3	64.3
TyDiQA	55.2	72.6	—
MGSM (multilingual math)	45.9	72.0	—

The most impressive result is on multilingual tasks. Flan-PaLM improves TyDiQA from 55.2 → 72.6 (+17.4!) and MGSM from 45.9 → 72.0 (+26.1!). This suggests that instruction tuning doesn't just teach the model to follow instructions — it also improves its ability to transfer knowledge across languages, likely because the instruction tuning data includes diverse languages and tasks.

Flan-T5: the practical winner

While Flan-PaLM 540B is impressive, it's also impractical for most users. Flan-T5 is the practical variant — instruction-tuning Google's T5 models. The results are remarkable:

Model	Params	MMLU (0-shot)	BBH (CoT)
T5-XL (no IT)	3B	28.3	24.1
Flan-T5-XL	3B	52.4	43.6
T5-XXL (no IT)	11B	33.7	28.5
Flan-T5-XXL	11B	57.2	49.0

Flan-T5-XL (3B) with instruction tuning outperforms the raw T5-XXL (11B) on both benchmarks. Instruction tuning is worth more than a 3.7x increase in model size.

This result has enormous practical implications. Running a 3B model is ~4x cheaper than running an 11B model (in both memory and compute). If instruction tuning a 3B model gives better results than running an 11B model without instruction tuning, then instruction tuning is by far the most cost-effective intervention available. It's cheaper than buying bigger hardware, cheaper than collecting more pre-training data, and cheaper than any architectural improvement.

The "Flan tax" is trivially small: Instruction-tuning PaLM 540B takes about 0.2% of the pre-training compute. For T5-11B, it's about 0.5%. This means you get a 20+ point improvement on zero-shot benchmarks for less than 1% additional compute. There is no other single intervention in ML that gives this favorable a cost-benefit ratio.

The paper also compares against GPT-3.5 (text-davinci-002), which was the strongest publicly available model at the time. Flan-PaLM 540B outperforms GPT-3.5 on MMLU (73.6 vs 68.0) and is competitive on BBH (66.3 vs 64.3). This was the first time an openly described instruction-tuning recipe produced a model competitive with OpenAI's best — a significant milestone for open research.

Zero-shot vs few-shot

A particularly interesting finding: instruction tuning dramatically improves zero-shot performance (no examples in the prompt) while maintaining or slightly improving few-shot performance. This matters because zero-shot is the most user-friendly mode — you just ask a question without providing examples. Before instruction tuning, zero-shot performance was significantly worse than few-shot. After instruction tuning, the gap narrows or disappears.

python
# Zero-shot vs few-shot improvement
# PaLM 540B on MMLU:
#   0-shot: 69.3 → 73.6 (Flan) = +4.3
#   5-shot: 73.5 → 74.1 (Flan) = +0.6

# Zero-shot gains >> few-shot gains
# Why? Few-shot examples already provide an "interface" —
# they show the model what format to use.
# Instruction tuning provides this interface permanently,
# making the few-shot examples redundant.

Results Comparison

Compare base models vs instruction-tuned versions across benchmarks. Toggle between PaLM and T5 families.

What is the most surprising finding from the Flan benchmark results?

Instruction tuning provides disproportionately large gains on multilingual tasks (TyDiQA: +17.4, MGSM: +26.1), and smaller instruction-tuned models (Flan-T5-XL 3B) can outperform much larger base models (T5-XXL 11B) — instruction tuning is worth more than 3.7x the model size Flan-PaLM matches GPT-4 on all benchmarks Instruction tuning only helps English tasks

Chapter 5: Practical Guidelines

The paper distills its findings into concrete guidelines for practitioners who want to instruction-tune their own models. These guidelines have proven remarkably durable.

Recipe

1. Maximize task diversity

More diverse tasks (from different datasets, with different prompt templates) consistently improves results. The curve hasn't saturated at 1,836 tasks.

↓

2. Include chain-of-thought

Mix ~1% CoT data with 99% standard data. This is the single most impactful addition — it unlocks reasoning capabilities at near-zero cost.

↓

3. Use input inversion

For classification tasks, randomly swap input/output roles: "What sentiment does 'great movie' have?" → "Give an example of a positive review."

↓

4. Balance task sizes

Use temperature sampling to prevent large tasks from dominating. T=3 works well as a default.

↓

5. Short fine-tuning

Only 30K-40K gradient steps. Over-training on instructions degrades pre-trained capabilities (catastrophic forgetting).

Input inversion is an underappreciated trick. When you ask a model "classify the sentiment of X," it learns to classify. But when you also ask "generate a text with positive sentiment," it learns to produce that category. This bidirectional understanding makes the model more robust — it learns what a category is, not just how to label it.

python
# Input inversion examples

# Original task:
# Input:  "Classify: 'This movie is great' → ?"
# Output: "Positive"

# Inverted task:
# Input:  "Generate a positive movie review."
# Output: "This movie is great, with excellent acting."

# Why this helps: the model learns BOTH directions of the mapping
# Original: text → label (recognition)
# Inverted: label → text (generation)
# Together: deep understanding of what "positive" means

Training duration: the forgetting cliff

The recommendation of 30K-40K gradient steps comes from a careful analysis of the training curve. Performance on instruction-tuning tasks improves steadily throughout training. But performance on held-out tasks (tasks not seen during instruction tuning) peaks around 30K-40K steps and then starts to decline. This is catastrophic forgetting — the model overwrites its pre-trained knowledge with instruction-specific patterns.

Performance(steps) = Knowledge(pre-train) · e^{-λ · steps} + Instruction(steps)

Where the first term represents pre-trained knowledge decaying exponentially with instruction-tuning steps, and the second term represents growing instruction-following ability. The optimal number of steps balances these two terms. Too few steps and the instruction-following ability is weak. Too many steps and the pre-trained knowledge is lost.

What NOT to do

Mistake	Why it hurts
Too few tasks	Model overfits to seen instruction patterns, fails on novel ones
Only one template per task	Model learns template matching, not instruction understanding
Too much training	Pre-trained knowledge is overwritten (catastrophic forgetting)
Only CoT data	Model tries to reason on simple factual questions
Uniform task sampling	Tiny tasks get drowned by massive ones

Training Recipe Simulator

Adjust the instruction tuning recipe and see how each change affects performance. Toggle components on/off to see their individual impact.

Why does the paper recommend only 30K-40K gradient steps for instruction tuning, not more?

Because over-training on instruction data causes catastrophic forgetting — the model loses the broad knowledge acquired during pre-training and starts to only perform well on the instruction-tuning task formats, degrading on novel tasks Because the learning rate becomes too small after 40K steps Because hardware constraints limit training to 40K steps

Chapter 6: Scaling Explorer

Let's bring the paper's three scaling dimensions together in one interactive visualization. You can independently vary the number of tasks, the model size, and the presence of CoT data, and see how each affects performance.

Three-Axis Scaling Explorer

Adjust all three scaling axes simultaneously: number of tasks, model size, and CoT inclusion. Watch how performance changes — and notice that each axis contributes independently.

Tasks 1836

Model 540B

—

The three scaling laws of instruction tuning

Axis	Scaling Behavior	Saturation?
Number of tasks	Log-linear improvement: 2x tasks ≈ +1-2% accuracy	Not observed at 1,836
Model size	Power-law improvement: 10x params ≈ +5-10% accuracy	Not observed at 540B
CoT inclusion	Step function: including ANY CoT gives +5-15% on reasoning	Mostly saturated at ~1% mix

The multiplicative benefit: The three axes interact multiplicatively. More tasks helps; bigger models help; CoT helps. But combining all three gives results far better than any single axis alone. This is why Flan-PaLM 540B (large model + many tasks + CoT) is so dominant — it benefits from all three scaling dimensions simultaneously.

The paper provides evidence for this multiplicativity by running a full factorial experiment: every combination of (few tasks / many tasks) × (small model / large model) × (with CoT / without CoT). The gains from each axis are roughly additive in log-space, which means they're multiplicative in raw performance. This is an important finding because it tells practitioners that they should invest in all three axes — skipping any one of them leaves significant performance on the table.

python
# The three scaling dimensions — approximate formula
# Performance ≈ Base + log(tasks)*8 + log(params)*3 + CoT*8

# Worst case: 20 tasks, 250M params, no CoT
# ≈ 50 + 10.4 + 0 + 0 = 60.4%

# Best case: 1836 tasks, 540B params, with CoT
# ≈ 50 + 26.1 + 22 + 8 = 106.1% (capped at ~95%)

# Marginal value of each axis at best settings:
# Adding tasks (20→1836):   +15.7 points
# Adding scale (250M→540B): +22.0 points
# Adding CoT:               +8.0 points

Which of the three scaling axes (tasks, model size, CoT) shows the most "step function" behavior — providing most of its benefit at minimal inclusion?

Chain-of-thought — including just ~1% CoT data in the mixture provides most of the reasoning improvement. Unlike task count and model size (which scale gradually), CoT acts more like a switch: you either include it or you don't, and including it unlocks a qualitatively different capability (step-by-step reasoning). Number of tasks — even a small number of tasks provides most of the benefit Model size — even small models capture most of the instruction-tuning benefit

Chapter 7: Connections

This paper sits at the intersection of two major trends: instruction tuning and scaling laws. It showed that both trends apply to the post-training phase, not just pre-training. More importantly, it provided the first comprehensive recipe for instruction tuning — a recipe that the community followed for over a year.

The paper's timing was critical. It was published in late 2022, just weeks before ChatGPT's release. While ChatGPT demonstrated the power of instruction tuning + RLHF to the public, the Flan paper provided the scientific foundation that the research community needed to replicate and extend these results. Without this paper's systematic analysis of what makes instruction tuning work, the open-source response to ChatGPT (Alpaca, Vicuna, etc.) would have been much more haphazard.

What this paper built on

Foundation	Contribution
FLAN (Wei 2021)	Original instruction tuning on 62 tasks — showed the concept works
T0/P3 (Sanh 2021)	Community prompt templates + zero-shot generalization study
Super-Natural Instructions	Large crowdsourced task collection with detailed definitions
CoT Prompting (Wei 2022)	Chain-of-thought prompting at inference time
PaLM (Chowdhery 2022)	The 540B base model that Flan-PaLM is built on

What came after

Successor	Advance
AlpacaFarm	Simulated human feedback for faster instruction tuning research
Camels (Tülu)	Systematic comparison of open instruction-tuning datasets
Alpaca	Self-instruct: generate instruction data from GPT-3.5, tune LLaMA
Orca	Explanation tuning: include GPT-4's reasoning traces in the fine-tuning data
Llama 3	Scaled instruction tuning + DPO to 405B parameters

This paper's lasting impact:
1. More tasks always help — became a design principle for instruction-tuning datasets.
2. CoT in fine-tuning — now standard practice. Every major model includes reasoning traces in its training data.
3. Flan-T5 — remained the go-to open instruction-tuned model for over a year, until Llama-based models superseded it.
4. The recipe — the practical guidelines (temperature sampling, input inversion, short training) are still followed in 2024.

The CoT finding deserves special emphasis. Before this paper, chain-of-thought was purely an inference-time technique — you added "Let's think step by step" to your prompt and hoped the model would reason better. This paper showed that including CoT in the training data is far more effective. This insight directly influenced every subsequent instruction-tuning effort: Llama 2 Chat, Mistral Instruct, and Claude all include reasoning traces in their fine-tuning data.

The Flan-T5 model family (small, base, large, xl, xxl) also deserves recognition for its practical impact. For over a year (late 2022 to mid-2023), Flan-T5 was the default instruction-following model for the research community. It was used in hundreds of papers, thousands of projects, and countless applications. Its encoder-decoder architecture made it particularly good at structured tasks (classification, extraction, translation) where the output format is well-defined.

python
# Using Flan-T5 — the most practical output of this paper
from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")

# Zero-shot — no examples needed!
prompt = "Classify the sentiment: 'This movie was incredible'\nSentiment:"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# → "positive"

# CoT reasoning — also works zero-shot after Flan tuning
prompt2 = "Q: If a train travels 60mph for 2.5 hours, how far does it go?"
# → "The train travels 60 miles per hour for 2.5 hours.
#    Distance = speed × time = 60 × 2.5 = 150 miles."

"Instruction tuning is a simple method that, across many settings, substantially improves the performance of pretrained language models."
— Chung et al., 2022

What is the most important practical takeaway from the Flan scaling paper?

Instruction tuning with diverse tasks (1,836+), chain-of-thought data (~1% of mix), and proper recipe (temperature sampling, short training, input inversion) is the single most cost-effective way to improve a pre-trained language model — a small instruction-tuned model can outperform a much larger base model That you need at least 540B parameters for instruction tuning to work That chain-of-thought should be the only data used in instruction tuning

Scaling Instruction-Finetuned Language Models