How Far Can Camels Go? (Wang 2023)

Chapter 0: The Open Instruction Landscape

It's mid-2023. ChatGPT has been out for six months, and the open-source community is scrambling to catch up. Everyone wants to build their own instruction-following model, and the recipe seems simple: take a pre-trained base model (like LLaMA), fine-tune it on instruction-response pairs, and ship it. But which instruction data should you use? Which base model? How much data do you need? Nobody knows — because nobody has done a systematic comparison.

The landscape is fragmented. Stanford Alpaca fine-tunes LLaMA-7B on 52K GPT-generated instructions. Databricks Dolly collects 15K human-written instructions from employees. FLAN aggregates 1,836 academic tasks into 15M examples. ShareGPT scrapes real ChatGPT conversations. Each group claims their approach works, but they all use different base models, different evaluation benchmarks, and different training recipes. It's impossible to know what's actually driving performance.

To see why this matters, consider a concrete scenario. You're an engineer at a startup. Your boss says: "Build me an AI assistant that can answer customer questions about our product." You know you need to instruction-tune a base model. But you're immediately faced with decisions that nobody has answered rigorously:

Decision	Options	What You'd Like to Know
Base model	LLaMA-7B, Pythia-6.9B, OPT-6.7B	Which gives the best starting point?
Dataset	FLAN, Alpaca, ShareGPT, Dolly, or mix?	Which produces the best assistant?
Data size	10K, 50K, 100K, 1M examples?	Is more always better?
Data source	LLM-generated vs human-written?	Does the source matter?

Before this paper, the honest answer to all of these questions was "we don't know." After this paper, we have empirically grounded answers for each one.

The fundamental question: When you fine-tune a base model into an instruction follower, what matters more — the dataset, the base model, or the training recipe? And within datasets, does more data always mean better performance, or does quality trump quantity? This paper answers these questions with controlled experiments that isolate each variable.

The stakes are high. In mid-2023, dozens of startups and research labs were investing millions of dollars in instruction tuning, often making choices based on vibes and blog posts rather than rigorous evidence. Some teams spent months collecting massive datasets of 1M+ instruction examples, only to find that their models weren't much better than simpler approaches. Others used expensive GPT-4-generated data when cheaper alternatives would have worked equally well. This paper provides the evidence base to make these decisions rationally.

The authors at Allen AI (AI2) name their project Tulu (after a breed of camel) and systematically investigate the open instruction-tuning landscape. They train dozens of models, varying one factor at a time, and evaluate them on both automatic benchmarks and human preference judgments. The result is the most comprehensive apples-to-apples comparison of open instruction tuning ever conducted.

Here's the setup: they take the same base model (LLaMA), fine-tune it on each of 7 different instruction datasets under identical training conditions, and evaluate on the same benchmarks. Then they hold the dataset fixed and vary the base model. Then they explore mixing datasets and varying data quantity. Every experiment isolates exactly one variable.

The methodology is critical. In machine learning research, confounding variables are everywhere. If you train LLaMA on FLAN and GPT-NeoX on Alpaca, and LLaMA wins, you can't tell if it's because of the model or the data. The paper's strength is its relentless variable isolation. When comparing datasets, the base model (LLaMA-7B), training hyperparameters (lr=2e-5, epochs=2, batch=128), and evaluation suite (MMLU, BBH, TydiQA, AlpacaEval) are all held constant. When comparing base models, the dataset (Tulu mix) and all other factors are held constant. This is basic experimental design, but it's surprisingly rare in ML papers.

Let's walk through the training setup concretely. Each experiment uses the same hardware (8x A100 80GB GPUs), the same training framework (PyTorch + Hugging Face Transformers), the same optimizer (AdamW), and the same format template for instructions:

python
# Instruction format template used across all experiments
template = """<|user|>
{instruction}
<|assistant|>
{response}"""

# Example formatted training instance:
# <|user|>
# What are three benefits of regular exercise?
# <|assistant|>
# Regular exercise provides numerous benefits:
# 1. Cardiovascular health: ...
# 2. Mental well-being: ...
# 3. Weight management: ...

# Loss is computed ONLY on assistant tokens
# This teaches the model to generate responses, not to parrot instructions

The Open Instruction-Tuning Landscape

This map shows the explosion of open instruction-tuning resources available by mid-2023. Click on each resource to see its characteristics — data source, size, and approach.

Click a resource

The numbers tell the story of fragmentation. At the time of writing, the open-source community had produced at least 7 major instruction datasets, 4 widely-used base model families, and dozens of instruction-tuned model variants — with no controlled comparison between them. Papers would fine-tune LLaMA on their new dataset, evaluate on their preferred benchmark, and claim a new state-of-the-art. But without controlling for the base model, training recipe, and evaluation suite, these comparisons were meaningless. It was like comparing cars when everyone uses different roads, different weather conditions, and different speedometers.

The paper's contributions are threefold:

Contribution 1

Controlled comparison of 7 instruction datasets on the same base model and evaluation suite

↓

Contribution 2

Controlled comparison of 4 base model families (LLaMA, Pythia, OPT, GPT-NeoX) with the same data

↓

Contribution 3

Data quantity experiments: from 1K to 1M examples, revealing the quality-vs-quantity tradeoff

Why "Camels"? The project is called Tulu after the Tulu breed of camel. In the instruction-tuning zoo, LLaMA is a llama, Alpaca is an alpaca, Vicuna is a vicuna — all camelids. The paper asks: how far can these camels go with different training data? The answer, it turns out, depends less on how much hay you give them and more on the quality of the hay.

Before we dive into the experiments, let's understand the instruction-tuning pipeline concretely. Here's what happens when you instruction-tune a base model:

python
# The instruction-tuning pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer

# 1. Load base model (pre-trained, but can't follow instructions)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-7b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-7b")

# 2. Format instruction data as prompt-response pairs
# Input:  "### Instruction: Summarize the following article..."
# Output: "The article discusses three key points..."
# The model learns to predict the output given the input

# 3. Fine-tune with standard language modeling loss
# But ONLY compute loss on the response tokens, not the instruction
trainer = SFTTrainer(
    model=model,
    train_dataset=instruction_data,
    max_seq_length=2048,
    args=TrainingArguments(
        learning_rate=2e-5,
        num_train_epochs=2,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=32,  # effective batch = 128
    )
)
trainer.train()
# After ~2 epochs: base model → instruction follower

The key insight is that instruction tuning is cheap compared to pre-training. Pre-training LLaMA-7B takes ~82,000 GPU-hours on A100s. Instruction tuning takes ~40 GPU-hours — 2,000x less compute. This asymmetry is why the question "which instruction data should I use?" matters so much: you have budget for dozens of instruction-tuning experiments but zero budget for re-doing pre-training.

Cost comparison: Pre-training LLaMA-7B costs ~$200,000 in cloud compute. Instruction-tuning the same model costs ~$100-500 depending on dataset size and hardware. This 400-2000x cost difference means you can run dozens of instruction-tuning experiments for the price of one pre-training run. The Camels paper leverages this asymmetry to run 40+ experiments that would be impossible at pre-training scale.

What is the fundamental methodological contribution of the Camels/Tulu paper?

Controlled experiments that isolate each variable (dataset, base model, data quantity) while holding everything else constant, enabling apples-to-apples comparison Introducing a new instruction-tuning dataset that outperforms all previous datasets Training the largest open-source instruction-following model available at the time

Chapter 1: Dataset Comparison

The paper evaluates seven instruction-tuning datasets that were publicly available by mid-2023. These datasets differ dramatically in size, source, format, and quality. Understanding their characteristics is essential before comparing their effects on model performance.

Let's walk through each one:

FLAN V2 is the elephant in the room. Google's collection aggregates 1,836 academic NLP tasks (sentiment analysis, NLI, QA, summarization, translation, etc.) into a unified instruction format. Each task gets multiple prompt templates, creating massive diversity. The authors subsample to ~100K examples for tractability, but the full set contains over 15 million. FLAN's strength is breadth: it covers virtually every NLP task type. Its weakness is that the tasks feel "academic" — they don't resemble the open-ended questions real users ask.

Here's a typical FLAN instruction: "Read the following passage and answer the question. Passage: 'The capital of France is Paris...' Question: 'What is the capital of France?' Answer:". The response is just "Paris." Compare this to a typical ShareGPT instruction: "What are some fun things to do in Paris?" with a 200-word response covering restaurants, museums, and neighborhoods. Both teach the model something — but the ShareGPT example teaches it to be a helpful assistant, while the FLAN example teaches it to extract information from passages.

Stanford Alpaca (CoT) consists of 52K instruction-response pairs generated by GPT-3.5-Turbo (text-davinci-003). Stanford researchers used 175 seed instructions and asked GPT to generate more via self-instruct. The "CoT" variant adds chain-of-thought reasoning to a subset. Quality is moderate — GPT-3.5 sometimes generates plausible-sounding but incorrect answers, and the instructions tend to be simple. But the data is cheap (total cost was ~$500) and covers a reasonable range of tasks.

The $500 price tag made Alpaca a sensation. It showed that you could build a passable instruction-following model for less than the cost of a nice dinner. However, the quality ceiling is real: GPT-3.5's errors propagate to the student model. If GPT-3.5 gives a wrong math answer, the student learns that wrong answer. This "error amplification" problem is a fundamental limitation of synthetic data from imperfect teachers.

ShareGPT is scraped from ShareGPT.com, where users voluntarily shared their ChatGPT conversations. This gives you ~90K real user-ChatGPT dialogues — authentic queries that people actually care about. The advantage: these conversations reflect real user needs, not academic benchmarks or synthetic prompts. The disadvantage: they can be noisy (incomplete conversations, screenshots described in text, non-English content mixed in).

ShareGPT's unique value is that it captures the distribution of real user queries. People ask ChatGPT to write code, explain concepts, brainstorm ideas, draft emails, translate text, analyze data, and dozens of other things. This distribution is very different from academic NLP tasks (which are dominated by classification and extraction) and from synthetic datasets (which tend toward generic questions). Training on ShareGPT teaches the model to handle the queries real users actually send.

Databricks Dolly contains 15K instructions written by Databricks employees. This is fully human-authored — no LLM generation involved. Employees were given categories (brainstorming, classification, QA, summarization, etc.) and wrote both instructions and responses. Quality is high because the writers are technically skilled, but the dataset is small and the instructions tend toward business use cases. The human-authored nature means responses don't have the "ChatGPT voice" that synthetic datasets inherit — they're more direct and often more accurate, though sometimes less polished in formatting.

Open Assistant (OASST1) is a crowdsourced dataset of ~10K conversation trees, where volunteers both wrote user prompts and assistant responses. Multiple responses per prompt were collected and ranked by other volunteers, creating a preference signal. The conversations are often multi-turn and cover a wide range of topics. Quality is variable — some contributions are excellent, others are superficial.

OASST is unique in having a built-in quality signal: each response was rated by 3-5 volunteers on helpfulness, truthfulness, and harmlessness. The paper uses the top-ranked responses only, effectively getting human-curated quality for free. This makes OASST punches above its weight — 10K curated examples from OASST often outperform 50K+ uncurated examples from other sources.

Self-Instruct contains ~82K instruction-response pairs generated by the self-instruct pipeline — similar to Alpaca but using GPT-3 (davinci) rather than GPT-3.5. Instructions are generated from a seed set of 175 examples, filtered for quality, and then responses are generated. The older model produces somewhat lower-quality outputs than Alpaca's GPT-3.5 source. The pipeline works by iteratively: (1) sample from the seed set, (2) prompt GPT-3 to generate a new instruction, (3) filter for novelty and quality, (4) generate a response for the instruction.

Unnatural Instructions is another synthetic dataset (~68K examples) generated by asking GPT-3 to produce both instructions and responses, with an emphasis on creative and unusual task formulations. The name comes from the "unnatural" prompt engineering used to elicit diverse task types. The dataset intentionally pushes beyond standard NLP task formats — you'll find instructions like "Write a limerick about quantum physics" or "Explain why the sky is blue using only words a pirate would use." This diversity is intended to teach the model flexibility, though the quality of individual responses varies significantly.

A pattern emerges when you look across all seven datasets: there's a fundamental tradeoff between breadth (how many different task types are covered) and depth (how detailed and helpful each individual response is). FLAN has maximum breadth but minimal depth. ShareGPT has high depth but uncontrolled breadth. The question is which dimension matters more for downstream performance — and the answer, as we'll see, depends on what you measure.

Dataset	Size	Source	Style
FLAN V2	~100K (subsampled)	1,836 academic NLP tasks	Structured, templated
Alpaca (CoT)	52K	GPT-3.5 generated	Short instructions, mixed quality
ShareGPT	~90K	Real ChatGPT conversations	Conversational, multi-turn
Dolly	15K	Human-written (Databricks employees)	Business-oriented, careful
OASST1	~10K	Crowdsourced conversation trees	Multi-turn, preference-ranked
Self-Instruct	82K	GPT-3 (davinci) generated	Synthetic, varied
Unnatural Instr.	68K	GPT-3 generated	Creative, unusual tasks

The key variable: These datasets differ along multiple axes — size (10K to 100K+), source (human vs LLM-generated), format (single-turn vs multi-turn), and domain coverage. The paper trains LLaMA-7B on each dataset separately, using identical hyperparameters, to isolate which dataset characteristics actually matter for downstream performance.

The training setup is uniform across all experiments: LLaMA-7B base model, 2 epochs of fine-tuning, learning rate 2e-5 with linear warmup and cosine decay, batch size 128, max sequence length 2048. This ensures any performance differences come from the data, not the training recipe.

Why exactly 2 epochs? The paper experiments with 1, 2, 3, and 5 epochs and finds that 2 epochs is the sweet spot for most datasets. At 1 epoch, the model hasn't fully absorbed the instruction format. At 3+ epochs, it starts to overfit — memorizing specific responses rather than learning the general instruction-following skill. Overfitting manifests as the model reproducing training responses verbatim or generating outputs in the exact style of a specific data source, reducing generalization. The 2-epoch finding has become standard in the community.

The max sequence length of 2048 is also significant. Some ShareGPT conversations exceed this length and are truncated. The paper notes that increasing to 4096 tokens slightly improves performance on conversation-heavy benchmarks but doubles memory requirements. For most experiments, 2048 is a practical compromise. Later models (Llama 2, Mistral) support longer contexts and benefit from longer training sequences.

python
# Standardized training config for all dataset comparisons
config = {
    "model": "llama-7b",
    "epochs": 2,
    "lr": 2e-5,
    "warmup_ratio": 0.03,
    "scheduler": "cosine",
    "batch_size": 128,
    "max_seq_len": 2048,
    "weight_decay": 0.0,
    "gradient_checkpointing": True,
}
# For each dataset D in [FLAN, Alpaca, ShareGPT, Dolly, OASST, ...]:
#   model = finetune(llama_7b, D, config)
#   scores = evaluate(model, [MMLU, BBH, TydiQA, AlpacaEval, ...])

Dataset Characteristic Comparison

Compare the seven instruction datasets along key dimensions. Click each bar to highlight that dataset's profile. The radar shows how each dataset scores on size, diversity, naturalness, and response quality.

The critical question: when you train LLaMA-7B on each dataset independently, which one produces the best instruction-following model? The answer might surprise you — it's not the largest dataset, and it's not the most expensive to create.

Let's look at an example from each dataset to see how they differ in style and quality:

python
# Example instruction-response pairs from each dataset

# FLAN: academic task format
flan_example = {
    "instruction": "Classify the sentiment of the following review as "
                   "positive or negative: 'The movie was a waste of time.'",
    "response": "negative"  # ~1 token response
}

# ShareGPT: natural conversation
sharegpt_example = {
    "instruction": "Explain quantum entanglement like I'm 10 years old",
    "response": "Imagine you have two magic coins. You flip them both..."
    # ~200 token response with analogies, structure
}

# Dolly: human-written, structured
dolly_example = {
    "instruction": "What are the main differences between Python and Java?",
    "response": "Python and Java differ in several key ways:\n"
                 "1. Type system: Python is dynamically typed...\n"
                 "2. Syntax: Python uses indentation...\n"
    # ~150 tokens, well-structured
}

# Key difference: FLAN responses average ~15 tokens
# ShareGPT responses average ~350 tokens
# This response length difference drives the style transfer effect

Response length as a proxy for quality: The average response length varies dramatically across datasets. FLAN: ~15 tokens (short labels and classifications). Alpaca: ~65 tokens. Dolly: ~110 tokens. ShareGPT: ~350 tokens. Models trained on longer responses learn to be more thorough and explanatory — a key factor in human preference evaluations where users prefer detailed, well-structured answers.

Which characteristic most distinguishes FLAN V2 from ShareGPT as an instruction-tuning dataset?

FLAN consists of academic NLP tasks reformatted as instructions, while ShareGPT contains real user conversations — making them cover very different skill distributions FLAN is generated by GPT-4 while ShareGPT is generated by GPT-3.5 FLAN is smaller and higher quality while ShareGPT is larger but noisier

Chapter 2: Base Model Comparison

Dataset choice is only half the equation. The other half is the base model — the pre-trained language model you start with before fine-tuning. In 2023, several open base model families were available, and the community hadn't established which one was the best foundation for instruction tuning.

The paper compares four base model families, spanning different sizes, training data, and architectural choices:

LLaMA (Meta, 2023): The clear frontrunner. LLaMA models are trained on 1-1.4 trillion tokens of publicly available text, using a standard Transformer decoder architecture with RoPE positional embeddings and SwiGLU activations. Sizes: 7B, 13B, 30B, 65B. LLaMA's distinguishing feature is its aggressive training budget — the 7B model sees more data per parameter than any previous open model, following the "Chinchilla optimal" philosophy.

Pythia (EleutherAI, 2023): A family of models from 70M to 12B parameters, trained on The Pile (300B tokens). Pythia's unique contribution is scientific reproducibility — every model checkpoint is released, training data is deduplicated, and the full training pipeline is documented. The downside: Pythia is trained on significantly less data than LLaMA (300B vs 1T tokens), and The Pile is smaller and older.

OPT (Meta, 2022): Open Pre-trained Transformers, available from 125M to 175B parameters. OPT was trained on a mix of public datasets (The Pile, BookCorpus, etc.) totaling ~180B tokens. OPT is older than LLaMA and was trained with less data and lower quality filtering. Performance is generally below LLaMA at equivalent sizes.

GPT-NeoX (EleutherAI, 2022): A 20B parameter model trained on The Pile. Uses rotary positional embeddings and parallel attention+FFN computation (computing attention and FFN in parallel rather than sequentially, as in GPT-J). Single size only, making scaling analysis impossible within this family. Despite being the largest model in the comparison (20B), it often loses to LLaMA-13B because of the pre-training data disadvantage.

What makes a good base model for instruction tuning? The paper's results suggest three key factors: (1) Pre-training data volume — more tokens seen during pre-training means a richer knowledge base to build on. (2) Data quality and filtering — LLaMA's aggressive data filtering removes low-quality web text that would pollute the model's representations. (3) Training efficiency — LLaMA trains the 7B model on 1T tokens (5x the Chinchilla-optimal amount), creating a model that "knows" far more than its parameter count would suggest.

Model Family	Sizes Available	Training Tokens	Key Feature
LLaMA	7B, 13B, 30B, 65B	1-1.4T	Chinchilla-optimal data/params ratio
Pythia	70M-12B	300B	Full checkpoint reproducibility
OPT	125M-175B	180B	First large open model release
GPT-NeoX	20B only	300B (Pile)	Parallel attention/FFN

Pre-training data matters enormously. The paper finds that LLaMA consistently outperforms other base models at the same parameter count, and the gap is large. LLaMA-7B instruction-tuned often outperforms OPT-30B and Pythia-12B instruction-tuned. The difference comes down to pre-training data volume and quality. LLaMA sees 3-5x more tokens than competitors, with better data filtering. This means the foundation matters as much as the fine-tuning.

The experimental setup is clean: take the best-performing dataset from the dataset comparison (a mix of FLAN + ShareGPT), and train each base model with identical hyperparameters. Then evaluate on the same benchmark suite. This isolates the effect of the base model.

python
# Base model comparison experiment
best_data = load_dataset("tulu_mix")  # FLAN + ShareGPT mix

base_models = {
    "llama-7b":   load("meta-llama/Llama-7b"),     # 1T tokens pre-training
    "pythia-6.9b":load("EleutherAI/pythia-6.9b"), # 300B tokens pre-training
    "opt-6.7b":   load("facebook/opt-6.7b"),      # 180B tokens pre-training
}

# Same data, same hyperparameters, different base
for name, model in base_models.items():
    tuned = finetune(model, best_data, config)
    scores = evaluate(tuned, benchmark_suite)
    # Result: LLaMA-7B >> Pythia-6.9B > OPT-6.7B
    # Gap is 5-15% on most benchmarks

The results are unambiguous: LLaMA dominates at every size. Instruction-tuned LLaMA-7B outperforms instruction-tuned Pythia-12B on most benchmarks, despite having 40% fewer parameters. The pre-training foundation — specifically the volume and quality of pre-training data — is a stronger predictor of downstream performance than the instruction-tuning data or training recipe.

The magnitude of the gap is startling. On AlpacaEval, LLaMA-7B + Tulu mix scores ~58%. Pythia-6.9B + same Tulu mix scores ~38%. That's a 20-point gap from the same instruction data and training recipe. On MMLU, the gap is ~14 points (46% vs 32%). These are massive differences that dwarf the effect of switching between instruction datasets (which typically changes scores by 5-10 points).

What's happening under the hood? LLaMA has seen more pre-training data (1T vs 300B tokens), which means it has richer internal representations. When you instruction-tune it, you're surfacing knowledge that's already there. When you instruction-tune Pythia, the knowledge simply isn't there to surface. You can teach Pythia to respond in a helpful style, but it can't answer questions about topics it never learned during pre-training.

This has a practical implication that the community learned quickly: if you're building an open instruction-following model, start with the best base model you can get. No amount of clever instruction tuning can compensate for a weak foundation. By the time this paper was published, LLaMA had become the default base model for almost every open-source instruction-tuning project.

The architecture differences between these base models are relatively minor — they all use standard Transformer decoder blocks with autoregressive training. The key differences are in pre-training data processing:

python
# Why LLaMA's pre-training matters
pretraining_comparison = {
    "LLaMA-7B": {
        "tokens": "1.0T",           # 3-5x more than competitors
        "data_sources": [
            "CommonCrawl (filtered)",    # 67% — heavily filtered web
            "C4",                         # 15% — cleaned CommonCrawl
            "GitHub",                     # 4.5% — code helps reasoning
            "Wikipedia",                  # 4.5% — high-quality facts
            "Books",                      # 4.5% — long-form text
            "ArXiv",                      # 2.5% — scientific text
            "StackExchange",              # 2% — Q&A format
        ],
        "key_choice": "Train 7B model on 1T tokens (5x Chinchilla-optimal)",
    },
    "Pythia-6.9B": {
        "tokens": "300B",           # less than 1/3 of LLaMA
        "data_sources": ["The Pile (deduplicated)"],
        "key_choice": "Prioritize reproducibility over performance",
    },
}
# LLaMA's bet: overtrain a small model on high-quality data
# This bet paid off — LLaMA-7B matches Pythia-12B

Base Model Performance Comparison

Same instruction data, different base models. Toggle models to see how the pre-training foundation affects downstream task performance after instruction tuning.

The scaling tax: Within a model family, scaling up parameters always helps — LLaMA-13B beats LLaMA-7B, etc. But the per-parameter efficiency of LLaMA is so high that jumping to a 2x larger model from a different family often doesn't help. LLaMA-7B (1T pre-training tokens) ~ OPT-30B (180B tokens) in instruction-following quality. Pre-training data is the great equalizer.

The paper also tests a within-family scaling experiment using LLaMA at 7B, 13B, 30B, and 65B parameters. With the same instruction data (Tulu mix), performance scales smoothly with model size. This experiment is critical because it answers the question: "If I have budget for a bigger model OR better instruction data, which should I invest in?"

The answer depends on where you are on the scaling curve. If you're at 7B → 13B, the base model upgrade is worth more. If you're already at 65B and have poor instruction data, investing in data quality has higher ROI. The crossover point is roughly around 30B parameters — below that, invest in the base model; above that, invest in instruction data and alignment.

Performance ≈ α · log(N_params) + β · Quality(D_instruct)

Where N_params is the number of parameters and D_instruct is the instruction dataset. This formula captures the paper's central finding: both the base model (through pre-training scale) and the instruction data (through quality) contribute independently to downstream performance, but the base model contribution has a log-relationship with parameters while the data quality contribution has a more linear relationship.

python
# Within-family scaling: LLaMA sizes with same data
llama_scaling = {
    "7B":  {"mmlu": 45.9, "alpaca": 58.6},
    "13B": {"mmlu": 52.3, "alpaca": 65.4},
    "30B": {"mmlu": 58.7, "alpaca": 71.8},
    "65B": {"mmlu": 63.5, "alpaca": 76.2},
}
# Each size jump gives ~6-7% MMLU improvement
# 65B with Tulu mix is competitive with ChatGPT on some tasks

Why does instruction-tuned LLaMA-7B often outperform instruction-tuned Pythia-12B despite having fewer parameters?

Because LLaMA uses a newer Transformer architecture with more efficient attention Because LLaMA was pre-trained on 3-5x more tokens (1T vs 300B) with better data filtering, giving it a stronger knowledge foundation that instruction tuning builds upon Because LLaMA's instruction tuning uses a better optimizer and learning rate schedule

Chapter 3: Quality vs Quantity

This is the chapter that made the paper famous. The question is deceptively simple: if you have a fixed compute budget for instruction tuning, should you collect more data or better data? The intuition from pre-training suggests "more data is always better." But instruction tuning is different — and the results are striking.

The paper designs a clean experiment to answer this. They take the FLAN dataset (which has millions of examples available) and train LLaMA-7B on different-sized subsets: 1K, 5K, 10K, 25K, 50K, and 100K examples. They also compare against the much smaller but higher-quality datasets like Dolly (15K) and OASST (10K).

Why is this experiment informative? Because it holds quality constant (all FLAN data is the same quality) and varies only quantity. This gives us a clean learning curve showing the marginal value of additional data. Then, comparing the FLAN scaling curve against high-quality datasets at fixed sizes tells us the value of quality — if 10K OASST beats 100K FLAN, we know quality is worth at least 10x in data efficiency.

The training details matter for reproducibility: each model is trained for exactly 2 epochs regardless of dataset size. This means the 1K model sees each example ~2 times, while the 100K model also sees each example ~2 times. The compute scales linearly with dataset size: 1K takes ~15 minutes, 100K takes ~25 hours on 8x A100s. This linear scaling makes the diminishing returns finding especially stark — you pay 100x more compute for a barely perceptible improvement.

The headline finding: Training on 10K high-quality examples (Dolly, OASST) often matches or exceeds training on 100K lower-quality examples (FLAN subsampled, Self-Instruct). Data quality — defined as response helpfulness, accuracy, and naturalness — matters more than data quantity for instruction following. This was counter to the prevailing assumption that more instruction data was always better.

What does "quality" mean concretely? The paper identifies several dimensions:

Response Helpfulness

Does the response actually answer the user's question in a useful way? ChatGPT-style responses are more helpful than template-filled academic task outputs.

↓

Response Length

Longer, more detailed responses teach the model to be thorough. ShareGPT responses average ~350 tokens; FLAN responses average ~15 tokens.

↓

Instruction Naturalness

Real user queries (ShareGPT) are open-ended and varied. Academic tasks (FLAN) are formulaic: "Classify the sentiment of the following review..."

↓

Task Diversity

A dataset that covers many different task types teaches the model to generalize. FLAN covers 1,836 task types; Dolly covers ~7 categories.

The quality-vs-quantity tradeoff isn't uniform across all evaluations. On traditional NLP benchmarks (MMLU, BBH), quantity helps — FLAN's massive task diversity gives better benchmark coverage. But on human preference evaluations (AlpacaEval, where humans judge which model's response is better), quality dominates. ShareGPT with 90K real conversations crushes FLAN with 100K academic examples on human preference.

Why? Because human evaluators judge the same things real users care about: is the response helpful, detailed, well-structured, and natural-sounding? FLAN teaches the model to format NLI labels correctly. ShareGPT teaches the model to have a genuine conversation.

Consider a concrete example. Ask both models: "What's the best way to learn Python?" A FLAN-tuned model might respond: "Take online courses, read documentation, and practice coding." (15 words, technically correct, unhelpful.) A ShareGPT-tuned model might respond: "Great question! Here's a roadmap I'd recommend: 1. Start with Python basics — variables, loops, functions. Try Codecademy or Python.org's official tutorial. 2. Build small projects early — a calculator, a to-do list, a web scraper. The best learning is by doing. 3. Read other people's code on GitHub..." (100+ words, structured, actionable, conversational.) Both models have the same knowledge (from LLaMA pre-training). The difference is entirely in how they present it.

The "ChatGPT style" effect: ShareGPT data comes from real ChatGPT conversations. So models trained on ShareGPT learn to respond like ChatGPT — with numbered lists, bold headers, friendly tone, and thorough explanations. This is a form of knowledge distillation: the instruction data transfers ChatGPT's response style to the open model, even though the open model has different underlying knowledge.

python
# Quality vs quantity experiment
import numpy as np

# Scaling experiment: subsample FLAN to different sizes
flan_sizes = [1000, 5000, 10000, 25000, 50000, 100000]
flan_scores = {
    "MMLU":      [38.2, 41.5, 43.1, 44.8, 45.3, 45.9],   # scales with size
    "AlpacaEval":[25.1, 32.4, 38.7, 42.1, 43.5, 44.2],   # diminishing returns
}

# Compare with small but high-quality datasets
quality_datasets = {
    "OASST (10K)":   {"MMLU": 40.1, "AlpacaEval": 52.3},  # fewer examples, better responses
    "ShareGPT (90K)": {"MMLU": 43.7, "AlpacaEval": 58.6},  # real conversations win
    "Dolly (15K)":   {"MMLU": 39.8, "AlpacaEval": 47.1},  # human-written, small but good
}
# Key insight: OASST with 10K examples beats FLAN with 100K on AlpacaEval
# because human evaluators prefer natural, detailed responses

There's a nuance here that's important. The paper doesn't find that quality always beats quantity in every setting. The relationship depends on what you're measuring:

Evaluation Type	What Wins?	Why?
NLP Benchmarks (MMLU, BBH)	Quantity helps	More task types = better coverage of benchmark skills
Human Preference (AlpacaEval)	Quality wins	Judges prefer natural, detailed, helpful responses
Safety / Toxicity (ToxiGen)	Neither dominates	Safety requires explicit safety data, not just more data
Multilingual (TydiQA)	Data source matters	Only FLAN has significant multilingual coverage

Quality vs Quantity: Scaling Curves

Drag the slider to change the number of FLAN training examples and see how benchmark scores change. The horizontal dashed lines show scores from smaller but higher-quality datasets — notice how quickly they're surpassed (or not) as FLAN scales.

Data size 25K

The practical takeaway: If you're building a chatbot and care about user satisfaction, invest in 10K excellent instruction-response pairs rather than scraping 1M mediocre ones. If you're building a benchmark champion, go big on diverse task data. Most real applications care more about the former — which is why the "quality over quantity" finding changed how the community approached instruction tuning.

Let's quantify the diminishing returns concretely. The FLAN scaling curve looks like this:

python
# FLAN data scaling — AlpacaEval win rate
sizes  = [1000,  5000,  10000, 25000, 50000,  100000]
scores = [25.1,  32.4,  38.7,  42.1,  43.5,   44.2]

# Marginal gain per 10K additional examples:
#   1K → 10K:   +13.6  (huge gain from first 10K)
#   10K → 25K:  +3.4   (decent gain)
#   25K → 50K:  +1.4   (diminishing)
#   50K → 100K: +0.7   (almost flat)

# Now compare: OASST with just 10K examples scores 52.3
# That's higher than FLAN at ANY scale tested!
# Quality advantage: 52.3 - 44.2 = 8.1 points on AlpacaEval
# Even 10x more FLAN data can't close this gap

Why does quality win on human evaluation? The mechanism is surprisingly simple. Models learn to imitate their training data. FLAN teaches the model: "When asked a question, output a short label or classification." ShareGPT/OASST teach: "When asked a question, explain thoroughly with examples, analogies, and structure." Human evaluators overwhelmingly prefer the second style, even when both answers are factually correct. The model isn't learning different knowledge — it's learning a different response format.

The paper also finds that mixing datasets helps. Their best model (Tulu) is trained on a carefully weighted mixture of FLAN, ShareGPT, and OASST — combining FLAN's task diversity with ShareGPT's naturalness and OASST's conversational quality. This mixture outperforms any single dataset alone, suggesting that different datasets contribute complementary skills.

The mixing strategy isn't just "throw everything together." The paper experiments with different mixture ratios and finds that the optimal balance depends on your target evaluation. For general-purpose assistants (the most common use case), the sweet spot is roughly 30% FLAN (breadth) + 35% ShareGPT (style) + 15% OASST (multi-turn) + 20% specialized data (CoT reasoning, code, etc.).

The authors test several mixing strategies: (1) equal weights (each dataset contributes equally by example count), (2) proportional to size (larger datasets contribute more), (3) inverse proportional (smaller, higher-quality datasets are oversampled), and (4) manually tuned weights based on dev set performance. Strategy (4) works best, but strategy (3) is a good heuristic when you don't have a dev set — oversample the highest-quality data.

python
# Dataset mixing strategies
def create_mixture(datasets, strategy="quality_weighted"):
    if strategy == "equal":
        # Each dataset contributes same number of examples
        n_per = 10000
        return [d.sample(n_per) for d in datasets]
    elif strategy == "quality_weighted":
        # Oversample high-quality, undersample low-quality
        weights = {"sharegpt": 0.35, "oasst": 0.15,
                   "flan": 0.30, "cot": 0.20}
        return weighted_sample(datasets, weights, total=80000)
# Quality-weighted consistently outperforms equal mixing

Verified finding: The paper's quality-over-quantity result has been independently replicated by multiple groups. LIMA (Zhou et al., 2023) showed that just 1,000 carefully curated examples can produce a competitive model. Orca (Microsoft, 2023) showed that reasoning traces from GPT-4 are worth more than 10x as many examples from GPT-3.5. The pattern is robust across different base models, different datasets, and different evaluation suites.

On AlpacaEval (human preference), 10K OASST examples outperform 100K FLAN examples. Why?

Because OASST is generated by GPT-4, which is a better teacher than FLAN's academic sources Because OASST contains natural, detailed, conversational responses that match what human evaluators prefer, while FLAN's academic-style responses are formulaic and brief Because OASST examples are longer and take more compute to train on, effectively giving more gradient updates per example

Chapter 4: Evaluation

One of the paper's most important contributions is its multi-faceted evaluation framework. Instead of relying on a single benchmark, the authors evaluate every model on a diverse suite that captures different aspects of instruction-following ability. This matters because, as they show, the ranking of models changes dramatically depending on which evaluation you use.

The evaluation suite covers four categories:

Factual knowledge: MMLU (Massive Multitask Language Understanding) tests 57 subjects from elementary math to professional law. It measures whether the model has learned and can recall factual knowledge. MMLU scores correlate strongly with pre-training data volume — bigger models with more pre-training data score higher, regardless of instruction tuning. This is a multiple-choice benchmark, so it's purely testing recall and reasoning — the model's response style doesn't matter.

MMLU is the benchmark where instruction tuning helps least — you might gain 2-5 points over a good few-shot prompt with the base model. The knowledge tested by MMLU is almost entirely learned during pre-training. Instruction tuning just teaches the model to format its answer as a multiple-choice selection. This is why base model choice dominates on MMLU.

Reasoning: BIG-Bench Hard (BBH) is a collection of 23 challenging reasoning tasks from the BIG-Bench benchmark, requiring multi-step reasoning, logical deduction, and mathematical thinking. BBH is evaluated with chain-of-thought prompting, where the model must show its reasoning step by step. This benchmark rewards instruction datasets that include reasoning traces (like Alpaca-CoT). Interestingly, this is the benchmark where the gap between instruction-tuning approaches is smallest — reasoning ability seems to depend more on model scale than on instruction data quality.

Multilinguality: TydiQA tests question answering in 11 typologically diverse languages (Arabic, Bengali, Finnish, Indonesian, Japanese, Kiswahili, Korean, Russian, Swahili, Telugu, Thai). Most instruction datasets are English-only, so TydiQA reveals whether instruction tuning transfers cross-lingually. FLAN is the only dataset with significant multilingual content, and it shows — FLAN-tuned models perform best on TydiQA.

The multilingual results reveal an important limitation: instruction tuning on English-only data does not meaningfully improve multilingual performance. The base LLaMA model has some multilingual ability from pre-training (its data includes non-English text), but English-only instruction tuning can actually hurt multilingual performance by pushing the model toward English-only response patterns. If multilingual ability matters, you need multilingual instruction data.

Open-ended instruction following: AlpacaEval uses GPT-4 as an automatic judge to rate model responses against a reference (Davinci-003). This is the closest proxy for "does a real user find this response helpful?" AlpacaEval correlates well with human judgments and is the evaluation where dataset quality matters most. It tests 805 open-ended prompts spanning coding, creative writing, advice, analysis, and more — the kind of queries real users actually send to chatbots.

AlpacaEval has a known bias: it tends to prefer longer responses. A model that produces verbose but mediocre answers can score higher than a model that gives concise but correct answers. The paper acknowledges this limitation and supplements AlpacaEval with human evaluation to check for this failure mode.

There's also a toxicity evaluation using ToxiGen, which tests whether the model generates harmful content when prompted. This evaluation is important because instruction tuning can inadvertently make models more willing to follow harmful instructions. The paper finds that most instruction datasets don't explicitly address safety — only OASST includes safety-aware training data, and OASST-tuned models show the lowest toxicity rates.

The evaluation trap: If you only evaluate on MMLU, you'd conclude that FLAN is the best dataset (highest factual recall). If you only evaluate on AlpacaEval, you'd conclude that ShareGPT is the best dataset (most natural responses). Both conclusions are incomplete. The paper's contribution is showing that no single evaluation captures instruction-following quality, and that you need a multi-faceted suite to understand model behavior.

The authors also conduct human evaluation as a complement to automatic metrics. They recruit human annotators to compare model outputs pairwise, asking "Which response is more helpful?" Human evaluations generally agree with AlpacaEval rankings but sometimes diverge — particularly when models produce lengthy but unhelpful responses that GPT-4 rates highly but humans find frustrating.

Benchmark	What It Measures	Which Dataset Wins	Why
MMLU	Factual knowledge	FLAN	1,836 task types cover more knowledge
BBH	Reasoning	Alpaca-CoT	Chain-of-thought examples teach reasoning
TydiQA	Multilingual QA	FLAN	Only FLAN has multilingual training data
AlpacaEval	Open-ended helpfulness	ShareGPT	Real conversations teach natural responses
Toxicity	Safety	OASST	Has explicit safety-aware training data

python
# Evaluation pipeline (simplified)
def evaluate_model(model, tokenizer):
    results = {}

    # 1. MMLU: 57-subject multiple choice
    results["mmlu"] = run_mmlu(model, tokenizer, n_shot=5)
    # → accuracy across subjects, e.g., 45.3%

    # 2. BBH: chain-of-thought reasoning
    results["bbh"] = run_bbh(model, tokenizer, cot=True)
    # → exact match on 23 reasoning tasks, e.g., 38.7%

    # 3. TydiQA: multilingual question answering
    results["tydiqa"] = run_tydiqa(model, tokenizer, n_shot=1)
    # → F1 score across 11 languages, e.g., 35.2%

    # 4. AlpacaEval: GPT-4 as judge
    results["alpaca_eval"] = run_alpaca_eval(model, tokenizer)
    # → win rate vs text-davinci-003, e.g., 58.6%

    return results

Evaluation Radar: Dataset × Benchmark

Toggle datasets to see their performance profile across all four evaluation axes. No single dataset wins everywhere — this is why the Tulu mixture was created.

The case for mixture models: Since no single dataset excels on all evaluations, the optimal strategy is a weighted mixture. Tulu combines FLAN (for task coverage), ShareGPT (for naturalness), and OASST (for conversational quality) — and outperforms any single dataset on the aggregate evaluation suite. This mixture approach became standard practice: later models like Orca, OpenHermes, and Tulu 2 all use carefully curated dataset mixtures.

There's a deeper lesson here about evaluation design in AI research. Before this paper, most instruction-tuning papers evaluated on 1-2 benchmarks and claimed superiority. The Camels paper showed that this is fundamentally misleading — a model's ranking depends entirely on which benchmark you choose. This drove the field toward multi-benchmark evaluation suites (like the Open LLM Leaderboard) and composite scores that aggregate across diverse evaluations.

The human evaluation component deserves special attention. The authors recruit evaluators from Mechanical Turk and from graduate students, finding interesting disagreements. On most model comparisons, the two groups agree. But on some edge cases — particularly where one model gives a short, correct answer and another gives a long, partially incorrect answer — the groups diverge. Graduate students prefer the concise correct answer; Mechanical Turk workers prefer the longer response. This highlights a general challenge: "helpfulness" is subjective and audience-dependent.

The paper also contributes to the growing literature on evaluation contamination. Some instruction datasets (particularly FLAN) contain tasks that overlap with benchmark test sets. The authors check for this and find minimal contamination, but flag it as a concern for future work. This issue would become much more prominent in 2024, when researchers found that many open datasets were contaminated with benchmark answers.

The paper also highlighted an important tension in AI evaluation: automatic metrics vs human judgment. MMLU is automatic, cheap, and reproducible — but it doesn't measure what users care about. AlpacaEval uses GPT-4 as a proxy for human judgment — cheaper than real humans but subject to the judge model's biases (GPT-4 tends to prefer longer responses). Real human evaluation is the gold standard but expensive and slow. The paper uses all three, showing where they agree and disagree:

python
# Evaluation cost and reliability tradeoffs
eval_methods = {
    "MMLU (auto)": {
        "cost": "$0 (runs locally)",
        "time": "~30 min",
        "measures": "factual recall",
        "bias": "none (multiple choice)",
    },
    "AlpacaEval (GPT-4)": {
        "cost": "~$50 per model",
        "time": "~2 hours",
        "measures": "helpfulness, coherence",
        "bias": "prefers verbose responses",
    },
    "Human eval": {
        "cost": "$500-2000 per model",
        "time": "1-2 weeks",
        "measures": "actual user satisfaction",
        "bias": "inter-annotator disagreement ~75%",
    },
}
# Paper uses all three to triangulate true quality

Why do different instruction datasets produce models that rank differently on different benchmarks?

Because each dataset teaches different skills — FLAN teaches factual breadth, ShareGPT teaches conversational naturalness, Alpaca-CoT teaches reasoning — and different benchmarks test different skills Because the benchmarks are poorly designed and produce random rankings Because the models are trained with different random seeds, causing variance

Chapter 5: Key Findings

Let's consolidate the paper's findings into actionable lessons. These results were derived from dozens of controlled experiments, each isolating a single variable. They represent the most rigorous understanding of open instruction tuning available at the time of publication.

Finding 1: Base model quality is the strongest predictor of downstream performance.

No amount of instruction tuning can compensate for a weak base model. LLaMA-7B instruction-tuned on any dataset consistently outperforms OPT-13B and Pythia-12B instruction-tuned on the same or better data. The pre-training data volume and quality set a ceiling that instruction tuning cannot exceed. Practically: always start with the best available base model.

The magnitude of this effect is remarkable. Switching from OPT-6.7B to LLaMA-7B (same instruction data, same training) improves AlpacaEval by 25+ points. Switching from the worst instruction dataset to the best (same base model) improves it by ~15 points. The base model effect is roughly 2x larger than the dataset effect. This means that if you're choosing between "better base model + mediocre data" vs "worse base model + excellent data," choose the better base model every time.

Finding 2: Data quality matters more than data quantity for instruction following.

10K high-quality examples (OASST, ShareGPT) often match or exceed 100K lower-quality examples (FLAN, Self-Instruct) on human preference evaluations. The definition of "quality" here is nuanced: it means responses that are helpful, detailed, natural-sounding, and accurate — not just technically correct. This finding drove the community toward smaller, curated datasets rather than massive synthetic ones.

Think of it this way: if you're teaching someone to write good emails, showing them 100 mediocre emails teaches them to write mediocre emails. Showing them 10 excellent emails teaches them to write excellent emails. The model imitates its training data — so the quality of what you show it is the quality you get back.

Finding 3: Different datasets teach different skills.

No single dataset dominates all benchmarks. FLAN excels at factual tasks; ShareGPT excels at open-ended conversation; Alpaca-CoT excels at reasoning; OASST excels at safety. The optimal strategy is a weighted mixture that combines complementary strengths. This is perhaps the most practically useful finding — it tells you that building a good instruction-tuned model isn't about finding the "one perfect dataset." It's about combining datasets that are strong in different areas, like assembling a team with complementary skills.

Finding 4: Response style matters as much as response content.

Models fine-tuned on ShareGPT produce longer, more structured responses (with headers, bullet points, step-by-step explanations) compared to FLAN-tuned models. Human evaluators strongly prefer this style, even when the factual content is equivalent. This suggests that instruction tuning primarily teaches how to respond rather than what to know — the knowledge comes from pre-training.

The response style finding has a surprising corollary: you can predict which dataset a model was trained on just by looking at its response style. FLAN-tuned models give terse, classification-style responses. ShareGPT-tuned models use numbered lists and friendly language. Alpaca-tuned models tend toward academic-sounding explanations. OASST-tuned models are conversational and ask clarifying questions. The instruction data's "personality" transfers directly to the model.

The style hypothesis: Instruction tuning is primarily a style transfer operation, not a knowledge injection operation. The base model already has the knowledge from pre-training. Instruction tuning teaches the model how to format that knowledge into helpful, well-structured responses. This explains why a small amount of high-quality data suffices — you don't need millions of examples to teach style, but you need every example to demonstrate good style.

This hypothesis has been tested in follow-up work. If instruction tuning is mostly style transfer, then the model shouldn't learn new facts from instruction data — and indeed, studies show that factual accuracy on knowledge benchmarks (like MMLU) barely changes with instruction tuning. What does change dramatically is the model's ability to present its knowledge in a user-friendly format. A base LLaMA-7B "knows" the capital of France but can't tell you unless you prompt it just right. Instruction-tuned LLaMA-7B tells you "Paris" when asked directly, because it's learned the instruction-following interface.

Finding 5: Scaling instruction data shows diminishing returns.

Going from 1K to 10K FLAN examples produces a large jump in performance. Going from 10K to 100K produces a modest improvement. Going from 100K to 1M produces minimal improvement. The learning curve follows a power law with a steep initial slope that quickly flattens. This means that even with unlimited data, there's a practical ceiling to what instruction tuning can achieve.

The practical implication: if you're budgeting for instruction tuning, invest in the first 50K high-quality examples. Beyond that, the marginal value drops below the cost of data collection and compute. Your budget is better spent on preference optimization (DPO/RLHF) or on improving the base model.

python
# Diminishing returns model
import numpy as np

def instruction_tuning_gain(n_examples, quality_factor=1.0):
    """
    Approximate performance gain from instruction tuning.
    n_examples: number of training examples
    quality_factor: 0-1, quality of instruction data

    Returns: approximate benchmark improvement (%)
    """
    # Log scaling: most learning happens in first 10K examples
    base_gain = 10 * np.log10(n_examples / 100 + 1)
    # Quality multiplier: 10K high-quality ≈ 100K low-quality
    return base_gain * quality_factor

# Example: quality_factor=1.0 (ShareGPT) vs 0.6 (Self-Instruct)
# instruction_tuning_gain(10000, 1.0) ≈ 20.0
# instruction_tuning_gain(100000, 0.6) ≈ 18.0
# 10K excellent examples ≈ 100K mediocre examples

Finding 6: The gap between open and proprietary models is closing, but not closed.

The best Tulu models (LLaMA-65B with the optimal dataset mixture) are competitive with ChatGPT on some benchmarks but still trail on others, particularly complex reasoning and multilingual tasks. The gap is larger on hard tasks and smaller on simple tasks, suggesting that scaling the base model (not just the instruction data) is needed to fully close the gap.

Let's quantify the gap:

Benchmark	Tulu-65B	ChatGPT	Gap
MMLU	63.5	70.0	-6.5
BBH	55.2	68.1	-12.9
AlpacaEval	76.2	89.4	-13.2
TydiQA	48.7	62.3	-13.6

The gap is largest on reasoning (BBH) and multilingual (TydiQA) tasks. This suggests that the main advantage of proprietary models comes from (1) larger pre-training scale (GPT-3.5 is estimated at 175B parameters with more training data), (2) RLHF alignment (which Tulu doesn't use), and (3) better multilingual pre-training data. Subsequent work (Tulu 2, Zephyr) would address item (2) with DPO, closing the gap further.

The open-source trajectory: In the six months after this paper, the gap narrowed rapidly. Mistral-7B (stronger base model) + DPO alignment closed much of the remaining gap. By early 2024, open models like Llama 2-70B Chat and Mixtral-8x7B-Instruct were competitive with GPT-3.5 on most benchmarks. The Camels paper correctly identified the path: better base models + quality data + preference optimization.

Key Findings Dashboard

Explore each finding with supporting data. Click a finding to see the evidence — bar charts comparing controlled experiments that isolate the relevant variable.

Validation against human judgments: The paper's findings are validated by human evaluations, not just automatic metrics. Human annotators rank model pairs and the rankings correlate with AlpacaEval but sometimes diverge from MMLU. This gives confidence that the quality-over-quantity finding reflects genuine usefulness, not benchmark gaming.

Let's build a mental model of how these findings connect. Think of instruction tuning as stacking two skills: knowledge (knowing facts, reasoning, multilingual ability) and interface (formatting that knowledge as a helpful response). The base model determines the knowledge ceiling. The instruction data determines the interface quality. More instruction data slightly improves knowledge coverage, but primarily improves the interface.

Model Quality = Knowledge(base model) + Interface(instruction data)

Where Knowledge scales with pre-training compute (log-relationship with parameters and tokens), and Interface scales with instruction data quality (fast saturation — 10K good examples ~ 100K mediocre examples). This decomposition explains all six findings:

Finding	Explanation via Knowledge + Interface
Base model matters most	Knowledge ceiling is the dominant term
Quality > quantity	Interface saturates quickly with good examples
Different datasets teach different skills	Datasets differ in both knowledge coverage and interface style
Style matters as much as content	Interface = style; content = knowledge from pre-training
Diminishing returns	Interface learning curve is steep then flat
Gap to proprietary models	Knowledge gap (more pre-training) + Interface gap (RLHF)

python
# Mental model: instruction tuning decomposes into Knowledge + Interface
import numpy as np

def model_quality(base_tokens_T, instruct_examples, data_quality):
    """
    base_tokens_T: pre-training tokens in trillions (e.g., 1.0 for LLaMA)
    instruct_examples: number of instruction examples (e.g., 50000)
    data_quality: quality score 0-1 (e.g., 0.9 for ShareGPT)
    """
    # Knowledge: log-scaled with pre-training
    knowledge = 25 * np.log2(base_tokens_T * 10 + 1)

    # Interface: saturating with instruction data
    interface = 20 * data_quality * (1 - np.exp(-instruct_examples / 15000))

    return knowledge + interface

# LLaMA-7B + 10K ShareGPT (quality=0.9)
# model_quality(1.0, 10000, 0.9) ≈ 25*3.46 + 20*0.9*0.49 ≈ 95.4

# OPT-6.7B + 100K FLAN (quality=0.5)
# model_quality(0.18, 100000, 0.5) ≈ 25*1.22 + 20*0.5*0.999 ≈ 40.5

# Even 10x more data can't overcome the base model gap!

The paper argues that instruction tuning is primarily a "style transfer" operation. What does this mean?

It means instruction tuning changes the visual style of the model's outputs (fonts, formatting, etc.) It means instruction tuning transfers knowledge from the instruction data to the model weights It means the base model already has knowledge from pre-training, and instruction tuning mainly teaches HOW to present that knowledge in helpful, well-structured responses — explaining why a small amount of high-quality data suffices

Chapter 6: Tuning Explorer

Let's put all the paper's findings together into one interactive simulation. This explorer lets you design your own instruction-tuning experiment: choose a base model, pick a dataset (or mix datasets), set the data size, and see predicted performance across all four benchmarks.

The predictions are based on the paper's experimental results, interpolated to cover combinations the paper didn't test directly. Use this to build intuition about the tradeoffs in instruction tuning — which choices matter most, and where the diminishing returns kick in.

Think of this as a "lab" where you can run virtual experiments that would cost thousands of dollars in GPU time. Each combination represents a real experimental condition: a base model initialized with specific pre-trained weights, fine-tuned on a specific instruction dataset for 2 epochs, and evaluated on a standardized benchmark suite. The radar chart shows the predicted performance profile across all four evaluation dimensions.

Try these experiments: (1) Fix the dataset and change the base model — watch how LLaMA dominates. (2) Fix the base model to LLaMA-7B and toggle between datasets — see how different datasets excel on different benchmarks. (3) Pick ShareGPT and drag the data size slider — notice how performance plateaus after ~50K examples. (4) Create a mixture and compare to single datasets.

Instruction Tuning Experiment Designer

Design your own instruction-tuning experiment. Select components and see predicted benchmark performance. Try to find the combination that maximizes the aggregate score.

Base Model

Dataset

Data Size 25K

The best aggregate score you can achieve with this explorer is LLaMA-7B + Tulu Mix at 100K examples. But notice that going from 50K to 100K barely changes the score — confirming the diminishing returns finding. And notice that the single biggest improvement comes from switching the base model from OPT to LLaMA, not from any dataset change.

Notice the patterns as you explore:

Experiment	What You'll See
Change base model only	LLaMA consistently 5-15% ahead of Pythia/OPT at same size
Change dataset only	Rankings differ by benchmark — no single dataset wins all
Increase data size	Sharp gains up to 10K, then diminishing returns
Use Tulu Mix	Best aggregate performance — mixture beats any single source

The explorer also shows something subtle: the interaction between base model and dataset. Some datasets pair better with some models. For example, reasoning-focused datasets (Alpaca-CoT) show a larger gap between LLaMA and Pythia than factual datasets (FLAN), suggesting that the base model's capacity is more important for reasoning than for factual recall.

Here's a concrete thought experiment to build intuition. Imagine you have a $1,000 budget. How should you allocate it?

Strategy	What You Get	Expected Quality
Option A	$0 data (use free FLAN) + $1000 compute → LLaMA-13B + FLAN	Good benchmarks, poor chat
Option B	$500 curating 10K examples + $500 compute → LLaMA-7B + quality mix	Great chat, decent benchmarks
Option C	$200 scraping 100K examples + $800 compute → LLaMA-7B + noisy data	Mediocre everything

The paper's results suggest Option B is almost always the right choice for real applications. The 10K curated examples produce better user-facing quality than 100K scraped examples, and the money saved on data can be spent on a larger base model (or more training runs to tune hyperparameters).

python
# The Tulu mixture recipe that won
# Weighting based on paper experiments
tulu_mix = {
    "flan_v2": {
        "weight": 0.30,       # broadest task coverage
        "samples": 25000,     # subsampled for efficiency
        "strengths": ["MMLU", "TydiQA"],
    },
    "sharegpt": {
        "weight": 0.35,       # best chat-style responses
        "samples": 30000,     # real user conversations
        "strengths": ["AlpacaEval", "human_pref"],
    },
    "oasst1": {
        "weight": 0.15,       # multi-turn + safety
        "samples": 10000,     # crowd-ranked quality
        "strengths": ["safety", "multi_turn"],
    },
    "alpaca_cot": {
        "weight": 0.15,       # reasoning chains
        "samples": 10000,     # GPT-3.5 + CoT
        "strengths": ["BBH", "reasoning"],
    },
    "dolly": {
        "weight": 0.05,       # human-written supplement
        "samples": 5000,      # highest per-example quality
        "strengths": ["accuracy"],
    },
}
# Total: ~80K examples from 5 complementary sources
# This mixture became the template for Tulu 2, OpenHermes, etc.

The recipe that emerged: From this paper and subsequent work, the open-source community converged on a standard recipe: (1) Start with the best available base model (LLaMA at the time, later Llama 2, Mistral). (2) Fine-tune on a mixture of high-quality datasets (~50-100K examples total). (3) Add chain-of-thought examples for reasoning. (4) Optionally, do a second stage with preference optimization (DPO/PPO). This recipe, first articulated here, became the foundation for Tulu 2, OpenHermes, and dozens of other open models.

Based on the paper's findings, if you had to choose between (A) LLaMA-7B trained on 100K FLAN examples, or (B) LLaMA-7B trained on a 50K mixture of FLAN+ShareGPT+OASST, which would perform better on aggregate evaluation?

(A) — more data is always better (B) — the mixture combines complementary strengths from different datasets, and quality/diversity matters more than raw quantity They would perform roughly the same since both use LLaMA-7B as the base

Chapter 7: Connections

The Camels paper was published at NeurIPS 2023, and its findings shaped the trajectory of open instruction tuning. Let's trace its influence and connections to both predecessors and successors.

Predecessors:

The paper builds directly on the instruction tuning paradigm introduced by FLAN (Wei et al., 2022) and the self-instruct method (Wang et al., 2022). FLAN showed that fine-tuning on many tasks phrased as instructions improves zero-shot generalization. Self-Instruct showed that LLMs can generate their own training data. Stanford Alpaca combined these ideas: use Self-Instruct with GPT-3.5 to generate data, then fine-tune LLaMA. The Camels paper asks: of all these approaches, which actually works best?

The connection to InstructGPT (Ouyang et al., 2022) is also important. InstructGPT showed that instruction tuning + RLHF produces dramatic improvements in human preference, but it was proprietary. The Camels paper can be seen as the open-source community's attempt to replicate the SFT portion of InstructGPT's recipe, systematically exploring which open data best substitutes for OpenAI's proprietary instruction data. (The RLHF portion would come later with Tulu 2's DPO.)

The paper also connects to the scaling laws literature (Kaplan et al., Hoffmann et al./Chinchilla). Just as pre-training follows scaling laws (more compute = better performance), instruction tuning also follows scaling laws — but with a different shape. Pre-training benefits from more data roughly log-linearly. Instruction tuning shows much steeper diminishing returns, saturating around 50-100K examples.

The scaling laws connection is worth elaborating. Chinchilla (Hoffmann et al., 2022) showed that for pre-training, the optimal strategy is to scale data and parameters equally. The Camels paper shows that instruction tuning follows a fundamentally different scaling law. In pre-training, doubling the data always helps (at least up to current scales). In instruction tuning, doubling the data helps early but hits diminishing returns quickly. The difference comes from what's being learned: pre-training learns knowledge (which benefits from more data), while instruction tuning learns an interface (which saturates quickly).

Successors:

The Tulu project continued with Tulu 2 (Ivison et al., 2023), which extended the analysis to Llama 2, added DPO preference optimization as a second training stage, and achieved state-of-the-art open-model performance. Tulu 2 validated the quality-over-quantity finding with more data and added the insight that DPO adds a complementary "alignment" dimension on top of instruction tuning.

The broader community picked up the quality-over-quantity finding quickly. LIMA (Zhou et al., 2023) pushed this to the extreme: just 1,000 carefully curated examples (selected by the authors themselves) produced a model that was competitive with GPT-4 on many tasks. LIMA's slogan — "Less Is More for Alignment" — was a direct extension of the Camels finding. LIMA's authors manually selected each of the 1,000 examples from diverse sources (StackExchange, wikiHow, Reddit, and expert-written examples), ensuring every single one was an exemplar of a helpful, well-structured response. This curation effort took weeks but produced better results than millions of auto-generated examples.

The Orca papers (Microsoft, 2023) took a different approach to the quality question. Instead of curating real data, they generated synthetic data using GPT-4 — but with a twist: they asked GPT-4 to explain its reasoning step-by-step. These "reasoning traces" proved far more valuable as training data than simple input-output pairs. A single Orca-style example with detailed reasoning is worth many simple examples, because it teaches the model how to think, not just what to say.

Paper	Relationship	Key Extension
FLAN (2022)	Predecessor	Introduced instruction tuning at scale
Self-Instruct (2022)	Predecessor	LLM-generated training data pipeline
Alpaca (2023)	Predecessor	Combined Self-Instruct + LLaMA fine-tuning
LIMA (2023)	Parallel work	1K examples can be enough — quality is paramount
Tulu 2 (2023)	Direct sequel	Added DPO, scaled to Llama 2, new benchmarks
Orca (2023)	Influenced by	High-quality synthetic data via GPT-4 reasoning traces
OpenHermes (2023)	Influenced by	Curated dataset mixture, community-driven

The lasting impact: Before this paper, the open-source instruction tuning community was in a "more data is better" mindset — every project tried to collect the largest possible instruction dataset. After this paper, the mindset shifted to curation: what's the smallest, highest-quality dataset that maximizes downstream performance? This shift is visible in every major open model release after mid-2023.

The paper also has methodological impact. Its controlled-experiment framework — vary one factor at a time, evaluate on multiple benchmarks — became the standard for instruction-tuning research. Before Camels, papers would introduce a new dataset and show it works on one benchmark with one model. After Camels, reviewers expected ablations across models, datasets, and evaluation suites.

The open-source nature of the project amplified its impact. All code, all model weights, all evaluation scripts, and all intermediate results are publicly available on the AI2 GitHub. This means any researcher can reproduce the experiments, extend them with new datasets or models, or build on the findings for their own work. The Tulu model weights were downloaded thousands of times in the first month alone, and the evaluation framework was adopted by multiple independent projects.

Finally, the paper's timing was perfect. It was published just as the open-source LLM community was transitioning from "can we make any instruction-following model at all?" to "how do we make the best one?" The Camels paper provided the roadmap at exactly the moment it was needed, making it one of the most influential papers of the open instruction-tuning era.

Looking forward, the quality-over-quantity insight extends beyond instruction tuning to other alignment techniques. RLHF with high-quality human preferences (fewer but more expert annotators) outperforms RLHF with massive but noisy crowdsourced preferences. DPO with carefully chosen preference pairs outperforms DPO with automatically generated pairs. The pattern is consistent: in the fine-tuning regime, data quality dominates data quantity.

The paper's practical impact can be summarized in a recipe that the community adopted:

Step 1: Base Model

Start with the best available open base model (at the time: LLaMA; now: Llama 3, Mistral, Qwen)

↓

Step 2: Data Mixture

Curate a mixture of 50-100K high-quality instruction examples from diverse sources

↓

Step 3: SFT

Supervised fine-tuning for 2-3 epochs with standard hyperparameters

↓

Step 4: Preference Optimization

(Added by Tulu 2) DPO or PPO with preference data for final alignment

python
# The Tulu recipe, now standard for open instruction-tuned models
# Step 1: Base model selection (biggest impact on ceiling)
base = load_model("meta-llama/Llama-2-7b-hf")

# Step 2: Data mixture (quality > quantity)
data = weighted_mix([
    ("flan_v2", 0.3),      # task diversity + benchmarks
    ("sharegpt", 0.35),    # natural conversation style
    ("oasst", 0.15),       # multi-turn + safety
    ("cot_data", 0.1),     # reasoning chains
    ("code_data", 0.1),    # coding ability
])  # Total: ~75K examples

# Step 3: SFT
sft_model = sft_train(base, data, epochs=2, lr=2e-5)

# Step 4: DPO (Tulu 2 addition)
final_model = dpo_train(sft_model, preference_data, beta=0.1)

From Tulu to Tulu 3: The lineage continues. Tulu 2 (late 2023) added DPO and moved to Llama 2. Tulu 3 (2024) incorporated verifiable reward-based training, synthetic data generation, and process reward models. Each iteration builds on the Camels finding: start with the best base, curate the data carefully, and add alignment layers. The original insight — quality over quantity — remains the foundation.

Let's look at how the open-model landscape evolved in the year after this paper:

Date	Model	Innovation (inspired by Camels)
Jun 2023	Orca (Microsoft)	GPT-4 reasoning traces as training data — quality-first
Jul 2023	Llama 2 + Chat	Meta's RLHF pipeline, open-weight alignment
Sep 2023	Zephyr (HF)	DPO on UltraChat data — distilled alignment
Oct 2023	OpenHermes 2.5	Community-curated dataset mixture, Mistral base
Nov 2023	Tulu 2	Paper's own sequel — DPO added, Llama 2 base
Dec 2023	Mistral-Instruct	Better base model + quality instruction data
Feb 2024	Gemma-Instruct	Google's entry — quality data, modern base

Every one of these models follows the template established by the Camels paper: (1) start with the best available base model, (2) curate high-quality instruction data (emphasizing quality over quantity), (3) train with standard SFT, (4) optionally add preference optimization. The template works. It's become the de facto recipe for open instruction-tuned models.

Perhaps the most lasting contribution is methodological. Before this paper, instruction-tuning papers were hard to compare because everyone used different setups. After this paper, the community adopted a standard evaluation framework: train on your data, evaluate on MMLU + BBH + AlpacaEval + at least one other metric. The Open LLM Leaderboard, launched in 2023, was directly inspired by this kind of multi-benchmark evaluation. It became the standard for comparing open models, and its design owes much to the Camels paper's demonstration that single-benchmark evaluation is misleading.

The enduring lesson: In a world where compute is expensive and data is cheap to generate, the natural instinct is to generate massive instruction datasets. This paper shows that instinct is wrong. The right approach is to invest human effort in curating smaller, higher-quality datasets. This lesson has held up through multiple model generations and remains the foundation of open instruction tuning in 2024 and beyond.

How did the Camels paper change the open-source instruction tuning community's approach?

It shifted the community from a "collect the largest possible instruction dataset" mindset to a "curate the highest-quality dataset mixture" mindset, backed by controlled experiments showing quality trumps quantity It proved that only proprietary datasets from OpenAI can produce good instruction-following models It introduced RLHF as the standard alignment technique for open-source models