Exploring the State of Instruction Tuning on Open Resources — a systematic comparison of open-source instruction-tuning datasets, base models, and training recipes. The headline finding: data quality trumps data quantity.
It's mid-2023. ChatGPT has been out for six months, and the open-source community is scrambling to catch up. Everyone wants to build their own instruction-following model, and the recipe seems simple: take a pre-trained base model (like LLaMA), fine-tune it on instruction-response pairs, and ship it. But which instruction data should you use? Which base model? How much data do you need? Nobody knows — because nobody has done a systematic comparison.
The landscape is fragmented. Stanford Alpaca fine-tunes LLaMA-7B on 52K GPT-generated instructions. Databricks Dolly collects 15K human-written instructions from employees. FLAN aggregates 1,836 academic tasks into 15M examples. ShareGPT scrapes real ChatGPT conversations. Each group claims their approach works, but they all use different base models, different evaluation benchmarks, and different training recipes. It's impossible to know what's actually driving performance.
To see why this matters, consider a concrete scenario. You're an engineer at a startup. Your boss says: "Build me an AI assistant that can answer customer questions about our product." You know you need to instruction-tune a base model. But you're immediately faced with decisions that nobody has answered rigorously:
| Decision | Options | What You'd Like to Know |
|---|---|---|
| Base model | LLaMA-7B, Pythia-6.9B, OPT-6.7B | Which gives the best starting point? |
| Dataset | FLAN, Alpaca, ShareGPT, Dolly, or mix? | Which produces the best assistant? |
| Data size | 10K, 50K, 100K, 1M examples? | Is more always better? |
| Data source | LLM-generated vs human-written? | Does the source matter? |
Before this paper, the honest answer to all of these questions was "we don't know." After this paper, we have empirically grounded answers for each one.
The stakes are high. In mid-2023, dozens of startups and research labs were investing millions of dollars in instruction tuning, often making choices based on vibes and blog posts rather than rigorous evidence. Some teams spent months collecting massive datasets of 1M+ instruction examples, only to find that their models weren't much better than simpler approaches. Others used expensive GPT-4-generated data when cheaper alternatives would have worked equally well. This paper provides the evidence base to make these decisions rationally.
The authors at Allen AI (AI2) name their project Tulu (after a breed of camel) and systematically investigate the open instruction-tuning landscape. They train dozens of models, varying one factor at a time, and evaluate them on both automatic benchmarks and human preference judgments. The result is the most comprehensive apples-to-apples comparison of open instruction tuning ever conducted.
Here's the setup: they take the same base model (LLaMA), fine-tune it on each of 7 different instruction datasets under identical training conditions, and evaluate on the same benchmarks. Then they hold the dataset fixed and vary the base model. Then they explore mixing datasets and varying data quantity. Every experiment isolates exactly one variable.
The methodology is critical. In machine learning research, confounding variables are everywhere. If you train LLaMA on FLAN and GPT-NeoX on Alpaca, and LLaMA wins, you can't tell if it's because of the model or the data. The paper's strength is its relentless variable isolation. When comparing datasets, the base model (LLaMA-7B), training hyperparameters (lr=2e-5, epochs=2, batch=128), and evaluation suite (MMLU, BBH, TydiQA, AlpacaEval) are all held constant. When comparing base models, the dataset (Tulu mix) and all other factors are held constant. This is basic experimental design, but it's surprisingly rare in ML papers.
Let's walk through the training setup concretely. Each experiment uses the same hardware (8x A100 80GB GPUs), the same training framework (PyTorch + Hugging Face Transformers), the same optimizer (AdamW), and the same format template for instructions:
python # Instruction format template used across all experiments template = """<|user|> {instruction} <|assistant|> {response}""" # Example formatted training instance: # <|user|> # What are three benefits of regular exercise? # <|assistant|> # Regular exercise provides numerous benefits: # 1. Cardiovascular health: ... # 2. Mental well-being: ... # 3. Weight management: ... # Loss is computed ONLY on assistant tokens # This teaches the model to generate responses, not to parrot instructions
This map shows the explosion of open instruction-tuning resources available by mid-2023. Click on each resource to see its characteristics — data source, size, and approach.
The numbers tell the story of fragmentation. At the time of writing, the open-source community had produced at least 7 major instruction datasets, 4 widely-used base model families, and dozens of instruction-tuned model variants — with no controlled comparison between them. Papers would fine-tune LLaMA on their new dataset, evaluate on their preferred benchmark, and claim a new state-of-the-art. But without controlling for the base model, training recipe, and evaluation suite, these comparisons were meaningless. It was like comparing cars when everyone uses different roads, different weather conditions, and different speedometers.
The paper's contributions are threefold:
Before we dive into the experiments, let's understand the instruction-tuning pipeline concretely. Here's what happens when you instruction-tune a base model:
python # The instruction-tuning pipeline from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from trl import SFTTrainer # 1. Load base model (pre-trained, but can't follow instructions) model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-7b") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-7b") # 2. Format instruction data as prompt-response pairs # Input: "### Instruction: Summarize the following article..." # Output: "The article discusses three key points..." # The model learns to predict the output given the input # 3. Fine-tune with standard language modeling loss # But ONLY compute loss on the response tokens, not the instruction trainer = SFTTrainer( model=model, train_dataset=instruction_data, max_seq_length=2048, args=TrainingArguments( learning_rate=2e-5, num_train_epochs=2, per_device_train_batch_size=4, gradient_accumulation_steps=32, # effective batch = 128 ) ) trainer.train() # After ~2 epochs: base model → instruction follower
The key insight is that instruction tuning is cheap compared to pre-training. Pre-training LLaMA-7B takes ~82,000 GPU-hours on A100s. Instruction tuning takes ~40 GPU-hours — 2,000x less compute. This asymmetry is why the question "which instruction data should I use?" matters so much: you have budget for dozens of instruction-tuning experiments but zero budget for re-doing pre-training.
The paper evaluates seven instruction-tuning datasets that were publicly available by mid-2023. These datasets differ dramatically in size, source, format, and quality. Understanding their characteristics is essential before comparing their effects on model performance.
Let's walk through each one:
FLAN V2 is the elephant in the room. Google's collection aggregates 1,836 academic NLP tasks (sentiment analysis, NLI, QA, summarization, translation, etc.) into a unified instruction format. Each task gets multiple prompt templates, creating massive diversity. The authors subsample to ~100K examples for tractability, but the full set contains over 15 million. FLAN's strength is breadth: it covers virtually every NLP task type. Its weakness is that the tasks feel "academic" — they don't resemble the open-ended questions real users ask.
Here's a typical FLAN instruction: "Read the following passage and answer the question. Passage: 'The capital of France is Paris...' Question: 'What is the capital of France?' Answer:". The response is just "Paris." Compare this to a typical ShareGPT instruction: "What are some fun things to do in Paris?" with a 200-word response covering restaurants, museums, and neighborhoods. Both teach the model something — but the ShareGPT example teaches it to be a helpful assistant, while the FLAN example teaches it to extract information from passages.
Stanford Alpaca (CoT) consists of 52K instruction-response pairs generated by GPT-3.5-Turbo (text-davinci-003). Stanford researchers used 175 seed instructions and asked GPT to generate more via self-instruct. The "CoT" variant adds chain-of-thought reasoning to a subset. Quality is moderate — GPT-3.5 sometimes generates plausible-sounding but incorrect answers, and the instructions tend to be simple. But the data is cheap (total cost was ~$500) and covers a reasonable range of tasks.
The $500 price tag made Alpaca a sensation. It showed that you could build a passable instruction-following model for less than the cost of a nice dinner. However, the quality ceiling is real: GPT-3.5's errors propagate to the student model. If GPT-3.5 gives a wrong math answer, the student learns that wrong answer. This "error amplification" problem is a fundamental limitation of synthetic data from imperfect teachers.
ShareGPT is scraped from ShareGPT.com, where users voluntarily shared their ChatGPT conversations. This gives you ~90K real user-ChatGPT dialogues — authentic queries that people actually care about. The advantage: these conversations reflect real user needs, not academic benchmarks or synthetic prompts. The disadvantage: they can be noisy (incomplete conversations, screenshots described in text, non-English content mixed in).
ShareGPT's unique value is that it captures the distribution of real user queries. People ask ChatGPT to write code, explain concepts, brainstorm ideas, draft emails, translate text, analyze data, and dozens of other things. This distribution is very different from academic NLP tasks (which are dominated by classification and extraction) and from synthetic datasets (which tend toward generic questions). Training on ShareGPT teaches the model to handle the queries real users actually send.
Databricks Dolly contains 15K instructions written by Databricks employees. This is fully human-authored — no LLM generation involved. Employees were given categories (brainstorming, classification, QA, summarization, etc.) and wrote both instructions and responses. Quality is high because the writers are technically skilled, but the dataset is small and the instructions tend toward business use cases. The human-authored nature means responses don't have the "ChatGPT voice" that synthetic datasets inherit — they're more direct and often more accurate, though sometimes less polished in formatting.
Open Assistant (OASST1) is a crowdsourced dataset of ~10K conversation trees, where volunteers both wrote user prompts and assistant responses. Multiple responses per prompt were collected and ranked by other volunteers, creating a preference signal. The conversations are often multi-turn and cover a wide range of topics. Quality is variable — some contributions are excellent, others are superficial.
OASST is unique in having a built-in quality signal: each response was rated by 3-5 volunteers on helpfulness, truthfulness, and harmlessness. The paper uses the top-ranked responses only, effectively getting human-curated quality for free. This makes OASST punches above its weight — 10K curated examples from OASST often outperform 50K+ uncurated examples from other sources.
Self-Instruct contains ~82K instruction-response pairs generated by the self-instruct pipeline — similar to Alpaca but using GPT-3 (davinci) rather than GPT-3.5. Instructions are generated from a seed set of 175 examples, filtered for quality, and then responses are generated. The older model produces somewhat lower-quality outputs than Alpaca's GPT-3.5 source. The pipeline works by iteratively: (1) sample from the seed set, (2) prompt GPT-3 to generate a new instruction, (3) filter for novelty and quality, (4) generate a response for the instruction.
Unnatural Instructions is another synthetic dataset (~68K examples) generated by asking GPT-3 to produce both instructions and responses, with an emphasis on creative and unusual task formulations. The name comes from the "unnatural" prompt engineering used to elicit diverse task types. The dataset intentionally pushes beyond standard NLP task formats — you'll find instructions like "Write a limerick about quantum physics" or "Explain why the sky is blue using only words a pirate would use." This diversity is intended to teach the model flexibility, though the quality of individual responses varies significantly.
A pattern emerges when you look across all seven datasets: there's a fundamental tradeoff between breadth (how many different task types are covered) and depth (how detailed and helpful each individual response is). FLAN has maximum breadth but minimal depth. ShareGPT has high depth but uncontrolled breadth. The question is which dimension matters more for downstream performance — and the answer, as we'll see, depends on what you measure.
| Dataset | Size | Source | Style |
|---|---|---|---|
| FLAN V2 | ~100K (subsampled) | 1,836 academic NLP tasks | Structured, templated |
| Alpaca (CoT) | 52K | GPT-3.5 generated | Short instructions, mixed quality |
| ShareGPT | ~90K | Real ChatGPT conversations | Conversational, multi-turn |
| Dolly | 15K | Human-written (Databricks employees) | Business-oriented, careful |
| OASST1 | ~10K | Crowdsourced conversation trees | Multi-turn, preference-ranked |
| Self-Instruct | 82K | GPT-3 (davinci) generated | Synthetic, varied |
| Unnatural Instr. | 68K | GPT-3 generated | Creative, unusual tasks |
The training setup is uniform across all experiments: LLaMA-7B base model, 2 epochs of fine-tuning, learning rate 2e-5 with linear warmup and cosine decay, batch size 128, max sequence length 2048. This ensures any performance differences come from the data, not the training recipe.
Why exactly 2 epochs? The paper experiments with 1, 2, 3, and 5 epochs and finds that 2 epochs is the sweet spot for most datasets. At 1 epoch, the model hasn't fully absorbed the instruction format. At 3+ epochs, it starts to overfit — memorizing specific responses rather than learning the general instruction-following skill. Overfitting manifests as the model reproducing training responses verbatim or generating outputs in the exact style of a specific data source, reducing generalization. The 2-epoch finding has become standard in the community.
The max sequence length of 2048 is also significant. Some ShareGPT conversations exceed this length and are truncated. The paper notes that increasing to 4096 tokens slightly improves performance on conversation-heavy benchmarks but doubles memory requirements. For most experiments, 2048 is a practical compromise. Later models (Llama 2, Mistral) support longer contexts and benefit from longer training sequences.
python # Standardized training config for all dataset comparisons config = { "model": "llama-7b", "epochs": 2, "lr": 2e-5, "warmup_ratio": 0.03, "scheduler": "cosine", "batch_size": 128, "max_seq_len": 2048, "weight_decay": 0.0, "gradient_checkpointing": True, } # For each dataset D in [FLAN, Alpaca, ShareGPT, Dolly, OASST, ...]: # model = finetune(llama_7b, D, config) # scores = evaluate(model, [MMLU, BBH, TydiQA, AlpacaEval, ...])
Compare the seven instruction datasets along key dimensions. Click each bar to highlight that dataset's profile. The radar shows how each dataset scores on size, diversity, naturalness, and response quality.
The critical question: when you train LLaMA-7B on each dataset independently, which one produces the best instruction-following model? The answer might surprise you — it's not the largest dataset, and it's not the most expensive to create.
Let's look at an example from each dataset to see how they differ in style and quality:
python # Example instruction-response pairs from each dataset # FLAN: academic task format flan_example = { "instruction": "Classify the sentiment of the following review as " "positive or negative: 'The movie was a waste of time.'", "response": "negative" # ~1 token response } # ShareGPT: natural conversation sharegpt_example = { "instruction": "Explain quantum entanglement like I'm 10 years old", "response": "Imagine you have two magic coins. You flip them both..." # ~200 token response with analogies, structure } # Dolly: human-written, structured dolly_example = { "instruction": "What are the main differences between Python and Java?", "response": "Python and Java differ in several key ways:\n" "1. Type system: Python is dynamically typed...\n" "2. Syntax: Python uses indentation...\n" # ~150 tokens, well-structured } # Key difference: FLAN responses average ~15 tokens # ShareGPT responses average ~350 tokens # This response length difference drives the style transfer effect
Dataset choice is only half the equation. The other half is the base model — the pre-trained language model you start with before fine-tuning. In 2023, several open base model families were available, and the community hadn't established which one was the best foundation for instruction tuning.
The paper compares four base model families, spanning different sizes, training data, and architectural choices:
LLaMA (Meta, 2023): The clear frontrunner. LLaMA models are trained on 1-1.4 trillion tokens of publicly available text, using a standard Transformer decoder architecture with RoPE positional embeddings and SwiGLU activations. Sizes: 7B, 13B, 30B, 65B. LLaMA's distinguishing feature is its aggressive training budget — the 7B model sees more data per parameter than any previous open model, following the "Chinchilla optimal" philosophy.
Pythia (EleutherAI, 2023): A family of models from 70M to 12B parameters, trained on The Pile (300B tokens). Pythia's unique contribution is scientific reproducibility — every model checkpoint is released, training data is deduplicated, and the full training pipeline is documented. The downside: Pythia is trained on significantly less data than LLaMA (300B vs 1T tokens), and The Pile is smaller and older.
OPT (Meta, 2022): Open Pre-trained Transformers, available from 125M to 175B parameters. OPT was trained on a mix of public datasets (The Pile, BookCorpus, etc.) totaling ~180B tokens. OPT is older than LLaMA and was trained with less data and lower quality filtering. Performance is generally below LLaMA at equivalent sizes.
GPT-NeoX (EleutherAI, 2022): A 20B parameter model trained on The Pile. Uses rotary positional embeddings and parallel attention+FFN computation (computing attention and FFN in parallel rather than sequentially, as in GPT-J). Single size only, making scaling analysis impossible within this family. Despite being the largest model in the comparison (20B), it often loses to LLaMA-13B because of the pre-training data disadvantage.
| Model Family | Sizes Available | Training Tokens | Key Feature |
|---|---|---|---|
| LLaMA | 7B, 13B, 30B, 65B | 1-1.4T | Chinchilla-optimal data/params ratio |
| Pythia | 70M-12B | 300B | Full checkpoint reproducibility |
| OPT | 125M-175B | 180B | First large open model release |
| GPT-NeoX | 20B only | 300B (Pile) | Parallel attention/FFN |
The experimental setup is clean: take the best-performing dataset from the dataset comparison (a mix of FLAN + ShareGPT), and train each base model with identical hyperparameters. Then evaluate on the same benchmark suite. This isolates the effect of the base model.
python # Base model comparison experiment best_data = load_dataset("tulu_mix") # FLAN + ShareGPT mix base_models = { "llama-7b": load("meta-llama/Llama-7b"), # 1T tokens pre-training "pythia-6.9b":load("EleutherAI/pythia-6.9b"), # 300B tokens pre-training "opt-6.7b": load("facebook/opt-6.7b"), # 180B tokens pre-training } # Same data, same hyperparameters, different base for name, model in base_models.items(): tuned = finetune(model, best_data, config) scores = evaluate(tuned, benchmark_suite) # Result: LLaMA-7B >> Pythia-6.9B > OPT-6.7B # Gap is 5-15% on most benchmarks
The results are unambiguous: LLaMA dominates at every size. Instruction-tuned LLaMA-7B outperforms instruction-tuned Pythia-12B on most benchmarks, despite having 40% fewer parameters. The pre-training foundation — specifically the volume and quality of pre-training data — is a stronger predictor of downstream performance than the instruction-tuning data or training recipe.
The magnitude of the gap is startling. On AlpacaEval, LLaMA-7B + Tulu mix scores ~58%. Pythia-6.9B + same Tulu mix scores ~38%. That's a 20-point gap from the same instruction data and training recipe. On MMLU, the gap is ~14 points (46% vs 32%). These are massive differences that dwarf the effect of switching between instruction datasets (which typically changes scores by 5-10 points).
What's happening under the hood? LLaMA has seen more pre-training data (1T vs 300B tokens), which means it has richer internal representations. When you instruction-tune it, you're surfacing knowledge that's already there. When you instruction-tune Pythia, the knowledge simply isn't there to surface. You can teach Pythia to respond in a helpful style, but it can't answer questions about topics it never learned during pre-training.
This has a practical implication that the community learned quickly: if you're building an open instruction-following model, start with the best base model you can get. No amount of clever instruction tuning can compensate for a weak foundation. By the time this paper was published, LLaMA had become the default base model for almost every open-source instruction-tuning project.
The architecture differences between these base models are relatively minor — they all use standard Transformer decoder blocks with autoregressive training. The key differences are in pre-training data processing:
python # Why LLaMA's pre-training matters pretraining_comparison = { "LLaMA-7B": { "tokens": "1.0T", # 3-5x more than competitors "data_sources": [ "CommonCrawl (filtered)", # 67% — heavily filtered web "C4", # 15% — cleaned CommonCrawl "GitHub", # 4.5% — code helps reasoning "Wikipedia", # 4.5% — high-quality facts "Books", # 4.5% — long-form text "ArXiv", # 2.5% — scientific text "StackExchange", # 2% — Q&A format ], "key_choice": "Train 7B model on 1T tokens (5x Chinchilla-optimal)", }, "Pythia-6.9B": { "tokens": "300B", # less than 1/3 of LLaMA "data_sources": ["The Pile (deduplicated)"], "key_choice": "Prioritize reproducibility over performance", }, } # LLaMA's bet: overtrain a small model on high-quality data # This bet paid off — LLaMA-7B matches Pythia-12B
Same instruction data, different base models. Toggle models to see how the pre-training foundation affects downstream task performance after instruction tuning.
The paper also tests a within-family scaling experiment using LLaMA at 7B, 13B, 30B, and 65B parameters. With the same instruction data (Tulu mix), performance scales smoothly with model size. This experiment is critical because it answers the question: "If I have budget for a bigger model OR better instruction data, which should I invest in?"
The answer depends on where you are on the scaling curve. If you're at 7B → 13B, the base model upgrade is worth more. If you're already at 65B and have poor instruction data, investing in data quality has higher ROI. The crossover point is roughly around 30B parameters — below that, invest in the base model; above that, invest in instruction data and alignment.
Where Nparams is the number of parameters and Dinstruct is the instruction dataset. This formula captures the paper's central finding: both the base model (through pre-training scale) and the instruction data (through quality) contribute independently to downstream performance, but the base model contribution has a log-relationship with parameters while the data quality contribution has a more linear relationship.
python # Within-family scaling: LLaMA sizes with same data llama_scaling = { "7B": {"mmlu": 45.9, "alpaca": 58.6}, "13B": {"mmlu": 52.3, "alpaca": 65.4}, "30B": {"mmlu": 58.7, "alpaca": 71.8}, "65B": {"mmlu": 63.5, "alpaca": 76.2}, } # Each size jump gives ~6-7% MMLU improvement # 65B with Tulu mix is competitive with ChatGPT on some tasks
This is the chapter that made the paper famous. The question is deceptively simple: if you have a fixed compute budget for instruction tuning, should you collect more data or better data? The intuition from pre-training suggests "more data is always better." But instruction tuning is different — and the results are striking.
The paper designs a clean experiment to answer this. They take the FLAN dataset (which has millions of examples available) and train LLaMA-7B on different-sized subsets: 1K, 5K, 10K, 25K, 50K, and 100K examples. They also compare against the much smaller but higher-quality datasets like Dolly (15K) and OASST (10K).
Why is this experiment informative? Because it holds quality constant (all FLAN data is the same quality) and varies only quantity. This gives us a clean learning curve showing the marginal value of additional data. Then, comparing the FLAN scaling curve against high-quality datasets at fixed sizes tells us the value of quality — if 10K OASST beats 100K FLAN, we know quality is worth at least 10x in data efficiency.
The training details matter for reproducibility: each model is trained for exactly 2 epochs regardless of dataset size. This means the 1K model sees each example ~2 times, while the 100K model also sees each example ~2 times. The compute scales linearly with dataset size: 1K takes ~15 minutes, 100K takes ~25 hours on 8x A100s. This linear scaling makes the diminishing returns finding especially stark — you pay 100x more compute for a barely perceptible improvement.
What does "quality" mean concretely? The paper identifies several dimensions:
The quality-vs-quantity tradeoff isn't uniform across all evaluations. On traditional NLP benchmarks (MMLU, BBH), quantity helps — FLAN's massive task diversity gives better benchmark coverage. But on human preference evaluations (AlpacaEval, where humans judge which model's response is better), quality dominates. ShareGPT with 90K real conversations crushes FLAN with 100K academic examples on human preference.
Why? Because human evaluators judge the same things real users care about: is the response helpful, detailed, well-structured, and natural-sounding? FLAN teaches the model to format NLI labels correctly. ShareGPT teaches the model to have a genuine conversation.
Consider a concrete example. Ask both models: "What's the best way to learn Python?" A FLAN-tuned model might respond: "Take online courses, read documentation, and practice coding." (15 words, technically correct, unhelpful.) A ShareGPT-tuned model might respond: "Great question! Here's a roadmap I'd recommend: 1. Start with Python basics — variables, loops, functions. Try Codecademy or Python.org's official tutorial. 2. Build small projects early — a calculator, a to-do list, a web scraper. The best learning is by doing. 3. Read other people's code on GitHub..." (100+ words, structured, actionable, conversational.) Both models have the same knowledge (from LLaMA pre-training). The difference is entirely in how they present it.
python # Quality vs quantity experiment import numpy as np # Scaling experiment: subsample FLAN to different sizes flan_sizes = [1000, 5000, 10000, 25000, 50000, 100000] flan_scores = { "MMLU": [38.2, 41.5, 43.1, 44.8, 45.3, 45.9], # scales with size "AlpacaEval":[25.1, 32.4, 38.7, 42.1, 43.5, 44.2], # diminishing returns } # Compare with small but high-quality datasets quality_datasets = { "OASST (10K)": {"MMLU": 40.1, "AlpacaEval": 52.3}, # fewer examples, better responses "ShareGPT (90K)": {"MMLU": 43.7, "AlpacaEval": 58.6}, # real conversations win "Dolly (15K)": {"MMLU": 39.8, "AlpacaEval": 47.1}, # human-written, small but good } # Key insight: OASST with 10K examples beats FLAN with 100K on AlpacaEval # because human evaluators prefer natural, detailed responses
There's a nuance here that's important. The paper doesn't find that quality always beats quantity in every setting. The relationship depends on what you're measuring:
| Evaluation Type | What Wins? | Why? |
|---|---|---|
| NLP Benchmarks (MMLU, BBH) | Quantity helps | More task types = better coverage of benchmark skills |
| Human Preference (AlpacaEval) | Quality wins | Judges prefer natural, detailed, helpful responses |
| Safety / Toxicity (ToxiGen) | Neither dominates | Safety requires explicit safety data, not just more data |
| Multilingual (TydiQA) | Data source matters | Only FLAN has significant multilingual coverage |
Drag the slider to change the number of FLAN training examples and see how benchmark scores change. The horizontal dashed lines show scores from smaller but higher-quality datasets — notice how quickly they're surpassed (or not) as FLAN scales.
Let's quantify the diminishing returns concretely. The FLAN scaling curve looks like this:
python # FLAN data scaling — AlpacaEval win rate sizes = [1000, 5000, 10000, 25000, 50000, 100000] scores = [25.1, 32.4, 38.7, 42.1, 43.5, 44.2] # Marginal gain per 10K additional examples: # 1K → 10K: +13.6 (huge gain from first 10K) # 10K → 25K: +3.4 (decent gain) # 25K → 50K: +1.4 (diminishing) # 50K → 100K: +0.7 (almost flat) # Now compare: OASST with just 10K examples scores 52.3 # That's higher than FLAN at ANY scale tested! # Quality advantage: 52.3 - 44.2 = 8.1 points on AlpacaEval # Even 10x more FLAN data can't close this gap
The paper also finds that mixing datasets helps. Their best model (Tulu) is trained on a carefully weighted mixture of FLAN, ShareGPT, and OASST — combining FLAN's task diversity with ShareGPT's naturalness and OASST's conversational quality. This mixture outperforms any single dataset alone, suggesting that different datasets contribute complementary skills.
The mixing strategy isn't just "throw everything together." The paper experiments with different mixture ratios and finds that the optimal balance depends on your target evaluation. For general-purpose assistants (the most common use case), the sweet spot is roughly 30% FLAN (breadth) + 35% ShareGPT (style) + 15% OASST (multi-turn) + 20% specialized data (CoT reasoning, code, etc.).
The authors test several mixing strategies: (1) equal weights (each dataset contributes equally by example count), (2) proportional to size (larger datasets contribute more), (3) inverse proportional (smaller, higher-quality datasets are oversampled), and (4) manually tuned weights based on dev set performance. Strategy (4) works best, but strategy (3) is a good heuristic when you don't have a dev set — oversample the highest-quality data.
python # Dataset mixing strategies def create_mixture(datasets, strategy="quality_weighted"): if strategy == "equal": # Each dataset contributes same number of examples n_per = 10000 return [d.sample(n_per) for d in datasets] elif strategy == "quality_weighted": # Oversample high-quality, undersample low-quality weights = {"sharegpt": 0.35, "oasst": 0.15, "flan": 0.30, "cot": 0.20} return weighted_sample(datasets, weights, total=80000) # Quality-weighted consistently outperforms equal mixing
One of the paper's most important contributions is its multi-faceted evaluation framework. Instead of relying on a single benchmark, the authors evaluate every model on a diverse suite that captures different aspects of instruction-following ability. This matters because, as they show, the ranking of models changes dramatically depending on which evaluation you use.
The evaluation suite covers four categories:
Factual knowledge: MMLU (Massive Multitask Language Understanding) tests 57 subjects from elementary math to professional law. It measures whether the model has learned and can recall factual knowledge. MMLU scores correlate strongly with pre-training data volume — bigger models with more pre-training data score higher, regardless of instruction tuning. This is a multiple-choice benchmark, so it's purely testing recall and reasoning — the model's response style doesn't matter.
MMLU is the benchmark where instruction tuning helps least — you might gain 2-5 points over a good few-shot prompt with the base model. The knowledge tested by MMLU is almost entirely learned during pre-training. Instruction tuning just teaches the model to format its answer as a multiple-choice selection. This is why base model choice dominates on MMLU.
Reasoning: BIG-Bench Hard (BBH) is a collection of 23 challenging reasoning tasks from the BIG-Bench benchmark, requiring multi-step reasoning, logical deduction, and mathematical thinking. BBH is evaluated with chain-of-thought prompting, where the model must show its reasoning step by step. This benchmark rewards instruction datasets that include reasoning traces (like Alpaca-CoT). Interestingly, this is the benchmark where the gap between instruction-tuning approaches is smallest — reasoning ability seems to depend more on model scale than on instruction data quality.
Multilinguality: TydiQA tests question answering in 11 typologically diverse languages (Arabic, Bengali, Finnish, Indonesian, Japanese, Kiswahili, Korean, Russian, Swahili, Telugu, Thai). Most instruction datasets are English-only, so TydiQA reveals whether instruction tuning transfers cross-lingually. FLAN is the only dataset with significant multilingual content, and it shows — FLAN-tuned models perform best on TydiQA.
The multilingual results reveal an important limitation: instruction tuning on English-only data does not meaningfully improve multilingual performance. The base LLaMA model has some multilingual ability from pre-training (its data includes non-English text), but English-only instruction tuning can actually hurt multilingual performance by pushing the model toward English-only response patterns. If multilingual ability matters, you need multilingual instruction data.
Open-ended instruction following: AlpacaEval uses GPT-4 as an automatic judge to rate model responses against a reference (Davinci-003). This is the closest proxy for "does a real user find this response helpful?" AlpacaEval correlates well with human judgments and is the evaluation where dataset quality matters most. It tests 805 open-ended prompts spanning coding, creative writing, advice, analysis, and more — the kind of queries real users actually send to chatbots.
AlpacaEval has a known bias: it tends to prefer longer responses. A model that produces verbose but mediocre answers can score higher than a model that gives concise but correct answers. The paper acknowledges this limitation and supplements AlpacaEval with human evaluation to check for this failure mode.
There's also a toxicity evaluation using ToxiGen, which tests whether the model generates harmful content when prompted. This evaluation is important because instruction tuning can inadvertently make models more willing to follow harmful instructions. The paper finds that most instruction datasets don't explicitly address safety — only OASST includes safety-aware training data, and OASST-tuned models show the lowest toxicity rates.
The authors also conduct human evaluation as a complement to automatic metrics. They recruit human annotators to compare model outputs pairwise, asking "Which response is more helpful?" Human evaluations generally agree with AlpacaEval rankings but sometimes diverge — particularly when models produce lengthy but unhelpful responses that GPT-4 rates highly but humans find frustrating.
| Benchmark | What It Measures | Which Dataset Wins | Why |
|---|---|---|---|
| MMLU | Factual knowledge | FLAN | 1,836 task types cover more knowledge |
| BBH | Reasoning | Alpaca-CoT | Chain-of-thought examples teach reasoning |
| TydiQA | Multilingual QA | FLAN | Only FLAN has multilingual training data |
| AlpacaEval | Open-ended helpfulness | ShareGPT | Real conversations teach natural responses |
| Toxicity | Safety | OASST | Has explicit safety-aware training data |
python # Evaluation pipeline (simplified) def evaluate_model(model, tokenizer): results = {} # 1. MMLU: 57-subject multiple choice results["mmlu"] = run_mmlu(model, tokenizer, n_shot=5) # → accuracy across subjects, e.g., 45.3% # 2. BBH: chain-of-thought reasoning results["bbh"] = run_bbh(model, tokenizer, cot=True) # → exact match on 23 reasoning tasks, e.g., 38.7% # 3. TydiQA: multilingual question answering results["tydiqa"] = run_tydiqa(model, tokenizer, n_shot=1) # → F1 score across 11 languages, e.g., 35.2% # 4. AlpacaEval: GPT-4 as judge results["alpaca_eval"] = run_alpaca_eval(model, tokenizer) # → win rate vs text-davinci-003, e.g., 58.6% return results
Toggle datasets to see their performance profile across all four evaluation axes. No single dataset wins everywhere — this is why the Tulu mixture was created.
There's a deeper lesson here about evaluation design in AI research. Before this paper, most instruction-tuning papers evaluated on 1-2 benchmarks and claimed superiority. The Camels paper showed that this is fundamentally misleading — a model's ranking depends entirely on which benchmark you choose. This drove the field toward multi-benchmark evaluation suites (like the Open LLM Leaderboard) and composite scores that aggregate across diverse evaluations.
The human evaluation component deserves special attention. The authors recruit evaluators from Mechanical Turk and from graduate students, finding interesting disagreements. On most model comparisons, the two groups agree. But on some edge cases — particularly where one model gives a short, correct answer and another gives a long, partially incorrect answer — the groups diverge. Graduate students prefer the concise correct answer; Mechanical Turk workers prefer the longer response. This highlights a general challenge: "helpfulness" is subjective and audience-dependent.
The paper also contributes to the growing literature on evaluation contamination. Some instruction datasets (particularly FLAN) contain tasks that overlap with benchmark test sets. The authors check for this and find minimal contamination, but flag it as a concern for future work. This issue would become much more prominent in 2024, when researchers found that many open datasets were contaminated with benchmark answers.
The paper also highlighted an important tension in AI evaluation: automatic metrics vs human judgment. MMLU is automatic, cheap, and reproducible — but it doesn't measure what users care about. AlpacaEval uses GPT-4 as a proxy for human judgment — cheaper than real humans but subject to the judge model's biases (GPT-4 tends to prefer longer responses). Real human evaluation is the gold standard but expensive and slow. The paper uses all three, showing where they agree and disagree:
python # Evaluation cost and reliability tradeoffs eval_methods = { "MMLU (auto)": { "cost": "$0 (runs locally)", "time": "~30 min", "measures": "factual recall", "bias": "none (multiple choice)", }, "AlpacaEval (GPT-4)": { "cost": "~$50 per model", "time": "~2 hours", "measures": "helpfulness, coherence", "bias": "prefers verbose responses", }, "Human eval": { "cost": "$500-2000 per model", "time": "1-2 weeks", "measures": "actual user satisfaction", "bias": "inter-annotator disagreement ~75%", }, } # Paper uses all three to triangulate true quality
Let's consolidate the paper's findings into actionable lessons. These results were derived from dozens of controlled experiments, each isolating a single variable. They represent the most rigorous understanding of open instruction tuning available at the time of publication.
Finding 1: Base model quality is the strongest predictor of downstream performance.
No amount of instruction tuning can compensate for a weak base model. LLaMA-7B instruction-tuned on any dataset consistently outperforms OPT-13B and Pythia-12B instruction-tuned on the same or better data. The pre-training data volume and quality set a ceiling that instruction tuning cannot exceed. Practically: always start with the best available base model.
The magnitude of this effect is remarkable. Switching from OPT-6.7B to LLaMA-7B (same instruction data, same training) improves AlpacaEval by 25+ points. Switching from the worst instruction dataset to the best (same base model) improves it by ~15 points. The base model effect is roughly 2x larger than the dataset effect. This means that if you're choosing between "better base model + mediocre data" vs "worse base model + excellent data," choose the better base model every time.
Finding 2: Data quality matters more than data quantity for instruction following.
10K high-quality examples (OASST, ShareGPT) often match or exceed 100K lower-quality examples (FLAN, Self-Instruct) on human preference evaluations. The definition of "quality" here is nuanced: it means responses that are helpful, detailed, natural-sounding, and accurate — not just technically correct. This finding drove the community toward smaller, curated datasets rather than massive synthetic ones.
Think of it this way: if you're teaching someone to write good emails, showing them 100 mediocre emails teaches them to write mediocre emails. Showing them 10 excellent emails teaches them to write excellent emails. The model imitates its training data — so the quality of what you show it is the quality you get back.
Finding 3: Different datasets teach different skills.
No single dataset dominates all benchmarks. FLAN excels at factual tasks; ShareGPT excels at open-ended conversation; Alpaca-CoT excels at reasoning; OASST excels at safety. The optimal strategy is a weighted mixture that combines complementary strengths. This is perhaps the most practically useful finding — it tells you that building a good instruction-tuned model isn't about finding the "one perfect dataset." It's about combining datasets that are strong in different areas, like assembling a team with complementary skills.
Finding 4: Response style matters as much as response content.
Models fine-tuned on ShareGPT produce longer, more structured responses (with headers, bullet points, step-by-step explanations) compared to FLAN-tuned models. Human evaluators strongly prefer this style, even when the factual content is equivalent. This suggests that instruction tuning primarily teaches how to respond rather than what to know — the knowledge comes from pre-training.
The response style finding has a surprising corollary: you can predict which dataset a model was trained on just by looking at its response style. FLAN-tuned models give terse, classification-style responses. ShareGPT-tuned models use numbered lists and friendly language. Alpaca-tuned models tend toward academic-sounding explanations. OASST-tuned models are conversational and ask clarifying questions. The instruction data's "personality" transfers directly to the model.
This hypothesis has been tested in follow-up work. If instruction tuning is mostly style transfer, then the model shouldn't learn new facts from instruction data — and indeed, studies show that factual accuracy on knowledge benchmarks (like MMLU) barely changes with instruction tuning. What does change dramatically is the model's ability to present its knowledge in a user-friendly format. A base LLaMA-7B "knows" the capital of France but can't tell you unless you prompt it just right. Instruction-tuned LLaMA-7B tells you "Paris" when asked directly, because it's learned the instruction-following interface.
Finding 5: Scaling instruction data shows diminishing returns.
Going from 1K to 10K FLAN examples produces a large jump in performance. Going from 10K to 100K produces a modest improvement. Going from 100K to 1M produces minimal improvement. The learning curve follows a power law with a steep initial slope that quickly flattens. This means that even with unlimited data, there's a practical ceiling to what instruction tuning can achieve.
The practical implication: if you're budgeting for instruction tuning, invest in the first 50K high-quality examples. Beyond that, the marginal value drops below the cost of data collection and compute. Your budget is better spent on preference optimization (DPO/RLHF) or on improving the base model.
python # Diminishing returns model import numpy as np def instruction_tuning_gain(n_examples, quality_factor=1.0): """ Approximate performance gain from instruction tuning. n_examples: number of training examples quality_factor: 0-1, quality of instruction data Returns: approximate benchmark improvement (%) """ # Log scaling: most learning happens in first 10K examples base_gain = 10 * np.log10(n_examples / 100 + 1) # Quality multiplier: 10K high-quality ≈ 100K low-quality return base_gain * quality_factor # Example: quality_factor=1.0 (ShareGPT) vs 0.6 (Self-Instruct) # instruction_tuning_gain(10000, 1.0) ≈ 20.0 # instruction_tuning_gain(100000, 0.6) ≈ 18.0 # 10K excellent examples ≈ 100K mediocre examples
Finding 6: The gap between open and proprietary models is closing, but not closed.
The best Tulu models (LLaMA-65B with the optimal dataset mixture) are competitive with ChatGPT on some benchmarks but still trail on others, particularly complex reasoning and multilingual tasks. The gap is larger on hard tasks and smaller on simple tasks, suggesting that scaling the base model (not just the instruction data) is needed to fully close the gap.
Let's quantify the gap:
| Benchmark | Tulu-65B | ChatGPT | Gap |
|---|---|---|---|
| MMLU | 63.5 | 70.0 | -6.5 |
| BBH | 55.2 | 68.1 | -12.9 |
| AlpacaEval | 76.2 | 89.4 | -13.2 |
| TydiQA | 48.7 | 62.3 | -13.6 |
The gap is largest on reasoning (BBH) and multilingual (TydiQA) tasks. This suggests that the main advantage of proprietary models comes from (1) larger pre-training scale (GPT-3.5 is estimated at 175B parameters with more training data), (2) RLHF alignment (which Tulu doesn't use), and (3) better multilingual pre-training data. Subsequent work (Tulu 2, Zephyr) would address item (2) with DPO, closing the gap further.
Explore each finding with supporting data. Click a finding to see the evidence — bar charts comparing controlled experiments that isolate the relevant variable.
Let's build a mental model of how these findings connect. Think of instruction tuning as stacking two skills: knowledge (knowing facts, reasoning, multilingual ability) and interface (formatting that knowledge as a helpful response). The base model determines the knowledge ceiling. The instruction data determines the interface quality. More instruction data slightly improves knowledge coverage, but primarily improves the interface.
Where Knowledge scales with pre-training compute (log-relationship with parameters and tokens), and Interface scales with instruction data quality (fast saturation — 10K good examples ~ 100K mediocre examples). This decomposition explains all six findings:
| Finding | Explanation via Knowledge + Interface |
|---|---|
| Base model matters most | Knowledge ceiling is the dominant term |
| Quality > quantity | Interface saturates quickly with good examples |
| Different datasets teach different skills | Datasets differ in both knowledge coverage and interface style |
| Style matters as much as content | Interface = style; content = knowledge from pre-training |
| Diminishing returns | Interface learning curve is steep then flat |
| Gap to proprietary models | Knowledge gap (more pre-training) + Interface gap (RLHF) |
python # Mental model: instruction tuning decomposes into Knowledge + Interface import numpy as np def model_quality(base_tokens_T, instruct_examples, data_quality): """ base_tokens_T: pre-training tokens in trillions (e.g., 1.0 for LLaMA) instruct_examples: number of instruction examples (e.g., 50000) data_quality: quality score 0-1 (e.g., 0.9 for ShareGPT) """ # Knowledge: log-scaled with pre-training knowledge = 25 * np.log2(base_tokens_T * 10 + 1) # Interface: saturating with instruction data interface = 20 * data_quality * (1 - np.exp(-instruct_examples / 15000)) return knowledge + interface # LLaMA-7B + 10K ShareGPT (quality=0.9) # model_quality(1.0, 10000, 0.9) ≈ 25*3.46 + 20*0.9*0.49 ≈ 95.4 # OPT-6.7B + 100K FLAN (quality=0.5) # model_quality(0.18, 100000, 0.5) ≈ 25*1.22 + 20*0.5*0.999 ≈ 40.5 # Even 10x more data can't overcome the base model gap!
Let's put all the paper's findings together into one interactive simulation. This explorer lets you design your own instruction-tuning experiment: choose a base model, pick a dataset (or mix datasets), set the data size, and see predicted performance across all four benchmarks.
The predictions are based on the paper's experimental results, interpolated to cover combinations the paper didn't test directly. Use this to build intuition about the tradeoffs in instruction tuning — which choices matter most, and where the diminishing returns kick in.
Think of this as a "lab" where you can run virtual experiments that would cost thousands of dollars in GPU time. Each combination represents a real experimental condition: a base model initialized with specific pre-trained weights, fine-tuned on a specific instruction dataset for 2 epochs, and evaluated on a standardized benchmark suite. The radar chart shows the predicted performance profile across all four evaluation dimensions.
Design your own instruction-tuning experiment. Select components and see predicted benchmark performance. Try to find the combination that maximizes the aggregate score.
The best aggregate score you can achieve with this explorer is LLaMA-7B + Tulu Mix at 100K examples. But notice that going from 50K to 100K barely changes the score — confirming the diminishing returns finding. And notice that the single biggest improvement comes from switching the base model from OPT to LLaMA, not from any dataset change.
Notice the patterns as you explore:
| Experiment | What You'll See |
|---|---|
| Change base model only | LLaMA consistently 5-15% ahead of Pythia/OPT at same size |
| Change dataset only | Rankings differ by benchmark — no single dataset wins all |
| Increase data size | Sharp gains up to 10K, then diminishing returns |
| Use Tulu Mix | Best aggregate performance — mixture beats any single source |
The explorer also shows something subtle: the interaction between base model and dataset. Some datasets pair better with some models. For example, reasoning-focused datasets (Alpaca-CoT) show a larger gap between LLaMA and Pythia than factual datasets (FLAN), suggesting that the base model's capacity is more important for reasoning than for factual recall.
Here's a concrete thought experiment to build intuition. Imagine you have a $1,000 budget. How should you allocate it?
| Strategy | What You Get | Expected Quality |
|---|---|---|
| Option A | $0 data (use free FLAN) + $1000 compute → LLaMA-13B + FLAN | Good benchmarks, poor chat |
| Option B | $500 curating 10K examples + $500 compute → LLaMA-7B + quality mix | Great chat, decent benchmarks |
| Option C | $200 scraping 100K examples + $800 compute → LLaMA-7B + noisy data | Mediocre everything |
The paper's results suggest Option B is almost always the right choice for real applications. The 10K curated examples produce better user-facing quality than 100K scraped examples, and the money saved on data can be spent on a larger base model (or more training runs to tune hyperparameters).
python # The Tulu mixture recipe that won # Weighting based on paper experiments tulu_mix = { "flan_v2": { "weight": 0.30, # broadest task coverage "samples": 25000, # subsampled for efficiency "strengths": ["MMLU", "TydiQA"], }, "sharegpt": { "weight": 0.35, # best chat-style responses "samples": 30000, # real user conversations "strengths": ["AlpacaEval", "human_pref"], }, "oasst1": { "weight": 0.15, # multi-turn + safety "samples": 10000, # crowd-ranked quality "strengths": ["safety", "multi_turn"], }, "alpaca_cot": { "weight": 0.15, # reasoning chains "samples": 10000, # GPT-3.5 + CoT "strengths": ["BBH", "reasoning"], }, "dolly": { "weight": 0.05, # human-written supplement "samples": 5000, # highest per-example quality "strengths": ["accuracy"], }, } # Total: ~80K examples from 5 complementary sources # This mixture became the template for Tulu 2, OpenHermes, etc.
The Camels paper was published at NeurIPS 2023, and its findings shaped the trajectory of open instruction tuning. Let's trace its influence and connections to both predecessors and successors.
Predecessors:
The paper builds directly on the instruction tuning paradigm introduced by FLAN (Wei et al., 2022) and the self-instruct method (Wang et al., 2022). FLAN showed that fine-tuning on many tasks phrased as instructions improves zero-shot generalization. Self-Instruct showed that LLMs can generate their own training data. Stanford Alpaca combined these ideas: use Self-Instruct with GPT-3.5 to generate data, then fine-tune LLaMA. The Camels paper asks: of all these approaches, which actually works best?
The connection to InstructGPT (Ouyang et al., 2022) is also important. InstructGPT showed that instruction tuning + RLHF produces dramatic improvements in human preference, but it was proprietary. The Camels paper can be seen as the open-source community's attempt to replicate the SFT portion of InstructGPT's recipe, systematically exploring which open data best substitutes for OpenAI's proprietary instruction data. (The RLHF portion would come later with Tulu 2's DPO.)
The paper also connects to the scaling laws literature (Kaplan et al., Hoffmann et al./Chinchilla). Just as pre-training follows scaling laws (more compute = better performance), instruction tuning also follows scaling laws — but with a different shape. Pre-training benefits from more data roughly log-linearly. Instruction tuning shows much steeper diminishing returns, saturating around 50-100K examples.
The scaling laws connection is worth elaborating. Chinchilla (Hoffmann et al., 2022) showed that for pre-training, the optimal strategy is to scale data and parameters equally. The Camels paper shows that instruction tuning follows a fundamentally different scaling law. In pre-training, doubling the data always helps (at least up to current scales). In instruction tuning, doubling the data helps early but hits diminishing returns quickly. The difference comes from what's being learned: pre-training learns knowledge (which benefits from more data), while instruction tuning learns an interface (which saturates quickly).
Successors:
The Tulu project continued with Tulu 2 (Ivison et al., 2023), which extended the analysis to Llama 2, added DPO preference optimization as a second training stage, and achieved state-of-the-art open-model performance. Tulu 2 validated the quality-over-quantity finding with more data and added the insight that DPO adds a complementary "alignment" dimension on top of instruction tuning.
The broader community picked up the quality-over-quantity finding quickly. LIMA (Zhou et al., 2023) pushed this to the extreme: just 1,000 carefully curated examples (selected by the authors themselves) produced a model that was competitive with GPT-4 on many tasks. LIMA's slogan — "Less Is More for Alignment" — was a direct extension of the Camels finding. LIMA's authors manually selected each of the 1,000 examples from diverse sources (StackExchange, wikiHow, Reddit, and expert-written examples), ensuring every single one was an exemplar of a helpful, well-structured response. This curation effort took weeks but produced better results than millions of auto-generated examples.
The Orca papers (Microsoft, 2023) took a different approach to the quality question. Instead of curating real data, they generated synthetic data using GPT-4 — but with a twist: they asked GPT-4 to explain its reasoning step-by-step. These "reasoning traces" proved far more valuable as training data than simple input-output pairs. A single Orca-style example with detailed reasoning is worth many simple examples, because it teaches the model how to think, not just what to say.
| Paper | Relationship | Key Extension |
|---|---|---|
| FLAN (2022) | Predecessor | Introduced instruction tuning at scale |
| Self-Instruct (2022) | Predecessor | LLM-generated training data pipeline |
| Alpaca (2023) | Predecessor | Combined Self-Instruct + LLaMA fine-tuning |
| LIMA (2023) | Parallel work | 1K examples can be enough — quality is paramount |
| Tulu 2 (2023) | Direct sequel | Added DPO, scaled to Llama 2, new benchmarks |
| Orca (2023) | Influenced by | High-quality synthetic data via GPT-4 reasoning traces |
| OpenHermes (2023) | Influenced by | Curated dataset mixture, community-driven |
The paper also has methodological impact. Its controlled-experiment framework — vary one factor at a time, evaluate on multiple benchmarks — became the standard for instruction-tuning research. Before Camels, papers would introduce a new dataset and show it works on one benchmark with one model. After Camels, reviewers expected ablations across models, datasets, and evaluation suites.
The open-source nature of the project amplified its impact. All code, all model weights, all evaluation scripts, and all intermediate results are publicly available on the AI2 GitHub. This means any researcher can reproduce the experiments, extend them with new datasets or models, or build on the findings for their own work. The Tulu model weights were downloaded thousands of times in the first month alone, and the evaluation framework was adopted by multiple independent projects.
Finally, the paper's timing was perfect. It was published just as the open-source LLM community was transitioning from "can we make any instruction-following model at all?" to "how do we make the best one?" The Camels paper provided the roadmap at exactly the moment it was needed, making it one of the most influential papers of the open instruction-tuning era.
Looking forward, the quality-over-quantity insight extends beyond instruction tuning to other alignment techniques. RLHF with high-quality human preferences (fewer but more expert annotators) outperforms RLHF with massive but noisy crowdsourced preferences. DPO with carefully chosen preference pairs outperforms DPO with automatically generated pairs. The pattern is consistent: in the fine-tuning regime, data quality dominates data quantity.
The paper's practical impact can be summarized in a recipe that the community adopted:
python # The Tulu recipe, now standard for open instruction-tuned models # Step 1: Base model selection (biggest impact on ceiling) base = load_model("meta-llama/Llama-2-7b-hf") # Step 2: Data mixture (quality > quantity) data = weighted_mix([ ("flan_v2", 0.3), # task diversity + benchmarks ("sharegpt", 0.35), # natural conversation style ("oasst", 0.15), # multi-turn + safety ("cot_data", 0.1), # reasoning chains ("code_data", 0.1), # coding ability ]) # Total: ~75K examples # Step 3: SFT sft_model = sft_train(base, data, epochs=2, lr=2e-5) # Step 4: DPO (Tulu 2 addition) final_model = dpo_train(sft_model, preference_data, beta=0.1)
Let's look at how the open-model landscape evolved in the year after this paper:
| Date | Model | Innovation (inspired by Camels) |
|---|---|---|
| Jun 2023 | Orca (Microsoft) | GPT-4 reasoning traces as training data — quality-first |
| Jul 2023 | Llama 2 + Chat | Meta's RLHF pipeline, open-weight alignment |
| Sep 2023 | Zephyr (HF) | DPO on UltraChat data — distilled alignment |
| Oct 2023 | OpenHermes 2.5 | Community-curated dataset mixture, Mistral base |
| Nov 2023 | Tulu 2 | Paper's own sequel — DPO added, Llama 2 base |
| Dec 2023 | Mistral-Instruct | Better base model + quality instruction data |
| Feb 2024 | Gemma-Instruct | Google's entry — quality data, modern base |
Every one of these models follows the template established by the Camels paper: (1) start with the best available base model, (2) curate high-quality instruction data (emphasizing quality over quantity), (3) train with standard SFT, (4) optionally add preference optimization. The template works. It's become the de facto recipe for open instruction-tuned models.
Perhaps the most lasting contribution is methodological. Before this paper, instruction-tuning papers were hard to compare because everyone used different setups. After this paper, the community adopted a standard evaluation framework: train on your data, evaluate on MMLU + BBH + AlpacaEval + at least one other metric. The Open LLM Leaderboard, launched in 2023, was directly inspired by this kind of multi-benchmark evaluation. It became the standard for comparing open models, and its design owes much to the Camels paper's demonstration that single-benchmark evaluation is misleading.