Simula: Reasoning-Driven Synthetic Data

0The Data Bottleneck

You want to build a model that classifies cybersecurity threats. Or answers Swiss law exam questions. Or solves math problems in Nepali. In all three cases, you face the same wall: the data you need doesn't exist.

Internet data built the generalists — GPT, Gemini, Claude. But the specialized AI that actually ships into production needs specialized data, and specialized data is inherently scarce, expensive to collect, and often locked behind privacy constraints.

The obvious fix: generate it synthetically. Use a large model to create training data for a smaller model. But the obvious fix has obvious failure modes:

The three synthetic data traps
1. Mode collapse. Prompt an LLM with "generate a math problem" 10,000 times and you get 10,000 variations of the same 50 problems it knows best. No coverage of edge cases.
2. No quality signal. The LLM that generated the data can't reliably tell you which outputs are wrong. Sycophancy bias makes it agree with its own mistakes.
3. No control. You can't say "make 40% of the data hard, cover all 12 subtopics, and ensure geographic diversity." Existing methods treat the generation process as a black box.

Most existing approaches try to solve one of these at a time. Evolutionary prompt methods (like PromptBreeder) evolve prompts to increase diversity but lose explainability. Manual prompt engineering gives control but doesn't scale. Seed-data methods (like LLM2LLM) start from real examples but require the very data you're trying to replace.

The question this paper asks: can we design a synthetic data generation mechanism — not just better prompts, but a full system — that simultaneously controls diversity, complexity, and quality, at scale, without seed data?

Check: Why is "just generate more data" insufficient for synthetic data?

It's too expensive to generate lots of data

More data from the same distribution doesn't add new information

LLMs can't generate text of sufficient quality

Synthetic data always hurts model performance

1The Key Insight: Reason First, Generate Second

Here's the core idea behind Simula: don't start by generating data. Start by reasoning about what data should exist.

Think about how a domain expert would design a dataset. They wouldn't immediately write 10,000 examples. They'd first ask: "What are the important axes of variation? What subtopics must we cover? What difficulty levels do we need?" They'd build a map of the space before filling it in.

Simula makes an LLM do exactly this. The framework has three stages:

Stage 1: Map

Build taxonomies of the concept space. Decompose "cybersecurity questions" into factors like threat-type, attack-vector, difficulty-level — then expand each into a hierarchical tree.

Stage 2: Synthesize

Sample from the taxonomy to create "mixes" of requirements. Convert each mix into a meta-prompt. Apply complexity augmentation. Generate data proposals with agentic refinement.

Stage 3: Evaluate

Double-critic filtering for quality. Calibrated batch scoring for complexity. Taxonomy-based coverage metrics for diversity. Each data point has a full audit trail.

Why "reasoning-first" is future-proof
Because every component of Simula relies on reasoning capabilities, the system automatically improves as the underlying LLM improves. Better reasoning → better taxonomies → better coverage → better data. No component needs to be re-engineered when the model gets smarter. This is the opposite of hand-tuned prompt engineering, which becomes obsolete with every model generation.

The paper calls this mechanism design — and argues it's an underexplored research axis. Most synthetic data work asks "what is good data?" Simula asks "how do you generate good data?" The mechanism determines the properties; the properties determine the downstream performance.

Let's build up each stage.

Check: What does Simula do BEFORE generating any data points?

Fine-tunes the teacher model on existing benchmarks

Builds taxonomies that map the concept space to ensure coverage

Collects seed data from the target domain

Runs evolutionary prompt optimization

2Taxonomies: Mapping the Concept Space

Suppose your dataset description is: "A dataset of stories about cats." This is hopelessly underspecified. The space of all datasets matching this description is infinite. How do you ensure you cover it?

Simula's answer: decompose the space into factors of variation, then expand each factor into a hierarchical taxonomy.

An LLM proposes factors for our cat-story dataset: "cat type," "story format," "intended audience." Each factor becomes a tree. "Cat type" branches into "domestic" → "shorthair" → "British Shorthair." "Story format" branches into "poem" → "haiku," "prose" → "short story," etc. The product of these trees defines a discrete, sample-able approximation of the concept space.

From equation (1) in the paper
Given user instructions y, optional sample data S, and factor specifications (d_i, f_i) where d_i is the desired depth:

M³(y, S, (d₀, f₀), …, (d_K, f_K)) = {T_i}_i=0^K = T_y

Each T_i is a hierarchical tree where root = broad factor, leaves = specific instantiations. More factors and deeper trees = sharper coverage control.

But there's a catch: as taxonomies grow deeper, you risk progressive coverage loss. The LLM might generate "domestic" → "shorthair" but miss "domestic" → "longhair" entirely. To mitigate this, Simula uses a three-step expansion algorithm:

Best-of-N proposals: For each node, prompt the LLM N times to propose children. This increases the proposal distribution and catches edge cases.
Critic refinement: A separate LLM call reviews the proposals, adding missing nodes, removing redundancies, and improving specificity. This exploits the generator-critic gap — models are often better at evaluating completeness than generating it in one shot.
Level planning: After generating all nodes at one level, the LLM generates a "plan" for the next level, ensuring consistent granularity across branches.

The interactive tree above lets you explore how taxonomies decompose a concept space. Each level adds granularity — but also adds the risk of missing branches. The balance between coverage and granularity is a design choice that Simula leaves to the user.

Taxonomy quality: the empirical result
Table 2 in the paper shows that Simula-generated taxonomies achieve 0.78 completeness and 0.97 soundness on conceptual taxonomies (vs. human experts). The generator-critic loop increased completeness from 0.52 (0-shot) to 0.78 — a 50% improvement. Additionally, Simula found 0.94 novelty score — relevant nodes that experts missed — giving 1.72x total coverage over expert-only taxonomies.

Check: Why does Simula use Best-of-N proposals + critic refinement instead of a single generation call?

To make the process more expensive and thus higher quality

Because models are better at evaluating completeness than generating it in one shot

To prevent the model from memorizing the taxonomy

Because single calls always produce empty outputs

3Controlled Synthesis: From Taxonomy to Data

We have our taxonomies. Now we need to turn tree nodes into actual data points. This happens in two phases: taxonomic sampling and agentic refinement.

Phase 1: Taxonomic Sampling

The LLM first formulates sampling strategies — rules for which taxonomies can be combined and with what weight. This prevents nonsensical combinations (e.g., "mature horror novel" + "intended audience: toddlers"). Each strategy defines a compatible subset of taxonomies.

Given a strategy, the framework samples nodes from the corresponding taxonomies. These sampled nodes become data point requirements. For example:

Worked example
Sampled nodes: {house cat, poem, travel enthusiast}
+ Dataset instructions: "A dataset of stories about cats"
↓ Meta-prompt: "Compose an exciting haiku about a house cat who goes on an adventure."
↓ Generated output: "Paws on foreign soil / whiskers catch the monsoon wind / home is where I roam"

Phase 2: Optimizing Local Diversity and Complexity

This is where things get interesting. Consider the math:

You want N = 100 data points. Your taxonomy produces V = 200 unique node-sets. Since N < V, you can cover at most 100/200 = 50% of the space. This is your global coverage ratio.

Now flip it: N = 800, V = 200. You can generate 4 meta-prompts per node-set and cover everything. But as N/V grows, independently generating meta-prompts from the same requirements leads to mode collapse — increasingly similar outputs.

Simula mitigates mode collapse two ways:

Batch meta-prompting: Generate multiple meta-prompts simultaneously (so the model sees them all in context and can diversify), then sub-sample.
Complexification: A user-defined fraction c (default 0.5) of meta-prompts are passed through a "complexification" step where the LLM increases their difficulty while maintaining the original requirements.

Global vs. Local diversification
Global: Use deep taxonomy levels for sampling. Node "math → algebra → quadratic equations" is more specific than just "math." This controls what topics appear.
Local: Use meta-prompting and complexification to generate diverse interpretations of the same topic. "Different takes on a quadratic equation problem."

The paper's key finding: Global and Local are additive. Each captures a different type of diversity. Using both simultaneously always matches or beats either alone.

Check: When N/V is very large (many data points per node-set), what problem arises and how does Simula mitigate it?

The model runs out of memory; use smaller batches

Mode collapse from repeatedly generating from the same requirements; batch meta-prompting + complexification

The taxonomy becomes invalid; regenerate from scratch

The data becomes too simple; increase model temperature

4The Double Critic: Quality Through Skepticism

You've generated thousands of data points. Some are wrong. A math problem with an incorrect answer. A cybersecurity question where the "correct" choice is actually wrong. A legal exam question that contradicts the source material. How do you catch these?

The obvious approach: ask the same LLM to verify its outputs. But there's a well-documented problem: sycophancy bias. When you show a model its own output and ask "is this correct?", it's biased toward saying yes. It generated the answer; of course it thinks it's right.

Simula's fix is elegant: the double critic. Instead of asking one question ("is this correct?"), ask two independent questions:

Critic A

"Here is a question and answer. Is the answer CORRECT? Explain your reasoning, then give a binary verdict."

Critic B

"Here is a question and answer. Is the answer INCORRECT? Explain your reasoning, then give a binary verdict."

A sample is accepted only if Critic A says "correct" AND Critic B says "not incorrect." This dual query breaks the sycophancy pattern: asking "is this incorrect?" forces the model to look for flaws rather than confirm its prior.

The interactive visualization above shows how double-critic rejection sampling works on the MATH dataset. The key numbers from the paper:

Double-critic results on MATH (Figure 3)
Controlled setting: Given a mix of correct and corrupted answers, the critic achieves consistent "lift" — accepted accuracy is always higher than baseline generation accuracy. At complexity level 1: baseline 96% → post-critic 99%. At level 5: baseline 68% → post-critic 78%.

The cost curve: Higher complexity requires higher rejection rates to maintain the accuracy gain. At level 5, about 27% of samples are rejected to achieve a 10-point accuracy lift. At level 1, only 5% are rejected for a 3-point lift.

Empirical setting: The critic transfers to the model's own outputs (not just controlled corruptions), but with reduced effectiveness — the accuracy lift is smaller because the model's own errors are subtler than deliberately corrupted answers.

Beyond answer correctness, the critic also performs semantic verification: does the generated output actually match the taxonomy requirements? If the meta-prompt required a haiku about a house cat adventure, the critic checks: Is this a haiku? Is there a house cat? Is there an adventure? Point-wise requirement checks ensure the data fits the taxonomy specification.

The generator-critic gap
Why can a model critique its own outputs effectively? Because generation and verification are different tasks with different difficulty. Generating a correct proof is harder than checking one. Writing a subtle incorrect answer is harder than spotting it (especially when framed as "find the flaw"). The double-critic exploits this asymmetry. It's the same principle behind Best-of-N sampling but applied to quality, not diversity.

Check: Why does the double critic ask "is this INCORRECT?" separately instead of just asking "is this correct?"

To increase the cost and thoroughness of evaluation

To break sycophancy bias by forcing the model to actively look for flaws

Because the model can only answer binary yes/no questions

To ensure the model reads the question twice

5Evaluation: Measuring What You Made

You've generated a synthetic dataset. How do you know it's good? Simula proposes evaluation tools for all three axes: diversity, complexity, and quality.

Measuring Diversity

Embedding-based metrics are the standard approach: embed all data points, compute pairwise cosine distances. Higher average distance = more diverse. Simula computes this both dataset-wide (global diversity) and among k=10 nearest neighbors (local diversity).

But embedding distances are coarse — they tell you "this dataset is more spread out in embedding space" without saying what's missing. So Simula adds taxonomy-based coverage: for each data point, an LLM assigns it to the most relevant taxonomy node. You can then compute what fraction of nodes at each taxonomy level are covered. This gives a fine-grained, actionable map of gaps.

Measuring Complexity

How complex is a single data point? This is tricky because:

Synthetic data generation is unsupervised (no labels)
Real data rarely has complexity annotations
"Complexity" is relative — hard for whom?

Simula's solution: calibrated batch scoring with Elo ratings.

The scoring algorithm
1. Sample batches of data points, each point appearing K times across batches.
2. For each batch, the LLM assigns relative complexity scores (calibrated against other batch members).
3. Convert batch scores into pairwise comparisons.
4. Compute Elo ratings from the pairwise comparisons.

Why batches? Per-sample scoring is noisy (the model overcommits to each individual judgment). Batch scoring forces calibration — the model must rank items relative to each other, producing more stable and comparable scores.

The Elo scores enable cross-dataset comparison: you can place synthetic and real data points on the same complexity scale and see where the distributions overlap or diverge.

Validation: Do Complexity Scores Mean Anything?

The paper validates on MATH (which has human complexity labels 1–5): model-assigned Elo scores align with human ratings. Furthermore, stratified by human complexity, rejected samples consistently have higher Elo scores than accepted ones. The critic is systematically harder on complex items — exactly what you'd expect if the scoring is meaningful.

Check: Why does Simula use batch-wise Elo scoring instead of per-sample complexity scores?

Because Elo was originally designed for language models

Per-sample scoring is noisy; batch comparison forces calibration and produces more stable rankings

To reduce the number of LLM calls needed

Because individual data points have no inherent complexity

6Experimental Setup

The paper tests Simula with a carefully designed ablation study across five diverse datasets.

System Versions

To isolate the contribution of each component, five configurations are evaluated:

Version	Taxonomy Depth	Meta-Prompting	Critic
Baseline	1 (top-level only)	No	No
+Local	1	Yes + complexify	No
+Global	All levels	No	No
+Local+Global	All levels	Yes + complexify	No
Full System	All levels	Yes + complexify	Yes

Datasets

Chosen to span niche (where real data is scarce) and popular (where benchmarks exist) domains:

Dataset	Domain	Task	Test Size
CTI-MCQ	Cybersecurity	Multiple-choice (4 options)	2,500
CTI-RCM	Cybersecurity	Open-ended CWE generation	1,000
LEXam	Legal (Swiss/EU/Intl law)	Multiple-choice	1,660
GSM8k	Math	Multi-step word problems	1,320
Global MMLU	Math/CS/Physics	Multiple-choice (3 langs)	1,440/lang

The downstream setup
Teacher: Gemini 2.5 Flash (non-thinking) generates all synthetic data.
Student: Gemma 3 4B is fine-tuned on the synthetic data via LoRA.
Evaluation: Student accuracy on the real benchmark test set.
Scale: 512k unique data points per dataset, tested at sizes 4k → 512k.
Rigor: 10 seeds per configuration, 95% CI via standard error of mean.

The niche datasets (CTI, LEXam) use the instruction-tuned student; the popular ones (GSM8k, MMLU) use the pre-trained student. This tests whether Simula works both for "teaching new skills" and "improving existing ones."

Check: Why does the paper test on both niche (cybersecurity, law) and popular (math, MMLU) domains?

To show Simula only works on niche domains

To test whether the framework generalizes across domains where real data is scarce AND domains with existing benchmarks

Because the authors work in cybersecurity and math

To increase the page count of the paper

7Results: What Actually Works

The results tell a nuanced story. There is no silver bullet. But there are clear patterns.

Finding 1: Full system is always best

Across all datasets and all data sizes, the full Simula system (Local + Global + Critique) matches or beats every ablated version. Simultaneously optimizing all data axes never hurts. If you don't have strong priors about your domain, use everything.

Finding 2: The Baseline scales worst

The Baseline (top-level taxonomy, no meta-prompting, no critic) scales the worst with data size. Adding more data of the same quality doesn't help much. This is the paper's strongest evidence that the data scaling law is a function of data properties, not data size alone.

Finding 3: Global and Local are additive

On every dataset, combining Global and Local diversification outperforms either alone. But their individual contributions vary:

CTI-MCQ: Local plateaus early; Global gives the long-term gains.
GSM8k: Global scales slightly worse than Baseline alone; Local is essential.
LEXam: Global alone doesn't beat Baseline; the combination does.

Why the asymmetry?
Global diversification controls what topics appear; Local controls how they're presented. Math (GSM8k) has a fixed set of operations — deeper taxonomy levels don't add new math, they add needless specificity. But variations in problem framing (Local) teach the model different approaches. Conversely, cybersecurity (CTI-MCQ) has a sprawling topic space — shallow sampling misses entire threat categories. The right mix depends on the domain.

Finding 4: Teacher-student gap impacts scaling

The student model's performance saturates when it approaches the teacher's capability on that task:

CTI-RCM: Teacher accuracy = 70%, student saturates at ~65% (bridging 83% of the gap). More data beyond 128k doesn't help.
GSM8k: Teacher accuracy = 88%, student reaches 75% and still climbing. Room to grow.

This has a practical implication: if your teacher is weak on the target task, scaling synthetic data has diminishing returns faster. The teacher's ceiling becomes the student's ceiling.

Finding 5: Critiquing is dataset-dependent

Adding the critic step never hurts, but its impact varies wildly:

Dataset	Rejection Rate	Impact
CTI-MCQ	2%	Negligible
CTI-RCM	9%	Small improvement
GSM8k	9%	Moderate improvement
LEXam	61%	Negligible (teacher too weak)
Global MMLU	3%	Significant improvement

LEXam's 61% rejection rate is striking: the teacher model (Gemini 2.5 Flash) only achieves 57% accuracy on LEXam. When the teacher is barely better than random on the task, it can't reliably generate or critique the data.

Check: Why does the Baseline scale worst with data size?

It uses a smaller model for generation

More data from a shallow, undiversified generation process adds redundancy, not information

It doesn't use GPUs for generation

The Baseline generates data too slowly

8The Complexity Paradox

Here's the most surprising finding in the paper. Common wisdom says: "harder training data makes better models." The paper shows this is only true when the teacher is competent.

To test this, the authors split the full system's output into three subsets using the Elo complexity scores: Low (bottom 40%), High (top 40%), and All (random subsample). They train separate student models on each split and measure accuracy.

The paradox
GSM8k: High Complexity gives a 10-point accuracy gain over Low Complexity at 64k data items. Harder math problems = better math model. Expected.

CTI-MCQ: Same pattern — Low Complexity shows no scaling with data size. Easy cybersecurity questions don't teach anything new.

LEXam: Reversed. Only Low Complexity improves with scale. High Complexity hurts. The teacher (57% accuracy) generates incorrect answers on hard questions. Training on confidently-wrong complex examples teaches the student wrong patterns.

This has a direct practical implication: before cranking up complexity, check your teacher's performance on the target domain. If the teacher is weak, more complex data doesn't teach better — it teaches wrong, confidently. In those cases, start with easy data where the teacher is reliable, or use a stronger teacher for the hard subset.

The lifecycle cost argument
The Baseline uses ~5x fewer inference calls per data point than the full system. So couldn't you just generate 5x more Baseline data for the same cost? The paper's data shows: no. The full system at 64k often beats the Baseline at 512k. Since training costs (GPU hours for fine-tuning) far exceed generation costs (inference calls), a smaller but higher-quality dataset is cheaper across the full lifecycle. You'd rather train on 64k good points than 512k mediocre ones.

Check: Under what condition does high-complexity synthetic data HURT downstream performance?

When the dataset is too small

When the teacher model is weak on the target domain, generating confidently wrong answers on hard questions

When the taxonomy is too shallow

High complexity never hurts performance

9Connections & Cheat Sheet

How Simula Connects to Other Work

Self-Refine (Madaan et al., 2024)

Uses iterative LLM self-feedback to improve outputs. Simula's critic loop is similar but structurally specialized — the double critic breaks sycophancy by asking for correctness and incorrectness independently.

LLM2LLM (Lee et al., 2024)

Iteratively generates data focused on the student model's mistakes. Requires seed data from the target distribution. Simula is seedless — it generates the coverage map from scratch.

Scaling Laws (Kaplan et al., 2020; Hoffmann et al., 2022)

Show that more data = better models. Simula adds a qualifier: more diverse, complex, high-quality data scales; more homogeneous data plateaus. The scaling law is a function of data properties, not size.

Constitutional AI (Bai et al., 2022)

Uses AI feedback to align models. Simula's double-critic is conceptually similar — using model judgment to filter outputs — but applied to data quality rather than alignment.

Cheat Sheet

Concept	What It Is	When to Use
Taxonomy T_i	Hierarchical tree of a factor of variation	Always — defines coverage space
Mix / Node-set	Sampled nodes from multiple taxonomies	Each becomes one data point's "spec"
Meta-prompt	LLM-generated prompt from a mix	Local diversification
Complexification	LLM rewrites to increase difficulty	When teacher is strong on the domain
Double critic	Correct? + Incorrect? dual query	Always; especially for MCQ
Elo complexity	Batch-calibrated difficulty score	To compare distributions, split by difficulty
Level Ratio Coverage	Fraction of taxonomy nodes covered	To find gaps in dataset coverage
N/V ratio	Data points per unique node-set	<1: prioritize coverage. >1: prioritize diversity

Related Lessons

Scaling Test-Time Compute — the inference-time analogue: more compute at test time vs. more data at train time
Constitutional AI — AI-feedback-driven alignment, a conceptual cousin of the double critic
Large Language Monkeys — scaling inference with repeated sampling (Best-of-N), the same principle Simula uses for taxonomy proposals
LLM Power Laws — scaling laws that Simula's results extend to synthetic data

The one-sentence summary
Design the mechanism that generates your synthetic data with the same care you'd design the model that consumes it — because the scaling law is a function of data properties, not data volume.