A seedless, agentic framework that generates controllable synthetic datasets at scale by reasoning about coverage, complexity, and quality — without seed data or manual prompts.
You want to build a model that classifies cybersecurity threats. Or answers Swiss law exam questions. Or solves math problems in Nepali. In all three cases, you face the same wall: the data you need doesn't exist.
Internet data built the generalists — GPT, Gemini, Claude. But the specialized AI that actually ships into production needs specialized data, and specialized data is inherently scarce, expensive to collect, and often locked behind privacy constraints.
The obvious fix: generate it synthetically. Use a large model to create training data for a smaller model. But the obvious fix has obvious failure modes:
The three synthetic data traps 1. Mode collapse. Prompt an LLM with "generate a math problem" 10,000 times and you get 10,000 variations of the same 50 problems it knows best. No coverage of edge cases. 2. No quality signal. The LLM that generated the data can't reliably tell you which outputs are wrong. Sycophancy bias makes it agree with its own mistakes. 3. No control. You can't say "make 40% of the data hard, cover all 12 subtopics, and ensure geographic diversity." Existing methods treat the generation process as a black box.
Most existing approaches try to solve one of these at a time. Evolutionary prompt methods (like PromptBreeder) evolve prompts to increase diversity but lose explainability. Manual prompt engineering gives control but doesn't scale. Seed-data methods (like LLM2LLM) start from real examples but require the very data you're trying to replace.
The question this paper asks: can we design a synthetic data generation mechanism — not just better prompts, but a full system — that simultaneously controls diversity, complexity, and quality, at scale, without seed data?
Check: Why is "just generate more data" insufficient for synthetic data?
It's too expensive to generate lots of data
More data from the same distribution doesn't add new information
LLMs can't generate text of sufficient quality
Synthetic data always hurts model performance
1The Key Insight: Reason First, Generate Second
Here's the core idea behind Simula: don't start by generating data. Start by reasoning about what data should exist.
Think about how a domain expert would design a dataset. They wouldn't immediately write 10,000 examples. They'd first ask: "What are the important axes of variation? What subtopics must we cover? What difficulty levels do we need?" They'd build a map of the space before filling it in.
Simula makes an LLM do exactly this. The framework has three stages:
Stage 1: Map
Build taxonomies of the concept space. Decompose "cybersecurity questions" into factors like threat-type, attack-vector, difficulty-level — then expand each into a hierarchical tree.
Stage 2: Synthesize
Sample from the taxonomy to create "mixes" of requirements. Convert each mix into a meta-prompt. Apply complexity augmentation. Generate data proposals with agentic refinement.
Stage 3: Evaluate
Double-critic filtering for quality. Calibrated batch scoring for complexity. Taxonomy-based coverage metrics for diversity. Each data point has a full audit trail.
Why "reasoning-first" is future-proof
Because every component of Simula relies on reasoning capabilities, the system automatically improves as the underlying LLM improves. Better reasoning → better taxonomies → better coverage → better data. No component needs to be re-engineered when the model gets smarter. This is the opposite of hand-tuned prompt engineering, which becomes obsolete with every model generation.
The paper calls this mechanism design — and argues it's an underexplored research axis. Most synthetic data work asks "what is good data?" Simula asks "how do you generate good data?" The mechanism determines the properties; the properties determine the downstream performance.
Let's build up each stage.
Check: What does Simula do BEFORE generating any data points?
Fine-tunes the teacher model on existing benchmarks
Builds taxonomies that map the concept space to ensure coverage
Collects seed data from the target domain
Runs evolutionary prompt optimization
2Taxonomies: Mapping the Concept Space
Suppose your dataset description is: "A dataset of stories about cats." This is hopelessly underspecified. The space of all datasets matching this description is infinite. How do you ensure you cover it?
Simula's answer: decompose the space into factors of variation, then expand each factor into a hierarchical taxonomy.
An LLM proposes factors for our cat-story dataset: "cat type," "story format," "intended audience." Each factor becomes a tree. "Cat type" branches into "domestic" → "shorthair" → "British Shorthair." "Story format" branches into "poem" → "haiku," "prose" → "short story," etc. The product of these trees defines a discrete, sample-able approximation of the concept space.
From equation (1) in the paper
Given user instructions y, optional sample data S, and factor specifications (di, fi) where di is the desired depth:
M3(y, S, (d0, f0), …, (dK, fK)) = {Ti}i=0K = Ty
Each Ti is a hierarchical tree where root = broad factor, leaves = specific instantiations. More factors and deeper trees = sharper coverage control.
But there's a catch: as taxonomies grow deeper, you risk progressive coverage loss. The LLM might generate "domestic" → "shorthair" but miss "domestic" → "longhair" entirely. To mitigate this, Simula uses a three-step expansion algorithm:
Best-of-N proposals: For each node, prompt the LLM N times to propose children. This increases the proposal distribution and catches edge cases.
Critic refinement: A separate LLM call reviews the proposals, adding missing nodes, removing redundancies, and improving specificity. This exploits the generator-critic gap — models are often better at evaluating completeness than generating it in one shot.
Level planning: After generating all nodes at one level, the LLM generates a "plan" for the next level, ensuring consistent granularity across branches.
The interactive tree above lets you explore how taxonomies decompose a concept space. Each level adds granularity — but also adds the risk of missing branches. The balance between coverage and granularity is a design choice that Simula leaves to the user.
Taxonomy quality: the empirical result
Table 2 in the paper shows that Simula-generated taxonomies achieve 0.78 completeness and 0.97 soundness on conceptual taxonomies (vs. human experts). The generator-critic loop increased completeness from 0.52 (0-shot) to 0.78 — a 50% improvement. Additionally, Simula found 0.94 novelty score — relevant nodes that experts missed — giving 1.72x total coverage over expert-only taxonomies.
Check: Why does Simula use Best-of-N proposals + critic refinement instead of a single generation call?
To make the process more expensive and thus higher quality
Because models are better at evaluating completeness than generating it in one shot
To prevent the model from memorizing the taxonomy
Because single calls always produce empty outputs
3Controlled Synthesis: From Taxonomy to Data
We have our taxonomies. Now we need to turn tree nodes into actual data points. This happens in two phases: taxonomic sampling and agentic refinement.
Phase 1: Taxonomic Sampling
The LLM first formulates sampling strategies — rules for which taxonomies can be combined and with what weight. This prevents nonsensical combinations (e.g., "mature horror novel" + "intended audience: toddlers"). Each strategy defines a compatible subset of taxonomies.
Given a strategy, the framework samples nodes from the corresponding taxonomies. These sampled nodes become data point requirements. For example:
Worked example
Sampled nodes: {house cat, poem, travel enthusiast}
+ Dataset instructions: "A dataset of stories about cats"
↓ Meta-prompt: "Compose an exciting haiku about a house cat who goes on an adventure."
↓ Generated output: "Paws on foreign soil / whiskers catch the monsoon wind / home is where I roam"
Phase 2: Optimizing Local Diversity and Complexity
This is where things get interesting. Consider the math:
You want N = 100 data points. Your taxonomy produces V = 200 unique node-sets. Since N < V, you can cover at most 100/200 = 50% of the space. This is your global coverage ratio.
Now flip it: N = 800, V = 200. You can generate 4 meta-prompts per node-set and cover everything. But as N/V grows, independently generating meta-prompts from the same requirements leads to mode collapse — increasingly similar outputs.
Simula mitigates mode collapse two ways:
Batch meta-prompting: Generate multiple meta-prompts simultaneously (so the model sees them all in context and can diversify), then sub-sample.
Complexification: A user-defined fraction c (default 0.5) of meta-prompts are passed through a "complexification" step where the LLM increases their difficulty while maintaining the original requirements.
Global vs. Local diversification Global: Use deep taxonomy levels for sampling. Node "math → algebra → quadratic equations" is more specific than just "math." This controls what topics appear. Local: Use meta-prompting and complexification to generate diverse interpretations of the same topic. "Different takes on a quadratic equation problem."
The paper's key finding: Global and Local are additive. Each captures a different type of diversity. Using both simultaneously always matches or beats either alone.
Check: When N/V is very large (many data points per node-set), what problem arises and how does Simula mitigate it?
The model runs out of memory; use smaller batches
Mode collapse from repeatedly generating from the same requirements; batch meta-prompting + complexification
The taxonomy becomes invalid; regenerate from scratch
The data becomes too simple; increase model temperature
4The Double Critic: Quality Through Skepticism
You've generated thousands of data points. Some are wrong. A math problem with an incorrect answer. A cybersecurity question where the "correct" choice is actually wrong. A legal exam question that contradicts the source material. How do you catch these?
The obvious approach: ask the same LLM to verify its outputs. But there's a well-documented problem: sycophancy bias. When you show a model its own output and ask "is this correct?", it's biased toward saying yes. It generated the answer; of course it thinks it's right.
Simula's fix is elegant: the double critic. Instead of asking one question ("is this correct?"), ask two independent questions:
Critic A
"Here is a question and answer. Is the answer CORRECT? Explain your reasoning, then give a binary verdict."
Critic B
"Here is a question and answer. Is the answer INCORRECT? Explain your reasoning, then give a binary verdict."
A sample is accepted only if Critic A says "correct" AND Critic B says "not incorrect." This dual query breaks the sycophancy pattern: asking "is this incorrect?" forces the model to look for flaws rather than confirm its prior.
The interactive visualization above shows how double-critic rejection sampling works on the MATH dataset. The key numbers from the paper:
Double-critic results on MATH (Figure 3) Controlled setting: Given a mix of correct and corrupted answers, the critic achieves consistent "lift" — accepted accuracy is always higher than baseline generation accuracy. At complexity level 1: baseline 96% → post-critic 99%. At level 5: baseline 68% → post-critic 78%.
The cost curve: Higher complexity requires higher rejection rates to maintain the accuracy gain. At level 5, about 27% of samples are rejected to achieve a 10-point accuracy lift. At level 1, only 5% are rejected for a 3-point lift.
Empirical setting: The critic transfers to the model's own outputs (not just controlled corruptions), but with reduced effectiveness — the accuracy lift is smaller because the model's own errors are subtler than deliberately corrupted answers.
Beyond answer correctness, the critic also performs semantic verification: does the generated output actually match the taxonomy requirements? If the meta-prompt required a haiku about a house cat adventure, the critic checks: Is this a haiku? Is there a house cat? Is there an adventure? Point-wise requirement checks ensure the data fits the taxonomy specification.
The generator-critic gap
Why can a model critique its own outputs effectively? Because generation and verification are different tasks with different difficulty. Generating a correct proof is harder than checking one. Writing a subtle incorrect answer is harder than spotting it (especially when framed as "find the flaw"). The double-critic exploits this asymmetry. It's the same principle behind Best-of-N sampling but applied to quality, not diversity.
Check: Why does the double critic ask "is this INCORRECT?" separately instead of just asking "is this correct?"
To increase the cost and thoroughness of evaluation
To break sycophancy bias by forcing the model to actively look for flaws
Because the model can only answer binary yes/no questions
To ensure the model reads the question twice
5Evaluation: Measuring What You Made
You've generated a synthetic dataset. How do you know it's good? Simula proposes evaluation tools for all three axes: diversity, complexity, and quality.
Measuring Diversity
Embedding-based metrics are the standard approach: embed all data points, compute pairwise cosine distances. Higher average distance = more diverse. Simula computes this both dataset-wide (global diversity) and among k=10 nearest neighbors (local diversity).
But embedding distances are coarse — they tell you "this dataset is more spread out in embedding space" without saying what's missing. So Simula adds taxonomy-based coverage: for each data point, an LLM assigns it to the most relevant taxonomy node. You can then compute what fraction of nodes at each taxonomy level are covered. This gives a fine-grained, actionable map of gaps.
Measuring Complexity
How complex is a single data point? This is tricky because:
Synthetic data generation is unsupervised (no labels)
Real data rarely has complexity annotations
"Complexity" is relative — hard for whom?
Simula's solution: calibrated batch scoring with Elo ratings.
The scoring algorithm
1. Sample batches of data points, each point appearing K times across batches.
2. For each batch, the LLM assigns relative complexity scores (calibrated against other batch members).
3. Convert batch scores into pairwise comparisons.
4. Compute Elo ratings from the pairwise comparisons.
Why batches? Per-sample scoring is noisy (the model overcommits to each individual judgment). Batch scoring forces calibration — the model must rank items relative to each other, producing more stable and comparable scores.
The Elo scores enable cross-dataset comparison: you can place synthetic and real data points on the same complexity scale and see where the distributions overlap or diverge.
Validation: Do Complexity Scores Mean Anything?
The paper validates on MATH (which has human complexity labels 1–5): model-assigned Elo scores align with human ratings. Furthermore, stratified by human complexity, rejected samples consistently have higher Elo scores than accepted ones. The critic is systematically harder on complex items — exactly what you'd expect if the scoring is meaningful.
Check: Why does Simula use batch-wise Elo scoring instead of per-sample complexity scores?
Because Elo was originally designed for language models
Per-sample scoring is noisy; batch comparison forces calibration and produces more stable rankings
To reduce the number of LLM calls needed
Because individual data points have no inherent complexity
6Experimental Setup
The paper tests Simula with a carefully designed ablation study across five diverse datasets.
System Versions
To isolate the contribution of each component, five configurations are evaluated:
Version
Taxonomy Depth
Meta-Prompting
Critic
Baseline
1 (top-level only)
No
No
+Local
1
Yes + complexify
No
+Global
All levels
No
No
+Local+Global
All levels
Yes + complexify
No
Full System
All levels
Yes + complexify
Yes
Datasets
Chosen to span niche (where real data is scarce) and popular (where benchmarks exist) domains:
Dataset
Domain
Task
Test Size
CTI-MCQ
Cybersecurity
Multiple-choice (4 options)
2,500
CTI-RCM
Cybersecurity
Open-ended CWE generation
1,000
LEXam
Legal (Swiss/EU/Intl law)
Multiple-choice
1,660
GSM8k
Math
Multi-step word problems
1,320
Global MMLU
Math/CS/Physics
Multiple-choice (3 langs)
1,440/lang
The downstream setup Teacher: Gemini 2.5 Flash (non-thinking) generates all synthetic data. Student: Gemma 3 4B is fine-tuned on the synthetic data via LoRA. Evaluation: Student accuracy on the real benchmark test set. Scale: 512k unique data points per dataset, tested at sizes 4k → 512k. Rigor: 10 seeds per configuration, 95% CI via standard error of mean.
The niche datasets (CTI, LEXam) use the instruction-tuned student; the popular ones (GSM8k, MMLU) use the pre-trained student. This tests whether Simula works both for "teaching new skills" and "improving existing ones."
Check: Why does the paper test on both niche (cybersecurity, law) and popular (math, MMLU) domains?
To show Simula only works on niche domains
To test whether the framework generalizes across domains where real data is scarce AND domains with existing benchmarks
Because the authors work in cybersecurity and math
To increase the page count of the paper
7Results: What Actually Works
The results tell a nuanced story. There is no silver bullet. But there are clear patterns.
Finding 1: Full system is always best
Across all datasets and all data sizes, the full Simula system (Local + Global + Critique) matches or beats every ablated version. Simultaneously optimizing all data axes never hurts. If you don't have strong priors about your domain, use everything.
Finding 2: The Baseline scales worst
The Baseline (top-level taxonomy, no meta-prompting, no critic) scales the worst with data size. Adding more data of the same quality doesn't help much. This is the paper's strongest evidence that the data scaling law is a function of data properties, not data size alone.
Finding 3: Global and Local are additive
On every dataset, combining Global and Local diversification outperforms either alone. But their individual contributions vary:
CTI-MCQ: Local plateaus early; Global gives the long-term gains.
GSM8k: Global scales slightly worse than Baseline alone; Local is essential.
LEXam: Global alone doesn't beat Baseline; the combination does.
Why the asymmetry?
Global diversification controls what topics appear; Local controls how they're presented. Math (GSM8k) has a fixed set of operations — deeper taxonomy levels don't add new math, they add needless specificity. But variations in problem framing (Local) teach the model different approaches. Conversely, cybersecurity (CTI-MCQ) has a sprawling topic space — shallow sampling misses entire threat categories. The right mix depends on the domain.
Finding 4: Teacher-student gap impacts scaling
The student model's performance saturates when it approaches the teacher's capability on that task:
CTI-RCM: Teacher accuracy = 70%, student saturates at ~65% (bridging 83% of the gap). More data beyond 128k doesn't help.
GSM8k: Teacher accuracy = 88%, student reaches 75% and still climbing. Room to grow.
This has a practical implication: if your teacher is weak on the target task, scaling synthetic data has diminishing returns faster. The teacher's ceiling becomes the student's ceiling.
Finding 5: Critiquing is dataset-dependent
Adding the critic step never hurts, but its impact varies wildly:
Dataset
Rejection Rate
Impact
CTI-MCQ
2%
Negligible
CTI-RCM
9%
Small improvement
GSM8k
9%
Moderate improvement
LEXam
61%
Negligible (teacher too weak)
Global MMLU
3%
Significant improvement
LEXam's 61% rejection rate is striking: the teacher model (Gemini 2.5 Flash) only achieves 57% accuracy on LEXam. When the teacher is barely better than random on the task, it can't reliably generate or critique the data.
Check: Why does the Baseline scale worst with data size?
It uses a smaller model for generation
More data from a shallow, undiversified generation process adds redundancy, not information
It doesn't use GPUs for generation
The Baseline generates data too slowly
8The Complexity Paradox
Here's the most surprising finding in the paper. Common wisdom says: "harder training data makes better models." The paper shows this is only true when the teacher is competent.
To test this, the authors split the full system's output into three subsets using the Elo complexity scores: Low (bottom 40%), High (top 40%), and All (random subsample). They train separate student models on each split and measure accuracy.
The paradox GSM8k: High Complexity gives a 10-point accuracy gain over Low Complexity at 64k data items. Harder math problems = better math model. Expected.
CTI-MCQ: Same pattern — Low Complexity shows no scaling with data size. Easy cybersecurity questions don't teach anything new.
LEXam:Reversed. Only Low Complexity improves with scale. High Complexity hurts. The teacher (57% accuracy) generates incorrect answers on hard questions. Training on confidently-wrong complex examples teaches the student wrong patterns.
This has a direct practical implication: before cranking up complexity, check your teacher's performance on the target domain. If the teacher is weak, more complex data doesn't teach better — it teaches wrong, confidently. In those cases, start with easy data where the teacher is reliable, or use a stronger teacher for the hard subset.
The lifecycle cost argument
The Baseline uses ~5x fewer inference calls per data point than the full system. So couldn't you just generate 5x more Baseline data for the same cost? The paper's data shows: no. The full system at 64k often beats the Baseline at 512k. Since training costs (GPU hours for fine-tuning) far exceed generation costs (inference calls), a smaller but higher-quality dataset is cheaper across the full lifecycle. You'd rather train on 64k good points than 512k mediocre ones.
Check: Under what condition does high-complexity synthetic data HURT downstream performance?
When the dataset is too small
When the teacher model is weak on the target domain, generating confidently wrong answers on hard questions
When the taxonomy is too shallow
High complexity never hurts performance
9Connections & Cheat Sheet
How Simula Connects to Other Work
Self-Refine (Madaan et al., 2024)
Uses iterative LLM self-feedback to improve outputs. Simula's critic loop is similar but structurally specialized — the double critic breaks sycophancy by asking for correctness and incorrectness independently.
LLM2LLM (Lee et al., 2024)
Iteratively generates data focused on the student model's mistakes. Requires seed data from the target distribution. Simula is seedless — it generates the coverage map from scratch.
Scaling Laws (Kaplan et al., 2020; Hoffmann et al., 2022)
Show that more data = better models. Simula adds a qualifier: more diverse, complex, high-quality data scales; more homogeneous data plateaus. The scaling law is a function of data properties, not size.
Constitutional AI (Bai et al., 2022)
Uses AI feedback to align models. Simula's double-critic is conceptually similar — using model judgment to filter outputs — but applied to data quality rather than alignment.
Cheat Sheet
Concept
What It Is
When to Use
Taxonomy Ti
Hierarchical tree of a factor of variation
Always — defines coverage space
Mix / Node-set
Sampled nodes from multiple taxonomies
Each becomes one data point's "spec"
Meta-prompt
LLM-generated prompt from a mix
Local diversification
Complexification
LLM rewrites to increase difficulty
When teacher is strong on the domain
Double critic
Correct? + Incorrect? dual query
Always; especially for MCQ
Elo complexity
Batch-calibrated difficulty score
To compare distributions, split by difficulty
Level Ratio Coverage
Fraction of taxonomy nodes covered
To find gaps in dataset coverage
N/V ratio
Data points per unique node-set
<1: prioritize coverage. >1: prioritize diversity
Related Lessons
Scaling Test-Time Compute — the inference-time analogue: more compute at test time vs. more data at train time
Constitutional AI — AI-feedback-driven alignment, a conceptual cousin of the double critic
Large Language Monkeys — scaling inference with repeated sampling (Best-of-N), the same principle Simula uses for taxonomy proposals
LLM Power Laws — scaling laws that Simula's results extend to synthetic data
The one-sentence summary
Design the mechanism that generates your synthetic data with the same care you'd design the model that consumes it — because the scaling law is a function of data properties, not data volume.