Yamada, Lange, Lu et al. — Sakana AI / UBC / Vector / Oxford, 2025

The AI Scientist-v2 Workshop-Level Automated Scientific Discovery

Eliminate human-authored code templates. Manage experiments via agentic tree search. Refine figures with a VLM. Result: the first fully AI-generated paper to survive peer review at an ICLR workshop.

Prerequisites: LLM agents + Tree search basics + Scientific publishing workflow
10
Chapters
3
Simulations

Chapter 0: The Problem

In 2024, Lu et al. released The AI Scientist — the first system to automate the full scientific discovery pipeline end-to-end. Give it a research topic and it would generate hypotheses, write code, run experiments, produce figures, and draft a complete manuscript. Impressive, but deeply limited in practice.

The core limitation: human-authored code templates. For every new research domain, a human researcher had to hand-write a baseline codebase. The AI Scientist v1 could only make incremental modifications to this template — swap a loss function, tweak a hyperparameter, add a regularization term. It couldn't build an experiment from scratch.

This made v1 brittle. Want to study compositional generalization? Someone first writes a training loop, a dataset loader, an evaluation harness. Want to study vision transformers? A completely different template. The "automation" was really just "incremental code editing given a scaffold."

The deeper issue: Beyond templates, v1's experimentation was strictly linear. Each code change built directly on the previous one, like a single chain of edits. Real science doesn't work this way. Researchers explore branching hypotheses, backtrack from dead ends, and pursue multiple directions simultaneously. v1's linear approach meant it was perpetually myopic — it could never backtrack from a bad decision or explore an alternative path.

The papers v1 produced reflected these limitations. Internal evaluation found them "below the level of top ML venues." The manuscripts lacked depth, had superficial experiment designs, and the figures were often poorly formatted (since no vision model ever looked at them). No v1 paper was submitted to peer review, because the authors judged none were good enough.

Three specific failure modes defined v1:

  1. Template dependency. Each new domain required manual setup, killing the dream of autonomous science.
  2. Linear experimentation. No branching, no backtracking, no parallel exploration of hypotheses.
  3. No visual review. Generated figures were never checked by a vision model. Mislabeled axes, missing legends, and garbled plots went unnoticed.

There was also a subtler issue: v1's manuscript writing used Aider to make incremental LaTeX edits, section by section. This often produced internally inconsistent papers where the introduction promised one thing and the experiments delivered another. The system had no "big picture" view of the document.

The AI Scientist-v2 addresses all of these. The question is: how?

What was the most fundamental limitation of The AI Scientist v1?

Chapter 1: The Key Insight

The AI Scientist-v2 rests on one central idea: treat scientific experimentation as tree search, not linear editing.

In v1, the experiment flow looked like a chain:

Template code
Human-written baseline
Edit 1
Tweak learning rate
Edit 2
Change loss function
Edit 3
Add regularizer
Write paper
Based on final code state

Every edit depended on the one before it. If Edit 2 was a mistake, Edit 3 inherited the damage. There was no way to go back.

In v2, the flow is a tree. Each experiment is a node. Promising nodes get expanded with refinements. Buggy nodes get debugging children. The system can explore multiple hypotheses in parallel and select the best branch at each stage. This is managed by a dedicated Experiment Manager agent that coordinates four distinct research stages.

Three innovations work together to make this possible:

  1. No templates. The system generates code from scratch, starting from a research idea and a topic description. This eliminates the human bottleneck and lets the system generalize across ML domains.
  2. Progressive agentic tree search. An Experiment Manager agent guides experimentation through four stages (feasibility, tuning, research execution, ablations), using tree search within each stage to explore the hypothesis space.
  3. VLM figure refinement. A vision-language model reviews every generated figure for clarity, correctness, and aesthetics. Figures that fail the VLM check cause their experiment node to be marked as "buggy" and re-attempted.
Why this combination matters: Any one of these improvements alone would help. But they're synergistic. Template-free operation means the tree search explores a much larger space (not just variations on a fixed codebase). The VLM feedback prevents the tree from wasting branches on experiments whose results can't be clearly communicated. And the staged Experiment Manager prevents the tree from growing aimlessly — it imposes the structure of real scientific methodology.

The payoff: one of the three manuscripts v2 generated and submitted to ICLR's ICBINB workshop received scores of 6, 7, and 6 — an average of 6.33, placing it in roughly the top 45% of submissions. It would have been accepted had it been human-authored. This is the first time a fully AI-generated paper has passed peer review.

What is the central architectural shift from v1 to v2?

Chapter 2: v1 vs v2 Comparison

Let's lay the differences out side by side. These aren't incremental tweaks — nearly every component was redesigned.

FeatureAI Scientist v1AI Scientist v2
CodebaseTopic-specific templates (human-authored)Generated from scratch (domain-general)
Idea generationConditioned on existing template codeOpen-ended, with Semantic Scholar in the loop
Experiment planningLinear — each edit builds on the lastTree-based — branch, backtrack, explore in parallel
Experiment stagesSingle pass4 stages: feasibility → tuning → execution → ablation
Parallel executionNoYes — multiple nodes expanded concurrently
Figure reviewNone (text-only models)VLM checks every figure for clarity and correctness
Manuscript writingIncremental edits via AiderSingle-pass generation + reflection with o1
ReviewerText-only LLMVLM-augmented (sees figures + text together)
Human evaluationNot submitted to peer reviewOne paper accepted at ICLR workshop (6.33 avg)
Dataset handlingBundled with templatesHugging Face Hub auto-download

The workflow pipeline

The v2 pipeline proceeds through five major phases:

1. Idea Generation
Open-ended brainstorming with literature search (Semantic Scholar). ~20 ideas generated, best selected by researchers.
2. Experimentation
4-stage tree search managed by the Experiment Manager. Code generated from scratch, executed, VLM-reviewed.
3. Figure Aggregation
Best experiment nodes produce final figures. VLM iteratively refines aesthetics and accuracy.
4. Manuscript Writing
Single-pass LaTeX generation using experiment results, figures, and summaries. No incremental editing.
5. Reflection & Review
Reasoning model (o1) reviews the manuscript. VLM checks figure-caption alignment. Iterative refinement.
Models used: The primary LLM is Claude 3.5 Sonnet for experiment code generation and management. GPT-4o serves as the VLM for figure review. The manuscript reflection stage uses o1 for deeper reasoning about scientific coherence. The AI reviewer uses GPT-4o for VLM-augmented review.

Cost and compute

Each complete run (idea → manuscript) costs approximately $20-50 in API calls and takes several hours of wall-clock time. The majority of cost comes from the experiment tree search phase, where many LLM calls generate and evaluate code. Experiment execution itself uses standard academic compute (single GPU). For the ICBINB submission, multiple seeds were run per idea, so the total cost per submitted paper was higher — but still orders of magnitude cheaper than a human researcher's salary.

Which of the following is NOT a change from v1 to v2?

Chapter 3: The Experiment Manager

This is the heart of v2. The Experiment Manager is a dedicated LLM agent that orchestrates the entire experimental process through four stages, using tree search within each stage to explore the hypothesis space systematically.

The four stages

Real scientific research follows a natural progression: first check if an idea is feasible, then tune the setup, then run the core experiments, then validate with ablations. The Experiment Manager enforces this structure:

Stage 1: Preliminary Investigation
Establish feasibility. Build a minimal working prototype from the research idea. Stops when basic code runs successfully.
↓ Best node carried forward
Stage 2: Hyperparameter Tuning
Refine the prototype. Optimize learning rate, epochs, batch size. Stops when training curves converge across at least 2 datasets.
↓ Best node carried forward
Stage 3: Research Agenda Execution
Run the core experiments. Test the main hypotheses. If experiments finish too fast, the system suggests increasing complexity. Stops when compute budget is exhausted.
↓ Best node carried forward
Stage 4: Ablation Studies
Validate findings. Remove components one at a time. Run replications with different seeds for statistical significance (mean ± std).

How tree search works within each stage

Within each stage, the system builds a tree of experiment nodes. Each node contains:

At each iteration, the system selects nodes to expand. The selection strategy depends on the node type:

New child nodes are created and executed in parallel, dramatically accelerating exploration compared to v1's sequential approach.

The node execution cycle

Each node, once created, follows a fixed execution cycle:

1. Plan + Code Generation
LLM writes an experiment plan (natural language) and Python code implementing it
2. Code Execution
Python script runs in an interpreter. Metrics saved to .npy files. If error: node marked buggy, error trace recorded.
3. Plot Generation
System reads saved .npy files and generates visualizations summarizing results
4. VLM Review
Vision model checks plots for clarity. Issues flagged → node marked buggy with VLM feedback recorded.
5. Status Assignment
Node is classified as non-buggy (passed all checks) or buggy (code error or VLM rejection)

Specialized node types

Beyond basic experiment nodes, the system creates specialized variants:

Node TypeStagePurpose
Hyperparameter2Systematically explore alternative hyperparameters. Tracks previously tested configs to avoid redundancy.
Ablation4Remove one component at a time. Tracks previously tested ablation conditions.
Replication3, 4Re-run parent experiment with different random seeds. Enables mean ± std reporting.
Aggregation4Collect results from replication nodes. Produce combined figures with error bars. No new experiments.
The key difference from AIDE: The AI Scientist-v2's tree search is inspired by AIDE (Jiang et al., 2025), which uses tree search for ML engineering tasks. But AIDE operates with a single scalar score per node (e.g., validation accuracy). The AI Scientist-v2 uses an LLM evaluator that considers multiple factors — not just accuracy, but training dynamics, figure quality, VLM feedback, and scientific coherence. This is necessary because "good science" isn't a single number.

A worked example: one iteration of tree expansion

Suppose we're in Stage 3 (Research Agenda Execution). The tree currently has three non-buggy nodes and one buggy node. Here's what happens in one iteration:

  1. The system rolls a random number. With probability p, it selects the buggy node for debugging. Otherwise, it asks the LLM evaluator to rank the three non-buggy nodes.
  2. The evaluator receives each node's metrics (train/val loss curves, accuracy), its experiment plan, and VLM feedback on its figures. It returns a ranking. The top-ranked node is selected for refinement.
  3. The LLM generates a new experiment plan and Python code for the child node. For a refinement: "The parent achieved 72% accuracy using a 2-layer LSTM. I'll try adding attention over hidden states and increasing hidden dim from 128 to 256." For debugging: "The parent crashed with OOM. I'll reduce batch size from 256 to 64 and add gradient checkpointing."
  4. Both new child nodes are executed in parallel. Each runs its Python script, saves metrics to .npy files, generates plots, and submits plots to the VLM.
  5. Results are recorded. The tree grows by two nodes. The next iteration begins.

This cycle repeats until the compute budget for the current stage is exhausted. Then the Experiment Manager selects the single best node (via the LLM evaluator) to seed the next stage.

What happens when the tree search encounters a buggy node (one whose code crashed)?

Chapter 4: Template-Free Operation

This is where v2 becomes truly autonomous. Instead of starting from a human-written codebase, v2 starts from nothing but a research idea.

The process begins in the idea generation phase. The system is prompted to brainstorm open-ended research directions for a given topic — something like "negative results in deep learning" (the ICBINB workshop theme). It generates roughly twenty candidate ideas, each described as a title and short hypothesis. Critically, the system has access to Semantic Scholar during this phase, so it can check whether an idea is novel and identify relevant prior work.

Once an idea is selected, the Experiment Manager takes over. Here's how Stage 1 (Preliminary Investigation) works without a template:

  1. The LLM receives the research idea as a natural language description (title + hypothesis + experimental plan).
  2. It generates a complete Python experiment script from scratch — dataset loading (via Hugging Face Hub), model definition, training loop, evaluation, and metric logging.
  3. The script is executed in a Python interpreter. If it crashes, the error trace is recorded and the node is marked buggy. A debugging child is spawned.
  4. If it succeeds, the system saves metrics to numpy files and generates plots.
  5. The VLM reviews the plots. If they're unclear or incorrect, the node is marked buggy.
  6. Non-buggy nodes proceed to refinement — the LLM generates improved versions.
Why Hugging Face Hub matters: A key enabler of template-free operation is standardized dataset access. v2 prompts the LLM to use datasets.load_dataset() from Hugging Face whenever possible. This provides a consistent API for downloading hundreds of ML datasets with predefined train/val/test splits. Without this, the LLM would have to write custom data loading code for every experiment — a major source of bugs.

Manuscript writing: single-pass + reflection

The manuscript writing phase also changed fundamentally. v1 used Aider to iteratively edit LaTeX files section by section, which was slow and often produced inconsistent text (early sections didn't know what later sections would say). v2 uses a single-pass generation: the LLM writes the entire manuscript at once, given the experiment results, figures, and summaries from the best experiment nodes.

After the initial draft, a separate reflection stage uses a reasoning model (o1) to review the manuscript holistically. The reflection stage also receives the target page limit (e.g., 4 pages for the ICBINB workshop) alongside the current PDF length, allowing it to suggest cuts or expansions as needed. This two-phase approach (draft + reflect) is cleaner and more reliable than incremental editing, and the separation of concerns means the drafting model can focus on content while the reflection model focuses on quality.

The prompt structure

What does the LLM actually receive when generating experiment code? The prompt includes:

The system doesn't receive any starter code, model architecture, or training recipe. Everything is generated from the idea description alone. This is what makes v2 domain-general: the same pipeline that studies compositional generalization in sequence models could equally study data augmentation in vision or reward shaping in RL.

What the plot aggregation prompt looks like

At the end of experimentation, the system generates a final plot aggregation script. The actual prompt instructs the LLM to:

From the real prompt: "Combine relevant existing plotting code. Create a complete set of final scientific plots, stored in figures/ only. Use existing .npy data — do NOT hallucinate data. Only create plots where the data is best presented as a figure and not as a table. Put each plot in a separate try-catch block so one failure doesn't affect others. Create only plots that are unique and needed for the final paper."

This is a small but important detail. The system explicitly separates experiment execution (which saves raw data as .npy arrays) from figure generation (which reads those arrays and produces publication-quality plots). This separation prevents a common failure mode: plots that look right but are generated from hallucinated data rather than actual experiment outputs.

How does v2 handle dataset loading without human-written template code?

Chapter 5: VLM Figure Refinement

v1 had a blind spot: it never looked at its own figures. The language models generating plots couldn't verify whether the output was clear, correctly labeled, or even legible. This led to manuscripts with garbled axes, missing legends, and misleading visualizations — the kind of issues a human reviewer would catch immediately.

v2 integrates Vision-Language Models (VLMs) at two critical points in the pipeline.

Point 1: During experimentation (tree search)

After each experiment node generates its figures, the VLM (GPT-4o) reviews them. It checks for:

If the VLM flags any issues, the node is marked buggy and the feedback is recorded. When a debugging child is spawned from this node, the LLM receives the VLM's critique alongside the code, enabling targeted fixes. This means figure quality improves through the same tree search mechanism that improves experiment quality.

Point 2: During manuscript reflection

After the manuscript is drafted, the system extracts each figure along with its caption and the text that references it (identified by searching for "Figure X" in the manuscript). These figure-caption-context triples are sent to the VLM for a second round of checks:

The feedback loop in practice: During tree search, a node generates a training curve with no axis labels. The VLM flags this. A debugging child receives: "VLM feedback: Missing axis labels on training loss plot. Add xlabel('Epoch') and ylabel('Loss')." The LLM fixes the plotting code. The new figure passes VLM review. The node is marked non-buggy and can now be expanded via refinement instead of further debugging.

This two-stage VLM integration is one of the simplest innovations in the paper, but one of the most impactful for manuscript quality. Human reviewers at the ICBINB workshop did flag some figure issues in the submitted papers, but far fewer than they would have without VLM filtering.

What happens when the VLM flags issues with a figure during tree search?

Chapter 6: Enhanced Reviewer

v1 included an automated AI reviewer — an LLM that scored generated manuscripts on a 1-10 scale across several criteria. But it had a fundamental limitation: it was text-only. The reviewer read the LaTeX source but never saw the rendered figures. This meant it couldn't assess whether figures supported the claims in the text.

v2 upgrades the reviewer with VLM capabilities. The enhanced reviewer:

  1. Reads the text of the manuscript, evaluating clarity, novelty, methodology, and significance.
  2. Views every figure alongside its caption, checking alignment between visual evidence and written claims.
  3. Assesses experimental rigor — are error bars present? Are baselines fair? Are the datasets appropriate for the claims?
  4. Produces a structured review with a numerical score and actionable feedback.

Correlation with human reviewers

The paper evaluates the enhanced reviewer against human peer review scores from the ICBINB workshop. The VLM-augmented reviewer shows higher correlation with human scores than the text-only reviewer from v1. This is expected: a reviewer that can actually see the figures makes more holistic judgments.

PaperHuman Avg ScoreAI Review (v2, VLM)Outcome
Compositional Regularization6.33Highest rankedAccepted (withdrawn post-review)
Paper 2Below thresholdMiddle rankedRejected
Paper 3Below thresholdLowest rankedRejected
The ranking matched: The internal AI reviewer ranked the three submitted manuscripts in the same order as the human peer reviewers. The manuscript that scored highest with humans (the one that passed peer review) also scored highest with the AI reviewer. The two rejected manuscripts scored lower on both human and AI evaluations. While this is a small sample (n=3), it's encouraging evidence that VLM-augmented review captures something real about paper quality.

What the AI reviewer misses

Even the enhanced reviewer has blind spots. The human reviewers at ICBINB flagged issues the AI reviewer did not:

These are higher-level scientific judgment calls that current AI reviewers struggle with. They require understanding what makes a research contribution convincing to the community, not just whether the text and figures are internally consistent.

What key capability does v2's reviewer add over v1's text-only reviewer?

Chapter 7: The Accepted Paper

Three manuscripts were submitted to the ICLR 2025 "I Can't Believe It's Not Better" (ICBINB) workshop. One was accepted. Let's dissect what happened.

The evaluation protocol

The evaluation was carefully designed with ethical safeguards:

  1. Three fully autonomous manuscripts were included among 43 total workshop submissions.
  2. Reviewers were told in advance that some submissions might be AI-generated, but not which ones. They could opt out of reviewing potentially AI-generated work.
  3. After review, the accepted manuscript was withdrawn before publication — the authors did not want to set a precedent for AI-generated papers entering the scientific record without broader community discussion.
  4. Reviewers were informed of the experiment only after peer review was complete.
  5. The study had IRB approval from the University of British Columbia.

The accepted paper: compositional regularization

The workshop-accepted paper investigated whether adding a compositional regularization term to the training loss could improve neural network generalization. The idea: penalize large deviations between embeddings of successive time steps in an LSTM-based sequence model, hypothesizing this encourages compositional representations.

The experimental setup: train LSTM models on synthetic arithmetic expression datasets (tasks like evaluating "3 + (2 * 4)"), with and without the proposed regularization term. Evaluate whether the regularized model generalizes better to unseen combinations of operations.

The result was a negative finding — compositional regularization did not significantly improve generalization performance, and in some cases actually harmed training. Furthermore, increasing the complexity of arithmetic expressions made generalization worse regardless of regularization. The paper concludes that explicitly enforcing compositional structure via regularization alone may not be sufficient.

This aligned well with the ICBINB workshop theme (unexpected failures and negative results). The paper received scores of 6 (weak accept), 7 (accept), and 6 (weak accept), averaging 6.33 — placing it roughly in the top 45% of the 43 total submissions.

What reviewers praised

What reviewers criticized

The initial idea (from the system's own generation)

Verbatim from the AI-generated idea: "Introducing a compositional regularization term during training can encourage neural networks to develop compositional representations, thereby improving their ability to generalize to novel combinations of known components."

The system started from this one-sentence hypothesis, generated all code from scratch, ran experiments on synthetic arithmetic datasets, discovered the regularization didn't work, and wrote a paper honestly reporting the negative result. The ICBINB workshop specifically values such negative findings — a lucky alignment between the system's output and the venue's theme.

The other two submissions

Both were rejected. The authors' internal analysis agrees with this outcome: they judged only one of the three to be workshop-quality. The system was run with multiple random seeds per idea, and the best manuscript from each idea's seed runs was selected for submission — similar to a professor selecting the best work from multiple students.

Milestone, not mastery: The authors are clear that this is not "AI replacing scientists." The accepted paper is workshop-level, not conference-level. Their internal review found issues the workshop reviewers missed — potential dataset overlap, unclear descriptions of what was being regularized. And the human involvement (selecting ideas, choosing the best seed, setting compute budgets) is non-trivial. But as a proof of concept, it establishes that the gap between AI-generated and human-authored research is narrowing to the point where peer reviewers can't reliably distinguish them at the workshop level.
What was the accepted paper's main finding?

Chapter 8: Safety & Ethics

Systems that autonomously generate scientific manuscripts raise serious concerns. The authors devote considerable space to this, and it deserves careful attention.

Dual-use risks

The most obvious concern: AI Scientist-v2 could be used to flood conferences and journals with AI-generated submissions. If the system can produce workshop-level papers at scale, it could overwhelm the peer review system — which already struggles with the volume of human submissions.

More subtle: a system that generates plausible-looking research could be used to produce fake but convincing papers that support a desired conclusion. This is especially dangerous in politically sensitive areas (climate science, drug efficacy) where manufactured evidence could influence policy.

Safeguards implemented

The broader question

Should AI-generated papers be published in the scientific record? The authors deliberately avoid a strong stance. They argue for "transparent experimentation" — conducting controlled studies like the ICBINB evaluation to understand AI capabilities and limitations, rather than either banning AI-generated work outright or allowing it without guardrails.

The citation hallucination problem: Like all LLM-based systems, AI Scientist-v2 occasionally hallucinates citations — referencing papers that don't exist or attributing findings to the wrong authors. This is a practical concern for scientific integrity that no current safeguard fully addresses. The system's use of Semantic Scholar helps, but doesn't eliminate the problem entirely.

The review integrity question

Perhaps the most immediate concern: if AI-generated papers become common, peer review itself changes. Reviewers may become suspicious of all submissions, demanding proof of human authorship. This creates overhead for legitimate researchers and erodes trust in the review process. The ICBINB experiment was conducted transparently, but not all future uses will be.

The authors note that both v1 and v2 always include an explicit disclosure that manuscripts are AI-generated. But this is a voluntary safeguard. There's nothing preventing someone from removing that disclosure and submitting the output as their own work.

The economic argument

If autonomous systems can produce workshop-level research at low cost ($20-50 per paper run), what happens to early-career researchers whose workshop publications are crucial for career progression? The authors don't address this directly, but it's a consequence worth considering. Science has historically been a human endeavor where the process of doing research — learning to formulate hypotheses, design experiments, interpret results — is as valuable as the output.

On the other hand, if these systems are used as tools rather than replacements — helping researchers explore more hypotheses, catch figure errors, draft initial manuscripts — they could democratize research by reducing the labor barrier to entry. The dual-use nature is real and unresolved.

What key safeguard did the authors implement for the ICBINB workshop submission?

Chapter 9: Connections

The AI Scientist-v2 sits at the intersection of several active research threads. Here's how it connects to the broader landscape.

Predecessors and inspirations

Concurrent and related systems

The key architectural pattern

Stepping back, AI Scientist-v2 exemplifies a pattern appearing across many recent systems: LLM + tree search + tool use + multi-modal feedback. The LLM provides the reasoning and code generation. Tree search provides structured exploration. Tool use (Python interpreter, Hugging Face, Semantic Scholar) grounds the system in reality. And VLM feedback closes the loop by evaluating outputs the text-only LLM can't assess. This same pattern shows up in AlphaEvolve (LLM + evolutionary search + code execution) and in concurrent work on agentic coding (LLM + tree search + test execution).

Open questions

The trajectory: In June 2024, AI Scientist v1 produced papers judged "below workshop level." By April 2025, v2 produced a paper that passed workshop peer review. If this trajectory continues, conference-level papers from autonomous systems may arrive within 1-2 years. Whether that's exciting or alarming depends on your perspective — and on the safeguards the community builds.
What system inspired v2's tree-search approach to experimentation?