AI Scientist v2 — Veanors

Chapter 0: The Problem

In 2024, Lu et al. released The AI Scientist — the first system to automate the full scientific discovery pipeline end-to-end. Give it a research topic and it would generate hypotheses, write code, run experiments, produce figures, and draft a complete manuscript. Impressive, but deeply limited in practice.

The core limitation: human-authored code templates. For every new research domain, a human researcher had to hand-write a baseline codebase. The AI Scientist v1 could only make incremental modifications to this template — swap a loss function, tweak a hyperparameter, add a regularization term. It couldn't build an experiment from scratch.

This made v1 brittle. Want to study compositional generalization? Someone first writes a training loop, a dataset loader, an evaluation harness. Want to study vision transformers? A completely different template. The "automation" was really just "incremental code editing given a scaffold."

The deeper issue: Beyond templates, v1's experimentation was strictly linear. Each code change built directly on the previous one, like a single chain of edits. Real science doesn't work this way. Researchers explore branching hypotheses, backtrack from dead ends, and pursue multiple directions simultaneously. v1's linear approach meant it was perpetually myopic — it could never backtrack from a bad decision or explore an alternative path.

The papers v1 produced reflected these limitations. Internal evaluation found them "below the level of top ML venues." The manuscripts lacked depth, had superficial experiment designs, and the figures were often poorly formatted (since no vision model ever looked at them). No v1 paper was submitted to peer review, because the authors judged none were good enough.

Three specific failure modes defined v1:

Template dependency. Each new domain required manual setup, killing the dream of autonomous science.
Linear experimentation. No branching, no backtracking, no parallel exploration of hypotheses.
No visual review. Generated figures were never checked by a vision model. Mislabeled axes, missing legends, and garbled plots went unnoticed.

There was also a subtler issue: v1's manuscript writing used Aider to make incremental LaTeX edits, section by section. This often produced internally inconsistent papers where the introduction promised one thing and the experiments delivered another. The system had no "big picture" view of the document.

The AI Scientist-v2 addresses all of these. The question is: how?

What was the most fundamental limitation of The AI Scientist v1?

It required human-authored code templates for each domain, limiting autonomy and generalizability It couldn't generate text at all It only worked with vision models, not language models

Chapter 1: The Key Insight

The AI Scientist-v2 rests on one central idea: treat scientific experimentation as tree search, not linear editing.

In v1, the experiment flow looked like a chain:

Template code

Human-written baseline

↓

Edit 1

Tweak learning rate

↓

Edit 2

Change loss function

↓

Edit 3

Add regularizer

↓

Write paper

Based on final code state

Every edit depended on the one before it. If Edit 2 was a mistake, Edit 3 inherited the damage. There was no way to go back.

In v2, the flow is a tree. Each experiment is a node. Promising nodes get expanded with refinements. Buggy nodes get debugging children. The system can explore multiple hypotheses in parallel and select the best branch at each stage. This is managed by a dedicated Experiment Manager agent that coordinates four distinct research stages.

Three innovations work together to make this possible:

No templates. The system generates code from scratch, starting from a research idea and a topic description. This eliminates the human bottleneck and lets the system generalize across ML domains.
Progressive agentic tree search. An Experiment Manager agent guides experimentation through four stages (feasibility, tuning, research execution, ablations), using tree search within each stage to explore the hypothesis space.
VLM figure refinement. A vision-language model reviews every generated figure for clarity, correctness, and aesthetics. Figures that fail the VLM check cause their experiment node to be marked as "buggy" and re-attempted.

Why this combination matters: Any one of these improvements alone would help. But they're synergistic. Template-free operation means the tree search explores a much larger space (not just variations on a fixed codebase). The VLM feedback prevents the tree from wasting branches on experiments whose results can't be clearly communicated. And the staged Experiment Manager prevents the tree from growing aimlessly — it imposes the structure of real scientific methodology.

The payoff: one of the three manuscripts v2 generated and submitted to ICLR's ICBINB workshop received scores of 6, 7, and 6 — an average of 6.33, placing it in roughly the top 45% of submissions. It would have been accepted had it been human-authored. This is the first time a fully AI-generated paper has passed peer review.

What is the central architectural shift from v1 to v2?

Replacing linear sequential experimentation with agentic tree search over a branching hypothesis space Using a bigger language model Adding more human reviewers to the loop

Chapter 2: v1 vs v2 Comparison

Let's lay the differences out side by side. These aren't incremental tweaks — nearly every component was redesigned.

Feature	AI Scientist v1	AI Scientist v2
Codebase	Topic-specific templates (human-authored)	Generated from scratch (domain-general)
Idea generation	Conditioned on existing template code	Open-ended, with Semantic Scholar in the loop
Experiment planning	Linear — each edit builds on the last	Tree-based — branch, backtrack, explore in parallel
Experiment stages	Single pass	4 stages: feasibility → tuning → execution → ablation
Parallel execution	No	Yes — multiple nodes expanded concurrently
Figure review	None (text-only models)	VLM checks every figure for clarity and correctness
Manuscript writing	Incremental edits via Aider	Single-pass generation + reflection with o1
Reviewer	Text-only LLM	VLM-augmented (sees figures + text together)
Human evaluation	Not submitted to peer review	One paper accepted at ICLR workshop (6.33 avg)
Dataset handling	Bundled with templates	Hugging Face Hub auto-download

The workflow pipeline

The v2 pipeline proceeds through five major phases:

1. Idea Generation

Open-ended brainstorming with literature search (Semantic Scholar). ~20 ideas generated, best selected by researchers.

↓

2. Experimentation

4-stage tree search managed by the Experiment Manager. Code generated from scratch, executed, VLM-reviewed.

↓

3. Figure Aggregation

Best experiment nodes produce final figures. VLM iteratively refines aesthetics and accuracy.

↓

4. Manuscript Writing

Single-pass LaTeX generation using experiment results, figures, and summaries. No incremental editing.

↓

5. Reflection & Review

Reasoning model (o1) reviews the manuscript. VLM checks figure-caption alignment. Iterative refinement.

Models used: The primary LLM is Claude 3.5 Sonnet for experiment code generation and management. GPT-4o serves as the VLM for figure review. The manuscript reflection stage uses o1 for deeper reasoning about scientific coherence. The AI reviewer uses GPT-4o for VLM-augmented review.

Cost and compute

Each complete run (idea → manuscript) costs approximately $20-50 in API calls and takes several hours of wall-clock time. The majority of cost comes from the experiment tree search phase, where many LLM calls generate and evaluate code. Experiment execution itself uses standard academic compute (single GPU). For the ICBINB submission, multiple seeds were run per idea, so the total cost per submitted paper was higher — but still orders of magnitude cheaper than a human researcher's salary.

Which of the following is NOT a change from v1 to v2?

Template-free code generation Tree-based experimentation Using reinforcement learning to optimize the experiment pipeline

Chapter 3: The Experiment Manager

This is the heart of v2. The Experiment Manager is a dedicated LLM agent that orchestrates the entire experimental process through four stages, using tree search within each stage to explore the hypothesis space systematically.

The four stages

Real scientific research follows a natural progression: first check if an idea is feasible, then tune the setup, then run the core experiments, then validate with ablations. The Experiment Manager enforces this structure:

Stage 1: Preliminary Investigation

Establish feasibility. Build a minimal working prototype from the research idea. Stops when basic code runs successfully.

↓ Best node carried forward

Stage 2: Hyperparameter Tuning

Refine the prototype. Optimize learning rate, epochs, batch size. Stops when training curves converge across at least 2 datasets.

↓ Best node carried forward

Stage 3: Research Agenda Execution

Run the core experiments. Test the main hypotheses. If experiments finish too fast, the system suggests increasing complexity. Stops when compute budget is exhausted.

↓ Best node carried forward

Stage 4: Ablation Studies

Validate findings. Remove components one at a time. Run replications with different seeds for statistical significance (mean ± std).

How tree search works within each stage

Within each stage, the system builds a tree of experiment nodes. Each node contains:

An experiment script (Python code)
A textual plan describing what this experiment tests
An execution trace (error messages if it crashed)
Performance metrics (loss, accuracy, etc.)
Generated figures and VLM feedback on them
A status flag: buggy or non-buggy

At each iteration, the system selects nodes to expand. The selection strategy depends on the node type:

Non-buggy nodes are expanded via refinement — the LLM generates an improved version of the experiment code based on results so far. Selection uses best-first search, guided by an LLM evaluator that considers metrics, training dynamics, and figure quality.
Buggy nodes are expanded via debugging — the LLM receives the error trace and attempts to fix the code. With a predefined probability, buggy nodes are prioritized (to avoid wasting good code paths on broken ancestors).

New child nodes are created and executed in parallel, dramatically accelerating exploration compared to v1's sequential approach.

The node execution cycle

Each node, once created, follows a fixed execution cycle:

1. Plan + Code Generation

LLM writes an experiment plan (natural language) and Python code implementing it

↓

2. Code Execution

Python script runs in an interpreter. Metrics saved to .npy files. If error: node marked buggy, error trace recorded.

↓

3. Plot Generation

System reads saved .npy files and generates visualizations summarizing results

↓

4. VLM Review

Vision model checks plots for clarity. Issues flagged → node marked buggy with VLM feedback recorded.

↓

5. Status Assignment

Node is classified as non-buggy (passed all checks) or buggy (code error or VLM rejection)

Specialized node types

Beyond basic experiment nodes, the system creates specialized variants:

Node Type	Stage	Purpose
Hyperparameter	2	Systematically explore alternative hyperparameters. Tracks previously tested configs to avoid redundancy.
Ablation	4	Remove one component at a time. Tracks previously tested ablation conditions.
Replication	3, 4	Re-run parent experiment with different random seeds. Enables mean ± std reporting.
Aggregation	4	Collect results from replication nodes. Produce combined figures with error bars. No new experiments.

The key difference from AIDE: The AI Scientist-v2's tree search is inspired by AIDE (Jiang et al., 2025), which uses tree search for ML engineering tasks. But AIDE operates with a single scalar score per node (e.g., validation accuracy). The AI Scientist-v2 uses an LLM evaluator that considers multiple factors — not just accuracy, but training dynamics, figure quality, VLM feedback, and scientific coherence. This is necessary because "good science" isn't a single number.

A worked example: one iteration of tree expansion

Suppose we're in Stage 3 (Research Agenda Execution). The tree currently has three non-buggy nodes and one buggy node. Here's what happens in one iteration:

The system rolls a random number. With probability p, it selects the buggy node for debugging. Otherwise, it asks the LLM evaluator to rank the three non-buggy nodes.
The evaluator receives each node's metrics (train/val loss curves, accuracy), its experiment plan, and VLM feedback on its figures. It returns a ranking. The top-ranked node is selected for refinement.
The LLM generates a new experiment plan and Python code for the child node. For a refinement: "The parent achieved 72% accuracy using a 2-layer LSTM. I'll try adding attention over hidden states and increasing hidden dim from 128 to 256." For debugging: "The parent crashed with OOM. I'll reduce batch size from 256 to 64 and add gradient checkpointing."
Both new child nodes are executed in parallel. Each runs its Python script, saves metrics to .npy files, generates plots, and submits plots to the VLM.
Results are recorded. The tree grows by two nodes. The next iteration begins.

This cycle repeats until the compute budget for the current stage is exhausted. Then the Experiment Manager selects the single best node (via the LLM evaluator) to seed the next stage.

What happens when the tree search encounters a buggy node (one whose code crashed)?

It creates a debugging child node — the LLM receives the error trace and attempts to fix the code It deletes the node and starts over from the root It stops the entire experiment and asks a human for help

Chapter 4: Template-Free Operation

This is where v2 becomes truly autonomous. Instead of starting from a human-written codebase, v2 starts from nothing but a research idea.

The process begins in the idea generation phase. The system is prompted to brainstorm open-ended research directions for a given topic — something like "negative results in deep learning" (the ICBINB workshop theme). It generates roughly twenty candidate ideas, each described as a title and short hypothesis. Critically, the system has access to Semantic Scholar during this phase, so it can check whether an idea is novel and identify relevant prior work.

Once an idea is selected, the Experiment Manager takes over. Here's how Stage 1 (Preliminary Investigation) works without a template:

The LLM receives the research idea as a natural language description (title + hypothesis + experimental plan).
It generates a complete Python experiment script from scratch — dataset loading (via Hugging Face Hub), model definition, training loop, evaluation, and metric logging.
The script is executed in a Python interpreter. If it crashes, the error trace is recorded and the node is marked buggy. A debugging child is spawned.
If it succeeds, the system saves metrics to numpy files and generates plots.
The VLM reviews the plots. If they're unclear or incorrect, the node is marked buggy.
Non-buggy nodes proceed to refinement — the LLM generates improved versions.

Why Hugging Face Hub matters: A key enabler of template-free operation is standardized dataset access. v2 prompts the LLM to use datasets.load_dataset() from Hugging Face whenever possible. This provides a consistent API for downloading hundreds of ML datasets with predefined train/val/test splits. Without this, the LLM would have to write custom data loading code for every experiment — a major source of bugs.

Manuscript writing: single-pass + reflection

The manuscript writing phase also changed fundamentally. v1 used Aider to iteratively edit LaTeX files section by section, which was slow and often produced inconsistent text (early sections didn't know what later sections would say). v2 uses a single-pass generation: the LLM writes the entire manuscript at once, given the experiment results, figures, and summaries from the best experiment nodes.

After the initial draft, a separate reflection stage uses a reasoning model (o1) to review the manuscript holistically. The reflection stage also receives the target page limit (e.g., 4 pages for the ICBINB workshop) alongside the current PDF length, allowing it to suggest cuts or expansions as needed. This two-phase approach (draft + reflect) is cleaner and more reliable than incremental editing, and the separation of concerns means the drafting model can focus on content while the reflection model focuses on quality.

The prompt structure

What does the LLM actually receive when generating experiment code? The prompt includes:

The research idea (title, hypothesis, expected outcomes)
The current experiment stage (1-4) and its goals
Results from the best parent node (if not Stage 1)
Error traces (if debugging a buggy node)
Instructions to save metrics as .npy files and generate plots
Instructions to use Hugging Face Hub for datasets

The system doesn't receive any starter code, model architecture, or training recipe. Everything is generated from the idea description alone. This is what makes v2 domain-general: the same pipeline that studies compositional generalization in sequence models could equally study data augmentation in vision or reward shaping in RL.

What the plot aggregation prompt looks like

At the end of experimentation, the system generates a final plot aggregation script. The actual prompt instructs the LLM to:

From the real prompt: "Combine relevant existing plotting code. Create a complete set of final scientific plots, stored in figures/ only. Use existing .npy data — do NOT hallucinate data. Only create plots where the data is best presented as a figure and not as a table. Put each plot in a separate try-catch block so one failure doesn't affect others. Create only plots that are unique and needed for the final paper."

This is a small but important detail. The system explicitly separates experiment execution (which saves raw data as .npy arrays) from figure generation (which reads those arrays and produces publication-quality plots). This separation prevents a common failure mode: plots that look right but are generated from hallucinated data rather than actual experiment outputs.

How does v2 handle dataset loading without human-written template code?

It prompts the LLM to use Hugging Face Hub's datasets.load_dataset() for standardized, one-line data access It downloads all possible datasets in advance It only works with built-in PyTorch datasets

Chapter 5: VLM Figure Refinement

v1 had a blind spot: it never looked at its own figures. The language models generating plots couldn't verify whether the output was clear, correctly labeled, or even legible. This led to manuscripts with garbled axes, missing legends, and misleading visualizations — the kind of issues a human reviewer would catch immediately.

v2 integrates Vision-Language Models (VLMs) at two critical points in the pipeline.

Point 1: During experimentation (tree search)

After each experiment node generates its figures, the VLM (GPT-4o) reviews them. It checks for:

Label clarity: Are axes labeled? Is there a legend?
Visual accuracy: Does the figure match the data description?
Readability: Are colors distinguishable? Is text legible?
Misleading elements: Truncated axes, cherry-picked ranges, ambiguous scales?

If the VLM flags any issues, the node is marked buggy and the feedback is recorded. When a debugging child is spawned from this node, the LLM receives the VLM's critique alongside the code, enabling targeted fixes. This means figure quality improves through the same tree search mechanism that improves experiment quality.

Point 2: During manuscript reflection

After the manuscript is drafted, the system extracts each figure along with its caption and the text that references it (identified by searching for "Figure X" in the manuscript). These figure-caption-context triples are sent to the VLM for a second round of checks:

Figure-caption alignment: Does the caption accurately describe what's shown?
Figure-text alignment: Does the surrounding text correctly interpret the figure?
Duplication detection: Are the same figures repeated between main text and appendix?
Aesthetic quality: Layout, spacing, color choices for a polished paper.

The feedback loop in practice: During tree search, a node generates a training curve with no axis labels. The VLM flags this. A debugging child receives: "VLM feedback: Missing axis labels on training loss plot. Add xlabel('Epoch') and ylabel('Loss')." The LLM fixes the plotting code. The new figure passes VLM review. The node is marked non-buggy and can now be expanded via refinement instead of further debugging.

This two-stage VLM integration is one of the simplest innovations in the paper, but one of the most impactful for manuscript quality. Human reviewers at the ICBINB workshop did flag some figure issues in the submitted papers, but far fewer than they would have without VLM filtering.

What happens when the VLM flags issues with a figure during tree search?

The experiment node is marked buggy, and when a debugging child is spawned it receives the VLM critique to fix the plotting code The figure is deleted and no replacement is generated A human is asked to fix the figure manually

Chapter 6: Enhanced Reviewer

v1 included an automated AI reviewer — an LLM that scored generated manuscripts on a 1-10 scale across several criteria. But it had a fundamental limitation: it was text-only. The reviewer read the LaTeX source but never saw the rendered figures. This meant it couldn't assess whether figures supported the claims in the text.

v2 upgrades the reviewer with VLM capabilities. The enhanced reviewer:

Reads the text of the manuscript, evaluating clarity, novelty, methodology, and significance.
Views every figure alongside its caption, checking alignment between visual evidence and written claims.
Assesses experimental rigor — are error bars present? Are baselines fair? Are the datasets appropriate for the claims?
Produces a structured review with a numerical score and actionable feedback.

Correlation with human reviewers

The paper evaluates the enhanced reviewer against human peer review scores from the ICBINB workshop. The VLM-augmented reviewer shows higher correlation with human scores than the text-only reviewer from v1. This is expected: a reviewer that can actually see the figures makes more holistic judgments.

Paper	Human Avg Score	AI Review (v2, VLM)	Outcome
Compositional Regularization	6.33	Highest ranked	Accepted (withdrawn post-review)
Paper 2	Below threshold	Middle ranked	Rejected
Paper 3	Below threshold	Lowest ranked	Rejected

The ranking matched: The internal AI reviewer ranked the three submitted manuscripts in the same order as the human peer reviewers. The manuscript that scored highest with humans (the one that passed peer review) also scored highest with the AI reviewer. The two rejected manuscripts scored lower on both human and AI evaluations. While this is a small sample (n=3), it's encouraging evidence that VLM-augmented review captures something real about paper quality.

What the AI reviewer misses

Even the enhanced reviewer has blind spots. The human reviewers at ICBINB flagged issues the AI reviewer did not:

Insufficient intuitive explanation — the paper didn't explain why the proposed regularization should work.
Narrow experimental scope — only synthetic datasets, no real-world tasks.
Missing alternative approaches — the paper didn't compare against other compositional methods.

These are higher-level scientific judgment calls that current AI reviewers struggle with. They require understanding what makes a research contribution convincing to the community, not just whether the text and figures are internally consistent.

What key capability does v2's reviewer add over v1's text-only reviewer?

VLM integration that lets the reviewer see and evaluate figures alongside text and captions The ability to run the paper's code and verify results Access to a database of all published papers for novelty checking

Chapter 7: The Accepted Paper

Three manuscripts were submitted to the ICLR 2025 "I Can't Believe It's Not Better" (ICBINB) workshop. One was accepted. Let's dissect what happened.

The evaluation protocol

The evaluation was carefully designed with ethical safeguards:

Three fully autonomous manuscripts were included among 43 total workshop submissions.
Reviewers were told in advance that some submissions might be AI-generated, but not which ones. They could opt out of reviewing potentially AI-generated work.
After review, the accepted manuscript was withdrawn before publication — the authors did not want to set a precedent for AI-generated papers entering the scientific record without broader community discussion.
Reviewers were informed of the experiment only after peer review was complete.
The study had IRB approval from the University of British Columbia.

The accepted paper: compositional regularization

The workshop-accepted paper investigated whether adding a compositional regularization term to the training loss could improve neural network generalization. The idea: penalize large deviations between embeddings of successive time steps in an LSTM-based sequence model, hypothesizing this encourages compositional representations.

The experimental setup: train LSTM models on synthetic arithmetic expression datasets (tasks like evaluating "3 + (2 * 4)"), with and without the proposed regularization term. Evaluate whether the regularized model generalizes better to unseen combinations of operations.

The result was a negative finding — compositional regularization did not significantly improve generalization performance, and in some cases actually harmed training. Furthermore, increasing the complexity of arithmetic expressions made generalization worse regardless of regularization. The paper concludes that explicitly enforcing compositional structure via regularization alone may not be sufficient.

This aligned well with the ICBINB workshop theme (unexpected failures and negative results). The paper received scores of 6 (weak accept), 7 (accept), and 6 (weak accept), averaging 6.33 — placing it roughly in the top 45% of the 43 total submissions.

What reviewers praised

Clearly identified the challenge of compositional regularization
Honest reporting of negative results
Technically sound experimental setup

What reviewers criticized

Insufficient justification for why the chosen regularization method should enhance compositionality
Limited to synthetic arithmetic datasets — no real-world evaluation
Some figure caption inaccuracies
Missing comparisons with alternative compositional methods

The initial idea (from the system's own generation)

Verbatim from the AI-generated idea: "Introducing a compositional regularization term during training can encourage neural networks to develop compositional representations, thereby improving their ability to generalize to novel combinations of known components."

The system started from this one-sentence hypothesis, generated all code from scratch, ran experiments on synthetic arithmetic datasets, discovered the regularization didn't work, and wrote a paper honestly reporting the negative result. The ICBINB workshop specifically values such negative findings — a lucky alignment between the system's output and the venue's theme.

The other two submissions

Both were rejected. The authors' internal analysis agrees with this outcome: they judged only one of the three to be workshop-quality. The system was run with multiple random seeds per idea, and the best manuscript from each idea's seed runs was selected for submission — similar to a professor selecting the best work from multiple students.

Milestone, not mastery: The authors are clear that this is not "AI replacing scientists." The accepted paper is workshop-level, not conference-level. Their internal review found issues the workshop reviewers missed — potential dataset overlap, unclear descriptions of what was being regularized. And the human involvement (selecting ideas, choosing the best seed, setting compute budgets) is non-trivial. But as a proof of concept, it establishes that the gap between AI-generated and human-authored research is narrowing to the point where peer reviewers can't reliably distinguish them at the workshop level.

What was the accepted paper's main finding?

A negative result — compositional regularization did not improve generalization and sometimes harmed it A positive result showing regularization dramatically improves compositional generalization A theoretical proof that compositional regularization is impossible

Chapter 8: Safety & Ethics

Systems that autonomously generate scientific manuscripts raise serious concerns. The authors devote considerable space to this, and it deserves careful attention.

Dual-use risks

The most obvious concern: AI Scientist-v2 could be used to flood conferences and journals with AI-generated submissions. If the system can produce workshop-level papers at scale, it could overwhelm the peer review system — which already struggles with the volume of human submissions.

More subtle: a system that generates plausible-looking research could be used to produce fake but convincing papers that support a desired conclusion. This is especially dangerous in politically sensitive areas (climate science, drug efficacy) where manufactured evidence could influence policy.

Safeguards implemented

Mandatory labeling: All AI Scientist manuscripts explicitly label themselves as AI-generated. The system always includes a disclosure statement.
Post-review withdrawal: The accepted ICBINB paper was withdrawn before publication to avoid setting a precedent.
IRB approval: The peer review experiment was approved by UBC's institutional review board.
Reviewer consent: Reviewers were told AI-generated papers might be present and could opt out.
Open-source release: The codebase is public, enabling the community to study and improve detection methods.

The broader question

Should AI-generated papers be published in the scientific record? The authors deliberately avoid a strong stance. They argue for "transparent experimentation" — conducting controlled studies like the ICBINB evaluation to understand AI capabilities and limitations, rather than either banning AI-generated work outright or allowing it without guardrails.

The citation hallucination problem: Like all LLM-based systems, AI Scientist-v2 occasionally hallucinates citations — referencing papers that don't exist or attributing findings to the wrong authors. This is a practical concern for scientific integrity that no current safeguard fully addresses. The system's use of Semantic Scholar helps, but doesn't eliminate the problem entirely.

The review integrity question

Perhaps the most immediate concern: if AI-generated papers become common, peer review itself changes. Reviewers may become suspicious of all submissions, demanding proof of human authorship. This creates overhead for legitimate researchers and erodes trust in the review process. The ICBINB experiment was conducted transparently, but not all future uses will be.

The authors note that both v1 and v2 always include an explicit disclosure that manuscripts are AI-generated. But this is a voluntary safeguard. There's nothing preventing someone from removing that disclosure and submitting the output as their own work.

The economic argument

If autonomous systems can produce workshop-level research at low cost ($20-50 per paper run), what happens to early-career researchers whose workshop publications are crucial for career progression? The authors don't address this directly, but it's a consequence worth considering. Science has historically been a human endeavor where the process of doing research — learning to formulate hypotheses, design experiments, interpret results — is as valuable as the output.

On the other hand, if these systems are used as tools rather than replacements — helping researchers explore more hypotheses, catch figure errors, draft initial manuscripts — they could democratize research by reducing the labor barrier to entry. The dual-use nature is real and unresolved.

What key safeguard did the authors implement for the ICBINB workshop submission?

The accepted AI-generated paper was withdrawn before publication to avoid setting a precedent They didn't submit any papers at all They replaced the AI text with human-written versions before submission

Chapter 9: Connections

The AI Scientist-v2 sits at the intersection of several active research threads. Here's how it connects to the broader landscape.

Predecessors and inspirations

The AI Scientist v1 (Lu et al., 2024) — The direct predecessor. Established the end-to-end automated science pipeline but relied on templates and linear experimentation. v2 generalizes and deepens every component.
AIDE (Jiang et al., 2025) — Tree search for ML engineering tasks. Achieved state-of-the-art on MLEBench, a benchmark for ML engineering competitions. Inspired v2's tree-based experimentation, but AIDE uses scalar scores (e.g., validation accuracy) while v2 uses LLM-based multi-criteria evaluation. AIDE targets Kaggle-style competitions with clear metrics; v2 targets open-ended science where "quality" is multidimensional.
Reflexion (Shinn et al., 2024) — Iterative self-reflection for LLM agents. Models review their own outputs and improve through repeated attempts. The Experiment Manager's stage-wise progression echoes this iterate-and-improve loop, but adds the branching structure of tree search rather than Reflexion's linear retry chain.

Concurrent and related systems

Darwin Gödel Machine — Self-improving AI using open-ended evolutionary search. Both Darwin and v2 use tree-like exploration of hypothesis spaces, but Darwin targets self-improvement while v2 targets scientific discovery.
AlphaEvolve (Google DeepMind) — Evolutionary code generation for algorithm discovery. Shares the template-free, code-from-scratch philosophy with v2. Both systems use LLMs to generate code and evaluate results, but AlphaEvolve uses evolutionary pressure (populations of solutions with mutation and selection) rather than tree search, and focuses on discovering optimal algorithms rather than producing manuscripts.
Paper2Agent — Generates code implementations from published papers. Complementary to v2: Paper2Agent reproduces existing work, while v2 generates new work. Together they could form a pipeline where Paper2Agent turns prior papers into implementations, and AI Scientist v2 builds on them.
Sakana's broader agenda — AI Scientist is part of Sakana AI's research program on autonomous AI researchers. The Darwin Gödel Machine, also from Sakana, explores a different approach — self-modifying agents that improve their own code via evolutionary search. Both share the philosophy that AI systems should do science, but differ on whether the output is papers (AI Scientist) or self-improving algorithms (Darwin).

The key architectural pattern

Stepping back, AI Scientist-v2 exemplifies a pattern appearing across many recent systems: LLM + tree search + tool use + multi-modal feedback. The LLM provides the reasoning and code generation. Tree search provides structured exploration. Tool use (Python interpreter, Hugging Face, Semantic Scholar) grounds the system in reality. And VLM feedback closes the loop by evaluating outputs the text-only LLM can't assess. This same pattern shows up in AlphaEvolve (LLM + evolutionary search + code execution) and in concurrent work on agentic coding (LLM + tree search + test execution).

Open questions

Scaling: What happens with more compute budget per paper? Currently each run costs $20-50. Would 10x more produce conference-level papers?
Beyond ML: The system currently works within ML domains (leveraging Hugging Face, Python, standard ML evaluation). Could it generalize to wet-lab sciences, mathematics, or social sciences?
Acceptance rate: 1 out of 3 passed review, but from multiple seeds per idea. What's the true hit rate per seed? Per idea? This matters for understanding reliability.
Detection: Can reviewers learn to identify AI-generated submissions? What features distinguish them from human work?
Self-improvement: Could the system use its own reviewer to filter outputs before submission, creating an inner loop of generate-review-revise that converges on higher quality?

The trajectory: In June 2024, AI Scientist v1 produced papers judged "below workshop level." By April 2025, v2 produced a paper that passed workshop peer review. If this trajectory continues, conference-level papers from autonomous systems may arrive within 1-2 years. Whether that's exciting or alarming depends on your perspective — and on the safeguards the community builds.

What system inspired v2's tree-search approach to experimentation?

AIDE, which uses tree search for ML engineering tasks on benchmarks like MLEBench AlphaGo, which uses Monte Carlo tree search for game playing GPT-4, which uses beam search for text generation

The AI Scientist-v2 Workshop-Level Automated Scientific Discovery

Chapter 0: The Problem

Chapter 1: The Key Insight

Chapter 2: v1 vs v2 Comparison

The workflow pipeline

Cost and compute

Chapter 3: The Experiment Manager

The four stages

How tree search works within each stage

The node execution cycle

Specialized node types

A worked example: one iteration of tree expansion

Chapter 4: Template-Free Operation

Manuscript writing: single-pass + reflection

The prompt structure

What the plot aggregation prompt looks like

Chapter 5: VLM Figure Refinement

Point 1: During experimentation (tree search)

Point 2: During manuscript reflection

Chapter 6: Enhanced Reviewer

Correlation with human reviewers

What the AI reviewer misses

Chapter 7: The Accepted Paper

The evaluation protocol

The accepted paper: compositional regularization

What reviewers praised

What reviewers criticized

The initial idea (from the system's own generation)

The other two submissions

Chapter 8: Safety & Ethics

Dual-use risks

Safeguards implemented

The broader question

The review integrity question

The economic argument

Chapter 9: Connections

Predecessors and inspirations

Concurrent and related systems

The key architectural pattern

Open questions