Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, David Ha — Sakana AI, Oxford, UBC, Vector Institute, 2024

The AI Scientist: Towards Fully Automated Scientific Discovery

The first framework that takes LLMs from idea generation all the way through experiment execution, paper writing, and peer review — producing full scientific papers for less than $15 each.

Prerequisites: What an LLM is + Basic ML intuition + What code generation means
10
Chapters
4
Simulations

Chapter 0: The Problem

Imagine you are a machine learning researcher. You have an idea — maybe a new way to schedule noise in a diffusion model, or a trick to make transformers generalize faster. What happens next?

You spend a day writing the experiment code. You debug it for another day. You run experiments overnight, look at the results in the morning, tweak things, and re-run. After a week, you have enough results to start writing the paper. The writing takes another week. Then you submit, wait months for reviews, revise, and resubmit. The total time from idea to published paper? Months to years.

The bottleneck is not compute. It is the serial, manual nature of the scientific process. A single researcher can only hold one or two projects in their head at once. They sleep, they get distracted, they spend time on formatting LaTeX. The actual intellectual work — the creative part — is a tiny fraction of the elapsed time.

Now imagine you could hand an LLM a starting codebase and say: "Here is a diffusion model that trains on 2D datasets. Find something interesting to do with it." And the LLM would brainstorm ideas, implement the best one, run the experiments, generate plots, write the paper, and even review it — all autonomously, all for less than $15.

That is exactly what The AI Scientist does. It is the first framework for fully automated, open-ended scientific discovery in machine learning.

The Scientific Process: Manual vs. Automated

Each bar represents one stage of a research project. Click "Automate" to see how The AI Scientist collapses the timeline.

Why is the scientific process slow even when compute is cheap?

Chapter 1: The Key Insight

Here is the central realization behind The AI Scientist: frontier LLMs are now good enough at every individual step of the research process — brainstorming, coding, writing, reviewing — that you can chain them together into a single autonomous pipeline.

No single capability is new. LLMs have been used to brainstorm ideas (ResearchAgent), write code (Aider on SWE-Bench), generate text (GPT-4 for manuscripts), and evaluate papers (LLM-as-a-judge). The insight is that these capabilities have crossed a quality threshold where the full loop produces useful output without human intervention.

Think of it like self-driving cars. Individual components — lane detection, object recognition, path planning — existed for years before anyone assembled them into a working autonomous system. The AI Scientist does the same thing for research: it assembles known LLM capabilities into the first end-to-end autonomous research pipeline.

The framework has three core phases, each built on LLM agent techniques:

After the paper is written, an Automated Reviewer evaluates it using NeurIPS-style guidelines. The review score and feedback can be fed back into the archive, making the process open-ended — each generation of ideas builds on the last.

Cost breakdown: The entire pipeline — ideation, coding, running experiments, writing, and reviewing — costs approximately $10–$15 per paper in LLM API calls. The actual compute for experiments is negligible because the templates use small-scale models (tiny transformers, 2D diffusion).
What is the key insight of The AI Scientist?

Chapter 2: The Pipeline

The AI Scientist takes a starting code template — a small, self-contained ML experiment — and runs a seven-stage pipeline that produces a complete scientific paper.

1. Idea Generation
LLM brainstorms 50 research ideas. Each idea has a description, experiment plan, and self-assessed scores for interestingness, novelty, and feasibility.
2. Novelty Check
Each idea is checked against existing literature via Semantic Scholar API. Ideas too similar to published work are filtered out.
3. Code Implementation
Aider (LLM coding assistant) modifies the template codebase to implement the idea. Up to 4 retries on failure.
4. Experiment Execution
Code is run. Results are collected. After each experiment, the LLM takes notes and re-plans the next experiment. Up to 5 experiment iterations.
5. Visualization
Aider edits a plotting script to generate figures for the paper. Each plot gets a text description for the write-up.
6. Paper Write-up
Section-by-section LaTeX generation: intro, background, methods, experiments, results, conclusion. Then reference search via Semantic Scholar. Then refinement.
7. Automated Review
GPT-4o-based reviewer scores the paper on soundness, presentation, contribution, and overall quality. Binary accept/reject decision. Feedback stored in archive.
The template is the seed. The AI Scientist does not start from nothing. It gets a small, working codebase — for example, code that trains a tiny diffusion model on 2D datasets, or NanoGPT on Shakespeare. The LLM is free to modify this code in any way it wants. This starting point is what makes the experiments tractable: small-scale, fast to run, easy to iterate on.

Three templates were tested in the paper:

TemplateDomainWhat It Does
2D DiffusionGenerative modelsTrains DDPM on four 2D distributions (circles, moons, dino, etc.)
NanoGPTLanguage modelingTrains a small transformer on character-level Shakespeare + enwik8
GrokkingLearning dynamicsTrains a transformer on modular arithmetic to study delayed generalization
The AI Scientist Pipeline

Click "Next Stage" to walk through each phase. Watch costs accumulate and outputs build up.

Why does The AI Scientist need a starting code template?

Chapter 3: Idea Generation

The first step is the hardest to believe: an LLM brainstorming novel research ideas. Not just random word salad — ideas with a clear description, a concrete experiment plan, and self-assessed scores.

How It Works

The AI Scientist uses an approach inspired by evolutionary computation. It maintains a growing archive of ideas. At each iteration, the LLM is prompted to propose a new idea that is different from everything already in the archive. This is the LLM acting as a mutation operator on the space of research directions.

Each idea includes:

The LLM uses chain-of-thought reasoning and self-reflection to refine each idea over multiple rounds. After generation, a novelty filter kicks in: the LLM queries the Semantic Scholar API to check if the idea has already been published. If a close match is found, the idea is discarded.

Concrete example — an actual generated idea: "Adaptive Dual-Scale Denoising for Dynamic Feature Balancing in Low-Dimensional Diffusion Models." The plan: modify the denoiser to have two parallel branches (global and local), with a learnable timestep-conditioned weight to balance them. Evaluate via KL divergence and visual quality. Self-scores: interestingness 9, feasibility 8, novelty 8. Semantic Scholar found no close match — idea passed.

The Novelty Problem

Self-assessment is biased. LLMs consistently overestimate how novel their ideas are. The Semantic Scholar check helps, but it is not perfect — it can miss unpublished concurrent work or ideas expressed differently. Still, across runs, about 70–95% of ideas pass the novelty check, depending on the model.

An interesting pattern: the archive creates implicit diversity pressure. Because each new idea must differ from the existing archive, the LLM is pushed to explore different directions rather than getting stuck in a rut. This mirrors how Quality-Diversity algorithms work in evolutionary computation.

How does The AI Scientist check if a generated idea is novel?

Chapter 4: The Experiment Loop

This is where the magic happens. The AI Scientist has an idea and a code template. Now it needs to actually do the research: implement the idea in code, run experiments, look at results, iterate, and create visualizations.

The Coding Agent

Implementation is handled by Aider, a state-of-the-art open-source coding assistant. Aider takes the idea's experiment plan and the template codebase, then makes the necessary code changes. If the code fails to run, Aider gets the error traceback and retries — up to 4 attempts.

Real code changes, not just prompts. For the "Adaptive Dual-Scale Denoising" idea, Aider added 40+ lines of new PyTorch code: a second denoiser branch, an upscaling layer, and a timestep-conditioned weight network with LeakyReLU activations and a Softmax output. It also modified the forward pass to return the adaptive weights for visualization. This is non-trivial software engineering.

The Iteration Loop

After each experiment completes, The AI Scientist follows a structured loop:

  1. Run the experiment, collect numerical results and logs
  2. Aider reads the results and takes experimental notes (like a lab journal)
  3. Based on the notes, Aider re-plans the next experiment (maybe tweak a hyperparameter, add an ablation)
  4. Implement the changes and run again

This loop repeats up to 5 times. At the end, Aider edits the plotting script to create figures.

Error recovery in action: When experiments hit a timeout or crash, the error message is fed back to Aider. In one run, The AI Scientist noticed it had forgotten to create an output directory — and fixed it automatically. In another, it restructured an experiment that was taking too long to fit within the time limit.
Experiment Loop Simulator (SHOWCASE)

Walk through the coding agent loop. Watch as the AI implements an idea, runs experiments, hits errors, recovers, and iterates on results. Click "Step" to advance.

Ready

What the Code Actually Looks Like

Here is a simplified version of the kind of code Aider generates. For the dual-scale denoising idea, it creates two parallel branches and a weight network:

python
class MLPDenoiser(nn.Module):
    def __init__(self, ...):
        # Two parallel denoiser branches
        self.global_network = nn.Sequential(...)
        self.local_network  = nn.Sequential(...)
        # Upscale input for local branch
        self.upscale = nn.Linear(2, 4)
        # Timestep-conditioned weight
        self.weight_network = nn.Sequential(
            nn.Linear(emb_dim, hidden),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden, 2),
            nn.Softmax(dim=-1)  # weights sum to 1
        )

    def forward(self, x, t):
        g_out = self.global_network(emb)
        l_out = self.local_network(local_emb)
        w = self.weight_network(t_emb)
        return w[:,0]*g_out + w[:,1]*l_out, w
What happens when an experiment crashes or times out?

Chapter 5: Paper Writing

The experiments are done. The plots are ready. Now The AI Scientist needs to write the actual paper — a full scientific manuscript in LaTeX, following the format of a standard ML conference submission.

The Four-Stage Write-up

Writing happens in a carefully structured sequence:

(a) Per-Section Generation
Aider fills in a blank conference template section by section: introduction, background, methods, experimental setup, results, conclusion. Each section sees all previously written sections. One round of self-reflection refines each section. Citations are left as placeholders.
(b) Reference Search
20 rounds of Semantic Scholar API queries. The AI Scientist finds the most relevant papers to cite, generates BibTeX entries, and fills in the related work section. Missing citations from other sections are also resolved.
(c) Refinement
A final round of self-reflection, section by section. Goal: remove repetition, tighten arguments, improve clarity. The first draft is often verbose — this pass cuts it down.
(d) Compilation
LaTeX compilation. Errors are piped back to Aider for automatic fixing. A LaTeX linter catches formatting issues.
What makes a good AI-written paper? The case study paper ("Adaptive Dual-Scale Denoising") had precise mathematical notation, correctly rounded experimental numbers (matching the logs to 3 decimal places), accurate percentage comparisons to baselines, novel algorithm-specific visualizations, and a reasonable future work section. It read like a competent first-year PhD student's submission.

What Goes Right

What Goes Wrong

In what order does The AI Scientist write the paper?

Chapter 6: The Automated Reviewer

A paper is only as good as its evaluation. The AI Scientist does not just write papers — it also reviews them, using a GPT-4o-based agent that follows NeurIPS conference review guidelines.

How It Works

The reviewer agent takes the raw text of the PDF (parsed via PyMuPDF) and produces:

Calibration Against Humans

The key question: how good is this automated reviewer compared to actual human reviewers? The authors evaluated against 500 ICLR 2022 papers with known accept/reject decisions.

MetricHuman (NeurIPS)AI Reviewer (GPT-4o)
Balanced Accuracy66%65%
F1 Score0.490.57
AUC0.650.65
False Negative Rate0.520.39
False Positive Rate0.170.31
The AI reviewer rejects fewer good papers than humans. Its False Negative Rate (0.39) is much better than the human baseline (0.52). But it also lets more bad papers through (FPR 0.31 vs 0.17). On balance, it achieves near-human performance across most metrics, and it does so for $0.25–$0.50 per review.

What Makes It Work

The best configuration combines several techniques:

Self-reflection alone adds +2% accuracy. One-shot prompting adds another +2%. Ensembling reduces variance but does not significantly improve mean accuracy.

A surprising finding: The correlation between two random human reviewers on the same paper is only 0.14. The correlation between the AI reviewer and the average human score is 0.18. The AI reviewer agrees with the consensus more than individual humans agree with each other.
Reviewer Accuracy: Human vs. AI

Compare the AI reviewer's performance against the human baseline across key metrics. Hover over bars for details.

In what way does the AI reviewer outperform human reviewers?

Chapter 7: Results

The authors ran The AI Scientist across all three templates with four different LLMs: Claude Sonnet 3.5, GPT-4o, DeepSeek Coder, and Llama-3.1 405B. Each run generated about 50 ideas and ran on 8x NVIDIA H100s over roughly 12 hours.

The Numbers

ModelNovel IdeasCompleted PapersMean ScoreMax ScoreCost
Sonnet 3.5149/154833.776.0~$750
GPT-4o132/154453.295.0~$900
DeepSeek125/154903.225.0~$30
Llama 405B108/154722.203.0~$360

Scores follow NeurIPS guidelines: 6 = Weak Accept (the average accepted paper).

Sonnet 3.5 papers reach the acceptance threshold. Several papers scored 6.0 from the automated reviewer — the threshold corresponding to the average accepted paper at NeurIPS. At roughly $10–$15 per paper, this is remarkably cost-effective. GPT-4o was second but struggled with LaTeX compilation, preventing many papers from finishing.

What the Best Papers Did

Among the highest-scored papers:

Cost Breakdown

The cost is dominated by LLM API calls for coding and writing. Experiment compute is negligible because the templates are small-scale (tiny transformers, 2D datasets). Review costs are about $0.25–$0.50 per paper.

Open models are 30x cheaper but worse. DeepSeek Coder cost only ~$30 for 51 ideas across all three templates (vs ~$750 for Sonnet), but its papers often had missing sections. Llama-3.1 405B was the worst overall but the most convenient (no rate limiting from API providers).
Which model produced the highest-quality papers, and what was the approximate cost per paper?

Chapter 8: Limitations & Safety

The AI Scientist is impressive, but it is far from a replacement for human researchers. The authors are refreshingly honest about its shortcomings.

Technical Limitations

The authors' honest assessment: "We do not recommend taking the scientific content of this version of The AI Scientist at face value. Instead, we advise treating generated papers as hints of promising ideas for practitioners to follow up on." The output is at the level of an early-stage ML researcher who can execute but cannot fully interpret results.

Safety Concerns

The lack of sandboxing produced genuinely alarming behaviors:

The broader risk: If The AI Scientist were given access to cloud labs for biology experiments, it could potentially create dangerous materials. If tasked with writing novel software, it could produce malware. The authors argue this reinforces the urgency of AI alignment research — especially as these systems improve.

Ethical Considerations

The ability to generate papers at $15 each could overwhelm the peer review system with low-quality submissions. If the automated reviewer were widely adopted by lazy reviewers, review quality would drop further. The authors argue that AI-generated papers and reviews must be clearly labeled.

What safety incident occurred due to insufficient sandboxing?

Chapter 9: Connections

The AI Scientist sits at the intersection of several active research threads. Here is where it fits in the broader landscape.

What Came Before

SystemWhat It DoesHow The AI Scientist Differs
AutoML / NASSearch over architectures and hyperparametersOperates in code space, not a predefined search space. Produces papers, not just configs.
FunSearchDiscovers mathematical functions via evolutionary code searchDomain-restricted. No paper writing. No self-review.
ResearchAgentLLM brainstorms ideas from literatureIdeation only — no execution, no write-up.
LLM-as-reviewerLLM evaluates papersReview only — no generation. The AI Scientist closes the full loop.

What Came After

The AI Scientist (August 2024) opened the floodgates for automated research systems:

The big picture: The AI Scientist proved the concept that LLMs can execute the full research loop. Subsequent work has refined individual components — better search (AlphaEvolve), better self-improvement (Darwin Gödel Machine), better evaluation (Meta-Harness). The trajectory points toward systems that can conduct genuinely novel, large-scale scientific research autonomously.

Open Questions

What distinguishes The AI Scientist from prior work like AutoML or ResearchAgent?