Saad-Falcon, Gamarra Lafuente et al. — Stanford, 2025

ARCHON: Architecture Search for Inference-Time Techniques

Given a compute budget and a set of LLMs, automatically discover the optimal combination of generators, critics, rankers, fusers, and verifiers — outperforming o1, GPT-4o, and Claude 3.5 Sonnet by 15.1% average.

Prerequisites: LLM inference basics + Concept of ensembling. That's it.
10
Chapters
5+
Simulations

Chapter 0: The Problem

You have access to GPT-4o, Claude 3.5 Sonnet, Llama 3.1 405B, and a handful of 70B open-source models. You want to build a system that answers questions better than any single one of them. What do you do?

The obvious first step: sample from multiple models and pick the best answer. But how do you pick? Majority vote? A ranker model? And should you fuse the good answers together into one super-response? What about having a critic point out flaws first? Should you verify reasoning chains? Run unit tests on code?

Each of these strategies — repeated sampling, voting, fusion, revision, ranking, verification — works well in isolation for specific tasks. But combining them is ad hoc. The number of possible configurations is enormous: which models to use, how many samples, what order to apply techniques, how many layers of fusion. Practitioners resort to trial and error.

The core tension: Inference-time techniques are powerful individually. But the design space for combining them is combinatorial — 9,576 valid configurations in ARCHON's search space alone. No human can explore this by hand, and the best combination changes with every task and compute budget.

The Mixture-of-Agents (MoA) approach samples from many models and fuses their outputs. It works, but it uses a fixed architecture for every task. ADAS optimizes prompts for a single model. AFlow chains operations but doesn't explore the full space of technique combinations. None of them ask the fundamental question: given my compute budget and available models, what is the optimal way to combine inference-time techniques?

This is what ARCHON solves. It treats inference-time system design as architecture search — the same idea that revolutionized neural network design (NAS), now applied to the meta-level of how we orchestrate LLMs at inference time.

A concrete example of the problem

Suppose you're building a system for competitive programming. Should you:

  1. Sample GPT-4o 1000 times and pick the one that passes unit tests? (Large Language Monkeys approach)
  2. Sample 5 models once each, fuse their answers, then verify? (Mixture-of-Agents approach)
  3. Sample 3 models, critique each response, fuse the best, generate unit tests, filter, then fuse again? (What approach?)

Option 3 is what ARCHON discovers for coding tasks — a configuration no human would think to try because it involves 6 layers of inference-time techniques. And it achieves 41.4% Pass@1 on CodeContests, versus 18.1% for a single GPT-4o call and 31.5% for o1.

Why is combining inference-time techniques difficult?

Chapter 1: The Key Insight

ARCHON's insight is to reframe inference-time system design as a problem that looks exactly like neural architecture search. In NAS, you search over configurations of layers, activation functions, and connections. In ARCHON, you search over configurations of LLM components — generators, critics, rankers, fusers, verifiers — arranged in layers.

The analogy runs deep. In a neural network, each layer transforms a tensor. In ARCHON, each layer transforms a list of strings. Generator layers produce candidate responses. Fuser layers combine multiple candidates into fewer, better ones. Ranker and verifier layers act like non-linearities — they filter the list, keeping only the good candidates.

Input
Instruction prompt (a single string)
Generator Layer
Produce N candidate responses from K models
Critic / Ranker / Verifier
Filter, rank, or annotate candidates (non-linearity)
Fuser Layer
Combine surviving candidates into fewer, higher-quality outputs
Output
Best single response

Like neural networks, ARCHON architectures are deployed off-the-shelf — no weights are learned between components. The "training" happens during architecture search on a small dev set (20% of target data). Once the optimal configuration is found, you run it on new inputs without further tuning.

Why this framing is powerful: By casting the problem as architecture search, ARCHON imports decades of optimization techniques from NAS. Specifically, it uses Bayesian optimization with Gaussian processes to explore the 9,576-configuration search space. This finds the optimal architecture in 88.5% fewer evaluations than greedy search and 90.4% fewer than random search.

The framework also reveals a key finding that manual system design misses: the best combination depends on the task. Instruction-following benefits from multiple layers of critiquing and fusing. Math tasks want broad generation followed by quick filtering. Coding tasks need repeated generation with unit-test verification. No single architecture dominates, which is exactly why you need automated search.

What makes this different from prompt engineering?

Frameworks like DSPy optimize what you say to a single model. ARCHON optimizes how you orchestrate multiple models and techniques. The difference matters: even the perfect prompt to GPT-4o can't match the accuracy of a 10-model ensemble with iterative critique-and-fuse, because a single model has a fixed error distribution. Multiple models have different error distributions, and combining them reduces errors that no single prompt can fix.

Think of it this way: prompt engineering is like tuning the knobs on one instrument. ARCHON is like composing for an orchestra — choosing which instruments to include, what each one plays, and how to layer the parts. The search algorithm is the automated composer.

What is the core analogy between ARCHON and neural architecture search?

Chapter 2: The Design Space

ARCHON defines six types of LLM components. Each one is a pluggable module that performs a text-to-text operation. Think of them as the "layer types" in ARCHON's neural-network analogy.

1. Generator

Takes the instruction prompt, outputs a candidate response. Can be called in parallel across multiple models (generation ensembling) and each model can be sampled multiple times (repeated sampling). Temperature is set to 0.7 for all components to ensure response diversity. Generators are always the first layer — they are the only component that doesn't require prior candidates as input.

The paper finds that sampling from additional models gives nearly double the benefit of additional samples from the same model (18.5% vs 9.3% improvement). Different models make different errors, so ensembling captures a wider range of correct strategies than temperature-based diversity from a single model.

2. Fuser

Takes the instruction prompt and a set of candidate responses, produces one or more higher-quality fused responses by combining the best elements. Averaging 8.9% improvement across all benchmarks tested. Can be stacked in multiple layers for iterative refinement — MT-Bench gains 10-15 points from 3-4 fusion layers.

3. Critic

Takes the prompt and candidates, produces strengths and weaknesses for each response. These annotations are passed to the next layer (typically a Ranker or Fuser) to inform its decisions. Adds 11.5 percentage points on average when combined with Generator ensemble + Fuser.

4. Ranker

Takes the prompt and candidates, ranks them by quality using pairwise comparisons focused on style and prompt adherence. Filters to the top-K. Improves output quality by 10.8% over random selection, performing within 2.7% of oracle selection.

5. Verifier

Two-stage process: (1) generates reasoning for why a candidate is correct, (2) takes in the original prompt, candidate, and generated reasoning, then produces a binary verdict — [Correct] or [Incorrect]. Only verified responses pass to the next layer. The two-stage design prevents the verifier from rubber-stamping answers: stage 2 can catch flaws in stage 1's reasoning.

Most effective for reasoning tasks, improving MixEval/MATH by 8.4%. However, when used alone (without other techniques), the Verifier lags behind ensemble+fuser by 1.5%, suggesting verification works best as a complement to other techniques, not a replacement.

6. Unit Test Generator + Evaluator

Complementary pair: the Generator produces 5-10 test statements for the instruction, the Evaluator scores candidates against those tests. Devastating for coding — boosted CodeContests Pass@1 by 56% (17.9% to 29.3%).

Every module is text-in, text-out

ModuleInputOutputCardinality change
GeneratorPromptN candidate strings1 → N
FuserPrompt + K candidatesM fused strings (M ≤ K)K → M
CriticPrompt + K candidatesK annotated strings (strengths/weaknesses)K → K (enriched)
RankerPrompt + K candidatesTop-J candidatesK → J (filtered)
VerifierPrompt + K candidatesVerified subsetK → ≤K (filtered)
Unit TestPrompt + K candidatesRanked by test passageK → K (reranked)

This uniform interface is what makes ARCHON composable. Any module's output can feed into any other module's input (subject to placement rules). The state flowing through the architecture is always the same type: instruction prompt + list of candidate strings.

Construction rules (from ablation studies across 7 benchmarks):
1. Only one type of component per layer.
2. Generators can only be in the first layer.
3. Critics must come before a Ranker or Fuser — otherwise the strengths/weaknesses can't be incorporated.
4. Ranker, Critic, Verifier, and Unit Test layers are singletons in their layer.
5. Multiple Fusers can share a layer (they run in parallel on the same inputs).
6. Unit Test Generator and Evaluator must be in consecutive layers, Generator first.
These constraints were determined empirically. Alternative orderings were tested but performed worse — for example, placing a Ranker before a Critic loses the critic's annotations that could inform ranking.
Which component type produced the largest single-technique improvement in ARCHON?

Chapter 3: Search Strategy

The design space has 9,576 valid configurations. Exhaustive search is impractical because each evaluation requires running the full architecture on a dev set — dozens of LLM calls per query. ARCHON needs to find the optimum in as few evaluations as possible.

The six search hyperparameters

AxisRangeWhat it controls
Top-K Generators1 – 10How many models in the initial ensemble
Samples per Generator1 – 5 (up to 1000 for coding)Repeated sampling from each model
Fusion Layers1 – 4Depth of iterative refinement
Fusers per Layer2 – 10 (step 2)Breadth of each fusion stage
Critic + RankerOn / Off before each fuser layerWhether to annotate and filter before fusing
Evaluation LayerVerifier / Unit Test / NoneFinal filtering before output

Where 9,576 comes from

The combinatorial math: 10 choices for top-K generators x 5 for samples per generator x 4 for fusion layers x 5 for fusers per layer x 2 for critic/ranker on/off x 3 for evaluation layer = 6,000. But many configurations are invalid — for example, configurations where the total number of initial generations exceeds the context window of the fusers. After pruning invalid combinations, 9,576 remain. This is large enough that random search is wasteful, but small enough that Bayesian optimization can efficiently navigate it.

Three search methods compared

Random search samples architectures uniformly at random. Simple but wasteful — most of the budget explores bad configurations.

Greedy search optimizes one hyperparameter at a time, holding the rest fixed. Better, but gets stuck in local optima because it never jointly considers interactions between hyperparameters.

Bayesian optimization (ARCHON's default) maintains a Gaussian process surrogate model of the objective function. It starts by evaluating ~230 random architectures to calibrate, then uses an acquisition function to propose the most promising configurations. Each evaluation refines the surrogate, guiding the search toward the global optimum.

The numbers: Bayesian optimization found the best architecture in 96.0% of searches. It required 88.5% fewer evaluations than greedy search and 90.4% fewer than random search. For budgets under 20 inference calls, greedy search is competitive. Above that, Bayesian optimization dominates.

To impose compute constraints, ARCHON simply excludes any architecture exceeding the inference call, input token, or output token budget from the search space. This happens before Bayesian optimization even considers them — invalid configurations are never evaluated.

The Bayesian optimization loop in detail

1. Random Initialization
Evaluate ~230 random architectures on the 20% dev set. Record (config, accuracy) pairs.
2. Fit Surrogate Model
Train a Gaussian process on the (config, accuracy) pairs. This gives a mean prediction + uncertainty for every untested configuration.
3. Acquisition Function
Select the configuration that maximizes expected improvement (EI): high predicted accuracy OR high uncertainty (explore vs exploit).
4. Evaluate
Run the proposed architecture on the dev set. Add the result to the dataset.
↻ repeat until budget exhausted
5. Return Best
Output the architecture with the highest dev-set accuracy.

The key advantage: after calibration, the Gaussian process can predict which configurations are promising without actually running them. This is what makes Bayesian optimization 10x more sample-efficient than random search — it concentrates evaluations on the promising region of the search space.

Why does Bayesian optimization outperform greedy search for ARCHON?

Chapter 4: Architecture Examples

What do discovered ARCHON architectures actually look like? The paper provides detailed specifications for both general-purpose and task-specific configurations. The differences are revealing.

General-purpose all-source architecture

The best general-purpose architecture starts with a broad initial layer of 10 generators (the top performers from Chatbot Arena). Then four successive layers of critique and fusion, using Qwen2-72B for critiquing and Claude 3.5 Sonnet for fusing. Each layer has progressively fewer fusers — 8, then 6, then 4, then 1 — creating a "funneling" effect that distills many candidates into a single refined output.

Task-specific architectures differ dramatically

Instruction-following (MT-Bench, AlpacaEval): Multiple layers of critiquing and fusing with diverse LLMs. The diversity of perspectives matters more than raw generation count. Architecture uses 3-4 Critic+Fuser layers with a mix of open and closed-source models.

Math (MATH benchmark): Broad initial generation (many samples), then quick reduction. The architecture favors many candidates from strong math models with a verifier to filter incorrect reasoning before a single fusion step.

Coding (CodeContests): Repeated generation from a single strong model (GPT-4o, up to 1000 samples), unit test generation and evaluation, then fusion. The architecture is narrow and deep — multiple iterations of generate-critique-fuse over a single response thread, not broad ensembling. When unit-test generation is added with high sampling, CodeContests Pass@1 jumps from 17.9% to 29.3% — and with the full task-specific architecture, it reaches 41.4%.

The "funneling" effect in numbers

The general-purpose architecture starts with 10 generators (each sampled once = 10 candidates). The first fusion layer uses 8 fusers, each processing the full set. The second layer uses 6 fusers operating on the fused outputs. Then 4 fusers, then a single final fuser. At each stage, the number of candidate responses shrinks while the quality increases:

Layer 1: 10 Generators
10 candidates (one per model)
↓ Critic (Qwen2-72B) annotates all 10
Layer 3: 8 Fusers
8 fused candidates
↓ Critic annotates all 8
Layer 5: 6 Fusers
6 fused candidates
↓ Critic annotates all 6
Layer 7: 4 Fusers
4 fused candidates
Layer 8: 1 Final Fuser
1 output
The funneling pattern: The general-purpose architecture starts wide (10 generators producing 10+ candidates) and progressively narrows (8 fusers → 6 → 4 → 1). This mirrors how ensemble methods work in classical ML — diversity at the base, refinement at the top. The task-specific architectures break this pattern in task-appropriate ways: coding goes narrow-deep, math goes wide-then-filter.
How do coding-specific ARCHON architectures differ from instruction-following ones?

Chapter 5: Pareto Frontier

The fundamental question in inference-time compute is: how much accuracy do you get per token spent? ARCHON pushes the Pareto frontier — achieving higher accuracy at every budget level compared to fixed architectures.

How ARCHON sweeps the frontier

By adding token budget constraints to the architecture search, ARCHON discovers different optimal architectures for different budgets. At low budgets, it might use a single strong model with one fusion layer. At high budgets, it deploys the full funneling architecture with 10 generators and 4 fusion layers. The key insight: the architecture itself adapts to the budget, not just the number of samples.

Concrete numbers from the paper

ApproachAvg Inference CallsAvg Input TokensAvg Output Tokens
Single LLM (GPT-4o)1549
MoA1925,10917,422
ADAS5272,80444,872
AFlow4868,59641,748
ARCHON (general, all-source)3550,42730,461
ARCHON (task-specific, all-source)3958,25036,114

ARCHON's general-purpose architecture uses 31% fewer tokens than the best baseline (ADAS) while achieving 6.4% better performance. Even the task-specific architectures, which use more tokens, still use 15.1% fewer input tokens and 13.5% fewer output tokens than ADAS.

How budget constraints reshape architectures

At low budgets (<20 inference calls), ARCHON discovers architectures that look like enhanced single-model systems: one strong generator with a Critic and a single Fuser. The architecture search degenerates to model selection plus minimal post-processing.

At medium budgets (20-40 calls), the architectures start incorporating 3-5 generators and 2 fusion layers. The search discovers that diversity at the generation layer is the most efficient way to spend the extra calls.

At high budgets (40+ calls), the full funneling architecture emerges: 10 generators, Critic, Ranker, 3-4 Fuser layers, and a Verifier. The marginal return on each additional layer is still positive but decreasing — the first Fuser layer gives the biggest jump.

Budget-optimal architecture shape: At every budget level tested, ARCHON's discovered architectures outperform the static baselines (MoA, ADAS, AFlow). The static architectures waste budget because they don't adapt — MoA always uses 95 inference calls whether the budget is 20 or 200. ARCHON allocates every call to where it helps most for the given task and budget.
The Pareto argument: At every FLOP budget tested, ARCHON architectures achieve higher performance than MoA, ADAS, AFlow, and o1. This means there is no budget level where a fixed architecture is a better deal. The architecture search overhead (running on a 20% dev set) pays for itself by finding configurations that are both more accurate AND more efficient than hand-designed systems.
How does ARCHON's general-purpose architecture compare to ADAS in terms of token efficiency?

Chapter 6: Results

ARCHON was evaluated across seven benchmarks spanning three task categories. Here are the headline numbers from Table 1 of the paper, showing the best all-source task-specific architectures.

Instruction-following

BenchmarkGPT-4oClaude 3.5o1ARCHON
MT-Bench (win rate)44.2%56.3%79.5%
AlpacaEval 2.0 (LC win rate)57.8%52.7%59.3%69.0%
Arena-Hard-Auto (win rate)80.6%81.4%81.7%92.5%

Reasoning

BenchmarkGPT-4oLlama 405Bo1ARCHON
MixEval (accuracy)87.5%88.2%87.5%89.7%
MixEval-Hard (accuracy)63.4%66.0%72.0%72.7%
MATH (Pass@1)83.5%85.0%92.7%93.5%

Coding

BenchmarkGPT-4oLlama 405Bo1ARCHON
CodeContests (Pass@1)18.1%20.4%31.5%41.4%

The CodeContests result is remarkable: ARCHON's task-specific architecture more than doubles GPT-4o's Pass@1 (18.1% to 41.4%) by using high-sample generation with unit-test filtering. Even compared to o1 — which has its own internal chain-of-thought compute — ARCHON achieves 9.9 percentage points higher.

Open-source ARCHON is competitive

Even when restricted to only open-source models (no GPT-4o, no Claude), ARCHON's task-specific architectures achieve a 11.2% average improvement over the best individual open-source LLM. On MT-Bench, the open-source ARCHON reaches 71.1% win rate — beating single-call GPT-4o (44.2%) and approaching the closed-source ARCHON (77.0%). This demonstrates that orchestration can compensate for individual model quality.

Head-to-head: model sources compared

ARCHON variantMT-BenchArena-HardMixEvalMATHCodeContests
Open-source (task-specific)71.1%89.6%88.8%89.5%28.9%
Closed-source (task-specific)77.0%90.5%89.5%92.1%25.1%
All-source (task-specific)79.5%92.5%89.7%93.5%41.4%

Notably, open-source ARCHON beats closed-source ARCHON on CodeContests (28.9% vs 25.1%). This is because the top open-source coding models (DeepSeek, CodeLlama variants) generate more diverse solution strategies. When both pools are available, the all-source architecture can pick the best of both worlds — open-source generators for diversity, closed-source models for critiquing and fusing.

Generalization to unseen tasks

The general-purpose ARCHON architecture was tested on three benchmarks it was never searched on: GPQA (graduate-level questions), MMLU, and MMLU-Pro. Results:

SystemGPQAMMLUMMLU-Pro% of task-specific
AFlow (generalized)37.1%53.0%43.4%~71%
ADAS (generalized)39.8%53.5%44.1%~71%
ARCHON (generalized)56.1%76.5%71.0%~93%

ARCHON's general-purpose architecture retains 91-94% of the task-specific performance on unseen tasks, while ADAS and AFlow retain only 67-74%. The funneling architecture — broad generation plus iterative critique-and-fuse — is a surprisingly robust general strategy.

Generalization: The general-purpose ARCHON architecture (searched on the 7 benchmarks) was tested on 3 unseen tasks: GPQA, MMLU, and MMLU-Pro. It captured 91-94% of the task-specific architecture's performance on those unseen tasks — far better than ADAS (67-73%) and AFlow (67-74%). The architecture generalizes.
On which task category does ARCHON show the largest absolute improvement over single LLMs?

Chapter 7: Technique Interactions

The most scientifically interesting part of the paper isn't the final numbers — it's the analysis of how inference-time techniques interact. The authors identify four key trends from exhaustive ablations across seven benchmarks.

T1: Scaling generation always helps

Both repeated sampling (more samples per model) and model ensembling (more models) produce substantial gains: 9.3% and 18.5% respectively. Ensembling from additional models gives nearly double the benefit of additional samples from the same model, because diverse models make different errors.

T2: Stacking technique layers compounds gains

Adding layers of inference-time techniques — particularly Fuser layers — significantly improves performance. Adding a single Fuser as the last layer is always beneficial. The marginal gain varies by task: MT-Bench and AlpacaEval gain 10-15 points from 3-4 fusion layers, while MixEval gains only 1-2 points.

T3: Diversity of techniques matters

Using different types of techniques (Critic before Ranker before Fuser) outperforms using only one type. Critics before Fusers are particularly effective because the critique annotations give the Fuser explicit guidance on what to keep and what to fix. Generation ensembling + critique + fusion improved performance by 18.8% on average over single-model generation.

T4: Verification filters out reasoning errors

For reasoning and coding tasks, adding Verifiers and Unit Test Generators after generation but before fusion filters out flawed responses. This is critical because fusion can propagate errors — if a wrong answer sounds confident, the Fuser might incorporate it. Verification prevents this. On CodeContests, adding unit tests to a high-sample GPT-4o setup boosted Pass@1 from 17.9% to 29.3%.

Concrete interaction examples from the ablations

ConfigurationMT-BenchMixEvalCodeContests
10-model ensemble only51.6%86.9%15.1%
Ensemble + Fuser65.0%87.0%15.1%
Ensemble + Critic + Fuser72.7%87.2%
Ensemble + Verifier + Fuser88.4%
Ensemble + Unit Tests (high sample)29.3%
Full ARCHON (task-specific)79.5%89.7%41.4%

Notice the pattern: for instruction-following (MT-Bench), the Critic provides the biggest single boost (+7.7 points on top of ensemble+fuser). For reasoning (MixEval), the Verifier matters more. For coding, unit tests with high sampling are transformative. The "full ARCHON" row uses the task-specific architecture from the search, which combines the right techniques for each task.

When interactions go wrong

Not all combinations help. The Verifier module, when used alone after generation (without a Fuser), actually hurts performance on instruction-following tasks. Why? The Verifier is trained to check factual correctness and reasoning chains, but instruction-following quality depends on style, tone, and completeness — dimensions the Verifier doesn't assess. It aggressively filters responses that are stylistically excellent but contain minor reasoning imprecisions, leaving behind dry, technically correct but uninspiring outputs.

Similarly, stacking too many fusion layers on MixEval yields diminishing returns (1-2 points after the first layer). The benchmark's questions are relatively unambiguous, so the first fusion pass captures most of the benefit. Additional layers just paraphrase without improving substance.

These negative and diminishing interactions are exactly why manual system design fails. You can't intuit which combinations help on which tasks without running the experiments — or letting a search algorithm do it for you.

The key finding: Technique combinations are not additive — they are synergistic. Critic + Fuser outperforms Critic alone + Fuser alone summed. The critic gives the fuser explicit repair instructions. But interactions can also be negative: Verifier alone after generation sometimes hurts instruction-following tasks by being too aggressive in filtering. The right combination depends on the task, which is why automated search matters.
Why does model ensembling give nearly double the benefit of repeated sampling?

Chapter 8: Cost Analysis

ARCHON's improvements come at a price: multiple LLM API calls per query. The paper is transparent about the costs, giving exact token counts and dollar estimates.

Wall-clock time and dollar costs

ARCHON architectures make multiple successive LLM calls for different operations. The paper estimates approximately 5x more time and money than a single LLM call. For the general all-source architecture (35 inference calls, ~50K input tokens, ~30K output tokens per query), the cost is roughly $0.32 per query at 2024 API prices.

When the cost is justified

The paper argues ARCHON makes economic sense for high-stakes domains: scientific research, competitive programming, complex agentic tasks, medical diagnosis. In these domains, the cost of a wrong answer far exceeds a few cents per query. For casual chat, single-model inference is fine.

Model size matters

ARCHON is most effective with 70B+ parameter models. With 7B models, the framework boosts performance by 7.5% over the best individual 7B model, but the gains are smaller and 7B models are weaker at critiquing and fusing. The Critic and Fuser roles require strong instruction-following capability that smaller models lack.

Interestingly, 7B models can still serve as effective Rankers — pairwise comparison is an easier task than open-ended critique or synthesis. This suggests a practical deployment pattern: use cheap 7B models for ranking and filtering, reserve expensive 70B+ models for generation and fusion. ARCHON's architecture search can discover these heterogeneous allocations automatically.

Latency breakdown

The critical path through an ARCHON architecture determines wall-clock latency. In the general-purpose architecture with 4 fusion layers, the critical path is: Generator (parallel, ~2s) + Critic (~3s) + Fuser Layer 1 (parallel, ~3s) + Critic + Fuser Layer 2 + ... The total is roughly 5x a single LLM call. However, because Generator and Fuser layers run their constituent models in parallel, the latency scales with the number of sequential layers, not the total inference calls. A 35-call architecture with 6 sequential layers takes about the same time as 6 sequential single-model calls.

ConfigurationAvg PFLOPs/query$/queryAvg improvement
Single LLM (GPT-4o)0.6$0.01baseline
MoA15.3$0.06+5.7%
ADAS58.8$0.63+9.4%
o1~0.5$0.52+7.8%
ARCHON (general)27.8$0.32+12.3%
ARCHON (task-specific)33.7$0.37+15.1%

The cost-quality tradeoff curve

ARCHON provides an explicit knob for this tradeoff via the budget constraint in architecture search. Need results in under $0.10/query? The search discovers a lean 3-call architecture. Have $0.50/query to spend? It discovers a 40-call architecture that squeezes maximum accuracy from that budget. This explicitness is a major advantage over systems like o1, where the compute allocation is opaque and non-configurable.

Future cost reduction: LLM inference costs are dropping rapidly — GPT-4o-level quality at a fraction of the 2024 price is expected within a year. As costs fall, the economic case for ARCHON-style multi-call systems strengthens. Additionally, future work could use distillation to compress the aggregate knowledge of an ARCHON architecture into a single smaller model, eliminating the multi-call overhead entirely while retaining much of the quality gain.
Why is ARCHON less effective with 7B parameter models?

Chapter 9: Connections

ARCHON sits at the intersection of several research threads in inference-time compute scaling. Here's how it connects to the broader landscape.

Related approaches

SystemFocusLimitation ARCHON addresses
Mixture-of-Agents (MoA)Fixed multi-model fusionFixed architecture — same pipeline for all tasks and budgets
ADASPrompt optimization for single LMSingle-model, doesn't combine multiple techniques
AFlowChained operationsLimited search over technique combinations
DSPyPrompt engineering + tool useSingle LM, prompt-centric rather than architecture-centric
Large Language MonkeysRepeated sampling scaling lawsSingle technique (sampling), ARCHON combines many
OpenAI o1Internal chain-of-thoughtOpaque, fixed compute allocation — ARCHON makes the compute allocation explicit and searchable

The broader picture

ARCHON represents a shift from model-centric to system-centric scaling. Instead of making a single model bigger (training-time compute), you invest compute at inference time to orchestrate multiple models. This is complementary: ARCHON gets better as the individual models get stronger, and training-time improvements compound with inference-time orchestration.

Open questions and future work

Dynamic architectures. ARCHON currently discovers a fixed architecture per task or a single general-purpose architecture. A natural extension is per-query routing — using a lightweight classifier to select the right architecture for each input based on difficulty, domain, or estimated quality of initial generations.

Distillation. Once an ARCHON architecture produces high-quality outputs on a dataset, those outputs could be used as training data to distill the aggregate knowledge into a single model. This eliminates the multi-call overhead at deployment while retaining much of the quality gain.

New technique modules. The framework is open-source and extensible. New inference-time techniques can be added as modules, new models plugged in, and the search re-run. As the field develops new techniques (tree search, debate, reward models, process reward verification), ARCHON's architecture search can automatically discover how they interact with existing methods.

Latency optimization. The current search maximizes accuracy given a token budget. Adding latency as an explicit objective would discover architectures that parallelize more aggressively — for example, running multiple Fusers in parallel rather than sequential layers, or using faster models for intermediate Critic steps.

Scaling to more techniques. The current search space includes 6 technique types. As the field develops new methods — process reward models for step-by-step verification, debate between models, tree-of-thought search, or tool-augmented generation — the search space grows exponentially. Efficient search algorithms that scale to larger spaces (perhaps hierarchical Bayesian optimization or evolutionary strategies) will become essential.

Meta-learning across tasks. Currently, architecture search runs independently for each task. A meta-learning approach could learn priors over which architecture shapes work for which task types, dramatically reducing the search budget for new tasks. The paper's finding that instruction-following tasks prefer wide-then-fuse while coding prefers narrow-deep is a hint that such priors exist and are learnable.

Looking forward: The promise of ARCHON is that inference-time system design becomes a solved optimization problem. Instead of asking "how should I combine these LLMs?", you ask the search algorithm. The answer will be different for every task, every budget, and every set of available models — and that's the point. The best system is the one that adapts to your constraints, not the one with the cleverest fixed design.

ARCHON's code is available at github.com/ScalingIntelligence/Archon. The framework is implemented in Python, with each LLM component as a modular class. Adding a new technique requires implementing a single process(prompt, candidates) -> candidates method. The architecture search runs on a 20% dev set and typically completes in a few hours depending on the evaluation budget.

What is the fundamental shift ARCHON represents compared to approaches like o1?