AB-MCTS — Veanors

Chapter 0: The Problem

You give an LLM a hard coding problem. It gets it wrong. What do you do?

Strategy A: Sample more. Ask the LLM 128 times at non-zero temperature, hope that one answer is correct. This is repeated sampling (best-of-n). It works surprisingly well — Large Language Monkeys showed that even problems no single sample solves can be cracked by sampling thousands of times. But there is a fatal flaw: each attempt starts from scratch. If sample #47 got 90% of the way there but failed one edge case, sample #48 doesn't know that. No feedback, no refinement, no learning from near-misses.

Strategy B: Refine one answer. Generate one solution, run the tests, feed the error messages back to the LLM, ask it to fix. Repeat. This is sequential refinement. It uses feedback, but it is locked into a single trajectory. If the initial approach was fundamentally wrong, no amount of patching will save it. You are deep but narrow.

Strategy C: Standard MCTS. Build a search tree with a fixed branching factor (say 5 children per node). Select the most promising node, expand it, propagate scores. This balances exploration and exploitation — in theory. In practice, fixing the branching factor at 5 means you never generate the diverse burst of 50 candidates that repeated sampling benefits from. The fixed width strangles the LLM's natural diversity.

The core tension: Repeated sampling has diversity but no feedback. Sequential refinement has feedback but no diversity. Standard MCTS tries to compromise but fixes the width, preventing the LLM from exploring its full output space. We need a method that dynamically decides: should I generate more diverse attempts here, or should I drill deeper into a promising one?

This is the problem AB-MCTS solves. At every node in the search tree, it asks: "Is it more valuable to go wider (sample a brand new solution from this prompt) or go deeper (take the best child and refine it using feedback)?" The answer changes from node to node and from moment to moment, driven by Bayesian statistics over the scores observed so far.

Why can't repeated sampling improve a near-correct solution?

Because each sample is generated independently from the same prompt — there is no mechanism to feed test results or error messages from one attempt back into the next Because repeated sampling uses a lower temperature Because the LLM cannot generate correct code

Chapter 1: The Key Insight

AB-MCTS's central idea is beautifully simple: let the search tree decide its own shape.

Repeated Sampling = Flat Tree

All 128 samples are children of the root. Width = 128, depth = 1. Maximum diversity, zero refinement. This is a tree, just a very wide, very shallow one.

↓

Sequential Refinement = Tall Tree

One root child, refined 127 times. Width = 1, depth = 128. Maximum refinement, zero diversity. Also a tree — a tall, thin one.

↓

Standard MCTS = Fixed Shape

Every node gets exactly k children (e.g., k=5). The tree shape is determined before you see any results. Rigid.

↓

AB-MCTS = Adaptive Shape

At each node, dynamically choose: add another child (go wider) or descend into an existing child and refine (go deeper). The tree grows organically based on what the scores tell you.

Why this matters: Different problems need different tree shapes. A problem where the LLM frequently generates correct solutions from scratch benefits from width — just sample a lot and pick the best. A problem where solutions are close but have subtle bugs benefits from depth — refine the best attempt with targeted feedback. AB-MCTS adapts to both without you choosing in advance.

The mechanism is a special node called the GEN node. Every node in the tree has a GEN node as a virtual child. When the search selects a GEN node, a brand new LLM response is generated from that node's prompt — widening the tree. When the search selects an existing child instead, it descends deeper. The choice between GEN and existing children is made by Thompson sampling over Bayesian posterior distributions of expected scores.

This means the branching factor is unbounded. A node might get 1 child or 50 children, depending on whether new samples keep producing promising results. Traditional MCTS cannot handle this because UCB/UCT requires knowing the number of arms in advance. Thompson sampling naturally accommodates new arms appearing dynamically.

What makes AB-MCTS a strict generalization of repeated sampling?

AB-MCTS uses a higher temperature If AB-MCTS always selects the GEN node at the root and never descends, it produces a flat, depth-1 tree identical to repeated sampling. Repeated sampling is just one possible tree shape AB-MCTS can produce. AB-MCTS uses the same scoring function

Chapter 2: Repeated Sampling as a Flat Tree

Before we build AB-MCTS, let's understand exactly why repeated sampling works and where it breaks.

You have a coding problem. You call the LLM 128 times with the same prompt at temperature 0.6. Each call returns a complete solution. You run the tests on each, pick the one that passes the most. This is best-of-n.

Why does this work? Because LLMs at non-zero temperature explore a vast output space. Even if the probability of any single sample being correct is only 2%, with 128 samples the probability that at least one is correct is 1 − 0.98¹²⁸ ≈ 92%. Large Language Monkeys formalized this: the number of correct samples grows roughly as a coverage function — more samples, more coverage of the solution space.

Where it fails

Consider a problem where the LLM's approach is fundamentally correct but it keeps making a subtle off-by-one error in the boundary condition. Sampling 128 times might produce 128 solutions that all fail the same edge case in slightly different ways. What you actually need is to take the best attempt, show the LLM the failing test output, and ask it to fix that specific line. Repeated sampling cannot do this — it has no feedback loop.

In tree terms, repeated sampling is a tree of depth 1 and width n. Every node is a direct child of the root. The tree is maximally wide and minimally deep.

Formally: Given a budget of n LLM calls, repeated sampling uses all n calls to generate children of the root node. The final answer is selected by: arg max_i R(solution_i), where R is the scoring function (e.g., fraction of tests passed).

Flat Tree: Repeated Sampling

Each circle is an LLM-generated solution. Color indicates test score (red = 0%, yellow = 50%, green = 100%). Click "Sample" to add a new independent sample. Notice how samples don't learn from each other.

0 samples, best: —

With a per-sample success rate of 5%, how many independent samples do you need for a 95% chance of at least one success?

About 59 samples (since 1 − 0.95⁵⁹ ≈ 0.95) About 20 samples About 200 samples

Chapter 3: Standard MCTS Limitations

Standard Monte Carlo Tree Search, as used in LATS (Language Agent Tree Search), applies the classic MCTS algorithm to LLM-generated solutions. Here is how it works:

Selection: Start at the root. At each node, use UCT (Upper Confidence bound for Trees) to select a child, balancing exploitation (high-scoring children) and exploration (rarely-visited children). Descend until you reach a leaf.
Expansion: At the leaf, generate exactly k new children by calling the LLM k times. Each child gets the parent's solution plus feedback (e.g., test errors) as context.
Evaluation: Score each new child by running tests.
Backpropagation: Propagate the scores up the tree to update node statistics.

The critical limitation is the fixed branching factor k. In LATS, k = 5. Every node gets exactly 5 children, no more, no less. This creates two problems:

Problem 1: Width starvation. If the root node has a 10% per-sample success rate and you only generate 5 children, you have a 1 − 0.9⁵ = 41% chance of getting at least one good solution at the root level. Repeated sampling with 128 tries would give you 1 − 0.9¹²⁸ > 99.99%. The fixed width throws away the LLM's natural diversity.

Problem 2: Wasted depth. If a node already has a perfect solution among its children, standard MCTS still forces the search deeper, expanding nodes that don't need expansion. Budget is wasted generating refinements of solutions that were already correct.

The fundamental issue is that the tree shape is a hyperparameter, chosen before seeing any data. But the optimal shape depends on the problem: easy problems want width, hard-but-refinable problems want depth, and most problems want a mix. Standard MCTS cannot adapt.

UCT and the fixed-arm assumption

UCT assumes a fixed number of arms (children). The formula is:

UCT(N_j) = Q̄(N_j) + c · √(ln N(parent) / N(N_j))

where Q̄ is the average reward and N is the visit count. This breaks when the number of children can grow unboundedly — the exploration bonus for a "new child that doesn't exist yet" is undefined. AB-MCTS replaces UCT with Thompson sampling, which handles dynamic arm creation naturally.

Why is UCT unsuitable for adaptive branching?

UCT requires a fixed set of arms (children) at each node to compute the exploration bonus. A "generate new child" action has no visit count or average reward, so the UCT formula cannot score it. UCT is too slow to compute UCT only works for games, not for coding problems

Chapter 4: The AB-MCTS Algorithm

Now we build the full algorithm. AB-MCTS runs for n_nodes iterations. Each iteration adds exactly one new node to the tree. The process has three steps:

Step 1: Selection

Start at the root. At each node, choose among its children plus a GEN node. The GEN node represents the option of generating a brand new child right here. Use Thompson sampling (not UCT) to choose. If you select an existing child, descend into it and repeat. If you select the GEN node, stop — this node will be expanded.

Step 2: Expansion

Generate a new LLM response. If the selected node is the root, the prompt is just the task description. If it is a non-root node, the prompt includes the node's existing answer plus external feedback (compiler errors, test results). The new response becomes a new child node.

Step 3: Score Backup

Run the scoring function on the new answer (e.g., execute the code, count passing tests). Propagate this score up the tree to update the Bayesian posteriors at each ancestor.

The GEN node is the key. It is a virtual child that always exists at every node. Selecting it means "I think a brand new sample from this node will score higher than refining any existing child." As more children accumulate and one proves promising, the GEN node's appeal decreases — the posterior shifts probability toward refinement. But if all children score poorly, the GEN node stays attractive — the search keeps widening.

The algorithm in pseudocode

function AB-MCTS(n_nodes):
    T = initialize_tree()
    for n = 1 to n_nodes:
        N = select_expansion_target(T)   # Step 1
        N_new = expand(N, T)              # Step 2
        score_backup(N_new, T)             # Step 3
    return select_best(T)

function select_expansion_target(T):
    N = root
    while not is_leaf(N):
        N_next = thompson_select_child(N)  # includes GEN
        if N_next is GEN_NODE:
            break                          # expand here
        N = N_next
    return N

AB-MCTS: Build a Search Tree

Click "Step" to run one iteration of AB-MCTS. Watch the tree grow adaptively — sometimes going wider (new child at a node), sometimes going deeper (refining an existing child). Node color shows score. The highlighted path shows the current selection.

0 nodes, best: —

In AB-MCTS, what happens when the GEN node is selected at a node N?

A brand new LLM response is generated from N's context (task + N's answer + feedback), creating a new child node. This widens the tree at N. The existing children of N are re-scored Node N is deleted from the tree

Chapter 5: The Branching Decision

The heart of AB-MCTS is deciding: wider or deeper? This decision is made at every node during selection. The authors propose two approaches, both using Thompson sampling over Bayesian posteriors.

AB-MCTS-M: The Mixed Model

At node N with children N₁, ..., N_k, model the score of a future expansion as:

r = α_j + σ_yε, α_j = μ_α + σ_αε_j

Each existing child j has a "group-level intercept" α_j capturing how good that branch is. The GEN node (action a₀) gets no direct observations, but its α₀ is drawn from the same shared distribution — so the model infers what a new branch might score based on what existing branches scored.

If existing branches score well with low variance, the GEN node's expected score is moderate with high variance — Thompson sampling will sometimes still pick it (exploration). If existing branches score poorly, the GEN node looks relatively more promising — the tree widens.

This uses MCMC (Markov Chain Monte Carlo) to sample from the posterior. Computationally heavier, but statistically principled.

AB-MCTS-A: Node Aggregation

A lighter-weight alternative. Instead of a hierarchical mixed model, AB-MCTS-A introduces a CONT node (continue) alongside the GEN node. At each node N:

GEN node: represents "generate a brand new child." Its posterior is updated with scores of nodes created via GEN at this level.
CONT node: represents "continue into an existing child." Its posterior is updated with scores of all descendant refinements.

Thompson sampling draws from both posteriors. If the GEN draw is higher, widen. If the CONT draw is higher, descend into the best existing child and repeat.

AB-MCTS-A uses conjugate priors (Beta or Gaussian), so posterior updates are analytical — no MCMC needed. Much faster per iteration.

Two variants of AB-MCTS-A: (1) Gaussian variant uses a normal-inverse-χ² prior, suitable for unbounded scores. (2) Beta variant uses a Beta prior for scores in [0, 1], which is natural for "fraction of tests passed." Both are closed-form, making them fast.

Thompson Sampling (intuition)

Thompson sampling works by maintaining a probability distribution over each arm's expected reward. To choose an arm, you draw one random sample from each arm's distribution and pick the arm with the highest draw. Arms with high expected reward get picked often (exploitation). Arms with high uncertainty occasionally produce very high draws (exploration). This naturally balances both — and it handles the GEN node elegantly because a new arm naturally has high uncertainty.

Why does Thompson sampling favor the GEN node when all existing children have mediocre scores?

It doesn't — it always prefers existing children The GEN node has a wide posterior (high uncertainty) because it represents an unseen action. When existing children cluster around a mediocre score, their posteriors are tight and centered on that mediocre value, while the GEN node's wide distribution frequently produces high draws that beat them. Thompson sampling uses a random number generator that favors GEN

Chapter 6: External Feedback Integration

AB-MCTS's power comes from combining diverse LLM generation with external feedback signals — information the LLM cannot generate on its own but that tells it exactly what went wrong.

Feedback in coding tasks

When an LLM generates a Python solution for a competitive programming problem, the system can:

Compile the code — catching syntax errors and import failures immediately.
Run public test cases — revealing which inputs produce wrong outputs, runtime errors, or timeouts.
Report the score — fraction of tests passed, giving a continuous signal in [0, 1].

This feedback serves two roles in AB-MCTS:

Role 1: Scoring function R. The fraction of tests passed (or a similar metric) is the reward signal that drives tree search. It updates the Bayesian posteriors, steering Thompson sampling toward promising branches.

Role 2: Refinement context. When going deeper (refining an existing solution), the LLM receives its previous code plus the error messages, failing test cases, and compiler output. This targeted feedback lets it fix specific bugs rather than regenerating from scratch.

Feedback for other domains

The framework generalizes beyond coding:

MLE-Bench (Kaggle competitions): The score is the competition metric (accuracy, F1, RMSE). Feedback includes validation set performance and error logs from training.
ARC-AGI (pattern puzzles): The score is whether the predicted output grid matches the expected one. Feedback is binary (correct/incorrect per example).
Any task with verifiable outputs: Mathematical proofs (checked by a proof assistant), SQL queries (checked against expected results), API interactions (checked against test harnesses).

The key requirement is a surrogate evaluator R available at search time. This need not be the final hidden-test evaluator — even a partial signal (public tests only) is enough to guide the search. The paper acknowledges that developing such evaluators is itself a challenge for some domains.

What two roles does external feedback play in AB-MCTS?

It provides the reward signal R for tree search (steering Thompson sampling via Bayesian posteriors) AND the refinement context for "go deeper" steps (giving the LLM specific error information to fix) It trains the model and generates prompts It selects the temperature and picks the final answer

Chapter 7: Results

AB-MCTS was evaluated on four benchmarks using GPT-4o and DeepSeek-V3, with a generation budget of 128 LLM calls (2⁷).

Main results (Table 1 from the paper)

Method	LiveCodeBench	CodeContest	ARC-AGI	Avg. Rank
Repeated Sampling	37.8 / 40.7	37.9 / 43.2	15.0 / 18.6	3.5
Sequential Refinement	37.8 / 41.6	30.1 / 41.6	8.7 / 10.0	5.5
Standard MCTS	36.7 / 43.2	37.5 / 43.8	9.0 / 14.0	4.2
AB-MCTS-M	38.9 / 43.0	40.6 / 44.6	12.3 / 16.0	2.3
AB-MCTS-A (Gauss.)	39.1 / 42.5	40.2 / 43.4	13.0 / 18.3	2.7
AB-MCTS-A (Beta)	38.7 / 42.3	40.4 / 44.8	14.0 / 16.6	2.7

Values: GPT-4o / DeepSeek-V3. Higher is better. Budget = 128 calls.

Key takeaway: AB-MCTS variants achieve the best average rank (2.3–2.7) across all benchmarks and models. No single baseline wins everywhere, but AB-MCTS consistently performs near the top. The average rank of 2.3 for AB-MCTS-M means it is almost always the best or second-best method.

Benchmark-specific insights

LiveCodeBench & CodeContest (competitive programming): AB-MCTS pulls ahead at budgets as small as 8 LLM calls. On CodeContest, AB-MCTS-A (Beta) with DeepSeek-V3 achieves 44.8 — the single highest score in the table. These problems benefit from a mix of diverse approaches (width) and debugging specific solutions (depth).

ARC-AGI (pattern puzzles): Repeated sampling is strongest here, reflecting ARC's nature — the LLM either "gets" the pattern or doesn't, and feedback (correct/incorrect) is too coarse for targeted refinement. Still, AB-MCTS is competitive and never catastrophically worse.

MLE-Bench (Kaggle): AB-MCTS-M achieves the best average rank (1.3) across three ML competitions. These tasks especially benefit from iterative refinement — adjusting hyperparameters, fixing data processing bugs — making depth particularly valuable.

Performance vs. Generation Budget

Drag the slider to change the generation budget. Compare how repeated sampling (wide only) vs. AB-MCTS (adaptive) scale. AB-MCTS pulls ahead as budget grows because it can refine promising solutions.

Budget (2^k) 128

On which benchmark does repeated sampling perform most competitively against AB-MCTS, and why?

ARC-AGI, because the feedback signal is binary (correct/incorrect) and too coarse for targeted refinement — width (diverse sampling) matters more than depth (iterative fixing), which is exactly what repeated sampling maximizes LiveCodeBench, because coding is easy MLE-Bench, because Kaggle competitions are simple

Chapter 8: Analysis

The paper provides a fascinating analysis of tree shapes that emerge from AB-MCTS compared to baselines.

Width vs. depth in practice

The authors measure each algorithm's tree shape using the log-ratio of mean depth to mean width. This single number captures the tree's character: negative values mean wider (more exploration), positive values mean deeper (more refinement).

Finding: AB-MCTS trees tend to be wider than standard MCTS trees but narrower than pure repeated sampling. The tree shape adapts per problem — easy problems get wider trees (the search finds solutions quickly and keeps exploring), while hard problems get deeper trees (the search finds a promising direction and refines it).

When does depth help most?

Depth is most valuable when:

The LLM can generate a near-correct solution but struggles with specific details (off-by-one errors, edge cases, import issues).
External feedback is rich and specific — compiler errors with line numbers, failing test cases with expected vs. actual output. This gives the LLM actionable information for refinement.
The solution space is structured such that small edits can fix a mostly-correct solution (typical of coding tasks).

When does width help most?

Width is most valuable when:

The problem requires a fundamentally different approach that incremental edits cannot reach. Sampling many diverse starting points is the only way to discover it.
External feedback is coarse or binary (correct/incorrect, like ARC-AGI). With no detailed error signal, refinement is essentially blind rewriting.
The per-sample success rate is low but nonzero — the coverage argument from Large Language Monkeys applies.

Scaling to large budgets

On ARC-AGI with budgets up to 512, AB-MCTS continues improving substantially between 200–500 samples, while repeated sampling begins to plateau. This is remarkable: even on a task that favors width, AB-MCTS's adaptive depth provides gains at scale that pure width cannot.

Qualitative tree analysis

The paper visualizes a real search tree from MLE-Bench (Random Acts of Pizza). The tree shows clearly adaptive branching: the root has many children (wide exploration), but certain promising branches develop significant depth (focused refinement). Failed branches (grey nodes with no score) are abandoned quickly. This is exactly the behavior the algorithm was designed to produce.

Why does AB-MCTS continue to improve at budgets of 200-500 on ARC-AGI while repeated sampling plateaus?

AB-MCTS uses a different LLM Even on tasks favoring width, AB-MCTS can selectively deepen promising branches, extracting additional value from near-misses that repeated sampling discards. At high budgets, the marginal return of yet another independent sample diminishes, while targeted refinement of the best candidates still yields gains. AB-MCTS has a higher temperature setting

Chapter 9: Connections

AB-MCTS sits at the intersection of several important research threads in inference-time compute scaling. Let's map where each connection leads.

LATS (Language Agent Tree Search)

Zhou et al. (2024) applied standard MCTS to LLM-generated solutions, using UCT for selection and a fixed branching factor. AB-MCTS directly extends LATS by replacing UCT with Thompson sampling and introducing the GEN node for adaptive branching. LATS corresponds to the "Standard MCTS" baseline in the experiments.

Large Language Monkeys

Brown et al. (2024) formalized the power of repeated sampling: if you sample enough times, even hard problems get solved. Their "coverage" analysis shows that success probability grows as a function of the number of samples. AB-MCTS generalizes this by treating repeated sampling as a special case (flat tree) and showing that adaptive depth can improve upon pure width.

Scaling Test-Time Compute

Snell et al. (2025) analyzed how to optimally allocate inference-time compute between different strategies (revision, verification, search). AB-MCTS provides a concrete algorithm for this allocation: the Bayesian branching decision automatically distributes budget between exploration (width) and exploitation (depth) based on observed scores.

Self-Refine and Reflexion

Madaan et al. (Self-Refine, 2024) and Shinn et al. (Reflexion, 2024) pioneered iterative self-refinement with feedback. These correspond to the "Sequential Refinement" baseline — depth-only search. AB-MCTS incorporates their insight (feedback is valuable) while adding width (diversity is also valuable).

Progressive Widening

Coulom (2007) and Chaslot et al. (2008) proposed increasing the number of actions per node over time in MCTS. AB-MCTS differs because all new branches come from the same LLM (homogeneous generation), enabling principled statistical comparison between "sample new" and "refine existing." Progressive widening was designed for games with unique action semantics.

Cheat Sheet

Aspect	AB-MCTS
Core mechanism	GEN node for adaptive branching + Thompson sampling
Selection policy	Thompson sampling over Bayesian posteriors (not UCT)
Variants	AB-MCTS-M (mixed model, MCMC) and AB-MCTS-A (conjugate priors, analytical)
Branching factor	Unbounded, dynamically decided per node
Generalizes	Repeated sampling (flat tree), sequential refinement (tall tree)
Feedback	Any external evaluator R: test results, compiler output, metrics
Budget	n_nodes LLM calls total, allocated adaptively
Models tested	GPT-4o, DeepSeek-V3
Benchmarks	LiveCodeBench, CodeContest, ARC-AGI, MLE-Bench
Best avg. rank	2.3 (AB-MCTS-M) across all settings
Code	github.com/SakanaAI/treequest

The bigger picture: AB-MCTS demonstrates that inference-time compute scaling is not just about sampling more — it's about spending your compute budget wisely. The same 128 LLM calls produce better results when dynamically allocated between exploration and refinement. As inference budgets grow (o1-style reasoning, agent loops), adaptive allocation will matter more, not less.

How does AB-MCTS relate to repeated sampling and sequential refinement?

They are unrelated methods AB-MCTS strictly generalizes both: repeated sampling is the special case where only GEN nodes are selected (flat tree, depth 1), and sequential refinement is the special case where only one branch is ever deepened (tall tree, width 1). AB-MCTS finds the optimal mix. AB-MCTS replaces repeated sampling but is different from sequential refinement

AB-MCTS: Wider or Deeper?