Adaptive Branching Monte Carlo Tree Search dynamically decides whether to generate new candidates (go wider) or refine existing ones (go deeper), generalizing repeated sampling for inference-time compute scaling.
You give an LLM a hard coding problem. It gets it wrong. What do you do?
Strategy A: Sample more. Ask the LLM 128 times at non-zero temperature, hope that one answer is correct. This is repeated sampling (best-of-n). It works surprisingly well — Large Language Monkeys showed that even problems no single sample solves can be cracked by sampling thousands of times. But there is a fatal flaw: each attempt starts from scratch. If sample #47 got 90% of the way there but failed one edge case, sample #48 doesn't know that. No feedback, no refinement, no learning from near-misses.
Strategy B: Refine one answer. Generate one solution, run the tests, feed the error messages back to the LLM, ask it to fix. Repeat. This is sequential refinement. It uses feedback, but it is locked into a single trajectory. If the initial approach was fundamentally wrong, no amount of patching will save it. You are deep but narrow.
Strategy C: Standard MCTS. Build a search tree with a fixed branching factor (say 5 children per node). Select the most promising node, expand it, propagate scores. This balances exploration and exploitation — in theory. In practice, fixing the branching factor at 5 means you never generate the diverse burst of 50 candidates that repeated sampling benefits from. The fixed width strangles the LLM's natural diversity.
This is the problem AB-MCTS solves. At every node in the search tree, it asks: "Is it more valuable to go wider (sample a brand new solution from this prompt) or go deeper (take the best child and refine it using feedback)?" The answer changes from node to node and from moment to moment, driven by Bayesian statistics over the scores observed so far.
AB-MCTS's central idea is beautifully simple: let the search tree decide its own shape.
The mechanism is a special node called the GEN node. Every node in the tree has a GEN node as a virtual child. When the search selects a GEN node, a brand new LLM response is generated from that node's prompt — widening the tree. When the search selects an existing child instead, it descends deeper. The choice between GEN and existing children is made by Thompson sampling over Bayesian posterior distributions of expected scores.
This means the branching factor is unbounded. A node might get 1 child or 50 children, depending on whether new samples keep producing promising results. Traditional MCTS cannot handle this because UCB/UCT requires knowing the number of arms in advance. Thompson sampling naturally accommodates new arms appearing dynamically.
Before we build AB-MCTS, let's understand exactly why repeated sampling works and where it breaks.
You have a coding problem. You call the LLM 128 times with the same prompt at temperature 0.6. Each call returns a complete solution. You run the tests on each, pick the one that passes the most. This is best-of-n.
Why does this work? Because LLMs at non-zero temperature explore a vast output space. Even if the probability of any single sample being correct is only 2%, with 128 samples the probability that at least one is correct is 1 − 0.98128 ≈ 92%. Large Language Monkeys formalized this: the number of correct samples grows roughly as a coverage function — more samples, more coverage of the solution space.
Consider a problem where the LLM's approach is fundamentally correct but it keeps making a subtle off-by-one error in the boundary condition. Sampling 128 times might produce 128 solutions that all fail the same edge case in slightly different ways. What you actually need is to take the best attempt, show the LLM the failing test output, and ask it to fix that specific line. Repeated sampling cannot do this — it has no feedback loop.
In tree terms, repeated sampling is a tree of depth 1 and width n. Every node is a direct child of the root. The tree is maximally wide and minimally deep.
Each circle is an LLM-generated solution. Color indicates test score (red = 0%, yellow = 50%, green = 100%). Click "Sample" to add a new independent sample. Notice how samples don't learn from each other.
Standard Monte Carlo Tree Search, as used in LATS (Language Agent Tree Search), applies the classic MCTS algorithm to LLM-generated solutions. Here is how it works:
The critical limitation is the fixed branching factor k. In LATS, k = 5. Every node gets exactly 5 children, no more, no less. This creates two problems:
The fundamental issue is that the tree shape is a hyperparameter, chosen before seeing any data. But the optimal shape depends on the problem: easy problems want width, hard-but-refinable problems want depth, and most problems want a mix. Standard MCTS cannot adapt.
UCT assumes a fixed number of arms (children). The formula is:
where Q̄ is the average reward and N is the visit count. This breaks when the number of children can grow unboundedly — the exploration bonus for a "new child that doesn't exist yet" is undefined. AB-MCTS replaces UCT with Thompson sampling, which handles dynamic arm creation naturally.
Now we build the full algorithm. AB-MCTS runs for nnodes iterations. Each iteration adds exactly one new node to the tree. The process has three steps:
Start at the root. At each node, choose among its children plus a GEN node. The GEN node represents the option of generating a brand new child right here. Use Thompson sampling (not UCT) to choose. If you select an existing child, descend into it and repeat. If you select the GEN node, stop — this node will be expanded.
Generate a new LLM response. If the selected node is the root, the prompt is just the task description. If it is a non-root node, the prompt includes the node's existing answer plus external feedback (compiler errors, test results). The new response becomes a new child node.
Run the scoring function on the new answer (e.g., execute the code, count passing tests). Propagate this score up the tree to update the Bayesian posteriors at each ancestor.
function AB-MCTS(n_nodes): T = initialize_tree() for n = 1 to n_nodes: N = select_expansion_target(T) # Step 1 N_new = expand(N, T) # Step 2 score_backup(N_new, T) # Step 3 return select_best(T) function select_expansion_target(T): N = root while not is_leaf(N): N_next = thompson_select_child(N) # includes GEN if N_next is GEN_NODE: break # expand here N = N_next return N
Click "Step" to run one iteration of AB-MCTS. Watch the tree grow adaptively — sometimes going wider (new child at a node), sometimes going deeper (refining an existing child). Node color shows score. The highlighted path shows the current selection.
The heart of AB-MCTS is deciding: wider or deeper? This decision is made at every node during selection. The authors propose two approaches, both using Thompson sampling over Bayesian posteriors.
At node N with children N1, ..., Nk, model the score of a future expansion as:
Each existing child j has a "group-level intercept" αj capturing how good that branch is. The GEN node (action a0) gets no direct observations, but its α0 is drawn from the same shared distribution — so the model infers what a new branch might score based on what existing branches scored.
If existing branches score well with low variance, the GEN node's expected score is moderate with high variance — Thompson sampling will sometimes still pick it (exploration). If existing branches score poorly, the GEN node looks relatively more promising — the tree widens.
This uses MCMC (Markov Chain Monte Carlo) to sample from the posterior. Computationally heavier, but statistically principled.
A lighter-weight alternative. Instead of a hierarchical mixed model, AB-MCTS-A introduces a CONT node (continue) alongside the GEN node. At each node N:
Thompson sampling draws from both posteriors. If the GEN draw is higher, widen. If the CONT draw is higher, descend into the best existing child and repeat.
AB-MCTS-A uses conjugate priors (Beta or Gaussian), so posterior updates are analytical — no MCMC needed. Much faster per iteration.
Thompson sampling works by maintaining a probability distribution over each arm's expected reward. To choose an arm, you draw one random sample from each arm's distribution and pick the arm with the highest draw. Arms with high expected reward get picked often (exploitation). Arms with high uncertainty occasionally produce very high draws (exploration). This naturally balances both — and it handles the GEN node elegantly because a new arm naturally has high uncertainty.
AB-MCTS's power comes from combining diverse LLM generation with external feedback signals — information the LLM cannot generate on its own but that tells it exactly what went wrong.
When an LLM generates a Python solution for a competitive programming problem, the system can:
This feedback serves two roles in AB-MCTS:
The framework generalizes beyond coding:
The key requirement is a surrogate evaluator R available at search time. This need not be the final hidden-test evaluator — even a partial signal (public tests only) is enough to guide the search. The paper acknowledges that developing such evaluators is itself a challenge for some domains.
AB-MCTS was evaluated on four benchmarks using GPT-4o and DeepSeek-V3, with a generation budget of 128 LLM calls (27).
| Method | LiveCodeBench | CodeContest | ARC-AGI | Avg. Rank |
|---|---|---|---|---|
| Repeated Sampling | 37.8 / 40.7 | 37.9 / 43.2 | 15.0 / 18.6 | 3.5 |
| Sequential Refinement | 37.8 / 41.6 | 30.1 / 41.6 | 8.7 / 10.0 | 5.5 |
| Standard MCTS | 36.7 / 43.2 | 37.5 / 43.8 | 9.0 / 14.0 | 4.2 |
| AB-MCTS-M | 38.9 / 43.0 | 40.6 / 44.6 | 12.3 / 16.0 | 2.3 |
| AB-MCTS-A (Gauss.) | 39.1 / 42.5 | 40.2 / 43.4 | 13.0 / 18.3 | 2.7 |
| AB-MCTS-A (Beta) | 38.7 / 42.3 | 40.4 / 44.8 | 14.0 / 16.6 | 2.7 |
Values: GPT-4o / DeepSeek-V3. Higher is better. Budget = 128 calls.
LiveCodeBench & CodeContest (competitive programming): AB-MCTS pulls ahead at budgets as small as 8 LLM calls. On CodeContest, AB-MCTS-A (Beta) with DeepSeek-V3 achieves 44.8 — the single highest score in the table. These problems benefit from a mix of diverse approaches (width) and debugging specific solutions (depth).
ARC-AGI (pattern puzzles): Repeated sampling is strongest here, reflecting ARC's nature — the LLM either "gets" the pattern or doesn't, and feedback (correct/incorrect) is too coarse for targeted refinement. Still, AB-MCTS is competitive and never catastrophically worse.
MLE-Bench (Kaggle): AB-MCTS-M achieves the best average rank (1.3) across three ML competitions. These tasks especially benefit from iterative refinement — adjusting hyperparameters, fixing data processing bugs — making depth particularly valuable.
Drag the slider to change the generation budget. Compare how repeated sampling (wide only) vs. AB-MCTS (adaptive) scale. AB-MCTS pulls ahead as budget grows because it can refine promising solutions.
The paper provides a fascinating analysis of tree shapes that emerge from AB-MCTS compared to baselines.
The authors measure each algorithm's tree shape using the log-ratio of mean depth to mean width. This single number captures the tree's character: negative values mean wider (more exploration), positive values mean deeper (more refinement).
Depth is most valuable when:
Width is most valuable when:
On ARC-AGI with budgets up to 512, AB-MCTS continues improving substantially between 200–500 samples, while repeated sampling begins to plateau. This is remarkable: even on a task that favors width, AB-MCTS's adaptive depth provides gains at scale that pure width cannot.
The paper visualizes a real search tree from MLE-Bench (Random Acts of Pizza). The tree shows clearly adaptive branching: the root has many children (wide exploration), but certain promising branches develop significant depth (focused refinement). Failed branches (grey nodes with no score) are abandoned quickly. This is exactly the behavior the algorithm was designed to produce.
AB-MCTS sits at the intersection of several important research threads in inference-time compute scaling. Let's map where each connection leads.
Zhou et al. (2024) applied standard MCTS to LLM-generated solutions, using UCT for selection and a fixed branching factor. AB-MCTS directly extends LATS by replacing UCT with Thompson sampling and introducing the GEN node for adaptive branching. LATS corresponds to the "Standard MCTS" baseline in the experiments.
Brown et al. (2024) formalized the power of repeated sampling: if you sample enough times, even hard problems get solved. Their "coverage" analysis shows that success probability grows as a function of the number of samples. AB-MCTS generalizes this by treating repeated sampling as a special case (flat tree) and showing that adaptive depth can improve upon pure width.
Snell et al. (2025) analyzed how to optimally allocate inference-time compute between different strategies (revision, verification, search). AB-MCTS provides a concrete algorithm for this allocation: the Bayesian branching decision automatically distributes budget between exploration (width) and exploitation (depth) based on observed scores.
Madaan et al. (Self-Refine, 2024) and Shinn et al. (Reflexion, 2024) pioneered iterative self-refinement with feedback. These correspond to the "Sequential Refinement" baseline — depth-only search. AB-MCTS incorporates their insight (feedback is valuable) while adding width (diversity is also valuable).
Coulom (2007) and Chaslot et al. (2008) proposed increasing the number of actions per node over time in MCTS. AB-MCTS differs because all new branches come from the same LLM (homogeneous generation), enabling principled statistical comparison between "sample new" and "refine existing." Progressive widening was designed for games with unique action semantics.
| Aspect | AB-MCTS |
|---|---|
| Core mechanism | GEN node for adaptive branching + Thompson sampling |
| Selection policy | Thompson sampling over Bayesian posteriors (not UCT) |
| Variants | AB-MCTS-M (mixed model, MCMC) and AB-MCTS-A (conjugate priors, analytical) |
| Branching factor | Unbounded, dynamically decided per node |
| Generalizes | Repeated sampling (flat tree), sequential refinement (tall tree) |
| Feedback | Any external evaluator R: test results, compiler output, metrics |
| Budget | nnodes LLM calls total, allocated adaptively |
| Models tested | GPT-4o, DeepSeek-V3 |
| Benchmarks | LiveCodeBench, CodeContest, ARC-AGI, MLE-Bench |
| Best avg. rank | 2.3 (AB-MCTS-M) across all settings |
| Code | github.com/SakanaAI/treequest |