Monte Carlo Tree Search meets language model agents. LATS uses the LM as action generator, value function, AND self-reflector — achieving 92.7% pass@1 on HumanEval with GPT-4 and 75.9 on WebShop with GPT-3.5.
You give GPT-4 a coding task. It generates a solution. The solution fails a test case. What does GPT-4 do?
Nothing. It already committed to that path. The token sequence is emitted, the die is cast. Even with ReAct-style prompting, where the model can observe test results and try again, it only refines a single trajectory. It never considers: "What if I had taken a completely different approach back at step 3?"
This is the core limitation. Current LM agents are greedy. They make one choice at each step and never backtrack. Chain-of-thought (CoT) reasons in a single line. ReAct acts in a single line. Reflexion reflects on a single trajectory and tries again, but it still explores one path at a time. None of them plan.
Think about how a human programmer approaches a hard problem. You don't just write code top-to-bottom and hope it works. You consider multiple approaches, mentally simulate each one, test the most promising, backtrack when something fails, and use failure information to guide your next attempt. You're doing tree search — exploring multiple branches of a decision tree, using evaluation and backtracking to find the best path.
The game of Go was cracked by exactly this kind of planning. AlphaGo used Monte Carlo Tree Search to explore millions of possible move sequences, guided by a value network that estimated how good each board position was. What if we could do the same thing for language model agents?
That's LATS. Take MCTS — the algorithm behind AlphaGo — and adapt it for LM agents. Instead of Go moves, the agent generates reasoning steps and actions. Instead of a trained value network, the LM itself evaluates how promising each path looks. Instead of just internal simulation, the agent actually executes actions in the real environment and gets real feedback. And when a path fails, the LM reflects on why it failed, storing that wisdom for future attempts.
LATS's central idea is surprisingly elegant: a pretrained LM can serve triple duty as the agent (action generator), the value function (state evaluator), and the reflector (failure analyzer). This means you can run full MCTS using nothing but a single language model plus an environment.
MCTS normally has a major requirement: you need a world model that can simulate forward from any state, and you need to be able to revert to any previous state. In chess or Go, this is easy — just copy the board. In Atari, you save and restore the emulator state. But for general decision-making, building a world model is hard.
LM tasks have a special property that makes this trivial. The "state" in an LM task is just a text sequence: the original prompt, the actions taken so far, and the observations received. To revert to any previous state, you literally copy-paste the earlier text context. No world model needed. No state restoration needed. You just trim the conversation and branch from there.
The value function deserves special attention. In standard MCTS (like AlphaGo), the value function is a trained neural network. In LATS, the LM evaluates a state by reasoning about it after receiving environment feedback. This is the key difference from Tree-of-Thought, which evaluates states using only the LM's internal knowledge. Environment feedback grounds the evaluation in reality, avoiding hallucinated assessments.
Self-reflection is the other differentiator. When a complete trajectory fails (e.g., the code doesn't pass tests, or the answer is wrong), the LM is prompted to analyze the failure and suggest improvements. These reflections are stored in memory and added to future prompts, giving the agent a "semantic gradient signal" — richer than a scalar reward, because it explains what went wrong in natural language.
Before diving into LATS, let's build solid intuition for Monte Carlo Tree Search. MCTS is the engine under the hood — understanding it is non-negotiable.
Imagine every decision your agent could make as a tree. The root is the starting state. Each edge is an action. Each node is the state you reach after taking that action. Leaves are terminal states where the task is done (success or failure). The goal: find the path from root to a successful leaf.
The problem: the tree is enormous. A coding task might have thousands of possible next actions at each step. You can't explore everything. You need a strategy that focuses on the most promising branches while still occasionally trying new things. This is the exploration-exploitation tradeoff.
MCTS handles this tradeoff through four phases, repeated for each episode:
The selection step uses UCT (Upper Confidence bounds applied to Trees) to balance exploration and exploitation:
Where:
The second term is the exploration bonus. A node that has been visited few times relative to its parent gets a high bonus — "we haven't tried this much, let's explore." A node visited many times gets a low bonus — "we know what this branch does." The beautiful thing about UCT is that it's provably optimal in the limit: given enough episodes, it will find the best path.
When a simulation reaches a terminal state with reward r, every node along the path updates:
This is just a running average. Early on, a node's value is noisy. After many visits, it converges to the true expected reward from that state. The key: backpropagation is what makes the tree learn. Each simulation makes the tree's value estimates more accurate, which makes selection smarter, which makes the next simulation more informative.
Now we wire MCTS into a language model agent. This is the showcase chapter — the full LATS pipeline with all six operations, showing how the LM powers every component.
LATS extends standard MCTS with two additional operations (evaluation and reflection), for a total of six performed in succession:
As agent (expansion): At each node, the LM is prompted with the current state (original input + action history + observations) and generates n candidate next actions. For ReAct-based tasks, actions include both reasoning traces ("I should search for X because...") and environment actions ("search[X]"). The diversity of n samples is crucial — it's what creates the branching factor of the tree.
As value function (evaluation): For each new child node, the LM is prompted to reason about the state and assign a scalar score. The key distinction from Tree-of-Thought: this evaluation happens after receiving environment feedback (the observation from executing the action). This grounds the score in reality. The full value function combines two signals:
Where LM(s) is the LM's self-generated score after reasoning about the state with environment context, and SC(s) is a self-consistency bonus — actions that are sampled multiple times at the same state are more likely correct, so they get a higher score. Lambda balances the two.
As reflector: When a trajectory reaches a terminal failure, the LM is prompted with the full trajectory and the final reward. It generates a verbal reflection: "The search failed because the initial query was too specific. A broader search term like X would have found the relevant passage." These reflections are stored in memory and injected into future prompts.
Each node in the LATS tree stores: the original input x, the full action sequence a1...i, the observation sequence o1...i, the value V(s), and the visit count N(s). Failed trajectories and their reflections are stored separately. The entire structure lives as text in external memory — nothing is trained, nothing is gradient-updated. It's all in-context learning.
Let's trace through the first two operations in detail with a concrete example.
Suppose we're on iteration 3 of a HotPotQA question. The tree already has several expanded nodes from previous iterations. Selection starts at the root (the original question) and walks down, at each level choosing the child with the highest UCT score.
The walk ends when it reaches a leaf node — a node that hasn't been expanded yet (or was expanded but has promising unexplored children). This is the node we'll expand next.
The exploration weight w is a hyperparameter. A higher w makes the agent try more novel branches; a lower w makes it focus on branches that have already shown good results. In practice, w is tuned per domain — web navigation benefits from higher exploration than programming, where correct solutions tend to be concentrated.
At the selected leaf, the LM generates n candidate actions. For ReAct-based tasks, the prompt looks like:
# Context provided to the LM: Question: Which school was founded first, Whitgift or ## Charterhouse? Thought 1: I need to search for Whitgift and Charterhouse founding dates. Action 1: search[Whitgift School] Observation 1: Whitgift School is a school in Croydon, founded in 1596... # The LM now generates n=5 different next actions: Candidate A: search[Charterhouse School] Candidate B: search[Charterhouse School founding year] Candidate C: search[Charterhouse history] Candidate D: think[Whitgift was 1596. I need Charterhouse's date.] Candidate E: search[oldest schools England Charterhouse]
Each candidate is sent to the environment. The environment returns an observation (the Wikipedia search result, the compiler output, the web page content). This creates n new child nodes in the tree, each with its own state: the original context + the new action + the new observation.
LATS inherits from ReAct: the action space is the union of environment actions A and reasoning traces Z. Actions like search[...] and click[...] affect the environment and produce observations. Thoughts like think[...] are internal reasoning that help the agent plan. Both are treated as "actions" in the tree, which means the tree captures not just what the agent did but also what it thought.
After expansion creates new child nodes, LATS evaluates them and then simulates the most promising one to completion.
Each new child node gets a value score. The LM is prompted with the node's full state (input + actions + observations so far) and asked to reason about progress and end with a numerical score. Crucially, this happens after the environment returns the observation — so the LM is evaluating reality, not hallucinating about what might happen.
For example, after the agent searches for "Charterhouse School" and gets a Wikipedia snippet, the LM might reason: "The observation confirms Charterhouse was founded in 1611. We already know Whitgift was 1596. We have both dates and can answer the question. Score: 0.9."
The self-consistency component adds a second signal. If 3 out of 5 candidate actions at a node are variations of "search[Charterhouse School]", that action gets a self-consistency bonus. The intuition: if the LM independently converges on the same action from multiple samples, that action is likely correct. The combined value:
The simulation phase takes the highest-valued child node and expands it further until reaching a terminal state. At each depth level, the same expand-and-evaluate cycle runs, always prioritizing the child with the best value.
A terminal state is determined by the environment: the agent submits a final answer (HotPotQA), the code passes all tests (HumanEval), the agent completes a purchase (WebShop), or a depth limit is reached.
If the terminal state is a success, LATS stops immediately — it found a working trajectory. If it's a failure, LATS continues to backpropagation and reflection.
For HumanEval and MBPP, each "action" is a complete solution (a full Python function). The environment compiles it and runs a synthetic test suite. The observation is the compiler output and which tests pass. Since each action is already a terminal solution, LATS skips the simulation phase entirely and uses the fraction of passed tests as the backpropagated reward.
A trajectory has reached a terminal state. If it failed, two things happen: value updates flow backward through the tree, and the LM generates a verbal autopsy of what went wrong.
For every node si on the path from root to the terminal node, LATS updates:
Where r is the terminal reward (e.g., 1.0 for correct answer, 0.0 for wrong, or the fraction of passed tests for programming). This is a running average: after many trajectories pass through a node, its value converges to the true expected reward from that point.
These updated values feed directly into the UCT formula for the next iteration's selection phase. A node on a path that led to success gets a higher V(s), making it more likely to be selected again. A node on a failed path gets a lower V(s), steering future exploration elsewhere.
After a failed trajectory, the LM is prompted with the full trajectory (all actions and observations) plus the final reward, and asked to reflect. The prompt is something like:
# Reflection prompt (simplified) You are an advanced reasoning agent reflecting on a failed attempt. Question: [original question] Trajectory: [full action/observation sequence] Reward: 0.0 (incorrect answer) Analyze what went wrong and suggest a better strategy. # LM generates: Reflection: The agent searched for "quantum entanglement" but the answer was about Bell's theorem, discussed in a different article. The agent should have first searched for "Bell test experiments" to find the specific physicist mentioned in the question. The key mistake was using too narrow a search query.
This reflection is stored in memory alongside the failed trajectory. In subsequent iterations, both are injected into the agent and value function prompts as additional context. This means the agent literally learns from its mistakes within the episode, without any gradient updates.
LATS repeats this six-operation cycle (selection → expansion → evaluation → simulation → backpropagation → reflection) for up to k trajectories. With each iteration, the tree gets more informed: values are more accurate, reflections provide richer guidance, and UCT steers toward increasingly promising branches. If any trajectory succeeds, LATS terminates immediately.
LATS is evaluated across four domains: programming, multi-hop QA, web navigation, and mathematical reasoning. The results are consistently strong.
| Method | Model | Pass@1 |
|---|---|---|
| CoT | GPT-3.5 | 46.9% |
| ReAct | GPT-3.5 | 56.9% |
| Reflexion | GPT-3.5 | 68.1% |
| RAP | GPT-3.5 | 63.1% |
| LATS | GPT-3.5 | 83.8% |
| Base LM | GPT-4 | 80.1% |
| Reflexion | GPT-4 | 91.0% |
| LATS | GPT-4 | 92.7% |
LATS with GPT-3.5 (83.8%) outperforms base GPT-4 (80.1%). With GPT-4, LATS achieves state-of-the-art 92.7% pass@1. The key: each "action" is a complete solution, the environment runs synthetic tests, and LATS uses test results to guide the search. This is search over solutions with real feedback, not just LM self-evaluation.
| Method | EM (Exact Match) |
|---|---|
| CoT | 0.34 |
| ReAct | 0.32 |
| Reflexion | 0.51 |
| ToT (ReAct) | 0.39 |
| RAP (ReAct) | 0.54 |
| LATS (ReAct) | 0.63 |
| LATS (CoT + ReAct) | 0.71 |
LATS doubles ReAct's performance. The CoT+ReAct variant is telling: it first tries internal reasoning (cheap), and only switches to tool-augmented search when internal reasoning fails. This mirrors how humans approach questions — think first, look things up only when you need to.
| Method | Score | Success Rate |
|---|---|---|
| ReAct | 53.8 | 28.0% |
| Reflexion | 64.2 | 35.0% |
| LATS | 75.9 | 38.0% |
| IL + RL (fine-tuned) | 62.4 | 28.7% |
| Fine-tuning (Furuta et al.) | 67.5 | 45.0% |
LATS with GPT-3.5 (no fine-tuning) scores 75.9, beating even fine-tuned models on average score. This is a gradient-free method competing with gradient-based training. The performance comes entirely from better search strategy, not better parameters.
| Method | Success Rate |
|---|---|
| CoT | 8% |
| Reflexion | 12% |
| ToT | 20% |
| RAP | 40% |
| LATS | 44% |
Even on purely internal reasoning tasks with no environment interaction, LATS outperforms specialized reasoning methods. The self-consistency component of the value function is what makes the difference here.
LATS sits at a unique intersection. Let's map exactly what it adds over each prior method.
| Method | Reasoning | Acting | Planning | Reflection | Ext. Memory |
|---|---|---|---|---|---|
| CoT | Yes | — | — | — | — |
| ReAct | Yes | Yes | — | — | — |
| ToT | Yes | — | Yes | Yes | Yes |
| RAP | Yes | — | Yes | — | Yes |
| Reflexion | Yes | Yes | — | Yes | Yes |
| LATS | Yes | Yes | Yes | Yes | Yes |
CoT generates a single reasoning chain autoregressively. No backtracking, no environment interaction. LATS adds: tree search over multiple reasoning paths, real environment feedback, and reflection. On HotPotQA, LATS (0.71) nearly doubles CoT (0.38).
ReAct adds environment interaction to CoT but still follows a single greedy trajectory. If the first search query returns unhelpful results, ReAct has to recover within the same trajectory. LATS adds: tree search (try different queries in parallel branches), principled selection (UCT), and backtracking. On HotPotQA, LATS (0.63) doubles ReAct (0.32).
ToT does tree search over reasoning paths with LM-based evaluation. But it has two blind spots: no environment interaction (the LM evaluates states internally, risking hallucinated assessments) and uses DFS/BFS instead of UCT (less principled exploration). LATS adds: environment grounding and UCT-guided search. The ablation confirms this: replacing MCTS with DFS drops HotPotQA from 0.63 to 0.42.
RAP also uses MCTS for LM reasoning, but it requires the LM to act as a world model — predicting what observations would result from actions, without actually executing them. This limits RAP to tasks where the LM can accurately predict outcomes. LATS executes actions in the real environment, getting actual observations. On HotPotQA with ReAct prompting, LATS (0.63) beats RAP (0.54).
Reflexion uses reflection and external memory but follows a single trajectory per trial. It refines the same approach iteratively rather than exploring fundamentally different approaches. LATS subsumes Reflexion's reflection mechanism and adds tree search. The combination is more powerful: on HumanEval, LATS (83.8%) beats Reflexion (68.1%) with GPT-3.5.
LATS sits at the intersection of classical AI planning, modern LM reasoning, and test-time compute scaling. Let's trace the threads.
LATS's deepest ancestor is AlphaGo (Silver et al., 2016), which used MCTS with learned value and policy networks to master Go. LATS replaces the trained networks with LM prompting — the "policy network" is the LM generating actions, the "value network" is the LM evaluating states. The key difference: AlphaGo required millions of self-play games to train its networks. LATS uses zero-shot in-context learning. This makes LATS instantly applicable to any new domain without training.
LATS is an early example of a broader trend: spending more compute at inference time (rather than training time) to improve performance. The idea that a weaker model with more search can outperform a stronger model with less search (LATS GPT-3.5 beating base GPT-4 on HumanEval) foreshadows systems like OpenAI's o1, which also invest heavily in test-time reasoning. LATS makes the trade-off explicit: you're exchanging more LM calls during inference for better task completion.
While LATS searches over action trajectories within a single task, ARCHON (Saad-Falcon et al., 2025) searches over system architectures across tasks. They're complementary: ARCHON could discover that LATS-style tree search is the optimal strategy for certain domains, configuring it automatically. In fact, LATS-like components (generation, evaluation, reflection) map directly onto ARCHON's module types (Generator, Verifier, Critic).
LATS was a catalyst for work on systematically evaluating LM agent frameworks. Meta-Harness (the agent evaluation framework) tests precisely the kind of claims LATS makes: that search improves over greedy methods, that reflection helps, that environment feedback is crucial. The ablation methodology in LATS — removing one component at a time — became a template for rigorous agent evaluation.
Computational cost: LATS uses O(k · n) LM calls per task (k trajectories, n branches per expansion). For k=50 and n=5, that's up to 250 calls. On complex tasks, this can mean 100,000+ tokens. The paper shows LATS expands fewer nodes on average than ToT or RAP (66.65 vs. 84.05 at k=50), but it's still expensive compared to single-trajectory methods.
Revertibility assumption: LATS assumes you can revert to any prior state. This works for text-based environments (just truncate the context) but fails for irreversible real-world actions (you can't un-send an email, un-purchase a product). This limits LATS to environments with built-in rollback or simulation.
Where it leads: The combination of tree search + LM evaluation + reflection has become a building block for increasingly capable agent systems. Future directions include multi-agent LATS, learned value functions that improve across tasks, and integration with more complex environment simulators. The core insight — that pretrained LMs already have the capabilities to do planning if you structure the search right — remains one of the most influential ideas in LM agent research.