Zhou, Yan, Shlapentokh-Rothman et al. — UIUC, 2024

Language Agent Tree Search Unifies Reasoning, Acting, and Planning

Monte Carlo Tree Search meets language model agents. LATS uses the LM as action generator, value function, AND self-reflector — achieving 92.7% pass@1 on HumanEval with GPT-4 and 75.9 on WebShop with GPT-3.5.

Prerequisites: Chain-of-thought prompting + ReAct basics + Basic tree search intuition. That's it.
10
Chapters
4+
Simulations

Chapter 0: The Problem

You give GPT-4 a coding task. It generates a solution. The solution fails a test case. What does GPT-4 do?

Nothing. It already committed to that path. The token sequence is emitted, the die is cast. Even with ReAct-style prompting, where the model can observe test results and try again, it only refines a single trajectory. It never considers: "What if I had taken a completely different approach back at step 3?"

This is the core limitation. Current LM agents are greedy. They make one choice at each step and never backtrack. Chain-of-thought (CoT) reasons in a single line. ReAct acts in a single line. Reflexion reflects on a single trajectory and tries again, but it still explores one path at a time. None of them plan.

The gap in existing methods: CoT and ReAct generate single trajectories autoregressively. Tree-of-Thought (ToT) and RAP explore multiple paths, but they rely entirely on the LM's internal knowledge — no real environment feedback. Reflexion uses feedback but only along one trajectory. No method combines tree search, real environment interaction, AND self-reflection.

Think about how a human programmer approaches a hard problem. You don't just write code top-to-bottom and hope it works. You consider multiple approaches, mentally simulate each one, test the most promising, backtrack when something fails, and use failure information to guide your next attempt. You're doing tree search — exploring multiple branches of a decision tree, using evaluation and backtracking to find the best path.

The game of Go was cracked by exactly this kind of planning. AlphaGo used Monte Carlo Tree Search to explore millions of possible move sequences, guided by a value network that estimated how good each board position was. What if we could do the same thing for language model agents?

That's LATS. Take MCTS — the algorithm behind AlphaGo — and adapt it for LM agents. Instead of Go moves, the agent generates reasoning steps and actions. Instead of a trained value network, the LM itself evaluates how promising each path looks. Instead of just internal simulation, the agent actually executes actions in the real environment and gets real feedback. And when a path fails, the LM reflects on why it failed, storing that wisdom for future attempts.

What is the fundamental limitation of ReAct-style LM agents that LATS addresses?

Chapter 1: The Key Insight

LATS's central idea is surprisingly elegant: a pretrained LM can serve triple duty as the agent (action generator), the value function (state evaluator), and the reflector (failure analyzer). This means you can run full MCTS using nothing but a single language model plus an environment.

Why this works for LMs but not for traditional RL

MCTS normally has a major requirement: you need a world model that can simulate forward from any state, and you need to be able to revert to any previous state. In chess or Go, this is easy — just copy the board. In Atari, you save and restore the emulator state. But for general decision-making, building a world model is hard.

LM tasks have a special property that makes this trivial. The "state" in an LM task is just a text sequence: the original prompt, the actions taken so far, and the observations received. To revert to any previous state, you literally copy-paste the earlier text context. No world model needed. No state restoration needed. You just trim the conversation and branch from there.

The key enabling property: In most LM tasks, you can revert to any previous state by setting the input to the historical text context. This makes tree search essentially free — branching is just prompt manipulation. This observation, combined with the LM's ability to generate actions, evaluate states, AND reflect on failures, is what makes LATS possible.

Three roles, one model

Role 1: Agent
The LM generates candidate actions (reasoning + acting) at each node
Role 2: Value Function
The LM evaluates how promising each state is (with environment feedback)
Role 3: Reflector
When a trajectory fails, the LM analyzes why and generates a verbal reflection

The value function deserves special attention. In standard MCTS (like AlphaGo), the value function is a trained neural network. In LATS, the LM evaluates a state by reasoning about it after receiving environment feedback. This is the key difference from Tree-of-Thought, which evaluates states using only the LM's internal knowledge. Environment feedback grounds the evaluation in reality, avoiding hallucinated assessments.

Self-reflection is the other differentiator. When a complete trajectory fails (e.g., the code doesn't pass tests, or the answer is wrong), the LM is prompted to analyze the failure and suggest improvements. These reflections are stored in memory and added to future prompts, giving the agent a "semantic gradient signal" — richer than a scalar reward, because it explains what went wrong in natural language.

Why can LATS run MCTS without training a separate world model?

Chapter 2: MCTS Background

Before diving into LATS, let's build solid intuition for Monte Carlo Tree Search. MCTS is the engine under the hood — understanding it is non-negotiable.

The decision tree

Imagine every decision your agent could make as a tree. The root is the starting state. Each edge is an action. Each node is the state you reach after taking that action. Leaves are terminal states where the task is done (success or failure). The goal: find the path from root to a successful leaf.

The problem: the tree is enormous. A coding task might have thousands of possible next actions at each step. You can't explore everything. You need a strategy that focuses on the most promising branches while still occasionally trying new things. This is the exploration-exploitation tradeoff.

The four phases of MCTS

MCTS handles this tradeoff through four phases, repeated for each episode:

1. Selection
Walk down the tree from root, at each node choosing the child with highest UCT score
2. Expansion
At the selected leaf, generate new child nodes by trying different actions
3. Simulation
From the new node, play out to a terminal state to get a reward signal
4. Backpropagation
Propagate the reward back up the path, updating every node's value

The UCT formula

The selection step uses UCT (Upper Confidence bounds applied to Trees) to balance exploration and exploitation:

UCT(s) = V(s) + w ⋅ √(ln N(p) / N(s))

Where:

The second term is the exploration bonus. A node that has been visited few times relative to its parent gets a high bonus — "we haven't tried this much, let's explore." A node visited many times gets a low bonus — "we know what this branch does." The beautiful thing about UCT is that it's provably optimal in the limit: given enough episodes, it will find the best path.

Intuition for the exploration term: Think of a restaurant you've visited 50 times (high N(s)) versus one you've been to once (low N(s)). Even if the first restaurant is great (high V(s)), the formula says: "Try the new place — you might discover something better." The ln in the numerator ensures that exploration grows slowly while exploitation stays proportional to visits.

Backpropagation updates

When a simulation reaches a terminal state with reward r, every node along the path updates:

V(s) = (Vold(s) ⋅ (N(s) − 1) + r) / N(s)

This is just a running average. Early on, a node's value is noisy. After many visits, it converges to the true expected reward from that state. The key: backpropagation is what makes the tree learn. Each simulation makes the tree's value estimates more accurate, which makes selection smarter, which makes the next simulation more informative.

In the UCT formula, what happens to a node's exploration bonus as it gets visited more times?

Chapter 3: LATS Architecture

Now we wire MCTS into a language model agent. This is the showcase chapter — the full LATS pipeline with all six operations, showing how the LM powers every component.

The six operations of LATS

LATS extends standard MCTS with two additional operations (evaluation and reflection), for a total of six performed in succession:

1. Selection
UCT-guided walk from root to best leaf
2. Expansion
LM generates n candidate actions; environment returns observations
3. Evaluation
LM scores each new node (with environment feedback) + self-consistency
4. Simulation
Expand the highest-value node until a terminal state
5. Backpropagation
Update values along the path from terminal node to root
6. Reflection
If failed: LM analyzes why. Reflection stored for future iterations

How the LM plays three roles

As agent (expansion): At each node, the LM is prompted with the current state (original input + action history + observations) and generates n candidate next actions. For ReAct-based tasks, actions include both reasoning traces ("I should search for X because...") and environment actions ("search[X]"). The diversity of n samples is crucial — it's what creates the branching factor of the tree.

As value function (evaluation): For each new child node, the LM is prompted to reason about the state and assign a scalar score. The key distinction from Tree-of-Thought: this evaluation happens after receiving environment feedback (the observation from executing the action). This grounds the score in reality. The full value function combines two signals:

V(s) = λ ⋅ LM(s) + (1 − λ) ⋅ SC(s)

Where LM(s) is the LM's self-generated score after reasoning about the state with environment context, and SC(s) is a self-consistency bonus — actions that are sampled multiple times at the same state are more likely correct, so they get a higher score. Lambda balances the two.

As reflector: When a trajectory reaches a terminal failure, the LM is prompted with the full trajectory and the final reward. It generates a verbal reflection: "The search failed because the initial query was too specific. A broader search term like X would have found the relevant passage." These reflections are stored in memory and injected into future prompts.

Why reflection is more than a scalar reward: A scalar reward says "this trajectory scored 0.3." A reflection says "you searched for 'quantum entanglement EPR paradox' but should have searched for 'Bell's theorem' because the answer was in a physics history article." The reflection gives the agent a semantic gradient — actionable information about what to change, not just how bad things went.

The tree in memory

Each node in the LATS tree stores: the original input x, the full action sequence a1...i, the observation sequence o1...i, the value V(s), and the visit count N(s). Failed trajectories and their reflections are stored separately. The entire structure lives as text in external memory — nothing is trained, nothing is gradient-updated. It's all in-context learning.

What does the self-consistency component SC(s) in LATS's value function capture?

Chapter 4: Selection & Expansion

Let's trace through the first two operations in detail with a concrete example.

Selection: walking the tree

Suppose we're on iteration 3 of a HotPotQA question. The tree already has several expanded nodes from previous iterations. Selection starts at the root (the original question) and walks down, at each level choosing the child with the highest UCT score.

The walk ends when it reaches a leaf node — a node that hasn't been expanded yet (or was expanded but has promising unexplored children). This is the node we'll expand next.

The exploration weight w is a hyperparameter. A higher w makes the agent try more novel branches; a lower w makes it focus on branches that have already shown good results. In practice, w is tuned per domain — web navigation benefits from higher exploration than programming, where correct solutions tend to be concentrated.

Expansion: the LM generates candidates

At the selected leaf, the LM generates n candidate actions. For ReAct-based tasks, the prompt looks like:

# Context provided to the LM:
Question: Which school was founded first, Whitgift or ## Charterhouse?
Thought 1: I need to search for Whitgift and Charterhouse founding dates.
Action 1: search[Whitgift School]
Observation 1: Whitgift School is a school in Croydon, founded in 1596...

# The LM now generates n=5 different next actions:
Candidate A: search[Charterhouse School]
Candidate B: search[Charterhouse School founding year]
Candidate C: search[Charterhouse history]
Candidate D: think[Whitgift was 1596. I need Charterhouse's date.]
Candidate E: search[oldest schools England Charterhouse]

Each candidate is sent to the environment. The environment returns an observation (the Wikipedia search result, the compiler output, the web page content). This creates n new child nodes in the tree, each with its own state: the original context + the new action + the new observation.

Why n candidates instead of 1: Sampling n actions from the LM creates the branching factor that makes tree search work. With n=1, LATS degenerates to a single trajectory (like ReAct). With n=5, each expansion creates 5 branches, and UCT can select the most promising one later. The paper finds n=5 is the sweet spot — enough diversity for effective search, manageable cost.

The action space

LATS inherits from ReAct: the action space is the union of environment actions A and reasoning traces Z. Actions like search[...] and click[...] affect the environment and produce observations. Thoughts like think[...] are internal reasoning that help the agent plan. Both are treated as "actions" in the tree, which means the tree captures not just what the agent did but also what it thought.

In LATS expansion, why does the LM generate n=5 candidate actions instead of just 1?

Chapter 5: Simulation & Evaluation

After expansion creates new child nodes, LATS evaluates them and then simulates the most promising one to completion.

Evaluation: LM scoring + self-consistency

Each new child node gets a value score. The LM is prompted with the node's full state (input + actions + observations so far) and asked to reason about progress and end with a numerical score. Crucially, this happens after the environment returns the observation — so the LM is evaluating reality, not hallucinating about what might happen.

For example, after the agent searches for "Charterhouse School" and gets a Wikipedia snippet, the LM might reason: "The observation confirms Charterhouse was founded in 1611. We already know Whitgift was 1596. We have both dates and can answer the question. Score: 0.9."

The self-consistency component adds a second signal. If 3 out of 5 candidate actions at a node are variations of "search[Charterhouse School]", that action gets a self-consistency bonus. The intuition: if the LM independently converges on the same action from multiple samples, that action is likely correct. The combined value:

V(s) = λ ⋅ LM(s) + (1 − λ) ⋅ SC(s)
Why evaluate after environment feedback, not before: Tree-of-Thought evaluates states using only the LM's internal knowledge. This means a state where the agent wrote buggy code could still get a high score if the code "looks reasonable" to the LM. LATS evaluates after the code is actually compiled and tested. A state with failing tests gets a low score regardless of how elegant the code looks. This is what makes LATS scale to harder environments.

Simulation: play out to the end

The simulation phase takes the highest-valued child node and expands it further until reaching a terminal state. At each depth level, the same expand-and-evaluate cycle runs, always prioritizing the child with the best value.

A terminal state is determined by the environment: the agent submits a final answer (HotPotQA), the code passes all tests (HumanEval), the agent completes a purchase (WebShop), or a depth limit is reached.

If the terminal state is a success, LATS stops immediately — it found a working trajectory. If it's a failure, LATS continues to backpropagation and reflection.

Programming tasks: a special case

For HumanEval and MBPP, each "action" is a complete solution (a full Python function). The environment compiles it and runs a synthetic test suite. The observation is the compiler output and which tests pass. Since each action is already a terminal solution, LATS skips the simulation phase entirely and uses the fraction of passed tests as the backpropagated reward.

How does LATS's evaluation differ from Tree-of-Thought's evaluation?

Chapter 6: Backpropagation & Reflection

A trajectory has reached a terminal state. If it failed, two things happen: value updates flow backward through the tree, and the LM generates a verbal autopsy of what went wrong.

Backpropagation: updating the tree

For every node si on the path from root to the terminal node, LATS updates:

N(si) = N(si) + 1
V(si) = (Vold(si) ⋅ (N(si) − 1) + r) / N(si)

Where r is the terminal reward (e.g., 1.0 for correct answer, 0.0 for wrong, or the fraction of passed tests for programming). This is a running average: after many trajectories pass through a node, its value converges to the true expected reward from that point.

These updated values feed directly into the UCT formula for the next iteration's selection phase. A node on a path that led to success gets a higher V(s), making it more likely to be selected again. A node on a failed path gets a lower V(s), steering future exploration elsewhere.

Reflection: the semantic gradient

After a failed trajectory, the LM is prompted with the full trajectory (all actions and observations) plus the final reward, and asked to reflect. The prompt is something like:

# Reflection prompt (simplified)
You are an advanced reasoning agent reflecting on a failed attempt.
Question: [original question]
Trajectory: [full action/observation sequence]
Reward: 0.0 (incorrect answer)

Analyze what went wrong and suggest a better strategy.

# LM generates:
Reflection: The agent searched for "quantum entanglement"
but the answer was about Bell's theorem, discussed in a
different article. The agent should have first searched for
"Bell test experiments" to find the specific physicist
mentioned in the question. The key mistake was using too
narrow a search query.

This reflection is stored in memory alongside the failed trajectory. In subsequent iterations, both are injected into the agent and value function prompts as additional context. This means the agent literally learns from its mistakes within the episode, without any gradient updates.

Reflection vs. scalar reward — a concrete comparison:
Scalar only: "This trajectory scored 0.0." The agent knows the path was bad, but not why.
With reflection: "The search query was too narrow. Next time, use broader terms." The agent knows what to change. The ablation study confirms this: removing reflection drops HotPotQA performance by 0.05 EM (from 0.63 to 0.58).

The full loop

LATS repeats this six-operation cycle (selection → expansion → evaluation → simulation → backpropagation → reflection) for up to k trajectories. With each iteration, the tree gets more informed: values are more accurate, reflections provide richer guidance, and UCT steers toward increasingly promising branches. If any trajectory succeeds, LATS terminates immediately.

Why is LM-generated reflection more useful than a scalar reward signal?

Chapter 7: Results

LATS is evaluated across four domains: programming, multi-hop QA, web navigation, and mathematical reasoning. The results are consistently strong.

HumanEval (programming)

MethodModelPass@1
CoTGPT-3.546.9%
ReActGPT-3.556.9%
ReflexionGPT-3.568.1%
RAPGPT-3.563.1%
LATSGPT-3.583.8%
Base LMGPT-480.1%
ReflexionGPT-491.0%
LATSGPT-492.7%

LATS with GPT-3.5 (83.8%) outperforms base GPT-4 (80.1%). With GPT-4, LATS achieves state-of-the-art 92.7% pass@1. The key: each "action" is a complete solution, the environment runs synthetic tests, and LATS uses test results to guide the search. This is search over solutions with real feedback, not just LM self-evaluation.

HotPotQA (multi-hop QA)

MethodEM (Exact Match)
CoT0.34
ReAct0.32
Reflexion0.51
ToT (ReAct)0.39
RAP (ReAct)0.54
LATS (ReAct)0.63
LATS (CoT + ReAct)0.71

LATS doubles ReAct's performance. The CoT+ReAct variant is telling: it first tries internal reasoning (cheap), and only switches to tool-augmented search when internal reasoning fails. This mirrors how humans approach questions — think first, look things up only when you need to.

WebShop (web navigation)

MethodScoreSuccess Rate
ReAct53.828.0%
Reflexion64.235.0%
LATS75.938.0%
IL + RL (fine-tuned)62.428.7%
Fine-tuning (Furuta et al.)67.545.0%

LATS with GPT-3.5 (no fine-tuning) scores 75.9, beating even fine-tuned models on average score. This is a gradient-free method competing with gradient-based training. The performance comes entirely from better search strategy, not better parameters.

Game of 24 (mathematical reasoning)

MethodSuccess Rate
CoT8%
Reflexion12%
ToT20%
RAP40%
LATS44%

Even on purely internal reasoning tasks with no environment interaction, LATS outperforms specialized reasoning methods. The self-consistency component of the value function is what makes the difference here.

What is the most striking result from LATS's evaluation?

Chapter 8: Comparison

LATS sits at a unique intersection. Let's map exactly what it adds over each prior method.

The capability matrix

MethodReasoningActingPlanningReflectionExt. Memory
CoTYes
ReActYesYes
ToTYesYesYesYes
RAPYesYesYes
ReflexionYesYesYesYes
LATSYesYesYesYesYes

vs. Chain-of-Thought (CoT)

CoT generates a single reasoning chain autoregressively. No backtracking, no environment interaction. LATS adds: tree search over multiple reasoning paths, real environment feedback, and reflection. On HotPotQA, LATS (0.71) nearly doubles CoT (0.38).

vs. ReAct

ReAct adds environment interaction to CoT but still follows a single greedy trajectory. If the first search query returns unhelpful results, ReAct has to recover within the same trajectory. LATS adds: tree search (try different queries in parallel branches), principled selection (UCT), and backtracking. On HotPotQA, LATS (0.63) doubles ReAct (0.32).

vs. Tree-of-Thought (ToT)

ToT does tree search over reasoning paths with LM-based evaluation. But it has two blind spots: no environment interaction (the LM evaluates states internally, risking hallucinated assessments) and uses DFS/BFS instead of UCT (less principled exploration). LATS adds: environment grounding and UCT-guided search. The ablation confirms this: replacing MCTS with DFS drops HotPotQA from 0.63 to 0.42.

vs. RAP (Reasoning via Planning)

RAP also uses MCTS for LM reasoning, but it requires the LM to act as a world model — predicting what observations would result from actions, without actually executing them. This limits RAP to tasks where the LM can accurately predict outcomes. LATS executes actions in the real environment, getting actual observations. On HotPotQA with ReAct prompting, LATS (0.63) beats RAP (0.54).

vs. Reflexion

Reflexion uses reflection and external memory but follows a single trajectory per trial. It refines the same approach iteratively rather than exploring fundamentally different approaches. LATS subsumes Reflexion's reflection mechanism and adds tree search. The combination is more powerful: on HumanEval, LATS (83.8%) beats Reflexion (68.1%) with GPT-3.5.

The ablation story: Every component matters. Remove the LM heuristic: −0.26 EM. Switch MCTS to DFS: −0.21 EM. Remove reflection: −0.05 EM. Remove external feedback by using ToT/RAP with ReAct: barely above baseline. LATS's performance comes from the synergy of all components, not any single one.
What does LATS add over RAP (Reasoning via Planning)?

Chapter 9: Connections

LATS sits at the intersection of classical AI planning, modern LM reasoning, and test-time compute scaling. Let's trace the threads.

AlphaGo and MCTS

LATS's deepest ancestor is AlphaGo (Silver et al., 2016), which used MCTS with learned value and policy networks to master Go. LATS replaces the trained networks with LM prompting — the "policy network" is the LM generating actions, the "value network" is the LM evaluating states. The key difference: AlphaGo required millions of self-play games to train its networks. LATS uses zero-shot in-context learning. This makes LATS instantly applicable to any new domain without training.

Scaling test-time compute

LATS is an early example of a broader trend: spending more compute at inference time (rather than training time) to improve performance. The idea that a weaker model with more search can outperform a stronger model with less search (LATS GPT-3.5 beating base GPT-4 on HumanEval) foreshadows systems like OpenAI's o1, which also invest heavily in test-time reasoning. LATS makes the trade-off explicit: you're exchanging more LM calls during inference for better task completion.

ARCHON and meta-search

While LATS searches over action trajectories within a single task, ARCHON (Saad-Falcon et al., 2025) searches over system architectures across tasks. They're complementary: ARCHON could discover that LATS-style tree search is the optimal strategy for certain domains, configuring it automatically. In fact, LATS-like components (generation, evaluation, reflection) map directly onto ARCHON's module types (Generator, Verifier, Critic).

Meta-Harness and agent evaluation

LATS was a catalyst for work on systematically evaluating LM agent frameworks. Meta-Harness (the agent evaluation framework) tests precisely the kind of claims LATS makes: that search improves over greedy methods, that reflection helps, that environment feedback is crucial. The ablation methodology in LATS — removing one component at a time — became a template for rigorous agent evaluation.

Limitations to keep in mind

Computational cost: LATS uses O(k · n) LM calls per task (k trajectories, n branches per expansion). For k=50 and n=5, that's up to 250 calls. On complex tasks, this can mean 100,000+ tokens. The paper shows LATS expands fewer nodes on average than ToT or RAP (66.65 vs. 84.05 at k=50), but it's still expensive compared to single-trajectory methods.

Revertibility assumption: LATS assumes you can revert to any prior state. This works for text-based environments (just truncate the context) but fails for irreversible real-world actions (you can't un-send an email, un-purchase a product). This limits LATS to environments with built-in rollback or simulation.

Where it leads: The combination of tree search + LM evaluation + reflection has become a building block for increasingly capable agent systems. Future directions include multi-agent LATS, learned value functions that improve across tasks, and integration with more complex environment simulators. The core insight — that pretrained LMs already have the capabilities to do planning if you structure the search right — remains one of the most influential ideas in LM agent research.

How does LATS relate to the broader trend of scaling test-time compute?