ADAPT — Veanors

Chapter 0: The Problem

A robot in an unfamiliar kitchen is told: "Put a clean mug on the desk." An iterative executor like ReAct starts exploring: go to countertop 1, go to cabinet 12, open cabinet 12... After dozens of steps, it has lost track of its original goal amid a mountain of action-observation pairs. It gives up. Task failed.

Alright, try the other approach. A plan-and-execute system decomposes the task upfront: Step 1: Find and take a mug. Step 2: Clean the mug. Step 3: Put the clean mug on the desk. Looks reasonable. But Step 1 — "find and take a mug" — turns out to be hard. The mug is hidden in an obscure cabinet. The executor fails at Step 1, and the entire plan collapses. No recovery mechanism. No fallback.

The core tension: Iterative executors (like ReAct) can adapt to the environment step-by-step but drown in long trajectories. Plan-and-execute methods keep the big picture but are brittle — one failed sub-task kills the whole plan. Neither approach handles the fundamental reality that some sub-tasks are hard and some are easy, and you can’t always tell which is which in advance.

The problem gets worse as tasks become more complex. Consider crafting a beehive in Minecraft. You need planks, which need logs. You need honeycomb, which needs... more sub-components. The compositional depth can be 2, 3, or 4 levels. A flat plan can’t anticipate all of this. And an iterative executor can’t compose multi-step sub-tasks reliably without losing track.

What we need is an approach that combines the best of both worlds: the adaptability of iterative execution with the structure of hierarchical planning. And critically, it should only invest effort in planning when planning is actually needed.

Why does a plan-and-execute approach fail when one sub-task turns out to be unexpectedly complex?

The plan is generated once upfront and has no mechanism to further decompose or adapt when a sub-task fails — one failure cascades to task failure Plan-and-execute systems can’t use LLMs The planner always generates too many steps

Chapter 1: The Key Insight

ADAPT’s insight is almost embarrassingly simple: try first, decompose only when stuck.

Given a task, hand it to the executor LLM. Let it attempt the entire thing. If it succeeds — great, you’re done. No planning overhead, no unnecessary decomposition. If it fails — now call the planner to break the task into sub-tasks. Then try each sub-task. If a sub-task fails, decompose that one too. Recursively.

Why this works so well: Not all tasks need decomposition. A strong LLM might handle "clean the mug" in one shot — it’s an atomic skill. But "find and take a mug from somewhere in the house" might be beyond its single-trajectory capability. By attempting first and decomposing only on failure, ADAPT automatically calibrates to both the task’s actual difficulty and the LLM’s actual capability. Strong models decompose less. Weak models decompose more. Complex tasks get more levels. Simple tasks get none.

This is the opposite of how most planning systems work. Traditional plan-and-execute assumes you need a plan for everything. ADAPT assumes you need a plan for nothing — until proven otherwise. It’s lazy decomposition, and that laziness is a feature.

The recursive structure

The recursion is the second key idea. Plan-and-execute decomposes once. If a sub-task is still too complex after one round of decomposition, tough luck. ADAPT decomposes again, and again, up to a maximum depth d_max. Each level of recursion produces shorter, simpler sub-tasks until the executor can handle them.

ADAPT(task, k) = try executor(task); if fail and k < d_max: plan(task) → [sub₁, sub₂, ...] → ADAPT(sub_i, k+1)

Where k is the current recursion depth and d_max is the maximum allowed depth (typically 3 or 4).

What makes ADAPT "as-needed" compared to standard plan-and-execute?

It only invokes the planner when the executor fails at a sub-task, rather than generating a complete plan upfront for every task It uses a smaller LLM It runs faster because it skips planning entirely

Chapter 2: The ADAPT Algorithm

ADAPT has three modules: an executor, a planner, and a controller that orchestrates them. Let’s walk through each one.

The executor

The executor is a ReAct-style LLM agent. Given a task description, it iteratively generates thoughts and actions, receives observations from the environment, and continues until it either succeeds or hits a maximum iteration limit. It’s equipped with atomic skills specific to the environment — things like "put object on receptacle" or "clean object with sinkbasin" in ALFWorld.

Critically, the executor is also responsible for self-assessing success. Its prompt instructs it to output "task completed" when it believes it has succeeded, or "task failed" when it cannot proceed further. This self-generated heuristic drives the as-needed decomposition.

The planner

When the executor fails, the planner takes over. It receives the failed task and generates a short plan of 3–5 sub-tasks. The plans are intentionally kept short and abstract — detailed upfront plans in unexplored environments lead to cascading errors. Each sub-task is connected by a logical operator:

AND — sub-tasks must be executed sequentially. All must succeed. Example: "find mug AND clean mug AND place mug on desk."
OR — sub-tasks are alternatives. Any one success suffices. Example: "find mug on countertop OR find mug in cabinet OR find mug on shelf."

The controller

The controller is a deterministic recursive algorithm — no LLM involved. It orchestrates executor and planner calls according to Algorithm 1:

Input

Task description, current depth k, max depth d_max

↓

Execute

Run executor(task). Did it succeed?

↓

Success?

If yes → return True. If no and k ≥ d_max → return False.

↓

Decompose

Call planner(task) → [sub₁, ..., sub_n] with logic (AND/OR)

↓

Recurse

For each sub_i: call ADAPT(sub_i, k+1). Combine with AND/OR logic.

↓

Return

AND: all must succeed. OR: any success stops iteration.

The planner generates short plans on purpose. Expecting a 10-step plan to "put a clean mug on a desk" without knowing where the mug is would produce cascading errors from incorrect assumptions. Short, abstract plans delegate the hard decisions to deeper recursive calls where the executor has more context.

ADAPT Algorithm Flow

Watch the controller orchestrate executor and planner. Press Step to advance through the algorithm on the task "Put a clean mug on desk."

Ready: "Put a clean mug on desk"

In ADAPT, what happens when the executor fails and the current depth k equals d_max?

The controller returns failure — no further decomposition is attempted because the maximum recursion depth has been reached The planner is called anyway to try once more The system switches to a different LLM

Chapter 3: Recursive Decomposition

This is the showcase. Let’s trace ADAPT on the ALFWorld task "Put a clean mug on desk" and watch the recursion tree unfold as-needed.

The executor receives the full task. It tries to explore the house, find a mug, clean it, and place it — all in one trajectory. After 15 actions, it’s lost. "Task failed." The controller triggers decomposition.

The planner breaks it into three AND-linked sub-tasks: (1) Find and take a mug, (2) Clean the mug, (3) Put the clean mug on the desk.

ADAPT recurses on sub-task 1: "Find and take a mug." The executor tries. It checks a few locations, doesn’t find one. "Task failed." Another decomposition. The planner generates OR-linked alternatives: (1a) Find mug on countertops, OR (1b) Find mug in cabinets, OR (1c) Find mug on shelves.

The executor tries 1a. No mug on countertops. Fails. Tries 1b. Opens cabinet 3 — there’s the mug! Takes it. Success. The OR short-circuits — 1c is never attempted.

Back up one level. Sub-task 2: "Clean the mug." The executor already knows how to do this (it’s an atomic skill). Goes to the sinkbasin, cleans the mug. Success on the first try — no decomposition needed.

Sub-task 3: "Put the clean mug on the desk." Another atomic skill. Success. Task complete.

The as-needed principle in action: Sub-tasks 2 and 3 were never decomposed because the executor handled them directly. Only sub-task 1 was hard enough to need further breakdown. And within sub-task 1, the OR logic meant we stopped searching as soon as we found the mug. This is vastly more efficient than pre-planning all possible locations or retrying the entire task from scratch.

ADAPT Recursion Tree (Interactive)

Click nodes to expand the recursion tree. Red nodes trigger decomposition. Green nodes succeed directly. Watch how only the hard sub-tasks get decomposed.

Click a node or press auto-play

In the ALFWorld example above, why were sub-tasks 2 and 3 never decomposed?

The executor succeeded at them directly — cleaning a carried mug and placing it are atomic skills the LLM can perform without further planning The planner decided they were too simple The maximum depth was already reached

Chapter 4: Detecting Failure

The entire ADAPT algorithm hinges on one critical question: how does the system know when the executor has failed? If failure detection is unreliable, the whole recursive structure falls apart — false positives would trigger unnecessary decomposition, and false negatives would let real failures propagate.

Self-generated success heuristic

ADAPT takes a surprisingly simple approach. The executor prompt includes an instruction: if you believe the task is done, say "task completed." If you’re stuck and can’t proceed, say "task failed." The LLM itself judges its own success.

This is not as crazy as it sounds. In interactive environments, the LLM receives textual feedback after every action. If it tries to pick up a mug and the environment says "Nothing happens" or it has checked 10 locations without finding anything, the LLM can reasonably conclude it has failed. The signal is in the trajectory.

Why self-assessment works here: Interactive decision-making environments provide rich textual feedback. The LLM doesn’t need to assess the quality of abstract reasoning (where self-evaluation is unreliable). It needs to assess whether concrete actions in an environment succeeded — "Did I find the mug? Is the mug clean? Did the environment confirm my action?" This is much easier for LLMs to judge correctly.

How reliable is it?

The paper validates the self-generated heuristic against the gold environment rewards. For ALFWorld, the LLM’s success assessment closely matches the true task completion signal. Specifically, very few cases of false positives (claiming success when the task isn’t done) and a moderate rate of false negatives (claiming failure when the task is actually complete — a conservative and safe error mode).

Failure triggers

There are two ways the executor signals failure:

Explicit declaration: The LLM outputs "task failed" when it determines it cannot proceed.
Iteration limit: If the executor reaches the maximum number of iterations without declaring success, the controller treats it as a failure.

Both signals hand control back to the controller, which then decides whether to invoke the planner (if depth budget remains) or return failure to the parent call.

Why is the self-generated success heuristic more reliable in interactive environments than in abstract reasoning tasks?

Interactive environments provide concrete textual feedback after each action, so the LLM judges observable outcomes rather than its own abstract reasoning quality The LLM is fine-tuned specifically for these environments Interactive environments are simpler than reasoning tasks

Chapter 5: The TextCraft Benchmark

The authors introduce TextCraft, a text-based environment inspired by Minecraft crafting recipes. It’s specifically designed to test compositional task decomposition — exactly what ADAPT is built for.

How TextCraft works

The agent must craft a target item by following crafting recipes. Each recipe requires ingredients, and those ingredients may themselves require crafting from other ingredients. This creates a natural recipe tree with varying depths.

For example, crafting a beehive requires 6 planks and 3 honeycomb. Planks require logs. So the recipe tree has depth 2. More complex items have depth 3 or 4, requiring sub-sub-ingredients that must be crafted in the right order.

Why this is the perfect test

TextCraft has several properties that make it ideal for evaluating ADAPT:

Natural compositional structure: Tasks are inherently decomposable into sub-tasks of varying difficulty.
Varying complexity: Recipe depth 2 is manageable; depth 4 requires multiple levels of planning.
Linguistic knowledge needed: Recipes use categories (e.g., "planks") but the agent must select a specific item (e.g., "oak planks"). The LLM’s world knowledge helps.
Clear ground truth: You either crafted the item or you didn’t.

The TextCraft recipe tree for a beehive: the root is "craft beehive" (needs planks + honeycomb). One branch leads to "craft planks" (needs logs). Another to "get honeycomb." Each leaf is either a craftable or fetchable item. ADAPT’s recursion tree naturally mirrors this recipe tree.

The three actions

The agent has three types of actions: craft (combine ingredients into an item), fetch (get a raw material from the environment), and inventory (check what you have). Simple, but the compositional depth makes tasks challenging.

TextCraft Recipe Tree

Explore recipe trees of different depths. Each node is a crafting step. Leaf nodes are raw materials. Toggle depth to see how complexity scales.

Why does TextCraft naturally expose the need for recursive decomposition?

Its recipe trees have varying depths — complex items require crafting sub-components that themselves require crafting, creating a natural hierarchy that mirrors recursive decomposition It uses Minecraft’s actual game engine It requires visual understanding

Chapter 6: Results

ADAPT with GPT-3.5 substantially outperforms all baselines across three diverse benchmarks. The numbers are striking.

ALFWorld

ADAPT achieves 71.6% overall success rate, compared to ReAct’s 43.3% (+28.3 points). Even more impressive: on the hardest task type ("pick2," which requires composing two pick-style tasks with a long action history), ADAPT scores 52.9% while all baselines score under 12%. That’s a 4x improvement on the hardest tasks.

WebShop

ADAPT achieves 44% success rate. ReAct: 32%. Plan-and-Execute: 17%. Reflexion: 35%. LATS: 38%. ADAPT outperforms the strongest baseline (LATS) by 6 points, without requiring the expensive tree search that LATS uses.

TextCraft

ADAPT achieves 52%. ReAct: 19%. Plan-and-Execute: 27%. Reflexion: 32%. A +33 point improvement over ReAct and +20 over Reflexion. The compositional nature of TextCraft is where ADAPT’s recursive decomposition truly shines.

The headline numbers: +28.3% on ALFWorld, +27% on WebShop, +33% on TextCraft over ReAct. Against Reflexion (which gets multiple trials with memory): +14.1%, +9%, +20%. ADAPT outperforms every baseline on every benchmark, often by large margins.

Results Across Benchmarks

Success rates (%) with GPT-3.5. ADAPT (warm) vs baselines (grey).

On which benchmark does ADAPT show the largest absolute improvement over ReAct?

TextCraft (+33 points), because its compositional recipe structure perfectly matches ADAPT’s recursive decomposition ALFWorld (+28.3 points) WebShop (+27 points)

Chapter 7: Adaptation Analysis

The most revealing analysis in the paper isn’t about raw performance — it’s about how ADAPT dynamically adjusts its behavior based on the executor’s capability and the task’s complexity.

Scaling with depth

Performance increases with d_max across all datasets. Moving from d_max=1 (executor only, equivalent to ReAct) to d_max=2 (one level of decomposition) gives the biggest jump. Moving from d_max=2 to d_max=3 gives a further boost, validating that some sub-tasks genuinely need multi-level decomposition.

Adapting to executor capability

The paper tests three executor settings on ALFWorld: (1) task-specific gold trajectories in the prompt (strong), (2) a hybrid with some gold trajectories (medium), and (3) only atomic skills (weak). ADAPT improves all three — but the weak executor benefits the most dramatically, jumping from 3.3% to 41.7%. The stronger the executor, the less decomposition happens. ADAPT adapts.

Adapting to task complexity

On TextCraft, they measure the maximum decomposition depth actually used (k_max) against the recipe tree depth (task complexity). The results are elegant: for depth-2 recipes, k_max averages 1.9. For depth-3 recipes, k_max averages 2.8. ADAPT automatically invests more decomposition effort in harder tasks.

This is the key result: ADAPT isn’t just a better algorithm — it’s an adaptive algorithm. It doesn’t have a fixed decomposition strategy. Its behavior emerges from the interaction between the executor’s capability, the task’s actual difficulty, and the environment’s feedback. The same algorithm, same code, same prompts — but the recursion tree shapes itself differently for every task.

Different LLMs

ADAPT improves GPT-3.5, GPT-4, LLaMA-2 70B, and Lemur 70B across all benchmarks. GPT-4 (the strongest) gets up to +37% improvement. LLaMA (the weakest) gets up to +15%. You can even mix LLMs: use a strong model (GPT-3.5) as the planner and a weak model (LLaMA) as the executor. The planner is called sparingly, so costs stay low.

Decomposition Depth vs Task Complexity

TextCraft: ADAPT’s actual decomposition depth (k_max) tracks recipe complexity. Toggle between LLM capability and task complexity views.

When ADAPT is given a maximum depth budget of d_max=4, what determines how much of that budget it actually uses?

The interaction between task complexity and executor capability — harder tasks and weaker executors cause more decomposition, while simpler tasks or stronger executors use less depth It always uses the full budget A random number generator

Chapter 8: Comparison

Let’s place ADAPT in the landscape of LLM agent methods and understand what each approach does differently.

vs ReAct

ReAct is an iterative executor — it generates thoughts and actions one step at a time. No explicit planning. It must keep the entire task context implicitly in the action-observation trajectory. As tasks get complex, the trajectory grows long, distractors accumulate, and the LLM loses track of the goal. ADAPT uses ReAct as its executor but wraps it in recursive structure, keeping each executor call focused on a manageable sub-task.

vs Plan-and-Execute

Plan-and-Execute generates a complete plan upfront, then hands each step to the executor. No adaptation. If step 3 of 5 fails, the whole thing fails. ADAPT differs in two ways: it plans only when needed (not upfront) and it plans recursively (failed sub-tasks get decomposed further). ADAPT with d_max=2 differs from Plan-and-Execute because it only decomposes tasks the executor actually fails at.

vs Reflexion

Reflexion addresses failure differently: after a full task attempt fails, it reflects on what went wrong and retries the entire task with that feedback in memory. This wastes effort re-executing sub-tasks that already succeeded. ADAPT localizes failure: only the failing sub-task gets decomposed further, while successful sub-tasks are never repeated.

vs AdaPlanner / LATS

AdaPlanner refines plans based on environment feedback but doesn’t recursively decompose. LATS (Language Agent Tree Search) uses Monte Carlo tree search to explore multiple trajectories, which is powerful but computationally expensive (requires environment rollback). ADAPT achieves comparable or better results with simpler recursive decomposition and no search.

The conceptual distinction: ReAct adapts at the action level. Reflexion adapts at the trial level. LATS adapts at the trajectory level. ADAPT adapts at the sub-task level — the most natural granularity for compositional tasks. It decomposes exactly where difficulty lies, no more, no less.

Method Comparison

How each method responds to a failed sub-task within a complex task.

What is the key advantage of ADAPT over Reflexion when handling a complex task where only one sub-task is difficult?

ADAPT decomposes only the failing sub-task, while Reflexion retries the entire task from scratch, wasting effort re-executing sub-tasks that already succeeded ADAPT uses a stronger LLM Reflexion cannot use GPT-3.5

Chapter 9: Connections

LATS — search over trajectories

LATS uses tree search (MCTS) to explore multiple complete trajectories, using environment feedback and LLM self-reflection to guide the search. ADAPT is complementary: LATS could be used within ADAPT’s executor module to strengthen individual sub-task execution. Where LATS searches broadly across trajectories, ADAPT decomposes vertically through sub-task hierarchy.

Hierarchical task networks

ADAPT draws from the classical AI tradition of Hierarchical Task Networks (HTNs), where complex tasks are decomposed into sub-tasks using pre-defined recipes. ADAPT replaces the hand-crafted decomposition library with an LLM planner, making decomposition flexible and knowledge-driven rather than hand-engineered.

Hierarchical RL and options

In reinforcement learning, the options framework defines temporally extended actions (options) that abstract over primitive actions. ADAPT’s recursive structure mirrors this: the planner creates "options" (sub-task descriptions) and the executor performs them. The key difference is that ADAPT’s options are generated dynamically by the LLM, not pre-trained or pre-defined.

Robotics task decomposition

Systems like SayCan and Inner Monologue use LLMs for robot task planning. ADAPT’s as-needed decomposition could strengthen these systems: instead of generating a flat plan and hoping each step works, attempt execution first and decompose only when the robot fails. This is especially useful in novel environments where the difficulty of sub-tasks is unpredictable.

Limitations

ADAPT’s self-evaluation assumption is the main limitation. The LLM must accurately judge whether it has succeeded or failed. In interactive environments with clear feedback, this works well. But in domains where success is ambiguous (creative writing, open-ended reasoning), the heuristic may break down. Future work could incorporate external verifiers or calibrated confidence scores.

ADAPT also assumes sub-task independence within AND/OR trees. In reality, failed sub-tasks may leave the environment in a bad state that affects subsequent sub-tasks. The paper doesn’t address environment state corruption from failed execution attempts.

The big picture: ADAPT represents a shift from "plan, then execute" to "execute, then plan if needed." This principle — lazy evaluation of planning effort — is broadly applicable beyond LLM agents. Anywhere you have hierarchical tasks with uncertain sub-task difficulty, trying first and decomposing on failure is a powerful strategy.

How could ADAPT and LATS be combined?

Use LATS as the executor within ADAPT — ADAPT decomposes tasks hierarchically while LATS uses tree search to execute each sub-task more reliably Replace ADAPT’s planner with LATS They are fundamentally incompatible

Try First, Decompose Later