As-Needed Decomposition and Planning with Language Models — Don’t plan everything upfront. Attempt the task, and only decompose when the LLM gets stuck. ADAPT recursively breaks down sub-tasks on failure, adapting to both task complexity and model capability.
A robot in an unfamiliar kitchen is told: "Put a clean mug on the desk." An iterative executor like ReAct starts exploring: go to countertop 1, go to cabinet 12, open cabinet 12... After dozens of steps, it has lost track of its original goal amid a mountain of action-observation pairs. It gives up. Task failed.
Alright, try the other approach. A plan-and-execute system decomposes the task upfront: Step 1: Find and take a mug. Step 2: Clean the mug. Step 3: Put the clean mug on the desk. Looks reasonable. But Step 1 — "find and take a mug" — turns out to be hard. The mug is hidden in an obscure cabinet. The executor fails at Step 1, and the entire plan collapses. No recovery mechanism. No fallback.
The problem gets worse as tasks become more complex. Consider crafting a beehive in Minecraft. You need planks, which need logs. You need honeycomb, which needs... more sub-components. The compositional depth can be 2, 3, or 4 levels. A flat plan can’t anticipate all of this. And an iterative executor can’t compose multi-step sub-tasks reliably without losing track.
What we need is an approach that combines the best of both worlds: the adaptability of iterative execution with the structure of hierarchical planning. And critically, it should only invest effort in planning when planning is actually needed.
ADAPT’s insight is almost embarrassingly simple: try first, decompose only when stuck.
Given a task, hand it to the executor LLM. Let it attempt the entire thing. If it succeeds — great, you’re done. No planning overhead, no unnecessary decomposition. If it fails — now call the planner to break the task into sub-tasks. Then try each sub-task. If a sub-task fails, decompose that one too. Recursively.
This is the opposite of how most planning systems work. Traditional plan-and-execute assumes you need a plan for everything. ADAPT assumes you need a plan for nothing — until proven otherwise. It’s lazy decomposition, and that laziness is a feature.
The recursion is the second key idea. Plan-and-execute decomposes once. If a sub-task is still too complex after one round of decomposition, tough luck. ADAPT decomposes again, and again, up to a maximum depth dmax. Each level of recursion produces shorter, simpler sub-tasks until the executor can handle them.
Where k is the current recursion depth and dmax is the maximum allowed depth (typically 3 or 4).
ADAPT has three modules: an executor, a planner, and a controller that orchestrates them. Let’s walk through each one.
The executor is a ReAct-style LLM agent. Given a task description, it iteratively generates thoughts and actions, receives observations from the environment, and continues until it either succeeds or hits a maximum iteration limit. It’s equipped with atomic skills specific to the environment — things like "put object on receptacle" or "clean object with sinkbasin" in ALFWorld.
Critically, the executor is also responsible for self-assessing success. Its prompt instructs it to output "task completed" when it believes it has succeeded, or "task failed" when it cannot proceed further. This self-generated heuristic drives the as-needed decomposition.
When the executor fails, the planner takes over. It receives the failed task and generates a short plan of 3–5 sub-tasks. The plans are intentionally kept short and abstract — detailed upfront plans in unexplored environments lead to cascading errors. Each sub-task is connected by a logical operator:
The controller is a deterministic recursive algorithm — no LLM involved. It orchestrates executor and planner calls according to Algorithm 1:
Watch the controller orchestrate executor and planner. Press Step to advance through the algorithm on the task "Put a clean mug on desk."
This is the showcase. Let’s trace ADAPT on the ALFWorld task "Put a clean mug on desk" and watch the recursion tree unfold as-needed.
The executor receives the full task. It tries to explore the house, find a mug, clean it, and place it — all in one trajectory. After 15 actions, it’s lost. "Task failed." The controller triggers decomposition.
The planner breaks it into three AND-linked sub-tasks: (1) Find and take a mug, (2) Clean the mug, (3) Put the clean mug on the desk.
ADAPT recurses on sub-task 1: "Find and take a mug." The executor tries. It checks a few locations, doesn’t find one. "Task failed." Another decomposition. The planner generates OR-linked alternatives: (1a) Find mug on countertops, OR (1b) Find mug in cabinets, OR (1c) Find mug on shelves.
The executor tries 1a. No mug on countertops. Fails. Tries 1b. Opens cabinet 3 — there’s the mug! Takes it. Success. The OR short-circuits — 1c is never attempted.
Back up one level. Sub-task 2: "Clean the mug." The executor already knows how to do this (it’s an atomic skill). Goes to the sinkbasin, cleans the mug. Success on the first try — no decomposition needed.
Sub-task 3: "Put the clean mug on the desk." Another atomic skill. Success. Task complete.
Click nodes to expand the recursion tree. Red nodes trigger decomposition. Green nodes succeed directly. Watch how only the hard sub-tasks get decomposed.
The entire ADAPT algorithm hinges on one critical question: how does the system know when the executor has failed? If failure detection is unreliable, the whole recursive structure falls apart — false positives would trigger unnecessary decomposition, and false negatives would let real failures propagate.
ADAPT takes a surprisingly simple approach. The executor prompt includes an instruction: if you believe the task is done, say "task completed." If you’re stuck and can’t proceed, say "task failed." The LLM itself judges its own success.
This is not as crazy as it sounds. In interactive environments, the LLM receives textual feedback after every action. If it tries to pick up a mug and the environment says "Nothing happens" or it has checked 10 locations without finding anything, the LLM can reasonably conclude it has failed. The signal is in the trajectory.
The paper validates the self-generated heuristic against the gold environment rewards. For ALFWorld, the LLM’s success assessment closely matches the true task completion signal. Specifically, very few cases of false positives (claiming success when the task isn’t done) and a moderate rate of false negatives (claiming failure when the task is actually complete — a conservative and safe error mode).
There are two ways the executor signals failure:
Both signals hand control back to the controller, which then decides whether to invoke the planner (if depth budget remains) or return failure to the parent call.
The authors introduce TextCraft, a text-based environment inspired by Minecraft crafting recipes. It’s specifically designed to test compositional task decomposition — exactly what ADAPT is built for.
The agent must craft a target item by following crafting recipes. Each recipe requires ingredients, and those ingredients may themselves require crafting from other ingredients. This creates a natural recipe tree with varying depths.
For example, crafting a beehive requires 6 planks and 3 honeycomb. Planks require logs. So the recipe tree has depth 2. More complex items have depth 3 or 4, requiring sub-sub-ingredients that must be crafted in the right order.
TextCraft has several properties that make it ideal for evaluating ADAPT:
The agent has three types of actions: craft (combine ingredients into an item), fetch (get a raw material from the environment), and inventory (check what you have). Simple, but the compositional depth makes tasks challenging.
Explore recipe trees of different depths. Each node is a crafting step. Leaf nodes are raw materials. Toggle depth to see how complexity scales.
ADAPT with GPT-3.5 substantially outperforms all baselines across three diverse benchmarks. The numbers are striking.
ADAPT achieves 71.6% overall success rate, compared to ReAct’s 43.3% (+28.3 points). Even more impressive: on the hardest task type ("pick2," which requires composing two pick-style tasks with a long action history), ADAPT scores 52.9% while all baselines score under 12%. That’s a 4x improvement on the hardest tasks.
ADAPT achieves 44% success rate. ReAct: 32%. Plan-and-Execute: 17%. Reflexion: 35%. LATS: 38%. ADAPT outperforms the strongest baseline (LATS) by 6 points, without requiring the expensive tree search that LATS uses.
ADAPT achieves 52%. ReAct: 19%. Plan-and-Execute: 27%. Reflexion: 32%. A +33 point improvement over ReAct and +20 over Reflexion. The compositional nature of TextCraft is where ADAPT’s recursive decomposition truly shines.
Success rates (%) with GPT-3.5. ADAPT (warm) vs baselines (grey).
The most revealing analysis in the paper isn’t about raw performance — it’s about how ADAPT dynamically adjusts its behavior based on the executor’s capability and the task’s complexity.
Performance increases with dmax across all datasets. Moving from dmax=1 (executor only, equivalent to ReAct) to dmax=2 (one level of decomposition) gives the biggest jump. Moving from dmax=2 to dmax=3 gives a further boost, validating that some sub-tasks genuinely need multi-level decomposition.
The paper tests three executor settings on ALFWorld: (1) task-specific gold trajectories in the prompt (strong), (2) a hybrid with some gold trajectories (medium), and (3) only atomic skills (weak). ADAPT improves all three — but the weak executor benefits the most dramatically, jumping from 3.3% to 41.7%. The stronger the executor, the less decomposition happens. ADAPT adapts.
On TextCraft, they measure the maximum decomposition depth actually used (kmax) against the recipe tree depth (task complexity). The results are elegant: for depth-2 recipes, kmax averages 1.9. For depth-3 recipes, kmax averages 2.8. ADAPT automatically invests more decomposition effort in harder tasks.
ADAPT improves GPT-3.5, GPT-4, LLaMA-2 70B, and Lemur 70B across all benchmarks. GPT-4 (the strongest) gets up to +37% improvement. LLaMA (the weakest) gets up to +15%. You can even mix LLMs: use a strong model (GPT-3.5) as the planner and a weak model (LLaMA) as the executor. The planner is called sparingly, so costs stay low.
TextCraft: ADAPT’s actual decomposition depth (kmax) tracks recipe complexity. Toggle between LLM capability and task complexity views.
Let’s place ADAPT in the landscape of LLM agent methods and understand what each approach does differently.
ReAct is an iterative executor — it generates thoughts and actions one step at a time. No explicit planning. It must keep the entire task context implicitly in the action-observation trajectory. As tasks get complex, the trajectory grows long, distractors accumulate, and the LLM loses track of the goal. ADAPT uses ReAct as its executor but wraps it in recursive structure, keeping each executor call focused on a manageable sub-task.
Plan-and-Execute generates a complete plan upfront, then hands each step to the executor. No adaptation. If step 3 of 5 fails, the whole thing fails. ADAPT differs in two ways: it plans only when needed (not upfront) and it plans recursively (failed sub-tasks get decomposed further). ADAPT with dmax=2 differs from Plan-and-Execute because it only decomposes tasks the executor actually fails at.
Reflexion addresses failure differently: after a full task attempt fails, it reflects on what went wrong and retries the entire task with that feedback in memory. This wastes effort re-executing sub-tasks that already succeeded. ADAPT localizes failure: only the failing sub-task gets decomposed further, while successful sub-tasks are never repeated.
AdaPlanner refines plans based on environment feedback but doesn’t recursively decompose. LATS (Language Agent Tree Search) uses Monte Carlo tree search to explore multiple trajectories, which is powerful but computationally expensive (requires environment rollback). ADAPT achieves comparable or better results with simpler recursive decomposition and no search.
How each method responds to a failed sub-task within a complex task.
LATS uses tree search (MCTS) to explore multiple complete trajectories, using environment feedback and LLM self-reflection to guide the search. ADAPT is complementary: LATS could be used within ADAPT’s executor module to strengthen individual sub-task execution. Where LATS searches broadly across trajectories, ADAPT decomposes vertically through sub-task hierarchy.
ADAPT draws from the classical AI tradition of Hierarchical Task Networks (HTNs), where complex tasks are decomposed into sub-tasks using pre-defined recipes. ADAPT replaces the hand-crafted decomposition library with an LLM planner, making decomposition flexible and knowledge-driven rather than hand-engineered.
In reinforcement learning, the options framework defines temporally extended actions (options) that abstract over primitive actions. ADAPT’s recursive structure mirrors this: the planner creates "options" (sub-task descriptions) and the executor performs them. The key difference is that ADAPT’s options are generated dynamically by the LLM, not pre-trained or pre-defined.
Systems like SayCan and Inner Monologue use LLMs for robot task planning. ADAPT’s as-needed decomposition could strengthen these systems: instead of generating a flat plan and hoping each step works, attempt execution first and decompose only when the robot fails. This is especially useful in novel environments where the difficulty of sub-tasks is unpredictable.
ADAPT’s self-evaluation assumption is the main limitation. The LLM must accurately judge whether it has succeeded or failed. In interactive environments with clear feedback, this works well. But in domains where success is ambiguous (creative writing, open-ended reasoning), the heuristic may break down. Future work could incorporate external verifiers or calibrated confidence scores.
ADAPT also assumes sub-task independence within AND/OR trees. In reality, failed sub-tasks may leave the environment in a bad state that affects subsequent sub-tasks. The paper doesn’t address environment state corruption from failed execution attempts.