An agent fails at customer service tasks. But why? A trajectory-level reward says "wrong" — it never says "you forgot to verify preconditions before issuing a refund." TRACE automatically discovers the 3–4 specific capabilities the agent lacks, builds synthetic training environments for each one, trains a separate LoRA adapter per capability, and routes each new task to the right adapter. Result: +14.1 points on τ²-Bench with only 5.3% of parameters updated.
You build an LLM-powered customer service agent. It can look up orders, check inventory, process refunds, transfer calls. You deploy it on 100 real customer interactions and it succeeds on 58 of them. The other 42 fail — wrong refund amounts, missing verification steps, hallucinated product details.
You want to improve this agent. The standard 2025–2026 approach is reinforcement learning: run the agent on many tasks, collect trajectory-level rewards (success = 1, fail = 0), and update the model weights via policy gradient methods like GRPO.
There is a fundamental problem with this approach. Trajectory-level rewards treat the agent as a monolithic black box. The reward says "this trajectory failed" but never says why. Consider two failure trajectories:
These are completely different failure modes that need completely different fixes. But a scalar reward of 0 treats them identically. Training with GRPO on all failures simultaneously conflates these signals. The model gets pulled in multiple directions at once, learning a muddled compromise that partially fixes all problems and fully fixes none.
There is a second problem too. Suppose you collect 500 failure trajectories and train on all of them with GRPO directly. The target environment is complex — each customer service task involves 5–15 tool calls, multiple conditional branches, and noisy real-world data. The reward signal is sparse (you only know at the very end whether the task succeeded). The RL optimization landscape is rugged.
What if instead of training on the complex target environment, you could isolate the missing capability and train on a simpler synthetic environment that tests only that capability? The synthetic environment is easier to learn from — shorter trajectories, denser rewards, no confounding variables. This is the second key idea in TRACE.
Each dot is a failure trajectory. Color = the actual missing capability (revealed by TRACE). A scalar reward of 0 cannot distinguish them. Click "Reveal Capabilities" to see the hidden structure.
Here is the core observation that makes TRACE work: not all failures are created equal. If you look at 100 failure trajectories, you will find that the same small set of capability deficits explains the vast majority of them.
The authors ran TRACE's capability identification step 10 independent times on the same dataset (Figure 2a in the paper). Each run uses a different random seed, different trajectory samples, different LLM prompting. And yet, across all 10 runs, the same 4 capabilities emerged every time:
This is not a coincidence. These capabilities are real bottlenecks in the agent's performance. The model already knows how to do many things well — it can hold conversations, follow instructions, generate coherent text. But these 4 specific capabilities are where it falls short on this particular task distribution.
Let's make this concrete. Suppose you analyze 50 success trajectories and 50 failure trajectories for the "precondition verification" capability:
A gap of 0.54 is enormous. It means precondition verification errors are 4.4x more common in failures than in successes. This capability is a major bottleneck.
Now compare with "greeting the customer politely":
The model already greets customers fine. Training on this capability would be wasted effort.
Left: success trajectories with capabilities present (green). Right: failure trajectories with capabilities lacking (red). The contrastive gap is the difference in error rates. Hover over a capability bar to see the numbers.
TRACE works with any agentic environment defined as a tuple (X, P, R, y) where:
A trajectory τ is the full sequence of agent actions and environment responses: τ = (a1, o1, a2, o2, ..., aT, oT) where at is the agent's action at step t (a text response, a tool call) and ot is the environment's observation (tool output, user reply).
A capability c is a natural-language description of a specific skill the agent needs. Examples:
Critically, capabilities are NOT predefined by the user. TRACE discovers them automatically from the trajectories. The user provides only the environment and a dataset of tasks.
For each identified capability deficit c, TRACE synthesizes a new environment Ec = (Xc, Pc, Rc, yc) with three key properties:
Why decompose at all? Consider the alternative: train one adapter on the full target environment. The RL loss mixes gradients from all capability deficits. A batch of 64 trajectories might contain 20 failures due to structured data reasoning, 15 due to precondition verification, 10 due to tool-calling precision, and 19 due to other causes. The policy gradient averages over all of these, producing a noisy signal that partially addresses all capabilities and fully addresses none.
In the mixed-capability setting, Â(τ) conflates advantages from different capabilities. Two trajectories that fail for different reasons get the same negative advantage. The gradient points in a direction that compromises between fixing structured data reasoning and fixing precondition verification — and that compromise may not be optimal for either.
With TRACE's decomposition, each adapter θi trains on Eci where all failures are about capability ci. The advantage function Â(τ) is now pure signal: "did you do ci correctly?" The gradient is clean and converges faster.
This is the first step of TRACE and arguably the most elegant. The goal: given a set of trajectories, automatically discover which capabilities the agent lacks. No human annotation. No predefined capability taxonomy. The capabilities emerge from the data.
Run the base model on a dataset D of tasks. Each task produces a trajectory τ with reward R(τ) ∈ {0,1}. Split into:
In the paper's τ²-Bench experiments, the base model (Llama-3.1-8B-Instruct) achieves about 24% success rate on 680 tasks. That gives roughly D+ ≈ 163 successes and D− ≈ 517 failures.
Sample a batch of ~50 failure trajectories from D−. Feed them to an LLM (GPT-4.1 in the paper) with a prompt that says:
The LLM produces a capability dictionary — a list of (name, description) pairs. In the paper, this typically yields 5–8 candidate capabilities. The discovery is run multiple times with different trajectory samples, and all discovered capabilities are pooled.
Now comes the scoring. For every trajectory τ in D+ ∪ D− and every candidate capability c, an LLM labels the trajectory as one of:
This is the most computationally expensive step. If you have 680 trajectories and 7 candidate capabilities, that is 4,760 labeling calls. The paper uses GPT-4.1 for this, which costs money but gives reliable labels.
For each capability c, compute two error rates. Let N+(c) = number of successes where c is relevant (not NA), and E+(c) = number of those where c is LACKING:
Think of ER+(c) as the "background noise" — how often this capability fails even in trajectories that ultimately succeed. Maybe the agent makes a precondition error 16% of the time but recovers from it (corrects itself on the next turn). ER−(c) is the "failure signal" — how often this capability fails in trajectories that ultimately fail.
The contrastive gap is the key statistic:
A high Δ̂(c) means: "when this capability fails, the whole trajectory fails." A low Δ̂(c) means: "this capability fails at similar rates regardless of outcome — it is not the bottleneck."
The coverage measures what fraction of failures involve this capability:
Coverage prevents TRACE from selecting capabilities that have a high contrastive gap but only affect 3% of failures. A capability that perfectly separates 2 failures from successes is not useful if you have 500 other failures to fix.
TRACE retains capability c if:
Let's walk through a concrete worked example with real-ish numbers from τ²-Bench:
Adjust the contrastive gap threshold (δ) and coverage threshold (ρ) to see which capabilities get retained. The paper uses δ=0.20 and ρ=0.10.
Now we assemble all four steps into the full TRACE pipeline. This is the showcase visualization — step through the entire system from raw trajectories to inference-time routing.
Let's trace through each stage with actual numbers from the τ²-Bench experiments.
The base model (Llama-3.1-8B-Instruct) achieves 24.3% on τ²-Bench. TRACE rolls it out on 680 tasks: 165 successes, 515 failures. After contrastive analysis, 4 capabilities emerge with gaps ranging from 0.38 to 0.54.
For each of the 4 capabilities, TRACE prompts an LLM to generate a synthetic environment. The synthesis prompt includes: the capability description, the target environment's tool schemas, and example trajectories. The LLM produces Python code that implements a seeded task generator, transition logic, and an evaluation function. Each synthetic env produces unlimited training tasks on demand.
Each capability gets its own LoRA adapter (rank 32, ~5.3% of parameters). Training uses GRPO with G=4 groups, K=4 rollouts per group per seed, AdamW optimizer (lr=1e−5), for 40 iterations on 4–8 A100 GPUs. Total: 640 rollouts per adapter (40 iterations × 4 groups × 4 rollouts).
At inference, a new task comes in. The base model is shown the task description, the 4 capability descriptions, and a few example trajectories for each capability. It is asked: "Which capability is most relevant to this task? Choose one, or 'none' if the base model suffices." TRACE takes the argmax of the logit scores for each capability label token. The task is routed to the corresponding adapter (or the base model if "none" wins).
Step through the 4 stages of TRACE. Each button advances to the next stage. Watch trajectories get classified, capabilities identified, environments synthesized, adapters trained, and tasks routed.
After training 4 LoRA adapters (one per capability), TRACE achieves 38.4% on τ²-Bench, up from 24.3% — a +14.1 point improvement. For comparison:
TRACE more than doubles the improvement of any baseline. And it does this with only 4 adapters, each updating 5.3% of parameters.
This is the second stage of TRACE and the most creative. Given a capability description (e.g., "structured data reasoning") and the target environment's interface, TRACE must generate a complete, verifiable training environment that isolates that capability.
The synthesis prompt includes:
The LLM generates Python code implementing three components:
generate_task(seed: int) → Task that creates a new task instance testing the target capability. Seeded for reproducibility. Can generate unlimited tasks by varying the seed. Each task includes the initial state, the correct answer, and any metadata needed for evaluation.get_order_details(order_id), the synthetic env returns a realistic-looking JSON response constructed from the task's seed data.evaluate(trajectory, task) → {0, 1} that checks whether the agent correctly exercised the target capability. NOT just "did the agent get the right final answer" but "did the agent correctly perform the specific skill being tested."Suppose the identified capability is "structured data reasoning." The synthesis LLM might produce:
python def generate_task(seed: int) -> Task: rng = random.Random(seed) # Create a customer with 3-5 orders n_orders = rng.randint(3, 5) orders = [] for i in range(n_orders): orders.append({ "order_id": f"ORD-{seed}-{i}", "product": rng.choice(PRODUCTS), "price": round(rng.uniform(10, 500), 2), "status": rng.choice(["delivered", "shipped", "returned"]), "date": random_date(rng), }) # Task: "What is the total spent on delivered orders?" target = sum(o["price"] for o in orders if o["status"] == "delivered") return Task(orders=orders, question=..., answer=target)
The key is that this environment isolates structured data reasoning. The task is simple (look up orders, filter by status, sum prices) but requires correctly parsing JSON, filtering on a field, and computing an aggregate. There is no precondition verification, no complex multi-step planning. Just the one capability being trained.
If the target environment exposes tools like get_order_details(order_id) and process_refund(order_id, amount), the synthetic environment must expose the SAME tools with the SAME signatures. The agent should not be able to distinguish whether it is talking to the real environment or the synthetic one.
Why? Because LoRA adapters modify the model's behavior within the same input distribution. If the synthetic env uses different tool names or parameter formats, the adapter learns to use those different names — and that skill does not transfer to the real environment. Interface preservation is what makes transfer possible.
Once we have synthetic environments, we need to train adapters on them. TRACE uses GRPO (Group Relative Policy Optimization) — a variant of policy gradient methods designed for language model fine-tuning. Let's derive how GRPO works and why it is used here.
Standard REINFORCE computes the policy gradient as:
where R(τ) is the trajectory reward and πθ(τ) = ∏t πθ(at|st) is the probability of the trajectory under policy θ. The problem: R(τ) has high variance. A reward of 1 on one trajectory does not tell you whether that trajectory was lucky or genuinely good.
GRPO reduces variance by computing relative advantages within a group of rollouts from the same seed. For each seed (task instance), GRPO runs K rollouts and normalizes rewards within the group:
where μgroup and σgroup are the mean and standard deviation of rewards within the group. This makes the advantage relative: "how much better was this rollout than the average rollout on the same task?"
Let's work through a concrete example. Suppose we have one task with K=4 rollouts:
GRPO uses a clipped objective similar to PPO, but without a separate value function:
where ri(θ) = πθ(τi) / πold(τi) is the importance ratio, ε is the clip range (typically 0.2), and the KL term prevents the model from drifting too far from the reference policy (the base model).
The paper uses these specific hyperparameters:
The total training budget per adapter is 640 rollouts. With 4 adapters, that is 2,560 rollouts total. Compare with GRPO-on-target which uses the same 2,560 rollouts but trains a single adapter on the mixed-capability target environment — and achieves worse results.
Table 3 in the paper tests "merged" training (one adapter trained on all 4 synthetic environments mixed together). The merged adapter achieves only +8.2 points vs TRACE's +14.1 points. Why?
Because merged training suffers from the same gradient conflation as GRPO-on-target. Within a single training batch, some rollouts are about structured data reasoning, others about precondition verification. The gradient is an average over these different skills. The model learns a compromise that partially addresses all capabilities but excels at none.
Watch each LoRA adapter's reward improve over 40 training iterations. Click a capability to highlight its curve. The dashed line shows the base model's performance.
After training K adapters (one per identified capability), TRACE has K specialized models plus the base model. At inference time, a new task arrives and TRACE must decide: which adapter (if any) should handle this task?
The obvious alternative is to merge all adapters into one model. LoRA adapter merging is well-studied: you can add the adapter weight matrices. But Table 3 in the paper shows this performs worse than routing:
TRACE uses a surprisingly simple routing approach — it reuses the base model itself as the router. Here is how:
Not every task requires a specialized adapter. Some tasks are straightforward enough that the base model handles them correctly. TRACE includes a "none" option in the routing prompt. If the base model assigns the highest logit to "none," the base model handles the task directly.
This is important for two reasons:
The paper does not report routing accuracy directly, but the +14.1 point improvement implies the router is quite accurate. If routing were random, you would expect much smaller gains (each adapter only helps on its specific capability subset).
An informal analysis: if TRACE routes 60% of tasks correctly to the best adapter, and each adapter improves its capability by ~35% on relevant tasks, the expected gain is 0.60 × 0.35 × fraction_of_tasks_per_cap — which, summed over 4 capabilities covering ~80% of failures, gives roughly the observed +14 point gain.
τ²-Bench is a customer service benchmark with 680 tasks across 5 domains (retail, airline, hotel, insurance, telecom). Each task requires multi-turn interaction with tool calls. The metric is task success rate (binary: did the agent achieve the correct outcome?).
TRACE more than doubles the gain of the next best method (GRPO at +6.1). That gap is remarkable. Let's understand why.
ToolSandbox tests tool-calling ability across 8 diverse scenarios (calendar management, file operations, email, etc.). The metric counts "perfect scores" — tasks where the agent gets everything right.
Figures 3 and 4 in the paper show how TRACE and baselines scale with the number of training rollouts. This is one of the most important results:
The scaling curves cross around 640 rollouts — below that, GRPO-on-target is competitive (since TRACE spends some of its budget on capability identification). Above 640, TRACE pulls away rapidly.
One concern with LLM-based capability identification: is it reproducible? Or does each run produce random capabilities? The paper runs the identification step 10 times independently. Result: the same 4 core capabilities are identified in every single run, though their exact names and descriptions vary slightly. The contrastive gaps are consistent to within ±0.05.
This stability is a strong signal. It means the capability deficits are real, objective properties of the model-environment pair, not artifacts of the LLM's wording choices.
Drag the rollout budget slider to see how each method scales. TRACE pulls ahead after ~640 rollouts and the gap widens with more budget.
The paper ablates each component:
Every component matters. The full pipeline is greater than the sum of its parts.
TRACE represents a shift from "train harder" to "train smarter." Instead of throwing more compute at the same optimization problem, it decomposes the problem into subproblems that are individually easier. This is the same insight that powers mixture-of-experts models, modular neural networks, and curriculum learning — but applied to the meta-level problem of figuring out what to train on.
The most surprising finding is how few capabilities matter. Across 680 diverse customer service tasks, just 4 capabilities explain the vast majority of failures. This suggests that LLMs are closer to being great agents than raw benchmark numbers imply — they are not failing broadly, they are failing specifically, and those specific deficits can be surgically corrected.