TRACE — Veanors

Chapter 0: The Problem

You build an LLM-powered customer service agent. It can look up orders, check inventory, process refunds, transfer calls. You deploy it on 100 real customer interactions and it succeeds on 58 of them. The other 42 fail — wrong refund amounts, missing verification steps, hallucinated product details.

You want to improve this agent. The standard 2025–2026 approach is reinforcement learning: run the agent on many tasks, collect trajectory-level rewards (success = 1, fail = 0), and update the model weights via policy gradient methods like GRPO.

There is a fundamental problem with this approach. Trajectory-level rewards treat the agent as a monolithic black box. The reward says "this trajectory failed" but never says why. Consider two failure trajectories:

Failure A: The agent issued a refund without first checking the order status. The order had already been refunded. Result: double refund. The missing capability: precondition verification.
Failure B: The agent looked up the customer's account, found three recent orders, and tried to process a return on the wrong one because it confused the order IDs. The missing capability: structured data reasoning.

These are completely different failure modes that need completely different fixes. But a scalar reward of 0 treats them identically. Training with GRPO on all failures simultaneously conflates these signals. The model gets pulled in multiple directions at once, learning a muddled compromise that partially fixes all problems and fully fixes none.

The credit assignment gap: Standard RL tells you THAT the agent failed. It never tells you WHICH specific capability was missing. And without knowing which capability to fix, you cannot fix it efficiently. This is the problem TRACE solves: it decomposes "this agent is bad" into "this agent lacks capabilities C₁, C₂, C₃" and then trains each one independently.

There is a second problem too. Suppose you collect 500 failure trajectories and train on all of them with GRPO directly. The target environment is complex — each customer service task involves 5–15 tool calls, multiple conditional branches, and noisy real-world data. The reward signal is sparse (you only know at the very end whether the task succeeded). The RL optimization landscape is rugged.

What if instead of training on the complex target environment, you could isolate the missing capability and train on a simpler synthetic environment that tests only that capability? The synthetic environment is easier to learn from — shorter trajectories, denser rewards, no confounding variables. This is the second key idea in TRACE.

Why Trajectory Reward Fails at Credit Assignment

Each dot is a failure trajectory. Color = the actual missing capability (revealed by TRACE). A scalar reward of 0 cannot distinguish them. Click "Reveal Capabilities" to see the hidden structure.

Why is training directly with trajectory-level RL rewards on the target environment suboptimal for agentic tasks?

The scalar reward conflates failures from different missing capabilities into one signal, so the model gets pulled in multiple directions at once and learns a muddled compromise that fully fixes nothing RL rewards are always noisy and cannot be trusted for fine-tuning The agent cannot receive reward signals because it uses tool calling

Chapter 1: The Key Insight

Here is the core observation that makes TRACE work: not all failures are created equal. If you look at 100 failure trajectories, you will find that the same small set of capability deficits explains the vast majority of them.

The authors ran TRACE's capability identification step 10 independent times on the same dataset (Figure 2a in the paper). Each run uses a different random seed, different trajectory samples, different LLM prompting. And yet, across all 10 runs, the same 4 capabilities emerged every time:

Structured data reasoning — correctly extracting, comparing, and manipulating data from database lookups and API responses
Multi-step task completion — executing a sequence of dependent actions without dropping steps or losing context
Precondition verification — checking that required conditions are met before performing an action (e.g., verifying an order exists before refunding it)
Tool-calling precision — using the correct tool with the correct arguments in the correct format

This is not a coincidence. These capabilities are real bottlenecks in the agent's performance. The model already knows how to do many things well — it can hold conversations, follow instructions, generate coherent text. But these 4 specific capabilities are where it falls short on this particular task distribution.

The contrastive trick: How do you identify which capabilities matter? You compare successful trajectories with failed trajectories. If a capability is equally present (or absent) in both successes and failures, it is not the bottleneck — it is either already mastered or irrelevant. But if a capability is consistently present in successes and absent in failures, that is your signal. TRACE computes this as a contrastive gap: the error rate on failures minus the error rate on successes. A high contrastive gap means "this capability separates the winners from the losers."

Let's make this concrete. Suppose you analyze 50 success trajectories and 50 failure trajectories for the "precondition verification" capability:

In the 50 successes, 8 had precondition errors → ER⁺ = 8/50 = 0.16
In the 50 failures, 35 had precondition errors → ER⁻ = 35/50 = 0.70
Contrastive gap: Δ̂ = 0.70 − 0.16 = 0.54

A gap of 0.54 is enormous. It means precondition verification errors are 4.4x more common in failures than in successes. This capability is a major bottleneck.

Now compare with "greeting the customer politely":

ER⁺ = 0.02, ER⁻ = 0.03
Δ̂ = 0.01 — essentially zero

The model already greets customers fine. Training on this capability would be wasted effort.

Why this matters for training efficiency: Instead of training on all 100 trajectories with a single RL objective, TRACE identifies 3–4 specific capabilities and trains a separate adapter for each. Each adapter only needs to learn ONE thing. The synthetic training environment for each capability is simpler, the reward signal is denser, and the RL converges faster. This is why TRACE outperforms GRPO-on-target by +14.1 points on τ²-Bench despite updating only 5.3% of parameters.

Contrastive Analysis: Success vs Failure

Left: success trajectories with capabilities present (green). Right: failure trajectories with capabilities lacking (red). The contrastive gap is the difference in error rates. Hover over a capability bar to see the numbers.

If a capability has ER⁺ = 0.45 and ER⁻ = 0.50, what does the contrastive gap of 0.05 tell you?

This capability is NOT a meaningful bottleneck — it fails at similar rates in both successes and failures, so fixing it would not significantly improve overall performance This is a critical capability that needs immediate training The model has already mastered this capability perfectly

Chapter 2: Formal Setup

The Agentic Environment

TRACE works with any agentic environment defined as a tuple (X, P, R, y) where:

X = the task space. Each x ∈ X is a specific task instance (e.g., "Customer wants to return order #4521")
P = the protocol. The rules of engagement — which tools are available, what actions the agent can take, the conversation format, system instructions
R = the reward function. Maps a trajectory τ to a scalar R(τ) ∈ {0, 1}. Binary: did the agent solve the task?
y = the target variable (ground truth answer or state the agent must achieve)

A trajectory τ is the full sequence of agent actions and environment responses: τ = (a₁, o₁, a₂, o₂, ..., a_T, o_T) where a_t is the agent's action at step t (a text response, a tool call) and o_t is the environment's observation (tool output, user reply).

What Is a Capability?

A capability c is a natural-language description of a specific skill the agent needs. Examples:

Structured Data Reasoning

The ability to correctly extract, compare, and manipulate data from structured API responses (JSON objects, database records, nested dictionaries). Includes correctly indexing into lists, comparing dates, and computing aggregates.

Precondition Verification

The ability to verify that all necessary conditions are satisfied before performing an irreversible action. Example: checking that an order has status "delivered" before processing a return, or confirming the customer's identity before revealing account details.

Tool-Calling Precision

The ability to call the correct API endpoint with correctly formatted arguments. Includes using the right parameter names, correct data types, and proper nesting of arguments.

Critically, capabilities are NOT predefined by the user. TRACE discovers them automatically from the trajectories. The user provides only the environment and a dataset of tasks.

The Capability-Targeted Synthetic Environment

For each identified capability deficit c, TRACE synthesizes a new environment E_c = (X_c, P_c, R_c, y_c) with three key properties:

Capability isolation: E_c primarily tests capability c. Success requires c and little else. This means RL training on E_c produces a clean gradient signal for learning c.
Interface preservation: P_c uses the same tool schemas, action space, and conversation format as the target environment P. The agent interacts with E_c the same way it interacts with the real environment. This ensures the learned capability transfers.
Verifiability: R_c can be computed automatically. No human evaluation needed. This is essential for RL training which needs thousands of reward evaluations.

The three subproblems of TRACE: Given a target environment E and a base model θ_base, TRACE decomposes the problem into: (1) Identify the set of capabilities {c₁, ..., c_K} the model lacks for E. (2) Synthesize environments {E_c1, ..., E_cK} that isolate each capability. (3) Train the model on each E_ci to produce K specialized adapters {θ₁, ..., θ_K}, then route each task to the best adapter at inference.

The Decomposition Advantage

Why decompose at all? Consider the alternative: train one adapter on the full target environment. The RL loss mixes gradients from all capability deficits. A batch of 64 trajectories might contain 20 failures due to structured data reasoning, 15 due to precondition verification, 10 due to tool-calling precision, and 19 due to other causes. The policy gradient averages over all of these, producing a noisy signal that partially addresses all capabilities and fully addresses none.

∇_θ J(θ) = E_τ~πθ [ Σ_t ∇_θ log π_θ(a_t|s_t) · Â(τ) ]

In the mixed-capability setting, Â(τ) conflates advantages from different capabilities. Two trajectories that fail for different reasons get the same negative advantage. The gradient points in a direction that compromises between fixing structured data reasoning and fixing precondition verification — and that compromise may not be optimal for either.

With TRACE's decomposition, each adapter θ_i trains on E_ci where all failures are about capability c_i. The advantage function Â(τ) is now pure signal: "did you do c_i correctly?" The gradient is clean and converges faster.

What are the three required properties of a capability-targeted synthetic environment E_c?

Capability isolation (tests mainly c), interface preservation (same tool schemas as target env), and verifiability (automatic reward computation) Diversity (many task types), difficulty (harder than target), and novelty (unseen scenarios) Simplicity (short trajectories), realism (identical to target), and human-evaluated rewards

Chapter 3: Contrastive Capability Identification

This is the first step of TRACE and arguably the most elegant. The goal: given a set of trajectories, automatically discover which capabilities the agent lacks. No human annotation. No predefined capability taxonomy. The capabilities emerge from the data.

Step 1: Split Trajectories

Run the base model on a dataset D of tasks. Each task produces a trajectory τ with reward R(τ) ∈ {0,1}. Split into:

D⁺ = successful trajectories (R = 1)
D⁻ = failed trajectories (R = 0)

In the paper's τ²-Bench experiments, the base model (Llama-3.1-8B-Instruct) achieves about 24% success rate on 680 tasks. That gives roughly D⁺ ≈ 163 successes and D⁻ ≈ 517 failures.

Step 2: Discovery Phase

Sample a batch of ~50 failure trajectories from D⁻. Feed them to an LLM (GPT-4.1 in the paper) with a prompt that says:

Discovery prompt (paraphrased): "Here are 50 trajectories where an agent failed at customer service tasks. Analyze these trajectories and identify 5–8 distinct, recurring capabilities that the agent lacks. Each capability should be: (a) specific enough to be testable, (b) general enough to appear across multiple tasks, (c) a skill deficit, not a one-off error. Return a list of capability names with descriptions."

The LLM produces a capability dictionary — a list of (name, description) pairs. In the paper, this typically yields 5–8 candidate capabilities. The discovery is run multiple times with different trajectory samples, and all discovered capabilities are pooled.

Step 3: Labeling Phase

Now comes the scoring. For every trajectory τ in D⁺ ∪ D⁻ and every candidate capability c, an LLM labels the trajectory as one of:

NA — this capability is not relevant to this trajectory (e.g., a task that never requires precondition checking)
PRESENT — the capability is relevant and the agent demonstrated it correctly
LACKING — the capability is relevant and the agent failed to demonstrate it

This is the most computationally expensive step. If you have 680 trajectories and 7 candidate capabilities, that is 4,760 labeling calls. The paper uses GPT-4.1 for this, which costs money but gives reliable labels.

Step 4: Compute Error Rates

For each capability c, compute two error rates. Let N⁺(c) = number of successes where c is relevant (not NA), and E⁺(c) = number of those where c is LACKING:

ER⁺(c) = E⁺(c) / N⁺(c) (error rate on successes)

ER⁻(c) = E⁻(c) / N⁻(c) (error rate on failures)

Think of ER⁺(c) as the "background noise" — how often this capability fails even in trajectories that ultimately succeed. Maybe the agent makes a precondition error 16% of the time but recovers from it (corrects itself on the next turn). ER⁻(c) is the "failure signal" — how often this capability fails in trajectories that ultimately fail.

Step 5: Contrastive Gap and Coverage

The contrastive gap is the key statistic:

Δ̂(c) = ER⁻(c) − ER⁺(c)

A high Δ̂(c) means: "when this capability fails, the whole trajectory fails." A low Δ̂(c) means: "this capability fails at similar rates regardless of outcome — it is not the bottleneck."

The coverage measures what fraction of failures involve this capability:

Ĉov(c) = N⁻(c) / |D⁻|

Coverage prevents TRACE from selecting capabilities that have a high contrastive gap but only affect 3% of failures. A capability that perfectly separates 2 failures from successes is not useful if you have 500 other failures to fix.

Step 6: Threshold and Select

TRACE retains capability c if:

Δ̂(c) ≥ δ = 0.20 AND Ĉov(c) ≥ ρ = 0.10

Let's walk through a concrete worked example with real-ish numbers from τ²-Bench:

Worked example — 5 candidate capabilities:

1. Structured data reasoning: N⁺=140, E⁺=21, N⁻=480, E⁻=312
ER⁺ = 21/140 = 0.15, ER⁻ = 312/480 = 0.65, Δ̂ = 0.50, Cov = 480/517 = 0.93 ✓

2. Multi-step task completion: N⁺=120, E⁺=18, N⁻=400, E⁻=220
ER⁺ = 0.15, ER⁻ = 0.55, Δ̂ = 0.40, Cov = 400/517 = 0.77 ✓

3. Precondition verification: N⁺=80, E⁺=13, N⁻=350, E⁻=245
ER⁺ = 0.16, ER⁻ = 0.70, Δ̂ = 0.54, Cov = 350/517 = 0.68 ✓

4. Tool-calling precision: N⁺=100, E⁺=12, N⁻=300, E⁻=150
ER⁺ = 0.12, ER⁻ = 0.50, Δ̂ = 0.38, Cov = 300/517 = 0.58 ✓

5. Polite tone maintenance: N⁺=150, E⁺=6, N⁻=490, E⁻=24
ER⁺ = 0.04, ER⁻ = 0.05, Δ̂ = 0.01, Cov = 490/517 = 0.95 ✓ on coverage but FAILS on δ

Result: Capabilities 1–4 selected. Capability 5 rejected. The model is already polite — training on politeness would waste budget.

Interactive Contrastive Gap Computation

Adjust the contrastive gap threshold (δ) and coverage threshold (ρ) to see which capabilities get retained. The paper uses δ=0.20 and ρ=0.10.

δ (gap threshold) 0.20

ρ (coverage threshold) 0.10

A candidate capability has ER⁻ = 0.72, ER⁺ = 0.68, and Cov = 0.85. Should TRACE select it?

No — the contrastive gap is only 0.04 (well below δ=0.20). Despite high coverage, this capability fails at nearly the same rate in successes and failures, so it is not a discriminating bottleneck Yes — both ER⁻ and coverage are high, so this is clearly a major failure mode Need more information — we must also check the number of trajectories

Chapter 4: The TRACE Pipeline

Now we assemble all four steps into the full TRACE pipeline. This is the showcase visualization — step through the entire system from raw trajectories to inference-time routing.

The four stages:
Stage 1 — Roll out base model, split into success/failure, discover capabilities via contrastive analysis.
Stage 2 — For each identified capability deficit, synthesize a verifiable training environment.
Stage 3 — Train a separate LoRA adapter on each synthetic environment using GRPO.
Stage 4 — At inference, route each new task to the most relevant adapter (or use the base model if no adapter applies).

Let's trace through each stage with actual numbers from the τ²-Bench experiments.

Stage 1: Identify

The base model (Llama-3.1-8B-Instruct) achieves 24.3% on τ²-Bench. TRACE rolls it out on 680 tasks: 165 successes, 515 failures. After contrastive analysis, 4 capabilities emerge with gaps ranging from 0.38 to 0.54.

Stage 2: Synthesize

For each of the 4 capabilities, TRACE prompts an LLM to generate a synthetic environment. The synthesis prompt includes: the capability description, the target environment's tool schemas, and example trajectories. The LLM produces Python code that implements a seeded task generator, transition logic, and an evaluation function. Each synthetic env produces unlimited training tasks on demand.

Stage 3: Train

Each capability gets its own LoRA adapter (rank 32, ~5.3% of parameters). Training uses GRPO with G=4 groups, K=4 rollouts per group per seed, AdamW optimizer (lr=1e−5), for 40 iterations on 4–8 A100 GPUs. Total: 640 rollouts per adapter (40 iterations × 4 groups × 4 rollouts).

Stage 4: Route

At inference, a new task comes in. The base model is shown the task description, the 4 capability descriptions, and a few example trajectories for each capability. It is asked: "Which capability is most relevant to this task? Choose one, or 'none' if the base model suffices." TRACE takes the argmax of the logit scores for each capability label token. The task is routed to the corresponding adapter (or the base model if "none" wins).

The Full TRACE Pipeline

Step through the 4 stages of TRACE. Each button advances to the next stage. Watch trajectories get classified, capabilities identified, environments synthesized, adapters trained, and tasks routed.

End-to-End Numbers

After training 4 LoRA adapters (one per capability), TRACE achieves 38.4% on τ²-Bench, up from 24.3% — a +14.1 point improvement. For comparison:

GRPO on target environment: 30.4% (+6.1)
AWM (Agentic Workflow Memory): 28.1% (+3.8)
ADP (Agentic Data Pipeline): 27.9% (+3.6)
GEPA (prompt evolution): 26.8% (+2.5)

TRACE more than doubles the improvement of any baseline. And it does this with only 4 adapters, each updating 5.3% of parameters.

Why does TRACE train separate LoRA adapters per capability instead of one adapter on all capabilities?

Each adapter receives a clean gradient signal from an environment that isolates one capability — this avoids the conflated gradients from mixing multiple capability deficits in a single training run Because LoRA adapters cannot hold more than one capability at a time To save GPU memory by training smaller adapters sequentially

Chapter 5: Environment Synthesis

This is the second stage of TRACE and the most creative. Given a capability description (e.g., "structured data reasoning") and the target environment's interface, TRACE must generate a complete, verifiable training environment that isolates that capability.

What the Synthesis LLM Receives

The synthesis prompt includes:

The capability description — what skill to test, from the contrastive identification step
The target environment's tool schemas — the exact API signatures (function names, parameter types, return types) so the synthetic env uses the same interface
Example failure trajectories — 3–5 trajectories from D⁻ where this specific capability was labeled LACKING, so the LLM can see what failure looks like in context
Example success trajectories — 2–3 from D⁺ where the capability was PRESENT, showing what correct behavior looks like

What the Synthesis LLM Produces

The LLM generates Python code implementing three components:

Seeded Task Generator

A function generate_task(seed: int) → Task that creates a new task instance testing the target capability. Seeded for reproducibility. Can generate unlimited tasks by varying the seed. Each task includes the initial state, the correct answer, and any metadata needed for evaluation.

↓

Transition Logic

The environment dynamics: how the state changes in response to agent actions. Implements the same tool APIs as the target env (same function names, same parameter types). When the agent calls get_order_details(order_id), the synthetic env returns a realistic-looking JSON response constructed from the task's seed data.

↓

Evaluation Function

A deterministic function evaluate(trajectory, task) → {0, 1} that checks whether the agent correctly exercised the target capability. NOT just "did the agent get the right final answer" but "did the agent correctly perform the specific skill being tested."

Concrete Example: Structured Data Reasoning

Suppose the identified capability is "structured data reasoning." The synthesis LLM might produce:

python
def generate_task(seed: int) -> Task:
    rng = random.Random(seed)
    # Create a customer with 3-5 orders
    n_orders = rng.randint(3, 5)
    orders = []
    for i in range(n_orders):
        orders.append({
            "order_id": f"ORD-{seed}-{i}",
            "product": rng.choice(PRODUCTS),
            "price": round(rng.uniform(10, 500), 2),
            "status": rng.choice(["delivered", "shipped", "returned"]),
            "date": random_date(rng),
        })
    # Task: "What is the total spent on delivered orders?"
    target = sum(o["price"] for o in orders if o["status"] == "delivered")
    return Task(orders=orders, question=..., answer=target)

The key is that this environment isolates structured data reasoning. The task is simple (look up orders, filter by status, sum prices) but requires correctly parsing JSON, filtering on a field, and computing an aggregate. There is no precondition verification, no complex multi-step planning. Just the one capability being trained.

Interface Preservation Is Critical

If the target environment exposes tools like get_order_details(order_id) and process_refund(order_id, amount), the synthetic environment must expose the SAME tools with the SAME signatures. The agent should not be able to distinguish whether it is talking to the real environment or the synthetic one.

Why? Because LoRA adapters modify the model's behavior within the same input distribution. If the synthetic env uses different tool names or parameter formats, the adapter learns to use those different names — and that skill does not transfer to the real environment. Interface preservation is what makes transfer possible.

Quality control: Synthesized environments are not always correct on the first try. The paper uses a validation step: run the base model on 50 generated tasks. If the pass rate is 0% or 100%, the environment is too hard or too easy. The sweet spot is 20–60% base model pass rate, which provides both positive and negative examples for RL.

Why must the synthetic environment preserve the target environment's tool schemas (same function names, parameters, return types)?

Because LoRA adapters modify behavior within the same input distribution — if the synthetic env uses different tool names, the learned skill will not transfer to the real environment To make the synthetic environment cheaper to create Because the LLM cannot learn new tool interfaces

Chapter 6: RL Training with GRPO

Once we have synthetic environments, we need to train adapters on them. TRACE uses GRPO (Group Relative Policy Optimization) — a variant of policy gradient methods designed for language model fine-tuning. Let's derive how GRPO works and why it is used here.

The REINFORCE Foundation

Standard REINFORCE computes the policy gradient as:

∇_θ J = E_τ~πθ [ R(τ) · ∇_θ log π_θ(τ) ]

where R(τ) is the trajectory reward and π_θ(τ) = ∏_t π_θ(a_t|s_t) is the probability of the trajectory under policy θ. The problem: R(τ) has high variance. A reward of 1 on one trajectory does not tell you whether that trajectory was lucky or genuinely good.

GRPO's Solution: Within-Group Normalization

GRPO reduces variance by computing relative advantages within a group of rollouts from the same seed. For each seed (task instance), GRPO runs K rollouts and normalizes rewards within the group:

Â_i = (R_i − μ_group) / σ_group

where μ_group and σ_group are the mean and standard deviation of rewards within the group. This makes the advantage relative: "how much better was this rollout than the average rollout on the same task?"

Let's work through a concrete example. Suppose we have one task with K=4 rollouts:

Worked example — within-group normalization:

Task seed 42: "Customer wants to return order ORD-42-2"
Rollout 1: Correctly checks order status, processes return. R₁ = 1
Rollout 2: Forgets to check status, processes return on cancelled order. R₂ = 0
Rollout 3: Checks status, but uses wrong refund amount. R₃ = 0
Rollout 4: Perfect execution. R₄ = 1

Rewards: [1, 0, 0, 1] → μ = 0.50, σ = 0.50
Advantages: Â = [(1-0.5)/0.5, (0-0.5)/0.5, (0-0.5)/0.5, (1-0.5)/0.5] = [+1.0, -1.0, -1.0, +1.0]

Rollouts 1 and 4 get positive advantage (reinforce these behaviors).
Rollouts 2 and 3 get negative advantage (suppress these behaviors).
The advantage is relative to THE SAME TASK, not across different tasks.

The Full GRPO Objective

GRPO uses a clipped objective similar to PPO, but without a separate value function:

L_GRPO = − E_group [ Σ_i=1..K min( r_i(θ) · Â_i, clip(r_i(θ), 1−ε, 1+ε) · Â_i ) ] + β · KL(π_θ || π_ref)

where r_i(θ) = π_θ(τ_i) / π_old(τ_i) is the importance ratio, ε is the clip range (typically 0.2), and the KL term prevents the model from drifting too far from the reference policy (the base model).

TRACE's Training Configuration

The paper uses these specific hyperparameters:

LoRA Config

Rank 32, applied to attention layers. ~5.3% of parameters updated per adapter. One adapter per capability.

GRPO Config

G = 4 groups per iteration, K = 4 rollouts per group. That is 16 rollouts per iteration. 40 iterations total = 640 rollouts per adapter.

Optimizer

AdamW, learning rate 1e−5, linear warmup over 5 iterations. 4–8 A100 GPUs.

The total training budget per adapter is 640 rollouts. With 4 adapters, that is 2,560 rollouts total. Compare with GRPO-on-target which uses the same 2,560 rollouts but trains a single adapter on the mixed-capability target environment — and achieves worse results.

Why LoRA Per Capability Beats Multi-Capability Training

Table 3 in the paper tests "merged" training (one adapter trained on all 4 synthetic environments mixed together). The merged adapter achieves only +8.2 points vs TRACE's +14.1 points. Why?

Because merged training suffers from the same gradient conflation as GRPO-on-target. Within a single training batch, some rollouts are about structured data reasoning, others about precondition verification. The gradient is an average over these different skills. The model learns a compromise that partially addresses all capabilities but excels at none.

The analogy: Imagine training for a triathlon by doing all three sports simultaneously — swimming while cycling while running (somehow). You would be terrible. It is much better to train swimming, cycling, and running separately, each with focused practice, and then switch between them on race day. TRACE's per-capability adapters + routing is exactly this: specialize in each capability, then select the right specialist for each task.

GRPO Training Curves per Capability

Watch each LoRA adapter's reward improve over 40 training iterations. Click a capability to highlight its curve. The dashed line shows the base model's performance.

In GRPO, why are advantages computed within groups of rollouts from the same task seed, rather than across all tasks?

Within-group normalization makes the advantage relative to the same task — "how much better than average on THIS specific task" — which reduces variance from task difficulty differences To reduce the number of rollouts needed per training step Because different tasks have incompatible reward scales

Chapter 7: Routing & Composition

After training K adapters (one per identified capability), TRACE has K specialized models plus the base model. At inference time, a new task arrives and TRACE must decide: which adapter (if any) should handle this task?

Why Routing Beats Merging

The obvious alternative is to merge all adapters into one model. LoRA adapter merging is well-studied: you can add the adapter weight matrices. But Table 3 in the paper shows this performs worse than routing:

Merge All Adapters

τ²-Bench: 32.7% (+8.4 over base)
The merged model is a compromise. Adapter weights that improve structured data reasoning may interfere with adapter weights that improve precondition verification. They occupy overlapping subspaces.

Route to Best Adapter

τ²-Bench: 38.4% (+14.1 over base)
Each adapter is used in isolation on the tasks where its capability is most relevant. No interference between capabilities. +5.7 points over merging.

The Routing Mechanism

TRACE uses a surprisingly simple routing approach — it reuses the base model itself as the router. Here is how:

Construct a prompt that includes:
- The incoming task description
- Each capability's name and description
- 2–3 example trajectories per capability (showing what that capability looks like in practice)
- A "none" option for tasks the base model can already handle
Feed this prompt to the base model (not any adapter)
Look at the logit scores for the label tokens ("A", "B", "C", "D", "none")
Take the argmax — route to the adapter with the highest logit

Why use the base model for routing? The base model has broad knowledge about what capabilities entail — it understands what "structured data reasoning" means and can assess whether a given task requires it. The adapted models are specialists: they are better at doing structured data reasoning but not necessarily better at recognizing when it is needed. The base model is the generalist that triages; the adapters are the specialists that treat.

The "None" Option

Not every task requires a specialized adapter. Some tasks are straightforward enough that the base model handles them correctly. TRACE includes a "none" option in the routing prompt. If the base model assigns the highest logit to "none," the base model handles the task directly.

This is important for two reasons:

Avoiding regression: An adapter trained for structured data reasoning might perform worse on tasks that require creative conversation (because the LoRA weights slightly modify the model's conversational ability). Routing to "none" prevents this.
Efficiency: Why load and run an adapter if the base model can already handle the task?

Routing Accuracy

The paper does not report routing accuracy directly, but the +14.1 point improvement implies the router is quite accurate. If routing were random, you would expect much smaller gains (each adapter only helps on its specific capability subset).

An informal analysis: if TRACE routes 60% of tasks correctly to the best adapter, and each adapter improves its capability by ~35% on relevant tasks, the expected gain is 0.60 × 0.35 × fraction_of_tasks_per_cap — which, summed over 4 capabilities covering ~80% of failures, gives roughly the observed +14 point gain.

Logit routing vs. generation routing: TRACE does NOT generate a full response from the base model to decide the route (that would double inference cost). Instead, it feeds a classification prompt and reads the raw logits for a fixed set of label tokens. This is a single forward pass — fast and cheap. The classification prompt with examples gives the model enough context to make a good decision without needing chain-of-thought.

Why does TRACE include a "none" option in the routing mechanism?

To avoid regression on tasks the base model already handles well — specialized adapters can hurt performance on tasks outside their capability because LoRA weights slightly alter general conversational ability Because the routing model cannot always determine which capability is needed To save money on API calls when the task is easy

Chapter 8: Results & Scaling

Main Results: τ²-Bench

τ²-Bench is a customer service benchmark with 680 tasks across 5 domains (retail, airline, hotel, insurance, telecom). Each task requires multi-turn interaction with tool calls. The metric is task success rate (binary: did the agent achieve the correct outcome?).

Base Model (Llama-3.1-8B-Instruct)

24.3% success rate

ADP (Agentic Data Pipeline)

27.9% (+3.6)

AWM (Agentic Workflow Memory)

28.1% (+3.8)

GRPO on Target

30.4% (+6.1)

GEPA (prompt evolution)

26.8% (+2.5)

TRACE (this paper)

38.4% (+14.1)

TRACE more than doubles the gain of the next best method (GRPO at +6.1). That gap is remarkable. Let's understand why.

Main Results: ToolSandbox

ToolSandbox tests tool-calling ability across 8 diverse scenarios (calendar management, file operations, email, etc.). The metric counts "perfect scores" — tasks where the agent gets everything right.

ToolSandbox results:
Base model: 0 perfect scores (out of ~60 tasks)
GRPO on target: +3 perfect scores
TRACE: +7 perfect scores

TRACE also achieves the highest milestone completions (partial credit) with +16 over base.

Scaling Behavior

Figures 3 and 4 in the paper show how TRACE and baselines scale with the number of training rollouts. This is one of the most important results:

GRPO on target shows diminishing returns. Going from 1,280 to 2,560 rollouts adds only ~1.5 points. The mixed-capability gradients hit a ceiling.
TRACE shows increasing returns in the same range. Going from 1,280 to 2,560 rollouts adds ~4 points. Each additional rollout on a focused capability env produces more useful learning signal.

The scaling curves cross around 640 rollouts — below that, GRPO-on-target is competitive (since TRACE spends some of its budget on capability identification). Above 640, TRACE pulls away rapidly.

Stability Analysis (Figure 2a)

One concern with LLM-based capability identification: is it reproducible? Or does each run produce random capabilities? The paper runs the identification step 10 times independently. Result: the same 4 core capabilities are identified in every single run, though their exact names and descriptions vary slightly. The contrastive gaps are consistent to within ±0.05.

This stability is a strong signal. It means the capability deficits are real, objective properties of the model-environment pair, not artifacts of the LLM's wording choices.

Scaling: TRACE vs Baselines

Drag the rollout budget slider to see how each method scales. TRACE pulls ahead after ~640 rollouts and the gap widens with more budget.

Total Rollouts 2,560

Ablation: What Happens Without Each Stage?

The paper ablates each component:

Without contrastive identification (use manually specified capabilities): −4.2 points. Human-specified capabilities miss subtle bottlenecks.
Without synthetic envs (train adapters on target env directly, one per manually identified capability): −6.8 points. Target env is too complex for focused learning.
Without routing (merge adapters instead): −5.7 points. Adapter interference degrades performance.
Without per-capability adapters (one merged adapter on all synthetic envs): −5.9 points. Gradient conflation returns.

Every component matters. The full pipeline is greater than the sum of its parts.

Why does TRACE show increasing returns with more rollouts while GRPO-on-target shows diminishing returns?

Each additional rollout in TRACE's focused capability environment produces a clean learning signal for one specific skill, while GRPO-on-target's additional rollouts face mixed-capability gradients that hit a ceiling as the easy improvements are exhausted Because TRACE uses a more efficient optimizer than GRPO Because GRPO on target runs out of training data sooner

Chapter 9: Connections

Related Work

ReAct (Yao et al., 2023)

Interleaves reasoning and acting in LLM agents. TRACE assumes agents already use ReAct-style interaction and focuses on improving the underlying model capabilities, not the prompting strategy.

LATS (Zhou et al., 2024)

Language Agent Tree Search — uses MCTS-style search at inference time. TRACE improves the base model so search is less necessary. They are complementary: you could apply LATS on top of a TRACE-trained model.

τ²-Bench (Yao et al., 2024)

The customer service benchmark TRACE evaluates on. TRACE is the first method to achieve >35% on this benchmark with an 8B model.

LoRA (Hu et al., 2022)

Low-rank adaptation — the parameter-efficient fine-tuning method TRACE uses. TRACE's innovation is using separate LoRA adapters per capability and routing between them, rather than one adapter for all capabilities.

GRPO (Shao et al., 2024)

Group Relative Policy Optimization — the RL algorithm TRACE uses for training. TRACE does not modify GRPO itself; it improves what GRPO trains on (focused synthetic envs vs. noisy target env).

GEPA (Agrawal et al., 2026)

Reflective prompt evolution — optimizes prompts via trace analysis. TRACE trains weights instead of prompts. On τ²-Bench, TRACE (+14.1) significantly outperforms GEPA (+2.5), suggesting weight-level adaptation is more powerful for capability deficits.

Limitations

LLM-dependent capability identification: The quality of discovered capabilities depends on the LLM used for analysis (GPT-4.1). A weaker model might miss subtle capabilities or hallucinate non-existent ones.
Synthetic environment fidelity: Synthesized environments may not perfectly capture the complexity of the target environment. The paper shows transfer works well, but the gap between synthetic and real performance is not zero.
Binary rewards only: TRACE currently works with binary success/failure. Extending to continuous or multi-dimensional rewards is future work.
Routing overhead: Each inference requires a routing step (one additional forward pass). For latency-critical applications, this adds delay.
Tested on 8B models only: The paper uses Llama-3.1-8B-Instruct. Whether the same approach helps 70B+ models (which may already have these capabilities) is unknown.

Cheat Sheet

Contrastive gap: Δ̂(c) = ER⁻(c) − ER⁺(c)

Coverage: Ĉov(c) = N⁻(c) / |D⁻|

Selection thresholds: Δ̂(c) ≥ 0.20 AND Ĉov(c) ≥ 0.10

GRPO advantage: Â_i = (R_i − μ_group) / σ_group

Training budget: G=4 groups × K=4 rollouts × 40 iterations = 640 rollouts/adapter

LoRA rank: 32 (~5.3% of 8B params)

Routing: argmax of base model logits over capability label tokens

Key result: +14.1 on τ²-Bench, +7 perfect on ToolSandbox

The Big Picture

TRACE represents a shift from "train harder" to "train smarter." Instead of throwing more compute at the same optimization problem, it decomposes the problem into subproblems that are individually easier. This is the same insight that powers mixture-of-experts models, modular neural networks, and curriculum learning — but applied to the meta-level problem of figuring out what to train on.

The most surprising finding is how few capabilities matter. Across 680 diverse customer service tasks, just 4 capabilities explain the vast majority of failures. This suggests that LLMs are closer to being great agents than raw benchmark numbers imply — they are not failing broadly, they are failing specifically, and those specific deficits can be surgically corrected.

What is the most surprising empirical finding of TRACE, and what does it imply about LLM agent capabilities?

The same 3–4 capabilities are consistently identified across 10 independent runs, implying that LLM agents fail specifically (not broadly) and these few specific deficits can be surgically corrected That GRPO is fundamentally broken and should be replaced entirely That 8B models are too small for agentic tasks

TRACE: Capability-Targeted Agentic Training