τ²-Bench — Veanors

Chapter 0: The Problem

You call your phone company because your mobile data stopped working. The support agent looks up your account, sees everything is fine on their end, and asks: "Could you check if airplane mode is on?" You open Settings, toggle airplane mode off, and report back. The agent then asks you to toggle mobile data. You do. It works.

Notice what happened: both of you took actions. The agent modified your account in their CRM system. You modified your phone's settings. Neither of you could do the other's job. You were collaborating on a shared problem where the environment — your phone service — depended on both of your actions.

The blind spot in every agent benchmark: Existing benchmarks like τ-bench, WebArena, and SWE-Bench test agents in single-control environments. Only the agent uses tools. The user just types messages — a passive information provider. But real customer support, IT troubleshooting, and technical guidance all require the user to actively do things: restart devices, change settings, run commands. No benchmark tested this — until τ²-bench.

The gap is not just theoretical. When you give an agent all the tools (single-control), it scores 52% on telecom tasks. When you force it to guide a user who holds half the tools (dual-control), it drops to 34%. That 18-point gap is pure coordination and communication failure — the agent knows the answer but cannot get the user to execute it.

Single-Control vs. Dual-Control

Click to toggle between the two paradigms. In single-control, the agent does everything. In dual-control, agent and user each have their own tools acting on a shared environment.

Concept: Real-world agent tasks often involve dual-control — both the agent and the user can modify shared state through their own tools. Realization: τ²-bench formalizes this as a Dec-POMDP (Decentralized Partially Observable Markov Decision Process) and builds a telecom support domain where the agent has backend CRM tools, the user has phone settings tools, and both must coordinate through natural language to solve problems.

Why do existing agent benchmarks fail to capture the difficulty of real customer support interactions?

Because they are single-control: only the agent uses tools while the user passively provides information — they never test whether the agent can guide a user who must actively modify shared state Because agents cannot use tools at all Because the tasks are too easy

Chapter 1: The Dec-POMDP Insight

Why is guiding a user so much harder than doing the task yourself? The answer lies in a concept from multi-agent decision theory: the Decentralized Partially Observable Markov Decision Process (Dec-POMDP).

In a regular POMDP, one agent acts on a partially observable world. In a Dec-POMDP, two agents act on the same world, but each sees only their own slice of it. Neither can observe the other's tools, databases, or internal state. They can only communicate through messages.

Here is the critical asymmetry in τ²-bench's Dec-POMDP:

Property	Agent	User
Tools	Backend CRM: read/write customer records, enable services, check line details	Phone settings: toggle airplane mode, toggle data, check status bar, restart phone
Database	Customer profiles, subscription plans, line configs	Device state: SIM status, airplane mode, battery, data enabled, signal
Observability	Sees CRM data + user messages	Sees phone screen + agent messages
Planning	Must diagnose root cause and orchestrate solution	Reacts to agent instructions (does not plan independently)

Key insight: The user is not an adversary and not a co-equal planner. The user is a reactive tool executor — they follow instructions but cannot diagnose the problem themselves. This creates complexity asymmetry: the agent must think and communicate, while the user just follows and reports. The hard part is not solving the problem — it is explaining the solution steps to someone who cannot see your side of the world.

Dec-POMDP Information Flow

Both players act on shared state but observe different slices. The only bridge is natural language. Click "Step" to walk through a troubleshooting trajectory where both agent and user take actions.

Ready — click Step

The Dec-POMDP framing gives τ²-bench a formal structure for measuring exactly where agents fail. If the agent succeeds in no-user mode (where it controls all tools) but fails in dual-control mode, the failure is purely in communication and coordination — not reasoning.

In τ²-bench's Dec-POMDP, the user is modeled as a reactive tool executor rather than an independent planner. Why is this design choice critical?

Because it makes the benchmark easier It preserves complexity asymmetry: the agent must both solve the problem AND communicate the solution to a user who cannot diagnose independently, which is what makes real tech support hard Because users cannot use tools in real life

Chapter 2: The Telecom Domain

The paper introduces a new telecom technical support domain. A customer calls in with a phone problem — mobile data not working, MMS failing, service disconnected. The agent has backend access to the customer's account. The user has a mocked phone with settings they can toggle.

Here are the actual tool schemas, split by who controls them:

Agent tools (13 total: 7 write, 6 read):

python
# READ tools — agent can inspect the backend
get_customer_by_id(customer_id: str) → Customer
get_customer_by_name(full_name: str) → Customer
search_customers(query: str) → List[Customer]
get_line_details(line_id: str) → LineDetails
get_plan_details(plan_id: str) → PlanDetails
get_available_plans() → List[Plan]

# WRITE tools — agent can modify the backend
enable_roaming(customer_id, line_id) → str
disable_roaming(customer_id, line_id) → str
activate_line(customer_id, line_id) → str
suspend_line(customer_id, line_id) → str
change_plan(customer_id, line_id, plan_id) → str
add_service(customer_id, line_id, service) → str
transfer_to_human(reason: str) → str

User tools (30 total: 15 write, 15 read):

python
# READ tools — user checks their phone
get_status_bar() → str     # "📶 Excellent | 5G | Data ON | 🔋 80%"
check_airplane_mode() → str
check_wifi_status() → str
check_data_status() → str
check_signal_strength() → str
open_browser() → str     # tests if internet works
send_test_mms() → str    # tests if MMS works
... (8 more read tools)

# WRITE tools — user modifies phone settings
toggle_airplane_mode() → str
toggle_data() → str
toggle_wifi() → str
restart_phone() → str
reset_network_settings() → str
toggle_mms() → str
... (9 more write tools)

The crucial design: User tools return human-readable strings, not structured data. When the user calls get_status_bar(), they get a status bar emoji string like a real phone screen — not a JSON object. This constrains the user simulator to behave like a real person reporting what they see, not like an API client parsing structured data.

The agent's database stores structured customer records:

toml
[[customers]]
customer_id = "C1001"
full_name = "John Smith"
date_of_birth = "1985-06-15"
phone_number = "555-123-2002"

[[lines]]
line_id = "L1002"
customer_id = "C1001"
plan_id = "PLAN_PREMIUM"
roaming_enabled = false
status = "active"

The user's database is a mocked phone device:

toml
[device]
sim_card_status = "active"
airplane_mode = false
battery_level = 80
data_enabled = true
wifi_enabled = true
signal_strength = "excellent"
mms_enabled = true

Concept: The telecom domain creates a natural split: backend operations belong to the agent, device operations belong to the user. Realization: This split is not arbitrary — it mirrors the real world. A support agent cannot reach through the phone and toggle your airplane mode. They must ask you to do it, then verify via their backend that the issue resolved. The database schemas above are the actual state representation; toggle_airplane_mode() flips the boolean, and get_status_bar() reads it into a human-readable string.

Why do user tools in τ²-bench return human-readable strings (like emoji status bars) instead of structured JSON?

To constrain the user simulator to behave like a real person reporting what they see on screen, rather than an API client parsing structured data — this makes the simulation more realistic and prevents the user LLM from extracting information beyond what a human would Because JSON is too complex for users To save bandwidth

Chapter 3: Compositional Task Generation

How do you create thousands of verifiable test scenarios without hand-writing each one? τ²-bench uses a compositional task generator that builds complex tasks from atomic building blocks.

Each atomic subtask t represents one specific issue (e.g., "airplane mode is on, causing no data"). It is defined by three function sets:

init functions f^init_t,k

Set up the broken state. Example: set_airplane_mode(True) puts the phone in airplane mode before the conversation starts.

↓

solution functions f^sol_t,k

The tool calls that fix the issue. Example: toggle_airplane_mode() (user tool). These must be available to either the agent or user.

↓

assertion functions f^assert_t,k

Verify the fix. Example: assert_service_status("connected") checks the final state. If all assertions pass, the task is solved.

Key insight: Because each subtask has programmatic init, solution, and assertion functions, correctness is provably verifiable. You run the init functions, apply the solutions, check the assertions — if they pass, the task is solvable. No ambiguity, no subjective judgment.

Atomic subtasks are organized into groups of mutually exclusive alternatives. A composite task picks at most one subtask from each group and concatenates their functions. The telecom domain has 15 atomic subtask groups across three user intents of increasing difficulty:

Intent	Difficulty	Example subtask	Why harder
service_issue	Easiest	Line suspended → activate it	Agent-side fix only, straightforward
mobile_data_issue	Medium	Airplane mode ON + roaming disabled	Requires checking service issues first, then user + agent coordination
mms_issue	Hardest	MMS disabled + data off + no roaming	Must resolve data issues first (which may require service fixes), multi-stage chain

Combining 15 subtask groups programmatically yields 2,285 total tasks. The paper subsamples 114 tasks balanced across intents and difficulty levels (1-9 subtasks per task). The number of subtasks directly controls difficulty — more subtasks mean more diagnostic steps, more user interactions, and more state transitions to track.

Task Composition Explorer

See how atomic subtasks compose into a full task. Use the slider to add more subtasks and watch the init/solution/assertion chains grow. Each subtask adds both diagnostic and resolution steps.

Subtasks 1

What makes τ²-bench's task generation "compositional" rather than just "programmatic"?

Tasks are generated randomly Complex tasks are built by selecting and combining atomic subtasks from different groups, concatenating their init/solution/assertion functions — this creates verifiable multi-issue scenarios from a small set of building blocks Each task is hand-written by the authors

Chapter 4: The User Simulator

The biggest weakness of conversational agent benchmarks is the user simulator. In τ-bench's original retail and airline domains, the user is an LLM with a natural language description of their goal. The problem: LLMs hallucinate. They fabricate information, contradict their stated preferences, and behave inconsistently. The retail domain has a 40% user error rate with 12% critical errors that make tasks unsolvable.

τ²-bench's insight: constrain the user simulator not just with prompting, but with tools and environment state.

The breakthrough: Instead of telling the user simulator "you see that your data is disabled" (which it might forget or contradict), you give it the tool get_status_bar() that actually reads the device state. The user cannot lie about what they see because their observation is grounded in the real environment. This drops the error rate from 40% to 16%, and critical errors from 12% to 6%.

Three design principles make the user simulator reliable:

Tool-grounded observations. The user does not imagine what they see — they call read tools that return the actual device state. If the agent says "check your status bar," the user calls get_status_bar() and reports the real output.
Reactive behavior. The user does not plan independently. They only call tools when the agent asks them to. This limits the action space and prevents the user simulator from "getting ahead" of the conversation.
Human-readable outputs. Tool returns are strings a human would see, not structured data. The user reports "I see four bars and 5G" rather than parsing {"signal": "excellent", "network": "5G"}.

Domain	Conversations	Critical Errors	Benign Errors	Total Error Rate
airline (τ-bench)	100	13%	34%	47%
retail (τ-bench)	50	12%	28%	40%
telecom (τ²-bench)	50	6%	10%	16%

User Simulator: Prompted vs. Tool-Grounded

Watch two user simulators handle the same scenario. The prompted user relies on memory of initial instructions. The tool-grounded user calls actual device tools. Click "Step" to advance the conversation and see where the prompted user makes errors.

Ready

Concept: User simulator reliability is the Achilles' heel of conversational benchmarks. Realization: τ²-bench shows that giving the user actual tools that read real environment state is far more effective than elaborate prompting. The environment constrains behavior more reliably than instructions do. This is the same principle behind grounding LLMs with retrieval — connect to reality, don't just describe it.

How does τ²-bench reduce user simulator error rates from 40% to 16%?

By using a more powerful LLM for the user By writing more detailed prompts By grounding user observations in actual tool calls that read the real environment state, rather than relying on the user LLM to remember and correctly report information from its instructions

Chapter 5: Dec-POMDP Formalization

Now let us write down the math. The entire τ²-bench interaction is formally a tuple:

(S, {A_i}, {O_i}, T, R, U, M) where i ∈ {agent, user}

Let us unpack each component with concrete telecom examples.

Message space M: All possible natural language messages. User: "My data isn't working." Agent: "Could you check if airplane mode is on?"

State space S: The global state decomposes as:

S = S_world ⊗ S_history

S_world = S_db,agent ⊗ S_db,user

S_db,agent is the CRM (customer profiles, line configs). S_db,user is the phone device state (airplane mode, data enabled, signal). S_history logs every action, observation, and message in order.

Action spaces A_i: Player i either calls a tool or sends a message. Only one player acts per turn.

a_i ∈ A_i = A_i,tool ∪ M

Agent tool actions: get_customer_by_id("C1001"), enable_roaming("C1001", "L1002"). User tool actions: toggle_airplane_mode(), get_status_bar().

Observation spaces O_i: Player i sees either a tool return or a message from the other player.

o_i ∈ O_i = O_i,tool ∪ M

Transition function T: Given current state s and action a, yields new state s' and observation o:

T : S × A → S × O

Calling enable_roaming("C1001", "L1002") changes S_db,agent (roaming flag flips to true) AND affects S_db,user (the phone can now access roaming networks). This cross-database effect is what makes the environment shared.

Reward function R: A function R : S → [0, 1] that checks whether all assertion functions pass on the final state. Binary: 1 if the task is solved, 0 otherwise.

Key insight: The transition function T can create cross-database effects. When the agent calls enable_roaming(), it changes the agent's database (roaming_enabled = true) AND the user's phone environment (the device can now connect to roaming networks). This is what makes it a genuinely shared environment, not just two independent systems.

Instruction space U: Defines the user's scenario (what problem they have) and the agent's domain policy (troubleshooting procedures). The user sees: "Your mobile data is not working. You want to fix it." The agent sees: "Follow these diagnostic steps for data issues: first check service status, then check airplane mode..."

Why is the cross-database transition effect (e.g., agent's enable_roaming() affecting the user's device state) essential to modeling the dual-control environment?

It creates a genuinely shared environment where one player's actions change what the other player observes — without this coupling, the agent and user would be operating on independent systems and coordination would be trivial It makes the math simpler It is required by the Dec-POMDP definition

Chapter 6: Evaluation Protocol

How do you know if the agent actually solved the problem? τ²-bench uses multiple evaluation criteria that go beyond simple pass/fail.

The pass^k metric: This is the core reliability metric from τ-bench. Run the same task k times independently. pass^k is the fraction of times all k runs succeed. Higher k is stricter — it measures consistency, not just one-shot accuracy.

pass^k = E_task[ ∏_j=1^k success_j ]

If an agent succeeds 70% of the time on a task, pass^1 = 0.70, but pass^4 = 0.70⁴ = 0.24. The exponential decay penalizes inconsistency.

Five evaluation criteria (a task can use a subset):

Criterion	What it checks	Example
DB check	Agent database matches expected state	Customer's roaming flag is true
Status assertions	User device state meets conditions	`assert_service_status("connected")`
NL assertions	History contains expected communication	"Agent diagnosed the cause"
Comm info check	Agent communicated required info	Told user the roaming cost
Action matching	All solution tool calls appear in trajectory	`toggle_airplane_mode` was called

For the telecom domain, only assertion functions are used. This is cleaner than checking communication content — you verify the end state, not how the agent got there. The phone either works or it does not.

Concept: pass^k exponentially penalizes inconsistency — an agent that solves a task 70% of the time scores only 24% on pass^4. Realization: This matters for deployment. A customer support system that works 7 out of 10 times is unreliable. pass^k captures this by measuring the probability that the agent succeeds on ALL k independent attempts, which is the real-world reliability you care about.

Three evaluation modes for ablation analysis:

Default (dual-control)

Agent and user collaborate. Agent has backend tools, user has phone tools. Normal mode — tests everything.

↓ remove user

No-User

Agent gets a ticket summarizing the problem. Agent controls ALL tools (both backend and phone). Tests pure reasoning — no communication needed.

↓ give answer

Oracle Plan

Agent receives the exact sequence of tool calls needed. Must coordinate with user to execute them. Tests pure communication — no reasoning needed.

The gap between No-User and Default isolates communication failure. The gap between Oracle Plan and Default isolates reasoning failure. Together they decompose agent performance into its constituent skills.

If an agent has pass^1 = 0.50 on a task, what is its pass^4?

0.50 0.0625 — pass^4 = 0.50^4 = 0.0625. Each independent trial must succeed, so the probability multiplies: four consecutive successes at 50% each is only 6.25% 0.25

Chapter 7: Results

The experiments evaluate four models: gpt-4.1, gpt-4.1-mini, o4-mini, and claude-3.7-sonnet. Each task runs 4 times at temperature 0. The user simulator is always gpt-4.1. Here are the headline findings.

Finding 1: Telecom is the hardest domain.

Model	Retail pass^1	Airline pass^1	Telecom pass^1
gpt-4.1	74%	56%	34%
gpt-4.1-mini	59%	46%	52%
o4-mini	66%	53%	42%
claude-3.7-sonnet	79%	50%	49%

Remarkably, gpt-4.1 — the strongest model on retail (74%) — is the weakest on telecom (34%). The mini model outperforms it. This suggests that raw reasoning power does not translate directly to coordination ability.

Finding 2: The dual-control gap. When agents switch from no-user (they control everything) to dual-control (they must guide the user), performance drops dramatically. gpt-4.1: 52% → 34% (−18 points). o4-mini: 67% → 42% (−25 points). This gap is pure communication and coordination failure — the agent can solve the problem when it has all the tools, but cannot get the user to do the right things.

Finding 3: Performance collapses with task complexity. As the number of required actions increases, pass^1 drops toward zero. For tasks requiring 7+ actions in dual-control mode, both gpt-4.1 and o4-mini score near 0%. Even in no-user mode, performance degrades — but the gap between modes narrows, suggesting that long-horizon tasks are hard for reasoning too, not just communication.

Dual-Control Performance Dashboard

Explore the key experimental results. Toggle between views: the dual-control gap (Default vs No-User vs Oracle Plan), performance by issue type, and the complexity scaling curve.

Finding 4: Issue type matters. service_issue tasks are easiest (agent-side fixes). mobile_data_issue and mms_issue require multi-stage coordination and score much lower. For gpt-4.1: service_issue pass^1 = 52%, mobile_data_issue = 30%, mms_issue = 22%.

Finding 5: User persona affects success. Tasks with "Hard" personas (low-tech users) are harder than "Easy" personas (tech-savvy users). Surprisingly, "None" (no persona) often performs as badly as "Hard," suggesting that well-defined personas actually help the simulator behave more consistently.

gpt-4.1 scores 74% on retail but only 34% on telecom, while gpt-4.1-mini scores 59% on retail and 52% on telecom. What does this reversal reveal?

Raw reasoning power (where gpt-4.1 excels) does not translate to coordination ability — dual-control tasks require a different skill (guiding users through natural language) that is not strongly correlated with single-control reasoning performance gpt-4.1-mini is a better model overall The telecom domain is easier

Chapter 8: Failure Analysis

Where exactly do agents fail in dual-control? The three evaluation modes (Default, No-User, Oracle Plan) decompose failures into precise categories.

Reasoning failures (visible in No-User mode): The agent has all tools but still fails. This includes wrong diagnoses (checking the wrong thing first), incorrect tool calls (wrong parameters), and policy violations (not following troubleshooting procedures).

Communication failures (the gap between No-User and Default): The agent knows the answer but cannot extract it through conversation. Three specific failure modes:

Vague instructions: "Check your settings" instead of "Go to Settings, tap Network, and tell me if Mobile Data is toggled on."
Missing verification: Agent tells user to toggle something but never asks them to confirm it worked.
State tracking failure: Agent loses track of what the user has already done, asks them to repeat steps, or skips steps that were not completed.

Coordination failures (unique to dual-control): The agent and user's actions must happen in the right order. If the agent enables roaming after asking the user to toggle data, the data toggle alone does not fix the problem. Sequencing matters.

The workflow policy experiment: The authors created an alternate policy document with more specific step-by-step workflows (instead of general guidelines). Results: slightly improves Default and No-User performance. But hurts Oracle Plan performance. Why? When the agent already has the answer (Oracle Plan), a detailed workflow creates confusion — the agent tries to follow both the ground truth and the workflow, and they conflict. This shows that good policies and good reasoning can actually interfere with each other.

Here is a concrete example of a failed dual-control interaction:

conversation
# Issue: mobile data not working (airplane mode ON + roaming disabled)

User: "My mobile data is not working."
Agent: calls get_customer_by_id("C1001")
Agent: calls get_line_details("L1002")
       # Sees roaming_enabled=false. Correct diagnosis so far.

Agent: "I see that your roaming is not enabled.
        I'll enable it for you."
Agent: calls enable_roaming("C1001", "L1002")

# FAILURE: Agent never checked airplane mode!
# Even with roaming enabled, airplane mode blocks all connectivity.
# Agent needed to ask user: "Can you check if airplane mode is on?"

Agent: "I've enabled roaming. Your data should work now.
        Is there anything else?"
User: "No, it's still not working."
# Agent has now lost context and retries the same fix...

Concept: Agent failures in dual-control decompose into reasoning (wrong diagnosis), communication (vague instructions, missing verification), and coordination (wrong action ordering). Realization: The ablation modes make this decomposition precise: Default − No-User = communication cost. Default − Oracle Plan = reasoning cost. For gpt-4.1, the communication cost (−18 points) exceeds the reasoning cost, confirming that guiding users is the harder problem.

Why does the workflow-based policy hurt performance in Oracle Plan mode?

When the agent already has the ground truth sequence of actions, a detailed workflow creates interference — the agent tries to follow both the known solution and the prescribed workflow, and the two can conflict, causing confusion that degrades execution The workflow policy has bugs Oracle Plan does not use the policy

Chapter 9: Connections

τ²-bench sits at the intersection of several research threads:

Connection	Relationship
τ-bench (Shinn et al., 2024)	The direct predecessor. τ²-bench extends it from single-control to dual-control, adding the telecom domain and compositional task generation. The retail and airline domains are carried over.
Dec-POMDPs	τ²-bench formalizes dual-control as a Dec-POMDP. Classic framework for multi-agent partial observability. The complexity asymmetry (agent plans, user reacts) is a specific instance of asymmetric Dec-POMDPs.
Agent evaluation survey (Yehudai et al., 2025)	Comprehensive survey of how LLM agents are evaluated. τ²-bench contributes a new evaluation paradigm (dual-control) that fills a gap the survey identifies: benchmarks that test coordination.
IntellAgent (Waisberg et al., 2024)	Programmatic benchmark generation from policy graphs. Complementary approach — IntellAgent generates synthetic proxies, τ²-bench generates verifiable compositional tasks.
ToolSandbox (Lu et al., 2024)	Stateful tool evaluation. τ²-bench adds the twist that tools are split between two players who must coordinate through language.
Task-oriented dialogue	The legacy of MultiWOZ and similar benchmarks. τ²-bench goes beyond information-seeking dialogue to action-oriented collaboration.

Limitations acknowledged by the authors:

No expert-novice gap modeling. The benchmark does not explicitly test whether agents can adapt explanations to the user's technical level. A real support agent simplifies for novices and uses jargon with experts.
Domain creation still manual. Building a new domain (database schemas, tools, policies, atomic subtasks) requires significant human expert effort. Automating this is an open problem.
User simulator still imperfect. 6% critical error rate is much better than 12%, but not zero. Some tasks may still fail due to user simulator bugs rather than agent failures.
Only one dual-control domain. The telecom domain is a proof of concept. Extending the tool-grounded user approach to retail and airline domains remains future work.

The big picture: τ²-bench demonstrates that the hardest part of building useful agents is not reasoning — it is coordination. When you give the agent all the tools, it succeeds half the time. When it must work through a user, it drops by 18-25 points. This gap — the coordination gap — is the frontier for conversational AI. Closing it requires agents that can explain clearly, verify comprehension, track shared state, and adapt to users who make mistakes. That is a fundamentally different skill than answering questions or calling APIs.

What is the single most important finding of τ²-bench?

That LLMs cannot use tools That the performance drop from single-control to dual-control (18-25 points) reveals coordination and communication as a critical bottleneck — agents can reason about solutions but struggle to guide users who share control of the environment That the telecom domain is hard

τ²-Bench: Evaluating Agents in Dual-Control Environments