When the user has tools too. The first benchmark that tests whether an AI agent can guide a human who actively modifies shared state — modeled as a Dec-POMDP over a telecom support domain.
You call your phone company because your mobile data stopped working. The support agent looks up your account, sees everything is fine on their end, and asks: "Could you check if airplane mode is on?" You open Settings, toggle airplane mode off, and report back. The agent then asks you to toggle mobile data. You do. It works.
Notice what happened: both of you took actions. The agent modified your account in their CRM system. You modified your phone's settings. Neither of you could do the other's job. You were collaborating on a shared problem where the environment — your phone service — depended on both of your actions.
The gap is not just theoretical. When you give an agent all the tools (single-control), it scores 52% on telecom tasks. When you force it to guide a user who holds half the tools (dual-control), it drops to 34%. That 18-point gap is pure coordination and communication failure — the agent knows the answer but cannot get the user to execute it.
Click to toggle between the two paradigms. In single-control, the agent does everything. In dual-control, agent and user each have their own tools acting on a shared environment.
Why is guiding a user so much harder than doing the task yourself? The answer lies in a concept from multi-agent decision theory: the Decentralized Partially Observable Markov Decision Process (Dec-POMDP).
In a regular POMDP, one agent acts on a partially observable world. In a Dec-POMDP, two agents act on the same world, but each sees only their own slice of it. Neither can observe the other's tools, databases, or internal state. They can only communicate through messages.
Here is the critical asymmetry in τ²-bench's Dec-POMDP:
| Property | Agent | User |
|---|---|---|
| Tools | Backend CRM: read/write customer records, enable services, check line details | Phone settings: toggle airplane mode, toggle data, check status bar, restart phone |
| Database | Customer profiles, subscription plans, line configs | Device state: SIM status, airplane mode, battery, data enabled, signal |
| Observability | Sees CRM data + user messages | Sees phone screen + agent messages |
| Planning | Must diagnose root cause and orchestrate solution | Reacts to agent instructions (does not plan independently) |
Both players act on shared state but observe different slices. The only bridge is natural language. Click "Step" to walk through a troubleshooting trajectory where both agent and user take actions.
The Dec-POMDP framing gives τ²-bench a formal structure for measuring exactly where agents fail. If the agent succeeds in no-user mode (where it controls all tools) but fails in dual-control mode, the failure is purely in communication and coordination — not reasoning.
The paper introduces a new telecom technical support domain. A customer calls in with a phone problem — mobile data not working, MMS failing, service disconnected. The agent has backend access to the customer's account. The user has a mocked phone with settings they can toggle.
Here are the actual tool schemas, split by who controls them:
Agent tools (13 total: 7 write, 6 read):
python # READ tools — agent can inspect the backend get_customer_by_id(customer_id: str) → Customer get_customer_by_name(full_name: str) → Customer search_customers(query: str) → List[Customer] get_line_details(line_id: str) → LineDetails get_plan_details(plan_id: str) → PlanDetails get_available_plans() → List[Plan] # WRITE tools — agent can modify the backend enable_roaming(customer_id, line_id) → str disable_roaming(customer_id, line_id) → str activate_line(customer_id, line_id) → str suspend_line(customer_id, line_id) → str change_plan(customer_id, line_id, plan_id) → str add_service(customer_id, line_id, service) → str transfer_to_human(reason: str) → str
User tools (30 total: 15 write, 15 read):
python # READ tools — user checks their phone get_status_bar() → str # "📶 Excellent | 5G | Data ON | 🔋 80%" check_airplane_mode() → str check_wifi_status() → str check_data_status() → str check_signal_strength() → str open_browser() → str # tests if internet works send_test_mms() → str # tests if MMS works ... (8 more read tools) # WRITE tools — user modifies phone settings toggle_airplane_mode() → str toggle_data() → str toggle_wifi() → str restart_phone() → str reset_network_settings() → str toggle_mms() → str ... (9 more write tools)
get_status_bar(), they get a status bar emoji string like a real phone screen — not a JSON object. This constrains the user simulator to behave like a real person reporting what they see, not like an API client parsing structured data.The agent's database stores structured customer records:
toml [[customers]] customer_id = "C1001" full_name = "John Smith" date_of_birth = "1985-06-15" phone_number = "555-123-2002" [[lines]] line_id = "L1002" customer_id = "C1001" plan_id = "PLAN_PREMIUM" roaming_enabled = false status = "active"
The user's database is a mocked phone device:
toml [device] sim_card_status = "active" airplane_mode = false battery_level = 80 data_enabled = true wifi_enabled = true signal_strength = "excellent" mms_enabled = true
toggle_airplane_mode() flips the boolean, and get_status_bar() reads it into a human-readable string.How do you create thousands of verifiable test scenarios without hand-writing each one? τ²-bench uses a compositional task generator that builds complex tasks from atomic building blocks.
Each atomic subtask t represents one specific issue (e.g., "airplane mode is on, causing no data"). It is defined by three function sets:
set_airplane_mode(True) puts the phone in airplane mode before the conversation starts.toggle_airplane_mode() (user tool). These must be available to either the agent or user.assert_service_status("connected") checks the final state. If all assertions pass, the task is solved.Atomic subtasks are organized into groups of mutually exclusive alternatives. A composite task picks at most one subtask from each group and concatenates their functions. The telecom domain has 15 atomic subtask groups across three user intents of increasing difficulty:
| Intent | Difficulty | Example subtask | Why harder |
|---|---|---|---|
| service_issue | Easiest | Line suspended → activate it | Agent-side fix only, straightforward |
| mobile_data_issue | Medium | Airplane mode ON + roaming disabled | Requires checking service issues first, then user + agent coordination |
| mms_issue | Hardest | MMS disabled + data off + no roaming | Must resolve data issues first (which may require service fixes), multi-stage chain |
Combining 15 subtask groups programmatically yields 2,285 total tasks. The paper subsamples 114 tasks balanced across intents and difficulty levels (1-9 subtasks per task). The number of subtasks directly controls difficulty — more subtasks mean more diagnostic steps, more user interactions, and more state transitions to track.
See how atomic subtasks compose into a full task. Use the slider to add more subtasks and watch the init/solution/assertion chains grow. Each subtask adds both diagnostic and resolution steps.
The biggest weakness of conversational agent benchmarks is the user simulator. In τ-bench's original retail and airline domains, the user is an LLM with a natural language description of their goal. The problem: LLMs hallucinate. They fabricate information, contradict their stated preferences, and behave inconsistently. The retail domain has a 40% user error rate with 12% critical errors that make tasks unsolvable.
τ²-bench's insight: constrain the user simulator not just with prompting, but with tools and environment state.
get_status_bar() that actually reads the device state. The user cannot lie about what they see because their observation is grounded in the real environment. This drops the error rate from 40% to 16%, and critical errors from 12% to 6%.Three design principles make the user simulator reliable:
get_status_bar() and reports the real output.{"signal": "excellent", "network": "5G"}.| Domain | Conversations | Critical Errors | Benign Errors | Total Error Rate |
|---|---|---|---|---|
| airline (τ-bench) | 100 | 13% | 34% | 47% |
| retail (τ-bench) | 50 | 12% | 28% | 40% |
| telecom (τ²-bench) | 50 | 6% | 10% | 16% |
Watch two user simulators handle the same scenario. The prompted user relies on memory of initial instructions. The tool-grounded user calls actual device tools. Click "Step" to advance the conversation and see where the prompted user makes errors.
Now let us write down the math. The entire τ²-bench interaction is formally a tuple:
Let us unpack each component with concrete telecom examples.
Message space M: All possible natural language messages. User: "My data isn't working." Agent: "Could you check if airplane mode is on?"
State space S: The global state decomposes as:
Sdb,agent is the CRM (customer profiles, line configs). Sdb,user is the phone device state (airplane mode, data enabled, signal). Shistory logs every action, observation, and message in order.
Action spaces Ai: Player i either calls a tool or sends a message. Only one player acts per turn.
Agent tool actions: get_customer_by_id("C1001"), enable_roaming("C1001", "L1002"). User tool actions: toggle_airplane_mode(), get_status_bar().
Observation spaces Oi: Player i sees either a tool return or a message from the other player.
Transition function T: Given current state s and action a, yields new state s' and observation o:
Calling enable_roaming("C1001", "L1002") changes Sdb,agent (roaming flag flips to true) AND affects Sdb,user (the phone can now access roaming networks). This cross-database effect is what makes the environment shared.
Reward function R: A function R : S → [0, 1] that checks whether all assertion functions pass on the final state. Binary: 1 if the task is solved, 0 otherwise.
enable_roaming(), it changes the agent's database (roaming_enabled = true) AND the user's phone environment (the device can now connect to roaming networks). This is what makes it a genuinely shared environment, not just two independent systems.Instruction space U: Defines the user's scenario (what problem they have) and the agent's domain policy (troubleshooting procedures). The user sees: "Your mobile data is not working. You want to fix it." The agent sees: "Follow these diagnostic steps for data issues: first check service status, then check airplane mode..."
enable_roaming() affecting the user's device state) essential to modeling the dual-control environment?How do you know if the agent actually solved the problem? τ²-bench uses multiple evaluation criteria that go beyond simple pass/fail.
The pass^k metric: This is the core reliability metric from τ-bench. Run the same task k times independently. pass^k is the fraction of times all k runs succeed. Higher k is stricter — it measures consistency, not just one-shot accuracy.
If an agent succeeds 70% of the time on a task, pass^1 = 0.70, but pass^4 = 0.704 = 0.24. The exponential decay penalizes inconsistency.
Five evaluation criteria (a task can use a subset):
| Criterion | What it checks | Example |
|---|---|---|
| DB check | Agent database matches expected state | Customer's roaming flag is true |
| Status assertions | User device state meets conditions | assert_service_status("connected") |
| NL assertions | History contains expected communication | "Agent diagnosed the cause" |
| Comm info check | Agent communicated required info | Told user the roaming cost |
| Action matching | All solution tool calls appear in trajectory | toggle_airplane_mode was called |
For the telecom domain, only assertion functions are used. This is cleaner than checking communication content — you verify the end state, not how the agent got there. The phone either works or it does not.
Three evaluation modes for ablation analysis:
The gap between No-User and Default isolates communication failure. The gap between Oracle Plan and Default isolates reasoning failure. Together they decompose agent performance into its constituent skills.
The experiments evaluate four models: gpt-4.1, gpt-4.1-mini, o4-mini, and claude-3.7-sonnet. Each task runs 4 times at temperature 0. The user simulator is always gpt-4.1. Here are the headline findings.
Finding 1: Telecom is the hardest domain.
| Model | Retail pass^1 | Airline pass^1 | Telecom pass^1 |
|---|---|---|---|
| gpt-4.1 | 74% | 56% | 34% |
| gpt-4.1-mini | 59% | 46% | 52% |
| o4-mini | 66% | 53% | 42% |
| claude-3.7-sonnet | 79% | 50% | 49% |
Remarkably, gpt-4.1 — the strongest model on retail (74%) — is the weakest on telecom (34%). The mini model outperforms it. This suggests that raw reasoning power does not translate directly to coordination ability.
Finding 3: Performance collapses with task complexity. As the number of required actions increases, pass^1 drops toward zero. For tasks requiring 7+ actions in dual-control mode, both gpt-4.1 and o4-mini score near 0%. Even in no-user mode, performance degrades — but the gap between modes narrows, suggesting that long-horizon tasks are hard for reasoning too, not just communication.
Explore the key experimental results. Toggle between views: the dual-control gap (Default vs No-User vs Oracle Plan), performance by issue type, and the complexity scaling curve.
Finding 4: Issue type matters. service_issue tasks are easiest (agent-side fixes). mobile_data_issue and mms_issue require multi-stage coordination and score much lower. For gpt-4.1: service_issue pass^1 = 52%, mobile_data_issue = 30%, mms_issue = 22%.
Finding 5: User persona affects success. Tasks with "Hard" personas (low-tech users) are harder than "Easy" personas (tech-savvy users). Surprisingly, "None" (no persona) often performs as badly as "Hard," suggesting that well-defined personas actually help the simulator behave more consistently.
Where exactly do agents fail in dual-control? The three evaluation modes (Default, No-User, Oracle Plan) decompose failures into precise categories.
Reasoning failures (visible in No-User mode): The agent has all tools but still fails. This includes wrong diagnoses (checking the wrong thing first), incorrect tool calls (wrong parameters), and policy violations (not following troubleshooting procedures).
Communication failures (the gap between No-User and Default): The agent knows the answer but cannot extract it through conversation. Three specific failure modes:
Coordination failures (unique to dual-control): The agent and user's actions must happen in the right order. If the agent enables roaming after asking the user to toggle data, the data toggle alone does not fix the problem. Sequencing matters.
Here is a concrete example of a failed dual-control interaction:
conversation # Issue: mobile data not working (airplane mode ON + roaming disabled) User: "My mobile data is not working." Agent: calls get_customer_by_id("C1001") Agent: calls get_line_details("L1002") # Sees roaming_enabled=false. Correct diagnosis so far. Agent: "I see that your roaming is not enabled. I'll enable it for you." Agent: calls enable_roaming("C1001", "L1002") # FAILURE: Agent never checked airplane mode! # Even with roaming enabled, airplane mode blocks all connectivity. # Agent needed to ask user: "Can you check if airplane mode is on?" Agent: "I've enabled roaming. Your data should work now. Is there anything else?" User: "No, it's still not working." # Agent has now lost context and retries the same fix...
τ²-bench sits at the intersection of several research threads:
| Connection | Relationship |
|---|---|
| τ-bench (Shinn et al., 2024) | The direct predecessor. τ²-bench extends it from single-control to dual-control, adding the telecom domain and compositional task generation. The retail and airline domains are carried over. |
| Dec-POMDPs | τ²-bench formalizes dual-control as a Dec-POMDP. Classic framework for multi-agent partial observability. The complexity asymmetry (agent plans, user reacts) is a specific instance of asymmetric Dec-POMDPs. |
| Agent evaluation survey (Yehudai et al., 2025) | Comprehensive survey of how LLM agents are evaluated. τ²-bench contributes a new evaluation paradigm (dual-control) that fills a gap the survey identifies: benchmarks that test coordination. |
| IntellAgent (Waisberg et al., 2024) | Programmatic benchmark generation from policy graphs. Complementary approach — IntellAgent generates synthetic proxies, τ²-bench generates verifiable compositional tasks. |
| ToolSandbox (Lu et al., 2024) | Stateful tool evaluation. τ²-bench adds the twist that tools are split between two players who must coordinate through language. |
| Task-oriented dialogue | The legacy of MultiWOZ and similar benchmarks. τ²-bench goes beyond information-seeking dialogue to action-oriented collaboration. |
Limitations acknowledged by the authors: