Barres, Dong, Ray, Si, Narasimhan — Sierra / U of Toronto / Vector Institute — 2025

τ²-Bench: Evaluating Agents in Dual-Control Environments

When the user has tools too. The first benchmark that tests whether an AI agent can guide a human who actively modifies shared state — modeled as a Dec-POMDP over a telecom support domain.

Prerequisites: LLM tool use basics + Agent evaluation concepts
10
Chapters
5+
Simulations

Chapter 0: The Problem

You call your phone company because your mobile data stopped working. The support agent looks up your account, sees everything is fine on their end, and asks: "Could you check if airplane mode is on?" You open Settings, toggle airplane mode off, and report back. The agent then asks you to toggle mobile data. You do. It works.

Notice what happened: both of you took actions. The agent modified your account in their CRM system. You modified your phone's settings. Neither of you could do the other's job. You were collaborating on a shared problem where the environment — your phone service — depended on both of your actions.

The blind spot in every agent benchmark: Existing benchmarks like τ-bench, WebArena, and SWE-Bench test agents in single-control environments. Only the agent uses tools. The user just types messages — a passive information provider. But real customer support, IT troubleshooting, and technical guidance all require the user to actively do things: restart devices, change settings, run commands. No benchmark tested this — until τ²-bench.

The gap is not just theoretical. When you give an agent all the tools (single-control), it scores 52% on telecom tasks. When you force it to guide a user who holds half the tools (dual-control), it drops to 34%. That 18-point gap is pure coordination and communication failure — the agent knows the answer but cannot get the user to execute it.

Single-Control vs. Dual-Control

Click to toggle between the two paradigms. In single-control, the agent does everything. In dual-control, agent and user each have their own tools acting on a shared environment.

Concept: Real-world agent tasks often involve dual-control — both the agent and the user can modify shared state through their own tools. Realization: τ²-bench formalizes this as a Dec-POMDP (Decentralized Partially Observable Markov Decision Process) and builds a telecom support domain where the agent has backend CRM tools, the user has phone settings tools, and both must coordinate through natural language to solve problems.
Why do existing agent benchmarks fail to capture the difficulty of real customer support interactions?

Chapter 1: The Dec-POMDP Insight

Why is guiding a user so much harder than doing the task yourself? The answer lies in a concept from multi-agent decision theory: the Decentralized Partially Observable Markov Decision Process (Dec-POMDP).

In a regular POMDP, one agent acts on a partially observable world. In a Dec-POMDP, two agents act on the same world, but each sees only their own slice of it. Neither can observe the other's tools, databases, or internal state. They can only communicate through messages.

Here is the critical asymmetry in τ²-bench's Dec-POMDP:

PropertyAgentUser
ToolsBackend CRM: read/write customer records, enable services, check line detailsPhone settings: toggle airplane mode, toggle data, check status bar, restart phone
DatabaseCustomer profiles, subscription plans, line configsDevice state: SIM status, airplane mode, battery, data enabled, signal
ObservabilitySees CRM data + user messagesSees phone screen + agent messages
PlanningMust diagnose root cause and orchestrate solutionReacts to agent instructions (does not plan independently)
Key insight: The user is not an adversary and not a co-equal planner. The user is a reactive tool executor — they follow instructions but cannot diagnose the problem themselves. This creates complexity asymmetry: the agent must think and communicate, while the user just follows and reports. The hard part is not solving the problem — it is explaining the solution steps to someone who cannot see your side of the world.
Dec-POMDP Information Flow

Both players act on shared state but observe different slices. The only bridge is natural language. Click "Step" to walk through a troubleshooting trajectory where both agent and user take actions.

Ready — click Step

The Dec-POMDP framing gives τ²-bench a formal structure for measuring exactly where agents fail. If the agent succeeds in no-user mode (where it controls all tools) but fails in dual-control mode, the failure is purely in communication and coordination — not reasoning.

In τ²-bench's Dec-POMDP, the user is modeled as a reactive tool executor rather than an independent planner. Why is this design choice critical?

Chapter 2: The Telecom Domain

The paper introduces a new telecom technical support domain. A customer calls in with a phone problem — mobile data not working, MMS failing, service disconnected. The agent has backend access to the customer's account. The user has a mocked phone with settings they can toggle.

Here are the actual tool schemas, split by who controls them:

Agent tools (13 total: 7 write, 6 read):

python
# READ tools — agent can inspect the backend
get_customer_by_id(customer_id: str) → Customer
get_customer_by_name(full_name: str) → Customer
search_customers(query: str) → List[Customer]
get_line_details(line_id: str) → LineDetails
get_plan_details(plan_id: str) → PlanDetails
get_available_plans() → List[Plan]

# WRITE tools — agent can modify the backend
enable_roaming(customer_id, line_id) → str
disable_roaming(customer_id, line_id) → str
activate_line(customer_id, line_id) → str
suspend_line(customer_id, line_id) → str
change_plan(customer_id, line_id, plan_id) → str
add_service(customer_id, line_id, service) → str
transfer_to_human(reason: str) → str

User tools (30 total: 15 write, 15 read):

python
# READ tools — user checks their phone
get_status_bar() → str     # "📶 Excellent | 5G | Data ON | 🔋 80%"
check_airplane_mode() → str
check_wifi_status() → str
check_data_status() → str
check_signal_strength() → str
open_browser() → str     # tests if internet works
send_test_mms() → str    # tests if MMS works
... (8 more read tools)

# WRITE tools — user modifies phone settings
toggle_airplane_mode() → str
toggle_data() → str
toggle_wifi() → str
restart_phone() → str
reset_network_settings() → str
toggle_mms() → str
... (9 more write tools)
The crucial design: User tools return human-readable strings, not structured data. When the user calls get_status_bar(), they get a status bar emoji string like a real phone screen — not a JSON object. This constrains the user simulator to behave like a real person reporting what they see, not like an API client parsing structured data.

The agent's database stores structured customer records:

toml
[[customers]]
customer_id = "C1001"
full_name = "John Smith"
date_of_birth = "1985-06-15"
phone_number = "555-123-2002"

[[lines]]
line_id = "L1002"
customer_id = "C1001"
plan_id = "PLAN_PREMIUM"
roaming_enabled = false
status = "active"

The user's database is a mocked phone device:

toml
[device]
sim_card_status = "active"
airplane_mode = false
battery_level = 80
data_enabled = true
wifi_enabled = true
signal_strength = "excellent"
mms_enabled = true
Concept: The telecom domain creates a natural split: backend operations belong to the agent, device operations belong to the user. Realization: This split is not arbitrary — it mirrors the real world. A support agent cannot reach through the phone and toggle your airplane mode. They must ask you to do it, then verify via their backend that the issue resolved. The database schemas above are the actual state representation; toggle_airplane_mode() flips the boolean, and get_status_bar() reads it into a human-readable string.
Why do user tools in τ²-bench return human-readable strings (like emoji status bars) instead of structured JSON?

Chapter 3: Compositional Task Generation

How do you create thousands of verifiable test scenarios without hand-writing each one? τ²-bench uses a compositional task generator that builds complex tasks from atomic building blocks.

Each atomic subtask t represents one specific issue (e.g., "airplane mode is on, causing no data"). It is defined by three function sets:

init functions finitt,k
Set up the broken state. Example: set_airplane_mode(True) puts the phone in airplane mode before the conversation starts.
solution functions fsolt,k
The tool calls that fix the issue. Example: toggle_airplane_mode() (user tool). These must be available to either the agent or user.
assertion functions fassertt,k
Verify the fix. Example: assert_service_status("connected") checks the final state. If all assertions pass, the task is solved.
Key insight: Because each subtask has programmatic init, solution, and assertion functions, correctness is provably verifiable. You run the init functions, apply the solutions, check the assertions — if they pass, the task is solvable. No ambiguity, no subjective judgment.

Atomic subtasks are organized into groups of mutually exclusive alternatives. A composite task picks at most one subtask from each group and concatenates their functions. The telecom domain has 15 atomic subtask groups across three user intents of increasing difficulty:

IntentDifficultyExample subtaskWhy harder
service_issueEasiestLine suspended → activate itAgent-side fix only, straightforward
mobile_data_issueMediumAirplane mode ON + roaming disabledRequires checking service issues first, then user + agent coordination
mms_issueHardestMMS disabled + data off + no roamingMust resolve data issues first (which may require service fixes), multi-stage chain

Combining 15 subtask groups programmatically yields 2,285 total tasks. The paper subsamples 114 tasks balanced across intents and difficulty levels (1-9 subtasks per task). The number of subtasks directly controls difficulty — more subtasks mean more diagnostic steps, more user interactions, and more state transitions to track.

Task Composition Explorer

See how atomic subtasks compose into a full task. Use the slider to add more subtasks and watch the init/solution/assertion chains grow. Each subtask adds both diagnostic and resolution steps.

Subtasks 1
What makes τ²-bench's task generation "compositional" rather than just "programmatic"?

Chapter 4: The User Simulator

The biggest weakness of conversational agent benchmarks is the user simulator. In τ-bench's original retail and airline domains, the user is an LLM with a natural language description of their goal. The problem: LLMs hallucinate. They fabricate information, contradict their stated preferences, and behave inconsistently. The retail domain has a 40% user error rate with 12% critical errors that make tasks unsolvable.

τ²-bench's insight: constrain the user simulator not just with prompting, but with tools and environment state.

The breakthrough: Instead of telling the user simulator "you see that your data is disabled" (which it might forget or contradict), you give it the tool get_status_bar() that actually reads the device state. The user cannot lie about what they see because their observation is grounded in the real environment. This drops the error rate from 40% to 16%, and critical errors from 12% to 6%.

Three design principles make the user simulator reliable:

  1. Tool-grounded observations. The user does not imagine what they see — they call read tools that return the actual device state. If the agent says "check your status bar," the user calls get_status_bar() and reports the real output.
  2. Reactive behavior. The user does not plan independently. They only call tools when the agent asks them to. This limits the action space and prevents the user simulator from "getting ahead" of the conversation.
  3. Human-readable outputs. Tool returns are strings a human would see, not structured data. The user reports "I see four bars and 5G" rather than parsing {"signal": "excellent", "network": "5G"}.
DomainConversationsCritical ErrorsBenign ErrorsTotal Error Rate
airline (τ-bench)10013%34%47%
retail (τ-bench)5012%28%40%
telecom (τ²-bench)506%10%16%
User Simulator: Prompted vs. Tool-Grounded

Watch two user simulators handle the same scenario. The prompted user relies on memory of initial instructions. The tool-grounded user calls actual device tools. Click "Step" to advance the conversation and see where the prompted user makes errors.

Ready
Concept: User simulator reliability is the Achilles' heel of conversational benchmarks. Realization: τ²-bench shows that giving the user actual tools that read real environment state is far more effective than elaborate prompting. The environment constrains behavior more reliably than instructions do. This is the same principle behind grounding LLMs with retrieval — connect to reality, don't just describe it.
How does τ²-bench reduce user simulator error rates from 40% to 16%?

Chapter 5: Dec-POMDP Formalization

Now let us write down the math. The entire τ²-bench interaction is formally a tuple:

(S, {Ai}, {Oi}, T, R, U, M)    where i ∈ {agent, user}

Let us unpack each component with concrete telecom examples.

Message space M: All possible natural language messages. User: "My data isn't working." Agent: "Could you check if airplane mode is on?"

State space S: The global state decomposes as:

S = Sworld ⊗ Shistory
Sworld = Sdb,agent ⊗ Sdb,user

Sdb,agent is the CRM (customer profiles, line configs). Sdb,user is the phone device state (airplane mode, data enabled, signal). Shistory logs every action, observation, and message in order.

Action spaces Ai: Player i either calls a tool or sends a message. Only one player acts per turn.

ai ∈ Ai = Ai,tool ∪ M

Agent tool actions: get_customer_by_id("C1001"), enable_roaming("C1001", "L1002"). User tool actions: toggle_airplane_mode(), get_status_bar().

Observation spaces Oi: Player i sees either a tool return or a message from the other player.

oi ∈ Oi = Oi,tool ∪ M

Transition function T: Given current state s and action a, yields new state s' and observation o:

T : S × A → S × O

Calling enable_roaming("C1001", "L1002") changes Sdb,agent (roaming flag flips to true) AND affects Sdb,user (the phone can now access roaming networks). This cross-database effect is what makes the environment shared.

Reward function R: A function R : S → [0, 1] that checks whether all assertion functions pass on the final state. Binary: 1 if the task is solved, 0 otherwise.

Key insight: The transition function T can create cross-database effects. When the agent calls enable_roaming(), it changes the agent's database (roaming_enabled = true) AND the user's phone environment (the device can now connect to roaming networks). This is what makes it a genuinely shared environment, not just two independent systems.

Instruction space U: Defines the user's scenario (what problem they have) and the agent's domain policy (troubleshooting procedures). The user sees: "Your mobile data is not working. You want to fix it." The agent sees: "Follow these diagnostic steps for data issues: first check service status, then check airplane mode..."

Why is the cross-database transition effect (e.g., agent's enable_roaming() affecting the user's device state) essential to modeling the dual-control environment?

Chapter 6: Evaluation Protocol

How do you know if the agent actually solved the problem? τ²-bench uses multiple evaluation criteria that go beyond simple pass/fail.

The pass^k metric: This is the core reliability metric from τ-bench. Run the same task k times independently. pass^k is the fraction of times all k runs succeed. Higher k is stricter — it measures consistency, not just one-shot accuracy.

passk = Etask[ ∏j=1k successj ]

If an agent succeeds 70% of the time on a task, pass^1 = 0.70, but pass^4 = 0.704 = 0.24. The exponential decay penalizes inconsistency.

Five evaluation criteria (a task can use a subset):

CriterionWhat it checksExample
DB checkAgent database matches expected stateCustomer's roaming flag is true
Status assertionsUser device state meets conditionsassert_service_status("connected")
NL assertionsHistory contains expected communication"Agent diagnosed the cause"
Comm info checkAgent communicated required infoTold user the roaming cost
Action matchingAll solution tool calls appear in trajectorytoggle_airplane_mode was called

For the telecom domain, only assertion functions are used. This is cleaner than checking communication content — you verify the end state, not how the agent got there. The phone either works or it does not.

Concept: pass^k exponentially penalizes inconsistency — an agent that solves a task 70% of the time scores only 24% on pass^4. Realization: This matters for deployment. A customer support system that works 7 out of 10 times is unreliable. pass^k captures this by measuring the probability that the agent succeeds on ALL k independent attempts, which is the real-world reliability you care about.

Three evaluation modes for ablation analysis:

Default (dual-control)
Agent and user collaborate. Agent has backend tools, user has phone tools. Normal mode — tests everything.
↓ remove user
No-User
Agent gets a ticket summarizing the problem. Agent controls ALL tools (both backend and phone). Tests pure reasoning — no communication needed.
↓ give answer
Oracle Plan
Agent receives the exact sequence of tool calls needed. Must coordinate with user to execute them. Tests pure communication — no reasoning needed.

The gap between No-User and Default isolates communication failure. The gap between Oracle Plan and Default isolates reasoning failure. Together they decompose agent performance into its constituent skills.

If an agent has pass^1 = 0.50 on a task, what is its pass^4?

Chapter 7: Results

The experiments evaluate four models: gpt-4.1, gpt-4.1-mini, o4-mini, and claude-3.7-sonnet. Each task runs 4 times at temperature 0. The user simulator is always gpt-4.1. Here are the headline findings.

Finding 1: Telecom is the hardest domain.

ModelRetail pass^1Airline pass^1Telecom pass^1
gpt-4.174%56%34%
gpt-4.1-mini59%46%52%
o4-mini66%53%42%
claude-3.7-sonnet79%50%49%

Remarkably, gpt-4.1 — the strongest model on retail (74%) — is the weakest on telecom (34%). The mini model outperforms it. This suggests that raw reasoning power does not translate directly to coordination ability.

Finding 2: The dual-control gap. When agents switch from no-user (they control everything) to dual-control (they must guide the user), performance drops dramatically. gpt-4.1: 52% → 34% (−18 points). o4-mini: 67% → 42% (−25 points). This gap is pure communication and coordination failure — the agent can solve the problem when it has all the tools, but cannot get the user to do the right things.

Finding 3: Performance collapses with task complexity. As the number of required actions increases, pass^1 drops toward zero. For tasks requiring 7+ actions in dual-control mode, both gpt-4.1 and o4-mini score near 0%. Even in no-user mode, performance degrades — but the gap between modes narrows, suggesting that long-horizon tasks are hard for reasoning too, not just communication.

Dual-Control Performance Dashboard

Explore the key experimental results. Toggle between views: the dual-control gap (Default vs No-User vs Oracle Plan), performance by issue type, and the complexity scaling curve.

Finding 4: Issue type matters. service_issue tasks are easiest (agent-side fixes). mobile_data_issue and mms_issue require multi-stage coordination and score much lower. For gpt-4.1: service_issue pass^1 = 52%, mobile_data_issue = 30%, mms_issue = 22%.

Finding 5: User persona affects success. Tasks with "Hard" personas (low-tech users) are harder than "Easy" personas (tech-savvy users). Surprisingly, "None" (no persona) often performs as badly as "Hard," suggesting that well-defined personas actually help the simulator behave more consistently.

gpt-4.1 scores 74% on retail but only 34% on telecom, while gpt-4.1-mini scores 59% on retail and 52% on telecom. What does this reversal reveal?

Chapter 8: Failure Analysis

Where exactly do agents fail in dual-control? The three evaluation modes (Default, No-User, Oracle Plan) decompose failures into precise categories.

Reasoning failures (visible in No-User mode): The agent has all tools but still fails. This includes wrong diagnoses (checking the wrong thing first), incorrect tool calls (wrong parameters), and policy violations (not following troubleshooting procedures).

Communication failures (the gap between No-User and Default): The agent knows the answer but cannot extract it through conversation. Three specific failure modes:

Coordination failures (unique to dual-control): The agent and user's actions must happen in the right order. If the agent enables roaming after asking the user to toggle data, the data toggle alone does not fix the problem. Sequencing matters.

The workflow policy experiment: The authors created an alternate policy document with more specific step-by-step workflows (instead of general guidelines). Results: slightly improves Default and No-User performance. But hurts Oracle Plan performance. Why? When the agent already has the answer (Oracle Plan), a detailed workflow creates confusion — the agent tries to follow both the ground truth and the workflow, and they conflict. This shows that good policies and good reasoning can actually interfere with each other.

Here is a concrete example of a failed dual-control interaction:

conversation
# Issue: mobile data not working (airplane mode ON + roaming disabled)

User: "My mobile data is not working."
Agent: calls get_customer_by_id("C1001")
Agent: calls get_line_details("L1002")
       # Sees roaming_enabled=false. Correct diagnosis so far.

Agent: "I see that your roaming is not enabled.
        I'll enable it for you."
Agent: calls enable_roaming("C1001", "L1002")

# FAILURE: Agent never checked airplane mode!
# Even with roaming enabled, airplane mode blocks all connectivity.
# Agent needed to ask user: "Can you check if airplane mode is on?"

Agent: "I've enabled roaming. Your data should work now.
        Is there anything else?"
User: "No, it's still not working."
# Agent has now lost context and retries the same fix...
Concept: Agent failures in dual-control decompose into reasoning (wrong diagnosis), communication (vague instructions, missing verification), and coordination (wrong action ordering). Realization: The ablation modes make this decomposition precise: Default − No-User = communication cost. Default − Oracle Plan = reasoning cost. For gpt-4.1, the communication cost (−18 points) exceeds the reasoning cost, confirming that guiding users is the harder problem.
Why does the workflow-based policy hurt performance in Oracle Plan mode?

Chapter 9: Connections

τ²-bench sits at the intersection of several research threads:

ConnectionRelationship
τ-bench (Shinn et al., 2024)The direct predecessor. τ²-bench extends it from single-control to dual-control, adding the telecom domain and compositional task generation. The retail and airline domains are carried over.
Dec-POMDPsτ²-bench formalizes dual-control as a Dec-POMDP. Classic framework for multi-agent partial observability. The complexity asymmetry (agent plans, user reacts) is a specific instance of asymmetric Dec-POMDPs.
Agent evaluation survey (Yehudai et al., 2025)Comprehensive survey of how LLM agents are evaluated. τ²-bench contributes a new evaluation paradigm (dual-control) that fills a gap the survey identifies: benchmarks that test coordination.
IntellAgent (Waisberg et al., 2024)Programmatic benchmark generation from policy graphs. Complementary approach — IntellAgent generates synthetic proxies, τ²-bench generates verifiable compositional tasks.
ToolSandbox (Lu et al., 2024)Stateful tool evaluation. τ²-bench adds the twist that tools are split between two players who must coordinate through language.
Task-oriented dialogueThe legacy of MultiWOZ and similar benchmarks. τ²-bench goes beyond information-seeking dialogue to action-oriented collaboration.

Limitations acknowledged by the authors:

The big picture: τ²-bench demonstrates that the hardest part of building useful agents is not reasoning — it is coordination. When you give the agent all the tools, it succeeds half the time. When it must work through a user, it drops by 18-25 points. This gap — the coordination gap — is the frontier for conversational AI. Closing it requires agents that can explain clearly, verify comprehension, track shared state, and adapt to users who make mistakes. That is a fundamentally different skill than answering questions or calling APIs.
What is the single most important finding of τ²-bench?