Your agent passes your test cases. Then it fails spectacularly on real users. Learn to build evaluation systems that catch failures before deployment.
You just built an agent. It can call tools, search the web, write code, and interact with databases. You test it on ten carefully crafted scenarios. It aces all ten. You deploy it. Within an hour, a user asks it to cancel an order, and it cancels the wrong order. Another user asks for a refund, and the agent enters an infinite loop calling the same API over and over.
What went wrong? You tested it. It worked. But you did what every engineer does at first: you ran a vibe check — a handful of manual tests, eyeballed the outputs, and called it good. Vibe checks feel productive. They are not evaluation.
The problem is that agents are fundamentally different from traditional software. A function that adds two numbers will always return the same result. An agent might take a different path through its reasoning every single time you run it. It might call different tools, in different orders, with different arguments. It interacts with external environments that have state. And it operates over long horizons — a single task might involve twenty tool calls and ten reasoning steps.
Here's a concrete illustration. Imagine an agent that handles customer service for an airline. A user says "I need to change my flight." The agent must:
At each step, the agent could fail differently. It could hallucinate a policy that doesn't exist. It could call the wrong API. It could change the wrong booking. And because LLM sampling involves randomness, running the exact same scenario twice might produce two completely different failure modes.
Click Step to advance through one cycle of the agent loop. Notice how each step depends on the previous result — errors compound.
This is why evaluation — systematic, repeatable measurement of agent performance — matters so much. Without it, you're deploying a system you don't understand into situations you haven't anticipated. The rest of this lesson teaches you how to build evaluations that actually work.
Let's be precise about what's wrong with vibe checks (informal testing) and what real evaluation looks like:
| Property | Vibe Check | Real Evaluation |
|---|---|---|
| Test cases | 5-10, hand-picked | 50-200+, systematically designed |
| Runs per test | 1 (maybe 2) | 5-20 (for statistical significance) |
| Scoring | "Looks good to me" | Automated graders + human review |
| Regression detection | None | CI pipeline blocks regressions |
| Coverage | Happy path only | Edge cases, errors, adversarial inputs |
| Reproducibility | Cannot reproduce | Deterministic setup, saved transcripts |
| Time to run | 5 minutes | 30 minutes to 2 hours |
The difference isn't just rigor — it's confidence. After a vibe check, you hope your agent works. After a real evaluation, you know your agent works on X% of tasks with Y% reliability. You can make data-driven decisions about whether to ship.
Here's the uncomfortable truth: most agent teams in production today are still running vibe checks. They demo the agent to their manager, it works on the demo scenario, and they ship it. Three weeks later, support tickets pile up because the agent hallucinates refund policies, enters infinite loops on edge cases, or cancels the wrong orders. The cost of fixing these failures in production — user trust lost, engineering time burned, revenue impacted — always exceeds the cost of building proper evaluation upfront.
This lesson gives you the tools to do better. By the end, you'll know how to build an evaluation system that catches these failures before your users do.
Here's a simple but powerful way to understand why long-horizon tasks are so hard. Suppose each step in the agent's reasoning has a 95% chance of being correct. That sounds great — almost always right. But over multiple steps, errors compound:
| Steps | P(all correct) | P(at least one error) |
|---|---|---|
| 1 | 95.0% | 5.0% |
| 3 | 85.7% | 14.3% |
| 5 | 77.4% | 22.6% |
| 10 | 59.9% | 40.1% |
| 20 | 35.8% | 64.2% |
At 20 steps, even with 95% per-step accuracy, you have a 64% chance of at least one error. And in an agent, one error can cascade — calling the wrong tool means getting wrong data, which means making a wrong decision, which means taking a wrong action. One bad step can doom the entire task.
This is why evaluation must test long-horizon tasks, not just short ones. A 3-step task at 95% per-step gives you 86% success. A 20-step task gives you 36%. If you only test 3-step tasks, you'll think your agent is great. Deploy it on 20-step tasks and watch it fail two-thirds of the time.
Before we can evaluate agents, we need to agree on what one is. An agent is an LLM running in a loop, equipped with tools and instructions, that can take actions in an environment. Three components define it:
The agent harness is the software system that enables a model to act as an agent. It manages the loop: send the context to the model, parse its output for tool calls, execute those tools, feed results back, repeat. When we evaluate an agent, we're evaluating the harness AND the model working together — not the model in isolation.
Modern LLMs are trained to emit special structured output when they want to call a tool. The pattern looks like this:
json // 1. You give the model a list of available tools: { "tools": [ { "name": "lookup_order", "description": "Look up order details by order ID", "parameters": { "order_id": { "type": "string" } } } ] } // 2. The model responds with a tool call: { "tool_call": { "name": "lookup_order", "arguments": { "order_id": "ORD-12345" } } } // 3. Your harness executes the function, returns result: { "tool_result": { "status": "shipped", "item": "Blue Widget", "tracking": "1Z999AA10123456784" } } // 4. Model sees result, decides next action or responds to user
This parse-execute-continue cycle is the heartbeat of every agent. The harness parses the model's output for tool calls, executes them in the real environment, and feeds results back into the context. The cycle continues until the model produces a final response (no tool call) or hits a maximum iteration limit.
Not everything that uses an LLM is an agent. There's a spectrum from simple to fully autonomous:
| Level | What It Does | Example |
|---|---|---|
| Single-turn LLM | One prompt in, one response out | ChatGPT answering a question |
| Chain / Pipeline | Fixed sequence of LLM calls | Summarize → Translate → Format |
| Router | LLM picks which path to take | Classify intent, route to handler |
| Tool-using Agent | LLM calls tools in a loop | Research agent with search + browse |
| Multi-agent System | Multiple agents coordinating | Manager delegates to specialists |
As you move down this spectrum, evaluation gets harder. A single-turn LLM can be evaluated with a simple input-output dataset. A multi-agent system requires evaluating coordination, delegation, and emergent behaviors across multiple interacting components.
Let's look at a concrete harness implementation. This is the actual control flow that runs every agent:
python def run_agent(user_message, system_prompt, tools, max_iters=25): """The core agent loop. Every agent harness is a variant of this.""" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_message} ] for i in range(max_iters): # Step 1: Send context to model response = llm.generate( messages=messages, tools=tools, temperature=0.0 # lower = more deterministic ) # Step 2: Check if model wants to call a tool if response.has_tool_call: # Step 3: Parse and execute the tool call tool_name = response.tool_call.name tool_args = response.tool_call.arguments result = execute_tool(tool_name, tool_args) # Step 4: Feed result back into context messages.append(response.to_message()) messages.append({ "role": "tool", "content": json.dumps(result) }) continue # back to Step 1 # No tool call = final response to user messages.append(response.to_message()) return messages # the full transcript # Safety: hit max iterations without finishing raise MaxIterationsError("Agent stuck in loop")
Notice the key design decisions embedded in this code:
The quality of tool definitions dramatically affects agent performance. Compare these two definitions for the same tool:
bad tool def { "name": "search", "description": "Search", "parameters": { "q": {"type": "string"} } }
good tool def { "name": "search_orders", "description": "Search orders by customer email, order ID, or date range. Returns up to 10 matching orders.", "parameters": { "query": { "type": "string", "description": "Email, order ID, or YYYY-MM-DD" } } }
The good definition tells the model exactly what the tool does, what inputs it accepts, and what format to use. When evaluating agents, changing tool definitions can shift pass rates by 10-20%. This is part of why we evaluate the harness as a whole, not just the model.
Click each phase to advance. Watch the context window fill with tool results as the agent works through a task.
Sometimes a single agent isn't enough. When your system prompt grows to thousands of lines, or you have dozens of overlapping tools, or different tasks require fundamentally different reasoning strategies — you need multiple agents working together. But multi-agent systems are harder to build, debug, and evaluate. So the first rule is: start with a single agent. Only add complexity when the single agent demonstrably fails.
Three signals tell you a single agent is struggling:
There are two main patterns for multi-agent systems:
Manager pattern: A central "manager" agent receives the user's request, decides which specialist to delegate to, and synthesizes the final response. Clean control flow, easy to debug, but the manager becomes a bottleneck.
Decentralized pattern: Agents hand off to each other in a chain or graph. No central controller. More flexible for complex workflows, but harder to reason about and debug — "who's in charge?" becomes ambiguous.
A critical benefit of multi-agent systems is context protection. When a sub-agent handles a tool-heavy subtask, all those tool calls and results fill up the sub-agent's context window, not the main agent's. The main agent only sees the sub-agent's final summary. This keeps the main agent's context clean and focused.
Think of it like delegation at a company. The CEO doesn't sit in on every engineering standup. They delegate to a VP, who reports back a summary. The CEO's "context window" stays clear for strategic decisions.
python # Manager agent with specialized sub-agents class ManagerAgent: def __init__(self): self.router = LLM(system_prompt="""You are a routing agent. Given a user request, determine which specialist to delegate to: - 'returns': for return/refund requests - 'booking': for flight/hotel changes - 'billing': for payment/invoice questions Respond with ONLY the specialist name.""") self.specialists = { "returns": Agent(tools=return_tools, prompt=return_policy), "booking": Agent(tools=booking_tools, prompt=booking_policy), "billing": Agent(tools=billing_tools, prompt=billing_policy), } def handle(self, user_message): # Step 1: Route to specialist specialist = self.router.generate(user_message).strip() # Step 2: Delegate (sub-agent gets its own clean context) result = self.specialists[specialist].run(user_message) # Step 3: Manager only sees the summary, not all tool calls return result.final_response
Notice the key property: the sub-agent runs in its own context. If the returns specialist makes 8 tool calls consuming 4,000 tokens, the manager never sees those tokens. It only gets the final response — maybe 200 tokens. This is the context protection in action.
When evaluating a multi-agent system, you need to test at three levels:
A common mistake is skipping level 2 (routing evaluation). Teams test each specialist and test the full system, but never specifically evaluate whether the router makes correct decisions. A routing error sends the user to the wrong specialist, which wastes time and often fails spectacularly because the specialist doesn't have the right tools for the task.
Click a task to see the manager route it to the right specialist agent. Watch how the main context stays clean.
An agent's context window is its working memory. Everything the agent knows — the system prompt, conversation history, tool results, retrieved documents — lives in this finite window. And here's the problem: context rots.
Context rot is what happens when an agent runs for many steps. Each tool call adds its result to the context. Each reasoning step adds tokens. After twenty iterations, the context is stuffed with old tool results, irrelevant intermediate reasoning, and stale information. The model's attention is spread across thousands of tokens of noise, and the signal — the user's original request, the current state of the task — gets buried.
There are two strategies for getting information into the context window:
| Strategy | How It Works | Tradeoff |
|---|---|---|
| Static (RAG) | Pre-load relevant documents into the context before the agent starts | Wastes tokens if docs aren't needed; stale if task evolves |
| Dynamic (Tool-based) | Agent discovers context by calling tools as needed | Uses tokens only when needed; agent decides what's relevant |
The modern best practice is progressive disclosure: give the agent tools to discover context rather than pre-loading everything. Let the agent search a knowledge base, look up a policy document, or query a database when it needs information. This way, only relevant information enters the context, and the agent stays in control of what it knows.
Even with progressive disclosure, context eventually fills up. Three strategies fight context rot:
Each strategy involves a tradeoff: compaction loses detail but reclaims space. The art is knowing when to compact and how much to keep. Compact too aggressively and the agent forgets important context. Compact too little and you hit the token limit.
Let's do the math for a typical agent task. Say you're using a model with a 128K token context window:
| Component | Tokens | Percentage |
|---|---|---|
| System prompt + policies | 3,000 | 2.3% |
| Tool definitions (15 tools) | 2,500 | 2.0% |
| User message | 200 | 0.2% |
| Average tool result | 500 | 0.4% each |
| Model reasoning per step | 300 | 0.2% each |
| After 20 tool calls | ~22,000 | 17% |
| After 50 tool calls | ~46,000 | 36% |
At 50 tool calls, you've consumed 36% of the context window. That sounds manageable, but here's the catch: the attention quality of a Transformer degrades well before you hit the limit. Research shows that models struggle to attend to information in the middle of long contexts (the "lost in the middle" problem). By the time you're at 50% utilization, the model may already be failing to recall its original instructions.
Here's a concrete example of progressive disclosure vs static context. Imagine an agent that helps with HR policies:
python # BAD: Static context — dump everything into system prompt system_prompt = """You are an HR assistant. Here are ALL company policies: [3,000 tokens of vacation policy] [2,000 tokens of leave policy] [4,000 tokens of benefits policy] [2,500 tokens of expense policy] ... """ # Result: 15,000+ tokens wasted if user only asks about vacation # GOOD: Progressive disclosure — agent discovers what it needs system_prompt = """You are an HR assistant. Use the search_policy tool to look up relevant policies before answering any question.""" tools = [{ "name": "search_policy", "description": "Search company HR policies by topic", "parameters": {"query": {"type": "string"}} }] # Result: only loads the 800 tokens about vacation when needed
The static approach consumes 15,000 tokens upfront for policies the user may never ask about. The progressive approach uses 800 tokens only when needed. Over a 20-step conversation, that 14,200-token savings compounds — leaving more room for actual tool results and reasoning.
Watch the context window fill as the agent makes tool calls. Hit Compact to see summarization reclaim space. The orange line shows attention quality degrading as context fills.
Here's a real failure pattern caused by context rot. An agent is handling a complex customer request that requires multiple tool calls:
transcript (simplified) Turn 1: User: "I have two orders. Cancel ORD-100 and check status of ORD-200." Turn 2: Agent calls lookup_order("ORD-100") # 500 tokens of result Turn 3: Agent calls check_cancel_policy("ORD-100") # 400 tokens of result Turn 4: Agent calls cancel_order("ORD-100") # 300 tokens of result Turn 5: Agent calls lookup_order("ORD-200") # 500 tokens of result Turn 6: Agent calls get_tracking("ORD-200") # 600 tokens of result # By Turn 6, context has ~3,500 tokens of tool results. # Agent's response: Turn 7: "I've cancelled ORD-200 and your order ORD-100 is in transit." # ^^^^^^^^ WRONG! Mixed up the two order IDs!
The agent correctly executed all the tool calls but then confused the two orders in its final response. It cancelled ORD-100 (correct) but then said it cancelled ORD-200 (wrong). This is classic context rot: the model's attention, spread across 3,500 tokens of tool results, failed to correctly attribute which results belonged to which order.
This failure would be caught by a code grader that checks the response text for correct order IDs. But it would NOT be caught by a pure outcome grader that only checks database state (the cancellation was correct). You need both.
Now we get to the core. Every agent evaluation system, no matter how sophisticated, boils down to five components. Learn these five, and you can evaluate anything.
A task is a predefined input paired with success criteria. "Given this user message and this database state, the agent should do X." A task includes:
Good tasks are specific, solvable, and have clear success criteria. "Handle a customer complaint" is a bad task. "User wants to return order ORD-789. Policy allows returns within 30 days. Order was placed 15 days ago. Agent should process the return and confirm to the user" is a good task.
A trial is one attempt at completing a task. Because agents are non-deterministic, running the same task twice produces different trials. This is why we need multiple trials per task — a single trial tells you almost nothing about reliability.
The transcript is the complete record of everything the agent did during a trial: every message, every tool call, every tool result, every reasoning step. This is your forensic evidence when things go wrong. A good evaluation system always saves the full transcript.
transcript # Example transcript structure: [ {"role": "user", "content": "I want to return order ORD-789"}, {"role": "assistant", "tool_call": "lookup_order(ORD-789)"}, {"role": "tool", "result": "{status: delivered, days_ago: 15}"}, {"role": "assistant", "tool_call": "check_return_policy(ORD-789)"}, {"role": "tool", "result": "{eligible: true, window: 30 days}"}, {"role": "assistant", "tool_call": "process_return(ORD-789)"}, {"role": "tool", "result": "{return_id: RET-456, refund: $49.99}"}, {"role": "assistant", "content": "Your return has been processed..."} ]
The outcome is the final environment state after the agent finishes. Did the database change correctly? Was the right API called? Does the final response contain the right information? Outcomes can be checked by inspecting the environment, the agent's final output, or both.
The grader takes a transcript and/or outcome and produces a score. "Did this trial succeed?" Graders can be human reviewers, automated code checks, or LLMs acting as judges. We'll dive deep into grader types in the next chapter.
Click Run to watch a task flow through the evaluation pipeline: task → agent runs (producing a transcript) → grader scores → result stored.
A subtle but critical requirement: each trial must run in a clean, isolated environment. If Trial 1 cancels an order in the database, Trial 2 must start with the order still active. Otherwise, Trial 2 might fail not because the agent is broken, but because the order was already cancelled by Trial 1.
Three isolation strategies:
| Strategy | How It Works | Cost |
|---|---|---|
| Database snapshot/restore | Save DB state before each trial, restore after | Low (fast for small DBs) |
| Docker containers | Fresh container per trial with pre-loaded data | Medium (container startup time) |
| Mock environments | In-memory mock of all tools and state | Low (fastest, but may miss integration bugs) |
Docker containers are the most robust approach — they guarantee complete isolation. Mock environments are fastest but risk missing bugs that only appear in real tool interactions. Most production eval systems use a hybrid: mocked tools for fast iteration, Docker containers for full regression runs.
Every trial should save the complete transcript in a structured format. This is your debugging lifeline. When pass rates drop, you read the transcripts to understand why. A good transcript record includes:
json { "task_id": "cancel-order-01", "trial_id": "trial-007", "timestamp": "2026-05-18T14:30:00Z", "model": "claude-opus-4-20250514", "temperature": 0.0, "messages": [/* full message sequence */], "tool_calls": [ {"name": "lookup_order", "args": {"id": "ORD-123"}, "latency_ms": 45}, {"name": "cancel_order", "args": {"id": "ORD-123"}, "latency_ms": 120} ], "total_tokens": 3847, "total_latency_ms": 4521, "db_state_before": {/* snapshot */}, "db_state_after": {/* snapshot */}, "grading": { "code_grader": {"passed": true, "checks": ["db_state", "tool_calls"]}, "model_grader": {"score": 4, "reasoning": "..."} }, "result": "PASS" }
With this structure, you can query your evaluation database: "Show me all failing trials for task 'cancel-order-01' where the agent called the wrong tool." This turns debugging from guesswork into data analysis.
Let's put it all together with a worked example:
| Component | Concrete Value |
|---|---|
| Task | "Cancel order ORD-123. Policy: cancellation allowed if status = 'processing'." |
| Trial 1 | Agent calls lookup_order → cancel_order → responds "Done." (3 steps) |
| Trial 2 | Agent calls lookup_order → check_policy → cancel_order → responds "Cancelled." (4 steps) |
| Transcript | Full sequence of messages and tool calls for each trial |
| Outcome | Database: ORD-123 status changed from "processing" to "cancelled" |
| Grader | Code check: assert db["ORD-123"]["status"] == "cancelled" |
Notice that Trial 1 and Trial 2 took different paths (3 steps vs 4 steps) but both arrived at the correct outcome. This is normal for agents — the path varies, but we care about the destination.
This brings us to a fundamental design choice: do you grade the trajectory (how the agent got there) or the outcome (where it ended up)?
Trajectory-based: Check that the agent called the right tools in the right order. "Did it call lookup_order before cancel_order?"
Pro: Catches unsafe intermediate steps.
Con: Overly prescriptive — rejects valid alternative paths.
Outcome-based: Only check the final result. "Is the order cancelled? Is the response correct?"
Pro: Accepts any valid solution strategy.
Con: Might miss dangerous intermediate actions (e.g., agent deleted other data before fixing it).
In practice, the best approach combines both: outcome-based grading for "did it work?" plus lightweight trajectory checks for safety constraints: "did the agent avoid calling the delete_all_data() tool?" "did it confirm with the user before making changes?"
How many trials do you need per task? This depends on how precise you need your estimates to be. Here's a rule of thumb:
| Trials per Task | Precision | Use Case |
|---|---|---|
| 5 | ±20% | Quick directional check — is pass rate above 50%? |
| 10 | ±15% | Standard evaluation — reliable pass@1 and pass^3 |
| 20 | ±10% | Precise comparison — is Agent A better than Agent B? |
| 50+ | ±5% | High-stakes deployment decision |
These are approximate — the actual confidence interval depends on the true pass rate. But as a planning heuristic: 10 trials per task is the sweet spot for most iterative development. Use 20+ for final deployment decisions.
You have a transcript. Now you need to decide: did the agent succeed? This is the job of the grader. There are three families of graders, each with distinct strengths and weaknesses. The best evaluation systems use all three.
Human graders are the gold standard. A subject-matter expert reads the transcript, evaluates the outcome, and assigns a score. Nothing beats a human who deeply understands the domain.
But human grading has problems. It's slow (minutes per transcript), expensive (expert time), and doesn't scale. You can't have a human review 10,000 trials overnight. And humans disagree with each other — inter-annotator agreement (how often two humans give the same score) is often surprisingly low, especially for subjective quality judgments.
Variants of human grading:
Code-based graders are deterministic functions that check specific properties of the outcome or transcript. They're fast, cheap, and perfectly reproducible. But they're brittle — they can only check things you can express as code.
python # Example code-based graders: def check_string_match(transcript, expected): """Did the final response contain the expected string?""" final_msg = transcript[-1]["content"] return expected.lower() in final_msg.lower() def check_tool_called(transcript, tool_name): """Was the required tool called at least once?""" return any( msg.get("tool_call", "").startswith(tool_name) for msg in transcript ) def check_database_state(db, order_id, expected_status): """Did the database end up in the correct state?""" return db[order_id]["status"] == expected_status def check_no_forbidden_tool(transcript, forbidden): """Did the agent avoid calling a forbidden tool?""" return not any( msg.get("tool_call", "").startswith(forbidden) for msg in transcript )
Code-based graders are perfect for objective criteria: "Was the database updated?" "Did the response include the order ID?" "Was the forbidden tool avoided?" They fail for subjective criteria: "Was the tone professional?" "Did the explanation make sense?"
Model-based graders use another LLM to evaluate the agent's output. You provide a rubric and the transcript, and the judge LLM scores it. This combines the flexibility of human judgment with the scalability of automation.
python # LLM-as-Judge grader (simplified) def llm_judge(transcript, rubric): prompt = f"""You are an expert evaluator. Given this agent transcript: {transcript} Score the agent on the following rubric: {rubric} Respond with a JSON object: {{"score": 1-5, "reasoning": "..."}}" response = llm.generate(prompt) return json.loads(response)
Three common patterns for model-based grading:
| Pattern | How It Works | Best For |
|---|---|---|
| Rubric scoring | Judge scores on a 1-5 scale using criteria | Absolute quality assessment |
| Pairwise comparison | Judge picks which of two outputs is better | Comparing agent versions (A/B) |
| Reference-guided | Judge compares output to a gold-standard reference | Tasks with known correct answers |
The catch: model-based graders are themselves non-deterministic. Run the same judgment twice and you might get different scores. They also have biases — they tend to prefer longer outputs, outputs that look like their own writing style, and outputs listed first in pairwise comparisons. You need to calibrate model-based graders against human judgments before trusting them.
How do you know your LLM judge is actually reliable? The standard approach is calibration against human labels:
A well-calibrated LLM judge typically agrees with human experts 80-90% of the time on binary (pass/fail) judgments. For 5-point Likert scales, expect within-1-point agreement about 70-80% of the time. If you're below these numbers, the judge is too noisy to trust.
| Bias | What Happens | Mitigation |
|---|---|---|
| Length bias | Prefers longer, more verbose outputs | Add "conciseness" to rubric; penalize unnecessary length |
| Position bias | In pairwise comparisons, prefers the first option | Run each comparison twice with swapped order; average results |
| Self-preference | Prefers outputs that sound like itself | Use a different model family as judge than as agent |
| Anchoring | If given a reference answer, rates similar outputs higher | Score without reference first, then use reference as secondary check |
| Leniency | Avoids giving low scores; clusters around 3-4 out of 5 | Use binary pass/fail when possible; clearer signal than Likert scales |
The same agent output evaluated by all three grader types. Toggle each grader to see its assessment and strengths/weaknesses.
You've run your agent on 50 tasks, 10 trials each, and collected 500 scores. Now what? You need metrics that capture both capability (can the agent solve the task at all?) and reliability (can it solve it consistently?). Two metrics do this: Pass@K and Pass^K.
Pass@K measures the probability that at least one out of K attempts succeeds. If you give the agent K tries, what are the odds it gets it right at least once?
Where n is the total number of trials you ran, c is the number of successful trials, and C(a, b) is the binomial coefficient "a choose b." Let's unpack this with a worked example.
Pass^K flips the question: do ALL K attempts succeed? This measures reliability — can you trust the agent to get it right every time, not just sometimes?
The gap between pass@K and pass^K reveals a critical truth about your agent. High pass@K means the agent can solve the task. Low pass^K means it does so unreliably. For production systems, reliability matters more than capability. A user doesn't get 3 tries — they get one.
Even for agents with respectable per-trial accuracy, pass^K drops dramatically as K increases. Let's compute for an agent with 80% per-trial success (c = 8 out of n = 10):
| K | Pass@K | Pass^K | Gap |
|---|---|---|---|
| 1 | 0.800 | 0.800 | 0.000 |
| 2 | 0.956 | 0.622 | 0.334 |
| 3 | 0.992 | 0.467 | 0.525 |
| 5 | 1.000 | 0.222 | 0.778 |
At K = 5, pass@5 is essentially 1.0 (the agent can almost certainly solve it in 5 tries). But pass^5 is only 0.222 — it only succeeds on all 5 attempts about 22% of the time. For an 80% accurate agent! This is why pass^K is the metric that matters for production deployment.
Where do these formulas come from? Let's derive them from first principles.
Pass@K derivation: We have n total trials, c of which succeeded. We want to pick K trials at random. What's the probability that at least one is a success?
It's easier to compute the complement: the probability that all K picks are failures. There are (n - c) failures total. The number of ways to pick K failures from (n - c) is C(n - c, K). The total ways to pick K trials from n is C(n, K). So:
Pass^K derivation: Same setup, but now we want the probability that ALL K picks are successes. The number of ways to pick K successes from c is C(c, K). So:
Both formulas use the hypergeometric distribution — sampling without replacement from a finite population. This is important: we're not assuming each trial is independent (which would give a simpler binomial model). The hypergeometric approach accounts for the fact that we ran a fixed number of trials and observed a fixed number of successes.
If you assume trials are independent with success probability p = c/n, you get simpler approximations:
These are good approximations when n is large relative to K. For small n (say n = 10, K = 5), use the exact hypergeometric formulas. For large n (n = 100), the approximations are close enough.
Agent A achieves 70% per-trial success (7/10). Agent B achieves 50% per-trial success (5/10). But Agent B uses a retry mechanism that auto-retries on failure. Which is better for production?
| Metric | Agent A (p=0.7) | Agent B (p=0.5, 2 retries) |
|---|---|---|
| pass@1 | 0.700 | 0.500 |
| pass@3 (effective) | 0.992 | 0.917 (3 tries total) |
| pass^1 | 0.700 | 0.500 |
| pass^3 | 0.292 | 0.083 |
| Cost per task | 1x | ~2.5x (avg retries) |
Agent A is better on every metric. The retry mechanism helps Agent B's pass@3 (0.917 is decent), but its pass^3 (0.083) means it reliably succeeds on 3 consecutive attempts only 8% of the time. Retries mask capability problems — they don't fix them.
Adjust n (trials), c (successes), and K to see both metrics. The teal bar is pass@K (capability). The orange bar is pass^K (reliability).
Theory is great, but what does a real agent evaluation look like? Let's study τ-bench (tau-bench), one of the most well-designed agent benchmarks. It evaluates agents on dynamic, multi-turn customer service conversations in two domains: retail and airline.
The magic of τ-bench is that the user is also an LLM. This means conversations are dynamic — the simulated user reacts to the agent's responses, asks follow-up questions, provides clarifications, and can even get confused or frustrated. No two conversations are identical, even for the same task.
Each task specifies a scenario. For example: "User wants to change a flight from economy to business class. The fare difference is $350. The user has $200 in credit. Policy: credit can be applied to upgrades." The user simulator drives the conversation, and the agent must navigate the tools and policies to resolve it correctly.
python # Simplified τ-bench agentic loop def run_tau_bench_trial(task, agent, user_sim, db, tools, policies): # Initialize conversation with user's opening message user_msg = user_sim.generate_opening(task.scenario) conversation = [user_msg] db_snapshot = copy.deepcopy(db) # snapshot for grading for turn in range(25): # max 25 turns # Agent sees: system prompt + policies + conversation agent_response = agent.generate( system=policies, tools=tools, messages=conversation ) # If agent wants to call a tool: if agent_response.has_tool_call: result = execute_tool(agent_response.tool_call, db) conversation.append(agent_response) conversation.append(tool_result(result)) continue # agent gets to see result and decide next # If agent responds to user: conversation.append(agent_response) # Check if conversation is done if user_sim.is_satisfied(conversation, task): break # User simulator responds user_reply = user_sim.generate_reply(conversation, task) conversation.append(user_reply) # Grade: check database + output against ground truth return grade(db, task.expected_db_state, conversation)
τ-bench grades outcomes in two ways: checking that the database was modified correctly (e.g., flight was actually changed, credit was applied) AND checking that the agent's text responses contain the right information (e.g., confirming the new flight details to the user). Both must pass for a trial to succeed.
There are two sources of randomness in τ-bench. The agent samples from its LLM, producing different reasoning and tool call sequences. And the user simulator also samples, producing different follow-up questions and phrasings. This means every trial of the same task generates a genuinely different conversation. Pass@K and pass^K become essential because a single trial is statistically meaningless.
The most challenging aspect of τ-bench isn't the tool calling — it's policy compliance. The agent receives policy documents like this:
policy excerpt ## Return Policy - Electronics: 14-day return window, 15% restocking fee - Clothing: 30-day return window, no restocking fee - Sale items: final sale, no returns - Defective items: 90-day window, full refund, no restocking fee ## Exceptions - If customer has Gold status: waive restocking fee - If order was delayed >7 days: extend return window by 14 days - Items over $500: require manager approval for return
The agent must navigate these rules while conversing with a user who may not know the policies. When a Gold-status customer asks to return a $600 electronic item purchased 16 days ago, the agent needs to: (1) recognize the item is past the 14-day window, (2) check if the order was delayed >7 days, (3) note the item is over $500, (4) check Gold status to waive restocking. Missing any one of these conditions produces a wrong outcome.
This is where agents typically fail. The model might apply the wrong policy, miss an exception clause, or hallucinate a policy that doesn't exist. Policy compliance failures account for roughly 40% of all errors in τ-bench.
τ-bench spawned two important extensions:
| Benchmark | What It Adds | Why It Matters |
|---|---|---|
| τ²-bench | Multi-agent tasks requiring delegation between agents | Tests whether agents can coordinate and whether multi-agent systems actually improve over single agents |
| τ³-bench | Longer, more complex conversations with multiple user goals per session | Tests context management and multi-goal reasoning under context pressure |
Several design choices make τ-bench particularly well-designed:
Watch a retail task flow through the τ-bench system. The agent interacts with a simulated user, calls tools, and modifies the database. Click Step to advance.
While τ-bench evaluates conversational agents, Terminal-Bench tackles a different challenge: agents that solve real tasks in real terminal environments. No simulated users. No conversation. Just a Docker container, a task instruction, and a set of automated tests to verify the result.
Each Terminal-Bench task is packaged as a Harbor — a self-contained evaluation unit with four components:
| Component | What It Is | Example |
|---|---|---|
| Instruction | Natural language task description | "Set up a PostgreSQL database with tables for users and orders, create a backup script" |
| Dockerfile | The starting environment | Ubuntu 22.04 with Python 3.11, Node.js 18, PostgreSQL installed |
| Tests | Automated verification scripts | Check that tables exist, backup script runs, data is preserved |
| Oracle Solution | A known-working solution (for validation) | The exact commands and files that solve the task |
The key design principle is outcome-oriented evaluation. Terminal-Bench doesn't care HOW the agent solves the task — it only checks the final state of the Docker container. Did the right files get created? Does the script work? Are the tests passing? This means the agent can use any approach: write a script, install packages, copy files, whatever works.
Not all tasks are good tasks. Terminal-Bench enforces quality through a rigorous 7-step audit process for every task:
Step 3 — integrity checking — is especially important. An agent might learn to exploit the test suite rather than solve the task. For example, if the test checks "does file output.txt contain the word 'success'?", a lazy agent could just write "success" to the file without doing any real work. The audit process includes adversarial exploit detection: a human tries to find shortcuts that pass the tests without solving the task, then fixes the tests to block those shortcuts.
Let's walk through a concrete integrity failure and how to fix it. Original task and test:
example # TASK: "Write a Python script that sorts a CSV file by the 'date' column" # TEST (naive): def test_csv_sorted(): output = read_file("output.csv") dates = [row["date"] for row in csv.DictReader(output)] assert dates == sorted(dates) # just checks output is sorted
An agent could exploit this test by simply writing a hardcoded sorted CSV file without implementing any sorting logic. Or it could read the test file, figure out the expected output, and write it directly. To fix this, the audit adds:
example # TEST (robust): def test_csv_sorted_robust(): # 1. Check output file exists and is valid CSV output = read_file("output.csv") dates = [row["date"] for row in csv.DictReader(output)] assert dates == sorted(dates) # 2. Check a Python script exists (not just a data file) assert os.path.exists("sort_csv.py") # 3. Run the script on a DIFFERENT input to verify generality create_random_csv("test_input.csv", n=100) subprocess.run(["python", "sort_csv.py", "test_input.csv", "test_output.csv"]) test_dates = [row["date"] for row in csv.DictReader(open("test_output.csv"))] assert test_dates == sorted(test_dates) # 4. Verify the script doesn't just copy the expected output assert "hardcoded" not in open("sort_csv.py").read().lower()
The robust test runs the agent's solution on a new, random input that didn't exist during the agent's execution. This is the gold standard for integrity checking: the test should verify that the agent created a general solution, not a task-specific shortcut.
Terminal-Bench contains 89 tasks across 15+ categories: software engineering, data processing, system administration, scientific computing, networking, databases, and more. Each task takes a capable agent 5-30 minutes to attempt, making the full benchmark a substantial test of real-world terminal skills.
| Category | Tasks | Examples |
|---|---|---|
| Software Engineering | 18 | Build systems, testing, debugging |
| Data Processing | 12 | CSV, JSON, database manipulation |
| System Administration | 14 | Service configuration, permissions, networking |
| Scientific Computing | 8 | Numerical methods, plotting, data analysis |
| DevOps | 10 | Docker, CI/CD, monitoring |
| Security | 6 | Encryption, auth, vulnerability scanning |
| Other | 21 | Misc. terminal tasks |
Performance on Terminal-Bench varies dramatically by model and agent framework. Some findings:
| Finding | Implication |
|---|---|
| Best agents achieve ~60-70% pass@1 | Terminal tasks remain challenging even for frontier models |
| Pass^5 drops to ~15-30% for top agents | Reliability is still a major unsolved problem |
| Some categories (data processing) are much easier than others (system administration) | Aggregate scores hide category-level weaknesses |
| Agent frameworks matter as much as model choice | Harness design (retry logic, error handling, tool definitions) is a major performance lever |
| Longer time budgets don't always help | Giving agents more time can lead to more tool calls, which increases context rot |
The last finding is counterintuitive: you'd expect more time to help, but agents with generous time budgets sometimes perform worse because they make more tool calls, fill up the context, and start making confusion errors. This connects directly back to context engineering — a well-designed agent knows when to stop exploring and commit to an answer.
If you're building a Terminal-Bench-style evaluation for your own domain, follow these principles:
Watch a task flow through Terminal-Bench: instruction → Docker container → agent executes commands → tests verify final state. Click Step to advance.
No single evaluation method catches everything. Human reviewers miss things. Code-based graders are brittle. Model-based graders have biases. Automated benchmarks have blind spots. If you rely on any one method alone, failures will slip through.
The solution is the Swiss cheese model, borrowed from safety engineering. Imagine each evaluation method as a slice of Swiss cheese — it blocks most failures, but has holes. Stack multiple slices together, and the holes in one slice are covered by solid cheese in the next. Enough layers, and almost nothing gets through.
| Failure Type | Layer 1 (Auto) | Layer 2 (Manual) | Layer 3 (Prod) |
|---|---|---|---|
| Regression from code change | Catches | Slow to notice | Too late |
| Wrong tone / unprofessional | Misses | Catches | User complaints |
| Novel edge case from real user | Not in test set | Not in sample | Catches |
| Hallucinated policy | Maybe | Catches | Catches |
| Infinite tool-call loop | Catches | Catches | Catches |
| Slow performance degradation | Misses | Misses | Catches |
Notice the pattern: no single layer has all green. But together, they cover everything. This is why you need all three — not just your favorite.
Toggle evaluation layers on and off. Watch failure types slip through when layers are missing. Red dots are failures that escape. The goal: zero red dots reaching the right side.
Layer 1 — Automated Evals: A CI pipeline that runs your benchmark suite on every pull request. Takes 10-30 minutes. Reports pass@1 and pass^3. Blocks merge if pass@1 drops by more than 5% from baseline. This is your first line of defense.
yaml # .github/workflows/agent-eval.yml (simplified) name: Agent Evaluation on: pull_request jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install -r requirements.txt - run: python run_eval.py --tasks eval/tasks/*.json --trials-per-task 5 --output results.json - run: python check_regression.py --current results.json --baseline eval/baseline.json --max-regression 0.05
Layer 2 — Manual Review: Weekly, a domain expert reviews 20-30 randomly sampled transcripts from the eval suite. They score on a rubric and flag any new failure modes. This catches "the agent technically passed the code grader but gave bad advice" situations.
Layer 3 — Production Monitoring: In production, track these signals:
You now have all the concepts. Let's put them together into a practical, step-by-step roadmap for building your own agent evaluation system. Seven steps, from scratch to production-ready.
Before writing a single test, answer: what does "good" look like? Be specific. Not "the agent should help the user" but "the agent should resolve the user's issue, update the database correctly, and respond in a professional tone within 30 seconds." Your success criteria become the inputs to your graders.
Start with 10-20 tasks that cover your most common scenarios. Don't try to be comprehensive yet. Include a mix of easy tasks (single tool call), medium tasks (3-5 tool calls), and hard tasks (ambiguous instructions, policy edge cases). Real user conversations from logs are the best source.
Good tasks are specific (unambiguous instructions), solvable (the agent has the tools and information to succeed), and representative (they reflect real usage patterns). Include the full context: user message, initial database state, available tools, and applicable policies.
For each task, define the expected outcome. This might be: expected database state after completion, required tool calls, reference output text, or a combination. The more specific your ground truth, the easier grading becomes.
Set up a layered grading system. Start with code-based graders for objective checks (database state, tool calls). Add a model-based grader for subjective quality (tone, helpfulness, accuracy of explanations). Schedule periodic human review to calibrate the model-based grader.
python # Example: composite grader def composite_grade(transcript, outcome, ground_truth): scores = {} # Code graders (fast, deterministic) scores["db_correct"] = check_database_state( outcome.db, ground_truth.expected_db ) scores["tools_correct"] = check_required_tools( transcript, ground_truth.required_tools ) scores["no_forbidden"] = check_no_forbidden_tool( transcript, ground_truth.forbidden_tools ) # Model grader (flexible, non-deterministic) scores["quality"] = llm_judge( transcript, rubric="Rate 1-5: professional tone, accurate info, concise" ) # Overall: code graders must pass, quality ≥ 3 passed = (all([ scores["db_correct"], scores["tools_correct"], scores["no_forbidden"] ]) and scores["quality"] >= 3) return {"passed": passed, "scores": scores}
The evaluation harness is the software that runs tasks, collects transcripts, applies graders, and stores results. It needs to handle:
python # Skeleton evaluation harness import asyncio, json, copy from datetime import datetime class EvalHarness: def __init__(self, agent, graders, trials_per_task=10): self.agent = agent self.graders = graders self.trials_per_task = trials_per_task self.results = [] async def run_task(self, task): """Run N trials for one task.""" trial_results = [] for i in range(self.trials_per_task): # Fresh environment per trial env = copy.deepcopy(task.initial_env) # Run agent transcript = await self.agent.run( user_message=task.user_message, tools=task.tools, env=env ) # Grade scores = {} for name, grader in self.graders.items(): scores[name] = grader.grade(transcript, env, task.ground_truth) trial_results.append({ "trial": i, "transcript": transcript, "scores": scores, "passed": all(s["passed"] for s in scores.values()) }) # Compute metrics c = sum(1 for t in trial_results if t["passed"]) n = len(trial_results) return { "task_id": task.id, "pass_at_1": c / n, "pass_at_3": compute_pass_at_k(n, c, 3), "pass_up_3": compute_pass_up_k(n, c, 3), "trials": trial_results }
This is a minimal but functional harness. Production systems add logging, progress bars, error handling, cost tracking, and result dashboards — but the core loop is always the same: run, grade, store, compute metrics.
When presenting evaluation results, include enough detail for informed decision-making. A good eval report looks like this:
text ======================================== Agent Evaluation Report — v2.4.1 Date: 2026-05-18 | Model: claude-opus-4 Tasks: 50 | Trials/task: 10 | Total runs: 500 ======================================== METRICS (all tasks): pass@1: 0.820 (410/500 trials passed) pass@3: 0.978 pass^3: 0.574 pass^5: 0.389 METRICS (by difficulty): Easy (20 tasks): pass@1=0.940 pass^3=0.821 Medium (20 tasks): pass@1=0.815 pass^3=0.520 Hard (10 tasks): pass@1=0.640 pass^3=0.267 METRICS (by category): Order lookup: pass@1=0.960 Cancellation: pass@1=0.880 Returns: pass@1=0.780 Policy edge cases: pass@1=0.600 TOP FAILURE MODES: 1. Policy hallucination (12 failures) 2. Wrong order ID in response (8 failures) 3. Premature action without confirmation (6 failures) VS BASELINE (v2.3.0): pass@1: 0.820 vs 0.790 (+3.0%) ✓ pass^3: 0.574 vs 0.510 (+6.4%) ✓ No regressions in any category. SAFE TO SHIP. ========================================
This report gives you everything needed to decide whether to ship: overall metrics, difficulty breakdown, category breakdown, top failure modes, and comparison to baseline. The category breakdown is especially important — a 82% aggregate can hide a 60% pass rate on policy edge cases, which might be unacceptable for your use case.
Run your evaluation. Read the failing transcripts. Understand why the agent failed. Fix the agent (better prompts, better tools, better instructions). Run again. This cycle never ends. The best agent evaluations evolve continuously through new failure cases and ongoing maintenance.
Here's what a small but well-designed task set looks like for a customer service agent:
| # | Task | Difficulty | Tests |
|---|---|---|---|
| 1 | Simple order status lookup | Easy | Single tool call, correct response |
| 2 | Cancel a cancellable order | Easy | DB write, confirmation message |
| 3 | Cancel a non-cancellable order | Medium | Policy compliance — agent should refuse |
| 4 | Return with restocking fee | Medium | Fee calculation, user confirmation |
| 5 | Return for Gold member (fee waived) | Medium | Exception handling |
| 6 | Ambiguous user request | Hard | Clarification before action |
| 7 | Multi-step: return + reorder different item | Hard | Multiple DB writes, correct sequence |
| 8 | Policy edge case: item expired but defective | Hard | Exception vs standard policy conflict |
| 9 | User provides wrong order ID | Medium | Error handling, helpful recovery |
| 10 | Transfer to human agent | Medium | Knows when to escalate |
Notice the distribution: 2 easy, 4 medium, 3 hard, 1 edge case. This gives you a nuanced view of agent capability. An agent that passes 7/10 might still be failing all the hard tasks — the average masks the weakness.
After running hundreds of evaluation trials, certain failure patterns appear again and again. Knowing these helps you write better tasks and graders:
For each pattern, add specific test cases to your eval suite. The infinite loop pattern, for example, is caught by a simple grader that checks whether the same tool was called 3+ times with identical arguments.
Click each step to build the evaluation pipeline from left to right. Each step lights up when complete. The pipeline only works when all 7 steps are in place.
Here's the lay of the land — every major agent benchmark, what it tests, and how it works:
| Benchmark | Domain | Agent Type | Grading |
|---|---|---|---|
| τ-bench | Retail & Airline | Conversational + tools | DB state + output strings |
| τ²-bench | Multi-agent retail/airline | Multi-agent + delegation | DB state + coordination |
| τ³-bench | Complex multi-turn retail | Extended conversation | DB state + policy compliance |
| Terminal-Bench | Terminal / DevOps | Command execution | Docker container state tests |
| SWE-bench | Software engineering | Code generation + editing | Test suite pass/fail |
| GAIA | General assistant | Multi-tool reasoning | Factual correctness |
| AgentCompany | Business operations | Multi-app interaction | Task completion + quality |
| MT-Bench | Open-ended conversation | Single-turn LLM | LLM-as-Judge pairwise |
No single benchmark is comprehensive. Choose based on your agent's domain:
| If your agent does... | Use these benchmarks | Why |
|---|---|---|
| Customer service with tools | τ-bench, τ²-bench | Closest to real CS workflows with policy compliance |
| Code generation / editing | SWE-bench, Terminal-Bench | Verifiable outcomes in real codebases and terminal environments |
| General-purpose assistance | GAIA, MT-Bench | Multi-tool reasoning and open-ended quality |
| Business workflow automation | AgentCompany | Multi-application interaction patterns |
| Your own domain | Build your own + one public benchmark | Domain-specific tasks catch failures no public benchmark will |
No benchmark is perfect. Be aware of these common limitations:
The Five Components
Three Grader Types
Two Key Metrics
Three Evaluation Layers
Use this checklist before deploying an agent to production:
| Check | Threshold | Status |
|---|---|---|
| pass@1 on core tasks | ≥ 80% | Must pass |
| pass^3 on core tasks | ≥ 50% | Must pass |
| pass@1 on edge cases | ≥ 60% | Should pass |
| No infinite loop failures | 0 occurrences | Must pass |
| No data corruption failures | 0 occurrences | Must pass |
| Human review sample (20 transcripts) | ≥ 90% acceptable | Must pass |
| LLM judge calibrated against humans | ≥ 80% agreement | Should pass |
| A/B test vs current system (if exists) | Non-inferior | Should pass |
| Latency p95 | < 30 seconds | Should pass |
| Cost per conversation | Within budget | Must pass |
"Must pass" items are blockers — if any fail, do not ship. "Should pass" items are strong recommendations — shipping without them adds risk that should be acknowledged.
Every formula from this lesson in one place:
| Formula | What It Computes | When to Use |
|---|---|---|
| pass@K = 1 − C(n−c, K) / C(n, K) | P(at least 1 of K succeeds) | Measuring capability |
| pass^K = C(c, K) / C(n, K) | P(all K succeed) | Measuring reliability |
| pass@K ≈ 1 − (1−p)K | Approximate pass@K | Large n, quick estimate |
| pass^K ≈ pK | Approximate pass^K | Large n, quick estimate |
| P(all steps correct) = psteps | Error compounding over trajectory | Estimating long-horizon failure rate |
| Term | Definition |
|---|---|
| Agent | LLM + tools + instructions running in a loop via a harness |
| Agent harness | The software system (loop, tool execution, context management) that enables a model to act as an agent |
| Context rot | Degradation of agent performance as the context window fills with irrelevant information |
| Compaction | Strategies to reduce context size (summarization, clearing old tool results, note-taking) |
| Task | A predefined input + success criteria for evaluation |
| Trial | One attempt at completing a task |
| Transcript | Complete record of messages, tool calls, and results during a trial |
| Outcome | Final environment state after agent completes (or fails) a task |
| Grader | A function (human, code, or model) that scores a trial |
| Pass@K | Probability of at least one success in K attempts (capability) |
| Pass^K | Probability of all K attempts succeeding (reliability) |
| LLM-as-Judge | Using another LLM to evaluate agent outputs against a rubric |
| Inter-annotator agreement | How often two human graders give the same score |
| Swiss cheese model | Layering multiple evaluation methods so each covers the others' blind spots |
| Evaluation flywheel | Production failures → new test cases → better evals → better agents → repeat |
| Harbor | Terminal-Bench's self-contained task package (instruction + Docker + tests + oracle) |
| Progressive disclosure | Agent discovers context via tools rather than having it pre-loaded |
The field is evolving rapidly. A few trends to watch:
Continue your learning path:
You now understand how to evaluate agents rigorously. Go build evals that make your agents truly reliable.