Agent Evaluation — From Vibe Checks to Rigorous Testing

Chapter 0: Why Agent Evaluation is Hard

You just built an agent. It can call tools, search the web, write code, and interact with databases. You test it on ten carefully crafted scenarios. It aces all ten. You deploy it. Within an hour, a user asks it to cancel an order, and it cancels the wrong order. Another user asks for a refund, and the agent enters an infinite loop calling the same API over and over.

What went wrong? You tested it. It worked. But you did what every engineer does at first: you ran a vibe check — a handful of manual tests, eyeballed the outputs, and called it good. Vibe checks feel productive. They are not evaluation.

The problem is that agents are fundamentally different from traditional software. A function that adds two numbers will always return the same result. An agent might take a different path through its reasoning every single time you run it. It might call different tools, in different orders, with different arguments. It interacts with external environments that have state. And it operates over long horizons — a single task might involve twenty tool calls and ten reasoning steps.

The core problem: Agents are non-deterministic, environment-interacting, long-horizon systems. Traditional software testing (unit tests, integration tests) was designed for deterministic functions. We need something new.

Here's a concrete illustration. Imagine an agent that handles customer service for an airline. A user says "I need to change my flight." The agent must:

Look up the user's booking
Check available flights
Verify the change policy allows modifications
Calculate any fare difference
Execute the change in the database
Confirm with the user

At each step, the agent could fail differently. It could hallucinate a policy that doesn't exist. It could call the wrong API. It could change the wrong booking. And because LLM sampling involves randomness, running the exact same scenario twice might produce two completely different failure modes.

The Agentic Loop

Click Step to advance through one cycle of the agent loop. Notice how each step depends on the previous result — errors compound.

Click Step to begin

This is why evaluation — systematic, repeatable measurement of agent performance — matters so much. Without it, you're deploying a system you don't understand into situations you haven't anticipated. The rest of this lesson teaches you how to build evaluations that actually work.

Vibe Checks vs Real Evaluation

Let's be precise about what's wrong with vibe checks (informal testing) and what real evaluation looks like:

Property	Vibe Check	Real Evaluation
Test cases	5-10, hand-picked	50-200+, systematically designed
Runs per test	1 (maybe 2)	5-20 (for statistical significance)
Scoring	"Looks good to me"	Automated graders + human review
Regression detection	None	CI pipeline blocks regressions
Coverage	Happy path only	Edge cases, errors, adversarial inputs
Reproducibility	Cannot reproduce	Deterministic setup, saved transcripts
Time to run	5 minutes	30 minutes to 2 hours

The difference isn't just rigor — it's confidence. After a vibe check, you hope your agent works. After a real evaluation, you know your agent works on X% of tasks with Y% reliability. You can make data-driven decisions about whether to ship.

Here's the uncomfortable truth: most agent teams in production today are still running vibe checks. They demo the agent to their manager, it works on the demo scenario, and they ship it. Three weeks later, support tickets pile up because the agent hallucinates refund policies, enters infinite loops on edge cases, or cancels the wrong orders. The cost of fixing these failures in production — user trust lost, engineering time burned, revenue impacted — always exceeds the cost of building proper evaluation upfront.

This lesson gives you the tools to do better. By the end, you'll know how to build an evaluation system that catches these failures before your users do.

The three properties that make agents hard to evaluate:
1. Non-determinism: Same input, different outputs each run (LLM sampling temperature).
2. Environment interaction: Agents change the world (databases, APIs, files) — side effects matter.
3. Long horizons: A single task involves many steps — errors compound across the trajectory.

Quantifying the Problem: Error Compounding

Here's a simple but powerful way to understand why long-horizon tasks are so hard. Suppose each step in the agent's reasoning has a 95% chance of being correct. That sounds great — almost always right. But over multiple steps, errors compound:

Steps	P(all correct)	P(at least one error)
1	95.0%	5.0%
3	85.7%	14.3%
5	77.4%	22.6%
10	59.9%	40.1%
20	35.8%	64.2%

At 20 steps, even with 95% per-step accuracy, you have a 64% chance of at least one error. And in an agent, one error can cascade — calling the wrong tool means getting wrong data, which means making a wrong decision, which means taking a wrong action. One bad step can doom the entire task.

This is why evaluation must test long-horizon tasks, not just short ones. A 3-step task at 95% per-step gives you 86% success. A 20-step task gives you 36%. If you only test 3-step tasks, you'll think your agent is great. Deploy it on 20-step tasks and watch it fail two-thirds of the time.

Check: Why don't traditional unit tests work for evaluating agents?

Agents are too slow to test Agents don't have functions Agents are non-deterministic, interact with environments, and operate over long horizons

Chapter 1: What IS an Agent?

Before we can evaluate agents, we need to agree on what one is. An agent is an LLM running in a loop, equipped with tools and instructions, that can take actions in an environment. Three components define it:

Model

The LLM that reasons and decides (GPT-4, Claude, Llama)

↓

Tools

Functions the agent can call (search, database, calculator, APIs)

↓

Instructions

System prompt defining behavior, policies, and constraints

↓

Agent Harness

The loop that ties model + tools + instructions together

The agent harness is the software system that enables a model to act as an agent. It manages the loop: send the context to the model, parse its output for tool calls, execute those tools, feed results back, repeat. When we evaluate an agent, we're evaluating the harness AND the model working together — not the model in isolation.

How Tool Calling Works

Modern LLMs are trained to emit special structured output when they want to call a tool. The pattern looks like this:

json
// 1. You give the model a list of available tools:
{
  "tools": [
    {
      "name": "lookup_order",
      "description": "Look up order details by order ID",
      "parameters": {
        "order_id": { "type": "string" }
      }
    }
  ]
}

// 2. The model responds with a tool call:
{
  "tool_call": {
    "name": "lookup_order",
    "arguments": { "order_id": "ORD-12345" }
  }
}

// 3. Your harness executes the function, returns result:
{
  "tool_result": {
    "status": "shipped",
    "item": "Blue Widget",
    "tracking": "1Z999AA10123456784"
  }
}

// 4. Model sees result, decides next action or responds to user

This parse-execute-continue cycle is the heartbeat of every agent. The harness parses the model's output for tool calls, executes them in the real environment, and feeds results back into the context. The cycle continues until the model produces a final response (no tool call) or hits a maximum iteration limit.

The Autonomy Spectrum

Not everything that uses an LLM is an agent. There's a spectrum from simple to fully autonomous:

Level	What It Does	Example
Single-turn LLM	One prompt in, one response out	ChatGPT answering a question
Chain / Pipeline	Fixed sequence of LLM calls	Summarize → Translate → Format
Router	LLM picks which path to take	Classify intent, route to handler
Tool-using Agent	LLM calls tools in a loop	Research agent with search + browse
Multi-agent System	Multiple agents coordinating	Manager delegates to specialists

As you move down this spectrum, evaluation gets harder. A single-turn LLM can be evaluated with a simple input-output dataset. A multi-agent system requires evaluating coordination, delegation, and emergent behaviors across multiple interacting components.

The Harness Loop in Detail

Let's look at a concrete harness implementation. This is the actual control flow that runs every agent:

python
def run_agent(user_message, system_prompt, tools, max_iters=25):
    """The core agent loop. Every agent harness is a variant of this."""
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ]

    for i in range(max_iters):
        # Step 1: Send context to model
        response = llm.generate(
            messages=messages,
            tools=tools,
            temperature=0.0  # lower = more deterministic
        )

        # Step 2: Check if model wants to call a tool
        if response.has_tool_call:
            # Step 3: Parse and execute the tool call
            tool_name = response.tool_call.name
            tool_args = response.tool_call.arguments
            result = execute_tool(tool_name, tool_args)

            # Step 4: Feed result back into context
            messages.append(response.to_message())
            messages.append({
                "role": "tool",
                "content": json.dumps(result)
            })
            continue  # back to Step 1

        # No tool call = final response to user
        messages.append(response.to_message())
        return messages  # the full transcript

    # Safety: hit max iterations without finishing
    raise MaxIterationsError("Agent stuck in loop")

Notice the key design decisions embedded in this code:

Max iterations: Without this, a confused agent could loop forever, calling tools that don't help. The iteration limit is a safety valve.
Temperature: Lower temperature means more deterministic output, but reduces the agent's ability to explore alternative strategies. This directly affects evaluation — higher temperature means more variance across trials.
Context accumulation: Every tool call and result is appended to the messages list. This is why context management matters — after 20 tool calls, you've added 40+ messages to the context.
No retry logic: If a tool call fails, the raw error is fed back to the model. The model decides whether to retry, try a different tool, or give up. This is a harness design choice that affects evaluation results.

What Makes a Good Tool Definition?

The quality of tool definitions dramatically affects agent performance. Compare these two definitions for the same tool:

bad tool def
{
  "name": "search",
  "description": "Search",
  "parameters": {
    "q": {"type": "string"}
  }
}

good tool def
{
  "name": "search_orders",
  "description": "Search orders
  by customer email, order
  ID, or date range. Returns
  up to 10 matching orders.",
  "parameters": {
    "query": {
      "type": "string",
      "description": "Email,
      order ID, or YYYY-MM-DD"
    }
  }
}

The good definition tells the model exactly what the tool does, what inputs it accepts, and what format to use. When evaluating agents, changing tool definitions can shift pass rates by 10-20%. This is part of why we evaluate the harness as a whole, not just the model.

Key insight: When we evaluate an agent, we're evaluating the harness and the model working together. Swapping the model changes performance. Changing the system prompt changes performance. Modifying the tool definitions changes performance. The evaluation must capture this entire system, not just the LLM.

Interactive Agent Loop

Click each phase to advance. Watch the context window fill with tool results as the agent works through a task.

Phase: Awaiting user input

Check: What three components define an agent?

Model + Tools + Instructions (running in a loop via a harness) Database + API + Frontend Prompt + Temperature + Max tokens

Chapter 2: Multi-Agent Systems

Sometimes a single agent isn't enough. When your system prompt grows to thousands of lines, or you have dozens of overlapping tools, or different tasks require fundamentally different reasoning strategies — you need multiple agents working together. But multi-agent systems are harder to build, debug, and evaluate. So the first rule is: start with a single agent. Only add complexity when the single agent demonstrably fails.

When to Go Multi-Agent

Three signals tell you a single agent is struggling:

1. Bloated instructions. Your system prompt is 5,000+ words and the agent keeps forgetting rules buried on page three. Long instructions degrade performance because the model attends to everything but remembers nothing perfectly.

2. Too many overlapping tools. You have 30+ tools and the agent keeps picking the wrong one. "search_orders" vs "lookup_order" vs "find_order_by_email" — the model confuses similar tools when there are too many.

3. Divergent reasoning needs. One task requires careful, step-by-step analysis (auditing a financial report). Another requires fast, shallow lookup (checking a tracking number). A single prompt can't optimize for both.

Manager vs Decentralized

There are two main patterns for multi-agent systems:

Manager pattern: A central "manager" agent receives the user's request, decides which specialist to delegate to, and synthesizes the final response. Clean control flow, easy to debug, but the manager becomes a bottleneck.

Decentralized pattern: Agents hand off to each other in a chain or graph. No central controller. More flexible for complex workflows, but harder to reason about and debug — "who's in charge?" becomes ambiguous.

Sub-agents and Context Protection

A critical benefit of multi-agent systems is context protection. When a sub-agent handles a tool-heavy subtask, all those tool calls and results fill up the sub-agent's context window, not the main agent's. The main agent only sees the sub-agent's final summary. This keeps the main agent's context clean and focused.

Think of it like delegation at a company. The CEO doesn't sit in on every engineering standup. They delegate to a VP, who reports back a summary. The CEO's "context window" stays clear for strategic decisions.

The Manager Pattern in Code

python
# Manager agent with specialized sub-agents

class ManagerAgent:
    def __init__(self):
        self.router = LLM(system_prompt="""You are a routing agent.
        Given a user request, determine which specialist to delegate to:
        - 'returns': for return/refund requests
        - 'booking': for flight/hotel changes
        - 'billing': for payment/invoice questions
        Respond with ONLY the specialist name.""")

        self.specialists = {
            "returns": Agent(tools=return_tools, prompt=return_policy),
            "booking": Agent(tools=booking_tools, prompt=booking_policy),
            "billing": Agent(tools=billing_tools, prompt=billing_policy),
        }

    def handle(self, user_message):
        # Step 1: Route to specialist
        specialist = self.router.generate(user_message).strip()

        # Step 2: Delegate (sub-agent gets its own clean context)
        result = self.specialists[specialist].run(user_message)

        # Step 3: Manager only sees the summary, not all tool calls
        return result.final_response

Notice the key property: the sub-agent runs in its own context. If the returns specialist makes 8 tool calls consuming 4,000 tokens, the manager never sees those tokens. It only gets the final response — maybe 200 tokens. This is the context protection in action.

Evaluating Multi-Agent Systems

When evaluating a multi-agent system, you need to test at three levels:

Individual agent evaluation: Does each specialist work correctly in isolation? Test the returns agent on return tasks, the booking agent on booking tasks.
Routing evaluation: Does the manager route to the correct specialist? Test with clear-cut cases AND ambiguous cases (e.g., "I want to return an item I booked" — returns or booking?).
End-to-end evaluation: Does the full system work? This catches integration failures: the manager routes correctly, the specialist handles it correctly, but the response doesn't get properly relayed back.

A common mistake is skipping level 2 (routing evaluation). Teams test each specialist and test the full system, but never specifically evaluate whether the router makes correct decisions. A routing error sends the user to the wrong specialist, which wastes time and often fails spectacularly because the specialist doesn't have the right tools for the task.

Routing accuracy target: Aim for ≥95% routing accuracy on clear-cut cases and ≥80% on ambiguous cases. If routing accuracy drops below these thresholds, the multi-agent system will underperform a well-designed single agent. At that point, simplify: go back to one agent with all the tools and a better system prompt.

Manager / Worker Architecture

Click a task to see the manager route it to the right specialist agent. Watch how the main context stays clean.

Key insight: Single agents are easier to evaluate and maintain. Multi-agent systems multiply the evaluation surface — you need to evaluate each agent individually AND their interactions. Start simple. Add agents only when a single agent demonstrably fails at the task.

Check: What is the main benefit of using sub-agents?

Sub-agents are faster than a single agent Sub-agents protect the main agent's context from being overloaded with tool results Sub-agents are cheaper to run

Chapter 3: Context Engineering

An agent's context window is its working memory. Everything the agent knows — the system prompt, conversation history, tool results, retrieved documents — lives in this finite window. And here's the problem: context rots.

Context rot is what happens when an agent runs for many steps. Each tool call adds its result to the context. Each reasoning step adds tokens. After twenty iterations, the context is stuffed with old tool results, irrelevant intermediate reasoning, and stale information. The model's attention is spread across thousands of tokens of noise, and the signal — the user's original request, the current state of the task — gets buried.

Analogy: Context rot is like a desk that never gets cleaned. Every document you've ever looked at is still spread across it. When you need to find the one thing that matters right now, you're digging through piles of irrelevant paper. Eventually you start making mistakes because you grab the wrong sheet.

Static vs Dynamic Context

There are two strategies for getting information into the context window:

Strategy	How It Works	Tradeoff
Static (RAG)	Pre-load relevant documents into the context before the agent starts	Wastes tokens if docs aren't needed; stale if task evolves
Dynamic (Tool-based)	Agent discovers context by calling tools as needed	Uses tokens only when needed; agent decides what's relevant

The modern best practice is progressive disclosure: give the agent tools to discover context rather than pre-loading everything. Let the agent search a knowledge base, look up a policy document, or query a database when it needs information. This way, only relevant information enters the context, and the agent stays in control of what it knows.

Compaction Strategies

Even with progressive disclosure, context eventually fills up. Three strategies fight context rot:

Summarization

Periodically compress the conversation so far into a shorter summary. Replace the full history with the summary + recent messages.

↓

Tool Result Clearing

After the agent has used a tool result, truncate or remove it. The agent already extracted what it needed — keeping the raw result wastes tokens.

↓

Note-taking

Give the agent a "scratchpad" tool to save key findings. Then compact old context, knowing important facts are preserved in notes.

Each strategy involves a tradeoff: compaction loses detail but reclaims space. The art is knowing when to compact and how much to keep. Compact too aggressively and the agent forgets important context. Compact too little and you hit the token limit.

A Concrete Context Budget

Let's do the math for a typical agent task. Say you're using a model with a 128K token context window:

Component	Tokens	Percentage
System prompt + policies	3,000	2.3%
Tool definitions (15 tools)	2,500	2.0%
User message	200	0.2%
Average tool result	500	0.4% each
Model reasoning per step	300	0.2% each
After 20 tool calls	~22,000	17%
After 50 tool calls	~46,000	36%

At 50 tool calls, you've consumed 36% of the context window. That sounds manageable, but here's the catch: the attention quality of a Transformer degrades well before you hit the limit. Research shows that models struggle to attend to information in the middle of long contexts (the "lost in the middle" problem). By the time you're at 50% utilization, the model may already be failing to recall its original instructions.

Progressive Disclosure in Practice

Here's a concrete example of progressive disclosure vs static context. Imagine an agent that helps with HR policies:

python
# BAD: Static context — dump everything into system prompt
system_prompt = """You are an HR assistant.
Here are ALL company policies:

[3,000 tokens of vacation policy]
[2,000 tokens of leave policy]
[4,000 tokens of benefits policy]
[2,500 tokens of expense policy]
...
"""
# Result: 15,000+ tokens wasted if user only asks about vacation

# GOOD: Progressive disclosure — agent discovers what it needs
system_prompt = """You are an HR assistant.
Use the search_policy tool to look up relevant policies before
answering any question."""

tools = [{
    "name": "search_policy",
    "description": "Search company HR policies by topic",
    "parameters": {"query": {"type": "string"}}
}]
# Result: only loads the 800 tokens about vacation when needed

The static approach consumes 15,000 tokens upfront for policies the user may never ask about. The progressive approach uses 800 tokens only when needed. Over a 20-step conversation, that 14,200-token savings compounds — leaving more room for actual tool results and reasoning.

Context Window Filling Up

Watch the context window fill as the agent makes tool calls. Hit Compact to see summarization reclaim space. The orange line shows attention quality degrading as context fills.

Context: 0% full

A Context Rot Failure: Worked Example

Here's a real failure pattern caused by context rot. An agent is handling a complex customer request that requires multiple tool calls:

transcript (simplified)
Turn 1: User: "I have two orders. Cancel ORD-100 and check status of ORD-200."
Turn 2: Agent calls lookup_order("ORD-100") # 500 tokens of result
Turn 3: Agent calls check_cancel_policy("ORD-100") # 400 tokens of result
Turn 4: Agent calls cancel_order("ORD-100") # 300 tokens of result
Turn 5: Agent calls lookup_order("ORD-200") # 500 tokens of result
Turn 6: Agent calls get_tracking("ORD-200") # 600 tokens of result

# By Turn 6, context has ~3,500 tokens of tool results.
# Agent's response:
Turn 7: "I've cancelled ORD-200 and your order ORD-100 is in transit."
# ^^^^^^^^ WRONG! Mixed up the two order IDs!

The agent correctly executed all the tool calls but then confused the two orders in its final response. It cancelled ORD-100 (correct) but then said it cancelled ORD-200 (wrong). This is classic context rot: the model's attention, spread across 3,500 tokens of tool results, failed to correctly attribute which results belonged to which order.

This failure would be caught by a code grader that checks the response text for correct order IDs. But it would NOT be caught by a pure outcome grader that only checks database state (the cancellation was correct). You need both.

Why context engineering matters for evaluation: An agent that works perfectly on short tasks might fail completely on long tasks because of context rot. Your evaluation must include long-horizon tasks that stress-test context management. If you only test 3-step tasks, you'll never catch the failure that happens on step 15.

Check: What is "context rot"?

The degradation of agent performance as the context window fills with irrelevant old information When the model forgets its system prompt When the API key expires

Chapter 4: The Evaluation Framework

Now we get to the core. Every agent evaluation system, no matter how sophisticated, boils down to five components. Learn these five, and you can evaluate anything.

The five components of evaluation: Task, Trial, Transcript, Outcome, Grader. Every eval system is built from these primitives. No exceptions.

1. Task

A task is a predefined input paired with success criteria. "Given this user message and this database state, the agent should do X." A task includes:

The input: user message, conversation history, initial environment state
The success criteria: what counts as a correct outcome
Optional ground truth: the expected final state, expected tool calls, or reference output

Good tasks are specific, solvable, and have clear success criteria. "Handle a customer complaint" is a bad task. "User wants to return order ORD-789. Policy allows returns within 30 days. Order was placed 15 days ago. Agent should process the return and confirm to the user" is a good task.

2. Trial

A trial is one attempt at completing a task. Because agents are non-deterministic, running the same task twice produces different trials. This is why we need multiple trials per task — a single trial tells you almost nothing about reliability.

3. Transcript (Trajectory)

The transcript is the complete record of everything the agent did during a trial: every message, every tool call, every tool result, every reasoning step. This is your forensic evidence when things go wrong. A good evaluation system always saves the full transcript.

transcript
# Example transcript structure:
[
  {"role": "user",     "content": "I want to return order ORD-789"},
  {"role": "assistant", "tool_call": "lookup_order(ORD-789)"},
  {"role": "tool",      "result": "{status: delivered, days_ago: 15}"},
  {"role": "assistant", "tool_call": "check_return_policy(ORD-789)"},
  {"role": "tool",      "result": "{eligible: true, window: 30 days}"},
  {"role": "assistant", "tool_call": "process_return(ORD-789)"},
  {"role": "tool",      "result": "{return_id: RET-456, refund: $49.99}"},
  {"role": "assistant", "content": "Your return has been processed..."}
]

4. Outcome

The outcome is the final environment state after the agent finishes. Did the database change correctly? Was the right API called? Does the final response contain the right information? Outcomes can be checked by inspecting the environment, the agent's final output, or both.

5. Grader

The grader takes a transcript and/or outcome and produces a score. "Did this trial succeed?" Graders can be human reviewers, automated code checks, or LLMs acting as judges. We'll dive deep into grader types in the next chapter.

The Evaluation Pipeline

Click Run to watch a task flow through the evaluation pipeline: task → agent runs (producing a transcript) → grader scores → result stored.

Ready

The flow: You define tasks. For each task, you run N trials. Each trial produces a transcript. Each transcript is scored by graders. You aggregate scores to get metrics. These metrics tell you whether your agent is improving or regressing.

Environment Isolation

A subtle but critical requirement: each trial must run in a clean, isolated environment. If Trial 1 cancels an order in the database, Trial 2 must start with the order still active. Otherwise, Trial 2 might fail not because the agent is broken, but because the order was already cancelled by Trial 1.

Three isolation strategies:

Strategy	How It Works	Cost
Database snapshot/restore	Save DB state before each trial, restore after	Low (fast for small DBs)
Docker containers	Fresh container per trial with pre-loaded data	Medium (container startup time)
Mock environments	In-memory mock of all tools and state	Low (fastest, but may miss integration bugs)

Docker containers are the most robust approach — they guarantee complete isolation. Mock environments are fastest but risk missing bugs that only appear in real tool interactions. Most production eval systems use a hybrid: mocked tools for fast iteration, Docker containers for full regression runs.

Saving and Analyzing Transcripts

Every trial should save the complete transcript in a structured format. This is your debugging lifeline. When pass rates drop, you read the transcripts to understand why. A good transcript record includes:

json
{
  "task_id": "cancel-order-01",
  "trial_id": "trial-007",
  "timestamp": "2026-05-18T14:30:00Z",
  "model": "claude-opus-4-20250514",
  "temperature": 0.0,
  "messages": [/* full message sequence */],
  "tool_calls": [
    {"name": "lookup_order", "args": {"id": "ORD-123"}, "latency_ms": 45},
    {"name": "cancel_order", "args": {"id": "ORD-123"}, "latency_ms": 120}
  ],
  "total_tokens": 3847,
  "total_latency_ms": 4521,
  "db_state_before": {/* snapshot */},
  "db_state_after": {/* snapshot */},
  "grading": {
    "code_grader": {"passed": true, "checks": ["db_state", "tool_calls"]},
    "model_grader": {"score": 4, "reasoning": "..."}
  },
  "result": "PASS"
}

With this structure, you can query your evaluation database: "Show me all failing trials for task 'cancel-order-01' where the agent called the wrong tool." This turns debugging from guesswork into data analysis.

A Concrete Example

Let's put it all together with a worked example:

Component	Concrete Value
Task	"Cancel order ORD-123. Policy: cancellation allowed if status = 'processing'."
Trial 1	Agent calls lookup_order → cancel_order → responds "Done." (3 steps)
Trial 2	Agent calls lookup_order → check_policy → cancel_order → responds "Cancelled." (4 steps)
Transcript	Full sequence of messages and tool calls for each trial
Outcome	Database: ORD-123 status changed from "processing" to "cancelled"
Grader	Code check: assert db["ORD-123"]["status"] == "cancelled"

Notice that Trial 1 and Trial 2 took different paths (3 steps vs 4 steps) but both arrived at the correct outcome. This is normal for agents — the path varies, but we care about the destination.

Trajectory-Based vs Outcome-Based Evaluation

This brings us to a fundamental design choice: do you grade the trajectory (how the agent got there) or the outcome (where it ended up)?

Trajectory-based: Check that the agent called the right tools in the right order. "Did it call lookup_order before cancel_order?"

Pro: Catches unsafe intermediate steps.

Con: Overly prescriptive — rejects valid alternative paths.

Outcome-based: Only check the final result. "Is the order cancelled? Is the response correct?"

Pro: Accepts any valid solution strategy.

Con: Might miss dangerous intermediate actions (e.g., agent deleted other data before fixing it).

In practice, the best approach combines both: outcome-based grading for "did it work?" plus lightweight trajectory checks for safety constraints: "did the agent avoid calling the delete_all_data() tool?" "did it confirm with the user before making changes?"

Statistical Significance

How many trials do you need per task? This depends on how precise you need your estimates to be. Here's a rule of thumb:

Trials per Task	Precision	Use Case
5	±20%	Quick directional check — is pass rate above 50%?
10	±15%	Standard evaluation — reliable pass@1 and pass^3
20	±10%	Precise comparison — is Agent A better than Agent B?
50+	±5%	High-stakes deployment decision

These are approximate — the actual confidence interval depends on the true pass rate. But as a planning heuristic: 10 trials per task is the sweet spot for most iterative development. Use 20+ for final deployment decisions.

Cost calculation: If you have 50 tasks at 10 trials each, that's 500 agent runs. Each run might involve 5-10 LLM calls plus tool executions. At $0.01-0.10 per LLM call, your eval costs $25-500 per full run. This is cheap compared to the cost of shipping a broken agent to real users.

Check: Why do we need multiple trials per task?

To make the evaluation take longer Because agents are non-deterministic — a single trial tells you almost nothing about reliability Because the LLM needs to warm up

Chapter 5: Types of Graders

You have a transcript. Now you need to decide: did the agent succeed? This is the job of the grader. There are three families of graders, each with distinct strengths and weaknesses. The best evaluation systems use all three.

1. Human Graders

Human graders are the gold standard. A subject-matter expert reads the transcript, evaluates the outcome, and assigns a score. Nothing beats a human who deeply understands the domain.

But human grading has problems. It's slow (minutes per transcript), expensive (expert time), and doesn't scale. You can't have a human review 10,000 trials overnight. And humans disagree with each other — inter-annotator agreement (how often two humans give the same score) is often surprisingly low, especially for subjective quality judgments.

Variants of human grading:

SME review: Domain experts review transcripts. Highest quality, most expensive.
Crowdsourced: Many non-expert reviewers. Cheaper, but noisier. Need redundancy (3+ reviewers per item) and agreement filters.
A/B testing: Show human reviewers outputs from two agent versions side by side. "Which is better?" Relative judgments are easier and more consistent than absolute scores.

2. Code-Based Graders

Code-based graders are deterministic functions that check specific properties of the outcome or transcript. They're fast, cheap, and perfectly reproducible. But they're brittle — they can only check things you can express as code.

python
# Example code-based graders:

def check_string_match(transcript, expected):
    """Did the final response contain the expected string?"""
    final_msg = transcript[-1]["content"]
    return expected.lower() in final_msg.lower()

def check_tool_called(transcript, tool_name):
    """Was the required tool called at least once?"""
    return any(
        msg.get("tool_call", "").startswith(tool_name)
        for msg in transcript
    )

def check_database_state(db, order_id, expected_status):
    """Did the database end up in the correct state?"""
    return db[order_id]["status"] == expected_status

def check_no_forbidden_tool(transcript, forbidden):
    """Did the agent avoid calling a forbidden tool?"""
    return not any(
        msg.get("tool_call", "").startswith(forbidden)
        for msg in transcript
    )

Code-based graders are perfect for objective criteria: "Was the database updated?" "Did the response include the order ID?" "Was the forbidden tool avoided?" They fail for subjective criteria: "Was the tone professional?" "Did the explanation make sense?"

3. Model-Based Graders (LLM-as-Judge)

Model-based graders use another LLM to evaluate the agent's output. You provide a rubric and the transcript, and the judge LLM scores it. This combines the flexibility of human judgment with the scalability of automation.

python
# LLM-as-Judge grader (simplified)

def llm_judge(transcript, rubric):
    prompt = f"""You are an expert evaluator.

Given this agent transcript:
{transcript}

Score the agent on the following rubric:
{rubric}

Respond with a JSON object:
{{"score": 1-5, "reasoning": "..."}}"

    response = llm.generate(prompt)
    return json.loads(response)

Three common patterns for model-based grading:

Pattern	How It Works	Best For
Rubric scoring	Judge scores on a 1-5 scale using criteria	Absolute quality assessment
Pairwise comparison	Judge picks which of two outputs is better	Comparing agent versions (A/B)
Reference-guided	Judge compares output to a gold-standard reference	Tasks with known correct answers

The catch: model-based graders are themselves non-deterministic. Run the same judgment twice and you might get different scores. They also have biases — they tend to prefer longer outputs, outputs that look like their own writing style, and outputs listed first in pairwise comparisons. You need to calibrate model-based graders against human judgments before trusting them.

Calibrating LLM Judges

How do you know your LLM judge is actually reliable? The standard approach is calibration against human labels:

Have human experts score 50-100 transcripts on your rubric.
Run the LLM judge on the same transcripts.
Compute agreement: what percentage of the time do human and model agree?
If agreement is <80%, revise your rubric — it's probably ambiguous.
Check for systematic bias: does the model consistently score higher or lower than humans?
Re-calibrate periodically as you update the agent (the distribution of outputs changes).

A well-calibrated LLM judge typically agrees with human experts 80-90% of the time on binary (pass/fail) judgments. For 5-point Likert scales, expect within-1-point agreement about 70-80% of the time. If you're below these numbers, the judge is too noisy to trust.

Known Biases in LLM Judges

Bias	What Happens	Mitigation
Length bias	Prefers longer, more verbose outputs	Add "conciseness" to rubric; penalize unnecessary length
Position bias	In pairwise comparisons, prefers the first option	Run each comparison twice with swapped order; average results
Self-preference	Prefers outputs that sound like itself	Use a different model family as judge than as agent
Anchoring	If given a reference answer, rates similar outputs higher	Score without reference first, then use reference as secondary check
Leniency	Avoids giving low scores; clusters around 3-4 out of 5	Use binary pass/fail when possible; clearer signal than Likert scales

Practical tip: For production agent evaluation, binary grading (pass/fail) is almost always better than rubric scoring (1-5). Binary judgments have higher inter-rater agreement (both human and model), are easier to aggregate into metrics, and produce clearer signal for decision-making ("ship or don't ship"). Save rubric scoring for deep-dive analysis of specific failure modes.

The grading triangle: Human graders are accurate but slow. Code-based graders are fast but brittle. Model-based graders are flexible but noisy. Use all three. Code graders for objective checks, model graders for subjective quality, human graders to calibrate and audit the other two.

Grader Comparison

The same agent output evaluated by all three grader types. Toggle each grader to see its assessment and strengths/weaknesses.

Check: What is the main weakness of model-based (LLM-as-Judge) graders?

They are too slow They are non-deterministic and have biases (length preference, position bias) that require calibration They can only check objective criteria

Chapter 6: Metrics — Pass@K and Pass^K

You've run your agent on 50 tasks, 10 trials each, and collected 500 scores. Now what? You need metrics that capture both capability (can the agent solve the task at all?) and reliability (can it solve it consistently?). Two metrics do this: Pass@K and Pass^K.

Pass@K: Can It Succeed At Least Once?

Pass@K measures the probability that at least one out of K attempts succeeds. If you give the agent K tries, what are the odds it gets it right at least once?

pass@K = E[ 1 − C(n − c, K) / C(n, K) ]

Where n is the total number of trials you ran, c is the number of successful trials, and C(a, b) is the binomial coefficient "a choose b." Let's unpack this with a worked example.

Worked example: You run n = 10 trials. c = 7 succeed. What is pass@K for K = 3?

pass@3 = 1 − C(10 − 7, 3) / C(10, 3)
= 1 − C(3, 3) / C(10, 3)
= 1 − 1 / 120
= 1 − 0.00833
= 0.992

With 3 tries and a 70% per-trial success rate, you have a 99.2% chance of succeeding at least once. Sounds great, right? But wait...

Pass^K: Does It Succeed EVERY Time?

Pass^K flips the question: do ALL K attempts succeed? This measures reliability — can you trust the agent to get it right every time, not just sometimes?

pass^K = E[ C(c, K) / C(n, K) ]

Same example continued: n = 10, c = 7, K = 3.

pass^3 = C(7, 3) / C(10, 3)
= 35 / 120
= 0.292

Only a 29.2% chance that ALL 3 attempts succeed. The same agent that looks great on pass@3 (99.2%) looks terrible on pass^3 (29.2%). This is the reliability gap.

The gap between pass@K and pass^K reveals a critical truth about your agent. High pass@K means the agent can solve the task. Low pass^K means it does so unreliably. For production systems, reliability matters more than capability. A user doesn't get 3 tries — they get one.

How Pass^K Drops With K

Even for agents with respectable per-trial accuracy, pass^K drops dramatically as K increases. Let's compute for an agent with 80% per-trial success (c = 8 out of n = 10):

K	Pass@K	Pass^K	Gap
1	0.800	0.800	0.000
2	0.956	0.622	0.334
3	0.992	0.467	0.525
5	1.000	0.222	0.778

At K = 5, pass@5 is essentially 1.0 (the agent can almost certainly solve it in 5 tries). But pass^5 is only 0.222 — it only succeeds on all 5 attempts about 22% of the time. For an 80% accurate agent! This is why pass^K is the metric that matters for production deployment.

Deriving the Formulas

Where do these formulas come from? Let's derive them from first principles.

Pass@K derivation: We have n total trials, c of which succeeded. We want to pick K trials at random. What's the probability that at least one is a success?

It's easier to compute the complement: the probability that all K picks are failures. There are (n - c) failures total. The number of ways to pick K failures from (n - c) is C(n - c, K). The total ways to pick K trials from n is C(n, K). So:

P(all K fail) = C(n − c, K) / C(n, K)

pass@K = 1 − P(all K fail) = 1 − C(n − c, K) / C(n, K)

Pass^K derivation: Same setup, but now we want the probability that ALL K picks are successes. The number of ways to pick K successes from c is C(c, K). So:

pass^K = C(c, K) / C(n, K)

Both formulas use the hypergeometric distribution — sampling without replacement from a finite population. This is important: we're not assuming each trial is independent (which would give a simpler binomial model). The hypergeometric approach accounts for the fact that we ran a fixed number of trials and observed a fixed number of successes.

When Approximate Counts Are Enough

If you assume trials are independent with success probability p = c/n, you get simpler approximations:

pass@K ≈ 1 − (1 − p)^K

pass^K ≈ p^K

These are good approximations when n is large relative to K. For small n (say n = 10, K = 5), use the exact hypergeometric formulas. For large n (n = 100), the approximations are close enough.

Quick mental math: For the approximate formulas, pass^K = p^K. At p = 0.9 (90% per-trial): pass^3 ≈ 0.729, pass^5 ≈ 0.590, pass^10 ≈ 0.349. Even a 90% agent fails more than half the time over 10 consecutive attempts. That's the reliability reality check.

Worked Example: Comparing Two Agents

Agent A achieves 70% per-trial success (7/10). Agent B achieves 50% per-trial success (5/10). But Agent B uses a retry mechanism that auto-retries on failure. Which is better for production?

Metric	Agent A (p=0.7)	Agent B (p=0.5, 2 retries)
pass@1	0.700	0.500
pass@3 (effective)	0.992	0.917 (3 tries total)
pass^1	0.700	0.500
pass^3	0.292	0.083
Cost per task	1x	~2.5x (avg retries)

Agent A is better on every metric. The retry mechanism helps Agent B's pass@3 (0.917 is decent), but its pass^3 (0.083) means it reliably succeeds on 3 consecutive attempts only 8% of the time. Retries mask capability problems — they don't fix them.

Pass@K vs Pass^K Calculator

Adjust n (trials), c (successes), and K to see both metrics. The teal bar is pass@K (capability). The orange bar is pass^K (reliability).

Trials n10

Successes c7

Key insight: Most agents perform poorly on pass^K even for small K. When someone reports "our agent achieves 85% accuracy," ask: "Is that pass@1? pass@3? What's your pass^K?" The difference reveals whether the agent is capable but unreliable or truly production-ready.

Check: An agent with 90% per-trial success — is pass^5 likely to be high or low?

High — 90% is very good Low — even 90% per-trial drops fast when you require ALL 5 to succeed (0.9⁵ ≈ 0.59) It depends on the grader

Chapter 7: Case Study — τ-bench

Theory is great, but what does a real agent evaluation look like? Let's study τ-bench (tau-bench), one of the most well-designed agent benchmarks. It evaluates agents on dynamic, multi-turn customer service conversations in two domains: retail and airline.

The Four Components of τ-bench

1. Databases (JSON)

Product catalogs, user profiles, order histories — the "world state" the agent reads and writes to.

↓

2. APIs (Python tools)

Functions like get_order_details(), cancel_order(), transfer_to_human() that the agent can call.

↓

3. Policy Documents (Markdown)

Rules the agent must follow: "Returns allowed within 30 days." "Refunds only for defective items."

↓

4. User Simulator (LLM)

Another LLM plays the customer, following a script of what the user wants to achieve.

The magic of τ-bench is that the user is also an LLM. This means conversations are dynamic — the simulated user reacts to the agent's responses, asks follow-up questions, provides clarifications, and can even get confused or frustrated. No two conversations are identical, even for the same task.

How a τ-bench Task Works

Each task specifies a scenario. For example: "User wants to change a flight from economy to business class. The fare difference is $350. The user has $200 in credit. Policy: credit can be applied to upgrades." The user simulator drives the conversation, and the agent must navigate the tools and policies to resolve it correctly.

python
# Simplified τ-bench agentic loop

def run_tau_bench_trial(task, agent, user_sim, db, tools, policies):
    # Initialize conversation with user's opening message
    user_msg = user_sim.generate_opening(task.scenario)
    conversation = [user_msg]
    db_snapshot = copy.deepcopy(db)  # snapshot for grading

    for turn in range(25):  # max 25 turns
        # Agent sees: system prompt + policies + conversation
        agent_response = agent.generate(
            system=policies,
            tools=tools,
            messages=conversation
        )

        # If agent wants to call a tool:
        if agent_response.has_tool_call:
            result = execute_tool(agent_response.tool_call, db)
            conversation.append(agent_response)
            conversation.append(tool_result(result))
            continue  # agent gets to see result and decide next

        # If agent responds to user:
        conversation.append(agent_response)

        # Check if conversation is done
        if user_sim.is_satisfied(conversation, task):
            break

        # User simulator responds
        user_reply = user_sim.generate_reply(conversation, task)
        conversation.append(user_reply)

    # Grade: check database + output against ground truth
    return grade(db, task.expected_db_state, conversation)

Outcome Verification

τ-bench grades outcomes in two ways: checking that the database was modified correctly (e.g., flight was actually changed, credit was applied) AND checking that the agent's text responses contain the right information (e.g., confirming the new flight details to the user). Both must pass for a trial to succeed.

Why Non-Determinism Matters Here

There are two sources of randomness in τ-bench. The agent samples from its LLM, producing different reasoning and tool call sequences. And the user simulator also samples, producing different follow-up questions and phrasings. This means every trial of the same task generates a genuinely different conversation. Pass@K and pass^K become essential because a single trial is statistically meaningless.

Policy Compliance: The Hard Part

The most challenging aspect of τ-bench isn't the tool calling — it's policy compliance. The agent receives policy documents like this:

policy excerpt
## Return Policy
- Electronics: 14-day return window, 15% restocking fee
- Clothing: 30-day return window, no restocking fee
- Sale items: final sale, no returns
- Defective items: 90-day window, full refund, no restocking fee

## Exceptions
- If customer has Gold status: waive restocking fee
- If order was delayed >7 days: extend return window by 14 days
- Items over $500: require manager approval for return

The agent must navigate these rules while conversing with a user who may not know the policies. When a Gold-status customer asks to return a $600 electronic item purchased 16 days ago, the agent needs to: (1) recognize the item is past the 14-day window, (2) check if the order was delayed >7 days, (3) note the item is over $500, (4) check Gold status to waive restocking. Missing any one of these conditions produces a wrong outcome.

This is where agents typically fail. The model might apply the wrong policy, miss an exception clause, or hallucinate a policy that doesn't exist. Policy compliance failures account for roughly 40% of all errors in τ-bench.

The Extensions: τ²-bench and τ³-bench

τ-bench spawned two important extensions:

Benchmark	What It Adds	Why It Matters
τ²-bench	Multi-agent tasks requiring delegation between agents	Tests whether agents can coordinate and whether multi-agent systems actually improve over single agents
τ³-bench	Longer, more complex conversations with multiple user goals per session	Tests context management and multi-goal reasoning under context pressure

Results that surprised people: When τ-bench was first released, state-of-the-art agents achieved around 50-70% pass@1 on the retail domain. But pass^5 dropped to 10-30%. The agents could solve tasks sometimes, but almost never reliably. This was a wake-up call for the industry.

What Makes τ-bench a Good Benchmark?

Several design choices make τ-bench particularly well-designed:

Realistic complexity: Tasks mirror real customer service scenarios, not toy problems.
Dynamic conversations: The user simulator creates genuine back-and-forth, not scripted exchanges.
Multi-signal grading: Checking both database state and output text catches agents that say the right thing but do the wrong thing (or vice versa).
Built-in non-determinism: Two sources of randomness (agent + user simulator) force the use of multiple trials and proper statistical metrics.
Domain specificity: Retail and airline policies are complex enough to be challenging but well-defined enough to have clear ground truth.

Interactive τ-bench Scenario

Watch a retail task flow through the τ-bench system. The agent interacts with a simulated user, calls tools, and modifies the database. Click Step to advance.

Task: User wants to return a laptop

Check: What makes τ-bench conversations dynamic rather than scripted?

The tasks are randomly generated The database changes between runs Both the agent and the user simulator are LLMs that sample non-deterministically, producing different conversations each trial

Chapter 8: Case Study — Terminal-Bench

While τ-bench evaluates conversational agents, Terminal-Bench tackles a different challenge: agents that solve real tasks in real terminal environments. No simulated users. No conversation. Just a Docker container, a task instruction, and a set of automated tests to verify the result.

The Harbor Framework

Each Terminal-Bench task is packaged as a Harbor — a self-contained evaluation unit with four components:

Component	What It Is	Example
Instruction	Natural language task description	"Set up a PostgreSQL database with tables for users and orders, create a backup script"
Dockerfile	The starting environment	Ubuntu 22.04 with Python 3.11, Node.js 18, PostgreSQL installed
Tests	Automated verification scripts	Check that tables exist, backup script runs, data is preserved
Oracle Solution	A known-working solution (for validation)	The exact commands and files that solve the task

The key design principle is outcome-oriented evaluation. Terminal-Bench doesn't care HOW the agent solves the task — it only checks the final state of the Docker container. Did the right files get created? Does the script work? Are the tests passing? This means the agent can use any approach: write a script, install packages, copy files, whatever works.

Task Quality: The 7-Step Audit

Not all tasks are good tasks. Terminal-Bench enforces quality through a rigorous 7-step audit process for every task:

1. Specificity

Is the task unambiguous? Can it be solved without guessing what the author meant?

↓

2. Solvability

Can the task actually be completed in the given Docker environment? Does the oracle solution work?

↓

3. Integrity

Can the agent cheat? Can it pass the tests without actually solving the task? Adversarial exploit detection.

↓

4. Independence

Does the task depend on external services (internet, APIs) that might fail?

↓

5. Determinism

Do the tests produce the same result every time? No race conditions, no timing dependencies.

↓

6. Difficulty Calibration

Is the task appropriately challenging? Not trivially easy, not impossibly hard.

↓

7. Coverage

Does the task test something meaningfully different from existing tasks?

Step 3 — integrity checking — is especially important. An agent might learn to exploit the test suite rather than solve the task. For example, if the test checks "does file output.txt contain the word 'success'?", a lazy agent could just write "success" to the file without doing any real work. The audit process includes adversarial exploit detection: a human tries to find shortcuts that pass the tests without solving the task, then fixes the tests to block those shortcuts.

Adversarial Exploit Detection: A Deep Dive

Let's walk through a concrete integrity failure and how to fix it. Original task and test:

example
# TASK: "Write a Python script that sorts a CSV file by the 'date' column"

# TEST (naive):
def test_csv_sorted():
    output = read_file("output.csv")
    dates = [row["date"] for row in csv.DictReader(output)]
    assert dates == sorted(dates)  # just checks output is sorted

An agent could exploit this test by simply writing a hardcoded sorted CSV file without implementing any sorting logic. Or it could read the test file, figure out the expected output, and write it directly. To fix this, the audit adds:

example
# TEST (robust):
def test_csv_sorted_robust():
    # 1. Check output file exists and is valid CSV
    output = read_file("output.csv")
    dates = [row["date"] for row in csv.DictReader(output)]
    assert dates == sorted(dates)

    # 2. Check a Python script exists (not just a data file)
    assert os.path.exists("sort_csv.py")

    # 3. Run the script on a DIFFERENT input to verify generality
    create_random_csv("test_input.csv", n=100)
    subprocess.run(["python", "sort_csv.py", "test_input.csv", "test_output.csv"])
    test_dates = [row["date"] for row in csv.DictReader(open("test_output.csv"))]
    assert test_dates == sorted(test_dates)

    # 4. Verify the script doesn't just copy the expected output
    assert "hardcoded" not in open("sort_csv.py").read().lower()

The robust test runs the agent's solution on a new, random input that didn't exist during the agent's execution. This is the gold standard for integrity checking: the test should verify that the agent created a general solution, not a task-specific shortcut.

Scale and Coverage

Terminal-Bench contains 89 tasks across 15+ categories: software engineering, data processing, system administration, scientific computing, networking, databases, and more. Each task takes a capable agent 5-30 minutes to attempt, making the full benchmark a substantial test of real-world terminal skills.

Task Category Breakdown

Category	Tasks	Examples
Software Engineering	18	Build systems, testing, debugging
Data Processing	12	CSV, JSON, database manipulation
System Administration	14	Service configuration, permissions, networking
Scientific Computing	8	Numerical methods, plotting, data analysis
DevOps	10	Docker, CI/CD, monitoring
Security	6	Encryption, auth, vulnerability scanning
Other	21	Misc. terminal tasks

Key design lesson from Terminal-Bench: Outcome-oriented evaluation is more robust than trajectory-based evaluation. If you check "did the agent run the right commands?", you're testing one path. If you check "is the final system state correct?", you're testing all possible paths. The agent might solve it in a way you never anticipated — and that's fine, as long as it works.

Terminal-Bench Results Across Models

Performance on Terminal-Bench varies dramatically by model and agent framework. Some findings:

Finding	Implication
Best agents achieve ~60-70% pass@1	Terminal tasks remain challenging even for frontier models
Pass^5 drops to ~15-30% for top agents	Reliability is still a major unsolved problem
Some categories (data processing) are much easier than others (system administration)	Aggregate scores hide category-level weaknesses
Agent frameworks matter as much as model choice	Harness design (retry logic, error handling, tool definitions) is a major performance lever
Longer time budgets don't always help	Giving agents more time can lead to more tool calls, which increases context rot

The last finding is counterintuitive: you'd expect more time to help, but agents with generous time budgets sometimes perform worse because they make more tool calls, fill up the context, and start making confusion errors. This connects directly back to context engineering — a well-designed agent knows when to stop exploring and commit to an answer.

Designing Your Own Task Harbors

If you're building a Terminal-Bench-style evaluation for your own domain, follow these principles:

Dockerfile first: Define the exact environment. Pin package versions. Make the setup reproducible.
Oracle solution second: Solve the task yourself before asking the agent. If you can't solve it, it's not a fair task.
Tests third: Write tests that verify the outcome, not the path. Test on new inputs when possible.
Adversarial audit fourth: Try to break your own tests. Can you pass them without solving the task? Fix any exploits.
Difficulty calibration last: Estimate task difficulty based on number of required commands, conceptual complexity, and domain specificity.

Terminal-Bench Task Pipeline

Watch a task flow through Terminal-Bench: instruction → Docker container → agent executes commands → tests verify final state. Click Step to advance.

Task: Set up PostgreSQL + backup script

Check: Why is "integrity checking" important in benchmark design?

To prevent agents from passing tests without actually solving the task (exploiting shortcuts) To make tasks harder To reduce the number of tasks needed

Chapter 9: The Swiss Cheese Model

No single evaluation method catches everything. Human reviewers miss things. Code-based graders are brittle. Model-based graders have biases. Automated benchmarks have blind spots. If you rely on any one method alone, failures will slip through.

The solution is the Swiss cheese model, borrowed from safety engineering. Imagine each evaluation method as a slice of Swiss cheese — it blocks most failures, but has holes. Stack multiple slices together, and the holes in one slice are covered by solid cheese in the next. Enough layers, and almost nothing gets through.

The three layers of evaluation:
Layer 1 — Automated evals (pre-deploy): Run your benchmark suite on every code change. Catches regressions before they ship. Fast, cheap, but only tests what you've thought to test.

Layer 2 — Manual transcript review (pre-deploy): Humans read a sample of transcripts. Catches nuanced failures that code can't detect: wrong tone, hallucinated policies, subtle logical errors. Slow, but catches things automation misses.

Layer 3 — Production monitoring + A/B testing (post-deploy): Monitor real user interactions. Track resolution rates, escalation rates, user satisfaction. Catches edge cases and distribution shifts that no benchmark anticipated.

What Each Layer Catches

Failure Type	Layer 1 (Auto)	Layer 2 (Manual)	Layer 3 (Prod)
Regression from code change	Catches	Slow to notice	Too late
Wrong tone / unprofessional	Misses	Catches	User complaints
Novel edge case from real user	Not in test set	Not in sample	Catches
Hallucinated policy	Maybe	Catches	Catches
Infinite tool-call loop	Catches	Catches	Catches
Slow performance degradation	Misses	Misses	Catches

Notice the pattern: no single layer has all green. But together, they cover everything. This is why you need all three — not just your favorite.

Interactive Swiss Cheese Model

Toggle evaluation layers on and off. Watch failure types slip through when layers are missing. Red dots are failures that escape. The goal: zero red dots reaching the right side.

All layers active: 0 failures escape

Implementation: What Each Layer Looks Like

Layer 1 — Automated Evals: A CI pipeline that runs your benchmark suite on every pull request. Takes 10-30 minutes. Reports pass@1 and pass^3. Blocks merge if pass@1 drops by more than 5% from baseline. This is your first line of defense.

yaml
# .github/workflows/agent-eval.yml (simplified)
name: Agent Evaluation
on: pull_request
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: python run_eval.py
               --tasks eval/tasks/*.json
               --trials-per-task 5
               --output results.json
      - run: python check_regression.py
               --current results.json
               --baseline eval/baseline.json
               --max-regression 0.05

Layer 2 — Manual Review: Weekly, a domain expert reviews 20-30 randomly sampled transcripts from the eval suite. They score on a rubric and flag any new failure modes. This catches "the agent technically passed the code grader but gave bad advice" situations.

Layer 3 — Production Monitoring: In production, track these signals:

Resolution rate: What percentage of conversations result in the user's issue being resolved?
Escalation rate: How often does the agent hand off to a human? Rising escalation rate = agent is struggling.
User satisfaction: Post-conversation thumbs up/down or CSAT score.
Latency: How long does the agent take? Long conversations may indicate the agent is stuck or confused.
Error rate: Tool call failures, API errors, max-iteration hits.

The feedback loop: Production monitoring (Layer 3) feeds back into automated evals (Layer 1). When you discover a new failure mode in production, you add it as a test case in your benchmark. Over time, your automated evals get better because they're informed by real-world failures. This is the evaluation flywheel.

Check: Why is the Swiss cheese model necessary for agent evaluation?

Because no single evaluation method catches all failure types — layering multiple methods covers each other's blind spots Because Swiss cheese is a good metaphor Because automated tests are unreliable

Chapter 10: Build Your Own Agent Eval

You now have all the concepts. Let's put them together into a practical, step-by-step roadmap for building your own agent evaluation system. Seven steps, from scratch to production-ready.

Step 1: Define Success Criteria

Before writing a single test, answer: what does "good" look like? Be specific. Not "the agent should help the user" but "the agent should resolve the user's issue, update the database correctly, and respond in a professional tone within 30 seconds." Your success criteria become the inputs to your graders.

Step 2: Collect a Small Task Set

Start with 10-20 tasks that cover your most common scenarios. Don't try to be comprehensive yet. Include a mix of easy tasks (single tool call), medium tasks (3-5 tool calls), and hard tasks (ambiguous instructions, policy edge cases). Real user conversations from logs are the best source.

Step 3: Create Useful Tasks

Good tasks are specific (unambiguous instructions), solvable (the agent has the tools and information to succeed), and representative (they reflect real usage patterns). Include the full context: user message, initial database state, available tools, and applicable policies.

Step 4: Provide Ground Truth

For each task, define the expected outcome. This might be: expected database state after completion, required tool calls, reference output text, or a combination. The more specific your ground truth, the easier grading becomes.

Step 5: Configure Graders

Set up a layered grading system. Start with code-based graders for objective checks (database state, tool calls). Add a model-based grader for subjective quality (tone, helpfulness, accuracy of explanations). Schedule periodic human review to calibrate the model-based grader.

python
# Example: composite grader

def composite_grade(transcript, outcome, ground_truth):
    scores = {}

    # Code graders (fast, deterministic)
    scores["db_correct"] = check_database_state(
        outcome.db, ground_truth.expected_db
    )
    scores["tools_correct"] = check_required_tools(
        transcript, ground_truth.required_tools
    )
    scores["no_forbidden"] = check_no_forbidden_tool(
        transcript, ground_truth.forbidden_tools
    )

    # Model grader (flexible, non-deterministic)
    scores["quality"] = llm_judge(
        transcript,
        rubric="Rate 1-5: professional tone, accurate info, concise"
    )

    # Overall: code graders must pass, quality ≥ 3
    passed = (all([
        scores["db_correct"],
        scores["tools_correct"],
        scores["no_forbidden"]
    ]) and scores["quality"] >= 3)

    return {"passed": passed, "scores": scores}

Step 6: Build the Evaluation Harness

The evaluation harness is the software that runs tasks, collects transcripts, applies graders, and stores results. It needs to handle:

Running N trials per task (for pass@K / pass^K computation)
Isolating each trial (fresh database state, clean context)
Saving full transcripts for debugging
Parallelization (running multiple trials concurrently)
Result storage (database or structured files)

python
# Skeleton evaluation harness

import asyncio, json, copy
from datetime import datetime

class EvalHarness:
    def __init__(self, agent, graders, trials_per_task=10):
        self.agent = agent
        self.graders = graders
        self.trials_per_task = trials_per_task
        self.results = []

    async def run_task(self, task):
        """Run N trials for one task."""
        trial_results = []
        for i in range(self.trials_per_task):
            # Fresh environment per trial
            env = copy.deepcopy(task.initial_env)

            # Run agent
            transcript = await self.agent.run(
                user_message=task.user_message,
                tools=task.tools,
                env=env
            )

            # Grade
            scores = {}
            for name, grader in self.graders.items():
                scores[name] = grader.grade(transcript, env, task.ground_truth)

            trial_results.append({
                "trial": i,
                "transcript": transcript,
                "scores": scores,
                "passed": all(s["passed"] for s in scores.values())
            })

        # Compute metrics
        c = sum(1 for t in trial_results if t["passed"])
        n = len(trial_results)
        return {
            "task_id": task.id,
            "pass_at_1": c / n,
            "pass_at_3": compute_pass_at_k(n, c, 3),
            "pass_up_3": compute_pass_up_k(n, c, 3),
            "trials": trial_results
        }

This is a minimal but functional harness. Production systems add logging, progress bars, error handling, cost tracking, and result dashboards — but the core loop is always the same: run, grade, store, compute metrics.

Reporting Results

When presenting evaluation results, include enough detail for informed decision-making. A good eval report looks like this:

text
========================================
Agent Evaluation Report — v2.4.1
Date: 2026-05-18 | Model: claude-opus-4
Tasks: 50 | Trials/task: 10 | Total runs: 500
========================================

METRICS (all tasks):
  pass@1:  0.820  (410/500 trials passed)
  pass@3:  0.978
  pass^3:  0.574
  pass^5:  0.389

METRICS (by difficulty):
  Easy   (20 tasks):  pass@1=0.940  pass^3=0.821
  Medium (20 tasks):  pass@1=0.815  pass^3=0.520
  Hard   (10 tasks):  pass@1=0.640  pass^3=0.267

METRICS (by category):
  Order lookup:     pass@1=0.960
  Cancellation:     pass@1=0.880
  Returns:          pass@1=0.780
  Policy edge cases: pass@1=0.600

TOP FAILURE MODES:
  1. Policy hallucination (12 failures)
  2. Wrong order ID in response (8 failures)
  3. Premature action without confirmation (6 failures)

VS BASELINE (v2.3.0):
  pass@1: 0.820 vs 0.790 (+3.0%) ✓
  pass^3: 0.574 vs 0.510 (+6.4%) ✓
  No regressions in any category. SAFE TO SHIP.
========================================

This report gives you everything needed to decide whether to ship: overall metrics, difficulty breakdown, category breakdown, top failure modes, and comparison to baseline. The category breakdown is especially important — a 82% aggregate can hide a 60% pass rate on policy edge cases, which might be unacceptable for your use case.

Step 7: Inspect, Iterate, Maintain

Run your evaluation. Read the failing transcripts. Understand why the agent failed. Fix the agent (better prompts, better tools, better instructions). Run again. This cycle never ends. The best agent evaluations evolve continuously through new failure cases and ongoing maintenance.

A Concrete Task Set Example

Here's what a small but well-designed task set looks like for a customer service agent:

#	Task	Difficulty	Tests
1	Simple order status lookup	Easy	Single tool call, correct response
2	Cancel a cancellable order	Easy	DB write, confirmation message
3	Cancel a non-cancellable order	Medium	Policy compliance — agent should refuse
4	Return with restocking fee	Medium	Fee calculation, user confirmation
5	Return for Gold member (fee waived)	Medium	Exception handling
6	Ambiguous user request	Hard	Clarification before action
7	Multi-step: return + reorder different item	Hard	Multiple DB writes, correct sequence
8	Policy edge case: item expired but defective	Hard	Exception vs standard policy conflict
9	User provides wrong order ID	Medium	Error handling, helpful recovery
10	Transfer to human agent	Medium	Knows when to escalate

Notice the distribution: 2 easy, 4 medium, 3 hard, 1 edge case. This gives you a nuanced view of agent capability. An agent that passes 7/10 might still be failing all the hard tasks — the average masks the weakness.

Common Failure Patterns

After running hundreds of evaluation trials, certain failure patterns appear again and again. Knowing these helps you write better tasks and graders:

Hallucinated Policy

Agent invents a rule that doesn't exist. "Our policy allows 60-day returns" when the real policy is 30 days.

↓

Wrong Tool Selection

Agent calls search_products() when it should call lookup_order(). Often caused by ambiguous tool names.

↓

Premature Action

Agent acts before confirming with the user. Cancels an order without asking "Are you sure?"

↓

Infinite Loop

Agent keeps calling the same tool with the same arguments, expecting a different result.

↓

Context Confusion

After many tool calls, agent confuses details from different steps. Applies one order's policy to another order.

For each pattern, add specific test cases to your eval suite. The infinite loop pattern, for example, is caught by a simple grader that checks whether the same tool was called 3+ times with identical arguments.

The evaluation flywheel: Evaluate → Identify failures → Fix agent → Add failure as regression test → Evaluate again. Every failure you fix makes your evaluation suite stronger. Every new failure you discover in production gets added back as a test case. The eval gets better over time, automatically.

Build-an-Eval Pipeline

Click each step to build the evaluation pipeline from left to right. Each step lights up when complete. The pipeline only works when all 7 steps are in place.

Step 1 of 7: Define Success Criteria

Check: What should you do when you discover a new failure in production?

Ignore it — the benchmark didn't test for it Just fix the agent and move on Fix the agent AND add the failure as a regression test to your benchmark

Chapter 11: Connections & Cheat Sheet

Benchmark Landscape

Here's the lay of the land — every major agent benchmark, what it tests, and how it works:

Benchmark	Domain	Agent Type	Grading
τ-bench	Retail & Airline	Conversational + tools	DB state + output strings
τ²-bench	Multi-agent retail/airline	Multi-agent + delegation	DB state + coordination
τ³-bench	Complex multi-turn retail	Extended conversation	DB state + policy compliance
Terminal-Bench	Terminal / DevOps	Command execution	Docker container state tests
SWE-bench	Software engineering	Code generation + editing	Test suite pass/fail
GAIA	General assistant	Multi-tool reasoning	Factual correctness
AgentCompany	Business operations	Multi-app interaction	Task completion + quality
MT-Bench	Open-ended conversation	Single-turn LLM	LLM-as-Judge pairwise

Choosing the Right Benchmark

No single benchmark is comprehensive. Choose based on your agent's domain:

If your agent does...	Use these benchmarks	Why
Customer service with tools	τ-bench, τ²-bench	Closest to real CS workflows with policy compliance
Code generation / editing	SWE-bench, Terminal-Bench	Verifiable outcomes in real codebases and terminal environments
General-purpose assistance	GAIA, MT-Bench	Multi-tool reasoning and open-ended quality
Business workflow automation	AgentCompany	Multi-application interaction patterns
Your own domain	Build your own + one public benchmark	Domain-specific tasks catch failures no public benchmark will

The hybrid approach: Run at least one public benchmark (for comparison with other systems) PLUS your own domain-specific evaluation (for catching failures that matter to your users). Public benchmarks give you a baseline; custom evals give you confidence.

Benchmark Limitations

No benchmark is perfect. Be aware of these common limitations:

Contamination: If the benchmark tasks leaked into the model's training data, scores are artificially inflated. Always check for contamination when using public benchmarks.
Distribution mismatch: Benchmark tasks may not reflect your actual usage patterns. An agent that scores 90% on τ-bench might score 60% on your real customer service tasks.
Ceiling effects: As agents improve, benchmarks saturate. When every agent scores 95%+ on a benchmark, it stops being useful for comparison.
Gaming: Systems can be specifically tuned to perform well on a known benchmark without generalizing. This is why custom evals matter.
Static nature: Most benchmarks are fixed task sets. Real users bring novel, ever-changing requests that no fixed benchmark anticipates.

Evaluation Cheat Sheet

The Five Components

Task = input + success criteria
Trial = one attempt at a task
Transcript = full record of actions
Outcome = final environment state
Grader = scores the result

Three Grader Types

Human = gold standard, slow, expensive
Code = fast, deterministic, brittle
Model (LLM-as-Judge) = flexible, noisy

Two Key Metrics

Pass@K = at least 1 of K succeeds (capability)
Pass^K = all K succeed (reliability)

Three Evaluation Layers

Automated evals (pre-deploy regression)
Manual review (pre-deploy quality)
Production monitoring (post-deploy)

The 7-Step Roadmap

Define success criteria

↓

Collect small task set (10-20 tasks)

↓

Create useful tasks (specific, solvable, representative)

↓

Provide ground truth

↓

Configure graders (code + model + human)

↓

Build evaluation harness

↓

Inspect, iterate, maintain — forever

Decision Checklist: When to Ship

Use this checklist before deploying an agent to production:

Check	Threshold	Status
pass@1 on core tasks	≥ 80%	Must pass
pass^3 on core tasks	≥ 50%	Must pass
pass@1 on edge cases	≥ 60%	Should pass
No infinite loop failures	0 occurrences	Must pass
No data corruption failures	0 occurrences	Must pass
Human review sample (20 transcripts)	≥ 90% acceptable	Must pass
LLM judge calibrated against humans	≥ 80% agreement	Should pass
A/B test vs current system (if exists)	Non-inferior	Should pass
Latency p95	< 30 seconds	Should pass
Cost per conversation	Within budget	Must pass

"Must pass" items are blockers — if any fail, do not ship. "Should pass" items are strong recommendations — shipping without them adds risk that should be acknowledged.

Formula Reference

Every formula from this lesson in one place:

Formula	What It Computes	When to Use
pass@K = 1 − C(n−c, K) / C(n, K)	P(at least 1 of K succeeds)	Measuring capability
pass^K = C(c, K) / C(n, K)	P(all K succeed)	Measuring reliability
pass@K ≈ 1 − (1−p)^K	Approximate pass@K	Large n, quick estimate
pass^K ≈ p^K	Approximate pass^K	Large n, quick estimate
P(all steps correct) = p^steps	Error compounding over trajectory	Estimating long-horizon failure rate

Glossary of Key Terms

Term	Definition
Agent	LLM + tools + instructions running in a loop via a harness
Agent harness	The software system (loop, tool execution, context management) that enables a model to act as an agent
Context rot	Degradation of agent performance as the context window fills with irrelevant information
Compaction	Strategies to reduce context size (summarization, clearing old tool results, note-taking)
Task	A predefined input + success criteria for evaluation
Trial	One attempt at completing a task
Transcript	Complete record of messages, tool calls, and results during a trial
Outcome	Final environment state after agent completes (or fails) a task
Grader	A function (human, code, or model) that scores a trial
Pass@K	Probability of at least one success in K attempts (capability)
Pass^K	Probability of all K attempts succeeding (reliability)
LLM-as-Judge	Using another LLM to evaluate agent outputs against a rubric
Inter-annotator agreement	How often two human graders give the same score
Swiss cheese model	Layering multiple evaluation methods so each covers the others' blind spots
Evaluation flywheel	Production failures → new test cases → better evals → better agents → repeat
Harbor	Terminal-Bench's self-contained task package (instruction + Docker + tests + oracle)
Progressive disclosure	Agent discovers context via tools rather than having it pre-loaded

The Evolution of Agent Evaluation

The field is evolving rapidly. A few trends to watch:

Dynamic benchmarks: Benchmarks that regenerate tasks to prevent memorization/contamination.
Agentic evaluation: Using agents to evaluate other agents (recursive, but effective when calibrated).
Reward-model graders: Training specialized reward models as graders, combining the speed of code with the flexibility of LLMs.
Real-time monitoring: Evaluating agents in production with live grading, not just offline benchmarks.
Multi-turn evaluation: Benchmarks specifically designed for long conversations (50+ turns) where context management becomes critical.

Related Lessons

Continue your learning path:

Reward & Alignment — RLHF, DPO, and making agents do what you want
microGPT — Understand the underlying model that powers agents
Transformer — The attention mechanism behind every agent's brain
CS224N Lec 10: Agents, Tools & RAG — ReAct loops and tool calling in depth
CS224N Lec 11: Benchmarking & Evaluation — LLM evaluation fundamentals

References

Anthropic. "Demystifying evals for AI agents." 2025.
Anthropic. "Effective context engineering for agents." 2025.
OpenAI. "A practical guide to building agents." 2025.
Yao et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. arXiv
Zheng et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv
Yao et al. "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." 2024. arXiv
Fourney et al. "τ²-bench: Benchmarking Multi-Agent Systems." 2025.
Shen et al. "Terminal-Bench: Evaluating Terminal Automation Agents." 2025. arXiv
Wolfe, Cameron R. "Agent Evaluation: A Detailed Guide." Deep Learning Focus, May 2026.

"The best evaluation is the one that embarrasses your agent before your users do."

— Adapted from the software testing proverb

You now understand how to evaluate agents rigorously. Go build evals that make your agents truly reliable.

Check: What is the most important thing to do after discovering a production failure?

Increase the temperature of the LLM Fix the bug AND add it as a regression test case to your eval suite (the evaluation flywheel) Switch to a bigger model

Evaluate Your AgentsRigorously