The Complete Beginner's Path

Evaluate Your Agents
Rigorously

Your agent passes your test cases. Then it fails spectacularly on real users. Learn to build evaluation systems that catch failures before deployment.

Prerequisites: Basic programming + Familiarity with LLMs. That's it.
12
Chapters
12+
Simulations
0
Assumed Knowledge

Chapter 0: Why Agent Evaluation is Hard

You just built an agent. It can call tools, search the web, write code, and interact with databases. You test it on ten carefully crafted scenarios. It aces all ten. You deploy it. Within an hour, a user asks it to cancel an order, and it cancels the wrong order. Another user asks for a refund, and the agent enters an infinite loop calling the same API over and over.

What went wrong? You tested it. It worked. But you did what every engineer does at first: you ran a vibe check — a handful of manual tests, eyeballed the outputs, and called it good. Vibe checks feel productive. They are not evaluation.

The problem is that agents are fundamentally different from traditional software. A function that adds two numbers will always return the same result. An agent might take a different path through its reasoning every single time you run it. It might call different tools, in different orders, with different arguments. It interacts with external environments that have state. And it operates over long horizons — a single task might involve twenty tool calls and ten reasoning steps.

The core problem: Agents are non-deterministic, environment-interacting, long-horizon systems. Traditional software testing (unit tests, integration tests) was designed for deterministic functions. We need something new.

Here's a concrete illustration. Imagine an agent that handles customer service for an airline. A user says "I need to change my flight." The agent must:

  1. Look up the user's booking
  2. Check available flights
  3. Verify the change policy allows modifications
  4. Calculate any fare difference
  5. Execute the change in the database
  6. Confirm with the user

At each step, the agent could fail differently. It could hallucinate a policy that doesn't exist. It could call the wrong API. It could change the wrong booking. And because LLM sampling involves randomness, running the exact same scenario twice might produce two completely different failure modes.

The Agentic Loop

Click Step to advance through one cycle of the agent loop. Notice how each step depends on the previous result — errors compound.

Click Step to begin

This is why evaluation — systematic, repeatable measurement of agent performance — matters so much. Without it, you're deploying a system you don't understand into situations you haven't anticipated. The rest of this lesson teaches you how to build evaluations that actually work.

Vibe Checks vs Real Evaluation

Let's be precise about what's wrong with vibe checks (informal testing) and what real evaluation looks like:

PropertyVibe CheckReal Evaluation
Test cases5-10, hand-picked50-200+, systematically designed
Runs per test1 (maybe 2)5-20 (for statistical significance)
Scoring"Looks good to me"Automated graders + human review
Regression detectionNoneCI pipeline blocks regressions
CoverageHappy path onlyEdge cases, errors, adversarial inputs
ReproducibilityCannot reproduceDeterministic setup, saved transcripts
Time to run5 minutes30 minutes to 2 hours

The difference isn't just rigor — it's confidence. After a vibe check, you hope your agent works. After a real evaluation, you know your agent works on X% of tasks with Y% reliability. You can make data-driven decisions about whether to ship.

Here's the uncomfortable truth: most agent teams in production today are still running vibe checks. They demo the agent to their manager, it works on the demo scenario, and they ship it. Three weeks later, support tickets pile up because the agent hallucinates refund policies, enters infinite loops on edge cases, or cancels the wrong orders. The cost of fixing these failures in production — user trust lost, engineering time burned, revenue impacted — always exceeds the cost of building proper evaluation upfront.

This lesson gives you the tools to do better. By the end, you'll know how to build an evaluation system that catches these failures before your users do.

The three properties that make agents hard to evaluate:
1. Non-determinism: Same input, different outputs each run (LLM sampling temperature).
2. Environment interaction: Agents change the world (databases, APIs, files) — side effects matter.
3. Long horizons: A single task involves many steps — errors compound across the trajectory.

Quantifying the Problem: Error Compounding

Here's a simple but powerful way to understand why long-horizon tasks are so hard. Suppose each step in the agent's reasoning has a 95% chance of being correct. That sounds great — almost always right. But over multiple steps, errors compound:

StepsP(all correct)P(at least one error)
195.0%5.0%
385.7%14.3%
577.4%22.6%
1059.9%40.1%
2035.8%64.2%

At 20 steps, even with 95% per-step accuracy, you have a 64% chance of at least one error. And in an agent, one error can cascade — calling the wrong tool means getting wrong data, which means making a wrong decision, which means taking a wrong action. One bad step can doom the entire task.

This is why evaluation must test long-horizon tasks, not just short ones. A 3-step task at 95% per-step gives you 86% success. A 20-step task gives you 36%. If you only test 3-step tasks, you'll think your agent is great. Deploy it on 20-step tasks and watch it fail two-thirds of the time.

Check: Why don't traditional unit tests work for evaluating agents?

Chapter 1: What IS an Agent?

Before we can evaluate agents, we need to agree on what one is. An agent is an LLM running in a loop, equipped with tools and instructions, that can take actions in an environment. Three components define it:

Model
The LLM that reasons and decides (GPT-4, Claude, Llama)
Tools
Functions the agent can call (search, database, calculator, APIs)
Instructions
System prompt defining behavior, policies, and constraints
Agent Harness
The loop that ties model + tools + instructions together

The agent harness is the software system that enables a model to act as an agent. It manages the loop: send the context to the model, parse its output for tool calls, execute those tools, feed results back, repeat. When we evaluate an agent, we're evaluating the harness AND the model working together — not the model in isolation.

How Tool Calling Works

Modern LLMs are trained to emit special structured output when they want to call a tool. The pattern looks like this:

json
// 1. You give the model a list of available tools:
{
  "tools": [
    {
      "name": "lookup_order",
      "description": "Look up order details by order ID",
      "parameters": {
        "order_id": { "type": "string" }
      }
    }
  ]
}

// 2. The model responds with a tool call:
{
  "tool_call": {
    "name": "lookup_order",
    "arguments": { "order_id": "ORD-12345" }
  }
}

// 3. Your harness executes the function, returns result:
{
  "tool_result": {
    "status": "shipped",
    "item": "Blue Widget",
    "tracking": "1Z999AA10123456784"
  }
}

// 4. Model sees result, decides next action or responds to user

This parse-execute-continue cycle is the heartbeat of every agent. The harness parses the model's output for tool calls, executes them in the real environment, and feeds results back into the context. The cycle continues until the model produces a final response (no tool call) or hits a maximum iteration limit.

The Autonomy Spectrum

Not everything that uses an LLM is an agent. There's a spectrum from simple to fully autonomous:

LevelWhat It DoesExample
Single-turn LLMOne prompt in, one response outChatGPT answering a question
Chain / PipelineFixed sequence of LLM callsSummarize → Translate → Format
RouterLLM picks which path to takeClassify intent, route to handler
Tool-using AgentLLM calls tools in a loopResearch agent with search + browse
Multi-agent SystemMultiple agents coordinatingManager delegates to specialists

As you move down this spectrum, evaluation gets harder. A single-turn LLM can be evaluated with a simple input-output dataset. A multi-agent system requires evaluating coordination, delegation, and emergent behaviors across multiple interacting components.

The Harness Loop in Detail

Let's look at a concrete harness implementation. This is the actual control flow that runs every agent:

python
def run_agent(user_message, system_prompt, tools, max_iters=25):
    """The core agent loop. Every agent harness is a variant of this."""
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ]

    for i in range(max_iters):
        # Step 1: Send context to model
        response = llm.generate(
            messages=messages,
            tools=tools,
            temperature=0.0  # lower = more deterministic
        )

        # Step 2: Check if model wants to call a tool
        if response.has_tool_call:
            # Step 3: Parse and execute the tool call
            tool_name = response.tool_call.name
            tool_args = response.tool_call.arguments
            result = execute_tool(tool_name, tool_args)

            # Step 4: Feed result back into context
            messages.append(response.to_message())
            messages.append({
                "role": "tool",
                "content": json.dumps(result)
            })
            continue  # back to Step 1

        # No tool call = final response to user
        messages.append(response.to_message())
        return messages  # the full transcript

    # Safety: hit max iterations without finishing
    raise MaxIterationsError("Agent stuck in loop")

Notice the key design decisions embedded in this code:

What Makes a Good Tool Definition?

The quality of tool definitions dramatically affects agent performance. Compare these two definitions for the same tool:

bad tool def
{
  "name": "search",
  "description": "Search",
  "parameters": {
    "q": {"type": "string"}
  }
}
good tool def
{
  "name": "search_orders",
  "description": "Search orders
  by customer email, order
  ID, or date range. Returns
  up to 10 matching orders.",
  "parameters": {
    "query": {
      "type": "string",
      "description": "Email,
      order ID, or YYYY-MM-DD"
    }
  }
}

The good definition tells the model exactly what the tool does, what inputs it accepts, and what format to use. When evaluating agents, changing tool definitions can shift pass rates by 10-20%. This is part of why we evaluate the harness as a whole, not just the model.

Key insight: When we evaluate an agent, we're evaluating the harness and the model working together. Swapping the model changes performance. Changing the system prompt changes performance. Modifying the tool definitions changes performance. The evaluation must capture this entire system, not just the LLM.
Interactive Agent Loop

Click each phase to advance. Watch the context window fill with tool results as the agent works through a task.

Phase: Awaiting user input
Check: What three components define an agent?

Chapter 2: Multi-Agent Systems

Sometimes a single agent isn't enough. When your system prompt grows to thousands of lines, or you have dozens of overlapping tools, or different tasks require fundamentally different reasoning strategies — you need multiple agents working together. But multi-agent systems are harder to build, debug, and evaluate. So the first rule is: start with a single agent. Only add complexity when the single agent demonstrably fails.

When to Go Multi-Agent

Three signals tell you a single agent is struggling:

1. Bloated instructions. Your system prompt is 5,000+ words and the agent keeps forgetting rules buried on page three. Long instructions degrade performance because the model attends to everything but remembers nothing perfectly.
2. Too many overlapping tools. You have 30+ tools and the agent keeps picking the wrong one. "search_orders" vs "lookup_order" vs "find_order_by_email" — the model confuses similar tools when there are too many.
3. Divergent reasoning needs. One task requires careful, step-by-step analysis (auditing a financial report). Another requires fast, shallow lookup (checking a tracking number). A single prompt can't optimize for both.

Manager vs Decentralized

There are two main patterns for multi-agent systems:

Manager pattern: A central "manager" agent receives the user's request, decides which specialist to delegate to, and synthesizes the final response. Clean control flow, easy to debug, but the manager becomes a bottleneck.

Decentralized pattern: Agents hand off to each other in a chain or graph. No central controller. More flexible for complex workflows, but harder to reason about and debug — "who's in charge?" becomes ambiguous.

Sub-agents and Context Protection

A critical benefit of multi-agent systems is context protection. When a sub-agent handles a tool-heavy subtask, all those tool calls and results fill up the sub-agent's context window, not the main agent's. The main agent only sees the sub-agent's final summary. This keeps the main agent's context clean and focused.

Think of it like delegation at a company. The CEO doesn't sit in on every engineering standup. They delegate to a VP, who reports back a summary. The CEO's "context window" stays clear for strategic decisions.

The Manager Pattern in Code

python
# Manager agent with specialized sub-agents

class ManagerAgent:
    def __init__(self):
        self.router = LLM(system_prompt="""You are a routing agent.
        Given a user request, determine which specialist to delegate to:
        - 'returns': for return/refund requests
        - 'booking': for flight/hotel changes
        - 'billing': for payment/invoice questions
        Respond with ONLY the specialist name.""")

        self.specialists = {
            "returns": Agent(tools=return_tools, prompt=return_policy),
            "booking": Agent(tools=booking_tools, prompt=booking_policy),
            "billing": Agent(tools=billing_tools, prompt=billing_policy),
        }

    def handle(self, user_message):
        # Step 1: Route to specialist
        specialist = self.router.generate(user_message).strip()

        # Step 2: Delegate (sub-agent gets its own clean context)
        result = self.specialists[specialist].run(user_message)

        # Step 3: Manager only sees the summary, not all tool calls
        return result.final_response

Notice the key property: the sub-agent runs in its own context. If the returns specialist makes 8 tool calls consuming 4,000 tokens, the manager never sees those tokens. It only gets the final response — maybe 200 tokens. This is the context protection in action.

Evaluating Multi-Agent Systems

When evaluating a multi-agent system, you need to test at three levels:

  1. Individual agent evaluation: Does each specialist work correctly in isolation? Test the returns agent on return tasks, the booking agent on booking tasks.
  2. Routing evaluation: Does the manager route to the correct specialist? Test with clear-cut cases AND ambiguous cases (e.g., "I want to return an item I booked" — returns or booking?).
  3. End-to-end evaluation: Does the full system work? This catches integration failures: the manager routes correctly, the specialist handles it correctly, but the response doesn't get properly relayed back.

A common mistake is skipping level 2 (routing evaluation). Teams test each specialist and test the full system, but never specifically evaluate whether the router makes correct decisions. A routing error sends the user to the wrong specialist, which wastes time and often fails spectacularly because the specialist doesn't have the right tools for the task.

Routing accuracy target: Aim for ≥95% routing accuracy on clear-cut cases and ≥80% on ambiguous cases. If routing accuracy drops below these thresholds, the multi-agent system will underperform a well-designed single agent. At that point, simplify: go back to one agent with all the tools and a better system prompt.
Manager / Worker Architecture

Click a task to see the manager route it to the right specialist agent. Watch how the main context stays clean.

Key insight: Single agents are easier to evaluate and maintain. Multi-agent systems multiply the evaluation surface — you need to evaluate each agent individually AND their interactions. Start simple. Add agents only when a single agent demonstrably fails at the task.
Check: What is the main benefit of using sub-agents?

Chapter 3: Context Engineering

An agent's context window is its working memory. Everything the agent knows — the system prompt, conversation history, tool results, retrieved documents — lives in this finite window. And here's the problem: context rots.

Context rot is what happens when an agent runs for many steps. Each tool call adds its result to the context. Each reasoning step adds tokens. After twenty iterations, the context is stuffed with old tool results, irrelevant intermediate reasoning, and stale information. The model's attention is spread across thousands of tokens of noise, and the signal — the user's original request, the current state of the task — gets buried.

Analogy: Context rot is like a desk that never gets cleaned. Every document you've ever looked at is still spread across it. When you need to find the one thing that matters right now, you're digging through piles of irrelevant paper. Eventually you start making mistakes because you grab the wrong sheet.

Static vs Dynamic Context

There are two strategies for getting information into the context window:

StrategyHow It WorksTradeoff
Static (RAG)Pre-load relevant documents into the context before the agent startsWastes tokens if docs aren't needed; stale if task evolves
Dynamic (Tool-based)Agent discovers context by calling tools as neededUses tokens only when needed; agent decides what's relevant

The modern best practice is progressive disclosure: give the agent tools to discover context rather than pre-loading everything. Let the agent search a knowledge base, look up a policy document, or query a database when it needs information. This way, only relevant information enters the context, and the agent stays in control of what it knows.

Compaction Strategies

Even with progressive disclosure, context eventually fills up. Three strategies fight context rot:

Summarization
Periodically compress the conversation so far into a shorter summary. Replace the full history with the summary + recent messages.
Tool Result Clearing
After the agent has used a tool result, truncate or remove it. The agent already extracted what it needed — keeping the raw result wastes tokens.
Note-taking
Give the agent a "scratchpad" tool to save key findings. Then compact old context, knowing important facts are preserved in notes.

Each strategy involves a tradeoff: compaction loses detail but reclaims space. The art is knowing when to compact and how much to keep. Compact too aggressively and the agent forgets important context. Compact too little and you hit the token limit.

A Concrete Context Budget

Let's do the math for a typical agent task. Say you're using a model with a 128K token context window:

ComponentTokensPercentage
System prompt + policies3,0002.3%
Tool definitions (15 tools)2,5002.0%
User message2000.2%
Average tool result5000.4% each
Model reasoning per step3000.2% each
After 20 tool calls~22,00017%
After 50 tool calls~46,00036%

At 50 tool calls, you've consumed 36% of the context window. That sounds manageable, but here's the catch: the attention quality of a Transformer degrades well before you hit the limit. Research shows that models struggle to attend to information in the middle of long contexts (the "lost in the middle" problem). By the time you're at 50% utilization, the model may already be failing to recall its original instructions.

Progressive Disclosure in Practice

Here's a concrete example of progressive disclosure vs static context. Imagine an agent that helps with HR policies:

python
# BAD: Static context — dump everything into system prompt
system_prompt = """You are an HR assistant.
Here are ALL company policies:

[3,000 tokens of vacation policy]
[2,000 tokens of leave policy]
[4,000 tokens of benefits policy]
[2,500 tokens of expense policy]
...
"""
# Result: 15,000+ tokens wasted if user only asks about vacation

# GOOD: Progressive disclosure — agent discovers what it needs
system_prompt = """You are an HR assistant.
Use the search_policy tool to look up relevant policies before
answering any question."""

tools = [{
    "name": "search_policy",
    "description": "Search company HR policies by topic",
    "parameters": {"query": {"type": "string"}}
}]
# Result: only loads the 800 tokens about vacation when needed

The static approach consumes 15,000 tokens upfront for policies the user may never ask about. The progressive approach uses 800 tokens only when needed. Over a 20-step conversation, that 14,200-token savings compounds — leaving more room for actual tool results and reasoning.

Context Window Filling Up

Watch the context window fill as the agent makes tool calls. Hit Compact to see summarization reclaim space. The orange line shows attention quality degrading as context fills.

Context: 0% full

A Context Rot Failure: Worked Example

Here's a real failure pattern caused by context rot. An agent is handling a complex customer request that requires multiple tool calls:

transcript (simplified)
Turn 1: User: "I have two orders. Cancel ORD-100 and check status of ORD-200."
Turn 2: Agent calls lookup_order("ORD-100") # 500 tokens of result
Turn 3: Agent calls check_cancel_policy("ORD-100") # 400 tokens of result
Turn 4: Agent calls cancel_order("ORD-100") # 300 tokens of result
Turn 5: Agent calls lookup_order("ORD-200") # 500 tokens of result
Turn 6: Agent calls get_tracking("ORD-200") # 600 tokens of result

# By Turn 6, context has ~3,500 tokens of tool results.
# Agent's response:
Turn 7: "I've cancelled ORD-200 and your order ORD-100 is in transit."
# ^^^^^^^^ WRONG! Mixed up the two order IDs!

The agent correctly executed all the tool calls but then confused the two orders in its final response. It cancelled ORD-100 (correct) but then said it cancelled ORD-200 (wrong). This is classic context rot: the model's attention, spread across 3,500 tokens of tool results, failed to correctly attribute which results belonged to which order.

This failure would be caught by a code grader that checks the response text for correct order IDs. But it would NOT be caught by a pure outcome grader that only checks database state (the cancellation was correct). You need both.

Why context engineering matters for evaluation: An agent that works perfectly on short tasks might fail completely on long tasks because of context rot. Your evaluation must include long-horizon tasks that stress-test context management. If you only test 3-step tasks, you'll never catch the failure that happens on step 15.
Check: What is "context rot"?

Chapter 4: The Evaluation Framework

Now we get to the core. Every agent evaluation system, no matter how sophisticated, boils down to five components. Learn these five, and you can evaluate anything.

The five components of evaluation: Task, Trial, Transcript, Outcome, Grader. Every eval system is built from these primitives. No exceptions.

1. Task

A task is a predefined input paired with success criteria. "Given this user message and this database state, the agent should do X." A task includes:

Good tasks are specific, solvable, and have clear success criteria. "Handle a customer complaint" is a bad task. "User wants to return order ORD-789. Policy allows returns within 30 days. Order was placed 15 days ago. Agent should process the return and confirm to the user" is a good task.

2. Trial

A trial is one attempt at completing a task. Because agents are non-deterministic, running the same task twice produces different trials. This is why we need multiple trials per task — a single trial tells you almost nothing about reliability.

3. Transcript (Trajectory)

The transcript is the complete record of everything the agent did during a trial: every message, every tool call, every tool result, every reasoning step. This is your forensic evidence when things go wrong. A good evaluation system always saves the full transcript.

transcript
# Example transcript structure:
[
  {"role": "user",     "content": "I want to return order ORD-789"},
  {"role": "assistant", "tool_call": "lookup_order(ORD-789)"},
  {"role": "tool",      "result": "{status: delivered, days_ago: 15}"},
  {"role": "assistant", "tool_call": "check_return_policy(ORD-789)"},
  {"role": "tool",      "result": "{eligible: true, window: 30 days}"},
  {"role": "assistant", "tool_call": "process_return(ORD-789)"},
  {"role": "tool",      "result": "{return_id: RET-456, refund: $49.99}"},
  {"role": "assistant", "content": "Your return has been processed..."}
]

4. Outcome

The outcome is the final environment state after the agent finishes. Did the database change correctly? Was the right API called? Does the final response contain the right information? Outcomes can be checked by inspecting the environment, the agent's final output, or both.

5. Grader

The grader takes a transcript and/or outcome and produces a score. "Did this trial succeed?" Graders can be human reviewers, automated code checks, or LLMs acting as judges. We'll dive deep into grader types in the next chapter.

The Evaluation Pipeline

Click Run to watch a task flow through the evaluation pipeline: task → agent runs (producing a transcript) → grader scores → result stored.

Ready
The flow: You define tasks. For each task, you run N trials. Each trial produces a transcript. Each transcript is scored by graders. You aggregate scores to get metrics. These metrics tell you whether your agent is improving or regressing.

Environment Isolation

A subtle but critical requirement: each trial must run in a clean, isolated environment. If Trial 1 cancels an order in the database, Trial 2 must start with the order still active. Otherwise, Trial 2 might fail not because the agent is broken, but because the order was already cancelled by Trial 1.

Three isolation strategies:

StrategyHow It WorksCost
Database snapshot/restoreSave DB state before each trial, restore afterLow (fast for small DBs)
Docker containersFresh container per trial with pre-loaded dataMedium (container startup time)
Mock environmentsIn-memory mock of all tools and stateLow (fastest, but may miss integration bugs)

Docker containers are the most robust approach — they guarantee complete isolation. Mock environments are fastest but risk missing bugs that only appear in real tool interactions. Most production eval systems use a hybrid: mocked tools for fast iteration, Docker containers for full regression runs.

Saving and Analyzing Transcripts

Every trial should save the complete transcript in a structured format. This is your debugging lifeline. When pass rates drop, you read the transcripts to understand why. A good transcript record includes:

json
{
  "task_id": "cancel-order-01",
  "trial_id": "trial-007",
  "timestamp": "2026-05-18T14:30:00Z",
  "model": "claude-opus-4-20250514",
  "temperature": 0.0,
  "messages": [/* full message sequence */],
  "tool_calls": [
    {"name": "lookup_order", "args": {"id": "ORD-123"}, "latency_ms": 45},
    {"name": "cancel_order", "args": {"id": "ORD-123"}, "latency_ms": 120}
  ],
  "total_tokens": 3847,
  "total_latency_ms": 4521,
  "db_state_before": {/* snapshot */},
  "db_state_after": {/* snapshot */},
  "grading": {
    "code_grader": {"passed": true, "checks": ["db_state", "tool_calls"]},
    "model_grader": {"score": 4, "reasoning": "..."}
  },
  "result": "PASS"
}

With this structure, you can query your evaluation database: "Show me all failing trials for task 'cancel-order-01' where the agent called the wrong tool." This turns debugging from guesswork into data analysis.

A Concrete Example

Let's put it all together with a worked example:

ComponentConcrete Value
Task"Cancel order ORD-123. Policy: cancellation allowed if status = 'processing'."
Trial 1Agent calls lookup_order → cancel_order → responds "Done." (3 steps)
Trial 2Agent calls lookup_order → check_policy → cancel_order → responds "Cancelled." (4 steps)
TranscriptFull sequence of messages and tool calls for each trial
OutcomeDatabase: ORD-123 status changed from "processing" to "cancelled"
GraderCode check: assert db["ORD-123"]["status"] == "cancelled"

Notice that Trial 1 and Trial 2 took different paths (3 steps vs 4 steps) but both arrived at the correct outcome. This is normal for agents — the path varies, but we care about the destination.

Trajectory-Based vs Outcome-Based Evaluation

This brings us to a fundamental design choice: do you grade the trajectory (how the agent got there) or the outcome (where it ended up)?

Trajectory-based: Check that the agent called the right tools in the right order. "Did it call lookup_order before cancel_order?"

Pro: Catches unsafe intermediate steps.

Con: Overly prescriptive — rejects valid alternative paths.

Outcome-based: Only check the final result. "Is the order cancelled? Is the response correct?"

Pro: Accepts any valid solution strategy.

Con: Might miss dangerous intermediate actions (e.g., agent deleted other data before fixing it).

In practice, the best approach combines both: outcome-based grading for "did it work?" plus lightweight trajectory checks for safety constraints: "did the agent avoid calling the delete_all_data() tool?" "did it confirm with the user before making changes?"

Statistical Significance

How many trials do you need per task? This depends on how precise you need your estimates to be. Here's a rule of thumb:

Trials per TaskPrecisionUse Case
5±20%Quick directional check — is pass rate above 50%?
10±15%Standard evaluation — reliable pass@1 and pass^3
20±10%Precise comparison — is Agent A better than Agent B?
50+±5%High-stakes deployment decision

These are approximate — the actual confidence interval depends on the true pass rate. But as a planning heuristic: 10 trials per task is the sweet spot for most iterative development. Use 20+ for final deployment decisions.

Cost calculation: If you have 50 tasks at 10 trials each, that's 500 agent runs. Each run might involve 5-10 LLM calls plus tool executions. At $0.01-0.10 per LLM call, your eval costs $25-500 per full run. This is cheap compared to the cost of shipping a broken agent to real users.
Check: Why do we need multiple trials per task?

Chapter 5: Types of Graders

You have a transcript. Now you need to decide: did the agent succeed? This is the job of the grader. There are three families of graders, each with distinct strengths and weaknesses. The best evaluation systems use all three.

1. Human Graders

Human graders are the gold standard. A subject-matter expert reads the transcript, evaluates the outcome, and assigns a score. Nothing beats a human who deeply understands the domain.

But human grading has problems. It's slow (minutes per transcript), expensive (expert time), and doesn't scale. You can't have a human review 10,000 trials overnight. And humans disagree with each other — inter-annotator agreement (how often two humans give the same score) is often surprisingly low, especially for subjective quality judgments.

Variants of human grading:

2. Code-Based Graders

Code-based graders are deterministic functions that check specific properties of the outcome or transcript. They're fast, cheap, and perfectly reproducible. But they're brittle — they can only check things you can express as code.

python
# Example code-based graders:

def check_string_match(transcript, expected):
    """Did the final response contain the expected string?"""
    final_msg = transcript[-1]["content"]
    return expected.lower() in final_msg.lower()

def check_tool_called(transcript, tool_name):
    """Was the required tool called at least once?"""
    return any(
        msg.get("tool_call", "").startswith(tool_name)
        for msg in transcript
    )

def check_database_state(db, order_id, expected_status):
    """Did the database end up in the correct state?"""
    return db[order_id]["status"] == expected_status

def check_no_forbidden_tool(transcript, forbidden):
    """Did the agent avoid calling a forbidden tool?"""
    return not any(
        msg.get("tool_call", "").startswith(forbidden)
        for msg in transcript
    )

Code-based graders are perfect for objective criteria: "Was the database updated?" "Did the response include the order ID?" "Was the forbidden tool avoided?" They fail for subjective criteria: "Was the tone professional?" "Did the explanation make sense?"

3. Model-Based Graders (LLM-as-Judge)

Model-based graders use another LLM to evaluate the agent's output. You provide a rubric and the transcript, and the judge LLM scores it. This combines the flexibility of human judgment with the scalability of automation.

python
# LLM-as-Judge grader (simplified)

def llm_judge(transcript, rubric):
    prompt = f"""You are an expert evaluator.

Given this agent transcript:
{transcript}

Score the agent on the following rubric:
{rubric}

Respond with a JSON object:
{{"score": 1-5, "reasoning": "..."}}"

    response = llm.generate(prompt)
    return json.loads(response)

Three common patterns for model-based grading:

PatternHow It WorksBest For
Rubric scoringJudge scores on a 1-5 scale using criteriaAbsolute quality assessment
Pairwise comparisonJudge picks which of two outputs is betterComparing agent versions (A/B)
Reference-guidedJudge compares output to a gold-standard referenceTasks with known correct answers

The catch: model-based graders are themselves non-deterministic. Run the same judgment twice and you might get different scores. They also have biases — they tend to prefer longer outputs, outputs that look like their own writing style, and outputs listed first in pairwise comparisons. You need to calibrate model-based graders against human judgments before trusting them.

Calibrating LLM Judges

How do you know your LLM judge is actually reliable? The standard approach is calibration against human labels:

  1. Have human experts score 50-100 transcripts on your rubric.
  2. Run the LLM judge on the same transcripts.
  3. Compute agreement: what percentage of the time do human and model agree?
  4. If agreement is <80%, revise your rubric — it's probably ambiguous.
  5. Check for systematic bias: does the model consistently score higher or lower than humans?
  6. Re-calibrate periodically as you update the agent (the distribution of outputs changes).

A well-calibrated LLM judge typically agrees with human experts 80-90% of the time on binary (pass/fail) judgments. For 5-point Likert scales, expect within-1-point agreement about 70-80% of the time. If you're below these numbers, the judge is too noisy to trust.

Known Biases in LLM Judges

BiasWhat HappensMitigation
Length biasPrefers longer, more verbose outputsAdd "conciseness" to rubric; penalize unnecessary length
Position biasIn pairwise comparisons, prefers the first optionRun each comparison twice with swapped order; average results
Self-preferencePrefers outputs that sound like itselfUse a different model family as judge than as agent
AnchoringIf given a reference answer, rates similar outputs higherScore without reference first, then use reference as secondary check
LeniencyAvoids giving low scores; clusters around 3-4 out of 5Use binary pass/fail when possible; clearer signal than Likert scales
Practical tip: For production agent evaluation, binary grading (pass/fail) is almost always better than rubric scoring (1-5). Binary judgments have higher inter-rater agreement (both human and model), are easier to aggregate into metrics, and produce clearer signal for decision-making ("ship or don't ship"). Save rubric scoring for deep-dive analysis of specific failure modes.
The grading triangle: Human graders are accurate but slow. Code-based graders are fast but brittle. Model-based graders are flexible but noisy. Use all three. Code graders for objective checks, model graders for subjective quality, human graders to calibrate and audit the other two.
Grader Comparison

The same agent output evaluated by all three grader types. Toggle each grader to see its assessment and strengths/weaknesses.

Check: What is the main weakness of model-based (LLM-as-Judge) graders?

Chapter 6: Metrics — Pass@K and Pass^K

You've run your agent on 50 tasks, 10 trials each, and collected 500 scores. Now what? You need metrics that capture both capability (can the agent solve the task at all?) and reliability (can it solve it consistently?). Two metrics do this: Pass@K and Pass^K.

Pass@K: Can It Succeed At Least Once?

Pass@K measures the probability that at least one out of K attempts succeeds. If you give the agent K tries, what are the odds it gets it right at least once?

pass@K = E[ 1 − C(n − c, K) / C(n, K) ]

Where n is the total number of trials you ran, c is the number of successful trials, and C(a, b) is the binomial coefficient "a choose b." Let's unpack this with a worked example.

Worked example: You run n = 10 trials. c = 7 succeed. What is pass@K for K = 3?

pass@3 = 1 − C(10 − 7, 3) / C(10, 3)
= 1 − C(3, 3) / C(10, 3)
= 1 − 1 / 120
= 1 − 0.00833
= 0.992

With 3 tries and a 70% per-trial success rate, you have a 99.2% chance of succeeding at least once. Sounds great, right? But wait...

Pass^K: Does It Succeed EVERY Time?

Pass^K flips the question: do ALL K attempts succeed? This measures reliability — can you trust the agent to get it right every time, not just sometimes?

pass^K = E[ C(c, K) / C(n, K) ]
Same example continued: n = 10, c = 7, K = 3.

pass^3 = C(7, 3) / C(10, 3)
= 35 / 120
= 0.292

Only a 29.2% chance that ALL 3 attempts succeed. The same agent that looks great on pass@3 (99.2%) looks terrible on pass^3 (29.2%). This is the reliability gap.

The gap between pass@K and pass^K reveals a critical truth about your agent. High pass@K means the agent can solve the task. Low pass^K means it does so unreliably. For production systems, reliability matters more than capability. A user doesn't get 3 tries — they get one.

How Pass^K Drops With K

Even for agents with respectable per-trial accuracy, pass^K drops dramatically as K increases. Let's compute for an agent with 80% per-trial success (c = 8 out of n = 10):

KPass@KPass^KGap
10.8000.8000.000
20.9560.6220.334
30.9920.4670.525
51.0000.2220.778

At K = 5, pass@5 is essentially 1.0 (the agent can almost certainly solve it in 5 tries). But pass^5 is only 0.222 — it only succeeds on all 5 attempts about 22% of the time. For an 80% accurate agent! This is why pass^K is the metric that matters for production deployment.

Deriving the Formulas

Where do these formulas come from? Let's derive them from first principles.

Pass@K derivation: We have n total trials, c of which succeeded. We want to pick K trials at random. What's the probability that at least one is a success?

It's easier to compute the complement: the probability that all K picks are failures. There are (n - c) failures total. The number of ways to pick K failures from (n - c) is C(n - c, K). The total ways to pick K trials from n is C(n, K). So:

P(all K fail) = C(n − c, K) / C(n, K)
pass@K = 1 − P(all K fail) = 1 − C(n − c, K) / C(n, K)

Pass^K derivation: Same setup, but now we want the probability that ALL K picks are successes. The number of ways to pick K successes from c is C(c, K). So:

pass^K = C(c, K) / C(n, K)

Both formulas use the hypergeometric distribution — sampling without replacement from a finite population. This is important: we're not assuming each trial is independent (which would give a simpler binomial model). The hypergeometric approach accounts for the fact that we ran a fixed number of trials and observed a fixed number of successes.

When Approximate Counts Are Enough

If you assume trials are independent with success probability p = c/n, you get simpler approximations:

pass@K ≈ 1 − (1 − p)K
pass^K ≈ pK

These are good approximations when n is large relative to K. For small n (say n = 10, K = 5), use the exact hypergeometric formulas. For large n (n = 100), the approximations are close enough.

Quick mental math: For the approximate formulas, pass^K = pK. At p = 0.9 (90% per-trial): pass^3 ≈ 0.729, pass^5 ≈ 0.590, pass^10 ≈ 0.349. Even a 90% agent fails more than half the time over 10 consecutive attempts. That's the reliability reality check.

Worked Example: Comparing Two Agents

Agent A achieves 70% per-trial success (7/10). Agent B achieves 50% per-trial success (5/10). But Agent B uses a retry mechanism that auto-retries on failure. Which is better for production?

MetricAgent A (p=0.7)Agent B (p=0.5, 2 retries)
pass@10.7000.500
pass@3 (effective)0.9920.917 (3 tries total)
pass^10.7000.500
pass^30.2920.083
Cost per task1x~2.5x (avg retries)

Agent A is better on every metric. The retry mechanism helps Agent B's pass@3 (0.917 is decent), but its pass^3 (0.083) means it reliably succeeds on 3 consecutive attempts only 8% of the time. Retries mask capability problems — they don't fix them.

Pass@K vs Pass^K Calculator

Adjust n (trials), c (successes), and K to see both metrics. The teal bar is pass@K (capability). The orange bar is pass^K (reliability).

Trials n10
Successes c7
K3
Key insight: Most agents perform poorly on pass^K even for small K. When someone reports "our agent achieves 85% accuracy," ask: "Is that pass@1? pass@3? What's your pass^K?" The difference reveals whether the agent is capable but unreliable or truly production-ready.
Check: An agent with 90% per-trial success — is pass^5 likely to be high or low?

Chapter 7: Case Study — τ-bench

Theory is great, but what does a real agent evaluation look like? Let's study τ-bench (tau-bench), one of the most well-designed agent benchmarks. It evaluates agents on dynamic, multi-turn customer service conversations in two domains: retail and airline.

The Four Components of τ-bench

1. Databases (JSON)
Product catalogs, user profiles, order histories — the "world state" the agent reads and writes to.
2. APIs (Python tools)
Functions like get_order_details(), cancel_order(), transfer_to_human() that the agent can call.
3. Policy Documents (Markdown)
Rules the agent must follow: "Returns allowed within 30 days." "Refunds only for defective items."
4. User Simulator (LLM)
Another LLM plays the customer, following a script of what the user wants to achieve.

The magic of τ-bench is that the user is also an LLM. This means conversations are dynamic — the simulated user reacts to the agent's responses, asks follow-up questions, provides clarifications, and can even get confused or frustrated. No two conversations are identical, even for the same task.

How a τ-bench Task Works

Each task specifies a scenario. For example: "User wants to change a flight from economy to business class. The fare difference is $350. The user has $200 in credit. Policy: credit can be applied to upgrades." The user simulator drives the conversation, and the agent must navigate the tools and policies to resolve it correctly.

python
# Simplified τ-bench agentic loop

def run_tau_bench_trial(task, agent, user_sim, db, tools, policies):
    # Initialize conversation with user's opening message
    user_msg = user_sim.generate_opening(task.scenario)
    conversation = [user_msg]
    db_snapshot = copy.deepcopy(db)  # snapshot for grading

    for turn in range(25):  # max 25 turns
        # Agent sees: system prompt + policies + conversation
        agent_response = agent.generate(
            system=policies,
            tools=tools,
            messages=conversation
        )

        # If agent wants to call a tool:
        if agent_response.has_tool_call:
            result = execute_tool(agent_response.tool_call, db)
            conversation.append(agent_response)
            conversation.append(tool_result(result))
            continue  # agent gets to see result and decide next

        # If agent responds to user:
        conversation.append(agent_response)

        # Check if conversation is done
        if user_sim.is_satisfied(conversation, task):
            break

        # User simulator responds
        user_reply = user_sim.generate_reply(conversation, task)
        conversation.append(user_reply)

    # Grade: check database + output against ground truth
    return grade(db, task.expected_db_state, conversation)

Outcome Verification

τ-bench grades outcomes in two ways: checking that the database was modified correctly (e.g., flight was actually changed, credit was applied) AND checking that the agent's text responses contain the right information (e.g., confirming the new flight details to the user). Both must pass for a trial to succeed.

Why Non-Determinism Matters Here

There are two sources of randomness in τ-bench. The agent samples from its LLM, producing different reasoning and tool call sequences. And the user simulator also samples, producing different follow-up questions and phrasings. This means every trial of the same task generates a genuinely different conversation. Pass@K and pass^K become essential because a single trial is statistically meaningless.

Policy Compliance: The Hard Part

The most challenging aspect of τ-bench isn't the tool calling — it's policy compliance. The agent receives policy documents like this:

policy excerpt
## Return Policy
- Electronics: 14-day return window, 15% restocking fee
- Clothing: 30-day return window, no restocking fee
- Sale items: final sale, no returns
- Defective items: 90-day window, full refund, no restocking fee

## Exceptions
- If customer has Gold status: waive restocking fee
- If order was delayed >7 days: extend return window by 14 days
- Items over $500: require manager approval for return

The agent must navigate these rules while conversing with a user who may not know the policies. When a Gold-status customer asks to return a $600 electronic item purchased 16 days ago, the agent needs to: (1) recognize the item is past the 14-day window, (2) check if the order was delayed >7 days, (3) note the item is over $500, (4) check Gold status to waive restocking. Missing any one of these conditions produces a wrong outcome.

This is where agents typically fail. The model might apply the wrong policy, miss an exception clause, or hallucinate a policy that doesn't exist. Policy compliance failures account for roughly 40% of all errors in τ-bench.

The Extensions: τ²-bench and τ³-bench

τ-bench spawned two important extensions:

BenchmarkWhat It AddsWhy It Matters
τ²-benchMulti-agent tasks requiring delegation between agentsTests whether agents can coordinate and whether multi-agent systems actually improve over single agents
τ³-benchLonger, more complex conversations with multiple user goals per sessionTests context management and multi-goal reasoning under context pressure
Results that surprised people: When τ-bench was first released, state-of-the-art agents achieved around 50-70% pass@1 on the retail domain. But pass^5 dropped to 10-30%. The agents could solve tasks sometimes, but almost never reliably. This was a wake-up call for the industry.

What Makes τ-bench a Good Benchmark?

Several design choices make τ-bench particularly well-designed:

Interactive τ-bench Scenario

Watch a retail task flow through the τ-bench system. The agent interacts with a simulated user, calls tools, and modifies the database. Click Step to advance.

Task: User wants to return a laptop
Check: What makes τ-bench conversations dynamic rather than scripted?

Chapter 8: Case Study — Terminal-Bench

While τ-bench evaluates conversational agents, Terminal-Bench tackles a different challenge: agents that solve real tasks in real terminal environments. No simulated users. No conversation. Just a Docker container, a task instruction, and a set of automated tests to verify the result.

The Harbor Framework

Each Terminal-Bench task is packaged as a Harbor — a self-contained evaluation unit with four components:

ComponentWhat It IsExample
InstructionNatural language task description"Set up a PostgreSQL database with tables for users and orders, create a backup script"
DockerfileThe starting environmentUbuntu 22.04 with Python 3.11, Node.js 18, PostgreSQL installed
TestsAutomated verification scriptsCheck that tables exist, backup script runs, data is preserved
Oracle SolutionA known-working solution (for validation)The exact commands and files that solve the task

The key design principle is outcome-oriented evaluation. Terminal-Bench doesn't care HOW the agent solves the task — it only checks the final state of the Docker container. Did the right files get created? Does the script work? Are the tests passing? This means the agent can use any approach: write a script, install packages, copy files, whatever works.

Task Quality: The 7-Step Audit

Not all tasks are good tasks. Terminal-Bench enforces quality through a rigorous 7-step audit process for every task:

1. Specificity
Is the task unambiguous? Can it be solved without guessing what the author meant?
2. Solvability
Can the task actually be completed in the given Docker environment? Does the oracle solution work?
3. Integrity
Can the agent cheat? Can it pass the tests without actually solving the task? Adversarial exploit detection.
4. Independence
Does the task depend on external services (internet, APIs) that might fail?
5. Determinism
Do the tests produce the same result every time? No race conditions, no timing dependencies.
6. Difficulty Calibration
Is the task appropriately challenging? Not trivially easy, not impossibly hard.
7. Coverage
Does the task test something meaningfully different from existing tasks?

Step 3 — integrity checking — is especially important. An agent might learn to exploit the test suite rather than solve the task. For example, if the test checks "does file output.txt contain the word 'success'?", a lazy agent could just write "success" to the file without doing any real work. The audit process includes adversarial exploit detection: a human tries to find shortcuts that pass the tests without solving the task, then fixes the tests to block those shortcuts.

Adversarial Exploit Detection: A Deep Dive

Let's walk through a concrete integrity failure and how to fix it. Original task and test:

example
# TASK: "Write a Python script that sorts a CSV file by the 'date' column"

# TEST (naive):
def test_csv_sorted():
    output = read_file("output.csv")
    dates = [row["date"] for row in csv.DictReader(output)]
    assert dates == sorted(dates)  # just checks output is sorted

An agent could exploit this test by simply writing a hardcoded sorted CSV file without implementing any sorting logic. Or it could read the test file, figure out the expected output, and write it directly. To fix this, the audit adds:

example
# TEST (robust):
def test_csv_sorted_robust():
    # 1. Check output file exists and is valid CSV
    output = read_file("output.csv")
    dates = [row["date"] for row in csv.DictReader(output)]
    assert dates == sorted(dates)

    # 2. Check a Python script exists (not just a data file)
    assert os.path.exists("sort_csv.py")

    # 3. Run the script on a DIFFERENT input to verify generality
    create_random_csv("test_input.csv", n=100)
    subprocess.run(["python", "sort_csv.py", "test_input.csv", "test_output.csv"])
    test_dates = [row["date"] for row in csv.DictReader(open("test_output.csv"))]
    assert test_dates == sorted(test_dates)

    # 4. Verify the script doesn't just copy the expected output
    assert "hardcoded" not in open("sort_csv.py").read().lower()

The robust test runs the agent's solution on a new, random input that didn't exist during the agent's execution. This is the gold standard for integrity checking: the test should verify that the agent created a general solution, not a task-specific shortcut.

Scale and Coverage

Terminal-Bench contains 89 tasks across 15+ categories: software engineering, data processing, system administration, scientific computing, networking, databases, and more. Each task takes a capable agent 5-30 minutes to attempt, making the full benchmark a substantial test of real-world terminal skills.

Task Category Breakdown

CategoryTasksExamples
Software Engineering18Build systems, testing, debugging
Data Processing12CSV, JSON, database manipulation
System Administration14Service configuration, permissions, networking
Scientific Computing8Numerical methods, plotting, data analysis
DevOps10Docker, CI/CD, monitoring
Security6Encryption, auth, vulnerability scanning
Other21Misc. terminal tasks
Key design lesson from Terminal-Bench: Outcome-oriented evaluation is more robust than trajectory-based evaluation. If you check "did the agent run the right commands?", you're testing one path. If you check "is the final system state correct?", you're testing all possible paths. The agent might solve it in a way you never anticipated — and that's fine, as long as it works.

Terminal-Bench Results Across Models

Performance on Terminal-Bench varies dramatically by model and agent framework. Some findings:

FindingImplication
Best agents achieve ~60-70% pass@1Terminal tasks remain challenging even for frontier models
Pass^5 drops to ~15-30% for top agentsReliability is still a major unsolved problem
Some categories (data processing) are much easier than others (system administration)Aggregate scores hide category-level weaknesses
Agent frameworks matter as much as model choiceHarness design (retry logic, error handling, tool definitions) is a major performance lever
Longer time budgets don't always helpGiving agents more time can lead to more tool calls, which increases context rot

The last finding is counterintuitive: you'd expect more time to help, but agents with generous time budgets sometimes perform worse because they make more tool calls, fill up the context, and start making confusion errors. This connects directly back to context engineering — a well-designed agent knows when to stop exploring and commit to an answer.

Designing Your Own Task Harbors

If you're building a Terminal-Bench-style evaluation for your own domain, follow these principles:

  1. Dockerfile first: Define the exact environment. Pin package versions. Make the setup reproducible.
  2. Oracle solution second: Solve the task yourself before asking the agent. If you can't solve it, it's not a fair task.
  3. Tests third: Write tests that verify the outcome, not the path. Test on new inputs when possible.
  4. Adversarial audit fourth: Try to break your own tests. Can you pass them without solving the task? Fix any exploits.
  5. Difficulty calibration last: Estimate task difficulty based on number of required commands, conceptual complexity, and domain specificity.
Terminal-Bench Task Pipeline

Watch a task flow through Terminal-Bench: instruction → Docker container → agent executes commands → tests verify final state. Click Step to advance.

Task: Set up PostgreSQL + backup script
Check: Why is "integrity checking" important in benchmark design?

Chapter 9: The Swiss Cheese Model

No single evaluation method catches everything. Human reviewers miss things. Code-based graders are brittle. Model-based graders have biases. Automated benchmarks have blind spots. If you rely on any one method alone, failures will slip through.

The solution is the Swiss cheese model, borrowed from safety engineering. Imagine each evaluation method as a slice of Swiss cheese — it blocks most failures, but has holes. Stack multiple slices together, and the holes in one slice are covered by solid cheese in the next. Enough layers, and almost nothing gets through.

The three layers of evaluation:
Layer 1 — Automated evals (pre-deploy): Run your benchmark suite on every code change. Catches regressions before they ship. Fast, cheap, but only tests what you've thought to test.

Layer 2 — Manual transcript review (pre-deploy): Humans read a sample of transcripts. Catches nuanced failures that code can't detect: wrong tone, hallucinated policies, subtle logical errors. Slow, but catches things automation misses.

Layer 3 — Production monitoring + A/B testing (post-deploy): Monitor real user interactions. Track resolution rates, escalation rates, user satisfaction. Catches edge cases and distribution shifts that no benchmark anticipated.

What Each Layer Catches

Failure TypeLayer 1 (Auto)Layer 2 (Manual)Layer 3 (Prod)
Regression from code changeCatchesSlow to noticeToo late
Wrong tone / unprofessionalMissesCatchesUser complaints
Novel edge case from real userNot in test setNot in sampleCatches
Hallucinated policyMaybeCatchesCatches
Infinite tool-call loopCatchesCatchesCatches
Slow performance degradationMissesMissesCatches

Notice the pattern: no single layer has all green. But together, they cover everything. This is why you need all three — not just your favorite.

Interactive Swiss Cheese Model

Toggle evaluation layers on and off. Watch failure types slip through when layers are missing. Red dots are failures that escape. The goal: zero red dots reaching the right side.

All layers active: 0 failures escape

Implementation: What Each Layer Looks Like

Layer 1 — Automated Evals: A CI pipeline that runs your benchmark suite on every pull request. Takes 10-30 minutes. Reports pass@1 and pass^3. Blocks merge if pass@1 drops by more than 5% from baseline. This is your first line of defense.

yaml
# .github/workflows/agent-eval.yml (simplified)
name: Agent Evaluation
on: pull_request
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: python run_eval.py
               --tasks eval/tasks/*.json
               --trials-per-task 5
               --output results.json
      - run: python check_regression.py
               --current results.json
               --baseline eval/baseline.json
               --max-regression 0.05

Layer 2 — Manual Review: Weekly, a domain expert reviews 20-30 randomly sampled transcripts from the eval suite. They score on a rubric and flag any new failure modes. This catches "the agent technically passed the code grader but gave bad advice" situations.

Layer 3 — Production Monitoring: In production, track these signals:

The feedback loop: Production monitoring (Layer 3) feeds back into automated evals (Layer 1). When you discover a new failure mode in production, you add it as a test case in your benchmark. Over time, your automated evals get better because they're informed by real-world failures. This is the evaluation flywheel.
Check: Why is the Swiss cheese model necessary for agent evaluation?

Chapter 10: Build Your Own Agent Eval

You now have all the concepts. Let's put them together into a practical, step-by-step roadmap for building your own agent evaluation system. Seven steps, from scratch to production-ready.

Step 1: Define Success Criteria

Before writing a single test, answer: what does "good" look like? Be specific. Not "the agent should help the user" but "the agent should resolve the user's issue, update the database correctly, and respond in a professional tone within 30 seconds." Your success criteria become the inputs to your graders.

Step 2: Collect a Small Task Set

Start with 10-20 tasks that cover your most common scenarios. Don't try to be comprehensive yet. Include a mix of easy tasks (single tool call), medium tasks (3-5 tool calls), and hard tasks (ambiguous instructions, policy edge cases). Real user conversations from logs are the best source.

Step 3: Create Useful Tasks

Good tasks are specific (unambiguous instructions), solvable (the agent has the tools and information to succeed), and representative (they reflect real usage patterns). Include the full context: user message, initial database state, available tools, and applicable policies.

Step 4: Provide Ground Truth

For each task, define the expected outcome. This might be: expected database state after completion, required tool calls, reference output text, or a combination. The more specific your ground truth, the easier grading becomes.

Step 5: Configure Graders

Set up a layered grading system. Start with code-based graders for objective checks (database state, tool calls). Add a model-based grader for subjective quality (tone, helpfulness, accuracy of explanations). Schedule periodic human review to calibrate the model-based grader.

python
# Example: composite grader

def composite_grade(transcript, outcome, ground_truth):
    scores = {}

    # Code graders (fast, deterministic)
    scores["db_correct"] = check_database_state(
        outcome.db, ground_truth.expected_db
    )
    scores["tools_correct"] = check_required_tools(
        transcript, ground_truth.required_tools
    )
    scores["no_forbidden"] = check_no_forbidden_tool(
        transcript, ground_truth.forbidden_tools
    )

    # Model grader (flexible, non-deterministic)
    scores["quality"] = llm_judge(
        transcript,
        rubric="Rate 1-5: professional tone, accurate info, concise"
    )

    # Overall: code graders must pass, quality ≥ 3
    passed = (all([
        scores["db_correct"],
        scores["tools_correct"],
        scores["no_forbidden"]
    ]) and scores["quality"] >= 3)

    return {"passed": passed, "scores": scores}

Step 6: Build the Evaluation Harness

The evaluation harness is the software that runs tasks, collects transcripts, applies graders, and stores results. It needs to handle:

python
# Skeleton evaluation harness

import asyncio, json, copy
from datetime import datetime

class EvalHarness:
    def __init__(self, agent, graders, trials_per_task=10):
        self.agent = agent
        self.graders = graders
        self.trials_per_task = trials_per_task
        self.results = []

    async def run_task(self, task):
        """Run N trials for one task."""
        trial_results = []
        for i in range(self.trials_per_task):
            # Fresh environment per trial
            env = copy.deepcopy(task.initial_env)

            # Run agent
            transcript = await self.agent.run(
                user_message=task.user_message,
                tools=task.tools,
                env=env
            )

            # Grade
            scores = {}
            for name, grader in self.graders.items():
                scores[name] = grader.grade(transcript, env, task.ground_truth)

            trial_results.append({
                "trial": i,
                "transcript": transcript,
                "scores": scores,
                "passed": all(s["passed"] for s in scores.values())
            })

        # Compute metrics
        c = sum(1 for t in trial_results if t["passed"])
        n = len(trial_results)
        return {
            "task_id": task.id,
            "pass_at_1": c / n,
            "pass_at_3": compute_pass_at_k(n, c, 3),
            "pass_up_3": compute_pass_up_k(n, c, 3),
            "trials": trial_results
        }

This is a minimal but functional harness. Production systems add logging, progress bars, error handling, cost tracking, and result dashboards — but the core loop is always the same: run, grade, store, compute metrics.

Reporting Results

When presenting evaluation results, include enough detail for informed decision-making. A good eval report looks like this:

text
========================================
Agent Evaluation Report — v2.4.1
Date: 2026-05-18 | Model: claude-opus-4
Tasks: 50 | Trials/task: 10 | Total runs: 500
========================================

METRICS (all tasks):
  pass@1:  0.820  (410/500 trials passed)
  pass@3:  0.978
  pass^3:  0.574
  pass^5:  0.389

METRICS (by difficulty):
  Easy   (20 tasks):  pass@1=0.940  pass^3=0.821
  Medium (20 tasks):  pass@1=0.815  pass^3=0.520
  Hard   (10 tasks):  pass@1=0.640  pass^3=0.267

METRICS (by category):
  Order lookup:     pass@1=0.960
  Cancellation:     pass@1=0.880
  Returns:          pass@1=0.780
  Policy edge cases: pass@1=0.600

TOP FAILURE MODES:
  1. Policy hallucination (12 failures)
  2. Wrong order ID in response (8 failures)
  3. Premature action without confirmation (6 failures)

VS BASELINE (v2.3.0):
  pass@1: 0.820 vs 0.790 (+3.0%) ✓
  pass^3: 0.574 vs 0.510 (+6.4%) ✓
  No regressions in any category. SAFE TO SHIP.
========================================

This report gives you everything needed to decide whether to ship: overall metrics, difficulty breakdown, category breakdown, top failure modes, and comparison to baseline. The category breakdown is especially important — a 82% aggregate can hide a 60% pass rate on policy edge cases, which might be unacceptable for your use case.

Step 7: Inspect, Iterate, Maintain

Run your evaluation. Read the failing transcripts. Understand why the agent failed. Fix the agent (better prompts, better tools, better instructions). Run again. This cycle never ends. The best agent evaluations evolve continuously through new failure cases and ongoing maintenance.

A Concrete Task Set Example

Here's what a small but well-designed task set looks like for a customer service agent:

#TaskDifficultyTests
1Simple order status lookupEasySingle tool call, correct response
2Cancel a cancellable orderEasyDB write, confirmation message
3Cancel a non-cancellable orderMediumPolicy compliance — agent should refuse
4Return with restocking feeMediumFee calculation, user confirmation
5Return for Gold member (fee waived)MediumException handling
6Ambiguous user requestHardClarification before action
7Multi-step: return + reorder different itemHardMultiple DB writes, correct sequence
8Policy edge case: item expired but defectiveHardException vs standard policy conflict
9User provides wrong order IDMediumError handling, helpful recovery
10Transfer to human agentMediumKnows when to escalate

Notice the distribution: 2 easy, 4 medium, 3 hard, 1 edge case. This gives you a nuanced view of agent capability. An agent that passes 7/10 might still be failing all the hard tasks — the average masks the weakness.

Common Failure Patterns

After running hundreds of evaluation trials, certain failure patterns appear again and again. Knowing these helps you write better tasks and graders:

Hallucinated Policy
Agent invents a rule that doesn't exist. "Our policy allows 60-day returns" when the real policy is 30 days.
Wrong Tool Selection
Agent calls search_products() when it should call lookup_order(). Often caused by ambiguous tool names.
Premature Action
Agent acts before confirming with the user. Cancels an order without asking "Are you sure?"
Infinite Loop
Agent keeps calling the same tool with the same arguments, expecting a different result.
Context Confusion
After many tool calls, agent confuses details from different steps. Applies one order's policy to another order.

For each pattern, add specific test cases to your eval suite. The infinite loop pattern, for example, is caught by a simple grader that checks whether the same tool was called 3+ times with identical arguments.

The evaluation flywheel: Evaluate → Identify failures → Fix agent → Add failure as regression test → Evaluate again. Every failure you fix makes your evaluation suite stronger. Every new failure you discover in production gets added back as a test case. The eval gets better over time, automatically.
Build-an-Eval Pipeline

Click each step to build the evaluation pipeline from left to right. Each step lights up when complete. The pipeline only works when all 7 steps are in place.

Step 1 of 7: Define Success Criteria
Check: What should you do when you discover a new failure in production?

Chapter 11: Connections & Cheat Sheet

Benchmark Landscape

Here's the lay of the land — every major agent benchmark, what it tests, and how it works:

BenchmarkDomainAgent TypeGrading
τ-benchRetail & AirlineConversational + toolsDB state + output strings
τ²-benchMulti-agent retail/airlineMulti-agent + delegationDB state + coordination
τ³-benchComplex multi-turn retailExtended conversationDB state + policy compliance
Terminal-BenchTerminal / DevOpsCommand executionDocker container state tests
SWE-benchSoftware engineeringCode generation + editingTest suite pass/fail
GAIAGeneral assistantMulti-tool reasoningFactual correctness
AgentCompanyBusiness operationsMulti-app interactionTask completion + quality
MT-BenchOpen-ended conversationSingle-turn LLMLLM-as-Judge pairwise

Choosing the Right Benchmark

No single benchmark is comprehensive. Choose based on your agent's domain:

If your agent does...Use these benchmarksWhy
Customer service with toolsτ-bench, τ²-benchClosest to real CS workflows with policy compliance
Code generation / editingSWE-bench, Terminal-BenchVerifiable outcomes in real codebases and terminal environments
General-purpose assistanceGAIA, MT-BenchMulti-tool reasoning and open-ended quality
Business workflow automationAgentCompanyMulti-application interaction patterns
Your own domainBuild your own + one public benchmarkDomain-specific tasks catch failures no public benchmark will
The hybrid approach: Run at least one public benchmark (for comparison with other systems) PLUS your own domain-specific evaluation (for catching failures that matter to your users). Public benchmarks give you a baseline; custom evals give you confidence.

Benchmark Limitations

No benchmark is perfect. Be aware of these common limitations:

Evaluation Cheat Sheet

The Five Components

  1. Task = input + success criteria
  2. Trial = one attempt at a task
  3. Transcript = full record of actions
  4. Outcome = final environment state
  5. Grader = scores the result

Three Grader Types

  1. Human = gold standard, slow, expensive
  2. Code = fast, deterministic, brittle
  3. Model (LLM-as-Judge) = flexible, noisy

Two Key Metrics

  • Pass@K = at least 1 of K succeeds (capability)
  • Pass^K = all K succeed (reliability)

Three Evaluation Layers

  1. Automated evals (pre-deploy regression)
  2. Manual review (pre-deploy quality)
  3. Production monitoring (post-deploy)

The 7-Step Roadmap

1
Define success criteria
2
Collect small task set (10-20 tasks)
3
Create useful tasks (specific, solvable, representative)
4
Provide ground truth
5
Configure graders (code + model + human)
6
Build evaluation harness
7
Inspect, iterate, maintain — forever

Decision Checklist: When to Ship

Use this checklist before deploying an agent to production:

CheckThresholdStatus
pass@1 on core tasks≥ 80%Must pass
pass^3 on core tasks≥ 50%Must pass
pass@1 on edge cases≥ 60%Should pass
No infinite loop failures0 occurrencesMust pass
No data corruption failures0 occurrencesMust pass
Human review sample (20 transcripts)≥ 90% acceptableMust pass
LLM judge calibrated against humans≥ 80% agreementShould pass
A/B test vs current system (if exists)Non-inferiorShould pass
Latency p95< 30 secondsShould pass
Cost per conversationWithin budgetMust pass

"Must pass" items are blockers — if any fail, do not ship. "Should pass" items are strong recommendations — shipping without them adds risk that should be acknowledged.

Formula Reference

Every formula from this lesson in one place:

FormulaWhat It ComputesWhen to Use
pass@K = 1 − C(n−c, K) / C(n, K)P(at least 1 of K succeeds)Measuring capability
pass^K = C(c, K) / C(n, K)P(all K succeed)Measuring reliability
pass@K ≈ 1 − (1−p)KApproximate pass@KLarge n, quick estimate
pass^K ≈ pKApproximate pass^KLarge n, quick estimate
P(all steps correct) = pstepsError compounding over trajectoryEstimating long-horizon failure rate

Glossary of Key Terms

TermDefinition
AgentLLM + tools + instructions running in a loop via a harness
Agent harnessThe software system (loop, tool execution, context management) that enables a model to act as an agent
Context rotDegradation of agent performance as the context window fills with irrelevant information
CompactionStrategies to reduce context size (summarization, clearing old tool results, note-taking)
TaskA predefined input + success criteria for evaluation
TrialOne attempt at completing a task
TranscriptComplete record of messages, tool calls, and results during a trial
OutcomeFinal environment state after agent completes (or fails) a task
GraderA function (human, code, or model) that scores a trial
Pass@KProbability of at least one success in K attempts (capability)
Pass^KProbability of all K attempts succeeding (reliability)
LLM-as-JudgeUsing another LLM to evaluate agent outputs against a rubric
Inter-annotator agreementHow often two human graders give the same score
Swiss cheese modelLayering multiple evaluation methods so each covers the others' blind spots
Evaluation flywheelProduction failures → new test cases → better evals → better agents → repeat
HarborTerminal-Bench's self-contained task package (instruction + Docker + tests + oracle)
Progressive disclosureAgent discovers context via tools rather than having it pre-loaded

The Evolution of Agent Evaluation

The field is evolving rapidly. A few trends to watch:

Related Lessons

Continue your learning path:

References

  1. Anthropic. "Demystifying evals for AI agents." 2025.
  2. Anthropic. "Effective context engineering for agents." 2025.
  3. OpenAI. "A practical guide to building agents." 2025.
  4. Yao et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. arXiv
  5. Zheng et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv
  6. Yao et al. "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." 2024. arXiv
  7. Fourney et al. "τ²-bench: Benchmarking Multi-Agent Systems." 2025.
  8. Shen et al. "Terminal-Bench: Evaluating Terminal Automation Agents." 2025. arXiv
  9. Wolfe, Cameron R. "Agent Evaluation: A Detailed Guide." Deep Learning Focus, May 2026.
"The best evaluation is the one that embarrasses your agent before your users do."
— Adapted from the software testing proverb

You now understand how to evaluate agents rigorously. Go build evals that make your agents truly reliable.

Check: What is the most important thing to do after discovering a production failure?