The first comprehensive survey of how we measure what agents can actually do. Four evaluation dimensions — fundamental capabilities, application-specific benchmarks, generalist benchmarks, and evaluation frameworks — and the critical gaps nobody has filled yet.
You build an LLM agent. It can browse the web, write code, call APIs, and reason across multiple steps. It even reflects on its mistakes and retries. It feels impressive in a demo.
But how do you know it works?
Traditional LLM benchmarks — MMLU, GSM8K, HumanEval — test single-turn, text-in text-out performance. An agent is fundamentally different. It operates in a loop: it observes the environment, decides on an action, executes it, observes the result, and repeats. Each step can succeed or fail. The environment changes between steps. The agent might need to call tools, navigate websites, write and debug code, or hold multi-turn conversations.
This survey by Yehudai et al. (2025) is the first to comprehensively map how the field evaluates LLM-based agents. They organize the landscape along four dimensions:
Along the way, they identify critical gaps: almost nobody measures cost-efficiency, safety, or robustness. Most benchmarks use coarse "did it succeed?" metrics that hide where agents actually fail. And static benchmarks get saturated quickly as models improve.
A model takes one input and produces one output. An agent loops through observe-decide-act cycles. Click "Step" to advance the agent through its trajectory. Notice how each step changes the environment state — this is what makes agent evaluation fundamentally harder.
Before we can evaluate agents, we need a map. The survey organizes the entire evaluation landscape into four dimensions. Think of these as concentric circles: from atomic skills outward to complete systems.
Each layer also has its own evaluation paradigm:
| Layer | What you measure | How you measure it | Example benchmarks |
|---|---|---|---|
| Capabilities | Can it plan? Use tools? Reflect? | Isolated tasks, controlled inputs | PlanBench, BFCL, Reflection-Bench |
| Applications | Can it do this specific job? | Simulated or real environments | WebArena, SWE-Bench, tau-Bench |
| Generalist | Can it handle anything? | Diverse, multi-skill test suites | GAIA, AgentBench, TheAgentCompany |
| Frameworks | How well does the eval pipeline work? | Dev tools, monitoring, A/B testing | LangSmith, Galileo, Vertex AI |
The four layers of agent evaluation, from atomic capabilities to evaluation infrastructure. Hover over each layer to see its key benchmarks and what it measures.
Planning is the backbone of agency. Without it, an agent is just a fancy autocomplete that happens to call tools. Planning means decomposing a complex task into subtasks, tracking which subtasks are done, maintaining a belief about the current state, and adapting when things go wrong.
The survey identifies five essential planning abilities:
| Ability | What it means | Example benchmark |
|---|---|---|
| Task decomposition | Breaking "book a trip" into "find flights" + "find hotels" + "check budget" | Natural Plan |
| State tracking | Knowing what has been done and what remains | PlanBench, ALFWorld |
| Self-correction | Detecting errors and recovering — not just continuing blindly | MINT, ToolEmu |
| Causal understanding | Predicting what an action will do before executing it | ACPBench, FlowBench |
| Meta-planning | Choosing and refining the planning strategy itself | AutoPlanBench |
The benchmarks evolved in waves. First came repurposed reasoning datasets — GSM8K, HotpotQA, MATH — used as proxies for multi-step reasoning. These test the reasoning engine but not the agent loop. Then came agent-specific benchmarks:
As the number of required planning steps increases, agent success rates drop dramatically. Drag the horizon slider to see how quickly performance degrades. The curve is based on patterns reported across PlanBench, MINT, and Natural Plan.
An agent that cannot use tools is just an LLM with a loop. Tool use — calling APIs, executing functions, interacting with databases — is what transforms language models into agents that act in the world.
The survey decomposes tool use into a pipeline of sub-tasks:
Benchmarks for tool use evolved in three generations:
Explore how tool use benchmarks have evolved. Click to toggle between simple (single-call), chained (sequential), and nested (dependent outputs) tool evaluation scenarios.
| Benchmark | Year | Focus | Key innovation |
|---|---|---|---|
| BFCL | 2024 | Function calling leaderboard | Live, versioned (v1-v3), multi-turn |
| ToolBench | 2023 | 16K real APIs | Scale and diversity |
| ToolSandbox | 2024 | Stateful execution | State dependencies between calls |
| ComplexFuncBench | 2025 | Complex real-world calls | Implicit params, constraints |
| NESTFUL | 2024 | Nested API sequences | Output-as-input chaining |
| StableToolBench | 2024 | API stability | Virtual server with caching |
This is where agent evaluation gets real. Web agents navigate actual websites. SWE agents fix actual bugs in actual codebases. The stakes are high: these are the benchmarks that most closely predict whether an agent can replace human labor.
Web agent benchmarks evolved from toy to terrifyingly real:
SWE-bench changed everything. Instead of synthetic coding puzzles (HumanEval, MBPP), it uses real GitHub issues from real repositories. The agent must read the issue, understand the codebase, find the bug, write a fix, and pass the test suite.
| Benchmark | Year | What it tests | Difficulty |
|---|---|---|---|
| HumanEval | 2021 | Self-contained coding problems | Entry-level |
| SWE-bench | 2023 | Real GitHub issues, full repos | Professional |
| SWE-bench Verified | 2024 | Curated subset with strong tests | Professional |
| SWE-bench+ | 2024 | Fixed leakage and weak tests | Professional |
| SWE-bench Multimodal | 2024 | Visual software (JS, UI elements) | Professional+ |
| SWELancer | 2025 | Freelance coding tasks, $ valued | Real-world |
| ITBench | 2025 | IT automation tasks | Enterprise |
The SWE-bench family keeps getting refined because the original had problems: some solutions leaked through issue descriptions, test cases were too weak (accepting wrong answers), and tasks varied wildly in difficulty. Each variant (Lite, Verified, +, Multimodal) addresses a specific flaw.
The progression from simple simulations to real-world environments. Each row shows a benchmark's realism level (x-axis) and difficulty (circle size). Drag the timeline to see how evaluation evolved.
Scientific agent evaluation has grown from "can you answer a science question?" to "can you do actual science?" The survey traces this evolution across four stages of the scientific process:
| Stage | What the agent does | Benchmark |
|---|---|---|
| Ideation | Generate novel, expert-level research ideas | Si et al. (2025) |
| Experiment design | Plan experiments with proper methodology | AAAR-1.0 |
| Code execution | Write accurate, executable scientific code | SciCode, ScienceAgentBench, CORE-Bench |
| Peer review | Generate substantive, accurate feedback | Chamoun et al. (2024) |
Two integrated frameworks stand out:
Customer service agents are a deceptively hard evaluation target. The agent must hold a multi-turn conversation, follow company policies, call the right functions in the right order, and communicate accurately. A single wrong action — refunding the wrong order, violating a return policy — has real consequences.
The benchmarks here span two approaches:
IntellAgent (Levi & Kadar, 2025) goes further: given a database schema and company policies document, it automatically generates evaluation scenarios. It constructs a policy graph, samples policy combinations, creates events, simulates dialogues, and has a critique agent analyze performance. This is evaluation-as-code: no manual test creation needed.
Application-specific benchmarks tell you if your agent can do one job well. But the promise of LLM agents is generality — one system that can handle web search today, database queries tomorrow, and code review next week. Generalist benchmarks test this breadth.
Three benchmark families dominate:
466 human-crafted, real-world questions that require reasoning, multi-modal understanding, web navigation, and tool use — often all at once. A single GAIA question might require searching the web, extracting data from a PDF, computing something with Python, and synthesizing the answer. These are the kinds of tasks a capable human assistant handles routinely but that expose every weakness in an agent.
A suite of eight interactive environments: operating system commands, SQL databases, digital games, knowledge graphs, household tasks, web browsing, and more. The key insight is testing the same agent across wildly different domains. An agent that aces web browsing but fails at OS commands reveals brittle, domain-dependent strategies rather than general intelligence.
The newest generation pushes even further:
Compare the scope and best-known agent performance across major generalist benchmarks. Lower bars = harder benchmarks where current agents still struggle.
Benchmarks tell you what to measure. Frameworks tell you how. They are the infrastructure that makes evaluation repeatable, scalable, and actionable. The survey identifies three levels of granularity at which frameworks operate:
Did the agent produce the right answer? This is the simplest level. Most frameworks use LLM-based judges to evaluate agent responses against predefined criteria. Some (Databricks Mosaic, PatronusAI) offer proprietary judge models trained specifically for this purpose.
Was each individual action correct? This level assesses tool selection, parameter extraction, and execution output at every step. Galileo introduces an action advancement metric: did this step move the agent closer to the goal? This is subtler than binary pass/fail — a step might be "correct" (valid tool call, correct parameters) but pointless (didn't advance the task).
Was the overall path good? Google Vertex AI and LangSmith analyze the full sequence of steps against an expected optimal path. AgentEvals supports LLM-as-judge evaluation of trajectories with or without reference trajectories. For LangGraph-based agents, it can assess whether the agent followed the expected workflow graph.
| Framework | Stepwise | Trajectory | Monitoring | A/B Testing | Synth. Data |
|---|---|---|---|---|---|
| LangSmith | Yes | Yes | Yes | Yes | No |
| Langfuse | Yes | No | Yes | Yes | No |
| Google Vertex AI | Yes | Yes | Yes | Yes | No |
| Galileo | Yes | No | Yes | Yes | Yes |
| PatronusAI | Yes | No | Yes | Yes | Yes |
| Mosaic AI | Yes | No | Yes | Yes | Yes |
| AgentEvals | No | Yes | No | No | No |
Beyond monitoring and assessment, the survey highlights Gym-like environments — inspired by OpenAI Gym — that provide standardized, interactive settings for training and evaluating agents:
These provide the controlled, repeatable environments that passive monitoring cannot: you can replay trajectories, inject failures, and test edge cases systematically.
The survey doesn't just catalogue what exists. It identifies what is missing. Four critical gaps threaten to make current evaluation frameworks inadequate as agents move toward real-world deployment.
Most benchmarks report a single number: success rate. Did the agent complete the task? Yes or no. This tells you nothing about where it failed. Was it a bad plan? A wrong tool call? A misunderstood user intent? Without step-level and trajectory-level metrics, you cannot diagnose or fix agent failures.
The fix: Standardized, fine-grained evaluation metrics that capture each decision point. WebCanvas's navigational node completion rates are a step in this direction. LangSmith and Galileo's trajectory analysis is another. But there is no standard yet.
Current evaluations almost exclusively measure accuracy. But a web agent that succeeds 80% of the time using $0.50 per task is more valuable than one that succeeds 85% using $5.00. As Kapoor et al. (2024) observed, ignoring cost "inadvertently drives the development of highly capable but resource-intensive agents, limiting their practical deployment."
The fix: Integrate cost as a core metric alongside accuracy. Track token usage, API expenses, inference time, and total resource consumption. SWELancer is the first benchmark to link performance to monetary value — more benchmarks should follow.
Almost no benchmark comprehensively tests safety. Can the agent be jailbroken into harmful actions? Does it follow organizational policies? Does it handle adversarial inputs gracefully? ST-WebAgentBench and AgentHarm have started, but the coverage is minimal compared to the risk.
The fix: Multi-dimensional safety benchmarks that simulate adversarial scenarios, test policy compliance, and evaluate behavior in multi-agent settings where emergent risks arise.
Static benchmarks get saturated. Models improve, the benchmark becomes easy, and the leaderboard stops being informative. Human-annotated benchmarks are expensive to refresh. The field needs automated benchmark generation and live benchmarks that evolve.
The fix: IntellAgent's automatic benchmark generation, BFCL's live versioning, and Agent-as-a-Judge (using LLM agents to evaluate other agents) all point toward scalable evaluation. But these approaches need validation — automated judges can have their own biases.
How well do current benchmarks cover each critical dimension? The shaded area shows the current state of coverage. Click a gap label to learn what is missing.
This survey sits at the center of a rapidly expanding web of agent research. Here is how it connects to other key work:
| Topic | Connection |
|---|---|
| Meta-Harness | If you want to build your own evaluation harness, start here. Meta's framework for evaluating coding agents directly implements the trajectory-based assessment discussed in this survey. |
| Dive into Claude Code | A concrete example of an SWE agent in the wild. Claude Code's architecture — tool use, planning, self-reflection, memory — maps directly onto the four fundamental capabilities this survey evaluates. |
| RLHF / DPO / RLEF | How do you train agents to get better? RLHF and DPO optimize for human preferences; RLEF (RL from Execution Feedback) optimizes for task completion. The evaluation methods surveyed here become the reward signals for these training algorithms. |
| SayCan / RT-2 / pi0 | Embodied agents face the same evaluation challenges but in the physical world. The survey's framework extends naturally: capabilities (manipulation, navigation), applications (household, industrial), and generalist (diverse tasks in simulation). |
| Kimi-K2 / DeepSeek-V3 | The underlying LLMs that power agents. Better base models (larger context, better reasoning) directly improve agent performance on these benchmarks — but the evaluation still requires agent-specific methodology. |
| Dimension | Key Benchmarks | What they test | Current best % |
|---|---|---|---|
| Planning | PlanBench, MINT, Natural Plan | Multi-step decomposition, state tracking | <30% (high complexity) |
| Tool Use | BFCL, ToolSandbox, NESTFUL | Function calling, chaining, statefulness | ~70-80% (simple), ~40% (complex) |
| Web Agents | WebArena, WorkArena++ | Dynamic web navigation, enterprise tasks | ~15-35% |
| SWE Agents | SWE-bench Verified, SWELancer | Real GitHub bugs, freelance tasks | ~50-72% (Verified) |
| Scientific | DiscoveryWorld, MLGym, CORE-Bench | Full research cycles, reproducibility | ~20-50% |
| Conversational | tau-Bench, IntellAgent | Policy compliance, multi-turn dialogue | ~40-65% |
| Generalist | GAIA, TheAgentCompany, HAL | Diverse, multi-skill tasks | ~2-55% (varies by level) |