Yehudai et al. — Hebrew University / IBM Research / Yale — 2025

Survey on Evaluation of LLM-based Agents

The first comprehensive survey of how we measure what agents can actually do. Four evaluation dimensions — fundamental capabilities, application-specific benchmarks, generalist benchmarks, and evaluation frameworks — and the critical gaps nobody has filled yet.

Prerequisites: LLM basics + Agents (tool use, planning)
10
Chapters
5+
Simulations

Chapter 0: The Problem

You build an LLM agent. It can browse the web, write code, call APIs, and reason across multiple steps. It even reflects on its mistakes and retries. It feels impressive in a demo.

But how do you know it works?

Traditional LLM benchmarks — MMLU, GSM8K, HumanEval — test single-turn, text-in text-out performance. An agent is fundamentally different. It operates in a loop: it observes the environment, decides on an action, executes it, observes the result, and repeats. Each step can succeed or fail. The environment changes between steps. The agent might need to call tools, navigate websites, write and debug code, or hold multi-turn conversations.

The core tension: Evaluating an agent is not like evaluating a model. A model produces one answer. An agent produces a trajectory — a sequence of observations, decisions, and actions. Two agents might reach the same correct answer via completely different paths. One might be efficient, the other wasteful. One might be safe, the other dangerous. Measuring just "did it get the right answer?" misses almost everything that matters.

This survey by Yehudai et al. (2025) is the first to comprehensively map how the field evaluates LLM-based agents. They organize the landscape along four dimensions:

  1. Fundamental capabilities — planning, tool use, self-reflection, memory
  2. Application-specific benchmarks — web agents, SWE agents, scientific agents, conversational agents
  3. Generalist agent benchmarks — GAIA, AgentBench, TheAgentCompany
  4. Evaluation frameworks — LangSmith, Vertex AI, Galileo, and the infrastructure for building evals

Along the way, they identify critical gaps: almost nobody measures cost-efficiency, safety, or robustness. Most benchmarks use coarse "did it succeed?" metrics that hide where agents actually fail. And static benchmarks get saturated quickly as models improve.

Agent vs. Model Evaluation

A model takes one input and produces one output. An agent loops through observe-decide-act cycles. Click "Step" to advance the agent through its trajectory. Notice how each step changes the environment state — this is what makes agent evaluation fundamentally harder.

Ready — click Step
Concept: An agent evaluation must assess not just final outcomes but the entire trajectory — each decision, each tool call, each recovery from error. Realization: This is why you cannot simply reuse LLM benchmarks for agents. You need environments that simulate real-world dynamics, metrics that capture intermediate steps, and frameworks that track cost and safety alongside accuracy.
Why can't traditional LLM benchmarks (like MMLU or GSM8K) adequately evaluate an LLM-based agent?

Chapter 1: The Taxonomy

Before we can evaluate agents, we need a map. The survey organizes the entire evaluation landscape into four dimensions. Think of these as concentric circles: from atomic skills outward to complete systems.

Layer 1: Fundamental Capabilities
Can the agent plan, use tools, reflect on mistakes, and maintain memory? These are the atomic building blocks. You test them in isolation: give the agent a planning problem, see if it decomposes it correctly. Give it an API schema, see if it calls the right function.
Layer 2: Application-Specific Benchmarks
Can the agent do a real job? Web navigation, software engineering, scientific research, customer service. Each domain combines capabilities differently and adds domain-specific challenges (e.g., a web agent must parse HTML; a SWE agent must understand codebases).
Layer 3: Generalist Benchmarks
Can the agent handle diverse tasks it has never seen before? GAIA asks 466 real-world questions requiring web search, coding, and multi-modal reasoning together. AgentBench spans OS commands, databases, games, and household tasks. TheAgentCompany simulates an entire software company.
Layer 4: Evaluation Frameworks
How do developers actually build, run, and analyze evals? LangSmith, Vertex AI, Galileo provide monitoring, trajectory analysis, stepwise assessment, and LLM-as-judge pipelines. These are the infrastructure that makes evaluation repeatable and scalable.
Concept: The four layers mirror how agents are built: capabilities compose into applications, applications compose into general intelligence, and frameworks make evaluation systematic. Realization: Most teams only evaluate at layer 2 or 3 (application or generalist). But failures often originate at layer 1 (a broken tool call, a flawed plan). Without capability-level evaluation, you see that the agent failed but not why.

Each layer also has its own evaluation paradigm:

LayerWhat you measureHow you measure itExample benchmarks
CapabilitiesCan it plan? Use tools? Reflect?Isolated tasks, controlled inputsPlanBench, BFCL, Reflection-Bench
ApplicationsCan it do this specific job?Simulated or real environmentsWebArena, SWE-Bench, tau-Bench
GeneralistCan it handle anything?Diverse, multi-skill test suitesGAIA, AgentBench, TheAgentCompany
FrameworksHow well does the eval pipeline work?Dev tools, monitoring, A/B testingLangSmith, Galileo, Vertex AI
Evaluation Taxonomy Map

The four layers of agent evaluation, from atomic capabilities to evaluation infrastructure. Hover over each layer to see its key benchmarks and what it measures.

An agent correctly answers a complex question by calling five APIs in sequence, but it takes 47 API calls to get there (42 failed attempts). Which evaluation layer would catch this inefficiency?

Chapter 2: Planning & Reasoning

Planning is the backbone of agency. Without it, an agent is just a fancy autocomplete that happens to call tools. Planning means decomposing a complex task into subtasks, tracking which subtasks are done, maintaining a belief about the current state, and adapting when things go wrong.

The survey identifies five essential planning abilities:

AbilityWhat it meansExample benchmark
Task decompositionBreaking "book a trip" into "find flights" + "find hotels" + "check budget"Natural Plan
State trackingKnowing what has been done and what remainsPlanBench, ALFWorld
Self-correctionDetecting errors and recovering — not just continuing blindlyMINT, ToolEmu
Causal understandingPredicting what an action will do before executing itACPBench, FlowBench
Meta-planningChoosing and refining the planning strategy itselfAutoPlanBench
Concept: Current LLMs are decent at short-term tactical planning (2-5 steps) but struggle dramatically with strategic long-horizon planning (10+ steps). Realization: AutoPlanBench shows that even state-of-the-art LLM agents lag behind classical symbolic planners on problems requiring 10+ steps. The gap isn't in reasoning per se — it's in maintaining consistent state over many steps and recovering from cascading errors.

The benchmarks evolved in waves. First came repurposed reasoning datasets — GSM8K, HotpotQA, MATH — used as proxies for multi-step reasoning. These test the reasoning engine but not the agent loop. Then came agent-specific benchmarks:

Planning Horizon vs. Success Rate

As the number of required planning steps increases, agent success rates drop dramatically. Drag the horizon slider to see how quickly performance degrades. The curve is based on patterns reported across PlanBench, MINT, and Natural Plan.

Planning Horizon (steps) 5
The key finding: Planning evaluation has progressed from testing whether an agent can reason through math problems (a proxy) to testing whether it can decompose, track, correct, and adapt across real interactive environments. But the hardest problems — long-horizon, open-ended, real-world planning — remain largely unsolved. The best agents score below 30% on benchmarks like Natural Plan with high-complexity tasks.
According to the survey, what is the biggest gap between current LLM agents and classical symbolic planners?

Chapter 3: Tool Use

An agent that cannot use tools is just an LLM with a loop. Tool use — calling APIs, executing functions, interacting with databases — is what transforms language models into agents that act in the world.

The survey decomposes tool use into a pipeline of sub-tasks:

1. Intent Recognition
Does the user's request require a function call? "What's the weather in Tokyo?" → yes. "Tell me a joke" → probably not.
2. Function Selection
Which tool is appropriate? If the user asks about weather, choose the weather API, not the calculator.
3. Parameter Extraction
Map conversation context to function parameters. "Tokyo" → location="Tokyo". Some parameters may be implicit.
4. Execution
Call the function. Handle errors. Parse the output.
5. Response Generation
Incorporate the tool's output into a natural language response for the user.

Benchmarks for tool use evolved in three generations:

Generation 1: Simple, single-step. ToolAlpaca, APIBench, and BFCL v1 test one-shot function calls with explicitly provided parameters. These are necessary baselines but miss real-world complexity.
Generation 2: Multi-step and stateful. ToolSandbox adds stateful tool execution where one call changes the state for the next. Seal-Tools tests nested tool calls where outputs feed into subsequent calls. NESTFUL evaluates sequences where API outputs chain together.
Generation 3: Realistic and dynamic. ComplexFuncBench tests implicit parameter inference, user-defined constraints, and long-context processing. BFCL v3 incorporates multi-turn, multi-step evaluation with live updates. StableToolBench introduces virtual API servers with caching to handle API instability.
Concept: Tool use evaluation has shifted from "can you fill in function parameters?" to "can you chain tools across a stateful environment where each call changes the world?" Realization: This mirrors the difference between calling a function in isolation vs. writing a real program. The hardest tool use problems involve implicit parameters (not stated in the prompt), dependent outputs (tool B needs tool A's output), and error recovery (tool C fails, now what?).
Tool Call Complexity

Explore how tool use benchmarks have evolved. Click to toggle between simple (single-call), chained (sequential), and nested (dependent outputs) tool evaluation scenarios.

BenchmarkYearFocusKey innovation
BFCL2024Function calling leaderboardLive, versioned (v1-v3), multi-turn
ToolBench202316K real APIsScale and diversity
ToolSandbox2024Stateful executionState dependencies between calls
ComplexFuncBench2025Complex real-world callsImplicit params, constraints
NESTFUL2024Nested API sequencesOutput-as-input chaining
StableToolBench2024API stabilityVirtual server with caching
What makes NESTFUL and ToolSandbox fundamentally harder than early benchmarks like APIBench?

Chapter 4: Web & Software Engineering Agents

This is where agent evaluation gets real. Web agents navigate actual websites. SWE agents fix actual bugs in actual codebases. The stakes are high: these are the benchmarks that most closely predict whether an agent can replace human labor.

Web Agents

Web agent benchmarks evolved from toy to terrifyingly real:

MiniWoB / MiniWoB++ (2017-2018)
Basic simulated web elements. Click buttons, fill forms. Early proof of concept. Simple, static, reproducible.
WebShop / Mind2Web (2022-2023)
Simulated online shopping, broader web interactions. Static datasets enable reproducible evaluation. More realistic but still offline.
WebArena / VisualWebArena (2023-2024)
Dynamic, online environments with realistic UI elements. Agents must interpret visual information, adapt to changing pages, and complete multi-step workflows. The real deal.
WorkArena++ / ST-WebAgentBench (2024-2025)
Enterprise environments: multi-step office tasks, policy compliance, safety protocols. Not just "did it work?" but "did it follow the rules?"
Concept: Web agent evaluation has shifted from "can it click the right button?" to "can it complete a complex enterprise workflow while respecting policies and safety constraints?" Realization: Even state-of-the-art agents achieve only ~15-35% success on WebArena. The gap between demos and reliable deployment is enormous. And almost no benchmark tests policy compliance or safety — the very things that matter most for real deployment.

Software Engineering Agents

SWE-bench changed everything. Instead of synthetic coding puzzles (HumanEval, MBPP), it uses real GitHub issues from real repositories. The agent must read the issue, understand the codebase, find the bug, write a fix, and pass the test suite.

BenchmarkYearWhat it testsDifficulty
HumanEval2021Self-contained coding problemsEntry-level
SWE-bench2023Real GitHub issues, full reposProfessional
SWE-bench Verified2024Curated subset with strong testsProfessional
SWE-bench+2024Fixed leakage and weak testsProfessional
SWE-bench Multimodal2024Visual software (JS, UI elements)Professional+
SWELancer2025Freelance coding tasks, $ valuedReal-world
ITBench2025IT automation tasksEnterprise

The SWE-bench family keeps getting refined because the original had problems: some solutions leaked through issue descriptions, test cases were too weak (accepting wrong answers), and tasks varied wildly in difficulty. Each variant (Lite, Verified, +, Multimodal) addresses a specific flaw.

Web & SWE Benchmark Evolution

The progression from simple simulations to real-world environments. Each row shows a benchmark's realism level (x-axis) and difficulty (circle size). Drag the timeline to see how evaluation evolved.

Year 2025
The SWELancer insight: By connecting agent performance to monetary value (how much the freelance task pays), SWELancer lets you directly measure the economic viability of agent deployment. If your agent can solve a $500 task but costs $200 in API calls, that is useful. If it costs $600, it is not. This is the first benchmark to make this explicit.
Why does the SWE-bench family keep producing new variants (Lite, Verified, +, Multimodal)?

Chapter 5: Scientific & Conversational Agents

Scientific Agents

Scientific agent evaluation has grown from "can you answer a science question?" to "can you do actual science?" The survey traces this evolution across four stages of the scientific process:

StageWhat the agent doesBenchmark
IdeationGenerate novel, expert-level research ideasSi et al. (2025)
Experiment designPlan experiments with proper methodologyAAAR-1.0
Code executionWrite accurate, executable scientific codeSciCode, ScienceAgentBench, CORE-Bench
Peer reviewGenerate substantive, accurate feedbackChamoun et al. (2024)
Concept: Testing whether an agent can answer science questions (ScienceQA) is categorically different from testing whether it can do science (DiscoveryWorld). Realization: DiscoveryWorld simulates complete scientific discovery cycles — hypothesis formation, experimentation, result interpretation — across 120 diverse tasks. This is the difference between testing if a student can pass an exam vs. testing if they can run a lab.

Two integrated frameworks stand out:

Conversational Agents

Customer service agents are a deceptively hard evaluation target. The agent must hold a multi-turn conversation, follow company policies, call the right functions in the right order, and communicate accurately. A single wrong action — refunding the wrong order, violating a return policy — has real consequences.

The benchmarks here span two approaches:

tau-Bench (Yao et al., 2024) is the standout: it emulates conversations between an agent and an LLM-simulated user in airline and retail domains. The benchmark includes databases, APIs, and domain policies. Each task has ground truth for both the expected database write and the correct user response. With 115 retail and 50 airline tasks, it tests whether agents can do the job without violating any rules.

IntellAgent (Levi & Kadar, 2025) goes further: given a database schema and company policies document, it automatically generates evaluation scenarios. It constructs a policy graph, samples policy combinations, creates events, simulates dialogues, and has a critique agent analyze performance. This is evaluation-as-code: no manual test creation needed.

What makes DiscoveryWorld fundamentally different from earlier scientific agent benchmarks like ScienceQA?

Chapter 6: Generalist Benchmarks

Application-specific benchmarks tell you if your agent can do one job well. But the promise of LLM agents is generality — one system that can handle web search today, database queries tomorrow, and code review next week. Generalist benchmarks test this breadth.

Three benchmark families dominate:

GAIA (Mialon et al., 2023)

466 human-crafted, real-world questions that require reasoning, multi-modal understanding, web navigation, and tool use — often all at once. A single GAIA question might require searching the web, extracting data from a PDF, computing something with Python, and synthesizing the answer. These are the kinds of tasks a capable human assistant handles routinely but that expose every weakness in an agent.

AgentBench (Liu et al., 2023)

A suite of eight interactive environments: operating system commands, SQL databases, digital games, knowledge graphs, household tasks, web browsing, and more. The key insight is testing the same agent across wildly different domains. An agent that aces web browsing but fails at OS commands reveals brittle, domain-dependent strategies rather than general intelligence.

Real-World Professional Benchmarks

The newest generation pushes even further:

Concept: Generalist benchmarks expose a brutal truth: agents that appear competent in demos often score below 25% on these comprehensive tests. TheAgentCompany's best agent scores ~24%. GAIA's hardest level (Level 3) sees agents below 10%. Realization: The gap between "impressive demo" and "reliable deployment" is not incremental — it is an order of magnitude. Generalist benchmarks are the reality check.
Generalist Benchmark Comparison

Compare the scope and best-known agent performance across major generalist benchmarks. Lower bars = harder benchmarks where current agents still struggle.

Why are generalist benchmarks important even if you only plan to deploy your agent for one specific application?

Chapter 7: Evaluation Frameworks

Benchmarks tell you what to measure. Frameworks tell you how. They are the infrastructure that makes evaluation repeatable, scalable, and actionable. The survey identifies three levels of granularity at which frameworks operate:

Level 1: Final Response Evaluation

Did the agent produce the right answer? This is the simplest level. Most frameworks use LLM-based judges to evaluate agent responses against predefined criteria. Some (Databricks Mosaic, PatronusAI) offer proprietary judge models trained specifically for this purpose.

Level 2: Stepwise Evaluation

Was each individual action correct? This level assesses tool selection, parameter extraction, and execution output at every step. Galileo introduces an action advancement metric: did this step move the agent closer to the goal? This is subtler than binary pass/fail — a step might be "correct" (valid tool call, correct parameters) but pointless (didn't advance the task).

Level 3: Trajectory-Based Assessment

Was the overall path good? Google Vertex AI and LangSmith analyze the full sequence of steps against an expected optimal path. AgentEvals supports LLM-as-judge evaluation of trajectories with or without reference trajectories. For LangGraph-based agents, it can assess whether the agent followed the expected workflow graph.

Concept: The three levels form a hierarchy: final response is the cheapest but least informative, trajectory is the richest but hardest to judge. Realization: Most teams default to Level 1 (final response) because it is easy. But the actionable insights — "the agent chose the wrong tool at step 3" or "the agent took 47 steps when 5 would suffice" — only come from Levels 2 and 3. The overhead of fine-grained evaluation is the price of understanding your agent.
FrameworkStepwiseTrajectoryMonitoringA/B TestingSynth. Data
LangSmithYesYesYesYesNo
LangfuseYesNoYesYesNo
Google Vertex AIYesYesYesYesNo
GalileoYesNoYesYesYes
PatronusAIYesNoYesYesYes
Mosaic AIYesNoYesYesYes
AgentEvalsNoYesNoNoNo

Beyond monitoring and assessment, the survey highlights Gym-like environments — inspired by OpenAI Gym — that provide standardized, interactive settings for training and evaluating agents:

These provide the controlled, repeatable environments that passive monitoring cannot: you can replay trajectories, inject failures, and test edge cases systematically.

What advantage does Galileo's "action advancement metric" have over standard stepwise evaluation?

Chapter 8: Critical Gaps

The survey doesn't just catalogue what exists. It identifies what is missing. Four critical gaps threaten to make current evaluation frameworks inadequate as agents move toward real-world deployment.

Gap 1: Granular Evaluation

Most benchmarks report a single number: success rate. Did the agent complete the task? Yes or no. This tells you nothing about where it failed. Was it a bad plan? A wrong tool call? A misunderstood user intent? Without step-level and trajectory-level metrics, you cannot diagnose or fix agent failures.

The fix: Standardized, fine-grained evaluation metrics that capture each decision point. WebCanvas's navigational node completion rates are a step in this direction. LangSmith and Galileo's trajectory analysis is another. But there is no standard yet.

Gap 2: Cost and Efficiency

Current evaluations almost exclusively measure accuracy. But a web agent that succeeds 80% of the time using $0.50 per task is more valuable than one that succeeds 85% using $5.00. As Kapoor et al. (2024) observed, ignoring cost "inadvertently drives the development of highly capable but resource-intensive agents, limiting their practical deployment."

The fix: Integrate cost as a core metric alongside accuracy. Track token usage, API expenses, inference time, and total resource consumption. SWELancer is the first benchmark to link performance to monetary value — more benchmarks should follow.

Gap 3: Safety and Compliance

Almost no benchmark comprehensively tests safety. Can the agent be jailbroken into harmful actions? Does it follow organizational policies? Does it handle adversarial inputs gracefully? ST-WebAgentBench and AgentHarm have started, but the coverage is minimal compared to the risk.

The fix: Multi-dimensional safety benchmarks that simulate adversarial scenarios, test policy compliance, and evaluate behavior in multi-agent settings where emergent risks arise.

Gap 4: Scalable, Continuously Updated Benchmarks

Static benchmarks get saturated. Models improve, the benchmark becomes easy, and the leaderboard stops being informative. Human-annotated benchmarks are expensive to refresh. The field needs automated benchmark generation and live benchmarks that evolve.

The fix: IntellAgent's automatic benchmark generation, BFCL's live versioning, and Agent-as-a-Judge (using LLM agents to evaluate other agents) all point toward scalable evaluation. But these approaches need validation — automated judges can have their own biases.

Concept: The four gaps — granularity, cost, safety, and freshness — are not independent. They compound. Realization: An agent that appears safe in a static benchmark might be unsafe in a dynamic one. An agent that appears efficient on a coarse metric might be wasteful at the step level. And a benchmark that was valid last month might be saturated this month. Solving each gap in isolation is not enough — the field needs evaluation systems that address all four simultaneously.
Evaluation Gap Radar

How well do current benchmarks cover each critical dimension? The shaded area shows the current state of coverage. Click a gap label to learn what is missing.

A company deploys an agent that scores 90% on SWE-bench Verified. In production, it costs $12 per task (3x human cost), occasionally violates data privacy policies, and its performance drops to 70% on new repositories not in the benchmark. Which gaps does this scenario illustrate?

Chapter 9: Connections

This survey sits at the center of a rapidly expanding web of agent research. Here is how it connects to other key work:

TopicConnection
Meta-HarnessIf you want to build your own evaluation harness, start here. Meta's framework for evaluating coding agents directly implements the trajectory-based assessment discussed in this survey.
Dive into Claude CodeA concrete example of an SWE agent in the wild. Claude Code's architecture — tool use, planning, self-reflection, memory — maps directly onto the four fundamental capabilities this survey evaluates.
RLHF / DPO / RLEFHow do you train agents to get better? RLHF and DPO optimize for human preferences; RLEF (RL from Execution Feedback) optimizes for task completion. The evaluation methods surveyed here become the reward signals for these training algorithms.
SayCan / RT-2 / pi0Embodied agents face the same evaluation challenges but in the physical world. The survey's framework extends naturally: capabilities (manipulation, navigation), applications (household, industrial), and generalist (diverse tasks in simulation).
Kimi-K2 / DeepSeek-V3The underlying LLMs that power agents. Better base models (larger context, better reasoning) directly improve agent performance on these benchmarks — but the evaluation still requires agent-specific methodology.

Cheat Sheet

DimensionKey BenchmarksWhat they testCurrent best %
PlanningPlanBench, MINT, Natural PlanMulti-step decomposition, state tracking<30% (high complexity)
Tool UseBFCL, ToolSandbox, NESTFULFunction calling, chaining, statefulness~70-80% (simple), ~40% (complex)
Web AgentsWebArena, WorkArena++Dynamic web navigation, enterprise tasks~15-35%
SWE AgentsSWE-bench Verified, SWELancerReal GitHub bugs, freelance tasks~50-72% (Verified)
ScientificDiscoveryWorld, MLGym, CORE-BenchFull research cycles, reproducibility~20-50%
Conversationaltau-Bench, IntellAgentPolicy compliance, multi-turn dialogue~40-65%
GeneralistGAIA, TheAgentCompany, HALDiverse, multi-skill tasks~2-55% (varies by level)
The meta-insight of this survey: Agent evaluation is still in its early days. The field knows how to measure whether an agent gets the right answer, but barely knows how to measure whether it got there safely, efficiently, or for the right reasons. As agents become more autonomous and take more consequential actions, closing these evaluation gaps becomes not just a research problem but a safety imperative.
If you were designing an evaluation framework for a new LLM-based coding agent, which combination of elements from this survey would give you the most comprehensive assessment?